Forecasting currency covariances using machine learning tree-based algorithms with low and high prices 1

. We combine machine learning tree-based algorithms with the usage of low and high prices and suggest a new approach to forecasting currency covariances. We apply three algorithms: Random Forest Regression, Gradient Boosting Regression Trees and Extreme Gradient Boosting with a tree learner. We conduct an empirical evaluation of this procedure on the three most heavily traded currency pairs in the Forex market: EUR/USD, USD/JPY and GBP/USD. The forecasts of covariances formulated on the three applied algorithms are predominantly more accurate than the Dynamic Conditional Correlation model based on closing prices. The results of the analyses indicate that the GBRT algorithm is the best-performing method.


Introduction
Multivariate volatility models can be used in many financial applications, such as asset pricing, portfolio optimisation, risk management, the estimation of systemic risk in banking, Value-at-Risk estimation or asset allocation. Volatility models of financial instruments that are commonly used are largely based on closing prices only. However, the use of daily low and high prices leads to more accurate estimates and forecasts of variances (e.g. Chou, 2005;Fiszeder & Perczak, 2016;Lin et al., 2012;Molnár, 2016) and covariances (e.g. Chou et al., 2009;Fiszeder, 2018;. Daily low and high prices are almost always available alongside closing prices in financial series. Therefore, making use of them in volatility models is important from a practical viewpoint. The review of multivariate range-based models can be found in Petropoulos et al. (in press).
Recently, the importance of machine learning (ML) algorithms in the forecasting of financial time series has increased considerably (see e.g. de Prado, 2018). ML models, unlike classic (nonlinear) time series analysis, do not require prior assumptions about the underlying structure of data (Zhang, 2003) and are able to capture recurring nonlinear patterns in time series (see e.g. Fischer et al., 2019). These factors cause machine learning algorithms to outperform most traditional stochastic methods in financial market forecasting (Fiszeder & Orzeszko, 2021;Ryll & Seidens, 2019). The most popular ML approaches in the field of finance are Artificial Neural Network (ANN) and Support Vector Machine (SVM). On average, recurrent neural networks outperform feed-forward neural networks as well as support vector machines (Ryll & Seidens, 2019). However, most models based on ANN and SVM are treated as 'black box' algorithms. A black box model is a system that does not reveal its internal mechanisms, and therefore in machine learning it describes models that cannot be understood by looking at their parameters (e.g. an artificial neural network). Interpretable or Explainable Machine Learning refers to methods and models that make the behaviour and predictions of machine learning systems understandable to humans, which is essential in business forecasting and decision-making (Bejger & Elster, 2020). As an alternative to ANN and SVM models, we want to evaluate the performance of tree-based ensemble algorithms in the forecasting of financial time series. Although ensemble learning algorithms are also referred to as black boxes, if a part of an ensemble is a decision tree, the interpretability of the model and predictions becomes much greater. The decision mechanism (model) of a single decision tree is entirely transparent and interpretable (Barredo Arrieta et al., 2019) due to its intrinsic properties. Among others things, it enables the ranking of the relative significance of predictor variables through variable importance metrics (VIMs) (Biau & Scornet, 2016;Breiman, 2001). In an ensemble of trees, these measures could be used for the global and local interpretability of predictions.
The aim of the paper is to suggest a new approach to forecasting currency covariances based on the combination of machine learning tree-based algorithms with the use of low and high prices. The methods we selected are based on the regression tree concept and Classification and Regression Trees (CART) split criterion (Breiman et al., 1984). We apply the Random Forest Regression (RFR) algorithm (Breiman, 2001), the Gradient Boosting Regression Trees (GBRT) algorithm (Friedman, 2001), and the Extreme Gradient Boosting with tree learner (XGBoost, described in Chen & Guestrin, 2016). Although the most popular ML approaches in the field of finance are ANN and SVM learning methods (Henrique et al., 2019;Ryll & Seidens, 2019), we decided to apply the three above-mentioned algorithms instead, for the following reasons: • models based on ANN and SVM are treated as 'black box' algorithms with no clear interpretation of hyperparameters and the importance of features, while tree-based methods are interpretable through VIMs (an intrinsic property of treebased models) and a well-defined hyperparameter meaning; • they can handle heterogeneous data (ordered or categorical variables, or a mix of both) with minimal preprocessing; • they can handle highly correlated predictor variables; • as all tree-based methods, they intrinsically implement feature selection; • they either do not overfit (RFR) or are easy to control against overfitting (GBRT, XGBoost); • tree-based methods are still rarely used in forecasting financial markets, despite the fact that they proved to be among the best in competitions such as M4, M5, or those organised by the Kaggle portal. The applications of the RFR, GBRT, and XGBoost algorithms in the forecasting of financial markets are presented in the works of Ghosh et al. (in press), Islam et al. (2021), Khaidem et al. (2016), Krauss et al. (2017), Kumar and Thenmozhi (2006), Waldow et al. (2021), Yang (2021) and Yang et al. (2021). However, most of those studies are devoted to the forecasting of stock prices or exchange rates. To the best of our knowledge, this study presents the first application of the forecasting of currency covariances. We empirically evaluate the usability of the algorithms on the three most heavily traded currency pairs in the Forex market: EUR/USD, USD/JPY, and GBP/USD. The forecasts of covariances formulated on the three applied algorithms are predominantly more accurate than the Dynamic Conditional Correlation benchmark model based on closing prices.
The remaining part of the paper is organised in the following way: Section 2 describes applied models and methods, in Section 3 we present the data and an outline of the study, Section 4 evaluates the forecasts of the covariance of returns from the RFR, GBRT, XGBoost algorithms and the DCC model, and Section 5 contains the conclusions of the study.

Tree-based ensemble algorithms
The building block of the machine learning algorithms which we use is a weak learner of a regression tree (e.g. Breiman et al., 1984;Quinlan, 1992). A regression tree is a supervised learning method used to learn a function that combines a set of variables intending to predict another variable. The general idea of a tree learner is to partition feature space into a set of rectangles and then fit a simple model (like a constant ) in each one (Hastie et al., 2009).
The prediction function of a tree is defined as: where is the number of leaves in the tree; is a region in the feature space (corresponding to leaf ), is a constant corresponding to region , is the indicator function (returning 1 if ∈ , 0 otherwise). The value of is determined in the training phase of the tree. For regression, we partition the predictor space to find a set of regions that minimise the RSS, given by: where � is the mean response for the training observations within the -th region.
As it is infeasible to consider every possible partition of the feature space into regions, a top-down, greedy algorithm known as recursive partitioning (e.g., CART for a binary tree, Breiman et al., 1984) is used to train the single tree. The essential element of CART is a split criterion, dependent on an impurity measure. A regression tree split criterion computes the (renormalised) difference between the empirical variance in the node before and after a cut is performed.

Random Forest ensemble algorithm
A single regression tree is typically a weak prediction model which is unstable (high variance learner). To reduce prediction variance and prevent bias from increasing, one can combine the prediction of many weak learners (Schapire, 1990), creating an ensemble of learners. A Random Forest (Breiman, 2001;Ho, 1998) is an ensemble (or forest) of decision trees grown from a randomised variant of a tree induction algorithm.
The Random Forest exploits two sources of randomness to reduce the correlation of residuals of base learners, which decreases the general prediction error. The first of them is a bootstrap, where each tree is constructed on the basis of a bootstrapresampled training data set, thanks to which the trees are different from each other. The second is a split-variable randomisation: each time a split is to be performed, the search for the split variable is limited to a random subset of the predictors, which leads to the decorrelation of trees. When the forest's element is a regression tree, such a learning mechanism is called Random Forest Regression.
The RFR prediction is the unweighted average over predictions (1) of the set of trees: If = , we have an ancestor of random forests, bootstrap aggregation, or a bagging ensemble method (Breiman, 2004).

Gradient Boosting Regression Trees
Boosting (Schapire, 1999) is also a technique which additively combines many weak learners to an ensemble. It is a form of a more general concept of additive expansion learning. Boosting algorithms evolved from Adaboost, the first successful boosting algorithm (Freund, 1995;Freund & Schapire, 1997), to its generalisation as a Gradient Boosting that handles various loss functions (Friedman, 2001;Friedman et al., 2000). The GBRT algorithm involves two main steps -fitting (shallow) decision tree ℎ to the 'residuals' from the model, given current tree model , and updating by adding ℎ and updating the residuals. Those steps are repeated until an error on the test set starts to arise. The natural idea is to generalise boosting for any differentiable loss function (for example, not sensitive to outliers). In our study, we use the Huber loss function of the following form: The most important differences between GBRT and RFR are the folowing: trees are grown sequentially, which means that each tree is grown using information from the previously grown trees; the method is more sensitive to overfitting, and the number of trees should be controlled ex-post.

Extreme Gradient Boosting algorithm
XGBoost is a scalable machine learning system for tree boosting. It was implemented and described by Chen and Guestrin (2016). XGBoost is widely recognised by practitioners (e.g. Kaggle competitors) and has implementations in many programming languages (R, Python, Java, Scala, Julia, Perl, and others). The method is based on the GBRT idea, but the computational implementation offers more hyperparameters to tune. There is a technical difference in optimising a loss function between GBRT and XGBoost, as GBRT divides the optimisation problem into two parts (the determination of the direction of the minimisation step, the optimisation of the step length). XGBoost tries to determine both in one step directly. It means that at each iteration, both algorithms need to calculate the gradient at the current estimate. Still, XGBoost also needs to calculate the Hessian matrix, so the XGBoost loss function must be twice differentiable.

Range-based covariance estimator for exchange rates
In the suggested approach to covariance forecasting, we apply the estimator of the covariance of returns calculated on the basis of low and high prices. This estimator has an advantage over that based only on the closing prices, because it uses information about the price changes during the day. Let us consider two exchange rates of currencies and in terms of currency , denoted as / and B/ , respectively. In the absence of triangular arbitrage opportunities, the return of the cross rate can be written as: Then, the range-based estimator of covariance for the currency pairs can be represented as: where and are the daily high and low prices, respectively. More details about the applied range-based covariance estimator and its properties can be found in Fiszeder and Orzeszko (2021), who employ this estimator in a new methodology for dynamic modeling and forecasting covariance matrices based on support vector regression.

The DCC model
In this section, we describe the DCC model of Engle (2002). It is one of the most popular multivariate volatility models (see e.g. Bauwens et al., 2012) and is often used as a benchmark model in empirical studies. Let us assume that ( × 1 vector) is the innovation process for the conditional mean and can be written as: where −1 is the set of all information available at time − 1, is the multivariate normal distribution, and is the × symmetric conditional covariance matrix.
The DCC( , ) model can be presented as: where = diag(ℎ 1 1/2 , ℎ 2 1/2 . . . , ℎ 1/2 ), conditional variances ℎ (for = 1,2, … , ) are described as univariate GARCH models (equations (12-13)), is the standardised × 1 residual vector assumed to be serially independently distributed given as = −1 , is the time varying × conditional correlation matrix of , is the unconditional × covariance matrix of (it can also be estimated with other parameters of the model, but this makes estimation more difficult) and * is the diagonal × matrix composed of the square root of diagonal elements of . Parameters (for = 1,2, … , and θ (for = 1,2, … , ) are nonnegative and satisfy the ∑ =1 + ∑ θ < 1 =1 condition. The univariate GARCH( , ) model applied in the DCC model can be written as: where 0 > 0, ≥ 0, ≥ 0 (for = 1,2, … , ; = 1,2, … , ; = 1,2, … , ), weaker conditions for nonnegativity of the conditional variance can be assumed (see Nelson & Cao, 1992). The requirement for covariance stationarity of is ∑ =1 + ∑ =1 < 1. Parameters of the DCC model can be estimated by the quasi-maximum likelihood method using a two-stage approach. Let the parameters of model be written in two groups, i.e. ′ = ( ′ 1 , ′ 2 ), where 1 is the vector of the parameters of conditional means and variances, and 2 is the vector of the parameters of the correlation part of the model. The log-likelihood function can be written as the sum of two parts: where ( 1 ) represents the volatility part: while ( 2 | 1 ) can be viewed as the correlation component: ( 1 ) can be written as the sum of the log-likelihood functions of univariate GARCH models: In the first stage, the parameters of univariate GARCH models can be estimated separately for each of the assets and the estimates of ℎ can be obtained. In the second stage, residuals transformed by their estimated standard deviations are used to estimate the parameters of the correlation part ( 2 ) conditioning on the parameters estimated in the first stage ( � 1 ).

Data and description of the research
We evaluate the accuracy of the proposed procedure of covariance forecasting based on data from the Forex market, for the purpose of which we examine three most heavily traded currency pairs, namely EUR/USD, USD/JPY and GBP/USD. Daily data for the period from 2 January 2004 to 30 December 2016 are used. A total sample of 3,365 observations is split into a training set (period: 14 Janury 2004 to 31 December 2014, size: 2,846 observations, the first eight observations are truncated during the construction of the analytical dataset) and a test set (period: 2 January 2015 to 30 December 2016, holdout set size: 519 observations).
The target variable is the covariance of returns of currency pairs given in (6). This estimator is more efficient than the one based on closing prices only. The set of predictors contains the following time series: -target_lag_1 until target_lag_8, min_A/C_lag1, min_B/C_lag1, max_A/C_lag1, max_B/C_lag1, close_A/C_lag1, close_B/C_lag1, lnzwr_A/C_lag1 and lnzwr_B/C_lag1. We also add categorical predictors commonly used in training time series models: month, weekofyear, dayofweek, dayofyear, dayofmonth. Categorical variables are integer-coded (which is a better option for the tree-based methods than one-hot encoding).
We implement machine learning pipelines for the random forest, GBRT and XGBoost in the Python 3.6.3 environment, and use standard libraries for data processing and machine learning, i.e. numpy, scikit-learn, pandas and datetime. Additionally, we use the XGBoost library and the scikit-learn wrapper interface for XGBoost. The training set, containing about 85% of the samples, is used to tune hyperparameters using the time series k-fold cross-validator (TimeSeriesSplit method). Standard cross-validation techniques assume that samples are independent and identically distributed, and would result in an unreasonable correlation between training and testing instances (yielding poor estimates of the generalisation error) on time series data. In the TimeSeriesSplit method, successive training sets are supersets of those that come before them. It also adds all surplus data to the first training partition, which is always used to train the model. The preliminary tuning of hyperparameters is done by searching the space of the parameters (RandomizedSearchCV or GridSearchCV methods) with the above described k-fold cross-validation. In a random forest, the additional calibration of the min_samples_split, max_depth, min_impurity_decrease and min_samples_leaf split hyperparameters is performed. In the case of boosting (GBRT), the ex-post control for overfitting (the value of the n_estimators hyperparameter) is done (see the Figure). For tree-based supervised learning, the critical model's elements are impurity measures, which determine the split quality and a loss function, influencing the quality of predictions. In regression on time series data, the MSE and RMSE impurity measures are applied. As a loss function, we utilise the Huber loss function in the GBRT model, and the square loss in XGBoost.
The three ensemble models are tuned, trained, and applied on the out-of-sample instances to generate forecasts. Parameters of the DCC model are estimated each day on a rolling sample of a fixed size of 500 observations (approximately 2 years).

Comparison of covariance forecasts
This section compares out-of-sample one-day-ahead forecasts of the covariance of returns from three machine learning algorithms (RFR, GBRT and XGBoost), with the forecasts from the DCC model. We evaluate forecasts for a two-year period from 2 January 2015 to 30 December 2016.
The sum of products of 15-minute returns (the realised covariance) is employed as a proxy of the daily covariance for the evaluation of the forecasts. We assess the forecasts from the models based on two primary measures, i.e. the mean squared error (MSE) and the mean absolute error (MAE). In order to evaluate the statistical significance of the results, the Diebold-Mariano test (Diebold & Mariano, 1995) is applied. We perform a pairwise comparison with respect to the DCC benchmark model. The forecasting performance results are presented in the Table. Under the MSE criterion, the forecasts of covariance from all the three machine learning algorithms are more accurate than the forecasts based on the DCC model. According to the Diebold-Mariano test, the advantage of these algorithms is statistically significant at the level of 10%, except the EUR/USD-GBP/USD relation for RFR. Under the MAE measure, the forecasts based on the analysed machine learning methods are again significantly more accurate than the forecasts from the DCC model for the EUR/USD-JPY/USD and JPY/USD-GBP/USD relations. For EUR/USD-GBP/USD, the lowest value of the criterion occurs for the GBRT algorithm, but this result is not statistically significant. Predominantly, both of the loss functions indicate the GBRT algorithm as the best performing method.

Conclusions
The machine learning ensemble method is a method that combines a set of weak learners to create a (more potent) learner that performs better than any of the individual ones. Ensemble methods help reduce bias and/or variance. We use a decision tree as a base, weak learner. We examine the performance of three popular tree-based ensemble algorithms: random forest (regression), GBRT and XGBoost. These algorithms exploit two different approaches to ensemble learning. Random forest trains individual estimators independently over bootstrapped subsets of data (bagging) and incorporates the second level of randomness. When optimising each node split, only a random subsample (without replacement) of the attributes will be evaluated, with the purpose of the further decorrelating of the estimators. Both GBRT and XGBoost utilise a boosting technique that is different from a random forest. In boosting, individual trees are fitted sequentially, observations are weighted differently in each iteration, and poor-performing trees are excluded. All the three algorithms belong to a group of machine learning algorithms which are most popular and widely-used in many fields. It is worth noticing that a variant of the gradient boosting algorithm, LightGBM, has won the M5 forecasting competition (Makridakis et al., 2020). Tree-based ensembles are also becoming increasingly popular in financial forecasting (Henrique et al., 2019;de Prado, 2018).
Daily low and high prices contain important information about the variability of the prices of financial instruments, but they are very seldom used for the estimation of volatility models. We combine machine learning tree-based algorithms with the usage of low and high prices and suggest a new approach to forecasting currency covariances. We conduct an empirical evaluation of this procedure on the basis of three most heavily traded currency pairs in the Forex market: EUR/USD, USD/JPY and GBP/USD. The forecasts of covariances formulated on the three applied algorithms are in most part more accurate than the DCC model, used as a benchmark model based on closing prices. The results of the analyses indicate that the GBRT algorithm is the best-performing method.
Research on tree-based machine learning methods in covariance forecasting can be further developed, for example in the area of the analysis of the importance of predictors or studies on the interpretability of the optimal values of hyperparameters. Other issues, such as modifying a loss function in boosting-based methods and examining the performance of random forest and gradient boosting variants (e.g. quantile regression forests, dynamic random forests, and the LightGBM algorithm) seem worth further investigation as well.