{"title": "Augmented Functional Time Series Representation and Forecasting with Gaussian Processes", "book": "Advances in Neural Information Processing Systems", "page_first": 265, "page_last": 272, "abstract": null, "full_text": "Augmented Functional Time Series Representation\n\nand Forecasting with Gaussian Processes\n\nNicolas Chapados and Yoshua Bengio\n\nDepartment of Computer Science and Operations Research\n\nUniversity of Montr\u00b4eal\n\nMontr\u00b4eal, Qu\u00b4ebec, Canada H3C 3J7\n\n{chapados,bengioy}@iro.umontreal.ca\n\nAbstract\n\nWe introduce a functional representation of time series which allows forecasts to\nbe performed over an unspeci\ufb01ed horizon with progressively-revealed informa-\ntion sets. By virtue of using Gaussian processes, a complete covariance matrix\nbetween forecasts at several time-steps is available. This information is put to use\nin an application to actively trade price spreads between commodity futures con-\ntracts. The approach delivers impressive out-of-sample risk-adjusted returns after\ntransaction costs on a portfolio of 30 spreads.\n\n1 Introduction\n\nClassical time-series forecasting models, such as ARMA models [6], assume that forecasting is\nperformed at a \ufb01xed horizon, which is implicit in the model. An overlaying deterministic time trend\nmay be \ufb01t to the data, but is generally of \ufb01xed and relatively simple functional form (e.g. linear,\nquadratic, or sinusoidal for periodic data). To forecast beyond the \ufb01xed horizon, it is necessary\nto iterate forecasts in a multi-step fashion. These models are good at representing the short-term\ndynamics of the time series, but degrade rapidly when longer-term forecasts must be made, usually\nquickly converging to the unconditional expectation of the process after removal of the deterministic\ntime trend. This is a major issue in applications that require a forecast over a complete future\ntrajectory, and not a single (or restricted) horizon. These models are also constrained to deal with\nregularly-sampled data, and make it dif\ufb01cult to condition the time trend on explanatory variables,\nespecially when iteration of short-term forecasts has to be performed. To a large extent, the same\nproblems are present with non-linear generalizations of such models, such as time-delay or recurrent\nneural networks [1], which simply allow the short-term dynamics to become nonlinear but leave\nopen the question of forecasting complete future trajectories.\nFunctional Data Analysis (FDA) [10] has been proposed in the statistical literature as an answer\nto some of these concerns. The central idea is to consider a whole curve as an example (speci\ufb01ed\nby a \ufb01nite number of samples ht, yti), which can be represented by coef\ufb01cients in a non-parametric\nbasis expansion such as splines. This implies learning about complete trajectories as a function\nof time, hence the \u201cfunctional\u201d designation. Since time is viewed as an independent variable, the\napproach can forecast at arbitrary horizons and handle irregularly-sampled data. Typically, FDA is\nused without explanatory time-dependent variables, which are important for the kind of applications\nwe shall be considering. Furthermore, the question remains of how to integrate a progressively-\nrevealed information set in order to make increasingly more precise forecasts of the same future\ntrajectory. To incorporate conditioning information, we consider here the output of a prediction to\nbe a whole forecasting curve (as a function of t).\nThe motivation for this work comes from forecasting and actively trading price spreads between\ncommodity futures contracts (see, e.g., [7], for an introduction). Since futures contracts expire and\nhave a \ufb01nite duration, this problem is characterized by the presence of a large number of separate\n\n1\n\n\fhistorical time series, which all can be of relevance in forecasting a new time series. For example,\nwe expect seasonalities to affect similarly all the series. Furthermore, conditioning information, in\nthe form of macroeconomic variables, can be of importance, but exhibit the cumbersome property\nof being released periodically, with explanatory power that varies across the forecasting horizon. In\nother words, when making a very long-horizon forecast, the model should not incorporate condi-\ntioning information in the same way as when making a short- or medium-term forecast. A possible\nsolution to this problem is to have multiple models for forecasting each time series, one for each\ntime scale. However, this is hard to work with, requires a high degree of skill on the part of the\nmodeler, and is not amenable to robust automation when one wants to process hundreds of time\nseries. In addition, in order to measure risk associated with a particular trade (buying at time t and\nselling at time t0), we need to estimate the covariance of the price predictions associated with these\ntwo points in the trajectory.\nThese considerations motivate the use of Gaussian processes, which naturally provide a covariance\nmatrix between forecasts made at several points. To tackle the challenging task of forecasting and\ntrading spreads between commodity futures, we introduce here a form of functional data analysis\nin which the function to be forecast is indexed both by the date of availability of the information\nset and by the forecast horizon. The predicted trajectory is thus represented as a functional object\nassociated with a distribution, a Gaussian process, from which the risk of different trading decisions\ncan readily be estimated. This approach allows incorporating input variables that cannot be assumed\nto remain constant over the forecast horizon, like statistics of the short-term dynamics.\n\nPrevious Work\nGaussian processes for time-series forecasting have been considered before.\nMulti-step forecasts are explicitly tackled by [4], wherein uncertainty about the intermediate values\nis formally incorporated into the predictive distribution to obtain more realistic uncertainty bounds\nat longer horizons. However, this approach, while well-suited to purely autoregressive processes,\ndoes not appear amenable to the explicit handling of exogenous input variables. Furthermore, it\nsuffers from the restriction of only dealing with regularly-sampled data. Our approach is inspired\nby the CO2 model of [11] as an example of application-speci\ufb01c covariance function engineering.\n\n2 The Model\nt}, i = 1, . . . , N and t = 1, . . . , Mi.\nWe consider a set of N real time series each of length Mi, {yi\nIn our application each i represents a different year, and the series is the sequence of commodity\nspread prices during the period where it is traded. The lengths of all series are not necessarily\nidentical, but we shall assume that the time periods spanned by the series are \u201ccomparable\u201d (e.g.\nthe same range of days within a year if the series follow an annual cycle) so that knowledge from\npast series can be transferred to a new one to be forecast. The forecasting problem is that given\nt }, t =\nobservations from the complete series i = 1, . . . , N \u2212 1 and from a partial last series, {yN\n1, . . . , MN , we want to extrapolate the last series until a predetermined endpoint, i.e. characterize\n\u03c4 }, \u03c4 = MN + 1, . . . , MN + H. We are also given a set of non-stochastic\nthe joint distribution of {yN\nt \u2208 Rd. Our objective is to \ufb01nd an\nexplanatory variables speci\ufb01c to each series, {xi\neffective representation of P ({yN\nt}i=1,...,N\n), with \u03c4, i and t ranging,\nrespectively over the forecasting horizon, the available series and the observations within a series.\n\n\u03c4 }\u03c4 =MN +1,...,MN +H |{xi\n\nt}, where xi\nt, yi\n\nt=1,...,Mi\n\nGaussian Processes\nAssuming that we are willing to accept a normally-distributed posterior,\nGaussian processes [8, 11, 14] have proved a general and \ufb02exible tool for nonlinear regression in\na Bayesian framework. Given a training set of M input\u2013output pairs hX \u2208 RM\u00d7d, y \u2208 RMi,\na set of M0 test point locations X\u2217 \u2208 RM0\u00d7d and a positive semi-de\ufb01nite covariance function\nk : Rd \u00d7 Rd 7\u2192 R, the joint posterior distribution of the test outputs y\u2217 follows a normal with mean\nand covariance given by\n\nE [y\u2217 | X, X\u2217, y] = K(X\u2217, X)\u039b\u22121y,\n\nCov [y\u2217 | X, X\u2217, y] = K(X\u2217, X\u2217) \u2212 K(X\u2217, X)\u039b\u22121K(X, X\u2217),\n\n(1)\n(2)\nwhere we have set \u039b = K(X, X) + \u03c32\nnIM , with K the matrix of covariance evaluations,\nK(U, V)i,j\nn the assumed process noise level. The speci\ufb01c form of the covari-\nance function used in our application is described below, after introducing the representation used\nfor forecasting.\n\n4\n= k(Ui, Vj), and \u03c32\n\n2\n\n\fFunctional Representation for Forecasting In the spirit of functional data analysis, a \ufb01rst attempt\nat solving the forecasting problem is to set it forth in terms of regression from the input variables to\nthe series values, adding to the inputs an explicit time index t and series identity i,\n\nt0] = f(i, t, xi\n\nt|t0)\n\nCov\n\nt, yi0\nyi\nt0\n\n= g(i, t, xi\n\nt|t0, i0, t0, xi0\n\nt0|t0),\n\n(3)\n\nh\n\ni\n\n(cid:12)(cid:12)(cid:12)I i\n\nt0\n\nE(cid:2)yi\n\nt\n\n(cid:12)(cid:12)I i\n\nt|t0 denotes a forecast of xi\n\nthese expressions being conditioned on the information set I i\nt0 containing information up to time\nt0 of series i (we assume that all prior series i0 < i are also included in their entirety in I i\nt0). The\nnotation xi\nt given information available at t0. Functions f and g result\nfrom Gaussian process training, eq. (1) and (2), using information in I i\nt0. To extrapolate over the\nunknown horizon, one simply evaluates f and g with the series identity index i set to N and the time\nindex t within a series ranging over the elements of \u03c4 (forecasting period). Owing to the smoothness\nproperties of an adequate covariance function, one can expect the last time series (whose starting\nportion is present in the training data) to be smoothly extended, with the Gaussian process borrowing\nfrom prior series, i < N, to guide the extrapolation as the time index reaches far enough beyond the\navailable data in the last series.\nThe principal dif\ufb01culty with this method resides in handling the exogenous inputs xN\nt|t0 over the\nforecasting period: the realizations of these variables, xN\nt , are not usually known at the time the\nforecast is made and must be extrapolated with some reasonableness. For slow-moving variables\nthat represent a \u201clevel\u201d (as opposed to a \u201cdifference\u201d or a \u201creturn\u201d), one can conceivably keep their\nvalue constant to the last known realization across the forecasting period. However, this solution\nis restrictive, problem-dependent, and precludes the incorporation of short-term dynamics variables\n(e.g. the \ufb01rst differences over the last few time-steps) if desired.\nAugmenting the Functional Representation We propose in this paper to augment the functional\nrepresentation with an additional input variable that expresses the time at which the forecast is being\nmade, in addition to the time for which the forecast is made. We shall denote the former the operation\ntime and the latter the target time. The distinction is as follows: operation time represents the time\nat which the other input variables are observed and the time at which, conceptually, a forecast of\nthe entire future trajectory is performed. In contrast, target time represents time at a point of the\npredicted target series (beyond operation time), given the information known at the operation time.\nAs previously, the time series index i remains part of the inputs. In this framework, forecasting is\nperformed by holding the time series index constant to N, the operation time constant to the time\nMN of the last observation, the other input variables constant to their last-observed values xN\n, and\nvarying the target time over the forecasting period \u03c4. Since we are not attempting to extrapolate the\ninputs beyond their intended range of validity, this approach admits general input variables, without\nrestriction as to their type, and whether they themselves can be forecast.\nIt can be convenient to represent the target time as a positive delta \u2206 from the operation time t0. In\ncontrast to eq. (3), this yields the representation\n\nMN\n\nt0+\u2206\n\nt0)\n\nCov\n\nt0, i0, t0\n\n= g(i, t0, \u2206, xi\n\nt0] = f(i, t0, \u2206, xi\n\nt0+\u2206, yi0\nyi\nt0\n0+\u22060\n\n0, \u22060, xi0\n),\nt0\n0\n(4)\nwhere we have assumed the operation time to coincide with the end of the information set. Note\nthat this augmentation allows to dispense with the problematic extrapolation xi\nt|t0 of the inputs,\ninstead allowing a direct use of the last available values xi\nt0. Moreover, from a given information\nset, nothing precludes forecasting the same trajectory from several operation times t0 < t0, which\ncan be used as a means of evaluating the stability of the obtained forecast.\nThe obvious downside to augmentation lies in the greater computational cost it entails. In particular,\nthe training set must contain suf\ufb01cient information to represent the output variable for many combi-\nnations of operation and target times that can be provided as input. In the worst case, this implies\nthat the number of training examples grows quadratically with the length of the training time series.\nIn practice, a downsampling scheme is used wherein only a \ufb01xed number of target-time points is\nsampled for every operation-time point.1\n\nE(cid:2)yi\n\n(cid:12)(cid:12)I i\n\ni\n\n(cid:12)(cid:12)(cid:12)I i\n\nt0\n\nh\n\n1This number was 15 in our experiments, and these were not regularly spaced, with longer horizons spaced\nfarther apart. Furthermore, the original daily frequency of the data was reduced to keep approximately one\noperation-time point per week.\n\n3\n\n\fCovariance Function We used a modi\ufb01ed form of the rational quadratic covariance function\nwith hyperparameters for automatic relevance determination [11], which is expressed as\n\n \n\n!\u2212\u03b1\n\ndX\n\nk=1\n\n1\n2\u03b1\n\n(uk \u2212 vk)2\n\n\u20182\nk\n\nkAUG-RQ(u, v; \u2018, \u03b1, \u03c3f , \u03c3TS) = \u03c32\nf\n\n1 +\n\n+ \u03c32\n\nTS\u03b4iu,iv ,\n\n(5)\n\n4\n= I[j = k] is the Kronecker delta. The variables u and v are values in the augmented\nwhere \u03b4j,k\nrepresentation introduced previously, containing the three variables representing time (current time-\nseries index or year, operation time, target time) as well as the additional explanatory variables. The\nnotation iu denotes the time-series index component i of input variable u. The last term of the co-\nvariance function, the Kronecker delta, is used to induce an increased similarity among points that\nbelong to the same time series (e.g. the same spread trading year). By allowing a series-speci\ufb01c\naverage level to be maintained into the extrapolated portion, the presence of this term was found\nto bring better forecasting performance. The hyperparameters \u2018i, \u03b1, \u03c3f , \u03c3TS, \u03c3n are found by max-\nimizing the marginal likelihood on the training set by a standard conjugate gradient optimization\n[11]. For tractability, we rely on a two-stage training procedure, wherein hyperparameter optimiza-\ntion is performed on a fairly small training set (M = 500) and \ufb01nal training is done on a larger set\n(M = 2250), keeping hyperparameters \ufb01xed.\n\n3 Evaluating Forecasting Performance\n\nTo establish the bene\ufb01ts of the proposed functional representation for forecasting commodity spread\nprices, we compared it against other likely models on three common grain and grain-related\nspreads:2 the January\u2013July Soybeans, May\u2013September Soybean Meal, and March\u2013July Chicago\nHard Red Wheat. The forecasting task is to predict the complete future trajectory of each spread\n(taken individually), from 200 days before maturity until maturity.\nMethodology Realized prices in the previous trading years are provided from 250 days to maturity,\nusing data going back to 1989. The \ufb01rst test year is 1994. Within a given trading year, the time\nvariables represent the number of calendar days to maturity of the near leg; since no data is observed\non week-ends, training examples are sampled on an irregular time scale. Performance evaluation\nproceeds through a sequential validation procedure [2]: within a trading year, we \ufb01rst train models\n200 days before maturity and obtain a \ufb01rst forecast for the future price trajectory. We then retrain\nmodels every 25 days, and obtain revised portions of the remainder of the trajectory. Proceeding\nsequentially, this operation is repeated for succeeding trading years. All forecasts are compared\namongst models on squared-error and negative log-likelihood criteria (see \u201cassessing signi\ufb01cance\u201d,\nbelow). Input variables are subject to minimal preprocessing: we standardize them to zero mean\nand unit standard deviation. The price targets require additional treatment: since the price level of\na spread can vary signi\ufb01cantly from year to year, we normalize the price trajectories to start at zero\nat the start of every trading year, by subtracting the \ufb01rst price. Furthermore, in order to get slightly\nbetter behaved optimization, we divide the price targets by their overall standard deviation.\nModels Compared\nThe \u201ccomplete\u201d model to be compared against others is based on the\naugmented-input representation Gaussian process with the modi\ufb01ed rational quadratic covariance\nfunction eq. (5). In addition to the three variables required for the representation of time, the fol-\nlowing inputs were provided to the model: (i) the current spread price and the price of the three\nnearest futures contracts on the underlying commodity term structure, (ii) economic variables (the\nstock-to-use ratio and year-over-year difference in total ending stocks) provided on the underlying\ncommodity by the U.S. Department of Agriculture [13]. This model is denoted AugRQ/all-inp. An\nexample of the sequence of forecasts made by this model, repeated every 25 times steps, is shown\nin the upper panel of Figure 1.\nTo determine the value added by each type of input variable, we include in the comparison two\nmodels based on exactly on the same architecture, but providing less inputs: AugRQ/less-inp does\n2Our convention is to \ufb01rst give the short leg of the spread, followed by the long leg. Hence, Soybeans 1\u20137\nshould be interpreted as taking a short position (i.e. selling) in the January Soybeans contract and taking an\noffsetting long (i.e. buying) in the July contract. Traditionally, intra-commodity spread positions are taken so\nas to match the number of contracts on both legs \u2014 the number of short contracts equals the number of long\nones \u2014 not the dollar value of the long and short sides.\n\n4\n\n\fFigure 1: Top Panel: Illustration of multiple forecasts, repeated every 25 days, of the 1996 March\u2013July Wheat\nspread (dashed lines); realized price is in gray. Although the \ufb01rst forecast (smooth solid blue, with con\ufb01dence\nbands) mistakes the overall price level, it approximately correctly identi\ufb01es local price maxima and minima,\nwhich is suf\ufb01cient for trading purposes. Bottom Panel: Position taken by the trading model (in red: short, then\nneutral, then long), and cumulative pro\ufb01t of that trade (gray).\n\nnot include the economic variables. AugRQ/no-inp further removes the price inputs, leaving only\nthe time-representation inputs. Moreover, to quantify the performance gain of the augmented repre-\nsentation of time, the model StdRQ/no-inp implements a \u201cstandard time representation\u201d that would\nlikely be used in a functional data analysis model; as described in eq. (3), this uses a single time\nvariable instead of splitting the representation of time between the operation and target times.\nFinally, we compare against simpler models: Linear/all-inp uses a dot-product covariance function\nto implement Bayesian linear regression, using the full set of input variables described above. And\nAR(1) is a simple linear autoregressive model. The predictive mean and covariance matrix for this\nlast model are established as follows (see, e.g. [6]). We consider the scalar data generating process\n\n(6)\nwhere the process {yt} has an unconditional mean of zero.3 Given information available at time t,\nIt, the h-step ahead forecast from time t under this model, has conditional expectation and covari-\nance (with the h0-step ahead forecast), expressed as\n\nyt = \u03c6 yt\u22121 + \u03b5t,\n\n\u03b5t\n\niid\u223c N (0, \u03c32),\n\nE [yt+h |It] = \u03c6hyt,\n\nCov(cid:2)yt+h|t, yt+h0|t |It\n\n(cid:3) = \u03c32\u03c6h+h0 1 \u2212 \u03c6\u22122 min(h,h0)\n\n.\n\n\u03c62 \u2212 1\n\nAssessing Signi\ufb01cance of Forecasting Performance Differences For each trajectory forecast, we\nmeasure the squared error (SE) made at each time-step along with the negative log-likelihood (NLL)\nof the realized price under the predictive distribution. To account for differences in target variable\ndistribution throughout the years, we normalize the SE by dividing it by the standard deviation of\nthe test targets in a given year. Similarly, we normalize the NLL by subtracting the likelihood of a\nunivariate Gaussian distribution estimated on the test targets of the year.\nDue to the serial correlation it exhibits, the time series of performance differences (either SE or\nNLL) between two models cannot directly be subjected to a standard t-test of the null hypothesis of\nno difference in forecasting performance. The well-known Diebold-Mariano test [3] corrects for this\ncorrelation structure in the case where a single time series of performance differences is available.\nThis test is usually expressed as follows. Let {dt} be the sequence of error differences between two\nt dt be the mean difference. The sample variance of \u00afd is\nmodels to be compared. Let \u00afd = 1\nreadily shown [3] to be\n\nP\n\nM\n\n4\n= Var[ \u00afd] =\n\n\u02c6vDM\n\n1\nM\n\nKX\n\nk=\u2212K\n\n\u02c6\u03b3k,\n\n3In our experiments, we estimate an independent empirical mean for each trading year, which is subtracted\n\nfrom the prices before proceeding with the analysis.\n\n5\n\n\fTable 1: Forecast performance difference between AugRQ/all-inp and all other models, for the three spreads\nstudied. For both the Squared Error and NLL criteria, the value of the cross-correlation-corrected statis-\ntic is listed (CCC) along with its p-value under the null hypothesis. A negative CCC statistic indicates that\nAugRQ/all-inp beats the other model on average.\n\nSoybeans 1\u20137\n\nSoybean Meal 5\u20139\nNLL\n\nWheat 3\u20137\n\nNLL\n\nCCC\n\nSq. Error\nCCC\np\n\nCCC\nAugRQ/less-inp \u22120.86\nAugRQ/no-inp \u22121.68\n\u22121.53\nLinear/all-inp\n\u22124.24 10\u22125 \u22120.44 0.66 \u22122.53 0.01\nAR(1)\n\u22122.44\nStdRQ/no-inp\n\nSq. Error\nCCC\nCCC\np\n0.39 \u22120.89 0.37 \u22121.05 0.29 \u22120.95 0.34 \u22120.05\n1.06 0.29\n0.09 \u22121.73 0.08 \u22121.78 0.08 \u22122.42 0.02 \u22122.75\n0.02\n0.13 \u22121.33 0.18 \u22121.61 0.11 \u22122.00 0.05 \u22124.20 10\u22124 \u22123.45 10\u22123\n0.00 \u22126.07 10\u22129\n0.01 \u22129.36\n0.00\n\n0.12 0.90 \u22126.50\n0.01 \u22121.04 0.30 \u22122.69 0.01 \u22121.08 0.28 \u22122.67\n\n0.96\n0.01 \u22122.42\n\nSq. Error\np\n\nCCC\n\nNLL\n\np\n\np\n\np\n\n\u221a\n\n1\nM 2\n\nKX\n\n\u02c6vCCC\u2212DM =\n\n\uf8eb\uf8edX\n\nwhere M is the sequence length and \u02c6\u03b3k is an estimator of the lag-k autocovariance of the dts. The\nmaximum lag order K is a parameter of the test and must be determined empirically. Then the\n\u02c6vDM is asymptotically distributed as N (0, 1) and a classical test of the null\nstatistic DM = \u00afd/\nhypothesis \u00afd = 0 can be performed.\nUnfortunately, even the Diebold-Mariano correction for autocorrelation is not suf\ufb01cient to compare\nmodels in the present case. Due to the repeated forecasts made for the same time-step across several\niterations of sequential validation, the error sequences are likely to be cross-correlated since they\nresult from models estimated on strongly overlapping training sets. This suggests that an additional\ncorrection should be applied to account for this cross-correlation across test sets, expressed as\n\nk +X\nX\nwhere Mi is the number of examples in test set i, M = P\n\nj6=i\ni Mi is the total number of examples,\nMi \u2229 j is the number of time-steps where test sets i and j overlap, \u02c6\u03b3i\nk denote the estimated lag-k\nautocovariances within test set i, and \u02c6\u03b3i,j\nk denote the estimated lag-k cross-covariances between test\nsets i and j. The maximum lag order for cross-covariances, K0, is possibly different from K (our\nexperiments used K = K0 = 15). This revised variance estimator was used in place of the usual\nDiebold-Mariano statistic in the results presented below.\nResults\nResults of the forecasting performance difference between AugRQ/all-inp and all other\nmodels is shown in Table 1. We observe that AugRQ/all-inp generally beats the others on both the\nSE and NLL criteria, often statistically signi\ufb01cantly so. In particular, the augmented representation\nof time is shown to be of value (i.e. comparing against StdRQ/no-inp). Moreover, the Gaussian\nprocess is capable of making good use of the additional price and economic input variables, although\nnot always with the traditionally accepted levels of signi\ufb01cance.\n\n\uf8f6\uf8f8 ,\n\nK0X\n\nk=\u2212K0\n\nMi \u2229 j\n\nk=\u2212K\n\n\u02c6\u03b3i,j\nk\n\nMi\n\ni\n\n\u02c6\u03b3i\n\ni\n\n(7)\n\n4 Application: Trading a Portfolio of Spreads\n\nWe applied this forecasting methodology based on an augmented representation of time to trading a\nportfolio of spreads. Within a given trading year, we apply an information-ratio criterion to greedily\ndetermine the best trade into which to enter, based on the entire price forecast (until the end of the\nyear) produced by the Gaussian process. More speci\ufb01cally, let {pt} be the future prices forecast by\nthe model at some operation time (presumably the time of last available element in the training set).\nThe expected forecast dollar pro\ufb01t of buying at t1 and selling at t2 is simply given by pt2 \u2212 pt1. Of\ncourse, a prudent investor would take trade risk into consideration. A simple approximation of risk\nis given by the trade pro\ufb01t volatility. This yields the forecast information ratio4 of the trade\n\ncIR(t1, t2) =\n\nE[pt2 \u2212 pt1|It0]\n\npVar[pt2 \u2212 pt1|It0]\n\n,\n\n(8)\n\n4An information ratio is de\ufb01ned as the average return of a portfolio in excess of a benchmark, divided by\n\nthe standard deviation of the excess return distribution; see [5] for more details.\n\n6\n\n\fTable 2: Financial performance statistics for\nthe 30-spread portfolio on the 1994\u20132007 (until\nApril 30) period, and two disjoint sub-periods.\nAll returns are expressed in excess of the risk-\nfree rate. The information ratio statistics are an-\nnualized. Skewness and excess kurtosis are on\nthe monthly return distributions. Drawdown du-\nration is expressed in calendar days. The model\ndisplays good performance for moderate risk.\n\nFigure 2: After a price trajectory forecast (in the top\nand left portions of the \ufb01gure), all possible pairs of buy-\nday/sell-day are evaluated on a trade information ra-\ntio criterion, whose results are shown by the level plot.\nThe best trade is selected, here shorting 235 days be-\nfore maturity with forecast price at a local maximum,\nand covering 100 days later at a local minimum.\n\nAvg Annual Return\nAvg Annual Stddev\nInformation Ratio\nSkewness\nExcess Kurtosis\nBest Month\nWorst Month\nPercent Months Up\nMax. Drawdown\nDrawdown Duration\nDrawdown From\nDrawdown Until\n\nFull\nPeriod\n\n1994/01 2003/01\n2002/12 2007/04\n7.3% 5.9% 10.1%\n4.1%\n4.1% 4.0%\n2.44\n1.45\n1.77\n0.76\n0.65\n0.68\n1.26\n3.40\n4.60\n6.0% 6.0%\n4.8%\n\u22123.4% \u22123.4% \u22121.8%\n77%\n\u22127.7% \u22127.7% \u22124.0%\n23\n1997/02 1997/02 2004/06\n1998/11 1998/11 2004/07\n\n653\n\n71%\n\n67%\n\n653\n\nwhere Var[pt2 \u2212 pt1|It0] can be computed as Var[pt1|It0] + Var[pt1|It0] \u2212 2 Cov[pt1, pt2|It0],\neach quantity being separately obtainable from the Gaussian process forecast, cf. eq. (2). The trade\ndecision is made in one of two ways, depending on whether a position has already been opened: (i)\nWhen making a decision at time t0, if a position has not yet been entered for the spread in a given\ntrading year, eq. (8) is maximized with respect to unconstrained t1, t2 \u2265 t0. An illustration of this\ncriterion is given in Figure 2, which corresponds to the \ufb01rst decision made when trading the spread\nshown in Figure 1. (ii) In contrast, if a position has already been opened, eq. (8) is only maximized\nwith respect to t2, keeping t1 \ufb01xed at t0. This corresponds to revising the exit point of an existing\nposition. Simple additional \ufb01lters are used to avoid entering marginal trades: we impose a trade\nduration of at least four days, a minimum forecast IR of 0.25 and a forecast standard deviation of\nthe price sequence of at least 0.075. These thresholds have not been tuned extensively; they were\nused only to avoid trading on an approximately \ufb02at price forecast.\nWe applied these ideas to trading a portfolio of 30 spreads, selected among the following commodi-\nties: Cotton (2 spreads), Feeder Cattle (2), Gasoline (1), Lean Hogs (7), Live Cattle (1), Natural\nGas (2), Soybean Meal (5), Soybeans (5), Wheat (5). The spreads were selected on the basis of\ntheir good performance on the 1994\u20132002 period. Our simulations were carried on the 1994\u20132007\nperiod, using historical data (for Gaussian process training) dating back to 1989. Transaction costs\nwere assumed to be 5 basis points per spread leg traded. Spreads were never traded later than 25\ncalendar days before maturity of the near leg. Relative returns are computed using as a notional\namount half the total exposure incurred by both legs of the spread.5 Financial performance results\non the complete test period and two disjoint sub-periods (which correspond, until end-2002 to the\nmodel selection period, and after 2003 to a true out-of-sample evaluation) are shown in Table 2. In\nall sub-periods, but particularly since 2003, the portfolio exhibits a very favorable risk-return pro\ufb01le,\nincluding positive skewness and acceptable excess kurtosis.6 A plot of cumulative returns, number\nof open positions and monthly returns appears in Figure 3.\n\n5This is a conservative assumption, since most exchanges impose considerably reduced margin requirements\n\non recognized spreads.\n\n6By way of comparison, over the period 1 Jan. 1994\u201330 Apr. 2007, the S&P 500 index has an information\n\nratio of approximately 0.37 against the U.S. three-month treasury bills.\n\n7\n\n\fFigure 3: Top Panel: cumulative excess return after transaction costs of a portfolio of 30 spreads traded\naccording to the maximum information-ratio criterion; the bottom part plots the number of positions open at a\ntime (right axis). Bottom Panel: monthly portfolio relative excess returns; we observe the signi\ufb01cant positive\nskewness in the distribution.\n\n5 Future Work and Conclusions\n\nWe introduced a \ufb02exible functional representation of time series, capable of making long-term fore-\ncasts from progressively-revealed information sets and of handling multiple irregularly-sampled se-\nries as training examples. We demonstrated the approach on a challenging commodity spread trading\napplication, making use of a Gaussian process\u2019 ability to compute a complete covariance matrix be-\ntween several test outputs. Future work includes making more systematic use of approximation\nmethods for Gaussian processes (see [9] for a survey). The speci\ufb01c usage pattern of the Gaussian\nprocess may guide the approximation: in particular, since we know in advance the test inputs, the\nproblem is intrinsically one of transduction, and the Bayesian Committee Machine [12] could prove\nbene\ufb01cial.\n\nReferences\n[1] C. Bishop. Neural Networks for Pattern Recognition. Oxford University Press, 1995.\n[2] N. Chapados and Y. Bengio. Cost functions and model combination for VaR-based asset allocation using\n\nneural networks. IEEE Transactions on Neural Networks, 12(4):890\u2013906, July 2001.\n\n[3] F. X. Diebold and R. S. Mariano. Comparing predictive accuracy. Journal of Business & Economic\n\nStatistics, 13(3):253\u2013263, July 1995.\n\n[4] A. Girard, C. E. Rasmussen, J. Q. Candela, and R. Murray-Smith. Gaussian process priors with uncertain\ninputs \u2013 application to multiple-step ahead time series forecasting. In S. T. S. Becker and K. Obermayer,\neditors, Advances in Neural Information Processing Systems 15, pages 529\u2013536. MIT Press, 2003.\n\n[5] R. C. Grinold and R. N. Kahn. Active Portfolio Management. McGraw Hill, 1999.\n[6] J. D. Hamilton. Time Series Analysis. Princeton University Press, 1994.\n[7] J. C. Hull. Options, Futures and Other Derivatives. Prentice Hall, Englewood Cliffs, NJ, sixth edition,\n\n2005.\n\n40:1\u201342, 1978. (With discussion).\n\n[8] A. O\u2019Hagan. Curve \ufb01tting and optimal design for prediction. Journal of the Royal Statistical Society B,\n\n[9] J. Quionero-Candela and C. E. Rasmussen. A unifying view of sparse approximate gaussian process\n\nregression. Journal of Machine Learning Research, 6:1939\u20131959, 2005.\n\n[10] J. O. Ramsay and B. W. Silverman. Functional Data Analysis. Springer, second edition, 2005.\n[11] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006.\n[12] V. Tresp. A bayesian committee machine. Neural Computation, 12:2719\u20132741, 2000.\n[13] U.S. Department of Agriculture. Economic research service data sets. WWW publication. Available at\n\nhttp://www.ers.usda.gov/Data/.\n\n[14] C. K. I. Williams and C. E. Rasmussen. Gaussian processes for regression. In D. S. Touretzky, M. C.\nMozer, and M. E. Hasselmo, editors, Advances in Neural Information Processing Systems 8, pages 514\u2013\n520. MIT Press, 1996.\n\n8\n\n\f", "award": [], "sourceid": 812, "authors": [{"given_name": "Nicolas", "family_name": "Chapados", "institution": null}, {"given_name": "Yoshua", "family_name": "Bengio", "institution": null}]}