{"title": "High-dimensional multivariate forecasting with low-rank Gaussian Copula Processes", "book": "Advances in Neural Information Processing Systems", "page_first": 6827, "page_last": 6837, "abstract": "Predicting the dependencies between observations from multiple time series is critical for applications such as anomaly detection, financial risk management, causal analysis, or demand forecasting.\nHowever, the computational and numerical difficulties of estimating time-varying and high-dimensional covariance matrices often limits existing methods to handling at most a few hundred dimensions or requires making strong assumptions on the dependence between series.\nWe propose to combine an RNN-based time series model with a Gaussian copula process output model with a low-rank covariance structure to reduce the computational complexity and handle non-Gaussian marginal distributions.\nThis permits to drastically reduce the number of parameters and consequently allows the modeling of time-varying correlations of thousands of time series. We show on several real-world datasets that our method provides significant accuracy improvements over state-of-the-art baselines and perform an ablation study analyzing the contributions of the different components of our model.", "full_text": "High-Dimensional Multivariate Forecasting with\n\nLow-Rank Gaussian Copula Processes\n\nDavid Salinas\nNaverlabs \u2217\n\ndavid.salinas@naverlabs.com\n\nMichael Bohlke-Schneider\n\nAmazon Research\n\nbohlkem@amazon.com\n\nLaurent Callot\nAmazon Research\n\nlcallot@amazon.com\n\nRoberto Medico\nGhent University \u2217\n\nroberto.medico91@gmail.com\n\nAbstract\n\nJan Gasthaus\n\nAmazon Research\n\ngasthaus@amazon.com\n\nPredicting the dependencies between observations from multiple time series is\ncritical for applications such as anomaly detection, \ufb01nancial risk management,\ncausal analysis, or demand forecasting. However, the computational and numeri-\ncal dif\ufb01culties of estimating time-varying and high-dimensional covariance matri-\nces often limits existing methods to handling at most a few hundred dimensions or\nrequires making strong assumptions on the dependence between series. We pro-\npose to combine an RNN-based time series model with a Gaussian copula process\noutput model with a low-rank covariance structure to reduce the computational\ncomplexity and handle non-Gaussian marginal distributions. This permits to dras-\ntically reduce the number of parameters and consequently allows the modeling of\ntime-varying correlations of thousands of time series. We show on several real-\nworld datasets that our method provides signi\ufb01cant accuracy improvements over\nstate-of-the-art baselines and perform an ablation study analyzing the contribu-\ntions of the different components of our model.\n\n1\n\nIntroduction\n\nThe goal of forecasting is to predict the distribution of future time series values. Forecasting tasks\nfrequently require predicting several related time series, such as multiple metrics for a compute\n\ufb02eet or multiple products of the same category in demand forecasting. While these time series are\noften dependent, they are commonly assumed to be (conditionally) independent in high-dimensional\nsettings because of the hurdle of estimating large covariance matrices.\nAssuming independence, however, makes such methods unsuited for applications in which the corre-\nlations between time series play an important role. This is the case in \ufb01nance, where risk minimizing\nportfolios cannot be constructed without a forecast of the covariance of assets. In retail, a method\nproviding a probabilistic forecast for different sellers should take competition relationships and can-\nnibalization effects into account. In anomaly detection, observing several nodes deviating from their\nexpected behavior can be cause for alarm even if no single node exhibits clear signs of anomalous\nbehavior.\nMultivariate forecasting has been an important topic in the statistics and econometrics literature.\nSeveral multivariate extensions of classical univariate methods are widely used, such as vector au-\ntoregressions (VAR) extending autoregressive models [19], multivariate state-space models [7], or\n\n\u2217Work done while being at Amazon Research\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: Top: covariance matrix predicted by our model for taxi traf\ufb01c time series for 1214\nlocations in New-York at 4 different hours of a Sunday (only a neighborhood of 103 series is shown\nhere, for clearer visualization). Bottom: Correlation graph obtained by keeping only pairs with\ncovariance above a \ufb01xed threshold at the same hours. Both spatial and temporal relations are learned\nfrom the data as the covariance evolves over time and edges connect locations that are close to each\nother.\n\nmultivariate generalized autoregressive conditional heteroskedasticity (MGARCH) models [2]. The\nrapid increase in the dif\ufb01culty of estimating these models due to the growth in number of parameters\nwith the dimension of the problem have been binding constraints to move beyond low-dimensional\ncases. To alleviate these limitations, researchers have used dimensionality reduction methods and\nregularization, see for instance [3, 5] for VAR models and [34, 9] for MGARCH models, but these\nmodels remain unsuited for applications with more than a few hundreds dimensions [23].\nForecasting can be seen as an instance of sequence modeling, a topic which has been intensively\nstudied by the machine learning community. Deep learning-based sequence models have been suc-\ncessfully applied to audio signals [33], language modeling [13, 30], and general density estimation\nof univariate sequences [22, 21]. Similar sequence modeling techniques have also been used in the\ncontext of forecasting to make probabilistic predictions for collections of real or integer-valued time\nseries [26, 36, 16]. These approaches \ufb01t a global (i.e. shared) sequence-to-sequence model to a\ncollection of time series, but generate statistically independent predictions. Outside the forecasting\ndomain, similar methods have also been applied to (low-dimensional) multivariate dependent time\nseries, e.g. two-dimensional time series of drawing trajectories [13, 14].\nTwo main issues prevent the estimation of high-dimensional multivariate time series models. The\n\ufb01rst one is the O(N 2) scaling of the number of parameters required to express the covariance matrix\nwhere N denotes the dimension. Using dimensionality reduction techniques like PCA as a pre-\nprocessing step is a common approach to alleviate this problem, but it separates the estimation of\nthe model from the preprocessing step, leading to decreased performance. This motivated [27] to\nperform such a factorization jointly with the model estimation. In this paper we show how the low\nrank plus diagonal covariance structure of the factor analysis model [29, 25, 10, 24] can be used in\ncombination with Gaussian copula processes [37] to an LSTM-RNN [15] to jointly learn the tempo-\nral dynamics and the (time-varying) covariance structure, while signi\ufb01cantly reducing the number\nof parameters that need to be estimated.\nThe second issue affects not only multivariate models, but all global time series models, i.e. models\nthat estimate a single model for a collection of time series: In real-world data, the magnitudes of the\ntime series can vary drastically between different series of the same data set, often spanning several\norders of magnitude. In online retail demand forecasting, for example, item sales follow a power-\nlaw distribution, with a large number of items selling only a few units throughout the year, and a few\npopular items selling thousands of units per day [26]. The challenge posed by this for estimating\nglobal models across time series has been noted in previous work [26, 37, 27]. Several approaches\nhave been proposed to alleviate this problem, including simple, \ufb01xed invertible transformations such\nas the square-root or logarithmic transformations, and the data-adaptive Box-Cox transform [4],\n\n2\n\n\fthat aims to map a potentially heavy-tailed distribution to a Normal distribution. Other approaches\nincludes removing scale with mean-scaling [26], or with a separate network [27].\nHere, we propose to address this problem by modeling each time series\u2019 marginal distribution sepa-\nrately using a non-parametric estimate of its cumulative distribution function (CDF). Using this CDF\nestimate as the marginal transformation in a Gaussian copula (following [18, 17, 1]) effectively ad-\ndresses the challenges posed by scaling, as it decouples the estimation of marginal distributions from\nthe temporal dynamics and the dependency structure.\nThe work most closely related to ours is the recent work [32], which also proposes to combine deep\nautoregressive models with copula to model correlated time series. Their approach uses a nonpara-\nmetric estimate of the copula, whereas we employ a Gaussian copula with low-rank structure that\nis learned jointly with the rest of the model. The nonparametric copula estimate requires splitting a\nN-dimensional cube into \u03b5\u2212N many pieces (where N is the time series dimension and \u03b5 is a desired\nprecision), making it dif\ufb01cult to scale that approach to large dimensions. The method also requires\nthe marginal distributions and the dependency structure to be time-invariant, an assumption which\nis often violated is practice as shown in Fig. 1. A concurrent approach was proposed in [35] which\nalso uses Copula and estimates marginal quantile functions with the approach proposed in [11] and\nmodels the Cholesky factor as the output of a neural network. Two important differences are that this\napproach requires to estimate O(N 2) parameters to model the covariance matrix instead of O(N )\nwith the low-rank approach that we propose, another difference is the use of a non-parametric esti-\nmator for the marginal quantile functions.\nThe main contributions of this paper are:\n\n\u2022 a method for probabilistic high-dimensional multivariate forecasting (scaling to dimensions\n\nup to an order of magnitude larger than previously reported in [23]),\n\n\u2022 a parametrization of the output distribution based on a low-rank-plus-diagonal covariance\n\nmatrix enabling this scaling by signi\ufb01cantly reducing the number of parameters,\n\n\u2022 a copula-based approach for handling different scales and modeling non-Gaussian data,\n\u2022 an empirical study on arti\ufb01cial and six real-world datasets showing how this method im-\n\nproves accuracy over the state of the art while scaling to large dimensions.\n\nThe rest of the paper is structured as follows: In Section 2 we introduce the probabilistic forecast-\ning problem and describe the overall structure of our model. We then describe how we can use the\nempirical marginal distributions in a Gaussian copula to address the scaling problem and handle\nnon-Gaussian distributions in Section 3. In Section 4 we describe the parametrization of the covari-\nance matrix with low-rank-plus-diagonal structure, and how the resulting model can be viewed as\na low-rank Gaussian copula process. Finally, we report experiments with real-world datasets that\ndemonstrate how these contributions combine to allow our model to generate correlated predictions\nthat outperform state-of-the-art methods in terms of accuracy.\n\n2 Autoregressive RNN Model for Probabilistic Multivariate Forecasting\nLet us denote the values of a multivariate time series by zi,t \u2208 D, where i \u2208 {1, 2, . . . , N} indexes\nthe individual univariate component time series, and t indexes time. The domain D is assumed to\nbe either R or N. We will denote the multivariate observation vector at time t by zt \u2208 DN . Given T\nobservations z1, . . . , zT , we are interested in forecasting the future values for \u03c4 time units, i.e. we\nwant to estimate the joint conditional distribution P (zT +1, ..., zT +\u03c4|z1, . . . , zT ). In a nutshell, our\nmodel takes the form of a non-linear, deterministic state space model whose state hi,t \u2208 Rk evolves\nindependently for each time series i according to transition dynamics \u03d5,\n\nhi,t = \u03d5\u03b8h (hi,t\u22121, zi,t\u22121)\n\ni = 1, . . . , N,\n\n(1)\n\nwhere the transition dynamics \u03d5 are parametrized using a LSTM-RNN [15]. Note that the LSTM is\nunrolled for each time series separately, but parameters are tied across time series. Given the state\nvalues hi,t for all time series i = 1, 2, . . . , N and denoting by ht the collection of state values for\nall series at time t, we parametrize the joint emission distribution using a Gaussian copula,\n\np(zt|ht) = N ([f1(z1,t), f2(z2,t), . . . , fN (zN,t)]T | \u00b5(ht), \u03a3(ht)).\n\n(2)\n\n3\n\n\f\uf8eb\uf8ed\uf8ee\uf8f0\u00b51t\n\n\u00b52t\n\u00b54t\n\n\uf8f9\uf8fb ,\n\n\uf8ee\uf8f0d1t\n\n0\n0\n\nN\n\n\uf8f9\uf8fb +\n\n\uf8ee\uf8f0 v1t\n\nv2t\nv4t\n\n\uf8f9\uf8fb\uf8ee\uf8f0 v1t\n\nv2t\nv4t\n\n\uf8f9\uf8fbT\uf8f6\uf8f8\n\n0\nd2t\n0\n\n0\n0\nd4t\n\n\uf8eb\uf8ed\uf8ee\uf8f0\u00b51t\n\n\u00b53t\n\u00b54t\n\n\uf8f9\uf8fb ,\n\n\uf8ee\uf8f0d1t\n\n0\n0\n\nN\n\n\uf8f9\uf8fb +\n\n\uf8ee\uf8f0 v1t\n\nv3t\nv4t\n\n\uf8f9\uf8fb\uf8ee\uf8f0 v1t\n\nv3t\nv4t\n\n\uf8f9\uf8fbT\uf8f6\uf8f8\n\n0\nd3t\n0\n\n0\n0\nd4t\n\n\u00b51t, d1t, v1t\n\n\u00b52t, d2t, v2t\n\n\u00b54t, d4t, v4t\n\n\u00b51t, d1t, v1t\n\n\u00b53t, d3t, v3t\n\n\u00b54t, d4t, v4t\n\nh1t\n\nz1t\n\nh2t\n\nz2t\n\nh4t\n\nz4t\n\nh1t\n\nz1t\n\nh3t\n\nz3t\n\nh4t\n\nz4t\n\nFigure 2: Illustration of our model parametrization. During training, dimensions are sampled at\nrandom and a local LSTM is unrolled on each of them individually (1, 2, 4, then 1, 3, 4 in the\nexample). The parameters governing the state updates and parameter projections are shared for all\ntime series. This parametrization can express the Low-rank Gaussian distribution on sets of series\nthat varies during training or prediction.\n\nThe transformations fi : D \u2192 R here are invertible mappings of the form \u03a6\u22121 \u25e6 \u02c6Fi, where \u03a6\ndenotes the cumulative distribution function of the standard normal distribution, and \u02c6Fi denotes an\nestimate of the marginal distribution of the i-th time series zi,1, . . . , zi,T . The role of these functions\nfi is to transform the data for the i-th time series such that marginally it follows a standard normal\ndistribution. The functions \u00b5(\u00b7) and \u03a3(\u00b7) map the state ht to the mean and covariance of a Gaussian\ndistribution over the transformed observations (described in more detail in Section 4).\nUnder this model, we can factorize the joint distribution of the observations as\n\np(z1, . . . zT +\u03c4 ) =\n\np(zt|z1, . . . , zt\u22121) =\n\np(zt|ht).\n\n(3)\n\nt=1\n\nt=1\n\nBoth the state update function \u03d5 and the mappings \u00b5(\u00b7) and \u03a3(\u00b7) have free parameters that are\nlearned from the data. We denote \u03b8 the vector of all free parameters, consisting of the parameters of\nthe state update \u03b8h as well as \u03b8\u00b5 and \u03b8\u03a3 which denote the free parameters in \u00b5(\u00b7) and \u03a3(\u00b7). Given\n\u03b8 and hT +1, we can produce Monte Carlo samples from the joint predictive distribution\n\nT +\u03c4(cid:89)\n\nT +\u03c4(cid:89)\n\nT +\u03c4(cid:89)\n\nt=T +1\n\np(zT +1, . . . zT +\u03c4|z1, . . . , zT ) = p(zT +1, . . . zT +\u03c4|hT +1) =\n\np(zt|ht)\n\n(4)\n\nby sequentially sampling from P (zt|ht) and updating ht using \u03d5 2. We learn the parameters \u03b8 from\nthe observed data z1, . . . , zT using maximum likelihood estimation by i.e. by minimizing the loss\nfunction\n\n\u2212 log p(z1, z2, . . . , zT ) = \u2212 T(cid:88)\n\nlog p(zt|ht),\n\n(5)\n\nusing stochastic gradient descent-based optimization. To handle long time series, we employ a data\naugmentation strategy which randomly samples \ufb01xed-size slices of length T (cid:48) + \u03c4 from the time\nseries during training, where we \ufb01x the context length hyperparameter T (cid:48) to \u03c4. During prediction,\nonly the last T (cid:48) time steps are used in computing the initial state for prediction.\n\nt=1\n\n3 Gaussian Copula\nA copula function C : [0, 1]N \u2192 [0, 1] is the CDF of a joint distribution of a collection of real\nrandom variables U1, . . . , UN with uniform marginal distribution [8], i.e.\nC(u1, . . . , uN ) = P (U1 \u2264 u1, . . . , UN \u2264 uN ).\n\nSklar\u2019s theorem [28] states that any joint cumulative distribution F admits a representation in terms\nof its univariate marginals Fi and a copula function C,\n\nF (z1, . . . , zN ) = C(F1(z1), . . . , FN (zN )).\n\n2Note that the model complexity scales linearly with the number of Monte Carlo samples.\n\n4\n\n\fWhen the marginals are continuous the copula C is unique and is given by the joint distribution\nof the probability integral transforms of the original variables, i.e. u \u223c C where ui = Fi(zi).\nFurthermore, if zi is continuous then ui \u223c U(0, 1).\nA common modeling choice for C is to use the Gaussian copula, de\ufb01ned by:\n\nC(F1(z1), . . . , Fd(zN )) = \u03c6\u00b5,\u03a3(\u03a6\u22121(F1(z1)), . . . , \u03a6\u22121(FN (zN ))),\n\nwhere \u03a6 : R \u2192 [0, 1] is the CDF of the standard normal and \u03c6\u00b5,\u03a3 is a multivariate normal distribu-\ntion parametrized with \u00b5 \u2208 RN and \u03a3 \u2208 RN\u00d7N . In this model, the observations z, the marginally\nuniform random variables u and the Gaussian random variables x are related as follows:\n\nx \u03a6\u2212\u2192 u F \u22121\u2212\u2212\u2212\u2192 z\n\nz F\u2212\u2192 u \u03a6\u22121\u2212\u2212\u2212\u2192 x.\n\nSetting fi = \u03a6\u22121 \u25e6 \u02c6Fi results in the model in Eq. (2).\nThe marginal distributions Fi are not given a priori and need to be estimated from data. We use\nthe non-parametric approach of [18] proposed in the context of estimating high-dimensional distri-\nbutions with sparse covariance structure. In particular, they use the empirical CDF of the marginal\ndistributions,\n\nm(cid:88)\n\nt=1\n\n\u02c6Fi(v) =\n\n1\nm\n\n1zit\u2264v,\n\nwhere m observations are considered. As we require the transformations fi to be differentiable, we\nuse a linearly-interpolated version of the empirical CDF resulting in a piecewise-constant derivative\n\u02c6F (cid:48)(u). This allow us to write the log-density of the original observations under our model as\n\nlog p(z; \u00b5, \u03a3) = log \u03c6\u00b5,\u03a3(\u03a6\u22121( \u02c6F (z))) + log\n= log \u03c6\u00b5,\u03a3(\u03a6\u22121( \u02c6F (z))) + log\n\u02c6F (z)\n= log \u03c6\u00b5,\u03a3(\u03a6\u22121( \u02c6F (z))) \u2212 log \u03c6(\u03a6\u22121( \u02c6F (z)) + log \u02c6F (cid:48)(z)\n\n\u03a6\u22121( \u02c6F (z))\n\u03a6\u22121(u) + log\n\nd\ndz\nd\ndu\n\nd\ndz\n\nwhich are the individual terms in the total loss (5) where \u03c6 is the probability density function of the\nstandard normal.\nThe number of past observations m used to estimate the empirical CDFs is an hyperparameter and\nleft constant in our experiments with m = 100 3.\n\n4 Low-rank Gaussian Process Parametrization\nAfter applying the marginal transformations fi(\u00b7) our model assumes a joint multivariate Gaussian\ndistribution over the transformed data. In this section we describe how the parameters \u00b5(ht) and\n\u03a3(ht) of this emission distribution are obtained from the LSTM state ht.\nWe begin by describing how a low-rank-plus-diagonal parametrization of the covariance matrix can\nbe used to keep the computational complexity and the number of parameters manageable as the\nnumber of time series N grows. We then show how, by viewing the emission distribution as a\ntime-varying low-rank Gaussian Process gt \u223c GP(\u02dc\u00b5t(\u00b7), kt(\u00b7,\u00b7)), we can train the model by only\nconsidering a subset of time series in each mini-batch further alleviating memory constraints and\nallowing the model to be applied to very high-dimensional sets of time series.\nLet us denote the vector of transformed observations by\n\nxt = f (zt) = [f1(z1,t), f2(z2,t), . . . , fN (zN,t)]T ,\n\nso that p(xt|ht) = N (xt|\u00b5(ht), \u03a3(ht)). The covariance matrix \u03a3(ht) is a N \u00d7 N symmetric\npositive de\ufb01nite matrix with O(N 2) free parameters. Evaluating the Gaussian likelihood na\u00efvely\n\n3This makes the underlying assumption that the marginal distributions are stationary, which is violated e.g.\nin case of a linear trend. Standard time series techniques such as de-trending or differencing can be used to\npre-process the data such that this assumption is satis\ufb01ed.\n\n5\n\n\frequires O(N 3) operations. Using a structured parametrization of the covariance matrix as the\nsum of a diagonal matrix and a low rank matrix, \u03a3 = D + V V T where D \u2208 RN\u00d7N is diagonal\nand V \u2208 RN\u00d7r, results in a compact representation with O(N \u00d7 r) parameters. This allows the\nlikelihood to be evaluated using O(N r2+r3) operations. As the rank hyperparameter r can typically\nbe chosen to be much smaller than N, this leads to a signi\ufb01cant speedup.\nIn all our low-rank\nexperiments we use r = 10. We investigate the sensitivity to this parameter of accuracy and speed\nin the Appendix.\nRecall from Eq. 1 that hi,t represents the state of an LSTM unrolled with values preceding zi,t. In\norder to de\ufb01ne the mapping \u03a3(ht), we de\ufb01ne mappings for its components\n\n(cid:35)(cid:34)\n\n(cid:35)T\n\nv1(h1,t)\n\n. . .\n\nvN (hN,t)\n\nv1(h1,t)\n\n. . .\n\nvN (hN,t)\n\n= Dt + VtV T\nt .\n\n\uf8ee\uf8ef\uf8f0d1(h1,t)\n\n0\n\n(cid:34)\n\n\uf8f9\uf8fa\uf8fb +\n\n0\n\n...\n\ndN (hN,t)\n\n\u03a3(ht) =\n\nNote that the component mappings di and vi depend only on the state hi,t for the i-th component\ntime series, but not on the states of the other time series. Instead of learning separate mappings\ndi, vi, and \u00b5i for each time series, we parametrize them in terms of the shared functions \u02dcd, \u02dcv, and\n\u02dc\u00b5, respectively. These functions depend on an E-dimensional feature vector ei \u2208 RE for each\nindividual time series. The vectors ei can either be features of the time series that are known a\npriori, or can be embeddings that are learned with the rest of the model (or a combination of both).\nDe\ufb01ne the vector yi,t = [hi,t; ei]T \u2208 Rp\u00d71, which concatenates the state for time series i at time t\nwith the features ei of the i-th time series and use the following parametrization:\n\n\u00b5i(hi,t) = \u02dc\u00b5(yi,t) = wT\ndi(hi,t) = \u02dcd(yi,t) = s(wT\nvi(hi,t) = \u02dcv(yi,t) = Wvyi,t,\n\n\u00b5 yi,t\n\nd yi,t)\n\nwhere s(x) = log(1 + ex) maps to positive values, w\u00b5 \u2208 Rp\u00d71, wd \u2208 Rp\u00d71, Wv \u2208 Rr\u00d7p are\nparameters.\nAll parameters \u03b8\u00b5 = {w\u00b5, w\u02dc\u00b5}, \u03b8\u03a3 = {wd, Wv, w \u02dcd} as well as the LSTM update parameters \u03b8h\nare learned by optimizing Eq. 5. These parameters are shared for all time series and can therefore be\nused to parametrize a GP. We can view the distribution of xt as a Gaussian process evaluated at points\nyi,t, i.e. xi,t = gt(yi,t), where gt \u223c GP(\u02dc\u00b5(\u00b7), k(\u00b7,\u00b7)), with k(y, y(cid:48)) = 1y=y(cid:48) \u02dcd(y) + \u02dcv(y)T \u02dcv(y(cid:48)).\nUsing this view it becomes apparent that we can train the model by evaluating the Gaussian terms\nin the loss only on random subsets of the time series in each iteration, i.e. we can train the model\nusing batches of size B (cid:28) N as illustrated in Figure 2 (in our experiments we use B = 20).\nFurther, if prior information about the covariance structure is available (e.g. in the case of spatial\ndata the covariance might be directly related to the distance between points), this information can\nbe easily incorporated directly into the kernel, either by exclusively using a pre-speci\ufb01ed kernel or\nby combining it with the learned, time-varying kernel speci\ufb01ed above.\n\n5 Experiments\n\nSynthetic experiment. We \ufb01rst perform an experiment on synthetic data demonstrating that our\napproach can recover complex time-varying low-rank covariance patterns from multi-dimensional\nobservations. An arti\ufb01cial dataset is generated by drawing T observations from a normal distribution\nwith time-varying mean and covariance matrix, zt \u223c N (\u03c1tu, \u03a3t) where \u03c1t = sin(t), \u03a3t = U StU T\nand\n\n(cid:21)\n\n(cid:20) \u03c32\n\n1\n\nSt =\n\n\u03c1t\u03c31\u03c32\n\n\u03c1t\u03c31\u03c32\n\n\u03c32\n2\n\nThe coef\ufb01cients of u \u2208 RN\u00d71 and U \u2208 RN\u00d7r are drawn uniformly in [a, b] and \u03c31, \u03c32 are \ufb01xed\nconstants. By construction, the rank of \u03a3t is 2. Both the mean and correlation coef\ufb01cient of the\ntwo underlying latent variables oscillate through time as \u03c1t oscillates between -1 and 1.\nIn our\nexperiments, the constants are set to \u03c31 = \u03c32 = 0.1, a = \u22120.5, b = 0.5 and T = 24, 000.\nIn Figure 3, we compare the one-step-ahead predicted covariance given by our model, i.e. the lower\ntriangle of \u03a3(ht), to the true covariance, showing that the model is able to recover the complex\nunderlying pattern of the covariance matrix.\n\n6\n\n\fFigure 3: True (solid line) and predicted (dashed line) covariance after training with N = 4 (left) and\nN = 8 (right) time series. Each line corresponds to an entry in the lower triangle of \u03a3t (including\nthe diagonal, i.e. 10 lines in the left plot, 28 in the right).\n\nExperiments with real-world datasets. The following publicly-available datasets are used to\ncompare the accuracy of different multivariate forecasting models.\n\n\u2022 Exchange rate: daily exchange rate between 8 currencies as used in [16]\n\u2022 Solar: hourly photo-voltaic production of 137 stations in Alabama State used in [16]\n\u2022 Electricity: hourly time series of the electricity consumption of 370 customers [6]\n\u2022 Traf\ufb01c: hourly occupancy rate, between 0 and 1, of 963 San Francisco car lanes [6]\n\u2022 Taxi: spatio-temporal traf\ufb01c time series of New York taxi rides [31] taken at 1214 locations\nevery 30 minutes in the months of January 2015 (training set) and January 2016 (test set)\n\n\u2022 Wikipedia: daily page views of 2000 Wikipedia pages used in [11]\n\nEach dataset is split into a training and test set by using all data prior to a \ufb01xed date for the training\nand by using rolling windows for the test set. We measure accuracy on forecasts starting on time\npoints equally spaced after the last point seen for training. For hourly datasets, accuracy is measured\non 7 rolling time windows, for all other datasets we use 5 time windows, except for taxi, where\n57 windows are used in order to cover the full test set. The number of steps predicted \u03c4, domain,\ntime-frequency, dimension N and time-steps available for training T is given in the appendix for all\ndatasets.\n\nEvaluation against baseline and ablation study. As we are looking into modeling correlated\ntime series, only methods that are able to produce correlated samples are considered in our compar-\nisons. The \ufb01rst baseline is VAR, a multivariate linear vector auto-regressive model using lag 1 and\na lag corresponding to the periodicity of the data. The second is GARCH, a multivariate conditional\nheteroskedasticity model proposed by [34] with implementation from [12]. More details about these\nmethods can be found in the supplement.\nWe also compare with different RNN architectures, distributions, and data transformation schemes\nto show the bene\ufb01t of the low-rank Gaussian Copula Process that we propose. The most straightfor-\nward alternative to our approach is a single global LSTM that receives and predicts all target dimen-\nsions at once. We refer to this architecture as Vec-LSTM. We compare this architecture with the GP\napproach described in Section 4, where the LSTM is unrolled on each dimensions separately before\nreconstructing the joint distribution. For the output distribution in the Vec-LSTM architecture, we\ncompare independent4, low-rank and full-rank normal distributions. For the data transformation we\ncompare the copula approach that we propose, the mean scaling operation proposed in [26], and no\ntransformation.\n\n4Note that samples are still correlated with a diagonal noise due to the conditioning on the LSTM state.\n\n7\n\n0102030405060700.0040.0020.0000.0020.0040.0060102030405060700.0040.0020.0000.0020.0040.006\fbaseline\n\narchitecture\n\ndata transformation\n\ndistribution CRPS ratio CRPS-Sum ratio\n\nnum params ratio\n\n-\n-\n\nVAR\nGARCH\nVec-LSTM\nVec-LSTM-ind\nVec-LSTM\nVec-LSTM-ind-scaling\nVec-LSTM\nVec-LSTM-fullrank\nVec-LSTM-fullrank-scaling Vec-LSTM\nVec-LSTM\nVec-LSTM-lowrank-Copula\nGP\nGP-scaling\nGP-Copula\n\nGP\nGP\nGP\n\n-\n-\n\nNone\n\nNone\n\nMean-scaling\n\nMean-scaling\n\nCopula\nNone\n\nMean-scaling\n\nCopula\n\n-\n-\n\nIndependent\nIndependent\n\nFull-rank\nFull-rank\nLow-rank\nLow-rank\nLow-rank\nLow-rank\n\n10.0\n7.8\n3.6\n1.4\n29.1\n22.5\n1.1\n4.5\n2.0\n1.0\n\n10.9\n6.3\n6.8\n1.4\n44.4\n37.6\n1.7\n9.5\n3.4\n1.0\n\n35.0\n6.2\n13.9\n13.9\n103.4\n103.4\n20.3\n1.0\n1.0\n1.0\n\nTable 1: Baselines summary and average ratio compared to GP-Copula for CRPS, CRPS-Sum and\nnumber of parameters on all datasets.\n\nThe description of the baselines as well as their average performance compared to GP-Copula are\ngiven in Table 1. For evaluation, we generate 400 samples from each model and evaluate multi-step\naccuracy using the continuous ranked probability score metric [20] that measures the accuracy of\nthe predicted distribution (see supplement for details). We compute the CRPS metric on each time\nseries individually (CRPS) as well on the sum of all time series (CRPS-Sum). Both metrics are\naveraged over the prediction horizon and over the evaluation time points. RNN models are trained\nonly once with the dates preceding the \ufb01rst rolling time point and the same trained model is then\nused on all rolling evaluations.\nTable 2 reports the CRPS-Sum accuracy of all methods (some entries are missing due to models\nrequiring too much memory or having divergent losses). The individual time series CRPS as well as\nmean squared error are also reported in the supplement. Models that do not have data transforma-\ntions are generally less accurate and more unstable. We believe this to be caused by the large scale\nvariation between series also noted in [26, 27]. In particular, the copula transformation performs\nbetter than mean-scaling for GP, where GP-Copula signi\ufb01cantly outperforms GP-scaling.\nThe GP-Copula model that we propose provides signi\ufb01cant accuracy improvements on most\ndatasets.\nIn our comparison CRPS and CRPS-Sum are improved by on average 10% and 40%\n(respectively) compared to the second best models for those metrics Vec-LSTM-lowrank-Copula\nand Vec-LSTM-ind-scaling. One factor might be that the training is made more robust by adding\nrandomness, as GP models need to predict different groups of series for each training example, mak-\ning it harder to over\ufb01t. Note also that the number of parameters is drastically smaller compared to\nVec-LSTM architectures. For the traf\ufb01c dataset, the GP models have 44K parameters to estimate\ncompared to 1.1M in a Vec-LSTM with a low-rank distribution and 38M parameters with a full-rank\ndistribution. The complexity of the number of parameters are also given in Table 3.\nWe also qualitatively assess the covariance structure predicted by our model. In Fig. 1, we plot the\npredicted correlation matrix for several time steps after training on the Taxi dataset. We following\nthe approach in [18] and reconstruct the covariance graph by truncating edges whose correlation\ncoef\ufb01cient is less than a threshold kept constant over time. Fig. 1 shows the spatio-temporal cor-\nrelation graph obtained at different hours. The predicted correlation matrices show how the model\nreconstructs the evolving topology of spatial relationships in the city traf\ufb01c. Covariance matrices\npredicted over time by our model can also be found in the appendix for other datasets.\nAdditional details concerning the processing of the datasets, hyper-parameter optimization, eval-\nuations, and model are given in the supplement. The code to perform the evaluations of\nour methods and different baselines is available at https://github.com/mbohlkeschneider/gluon-\nts/tree/mv_release.\n\n6 Conclusion\n\nWe presented an approach to obtain probabilistic forecast of high-dimensional multivariate time\nseries. By using a low-rank approximation, we can avoid the potentially very large number of pa-\nrameters of a full covariate matrix and by using a low-rank Gaussian Copula process we can stably\noptimize directly parameters of an autoregressive model. We believe that such techniques allow-\ning to estimate high-dimensional time varying covariance matrices may open the door to several\napplications in anomaly detection, imputation or graph analysis for time series data.\n\n8\n\n\fdataset\nestimator\n\nCRPS-Sum\n\nexchange\n\nsolar\n\nelec\n\ntraf\ufb01c\n\ntaxi\n\nwiki\n\nVAR\nGARCH\nVec-LSTM-ind\nVec-LSTM-ind-scaling\nVec-LSTM-fullrank\nVec-LSTM-fullrank-scaling\nVec-LSTM-lowrank-Copula\nGP\nGP-scaling\nGP-Copula\n\n0.010+/-0.000\n0.020+/-0.000\n0.009+/-0.000\n0.008+/-0.001\n0.646+/-0.114\n0.394+/-0.174\n0.007+/-0.000\n0.011+/-0.001\n0.009+/-0.000\n0.007+/-0.000\n\n0.524+/-0.001\n0.869+/-0.000\n0.470+/-0.039\n0.391+/-0.017\n0.956+/-0.000\n0.920+/-0.035\n0.319+/-0.011\n0.828+/-0.010\n0.368+/-0.012\n0.337+/-0.024\n\n0.031+/-0.000\n0.278+/-0.000\n0.731+/-0.007\n0.025+/-0.001\n0.999+/-0.000\n0.747+/-0.020\n0.064+/-0.008\n0.947+/-0.016\n0.022+/-0.000\n0.024+/-0.002\n\n0.144+/-0.000\n0.368+/-0.000\n0.110+/-0.020\n0.087+/-0.041\n-\n-\n0.103+/-0.006\n2.198+/-0.774\n0.079+/-0.000\n0.078+/-0.002\n\n0.292+/-0.000\n-\n0.429+/-0.000\n0.506+/-0.005\n-\n-\n0.326+/-0.007\n0.425+/-0.199\n0.183+/-0.395\n0.208+/-0.183\n\n3.400+/-0.003\n-\n0.801+/-0.029\n0.133+/-0.002\n-\n-\n0.241+/-0.033\n0.933+/-0.003\n1.483+/-1.034\n0.086+/-0.004\n\nTable 2: CRPS-sum accuracy comparison (lower is better, best two methods are in bold). Mean and\nstd are obtained by rerunning each method three times.\n\ninput\n\nindependent\n\nVec-LSTM O(N k) O(N k)\nGP\n\nO(k)\n\nO(k)\n\noutput\nlow-rank\nfull-rank\nO(N rk) O(N 2k)\nO(N 2k)\nO(rk)\n\nTable 3: Number of parameters for input and output projection of different models. We recall that\nN and k denotes the dimension and size of the LSTM state.\n\nReferences\n[1] Wolfgang Aussenegg and Christian Cech. A new copula approach for high-dimensional real\nworld portfolios. University of Applied Sciences b\ufb01 Vienna, Austria. Working paper series,\n68(2012):1\u201326, 2012.\n\n[2] Luc Bauwens, S\u00e9bastien Laurent, and Jeroen VK Rombouts. Multivariate garch models: a\n\nsurvey. Journal of applied econometrics, 21(1):79\u2013109, 2006.\n\n[3] Ben S Bernanke, Jean Boivin, and Piotr Eliasz. Measuring the effects of monetary policy: a\nfactor-augmented vector autoregressive (favar) approach. The Quarterly journal of economics,\n120(1):387\u2013422, 2005.\n\n[4] G. E. P. Box and D. R. Cox. An analysis of transformations. Journal of the Royal Statistical\n\nSociety. Series B (Methodological), 26(2):211\u2013252, 1964.\n\n[5] Laurent AF Callot and Anders B Kock. Oracle ef\ufb01cient estimation and forecasting with the\nadaptive lasso and the adaptive group lasso in vector autoregressions. Essays in Nonlinear\nTime Series Econometrics, pages 238\u2013268, 2014.\n\n[6] Dua Dheeru and E\ufb01 Karra Taniskidou. UCI machine learning repository, 2017.\n\n[7] James Durbin and Siem Jan Koopman. Time series analysis by state space methods, volume 38.\n\nOxford University Press, 2012.\n\n[8] Gal Elidan. Copulas in machine learning.\n\nIn Piotr Jaworski, Fabrizio Durante, and Wolf-\ngang Karl H\u00e4rdle, editors, Copulae in Mathematical and Quantitative Finance, pages 39\u201360,\nBerlin, Heidelberg, 2013. Springer Berlin Heidelberg.\n\n[9] Robert Engle. Dynamic conditional correlation: A simple class of multivariate generalized au-\ntoregressive conditional heteroskedasticity models. Journal of Business & Economic Statistics,\n20(3):339\u2013350, 2002.\n\n[10] B S Everitt. An Introduction to Latent Variable Models. Chapman and Hill, 1984.\n\n[11] Jan Gasthaus, Benidis Benidis, Konstantinos, Yuyang Wang, Syama S. Rangapuram, David\nSalinas, Valentin Flunkert, and Tim Januschowski. Probabilistic forecasting with spline quan-\ntile function RNNs. AISTATS, 2019.\n\n9\n\n\f[12] Alexios Ghalanos. rmgarch: Multivariate GARCH models., 2019. R package version 1.3-6.\n\n[13] Alex Graves. Generating sequences with recurrent neural networks.\n\narXiv:1308.0850, 2013.\n\narXiv preprint\n\n[14] Karol Gregor, Ivo Danihelka, Alex Graves, and Daan Wierstra. DRAW: A recurrent neural\n\nnetwork for image generation. CoRR, abs/1502.04623, 2015.\n\n[15] Sepp Hochreiter and J\u00fcrgen Schmidhuber. Long short-term memory. Neural computation,\n\n9(8):1735\u20131780, 1997.\n\n[16] Guokun Lai, Wei-Cheng Chang, Yiming Yang, and Hanxiao Liu. Modeling long- and short-\n\nterm temporal patterns with deep neural networks. CoRR, abs/1703.07015, 2017.\n\n[17] Han Liu, Fang Han, Ming Yuan, John Lafferty, Larry Wasserman, et al. High-dimensional\nsemiparametric Gaussian copula graphical models. The Annals of Statistics, 40(4):2293\u20132326,\n2012.\n\n[18] Han Liu, John Lafferty, and Larry Wasserman. The nonparanormal: Semiparametric es-\nJournal of Machine Learning Research,\n\ntimation of high dimensional undirected graphs.\n10(Oct):2295\u20132328, 2009.\n\n[19] Helmut L\u00fctkepohl. New introduction to multiple time series analysis. Springer Science &\n\nBusiness Media, 2005.\n\n[20] James E Matheson and Robert L Winkler. Scoring rules for continuous probability distribu-\n\ntions. Management science, 22(10):1087\u20131096, 1976.\n\n[21] Junier Oliva, Avinava Dubey, Manzil Zaheer, Barnabas Poczos, Ruslan Salakhutdinov, Eric\nXing, and Jeff Schneider. Transformation autoregressive networks. In Jennifer Dy and An-\ndreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning,\nvolume 80 of Proceedings of Machine Learning Research, pages 3898\u20133907, Stockholmsm\u00e4s-\nsan, Stockholm Sweden, 10\u201315 Jul 2018. PMLR.\n\n[22] George Papamakarios, Theo Pavlakou, and Iain Murray. Masked autoregressive \ufb02ow for den-\nsity estimation.\nIn I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vish-\nwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30,\npages 2338\u20132347. Curran Associates, Inc., 2017.\n\n[23] Andrew J Patton. A review of copula models for economic time series. Journal of Multivariate\n\nAnalysis, 110:4\u201318, 2012.\n\n[24] Sam Roweis and Zoubin Ghahramani. A unifying review of linear Gaussian models. Neural\n\nComputation, 11(2):305\u2013345, 1999.\n\n[25] Donald B Rubin and Dorothy T Thayer. EM algorithms for ML factor analysis. Psychometrika,\n\n47(1):69\u201376, 1982.\n\n[26] David Salinas, Valentin Flunkert, and Jan Gasthaus. Deepar: Probabilistic forecasting with\n\nautoregressive recurrent networks. CoRR, abs/1704.04110, 2017.\n\n[27] Rajat Sen, Hsiang-Fu Yu, and Inderjit Dhillon. Think globally, act locally: A deep neural net-\nwork approach to high-dimensional time series forecasting. arXiv preprint arXiv:1905.03806,\n2019.\n\n[28] M Sklar. Fonctions de repartition \u00e0 n dimensions et leurs marges. Publications de l\u2019Institut de\n\nStatistique de l\u2019Universit\u00e9 de Paris,, 8:229\u2013231, 1959.\n\n[29] C. Spearman. \"general intelligence,\" objectively determined and measured. The American\n\nJournal of Psychology, 15(2):201\u2013292, 1904.\n\n[30] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural\n\nnetworks. In Advances in Neural Information Processing Systems, pages 3104\u20133112, 2014.\n\n10\n\n\f[31] NYC Taxi and Limousine Commission. TLC trip record data. https://www1.nyc.gov/\n\nsite/tlc/about/tlc-trip-record-data.page, 2015.\n\n[32] J. Toubeau, J. Bottieau, F. Vall\u00e9e, and Z. De Gr\u00e8ve. Deep learning-based multivariate proba-\nbilistic forecasting for short-term scheduling in power markets. IEEE Transactions on Power\nSystems, 34(2):1203\u20131215, March 2019.\n\n[33] A\u00e4ron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex\nGraves, Nal Kalchbrenner, Andrew W. Senior, and Koray Kavukcuoglu. Wavenet: A genera-\ntive model for raw audio. CoRR, abs/1609.03499, 2016.\n\n[34] Roy Van der Weide. Go-garch: a multivariate generalized orthogonal garch model. Journal of\n\nApplied Econometrics, 17(5):549\u2013564, 2002.\n\n[35] Ruofeng Wen and Kari Torkkola. Deep generative quantile-copula models for probabilistic\n\nforecasting. arXiv preprint arXiv:1907.10697, 2019.\n\n[36] Ruofeng Wen, Kari Torkkola, and Balakrishnan Narayanaswamy. A multi-horizon quantile\n\nrecurrent forecaster. arXiv preprint arXiv:1711.11053, 2017.\n\n[37] Andrew Gordon Wilson and Zoubin Ghahramani. Copula processes.\n\nIn Proceedings of\nthe 23rd International Conference on Neural Information Processing Systems - Volume 2,\nNIPS\u201910, pages 2460\u20132468, USA, 2010. Curran Associates Inc.\n\n11\n\n\f", "award": [], "sourceid": 3697, "authors": [{"given_name": "David", "family_name": "Salinas", "institution": "Naverlabs"}, {"given_name": "Michael", "family_name": "Bohlke-Schneider", "institution": "Amazon"}, {"given_name": "Laurent", "family_name": "Callot", "institution": "Amazon"}, {"given_name": "Roberto", "family_name": "Medico", "institution": "Ghent University"}, {"given_name": "Jan", "family_name": "Gasthaus", "institution": "Amazon.com"}]}