{"title": "Copula Processes", "book": "Advances in Neural Information Processing Systems", "page_first": 2460, "page_last": 2468, "abstract": "We define a copula process which describes the dependencies between arbitrarily many random variables independently of their marginal distributions. As an example, we develop a stochastic volatility model, Gaussian Copula Process Volatility (GCPV), to predict the latent standard deviations of a sequence of random variables. To make predictions we use Bayesian inference, with the Laplace approximation, and with Markov chain Monte Carlo as an alternative. We find our model can outperform GARCH on simulated and financial data. And unlike GARCH, GCPV can easily handle missing data, incorporate covariates other than time, and model a rich class of covariance structures.", "full_text": "Copula Processes\n\nAndrew Gordon Wilson\u2217\nDepartment of Engineering\nUniversity of Cambridge\nagw38@cam.ac.uk\n\nZoubin Ghahramani\u2020\nDepartment of Engineering\nUniversity of Cambridge\n\nzoubin@eng.cam.ac.uk\n\nAbstract\n\nWe de\ufb01ne a copula process which describes the dependencies between arbitrarily\nmany random variables independently of their marginal distributions. As an exam-\nple, we develop a stochastic volatility model, Gaussian Copula Process Volatility\n(GCPV), to predict the latent standard deviations of a sequence of random vari-\nables. To make predictions we use Bayesian inference, with the Laplace approxi-\nmation, and with Markov chain Monte Carlo as an alternative. We \ufb01nd our model\ncan outperform GARCH on simulated and \ufb01nancial data. And unlike GARCH,\nGCPV can easily handle missing data, incorporate covariates other than time, and\nmodel a rich class of covariance structures.\n\nImagine measuring the distance of a rocket as it leaves Earth, and wanting to know how these mea-\nsurements correlate with one another. How much does the value of the measurement at \ufb01fteen\nminutes depend on the measurement at \ufb01ve minutes? Once we\u2019ve learned this correlation structure,\nsuppose we want to compare it to the dependence between measurements of the rocket\u2019s velocity.\nTo do this, it is convenient to separate dependence from the marginal distributions of our measure-\nments. At any given time, a rocket\u2019s distance from Earth could have a Gamma distribution, while its\nvelocity could have a Gaussian distribution. And separating dependence from marginal distributions\nis precisely what a copula function does.\nWhile copulas have recently become popular, especially in \ufb01nancial applications [1, 2], as Nelsen\n[3] writes, \u201cthe study of copulas and the role they play in probability, statistics, and stochastic\nprocesses is a subject still in its infancy. There are many open problems. . . \u201d Typically only bivariate\n(and recently trivariate) copulas are being used and studied. In our introductory example, we are\ninterested in learning the correlations in different stochastic processes, and comparing them.\nIt\nwould therefore be useful to have a copula process, which can describe the dependencies between\narbitrarily many random variables independently of their marginal distributions. We de\ufb01ne such a\nprocess. And as an example, we develop a stochastic volatility model, Gaussian Copula Process\nVolatility (GCPV). In doing so, we provide a Bayesian framework for the learning the marginal\ndistributions and dependency structure of what we call a Gaussian copula process.\nThe volatility of a random variable is its standard deviation. Stochastic volatility models are used\nto predict the volatilities of a heteroscedastic sequence \u2013 a sequence of random variables with dif-\nferent variances, like distance measurements of a rocket as it leaves the Earth. As the rocket gets\nfurther away, the variance on the measurements increases. Heteroscedasticity is especially impor-\ntant in econometrics; the returns on equity indices, like the S&P 500, or on currency exchanges, are\nheteroscedastic. Indeed, in 2003, Robert Engle won the Nobel Prize in economics \u201cfor methods of\nanalyzing economic time series with time-varying volatility\u201d. GARCH [4], a generalized version of\nEngle\u2019s ARCH, is arguably unsurpassed for predicting the volatility of returns on equity indices and\ncurrency exchanges [5, 6, 7]. GCPV can outperform GARCH, and is competitive on \ufb01nancial data\nthat especially suits GARCH [8, 9, 10]. Before discussing GCPV, we \ufb01rst introduce copulas and the\ncopula process. For a review of Gaussian processes, see Rasmussen and Williams [11].\n\n\u2217http://mlg.eng.cam.ac.uk/andrew\n\u2020Also at the machine learning department at Carnegie Mellon University.\n\n1\n\n\f1 Copulas\n\nCopulas are important because they separate the dependency structure between random variables\nfrom their marginal distributions.\nIntuitively, we can describe the dependency structure of any\nmultivariate joint distribution H(x1, . . . , xn) = P (X1 \u2264 x1, . . . Xn \u2264 xn) through a two step\nprocess. First we take each univariate random variable Xi and transform it through its cumulative\ndistribution function (cdf) Fi to get Ui = Fi(Xi), a uniform random variable. We then express\nthe dependencies between these transformed variables through the n-copula C(u1, . . . , un).\nFormally, an n-copula C : [0, 1]n \u2192 [0, 1] is a multivariate cdf with uniform univariate marginals:\nC(u1, u2, . . . , un) = P (U1 \u2264 u1, U2 \u2264 u2, . . . , Un \u2264 un), where U1, U2, . . . , Un are standard\nuniform random variables. Sklar [12] precisely expressed our intuition in the theorem below.\n\nTheorem 1.1. Sklar\u2019s theorem\nLet H be an n-dimensional distribution function with marginal distribution functions F1, F2, . . . , Fn.\nThen there exists an n-copula C such that for all (x1, x2, . . . , xn) \u2208 [\u2212\u221e,\u221e]n,\n\nH(x1, x2, . . . , xn) = C(F1(x1), F2(x2), . . . , Fn(xn)) = C(u1, u2, . . . , un).\n\n(1)\nIf F1, F2, . . . , Fn are all continuous then C is unique; otherwise C is uniquely determined on\nRange F1 \u00d7 Range F2 \u00d7 \u00b7\u00b7\u00b7 \u00d7 Range Fn. Conversely, if C is an n-copula and F1, F2, . . . , Fn are\ndistribution functions, then the function H is an n-dimensional distribution function with marginal\ndistribution functions F1, F2, . . . , Fn.\nAs a corollary, if F (\u22121)\nu1, u2, . . . , un \u2208 [0, 1]n,\n\n(u) = inf{x : F (x) \u2265 u}, the quasi-inverse of Fi, then for all\n\ni\n\nC(u1, u2, . . . , un) = H(F (\u22121)\n\n(u1), F (\u22121)\n\n1\n\n(u2), . . . , F (\u22121)\n\nn\n\n(un)).\n\n(2)\n\n2\n\nC(u, v) = \u03a6\u03c1(\u03a6\u22121(u), \u03a6\u22121(v)),\n\nIn other words, (2) can be used to construct a copula. For example, the bivariate Gaussian copula is\nde\ufb01ned as\n\n(3)\nwhere \u03a6\u03c1 is a bivariate Gaussian cdf with correlation coef\ufb01cient \u03c1, and \u03a6 is the standard univariate\nGaussian cdf. Li [2] popularised the bivariate Gaussian copula, by showing how it could be used to\nstudy \ufb01nancial risk and default correlation, using credit derivatives as an example.\nBy substituting F (x) for u and G(y) for v in equation (3), we have a bivariate distribution H(x, y),\nwith a Gaussian dependency structure, and marginals F and G. Regardless of F and G, the resulting\nH(x, y) can still be uniquely expressed as a Gaussian copula, so long as F and G are continuous. It is\nthen a copula itself that captures the underlying dependencies between random variables, regardless\nof their marginal distributions. For this reason, copulas have been called dependence functions\n[13, 14]. Nelsen [3] contains an extensive discussion of copulas.\n\n2 Copula Processes\n\nImagine choosing a covariance function, and then drawing a sample function at some \ufb01nite number\nof points from a Gaussian process. The result is a sample from a collection of Gaussian random\nvariables, with a dependency structure encoded by the speci\ufb01ed covariance function. Now, suppose\nwe transform each of these values through a univariate Gaussian cdf, such that we have a sample\nfrom a collection of uniform random variables. These uniform random variables also have this\nunderlying Gaussian process dependency structure. One might call the resulting values a draw from\na Gaussian-Uniform Process. We could subsequently put these values through an inverse beta cdf,\nto obtain a draw from what could be called a Gaussian-Beta Process: the values would be a sample\nfrom beta random variables, again with an underlying Gaussian process dependency structure. We\ncould also transform the uniform values with different inverse cdfs, which would give a sample from\ndifferent random variables, with dependencies encoded by the Gaussian process.\nThe above procedure is a means to generate samples from arbitrarily many random variables, with\narbitrary marginal distributions, and desired dependencies. It is an example of how to use what we\ncall a copula process \u2013 in this case, a Gaussian copula process, since a Gaussian copula describes\nthe dependency structure of a \ufb01nite number of samples. We now formally de\ufb01ne a copula process.\n\n2\n\n\fDe\ufb01nition 2.1. Copula Process\nLet {Wt} be a collection of random variables indexed by t \u2208 T , with marginal distribution functions\nFt, and let Qt = Ft(Wt). Further, let \u00b5 be a stochastic process measure with marginal distribution\nfunctions Gt, and joint distribution function H. Then Wt is copula process distributed with base\nmeasure \u00b5, or Wt \u223c CP(\u00b5), if and only if for all n \u2208 N, ai \u2208 R,\n\nn(cid:92)\n\nP (\n\n{G(\u22121)\n\nti\n\n(Qti ) \u2264 ai}) = Ht1,t2,...,tn (a1, a2, . . . , an).\n\n(4)\n\ni=1\n\nti\n\nis the quasi-inverse of Gti, as previously de\ufb01ned.\n\nEach Qti \u223c Uniform(0, 1), and G(\u22121)\nDe\ufb01nition 2.2. Gaussian Copula Process\nWt is Gaussian copula process distributed if it is copula process distributed and the base measure\n\u00b5 is a Gaussian process. If there is a mapping \u03a8 such that \u03a8(Wt) \u223c GP(m(t), k(t, t(cid:48))), then we\nwrite Wt \u223c GCP(\u03a8, m(t), k(t, t(cid:48))).\nFor example, if we have Wt \u223c GCP with m(t) = 0 and k(t, t) = 1, then in the de\ufb01nition of a copula\nprocess, Gt = \u03a6, the standard univariate Gaussian cdf, and H is the usual GP joint distribution\nfunction. Supposing this GCP is a Gaussian-Beta process, then \u03a8 = \u03a6\u22121 \u25e6 FB, where FB is a\nunivariate Beta cdf. One could similarly de\ufb01ne other copula processes.\nWe described generally how a copula process can be used to generate samples of arbitrarily many\nrandom variables with desired marginals and dependencies. We now develop a speci\ufb01c and practical\napplication of this framework. We introduce a stochastic volatility model, Gaussian Copula Process\nVolatility (GCPV), as an example of how to learn the joint distribution of arbitrarily many random\nvariables, the marginals of these random variables, and to make predictions. To do this, we \ufb01t a\nGaussian copula process by using a type of Warped Gaussian Process [15]. However, our method-\nology varies substantially from Snelson et al. [15], since we are doing inference on latent variables\nas opposed to observations, which is a much greater undertaking that involves approximations, and\nwe are doing so in a different context.\n\n3 Gaussian Copula Process Volatility\nAssume we have a sequence of observations y = (y1, . . . , yn)(cid:62) at times t = (t1, . . . , tn)(cid:62). The\nobservations are random variables with different latent standard deviations. We therefore have n\nunobserved standard deviations, \u03c31, . . . , \u03c3n, and want to learn the correlation structure between\nthese standard deviations, and also to predict the distribution of \u03c3\u2217 at some unrealised time t\u2217.\nWe model the standard deviation function as a Gaussian copula process:\n\n\u03c3t \u223c GCP(g\u22121, 0, k(t, t(cid:48))).\n\n(5)\n\nSpeci\ufb01cally,\n\nf (t) \u223c GP(m(t) = 0, k(t, t(cid:48)))\n\u03c3(t) = g(f (t), \u03c9)\ny(t) \u223c N (0, \u03c32(t)),\n\n(6)\n(7)\n(8)\nwhere g is a monotonic warping function, parametrized by \u03c9. For each of the observations y =\n(y1, . . . , yn)(cid:62) we have corresponding GP latent function values f = (f1, . . . , fn)(cid:62), where \u03c3(ti) =\ng(fi, \u03c9), using the shorthand fi to mean f (ti).\n\u03c3t \u223c GCP, because any \ufb01nite sequence (\u03c31, . . . , \u03c3p) is distributed as a Gaussian copula:\nP (\u03c31 \u2264 a1, . . . , \u03c3p \u2264 ap) = P (g\u22121(\u03c31) \u2264 g\u22121(a1), . . . , g\u22121(\u03c3p) \u2264 g\u22121(ap))\n= \u03a6\u0393(g\u22121(a1), . . . , g\u22121(ap)) = \u03a6\u0393(\u03a6\u22121(F (a1)), . . . , \u03a6\u22121(F (ap)))\n= \u03a6\u0393(\u03a6\u22121(u1), . . . , \u03a6\u22121(up)) = C(u1, . . . , up),\n\n(9)\n\nwhere \u03a6 is the standard univariate Gaussian cdf (supposing k(t, t) = 1), \u03a6\u0393 is a multivariate Gaus-\nsian cdf with covariance matrix \u0393ij = cov(g\u22121(\u03c3i), g\u22121(\u03c3j)), and F is the marginal distribution of\n\n3\n\n\feach \u03c3i. In (5), we have \u03a8 = g\u22121, because it is g\u22121 which maps \u03c3t to a GP. The speci\ufb01cation in\n(5) is equivalently expressed by (6) and (7). With GCPV, the form of g is learned so that g\u22121(\u03c3t)\nis best modelled by a GP. By learning g, we learn the marginal of each \u03c3: F (a) = \u03a6(g\u22121(a)) for\na \u2208 R. Recently, a different sort of \u2018kernel copula process\u2019 has been used, where the marginals\nof the variables being modelled are not learned [16].1 Further, we also consider a more subtle and\n\ufb02exible form of our model, where the function g itself is indexed by time: g = gt(f (t), \u03c9). We only\nassume that the marginal distributions of \u03c3t are stationary over \u2018small\u2019 time periods, and for each of\nthese time periods (5)-(7) hold true. We return to this in the \ufb01nal discussion section.\nHere we have assumed that each observation, conditioned on knowing its variance, is normally\ndistributed with zero mean. This is a common assumption in heteroscedastic models. The zero\nmean and normality assumptions can be relaxed and are not central to this paper.\n\n4 Predictions with GCPV\nUltimately, we wish to infer p(\u03c3(t\u2217)|y, z), where z = {\u03b8, \u03c9}, and \u03b8 are the hyperparameters of the\nGP covariance function. To do this, we sample from\n\np(f\u2217|y, z) =\n\np(f\u2217|f , \u03b8)p(f|y, z)df\n\n(10)\n\n(cid:90)\n\nand then transform these samples by g. Letting (Cf )ij = \u03b4ijg(fi, \u03c9)2, where \u03b4ij is the Kronecker\ndelta, Kij = k(ti, tj), (k\u2217)i = k(t\u2217, ti), we have\n\np(f|y, z) = N (f ; 0, K)N (y; 0, Cf )/p(y|z),\n\u2217 K\u22121k\u2217).\n\np(f\u2217|f , \u03b8) = N (k(cid:62)\n\n\u2217 K\u22121f , k(t\u2217, t\u2217) \u2212 k(cid:62)\n(cid:90)\n\np(y|f , \u03c9)p(f|\u03b8)df .\n\np(y|z) =\n\n(11)\n(12)\n\n(13)\n\nWe also wish to learn z, which we can do by \ufb01nding the \u02c6z that maximizes the marginal likelihood,\n\nUnfortunately, for many functions g, (10) and (13) are intractable. Our methods of dealing with\nthis can be used in very general circumstances, where one has a Gaussian process prior, but an\n(optionally parametrized) non-Gaussian likelihood. We use the Laplace approximation to estimate\np(f|y, z) as a Gaussian. Then we can integrate (10) for a Gaussian approximation to p(f\u2217|y, z),\nwhich we sample from to make predictions of \u03c3\u2217. Using Laplace, we can also \ufb01nd an expression\nfor an approximate marginal likelihood, which we maximize to determine z. Once we have found\nz with Laplace, we use Markov chain Monte Carlo to sample from p(f\u2217|y, z), and compare that to\nusing Laplace to sample from p(f\u2217|y, z). In the supplement we relate this discussion to (9).\n\n4.1 Laplace Approximation\n\nThe goal is to approximate (11) with a Gaussian, so that we can evaluate (10) and (13) and make\npredictions. In doing so, we follow Rasmussen and Williams [11] in their treatment of Gaussian\nprocess classi\ufb01cation, except we use a parametrized likelihood, and modify Newton\u2019s method.\nFirst, consider as an objective function the logarithm of an unnormalized (11):\n\ns(f|y, z) = log p(y|f , \u03c9) + log p(f|\u03b8).\n\n(14)\n\nThe Laplace approximation uses a second order Taylor expansion about the \u02c6f which maximizes\n(14), to \ufb01nd an approximate objective \u02dcs(f|y, z). So the \ufb01rst step is to \ufb01nd \u02c6f, for which we use\nNewton\u2019s method. The Newton update is f new = f \u2212 (\u2207\u2207s(f ))\u22121\u2207s(f ). Differentiating (14),\n\n\u2207s(f|y, z) = \u2207 log p(y|f , \u03c9) \u2212 K\u22121f\n\u2207\u2207s(f|y, z) = \u2207\u2207 log p(y|f , \u03c9) \u2212 K\u22121 = \u2212W \u2212 K\u22121,\n\n(15)\n(16)\n\nwhere W is the diagonal matrix \u2212\u2207\u2207 log p(y|f , \u03c9).\n\n1Note added in proof : Also, for a very recent related model, see Rodr\u00b4\u0131guez et al. [17].\n\n4\n\n\fIf the likelihood function p(y|f , \u03c9) is not log concave, then W may have negative entries. Vanhat-\nalo et al. [18] found this to be problematic when doing Gaussian process regression with a Student-t\nlikelihood. They instead use an expectation-maximization (EM) algorithm for \ufb01nding \u02c6f, and iterate\nordered rank one Cholesky updates to evaluate the Laplace approximate marginal likelihood. But\nEM can converge slowly, especially near a local optimum, and each of the rank one updates is vul-\nnerable to numerical instability. With a small modi\ufb01cation of Newton\u2019s method, we often get close to\nquadratic convergence for \ufb01nding \u02c6f, and can evaluate the Laplace approximate marginal likelihood\nin a numerically stable fashion, with no approximate Cholesky factors, and optimal computational\nrequirements. Some comments are in the supplementary material but, in short, we use an approxi-\nmate negative Hessian, \u2212\u2207\u2207s \u2248 M + K\u22121, which is guaranteed to be positive de\ufb01nite, since M\nis formed on each iteration by zeroing the negative entries of W . For stability, we reformulate our\n2 , b = M f + \u2207 log p(y|f ),\noptimization in terms of B = I + M 1\na = b \u2212 QKb. Since (K\u22121 + M )\u22121 = K \u2212 KQK, the Newton update becomes f new = Ka.\nWith these updates we \ufb01nd \u02c6f and get an expression for \u02dcs which we use to approximate (13) and\n\n(11). The approximate marginal likelihood q(y|z) is given by(cid:82) exp(\u02dcs)df. Taking its logarithm,\n\n2 , and let Q = M 1\n\n2 B\u22121M 1\n\n2 KM 1\n\nlog q(y|z) = \u2212 1\n2\n\n\u02c6f(cid:62)a \u02c6f + log p(y| \u02c6f ) \u2212 1\n2\n\nlog |B \u02c6f|,\n\n(17)\n\nwhere B \u02c6f is B evaluated at \u02c6f, and a \u02c6f is a numerically stable evaluation of K\u22121 \u02c6f.\nTo learn the parameters z, we use conjugate gradient descent to maximize (17) with respect to z.\nSince \u02c6f is a function of z, we initialize z, and update \u02c6f every time we vary z. Once we have found\nan optimum \u02c6z, we can make predictions. By exponentiating \u02dcs, we \ufb01nd a Gaussian approximation to\nthe posterior (11), q(f|y, z) = N ( \u02c6f , K \u2212 KQK). The product of this approximate posterior with\np(f\u2217|f ) is Gaussian. Integrating this product, we approximate p(f\u2217|y, z) as\n\u2217 Qk\u2217).\n\n(18)\nGiven n training observations, the cost of each Newton iteration is dominated by computing the\ncholesky decomposition of B, which takes O(n3) operations. The objective function typically\nchanges by less than 10\u22126 after 3 iterations. Once Newton\u2019s method has converged, it takes only\nO(1) operations to draw from q(f\u2217|y, z) and make predictions.\n\n\u2217 \u2207 log p(y| \u02c6f ), k(t\u2217, t\u2217) \u2212 k(cid:62)\n\nq(f\u2217|y, z) = N (k(cid:62)\n\n4.2 Markov chain Monte Carlo\n\nWe use Markov chain Monte Carlo (MCMC) to sample from (11), so that we can later sample from\np(\u03c3\u2217|y, z) to make predictions. Sampling from (11) is dif\ufb01cult, because the variables f are strongly\ncoupled by a Gaussian process prior. We use a new technique, Elliptical Slice Sampling [19], and\n\ufb01nd it extremely effective for this purpose. It was speci\ufb01cally designed to sample from posteriors\nwith correlated Gaussian priors. It has no free parameters, and jointly updates every element of f.\nFor our setting, it is over 100 times as fast as axis aligned slice sampling with univariate updates.\nTo make predictions, we take J samples of p(f|y, z), {f 1, . . . , f J}, and then approximate (10) as\na mixture of J Gaussians:\n\np(f\u2217|y, z) \u2248 1\nJ\n\n(19)\nEach of the Gaussians in this mixture have equal weight. So for each sample of f\u2217|y, we uniformly\nchoose a random p(f\u2217|f i, \u03b8) and draw a sample. In the limit J \u2192 \u221e, we are sampling from the\nexact p(f\u2217|y, z). Mapping these samples through g gives samples from p(\u03c3\u2217|y, z). After one O(n3)\nand one O(J) operation, a draw from (19) takes O(1) operations.\n\ni=1\n\nJ(cid:88)\n\np(f\u2217|f i, \u03b8).\n\n4.3 Warping Function\n\nThe warping function, g, maps fi, a GP function value, to \u03c3i, a standard deviation. Since fi can take\nany value in R, and \u03c3i can take any non-negative real value, g : R \u2192 R+. For each fi to correspond\nto a unique deviation, g must also be one-to-one. We use\n\ng(x, \u03c9) =\n\naj log[exp[bj(x + cj)] + 1],\n\naj, bj > 0.\n\n(20)\n\nK(cid:88)\n\nj=1\n\n5\n\n\ftends to ((cid:80)K\n\nThis is monotonic, positive, in\ufb01nitely differentiable, asymptotic towards zero as x \u2192 \u2212\u221e, and\nj=1 ajbj)x as x \u2192 \u221e. In practice, it is useful to add a small constant to (20), to avoid\nrare situations where the parameters \u03c9 are trained to make g extremely small for certain inputs, at\nthe expense of a good overall \ufb01t; this can happen when the parameters \u03c9 are learned by optimizing\na likelihood. A suitable constant could be one tenth the absolute value of the smallest nonzero\nobservation.\nBy inferring the parameters of the warping function, or distributions of these parameters, we are\nlearning a transformation which will best model \u03c3t with a Gaussian process. The more \ufb02exible the\nwarping function, the more potential there is to improve the GCPV \ufb01t \u2013 in other words, the better\nwe can estimate the \u2018perfect\u2019 transformation. To test the importance of this \ufb02exibility, we also try\na simple unparametrized warping function, g(x) = ex. In related work, Goldberg et al. [20] place\na GP prior on the log noise level in a standard GP regression model on observations, except for\ninference they use Gibbs sampling, and a high level of \u2018jitter\u2019 for conditioning.\nOnce g is trained, we can infer the marginal distribution of each \u03c3: F (a) = \u03a6(g\u22121(a)), for a \u2208 R.\nThis suggests an alternate way to initialize g: we can initialize F as a mixture of Gaussians, and\nthen map through \u03a6\u22121 to \ufb01nd g\u22121. Since mixtures of Gaussians are dense in the set of probability\ndistributions, we could in principle \ufb01nd the \u2018perfect\u2019 g using an in\ufb01nite mixture of Gaussians [21].\n\n5 Experiments\n\n(cid:80)q\n\nj=1 bj\u03c32\n\ni=1 aiy2\n\nIn our experiments, we predict the latent standard deviations \u03c3 of observations y at times t, and\nalso \u03c3\u2217 at unobserved times t\u2217. To do this, we use two versions of GCPV. The \ufb01rst variant, which\nwe simply refer to as GCPV, uses the warping function (20) with K = 1, and squared exponential\ncovariance function, k(t, t(cid:48)) = A exp(\u2212(t\u2212t(cid:48))2/l2), with A = 1. The second variant, which we call\nGP-EXP, uses the unparametrized warping function ex, and the same covariance function, except\nthe amplitude A is a trained hyperparameter. The other hyperparameter l is called the lengthscale of\nthe covariance function. The greater l, the greater the covariance between \u03c3t and \u03c3t+a for a \u2208 R.\nWe train hyperparameters by maximizing the Laplace approximate log marginal likelihood (17).\nWe then sample from p(f\u2217|y) using the Laplace approximation (18). We also do this using MCMC\n(19) with J = 10000, after discarding a previous 10000 samples of p(f|y) as burn-in. We pass\nthese samples of f\u2217|y through g and g2 to draw from p(\u03c3\u2217|y) and p(\u03c32\u2217|y), and compute the sample\nmean and variance of \u03c3\u2217|y. We use the sample mean as a point predictor, and the sample variance\nfor error bounds on these predictions, and we use 10000 samples to compute these quantities. For\nGCPV we use Laplace and MCMC for inference, but for GP-EXP we only use Laplace. We compare\npredictions to GARCH(1,1), which has been shown in extensive and recent reviews to be competitive\nwith other GARCH variants, and more sophisticated models [5, 6, 7]. GARCH(p,q) speci\ufb01es y(t) \u223c\nN (0, \u03c32(t)), and lets the variance be a deterministic function of the past: \u03c32\nt\u2212i +\nt\u2212j. We use the Matlab Econometrics Toolbox implementation of GARCH, where the\n\nt = a0 +(cid:80)p\n\nparameters a0, ai and bj are estimated using a constrained maximum likelihood.\nWe make forecasts of volatility, and we predict historical volatility. By \u2018historical volatility\u2019 we\nmean the volatility at observed time points, or between these points. Uncovering historical volatility\nis important. It could, for instance, be used to study what causes \ufb02uctuations in the stock market, or\nto understand physical systems.\nTo evaluate our model, we use the Mean Squared Error (MSE) between the true variance, or proxy\nfor the truth, and the predicted variance. Although likelihood has advantages, we are limited in\nspace, and we wish to harmonize with the econometrics literature, and other assessments of volatility\nmodels, where MSE is the standard. In a similar assessment of volatility models, Brownlees et al.\n[7] found that MSE and quasi-likelihood rankings were comparable.\nWhen the true variance is unknown we follow Brownlees et al. [7] and use squared observations\nas a proxy for the truth, to compare our model to GARCH.2 The more observations, the more\nreliable these performance estimates will be. However, not many observations (e.g. 100) are needed\nfor a stable ranking of competing models; in Brownlees et al. [7], the rankings derived from high\nfrequency squared observations are similar to those derived using daily squared observations.\n\n2Since each observation y is assumed to have zero mean and variance \u03c32, E[y2] = \u03c32.\n\n6\n\n\f5.1 Simulations\nWe simulate observations from N (0, \u03c32(t)), using \u03c3(t) = sin(t) cos(t2) + 1, at t =\n(0, 0.02, 0.04, . . . , 4)(cid:62). We call this data set TRIG. We also simulate using a standard deviation\nthat jumps from 0.1 to 7 and back, at times t = (0, 0.1, 0.2, . . . , 6)(cid:62). We call this data set JUMP.\nTo forecast, we use all observations up until the current time point, and make 1, 7, and 30 step\nahead predictions. So, for example, in TRIG we start by observing t = 0, and make forecasts at\nt = 0.02, 0.14, 0.60. Then we observe t = 0, 0.02 and make forecasts at t = 0.04, 0.16, 0.62, and\nso on, until all data points have been observed. For historical volatility, we predict the latent \u03c3t at\nthe observation times, which is safe since we are comparing to the true volatility, which is not used\nin training; the results are similar if we interpolate. Figure 1 panels a) and b) show the true volatil-\nity for TRIG and JUMP respectively, alongside GCPV Laplace, GCPV MCMC, GP-EXP Laplace,\nand GARCH(1,1) predictions of historical volatility. Table 1 shows the results for forecasting and\nhistorical volatility.\nIn panel a) we see that GCPV more accurately captures the dependencies between \u03c3 at different\ntimes points than GARCH: if we manually decrease the lengthscale in the GCPV covariance func-\ntion, we can replicate the erratic GARCH behaviour, which inaccurately suggests that the covariance\nbetween \u03c3t and \u03c3t+a decreases quickly with increases in a. We also see that GCPV with an un-\nparametrized exponential warping function tends to overestimates peaks and underestimate troughs.\nIn panel b), the volatility is extremely dif\ufb01cult to reconstruct or forecast \u2013 with no warning it will\nimmediately and dramatically increase or decrease. This behaviour is not suited to a smooth squared\nexponential covariance function. Nevertheless, GCPV outperforms GARCH, especially in regions\nof low volatility. We also see this in panel a) for t \u2208 (1.5, 2). GARCH is known to respond slowly\nto large returns, and to overpredict volatility [22]. In JUMP, the greater the peaks, and the smaller\nthe troughs, the more GARCH suffers, while GCPV is mostly robust to these changes.\n\n5.2 Financial Data\n\nThe returns on the daily exchange rate between the Deutschmark (DM) and the Great Britain\nPound (GBP) from 1984 to 1992 have become a benchmark for assessing the performance of\nGARCH models [8, 9, 10]. This exchange data, which we refer to as DMGBP, can be obtained\nfrom www.datastream.com, and the returns are calculated as rt = log(Pt+1/Pt), where Pt is\nthe number of DM to GBP on day t. The returns are assumed to have a zero mean function.\nWe use a rolling window of the previous 120 days of returns to make 1, 7, and 30 day ahead volatility\nforecasts, starting at the beginning of January 1988, and ending at the beginning of January 1992\n(659 trading days). Every 7 days, we retrain the parameters of GCPV and GARCH. Every time\nwe retrain parameters, we predict historical volatility over the past 120 days. The average MSE\nfor these historical predictions is given in Table 1, although they should be observed with caution;\nunlike with the simulations, the DMGBP historical predictions are trained using the same data they\nare assessed on. In Figure 1c), we see that the GARCH one day ahead forecasts are lifted above\nthe GCPV forecasts, but unlike in the simulations, they are now operating on a similar lengthscale.\nThis suggests that GARCH could still be overpredicting volatility, but that GCPV has adapted its\nestimation of how \u03c3t and \u03c3t+a correlate with one another. Since GARCH is suited to this \ufb01nancial\ndata set, it is reassuring that GCPV predictions have a similar time varying structure. Overall, GCPV\nand GARCH are competitive with one another for forecasting currency exchange returns, as seen\nin Table 1. Moreover, a learned warping function g outperforms an unparametrized one, and a full\nLaplace solution is comparable to using MCMC for inference, in accuracy and speed. This is also\ntrue for the simulations. Therefore we recommend whichever is more convenient to implement.\n\n6 Discussion\n\nWe de\ufb01ned a copula process, and as an example, developed a stochastic volatility model, GCPV,\nwhich can outperform GARCH. With GCPV, the volatility \u03c3t is distributed as a Gaussian Copula\nProcess, which separates the modelling of the dependencies between volatilities at different times\nfrom their marginal distributions \u2013 arguably the most useful property of a copula. Further, GCPV \ufb01ts\nthe marginals in the Gaussian copula process by learning a warping function. If we had simply cho-\nsen an unparametrized exponential warping function, we would incorrectly be assuming that the log\n\n7\n\n\fData set\nTRIG\n\nJUMP\n\u00d7103\n\nDMGBP\n\u00d710\u22129\n\nTable 1: MSE for predicting volatility.\n\nModel Historical\n\n1 step\n\n7 step\n\n30 step\n\nGCPV (LA)\nGCPV (MCMC)\nGP-EXP\nGARCH\nGCPV (LA)\nGCPV (MCMC)\nGP-EXP\nGARCH\nGCPV (LA)\nGCPV (MCMC)\nGP-EXP\nGARCH\n\n0.0953\n0.0760\n0.193\n0.938\n\n0.588\n1.21\n1.43\n1.88\n\n2.43\n2.39\n2.52\n2.83\n\n0.588\n0.622\n0.646\n1.04\n\n0.891\n0.951\n1.76\n1.58\n\n3.00\n3.00\n3.20\n3.03\n\n0.951\n0.979\n1.36\n1.79\n\n1.38\n1.37\n6.95\n3.43\n\n3.08\n3.08\n3.46\n3.12\n\n1.71\n1.76\n1.15\n5.12\n\n1.35\n1.35\n14.7\n5.65\n\n3.17\n3.17\n5.14\n3.32\n\nFigure 1: Predicting volatility and learning its marginal pdf. For a) and b), the true volatility, and GCPV\n(MCMC), GCPV (LA), GP-EXP, and GARCH predictions, are shown respectively by a thick green line, a\ndashed thick blue line, a dashed black line, a cyan line, and a red line. a) shows predictions of historical\nvolatility for TRIG, where the shade is a 95% con\ufb01dence interval about GCPV (MCMC) predictions. b) shows\npredictions of historical volatility for JUMP. In c), a black line and a dashed red line respectively show GCPV\n(LA) and GARCH one day ahead volatility forecasts for DMGBP. In d), a black line and a dashed blue line\nrespectively show the GCPV learned marginal pdf of \u03c3t in DMGBP and a Gamma(4.15,0.00045) pdf.\n\nvolatilities are marginally Gaussian distributed. Indeed, for the DMGBP data, we trained the warping\nfunction g over a 120 day period, and mapped its inverse through the univariate standard Gaussian\ncdf \u03a6, and differenced, to estimate the marginal probability density function (pdf) of \u03c3t over this\nperiod. The learned marginal pdf, shown in Figure 1d), is similar to a Gamma(4.15,0.00045) distri-\nbution. However, in using a rolling window to retrain the parameters of g, we do not assume that the\nmarginals of \u03c3t are stationary; we have a time changing warping function.\nWhile GARCH is successful, and its simplicity is attractive, our model is also simple and has a\nnumber of advantages. We can effortlessly handle missing data, we can easily incorporate covariates\nother than time (like interest rates) in our covariance function, and we can choose from a rich class\nof covariance functions \u2013 squared exponential, Brownian motion, Mat\u00b4ern, periodic, etc. In fact, the\nvolatility of high frequency intradaily returns on equity indices and currency exchanges is cyclical\n[23], and GCPV with a periodic covariance function is uniquely well suited to this data. And the\nparameters of GCPV, like the covariance function lengthscale, or the learned warping function,\nprovide insight into the underlying source of volatility, unlike the parameters of GARCH.\nFinally, copulas are rapidly becoming popular in applications, but often only bivariate copulas are\nbeing used. With our copula process one can learn the dependencies between arbitrarily many ran-\ndom variables independently of their marginal distributions. We hope the Gaussian Copula Process\nVolatility model will encourage other applications of copula processes. More generally, we hope\nour work will help bring together the machine learning and econometrics communities.\nAcknowledgments: Thanks to Carl Edward Rasmussen and Ferenc Husz\u00b4ar for helpful conversa-\ntions. AGW is supported by an NSERC grant.\n\n8\n\n012340123Time(a)VolatilityTRIG 024605101520Time(b)VolatilityJUMP 020040060000.0050.010.015Days(c)VolatilityDMGBP 00.0050.010200400600\u03c3(d)Probability DensityDMGBP \fReferences\n[1] Paul Embrechts, Alexander McNeil, and Daniel Straumann. Correlation and dependence in risk manage-\nment: Properties and pitfalls. In Risk Management: Value at risk and beyond, pages 176\u2013223. Cambridge\nUniversity Press, 1999.\n\n[2] David X. Li. On default correlation: A copula function approach. Journal of Fixed Income, 9(4):43\u201354,\n\n2000.\n\n[3] Roger B. Nelsen. An Introduction to Copulas. Springer Series in Statistics, second edition, 2006.\n[4] Tim Bollerslev. Generalized autoregressive conditional heteroskedasticity. Journal of Econometrics, 31\n\n(3):307\u2013327, 1986.\n\n[5] Ser-Huang Poon and Clive W.J. Granger. Practical issues in forecasting volatility. Financial Analysts\n\nJournal, 61(1):45\u201356, 2005.\n\n[6] Peter Reinhard Hansen and Asger Lunde. A forecast comparison of volatility models: Does anything\n\nbeat a GARCH(1,1). Journal of Applied Econometrics, 20(7):873\u2013889, 2005.\n\n[7] Christian T. Brownlees, Robert F. Engle, and Bryan T. Kelly. A practical guide to volatility forecasting\n\nthrough calm and storm, 2009. Available at SSRN: http://ssrn.com/abstract=1502915.\n\n[8] T. Bollerslev and E. Ghysels. Periodic autoregressive conditional heteroscedasticity. Journal of Business\n\nand Economic Statistics, 14:139\u2013151, 1996.\n\n[9] B.D. McCullough and C.G. Renfro. Benchmarks and software standards: A case study of GARCH\n\nprocedures. Journal of Economic and Social Measurement, 25:59\u201371, 1998.\n\n[10] C. Brooks, S.P. Burke, and G. Persand. Benchmarks and the accuracy of GARCH model estimation.\n\nInternational Journal of Forecasting, 17:45\u201356, 2001.\n\n[11] Carl Edward Rasmussen and Christopher K.I. Williams. Gaussian processes for Machine Learning. The\n\nMIT Press, 2006.\n\n[12] Abe Sklar. Fonctions de r\u00b4epartition `a n dimensions et leurs marges. Publ. Inst. Statist. Univ. Paris, 8:\n\n229\u2013231, 1959.\n\n[13] P Deheuvels. Caract\u00b4eisation compl`ete des lois extr\u02c6emes multivari\u00b4es et de la convergence des types\n\nextr\u02c6emes. Publications de l\u2019Institut de Statistique de l\u2019Universit\u00b4e de Paris, 23:1\u201336, 1978.\n\n[14] G Kimeldorf and A Sampson. Uniform representations of bivariate distributions. Communications in\n\nStatistics, 4:617\u2013627, 1982.\n\n[15] Edward Snelson, Carl Edward Rasmussen, and Zoubin Ghahramani. Warped Gaussian Processes. In\n\nNIPS, 2003.\n\n[16] Sebastian Jaimungal and Eddie K.H. Ng. Kernel-based Copula processes. In ECML PKDD, 2009.\n[17] A. Rodr\u00b4\u0131guez, D.B. Dunson, and A.E. Gelfand. Latent stick-breaking processes. Journal of the American\n\nStatistical Association, 105(490):647\u2013659, 2010.\n\n[18] Jarno Vanhatalo, Pasi Jylanki, and Aki Vehtari. Gaussian process regression with Student-t likelihood. In\n\nNIPS, 2009.\n\n[19] Iain Murray, Ryan Prescott Adams, and David J.C. MacKay. Elliptical Slice Sampling. In AISTATS,\n\n2010.\n\n[20] Paul W. Goldberg, Christopher K.I. Williams, and Christopher M. Bishop. Regression with input-\n\ndependent noise: A Gaussian process treatment. In NIPS, 1998.\n\n[21] Carl Edward Rasmussen. The In\ufb01nite Gaussian Mixture Model. In NIPS, 2000.\n[22] Ruey S. Tsay. Analysis of Financial Time Series. John Wiley & Sons, 2002.\n[23] Torben G. Andersen and Tim Bollerslev. Intraday periodicity and volatility persistence in \ufb01nancial mar-\n\nkets. Journal of Empirical Finance, 4(2-3):115\u2013158, 1997.\n\n9\n\n\f", "award": [], "sourceid": 784, "authors": [{"given_name": "Andrew", "family_name": "Wilson", "institution": null}, {"given_name": "Zoubin", "family_name": "Ghahramani", "institution": null}]}