{"title": "Infinite-Horizon Gaussian Processes", "book": "Advances in Neural Information Processing Systems", "page_first": 3486, "page_last": 3495, "abstract": "Gaussian processes provide a flexible framework for forecasting, removing noise, and interpreting long temporal datasets. State space modelling (Kalman filtering) enables these non-parametric models to be deployed on long datasets by reducing the complexity to linear in the number of data points. The complexity is still cubic in the state dimension m which is an impediment to practical application. In certain special cases (Gaussian likelihood, regular spacing) the GP posterior will reach a steady posterior state when the data are very long. We leverage this and formulate an inference scheme for GPs with general likelihoods, where inference is based on single-sweep EP (assumed density filtering). The infinite-horizon model tackles the cubic cost in the state dimensionality and reduces the cost in the state dimension m to O(m^2) per data point. The model is extended to online-learning of hyperparameters. We show examples for large finite-length modelling problems, and present how the method runs in real-time on a smartphone on a continuous data stream updated at 100 Hz.", "full_text": "In\ufb01nite-Horizon Gaussian Processes\n\nArno Solin\u2217\nAalto University\n\nJames Hensman\n\nPROWLER.io\n\nRichard E. Turner\n\nUniversity of Cambridge\n\narno.solin@aalto.fi\n\njames@prowler.io\n\nret26@cam.ac.uk\n\nAbstract\n\nGaussian processes provide a \ufb02exible framework for forecasting, removing noise,\nand interpreting long temporal datasets. State space modelling (Kalman \ufb01ltering)\nenables these non-parametric models to be deployed on long datasets by reducing\nthe complexity to linear in the number of data points. The complexity is still\ncubic in the state dimension m which is an impediment to practical application. In\ncertain special cases (Gaussian likelihood, regular spacing) the GP posterior will\nreach a steady posterior state when the data are very long. We leverage this and\nformulate an inference scheme for GPs with general likelihoods, where inference is\nbased on single-sweep EP (assumed density \ufb01ltering). The in\ufb01nite-horizon model\ntackles the cubic cost in the state dimensionality and reduces the cost in the state\ndimension m to O(m2) per data point. The model is extended to online-learning\nof hyperparameters. We show examples for large \ufb01nite-length modelling problems,\nand present how the method runs in real-time on a smartphone on a continuous\ndata stream updated at 100 Hz.\n\n1\n\nIntroduction\n\nGaussian process (GP, [25]) models provide a plug & play interpretable approach to probabilistic\nmodelling, and would perhaps be more widely applied if not for their associated computational\ncomplexity: na\u00efve implementations of GPs require the construction and decomposition of a kernel\nmatrix at cost O(n3), where n is the number of data. In this work, we consider GP time series\n(i.e. GPs with one input dimension). In this case, construction of the kernel matrix can be avoided\nby exploiting the (approximate) Markov structure of the process and re-writing the model as a\nlinear Gaussian state space model, which can then be solved using Kalman \ufb01ltering (see, e.g., [27]).\nThe Kalman \ufb01lter costs O(m3n), where m is the dimension of the state space. We propose the\nIn\ufb01nite-Horizon GP approximation (IHGP), which reduces the cost to O(m2n).\n\nAs m grows with the number of kernel components in the GP prior, this cost saving can be signi\ufb01cant\nfor many GP models where m can reach hundreds. For example, the automatic statistician [6]\nsearches for kernels (on 1D datasets) using sums and products of kernels. The summing of two\nkernels results in the concatenation of the state space (sum of the ms) and a product of kernels results\nin the Kronecker sum of their statespaces (product of ms). This quickly results in very high state\ndimensions; we show results with a similarly constructed kernel in our experiments.\n\nWe are concerned with real-time processing of long (or streaming) time-series with short and long\nlength-scale components, and non-Gaussian noise/likelihood and potential non-stationary structure.\nWe show how the IHGP can be applied in the streaming setting, including ef\ufb01cient estimation of the\nmarginal likelihood and associated gradients, enabling on-line learning of hyper (kernel) parameters.\nWe demonstrate this by applying our approach to a streaming dataset of two million points, as well as\nproviding an implementation of the method on an iPhone, allowing on-line learning of a GP model of\nthe phone\u2019s acceleration.\n\n\u2217This work was undertaken whilst AS was a Visiting Research Fellow with University of Cambridge.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fy\n\n,\nt\nu\np\nt\nu\nO\n\n1\n\n5\n.\n0\n\n0\n\n5\n.\n0\n\u2212\n\n1\n\u2212\n\n0\n\n)\n\u03b8\n|\n\ny\n(\np\ng\no\nl\n\n\u2212\n\n0\n6\n\n5\n5\n\n0\n5\n\nIHGP\n\nFull GP\n\nIHGP mean\n\n95% quantiles\n\nFull GP\n\n2\n\n4\n\n6\n\n0\n\n2\n\n4\n\n6\n\n8\n\n10\n\nInput, t\n\nLength-scale, \u2113\n\nFigure 1: (Left) GP regression with n = 100 observations and a Mat\u00e9rn covariance function. The\nIHGP is close to exact far from boundaries, where the constant marginal variance assumption shows.\n(Right) Hyperparameters \u03b8 = (\u03c32\n\nn, \u03c32, \u2113) optimised independently for both models.\n\nFor data where a Gaussian noise assumption may not be appropriate, many approaches have been\nproposed for approximation (see, e.g., [21] for an overview). Here we show how to combine Assumed\nDensity Filtering (ADF, a.k.a. single-sweep Expectation Propagation, EP [5, 12, 19]) with the IHGP.\nWe are motivated by the application to Log-Gaussian Cox Processes (LGCP, [20]). Usually the LGCP\nmodel uses binning to avoid a doubly-intractable model; in this case it is desirable to have more bins\nin order to capture short-lengthscale effects, leading to more time points. Additionally, the desire to\ncapture long-and-short-term effects means that the state space dimension m can be large. We show\nthat our approach is effective on standard benchmarks (coal-mining disasters) as well as a much\nlarger dataset (airline accidents).\n\nThe structure of the paper is as follows. Sec. 2 covers the necessary background and notation related to\nGPs and state space solutions. Sec. 3 leverages the idea of steady-state \ufb01ltering to derive IHGP. Sec. 4\nillustrates the approach on several problems, and the supplementary material contains additional\nexamples and a nomenclature for easier reading. Code implementations in MATLAB/C++/Objective-C\nand video examples of real-time operation are available at https://github.com/AaltoML/IHGP.\n\n2 Background\n\ni=1 p(yi | f (ti)), where the data D = {(ti, yi)}n\n\nWe are concerned with GP models [25] admitting the form: f (t) \u223c GP(\u00b5(t), \u03ba(t, t\u2032)) and y | f \u223c\nQn\ni=1 are input\u2013output pairs, \u00b5(t) the mean function,\nand \u03ba(t, t\u2032) the covariance function of the GP prior. The likelihood factorizes over the observations.\nThis family covers many standard modelling problems, including regression and classi\ufb01cation tasks.\nWithout loss of generality, we present the methodology for zero-mean (\u00b5(t) := 0) GP priors. We\napproximate posteriors of the form (see [24] for an overview):\n\nq(f | D) = N(f | K\u03b1, (K\u22121 + W)\u22121),\n\n(1)\n\nwhere Ki,j = \u03ba(ti, tj) is the prior covariance matrix, \u03b1 \u2208 Rn, and the (likelihood precision) matrix\nis diagonal, W = diag(w). Elements of w \u2208 Rn are non negative for log-concave likelihoods. The\npredictive mean and marginal variance for a test input t\u2217 is \u00b5f,\u2217 = kT\n\u2217 (K +\nW\u22121)\u22121k\u2217. A probabilistic way of learning the hyperparameters \u03b8 of the covariance function (such\nas magnitude and scale) and the likelihood model (such as noise scale) is by maximizing the (log)\nmarginal likelihood function p(y | \u03b8) [25].\n\nf,\u2217 = k\u2217\u2217 \u2212 kT\n\n\u2217 \u03b1 and \u03c32\n\nNumerous methods have been proposed for dealing with the prohibitive computational complexity\nof the matrix inverse in dealing with the latent function in Eq. (1). While general-purpose methods\nsuch as inducing input [4, 23, 30, 33], basis function projection [11, 17, 32], interpolation approaches\n[37], or stochastic approximations [10, 14] do not pose restrictions to the input dimensionality, they\nscale poorly in long time-series models by still needing to \ufb01ll the extending domain (see discussion\nin [3]). For certain problems tree-structured approximations [3] or band-structured matrices can\nbe leveraged. However, [8, 22, 26, 29] have shown that for one-dimensional GPs with high-order\nMarkovian structure, an optimal representation (without approximations) is rewriting the GP in terms\nof a state space model and solving inference in linear time by sequential Kalman \ufb01ltering methods.\nWe will therefore focus on building upon the state space methodology.\n\n2\n\n\f2.1 State space GPs\n\nIn one-dimensional GPs (time-series) the data points feature the special property of having a natural\nordering. If the GP prior itself admits a Markovian structure, the GP model can be reformulated\nas a state space model. Recent work has focused on showing how many widely used covariance\nfunction can be either exactly (e.g., the half-integer Mat\u00e9rn class, polynomial, noise, constant) or\napproximately (e.g., the squared-exponential/RBF, rational quadratic, periodic, etc.) converted into\nstate space models. In continuous time, a simple dynamical system able to represent these covariance\nfunctions is given by the following linear time-invariant stochastic differential equation (see [28]):\n\n\u02d9f (t) = F f (t) + L w(t),\n\nyi \u223c p(yi | hT f (ti)),\n\n(2)\n\nwhere w(t) is an s-dimensional white noise process, and F \u2208 Rm\u00d7m, L \u2208 Rm\u00d7s, h \u2208 Rm\u00d71 are\nthe feedback, noise effect, and measurement matrices, respectively. The driving process w(t) \u2208 Rs\nis a multivariate white noise process with spectral density matrix Qc \u2208 Rs\u00d7s. The initial state is\ndistributed according to f0 \u223c N(0, P0). For discrete input values ti, this translates into\n\nfi \u223c N(Ai\u22121fi\u22121, Qi\u22121),\n\nyi \u223c p(yi | hTfi),\n\n(3)\n\nwith f0 \u223c N(0, P0). The discrete-time dynamical model is solved through a matrix exponential Ai =\nexp(F \u2206ti), where \u2206ti = ti+1 \u2212 ti \u2265 0. For stationary covariance functions, \u03ba(t, t\u2032) = \u03ba(t \u2212 t\u2032), the\nprocess noise covariance is given by Qi = P\u221e \u2212 Ai P\u221e AT\ni . The stationary state (corresponding\nto the initial state P0) is distributed by f\u221e \u223c N(0, P\u221e) and the stationary covariance can be found\nby solving the Lyapunov equation \u02d9P\u221e = F P\u221e + P\u221e FT + L Qc LT = 0. Appendix B shows\nan example of representing the Mat\u00e9rn (\u03bd = 3/2) covariance function as a state space model. Other\ncovariance functions have been listed in [31].\n\n2.2 Bayesian \ufb01ltering\n\nThe closed-form solution to the linear Bayesian \ufb01ltering problem\u2014Eq. (3) with a Gaussian likelihood\nN(yi | hTfi, \u03c32\nn)\u2014is known as the Kalman \ufb01lter [27]. The interest is in the following marginal\ndistributions: p(fi | y1:i\u22121) = N(fi | mp\ni, Pf\ni)\n(\ufb01ltering distribution), and p(yi | y1:i\u22121) = N(yi | vi, si) (decomposed marginal likelihood). The\npredictive state mean and covariance are given by mp\ni + Qi.\nThe so called \u2018innovation\u2019 mean and variances vi and si are\n\ni ) (predictive distribution), p(fi | y1:i) = N(fi | mf\n\ni\u22121 and Pp\n\ni = Ai mf\n\ni = Ai Pf\n\ni\u22121 AT\n\ni , Pp\n\nvi = yi \u2212 hTmp\ni\n\nand\n\nsi = hTPp\n\ni h + \u03c32\nn.\n\n(4)\n\nThe log marginal likelihood can be evaluated during the \ufb01lter update steps by log p(y) =\n\u2212 Pn\n\ni /si). The \ufb01lter mean and covariances are given by\n\n1\n\ni=1\n\n2 (log 2\u03c0si + v2\nki = Pp\n\ni h/si,\n\nmf\n\ni = mp\n\ni\u22121 + ki vi,\n\nPf\n\ni = Pp\n\ni \u2212 ki hTPp\ni ,\n\n(5)\n\nwhere ki \u2208 Rm represents the \ufb01lter gain term. In batch inference, we are actually interested in the\nso called smoothing solution, p(f | D) corresponding to marginals p(fi | y1:n) = N(fi | ms\ni). The\nsmoother mean and covariance is solved by the backward recursion, from i = n \u2212 1 backwards to 1:\n\ni, Ps\n\nms\n\ni = mf\n\ni + Gi (ms\n\ni+1 \u2212 mp\n\ni+1),\n\nPs\n\ni = Pf\n\ni + Gi (Ps\n\ni+1 \u2212 Pp\n\ni+1) GT\ni ,\n\n(6)\n\ni AT\n\ni+1]\u22121 is the smoother gain at ti. The computational complexity is clearly\nwhere Gi = Pf\nlinear in the number of data n (recursion repetitions), and cubic in the state dimension m due to\nmatrix\u2013matrix multiplications, and the matrix inverse in calculation of Gi.\n\ni+1 [Pp\n\n3\n\nIn\ufb01nite-horizon Gaussian processes\n\nWe now tackle the cubic computational complexity in the state dimensionality by seeking in\ufb01nite-\nhorizon approximations to the Gaussian process. In Sec. 3.1 we revisit traditional steady-state Kalman\n\ufb01ltering (for Gaussian likelihood, equidistant data) from quadratic \ufb01lter design (see, e.g., [18] and\n[7] for an introduction), and extend it to provide approximations to the marginal likelihood and its\ngradients. Finally, we present an in\ufb01nite-horizon framework for non-Gaussian likelihoods.\n\n3\n\n\f0\n0\n1\n\n2\n\u2212\n0\n1\n\n4\n\u2212\n0\n1\n\n)\np\nP\n(\ng\na\ni\nd\nf\no\n\ns\nt\nn\ne\nm\ne\nl\nE\n\n3\n.\n0\n\n2\n.\n0\n\n1\n.\n0\n\n0\n\nk\n\n,\nn\ni\na\nG\n\n10\u22122\n\n10\u22121\n\n100\n\n101\n\n102\n\n103\n\n10\u22122\n\n10\u22121\n\n100\n\n101\n\n102\n\n103\n\nLikelihood variance, \u03b3\n\nLikelihood variance, \u03b3\n\nFigure 2: (Left) Interpolation of Pp (dots solved, solid interpolated). The dashed lines show elements\nin P\u221e (prior stationary state covariance). (Right) The Kalman gain k evaluated for the Pps.\n\n3.1 Steady-state Kalman \ufb01lter for t \u2192 \u221e\n\nIn steady-state Kalman \ufb01ltering (see [7], Ch. 8.4, or [1], Ch. 4, for the traditional perspective) we\nassume t \u226b \u2113e\ufb00 , where \u2113e\ufb00 is the longest time scale in the covariance function, and equidistant\nobservations in time (Ai := A and Qi := Q). After several \u2113e\ufb00 (as t \u2192 \u221e), the \ufb01lter gain converges\nto the stationary limiting Kalman \ufb01lter gain k. The resulting \ufb01lter becomes time-invariant, which\nintroduces approximation errors near the boundaries (cf. Fig. 1).\n\nIn practice, we seek a stationary \ufb01lter state covariance (corresponding to the stationary Kalman\ngain) \u02c6Pf . Solving for this matrix thus corresponds to seeking a covariance that is equal between two\nconsecutive \ufb01lter recursions. Directly from the Kalman \ufb01ltering forward prediction and update steps\n(in Eq. 5), we recover the recursion (by dropping dependency on the time step):\nn)\u22121 hT \u02c6Pp AT + Q.\n\n\u02c6Pp = A \u02c6Pp AT \u2212 A \u02c6Pp h (hT \u02c6Pp h + \u03c32\n\n(7)\n\nThis equation is of the form of a discrete algebraic Riccati equation (DARE, see, e.g., [15]), which\nis a type of nonlinear matrix equation that often arises in the context of in\ufb01nite-horizon optimal\ncontrol problems. Since \u03c32\nn > 0, Q is P.S.D., and the associated state space model being both\nstabilizable and observable, the DARE has a unique stabilising solution for \u02c6Pp that can be found\neither by iterating the Riccati equation or by matrix decompositions. The Schur method by Laub [16]\nsolves the DARE in O(m3), is numerically stable, and widely available in matrix libraries (Python\nscipy.linalg.solve_discrete_are, MATLAB Control System Toolbox DARE, see also SLICOT\nroutine SB02OD).\nThe corresponding stationary gain is k = \u02c6Pp h/(hT \u02c6Pp h + \u03c32\nn). Re-deriving the \ufb01lter recursion with\nthe stationary gain gives a simpli\ufb01ed iteration for the \ufb01lter mean (the covariance is now time-invariant):\n\n\u02c6mf\n\ni = (A \u2212 k hTA) \u02c6mf\n\n(8)\nfor all i = 1, 2, . . . , n. This recursive iteration has a computational cost associated with one m \u00d7 m\nmatrix\u2013vector multiplication, so the overall computational cost for the forward iteration is O(n m2)\n(as opposed to the O(n m3) in the Kalman \ufb01lter).\n\ni\u22121 + k yi\n\n\u02c6Pf = \u02c6Pp \u2212 k hT \u02c6Pp,\n\nand\n\nMarginal likelihood evaluation: The approximative log marginal likelihood comes out as a by-\nproduct of the \ufb01lter forward recursion: log p(y) \u2248 \u2212 n\ni /(2 \u02c6s), where the stationary\ninnovation covariance is given by \u02c6s = hT \u02c6Pp h+\u03c32\ni\u22121.\n\nn and the innovation mean by \u02c6vi = yi \u2212hTA \u02c6mf\n\n2 log 2\u03c0\u02c6s\u2212Pn\n\ni=1 \u02c6v2\n\nSteady-state backward pass: To obtain the complete in\ufb01nite-horizon solution, we formally derive\ni, \u02c6Ps), where \u02c6P is\nthe solution corresponding to the smoothing distribution p(fi | y1:n) \u2248 N(fi | \u02c6ms\nthe stationary state covariance. Establishing the backward recursion does not require taking any\nadditional limits, as the smoother gain is only a function of consecutive \ufb01ltering steps. Re-deriving the\nbackward pass in Equation (6) gives the time-invariant smoother gain and posterior state covariance\n\nG = \u02c6Pf AT [A \u02c6Pf AT + Q]\u22121\n\n(9)\nwhere \u02c6Ps is implicitly de\ufb01ned in terms of the solution to a DARE. The backward iteration for the\nstate mean: \u02c6ms\n\ni). Even this recursion scales as O(n m2).\n\n\u02c6Ps = G \u02c6Ps GT + \u02c6Pf \u2212 G (A \u02c6Pf AT + Q) GT,\n\ni+1 \u2212 A \u02c6mf\n\ni + G ( \u02c6ms\n\ni = \u02c6mf\n\nand\n\n4\n\n\fAlgorithm 1 In\ufb01nite-horizon Gaussian process (IHGP) inference. The GP prior is speci\ufb01ed in terms\nof a state space model. After the setup cost on line 2, all operations are at most O(m2).\n\ni h\n\nelse\n\nf,i)\n\n0 \u2190 P0;\n\n\u03b30 = \u221e\n\n\u03b7i \u2190 yi;\n\n\u03b3i \u2190 \u03c32\nn,i\n\n\ufb01nd predictive covariance\nlatent\n\ni \u2190 Pp(\u03b3i\u22121)\nf,i = hTPp\n\u02dc\u03c32\n\ni\u22121;\n\nEvaluate Pp\n\u02dc\u00b5f,i \u2190 hTA mf\nif Gaussian likelihood then\n\n1: Input: {yi}, {A, Q, h, P0}, p(y | f )\ntargets, model, likelihood\n2: Set up Pp(\u03b3), Ps(\u03b3), and G(\u03b3) for \u03b31:K solve DAREs for a set of likelihood variances, cost O(K m3)\n0 \u2190 0; Pp\n3: mf\ninitialize\n4: for i = 1 to n do\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13:\n14: mf\n15: end for\n16: ms\n17: for i = n \u2212 1 to 1 do\n18: ms\n19: end for\n20: Return: \u00b5f,i = hTms\n\nMatch exp(\u03bdi fi \u2212 \u03c4i f 2\n\u03b3i \u2190 \u03c4 \u22121\n\u03b7i \u2190 \u03bdi/\u03c4i;\n\nf,i) mom= p(yi | fi) N(fi | \u02dc\u00b5f,i, \u02dc\u03c32\n\ngain\nmean and covariance\n\nmatch moments\nequivalent update\n\nif \u03c32\n\nn,i := \u03c32\n\nn, ki and Pf\n\ni \u2190 (A \u2212 ki hTA) mf\n\nn \u2190 mf\n\nn; Ps\n\nn \u2190 Ps(\u03b3n)\n\ni); Ps\n\ni \u2190 Ps(\u03b3i)\n\ni \u2190 mf\n\ni + G(\u03b3i) (ms\n\ni+1 \u2212 A mf\n\ni \u2190 Pp\n\ni \u2212 ki \u03b3i kT\n\ni\n\ni /2) N(fi | \u02dc\u00b5f,i, \u02dc\u03c32\n\ni, \u03c32\n\nf,i = hTPs\n\ni h, log p(y)\n\ni become time-invariant\n\nend if\nki \u2190 Pp\n\ni h/(\u02dc\u03c32\n\nf,i + \u03b3i)\n\ni\n\ni\u22121 + ki \u03b7i; Pf\n\ninitialize backward pass\n\nmean and covariance\n\nmean, variance, evidence\n\n3.2\n\nIn\ufb01nite-horizon GPs for general likelihoods\n\nIn IHGP, instead of using the true predictive covariance for propagation, we use the one obtained\nfrom the stationary state of a system with measurement noise \ufb01xed to the current measurement noise\nand regular spacing. The Kalman \ufb01lter iterations can be used in solving approximate posteriors for\nmodels with general likelihoods in form of Eq. (1) by manipulating the innovation vi and si (see [22]).\nWe derive a generalization of the steady-state iteration allowing for time-dependent measurement\nnoise and non-Gaussian likelihoods.\nWe re-formulate the DARE in Eq. (7) as an implicit function \u02c6Pp : R+ \u2192 Rm\u00d7m of the likelihood\nvariance, \u2018measurement noise\u2019, \u03b3 \u2208 R+:\n\nPp(\u03b3) = A Pp(\u03b3) AT \u2212 A Pp(\u03b3) h (hTPp(\u03b3) h + \u03b3)\u22121 hTPp(\u03b3) AT + Q.\n\n(10)\n\nThe elements in Pp are smooth functions in \u03b3, and we set up an interpolation scheme\u2014inspired\nby Wilson and Nickisch [37] who use cubic convolutional interpolation [13] in their KISS-GP\nframework\u2014over a log-spaced one-dimensional grid of K points in \u03b3 for evaluation of \u02c6Pp(\u03b3). Fig. 2\nshows results of K = 32 grid points (as dots) over \u03b3 = 10\u22122, . . . , 103 (this grid is used throughout\nthe experiments). In the limit of \u03b3 \u2192 \u221e the measurement has no effect, and the predictive covariance\nreturns to the stationary covariance of the GP prior (dashed). Similarly, the corresponding gain\nterms k show the gains going to zero in the same limit. We set up a similar interpolation scheme for\nevaluating G(\u03b3) and Ps(\u03b3) following Eq. (9). Now, solving the DAREs and the smoother gain has\nbeen replaced by computationally cheap (one-dimensional) kernel interpolation.\n\nAlg. 1 presents the recursion in IHGP inference by considering a locally steady-state GP model\nderived from the previous section. As can be seen in Sec. 3.1, the predictive state on step i only\ndepends on \u03b3i\u22121. For non-Gaussian inference we set up an EP [5, 12, 19] scheme which only requires\none forward pass (assumed density \ufb01ltering, see also unscented \ufb01ltering [27]), and is thus well suited\nfor streaming applications. We match the \ufb01rst two moments of p(yi | fi) and exp(\u03c4 fi \u2212 \u03bd f 2\ni /2) w.r.t.\nf,i) (denoted by \u2022 mom= \u2022, implemented by quadrature). The steps of the\nlatent values N(fi | \u02dc\u00b5f,i, \u02dc\u03c32\nbackward pass are also only dependent on the local steady-state model, thus evaluated in terms of \u03b3is.\n\nMissing observations correspond to \u03b3i = \u221e, and the model could be generalized to non-equidistant\ntime sampling by the scheme in Nickisch et al. [22] for calculating A(\u2206ti) and Q(\u2206ti).\n\n5\n\n\fTable 1: Mean absolute error of IHGP w.r.t.\nSS, negative log-likelihoods, and running times.\nMean over 10 repetitions reported; n = 1000.\n\nRegression\n\nCount data\n\nClassi\ufb01cation\n\nLikelihood\n\nGaussian\n\nPoisson\n\nLogit\n\nProbit\n\nMAE E[f (t\u2217)]\nMAE V[f (t\u2217)]\nNLL-FULL\nNLL-SS\nNLL-IHGP\ntfull\ntss\ntIHGP\n\n0.0095\n0.0008\n1452.5\n1452.5\n1456.0\n0.18 s\n0.04 s\n0.01 s\n\n0.0415\n0.0024\n2645.5\n2693.5\n2699.3\n6.17 s\n0.13 s\n0.14 s\n\n0.0741\n0.0115\n618.9\n617.5\n625.1\n11.78 s\n0.13 s\n0.13 s\n\n0.0351\n0.0079\n614.4\n613.9\n618.2\n9.93 s\n0.11 s\n0.10 s\n\n3.3 Online hyperparameter estimation\n\n)\ns\nd\nn\no\nc\ne\ns\n(\n\ne\nm\n\ni\nt\n\ng\nn\ni\nn\nn\nu\nR\n\n0\n1\n\n5\n\n0\n0\n\nState space\nIHGP\n\n20\n\n40\n\n60\n\n80\n\n100\n\nState dimensionality, m\n\nFigure 3: Empirical running time comparison\nfor GP regression on n = 10,000 data points.\nMaximum RMSE in IHGP E[f (t\u2217)] < 0.001.\n\nEven though IHGP can be used in a batch setting, it is especially well suited for continuous data\nstreams. In such applications, it is not practical to require several iterations over the data for optimising\nthe hyperparameters\u2014as new data would arrive before the optimisation terminates. We propose a\npractical extension of IHGP for online estimation of hyperparameters \u03b8 by leveraging that (i) new\nbatches of data are guaranteed to be available from the stream, (ii) IHGP only requires seeing each\ndata point once for evaluating the marginal likelihood and its gradient, (iii) data can be non-stationary,\nrequiring the hyperparameters to adapt.\n\nWe formulate the hyperparameter optimisation problem as an incremental gradient descent (e.g., [2])\nresembling stochastic gradient descent, but without the assumption of \ufb01nding a stationary optimum.\nStarting from some initial set of hyperparameters \u03b80, for each new (mini) batch j of data y(j) in a\nwindow of size nmb, iterate\n\n\u03b8j = \u03b8j\u22121 + \u03b7 \u2207 log p(y(j) | \u03b8j\u22121),\n\n(11)\n\nwhere \u03b7 is a learning-rate (step-size) parameter, and the gradient of the marginal likelihood is\nevaluated by the IHGP forward recursion. In a vanilla GP the windowing would introduce boundary\neffect due to growing marginal variance towards the boundaries, while in IHGP no edge effects are\npresent as the data stream is seen to continue beyond any boundaries (cf. Fig. 1).\n\n4 Experiments\n\nWe provide extensive evaluation of the IHGP both in terms of simulated benchmarks and four\nreal-world experiments in batch and online modes.\n\n)\nt\n(\n\u03bb\n\n,\n\ny\nt\ni\ns\nn\ne\nt\nn\ni\n\nt\nn\ne\nd\ni\nc\nc\nA\n\n4\n\n3\n\n2\n\n1\n\nFull (mean)\nFull (90% quantiles)\nState space\n\n)\nt\n(\n\u03bb\n\n,\n\ny\nt\ni\ns\nn\ne\nt\nn\ni\n\nt\nn\ne\nd\ni\nc\nc\nA\n\n4\n\n3\n\n2\n\n1\n\nIHGP (mean)\nIHGP (90% quantiles)\nState space\n\n1860\n\n1880\n\n1900\n\n1920\n\n1940\n\n1960\n\n1860\n\n1880\n\n1900\n\n1920\n\n1940\n\n1960\n\nTime (years)\n\nTime (years)\n\n(a) Intensity (EP vs. ADF)\n\n(b) Intensity (state space vs. in\ufb01nite-horizon)\n\nFigure 4: A small-scale comparison study on the coal mining accident data (191 accidents in n = 200\nbins). The data set is suf\ufb01ciently small that full EP with na\u00efve handling of the latent function can be\nconducted. Full EP is shown to work similarly as ADF (single-sweep EP) by state space modelling.\nWe then compare ADF on state space (exact handling of the latent function) to ADF with the IHGP.\n\n6\n\n\fFigures below decompose the intensity into components:\nlog \u03bb(t) = ftrend(t) + fyear(t) + fweek(t)\n\n0\n4\n\n0\n3\n\n0\n2\n\n0\n1\n\n0\n\n)\nt\n(\n\u03bb\n\n,\ny\nt\ni\ns\nn\ne\nt\nn\ni\n\nt\nn\ne\nd\ni\nc\nc\nA\n\n1920\n\n1930\n\n1940\n\n1950\n\n1960\n\n1970\nTime (years)\n\n0\n\n2\n\u2212\n\n)\nt\ni\nn\nu\n.\nb\nr\na\n(\n\n0\n2\n9\n1\n\n0\n6\n9\n1\n\n0\n0\n0\n2\n\nr\na\ne\nY\n\n1980\n\n1990\n\n2000\n\n2010\n\n0\n2\n9\n1\n\n0\n6\n9\n1\n\n0\n0\n0\n2\n\nr\na\ne\nY\n\n5\n.\n0\n\n0\n\n5\n.\n0\n\u2212\n\n2\n.\n0\n\n0\n\n2\n.\n0\n\u2212\n\n1920 1940 1960 1980 2000\n\nJ F M A M J\n\nJ A S O N D\n\nYear\n\nMonth\n\nSun Mon Tue Wed Thu Fri\n\nSat\n\nDay-of-week\n\nFigure 5: Explanatory analysis of the aircraft accident data set (1210 accidents predicted in n =\n35,959 daily bins) between years 1919\u20132018 by a log-Gaussian Cox process (Poisson likelihood).\n\n4.1 Experimental validation\n\nIn the toy examples, the data were simulated from yi = sinc(xi \u2212 6) + \u03b5i, \u03b5i \u223c N(0, 0.1) (see Fig. 1\nfor a visualization). The same function with thresholding was used in the classi\ufb01cation examples in\nthe Appendix. Table 1 shows comparisons for different log-concave likelihoods over a simulated data\nset with n = 1000. Example functions can be seen in Fig. 1 and Appendix E. The results are shown\nfor a Mat\u00e9rn (\u03bd = 3/2) with a full GP (na\u00efve handling of latent, full EP as in [24]), state space (SS,\nexact state space model, ADF as in [22]), and IHGP. With m only 2, IHGP is not faster than SS, but\napproximation errors remain small. Fig. 3 shows experimental results for the computational bene\ufb01ts\nin a regression study, with state dimensionality m = 2, . . . , 100. Experiments run in Mathworks\nMATLAB (R2017b) on an Apple MacBook Pro (2.3 GHz Intel Core i5, 16 Gb RAM). Both methods\nhave linear time complexity in the number of data points, so the number of data points is \ufb01xed to\nn = 10,000. The GP prior is set up as an increasing-length sum of Mat\u00e9rn (\u03bd = 3/2) kernels with\ndifferent characteristic length-scales. The state space scheme follows O(m3) and IHGP is O(m2).\n\n4.2 Log-Gaussian Cox processes\n\nA log Gaussian Cox process is an inhomogeneous Poisson process model for count data. The\nunknown intensity function \u03bb(t) is modelled with a log-Gaussian process such that f (t) = log \u03bb(t).\nThe likelihood of the unknown function f is p({tj} | f ) = exp(\u2212 R exp(f (t)) dt + PN\nj=1 f (tj)).\nThe likelihood requires non-trivial integration over the exponentiated GP, and thus instead the standard\napproach [20] is to consider locally constant intensity in subregions by discretising the interval into\nbins. This approximation corresponds to having a Poisson model for each bin. The likelihood\nbecomes p({tj} | f ) \u2248 Qn\ni=1 Poisson(yi({tj}) | exp(f (\u02c6ti))), where \u02c6ti is the bin coordinate and yi\nthe number of data points in it. This model reaches posterior consistency in the limit of bin width\ngoing to zero [34]. Thus it is expected that the accuracy improves with tighter binning.\n\nCoal mining disasters dataset: The data (available, e.g., in [35]) contain the dates of 191 coal\nmine explosions that killed ten or more people in Britain between years 1851\u20131962, which we\ndiscretize into n = 200 bins. We use a GP prior with a Mat\u00e9rn (\u03bd = 5/2) covariance function that\nhas an exact state space representation (state dimensionality m = 3) and thus no approximations\nregarding handling the latent are required. We optimise the characteristic length-scale and magnitude\nhyperparameters w.r.t. marginal likelihood in each model. Fig. 4 shows that full EP and state space\nADF produce almost equivalent results, and IHGP ADF and state space ADF produce similar results.\nIn IHGP the edge effects are clear around 1850\u20131860.\n\n7\n\n\f(a) Holding in hand\n\n(b) Shake\n\n(c) Swinging\n\n(d) On table\n\nFigure 6: Screenshots of online adaptive IHGP running in real-time on an iPhone. The lower plot\nshows current hyperparameters (measurement noise is \ufb01xed to \u03c32\nn = 1 for easier visualization) of\nthe prior covariance function, with a trail of previous hyperparameters. The top part shows the last\n2 seconds of accelerometer data (red), the GP mean, and 95% quantiles. The refresh rate for updating\nthe hyperparameters and re-prediction is 10 Hz. Video examples are in the supplementary material.\n\nAirline accident dataset: As a more challenging regression problem we explain the time-dependent\nintensity of accidents and incidents of commercial aircraft. The data [22] consists of dates of 1210\nincidents over the time-span of years 1919\u20132017. We use a bin width of one day and start from year\n1900 ensure no edge effects (n = 43,099), and a prior covariance function (similar to [6, 36])\n\n\u03ba(t, t\u2032) = \u03ba\u03bd=5/2\n\nMat. (t, t\u2032) + \u03ba1 year\n\nper\n\n(t, t\u2032) \u03ba\u03bd=3/2\n\nMat. (t, t\u2032) + \u03ba1 week\n\nper\n\n(t, t\u2032) \u03ba\u03bd=3/2\n\nMat. (t, t\u2032)\n\n(12)\n\ncapturing a trend, time-of-year variation (with decay), and day-of-week variation (with decay). This\nmodel has a state space representation of dimension m = 3 + 28 + 28 = 59. All hyperparameters\n(except time periods) were optimised w.r.t. marginal likelihood. Fig. 5 shows that we reproduce the\ntime-of-year results from [22] and additionally recover a high-frequency time-of-week effect.\n\n4.3 Electricity consumption\n\nWe do explorative analysis of electricity consumption for one household [9] recorded every minute\n(in log kW) over 1,442 days (n = 2,075,259, with 25,979 missing observations). We assign the\nmodel a GP prior with a covariance function accounting for slow variation and daily periodicity (with\ndecay). We \ufb01t a GP to the entire data with 2M data points by optimising the hyperparameters w.r.t.\nmarginal likelihood (results shown in Appendix F) using BFGS. Total running time 624 s.\n\nThe data is, however, inherently non-stationary due to the long time-horizon, where use of electricity\nhas varied. We therefore also run IHGP online in a rolling-window of 10 days (nmb = 14,400,\n\u03b7 = 0.001, window step size of 1 hr) and learn the hyperparameters online during the 34,348\nincremental gradient steps (evaluation time per step 0.26\u00b10.05 s). This leads to a non-stationary\nadaptive GP model which, e.g., learns to dampen the periodic component when the house is left\nvacant for days. Results shown in Appendix F in the supplement.\n\n4.4 Real-time GPs for adaptive model \ufb01tting\n\nIn the \ufb01nal experiment we implement the IHGP in C++ with wrappers in Objective-C for running as\nan app on an Apple iPhone 6s (iOS 11.3). We use the phone accelerometer x channel (sampled at\n100 Hz) as an input and \ufb01t a GP to a window of 2 s with Gaussian likelihood and a Mat\u00e9rn (\u03bd = 3/2)\nprior covariance function. We \ufb01x the measurement noise to \u03c32\nn = 1 and use separate learning rates\n\u03b7 = (0.1, 0.01) in online estimation of the magnitude scale and length-scale hyperparemeters. The\nGP is re-estimated every 0.1 s. Fig. 6 shows examples of various modes of data and how the GP has\nadapted to it. A video of the app in action is included in the web material together with the codes.\n\n8\n\n\f5 Discussion and conclusion\n\nWe have presented In\ufb01nite-Horizon GPs, a novel approximation scheme for state space Gaussian\nprocesses, which reduces the time-complexity to O(m2n). There is a clear intuition to the approx-\nimation: As widely known, in GP regression the posterior marginal variance only depends on the\ndistance between observations, and the likelihood variance. If both these are \ufb01xed, and t is larger\nthan the largest length-scale in the prior, the posterior marginal variance reaches a stationary state.\nThe intuition behind IHGP is that for every time instance, we adapt to the current likelihood variance,\ndiscard the Markov-trail, and start over by adapting to the current steady-state marginal posterior\ndistribution.\n\nThis approximation scheme is important especially in long (number of data in the thousands\u2013millions)\nor streaming (n growing without limit) data, and/or the GP prior has several components (m large).\nWe showed examples of regression, count data, and classi\ufb01cation tasks, and showed how IHGP can\nbe used in interpreting non-stationary data streams both off-line (Sec. 4.3) and on-line (Sec. 4.4).\n\nAcknowledgments\n\nWe thank the anonymous reviewers as well as Mark Rowland and Will Tebbutt for their comments on\nthe manuscript. AS acknowledges funding from the Academy of Finland (grant number 308640).\n\nReferences\n\n[1] B. D. Anderson and J. B. Moore. Optimal Filtering. Prentice-Hall, Englewood Cliffs, NJ, 1979.\n\n[2] D. P. Bertsekas. Nonlinear Programming. Athena Scienti\ufb01c, Cambridge, MA, 1999.\n\n[3] T. D. Bui and R. E. Turner. Tree-structured Gaussian process approximations. In Advances in Neural\n\nInformation Processing Systems (NIPS), pages 2213\u20132221, 2014.\n\n[4] T. D. Bui, J. Yan, and R. E. Turner. A unifying framework for Gaussian process pseudo-point approxima-\ntions using power expectation propagation. Journal of Machine Learning Research (JMLR), 18(104):1\u201372,\n2017.\n\n[5] L. Csat\u00f3 and M. Opper. Sparse on-line Gaussian processes. Neural Computation, 14(3):641\u2013668, 2002.\n\n[6] D. Duvenaud, J. R. Lloyd, R. Grosse, J. B. Tenenbaum, and Z. Ghahramani. Structure discovery in\nnonparametric regression through compositional kernel search. In International Conference on Machine\nLearning (ICML), volume 28 of PMLR, pages 1166\u20131174, 2013.\n\n[7] F. Gustafsson. Adaptive Filtering and Change Detection. John Wiley & Sons, 2000.\n\n[8] J. Hartikainen and S. S\u00e4rkk\u00e4. Kalman \ufb01ltering and smoothing solutions to temporal Gaussian process\nregression models. In Proceedings of the IEEE International Workshop on Machine Learning for Signal\nProcessing (MLSP), pages 379\u2013384, 2010.\n\n[9] G. H\u00e9brail and A. B\u00e9rard.\n\nIndividual household electric power consumption data set, 2012. URL\nhttps://archive.ics.uci.edu/ml/datasets/individual+household+electric+power+consumption. Online: UCI\nMachine Learning Repository.\n\n[10] J. Hensman, N. Fusi, and N. D. Lawrence. Gaussian processes for big data. In Uncertainty in Arti\ufb01cial\n\nIntelligence (UAI), pages 282\u2013290. AUAI Press, 2013.\n\n[11] J. Hensman, N. Durrande, and A. Solin. Variational Fourier features for Gaussian processes. Journal of\n\nMachine Learning Research (JMLR), 18(151):1\u201352, 2018.\n\n[12] T. Heskes and O. Zoeter. Expectation propagation for approximate inference in dynamic Bayesian networks.\nIn Uncertainty in Arti\ufb01cial Intelligence (UAI), pages 216\u2013223. Morgan Kaufmann Publishers Inc., 2002.\n\n[13] R. G. Keys. Cubic convolution interpolation for digital image processing. IEEE Transactions on Acoustics,\n\nSpeech and Signal Processing, 29(6):1153\u20131160, 1981.\n\n[14] K. Krauth, E. V. Bonilla, K. Cutajar, and M. Filippone. AutoGP: Exploring the capabilities and limitations\n\nof Gaussian process models. In Uncertainty in Arti\ufb01cial Intelligence (UAI). AUAI Press, 2017.\n\n[15] P. Lancaster and L. Rodman. Algebraic Riccati Equations. Clarendon Press, 1995.\n\n[16] A. Laub. A schur method for solving algebraic Riccati equations. IEEE Transactions on Automatic Control,\n\n24(6):913\u2013921, 1979.\n\n[17] M. L\u00e1zaro-Gredilla, J. Qui\u00f1onero-Candela, C. E. Rasmussen, and A. R. Figueiras-Vidal. Sparse spectrum\nGaussian process regression. Journal of Machine Learning Research (JMLR), 11:1865\u20131881, Jun 2010.\n\n[18] P. S. Maybeck. Stochastic Models, Estimation and Control, volume 1. Academic Press, New York, 1979.\n\n9\n\n\f[19] T. Minka. Expectation propagation for approximate Bayesian inference. In Uncertainty in Arti\ufb01cial\n\nIntelligence (UAI), volume 17, pages 362\u2013369, 2001.\n\n[20] J. M\u00f8ller, A. R. Syversveen, and R. P. Waagepetersen. Log Gaussian Cox processes. Scandinavian Journal\n\nof Statistics, 25(3):451\u2013482, 1998.\n\n[21] H. Nickisch and C. E. Rasmussen. Approximations for binary Gaussian process classi\ufb01cation. Journal of\n\nMachine Learning Research (JMLR), 9(10):2035\u20132078, 2008.\n\n[22] H. Nickisch, A. Solin, and A. Grigorievskiy. State space Gaussian processes with non-Gaussian likelihood.\nIn International Conference on Machine Learning (ICML), volume 80 of PMLR, pages 3789\u20133798, 2018.\n\n[23] J. Qui\u00f1onero-Candela and C. E. Rasmussen. A unifying view of sparse approximate Gaussian process\n\nregression. Journal of Machine Learning Research (JMLR), 6(Dec):1939\u20131959, 2005.\n\n[24] C. E. Rasmussen and H. Nickisch. Gaussian processes for machine learning (GPML) toolbox. Journal of\nMachine Learning Research (JMLR), 11:3011\u20133015, 2010. Software package: http://www.gaussianprocess.\norg/gpml/code.\n\n[25] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. The MIT Press, 2006.\n\n[26] S. Reece and S. Roberts. An introduction to Gaussian processes for the Kalman \ufb01lter expert. In Proceedings\n\nof the 13th Conference on Information Fusion (FUSION). IEEE, 2010.\n\n[27] S. S\u00e4rkk\u00e4. Bayesian Filtering and Smoothing. Cambridge University Press, 2013.\n\n[28] S. S\u00e4rkk\u00e4 and A. Solin. Applied Stochastic Differential Equations. Cambridge University Press, Cambridge,\n\nin press.\n\n[29] S. S\u00e4rkk\u00e4, A. Solin, and J. Hartikainen. Spatiotemporal learning via in\ufb01nite-dimensional Bayesian \ufb01ltering\n\nand smoothing. IEEE Signal Processing Magazine, 30(4):51\u201361, 2013.\n\n[30] E. Snelson and Z. Ghahramani. Sparse Gaussian processes using pseudo-inputs. In Advances in Neural\n\nInformation Processing Systems (NIPS), pages 1257\u20131264. Curran Associates, Inc., 2006.\n\n[31] A. Solin. Stochastic Differential Equation Methods for Spatio-Temporal Gaussian Process Regression.\n\nDoctoral dissertation, Aalto University, Helsinki, Finland, 2016.\n\n[32] A. Solin and S. S\u00e4rkk\u00e4. Hilbert space methods for reduced-rank Gaussian process regression. arXiv\n\npreprint arXiv:1401.5508, 2014.\n\n[33] M. K. Titsias. Variational learning of inducing variables in sparse Gaussian processes. In International\nConference on Arti\ufb01cial Intelligence and Statistics (AISTATS), volume 5 of PMLR, pages 567\u2013574, 2009.\n\n[34] S. T. Tokdar and J. K. Ghosh. Posterior consistency of logistic Gaussian process priors in density estimation.\n\nJournal of Statistical Planning and Inference, 137(1):34\u201342, 2007.\n\n[35] J. Vanhatalo, J. Riihim\u00e4ki, J. Hartikainen, P. Jyl\u00e4nki, V. Tolvanen, and A. Vehtari. GPstuff: Bayesian\nmodeling with Gaussian processes. Journal of Machine Learning Research (JMLR), 14(Apr):1175\u20131179,\n2013. Software package: http://research.cs.aalto.\ufb01/pml/software/gpstuff.\n\n[36] A. Wilson and R. Adams. Gaussian process kernels for pattern discovery and extrapolation. In International\n\nConference on Machine Learning (ICML), volume 28 of PMLR, pages 1067\u20131075, 2013.\n\n[37] A. G. Wilson and H. Nickisch. Kernel interpolation for scalable structured Gaussian processes (KISS-GP).\nIn International Conference on Machine Learning (ICML), volume 37 of PMLR, pages 1775\u20131784, 2015.\n\n10\n\n\f", "award": [], "sourceid": 1783, "authors": [{"given_name": "Arno", "family_name": "Solin", "institution": "Aalto University"}, {"given_name": "James", "family_name": "Hensman", "institution": "PROWLER.io"}, {"given_name": "Richard", "family_name": "Turner", "institution": "University of Cambridge"}]}