{"title": "Learning Stationary Time Series using Gaussian Processes with Nonparametric Kernels", "book": "Advances in Neural Information Processing Systems", "page_first": 3501, "page_last": 3509, "abstract": "We introduce the Gaussian Process Convolution Model (GPCM), a two-stage nonparametric generative procedure to model stationary signals as the convolution between a continuous-time white-noise process and a continuous-time linear filter drawn from Gaussian process. The GPCM is a continuous-time nonparametric-window moving average process and, conditionally, is itself a Gaussian process with a nonparametric kernel defined in a probabilistic fashion. The generative model can be equivalently considered in the frequency domain, where the power spectral density of the signal is specified using a Gaussian process. One of the main contributions of the paper is to develop a novel variational free-energy approach based on inter-domain inducing variables that efficiently learns the continuous-time linear filter and infers the driving white-noise process. In turn, this scheme provides closed-form probabilistic estimates of the covariance kernel and the noise-free signal both in denoising and prediction scenarios. Additionally, the variational inference procedure provides closed-form expressions for the approximate posterior of the spectral density given the observed data, leading to new Bayesian nonparametric approaches to spectrum estimation. The proposed GPCM is validated using synthetic and real-world signals.", "full_text": "Learning Stationary Time Series using Gaussian\n\nProcesses with Nonparametric Kernels\n\nFelipe Tobar\n\nftobar@dim.uchile.cl\n\nCenter for Mathematical Modeling\n\nUniversidad de Chile\n\nThang D. Bui\n\ntdb40@cam.ac.uk\n\nDepartment of Engineering\nUniversity of Cambridge\n\nRichard E. Turner\nret26@cam.ac.uk\n\nDepartment of Engineering\nUniversity of Cambridge\n\nAbstract\n\nWe introduce the Gaussian Process Convolution Model (GPCM), a two-stage non-\nparametric generative procedure to model stationary signals as the convolution\nbetween a continuous-time white-noise process and a continuous-time linear \ufb01lter\ndrawn from Gaussian process. The GPCM is a continuous-time nonparametric-\nwindow moving average process and, conditionally, is itself a Gaussian pro-\ncess with a nonparametric kernel de\ufb01ned in a probabilistic fashion. The gen-\nerative model can be equivalently considered in the frequency domain, where\nthe power spectral density of the signal is speci\ufb01ed using a Gaussian process.\nOne of the main contributions of the paper is to develop a novel variational free-\nenergy approach based on inter-domain inducing variables that ef\ufb01ciently learns\nthe continuous-time linear \ufb01lter and infers the driving white-noise process.\nIn\nturn, this scheme provides closed-form probabilistic estimates of the covariance\nkernel and the noise-free signal both in denoising and prediction scenarios. Addi-\ntionally, the variational inference procedure provides closed-form expressions for\nthe approximate posterior of the spectral density given the observed data, leading\nto new Bayesian nonparametric approaches to spectrum estimation. The proposed\nGPCM is validated using synthetic and real-world signals.\n\n1\n\nIntroduction\n\nGaussian process (GP) regression models have become a standard tool in Bayesian signal estimation\ndue to their expressiveness, robustness to over\ufb01tting and tractability [1]. GP regression begins with\na prior distribution over functions that encapsulates a priori assumptions, such as smoothness, sta-\ntionarity or periodicity. The prior is then updated by incorporating information from observed data\npoints via their likelihood functions. The result is a posterior distribution over functions that can be\nused for prediction. Critically for this work, the posterior and therefore the resultant predictions, is\nsensitive to the choice of prior distribution. The form of the prior covariance function (or kernel) of\nthe GP is arguably the central modelling choice. Employing a simple form of covariance will limit\nthe GP\u2019s capacity to generalise. The ubiquitous radial basis function or squared exponential kernel,\nfor example, implies prediction is just a local smoothing operation [2, 3]. Expressive kernels are\nneeded [4, 5], but although kernel design is widely acknowledged as pivotal, it typically proceeds\nvia a \u201cblack art\u201d in which a particular functional form is hand-crafted using intuitions about the\napplication domain to build a kernel using simpler primitive kernels as building blocks (e.g. [6]).\nRecently, some sophisticated automated approaches to kernel design have been developed that con-\nstruct kernel mixtures on the basis of incorporating different measures of similarity [7, 8], or more\ngenerally by both adding and multiplying kernels, thus mimicking the way in which a human would\nsearch for the best kernel [5]. Alternatively, a \ufb02exible parametric kernel can be used as in the case\nof the spectral mixture kernels, where the power spectral density (PSD) of the GP is parametrised\nby a mixture of Gaussians [4].\n\n1\n\n\fWe see two problems with this general approach: The \ufb01rst is that computational tractability limits the\ncomplexity of the kernels that can be designed in this way. Such constraints are problematic when\nsearching over kernel combinations and to a lesser extent when \ufb01tting potentially large numbers of\nkernel hyperparameters. Indeed, many naturally occurring signals contain more complex structure\nthan can comfortably be entertained using current methods, time series with complex spectra like\nsounds being a case in point [9, 10]. The second limitation is that hyperparameters of the kernel\nare typically \ufb01t by maximisation of the model marginal likelihood. For complex kernels with large\nnumbers of hyperparameters, this can easily result in over\ufb01tting rearing its ugly head once more (see\nsec. 4.2).\nThis paper attempts to remedy the existing limitations of GPs in the time series setting using the\nsame rationale by which GPs were originally developed. That is, kernels themselves are treated\nnonparametrically to enable \ufb02exible forms whose complexity can grow as more structure is revealed\nin the data. Moreover, approximate Bayesian inference is used for estimation, thus side-stepping\nproblems with model structure search and protecting against over\ufb01tting. These bene\ufb01ts are achieved\nby modelling time series as the output of a linear and time-invariant system de\ufb01ned by a convolution\nbetween a white-noise process and a continuous-time linear \ufb01lter. By considering the \ufb01lter to be\ndrawn from a GP, the expected second-order statistics (and, as a consequence, the spectral density)\nof the output signal are de\ufb01ned in a nonparametric fashion. The next section presents the proposed\nmodel, its relationship to GPs and how to sample from it.\nIn Section 3 we develop an analytic\napproximate inference method using state-of-the-art variational free-energy approximations for per-\nforming inference and learning. Section 4 shows simulations using both synthetic and real-world\ndatasets. Finally, Section 5 presents a discussion of our \ufb01ndings.\n\n2 Regression model: Convolving a linear \ufb01lter and a white-noise process\n\nWe introduce the Gaussian Process Convolution Model (GPCM) which can be viewed as construct-\ning a distribution over functions f (t) using a two-stage generative model. In the \ufb01rst stage, a con-\ntinuous \ufb01lter function h(t) : R (cid:55)\u2192 R is drawn from a GP with covariance function Kh(t1, t2). In\nthe second stage, the function f (t) is produced by convolving the \ufb01lter with continuous time white-\nnoise x(t). The white-noise can be treated informally as a draw from a GP with a delta-function\ncovariance,1\n\nh(t) \u223c GP (0,Kh(t1, t2)), x(t) \u223c GP (0, \u03c32\n\nx\u03b4(t1 \u2212 t2)),\n\nf (t) =\n\nh(t \u2212 \u03c4 )x(\u03c4 )d\u03c4.\n\n(1)\n\n(cid:90)\n\nR\n\nThis family of models can be motivated from several different perspectives due to the ubiquity of\ncontinuous-time linear systems.\nFirst, the model relates to linear time-invariant (LTI) systems [12]. The process x(t) is the input\nto the LTI system, the function h(t) is the system\u2019s impulse response (which is modelled as a draw\nfrom a GP) and f (t) is its output. In this setting, as an LTI system is entirely characterised by its\nimpulse response [12], model design boils down to identifying a suitable function h(t). A second\nperspective views the model through the lens of differential equations, in which case h(t) can be\nconsidered to be the Green\u2019s function of a system de\ufb01ned by a linear differential equation that is\ndriven by white-noise. In this way, the prior over h(t) implicitly de\ufb01nes a prior over the coef\ufb01cients\nof linear differential equations of potentially in\ufb01nite order [13]. Third, the GPCM can be thought\nof as a continuous-time generalisation of the discrete-time moving average process in which the\nwindow is potentially in\ufb01nite in extent and is produced by a GP prior [14].\nA fourth perspective relates the GPCM to standard GP models. Consider the \ufb01lter h(t) to be known.\nIn this case the process f (t)|h is distributed according to a GP, since f (t) is a linear combination\nof Gaussian random variables. The mean function mf|h(f (t)) and covariance function Kf|h(t1, t2)\nof the random variable f|h, t \u2208 R, are then stationary and given by mf|h(f (t)) = E [f (t)|h] =\nR h(t \u2212 \u03c4 )E [x(\u03c4 )] d\u03c4 = 0 and\n\n(cid:82)\n\nKf|h(t1, t2) = Kf|h(t) =\n\nh(s)h(s + t)ds = (h(t) \u2217 h(\u2212t))(t)\n\n(2)\n\n(cid:90)\n\nR\n\n1Here we use informal notation common in the GP literature. A more formal treatment would use stochastic\nintegral notation [11], which replaces the differential element x(\u03c4 )d\u03c4 = dW (\u03c4 ), so that eq. (1) becomes a\nstochastic integral equation (w.r.t. the Brownian motion W ).\n\n2\n\n\fthat is, the convolution between the \ufb01lter h(t) and its mirrored version with respect to t = 0 \u2014 see\nsec. 1 of the supplementary material for the full derivation.\nSince h(t) is itself is drawn from a nonparametric prior, the presented model (through the relation-\nship above) induces a prior over nonparametric kernels. A particular case is obtained when h(t)\nis chosen as the basis expansion of a reproducing kernel Hilbert space [15] with parametric kernel\n(e.g., the squared exponential kernel), whereby Kf|h becomes such a kernel.\nA \ufb01fth perspective considers the model in the frequency domain rather than the time domain. Here\nthe continuous-time linear \ufb01lter shapes the spectral content of the input process x(t). As x(t) is\nwhite-noise, it has positive PSD at all frequencies, which can potentially in\ufb02uence f (t). More\nprecisely, since the PSD of f|h is given by the Fourier transform of the covariance function (by\nthe Wiener\u2013Khinchin theorem [12]), the model places a nonparametric prior over the PSD, given\nR h(t)e\u2212j\u03c9tdt is the Fourier\n\nR Kf|h(t)e\u2212j\u03c9tdt = |\u02dch(\u03c9)|2, where \u02dch(\u03c9) = (cid:82)\n\nby F(Kf|h(t))(\u03c9) = (cid:82)\n\ntransform of the \ufb01lter.\nArmed with these different theoretical perspectives on the GPCM generative model, we next focus\non how to design appropriate covariance functions for the \ufb01lter.\n\n2.1 Sensible and tractable priors over the \ufb01lter function\n\nhe\u2212\u03b1t2\n\ne\u2212\u03b1t2\n\n1 and e\u2212\u03b1t2\n\n2 respectively, that is,\n\n2, \u03b1, \u03b3, \u03c3h > 0.\n\nKh(t1, t2) = KDSE(t1, t2) = \u03c32\n\nReal-world signals have \ufb01nite power (which relates to the stability of the system) and potentially\ncomplex spectral content. How can such knowledge be built into the \ufb01lter covariance function\nKh(t1, t2)? To ful\ufb01l these conditions, we model the linear \ufb01lter h(t) as a draw from a squared\nexponential GP that is multiplied by a Gaussian window (centred on zero) in order to restrict its\nextent. The resulting decaying squared exponential (DSE) covariance function is given by a squared\nexponential (SE) covariance pre- and post-multiplied by e\u2212\u03b1t2\n1e\u2212\u03b3(t1\u2212t2)2\n\n(cid:112)\u03c0/(2\u03b1). Consequently, by Chebyshev\u2019s inequality, f (t) is stochastically bounded, that is,\n(cid:112)\u03c0/(2\u03b1)T \u22122, T \u2208 R. Hence, the exponential decay of KDSE (controlled\n\n(3)\nWith the GP priors for x(t) and h(t), f (t) is zero-mean, stationary and has a variance E[f 2(t)] =\nx\u03c32\n\u03c32\nh\nPr(|f (t)| \u2265 T ) \u2264 \u03c32\nby \u03b1) plays a key role in the \ufb01niteness of the integral in eq. (1) \u2014 and, consequently, of f (t).\nAdditionally, the DSE model for the \ufb01lter h(t) provides a \ufb02exible prior distribution over linear sys-\ntems, where the hyperparameters have physical meaning: \u03c32\nh controls the power of the output f (t);\n\u221a\n\u03b3 is the characteristic timescale over which the \ufb01lter varies that, in turn, determines the typical\n1/\n\u03b1 is the temporal extent of the \ufb01lter which controls the\nfrequency content of the system; \ufb01nally, 1/\nlength of time correlations in the output signal and, equivalently, the bandwidth characteristics in\nthe frequency domain.\nAlthough the covariance function is \ufb02exible, its Gaussian form facilitates analytic computation that\nwill be leveraged when (approximately) sampling from the DSE-GPCM and performing inference.\nIn principle, it is also possible in the framework that follows to add causal structure into the covari-\nance function so that only causal \ufb01lters receive non-zero prior probability density, but we leave that\nextension for future work.\n\nx\u03c32\nh\n\n\u221a\n\n2.2 Sampling from the model\n\nExact sampling from the proposed model in eq. (1) is not possible, since it requires computation\nof the convolution between in\ufb01nite dimensional processes h(t) and x(t).\nIt is possible to make\nsome analytic progress by considering, instead, the GP formulation of the GPCM in eq. (2) and\nnoting that sampling f (t)|h \u223c GP (0,Kf|h) only requires knowledge of Kf|h = h(t) \u2217 h(\u2212t)\nand therefore avoids explicit representation of the troublesome white-noise process x(t). Further\nprogress requires approximation. The \ufb01rst key insight is that h(t) can be sampled at a \ufb01nite number\nof locations h = h(t) = [h(t1), . . . , h(tNh)] using a multivariate Gaussian and then exact analytic\ninference can be performed to infer the entire function h(t) (via noiseless GP regression). Moreover,\nsince the \ufb01lter is drawn from the DSE kernel h(t) \u223c GP (0, KDSE) it is, with high probability,\ntemporally limited in extent and smoothly varying. Therefore, a relatively small number of samples\nNh can potentially enable accurate estimates of h(t). The second key insight is that it is possible,\n\n3\n\n\fwhen using the DSE kernel, to analytically compute the expected value of the covariance of f (t)|h,\nKf|h = E[Kf|h|h] = E[h(t) \u2217 h(\u2212t)|h] as well as the uncertainty in this quantity. The more values\nthe latent process h we consider, the lower the uncertainty in h and, as a consequence, Kf|h \u2192 Kf|h\nalmost surely. This is an example of a Bayesian numerical integration method since the approach\nmaintains knowledge of its own inaccuracy [16].\nIn more detail, the kernel approximation Kf|h(t1, t2) is given by:\nE[Kf|h(t1, t2)|h] = E\n\nE [h(t1 \u2212 \u03c4 )h(t2 \u2212 \u03c4 )|h] d\u03c4\n\nh(t1 \u2212 \u03c4 )h(t2 \u2212 \u03c4 )d\u03c4\n\n(cid:20)(cid:90)\n\n(cid:90)\n\nKDSE(t1 \u2212 \u03c4, tr)KDSE(ts, t2 \u2212 \u03c4 )d\u03c4\n\n(cid:90)\n\n=\n\nR\n\nR\nKDSE(t1 \u2212 \u03c4, t2 \u2212 \u03c4 )d\u03c4 +\n\n(cid:12)(cid:12)(cid:12)(cid:12)h\n(cid:21)\nNg(cid:88)\n\n=\n\nr,s=1\n\nR\n\n(cid:90)\n\nMr,s\n\nR\n\nwhere Mr,s is the (r, s)th entry of the matrix (K\u22121hhT K\u22121 \u2212 K\u22121), K = KDSE(t, t). The kernel\napproximation and its Fourier transform, i.e., the PSD, can be calculated in closed form (see sec. 2\nin the supplementary material). Fig. 1 illustrates the generative process of the proposed model.\n\nFigure 1: Sampling from the proposed regression model. From left to right: \ufb01lter, kernel, power\nspectral density and sample of the output f (\u00b7).\n\n3\n\nInference and learning using variational methods\n\nR h(t \u2212 \u03c4 )x(\u03c4 )d\u03c4 + \u0001(t), \u0001(t) \u223c N (0, \u03c32\n\nGaussian noise, y(t) = f (t) + \u0001(t) = (cid:82)\n\nOne of the main contributions of this paper is to devise a computationally tractable method for learn-\ning the \ufb01lter h(t) (known as system identi\ufb01cation in the control community [17]) and inferring the\nwhite-noise process x(t) from a noisy dataset y \u2208 RN produced by their convolution and additive\n\u0001 ). Perform-\ning inference and learning is challenging for three reasons: First, the convolution means that each\nobserved datapoint depends on the entire unknown \ufb01lter and white-noise process, which are in\ufb01nite-\ndimensional functions. Second, the model is non-linear in the unknown functions since the \ufb01lter and\nthe white-noise multiply one another in the convolution. Third, continuous-time white-noise must\nbe handled with care since formally it is only well-behaved inside integrals.\nWe propose a variational approach that addresses these three problems. First, the convolution is\nmade tractable by using variational inducing variables that summarise the in\ufb01nite dimensional latent\nfunctions into \ufb01nite dimensional inducing points. This is the same approach that is used for scaling\nGP regression [18]. Second, the product non-linearity is made tractable by using a structured mean-\n\ufb01eld approximation and leveraging the fact that the posterior is conditionally a GP when x(t) or\nh(t) is \ufb01xed. Third, the direct representation of white-noise process is avoided by considering a\nset of inducing variables instead, which are related to x(t) via an integral transformation (so-called\ninter-domain inducing variables [19]). We outline the approach below.\nIn order to form the variational inter-domain approximation, we \ufb01rst expand the model with addi-\ntional variables. We use X to denote the set of all integral transformations of x(t) with members\n\nux(t) =(cid:82) w(t, \u03c4 )x(\u03c4 )d\u03c4 (which includes the original white-noise process when w(t, \u03c4 ) = \u03b4(t\u2212\u03c4 ))\nand identically de\ufb01ne the set H with members uh(t) = (cid:82) w(t, \u03c4 )h(\u03c4 )d\u03c4. The variational lower\n\nbound of the model evidence can be applied to this augmented model2 using Jensen\u2019s inequality\n\nL = log p(y) = log\n\np(y, H, X)dHdX \u2265\n\nq(H, X) log\n\np(y, H, X)\n\nq(H, X)\n\ndHdX = F\n\n(4)\n\n(cid:90)\n\n(cid:90)\n\n2This formulation can be made technically rigorous for latent functions [20], but we do not elaborate on that\n\nhere to simplify the exposition.\n\n4\n\n\u221210\u221250510\u22121\u22120.500.51Time[samples]Filterh(t)\u223cGP(0,Kh)LatentprocesshObservationsh\u221210010012Time[samples]KernelKf|h(t)=h(t)\u2217h(\u2212t)Approx.Kf|h=E[Kf|h|h]TruekernelKf|h\u22122\u2212101201234F(Kf|h)(\u03c9)Frequency[hertz]\u221250050\u2212202Signalf(t)\u223cGP(0,Kf|h)Time[samples]\fhere q(H, X) is any variational distribution over the sets of processes X and H. The bound\ncan be written as the difference between the model evidence and the KL divergence between\nthe variational distribution over all integral transformed processes and the true posterior, F =\nL \u2212 KL[q(H, X)||p(X, H|y)]. The bound is therefore saturated when q(H, X) = p(X, H|y),\nbut this is intractable. Instead, we choose a simpler parameterised form, similar in spirit to that used\nin the approximate sampling procedure, that allows us to side-step these dif\ufb01culties. In order to con-\nstruct the variational distribution, we \ufb01rst partition the set X into the original white-noise process,\na \ufb01nite set of variables called inter-domain inducing points ux that will be used to parameterise the\napproximation and the remaining variables X(cid:54)=x,ux, so that X = {x, ux, X(cid:54)=x,ux}. The set H is\npartitioned identically H = {h, uh, H(cid:54)=h,uh}. We then choose a variational distribution q(H, X)\nthat mirrors the form of the joint distribution,\n\np(y, H, X) = p(x, X(cid:54)=x,ux|ux)p(h, H(cid:54)=h,uh|uh)p(ux)p(uh)p(y|h, x)\n\nq(H, X) = p(x, X(cid:54)=x,ux|ux)p(h, H(cid:54)=h,uh|uh)q(ux)q(uh) = q(H)q(X).\n\nThis is a structured mean-\ufb01eld approximation [21]. The approximating distribution over the induc-\ning points q(ux)q(uh) is chosen to be a multivariate Gaussian (the optimal parametric form given\nthe assumed factorisation). Intuitively, the variational approximation implicitly constructs a surro-\ngate GP regression problem, whose posterior q(ux)q(uh) induces a predictive distribution that best\ncaptures the true posterior distribution as measured by the KL divergence.\nCritically, the resulting bound is now tractable as we will now show. First, note that the shared prior\nterms in the joint and approximation cancel leading to an elegant form,\n\np(y|h, x)p(uh)p(ux)\n\nq(uh)q(ux)\n\nq(h, x, uh, ux) log\n\ndhdxduhdux\n\n(5)\n\n= Eq [log p(y|h, x)] \u2212 KL[q(uh)||p(uh)] \u2212 KL[q(ux)||p(ux)].\n\n(6)\nThe last two terms in the bound are simple to compute being KL divergences between multivariate\nGaussians. The \ufb01rst term, the average of the log-likelihood terms with respect to the variational\ndistribution, is more complex,\nEq [log p(y|h, x)] = \u2212 N\n2\n\nh(ti \u2212 \u03c4 )x(\u03c4 )d\u03c4\n\n(cid:19)2(cid:35)\n\ny(ti) \u2212\n\nN(cid:88)\n\n(cid:34)(cid:18)\n\nlog(2\u03c0\u03c32\n\n\u0001 ) \u2212 1\n2\u03c32\n\u0001\n\n(cid:90)\n\nR\n\n.\n\nEq\n\ni=1\n\nComputation of the variational bound therefore requires the \ufb01rst and second moments of the con-\nvolution under the variational approximation. However, these can be computed analytically for\nparticular choices of covariance function such as the DSE, by taking the expectations inside the\nintegral (this is analogous to variational inference for the Gaussian Process Latent Variable Model\n[22]). For example, the \ufb01rst moment of the convolution is\n\n(cid:90)\n\nF =\n\n(cid:20)(cid:90)\n\nEq\n\nR\n\nh(ti \u2212 \u03c4 )x(\u03c4 )d\u03c4\nthe\n\ntake\n\n(cid:21)\n\n(cid:90)\n\n=\n\nR\n\nEq(h,uh) [h(ti \u2212 \u03c4 )] Eq(x,ux)[x(\u03c4 )]d\u03c4\n\n(7)\n\nthe\n\nexpectations\n\nform of\nuh,uh\n\n\u00b5uh and Eq(x,ux)[x(\u03c4 )] = Kx,ux (\u03c4 )K\u22121\n\nwhere\nthe predictive mean in GP regression,\nEq(h,uh) [h(ti \u2212 \u03c4 )] = Kh,uh(ti \u2212 \u03c4 )K\u22121\n\u00b5ux\nfunctions can be convolved analytically,(cid:82)\nwhere {Kh,uh, Kuh,uh, Kx,ux, Kux,ux} are the covariance functions and {\u00b5uh , \u00b5ux} are the\nmeans of the approximate variational posterior. Crucially, the integral is tractable if the covariance\nR Kh,uh (ti \u2212 \u03c4 )Kx,ux (\u03c4 )d\u03c4, which is the case for the SE\nand DSE covariances - see sec. 4 of the supplementary material for the derivation of the variational\nlower bound.\n(cid:2)(cid:82)\n(cid:2)Kf|h(t1, t2)(cid:3) = Eq\nThe fact that it is possible to compute the \ufb01rst and second moments of the convolution under the\napproximate posterior means that it is also tractable to compute the mean of the posterior distribution\nover the kernel, Eq\nThe method therefore supports full probabilistic inference and learning for nonparametric kernels,\nin addition to extrapolation, interpolation and denoising in a tractable manner. The next section\ndiscusses sensible choices for the integral transforms that de\ufb01ne the inducing variables uh and ux.\n\nR h(t1 \u2212 \u03c4 )h(t2 \u2212 \u03c4 )d\u03c4(cid:3) and the associated error-bars.\n\nux,ux\n\n3.1 Choice of the inducing variables uh and ux\n\nIn order to choose the domain of the inducing variables, it is useful to consider inference for the\nwhite-noise process given a \ufb01xed window h(t). Typically, we assume that the window h(t) is\n\n5\n\n\ftransform, ux = (cid:82)\n\nR exp(cid:0)\u2212 1\n\nsmoothly varying, in which case the data y(t) are only determined by the low-frequency content of\nthe white-noise; conversely in inference, the data can only reveal the low frequencies in x(t). In fact,\nsince a continuous time white-noise process contains power at all frequencies and in\ufb01nite power in\ntotal, most of the white-noise content will be undeterminable, as it is suppressed by the \ufb01lter (or\n\ufb01ltered out). However, for the same reason, these components do not affect prediction of f (t).\nSince we can only learn the low-frequency content of the white-noise and this is all that is important\nfor making predictions, we consider inter-domain inducing points formed by a Gaussian integral\n\n2l2 (tx \u2212 \u03c4 )2(cid:1) x(\u03c4 )d\u03c4. These inducing variables represent a local esti-\n\nmate of the white-noise process x around the inducing location tx considering a Gaussian window,\nand have a squared exponential covariance by construction (these covariances are shown in sec. 3\nof the supplementary material). In spectral terms, the process ux is a low-pass version of the true\nprocess x. The variational parameters l and tx affect the approximate posterior and can be optimised\nusing the free-energy, although this was not investigated here to minimise computational overhead.\nFor the inducing variables uh we chose not to use the \ufb02exibility of the inter-domain parameterisation\nand, instead, place the points in the same domain as the window.\n\n4 Experiments\n\nThe DSE-GPCM was tested using synthetic data with known statistical properties and real-world\nsignals. The aim of these experiments was to validate the new approach to learn covariance functions\nand PSDs while also providing error bars for the estimates, and to compare it against alternative\nparametric and nonparametric approaches.\n\n4.1 Learning known parametric kernels\n\nWe considered Gaussian processes with standard, parametric covariance kernels and veri\ufb01ed that\nour method is able to infer such kernels. Gaussian processes with squared exponential (GP-SE) and\nspectral mixture (GP-SM) kernels, both of unit variance, were used to generate two time series on\nthe region [-44, 44] uniformly sampled at 10 Hz (i.e., 880 samples). We then constructed the ob-\nservation signal by adding unit-variance white-noise. The experiment then consisted of (i) learning\nthe underlying kernel, (ii) estimating the latent process and (iii) performing imputation by removing\nobservations in the region [-4.4, 4.4] (10% of the observations).\nFig. 2 shows the results for the GP-SE case. We chose 88 inducing points for ux, that is, 1/10 of\nthe samples to be recovered and 30 for uh; the hyperparameters in eq. (2) were set to \u03b3 = 0.45\nand \u03b1 = 0.1, so as to allow for an uninformative prior on h(t). The variational objective F was\noptimised with respect to the hyperparameter \u03c3h and the variational parameters \u00b5h, \u00b5x (means) and\nthe Cholesky factors of Ch, Cx (covariances) using conjugate gradients. The true SE kernel was\nreconstructed from the noisy data with an accuracy of 5%, while the estimation mean squared error\n(MSE) was within 1% of the (unit) noise variance for both the true GP-SE and the proposed model.\nFig. 3 shows the results for the GP-SM time series. Along the lines of the GP-SE case, the re-\nconstruction of the true kernel and spectrum is remarkably accurate and the estimate of the latent\nprocess has virtually the same mean square error (MSE) as the true GP-SM model. These toy results\nindicate that the variational inference procedure can work well, in spite of known biases [23].\n\n4.2 Learning the spectrum of real-world signals\n\nThe ability of the DSE-GPCM to provide Bayesian estimates of the PSD of real-world signals was\nveri\ufb01ed next. This was achieved through a comparison of the proposed model to (i) the spectral\nmixture kernel (GP-SM) [4], (ii) tracking the Fourier coef\ufb01cients using a Kalman \ufb01lter (Kalman-\nFourier [24]), (iii) the Yule-Walker method and (iv) the periodogram [25].\nWe \ufb01rst analysed the Mauna Loa monthly CO2 concentration (de-trended). We considered the GP-\nSM with 4 and 10 components, Kalman-Fourier with a partition of 500 points between zero and\nthe Nyquist frequency, Yule-Walker with 250 lags and the raw periodogram. All methods used all\nthe data and each PSD estimate was normalised w.r.t its maximum (shown in \ufb01g. 4). All methods\nidenti\ufb01ed the three main frequency peaks at [0, year\u22121, 2year\u22121 ]; however, notice that the Kalman-\nFourier method does not provide sharp peaks and that GP-SM places Gaussians on frequencies with\n\n6\n\n\fFigure 2: Joint learning of an SE kernel and data imputation using the proposed DSE-GPCM ap-\nproach. Top: \ufb01lter h(t) and inducing points uh (left), \ufb01ltered white-noise process ux (centre) and\nlearnt kernel (right). Bottom: Latent signal and its estimates using both the DSE-GPCM and the\ntrue model (GP-SE). Con\ufb01dence intervals are shown in light blue (DSE-GPCM) and in between\ndashed red lines (GP-SE) and they correspond to 99.7% for the kernel and 95% otherwise.\n\nFigure 3: Joint learning of an SM kernel and data imputation using a nonparametric kernel. True\nand learnt kernel (left), true and learnt spectra (centre) and data imputation region (right).\n\nnegligible power \u2014 this is a known drawback of the GP-SM approach: it is sensitive to initialisation\nand gets trapped in noisy frequency peaks (in this experiment, the centres of the GP-SM were ini-\ntialised as multiples of one tenth of the Nyquist frequency). This example shows that the GP-SM can\nover\ufb01t noise in training data. Conversely, observe how the proposed DSE-GPCM approach (with\nNh = 300 and Nx = 150) not only captured the \ufb01rst three peaks but also the spectral \ufb02oor and\nplaced meaningful error bars (90%) where the raw periodogram laid.\n\nFigure 4: Spectral estimation of the Mauna Loa CO2 concentration. DSE-GPCM with error bars\n(90%) is shown with the periodogram at the left and all other methods at the right for clarity.\n\nThe next experiment consisted of recovering the spectrum of an audio signal from the TIMIT corpus,\ncomposed of 1750 samples (at 16kHz), only using an irregularly-sampled 20% of the available\ndata. We compared the proposed DSE-GPCM method to GP-SM (again 4 and 10 components) and\nKalman-Fourier; we used the periodogram and the Yule-Walker method as benchmarks, since these\n\n7\n\n\u22125050123Filterh(t)\u221240\u22122002040\u22122024Processux\u221250500.51Kernels(normalised).Discrepancy:5.4%\u221240\u221230\u221220\u221210010203040\u22124\u22122024Observations,latentprocessandkernelestimatesPosteriormeanInducingpointsTrueSEkernelDSE-GPCMkernelLatentprocessObservationsSEkernelestimate(MSE=0.9984)DSE-GPCMestimate(MSE=1.0116)PosteriormeanInducingpoints\u221220\u22121001020\u22120.500.511.5TimeKernels(normalised).Discrepancy:18.6%.00.10.20.30.40.52468101214161820FrequencyPSD(normalised).Discrepancy:15.8%.\u221210\u221250510\u22124\u2212202468TimeDataimputationGroundtruthDSE-GPCMposteriorDSE-GPCMTrueSMkernelGroundtruthObservationsSMestimate(MSE=1.0149)DSE-GPCMestimate(MSE=1.0507)1/year2/year3/year4/year5/year10\u22121010\u22125100Frequency[year\u22121]1/year2/year3/year4/year5/year10\u22121010\u22125100Frequency[year\u22121]DSE-GPCMPeriodogramSpectralmix.(4comp)Spectralmix.(10comp)Kalman-FourierYule-WalkerPeriodogram\fmethods cannot handle unevenly-sampled data (therefore, they used all the data). Besides the PSD,\nwe also computed the learnt kernel, shown alongside the autocorrelation function in \ufb01g. 5 (left).\nDue to its sensitivity to initial conditions, the centres of the GP-SM were initialised every 100Hz (the\nharmonics of the signal are approximately every 114Hz); however, it was only with 10 components\nthat the GP-SM was able to \ufb01nd the four main lobes of the PSD. Notice also how the DSE-GPCM\naccurately \ufb01nds the main lobes, both in location and width, together with the 90% error bars.\n\nFigure 5: Audio signal from TIMIT. Induced kernel of DSE-GPCM and GP-SM alongside auto-\ncorrelation function (left). PSD estimate using DSE-GPCM and raw periodogram (centre). PSD\nestimate using GP-SM, Kalman-Fourier, Yule-Walker and raw periodogram (right).\n\n5 Discussion\n\nThe Gaussian Process Convolution Model (GPCM) has been proposed as a generative model for\nstationary time series based on the convolution between a \ufb01lter function and a white-noise process.\nLearning the model from data is achieved via a novel variational free-energy approximation, which\nin turn allows us to perform predictions and inference on both the covariance kernel and the spec-\ntrum in a probabilistic, analytically and computationally tractable manner. The GPCM approach\nwas validated in the recovery of spectral density from non-uniformly sampled time series; to our\nknowledge, this is the \ufb01rst probabilistic approach that places nonparametric prior over the spectral\ndensity itself and which recovers a posterior distribution over that density directly from the time\nseries.\nThe encouraging results for both synthetic and real-world data shown in sec. 4 serve as a proof of\nconcept for the nonparametric design of covariance kernels and PSDs using convolution processes.\nIn this regard, extensions of the presented model can be identi\ufb01ed in the following directions: First,\nfor the proposed GPCM to have a desired performance, the number of inducing points uh and ux\nneeds to be increased with the (i) high frequency content and (ii) range of correlations of the data;\ntherefore, to avoid the computational overhead associated to large quantities of inducing points, the\n\ufb01lter prior or the inter-domain transformation can be designed to have a speci\ufb01c harmonic structure\nand therefore focus on a target spectrum. Second, the algorithm can be adapted to handle longer\ntime series, for instance, through the use of tree-structured approximations [26]. Third, the method\ncan also be extended beyond time series to operate on higher-dimensional input spaces; this can be\nachieved by means of a factorisation of the latent kernel, whereby the number of inducing points for\nthe \ufb01lter only increases linearly with the dimension, rather than exponentially.\n\nAcknowledgements\n\nPart of this work was carried out when F.T. was with the University of Cambridge. F.T. thanks\nCONICYT-PAI grant 82140061 and Basal-CONICYT Center for Mathematical Modeling (CMM).\nR.T. thanks EPSRC grants EP/L000776/1 and EP/M026957/1. T.B. thanks Google. We thank Mark\nRowland, Shane Gu and the anonymous reviewers for insightful feedback.\n\n8\n\n0102030\u22120.6\u22120.4\u22120.200.20.40.60.811.2Time[miliseconds]CovariancekernelDSE-GPCMSpectralMix.(4comp)SpectralMix.(10comp)Autocorrelationfunction11422834245657068479810\u2212610\u2212410\u22122100Frequency[hertz]PowerspectraldensityDSE-GPCMPeriodogram11422834245657068479810\u2212610\u2212410\u22122100Frequency[hertz]PowerspectraldensitySpectralMix.(4comp)SpectralMix.(10comp)Kalman-FourierYule-WalkerPeriodogram\fReferences\n[1] C. E. Rasmussen and C. K. I. Williams, Gaussian Processes for Machine Learning. The MIT Press, 2006.\n[2] Y. Bengio, \u201cLearning deep architectures for AI,\u201d Foundations and trends R(cid:13) in Machine Learning, vol. 2,\n\nno. 1, pp. 1\u2013127, 2009.\n\n[3] D. J. C. MacKay, \u201cIntroduction to Gaussian processes,\u201d in Neural Networks and Machine Learning (C. M.\n\nBishop, ed.), NATO ASI Series, pp. 133\u2013166, Kluwer Academic Press, 1998.\n\n[4] A. G. Wilson and R. P. Adams, \u201cGaussian process kernels for pattern discovery and extrapolation,\u201d in\n\nProc. of International Conference on Machine Learning, 2013.\n\n[5] D. Duvenaud, J. R. Lloyd, R. Grosse, J. B. Tenenbaum, and Z. Ghahramani, \u201cStructure discovery in\nnonparametric regression through compositional kernel search,\u201d in Proc. of International Conference on\nMachine Learning, pp. 1166\u20131174, 2013.\n\n[6] D. Duvenaud, H. Nickisch, and C. E. Rasmussen, \u201cAdditive Gaussian processes,\u201d in Advances in Neural\n\nInformation Processing Systems 24, pp. 226\u2013234, 2011.\n\n[7] M. G\u00a8onen and E. Alpaydin, \u201cMultiple kernel learning algorithms,\u201d The Journal of Machine Learning\n\nResearch, vol. 12, pp. 2211\u20132268, 2011.\n\n[8] F. Tobar, S.-Y. Kung, and D. Mandic, \u201cMultikernel least mean square algorithm,\u201d IEEE Trans. on Neural\n\nNetworks and Learning Systems, vol. 25, no. 2, pp. 265\u2013277, 2014.\n\n[9] R. E. Turner, Statistical Models for Natural Sounds. PhD thesis, Gatsby Computational Neuroscience\n\nUnit, UCL, 2010.\n\n[10] R. Turner and M. Sahani, \u201cTime-frequency analysis as probabilistic inference,\u201d IEEE Trans. on Signal\n\nProcessing, vol. 62, no. 23, pp. 6171\u20136183, 2014.\n\n[11] B. Oksendal, Stochastic Differential Equations. Springer, 2003.\n[12] A. V. Oppenheim and A. S. Willsky, Signals and Systems. Prentice-Hall, 1997.\n[13] C. Archambeau, D. Cornford, M. Opper, and J. Shawe-Taylor, \u201cGaussian process approximations of\nstochastic differential equations,\u201d Journal of Machine Learning Research Workshop and Conference Pro-\nceedings, vol. 1, pp. 1\u201316, 2007.\n\n[14] S. F. Gull, \u201cDevelopments in maximum entropy data analysis,\u201d in Maximum Entropy and Bayesian Meth-\n\nods (J. Skilling, ed.), vol. 36, pp. 53\u201371, Springer Netherlands, 1989.\n\n[15] B. Sch\u00a8olkopf and A. J. Smola, Learning with Kernels: Support Vector Machines, Regularization, Opti-\n\nmization, and Beyond. MIT Press, 2001.\n\n[16] T. P. Minka, \u201cDeriving quadrature rules from Gaussian processes,\u201d tech. rep., Statistics Department,\n\nCarnegie Mellon University, 2000.\n\n[17] A. H. Jazwinski, Stochastic Processes and Filtering Theory. New York, Academic Press., 1970.\n[18] M. K. Titsias, \u201cVariational learning of inducing variables in sparse Gaussian processes,\u201d in Proc. of Inter-\n\nnational Conference on Arti\ufb01cial Intelligence and Statistics, pp. 567\u2013574, 2009.\n\n[19] A. Figueiras-Vidal and M. L\u00b4azaro-Gredilla, \u201cInter-domain Gaussian processes for sparse inference using\n\ninducing features,\u201d in Advances in Neural Information Processing Systems, pp. 1087\u20131095, 2009.\n\n[20] A. G. d. G. Matthews, J. Hensman, R. E. Turner, and Z. Ghahramani, \u201cOn sparse variational methods and\nthe Kullback-Leibler divergence between stochastic processes,\u201d arXiv preprint arXiv:1504.07027, 2015.\n[21] D. J. C. MacKay, Information Theory, Inference, and Learning Algorithms. Cambridge University Press,\n\n2003.\n\n[22] M. K. Titsias and N. D. Lawrence, \u201cBayesian Gaussian process latent variable model,\u201d in Proc. of Inter-\n\nnational Conference on Arti\ufb01cial Intelligence and Statistics, pp. 844\u2013851, 2010.\n\n[23] R. E. Turner and M. Sahani, \u201cTwo problems with variational expectation maximisation for time-series\nmodels,\u201d in Bayesian time series models (D. Barber, T. Cemgil, and S. Chiappa, eds.), ch. 5, pp. 109\u2013130,\nCambridge University Press, 2011.\n\n[24] Y. Qi, T. Minka, and R. W. Picara, \u201cBayesian spectrum estimation of unevenly sampled nonstationary\n\ndata,\u201d in Proc. of IEEE ICASSP, vol. 2, pp. II\u20131473\u2013II\u20131476, 2002.\n\n[25] D. B. Percival and A. T. Walden, Spectral Analysis for Physical Applications. Cambridge University\n\nPress, 1993. Cambridge Books Online.\n\n[26] T. D. Bui and R. E. Turner, \u201cTree-structured Gaussian process approximations,\u201d in Advances in Neural\n\nInformation Processing Systems 27, pp. 2213\u20132221, 2014.\n\n9\n\n\f", "award": [], "sourceid": 1940, "authors": [{"given_name": "Felipe", "family_name": "Tobar", "institution": "Universidad de Chile"}, {"given_name": "Thang", "family_name": "Bui", "institution": "University of Cambridge"}, {"given_name": "Richard", "family_name": "Turner", "institution": "University of Cambridge"}]}