{"title": "Bayesian Inference and Learning in Gaussian Process State-Space Models with Particle MCMC", "book": "Advances in Neural Information Processing Systems", "page_first": 3156, "page_last": 3164, "abstract": "State-space models are successfully used in many areas of science, engineering and economics to model time series and dynamical systems. We present a fully Bayesian approach to inference and learning in nonlinear nonparametric state-space models. We place a Gaussian process prior over the transition dynamics, resulting in a flexible model able to capture complex dynamical phenomena. However, to enable efficient inference, we marginalize over the dynamics of the model and instead infer directly the joint smoothing distribution through the use of specially tailored Particle Markov Chain Monte Carlo samplers. Once an approximation of the smoothing distribution is computed, the state transition predictive distribution can be formulated analytically. We make use of sparse Gaussian process models to greatly reduce the computational complexity of the approach.", "full_text": "Bayesian Inference and Learning in Gaussian Process\n\nState-Space Models with Particle MCMC\n\nRoger Frigola1, Fredrik Lindsten2, Thomas B. Sch\u00a8on2,3 and Carl E. Rasmussen1\n\n1. Dept. of Engineering, University of Cambridge, UK, {rf342,cer54}@cam.ac.uk\n2. Div. of Automatic Control, Link\u00a8oping University, Sweden, lindsten@isy.liu.se\n3. Dept. of Information Technology, Uppsala University, Sweden, thomas.schon@it.uu.se\n\nAbstract\n\nState-space models are successfully used in many areas of science, engineering\nand economics to model time series and dynamical systems. We present a fully\nBayesian approach to inference and learning (i.e. state estimation and system\nidenti\ufb01cation) in nonlinear nonparametric state-space models. We place a Gaus-\nsian process prior over the state transition dynamics, resulting in a \ufb02exible model\nable to capture complex dynamical phenomena. To enable ef\ufb01cient inference, we\nmarginalize over the transition dynamics function and, instead, infer directly the\njoint smoothing distribution using specially tailored Particle Markov Chain Monte\nCarlo samplers. Once a sample from the smoothing distribution is computed,\nthe state transition predictive distribution can be formulated analytically. Our ap-\nproach preserves the full nonparametric expressivity of the model and can make\nuse of sparse Gaussian processes to greatly reduce computational complexity.\n\n1\n\nIntroduction\n\nState-space models (SSMs) constitute a popular and general class of models in the context of time\nseries and dynamical systems. Their main feature is the presence of a latent variable, the state\nxt \u2208 X (cid:44) Rnx, which condenses all aspects of the system that can have an impact on its future.\nA discrete-time SSM with nonlinear dynamics can be represented as\n\nxt+1 = f (xt, ut) + vt,\nyt = g(xt, ut) + et,\n\n(1a)\n(1b)\nwhere ut denotes a known external input, yt denotes the measurements, vt and et denote i.i.d. noises\nacting on the dynamics and the measurements, respectively. The function f encodes the dynamics\nand g describes the relationship between the observation and the unobserved states.\nWe are primarily concerned with the problem of learning general nonlinear SSMs. The aim is to\n\ufb01nd a model that can adaptively increase its complexity when more data is available. To this effect,\nwe employ a Bayesian nonparametric model for the dynamics (1a). This provides a \ufb02exible model\nthat is not constrained by any limiting assumptions caused by postulating a particular functional\nform. More speci\ufb01cally, we place a Gaussian process (GP) prior [1] over the unknown function f.\nThe resulting model is a generalization of the standard parametric SSM. The functional form of\nthe observation model g is assumed to be known, possibly parameterized by a \ufb01nite dimensional\nparameter. This is often a natural assumption, for instance in engineering applications where g\ncorresponds to a sensor model \u2013 we typically know what the sensors are measuring, at least up to\nsome unknown parameters. Furthermore, using too \ufb02exible models for both f and g can result in\nproblems of non-identi\ufb01ability.\nWe adopt a fully Bayesian approach whereby we \ufb01nd a posterior distribution over all the latent\nentities of interest, namely the state transition function f, the hidden state trajectory x0:T (cid:44) {xi}T\n\ni=0\n\n1\n\n\fand any hyper-parameter \u03b8 of the model. This is in contrast with existing approaches for using GPs\nto model SSMs, which tend to model the GP using a \ufb01nite set of target points, in effect making\nthe model parametric [2]. Inferring the distribution over the state trajectory p(x0:T | y0:T , u0:T )\nis an important problem in itself known as smoothing. We use a tailored particle Markov Chain\nMonte Carlo (PMCMC) algorithm [3] to ef\ufb01ciently sample from the smoothing distribution whilst\nmarginalizing over the state transition function. This contrasts with conventional approaches to\nsmoothing which require a \ufb01xed model of the transition dynamics. Once we have obtained an\napproximation of the smoothing distribution, with the dynamics of the model marginalized out,\nlearning the function f is straightforward since its posterior is available in closed form given the state\ntrajectory. Our only approximation is that of the sampling algorithm. We report very good mixing\nenabled by the use of recently developed PMCMC samplers [4] and the exact marginalization of the\ntransition dynamics.\nThere is by now a rich literature on GP-based SSMs. For instance, Deisenroth et al. [5, 6] presented\nre\ufb01ned approximation methods for \ufb01ltering and smoothing for already learned GP dynamics and\nmeasurement functions. In fact, the method proposed in the present paper provides a vital component\nneeded for these inference methods, namely that of learning the GP model in the \ufb01rst place. Turner\net al. [2] applied the EM algorithm to obtain a maximum likelihood estimate of parametric models\nwhich had the form of GPs where both inputs and outputs were parameters to be optimized. This\ntype of approach can be traced back to [7] where Ghahramani and Roweis applied EM to learn\nmodels based on radial basis functions. Wang et al. [8] learn a SSM with GPs by \ufb01nding a MAP\nestimate of the latent variables and hyper-parameters. They apply the learning in cases where the\ndimension of the observation vector is much higher than that of the latent state in what becomes a\nform of dynamic dimensionality reduction. This procedure would have the risk of over\ufb01tting in the\ncommon situation where the state-space is high-dimensional and there is signi\ufb01cant uncertainty in\nthe smoothing distribution.\n\n2 Gaussian Process State-Space Model\n\nWe describe the generative probabilistic model of the Gaussian process SSM (GP-SSM) represented\nin Figure 1b by\n\nf (xt) \u223c GP(cid:0)m\u03b8x(xt), k\u03b8x(xt, x(cid:48)\nt)(cid:1),\n\nxt+1 | ft \u223c N (xt+1 | ft, Q),\nyt | xt \u223c p(yt | xt, \u03b8y),\n\n(2a)\n(2b)\n(2c)\nand x0 \u223c p(x0), where we avoid notational clutter by omitting the conditioning on the known inputs\nut. In addition, we put a prior p(\u03b8) over the various hyper-parameters \u03b8 = {\u03b8x, \u03b8y, Q}. Also, note\nthat the measurement model (2c) and the prior on x0 can take any form since we do not rely on their\nproperties for ef\ufb01cient inference.\nThe GP is fully described by its mean function and its covariance function. An interesting property\nof the GP-SSM is that any a priori insight into the dynamics of the system can be readily encoded\nin the mean function. This is useful, since it is often possible to capture the main properties of\nthe dynamics, e.g. by using a simple parametric model or a model based on \ufb01rst principles. Such\n\n(a) Standard GP regression\n\n(b) GP-SSM\n\nFigure 1: Graphical models for standard GP regression and the GP-SSM model. The thick horizontal\nbars represent sets of fully connected nodes.\n\n2\n\n\fsimple models may be insuf\ufb01cient on their own, but useful together with the GP-SSM, as the GP\nis \ufb02exible enough to model complex departures from the mean function. If no speci\ufb01c prior model\nis available, the linear mean function m(xt) = xt is a good generic choice. Interestingly, the prior\ninformation encoded in this model will normally be more vague than the prior information encoded\nin parametric models. The measurement model (2c) implicitly contains the observation function g\nand the distribution of the i.i.d. measurement noise et.\n\nInference over States and Hyper-parameters\n\n3\nDirect learning of the function f in (2a) from input/output data {u0:T\u22121, y0:T} is challenging since\nthe states x0:T are not observed. Most (if not all) previous approaches attack this problem by re-\nverting to a parametric representation of f which is learned alongside the states. We address this\nproblem in a fundamentally different way by marginalizing out f, allowing us to respect the non-\nparametric nature of the model. A challenge with this approach is that marginalization of f will\nintroduce dependencies across time for the state variables that lead to the loss of the Markovian\nstructure of the state-process. However, recently developed inference methods, combining sequen-\ntial Monte Carlo (SMC) and Markov chain Monte Carlo (MCMC) allow us to tackle this problem.\nWe discuss marginalization of f in Section 3.1 and present the inference algorithms in Sections 3.2\nand 3.3.\n\n3.1 Marginalizing out the State Transition Function\n\nTargeting the joint posterior distribution of the hyper-parameters, the latent states and the latent func-\ntion f is problematic due to the strong dependencies between x0:T and f. We therefore marginalize\nthe dynamical function from the model, and instead target the distribution p(\u03b8, x0:T | y1:T ) (recall\nthat conditioning on u0:T\u22121 is implicit). In the MCMC literature, this is referred to as collapsing [9].\nHence, we \ufb01rst need to \ufb01nd an expression for the marginal prior p(\u03b8, x0:T ) = p(x0:T | \u03b8)p(\u03b8). Fo-\ncusing on p(x0:T | \u03b8) we note that, although this distribution is not Gaussian, it can be represented\nas a product of Gaussians. Omitting the dependence on \u03b8 in the notation, we obtain\n\nN(cid:0)xt | \u00b5t(x0:t\u22121), \u03a3t(x0:t\u22121)(cid:1),\n\n(3a)\n\n(3b)\n(3c)\n\np(x1:T | \u03b8, x0) =\n\nT(cid:89)\n\np(xt | \u03b8, x0:t\u22121) =\n\nT(cid:89)\n\u00b5t(x0:t\u22121) = mt\u22121 + Kt\u22121,0:t\u22122(cid:101)K\u22121\n\u03a3t(x0:t\u22121) = (cid:101)Kt\u22121 \u2212 Kt\u22121,0:t\u22122(cid:101)K\u22121\n\nt=1\n\nt=1\n\nwith\n\n0:t\u22122 (x1:t\u22121 \u2212 m0:t\u22122),\n0:t\u22122K(cid:62)\n\nt\u22121,0:t\u22122\n\nconditioned on x0:t\u22121, a one-step prediction for the state variable is a standard GP prediction. Here,\nand the (nxt) \u00d7 (nxt)\npositive de\ufb01nite matrix K0:t\u22121 with block entries [K0:t\u22121]i,j = k(xi\u22121, xj\u22121). We use two sets of\nK0:t\u22121 + It \u2297 Q. We can also express (3a) more succinctly as,\n\nfor t \u2265 2 and \u00b51(x0) = m0, \u03a31(x0) = (cid:101)K0. Equation (3) follows from the fact that, once\nwe have de\ufb01ned the mean vector m0:t\u22121 (cid:44)(cid:2)m(x0)(cid:62) . . . m(xt\u22121)(cid:62)(cid:3)(cid:62)\nindices, as in Kt\u22121,0:t\u22122, to refer to the off-diagonal blocks of K0:t\u22121. We also de\ufb01ne (cid:101)K0:t\u22121 =\np(x1:t | \u03b8, x0) = |(2\u03c0)nxt(cid:101)K0:t\u22121|\u2212 1\nphasize that this is not the case since both m0:t\u22121 and (cid:101)K0:t\u22121 depend (nonlinearly) on the argument\n\nThis expression looks very much like a multivariate Gaussian density function. However, we em-\n\n(cid:62)(cid:101)K\u22121\n0:t\u22121(x1:t \u2212 m0:t\u22121)).\n\n(x1:t \u2212 m0:t\u22121)\n\n2 exp(\u2212 1\n2\n\n(4)\n\nx1:t. In fact, (4) will typically be very far from Gaussian.\n\n3.2 Sequential Monte Carlo\n\nWith the prior (4) in place, we now turn to posterior inference and we start by considering the joint\nsmoothing distribution p(x0:T | \u03b8, y0:T ). The sequential nature of the proposed model suggests\nthe use of SMC. Though most well known for \ufb01ltering in Markovian SSMs \u2013 see [10, 11] for an\nintroduction \u2013 SMC is applicable also for non-Markovian latent variable models. We seek to ap-\nproximate the sequence of distributions p(x0:t | \u03b8, y0:t), for t = 0, . . . , T . Let {xi\nt\u22121}N\ni=1\n\n0:t\u22121, wi\n\n3\n\n\ftion, (cid:98)p(x0:t\u22121 | \u03b8, y0:t\u22121) (cid:44) (cid:80)N\n\nbe a collection of weighted particles approximating p(x0:t\u22121 | \u03b8, y0:t\u22121) by the empirical distribu-\n(x0:t\u22121). Here, \u03b4z(x) is a point-mass located at\ni=1 wi\nt}N\ni=1, referred to as\nt is the index of the ancestor particle at time t \u2212 1, of particle xi\nt.\n\nz. To propagate this sample to time t, we introduce the auxiliary variables {ai\nancestor indices. The variable ai\nHence, xi\n\nt is generated by \ufb01rst sampling ai\n\nt is generated as,\n\nt = j) = wj\n\nt\u22121. Then, xi\n\nt\u22121\u03b4xi\n\n0:t\u22121\n\nt\n\n0:t\u22121, y0:t),\n\n(5)\nt}.\nfor i = 1, . . . , N. The particle trajectories are then augmented according to xi\n0:t\u22121, xi\nSampling from the one-step predictive density is a simple (and sensible) choice, but we may also\nconsider other proposal distributions. In the above formulation the resampling step is implicit and\ncorresponds to sampling the ancestor indices (cf. the auxiliary particle \ufb01lter, [12]). Finally, the\nparticles are weighted according to the measurement model, wi\nt) for i = 1, . . . , N,\nwhere the weights are normalized to sum to 1.\n\nt \u221d p(yt | \u03b8, xi\n\n0:t = {xai\n\nt\n\nt with P(ai\nt \u223c p(xt | \u03b8, xai\nxi\n\n3.3 Particle Markov Chain Monte Carlo\n\nThere are two shortcomings of SMC: (i) it does not handle inference over hyper-parameters; (ii)\ndespite the fact that the sampler targets the joint smoothing distribution, it does in general not pro-\nvide an accurate approximation of the full joint distribution due to path degeneracy. That is, the\nsuccessive resampling steps cause the particle diversity to be very low for time points t far from the\n\ufb01nal time instant T .\nTo address these issues, we propose to use a particle Markov chain Monte Carlo (PMCMC, [3, 13])\nsampler. PMCMC relies on SMC to generate samples of the highly correlated state trajectory within\nan MCMC sampler. We employ a speci\ufb01c PMCMC sampler referred to as particle Gibbs with\nancestor sampling (PGAS, [4]), given in Algorithm 1. PGAS uses Gibbs-like steps for the state\ntrajectory x0:T and the hyper-parameters \u03b8, respectively. That is, we sample \ufb01rst x0:T given \u03b8,\nthen \u03b8 given x0:T , etc. However, the full conditionals are not explicitly available. Instead, we draw\nsamples from specially tailored Markov kernels, leaving these conditionals invariant. We address\nthese steps in the subsequent sections.\n\nAlgorithm 1 Particle Gibbs with ancestor sampling (PGAS)\n\n1. Set \u03b8[0] and x1:T [0] arbitrarily.\n2. For (cid:96) \u2265 1 do\n\n(a) Draw \u03b8[(cid:96)] conditionally on x0:T [(cid:96) \u2212 1] and y0:T as discussed in Section 3.3.2.\n(b) Run CPF-AS (see [4]) targeting p(x0:T | \u03b8[(cid:96)], y0:T ), conditionally on x0:T [(cid:96) \u2212 1].\n(c) Sample k with P(k = i) = wi\n\nT and set x1:T [(cid:96)] = xk\n\n1:T .\n\n3. end\n\n3.3.1 Sampling the State Trajectories\n\ntime step is speci\ufb01ed a priori. Let these particles be denoted(cid:101)x0:T = {(cid:101)x0, . . . ,(cid:101)xT}. We then sam-\nt =(cid:101)xt.\n\nTo sample the state trajectory, PGAS makes use of an SMC-like procedure referred to as a con-\nditional particle \ufb01lter with ancestor sampling (CPF-AS). This approach is particularly suitable for\nnon-Markovian latent variable models, as it relies only on a forward recursion (see [4]). The differ-\nence between a standard particle \ufb01lter (PF) and the CPF-AS is that, for the latter, one particle at each\nple according to (5) only for i = 1, . . . , N \u2212 1. The Nth particle is set deterministically: xN\nTo be able to construct the Nth particle trajectory, xN\nat time t \u2212 1. This is done by sampling a value for the corresponding ancestor index aN\n[4], the ancestor sampling probabilities are computed as\np({xi\n\nt has to be associated with an ancestor particle\nt . Following\n\n0:t\u22121). (6)\nwhere the ratio is between the unnormalized target densities up to time T and up to time t \u2212 1,\nrespectively. The second proportionality follows from the mutual conditional independence of the\nobservations, given the states. Here, {xi\n\n0:t\u22121,(cid:101)xt:T} refers to a path in XT +1 formed by concatenating\n\n0:t\u22121,(cid:101)xt:T}, y0:T )\n\nt\u22121p((cid:101)xt:T | xi\n\n(cid:101)wi\nt\u22121|T \u221d wi\n\n0:t\u22121,(cid:101)xt:T})\n\np({xi\np(xi\n\n0:t\u22121, y0:t\u22121)\n\n\u221d wi\n\n0:t\u22121)\n\n= wi\n\np(xi\n\nt\u22121\n\nt\u22121\n\n4\n\n\fthe two partial trajectories. The above expression can be computed by using the prior over state\ni=1 are then normalized to sum\n\ntrajectories given by (4). The ancestor sampling weights {(cid:101)wi\nt\u22121|T}N\nt\u22121|t.\nwhich is key to our development. More precisely, given(cid:101)x0:T let(cid:101)x(cid:48)\n\nto 1 and the ancestor index aN\nt\n\nis sampled with P(aN\n\nt = j) = wj\n\nThe conditioning on a prespeci\ufb01ed collection of particles implies an invariance property in CPF-AS,\n\n1. Run CPF-AS from time t = 0 to time t = T , conditionally on(cid:101)x0:T .\n2. Set(cid:101)x(cid:48)\n0:T to one of the resulting particle trajectories according to P((cid:101)x(cid:48)\n\u03b8 ((cid:101)x(cid:48)\n0:T | (cid:101)x0:T ) on XT +1,\n\nT .\n0:T ) = wi\nFor any N \u2265 2, this procedure de\ufb01nes an ergodic Markov kernel M N\nleaving the exact smoothing distribution p(x0:T | \u03b8, y0:T ) invariant [4]. Note that this invariance\nholds for any N \u2265 2, i.e. the number of particles that are used only affect the mixing rate of the\nkernel M N\n\u03b8 . However, it has been experienced in practice that the autocorrelation drops sharply as\nN increases [4, 14], and for many models a moderate N is enough to obtain a rapidly mixing kernel.\n\n0:T be generated as follows:\n\n0:T = xi\n\n3.3.2 Sampling the Hyper-parameters\n\nNext, we consider sampling the hyper-parameters given a state trajectory and sequence of observa-\ntions, i.e. from p(\u03b8 | x0:T , y0:T ). In the following, we consider the common situation where there\nare distinct hyper-parameters for the likelihood p(y0:T | x0:T , \u03b8y) and for the prior over trajectories\np(x0:T | \u03b8x). If the prior over the hyper-parameters factorizes between those two groups we obtain\np(\u03b8 | x0:T , y0:T ) \u221d p(\u03b8y | x0:T , y0:T ) p(\u03b8x | x0:T ). We can thus proceed to sample the two\ngroups of hyper-parameters independently. Sampling \u03b8y will be straightforward in most cases, in\nparticular if conjugate priors for the likelihood are used. Sampling \u03b8x will, nevertheless, be harder\nsince the covariance function hyper-parameters enter the expression in a non-trivial way. However,\nwe note that once the state trajectory is \ufb01xed, we are left with a problem analogous to Gaussian\nprocess regression where x0:T\u22121 are the inputs, x1:T are the outputs and Q is the likelihood co-\nvariance matrix. Given that the latent dynamics can be marginalized out analytically, sampling the\nhyper-parameters with slice sampling is straightforward [15].\n\n4 A Sparse GP-SSM Construction and Implementation Details\nA naive implementation of the CPF-AS algorithm will give rise to O(T 4) computational complexity,\nsince at each time step t = 1, . . . , T , a matrix of size T \u00d7 T needs to be factorized. However, it is\npossible to update and reuse the factors from the previous time step, bringing the total computational\ncomplexity down to the familiar O(T 3). Furthermore, by introducing a sparse GP model, we can\nreduce the complexity to O(M 2T ) where M (cid:28) T . In Section 4.1 we introduce the sparse GP\nmodel and in Section 4.2 we provide insight into the ef\ufb01cient implementation of both the vanilla GP\nand the sparse GP.\n\n4.1 FIC Prior over the State Trajectory\n\nAn important alternative to GP-SSM is given by exchanging the vanilla GP prior over f for a sparse\ncounterpart. We do not consider the resulting model to be an approximation to GP-SSM, it is still a\nGP-SSM, but with a different prior over functions. As a result we expect it to sometimes outperform\nits non-sparse version in the same way as it happens with their regression siblings [16].\nMost sparse GP methods can be formulated in terms of a set of so called inducing variables [17].\nThese variables live in the space of the latent function and have a set I of corresponding inducing\ninputs. The assumption is that, conditionally on the inducing variables, the latent function values are\nmutually independent. Although the inducing variables are marginalized analytically \u2013 this is key for\nthe model to remain nonparametric \u2013 the inducing inputs have to be chosen in such a way that they,\ninformally speaking, cover the same region of the input space covered by the data. Crucially, in order\nto achieve computational gains, the number M of inducing variables is selected to be smaller than\nthe original number of data points. In the following, we will use the fully independent conditional\n(FIC) sparse GP prior as de\ufb01ned in [17] due to its very good empirical performance [16].\nAs shown in [17], the FIC prior can be obtained by replacing the covariance function k(\u00b7,\u00b7) by,\n\n(cid:0)k(xi, xj) \u2212 s(xi, xj)(cid:1),\n\nkFIC(xi, xj) = s(xi, xj) + \u03b4ij\n\n5\n\n(7)\n\n\ft\n\nt\n\n\u00b5FIC\n\u03a3FIC\n\n(x0:t\u22121) = (cid:101)Kt\u22121 \u2212 St\u22121 + Kt\u22121,IPKI,t\u22121,\n\nwhere s(xi, xj) (cid:44) k(xi,I)k(I,I)\u22121k(I, xj), \u03b4ij is Kronecker\u2019s delta and we use the convention\nwhereby when k takes a set as one of its arguments it generates a matrix of covariances. Using the\nWoodbury matrix identity, we can express the one-step predictive density as in (3), with\n0:t\u22122 (x1:t\u22121 \u2212 m0:t\u22122),\n\n(x0:t\u22121) = mt\u22121 + Kt\u22121,IPKI,0:t\u22122\u039b\u22121\n\n0:t\u22122K0:t\u22122,I)\u22121, \u039b0:t\u22122 (cid:44) diag[(cid:101)K0:t\u22122 \u2212 S0:t\u22122] and SA,B (cid:44)\nwhere P (cid:44) (KI,I + KI,0:t\u22122\u039b\u22121\nKA,IK\u22121I,IKI,B. Despite its apparent cumbersomeness, the computational complexity involved in\ncomputing the above mean and covariance is O(M 2t), as opposed to O(t3) for (3). The same idea\ncan be used to express (4) in a form which allows for ef\ufb01cient computation. Here diag refers to a\nblock diagonalization if Q is not diagonal.\nWe do not address the problem of choosing the inducing inputs, but note that one option is to use\ngreedy methods (e.g. [18]). The fast forward selection algorithm is appealing due to its very low\ncomputational complexity [18]. Moreover, its potential drawback of interference between hyper-\nparameter learning and active set selection is not an issue in our case since hyper-parameters will be\n\ufb01xed for a given run of the particle \ufb01lter.\n\n(8a)\n(8b)\n\nt\n\n4.2 Implementation Details\nAs pointed out above, it is crucial to reuse computations across time to attain the O(T 3) or O(M 2T )\ncomputational complexity for the vanilla GP and the FIC prior, respectively. We start by discussing\nthe vanilla GP and then brie\ufb02y comment on the implementation aspects of FIC.\nThere are two costly operations of the CPF-AS algorithm: (i) sampling from the prior (5), requiring\nthe computation of (3b) and (3c) and (ii) evaluating the ancestor sampling probabilities according\nto (6). Both of these operations can be carried out ef\ufb01ciently by keeping track of a Cholesky fac-\n, for each particle i = 1, . . . , N. Here,\n\n0:t\u22121,(cid:101)xt:T\u22121}) is a matrix de\ufb01ned analogously to (cid:101)K0:T\u22121, but where the covariance function\n\ntorization of the matrix (cid:101)K({xi\n(cid:101)K({xi\n\n0:t\u22121,(cid:101)xt:T\u22121}) = Li\n\ntLi(cid:62)\n\nis evaluated for the concatenated state trajectory {xi\nsub-matrices corresponding to the Cholesky factors for the covariance matrix \u03a3t(xi\nfor the matrices needed to ef\ufb01ciently evaluate the ancestor sampling probabilities (6).\nIt remains to \ufb01nd an ef\ufb01cient update of the Cholesky factor to obtain Li\n\nt to t + 1 in the algorithm, (cid:101)xt will be replaced by xi\n0:t,(cid:101)xt+1:T\u22121}) can be obtained from (cid:101)K({xi\nmatrix (cid:101)K({xi\n\nt+1. As we move from time\nt in the concatenated trajectory. Hence, the\n\n0:t\u22121,(cid:101)xt:T\u22121}) by replacing nx rows and\n\n0:t\u22121,(cid:101)xt:T\u22121}. From Li\n\nt, it is possible to identify\n0:t\u22121) as well as\n\ncolumns, corresponding to a rank 2nx update. It follows that we can compute Li\nt+1 by making nx\nt. In summary, all the operations at a speci\ufb01c time\nsuccessive rank one updates and downdates on Li\nstep can be done in O(T 2) computations, leading to a total computational complexity of O(T 3).\nFor the FIC prior, a naive implementation will give rise to O(M 2T 2) computational complexity.\nThis can be reduced to O(M 2T ) by keeping track of a factorization for the matrix P. However, to\nreach the O(M 2T ) cost all intermediate operations scaling with T has to be avoided, requiring us\nto reuse not only the matrix factorizations, but also intermediate matrix-vector multiplications.\n\n5 Learning the Dynamics\nAlgorithm 1 gives us a tool to compute p(x0:T , \u03b8 | y1:T ). We now discuss how this can be used to\n\ufb01nd an explicit model for f. The goal of learning the state transition dynamics is equivalent to that\nof obtaining a predictive distribution over f\u2217 = f (x\u2217), evaluated at an arbitrary test point x\u2217,\n\np(f\u2217 | x\u2217\n\n, y1:T ) =\n\np(f\u2217 | x\u2217\n\n, x0:T , \u03b8) p(x0:T , \u03b8 | y1:T ) dx0:T d\u03b8.\n\n(9)\n\n(cid:90)\n\nUsing a sample-based approximation of p(x0:T , \u03b8 | y1:T ), this integral can be approximated by\n\nN (f\u2217 | \u00b5(cid:96)(x\u2217\n\n), \u03a3(cid:96)(x\u2217\n\n)),\n\n(10)\n\np(f\u2217 | x\u2217\n\n, y1:T ) \u2248 1\nL\n\nL(cid:88)\n\n(cid:96)=1\n\nL(cid:88)\n\n(cid:96)=1\n\np(f\u2217 | x\u2217\n\n, x0:T [(cid:96)], \u03b8[(cid:96)]) =\n\n1\nL\n\n6\n\n\fwhere L is the number of samples and \u00b5(cid:96)(x\u2217) and \u03a3(cid:96)(x\u2217) follow the expressions for the predictive\ndistribution in standard GP regression if x0:T\u22121[(cid:96)] are treated as inputs, x1:T [(cid:96)] are treated as outputs\nand Q is the likelihood covariance matrix. This mixture of Gaussians is an expressive representation\nof the predictive density which can, for instance, correctly take into account multimodality arising\nfrom ambiguity in the measurements. Although factorized covariance matrices can be pre-computed,\nthe overall computational cost will increase linearly with L.The computational cost can be reduced\nby thinning the Markov chain using e.g. random sub-sampling or kernel herding [19].\nIn some situations it could be useful to obtain an approximation from the mixture of Gaussians\nconsisting in a single GP representation. This is the case in applications such as control or real time\n\ufb01ltering where the cost of evaluating the mixture of Gaussians can be prohibitive. In those cases\none could opt for a pragmatic approach and learn the mapping x\u2217 (cid:55)\u2192 f\u2217 from a cloud of points\n{x0:T [(cid:96)], f0:T [(cid:96)]}L\n(cid:96)=1 using sparse GP regression. The latent function values f0:T [(cid:96)] can be easily\nsampled from the normally distributed p(f0:T [(cid:96)] | x0:T [(cid:96)], \u03b8[(cid:96)]).\n\n6 Experiments\n\n6.1 Learning a Nonlinear System Benchmark\n\nt ) + cut + vt, vt \u223c N (0, q)\nConsider a system with dynamics given by xt+1 = axt + bxt/(1 + x2\nt + et, et \u223c N (0, r), with parameters (a, b, c, d, q, r) =\nand observations given by yt = dx2\n(0.5, 25, 8, 0.05, 10, 1) and a known input ut = cos(1.2(t + 1)). One of the dif\ufb01culties of this\nsystem is that the smoothing density p(x0:T | y0:T ) is multimodal since no information about the\nsign of xt is available in the observations. The system is simulated for T = 200 time steps, using\nlog-normal priors for the hyper-parameters, and the PGAS sampler is then run for 50 iterations using\nN = 20 particles. To illustrate the capability of the GP-SSM to make use of a parametric model as\nbaseline, we use a mean function with the same parametric form as the true system, but parameters\n(a, b, c) = (0.3, 7.5, 0). This function, denoted model B, is manifestly different to the actual state\ntransition (green vs. black surfaces in Figure 2), also demonstrating the \ufb02exibility of the GP-SSM.\nFigure 2 (left) shows the samples of x0:T (red).\nIt is apparent that the distribution covers two\nalternative state trajectories at particular times (e.g. t = 10). In fact, it is always the case that\nthis bi-modal distribution covers the two states of opposite signs that could have led to the same\nobservation (cyan).\nIn Figure 2 (right) we plot samples from the smoothing distribution, where\neach circle corresponds to (xt, ut, E[ft]). Although the parametric model used in the mean function\nof the GP (green) is clearly not representative of the true dynamics (black), the samples from the\nsmoothing distribution accurately portray the underlying system. The smoothness prior embodied\nby the GP allows for accurate sampling from the smoothing distribution even when the parametric\nmodel of the dynamics fails to capture important features.\nTo measure the predictive capability of the learned transition dynamics, we generate a new dataset\nconsisting of 10 000 time steps and present the RMSE between the predicted value of f (xt, ut)\nand the actual one. We compare the results from GP-SSM with the predictions obtained from two\nparametric models (one with the true model structure and one linear model) and two known models\n(the ground truth model and model B). We also report results for the sparse GP-SSM using an\nFIC prior with 40 inducing points. Table 1 summarizes the results, averaged over 10 independent\ntraining and test datasets. We also report the RMSE from the joint smoothing sample to the ground\ntruth trajectory.\n\nTable 1: RMSE to ground truth values over 10 independent runs.\n\nRMSE\n\nGround truth model (known parameters)\nGP-SSM (proposed, model B mean function)\nSparse GP-SSM (proposed, model B mean function)\nModel B (\ufb01xed parameters)\nGround truth model, learned parameters\nLinear model, learned parameters\n\n7\n\nprediction of\nf\u2217|x\u2217\n\nt , data\n\nt , u\u2217\n\u2013\n1.7 \u00b1 0.2\n1.8 \u00b1 0.2\n7.1 \u00b1 0.0\n0.5 \u00b1 0.2\n5.5 \u00b1 0.1\n\nsmoothing\nx0:T|data\n2.7 \u00b1 0.5\n3.2 \u00b1 0.5\n2.7 \u00b1 0.4\n13.6 \u00b1 1.1\n3.0 \u00b1 0.4\n6.0 \u00b1 0.5\n\n\fFigure 2: Left: Smoothing distribution. Right: State transition function (black: actual transition\nfunction, green: mean function (model B) and red: smoothing samples).\n\nFigure 3: One step ahead predictive distribution for each of the states of the cart and pole system.\nBlack: ground truth. Colored band: one standard deviation from the mixture of Gaussians predictive.\n\n6.2 Learning a Cart and Pole System\n\nWe apply our approach to learn a model of a cart and pole system used in reinforcement learning.\nThe system consists of a cart, with a free-spinning pendulum, rolling on a horizontal track. An\nexternal force is applied to the cart. The system\u2019s dynamics can be described by four states and a\nset of nonlinear ordinary differential equations [20]. We learn a GP-SSM based on 100 observations\nof the state corrupted with Gaussian noise. Although the training set only explores a small region\nof the 4-dimensional state space, we can learn a model of the dynamics which can produce one step\nahead predictions such the ones in Figure 3. We obtain a predictive distribution in the form of a\nmixture of Gaussians from which we display the \ufb01rst and second moments. Crucially, the learned\nmodel reports different amounts of uncertainty in different regions of the state-space. For instance,\nnote the narrower error-bars on some states between t = 320 and t = 350. This is due to the model\nbeing more con\ufb01dent in its predictions in areas that are closer to the training data.\n\n7 Conclusions\n\nWe have shown an ef\ufb01cient way to perform fully Bayesian inference and learning in the GP-SSM.\nA key contribution is that our approach retains the full nonparametric expressivity of the model.\nThis is made possible by marginalizing out the state transition function, which results in a non-\ntrivial inference problem that we solve using a tailored PGAS sampler.\nA particular characteristic of our approach is that the latent states can be sampled from the smoothing\ndistribution even when the state transition function is unknown. Assumptions about smoothness\nand parsimony of this function embodied by the GP prior suf\ufb01ce to obtain high-quality smoothing\ndistributions. Once samples from the smoothing distribution are available, they can be used to\ndescribe a posterior over the state transition function. This contrasts with the conventional approach\nto inference in dynamical systems where smoothing is performed conditioned on a model of the\nstate transition dynamics.\n\n8\n\n0102030405060\u221220\u221215\u221210\u2212505101520TimeState SamplesGround truth\u00b1(max(yt,0)/d)1/2\u221220\u221215\u221210\u2212505101520\u22121\u22120.500.51\u221220\u221215\u221210\u2212505101520x(t)u(t)f(t)300350810121416x300350\u2212202\u02d9x300350\u221210\u221250510\u02d9\u03b8300350\u22122\u22121012\u03b8\fReferences\n[1] C. E. Rasmussen and C. K. I. Williams, Gaussian Processes for Machine Learning. MIT Press, 2006.\n[2] R. Turner, M. P. Deisenroth, and C. E. Rasmussen, \u201cState-space inference and learning with Gaussian\nprocesses,\u201d in 13th International Conference on Arti\ufb01cial Intelligence and Statistics, ser. W&CP, Y. W.\nTeh and M. Titterington, Eds., vol. 9, Chia Laguna, Sardinia, Italy, May 13\u201315 2010, pp. 868\u2013875.\n\n[3] C. Andrieu, A. Doucet, and R. Holenstein, \u201cParticle Markov chain Monte Carlo methods,\u201d Journal of the\n\nRoyal Statistical Society: Series B (Statistical Methodology), vol. 72, no. 3, pp. 269\u2013342, 2010.\n\n[4] F. Lindsten, M. Jordan, and T. B. Sch\u00a8on, \u201cAncestor sampling for particle Gibbs,\u201d in Advances in Neural\nInformation Processing Systems 25, P. Bartlett, F. Pereira, C. Burges, L. Bottou, and K. Weinberger, Eds.,\n2012, pp. 2600\u20132608.\n\n[5] M. Deisenroth, R. Turner, M. Huber, U. Hanebeck, and C. Rasmussen, \u201cRobust \ufb01ltering and smoothing\nwith Gaussian processes,\u201d IEEE Transactions on Automatic Control, vol. 57, no. 7, pp. 1865 \u20131871, july\n2012.\n\n[6] M. Deisenroth and S. Mohamed, \u201cExpectation Propagation in Gaussian process dynamical systems,\u201d in\nAdvances in Neural Information Processing Systems 25, P. Bartlett, F. Pereira, C. Burges, L. Bottou, and\nK. Weinberger, Eds., 2012, pp. 2618\u20132626.\n\n[7] Z. Ghahramani and S. Roweis, \u201cLearning nonlinear dynamical systems using an EM algorithm,\u201d in Ad-\nvances in Neural Information Processing Systems 11, M. J. Kearns, S. A. Solla, and D. A. Cohn, Eds.\nMIT Press, 1999.\n\n[8] J. Wang, D. Fleet, and A. Hertzmann, \u201cGaussian process dynamical models,\u201d in Advances in Neural\nInformation Processing Systems 18, Y. Weiss, B. Sch\u00a8olkopf, and J. Platt, Eds. Cambridge, MA: MIT\nPress, 2006, pp. 1441\u20131448.\n\n[9] J. S. Liu, Monte Carlo Strategies in Scienti\ufb01c Computing. Springer, 2001.\n[10] A. Doucet and A. Johansen, \u201cA tutorial on particle \ufb01ltering and smoothing: Fifteen years later,\u201d in The\nOxford Handbook of Nonlinear Filtering, D. Crisan and B. Rozovsky, Eds. Oxford University Press,\n2011.\n\n[11] F. Gustafsson, \u201cParticle \ufb01lter theory and practice with positioning applications,\u201d IEEE Aerospace and\n\nElectronic Systems Magazine, vol. 25, no. 7, pp. 53\u201382, 2010.\n\n[12] M. K. Pitt and N. Shephard, \u201cFiltering via simulation: Auxiliary particle \ufb01lters,\u201d Journal of the American\n\nStatistical Association, vol. 94, no. 446, pp. 590\u2013599, 1999.\n\n[13] F. Lindsten and T. B. Sch\u00a8on, \u201cBackward simulation methods for Monte Carlo statistical inference,\u201d Foun-\n\ndations and Trends in Machine Learning, vol. 6, no. 1, pp. 1\u2013143, 2013.\n\n[14] F. Lindsten and T. B. Sch\u00a8on, \u201cOn the use of backward simulation in the particle Gibbs sampler,\u201d in\nProceedings of the 2012 IEEE International Conference on Acoustics, Speech and Signal Processing\n(ICASSP), Kyoto, Japan, Mar. 2012.\n\n[15] D. K. Agarwal and A. E. Gelfand, \u201cSlice sampling for simulation based \ufb01tting of spatial data models,\u201d\n\nStatistics and Computing, vol. 15, no. 1, pp. 61\u201369, 2005.\n\n[16] E. Snelson and Z. Ghahramani, \u201cSparse Gaussian processes using pseudo-inputs,\u201d in Advances in Neural\nInformation Processing Systems (NIPS), Y. Weiss, B. Sch\u00a8olkopf, and J. Platt, Eds., Cambridge, MA, 2006,\npp. 1257\u20131264.\n\n[17] J. Qui\u02dcnonero-Candela and C. E. Rasmussen, \u201cA unifying view of sparse approximate Gaussian process\n\nregression,\u201d Journal of Machine Learning Research, vol. 6, pp. 1939\u20131959, 2005.\n\n[18] M. Seeger, C. Williams, and N. Lawrence, \u201cFast Forward Selection to Speed Up Sparse Gaussian Process\n\nRegression,\u201d in Arti\ufb01cial Intelligence and Statistics 9, 2003.\n\n[19] Y. Chen, M. Welling, and A. Smola, \u201cSuper-samples from kernel herding,\u201d in Proceedings of the 26th\nConference on Uncertainty in Arti\ufb01cial Intelligence (UAI 2010), P. Gr\u00a8unwald and P. Spirtes, Eds. AUAI\nPress, 2010.\n\n[20] M. Deisenroth, \u201cEf\ufb01cient reinforcement learning using Gaussian processes,\u201d Ph.D. dissertation, Karl-\n\nsruher Institut f\u00a8ur Technologie, 2010.\n\n9\n\n\f", "award": [], "sourceid": 1449, "authors": [{"given_name": "Roger", "family_name": "Frigola", "institution": "University of Cambridge"}, {"given_name": "Fredrik", "family_name": "Lindsten", "institution": "Link\u00f6ping University"}, {"given_name": "Thomas", "family_name": "Sch\u00f6n", "institution": "Uppsala University"}, {"given_name": "Carl Edward", "family_name": "Rasmussen", "institution": "University of Cambridge"}]}