{"title": "Identification of Gaussian Process State Space Models", "book": "Advances in Neural Information Processing Systems", "page_first": 5309, "page_last": 5319, "abstract": "The Gaussian process state space model (GPSSM) is a non-linear dynamical system, where unknown transition and/or measurement mappings are described by GPs. Most research in GPSSMs has focussed on the state estimation problem, i.e., computing a posterior of the latent state given the model. However, the key challenge in GPSSMs has not been satisfactorily addressed yet: system identification, i.e., learning the model. To address this challenge, we impose a structured Gaussian variational posterior distribution over the latent states, which is parameterised by a recognition model in the form of a bi-directional recurrent neural network. Inference with this structure allows us to recover a posterior smoothed over sequences of data. We provide a practical algorithm for efficiently computing a lower bound on the marginal likelihood using the reparameterisation trick. This further allows for the use of arbitrary kernels within the GPSSM. We demonstrate that the learnt GPSSM can efficiently generate plausible future trajectories of the identified system after only observing a small number of episodes from the true system.", "full_text": "Identi\ufb01cation of Gaussian Process State Space Models\n\nStefanos Eleftheriadis\u2020, Thomas F.W. Nicholson\u2020, Marc P. Deisenroth\u2020\u2021, James Hensman\u2020\n\n{stefanos, tom, marc, james}@prowler.io\n\n\u2020PROWLER.io,\n\n\u2021Imperial College London\n\nAbstract\n\nThe Gaussian process state space model (GPSSM) is a non-linear dynamical sys-\ntem, where unknown transition and/or measurement mappings are described by\nGPs. Most research in GPSSMs has focussed on the state estimation problem,\ni.e., computing a posterior of the latent state given the model. However, the key\nchallenge in GPSSMs has not been satisfactorily addressed yet: system identi\ufb01ca-\ntion, i.e., learning the model. To address this challenge, we impose a structured\nGaussian variational posterior distribution over the latent states, which is param-\neterised by a recognition model in the form of a bi-directional recurrent neural\nnetwork. Inference with this structure allows us to recover a posterior smoothed\nover sequences of data. We provide a practical algorithm for ef\ufb01ciently computing\na lower bound on the marginal likelihood using the reparameterisation trick. This\nfurther allows for the use of arbitrary kernels within the GPSSM. We demonstrate\nthat the learnt GPSSM can ef\ufb01ciently generate plausible future trajectories of the\nidenti\ufb01ed system after only observing a small number of episodes from the true\nsystem.\n\n1 Introduction\n\nState space models can effectively address the problem of learning patterns and predicting behaviour\nin sequential data. Due to their modelling power they have a vast applicability in various domains of\nscience and engineering, such as robotics, \ufb01nance, neuroscience, etc. (Brown et al., 1998).\nMost research and applications have focussed on linear state space models for which solutions for\ninference (state estimation) and learning (system identi\ufb01cation) are well established (Kalman, 1960;\nLjung, 1999). In this work, we are interested in non-linear state space models. In particular, we\nconsider the case where a Gaussian process (GP) (Rasmussen and Williams, 2006) is responsible for\nmodelling the underlying dynamics. This is widely known as the Gaussian process state space model\n(GPSSM). We choose to build upon GPs for a number of reasons. First, they are non-parametric,\nwhich makes them effective in learning from small datasets. This can be advantageous over well-\nknown parametric models (e.g., recurrent neural networks\u2014RNNs), especially in situation where\ndata are not abundant. Second, we want to take advantage of the probabilistic properties of GPs.\nBy using a GP for the latent transitions, we can get away with an approximate model and learn a\ndistribution over functions. This allows us to account for model errors whilst quantifying uncertainty,\nas discussed and empirically shown by Schneider (1997) and Deisenroth et al. (2015). Consequently,\nthe system will not become overcon\ufb01dent in regions of the space where data are scarce.\nSystem identi\ufb01cation with the GPSSM is a challenging task. This is due to un-identi\ufb01ability issues:\nboth states and transition functions are unknown. Most work so far has focused only on state\nestimation of the GPSSM. In this paper, we focus on addressing the challenge of system identi\ufb01cation\nand based on recent work by Frigola et al. (2014) we propose a novel inference method for learning\nthe GPSSM. We approximate the entire process of the state transition function by employing the\nframework of variational inference. We assume a Markov-structured Gaussian posterior distribution\nover the latent states. The variational posterior can be naturally combined with a recognition model\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fbased on bi-directional recurrent neural networks, which facilitate smoothing of the state posterior\nover the data sequences. We present an ef\ufb01cient algorithm based on the reparameterisation trick for\ncomputing the lower bound on the marginal likelihood. This signi\ufb01cantly accelerates learning of the\nmodel and allows for arbitrary kernel functions.\n\n2 Gaussian process state space models\n\nWe consider the dynamical system\n\nxt = f (xt1, at1) + \u270ff , yt = g(xt) + \u270fg,\n\n(1)\nwhere t indexes time, x 2 RD is a latent state, a 2 RP are control signals (actions) and y 2 RO\nare measurements/observations. We assume i.i.d. Gaussian system/measurement noise \u270f(\u00b7) \u21e0\n(\u00b7)I. The state-space model in eq. (1) can be fully described by the measurement and\nN0, 2\ntransition functions, g and f.\nThe key idea of a GPSSM is to model the transition function f and/or the measurement function g\nin eq. (1) using GPs, which are distributions over functions. A GP is fully speci\ufb01ed by a mean \u2318(\u00b7)\nand a covariance/kernel function k(\u00b7,\u00b7), see e.g., (Rasmussen and Williams, 2006). The covariance\nfunction allows us to encode basic structural assumptions of the class of functions we want to model,\ne.g., smoothness, periodicity or stationarity. A common choice for a covariance function is the radial\nbasis function (RBF).\nLet f (\u00b7) denote a GP random function, and X = [xi]N\nthat function. Then, any \ufb01nite subset of function evaluations, f = [f (xi)]N\ndistributed\n\ni=1 be a series of points in the domain of\ni=1, are jointly Gaussian\n\np(f|X) = Nf | \u2318, Kxx ,\n\n(2)\nwhere the matrix Kxx contains evaluations of the kernel function at all pairs of datapoints in X, and\ni=1 is the prior mean function. This property leads to the widely used GP regression\n\u2318 = [\u2318(xi)]N\nmodel: if Gaussian noise is assumed, the marginal likelihood can be computed in closed form,\nenabling learning of the kernel parameters. By de\ufb01nition, the conditional distribution of a GP is\nanother GP. If we are to observe the values f at the input locations X, then we predict the values\nelsewhere on the GP using the conditional\n\nxx k(X,\u00b7) .\n\nxx (f  \u2318)), k(\u00b7,\u00b7)  k(\u00b7, X)K1\n\nf (\u00b7)| f \u21e0GP\u2318(\u00b7) + k(\u00b7, X)K1\n\n(3)\nUnlike the supervised setting, in the GPSSM, we are presented with neither values of the function on\nwhich to condition, nor on inputs to the function since the hidden states xt are latent. The challenge\nof inference in the GPSSM lies in dually inferring the latent variables x and in \ufb01tting the Gaussian\nprocess dynamics f (\u00b7).\nIn the GPSSM, we place independent GP priors on the transition function f in eq. (1) for each output\ndimension of xt+1, and collect realisations of those functions in the random variables f, such that\n(4)\nwhere we used the short-hand notation \u02dcxt = [xt, at] to collect the state-action pair at time t. In this\nwork, we use a mean function that keeps the state constant, so \u2318d(\u02dcxt) = x(d)\nTo reduce some of the un-identi\ufb01ability problems of GPSSMs, we assume a linear measurement\nmapping g so that the data conditional is\n\nfd(\u00b7) \u21e0GP\u2318d(\u00b7), kd(\u00b7,\u00b7),\n\nand p(xt|f t) = N (xt|f t, 2\n\nf I),\n\nf t = [fd(\u02dcxt1)]D\n\nd=1\n\n.\n\nt\n\n(5)\nThe linear observation model g(x) = Wgx + bg + \u270fg is not limiting since a non-linear g could be\nreplaced by additional dimensions in the state space (Frigola, 2015).\n\np(yt|xt) = N (yt|Wgxt + bg, 2\n\ngI) .\n\n2.1 Related work\nState estimation in GPSSMs has been proposed by Ko and Fox (2009a) and Deisenroth et al. (2009)\nfor \ufb01ltering and by Deisenroth et al. (2012) and Deisenroth and Mohamed (2012) for smoothing\nusing both deterministic (e.g., linearisation) and stochastic (e.g., particles) approximations. These\n\n2\n\n\fapproaches focused only on inference in learnt GPSSMs and not on system identi\ufb01cation, since\nlearning of the state transition function f without observing the system\u2019s true state x is challenging.\nTowards this approach, Wang et al. (2008), Ko and Fox (2009b) and Turner et al. (2010) proposed\nmethods for learning GPSSMs based on maximum likelihood estimation. Frigola et al. (2013)\nfollowed a Bayesian treatment to the problem and proposed an inference mechanism based on\nparticle Markov chain Monte Carlo. Speci\ufb01cally, they \ufb01rst obtain sample trajectories from the\nsmoothing distribution that could be used to de\ufb01ne a predictive density via Monte Carlo integration.\nThen, conditioned on this trajectory they sample the model\u2019s hyper-parameters. This approach\nscales proportionally to the length of the time series and the number of the particles. To tackle\nthis inef\ufb01ciency, Frigola et al. (2014) suggested a hybrid inference approach combining variational\ninference and sequential Monte Carlo. Using the sparse variational framework from (Titsias, 2009) to\napproximate the GP led to a tractable distribution over the state transition function that is independent\nof the length of the time series.\nAn alternative to learning a state-space model is to follow an autoregressive strategy (as in Murray-\nSmith and Girard, 2001; Likar and Kocijan, 2007; Turner, 2011; Roberts et al., 2013; Kocijan, 2016),\nto directly model the mapping from previous to current observations. This can be problematic since\nnoise is propagated through the system during inference. To alleviate this, Mattos et al. (2015)\nproposed the recurrent GP, a non-linear dynamical model that resembles a deep GP mapping from\nobserved inputs to observed outputs, with an autoregressive structure on the intermediate latent states.\nThey further followed the idea by Dai et al. (2015) and introduced an RNN-based recognition model\nto approximate the true posterior of the latent state. A downside is the requirement to feed future\nactions forward into the RNN during inference, in order to propagate uncertainty towards the outputs.\nAnother issue stems from the model\u2019s inef\ufb01ciency in analytically computing expectations of the kernel\nfunctions under the approximate posterior when dealing with high-dimensional latent states. Recently,\nAl-Shedivat et al. (2016), introduced a recurrent structure to the manifold GP (Calandra et al., 2016).\nThey proposed to use an LSTM in order to map the observed inputs onto a non-linear manifold,\nwhere the GP actually operates on. For inef\ufb01ciency, they followed an approximate inference scheme\nbased on Kronecker products over Toeplitz-structured kernels.\n\n3 Inference\n\nOur inference scheme uses variational Bayes (see e.g., Beal, 2003; Blei et al., 2017). We \ufb01rst de\ufb01ne\nthe form of the approximation to the posterior, q(\u00b7). Then we derive the evidence lower bound\n(ELBO) with respect to which the posterior approximation is optimised in order to minimise the\nKullback-Leibler divergence between the approximate and true posterior. We detail how the ELBO is\nestimated in a stochastic fashion and optimized using gradient-based methods, and describe how the\nform of the approximate posterior is given by a recurrent neural network. The graphical models of\nthe GPSSM and our proposed approximation are shown in Figure 1.\n\n3.1 Posterior approximation\n\nFollowing the work by Frigola et al. (2014), we adopt a variational approximation to the posterior,\nassuming factorisation between the latent functions f (\u00b7) and the state trajectories X. However,\nunlike Frigola et al.\u2019s work, we do not run particle MCMC to approximate the state trajectories, but\ninstead assume that the posterior over states is given by a Markov-structured Gaussian distribution\nparameterised by a recognition model (see section 3.3). In concordance with Frigola et al. (2014), we\nadopt a sparse variational framework to approximate the GP. The sparse approximation allows us to\ndeal with both (a) the unobserved nature of the GP inputs and (b) any potential computational scaling\nissues with the GP by controlling the number of inducing points in the approximation.\nThe variational approximation to the GP posterior is formed as follows: Let Z = [z1, . . . , zM ] be\nsome points in the same domain as \u02dcx. For each Gaussian process fd(\u00b7), we de\ufb01ne the inducing\nvariables ud = [fd(zm)]M\nm=1, so that the density of ud under the GP prior is N (\u2318d, Kzz), with\nm=1. We make a mean-\ufb01eld variational approximation to the posterior for U, taking\n\u2318d = [\u2318d(zm)]M\nthe form q(U ) =QD\nd=1 N (ud | \u00b5d, \u2303d). The variational posterior of the rest of the points on the\n\nGP is assumed to be given by the same conditional distribution as the prior:\n\n3\n\nfd(\u00b7)| ud \u21e0GP\u2318d(\u00b7) + k(\u00b7, Z)K1\n\nzz (ud  \u2318d),\n\nk(\u00b7,\u00b7)  k(\u00b7, Z)K1\n\nzz k(Z,\u00b7) .\n\n(6)\n\n\fFigure 1: The GPSSM with the GP state transition functions (left), and the proposed approximation with the\nrecognition model in the form of a bi-RNN (right). Black arrows show conditional dependencies of the model,\nred arrows show the data-\ufb02ow in the recognition.\n\nIntegrating this expression with respect to the prior distribution p(ud) = N (\u2318d, Kzz) gives the GP\nprior in eq. (4). Integrating with respect to the variational distribution q(U ) gives our approximation\nto the posterior process fd(\u00b7) \u21e0GP\u00b5d(\u00b7), vd(\u00b7,\u00b7), with\n\n\u00b5d(\u00b7) = \u2318d(\u00b7) + k(\u00b7, Z)K1\nvd(\u00b7,\u00b7) = k(\u00b7,\u00b7)  k(\u00b7, Z)K1\n\nzz (\u00b5d  \u2318d),\nzz [Kzz  \u2303d]K1\n\nzz k(Z,\u00b7) .\n\n(7)\n(8)\n\nThe approximation to the posterior of the state trajectory is assumed to have a Gauss-Markov structure:\n\nq(x0) = Nx0 | m0, L0L>0,\n\nq(xt | xt1) = Nxt | Atxt1, LtL>t .\n\n(9)\nThis distribution is speci\ufb01ed through a single mean vector m0, a series of square matrices At, and\na series of lower-triangular matrices Lt. It serves as a locally linear approximation to an overall\nnon-linear posterior over the states. This is a good approximation provided that the t between the\ntransitions is suf\ufb01ciently small.\nWith the approximating distributions for the variational posterior de\ufb01ned in eq. (7)\u2013(9), we are ready\nto derive the evidence lower bound (ELBO) on the model\u2019s true likelihood. Following (Frigola, 2015,\neq. (5.10)), the ELBO is given by\n\nELBO = Eq(x0)[log p(x0)] + H[q(X)]  KL[q(U )|| p(U )]\nvd(\u02dcxt1, \u02dcxt1) + log Nx(d)\n\n\n\nt\n\n+ Eq(X)h TXt=1\n+ Eq(X)h TXt=1\n\n1\n22\nf\n\nDXd=1\nlog Nyt | g(xt), 2\n\ngI Oi ,\n\n| \u00b5d(\u02dcxt1), 2\n\nfi\n\n(10)\n\nwhere KL[\u00b7||\u00b7] is the Kullback-Leibler divergence, and H[\u00b7] denotes the entropy. Note that with\nthe above formulation we can naturally deal with multiple episodic data since the ELBO can be\nfactorised across independent episodes. We can now learn the GPSSM by optimising the ELBO\nw.r.t. the parameters of the model and the variational parameters. A full derivation is provided in the\nsupplementary material.\nThe form of the ELBO justi\ufb01es the Markov-structure that we have assumed for the variational\ndistribution q(X): we see that the latent states only interact over pairwise time steps xt and xt1;\nadding further structure to q(X) is unnecessary.\n\n3.2 Ef\ufb01cient computation of the ELBO\nTo compute the ELBO in eq. (10), we need to compute expectations w.r.t. q(X). Frigola et al.\n(2014) showed that for the RBF kernel the relevant expectations can be computed in closed form in\na similar way to Titsias and Lawrence (2010). To allow for general kernels we propose to use the\nreparameterisation trick (Kingma and Welling, 2014; Rezende et al., 2014) instead: by sampling\na single trajectory from q(X) and evaluating the integrands in eq. (10), we obtain an unbiased\nestimate of the ELBO. To draw a sample from the Gauss-Markov structure in eq. (9), we \ufb01rst sample\n\u270ft \u21e0N (0, I), t = 0, . . . , T , and then apply recursively the af\ufb01ne transformation\n\nx0 = m0 + L0\u270f0, xt = Atxt1 + Lt\u270ft .\n\n(11)\n\n4\n\ny1y2y3x1x2x3x0fd(\u00b7)\u221ea1a2a3y1y2y3x1x2x3x0h1h2h3h0W\u1ef9W\u1ef9W\u1ef9WA,LW(f,b)hW(f,b)hW(f,b)hWA,LWA,LWA,La1a2fd(\u00b7)\u221e\u03bca3\u03c5dd\fThis simple estimator of the ELBO can then be used in optimisation using stochastic gradient methods;\nwe used the Adam optimizer (Kingma and Ba, 2015). It may seem initially counter-intuitive to use a\nstochastic estimate of the ELBO where one is available in closed form, but this approach offers two\ndistinct advantages. First, computation is dramatically reduced: our scheme requires O(T D) storage\nin order to evaluate the integrand in eq. (10) at a single sample from q(X). A scheme that computes\nthe integral in closed form requires O(T M 2) (where M is the number of inducing variables in the\nsparse GP) storage for the suf\ufb01cient statistics of the kernel evaluations. The second advantage is that\nwe are no longer restricted to the RBF kernel, but can use any valid kernel for inference and learning\nin GPSSMs. The reparameterisation trick also allows us to perform batched updates of the model\nparameters, amounting to doubly stochastic variational inference (Titsias and L\u00e1zaro-Gredilla, 2014),\nwhich we experimentally found to improve run-time and sample-ef\ufb01ciency.\nSome of the elements of the ELBO in eq. (10) are still available in closed-form. To reduce the\nvariance of the estimate of the ELBO we exploit this where possible: the entropy of the Gauss-\nt=0 log(det(Lt)); the expected likelihood (last\nMarkov structure is H[q(X)] =  T D\nterm in eq. (10)) can be computed easily given the marginals of q(X), which are given by\n\n2 log(2\u21e1e) PT\n\nq(xt) = N (mt, \u2303t), mt = Atmt1, \u2303t = At\u2303t1A>t + LtL>t ,\n\n(12)\nand the necessary Kullback-Leibler divergences can be computed analytically: we use the implemen-\ntations from GP\ufb02ow (Matthews et al., 2017).\n\n3.3 A recurrent recognition model\nThe variational distribution of the latent trajectories in eq. (9) has a large number of parameters\n(At, Lt) that grows with the length of the dataset. Further, if we wish to train a model on multiple\nepisodes (independent data sequences sharing the same dynamics), then the number of parameters\ngrows further. To alleviate this, we propose to use a recognition model in the form of a bi-directional\nrecurrent neural network (bi-RNN), which is responsible for recovering the variational parameters\nAt, Lt.\nA bi-RNN is a combination of two independent RNNs operating on opposite directions of the\nsequence. Each network is speci\ufb01ed by two weight matrices W acting on a hidden state h:\n\nh h(f )\nh h(b)\n\nt1 + W (f )\nt+1 + W (b)\n\nh(f )\nt = (W (f )\nh(b)\nt = (W (b)\n\n(13)\n(14)\nwhere \u02dcyt = [yt, at] denotes the concatenation of the observed data and control actions and the\nsuperscripts denote the direction (forward/backward) of the RNN. The activation function  (we use\nthe tanh function), acts on each element of its argument separately. In our experiments we found that\nusing gated recurrent units (Cho et al., 2014) improved performance of our model. We now make the\nparameters of the Gauss-Markov structure dependent on the sequences h(f ), h(b), so that\n\nforward passing\nbackward passing\n\n\u02dcy \u02dcyt + b(f )\nh ) ,\n\u02dcy \u02dcyt + b(b)\nh ) ,\n\nAt = reshape(WA[h(f )\n\nt\n\n; h(b)\n\nt\n\n] + bA), Lt = reshape(WL[h(f )\n\nt\n\n; h(b)\n\nt\n\n] + bL) .\n\n(15)\n\nh\n\n\u02dcy\n\n, W (f,b)\n\n, WA, WL, b(f,b)\n\nThe parameters of the Gauss-Markov structure q(X) are now almost completely encapsulated in the\nrecurrent recognition model as W (f,b)\n, bA, bL. We only need to infer the\nparameters of the initial state, m0, L0 for each episode; this is where we utilise the functionality of the\nbi-RNN structure. Instead of directly learning the initial state q(x0), we can now obtain it indirectly\nvia the output state of the backward RNN. Another nice property of the proposed recognition model\nis that now q(X) is recognised from both future and past observations, since the proposed bi-RNN\nrecognition model can be regarded as a forward and backward sequential smoother of our variational\nposterior. Finally, it is worth noting the interplay between the variational distribution q(X) and the\nrecognition model. Recall that the variational distribution is a Bayesian linear approximation to the\nnon-linear posterior and is fully de\ufb01ned by the time varying parameters, At, Lt; the recognition\nmodel has the role to recover these parameters via the non-linear and time invariant RNN.\n\nh\n\n4 Experiments\n\nWe benchmark the proposed GPSSM approach on data from one illustrative example and three\nchallenging non-linear data sets of simulated and real data. Our aim is to demonstrate that we can: (i)\n\n5\n\n\fGP posterior\n\ninducing points\n\nground truth\n\nGP posterior\n\ninducing points\n\nground truth\n\nRBF\n\nRBF + Matern\n\nArc-cosine\n\nRBF + Matern\n\nArc-cosine\n\nMGP\n\nMGP\n\n1\n\n2\nxt\n\n3\n\n4\n\n5\n\n6\n\n2 1 0\n\n1\n\n2\nxt\n\n3\n\n4\n\n5\n\n6\n\n2 1 0\n\n1\n\n2\nxt\n\n3\n\n4\n\n5\n\n6\n\n2 1 0\n\n1\n\n2\nxt\n\n3\n\n4\n\n5\n\n6\n\nRBF\n\n4\n\n2\n\n1\n+\nx\n\nt\n\n0\n2\n\n2 1 0\n\nFigure 2: The learnt state transition function with different kernels. The true function is given by eq. (16).\n\nbene\ufb01t from the use of non-smooth kernels with our approximate inference and accurately model\nnon-smooth transition functions; (ii) successfully learn non-linear dynamical systems even from\nnoisy and partially observed inputs; (iii) sample plausible future trajectories from the system even\nwhen trained with either a small number of episodes or long time sequences.\n\n2 1 0\n\n4.1 Non-linear system identi\ufb01cation\n2 1 0\n4\n1\n6\n3\n\n2 1 0\n1\n3\n\n2\nWe \ufb01rst apply our approach to a synthetic dataset generated broadly according to (Frigola et al.,\n2014). The data is created using a non-linear, non-smooth transition function with additive state and\nxt\nobservation noise according to: p(xt+1|xt) = N (f (xt), 2\n\n2 1 0\n3\n6\n4\n1\n6\n4\n3\ng), where\nf ), and p(yt|xt) = N (xt, 2\n\n5\n2\nxt\n\n5\n2\nxt\n\n5\n\n6\n\n4\n1\n\n5\n2\nxt\notherwise .\n\nif xt < 4,\n\n13  2xt,\n\nf (xt) = xt + 1,\n\nf = 0.01 and 2\n\n(16)\nIn our experiments, we set the system and measurement noise variances to 2\ng = 0.1,\nrespectively, and generate 200 episodes of length 10 that were used as the observed data for training\nthe GPSSM. We used 20 inducing points (initialised uniformly across the range of the input data)\nfor approximating the GP and 20 hidden units for the recurrent recognition model. We evaluate the\nfollowing kernels: RBF, additive composition of the RBF (initial ` = 10) and Matern (\u232b = 1\n2, initial\n` = 0.1), 0-order arc-cosine (Cho and Saul, 2009), and the MGP kernel (Calandra et al., 2016) (depth\n5, hidden dimensions [3, 2, 3, 2, 3], tanh activation, Matern (\u232b = 1\nThe learnt GP state transition functions are shown in Figure 2. With the non-smooth kernels we are\nable to learn accurate transitions and model the instantaneous dynamical change, as opposed to the\nsmooth transition learnt with the RBF. Note that all non-smooth kernels place inducing points directly\non the peak (at xt = 4) to model the kink, whereas the RBF kernel explains this behaviour as a longer-\nscale wiggliness of the posterior process. When using a kernel without the RBF component the GP\nposterior quickly reverts to the mean function (\u2318(x) = x) as we move away from the data: the short\nlength-scales that enable them to model the instantaneous change prevent them from extrapolating\ndownwards in the transition function. The composition of the RBF and Matern kernel bene\ufb01ts from\nlong and short length scales and can better extrapolate. The posteriors can be viewed across a longer\nrange of the function space in the supplementary material.\n\n2) compound kernel).\n\n4.2 Modelling cart-pole dynamics\n\nWe demonstrate the ef\ufb01cacy of the proposed GPSSM on learning the non-linear dynamics of the\ncart-pole system from (Deisenroth and Rasmussen, 2011). The system is composed of a cart running\non a track, with a freely swinging pendulum attached to it. The state of the system consists of the\ncart\u2019s position and velocity, and the pendulum\u2019s angle and angular velocity, while a horizontal force\n(action) a 2 [10, 10]N can be applied to the cart. We used the PILCO algorithm from (Deisenroth\nand Rasmussen, 2011) to learn a feedback controller that swings the pendulum and balances it in\nthe inverted position in the middle of the track. We collected trajectory data from 16 trials during\nlearning; each trajectory/episode was 4 s (40 time steps) long.\nWhen training the GPSSM for the cart-pole system we used data up to the \ufb01rst 15 episodes. We\nused 100 inducing points to approximate the GP function with a Matern \u232b = 1\n2 and 50 hidden units\nfor the recurrent recognition model. The learning rate for the Adam optimiser was set to 103. We\nqualitatively assess the performance of our model by feeding the control sequence of the last episode\nto the GPSSM in order to generate future responses.\n\n6\n\n\fn\no\ni\nt\ni\ns\no\np\n\nt\nr\na\nc\n\nn\no\ni\nt\ni\ns\no\np\n\nt\nr\na\nc\n\n0.4\n\n0.2\n\n0\n\n0.2\n\n0.4\n\n0.4\n\n0.2\n\n0\n\n0.2\n\n0.4\n\n10\n10\n\n2 episodes (80 time steps in total)\n\n8 episodes (320 time steps in total)\n\n15 episodes (600 time steps in total)\n\n10\n\n5\n\n0\n\n10\n\n5\n\n0\n\ne\nl\ng\nn\na\n\ne\nl\ng\nn\na\n\ncontrol signal\n\n0\n\n5\n\n10\n\n20\n\n15\ntime step\n\n25\n\n30\n\n35\n\n40\n\n0\n\n5\n\n10\n\n20\n\n15\ntime step\n\n25\n\n30\n\n35\n\n40\n\n0\n\n5\n\n10\n\n20\n\n15\ntime step\n\n25\n\n30\n\n35\n\n40\n\nFigure 3: Predicting the cart\u2019s position and pendulum\u2019s angle behaviour from the cart-pole dataset by applying\nthe control signal of the testing episode to sampled future trajectories from the proposed GPSSM. Learning of\nthe dynamics is demonstrated with observed (upper row) and hidden (lower row) velocities and with increasing\nnumber of training episodes. Ground truth is denoted with the marked lines.\n\nIn Figure 3, we demonstrate the ability of the proposed GPSSM to learn the underlying dynamics\nof the system from a different number of episodes with fully and partially observed data. In the top\nrow, the GPSSM observes the full 4D state, while in the bottom row, we train the GPSSM with only\nthe cart\u2019s position and the pendulum\u2019s angle observed (i.e., the true state is not fully observed since\nthe velocities are hidden). In both cases, sampling long-term trajectories based on only 2 episodes\nfor training does not result in plausible future trajectories. However, we could model part of the\ndynamics after training with only 8 episodes (320 time steps interaction with the system), while\ntraining with 15 episodes (600 time steps in total) allowed the GPSSM to produce trajectories similar\nto the ground truth. It is worth emphasising the fact that the GPSSM could recover the unobserved\nvelocities in the latent states, which resulted in smooth transitions of the cart and swinging of the\npendulum. However, it seems that the recovered cart\u2019s velocity is overestimated. This is evidenced\nby the increased variance in the prediction of the cart\u2019s position around 0 (the centre of the track).\nDetailed \ufb01ttings for each episode and learnt latent states with observed and hidden velocities are\nprovided in the supplementary material.\n\nTable 1: Average Euclidean distance between the true\nand the predicted trajectories, measured at the pendu-\nlum\u2019s tip. The error is in pendulum\u2019s length units.\n\n2 episodes\n\n8 episodes\n\n15 episodes\n\nKalman\nARGP\nGPSSM\n\n1.65\n1.22\n1.21\n\n1.52\n1.03\n0.67\n\n1.48\n0.80\n0.59\n\nn\no\ni\nt\ni\ns\no\np\n\nt\nr\na\nc\n\n0.4\n\n0.2\n\n0\n\n0.2\n\n0.4\n\n10\n10\n\n0\n\n5\n\n10\n\n10\n\n5\n\n0\n\ne\nl\ng\nn\na\n\ncontrol signal\n\n25\n\n30\n\n35\n\n40\n\n20\n\n15\ntime step\n\nFigure 4: Predictions with lagged actions.\n\nIn Table 1, we provide the average Euclidean distance between the predicted and the true trajectories\nmeasured at the pendulum\u2019s tip, with fully observed states. We compare to two baselines: (i) the\nauto-regressive GP (ARGP) that maps the tuple [yt1, at1] to the next observation yt (as in PILCO\n(Deisenroth et al., 2015)), and (ii) a linear system for identi\ufb01cation that uses the Kalman \ufb01ltering\ntechnique (Kalman, 1960). We see that the GPSSM signi\ufb01cantly outperforms the baselines on this\nhighly non-linear benchmark. The linear system cannot learn the dynamics at all, while the ARGP\nonly manages to produce sensible error (less than a pendulum\u2019s length) after seeing 15 episodes. Note\n\n7\n\n\fthat the GPSSM trained on 8 episodes produces trajectories with less error than the ARGP trained on\n15 episodes.\nWe also ran experiments using lagged actions where the partially observed state at time t is affected\nby the action at t  2. Figure 4 shows that we are able to sample future trajectories with an\naccuracy similar to time-aligned actions. This indicates that our model is able to learn a compressed\nrepresentation of the full state and previous inputs, essentially \u2018remembering\u2019 the lagged actions.\n\n4.3 Modelling double pendulum dynamics\n\nWe demonstrate the learning and modelling of the dynamics of the double pendulum system\nfrom (Deisenroth et al., 2015). The double pendulum is a two-link robot arm with two actua-\ntors. The state of the system consists of the angles and the corresponding angular velocities of the\ninner and outer link, respectively, while different torques a1, a2 2 [2, 2] Nm can be applied to the\ntwo actuators. The task of swinging the double pendulum and balancing it in the upwards position\nis extremely challenging. First, it requires the interplay of two correlated control signals (i.e., the\ntorques). Second, the behaviour of the system, when operating at free will, is chaotic.\nWe learn the underlying dynamics from episodic data (15 episodes, 30 time steps long each). Training\nof the GPSSM was performed with data up to 14 episodes, while always demonstrating the learnt\nunderlying dynamics on the last episode, which serves as the test set. We used 200 inducing points to\napproximate the GP function with a Matern \u232b = 1\n2 and 80 hidden units for the recurrent recognition\nmodel. The learning rate for the Adam optimiser was set to 103. The dif\ufb01culty of the task is evident\nin Figure 5, where we can see that even after observing 14 episodes we cannot accurately predict\nthe system\u2019s future behaviour for more than 15 time steps (i.e., 1.5 s). It is worth noting that we can\ngenerate reliable simulation even though we observe only the pendulums\u2019 angles.\n\n2 episodes\n\n8 episodes\n\n14 episodes\n\ne\nl\ng\nn\na\n\nr\ne\nn\nn\ni\n\ne\nl\ng\nn\na\n\nr\ne\nn\nn\ni\n\n6\n\n5\n\n4\n\n3\n\n6\n\n5\n\n4\n\n3\n\n2\n-2\n\ninner torque\n\nouter torque\n\ne\nl\ng\nn\na\n\nr\ne\nt\nu\no\n\ne\nl\ng\nn\na\n\nr\ne\nt\nu\no\n\n4\n\n2\n\n0\n\n4\n\n2\n\n0\n\n0\n\n5\n\n10\n\n15\n\n20\n\n25\n\n30\n\n0\n\n5\n\n10\n\n15\n\n20\n\n25\n\n30\n\n0\n\n5\n\n10\n\n15\n\n20\n\n25\n\n30\n\ntime step\n\ntime step\n\ntime step\n\nFigure 5: Predicting the inner and outer pendulum\u2019s angle from the double pendulum dataset by\napplying the control signals of the testing episode to sampled future trajectories from the proposed\nGPSSM. Learning of the dynamics is demonstrated with observed (upper row) and hidden (lower\nrow) angular velocities and with increasing number of training episodes. Ground truth is denoted\nwith the marked lines.\n\n4.4 Modelling actuator dynamics\n\nHere we evaluate the proposed GPSSM on real data from a hydraulic actuator that controls a robot\narm (Sj\u00f6berg et al., 1995). The input is the size of the actuator\u2019s valve opening and the output is\nits oil pressure. We train the GPSSM on half the sequence (512 steps) and evaluate the model on\nthe remaining half. We use 15 inducing points to approximate the GP function with a combination\nof an RBF and a Matern \u232b = 1\n2 and 15 hidden units for the recurrent recognition model. Figure 6\n\n8\n\n\f4\n\n2\n\n0\n\n2\n4\n\n1\n1\n\n50\n\ntraining\ntesting\n\n0\n\n50\n\n100\n\n150\n\n200\n\n250\n\n300\n\n350\n\n400\n\n450\n\n550\n\n500\ntime step\n\n600\n\n650\n\n700\n\n750\n\n800\n\n850\n\n900\n\n950\n\n1,000 1,050\n\ncontrol signal\n\nFigure 6: Demonstration of the identi\ufb01ed model that controls the non-linear dynamics of the actuator dataset.\nThe model\u2019s \ufb01tting on the train data and sampled future predictions, after applying the control signal to the\nsystem. Ground truth is denoted with the marked lines.\n\nshows the \ufb01tting on the train data along with sampled future predictions from the learnt system when\noperating on a free simulation mode. It is worth noting the correct capturing of the uncertainty from\nthe model at the points where the predictions are not accurate.\n\n5 Discussion and conclusion\n\nWe have proposed a novel inference mechanism for the GPSSM, in order to address the challenging\ntask of non-linear system identi\ufb01cation. Since our inference is based on the variational framework,\nsuccessful learning of the model relies on de\ufb01ning good approximations to the posterior of the latent\nfunctions and states. Approximating the posterior over the dynamics with a sparse GP seems to be a\nreasonable choice given our assumptions over the transition function. However, the dif\ufb01culty remains\nin the selection of the approximate posterior of the latent states. This is the key component that\nenables successful learning of the GPSSM.\nIn this work, we construct the variational posterior so that it follows the same Markov properties as\nthe true states. Furthermore, it is enforced to have a simple-to-learn, linear, time-varying structure. To\nassure, though, that this approximation has rich representational capacity we proposed to recover the\nvariational parameters of the posterior via a non-linear recurrent recognition model. Consequently,\nthe joint approximate posterior resembles the behaviour of the true system, which facilitates the\neffective learning of the GPSSM.\nIn the experimental section we have provided evidence that the proposed approach is able to identify\nlatent dynamics in true and simulated data, even from partial and lagged observations, while requiring\nonly small data sets for this challenging task.\n\nAcknowledgement\nMarc P. Deisenroth has been supported by a Google faculty research award.\n\nReferences\nMaruan Al-Shedivat, Andrew G. Wilson, Yunus Saatchi, Zhiting Hu, and Eric P. Xing. Learning\n\nscalable deep kernels with recurrent structure. arXiv preprint arXiv:1610.08936, 2016.\n\nMatthew J. Beal. Variational algorithms for approximate Bayesian inference. PhD thesis, University\n\nof London, London, UK, 2003.\n\nDavid M. Blei, Alp Kucukelbir, and Jon D. McAuliffe. Variational inference: A review for statisticians.\n\nJournal of the American Statistical Association, 112(518):859\u2013877, 2017.\n\nEmery N. Brown, Loren M. Frank, Dengda Tang, Michael C. Quirk, and Matthew A. Wilson. A\nstatistical paradigm for neural spike train decoding applied to position prediction from ensemble\n\ufb01ring patterns of rat hippocampal place cells. Journal of Neuroscience, 18(18):7411\u20137425, 1998.\nRoberto Calandra, Jan Peters, Carl E. Rasmussen, and Marc P. Deisenroth. Manifold Gaussian\n\nprocesses for regression. In IEEE International Joint Conference on Neural Networks, 2016.\n\n9\n\n\fKyungHyun Cho, Bart van Merrienboer, Dzmitry Bahdanau, and Yoshua Bengio. On the properties\nof neural machine translation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259,\n2014.\n\nYoungmin Cho and Lawrence K. Saul. Kernel methods for deep learning. In Advances in Neural\n\nInformation Processing Systems, pages 342\u2013350. 2009.\n\nZhenwen Dai, Andreas Damianou, Javier Gonz\u00e1lez, and Neil Lawrence. Variational auto-encoded\n\ndeep Gaussian processes. In International Conference on Learning Representations, 2015.\n\nMarc P. Deisenroth and Shakir Mohamed. Expectation propagation in Gaussian process dynamical\n\nsystems. In Advances in Neural Information Processing Systems, pages 2618\u20132626, 2012.\n\nMarc P. Deisenroth and Carl E. Rasmussen. PILCO: A model-based and data-ef\ufb01cient approach to\n\npolicy search. In International Conference on Machine Learning, pages 465\u2013472, 2011.\n\nMarc P. Deisenroth, Marco F. Huber, and Uwe D. Hanebeck. Analytic moment-based Gaussian\n\nprocess \ufb01ltering. In International Conference on Machine Learning, pages 225\u2013232, 2009.\n\nMarc P. Deisenroth, Ryan D. Turner, Marco Huber, Uwe D. Hanebeck, and Carl E. Rasmussen.\nRobust \ufb01ltering and smoothing with Gaussian processes. IEEE Transactions on Automatic Control,\n57(7):1865\u20131871, 2012.\n\nMarc P. Deisenroth, Dieter Fox, and Carl E. Rasmussen. Gaussian processes for data-ef\ufb01cient learning\nin robotics and control. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(2):\n408\u2013423, 2015.\n\nRoger Frigola. Bayesian time series learning with Gaussian processes. PhD thesis, University of\n\nCambridge, Cambridge, UK, 2015.\n\nRoger Frigola, Fredrik Lindsten, Thomas B. Sch\u00f6n, and Carl E. Rasmussen. Bayesian inference and\nlearning in Gaussian process state-space models with particle MCMC. In Advances in Neural\nInformation Processing Systems, pages 3156\u20133164, 2013.\n\nRoger Frigola, Yutian Chen, and Carl E. Rasmussen. Variational Gaussian process state-space models.\n\nIn Advances in Neural Information Processing Systems, pages 3680\u20133688, 2014.\n\nRudolf E. Kalman. A new approach to linear \ufb01ltering and prediction problems. Transactions of the\nAmerican Society of Mathematical Engineering, Journal of Basic Engineering, 82(D):35\u201345, 1960.\n\nDiederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International\n\nConference on Learning Representations, 2015.\n\nDiederik P. Kingma and Max Welling. Auto-encoding variational Bayes. In International Conference\n\non Learning Representations, 2014.\n\nJonathan Ko and Dieter Fox. GP-BayesFilters: Bayesian \ufb01ltering using Gaussian process prediction\n\nand observation models. Autonomous Robots, 27(1):75\u201390, 2009a.\n\nJonathan Ko and Dieter Fox. Learning GP-BayesFilters via Gaussian process latent variable models.\n\nIn Robotics: Science and Systems, 2009b.\n\nJu\u0161 Kocijan. Modelling and control of dynamic systems using Gaussian process models. Springer,\n\n2016.\n\nBojan Likar and Ju\u0161 Kocijan. Predictive control of a gas-liquid separation plant based on a Gaussian\n\nprocess model. Computers & Chemical Engineering, 31(3):142\u2013152, 2007.\n\nLennart Ljung. System identi\ufb01cation: Theory for the user. Prentice Hall, 1999.\n\nAlexander G. de G. Matthews. Scalable Gaussian process inference using variational methods. PhD\n\nthesis, University of Cambridge, Cambridge, UK, 2017.\n\n10\n\n\fAlexander G. de G. Matthews, James Hensman, Richard E. Turner, and Zoubin Ghahramani. On\nsparse variational methods and the Kullback-Leibler divergence between stochastic processes. In\nInternational Conference on Arti\ufb01cial Intelligence and Statistics, volume 51 of JMLR W&CP,\npages 231\u2013239, 2016.\n\nAlexander G. de G. Matthews, Mark van der Wilk, Tom Nickson, Keisuke Fujii, Alexis Boukouvalas,\nPablo Le\u00f3n-Villagr\u00e1, Zoubin Ghahramani, and James Hensman. GP\ufb02ow: A Gaussian process\nlibrary using TensorFlow. Journal of Machine Learning Research, 18(40):1\u20136, 2017.\n\nC\u00e9sar Lincoln C. Mattos, Zhenwen Dai, Andreas Damianou, Jeremy Forth, Guilherme A. Barreto,\nand Neil D. Lawrence. Recurrent Gaussian processes. In International Conference on Learning\nRepresentations, 2015.\n\nRoderick Murray-Smith and Agathe Girard. Gaussian process priors with ARMA noise models. In\n\nIrish Signals and Systems Conference, pages 147\u2013152, 2001.\n\nCarl E. Rasmussen and Christopher K. I. Williams. Gaussian processes for machine learning. The\n\nMIT Press, Cambridge, MA, USA, 2006.\n\nDanilo J. Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approx-\nimate inference in deep generative models. In International Conference on Machine Learning,\npages 1278\u20131286, 2014.\n\nStephen Roberts, Michael Osborne, Mark Ebden, Steven Reece, Neale Gibson, and Suzanne Aigrain.\nGaussian processes for time-series modelling. Philosophical Transactions of the Royal Society A,\n371(1984):20110550, 2013.\n\nJeff G. Schneider. Exploiting model uncertainty estimates for safe dynamic control learning. In\n\nAdvances in Neural Information Processing Systems. 1997.\n\nJonas Sj\u00f6berg, Qinghua Zhang, Lennart Ljung, Albert Benveniste, Bernard Delyon, Pierre-Yves\nGlorennec, H\u00e5kan Hjalmarsson, and Anatoli Juditsky. Nonlinear black-box modeling in system\nidenti\ufb01cation: A uni\ufb01ed overview. Automatica, 31(12):1691\u20131724, 1995.\n\nMichalis K. Titsias. Variational learning of inducing variables in sparse Gaussian processes. In\nInternational Conference on Arti\ufb01cial Intelligence and Statistics, volume 5 of JMLR W&CP, pages\n567\u2013574, 2009.\n\nMichalis K. Titsias and Neil D. Lawrence. Bayesian Gaussian process latent variable model. In\nInternational Conference on Arti\ufb01cial Intelligence and Statistics, volume 9 of JMLR W&CP, pages\n844\u2013851, 2010.\n\nMichalis K. Titsias and Miguel L\u00e1zaro-Gredilla. Doubly stochastic variational Bayes for non-\nconjugate inference. In International Conference on Machine Learning, pages 1971\u20131979, 2014.\nRyan D. Turner. Gaussian processes for state space models and change point detection. PhD thesis,\n\nUniversity of Cambridge, Cambridge, UK, 2011.\n\nRyan D. Turner, Marc P. Deisenroth, and Carl E. Rasmussen. State-space inference and learning with\nGaussian processes. In International Conference on Arti\ufb01cial Intelligence and Statistics, volume 9\nof JMLR W&CP, pages 868\u2013875, 2010.\n\nJack M. Wang, David J. Fleet, and Aaron Hertzmann. Gaussian process dynamical models for human\nmotion. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(2):283\u2013298, 2008.\n\n11\n\n\f", "award": [], "sourceid": 2748, "authors": [{"given_name": "Stefanos", "family_name": "Eleftheriadis", "institution": "PROWLER.io"}, {"given_name": "Tom", "family_name": "Nicholson", "institution": "PROWLER.IO"}, {"given_name": "Marc", "family_name": "Deisenroth", "institution": "Imperial College London"}, {"given_name": "James", "family_name": "Hensman", "institution": "PROWLER.io"}]}