{"title": "Variational Inference for Diffusion Processes", "book": "Advances in Neural Information Processing Systems", "page_first": 17, "page_last": 24, "abstract": "Diffusion processes are a family of continuous-time continuous-state stochastic processes that are in general only partially observed. The joint estimation of the forcing parameters and the system noise (volatility) in these dynamical systems is a crucial, but non-trivial task, especially when the system is nonlinear and multi-modal. We propose a variational treatment of diffusion processes, which allows us to estimate these parameters by simple gradient techniques and which is computationally less demanding than most MCMC approaches. Furthermore, our parameter inference scheme does not break down when the time step gets smaller, unlike most current approaches. Finally, we show how a cheap estimate of the posterior over the parameters can be constructed based on the variational free energy.", "full_text": "Variational Inference for Diffusion Processes\n\nC\u00b4edric Archambeau\n\nUniversity College London\n\nc.archambeau@cs.ucl.ac.uk\n\nManfred Opper\n\nTechnical University Berlin\n\nopperm@cs.tu-berlin.de\n\nYuan Shen\n\nAston University\n\ny.shen2@aston.ac.uk\n\nDan Cornford\nAston University\n\nd.cornford@aston.ac.uk\n\nAbstract\n\nJohn Shawe-Taylor\n\nUniversity College London\n\njst@cs.ucl.ac.uk\n\nDiffusion processes are a family of continuous-time continuous-state stochastic\nprocesses that are in general only partially observed. The joint estimation of the\nforcing parameters and the system noise (volatility) in these dynamical systems is\na crucial, but non-trivial task, especially when the system is nonlinear and multi-\nmodal. We propose a variational treatment of diffusion processes, which allows\nus to compute type II maximum likelihood estimates of the parameters by sim-\nple gradient techniques and which is computationally less demanding than most\nMCMC approaches. We also show how a cheap estimate of the posterior over the\nparameters can be constructed based on the variational free energy.\n\n1 Introduction\n\nContinuous-time diffusion processes, described by stochastic differential equations (SDEs), arise\nnaturally in a range of applications from environmental modelling to mathematical \ufb01nance [13]. In\nstatistics the problem of Bayesian inference for both the state and parameters, within partially ob-\nserved, non-linear diffusion processes has been tackled using Markov Chain Monte Carlo (MCMC)\napproaches based on data augmentation [17, 11], Monte Carlo exact simulation methods [6], or\nLangevin / hybrid Monte Carlo methods [1, 3]. Within the signal processing community solutions\nto the so called Zakai equation [12] based on particle \ufb01lters [8], a variety of extensions to the Kalman\n\ufb01lter/smoother [2, 5] and mean \ufb01eld analysis of the SDE together with moment closure methods [10]\nhave also been proposed. In this work we develop a novel variational approach to the problem of\napproximate inference in continuous-time diffusion processes, including a marginal likelihood (ev-\nidence) based inference technique for the forcing parameters. In general, joint parameter and state\ninference using naive methods is complicated due to dependencies between state and system noise\nparameters.\nWe work in continuous time, computing distributions over sample paths1, and discretise only in\nour posterior approximation, which has advantages over methods based on discretising the SDE\ndirectly [3]. The approximate inference approach we describe is more computationally ef\ufb01cient than\ncompeting Monte Carlo algorithms and could be further improved in speed by de\ufb01ning a variety\nof sub-optimal approximations. The approximation is also more accurate than existing Kalman\nsmoothing methods applied to non-linear systems [4]. Ultimately, we are motivated by the critical\nrequirement to estimate parameters within large environmental models, where at present only a small\nnumber of Kalman \ufb01lter/smoother based estimation algorithms have been attempted [2], and there\nhave been no likelihood based attempts to estimate the system noise forcing parameters.\n\n1A sample path is a continuous-time realisation of a stochatic process in a certain time interval. Hence, a\n\nsample path is an in\ufb01nite dimensional object.\n\n1\n\n\fIn Section 2 and 3, we introduce the formalism for a variational treatment of partially observed diffu-\nsion processes with measurement noise and we provide the tools to estimate the optimal variational\nposterior process [4]. Section 4 deals with the estimation of the drift and the system noise parame-\nter, as well as the estimation of the optimal initial conditions. Finally, the approach is validated on a\nbi-stable nonlinear system in Section 5. In this context, we also discuss how to construct an estimate\nof the posterior distribution over parameters based on the variational free energy.\n\n2 Diffusion processes with measurement error\nConsider the continuous-time continuous-state stochastic process X = {Xt, t0 \u2264 t \u2264 tf}. We\nassume this process is a d-dimensional diffusion process. Its time evolution is described by the\nfollowing SDE (to be interpreted as an Ito stochastic integral):\n\ndWt \u223c N (0, dtI).\n\ndXt = f\u03b8(t, Xt) dt + \u03a31/2 dWt,\n\n(1)\nThe nonlinear vector function f\u03b8 de\ufb01nes the deterministic drift and the positive semi-de\ufb01nite matrix\n\u03a3 \u2208 Rd\u00d7d is the system noise covariance. The diffusion is modelled by a d-dimensional Wiener\nprocess W = {Wt, t0 \u2264 t \u2264 tf} (see e.g. [13] for a formal de\ufb01nition). Eq. (1) de\ufb01nes a process\nwith additive system noise. This might seem restrictive at \ufb01rst sight. However, it can be shown\n[13, 17, 6] that a range of state dependent stochastic forcings can be transformed into this form.\nIt is further assumed that only a small number of discrete-time latent states are observed and that the\nobservations are subject to measurement error. We denote the set of observations at the discrete times\n{tn}N\nn=1, with xn = Xt=tn.For\nsimplicity, the measurement noise is modelled by a zero-mean multivariate Gaussian density,with\ncovariance matrix R \u2208 Rd\u00d7d.\n\nn=1 and the corresponding latent states by {xn}N\n\nn=1 by Y = {yn}N\n\n3 Approximate inference for diffusion processes\n\n(cid:28)\nln p(Y, X|\u03b8, \u03a3)\nq(X|\u03a3)\n\n(cid:29)\n\nOur approximate inference scheme builds on [4] and is based on a variational inference approach\n(see for example [7]). The aim is to minimise the variational free energy, which is de\ufb01ned as follows:\n\nF\u03a3(q, \u03b8) = \u2212\n\n, X = {Xt, t0 \u2264 t \u2264 tf},\n\nq\n\n(2)\nwhere q(X|\u03a3) is an approximate posterior process over sample paths in the interval [t0, tf ] and \u03b8\nare the parameters, excluding the stochastic forcing covariance matrix \u03a3. Hence, this quantity is an\nupper bound to the negative log-marginal likelihood:\n\n\u2212 ln p(Y |\u03b8, \u03a3) = F\u03a3(q, \u03b8) \u2212 KL [q(X|\u03a3)(cid:107)p(X|Y, \u03b8, \u03a3)] \u2264 F\u03a3(q, \u03b8).\n\n(3)\nAs noted in Appendix A, this bound is \ufb01nite if the approximate process is another diffusion process\nwith a system noise covariance chosen to be identical to that of the prior process induced by (1).\nThe standard approach for learning the parameters in presence of latent variables is to use an EM\ntype algorithm [9]. However, since the variational distribution is restricted to have the same system\nnoise covariance (see Appendix A) as the true posterior, the EM algorithm would leave this covari-\nance completely unchanged in the M step and cannot be used for learning this crucial parameter.\nTherefore, we adopt a different approach, which is based on a conjugate gradient method.\n\n3.1 Optimal approximate posterior process\n\nWe consider an approximate time-varying linear process with the same diffusion term, that is the\nsame system noise covariance:\n\ndXt = g(t, Xt) dt + \u03a31/2 dWt,\n\n(4)\nwhere g(t, x) = \u2212A(t)x+b(t), with A(t) \u2208 Rd\u00d7d and b(t) \u2208 Rd. In other words, the approximate\nposterior process q(X|\u03a3) is restricted to be a Gaussian process [4]. The Gaussian marginal at time\nt is de\ufb01ned as follows:\n\ndWt \u223c N (0, dtI),\n\nq(Xt|\u03a3) = N (Xt|m(t), S(t)),\n\nt0 \u2264 t \u2264 tf ,\n\n(5)\n\n2\n\n\f(cid:90) tf\n\n(cid:90) tf\n\nEobs(t)(cid:88)\n\nwhere m(t) \u2208 Rd and S(t) \u2208 Rd\u00d7d are respectively the marginal mean and the marginal covariance\nat time t. In the rest of the paper, we denote q(Xt|\u03a3) by the shorthand notation qt.\nFor \ufb01xed parameters \u03b8 and assuming that there is no observation at the initial time t0, the optimal\napproximate posterior process q(X|\u03a3) is the one minimizing the variational free energy, which is\ngiven by (see Appendix A)\n\nF\u03a3(q, \u03b8) =\n\nEsde(t) dt +\n\n\u03b4(t \u2212 tn) dt + KL [q0(cid:107)p0] .\n\n(6)\n\nt0\n\nt0\n\nn\n\nThe function \u03b4(t) is Dirac\u2019s delta function. The energy functions Esde(t) and Eobs(t) are de\ufb01ned as\nfollows:\n\n(cid:10)(f\u03b8(t, Xt) \u2212 g(t, Xt))(cid:62)\u03a3\u22121(f\u03b8(t, Xt) \u2212 g(t, Xt))(cid:11)\n(cid:10)(Yt \u2212 Xt)(cid:62)R\u22121(Yt \u2212 Xt)(cid:11)\n\nln|R|.\n\nln 2\u03c0 +\n\n,\n\nqt\n\n+ d\n2\n\nqt\n\n1\n2\n\nEsde(t) =\n\nEobs(t) =\n\n1\n2\n1\n2\n\nwhere {Yt, t0 \u2264 t \u2264 tf} is the underlying continuous-time observable process.\n\n(7)\n\n(8)\n\n3.2 Smoothing algorithm\n\nThe variational parameters to optimise in order to \ufb01nd the optimal Gaussian process approximation\nare A(t), b(t), m(t) and S(t). For a linear SDE with additive system noise, it can be shown that\nthe time evolution of the means and the covariances are described by a set of ordinary differential\nequations [13, 4]:\n\n\u02d9m(t) = \u2212A(t)m(t) + b(t),\n\u02d9S(t) = \u2212A(t)S(t) \u2212 S(t)A(cid:62)(t) + \u03a3,\n\n(9)\n(10)\nwhere \u02d9 denotes the time derivtive. These equations provide us with consistency constraints for the\nmarginal means and the marginal covariances along sample paths. To enforce these constraints we\nformulate the Lagrangian\n\n(cid:90) tf\n\nt0\n\n(cid:62)(t)(cid:0) \u02d9m(t) + A(t)m(t) \u2212 b(t)(cid:1) dt\n(cid:16) \u02d9S(t) + 2A(t)S(t) \u2212 \u03a3\n(cid:17)(cid:111)\n\n\u03bb\n\nL\u03b8,\u03a3 = F\u03a3(q, \u03b8) \u2212\n\n(cid:90) tf\n\n\u2212\n\n(cid:110)\n\ntr\n\n\u03a8(t)\n\n(11)\nwhere \u03bb(t) \u2208 Rd and \u03a8(t) \u2208 Rd\u00d7d are time dependent Lagrange multipliers, with \u03a8(t) symmetric.\nFirst, taking the functional derivatives of L\u03b8,\u03a3 with respect to A(t) and b(t) results in the following\ngradient functions:\n\ndt,\n\nt0\n\n\u2207AL\u03b8,\u03a3(t) = \u2207AEsde(t) \u2212 \u03bb(t)m(cid:62)(t) \u2212 2\u03a8(t)S(t),\n\u2207bL\u03b8,\u03a3(t) = \u2207bEsde(t) + \u03bb(t).\n\n(12)\n(13)\n\nThe gradients \u2207AEsde(t) and \u2207bEsde(t) are derived in Appendix B.\nSecondly, taking the functional derivatives of L\u03b8,\u03a3 with respect to m(t) and S(t), setting to zero\nand rearranging leads to a set of ordinary differential equations, which describe the time evolution\nof the Lagrange multipliers, along with jump conditions when there are observations:\nn \u2212 \u2207mEobs(t)|t=tn ,\n\u2212\nn = \u03bb\nn \u2212 \u2207SEobs(t)|t=tn .\nn = \u03a8\u2212\n\n\u02d9\u03bb(t) = \u2212\u2207mEsde(t) + A(cid:62)(t)\u03bb(t), \u03bb+\n\u02d9\u03a8(t) = \u2212\u2207SEsde(t) + 2\u03a8(t)A(t), \u03a8+\n\n(14)\n(15)\n\nThe optimal variational functions can be computed by means of a gradient descent technique, such\nas the conjugate gradient (see e.g., [16]). The explicit gradients with respect to A(t) and b(t)\nare given by (12) and (13). Since m(t), S(t), \u03bb(t) and \u03a8(t) are dependent on these parameters,\none needs also to take the corresponding implicit derivatives into account. However, these implicit\ngradients vanish if the consistency constraints for the means (9) and the covariances (10), as well as\nthe ones for the Lagrange multipliers (14-15), are satis\ufb01ed. One way to achieve this is to perform a\nforward propagation of the means and the covariances, followed by a backward propagation of the\nLagrange multipliers, and then to take a gradient step. The resulting algorithm for computing the\noptimal posterior q(X|\u03a3) over sample paths is detailed in Algorithm 1.\n\n3\n\n\ffor k = 0 to K \u2212 1 do\n\nend for{forward propagation}\nfor k = K to 1 do\n\nmk+1 \u2190 mk \u2212 (Akmk \u2212 bk)\u2206t\nSk+1 \u2190 Sk \u2212 (AkSk + SkA(cid:62)\n\nAlgorithm 1 Compute the optimal q(X|\u03a3).\n1: input(m0, S0, \u03b8, \u03a3, t0, tf , \u2206t, \u03c9)\n2: K \u2190 (tf \u2212 t0)/\u2206t\n3: initialise {Ak, bk}k\u22650\n4: repeat\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13:\n14:\n15:\n16:\n17:\n18: until minimum of L\u03b8,\u03a3 is attained {optimisation loop}\n19: return {Ak, bk, mk, Sk, \u03bbk, \u03a8k}k\u22650\n\n\u03bbk\u22121 \u2190 \u03bbk + (\u2207mEsde|t=tk \u2212 A(cid:62)\nk \u03bbk)\u2206t\n\u03a8k\u22121 \u2190 \u03a8k + (\u2207SEsde|t=tk \u2212 2\u03a8kAk)\u2206t\nif observation at tk\u22121 then\n\n\u03bbk\u22121 \u2190 \u03bbk\u22121 + \u2207mEobs|t=tk\u22121\n\u03a8k\u22121 \u2190 \u03a8k\u22121 + \u2207SEobs|t=tk\u22121\n\nend if{jumps}\n\nk \u2212 \u03a3)\u2206t\n\nend for{backward sweep (adjoint operation)}\nupdate {Ak, bk}k\u22650 using the gradient functions (12) and (13)\n\n4 Parameter estimation\n\nThe parameters to optimise include the parameters of the prior over the initial state, the drift func-\ntion parameters and the system noise covariance. The estimation of the parameters related to the\nobservable process are not discussed in this work, although it is a straightforward extension.\nThe smoothing algorithm described in the previous section computes the optimal posterior process\nby providing us with the stationary solution functions A(t) and b(t). Therefore, when subsequently\noptimising the parameters we only need to compute their explicit derivatives; the implicit ones\nvanish since \u2207AL\u03b8,\u03a3 = 0 and \u2207bL\u03b8,\u03a3 = 0. Before computing the gradients, we integrate (11) by\nparts to make the boundary conditions explicit. This leads to\n\n(cid:110)\n\n(cid:90) tf\n(cid:62)(t)(cid:0)A(t)m(t) \u2212 b(t)(cid:1) \u2212 \u02d9\u03bb\n(cid:110)\n(cid:111)\n\u03a8(t)(cid:0)2A(t)S(t) \u2212 \u03a3(cid:1) \u2212 \u02d9\u03a8(t)S(t)\n\n\u03bb\n\nt0\n\nL\u03b8,\u03a3 = F\u03a3(q, \u03b8) \u2212\n\n(cid:90) tf\n\n\u2212\n\ntr\n\n(cid:62)\n\n(t)m(t)\n\ndt \u2212 \u03bb\n(cid:62)\nf mf + \u03bb\n\n(cid:62)\n0 m0\n\ndt \u2212 tr{\u03a8f Sf} + tr{\u03a80S0} ,\n\n(16)\n\n(cid:111)\n\nt0\n\nAt the \ufb01nal time tf , there are no consistency constraints, that is \u03bbf and \u03a8f are both equal to zero.\n\nInitial state\n\n4.1\nThe initial variational posterior q(x0) is chosen equal to N (x0|m0, S0) to ensure that the approxi-\nmate process is a Gaussian one. Taking the derivatives of (16) with respect to m0 and S0 results in\nthe following expressions:\n\n\u2207m0L\u03b8,\u03a3 = \u03bb0 + \u03c4\u22121\n\n0 (m0 \u2212 \u00b50), \u2207S0L\u03b8,\u03a3 = \u03a80 +\n\n1\n2\n\n0 I \u2212 S\u22121\n\n0\n\n(17)\n\nwhere the prior p(x0) is assumed to be an isotropic Gaussian density with mean \u00b50. Its variance \u03c40\nis taken suf\ufb01ciently large to give a broad prior.\n\n4.2 Drift\n\n(cid:0)\u03c4\u22121\n\n(cid:1) ,\n\nThe gradients for the drift function parameters \u03b8f only depend on the total energy associated to the\nSDE. Their general expression is given by\n\n\u2207\u03b8fL\u03b8,\u03a3 =\n\n\u2207\u03b8f Esde(t) dt,\n\n(18)\n\n(cid:90) tf\n\nt0\n\n4\n\n\f(cid:68)\n\n(cid:69)\n\nwhere \u2207\u03b8f Esde(t) =\ndo play a role in this gradient as they enter through g(t, Xt) and the expectation w.r.t. q(Xt|\u03a3).\n\n(f\u03b8(t, Xt) \u2212 g(t, Xt))(cid:62) \u03a3\u22121\u2207\u03b8f f\u03b8(t, Xt)\n\n. Note that the observations\n\nqt\n\n4.3 System noise\n\nEstimating the system noise covariance (or volatility) is essential as the system noise, together with\nthe drift function, determines the dynamics. In general, this parameter is dif\ufb01cult to estimate using\nan MCMC approach because the ef\ufb01ciency is strongly dependent on the discrete approximation of\nthe SDE and most methods break down when the time step \u2206t gets too small [11, 6]. For example\nin a Bayesian MCMC approach, which alternates between sampling paths and parameters, the latent\npaths imputed between observations must have a system noise parameter which is arbitrarily close\nto its previous value in order to be accepted by a Metropolis sampler. Hence, the algorithm becomes\nextremely slow. Note, that for the same reason, a naive EM algorithm within our approach breaks\ndown. However, in our method, we can simply compute approximations to the marginal likelihood\nand its gradient directly. In the next section, we will compare our results to a direct MCMC estimate\nof the marginal likelihood which is a time consuming method.\nThe gradient of (16) with respect to \u03a3 is given by\n\n(cid:90) tf\n\n(cid:90) tf\n\nwhere \u2207\u03a3Esde(t) = \u2212 1\n\n\u2207\u03a3L\u03b8,\u03a3 =\n\n2 \u03a3\u22121(cid:68)\n\n\u2207\u03a3Esde(t) dt +\n\n(f\u03b8(t, Xt) \u2212 g(t, Xt)) (f\u03b8(t, Xt) \u2212 g(t, Xt))(cid:62)(cid:69)\n\n\u03a8(t) dt,\n\nt0\n\nt0\n\n(19)\n\n\u03a3\u22121.\n\nqt\n\n5 Experimental validation on a bi-stable system\n\nf\u03b8(t, x) = 4x(cid:0)\u03b8 \u2212 x2(cid:1) ,\n\nIn order to validate the approach, we consider the 1 dimensional double-well system:\n\n\u03b8 > 0,\n\n(20)\nwhere f\u03b8(t, x) is the drift function. This dynamical system is highly nonlinear and its stationary\ndistribution is multi-modal. It has two stable states, one in x = \u2212\u03b8 and one in x = +\u03b8. The system\nis driven by the system noise, which makes it occasionally \ufb02ip from one well to the other.\nIn the experiments, we set the drift parameter \u03b8 to 1, the system noise standard deviation \u03c3 to 0.5 and\nthe measurement error standard deviation r to 0.2. The time step for the variational approximation\nis set to \u2206t = 0.01, which is identical to the time resolution used to generate the original sample\npath. In this setting, the exit time from one of the wells is 4000 time units [15]. In other words, the\ntransition from one well to the other is highly unlikely in the window of roughly 8 time units that\nwe consider and where a transition occurs.\nFigure 1(a) compares the variational solution to the outcomes of a hybrid MCMC simulation of the\nposterior process using the true parameter values. The hybrid MCMC approach was proposed in\n[1]. At each step of the sampling process, an entire sample path is generated. In order to keep the\nacceptance of new paths suf\ufb01ciently high, the basic MCMC algorithm is combined with ideas from\nMolecular Dynamics, such that the MCMC sampler moves towards regions of high probability in the\nstate space. An important drawback of MCMC approaches is that it might be extremely dif\ufb01cult to\nmonitor their convergence and that they may require a very large number of samples before actually\nconverging. In particular, over 100, 000 sample paths were necessary to reach convergence in the\ncase of the double-well system.\nThe solution provided by the hybrid MCMC is here considered as the base line solution. One can ob-\nserve that the variational solution underestimates the uncertainty (smaller error bars). Nevertheless,\nthe time of the transition is correctly located. Convergence of the smoothing algorithm was achieved\nin approximately 180 conjugate gradient steps, each one involving a forward and backward sweep.\nThe optimal parameters and the optimal initial conditions for the variational solution are given by\n\n(21)\nConvergence of the outer optimization loop is typically reached after less then 10 conjugate gradient\nsteps. While the estimated value for the drift parameter is within 15% percent from its true value,\n\n\u02c6m0 = 0.88,\n\n\u02c6s0 = 0.45.\n\n\u02c6\u03c3 = 0.72,\n\n\u02c6\u03b8f = 0.85,\n\n5\n\n\f(a)\n\n(b)\n\nFigure 1: (a) Variational solution (solid) compared to the hybrid MCMC solution (dashed), using\nthe true parameter values. The curves denote the mean paths and the shaded regions are the two-\nstandard deviation noise tubes. (b) Posterior of the system noise variance (diffusion term). The plain\ncurve and the dashed curve are respectively the approximations of the posterior shape based on the\nvariational free energy and MCMC.\n\nthe deviation of the system noise is worse. Deviations may be explained by the fact that the number\nof observations is relatively small. Furthermore, we have chosen a sample path which contains\na transition between the two wells within a small time interval and is thus highly untypical with\nrespect to the prior distribution. This fact was experimentally assessed by estimating the parameters\non a sample path without transition, in a time window of the same size. In this case, we obtained\nestimate roughly within 5% of the true parameter values: \u02c6\u03c3 = 0.46 and \u02c6\u03b8f = 0.92. Finally, it turns\nout that our estimate for \u02c6\u03c3 is close to the one obtained from the MCMC approach as discussed next.\n\nPosterior distribution over the parameters\nInterestingly, minimizing the free energy F\u03c32 for different values of \u03c3 provides us with much more\ninformation than a single point estimate for the parameters [14]. Using a suitable prior over p(\u03c3),\nwe can approximate the posterior over the system noise variance via\n\np(\u03c32|Y ) \u221d e\u2212F\u03c32 p(\u03c32),\n\n(22)\nwhere we take e\u2212F\u03c32 (at its minimum) as an approximation to the marginal likelihood of the ob-\nservations p(Y |\u03c32). To illustrate this point, we assume a non-informative Gamma prior p(\u03c32) =\nG(\u03b1, \u03b2), with \u03b1 = 10\u22123 and \u03b2 = 10\u22123. A comparison with preliminary MCMC estimates for\np(Y |\u03c32) for \u03b8 = 1 and a set of system noise variances indicates that the shape of our approximation\nis a reasonable indicator of the shape of the posterior. Figure 1(b) shows that at least the mean and\nthe variance of the density come out fairly well.\n\n6 Conclusion\n\nWe have presented a variational approach to the approximate inference of stochastic differential\nequations from a \ufb01nite set of noisy observations. So far, we have tested the method on a one dimen-\nsional bi-stable system only. Comparison with a Monte Carlo approach suggests that our method\ncan reproduce the posterior mean fairly well but underestimates the variance in the region of the\ntransition. Parameter estimates also agree well with the MC predictions.\nIn the future, we will extend our method in various directions. Although our approach is based on\na Gaussian approximation of the posterior process, we expect that one can improve on it and obtain\nnon-Gaussian predictions at least for various marginal posterior distributions, including that of the\nlatent variable Xt at a \ufb01xed time t. This should be possible by generalising our method for the\ncomputation of a non-Gaussian shaped probability density for the system noise parameter using the\nfree energy. An important extension of our method will be to systems with many degrees of freedom.\n\n6\n\n012345678!2!1.5!1!0.500.511.52tx0.30.40.50.60.70.80.900.20.40.60.811.21.41.6!2\fWe hope that the possibility of using simpler suboptimal parametrisations of the approximating\nGaussian process will allow us to obtain a tractable inference method that scales well to higher\ndimensions.\n\nAcknowledgments\n\nThis work has been funded by the EPSRC as part of the Variational Inference for Stochastic Dynamic\nEnvironmental Models (VISDEM) project (EP/C005848/1).\n\nA The Kullback-Leibler divergence interpreted as a path integral\n\nIn this section, we show that the Kullback-Leibler divergence between the posterior process\np(Xt|Y, \u03b8, \u03a3) and its approximation q(X|\u03a3) can be interpreted as a path integral over time.\nIt\nis an average over all possible realisations, called sample paths, of the continuous-time (i.e., in\ufb01nite\ndimensional) random variable described by the SDE in the time interval under consideration.\nConsider the Euler-Muryama discrete approximation (see for example [13]) of the SDE (1) and its\nlinear approximation (4):\n\n(23)\n(24)\nwhere \u2206xk \u2261 xk+1 \u2212 xk and wk \u223c N (0, \u2206tI). The vectors fk and gk are shorthand notations\nfor f\u03b8(tk, xk) and g(tk, xk). Hence, the joint distributions of discrete sample paths {xk}k\u22650 for the\ntrue process and its approximation follow from the Markov property:\n\n\u2206xk = fk\u2206t + \u03a31/2\u2206wk,\n\u2206xk = gk\u2206t + \u03a31/2\u2206wk,\n\np(x0, . . . , xK|\u03a3) = p(x0)(cid:89)\nq(x0, . . . , xK|\u03a3) = q(x0)(cid:89)\n\nk>0\n\nk>0\n\nN (xk+1|xk + fk\u2206t, \u03a3\u2206t),\n\nN (xk+1|xk + gk\u2206t, \u03a3\u2206t),\n\n(25)\n\n(26)\n\nwhere p(x0) is the prior on the intial state x0 and q(x0) is assumed to be Gaussian. Note thate we\ndo not restrict the variational posterior to factorise over the latent states.\nThe Kullback-Leibler divergence between the two discretized prior processes is given by\nq(xk)(cid:104)ln p(xk+1|xk)(cid:105)q(xk+1|xk) dxk\n\n(cid:90)\nKL [q(cid:107)p] = KL [q(x0)(cid:107)p(x0)] \u2212(cid:88)\n(cid:88)\n\n= KL [q(x0)(cid:107)p(x0)] +\n\nk>0\n1\n2\n\nk>0\n\n(cid:10)(fk \u2212 gk)(cid:62)\u03a3\u22121(fk \u2212 gk)(cid:11)\n(cid:90) tf\n\n(cid:10)(ft \u2212 gt)(cid:62)\u03a3\u22121(ft \u2212 gt)(cid:11)\n\nq(xk) \u2206t,\n\nwhere we omitted the conditional dependency on \u03a3 for simplicity. The second term on the right\nhand side is a sum in \u2206t. As a result, taking limits for \u2206t \u2192 0 leads to a proper Riemann integral,\nwhich de\ufb01nes an integral over the average sample path:\n\n1\n2\n\nKL [q(X|\u03a3)(cid:107)p(X|\u03b8, \u03a3)] = KL [q0(cid:107)p0] +\n\n(27)\nwhere X = {Xt, t0 \u2264 t \u2264 tf} denotes the stochastic process in the interval [t0, tf ]. The distribu-\ntion qt = q(Xt|\u03a3) is the marginal at time t for a given system noise covariance \u03a3.\nIt is important to realise that the KL between the induced prior process and its approximation is\n\ufb01nite because the system noise covariances are chosen to be identical.\nIf this was not the case,\nthe normalizing constants of p(xk+1|xk) and q(xk+1|xk) would not cancel. This would result in\nKL \u2192 \u221e when \u2206t \u2192 0.\nIf we assume that the observations are i.i.d., it follows also that\n\ndt,\n\nqt\n\nt0\n\nF\u03a3(q, \u03b8) = \u2212(cid:88)\n\n(cid:104)ln p(yn|xn)(cid:105)q(xn) + KL [q(X|\u03a3)(cid:107)p(X|\u03b8, \u03a3)] .\n\nn\n\nClearly, minimising this expression with respect to the variational parameters for a given system\nnoise \u03a3 and for a \ufb01xed parameter vector \u03b8 is equivalent to minimising the KL between the vari-\national posterior q(X|\u03a3) and the true posterior p(X|Y, \u03b8, \u03a3), since the normalizing constant is\nindependent of sample paths.\n\n7\n\n\fB The gradient functions\n\nThe general expressions for the gradients of Esde(t) with respect to the variational functions are\ngiven by\n\n\u2207AEsde(t) = \u03a3\u22121(cid:110)(cid:104)\u2207xf\u03b8(t, Xt)(cid:105)qt\n\u2207bEsde(t) = \u03a3\u22121(cid:110)\u2212(cid:104)f\u03b8(t, Xt)(cid:105)qt\n\n(cid:111)\nf\u03b8(t, Xt) (Xt \u2212 m(t))(cid:62)(cid:69)\n(cid:68)\n\n+ A(t)\n\u2212 A(t)m(t) + b(t)\n\n(cid:111)\n\n,\n\nwhere (cid:104)\u2207xf\u03b8(t, Xt)(cid:105)qt\n\nS(t) =\n\nS(t) \u2212 \u2207bEsde(t)m(cid:62)(t),\n\n(28)\n\n(29)\n\nis invoked in order to obtain (28).\n\nqt\n\nReferences\n[1] F. J. Alexander, G. L. Eyink, and J. M. Restrepo. Accelerated Monte Carlo for optimal estimation of time\n\nseries. Journal of Statistical Physics, 119:1331\u20131345, 2005.\n\n[2] J. D. Annan, J. C. Hargreaves, N. R. Edwards, and R. Marsh. Parameter estimation in an intermediate\n\ncomplexity earth system model using an ensemble Kalman \ufb01lter. Ocean Modelling, 8:135\u2013154, 2005.\n\n[3] A. Apte, M. Hairer, A. Stuart, and J. Voss. Sampling the posterior: An approach to non-Gaussian data\n\nassimilation. Physica D, 230:50\u201364, 2007.\n\n[4] C. Archambeau, D. Cornford, M. Opper, and J. Shawe-Taylor. Gaussian process approximation of\nstochastic differential equations. Journal of Machine Learning Research: Workshop and Conference\nProceedings, 1:1\u201316, 2007.\n\n[5] D. Barber. Expectation correction for smoothed inference in switching linear dynamical systems. Journal\n\nof Machine Learning Research, 7:2515\u20132540, 2006.\n\n[6] A. Beskos, O. Papaspiliopoulos, G. Roberts, and P. Fearnhead. Exact and computationally ef\ufb01cient\nlikelihood-based estimation for discretely observed diffusion processes (with discussion). Journal of\nthe Royal Statistical Society B, 68(3):333\u2013382, 2006.\n\n[7] Christopher M. Bishop. Pattern Recognition and Machine Learning. Springer, New York, 2006.\n[8] D. Crisan and T. Lyons. A particle approximation of the solution of the Kushner-Stratonovitch equation.\n\nProbability Theory and Related Fields, 115(4):549\u2013578, 1999.\n\n[9] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via EM\n\nalgorithm. Journal of the Royal Statistical Society B, 39(1):1\u201338, 1977.\n\n[10] G. L. Eyink, J. L. Restrepo, and F. J. Alexander. A mean \ufb01eld approximation in data assimilation for\n\nnonlinear dynamics. Physica D, 194:347\u2013368, 2004.\n\n[11] A. Golightly and D. J. Wilkinson. Bayesian inference for nonlinear multivariate diffusion models observed\n\nwith error. Computational Statistics and Data Analysis, 2007. Accepted.\n\n[12] A. H. Jazwinski. Stochastic Processes and Filtering Theory. Academic Press, New York, 1970.\n[13] Peter E. Kloeden and Eckhard Platen. Numerical Solution of Stochastic Differential Equations. Springer,\n\nBerlin, 1999.\n\n[14] H. Lappalainen and J. W. Miskin. Ensemble learning. In M. Girolami, editor, Advances in Independent\n\nComponent Analysis, pages 76\u201392. Springer-Verlag, 2000.\n\n[15] R. N. Miller, M. Ghil, and F. Gauthiez. Advanced data assimilation in strongly nonlinear dynamical\n\nsystems. Journal of the Atmospheric Sciences, 51:1037\u20131056, 1994.\n\n[16] Jorge Nocedal and Stephen J. Wright. Numerical Optimization. Springer, 2000.\n[17] G. Roberts and O. Stramer. On inference for partially observed non-linear diffusion models using the\n\nMetropolis-Hastings algorithm. Biometirka, 88:603\u2013621, 2001.\n\n8\n\n\f", "award": [], "sourceid": 1082, "authors": [{"given_name": "C\u00e9dric", "family_name": "Archambeau", "institution": null}, {"given_name": "Manfred", "family_name": "Opper", "institution": null}, {"given_name": "Yuan", "family_name": "Shen", "institution": null}, {"given_name": "Dan", "family_name": "Cornford", "institution": null}, {"given_name": "John", "family_name": "Shawe-taylor", "institution": null}]}