{"title": "Efficient and Flexible Inference for Stochastic Systems", "book": "Advances in Neural Information Processing Systems", "page_first": 6988, "page_last": 6998, "abstract": "Many real world dynamical systems are described by stochastic differential equations. Thus parameter inference is a challenging and important problem in many disciplines. We provide a grid free and flexible algorithm offering parameter and state inference for stochastic systems and compare our approch based on variational approximations to state of the art methods showing significant advantages both in runtime and accuracy.", "full_text": "Ef\ufb01cient and Flexible Inference for Stochastic\n\nSystems\n\nStefan Bauer\u2217\n\nDepartment of Computer Science\n\nETH Zurich\n\nbauers@inf.ethz.ch\n\nNico S. Gorbach\u2217\n\nDepartment of Computer Science\n\nETH Zurich\n\nngorbach@inf.ethz.ch\n\n\u00d0or \u00afde Miladinovi\u00b4c\n\nDepartment of Computer Science\n\nETH Zurich\n\ndjordjem@inf.ethz.ch\n\njbuhmann@inf.ethz.ch\n\nJoachim M. Buhmann\n\nDepartment of Computer Science\n\nETH Zurich\n\nAbstract\n\nMany real world dynamical systems are described by stochastic differential equa-\ntions. Thus parameter inference is a challenging and important problem in many\ndisciplines. We provide a grid free and \ufb02exible algorithm offering parameter and\nstate inference for stochastic systems and compare our approch based on variational\napproximations to state of the art methods showing signi\ufb01cant advantages both in\nruntime and accuracy.\n\n1\n\nIntroduction\n\nA dynamical system is represented by a set of K stochastic differential equations (SDE\u2019s) with model\nparameters \u03b8 that describe the evolution of K states X(t) = [x1(t), x2(t), . . . , xK(t)]T such that:\n\ndX(t) = f (X(t), \u03b8)dt + \u03a3dWt,\n\n(1)\nwhere Wt is a Wiener process. A sequence of observations, y(t) is usually contaminated by some\nmeasurement error which we assume to be normally distributed with zero mean and variance for each\nk\u03b4ik. Thus for N distinct time points the overall\nof the K states, i.e. E \u223c N (0, D), with Dik = \u03c32\nsystem may be summarized as\n\nwhere\n\nY = AX + E,\n\nX = [x(t1), . . . , x(tN )] = [x1, . . . , xK]T\nY = [y(t1), . . . , y(tN )] = [y1, . . . , yK]T ,\n\nwhere xk = [xk(t1), . . . , xk(tN )]T is the k\u2019th state sequence and yk = [yk(t1), . . . , yk(tN )]T are\nthe observations. Given the observations Y and the description of the dynamical system (1), the aim\nis to estimate both state variables X and parameters \u03b8.\nRelated Work. Classic approaches for solving the inverse problem i.e. estimating the parameters\ngiven some noisy observations of the process, include the Kalman Filter or its improvements [e.g.\nEvensen, 2003, Torn\u00f8e et al., 2005] and MCMC based approaches [e.g. Lyons et al., 2012]. However,\n\n\u2217The \ufb01rst two authors contributed equally to this work.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fMCMC based methods do not scale well since the number of particles required for a given accuracy\ngrows exponentially with the dimensionality of the inference problem [Snyder et al., 2008], which\nis why approximations to the inference problem became increasingly more popular in recent years.\nArchambeau et al. [2008] proposed a variational formulation for parameter and state inference of\nstochastic diffuion processes using a linear dynamic approximation: In an iterated two-step approach\nthe mean and covariance of the approximate process (forward propagation) and in the second step the\ntime evolution of the Lagrange multipliers, which ensure the consistency constraints for mean and\nvariance (backward propagation), are calculated in order to obtain a smooth estimate of the states.\nBoth forward and backward smoothing require the repeated solving of ODEs. In order to obtain\na good accuracy a \ufb01ne time grid is additionally needed, which makes the approach computational\nexpensive and infeasible for larger systems [Vrettas et al., 2015]. For parameter estimation the\nsmoothing algorithm is used in the inner loop of a conjugate gradient algorithm to obtain an estimate\nof the optimal approximation process (given a \ufb01xed set of parameters) while in the outer loop a\ngradient step is taken to improve the current estimate of the parameters. An extension of Archambeau\net al. [2008] using local polynomial approximations and mean-\ufb01eld approximations was proposed in\nVrettas et al. [2015]. Mean-\ufb01eld approximations remove the need of Lagrange multipliers and thus of\nthe backward propagation while the polynomial approximations remove the need of solving ODEs\niteratively in the forward propagation step which makes the smoothing algorithm and thus the inner\nloop for parameter estimation feasible, even for large systems while achieving a comparable accuracy\n[Vrettas et al., 2015].\nOur contributions. While established methods often assume full observability of the stochastic\nsystem for parameter estimation, we solve the more dif\ufb01cult problem of inferring parameters in\nsystems which include unobserved variables by combining state and parameter estimation in one step.\nDespite the fact that we compare our approach to other methods which solve a simpler problem, we\noffer improved accuracy in parameter estimation at a fraction of the computational cost.\n\n2 Random Ordinary Differential Equations\n\nCompared to stochastic differential equations, random ordinary differential equations (RODEs) have\nbeen less popular even though both frameworks are highly connected. RODEs are pathwise ordinary\ndifferential equations that contain a stochastic process in their vector \ufb01eld functions. In Kloeden and\nJentzen [2007] RODEs have been studied to derive better numerical integration schemes for SDEs,\nwhich e.g. allows for stronger pathwise results compared to the L2 results given in Ito stochastic\ncalculus. Moreover, RODEs sometimes have an advantage over SDEs by allowing more realistic\nnoise for some applications e.g. correlated noise or noise with limited variance. Let (\u2126,F,P) be a\ncomplete probability space, (\u03b6t)t\u2208[0,T ] be a Rm-valued stochastic process with continuous sample\npaths and f : Rm \u00d7 Rd \u2192 Rd a continuous function. Then\n= f (x(t), \u03b6t(\u03c9))\n\ndx(t)\n\n(2)\n\ndt\n\nis a scalar RODE, that is, an ODE\n\ndx(t)\n\n= F\u03c9(t, x) := f (x(t), \u03c9(t)),\n\n(3)\nfor all \u03c9 \u2208 \u2126. Following Kloeden and Jentzen [2007], we likewise assume that f is arbitrary smooth\ni.e. f \u2208 C\u221e and thus locally Lipschitz in x such that the initial value problem (3) has a unique\nsolution, which we assume to exist on the \ufb01nite time interval [0, T ]. A simple example for a RODE is\nExample 1 (RODE).\n\ndt\n\ndx(t)\n\ndt\n\n= \u2212x + sin(Wt(\u03c9)),\n\n(4)\n\nwhere Wt is a Wiener process. Taylor-like schemes for directly solving RODEs (2) were derived e.g.\nin Gr\u00fcne and Kloeden [2001], Jentzen and Kloeden [2009]. One approach for solving the RODE (2)\nis to use sampling to obtain many ODE\u2019s (3) which can then be solved pathwise using deterministic\ncalculus. However, this pathwise solution of RODEs implies that a massive amount of deterministic\nODEs have to be solved ef\ufb01ciently. A study with a high performance focus was conducted in\n\n2\n\n\fRiesinger et al. [2016], where parallelized pathwise inference for RODEs was implemented using\nGPU\u2019s. While in principle classic numerical schemes for deterministic systems e.g. Runge-Kutta\ncan be used for each path, they will usually converge with a lower order since the vector \ufb01eld is not\nsmooth enough in time [Asai et al., 2013]. Since the driving stochastic process \u03b6t has at most H\u00f6lder\ncontinuous sample paths, the sample paths of the solution t \u2192 x(t) are continuously differentiable\nbut the derivatives of the solution sample paths are at most H\u00f6lder continuous in time. This is caused\nby the fact that F\u03c9(t, x) of the ODE (3) is usually only continuous, but not differentiable in t, no\nmatter how smooth the function f is in its variables. RODEs offer the opportunity to use deterministic\ncalculus (pathwise), yet being highly connected with an SDE since any RODE with a Wiener process\ncan be written as SDE Jentzen and Kloeden [2011]. To illustrate the point, the example 1 above can\nbe re-written as an SDE by:\nExample 2 (SDE transformed RODE).\n\nd(cid:18)Xt\nYt(cid:19) =(cid:18)\u2212Xt + sin(Yt)\n\n0\n\n(cid:19) +(cid:18)0\n\n1(cid:19) dWt.\n\n(5)\n\nIt likewise holds that SDEs can be transformed into RODEs. This transformation was \ufb01rst described\nin Sussmann [1978] and Doss [1977] and generalized to all \ufb01nite dimensional stochastic differential\nequations by Imkeller and Schmalfuss [2001]. RODEs can thus be used to \ufb01nd pathwise solutions for\nSDEs but SDEs can likewise be used to \ufb01nd better solution for RODEs Asai and Kloeden [2013].\nDue to space limitations and to circumvent the introduction of a large mathematical framework, we\nonly show the transformation for additive SDE\u2019s following [Jentzen and Kloeden, 2011, chapter 2].\nProposition 1. Any \ufb01nite dimensional SDE can be transformed into an RODE and the other way\nround:\n\ndxt = f (xt)dt + dWt \u21d0\u21d2\n\ndz(t)\n\ndt\n\n= f (zt + Ot) + Ot,\n\n(6)\n\nwhere z(t) := xt \u2212 Ot and Ot is the Ornstein-Uhlenbeck stochastic stationary process satisfying\nthe linear SDE\n\ndOt = \u2212Otdt + dWt\n\n(7)\n\nTypically a stationary Ornstein-Uhlenbeck process is used to replace the white noise of the SDE in its\ntransformation to an RODE. By continuity and the Fundamental Theorem of Calculus it then follows\nthat z(t) is pathwise differentiable. While we only showed the transformation for additive SDE\u2019s, it\ngenerally holds true that any RODE with a Wiener process can be transformed into an SDE and any\n\ufb01nite dimensional SDE with regular coef\ufb01cients can be transformed into an RODE. This includes\nnonlinear drifts and diffusions and is true for univariate and multivariate processes [Han and Kloeden,\n2017]. There are cases for which this does not hold e.g. a RODE which includes fractional Brownian\nmotion as the driving noise. While the presented method is thus even more general since RODE\u2019s\ncan be solved, we limit ourselves to the problem of solving additive SDE\u2019s by transforming them into\na RODE.\nSince the solution of a RODE is continuously differentiable in time (but not further differentiable in\ntime), classic numerical methods for ODEs rarely do achieve their traditional order and thus ef\ufb01ciency\n[Kloeden and Jentzen, 2007]. In the following we describe a scalable variational formulation to infer\nstates and parameters of stochastic differential equations by providing an ensemble learning type\nalgorithm for inferring the parameters of the corresponding random ordinary differential equation.\n\n3 Variational Gradient Matching\n\nGradient matching with Gaussian processes was originally motivated in Calderhead et al. [2008] and\noffers a computationally ef\ufb01cient shortcut for parameter inference in deterministic systems. While the\noriginal formulation was based on sampling, Gorbach et al. [2017] proposed a variational formulation\noffering signi\ufb01cant runtime and accuracy improvements.\nGradient matching assumes that the covariance kernel C\u03c6k (with hyper-parameters \u03c6k) of a Gaussian\nprocess prior on state variables is once differentiable to obtain a conditional distribution over state\n\n3\n\n\fFigure 1: Noise. The left plot shows three typical Wiener processes generated with mean zero and\nthe corresponding Ornstein-Uhlenbeck (OU) process having the same Wiener process in its diffusion\n(right). The scale on the y-axis shows the mean-reverting behaviour of the OU process (compared to\nthe Wiener process).\n\nderivatives using the closure property under differentiation of Gaussian processes:\n\np( \u02d9X | X, \u03c6) =\n\nN ( \u02d9xk | mk, Ak),\n\n(8)\n\nwhere the mean and covariance is given by:\n\nk\n\nmk := (cid:48)C\u03c6k C\u22121\n\u03c6k\n\nxk, Ak := C(cid:48)(cid:48)\u03c6k \u2212 (cid:48)C\u03c6k C\u22121\n\n\u03c6k\n\nC(cid:48)\u03c6k\n\n,\n\ndenotes the auto-covariance for each state-derivative with C(cid:48)\u03c6k\n\nand (cid:48)C\u03c6k denoting the cross-\n\n(cid:89)\n\n(cid:89)\n\nC(cid:48)(cid:48)\u03c6k\ncovariances between the state and its derivative.\nThe posterior distribution over state-variables is\np(X | Y, \u03c6, \u03c3) =\n\n(cid:89)\n\nk\n\nN (\u00b5k(yk), \u03a3k) ,\n\n(9)\n\nwhere \u00b5k(yk) := C\u03c6k (C\u03c6k + \u03c32\nInserting the GP based prior in the right hand side of a differential equation and assuming additive,\nnormally distributed noise with state-speci\ufb01c error variance \u03b3k one obtains a distribution of state\nderivatives\n\nkI)\u22121yk and \u03a3k := \u03c32\n\nkC\u03c6k (C\u03c6k + \u03c32\n\nkI)\u22121.\n\np( \u02d9X | X, \u03b8, \u03b3) =\n\nN ( \u02d9xk | fk(X, \u03b8), \u03b3kI) .\n\n(10)\n\nk\n\nwhich is combined with the smoothed distribution obtained from the data \ufb01t (9) in a product of\nexperts approach:\n\np( \u02d9X | X, \u03b8, \u03c6, \u03b3) \u221d p( \u02d9X | X, \u03c6)p( \u02d9X | X, \u03b8, \u03b3).\n\nAfter analytically integrating out the latent state-derivatives\n\np(\u03b8 | X, \u03c6, \u03b3) \u221d p(\u03b8)(cid:89)k\n\nN(cid:0)fk(X, \u03b8) | mk, \u039b\u22121\nk )(cid:1) .\n\n(11)\n\n:= Ak + \u03b3kI one aims to determine the maximum a posteriori estimate (MAP) of the\n\nwhere \u039b\u22121\nk\nparameters\n\n\u2217\n\n\u03b8\n\n: = arg max\n\n\u03b8\n\nln(cid:90) p(\u03b8 | X, \u03c6, \u03b3)p(X | Y, \u03c6)dX,\n\n(12)\n\nSince the integral in (12) is in most cases analytically intractable (even for small systems due to the\nnon-linearities and couplings induced by the drift function), a lower bound is established through the\n\n4\n\n\fintroduction of an auxiliary distribution Q:\n\n(a)\n\n(b)\n\nln(cid:90) p(\u03b8 | X, \u03c6, \u03b3)p(X | Y, \u03c6)dX\n= \u2212(cid:90) Q(X)dX ln\n(cid:82) p(\u03b8 | X, \u03c6, \u03b3)p(X | Y, \u03c6)dX\n\u2265 \u2212(cid:90) Q(X) ln\n= H(Q) + EQ ln p(\u03b8 | X, \u03c6, \u03b3) + EQ ln p(X | Y, \u03c6)\n=: LQ(\u03b8)\n\np(\u03b8 | X, \u03c6, \u03b3)p(X | Y, \u03c6)\n\n(cid:82) Q(X)dX\n\nQ(X)\n\ndX\n\n(13)\n\n(14)\n\nwhere H(Q) is the entropy. In (a) the auxiliary distribution Q(X),(cid:82) Q(X)dX = 1 is introduced and\n\nin (b) is using Jensens\u2019s inequality. The lower bound holds with equality whenever\n\nwhere in (c) Bayes rule is used. Unfortunately Q\u2217 is analytically intractable because its normalization\ngiven by the integral in the denominator is in most cases analytically intractable due to the strong\ncouplings induced by the nonlinear drift function f in (1). Using mean-\ufb01eld approximations\n\n(c)\n\n= p(X | Y, \u03b8, \u03c6, \u03b3),\n\nQ\u2217(X) : =\n\np(\u03b8 | X, \u03c6, \u03b3)p(X | Y, \u03c6)\n\n(cid:82) p(\u03b8 | X, \u03c6, \u03b3)p(X | Y, \u03c6)dX\nQ :=(cid:26)Q : Q(X, \u03b8) = q(\u03b8 | \u03bb)(cid:89)u\n\nq(xu | \u03c8u)(cid:27),\n\nwhere \u03bb and \u03c8u are the variational parameters. Assuming that the drift in (1) is linear in the\nparameters \u03b8 and that states only appear as monomial factors in arbitrary large products of states the\ntrue conditionals p(\u03b8 | X, Y, \u03c6) and p(xu | \u03b8, X\u2212u, Y, \u03c6) are Gaussian distributed, where X\u2212u\ndenotes all states excluding state xu (i.e. X\u2212u := {x \u2208 X | x (cid:54)= xu}) and thus q(\u03b8 | \u03bb) and\nq(xu | \u03c8u) are designed to be Gaussian.\nThis posterior distribution over states is then approximated as p(X|Y, \u03b8, \u03c6, \u03b3, \u03c3) \u2248 (cid:98)Q(X) =\n(cid:81)k(cid:81)t(cid:98)q\u03c8kt and the log transformed distribution over the ODE parameters given the observations as\nln p(\u03b8|Y, \u03c6, \u03b3, \u03c3) \u2248 L \u02c6Q(\u03b8).\nAlgorithm 1 Ensemble based parameter estimation for SDEs\n1: Transform the SDE 1 into a RODE 2\n2: Simulate a maximum number Nmax of OU-processes and insert them in 2 to obtain Nmax ODEs\n3: For each ODE obtain approximate solutions using variational gradient matching [Gorbach et al.,\n\n2017]\n\n5: Transform the solutions of the RODE 2 back into solutions of the SDE 1.\n\n4: Combine the solutions(cid:98)\u03b8 to obtain an estimate of the parameters for the RODE 2\nGorbach et al. [2017] then use an EM-type approach illustrated in \ufb01gure 2 iteratively optimizing\nparameters and the variational lower bound L \u02c6Q(\u03b8). The variational parameters can be derived\nanalytically and the algorithm scales linearly in the number states of the differential equation and is\nthus ideally suited to infer the solutions of the massive number of pathwise ODEs required for the\npathwise solution of the RODE formulation of the SDE. Since solution paths of the RODE are only\nonce differentiable, gradient matching (which only makes this assumption w.r.t. solution paths) is\nideally suited for estimating the parameters. Our approach is summarized in algorithm 1.\nHowever, the application of variational gradient matching [Gorbach et al., 2017] for the pathwise\nsolution of the RODE is not straightforward since e.g. in the case for scalar stochastic differential\nequations one has to solve\n\ndz(t)\n\n(15)\nfor a sampled trajectory Ot of an Ornstein-Uhlenbeck process rather than the classic ODE formulation\ndz(t)\ndt = f (zt). We account for the increased uncertainty by assuming an additional state speci\ufb01c\nGaussian noise factor \u03b4 i.e. assuming f (x + Ot) + Ot + \u03b4 for a sampled trajectory Ot in the gradient\nmatching formulation (10).\n\n= f\u03b8(zt + Ot) + Ot,\n\ndt\n\n5\n\n\fFlexibility and Ef\ufb01ciency Algorithm 1 offers\na \ufb02exible framework for inference in stochas-\n\ntic dynamical systems e.g. if the parameters(cid:98)\u03b8\n\nare known they can be set to the true values in\neach iteration, and algorithm 1 then just corre-\nsponds to a smoothing algorithm. Compared to\nthe smoothing algorithm in Archambeau et al.\n[2008] it does not require the computational ex-\npensive forward and backward propagation us-\ning an ODE solver. If the parameters are not\nknown then algorithm 1 offers a grid free infer-\nence procedure for estimating the parameters.\nOpposite to Vrettas et al. [2011] which consider\nunobserved state variables in the case of smooth-\ning but assume the system to be fully observed if\nparameters are estimated, the outlined approach\noffers an ef\ufb01cient inference framework for the\nmuch more complicated problem of inferring\nthe parameters while not all states are observed\nand still scales linearly in the states if pathwise\ninference of the RODE is done in parallel.\nThe conceptual difference between the approach\nof Vrettas et al. [2015] and Gorbach et al. [2017] is illustrated in \ufb01gure 3.\n\nFigure 2: Illustration of the \"hill climbing\" algo-\nrithm in Gorbach et al. [2017] . The difference\n\nbetween the lower bound L(cid:98)Q(\u00b7) (\u03b8) and the log in-\n\ntegral is given by the Kullback-Leibler divergence\n(red line).\n\nFigure 3: Conceptual Difference. The red line represents an arti\ufb01cial function which has to be\napproximated. Our approach (right) is grid free and based on the minimization of the differences of\nthe slopes. That is why convergence is vertical with each iteration step corresponding to a dashed\nline (thickness of the line indicating the convergence direction). Vrettas et al. [2015] approximate\nthe true process by a linearized dynamic process which is discretized (left) and improved by iterated\nforward and backward smoothing.\n\n4 Experiments\n\nWe compare our approach on two established benchmark models for stochastic systems especially\nused for weather forecasts. Vrettas et al. [2011] provide an extensive comparison of the approach of\nArchambeau et al. [2008] and its improvements compared to classic Kalman \ufb01ltering as well as more\nadvanced and state of the art inference schemes like 4D-Var [Le Dimet and Talagrand, 1986]. We use\nthe reported results there as a comparison measure.\nThe drift function for the Lorenz96 system consists of equations of the form:\n\nfk(x(t), \u03b8) = (xk+1 \u2212 xk\u22122)xk\u22121 \u2212 xk + \u03b8\n\nwhere \u03b8 is a scalar forcing parameter, x\u22121 = xK\u22121, x0 = xK and xK+1 = x1 (with K being the\nnumber of states in the stochastic system (1)). The Lorenz96 system can be seen as a minimalistic\nweather model [Lorenz and Emanuel, 1998].\n\n6\n\nE-\u00ad\u2010stepM-\u00ad\u2010stepM-\u00ad\u2010stepE-\u00ad\u2010stepmax\u03b8log\uffff\u02d9Y\uffffkN\uffff\u02d9Yk|Fk(Y,\u03b8,\u03b3kI)\uffffN\uffffXk|Yk,\u03c32I\uffffd\u02d9Y(5)=max\u03b8H(Q)+\uffffk\uffffEQlogN\uffff\u02d9Yk|Fk(Y,\u03b8,\u03b3kI)\uffff+EQlogN\uffffXk|Yk,\u03c32I\uffff\uffff+DKL\uf8ee\uf8f0Q(\u03c9)\uffff\uffff\uffff\uffff\uffff\uffff\uffff\uffff\uffffkN\uffff\u02d9Yk|Fk(Y,\u03b8,\u03b3kI)\uffffN\uffffXk|Yk,\u03c32I\uffff\uffff\u02d9YkN\uffff\u02d9Yk|Fk(Y,\u03b8,\u03b3kI)\uffffN(Xk|Yk,\u03c32I)d\u02d9Yk\uf8f9\uf8fbWecanestablishtouchinglowerboundssincewecansolvetheintegral??analytically.61\u03b8624not analytically tractableanalytically tractable for a restricted family of ODE's5Experiments746Discussion75Thecontributionofthispaperistointegrateoutthelatentstatevariablesinsteadofsamplingthemas76inpreviouswork.Sincetheintegrationoverstatevariablesisnotanalyticallytractableweestablish77tightvariationallowerboundsthatareanalyticallytractableprovidedthattheODEissuchthatthe78statevariablesappearinquadraticforminequation??.ODE\u2019ssuchastheLotka-Volterrasystem79full-\ufb01llsuchODErequirementswhereasothersystemssuchastheFitz-HighNagumosytemdonot.80DKL\uffffQ(i\u22121)(X)\uffff\uffff\uffff\uffffp(\u03b8,X|Y,\u03c6,\u03b3)\uffff81LQ(t)(\u03b8)82log\uffffp(\u03b8,X|Y,\u03c6,\u03b3)dX83References84B.Calderhead,M.GirolamiandN.Lawrence,\u201cAcceleratingbayesianinferenceovernonliner85differentialequationswithgaussianprocesses,\u201dNeuralInformationProcessingSystems,vol.22,86no.429-443,2008.87F.Dondelinger,M.Filippone,S.RogersandD.Husmeier,\u201cOdeparameterinferenceusingadaptive88gradientmatchingwithgaussianprocesses,\u201dAISTATS,vol.31,pp.216\u2013228,2013.894tightvariationallowerboundsthatareanalyticallytractableprovidedthattheODEissuchthatthe78statevariablesappearinquadraticforminequation6.ODE\u2019ssuchastheLotka-Volterrasystem79full-\ufb01llsuchODErequirementswhereasothersystemssuchastheFitz-HighNagumosytemdonot.80DKL\uffffQ(i\u22121)(X)\uffff\uffff\uffff\uffffp(\u03b8,X|Y,\u03c6,\u03b3)\uffff81LQ(t)(\u03b8)82log\uffffp(\u03b8,X|Y,\u03c6,\u03b3)dX83\u02c6\u03b8(i\u22121)84References85B.Calderhead,M.GirolamiandN.Lawrence,\u201cAcceleratingbayesianinferenceovernonliner86differentialequationswithgaussianprocesses,\u201dNeuralInformationProcessingSystems,vol.22,87no.429-443,2008.88F.Dondelinger,M.Filippone,S.RogersandD.Husmeier,\u201cOdeparameterinferenceusingadaptive89gradientmatchingwithgaussianprocesses,\u201dAISTATS,vol.31,pp.216\u2013228,2013.904tightvariationallowerboundsthatareanalyticallytractableprovidedthattheODEissuchthatthe78statevariablesappearinquadraticforminequation6.ODE\u2019ssuchastheLotka-Volterrasystem79full-\ufb01llsuchODErequirementswhereasothersystemssuchastheFitz-HighNagumosytemdonot.80DKL\uffffQ(i\u22121)(X)\uffff\uffff\uffff\uffffp(\u03b8,X|Y,\u03c6,\u03b3)\uffff81LQ(t)(\u03b8)82log\uffffp(\u03b8,X|Y,\u03c6,\u03b3)dX83\u02c6\u03b8(i+1)84References85B.Calderhead,M.GirolamiandN.Lawrence,\u201cAcceleratingbayesianinferenceovernonliner86differentialequationswithgaussianprocesses,\u201dNeuralInformationProcessingSystems,vol.22,87no.429-443,2008.88F.Dondelinger,M.Filippone,S.RogersandD.Husmeier,\u201cOdeparameterinferenceusingadaptive89gradientmatchingwithgaussianprocesses,\u201dAISTATS,vol.31,pp.216\u2013228,2013.904tightvariationallowerboundsthatareanalyticallytractableprovidedthattheODEissuchthatthe78statevariablesappearinquadraticforminequation6.ODE\u2019ssuchastheLotka-Volterrasystem79full-\ufb01llsuchODErequirementswhereasothersystemssuchastheFitz-HighNagumosytemdonot.80DKL\uffffQ(i\u22121)(X)\uffff\uffff\uffff\uffffp(\u03b8,X|Y,\u03c6,\u03b3)\uffff81LQ(t)(\u03b8)82log\uffffp(\u03b8,X|Y,\u03c6,\u03b3)dX83\u02c6\u03b8(i)84References85B.Calderhead,M.GirolamiandN.Lawrence,\u201cAcceleratingbayesianinferenceovernonliner86differentialequationswithgaussianprocesses,\u201dNeuralInformationProcessingSystems,vol.22,87no.429-443,2008.88F.Dondelinger,M.Filippone,S.RogersandD.Husmeier,\u201cOdeparameterinferenceusingadaptive89gradientmatchingwithgaussianprocesses,\u201dAISTATS,vol.31,pp.216\u2013228,2013.904tightvariationallowerboundsthatareanalyticallytractableprovidedthattheODEissuchthatthe78statevariablesappearinquadraticforminequation6.ODE\u2019ssuchastheLotka-Volterrasystem79full-\ufb01llsuchODErequirementswhereasothersystemssuchastheFitz-HighNagumosytemdonot.80DKL\uffffQ(i\u22121)(X)\uffff\uffff\uffff\uffffp(\u03b8,X|Y,\u03c6,\u03b3)\uffff81LQ(i)(\u03b8)82log\uffffp(\u03b8,X|Y,\u03c6,\u03b3)dX83\u02c6\u03b8(i)84References85B.Calderhead,M.GirolamiandN.Lawrence,\u201cAcceleratingbayesianinferenceovernonliner86differentialequationswithgaussianprocesses,\u201dNeuralInformationProcessingSystems,vol.22,87no.429-443,2008.88F.Dondelinger,M.Filippone,S.RogersandD.Husmeier,\u201cOdeparameterinferenceusingadaptive89gradientmatchingwithgaussianprocesses,\u201dAISTATS,vol.31,pp.216\u2013228,2013.904tightvariationallowerboundsthatareanalyticallytractableprovidedthattheODEissuchthatthe78statevariablesappearinquadraticforminequation6.ODE\u2019ssuchastheLotka-Volterrasystem79full-\ufb01llsuchODErequirementswhereasothersystemssuchastheFitz-HighNagumosytemdonot.80DKL\uffffQ(i\u22121)(X)\uffff\uffff\uffff\uffffp(\u03b8,X|Y,\u03c6,\u03b3)\uffff81LQ(i\u22121)(\u03b8)82log\uffffp(\u03b8,X|Y,\u03c6,\u03b3)dX83\u02c6\u03b8(i)84References85B.Calderhead,M.GirolamiandN.Lawrence,\u201cAcceleratingbayesianinferenceovernonliner86differentialequationswithgaussianprocesses,\u201dNeuralInformationProcessingSystems,vol.22,87no.429-443,2008.88F.Dondelinger,M.Filippone,S.RogersandD.Husmeier,\u201cOdeparameterinferenceusingadaptive89gradientmatchingwithgaussianprocesses,\u201dAISTATS,vol.31,pp.216\u2013228,2013.904tightvariationallowerboundsthatareanalyticallytractableprovidedthattheODEissuchthatthe78statevariablesappearinquadraticforminequation6.ODE\u2019ssuchastheLotka-Volterrasystem79full-\ufb01llsuchODErequirementswhereasothersystemssuchastheFitz-HighNagumosytemdonot.80DKL\uffffQ(i\u22121)(X)\uffff\uffff\uffff\uffffp(\u03b8,X|Y,\u03c6,\u03b3)\uffff81LQ(i+1)(\u03b8)82log\uffffp(\u03b8,X|Y,\u03c6,\u03b3)dX83\u02c6\u03b8(i)84References85B.Calderhead,M.GirolamiandN.Lawrence,\u201cAcceleratingbayesianinferenceovernonliner86differentialequationswithgaussianprocesses,\u201dNeuralInformationProcessingSystems,vol.22,87no.429-443,2008.88F.Dondelinger,M.Filippone,S.RogersandD.Husmeier,\u201cOdeparameterinferenceusingadaptive89gradientmatchingwithgaussianprocesses,\u201dAISTATS,vol.31,pp.216\u2013228,2013.904(ODE parameters)latent state variablesstate observationsGP kernel parametersstate-derivative noise\fThe three dimensional Lorenz attractor is described by the parameter vector \u03b8 = (\u03c3, \u03c1, \u03b2) and the\nfollowing time evolution:\n\ndX(t) =(cid:34)\n\n\u03c1x1(t) \u2212 x2(t) \u2212 x1(t)x3(t)\n\n\u03c3(x2(t) \u2212 x1(t))\nx1(t)x2(t) \u2212 \u03b2x3(t)\n\n(cid:35) dt + \u03a3\n\n1\n\n2 dWt\n\nThe runtime for state estimation using the approach of Vrettas et al. [2011] and our method is\nindicated in table 1. While parameter and state estimation are combined in one step in our approach,\nparameter estimation using the approach of Vrettas et al. [2011] would imply the iterative use of the\nsmoothing algorithm and thus a multiple factor of the runtime indicated in table 1. While we solve a\nmuch more dif\ufb01cult problem by inferring parameters and states at the same time our runtime is only a\nfraction of the runtime awarded for a single run of the inner loop for parameter estimation in Vrettas\net al. [2011].\n\nMethod\nVGPA_MF\nOur approach\n\nL63/D=3 L96/D=40 L96/D=1000\n\n31s\n2.4s\n\n6503s\n14s\n\n17345s\n383s\n\nTable 1: Runtime for one run of the smoothing algorithm of the approach of Vrettas et al. [2015] vs\nthe runtime of our approach in parallel implementation (using 51 OU sample paths). While parameter\nestimation is done simultaneously in our approach, Vrettas et al. [2015] use the smoothing algorithm\niteratively for state estimation in an inner loop such that the runtime for parameter estimations is\nmultiple times higher than the indicated runtime for just one run of the smoothing algorithm.\n\nWe use our method to infer the states and drift parameters for the Lorenz attractor where the dimension\ny is unobserved. The estimated state trajectories are shown in \ufb01gure 4.\n\nFigure 4: Lorenz attractor. The Lorenz attractor trajectories are shown on the right -hand side for\ninferred solutions using an SDE solver, while the left-hand side plot shows the inferred trajectory\nusing our method. Our method was able to accurately resolve the typical \u201cbutter\ufb02y\u201d pattern despite\nnot observing the drift parameters as well as not observing the dimension y. Only the dimensions x\nand z were observed.\n\nThe estimated trajectories for one sample path are also shown in the time domain in section 5.2 of the\nsupplementary material.\nOur approach offers an appealing shortcut to the inference problem for stochastic dynamical systems\nand is robust to the noise in the diffusion term. Figure 5 shows the dependence of the inferred\nparameters on the variance in the diffusion term of the stochastic differential equation.\nIncreasing the time interval of the observed process e.g. from 10 to 60 secs leads to a converging\nbehaviour to the true parameters (\ufb01gure 6). This is in contrast to the reported results of Archambeau\net al. [2008], reported in Vrettas et al. [2011, Figure 29] and shows the asymptotic time consistency\nof our approach.\nFigure 5 shows, that in the near noiseless scenario we approximately identify sigma correctly.\nEstimating the \u03c3 term in Figure 6 is more dif\ufb01cult than the other two parameters in the drift\n\n7\n\nEstimated TrajectoryzyxSimulated Trajectoryzyx\fFigure 5: Lorenz attractor. Boxplots indicate the median of the inferred parameters over 51\ngenerated OU sample paths. Using a low variance for the diffusion term in simulating one random\nsample path from the SDE, our approach infers approximately the correct parameters and does not\ncompletely deteriorate if the variance is increased by a factor of 30.\n\nFigure 6: Lorenz attractor. Increasing the time interval of the observed process leads to a conver-\ngence towards the true parameters opposed to the results in [Vrettas et al., 2011, Figure 29].\n\nfunction of the Lorenz attractor system, since the variance of the diffusion and the observation noise\nunfortunately lead to an identi\ufb01ability problem for the parameter sigma, which is why longer time\nperiods in Figure 6 do not improve the estimation accuracy for \u03c3.\n\nFigure 7: Lorenz96. Left hand side shows the accuracy of the parameter estimation with increasing\ndiffusion variance (right to left) for a 40 dimensional system, while the plot on the right hand side\nshows the accuracy with decreasing number of observations. Red dots show the results of the\napproach of Archambeau et al. [2008] when available as reported in Vrettas et al. [2011]. The correct\nparameter has the value 8 and our approach performs signi\ufb01cantly better, while having a lower\nruntime and is furthermore able to include unobserved variables (right)\n\nFor the Lorenz96 system our parameter estimation approach is likewise robust to the variance in the\ndiffusion term (\ufb01gure 7). It furthermore outperforms the approach of Archambeau et al. [2008] in the\ncases where results were reported in Vrettas et al. [2011]. The performance level is equal when, for\nour approach, we assume that only one third of the variables are unobserved.\n\n8\n\n301861diffusion567891011est301861diffusion252627282930est301861diffusion1.41.61.822.22.42.62.8est102030405060final time67891011est102030405060final time25.52626.52727.52828.5est102030405060final time2.12.22.32.42.52.62.7est151020diffusion678910estfully obs.2/3 observed1/2 observed1/3 observed0123456789est\fThe estimated trajectories for one sample path of the Lorenz96 system are shown in section 5.3 of the\nsupplementary material.\n\n5 Discussion\n\nParameter inference in stochastic systems is a challenging but important problem in many disciplines.\nCurrent approaches are based on exploration in the parameter space which is computationally\nexpensive and infeasible for larger systems. Using a gradient matching formulation and adapting it to\nthe inference of random ordinary differential equations, our proposal is a \ufb02exible framework which\nallows to use deterministic calculus for inference in stochastic systems. While our approach tackles\na much more dif\ufb01cult problem by combining state and parameter estimation in one step, it offers\nimproved accuracy and is orders of magnitude faster compared to current state of the art methods\nbased on variational inference.\n\nAcknowledgements\n\nThis research was partially supported by the Max Planck ETH Center for Learning Systems and the\nSystemsX.ch project SignalX.\n\n9\n\n\fReferences\nC\u00e9dric Archambeau, Manfred Opper, Yuan Shen, Dan Cornford, and John S Shawe-taylor. Variational\n\ninference for diffusion processes. Neural Information Processing Systems (NIPS), 2008.\n\nYusuke Asai and Peter E Kloeden. Numerical schemes for random odes via stochastic differential\n\nequations. Commun. Appl. Anal, 17(3):521\u2013528, 2013.\n\nYusuke Asai, Eva Herrmann, and Peter E Kloeden. Stable integration of stiff random ordinary\n\ndifferential equations. Stochastic Analysis and Applications, 31(2):293\u2013313, 2013.\n\nBen Calderhead, Mark Girolami and Neil D. Lawrence. Accelerating bayesian inference over\nnonliner differential equations with gaussian processes. Neural Information Processing Systems\n(NIPS), 2008.\n\nHalim Doss. Liens entre \u00e9quations diff\u00e9rentielles stochastiques et ordinaires. In Annales de l\u2019IHP\n\nProbabilit\u00e9s et statistiques, volume 13, pages 99\u2013125, 1977.\n\nGeir Evensen. The ensemble kalman \ufb01lter: Theoretical formulation and practical implementation.\n\nOcean dynamics, 53(4):343\u2013367, 2003.\n\nNico S Gorbach, Stefan Bauer, and Joachim M Buhmann. Scalable variational inference for dynamical\n\nsystems. arXiv preprint arXiv:1895944, 2017.\n\nLars Gr\u00fcne and PE Kloeden. Pathwise approximation of random ordinary differential equations. BIT\n\nNumerical Mathematics, 41(4):711\u2013721, 2001.\n\nXiaoying Han and Peter E Kloeden. Random ordinary differential equations and their numerical\n\nsolution, 2017.\n\nPeter Imkeller and Bj\u00f6rn Schmalfuss. The conjugacy of stochastic and random differential equations\nand the existence of global attractors. Journal of Dynamics and Differential Equations, 13(2):\n215\u2013249, 2001.\n\nArnulf Jentzen and Peter E Kloeden. Pathwise taylor schemes for random ordinary differential\n\nequations. BIT Numerical Mathematics, 49(1):113\u2013140, 2009.\n\nArnulf Jentzen and Peter E Kloeden. Taylor approximations for stochastic partial differential\n\nequations. SIAM, 2011.\n\nPeter E Kloeden and Arnulf Jentzen. Pathwise convergent higher order numerical schemes for random\nordinary differential equations. In Proceedings of the Royal Society of London A: Mathematical,\nPhysical and Engineering Sciences, volume 463, pages 2929\u20132944. The Royal Society, 2007.\n\nFran\u00e7ois-Xavier Le Dimet and Olivier Talagrand. Variational algorithms for analysis and assimi-\nlation of meteorological observations: theoretical aspects. Tellus A: Dynamic Meteorology and\nOceanography, 38(2):97\u2013110, 1986.\n\nEdward N Lorenz and Kerry A Emanuel. Optimal sites for supplementary weather observations:\n\nSimulation with a small model. Journal of the Atmospheric Sciences, 55(3):399\u2013414, 1998.\n\nSimon Lyons, Amos J Storkey, and Simo S\u00e4rkk\u00e4. The coloured noise expansion and parameter\n\nestimation of diffusion processes. Neural Information Processing Systems (NIPS), 2012.\n\nChristoph Riesinger, Tobias Neckel, and Florian Rupp. Solving random ordinary differential equations\non gpu clusters using multiple levels of parallelism. SIAM Journal on Scienti\ufb01c Computing, 38(4):\nC372\u2013C402, 2016.\n\nChris Snyder, Thomas Bengtsson, Peter Bickel, and Jeff Anderson. Obstacles to high-dimensional\n\nparticle \ufb01ltering. Monthly Weather Review, 136(12):4629\u20134640, 2008.\n\nH\u00e9ctor J Sussmann. On the gap between deterministic and stochastic ordinary differential equations.\n\nThe Annals of Probability, pages 19\u201341, 1978.\n\n10\n\n\fChristoffer W Torn\u00f8e, Rune V Overgaard, Henrik Agers\u00f8, Henrik A Nielsen, Henrik Madsen, and\nE Niclas Jonsson. Stochastic differential equations in nonmem R(cid:13): implementation, application,\nand comparison with ordinary differential equations. Pharmaceutical research, 22(8):1247\u20131258,\n2005.\n\nMichail D Vrettas, Dan Cornford, and Manfred Opper. Estimating parameters in stochastic systems:\nA variational bayesian approach. Physica D: Nonlinear Phenomena, 240(23):1877\u20131900, 2011.\n\nMichail D Vrettas, Manfred Opper, and Dan Cornford. Variational mean-\ufb01eld algorithm for ef\ufb01cient\ninference in large systems of stochastic differential equations. Physical Review E, 91(1):012148,\n2015.\n\n11\n\n\f", "award": [], "sourceid": 3506, "authors": [{"given_name": "Stefan", "family_name": "Bauer", "institution": "ETH Z\u00fcrich"}, {"given_name": "Nico", "family_name": "Gorbach", "institution": "Swiss Federal Institute of Technology Zurich (ETHZ)"}, {"given_name": "Djordje", "family_name": "Miladinovic", "institution": "ETH Zurich"}, {"given_name": "Joachim", "family_name": "Buhmann", "institution": "ETH Zurich"}]}