{"title": "Accelerating Bayesian Inference over Nonlinear Differential Equations with Gaussian Processes", "book": "Advances in Neural Information Processing Systems", "page_first": 217, "page_last": 224, "abstract": "Identification and comparison of nonlinear dynamical systems using noisy and sparse experimental data is a vital task in many fields, however current methods are computationally expensive and prone to error due in part to the nonlinear nature of the likelihood surfaces induced. We present an accelerated sampling procedure which enables Bayesian inference of parameters in nonlinear ordinary and delay differential equations via the novel use of Gaussian processes (GP). Our method involves GP regression over time-series data, and the resulting derivative and time delay estimates make parameter inference possible without solving the dynamical system explicitly, resulting in dramatic savings of computational time. We demonstrate the speed and statistical accuracy of our approach using examples of both ordinary and delay differential equations, and provide a comprehensive comparison with current state of the art methods.", "full_text": "Accelerating Bayesian Inference over Nonlinear\nDifferential Equations with Gaussian Processes\n\nBen Calderhead\n\nDept. of Computing Sci.\nUniversity of Glasgow\nbc@dcs.gla.ac.uk\n\nMark Girolami\n\nDept. of Computing Sci.\nUniversity of Glasgow\n\ngirolami@dcs.gla.ac.uk\n\nNeil D. Lawrence\n\nSchool of Computer Sci.\nUniversity of Manchester\nneill@cs.man.ac.uk\n\nAbstract\n\nIdenti\ufb01cation and comparison of nonlinear dynamical system models using noisy\nand sparse experimental data is a vital task in many \ufb01elds, however current meth-\nods are computationally expensive and prone to error due in part to the nonlinear\nnature of the likelihood surfaces induced. We present an accelerated sampling\nprocedure which enables Bayesian inference of parameters in nonlinear ordinary\nand delay differential equations via the novel use of Gaussian processes (GP). Our\nmethod involves GP regression over time-series data, and the resulting derivative\nand time delay estimates make parameter inference possible without solving the\ndynamical system explicitly, resulting in dramatic savings of computational time.\nWe demonstrate the speed and statistical accuracy of our approach using examples\nof both ordinary and delay differential equations, and provide a comprehensive\ncomparison with current state of the art methods.\n\n1 Introduction\n\nMechanistic system modeling employing nonlinear ordinary or delay differential equations 1 (ODEs\nor DDEs) is oftentimes hampered by incomplete knowledge of the system structure or the spe-\nci\ufb01c parameter values de\ufb01ning the observed dynamics [16]. Bayesian, and indeed non-Bayesian,\napproaches for parameter estimation and model comparison [19] involve evaluating likelihood func-\ntions, which requires the explicit numerical solution of the differential equations describing the\nmodel. The computational cost of obtaining the required numerical solutions of the ODEs or DDEs\ncan result in extremely slow running times.\nIn this paper we present a method for performing\nBayesian inference over mechanistic models by the novel use of Gaussian processes (GP) to predict\nthe state variables of the model as well as their derivatives, thus avoiding the need to solve the sys-\ntem explicitly. This results in dramatically improved computational ef\ufb01ciency (up to four hundred\ntimes faster in the case of DDEs). We note that state space models offer an alternative approach\nfor performing parameter inference over dynamical models particularly for on-line analysis of data,\nsee [2]. Related to the work we present, we also note that in [6] the use of GPs has been proposed\nin obtaining the solution of fully parameterised linear operator equations such as ODEs. Likewise\nin [12] GPs are employed as emulators of the posterior response to parameter values as a means of\nimproving the computational ef\ufb01ciency of a hybrid Monte Carlo sampler.\nOur approach is different and builds signi\ufb01cantly upon previous work which has investigated the use\nof derivative estimates to directly approximate system parameters for models described by ODEs.\nA spline-based approach was \ufb01rst suggested in [18] for smoothing experimental data and obtaining\nderivative estimates, which could then be used to compute a measure of mismatch for derivative\nvalues obtained from the system of equations. More recent developments of this method are de-\nscribed in [11]. All of these approaches, however, are plagued by similar problems. The methods\n\n1The methodology in this paper can also be straightforwardly extended to partial differential equations.\n\n\fare all critically dependent on additional regularisation parameters to determine the level of data\nsmoothing. They all exhibit the problem of providing sub-optimal point estimates; even [11] may\nnot converge to a reasonable solution depending on the initial values selected, as we demonstrate in\nSection 5.1. Furthermore, it is not at all obvious how these methods can be extended for partially\nobserved systems, which are typical in, e.g. systems biology [10, 1, 8, 19]. Finally, these methods\nonly provide point estimates of the \u201ccorrect\u201d parameters and are unable to cope with multiple so-\nlutions. (Although it should be noted that [11] does offer a local estimate of uncertainty based on\nsecond derivatives, at additional computational cost.) It is therefore unclear how objective model\ncomparison could be implemented using these methods.\nIn contrast we provide a Bayesian solution, which is capable of sampling from multimodal distribu-\ntions. We demonstrate its speed and statistical accuracy and provide comparisons with the current\nbest methods. It should also be noted that the papers mentioned above have focussed only on param-\neter estimation for fully observed systems of ODEs; we additionally show how parameter inference\nover both fully and partially observed ODE systems as well as DDEs may be performed ef\ufb01ciently\nusing our state derivative approach.\n\n2 Posterior Sampling by Explicit Integration of Differential Equations\n\nA dynamical system may be described by a collection of N ordinary differential equations and model\nparameters \u03b8 which de\ufb01ne a functional relationship between the process state, x(t), and its time\nderivative such that \u02d9x(t) = f(x, \u03b8, t). Likewise delay differential equations can be used to describe\ncertain dynamic systems, where now an explicit time-delay \u03c4 is employed. A sequence of process\nobservations, y(t), are usually contaminated with some measurement error which is modeled as\ny(t) = x(t) + \u0001(t) where \u0001(t) de\ufb01nes an appropriate multivariate noise process, e.g. a zero-mean\nGaussian with variance \u03c32\nn for each of the N states. If observations are made at T distinct time points\nthe N \u00d7 T matrices summarise the overall observed system as Y = X+E. In order to obtain values\nfor X the system of ODEs must be solved, so that in the case of an initial value problem X(\u03b8, x0)\ndenotes the solution of the system of equations at the speci\ufb01ed time points for the parameters \u03b8 and\ninitial conditions x0. Figure 1(a) illustrates graphically the conditional dependencies of the overall\nstatistical model and from this the posterior density follows by employing appropriate priors such\nn). The desired marginal p(\u03b8|Y)\n\nthat p(\u03b8, x0, \u03c3|Y) \u221d \u03c0(\u03b8)\u03c0(x0)\u03c0(\u03c3)Q\n\nn NYn,\u00b7(X(\u03b8, x0)n,\u00b7, I\u03c32\n\ncan be obtained from this joint posterior2.\nVarious sampling schemes can be devised to sample from the joint posterior. However, regardless\nof the sampling method, each proposal requires the speci\ufb01c solution of the system of differential\nequations which, as will be demonstrated in the experimental sections, is the main computational\nbottleneck in running an MCMC scheme for models based on differential equations. The computa-\ntional complexity of numerically solving such a system cannot be easily quanti\ufb01ed since it depends\non many factors such as the type of model and its stiffness, which in turn depends on the speci\ufb01c\nparameter values used. A method to alleviate this bottleneck is the main contribution of this paper.\n\n3 Auxiliary Gaussian Processes on State Variables\nLet us assume independent3 Gaussian process priors on the state variables such that p(Xn,\u00b7|\u03d5n) =\nN (0, C\u03d5n), where C\u03d5n denotes the matrix of covariance function values with hyperparameters\nnIT ), the state posterior, p(Xn,\u00b7|Yn,\u00b7, \u03c3n, \u03d5n) follows as N (\u00b5n, \u03a3n)\n\u03d5n. With noise \u0001n \u223c N (0, \u03c32\nnI)\u22121. Given priors \u03c0(\u03c3n) and\nnI)\u22121Yn,\u00b7 and \u03a3n = \u03c32\nwhere \u00b5n = C\u03d5n(C\u03d5n + \u03c32\n\u03c0(\u03d5n) the corresponding posterior is p(\u03d5n, \u03c3n|Yn,\u00b7) \u221d \u03c0(\u03c3n)\u03c0(\u03d5n)NYn,\u00b7(0, \u03c32\nnI + C\u03d5n) and\nfrom this we can obtain the joint posterior, p(X, \u03c3n=1\u00b7\u00b7\u00b7N , \u03d5n=1\u00b7\u00b7\u00b7N|Y, ), over a non-parametric\nGP model of the state-variables. Note that a non-Gaussian noise model may alternatively be\nimplemented using warped GPs [14]. The conditional distribution for the state-derivatives is\n\nnC\u03d5n(C\u03d5n + \u03c32\n\n2This distribution is implicitly conditioned on the numerical solver and associated error tolerances.\n3The dependencies between state variables can be modeled by de\ufb01ning the overall state vector as x =\nvec(X) and using a GP prior of the form x \u223c N (0, \u03a3 \u2297 C) where \u2297 denotes the Kronecker matrix product\nand \u03a3 is an N \u00d7 N positive semi-de\ufb01nite matrix specifying inter-state similarities with C, the T \u00d7 T matrix\nde\ufb01ning intra-state similarities [13].\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 1: (a) Graphical model representing explicit solution of an ODE system, (b) Graphical model rep-\nresenting approach developed in this paper with dashed lines showing how the two models are combined in\nproduct form, (c) Likelihood surface for a simple oscillator model\n\np( \u02d9Xn,\u00b7|Xn,\u00b7, \u03d5n, \u03c3n) = N (mn, Kn), where the mean and covariance are given by\n\u2212 0C\u03d5n(C\u03d5n + \u03c32\n\nmn = 0C\u03d5n(C\u03d5n + \u03c32\n\n00\nand Kn = C\n\u03d5n\n\nnI)\u22121Xn,\u00b7\n\nnI)\u22121C\n\n0\n\u03d5n\n\ndenotes the auto- covariance for each state- derivative with C0\n\u03d5n\n\nand 0C\u03d5n denoting\nwhere C00\n\u03d5n\nthe cross- covariances between the state and its derivative [13, 15]. The main advantage of using\nthe Gaussian process model now becomes apparent. The GP speci\ufb01es a jointly Gaussian distri-\nbution over the function and its derivatives ([13], pg.191). This allows us to evaluate a poste-\nrior over parameters \u03b8 consistent with the differential equation based on the smoothed state and\nstate derivative estimates, see Figure 1(b). Assuming Normal errors between the state- derivatives\n\u02d9Xn,\u00b7 and the functional, fn(X, \u03b8, t) evaluated at the GP generated state- values, X corresponding\nto time points t = t1 \u00b7\u00b7\u00b7t T then p( \u02d9Xn,\u00b7|X, \u03b8, \u03b3n) = N (fn(X, \u03b8, t), I\u03b3n) with \u03b3n a state- speci\ufb01c\nerror variance. Both statistical models p( \u02d9Xn,\u00b7|Xn,\u00b7, \u03d5n, \u03c3n) and p( \u02d9Xn,\u00b7|X, \u03b8, \u03b3n) can be linked\nin the form of a Product of Experts [7] to de\ufb01ne the overall density p( \u02d9Xn,\u00b7|X, \u03b8, \u03b3n, \u03d5n, \u03c3n) \u221d\n\nN (mn, Kn)N (fn(X, \u03b8, t), I\u03b3n) [see e.g. 20]. Introducing priors \u03c0(\u03b8) and \u03c0(\u03b3) =Q\n\nn \u03c0(\u03b3n)\n\np(\u03b8, \u03b3|X, \u03d5, \u03c3) =\n\np( \u02d9X, \u03b8, \u03b3|X, \u03d5, \u03c3)d \u02d9X\n\nZ\n\nZ\n\u221d \u03c0(\u03b8)\u03c0(\u03b3)Y\nQ\n\n\u221d \u03c0(\u03b8)\u03c0(\u03b3)\nn Z(\u03b3n)\n\nn\n\n(\n\u22121\n2\n\nX\n\nn\n\nN (mn, Kn)N (fn(X, \u03b8, t), I\u03b3n)d \u02d9Xn,\u00b7\n\nexp\n\n(fn \u2212 mn)T(Kn + I\u03b3n)\u22121(fn \u2212 mn)\n\n)\n\nwhere fn \u2261 fn(X, \u03b8, t), and Z(\u03b3n) = |2\u03c0(Kn + I\u03b3n)| 1\n2 is a normalizing constant. Since the\ngradients appear only linearly and their conditional distribution given X is Gaussian they can be\nmarginalized exactly. In other words, given observations Y, we can sample from the conditional\ndistribution for X and marginalize the augmented derivative space. The differential equation need\nnow never be explicitly solved, its implicit solution is integrated into the sampling scheme.\n\n4 Sampling Schemes for Fully and Partially Observed Systems\n\nThe introduction of the auxiliary model and its associated variables has enabled us to recast the\ndifferential equation as another component of the inference process. The relationship between the\nauxiliary model and the physical process that we are modeling is shown in Figure 1(b), where the\ndotted lines represent a transfer of information between the models. This information transfer takes\nplace through sampling candidate solutions for the system in the GP model. Inference is performed\nby combining these approximate solutions with the system dynamics from the differential equations.\nIt now remains to de\ufb01ne an overall sampling scheme for the structural parameters. For brevity, we\n\n\fomit normalizing constants and assume that the system is de\ufb01ned in terms of ODEs. However,\nour scheme is easily extended for delay differential equations (DDEs) where now predictions at\neach time point t and the associated delay (t \u2212 \u03c4) are required \u2014 we present results for a DDE\nsystem in Section 5.2. We can now consider the complete sampling scheme by also inferring the\nhyperparameters and corresponding predictions of the state variables and derivatives using the GP\nframework described in Section 3. We can obtain samples \u03b8 from the desired marginal posterior\np(\u03b8|Y)4 by sampling from the joint posterior p(\u03b8, \u03b3, X, \u03d5, \u03c3|Y) as follows\n\nnI + C\u03d5n)\n\n(1)\n\n(2)\n\n)\n\n\u03d5n, \u03c3n|Yn,\u00b7 \u223c p(\u03d5n, \u03c3n|Yn,\u00b7) \u221d \u03c0(\u03c3n)\u03c0(\u03d5n)NYn,\u00b7(0, \u03c32\n(\nX\nXn,\u00b7|Yn,\u00b7, \u03c3n, \u03d5n \u223c p(Xn,\u00b7|Yn,\u00b7, \u03c3n, \u03d5n) = NXn,\u00b7(\u00b5n, \u03a3n)\n\u22121\n2\n\n\u03b8, \u03b3|X, \u03d5, \u03c3 \u223c p(\u03b8, \u03b3|X, \u03d5, \u03c3) \u221d \u03c0(\u03b8)\u03c0(\u03b3) exp\n\nn(Kn + I\u03b3n)\u22121\u03b4n\n\u03b4T\n\nn\n\nparameter \u03b2, such that p(\u03b8|\u03b2) =Q\n\n(3)\nwhere \u03b4n \u2261 fn \u2212 mn. This requires two Metropolis sampling schemes; one for inferring the param-\neters of the GP, \u03d5 and \u03c3, and another for the parameters of the structural system, \u03b8 and \u03b3. However,\nas a consequence of the system induced dynamics the corresponding likelihood surface de\ufb01ned by\np(Y|\u03b8, x0, \u03c3) can present formidable challenges to standard sampling methods. As an example\nFigure 1(c) illustrates the induced likelihood surface of a simple dynamic oscillator similar to that\npresented in the experimental section. Recent advances in MCMC methodology suggest solutions\nto this problem in the form of population-based MCMC methods [8], which we therefore implement\nto sample the structural parameters of our model. Population MCMC enables samples to be drawn\nfrom a target density p(\u03b8) by de\ufb01ning a product of annealed densities indexed by a temperature\ni p(\u03b8|\u03b2i) and the desired target density p(\u03b8) is de\ufb01ned for one\nvalue of \u03b2i. It is convenient to \ufb01x a geometric path between the prior and posterior, which we do in\nour implementation, although other sequences are possible [3]. A time homogeneous Markov tran-\nsition kernel which has p(\u03b8) as its stationary distribution can then be constructed from both local\nMetropolis proposal moves and global temperature switching moves between the tempered chains\nof the population [8], allowing freer movement within the parameter space.\nThe computational scaling for each component of the sampler is now considered. Sampling of the\nGP covariance function parameters by a Metropolis step requires computation of a matrix deter-\nminant and its inverse, so for all N states in the system a dominant scaling of O(N T 3) will be\nobtained. This poses little problem for many applications in systems biology since T is often fairly\nsmall (T \u2248 10 to 100). For larger values of T , sparse approximations can offer much improved\ncomputational scaling of order O(N M 2T ), where M is the number of time points selected [9].\nSampling from a multivariate Normal whose covariance matrix and corresponding decompositions\nhave already been computed therefore incurs no dominating additional computational overhead.\nThe \ufb01nal Metropolis step (Equation 3) requires each of the Kn matrices to be constructed and the\nassociated determinants and inverses computed thus incurring a total O(N T 3) scaling per sample.\nAn approximate scheme can be constructed by \ufb01rst obtaining the maximum a posteriori values for\nthe GP hyperparameters and posterior mean state values, \u02c6\u03d5, \u02c6\u03c3, \u02c6Xn, and then employing these in\nEquation 3. This will provide samples from p(\u03b8, \u03b3| \u02c6X, \u02c6\u03d5, \u02c6\u03c3, Y) which may be a useful surrogate\nfor the full joint posterior incurring lower computational cost as all matrix operations will have been\npre-computed, as will be demonstrated later in the paper.\nWe can also construct a sampling scheme for the important special case where some states are\nunobserved. We partition X into Xo, and Xu. Let o index the observed states, then we may infer all\nthe unknown variables as follows\n\np(\u03b8, \u03b3, Xu|Xo, \u03d5, \u03c3) \u221d \u03c0(\u03b8)\u03c0(\u03b3)\u03c0(Xu) exp\n\n(\u03b4o,u\n\nn )T(Kn + I\u03b3n)\u22121(\u03b4o,u\nn )\n\nn \u2261 fn(Xo, Xu, \u03b8, t) \u2212 mn and \u03c0(Xu) is an appropriately chosen prior. The values of\nwhere \u03b4o,u\nunobserved species are obtained by propagating their sampled initial values using the corresponding\ndiscrete versions of the differential equations and the smoothed estimates of observed species. The\np53 transcriptional network example we include requires inference over unobserved protein species,\nsee Section 5.3.\n\n4Note that this is implicitly conditioned on the class of covariance function chosen.\n\n(\n\u22121\n2\n\nX\n\nn\u2208o\n\n)\n\n\f5 Experimental Examples\n\nWe now demonstrate our GP-based method using a standard squared exponential covariance func-\ntion on a variety of examples involving both ordinary and delay differential equations, and compare\nthe accuracy and speed with other state-of-the-art methods.\n\n5.1 Example 1 - Nonlinear Ordinary Differential Equations\n\nc(cid:0)V \u2212 V 3/3 + R(cid:1),\n\nWe \ufb01rst consider the FitzHugh-Nagumo model [11] which was originally developed to model\nthe behaviour of spike potentials in the giant axon of squid neurons and is de\ufb01ned as \u02d9V =\n\u02d9R = \u2212 (V \u2212 a + bR) /c. Although consisting of only 2 equations and 3 pa-\nrameters, this dynamical system exhibits a highly nonlinear likelihood surface [11], which is induced\nby the sharp changes in the properties of the limit cycle as the values of the parameters vary. Such\na feature is common to many nonlinear systems and so this model provides an excellent test for our\nGP-based parameter inference method.\nData is generated from the model, with parameters a = 0.2, b = 0.2, c = 3, at {40, 80, 120} time\npoints with additive Gaussian noise, N(0, v) for v = 0.1 \u00d7 \u03c3n, where \u03c3n is the standard deviation\nfor the nth species. The parameters were then inferred from these data sets using the full Bayesian\nsampling scheme and the approximate sampling scheme (Section 4), both employing population\nMCMC. Additionally, we inferred the parameters using 2 alternative methods, the pro\ufb01led estima-\ntion method of Ramsay et al. [11] and a Population MCMC based sampling scheme, in which the\nODEs were solved explicitly (Section 2), to complete the comparative study. All the algorithms\nwere coded in Matlab, and the population MCMC algorithms were run with 30 temperatures, and\nused a suitably diffuse \u0393(2, 1) prior distribution for all parameters, forming the base distribution for\nthe sampler. Two of these population MCMC samplers were run in parallel and the \u02c6R statistic [5]\nwas used to monitor convergence of all chains at all temperatures. The required numerical approxi-\nmations to the ODE were calculated using the Sundials ODE solver, which has been demonstrated to\nbe considerably (up to 100 times) faster than the standard ODE45/ODE15s solvers commonly used\nin Matlab. In our experiments the chains generally converged after around 5000 iterations, and 2000\nsamples were then drawn to form the posterior distributions. Ramsay\u2019s method [11] was imple-\nmented using the Matlab code which accompanies their paper. The optimal algorithm settings were\nused, tuned for the FitzHugh-Nagumo model (see [11] for details) which they also investigated. Each\nexperiment was repeated 100 times, and Table 1 shows summary statistics for each of the inferred\nparameters. All of the three sampling methods based on population MCMC produced low variance\nsamples from posteriors positioned close to the true parameters values. Most noticeable from the\nresults in Figure 2 is the dramatic speed advantage the GP based methods have over the more direct\napproach, whereby the differential equations are solved explicitly; the GP methods introduced in\nthis paper offer up to a 10-fold increase in speed, even for this relatively simple system of ODEs.\nWe found the performance of the pro\ufb01led estimation method [11] to be very sensitive to the initial\nparameter values. In practice parameter values are unknown, indeed little may be known even about\nthe range of possible values they may take. Thus it seems sensible to choose initial values from a\nwide prior distribution so as to explore as many regions of parameter space as possible. Employing\n\nFitzHugh-Nagumo ODE Model\n\n40\n\n80\n\nSamples Method\nGP MAP\nGP Fully Bayesian\nExplicit ODE\nGP MAP\nGP Fully Bayesian\nExplicit ODE\nGP MAP\nGP Fully Bayesian\nExplicit ODE\n\n120\n\na\n\n0.1930 \u00b1 0.0242\n0.1983 \u00b1 0.0231\n0.2015 \u00b1 0.0107\n0.1950 \u00b1 0.0206\n0.2068 \u00b1 0.0194\n0.2029 \u00b1 0.0121\n0.1918 \u00b1 0.0145\n0.1971 \u00b1 0.0162\n0.2071 \u00b1 0.0112\n\nb\n\n0.2070 \u00b1 0.0453\n0.2097 \u00b1 0.0481\n0.2106 \u00b1 0.0385\n0.2114 \u00b1 0.0386\n0.1947 \u00b1 0.0413\n0.1837 \u00b1 0.0304\n0.2088 \u00b1 0.0317\n0.2081 \u00b1 0.0330\n0.2123 \u00b1 0.0286\n\nc\n\n2.9737 \u00b1 0.0802\n3.0133 \u00b1 0.0632\n3.0153 \u00b1 0.0247\n2.9801 \u00b1 0.0689\n3.0139 \u00b1 0.0585\n3.0099 \u00b1 0.0158\n3.0137 \u00b1 0.0489\n3.0069 \u00b1 0.0593\n3.0112 \u00b1 0.0139\n\nTable 1: Summary statistics for each of the inferred parameters of the FitzHugh-Nagumo model. Each exper-\niment was repeated 100 times and the mean parameter values are shown. We observe that all three population-\nbased MCMC methods converge close to the true parameter values, a = 0.2, b = 0.2 and c = 3.\n\n\fFigure 2: Summary statistics of the overall time taken for the algorithms to run to completion. Solid bars show\nmean time for 100 runs; superimposed boxplots display median results with upper and lower quartiles.\n\npro\ufb01led estimation using initial parameter values drawn from a wide gamma prior, however, yielded\nhighly biased results, with the algorithm often converging to local maxima far from the true param-\neter values. The parameter estimates become more biased as the variance of the prior is increased,\ni.e. as the starting points move further from the true parameter values. E.g. consider parameter a;\nfor 40 data points, for initial values a, b, c \u223c N ({0.2, 0.2, 3}, 0.2), the range of estimated values for\n\u02c6a was [Min, Median, Max] = [0.173, 0.203, 0.235]. For initial values a, b, c \u223c \u0393(1, 0.5), the \u02c6a had\na range [Min, Median, Max] = [\u22120.329, 0.205, 9.3 \u00d7 109] and for a wider prior a, b, c \u223c \u0393(2, 1),\nthen \u02c6a had range [Min, Median, Max] = [\u22121.4 \u00d7 1010, 0.195, 2.2 \u00d7 109]. Lack of robustness\ntherefore seems to be a signi\ufb01cant problem with this pro\ufb01led estimation method. The speed of the\npro\ufb01led estimation method was also extremely variable, and this was observed to be very depen-\ndent on the initial parameter values e.g. for initial values a, b, c \u223c N ({0.2, 0.2, 3}, 0.2), the times\nrecorded were [Min, Mean, Max] = [193, 308, 475]. Using a different prior for initial values such\nthat a, b, c \u223c \u0393(1, 0.5), the times were [Min, Mean, Max] = [200, 913, 3265] and similarly for a\nwider prior a, b, c \u223c \u0393(2, 1), [Min, Mean, Max] = [132, 4171, 37411]. Experiments performed with\nnoise v = {0.05, 0.2} \u00d7 \u03c3n produced similar and consistent results, however they are omitted due\nto lack of space.\n\n5.2 Example 2 - Nonlinear Delay Differential Equations\n\nThis example model describes the oscillatory behaviour of the concentration of mRNA and its corre-\nsponding protein level in a genetic regulatory network, introduced by Monk [10]. The translocation\nof mRNA from the nucleus to the cytosol is explicitly described by a delay differential equation.\n\nd\u00b5\ndt\n\n=\n\n1\n\n1 + (p(t \u2212 \u03c4)/p0)n \u2212 \u00b5m\u00b5\n\n= \u00b5 \u2212 \u00b5pp\n\ndp\ndt\n\nwhere \u00b5m and \u00b5p are decay rates, p0 is the repression threshold, n is a Hill coef\ufb01cient and \u03c4 is the\ntime delay. The application of our method to DDEs is of particular interest since numerical solutions\nto DDEs are generally much more computationally expensive to obtain than ODEs. Thus inference\nof such models using MCMC methods and explicitly solving the system at each iteration becomes\nless feasible as the complexity of the system of DDEs increases.\nWe consider data generated from the above model, with parameters \u00b5m = 0.03, \u00b5p = 0.03,\np0 = 100, \u03c4 = 25, at {40, 80, 120} time points with added random noise drawn from a Gaus-\nsian distribution, N(0, v) for v = 0.1 \u00d7 \u03c3n, where \u03c3n is the standard deviation of the time-series\ndata for the nth species. The parameters were then inferred from these data sets using our GP-based\npopulation MCMC methods. Figure 3 shows a time comparison for 10 iterations of the GP sampling\nalgorithms and compares it to explicitly solving the DDEs using the Matlab solver DDE23 (which\nis generally faster than the Sundials solver for DDEs). The GP methods are around 400 times faster\nfor 40 data points. Using the GP methods, samples from the full posterior can be obtained in less\nthan an hour. Solving the DDEs explicitly, the population MCMC algorithm would take in excess of\ntwo weeks computation time, assuming the chains take a similar number of iterations to converge.\n\n\f40\n\n80\n\nSamples Method\nGP MAP\nGP Full Bayes\nGP MAP\nGP Full Bayes\nGP MAP\nGP Full Bayes\n\n120\n\n100.21 \u00b1 2.08\n99.75 \u00b1 1.50\n99.48 \u00b1 1.29\n100.26 \u00b1 1.03\n99.91 \u00b1 1.02\n100.23 \u00b1 0.92\n\nMonk DDE Model\n\u00b5m\n\n\u00b5p \u00d710\u22123\n29.7 \u00b1 1.6\n29.8 \u00b1 1.2\n29.5 \u00b1 0.9\n30.1 \u00b1 0.6\n30.0 \u00b1 0.5\n30.0 \u00b1 0.4\n\np0 \u00d710\u22123\n30.1 \u00b1 0.3\n30.1 \u00b1 0.2\n30.1 \u00b1 0.1\n30.1 \u00b1 0.1\n30.0 \u00b1 0.1\n30.0 \u00b1 0.1\n\n\u03c4\n\n25.65 \u00b1 1.04\n25.33 \u00b1 0.85\n24.81 \u00b1 0.59\n24.87 \u00b1 0.44\n24.97 \u00b1 0.38\n25.03 \u00b1 0.25\n\nTable 2: Summary statistics for each of the inferred parameters of the Monk model. Each experiment was\nrepeated 100 times and we observe that both GP population-based MCMC methods converge close to the true\nparameter values, \u00b5m = 100, \u00b5p = 30 \u00d7 10\u22123 and p0 = 30 \u00d7 10\u22123. The time-delay parameter, \u03c4 = 25, is\nalso successfully inferred.\n\nFigure 3: Summary statistics of the time taken for the algorithms to complete 10 iterations using DDE model.\n\n5.3 Example 3 - The p53 Gene Regulatory Network with Unobserved Species\n\nOur third example considers a linear and a nonlinear model describing the regulation of 5 target\ngenes by the tumour repressor transcription factor protein p53. We consider the following differen-\ntial equations which relate the expression level xj(t) of the jth gene at time t to the concentration of\nthe transcription factor protein f(t) which regulates it, \u02d9xj = Bj +Sjg(f(t))\u2212Djxj(t), where Bj is\nthe basal rate of gene j, Sj is the sensitivity of gene j to the transcription factor and Dj is the decay\nrate of the mRNA. Letting g(f(t)) = f(t) gives us the linear model originally investigated in [1],\nand letting g(f(t)) = exp(f(t)) gives us the nonlinear model investigated in [4]. The transcription\nfactor f(t) is unobserved and must be inferred along with the other structural parameters Bj, Sj\nand Dj using the sampling scheme detailed in Section 4.1. In this experiment, priors on the unob-\nserved species used were f(t) \u223c \u0393(2, 1) with a log-Normal proposal. We test our method using the\n\n(a) Linear Model\n\n(b) Nonlinear Model\n\nFigure 4: The predicted output of the p53 gene using data from Barenco et al. [1] and the accelerated GP\ninference method for (a) the linear model and (b) the nonlinear response model. Note that the asymmetric error\nbars in (b) are due to exp(y) being plotted, as opposed to just y in (a). Our results are compared to the results\nobtained by Barenco et al. [1] (shown as crosses) and are comparable to those obtained by Lawrence et al. [4].\n\n\fleukemia data set studied in [1], which comprises 3 measurements at each of 7 time points for each\nof the 5 genes. Figure 4 shows the inferred missing species and the results are in good accordance\nwith recent biological studies. For this example, our GP sampling algorithms ran to completion in\nunder an hour on a 2.2GHz Centrino laptop, with no difference in speed between using the linear\nand nonlinear models; indeed the equations describing this biological system could be made more\ncomplex with little additional computational cost.\n\n6 Conclusions\n\nExplicit solution of differential equations is a major bottleneck for the application of inferential\nmethodology in a number of application areas, e.g. systems biology, nonlinear dynamic systems.\nWe have addressed this problem and placed it within a Bayesian framework which tackles the main\nshortcomings of previous solutions to the problem of system identi\ufb01cation for nonlinear differential\nequations. Our methodology allows the possibility of model comparison via the use of Bayes factors,\nwhich may be straightforwardly calculated from the samples obtained from the population MCMC\nalgorithm. Possible extensions to this method include more ef\ufb01cient sampling exploiting control\nvariable methods [17], embedding characteristics of a dynamical system in the design of covariance\nfunctions and application of our method to models involving partial differential equations.\n\nAcknowledgments\n\nBen Calderhead is supported by Microsoft Research through its European PhD Scholarship Pro-\ngramme. Mark Girolami is supported by an EPSRC Advanced Research Fellowship EP/EO52029\nand BBSRC Research Grant BB/G006997/1.\n\nReferences\n\n[1] Barenco, M., Tomescu, D., Brewer, D., Callard, D., Stark, J. and Hubank, M. (2006) Ranked prediction of\np53 targets using hidden variable dynamic modeling, Genome Biology, 7 (3):R25.\n[2] Doucet, A., de Freitas, N. and Gordon, N., (2001) Sequential Monte Carlo Methods in Practice, Springer.\n[3] Friel, N. and Pettitt, A. N. (2008) Marginal Likelihood Estimation via Power Posteriors. Journal of the\nRoyal Statistical Society: Series B, 70 (3), 589-607.\n[4] Gao, P., Honkela, A., Rattray, M. and Lawrence, N.D. (2008) Gaussian Process Modelling of Latent\nChemical Species: Applications to Inferring Transcription Factor Activities, Bioinformatics, 24, i70-i75.\n[5] Gelman, A., Carlin, J.B., Stern, H.S. and Rubin, D.B. (2004) Bayesian Data Analysis, Chapman & Hall.\n[6] Graepel, T., (2003) Solving noisy linear operator equations by Gaussian processes: application to ordinary\nand partial differential equations, Proc. ICML 2003.\n[7] Mayraz, G. and Hinton, G. (2001) Recognizing Hand-Written Digits Using Hierarchical Products of\nExperts, Proc. NIPS 13.\n[8] Jasra, A., Stephens, D.A. and Holmes, C.C., (2007) On population-based simulation for static inference,\nStatistics and Computing, 17, 263-279.\n[9] Lawrence, N.D., Seeger, M. and Herbrich, R. (2003) Fast sparse Gaussian process methods: the informative\nvector machine, Proc. NIPS 15.\n[10] Monk, N. (2003) Oscillatory Expression of Hes1, p53, and NF-kB Driven by Transcriptional Time Delays.\nCurrent Biology, 13 (16), 1409-1413.\n[11] Ramsay, J., Hooker, G., Campbell, D. and Cao, J. (2007) Parameter Estimation for Differential Equations:\nA Generalized Smoothing Approach. Journal of the Royal Statistical Society: Series B, 69 (5), 741-796.\n[12] Rasmussen, C, E., (2003) Gaussian processes to speed up hybrid Monte Carlo for expensive Bayesian\nintegrals, Bayesian Statistics, 7, 651-659.\n[13] Rasmussen, C.E. and Williams, C.K.I. (2006) Gaussian Processes for Machine Learning, The MIT Press.\n[14] Snelson, E., Rasmussen, C.E. and Ghahramani, Z. (2004), Warped Gaussian processes, Proc. NIPS 16.\n[15] Solak, E., Murray-Smith, R., Leithead, W.E., Leith, D.J. and Rasmussen, C.E. (2003) Derivative\nobservations in Gaussian Process models of dynamic systems, Proc. NIPS 15.\n[16] Tarantola, A. (2005) Inverse Problem Theory and Methods for Model Parameter Estimation, SIAM.\n[17] Titsias, M. and Lawrence, N. (2008) Ef\ufb01cient Sampling for Gaussian Process Inference using Control\nVariables, Proc. NIPS 22.\n[18] Varah, J.M. (1982) A spline least squares method for numerical parameter estimation in differential\nequations. SIAM J. Scient. Comput., 3, 28-46.\n[19] Vyshemirsky, V. and and Girolami, M., (2008), Bayesian ranking of biochemical system models\nBioinformatics 24, 833-839.\n[20] Williams, C.K.I., Agakov, F.V., Felderof, S.N. (2002), Products of Gaussians, Proc. NIPS 14.\n\n\f", "award": [], "sourceid": 521, "authors": [{"given_name": "Ben", "family_name": "Calderhead", "institution": null}, {"given_name": "Mark", "family_name": "Girolami", "institution": null}, {"given_name": "Neil", "family_name": "Lawrence", "institution": null}]}