{"title": "Approximate inference in continuous time Gaussian-Jump processes", "book": "Advances in Neural Information Processing Systems", "page_first": 1831, "page_last": 1839, "abstract": "We present a novel approach to inference in conditionally Gaussian continuous time stochastic processes, where the latent process is a Markovian jump process. We first consider the case of jump-diffusion processes, where the drift of a linear stochastic differential equation can jump at arbitrary time points. We derive partial differential equations for exact inference and present a very efficient mean field approximation. By introducing a novel lower bound on the free energy, we then generalise our approach to Gaussian processes with arbitrary covariance, such as the non-Markovian RBF covariance. We present results on both simulated and real data, showing that the approach is very accurate in capturing latent dynamics and can be useful in a number of real data modelling tasks.", "full_text": "Approximate inference in continuous time\n\nGaussian-Jump processes\n\nManfred Opper\n\nBerlin, Germany\n\nAndreas Ruttor\n\nBerlin, Germany\n\nFakult\u00a8at Elektrotechnik und Informatik\n\nTechnische Universit\u00a8at Berlin\n\nFakult\u00a8at Elektrotechnik und Informatik\n\nTechnische Universit\u00a8at Berlin\n\nopperm@cs.tu-berlin.de\n\nandreas.ruttor@tu-berlin.de\n\nGuido Sanguinetti\nSchool of Informatics\nUniversity of Edinburgh\n\nG.Sanguinetti@ed.ac.uk\n\nAbstract\n\nWe present a novel approach to inference in conditionally Gaussian continuous\ntime stochastic processes, where the latent process is a Markovian jump process.\nWe \ufb01rst consider the case of jump-diffusion processes, where the drift of a linear\nstochastic differential equation can jump at arbitrary time points. We derive partial\ndifferential equations for exact inference and present a very ef\ufb01cient mean \ufb01eld\napproximation. By introducing a novel lower bound on the free energy, we then\ngeneralise our approach to Gaussian processes with arbitrary covariance, such as\nthe non-Markovian RBF covariance. We present results on both simulated and\nreal data, showing that the approach is very accurate in capturing latent dynamics\nand can be useful in a number of real data modelling tasks.\n\nIntroduction\n\nContinuous time stochastic processes are receiving increasing attention within the statistical machine\nlearning community, as they provide a convenient and physically realistic tool for modelling and\ninference in a variety of real world problems. Both continuous state space [1, 2] and discrete state\nspace [3\u20135] systems have been considered, with applications ranging from systems biology [6] to\nmodelling motion capture [7]. Within the machine learning community, Gaussian processes (GPs)\n[8] have proved particularly popular, due to their appealing properties which allow to reduce the\nin\ufb01nite dimensional smoothing problem into a \ufb01nite dimensional regression problem. While GPs\nare indubitably a very successful tool in many pattern recognition tasks, their use is restricted to\nprocesses with continuously varying temporal behaviour, which can be a limit in many applications\nwhich exhibit inherently non-stationary or discontinuous behaviour.\nIn this contribution, we consider the state inference and parameter estimation problems in a wider\nclass of conditionally Gaussian (or Gaussian-Jump) processes, where the mean evolution of the GP is\ndetermined by the state of a latent (discrete) variable which evolves according to Markovian dynam-\nics. We \ufb01rst consider the special, but important, case where the GP is a Markovian process, i.e. an\nOrnstein-Uhlenbeck (OU) process. In this case, exact inference can be derived by using a forward-\nbackward procedure. This leads to partial differential equations, whose numerical solution can be\ncomputationally expensive; alternatively, a variational approximation leads to an iterative scheme\ninvolving only the numerical solution of ordinary differential equations, and which is extremely\nef\ufb01cient from a computational point of view. We then consider the case of general (non-Markov)\n\n1\n\n\fGPs coupled to a Markovian latent variable. Inference in this case is intractable, but, by means of a\nLegendre transform, we can derive a lower bound on the exact free energy, which can be optimised\nusing a saddle point procedure.\n\n1 Conditionally Gaussian Markov Processes\n\nWe consider a continuous state stochastic system governed by a linear stochastic differential equa-\ntion (SDE) with piecewise constant (in time) drift bias which can switch randomly with Markovian\ndynamics (see e.g. [9] for a good introduction to stochastic processes). For simplicity, we give the\nderivations in the case when there are only two states in the switching process (i.e. it is a random\ntelegraph process) and the diffusion system is one dimensional; generalisation to more dimensions\nor more latent states is straightforward. The system can be written as\ndx = (A\u00b5 + b \u2212 \u03bbx) dt + \u03c3dw(t),\n\n\u00b5(t) \u223c T P (f\u00b1) ,\n\n(1)\n\nwhere w is the Wiener process with variance \u03c32 and \u00b5(t) is a random telegraph process with switch-\ning rates f\u00b1. Our interest in this type of models is twofold: similar models have found applications\nin \ufb01elds like systems biology, where the rapid transitions of regulatory proteins make a switching\nlatent variable a plausible model [6]. At the same time, at least intuitively, model (1) could be con-\nsidered as an approximation to more complex non-linear diffusion processes, where diffusion near\nlocal minima of the potential is approximated by linear diffusion.\nLet us assume that we observe the process x at a \ufb01nite number of time points with i.i.d. noise, giving\nvalues\n\nyi \u223c N(cid:0)x(ti), s2(cid:1) ,\n\ni = 1, . . . , N.\n\nFor simplicity, we have assumed that the process itself is observed; nothing would change in what\nfollows if we assumed that the variable y is linearly related to the process (except of course that\nwe would have more parameters to estimate). The problem we wish to address is the inference of\nthe joint posterior over both variables x and \u00b5 at any time within a certain interval, as well as the\ndetermination of (a subset of) the parameters and hyperparameters involved in equation (1) and in\nthe observation model.\n\n1.1 Exact state inference\n\nAs the system described by equation (1) is a Markovian process, the marginal probability distribution\nq\u00b5(x, t) for both state variables \u00b5 \u2208 {0, 1} and x of the posterior process can be calculated using\na smoothing algorithm similar to the one described in [6]. Based on the Markov property one can\nshow that\n\nq\u00b5(x, t) =\n\n(2)\nHere p\u00b5(x, t) denotes the marginal \ufb01ltering distribution, while \u03a8\u00b5(x, t) = p({yi|ti > t}|xt =\nx, \u00b5t = \u00b5) is the likelihood of all observations after time t under the condition that the process has\nstate (x, \u00b5) at time t (backward message). The time evolution of the backward message is described\nby the backward Chapman-Kolmogorov equation for \u00b5 \u2208 {0, 1} [9]:\n\np\u00b5(x, t)\u03a8\u00b5(x, t).\n\n1\nZ\n\n\u2202\u03a8\u00b5\n\u2202t\n\n+ (A\u00b5 + b \u2212 \u03bbx) \u2202\u03a8\u00b5\n\u2202x\n\n+ \u03c32\n2\n\n\u22022\u03a8\u00b5\n\u2202x2 = f1\u2212\u00b5(\u03a8\u00b5(x, t) \u2212 \u03a81\u2212\u00b5(x, t)).\n\n(3)\n\nThis PDE must be solved backward in time starting at the last observation yN using the initial\ncondition\n\n\u03a8\u00b5(x, tN ) = p(yN|x(tN ) = x).\nThe other observations are taken into account by jump conditions\n\n(4)\n\n\u03a8\u00b5(x, t\u2212\n\n(5)\nk ) being the values of \u03a8\u00b5(x, t) before and after the k-th observation and p(yj|x(tj) =\n\nj ) = \u03a8\u00b5(x, t+\n\nj ) p(yj|x(tj) = x),\n\nwhere \u03a8\u00b5(x, t\u2213\nx) is given by the noise model.\n\n2\n\n\fIn order to calculate q\u00b5(x, t) we need to calculate the \ufb01ltering distribution p\u00b5(x, t), too. Its time\nevolution is given by the forward Chapman-Kolmogorov equation [9]\n\n\u2202p\u00b5\n\u2202t\n\n+ \u2202\n\u2202x\n\n(A\u00b5 + b \u2212 \u03bbx)p\u00b5(x, t) \u2212 \u03c32\n2\n\n\u22022p\u00b5\n\u2202x2 = f\u00b5 p1\u2212\u00b5(x, t) \u2212 f1\u2212\u00b5 p\u00b5(x, t).\n\n(6)\n\nWe can show that the posterior process q\u00b5(x, t) ful\ufb01ls a similar PDE by calculating its time derivative\nand using both (3) and (6). By doing so we \ufb01nd\n(A\u00b5 + b\u2212 \u03bbx + c\u00b5(x, t))q\u00b5(x, t)\u2212 \u03c32\n2\n\n\u22022q\u00b5\n\u2202x2 = g\u00b5(x, t) q1\u2212\u00b5(x, t)\u2212 g1\u2212\u00b5(x, t) q\u00b5(x, t),\n\n+ \u2202\n\u2202x\n\n\u2202q\u00b5\n\u2202t\n\nwhere\n\n\u03a8\u00b5(x, t)\n\u03a81\u2212\u00b5(x, t) f\u00b5\nare time and state dependent posterior jump rates, while the drift\n\ng\u00b5(x, t) =\n\nc\u00b5(x, t) = \u03c32 \u2202\n\u2202x\n\nlog \u03a8\u00b5(x, t)\n\ntakes the observations into account.\nIt is clearly visible that (7) is also a forward Chapman-\nKolmogorov equation. Consequently, the only differences between prior and posterior process are\nthe jump rates for the telegraph process \u00b5 and the drift of the diffusion process x.\n\n1.2 Variational inference\n\nThe exact inference approach outlined above gives rise to PDEs which need to be solved numeri-\ncally in order to estimate the relevant posteriors. For one dimensional GPs this is expensive, but\nin principle feasible. This work will be deferred to a further publication. Of course, numerical so-\nlutions become computationally prohibitive for higher dimensional problems, leading to a need for\napproximations. We describe here a variational approximation to the joint posterior over the switch-\ning process \u00b5(t) and the diffusion process x(t) which gives an upper bound on the true free energy;\nit is obtained by making a factorised approximation to the probability over paths (x0:T , \u00b50:T ) of the\nform\n\nq (x0:T , \u00b50:T ) = qx (x0:T ) q\u00b5 (\u00b50:T ) ,\n\n(10)\nwhere qx is a pure diffusion process (which can be easily shown to be Gaussian) and q\u00b5 is a pure\njump process. Considering the KL divergence between the original process (1) and the approxi-\nmating process, and keeping into account the conditional structure of the model and equation (10),\nwe obtain the following expression for the Kullback-Leibler (KL) divergence between the true and\napproximating posteriors:\n\n(7)\n\n(8)\n\n(9)\n\nN(cid:88)\n\ni=1\n\nKL [q(cid:107)p] = K0 +\n\n(cid:104)log p (yi|x(ti))(cid:105)qx\n\n+(cid:104)KL [qx(cid:107)p (x0:T|\u00b50:T )](cid:105)q\u00b5\n\n+ KL [q\u00b5(cid:107)p(\u00b50:T )] . (11)\n\nBy using the general formula for the KL divergence between two diffusion processes [1], we obtain\nthe following form for the third term in equation (11):\n\n2\u03c32{[\u03b1(t) + \u03bb]2(cid:2)c2(t) + m2(t)(cid:3) + [\u03b2(t) \u2212 b]2 +\n+ 2 [\u03b1(t) + \u03bb] [\u03b2(t) \u2212 b] m(t) +(cid:2)A2 \u2212 2A (\u03b1(t) + \u03bb) m(t) \u2212 2A (\u03b2(t) \u2212 b)(cid:3) q1\n\n(cid:104)KL [qx(cid:107)p (x0:T|\u00b50:T )](cid:105)q\u00b5\n\n(12)\n\n=\n\ndt\n\n1\n\n\u00b5(t)}.\n\n(cid:90)\n\nHere \u03b1 and \u03b2 are the gain and bias (coef\ufb01cients of the linear term and constant) of the drift of\nthe approximating diffusion process, m and c2 are the mean and variance of the approximating\n\u00b5(t) is the marginal probability at time t of the switch being on (computed using\nprocess, and q1\nthe approximating jump process). So the KL is the sum of an initial condition part (which can\nbe set to zero) and two other parts involving the KL between a Markovian Gaussian process and\na Markovian Gaussian process observed linearly with noise (second and third terms) and the KL\nbetween two telegraph processes. The variational E-step iteratively minimises these two parts using\nrecursions of the forward-backward type. Interleaved with this, variational M-steps can be carried\nout by optimising the variational free energy w.r.t. the parameters; the \ufb01xed point equations for this\nare easily derived and will be omitted here due to space constraints. Evaluation of the Hessian of\nthe free energy w.r.t. the parameters can be used to provide a measure of the uncertainty associated.\n\n3\n\n\f1.2.1 Computation of the approximating diffusion process\n\nMinimisation of the second and third term in equation (11) requires \ufb01nding an approximating Gaus-\nsian process. By inspection of equation (12), we see that we are trying to compute the posterior\n\u00b5(t)+b\u2212\u03bbx, with the obser-\nprocess for a discretely observed Gaussian process with (prior) drift Aq1\nvations being i.i.d. with Gaussian noise. Due to the Markovian nature of the process, its single time\nmarginals can be computed using the continuous time version of the well known forward-backward\nalgorithm [10, 11]. The single time posterior marginal can be decomposed as\n\nq (x(t)) = p (x(t)|y1, . . . , yN ) =\n\n1\nZ\n\n\u03c6 (x(t)) \u03be (x(t)) ,\n\n(13)\n\nwhere \u03c6 is the \ufb01ltered process or forward message, and \u03be is the backward message, i.e. the likelihood\nof future observations conditioned on time t. The recursions are based on the following general\nODEs linking mean \u02c6m and variance \u02c6c2 of a general Gaussian diffusion process with system noise \u03c32\nto the drift coef\ufb01cients \u02c6\u03b1 and \u02c6\u03b2 of the respective SDE, which are a consequence of the Fokker-Planck\nequation for Gaussian processes\n\nd \u02c6m\ndt\nd\u02c6c2\ndt\n\n= \u02c6\u03b1 \u02c6m + \u02c6\u03b2,\n\n= 2\u02c6\u03b1\u02c6c2 + \u03c32.\n\n(14)\n\nThe \ufb01ltered process outside the observations satis\ufb01es the forward Fokker-Planck equation of the\nprior process, so its mean and variance can be propagated using equations (14) with prior drift\ncoef\ufb01cients \u02c6\u03b1 = \u2212\u03bb and \u02c6\u03b2 = Aq1\nlim\nt\u2192t+\n\n\u00b5 + b. Observations are incorporated via the jump conditions\n\u03c6 (x(t)) \u221d p (yi|x(ti)) lim\nt\u2192t\n\n\u03c6 (x(t)) ,\n\n(15)\n\n\u2212\ni\n\ni\n\nwhence the recursions on the mean and variances easily follow. Notice that this is much simpler than\n(discrete time) Kalman \ufb01lter recursions as the prior gain is zero in continuous time. Computation of\nthe backward message (smoothing) is analogous; the reader is referred to [10,11] for further details.\n\n1.2.2 Jump process smoothing\n\nHaving computed the approximating diffusion process, we now turn to give the updates for the\napproximating jump process.The KL divergence in equation (11) involves the jump process in two\nterms: the last term is the KL divergence between the posterior jump process and the prior one, while\nthe third term, which gives the expectation of the KL between the two diffusion processes under the\nposterior jump, also contains terms involving the jump posterior. The KL divergence between two\ntelegraph processes was calculated in [4]; considering the jump terms coming from equation (12),\nand adding a Lagrange multiplier to take into account the Master equation ful\ufb01lled by the telegraph\nprocess, we end up with the following Lagrangian:\nL [q\u00b5, g\u00b1, \u03c8, \u03be] = KL [q\u00b5(cid:107)pprior] +\n+ (g\u2212 + g+)q1 \u2212 g+\n\n(cid:2)A2 \u2212 2A (\u03b1 + \u03bb) m \u2212 2A (\u03b2 \u2212 b)(cid:3) q1(t)+\n\n(cid:18) dq1\n\n1\n2\u03c32\n\ndt\u03c8(t)\n\n(cid:90)\n\ndt\n\n(cid:19)\n\n.\n\n(16)\n\n(cid:90)\n\ndt\n\nNotice we use q1(t) = q\u00b5(\u00b5(t) = 1) to lighten the notation. Functional derivatives w.r.t. to the\nposterior rates g\u00b1 allow to eliminate them in favour of the Lagrange multipliers; inserting this into\nthe functional derivatives w.r.t. to the marginals q1(t) gives ODEs involving the Lagrange multiplier\nand the prior rates only (as well as terms from the diffusion process), which can be solved backward\nin time from the condition \u03c8(T ) = 0. This allows to update the rates and then the posterior marginals\ncan be found in a forward propagation, in a manner similar to [4].\n\n2 Conditionally Gaussian Processes: general case\n\nIn this section, we would like to generalise our model to processes of the form\n\ndx = (\u2212\u03bbx + A\u00b5 + b)dt + df(t),\n\n(17)\n\n4\n\n\fwhere the white noise driving process \u03c3dw(t) in (1) is replaced by an arbitrary GP df(t) 1. The ap-\nplication of our variational approximation (11) requires the KL divergence KL [qx(cid:107)p (x0:T|\u00b50:T )]\nbetween a GP qx and a GP with a shifted mean function p (x0:T|\u00b50:T ). Assuming the same co-\nvariance this could in principle be computed using the Radon-Nykodym derivative between the two\nmeasures. Our preliminary results (based on the Cameron-Martin formula for GPs [12]) indicates\nthat even in simple cases (like Ornstein-Uhlenbeck noise) the measures are not absolutely continu-\nous and the KL divergence is in\ufb01nite. Hence, we have resorted to a different variational approach,\nwhich is based on a lower bound to the free energy.\nWe use the fact, that conditioned on the path of the switching process \u00b50:T , the prior of x(t)\nis a GP with a covariance kernel K(t, t(cid:48)) and can be marginalised out exactly. The kernel K\ncan be easily computed from the kernel of the driving noise process f(t) [2].\nIn the previous\ncase of white noise K is given by the (nonstationary) Ornstein-Uhlenbeck kernel KOU (t, t(cid:48)) =\n\u03c32\n. The mean function of the conditioned GP is obtained by solving the\n2\u03bb\nlinear ODE (17) without noise, i.e. with f = 0. This yields\n\n(cid:110)\ne\u2212\u03bb|t\u2212t(cid:48)| \u2212 e\u2212\u03bb(t+t(cid:48))(cid:111)\n\nEGP [x(t)|\u00b50:T ] =\n\ne\u2212\u03bb(t\u2212s)(A\u00b5(s) + b) ds .\n\n(18)\n\n(cid:90) t\n\n0\n\n(cid:20)\n\n(cid:26)\n\n\u22121\n2\n\nMarginalising out the conditional GP, the negative log marginal probability of observations (free\nenergy) F = \u2212 ln p(D) is represented as\n\nF = \u2212 ln E\u00b5 [p(D|\u00b50:T )] = \u03ba \u2212 ln E\u00b5\n\nexp\n\n(y \u2212 x\u00b5)(cid:62)(K + \u03c32I)\u22121(y \u2212 x\u00b5)\n\n.\n\n(19)\n\n(cid:27)(cid:21)\n\nHere E\u00b5 denotes expectation over the prior switching process p\u00b5, y is the vector of observations,\nand x\u00b5 = EGP [(x(t1), . . . , x(tN ))(cid:62) |\u00b50:T ] is the vector of conditional means at observation times\n2 ln(|2\u03c0K|). This intractable free energy contains a functional in\nti. K is the kernel matrix and \u03ba = 1\nthe exponent which is bilinear in the switching process \u00b5. In the spirit of other variational transfor-\nmations [13, 14] this can be linearised through a Legendre transform (or convex duality). Applying\n2 z(cid:62)A\u22121z = max\u03b8\n1\nand exchanging the max operation with the expectation over \u00b5, leads to the lower bound\n\n(cid:8)\u03b8(cid:62)z \u2212 1\n2 \u03b8(cid:62)A\u03b8(cid:9) to the vector z = (y \u2212 x\u00b5) and the matrix A = (K + \u03c32I),\n(cid:18)\n\u22121\n2 \u03b8(cid:62)(K + \u03c32I)\u03b8 \u2212 ln E\u00b5\n\n(cid:2)exp(cid:8)\u2212\u03b8(cid:62)(y \u2212 x\u00b5)(cid:9)(cid:3)(cid:19)\n\nF \u2265 \u03ba + max\n\n(20)\n\n.\n\n\u03b8\n\nA similar upper bound which is however harder to evaluate computationally will be presented else-\nIt can be shown that the lower bound (20) neglects the variance of the E\u00b5 [x\u00b5] process\nwhere.\n(intuitively, the two point expectations in (19) are dropped). The second term in the bracket looks\nlike the free energy for a jump process model having a (pseudo) log likelihood of the data given by\n\u2212\u03b8(cid:62)(y\u2212x\u00b5). This auxiliary free energy can again be rewritten in terms of the \u201cstandard variational\u201d\nrepresentation\n\n(cid:2)exp(cid:8)\u2212\u03b8(cid:62)(y \u2212 x\u00b5)(cid:9)(cid:3) = min\n\n(cid:8)KL[q(cid:107)pprior] + \u03b8(cid:62)(y \u2212 Eq[x\u00b5])(cid:9) ,\n\n\u2212 ln E\u00b5\n\n(21)\n\nq\n\nwhere in the second line we have introduced an arbitrary process q over the switching variable and\nused standard variational manipulations. Inserting (18) into the last term in (21), we see that this KL\nminimisation is of the same structure as the one in equation (16) with a linear functional of q in the\n(pseudo) likelihood term. Therefore the minimiser q is an inhomogeneous Markov jump process,\nand we can use a backward and forward sweep to compute marginals q1(t) exactly for a \ufb01xed \u03b8!\nThese marginals are used to compute the gradient of the lower bound (K + \u03c32I)\u03b8 + (y \u2212 Eq[x\u00b5])\nand we iterate between gradient ascent steps and recomputations of Eq[x\u00b5]. Since the minimax\nproblem de\ufb01ned by (20) and (21) is concave in \u03b8 and convex in q the solution must be unique. Upon\nconvergence, we use the switching process marginals q1 for prediction. Statistics of the smoothed\nx process can then be computed by summing the conditional GP statistics (obtained by exact GP\nregression) and the x\u00b5 statistics, which can be computed using the same methods as in [6].\n\n1In case of a process with smooth sample paths, we can write df (t) = g(t)dt with an \u201cordinary\u201d GP g\n\n5\n\n\fFigure 1: Results on synthetic data. Variational Markovian Gaussian-Jump process on the left,\napproximate RBF Gaussian-Jump process on the right. Top row, inferred posterior jump means\n(solid line) and true jump pro\ufb01le (dotted black) Bottom row: inferred posterior mean x (solid) with\ncon\ufb01dence intervals (dotted red); data points are shown as red crosses, and the true sample pro\ufb01le\nis shown as black dots. Notice that the less con\ufb01dent jump prediction for the RBF process gives a\nmuch higher uncertainty in the x prediction (see text). The x axis units are the simulation time steps.\n\n3 Results\n\n3.1 Synthetic data\n\nTo evaluate the performance and identi\ufb01ability of our model, we experimented \ufb01rst with a simple\none-dimensional synthetic data set generated using a jump pro\ufb01le with only two jumps. A sample\nfrom the resulting conditional Gaussian process was then obtained by simulating the SDE using\nthe Euler-Maruyama method, and ten identically spaced points were then taken from the sample\npath and corrupted with Gaussian noise. Inference was then carried out using two procedures: a\nMarkovian Gaussian-Jump process as described in Section 1, using the variational algorithm, and\na \u201cRBF\u201d Gaussian-Jump process with slowly varying covariance, as described in Section 2. The\nparameters s2, \u03c32 and f\u00b1 were kept \ufb01xed, while the A, b and \u03bb hyperparameters were optimised\nusing type II ML.\nThe inference results are shown in Figure 1: the left column gives the results of the variational\nsmoothing, while the right column gives the results obtained by \ufb01tting a RBF Gaussian-Jump pro-\ncess. The top row shows the inferred posterior mean of the discrete state distribution, while the\nbottom row gives the conditionally Gaussian posterior. We notice that both approaches provide a\ngood smoothing of the GP and the jump process, although the second jump is inferred as being\nslightly later than in the true path. Notice that the uncertainties associated with the RBF process are\nmuch higher than in the Markovian one, and are dominated by the uncertainty in the posterior mean\ncaused by the uncertainty in the jump process, which is less con\ufb01dent than in the Markovian case\n(top right \ufb01gure). This is probably due to the fact that the lower bound (20) ignores the contributions\nof the variance of the x\u00b5 term in the free energy, which is due to the variance of the jump pro-\ncess, and hence removes the penalty for having intermediate jump posteriors. A similar behaviour\nwas already noted in a related context in [14]. In terms of computational ef\ufb01ciency, the variational\nMarkovian algorithm converged in approximately 0.1 seconds on a standard laptop, while the RBF\nprocess took approximately two minutes. As a baseline, we used a standard discrete time Switching\n\n6\n\n0100200300400500600700800900100000.10.20.30.40.50.60.70.80.910100200300400500600700800900100000.10.20.30.40.50.60.70.80.910100200300400500600700800900100000.511.522.533.544.501002003004005006007008009001000\u221210123456\fFigure 2: Results on double well diffusion. Left: inferred posterior switch mean; right smoothed\ndata, with con\ufb01dence intervals. The x axis units are the simulation time steps.\n\nKalman Filter in the implementation of [15], but did not manage to obtain good results. It is not\nclear whether the problem resided in the short time series or in our application of the model.\nEstimation of the parameters using the variational upper bound also gave very accurate results, with\nA = 3.1 \u00b1 0.3 \u00d7 10\u22122 (true value 3 \u00d7 10\u22122), b = 1.0 \u00b1 2 \u00d7 10\u22122 (true value 1 \u00d7 10\u22122) and\n\u03bb = 1.1 \u00b1 0.1 \u00d7 10\u22122 (true value 1 \u00d7 10\u22122). It is interesting to note that, if the system noise\nparameter \u03c32 was set at a higher value, then the A parameter was always driven to zero, leading to a\ndecoupling of the Gaussian and jump processes. In fact, it can be shown that the true free energy has\nalways a local minimum for A = 0: heuristically, the GP is always a suf\ufb01ciently \ufb02exible model to \ufb01t\nthe data on its own. However, for small levels of system noise, the evidence of the data is such that\nthe more complex model involving a jump process is favoured, giving a type of automated Occam\nrazor, which is one of the main attractions of Bayesian modelling.\n\n3.2 Diffusion in a double-well potential\n\nTo illustrate the properties of the Gaussian-jump process as an approximator for non-linear stochas-\ntic models, we considered the benchmark problem of smoothing data generated from a SDE with\ndouble-well potential drift and constant diffusion coef\ufb01cient. Since the process we wish to approx-\nimate is a diffusion process, we use the variational upper bound method, which gave good results\nin the synthetic experiments. The data we use is the same as the one used in [1], where a non-\nstationary Gaussian approximation to the non-linear SDE was proposed by means of a variational\napproximation. The results are shown in Figure 2: as is evident the method both captures accurately\nthe transition time, and provides an excellent smoothing (very similar to the one reported in [1]);\nthese results were obtained in 0.07 seconds, while the Gaussian process approximation of [1] in-\nvolves gradient descent in a high dimensional space and takes approximately three to four orders of\nmagnitude longer. Naturally, our method cannot be used in this case to estimate the parameters of\nthe true (double well) prior drift, as it only models the linear behaviour near the bottom of each well;\nhowever, for smoothing purposes it provides a very accurate and ef\ufb01cient alternative method.\n\n3.3 Regulation of competence in B. subtilis\n\nRegulation of gene expression at the transcriptional level provides an important application, as well\nas motivation for the class of models we have been considering. Transcription rates are modulated\nby the action of transcription factors (TFs), DNA binding proteins which can be activated fast in\nresponse to environmental signals. The activation state of a TF is a notoriously dif\ufb01cult quantity\nto measure experimentally; this has motivated a signi\ufb01cant effort within the machine learning and\nsystems biology community to provide models to infer TF activities from more easily measurable\ngene expression levels [2, 16, 17]. In this section, we apply our model to single cell \ufb02uorescence\nmeasurements of protein concentrations; the intrinsic stochasticity inherent in single cell data would\nmake conditionally deterministic models such as [2, 6] an inappropriate tool, while our variational\nSDE model should be able to better capture the inherent \ufb02uctuations.\nThe data we use was obtained in [18] during a study of the genetic regulation of competence in\nB. subtilis: brie\ufb02y, bacteria under food shortage can either enter a dormant stage (spore) or can\n\n7\n\n010020030040050060070080000.10.20.30.40.50.60.70.80.910100200300400500600700800\u22121.5\u22121\u22120.500.511.5\finferred posterior switch mean (ComK activity\nFigure 3: Results on competence circuit. Left:\npro\ufb01le); right smoothed ComS data, with con\ufb01dence intervals. The y axis units in the right hand\npanel are arbitrary \ufb02uorescence units.\n\ncontinue to replicate their DNA without dividing (competence). Competence is essentially a bet that\nthe food shortage will be short-lived: in that case, the competent cell can immediately divide into\nmany daughter cells, giving an evolutionary advantage. The molecular mechanisms underpinning\ncompetence are quite complex, but the essential behaviour can be captured by a simple system\ninvolving only two components: the competence regulator ComK and the auxiliary protein ComS,\nwhich is controlled by ComK with a switch-like behaviour (Hill coef\ufb01cient 5).\nIn [18], ComK\nactivity was indirectly estimated using a gene reporter system (using the ComG promoter). Here,\nwe leave ComK as a latent switching variable, and use our model to smooth the ComS data. The\nresults are shown in Figure 3, showing a clear switch behaviour for ComK activity (as expected, and\nin agreement with the high Hill coef\ufb01cient), and a good smoothing of the ComS data. Analysis of the\noptimal parameters is also instructive: while the A and b parameters are not so informative due to the\nfact that \ufb02uorescence measurements are reported in arbitrary units, the ComS decay rate is estimated\nas 0.32 \u00b1 0.06h\u22121, corresponding to a half life of approximately 3 hours, which is clearly plausible\nfrom the data. It should be pointed out that, in the simulations in the supplementary material of [18],\na nominal value of 0.0014 s\u22121 was used, corresponding to a half life of only 20 minutes! While\nthe purpose of that simulation was to recreate the qualitative behaviour of the system, rather than to\nestimate its parameters, the use of such an implausible parameter value illustrates all too well the\nneed for appropriate data-driven tools in modelling complex systems.\n\n4 Discussion\n\nIn this contribution we proposed a novel inference methodology for continuous time conditionally\nGaussian processes. As well as being interesting in its own right as a method for inference in\njump-diffusion processes (to our knowledge the \ufb01rst to be proposed), these models \ufb01nd a powerful\nmotivation due to their relevance to \ufb01elds such as systems biology, as well as plausible approxima-\ntions to non-linear diffusion processes. We presented both a method based on a variational upper\nbound in the case of Markovian processes, and a more general lower bound which holds also for\nnon-Markovian Gaussian processes.\nA natural question from the machine learning point of view is what are the advantages of continuous\ntime over discrete time approaches. As well as providing a conceptually more correct description of\nthe system, continuous time approaches have at least two signi\ufb01cant advantages in our view: a com-\nputational advantage in the availability of more stable solvers (such as Runge-Kutta methods), and\na communication advantage, as they are more immediately understandable to the large community\nof modellers which use differential equations but may not be familiar with statistical methods.\nThere are several possible extension to the work we presented: a relatively simple task would be\nan extension to a factorial design such as the one proposed for conditionally deterministic systems\nin [14]. A theoretical task of interest would be a thorough investigation of the relationship between\nthe upper and lower bounds we presented. This is possible, at least for Markovian GPs, but will be\npresented in other work.\n\n8\n\n02.557.51012.51517.52000.10.20.30.40.50.60.70.80.91Time (h)02.557.51012.51517.5200.511.522.533.544.5Time (h)\fReferences\n[1] Cedric Archambeau, Dan Cornford, Manfred Opper, and John Shawe-Taylor. Gaussian process\napproximations of stochastic differential equations. Journal of Machine Learning Research\nWorkshop and Conference Proceedings, 1(1):1\u201316, 2007.\n\n[2] Neil D. Lawrence, Guido Sanguinetti, and Magnus Rattray. Modelling transcriptional regu-\nlation using Gaussian processes. In Advances in Neural Information Processing Systems 19,\n2006.\n\n[3] Uri Nodelman, Christian R. Shelton, and Daphne Koller. Continuous time Bayesian networks.\nIn Proceedings of the Eighteenth conference on Uncertainty in Arti\ufb01cial Intelligence (UAI),\n2002.\n\n[4] Manfred Opper and Guido Sanguinetti. Variational inference for Markov jump processes. In\n\nAdvances in Neural Information Processing Systems 20, 2007.\n\n[5] Ido Cohn, Tal El-Hay, Nir Friedman, and Raz Kupferman. Mean \ufb01eld variational approxima-\ntion for continuous-time Bayesian networks. In Proceedings of the twenty-\ufb01fthth conference\non Uncertainty in Arti\ufb01cial Intelligence (UAI), 2009.\n\n[6] Guido Sanguinetti, Andreas Ruttor, Manfred Opper, and Cedric Archambeau. Switching reg-\n\nulatory models of cellular stress response. Bioinformatics, 25(10):1280\u20131286, 2009.\n\n[7] Mauricio Alvarez, David Luengo, and Neil D. Lawrence. Latent force models. In Proceedings\nof the Twelfth Interhantional Conference on Arti\ufb01cial Intelligence and Statistics (AISTATS),\n2009.\n\n[8] Carl E. Rasmussen and Christopher K.I. Williams. Gaussian Processes for Machine Learning.\n\nMIT press, 2005.\n\n[9] C. W. Gardiner. Handbook of Stochastic Methods. Springer, Berlin, second edition, 1996.\n[10] Andreas Ruttor and Manfred Opper. Ef\ufb01cient statistical inference for stochastic reaction pro-\n\ncesses. Phys. Rev. Lett., 103(23), 2009.\n\n[11] Cedric Archambeau and Manfred Opper. Approximate inference for continuous-time Markov\nprocesses. In David Barber, Taylan Cemgil, and Silvia Chiappa, editors, Inference and Learn-\ning in Dynamic Models. Cambridge University Press, 2010.\n\n[12] M. A. Lifshits. Gaussian Random Functions. Kluwer, Dordrecht, second edition, 1995.\n[13] Michael I. Jordan, Zoubin Ghahramani, Tommi S. Jaakkola, and Lawrence K. Saul. An intro-\n\nduction to variational methods for graphical models. Machine Learning, 37:183\u2013233, 1999.\n\n[14] Manfred Opper and Guido Sanguinetti. Learning combinatorial transcriptional dynamics from\n\ngene expression data. Bioinformatics, 26(13):1623\u20131629, 2010.\n\n[15] David Barber. Expectation correction for smoothing in switching linear Gaussian state space\n\nmodels. Journal of Machine Learning Research, 7:2515\u20132540, 2006.\n\n[16] James C. Liao, Riccardo Boscolo, Young-Lyeol Yang, Linh My Tran, Chiara Sabatti, and\nVwani P. Roychowdhury. Network component analysis: Reconstruction of regulatory signals\nin biological systems. Proceedings of the National Academy of Sciences USA, 100(26):15522\u2013\n15527, 2003.\n\n[17] Martino Barenco, Daniela Tomescu, David Brewer, Robin Callard, Jaroslav Stark, and Michael\nHubank. Ranked prediction of p53 targets using hidden variable dynamical modelling. Genome\nBiology, 7(3), 2006.\n\n[18] G\u00a8urol M. Su\u00a8el, Jordi Garcia-Ojalvo, Louisa M. Liberman, and Michael B. Elowitz. An ex-\ncitable gene regulatory circuit induces transient cellular differentiation. Nature, 440:545\u201350,\n2006.\n\n9\n\n\f", "award": [], "sourceid": 1095, "authors": [{"given_name": "Manfred", "family_name": "Opper", "institution": null}, {"given_name": "Andreas", "family_name": "Ruttor", "institution": null}, {"given_name": "Guido", "family_name": "Sanguinetti", "institution": null}]}