{"title": "Efficient Sampling for Gaussian Process Inference using Control Variables", "book": "Advances in Neural Information Processing Systems", "page_first": 1681, "page_last": 1688, "abstract": "Sampling functions in Gaussian process (GP) models is challenging because of the highly correlated posterior distribution. We describe an efficient Markov chain Monte Carlo algorithm for sampling from the posterior process of the GP model. This algorithm uses control variables which are auxiliary function values that provide a low dimensional representation of the function. At each iteration, the algorithm proposes new values for the control variables and generates the function from the conditional GP prior. The control variable input locations are found by continuously minimizing an objective function. We demonstrate the algorithm on regression and classification problems and we use it to estimate the parameters of a differential equation model of gene regulation.", "full_text": "Ef\ufb01cient Sampling for Gaussian Process Inference\n\nusing Control Variables\n\nMichalis K. Titsias, Neil D. Lawrence and Magnus Rattray\n\nSchool of Computer Science, University of Manchester\n\nManchester M13 9PL, UK\n\nAbstract\n\nSampling functions in Gaussian process (GP) models is challenging because of\nthe highly correlated posterior distribution. We describe an ef\ufb01cient Markov chain\nMonte Carlo algorithm for sampling from the posterior process of the GP model.\nThis algorithm uses control variables which are auxiliary function values that pro-\nvide a low dimensional representation of the function. At each iteration, the al-\ngorithm proposes new values for the control variables and generates the function\nfrom the conditional GP prior. The control variable input locations are found by\nminimizing an objective function. We demonstrate the algorithm on regression\nand classi\ufb01cation problems and we use it to estimate the parameters of a differen-\ntial equation model of gene regulation.\n\n1 Introduction\n\nGaussian processes (GPs) are used for Bayesian non-parametric estimation of unobserved or latent\nfunctions. In regression problems with Gaussian likelihoods, inference in GP models is analytically\ntractable, while for classi\ufb01cation deterministic approximate inference algorithms are widely used\n[16, 4, 5, 11]. However, in recent applications of GP models in systems biology [1] that require the\nestimation of ordinary differential equation models [2, 13, 8], the development of deterministic ap-\nproximations is dif\ufb01cult since the likelihood can be highly complex. Other applications of Gaussian\nprocesses where inference is intractable arise in spatio-temporal models and geostatistics and de-\nterministic approximations have also been developed there [14]. In this paper, we consider Markov\nchain Monte Carlo (MCMC) algorithms for inference in GP models. An advantage of MCMC over\ndeterministic approximate inference is that it provides an arbitrarily precise approximation to the\nposterior distribution in the limit of long runs. Another advantage is that the sampling scheme will\noften not depend on details of the likelihood function, and is therefore very generally applicable.\nIn order to bene\ufb01t from the advantages of MCMC it is necessary to develop an ef\ufb01cient sampling\nstrategy. This has proved to be particularly dif\ufb01cult in many GP applications, because the posterior\ndistribution describes a highly correlated high-dimensional variable. Thus simple MCMC sampling\nschemes such as Gibbs sampling can be very inef\ufb01cient. In this contribution we describe an ef\ufb01-\ncient MCMC algorithm for sampling from the posterior process of a GP model which constructs\nthe proposal distributions by utilizing the GP prior. This algorithm uses control variables which are\nauxiliary function values. At each iteration, the algorithm proposes new values for the control vari-\nables and samples the function by drawing from the conditional GP prior. The control variables are\nhighly informative points that provide a low dimensional representation of the function. The control\ninput locations are found by minimizing an objective function. The objective function used is the\nexpected least squares error of reconstructing the function values from the control variables, where\nthe expectation is over the GP prior.\nWe demonstrate the proposed MCMC algorithm on regression and classi\ufb01cation problems and com-\npare it with two Gibbs sampling schemes. We also apply the algorithm to inference in a systems\n\n\fbiology model where a set of genes is regulated by a transcription factor protein [8]. This provides\nan example of a problem with a non-linear and non-factorized likelihood function.\n\n2 Sampling algorithms for Gaussian Process models\n\nIn a GP model we assume a set of inputs (x1, . . . , xN ) and a set of function values f = (f1, . . . , fN )\nevaluated at those inputs. A Gaussian process places a prior on f which is a N-dimensional Gaussian\ndistribution so that p(f) = N(y|\u00b5, K). The mean \u00b5 is typically zero and the covariance matrix K\nis de\ufb01ned by the kernel function k(xn, xm) that depends on parameters \u03b8. GPs are widely used for\nsupervised learning [11] in which case we have a set of observed pairs (yi, xi), where i = 1, . . . , N,\nand we assume a likelihood model p(y|f) that depends on parameters \u03b1. For regression or classi-\n\ufb01cation problems, the latent function values are evaluated at the observed inputs and the likelihood\ni=1 p(yi|fi). However, for other type of applications, such as\nmodelling latent functions in ordinary differential equations, the above factorization is not applica-\nble. Assuming that we have obtained suitable values for the model parameters (\u03b8, \u03b1) inference over\nf is done by applying Bayes rule:\n\nfactorizes according to p(y|f) = QN\n\np(f|y) \u221d p(y|f)p(f).\n\n(1)\nFor regression, where the likelihood is Gaussian, the above posterior is a Gaussian distribution that\ncan be obtained using simple algebra. When the likelihood p(y|f) is non-Gaussian, computations\nbecome intractable and we need to carry out approximate inference.\nThe MCMC algorithm we consider is the general Metropolis-Hastings (MH) algorithm [12]. Sup-\npose we wish to sample from the posterior in eq. (1). The MH algorithm forms a Markov chain. We\ninitialize f (0) and we consider a proposal distribution Q(f (t+1)|f (t)) that allows us to draw a new\nstate given the current state. The new state is accepted with probability min(1, A) where\n\nA = p(y|f (t+1))p(f (t+1))\np(y|f (t))p(f (t))\n\nQ(f (t)|f (t+1))\nQ(f (t+1)|f (t)) .\n\n(2)\n\nTo apply this generic algorithm, we need to choose the proposal distribution Q. For GP models,\n\ufb01nding a good proposal distribution is challenging since f is high dimensional and the posterior\ndistribution can be highly correlated.\nTo motivate the algorithm presented in section 2.1, we discuss two extreme options for specify-\ning the proposal distribution Q. One simple way to choose Q is to set it equal to the GP prior\np(f). This gives us an independent MH algorithm [12]. However, sampling from the GP prior is\nvery inef\ufb01cient as it is unlikely to obtain a sample that will \ufb01t the data. Thus the Markov chain\nwill get stuck in the same state for thousands of iterations. On the other hand, sampling from the\nprior is appealing because any generated sample satis\ufb01es the smoothness requirement imposed by\nthe covariance function. Functions drawn from the posterior GP process should satisfy the same\nsmoothness requirement as well.\nThe other extreme choice for the proposal, that has been considered in [10], is to apply Gibbs\nsampling where we iteratively draw samples from each posterior conditional density p(fi|f\u2212i, y)\nwith f\u2212i = f \\fi. However, Gibbs sampling can be extremely slow for densely discretized functions,\nas in the regression problem of Figure 1, where the posterior GP process is highly correlated. To\nclarify this, note that the variance of the posterior conditional p(fi|f\u2212i, y) is smaller or equal to the\nvariance of the conditional GP prior p(fi|f\u2212i). However, p(fi|f\u2212i) may already have a tiny variance\ncaused by the conditioning on all remaining latent function values. For the one-dimensional example\nin Figure 1, Gibbs sampling is practically not applicable. We further study this issue in section 4.\nA similar algorithm to Gibbs sampling can be expressed by using the sequence of the conditional\ndensities p(fi|f\u2212i) as a proposal distribution for the MH algorithm1. We call this algorithm the\nGibbs-like algorithm. This algorithm can exhibit a high acceptance rate, but it is inef\ufb01cient to\nsample from highly correlated functions. A simple generalization of the Gibbs-like algorithm that\nis more appropriate for sampling from smooth functions is to divide the domain of the function\ninto regions and sample the entire function within each region by conditioning on the remaining\nfunction regions. Local region sampling iteratively draws each block of functions values fk from\n\n1Thus we replace the proposal distribution p(fi|f\u2212i, y) with the prior conditional p(fi|f\u2212i).\n\n\f|f (t)\u2212k), where f\u2212k = f \\ fk. However, this scheme is still inef\ufb01cient\nthe conditional GP prior p(f t+1\nto sample from highly correlated functions since the variance of the proposal distribution can be\nvery small close to the boundaries between neighbouring function regions. The description of this\nalgorithm is given in the supplementary material. In the next section we discuss an algorithm using\ncontrol variables that can ef\ufb01ciently sample from highly correlated functions.\n\nk\n\n2.1 Sampling using control variables\n\nLet fc be a set of M auxiliary function values that are evaluated at inputs Xc and drawn from the\nGP prior. We call fc the control variables and their meaning is analogous to the auxiliary inducing\nvariables used in sparse GP models [15]. To compute the posterior p(f|y) based on control variables\nwe use the expression\n\nZ\n\np(f|y) =\n\np(f|fc, y)p(fc|y)dfc.\n\nfc\n\n(3)\nAssuming that fc is highly informative about f, so that p(f|fc, y) \u2019 p(f|fc), we can approximately\nsample from p(f|y) in a two-stage manner: \ufb01rstly sample the control variables from p(fc|y) and\nthen generate f from the conditional prior p(f|fc). This scheme can allow us to introduce a MH\n|f (t)\nalgorithm, where we need to specify only a proposal distribution q(f (t+1)\nc ), that will mimic\nsampling from p(fc|y), and always sample f from the conditional prior p(f|fc). The whole proposal\ndistribution takes the form\n\nc\n\nQ(f (t+1), f (t+1)\n\nc\n\n|f (t), f (t)\n\nc ) = p(f (t+1)|f (t+1)\n\nc\n\n)q(f (t+1)\n\nc\n\n|f (t)\nc ).\n\nEach proposed sample is accepted with probability min(1, A) where A is given by\n\nA = p(y|f (t+1))p(f (t+1)\nc\np(y|f (t))p(f (t)\nc )\n\n)\n\n.\n\nc\n\nq(f (t)\nc\nq(f (t+1)\n\n|f (t+1)\n)\n|f (t)\nc )\n\nc\n\n.\n\n(4)\n\n(5)\n\nc\n\n|f (t)\nc ).\n\nThe usefulness of the above sampling scheme stems from the fact that the control variables can form\na low-dimensional representation of the function. Assuming that these variables are much fewer\nthan the points in f, the sampling is mainly carried out in the low dimensional space. In section 2.2\nwe describe how to select the number M of control variables and the inputs Xc so as fc becomes\nhighly informative about f. In the remainder of this section we discuss how we set the proposal\ndistribution q(f (t+1)\nA suitable choice for q is to use a Gaussian distribution with diagonal or full covariance matrix.\nThe covariance matrix can be adapted during the burn-in phase of MCMC in order to increase\nthe acceptance rate. Although this scheme is general, it has practical limitations. Firstly, tuning\na full covariance matrix is time consuming and in our case this adaption process must be car-\nried out simultaneously with searching for an appropriate set of control variables. Also, since the\nterms involving p(fc) do not cancel out in the acceptance probability in eq. (5), using a diagonal\ncovariance for the q distribution has the risk of proposing control variables that may not satisfy\nthe GP prior smoothness requirement. To avoid these problems, we de\ufb01ne q by utilizing the GP\nprior. According to eq. (3) a suitable choice for q must mimic the sampling from the posterior\np(fc|y). Given that the control points are far apart from each other, Gibbs sampling in the control\nvariables space can be ef\ufb01cient. However, iteratively sampling fci from the conditional posterior\np(fci|fc\u2212i , y) \u221d p(y|fc)p(fci|fc\u2212i), where fc\u2212i = fc \\ fci is intractable for non-Gaussian likeli-\nhoods2. An attractive alternative is to use a Gibbs-like algorithm where each fci is drawn from\n|f (t)\nthe conditional GP prior p(f (t+1)\nc\u2212i) and is accepted using the MH step. More speci\ufb01cally, the\n|f (t)\nproposal distribution draws a new f (t+1)\nc\u2212i) and\nci\ngenerates the function f (t+1) from p(f (t+1)|f (t+1)\n, f (t+1)) is accepted\nusing the MH step. This scheme of sampling the control variables one-at-a-time and resampling f is\niterated between different control variables. A complete iteration of the algorithm consists of a full\nscan over all control variables. The acceptance probability A in eq. (5) becomes the likelihood ratio\nand the prior smoothness requirement is always satis\ufb01ed. The iteration between different control\nvariables is illustrated in Figure 1.\n\nfor a certain control variable i from p(f (t+1)\n\nc\u2212i). The sample (f (t+1)\n\n, f (t)\n\nci\n\nci\n\nci\n\nci\n\n2This is because we need to integrate out f in order to compute p(y|fc).\n\n\fFigure 1: Visualization of iterating between control variables. The red solid line is the current f (t), the blue\nline is the proposed f (t+1), the red circles are the current control variables f (t)\nc while the diamond (in magenta)\n|f (t)\nis the proposed control variable f (t+1)\nc\u2212i )\n(with two-standard error bars) and the shaded area shows the effective proposal p(f t+1|f (t)\n\n. The blue solid vertical line represents the distribution p(f (t+1)\n\nci\n\nci\n\nc\u2212i ).\n\nc\u2212i) = R\n\nf\n\nAlthough the control variables are sampled one-at-at-time, f can still be drawn with a considerable\nvariance. To clarify this, note that when the control variable fci changes the effective proposal\ndistribution for f is p(f t+1|f (t)\n, which is the\nconditional GP prior given all the control points apart from the current point fci. This conditional\nprior can have considerable variance close to fci and in all regions that are not close to the remaining\ncontrol variables. As illustrated in Figure 1, the iteration over different control variables allow f to\nbe drawn with a considerable variance everywhere in the input space.\n\n|f (t)\nc\u2212i)df (t+1)\n\np(f t+1|f (t+1)\n\n(t+1)\nci\n\n, f (t)\n\nc\u2212i)p(f (t+1)\n\nci\n\nci\n\nci\n\n2.2 Selection of the control variables\n\nTo apply the previous algorithm we need to select the number, M, of the control points and the\nassociated inputs Xc. Xc must be chosen so that knowledge of fc can determine f with small\nerror. The prediction of f given fc is equal to Kf,cK\u22121\nc,c fc which is the mean of the conditional prior\nc,c fc||2\np(f|fc). A suitable way to search over Xc is to minimize the reconstruction error ||f \u2212 Kf,cK\u22121\naveraged over any possible value of (f , fc):\n\n||f \u2212 Kf,cK\u22121\n\nc,c fc||2p(f|fc)p(fc)df dfc = Tr(Kf,f \u2212 Kf,cK\u22121\n\nc,c K T\n\nf,c).\n\nZ\n\nG(Xc) =\n\nf ,fc\n\nThe quantity inside the trace is the covariance of p(f|fc) and thus G(Xc) is the total variance of\nthis distribution. We can minimize G(Xc) w.r.t. Xc using continuous optimization similarly to the\napproach in [15]. Note that when G(Xc) becomes zero, p(f|fc) becomes a delta function.\nTo \ufb01nd the number M of control points we minimize G(Xc) by incrementally adding control vari-\nables until the total variance of p(f|fc) becomes smaller than a certain percentage of the total vari-\nance of the prior p(f). 5% was the threshold used in all our experiments. Then we start the sim-\nulation and we observe the acceptance rate of the Markov chain. According to standard heuristics\n[12] which suggest that desirable acceptance rates of MH algorithms are around 1/4, we require a\nfull iteration of the algorithm (a complete scan over the control variables) to have an acceptance rate\nlarger than 1/4. When for the current set of control inputs Xc the chain has a low acceptance rate, it\nmeans that the variance of p(f|fc) is still too high and we need to add more control points in order\nto further reduce G(Xc). The process of observing the acceptance rate and adding control variables\nis continued until we reach the desirable acceptance rate.\nWhen the training inputs X are placed uniformly in the space, and the kernel function is stationary,\nthe minimization of G places Xc in a regular grid. In general, the minimization of G places the\ncontrol inputs close to the clusters of the input data in such a way that the kernel function is taken\ninto account. This suggests that G can also be used for learning inducing variables in sparse GP\nmodels in a unsupervised fashion, where the observed outputs y are not involved.\n\n00.20.40.60.81\u22123\u22122\u22121012300.20.40.60.81\u22123\u22122\u22121012300.20.40.60.81\u22123\u22122\u221210123\f3 Applications\n\nWe consider two applications where exact inference is intractable due to a non-linear likelihood\nfunction: classi\ufb01cation and parameter estimation in a differential equation model of gene regulation.\nClassi\ufb01cation: Deterministic inference methods for GP classi\ufb01cation are described in [16, 4, 7].\nAmong these approaches, the Expectation-Propagation (EP) algorithm [9] is found to be the most\nef\ufb01cient [6]. Our MCMC implementation con\ufb01rms these \ufb01ndings since sampling using control\nvariables gave similar classi\ufb01cation accuracy to EP.\nTranscriptional regulation: We consider a small biological sub-system where a set of target genes\nare regulated by one transcription factor (TF) protein. Ordinary differential equations (ODEs) can\nprovide an useful framework for modelling the dynamics in these biological networks [1, 2, 13, 8].\nThe concentration of the TF and the gene speci\ufb01c kinetic parameters are typically unknown and\nneed to be estimated by making use of a set of observed gene expression levels. We use a GP prior\nto model the unobserved TF activity, as proposed in [8], and apply full Bayesian inference based on\nthe MCMC algorithm presented previously.\nBarenco et al. [2] introduce a linear ODE model for gene activation from TF. This approach was\nextended in [13, 8] to account for non-linear models. The general form of the ODE model for\ntranscription regulation with a single TF has the form\n\ndyj(t)\n\ndt\n\n= Bj + Sjg(f(t)) \u2212 Djyj(t),\n\n(6)\n\nwhere the changing level of a gene j\u2019s expression, yj(t), is given by a combination of basal tran-\nscription rate, Bj, sensitivity, Sj, to its governing TF\u2019s activity, f(t), and the decay rate of the\nmRNA, Dj. The differential equation can be solved for yj(t) giving\n\nyj(t) = Bj\nDj\n\n+ Aje\u2212Dj t + Sje\u2212Dj t\n\ng(f(u))eDj udu,\n\n(7)\n\nwhere Aj term arises from the initial condition. Due to the non-linearity of the g function that trans-\nforms the TF, the integral in the above expression is not analytically obtained. However, numerical\nintegration can be used to accurately approximate the integral with a dense grid (ui)P\ni=1 of points in\nthe time axis and evaluating the function at the grid points fp = f(up). In this case the integral in the\np=1 wpg(fp)eDj up where the weights wp arise from the numerical\n\nabove equation can be writtenPPt\n\nintegration method used and, for example, can be given by the composite Simpson rule.\nThe TF concentration f(t) in the above system of ODEs is a latent function that needs to be esti-\nmated. Additionally, the kinetic parameters of each gene \u03b1j = (Bj, Dj, Sj, Aj) are unknown and\nalso need to be estimated. To infer these quantities we use mRNA measurements (obtained from\nmicroarray experiments) of N target genes at T different time steps. Let yjt denote the observed\ngene expression level of gene j at time t and let y = {yjt} collect together all these observations.\nAssuming a Gaussian noise for the observed gene expressions the likelihood of our data has the form\n\nZ t\n\n0\n\nNY\n\nTY\n\np(y|f ,{\u03b1j}N\n\nj=1) =\n\np(yjt|f1\u2264p\u2264Pt, \u03b1j),\n\n(8)\n\nj=1\n\nt=1\n\nwhere each probability density in the above product is a Gaussian with mean given by eq. (7) and\nf1\u2264p\u2264Pt denotes the TF values up to time t. Notice that this likelihood is non-Gaussian due to the\nnon-linearity of g. Further, this likelihood does not have a factorized form, as in the regression and\nclassi\ufb01cation cases, since an observed gene expression depends on the protein concentration activity\nin all previous times points. Also note that the discretization of the TF in P time points corresponds\nto a very dense grid, while the gene expression measurements are sparse, i.e. P (cid:29) T .\nTo apply full Bayesian inference in the above model, we need to de\ufb01ne prior distributions over all\nunknown quantities. The protein concentration f is a positive quantity, thus a suitable prior is to\nconsider a GP prior for log f. The kinetic parameters of each gene are all positive scalars. Those\nparameters are given vague gamma priors. Sampling the GP function is done exactly as described\nin section 2; we have only to plug in the likelihood from eq. (8) in the MH step. Sampling from\nthe kinetic parameters is carried using Gaussian proposal distributions with diagonal covariance\nmatrices that sample the positive kinetic parameters in the log space.\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 2: (a) shows the evolution of the KL divergence (against the number of MCMC iterations) between the\ntrue posterior and the empirically estimated posteriors for a 5-dimensional regression dataset. (b) shows the\nmean values with one-standard error bars of the KL divergence (against the input dimension) between the true\nposterior and the empirically estimated posteriors. (c) plots the number of control variables together with the\naverage correlation coef\ufb01cient of the GP prior.\n\n4 Experiments\n\n), where (\u03c32\n\n2\u20182\n\nf exp(\u2212||xm\u2212xn||2\n\nIn the \ufb01rst experiment we compare Gibbs sampling (Gibbs), sampling using local regions (region)\n(see the supplementary \ufb01le) and sampling using control variables (control) in standard regression\nproblems of varied input dimensions. The performance of the algorithms can be accurately assessed\nby computing the KL divergences between the exact Gaussian posterior p(f|y) and the Gaussians\nobtained by MCMC. We \ufb01x the number of training points to N = 200 and we vary the input di-\nmension d from 1 to 10. The training inputs X were chosen randomly inside the unit hypercube\n[0, 1]d. Thus, we can study the behavior of the algorithms w.r.t. the amount of correlation in the\nposterior GP process which depends on how densely the function is sampled. The larger the dimen-\nsion, the sparser the function is sampled. The outputs Y were chosen by randomly producing a GP\nf , \u20182) = (1, 100) and\nfunction using the squared-exponential kernel \u03c32\nthen adding noise with variance \u03c32 = 0.09. The burn-in period was 104 iterations3. For a certain\ndimension d the algorithms were initialized to the same state obtained by randomly drawing from\nthe GP prior. The parameters (\u03c32\nf , \u20182, \u03c32) were \ufb01xed to the values that generated the data. The\nexperimental setup was repeated 10 times so as to obtain con\ufb01dence intervals. We used thinned\nsamples (by keeping one sample every 10 iterations) to calculate the means and covariances of the\n200-dimensional posterior Gaussians. Figure 2(a) shows the KL divergence against the number of\nMCMC iterations for the 5-dimensional input dataset. It seems that for 200 training points and 5\ndimensions, the function values are still highly correlated and thus Gibbs takes much longer for the\nKL divergence to drop to zero. Figure 2(b) shows the KL divergence against the input dimension\nafter \ufb01xing the number of iterations to be 3 \u00d7 104. Clearly Gibbs is very inef\ufb01cient in low dimen-\nsions because of the highly correlated posterior. As dimension increases and the functions become\nsparsely sampled, Gibbs improves and eventually the KL divergences approaches zero. The region\nalgorithm works better than Gibbs but in low dimensions it also suffers from the problem of high\ncorrelation. For the control algorithm we observe that the KL divergence is very close to zero for\nall dimensions. Figure 2(c) shows the increase in the number of control variables used as the input\ndimension increases. The same plot shows the decrease of the average correlation coef\ufb01cient of the\nGP prior as the input dimension increases. This is very intuitive, since one should expect the number\nof control variables to increase as the function values become more independent.\nNext we consider two GP classi\ufb01cation problems for which exact inference is intractable. We used\nthe Wisconsin Breast Cancer (WBC) and the Pima Indians Diabetes (PID) binary classi\ufb01cation\ndatasets. The \ufb01rst consists of 683 examples (9 input dimensions) and the second of 768 examples\n(8 dimensions). 20% of the examples were used for testing in each case. The MCMC samplers\nwere run for 5 \u00d7 104 iterations (thinned to one sample every \ufb01ve iterations) after a burn-in of 104\niterations. The hyperparameters were \ufb01xed to those obtained by EP. Figures 3(a) and (b) shows\n3For Gibbs we used 2 \u00d7 104 iterations since the region and control algorithms require additional iterations\n\nduring the adaption phase.\n\n0246810x 10405101520MCMC iterationsKL(real||empirical)gibbsregioncontrol2468100102030405060dimensionKL(real||empirical)Gibbsregioncontrol246810010203040506070dimensionnumber of control variablescorrCoefcontrol00.050.10.150.20.25\f(a)\n\n(b)\n\n(c)\n\nFigure 3: We show results for GP classi\ufb01cation. Log-likelihood values are shown for MCMC samples obtained\nfrom (a) Gibbs and (b) control applied to the WBC dataset. In (c) we show the test errors (grey bars) and the\naverage negative log likelihoods (black bars) on the WBC (left) and PID (right) datasets and compare with EP.\n\nFigure 4: First row: The left plot shows the inferred TF concentration for p53; the small plot on top-right shows\nthe ground-truth protein concentration obtained by a Western blot experiment [2]. The middle plot shows the\npredicted expression of a gene obtained by the estimated ODE model; red crosses correspond to the actual gene\nexpression measurements. The right-hand plot shows the estimated decay rates for all 5 target genes used to\ntrain the model. Grey bars display the parameters found by MCMC and black bars the parameters found in [2]\nusing a linear ODE model. Second row: The left plot shows the inferred TF for LexA. Predicted expressions of\ntwo target genes are shown in the rest two plots. Error bars in all plots correspond to 95% credibility intervals.\n\nthe log-likelihood for MCMC samples on the WBC dataset, for the Gibbs and control algorithms\nrespectively. It can be observed that mixing is far superior for the control algorithm and it has also\nconverged to a much higher likelihood. In Figure 3(c) we compare the test error and the average\nnegative log likelihood in the test data obtained by the two MCMC algorithms with the results from\nEP. The proposed control algorithm shows similar classi\ufb01cation performance to EP, while the Gibbs\nalgorithm performs signi\ufb01cantly worse on both datasets.\nIn the \ufb01nal two experiments we apply the control algorithm to infer the protein concentration of TFs\nthat activate or repress a set of target genes. The latent function in these problems is always one-\ndimensional and densely discretized and thus the control algorithm is the only one that can converge\nto the GP posterior process in a reasonable time.\nWe \ufb01rst consider the TF p53 which is a tumour repressor activated during DNA damage. Seven\nsamples of the expression levels of \ufb01ve target genes in three replicas are collected as the raw time\ncourse data. The non-linear activation of the protein follows the Michaelis Menten kinetics inspired\nresponse [1] that allows saturation effects to be taken into account so as g(f(t)) = f (t)\n\u03b3j +f (t) in eq.\n\n2004006008001000\u2212264\u2212262\u2212260\u2212258\u2212256\u2212254MCMC iterationsLog likelihood2004006008001000\u221250\u221245\u221240\u221235\u221230MCMC iterationsLog likelihoodgibbscontrepgibbscontrep00.10.20.30.40.50.60.702468101200.511.522.533.5p26 sesn1 Gene \u2212 first ReplicaDDB2p26 sesn1TNFRSF10bCIp1/p21BIK012345678910Decay rates010203040506001234567Inferred protein01020304050603.544.555.566.577.5dinI Gene010203040506033.544.555.566.57yjiW Gene\f(6) where the Michaelis constant for the jth gene is given by \u03b3j. Note that since f(t) is positive the\nGP prior is placed on the log f(t). To apply MCMC we discretize f using a grid of P = 121 points.\nDuring sampling, 7 control variables were needed to obtain the desirable acceptance rate. Running\ntime was 4 hours for 5 \u00d7 105 sampling iterations plus 5 \u00d7 104 burn-in iterations. The \ufb01rst row of\nFigure 4 summarizes the estimated quantities obtained from MCMC simulation.\nNext we consider the TF LexA in E.Coli that acts as a repressor. In the repression case there is an\nanalogous Michaelis Menten model [1] where the non-linear function g takes the form: g(f(t)) =\n\u03b3j +f (t) . Again the GP prior is placed on the log of the TF activity. We applied our method to\nthe same microarray data considered in [13] where mRNA measurements of 14 target genes are\ncollected over six time points. For this dataset, the expression of the 14 genes were available for\nT = 6 times. The GP function f was discretized using 121 points. The result for the inferred\nTF pro\ufb01le along with predictions of two target genes are shown in the second row of Figure 4.\nOur inferred TF pro\ufb01le and reconstructed target gene pro\ufb01les are similar to those obtained in [13].\nHowever, for certain genes, our model provides a better \ufb01t to the gene pro\ufb01le.\n\n1\n\n5 Discussion\n\nGaussian processes allow for inference over latent functions using a Bayesian estimation framework.\nIn this paper, we presented an MCMC algorithm that uses control variables. We showed that this\nsampling scheme can ef\ufb01ciently deal with highly correlated posterior GP processes. MCMC allows\nfor full Bayesian inference in the transcription factor networks application. An important direction\nfor future research will be scaling the models used to much larger systems of ODEs with multiple in-\nteracting transcription factors. In such large systems where MCMC can become slow a combination\nof our method with the fast sampling scheme in [3] could be used to speed up the inference.\n\nAcknowledgments\n\nThis work is funded by EPSRC Grant No EP/F005687/1 \u201dGaussian Processes for Systems Identi\ufb01-\ncation with Applications in Systems Biology\u201d.\n\nReferences\n[1] U. Alon. An Introduction to Systems Biology: Design Principles of Biological Circuits. Chapman and\n\nHall/CRC, 2006.\n\n[2] M. Barenco, D. Tomescu, D. Brewer, J. Callard, R. Stark, and M. Hubank. Ranked prediction of p53\n\ntargets using hidden variable dynamic modeling. Genome Biology, 7(3), 2006.\n\n[3] B. Calderhead, M. Girolami, and N.D. Lawrence. Accelerating Bayesian Inference over Nonlinear Dif-\n\nferential Equations with Gaussian Processes. In Neural Information Processing Systems, 22, 2008.\n[4] L. Csato and M. Opper. Sparse online Gaussian processes. Neural Computation, 14:641\u2013668, 2002.\n[5] M. N. Gibbs and D. J. C. MacKay. Variational Gaussian process classi\ufb01ers. IEEE Transactions on Neural\n\nNetworks, 11(6):1458\u20131464, 2000.\n\n[6] M. Kuss and C. E. Rasmussen. Assessing Approximate Inference for Binary Gaussian Process Classi\ufb01-\n\ncation. Journal of Machine Learning Research, 6:1679\u20131704, 2005.\n\n[7] N. D. Lawerence, M. Seeger, and R. Herbrich. Fast sparse Gaussian process methods: the informative\n\nvector machine. In Advances in Neural Information Processing Systems, 13. MIT Press, 2002.\n\n[8] N. D. Lawrence, G. Sanguinetti, and M. Rattray. Modelling transcriptional regulation using Gaussian\n\nprocesses. In Advances in Neural Information Processing Systems, 19. MIT Press, 2007.\n\n[9] T. Minka. Expectation propagation for approximate Bayesian inference. In UAI, pages 362\u2013369, 2001.\n[10] R. M. Neal. Monte Carlo implementation of Gaussian process models for Bayesian regression and clas-\n\nsi\ufb01cation. Technical report, Dept. of Statistics, University of Toronto, 1997.\n\n[11] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006.\n[12] C. P. Robert and G. Casella. Monte Carlo Statistical Methods. Springer-Verlag, 2nd edition, 2004.\n[13] S. Rogers, R. Khanin, and M. Girolami. Bayesian model-based inference of transcription factor activity.\n\nBMC Bioinformatics, 8(2), 2006.\n\n[14] H. Rue, S. Martino, and N. Chopin. Approximate Bayesian inference for latent Gaussian models using\n\nintegrated nested Laplace approximations. NTNU Statistics Preprint, 2007.\n\n[15] E. Snelson and Z. Ghahramani. Sparse Gaussian process using pseudo inputs. In Advances in Neural\n\nInformation Processing Systems, 13. MIT Press, 2006.\n\n[16] C. K. I. Williams and D. Barber. Bayesian classi\ufb01cation with Gaussian processes. IEEE Transactions on\n\nPattern Analysis and Machine Intelligence, 20(12):1342\u20131351, 1998.\n\n\f", "award": [], "sourceid": 694, "authors": [{"given_name": "Neil", "family_name": "Lawrence", "institution": null}, {"given_name": "Magnus", "family_name": "Rattray", "institution": null}, {"given_name": "Michalis", "family_name": "Titsias", "institution": null}]}