{"title": "Regulator Discovery from Gene Expression Time Series of Malaria Parasites: a Hierachical Approach", "book": "Advances in Neural Information Processing Systems", "page_first": 649, "page_last": 656, "abstract": "We introduce a hierarchical Bayesian model for the discovery of putative regulators from gene expression data only. The hierarchy incorporates the knowledge that there are just a few regulators that by themselves only regulate a handful of genes. This is implemented through a so-called spike-and-slab prior, a mixture of Gaussians with different widths, with mixing weights from a hierarchical Bernoulli model. For efficient inference we implemented expectation propagation. Running the model on a malaria parasite data set, we found four genes with significant homology to transcription factors in an amoebe, one RNA regulator and three genes of unknown function (out of the top ten genes considered).", "full_text": "Regulator Discovery from Gene Expression Time\n\nSeries of Malaria Parasites: a Hierarchical Approach\n\nJos\u00b4e Miguel Hern\u00b4andez-Lobato\n\nEscuela Polit\u00b4ecnica Superior\n\nUniversidad Aut\u00b4onoma de Madrid, Madrid, Spain\n\nTjeerd Dijkstra\n\nLeiden Malaria Research Group\nLUMC, Leiden, The Netherlands\n\nJosemiguel.hernandez@uam.es\n\nt.dijkstra@lumc.nl\n\nTom Heskes\n\nInstitute for Computing and Information Sciences\n\nRadboud University Nijmegen, Nijmegen, The Netherlands\n\nt.heskes@science.ru.nl\n\nAbstract\n\nWe introduce a hierarchical Bayesian model for the discovery of putative regula-\ntors from gene expression data only. The hierarchy incorporates the knowledge\nthat there are just a few regulators that by themselves only regulate a handful\nof genes. This is implemented through a so-called spike-and-slab prior, a mix-\nture of Gaussians with different widths, with mixing weights from a hierarchical\nBernoulli model. For ef\ufb01cient inference we implemented expectation propaga-\ntion. Running the model on a malaria parasite data set, we found four genes with\nsigni\ufb01cant homology to transcription factors in an amoebe, one RNA regulator\nand three genes of unknown function (out of the top ten genes considered).\n\n1 Introduction\n\nBioinformatics provides a rich source for the application of techniques from machine learning. Es-\npecially the elucidation of regulatory networks underlying gene expression has lead to a cornucopia\nof approaches: see [1] for review. Here we focus on one aspect of network elucidation, the identi-\n\ufb01cation of the regulators of the causative agent of severe malaria, Plasmodiumfalciparum. Several\nproperties of the parasite necessitate a tailored algorithm for regulator identi\ufb01cation:\n\n\u2022 In most species gene regulation takes place at the \ufb01rst stage of gene expression when a\nDNA template is transcribed into mRNA. This transcriptional control is mediated by spe-\nci\ufb01c transcription factors. Few speci\ufb01c transcription factors have been identi\ufb01ed in Plas-\nmodium based on sequence homology with other species [2, 3]. This could be due to\nPlasmodium possessing a unique set of transcription factors or due to other mechanisms of\ngene regulation, e.g. at the level of mRNA stability or post-transcritional regulation.\n\u2022 Compared with yeast, gene expression in Plasmodium is hardly changed by perturbations\ne.g. by adding chemicals or changing temperature [4]. The biological interpretation of this\n\ufb01nding is that the parasite is so narrowly adapted to its environment inside a red blood cell\nthat it follows a stereotyped gene expression program. From a machine learning point of\nview, this \ufb01nding means that network elucidation techniques relying on perturbations of\ngene expression cannot be used.\n\u2022 Similar to yeast [5], data for three different strains of the parasite with time series of gene\nexpression are publicly available [6]. These assay all of Plasmodium\u2019s 5,600 genes for\nabout 50 time points. In contrast to yeast, there are no ChIP-chip data available and fewer\nthen ten transcription factor binding motifs are known.\n\n1\n\n\fTogether, these properties point to a vector autoregressive model making use of the gene expression\ntime series. The model should not rely on sequence homology information but it should be \ufb02exible\nenough to integrate sequence information in the future. This points to a Bayesian model as favored\napproach.\n\n2 The model\n\nWe start with a semi-realistic model of transcription based on Michaelis-Menten kinetics [1] and\nsubsequently simplify to obtain a linear model. Denoting the concentration of a certain mRNA\ntranscript at time t by z(t) we write:\n\ndz(t)\ndt =\n\nV1a1(t)M1\n\nK1 + a1(t)M1\n\n\u00b7\u00b7\u00b7 VN aN (t)MN\nKN + aN (t)MN\n\np(t) \u2212 1\n\u03c4z\n\nz(t),\n\n(1)\n\nwith aj(t) the concentration of the j-th activator (positive regulator), p(t) the concentration of RNA\npolymerase and Vj, Kj, Mj and \u03c4z reaction constants. N denotes the number of potential activators.\nThe activator is thought to bind to DNA motifs upstream of the transcription start site and binds RNA\npolymerase which reads the DNA template to produce an mRNA transcript. Mj can be thought of\nas the multiplicity of the motif, \u03c4z captures the characteristic life time of the transcript. While\nreasonably realistic, this equation harbors too many unknowns for reliable inference: 3N + 1 with\nN \u2248 1000. We proceed with several simpli\ufb01cations:\n\u2022 aj(t) (cid:3) Kj: activator concentration is low;\n\u2022 p(t) = p0 is constant;\ndt \u2248 z(t+\u0394)\u2212z(t)\n\u2022 dz(t)\n\u2022 \u0394 \u2248 \u03c4z: sampling period roughly equal to transcript life time.\n\nwith \u0394 the sampling period;\n\n\u0394\n\nCounting time in units of \u0394 and taking logarithms on both sides, Equation (1) then simpli\ufb01es to\n\nlog z(t + 1) = C + M1 log a1(t) + \u00b7\u00b7\u00b7 + MN log aN (t),\n\nwith C = log(T V1 \u00b7\u00b7\u00b7 VN p0/(K1 \u00b7\u00b7\u00b7 KN )). This is a linear model for gene expression level given\nthe expression levels of a set of activators. With a similar derivation one can include repressors [1].\n\n2.1 A Bayesian model for sparse linear regression\nLet y be a vector with the log expression of the target gene and X = (x1, . . . , xN ) a matrix whose\ncolumns contain the log expression of the candidate regulators. Assuming that the measurements\nare corrupted with additive Gaussian noise, we get y \u223c N (X\u03b2, \u03c32I) where \u03b2 = (\u03b21, . . . , \u03b2N )T\nis a vector of regression coef\ufb01cients and \u03c32 is the variance of the noise. Such a linear model is\ncommonly used [7, 8, 9]. Both y and x1, . . . , xN are mean-centered vectors with T measurements.\nWe specify an inverse gamma (IG) prior for \u03c32 so that P(\u03c32) = IG(\u03c32, \u03bd/2, \u03bd\u03bb/2), where \u03bb is a\nprior estimate of \u03c32 and \u03bd is the sample size associated with that estimate. We assume that a priori\nall components \u03b2i are independent and take a so-called \u201cspike and slab prior\u201d [10] for each of them.\nThat is, we introduce binary latent variables \u03b3i, with \u03b3i = 1 if xi takes part in the regression of y\nand \u03b3i = 0 otherwise. Given \u03b3, the prior on \u03b2 then reads\n\nP(\u03b2|\u03b3) =\n\nN(cid:2)\n\ni=1\n\nN(cid:2)\n\ni=1\n\nP(\u03b2i|\u03b3i) =\n\nN (\u03b2i, 0, v1)\u03b3i N (\u03b2i, 0, v0)1\u2212\u03b3i ,\n\nwhere N (x, \u03bc, \u03c32) denotes a Gaussian density with mean \u03bc and variance \u03c32 evaluated at x. In order\nto enforce sparsity, the variance v1 of the slab should be larger than the variance v0 of the spike.\nInstead of picking the hyperparameters v1 and v0 directly, it is convenient to pick a threshold of\npractical signi\ufb01cance \u03b4 so that P(\u03b3i = 1) gets more weight when |\u03b2i| > \u03b4 and P(\u03b3i = 0) gets more\nweight when |\u03b2i| < \u03b4 [10]. In this way, given \u03b4 and one of v1 or v0, we pick the other one such that\n\n\u03b42 =\n\nlog(v1/v0)\n0 \u2212 v\u22121\nv\u22121\n\n1\n\n.\n\n2\n\n(2)\n\n\fFinally, we assign independent Bernoulli priors to the components of the latent vector \u03b3:\n\nN(cid:2)\n\nP(\u03b3) =\n\nBern(\u03b3i, w) =\n\ni=1\n\ni=1\n\nw\u03b3i(1 \u2212 w)1\u2212\u03b3i ,\n\nso that each of the x1, . . . , xN can independently take part in the regression with probability w. We\ncan identify the candidate genes whose expression is more likely to be correlated with the target\ngene by means of the posterior distribution of \u03b3:\n\nP(\u03b3, \u03b2, \u03c32|y, X) d\u03b2 d\u03c32 \u221d\n\nP(\u03b3, \u03b2, \u03c32, y|X) d\u03b2 d\u03c32 ,\n\nN(cid:2)\n\n(cid:3)\n\n\u03b2,\u03c32\n\n(cid:3)\n\n\u03b2,\u03c32\n\nP(\u03b3|y, X) =\n\nwhere\n\nP(\u03b3, \u03b2, \u03c32, y|X) = N (y, X\u03b2, \u03c32I)P(\u03b2|\u03b3)P(\u03b3)P(\u03c32)\nN(cid:2)\n\nN(cid:5)\n\n(cid:6)(cid:4)\n\n=\n\nN (yt,\n\nxi,t\u03b2i, \u03c32)\n\n(cid:6)\n\ni=1\n\ni=1\n\n(cid:4)\n(cid:4)\n\nT(cid:2)\nN(cid:2)\n\nt=1\n\nBern(\u03b3i, w)\n\nIG(\u03c32, \u03bd/2, \u03bd\u03bb/2) .\n\n(cid:7)N (\u03b2i, 0, v1)\u03b3i N (\u03b2i, 0, v0)1\u2212\u03b3i\n\n(cid:8)(cid:6)\n\n(3)\n\ni=1\n\nUnfortunately, this posterior distribution cannot be computed exactly if the number N of candidate\ngenes is larger than 25. An approximation based on Markov Chain Monte Carlo (MCMC) methods\nhas been proposed in [11].\n\n2.2 A hierarchical model for gene regulation\n\nIn the section above we made use of the prior information that a target gene is typically regulated\nby a small number of regulators. We have not yet made use of the prior information that a regulator\ntypically regulates more than one gene. We incorporate this information by a hierarchical extension\nof our previous model. We introduce a vector \u03c4 of binary latent variables where \u03c4i = 1 if gene i is\na regulator and \u03c4i = 0 otherwise. The following joint distribution captures this idea:\n\nP(\u03c4 , \u03b3, \u03b2, \u03c32|X) =\n\nN (xj,t+1,\n\nxi,t\u03b2j,i, \u03c32\nj )\n\n\u23a1\n\u23a3 N(cid:2)\n\nT\u22121(cid:2)\n\nj=1\n\nt=1\n\nN(cid:5)\n\ni=1, i(cid:2)=j\n\nN (\u03b2j,i, 0, v1)\u03b3j,i N (\u03b2j,i, 0, v0)1\u2212\u03b3j,i\n\u23a4\n\u23a6\n\nBern(\u03b3j,i, w1)\u03c4i Bern(\u03b3j,i, w0)1\u2212\u03c4i\n\nIG(\u03c32\n\nj , \u03bdj/2, \u03bdj\u03bbj/2)\n\n(cid:6)\n\nBern(\u03c4i, w)\n\n.\n\n(4)\n\n\u23a4\n\u23a6\n\n\u23a1\n\u23a3 N(cid:2)\n\u23a1\n\u23a3 N(cid:2)\n\nj=1\n\nN(cid:2)\nN(cid:2)\n\ni=1,i(cid:2)=j\n\nj=1\n\ni=1,i(cid:2)=j\n\n\u23a4\n\u23a6\n\u23a1\n\u23a3 N(cid:2)\nN(cid:2)\n\nj=1\n\n\u23a4\n\u23a6\n(cid:4)\n\ni=1\n\nIn this hierarchical model, \u03b3 is a matrix of binary latent variables where \u03b3j,i = 1 if gene i takes\npart in the regression of gene j and \u03b3j,i = 0 otherwise. The relationship between regulators and\nregulatees suggests that P(\u03b3j,i = 1|\u03c4i = 1) should be bigger than P(\u03b3j,i = 1|\u03c4i = 0) and thus\nw1 > w0. Matrix \u03b2 contains regression coef\ufb01cients where \u03b2j,i is the regression coef\ufb01cient between\nthe expression of gene i and the delayed expression of gene j. Hyperparameter w represents the prior\nj of the vector \u03c32 contain the variance\nprobability of any gene being a regulator and the elements \u03c32\nof the noise in each of the N regressions. Hyperparameters \u03bbj and \u03bdj have the same meaning as in\nthe model for sparse linear regression. The corresponding plate model is illustrated in Figure 1.\nWe can identify the genes more likely to be regulators by means of the posterior distributionP(\u03c4|X).\nCompared with the sparse linear regression model we expanded the number of latent variables from\nO(N) to O(N 2). In order to keep inference feasible we turn to an approximate inference technique.\n\n3\n\n\fM\n\nL\u001e\n\nL\u001f\n\n\u03c4E\n\nNEJ\n\n\u03b3\u0006E\n\n\u03b2\u0006E\n\n\u0004\n\nM\u001e\n\nM\u001f\n\n\u03c3\u0006\n\nN\u0006J\u0002\u001f\n\n6\n\n\u03bb\u0006\n\n\u03bd\u0006\n\n\u0004\n\nFigure 1: The hierarchical\nmodel for gene regulation.\n\n3 Expectation propagation\n\nThe Expectation Propagation (EP) algorithm [12] allows to perform approximate Bayesian infer-\nIn all Bayesian problems, the joint distribution of the model parameters \u03b8 and a data set\nence.\nD = {(xi, yi) : i = 1, . . . , n} with i.i.d. elements can be expressed as a product of terms\n\nP(\u03b8,D) =\n\nP(yi|xi, \u03b8)P(\u03b8) =\n\n(5)\nwhere tn+1(\u03b8) = P(\u03b8) is the prior distribution for \u03b8 and ti(\u03b8) = P(yi|xi, \u03b8) for i = 1, . . . , n.\nExpectation propagation proceeds to approximate (5) with a product of simpler terms\n\ni=1\n\ni=1\n\nti(\u03b8) ,\n\nn(cid:2)\n\nn+1(cid:2)\n\nn+1(cid:2)\n\nti(\u03b8) \u2248 n+1(cid:2)\n\n\u02dcti(\u03b8) = Q(\u03b8) ,\n\n(6)\nwhere all the term approximations \u02dcti are restricted to belong to the same family F of exponential\ndistributions, but they do not have to integrate 1. Note that Q will also be in F because F is closed\nunder multiplication. Each term approximation \u02dcti is chosen so that\n\ni=1\n\ni=1\n\nis as close as possible to\n\n(cid:2)\n\nj(cid:2)=i\n\nQ(\u03b8) = \u02dcti(\u03b8)\n(cid:2)\n\nti(\u03b8)\n\nj(cid:2)=i\n\n\u02dctj(\u03b8) = \u02dcti(\u03b8)Q\\i(\u03b8)\n\n\u02dctj(\u03b8) = ti(\u03b8)Q\\i(\u03b8) ,\n\nin terms of the direct Kullback-Leibler (K-L) divergence. The pseudocode of the EP algorithm is:\n\n1. Initialize the term approximations \u02dcti and Q to be uniform.\n2. Repeat until all \u02dcti converge:\n\n(a) Choose a \u02dcti to re\ufb01ne and remove it from Q to get Q\\i (e.g. dividing Q by \u02dcti).\n(b) Update the term \u02dcti so that it minimizes the K-L divergence between tiQ\\i and \u02dctiQ\\i.\n(c) Re-compute Q so that Q = \u02dctiQ\\i.\n\nThe optimization problem in step (b) is solved by matching suf\ufb01cient statistics between a distribu-\ntion Q(cid:3)\nwithin the F family and tiQ\\i, the new \u02dcti is then equal to Q(cid:3)/Q\\i. Because Q belongs to the\nexponential family it is generally trivial to calculate its normalization constant. Once Q is normal-\nized it can approximate P(\u03b8|D). Finally, EP is not guaranteed to converge, although convergence\ncan be improved by means of damped updates or double-loop algorithms [13].\n\n3.1 EP for sparse linear regression\n\nThe application of EP to the models of Section 2 introduces some nontrivial technicalities.\nFurthermore, we describe several techniques to speed up the EP algorithm. We approximate\nP(\u03b3, \u03b2, \u03c32, y|X) for sparse linear regression by means of a factorized exponential distribution:\n\nP(\u03b3, \u03b2, \u03c32, y|X) \u2248\n\nIG(\u03c32, a, b) \u2261 Q(\u03b3, \u03b2, \u03c32) ,\n\n(7)\n\n(cid:4)\n\nN(cid:2)\n\n(cid:6)\nBern(\u03b3i, qi)N (\u03b2i, \u03bci, si)\n\ni=1\n\n4\n\n\fwhere {qi, \u03bci, si : i = 1, . . . , N}, a and b are free parameters. Note that in the approximation\nQ(\u03b3, \u03b2, \u03c32) all the components of the vectors \u03b3 and \u03b2 and the variable \u03c32 are considered to be\nindependent; this allows the approximation of P(\u03b3|y, X) by\ni=1 Bern(\u03b3i, qi). We tune the pa-\nrameters of Q(\u03b3, \u03b2, \u03c32) by means of EP over the unnormalized density P(\u03b3, \u03b2, \u03c32, y|X). Such\ndensity appears in (3) as a product of T + N terms (not counting the priors) which correspond to the\nti terms in (5). This way, we have T + N term approximations with the same form as (7) and which\ncorrespond to the term approximations \u02dcti in (6). The complexity is O(T N) per iteration, because\nupdating any of the \ufb01rst T term approximations requires N operations. However, some of the EP\nupdate operations require to compute integrals which do not have a closed form expression. To avoid\nthat, we employ the following simpli\ufb01cations when we update the \ufb01rst T term approximations:\n\n(cid:13)n\n\n1. When updating the parameters {\u03bci, si : i = 1, . . . , N} of the Gaussians in the term ap-\nproximations, we approximate a Student\u2019s t-distribution by means of a Gaussian distribu-\ntion with the same mean and variance. This approximation becomes more accurate as the\ndegrees of freedom of the t-distribution increase.\n\n2. When updating the parameters {a, b} of the IG in the term approximations, instead of\npropagating the suf\ufb01cient statistics of an IG distribution we propagate the expectations of\n1/\u03c32 and 1/\u03c34. To achieve this, we have to perform two approximations like the one stated\nabove. Note that in this case we are not minimizing the direct K-L divergence. However,\nat convergence, we expect the resulting IG in (7) to be suf\ufb01ciently accurate.\n\nIn order to improve convergence, we re-update all the N last term approximations each time one\nof the \ufb01rst T term approximations is updated. Computational complexity does not get worse than\nO(T N) and the resulting algorithm turns out to be faster. By comparison, the MCMC method\nin [11] takes O(N 2) steps to generate a single sample from P(\u03b3|y, X). On problems of much\nsmaller size than we will consider in our experiments, one typically requires on the order of 10000\nsamples to obtain reasonably accurate estimates [10].\n\n3.2 EP for gene regulation\nWe approximate P(\u03c4 , \u03b3, \u03b2, \u03c32|X) by the factorized exponential distribution\n\nQ(\u03c4 , \u03b3, \u03b2, \u03c32) =\n\n\u23a1\n\u23a3 N(cid:2)\n\u23a1\n\u23a3 N(cid:2)\n\nj=1\n\nN(cid:2)\nN(cid:2)\n\ni=1,i(cid:2)=j\n\nj=1\n\ni=1,i(cid:2)=j\n\n(cid:6)\n\nBern(\u03c4i, ti)\n\n\u23a4\n\u23a6 (cid:4)\n\u23a4\n\u23a6\nN (\u03b2j,i, \u03bcj,i, sj,i)\n\nBern(\u03b3j,i, wj,i)\n\nN(cid:2)\n\u23a1\n\u23a3 N(cid:2)\n\ni=1\n\nj=1\n\n\u23a4\n\u23a6 ,\n\nIG(\u03c32\n\nj , aj, bj)\n\nwhere {aj, bj, ti, wj,i, \u03bcj,i, sj,i : i = 1, . . . , N ; j = 1, . . . , N ; i (cid:7)= j} are free parameters. The\n(cid:13)N\nposterior probability P(\u03c4|X) that indicates which genes are more likely to be regulators can then\ni=1 Bern(\u03c4i, ti). Again, we \ufb01x the parameters in Q(\u03c4 , \u03b3, \u03b2, \u03c32) by means of\nbe approximated by\nEP over the joint density P(\u03c4 , \u03b3, \u03b2, \u03c32|X). It is trivial to adapt the EP algorithm used in the sparse\nlinear regression model to this new case: the terms to be approximated are the same as before except\nfor the new N(N \u2212 1) terms for the prior on \u03b3. As in the previous section and in order to improve\nconvergence, we re-update all the N(N \u2212 1) term approximations corresponding to the prior on \u03b2\neach time N of the N(T \u2212 1) term approximations corresponding to regressions are updated. In\norder to reduce memory requirements, we associate all the N(N \u2212 1) terms for the prior on \u03b2 into\na single term, which we can do because they are independent so that we only store in memory one\nterm approximation instead of N(N \u2212 1). We also group the N(N \u2212 1) terms for the prior on \u03b3\ninto N independent terms and the N(T \u2212 1) terms for the regressions into T \u2212 1 independent terms.\nAssuming a constant number of iterations (in our experiments, we need at most 20 iterations for EP\nto converge), the computational complexity and the memory requirements of the resulting algorithm\nare O(T N 2). This indicates that it is feasible to analyze data sets which contain the expression\npattern of thousands of genes. An MCMC algorithm would require O(N 3) to generate just a single\nsample.\n\n5\n\n\f4 Experiments with arti\ufb01cial data\n\nWe carried out experiments with arti\ufb01cially generated data in order to validate the EP algorithms.\nIn the experiments for sparse linear regression we \ufb01xed the hyperparameters in (3) so that \u03bd = 3,\n\u03bb is the sample variance of the target vector y, v1 = 1, \u03b4 = N\u22121, v0 is chosen according to\n(2) and w = N\u22121. In the experiment for gene regulation we \ufb01xed the hyperparameters in (4) so\nthat w = (N \u2212 1)\u22121, \u03bdi = 3 and \u03bbi is the sample variance of the vector xi, w1 = 10\u22121(N \u2212\n1)\u22121, w0 = 10\u22122(N \u2212 1)\u22121, v1 = 1, \u03b4 = 0.2 and v0 is chosen according to (2). Although the\nposterior probabilities are sensitive to some of the choices, the orderings of these probabilities, e.g.,\nto determine the most likely regulators, are robust to even large changes.\n\n4.1 Sparse linear regression\nIn the \ufb01rst experiment we set T = 50 and generated x1, . . . , x6000 \u223c N (0, 32I) candidate vectors\nand a target vector y = x1 \u2212 x2 + 0.5 x3 \u2212 0.5 x4 + \u03b5, where \u03b5 \u223c N (0, I). The EP algorithm\nassigned values close to 1 to w1 and w2, the parameters w3 and w4 obtained values 5.2 \u00b7 10\u22123 and\n0.5 respectively and w5, . . . , w6000 were smaller than 3 \u00b7 10\u22124. We repeated the experiment several\ntimes (each time using new data) and obtained similar results on each run.\nIn the second experiment we set T = 50 and generated a target vector y \u223c N (0, 32I) and\nx1, . . . , x500 candidate vectors so that xi = y + \u03b5i for i = 2, . . . , 500, where \u03b5i \u223c N (0, I).\nThe candidate vector x1 is generated as x1 = y + 0.5 \u03b51 where \u03b51 \u223c N (0, I). This way, the noise\nin x1 is twice as small as the noise in the other candidate vectors. Note that all the candidate vec-\ntors are highly correlated with each other and with the target vector. This is what happens in gene\nexpression data sets where many genes show similar expression patterns. We ran the EP algorithm\n100 times (each time using new data) and it always assigned to all the w1, . . . , w500 more or less the\nsame value of 6 \u00b7 10\u22124. However, w1 obtained the highest value on 54 of the runs and it was among\nthe three ws with highest value on 87 of the runs.\nFinally, we repeated these experiments setting N = 100, using the MCMC method of [11] and the\nEP algorithm for sparse linear regression. Both techniques produced results that are statistically\nindistinguishable (the approximations obtained through EP fall within the variation of the MCMC\nmethod), for EP within a fraction of the time of MCMC.\n\n4.2 Gene regulation\nIn this experiment we set T = 50 and generated a vector z with T + 1 values from a sinusoid. We\nthen generated 49 more vectors x2, ..., x50 where xi,t = zt+\u03b5i,t for i = 2, . . . , 50 and t = 1, . . . , T ,\nwhere \u03b5i,t \u223c N (0, \u03c32) and \u03c3 is one fourth of the sample standard deviation of z. We also generated\na vector x1 so that x1,t = zt+1 + \u03b5t where t = 1, . . . , T and \u03b5t \u223c N (0, \u03c32). In this way, x1 acts as\na regulator for x2, ..., x50. A single realization of the vectors x1, . . . , x50 is displayed on the left of\nFigure 2. We ran the EP algorithm for gene regulation over 100 different realizations of x1, . . . , x50.\nThe algorithm assigned t1 the highest value on 33 of the runs and x1 was ranked among the top \ufb01ve\non 74 of the runs. This indicates that the EP algorithm can successfully detect small differences in\ncorrelations and should be able to \ufb01nd new regulators in real microarray data.\n\n5 Experiments with real microarray data\n\nWe applied our algorithm to four data sets. The \ufb01rst is a yeast cell-cycle data set from [5] which is\ncommonly used as a benchmark for regulator discovery. Data sets two through four are from three\ndifferent Plasmodium strains [6]. Missing values were imputed by nearest neighbors [14] and the\nhyperparameters were \ufb01xed at the same values as in Section 4. The yeast cdc15 data set contains\n23 measurements of 6178 genes. We singled out 751 genes which met a minimum criterion for cell\ncycle regulation [5]. The top ten genes with the highest values for \u03c4 along with their annotation from\nthe Saccharomyces Genome database are listed in table 5: the top two genes are speci\ufb01c transcription\nfactors and IOC2 is associated with transcription regulation. As 4% of the yeast genome is associated\nwith transcription the probability of this occurring by chance is 0.0062. However, although the result\nis statistically signi\ufb01cant, we were disappointed to \ufb01nd none of the known cell-cycle regulators (like\nACE2, FKH* or SWI*) among the top ten.\n\n6\n\n\fi\n\nn\no\ns\ns\ne\nr\np\nx\nE\n\n5\n.\n1\n\n0\n.\n1\n\n5\n.\n0\n\n0\n.\n0\n\n5\n.\n0\n\u2212\n\n0\n.\n1\n\u2212\n\n5\n.\n1\n\u2212\n\nRegulator\nRegulatees\n\nGene PF11_321\nGenes positively regulated\nGenes negatively regulated\n\n3\n\n2\n\n1\n\n0\n\n1\n\u2212\n\n2\n\u2212\n\ni\n\nn\no\ns\ns\ne\nr\np\nx\nE\n\n0\n\n10\n\n20\n\n30\n\n40\n\n50\n\n0\n\n10\n\n20\n\n30\n\n40\n\n50\n\nMeasurement\n\nMeasurement\n\nFigure 2: Left: Plot of the vectors x2, ..., x50 in grey and the vector x1 in black. The vector x1\ncontains the expression of a regulator which would determine the expressions in x2, ..., x50. Right:\nExpressions of gene PF11 321 (black) and the 100 genes which are more likely to be regulated by it\n(light and dark grey). Two clusters of positively and negatively regulated genes can be appreciated.\n\nrank\n\nstandard\n\nname\n\ncommon\n\nname\n\nannotation\n\nCHA4\n1 YLR098c\n2 YOR315w SFG1\nJEM1\n3 YJL073w\n4 YOR023c\nAHC1\n5 YOR105w -\n6 YLR095w IOC2\n7 YOR321w PMT3\n8 YLR231c\nBNA5\n9 YOR248w -\n10 YOR247w SRL1\n\nDNA binding transcriptional activator\nputative transcription factor for growth of super\ufb01cial pseudohyphae\nDNAJ-like chaperone\nsubunit of the ADA histone acetyl transferase complex\ndubious open reading frame\ntranscription elongation\nprotein O-mannosyl transferase\nkynureninase\ndubious open reading frame\nmannoprotein\n\nThe three data sets for the malaria parasite [6] contain 53 measurements (3D7), 50 measurements\n(Dd2) and 48 measurements (HB3). We focus on 3D7 as this is the sequenced reference strain. We\nsingled out 751 genes who showed the highest variation as quanti\ufb01ed by the interquartile range of the\nexpression measurements. The top ten genes with the highest values for \u03c4 along with their annotation\nfrom PlasmoDB are listed in table 5. Recalling the motivation for our approach, the paucity of known\ntranscription factors, we cannot expect to \ufb01nd many annotated regulators in PlasmoDB version 5.4.\nThus, we list the BLASTP hits provided by PlasmoDB instead of the absent annotation. These\nhits were the highest scoring ones outside of the genus Plasmodium. We \ufb01nd four genes with a\nlarge identity to transcription factors in Dictyostelium (a recently sequenced social amoebe) and one\nannotated helicase which typically functions in post-transcriptional regulation. Interestingly three\ngenes have no known function and could be regulators.\n\nrank\n\nstandard name\n\nannotation or selected BLASTP hits\n\nPFC0950c\nPF11 0321\nPFI1210w\n\n1\n2\n3\n4 MAL6P1.233\n5\n6 MAL7P1.34\n7 MAL6P1.182\n8\n9\n10 MAL13P1.14\n\nPFD0175c\n\nPF13 0140\nPF13 0138\n\n25% identity to GATA binding TF in Dictyostelium\n25% identity to putative WRKY TF in Dictyostelium\nno BLASTP matches outside Plasmodium genus\nno BLASTP matches outside Plasmodium genus\n32% identity to GATA binding TF in Dictyostelium\n35% identity to GATA binding TF in Dictyostelium\nN-acetylglucosaminyl-phosphatidylinositol de-n-acetylase\ndihydrofolate synthase/folylpolyglutamate synthase\nno BLASTP matches outside Plasmodium genus\nDEAD box helicase\n\nResults for the HB3 strain were similar in that \ufb01ve putative regulators were found. Somewhat\ndisappointing, we found only one putative regulator (a helicase) among the top ten genes for Dd2.\n\n7\n\n\f6 Conclusion and discussion\n\nOur approach enters a \ufb01eld full of methods enforcing sparsity ([15, 8, 7, 16, 9]). Our main contri-\nbutions are: a hierarchical model to discover regulators, a tractable algorithm for fast approximate\ninference in models with many interacting variables, and the application to malaria.\n\nArguably most related is the hierarchical model in [15]. The covariates in this model are a dozen\nexternal variables, coding experimental conditions, instead of the hundreds of expression levels of\nother genes as in our model. Furthermore, the prior in [15] enforces sparsity on the \u201ccolumns\u201d of\n\u03b2 to implement the idea that some genes are not in\ufb02uenced by any of the experimental conditions.\nOur prior, on the other hand, enforces sparsity on the \u201crows\u201d in order to \ufb01nd regulators.\n\nFuture work could include more involved priors, e.g., enforcing sparsity on both \u201crows\u201d and\n\u201ccolumns\u201d or incorporating information from DNA sequence data. The approximate inference tech-\nniques described in this paper make it feasible to evaluate such extensions in a fraction of the time\nrequired by MCMC methods.\n\nReferences\n[1] T.S. Gardner and J.J. Faith. Reverse-engineering transcription control networks. Physics of\n\nLife Reviews, 2:65\u201388, 2005.\n\n[2] R. Coulson, N. Hall, and C. Ouzounis. Comparative genomics of transcriptional control in the\n\nhuman malaria parasite Plasmodium falciparum. Genome Res., 14:1548\u20131554, 2004.\n\n[3] S. Balaji, M.M. Babu, L.M. Iyer, and L. Aravind. Discovery of the principal speci\ufb01c transcrip-\ntion factors of apicomplexa and their implication for the evolution of the ap2-integrase dna\nbinding domains. Nucleic Acids Research, 33(13):3994\u20134006, 2005.\n\n[4] T. Sakata and E.A. Winzeler. Genomics, systems biology and drug development for infectuous\n\ndiseases. Molecular BioSystems, 3:841\u2013848, 2007.\n\n[5] P.T. Spellman, G. Sherlock, V.R. Iyer, K. Anders, M.B. Eisen, P.O. Brown, and D. Botstein.\nComprehensive identi\ufb01cation of cell cycle-regulated genes of the yeast Saccharomyces cere-\nvisiae by microarray hybridization. Molecular Biology of the Cell, 9(12):3273\u20133297, 1998.\n\n[6] M. LLinas, Z. Bozdech, E. D. Wong, A.T. Adai, and J. L. DeRisi. Comparative whole\ngenome transcriptome analysis of three Plasmodium falciparum strains. Nucleic Acids Re-\nsearch, 34(4):1166\u20131173, 2006.\n\n[7] M. Beal. Variational Algorithms for Approximate Bayesian Inference. PhD thesis, UCL, 2003.\n[8] C. Sabatti and G.M. James. Bayesian sparse hidden components analysis for transcription\n\nregulation networks. Bioinformatics, 22(6):739\u2013746, 2006.\n\n[9] S.T. Jensen, G. Chen, and C.J. Stoeckert. Bayesian variable selection and data integration for\n\nbiological regulatory networks. The Annals of Applied Statistics, 1:612\u2013633, 2007.\n\n[10] E.I. George and R.E. McCulloch. Approaches for Bayesian variable selection. Statistica\n\nSinica, 7:339\u2013374, 1997.\n\n[11] E.I. George and R.E. McCulloch. Variable selection via Gibbs sampling. Journal of the Amer-\n\nican Statistical Association, 88(423):881\u2013889, 1993.\n\n[12] T. Minka. A family of algorithms for approximate Bayesian inference. PhD thesis, MIT, 2001.\n[13] T. Heskes and O. Zoeter. Expectation propagation for approximate inference in dynamic\n\nBayesian networks. In UAI-2002, pages 216\u2013223, 2002.\n\n[14] O. Troyanskaya, M. Cantor, P. Brown, T. Hastie, R. Tibshirani, and D. Botstein. Missing value\n\nestimation methods for dna microarrays. Bioinformatics, 17(6):520\u2013525, 2001.\n\n[15] J. Lucas, C. Carvalho, Q. Wang, A. Bild, J. Nevins, and M. West. Sparse statistical modelling\nIn K.A. Do, P. M\u00a8uller, and M. Vannucci, editors, Bayesian\n\nin gene expression genomics.\ninference for gene expression and proteomics. Springer, 2006.\n\n[16] M.Y. Park, T. Hastie, and R. Tibshirani. Averaged gene expressions for regression. Biostatis-\n\ntics, 8:212\u2013227, 2007.\n\n8\n\n\f", "award": [], "sourceid": 919, "authors": [{"given_name": "Jos\u00e9", "family_name": "Hern\u00e1ndez-lobato", "institution": null}, {"given_name": "Tjeerd", "family_name": "Dijkstra", "institution": null}, {"given_name": "Tom", "family_name": "Heskes", "institution": null}]}