{"title": "Efficient inference in matrix-variate Gaussian models with \\iid observation noise", "book": "Advances in Neural Information Processing Systems", "page_first": 630, "page_last": 638, "abstract": "", "full_text": "Ef\ufb01cient inference in matrix-variate Gaussian models\n\nwith iid observation noise\n\nOliver Stegle1\n\nMax Planck Institutes\nT\u00a8ubingen, Germany\n\nstegle@tuebingen.mpg.de\n\nChristoph Lippert1\nMax Planck Institutes\nT\u00a8ubingen, Germany\n\nclippert@tuebingen.mpg.de\n\nInstitute for Computing and Information Sciences\n\nDepartment of Computer Science\n\nJoris Mooij\n\nRadboud University\n\nNijmegen, The Netherlands\n\nj.mooij@cs.ru.nl\n\nNeil Lawrence\n\nUniversity of Shef\ufb01eld\n\nShef\ufb01eld, UK\n\nN.Lawrence@sheffield.ac.uk\n\nMax Planck Institutes & Eberhard Karls Universit\u00a8at\n\nKarsten Borgwardt\n\nT\u00a8ubingen, Germany\n\nkarsten.borgwardt@tuebingen.mpg.de\n\nAbstract\n\nInference in matrix-variate Gaussian models has major applications for multi-\noutput prediction and joint learning of row and column covariances from matrix-\nvariate data. Here, we discuss an approach for ef\ufb01cient inference in such models\nthat explicitly account for iid observation noise. Computational tractability can be\nretained by exploiting the Kronecker product between row and column covariance\nmatrices. Using this framework, we show how to generalize the Graphical Lasso\nin order to learn a sparse inverse covariance between features while accounting for\na low-rank confounding covariance between samples. We show practical utility on\napplications to biology, where we model covariances with more than 100,000 di-\nmensions. We \ufb01nd greater accuracy in recovering biological network structures\nand are able to better reconstruct the confounders.\n\n1\n\nIntroduction\n\nMatrix-variate normal (MVN) models have important applications in various \ufb01elds. These models\nhave been used as regularizer for multi-output prediction, jointly modeling the similarity between\ntasks and samples [1]. In related work in Gaussian processes (GPs), generalizations of MVN distri-\nbutions have been used for inference of vector-valued functions [2, 3]. These models with Kronecker\nfactored covariance have applications in geostatistics [4], statistical testing on matrix-variate data [5]\nand statistical genetics [6].\nIn prior work, different covariance functions for rows and columns have been combined in a \ufb02exible\nmanner. For example, Dutilleul and Zhang et al. [7, 1] have performed estimation of free-form\ncovariances with different norm penalties. In other applications for prediction [2] and dimension\nreduction [8], combinations of free-form covariances with squared exponential covariances have\nbeen used.\n\n1These authors contributed equally to this work.\n\n1\n\n\fIn the absence of iid observation noise, an ef\ufb01cient inference scheme also known as the \u201c\ufb02ip-\ufb02op\nalgorithm\u201d can be derived. In this iterative approach, estimation of the respective covariances is\ndecoupled by rotating the data with respect to one of the covariances to optimize parameters of the\nother [7, 1]. While this simplifying assumption of noise-free matrix-variate data has been used with\nsome success, there are clear motivations for including iid noise in the model. For example, Bonilla\net al. [2] have shown that in multi-task regression a noise free GP with Kronecker structure leads to\na cancelation of information sharing between the various prediction tasks. This effect, also known\nfrom the geostatistics literature [4], eliminates any bene\ufb01t from multivariate prediction compared\nto na\u00a8\u0131ve approaches. Alternatively, when including observation noise in the model, computational\ntractability has been limited to smaller datasets. The covariance matrix no longer directly factor-\nizes into a Kronecker product, thus rendering simple approaches such as the \u201c\ufb02ip-\ufb02op algorithm\u201d\ninappropriate.\nHere, we address these shortcomings and propose a general framework for ef\ufb01cient inference in\nmatrix-variate normal models that include iid observation noise. Although in this model the co-\nvariance matrix no longer factorizes into a Kronecker product, we show how ef\ufb01cient parameter\ninference can still be done. To this end, we provide derivations of both the log-likelihood and gra-\ndients with respect to hyperparameters that can be computed in the same asymptotic runtime as\niterations of the \u201c\ufb02ip-\ufb02op algorithm\u201d on a noise-free model. This allows for parameter learning of\ncovariance matrices of size 105 \u00d7 105, or even bigger, which would not be possible if done na\u00a8\u0131vely.\nFirst, we show how for any combination of covariances, evaluation of model likelihood and gradients\nwith respect to individual covariance parameters is tractable. Second, we apply this framework\nto structure learning in Gaussian graphical models, while accounting for a confounding non-iid\nsample structure. This generalization of the Graphical Lasso [9, 10] (GLASSO) allows to jointly\nlearn and account for a sparse inverse covariance matrix between features and a structured (non-\ndiagonal) sample covariance. The low rank component of the sample covariance is used to account\nfor confounding effects, as is done in other models for genomics [11, 12].\nWe illustrate this generalization called \u201cKronecker GLASSO\u201d on synthetic datasets and heteroge-\nneous protein signaling and gene expression data, where the aim is to recover the hidden network\nstructures. We show that our approach is able to recover the confounding structure, when it is known,\nand reveals sparse biological networks that are in better agreement with known components of the\nlatent network structure.\n\n2 Ef\ufb01cient inference in Kronecker Gaussian processes\nAssume we are given a data matrix Y \u2208 RN\u00d7D with N rows and D columns, where N is the\nnumber of samples with D features each. As an example, think of N as a number of micro-array\nexperiments, where in each experiment the expression levels of the same D genes are measured;\nhere, yrc would be the expression level of gene c in experiment r. Alternatively, Y could represent\nmulti-variate targets in a multi-task prediction setting, with rows corresponding to tasks and columns\nto features. This setting occurs in geostatistics, where the entries yrc correspond to ecological mea-\nsurements taken on a regular grid.\nFirst we introduce some notation. For any L \u00d7 M matrix A, we de\ufb01ne vec(A) to be the vector\nobtained by concatenating the columns of A; further, let A \u2297 B denote the Kronecker product (or\ntensor product) between matrices A and B:\n\n\uf8f6\uf8f7\uf8f7\uf8f8 ;\n\n\uf8eb\uf8ec\uf8ec\uf8ed a11\n\na21\n...\naLM\n\nvec(A) =\n\nA \u2297 B =\n\n\uf8eb\uf8ec\uf8ec\uf8eda11B a12B . . .\n\na21B a22B . . .\n\n. . .\n. . .\naL1B aL2B . . .\n\n. . .\n\n\uf8f6\uf8f7\uf8f7\uf8f8 .\n\na1M B\na2M B\n\n...\n\naLM B\n\nFor modeling Y as a matrix-variate normal distribution with iid observation noise, we \ufb01rst introduce\nN \u00d7 D additional latent variables Z, which can be thought of as the noise-free observations. The\ndata Y is then given by Z plus iid Gaussian observation noise:\n\np(Y | Z, \u03c32) = N(cid:0)vec(Y)(cid:12)(cid:12) vec(Z), \u03c32IN\u00b7D\n\n(cid:1) .\n\n(1)\n\n2\n\n\fIf the covariance between rows and columns of the noise-free observations Z factorizes, we may\nassume a zero-mean matrix-variate normal model for Z:\n\np(Z| C, R) =\n\nexp{\u2212 1\n2Tr[C\u22121ZTR\u22121Z]}\n(2\u03c0)N D/2|R|N/2|C|D/2\n\n,\n\nwhich can be equivalently formulated as a multivariate normal distribution:\n\n= N (vec(Z)| 0N\u00b7D, C(\u0398C) \u2297 R(\u0398R)) .\n\n(2)\nHere, the matrix C is a D\u00d7 D column covariance matrix and R is an N \u00d7 N row covariance matrix\nthat may depend on hyperparameters \u0398C and \u0398R respectively. Marginalizing over the noise-free\nobservations Z results in the Kronecker Gaussian process model of the observed data Y\n\np(Y | C, R, \u03c32) = N(cid:0)vec(Y)(cid:12)(cid:12) 0N\u00b7D, C(\u0398C) \u2297 R(\u0398R) + \u03c32IN\u00b7D\n\n(3)\nFor notational convenience we will drop the dependency on hyperparameters \u0398C, \u0398R and \u03c32.\nNote that for \u03c32 = 0, the likelihood model in Equation (3) reduces to the matrix-variate normal\ndistribution in Equation (2).\n\n(cid:1) .\n\n2.1 Ef\ufb01cient parameter estimation\nFor ef\ufb01cient optimization of the log likelihood, L = ln p(Y | C, R, \u03c32), with respect to the hyper-\nparameters, we exploit an identity that allows us to write a matrix product with a Kronecker product\nmatrix in terms of ordinary matrix products:\n\n(4)\nWe also exploit the compatibility of a Kronecker product plus a constant diagonal term with the\neigenvalue decomposition:\n\n(C \u2297 R)vec(Y) = vec(RTYC).\n\nwhere C = UCSCUT\n\n(C \u2297 R + \u03c32I) = (UC \u2297 UR)(SC \u2297 SR + \u03c32I)(UT\n\nC \u2297 UT\nR),\nC is the eigenvalue decomposition of C, and similarly for R.\n\n(5)\n\nLikelihood evaluation Using these identities, the log of the likelihood in Equation (3) follows as\n\nL = \u2212 N \u00b7 D\n\nln(cid:12)(cid:12)SC \u2297 SR + \u03c32I(cid:12)(cid:12)\n\nln(2\u03c0) \u2212 1\n2\nRYUC)T(SC \u2297 SR + \u03c32I)\u22121vec(UT\n\n2\nvec(UT\n\n\u2212 1\n2\n\nRYUC).\n\n(6)\n\nThis term can be interpreted as a multivariate normal distribution with diagonal covariance matrix\n(SC \u2297 SR + \u03c32I) on rotated data vec(UT\nRYUC)T, similar to an approach that is used to speed up\nmixed models in genetics [13].\n\nd\nd\u03b8R\n\nL = \u2212 1\n2\n1\n2\n\n+\n\nGradient evaluation Derivatives of the log marginal likelihood with respect to a particular co-\nvariance parameter \u03b8R \u2208 \u0398R can be expressed as\n\n(cid:16)\n\n(SC \u2297 SR + \u03c32I)\u22121(cid:17)T\n\ndiag\n\n(cid:16)\n\n(cid:16)\n(cid:17)\n\n(cid:17)\n\ndiag\n\nSC \u2297 (UT\n\nR\n\nd\nd\u03b8R\n\nRUR)\n\nvec( \u02dcY)Tvec\n\nd\nd\u03b8R\nwhere vec( \u02dcY) = (SC \u2297 SR + \u03c32I)\u22121vec(UT\nRYUC). Analogous expressions follow for partial\nderivatives with respect to \u03b8C \u2208 \u0398C and the noise level \u03c32. Full details of all derivations, including\nderivatives wrt. \u03c32, can be found in the supplementary material.\n\nRUR \u02dcYSC\n\nUT\nR\n\n(7)\n\n,\n\nRuntime and memory complexity A na\u00a8\u0131ve implementation for optimizing the likelihood (3) with\nrespect to the hyperparameters would have runtime complexity O(N 3D3) and memory complexity\nO(N 2D2). Using the likelihood and derivative as expressed in Equations (6) and (7), each eval-\nuation with new kernel parameters involves solving the symmetric eigenvalue problems of both R\nand C, together having a runtime complexity of O(N 3 + D3). Explicit evaluation of any matrix\nKronecker products is not necessary, resulting in a low memory complexity of O(N 2 + D2).\n\n3\n\n\f3 Graphical Lasso in the presence of confounders\n\nEstimation of sparse inverse covariance matrices is widely used to identify undirected network struc-\ntures from observational data. However, non-iid observations due to hidden confounding variables\nmay hinder accurate recovery of the true network structure. If not accounted for, confounders may\nlead to a large number of false positive edges. This is of particular relevance in biological appli-\ncations, where observational data are often heterogeneous, combining measurements from different\nlabs, data obtained under various perturbations or from a range of measurement platforms.\nAs an application of the framework described in Section 2, we here propose an approach to learn-\ning sparse inverse covariance matrices between features, while accounting for covariation between\nsamples due to confounders. First, we brie\ufb02y review the \u201corthogonal\u201d approaches that account for\nthe corresponding types of sample and feature covariance we set out to model.\n\n3.1 Explaining feature dependencies using the Graphical Lasso\n\nA common approach to model relationships between variables in a graphical model is the GLASSO.\nIt has been used in the context of biological studies to recover the hidden network structure of\ngene-gene interrelationships [14], for instance. The GLASSO assumes a multivariate Gaussian dis-\ntribution on features with a sparse precision (inverse covariance) matrix. The sparsity is induced by\nan L1 penalty on the entries of C\u22121, the inverse of the feature covariance matrix.\nUnder the simplifying assumption of iid samples, the posterior distribution of Y under this model is\nproportional to\n\n(8)\n\np(Y, C\u22121) = p(C\u22121)\n\nN(cid:89)\np(C\u22121) \u221d exp(cid:0) \u2212 \u03bb(cid:13)(cid:13)C\u22121(cid:13)(cid:13)1\n\nr=1\n\nN (Yr,: | 0D, C) .\n\n(cid:1) [C\u22121 (cid:31) 0],\n\nHere, the prior on the precision matrix C\u22121 is\n\n(9)\nwith (cid:107)A(cid:107)1 de\ufb01ned as the sum over all absolute values of the matrix entries. Note that this prior is\nonly nonzero for positive-de\ufb01nite C\u22121.\n\n3.2 Modeling confounders using the Gaussian process latent variable model\n\nConfounders are unobserved variables that can lead to spurious associations between observed vari-\nables and to covariation between samples. A possible approach to identify such confounders is\ndimensionality reduction. Here we brie\ufb02y review two dimensionality reduction methods, dual prob-\nabilistic PCA and its generalization, the Gaussian process latent variable model (GPLVM) [15].\nIn the context of applications, these methods have previously been applied to identify regulatory\nprocesses [16], and to recover confounding factors with broad effects on many features [11, 12].\nIn dual probabilistic PCA [15], the observed data Y is explained as a linear combination of K latent\nvariables (\u201cfactors\u201d), plus independent observation noise. The model is as follows:\n\nY = XW + E,\n\nwhere X \u2208 RN\u00d7K contains the values of K latent variables (\u201cfactors\u201d), W \u2208 RK\u00d7D contains inde-\npendent standard-normally distributed weights that specify the mapping between latent and observed\nvariables. Finally, E \u2208 RN\u00d7D contains iid Gaussian noise with Erc \u223c N (0, \u03c32). Marginalizing\nover the weights W yields the data likelihood:\n\nD(cid:89)\n\nN(cid:0)Y:,c\n\n(cid:12)(cid:12) 0N , XXT + \u03c32IN\n\n(cid:1) .\n\np(Y | X) =\n\n(10)\n\nc=1\n\neral Gram matrix R, with Rrs = \u03ba(cid:0)(xr1, . . . , xrK), (xs1, . . . , xsK)(cid:1) for some covariance function\n\nLearning the latent factors X and the observation noise variance \u03c32 can be done by maximum\nlikelihood. The more general GPLVM [15] is obtained by replacing XXT in (10) with a more gen-\n\u03ba : RK \u00d7 RK \u2192 R.\n\n4\n\n\f3.3 Combining the two models\n\n(cid:1) .\n\nWe propose to combine these two different explanations of the data into one coherent model. Instead\nof treating either the samples or the features as being (conditionally) independent, we aim to learn a\njoint covariance for the observed data matrix Y. This model, called Kronecker GLASSO, is a special\ninstance of the Kronecker Gaussian process model introduced in Section 2, as the data likelihood\ncan be written as:\n\np(Y | R, C\u22121) = N(cid:0)vec(Y)(cid:12)(cid:12) 0N\u00b7D, C \u2297 R + \u03c32IN\u00b7D\n\n(11)\nHere, we build on the model components introduced in Section 3.2 and Section 3.1. We use the\nsparse L1 penalty (9) for the feature inverse covariance C\u22121 and use a linear kernel for the covari-\nance on rows R = XXT + \u03c12IN . Learning the model parameters proceeds via MAP inference,\noptimizing the log likelihood implied by Equation (11) with respect to X and C\u22121, and the hyper-\nparameters \u03c32, \u03c12. By combining the GLASSO and GPLVM in this way, we can recover network\nstructure in the presence of confounders.\nAn equivalent generative model can be obtained in a similar way as in dual probabilistic PCA.\nThe main difference is that now, the rows of the weight matrix W are sampled from a N (0D, C)\ndistribution instead of a N (0D, ID) distribution. This generative model for Y given latent variables\nX \u2208 RN\u00d7K and feature covariance C \u2208 RD\u00d7D is of the form Y = XW + \u03c1V + E, where\nW \u2208 RK\u00d7D, V \u2208 RN\u00d7D and E \u2208 RN\u00d7D are jointly independent with distributions vec(W) \u223c\nN (0KD, C \u2297 IK), vec(V) \u223c N (0N D, C \u2297 IN ) and vec(E) \u223c N (0N D, \u03c32IN D).\n\n3.4\n\nInference in the joint model\n\nAs already mentioned in Section 2, parameter inference in the Kronecker GLASSO model implied\nby Equation (11), when done na\u00a8\u0131vely, is intractable for all but very low dimensional data matrices Y.\nEven using the tricks discussed in Section 2, free-form sparse inverse covariance updates for C\u22121\nare intractable under the L1 penalty when depending on gradient updates.\nSimilar as in Section 2, the \ufb01rst step towards ef\ufb01cient inference is to introduce N \u00d7 D additional\nlatent variables Z, which can be thought of as the noise-free observations:\n\np(Y|Z, \u03c32) = N(cid:0)vec(Y)(cid:12)(cid:12) vec(Z), \u03c32IN\u00b7D\n\n(12)\n(13)\nWe consider the latent variables Z as additional model parameters. We now optimize the distribution\np(Y, C\u22121 | Z, R, \u03c32) = p(Y | Z, \u03c32)p(Z| R, C)p(C\u22121) with respect to the unknown parameters\nZ, C\u22121, \u03c32, and R (which depends on X and kernel parameters \u0398R) by iterating through the\nfollowing steps:\n\np(Z|R, C) = N (vec(Z)| 0N\u00b7D, C \u2297 R) .\n\n(cid:1)\n\n1. Optimize for \u03c32, R after integrating out Z, for \ufb01xed C:\np(Y | C, R(\u0398R, X), \u03c32) =\n\nN(cid:0)vec(Y)(cid:12)(cid:12) 0N\u00b7D, C \u2297 R(\u0398R, X) + \u03c32IN\u00b7D\n\n(cid:1)\n\n(14)\n\nargmax\n\u03c32,\u0398R,X\nargmax\n\u03c32,\u0398R,X\n\n2. Calculate the expectation of Z for \ufb01xed R, C, and \u03c32 :\n\nvec( \u02c6Z) = (C \u2297 R)(C \u2297 R + \u03c32IN\u00b7D)\u22121vec(Y)\n\n3. Optimize \u02c6C\u22121 for \ufb01xed R and \u02c6Z:\n\nargmax\n\n\u02c6C\u22121\nand set C = \u02c6C.\n\np( \u02c6C\u22121 | \u02c6Z, R) = argmax\n\n\u02c6C\u22121\n\nN(cid:16)\n\nvec( \u02c6Z)\n\n(cid:12)(cid:12)(cid:12) 0, \u02c6C \u2297 R\n\n(cid:17)\n\np( \u02c6C\u22121)\n\nAs a stopping criterion we consider the relative reduction of the negative log-marginal likelihood\n(Equation (11)) plus the regularizer on C\u22121. The choice to optimize \u02c6C\u22121 for \ufb01xed \u02c6Z is motivated\nby computational considerations, as this subproblem then reduces to conventional GLASSO; a full\nEM approach with latent variables Z does not seem feasible. Step 1 can be done using the ef\ufb01cient\nlikelihood evaluations and gradients described in Section 2. We will now discuss step 3 in more\ndetail.\n\n5\n\n\f(a) Precision-recall curve\n\n(b) Ground truth\n\n(c) GLASSO (d) Kron GLASSO(e) Ideal GLASSO\n\nFigure 1: Network reconstruction on the simulated example. (a) Precision-recall curve, when varying the\nsparsity penalty \u03bb. Compared are the standard GLASSO, our algorithm with Kronecker structure (Kronecker\nGLASSO) and as a reference an idealized setting, applying standard GLASSO to a similar dataset without\nconfounding in\ufb02uences (Ideal GLASSO). The model that accounts for confounders approaches the performance\nof an idealized model, while standard GLASSO \ufb01nds a large fraction of false positive edges. (b) Ground truth\nnetwork. (c-e) Recovered networks for GLASSO, Kronecker GLASSO and Ideal GLASSO at 40% recall (star\nin (a)). False positive predicted edges are colored in red. Because of the effect of confounders, standard\nGLASSO predicted an excess of edges to 4 of the nodes.\n\nOptimizing for \u02c6C\u22121 The third step, optimizing with respect to \u02c6C\u22121, can be done ef\ufb01ciently, using\nsimilar ideas as in Section 2. First consider:\n\nlnN(cid:16)\n\n(cid:12)(cid:12)(cid:12) 0N\u00b7D, \u02c6C \u2297 R\n\n(cid:17)\n\nvec( \u02c6Z)\n\n= \u2212 N \u00b7 D\nNow, using the Kronecker identity (4) and\n\n2\n\n(cid:12)(cid:12)(cid:12) \u02c6C \u2297 R\n\n(cid:12)(cid:12)(cid:12) \u2212 1\n\n2\n\nln(2\u03c0) \u2212 1\n2\n\nln\n\nvec( \u02c6Z)T( \u02c6C \u2297 R)\n\n\u22121vec( \u02c6Z).\n\nln|A \u2297 B| = rank(B) ln|A| + rank(A) ln|B| ,\n\nwe can rewrite the log likelihood as:\n\n= \u2212 N\u00b7D\n\n2\n\nln(2\u03c0) \u2212 1\n\n(cid:12)(cid:12)(cid:12) 0, \u02c6C \u2297 R\n\n(cid:17)\n(cid:12)(cid:12)(cid:12) \u02c6C\u22121(cid:12)(cid:12)(cid:12) \u2212 1\n\n2 N ln\n\np( \u02c6C\u22121)\n\nlnN(cid:16)\n\nvec( \u02c6Z)\n2 D ln|R| + 1\n(cid:18)\n\nThus we obtain a standard GLASSO problem with covariance matrix \u02c6ZTR\u22121 \u02c6Z:\n\np( \u02c6C\n\nargmax\n\n\u02c6C\u22121\n\n\u22121 | \u02c6Z, R) = argmax\n\u02c6C\u22121(cid:31)0\n\n\u2212 1\n2\n\nTr( \u02c6ZTR\n\n\u22121 \u02c6Z \u02c6C\n\n\u22121) +\n\n1\n2\n\nN ln\n\n2 Tr( \u02c6ZTR\u22121 \u02c6Z \u02c6C\u22121).\n\n(cid:12)(cid:12)(cid:12) \u02c6C\n\n\u22121(cid:12)(cid:12)(cid:12) \u2212 \u03bb\n\n(cid:13)(cid:13)(cid:13) \u02c6C\n\n\u22121(cid:13)(cid:13)(cid:13)1\n\n(cid:19)\n\n.\n\n(15)\n\nThe inverse sample covariance R\u22121 in Equation (15) rotates the data covariance, similar as in the\nestablished \ufb02ip-\ufb02op algorithm for inference in matrix-variate normal distributions [7, 1].\n\n4 Experiments\n\nIn this Section, we describe three experiments with the generalized GLASSO.\n\n4.1 Simulation study\n\nFirst, we considered an arti\ufb01cial dataset to illustrate the effect of confounding factors on the solution\nquality of sparse inverse covariance estimation. We created synthetic data, with N = 100 samples\nand D = 50 features according to the generative model described in Section 3.3. We generated\nthe sparse inverse column covariance C\u22121 choosing edges at random with a sparsity level of 1%.\nNon-zero entries of the inverse covariance were drawn from a Gaussian with mean 1 and variance\n2. The row covariance matrix R was created from K = 3 random factors xk, each drawn from\nunit variance iid Gaussian variables. The weighting between the confounders and the iid component\n\u03c12 was set such that the factors explained equal variance, which corresponds to moderate extent\nof confounding in\ufb02uences. Finally, we added independent Gaussian observation noise, choosing a\nsignal-to-noise ratio of 10%.\n\n6\n\n(cid:31)(cid:31)(cid:31)(cid:31)(cid:31)(cid:31)(cid:31)(cid:31)(cid:31)(cid:31)(cid:31)(cid:31)(cid:31)(cid:31)(cid:31)(cid:31)(cid:31)(cid:31)(cid:31)(cid:31)(cid:31)(cid:31)(cid:31)(cid:31)(cid:31)(cid:31)(cid:31)(cid:31)(cid:31)(cid:31)(cid:31)(cid:31)(cid:31)(cid:31)(cid:31)(cid:31)(cid:31)(cid:31)(cid:31)(cid:31)(cid:31)(cid:31)(cid:31)(cid:31)(cid:31)(cid:31)(cid:31)(cid:31)(cid:31)(cid:31)(cid:31)(cid:31)(cid:31)(cid:31)(cid:31)(cid:31)(cid:31)(cid:31)(cid:31)(cid:31)(cid:31)(cid:31)(cid:31)(cid:31)(cid:31)(cid:31)(cid:31)(cid:31)(cid:31)(cid:31)(cid:31)(cid:31)(cid:31)(cid:31)(cid:31)(cid:31)(cid:31)(cid:31)(cid:31)(cid:31)(cid:31)(cid:31)(cid:31)(cid:31)(cid:31)(cid:31)(cid:31)(cid:31)(cid:31)(cid:31)012345678910111213141516171819202122232425262728293031323334353637383940414243444546474849012345678910111213141516171819202122232425262728293031323334353637383940414243444546474849012345678910111213141516171819202122232425262728293031323334353637383940414243444546474849\f(a) Precision-recall curve\n\n(b) Ground truth\n\n(c) GLASSO\n\n(d) Kron GLASSO\n\nFigure 2: Network reconstruction of a protein signaling network from Sachs et al. (a) Precision-recall curve,\nwhen varying the sparsity penalty \u03bb. Compared are the standard GLASSO, and our algorithm with Kronecker\nstructure (Kronecker GLASSO). Standard GLASSO, not accounting for confounders, found more false positive\nedges for a wide range of recall rates. (b) Ground truth network. (c-d) Recovered networks for GLASSO and\nKronecker GLASSO at 40% recall (star in (a)). False positive edge predictions are colored in red.\n\nNext, we applied different methods to reconstruct the true simulated network. We considered stan-\ndard GLASSO and our Kronecker model that accounts for the confounding in\ufb02uence (Kronecker\nGLASSO). For reference, we also considered an idealized setting, applying GLASSO to a similar\ndataset without the confounding effects (Ideal GLASSO), obtained by setting X = 0N\u00b7K in the\ngenerative model. To determine an appropriate latent dimensionality of Kronecker GLASSO, we\nused the BIC criterion on multiple restarts with K = 1 to K = 5 latent factors. For all models\nwe varied the sparsity parameter of the graphical lasso, setting \u03bb = 5x, with x linearly interpolated\nbetween \u22128 and 3. The solution set of lasso-based algorithms is typically unstable and depends on\nslight variation of the data. To improve the stability of all methods, we employed stability selec-\ntion [17], applying each algorithm for all regularization parameters 100 times to randomly drawn\nsubsets containing 90% of the data. We then considered edges that were found in at least 50% of all\n100 restarts.\nFigure 1a shows the precision-recall curve for each algorithm. Kronecker GLASSO performed\nconsiderably better than standard GLASSO, approaching the performance of the ideal model with-\nout confounders. Figures 1b-d show the reconstructed networks at 40% recall. While Kronecker\nGLASSO reconstructed the same network as the ideal model, standard GLASSO found an excess of\nfalse positive edges.\n\n4.2 Network reconstruction of protein-signaling networks\n\nImportant practical applications of the GLASSO include the reconstruction of gene and protein\nnetworks. Here, we revisit the extensively studied protein signaling data from Sachs et al. [18]. The\ndataset provides observational data of the activations of 11 proteins under various external stimuli.\nWe combined measurements from the \ufb01rst 3 experiments, yielding a heterogeneous mix of 2,666\nsamples that are not expected to be an iid sample set. To make the inference more dif\ufb01cult, we\nselected a random fraction of 10% of the samples, yielding a \ufb01nal data matrix of size 266 times 11.\nWe used the directed ground truth network and moralized the graph structure to obtain an undirected\nground truth network. Parameter choice and stability selection were done as in the simulation study.\nFigure 2 shows the results. Analogous to the simulation setting, the Kronecker GLASSO model\nfound true network links with greater accuracy than standard graphical lasso. This results suggest\nthat our model is suitable to account for confounding variation as it occurs in real settings.\n\n4.3 Large-scale application to yeast gene expression data\n\nNext, we considered an application to large-scale gene expression pro\ufb01ling data from yeast. We\nrevisited the dataset from Smith et al. [19], consisting of 109 genetically diverse yeast strains, each of\nwhich has been expression pro\ufb01led in two environmental conditions (glucose and ethanol). Because\n\n7\n\nprafpmekplcgPIP2PIP3p44/42pakts473PKAPKCP38pjnkprafpmekplcgPIP2PIP3p44/42pakts473PKAPKCP38pjnkprafpmekplcgPIP2PIP3p44/42pakts473PKAPKCP38pjnk\f(a) Confounder reconstruction\n\n(b) GLASSO consistency (68%)\n\n(c) Kron. GLASSO consistency\n(74%)\n\nFigure 3: (a) Correlation coef\ufb01cient between learned confounding factor and true environmental condition for\ndifferent subsets of all features (genes). Compared are the standard GPLVM model with a linear covariance\nand our proposed model that accounts for low rank confounders and sparse gene-gene relationships (Kronecker\nGLASSO). Kronecker GLASSO is able to better recover the hidden confounder by accounting for the covari-\nance structure between genes. (b,c) Consistency of edges on the largest network with 1,000 nodes learnt on the\njoint dataset, comparing the results when combining both conditions with those for a single condition (glucose).\n\nthe confounder in this dataset is known explicitly, we tested the ability of Kronecker GLASSO to\nrecover it from observational data. Because of missing complete ground truth information, we could\nnot evaluate the network reconstruction quality directly. An appropriate regularization parameter\nwas selected by means of cross validation, evaluating the marginal likelihood on a test set (analogous\nto the procedure described in [10]). To simplify the comparison to the known confounding factor,\nwe chose a \ufb01xed number of confounders that we set to K = 1.\n\nRecovery of the known confounder Figure 3a shows the r2 correlation coef\ufb01cient between the\ninferred factor and the true environmental condition for increasing number of features (genes) that\nwere used for learning. In particular for small numbers of genes, accounting for the network struc-\nture between genes improved the ability to recover the true confounding effect.\n\nConsistency of obtained networks Next, we tested the consistency when applying GLASSO and\nKronecker GLASSO to data that combines both conditions, glucose and ethanol, comparing to the\nrecovered network from a single condition alone (glucose). The respective networks are shown in\nFigures 3b and 3c. The Kronecker GLASSO model identi\ufb01es more consistent edges, which shows\nthe susceptibility of standard GLASSO to the confounder, here the environmental in\ufb02uence.\n\n5 Conclusions and Discussion\n\nWe have shown an ef\ufb01cient scheme for parameter learning in matrix-variate normal distributions\nwith iid observation noise. By exploiting some linear algebra tricks, we have shown how hyper-\nparameter optimization for the row and column covariances can be carried out without evaluating\nthe prohibitive full covariance, thereby greatly reducing computational and memory complexity. To\nthe best of our knowledge, these measures have not previously been proposed, despite their general\napplicability.\nAs an application of our framework, we have proposed a method that accounts for confounding in-\n\ufb02uences while estimating a sparse inverse covariance structure. Our approach extends the Graphical\nLasso, generalizing the rigid assumption of iid samples to more general sample covariances. For\nthis purpose, we employ a Kronecker product covariance structure and learn a low-rank covariance\nbetween samples, thereby accounting for potential confounding in\ufb02uences. We provided synthetic\nand real world examples where our method is of practical use, reducing the number of false positive\nedges learned.\n\nAcknowledgments This research was supported by the FP7 PASCAL II Network of Excellence.\nOS received funding from the Volkswagen Foundation. JM was supported by NWO, the Netherlands\nOrganization for Scienti\ufb01c Research (VENI grant 639.031.036).\n\n8\n\n101102103Number of features (genes)0.40.50.60.70.80.91.0r^2 correlation with true confounderGPLVMKronecker GLasso\fReferences\n[1] Y. Zhang and J. Schneider. Learning multiple tasks with a sparse matrix-normal penalty. In\n\nAdvances in Neural Information Processing Systems, 2010.\n\n[2] E. Bonilla, K.M. Chai, and C. Williams. Multi-task gaussian process prediction. Advances in\n\nNeural Information Processing Systems, 20:153\u2013160, 2008.\n\n[3] M.A. Alvarez and N.D. Lawrence. Computationally ef\ufb01cient convolved multiple output gaus-\n\nsian processes. Journal of Machine Learning Research, 12:1425\u20131466, 2011.\n\n[4] H. Wackernagel. Multivariate geostatistics: an introduction with applications. Springer Ver-\n\nlag, 2003.\n\n[5] G.I. Allen and R. Tibshirani. Inference with transposable data: Modeling the effects of row\n\nand column correlations. Arxiv preprint arXiv:1004.0209, 2010.\n\n[6] M. Lynch and B. Walsh. Genetics and Analysis of Quantitative Traits. Sinauer Associates Inc.,\n\nU.S., 1998.\n\n[7] P. Dutilleul. The MLE algorithm for the matrix normal distribution. Journal of Statistical\n\nComputation and Simulation, 64(2):105\u2013123, 1999.\n\n[8] K. Zhang, B. Sch\u00a8olkopf, and D. Janzing. Invariant gaussian process latent variable models and\n\napplication in causal discovery. In Uncertainty in Arti\ufb01cial Intelligence, 2010.\n\n[9] O. Banerjee, L. El Ghaoui, and A. d\u2019Aspremont. Model selection through sparse maximum\nlikelihood estimation for multivariate gaussian or binary data. Journal of Machine Learning\nResearch, 9:485\u2013516, 2008.\n\n[10] J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation with the graph-\n\nical lasso. Biostatistics, 9(3):432, 2008.\n\n[11] J.T. Leek and J.D. Storey. Capturing heterogeneity in gene expression studies by surrogate\n\nvariable analysis. PLoS Genetics, 3(9):e161, 2007.\n\n[12] O. Stegle, L. Parts, R. Durbin, and J. Winn. A bayesian framework to account for complex\nnon-genetic factors in gene expression levels greatly increases power in eqtl studies. PLoS\nComputational Biology, 6(5):e1000770, 2010.\n\n[13] C. Lippert, J. Listgarten, Y. Liu, C.M. Kadie, R.I. Davidson, and D. Heckerman. FaST linear\n\nmixed models for genome-wide association studies. Nature Methods, 8:833\u2013835, 2011.\n\n[14] P. Men\u00b4endez, Y.A.I. Kourmpetis, C.J.F. Ter Braak, and F.A. van Eeuwijk. Gene regulatory\nnetworks from multifactorial perturbations using graphical lasso: Application to the dream4\nchallenge. PLoS One, 5(12):e14147, 2010.\n\n[15] N. Lawrence. Probabilistic non-linear principal component analysis with gaussian process\n\nlatent variable models. Journal of Machine Learning Research, 6:1783\u20131816, 2005.\n\n[16] K.Y. Yeung and W.L. Ruzzo. Principal component analysis for clustering gene expression data.\n\nBioinformatics, 17(9):763, 2001.\n\n[17] N. Meinshausen and P. B\u00a8uhlmann. Stability selection. Journal of the Royal Statistical Society:\n\nSeries B (Statistical Methodology), 72(4):417\u2013473, 2010.\n\n[18] K. Sachs, O. Perez, D. Pe\u2019er, D.A. Lauffenburger, and G.P. Nolan. Causal protein-signaling\n\nnetworks derived from multiparameter single-cell data. Science, 308(5721):523, 2005.\n\n[19] E.N. Smith and L. Kruglyak. Gene\u2013environment interaction in yeast gene expression. PLoS\n\nBiology, 6(4):e83, 2008.\n\n9\n\n\f", "award": [], "sourceid": 4281, "authors": [{"given_name": "Oliver", "family_name": "Stegle", "institution": null}, {"given_name": "Christoph", "family_name": "Lippert", "institution": null}, {"given_name": "Joris", "family_name": "Mooij", "institution": null}, {"given_name": "Neil", "family_name": "Lawrence", "institution": null}, {"given_name": "Karsten", "family_name": "Borgwardt", "institution": ""}]}