{"title": "Structured Learning of Gaussian Graphical Models", "book": "Advances in Neural Information Processing Systems", "page_first": 620, "page_last": 628, "abstract": "We consider estimation of multiple high-dimensional Gaussian graphical models corresponding to a single set of nodes under several distinct conditions. We assume that most aspects of the networks are shared, but that there are some structured differences between them. Specifically, the network differences are generated from node perturbations: a few nodes are perturbed across networks, and most or all edges stemming from such nodes differ between networks. This corresponds to a simple model for the mechanism underlying many cancers, in which the gene regulatory network is disrupted  due to the aberrant activity of a few specific genes. We propose to solve this problem using the structured joint graphical lasso, a convex optimization problem that is based upon the use of a novel symmetric overlap norm penalty, which we solve using  an alternating directions method of multipliers algorithm. Our proposal is illustrated on synthetic data and on an application to brain cancer gene expression data.", "full_text": "Structured Learning of Gaussian Graphical Models\n\nKarthik Mohan\u2217, Michael Jae-Yoon Chung\u2020, Seungyeop Han\u2020,\n\nDaniela Witten\u2021, Su-In Lee\u00a7, Maryam Fazel\u2217\n\nAbstract\n\nWe consider estimation of multiple high-dimensional Gaussian graphical mod-\nels corresponding to a single set of nodes under several distinct conditions. We\nassume that most aspects of the networks are shared, but that there are some struc-\ntured differences between them. Speci\ufb01cally, the network differences are gener-\nated from node perturbations: a few nodes are perturbed across networks, and\nmost or all edges stemming from such nodes differ between networks. This corre-\nsponds to a simple model for the mechanism underlying many cancers, in which\nthe gene regulatory network is disrupted due to the aberrant activity of a few spe-\nci\ufb01c genes. We propose to solve this problem using the perturbed-node joint\ngraphical lasso, a convex optimization problem that is based upon the use of a\nrow-column overlap norm penalty. We then solve the convex problem using an\nalternating directions method of multipliers algorithm. Our proposal is illustrated\non synthetic data and on an application to brain cancer gene expression data.\n\n1 Introduction\n\nProbabilistic graphical models are widely used in a variety of applications, from computer vision\nto natural language processing to computational biology. As this modeling framework is used in\nincreasingly complex domains, the problem of selecting from among the exponentially large space\nof possible network structures is of paramount importance. This problem is especially acute in the\nhigh-dimensional setting, in which the number of variables or nodes in the graphical model is much\nlarger than the number of observations that are available to estimate it.\nAs a motivating example, suppose that we have access to gene expression measurements for n1 lung\ncancer patients and n2 brain cancer patients, and that we would like to estimate the gene regulatory\nnetworks underlying these two types of cancer. We can consider estimating a single network on the\nbasis of all n1 +n2 patients. However, this approach is unlikely to be successful, due to fundamental\ndifferences between the true lung cancer and brain cancer gene regulatory networks that stem from\ntissue speci\ufb01city of gene expression as well as differing etiology of the two diseases. As an alter-\nnative, we could simply estimate a gene regulatory network using the n1 lung cancer patients and a\nseparate gene regulatory network using the n2 brain cancer patients. However, this approach fails to\nexploit the fact that the two underlying gene regulatory networks likely have substantial commonal-\nity, such as tumor-speci\ufb01c pathways. In order to effectively make use of the available data, we need\na principled approach for jointly estimating the lung cancer and brain cancer networks in such a way\nthat the two network estimates are encouraged to be quite similar to each other, while allowing for\ncertain structured differences. In fact, these differences themselves may be of scienti\ufb01c interest.\nIn this paper, we propose a general framework for jointly learning the structure of K networks, under\nthe assumption that the networks are similar overall, but may have certain structured differences.\n\n\u2217\n\u2020\n\u2021\n\u00a7\n\nElectrical Engineering, Univ. of Washington. fkarna,mfazelg@uw.edu\nComputer Science and Engineering, Univ. of Washington. fmjyc,syhang@cs.washington.edu\nBiostatistics, Univ. of Washington. dwitten@uw.edu\nComputer Science and Engineering, and Genome Sciences, Univ. of Washington. suinlee@uw.edu\n\n1\n\n\fSpeci\ufb01cally, we assume that the network differences result from node perturbation \u2013 that is, certain\nnodes are perturbed across the conditions, and so all or most of the edges associated with those\nnodes differ across the K networks. We detect such differences through the use of a row-column\noverlap norm penalty. Figure 1 illustrates a toy example in which a pair of networks are identical to\neach other, except for a single perturbed node (X2) that will be detected using our proposal.\nThe problem of estimating multiple networks that differ due to node perturbations arises in a number\nof applications. For instance, the gene regulatory networks in cancer patients and in normal individ-\nuals are likely to be similar to each other, with speci\ufb01c node perturbations that arise from a small\nset of genes with somatic (cancer-speci\ufb01c) mutations. Another example arises in the analysis of the\nconditional independence relationships among p stocks at two distinct points in time. We might be\ninterested in detecting stocks that have differential connectivity with all other edges across the two\ntime points, as these likely correspond to companies that have undergone signi\ufb01cant changes. Still\nanother example can be found in the \ufb01eld of neuroscience, where we are interested in learning how\nthe connectivity of neurons in the human brain changes over time.\n\nFigure 1: An example of two networks that differ due to node perturbation of X2. (a) Network 1\nand its adjacency matrix. (b) Network 2 and its adjacency matrix. (c) Left: Edges that differ between\nthe two networks. Right: Shaded cells indicate edges that differ between Networks 1 and 2.\n\nOur proposal for estimating multiple networks in the presence of node perturbation can be formu-\nlated as a convex optimization problem, which we solve using an ef\ufb01cient alternating directions\nmethod of multipliers (ADMM) algorithm that signi\ufb01cantly outperforms general-purpose optimiza-\ntion tools. We test our method on synthetic data generated from known graphical models, and on\none real-world task that involves inferring gene regulatory networks from experimental data.\nThe rest of this paper is organized as follows. In Section 2, we present recent work in the estimation\nof Gaussian graphical models (GGMs). In Section 3, we present our proposal for structured learning\nof multiple GGMs using the row-column overlap norm penalty. In Section 4, we present an ADMM\nalgorithm that solves the proposed convex optimization problem. Applications to synthetic and real\ndata are in Section 5, and the discussion is in Section 6.\n\n2 Background\n\n2.1 The graphical lasso\nSuppose that we wish to estimate a GGM on the basis of n observations, X1, . . . , Xn \u2208 Rp, which\nare independent and identically distributed N (0, (cid:6)). It is well known that this amounts to learning\n\u22121 by maximum likelihood, but\nthe sparsity structure of (cid:6)\nwhen p > n this is not possible because the empirical covariance matrix is singular. Consequently,\na number of authors [3, 4, 5, 6, 7, 8, 9] have considered maximizing the penalized log likelihood\n\n\u22121 [1, 2]. When n > p, one can estimate (cid:6)\n\n{log det (cid:2) \u2212 trace(S(cid:2)) \u2212 \u03bb\u2225(cid:2)\u22251} ,\n\nmaximize\n(cid:2)\u2208Sp\n\n++\n\n(1)\n\nwhere S is the empirical covariance matrix based on the n observations, \u03bb is a positive tuning\n++ denotes the set of positive de\ufb01nite matrices of size p, and \u2225(cid:2)\u22251 is the entrywise \u21131\nparameter, Sp\n\u22121. This estimate will be positive de\ufb01nite for\nnorm. The \u02c6(cid:2) that solves (1) serves as an estimate of (cid:6)\nany \u03bb > 0, and sparse when \u03bb is suf\ufb01ciently large, due to the \u21131 penalty [10] in (1). We refer to (1)\nas the graphical lasso formulation. This formulation is convex, and ef\ufb01cient algorithms for solving\nit are available [6, 4, 5, 7, 11].\n\n2\n\n\f2.2 The fused graphical lasso\n\n(cid:2)1\u2208Sp\n\nmaximize\n++;:::;(cid:2)K\u2208Sp\n\n++\n\n\u2225(cid:2)k\u22251 \u2212 \u03bb2\n\nP ((cid:2)1\n\nij, . . . , (cid:2)K\nij )\n\n(2)\n\n\uf8fc\uf8fd\uf8fe ,\n\nIn recent literature, convex formulations have been proposed for extending the graphical lasso (1) to\nthe setting in which one has access to a number of observations from K distinct conditions. The goal\nof the formulations is to estimate a graphical model for each condition under the assumption that the\n\u2208 Rp are independent\nK networks share certain characteristics [12, 13]. Suppose that X k\nand identically distributed from a N (0, (cid:6)k) distribution, for k = 1, . . . , K. Letting Sk denote the\nempirical covariance matrix for the kth class, one can maximize the penalized log likelihood\n\n1 , . . . , X k\nnk\n\n\uf8f1\uf8f2\uf8f3L((cid:2)1, . . . , (cid:2)K) \u2212 \u03bb1\n\u2211\n\n(\n\nK\u2211\n\nk=1\n\n\u2211\n)\n\ni\u0338=j\n\nlog det (cid:2)k \u2212 trace(Sk(cid:2)k)\n, \u03bb1 and \u03bb2 are nonnegative\nwhere L((cid:2)1, . . . , (cid:2)K) =\nij ) is a penalty applied to each off-diagonal element of\ntuning parameters, and P ((cid:2)1\n(cid:2)1, . . . , (cid:2)K in order to encourage similarity among them. Then the \u02c6(cid:2)1, . . . , \u02c6(cid:2)K that solve (2)\n\u22121. In particular, [13] considered the use of\nserve as estimates for ((cid:6)1)\n\nK\nk=1 nk\nij, . . . , (cid:2)K\n\u22121, . . . , ((cid:6)K)\n\nP ((cid:2)1\n\nij, . . . , (cid:2)K\n\nij ) =\n\n|(cid:2)k\n\nij\n\n\u2212 (cid:2)k\n\n\u2032\nij\n\n|,\n\n(3)\n\n\u2211\n\nk<k\u2032\n\na fused lasso penalty [14] on the differences between pairs of network edges. When \u03bb1 is large, the\nnetwork estimates will be sparse, and when \u03bb2 is large, pairs of network estimates will have identical\nedges. We refer to (2) with penalty (3) as the fused graphical lasso formulation (FGL).\nSolving the FGL formulation allows for much more accurate network inference than simply learning\neach of the K networks separately, because FGL borrows strength across all available observations\nin estimating each network. But in doing so, it implicitly assumes that differences among the K\nnetworks arise from edge perturbations. Therefore, this approach does not take full advantage of\nthe structure of the learning problem, which is that differences between the K networks are driven\nby nodes that differ across networks, rather than differences in individual edges.\n\n3 The perturbed-node joint graphical lasso\n\n3.1 Why is detecting node perturbation challenging?\n\nAt \ufb01rst glance, the problem of detecting node perturbation seems simple: in the case K = 2, we\ncould simply modify (2) as follows,\n\n\uf8f1\uf8f2\uf8f3L((cid:2)1, (cid:2)2) \u2212 \u03bb1\u2225(cid:2)1\u22251 \u2212 \u03bb1\u2225(cid:2)2\u22251 \u2212 \u03bb2\n\np\u2211\n\nj=1\n\n\uf8fc\uf8fd\uf8fe ,\n\n\u2225(cid:2)1\n\nj\n\n\u2212 (cid:2)2\n\nj\n\n\u22252\n\n(4)\n\n(cid:2)1\u2208Sp\n\nmaximize\n++;(cid:2)2\u2208Sp\n\n++\n\nj is the jth column of the matrix (cid:2)k. This amounts to applying a group lasso [15] penalty\nwhere (cid:2)k\nto the columns of (cid:2)1 \u2212 (cid:2)2. Since a group lasso penalty simultaneously shrinks all elements to\nwhich it is applied to zero, it appears that this will give the desired node perturbation structure. We\nwill refer to this as the naive group lasso approach.\nUnfortunately, a problem arises due to the fact that the optimization problem (4) must be performed\nsubject to a symmetry constraint on (cid:2)1 and (cid:2)2. This symmetry constraint effectively imposes\noverlap among the elements in the p group lasso penalties in (4), since the (i, j)th element of (cid:2)1 \u2212\n(cid:2)2 is in both the ith (row) and jth (column) groups. In the presence of overlapping groups, the\ngroup lasso penalty yields estimates whose support is the complement of the union of groups [16, 17].\n\u22121 in the case of node perturbation, as well as the\nFigure 2 shows a simple example of ((cid:6)1)\nestimate obtained using (4). The \ufb01gure reveals that (4) cannot be used to detect node perturbation,\nsince this task requires a penalty that yields estimates whose support is the union of groups.\n\n\u22121\u2212((cid:6)2)\n\n3.2 Proposed approach\n\nA node-perturbation in a GGM can be equivalently represented through a perturbation of the entries\nof a row and column of the corresponding precision matrix (Figure 1).\nIn other words, we can\n\n3\n\n\f\u22121 \u2212 ((cid:6)2)\n\nFigure 2: A toy example with p = 6 variables, of which two are perturbed (in red). Each panel\n\u22121, displayed as a network and as an adjacency matrix. Shaded\nshows an estimate of ((cid:6)1)\nelements of the adjacency matrix indicate non-zero elements of \u02c6(cid:2)1\u2212 \u02c6(cid:2)2, as do edges in the network.\nResults are shown for (a): PNJGL with q = 2, which gives the correct sparsity pattern; (b)-(c): the\nnaive group lasso. The naive group lasso is unable to detect the pattern of node perturbation.\ndetect a single node perturbation by looking for a row and a corresponding column of (cid:2)1 \u2212 (cid:2)2\nthat has nonzero elements. We de\ufb01ne a row-column group as a group that consists of a row and the\ncorresponding column in a matrix. Note that in a p \u00d7 p matrix, there exist p such groups, which\noverlap. If several nodes of a GGM are perturbed, then this will correspond to the union of the\ncorresponding row-column groups in (cid:2)1 \u2212 (cid:2)2. Therefore, in order to detect node perturbations in\na GGM (Figure 1), we must construct a regularizer that can promote estimates whose support is the\nunion of row-column groups. For this task, we propose the row-column overlap norm as a penalty.\nDe\ufb01nition 3.1. The row-column overlap norm (RCON) induced by a matrix norm f is de\ufb01ned as\n\n\u2126f (A) =\n\nmin\n\nV:A=V+VT\n\nf (V).\n\n(5)\n\nRCON satis\ufb01es the following properties that are easy to check: (1) \u2126f is indeed a norm. Con-\nsequently, it is convex.\n(2) When f is symmetric in its argument, i.e., f (V) = f (VT ), then\n\u2126f (A) = f (A)/2.\nIn this paper, we are interested in the particular class of RCON penalty where f is given by\n\np\u2211\n\nj=1\n\n\u2225Vj\u2225q,\n\nf (V) =\n\n(6)\nwhere 1 \u2264 q \u2264 \u221e. The norm in (6) is known as the \u21131/\u2113q norm since it can be interpreted as the\n\u21131 norm of the \u2113q norms of the columns of a matrix. With a little abuse of notation, we will let \u2126q\ndenote \u2126f with an \u21131/\u2113q norm of the form (6). We note that \u2126q is closely related to the overlap\ngroup lasso penalty [17, 16], and in fact can be derived from it (for the case of q = 2). However,\nour de\ufb01nition naturally and elegantly handles the grouping structure induced by the overlap of rows\nand columns, and can accommodate any \u2113q norm with q \u2265 1, and more generally any norm f. As\ndiscussed in [17], when applied to (cid:2)1 \u2212 (cid:2)2, the penalty \u2126q (with q = 2) will encourage the support\nof the matrix \u02c6(cid:2)1 \u2212 \u02c6(cid:2)2 to be the union of a set of rows and columns.\nNow, consider the task of jointly estimating two precision matrices by solving\n\n{\n}\nL((cid:2)1, (cid:2)2) \u2212 \u03bb1\u2225(cid:2)1\u22251 \u2212 \u03bb1\u2225(cid:2)2\u22251 \u2212 \u03bb2\u2126q((cid:2)1 \u2212 (cid:2)2)\n\n.\n\n(7)\n\n(cid:2)1\u2208Sp\n\nmaximize\n++;(cid:2)2\u2208Sp\n\n++\n\nWe refer to the convex optimization problem (7) as the perturbed-node joint graphical lasso (PN-\nIn (7), \u03bb1 and \u03bb2 are nonnegative tuning parameters, and q \u2265 1. Note that\nJGL) formulation.\nf (V) = \u2225V\u22251 satis\ufb01es property 2 of the RCON penalty. Thus we have the following observation.\nRemark 3.1. The FGL formulation (2) is a special case of the PNJGL formulation (7) with q = 1.\n\nLet \u02c6(cid:2)1, \u02c6(cid:2)2 be the optimal solution to (7). Note that the FGL formulation is an edge-based approach\nthat promotes many entries (or edges) in \u02c6(cid:2)1\u2212 \u02c6(cid:2)2 to be set to zero. However, setting q = 2 or q = \u221e\nin (7) gives us a node-based approach, where the support of \u02c6(cid:2)1 \u2212 \u02c6(cid:2)2 is encouraged to be a union\nof a few rows and the corresponding columns [17, 16]. Thus the nodes that have been perturbed can\nbe clearly detected using PNJGL with q = 2,\u221e. An example of the sparsity structure detected by\nPNJGL with q = 2 is shown in the left-hand panel of Figure 2. We note that the above formulation\ncan be easily extended to the estimation of K > 2 GGMs by including K(K\u22121)\nRCON penalty\nterms in (7), one for each pair of models. However we restrict ourselves to the case of K = 2 in this\npaper.\n\n2\n\n4\n\n\f4 An ADMM algorithm for the PNJGL formulation\n\nThe PNJGL optimization problem (7) is convex, and so can be directly solved in the modeling\nenvironment cvx [18], which calls conic interior-point solvers such as SeDuMi or SDPT3. How-\never, such a general approach does not fully exploit the structure of the problem and will not scale\nwell to large-scale instances. Other algorithms proposed for overlapping group lasso penalties\n[19, 20, 21] do not apply to our setting since the PNJGL formulation has a combination of Gaussian\nlog-likelihood loss (instead of squared error loss) and the RCON penalty along with a positive-\nde\ufb01nite constraint. We also note that other \ufb01rst-order methods are not easily applied to solve the\nPNJGL formulation because the subgradient of the RCON is not easy to compute and in addition\nthe proximal operator to RCON is non-trivial to compute.\nIn this section we present a fast and scalable alternating directions method of multipliers (ADMM)\nalgorithm [22] to solve the problem (7). We \ufb01rst reformulate (7) by introducing new variables, so\nas to decouple some of the terms in the objective function that are dif\ufb01cult to optimize jointly. This\nwill result in a simple algorithm with closed-form updates. The reformulation is as follows:\n\n\uf8f1\uf8f2\uf8f3\u2212L((cid:2)1, (cid:2)2) + \u03bb1\u2225Z1\u22251 + \u03bb1\u2225Z2\u22251 + \u03bb2\n\n\u2225Vj\u2225q\n(cid:2)1 \u2212 (cid:2)2 = V + W, V = WT , (cid:2)1 = Z1, (cid:2)2 = Z2.\n\np\u2211\n\nj=1\n\n\uf8fc\uf8fd\uf8fe\n\n(cid:2)1\u2208S p\n\n++;(cid:2)2\u2208S p\n\nminimize\n\n++;Z1;Z2;V;W\n\nsubject to\n\nAn ADMM algorithm can now be obtained in a standard fashion from the augmented Lagrangian\nto (8). We defer the details to a longer version of this paper. The complete algorithm for (8) is given\nin Algorithm 1, in which the operator Expand is given by\n\n{\u2212nk log det((cid:2)) + \u03c1\u2225(cid:2) \u2212 A\u22252\n\nF\n\n}\n\n=\n\n1\n2\n\nU\n\nD +\n\nD2 +\n\nI\n\nUT ,\n\n2nk\n\u03c1\n\nExpand(A, \u03c1, nk) = argmin\n\n(cid:2)\u2208Sp\n\n++\n\nwhere UDUT is the eigenvalue decomposition of A, and as mentioned earlier, nk is the number of\nobservations in the kth class. The operator Tq is given by\n\n(8)\n\n)\n\n\u221a\n\n(\n\uf8fc\uf8fd\uf8fe ,\n\np\u2211\n\nj=1\n\n\uf8f1\uf8f2\uf8f3 1\n\n2\n\nTq(A, \u03bb) = argmin\n\nX\n\n\u2225X \u2212 A\u22252\n\nF + \u03bb\n\n\u2225Xj\u2225q\n\nand is also known as the proximal operator corresponding to the \u21131/\u2113q norm. For q = 1, 2,\u221e, Tq\ntakes a simple form, which we omit here due to space constraints. A description of these operators\ncan also be found in Section 5 of [25].\nAlgorithm 1 can be interpreted as an approximate dual gradient ascent method. The approximation\nis due to the fact that the gradient of the dual to the augmented Lagrangian in each iteration is\ncomputed inexactly, through a coordinate descent cycling through the primal variables.\nTypically ADMM algorithms iterate over only two groups of primal variables. For such algorithms,\nthe convergence properties are well-known (see e.g. [22]). However, in our case we cycle through\nmore than two such groups. Although investigation of the convergence properties of ADMM algo-\nrithms for an arbitrary number of groups is an ongoing research area in the optimization literature\n[23, 24] and speci\ufb01c convergence results for our algorithm are not known, we empirically observe\nvery good convergence behavior. Further study of this issue is a direction for future work.\nWe initialize the primal variables to the identity matrix, and the dual variables to the matrix of zeros.\nWe set \u00b5 = 5, and tmax = 1000. In our implementation, the stopping criterion is that the difference\nbetween consecutive iterates becomes smaller than a tolerance \u03f5. The ADMM algorithm is orders\nof magnitude faster than an interior point method and also comparable in accuracy. Note that the\nper-iteration complexity of the ADMM algorithm is O(p3) (complexity of computing SVD). On\nthe other hand, the complexity of an interior point method is O(p6). When p = 30, the interior\npoint method (using cvx, which calls Sedumi) takes 7 minutes to run while ADMM takes only\n10 seconds. When p = 50, the times are 3.5 hours and 2 minutes, respectively. Also, we observe\nthat the average error between the cvx and ADMM solution when averaged over many random\ngenerations of the data is of O(10\n\n\u22124).\n\n5\n\n\fAlgorithm 1: ADMM algorithm for the PNJGL optimization problem (7)\ninput: \u03c1 > 0, \u00b5 > 1, tmax > 0, \u03f5 > 0;\nfor t = 1:tmax do\n\n\u03c1 \u2190 \u00b5\u03c1 ;\nwhile Not converged do\n\n(\n(\n\n1\n\n1\n\n)\n\n2 ((cid:2)2 + V + W + Z1) \u2212 1\n2 ((cid:2)1 \u2212 (V + W) + Z2) \u2212 1\n(cid:26) , (cid:21)1\n\n(cid:2)1 \u2190 Expand\n(\n(cid:2)2 \u2190 Expand\n(\nZi \u2190 T1\n(cid:2)i + Qi\nV \u2190 Tq\n2 (WT \u2212 W + ((cid:2)1 \u2212 (cid:2)2)) + 1\n2 (VT \u2212 V + ((cid:2)1 \u2212 (cid:2)2)) + 1\nW \u2190 1\nF \u2190 F + \u03c1((cid:2)1 \u2212 (cid:2)2 \u2212 (V + W)) ;\nG \u2190 G + \u03c1(V \u2212 WT );\nQi \u2190 Qi + \u03c1((cid:2)i \u2212 Zi) for i = 1, 2\n\nfor i = 1, 2 ;\n\n(cid:26)\n\n1\n\n)\n2(cid:26) (Q1 + n1S1 + F), \u03c1, n1\n2(cid:26) (Q2 + n2S2 \u2212 F), \u03c1, n2\n\n)\n\n;\n\n;\n\n)\n\n2(cid:26) (F \u2212 G), (cid:21)2\n\n2(cid:26)\n\n;\n\n2(cid:26) (F + GT ) ;\n\n5 Experiments\n\nWe describe experiments and report results on both synthetically generated data and real data.\n\n5.1 Synthetic experiments\n\nSynthetic data generation. We generated two networks as follows. The networks share individual\nedges as well as hub nodes, or nodes that are highly-connected to many other nodes. There are also\nperturbed nodes that differ between the networks. We \ufb01rst create a p \u00d7 p symmetric matrix A, with\ndiagonal elements equal to one. For i < j, we set\n\n{\n0\nUnif([\u22120.6,\u22120.3] \u222a [0.3, 0.6])\n\nAij \u223ci:i:d:\n\nwith probability 0.98\notherwise\n\n,\n\nand then we set Aji to equal Aij. Next, we randomly selected seven hub nodes, and set the elements\nof the corresponding rows and columns to be i.i.d. from a Unif([\u22120.6,\u22120.3]\u222a[0.3, 0.6]) distribution.\nThese steps resulted in a background pattern of structure common to both networks. Next, we copied\nA into two matrices, A1 and A2. We randomly selected m perturbed nodes that differ between A1\nand A2, and set the elements of the corresponding row and column of either A1 or A2 (chosen at\nrandom) to be i.i.d. draws from a Unif([\u22121.0,\u22120.5]\u222a [0.5, 1.0]) distribution. Finally, we computed\n\u22121 equal\nc = min(\u03bbmin(A1), \u03bbmin(A2)), the smallest eigenvalue of A1 and A2. We then set ((cid:6)1)\nto A1 + (0.1 \u2212 c)I and set ((cid:6)2)\n\u22121 equal to A2 + (0.1 \u2212 c)I. This last step is performed in order to\nensure positive de\ufb01niteness. We generated n independent observations each from a N (0, (cid:6)1) and a\nN (0, (cid:6)2) distribution, and used these to compute the empirical covariance matrices S1 and S2. We\ncompared the performances of graphical lasso, FGL, and PNJGL with q = 2 with p = 100, m = 2,\nand n = {10, 25, 50, 200}.\n\nResults. Results (averaged over 100 iterations) are shown in Figure 3. Increasing n yields more\naccurate results for PNJGL with q = 2, FGL, and graphical lasso. Furthermore, PNJGL with q = 2\nidenti\ufb01es non-zero edges and differing edges much more accurately than does FGL, which is in turn\nmore accurate than graphical lasso. PNJGL also leads to the most accurate estimates of (cid:2)1 and (cid:2)2.\nThe extent to which PNJGL with q = 2 outperforms others is more apparent when n is small.\n\n5.2\n\nInferring biological networks\n\nWe applied the PNJGL method to a recently-published cancer gene expression data set [26], with\nmRNA expression measurements for 11,861 genes in 220 patients with glioblastoma multiforme\n(GBM), a brain cancer. Each patient has one of four distinct clinical subtypes: Proneural, Neural,\nClassical, and Mesenchymal. We selected two subtypes \u2013 Proneural (53 patients) and Mesenchymal\n\n6\n\n\fFigure 3: Simulation study results for PNJGL with q = 2, FGL, and the graphical lasso (GL),\nfor (a) n = 10, (b) n = 25, (c) n = 50, (d) n = 200, when p = 100. Within each panel,\neach line corresponds to a \ufb01xed value of \u03bb2 (for PNJGL with q = 2 and for FGL). Each plot\u2019s\nx-axis denotes the number of edges estimated to be non-zero. The y-axes are as follows. Left:\nNumber of edges correctly estimated to be non-zero. Center: Number of edges correctly estimated\nto differ across networks, divided by the number of edges estimated to differ across networks. Right:\nThe Frobenius norm of the error in the estimated precision matrices, i.e. (\n+\n\n\u2211\n\n1=2\n\nij (cid:0) ^(cid:18)1\n\nij)2)\n\ni\u0338=j((cid:18)1\n\n\u2211\n\n(\n\ni\u0338=j((cid:18)2\n\nij (cid:0) ^(cid:18)2\n\nij)2)\n\n1=2.\n\n(56 patients) \u2013 for our analysis.\nIn this experiment, we aim to reconstruct the gene regulatory\nnetworks of the two subtypes, as well as to identify genes whose interactions with other genes vary\nsigni\ufb01cantly between the subtypes. Such genes are likely to have many somatic (cancer-speci\ufb01c)\nmutations. Understanding the molecular basis of these subtypes will lead to better understanding of\nbrain cancer, and eventually, improved patient treatment. We selected the 250 genes with the highest\nwithin-subtype variance, as well as 10 genes known to be frequently mutated across the four GBM\nsubtypes [26]: TP53, PTEN, NF1, EGFR, IDH1, PIK3R1, RB1, ERBB2, PIK3CA, PDGFRA. Two\nof these genes (EGFR, PDGFRA) were in the initial list of 250 genes selected based on the within-\nsubtype variance. This led to a total of 258 genes. We then applied PNJGL with q = 2 and FGL\nto the resulting 53 \u00d7 258 and 56 \u00d7 258 gene expression datasets, after standardizing each gene to\nhave variance one. Tuning parameters were selected so that each approach results in a per-network\nestimate of approximately 6,000 non-zero edges, as well as approximately 4,000 edges that differ\n\n7\n\n\facross the two network estimates. However, the results that follow persisted across a wide range of\ntuning parameter values.\n\n|Vij|; for FGL we get V from the PNJGL formulation as 1\n\ni\n\nFigure 4: PNJGL with q = 2 and FGL were performed on the brain cancer data set corresponding\nto 258 genes in patients with Proneural and Mesenchymal subtypes. (a)-(b): N Pj is plotted for each\ngene, based on (a) the FGL estimates and (b) the PNJGL estimates. (c)-(d): A heatmap of \u02c6(cid:2)1 \u2212 \u02c6(cid:2)2\nis shown for (c) FGL and (d) PNJGL; zero values are in white, and non-zero values are in black.\n\u2211\n\nWe quantify the extent of node perturbation (NP) in the network estimates as follows: N Pj =\n2 ( \u02c6(cid:2)1\u2212 \u02c6(cid:2)2). If N Pj = 0 (using a zero-\n\u22126), then the jth gene has the same edge weights in the two conditions. In Figure 4(a)-\nthreshold of 10\n(b), we plotted the resulting values for each of the 258 genes in FGL and PNJGL. Although the\nnetwork estimates resulting from PNJGL and FGL have approximately the same number of edges\nthat differ across cancer subtypes, PNJGL results in estimates in which only 37 genes appear to have\nnode perturbation. FGL results in estimates in which all 258 genes appear to have node perturbation.\nIn Figure 4(c)-(d), the non-zero elements of \u02c6(cid:2)1\u2212 \u02c6(cid:2)2 for FGL and for PNJGL are displayed. Clearly,\nthe pattern of network differences resulting from PNJGL is far more structured. The genes known\nto be frequently mutated across GBM subtypes are somewhat enriched out of those that appear to be\nperturbed according to the PNJGL estimates (3 out of 10 mutated genes were detected by PNJGL; 37\nout of 258 total genes were detected by PNJGL; hypergeometric p-value = 0.1594). In contrast, FGL\ndetects every gene as having node perturbation (Figure 4(a)). The gene with the highest N Pj value\n(according to both FGL and PNJGL with q = 2) is CXCL13, a small cytokine that belongs to the\nCXC chemokine family. Together with its receptor CXCR5, it controls the organization of B-cells\nwithin follicles of lymphoid tissues. This gene was not identi\ufb01ed as a frequently mutated gene in\nGBM [26]. However, there is recent evidence that CXCL13 plays a critical role in driving cancerous\npathways in breast, prostate, and ovarian tissue [27, 28]. Our results suggest the possibility of a\npreviously unknown role of CXCL13 in brain cancer.\n\n6 Discussion and future work\n\nWe have proposed the perturbed-node joint graphical lasso, a new approach for jointly learning\nGaussian graphical models under the assumption that network differences result from node pertur-\nbations. We impose this structure using a novel RCON penalty, which encourages the differences\nbetween the estimated networks to be the union of just a few rows and columns. We solve the result-\ning convex optimization problem using ADMM, which is more ef\ufb01cient and scalable than standard\ninterior point methods. Our proposed approach leads to far better performance on synthetic data\nthan two alternative approaches: learning Gaussian graphical models assuming edge perturbation\n[13], or simply learning each model separately. Future work will involve other forms of structured\nsparsity beyond simply node perturbation. For instance, if certain subnetworks are known a priori\nto be related to the conditions under study, then the RCON penalty can be modi\ufb01ed in order to en-\ncourage some subnetworks to be perturbed across the conditions. In addition, the ADMM algorithm\ndescribed in this paper requires computation of the eigen decomposition of a p \u00d7 p matrix at each\niteration; we plan to develop computational improvements that mirror recent results on related prob-\nlems in order to reduce the computations involved in solving the FGL optimization problem [6, 13].\nAcknowledgments D.W. was supported by NIH Grant DP5OD009145, M.F. was supported in part\nby NSF grant ECCS-0847077.\n\n8\n\n\fReferences\n[1] K.V. Mardia, J. Kent, and J.M. Bibby. Multivariate Analysis. Academic Press, 1979.\n[2] S.L. Lauritzen. Graphical Models. Oxford Science Publications, 1996.\n[3] M. Yuan and Y. Lin. Model selection and estimation in the Gaussian graphical model. Biometrika,\n\n94(10):19\u201335, 2007.\n\n[4] J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation with the graphical lasso.\n\nBiostatistics, 9:432\u2013441, 2007.\n\n[5] O. Banerjee, L. E. El Ghaoui, and A. d\u2019Aspremont. Model selection through sparse maximum likelihood\n\nestimation for multivariate Gaussian or binary data. JMLR, 9:485\u2013516, 2008.\n\n[6] D.M. Witten, J.H. Friedman, and N. Simon. New insights and faster computations for the graphical lasso.\n\nJournal of Computational and Graphical Statistics, 20(4):892\u2013900, 2011.\n\n[7] K. Scheinberg, S. Ma, and D. Goldfarb. Sparse inverse covariance selection via alternating linearization\n\nmethods. Advances in Neural Information Processing Systems, 2010.\n\n[8] P. Ravikumar, M.J. Wainwright, G. Raskutti, and B. Yu. Model selection in gaussian graphical models:\n\nhigh-dimensional consistency of l1-regularized MLE. Advances in NIPS, 2008.\n\n[9] C.J. Hsieh, M. Sustik, I. Dhillon, and P. Ravikumar. Sparse inverse covariance estimation using quadratic\n\napproximation. Advances in Neural Information Processing Systems, 2011.\n\n[10] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society,\n\nSeries B, 58:267\u2013288, 1996.\n\n[11] A. D\u2019Aspremont, O. Banerjee, and L. El Ghaoui. First-order methods for sparse covariance selection.\n\nSIAM Journal on Matrix Analysis and Applications, 30(1):56\u201366, 2008.\n\n[12] J. Guo, E. Levina, G. Michailidis, and J. Zhu. Joint estimation of multiple graphical models. Biometrika,\n\n98(1):1\u201315, 2011.\n\n[13] P. Danaher, P. Wang, and D. Witten. The joint graphical lasso for inverse covariance estimation across\n\nmultiple classes, 2012. http://arxiv.org/abs/1111.0324.\n\n[14] R. Tibshirani, M. Saunders, S. Rosset, J. Zhu, and K. Knight. Sparsity and smoothness via the fused lasso.\n\nJournal of the Royal Statistical Society, Series B, 67:91\u2013108, 2005.\n\n[15] M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables. Journal of the\n\nRoyal Statistical Society, Series B, 68:49\u201367, 2007.\n\n[16] L. Jacob, G. Obozinski, and J.P. Vert. Group lasso with overlap and graph lasso. Proceedings of the 26th\n\nInternational Conference on Machine Learning, 2009.\n\n[17] G. Obozinski, L. Jacob, and J.P. Vert. Group lasso with overlaps: the latent group lasso approach. 2011.\n\nhttp://arxiv.org/abs/1110.0413.\n\n[18] M. Grant and S. Boyd. cvx version 1.21. \u201dhttp://cvxr.com/cvx\u201d, October 2010.\n[19] A. Argyriou, C.A. Micchelli, and M. Pontil. Ef\ufb01cient \ufb01rst order methods for linear composite regularizers.\n\n2011. http://arxiv.org/pdf/1104.1436.\n\n[20] X. Chen, Q. Lin, S. Kim, J.G. Carbonell, and E.P. Xing. Smoothing proximal gradient method for general\nstructured sparse learning. Proceedings of the conference on Uncertainty in Arti\ufb01cial Intelligence, 2011.\n[21] S. Mosci, S. Villa, A. Verri, and L. Rosasco. A primal-dual algorithm for group sparse regularization with\n\noverlapping groups. Neural Information Processing Systems, pages 2604 \u2013 2612, 2010.\n\n[22] S.P. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statistical learning\n\nvia the alternating direction method of multipliers. Foundations and Trends in ML, 3(1):1\u2013122, 2010.\n\n[23] M. Hong and Z. Luo. On the linear convergence of the alternating direction method of multipliers. 2012.\n\nAvailable at arxiv.org/abs/1208.3922.\n\n[24] B. He, M. Tao, and X. Yuan. Alternating direction method with gaussian back substitution for separable\n\nconvex programming. SIAM Journal of Optimization, pages 313 \u2013 340, 2012.\n\n[25] J. Duchi and Y. Singer. Ef\ufb01cient online and batch learning using forward backward splitting. Journal of\n\nMachine Learning Research, pages 2899 \u2013 2934, 2009.\n\n[26] Verhaak et al. Integrated genomic analysis identi\ufb01es clinically relevant subtypes of glioblastoma charac-\n\nterized by abnormalities in PDGFRA, IDH1, EGFR, and NF1. Cancer Cell, 17(1):98\u2013110, 2010.\n\n[27] Grosso et al. Chemokine CXCL13 is overexpressed in the tumour tissue and in the peripheral blood of\n\nbreast cancer patients. British Journal Cancer, 99(6):930\u2013938, 2008.\n\n[28] El-Haibi et al. CXCL13-CXCR5 interactions support prostate cancer cell migration and invasion in a\n\nPI3K p110-, SRC- and FAK-dependent fashion. The Journal of Immunology, 15(19):5968\u201373, 2009.\n\n9\n\n\f", "award": [], "sourceid": 4499, "authors": [{"given_name": "Karthik", "family_name": "Mohan", "institution": null}, {"given_name": "Mike", "family_name": "Chung", "institution": null}, {"given_name": "Seungyeop", "family_name": "Han", "institution": null}, {"given_name": "Daniela", "family_name": "Witten", "institution": null}, {"given_name": "Su-in", "family_name": "Lee", "institution": null}, {"given_name": "Maryam", "family_name": "Fazel", "institution": null}]}