{"title": "On Sparse Gaussian Chain Graph Models", "book": "Advances in Neural Information Processing Systems", "page_first": 3212, "page_last": 3220, "abstract": "In this paper, we address the problem of learning the structure of Gaussian chain graph models in a high-dimensional space. Chain graph models are generalizations of undirected and directed graphical models that contain a mixed set of directed and undirected edges. While the problem of sparse structure learning has been studied extensively for Gaussian graphical models and more recently for conditional Gaussian graphical models (CGGMs), there has been little previous work on the structure recovery of Gaussian chain graph models. We consider linear regression models and a re-parameterization of the linear regression models using CGGMs as building blocks of chain graph models. We argue that when the goal is to recover model structures, there are many advantages of using CGGMs as chain component models over linear regression models, including convexity of the optimization problem, computational efficiency, recovery of structured sparsity, and ability to leverage the model structure for semi-supervised learning. We demonstrate our approach on simulated and genomic datasets.", "full_text": "On Sparse Gaussian Chain Graph Models\n\nCalvin McCarter\n\nMachine Learning Department\nCarnegie Mellon University\n\ncalvinm@cmu.edu\n\nSeyoung Kim\n\nLane Center for Computational Biology\n\nCarnegie Mellon University\nsssykim@cs.cmu.edu\n\nAbstract\n\nIn this paper, we address the problem of learning the structure of Gaussian chain\ngraph models in a high-dimensional space. Chain graph models are generaliza-\ntions of undirected and directed graphical models that contain a mixed set of di-\nrected and undirected edges. While the problem of sparse structure learning has\nbeen studied extensively for Gaussian graphical models and more recently for\nconditional Gaussian graphical models (CGGMs), there has been little previous\nwork on the structure recovery of Gaussian chain graph models. We consider lin-\near regression models and a re-parameterization of the linear regression models\nusing CGGMs as building blocks of chain graph models. We argue that when the\ngoal is to recover model structures, there are many advantages of using CGGMs\nas chain component models over linear regression models, including convexity of\nthe optimization problem, computational ef\ufb01ciency, recovery of structured spar-\nsity, and ability to leverage the model structure for semi-supervised learning. We\ndemonstrate our approach on simulated and genomic datasets.\n\n1\n\nIntroduction\n\nProbabilistic graphical models have been extensively studied as a powerful tool for modeling a set\nof conditional independencies in a probability distribution [12]. In this paper, we are concerned with\na class of graphical models, called chain graph models, that has been proposed as a generalization of\nundirected graphical models and directed acyclic graphical models [4, 9, 14]. Chain graph models\nare de\ufb01ned over chain graphs that contain a mixed set of directed and undirected edges but no\npartially directed cycles.\nIn particular, we study the problem of learning the structure of Gaussian chain graph models in a\nhigh-dimensional setting. While the problem of learning sparse structures from high-dimensional\ndata has been studied extensively for other related models such as Gaussian graphical models\n(GGMs) [8] and more recently conditional Gaussian graphical models (CGGMs) [17, 20], to our\nknowledge, there is little previous work that addresses this problem for Gaussian chain graph mod-\nels. Even with a known chain graph structure, current methods for parameter estimation are hindered\nby the presence of multiple locally optimal solutions [1, 7, 21].\nSince the seminal work on conditional random \ufb01elds (CRFs) [13], a general recipe for constructing\nchain graph models [12] has been given as using CRFs as building blocks for the model. We employ\nthis construction for Gaussian chain graph models and propose to use the recently-introduced sparse\nCGGMs [17, 20] as a Gaussian equivalent of general CRFs. When the goal is to learn the model\nstructure, we show that this construction is superior to the popular alternative approach of using\nlinear regression as component models. Some of the key advantages of our approach are due to the\nfact that the sparse Gaussian chain graph models inherit the desirable properties of sparse CGGM\nsuch as convexity of the optimization problem and structured output prediction. In fact, our work is\nthe \ufb01rst to introduce a joint estimation procedure for both the graph structure and parameters as a\nconvex optimization problem, given the groups of variables for chain components. Another advan-\n\n1\n\n\fj j\nx2jx1\n\n-\n\nx3\n\nj j\nx2jx1\n\nx3\n\nj j\nx2jx1\n-........R\n\nx3\n\n(a)\n\n(b)\n\n(c)\n\nj j\nx2j j\n\n-\n\nx1\n\n-\n(d)\n\nx3\n\nx4\n\nj j\nx2j j\n\nx1\n\nx3\n\nx4\n\n(e)\n\nj j\nx2j j\n........R\n........\u0012\n\n-\n\nx1\n\n-\n(f)\n\nx3\n\nx4\n\nFigure 1: Illustration of chain graph models. (a) A chain graph with two components, {x1, x2} and\n{x3}. (b) The moralized graph of the chain graph in (a). (c) After inference in the chain graph\nin (a), inferred indirect dependencies are shown as the dotted line. (d) A chain graph with three\ncomponents, {x1, x2}, {x3}, and {x4}. (e) The moralized graph of the chain graph in (d). (f) After\ninference in the chain graph in (d), inferred indirect dependencies are shown as the dotted lines.\ntage of our approach is the ability to model a functional mapping from multiple related variables to\nother multiple related variables in a more natural way via moralization in chain graphs than other\napproaches that rely on complex penalty functions for inducing structured sparsity [11, 15].\nOur work on sparse Gaussian chain graphs is motivated by problems in integrative genomic data\nanalyses [6, 18]. While sparse GGMs have been extremely popular for learning networks from\ndatasets of single modality such as gene-expression levels [8], we propose that sparse Gaussian chain\ngraph models with CGGM components can be used to learn a cascade of networks by integrating\nmultiple types of genomic data in a single statistical analysis. We show that our approach can\nreveal the module structures as well as the functional mapping between modules in different types\nof genomic data effectively. Furthermore, as the cost of collecting each data type differs, we show\nthat semi-supervised learning can be used to make effective use of both fully-observed and partially-\nobserved data.\n\n2 Sparse Gaussian Chain Graph Models\n\nWe consider a chain graph model for a probability distribution over J random variables x =\n{x1, . . . , xJ}. The chain graph model assumes that the random variables are partitioned into C\nchain components {x1, . . . , xC}, the \u03c4th component having size |\u03c4|. In addition, it assumes a par-\ntially directed graph structure, where edges between variables within each chain component are\nundirected and edges across two chain components are directed. Given this chain graph structure,\nthe joint probability distribution factorizes as follows:\n\nC(cid:89)\n\np(x) =\n\np(x\u03c4|xpa(\u03c4 )),\n\n\u03c4 =1\n\nwhere xpa(\u03c4 ) is the set of variables that are parents of one or more variables in x\u03c4 . Each factor\np(x\u03c4|xpa(\u03c4 )) models the conditional distribution of the chain component variables x\u03c4 given xpa(\u03c4 ).\nThis model can also be viewed as being constructed with CRFs for p(x\u03c4|xpa(\u03c4 ))\u2019s [13].\nThe conditional independence properties of undirected and directed graphical models have been\nextended to chain graph models [9, 14]. This can be easily seen by \ufb01rst constructing a moralized\ngraph, where undirected edges are added between any pairs of nodes in xpa(\u03c4 ) for each chain com-\nponent \u03c4 and all the directed edges are converted into undirected edges (Figure 1). Then, subsets of\nvariables xa and xb are conditionally independent given xc, if xa and xb are separated by xc in the\nmoralized graph. This conditional independence criterion for a chain graph is called c-separation\nand generalizes d-separation for Bayesian networks [12].\nIn this paper, we focus on Gaussian chain graph models, where both p(x) and p(x\u03c4|xpa(\u03c4 ))\u2019s are\nGaussian distributed. Below, we review linear regression models and CGGMs as chain component\nmodels, and introduce our approach for learning chain graph model structures.\n\n2.1 Sparse Linear Regression as Chain Component Model\nAs the speci\ufb01c functional form of p(x\u03c4|xpa(\u03c4 )) in Gaussian chain graphs models, a linear regression\nmodel with multivariate responses has been widely considered [2, 3, 7]:\n\n(1)\nwhere B\u03c4 \u2208 R|\u03c4|\u00d7|pa(\u03c4 )| is the matrix of regression coef\ufb01cients and \u0398\u03c4 is the |\u03c4| \u00d7 |\u03c4| inverse\ncovariance matrix that models correlated noise. Then, the non-zero elements in B\u03c4 indicate the\n\np(x\u03c4|xpa(\u03c4 )) = N (B\u03c4 xpa(\u03c4 ), \u0398\u22121\n\u03c4 ),\n\n2\n\n\fpresence of directed edges from xpa(\u03c4 ) to x\u03c4 , and the non-zero elements in \u0398\u03c4 correspond to the\nundirected edges among the variables in x\u03c4 . When the graph structure is known, an iterative pro-\ncedure has been proposed to estimate the model parameters, but it converges only to one of many\nlocally-optimal solutions [7].\nWhen the chain component model has the form of Eq. (1), in order to jointly estimate the sparse\ngraph structure and the parameters, we adopt sparse multivariate regression with covariance estima-\ntion (MRCE) [16] for each chain component and solve the following optimization problem:\n\nmin\n\ntr((X\u03c4 \u2212Xpa(\u03c4 )BT\n\n\u03c4 )\u0398\u03c4 (X\u03c4 \u2212Xpa(\u03c4 )BT\n\n\u03c4 )T )\u2212N log |\u0398\u03c4| +\u03bb\n\n||B\u03c4||1 + \u03b3\n\n||\u0398\u03c4||1,\n\nwhere X\u03b1 \u2208 RN\u00d7|\u03b1| is a dataset for N samples, || \u00b7 ||1 is the sparsity-inducing L1 penalty, and \u03bb\nand \u03b3 are the regularization parameters that control the amount of sparsity in the parameters. As in\nMRCE [16], the problem above is not convex, but only bi-convex.\n\nC(cid:88)\n\n\u03c4 =1\n\nC(cid:88)\n\n\u03c4 =1\n\nC(cid:88)\n\n\u03c4 =1\n\n(cid:16) \u2212 1\n\n2.2 Sparse Conditional Gaussian Graphical Model as Chain Component Model\nAs an alternative model for p(x\u03c4|xpa(\u03c4 )) in Gaussian chain graph models, a re-parameterization of\nthe linear regression model in Eq. (1) with natural parameters has been considered [14]. This model\nalso has been called a CGGM [17] or Gaussian CRF [20] due to its equivalence to a CRF. A CGGM\nfor p(x\u03c4|xpa(\u03c4 )) takes the standard form of undirected graphical models as a log-linear model:\n\np(x\u03c4|xpa(\u03c4 )) = exp\n\n\u03c4 \u0398\u03c4 x\u03c4 \u2212 xT\nxT\n\n2\n\n\u03c4 \u0398\u03c4,pa(\u03c4 )xpa(\u03c4 )\n\n(2)\nwhere \u0398\u03c4 \u2208 R|\u03c4|\u00d7|\u03c4| and \u0398\u03c4,pa(\u03c4 ) \u2208 R|\u03c4|\u00d7|pa(\u03c4 )| are the parameters for the feature weights between\npairs of variables within x\u03c4 and between pairs of variables across x\u03c4 and xpa(\u03c4 ), respectively, and\nA(xpa(\u03c4 )) is the normalization constant. The non-zero elements of \u0398\u03c4 and \u0398\u03c4,pa(\u03c4 ) indicate edges\namong the variables in x\u03c4 and between x\u03c4 and xpa(\u03c4 ), respectively.\nThe linear regression model in Eq. (1) can be viewed as the result of performing inference in the\nprobabilistic graphical model given by the CGGM in Eq. (2). This relationship between the two\nmodels can be seen by re-writing Eq. (2) in the form of a Gaussian distribution:\n\n/A(xpa(\u03c4 )),\n\n(cid:17)\n\np(x\u03c4|xpa(\u03c4 )) = N (\u2212\u0398\u22121\n\n\u03c4 \u0398\u03c4,pa(\u03c4 )xpa(\u03c4 ), \u0398\u22121\n\u03c4 ),\n\nwhere marginalization in a CGGM involves computing B\u03c4 xpa(\u03c4 ) = \u2212\u0398\u22121\na linear regression model parameterized by B\u03c4 .\nIn order to estimate the graph structure and parameters for Gaussian chain graph models with CG-\nGMs as chain component models, we adopt the procedure for learning a sparse CGGM [17, 20] and\nminimize the negative log-likelihood of data along with sparsity-inducing L1 penalty:\n\n(3)\n\u03c4 \u0398\u03c4,pa(\u03c4 )xpa(\u03c4 ) to obtain\n\nmin\u2212L(X; \u0398) + \u03bb\n\n||\u0398\u03c4,pa(\u03c4 )||1 + \u03b3\n\n||\u0398\u03c4||1,\n\nC(cid:88)\n\n\u03c4 =1\n\nC(cid:88)\n\n\u03c4 =1\n\nwhere \u0398 = {\u0398\u03c4 , \u0398\u03c4,pa(\u03c4 ), \u03c4 = 1, . . . , C} and L(X; \u0398) is the data log-likelihood for dataset X \u2208\nRN\u00d7J for N samples. Unlike MRCE, the optimization problem for a sparse CGGM is convex,\nand ef\ufb01cient algorithms have been developed to \ufb01nd the globally-optimal solution with substantially\nlower computation time than that for MRCE [17, 20].\nWhile maximum likelihood estimation leads to the equivalent parameter estimates for CGGMs and\nlinear regression models via the transformation B\u03c4 = \u2212\u0398\u22121\n\u03c4 \u0398\u03c4,pa(\u03c4 ), imposing a sparsity con-\nstraint on each model leads to different estimates for the sparsity pattern of the parameters and the\nmodel structure [17]. The graph structure of a sparse CGGM directly encodes the probabilistic de-\npendencies among the variables, whereas the sparsity pattern of B\u03c4 = \u2212\u0398\u22121\n\u03c4 \u0398\u03c4,pa(\u03c4 ) obtained after\nmarginalization can be interpreted as indirect in\ufb02uence of covariates xpa(\u03c4 ) on responses x\u03c4 . As il-\nlustrated in Figures 1(c) and 1(f), the CGGM parameters \u0398\u03c4,pa(\u03c4 ) (directed edges with solid line)\ncan be interpreted as direct dependencies between pairs of variables across x\u03c4 and xpa(\u03c4 ), whereas\nB\u03c4 = \u2212\u0398\u22121\n\u03c4 \u0398\u03c4,pa(\u03c4 ) obtained from inference can be viewed as indirect and inferred dependencies\n(directed edges with dotted line).\n\n3\n\n\fWe argue in this paper that when the goal is to learn the model structure, performing the estimation\nwith CGGMs for chain component models can lead to a more meaningful representation of the\nunderlying structure in data than imposing a sparsity constraint on linear regresssion models. Then\nthe corresponding linear regression model can be inferred via marginalization. This approach also\ninherits many of the advantages of sparse CGGMs such as convexity of optimization problem.\n\n2.3 Markov Properties and Chain Component Models\n\nWhen a CGGM is used as the component model, the overall chain graph model is known to have\nLauritzen-Wermuth-Frydenberg (LWF) Markov properties [9]. The LWF Markov properties also\ncorrespond to the standard probabilistic independencies in more general chain graphs constructed\nby using CRFs as building blocks [12].\nMany previous works have noted that LWF Markov properties do not hold for the chain graph mod-\nels with linear regression models [2, 3]. The alternative Markov properties (AMP) were therefore\nintroduced as the set of probabilistic independencies associated with chain graph models with linear\nregression component models [2, 3]. It has been shown that the LWF and AMP Markov proper-\nties are equivalent only for chain graph structures that do not contain the graph in Figure 1(a) as a\nsubgraph [2, 3]. For example, according to the LWF Markov property, in the chain graph model in\nFigure 1(a), x1 \u22a5 x3|x2 as x1 and x3 are separated by x2 in the moralized graph in Figure 1(b).\nHowever, the corresponding AMP Markov property implies a different probabilistic independence\nrelationship, x1 \u22a5 x3. In the model in Figure 1(d), according to the LWF Markov property, we have\nx1 \u22a5 x3|{x2, x4}, whereas the AMP Markov property gives x1 \u22a5 x3|x4.\nWe observe that when using sparse CGGMs as chain component models, we estimate a model with\nthe LWF Markov properties and perform marginalization in this model to obtain a model with linear-\nregression chain components that can be interpreted with the AMP Markov properties.\n\n3 Sparse Two-Layer Gaussian Chain Graph Models for Structured Sparsity\n\nAnother advantage of using CGGMs as chain component models instead of linear regression is\nthat the moralized graph, which is used to de\ufb01ne the LWF Markov properties, can be leveraged to\ndiscover the underlying structure in a correlated functional mapping from multiple inputs to multiple\noutputs. In this section, we show that a sparse two-layer Gaussian chain graph model with CGGM\ncomponents can be used to learn structured sparsity. The key idea behind our approach is that\nwhile inference in CGGMs within the chain graph model can reveal the shared sparsity patterns for\nmultiple related outputs, a moralization of the chain graph can reveal those for multiple inputs.\nStatistical methods for learning models with structured sparsity were extensively studied in the lit-\nerature of multi-task learning, where the goal is to \ufb01nd input features that in\ufb02uence multiple related\noutputs simultaneously [5, 11, 15]. Most of the previous works assumed the output structure to be\nknown a priori. Then, they constructed complex penalty functions that leverage this known out-\nput structure, in order to induce structured sparsity pattern in the estimated parameters in linear\nregression models. In contrast, a sparse CGGM was proposed as an approach for performing a joint\nestimation of the output structure and structured sparsity for multi-task learning. As was discussed\nin Section 2.2, once the CGGM structure is estimated, the inputs relevant for multiple related outputs\ncould be revealed via probabilistic inference in the graphical model.\nWhile sparse CGGMs focused on leveraging the output structure for improved predictions, another\naspect of learning structured sparsity is to consider the input structure to discover multiple related\ninputs jointly in\ufb02uencing an output. As CGGM is a discriminative model that does not model the\ninput distribution, it is unable to capture input relatedness directly, although discriminative models\nin general are known to improve prediction accuracy. We address this limitation of CGGMs by\nembedding CGGMs within a chain graph and examining the moralized graph.\nWe set up a two-layer Gaussian chain graph model for inputs x and outputs y as follows:\np(y, x) = p(y|x)p(x) =\nwhere a CGGM is used for p(y|x) and a GGM for p(x), and A1(x) and A2 are normalization con-\nstants. As the full model factorizes into two factors p(y|x) and p(x) with distinct sets of parameters,\n\nyT \u0398yyy \u2212 xT \u0398xyy)/A1(x)\n\nexp(\u2212 1\n2\n\n(cid:19)(cid:18)\n\n(cid:18)\n\n(cid:19)\n\nexp(\u2212 1\n2\n\nxT \u0398xxx)/A2\n\n,\n\n4\n\n\fa sparse graph structure and parameters can be learned by using the optimization methods for sparse\nCGGM [20] and sparse GGM [8, 10].\nThe estimated Gaussian chain graph model leads to a GGM over both the inputs and outputs, which\nreveals the structure of the moralized graph:\n\n(cid:32)\n\n(cid:18)\u0398yy\n\np(y, x) = N\n\n0,\n\n\u0398xy \u0398xx + \u0398xy\u0398\u22121\n\nyy\u0398T\nxy\n\n\u0398T\nxy\n\n(cid:19)\u22121(cid:33)\n\n.\n\nyy \u0398T\n\nyy \u0398T\n\nl l l l l\nl l l l l l\n\nIn the above GGM, we notice that the graph structure over inputs x consists of two components,\none for \u0398xx describing the conditional dependencies within the input variables and another for\n\u0398xy\u0398\u22121\nxy that re\ufb02ects the results of moralization in the chain graph. If the graph \u0398yy contains\nconnected components, the operation \u0398xy\u0398\u22121\nxy for moralization induces edges among those\ninputs in\ufb02uencing the outputs in each connected component.\nOur approach is illustrated in Figure 2.\nGiven the model in Figure 2(a), Figure\n2(b) illustrates the inferred structured\nsparsity for a functional mapping from\nmultiple inputs to multiple outputs. In\nFigure 2(b), the dotted edges correspond\nFigure 2: Illustration of sparse two-layer Gaussian chain\nto inferred indirect dependencies intro-\ngraphs with CGGMs.\n(a) A two-layer Gaussian chain\nduced via marginalization in the CGGM\np(y|x), which reveals how each input\ngraph. (b) The results of performing inference and moral-\nization in (a). The dotted edges correspond to indirect de-\nis in\ufb02uencing multiple related outputs.\npendencies inferred by inference. The edges among xj\u2019s\nOn the other hand, the additional edges\nrepresent the dependencies introduced by moralization.\namong xj\u2019s have been introduced by\nmoralization \u0398xy\u0398\u22121\nxy for multiple inputs jointly in\ufb02uencing each output. Combining the re-\nsults of marginalization and moralization, the two connected components in Figure 2(b) represent\nthe functional mapping from {x1, x2} to {y1, y2} and from {x3, x4, x5} to {y3, y4, y5}, respectively.\n\ny1\n\u0003\u0003\u0017\nx1 x2 x3 x4 x5 x6\n\nl l l l l\nl l l l l l\ny5\ny4\ny3\n. . . . . . . . . . . . .1\n..........3\n........Y\n.......\u0012\n\n\u001e\nAAK\n\ny1\ny2\n........3\n\u0003\u0003\u0017\n.....I \u0003\u0003\u0017\nx1 x2 x3 x4 x5 x6\n\nyy \u0398T\n\ny2\n\u0003\u0003\u0017\n\ny3\n\n\u001e\n\ny5\n\ny4\nAAK\n\n(b)\n\n(a)\n\n4 Sparse Multi-layer Gaussian Chain Graph Models\n\nIn this section, we extend the two-layer Gaussian chain graph model from the previous section into\na multi-layer model to model data that are naturally organized into multiple layers. Our approach is\nmotivated by problems in integrative genomic data analysis. In order to study the genetic architec-\nture of complex diseases, data are often collected for multiple data types, such as genotypes, gene\nexpressions, and phenotypes for a population of individuals [6, 18]. The primary goal of such studies\nis to identify the genotype features that in\ufb02uence gene expressions, which in turn in\ufb02uence pheno-\ntypes. In such problems, data can be naturally organized into multiple layers, where the in\ufb02uence of\nfeatures in each layer propagates to the next layer in sequence. In addition, it is well-known that the\nexpressions of genes within the same functional module are correlated and in\ufb02uenced by the com-\nmon genotype features and that the coordinated expressions of gene modules affect multiple related\nphenotypes jointly. These underlying structures in the genomic data can be potentially revealed by\ninference and moralization in sparse Gaussian chain graph models with CGGM components.\nIn addition, we explore the use of semi-supervised learning, where the top and bottom layer data\nare fully observed but the middle-layer data are collected only for a subset of samples.\nIn our\napplication, genotype data and phenotype data are relatively easy to collect from patients\u2019 blood\nsamples and from observations. However, gene-expression data collection is more challenging, as\ninvasive procedure such as surgery or biopsy is required to obtain tissue samples.\n\n4.1 Models\nGiven variables, x = {x1, . . . , xJ}, y = {y1, . . . , yK}, and z = {z1, . . . , zL}, at each of the three\nlayers, we set up a three-layer Gaussian chain graph model as follows:\np(z, y|x) = p(z|y)p(y|x)\n\n=\n\nexp(\u2212 1\n2\n\nzT \u0398zzz \u2212 yT \u0398yzz)/C2(y)\n\nexp(\u2212 1\n2\n\nyT \u0398yyy \u2212 xT \u0398xyy)/C1(x)\n\n, (4)\n\n(cid:18)\n\n(cid:19)\n\n(cid:19)(cid:18)\n\n5\n\n\fyy \u0398T\n\nzz \u0398T\n\nxy and Byz = \u2212\u0398\u22121\n\nwhere C1(x) and C2(y) are the normalization constants. In our application, x, y, and z correspond\nto genotypes, gene-expression levels, and phenotypes, respectively. As the focus of such studies\nlies on discovering how the genotypic variability in\ufb02uences gene expressions and phenotypes rather\nthan the structure in genotype features, we do not model p(x) directly.\n(4), structured sparsity pattern can be recovered via\nGiven the estimated sparse model for Eq.\ninference and moralization. Computing Bxy = \u2212\u0398\u22121\nyz corresponds\nto performing inference to reveal how multiple related yk\u2019s in \u0398yy (or zl\u2019s in \u0398zz) are jointly\nin\ufb02uenced by a common set of relevant xj\u2019s (or yk\u2019s). On the other hand, the effects of moralization\ncan be seen from the joint distribution p(z, y|x) derived from Eq. (4):\n(zz,yy)),\n\u0398T\nyz\n\nwhere \u0398(yz,xy) = (0J\u00d7L, \u0398xy) and \u0398(zz,yy) =\nresponds to the undirected graphical model over z and y conditional on x after moralization.\n4.2 Semi-supervised Learning\nGiven a dataset D = {Do,Dh}, where Do = {Xo, Yo, Zo} for the fully-observed data and Dh =\n{Xh, Zh} for the samples with missing gene-expression levels, for semi-supervised learning, we\nadopt an EM algorithm that iteratively maximizes the expected log-likelihood of complete data:\n\n\u0398yz \u0398yy + \u0398yz\u0398\u22121\n\np(z, y|x) = N (\u2212\u0398\u22121\n\n(cid:18)\u0398zz\n\n(zz,yy)\u0398T\n\n(yz,xy)x, \u0398\u22121\n\n. \u0398(zz,yy) cor-\n\nzz \u0398T\nyz\n\n(cid:19)\n\nL(Do; \u0398) + E(cid:2)L(Dh, Yh; \u0398)(cid:3),\n\ncombined with L1-regularization, where L(Do; \u0398) is the data log-likelihood with respect to the\nmodel in Eq. (4) and the expectation is taken with respect to:\n\np(y|z, x) = N (\u00b5y|x,z, \u03a3y|x,z),\n\n\u00b5y|x,z = \u2212\u03a3y|x,z(\u0398yzz + \u0398T\n\nxyx) and \u03a3y|x,z = (\u0398yy + \u0398yz\u0398\u22121\n\nzz \u0398T\n\nyz)\u22121.\n\n5 Results\n\nIn this section, we empirically demonstrate that CGGMs are more effective components for sparse\nGaussian chain graph models than linear regression for various tasks, using synthetic and real-world\ngenomic datasets. We used the sparse three-layer structure for p(z, y|x) in all our experiments.\n\n5.1 Simulation Study\n\nIn simulation study, we considered two scenarios for true models, CGGM-based and linear-\nregression-based Gaussian chain graph models. We evaluated the performance in terms of graph\nstructure recovery and prediction accuracy in both supervised and semi-supervised settings.\nIn order to simulate data, we assumed the problem size of J=500, K=100, and L=50 for x, y, and\nz, respectively, and generated samples from known true models. Since we do not model p(x), we\nused an arbitrary choice of multinomial distribution to generate samples for x. The true parameters\nfor CGGM-based simulation were set as follows. We set the graph structure in \u0398yy to a randomly-\ngenerated scale-free network with a community structure [19] with six communities. The edge\nweights were drawn randomly from a uniform distribution [0.8, 1.2]. We then set \u0398yy to the graph\nLaplacian of this network plus small positive values along the diagonal so that \u0398yy is positive\nde\ufb01nite. We generated \u0398zz using a similar strategy, assuming four communities. \u0398xy was set to\na sparse random matrix, where 0.4% of the elements have non-zero values drawn from a uniform\ndistribution [-1.2,-0.8]. \u0398yz was generated using a similar strategy, with a sparsity level of 0.5%. We\nset the sparsity pattern of \u0398yz so that it roughly respects the functional mapping from communities\nin y to communities in z. Speci\ufb01cally, after reordering the variables in y and z by performing\nhierarchical clustering on each of the two networks \u0398yy and \u0398zz, the non-zero elements were\nselected randomly around the diagonal of \u0398yz.\nWe set the true parameters for the linear-regression-based models using the same strategy as the\nCGGM-based simulation above for \u0398yy and \u0398zz. We set Bxy so that 50% of the variables in x\nhave non-zero in\ufb02uence on \ufb01ve randomly chosen variables in y in one randomly chosen community\nin \u0398yy. We set Byz in a similar manner, assuming 80% of the variables in y are relevant to eight\nrandomly-chosen variables in z from a randomly-chosen community in \u0398zz.\n\n6\n\n\f(a)\n\n(b)\n\n(c)\n\n(d)\n\n(e)\n\nFigure 4: Precision/recall curves for graph structure recovery in CGGM-based simulation study. (a)\n\u0398yy, (b) \u0398zz, (c) Bxy, (d) Byz, and (e) \u0398xy. (CG: CGGM-based models with supervised learning,\nCG-semi: CG with semi-supervised learning, LR: linear-regression-based models with supervised\nlearning, LR-semi: LR with semi-supervised learning.)\n\n(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 5: Prediction errors in CGGM-based simulation study. The same estimated models in Figure\n4 were used to predict (a) y given x, z, (b) z given x, (c) y given x, and (d) z given y.\n\n(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 6: Performance for graph structure recovery in linear-regression-based simulation study.\nPrecision/recall curves are shown for (a) \u0398yy, (b) \u0398zz, (c) Bxy, and (d) Byz.\n\n(e)\n\n(a)\n\n(b)\n\n(c)\n\n(d)\n\nEach dataset consisted of 600 samples, of which 400\nand 200 samples were used as training and test sets.\nTo select the regularization parameters, we estimated\na model using 300 samples, evaluated prediction er-\nrors on the other 100 samples in the training set, and\nselected the values with the lowest prediction errors.\nWe used the optimization methods in [20] for CGGM-\nbased models and the MRCE procedure [16] for linear-\nregression-based models.\nFigure 3 illustrates how the model with CGGM chain\ncomponents can be used to discover the structured\nsparsity via inference and moralization. In each panel,\nblack and bright pixels correspond to zero and non-\nzero values, respectively. While Figure 3(a) shows\nhow variables in z are related in \u0398zz, Figure 3(b)\nshows Byz = \u2212\u0398\u22121\nyz obtained via marginaliza-\ntion within the CGGM p(z|y), where functional map-\npings from variables in y to multiple related variables\nIn Figure 3(c), the effects of moralization \u0398yy +\nin z can be seen as white vertical bars.\n\u0398yz\u0398\u22121\nzz \u0398T\nyz\n(Figure 3(e)). The additional edges among variables in y in Figure 3(e) correspond to the edges\nintroduced via moralization and show the groupings of the variables y as the block structure along\nthe diagonal. By examining Figures 3(b) and 3(e), we can infer a functional mapping from modules\nin y to modules in z.\nIn order to systematically compare the performance of the two types of models, we examined the\naverage performance over 30 randomly-generated datasets. We considered both supervised and\nsemi-supervised settings. Assuming that 200 samples out of the total 400 training samples were\n\nFigure 3: Illustration of the structured spar-\nsity recovered by the model with CGGM\ncomponents, simulated dataset.\n(a) \u0398zz.\n(b) Byz = \u2212\u0398\u22121\nyz shows the effects of\nmarginalization (white vertical bars). The\neffects of moralization are shown in (c)\n\u0398yy + \u0398yz\u0398\u22121\nyz, and its decomposi-\ntion into (d) \u0398yy and (e) \u0398yz\u0398\u22121\n\nyz are shown, which further decomposes into \u0398yy (Figure 3(d)) and \u0398yz\u0398\u22121\n\nzz \u0398T\n\nyz.\n\nzz \u0398T\n\nzz \u0398T\n\nzz \u0398T\n\nzz \u0398T\n\n7\n\n00.5100.20.40.60.81RecallPrecision CG\u2212semiCGLR\u2212semiLR00.5100.20.40.60.81PrecisionRecall00.5100.20.40.60.81PrecisionRecall00.5100.20.40.60.81PrecisionRecall00.5100.20.40.60.81PrecisionRecall0.20.40.60.81CG\u2212semiCGLR\u2212semiLRtest err12345CG\u2212semiCGLR\u2212semiLRtest err0.40.60.811.2CG\u2212semiCGLR\u2212semiLRtest err0.40.60.811.2CG\u2212semiCGLR\u2212semiLRtest err00.5100.20.40.60.81RecallPrecision CG\u2212semiCGLR\u2212semiLR00.5100.20.40.60.81PrecisionRecall00.5100.20.40.60.81PrecisionRecall00.5100.20.40.60.81PrecisionRecall\f(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 7: Prediction errors in linear-regression-based simulation study. The same estimated models\nin Figure 6 were used to predict (a) y given x, z, (b) z given x, (c) y given x, and (d) z given y.\n\nmissing data for y, for supervised learning, we used only those samples with complete data; for\nsemi-supervised learning, we used all samples, including partially-observed cases.\nThe precision/recall curves for recovering the true graph structures are shown in Figure 4, using\ndatasets simulated from the true models with CGGM components. Each curve was obtained as an\naverage over 30 different datasets. We observe that in both supervised and semi-supervised set-\ntings, the models with CGGM components outperform the ones with linear regression components.\nIn addition, the performance of the CGGM-based models improves signi\ufb01cantly, when using the\npartially-observed data in addition to the fully-observed samples (the curve for CG-semi in Fig-\nure 4), compared to using only the fully-observed samples (the curve for CG in Figure 4). This\nimprovement from using partially-observed data is substantially smaller for the linear-regression-\nbased models. The average prediction errors from the same set of estimated models in Figure 4 are\nshown in Figure 5. The CGGM-based models outperform in all prediction tasks, because they can\nleverage the underlying structure in the data and estimate models more effectively.\nFor the simulation scenario using the linear-regression-based true models, we show the results for\nprecision/recall curves and prediction errors in Figures 6 and 7, respectively. We \ufb01nd that even\nthough the data were generated from chain graph models with linear regression components, the\nCGGM-based methods perform as well as or better than the other models.\n\nIntegrative Genomic Data Analysis\n\nTable 1: Prediction errors, mouse diabetes data\n\n5.2\nWe applied the two types of three-layer chain graph\nmodels to single-nucleotide-polymorphism (SNP),\ngene-expression, and phenotype data from the pancre-\nTask CG-semi CG LR-semi LR\ny | x, z 0.9070 0.9996 1.0958 0.9671\natic islets study for diabetic mice [18]. We selected\nz | x 1.0661 1.0585 1.0505 1.0614\n200 islet gene-expression traits after performing hier-\ny | x 0.8989 0.9382 0.9332 0.9103\narchical clustering to \ufb01nd several gene modules. Our\nz | y 1.0712 1.0861 1.1095 1.0765\ndataset also included 1000 SNPs and 100 pancreatic\nislet cell phenotypes. Of the total 506 samples, we\nused 406 as training set, of which 100 were held out as a validation set to select regularization\nparameters, and used the remaining 100 samples as test set to evaluate prediction accuracies. We\nconsidered both supervised and semi-supervised settings, assuming gene expressions are missing\nfor 150 mice. In supervised learning, only those samples without missing gene expressions were\nused.\nAs can be seen from the prediction errors in Table 1, the models with CGGM chain components are\nmore accurate in various prediction tasks. In addition, the CGGM-based models can more effectively\nleverage the samples with partially-observed data than linear-regression-based models.\n\n6 Conclusions\n\nIn this paper, we addressed the problem of learning the structure of Gaussian chain graph models\nin a high-dimensional space. We argued that when the goal is to recover the model structure, using\nsparse CGGMs as chain component models has many advantages such as recovery of structured\nsparsity, computational ef\ufb01ciency, globally-optimal solutions for parameter estimates, and superior\nperformance in semi-supervised learning.\n\nAcknowledgements\n\nThis material is based upon work supported by an NSF CAREER Award No. MCB-1149885, Sloan\nResearch Fellowship, and Okawa Foundation Research Grant.\n\n8\n\n0.511.52CG\u2212semiCGLR\u2212semiLRtest err10203040CG\u2212semiCGLR\u2212semiLRtest err0.511.52CG\u2212semiCGLR\u2212semiLRtest err0102030CG\u2212semiCGLR\u2212semiLRtest err\fReferences\n[1] F. Abegaz and E. Wit. Sparse time series chain graphical models for reconstructing genetic\n\nnetworks. Biostatistics, pages 586\u2013599, 2013.\n\n[2] S. Andersson, D. Madigan, and D. Perlman. An alternative Markov property for chain graphs.\nIn Proceedings of the 12th Conference on Uncertainty in Arti\ufb01cial Intelligence, pages 40\u201348.\nMorgan Kaufmann, 1996.\n\n[3] S. Andersson, D. Madigan, and D. Perlman. Alternative Markov properties for chain graphs.\n\nScandinavian Journal of Statistics, 28:33\u201385, 2001.\n\n[4] W. Buntine. Chain graphs for learning. In Proceedings of the 11th Conference on Uncertainty\n\nin Arti\ufb01cial Intelligence, pages 46\u201354. Morgan Kaufmann, 1995.\n\n[5] X. Chen, X. Shi, X. Xu, Z. Wang, R. Mills, C. Lee, and J. Xu. A two-graph guided multi-task\nlasso approach for eQTL mapping. In Proceedings of the 15th International Conference on\nArti\ufb01cial Intelligence and Statistics (AISTATS), volume 16. JMLR W&CP, 2012.\n\n[6] Y. Chen, J. Zhu, P.K. Lum, X. Yang, S. Pinto, D.J. MacNeil, C. Zhang, J. Lamb, S. Edwards,\nS.K. Sieberts, et al. Variations in DNA elucidate molecular networks that cause disease. Na-\nture, 452(27):429\u201335, 2008.\n\n[7] M. Drton and M. Eichler. Maximum likelihood estimation in Gaussian chain graph models\n\nunder the alternative Markov property. Scandinavian Journal of Statistics, 33:247\u201357, 2006.\n\n[8] J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation with the graph-\n\nical lasso. Biostatistics, 9(3):432\u201341, 2008.\n\n[9] M. Frydenberg. The chain graph Markov property. Scandinavian Journal of Statistics, 17:\n\n333\u201353, 1990.\n\n[10] C.J. Hsieh, M. Sustik, I. Dhillon, and P. Ravikumar. Sparse inverse covariance matrix estima-\ntion using quadratic approximation. In Advances in Neural Information Processing Systems\n(NIPS) 24, 2011.\n\n[11] L. Jacob, G. Obozinski, and J. Vert. Group lasso with overlap and graph lasso. In Proceedings\n\nof the 26th International Conference on Machine Learning, 2009.\n\n[12] D. Koller and N. Friedman. Probabilistic Graphical Models: Principles and Techniques. The\n\nMIT Press, 2009.\n\n[13] J. Lafferty, A. McCallum, and F. Pereira. Conditional random \ufb01elds: probabilistic models for\nsegmenting and labeling sequence data. In Proceedings of the 18th International Conference\non Machine Learning, 2001.\n\n[14] S.L. Lauritzen and N. Wermuth. Graphical models for associations between variables, some\n\nof which are qualitative and some quantitative. The Annals of Statistics, 17(1):31\u201357, 1989.\n\n[15] G. Obozinski, M.J. Wainwright, and M.J. Jordan. High-dimensional union support recovery in\n\nmultivariate regression. In Advances in Neural Information Processing Systems 21, 2008.\n\n[16] A. Rothman, E. Levina, and J. Zhu. Sparse multivariate regression with covariance estimation.\n\nJournal of Computational and Graphical Statistics, 19(4):947\u2013962, 2010.\n\n[17] K.A. Sohn and S. Kim. Joint estimation of structured sparsity and output structure in multiple-\nIn Proceedings of the 15th Inter-\noutput regression via inverse-covariance regularization.\nnational Conference on Arti\ufb01cial Intelligence and Statistics (AISTATS), volume 16. JMLR\nW&CP, 2012.\n\n[18] Z. Tu, M.P. Keller, C. Zhang, M.E. Rabaglia, D.M. Greenawalt, X. Yang, I.M. Wang, H. Dai,\nM.D. Bruss, P.Y. Lum, Y.P. Zhou, D.M. Kemp, C. Kendziorski, B.S. Yandell, A.D. Attie, E.E.\nSchadt, and J. Zhu. Integrative analysis of a cross-loci regulation network identi\ufb01es app as a\ngene regulating insulin secretion from pancreatic islets. PLoS Genetics, 8(12):e1003107, 2012.\n[19] J. Wu, Z. Gao, and H. Sun. Cascade and breakdown in scale-free networks with community\n\nstructure. Physical Review, 74:066111, 2006.\n\n[20] M. Wytock and J.Z. Kolter. Sparse Gaussian conditional random \ufb01elds: algorithms, theory,\nand application to energy forecasting. In Proceedings of the 30th International Conference on\nMachine Learning, volume 28. JMLR W&CP, 2013.\n\n[21] J. Yin and H. Li. A sparse conditional Gaussian graphical model for analysis of genetical\n\ngenomics data. The annals of applied statistics, 5(4):2630, 2011.\n\n9\n\n\f", "award": [], "sourceid": 1642, "authors": [{"given_name": "Calvin", "family_name": "McCarter", "institution": "Carnegie Mellon University"}, {"given_name": "Seyoung", "family_name": "Kim", "institution": "CMU"}]}