{"title": "Learning Identifiable Gaussian Bayesian Networks in Polynomial Time and Sample Complexity", "book": "Advances in Neural Information Processing Systems", "page_first": 6457, "page_last": 6466, "abstract": "Learning the directed acyclic graph (DAG) structure of a Bayesian network from observational data is a notoriously difficult problem for which many non-identifiability and hardness results are known. In this paper we propose a provably polynomial-time algorithm for learning sparse Gaussian Bayesian networks with equal noise variance --- a class of Bayesian networks for which the DAG structure can be uniquely identified from observational data --- under high-dimensional settings. We show that $O(k^4 \\log p)$ number of samples suffices for our method to recover the true DAG structure with high probability, where $p$ is the number of variables and $k$ is the maximum Markov blanket size. We obtain our theoretical guarantees under a condition called \\emph{restricted strong adjacency faithfulness} (RSAF), which is strictly weaker than strong faithfulness --- a condition that other methods based on conditional independence testing need for their success. The sample complexity of our method matches the information-theoretic limits in terms of the dependence on $p$. We validate our theoretical findings through synthetic experiments.", "full_text": "Learning Identi\ufb01able Gaussian Bayesian Networks in\n\nPolynomial Time and Sample Complexity\n\nDepartment of Computer Science, Purdue University, West Lafayette, IN - 47906\n\nAsish Ghoshal and Jean Honorio\n\n{aghoshal, jhonorio}@purdue.edu\n\nAbstract\n\nLearning the directed acyclic graph (DAG) structure of a Bayesian network from ob-\nservational data is a notoriously dif\ufb01cult problem for which many non-identi\ufb01ability\nand hardness results are known. In this paper we propose a provably polynomial-\ntime algorithm for learning sparse Gaussian Bayesian networks with equal noise\nvariance \u2014 a class of Bayesian networks for which the DAG structure can be\nuniquely identi\ufb01ed from observational data \u2014 under high-dimensional settings.\nWe show that O(k4 log p) number of samples suf\ufb01ces for our method to recover\nthe true DAG structure with high probability, where p is the number of variables\nand k is the maximum Markov blanket size. We obtain our theoretical guarantees\nunder a condition called restricted strong adjacency faithfulness (RSAF), which is\nstrictly weaker than strong faithfulness \u2014 a condition that other methods based on\nconditional independence testing need for their success. The sample complexity of\nour method matches the information-theoretic limits in terms of the dependence on\np. We validate our theoretical \ufb01ndings through synthetic experiments.\n\n1\n\nIntroduction and Related Work\n\nMotivation. The problem of learning the directed acyclic graph (DAG) structure of Bayesian\nnetworks (BNs) in general, and Gaussian Bayesian networks (GBNs) \u2014 or equivalently linear\nGaussian structural equation models (SEMs) \u2014 in particular, from observational data has a long\nhistory in the statistics and machine learning community. This is, in part, motivated by the desire to\nuncover causal relationships between entities in domains as diverse as \ufb01nance, genetics, medicine,\nneuroscience and arti\ufb01cial intelligence, to name a few. Although in general, the DAG structure\nof a GBN or linear Gaussian SEM cannot be uniquely identi\ufb01ed from purely observational data\n(i.e., multiple structures can encode the same conditional independence relationships present in the\nobserved data set), under certain restrictions on the generative model, the DAG structure can be\nuniquely determined. Furthermore, the problem of learning the structure of BNs exactly is known\nto be NP-complete even when the number of parents of a node is at most q, for q > 1, [1]. It is\nalso known that approximating the log-likelihood to a constant factor, even when the model class is\nrestricted to polytrees with at-most two parents per node, is NP-hard [2].\nPeters and B\u00fchlmann [3] recently showed that if the noise variances are the same, then the structure\nof a GBN can be uniquely identi\ufb01ed from observational data. As observed by them, this \u201cassumption\nof equal error variances seems natural for applications with variables from a similar domain and is\ncommonly used in time series models\u201d. Unfortunately, even for the equal noise-variance case, no\npolynomial time algorithm is known.\nContribution. In this paper we develop a polynomial time algorithm for learning a subclass of\nBNs exactly: sparse GBNs with equal noise variance. This problem has been considered by [3]\nwho proposed an exponential time algorithm based on `0-penalized maximum likelihood estimation\n(MLE), and a heuristic greedy search method without any guarantees. Our algorithm involves\nestimating a p-dimensional inverse covariance matrix and solving 2(p  1) at-most-k-dimensional\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fordinary least squares problems, where p is the number of nodes and k is the maximum Markov\nblanket size of a variable. We show that O((k4/\u21b52) log(p/)) samples suf\ufb01ce for our algorithm to\nrecover the true DAG structure and to approximate the parameters to at most \u21b5 additive error, with\nprobability at least 1  , for some > 0. The sample complexity of O(k4 log p) is close to the\ninformation-theoretic limit of \u2326(k log p) for learning sparse GBNs as obtained by [4]. The main\nassumption under which we obtain our theoretical guarantees is a condition that we refer to as the\n\u21b5-restricted strong adjacency faithfulness (RSAF). We show that RSAF is a strictly weaker condition\nthan strong faithfulness, which methods based on independence testing require for their success. In\nthis identi\ufb01able regime, given enough samples, our method can recover the exact DAG structure of\nany Gaussian distribution. However, existing exact algorithms like the PC algorithm [5] can fail to\nrecover the correct skeleton for distributions that are not faithful, and fail to orient a number of edges\nthat are not covered by the Meek orientation rules [6, 7]. Of independent interest is our analysis of\nOLS regression under the random design setting for which we obtain `1 error bounds.\nRelated Work. In the this section, we \ufb01rst discuss some identi\ufb01ability results for GBNs known in\nthe literature and then survey relevant algorithms for learning GBNs and Gaussian SEMs.\n[3] proved identi\ufb01ability of distributions drawn from a restricted SEM with additive noise, where in\nthe restricted SEM the functions are assumed to be non-linear and thrice continuously differentiable.\nIt is also known that SEMs with linear functions and strictly non-Gaussian noise are identi\ufb01able [8].\nIndenti\ufb01ability of the DAG structure for the linear function and Gaussian noise case was proved by\n[9] when noise variables are assumed to have equal variance.\nAlgorithms for learning BNs typically fall into two distinct categories, namely: independence test\nbased methods and score based methods. This dichotomy also extends to the Gaussian case. Score\nbased methods assign a score to a candidate DAG structure based on how well it explains the observed\ndata, and then attempt to \ufb01nd the highest scoring structure. Popular examples for the Gaussian\ndistribution are the log-likelihood based BIC and AIC scores and the `0-penalized log-likelihood\nscore by [10]. However, given that the number of DAGs and sparse DAGs is exponential in the\nnumber of variables [4, 11], exhaustively searching for the highest scoring DAG in the combinatorial\nspace of all DAGs, which is a feature of existing exact search based algorithms, is prohibitive for all\nbut a few number of variables. [12] propose a score-based method, based on concave penalization of\na reparameterized negative log-likelihood function, which can learn a GBN over 1000 variables in an\nhour. However, the resulting optimization problem is neither convex \u2014 therefore is not guaranteed to\n\ufb01nd a globally optimal solution \u2014 nor solvable in polynomial time. In light of these shortcomings,\napproximation algorithms have been proposed for learning BNs which can be used to learn GBNs in\nconjunction with a suitable score function; notable methods are Greedy Equivalence Search (GES)\nproposed by [13] and an LP-relaxation based method proposed by [14].\nAmong independence test based methods for learning GBNs, [15] extended the PC algorithm,\noriginally proposed by [5], to learn the Markov equivalence class of GBNs from observational data.\nThe computational complexity of the PC algorithm is bounded by O(pk) with high probability, where\nk is the maximum neighborhood size of a node, and is only ef\ufb01cient for learning very sparse DAGs.\nFor the non-linear Gaussian SEM case, [3] developed a two-stage algorithm called RESIT, which\nworks by \ufb01rst learning the causal ordering of the variables and then performing regressions to learn\nthe DAG structure. As we formally show in Appendix C.1, RESIT does not work for the linear\nGaussian case. Moreover, Peters et al. proved the correctness of RESIT only in the population\nsetting. Lastly, [16] developed an algorithm, which is similar in spirit to our algorithm, for ef\ufb01ciently\nlearning Poisson Bayesian networks. They exploit a property speci\ufb01c to the Poisson distribution\ncalled overdispersion to learn the causal ordering of variables.\nFinally, the max-min hill climbing (MMHC) algorithm by [17] is a state-of-the-art hybrid algorithm\nfor BNs that combines ideas from constraint-based and score-based learning. While MMHC works\nwell in practice, it is inherently a heuristic algorithm and is not guaranteed to recover the true DAG\nstructure even when it is uniquely identi\ufb01able.\n\n2 Preliminaries\n\nIn this section, we formalize the problem of learning Gaussian Bayesian networks from observational\ndata. First, we introduce some notations and de\ufb01nitions.\n\n2\n\n\fdef= (Pi,j|Ai,j|p)1/p. Finally, we denote the set [p] \\ {i} by i.\n\nWe denote the set {1, . . . , p} by [p]. Vectors and matrices are denoted by lowercase and uppercase\nbold faced letters respectively. Random variables (including random vectors) are denoted by italicized\nuppercase letters. Let sr, sc \u2713 [p] be any two non-empty index sets. Then for any matrix A 2 Rp\u21e5p,\nwe denote the R|sr|\u21e5|sc| sub-matrix, formed by selecting the sr rows and sc columns of A by:\nAsr,sc. With a slight abuse of notation, we will allow the index sets sr and sc to be a single\nindex, e.g., i, and we will denote the index set of all row (or columns) by \u21e4. Thus, A\u21e4,i and Ai,\u21e4\ndenote the i-th column and row of A respectively. For any vector v 2 Rp, we will denote its\nsupport set by: S(v) = {i 2 [p]||vi| > 0}. Vector `p-norms are denoted by k\u00b7kp. For matrices,\nk\u00b7k p denotes the induced (or operator) `p-norm and |\u00b7|p denotes the element-wise `p-norm, i.e.,\n|A|p\nLet G = (V, E) be a directed acyclic graph (DAG) where the vertex set V = [p] and E is the set of\ndirected edges, where (i, j) 2 E implies the edge i j. We denote by \u21e1G(i) and G(i) the parent\nset and the set of children of the i-th node, respectively, in the graph G, and drop the subscript G\nwhen the intended graph is clear from context. A vertex i 2 [p] is a terminal vertex in G if G(i) = ?.\nFor each i 2 [p] we have a random variable Xi 2 R, X = (X1, . . . , Xp) is the p-dimensional vector\nof random variables, and x = (x1, . . . , xp) is a joint assignment to X. Without loss of generality, we\nassume that E [Xi] = 0, 8i 2 [p]. Every DAG G = (V, E) de\ufb01nes a set of topological orderings TG\nover [p] that are compatible with the DAG G, i.e., TG = {\u2327 2 Sp | \u2327 (j) <\u2327 (i) if (i, j) 2 E}, where\nSp is the set of all possible permutations of [p].\nA Gaussian Bayesian network (GBN) is a tuple (G,P(W, S)), where G = (V, E) is a DAG structure,\nW = {wi,j 2 R | (i, j) 2 E ^| wi,j| > 0} is the set of edge weights, S = {2\ni=1 is the set of\nnoise variances, and P is a multivariate Gaussian distribution over X = (X1, . . . , Xp) that is Markov\nwith respect to the DAG G and is parameterized by W and S. In other words, P = N (x; 0, \u2303),\nfactorizes as follows:\n\ni 2 R+}p\n\n(1)\n\nP(x; W, S) =\n\npYi=1\nPi(xi; wi, x\u21e1(i), 2\ni ),\ni ) = N (xi; wT\nPi(xi; wi, x\u21e1(i), 2\n\ni ),\n\ni x\u21e1(i), 2\n\n(2)\nwhere wi 2 R|\u21e1(i)| def= (wi,j)j2\u21e1(i) is the weight vector for the i-th node, 0 is a vector of zeros of\nappropriate dimension (in this case p), x\u21e1(i) = {xj | j 2 \u21e1(i)}, \u2303 is the covariance matrix for X,\nand Pi is the conditional distribution of Xi given its parents \u2014 which is also Gaussian.\nWe will also extensively use an alternative, but equivalent, view of a GBN: the linear structural\nequation model (SEM). Let B = (wi,j1 [(i, j) 2 E])(i,j)2[p]\u21e5[p] be the matrix of weights created\nfrom the set of edge weights W. A GBN (G,P(W, S)) corresponds to a SEM where each variable\nXi can be written as follows:\n(3)\n\nBi,jXj + Ni, 8i 2 [p]\n\nXi = Xj2\u21e1(i)\n\nwith Ni \u21e0N (0, 2\ni ) (for all i 2 [p]) being independent noise variables and |Bi,j| > 0 for all j 2 \u21e1(i).\nThe joint distribution of X as given by the SEM corresponds to the distribution P in (1) and the\ngraph associated with the SEM, where we have a directed edge (i, j) if j 2 \u21e1(i), corresponds to the\nDAG G. Denoting N = (N1, . . . , Np) as the noise vector, (3) can be rewritten in vector form as:\nX = BX + N.\nGiven a GBN (G,P(W, S)), with B being the weight matrix corresponding to W, we denote the\neffective in\ufb02uence between two nodes i, j 2 [p]\ndef= BT\n\n(4)\n\n\u21e4,iB\u21e4,j  Bi,j  Bj,i\n\nbetween them and do not have common children, or (b) i and j have an edge between them but the dot\nproduct between the weights to the children (BT\n\u21e4,iB\u21e4,j) exactly equals the edge weight between i and\nj (Bi,j + Bj,i). The effective in\ufb02uence determines the Markov blanket of each node, i.e., 8i 2 [p],\n\nThe effective in\ufb02uence ewi,j between two nodes i and j is zero if: (a) i and j do not have an edge\nthe Markov blanket is given as: Si = {j | j 2 i ^ ewi,j 6= 0} 1. Furthermore, a node is conditionally\n\n1Our de\ufb01nition of Markov blanket differs from the commonly used graph-theoretic de\ufb01nition in that the\nlatter includes the parents, children and all the co-parents of the children of node i in the Markov blanket Si.\n\newi,j\n\n3\n\n\findependent of all other nodes not in its Markov blanket, i.e., Pr{Xi|Xi} = Pr{Xi|XSi}. Next,\nwe present a few de\ufb01nitions that will be useful later.\nDe\ufb01nition 1 (Causal Minimality [18]). A distribution P is causal minimal with respect to a DAG\nstructure G if it is not Markov with respect to a proper subgraph of G.\nDe\ufb01nition 2 (Faithfulness [5]). Given a GBN (G,P), P is faithful to the DAG G = (V, E) if for any\ni, j 2 V and any V0 \u2713 V \\ {i, j}:\n\ni d-separated from j | V0 () corr(Xi, Xj|XV0) = 0,\n\nwhere corr(Xi, Xj|XV0) is the partial correlation between Xi and Xj given XV0.\nDe\ufb01nition 3 (Strong Faithfulness [19]). Given a GBN (G,P) the multivariate Gaussian distribution\nP is -strongly faithful to the DAG G, for some  2 (0, 1), if\nmin{|corr(Xi, Xj|XV0)| : i is not d-separated from j | V0,8i, j 2 [p] ^ 8V0 \u2713 V \\ {i, j}^}  .\nStrong faithfulness is a stronger version of the faithfulness assumption that requires that for all triples\n(Xi, Xj, XV0) such that i is not d-separated from j given V0, the partial correlation corr(Xi, Xj|XV0)\nis bounded away from 0. It is known that while the set of distributions P that are Markov to a DAG\nG but not faithful to it have Lebesgue measure zero, the set of distributions P that are not strongly\nfaithful to G have nonzero Lebesgue measure, and in fact can be quite large [20].\nThe problem of learning a GBN from observational data corresponds to recovering the DAG structure\nG and parameters W from a matrix X 2 Rn\u21e5p of n i.i.d. samples drawn from P(W, S). In this paper\nwe consider the problem of learning GBNs over p variables where the size of the Markov blanket of a\nnode is at most k. This is in general not possible without making additional assumptions on the GBN\n(G,P(W, S)) and the distribution P as we describe next.\nAssumptions. Here, we enumerate our technical assumptions.\nAssumption 1 (Causal Minimality). Let (G,P(W, S)) be a GBN, then 8wi,j 2 W, |wi,j| > 0.\nThe above assumption ensures that all edge weights are strictly nonzero, which results in each variable\nXi being a non-constant function of its parents X\u21e1(i). Given Assumption 1, the distribution P is\ncausal minimal with respect to G [3] and therefore identi\ufb01able under equal noise variances [9], i.e.,\n1 = . . . = p = . Throughout the rest of the paper, we will denote such Bayesian networks by\n(G,P(W, 2)).\nAssumption 2 (Restricted Strong Adjacency Faithfulness). Let (G,P(W, 2)) be a GBN with G =\n(V, E). For every \u2327 2T G, consider the sequence of graphs G[m, \u2327 ] = (V[m, \u2327 ], E[m, \u2327 ]) indexed by\n(m, \u2327 ), where G[m, \u2327 ] is the induced subgraph of G over the \ufb01rst m vertices in the topological ordering\n\u2327, i.e., V[m, \u2327 ] def= {i 2 [p] | \u2327 (i) \uf8ff m} and E[m, \u2327 ] def= {(i, j) 2 E | i 2 V[m, \u2327 ] ^ j 2 V[m, \u2327 ]}.\nThe multivariate Gaussian distribution P is restricted \u21b5-strongly adjacency faithful to G, provided\nthat:\n\n(i) min{|wi,j| | (i, j) 2 E} > 3\u21b5,\n\n3\u21b5\n\uf8ff(\u21b5)\n\n(ii) |ewi,j| >\n\n, 8i 2 V[m, \u2327 ] ^ j 2 Si[m, \u2327 ] ^ m 2 [p] ^ \u2327 2T G,\n\nwhere \u21b5> 0 is a constant, ewi,j is the effective in\ufb02uence between i and j in the induced subgraph\n\nG[m, \u2327 ] as de\ufb01ned in (4), and Si[m, \u2327 ] denotes the Markov blanket of node i in G[m, \u2327 ]. The constant\n\uf8ff(\u21b5) = 1  2/(1+9|G[m,\u2327 ](i)|\u21b52) if i is a non-terminal vertex in G[m, \u2327 ], where |G[m,\u2327 ](i)| is the\nnumber of children of i in G[m, \u2327 ], and \uf8ff(\u21b5) = 1 if i is a terminal vertex.\n\nSimply stated, the RSAF assumption requires that the absolute value of the edge weights are at least\n3\u21b5 and the absolute value of the effective in\ufb02uence between two nodes, whenever it is non-zero, is at\nleast 3\u21b5 for terminal nodes and 3\u21b5/\uf8ff(\u21b5) for non-terminal nodes. Moreover, the above should hold\nnot only for the original DAG, but also for each DAG obtained by sequentially removing terminal\n\nvertices. The constant \u21b5 is related to the statistical error and decays as \u2326(k2plog p/n). Note that in\n\nBoth the de\ufb01nitions are equivalent under faithfulness. However, since we allow non-faithful distributions, our\nde\ufb01nition of Markov blanket is more appropriate.\n\n4\n\n\f2\n\n1\n\n4\n\n1\n\n1\n\n1\n\n-1\n\n1\n\n0.25\n\n3\n\n-1\n\n5\n\nFigure 1: A GBN, with noise variance set to 1 that is RSAF, but is neither faithful, nor\nstrongly faithful, nor adjacency faithful to the DAG structure. This GBN is not faithful\nbecause corr(X4, X5|X2, X3) = 0 even though (2, 3) do not d-separate 4 and 5.\nOther violations of faithfulness include corr(X1, X4|?) = 0 and corr(X1, X5|?) =\n0. Therefore, a CI test based method will fail to recover the true structure. In Appendix\nB.1, we show that the PC algorithm fails to recover the structure of this GBN while\nour method recovers the structure exactly.\n\nthe regime \u21b5 2 (0, 1/3p|G[m,\u2327 ](i)|), which happens for suf\ufb01ciently large n, then the condition on\newi,j is satis\ufb01ed trivially. As we will show later, Assumption 2 is equivalent to the following, for some\nconstant \u21b50,\n\nmin{|corr(Xi, Xj|XV[m,\u2327 ]\\{i,j})| | i 2 V[m, \u2327 ] ^ j 2 Si[m, \u2327 ] ^ m 2 [p] ^ \u2327 2T G} \u21b50.\n\nAt this point, it is worthwhile to compare our assumptions with those made by other methods for\nlearning GBNs. Methods based on conditional independence (CI) tests, e.g., the PC algorithm for\nlearning the equivalence class of GBNs developed by [15], require strong faithfulness. While strong\nfaithfulness requires that for a node pair (i, j) that are adjacent in the DAG, the partial correlation\ncorr(Xi, Xj|XS) is bounded away from zero for all sets S 2{ S \u2713 [p] \\ {i, j}}, RSAF only requires\nnon-zero partial correlations with respect to a subset of sets in {S \u2713 [p] \\ {i, j}}. Thus, RSAF is\nstrictly weaker than strong faithfulness. The number of non-zero partial correlations needed by RSAF\nis also strictly a subset of those needed by the faithfulness condition. Figure 1 shows a GBN which is\nRSAF but neither faithful, nor strongly faithful, nor adjacency faithful (see [20] for a de\ufb01nition).\nWe conclude this section with one last remark. At \ufb01rst glance, it might appear that the assumption\nof equal variance together with our assumptions implies a simple causal ordering of variables in\nwhich the marginal variance of the variables increases strictly monotonically with the causal ordering.\nHowever, this is not the case. For instance, in the GBN shown in Figure 1 the marginal variance of\nthe causally ordered nodes (1, 2, 3, 4, 5) is (1, 2, 2, 2, 2.125). We also perform extensive simulation\nexperiments to further investigate this case in Appendix B.6.\n\n3 Results\nWe start by characterizing the covariance and precision matrix of a GBN (G,P(W, 2)). Let B be\nthe weight matrix corresponding to the edge weights W, then from (3) it follows that the covariance\nand precision matrix are, respectively:\n\n\u2303 = 2(I  B)1(I  B)T ,\n\n\u2326 =\n\n1\n2 (I  B)T (I  B),\n\n(5)\n\nwhere I is the p \u21e5 p identity matrix.\nRemark 1. Since the elements of the inverse covariance matrix are related to the partial correlations\nas follows: corr(Xi, Xj|XV\\{i,j}) = \u2326i,j/p\u2326i,i\u2326j,j. We have that, |ewi,j| c\u21b5, for some constant\nc (Assumption 2), implies that |corr(Xi, Xj|XV\\{i,j})| c\u21b5/p\u2326i,i\u2326j,j > 0.\nNext, we describe a key property of homoscedastic noise GBNs in the lemma below, which will be\nthe driving force behind our algorithm.\nLemma 1. Let (G,P(W, 2)) be a GBN, with \u2326 being the inverse covariance matrix over X and\ni xi being the i-th regression coef\ufb01cient. Under Assumption 1, we\n\u2713i\nhave that\n\ndef= E [Xi|(Xi = xi)] = \u2713T\n\ni is a terminal vertex in G () \u2713ij = 2\u2326i,j, 8j 2 i.\n\nDetailed proofs can be found in Appendix A in the supplementary material. Lemma 1 states that, in\nthe population setting, one can identify the terminal vertex, and therefore the causal ordering, just\nby assuming causal minimality (Assumption 1). However, to identify terminal vertices from a \ufb01nite\nnumber of samples, one needs additional assumptions. We use Lemma 1 to develop our algorithm\nfor learning GBNs which, at a high level, works as follows. Given data X drawn from a GBN, we\n\n5\n\n\f\ufb01rst estimate the inverse covariance matrix b\u2326. Then we perform a series of ordinary least squares\n(OLS) regressions to compute the estimators b\u2713i 8i 2 [p]. We then identify terminal vertices using\n\nthe property described in Lemma 1 and remove the corresponding variables (columns) from X. We\nrepeat the process of identifying and removing terminal vertices and obtain the causal ordering of\nvertices. Then, we perform a \ufb01nal set of OLS regressions to learn the structure and parameters of the\nDAG.\nThe two main operations performed by our algorithm are: (a) estimating the inverse covariance\nmatrix, and (b) estimating the regression coef\ufb01cients \u2713i. In what follows, we discuss these two steps\nin more detail and obtain theoretical guarantees for our algorithm.\n\nInverse covariance matrix estimation. The \ufb01rst part of our algorithm requires an estimate b\u2326 of the\n\ntrue inverse covariance matrix \u2326\u21e4. Due in part to its role in undirected graphical model selection,\nthe problem of inverse covariance matrix estimation has received signi\ufb01cant attention over the years.\nA popular approach for inverse covariance estimation, under high-dimensional settings, is the `1-\npenalized Gaussian MLE studied by [21\u201328], among others. While, technically, these algorithms can\nbe used in the \ufb01rst phase of our algorithm to estimate the inverse covariance matrix, in this paper,\nwe use the method called CLIME, developed by Cai et. al. [29], since its theoretical guarantees do\nnot require a quite restrictive edge-based mutual incoherence condition as in [24]. Further, CLIME\n\nterminal vertices. Next, we brie\ufb02y describe the CLIME method for inverse covariance estimation and\ninstantiate the theoretical results of [29] for our purpose.\n\nis computationally attractive because it computes b\u2326 columnwise by solving p independent linear\nprograms. Even though the CLIME estimatorb\u2326 is not guaranteed to be positive-de\ufb01nite (it is positive-\nde\ufb01nite with high probability) it is suitable for our purpose since we use b\u2326 only for identifying\nThe CLIME estimator b\u2326 is obtained as follows. First, we compute a potentially non-symmetric\n\nestimate \u00af\u2326 = (\u00af!i,j) by solving the following:\n\n(6)\n\n\u00af\u2326 = argmin\n\n\u23262Rp\u21e5p|\u2326|1 s.t. |\u2303n\u2326  I|1 \uf8ff n,\n\nwhere n > 0 is the regularization parameter, \u2303n def= (1/n)XT X is the empirical covariance matrix.\nFinally, the symmetric estimator is obtained by selecting the smaller entry among \u00af!i,j and \u00af!j,i, i.e.,\n\nbe decomposed into p linear programs as follows. Let \u00af\u2326 = ( \u00af!1, . . . , \u00af!p), then\n\nb\u2326 = (b!i,j), whereb!i,j = \u00af!i,j1 [|\u00af!i,j| < |\u00af!j,i|] + \u00af!j,i1 [|\u00af!j,i|\uf8ff| \u00af!i,j|]. It is easy to see that (6) can\n\n\u00af!i = argmin\n\n(7)\n\n!2Rp k!k1 s.t. |\u2303n!  ei|1 \uf8ff n,\n\nwhere ei = (ei,j) such that ei,j = 1 for j = i and ei,j = 0 otherwise. The following lemma which\n\nn  k\u2326\u21e4k1q(C1/n) log(4p2/), n  ((164k\u2326\u21e4k4\n\nfollows from the results of [29] and [24], bounds the maximum elementwise difference between b\u2326\nand the true precision matrix \u2326\u21e4.\nLemma 2. Let (G\u21e4,P(W\u21e4, 2)) be a GBN satisfying Assumption 1, with \u2303\u21e4 and \u2326\u21e4 being the \u201ctrue\u201d\ncovariance and inverse covariance matrix over X, respectively. Given a data matrix X 2 Rn\u21e5p\nof n i.i.d. samples drawn from P(W\u21e4, 2), compute b\u2326 by solving (6). Then, if the regularization\nparameter and number of samples satisfy:\nwith probability at least 1   we have that |\u2326\u21e4 b\u2326|1 \uf8ff \u21b5/2, where C1 = 3200maxi(\u2303\u21e4i,i)2\nand  2 (0, 1). Further, thresholding b\u2326 at the level 4k\u2326\u21e4k1n, we have S(\u2326\u21e4) = S(b\u2326).\nnorm k\u2326\u21e4k1 = O(k), and the suf\ufb01cient number of samples required for the estimator b\u2326 to be within\n\u21b5 distance from \u2326\u21e4, elementwise, with probability at least 1   is O((1/\u21b52)k4 log(p/)).\nEstimating regression coef\ufb01cients. Given a GBN (G,P(W, 2)) with the covariance and precision\nmatrix over X being \u2303 and \u2326 respectively, the conditional distribution of Xi given the variables\ndef= (\u2713i)Si. This leads to the\nin its Markov blanket is: Xi|(XSi = x) \u21e0N ((\u2713i)T\nfollowing generative model for X\u21e4,i:\n\nRemark 2. Note that in each column of the true precision matrix \u2326\u21e4, at most k entries are non-zero,\nwhere k is the maximum Markov blanket size of a node in G. Therefore, the `1 induced (or operator)\n\nSix, 1/\u2326i,i). Let \u2713i\nSi\n\n1C1)/\u21b52) log((4p2)/),\n\nX\u21e4,i = (X\u21e4,Si)\u2713i\n\nSi + \"0i,\n\n6\n\n(8)\n\n\fSi of \u2713i\n\n2 = (\u2303n\n\nSi,Si)1\u2303n\n\nSi = argmin\n2R|Si|\n\n1\n2nkX\u21e4,i  (X\u21e4,Si)k2\n\nSi by solving the following ordinary least squares (OLS) problem:\n\nwhere \"0i \u21e0N (0, 1/\u2326i,i) and Xl,Si \u21e0N (0, \u2303Si,Si) for all l 2 [n]. Therefore, for all i 2 [p], we\nobtain the estimatorb\u2713i\nb\u2713i\n\nThe following lemma bounds the approximation error between the true regression coef\ufb01cients and\nthose obtained by solving the OLS problem. OLS regression has been previously analyzed by\n[30] under the random design setting. However, they obtain bounds on the predicion error, i.e.,\nSi b\u2713i\n(\u2713i\nLemma 3. Let (G\u21e4,P(W\u21e4, 2)) be a GBN with \u2303\u21e4 and \u2326\u21e4 being the true covariance and inverse\ncovariance matrix over X. Let X 2 Rn\u21e5p be the data matrix of n i.i.d. samples drawn from\nP(W\u21e4, 2). Let E [Xi|(XSi = x)] = xT \u2713i\nSi be the OLS solution obtained by solving (9)\nfor some i 2 [p]. Then, assuming \u2303\u21e4 is non-singular, and if the number of samples satisfy:\nc|Si|3/2(k\u2713i\n\nSi), while the following lemma bounds k\u2713i\n\nSi, and letb\u2713i\n\nSi b\u2713i\n\nSi b\u2713i\n\nSik1 + 1/|Si|)\n\nSi)T \u2303\u21e4(\u2713i\n\nSik1.\n\n(9)\n\nSi,i\n\nn \n\nwe have that, k\u2713i\nabsolute constant.\nOur algorithm. Algorithm 1 presents our algorithm for learning GBNs. Throughout the algorithm\n\nSik1 \uf8ff \u21b5 with probability at least 1  , for some  2 (0, 1), with c being an\n\nSi b\u2713i\n\nwe use as indices the true label of a node. We \ufb01rst estimate the inverse covariance matrix, b\u2326, in line\n5. In line 7 we estimate the Markov blanket of each node. Then, we estimateb\u2713i,j for all i and j 2bSi,\nand compute the maximum per-node ratios ri = |b\u2326i,j/b\u2713i,j| in lines 8 \u2013 11. We then identify as\n\nterminal vertex the node for which ri is minimum and remove it from the collection of variables (lines\n13 and 14). Each time a variable is removed, we perform a rank-1 update of the precision matrix\n(line 15) and also update the regression coef\ufb01cients of the nodes in its Markov blanket (lines 16 \u2013\n20). We repeat this process of identifying and removing terminal vertices until the causal order has\nbeen completely determined. Finally, we compute the DAG structure and parameters by regressing\neach variable against variables that are in its Markov blanket which also precede it in the causal order\n(lines 23 \u2013 29).\n\nmin(\u2303\u21e4Si,Si\n\n)\u21b5\n\nlog\u2713 4|Si|\n \u25c6 ,\n\n.\n\n)1\u2303n\n\nend for\n\n== (\u2303n\n\n16:\n17:\n18:\n\nAlgorithm 1 Gaussian Bayesian network structure learning algorithm.\nInput: Data matrix X 2 Rn\u21e5p.\nOutput: (bG,bW).\n1: bB 0 2 Rp\u21e5p.\n2: z ?, r ?. . z stores the causal order.\n. Remaining vertices.\n3: V [p].\n4: \u2303n (1/n)XT X.\n5: Compute b\u2326 using the CLIME estimator.\n6: b\u23260 = b\u2326.\n7: ComputebSi = {j 2 i | |b\u2326i,j|> 0},8i 2 [p].\n8: for i 2 1, . . . , p do\nComputeb\u2713i\n9:\nbSi,i\nbSi\nri max{|b\u2326i,j/b\u2713i,j| | j 2bSi}.\n10:\n11: end for\n12: for t 2 1 . . . p  1 do\ni argmin(r). .i is a terminal vertex.\n13:\nAppend i to z; V V\\{i}; ri +1.\n14:\nb\u2326 b\u2326i,i  (1/b\u2326i,i)(b\u2326i,i)(b\u2326i,i) .\n15:\n\nfor j 2bSi do\nbSj { l 6= j | |b\u2326j,l| > 0}.\nComputeb\u2713j\nbSj ,j\nbSj ,bSj\nbSj\nrj max{|b\u2326j,l/b\u2713j,l| | l 2bSj}.\n19:\n20:\n21: end for\n22: Append the remaining vertex in V to z.\n23: for i 2 2, . . . , p do\nbSzi { zj|j 2 [i  1]}\\{ j 2 [p] | j 6=\n24:\nzi ^|b\u23260\nzi,j| > 0}.\nComputeb\u2713 = (\u2303n\nbSzi ,bSzi\nb\u21e1(zi) S (b\u2713).\nbBzi,b\u21e1(zi) b\u2713b\u21e1(zi).\n29: bE { (i, j)|bBi,j 6= 0}, bW { bBi,j|(i, j) 2\nbE}, andbG ([p],bE).\n\n26:\n27:\n28: end for\n\nbSzi ,zi\n\nbSi,bSi\n\n)1\u2303n\n\n)1\u2303n\n\n= (\u2303n\n\n25:\n\n.\n\n.\n\nIn order to obtain our main result for learning GBNs we \ufb01rst derive the following technical lemma\nwhich states that if the data comes from a GBN that satis\ufb01es Assumptions 1 \u2013 2, then removing a\nterminal vertex results in a GBN that still satis\ufb01es Assumptions 1 \u2013 2.\n\n7\n\n\fLemma 4. Let (G,P(W, 2)) be a GBN satisfying Assumptions 1 \u2013 2, and let \u2303, \u2326 be the (non-\nsingular) covariance and precision matrix respectively. Let X 2 Rn\u21e5p be a data matrix of n\ni.i.d. samples drawn from P(W, 2), and let i be a terminal vertex in G. Denote by G0 = (V0, E0)\nand W0 = {wi,j 2 W | (i, j) 2 E0} the graph and set of edge weights, respectively, obtained by\nremoving the node i from G. Then, Xj,i \u21e0P (W0, 2) 8j 2 [n], and the GBN (G0,P(W0, 2))\nsatis\ufb01es Assumptions 1 \u2013 2. Further, the inverse covariance matrix \u23260 and the covariance matrix \u23030\nfor the GBN (G0,P(W0, 2)) satisfy (respectively): \u23260 = \u2326  (1/\u2326i,i)\u2326\u21e4,i\u2326i,\u21e4 and \u23030 = \u2303i,i.\nTheorem 1. LetbG = ([p],bE) and bW be the DAG and edge weights, respectively, returned by Algo-\nrithm 1. Under the assumption that the data matrix X was drawn from a GBN (G\u21e4,P(W\u21e4, 2)) with\nG\u21e4 = ([p], E\u21e4), \u2303\u21e4 and \u2326\u21e4 being the \u201ctrue\u201d covariance and inverse covariance matrix respectively,\nand satisfying Assumptions 1 \u2013 2; if the regularization parameter is set according to Lemma 2, and if\nthe number of samples satis\ufb01es the condition:\n\n1Cmax\n\n+\n\n\u21b52\n\n\n\nCmin\u21b5\n\n\u25c6 ,\n\n\u25c6 log\u2713 24p2(p  1)\n\nn  c\u2713 4k\u2326\u21e4k4\nk(3/2)(ewmax + 1/k)\ndef= max{|ewi,j||i 2 V[m, \u2327 ]^j 2 Si[m, \u2327 ]^m 2 [p]^\u2327 2T G}\nwhere c is an absolute constant, ewmax\nwith ewi,j being the effective in\ufb02uence between i and j (4), Cmax = maxi2p(\u2303\u21e4i,i)2, and Cmin =\nmini2[p] min(\u2303\u21e4Si,Si), then,bE \u25c6 E\u21e4 and 8(i, j) 2 bE, |bwi,j  w\u21e4i,j|\uf8ff \u21b5 with probability at least\n1   for some  2 (0, 1) and \u21b5> 0. Further, thresholding bW at the level \u21b5 we getbE = E\u21e4.\nThe CLIME estimator of the precision matrix can be computed in polynomial time and the OLS steps\ntake O(pk3) time. Therefore our algorithm is polynomial time (please see Appendix C.2).\n4 Experiments\n\nIn this section, we validate our theoretical \ufb01ndings through synthetic experiments. We use a class\nof Erd\u02ddos-R\u00e9nyi GBNs, with edge weights set to \u00b11/2 with probability 1/2, and noise variance\n2 = 0.8. For each value of p 2{ 50, 100, 150, 200}, we sampled 30 random GBNs and estimated\nthe probability Pr{G\u21e4 =bG} by computing the fraction of times the learned DAG structurebG matched\nthe true DAG structure G\u21e4 exactly. The number of samples was set to Ck2 log p, where C was the\ncontrol parameter, and k was the maximum Markov blanket size (please see Appendix B.2 for more\ndetails). Figure 2 shows the results of the structure and parameter recovery experiments. We can see\nthat the log p scaling as prescribed by Theorem 1 holds in practice.\nOur method outperforms various state-of-the-art methods like PC, GES and MMHC on this class\nof Erd\u02ddos-R\u00e9nyi GBNs (Appendix B.3), works when the noise variables have unequal, but similar,\nvariance (Appendix B.4), and also works for high-dimensional gene expression data (Appendix B.5).\n\nConcluding Remarks. There are several ways of extending our current work. While the algorithm\ndeveloped in the paper is speci\ufb01c to equal noise-variance case, we believe our theoretical analysis can\nbe extended to the non-identi\ufb01able case to show that our algorithm, under some suitable conditions,\ncan recover one of the Markov-equivalent DAGs. It would be also interesting to explore if some of\nthe ideas developed herein can be extended to binary or discrete Bayesian networks.\n\nFigure 2: (Left) Probability of cor-\nrect structure recovery vs. number\nof samples, where the latter is set\nto Ck2 log p with C being the con-\ntrol parameter and k being the max-\nimum Markov blanket size. (Right)\nThe maximum absolute difference\nbetween the true parameters and the\nlearned parameters vs. number of\nsamples.\n\n8\n\n\fReferences\n[1] David Maxwell Chickering. Learning bayesian networks is np-complete. In Learning from\n\ndata, pages 121\u2013130. Springer, 1996.\n\n[2] Sanjoy Dasgupta. Learning polytrees. In Proceedings of the Fifteenth conference on Uncertainty\n\nin arti\ufb01cial intelligence, pages 134\u2013141. Morgan Kaufmann Publishers Inc., 1999.\n\n[3] Jonas Peters, Joris M Mooij, Dominik Janzing, and Bernhard Sch\u00f6lkopf. Causal Discovery with\nContinuous Additive Noise Models. Journal of Machine Learning Research, 15(June):2009\u2013\n2053, 2014.\n\n[4] Asish Ghoshal and Jean Honorio. Information-theoretic limits of Bayesian network structure\nlearning. In Aarti Singh and Jerry Zhu, editors, Proceedings of the 20th International Confer-\nence on Arti\ufb01cial Intelligence and Statistics, volume 54 of Proceedings of Machine Learning\nResearch, pages 767\u2013775, Fort Lauderdale, FL, USA, 20\u201322 Apr 2017. PMLR.\n\n[5] Peter Spirtes, Clark N Glymour, and Richard Scheines. Causation, prediction, and search. MIT\n\npress, 2000.\n\n[6] Christopher Meek. Causal inference and causal explanation with background knowledge. In\nProceedings of the Eleventh conference on Uncertainty in arti\ufb01cial intelligence, pages 403\u2013410.\nMorgan Kaufmann Publishers Inc., 1995.\n\n[7] Christopher Meek. Strong completeness and faithfulness in bayesian networks. In Proceedings\nof the Eleventh conference on Uncertainty in arti\ufb01cial intelligence, pages 411\u2013418. Morgan\nKaufmann Publishers Inc., 1995.\n\n[8] Shohei Shimizu, Patrik O Hoyer, Aapo Hyv\u00e4rinen, and Antti Kerminen. A Linear Non-Gaussian\nAcyclic Model for Causal Discovery. Journal of Machine Learning Research, 7:2003\u20132030,\n2006.\n\n[9] J. Peters and P. B\u00fchlmann. Identi\ufb01ability of Gaussian structural equation models with equal\n\nerror variances. Biometrika, 101(1):219\u2013228, 2014.\n\n[10] Sara Van De Geer and Peter B\u00fchlmann. L0-Penalized maximum likelihood for sparse directed\n\nacyclic graphs. Annals of Statistics, 41(2):536\u2013567, 2013.\n\n[11] R W Robinson. Counting unlabeled acyclic digraphs. Combinatorial Mathematics V, 622:28\u201343,\n\n1977.\n\n[12] Bryon Aragam and Qing Zhou. Concave penalized estimation of sparse gaussian bayesian\n\nnetworks. Journal of Machine Learning Research, 16:2273\u20132328, 2015.\n\n[13] David Maxwell Chickering. Optimal Structure Identi\ufb01cation with Greedy Search. J. Mach.\n\nLearn. Res., 3:507\u2013554, March 2003.\n\n[14] Tommi S. Jaakkola, David Sontag, Amir Globerson, Marina Meila, and others. Learning\n\nBayesian Network Structure using LP Relaxations. In AISTATS, pages 358\u2013365, 2010.\n\n[15] Markus Kalisch and B\u00fchlmann Peter. Estimating High-Dimensional Directed Acyclic Graphs\n\nwith the PC-Algorithm. Journal of Machine Learning Research, 8:613\u2013636, 2007.\n\n[16] Gunwoong Park and Garvesh Raskutti. Learning large-scale poisson dag models based on\noverdispersion scoring. In Advances in Neural Information Processing Systems, pages 631\u2013639,\n2015.\n\n[17] Ioannis Tsamardinos, Laura E Brown, and Constantin F Aliferis. The max-min hill-climbing\n\nbayesian network structure learning algorithm. Machine learning, 65(1):31\u201378, 2006.\n\n[18] Jiji Zhang and Peter Spirtes. Detection of unfaithfulness and robust causal inference. Minds\n\nand Machines, 18(2):239\u2013271, 2008.\n\n[19] Jiji Zhang and Peter Spirtes. Strong faithfulness and uniform consistency in causal inference.\nIn Proceedings of the Nineteenth conference on Uncertainty in Arti\ufb01cial Intelligence, pages\n632\u2013639. Morgan Kaufmann Publishers Inc., 2002.\n\n[20] Caroline Uhler, Garvesh Raskutti, Peter B\u00fchlmann, and Bin Yu. Geometry of the faithfulness\n\nassumption in causal inference. Annals of Statistics, 41(2):436\u2013463, 2013.\n\n[21] Ming Yuan and Yi Lin. Model selection and estimation in the gaussian graphical model.\n\nBiometrika, 94(1):19\u201335, 2007.\n\n9\n\n\f[22] Onureena Banerjee, Laurent El Ghaoui, and Alexandre d\u2019Aspremont. Model selection through\nsparse maximum likelihood estimation for multivariate gaussian or binary data. Journal of\nMachine Learning Research, 9(Mar):485\u2013516, 2008.\n\n[23] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Sparse inverse covariance estimation\n\nwith the graphical lasso. Biostatistics, 9(3):432\u2013441, 2008.\n\n[24] Pradeep Ravikumar, Martin J. Wainwright, Garvesh Raskutti, and Bin Yu. High-dimensional\ncovariance estimation by minimizing `1-penalized log-determinant divergence. Electronic\nJournal of Statistics, 5(0):935\u2013980, 2011.\n\n[25] Cho-Jui Hsieh, M\u00e0ty\u00e0s A Sustik, Inderjit S Dhillon, Pradeep Ravikumar, and Russell Poldrack.\nBIG & QUIC : Sparse Inverse Covariance Estimation for a Million Variables. In Advances in\nNeural Information Processing Systems, volume 26, pages 3165\u20133173, 2013.\n\n[26] Cho-Jui Hsieh, Arindam Banerjee, Inderjit S Dhillon, and Pradeep K Ravikumar. A divide-and-\nconquer method for sparse inverse covariance estimation. In Advances in Neural Information\nProcessing Systems, pages 2330\u20132338, 2012.\n\n[27] Benjamin Rolfs, Bala Rajaratnam, Dominique Guillot, Ian Wong, and Arian Maleki. Itera-\ntive thresholding algorithm for sparse inverse covariance estimation. In Advances in Neural\nInformation Processing Systems, pages 1574\u20131582, 2012.\n\n[28] Christopher C Johnson, Ali Jalali, and Pradeep Ravikumar. High-dimensional sparse inverse\ncovariance estimation using greedy methods. In AISTATS, volume 22, pages 574\u2013582, 2012.\n[29] Tony Cai, Weidong Liu, and Xi Luo. A Constrained L1 Minimization Approach to Sparse\nPrecision Matrix Estimation. Journal of the American Statistical Association, 106(494):594\u2013607,\n2011.\n\n[30] Daniel Hsu, Sham M Kakade, and Tong Zhang. An analysis of random design linear regression.\n\nIn Proc. COLT. Citeseer, 2011.\n\n[31] Christopher M. Bishop. Pattern Recognition and Machine Learning (Information Science and\n\nStatistics). Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2006.\n\n[32] Roman Vershynin.\n\nIntroduction to the non-asymptotic analysis of random matrices.\n\narXiv:1011.3027 [cs, math], November 2010. arXiv: 1011.3027.\n\n[33] Rahul Mazumder and Trevor Hastie. Exact covariance thresholding into connected components\nfor large-scale graphical lasso. Journal of Machine Learning Research, 13(Mar):781\u2013794, 2012.\n[34] Y. Lu, Y. Yi, P. Liu, W. Wen, M. James, D. Wang, and M. You. Common human cancer genes\ndiscovered by integrated gene-expression analysis. Public Library of Science ONE, 2(11):e1149,\n2007.\n\n[35] E. Shubbar, A. Kovacs, S. Hajizadeh, T. Parris, S. Nemes, K.Gunnarsdottir, Z. Einbeigi,\nP. Karlsson, and K. Helou. Elevated cyclin B2 expression in invasive breast carcinoma is\nassociated with unfavorable clinical outcome. BioMedCentral Cancer, 13(1), 2013.\n\n10\n\n\f", "award": [], "sourceid": 3226, "authors": [{"given_name": "Asish", "family_name": "Ghoshal", "institution": "Purdue University"}, {"given_name": "Jean", "family_name": "Honorio", "institution": "Purdue University"}]}