{"title": "Globally optimal score-based learning of directed acyclic graphs in high-dimensions", "book": "Advances in Neural Information Processing Systems", "page_first": 4450, "page_last": 4462, "abstract": "We prove that $\\Omega(s\\log p)$ samples suffice to learn a sparse Gaussian directed acyclic graph (DAG) from data, where $s$ is the maximum Markov blanket size. This improves upon recent results that require $\\Omega(s^{4}\\log p)$ samples in the equal variance case. To prove this, we analyze a popular score-based estimator that has been the subject of extensive empirical inquiry in recent years and is known to achieve state-of-the-art results. Furthermore, the approach we study does not require strong assumptions such as faithfulness that existing theory for score-based learning crucially relies on. The resulting estimator is based around a difficult nonconvex optimization problem, and its analysis may be of independent interest given recent interest in nonconvex optimization in machine learning. Our analysis overcomes the drawbacks of existing theoretical analyses, which either fail to guarantee structure consistency in high-dimensions (i.e. learning the correct graph with high probability), or rely on restrictive assumptions. In contrast, we give explicit finite-sample bounds that are valid in the important $p\\gg n$ regime.", "full_text": "Globally optimal score-based learning of directed\n\nacyclic graphs in high-dimensions\n\nBryon Aragam1 Arash A. Amini2 Qing Zhou2\n1University of Chicago\nbryon@chicagobooth.edu\n\n2University of California, Los Angeles\n\n{aaamini,zhou}@stat.ucla.edu\n\nAbstract\n\nWe prove that \u2326(s log p) samples suf\ufb01ce to learn a sparse Gaussian directed acyclic\ngraph (DAG) from data, where s is the maximum Markov blanket size. This\nimproves upon recent results that require \u2326(s4 log p) samples in the equal variance\ncase. To prove this, we analyze a popular score-based estimator that has been\nthe subject of extensive empirical inquiry in recent years and is known to achieve\nstate-of-the-art results. Furthermore, the approach we study does not require strong\nassumptions such as faithfulness that existing theory for score-based learning\ncrucially relies on. The resulting estimator is based around a dif\ufb01cult nonconvex\noptimization problem, and its analysis may be of independent interest given recent\ninterest in nonconvex optimization in machine learning. Our analysis overcomes\nthe drawbacks of existing theoretical analyses, which either fail to guarantee\nstructure consistency in high-dimensions (i.e.\nlearning the correct graph with\nhigh probability), or rely on restrictive assumptions. In contrast, we give explicit\n\ufb01nite-sample bounds that are valid in the important p n regime.\n\n1\n\nIntroduction\n\nWith the growing importance of explainability and interpretability in modern machine learning\n[11, 64, 65], graphical models continue to play an important role in applications including genomics\n[72], health care [41], and \ufb01nance [50] owing to their natural interpretability and simplicity. For this\nreason, rigorous theoretical understanding of graphical models is an important challenge in modern\nmachine learning. Although estimating undirected graphical models can be formulated as a convex\nprogram, DAG models cannot be [15], which has limited our understanding of their \ufb01nite-sample\nproperties. Despite impressive progress in our understanding of nonconvex models across a spectrum\nof problems including dictionary learning [58], tensor decomposition [16, 18], deep neural networks\n[13, 14], and regression [36, 37], learning DAGs remains an important problem with many open\nquestions, particularly in the high-dimensional (p n) setting.\nAmong the many strategies for learning DAGs from data, score-based learning is a classical approach\nthat is popular in practice. While much is known about greedy search algorithms [6, 40], much\nless is known regarding the statistical properties of methods that \ufb01nd a global minimizer of a score\nfunction. One of the advantages of the latter approach is a potential relaxation of assumptions such\nas faithfulness [61]. In this paper, we prove that a score-based method requires only O(s log p)\nsamples, where s is the maximum Markov blanket size, at the cost of being dif\ufb01cult to compute since\nit requires solving a nonconvex, NP-hard optimization problem. This is a well-known drawback\nof score-based methods, although recent work has demonstrated that approximate methods can\noutperform state-of-the-art methods [1, 25, 70], and even come close to \ufb01nding the global minimum\nin practice [77].\nMore speci\ufb01cally, we characterize the \ufb01nite-sample, high-dimensional behaviour of the following\nscore-based DAG estimator, formulated as the solution of a constrained, nonsmooth, nonconvex\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\foptimization problem:\n\nbB 2 arg min\n\nB 2 D\n\nQ(B), Q(B) =\n\n1\n2n kX XBk2\n\nF + \u21e2(B),\n\n(1)\n\nwhere D is the set of p\u21e5 p matrices representing the weighted adjacency matrix of a DAG, X 2 Rn\u21e5p\nis the data, and \u21e2 is a suitably chosen regularizer (Section 2.3). In the literature on learning DAGs,\nQ is called a score function. This estimator has been the subject of extensive empirical inquiry [e.g.\n1, 26, 51, 54, 68, 77], and outperforms classical approaches such as the PC algorithm [57] and greedy\n\nequivalence search [GES, 6] on high-dimensional data. Moreover, although computation of bB is\n\nNP-hard [7], it can be computed exactly using dynamic programming [43, 44, 55, 56] and mixed\ninteger programs [9, 10], and approximate algorithms for computing this estimator scale to modern\nproblem sizes with tens of thousands of variables [1, 3].\n\nContributions\ning much needed justi\ufb01cation\u2014and caution\u2014for its use in applications. Speci\ufb01cally, our main\ncontributions are as follows:\n\nIn this paper we provide a comprehensive portrait of the behaviour of bB, provid-\n1. We provide explicit, \ufb01nite-sample structure recovery guarantees for the score-based estimator (1)\nthat are valid when p n. This is in contrast to recent work on score-based methods that either\nstudies asymptotic properties of speci\ufb01c algorithms under faithfulness [40], or does not prove exact\nstructure recovery [35, 62].\n2. We develop a new proof technique in order to simplify the analysis of score-based estimators,\nbased on a novel lattice construction and a reduction to neighbourhood regression. This construction\nallows us to provide uniform control over the superexponential family of neighbourhood regression\nproblems that de\ufb01ne (1), a result that is potentially interesting in its own right.\n\n3. We use this construction to prove an \u2326(s log p) sample complexity under which bB recovers the\n\ntrue DAG with high probability, which improves upon existing results. We also generalize existing\nresults on estimating identi\ufb01able DAGs with equal error variances to what we call minimum-trace\nDAGs.\n4. We discuss the more general, nonidenti\ufb01able case. In this setting, there is no \u201ctruth\u201d to approxi-\n\ndistribution.\n\nmate, however, we show that bB still estimates a suf\ufb01ciently sparse representative of the underlying\n\nWe anticipate these results will be of interest not only to the graphical modeling community, but also\nto the broader machine learning community in the way it analyzes a dif\ufb01cult nonconvex optimization\nproblem head on.\n\nPrevious work It was recently shown that it is possible to learn DAGs in high-dimensions [21\u2013\n23, 67]. These papers prove a lower bound of \u2326(k log p) on the sample complexity where k is the\nmaximum number of parents in the true DAG, and provide a polynomial-time algorithm that requires\nO(s4 log p) samples to recover this DAG. These papers are based on a new approach\u2014distinct from\ntraditional score-based or constraint-based learning\u2014that uses second-order information to \ufb01nd a\nnode ordering. Once this ordering is found, estimation is straightforward. Earlier work on the linear\nnon-Gaussian case uses independent component analysis to identify the true DAG model [52, 53] but\nrequires n > p and as such is not high-dimensional.\nPerhaps surprisingly, despite score-based methods being very popular in practice, none of these\npapers consider score-based methods. Asymptotically, consistency of the score-based GES algorithm\nis well-known [6, 40], however, to the best of our knowledge \ufb01nite-sample complexity results are\nnot available for GES. Furthermore, these results assume strong faithfulness, which\u2014as the name\nsuggests\u2014is an even stronger version of faithfulness that is known to be very stringent and may\nnot hold in practice [34, 61]. By assuming faithfulness, the Markov equivalence class\u2014and hence\nCPDAG\u2014of a distribution becomes identi\ufb01ed, which greatly simpli\ufb01es the theoretical analysis. Only\na few recent papers have studied \ufb01nite-sample properties of score-based estimators: van de Geer and\nB\u00fchlmann [62] establish `2-consistency of a restricted `0-regularized MLE, Loh and B\u00fchlmann [35]\nanalyze the empirical score of DAGs that are consistent with an estimated moral graph, and Yuan\net al. [71] analyze a constrained MLE. Unfortunately, the practical implications of these interesting\ntheoretical results have been limited by certain aspects of their analysis. Although van de Geer and\n\n2\n\n\fB\u00fchlmann [62] and Yuan et al. [71] avoid the faithfulness assumption, their structure consistency\nresults require p \uf8ff n and thus do not provide a direct theory for the high-dimensional structure\nlearning problem. Loh and B\u00fchlmann [35] do not consider the problem of structure recovery, and one\nof our contributions is to show that by properly regularizing the score in high-dimensions, structure\nrecovery is possible when p n.\nPerhaps surprisingly, proving consistency for the global minimizer of (1) turns out to be a unique\nchallenge: Despite a growing literature on theory for nonconvex problems [5, 8, 13, 14, 16, 18, 19, 28\u2013\n30, 33, 38, 58, 60], existing techniques from the graphical modeling literature fail to capture the\nessence of the program (1). Classical arguments such as the basic inequality can be used to prove `2-\nrates of convergence as in [63], but translating these rates into structure learning (e.g. by thresholding)\nrequires n =\u2326( p). By assuming strong faithfulness, one can simplify the problem substantially\nby reducing it to a constraint-based method as in [40]. The latter work in particular sidesteps all\nof the dif\ufb01culties in analyzing the nonconvex program (1), which constitute arguably some of the\nmost interesting theoretical aspects of this problem. More discussion on these points can be found in\nSection 6.\n\n2 Background\n\nOur approach is based on the structural equation model (SEM) interpretation of Gaussian DAGs.\nSuppose X = (X1, . . . , Xp) is a random vector satisfying\n\n(2)\n\nX = eBT X +e\",\n\ne\" \u21e0N p(0,e\u2326),\n\nwhere eB 2 D ande\u2326 is a p \u21e5 p positive diagonal matrix of variances. One can interpret eB as the\nweighted adjacency matrix of a graph. Given an n \u21e5 p random matrix X whose rows are i.i.d. drawn\naccording to the model (2), we de\ufb01ne a penalized least-squares (PLS) score function by (1). It follows\nfrom (2) that X \u21e0N p(0, \u2303(eB,e\u2326)), where\n(3)\nWe will assume that \u2303 0, and moreover that rmin(\u2303) \u21e3 rmax(\u2303) \u21e3 1, i.e. the eigenvalues of \u2303 are\nbounded away from 0 and 1. This is purely to simplify the theorem statements; explicit constants\ndepending on \u2303 and its eigenvalues can be found in the supplement.\nNotation We write a & b (resp. a . b) to mean that a C \u00b7 b (resp. a \uf8ff C \u00b7 b) for some constant\nC > 0. In all cases, exact values for these constants can be found in the supplement.\n\n\u2303(eB,e\u2326) := (I eB)Te\u2326(I eB)1.\n\n2.1\n\nIdenti\ufb01ability\n\nnonidenti\ufb01able. Recent work [12, 21, 62] assumes equivariance, i.e. \u2303=\u2303(\n\nThe map (eB,e\u2326) 7! \u2303(eB,e\u2326) is not one-to-one, i.e. without further assumptions the model (2) is\neB,e!2I) for some\ne!2 > 0, which ensures that eB is identi\ufb01able [47]. We generalize this condition as follows: Let Rp\ndenote the space of p \u21e5 p positive diagonal matrices and de\ufb01ne the equivalence class of \u2303 by\nand call eBmin a minimum-trace DAG if (eBmin,e\u2326min) 2 arg min{tre\u2326: ( eB,e\u2326) 2 D(\u2303)}. In other\nwords, eBmin minimizes the total conditional variance amongst all of the DAGs that represent \u2303. We\nwill sometimes abuse notation by writing eB 2 D(\u2303) ore\u2326 2 D(\u2303) for short. The following lemma\neB,e!2I) for some e!2 > 0. Then eB is the unique\n\nconnects equivariant DAGs to minimum-trace DAGs:\nLemma 2.1. Suppose \u2303 is given and \u2303=\u2303(\nminimum-trace DAG in D(\u2303).\n\nD(\u2303) =(eB,e\u2326) 2 D \u21e5 Rp\n\nIn general, minimum-trace DAGs are not unique, so this lemma shows that the concept of minimum-\ntrace provides a convenient generalization of known identi\ufb01ability results for equivariant DAGs.\nBeyond their connection with equivariance DAGs, it is important to address why minimum-trace\nDAGs should be of interest in the sequel. As discussed previously, despite a lack of theoretical\n\n+\n\n(4)\n\n+ :\u2303=\u2303(\n\neB,e\u2326) ,\n\n3\n\n\fformer question is surprisingly tricky; see Section 4. The results presented in this paper will show\n\njusti\ufb01cation, the estimator bB is popular in practice [e.g. 1, 26, 51, 54, 68, 77]. Our motivation is to\nanswer fundamental questions such as does bB converge, and if so, to what? We note that even the\nthat not only does bB converge, we can pinpoint what it converges to, namely a minimum-trace DAG.\n\nThe importance of this result lies not in the fact that we might be interested in minimum-trace DAGs,\nbut perhaps that we might not be: Whether or not one would be interested in a minimum-trace (or\nequivariance) DAG depends on the application.\n\n2.2 Superstructures\n\nIn addition to (1), we will also study a restricted version of bB de\ufb01ned as follows: Given an undirected\ngraph G = (V, E), de\ufb01ne DG = {B 2 D : B \u21e2 G}, i.e. the subset of D that are subgraphs of G, and\n(5)\n\nQ(B),\n\nwhere Q(B) is de\ufb01ned as in (1). The graph G is called a superstructure, and reduces both the\ncomputational and statistical complexity of score-based methods [42, 46]. We recall here also the\nmoral graph m(B) of a DAG B, de\ufb01ned as the undirected graph that results from ignoring edge\norientation in B and adding an undirected edge between the parents of each node in B. Clearly, m(B)\nis a superstructure of B.\n\nbB(G) 2 arg min\n\nB 2 DG\n\n2.3 Regularizer\n\nThis penalty leads to good theoretical properties but is dif\ufb01cult to optimize due to its combinatorial\n\nTraditionally, score functions use `0-regularization, i.e. \u21e2(B) = 2Pi,j 1(ij 6= 0) [6, 20, 62].\nnature. For this reason, we consider the `1-regularizer, \u21e2(B) = Pi,j |ij|, which is a convex\n\nsurrogate of the `0-regularizer that is easier to optimize [77], as well as the minimax concave\npenalty (MCP) [73], which is a continuous, nonconvex interpolant between `0 and `1 regularization.\nAlthough easier to compute with, `1 regularization is known to require strong incoherence conditions\nfor consistent variable selection [39, 66, 76], whereas the MCP does not require these conditions.\nMore details can be found in Appendix A.2 of the supplement.\nThe following condition formalizes the assumptions we place on \u21e2. Let Nj(G) denote the neigh-\nbourhood of Xj in G, i.e. the set of all vertices adjacent to Xj.\nCondition 2.1 (Regularizer). The regularizer \u21e2 is either `1 or the MCP. If `1 regularization is used,\nthen additionally assume that \u21e3(G) < 1, where\nsup\n\n(6)\n\n\u21e3(G) := sup\n1\uf8ffj\uf8ffp\n\nS\u21e2Nj (G)k\u2303ScS(\u2303SS)1k1,1.\n\nHere, kAk1,1 = maxiPj |aij|. Crucially, if \u21e2 is the MCP, then we are left with a continuous\n\noptimization problem without requiring any incoherence conditions.\n\n3 The identi\ufb01able case: Recovery of minimum-trace DAGs\n\n3.1 Assumptions\n\nWe begin with the identi\ufb01able case, i.e. eBmin is unique.\nGiven a minimum-trace DAG eBmin, de\ufb01ne for \u2318> 0\n\n(\u2318) := inf\n\nGiven a superstructure G, let s = s(G) denote the maximum degree of G, and de\ufb01ne\n\ne\u23262D(\u2303)h(1 \u2318) tre\u2326 (1 + \u2318) tre\u2326min \u21e2(eBmin)i.\ne\u23266=e\u2326min\n1 = 1(G) := 4r s log[3ep/s] + log p\n2 = 2(G) :=\u21e31 + 3p2r s log(ep/s)\n\n\u23182\n\nn\n\nn\n\n,\n\n.\n\n4\n\n(7)\n\n(8)\n\n(9)\n\n\fCondition 3.1 (Identi\ufb01ability). \u2303 0, and\n(a) There exists a unique minimum-trace DAG eBmin 2 D(\u2303);\n(b) 1(G) \uf8ff 1 and (\u2318) > 0, where \u2318 := 1[1 + 6\uf8ff(\u2303; s)2] and \uf8ff(\u2303; s) is a constant that depends\non \u2303 and s.\n\nSee (47) in the supplement for an exact expression of \uf8ff(\u2303; s), which is roughly the maximum\ncondition number of the principal submatrices of \u2303 of size O(s). Condition 3.1(a) is an identi\ufb01a-\n\nLemma 2.1, Condition 3.1(a) is strictly weaker than equivariance. Under this condition, we can speak\n\nbility condition on eBmin, and Condition 3.1(b) is needed to recover eBmin from \ufb01nite samples. By\nof \u201cthe\u201d minimum-trace DAG, which will be denoted in the sequel by (eBmin,e\u2326min). Condition 3.1(b)\n\nis closely related to gap conditions that have appeared previously [35, 62], and is discussed in detail\nin Section 3.2.\n\n3.2 First main result: Identi\ufb01able DAGs\n\nA standard approach is to de\ufb01ne G by the support of a consistent estimate of the precision matrix\n\ncontinue to hold for some minimum-trace DAG. In the next section, we consider the nonidenti\ufb01able\ncase in even greater detail (see Theorem 4.1).\n\nFor any A 2 Rp\u21e5p, let \u2327\u21e4(A) := min{|aij| : aij 6= 0}. The quantity \u2327\u21e4(eBmin) measures the smallest\nnonzero weight in eBmin, which is a measure of the signal strength in the problem.\nTheorem 3.1. Suppose that Conditions 2.1 and 3.1 hold and that eBmin \u21e2 G. If n & s log p,\n &plog p/n, and \u2327\u21e4(eBmin) & , then\nwith probability 1 O(ek log p), where k is the maximum in-degree of eBmin.\nIn fact, even if Condition 3.1(a) fails\u2014i.e. eBmin is not identi\ufb01able\u2014the conclusions of Theorem 3.1\nThe previous theorem assumes that a consistent superstructure G is known, i.e. that eBmin \u21e2 G.\n=\u2303 1. The following assumption encodes the minimal requirement we need on \u2303 and eBmin:\nCondition 3.2 (Superstructure). If (i, j) is an edge in m(eBmin), then ij 6= 0.\nThe results in Loh and B\u00fchlmann [35] show that as long as the entries of eBmin are drawn from a\nthe support of , which can be estimated using known results [39]. Letb denote such an estimate\nand with some abuse of notation, denote the resulting DAG estimator by bB(b).\nCorollary 3.1. Suppose that Conditions 2.1, 3.1, and 3.2 hold. If n & s log p, & plog p/n,\n\u2327\u21e4() & , and \u2327\u21e4(eBmin) & , then\nwith probability 1 O(ek log p).\nCorollary 3.1 implies that there is a score-based estimator with sample complexity \u2326(s log p). In\ncontrast to Ghoshal and Honorio [21], who require an element-wise consistent estimate of (i.e.\nin `1-norm), our result only requires the support of . The former approach leads to a \u2326(s4 log p)\nsample complexity, whereas our approach requires only \u2326(s log p) samples. Both of these results are\na signi\ufb01cant improvement over existing results on score-based methods, e.g. Theorem 5.1 in [62],\nwhich requires p . n/ log n and hence n & p.\n\ncontinuous distribution, Condition 3.2 is satis\ufb01ed except on a set of measure zero. For details, see\nTheorem 2 and Assumption 1 therein. Under Condition 3.2, it suf\ufb01ces to use a consistent estimate of\n\nsupp(bB(G)) = supp(eBmin)\n\nsupp(bB(b)) = supp(eBmin)\n\nFaithfulness and the beta-min condition Theorem 3.1 does not require the faithfulness assump-\ntion, which is a standard assumption in the literature on learning DAGs for both score-based [6, 40]\nand constraint-based methods [31], and is known to be very strong in practice [34, 61]. Assuming\n\n5\n\n\ffaithfulness, the Markov equivalence class becomes identi\ufb01ed, which simpli\ufb01es the problem by\nrestricting the number of equivalent DAGs that must be controlled. Recent work has also relaxed this\nassumption [21, 23, 45, 62], however, to the best of our knowledge, our result is the \ufb01rst such result\nfor score-based estimators in high-dimensions. Instead, we require a beta-min condition on the true\n\nDAG eBmin, which is typical in the statistical literature on model selection.\nGap condition Condition 3.1(b) imposes an implicit assumption on the degree of G through the\nrequirement 1(G) \uf8ff 1 which roughly translates to s log(p/s) + log p . n. The assumption on (\u2318),\non the other hand, is a type of identi\ufb01ability condition on eBmin. Whereas Condition 3.1(a) requires\neBmin to be identi\ufb01able in the in\ufb01nite sample limit, Condition 3.1(b) requires that there is a \u201cgap\u201d on\nthe orderps log p/n between the expected loss of eBmin and the expected loss of any other DAG in\nD(\u2303). To see this, note that EkX XeBk2\n(10)\nWhen \u21e2 is the MCP and & \u2318, a straightforward calculation shows that the following two conditions\nare suf\ufb01cient to guarantee Condition 3.1(b), in addition to 1 \uf8ff 1: There exists a 0 such that\n\ngap(\u2303) := inftre\u2326 tre\u2326min :e\u2326 6=e\u2326min, e\u2326 2 D(\u2303) .\n\nF /n = tre\u2326 for any (eB,e\u2326) 2 D(\u2303) and de\ufb01ne\n\ngap(\u2303) &h s log(p/s) + log p\nkeBmink0 .h\n\nia\ns log(p/s) + log pi1 a\n\nn\nn\n\np,\n\n2 p.\n\n(11)\n\n(12)\n\nThus, Condition 3.1(b) allows one to trade off the size of the \u201cgap\u201d in (11) with a sparsity condition\n\n(12) on eBmin. For example, taking a 2 (0, 2) and s log(p/s) + log p \u2327 n allows gap(\u2303) = o(p)\nwhile simultaneously tolerating an average degree keBmink0/p that grows without bound (cf. (12)).\n\nSince the problem considered here is at least as hard as p separate regression problems, this scaling\nin terms of p is expected. Similar conditions with a similar scaling have appeared in previous work\n[35, 62].\n\n4 The general case: Recovery of sparse representations\n\nIn the previous section, we leveraged strong prior information\u2014namely identi\ufb01ability and a consistent\nsuperstructure\u2014in order to analyze the sample complexity of learning a minimum-trace DAG. In\npractice, such prior information may not be available, and in general it is well-known that Gaussian\nDAGs are not identi\ufb01able [2, 62]. The estimator (1), of course, is well-de\ufb01ned whether or not\n\nCondition 3.1 holds, and in practice, one typically computes bB and \u201chopes for the best\u201d. Is it possible\nto say more in the general setting? Surprisingly, even if there is no DAG eB 2 D(\u2303) that is identi\ufb01able,\nwe can still provide guarantees. The idea is to \ufb01rst show that bB converges to some DAG eB 2 D(\u2303),\nand then show that eB is well-behaved compared to other representative DAGs in D(\u2303). Speci\ufb01cally,\nwe will show that eB is roughly as sparse as a minimum-trace DAG.\nDe\ufb01nition 4.1. Letej denote the jth column of eB 2 D. For any \u2303, let\n\u21e4(D(\u2303)) := inf\neB2D(\u2303)\n\n4.1 Assumptions\n\n\u2327\u21e4(eB).\n\nd(D(\u2303)) := sup\n\nWe will write d = d(D(\u2303)) to simplify the notation in the sequel.\n\n(13)\n\nCondition 4.1 (Minimum-trace DAG). \u2303 0, and there is a minimum-trace DAG eBmin such that\nCondition 4.1 can be interpreted as putting a soft lower bound on the weights in eBmin, as measured\nby the regularizer \u21e2 ande\u2326min. For comparison, recall that the usual beta-min condition in regression\nis minj |j| & plog p/n.\n\nfor some a2 > 0.\n\neB2D(\u2303)keBk0,\u2327\ntre\u2326min a2r (d + 1) log p\n\u21e2(eBmin)\n\nn\n\n6\n\n\f4.2 Second main result: The nonidenti\ufb01able case\nOur second result shows that even in the absence of identi\ufb01ability assumptions, we can still guarantee\n\napproaches any particular member of D(\u2303).\nTheorem 4.1. Suppose that Conditions 2.1 (with G the complete graph in the case of `1) and 4.1\n\nthat bB recovers the support of a DAG eB 2 D(\u2303), and that eB must also be sparse. In fact, we note\nthat even without the sparsity conclusion, it is not obvious (and indeed nontrivial to show) that bB\nhold. If n & d log p, &p(d + 1) log p/n, and \u2327\u21e4(D(\u2303)) & then there exists eB 2 D(\u2303) and a\nminimum-trace DAG eBmin 2 D(\u2303) such that\nsupp(bB) = supp(eB)\nwith probability at least 1 O(ed log p).\nThis is similar to the approach taken in van de Geer and B\u00fchlmann [62] with some key differences:\n1) Their Theorem 3.1 does not establish structure consistency, and 2) Their `0-regularized MLE\ninvolves a thresholded parameter space that is much more dif\ufb01cult to compute in practice, whereas\nour estimator (1) is de\ufb01ned over the full parameter space and involves continuous optimization.\nIn contrast to Theorem 3.1, Theorem 4.1 no longer requires the identi\ufb01ability condition (Condi-\n\nand \u21e2(eB) . \u21e2(bB) . \u21e2(eBmin)\n\ntion 3.1), which is replaced by Condition 4.1 on eBmin. The tradeoffs are 1) The estimator bB is no\n\nlonger guaranteed to recover an exact minimum-trace DAG, and 2) The beta-min condition and\nsample complexity now depend on the sparsity parameter d, which may be larger than s and can\nbe large for general covariance matrices. This result also emphasizes the advantages of noncon-\nvex regularization: When `1-regularization is used, the incoherence condition (6) is imposed over\nevery neighbourhood, which is a very severe restriction. With the MCP, there are no incoherence\nassumptions whatsoever.\n\ninterpreted as a \u201csoft\u201d notion of sparsity.\n\nStrong faithfulness and the beta-min condition In contrast to Theorem 3.1, which only requires\n\nSparsity A key conclusion in Theorem 4.1 is that \u21e2(eB) . \u21e2(bB) . \u21e2(eBmin): This says that bB\nis consistent with a parsimonious DAG. It is easy to show that this implies keBk0 . kbBk0 . keBmink0\nfor the MCP regularizer. For the `1 penalty, we have keBk1 . kbBk1 . keBmink1, which can be\na beta-min condition on the true DAG eBmin, Theorem 4.1 requires a much stronger condition on the\n\nsmallest weight of any DAG in the equivalence class D(\u2303) (cf. (13)). This is reminiscent of\u2014but\nnot the same as\u2014the strong faithfulness condition, which roughly asserts that the minimum partial\ncorrelation between any pair of d-separated variables in the true DAG is bounded away from zero. We\nleave it to future work to study this connection more carefully, however, we note here that previous\nwork on this problem [61] has noted the dif\ufb01culty of establishing such an explicit relationship, and to\nthe best of our knowledge this remains an open problem. Nonetheless, the novelty of Theorem 4.1 is\nin establishing \ufb01nite-sample structure recovery without imposing any identi\ufb01ability requirement, so\nit is natural to expect that stronger assumptions will be needed.\n\n5 Proof outline\n\nOur basic strategy is to reduce the analysis of bB to a family of neighbourhood regression problems,\nusing a similar approach as in our preprint [2]. This is similar to undirected models, for which the\nanalysis can be reduced to p different regression problems, namely the regression of Xj onto Xj\n[39, 69]. Unfortunately, for DAGs, there are p2p possible regression problems (the regression of Xj\nonto any subset S \u21e2 [p]j), which quickly become intractable to control uniformly. The manner in\nwhich these problems are controlled highlights the main technical difference between the proofs of\nTheorems 3.1 and 4.1.\nTo prove Theorem 3.1, we \ufb01rst prove a uniform concentration result for the score Q(B). Speci\ufb01cally,\nletting `(B) = kX XBk2\nF /(2n), we show that the following upper bound holds with high\nprobability over DG (Proposition B.7):\n\n|`(B) E`(B)|\uf8ff 1\u21e51 + 6\uf8ff(\u2303; s)2\u21e4E`(B)\n\n7\n\nfor all B 2 DG.\n\n(14)\n\n\fbj(S) 2\n\narg min\n\n\u27132Rm, supp(\u2713) \u21e2 S\n\n1\n2nkxj X\u2713k2\n\n2 + \u21e2(\u2713).\n\nBased on this result, we show that bB has the same topological sort as eBmin. This topological sort\nidenti\ufb01es candidate parent sets for each node Xj, and reduces the problem to p regression problems.\nThe main technical device here is uniform score concentration via (14), which is an interesting result\nin its own right due to its uniform control of an unbounded, subexponential empirical process. We\nnote here that the requirement that 1(G) \uf8ff 1 in Condition 3.1(b) is precisely the condition needed\nto ensure uniform concentration is possible over the restricted space DG.\nThe proof of Theorem 4.1 is more subtle and involved. Since we no longer assume we can restrict to\na superstructure, uniform score concentration (i.e. over the full space D) is no longer readily viable.\nAs a result, we must obtain uniform control over all p2p neighbourhood regression problems. Let\nj(S) =\u2303 1\nSS\u2303Sj denote the population regression coef\ufb01cients of Xj onto XS, where S \u21e2 [p]j. It\nis not hard to show that bB reduces to estimating j(S) for p random sets S that depend on X with\n\nthe penalized least-squares estimator\n\nIt turns out that these estimators have a great deal of redundancy, and in order to control all p2p such\nestimators, it suf\ufb01ces to control at most O(pd) of them. In order to prove this, we show that the\nfollowing set system has a largest element Mj(S) (Lemma B.2):\n\nTj(S) = {T \u21e2 [p]j : j(T ) = j(S)}.\n\nLet Mj(S) be this largest element, i.e. T 2T j(S) =) T \u21e2 Mj(S). Then there are at most O(pd)\nsuch sets, and we show that in order to control j(S) for all S, it suf\ufb01ces to control each j(Mj(S))\n(Corollary B.4). The \ufb01nal piece of the proof is to establish control over \u21e2(bB); this follows from a\n\nsomewhat lengthy but straightforward Gaussian concentration argument.\n\n6 Discussion\n\nWe have established that a score-based estimator achieves \u2326(s log p) sample complexity for learning\na sparse, minimum-trace DAG, and extended these results to the nonidenti\ufb01able setting. The proof\ntechnique is novel, leveraging the lattice structure of Gaussian conditional independence. Compared\nto recent theoretical work on DAG learning that sidesteps optimization altogether, our approach\ndirectly attacks a dif\ufb01cult nonconvex optimization problem. To conclude this paper, we discuss some\nlimitations, extensions, and directions for future research.\n\nNP-hard [7]. Fortunately, there are fast algorithms via dynamic programming for \ufb01nding globally\noptimal Bayesian networks [43, 55, 56]. For example, by combining dynamic programming with A*\n\nComputation Since (1) is a nonconvex program, computation of bB is challenging and in fact\nsearch, Xiang and Kim [68] propose an exact algorithm to compute the `1-regularized version of bB\ntrue global minimum in practice. Given the NP-hardness of computing bB, an important direction\n\nthat is tractable on problems with hundreds of nodes. More recently, a mixed-integer formulation\nhas also been proposed [9, 10]. Recent work [77] has also shown that the program (1) can be solved\napproximately with second-order methods, and the resulting solutions are often very close to the\n\nfor future work is to determine whether or not there exists a polynomial-time estimator that can\nachieve s log p sample complexity or better. As such, the current work provides important theoretical\njusti\ufb01cation for this inquiry.\n\nComparison to existing methods Despite the long history of score-based methods for learning\nDAGs, very little is known about the explicit, \ufb01nite-sample behaviour of these methods. We have\nalready acknowledged that the estimator (1) has appeared previously in the literature without a\nrigourous theoretical analysis [e.g. 26, 51, 54, 68, 77]. The well-known GES algorithm, on the other\nhand, has asymptotic consistency guarantees in both the low- [6] and high-dimensional [40] settings.\nWe do not pursue a detailed experimental comparison of these two popular approaches here for the\nsimple reason that this has already been done, see e.g. [1, 68, 70, 77]. These papers indicate that even\n\non a wide variety of settings and graphs.\n\napproximate algorithms for bB outperform GES (along with other algorithms such as PC and MMHC)\n\n8\n\n\fComparison to nonconvex models in ML Much of the interest in the current work stems not\nonly from providing explicit \ufb01nite-sample guarantees for the DAG learning problem, but also from\nits analysis of a highly nonconvex optimization problem. For this reason, it is worth comparing\nour results with recent work on nonconvex models in the ML literature [5, 8, 13, 14, 16, 18, 19,\n28\u201330, 33, 38, 58, 60].\nIn particular, we note the spate of recent papers on so-called \u201cbenign\nnonconvexity\u201d, which is the idea that although a problem may be nonconvex, its geometry is such that\nthe nonconvexity is not a practical issue. Conditions ensuring this include the Polyak-Lojasiewicz\ncondition [32], restricted strong convexity [36], and \u201cstrict\u201d or \u201crideable\u201d saddle points [17, 59].\nUnfortunately, this approach of benign nonconvexity does not apply to optimizing (1) since this\nproblem is easily shown to violate these properties, and in particular, there exist local minima that\nare not global. While this may seem discouraging, we note that recent work [77] has shown that\nsecond-order algorithms often \ufb01nd the global minimum in practice. We leave it to future work to\nstudy this behaviour in more detail.\n\nAcknowledgments\nWe thank the anonymous reviewers for their feedback. The authors acknowledge the support of the\nNSF via IIS-1546098.\n\nReferences\n[1] B. Aragam and Q. Zhou. Concave penalized estimation of sparse Gaussian Bayesian networks.\n\nJournal of Machine Learning Research, 16:2273\u20132328, 2015.\n\n[2] B. Aragam, A. A. Amini, and Q. Zhou. Learning directed acyclic graphs with penalized\nneighbourhood regression. arXiv:1511.08963, 2015. URL https://arxiv.org/abs/1511.\n08963.\n\n[3] B. Aragam, J. Gu, and Q. Zhou. Learning large-scale bayesian networks with the sparsebn\n\npackage. To appear, Journal of Statistical Software, arXiv:1703.04025, 2017.\n\n[4] P. J. Bickel, Y. Ritov, and A. B. Tsybakov. Simultaneous analysis of Lasso and Dantzig selector.\n\nAnnals of Statistics, 37(4):1705\u20131732, 2009.\n\n[5] A. Brutzkus and A. Globerson. Globally optimal gradient descent for a convnet with gaussian\n\ninputs. arXiv preprint arXiv:1702.07966, 2017.\n\n[6] D. M. Chickering. Optimal structure identi\ufb01cation with greedy search. Journal of Machine\n\nLearning Research, 3:507\u2013554, 2003.\n\n[7] D. M. Chickering, D. Heckerman, and C. Meek. Large-sample learning of Bayesian networks\n\nis NP-hard. Journal of Machine Learning Research, 5:1287\u20131330, 2004.\n\n[8] A. Choromanska, M. Henaff, M. Mathieu, G. B. Arous, and Y. LeCun. The loss surfaces of\n\nmultilayer networks. In Arti\ufb01cial Intelligence and Statistics, pages 192\u2013204, 2015.\n\n[9] J. Cussens. Bayesian network learning with cutting planes. arXiv preprint arXiv:1202.3713,\n\n2012.\n\n[10] J. Cussens, D. Haws, and M. Studen`y. Polyhedral aspects of score equivalence in bayesian\n\nnetwork structure learning. Mathematical Programming, 164(1-2):285\u2013324, 2017.\n\n[11] F. Doshi-Velez and B. Kim. Towards a rigorous science of interpretable machine learning. arXiv\n\npreprint arXiv:1702.08608, 2017.\n\n[12] M. Drton, W. Chen, and Y. S. Wang. On causal discovery with equal variance assumption.\n\narXiv preprint arXiv:1807.03419, 2018.\n\n[13] S. S. Du, J. D. Lee, and Y. Tian. When is a convolutional \ufb01lter easy to learn? arXiv preprint\n\narXiv:1709.06129, 2017.\n\n[14] S. S. Du, J. D. Lee, Y. Tian, B. Poczos, and A. Singh. Gradient descent learns one-hidden-layer\n\ncnn: Don\u2019t be afraid of spurious local minima. arXiv preprint arXiv:1712.00779, 2017.\n\n9\n\n\f[15] R. J. Evans. Model selection and local geometry. arXiv preprint arXiv:1801.08364, 2018.\n\n[16] R. Ge and T. Ma. On the optimization landscape of tensor decompositions. In Advances in\n\nNeural Information Processing Systems, pages 3653\u20133663, 2017.\n\n[17] R. Ge, F. Huang, C. Jin, and Y. Yuan. Escaping from saddle points\u2014online stochastic gradient\n\nfor tensor decomposition. In Conference on Learning Theory, pages 797\u2013842, 2015.\n\n[18] R. Ge, F. Huang, C. Jin, and Y. Yuan. Escaping from saddle points \u2014 online stochastic gradient\nfor tensor decomposition. In P. Gr\u00fcnwald, E. Hazan, and S. Kale, editors, Proceedings of The\n28th Conference on Learning Theory, volume 40 of Proceedings of Machine Learning Research,\npages 797\u2013842, Paris, France, 03\u201306 Jul 2015. PMLR. URL http://proceedings.mlr.\npress/v40/Ge15.html.\n\n[19] R. Ge, J. D. Lee, and T. Ma. Matrix completion has no spurious local minimum. In Advances in\n\nNeural Information Processing Systems, pages 2973\u20132981, 2016.\n\n[20] D. Geiger and D. Heckerman. Parameter priors for directed acyclic graphical models and the\ncharacterization of several probability distributions. Annals of Statistics, 30:1412\u20131440, 2002.\n\n[21] A. Ghoshal and J. Honorio. Learning identi\ufb01able gaussian bayesian networks in polynomial\n\ntime and sample complexity. 03 2017. URL https://arxiv.org/abs/1703.01196.\n\n[22] A. Ghoshal and J. Honorio. Information-theoretic limits of Bayesian network structure learning.\nIn A. Singh and J. Zhu, editors, Proceedings of the 20th International Conference on Arti\ufb01cial\nIntelligence and Statistics, volume 54 of Proceedings of Machine Learning Research, pages\n767\u2013775, Fort Lauderdale, FL, USA, 20\u201322 Apr 2017. PMLR. URL http://proceedings.\nmlr.press/v54/ghoshal17a.html.\n\n[23] A. Ghoshal and J. Honorio. Learning linear structural equation models in polynomial time and\n\nsample complexity. 07 2017. URL https://arxiv.org/abs/1707.04673.\n\n[24] Y. Gordon. On Milman\u2019s inequality and random subspaces which escape through a mesh in Rn.\n\nGeometric Aspects of Functional Analysis, pages 84\u2013106, 1988.\n\n[25] J. Gu, F. Fu, and Q. Zhou. Penalized estimation of directed acyclic graphs from discrete data.\n\nStatistics and Computing, DOI: 10.1007/s11222-018-9801-y, 2018.\n\n[26] S. W. Han, G. Chen, M.-S. Cheon, and H. Zhong. Estimation of directed acyclic graphs through\ntwo-stage adaptive lasso for gene network inference. Journal of the American Statistical\nAssociation, 111(515):1004\u20131019, 2016.\n\n[27] J. Huang, P. Breheny, and S. Ma. A selective review of group selection in high-dimensional\nmodels. Statistical science: a review journal of the Institute of Mathematical Statistics, 27(4),\n2012.\n\n[28] C. Jin, Y. Zhang, S. Balakrishnan, M. Wainwright, and M. Jordan. Local maxima in the likeli-\nhood of gaussian mixture models: Structural results and algorithmic consequences. pages\n4123\u20134131, 2016. URL https://www.scopus.com/inward/record.uri?eid=2-s2.\n0-85019182259&partnerID=40&md5=ffefd6d85c791c4ea475d149cb372d52. cited By\n3.\n\n[29] C. Jin, R. Ge, P. Netrapalli, S. M. Kakade, and M. I. Jordan. How to escape saddle points\n\nef\ufb01ciently. arXiv preprint arXiv:1703.00887, 2017.\n\n[30] C. Jin, L. T. Liu, R. Ge, and M. I. Jordan. Minimizing Nonconvex Population Risk from Rough\n\nEmpirical Risk. ArXiv e-prints, Mar. 2018.\n\n[31] M. Kalisch and P. B\u00fchlmann. Estimating high-dimensional directed acyclic graphs with the\n\nPC-algorithm. Journal of Machine Learning Research, 8:613\u2013636, 2007.\n\n[32] H. Karimi, J. Nutini, and M. Schmidt. Linear convergence of gradient and proximal-gradient\n\nmethods under the Polyak-\u0141ojasiewicz condition. ECML, 2016.\n\n10\n\n\f[33] K. Kawaguchi. Deep learning without poor local minima. In Advances in Neural Information\n\nProcessing Systems, pages 586\u2013594, 2016.\n\n[34] S. Lin, C. Uhler, B. Sturmfels, and P. B\u00fchlmann. Hypersurfaces and their singularities in partial\n\ncorrelation testing. Foundations of Computational Mathematics, 14(5):1079\u20131116, 2014.\n\n[35] P.-L. Loh and P. B\u00fchlmann. High-dimensional learning of linear causal networks via inverse\n\ncovariance estimation. Journal of Machine Learning Research, 15:3065\u20133105, 2014.\n\n[36] P.-L. Loh and M. J. Wainwright. Support recovery without incoherence: A case for nonconvex\n\nregularization. arXiv preprint arXiv:1412.5632, 2014.\n\n[37] P.-L. Loh and M. J. Wainwright. Regularized M-estimators with nonconvexity: Statistical and\nalgorithmic theory for local optima. Journal of Machine Learning Research, 16:559\u2013616, 2015.\n\n[38] S. Mei, Y. Bai, and A. Montanari. The landscape of empirical risk for nonconvex losses. The\n\nAnnals of Statistics, 46(6A):2747\u20132774, 2018.\n\n[39] N. Meinshausen and P. B\u00fchlmann. High-dimensional graphs and variable selection with the\n\nLasso. Annals of Statistics, 34(3):1436\u20131462, 2006.\n\n[40] P. Nandy, A. Hauser, and M. H. Maathuis. High-dimensional consistency in score-based and\n\nhybrid structure learning. arXiv preprint arXiv:1507.02608, 2018.\n\n[41] A. Nicholson, F. Cozman, M. Velikova, J. T. van Scheltinga, P. J. Lucas, and M. Spaanderman.\nApplications of bayesian networks exploiting causal functional relationships in bayesian network\nmodelling for personalised healthcare. International Journal of Approximate Reasoning, 55\n(1):59 \u2013 73, 2014. ISSN 0888-613X. doi: http://dx.doi.org/10.1016/j.ijar.2013.03.016. URL\nhttp://www.sciencedirect.com/science/article/pii/S0888613X13000777.\n\n[42] S. Ordyniak and S. Szeider. Parameterized complexity results for exact bayesian network\n\nstructure learning. Journal of Arti\ufb01cial Intelligence Research, 46:263\u2013302, 2013.\n\n[43] S. Ott and S. Miyano. Finding optimal gene networks using biological constraints. Genome\n\nInformatics, 14:124\u2013133, 2003.\n\n[44] S. Ott, S. Imoto, and S. Miyano. Finding optimal models for small gene networks. In Paci\ufb01c\n\nsymposium on biocomputing, volume 9, pages 557\u2013567. Citeseer, 2004.\n\n[45] G. Park and G. Raskutti. Learning quadratic variance function (qvf) dag models via overdisper-\n\nsion scoring (ods). 04 2017. URL https://arxiv.org/abs/1704.08783.\n\n[46] E. Perrier, S. Imoto, and S. Miyano. Finding optimal bayesian network given a super-structure.\n\nJournal of Machine Learning Research, 9(Oct):2251\u20132286, 2008.\n\n[47] J. Peters and P. B\u00fchlmann. Identi\ufb01ability of Gaussian structural equation models with equal\n\nerror variances. Biometrika, 101(1):219\u2013228, 2013.\n\n[48] G. Raskutti, M. J. Wainwright, and B. Yu. Restricted eigenvalue properties for correlated\n\nGaussian designs. Journal of Machine Learning Research, 11:2241\u20132259, 2010.\n\n[49] G. Raskutti, M. J. Wainwright, and B. Yu. Minimax rates of estimation for high-dimensional\nlinear regression over `q-balls. Information Theory, IEEE Transactions on, 57(10):6976\u20136994,\n2011.\n\n[50] A. D. Sanford and I. A. Moosa. A bayesian network structure for operational risk modelling\nin structured \ufb01nance operations. Journal of the Operational Research Society, 63(4):431\u2013444,\n2012.\n\n[51] M. Schmidt, A. Niculescu-Mizil, and K. Murphy. Learning graphical model structure using\n\nL1-regularization paths. In AAAI, volume 7, pages 1278\u20131283, 2007.\n\n[52] S. Shimizu, P. O. Hoyer, A. Hyv\u00e4rinen, and A. Kerminen. A linear non-Gaussian acyclic model\n\nfor causal discovery. Journal of Machine Learning Research, 7:2003\u20132030, 2006.\n\n11\n\n\f[53] S. Shimizu, T. Inazumi, Y. Sogawa, A. Hyv\u00e4rinen, Y. Kawahara, T. Washio, P. O. Hoyer, and\nK. Bollen. Directlingam: A direct method for learning a linear non-gaussian structural equation\nmodel. Journal of Machine Learning Research, 12:1225\u20131248, 2011.\n\n[54] A. Shojaie and G. Michailidis. Penalized likelihood methods for estimation of sparse high-\n\ndimensional directed acyclic graphs. Biometrika, 97(3):519\u2013538, 2010.\n\n[55] T. Silander and P. Myllymaki. A simple approach for \ufb01nding the globally optimal bayesian net-\nwork structure. In Proceedings of the 22nd Conference on Uncertainty in Arti\ufb01cial Intelligence,\n2006.\n\n[56] A. P. Singh and A. W. Moore. Finding optimal bayesian networks by dynamic programming.\n\n2005.\n\n[57] P. Spirtes and C. Glymour. An algorithm for fast recovery of sparse causal graphs. Social\n\nScience Computer Review, 9(1):62\u201372, 1991.\n\n[58] J. Sun, Q. Qu, and J. Wright. Complete dictionary recovery using nonconvex optimization.\nIn Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pages\n2351\u20132360, 2015.\n\n[59] J. Sun, Q. Qu, and J. Wright. When are nonconvex problems not scary? arXiv preprint\n\narXiv:1510.06096, 2015.\n\n[60] J. Sun, Q. Qu, and J. Wright. A geometric analysis of phase retrieval. In Information Theory\n\n(ISIT), 2016 IEEE International Symposium on, pages 2379\u20132383. IEEE, 2016.\n\n[61] C. Uhler, G. Raskutti, P. B\u00fchlmann, and B. Yu. Geometry of the faithfulness assumption in\n\ncausal inference. Annals of Statistics, 41(2):436\u2013463, 2013.\n\n[62] S. van de Geer and P. B\u00fchlmann. `0-penalized maximum likelihood for sparse directed acyclic\n\ngraphs. Annals of Statistics, 41(2):536\u2013567, 2013.\n\n[63] S. van de Geer et al. On the uniform convergence of empirical norms and inner products, with\n\napplication to causal inference. Electronic Journal of Statistics, 8:543\u2013574, 2014.\n\n[64] K. R. Varshney, P. Khanduri, P. Sharma, S. Zhang, and P. K. Varshney. Why interpretability in\nmachine learning? an answer using distributed detection and data fusion theory. arXiv preprint\narXiv:1806.09710, 2018.\n\n[65] S. Wachter, B. Mittelstadt, and C. Russell. Counterfactual explanations without opening the\n\nblack box: Automated decisions and the gdpr. 2017.\n\n[66] M. J. Wainwright. Sharp thresholds for high-dimensional and noisy sparsity recovery using-\nconstrained quadratic programming (Lasso). Information Theory, IEEE Transactions on, 55(5):\n2183\u20132202, 2009.\n\n[67] Y. S. Wang and M. Drton. High-dimensional causal discovery under non-gaussianity. 03 2018.\n\nURL https://arxiv.org/pdf/1803.11273.\n\n[68] J. Xiang and S. Kim. A* Lasso for learning a sparse Bayesian network structure for continuous\n\nvariables. In Advances in Neural Information Processing Systems, pages 2418\u20132426, 2013.\n\n[69] E. Yang, P. Ravikumar, G. I. Allen, and Z. Liu. Graphical models via univariate exponential\n\nfamily distributions. Journal of Machine Learning Research, 16:3813\u20133847, 2015.\n\n[70] Q. Ye, A. A. Amini, and Q. Zhou. Optimizing regularized cholesky score for order-based\n\nlearning of bayesian networks. arXiv preprint arXiv:1904.12360, 2019.\n\n[71] Y. Yuan, X. Shen, W. Pan, and Z. Wang. Constrained likelihood for reconstructing a directed\n\nacyclic gaussian graph. Biometrika, 106(1):109\u2013125, 2018.\n\n12\n\n\f[72] B. Zhang, C. Gaiteri, L.-G. Bodea, Z. Wang, J. McElwee, A. A. Podtelezhnikov, C. Zhang,\nT. Xie, L. Tran, R. Dobrin, E. Fluder, B. Clurman, S. Melquist, M. Narayanan, C. Suver, H. Shah,\nM. Mahajan, T. Gillis, J. Mysore, M. E. MacDonald, J. R. Lamb, D. A. Bennett, C. Molony,\nD. J. Stone, V. Gudnason, A. J. Myers, E. E. Schadt, H. Neumann, J. Zhu, and V. Emilsson.\nIntegrated systems approach identi\ufb01es genetic nodes and networks in late-onset alzheimer\u2019s\ndisease. Cell, 153(3):707\u2014720, April 2013. ISSN 0092-8674. doi: 10.1016/j.cell.2013.03.030.\nURL http://europepmc.org/articles/PMC3677161.\n\n[73] C.-H. Zhang. Nearly unbiased variable selection under minimax concave penalty. Annals of\n\nStatistics, 38(2):894\u2013942, 2010.\n\n[74] C.-H. Zhang and J. Huang. The sparsity and bias of the Lasso selection in high-dimensional\n\nlinear regression. Annals of Statistics, pages 1567\u20131594, 2008.\n\n[75] C.-H. Zhang and T. Zhang. A general theory of concave regularization for high-dimensional\n\nsparse estimation problems. Statistical Science, 27(4):576\u2013593, 2012.\n\n[76] P. Zhao and B. Yu. On model selection consistency of Lasso. Journal of Machine Learning\n\nResearch, 7:2541\u20132563, 2006.\n\n[77] X. Zheng, B. Aragam, P. K. Ravikumar, and E. P. Xing. DAGs with NO TEARS: Continuous\noptimization for structure learning. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman,\nN. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems\n31, pages 9472\u20139483. Curran Associates, Inc., 2018. URL http://papers.nips.cc/paper/\n8157-dags-with-no-tears-continuous-optimization-for-structure-learning.\npdf.\n\n13\n\n\f", "award": [], "sourceid": 2489, "authors": [{"given_name": "Bryon", "family_name": "Aragam", "institution": "University of Chicago"}, {"given_name": "Arash", "family_name": "Amini", "institution": "UCLA"}, {"given_name": "Qing", "family_name": "Zhou", "institution": "UCLA"}]}