{"title": "High-Dimensional Graphical Model Selection Using $\\ell_1$-Regularized Logistic Regression", "book": "Advances in Neural Information Processing Systems", "page_first": 1465, "page_last": 1472, "abstract": null, "full_text": "High-Dimensional Graphical Model Selection\n\nUsing `1-Regularized Logistic Regression\n\nMartin J. Wainwright\nDepartment of Statistics\n\nDepartment of EECS\n\nUniv. of California, Berkeley\n\nBerkeley, CA 94720\n\nJohn D. La\ufb00erty\n\nPradeep Ravikumar\nMachine Learning Dept. Computer Science Dept.\nCarnegie Mellon Univ. Machine Learning Dept.\nCarnegie Mellon Univ.\nPittsburgh, PA 15213\nPittsburgh, PA 15213\n\nAbstract\n\nWe focus on the problem of estimating the graph structure associated\nwith a discrete Markov random \ufb01eld. We describe a method based on `1-\nregularized logistic regression, in which the neighborhood of any given node\nis estimated by performing logistic regression subject to an `1-constraint.\nOur framework applies to the high-dimensional setting, in which both the\nnumber of nodes p and maximum neighborhood sizes d are allowed to grow\nas a function of the number of observations n. Our main result is to estab-\nlish su\ufb03cient conditions on the triple (n, p, d) for the method to succeed in\nconsistently estimating the neighborhood of every node in the graph simul-\ntaneously. Under certain mutual incoherence conditions analogous to those\nimposed in previous work on linear regression, we prove that consistent\nneighborhood selection can be obtained as long as the number of observa-\ntions n grows more quickly than 6d6 log d + 2d5 log p, thereby establishing\nthat logarithmic growth in the number of samples n relative to graph size\np is su\ufb03cient to achieve neighborhood consistency.\n\nKeywords: Graphical models; Markov random \ufb01elds; structure learning; `1-regularization;\nmodel selection; convex risk minimization; high-dimensional asymptotics; concentration.\n\n1 Introduction\n\n1 , . . . , x(i)\n\np }n\n\npoints {x(i) = (x(i)\n\nConsider a p-dimensional discrete random variable X = (X1, X2, . . . , Xp) where the dis-\ntribution of X is governed by an unknown undirected graphical model. In this paper, we\ninvestigate the problem of estimating the graph structure from an i.i.d. sample of n data\ni=1. This structure learning problem plays an important role\nin a broad range of applications where graphical models are used as a probabilistic repre-\nsentation tool, including image processing, document analysis and medical diagnosis. Our\napproach is to perform an `1-regularized logistic regression of each variable on the remaining\nvariables, and to use the sparsity pattern of the regression vector to infer the underlying\nneighborhood structure. The main contribution of the paper is a theoretical analysis show-\ning that, under suitable conditions, this procedure recovers the true graph structure with\nprobability one, in the high-dimensional setting in which both the sample size n and graph\nsize p = p(n) increase to in\ufb01nity.\n\nThe problem of structure learning for discrete graphical models\u2014due to both its importance\nand di\ufb03culty\u2014has attracted considerable attention. Constraint based approaches use hy-\npothesis testing to estimate the set of conditional independencies in the data, and then\ndetermine a graph that most closely represents those independencies [8]. An alternative ap-\nproach is to view the problem as estimation of a stochastic model, combining a scoring metric\non candidate graph structures with a goodness of \ufb01t measure to the data. The scoring met-\n\n\fric approach must be used together with a search procedure that generates candidate graph\nstructures to be scored. The combinatorial space of graph structures is super-exponential,\nhowever, and Chickering [1] shows that this problem is in general NP-hard. The space of\ncandidate structures in scoring based approaches is typically restricted to directed models\n(Bayesian networks) since the computation of typical score metrics involves computing the\nnormalization constant of the graphical model distribution, which is intractable for general\nundirected models. Estimation of graph structures in undirected models has thus largely\nbeen restricted to simple graph classes such as trees [2], polytrees [3] and hypertrees [9].\n\nThe technique of `1 regularization for estimation of sparse models or signals has a long\nhistory in many \ufb01elds; we refer to Tropp [10] for a recent survey. A surge of recent work has\nshown that `1-regularization can lead to practical algorithms with strong theoretical guaran-\ntees (e.g., [4, 5, 6, 10, 11, 12]). In this paper, we adapt the technique of `1-regularized logistic\nregression to the problem of inferring graph structure. The technique is computationally\ne\ufb03cient and thus well-suited to high dimensional problems, since it involves the solution\nonly of standard convex programs. Our main result establishes conditions on the sample\nsize n, graph size p and maximum neighborhood size d under which the true neighborhood\nstructure can be inferred with probability one as (n, p, d) increase. Our analysis, though\nasymptotic in nature, leads to growth conditions that are su\ufb03ciently weak so as to require\nonly that the number of observations n grow logarithmically in terms of the graph size.\nConsequently, our results establish that graphical structure can be learned from relatively\nsparse data. Our analysis and results are similar in spirit to the recent work of Meinshausen\nand B\u00a8uhlmann [5] on covariance selection in Gaussian graphical models, but focusing rather\non the case of discrete models.\n\nThe remainder of this paper is organized as follows. In Section 2, we formulate the problem\nand establish notation, before moving on to a precise statement of our main result, and a\nhigh-level proof outline in Section 3. Sections 4 and 5 detail the proof, with some technical\ndetails deferred to the full-length version. Finally, we provide experimental results and a\nconcluding discussion in Section 6.\n\n2 Problem Formulation and Notation\n\nLet G = (V, E) denote a graph with vertex set V of size |V | = p and edge set E. We denote\nby N (s) the set of neighbors of a vertex v \u2208 V ; that is N (s) = {(s, t) \u2208 E}. A pairwise\ngraphical model with graph G is a family of probability distributions for a random variable\nIn this paper, we restrict\nour attention to the case where each xs \u2208 {0, 1} is binary, and the family of probability\ndistributions is given by the Ising model\n\nX = (X1, X2, . . . , Xp) given by p(x) \u221d Q(s,t)\u2208E \u03c8st(xs, xt).\n\np(x; \u03b8) = exp(cid:16)Ps\u2208V \u03b8sxs +P(s,t)\u2208E \u03b8stxsxt \u2212 \u03a8(\u03b8)(cid:17) .\n\n(1)\n\nGiven such an exponential family in a minimal representation, the log partition function\n\u03a8(\u03b8) is strictly convex, which ensures that the parameter matrix \u03b8 is identi\ufb01able.\nWe address the following problem of graph learning. Given n samples x(i) \u2208 {0, 1}p drawn\nfrom an unknown distribution p(x; \u03b8\u2217) of the form (1), let bEn be an estimated set of edges.\nOur set-up includes the important situation in which the number of variables p may be large\nrelative to the sample size n. In particular, we allow the graph Gn = (Vn, En) to vary with n,\nso that the number of variables p = |Vn| and the sizes of the neighborhoods ds := |N (s)| may\nvary with sample size. (For notational clarity we will sometimes omit subscripts indicating\n\na dependence on n.) The goal is to construct an estimator bEn for which P[bEn = En] \u2192 1\nas n \u2192 \u221e. Equivalently, we consider the problem of estimating neighborhoods bNn(s) \u2282 Vn\nso that P[ bNn(s) = N (s), \u2200 s \u2208 Vn] \u2212\u2192 1. For many problems of interest, the graphical\n\nmodel provides a compact representation where the size of the neighborhoods are typically\nsmall\u2014say ds (cid:28) p for all s \u2208 Vn. Our goal is to use `1-regularized logistic regression to\nestimate these neighborhoods; for this paper, the actual values of the parameters \u03b8ij is a\nsecondary concern.\n\n\fGiven input data {(z(i), y(i))}, where z(i) is a p-dimensional covariate and y(i) \u2208 {0, 1} is a\nbinary response, logistic regression involves minimizing the negative log likelihood\n\nfs(\u03b8; x) =\n\n1\nn\n\nnXi=1nlog(1 + exp(\u03b8T z(i))) \u2212 y(i)\u03b8T z(i)o .\n\nWe focus on regularized version of this regression problem, involving an `1 constraint on (a\nsubset of) the parameter vector \u03b8. For convenience, we assume that z(i)\n1 = 1 is a constant so\nthat \u03b81 is a bias term, which is not regularized; we denote by \u03b8\\s the vector of all coe\ufb03cients\nof \u03b8 except the one in position s. For the graph learning task, we regress each variable Xs\nonto the remaining variables, sharing the same data x(i) across problems. This leads to the\nfollowing collection of optimization problems (p in total, one for each graph node):\n\n\u03b8\u2208Rp( 1\nb\u03b8s,\u03bb = arg min\n\nn\n\nnXi=1hlog(1 + exp(\u03b8T z(i,s))) \u2212 x(i)\n\ns \u03b8T z(i,s)i + \u03bbnk\u03b8\\sk1) .\n\nwhere s \u2208 V , and z(i,s) \u2208 {0, 1}p denotes the vector where z(i,s)\nfor t 6= s and\nz(i,s)\ns = 1. The parameter \u03b8s acts as a bias term, and is not regularized. Thus, the quantity\nb\u03b8s,\u03bb\ncan be thought of as a penalized conditional likelihood estimate of \u03b8s,t. Our estimate\n\nof the neighborhood N (s) is then given by\n\n= x(i)\nt\n\nt\n\nt\n\nbNn(s) =nt \u2208 V, t 6= s : b\u03b8s,\u03bb\n\nt\n\n6= 0o .\n\nOur goal is to provide conditions on the graphical model\u2014in particular, relations among the\nnumber of nodes p, number of observations n and maximum node degree d\u2014that ensure that\nthe collection of neighborhood estimates (2), one for each node s of the graph, is consistent\nwith high probability.\n\nWe conclude this section with some additional notation that is used throughout the sequel.\nDe\ufb01ning the probability p(z(i,s); \u03b8) := [1 + exp(\u2212\u03b8T z(i,s))]\u22121, straightforward calculations\nyield the gradient and Hessian, respectively, of the negative log likelihood (2):\n\n(2)\n\n(3)\n\n(4a)\n\n(4b)\n\n(5)\n\n\u2207\u03b8fs(\u03b8; x) =\n\n\u22072\n\u03b8fs(\u03b8; x) =\n\n1\nn\n\n1\nn\n\nnXi=1\nnXi=1\n\np(z(i,s); \u03b8) z(i,s) \u2212 \u03b8T 1\ns z(i,s)!\np(z(i,s); \u03b8) [1 \u2212 p(z(i,s); \u03b8)] z(i,s) (z(i,s))T .\n\nnXi=1\n\nx(i)\n\nn\n\nFinally, for ease of notation, we make frequent use the shortand Qs(\u03b8) = \u22072fs(\u03b8; x).\n\n3 Main Result and Outline of Analysis\n\nIn this section, we begin with a precise statement of our main result, and then provide a\nhigh-level overview of the key steps involved in its proof.\n\n3.1 Statement of main result\n\nWe begin by stating the assumptions that underlie our main result. A subset of the assump-\ntions involve the Fisher information matrix associated with the logistic regression model,\nde\ufb01ned for each node s \u2208 V as\n\nQ\u2217\n\ns = E(cid:20)ps(Z; \u03b8\u2217) {1 \u2212 ps(Z; \u03b8\u2217)}ZZ T(cid:21),\n\nNote that Q\u2217\ns is the population average of the Hessian Qs(\u03b8\u2217). For ease of notation we use\nS to denote the neighborhood N (s), and Sc to denote the complement V \u2212 N (s). Our\n\ufb01rst two assumptions (A1 and A2) place restrictions on the dependency and coherence\nstructure of this Fisher information matrix. We note that these \ufb01rst two assumptions are\nanalogous to conditions imposed in previous work [5, 10, 11, 12] on linear regression. Our\nthird assumption is a growth rate condition on the triple (n, p, d).\n\n\f[A1] Dependency condition: We require that the subset of the Fisher information\nmatrix corresponding to the relevant covariates has bounded eigenvalues: namely, there\nexist constants Cmin > 0 and Cmax < +\u221e such that\n\n(6)\n\nCmin \u2264 \u039bmin(Q\u2217\n\nSS),\n\nand\n\n\u039bmax(Q\u2217\n\nSS) \u2264 Cmax.\n\nThese conditions ensure that the relevant covariates do not become overly dependent, and\n\ncan be guaranteed (for instance) by assuming that b\u03b8s,\u03bb lies within a compact set.\n\n[A2] Incoherence condition: Our next assumption captures the intuition that the large\nnumber of irrelevant covariates (i.e., non-neighbors of node s) cannot exert an overly strong\ne\ufb00ect on the subset of relevant covariates (i.e., neighbors of node s). To formalize this\nintuition, we require the existence of an \u0001 \u2208 (0, 1] such that\nSS)\u22121k\u221e \u2264 1 \u2212 \u0001.\n\nScS(Q\u2217\n\nkQ\u2217\n\n(7)\n\nAnalogous conditions are required for the success of the Lasso in the case of linear regres-\nsion [5, 10, 11, 12].\n\n[A3] Growth rates: Our second set of assumptions involve the growth rates of the number\nof observations n, the graph size p, and the maximum node degree d.\nIn particular, we\nrequire that:\n\nn\nd5 \u2212 6d log(d) \u2212 2 log(p) \u2192 +\u221e.\n\n(8)\n\nNote that this condition allows the graph size p to grow exponentially with the number of\nobservations (i.e., p(n) = exp(n\u03b1) for some \u03b1 \u2208 (0, 1). Moreover, it is worthwhile noting that\nfor model selection in graphical models, one is typically interested in node degrees d that\nremain bounded (e.g., d = O(1)), or grow only weakly with graph size (say d = o(log p)).\n\nWith these assumptions, we now state our main result:\nTheorem 1. Given a graphical model and triple (n, p, d) such that conditions A1 through\nA3\nare satis\ufb01ed, suppose that the regularization parameter \u03bbn is chosen such that (a)\nn\u03bb2\nn \u2192 +\u221e.\n3.2 Outline of analysis\n\nn \u2212 2 log(p) \u2192 +\u221e, and (b) d\u03bbn \u2192 0. Then P[ bNn(s) = N (s), \u2200 s \u2208 Vn] \u2192 1 as\n\nWe now provide a high-level roadmap of the main steps involved in our proof of Theo-\nrem 1. Our approach is based on the notion of a primal witness:\nin particular, focusing\nour attention on a \ufb01xed node s \u2208 V , we de\ufb01ne a constructive procedure for generating a\nzero-subgradient optimality conditions associated with the convex program (3). We then\nshow that this construction succeeds with probability converging to one under the stated\nconditions. A key fact is that the convergence rate is su\ufb03ciently fast that a simple union\nbound over all graph nodes shows that we achieve consistent neighborhood estimation for\nall nodes simultaneously.\n\nprimal vectorb\u03b8 \u2208 Rp as well as a corresponding subgradientbz \u2208 Rn that together satisfy the\n\nTo provide some insight into the nature of our construction, the analysis in Section 4 shows\n\nthe neighborhood N (s) is correctly recovered if and only if the pair (b\u03b8,bz) satis\ufb01es the\nfollowing four conditions: (a) b\u03b8Sc = 0; (b) |b\u03b8t| > 0 for all t \u2208 S; (c) bzS = sgn(\u03b8\u2217\nkbzSck\u221e < 1. The \ufb01rst step in our construction is to choose the pair (b\u03b8,bz) such that both\n\nconditions (a) and (c) hold. The remainder of the analysis is then devoted to establishing\nthat properties (b) and (d) hold with high probability.\n\nS); and (d)\n\nIn the \ufb01rst part of our analysis, we assume that the dependence (A1) mutual incoherence\n(A2) conditions hold for the sample Fisher information matrices Qs(\u03b8\u2217) de\ufb01ned below equa-\ntion (4b). Under this assumption, we then show that the conditions on \u03bbn in the theorem\nstatement su\ufb03ce to guarantee that properties (b) and (d) hold for the constructed pair\n\n(b\u03b8,bz). The remainder of the analysis, provided in the full-length version of this paper, is\n\n\fdevoted to showing that under the speci\ufb01ed growth conditions (A3), imposing incoherence\nand dependence assumptions on the population version of the Fisher information Q\u2217(\u03b8\u2217)\nguarantees (with high probability) that analogous conditions hold for the sample quantities\nQs(\u03b8\u2217). While it follows immediately from the law of large numbers that the empirical\nFsiher information Qn\nAA for any \ufb01xed subset,\nthe delicacy is that we require controlling this convergence over subsets of increasing size.\nOur analysis therefore requires the use of uniform laws of large numbers [7].\n\nAA(\u03b8\u2217) converges to the population version Q\u2217\n\n4 Primal-Dual Relations for `1-Regularized Logistic Regression\n\nBasic convexity theory can be used to characterize the solutions of `1-regularized logistic\nregression. We assume in this section that \u03b81 corresponds to the unregularized bias term,\nand omit the dependence on sample size n in the notation. The objective is to compute\n\n\u03b8\u2208Rp L(\u03b8, \u03bb) = min\nmin\n\n\u03b8\u2208Rp(cid:8)f (\u03b8; x) + \u03bb(cid:0)k\u03b8\\1k1 \u2212 b(cid:1)(cid:9) = min\n\n\u03b8\u2208Rp(cid:8)f (\u03b8; x) + \u03bbk\u03b8\\1k1(cid:9)\n\n(9)\n\nThe function L(\u03b8, \u03bb) is the Lagrangian function for the problem of minimizing f (\u03b8; x) subject\nto k\u03b8\\1k1 \u2264 b for some b. The dual function is h(\u03bb) = inf \u03b8 L(\u03b8, \u03bb).\nIf p \u2264 n then f (\u03b8; x) is a strictly convex function of \u03b8. Since the `1-norm is convex, it follows\nthat L(\u03b8, \u03bb) is convex in \u03b8 and strictly convex in \u03b8 for p \u2264 n. Therefore the set of solutions\nsolution for any \u03c1 \u2208 [0, 1]. Since the solutions minimize f (\u03b8; x) subject to k\u03b8\\1k1 \u2264 b, the\n\nto (9) is convex. If b\u03b8 and b\u03b80 are two solutions, then by convexity b\u03b8 + \u03c1(b\u03b80 \u2212b\u03b8) is also a\nvalue of f (b\u03b8 + \u03c1(b\u03b80 \u2212b\u03b8)) is independent of \u03c1, and \u2207\u03b8f (b\u03b8; x) is independent of the particular\nsolution b\u03b8. These facts are summarized below.\nis convex, with the value of \u2207\u03b8f (b\u03b8; x) constant across all solutions. In particular, if p \u2265 n\nand |\u2207\u03b8t f (b\u03b8; x)| < \u03bb for some solution b\u03b8, then b\u03b8t = 0 for all solutions.\n\nLemma 1. If p \u2264 n then a unique solution to (9) exists. If p \u2265 n then the set of solutions\n\nThe subgradient \u2202k\u03b8\\1k1 \u2282 Rp is the collection of all vectors z satisfying |zt| \u2264 1 and\n\nzt =(cid:26)0\n\nsign(\u03b8t)\n\nfor t = 1\nif \u03b8t 6= 0.\n\nAny optimum of (9) must satisfy\n\n(10)\nfor some z \u2208 \u2202k\u03b8\\1k. The analysis in the following sections shows that, with high probability,\n\u03b8\u2217\nt = 0 in the true model \u03b8\u2217 from which the data are generated.\n\na primal-dual pair (b\u03b8,bz) can be constructed so that |bzt| < 1 and therefore b\u03b8t = 0 in case\n\n\u2202\u03b8L(b\u03b8, \u03bb) = \u2207\u03b8f (b\u03b8; x) + \u03bbz = 0\n\n5 Constructing a Primal-Dual Pair\n\nWe now \ufb01x a variable Xs for the logistic regression, denoting the set of variables in its\nneighborhood by S. From the results of the previous section we observe that the `1-\nregularized regression recovers the sparsity pattern if and only if there exists a primal-dual\n\nOur proof proceeds by showing the existence (with high probability) of a primal-dual pair\n\nfor all t \u2208 S and sgn(b\u03b8S) = sgn(\u03b8\u2217\n\nsolution pair (b\u03b8,bz) satisfying the zero-subgradient condition, and the conditions (a)b\u03b8Sc = 0;\n(b) |b\u03b8t| > 0\nS); and (d) kbzSck\u221e < 1.\n(b\u03b8,bz) that satisfy these conditions. We begin by setting b\u03b8Sc = 0, so that (a) holds, and\nalso setting bzS = sgn(b\u03b8S), so that (c) holds. We \ufb01rst establish a consistency result when\n\nincoherence conditions are imposed on the sample Fisher information Qn. The remaining\nanalysis, deferred to the full-length version, establishes that the incoherence assumption\n(A2) on the population version ensures that the sample version also obeys the property\nwith probability converging to one exponentially fast.\n\nS); (c)bzS = sgn(\u03b8\u2217\n\n\fTheorem 2. Suppose that\n\nkQn\n\nScS(Qn\n\nSS)\u22121k\u221e \u2264 1 \u2212 \u0001\nfor some \u0001 \u2208 (0, 1]. Assume that \u03bbn \u2192 0 is chosen that \u03bb2\n\nThen P(cid:16)bN (s) = N (s)(cid:17) = 1 \u2212 O(exp(\u2212cn\u03b3)) for some \u03b3 > 0.\n\nProof. Let us introduce the notation\n\n(11)\nnn \u2212 log(p) \u2192 +\u221e and \u03bbnd \u2192 0.\n\nW n\n\n:=\n\n1\nn\n\nnXi=1\n\nz(i,s) x(i)\n\ns \u2212\n\nexp(\u03b8\u2217T z(i,s))\n\n1 + exp(\u03b8\u2217T z(i,s))!\n\nSubstituting into the subgradient optimality condition (10) yields the equivalent condition\n\n\u03b8f (\u03b8\u2217; x), we write the zero-subgradient condition (13) in block\n\nBy a Taylor series expansion, this condition can be re-written as\n\n\u2207f (b\u03b8; x) \u2212 \u2207f (\u03b8; x) \u2212 W n + \u03bbnbz = 0.\n\u22072f (\u03b8\u2217; x) [b\u03b8 \u2212 \u03b8\u2217] = W n \u2212 \u03bbnbz + Rn,\nwhere the remainder Rn is a term of order kRnk2 = O(kb\u03b8 \u2212 \u03b8\u2217k2).\nUsing our shorthand Qn = \u22072\nform as:\nSc \u2212 \u03bbnbzSc + Rn\nS \u2212 \u03bbnbzS + Rn\n\nIt can be shown that the matrix Qn\nrewritten as\n\nScS [b\u03b8s,\u03bb\nSS [b\u03b8s,\u03bb\n\nS] = W n\nS] = W n\n\nS \u2212 \u03b8\u2217\nS \u2212 \u03b8\u2217\n\nQn\nQn\n\nS.\n\nSc ,\n\nQn\n\nScS (Qn\n\nSS)\u22121 [W n\nRe-arranging yields the condition\nS \u2212 Rn\n\nSS)\u22121 [W n\n\nScS (Qn\n\nQn\n\nS \u2212 \u03bbnbzS + Rn\n\nS] \u2212 [W n\n\nSc \u2212 Rn\n\nS] = W n\n\nSc] + \u03bbnQn\n\nScS (Qn\n\nSS is invertible w.p. one, so that these conditions can be\n\nSc .\n\nSc \u2212 \u03bbnbzSc + Rn\nSS)\u22121bzS = \u03bbnbzSc .\n\n(12)\n\n(13)\n\n(14a)\n\n(14b)\n\n(15)\n\n(16)\n\n(17)\n\n(18)\n\n(19)\n\ninequality and the sample incoherence bound (11) we have that\n\nAnalysis of condition (d): We now demonstrate that kbzSck\u221e < 1. Using triangle\n\n(2 \u2212 \u0001)\n\n\u03bbn\n\n[kW nk\u221e + kRnk\u221e] + (1 \u2212 \u0001)\n\nWe complete the proof that kbzSck\u221e < 1 with the following two lemmas, proved in the\n\nfull-length version.\nLemma 2. If n\u03bb2\n\nn \u2212 log(p) \u2192 +\u221e, then\n\nkbzSck\u221e \u2264\n\nP(cid:18) 2 \u2212 \u0001\n\n\u03bbn kW nk\u221e \u2265\n\n\u0001\n\n4(cid:19) \u2192 0\n\nn \u2212 log(p) \u2192 +\u221e and d\u03bbn \u2192 0, then we have\n\nP(cid:18) 2 \u2212 \u0001\n\n\u03bbn kRnk\u221e \u2265\n\n\u0001\n\n4(cid:19) \u2192 0\n\nat rate O(exp(cid:0)\u2212n\u03bb2\n\nLemma 3. If n\u03bb2\n\nn + log(p)(cid:1)).\n\nn + log(p)(cid:1)).\nat rate O(exp(cid:0)\u2212n\u03bb2\nto one at rate O(exp(cid:8)exp(cid:0)n\u03bb2\n\nWe apply these two lemmas to the bound (17) to obtain that with probability converging\n\nn \u2212 log(p)(cid:9)(cid:1), we have\n\n\u0001\n4\n\n+\n\n\u0001\n4\n\n+ (1 \u2212 \u0001) = 1 \u2212\n\n\u0001\n2\n\n.\n\nkbzSck\u221e \u2264\n\n\fAnalysis of condition (b): We next show that condition (b) can be satis\ufb01ed, so that\n\nS|. From equation (14b), we have\n\nsgn(b\u03b8S) = sgn(\u03b8\u2217\nS). De\ufb01ne \u03c1n := mini\u2208S |\u03b8\u2217\nS \u2212 (Qn\nTherefore, in order to establish that |b\u03b8s,\u03bb\n\nS), it su\ufb03ces to show that\n\nsign(\u03b8\u2217\n\nS\n\ni\n\n= \u03b8\u2217\n\nSS)\u22121 [WS \u2212 \u03bbnbzS + RS] .\n| > 0 for all i \u2208 S, and moreover that sign(b\u03b8s,\u03bb\n\nb\u03b8s,\u03bb\n(cid:13)(cid:13)(Qn\nSS)\u22121 [WS \u2212 \u03bbnbzS + RS](cid:13)(cid:13)\u221e \u2264\n\n\u03c1n\n2\n\n.\n\n(20)\n\nS ) =\n\nUsing our eigenvalue bounds, we have\n\n(cid:13)(cid:13)(Qn\nSS)\u22121 [WS \u2212 \u03bbnbzS + RS](cid:13)(cid:13)\u221e \u2264 k(Qn\n\u221ad k(Qn\n\u221ad\nCmin\n\nSS)\u22121k\u221e [kWSk\u221e + \u03bbn + kRSk\u221e]\nSS)\u22121k2 [kWSk\u221e + \u03bbn + kRSk\u221e]\n[kWSk\u221e + \u03bbn + kRSk\u221e] .\n\n\u2264\n\u2264\n\nIn fact, the righthand side tends to zero from our earlier results on W and R, and the\nassumption that \u03bbnd \u2192 0. Together with the exponential rates of convergence established\nby the stated lemmas, this completes the proof of the result.\n\n6 Experimental Results\n\nWe brie\ufb02y describe some experimental results that demonstrate the practical viability and\nperformance of our proposed method. We generated random Ising models (1) using the\nfollowing procedure: for a given graph size p and maximum degree d, we started with a graph\nwith disconnected cliques of size less than or equal to ten, and for each node, removed edges\nrandomly until the sparsity condition (degree less than d) was satis\ufb01ed. For all edges (s, t)\npresent in the resulting random graph, we chose the edge weight \u03b8st \u223c U[\u22123, 3]. We drew n\ni.i.d. samples from the resulting random Ising model by exact methods. We implemented the\n`1-regularized logistic regression by setting the `1 penalty as \u03bbn = O((log p)3\u221an), and solved\n\nthe convex program using a customized primal-dual algorithm (described in more detail in\nthe full-length version of this paper). We considered various sparsity regimes, including\nconstant (d = \u2126(1)), logarithmic (d = \u03b1 log(p)), or linear (d = \u03b1p).\nIn each case, we\nevaluate a given method in terms of its average precision (one minus the fraction of falsely\nincluded edges), and its recall (one minus the fraction of falsely excluded edges). Figure 1\nshows results for the case of constant degrees (d \u2264 4), and graph sizes p \u2208 {100, 200, 400},\nfor the AND method (respectively the OR) method, in which an edge (s, t) is included if\nand only if it is included in the local regressions at both node s and (respectively or ) node t.\nNote that both the precision and recall tend to one as the number of samples n is increased.\n\n7 Conclusion\n\nWe have shown that a technique based on `1-regularization, in which the neighborhood of any\ngiven node is estimated by performing logistic regression subject to an `1-constraint, can be\nused for consistent model selection in discrete graphical models. Our analysis applies to the\nhigh-dimensional setting, in which both the number of nodes p and maximum neighborhood\nsizes d are allowed to grow as a function of the number of observations n. Whereas the\ncurrent analysis provides su\ufb03cient conditions on the triple (n, p, d) that ensure consistent\nneighborhood selection, it remains to establish necessary conditions as well [11]. Finally,\nthe ideas described here, while specialized in this paper to the binary case, should be more\nbroadly applicable to discrete graphical models.\n\nAcknowledgments\n\nResearch supported in part by NSF grants IIS-0427206, CCF-0625879 and DMS-0605165.\n\n\fi\n\ni\n\nn\no\ns\nc\ne\nr\nP\n\ni\n\ni\n\nn\no\ns\nc\ne\nr\nP\n\n1\n\n0.95\n\n0.9\n\n0.85\n\n0.8\n\n0.75\n\n0.7\n\n0.65\n\n \n0\n\n200\n\n400\n\n1\n\n0.95\n\n0.9\n\n0.85\n\n0.8\n\n0.75\n\n0.7\n\n0.65\n\n0.6\n\n0.55\n\n0.5\n\n \n0\n\n200\n\n400\n\nAND Recall\n\n \n\nAND Precision\n\n \n\n0.7\n\n0.6\n\n0.5\n\nl\nl\n\na\nc\ne\nR\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n800\n\n600\nNumber of Samples\n\n1200\n\n1000\n\n1400\n\n p = 100\n p = 200\n p = 400\n\n1600\n\n1800\n\n2000\n\n0\n\n \n0\n\n200\n\n400\n\n600\n\n800\n\nNumber of Samples\n\n1000\n\n1200\n\n1400\n\n p = 100\n p = 200\n p = 400\n\n1600\n\n1800\n\n2000\n\nOR Recall\n\n \n\nOR Precision\n\n \n\n0.7\n\n0.6\n\n0.5\n\nl\nl\n\na\nc\ne\nR\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n800\n\n600\nNumber of Samples\n\n1000\n\n1200\n\n1400\n\n p = 100\n p = 200\n p = 400\n\n1600\n\n1800\n\n2000\n\n0\n\n \n0\n\n200\n\n400\n\n800\n\n600\nNumber of Samples\n\n1200\n\n1000\n\n1400\n\n p = 100\n p = 200\n p = 400\n\n1600\n\n1800\n\n2000\n\nFigure 1. Precision/recall plots using the AND method (top), and the OR method (bot-\ntom). Each panel shows precision/recall versus n, for graph sizes p \u2208 {100, 200, 400}.\n\nReferences\n\n[1] D. Chickering. Learning Bayesian networks is NP-complete. Proceedings of AI and\n\nStatistics, 1995.\n\n[2] C. Chow and C. Liu. Approximating discrete probability distributions with dependence\n\ntrees. IEEE Trans. Info. Theory, 14(3):462\u2013467, 1968.\n\n[3] S. Dasgupta. Learning polytrees. In Uncertainty on Arti\ufb01cial Intelligence, pages 134\u2013\n\n14, 1999.\n\n[4] D. Donoho and M. Elad. Maximal sparsity representation via `1 minimization. Proc.\n\nNatl. Acad. Sci., 100:2197\u20132202, March 2003.\n\n[5] N. Meinshausen and P. B\u00a8uhlmann. High dimensional graphs and variable selection with\n\nthe lasso. Annals of Statistics, 34(3), 2006.\n\n[6] A. Y. Ng. Feature selection, l1 vs. l2 regularization, and rotational invariance.\n\nIn\n\nInternational Conference on Machine Learning, 2004.\n\n[7] D. Pollard. Convergence of stochastic processes. Springer-Verlag, New York, 1984.\n[8] P. Spirtes, C. Glymour, and R. Scheines. Causation, prediction and search. MIT Press,\n\n2000.\n\n[9] N. Srebro. Maximum likelihood bounded tree-width Markov networks. Arti\ufb01cial Intel-\n\nligence, 143(1):123\u2013138, 2003.\n\n[10] J. A. Tropp. Just relax: Convex programming methods for identifying sparse signals.\n\nIEEE Trans. Info. Theory, 51(3):1030\u20131051, March 2006.\n\n[11] M. J. Wainwright. Sharp thresholds for high-dimensional and noisy sparsity recovery\nusing `1-constrained quadratic programs. In Proc. Allerton Conference on Communi-\ncation, Control and Computing, October 2006.\n\n[12] P. Zhao and B. Yu. Model selection with the lasso. Technical report, UC Berkeley,\nDepartment of Statistics, March 2006. Accepted to Journal of Machine Learning Re-\nsearch.\n\n\f", "award": [], "sourceid": 3138, "authors": [{"given_name": "Martin", "family_name": "Wainwright", "institution": null}, {"given_name": "John", "family_name": "Lafferty", "institution": null}, {"given_name": "Pradeep", "family_name": "Ravikumar", "institution": null}]}