{"title": "On model selection consistency of penalized M-estimators: a geometric theory", "book": "Advances in Neural Information Processing Systems", "page_first": 342, "page_last": 350, "abstract": "Penalized M-estimators are used in diverse areas of science and engineering to fit high-dimensional models with some low-dimensional structure. Often, the penalties are \\emph{geometrically decomposable}, \\ie\\ can be expressed as a sum of (convex) support functions. We generalize the notion of irrepresentable to geometrically decomposable penalties and develop a general framework for establishing consistency and model selection consistency of M-estimators with such penalties. We then use this framework to derive results for some special cases of interest in bioinformatics and statistical learning.", "full_text": "On model selection consistency of M-estimators with\n\ngeometrically decomposable penalties\n\nJason D. Lee, Yuekai Sun\n\nInstitute for Computational and Mathematical Engineering\n\nStanford University\n\n{jdl17,yuekai}@stanford.edu\n\nJonathan E. Taylor\n\nDepartment of Statisticis\n\nStanford University\n\njonathan.taylor@stanford.edu\n\nAbstract\n\nPenalized M-estimators are used in diverse areas of science and engineering to \ufb01t\nhigh-dimensional models with some low-dimensional structure. Often, the penal-\nties are geometrically decomposable, i.e. can be expressed as a sum of support\nfunctions over convex sets. We generalize the notion of irrepresentable to geomet-\nrically decomposable penalties and develop a general framework for establishing\nconsistency and model selection consistency of M-estimators with such penalties.\nWe then use this framework to derive results for some special cases of interest in\nbioinformatics and statistical learning.\n\n1\n\nIntroduction\n\nThe principle of parsimony is used in many areas of science and engineering to promote \u201csimple\u201d\nmodels over more complex ones.\nIn machine learning, signal processing, and high-dimensional\nstatistics, this principle motivates the use of sparsity inducing penalties for model selection and\nsignal recovery from incomplete/noisy measurements. In this work, we consider M-estimators of\nthe form:\n\nminimize\n\n\u03b8\u2208Rp\n\n(cid:96)(n)(\u03b8) + \u03bb\u03c1(\u03b8), subject to \u03b8 \u2208 S,\n\n(1.1)\n\nwhere (cid:96)(n) is a convex, twice continuously differentiable loss function, \u03c1 is a penalty function, and\nS \u2286 Rp is a subspace. Many commonly used penalties are geometrically decomposable, i.e. can\nbe expressed as a sum of support functions over convex sets. We describe this notion of decompos-\nable in Section 2 and then develop a general framework for analyzing the consistency and model\nselection consistency of M-estimators with geometrically decomposable penalties. When special-\nized to various statistical models, our framework yields some known and some new model selection\nconsistency results.\nThis paper is organized as follows: First, we review existing work on consistency and model selec-\ntion consistency of penalized M-estimators. Then, in Section 2, we describe the notion of geomet-\nrically decomposable and give some examples of geometrically decomposable penalties. In Section\n3, we generalize the notion of irrepresentable to geometrically decomposable penalties and state our\nmain result (Theorem 3.4). We prove our main result in the Supplementary Material and develop a\nconverse result concerning the necessity of the irrepresentable condition in the Supplementary Ma-\nterial. In Section 4, we use our main result to derive consistency and model selection consistency\nresults for the generalized lasso (total variation) and maximum likelihood estimation in exponential\nfamilies.\n\n1\n\n\f1.1 Consistency of penalized M-estimators\n\nThe consistency of penalized M-estimators has been studied extensively. The three most well-\nstudied problems are (i) the lasso [2, 26], (ii) generalized linear models (GLM) with the lasso\npenalty [10], and (iii) inverse covariance estimators with sparsity inducing penalties (equivalent\nto sparse maximum likelihood estimation for a Gaussian graphical model) [21, 20]. There are also\nconsistency results for M-estimators with group and structured variants of the lasso penalty [1, 7].\nNegahban et al. [17] proposes a uni\ufb01ed framework for establishing consistency and convergence\nrates for M-estimators with penalties \u03c1 that are decomposable with respect to a pair of subspaces\nM, \u00afM:\n\n\u03c1(x + y) = \u03c1(x) + \u03c1(y), for all x \u2208 M, y \u2208 \u00afM\u22a5.\n\nMany commonly used penalties such as the lasso, group lasso, and nuclear norm are decomposable\nin this sense. Negahban et al. prove a general result that establishes the consistency of M-estimators\nwith decomposable penalties. Using their framework, they derive consistency results for special\ncases like sparse and group sparse regression. The current work is in a similar vein as Negahban et\nal. [17], but we focus on establishing the more stringent result of model selection consistency rather\nthan consistency. See Section 3 for a comparison of the two notions of consistency.\nThe model selection consistency of penalized M-estimators has also been extensively studied. The\nmost commonly studied problems are (i) the lasso [30, 26], (ii) GLM\u2019s with the lasso penalty [4, 19,\n28], (iii) covariance estimation [15, 12, 20] and (more generally) structure learning [6, 14]. There are\nalso general results concerning M-estimators with sparsity inducing penalties [29, 16, 11, 22, 8, 18,\n24]. Despite the extensive work on model selection consistency, to our knowledge, this is the \ufb01rst\nwork to establish a general framework for model selection consistency for penalized M-estimators.\n\n2 Geometrically decomposable penalties\nLet C \u2282 Rp be a closed convex set. Then the support function over C is\n\n(2.1)\nSupport functions are sublinear and should be thought of as semi-norms. If C is a norm ball, i.e.\nC = {x | (cid:107)x(cid:107) \u2264 1}, then hC is the dual norm:\n\nhC(x) = supy {yT x | y \u2208 C}.\n\n(cid:107)y(cid:107)\u2217\n\n= supx {xT y | (cid:107)x(cid:107) \u2264 1}.\n\nThe support function is a supremum of linear functions, hence the subdifferential consists of the\nlinear functions that attain the supremum:\n\n\u2202hC(x) = {y \u2208 C | yT x = hC(x)}.\n\nThe support function (as a function of the convex set C) is also additive over Minkowski sums, i.e.\nif C and D are convex sets, then\n\nhC+D(x) = hC(x) + hD(x).\n\nWe use this property to express penalty functions as sums of support functions. E.g. if \u03c1 is a norm\nand the dual norm ball can be expressed as a (Minkowski) sum of convex sets C1, . . . , Ck, then \u03c1\ncan be expressed as a sum of support functions:\n\n\u03c1(x) = hC1 (x) + \u00b7\u00b7\u00b7 + hCk (x).\n\nIf a penalty \u03c1 can be expressed as\n\n(2.2)\nwhere A and I are closed convex sets and S is a subspace, then we say \u03c1 is a geometrically de-\ncomposable penalty. This form is general; if \u03c1 can be expressed as a sum of support functions,\ni.e.\n\n\u03c1(\u03b8) = hA(\u03b8) + hI (\u03b8) + hS\u22a5 (\u03b8),\n\n\u03c1(\u03b8) = hC1 (\u03b8) + \u00b7\u00b7\u00b7 + hCk (\u03b8),\n\nthen we can set A, I, and S\u22a5 to be sums of the sets C1, . . . , Ck to express \u03c1 in geometrically\ndecomposable form (2.2). In many cases of interest, A + I is a norm ball and hA+I = hA + hI is\nthe dual norm. In our analysis, we assume\n\n1Given the extensive work on consistency of penalized M-estimators, our review and referencing is neces-\n\nsarily incomplete.\n\n2\n\n\f1. A and I are bounded.\n2. I contains a relative neighborhood of the origin, i.e. 0 \u2208 relint(I).\n\nWe do not require A + I to contain a neighborhood of the origin. This generality allows for unpe-\nnalized variables.\nThe notation A and I should be as read as \u201cactive\u201d and \u201cinactive\u201d: span(A) should contain the true\nparameter vector and span(I) should contain deviations from the truth that we want to penalize. E.g.\nif we know the sparsity pattern of the unknown parameter vector, then A should span the subspace\nof all vectors with the correct sparsity pattern.\nThe third term enforces a subspace constraint \u03b8 \u2208 S because the support function of a subspace is\nthe (convex) indicator function of the orthogonal complement:\nx \u2208 S\n\n(cid:26)0\n\nhS\u22a5 (x) = 1S(x) =\n\n\u221e otherwise.\n\nSuch subspace constraints arise in many problems, either naturally (e.g. the constrained lasso [9]) or\nafter reformulation (e.g. group lasso with overlapping groups). We give three examples of penalized\nM-estimators with geometrically decomposable penalties, i.e.\n\nminimize\n\n\u03b8\u2208Rp\n\n(cid:96)(n)(\u03b8) + \u03bb\u03c1(\u03b8),\n\n(2.3)\n\nwhere \u03c1 is a geometrically decomposable penalty. We also compare our notion of geometrically\ndecomposable to two other notions of decomposable penalties by Negahban et al. [17] and Van De\nGeer [25] in the Supplementary Material.\n\n2.1 The lasso and group lasso penalties\nTwo geometrically decomposable penalties are the lasso and group lasso penalties. Let A and I\nbe complementary subsets of {1, . . . , p}. We can decompose the lasso penalty component-wise to\nobtain\n\nwhere hB\u221e,A and hB\u221e,I are support functions of the sets\n\n(cid:107)\u03b8(cid:107)1 = hB\u221e,A(\u03b8) + hB\u221e,I (\u03b8),\n\nB\u221e,A =(cid:8)\u03b8 \u2208 Rp | (cid:107)\u03b8(cid:107)\u221e \u2264 1 and \u03b8I = 0(cid:9)\nB\u221e,I =(cid:8)\u03b8 \u2208 Rp | (cid:107)\u03b8(cid:107)\u221e \u2264 1 and \u03b8A = 0(cid:9).\n\nIf the groups do not overlap, then we can also decompose the group lasso penalty group-wise (A\n\nand I are now sets of groups) to obtain(cid:88)\n\n(cid:107)\u03b8g(cid:107)2 = hB(2,\u221e),A (\u03b8) + hB(2,\u221e),I (\u03b8).\n\ng\u2208G\n\nhB(2,\u221e),A and hB(2,\u221e),I are support functions of the sets\n\nB(2,\u221e),A =(cid:8)\u03b8 \u2208 Rp | max\nB(2,\u221e),I =(cid:8)\u03b8 \u2208 Rp | max\n\ng\u2208G (cid:107)\u03b8g(cid:107)2 \u2264 1 and \u03b8g = 0, g \u2208 A(cid:9)\ng\u2208G (cid:107)\u03b8g(cid:107)2 \u2264 1 and \u03b8g = 0, g \u2208 I(cid:9).\n\nIf the groups overlap, then we can duplicate the parameters in overlapping groups and enforce equal-\nity constraints.\n\n2.2 The generalized lasso penalty\nAnother geometrically decomposable penalty is the generalized lasso penalty [23]. Let D \u2208 Rm\u00d7p\nbe a matrix and A and I be complementary subsets of {1, . . . , m}. We can express the generalized\nlasso penalty in decomposable form:\n\n(cid:107)D\u03b8(cid:107)1 = hDT B\u221e,A (\u03b8) + hDT B\u221e,I (\u03b8).\n\n(2.4)\n\n3\n\n\fhDT B\u221e,A and hDT B\u221e,I are support functions of the sets\n\nDT B\u221e,A = {x \u2208 Rp | x = DTAy,(cid:107)y(cid:107)\u221e \u2264 1}\nDT B\u221e,I = {x \u2208 Rp | x = DTI y,(cid:107)y(cid:107)\u221e \u2264 1}.\n\n(2.5)\n(2.6)\nWe can also formulate any generalized lasso penalized M-estimator as a linearly constrained, lasso\npenalized M-estimator. After a change of variables, a generalized lasso penalized M-estimator is\nequivalent to\n\nminimize\n\u03b8\u2208Rk,\u03b3\u2208Rp\n\n(cid:96)(n)(D\u2020\u03b8 + \u03b3) + \u03bb(cid:107)\u03b8(cid:107)1 , subject to \u03b3 \u2208 N (D),\n\nwhere N (D) is the nullspace of D. The lasso penalty can then be decomposed component-wise to\nobtain\n\n(cid:107)\u03b8(cid:107)1 = hB\u221e,A(\u03b8) + hB\u221e,I (\u03b8).\n\nWe enforce the subspace constraint \u03b8 \u2208 N (D) with the support function of R(D)\u22a5. This yields the\nconvex optimization problem\n\nminimize\n\u03b8\u2208Rk,\u03b3\u2208Rp\n\n(cid:96)(n)(D\u2020\u03b8 + \u03b3) + \u03bb(hB\u221e,A (\u03b8) + hB\u221e,I (\u03b8) + hN (D)\u22a5 (\u03b3)).\n\nThere are many interesting applications of the generalized lasso in signal processing and statistical\nlearning. We refer to Section 2 in [23] for some examples.\n\n2.3\n\n\u201cHybrid\u201d penalties\n\nA large class of geometrically decomposable penalties are so-called \u201chybrid\u201d penalties:\nin\ufb01mal\nconvolutions of penalties to promote solutions that are sums of simple components, e.g. \u03b8 = \u03b81 + \u03b82,\nwhere \u03b81 and \u03b82 are simple. If the constituent simple penalties are geometrically decomposable, then\nthe resulting hybrid penalty is also geometrically decomposable.\nFor example, let \u03c11 and \u03c12 be geometrically decomposable penalties, i.e. there are sets A1, I1, S1\nand A2, I2, S2 such that\n\n\u03c11(\u03b8) = hA1 (\u03b8) + hI1(\u03b8) + hS\u22a5\n\u03c12(\u03b8) = hA2 (\u03b8) + hI2(\u03b8) + hS\u22a5\n\n1\n\n(\u03b8)\n\n(\u03b8)\n\n2\n\nThe M-estimator with penalty \u03c1(\u03b8) = inf \u03b3 {\u03c11(\u03b3) + \u03c12(\u03b8 \u2212 \u03b3)} is equivalent to the solution to the\nconvex optimization problem\n\nminimize\n\n\u03b8\u2208R2p\n\n(cid:96)(n)(\u03b81 + \u03b82) + \u03bb(\u03c11(\u03b81) + \u03c12(\u03b82)).\n\n(2.7)\n\nThis is an M-estimator with a geometrically decomposable penalty:\n\nminimize\n\n\u03b8\u2208R2p\n\n(cid:96)(n)(\u03b81 + \u03b82) + \u03bb(hA(\u03b8) + hI (\u03b8) + hS\u22a5 (\u03b8)).\n\nhA, hI and hS\u22a5 are support functions of the sets\n\nA = {(\u03b81, \u03b82) | \u03b81 \u2208 A1 \u2282 Rp, \u03b82 \u2208 A2 \u2282 Rp}\nI = {(\u03b81, \u03b82) | \u03b81 \u2208 I1 \u2282 Rp, \u03b82 \u2208 I2 \u2282 Rp}\nS = {(\u03b81, \u03b82) | \u03b81 \u2208 S1 \u2282 Rp, \u03b82 \u2208 S2 \u2282 Rp}.\n\nThere are many interesting applications of the hybrid penalties in signal processing and statistical\nlearning. Two examples are the huber function, \u03c1(\u03b8) = inf \u03b8=\u03b31+\u03b32 (cid:107)\u03b31(cid:107)1 +(cid:107)\u03b32(cid:107)2\n2, and the multitask\ngroup regularizer, \u03c1(\u0398) = inf \u0398=B+S (cid:107)B(cid:107)1,\u221e +(cid:107)S(cid:107)1. See [27] for recent work on model selection\nconsistency in hybrid penalties.\n\n3 Main result\n\nWe assume the unknown parameter vector \u03b8(cid:63) is contained in the model subspace\n\nM := span(I)\u22a5 \u2229 S,\n\n(3.1)\n\n4\n\n\fand we seek estimates of \u03b8(cid:63) that are \u201ccorrect\u201d. We consider two notions of correctness: (i) an\nestimate \u02c6\u03b8 is consistent (in the (cid:96)2 norm) if the estimation error in the (cid:96)2 norm decays to zero in\n\nprobability as sample size grows: (cid:13)(cid:13)(cid:13)\u02c6\u03b8 \u2212 \u03b8(cid:63)(cid:13)(cid:13)(cid:13)2\n\np\u2192 0 as n \u2192 \u221e,\n\nand (ii) \u02c6\u03b8 is model selection consistent if the estimator selects the correct model with probability\ntending to one as sample size grows:\n\nPr(\u02c6\u03b8 \u2208 M ) \u2192 1 as n \u2192 \u221e.\n\nNOTATION: We use PC to denote the orthogonal projector onto span(C) and \u03b3C to denote the\ngauge function of a convex set C containing the origin:\n\n\u03b3C(x) = inf\nx\n\n{\u03bb \u2208 R+ | x \u2208 \u03bbC}.\n\nFurther, we use \u03ba(\u03c1) to denote the compatibility constant between a semi-norm \u03c1 and the (cid:96)2 norm\nover the model subspace:\n\n\u03ba(\u03c1) := sup\n\nFinally, we choose a norm (cid:107)\u00b7(cid:107)\u03b5 to make(cid:13)(cid:13)\u2207(cid:96)(n)(\u03b8(cid:63))(cid:13)(cid:13)\u03b5 small. This norm is usually the dual norm to\n\n{\u03c1(x) | (cid:107)x(cid:107)2 \u2264 1, x \u2208 M}.\n\nthe penalty.\nBefore we state our main result, we state our assumptions on the problem. Our two main assump-\ntions are stated in terms of the Fisher information matrix:\n\nx\n\nAssumption 3.1 (Restricted strong convexity). We assume the loss function (cid:96)(n) is locally strongly\nconvex with constant m over the model subspace, i.e.\n\nQ(n) = \u22072(cid:96)(n)(\u03b8(cid:63)).\n\n(cid:96)(n)(\u03b81) \u2212 (cid:96)(n)(\u03b82) \u2265 \u2207(cid:96)(n)(\u03b82)T (\u03b81 \u2212 \u03b82) +\n\n(cid:107)\u03b81 \u2212 \u03b82(cid:107)2\n\n2\n\n(3.2)\n\nm\n2\n\nfor some m > 0 and all \u03b81, \u03b82 \u2208 Br(\u03b8(cid:63)) \u2229 M.\nWe require this assumption to make the maximum likelihood estimate unique over the model sub-\nspace. Otherwise, we cannot hope for consistency. This assumption requires the loss function to be\ncurved along certain directions in the model subspace and is very similar to Negahban et al.\u2019s notion\nof restricted strong convexity [17] and Buhlmann and van de Geer\u2019s notion of compatibility [3].\nIntuitively, this assumption means the \u201cactive\u201d predictors are not overly dependent on each other.\nWe also require \u22072(cid:96)(n) to be locally Lipschitz continuous, i.e.\n\n(cid:107)\u22072(cid:96)(n)(\u03b81) \u2212 \u22072(cid:96)(n)(\u03b82)(cid:107)2 \u2264 L(cid:107)\u03b81 \u2212 \u03b82(cid:107)2 .\n\nfor some L > 0 and all \u03b81, \u03b82 \u2208 Br(\u03b8(cid:63)) \u2229 M. This condition automatically holds for all twice-\ncontinuously differentiable loss functions, hence we do not state this condition as an assumption.\nTo obtain model selection consistency results, we must \ufb01rst generalize the key notion of irrepre-\nsentable to geometrically decomposable penalties.\nAssumption 3.2 (Irrepresentability). There exist \u03c4 \u2208 (0, 1) such that\n\nsup\n\nz\n\n{V (PM\u22a5 (Q(n)PM (PM Q(n)PM )\u2020PM z \u2212 z)) | z \u2208 \u2202hA(Br(\u03b8(cid:63)) \u2229 M )}\n< 1 \u2212 \u03c4,\n\nwhere V is the in\ufb01mal convolution of \u03b3I and 1S\u22a5\n\nV (z) = inf\nu\n\n{\u03b3I (u) + 1S\u22a5 (z \u2212 u)}.\n\nIf uI (z) and uS\u22a5 (u) achieve V (z) (i.e. V (z) = \u03b3I (uI (z))), then V (u) < 1, means uI (z) \u2208\nrelint(I). Hence the irrepresentable condition requires any z \u2208 M\u22a5 to be decomposable into\nuI + uS\u22a5, where uI \u2208 relint(I) and uS\u22a5 \u2208 S\u22a5.\n\n5\n\n\fLemma 3.3. V is a bounded semi-norm over M\u22a5, i.e. V is \ufb01nite and sublinear over M\u22a5.\n\nLet (cid:107)\u00b7(cid:107)\u03b5 be an error norm, usually chosen to make(cid:13)(cid:13)\u2207(cid:96)(n)(\u03b8(cid:63))(cid:13)(cid:13)\u03b5 small. V is a bounded semi-norm\n\nover M\u22a5, hence there exists some \u00af\u03c4 such that\n\nV (PM\u22a5 (Q(n)PM (PM Q(n)PM )\u2020PM x \u2212 x)) \u2264 \u00af\u03c4 (cid:107)x(cid:107)\u03b5\n\n(3.3)\n\u00af\u03c4 surely exists because (i) (cid:107)\u00b7(cid:107)\u03b5 is a norm, so the set {x \u2208 Rp | (cid:107)x(cid:107)\u03b5 \u2264 1} is compact, and (ii) V is\n\ufb01nite over M\u22a5, so the left side of (3.3) is a continuous function of x. Intuitively, \u00af\u03c4 quanti\ufb01es how\nlarge the irrepresentable term can be compared to the error norm.\nThe irrepresentable condition is a standard assumption for model selection consistency and has\nbeen shown to be almost necessary for sign consistency of the lasso [30, 26]. Intuitively, the ir-\nrepresentable condition requires the active predictors to be not overly dependent on the inactive\npredictors. In Supplementary Material, we show our (generalized) irrepresentable condition is also\nnecessary for model selection consistency with some geometrically decomposable penalties.\nTheorem 3.4. Suppose Assumption 3.1 and 3.2 are satis\ufb01ed. If we select \u03bb such that\n\nand\n\n(cid:107)\u2207(cid:96)(n)(\u03b8(cid:63))(cid:107)\u03b5\n\n2\u00af\u03c4\n\u03c4\n\n2\u00af\u03c4 \u03ba((cid:107)\u00b7(cid:107)\u03b5)(2\u03ba(hA)+ \u03c4\n\n\u00af\u03c4 \u03ba((cid:107)\u00b7(cid:107)\u2217\n\n\u03b5 ))2\n\n\u03c4\n\nmr\n2\u03ba(hA)+ \u03c4\n\n\u00af\u03c4 \u03ba((cid:107)\u00b7(cid:107)\u2217\n\n\u03b5 ) ,\n\n\u03bb >\n\nL\n\n\u03bb < min\n\n\uf8f1\uf8f2\uf8f3 m2\n(cid:0)\u03ba(hA) + \u03c4\n\n\u03b5)(cid:1) \u03bb,\n\n2\u00af\u03c4 \u03ba((cid:107)\u00b7(cid:107)\u2217\n\n(cid:13)(cid:13)(cid:13)\u02c6\u03b8 \u2212 \u03b8(cid:63)(cid:13)(cid:13)(cid:13)2\n\n\u2264 2\n\n1.\n2. \u02c6\u03b8 \u2208 M := span(I)\u22a5 \u2229 S.\n\nm\n\nthen the penalized M-estimator is unique, consistent (in the (cid:96)2 norm), and model selection consistent,\ni.e. the optimal solution to (2.3) satis\ufb01es\n\nRemark 1. Theorem 3.4 makes a deterministic statement about the optimal solution to (2.3). To\nuse this result to derive consistency and model selection consistency results for a statistical model,\nwe must \ufb01rst verify Assumptions (3.1) and (3.2) are satis\ufb01ed with high probability. Then, we must\nchoose an error norm (cid:107)\u00b7(cid:107)\u03b5 and select \u03bb such that\n2\u00af\u03c4\n\u03c4\n\n(cid:107)\u2207(cid:96)(n)(\u03b8(cid:63))(cid:107)\u03b5\n\n\u03bb >\n\nand\n\n\u03bb < min\n\nwith high probability.\n\n\uf8f1\uf8f2\uf8f3 m2\n\nL\n\n2\u00af\u03c4 \u03ba((cid:107)\u00b7(cid:107)\u03b5)(2\u03ba(hA)+ \u03c4\n\n\u00af\u03c4 \u03ba((cid:107)\u00b7(cid:107)\u2217\n\n\u03b5 ))2\n\n\u03c4\n\nmr\n2\u03ba(hA)+ \u03c4\n\n\u00af\u03c4 \u03ba((cid:107)\u00b7(cid:107)\u2217\n\u03b5 )\n\nIn Section 4, we use this theorem to derive consistency and model selection consistency results for\nthe generalized lasso and penalized likelihood estimation for exponential families.\n\n4 Examples\n\nWe use Theorem 3.4 to establish the consistency and model selection consistency of the generalized\nlasso and a group lasso penalized maximum likelihood estimator in the high-dimensional setting.\nOur results are nonasymptotic, i.e. we obtain bounds in terms of sample size n and problem dimen-\nsion p that hold with high probability.\n\n4.1 The generalized lasso\nConsider the linear model y = X T \u03b8(cid:63) + \u0001, where X \u2208 Rn\u00d7p is the design matrix, and \u03b8(cid:63) \u2208 Rp\nare unknown regression parameters. We assume the columns of X are normalized so (cid:107)xi(cid:107)2 \u2264 \u221a\nn.\n\u0001 \u2208 Rn is i.i.d., zero mean, sub-Gaussian noise with parameter \u03c32.\n\n6\n\n\fWe seek an estimate of \u03b8(cid:63) with the generalized lasso:\n\nminimize\n\n\u03b8\u2208Rp\n\n1\n2n\n\n(cid:107)y \u2212 X\u03b8(cid:107)2\n\n2 + \u03bb(cid:107)D\u03b8(cid:107)1 ,\n\n(4.1)\n\nwhere D \u2208 Rm\u00d7p. The generalized lasso penalty is geometrically decomposable:\n\n(cid:107)D\u03b8(cid:107)1 = hDT B\u221e,A (\u03b8) + hDT B\u221e,I (\u03b8).\n\nhDT B\u221e,A and hDT B\u221e,I are support functions of the sets\n\nDT B\u221e,A = {x \u2208 Rp | x = DT y, yI = 0,(cid:107)y(cid:107)\u221e \u2264 1}\nDT B\u221e,I = {x \u2208 Rp | x = DT y, yA = 0,(cid:107)y(cid:107)\u221e \u2264 1}.\n\nThe sample \ufb01sher information matrix is Q(n) = 1\nLipschitz constant of Q(n) is zero. The restricted strong convexity constant is\n\nn X T X. Q(n) does not depend on \u03b8, hence the\n\nm = \u03bbmin(Q(n)) = inf\nx\n\n{xT Q(n)x | (cid:107)x(cid:107)2 = 1}.\n\nThe model subspace is the set\n\nspan(DT B\u221e,I)\u22a5 = R(DTI )\u22a5 = N (DI),\n\nwhere I is a subset of the row indices of D. The compatibility constants \u03ba((cid:96)1), \u03ba(hA) are\n\n\u03ba((cid:96)1) = sup\nx\n\u03ba(hA) = sup\nx\n\n{(cid:107)x(cid:107)1 | (cid:107)x(cid:107)2 \u2264 1, x \u2208 N (DI)}\n\n(cid:8)hDT B\u221e,A(x) | (cid:107)x(cid:107)2 \u2264 1, x \u2208 M(cid:9) \u2264 (cid:107)DA(cid:107)2\n(cid:113) log p\nn , then there exists c such that Pr(cid:0)\u03bb \u2265 2\u00af\u03c4\n\n\u221a\nIf we select \u03bb > 2\n\n(cid:1) \u2264 1 \u2212\n2 exp(cid:0)\u2212c\u03bb2n(cid:1). Thus the assumptions of Theorem 3.4 are satis\ufb01ed with probability at least 1 \u2212\nthen, with probability at least 1 \u2212 2 exp(cid:0)\u2212c\u03bb2n(cid:1), the solution to the generalized\n\n2 exp(\u2212c\u03bb2n), and we deduce the generalized lasso is consistent and model selection consistent.\nCorollary 4.1. Suppose y = X\u03b8(cid:63) + \u0001, where X \u2208 Rn\u00d7p is the design matrix, \u03b8(cid:63) are unknown\ncoef\ufb01cients, and \u0001 is i.i.d., zero mean, sub-Gaussian noise with parameter \u03c32. If we select \u03bb >\n\u221a\n2\nlasso is unique, consistent, and model selection consistent, i.e. the optimal solution to (4.1) satis\ufb01es\n\n(cid:112)|A|.\n(cid:13)(cid:13)\u2207(cid:96)(n)(\u03b8(cid:63))(cid:13)(cid:13)\u221e\n\n2\u03c3 \u00af\u03c4\n\u03c4\n\n2\u03c3 \u00af\u03c4\n\u03c4\n\n\u03c4\n\nn\n\n(cid:113) log p\n(cid:13)(cid:13)(cid:13)\u02c6\u03b8 \u2212 \u03b8(cid:63)(cid:13)(cid:13)(cid:13)2\n(cid:112)|A| + \u03c4\ni = 0, for any i such that(cid:0)D\u03b8(cid:63)(cid:1)\n2. (cid:0)D \u02c6\u03b8(cid:1)\n\n(cid:16)(cid:107)DA(cid:107)2\n\n\u2264 2\n\n1.\n\nm\n\n(cid:17)\n\ni = 0.\n\n2\u00af\u03c4 \u03ba((cid:96)1)\n\n\u03bb,\n\n4.2 Learning exponential families with redundant representations\n\nSuppose X is a random vector, and let \u03c6 be a vector of suf\ufb01cient statistics. The exponential family\nassociated with these suf\ufb01cient statistics is the set of distributions with the form\n\nPr(x; \u03b8) = exp(cid:0)\u03b8T \u03c6(x) \u2212 A(\u03b8)(cid:1) ,\n\nSuppose we are given samples x(1), . . . , x(n) drawn i.i.d. from an exponential family with unknown\nparameters \u03b8(cid:63) \u2208 Rp. We seek a maximum likelihood estimate (MLE) of the unknown parameters:\n(4.2)\n\nML (\u03b8) + \u03bb(cid:107)\u03b8(cid:107)2,1 , subject to \u03b8 \u2208 S.\n(cid:96)(n)\n\nminimize\n\n\u03b8\u2208Rp\n\nwhere (cid:96)(n)\n\nML is the (negative) log-likelihood function\n\nn(cid:88)\n\ni=1\n\nML (\u03b8) = \u2212 1\n(cid:96)(n)\nn\n\nlog Pr(x(i); \u03b8) = \u2212 1\nn\n\n7\n\nn(cid:88)\n\ni=1\n\n\u03b8T \u03c6(x(i)) + A(\u03b8)\n\n\fand (cid:107)\u03b8(cid:107)2,1 is the group lasso penalty\n\n(cid:88)\n\ng\u2208G\n\n(cid:107)\u03b8g(cid:107)2 .\n\n(cid:107)\u03b8(cid:107)2,1 =\n\nIt is also straightforward to change the maximum likelihood estimator to the more computationally\ntractable pseudolikelihood estimator [13, 6], the neighborhood selection procedure [15], and include\ncovariates [13]. For brevity, we only explain the details for the maximum likelihood estimator.\nMany undirected graphical models can be naturally viewed as exponential families. Thus estimat-\ning the parameters of exponential families is equivalent to learning undirected graphical models, a\nproblem of interest in many application areas such as bioinformatics.\nBelow, we state a corollary that results from applying Theorem 3.4 to exponential families. Please\nsee the supplementary material for the proof and de\ufb01nitions of the quantities involved.\nCorollary 4.2. Suppose we are given samples x(1), . . . , x(n) drawn i.i.d. from an exponential family\nwith unknown parameters \u03b8(cid:63). If we select\n\n\u221a\n2\n\n2L1 \u00af\u03c4\n\u03c4\n\n(cid:114)\n(cid:1)4\n(cid:0)2 + \u03c4\n\n\u03bb >\n\n(cid:40) 32L1L2\n\n2 \u00af\u03c4 2\n\n(maxg\u2208G |g|) log |G|\n\nn\n\nand the sample size n is larger than\n\nm4\u03c4 4\n\nmax\n\nthen, with probability at least 1 \u2212 2(cid:0) maxg\u2208G |g|(cid:1) exp(\u2212c\u03bb2n), the penalized maximum likelihood\n\n\u00af\u03c4 )2(maxg\u2208G |g|)|A| log |G|,\n\nestimator is unique, consistent, and model selection consistent, i.e. the optimal solution to (4.2)\nsatis\ufb01es\n\n16L1\nm2r2 (2 + \u03c4\n\n\u00af\u03c4\n\n(maxg\u2208G |g|)|A|2 log |G|\n\n(cid:13)(cid:13)(cid:13)\u02c6\u03b8 \u2212 \u03b8(cid:63)(cid:13)(cid:13)(cid:13)2\n(cid:1)(cid:112)|A|\u03bb,\n(cid:13)(cid:13)2\n2. \u02c6\u03b8g = 0, g \u2208 I and \u02c6\u03b8g (cid:54)= 0 if (cid:13)(cid:13)\u03b8(cid:63)\n\n(cid:0)1 + \u03c4\n\n\u2264 2\n\n1.\n\n2\u00af\u03c4\n\nm\n\ng\n\n(cid:0)1 + \u03c4\n\n2\u00af\u03c4\n\n(cid:1)(cid:112)|A|\u03bb.\n\n> 1\nm\n\n5 Conclusion\n\nWe proposed the notion of geometrically decomposable and generalized the irrepresentable con-\ndition to geometrically decomposable penalties. This notion of decomposability builds on those\nby Negahban et al. [17] and Cand\u00b4es and Recht [5] and includes many common sparsity inducing\npenalties. This notion of decomposability also allows us to enforce linear constraints.\nWe developed a general framework for establishing the model selection consistency of M-estimators\nwith geometrically decomposable penalties. Our main result gives deterministic conditions on the\nproblem that guarantee consistency and model selection consistency; in this sense, it extends the\nwork of [17] from estimation consistency to model selection consistency. We combine our main\nresult with probabilistic analysis to establish the consistency and model selection consistency of the\ngeneralized lasso and group lasso penalized maximum likelihood estimators.\n\nAcknowledgements\n\nWe thank Trevor Hastie and three anonymous reviewers for their insightful comments. J. Lee was\nsupported by a National Defense Science and Engineering Graduate Fellowship (NDSEG) and an\nNSF Graduate Fellowship. Y. Sun was supported by the NIH, award number 1U01GM102098-01.\nJ.E. Taylor was supported by the NSF, grant DMS 1208857, and by the AFOSR, grant 113039.\n\nReferences\n[1] F. Bach. Consistency of the group lasso and multiple kernel learning. J. Mach. Learn. Res., 9:1179\u20131225,\n\n2008.\n\n8\n\n\f[2] P.J. Bickel, Y. Ritov, and A.B. Tsybakov. Simultaneous analysis of lasso and dantzig selector. Ann. Statis.,\n\n37(4):1705\u20131732, 2009.\n\n[3] P. B\u00a8uhlmann and S. van de Geer. Statistics for high-dimensional data: Methods, theory and applications.\n\n2011.\n\n[4] F. Bunea. Honest variable selection in linear and logistic regression models via (cid:96)1 and (cid:96)1+(cid:96)2 penalization.\n\nElectron. J. Stat., 2:1153\u20131194, 2008.\n\n[5] E. Cand`es and B. Recht. Simple bounds for recovering low-complexity models. Math. Prog. Ser. A, pages\n\n1\u201313, 2012.\n\n[6] J. Guo, E. Levina, G. Michailidis, and J. Zhu. Asymptotic properties of the joint neighborhood selection\n\nmethod for estimating categorical markov networks. arXiv preprint.\n\n[7] L. Jacob, G. Obozinski, and J. Vert. Group lasso with overlap and graph lasso. In Int. Conf. Mach. Learn.\n\n(ICML), pages 433\u2013440. ACM, 2009.\n\n[8] A. Jalali, P. Ravikumar, V. Vasuki, S. Sanghavi, and UT ECE. On learning discrete graphical models\n\nusing group-sparse regularization. In Int. Conf. Artif. Intell. Stat. (AISTATS), 2011.\n\n[9] G.M. James, C. Paulson, and P. Rusmevichientong. The constrained lasso. Technical report, University\n\nof Southern California, 2012.\n\n[10] S.M. Kakade, O. Shamir, K. Sridharan, and A. Tewari. Learning exponential families in high-dimensions:\n\nStrong convexity and sparsity. In Int. Conf. Artif. Intell. Stat. (AISTATS), 2010.\n\n[11] M. Kolar, L. Song, A. Ahmed, and E. Xing. Estimating time-varying networks. Ann. Appl. Stat., 4(1):94\u2013\n\n123, 2010.\n\n[12] C. Lam and J. Fan. Sparsistency and rates of convergence in large covariance matrix estimation. Ann.\n\nStatis., 37(6B):4254, 2009.\n\n[13] J.D. Lee and T. Hastie. Learning mixed graphical models. arXiv preprint arXiv:1205.5012, 2012.\n[14] P.L. Loh and M.J. Wainwright. Structure estimation for discrete graphical models: Generalized covariance\n\nmatrices and their inverses. arXiv:1212.0478, 2012.\n\n[15] N. Meinshausen and P. B\u00a8uhlmann. High-dimensional graphs and variable selection with the lasso. Ann.\n\nStatis., 34(3):1436\u20131462, 2006.\n\n[16] Y. Nardi and A. Rinaldo. On the asymptotic properties of the group lasso estimator for linear models.\n\nElectron. J. Stat., 2:605\u2013633, 2008.\n\n[17] S.N. Negahban, P. Ravikumar, M.J. Wainwright, and B. Yu. A uni\ufb01ed framework for high-dimensional\n\nanalysis of m-estimators with decomposable regularizers. Statist. Sci., 27(4):538\u2013557, 2012.\n\n[18] G. Obozinski, M.J. Wainwright, and M.I. Jordan. Support union recovery in high-dimensional multivariate\n\nregression. Ann. Statis., 39(1):1\u201347, 2011.\n\n[19] P. Ravikumar, M.J. Wainwright, and J.D. Lafferty. High-dimensional ising model selection using (cid:96)1-\n\nregularized logistic regression. Ann. Statis., 38(3):1287\u20131319, 2010.\n\n[20] P. Ravikumar, M.J. Wainwright, G. Raskutti, and B. Yu. High-dimensional covariance estimation by\n\nminimizing (cid:96)1-penalized log-determinant divergence. Electron. J. Stat., 5:935\u2013980, 2011.\n\n[21] A.J. Rothman, P.J. Bickel, E. Levina, and J. Zhu. Sparse permutation invariant covariance estimation.\n\nElectron. J. Stat., 2:494\u2013515, 2008.\n\n[22] Y. She. Sparse regression with exact clustering. Electron. J. Stat., 4:1055\u20131096, 2010.\n[23] R.J. Tibshirani and J.E. Taylor. The solution path of the generalized lasso. Ann. Statis., 39(3):1335\u20131371,\n\n2011.\n\n[24] S. Vaiter, G. Peyr\u00b4e, C. Dossal, and J. Fadili. Robust sparse analysis regularization. IEEE Trans. Inform.\n\nTheory, 59(4):2001\u20132016, 2013.\n\n[25] S. van de Geer. Weakly decomposable regularization penalties and structured sparsity. arXiv preprint\n\narXiv:1204.4813, 2012.\n\n[26] M.J. Wainwright. Sharp thresholds for high-dimensional and noisy sparsity recovery using (cid:96)1-constrained\n\nquadratic programming (lasso). IEEE Trans. Inform. Theory, 55(5):2183\u20132202, 2009.\n\n[27] E. Yang and P. Ravikumar. Dirty statistical models.\n\n827\u2013835, 2013.\n\nIn Adv. Neural Inf. Process. Syst. (NIPS), pages\n\n[28] E. Yang, P. Ravikumar, G.I. Allen, and Z. Liu. On graphical models via univariate exponential family\n\ndistributions. arXiv:1301.4183, 2013.\n\n[29] M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables. J. R. Stat. Soc.\n\nSer. B Stat. Methodol., 68(1):49\u201367, 2006.\n\n[30] P. Zhao and B. Yu. On model selection consistency of lasso. J. Mach. Learn. Res., 7:2541\u20132563, 2006.\n\n9\n\n\f", "award": [], "sourceid": 245, "authors": [{"given_name": "Jason", "family_name": "Lee", "institution": "Stanford University"}, {"given_name": "Yuekai", "family_name": "Sun", "institution": "Stanford University"}, {"given_name": "Jonathan", "family_name": "Taylor", "institution": "Stanford University"}]}