{"title": "Model Selection for High-Dimensional Regression under the Generalized Irrepresentability Condition", "book": "Advances in Neural Information Processing Systems", "page_first": 3012, "page_last": 3020, "abstract": "In the high-dimensional regression model a response variable is linearly related to $p$ covariates, but the sample size $n$ is smaller than $p$. We assume that only a small subset of covariates is `active' (i.e., the corresponding coefficients are non-zero), and consider the model-selection problem of identifying the active covariates. A popular approach is to estimate the regression coefficients through the Lasso ($\\ell_1$-regularized least squares). This is known to correctly identify the active set only if the irrelevant covariates are roughly orthogonal to the relevant ones, as quantified through the so called `irrepresentability' condition. In this paper we study the `Gauss-Lasso' selector, a simple two-stage method that first solves the Lasso, and then performs ordinary least squares restricted to the Lasso active set. We formulate `generalized irrepresentability condition' (GIC), an assumption that is substantially weaker than irrepresentability. We prove that, under GIC, the Gauss-Lasso correctly recovers the active set.", "full_text": "Model Selection for High-Dimensional Regression\nunder the Generalized Irrepresentability Condition\n\nAdel Javanmard\nStanford University\nStanford, CA 94305\n\nadelj@stanford.edu\n\nAndrea Montanari\nStanford University\nStanford, CA 94305\n\nmontanar@stanford.edu\n\nAbstract\n\nIn the high-dimensional regression model a response variable is linearly related to\np covariates, but the sample size n is smaller than p. We assume that only a small\nsubset of covariates is \u2018active\u2019 (i.e., the corresponding coef\ufb01cients are non-zero),\nand consider the model-selection problem of identifying the active covariates.\nA popular approach is to estimate the regression coef\ufb01cients through the Lasso\n((cid:96)1-regularized least squares). This is known to correctly identify the active set\nonly if the irrelevant covariates are roughly orthogonal to the relevant ones, as\nquanti\ufb01ed through the so called \u2018irrepresentability\u2019 condition. In this paper we\nstudy the \u2018Gauss-Lasso\u2019 selector, a simple two-stage method that \ufb01rst solves the\nLasso, and then performs ordinary least squares restricted to the Lasso active set.\nWe formulate \u2018generalized irrepresentability condition\u2019 (GIC), an assumption that\nis substantially weaker than irrepresentability. We prove that, under GIC, the\nGauss-Lasso correctly recovers the active set.\n\nIntroduction\n\n1\nIn linear regression, we wish to estimate an unknown but \ufb01xed vector of parameters \u03b80 \u2208 Rp from n\npairs (Y1, X1), (Y2, X2), . . . , (Yn, Xn), with vectors Xi taking values in Rp and response variables\nYi given by\n\nYi = (cid:104)\u03b80, Xi(cid:105) + Wi ,\n\nWi \u223c N(0, \u03c32) ,\n\n(1)\n\nwhere (cid:104)\u00b7 , \u00b7(cid:105) is the standard scalar product.\nIn matrix form, letting Y = (Y1, . . . , Yn)T and denoting by X the design matrix with rows\nX T\n\n1 , . . . , X T\n\nn , we have\n\nY = X \u03b80 + W ,\n\nW \u223c N(0, \u03c32In\u00d7n) .\n\n(2)\n\nIn this paper, we consider the high-dimensional setting in which the number of parameters exceeds\nthe sample size, i.e., p > n, but the number of non-zero entries of \u03b80 is smaller than p. We denote by\nS \u2261 supp(\u03b80) \u2286 [p] the support of \u03b80, and let s0 \u2261 |S|. We are interested in the \u2018model selection\u2019\nproblem, namely in the problem of identifying S from data Y , X.\nIn words, there exists a \u2018true\u2019 low dimensional linear model that explains the data. We want to\nidentify the set S of covariates that are \u2018active\u2019 within this model. This problem has motivated a\nlarge body of research, because of its relevance to several modern data analysis tasks, ranging from\nsignal processing [9, 5] to genomics [15, 16]. A crucial step forward has been the development of\nmodel-selection techniques based on convex optimization formulations [17, 8, 6]. These formula-\ntions have lead to computationally ef\ufb01cient algorithms that can be applied to large scale problems.\nSuch developments pose the following theoretical question: For which vectors \u03b80, designs X, and\n\n1\n\n\fnoise levels \u03c3, the support S can be identi\ufb01ed, with high probability, through computationally ef\ufb01-\ncient procedures? The same question can be asked for random designs X and, in this case, \u2018high\nprobability\u2019 will refer both to the noise realization W , and to the design realization X. In the rest of\nthis introduction we shall focus \u2013for the sake of simplicity\u2013 on the deterministic settings, and refer\nto Section 3 for a treatment of Gaussian random designs.\nThe analysis of computationally ef\ufb01cient methods has largely focused on (cid:96)1-regularized least\nsquares, a.k.a. the Lasso [17]. The Lasso estimator is de\ufb01ned by\n(cid:107)Y \u2212 X\u03b8(cid:107)2\n\n(cid:98)\u03b8n(Y, X; \u03bb) \u2261 arg min\n\n2 + \u03bb(cid:107)\u03b8(cid:107)1\n\n(cid:110) 1\n\n(cid:111)\n\n\u03b8\u2208Rp\n\n2n\n\n.\n\n(3)\n\nIn case the right hand side has more than one minimizer, one of them can be selected arbitrarily for\nour purposes. It is worth noting that when columns of X are in general positions (e.g. when the\nentries of X are drawn form a continuous probability distribution), the Lasso solution is unique [18].\nWe will often omit the arguments Y , X, as they are clear from the context. (A closely related method\nis the so-called Dantzig selector [6]: it would be interesting to explore whether our results can be\ngeneralized to that approach.)\nIt was understood early on that, even in the large-sample, low-dimensional limit n \u2192 \u221e at p\n\nconstant, supp((cid:98)\u03b8n) (cid:54)= S unless the columns of X with index in S are roughly orthogonal to the\ncondition\u2019, that can be stated in terms of the empirical covariance matrix (cid:98)\u03a3 = (XTX/n). Letting\n(cid:98)\u03a3A,B be the submatrix ((cid:98)\u03a3i,j)i\u2208A,j\u2208B, irrepresentability requires\nS,S sign(\u03b80,S)(cid:107)\u221e \u2264 1 \u2212 \u03b7 ,\n\nones with index outside S [12]. This assumption is formalized by the so-called \u2018irrepresentability\n\n(4)\nfor some \u03b7 > 0 (here sign(u)i = +1, 0, \u22121 if, respectively, ui > 0, = 0, < 0). In an early break-\nthrough, Zhao and Yu [23] proved that, if this condition holds with \u03b7 uniformly bounded away from\n0, it guarantees correct model selection also in the high-dimensional regime p (cid:29) n. Meinshausen\nand B\u00a8ulmann [14] independently established the same result for random Gaussian designs, with\napplications to learning Gaussian graphical models. These papers applied to very sparse models, re-\nquiring in particular s0 = O(nc), c < 1, and parameter vectors with large coef\ufb01cients. Namely, scal-\n\ning the columns of X such that(cid:98)\u03a3i,i \u2264 1, for i \u2208 [p], they require \u03b8min \u2261 mini\u2208S |\u03b80,i| \u2265 c(cid:112)s0/n.\nbroad class of empirical covariances it is only necessary that \u03b8min \u2265 c\u03c3(cid:112)(log p)/n. This scaling\n\nWainwright [21] strengthened considerably these results by allowing for general scalings of s0, p, n\nand proving that much smaller non-zero coef\ufb01cients can be detected. Namely, he proved that for a\n\n(cid:107)(cid:98)\u03a3Sc,S(cid:98)\u03a3\u22121\n\n\u221a\nnIn\u00d7n, one would need |\u03b80,i| \u2265 c\u03c3/\n\nof the minimum non-zero entry is optimal up to constants. Also, for a speci\ufb01c classes of random\nGaussian designs (including X with i.i.d. standard Gaussian entries), the analysis of [21] provides\ntight bounds on the minimum sample size for correct model selection. Namely, there exists c(cid:96), cu >\n0 such that the Lasso fails with high probability if n < c(cid:96) s0 log p and succeeds with high probability\nif n \u2265 cu s0 log p.\nWhile, thanks to these recent works [23, 14, 21], we understand reasonably well model selection\nvia the Lasso, it is fundamentally unknown what model-selection performances can be achieved\n\u221a\nwith general computationally practical methods. Two aspects of of the above theory cannot be\nimproved substantially: (i) The non-zero entries must satisfy the condition \u03b8min \u2265 c\u03c3/\nn to be\ndetected with high probability. Even if n = p and the measurement directions Xi are orthogonal,\ne.g., X =\nn to distinguish the i-th entry from noise.\nFor instance, in [10], the authors prove a general upper bound on the minimax power of tests for\nhypotheses H0,i = {\u03b80,i = 0}. Specializing this bound to the case of standard Gaussian designs, the\nanalysis of [10] shows formally that no test can detect \u03b80,i (cid:54)= 0, with a \ufb01xed degree of con\ufb01dence,\n\u221a\nn. (ii) The sample size must satisfy n \u2265 s0. Indeed, if this is not the case,\nunless |\u03b80,i| \u2265 c\u03c3/\nfor each \u03b80 with support of size |S| = s0, there is a one parameter family {\u03b80(t) = \u03b80 + t v}t\u2208R\nwith supp(\u03b80(t)) \u2286 S, X\u03b80(t) = X\u03b80 and, for speci\ufb01c values of t, the support of \u03b80(t) is strictly\ncontained in S.\nOn the other hand, there is no fundamental reason to assume the irrepresentability condition (4).\nThis follows from the requirement that a speci\ufb01c method (the Lasso) succeeds, but is unclear why\nit should be necessary in general. In this paper we prove that the Gauss-Lasso selector has nearly\noptimal model selection properties under a condition that is strictly weaker than irrepresentability.\n\n\u221a\n\n2\n\n\f(cid:98)\u03b8n(Y, X; \u03bb) \u2261 arg min\n\nGAUSS-LASSO SELECTOR: Model selector for high dimensional problems\nInput: Measurement vector y, design model X, regularization parameter \u03bb, support size s0.\n\nOutput: Estimated support (cid:98)S.\n1: Let T = supp((cid:98)\u03b8n) be the support of Lasso estimator(cid:98)\u03b8n =(cid:98)\u03b8n(y, X, \u03bb) given by\n2: Construct the estimator(cid:98)\u03b8GL as follows:\n3: Find s0-th largest entry (in modulus) of(cid:98)\u03b8GL\n\n2 + \u03bb(cid:107)\u03b8(cid:107)1\n(cid:98)\u03b8GL\nT , denoted by(cid:98)\u03b8GL\n(s0)|(cid:9).\n| \u2265 |(cid:98)\u03b8GL\n\n(cid:98)\u03b8GL\n(cid:98)S \u2261(cid:8)i \u2208 [p] : |(cid:98)\u03b8GL\n\nT = (XT\n\nT XT )\u22121XT\n\nT y ,\n\nT c = 0 .\n(s0), and let\n\n(cid:107)Y \u2212 X\u03b8(cid:107)2\n\n(cid:110) 1\n\n\u03b8\u2208Rp\n\n2n\n\n(cid:111)\n\n.\n\ni\n\nWe call this condition the generalized irrepresentability condition (GIC). The Gauss-Lasso proce-\ndure uses the Lasso to estimate a \ufb01rst model T \u2286 {1, . . . , p}. It then constructs a new estimator by\nordinary least squares regression of the data Y onto the model T .\n\nWe prove that the estimated model is, with high probability, correct (i.e., (cid:98)S = S) under conditions\n\ncomparable to the ones assumed in [14, 23, 21], while replacing irrepresentability by the weaker\ngeneralized irrepresentability condition. In the case of random Gaussian designs, our analysis further\nassumes the restricted eigenvalue property in order to establish a nearly optimal scaling of the sample\nsize n with the sparsity parameter s0.\nIn order to build some intuition about the difference between irrepresentability and generalized\nirrepresentability, it is convenient to consider the Lasso cost function at \u2018zero noise\u2019:\n\nG(\u03b8; \u03be) \u2261 1\n2n\n\n(cid:107)X(\u03b8 \u2212 \u03b80)(cid:107)2\n\nLet(cid:98)\u03b8ZN(\u03be) be the minimizer of G(\u00b7 ; \u03be) and v \u2261 lim\u03be\u21920+ sign((cid:98)\u03b8ZN(\u03be)). The limit is well de\ufb01ned\nby Lemma 2.2 below. The KKT conditions for(cid:98)\u03b8ZN imply, for T \u2261 supp(v),\n\n1\n2\n\n(cid:104)(\u03b8 \u2212 \u03b80),(cid:98)\u03a3(\u03b8 \u2212 \u03b80)(cid:105) + \u03be(cid:107)\u03b8(cid:107)1 .\n\n2 + \u03be(cid:107)\u03b8(cid:107)1 =\n(cid:107)(cid:98)\u03a3T c,T(cid:98)\u03a3\u22121\n\nT,T vT(cid:107)\u221e \u2264 1 .\n\nSince G(\u00b7 ; \u03be) has always at least one minimizer, this condition is always satis\ufb01ed. Generalized\nirrepresentability requires that the above inequality holds with some small slack \u03b7 > 0 bounded\naway from zero, i.e.,\n\n(cid:107)(cid:98)\u03a3T c,T(cid:98)\u03a3\u22121\n\nT,T vT(cid:107)\u221e \u2264 1 \u2212 \u03b7 .\n\nNotice that this assumption reduces to standard irrepresentability cf. Eq. (4) if, in addition, we ask\nthat v = sign(\u03b80). In other words, earlier work [14, 23, 21] required generalized irrepresentability\nplus sign-consistency in zero noise, and established sign consistency in non-zero noise. In this paper\nthe former condition is shown to be suf\ufb01cient.\nFrom a different point of view, GIC demands that irrepresentability holds for a superset of the true\nsupport S. It was indeed argued in the literature that such a relaxation of irrepresentability allows to\ncover a signi\ufb01cantly broader set of cases (see for instance [3, Section 7.7.6]). However, it was never\nclari\ufb01ed why such a superset irrepresentability condition should be signi\ufb01cantly more general than\nsimple irrepresentability. Further, no precise prescription existed for the superset of the true support.\nOur contributions can therefore be summarized as follows:\n\ngeneralized irrepresentability should hold for a broad class of design matrices.\n\n\u2022 By tying it to the KKT condition for the zero-noise problem, we justify the expectation that\n\u2022 We thus provide a speci\ufb01c formulation of superset irrepresentability, prescribing both the\nsuperset T and the sign vector vT , that is, by itself, signi\ufb01cantly more general than simple\nirrepresentability.\n\n3\n\n\f\u2022 We show that, under GIC, exact support recovery can be guaranteed using the Gauss-Lasso,\nand formulate the appropriate \u2018minimum coef\ufb01cient\u2019 conditions that guarantee this. As a\nside remark, even when simple irrepresentability holds, our results strengthen somewhat\nthe estimates of [21] (see below for details).\n\nThe paper is organized as follows. In the rest of the introduction we illustrate the range of applicabil-\nity of GIC through a simple example and we discuss further related work. We \ufb01nally introduce the\nbasic notations to be used throughout the paper. Section 2 treats the case of deterministic designs X,\nand develops our main results on the basis of the GIC. Section 3 extends our analysis to the case of\nrandom designs. In this case GIC is required to hold for the population covariance, and the analysis\nis more technical as it requires to control the randomness of the design matrix. We refer the reader\nto the long version of the paper [11] for the proofs of our main results and the technical steps.\n\n1.1 An example\n\nIn order to illustrate the range of new cases covered by our results, it is instructive to consider a\nsimple example. A detailed discussion of this calculation can be found in [11]. The example corre-\nsponds to a Gaussian random design, i.e., the rows X T\nn are i.i.d. realizations of a p-variate\nnormal distribution with mean zero. We write Xi = (Xi,1, Xi,2, . . . , Xi,p)T for the components of\nXi. The response variable is linearly related to the \ufb01rst s0 covariates:\n\n1 , . . . X T\n\nYi = \u03b80,1Xi,1 + \u03b80,2Xi,2 + \u00b7\u00b7\u00b7 + \u03b80,s0 Xi,s0 + Wi ,\n\nwhere Wi \u223c N(0, \u03c32) and we assume \u03b80,i > 0 for all i \u2264 s0. In particular S = {1, . . . , s0}.\nAs for the design matrix, \ufb01rst p \u2212 1 covariates are orthogonal at the population level, i.e., Xi,j \u223c\nN(0, 1) are independent for 1 \u2264 j \u2264 p\u2212 1 (and 1 \u2264 i \u2264 n). However the p-th covariate is correlated\nto the s0 relevant ones:\n\nXi,p = a Xi,1 + a Xi,2 + \u00b7\u00b7\u00b7 + a Xi,s0 + b \u02dcXi,p .\n\nHere \u02dcXi,p \u223c N(0, 1) is independent from {Xi,1, . . . , Xi,p\u22121} and represents the orthogonal com-\nponent of the p-th covariate. We choose the coef\ufb01cients a, b \u2265 0 such that s0a2 + b2 = 1, whence\nE{X 2\ni,p} = 1 and hence the p-th covariate is normalized as the \ufb01rst (p \u2212 1) ones. In other words,\nthe rows of X are i.i.d. Gaussian Xi \u223c N(0, \u03a3) with covariance given by\n\n\uf8f1\uf8f2\uf8f31\n\n\u03a3ij =\n\nif i = j,\n\na if i = p, j \u2208 S or i \u2208 S, j = p,\n0\n\notherwise.\n\nthe support S from n \u2265 c s0 log p samples, provided \u03b8min \u2265 c(cid:48)(cid:112)(log p)/n. It follows from [21] that\n\nFor a = 0, this is the standard i.i.d. design and irrepresentability holds. The Lasso correctly recovers\nthis remains true as long as a \u2264 (1\u2212 \u03b7)/s0 for some \u03b7 > 0 bounded away from 0. However, as soon\nas a > 1/s0, the Lasso includes the p-th covariate in the estimated model, with high probability.\nOn the other hand, Gauss-Lasso is successful for a signi\ufb01cantly larger set of values of a. Namely, if\n\na \u2208\n\nthen it recovers S from n \u2265 c s0 log p samples, provided \u03b8min \u2265 c(cid:48)(cid:112)(log p)/n. While the interval\n\n((1\u2212 \u03b7)/s0, 1/s0] is not covered by this result, we expect this to be due to the proof technique rather\nthan to an intrinsic limitation of the Gauss-Lasso selector.\n\n\u222a\n\ns0\n\n0,\n\n,\n\n(cid:20)\n\n(cid:21)\n\n(cid:18) 1\n\n1 \u2212 \u03b7\ns0\n\n(cid:21)\n\n1 \u2212 \u03b7\u221a\n\ns0\n\n,\n\n1.2 Further related work\n\nThe restricted isometry property [7, 6] (or the related restricted eigenvalue [2] or compatibility con-\nditions [19]) have been used to establish guarantees on the estimation and model selection errors of\nthe Lasso or similar approaches. In particular, Bickel, Ritov and Tsybakov [2] show that, under such\nconditions, with high probability,\n\n(cid:107)(cid:98)\u03b8 \u2212 \u03b80(cid:107)2\n\n2 \u2264 C\u03c32 s0 log p\n\n.\n\nn\n\n4\n\n\fThe same conditions can be used to prove model-selection guarantees.\nIn particular, Zhou [24]\nstudies a multi-step thresholding procedure whose \ufb01rst steps coincide with the Gauss-Lasso. While\nthe main objective of this work is to prove high-dimensional (cid:96)2 consistency with a sparse estimated\nmodel, the author also proves partial model selection guarantees. Namely, the method correctly\n\nrecovers a subset of large coef\ufb01cients SL \u2286 S, provided |\u03b80,i| \u2265 c\u03c3(cid:112)s0(log p)/n, for i \u2208 SL. This\n\nmeans that the coef\ufb01cients that are guaranteed to be detected must be a factor\nis required by our results.\nAn alternative approach to establishing model-selection guarantees assumes a suitable mutual\nincoherence conditions. Lounici [13] proves correct model selection under the assumption\n\nmaxi(cid:54)=j |(cid:98)\u03a3ij| = O(1/s0). This assumption is however stronger than irrepresentability [19]. Cand\u00b4es\nmaxi(cid:54)=j |(cid:98)\u03a3ij| = O(1/(log p)). Under this condition, they establish model selection guarantees\nfor an ideal scaling of the non-zero coef\ufb01cients \u03b8min \u2265 c\u03c3(cid:112)(log p)/n. However, this result only\n\nand Plan [4] also assume mutual incoherence, albeit with a much weaker requirement, namely\n\n\u221a\n\ns0 larger than what\n\nholds with high probability for a \u2018random signal model\u2019 in which the non-zero coef\ufb01cients \u03b80,i have\nuniformly random signs.\nThe authors in [22] consider the variable selection problem, and under the same assumptions on\nthe non-zero coef\ufb01cients as in the present paper, guarantee support recovery under a cone condition.\nThe latter condition however is stronger than the generalized irrepresentability condition. In partic-\nular, for the example in Section 1.1 it yields no improvement over the standard irrepresentability.\nThe work [20] studies the adaptive and the thresholded Lasso estimators and proves correct model\nselection assuming the non-zero coef\ufb01cients are of order s0\nFinally, model selection consistency can be obtained without irrepresentability through other meth-\nods. For instance [25] develops the adaptive Lasso, using a data-dependent weighted (cid:96)1 regular-\nization, and [1] proposes the Bolasso, a resampling-based techniques. Unfortunately, both of these\napproaches are only guaranteed to succeed in the low-dimensional regime of p \ufb01xed, and n \u2192 \u221e.\n\n(cid:112)(log p)/n.\n\n1.3 Notations\n\nI,I represents the inverse of AI,I, i.e., A\u22121\n\nWe provide a brief summary of the notations used throughout the paper. For a matrix A and set of\nindices I, J, we let AJ denote the submatrix containing just the columns in J and AI,J denote the\nsubmatrix formed by the rows in I and columns in J. Likewise, for a vector v, vI is the restriction\nof v to indices in I. Further, the notation A\u22121\nI,I = (AI,I )\u22121.\nThe maximum and the minimum singular values of A are respectively denoted by \u03c3max(A) and\n\u03c3min(A). We write (cid:107)v(cid:107)p for the standard (cid:96)p norm of a vector v. Speci\ufb01cally, (cid:107)v(cid:107)0 denotes the\nnumber of nonzero entries in v. Also, (cid:107)A(cid:107)p refers to the induced operator norm on a matrix A. We\nuse ei to refer to the i-th standard basis element, e.g., e1 = (1, 0, . . . , 0). For a vector v, supp(v)\nrepresents the positions of nonzero entries of v. Throughout, we denote the rows of the design matrix\nX by X1, . . . , Xn \u2208 Rp and denote its columns by x1, . . . , xp \u2208 Rn. Further, for a vector v, sign(v)\nis the vector with entries sign(v)i = +1 if vi > 0, sign(v)i = \u22121 if vi < 0, and sign(v)i = 0\notherwise.\n\n2 Deterministic designs\n\nAn outline of this section is as follows: (1) We \ufb01rst consider the zero-noise problem W = 0,\nand prove several useful properties of the Lasso estimator in this case.\nIn particular, we show\nthat there exists a threshold for the regularization parameter below which the support of the Lasso\nestimator remains the same and contains supp(\u03b80). Moreover, the Lasso estimator support is not\nmuch larger than supp(\u03b80). (2) We then turn to the noisy problem, and introduce the generalized\nirrepresentability condition (GIC) that is motivated by the properties of the Lasso in the zero-noise\ncase. We prove that under GIC (and other technical conditions), with high probability, the signed\nsupport of the Lasso estimator is the same as that in the zero-noise problem. (3) We show that the\nGauss-Lasso selector correctly recovers the signed support of \u03b80.\n\n5\n\n\f2.1 Zero-noise problem\n\nRecall that(cid:98)\u03a3 \u2261 (XTX/n) denotes the empirical covariance of the rows of the design matrix. Given\n(cid:98)\u03a3 \u2208 Rp\u00d7p,(cid:98)\u03a3 (cid:23) 0, \u03b80 \u2208 Rp and \u03be \u2208 R+, we de\ufb01ne the zero-noise Lasso estimator as\n(cid:111)\nNote that(cid:98)\u03b8ZN(\u03be) is obtained by letting Y = X\u03b80 in the de\ufb01nition of(cid:98)\u03b8n(Y, X; \u03be).\nFollowing [2], we introduce a restricted eigenvalue constant for the empirical covariance matrix(cid:98)\u03a3:\n\n(cid:104)(\u03b8 \u2212 \u03b80),(cid:98)\u03a3(\u03b8 \u2212 \u03b80)(cid:105) + \u03be(cid:107)\u03b8(cid:107)1\n\n(cid:98)\u03b8ZN(\u03be) \u2261 arg min\n\n(cid:110) 1\n\n\u03b8\u2208Rp\n\n(5)\n\n2n\n\n.\n\n(cid:98)\u03ba(s, c0) \u2261 min\n\nJ\u2286[p]\n|J|\u2264s\n\nmin\nu\u2208Rp\n\n(cid:104)u,(cid:98)\u03a3u(cid:105)\n\n(cid:107)u(cid:107)2\n\n2\n\n.\n\n(6)\n\n(7)\n\nT0,T0\n\ns0 .\n\n1 +\n\n(cid:33)\n\n(cid:32)\n\n(cid:107)uJc(cid:107)1\u2264c0(cid:107)uJ(cid:107)1\n\nv0,T0]i|.\n\n(cid:107)(cid:98)\u03b8ZN(cid:107)0 \u2264\n\n4(cid:107)(cid:98)\u03a3(cid:107)2\n(cid:98)\u03ba(s0, 1)\n\nOur \ufb01rst result states that supp((cid:98)\u03b8ZN(\u03be)) is not much larger than the support of \u03b80, for any \u03be > 0.\nLemma 2.1. Let(cid:98)\u03b8ZN =(cid:98)\u03b8ZN(\u03be) be de\ufb01ned as per Eq. (5), with \u03be > 0. Then, if s0 = (cid:107)\u03b80(cid:107)0,\nLemma 2.2. Let (cid:98)\u03b8ZN = (cid:98)\u03b8ZN(\u03be) be de\ufb01ned as per Eq. (5), with \u03be > 0. Then there exist \u03be0 =\n\u03be0((cid:98)\u03a3, S, \u03b80) > 0, T0 \u2286 [p], v0 \u2208 {\u22121, 0, +1}p, such that the following happens. For all \u03be \u2208 (0, \u03be0),\nsign((cid:98)\u03b8ZN(\u03be)) = v0 and supp((cid:98)\u03b8ZN(\u03be)) = supp(v0) = T0. Further T0 \u2287 S, v0,S = sign(\u03b80,S) and\n\u03be0 = mini\u2208S |\u03b80,i/[(cid:98)\u03a3\u22121\nLemma 2.3. Let (cid:98)\u03b8ZN = (cid:98)\u03b8ZN(\u03be) be de\ufb01ned as per Eq. (5), with \u03be > 0. Let T \u2287 S and v \u2208\n{+1, 0,\u22121}p be such that supp(v) = T . Then sign((cid:98)\u03b8ZN) = v if and only if\n\n(cid:13)(cid:13)(cid:13)(cid:98)\u03a3T c,T(cid:98)\u03a3\u22121\n(cid:13)(cid:13)(cid:13)\u221e\n(cid:16)\n\u03b80,T \u2212 \u03be(cid:98)\u03a3\u22121\nFurther, if the above holds,(cid:98)\u03b8ZN is given by(cid:98)\u03b8ZN\nT c = 0 and(cid:98)\u03b8ZN\nGeneralized irrepresentability (deterministic designs). The pair ((cid:98)\u03a3, \u03b80), (cid:98)\u03a3 \u2208 Rp\u00d7p, \u03b80 \u2208 Rp\n\nMotivated by this result, we introduce the generalized irrepresentability condition (GIC) for deter-\nministic designs.\n\n(cid:17)\nT = \u03b80,T \u2212 \u03be(cid:98)\u03a3\u22121\n\nFinally we have the following standard characterization of the solution of the zero-noise problem.\n\nsatisfy the generalized irrepresentability condition with parameter \u03b7 > 0 if the following happens.\nLet v0, T0 be de\ufb01ned as per Lemma 2.2. Then\n\nvT = sign\n\nT,T vT .\n\n\u2264 1 ,\n\nT,T vT\n\nT,T vT\n\n(8)\n\n(9)\n\n.\n\n(10)\nIn other words we require the dual feasibility condition (8), which always holds, to hold with a\npositive slack \u03b7.\n\nv0,T0\n\nT0,T0\n\n(cid:13)(cid:13)(cid:13)(cid:98)\u03a3T c\n0 ,T0(cid:98)\u03a3\u22121\n\n(cid:13)(cid:13)(cid:13)\u221e\n\n\u2264 1 \u2212 \u03b7 .\n\n2.2 Noisy problem\n\nConsider the noisy linear observation model as described in (2), and let(cid:98)r \u2261 (XTW/n). We begin\nwith a standard characterization of sign((cid:98)\u03b8n), the signed support of the Lasso estimator (3).\nLemma 2.4. Let (cid:98)\u03b8n = (cid:98)\u03b8n(y, X; \u03bb) be de\ufb01ned as per Eq. (3), and let z \u2208 {+1, 0,\u22121}p with\nsign((cid:98)\u03b8n) = z if and only if(cid:13)(cid:13)(cid:13)(cid:98)\u03a3T c,T(cid:98)\u03a3\u22121\n\nsupp(z) = T . Further assume T \u2287 S. Then the signed support of the Lasso estimator is given by\n\nT,T zT +\n\n\u2264 1 ,\n\n(11)\n\n(cid:13)(cid:13)(cid:13)\u221e\n((cid:98)rT c \u2212(cid:98)\u03a3T c,T(cid:98)\u03a3\u22121\nT,T(cid:98)rT )\n(cid:16)\n(cid:17)\n\u03b80,T \u2212(cid:98)\u03a3\u22121\nT,T (\u03bbzT \u2212(cid:98)rT )\n\n1\n\u03bb\n\n.\n\nzT = sign\n\n(12)\n\n6\n\n\fand suppose that for some c2 > 0:\n\nv0,T0]i\n\nT0,T0\nv0,T0]i\n\nT0,T0\n\n\u221a\n\nfor all i \u2208 S,\nfor all i \u2208 T0 \\ S.\n\nWe further assume, without loss of generality, \u03b7 \u2264 c2\n\n|\u03b80,i| \u2265 c2\u03bb + \u03bb(cid:12)(cid:12)[(cid:98)\u03a3\u22121\n\nsatis\ufb01es the generalized irrepresentability condition with parameter \u03b7. Consider the Lasso estimator\n\nTheorem 2.5. Consider the deterministic design model with empirical covariance matrix (cid:98)\u03a3 \u2261\n(XTX)/n, and assume (cid:98)\u03a3i,i \u2264 1 for i \u2208 [p]. Let T0 \u2286 [p], v0 \u2208 {+1, 0,\u22121}p be the set and\nvector de\ufb01ned in Lemma 2.2. Assume that (i) \u03c3min((cid:98)\u03a3T0,T0 ) \u2265 Cmin > 0. (ii) The pair ((cid:98)\u03a3, \u03b80)\n(cid:98)\u03b8n = (cid:98)\u03b8n(y, X; \u03bb) de\ufb01ned as per Eq. (3), with \u03bb = (\u03c3/\u03b7)(cid:112)2c1 log p/n, for some constant c1 > 1,\n\n(cid:12)(cid:12)\n(cid:12)(cid:12)[(cid:98)\u03a3\u22121\n(cid:12)(cid:12) \u2265 c2\n(cid:111) \u2265 1 \u2212 4p1\u2212c1 .\nP(cid:110)\nsign((cid:98)\u03b8n(\u03bb)) = v0\nthat the required lower bound for |\u03b80,i|, i \u2208 S, does not depend on (cid:107)(cid:98)\u03a3\u22121\nRemark 2.6. Condition (i) in Theorem 2.5 requires the submatrix(cid:98)\u03a3T0,T0 to have minimum singular\nvalue bounded away form zero. Assuming (cid:98)\u03a3S,S to be non-singular is necessary for identi\ufb01ability.\nRequiring the minimum singular value of (cid:98)\u03a3T0,T0 to be bounded away from zero is not much more\nTheorem 2.7. Consider the deterministic design model with empirical covariance matrix (cid:98)\u03a3 \u2261\n(XTX)/n, and assume that(cid:98)\u03a3i,i \u2264 1 for i \u2208 [p]. Under the assumptions of Theorem 2.5,\nIn particular, if (cid:98)S is the model selected by the Gauss-Lasso, then P((cid:98)S = S) \u2265 1 \u2212 6 p1\u2212c1/4.\n\n(cid:17) \u2264 4p1\u2212c1 + 2pe\u2212nCmin\u00b52/2\u03c32\n\nNote that even under standard irrepresentability, this result improves over [21, Theorem 1.(b)], in\n\nWe next show that the Gauss-Lasso selector correctly recovers the support of \u03b80.\n\nP(cid:16)(cid:107)(cid:98)\u03b8GL \u2212 \u03b80(cid:107)\u221e \u2265 \u00b5\n\nrestrictive since T0 is comparable in size with S, as stated in Lemma 2.1.\n\nCmin. Then the following holds true:\n\n(13)\n(14)\n\n(15)\n\nS,S(cid:107)\u221e.\n\n3 Random Gaussian designs\n\n.\n\n(cid:111)\n\nIn the previous section, we studied the case of deterministic design models which allowed for a\nstraightforward analysis. Here, we consider the random design model which needs a more involved\nanalysis. Within the random Gaussian design model, the rows Xi are distributed as Xi \u223c N(0, \u03a3)\nfor some (unknown) covariance matrix \u03a3 (cid:31) 0. In order to study the performance of Gauss-Lasso\nselector in this case, we \ufb01rst de\ufb01ne the population-level estimator. Given \u03a3 \u2208 Rp\u00d7p, \u03a3 (cid:31) 0,\n\n\u03b80 \u2208 Rp and \u03be \u2208 R+, the population-level estimator(cid:98)\u03b8\u221e(\u03be) =(cid:98)\u03b8\u221e(\u03be; \u03b80, \u03a3) is de\ufb01ned as\n\n(cid:104)(\u03b8 \u2212 \u03b80), \u03a3(\u03b8 \u2212 \u03b80)(cid:105) + \u03be(cid:107)\u03b8(cid:107)1\n\n.\n\n(16)\n\n(cid:98)\u03b8\u221e(\u03be) \u2261 arg min\n\n\u03b8\u2208Rp\n\n(cid:110) 1\n\n2\n\nX is a random design. We show that under some conditions on the covariance \u03a3 and vector \u03b80,\n\nIn fact, the population-level estimator is obtained by assuming that the response vector Y is noiseless\nand n = \u221e, hence replacing the empirical covariance (XTX/n) with the exact covariance \u03a3 in the\n\nlasso optimization problem (3). Note that the population-level estimator(cid:98)\u03b8\u221e is deterministic, albeit\nT \u2261 supp((cid:98)\u03b8n) = supp((cid:98)\u03b8\u221e), i.e., the population-level estimator and the Lasso estimator share\nthe same (signed) support. Further T \u2287 S. Since (cid:98)\u03b8\u221e (and hence T ) is deterministic, XT is a\nsimple analysis of the Gauss-Lasso selector(cid:98)\u03b8GL.\n(cid:98)\u03b8\u221e(\u03be) has the similar properties to(cid:98)\u03b8ZN(\u03be) stated in Section 2.1. In particular, there exists a threshold\n\u03be0, such that for all \u03be \u2208 (0, \u03be0), supp((cid:98)\u03b8\u221e(\u03be)) remains the same and contains supp(\u03b80). Moreover,\nsupp((cid:98)\u03b8\u221e(\u03be)) is not much larger than supp(\u03b80). (2) We show that under GIC for covariance matrix\n\nGaussian matrix with rows drawn independently from N(0, \u03a3T,T ). This observation allows for a\n\nAn outline of the section is as follows: (1) We begin with noting that the population-level estimator\n\n\u03a3 (and other suf\ufb01cient conditions), with high probability, the signed support of the Lasso estimator\nis the same as the signed support of the population-level estimator. (3) Following the previous steps,\nwe show that the Gauss-Lasso selector correctly recovers the signed support of \u03b80.\n\n7\n\n\f3.1 The n = \u221e problem\n\n3.2 The high-dimensional problem\n\nComparing Eqs. (5) and (16), the estimators(cid:98)\u03b8ZN(\u03be) and(cid:98)\u03b8\u221e(\u03be) are de\ufb01ned in a very similar manner\n(the former is de\ufb01ned with respect to(cid:98)\u03a3 and the latter is de\ufb01ned with respect to \u03a3). It is easy to see\nthat(cid:98)\u03b8\u221e satis\ufb01es the properties stated in Section 2.1 once we replace(cid:98)\u03a3 with \u03a3.\nWe now consider the Lasso estimator (3). Recall the notations(cid:98)\u03a3 \u2261 (XTX)/n and(cid:98)r \u2261 (XTW )/n.\nNote that(cid:98)\u03a3 \u2208 Rp\u00d7p,(cid:98)r \u2208 Rp are both random quantities in the case of random designs.\nTheorem 3.1. Consider the Gaussian random design model with covariance matrix \u03a3 (cid:31) 0,\ntic set and vector de\ufb01ned in Lemma 2.2 (replacing (cid:98)\u03a3 with \u03a3), and t0 \u2261 |T0|. Assume that (i)\nand assume that \u03a3i,i \u2264 1 for i \u2208 [p]. Let T0 \u2286 [p], v0 \u2208 {+1, 0,\u22121}p be the determinis-\ntion with parameter \u03b7. Consider the Lasso estimator(cid:98)\u03b8n =(cid:98)\u03b8n(y, X; \u03bb) de\ufb01ned as per Eq. (3), with\n\u03c3min(\u03a3T0,T0) \u2265 Cmin > 0. (ii) The pair (\u03a3, \u03b80) satis\ufb01es the generalized irrepresentability condi-\n\u03bb = (4\u03c3/\u03b7)(cid:112)c1 log p/n, for some constant c1 > 1, and suppose that for some c2 > 0:\n\nWe further assume, without loss of generality, \u03b7 \u2264 c2\nM1 \u2261 (74c1)/(\u03b72Cmin), and M2 \u2261 c1(32/(c2Cmin))2 , then the following holds true:\n\n\u221a\n\n(18)\nCmin. If n \u2265 max(M1, M2)t0 log p with\n\n3\n2\n\nT0,T0\n\nv0,T0]i\n\n|\u03b80,i| \u2265 c2\u03bb +\n\n\u03bb(cid:12)(cid:12)[\u03a3\u22121\n(cid:12)(cid:12)[\u03a3\u22121\nP(cid:110)\nsign((cid:98)\u03b8n(\u03bb)) = v0\n\n(cid:12)(cid:12)\n(cid:12)(cid:12) \u2265 2c2\n(cid:111) \u2265 1 \u2212 pe\u2212 n\n\nv0,T0]i\n\nT0,T0\n\nfor all i \u2208 S,\nfor all i \u2208 T0 \\ S.\n\n10 \u2212 6e\u2212 t0\n\n2 \u2212 8p1\u2212c1 .\n\nNote that even under standard irrepresentability, this result improves over [21, Theorem 3.(ii)], in\nthat the required lower bound for |\u03b80,i|, i \u2208 S, does not depend on (cid:107)\u03a3\nRemark 3.2. Condition (i) follows readily from the restricted eigenvalue constraint,\ni.e.,\n\u03ba\u221e(t0, 0) > 0. This is a reasonable assumption since T0 is not much larger than S0, as stated\n\n\u22121/2\nS,S (cid:107)\u221e.\nin Lemma 2.1 (replacing(cid:98)\u03a3 with \u03a3). Namely, s0 \u2264 t0 \u2264 (1 + 4(cid:107)\u03a3(cid:107)2/\u03ba(s0, 1))s0.\n\nBelow, we show that the Gauss-Lasso selector correctly recovers the signed support of \u03b80.\nTheorem 3.3. Consider the random Gaussian design model with covariance matrix \u03a3 (cid:31) 0,\nand assume that \u03a3i,i \u2264 1 for i \u2208 [p]. Under the assumptions of Theorem 3.1, and for\nn \u2265 max(M1, M2)t0 log p, we have\n\n10 + 6e\u2212 s0\n\n2 + 8p1\u2212c1 + 2pe\u2212nCmin\u00b52/2\u03c32\n\n.\n\n(17)\n\n(19)\n\n(20)\n\nP(cid:16)(cid:107)(cid:98)\u03b8GL \u2212 \u03b80(cid:107)\u221e \u2265 \u00b5\n\n(cid:17) \u2264 pe\u2212 n\nP((cid:98)S = S) \u2265 1 \u2212 p e\u2212 n\n\nMoreover, letting \u02c6S be the model returned by the Gauss-Lasso selector, we have\n\n10 \u2212 6 e\u2212 s0\n\n2 \u2212 10 p1\u2212c1 .\n\nRemark 3.4. [Detection level] Let \u03b8min \u2261 mini\u2208S |\u03b80,i| be the minimum magnitude of the non-\nzero entries of vector \u03b80. By Theorem 3.3, Gauss-Lasso selector correctly recovers supp(\u03b80), with\nprobability greater than 1 \u2212 p e\u2212 n\n\n2 \u2212 10 p1\u2212c1, if n \u2265 max(M1, M2)t0 log p, and\n\n10 \u2212 6 e\u2212 s0\n\n(cid:114)\n\n(cid:0)1 + (cid:107)\u03a3\u22121\n\nT0,T0\n\n(cid:107)\u221e(cid:1) ,\n\n\u03b8min \u2265 C\u03c3\n\nlog p\n\nn\n\nfor some constant C. Note that Eq. (20) follows from Eqs. (17) and (18).\n\n8\n\n\fReferences\n[1] F. R. Bach. Bolasso: model consistent lasso estimation through the bootstrap. In Proceedings of the 25th\n\ninternational conference on Machine learning, pages 33\u201340. ACM, 2008.\n\n[2] P. J. Bickel, Y. Ritov, and A. B. Tsybakov. Simultaneous analysis of Lasso and Dantzig selector. Amer. J.\n\nof Mathematics, 37:1705\u20131732, 2009.\n\n[3] P. B\u00a8uhlmann and S. van de Geer. Statistics for high-dimensional data. Springer-Verlag, 2011.\n[4] E. Cand`es and Y. Plan. Near-ideal model selection by (cid:96)1 minimization. The Annals of Statistics,\n\n37(5A):2145\u20132177, 2009.\n\n[5] E. Candes, J. K. Romberg, and T. Tao. Robust uncertainty principles: Exact signal reconstruction from\n\nhighly incomplete frequency information. IEEE Trans. on Inform. Theory, 52:489 \u2013 509, 2006.\n\n[6] E. Cand\u00b4es and T. Tao. The Dantzig selector: statistical estimation when p is much larger than n. Annals\n\nof Statistics, 35:2313\u20132351, 2007.\n\n[7] E. J. Cand\u00b4es and T. Tao. Decoding by linear programming. IEEE Trans. on Inform. Theory, 51:4203\u2013\n\n4215, 2005.\n\n[8] S. Chen and D. Donoho. Examples of basis pursuit. In Proceedings of Wavelet Applications in Signal and\n\nImage Processing III, San Diego, CA, 1995.\n\n[9] D. L. Donoho. Compressed sensing. IEEE Trans. on Inform. Theory, 52:489\u2013509, April 2006.\n[10] A. Javanmard and A. Montanari. Hypothesis testing in high-dimensional regression under the gaussian\n\nrandom design model: Asymptotic theory. arXiv preprint arXiv:1301.4240, 2013.\n\n[11] A. Javanmard and A. Montanari. Model selection for high-dimensional regression under the generalized\n\nirrepresentability condition. arXiv:1305.0355, 2013.\n\n[12] K. Knight and W. Fu. Asymptotics for lasso-type estimators. Annals of Statistics, pages 1356\u20131378,\n\n2000.\n\n[13] K. Lounici. Sup-norm convergence rate and sign concentration property of lasso and dantzig estimators.\n\nElectronic Journal of statistics, 2:90\u2013102, 2008.\n\n[14] N. Meinshausen and P. B\u00a8uhlmann. High-dimensional graphs and variable selection with the lasso.\n\nAnn. Statist., 34:1436\u20131462, 2006.\n\n[15] J. Peng, J. Zhu, A. Bergamaschi, W. Han, D.-Y. Noh, J. R. Pollack, and P. Wang. Regularized multivari-\nate regression for identifying master predictors with application to integrative genomics study of breast\ncancer. The Annals of Applied Statistics, 4(1):53\u201377, 2010.\n\n[16] S. K. Shevade and S. S. Keerthi. A simple and ef\ufb01cient algorithm for gene selection using sparse logistic\n\nregression. Bioinformatics, 19(17):2246\u20132253, 2003.\n\n[17] R. Tibshirani. Regression shrinkage and selection with the Lasso. J. Royal. Statist. Soc B, 58:267\u2013288,\n\n1996.\n\n[18] R. J. Tibshirani. The lasso problem and uniqueness. Electronic Journal of Statistics, 7:1456\u20131490, 2013.\n[19] S. van de Geer and P. B\u00a8uhlmann. On the conditions used to prove oracle results for the lasso. Electron. J.\n\nStatist., 3:1360\u20131392, 2009.\n\n[20] S. van de Geer, P. B\u00a8uhlmann, and S. Zhou. The adaptive and the thresholded Lasso for potentially\n\nmisspeci\ufb01ed models (and a lower bound for the Lasso). Electron. J. Stat., 5:688\u2013749, 2011.\n\n[21] M. Wainwright. Sharp thresholds for high-dimensional and noisy sparsity recovery using (cid:96)1-constrained\n\nquadratic programming. IEEE Trans. on Inform. Theory, 55:2183\u20132202, 2009.\n\n[22] F. Ye and C.-H. Zhang. Rate minimaxity of the lasso and dantzig selector for the lq loss in lr balls.\n\nJournal of Machine Learning Research, 11:3519\u20133540, 2010.\n\n[23] P. Zhao and B. Yu. On model selection consistency of Lasso. The Journal of Machine Learning Research,\n\n7:2541\u20132563, 2006.\n\n[24] S. Zhou.\n\nThresholded Lasso for high dimensional variable selection and statistical estimation.\n\narXiv:1002.1583v2, 2010.\n\n[25] H. Zou. The adaptive lasso and its oracle properties. Journal of the American Statistical Association,\n\n101(476):1418\u20131429, 2006.\n\n9\n\n\f", "award": [], "sourceid": 1367, "authors": [{"given_name": "Adel", "family_name": "Javanmard", "institution": "Stanford University"}, {"given_name": "Andrea", "family_name": "Montanari", "institution": "Stanford University"}]}