{"title": "Sparse recovery by thresholded non-negative least squares", "book": "Advances in Neural Information Processing Systems", "page_first": 1926, "page_last": 1934, "abstract": "Non-negative data are commonly encountered in numerous fields, making non-negative least squares regression (NNLS) a frequently used tool. At least relative to its simplicity, it often performs rather well in practice. Serious doubts about its usefulness arise for modern high-dimensional linear models. Even in this setting - unlike first intuition may suggest - we show that for a broad class of designs, NNLS is resistant to overfitting and works excellently for sparse recovery when combined with thresholding, experimentally even outperforming L1-regularization. Since NNLS also circumvents the delicate choice of a regularization parameter, our findings suggest that NNLS may be the method of choice.", "full_text": "Sparse recovery by thresholded\n\nnon-negative least squares\n\nMartin Slawski and Matthias Hein\nDepartment of Computer Science\n\nSaarland University\n\nCampus E 1.1, Saarbr\u00a8ucken, Germany\n{ms,hein}@cs.uni-saarland.de\n\nAbstract\n\nNon-negative data are commonly encountered in numerous \ufb01elds, making non-\nnegative least squares regression (NNLS) a frequently used tool. At least rela-\ntive to its simplicity, it often performs rather well in practice. Serious doubts\nabout its usefulness arise for modern high-dimensional linear models. Even in\nthis setting \u2212 unlike \ufb01rst intuition may suggest \u2212 we show that for a broad class\nof designs, NNLS is resistant to over\ufb01tting and works excellently for sparse re-\ncovery when combined with thresholding, experimentally even outperforming (cid:96)1-\nregularization. Since NNLS also circumvents the delicate choice of a regulariza-\ntion parameter, our \ufb01ndings suggest that NNLS may be the method of choice.\n\n1\n\nIntroduction\n\ny = X\u03b2\u2217 + \u03b5,\n\nConsider the linear regression model\n(1)\nwhere y is a vector of observations, X \u2208 Rn\u00d7p a design matrix, \u03b5 a vector of noise and \u03b2\u2217 a vector\nof coef\ufb01cients to be estimated. Throughout this paper, we are concerned with a high-dimensional\nsetting in which the number of unknowns p is at least of the same order of magnitude as the number\nof observations n, i.e. p = O(n) or even p (cid:29) n, in which case one cannot hope to recover the\ntarget \u03b2\u2217 if it does not satisfy one of various kinds of sparsity constraints, the simplest being that\n\u03b2\u2217 is supported on S = {j : \u03b2\u2217\nj (cid:54)= 0}, |S| = s < n. In this paper, we additionally assume that\n\u03b2\u2217 is non-negative, i.e. \u03b2\u2217 \u2208 Rp\n+. This constraint is particularly relevant, since non-negative data\noccur frequently, e.g. in the form pixel intensity values of an image, time measurements, histograms\nor count data, economical quantities such as prices, incomes and growth rates. Non-negativity\nconstraints emerge in numerous deconvolution and unmixing problems in diverse \ufb01elds such as\nacoustics [1], astronomical imaging [2], computer vision [3], genomics [4], proteomics [5] and\nspectroscopy [6]; see [7] for a survey. Sparse recovery of non-negative signals in a noiseless setting\n(\u03b5 = 0) has been studied in a series of recent papers [8, 9, 10, 11]. One important \ufb01nding of this body\nof work is that non-negativity constraints alone may suf\ufb01ce for sparse recovery, without the need to\nemploy sparsity-promoting (cid:96)1-regularization as usually. The main contribution of the present paper\nis a transfer of this intriguing result to a more realistic noisy setup, contradicting the well-established\nparadigm that regularized estimation is necessary to cope with high dimensionality and to prevent\nover-adaptation to noise. More speci\ufb01cally, we study non-negative least squares (NNLS)\n\nwith minimizer(cid:98)\u03b2 and its counterpart after hard thresholding(cid:98)\u03b2(\u03bb),\n\nmin\n\u03b2(cid:23)0\n\n(cid:107)y \u2212 X\u03b2(cid:107)2\n\n2\n\n1\nn\n\n(cid:98)\u03b2j(\u03bb) =\n\n(cid:40)(cid:98)\u03b2j,\n\n0,\n\n(2)\n\n(3)\n\n(cid:98)\u03b2j > \u03bb,\n\n1\n\notherwise, j = 1, . . . , p,\n\n\fwhere \u03bb \u2265 0 is a threshold, and state conditions under which it is possible to infer the support\n\nS by (cid:98)S(\u03bb) = {j : (cid:98)\u03b2j(\u03bb) > 0}. Classical work on the problem [12] gives a positive answer for\n\n\ufb01xed p, while in case one follows the modern statistical trend, one would add a regularizer to (2) in\norder to encourage sparsity: the most popular approach is (cid:96)1-regularized least squares (lasso, [13]),\nwhich is easy to implement and comes with strong theoretical guarantees with regard to prediction\nand estimation of \u03b2\u2217 in the (cid:96)2-norm over a broad range of designs (see [14] for a review). On the\nother hand, the rather restrictive \u2019irrepresentable condition\u2019 on the design is essentially necessary in\norder to infer the support S from the sparsity pattern of the lasso [15, 16]. In view of its tendency\nto assign non-zero weights to elements of the off-support Sc = {1, . . . , p} \\ S, several researchers,\ne.g. [17, 18, 19], suggest to apply hard thresholding to the lasso solution to achieve support recovery.\nIn light of this, thresholding a non-negative least squares solution, provided it is close to the target\nw.r.t. the (cid:96)\u221e-norm, is more attractive for at least two reasons: \ufb01rst, there is no need to carefully\ntune the amount of (cid:96)1-regularization prior to thresholding; second, one may hope to detect relatively\nsmall non-zero coef\ufb01cients whose recovery is negatively affected by the bias of (cid:96)1-regularization.\n\nOutline. We \ufb01rst prove a bound on the mean square prediction error of the NNLS estimator,\ndemonstrating that it may be resistant to over\ufb01tting. Section 3 contains our main results on sparse\nrecovery with noise. Experiments providing strong support of our theoretical \ufb01ndings are presented\nin Section 4. Most of the proofs as well as technical de\ufb01nitions are relegated to the supplement.\nNotation. Let J, K be index sets. For a matrix A \u2208 Rn\u00d7m, AJ denotes the matrix one obtains by\nextracting the columns corresponding to J. For j = 1, . . . , m, Aj denotes the j-th column of A.\nThe matrix AJK is the sub-matrix of A by extracting rows in J and columns in K. For v \u2208 Rm, vJ\nis the sub-vector corresponding to J. The identity matrix is denoted by I and vectors of ones by 1.\nThe symbols (cid:22) (\u227a), (cid:23) ((cid:31)) denote entry-wise (strict) inequalities. Lower and uppercase c\u2019s denote\npositive universal constants (not depending on n, p, s) whose values may differ from line to line.\n\nAssumptions. We here \ufb01x what is assumed throughout the paper unless stated otherwise. Model\n(1) is assumed to hold. The matrix X is assumed to be non-random and scaled s.t. (cid:107)Xj(cid:107)2\n2 = n \u2200j.\nWe assume that \u03b5 has i.i.d. zero-mean sub-Gaussian entries with parameter \u03c3 > 0, cf. supplement.\n\n2 Prediction error and uniqueness of the solution\n\nn(cid:107)X\u03b2\u2217\u2212X(cid:98)\u03b2(cid:107)2\n\n2.\n\nIn the following, the quantity of interest is the mean squared prediction error (MSE) 1\n\nNNLS does not necessarily over\ufb01t.\nIt is well-known that the MSE of ordinary least squares (OLS)\nas well as that of ridge regression in general does not vanish unless p/n \u2192 0. Can one do better with\nnon-negativity constraints ? Obviously, the answer is negative for general X. To make this clear,\n\nlet a design matrix (cid:101)X be given and set X = [(cid:101)X \u2212 (cid:101)X] by concatenating (cid:101)X and \u2212(cid:101)X columnwise.\nThe non-negativity constraint is then vacuous in the sense that X(cid:98)\u03b2 = X(cid:98)\u03b2ols, where(cid:98)\u03b2ols is any OLS\n\nsolution. However, non-negativity constraints on \u03b2 can be strong when coupled with the following\ncondition imposed on the Gram matrix \u03a3 = 1\nSelf-regularizing property. We call a design self-regularizing with universal constant \u03ba \u2208 (0, 1] if\n(4)\nThe term \u2019self-regularizing\u2019 refers to the fact that the quadratic form in \u03a3 restricted to the non-\nnegative orthant acts like a regularizer arising from the design itself. Let us consider two examples:\n(1) If \u03a3 (cid:23) \u03ba0 > 0, i.e. all entries of the Gram matrix are at least \u03ba0, then (4) holds with \u03ba = \u03ba0.\n(2) If the Gram matrix is entry-wise non-negative and if the set of predictors indexed by {1, . . . , p}\n\n\u03b2(cid:62)\u03a3\u03b2 \u2265 \u03ba(1(cid:62)\u03b2)2 \u2200\u03b2 (cid:23) 0.\n\nn X(cid:62)X.\n\ncan be partitioned into subsets B1, . . . , BB such that min1\u2264b\u2264B 1\n\nn X(cid:62)\n\nBb\n\nXBb (cid:23) \u03ba0, then\n\n\u03b2(cid:62)\u03a3\u03b2 \u2265\n\nmin\n\u03b2(cid:23)0\n\n\u03b2(cid:62)\n\nBb\n\n1\nn\n\nX(cid:62)\n\nBb\n\nXBb \u03b2Bb \u2265 \u03ba0\n\n(1(cid:62)\u03b2Bb)2 \u2265 \u03ba0B (1(cid:62)\u03b2)2.\n\nIn particular, this applies to design matrices whose entries Xij = \u03c6j(ui) contain the function eval-\nuations of non-negative functions {\u03c6j}p\nj=1 traditionally used for data smoothing such as splines,\nGaussians and related \u2019localized\u2019 functions at points {ui}n\n\ni=1 in some \ufb01xed interval, see Figure 1.\n\n2\n\nB(cid:88)\n\nb=1\n\nB(cid:88)\n\nb=1\n\n\f(cid:107)X\u03b2\u2217 \u2212 X(cid:98)\u03b2(cid:107)2\n\n2 \u2264 8\u03c3\n\n1\nn\n\n(cid:114)2 log p\n\n(cid:107)\u03b2\u2217(cid:107)1 +\n\nlog p\n\n.\n\n8\u03c32\n\u03ba\n\nFor self-regularizing designs, the MSE of NNLS can be controlled as follows.\nTheorem 1. Let \u03a3 ful\ufb01ll the self-regularizing property with constant \u03ba. Then, with probability no\nless than 1 - 2/p, the NNLS estimator obeys\n\n\u03ba\n\nMSE, which is of the order O((cid:112)log(p)/n(cid:107)\u03b2\u2217(cid:107)1), may vanish as n \u2192 \u221e even if the number of\n\nThe statement implies that for self-regularizing designs, NNLS is consistent in the sense that its\npredictors p scales up to sub-exponentially in n. It is important to note that exact sparsity of \u03b2\u2217 is\nnot needed for Theorem 1 to hold. The rate is the same as for the lasso if no further assumptions on\nthe design are made, a result that is essentially obtained in the pioneering work [20].\n\nn\n\nn\n\nFigure 1: Block partitioning of 15 Gaussians\ninto B = 5 blocks. The right part shows the\ncorresponding pattern of the Gram matrix.\n\nFigure 2: A polyhedral cone in R3 and\nits intersection with the simplex (right).\nThe point y is contained in a face (bold)\nwith normal vector w, whereas y(cid:48) is not.\n\nUniqueness of the solution. Considerable insight can be gained by looking at the NNLS problem\n(2) from the perspective of convex geometry. Denote by C = XRp\n+ the polyhedral cone generated\nby the columns {Xj}p\nj=1 of X, which are henceforth assumed to be in general position in Rn. As\nvisualized in Figure 2, sparse recovery by non-negativity constraints can be analyzed by studying the\nface lattice of C [9, 10, 11]. For F \u2286 {1, . . . , p}, we say that XF R|F|\n+ is a face of C if there exists a\nseparating hyperplane with normal vector w passing through the origin such that (cid:104)Xj, w(cid:105) > 0, j /\u2208\nF , (cid:104)Xj, w(cid:105) = 0, j \u2208 F . Sparse recovery in a noiseless setting (\u03b5 = 0) can then be characterized\nconcisely by the following statement which can essentially be found in prior work [9, 10, 11, 21].\n+ is a face of C and the\nProposition 1. Let y = X\u03b2\u2217, where \u03b2\u2217 (cid:23) 0 has support S, |S| = s. If XSRs\ncolumns of X are in general position in Rn, then the constrained linear system X\u03b2 = y sb.t. \u03b2 (cid:23) 0,\nhas \u03b2\u2217 as its unique solution.\nthere exists a w \u2208 Rn s.t. (cid:104)Xj, w(cid:105) = 0, j \u2208\nProof. By de\ufb01nition, since XSRs\n(cid:104)Xj, w(cid:105) > 0, j \u2208 Sc. Assume that there is a second solution \u03b2\u2217 + \u03b4, \u03b4 (cid:54)= 0. Expand\nS,\nj\u2208Sc (cid:104)Xj, w(cid:105) \u03b4j = 0. Since\nXS(\u03b2\u2217\nSc = 0, feasibility requires \u03b4j \u2265 0, j \u2208 Sc. All inner products within the sum are positive,\n\u03b2\u2217\nconcluding that \u03b4Sc = 0. General position implies \u03b4S = 0.\n\nS + \u03b4S) + XSc\u03b4Sc = y. Multiplying both sides by w(cid:62) yields(cid:80)\n\nGiven Theorem 1 and Proposition 1, we turn to uniqueness in the noisy case.\n\nCorollary 1. In the setting of Theorem 1, if (cid:107)\u03b2\u2217(cid:107)1 = o((cid:112)n/ log(p)), then the NNLS solution (cid:98)\u03b2 is\n+, then X(cid:98)\u03b2, the projection of y on C, is contained in its\n1 implies that(cid:98)\u03b2 is unique. If y were already contained in C, one would have y = X(cid:98)\u03b2 and hence\n\nunique with high probability.\nProof. Suppose \ufb01rst that y /\u2208 C = XRp\nboundary, i.e. in a lower-dimensional face. Using general position of the columns of X, Proposition\n\n+ is a face of C,\n\n(5)\nusing concentration of measure of the norm of the sub-Gaussian random vector \u03b5. With the assumed\nscaling for (cid:107)\u03b2\u2217(cid:107)1, 1\n\n2 = o(1) in view of Theorem 1, which contradicts (5).\n\nn(cid:107)X\u03b2\u2217 \u2212 X(cid:98)\u03b2(cid:107)2\n\n2 = O(1), with high probability,\n\n2 =\n\n2 =\n\n(cid:107)X\u03b2\u2217 \u2212 y(cid:107)2\n\n1\nn\n\n(cid:107)\u03b5(cid:107)2\n\n1\nn\n\n(cid:107)X\u03b2\u2217 \u2212 X(cid:98)\u03b2(cid:107)2\n\n1\nn\n\n3\n\nf1f15B1B2B3B4B5yy'w\f3 Sparse recovery in the presence of noise\nProposition 1 states that support recovery requires XSRs\n+, which is equivalent\n+ from the rest of C. For the noisy case, mere\nto the existence of a hyperplane separating XSRs\nseparation is not enough \u2212 a quanti\ufb01cation is needed, which is provided by the following two inco-\nherence constants that are of central importance for our main result. Both are speci\ufb01c to NNLS and\nhave not been used previously in the literature on sparse recovery.\nDe\ufb01nition 1. For some \ufb01xed S \u2282 {1, . . . , p}, the separating hyperplane constant is de\ufb01ned as\n\n+ to be a face of XRp\n\n(cid:98)\u03c4(S) = max\n\n\u03c4,w\n\n\u03c4\n1\u221a\nn\n\nsb.t.\n\nduality=\n\nX(cid:62)\nS w = 0,\n1\u221a\nn\n\nmin\n\nScw (cid:23) \u03c41, (cid:107)w(cid:107)2 \u2264 1,\nX(cid:62)\n\n1\u221a\nn\n(cid:107)XS\u03b8 \u2212 XSc\u03bb(cid:107)2 ,\n\n(6)\n\n(7)\n\nn\n\n(8)\n\n(9)\n\nmin\n\n\u03bb\u2208T p\u2212s\u22121\n\nmin\nv\u2208V(F )\n\nZ(cid:62)Z\u03bb.\n\n\u03b8\u2208Rs, \u03bb\u2208T p\u2212s\u22121\n\nZ(cid:62)\nF ZF v\n\n\u2205(cid:54)=F\u2286{1,...,p\u2212s}\n\nwhere T m\u22121 = {v \u2208 Rm : v (cid:23) 0, 1(cid:62)v = 1} denotes the simplex in Rm, i.e.(cid:98)\u03c4(S) equals the\nS the orthogonal projections on the subspace spanned by {Xj}j\u2208S and its\n\ndistance of the subspace spanned by {Xj}j\u2208S and the convex hull of {Xj}j\u2208Sc.\nWe denote by \u03a0S and \u03a0\u22a5\northogonal complement, respectively, and set Z = \u03a0\u22a5\n\n(cid:98)\u03c4 2(S) = min\n(cid:13)(cid:13)(cid:13) 1\n\nS XSc. One can equivalently express (7) as\n\u03bb(cid:62) 1\nn\nS XSc,(cid:98)\u03c9(S) is de\ufb01ned as\n\n(cid:13)(cid:13)(cid:13)\u221e , V(F ) = {v \u2208 R|F| : (cid:107)v(cid:107)\u221e = 1, v (cid:23) 0}.\n\nThe second incoherence constant we need can be traced back to the KKT optimality conditions of\nthe NNLS problem. The role of the following quantity is best understood from (13) below.\nDe\ufb01nition 2. For some \ufb01xed S \u2282 {1, . . . , p} and Z = \u03a0\u22a5\n\nnegative. Denoting the entries of \u03a3 = 1\ninvolves the constants\n\u00b5(S) = maxj\u2208S maxk\u2208Sc |\u03c3jk|,\nK(S) = maxv: (cid:107)v(cid:107)\u221e=1\n\n(cid:98)\u03c9(S) =\nIn the supplement, we show that i) (cid:98)\u03c9(S) > 0 \u21d4 (cid:98)\u03c4(S) > 0 \u21d4 XSRs\n(cid:98)\u03c9(S) \u2264 1, with equality if {Xj}j\u2208S and {Xj}j\u2208Sc are orthogonal and 1\nn X(cid:62)\n(cid:13)(cid:13)\u03a3\u22121\nSSv(cid:13)(cid:13)\u221e , \u03c6min(S) = minv: (cid:107)v(cid:107)2=1 (cid:107)\u03a3SSv(cid:107)2 .\n\n+ is a face of C, and ii)\nScXSc is entry-wise non-\nn X(cid:62)X by \u03c3jk, 1 \u2264 j, k \u2264 p, our main result additionally\nk\u2208Sc |\u03c3jk|, \u03b2min(S) = minj\u2208S \u03b2\u2217\n\u00b5+(S) = maxj\u2208S\nj ,\nTheorem 2. Consider the thresholded NNLS estimator(cid:98)\u03b2(\u03bb) de\ufb01ned in (3) with support (cid:98)S(\u03bb).\n(i) If \u03bb > 2\u03c3b\u03c4 2(S)\n\n(cid:113) 2 log p\n\u03b2min(S) >(cid:101)\u03bb, (cid:101)\u03bb = \u03bb(1 + K(S)\u00b5(S)) +\n(cid:113) 2 log p\n\u03b2min(S) >(cid:101)\u03bb, (cid:101)\u03bb = \u03bb(1 + K(S)\u00b5+(S)) +\n\n(cid:114)2 log p\n(cid:114)2 log p\n(ii) or if \u03bb > 2\u03c3b\u03c9(S)\nthen (cid:107)(cid:98)\u03b2(\u03bb) \u2212 \u03b2\u2217(cid:107)\u221e \u2264(cid:101)\u03bb and (cid:98)S(\u03bb) = S with probability no less than 1 \u2212 10/p.\n\n{\u03c6min(S)}1/2\n\n{\u03c6min(S)}1/2\n\n(cid:80)\n\n,\n\nn\n\n,\n\nn\n\nand\n\nn\n\nand\n\nn\n\n(10)\n\n2\u03c3\n\n2\u03c3\n\nRemark. The concept of a separating functional as in (6) is also used to show support recovery for\nthe lasso [15, 16] as well as for orthogonal matching pursuit [22, 23]. The \u2019irrepresentable condition\u2019\nemployed in these works requires the existence of a separation constant \u03b3(S) > 0 such that\nS)| = 1, j \u2208 S,\nmax\nj\u2208Sc\n\nhence {Xj}j\u2208S and {Xj}j\u2208Sc are separated by the functional |(cid:10)\u00b7, XS(X(cid:62)\n\nS)| \u2264 1\u2212\u03b3(S), while |X(cid:62)\n\nS XS)\u22121 sign(\u03b2\u2217\n\nS XS)\u22121 sign(\u03b2\u2217\n\nS XS)\u22121 sign(\u03b2\u2217\n\nS)(cid:11)|.\n\nj XS(X(cid:62)\n\nj XS(X(cid:62)\n\n|X(cid:62)\n\nIn order to prove Theorem 2, we need two lemmas \ufb01rst. The \ufb01rst one is immediate from the\nKKT optimality conditions of the NNLS problem.\n\n4\n\n\fLemma 1. (cid:98)\u03b2 is a minimizer of (2) if and only if there exists F \u2286 {1, . . . , p} such that\nThe next lemma is crucial, since it permits us to decouple(cid:98)\u03b2S from(cid:98)\u03b2Sc.\n\nj (y \u2212 X(cid:98)\u03b2) = 0, and (cid:98)\u03b2j > 0, j \u2208 F,\n\nj (y \u2212 X(cid:98)\u03b2) \u2264 0, and (cid:98)\u03b2j = 0, j \u2208 F c.\n\nX(cid:62)\n\nX(cid:62)\n\n1\nn\n\n1\nn\n\n2\nn\n\n1\nn\n\nn\n\n1\nn\n\n2\n\n2\n\nn\n\nn\n\n(cid:107)\u03a0\u22a5\n\n(cid:107)\u03be(cid:107)2\n\n\u03b2(P 1)(cid:23)0\n\n\u03b2(P 2)(cid:23)0\n\n(P 1) : min\n\n(P 2) : min\n\nS (\u03b5 \u2212 XSc\u03b2(P 1))(cid:107)2\n\n(i) With \u03be = \u03a0\u22a5\n1\nn\n\nLemma 2. Consider the two non-negative least squares problems\n\nProof of Theorem 2. The proofs of parts (i) and (ii) overlap to a large extent. Steps speci\ufb01c to one of\nthe two parts are preceded by \u2019(i)\u2019 or \u2019(ii)\u2019. Consider problem (P 1) of Lemma 2.\n\n(cid:107)\u03a0Sy \u2212 XS\u03b2(P 2) \u2212 \u03a0SXSc(cid:98)\u03b2(P 1)(cid:107)2\nwith minimizers (cid:98)\u03b2(P 1) of (P 1) and (cid:98)\u03b2(P 2) of (P 2), respectively. If (cid:98)\u03b2(P 2) (cid:31) 0, then setting (cid:98)\u03b2S =\n(cid:98)\u03b2(P 2) and(cid:98)\u03b2Sc = (cid:98)\u03b2(P 1) yields a minimizer(cid:98)\u03b2 of the non-negative least squares problem (2).\nStep 1: Controlling (cid:107)(cid:98)\u03b2(P 1)(cid:107)1 via(cid:98)\u03c4 2(S), controlling (cid:107)(cid:98)\u03b2(P 1)(cid:107)\u221e via(cid:98)\u03c9(S).\nS \u03b5, since(cid:98)\u03b2(P 1) is a minimizer, it satis\ufb01es\n(cid:107)\u03be \u2212 Z(cid:98)\u03b2(P 1)(cid:107)2\n2 \u21d2 ((cid:98)\u03b2(P 1))(cid:62) 1\nZ(cid:62)Z(cid:98)\u03b2(P 1) \u2264 (cid:107)(cid:98)\u03b2(P 1)(cid:107)1M, M = max\n2 \u2264 1\nAs observed in (8),(cid:98)\u03c4 2(S) = min\u03bb\u2208T p\u2212s\u22121 \u03bb(cid:62) 1\nn Z(cid:62)Z\u03bb, s.t. the l.h.s. can be lower bounded via\n1 =(cid:98)\u03c4 2(S)(cid:107)(cid:98)\u03b2(P 1)(cid:107)2\nZ(cid:62)Z(cid:98)\u03b2(P 1) \u2265\n((cid:98)\u03b2(P 1))(cid:62) 1\n\u03bb(cid:62) 1\nCombining (11) and (12), we have (cid:107)(cid:98)\u03b2(P 1)(cid:107)1 \u2264 1b\u03c4 2(S) M.\nn\n(cid:98)\u03b2(P 1) = 0) such that(cid:98)\u03b2(P 1)\nF \u03be, \u21d2(cid:13)(cid:13)(cid:13) 1\n(cid:13)(cid:13)(cid:13)\u221e\n(cid:13)(cid:13)(cid:13) 2\nF ZF(cid:98)\u03b2(P 1)\n(cid:13)(cid:13)(cid:13)\u221e , V(F ) = {v \u2208 R|F| : (cid:107)v(cid:107)\u221e = 1, v (cid:23) 0}\n(cid:107)(cid:98)\u03b2(P 1)(cid:107)\u221e \u2264(cid:13)(cid:13)(cid:13) 2\n(cid:13)(cid:13)(cid:13)\u221e\n(cid:13)(cid:13)(cid:13) 1\n\nF ZF(cid:98)\u03b2(P 1)\n(cid:13)(cid:13)(cid:13) 1\n\u21d2(cid:98)\u03c9(S)(cid:107)(cid:98)\u03b2(P 1)(cid:107)\u221e =\nwhere we have used De\ufb01nition 2. We conclude that (cid:107)(cid:98)\u03b2(P 1)(cid:107)\u221e \u2264 Mb\u03c9(S).\nStep 2: Back-substitution into (P2). Equipped with the bounds just derived, we insert (cid:98)\u03b2(P 1) into\n\n(ii) In view of Lemma 1, there exists a set F \u2286 {1, . . . , p \u2212 s} (we may assume F (cid:54)= \u2205, otherwise\n\n(cid:13)(cid:13)(cid:13)\u221e\n(cid:107)(cid:98)\u03b2(P 1)(cid:107)\u221e \u2264(cid:13)(cid:13)(cid:13) 2\n\nF c = 0 and such that\nZ(cid:62)\n\n1\nZ(cid:62)\nn\n\u21d2 min\nv\u2208V(F )\n\nF =\nZ(cid:62)\nF ZF v\n\n(cid:107)(cid:98)\u03b2(P 1)(cid:107)2\n\nproblem (P 2) of Lemma 2, and show that in conjunction with the assumptions made for the mini-\nmum support coef\ufb01cient \u03b2min(S), the ordinary least squares estimator corresponding to (P 2)\n\n|Z(cid:62)\nj \u03be|.\n(11)\n\n\u2205(cid:54)=F\u2286{1,...,p\u2212s}\n\n(cid:13)(cid:13)(cid:13)\u221e\n\n(cid:13)(cid:13)(cid:13)\u221e\n\nmin\nv\u2208V(F )\n\n1\u2264j\u2264(p\u2212s)\n\nZ(cid:62)Z\u03bb\n\n\u03bb\u2208T p\u2212s\u22121\n\nZ(cid:62)\nF \u03be\n\nZF ZF v\n\nZ(cid:62)\n\nn\n\nZ(cid:62)\u03be\n\nn\n\n=\n\nn\n\nF\n\nn\n\nZ(cid:62)\u03be\n\n1.\n\n(12)\n\n(cid:27)\n\n= M,\n\n(13)\n\nn\n\n2\nn\n\nmin\n\nn\n\n(cid:26)\n\nmin\n\n\u00af\u03b2(P 2) = argmin\n\u03b2(P 2)\n\n1\nn\n\n(cid:107)\u03a0Sy \u2212 XS\u03b2(P 2) \u2212 \u03a0SXSc(cid:98)\u03b2(P 1)(cid:107)2\n\n2\n\n\u00af\u03b2(P 2) =\n\nexpression for the ordinary least squares estimator, one obtains\n\nhas only positive components. Lemma 2 then yields \u00af\u03b2(P 2) = (cid:98)\u03b2(P 2) = (cid:98)\u03b2S. Using the closed form\nSS\u03a3SSc(cid:98)\u03b2(P 1).\nS \u03b5 \u2212 \u03a3\u22121\n\u03a3\u22121\nSSX(cid:62)\nSS\u03a3SSc(cid:98)\u03b2(P 1)(cid:107)\u221e. We have\n(cid:40)\nS \u03b5(cid:107)\u221e and (cid:107)\u03a3\u22121\n\u00b5(S)(cid:107)(cid:98)\u03b2(P 1)(cid:107)1\n\u00b5+(S)(cid:107)(cid:98)\u03b2(P 1)(cid:107)\u221e for (ii).\n\nS + \u03a0S\u03b5 \u2212 \u03a0SXSc(cid:98)\u03b2(P 1)) = \u03b2\u2217\nSSv(cid:107)\u221e(cid:107)\u03a3SSc(cid:98)\u03b2(P 1)(cid:107)\u221e\n\nSS\u03a3SSc(cid:98)\u03b2(P 1)(cid:107)\u221e \u2264 max\n\nIt remains to control the deviation terms M = (cid:107) 1\n\n(10)\u2264 K(S)\u00b7\n\nS (XS\u03b2\u2217\n\n\u03a3\u22121\nSSX(cid:62)\n\nv: (cid:107)v(cid:107)\u221e=1\n\nSSX(cid:62)\n\nn\u03a3\u22121\n\n(cid:107)\u03a3\u22121\n\n(cid:107)\u03a3\u22121\n\nfor (i),\n\nS +\n\n1\nn\n\n1\nn\n\n(14)\nStep 3: Putting together the pieces. The two random terms M and M are maxima of a \ufb01nite collec-\ntion of sub-Gaussian random variables, which can be controlled using standard techniques. Since\n\n5\n\n\f\u221a\nSSX(cid:62)\nS /\n\nn\n\n2\u03c3\n\nj \u03a3\u22121\n\n{\u03c6min(S)}1/2\n\n(cid:113) 2 log p\nn } and {M \u2264\n\nn(cid:107)2 \u2264 {\u03c6min(S)}\u22121/2 for all j, the sub-Gaussian parameters\n(cid:107)Zj(cid:107)2 \u2264 (cid:107)Xj(cid:107)2 and (cid:107)e(cid:62)\n(cid:113) 2 log p\n\u221a\nn and \u03c3/({\u03c6min(S)}1/2\u221a\nn), respectively. It follows\nof these collections are upper bounded by \u03c3/\nthat the two events {M \u2264 2\u03c3\nn } both hold with probability\nno less than 1 \u2212 10/p, cf. supplement. Subsequently, we work conditional on these two events. For\n(cid:114)2 log p\n(cid:26)\u00b5(S)\nthe choice of \u03bb made for (i) and (ii), respectively, it follows that\n\nSubsequent thresholding with the respective choices made for \u03bb yields the assertion. 2\n\nand hence, using the lower bound on \u03b2min(S), that \u00af\u03b2(P 2) = (cid:98)\u03b2S (cid:31) 0 and thus also that(cid:98)\u03b2(P 1) = (cid:98)\u03b2Sc.\nliterature, for which thresholded NNLS achieves an (cid:96)\u221e-error of the optimal order O((cid:112)log(p)/n).\n\nIn the sequel, we apply Theorem 2 to speci\ufb01c classes of designs commonly studied in the\n\n(cid:107)\u03b2\u2217 \u2212 \u00af\u03b2(P 2)(cid:107)\u221e \u2264\n\n{\u03c6min(S)}1/2\n\n+ \u03bbK(S) \u00b7\n\nfor (i),\nfor (ii),\n\n\u00b5+(S)\n\n2\u03c3\n\nWe here only provide sketches, detailed derivations are relegated to the supplement.\nExample 1: Power decay. Let the entries of the Gram matrix \u03a3 be given by \u03c3jk = \u03c1|j\u2212k|, 1 \u2264\nj, k \u2264 p, 0 \u2264 \u03c1 < 1, so that the {Xj}p\nj=1 form a Markov random \ufb01eld in which Xj is conditionally\nindependent of {Xk}k /\u2208{j\u22121,j,j+1} given {Xj\u22121, Xj+1}, cf. [24]. The conditional independence\n\nstructure implies that all entries of Z(cid:62)Z are non-negative, such that, using the de\ufb01nition of(cid:98)\u03c9(S),\n(cid:88)\n(cid:98)\u03c9(S) \u2265 min\nthe sum on the r.h.s. vanishes, thus one computes(cid:98)\u03c9(S) \u2265 min1\u2264j\u2264(p\u2212s)\n\nk(cid:54)=j\nn(Z(cid:62)Z)jj \u2265 1 \u2212 2\u03c12\n1+\u03c12\nfor all S. For the remaining constants in (10), one can show that \u03a3\u22121\nSS is a band matrix of bandwidth\nno more than 3 for all choices of S such that \u03c6min(S) and K(S) are uniformly lower and upper\nbounded, respectively, by constants depending on \u03c1 only. By the geometric series formula, \u00b5+(S) \u2264\n1\u2212\u03c1. In total, for a constant C\u03c1 > 0 depending on \u03c1 only, one obtains an (cid:96)\u221e-error of the form\n\n(cid:12)(cid:12)(cid:12) = min\n\nmin{(Z(cid:62)Z)jk, 0},\n\n(Z(cid:62)Z)jj +\n\nv(cid:23)0,(cid:107)v(cid:107)\u221e=1\n\nZ(cid:62)\nj Zv\n\n1\u2264j\u2264(p\u2212s)\n\n(cid:12)(cid:12)(cid:12) 1\n\n1\u2264j\u2264p\u2212s\n\nmin\n\n1\nn\n\n1\nn\n\nn\n\n\u03c1\n\n1\n\n(cid:107)(cid:98)\u03b2(\u03bb) \u2212 \u03b2\u2217(cid:107)\u221e \u2264 C\u03c1\u03c3(cid:112)2 log(p)/n.\n\n(15)\nExample 2: Equi-correlation. Suppose that \u03c3jk = \u03c1, 0 < \u03c1 < 1, for all j (cid:54)= k, and \u03c3jj = 1 for\nn Z(cid:62)Z is of the same regular structure with diagonal\nall j. For any S, one computes that the matrix 1\nentries all equal to 1 \u2212 \u03b4 and off-diagonal entries all equal to \u03c1 \u2212 \u03b4, where \u03b4 = \u03c12s/(1 + (s \u2212 1)\u03c1).\nTherefore, using (8), the separating hyperplane constant (7) can be computed in closed form:\n\n(cid:98)\u03c4 2(S) =\n\n(1 \u2212 \u03c1)\u03c1\n(s \u2212 1)\u03c1 + 1\n\n1 \u2212 \u03c1\np \u2212 s\n\n+\n\n= O(s\u22121).\n\n(16)\n\nArguing as in (12) in the proof of Theorem 2, this allows one to show that with high probability,\n\nOn the other hand, using the same reasoning as in Example 1, (cid:98)\u03c9(S) \u2265 1 \u2212 \u03b4 = c\u03c1 > 0, say.\nChoosing the threshold \u03bb = 2\u03c3b\u03c9(S)\nas in part (ii) of Theorem 2 and combining the strong\n(cid:96)1-bound (17) on the off-support coef\ufb01cients with a slight modi\ufb01cation of the bound (14) together\nwith \u03c6min(S) = 1 \u2212 \u03c1 yields again the desired optimal bound of the form (15).\n\n(1 \u2212 \u03c1)\u03c1\n\nn\n\n.\n\n(17)\n\nRandom designs. So far, the design matrix X has been assumed to be \ufb01xed. Consider the follow-\ning ensemble of random matrices\nEns+ = {X = (xij),{xij, 1 \u2264 i \u2264 n, 1 \u2264 j \u2264 p} i.i.d. from a sub-Gaussian distribution on R+}.\nAmong others, the class of sub-Gaussian distributions on R+ encompasses all distributions on a\nbounded set on R+, e.g. the family of beta distributions (with the uniform distribution as spe-\ncial case) on [0, 1], Bernoulli distributions on {0, 1} or more generally distributions on counts\n\n6\n\n(cid:107)(cid:98)\u03b2Sc(cid:107)1 \u2264 2\u03c3(cid:112)2 log(p)/n\n(cid:98)\u03c4 2(S)\n(cid:113) 2 log p\n\n\u2264 ((s \u2212 1)\u03c1 + 1)2\u03c3(cid:112)2 log(p)/n\n\n\ffrom Ens+, one additionally has to take into account the deviation between \u03a3 and \u03a3\u2217. Using tools\n\n{0, 1, . . . , K}, for some positive integer K. The ensemble Ens+ is well amenable to analysis,\nsince after suitable re-scaling the corresponding population Gram matrix \u03a3\u2217 = E[ 1\nn X(cid:62)X] has\n\u221a\nequi-correlation structure (Example 2): denoting the mean of the entries and their squares by \u00b5 and\n\u00b52, respectively, we have \u03a3\u2217 = (\u00b52 \u2212 \u00b52)I + \u00b5211(cid:62) such that re-scaling by 1/\n\u00b52 leads to equi-\n\ncorrelation with \u03c1 = \u00b52/\u00b52. As shown above, the incoherence constant(cid:98)\u03c4 2(S), which gives rise to a\nstrong bound on (cid:107)(cid:98)\u03b2Sc(cid:107)1, scales favourably and can be computed in closed form. For random designs\nfrom random matrix theory, we show that the deviation is moderate, of the order O((cid:112)log(p)/n).\nn X(cid:62)X(cid:3) = \u03c1I + (1 \u2212 \u03c1)11(cid:62) for\nTheorem 3. Let X be a random matrix from Ens+, scaled s.t. E(cid:2) 1\nsuch that for all n \u2265 C log(p)s2,(cid:98)\u03c4 2(S) \u2265 cs\u22121 \u2212 C(cid:48)(cid:112)log(p)/n\n\nsome \u03c1 \u2208 (0, 1). Fix an S \u2282 {1, . . . , p}, |S| \u2264 s. Then there exists constants c, c1, c2, c3, C, C(cid:48) > 0\n\nwith probability no less than 1 \u2212 3/p \u2212 exp(\u2212c1n) \u2212 2 exp(\u2212c2 log p) \u2212 exp(\u2212c3 log1/2(p)s).\n\n4 Experiments\nSetup. We randomly generate data y = X\u03b2\u2217 + \u03b5, where \u03b5 has i.i.d. standard Gaussian entries. We\nconsider two choices for the design X. For one set of experiments, the rows of X are drawn i.i.d.\nfrom a Gaussian distribution whose covariance matrix has the power decay structure of Example 1\nwith parameter \u03c1 = 0.7. For the second set, we pick a representative of the class Ens+ by drawing\neach entry of X uniformly from [0, 1] and re-scaling s.t. the population Gram matrix \u03a3\u2217 has equi-\ncorrelation structure with \u03c1 = 3/4. The target \u03b2\u2217 is generated by selecting its support S uniformly\nat random and then setting \u03b2\u2217\nusing upper bounds for the constant C\u03c1 as used for Examples 1 and 2; the {Uj}j\u2208S are drawn i.i.d.\nuniformly from [0, 1], and b is a parameter controlling the signal strength. The experiments can be\ndivided into two parts. In the \ufb01rst part, the parameter b is kept \ufb01xed while the aspect ratio p/n of X\nand the fraction of sparsity s/n vary. In the second part, s/n is \ufb01xed to 0.2, while p/n and b vary.\nWhen not \ufb01xed, s/n \u2208 {0.05, 0.1, 0.15, 0.2, 0.25, 0.3}. The grid used for b is chosen speci\ufb01c to\nthe designs, calibrated such that the sparse recovery problems are suf\ufb01ciently challenging. For the\ndesign from Ens+, p/n \u2208 {2, 3, 5, 10}, whereas for power decay p/n \u2208 {1.5, 2, 2.5, 3, 3.5, 4}, for\nreasons that become clear from the results. Each con\ufb01guration is replicated 100 times for n = 500.\n\nj = b \u00b7 \u03b2min(S)(1 + Uj), j \u2208 S, where \u03b2min(S) = C\u03c1\u03c3(cid:112)2 log(p)/n,\n\n1\n\nn (cid:107)y \u2212 X\u03b2(cid:107)2\n\nComparison. Across these runs, we compare the probability of \u2019success\u2019 of thresholded NNLS\n(cid:98)\u03b2(\u00b5) of min\u03b2(cid:23)0\n(tNNLS), non-negative lasso (NN(cid:96)1), thresholded non-negative lasso (tNN(cid:96)1) and orthogonal match-\ning pursuit (OMP, [22, 23]). For a regularization parameter \u00b5 \u2265 0, NN(cid:96)1 is de\ufb01ned as a minimizer\n2 + \u00b51(cid:62)\u03b2. We also compare against the ordinary lasso (replacing 1(cid:62)\u03b2\nby (cid:107)\u03b2(cid:107)1 and removing the non-negativity constraint); since its performance is mostly nearly equal,\nability. \u2019Success\u2019 is de\ufb01ned as follows. For tNNLS, we have \u2019success\u2019 if minj\u2208S(cid:98)\u03b2j > maxj\u2208Sc (cid:98)\u03b2j,\npartially considerably worse than that of its non-negative counterpart (see the bottom right panel of\nFigure 4 for an example), the results are not shown in the remaining plots for the sake of better read-\ni.e. there exists a threshold that permits support recovery. For NN(cid:96)1, we set (cid:98)\u00b5 = 2(cid:107)X(cid:62)\u03b5/n(cid:107)\u221e,\nwhich is the empirical counterpart to \u00b50 = 2(cid:112)2 log(p)/n, the choice for the regularization param-\nthe whole set of solutions {(cid:98)\u03b2(\u00b5), \u00b5 \u2265(cid:98)\u00b5} using the non-negative lasso modi\ufb01cation of LARS [26]\n{(cid:98)\u03b2(\u00b5) : \u00b5 \u2208 [\u00b50 \u2227(cid:98)\u00b5, \u00b50 \u2228(cid:98)\u00b5]} and check whether minj\u2208S(cid:98)\u03b2j(\u00b5) > maxj\u2208Sc (cid:98)\u03b2j(\u00b5) holds for one\n\neter advocated in [14] to achieve the optimal rate for estimating \u03b2\u2217 in the (cid:96)2-norm, and compute\n\nand check whether the sparsity pattern of one of these solutions recovers S. For tNN(cid:96)1, we inspect\n\nof these solutions. For OMP, we check whether the support S is recovered in the \ufb01rst s steps. Note\nthat, when comparing tNNLS and tNN(cid:96)1, the lasso is given an advantage, since we optimize over a\nrange of solutions.\nRemark: We have circumvented the choice of the threshold \u03bb, which is crucial in practice. In a\nspeci\ufb01c application [5] the threshold is chosen in a signal-dependent way allowing domain experts\nto interpret \u03bb as signal-to-noise ratio. Alternatively, one can exploit that under the conditions of\n\nTheorem 2, the s largest coef\ufb01cients of (cid:98)\u03b2 are those of the support. Given a suitable data-driven\n\nestimate for s e.g. that proposed in [25], \u03bb can be chosen automatically.\n\n7\n\n\fFigure 3: Comparison of thresholded NNLS (red) and thresholded non-negative lasso (blue) for the\nexperiments with constant s/n, while b (abscissa) and p/n (symbols) vary.\n\nFigure 4: Top: Comparison of thresholded NNLS (red) and the thresholded non-negative lasso\n(blue) for the experiments with constant b, while s/n (abscissa) and p/n (symbols) vary. Bottom\nleft: Non-negative lasso without thresholding (blue) and orthogonal matching pursuit (magenta).\nBottom right: Thresholded non-negative lasso (blue) and thresholded ordinary lasso (green).\n\nResults. The approaches NN(cid:96)1 and OMP are not competitive \u2212 both work only with rather mod-\nerate levels of sparsity, with a breakdown at s/n = 0.15 for power decay as displayed in the bottom\nleft panel of Figure 4. For the second design, the results are even worse. This is in accordance with\nthe literature where thresholding is proposed as remedy [17, 18, 19]. Yet, for a wide range of con-\n\ufb01gurations, tNNLS visibly outperforms tNN(cid:96)1, a notable exception being power decay with larger\nvalues for p/n. This is in contrast to the design from Ens+, where even p/n = 10 can be handled.\nThis difference requires further research.\nConclusion. To deal with higher levels of sparsity, thresholding seems to be inevitable. Threshold-\ning the biased solution obtained by (cid:96)1-regularization requires a proper choice of the regularization\nparameter and is likely to be inferior to thresholded NNLS with regard to the detection of small sig-\nnals. The experimental results provide strong support for the central message of the paper: even in\nhigh-dimensional, noisy settings, non-negativity constraints can be unexpectedly powerful when in-\nteracting with \u2019self-regularizing \u2019properties of the design. While this has previously been observed\nempirically, our results provide a solid theoretical understanding of this phenomenon. A natural\nquestion is whether this \ufb01nding can be transferred to other kinds of \u2019simple constraints\u2019 (e.g. box\nconstraints) that are commonly imposed.\n\n8\n\n0.10.150.20.250.30.3500.20.40.60.81power decaybProb. of Success p/n= 1.5p/n= 2.0p/n= 2.5p/n= 3.0p/n= 3.5p/n= 4.00.30.350.40.450.50.550.600.20.40.60.81Ens+bProb. of Success p/n= 2.0p/n= 3.0p/n= 5.0p/n= 10.00.050.10.150.20.250.300.20.40.60.81power decays/nProb. of Success p/n= 1.5p/n= 2.0p/n= 2.5p/n= 3.0p/n= 3.5p/n= 4.00.050.10.150.20.250.300.20.40.60.81Ens+s/nProb. of Success p/n= 2.0p/n= 3.0p/n= 5.0p/n= 10.000.050.10.1500.20.40.60.81power decay, w/o thresholdings/nProb. of Success p/n= 1.5p/n= 2.0p/n= 2.5p/n= 3.0p/n= 3.5p/n= 4.00.050.10.150.20.250.300.20.40.60.81power decay, non\u2212negative lasso vs. ordinary lassos/nProb. of Success p/n= 1.5p/n= 2.0p/n= 2.5p/n= 3.0p/n= 3.5p/n= 4.0\fReferences\n[1] Y. Lin, D. Lee, and L. Saul. Nonnegative deconvolution for time of arrival estimation. In ICASSP, 2004.\n[2] J. Bardsley and J. Nagy. Covariance-preconditioned iterative methods for nonnegatively constrained as-\n\ntronomical imaging. SIAM Journal on Matrix Analysis and Applications, 27:1184\u20131198, 2006.\n\n[3] A. Szlam and. Z. Guo and S. Osher. A split Bregman method for non-negative sparsity penalized least\nsquares with applications to hyperspectral demixing. In IEEE International Conference on Image Pro-\ncessing, 2010.\n\n[4] L. Li and T. Speed. Parametric deconvolution of positive spike trains. The Annals of Statistics, 28:1279\u2013\n\n1301, 2000.\n\n[5] M. Slawski and M. Hein. Sparse recovery for Protein Mass Spectrometry data. In NIPS workshop on\n\npractical applications of sparse modelling, 2010.\n\n[6] D. Donoho, I. Johnstone, J. Hoch, and A. Stern. Maximum entropy and the nearly black object. Journal\n\nof the Royal Statistical Society Series B, 54:41\u201381, 1992.\n\n[7] D. Chen and R. Plemmons. Nonnegativity constraints in numerical analysis. In Symposium on the Birth\n\nof Numerical Analysis, 2007.\n\n[8] A. Bruckstein, M. Elad, and M. Zibulevsky. On the uniqueness of nonnegative sparse solutions to under-\n\ndetermined systems of equations. IEEE Transactions on Information Theory, 54:4813\u20134820, 2008.\n\n[9] D. Donoho and J. Tanner. Counting the faces of randomly-projected hypercubes and orthants, with appli-\n\ncations. Discrete and Computational Geometry, 43:522\u2013541, 2010.\n\n[10] M. Wang and A. Tang. Conditions for a Unique Non-negative Solution to an Underdetermined System.\n\nIn Proceedings of Allerton Conference on Communication, Control, and Computing, 2009.\n\n[11] M. Wang, W. Xu, and A. Tang. A unique nonnegative solution to an undetermined system: from vectors\n\nto matrices. IEEE Transactions on Signal Processing, 59:1007\u20131016, 2011.\n\n[12] C. Liew. Inequality Constrained Least-Squares Estimation. Journal of the American Statistical Associa-\n\ntion, 71:746\u2013751, 1976.\n\n[13] R. Tibshirani. Regression shrinkage and variable selection via the lasso. Journal of the Royal Statistical\n\nSociety Series B, 58:671\u2013686, 1996.\n\n[14] S. van de Geer and P. B\u00a8uhlmann. On the conditions used to prove oracle results for the Lasso. The\n\nElectronic Journal of Statistics, 3:1360\u20131392, 2009.\n\n[15] P. Zhao and B. Yu. On model selection consistency of the lasso. Journal of Machine Learning Research,\n\n7:2541\u20132567, 2006.\n\n[16] M. Wainwright.\n\nSharp thresholds for noisy and high-dimensional recovery of sparsity using (cid:96)1-\nconstrained quadratic programming (Lasso). IEEE Transactions on Information Theory, 55:2183\u20132202,\n2009.\n\n[17] N. Meinshausen and B. Yu. Lasso-type recovery of sparse representations for high-dimensional data. The\n\nAnnals of Statistics, 37:246\u2013270, 2009.\n\n[18] T. Zhang. Some Sharp Performance Bounds for Least Squares Regression with L1 Regularization. The\n\nAnnals of Statistics, 37:2109\u20132144, 2009.\n\n[19] S. Zhou. Thresholding procedures for high dimensional variable selection and statistical estimation. In\n\nNIPS, 2009.\n\n[20] E. Greenshtein and Y. Ritov. Persistence in high-dimensional linear predictor selection and the virtue of\n\noverparametrization. Bernoulli, 6:971\u2013988, 2004.\n\n[21] D. Donoho and J. Tanner. Sparse nonnegative solution of underdetermined linear equations by linear\n\nprogramming. Proceedings of the National Academy of Science, 102:9446\u20139451, 2005.\n\n[22] J. Tropp. Greed is good: Algorithmic results for sparse approximation. IEEE Transactions on Information\n\nTheory, 50:2231\u20132242, 2004.\n\n[23] T. Zhang. On the Consistency of Feature Selection using Greedy Least Squares Regression. Journal of\n\nMachine Learning Research, 10:555\u2013568, 2009.\n\n[24] H. Rue and L. Held. Gaussian Markov Random Fields. Chapman and Hall/CRC, Boca Raton, 2001.\n[25] C. Genovese, J. Jin, and L. Wasserman. Revisiting Marginal Regression. Technical report, Carnegie\n\nMellon University, 2009. http://arxiv.org/abs/0911.4080.\n\n[26] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least Angle Regression. The Annals of Statistics,\n\n32:407\u2013499, 2004.\n\n9\n\n\f", "award": [], "sourceid": 1095, "authors": [{"given_name": "Martin", "family_name": "Slawski", "institution": null}, {"given_name": "Matthias", "family_name": "Hein", "institution": null}]}