{"title": "The LASSO risk: asymptotic results and real world examples", "book": "Advances in Neural Information Processing Systems", "page_first": 145, "page_last": 153, "abstract": "We consider the problem of learning a coefficient vector x0 from noisy linear observation y=Ax0+w. In many contexts (ranging from model selection to image processing) it is desirable to construct a sparse estimator. In this case, a popular approach consists in solving an l1-penalized least squares problem known as the LASSO or BPDN.  For sequences of matrices A of increasing dimensions, with iid gaussian entries, we prove that the normalized risk of the LASSO converges to a limit, and we obtain an explicit expression for this limit. Our result is the first rigorous derivation of an explicit formula for the asymptotic risk of the LASSO for random instances. The proof technique is based on the analysis of AMP, a recently developed efficient algorithm, that is inspired from graphical models ideas.   Through simulations on real data matrices (gene expression data and hospital medical records) we observe that these results can be relevant in a broad array of practical applications.", "full_text": "The LASSO risk: asymptotic results and real world\n\nexamples\n\nMohsen Bayati\n\nStanford University\n\nJos\u00b4e Bento\n\nStanford University\n\nAndrea Montanari\nStanford University\n\nbayati@stanford.edu\n\njbento@stanford.edu\n\nmontanar@stanford.edu\n\nAbstract\n\nselection to image processing) it is desirable to construct a sparse estimator bx.\n\nWe consider the problem of learning a coef\ufb01cient vector x0 \u2208 RN from noisy\nlinear observation y = Ax0 + w \u2208 Rn. In many contexts (ranging from model\nIn this case, a popular approach consists in solving an \u21131-penalized least squares\nproblem known as the LASSO or Basis Pursuit DeNoising (BPDN).\nFor sequences of matrices A of increasing dimensions, with independent gaus-\nsian entries, we prove that the normalized risk of the LASSO converges to a limit,\nand we obtain an explicit expression for this limit. Our result is the \ufb01rst rigor-\nous derivation of an explicit formula for the asymptotic mean square error of the\nLASSO for random instances. The proof technique is based on the analysis of\nAMP, a recently developed ef\ufb01cient algorithm, that is inspired from graphical\nmodels ideas.\nThrough simulations on real data matrices (gene expression data and hospital med-\nical records) we observe that these results can be relevant in a broad array of prac-\ntical applications.\n\n1 Introduction\n\nLet x0 \u2208 RN be an unknown vector, and assume that a vector y \u2208 Rn of noisy linear measure-\nments of x0 is available. The problem of reconstructing x0 from such measurements arises in a\nnumber of disciplines, ranging from statistical learning to signal processing. In many contexts the\nmeasurements are modeled by\n\ny = Ax0 + w ,\n\n(1.1)\n\nwhere A \u2208 Rn\u00d7N is a known measurement matrix, and w is a noise vector.\nThe LASSO or Basis Pursuit Denoising (BPDN) is a method for reconstructing the unknown vector\nx0 given y, A, and is particularly useful when one seeks sparse solutions. For given A, y, one\nconsiders the cost functions CA,y : RN \u2192 R de\ufb01ned by\n\n(1.2)\n\nCA,y(x) =\n\n1\n2 ky \u2212 Axk2 + \u03bbkxk1 ,\nbx(\u03bb; A, y) = argminx CA,y(x) .\n\nwith \u03bb > 0. The original signal is estimated by\n\n(1.3)\nIn what follows we shall often omit the arguments A, y (and occasionally \u03bb) from the above nota-\ni )1/p\n\ni=1 vp\n\ndenotes the \u2113p-norm of a vector v \u2208 Rp (the subscript p will often be omitted if p = 2).\nA large and rapidly growing literature is devoted to (i) Developing fast algorithms for solving the\n\ntions. We will also usebx(\u03bb; N ) to emphasize the N -dependence. Further kvkp \u2261 (Pm\noptimization problem (1.3); (ii) Characterizing the performances and optimality of the estimatorbx.\n\nWe refer to Section 1.3 for an unavoidably incomplete overview.\n\n1\n\n\fDespite such substantial effort, and many remarkable achievements, our understanding of (1.3) is\nnot even comparable to the one we have of more classical topics in statistics and estimation theory.\nFor instance, the best bound on the mean square error (MSE) of the estimator (1.3), i.e. on the\n\nquantity N \u22121kbx \u2212 x0k2, was proved by Candes, Romberg and Tao [CRT06] (who in fact did not\n\nconsider the LASSO but a related optimization problem). Their result estimates the mean square\nerror only up to an unknown numerical multiplicative factor. Work by Candes and Tao [CT07] on\nthe analogous Dantzig selector, upper bounds the mean square error up to a factor C log N , under\nsomewhat different assumptions.\n\nThe objective of this paper is to complement this type of \u2018rough but robust\u2019 bounds by proving\nasymptotically exact expressions for the mean square error. Our asymptotic result holds almost\nsurely for sequences of random matrices A with \ufb01xed aspect ratio and independent gaussian entries.\nWhile this setting is admittedly speci\ufb01c, the careful study of such matrix ensembles has a long\ntradition both in statistics and communications theory and has spurred many insights [Joh06, Tel99].\nFurther, our main result provides asymptotically exact expressions for other operating characteristics\nof the LASSO as well (e.g., False Positive Rate and True positive Rate). We carried out simulations\non real data matrices with continuous entries (gene expression data) and binary feature matrices\n(hospital medical records). The results appear to be quite encouraging.\n\nAlthough our rigorous results are asymptotic in the problem dimensions, numerical simulations have\nshown that they are accurate already on problems with a few hundreds of variables. Further, they\nseem to enjoy a remarkable universality property and to hold for a fairly broad family of matrices\n[DMM10]. Both these phenomena are analogous to ones in random matrix theory, where delicate\nasymptotic properties of gaussian ensembles were subsequently proved to hold for much broader\nclasses of random matrices. Also, asymptotic statements in random matrix theory have been re-\nplaced over time by concrete probability bounds in \ufb01nite dimensions. Of course the optimization\nproblem (1.2) is not immediately related to spectral properties of the random matrix A. As a conse-\nquence, universality and non-asymptotic results in random matrix theory cannot be directly exported\nto the present problem. Nevertheless, we expect such developments to be foreseeable.\n\nOur proof is based on the analysis of an ef\ufb01cient iterative algorithm \ufb01rst proposed by [DMM09],\nand called AMP, for approximate message passing. The algorithm is inspired by belief-propagation\non graphical models, although the resulting iteration is signi\ufb01cantly simpler (and scales linearly\nin the number of nodes). Extensive simulations [DMM10] showed that, in a number of settings,\nAMP performances are statistically indistinguishable to the ones of LASSO, while its complexity is\nessentially as low as the one of the simplest greedy algorithms.\n\nThe proof technique just described is new. Earlier literature analyzes the convex optimization prob-\nlem (1.3) \u2013or similar problems\u2013 by a clever construction of an approximate optimum, or of a dual\nwitness. Such constructions are largely explicit. Here instead we prove an asymptotically exact\ncharacterization of a rather non-trivial iterative algorithm. The algorithm is then proved to converge\nto the exact optimum. Due to limited space in this paper we only state the main steps of the proof.\nMore details are available in [BM10b]\n\n1.1 De\ufb01nitions\n\nIn order to de\ufb01ne the AMP algorithm, we denote by \u03b7 : R\u00d7 R+ \u2192 R the soft thresholding function\n(1.4)\n\nif x > \u03b8,\nif \u2212\u03b8 \u2264 x \u2264 \u03b8,\notherwise.\n\n\u03b7(x; \u03b8) =( x \u2212 \u03b8\n\n0\nx + \u03b8\n\nThe algorithm constructs a sequence of estimates xt \u2208 RN , and residuals zt \u2208 Rn, according to the\niteration\n(1.5)\n\nxt+1 = \u03b7(A\u2217zt + xt; \u03b8t),\n\nzt = y \u2212 Axt + kxtk0\n\nn\n\nzt\u22121 ,\n\ninitialized with x0 = 0. Here A\u2217 denotes the transpose of matrix A, and kxtk0 is number of non-\nzero entries of xt. Given a scalar function f and a vector u \u2208 Rm, we let f (u) denote the vector\n(f (u1), . . . , f (um)) \u2208 Rm obtained by applying f componentwise. Finally hui \u2261 m\u22121Pm\ni=1 ui is\nthe average of the vector u \u2208 Rm.\n\n2\n\n\f(b) The empirical distribution of the entries of w(N ) converges weakly to a probability measure pW\n\nAs already mentioned, we will consider sequences of instances of increasing sizes, along which the\nLASSO behavior has a non-trivial limit.\nDe\ufb01nition 1. The sequence of instances {x0(N ), w(N ), A(N )}N \u2208N indexed by N is said to be a\nconverging sequence if x0(N ) \u2208 RN , w(N ) \u2208 Rn, A(N ) \u2208 Rn\u00d7N with n = n(N ) such that\nn/N \u2192 \u03b4 \u2208 (0,\u221e), and in addition the following conditions hold:\n(a) The empirical1 distribution of the entries of x0(N ) converges weakly to a probability measure\npX0 on R with bounded second moment. Further N \u22121PN\non R with bounded second moment. Further n\u22121Pn\n\n(c) If {ei}1\u2264i\u2264N , ei \u2208 RN denotes the standard basis,\nmini\u2208[N ] kA(N )eik2 \u2192 1, as N \u2192 \u221e where [N ] \u2261 {1, 2, . . . , N}.\nFor a converging sequence of instances, and an arbitrary sequence of thresholds {\u03b8t}t\u22650 (indepen-\ndent of N ), the asymptotic behavior of the recursion (1.5) can be characterized as follows.\nDe\ufb01ne the sequence {\u03c4 2\nW \u223c pW ) and letting, for all t \u2265 0: \u03c4 2\n\n0}/\u03b4 (for X0 \u223c pX0 and \u03c32 \u2261 E{W 2},\n\ni=1 x0,i(N )2 \u2192 EpX0{X 2\n0}.\n\nthen maxi\u2208[N ] kA(N )eik2,\n\ni=1 wi(N )2 \u2192 EpW {W 2}.\n\nt }t\u22650 by setting \u03c4 2\n\n0 = \u03c32 + E{X 2\nt , \u03b8t) with\n\nt+1 = F(\u03c4 2\n\nF(\u03c4 2, \u03b8) \u2261 \u03c32 +\n\nE{ [\u03b7(X0 + \u03c4 Z; \u03b8) \u2212 X0]2} ,\n\n1\n\u03b4\n\nwhere Z \u223c N(0, 1) is independent of X0. Notice that the function F depends on the law pX0.\nWe say a function \u03c8 : R2 \u2192 R is pseudo-Lipschitz if there exist a constant L > 0 such that for all\nx, y \u2208 R2: |\u03c8(x) \u2212 \u03c8(y)| \u2264 L(1 + kxk2 + kyk2)kx \u2212 yk2. (This is a special case of the de\ufb01nition\nused in [BM10a] where such a function is called pseudo-Lipschitz of order 2.)\n\nOur next proposition that was conjectured in [DMM09] and proved in [BM10a]. It shows that the\nbehavior of AMP can be tracked by the above one dimensional recursion. We often refer to this\nprediction by state evolution.\nTheorem 1 ([BM10a]). Let {x0(N ), w(N ), A(N )}N \u2208N be a converging sequence of instances with\nthe entries of A(N ) iid normal with mean 0 and variance 1/n and let \u03c8 : R \u00d7 R \u2192 R be a pseudo-\nLipschitz function. Then, almost surely\n\nlim\nN\u2192\u221e\n\n1\nN\n\nNXi=1\n\n\u03c8(cid:0)xt+1\n\ni\n\n, x0,i(cid:1) = En\u03c8(cid:0)\u03b7(X0 + \u03c4tZ; \u03b8t), X0(cid:1)o ,\n\n(1.6)\n\nwhere Z \u223c N(0, 1) is independent of X0 \u223c pX0.\nIn order to establish the connection with the LASSO, a speci\ufb01c policy has to be chosen for the\nthresholds {\u03b8t}t\u22650. Throughout this paper we will take \u03b8t = \u03b1\u03c4t with \u03b1 is \ufb01xed. In other words,\nt , \u03b1\u03c4t). This choice enjoys several\nthe sequence {\u03c4t}t\u22650 is given by the recursion \u03c4 2\nconvenient properties [DMM09].\n\nt+1 = F(\u03c4 2\n\n1.2 Main result\n\nBefore stating our results, we have to describe a calibration mapping between \u03b1 and \u03bb that was\nintroduced in [DMM10] (Propositions 2, 3 and Corollary 4). Their proofs are presented in [BM10b].\n\n\u2212\u221e \u03c6(x) dx.\n\n2 , with \u03c6(z) \u2261 e\u2212z2/2/\u221a2\u03c0 the standard gaussian density and\n\nLet us start by stating some convenient properties of the state evolution recursion.\nProposition 2 ([DMM09]). Let \u03b1min = \u03b1min(\u03b4) be the unique non-negative solution of the equation\n(1 + \u03b12)\u03a6(\u2212\u03b1) \u2212 \u03b1\u03c6(\u03b1) = \u03b4\n\u03a6(z) \u2261R z\nplace for any initial condition and is monotone. Finally(cid:12)(cid:12) dF\n\nFor any \u03c32 > 0, \u03b1 > \u03b1min(\u03b4), the \ufb01xed point equation \u03c4 2 = F(\u03c4 2, \u03b1\u03c4 ) admits a unique solution.\nDenoting by \u03c4\u2217 = \u03c4\u2217(\u03b1) this solution, we have limt\u2192\u221e \u03c4t = \u03c4\u2217(\u03b1). Further the convergence takes\n\nd\u03c4 2 (\u03c4 2, \u03b1\u03c4 )(cid:12)(cid:12) < 1 at \u03c4 = \u03c4\u2217.\n\n1The probability distribution that puts a point mass 1/N at each of the N entries of the vector.\n\n3\n\n\f\u03b4\n\nP{|X0 + \u03c4\u2217Z| \u2265\n\nWe then de\ufb01ne the function \u03b1 7\u2192 \u03bb(\u03b1) on (\u03b1min(\u03b4),\u221e), by \u03bb(\u03b1) \u2261 \u03b1\u03c4\u2217[1 \u2212 1\n\u03b1\u03c4\u2217}].\nThis function de\ufb01nes a correspondence (calibration) between the sequence of thresholds {\u03b8t}t\u22650\nand the regularization parameter \u03bb. It should be intuitively clear that larger \u03bb corresponds to larger\nthresholds and hence larger \u03b1 since both cases yield smaller estimates of x0.\nIn the following we will need to invert this function. We thus de\ufb01ne \u03b1 : (0,\u221e) \u2192 (\u03b1min,\u221e) in\nsuch a way that \u03b1(\u03bb) \u2208(cid:8) a \u2208 (\u03b1min,\u221e) : \u03bb(a) = \u03bb(cid:9).\nThe next result implies that the set on the right-hand side is non-empty and therefore the function\n\u03bb 7\u2192 \u03b1(\u03bb) is well de\ufb01ned.\nProposition 3 ([DMM10]). The function \u03b1 7\u2192 \u03bb(\u03b1) is continuous on the interval (\u03b1min,\u221e) with\n\u03bb(\u03b1min+) = \u2212\u221e and lim\u03b1\u2192\u221e \u03bb(\u03b1) = \u221e.\nTherefore the function \u03bb 7\u2192 \u03b1(\u03bb) satisfying \u03b1(\u03bb) \u2208(cid:8) a \u2208 (\u03b1min,\u221e) : \u03bb(a) = \u03bb(cid:9) exists.\nWe will denote by A = \u03b1((0,\u221e)) the image of the function \u03b1. Notice that the de\ufb01nition of \u03b1 is a\npriori not unique. We will see that uniqueness follows from our main theorem.\nExamples of the mappings \u03c4 2 7\u2192 F(\u03c4 2, \u03b1\u03c4 ), \u03b1 7\u2192 \u03c4\u2217(\u03b1) and \u03b1 7\u2192 \u03bb(\u03b1) are presented in [BM10b].\nWe can now state our main result.\nTheorem 2. Let {x0(N ), w(N ), A(N )}N \u2208N be a converging sequence of instances with the entries\nfor instance (x0(N ), w(N ), A(N )), with \u03c32, \u03bb > 0, P{X0 6= 0} and let \u03c8 : R \u00d7 R \u2192 R be a\npseudo-Lipschitz function. Then, almost surely\n\nof A(N ) iid normal with mean 0 and variance 1/n. Denote by bx(\u03bb; N ) the LASSO estimator\n\nlim\nN\u2192\u221e\n\n1\nN\n\nNXi=1\n\n\u03c8(cid:0)bxi, x0,i(cid:1) = En\u03c8(cid:0)\u03b7(X0 + \u03c4\u2217Z; \u03b8\u2217), X0(cid:1)o ,\n\n(1.7)\n\nwhere Z \u223c N(0, 1) is independent of X0 \u223c pX0, \u03c4\u2217 = \u03c4\u2217(\u03b1(\u03bb)) and \u03b8\u2217 = \u03b1(\u03bb)\u03c4\u2217(\u03b1(\u03bb)).\nAs a corollary, the function \u03bb 7\u2192 \u03b1(\u03bb) is indeed uniquely de\ufb01ned.\nCorollary 4. For any \u03bb, \u03c32 > 0 there exists a unique \u03b1 > \u03b1min such that \u03bb(\u03b1) = \u03bb (with the\nfunction \u03b1 \u2192 \u03bb(\u03b1) de\ufb01ned by \u03bb(\u03b1) = \u03b1\u03c4\u2217[1 \u2212 1\nHence the function \u03bb 7\u2192 \u03b1(\u03bb) is continuous non-decreasing with \u03b1((0,\u221e)) \u2261 A = (\u03b10,\u221e).\nThe assumption of a converging problem-sequence is important for the result to hold, while the\nhypothesis of gaussian measurement matrices A(N ) is necessary for the proof technique to be cor-\nrect. On the other hand, the restrictions \u03bb, \u03c32 > 0, and P{X0 6= 0} > 0 (whence \u03c4\u2217 6= 0 using\n\u03bb(\u03b1) = \u03b1\u03c4\u2217[1 \u2212 1\nP{|X0 + \u03c4\u2217Z| \u2265 \u03b1\u03c4\u2217}]) are made in order to avoid technical complications due\nto degenerate cases. Such cases can be resolved by continuity arguments.\n\nP{|X0 + \u03c4\u2217Z| \u2265 \u03b1\u03c4\u2217}].\n\n\u03b4\n\n\u03b4\n\n1.3 Related work\n\nThe LASSO was introduced in [Tib96, CD95]. Several papers provide performance guarantees for\nthe LASSO or similar convex optimization methods [CRT06, CT07], by proving upper bounds on\nthe resulting mean square error. These works assume an appropriate \u2018isometry\u2019 condition to hold for\nA. While such condition hold with high probability for some random matrices, it is often dif\ufb01cult to\nverify them explicitly. Further, it is only applicable to very sparse vectors x0. These restrictions are\nintrinsic to the worst-case point of view developed in [CRT06, CT07].\n\nGuarantees have been proved for correct support recovery in [ZY06], under an appropriate \u2018irrepre-\nsentibility\u2019 assumption on A. While support recovery is an interesting conceptualization for some\napplications (e.g. model selection), the metric considered in the present paper (mean square error)\nprovides complementary information and is quite standard in many different \ufb01elds.\n\nCloser to the spirit of this paper [RFG09] derived expressions for the mean square error under\nthe same model considered here. Similar results were presented recently in [KWT09, GBS09].\nThese papers argue that a sharp asymptotic characterization of the LASSO risk can provide valuable\n\n4\n\n\fguidance in practical applications. For instance, it can be used to evaluate competing optimization\nmethods on large scale applications, or to tune the regularization parameter \u03bb.\nUnfortunately, these results were non-rigorous and were obtained through the famously powerful\n\u2018replica method\u2019 from statistical physics [MM09].\n\nLet us emphasize that the present paper offers two advantages over these recent developments: (i)\nIt is completely rigorous, thus putting on a \ufb01rmer basis this line of research; (ii) It is algorithmic in\nthat the LASSO mean square error is shown to be equivalent to the one achieved by a low-complexity\nmessage passing algorithm.\n\n2 Numerical illustrations\n\nTheorem 2 assumes that the entries of matrix A are iid gaussians. We expect however that our\npredictions to be robust and hold for much larger family of matrices. Rigorous evidence in this\n\ndirection is presented in [KM10] where the normalized cost C(bx)/N is shown to have a limit as\n\nN \u2192 \u221e which is universal with respect to random matrices A with iid entries. (More precisely, it\nis universal if E{Aij} = 0, E{A2\nFurther, our result is asymptotic, while and one might wonder how accurate it is for instances of\nmoderate dimensions.\n\nij} \u2264 C/n3 for a uniform constant C.)\n\nij} = 1/n and E{A6\n\nNumerical simulations were carried out in [DMM10] and suggest that the result is robust and rel-\nevant already for N of the order of a few hundreds. As an illustration, we present in Figures 1-3\nthe outcome of such simulations for four types of real data and random matrices. We generated the\nsignal vector randomly with entries in {+1, 0,\u22121} and P(x0,i = +1) = P(x0,i = \u22121) = 0.05. The\nnoise vector w was generated by using i.i.d. N(0, 0.2) entries.\n\nl1-regularized regressions [KKL+07], [AJ07]. We used 40 values of \u03bb between .05 and 2 and N\nequal to 500, 1000, and 2000. For each case, the point (\u03bb, MSE) was plotted and the results are\nshown in the \ufb01gures. Continuous lines corresponds to the asymptotic prediction by Theorem 2 for\n\nWe obtained the optimum estimatorbx using OWLQN and l1 ls, packages for solving large-scale\n\u03c8(a, b) = (a \u2212 b)2, namely MSE = limN\u2192\u221e N \u22121kbx \u2212 x0k2 = E(cid:8)(cid:2)\u03b7(X0 + \u03c4\u2217Z; \u03b8\u2217) \u2212 X0(cid:3)2(cid:9) =\n\n\u03b4(\u03c4 2\nThe agreement is remarkably good already for N, n of the order of a few hundreds, and deviations\nare consistent with statistical \ufb02uctuations.\n\n\u2217 \u2212 \u03c32).\n\nThe four \ufb01gures correspond to measurement matrices A:\nFigure 1(a): Data consist of 2253 measurements of expression level of 7077 genes (this data is\nprovided to us by Broad Institute). From this matrix we took sub-matrices A of aspect ratio \u03b4 for\neach N . The entries were continuous variables. We standardized all columns of A to have mean 0\nand variance 1.\n\nFigure 1(b): From a data set of 1932 patient records we extracted 4833 binary features describing\ndemographic information, medical history, lab results, medications etc. The 0-1 matrix was sparse\n(with only 3.1% non-zero entries). Similar to genes data, for each N , the sub-matrices A with aspect\nratio \u03b4 were selected and standardized.\n\nFigure 2(a): Random \u00b11 matrices with aspect ratio \u03b4. Each entry is independently equal to +1/\u221an\nor \u22121/\u221an with equal probability.\n\nFigure 2(b): Random gaussian matrices with aspect ratio \u03b4 and iid N(0, 1/n) entries (as in Theorem\n2).\n\nNotice the behavior appears to be essentially indistinguishable. Also the asymptotic prediction has\na minimum as a function of \u03bb. The location of this minimum can be used to select the regularization\nparameter. Further empirical analysis is presented in [BBM10].\n\nFor the second data set \u2013patient records\u2013 we repeated the simulation 20 times (each time with fresh\nx0 and w) and obtained the average and standard error for MSE, False Positive Rate (FPR) and True\nPositive Rate (TPR). The results with error bars are shown in Figure 3. The length of each error bar\n\n5\n\n\f(a) Gene expression data\n\n(b) Hospital records    \n\n0.4\n\n0.35\n\n0.3\n\n0.25\n\n0.2\n\n0.15\n\n0.1\n\nE\nS\nM\n\n \n\nN=500\nN=1000\nN=2000\nPrediction\n\n0.4\n\n0.35\n\n0.3\n\n0.25\n\n0.2\n\n0.15\n\n0.1\n\nE\nS\nM\n\n \n\nN=500\nN=1000\nN=2000\nPrediction\n\n0.05\n \n0\n\n0.5\n\n1\n\n1.5\n\n\u03bb\n\n2\n\n2.5\n\n0.05\n \n0\n\n0.5\n\n1\n\n1.5\n\n\u03bb\n\n2\n\n2.5\n\nFigure 1: Mean square error (MSE) as a function of the regularization parameter \u03bb compared to the\nasymptotic prediction for \u03b4 = .5 and \u03c32 = .2. In plot (a) the measurement matrix A is a real valued\n(standardized) matrix of gene expression data and in plot (b) A is a (standardized) 0-1 feature matrix\n\nof hospital records. Each point in these plots is generated by \ufb01nding the LASSO predictorbx using\n\na measurement vector y = Ax0 + w for an independent signal vector x0 and an independent noise\nvector w.\n\n(a) \u00b11 matrices     \n\n(b) Gaussian matrices   \n\n0.4\n\n0.35\n\n0.3\n\n0.25\n\n0.2\n\n0.15\n\n0.1\n\nE\nS\nM\n\n \n\nN=500\nN=1000\nN=2000\nPrediction\n\n0.4\n\n0.35\n\n0.3\n\n0.25\n\n0.2\n\n0.15\n\n0.1\n\nE\nS\nM\n\n \n\nN=500\nN=1000\nN=2000\nPrediction\n\n0.05\n \n0\n\n0.5\n\n1\n\n1.5\n\n\u03bb\n\n2\n\n2.5\n\n0.05\n \n0\n\n0.5\n\n1\n\n1.5\n\n\u03bb\n\n2\n\n2.5\n\nFigure 2: As in Figure 1, but the measurement matrix A has iid entries that are equal to \u00b11/\u221an\n\nwith equal probabilities in plot (a), and has iid N(0, 1/n) entries in plot (b). Additionally, each point\nin these plots uses an independent matrix A.\n\nis equal to twice the standard error (in each direction). FPR and TPR are calculated using\n\ni=1\n\nFPR \u2261 PN\nPN\n\nI{\u02c6xi6=0}I{xi,0=0}\n\n,\n\nI{xi,0=0}\n\ni=1\n\ni=1\n\nTPR \u2261 PN\nPN\n\nI{\u02c6xi6=0}I{xi,06=0}\n\nI{xi,06=0}\n\ni=1\n\n,\n\n(2.1)\n\nwhere I{S} = 1 if statement S holds and I{S} = 0 otherwise. The predictions for FPR and TPR\nare obtained by applying Theorem 2 to \u03c8f pr(a, b) \u2261 I{a6=0}I{b=0} and \u03c8tpr(a, b) = I{a6=0}I{b6=0},\nwhich yields\n\nlim\nN\u2192\u221e\n\nFPR = 2\u03a6(\u2212\u03b1),\n\nlim\nN\u2192\u221e\n\nTPR = \u03a6(\u2212\u03b1 +\n\n1\n\u03c4\u2217\n\n) + \u03a6(\u2212\u03b1 \u2212\n\n1\n\u03c4\u2217\n\n)\n\n(2.2)\n\nwhere \u03a6 is de\ufb01ned in Proposition 2. Note that functions \u03c8f pr(a, b) and \u03c8tpr(a, b) are not pseudo-\nLipschitz but the limits (2.2) follow from Theorem 2 via standard weak-convergence arguments.\n\n3 A structural property and proof of the main theorem\n\nWe will prove the following theorem which implies our main result, Theorem 2.\nTheorem 3. Assume the hypotheses of Theorem 2. Denote by {xt(N )}t\u22650 the sequence of estimates\n\n2 = 0, almost surely.\n\nproduced by AMP. Then limt\u2192\u221e limN\u2192\u221e N \u22121kxt(N ) \u2212bx(\u03bb; N )k2\n\n6\n\n\f0.4\n\n0.35\n\n0.3\n\n0.25\n\n0.2\n\n0.15\n\n0.1\n\nE\nS\nM\n\n \n\nN500\nN1000\nN2000\nPrediction\n\n0.5\n\n0.45\n\n0.4\n\n0.35\n\n0.3\n\n0.25\n\n0.2\n\n0.15\n\n0.1\n\n0.05\n\ne\nt\na\nR\n \ne\nv\ni\nt\ni\ns\no\nP\n \ne\ns\nl\na\nF\n\n \n\nN=500\nN=1000\nN=2000\nPrediction\n\n0.05\n \n0\n\n0.5\n\n1\n\n1.5\n\n\u03bb\n\n2\n\n2.5\n\n0\n \n0\n\n0.5\n\n1\n\n1.5\n\n\u03bb\n\n2\n\n2.5\n\n \n\nN=500\nN=1000\nN=2000\nPrediction\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\ne\nt\na\nR\n \ne\nv\ni\nt\ni\ns\no\nP\n \ne\nu\nr\nT\n\n0\n \n0\n\n0.5\n\n1\n\n1.5\n\n\u03bb\n\n2\n\n2.5\n\nFigure 3: Average of MSE, FPR and TPR versus \u03bb for medical data, using 20 samples per \u03bb and N .\nAll parameters are similar to Figure 1(b). Error bars are twice the standard errors (in each direction).\n\nThe rest of the paper is devoted to the proof of this theorem. Section 3.1 proves a structural property\nthat is the key tool in this proof. Section 3.2 uses this property together with a few lemmas to prove\nTheorem 3. Proofs of lemmas and more details can be found in [BM10b].\n\ni=1 \u03c8(xt+1\n\nThe proof of Theorem 2 follows immediately. Since when \u03c8 is Lipschitz there is a constant B where\nNPN\n| 1\n\ni=1 \u03c8(bxi, x0,i)| \u2264 Bkxt+1 \u2212bxk2. We then obtain\n\ni\n\nNXi=1\n\n\u03c8(cid:0)xt+1\n\ni\n\n, x0,i(cid:1) = E{\u03c8(\u03b7(X0 + \u03c4\u2217Z; \u03b8\u2217), X0)} ,\n\nNPN\n, x0,i) \u2212 1\n\u03c8(bxi, x0,i) = lim\n\nt\u2192\u221e\n\nwhere we used Theorem 1 and Proposition 2. The case of pseudo-Lipschitz \u03c8 is a straightforward\ngeneralization.\n\nlim\nN\u2192\u221e\n\n1\nN\n\nNXi=1\n\nlim\nN\u2192\u221e\n\n1\nN\n\nSome notations. For any non-empty subset S of [m] and any k \u00d7 m matrix M we refer by MS to\nthe k by |S| sub-matrix of M that contains only the columns of M corresponding to S. Also de\ufb01ne\nthe scalar product hu, vi \u2261 1\ni=1 ui vi for u, v \u2208 Rm. Finally, the subgradient of a convex\nfunction f : Rm \u2192 R at point x \u2208 Rm is denoted by \u2202f (x). In particular, remember that the\nsubgradient of the \u21131 norm, x 7\u2192 kxk1 is given by\n\nm Pm\n\n\u2202kxk1 =(cid:8)v \u2208 Rm such that |vi| \u2264 1 \u2200i and xi 6= 0 \u21d2 vi = sign(xi)(cid:9) .\n\n3.1 A structural property of the LASSO cost function\n\n(3.1)\n\nOne main challenge in the proof of Theorem 2 lies in the fact that the function x 7\u2192 CA,y(x) is not\n\u2013in general\u2013 strictly convex. Hence there can be, in principle, vectors x of cost very close to the\noptimum and nevertheless far from the optimum. The following Lemma provides conditions under\nwhich this does not happen.\nLemma 1. There exists a function \u03be(\u03b5, c1, . . . , c5) such that the following happens. If x, r \u2208 RN\nsatisfy the following conditions:\n\n7\n\n\f(2) C(x + r) \u2264 C(x);\n\n(1) krk2 \u2264 c1\u221aN ;\n(3) There exists a subgradient sg(C, x) \u2208 \u2202C(x) with ksg(C, x)k2 \u2264 \u221aN \u03b5;\n(4) Let v \u2261 (1/\u03bb)[A\u2217(y \u2212 Ax) + sg(C, x)] \u2208 \u2202kxk1, and S(c2) \u2261 {i \u2208 [N ] :\nThen, for any S\u2032 \u2286 [N ], |S\u2032| \u2264 c3N , we have \u03c3min(AS(c2)\u222aS \u2032) \u2265 c4\n5 \u2264 \u03c3min(A)2 \u2264\n(5)\n\u03c3max(A)2 \u2264 c5.\nThen krk2 \u2264 \u221aN \u03be(\u03b5, c1, . . . , c5). Further for any c1, . . . , c5 > 0, \u03be(\u03b5, c1, . . . , c5) \u2192 0 as \u03b5 \u2192 0.\nFurther, if ker(A) = {0}, the same conclusion holds under conditions 1, 2, 3, and 5.\n3.2 Proof of Theorem 3\n\nThe maximum and minimum non-zero singular value of A satisfy c\u22121\n\n|vi| \u2265 1 \u2212 c2}.\n\nThe proof is based on a series of Lemmas that are used to check the assumptions of Lemma 1\n\nThe next lemma implies that submatrices of A constructed using the \ufb01rst t iterations of the AMP\nalgorithm are non-singular (more precisely, have singular values bounded away from 0).\nLemma 2. Let S \u2286 [N ] be measurable on the \u03c3-algebra St generated by {z0, . . . , zt\u22121} and\n{x0 + A\u2217z0, . . . , xt\u22121 + A\u2217zt\u22121} and assume |S| \u2264 N (\u03b4 \u2212 c) for some c > 0. Then there exists\na1 = a1(c) > 0 (independent of t) and a2 = a2(c, t) > 0 (depending on t and c) such that\nminS \u2032{\u03c3min(AS\u222aS \u2032 ) : S\u2032 \u2286 [N ],|S\u2032| \u2264 a1N} \u2265 a2, with probability converging to 1 as N \u2192 \u221e.\nWe will apply this lemma to a speci\ufb01c choice of the set S. Namely, de\ufb01ning\n\nvt \u2261\n\n1\n\n\u03b8t\u22121\n\n(xt\u22121 + A\u2217zt\u22121 \u2212 xt) ,\n\n(3.2)\n\n|vt\n\nour last lemma shows convergence of a particular sequence of sets provided by vt.\n\nLemma 3. Fix \u03b3 \u2208 (0, 1) and let the sequence {St(\u03b3)}t\u22650 be de\ufb01ned by St(\u03b3) \u2261 (cid:8) i \u2208 [N ] :\ni| \u2265 1 \u2212 \u03b3(cid:9). For any \u03be > 0 there exists t\u2217 = t\u2217(\u03be, \u03b3) < \u221e such that, for all t2 \u2265 t1 \u2265 t\u2217:\nlimN\u2192\u221e P(cid:8)|St2(\u03b3) \\ St1(\u03b3)| \u2265 N \u03be(cid:9) = 0.\nThe last two lemmas imply the following.\nProposition 5. There exist constants \u03b31 \u2208 (0, 1), \u03b32, \u03b33 > 0 and tmin < \u221e such that, for any\nt \u2265 tmin, min(cid:8)\u03c3min(ASt(\u03b31)\u222aS \u2032) : S\u2032 \u2286 [N ] , |S\u2032| \u2264 \u03b32N(cid:9) \u2265 \u03b33 with probability converging to\n1 as N \u2192 \u221e.\nProof of Theorem 3. We apply Lemma 1 to x = xt, the AMP estimate and r =bx \u2212 xt the distance\nCondition 1 holds since limN\u2192\u221ehbx,bxi and limN\u2192\u221ehxt, xti for all t are \ufb01nite.\nCondition 2 is immediate since x + r =bx minimizes C(\u00b7 ).\n\nfrom the LASSO optimum. The thesis follows by checking conditions 1\u20135. Namely we need to\nshow that there exists constants c1, . . . , c5 > 0 and, for each \u03b5 > 0 some t = t(\u03b5) such that 1\u20135\nhold with probability going to 1 as N \u2192 \u221e.\n\nConditions 3-4. Take v = vt as de\ufb01ned in Eq. (3.2). Using the de\ufb01nition (1.5), it is easy to check\nthat vt \u2208 \u2202kxk1. Further it can be shown that vt = (1/\u03bb)[A\u2217(y \u2212 Axt) + sg(C, xt)], with sg(C, xt)\na subgradient satisfying limt\u2192\u221e limN\u2192\u221e N \u22121ksg(C, xt)k2 = 0. This proves condition 3 and\ncondition 4 holds by Proposition 5.\n\nCondition 5 follows from standard limit theorems on the singular values of Wishart matrices.\n\nAcknowledgement\n\nThis work was partially supported by a Terman fellowship, the NSF CAREER award CCF-0743978,\nthe NSF grant DMS-0806211 and a Portuguese Doctoral FCT fellowship.\n\n8\n\n\fReferences\n\n[AJ07]\n\nG. Andrew and G. Jianfeng, Scalable training of l1-regularized log-linear models, Pro-\nceedings of the 24th international conference on Machine learning, 2007, pp. 33\u201340.\n\n[BBM10] M. Bayati, J .A. Bento, and A. Montanari, The LASSO risk: asymptotic results and real\n\nworld examples, Long version (in preparation), 2010.\n\n[BM10a] M. Bayati and A. Montanari, The dynamics of message passing on dense graphs, with\napplications to compressed sensing, Proceedings of IEEE International Symposium on\nInform. Theory (ISIT), 2010, Longer version in http://arxiv.org/abs/1001.3448.\n\n[BM10b]\n\n[CD95]\n\n[CRT06]\n\n[CT07]\n\n, The LASSO risk for gaussian matrices, 2010, preprint available in\n\nhttp://arxiv.org/abs/1008.2581.\nS.S. Chen and D.L. Donoho, Examples of basis pursuit, Proceedings of Wavelet Appli-\ncations in Signal and Image Processing III (San Diego, CA), 1995.\nE. Candes, J. K. Romberg, and T. Tao, Stable signal recovery from incomplete and in-\naccurate measurements, Communications on Pure and Applied Mathematics 59 (2006),\n1207\u20131223.\nE. Candes and T. Tao, The Dantzig selector: statistical estimation when p is much larger\nthan n, Annals of Statistics 35 (2007), 2313\u20132351.\n\n[DMM09] D. L. Donoho, A. Maleki, and A. Montanari, Message Passing Algorithms for Com-\npressed Sensing, Proceedings of the National Academy of Sciences 106 (2009), 18914\u2013\n18919.\n\n[DMM10] D.L. Donoho, A. Maleki, and A. Montanari, The Noise Sensitivity Phase Transition in\n\nCompressed Sensing, Preprint, 2010.\n\n[GBS09] D. Guo, D. Baron, and S. Shamai, A single-letter characterization of optimal noisy com-\n\n[Joh06]\n\npressed sensing, 47th Annual Allerton Conference (Monticello, IL), September 2009.\nI. Johnstone, High Dimensional Statistical Inference and Random Matrices, Proc. Inter-\nnational Congress of Mathematicians (Madrid), 2006.\n\n[KKL+07] S. J. Kim, K. Koh, M. Lustig, S. Boyd, and D. Gorinevsky, A method for large-scale\n\u21131-regularized least squares., IEEE Journal on Selected Topics in Signal Processing 1\n(2007), 606\u2013617.\nS. Korada and A. Montanari, Applications of Lindeberg Principle in Communications\nand Statistical Learning, preprint available in http://arxiv.org/abs/1004.0557, 2010.\n\n[KM10]\n\n[KWT09] Y. Kabashima, T. Wadayama, and T. Tanaka, A typical reconstruction limit for com-\n\npressed sensing based on lp-norm minimization, J.Stat. Mech. (2009), L09003.\n\n[MM09] M. M\u00b4ezard and A. Montanari, Information, Physics and Computation, Oxford Univer-\n\n[RFG09]\n\n[Tel99]\n\n[Tib96]\n\n[ZY06]\n\nsity Press, Oxford, 2009.\nS. Rangan, A. K. Fletcher, and V. K. Goyal, Asymptotic analysis of map estimation via\nthe replica method and applications to compressed sensing, PUT NIPS REF, 2009.\nE. Telatar, Capacity of Multi-antenna Gaussian Channels, European Transactions on\nTelecommunications 10 (1999), 585\u2013595.\nR. Tibshirani, Regression shrinkage and selection with the lasso, J. Royal. Statist. Soc\nB 58 (1996), 267\u2013288.\nP. Zhao and B. Yu, On model selection consistency of Lasso, The Journal of Machine\nLearning Research 7 (2006), 2541\u20132563.\n\n9\n\n\f", "award": [], "sourceid": 1182, "authors": [{"given_name": "Mohsen", "family_name": "Bayati", "institution": null}, {"given_name": "Jos\u00e9", "family_name": "Pereira", "institution": null}, {"given_name": "Andrea", "family_name": "Montanari", "institution": null}]}