{"title": "LASSO with Non-linear Measurements is Equivalent to One With Linear Measurements", "book": "Advances in Neural Information Processing Systems", "page_first": 3420, "page_last": 3428, "abstract": "Consider estimating an unknown, but structured (e.g. sparse, low-rank, etc.), signal $x_0\\in R^n$ from a vector $y\\in R^m$ of measurements of the form $y_i=g_i(a_i^Tx_0)$, where the $a_i$'s are the rows of a known measurement matrix $A$, and, $g$ is a (potentially unknown) nonlinear and random link-function. Such measurement functions could arise in applications where the measurement device has nonlinearities and uncertainties. It could also arise by design, e.g., $g_i(x)=sign(x+z_i)$, corresponds to noisy 1-bit quantized measurements. Motivated by the classical work of Brillinger, and more recent work of Plan and Vershynin, we estimate $x_0$ via solving the Generalized-LASSO, i.e., $\\hat x=\\arg\\min_{x}\\|y-Ax_0\\|_2+\\lambda f(x)$ for some regularization parameter $\\lambda >0$ and some (typically non-smooth) convex regularizer $f$ that promotes the structure of $x_0$, e.g. $\\ell_1$-norm, nuclear-norm. While this approach seems to naively ignore the nonlinear function $g$, both Brillinger and Plan and Vershynin have shown that, when the entries of $A$ are iid standard normal, this is a good estimator of $x_0$ up to a constant of proportionality $\\mu$, which only depends on $g$. In this work, we considerably strengthen these results by obtaining explicit expressions for $\\|\\hat x-\\mu x_0\\|_2$, for the regularized Generalized-LASSO, that are asymptotically precise when $m$ and $n$ grow large. A main result is that the estimation performance of the Generalized LASSO with non-linear measurements is asymptotically the same as one whose measurements are linear $y_i=\\mu a_i^Tx_0+\\sigma z_i$, with $\\mu=E[\\gamma g(\\gamma)]$ and $\\sigma^2=E[(g(\\gamma)-\\mu\\gamma)^2]$, and, $\\gamma$ standard normal. The derived expressions on the estimation performance are the first-known precise results in this context. One interesting consequence of our result is that the optimal quantizer of the measurements that minimizes the estimation error of the LASSO is the celebrated Lloyd-Max quantizer.", "full_text": "LASSO with Non-linear Measurements is Equivalent\n\nto One With Linear Measurements\n\nChristos Thrampoulidis,\n\nCaltech\n\nEhsan Abbasi\n\nCaltech\n\nDepartment of Electrical Engineering\n\nDepartment of Electrical Engineering\n\ncthrampo@caltech.edu\n\neabbasi@caltech.edu\n\nBabak Hassibi\n\nDepartment of Electrical Engineering\n\nCaltech\n\nhassibi@caltech.edu \u2217\n\nAbstract\n\nConsider estimating an unknown, but structured (e.g. sparse, low-rank, etc.), sig-\nnal x0 \u2208 Rn from a vector y \u2208 Rm of measurements of the form yi = gi(ai\nT x0),\nwhere the ai\u2019s are the rows of a known measurement matrix A, and, g(\u00b7) is a\n(potentially unknown) nonlinear and random link-function. Such measurement\nfunctions could arise in applications where the measurement device has nonlin-\nearities and uncertainties. It could also arise by design, e.g., gi(x) = sign(x + zi),\ncorresponds to noisy 1-bit quantized measurements. Motivated by the classical\nwork of Brillinger, and more recent work of Plan and Vershynin, we estimate x0\nvia solving the Generalized-LASSO, i.e., \u02c6x := arg minx (cid:107)y \u2212 Ax0(cid:107)2 + \u03bbf (x)\nfor some regularization parameter \u03bb > 0 and some (typically non-smooth) convex\nregularizer f (\u00b7) that promotes the structure of x0, e.g. (cid:96)1-norm, nuclear-norm,\netc. While this approach seems to naively ignore the nonlinear function g(\u00b7), both\nBrillinger (in the non-constrained case) and Plan and Vershynin have shown that,\nwhen the entries of A are iid standard normal, this is a good estimator of x0 up to\na constant of proportionality \u00b5, which only depends on g(\u00b7). In this work, we con-\nsiderably strengthen these results by obtaining explicit expressions for(cid:107)\u02c6x\u2212\u00b5x0(cid:107)2,\nfor the regularized Generalized-LASSO, that are asymptotically precise when m\nand n grow large. A main result is that the estimation performance of the Gener-\nalized LASSO with non-linear measurements is asymptotically the same as one\nT x0 + \u03c3zi, with \u00b5 = E\u03b3g(\u03b3) and\nwhose measurements are linear yi = \u00b5ai\n\u03c32 = E(g(\u03b3) \u2212 \u00b5\u03b3)2, and, \u03b3 standard normal. To the best of our knowledge,\nthe derived expressions on the estimation performance are the \ufb01rst-known precise\nresults in this context. One interesting consequence of our result is that the op-\ntimal quantizer of the measurements that minimizes the estimation error of the\nGeneralized LASSO is the celebrated Lloyd-Max quantizer.\n\nIntroduction\n\n1\nNon-linear Measurements. Consider the problem of estimating an unknown signal vector x0 \u2208 Rn\nfrom a vector y = (y1, y2, . . . , ym)T of m measurements taking the following form:\n\nyi = gi(aT\n\ni x0),\n\n(1)\nHere, each ai represents a (known) measurement vector. The gi\u2019s are independent copies of a\n(generically random) link function g. For instance, gi(x) = x + zi, with say zi being normally\n\u2217This work was supported in part by the National Science Foundation under grants CNS-0932428, CCF-1018927, CCF-1423663 and\nCCF-1409204, by a grant from Qualcomm Inc., by NASA\u2019s Jet Propulsion Laboratory through the President and Directors Fund, by King\nAbdulaziz University, and by King Abdullah University of Science and Technology.\n\ni = 1, 2, . . . , m.\n\n1\n\n\fdistributed, recovers the standard linear regression setup with gaussian noise. In this paper, we are\nparticularly interested in scenarios where g is non-linear. Notable examples include g(x) = sign(x)\n(or gi(x) = sign(x+zi)) and g(x) = (x)+, corresponding to 1-bit quantized (noisy) measurements,\nand, to the censored Tobit model, respectively. Depending on the situation, g might be known or\nunspeci\ufb01ed. In the statistics and econometrics literature, the measurement model in (1) is popular\nunder the name single-index model and several aspects of it have been well-studied, e.g. [4,5,14,15]1.\nStructured Signals.\nIt is typical that the unknown signal x0 obeys some sort of structure. For\ninstance, it might be sparse, i.e. only a few k (cid:28) n, of its entries are non-zero; or, it might be that\nx0 = vec(X0), where X0 \u2208 R\u221a\nn is a matrix of low-rank r (cid:28) n. To exploit this information\nit is typical to associate with the structure of x0 a properly chosen function f : Rn \u2192 R, which we\nrefer to as the regularizer. Of particular interest are convex and non-smooth such regularizers, e.g.\nthe (cid:96)1-norm for sparse signals, the nuclear-norm for low-rank ones, etc. Please refer to [1, 6, 13] for\nfurther discussions.\nAn Algorithm for Linear Measurements: The Generalized LASSO. When the link function\nis linear, i.e. gi(x) = x + zi, perhaps the most popular way of estimating x0 is via solving the\nGeneralized LASSO algorithm:\n\nn\u00d7\u221a\n\n(cid:107)y \u2212 Ax(cid:107)2 + \u03bbf (x).\n\nx\n\n1\n\n\u02c6x := arg min\n\n2(cid:107)y \u2212 Ax(cid:107)2\n\n(2)\nHere, A = [a1, a2, . . . , am]T \u2208 Rm\u00d7n is the known measurement matrix and \u03bb > 0 is a regularizer\nparameter. This is often referred to as the (cid:96)2-LASSO or the square-root-LASSO [3] to distinguish\nfrom the one solving minx\n2 + \u03bbf (x), instead. Our results can be accustomed to this\nlatter version, but for concreteness, we restrict attention to (2) throughout. The acronym LASSO for\n(2) was introduced in [22] for the special case of (cid:96)1-regularization; (2) is a natural generalization\nto other kinds of structures and includes the group-LASSO [25], the fused-LASSO [23] as special\ncases. We often drop the term \u201cGeneralized\u201d and refer to (2) simply as the LASSO.\nOne popular, measure of estimation performance of (2) is the squared-error (cid:107)\u02c6x \u2212 x0(cid:107)2\n2. Recently,\nthere have been signi\ufb01cant advances on establishing tight bounds and even precise characterizations\nof this quantity, in the presence of linear measurements [2, 10, 16, 18, 19, 21]. Such precise results\nhave been core to building a better understanding of the behavior of the LASSO, and, in particular,\non the exact role played by the choice of the regularizer f (in accordance with the structure of x0),\nby the number of measurements m, by the value of \u03bb, etc.. In certain cases, they even provide us\nwith useful insights into practical matters such as the tuning of the regularizer parameter.\nUsing the LASSO for Non-linear Measurements?. The LASSO is by nature tailored to a linear\nmodel for the measurements. Indeed, the \ufb01rst term of the objective function in (2) tries to \ufb01t Ax to\nthe observed vector y presuming that this is of the form yi = aT\ni x0 + noise. Of course, no one stops\ni x0) with g being non-linear2. But, the\nus from continuing to use it even in cases where yi = g(aT\nquestion then becomes: Can there be any guarantees that the solution \u02c6x of the Generalized LASSO\nis still a good estimate of x0?\nThe question just posed was \ufb01rst studied back in the early 80\u2019s by Brillinger [5] who provided an-\nswers in the case of solving (2) without a regularizer term. This, of course, corresponds to standard\nLeast Squares (LS). Interestingly, he showed that when the measurement vectors are Gaussian, then\nthe LS solution is a consistent estimate of x0, up to a constant of proportionality \u00b5, which only\ndepends on the link-function g. The result is sharp, but only under the assumption that the number\nof measurements m grows large, while the signal dimension n stays \ufb01xed, which was the typical\nsetting of interest at the time. In the world of structured signals and high-dimensional measure-\nments, the problem was only very recently revisited by Plan and Vershynin [17]. They consider a\nconstrained version of the Generalized LASSO, in which the regularizer is essentially replaced by a\nconstraint, and derive upper bounds on its performance. The bounds are not tight (they involve ab-\nsolute constants), but they demonstrate some key features: i) the solution to the constrained LASSO\n\u02c6x is a good estimate of x0 up to the same constant of proportionality \u00b5 that appears in Brillinger\u2019s\nresult. ii) Thus, (cid:107)\u02c6x \u2212 \u00b5x0(cid:107)2\n2 is a natural measure of performance. iii) Estimation is possible even\nwith m < n measurements by taking advantage of the structure of x0.\n\n1 The single-index model is a classical topic and can also be regarded as a special case of what is known\nas suf\ufb01cient dimension reduction problem. There is extensive literature on both subjects; unavoidably, we only\nrefer to the directly relevant works here.\n\n2Note that the Generalized LASSO in (2) does not assume knowledge of g. All that is assumed is the\n\navailability of the measurements yi. Thus, the link-function might as well be unknown or unspeci\ufb01ed.\n\n2\n\n\fFigure 1: Squared error of the (cid:96)1-regularized LASSO with non-linear measurements ((cid:3)) and with correspond-\ning linear ones ((cid:63)) as a function of the regularizer parameter \u03bb; both compared to the asymptotic prediction.\nHere, gi(x) = sign(x + 0.3zi) with zi \u223c N (0, 1). The unknown signal x0 is of dimension n = 768 and has\n(cid:100)0.15n(cid:101) non-zero entries (see Sec. 2.2.2 for details). The different curves correspond to (cid:100)0.75n(cid:101) and (cid:100)1.2n(cid:101)\nnumber of measurements, respectively. Simulation points are averages over 20 problem realizations.\n\n1.1 Summary of Contributions\nInspired by the work of Plan and Vershynin [17], and, motivated by recent advances on the precise\nanalysis of the Generalized LASSO with linear measurements, this paper extends these latter results\nto the case of non-linear mesaurements. When the measurement matrix A has entries i.i.d. Gaussian\n(henceforth, we assume this to be the case without further reference), and the estimation performance\nis measured in a mean-squared-error sense, we are able to precisely predict the asymptotic behavior\nof the error. The derived expression accurately captures the role of the link function g, the particular\nstructure of x0, the role of the regularizer f, and, the value of the regularizer parameter \u03bb. Further,\nit holds for all values of \u03bb, and for a wide class of functions f and g.\nInterestingly, our result shows in a very precise manner that in large dimensions, modulo the infor-\nmation about the magnitude of x0, the LASSO treats non-linear measurements exactly as if they\nwere scaled and noisy linear measurements with scaling factor \u00b5 and noise variance \u03c32 de\ufb01ned as\n\n\u00b5 := E[\u03b3g(\u03b3)],\n\n(3)\nwhere the expecation is with respect to both \u03b3 and g. In particular, when g is such that \u00b5 (cid:54)= 03, then,\n\nand\n\n\u03c32 := E[(g(\u03b3) \u2212 \u00b5\u03b3)2],\n\nfor \u03b3 \u223c N (0, 1),\n\nyi = gi(aT\n\nthe estimation performance of the Generalized LASSO with measurements of the form\ni x0) is asymptotically the same as if the measurements were rather of the form\nyi = \u00b5aT\n\ni x0 + \u03c3zi, with \u00b5, \u03c32 as in (3) and zi standard gaussian noise.\n\nRecent analysis of the squared-error of the LASSO, when used to recover structured signals from\nnoisy linear observations, provides us with either precise predictions (e.g. [2,20]), or in other cases,\nwith tight upper bounds (e.g. [10, 16]). Owing to the established relation between non-linear and\n(corresponding) linear measurements, such results also characterize the performance of the LASSO\nin the presence of nonlinearities. We remark that some of the error formulae derived here in the\ngeneral context of non-linear measurements, have not been previously known even under the prism\nof linear measurements. Figure 1 serves as an illustration; the error with non-linear measurements\nmatches well with the error of the corresponding linear ones and both are accurately predicted by\nour analytic expression.\nUnder the generic model in (1), which allows for g to even be unspeci\ufb01ed, x0 can, in principle, be\nestimated only up to a constant of proportionality [5, 15, 17]. For example, if g is uknown then any\ninformation about the norm (cid:107)x0(cid:107)2 could be absorbed in the de\ufb01nition of g. The same is true when\ng(x) = sign(x), eventhough g might be known here. In these cases, what becomes important is\nthe direction of x0. Motivated by this, and, in order to simplify the presentation, we have assumed\nthroughout that x0 has unit Euclidean norm4, i.e. (cid:107)x0(cid:107)2 = 1.\n\n3This excludes for example link functions g that are even, but also some other not so obvious cases [11,\nSec. 2.2]. For a few special cases, e.g. sparse recovery with binary measurements yi [24], different methodolo-\ngies than the LASSO have been recently proposed that do not require \u00b5 = 0.\n4In [17, Remark 1.8], they note that their results can be easily generalized to the case when (cid:107)x0(cid:107)2 (cid:54)= 1 by\nsimply redi\ufb01ning \u00afg(x) = g((cid:107)x0(cid:107)2x) and accordingly adjusting the values of the parameters \u00b5 and \u03c32 in (3).\nThe very same argument is also true in our case.\n\n3\n\n\u03bb00.511.522.53k\u00b5\u22121\u02c6x\u2212x0k2200.511.522.53Non-linearLinearPredictionm>nm<n\f1.2 Discussion of Relevant Literature\nExtending an Old Result. Brillinger [5] identi\ufb01ed the asymptotic behavior of the estimation error\nof the LS solution \u02c6xLS = (AT A)\u22121AT y by showing that, when n (the dimension of x0) is \ufb01xed,\n(4)\n\nm(cid:107)\u02c6xLS \u2212 \u00b5x0(cid:107)2 = \u03c3\n\n\u221a\n\n\u221a\n\nn,\n\nlim\nm\u2192\u221e\n\nwhere \u00b5 and \u03c32 are same as in (3). Our result can be viewed as a generalization of the above in\nseveral directions. First, we extend (4) to the regime where m/n = \u03b4 \u2208 (1,\u221e) and both grow large\nby showing that\n\nn\u2192\u221e(cid:107)\u02c6xLS \u2212 \u00b5x0(cid:107)2 =\n\nlim\n\n\u03c3\u221a\n\u03b4 \u2212 1\n\n.\n\n(5)\n\nSecond, and most importantly, we consider solving the Generalized LASSO instead, to which LS is\nonly a very special case. This allows versions of (5) where the error is \ufb01nite even when \u03b4 < 1 (e.g.,\nsee (8)). Note the additional challenges faced when considering the LASSO: i) \u02c6x no longer has a\nclosed-form expression, ii) the result needs to additionally capture the role of x0, f, and, \u03bb.\nMotivated by Recent Work. Plan and Vershynin consider a constrained Generalized LASSO:\n\nx\u2208K (cid:107)y \u2212 Ax(cid:107)2,\n(6)\nwith y as in (1) and K \u2282 Rn some known set (not necessarily convex). In its simplest form, their\nresult shows that when m (cid:38) DK(\u00b5x0) then with high probability,\n\n\u02c6xC-LASSO = arg min\n\n(cid:107)\u02c6xC-LASSO \u2212 \u00b5x0(cid:107)2 (cid:46) \u03c3(cid:112)DK(\u00b5x0) + \u03b6\n\n(7)\nHere, DK(\u00b5x0) is the Gaussian width, a speci\ufb01c measure of complexity of the constrained set K\nwhen viewed from \u00b5x0. For our purposes, it suf\ufb01ces to remark that if K is properly chosen, and,\nif \u00b5x0 is on the boundary of K, then DK(\u00b5x0) is less than n. Thus, estimation is in principle is\npossible with m < n measurements. The parameters \u00b5 and \u03c3 that appear in (7) are the same as in\n(3) and \u03b6 := E[(g(\u03b3) \u2212 \u00b5\u03b3)2\u03b32]. Observe that, in contrast to (4) and to the setting of this paper,\nthe result in (7) is non-asymptotic. Also, it suggests the critical role played by \u00b5 and \u03c3. On the\nother hand, (7) is only an upper bound on the error, and also, it suffers from unknown absolute\nproportionality constants (hidden in (cid:46)).\nMoving the analysis into an asymptotic setting, our work expands upon the result of [17]. First, we\nconsider the regularized LASSO instead, which is more commonly used in practice. Most impor-\ntantly, we improve the loose upper bounds into precise expressions. In turn, this proves in an exact\nmanner the role played by \u00b5 and \u03c32 to which (7) is only indicative. For a direct comparison with\n(7) we mention the following result which follows from our analysis (we omit the proof for brevity).\nAssume K is convex, m/n = \u03b4 \u2208 (0,\u221e), DK(\u00b5x0)/n = \u03c1 \u2208 (0, 1] and n \u2192 \u221e. Also, \u03b4 > \u03c1.\n\nThen, (7) yields an upper bound C\u03c3(cid:112)\u03c1/\u03b4 to the error, for some constant C > 0. Instead, we show\n\n\u221a\n\nm\n\n.\n\n(cid:107)\u02c6xC-LASSO \u2212 \u00b5x0(cid:107)2 \u2264 \u03c3\n\n.\n\n(8)\nPrecise Analysis of the LASSO With Linear Measurements. The \ufb01rst precise error formulae\nwere established in [2, 10] for the (cid:96)2\n2-LASSO with (cid:96)1-regularization. The analysis was based on\nthe the Approximate Message Passing (AMP) framework [9]. A more general line of work studies\nthe problem using a recently developed framework termed the Convex Gaussian Min-max Theorem\n(CGMT) [19], which is a tight version of a classical Gaussian comparison inequality by Gordon\n[12]. The CGMT framework was initially used by Stojnic [18] to derive tight upper bounds on the\nconstrained LASSO with (cid:96)1-regularization; [16] generalized those to general convex regularizers\n2-LASSO was studied in [21]. Those bounds hold for all values\nand also to the (cid:96)2-LASSO; the (cid:96)2\nof SNR, but they become tight only in the high-SNR regime. A precise error expression for all\nvalues of SNR was derived in [20] for the (cid:96)2-LASSO with (cid:96)1-regularization under a gaussianity\nassumption on the distribution of the non-zero entries of x0. When measurements are linear, our\nTheorem 2.3 generalizes this assumption. Moreover, our Theorem 2.2 provides error predictions\nfor regularizers going beyond the (cid:96)1-norm, e.g. (cid:96)1,2-norm, nuclear norm, which appear to be novel.\nWhen it comes to non-linear measurements, to the best of our knowledge, this paper is the \ufb01rst to\nderive asymptotically precise results on the performance of any LASSO-type program.\n2 Results\n2.1 Modeling Assumptions\nUnknown structured signal. We let x0 \u2208 Rn represent the unknown signal vector. We assume that\nx0 = x0/(cid:107)x0(cid:107)2, with x0 sampled from a probability density px0 in Rn. Thus, x0 is deterministically\n\n\u221a\n\u03c1\u221a\n\u03b4 \u2212 \u03c1\n\n4\n\n\fis a scalar p.d.f. and \u03b40 is the Dirac delta function5.\n\nof unit Euclidean-norm (this is mostly to simplify the presentation, see Footnote 4). Information\nabout the structure of x0 (and correspondingly of x0) is encoded in px0. E.g., to study an x0 which\nis sparse, it is typical to assume that its entries are i.i.d. x0,i \u223c (1 \u2212 \u03c1)\u03b40 + \u03c1qX 0\n, where \u03c1 \u2208 (0, 1)\nbecomes the normalized sparsity level, qX 0\nRegularizer. We consider convex regularizers f : Rn \u2192 R.\nMeasurement matrix. The entries of A \u2208 Rm\u00d7n are i.i.d. N (0, 1).\nMeasurements and Link-function. We observe y = (cid:126)g(Ax0) where (cid:126)g is a (possibly random) map\nfrom Rm to Rm and (cid:126)g(u) = [g1(u1), . . . , gm(um)]T . Each gi is i.i.d. from a real valued random\nfunction g for which \u00b5 and \u03c32 are de\ufb01ned in (3). We assume that \u00b5 and \u03c32 are nonzero and bounded.\nAsymptotics. We study a linear asymptotic regime. In particular, we consider a sequence of prob-\nlem instances {x(n)\n0 , A(n), f (n), m(n)}n\u2208N indexed by n such that A(n) \u2208 Rm\u00d7n has entries i.i.d.\nN (0, 1), f (n) : Rn \u2192 R is proper convex, and, m := m(n) with m = \u03b4n, \u03b4 \u2208 (0,\u221e). We further\nrequire that the following conditions hold:\n(a) x(n)\n\nin Rn with one-dimensional marginals that are\n\nis sampled from a probability density p(n)\nx0\n\nc2, for constants c1, c2, C \u2265 0 independent of n.\n\nindependent of n and have bounded second moments. Furthermore, n\u22121(cid:107)x(n)\n0 (cid:107)2\n\nx = 1.\n(b) For any n \u2208 N and any (cid:107)x(cid:107)2 \u2264 C, it holds n\u22121/2f (x) \u2264 c1 and n\u22121/2 maxs\u2208\u2202f (n)(x) (cid:107)s(cid:107)2 \u2264\nIn (a), we used \u201c P\u2212\u2192\u201d to denote convergence in probability as n \u2192 \u221e. The assumption \u03c32\nx = 1 holds\nwithout loss of generality, and, is only necessary to simplify the presentation. In (b), \u2202f (x) denotes\nthe subdifferential of f at x. The condition itself is no more than a normalization condition on f.\n0 , y(n)}n\u2208N where x(n)\nEvery such sequence {x(n)\n:=\n0 (cid:107)2 and y(n) := (cid:126)g(n)(Ax0). When clear from the context, we drop the superscript (n).\n0 /(cid:107)x(n)\nx(n)\n2.2 Precise Error Prediction\nLet {x(n)\n0 , A(n), f (n), y(n)}n\u2208N be a sequence of problem instances that satisfying all the condi-\ntions above. With these, de\ufb01ne the sequence {\u02c6x(n)}n\u2208N of solutions to the corresponding LASSO\nproblems for \ufb01xed \u03bb > 0:\n\n0 , A(n), f (n)}n\u2208N generates a sequence {x(n)\n\nP\u2212\u2192 \u03c32\n\n0\n\n0\n\n2\n\n.\n\n(9)\n\n(cid:110)(cid:107)y(n) \u2212 A(n)x(cid:107)2 + \u03bbf (n)(x)\n\n(cid:111)\n\n\u02c6x(n) := min\n\nx\n\n1\u221a\nn\n\nThe main contribution of this paper is a precise evaluation of limn\u2192\u221e (cid:107)\u00b5\u22121 \u02c6x(n) \u2212 x(n)\n0 (cid:107)2\nprobability over the randomness of A, of x0, and of g.\n\n2 with high\n\n2(cid:107)v \u2212 x(cid:107)2\n\n2.2.1 General Result\nTo state the result in a general framework, we require a further assumption on p(n)\nand f (n). Later\nin this section we illustrate how this assumption can be naturally met. We write f\u2217 for the Fenchel\u2019s\nx0\nconjugate of f, i.e., f\u2217(v) := supx xT v \u2212 f (x); also, we denote the Moreau envelope of f at v\nwith index \u03c4 to be ef,\u03c4 (v) := minx{ 1\nAssumption 1. We say Assumption 1 holds if for all non-negative constants c1, c2, c3 \u2208 R the\nn(f\u2217)(n),c3 (c1h + c2x0) exists with probability one over h \u223c N (0, In) and\npoint-wise limit of 1\nx0 \u223c p(n)\nTheorem 2.1 (Non-linear=Linear). Consider the asymptotic setup of Section 2.1 and let Assumption\n1 hold. Recall \u00b5 and \u03c32 as in (3) and let \u02c6x be the minimizer of the Generalized LASSO in (9) for\n\ufb01xed \u03bb > 0 and for measurements given by (1). Further let \u02c6xlin be the solution to the Generalized\nLASSO when used with linear measurements of the form ylin = A(\u00b5x0) + \u03c3z, where z has entries\ni.i.d. standard normal. Then, in the limit of n \u2192 \u221e, with probability one,\n\n. Then, we denote the limiting value as F (c1, c2, c3).\n\n2 + \u03c4 f (x)}.\n\nn e\u221a\n\nx0\n\n(cid:107)\u02c6x \u2212 \u00b5x0(cid:107)2\n\n2 = (cid:107)\u02c6xlin \u2212 \u00b5x0(cid:107)2\n2.\n\n5Such models have been widely used in the relevant literature, e.g. [7,8,10]. In fact, the results here continue\n\nto hold as long as the marginal distribution of x0 converges to a given distribution (as in [2]).\n\n5\n\n\fTheorem 2.1 relates in a very precise manner the error of the Generalized LASSO under non-linear\nmeasurements to the error of the same algorithm when used under appropriately scaled noisy linear\nmeasurements. Theorem 2.2 below, derives an asymptotically exact expression for the error.\nTheorem 2.2 (Precise Error Formula). Under the same assumptions of Theorem 2.1 and \u03b4 := m/n,\nit holds, with probability one,\n\n2 = \u03b12\u2217,\nwhere \u03b1\u2217 is the unique optimal solution to the convex program\n\u2212 \u03b1\u03bb2\n\u03c4\n\n\u03b12 + \u03c32 \u2212 \u03b1\u03c4\n2\n\n(cid:112)\n\n\u00b52\u03c4\n2\u03b1\n\nmin\n\u03b1\u22650\n\n\u221a\n\nn\u2192\u221e(cid:107)\u02c6x \u2212 \u00b5x0(cid:107)2\n\nlim\n\n+\n\n\u03b2\n\n\u03b4\n\nmax\n0\u2264\u03b2\u22641\n\u03c4\u22650\n\n(cid:18) \u03b2\n\n\u03bb\n\nF\n\n,\n\n\u00b5\u03c4\n\u03bb\u03b1\n\n,\n\n\u03c4\n\u03bb\u03b1\n\n(cid:19)\n\n.\n\n(10)\n\nAlso, the optimal cost of the LASSO in (9) converges to the optimal cost of the program in (10).\nUnder the stated conditions, Theorem 2.2 proves that the limit of (cid:107)\u02c6x \u2212 \u00b5x0(cid:107)2 exists and is equal\nto the unique solution of the optimization program in (10). Notice that this is a deterministic and\nconvex optimization, which only involves three scalar optimization variables. Thus, the optimal \u03b1\u2217\ncan, in principle, be ef\ufb01ciently numerically computed. In many speci\ufb01c cases of interest, with some\nextra effort, it is possible to yield simpler expressions for \u03b1\u2217, e.g. see Theorem 2.3 below. The role\nof the normalized number of measurement \u03b4 = m/n, of the regularizer parameter \u03bb, and, that of\ng, through \u00b5 and \u03c32, are explicit in (10); the structure of x0 and the choice of the regularizer f are\nimplicit in F . Figures 1-2 illustrate the accuracy of the prediction of the theorem in a number of\ndifferent settings. The proofs of both the Theorems are deferred to Appendix A. In the next sections,\nwe specialize Theorem 2.2 to the cases of sparse, group-sparse and low-rank signal recovery.\n2.2.2 Sparse Recovery\nAssume each entry x0,i, i = 1, . . . , n is sampled i.i.d. from a distribution\n\n(x) = (1 \u2212 \u03c1) \u00b7 \u03b40(x) + \u03c1 \u00b7 qX 0\n\npX 0\n\n(x),\n\n(11)\n\nwhere \u03b40 is the delta Dirac function, \u03c1 \u2208 (0, 1) and qX 0\na probability density function with second\nmoment normalized to 1/\u03c1 so that condition (a) of Section 2.1 is satis\ufb01ed. Then, x0 = x0/(cid:107)x0(cid:107)2\nis \u03c1n-sparse on average and has unit Euclidean norm. Letting f (x) = (cid:107)x(cid:107)1 also satis\ufb01es condition\n(b). Let us now check Assumption 1. The Fenchel\u2019s conjugate of the (cid:96)1-norm is simply the indicator\nfunction of the (cid:96)\u221e unit ball. Hence, without much effort,\n\ne\u221a\n\nn(f\u2217)(n),c3 (c1h + c2x0) =\n\n1\nn\n\n=\n\n1\n2n\n\n1\n2n\n\n(vi \u2212 (c1hi + c2x0,i))2\n\nmin\n|vi|\u22641\n\n\u03b72(c1hi + c2x0,i; 1),\n\n(12)\n\nn(cid:88)\nn(cid:88)\n\ni=1\n\ni=1\n\nwhere we have denoted\n\n\u03b7(x; \u03c4 ) := (x/|x|) (|x| \u2212 \u03c4 )+\n\n(13)\nfor the soft thresholding operator. An application of the weak law of large numbers to see that the\nlimit of the expression in (12) equals F (c1, c2, c3) := 1\n2\ntion is over h \u223c N (0, 1) and X 0 \u223c pX 0\n. With all these, Theorem 2.2 is applicable. We have put\nextra effort in order to obtain the following equivalent but more insightful characterization of the\nerror, as stated below and proved in Appendix B.\nTheorem 2.3 (Sparse Recovery). If \u03b4 > 1, then de\ufb01ne \u03bbcrit = 0. Otherwise, let \u03bbcrit, \u03bacrit be the\nunique pair of solutions to the following set of equations:\n\nE(cid:2)\u03b72(c1h + c2X 0; 1)(cid:3) , where the expecta-\n\n\u03ba2\u03b4 = \u03c32 + E(cid:2)(\u03b7(\u03bah + \u00b5X 0; \u03ba\u03bb) \u2212 \u00b5X 0)2(cid:3) ,\n\n(cid:40)\n\n\u03ba\u03b4 = E[(\u03b7(\u03bah + \u00b5X 0; \u03ba\u03bb) \u00b7 h)],\n\nwhere h \u223c N (0, 1) and is independent of X 0 \u223c pX 0\n\nn\u2192\u221e(cid:107)\u02c6x \u2212 \u00b5x0(cid:107)2\n2 =\nwhere \u03ba2\u2217(\u03bb) is the unique solution to (14).\n\nlim\n\n(cid:26)\u03b4\u03ba2\n\n. Then, for any \u03bb > 0, with probability one,\ncrit \u2212 \u03c32\n\u03b4\u03ba2\u2217(\u03bb) \u2212 \u03c32\n\n, \u03bb \u2264 \u03bbcrit,\n, \u03bb \u2265 \u03bbcrit,\n\n(14)\n(15)\n\n6\n\n\fFigure 2: Squared error of the LASSO as a function of the regularizer parameter compared to the asymptotic\npredictions. Simulation points represent averages over 20 realizations. (a) Illustration of Thm. 2.3 for g(x) =\nsign(x), n = 512, pX0\n(+1) = 0.9 and two values of \u03b4, namely 0.75 and 1.2.\n(b) Illustration of Thm. 2.2 for x0 being group-sparse as in Section 2.2.3 and gi(x) = sign(x + 0.3zi). In\nparticular, x0 is composed of t = 512 blocks of block size b = 3. Each block is zero with probability 0.95,\notherwise its entries are i.i.d. N (0, 1). Finally, \u03b4 = 0.75.\n\n(+1) = 0.05, pX0\n\n(+1) = pX0\n\nFigures 1 and 2(a) validate the prediction of the theorem, for different signal distributions, namely\nbeing Gaussian and Bernoulli, respectively. For the case of compressed (\u03b4 < 1) measurements,\nqX 0\nobserve the two different regimes of operation, one for \u03bb \u2264 \u03bbcrit and the other for \u03bb \u2265 \u03bbcrit, precisely\nas they are predicted by the theorem (see also [16, Sec. 8]). The special case of Theorem 2.3 for\nwhich qX 0\nis Gaussian has been previously studied in [20]. Otherwise, to the best of our knowledge,\nthis is the \ufb01rst precise analysis result for the (cid:96)2-LASSO stated in that generality. Analogous result,\nbut via different analysis tools, has only been known for the (cid:96)2\n\n2-LASSO as appears in [2].\n\nE(cid:2)(cid:107)(cid:126)\u03b7(c1h + c2X 0; 1)(cid:107)2\n\n2.2.3 Group-Sparse Recovery\nLet x0 \u2208 Rn be composed of t non-overlapping blocks of constant size b each such that n = t \u00b7 b.\n(x) = (1 \u2212\nEach block [x0]i, i = 1, . . . , t is sampled i.i.d. from a probability density in Rb: pX 0\n\u03c1) \u00b7 \u03b40(x) + \u03c1 \u00b7 qX 0\n(x), x \u2208 Rb, where \u03c1 \u2208 (0, 1). Thus, x0 is \u03c1t-block-sparse on average. We\noperate in the regime of linear measurements m/n = \u03b4 \u2208 (0,\u221e). As is common we use the\ni=1 (cid:107)[x0]i(cid:107)2; with this, (9) is often referred\nIt is not hard to show that Assumption 1 holds with\n\n(cid:96)1,2-norm to induce block-sparsity, i.e., f (x) = (cid:80)t\n\nto as group-LASSO in the literature [25].\nF (c1, c2, c3) := 1\n2\n2b\nvector soft thresholding operator and h \u223c N (0, Ib), X 0 \u223c pX 0\n2.2 is applicable in this setting; Figure 2(b) illustrates the accuracy of the prediction.\n2.2.4 Low-rank Matrix Recovery\nLet X0 \u2208 Rd\u00d7d be an unknown matrix of rank r, in which case, x0 = vec(X0) with n = d2.\nAssume m/d2 = \u03b4 \u2208 (0,\u221e) and r/d = \u03c1 \u2208 (0, 1). As usual in this setting, we consider nuclear-\nd(cid:107)X(cid:107)\u2217. Each subgradient S \u2208 \u2202f (X) then\nnorm regularization; in particular, we choose f (x) =\nsatis\ufb01es (cid:107)S(cid:107)F \u2264 d in agreement with assumption (b) of Section 2.1. Furthermore, for this choice of\nregularizer, we have\n\n(cid:3) , where (cid:126)\u03b7(x; \u03c4 ) = x/(cid:107)x(cid:107) ((cid:107)x(cid:107)2 \u2212 \u03c4 )+ , x \u2208 Rb is the\n\nand are independent. Thus Theorem\n\n\u221a\n\n(cid:0)c1H + c2X0\n\n(cid:1) =\n\ne\u221a\n\nn(f\u2217)(n),c3\n\n1\nn\n\n(cid:17)\n\n(cid:17)\n\n; 1\n\n,\n\n=\n\n1\n2d\n\nmin\n(cid:107)V(cid:107)2\u22641\n\n(cid:107)V \u2212 d\u22121/2(c1H + c2X0)(cid:107)2\n\nF =\n\n1\n2d\n\nd\u22121/2(c1H + c2X0)\n\nsi\n\n1\n(cid:107)V(cid:107)2\u2264\u221a\n2d2 min\n\nd\n\n(cid:107)V \u2212 (c1H + c2X0)(cid:107)2\n\nF\n\n(cid:16)\n\n\u03b72(cid:16)\n\nd(cid:88)\n\ni=1\n\nwhere \u03b7(\u00b7;\u00b7) is as in (13), si(\u00b7) denotes the ith singular value of its argument and H \u2208 Rd\u00d7d has en-\ntries N (0, 1). If conditions are met such that the empirical distribution of the singular values of (the\nsequence of random matrices) c1H + c2X0 converges asymptotically to a limiting distribution, say\nq(c1, c2), then F (c1, c2, c3) := 1\n2\nthis will be the case if d\u22121/2X0 = USVt, where U, V unitary matrices and S is a diagonal matrix\n\n(cid:2)\u03b72(x; 1)(cid:3) , and Theorem 2.1\u20132.2 apply. For instance,\n\nEx\u223cq(c1,c2)\n\n7\n\n\u03bb0.511.522.5k\u00b5\u22121x\u2212x0k220.511.52SparsesignalrecoverySimulationThm.2.3\u03b4=0.75\u03b4=1.2\u03bbcrit\u03bb0.511.522.533.544.5kx\u2212\u00b5x0k220.20.250.30.350.40.450.50.55Group-sparsesignalrecoverySimulationThm.2.2\fwhose entries have a given marginal distribution with bounded moments (in particular, independent\nof d). We leave the details and the problem of (numerically) evaluating F for future work.\n2.3 An Application to q-bit Compressive Sensing\n2.3.1 Setup\nConsider recovering a sparse unknown signal x0 \u2208 Rn from scalar q-bit quantized linear measure-\nments. Let t := {t0 = 0, t1, . . . , tL\u22121, tL = +\u221e} represent a (symmetric with respect to 0) set of\ndecision thresholds and (cid:96) := {\u00b1(cid:96)1,\u00b1(cid:96)2, . . . ,\u00b1(cid:96)L} the corresponding representation points, such\nthat L = 2q\u22121. Then, quantization of a real number x into q-bits can be represented as\n\nL(cid:88)\n\nQq(x, (cid:96), t) = sign(x)\n\n(cid:96)i1{ti\u22121\u2264|x|\u2264ti},\n\nwhere 1S is the indicator function of a set S. For example, 1-bit quantization with level (cid:96) corre-\nsponds to Q1(x, (cid:96)) = (cid:96) \u00b7 sign(x). The measurement vector y = [y1, y2 . . . , ym]T takes the form\n\ni=1\n\n(16)\ni \u2019s are the rows of a measurement matrix A \u2208 Rm\u00d7n, which is henceforth assumed i.i.d.\n\nwhere aT\nstandard Gaussian. We use the LASSO to obtain an estimate \u02c6x of x0 as\n\ni = 1, 2, . . . , m,\n\ni x0, (cid:96), t),\n\nyi = Qq(aT\n\n(cid:107)y \u2212 Ax(cid:107)2 + \u03bb(cid:107)x(cid:107)1.\n\nx\n\n\u02c6x := arg min\n\n(17)\nHenceforth, we assume for simplicity that (cid:107)x0(cid:107)2 = 1. Also, in our case, \u00b5 is known since g = Qq\nis known; thus, is reasonable to scale the solution of (17) as \u00b5\u22121 \u02c6x and consider the error quantity\n(cid:107)\u00b5\u22121 \u02c6x \u2212 x0(cid:107)2 as a measure of estimation performance. Clearly, the error depends (besides others)\non the number of bits q, on the choice of the decision thresholds t and on the quantization levels (cid:96).\nAn interesting question of practical importance becomes how to optimally choose these to achieve\nless error. As a running example for this section, we seek optimal quantization thresholds and\ncorresponding levels\n\n(t\u2217, (cid:96)\u2217) = arg min\nt,(cid:96)\n\n(cid:107)\u00b5\u22121 \u02c6x \u2212 x0(cid:107)2,\n\n(18)\n\nwhile keeping all other parameters such as the number of bits q and of measurements m \ufb01xed.\n\n2.3.2 Consequences of Precise Error Prediction\nTheorem 2.1 shows that (cid:107)\u00b5\u22121 \u02c6x \u2212 x0(cid:107)2 = (cid:107)\u02c6xlin \u2212 x0(cid:107)2, where \u02c6xlin is the solution to (17), but only,\nthis time with a measurement vector ylin = Ax0 + \u03c3\n\u00b5 z, where \u00b5, \u03c3 as in (20) and z has entries i.i.d.\nstandard normal. Thus, lower values of the ratio \u03c32/\u00b52 correspond to lower values of the error and\nthe design problem posed in (18) is equivalent to the following simpli\ufb01ed one:\n\n(t\u2217, (cid:96)\u2217) = arg min\nt,(cid:96)\n\n\u03c32(t, (cid:96))\n\u00b52(t, (cid:96))\n\n.\n\n(19)\n\nTo be explicit, \u00b5 and \u03c32 above can be easily expressed from (3) after setting g = Qq as follows:\n\n\u00b5 := \u00b5((cid:96), t) =\n\ne\u2212t2\n\ni\u22121/2 \u2212 e\u2212t2\n\nand \u03c32 := \u03c32((cid:96), t) := \u03c4 2 \u2212 \u00b52,\n\n(20)\n\ni /2(cid:17)\n\n(cid:114) 2\n\nL(cid:88)\n\n\u03c0\n\ni=1\n\n(cid:96)i \u00b7(cid:16)\nL(cid:88)\n\ni=1\n\n(cid:90) \u221e\n\nx\n\nwhere,\n\n\u03c4 2 := \u03c4 2((cid:96), t) = 2\n\ni \u00b7 (Q(ti\u22121) \u2212 Q(ti))\n(cid:96)2\n\nand Q(x) =\n\n1\u221a\n2\u03c0\n\nexp(\u2212u2/2)du.\n\n2.3.3 An Algorithm for Finding Optimal Quantization Levels and Thresholds\nIn contrast to the initial problem in (18), the optimization involved in (19) is explicit in terms of\nthe variables (cid:96) and t, but, is still hard to solve in general. Interestingly, we show in Appendix C\nthat the popular Lloyd-Max (LM) algorithm can be an effective algorithm for solving (19), since\nthe values to which it converges are stationary points of the objective in (19). Note that this is not a\ndirectly obvious result since the classical objective of the LM algorithm is minimizing the quantity\nE[(cid:107)y \u2212 Ax0(cid:107)2\n\n2] rather than E[(cid:107)\u00b5\u22121 \u02c6x \u2212 x0(cid:107)2\n2].\n\n8\n\n\fReferences\n[1] Francis R Bach. Structured sparsity-inducing norms through submodular functions. In Advances in Neural\n\nInformation Processing Systems, pages 118\u2013126, 2010.\n\n[2] Mohsen Bayati and Andrea Montanari. The lasso risk for gaussian matrices. Information Theory, IEEE\n\nTransactions on, 58(4):1997\u20132017, 2012.\n\n[3] Alexandre Belloni, Victor Chernozhukov, and Lie Wang. Square-root lasso: pivotal recovery of sparse\n\nsignals via conic programming. Biometrika, 98(4):791\u2013806, 2011.\n\n[4] David R. Brillinger. The identi\ufb01cation of a particular nonlinear time series system. Biometrika, 64(3):509\u2013\n\n515, 1977.\n\n[5] David R Brillinger. A generalized linear model with\u201d gaussian\u201d regressor variables. A Festschrift For\n\nErich L. Lehmann, page 97, 1982.\n\n[6] Venkat Chandrasekaran, Benjamin Recht, Pablo A Parrilo, and Alan S Willsky. The convex geometry of\n\nlinear inverse problems. Foundations of Computational Mathematics, 12(6):805\u2013849, 2012.\n\n[7] David L Donoho and Iain M Johnstone. Minimax risk overl p-balls forl p-error. Probability Theory and\n\nRelated Fields, 99(2):277\u2013303, 1994.\n\n[8] David L Donoho, Lain Johnstone, and Andrea Montanari. Accurate prediction of phase transitions in\ncompressed sensing via a connection to minimax denoising. IEEE transactions on information theory,\n59(6):3396\u20133433, 2013.\n\n[9] David L Donoho, Arian Maleki, and Andrea Montanari. Message-passing algorithms for compressed\n\nsensing. Proceedings of the National Academy of Sciences, 106(45):18914\u201318919, 2009.\n\n[10] David L Donoho, Arian Maleki, and Andrea Montanari. The noise-sensitivity phase transition in com-\n\npressed sensing. Information Theory, IEEE Transactions on, 57(10):6920\u20136941, 2011.\n\n[11] Alexandra L Garnham and Luke A Prendergast. A note on least squares sensitivity in single-index model\n\nestimation and the bene\ufb01ts of response transformations. Electronic J. of Statistics, 7:1983\u20132004, 2013.\n\n[12] Yehoram Gordon. On Milman\u2019s inequality and random subspaces which escape through a mesh in Rn.\n\nSpringer, 1988.\n\n[13] Marwa El Halabi and Volkan Cevher. A totally unimodular view of structured sparsity. arXiv preprint\n\narXiv:1411.1990, 2014.\n\n[14] Hidehiko Ichimura. Semiparametric least squares (sls) and weighted sls estimation of single-index mod-\n\nels. Journal of Econometrics, 58(1):71\u2013120, 1993.\n\n[15] Ker-Chau Li and Naihua Duan. Regression analysis under link violation. The Annals of Statistics, pages\n\n1009\u20131052, 1989.\n\n[16] Samet Oymak, Christos Thrampoulidis, and Babak Hassibi. The squared-error of generalized lasso: A\n\nprecise analysis. arXiv preprint arXiv:1311.0830, 2013.\n\n[17] Yaniv Plan and Roman Vershynin. The generalized lasso with non-linear observations. arXiv preprint\n\narXiv:1502.04071, 2015.\n\n[18] Mihailo Stojnic. A framework to characterize performance of lasso algorithms.\n\narXiv:1303.7291, 2013.\n\narXiv preprint\n\n[19] Christos Thrampoulidis, Samet Oymak, and Babak Hassibi. Regularized linear regression: A precise\nanalysis of the estimation error. In Proceedings of The 28th Conference on Learning Theory, pages 1683\u2013\n1709, 2015.\n\n[20] Christos Thrampoulidis, Ashkan Panahi, Daniel Guo, and Babak Hassibi. Precise error analysis of the\nlasso. In Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on,\npages 3467\u20133471.\n\n[21] Christos Thrampoulidis, Ashkan Panahi, and Babak Hassibi. Asymptotically exact error analysis for\n2-lasso. In Information Theory (ISIT), 2015 IEEE International Symposium on, pages\n\nthe generalized (cid:96)2\n2021\u20132025. IEEE, 2015.\n\n[22] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical\n\nSociety. Series B (Methodological), pages 267\u2013288, 1996.\n\n[23] Robert Tibshirani, Michael Saunders, Saharon Rosset, Ji Zhu, and Keith Knight. Sparsity and smoothness\nvia the fused lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(1):91\u2013\n108, 2005.\n\n[24] Xinyang Yi, Zhaoran Wang, Constantine Caramanis, and Han Liu. Optimal linear estimation under un-\n\nknown nonlinear transform. arXiv preprint arXiv:1505.03257, 2015.\n\n[25] Ming Yuan and Yi Lin. Model selection and estimation in regression with grouped variables. Journal of\n\nthe Royal Statistical Society: Series B (Statistical Methodology), 68(1):49\u201367, 2006.\n\n9\n\n\f", "award": [], "sourceid": 1888, "authors": [{"given_name": "CHRISTOS", "family_name": "THRAMPOULIDIS", "institution": "Caltech"}, {"given_name": "Ehsan", "family_name": "Abbasi", "institution": "Caltech"}, {"given_name": "Babak", "family_name": "Hassibi", "institution": "Caltech"}]}