{"title": "A Universal Analysis of Large-Scale Regularized Least Squares Solutions", "book": "Advances in Neural Information Processing Systems", "page_first": 3381, "page_last": 3390, "abstract": "A problem that has been of recent interest in statistical inference, machine learning and signal processing is that of understanding the asymptotic behavior of regularized least squares solutions under random measurement matrices (or dictionaries). The Least Absolute Shrinkage and Selection Operator (LASSO or least-squares with $\\ell_1$ regularization) is perhaps one of the most interesting examples. Precise expressions for the asymptotic performance of LASSO have been obtained for a number of different cases, in particular when the elements of the dictionary matrix are sampled independently from a Gaussian distribution. It has also been empirically observed that the resulting expressions remain valid when the entries of the dictionary matrix are independently sampled from certain non-Gaussian distributions. In this paper, we confirm these observations theoretically when the distribution is sub-Gaussian. We further generalize the previous expressions for a broader family of regularization functions and under milder conditions on the underlying random, possibly non-Gaussian, dictionary matrix. In particular, we establish the universality of the asymptotic statistics (e.g., the average quadratic risk) of LASSO with non-Gaussian dictionaries.", "full_text": "A Universal Analysis of Large-Scale Regularized\n\nLeast Squares Solutions\n\nDepartment of Electrical and Computer Engineering\n\nNorth Carolina State University\n\nAshkan Panahi\n\nRaleigh, NC 27606\napanahi@ncsu.edu\n\nBabak Hassibi\n\nDepartment of Electrical Engineering\n\nCalifornia Institute of Technology\n\nPasadena, CA 91125\n\nhassibi@caltech.edu\n\nAbstract\n\nA problem that has been of recent interest in statistical inference, machine learning\nand signal processing is that of understanding the asymptotic behavior of regular-\nized least squares solutions under random measurement matrices (or dictionaries).\nThe Least Absolute Shrinkage and Selection Operator (LASSO or least-squares\nwith (cid:96)1 regularization) is perhaps one of the most interesting examples. Precise\nexpressions for the asymptotic performance of LASSO have been obtained for\na number of different cases, in particular when the elements of the dictionary\nmatrix are sampled independently from a Gaussian distribution. It has also been\nempirically observed that the resulting expressions remain valid when the entries\nof the dictionary matrix are independently sampled from certain non-Gaussian\ndistributions. In this paper, we con\ufb01rm these observations theoretically when the\ndistribution is sub-Gaussian. We further generalize the previous expressions for\na broader family of regularization functions and under milder conditions on the\nunderlying random, possibly non-Gaussian, dictionary matrix. In particular, we\nestablish the universality of the asymptotic statistics (e.g., the average quadratic\nrisk) of LASSO with non-Gaussian dictionaries.\n\n1\n\nIntroduction\n\nDuring the last few decades, retrieving structured data from an incomplete set of linear observations\nhas received enormous attention in a wide range of applications. This problem is especially interesting\nwhen the ambient dimension of the data is very large, so that it cannot be directly observed and\nmanipulated. One of the main approaches is to solve regularized least-squares optimization problems\nthat are tied to the underlying data model. This can be generally expressed as:\n\n(cid:107)y \u2212 Ax(cid:107)2\n\n1\n2\n\nx\n\nmin\n\n2 + f (x),\n\n(1)\nwhere x \u2208 Rm and y \u2208 Rn are the desired data and observation vectors, respectively. The matrix\nA \u2208 Rm\u00d7n is the sensing matrix, representing the observation process, and the regularization\nfunction f : Rn \u2192 R imposes the desired structure on the observed data. When f is convex, the\noptimization in (1) can be solved reliably with a reasonable amount of calculations. In particular,\nthe case where f is the (cid:96)1 norm is known as the LASSO, which has been extremely successful in\nretrieving sparse data vectors. During the past years, random sensing matrices A have been widely\nused and studied in the context of the convex regularized least squares problems. From the perspective\nof data retrieval, this choice is supported by a number of studies in the so-called Compressed Sensing\n(CS) literature, which show that under reasonable assumptions, random matrices may lead to good\nperformance [1, 2, 3].\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fAn interesting topic, addressed in the recent compressed sensing literature (and also considered here),\nis to understand the behavior of the regularized least squares solution in the asymptotic case, where\nm and n grow to in\ufb01nity with a constant ratio \u03b3 = m/n. For this purpose, a scenario is widely\nconsidered, where y is generated by the following linear model:\n\ny = Ax0 + \u03bd,\n\n(2)\n\nwhere x0 is the true structured vector and \u03bd is the noise vector, here assumed to consist of independent\ncentered Gaussian entries, with equal variances \u03c32. Then, it is desired to characterize the statistical\nbehavior of the optimal solution \u02c6x of (1), also called the estimate, and the error w = \u02c6x \u2212 x0. More\nspeci\ufb01cally, we are interested in the asymptotic empirical distribution1 of the estimate \u02c6x and the error\nw, when the sensing matrix is also randomly generated with independent and identically distributed\nentries. Familiar examples of such matrices are Gaussian and Bernoulli matrices.\n\n1.1 Previous Work\n\nAnalyzing linear least squares problems with random matrices has a long history. The behavior\nof the unregularized solution, or that of ridge regression (i.e., (cid:96)2\u2212regularization) is characterized\nby the singular values of A which is well-understood in random matrix theory [4, 5]. However, a\ngeneral study of regularized solutions became prominent with the advent of compressed sensing,\nwhere considerable effort has been directed toward an analysis in the sense explained above. In\ncompressed sensing, early works focused on the LASSO ((cid:96)1 regularization), sparse vectors x0 and\nthe case, where \u03bd = 0 [6]. These works aimed at providing conditions to guarantee perfect recovery,\nmeaning w = 0, and established the Restricted Isometry Property (RIP) as a deterministic perfect\nrecovery condition. This condition is generally dif\ufb01cult to verify [7, 8]. It was immediately observed\nthat under mild conditions, random matrices satisfy the RIP condition with high probability, when\nthe dimensions grow to in\ufb01nity with a proper ratio \u03b3 [9, 10]. Soon after, it was discovered that the\nRIP condition was unnecessary to undertake the analysis for random matrices. In [11], an \"RIP-less\"\ntheory of perfect recovery was introduced.\nDespite some earlier attempts [12, 13], a successful error analysis of the LASSO for Gaussian\nmatrices was not obtained until the important paper [14], where it was shown by the analysis of\nso-called approximate message passing (AMP) that for any pseudo Lipschitz function \u03c8; R2 \u2192 R,\nand de\ufb01ning \u02c6xi, xi0 as the ith elements of \u02c6x and x0, respectively, the sample risk\n\nn(cid:88)\n\nk=1\n\n1\nn\n\n\u03c8(\u02c6xi, xi0)\n\nconverges to a value that can be precisely computed. As a special case, the asymptotic value of the\nscaled (cid:96)2 norm of the error w is calculated by taking \u03c8(\u02c6xi, xi0) = (\u02c6xi \u2212 xi0)2. In [15], similar\nresults are obtained for M-estimators using AMP. Fundamental bounds for linear estimation with\nGaussian matrices are also recently provided in [16]. Another remarkable direction of progress was\nmade in a series of papers, revolving around an approach, \ufb01rst developed by Gordon in [17], and\nintroduced to the compressed sensing literature by Stojnic in [18]. Employing Gordon\u2019s approach,\n[19] provides the analysis of a broad range of convex regularized least squares problems for Gaussian\nsensing matrices. Exact expressions are provided in this work only for asymptotically small noise.\nIn [20] this result is utilized to provide the exact analysis of the LASSO for general noise variance,\ncon\ufb01rming the earlier results in [14]. Some further investigations are recently provided in [21] and\n[22].\nWhen there is no measurement noise, universal (non-Gaussian) results on the phase transition for the\nnumber of measurements that allows perfect recovery of the signal have been recently obtained in\n[23]. Another special case of ridge ((cid:96)2) regression is studied in [24]. The technical approach in [23]\nis different from ours. Furthermore, the current paper considers measurement noise and is concerned\nwith the performance of the algorithm and not on the phase transitions for perfect recovery. In [25],\nthe so-called Lindeberg approach is proposed to study the asymptotic behavior of the LASSO. This is\nsimilar in spirit to our approach. However, the study in [25] is more limited than ours, in the sense\nthat it only establishes universality of the expected value of the optimal cost when the LASSO is\nrestricted to an arbitrary rectangular ball. Some stronger bounds on the error risk of LASSO are\n\n1Empirical distribution of a vector x is a measure \u03bd where \u03bd(A) is the fraction of entries in x valued in A.\n\n2\n\n\festablished in [23, 26], which are sharp for asymptotically small noise or large m. However, to the\nbest of our knowledge, there have not been any exact universal results of the generality as ours in the\nliterature. It is also noteworthy that our results can be predicted by the replica symmetry (RS) method\nas partially developed in [27]. Another recent area where the connection of RS and performance of\nestimators has been rigorously established is low rank matrix estimation [28, 29]\n\n2 Main Results\n\nOur contributions are twofold: First, we generalize the expressions in [21] and [20] for a more general\ncase of arbitrary separable regularization functions f (x) where with an abuse of notation\n\nf (xi)\n\nf (x) =\n\n(3)\nand the function f on the right hand side is a real function f (x) : R \u2192 R. Second, we show that our\nresult is universal, which precisely means that our expressions are independent of the distribution\n(law) of the i.i.d sensing matrix.\nIn general, we tie the asymptotic behavior of the optimization in (1) to the following two-dimensional\noptimization, which we refer to as the essential optimization:\n\u2212 \u03b3\u03b22\n2\n\n(cid:26) p\u03b2(\u03b3 \u2212 1)\n\nCf (\u03b3, \u03c3) = max\n\u03b2\u22650\n\n(cid:19)(cid:21)(cid:27)\n\n, p\u0393 + X\n\n(cid:18) \u03b2\n\np\n\nmin\np>0\n\n+ E\n\nSf\n\n(cid:20)\n\n+\n\n\u03b3\u03c32\u03b2\n\n2p\n\n,\n\n(4)\n\n2\n\nwhere X and \u0393 are two independent random variables, distributed by an arbitrary distribution \u03be and\nstandard Gaussian p.d.f, respectively. Further, Sf (. , . ) denotes the proximal function of f, which is\nde\ufb01ned by\n\nn(cid:88)\n\ni=1\n\nwith the minimum located at \u02c6xf (q, y). If the solution\nthen we de\ufb01ne the random variables\n\n\u02c6p = \u02c6p(\u03b3, \u03c3), \u02c6\u03b2 = \u02c6\u03b2(\u03b3, \u03c3)\n\n\u02c6X = \u02c6Xf,\u03be,\u03c3,\u03b3 = \u02c6xf\n\n, \u02c6p\u0393 + X\n\n, W = \u02c6X \u2212 X\n\nSf (q, y) = min\n\nx\n\n(x \u2212 y)2 + f (x).\n\nq\n2\n\n(cid:16)\n\n(cid:33)\n\n(cid:32) \u02c6\u03b2\n\n\u02c6p\n\n(cid:17)\n\n(5)\n\nof (4) is unique,\n\nOur result can be expressed by the following theorem:\nTheorem 1 Suppose that the entries of A are \ufb01rst generated independently by a proper distribution2\n\u221a\n\u00b5 and next scaled by 1/\nm. Moreover, assume that the true vector x0 is randomly generated and has\ni.i.d. entries with some distribution \u03be. Then,\n\u2022 The optimal cost in (1), scaled by 1\n\u2022 The empirical distributions of the solution \u02c6x and the error w weakly converge to the\n\nn , converges in probability to Cf (\u03b3, \u03c3),\n\ndistribution of \u02c6X and W , respectively,\n\nif one of the following holds:\n\n1. The real function f is strongly convex.\n2. The real function f equals \u03bb|x| for some \u03bb > 0, \u00b5 is further \u03c32\n\ns -sub-Gaussian3 and the\n\"effective sparsity\" M0 = Pr( \u02c6X (cid:54)= 0) is smaller than a constant depending on \u00b5, \u03bb, \u03b3, \u03c3.\nFor example, M0 \u2264 \u03c1/2 works where\n\n(cid:26)\n\n(cid:27)\n\n\u03c1 log 9 + H(\u03c1) \u2264 min\n\n1,\n\n9\n8\u03c32\ns\n\n+\n\nlog\n\n1\n2\n\n8\u03c32\ns\n9\n\nand H(\u03c1) = \u2212\u03c1 log \u03c1 \u2212 (1 \u2212 \u03c1) log(1 \u2212 \u03c1) is the binary entropy function4.\n\n2Here, a proper distribution is the one with vanishing \ufb01rst, third and \ufb01fth moments, unit variance and \ufb01nite\n\nfourth and sixth moments.\n\n3A centered random variable Z is \u03c32\n4In this paper, all logarithms are to natural base (e).\n\ns -sub-Gaussian if E(erZ ) \u2264 e\n\n\u03c32\ns r2\n2\n\nholds for every r \u2208 R.\n\n3\n\n\fWe include more detailed and general results, as well as the proofs in the supplementary material.\nIn the rest of this paper, we discuss the consequences of Theorem 1, especially for the case of the\nLASSO, and give a sketch of the proof of our results.\n\n3 Remarks and Numerical Results\n\nIn this section, we discuss few issues arising from our analysis.\n\n3.1 Evaluation of Asymptotic Values\n\nA crucial question related to Theorem 1 and the essential optimization is how to calculate the optimal\nparameters in (4). Here, our purpose is to provide a simple instruction for solving the optimization\nin (4). Notice that (4) is a min-max optimization over the pair (p, \u03b2) of real positive numbers. We\nobserve that there exists an appealing structure in this optimization, which substantially simpli\ufb01es its\nnumerical solution:\n\nTheorem 2 For any \ufb01xed \u03b2 > 0, the objective function in (4) is convex over p. For any \ufb01xed \u03b2,\ndenote the optimal value of the inner optimization (over p) of (4) by \u03c8(\u03b2). Then, \u03c8 is a concave\nfunction of \u03b2.\n\nUsing Theorem 2, we may reduce the problem of solving (4) into a sequence of single dimensional\nconvex optimization problems (line searches). We assume a derivative-free5 algorithm alg(\u03c6), such\nas dichotomous search (See the supplement for more details), which receives as an input (an oracle\nof) a convex function \u03c6 and returns its optimal value and its optimal point over [0 \u221e). Denote the\ncost function of (4) by \u03c6(p, \u03b2). This means that \u03c8(\u03b2) = min\n\u03c6(p, \u03b2). If \u03c6(p, \u03b2) is easy to calculate,\nwe observe that alg(\u03c6(p, \u03b2)) for a \ufb01xed \u03b2 is an oracle of \u03c8(\u03b2). Since \u03c8(\u03b2) is now easy to calculate\nwe may execute alg(\u03c8(\u03b2)) to obtain the optimal parameters.\n\np\n\n3.1.1 Derivation for LASSO\n\nTo apply the above technique, we require a fast method to evaluate the objective function in (4). Here,\nwe provide the expressions for the case of LASSO with f (x) = \u03bb|x|, which is originally formulated\nin [30]. For this case, we assume that the entries of the true vector x0 are non-zero and standard\nGaussian with probability 0 \u2264 \u03ba \u2264 1. In other words, \u03be = \u03baN + (1 \u2212 \u03ba)\u03b40, where N and \u03b40 are\nstandard Gaussian and the Dirac measures on R, respectively. Then, we have that\n\n(cid:20)\n\n(cid:18) \u03b2\n\np\n\n(cid:19)(cid:21)\n\n(cid:112)\n\nSf\n\n, p\u0393 + X\n\n= \u03ba\n\n1 + p2F (\n\n1 + p2) + (1 \u2212 \u03ba)pF (\u03b2)\n\nE\n\nwhere\n\n\u03b2\np\n\n(cid:112)\n(cid:19)\n\n(cid:18)\n\nF (q) =\n\n\u2212 \u03bb2\n\u221a\n\u03bbe\n2q2\n2\n\n2\u03c0\n\n\u2212 q\n2\n\n1 +\n\n\u03bb2\nq2\n\nQ(\n\n\u03bb\nq\n\n) +\n\nq\n4\n\nThe function Q(. ) is the Gaussian tail Q-function. We may replace the above expression in the\nde\ufb01nition of essential optimization to obtain \u02c6p, \u02c6\u03b2 and the random variables \u02c6X, W in Section 2. Now,\nlet us calculate (cid:107)w(cid:107)2\n2/n by taking expectation over empirical distribution of w. Using Theorem 1,\nwe obtain the following term for the asymptotic value of (cid:107)w(cid:107)2\n+ (1 \u2212 \u03ba)J\n\n(cid:19)\n\n(cid:19)\n\n2/n:\n\n, p, 1\n\n, p, 0\n\n(cid:18) \u03bbp\n(cid:32)\nJ(\u0001, p, \u03b1) = \u03b12 + 2(cid:0)p2 + \u00012 \u2212 \u03b12(cid:1) Q\n\nE(W 2) = \u03baJ\n\n\u03b2\n\nwhere\n\n(cid:18) \u03bbp\n(cid:114)\n\n\u03b2\n\n(cid:33)\n\n\u2212 2\u0001\n\n\u0001(cid:112)\u03b12 + p2\n\n(cid:18)\n\n(cid:19)\n\n\u03b12 + p2\n\n2\u03c0\n\nexp\n\n\u2212\n\n\u00012\n\n2(\u03b12 + p2)\n\nFigure 1a depicts the average value (cid:107)w(cid:107)2\n2/n over 50 independent realizations of the LASSO, including\nindependent Gaussian sensing matrices with \u03b3 = 0.5, sparse true vectors with \u03ba = 0.2 and Gaussian\n\n5It is also possible to use the derivative-based algorithms, but it requires to calculate the derivatives of Sf\n\nand \u03c8. We do not study this case.\n\n4\n\n\f(a)\n\n(b)\n\nFigure 1: a)The sample mean of the quadratic risk for different values of \u03bb, compared to its theoretical\nvalue. The average is taken over 50 trials. b) The sample variance of the quadratic risk for different\nvalues of \u03bb. The average is taken over 1000 trials.\n\nFigure 2: Asymptotic error (cid:96)1 and squared (cid:96)2 norms, as well as the solution sparsity. Their corre-\nsponding optimal \u03bb values are depicted by vertical lines.\n\nnoise realizations with \u03c32 = 0.1. We consider two different problem sizes n = 200, 500. As seen,\nthe sample mean, which approximates the statistical mean E((cid:107)w(cid:107)2\n2/n), agrees with the theoretical\nresults above.\nFigure 1b examines the convergence of the error 2-norm by depicting the sample variance of (cid:107)w(cid:107)2\n2/n\nfor the two cases above with n = 200, 500. Each data point is obtained by 1000 independent\nrealizations. As seen, the case n = 500 has a smaller variance, which indicates that as dimensions\ngrow the variance of the quadratic risk vanishes and it converges in probability to its mean. Another\ninteresting phenomenon in Figure 1b is that the larger values of \u03bb are associated with larger uncertainty\n(variance), especially for smaller problem sizes.\nThe asymptotic analysis allows us to decide an optimal value of the regularization parameter \u03bb.\nFigure 2 shows few possibilities. It depicts the theoretical values for the error squared (cid:96)2 and (cid:96)1\nnorms as well as the sparsity of the solution. The (effective6) sparsity can be calculated as\n\nM0 = Pr( \u02c6X (cid:54)= 0) = 2(1 \u2212 \u03ba)Q\n\n(cid:18) \u03bb\n\n(cid:19)\n\n+ 2\u03baQ\n\n\u03b2\n\n(cid:33)\n\n(cid:32)\n\n\u03b2(cid:112)1 + p2\n\n\u03bbp\n\nThe expression for the (cid:96)1 norm can be calculated similar to the (cid:96)2 norm, but does not have closed\nform and is calculated by a Monte Carlo method. We observe that at the minimal error, both in (cid:96)2\nand (cid:96)1 senses, the solution is sparser than the true vector (\u03ba = 0.2). On the contrary, adjusting the\nsparsity to the true one slightly increases the error (cid:96)2 norm. As expected, the sparsity of the solution\ndecreases monotonically with increasing \u03bb.\n\n6Since we establish weak convergence of the empirical distribution, this number does not necessarily re\ufb02ect\n\nthe number of exactly zero elements in \u02c6x, but rather the \"in\ufb01nitesimally\" small ones.\n\n5\n\n0123450.060.080.10.120.140.160.180.20.22\uf06cMSE TheoreticalEmpirical n=200Empirical n=50001234500.511.522.53x 10-3\uf06cVariance of Squared Error n=200n=50000.511.522.533.500.050.10.150.20.25MSE and SparsityError L2 normEffective SparsityError L1 norm\f(a)\n\n(b)\n\nFigure 3: (a) The average LASSO error (cid:96)2 norm (b) the sample variance of LASSO error (cid:96)2 norm for\ndifferent matrices.\n\n3.2 Universality and Heavy-tailed Distributions\n\nIn the previous section, we demonstrated numerical results, which were generated by Gaussian\nmatrices. Here, we focus on universality.\nIn Section 2, our results are under some regularity\nassumptions for the sensing matrix. For the LASSO, we require sub-Gaussian entries and low\nsparsity, which is equivalent to a large regularization value. Here, we examine these conditions\nin three cases: First, a centered Bernoulli matrix where each entry \u22121 or 1 with probability 1/2.\nSecond, a matrix distributed by Student\u2019s t-distribution with 3 degrees of freedom \u03bd = 3 and scaled\nto possess unit variance. Third, an asymmetric Bernoulli matrix where each entry is either 3 or \u22121/3\nwith probabilities 0.1 and 0.9, respectively. Figure 3 shows the error (cid:96)2 norm and its variance for the\nLASSO case. As seen, all cases follow the predicted asymptotic result. However, the results for the\nt-matrix and the asymmetric distribution is beyond our analysis, since t-distribution does not possess\n\ufb01nite statistical moments of an order larger than 2 and the asymmetric case possesses non-vanishing\nthird moment. This indicates that our universal results hold beyond the limit assumed in this paper.\nHowever, we are not able to prove it with our current technique.\n\n3.3 Remarks on More General Universality Results\n\nAs we explain in the supplementary document, Theorem 1 is specialized from a more general result.\nHere, we brie\ufb02y discuss the main aspects of our general result. When the regularization is not\nseparable (f is non-separable) our analysis may still guarantee universality of its behavior. However,\nwe are not able to evaluate the asymptotic values anymore. Instead, we relate the behavior of a\ngeneral sensing matrix to a reference choice, e.g. a Gaussian matrix. For example, we are able to\nshow that if the optimal objective value in (1) converges to a particular value for the reference matrix,\nthen it converges to exactly the same value for other suitable matrices in Theorem 1. The asymptotic\noptimal value may remain unknown to us. The universality of the optimal value holds for a much\nbroader family of regularizations than the separable ones in (3). For example, \"weakly separable\"\nfunctions of the form\n\n(cid:80)\n\nf (xi, xj)\n\nf (x) =\n\ni,j\n\nn\n\nor the generalized fused LASSO [31, 32] are simply seen to possess universal optimal values. One\nimportant property of our generalized result is that if we are able to establish optimal value universality\nfor a particular regularization function f (x), then we automatically have a similar result for f (\u03a8x),\nwhere \u03a8 is a \ufb01xed matrix satisfying certain regularity conditions7. This connects our analysis to\nthe analysis of generalized LASSO [33, 34]. Moreover, substituting f (\u03a8x) in (1) and changing\nthe optimization variable to x(cid:48) = \u03a8x, we obtain (1) where A is replaced by A\u03a8\u22121. Hence, our\napproach enables us to obtain further results on sensing matrices of the form A\u03a8\u22121, where A is i.i.d\nand \u03a8 is deterministic. We postpone more careful analysis to future papers.\n\n7More precisely, we require \u03a8 to have a strictly positive smallest singular value and a bounded third operator\n\nnorm.\n\n6\n\n0123450.120.140.160.180.2Average L2 ErrorTheoreticalt-distribution =3, n=200t-distribution =3, n=500BernoulliAsymmetric Bernoulli01234500.511.522.53Variance of Squared Error10-3Bernoulli, n=200t-distribution, n=500t-distribution, n=200Asymmetric Bernoulli\fIt is worth mentioning that we obtain Theorem 1 about the separable functions in the light of the\nsame principle: We simply connect the behavior of the error for an arbitrary matrix to the Gaussian\none. In this particular case, we are able to carry out the calculations over Gaussian matrices with the\nwell-known techniques, developed for example in [18] and brie\ufb02y explained below.\n\n4 Technical Discussion\n\n4.1 An Overview of Approach\n\nIn this section, we present a crude sketch of our mathematical analysis. Our aim is to show the main\nideas without being involved in mathematical subtleties. There are four main elements in our analysis,\nwhich we address in the following.\n\n4.1.1 From Optimal Cost to the Characteristics of the Optimal Point\n\nIn essence, we study the optimal values of optimizations such as the one in (1). Studying the optimal\nsolution directly is much more dif\ufb01cult. Hence, we employ an indirect method, where we connect an\narbitrary real-valued characteristic (function) g of the optimal point to the optimal value of a set of\nrelated optimizations. This is possible through the following simple observation:\nLemma 1 We are to minimize a convex function \u03c6(x) on a convex domain D and suppose that x\u2217\nis a minimal solution. Further, g(x) is such that the function \u03c6 + \u0001g remains convex when \u0001 is in a\nsymmetric interval [\u2212e e]. De\ufb01ne \u03a6(\u0001) as the minimal value of \u03c6 + \u0001g on D. Then, \u03a6(\u0001) is concave\non [\u2212e e] and g(x\u2217) is its subgradient at \u0001 = 0.\nAs a result of Lemma 1, the increments \u03a6(\u0001) \u2212 \u03a6(0) and \u03a6(0) \u2212 \u03a6(\u2212\u0001) for positive values of \u0001\nprovide lower and upper bounds for g(x\u2217), respectively. We use these bounds to prove convergence\nin probability. However, Lemma 1 requires \u03c6 + \u0001g to remain convex for both positive and negative\nvalues of \u0001. It is now simple to see that choosing a strongly convex function for f and a convex\nfunction with bounded second derivative for g ensures this requirement.\n\n4.1.2 Lindeberg\u2019s Approach\n\nWith the above approach, we only need to focus on the universality of the optimal values. To obtain\nuniversal results, we adopt the method that Lindeberg famously developed to obtain a strong version\nof the central limit theorem [35]. Lindeberg\u2019s approach requires a reference case, where asymptotic\nproperties are simple to deduce. Then, similar results are proved for an arbitrary case by considering\na \ufb01nite chain of intermediate problems, starting from the reference case and ending at the desired case.\nIn each step, we are able to analyze the change in the optimal value and show that the sum of these\nchanges cannot be substantial for asymptotically large cases. In our study, we take the optimization\nin (1) with a Gaussian matrix A as reference. In each step of the chain, we replace one Gaussian row\nof A with another one with the target distribution. After m step, we arrive at the desired case. At\neach step, we analyze the change in the optimal value by Taylor expansion, which shows that the\nchange is of second order and is o( 1\nm5/4 )) with high probability, such that the total\nchange is bounded by o(1). For this, we require strong convexity and bounded third derivatives. This\nshows universality of the optimal value.\n\nm ) (in fact O(\n\n1\n\n4.1.3 Asymptotic Results For Gaussian Matrices\n\nSince we take Gaussian matrices as reference in the Lindeberg\u2019s approach, we require a different\nmachinery to analyze the Gaussian case. The analysis of (1) for the Gaussian matrices is considered\nin [19]. Here, we brie\ufb02y review this approach and specialize it in some particular cases. Let us start\nby de\ufb01ning the following so-called Key optimization, associated with (1):\n\n\u03c6n(g, x0) = max\n\u03b2>0\n\n(6)\nwhere g is a n\u2212dimensional standard Normal random vector, independent of other variables. Then,\n[19] shows that in case A is generated by a standard Gaussian random variable and \u03c6n(g, x0)\nconverges in probability to a value C, then the optimal value in (1) also converges to C. The\n\nmin\nv\u2208Rn\n\n\u03b22 +\n\n\u03c32 +\n\n+ \u03b2\n\ngT v\n\nn\n\n\u2212 m\n2n\n\nfn(v + x0)\n\nn\n\n(cid:114)\n\nm\u03b2\nn\n\n(cid:107)v(cid:107)2\nm\n\n2\n\n7\n\n\fconsequences of this observation are thoroughly discussed in [20]. Here, we focus on a case, where\nf (x) is separable as in 3. For this case, The Key optimization in (6) can be simpli\ufb01ed and stated as in\nthe following theorem (See [30]).\n\nTheorem 3 Suppose that A is generated by a Gaussian distribution, x0 is i.i.d. with distribution\n\u03be and f (x) is separable as in (1). Furthermore, m/n \u2192 \u03b3 \u2208 R\u22650. Then, the optimal value of the\noptimization in (1), converges in probability to Cf (\u03b3, \u03c3) de\ufb01ned in Section 2.\n\nNow, we may put the above steps together to obtain the desired result for the strongly convex\nfunctions: Lindeberg\u2019s approach shows that the optimal cost is universal. On the other hand, the\noptimal cost for Gaussian matrices is given by Cf (\u03b3, \u03c3). We conclude that Cf (\u03b3, \u03c3) is the universal\nlimit of the optimal cost. Now, we may use the argument in Lemma 1 to obtain a characteristic g at\nthe optimal point. For this, we may take for example regularizations of the form f + \u0001g, which by the\nprevious discussion converges to Cf +\u0001g. Then, g(\u02c6x) becomes equal to dCf +\u0001g/d\u0001 at \u0001 = 0, which\nby further calculations leads to the result in Theorem 1 8.\n\n4.1.4 Final Step: The LASSO\nThe above argument fails for the LASSO with f (x) = \u03bb|x| because it lacks strong convexity. Our\nremedy is to start from an \"augmented approximation\" of the LASSO with f (x) = \u03bb|x| + \u0001x2/2\nand to show that the solution of the approximation is stable in the sense that removing the term\n\u0001x2/2 does not substantially change the optimal point. We employ a slightly modi\ufb01ed argument\nin [12], which requires two assumptions: a) The solution is sparse. b) The matrix A is suf\ufb01ciently\nrestricted-isometric. The condition on restricted isometry is satis\ufb01ed by assuming sub-Gaussian\ndistributions [36], while the sparsity of the solution is given by M0. The assumption that M0 is\nsuf\ufb01ciently small allows the argument in [12] to hold in our case, which ensures that the LASSO\nsolution remains close to the solution of the augmented LASSO and the claims of Theorem 1 can be\nestablished for the LASSO. However, we are able to show that the optimal value of the LASSO is\nclose to that of the augmented LASSO without any requirement for sparsity. This can be found in the\nsupplementary material.\n\n5 Conclusion\n\nThe main purpose of this study was to extend the existing results about the convex regularized\nleast squares problems in two different directions, namely more general regularization functions\nand non-Gaussian sensing matrices. In the \ufb01rst direction, we tied the asymptotic properties for\ngeneral separable convex regularization functions to a two-dimensional optimization that we called\nthe essential optimization. We also provided a simple way to calculate asymptotic characteristics of\nthe solution from the essential optimization. In the second direction, we showed that the asymptotic\nbehavior of regularization functions with certain regularity conditions is independent of the distri-\nbution (law) of the sensing matrix. We presented few numerical experiments which validated our\nresults. However, these experiments suggest that the universality of the asymptotic behavior holds\nbeyond our assumptions.\n\n5.1 Future Research\n\nAfter establishing the convergence results, a natural further question is the rate of convergence. The\nproperties of regularized least squares solutions with \ufb01nite size is not well-studied even for the\nGaussian matrices. Another interesting subject for future research is to consider random sensing\nmatrices, which are not necessarily identically distributed. We believe that our technique can be\ngeneralized to a case with independent rows or columns instead of elements. A similar generalization\ncan be obtained by considering true vectors with a different structure. Moreover, we introduced\na number of cases such as generalized LASSO [34] and generalized fused Lasso [32], where our\nanalysis shows universality, but the asymptotic performance cannot be calculated. Calculating the\nasymptotic values of these problems for a reference choice, such as Gaussian matrices is an interesting\nsubject of future study.\n\n8The expression for g(w) is found in a similar way, but requires some mathematical preparations, which we\n\nexpress later.\n\n8\n\n\fReferences\n\n[1] E. J. Candes and T. Tao, \u201cNear-optimal signal recovery from random projections: Universal\nencoding strategies?,\u201d Information Theory, IEEE Transactions on, vol. 52, no. 12, pp. 5406\u2013\n5425, 2006.\n\n[2] D. L. Donoho, \u201cFor most large underdetermined systems of linear equations the minimal (cid:96)1s-\nnorm solution is also the sparsest solution,\u201d Communications on pure and applied mathematics,\nvol. 59, no. 6, pp. 797\u2013829, 2006.\n\n[3] Y. C. Eldar and G. Kutyniok, Compressed sensing: theory and applications. Cambridge\n\nUniversity Press, 2012.\n\n[4] Z. Bai, \u201cMethodologies in spectral analysis of large dimensional random matrices, a review,\u201d\n\nStatistica Sinica, pp. 611\u2013662, 1999.\n\n[5] Z. Bai and J. W. Silverstein, Spectral analysis of large dimensional random matrices, vol. 20.\n\nSpringer, 2010.\n\n[6] D. L. Donoho, \u201cCompressed sensing,\u201d Information Theory, IEEE Transactions on, vol. 52, no. 4,\n\npp. 1289\u20131306, 2006.\n\n[7] S. S. Chen, D. L. Donoho, and M. A. Saunders, \u201cAtomic decomposition by basis pursuit,\u201d SIAM\n\njournal on scienti\ufb01c computing, vol. 20, no. 1, pp. 33\u201361, 1998.\n\n[8] E. J. Candes and T. Tao, \u201cDecoding by linear programming,\u201d Information Theory, IEEE\n\nTransactions on, vol. 51, no. 12, pp. 4203\u20134215, 2005.\n\n[9] E. J. Cand\u00e8s, \u201cThe restricted isometry property and its implications for compressed sensing,\u201d\n\nComptes Rendus Mathematique, vol. 346, no. 9, pp. 589\u2013592, 2008.\n\n[10] R. G. Baraniuk, \u201cCompressive sensing,\u201d IEEE signal processing magazine, vol. 24, no. 4, 2007.\n[11] E. J. Candes and Y. Plan, \u201cA probabilistic and ripless theory of compressed sensing,\u201d Information\n\nTheory, IEEE Transactions on, vol. 57, no. 11, pp. 7235\u20137254, 2011.\n\n[12] E. J. Candes, J. K. Romberg, and T. Tao, \u201cStable signal recovery from incomplete and inaccurate\nmeasurements,\u201d Communications on pure and applied mathematics, vol. 59, no. 8, pp. 1207\u2013\n1223, 2006.\n\n[13] D. L. Donoho, M. Elad, and V. N. Temlyakov, \u201cStable recovery of sparse overcomplete repre-\nsentations in the presence of noise,\u201d Information Theory, IEEE Transactions on, vol. 52, no. 1,\npp. 6\u201318, 2006.\n\n[14] M. Bayati and A. Montanari, \u201cThe lasso risk for gaussian matrices,\u201d Information Theory, IEEE\n\nTransactions on, vol. 58, no. 4, pp. 1997\u20132017, 2012.\n\n[15] D. Donoho and A. Montanari, \u201cHigh dimensional robust m-estimation: Asymptotic variance\nvia approximate message passing,\u201d Probability Theory and Related Fields, vol. 166, no. 3-4,\npp. 935\u2013969, 2016.\n\n[16] J. Barbier, M. Dia, N. Macris, and F. Krzakala, \u201cThe mutual information in random linear\nestimation,\u201d in Communication, Control, and Computing (Allerton), 2016 54th Annual Allerton\nConference on, pp. 625\u2013632, IEEE, 2016.\n\n[17] Y. Gordon, On Milman\u2019s inequality and random subspaces which escape through a mesh in R n.\n\nSpringer, 1988.\n\n[18] M. Stojnic, \u201cA framework to characterize performance of lasso algorithms,\u201d arXiv preprint\n\narXiv:1303.7291, 2013.\n\n[19] S. Oymak, C. Thrampoulidis, and B. Hassibi, \u201cThe squared-error of generalized lasso: A precise\nanalysis,\u201d in Communication, Control, and Computing (Allerton), 2013 51st Annual Allerton\nConference on, pp. 1002\u20131009, IEEE, 2013.\n[20] C. Thrampoulidis, A. Panahi, D. Guo, and B. Hassibi, \u201cPrecise error analysis of the \\(cid:96)2-lasso,\u201d\n\narXiv preprint arXiv:1502.04977, 2015.\n\n[21] C. Thrampoulidis, E. Abbasi, and B. Hassibi, \u201cThe lasso with non-linear measurements is\nequivalent to one with linear measurements,\u201d Advances in Neural Information Processing\nSystems, 2015.\n\n9\n\n\f[22] C. Thrampoulidis, E. Abbasi, and B. Hassibi, \u201cPrecise error analysis of regularized m-estimators\n\nin high-dimension,\u201d arXiv preprint arXiv:1601.06233, 2016.\n\n[23] S. Oymak and J. A. Tropp, \u201cUniversality laws for randomized dimension reduction, with\n\napplications,\u201d arXiv preprint arXiv:1511.09433, 2015.\n\n[24] N. E. Karoui, \u201cAsymptotic behavior of unregularized and ridge-regularized high-dimensional\n\nrobust regression estimators: rigorous results,\u201d arXiv preprint arXiv:1311.2445, 2013.\n\n[25] A. Montanari and S. B. Korada, \u201cApplications of lindeberg principle in communications and\n\nstatistical learning,\u201d tech. rep., 2010.\n\n[26] N. Zerbib, Y.-H. Li, Y.-P. Hsieh, and V. Cevher, \u201cEstimation error of the lasso,\u201d tech. rep., 2016.\n[27] Y. Kabashima, T. Wadayama, and T. Tanaka, \u201cA typical reconstruction limit for compressed sens-\ning based on lp-norm minimization,\u201d Journal of Statistical Mechanics: Theory and Experiment,\nvol. 2009, no. 09, p. L09003, 2009.\n\n[28] J. Barbier, M. Dia, N. Macris, F. Krzakala, T. Lesieur, and L. Zdeborov\u00e1, \u201cMutual information\nfor symmetric rank-one matrix estimation: A proof of the replica formula,\u201d in Advances in\nNeural Information Processing Systems, pp. 424\u2013432, 2016.\n\n[29] M. Lelarge and L. Miolane, \u201cFundamental limits of symmetric low-rank matrix estimation,\u201d\n\narXiv preprint arXiv:1611.03888, 2016.\n\n[30] C. Thrampoulidis, A. Panahi, and B. Hassibi, \u201cAsymptotically exact error analysis for the\ngeneralized equation-lasso,\u201d in 2015 IEEE International Symposium on Information Theory\n(ISIT), pp. 2021\u20132025, IEEE, 2015.\n\n[31] R. Tibshirani, M. Saunders, S. Rosset, J. Zhu, and K. Knight, \u201cSparsity and smoothness via\nthe fused lasso,\u201d Journal of the Royal Statistical Society: Series B (Statistical Methodology),\nvol. 67, no. 1, pp. 91\u2013108, 2005.\n\n[32] B. Xin, Y. Kawahara, Y. Wang, and W. Gao, \u201cEf\ufb01cient generalized fused lasso and its application\n\nto the diagnosis of alzheimer\u2019s disease.,\u201d in AAAI, pp. 2163\u20132169, Citeseer, 2014.\n\n[33] J. Liu, L. Yuan, and J. Ye, \u201cGuaranteed sparse recovery under linear transformation.,\u201d in ICML\n\n(3), pp. 91\u201399, 2013.\n\n[34] R. J. Tibshirani, J. E. Taylor, E. J. Candes, and T. Hastie, The solution path of the generalized\n\nlasso. Stanford University, 2011.\n\n[35] J. W. Lindeberg, \u201cEine neue herleitung des exponentialgesetzes in der wahrscheinlichkeitsrech-\n\nnung,\u201d Mathematische Zeitschrift, vol. 15, no. 1, pp. 211\u2013225, 1922.\n\n[36] R. Baraniuk, M. Davenport, R. DeVore, and M. Wakin, \u201cA simple proof of the restricted isometry\nproperty for random matrices,\u201d Constructive Approximation, vol. 28, no. 3, pp. 253\u2013263, 2008.\n\n10\n\n\f", "award": [], "sourceid": 1937, "authors": [{"given_name": "Ashkan", "family_name": "Panahi", "institution": "North Carolina State University"}, {"given_name": "Babak", "family_name": "Hassibi", "institution": "Caltech"}]}