{"title": "Algorithmic Analysis and Statistical Estimation of SLOPE via Approximate Message Passing", "book": "Advances in Neural Information Processing Systems", "page_first": 9366, "page_last": 9376, "abstract": "SLOPE is a relatively new convex optimization procedure for high-dimensional linear regression via the sorted $\\ell_1$ penalty: the larger the rank of the fitted coefficient, the larger the penalty. This non-separable penalty renders many existing techniques invalid or inconclusive in analyzing the SLOPE solution. In this paper, we develop an asymptotically exact characterization of the SLOPE solution under Gaussian random designs through solving the SLOPE problem using approximate message passing (AMP). This algorithmic approach allows us to approximate the SLOPE solution via the much more amenable AMP iterates. Explicitly, we characterize the asymptotic dynamics of the AMP iterates relying on a recently developed state evolution analysis for non-separable penalties, thereby overcoming the difficulty caused by the sorted $\\ell_1$ penalty. Moreover, we prove that the AMP iterates converge to the SLOPE solution in an asymptotic sense, and numerical simulations show that the convergence is surprisingly fast. Our proof rests on a novel technique that specifically leverages the SLOPE problem. In contrast to prior literature, our work not only yields an asymptotically sharp analysis but also offers an algorithmic, flexible, and constructive approach to understanding the SLOPE problem.", "full_text": "Algorithmic Analysis and Statistical Estimation of\n\nSLOPE via Approximate Message Passing\n\nZhiqi Bu\u2217\n\nJason M. Klusowski\u2020\n\nCynthia Rush\u2021\n\nWeijie Su\u00a7\n\nAbstract\n\nSLOPE is a relatively new convex optimization procedure for high-dimensional\nlinear regression via the sorted (cid:96)1 penalty: the larger the rank of the \ufb01tted coef\ufb01cient,\nthe larger the penalty. This non-separable penalty renders many existing techniques\ninvalid or inconclusive in analyzing the SLOPE solution. In this paper, we develop\nan asymptotically exact characterization of the SLOPE solution under Gaussian\nrandom designs through solving the SLOPE problem using approximate message\npassing (AMP). This algorithmic approach allows us to approximate the SLOPE\nsolution via the much more amenable AMP iterates. Explicitly, we characterize\nthe asymptotic dynamics of the AMP iterates relying on a recently developed state\nevolution analysis for non-separable penalties, thereby overcoming the dif\ufb01culty\ncaused by the sorted (cid:96)1 penalty. Moreover, we prove that the AMP iterates converge\nto the SLOPE solution in an asymptotic sense, and numerical simulations show\nthat the convergence is surprisingly fast. Our proof rests on a novel technique that\nspeci\ufb01cally leverages the SLOPE problem. In contrast to prior literature, our work\nnot only yields an asymptotically sharp analysis but also offers an algorithmic,\n\ufb02exible, and constructive approach to understanding the SLOPE problem.\n\nIntroduction\n\n1\nConsider observing linear measurements y \u2208 Rn that are modeled by the equation\n\ny = X\u03b2 + w,\n\n(1.1)\nwhere X \u2208 Rn\u00d7p is a known measurement matrix, \u03b2 \u2208 Rp is an unknown signal, and w \u2208 Rn is the\nmeasurement noise. Among numerous methods that seek to recover the signal \u03b2 from the observed\ndata, especially in the setting where \u03b2 is sparse and p is larger than n, SLOPE has recently emerged\nas a useful procedure that allows for estimation and model selection [9]. This method reconstructs\nthe signal by solving the minimization problem\n\n(cid:98)\u03b2 := arg min\n\nb\n\np(cid:88)\n\ni=1\n\n(cid:107)y \u2212 Xb(cid:107)2 +\n\n1\n2\n\n\u03bbi|b|(i),\n\n(1.2)\n\nPA 19104, USA. Email: zbu@sas.upenn.edu\n\nwhere (cid:107) \u00b7 (cid:107) denotes the (cid:96)2 norm, \u03bb1 \u2265 \u00b7\u00b7\u00b7 \u2265 \u03bbp \u2265 0 (with at least one strict inequality) is a\nsequence of thresholds, and |b|(1) \u2265 \u00b7\u00b7\u00b7 \u2265 |b|(p) are the order statistics of the \ufb01tted coef\ufb01cients\n\u2217Department of Applied Mathematics and Computational Science, University of Pennsylvania, Philadelphia,\n\u2020Department of Statistics, Rutgers University, New Brunswick, NJ 08854, USA. Email:\n\u2021Department\nof Statistics, Columbia University, New York, NY 10027, USA. Email:\n\u00a7Department of Statistics, University of Pennsylvania, Philadelphia, PA 19104, USA. Email:\nsuw@wharton.upenn.edu Supported in part by NSF DMS CAREER #1847415 and NSF CCF #1763314.\n\njason.klusowski@rutgers.edu Supported in part by NSF DMS #1915932.\n\ncynthia.rush@columbia.edu Supported in part by NSF CCF #1849883.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fSet Diff\n\nISTA\nFISTA\nAMP\n\n60\n47\n30\n\nOptimization errors\n10\u22125\n10\u22123\n9007\n7326\n593\n374\n13\n32\n\n10\u22124\n8569\n412\n22\n\n10\u22122\n4048\n275\n6\n\n10\u22126\n9161\n604\n40\n\nTable 1: First iteration t for which there is zero\n\nset difference or optimization error ||\u03b2t \u2212 (cid:98)\u03b2||2/p\n\nfalls below a threshold.\n\nFigure 1: Optimization errors, ||\u03b2t \u2212 (cid:98)\u03b2||2/p, and (sym-\nmetric) set difference of supp(\u03b2t) and supp((cid:98)\u03b2).\nin absolute value. The regularizer(cid:80) \u03bbi|b|(i) is a sorted (cid:96)1-norm (denoted as J\u03bb(b) henceforth),\n\nSetting of Figure 1 and Table 1: Design X\nis 500 \u00d7 1000 and has i.i.d. N (0, 1/500) en-\ntries. True signal \u03b2 is elementwise i.i.d. Gaussian-\nBernoulli: N (0, 1) with probability 0.1 and 0 oth-\nerwise. Noise variance \u03c32\nw = 0. A careful calibra-\ntion between the thresholds \u03b8t in AMP and \u03bb is\nSLOPE is used. Details in Section 2.\n\nwhich is non-separable due to the sorting operation involved in its calculation. Notably, SLOPE\nhas two attractive features that are not simultaneously present in other methods for linear regression\nincluding the LASSO [34] and knockoffs [1]. Explicitly, on the estimation side, SLOPE achieves\nminimax estimation properties under certain random designs without requiring any knowledge of the\nsparsity degree of \u03b2 [32, 7]. On the testing side, SLOPE controls the false discovery rate in the case\nof independent predictors [9, 12]. For completeness, we remark that [10, 35, 20] proposed similar\nnon-separable regularizers to encourage grouping of correlated predictors.\nThis work is concerned with the algorithmic aspects of SLOPE through the lens of approximate\nmessage passing (AMP) [2, 17, 22, 24, 27]. AMP is a class of computationally ef\ufb01cient and easy-\nto-implement algorithms for a broad range of statistical estimation problems, including compressed\nsensing and the LASSO [3]. When applied to SLOPE, AMP takes the following form: at initial\niteration t = 0, assign \u03b20 = 0, z0 = y, and for t \u2265 0,\n(X(cid:62)zt + \u03b2t),\n\n(1.3a)\n\n\u03b2t+1 = proxJ\u03b8t\nzt+1 = y \u2212 X\u03b2t+1 +\n\n(cid:104)\u2207 proxJ\u03b8t\n\nzt\nn\n\n(cid:105)\n\n(X(cid:62)zt + \u03b2t)\n\n.\n\n(1.3b)\n\nThe non-increasing sequence \u03b8t is proportional to \u03bb and will be given explicitly in Section 2. Here,\nproxJ\u03b8\n\nis the proximal operator of the sorted (cid:96)1 norm, that is,\n\nproxJ\u03b8\n\n(x) := argmin\n\nb\n\n(cid:107)x \u2212 b(cid:107)2 + J\u03b8(b),\n\n1\n2\n\nand \u2207 proxJ\u03b8\ndenotes the divergence of the proximal operator (see an equivalent but more explicit\nform of this algorithm in Section 2 and other preliminaries on SLOPE and the prox operator de\ufb01ned\nabove in Appendix A). Compared to the proximal gradient descent (ISTA) [15, 16, 26], AMP has an\nextra correction term in its residual step that adjusts the iteration in a non-trivial way and seeks to\nprovide improved convergence performance [17, 11].\nThe empirical performance of AMP in solving SLOPE is illustrated in Figure 1 and Table 1, which\nsuggest the superiority of AMP over ISTA and FISTA [6]\u2014perhaps the two most popular proximal\ngradient descent methods\u2014in terms of speed of convergence. However, the vast AMP literature thus\nfar remains silent on whether AMP provably solves SLOPE and, if so, whether one can leverage\nAMP to get insights into the statistical properties of SLOPE. This vacuum in the literature is due to\nthe non-separability of the SLOPE regularizer, making it a major challenge to apply AMP to SLOPE\ndirectly. In stark contrast, AMP theory has been rigorously applied to the LASSO [3], showing\nboth good empirical performance and nice theoretical properties of solving the LASSO using AMP.\nMoreover, AMP in this setting allows for asymptotically exact statistical characterization of its output,\nwhich converges to the LASSO solution, thereby providing a powerful tool in \ufb01ne-grained analyses\nof the LASSO [4, 33, 25].\nIn this work, we prove that the AMP algorithm (1.3) solves the SLOPE problem in an asymptotically\nexact sense under independent Gaussian random designs. Our proof uses the recently extended\n\n2\n\n0.000.030.060.09100101102103104Optimization ErrorAMPFISTAISTA0250500750100100.5101101.5102102.5IterationSet DifferenceAMPFISTAISTA\fAMP theory for non-separable denoisers [8] and applies this tool to derive the state evolution that\ndescribes the asymptotically exact behaviors of the AMP iterates \u03b2t in (1.3). The next step, which is\nthe core of our proof, is to relate the AMP estimates to the SLOPE solution. This presents several\nchallenges that cannot be resolved only within the AMP framework. In particular, unlike the LASSO,\nthe number of nonzeros in the SLOPE solution can exceed the number of observations. This fact\nimposes substantially more dif\ufb01culties on showing that the distance between the SLOPE solution\nand the AMP iterates goes to zero than in the LASSO case due to the possible non-strong convexity\nof the SLOPE problem, even restricted to the solution support. To overcome these challenges, we\ndevelop novel techniques that are tailored to the characteristics of the SLOPE solution. For example,\nour proof relies on the crucial property of SLOPE that the unique nonzero components of its solution\nnever outnumber the observation units.\nAs a byproduct, our analysis gives rise to an exact asymptotic characterization of the SLOPE solution\nunder independent Gaussian random designs through leveraging the statistical aspect of the AMP\ntheory. In slightly more detail, the probability distribution of the SLOPE solution is completely\nspeci\ufb01ed by a few parameters that are the solution to a certain \ufb01xed-point equation in an asymptotic\nsense. This provides a powerful tool for \ufb01ne-grained statistical analysis of SLOPE as it was for the\nLASSO problem. We note that a recent paper [21]\u2014which takes an entirely different path\u2014gives\nan asymptotic characterization of the SLOPE solution that matches our asymptotic analysis that is\ndeduced from our AMP theory for SLOPE. However, our AMP-based approach is more algorithmic\nin nature and offers a more concrete connection between the \ufb01nite-sample behaviors of the SLOPE\nproblem and its asymptotic distribution via the computationally ef\ufb01cient AMP algorithm.\n\n2 Algorithmic Development\n\nIn this section we develop an AMP algorithm for \ufb01nding the SLOPE estimator in (1.2). Recall the\nAMP algorithm we study is (1.3). Speci\ufb01cally, it is through the threshold values \u03b8t that one can ensure\nthe AMP estimates converge to the SLOPE estimator with parameter \u03bb. In this section we present\nhow one should calibrate the thresholds of the AMP iterations in (1.3) in order for the algorithm to\nsolve SLOPE cost in (1.2). Then in Section 3, we prove rigorously that the AMP algorithm solves the\nSLOPE optimization asymptotically and we leverage theoretical guarantees for the AMP algorithm to\nexactly characterize the mean square error of the SLOPE estimator in the large system limit. This is\ndone by applying recent theoretical results for AMP algorithms that use a non-separable non-linearity\n[8], like the one in (1.3).\nWe \ufb01rst note that the analysis we pursue in this work makes the following assumptions about the\nlinear model (1.1) and parameter vector in (A.1):\n\n(A1) The measurement matrix X has independent and identically-distributed (i.i.d.) Gaussian\n\nentries that have mean 0 and variance 1/n.\n\n(A2) The signal \u03b2 has elements that are i.i.d. B, with E(B2 max{0, log B}) < \u221e.\n(A3) The noise w is elementwise i.i.d. W , with \u03c32\n(A4) The vector \u03bb(p) = (\u03bb1, . . . , \u03bbp) is elementwise i.i.d. \u039b, with E(\u039b2) < \u221e.\n(A5) The ratio n/p approaches a constant \u03b4 \u2208 (0,\u221e) in the large system limit, as n, p \u2192 \u221e.\n\nw := E(W 2) < \u221e.\n\nRemark: (A4) can be relaxed as \u03bb1, . . . , \u03bbp having an empirical distribution that converges weakly\nto probability measure \u039b on R with E(\u039b2) < \u221e and (cid:107)\u03bb(p)(cid:107)2/p \u2192 E(\u039b2). A similar relaxation can\nbe made for assumptions (A2) and (A3).\n\n2.1 SLOPE Preliminaries\nFor a vector v \u2208 Rp, the divergence of the proximal operator, \u2207 proxf (v), is given by the following:\n\n(cid:16) \u2202\n\n,\n\n\u2202\n\u2202v2\n\n, . . . ,\n\n\u2202\n\u2202vp\n\n\u2202v1\n\n(cid:17) \u00b7 proxf (v),\n\n(2.1)\n\n\u2207 proxf (v) :=\n\np(cid:88)\n\ni=1\n\n\u2202\n\u2202vi\n\n[proxf (v)]i =\n\n3\n\n\fwhere as given in [32], proof of Fact 3.4,\n\n(cid:40)\n\n\u2202[proxJ\u03bb\n\u2202vj\n\n(v)]i\n\n=\n\nsign([proxJ\u03bb\n\n#{1 \u2264 k \u2264 p : |[proxJ\u03bb\n0,\n\n(v)]i)\u00b7sign([proxJ\u03bb\n\n(v)]k| = |[proxJ\u03bb\n\n(v)]j )\n\n(v)]j|} ,\n\nif |[proxJ\u03bb\notherwise.\n\n(v)]j| = |[proxJ\u03bb\n\n(v)]i|,\n\n(2.2)\n\n(2.3)\n\nHence the divergence is simpli\ufb01ed to\n\n\u2207 proxJ\u03bb\n\n(v) = (cid:107) proxJ\u03bb\n\n(v)(cid:107)\u2217\n0,\n\n0 counts the unique non-zero magnitudes in a vector, e.g. (cid:107)(0, 1,\u22122, 0, 2)(cid:107)\u2217\n\nwhere (cid:107) \u00b7 (cid:107)\u2217\n0 = 2. This\nexplicit form of divergence not only waives the need to use approximation in calculation but also\nspeed up the recursion, since it only depends on the proximal operator as a whole instead of on\n\u03b8t\u22121, X, zt\u22121, \u03b2t\u22121. Therefore, we have\nLemma 2.1. In AMP, (1.3b) is equivalent to\n\nzt+1 = y \u2212 X\u03b2t+1 +\n\n(cid:107)\u03b2t+1(cid:107)\u2217\n0.\n\nzt\n\u03b4p\n\nOther preliminary ideas and background on SLOPE and the prox operator are found in Appendix A.\n\n2.2 AMP Background\n\n(\u00b7), is usually referred to as a \u2018denoiser\u2019 in the AMP literature.\n\nAn attractive feature of AMP is that its statistical properties can be exactly characterized at each\niteration t, at least asymptotically, via a one-dimensional recursion known as state evolution [2, 8].\nSpeci\ufb01cally, it can be shown that the pseudo-data, meaning the input X(cid:62)zt + \u03b2t for the estimate of\nthe unknown signal in (1.3a), is asymptotically equal in distribution to the true signal plus independent,\nGaussian noise, i.e. \u03b2 + \u03c4tZ, where the noise variance \u03c4t is de\ufb01ned by the state evolution. For\nthis reason, the function used to update the estimate in (1.3a), in our case, the proximal operator,\nproxJ\u03b8t\nThis statistical characterization of the pseudo-data was \ufb01rst rigorously shown to be true in the case of\n\u2018separable\u2019 denoisers by Bayati and Montanari [2], and an analysis of the rate of this convergence was\ngiven in [31]. A \u2018separable\u2019 denoiser is one that applies the same (possibly non-linear) function to each\nelement of its input. Recent work, which we make use of in this paper, proves that asymptotically the\npseudo-data has distribution \u03b2 + \u03c4tZ when non-separable \u2018denoisers\u2019 are used in the AMP algorithm.\nThe dynamics of the AMP iterations are tracked by a recursive sequence referred to as the state\nevolution, de\ufb01ned below. For B elementwise i.i.d. B independent of Z \u223c N (0, Ip), let \u03c4 2\n0 =\nw + E[B2]/\u03b4 and for t \u2265 0,\n\u03c32\n\n\u03c4 2\nt+1 = \u03c32\n\nw + lim\np\n\n1\n\u03b4p\n\nE(cid:107)proxJ\u03b8t\n\n(B + \u03c4tZ) \u2212 B(cid:107)2.\n\n(2.4)\n\nBelow we make rigorous the way that the recursion in (2.4) relates to the AMP iteration (1.3a)-(1.3b).\nWe note that throughout, we let N (\u00b5, \u03c32) denote the Gaussian density with mean \u00b5 and variance \u03c32\nand we use Ip to indicate a p \u00d7 p identity matrix.\n\n2.3 Analysis of the AMP State Evolution\n\nAs mentioned previously, it is through the sequence of thresholds \u03b8t that one is able to relate the AMP\nalgorithm to the SLOPE estimator in (1.2) for certain \u03bb. Speci\ufb01cally, we will choose \u03b8t = \u03b1\u03c4t(p)\nfor every iteration t where the vector \u03b1 is \ufb01xed via a calibration made explicit below and \u03c4 2\nt (p) is\nde\ufb01ned using an approximation to the state evolution in (2.4) given in (2.5) below. We can interpret\nthis to mean that within the AMP algorithm, \u03b1 plays the role of the regularizer \u03bb.\nThe calibration is motivated by a careful analysis of the following approximation (when p is large) to\nthe state evolution iteration in (2.4). Namely,\n\n\u03c4 2\nt+1(p) = \u03c32\n\nw +\n\n1\n\u03b4p\n\nE(cid:107)proxJ\u03b1\u03c4t(p)\n\n(\u03b2 + \u03c4t(p)Z) \u2212 \u03b2(cid:107)2,\n\n(2.5)\n\n4\n\n\fwhere the difference between (2.5) and the state evolution (2.4) is via the large system limit in p.\nWhen we refer to the recursion in (2.5) we will always specify the p dependence explicitly as \u03c4t(p).\nBefore we introduce this calibration, however, we give the following result which motivates why the\nAMP iteration should relate at all to the SLOPE estimator.\n\nLemma 2.2. Any stationary point (cid:98)\u03b2 (with corresponding(cid:98)z) in the AMP algorithm (1.3a)-(1.3b) with\n\n\u03b8\u2217 = \u03b1\u03c4\u2217 is a minimizer of the SLOPE cost function in (1.2) with\n1 \u2212 1\nn\n\n(cid:16)\u2207 proxJ\u03b8\u2217 ((cid:98)\u03b2 + X(cid:62)(cid:98)z)\n\n1 \u2212 1\n\u03b4p\n\n\u03bb = \u03b8\u2217\n\n(cid:16)\n\n(cid:17)\n\n= \u03b8\u2217\n\n(cid:17)(cid:17)\n(cid:16)\n(cid:98)z = y \u2212 X(cid:98)\u03b2 + (cid:98)z\n\n(cid:13)(cid:13)(cid:13)\u2217\n(cid:13)(cid:13)(cid:13)proxJ\u03b8\u2217 ((cid:98)\u03b2 + X(cid:62)(cid:98)z)\n(\u2207 prox\u03b8\u2217 ((cid:98)\u03b2 + X(cid:62)(cid:98)z)).\n\n0\n\n.\n\n(2.6)\n\nProof of Lemma 2.2. By stationarity,\n\n(cid:98)\u03b2 = prox\u03b8\u2217 ((cid:98)\u03b2 + X(cid:62)(cid:98)z)\n\nand\n\n\u03b4p\n\nDenote by \u03c9 := 1\n\n\u03b4p (\u2207 prox\u03b8\u2217 ((cid:98)\u03b2 + X(cid:62)(cid:98)z)). Then, from (2.6),(cid:98)z = y\u2212X(cid:98)\u03b2\n\nFact A.1, X(cid:62)(cid:98)z \u2208 \u2202J\u03b8\u2217 ((cid:98)\u03b2). Clearly, X(cid:62)(cid:98)z = X(cid:62)(y\u2212X(cid:98)\u03b2)\n\u2208 J\u03b8\u2217 ((cid:98)\u03b2), which tells us X(cid:62)(y\u2212X(cid:98)\u03b2) \u2208\nJ\u03b8\u2217(1\u2212\u03c9)((cid:98)\u03b2) which is exactly the stationary condition of SLOPE with \u03bb = (1\u2212 \u03c9)\u03b8\u2217 as desired.\n(cid:111)\n\nResults about the recursion (2.5) are summarized in the following theorem and the theorem\u2019s proof is\ngiven in Appendix C. We \ufb01rst introduce some useful notations: let Amin(\u03b4) be the set of solutions to\n\n1\u2212\u03c9 , and by (2.6) along with\n\n(cid:17)\n\n1\u2212\u03c9\n\np(cid:88)\n\nE(cid:110)(cid:16)\n1 \u2212(cid:12)(cid:12)[proxJ\u03b1\n\n\u03b4 = f (\u03b1), where f (\u03b1) :=\n\n\u03b1j\n\n/[D(proxJ\u03b1\n\n(Z))]i\n\n(Z)]i\n\n(2.7)\n\n(cid:12)(cid:12)(cid:88)\n\nj\u2208Ii\n\n1\np\n\ni=1\n\n(cid:104)u(cid:105) :=(cid:80)m\n\nHere (cid:12) represents elementwise multiplication of vectors and for a vector v \u2208 Rp, D is de\ufb01ned\nelementwise as [D(v)]i = #{j : |vj| = |vi|} if vi (cid:54)= 0 and \u221e otherwise. For u \u2208 Rm, the notation\ni=1 ui/m and we say a vector u is larger than v if \u2200i, ui > vi. The expectation in (2.7) is\n\ntaken with respect to Z, a p-length vector of i.i.d. standard Gaussians.\nTheorem 1. For any \u03b1 strictly larger than at least one element in the set Amin(\u03b4), the recursion in\n(2.5) has a unique \ufb01xed point and denoting this \ufb01xed point by \u03c4 2\u2217 (p). Then \u03c4t(p) \u2192 \u03c4\u2217(p) for any\ninitial condition and monotonically. Moreover, de\ufb01ning a function F : R \u00d7 Rp \u2192 R as\n\nF(\u03c4 2(p), \u03b1\u03c4 (p)) := \u03c32 +\n\nE(cid:107)proxJ\u03b1\u03c4 (p)\n\n1\n\u03b4p\n\n(B + \u03c4 (p)Z) \u2212 B(cid:107)2,\n\n(2.8)\n\nwhere B is elementwise i.i.d. B independent of Z \u223c N (0, Ip), so that \u03c4 2\nthen | \u2202F\nf (\u03b1) = \u03b4 lim\u03c4 (p)\u2192\u221e dF/d\u03c4 2(p).\n\nt (p), \u03b1\u03c4t(p)),\n\u2202\u03c4 2(p) (\u03c4 2(p), \u03b1\u03c4 (p))|< 1 at \u03c4 (p) = \u03c4\u2217(p). Moreover, for f (\u03b1) de\ufb01ned in (2.7), we show that\n\nt+1(p) = F(\u03c4 2\n\nNotice that Theorem 1 gives necessary conditions on the calibra-\ntion vector \u03b1 under which recursion in (2.5), and equivalently, the\ncalibration given below are well-de\ufb01ned.\n\n2.4 Threshold Calibration\n\nMotivated by Lemma 2.2 and Lemma B.1, we de\ufb01ne a calibration\nfrom the regularization parameter \u03bb, to the corresponding threshold\n\u03b1 used to de\ufb01ne the AMP algorithm. Such calibration is asymptot-\nically exact when p = \u221e.\nIn practice, we will be given \ufb01nite-length \u03bb and then we want to\ndesign the AMP iteration to solve the corresponding SLOPE cost.\nWe do this by choosing \u03b1 as the vector that solves \u03bb = \u03bb(\u03b1) where\n\n(cid:16)\n\n\u03bb(\u03b1) := \u03b1\u03c4\u2217(p)\n\n(cid:17)\n\n,\n(2.9)\n\n1 \u2212 1\nn\n\nE(cid:107) proxJ\u03b1\u03c4\u2217 (p)\n\n(B + \u03c4\u2217(p)Z)(cid:107)\u2217\n\n0\n\n5\n\nFigure 2: Amin (black curve)\nwhen p = 2 and \u03b4 = 0.6.\n\n0.00.20.40.60.81.00.00.20.40.60.81.0a2a1FeasibleInfeasible\f(B + \u03c4\u2217(p)Z)(cid:107)\u2217\n\nwhere B is elementwise i.i.d. B independent of Z \u223c N (0, Ip) and \u03c4\u2217(p) is the limiting value\nde\ufb01ned in Theorem 1. We note the fact that the calibration in (2.9) sets \u03b1 as a vector in the same\ndirection as \u03bb, but that is scaled by a constant value (for each p), where the constant value is given by\n\u03c4\u2217(p)(1 \u2212 E(cid:107) proxJ\u03b1\u03c4\u2217 (p)\nWe claim that the calibration (2.9) and its inverse \u03bb (cid:55)\u2192 \u03b1(\u03bb) are well-de\ufb01ned. In [3, Proposition 1.4\n(\ufb01rst introduced in [18]) and Corollary 1.7] this is proved rigorously for the LASSO calibration and\nwe claim that this proof can be adapted to the present case without many dif\ufb01culties, though we don\u2019t\npursue this in the current document.\nProposition 2.3. The function \u03b1 (cid:55)\u2192 \u03bb(\u03b1) de\ufb01ned in (2.9) is continuous on {\u03b1 : f (\u03b1) < \u03b4} for\nf (\u00b7) de\ufb01ned in (2.7) with \u03bb(Amin) = \u2212\u221e and lim\u03b1\u2192\u221e \u03bb(\u03b1) = \u221e (where the limit is taken\nelementwise). Therefore the inverse function \u03bb (cid:55)\u2192 \u03b1(\u03bb) exists and is continuous non-decreasing for\nany \u03bb > 0.\n\n0)/n.\n\nThis proposition motivates Algorithm 1 which uses bisection method to \ufb01nd the unique \u03b1 for each \u03bb.\nIt suf\ufb01ces to \ufb01nd two guesses of \u03b1 parallel to \u03bb that, when mapped via (2.9), sandwich the true \u03bb.\nThe proof of this proposition can be found in [13, Appendix A.2].\nAlgorithm 1 Calibration from \u03bb \u2192 \u03b1\n\n1. Initialize \u03b11 = \u03b1min such that \u03b1min(cid:96) \u2208 Amin, where (cid:96) := \u03bb/\u03bb1; Initialize \u03b12 = 2\u03b11\nwhile L(\u03b12) < 0 where L : R \u2192 R; \u03b1 (cid:55)\u2192 sign(\u03bb(\u03b1(cid:96)) \u2212 \u03bb) do\nend while\n3. return BISECTION (L(\u03b1), \u03b11, \u03b12)\n\n2. Set \u03b11 = \u03b12, \u03b12 = 2\u03b12\n\nRemark: sign(\u03bb(\u00b7) \u2212 \u03bb) \u2208 R is well-de\ufb01ned since \u03bb(\u00b7) (cid:107) \u03bb implies all entries share the same sign.\nThe function \u201cBISECTION(L, a, b)\u201d \ufb01nds the root of L in [a, b] via the bisection method.\n\n(cid:17)\n\n(cid:16)\n\nAs noted previously, the calibration in (2.9) is exact when p \u2192 \u221e, so we study the mapping between\n\u03b1 and \u03bb in this limit. Recall from (A4), that the sequence of vectors {\u03bb(p)}p\u22650 are drawn i.i.d.\nfrom distribution \u039b. It follows that the sequence {\u03b1(p)}p\u22650 de\ufb01ned for each p by the \ufb01nite-sample\ncalibration (2.9) are i.i.d. from a distribution A, where A satis\ufb01es E(A2) < \u221e, and is de\ufb01ned via\n\np\n\n,\n\n0\n\n1\n\u03b4p\n\n1 \u2212 lim\n\nE|| proxJA(p)\u03c4\u2217 (B + \u03c4\u2217Z)||\u2217\n\n\u039b = A\u03c4\u2217\n\n(2.10)\nwhere A(p) \u2208 Rp are order statistics of p i.i.d. draws from A given by (2.10) and \u03c4\u2217 is de\ufb01ned as the\nlarge t limit of (2.4). We note that the calibrations presented in this section are well-de\ufb01ned:\nFact 2.4. The limits in (2.4) and (2.10) exist.\nThis fact is proven in Appendix E. One idea used in the proof is that the prox operator is asymptotically\nseparable, a result shown by [21, Proposition 1]. Speci\ufb01cally, for sequences of input, {v(p)}, and\nthresholds, {\u03bb(p)}, both having empirical distributions that weakly converge to a distributions V and\n\u039b, respectively, then there exists a limiting scalar function h(\u00b7) := h(v(p); V, \u039b) (determined by V\nand \u039b) of the proximal operator proxJ\u03bb\n(v(p)). Further details are shown in Appendix E, Lemma E.1.\nUsing h(\u00b7) := h(\u00b7; B + \u03c4\u2217Z, A\u03c4\u2217), this argument implies that (2.4) can be represented as\n\nand if we denote m as the Lebesgue measure, then the limit in (2.10) can be represented as\n\nP(cid:16)\n\nB + \u03c4\u2217Z \u2208(cid:110)\n\n\u03c4 2\u2217 := \u03c32 + E(h(B + \u03c4\u2217Z) \u2212 B)2/\u03b4,\n\n(cid:12)(cid:12)(cid:12) h(x) (cid:54)= 0 and m{z | |h(z)| = |h(x)|} = 0\n(cid:111)(cid:17)\n\nx\n\n.\n\nIn other words, the limit in (2.10) is the Lebesgue measure of the domain of the quantile function of\nh for which the quantile of h assumes unique values (i.e., is not \ufb02at).\n\n3 Asymptotic Characterization of SLOPE\n\n3.1 AMP Recovers the SLOPE Estimate\n\nHere we show that the AMP algorithm converges in (cid:96)2 to the SLOPE estimator, implying that the\nAMP iterates can be used as a surrogate for the global optimum of the SLOPE cost function. The\n\n6\n\n\fschema of the proof is similar to [3, Lemma 3.1], however, major differences lie in the fact that the\nproximal operator used in the AMP updates (1.3a)-(1.3b) is non-separable. We sketch the proof here,\nand a forthcoming article will be devoted to giving a complete and detailed argument.\nTheorem 2. Under assumptions (A1) - (A5), for the output of the AMP algorithm in (1.3a) and the\nSLOPE estimate (1.2),\n\nlim\nt\u2192\u221e ct = 0.\n\n(3.1)\n\n(cid:107)(cid:98)\u03b2 \u2212 \u03b2t(cid:107)2 = ct, where\n\nplim\np\u2192\u221e\n\n1\np\n\nProof. The proof requires dealing carefully with the fact that the SLOPE cost function given in (1.2)\nis not necessarily strongly convex, meaning that we could encounter the undesirable situation where\n\nC((cid:98)\u03b2) is close to C(\u03b2) but (cid:98)\u03b2 is not close to \u03b2, meaning the statistical recovery of \u03b2 would be poor.\n\nIn the LASSO case, one works around this challenge by showing that the (LASSO) cost function\ndoes have nice properties when considering just the elements of the non-zero support of \u03b2t at any\n(large) iteration t. In the LASSO case, the non-zero support of \u03b2 has size no larger than n < p.\nIn the SLOPE problem, however, it is possible that the support set has size exceeding n, and therefore\nthe LASSO analysis is not immediately applicable. Our proof develops novel techniques that are\ntailored to the characteristics of the SLOPE solution. Speci\ufb01cally, when considering the SLOPE\nproblem, one can show nice properties (similar to those in the LASSO case) by considering a support-\nlike set, that being the unique non-zeros in the estimate \u03b2t at any (large) iteration t. In other words, if\nwe de\ufb01ne an equivalence relation x \u223c y when |x| = |y|, then entries of AMP estimate at any iteration\nt are partitioned into equivalence classes. Then we observe from (2.9), and the non-negativity of \u03bb,\nthat the number of equivalence classes is no larger than n. We see an analogy between SLOPE\u2019s\nequivalence class (or \u2018maximal atom\u2019 as described in Appendix A) and LASSO\u2019s support set. This\napproach allows us to deal with the lack of a strongly convex cost.\n\n3.2 Exact Asymptotic Characterization of the SLOPE Estimate\n\nTheorem 2 ensures that the AMP algorithm solves the SLOPE problem in an asymptotic sense. To\nbetter appreciate the convergence guarantee, it calls for elaboration on (3.1). First, it implies that\n\n(cid:107)(cid:98)\u03b2 \u2212 \u03b2t(cid:107)2/p converges in probability to a constant, say ct. Next, (3.1) says that ct \u2192 0 as t \u2192 \u221e.\nA consequence of Theorem B.1, is that the SLOPE estimator (cid:98)\u03b2 inherits performance guarantees\nasymptotic characterization of pseudo-Lipschitz loss between (cid:98)\u03b2 and the truth \u03b2.\nd)k\u22121(cid:17)(cid:16)(cid:107)a \u2212 b(cid:107)/\n\nDe\ufb01nition 3.1. Uniformly pseudo-Lipschitz functions [8]: For k \u2208 N>0, a function \u03c6 : Rd \u2192 R\nis pseudo-Lipschitz of order k if there exists a constant L, such that for a, b \u2208 Rd,\n\u221a\n\nprovided by the AMP state evolution, in the sense of Theorem 3 below. Theorem 3 provides as\n\n(3.2)\nA sequence (in p) of pseudo-Lipschitz functions {\u03c6p}p\u2208N>0 is uniformly pseudo-Lipschitz of order k if,\ndenoting by Lp the pseudo-Lipschitz constant of \u03c6p, Lp < \u221e for each p and lim supp\u2192\u221e Lp < \u221e.\nTheorem 3. Under assumptions (A1) - (A5), for any uniformly pseudo-Lipschitz sequence of func-\ntions \u03c8p : Rp \u00d7 Rp \u2192 R and for Z \u223c N (0, Ip),\nE\n[\u03c8p(proxJ\u03b1(p)\u03c4t\nZ\n\n\u03c8p((cid:98)\u03b2, \u03b2) = lim\n\n(cid:107)\u03c6(a) \u2212 \u03c6(b)(cid:107) \u2264 L\n\n\u221a\n\n\u221a\nd)k\u22121 + ((cid:107)b(cid:107)/\n\n1 + ((cid:107)a(cid:107)/\n\n(\u03b2 + \u03c4tZ), \u03b2)],\n\n(cid:16)\n\n(cid:17)\n\nd\n\n.\n\nplim\n\np\n\nplim\n\np\n\nt\n\nwhere \u03c4t is de\ufb01ned in (2.4) and the expectation is taken with respect to Z.\n\nTheorem 3 tells us that under uniformly pseudo-Lipschitz loss, in the large system limit, distri-\nbutionally the SLOPE optimizer acts as a \u2018denoised\u2019 version of the truth corrupted by additive\nGaussian noise where the denoising function is given by the proximal operator, i.e. within uniformly\n\npseudo-Lipschitz loss(cid:98)\u03b2 can be replaced with proxJ\u03b1(p)\u03c4t\n\n(\u03b2 + \u03c4tZ) for large p, t.\n\nWe note that the result [21, Theorem 1] follows by Theorem 3 and their separability result [21,\nProposition 1]. To see this, in Theorem 3 consider a special case where \u03c8p(x, y) = 1\np\n\n(cid:80) \u03c8(xi, yi)\n\n7\n\n\ffor function \u03c8 : R \u00d7 R \u2192 R that is pseudo-Lipschitz of order k = 2. Then it is easy to show that\n\u03c8p(\u00b7,\u00b7) is uniformly pseudo-Lipschitz of order k = 2. The result of Theorem 3 then says that\n\np(cid:88)\n\ni=1\n\n\u03c8((cid:98)\u03b2i, \u03b2i) = lim\n\nplim\n\np\n\n1\np\n\nt\n\np(cid:88)\n\ni=1\n\nE\n[\u03c8([proxJ\u03b1(p)\u03c4t\nZ\n\n(\u03b2 + \u03c4tZ)]i, \u03b2i)].\n\nplim\n\np\n\n1\np\n\np(cid:88)\n\nThen by [21, Proposition 1], restated in Lemma E.1, which says that the proximal operator becomes\nasymptotically separable as p \u2192 \u221e, the result of [21, Theorem 1] follows by the Law of Large\nNumbers and Theorem 1. Namely, for some limiting scalar function ht,\n\np(cid:88)\n\ni=1\n\n1\np\n\nE\n[\u03c8(ht([\u03b2 + \u03c4tZ]i), \u03b2i)]\nZ\n\nlim\n\nt\n\nplim\n\np\n\n= lim\n\nt\n\nE\nZ,B\n\nE\n[\u03c8([proxJ\u03b1(p)\u03c4t\nZ\n\n1\np\n[\u03c8(ht(B + \u03c4tZ), B)] = E\n\ni=1\n\nZ,B\n\n(\u03b2 + \u03c4tZ)]i, \u03b2i)]\n\n(a)\n= lim\n\nplim\n\nt\n\np\n[\u03c8(ht(B + \u03c4\u2217Z), B)].\n\nWe note in step (a) above, we apply Lemma E.1, using that \u03b1(p)\u03c4t has an empirical distribution that\nconverges weakly to A\u03c4t for A de\ufb01ned by (2.10). The rigorous argument for justifying step (a) by\nLemma E.1 requires a bit more technical detail. We give such a rigorous argument, for a similar\nbut different limiting operation, in Appendix D for proving limiting properties of the prox operator\n(namely, property (P2) stated in Appendix B).\nWe highlight that our Theorem 3 allows the consideration of a non-asymptotic case in t. While\nTheorem 1 motivates an algorithmic way to \ufb01nd a value \u03c4t(p) which approximates \u03c4\u2217(p) well,\nTheorem 3 guarantees the accuracy of such approximation for use in practice. One particular use\nof Theorem 3 is to design the optimal sequence \u03bb that achieves the minimum \u03c4\u2217 and equivalently\nminimum error [21], though a concrete algorithm for doing so is still under investigation.\nWe prove Theorem 3 in Appendix B. We show that Theorem 3 follows from Theorem 2 and Lemma\nB.1, which demonstrates that the state evolution given in (2.4) characterizes the performance of the\nSLOPE AMP (1.3) via pseudo-Lipschitz loss functions. Finally we show how we use Theorem 3 to\nstudy the asymptotic mean-square error between the SLOPE estimator and the truth.\nw).\n\nCorollary 3.2. Under assumptions (A1) \u2212 (A5), plimp(cid:107)(cid:98)\u03b2 \u2212 \u03b2(cid:107)2/p = \u03b4(\u03c4 2\u2217 \u2212 \u03c32\nProof. Applying Theorem 3 to the pseudo-Lipschitz loss function \u03c81 : Rp \u00d7 Rp \u2192 R, de\ufb01ned as\n\u03c81(x, y) = ||x\u2212y||2/p, we \ufb01nd plimp\n(\u03b2 +\u03c4tZ)\u2212\u03b2(cid:107)2].\nEZ[(cid:107)proxJ\u03b1\u03c4t\nThe desired result follows since limt plimp\nw). To\nsee this, note that limt \u03b4(\u03c4 2\n\n(\u03b2 + \u03c4tZ) \u2212 \u03b2(cid:107)2] = \u03b4(\u03c4 2\u2217 \u2212 \u03c32\n\np(cid:107)(cid:98)\u03b2\u2212\u03b2(cid:107)2 = limt plimp\nEZ[(cid:107)proxJ\u03b1\u03c4t\nw) = \u03b4(\u03c4 2\u2217 \u2212 \u03c32\nw) and\nE\nZ,B\n\nt+1 \u2212 \u03c32\n\n1\np\n\n[(cid:107)proxJ\u03b1\u03c4t\nE\nZ\n\n(\u03b2 + \u03c4tZ)\u2212 \u03b2(cid:107)2] = lim\n\nt+1 \u2212 \u03c32\nplim\nw),\nfor B elementwise i.i.d. B independent of Z \u223c N (0, Ip). A rigorous argument for the above follows\nsimilarly to that used to prove property (P2) stated in Appendix B and proved in Appendix D.\n\n(B + \u03c4tZ)\u2212 B(cid:107)2] = \u03b4(\u03c4 2\n\n[(cid:107)proxJ\u03b1\u03c4t\n\n1\np\n\n1\np\n\n1\np\n\np\n\np\n\n1\n\n4 Discussion and Future Work\n\nThis work develops and analyzes the dynamics of an approximate message passing (AMP) algorithm\nwith the purpose of solving the SLOPE convex optimization procedure for high-dimensional linear\nregression. By employing recent theoretical analysis of AMP when the non-linearities used in the\nalgorithm are non-separable [8], as is the case for the SLOPE problem, we provide a rigorous proof\nthat the proposed AMP algorithm \ufb01nds the SLOPE solution asymptotically. Moreover empirical\nevidence suggests that the AMP estimate is already very close to the SLOPE solution even in few\niterations. By leveraging our analysis showing that AMP provably solves SLOPE, we provide an\nexact asymptotic characterization of the (cid:96)2 risk of the SLOPE estimator from the underlying truth and\ninsight into other statistical properties of the SLOPE estimator. Though this asymptotic analysis of\nthe SLOPE solution has been demonstrated in other recent work [21] using a different proof strategy,\nwe have a clear, rigorous statement of where it applies. That is, the analysis in [21] applies if the state\nevolution has a unique \ufb01xed point, whereas our Theorem 1 states precise conditions under which\nthis is true. Moreover, we believe that our algorithmic approach offers a more concrete connection\nbetween the \ufb01nite-sample behavior of the SLOPE estimator and its asymptotic distribution.\nWe now brie\ufb02y discuss some potential\n\nimprovements and directions for future research.\n\n8\n\n\f(a) i.i.d. \u00b11 Bernoulli design ma-\ntrix (top) and i.i.d. shifted expo-\nnential design matrix (bottom)\n\n(b) i.i.d. Gaussian design ma-\ntrix (top) and non-i.i.d.\nright\nrotationally-invariant design ma-\ntrix where AMP diverges (bot-\ntom)\n\ni.i.d. Gaussian measurement matrix assumption. A lim-\nitation of vanilla AMP is that the theory assumes an i.i.d.\nGaussian measurement matrix, and moreover, the AMP al-\ngorithm can become unstable when the measurement matrix\nis far from i.i.d., creating the need for heuristic techniques to\nprovide convergence in applications where the measurement\nmatrix is generated by nature (i.e., a real-world experiment\nor observational study). While, in general, AMP theory pro-\nvides performance guarantees only for i.i.d. sub-Gaussian\ndata [2, 5], in practice, favorable performance of AMP seems\nto be more universal. For example, in Fig. 3a, we illustrate\nthe performance of AMP for i.i.d. zero mean, 1/n variance\ndesign matrices that are not Gaussian (one i.i.d. \u00b11 Bernoulli\n(top) and one i.i.d. shifted exponential (bottom)). In particular,\nwe note that the exponential prior is not sub-Gaussian, so the\nperformance here is not supported by theory. In both cases,\nAMP converges very fast, thus demonstrating its robustness\nto distributional assumptions.\nOn the theoretical side, recent work proposes a variant\nof AMP, called vector-AMP or VAMP [28], which is a\ncomputationally-ef\ufb01cient algorithm that provably works for\na wide range of design matrices, namely, those that are right\nrotationally-invariant. For example, [23] studies VAMP for\na similar setting as SLOPE. However, the type of nonsep-\narability considered in this work requires the penalty to be\nseparable on subsets of an af\ufb01ne transformation of its in-\nput. As such, the setting does not directly apply to SLOPE.\nTo address this, we have built a hybrid, \u2018SLOPE VAMP\u2019,\nbased on code generously shared by the authors of the refer-\nenced work [23], which performs very well in the (non-) i.i.d.\n(non-) Gaussian regime (see Fig. 3a and 3b). Motivated by\nthese promising empirical results, we feel that theoretically\nunderstanding SLOPE dynamics with VAMP is an exciting\ndirection for future work.\nKnown signal prior assumption. There is a possibility that,\nby using EM- or SURE-based AMP strategies, one can re-\nmove the known signal prior assumption. Developing such\nstrategies alongside our SLOPE VAMP would provide a quite\ngeneral framework for recovery of the SLOPE estimator.\nComparison to \u2018Bayes-AMP\u2019. In general, the (statistical)\nmotivation for using methods like LASSO or SLOPE is to\nperform variable selection, and in addition, for SLOPE, to\ncontrol the false discovery rate. Both methods are therefore\nbiased and, consequently, \u2018Bayes-AMP\u2019 strategies that are\ndesigned to be optimal in terms of MSE will outperform if performance is based on MSE. In par-\nticular, [14] proves that \u2018Bayes-AMP\u2019 always has smaller MSE than that of methods employing\nconvex regularization for a wide class of convex penalties and Gaussian design. Nevertheless, Fig. 3c\nsuggests that SLOPE AMP has MSE that is not too much worse than MMSE AMP.\nSampling regime. The asymptotical regime studied here, n/p \u2192 \u03b4 \u2208 (0,\u221e), requires that the\nnumber of columns of the measurement matrix p grow at the same rate as the number of rows n. It is\nof practical interest to extend the results to high-dimensional settings where p grows faster than n.\n\n(c) i.i.d. Gaussian design matrix\nFigure 3: Performance of AMP vari-\nants in different settings with Bernoulli-\nGaussian prior, dimension = 1000, and\nsample size = 300.\n\n9\n\n0.000.020.040.06100100.5101101.5102Optimization ErrorAMPFISTAISTAVAMP0.000.020.040.06100100.5101101.5102Optimization ErrorAMPFISTAISTAVAMP0.0000.0050.010100100.5101101.5102Optimization ErrorAMPFISTAISTAVAMP0.000.020.040.06100100.5101101.5102Optimization ErrorFISTAISTAVAMP0.400.420.440.46100100.5101101.5102102.5103MSE for true signalFISTAISTALASSO AMPMMSE AMPSLOPE AMP0.4000.4250.4500255075100MSE for true signalLASSO AMP SEMMSE AMP SESLOPE AMP SE\fReferences\n[1] R. F. Barber and E. J. Cand\u00e8s. Controlling the false discovery rate via knockoffs. The Annals of\n\nStatistics, 43(5):2055\u20132085, 2015.\n\n[2] M. Bayati and A. Montanari. The dynamics of message passing on dense graphs, with applica-\n\ntions to compressed sensing. IEEE Trans. on Inf. Theory, 57(2):764\u2013785, 2011.\n\n[3] M. Bayati and A. Montanari. The lasso risk for gaussian matrices. IEEE Transactions on\n\nInformation Theory, 58(4):1997\u20132017, 2011.\n\n[4] M. Bayati, M. A. Erdogdu, and A. Montanari. Estimating lasso risk and noise level. In Advances\n\nin Neural Information Processing Systems, pages 944\u2013952, 2013.\n\n[5] M. Bayati, M. Lelarge, A. Montanari, et al. Universality in polytope phase transitions and\n\nmessage passing algorithms. The Annals of Applied Probability, 25(2):753\u2013822, 2015.\n\n[6] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse\n\nproblems. SIAM journal on imaging sciences, 2(1):183\u2013202, 2009.\n\n[7] P. C. Bellec, G. Lecu\u00e9, and A. B. Tsybakov. SLOPE meets lasso: improved oracle bounds and\n\noptimality. The Annals of Statistics, 46(6B):3603\u20133642, 2018.\n\n[8] R. Berthier, A. Montanari, and P.-M. Nguyen. State evolution for approximate message passing\n\nwith non-separable functions. arXiv preprint arXiv:1708.03950, 2017.\n\n[9] M. Bogdan, E. Van Den Berg, C. Sabatti, W. Su, and E. J. Cand\u00e8s. SLOPE\u2014adaptive variable\n\nselection via convex optimization. The Annals of Applied Statistics, 9(3):1103, 2015.\n\n[10] H. D. Bondell and B. J. Reich. Simultaneous regression shrinkage, variable selection, and\n\nsupervised clustering of predictors with oscar. Biometrics, 64(1):115\u2013123, 2008.\n\n[11] M. Borgerding and P. Schniter. Onsager-corrected deep learning for sparse linear inverse\nproblems. In 2016 IEEE Global Conference on Signal and Information Processing (GlobalSIP),\npages 227\u2013231, 2016.\n\n[12] D. Brzyski, A. Gossmann, W. Su, and M. Bogdan. Group SLOPE\u2014adaptive selection of groups\n\nof predictors. Journal of the American Statistical Association, pages 1\u201315, 2018.\n\n[13] Z. Bu, J. Klusowski, C. Rush, and W. Su. Algorithmic analysis and statistical estimation of\n\nslope via approximate message passing. arXiv preprint arXiv:1907.07502, 2019.\n\n[14] M. Celentano and A. Montanari. Fundamental barriers to high-dimensional regression with\n\nconvex penalties. arXiv preprint arXiv:1903.10603, 2019.\n\n[15] A. Chambolle, R. A. De Vore, N.-Y. Lee, and B. J. Lucier. Nonlinear wavelet image process-\ning: variational problems, compression, and noise removal through wavelet shrinkage. IEEE\nTransactions on Image Processing, 7(3):319\u2013335, 1998.\n\n[16] I. Daubechies, M. Defrise, and C. De Mol. An iterative thresholding algorithm for linear inverse\nproblems with a sparsity constraint. Communications on Pure and Applied Mathematics: A\nJournal Issued by the Courant Institute of Mathematical Sciences, 57(11):1413\u20131457, 2004.\n\n[17] D. L. Donoho, A. Maleki, and A. Montanari. Message-passing algorithms for compressed\n\nsensing. Proceedings of the National Academy of Sciences, 106(45):18914\u201318919, 2009.\n\n[18] D. L. Donoho, A. Maleki, and A. Montanari. The noise-sensitivity phase transition in com-\n\npressed sensing. IEEE Transactions on Information Theory, 57(10):6920\u20136941, 2011.\n\n[19] J. L. Doob. Stochastic processes, volume 101. New York Wiley, 1953.\n\n[20] M. Figueiredo and R. Nowak. Ordered weighted l1 regularized regression with strongly\ncorrelated covariates: Theoretical aspects. In Arti\ufb01cial Intelligence and Statistics, pages 930\u2013\n938, 2016.\n\n10\n\n\f[21] H. Hu and Y. M. Lu. Asymptotics and optimal designs of SLOPE for sparse linear regression.\n\narXiv preprint arXiv:1903.11582, 2019.\n\n[22] F. Krzakala, M. M\u00e9zard, F. Sausset, Y. Sun, and L. Zdeborov\u00e1. Probabilistic reconstruction\nin compressed sensing: algorithms, phase diagrams, and threshold achieving matrices. J. Stat.\nMech. Theory Exp., (8), 2012.\n\n[23] A. Manoel, F. Krzakala, G. Varoquaux, B. Thirion, and L. Zdeborov\u00e1. Approximate message-\npassing for convex optimization with non-separable penalties. arXiv preprint arXiv:1809.06304,\n2018.\n\n[24] A. Montanari. Graphical models concepts in compressed sensing. In Y. C. Eldar and G. Kutyniok,\neditors, Compressed Sensing, pages 394\u2013438. Cambridge University Press, 2012. URL http:\n//dx.doi.org/10.1017/CBO9780511794308.010.\n\n[25] A. Mousavi, A. Maleki, R. G. Baraniuk, et al. Consistent parameter estimation for lasso and\n\napproximate message passing. The Annals of Statistics, 46(1):119\u2013148, 2018.\n\n[26] N. Parikh, S. Boyd, et al. Proximal algorithms. Foundations and Trends R(cid:13) in Optimization, 1\n\n(3):127\u2013239, 2014.\n\n[27] S. Rangan. Generalized approximate message passing for estimation with random linear mixing.\n\nIn Proc. IEEE Int. Symp. Inf. Theory, pages 2168\u20132172, 2011.\n\n[28] S. Rangan, P. Schniter, and A. K. Fletcher. Vector approximate message passing.\n\nTransactions on Information Theory, 2019.\n\nIEEE\n\n[29] R. T. Rockafellar and R. J.-B. Wets. Variational analysis, volume 317. Springer Science &\n\nBusiness Media, 2009.\n\n[30] H. L. Royden. Real analysis. Krishna Prakashan Media, 1968.\n\n[31] C. Rush and R. Venkataramanan. Finite sample analysis of approximate message passing\n\nalgorithms. IEEE Trans. on Inf. Theory, 64(11):7264\u20137286, 2018.\n\n[32] W. Su and E. Cand\u00e8s. SLOPE is adaptive to unknown sparsity and asymptotically minimax.\n\nThe Annals of Statistics, 44(3):1038\u20131068, 2016.\n\n[33] W. Su, M. Bogdan, and E. Cand\u00e8s. False discoveries occur early on the lasso path. The Annals\n\nof Statistics, 45(5):2133\u20132150, 2017.\n\n[34] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical\n\nSociety: Series B (Statistical Methodology), 58(1):267\u2013288, 1996.\n\n[35] X. Zeng and M. A. Figueiredo. Decreasing weighted sorted (cid:96)1 regularization. IEEE Signal\n\nProcessing Letters, 21(10):1240\u20131244, 2014.\n\n11\n\n\f", "award": [], "sourceid": 5000, "authors": [{"given_name": "Zhiqi", "family_name": "Bu", "institution": "University of Pennsylvania"}, {"given_name": "Jason", "family_name": "Klusowski", "institution": "Rutgers University"}, {"given_name": "Cynthia", "family_name": "Rush", "institution": "Columbia University"}, {"given_name": "Weijie", "family_name": "Su", "institution": "The Wharton School, University of Pennsylvania"}]}