{"title": "An equivalence between high dimensional Bayes optimal inference and M-estimation", "book": "Advances in Neural Information Processing Systems", "page_first": 3378, "page_last": 3386, "abstract": "Due to the computational difficulty of performing MMSE (minimum mean squared error) inference, maximum a posteriori (MAP) is often used as a surrogate. However, the accuracy of MAP is suboptimal for high dimensional inference, where the number of model parameters is of the same order as the number of samples. In this work we demonstrate how MMSE performance is asymptotically achievable via optimization with an appropriately selected convex penalty and regularization function which are a smoothed version of the widely applied MAP algorithm. Our findings provide a new derivation and interpretation for recent optimal M-estimators discovered by El Karoui, et. al. PNAS 2013 as well as extending to non-additive noise models. We demonstrate the performance of these optimal M-estimators with numerical simulations. Overall, at the heart of our work is the revelation of a remarkable equivalence between two seemingly very different computational problems: namely that of high dimensional Bayesian integration, and high dimensional convex optimization. In essence we show that the former computationally difficult integral may be computed by solving the latter, simpler optimization problem.", "full_text": "An equivalence between high dimensional Bayes\n\noptimal inference and M-estimation\n\nMadhu Advani\n\nSurya Ganguli\n\nDepartment of Applied Physics, Stanford University\n\nmsadvani@stanford.edu\n\nsganguli@stanford.edu\n\nand\n\nAbstract\n\nWhen recovering an unknown signal from noisy measurements, the computational\ndif\ufb01culty of performing optimal Bayesian MMSE (minimum mean squared error)\ninference often necessitates the use of maximum a posteriori (MAP) inference,\na special case of regularized M-estimation, as a surrogate. However, MAP is\nsuboptimal in high dimensions, when the number of unknown signal components\nis similar to the number of measurements. In this work we demonstrate, when\nthe signal distribution and the likelihood function associated with the noise are\nboth log-concave, that optimal MMSE performance is asymptotically achievable\nvia another M-estimation procedure. This procedure involves minimizing convex\nloss and regularizer functions that are nonlinearly smoothed versions of the widely\napplied MAP optimization problem. Our \ufb01ndings provide a new heuristic derivation\nand interpretation for recent optimal M-estimators found in the setting of linear\nmeasurements and additive noise, and further extend these results to nonlinear\nmeasurements with non-additive noise. We numerically demonstrate superior\nperformance of our optimal M-estimators relative to MAP. Overall, at the heart\nof our work is the revelation of a remarkable equivalence between two seemingly\nvery different computational problems: namely that of high dimensional Bayesian\nintegration underlying MMSE inference, and high dimensional convex optimization\nunderlying M-estimation. In essence we show that the former dif\ufb01cult integral may\nbe computed by solving the latter, simpler optimization problem.\n\n1\n\nIntroduction\n\nModern technological advances now enable scientists to simultaneously record hundreds or thousands\nof variables in \ufb01elds ranging from neuroscience and genomics to health care and economics. For\nexample, in neuroscience, we can simultaneously record P = O(1000) neurons in behaving animals.\nHowever, the number of measurements N we can make of these P dimensional neural activity\npatterns can be limited in any given experimental condition due to constraints on recording time.\nThus a critical parameter is the measurement density \u03b1 = N\nP . Classical statistics focuses on the\nlimit of few variables and many measurements, so P is \ufb01nite, N is large, and \u03b1 \u2192 \u221e. Here, we\ninstead consider the modern high dimensional limit where the measurement density \u03b1 remains \ufb01nite\nas N, P \u2192 \u221e. In this important limit, we ask what is the optimal way to recover signal from noise?\nMore precisely, we wish to recover an unknown signal vector s0 \u2208 RP given N noisy measurements\n(1)\nHere, x\u00b5 and y\u00b5 are input-output pairs for measurement \u00b5, r is a measurement nonlinearity, and\n\u0001\u00b5 is a noise realization. For example, in a brain machine interface, x\u00b5 could be a neural activity\npattern, y\u00b5 a behavioral covariate, and s0 the unknown regression coef\ufb01cients of a decoder relating\nneural activity to behavior. Alternatively, in sensory neuroscience, x\u00b5 could be an external stimulus,\n\ny\u00b5 = r(x\u00b5 \u00b7 s0, \u0001\u00b5) where x\u00b5 \u2208 RP\n\nfor \u00b5 = 1, . . . , N.\n\nand\n\ny\u00b5 \u2208 R,\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fy\u00b5 a single neuron\u2019s response to that stimulus, and s0 the unknown receptive \ufb01eld relating stimulus\nto neural response. We assume the noise \u0001\u00b5 is independent and identically distributed (iid) across\nmeasurements, implying the outputs y\u00b5 are drawn iid from a noise distribution Py|z(y\u00b5|z\u00b5), where\nz\u00b5 = x\u00b5 \u00b7 s0. Similarly, we assume the signal components s0\ni are drawn iid from a prior signal\ns. Finally, we denote by X \u2208 RN\u00d7P the\ndistribution Ps(s0). We denote its variance below by \u03c32\ninput or measurement matrix, whose \u00b5\u2019th row is x\u00b5, and by y \u2208 RN the measurement output vector\nwhose \u00b5\u2019th component is y\u00b5. In this paper, we will focus on the case of dense iid random Gaussian\nmeasurements, normalized so that (cid:104) x\u00b5 \u00b7 x\u03bd (cid:105) = \u03b3 \u03b4\u00b5,\u03bd. In the case of systems identi\ufb01cation in\nsensory neuroscience, this choice would correspond to an oft used white noise stimulus at contrast \u03b3.\nNow given measurement data (X, y), as well as knowledge of the nonlinearity r(\u00b7) and the signal\nPs and noise Py|z distributions, what is the best way to infer an estimate \u02c6s of the unknown signal\ns0? We characterize the performance of an estimate \u02c6s by its mean squared error (MSE), (cid:107)\u02c6s \u2212 s0(cid:107)2\n2,\naveraged over noise realizations and measurements. The best minimal MSE (MMSE) estimator is\ngiven by optimal Bayesian integration to compute the posterior mean:\n\n\u02c6sMMSE =\n\ns P (s|X, y) ds.\n\n(2)\n\n(cid:90)\n\nUnfortunately, this integral is generally intractable in high dimensions, at large P ; both numerical\nintegration and Monte Carlo methods for estimating the integral require computational time growing\nexponentially in P for high accuracy. Consequently, an often used surrogate for MMSE inference\nis maximum a posteriori (MAP) inference, which computes the mode rather than the mean of the\nposterior distribution. Thus MAP relies on optimization rather than integration:\n[\u2212 log P (s|X, y)].\n\nP (s|X, y) = arg min\n\n\u02c6sMAP = arg max\n\n(3)\n\nAssuming inputs X are independent of the unkown signal s0, the above expression becomes\n\ns\n\n(cid:34) N(cid:88)\n\ns\n\nP(cid:88)\n\n(cid:35)\n\n\u02c6sMAP = arg min\n\ns\n\n\u2212 log Py|z(y\u00b5|x\u00b5 \u00b7 s) +\n\n\u2212 log Ps(si)\n\n.\n\n(4)\n\n\u00b5=1\n\ni=1\n\nA related algorithm is maximum likelihood (ML), which seeks to maximize the likelihood of the data\ngiven a candidate signal s. ML is equivalent to MAP in (4) but without the second sum, i.e. without\nprior information on the signal.\nWhile ML is typically optimal amongst unbiased estimators in the classical statistical limit \u03b1 \u2192 \u221e\n(see e.g. [1]), neither MAP nor ML are optimal in high dimensions, at \ufb01nite \u03b1. Therefore, we consider\na broader class of estimators known as regularized M-estimators, corresponding to the optimization\nproblem\n\n\u02c6s = arg min\n\ns\n\nL(y\u00b5, x\u00b5 \u00b7 s) +\n\n\u03c3(si)\n\n.\n\n(5)\n\nHere L(y, \u03b7) is a loss function and \u03c3 is a regularizer. We assume both to be convex functions in \u03b7 and\ns respectively. Note that MAP inference corresponds to the choice L(y, \u03b7) = \u2212 log Py|z(y|\u03b7) and\n\u03c3(s) = \u2212 log Ps(s). ML inference corresponds to the same loss function but without regularization:\n\u03c3(s) = 0. Other well known M-estimators include LASSO [2], corresponding to the choice\n2 (y \u2212 \u03b7)2 and \u03c3(s) \u221d |s|, or the elastic net [3], which includes an addition quadratic term\nL(y, \u03b7) = 1\non the LASSO regularizer. Such M-estimators are heuristically motivated as a convex relaxation of\nMAP inference for sparse signal distributions, and have been found to be very useful in such settings.\nHowever, a general theory for how to select the optimal M-estimator in (5) given the generative model\nof data in (1) remains elusive. This is the central problem we address in this work.\n\n(cid:35)\n\nP(cid:88)\n\ni=1\n\n(cid:34) N(cid:88)\n\n\u00b5=1\n\n1.1 Related work and Outline\n\nSeminal work [4] found the optimal unregularized M-estimator using variational methods in the\nspecial case of linear measurements and additive noise, i.e. r(z, \u0001) = z + \u0001 in (1). In this same\nsetting, [5] characterized unregularized M-estimator performance via approximate message passing\n(AMP) [6]. Following this, the performance of regularized M-estimators in the linear additive setting\nwas characterized in [7], using non-rigorous statistical physics methods based on replica theory, and\n\n2\n\n\fin [8], using rigorous methods different from [4, 5]. Moreover, [7] found the optimal regularized\nM-estimator and demonstrated, surprisingly, zero performance gap relative to MMSE. The goals of\nthis paper are to (1) interpret and extend previous work by deriving an equivalence between optimal\nM-estimation and Bayesian MMSE inference via AMP and (2) to derive the optimal M-estimator in\nthe more general setting of nonlinear measurements and non-additive noise.\nTo address these goals, we begin in section 2 by describing a pair of AMP algorithms, derived\nheuristically via approximations of belief propagation (BP). The \ufb01rst algorithm, mAMP, is designed\nto solve M-estimation in (5), while the second, bAMP, is designed to solve Bayesian MMSE inference\nin (2). In section 3 we derive a connection, via AMP, between M-estimation and MMSE inference:\nwe \ufb01nd, for a particular choice of optimal M-estimator, that mAMP and bAMP have the same \ufb01xed\npoints. To quantitatively determine the optimal M-estimator, which depends on some smoothing\nparameters, we must quantitatively characterize the performance of AMP, which we do in section\n4. We thereby recover optimal M-estimators found in recent works in the linear additive setting,\nwithout using variational methods, and moreover \ufb01nd optimal M-estimators in the nonlinear, non-\nadditive setting. Our non-variational approach through AMP also provides an intuitive explanation\nfor the form of the optimal M-estimator in terms of Bayesian inference. Intriguingly, the optimal\nM-estimator resembles a smoothed version of MAP, with lower measurement density requiring\nmore smoothing. In Section 4, we also demonstrate, through numerical simulations, a substantial\nperformance improvement in inference accuracy achieved by the optimal M-estimator over MAP\nunder nonlinear measurements with non-additive noise. We end with a discussion in section 5.\n\n2 Formulations of Bayesian inference and M-estimation through AMP\n\nBoth mAMP and bAMP, heuristically derived in the supplementary material 1 (SM) sections 2.2-2.4\nthough approximate BP applied to (5) and (2) respectively, can be expressed as special cases of a\ngeneralized AMP (gAMP) algorithm [9], which we \ufb01rst describe. gAMP is a set of iterative equations,\n\n\u03b7t = X\u02c6st + \u03bbt\n\n\u03b7Gy(\u03bbt\u22121\n\n\u03b7\n\n\u02c6st+1 = Gs\n\nh, \u02c6st \u2212 \u03bbt\n\u03bbt\n\nhXT Gy(\u03bbt\n\n\u03b7, y, \u03b7t)\n\n(6)\n\n(cid:17)\n\n(cid:16)\nP(cid:88)\n\nj=1\n\n, y, \u03b7t\u22121)\n\n(cid:33)\u22121\n\n(cid:32)\n\nN(cid:88)\n\n\u03bd=1\n\n\u03bbt\nh =\n\n\u03b3\u03b1\nN\n\n\u2202\n\u2202\u03b7\n\nGy(\u03bbt\n\n\u03b7, y\u03bd, \u03b7t\n\u03bd)\n\n\u03bbt+1\n\u03b7 =\n\n\u03b3\u03bbt\nh\nP\n\nGs(\u03bbt\n\nh, \u02c6st\n\nj \u2212 \u03bbt\n\nhXT\n\nj Gy(\u03bbt\n\n\u03b7, y, \u03b7t)),\n\n\u2202\n\u2202h\n\n(7)\nthat depend on the scalar functions Gy(\u03bb\u03b7, y, \u03b7) and Gs(\u03bbh, h) which, in our notation, act component-\nwise on vectors so that \u00b5th component Gy(\u03bb\u03b7, y, \u03b7)\u00b5 = Gy(\u03bb\u03b7, y\u00b5, \u03b7\u00b5) and the ith component\n\u03b7 \u2208 R+ and \u03b7t=\u22121 \u2208 RN .\nGs(\u03bbh, h)i = Gs(\u03bbh, hi). Initial conditions are given by \u02c6st=0 \u2208 RP , \u03bbt=0\nIntuitively, one can think of \u03b7t as related to the linear part of the measurement outcome predicted by\nthe current guess \u02c6st, and Gy is a measurement correction map that uses the actual measurement data y\nto correct \u03b7t. Also, intuitively, we can think of Gs as taking an input \u02c6st\u2212 \u03bbt\n\u03b7, y, \u03b7t), which\nis a measurement based correction to \u02c6st, and yielding as output a further, measurement independent\ncorrection \u02c6st+1, that could depend on either a regularizer or prior. We thus refer to the functions Gy\nand Gs as the measurement and signal correctors respectively. gAMP is thus alternating measurement\nand signal correction, with time dependent parameters \u03bbt\n\u03b7. These equations were described in\n[9], and special cases of them were studied in various works (see e.g. [5, 10]).\n\nhXT Gy(\u03bbt\n\nh and \u03bbt\n\n2.1 From M-estimation to mAMP\n\nNow, applying approximate BP to (5) when the input vectors x\u00b5 are iid Gaussian, again with\nnormalization (cid:104) x\u00b5 \u00b7 x\u00b5 (cid:105) = \u03b3, we \ufb01nd (SM Sec. 2.3) that the resulting mAMP equations are a special\ncase of the gAMP equations, where the functions Gy and Gs are related to the loss L and regularizer\n\u03c3 through\n\ny (\u03bb\u03b7, y, \u03b7) = M\u03bb\u03b7 [L(y,\u00b7) ](cid:48)(\u03b7),\nGM\n\ns (\u03bbh, h) = P\u03bbh[ \u03c3 ](h).\nGM\n\n(8)\n\n1Please see https://ganguli-gang.stanford.edu/pdf/16.Bayes.Mestimation.Supp.pdf for the supplementary\n\nmaterial.\n\n3\n\n\fThe functional mappings M and P, the Moreau envelope and proximal map [11], are de\ufb01ned as\n\nM\u03bb[ f ](x) = min\n\ny\n\n+ f (y)\n\n,\n\nP\u03bb[ f ](x) = arg min\n\ny\n\n(cid:20) (x \u2212 y)2\n\n2\u03bb\n\n(cid:21)\n\n(cid:20) (x \u2212 y)2\n\n2\u03bb\n\n(cid:21)\n\n+ f (y)\n\n.\n\n(9)\n\nThe proximal map maps a point x to another point that minimizes f while remaining close to x as\ndetermined by a scale \u03bb. This can be thought of as a proximal descent step on f starting from x with\nstep length \u03bb. Perhaps the most ubiquitous example of a proximal map occurs for f (z) = |z|, in which\ncase the proximal map is known as the soft thresholding operator and takes the form P\u03bb[ f ](x) = 0\nfor |x| \u2264 \u03bb and P\u03bb[ f ](x) = x \u2212 sign(x)\u03bb for |x| \u2265 \u03bb. This soft thresholding is prominent in AMP\napproaches to compressed sensing (e.g. [10]). The Moreau envelope is a minimum convolution of f\nwith a quadratic, and as such, M\u03bb[ f ](x) is a smoothed lower bound on f with the same minima\n[11]. Moreover, differentiating M with respect to x yields [11] the relation\n\n(10)\nThus a proximal descent step on f is equivalent to a gradient descent step on the Moreau envelope of\nf, with the same step length \u03bb. This equality is also useful in proving (SM Sec. 2.1) that the \ufb01xed\npoints of mAMP satisfy\n\nP\u03bb[ f ](x) = x \u2212 \u03bbM\u03bb[ f ](cid:48)(x).\n\nXT \u2202\n\u2202\u03b7\n\nL(y, X\u02c6s) + \u03c3(cid:48)(\u02c6s) = 0.\n\n(11)\n\n\u02c6st+1 = P\u03bbh [ \u03c3 ](\u02c6st \u2212 \u03bbhXT \u2202\n\nThus \ufb01xed points of mAMP are local minima of M-estimation in (5).\nTo develop intuition for the mAMP algorithm, we note that the \u02c6s update step in (6) is similar to\nthe more intuitive proximal gradient descent algorithm [11] which seeks to solve the M-estimation\nproblem in (5) by alternately performing a gradient descent step on the loss term and a proximal\ndescent step on the regularization term, both with the same step length. Thus one iteration of gradient\ndescent on L followed by proximal descent on \u03c3 in (5), with both steps using step length \u03bbh, yields\n(12)\nBy inserting (8) into (6)-(7), we see that mAMP closely resembles proximal gradient descent, but\nwith three main differences: 1) the loss function is replaced with its Moreau envelope, 2) the loss is\nevaluated at \u03b7t which includes an additional memory term, and 3) the step size \u03bbt\nh is time dependent.\nInterestingly, this additional memory term and step size evolution has been found to speed up\nconvergence relative to proximal gradient descent in certain special cases, like LASSO [10].\nIn summary, in mAMP the measurement corrector Gy implements a gradient descent on the Moreau\nsmoothed loss, while the signal corrector Gs implements a proximal descent step on the regularizer.\nBut because of (10), this latter step can also be thought of as a gradient descent step on the Moreau\nsmoothed regularizer. Thus overall, the mAMP approach to M-estimation is intimately related to\nMoreau smoothing of both the loss and regularizer.\n\n\u2202\u03b7L(y, X\u02c6st)).\n\n2.2 From Bayesian integration to bAMP\n\n(cid:82) sPs(s)e\u2212 (s\u2212h)2\n(cid:82) Ps(s)e\u2212 (s\u2212h)2\n\nNow, applying approximate BP to (2) when again the input vectors x\u00b5 are iid Gaussian, we \ufb01nd (SM\nSec. 2.2) that the resulting bAMP equations are a special case of the gAMP equations, where the\nfunctions Gy and Gs are related to the noise Py|z and signal Ps distributions through\ns (\u03bbh, h) = \u02c6smmse(\u03bbh, h),\n\nlog (Py(y|\u03b7, \u03bb\u03b7)),\n\n(13)\n\nGB\n\ny (\u03bb\u03b7, y, \u03b7) = \u2212 \u2202\nGB\n\u2202\u03b7\n\nwhere\n\nPy(y|\u03b7, \u03bb) \u221d\n\n(cid:90)\n(cid:10) s0|h(cid:11) where h = s0 +\n\n2\u03bb ds\n\nPy|z(y|z)e\u2212 (\u03b7\u2212z)2\n\n(14)\nas derived in SM section 2.2. Here Py(y|\u03b7, \u03bb) is a convolution of the likelihood with a Gaussian of\nvariance \u03bb (normalized so that it is a probability density in y) and \u02c6smmse denotes the posterior mean\n\u03bbw is a corrupted signal, w is a standard Gaussian random variable, and\n\n2\u03bb dz,\n\n\u02c6smmse(\u03bb, h) =\n\n2\u03bb ds\n\n\u221a\n\n,\n\ns0 is a random variable drawn from Ps.\nInserting these equations into (6)-(7), we see that bAMP performs a measurement correction step\nthrough Gy that corresponds to a gradient descent step on the negative log of a Gaussian-smoothed\nlikelihood function. The subsequent signal correction step through Gs is simply the computation of a\nposterior mean, assuming the input is drawn from the prior and corrupted by additive Gaussian noise\nwith a time-dependent variance \u03bbt\nh.\n\n4\n\n\f3 An AMP equivalence between Bayesian inference and M-estimation\n\nIn the previous section, we saw intriguing parallels between mAMP and bAMP, both special cases of\ngAMP. While mAMP performs its measurement and signal correction through a gradient descent\nstep on a Moreau smoothed loss and a Moreau smoothed regularizer respectively, bAMP performs its\nmeasurement correction through a gradient descent step on the minus log of a Gaussian smoothed\nlikelihood, and its signal correction though an MMSE estimation problem. These parallels suggest we\nmay be able to \ufb01nd a loss L and regularizer \u03c3 such that the corresponding mAMP becomes equivalent\nto bAMP. If so, then assuming the correctness of bAMP as a solution to (2), the resulting Lopt and\n\u03c3opt will yield the optimal mAMP dynamics, achieving MMSE inference.\nBy comparing (8) and (13), we see that bAMP and mAMP will have the same Gy if the Moreau-\nsmoothed loss equals the minus log of the Gaussian-smoothed likelihood function:\n\nM\u03bb\u03b7 [Lopt(y,\u00b7) ](\u03b7) = \u2212 log (Py(y|\u03b7, \u03bb\u03b7)).\n\n(15)\nBefore describing how to invert the above expression to determine Lopt, we would also like to \ufb01nd a\nrelation between the two signal correction functions GM\ns . This is a little more challenging\nbecause the former implements a proximal descent step while the latter implements an MMSE\nposterior mean computation. However, we can express the MMSE computation as gradient ascent on\nthe log of a Gaussian smoothed signal distribution (see SM):\n\ns and GB\n\n\u02c6smmse(\u03bbh, h) = h + \u03bbh\n\nlog (Ps(h, \u03bbh)),\n\nPs(h, \u03bb) \u221d\n\n\u2202\n\u2202h\n\nPs(s)e\u2212 (s\u2212h)2\n\n2\u03bb ds.\n\n(16)\n\n(cid:90)\n\ns as gradient descent on\nMoreover, by applying (10) to the de\ufb01nition of GM\ns\na Moreau smoothed regularizer. Then, comparing these modi\ufb01ed forms of GB\ns , we \ufb01nd\na similar condition for \u03c3opt, namely that its Moreau smoothing should equal the minus log of the\nGaussian smoothed signal distribution:\n\nin (8), we can write GM\n\ns with GM\n\nM\u03bbh [ \u03c3opt ](h) = \u2212 log (Ps(h, \u03bbh)) .\n\n(17)\n\nOur goal is now to compute the optimal loss and regularizer by inverting the Moreau envelope\nrelations (15, 17) to solve for Lopt, \u03c3opt. A suf\ufb01cient condition [4] to invert these Moreau envelopes\nto determine the optimal mAMP dynamics is that Py(y|z) and Ps(s) are log concave with respect\nto z and s respectively. Under this condition the Moreau envelope will be invertible via the relation\nMq[\u2212Mq[\u2212f ](\u00b7) ](\u00b7) = f (\u00b7) (see SM Appendix A.3 for a derivation), which yields:\n\n\u03c3opt(h) = \u2212M\u03bbh[ log (Ps(\u00b7, \u03bbh)) ](h).\n\nLopt(y, \u03b7) = \u2212M\u03bb\u03b7 [ log (Py(y|\u00b7, \u03bb\u03b7)) ](\u03b7),\n\n(18)\nThis optimal loss and regularizer form resembles smoothed MAP inference, with \u03bb\u03b7 and \u03bbh being\nscalar parameters that modify MAP through both Gaussian and Moreau smoothing. An example of\nsuch a family of smoothed loss and regularizer functions is given in Fig. 1 for the case of a logistic\noutput channel with Laplacian distributed signal. Additionally, one can show that the optimal loss\nand regularizer are convex when the signal and noise distributions are log-concave. Overall, this\nanalysis yields a dynamical equivalence between mAMP and bAMP as long as at each iteration time\nt, the optimal loss and regularizer for mAMP are chosen through the smoothing operation in (18), but\nusing time-dependent smoothing parameters \u03bbt\n\nh whose evolution is governed by (7).\n\n\u03b7 and \u03bbt\n\n4 Determining optimal smoothing parameters via state evolution of AMP\n\nIn the previous section, we have shown that mAMP and bAMP have the same dynamics, as long as, at\neach iteration t of mAMP, we choose a time dependent optimal loss Lopt\nthrough\n(18), where the time dependence is inherited from the time dependent smoothing parameters \u03bbt\n\u03b7 and\nh. However, mAMP was motivated as an algorithmic solution to the M-estimation problem in (5)\n\u03bbt\nfor a \ufb01xed loss and regularizer, while bAMP was motivated as a method of performing the Bayesian\nintegral in (2). This then raises the question, is there a \ufb01xed, optimal choice of Lopt and \u03c3opt in (5)\nsuch the corresponding M-estimation problem yields the same answer as the Bayesian integral in (2)?\nThe answer is yes: simply choose a \ufb01xed Lopt and \u03c3opt through (18) where the smoothing parameters\n\u03bb\u03b7 and \u03bbh are chosen to be those found at the \ufb01xed points of bAMP. To see this, note that \ufb01xed\npoints of mAMP with time dependent choices of Lopt\nare equivalent to the minima of the\n\nand regularizer \u03c3opt\n\nt and \u03c3opt\n\nt\n\nt\n\nt\n\n5\n\n\fFigure 1: Here we plot the optimal loss (A) and regularizer (B) in (18), for a logistic output y \u2208 {0, 1}\nwith Py|z(y = 1|z) = 1\n2 e\u2212|s|. In (A) we plot the loss\nfor the measurement y = 1: Lopt(y = 1,\u00b7). Both sets of curves from red to black (and bottom to top)\ncorrespond to smoothing parameters \u03bb\u03b7 = (0, 2, 4, 6) in (A) and \u03bbh = (0, 1/2, 1, 2) in (B). With\nzero smoothing, the red curves at the bottom correspond to the MAP loss and regularizer.\n\n1+e\u2212z , and Laplacian signal s with Ps(s) = 1\n\n\u03b7, \u03bbt\n\n\u03b7 and \u03bb\u221e\n\nh) approaches the \ufb01xed point (\u02c6s\u221e, \u03bb\u221e\n\nM-estimation problem in (5), with the choice of loss and regularizer that this time dependent sequence\nconverges to: Lopt\u221e and \u03c3opt\u221e (this follows from an extension of the argument that lead to (11)). In turn\nthe \ufb01xed points of mAMP are equivalent to those of bAMP under the choice (18). These equivalences\n\u03b7 , \u03bb\u221e\nthen imply that, if the bAMP dynamics for (\u02c6st, \u03bbt\nh ),\nthen \u02c6s\u221e is the solution to both Bayesian inference in (2) and optimal M-estimation in (5), with\n\u03b7 and \u03bb\u221e\noptimal loss and regularizer given by (18) with the choice of smoothing parameters \u03bb\u221e\nh .\nWe now discuss how to determine \u03bb\u221e\nh analytically, thereby completing our heuristic derivation\nof an optimal M-estimator that matches Bayesian MMSE inference. An essential tool is state evolution\n(SE) which characterizes the gAMP dynamics [12] as follows. First, let z = Xs0 be related to the\ntrue measurements. Then (6) implies that \u03b7t\u2212 z is a time-dependent residual. Remarkably, the gAMP\nequations ensure that the components of the residual \u03b7t \u2212 z, as well as ht = \u2212\u03bbt\n\u03b7, y, \u03b7t)\nare Gaussian distributed; the history term in the update of \u03b7t in (6) crucially cancels out non-Gaussian\nstructure that would otherwise develop as the vectors \u03b7t and ht propagate through the nonlinear\nmeasurement and signal correction steps induced by Gy and Gs. We denote by qt\nh the variance\nof the components of \u03b7t \u2212 z and ht respectively. Additionally, we denote by qt\nP (cid:104)(cid:107) \u02c6st \u2212 s0(cid:107)2(cid:105)\nthe per component MSE at iteration t. SE is a set of analytical evolution equations for the quantities\n(qt\nh) that characterize the state of gAMP. A rigorous derivation both for dense [12]\ns, qt\nGaussian measurements and sparse measurements [13] reveal that the SE equations accurately track\nthe gAMP dynamical state in the high dimensional limit N, P \u2192 \u221e with \u03b1 = N\nP O(1) that we\nconsider here.\nWe derive the speci\ufb01c form of the mAMP SE equations, yielding a set of 5 update equations (see SM\nsection 3.1 for further details). We also derive the SE equations for bAMP, which are simpler. First,\nwe \ufb01nd the relations \u03bbt\nh. Thus SE for bAMP reduces to a pair of update equations:\n\n\u03b7 and qt\ns = 1\n\nhXT Gy(\u03bbt\n\n\u03b7 = qt\n\nh, \u03bbt\n\n\u03b7, \u03bbt\n\n\u03b7, qt\n\n(cid:18)\n\n(cid:68)(cid:0)GB\n\n\u03b7, y, \u03b7t)(cid:1)2(cid:69)\n\nqt\nh =\n\n\u03b1\u03b3\n\ny (qt\n\nw,s0\n\n(cid:19)\u22121\n\n.\n\ny,z,\u03b7t\n\n(cid:28)(cid:16)\n\n\u03b7 and \u03bbt\n\nh, s0 +(cid:112)qt\n\nh = qt\n\nhw) \u2212 s0(cid:17)2(cid:29)\n\nqt+1\n\u03b7 = \u03b3\n\nGB\n\ns (qt\n\n(19)\nHere w is a zero mean unit variance Gaussian and s0 is a scalar signal drawn from the signal\ndistribution Ps. Thus the computation of the next residual qt+1\non the LHS of (19) involves\ncomputing the MSE in estimating a signal s0 corrupted by Gaussian noise of variance qt\nh, using\nMMSE inference as an estimation prcoedure via the function GB de\ufb01ned in (13). The RHS involves\nan average over the joint distribution of scalar versions of the output y, true measurement z, and\nestimated measurement \u03b7t. These three scalars are the SE analogs of the gAMP variables y, z,\nand \u03b7t, and they model the joint distribution of single components of these vectors. Their joint\ndistribution is given by P (y, z, \u03b7t) = Py|z(y|z)P (z, \u03b7t). In the special case of bAMP, z and \u03b7t\nare jointly zero mean Gaussian with second moments given by (cid:104)(\u03b7t)2(cid:105) = \u03b3\u03c32\n\u03b7, (cid:104)z2(cid:105) = \u03b3\u03c32\ns,\n\ns \u2212 qt\n\n\u03b7\n\n6\n\n-4-202401234-2-101200.511.52A Optimal loss B Optimal regularizer \fand (cid:104) z\u03b7t (cid:105) = \u03b3\u03c32\n\n(cid:10) (z \u2212 \u03b7t)2(cid:11) = qt\n\ns \u2212 qt\n\u03b7 (see SM 3.2 for derivations). These moments imply the residual variance\n\u03b7. Intuitively, when gAMP works well, that is re\ufb02ected in the SE equations by the\nreduction of the residual variance qt\n\u03b7 over time, as the time dependent estimated measurement \u03b7t\nconverges to the true measurement z. The actual measurement outcome y, after the nonlinear part of\nthe measurement process, is always conditionally independent of the estimated measurement \u03b7t, given\nthe true linear part of the measurement, z. Finally, the joint distribution of a single component of \u02c6st+1\nand s0 in gAMP are predicted by SE to have the same distribution as \u02c6st+1 = GB\nhw),\nafter marginalizing out w. Comparing with the LHS of (19) then yields that the MSE per component\nsatis\ufb01es qt\nNow, bAMP performance, upon convergence, is characterized by the \ufb01xed point of SE, which satis\ufb01es\n\nh, s0 +(cid:112)qt\n\ns = qt\n\n\u03b7/\u03b3.\n\ns (qt\n\nqs = MMSE(s0|s0 +\n\n\u221a\n\nqhw)\n\nqh =\n\n1\n\n\u03b1\u03b3J [ Py(y|\u03b7, \u03b3qs) ]\n\n.\n\n(20)\n\nHere, the MMSE function denotes the minimal error in estimating the scalar signal s0 from a\nmeasurement of s0 corrupted by additive Gaussian noise of variance qh via computation of the\n\nposterior mean(cid:10) s0|s0 +\n\nqhw(cid:11):\n\n\u221a\n\nMMSE(s0|s0 +\n\n\u221a\n\nqhw) =\n\n(cid:68)(cid:0)(cid:10) s0|s0 +\n\nqhw(cid:11) \u2212 s0(cid:1)2(cid:69)\n\n\u221a\n\n.\n\n(21)\n\ns0,w\n\nAlso, the function J on the RHS of (20) denotes the average Fisher information that y retains about\nan input, with some additional Gaussian input noise of variance q:\n\u2202\u03b72 log Py(y|\u03b7, q)\n\nJ [ Py(y|\u03b7, q) ] = \u2212(cid:68) \u22022\n\n(cid:69)\n\n(22)\n\n\u03b7,y\n\nThese equations characterize the performance of bAMP, through qs. Furthermore, they yield the\noptimal smoothing parameters \u03bb\u03b7 = \u03b3qs and \u03bbh = qh. This choice of smoothing parameters,\nwhen used in (18), yield a \ufb01xed optimal loss Lopt and regularizer \u03c3opt. When this optimal loss\nand regularizer are used in the M-estimation problem in (5), the resulting M-estimator should have\nperformance equivalent to that of MMSE inference in (2). This completes our heuristic derivation of\nan equivalence between optimal M-estimation and Bayesian inference through message passing.\nIn Figure 2 we demonstrate numerically that the optimal M-estimator substantially outperforms MAP,\nespecially at low measurement density \u03b1, and has performance equivalent to MMSE inference, as\ntheoretically predicted by SE for bAMP.\n\nFigure 2: For logistic output and Laplacian signal, as in Fig.\n1, we plot the per component MSE, normalized by signal\nvariance. Smooth curves are theoretical predictions based\non SE \ufb01xed points for mAMP for MAP inference (red) and\nbAMP for MMSE inference (black). Error bars re\ufb02ect stan-\ndard deviation in performance obtained by solving (5), via\nmAMP, for MAP inference (red) and optimal M-estimation\n(black), using simulated data generated as in (1), with dense\ni.i.d Gaussian measurements. For these \ufb01nite simulated data\nN P \u2248 250. These\nsets, we varied \u03b1 = N\nresults demonstrate that optimal M-estimation both signif-\nicantly outperforms MAP (black below red) and matches\nBayesian MMSE inference as predicted by SE for bAMP\n(black error bars consistent with black curve).\n\nP , while holding\n\n\u221a\n\n5 Discussion\n\nOverall we have derived an optimal M-estimator, or a choice of optimal loss and regularizer, such the\nM-estimation problem in (5) has equivalent performance to that of Bayes optimal MMSE inference in\n(2), in the case of log-concave signal distribution and noise likelihood. Our derivation is heuristic in\nthat it employs the formalism of gAMP, and as such depends on the correctness of a few statements.\nFirst, we assume that two special cases of the gAMP dynamics in (6), namely mAMP in (8) and\n\n7\n\n0123450.30.40.50.60.70.80.9MAPOptimalOptimal vs MAP inference error\fbAMP in (13) correctly solve the M-estimation problem in (5) and Bayesian MMSE inference in\n(2), respectively. We provide a heuristic derivation of both of these assumptions in the SM based on\napproximations of BP. Second, we require that SE in (19) correctly tracks the performance of gAMP\nin (13). We note that under mild conditions, the correctness of SE as a description of gAMP was\nrigorously proven in [12].\nWhile we have not presented a rigorous derivation that the bAMP dynamics correctly solves the\nMMSE inference problem, we note several related rigorous results. First, it has been shown that\nbAMP is equivalent to MMSE inference in the limit of large sparse measurement matrices in [13, 14].\nAlso, in this same large sparse limit, the corresponding mAMP algorithm was shown to be equivalent\nto MAP inference with additive Gaussian noise [15]. In the setting of dense measurements, the\ncorrectness of bAMP has not yet been rigorously proven, but the associated SE is believed to be exact\nin the dense iid Gaussian measurement setting based on replica arguments from statistical physics\n(see e.g. section 4.3 in [16] for further discussion). For this reason, similar arguments have been\nused to determine theoretical bounds on inference algorithms in compressed sensing [16], and matrix\nfactorization [17].\nThere are further rigorous results in the setting of M-estimation: mAMP and its associated SE is\nalso provably correct in the large sparse measurement limit, and has additionally been rigorously\nproven to converge in special cases [5],[6] for dense iid Gaussian measurements. We further expect\nthese results to generalize to a universality class of measurement matrices with iid elements and a\nsuitable condition on their moments. Indeed this generalization was demonstrated rigorously for a\nsubclass of M-estimators in [18]. In the setting of dense measurements, due to the current absence\nof rigorous results demonstrating the correctness of bAMP in solving MMSE inference, we have\nalso provided numerical experiments in Fig. 2. This \ufb01gure demonstrates that optimal M-estimation\ncan signi\ufb01cantly outperform MAP for high dimensional inference problems, again for the case of\nlog-concave signal and noise.\nAdditionally, we note that the per-iteration time complexity of the gAMP algorithms (6, 7) scales\nlinearly in both the number of measurements and signal dimensions. Therefore the optimal algorithms\nwe describe are applicable to large-scale problems. Moreover, at lower measurement densities, the\noptimal loss and regularizer are smoother. Such smoothing may accelerate convergence time. Indeed\nsmoother convex functions, with smaller Lipschitz constants on their derivative, can be minimized\nfaster via gradient descent. It would be interesting to explore whether a similar result may hold for\ngAMP dynamics.\nAnother interesting future direction is the optimal estimation of sparse signals, which typically do not\nhave log-concave distributions. One potential strategy in such scenarios would be to approximate\nthe signal distribution with the best log-concave \ufb01t and apply optimal smoothing to determine a\ngood regularizer. Alternatively, for any practical problem, one could choose the precise smoothing\nparameters through any model selection procedure, for example cross-validation on held-out data.\nThus the combined Moreau and Gaussian smoothing in (18) could yield a family of optimization\nproblems, where one member of this family could potentially yield better performance in practice on\nheld-out data. For example, while LASSO performs very well for sparse signals, as demonstrated by\nits success in compressed sensing [19, 20], the popular elastic net [3], which sometimes outperforms\npure LASSO by combining L1 and L2 penalties, resembles a speci\ufb01c type of smoothing of an L1\nregularizer. It would be interesting to see if combined Moreau and Gaussian smoothing underlying\nour optimal M-estimators could signi\ufb01cantly out-perform LASSO and elastic net in practice, when\nour distributional assumptions about signal and noise need not precisely hold. However, \ufb01nding\noptimal M-estimators for known sparse signal distributions, and characterizing the gap between their\nperformance and that of MMSE inference, remains a fundamental open question.\n\nAcknowledgements\n\nThe authors would like to thank Lenka Zdeborova and Stephen Boyd for useful discussions and also\nChris Stock and Ben Poole for comments on the manuscript. M.A. thanks the Stanford MBC and\nSGF for support. S.G. thanks the Burroughs Wellcome, Simons, Sloan, McKnight, and McDonnell\nfoundations, and the Of\ufb01ce of Naval Research for support.\n\n8\n\n\fReferences\n[1] P. Huber and E. Ronchetti. Robust Statistics. Wiley, 2009.\n\n[2] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical\n\nSociety. Series B (Methodological), 58:267\u2013288, 1996.\n\n[3] H. Zou and T. Hastie. Journal of the Royal Statistical Society: Series B (Statistical Methodology),\n\n67:301\u2013320, 2005.\n\n[4] D. Bean, PJ Bickel, N. El Karoui, and B. Yu. Optimal M-estimation in high-dimensional\n\nregression. PNAS, 110(36):14563\u20138, 2013.\n\n[5] D Donoho and A Montanari. High dimensional robust m-estimation: Asymptotic variance via\n\napproximate message passing. Probability Theory and Related Fields, pages 1\u201335, 2013.\n\n[6] M Bayati and A Montanari. The dynamics of message passing on dense graphs, with applications\n\nto compressed sensing. Information Theory, IEEE Transactions, 57(2):764\u2013785, 2011.\n\n[7] M Advani and S Ganguli. Statistical mechanics of optimal convex inference in high dimensions.\n\nPhysical Review X, 6:031034.\n\n[8] C. Thrampoulidis, Abbasi E., and Hassibi B. Precise high-dimensional error analysis of\nregularized m-estimators. 2015 53rd Annual Allerton Conference on Communication, Control,\nand Compting (Allerton), pages 410\u2013417, 2015.\n\n[9] S Rangan. Generalized approximate message passing for estimation with random linear mixing.\nInformation Theory Proceedings (ISIT), 2011 IEEE International Symposium, pages 2168\u20132172,\n2011.\n\n[10] D. L. Donoho, A. Maleki, and Montanari A. Message-passing algorithms for compressed\n\nsensing. Proceedings of the National Academy of Sciences, pages 18914\u201318919, 2009.\n\n[11] N. Parikh and S. Boyd. Proximal algorithms. Foundations and Trends in Optimization, 1(3):123\u2013\n\n231, 2013.\n\n[12] A Javanmard and A Montanari. State evolution for general approximate message passing\nalgorithms, with applications to spatial coupling. Information and Inference, page iat004, 2013.\n\n[13] S Rangan. Estimation with random linear mixing, belief propagation and compressed sensing.\n\nInformation Sciences and Systems (CISS), 2010 44th Annual Conference, 2010.\n\n[14] D Guo and CC Wang. Random sparse linear systems observed via arbitrary channels: A\ndecoupling principle. Information Theory, 2007. ISIT 2007. IEEE International Symposium,\n2007.\n\n[15] CC Wang and D Guo. Belief propagation is asymptotically equivalent to map estimation for\n\nsparse linear systems. Proc. Allerton Conf, pages 926\u2013935, 2006.\n\n[16] F Krzakala, M M\u00b4ezard, F Sausset, Y. Sun, and L. Zdeborov\u00b4a. Probabilistic reconstruction in\ncompressed sensing: algorithms, phase diagrams, and threshold achieving matrices. Journal of\nStatistical Mechanics: Theory and Experiment, (08):P08009, 2012.\n\n[17] Y. Kabashima, F. Krzakala, M. Mzard, A. Sakata, and L. Zdeborov\u00b4a. Phase transitions and\nsample complexity in bayes-optimal matrix factorization. IEEE Transactions on Information\nTheory, 62:4228\u20134265, 2016.\n\n[18] M Bayati, M Lelarge, and A Montanari. Universality in polytope phase transitions and message\n\npassing algorithms. The Annals of Applied Probability, 25:753\u2013822, 2015.\n\n[19] E. Candes and M. Wakin. An introduction to compressive sampling. IEEE Sig. Proc. Mag.,\n\n25(2):21\u201330, 2008.\n\n[20] A.M. Bruckstein, D.L. Donoho, and M. Elad. From sparse solutions of systems of equations to\n\nsparse modeling of signals and images. SIAM Review, 51(1):34\u201381, 2009.\n\n9\n\n\f", "award": [], "sourceid": 1677, "authors": [{"given_name": "Madhu", "family_name": "Advani", "institution": "Stanford University"}, {"given_name": "Surya", "family_name": "Ganguli", "institution": "Stanford"}]}