{"title": "Analytic solution and stationary phase approximation for the Bayesian lasso and elastic net", "book": "Advances in Neural Information Processing Systems", "page_first": 2770, "page_last": 2780, "abstract": "The lasso and elastic net linear regression models impose a double-exponential prior distribution on the model parameters to achieve   regression shrinkage and variable selection,  allowing the inference of robust models from large data sets.  However, there has been limited success in deriving estimates for the full posterior distribution of regression coefficients in these models, due to a need to evaluate analytically intractable partition function integrals. Here, the Fourier transform is used to express these integrals as complex-valued oscillatory integrals over \"regression frequencies\". This results in an analytic expansion and stationary phase approximation for the partition functions of the Bayesian lasso and elastic net, where the non-differentiability of the double-exponential prior has so far eluded such an approach. Use of this approximation leads to highly accurate numerical estimates for the expectation values and marginal posterior distributions of the regression coefficients, and allows for Bayesian inference of much higher dimensional models than previously possible.", "full_text": "Analytic solution and stationary phase approximation\n\nfor the Bayesian lasso and elastic net\n\nComputational Biology Unit, Department of Informatics, University of Bergen, Norway\n\nThe Roslin Institute, The University of Edinburgh, UK\n\nTom Michoel\n\ntom.michoel@uib.no\n\nAbstract\n\nThe lasso and elastic net linear regression models impose a double-exponential\nprior distribution on the model parameters to achieve regression shrinkage and\nvariable selection, allowing the inference of robust models from large data sets.\nHowever, there has been limited success in deriving estimates for the full posterior\ndistribution of regression coef\ufb01cients in these models, due to a need to evaluate\nanalytically intractable partition function integrals. Here, the Fourier transform\nis used to express these integrals as complex-valued oscillatory integrals over\n\u201cregression frequencies\u201d. This results in an analytic expansion and stationary\nphase approximation for the partition functions of the Bayesian lasso and elastic\nnet, where the non-differentiability of the double-exponential prior has so far\neluded such an approach. Use of this approximation leads to highly accurate\nnumerical estimates for the expectation values and marginal posterior distributions\nof the regression coef\ufb01cients, and allows for Bayesian inference of much higher\ndimensional models than previously possible.\n\n1\n\nIntroduction\n\nStatistical modelling of high-dimensional data sets where the number of variables exceeds the\nnumber of experimental samples may result in over-\ufb01tted models that do not generalize well to\nunseen data. Prediction accuracy in these situations can often be improved by shrinking regression\ncoef\ufb01cients towards zero [1]. Bayesian methods achieve this by imposing a prior distribution on\nthe regression coef\ufb01cients whose mass is concentrated around zero. For linear regression, the most\npopular methods are ridge regression [2], which has a normally distributed prior; lasso regression [3],\nwhich has a double-exponential or Laplace distribution prior; and elastic net regression [4], whose\nprior interpolates between the lasso and ridge priors. The lasso and elastic net are of particular interest,\nbecause in their maximum-likelihood solutions, a subset of regression coef\ufb01cients are exactly zero.\nHowever, maximum-likelihood solutions only provide a point estimate for the regression coef\ufb01cients.\nA fully Bayesian treatment that takes into account uncertainty due to data noise and limited sample\nsize, and provides posterior distributions and con\ufb01dence intervals, is therefore of great interest.\nUnsurprisingly, Bayesian inference for the lasso and elastic net involves analytically intractable\npartition function integrals and requires the use of numerical Gibbs sampling techniques [5\u20138].\nHowever, Gibbs sampling is computationally expensive and, particularly in high-dimensional settings,\nconvergence may be slow and dif\ufb01cult to assess or remedy [9\u201312]. An alternative to Gibbs sampling\nfor Bayesian inference is to use asymptotic approximations to the intractable integrals based on\nLaplace\u2019s method [13, 14]. However, the Laplace approximation requires twice differentiable log-\nlikelihood functions, and cannot be applied to the lasso and elastic net models as they contain a\nnon-differentiable term proportional to the sum of absolute values (i.e. (cid:96)1-norm) of the regression\ncoef\ufb01cients.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fAlternatives to the Laplace approximation have been considered for statistical models where the Fisher\ninformation matrix is singular, and no asymptotic approximation using normal distributions is feasible\n[15, 16]. However, in (cid:96)1-penalized models, the singularity originates from the prior distributions\non the model parameters, and the Fisher information matrix remains positive de\ufb01nite. Here we\nshow that in such models, approximate Bayesian inference is in fact possible using a Laplace-like\napproximation, more precisely the stationary phase or saddle point approximation for complex-valued\noscillatory integrals [17]. This is achieved by rewriting the partition function integrals in terms\nof \u201cfrequencies\u201d instead of regression coef\ufb01cients, through the use of the Fourier transform. The\nappearance of the Fourier transform in this context should not come as a big surprise. The stationary\nphase approximation can be used to obtain or invert characteristic functions, which are of course\nFourier transforms [18]. More to the point of this paper, there is an intimate connection between the\nFourier transform of the exponential of a convex function and the Legendre-Fenchel transform of\nthat convex function, which plays a fundamental role in physics by linking microscopic statistical\nmechanics to macroscopic thermodynamics and quantum to classical mechanics [19]. In particular,\nconvex duality [20, 21], which maps the solution of a convex optimization problem to that of its\ndual, is essentially equivalent to writing the partition function of a Gibbs probability distribution in\ncoordinate or frequency space (Appendix A).\nConvex duality principles have been essential to characterize analytical properties of the maximum-\nlikelihood solutions of the lasso and elastic net regression models [22\u201327]. This paper shows that\nequally powerful duality principles exist to study Bayesian inference problems.\n\n2 Analytic results\n\nWe consider the usual setup for linear regression where there are n observations of p predictor\nvariables and one response variable, and the effects of the predictors on the response are to be\ncoef\ufb01cients which need to be estimated and (cid:107)v(cid:107) = ((cid:80)n\ndetermined by minimizing the least squares cost function (cid:107)y \u2212 Ax(cid:107)2 subject to additional constraints,\nwhere y \u2208 Rn are the response data, A \u2208 Rn\u00d7p are the predictor data, x \u2208 Rp are the regression\ni=1 |vi|2)1/2 is the (cid:96)2-norm. Without loss of\nn(cid:88)\n\ngenerality, it is assumed that the response and predictors are centred and standardized,\nij = n for j \u2208 {1, 2, . . . , p}.\nA2\n\nAij = 0 and\n\nn(cid:88)\n\nn(cid:88)\n\nn(cid:88)\n\ny2\ni =\n\nyi =\n\n(1)\n\ni=1\n\ni=1\n\ni=1\n\ni=1\n\nIn a Bayesian setting, a hierarchical model is assumed where each sample yi is drawn independently\nfrom a normal distribution with mean Ai\u2022x and variance \u03c32, where Ai\u2022 denotes the ith row of A, or\nmore succintly,\n\n(2)\nwhere N denotes a multivariate normal distribution, and the regression coef\ufb01cients x are assumed to\nhave a prior distribution\n\np(y | A, x) = N (Ax, \u03c321),\n\n(cid:104)\u2212 n\n\n(cid:0)\u03bb(cid:107)x(cid:107)2 + 2\u00b5(cid:107)x(cid:107)1\n\n(cid:1)(cid:105)\n\np(x) \u221d exp\n\nwhere (cid:107)x(cid:107)1 =(cid:80)p\n\n(3)\nj=1 |xj| is the (cid:96)1-norm, and the prior distribution is de\ufb01ned upto a normalization\nconstant. The apparent dependence of the prior distribution on the data via the dimension paramater\nn only serves to simplify notation, allowing the posterior distribution of the regression coef\ufb01cients to\nbe written, using Bayes\u2019 theorem, as\n\n\u03c32\n\n,\n\np(x | y, A) \u221d p(y | x, A)p(x) \u221d e\n\n\u2212 n\n\u03c32 L(x|y,A),\n\nwhere\n\nL(x | y, A) =\n\n1\n2n\n\n= xT(cid:0) AT A\n\n2n\n\n(cid:107)y \u2212 Ax(cid:107)2 + \u03bb(cid:107)x(cid:107)2 + 2\u00b5(cid:107)x(cid:107)1\n\n+ \u03bb1(cid:1)x \u2212 2(cid:0) AT y\n\n(cid:1)T\n\n2n\n\nx + 2\u00b5(cid:107)x(cid:107)1 +\n\n(cid:107)y(cid:107)2\n\n1\n2n\n\n(4)\n\n(5)\n\n(6)\n\nis minus the posterior log-likelihood function. The maximum-likelihood solutions of the lasso (\u03bb = 0)\nand elastic net (\u03bb > 0) models are obtained by minimizing L, where the relative scaling of the penalty\n\n2\n\n\fparameters to the sample size n corresponds to the notational conventions of [28]1. In the current\nsetup, it is assumed that the parameters \u03bb \u2265 0, \u00b5 > 0 and \u03c32 > 0 are given a priori.\nTo facilitate notation, a slightly more general class of cost functions is de\ufb01ned as\n\nH(x | C, w, \u00b5) = xT Cx \u2212 2wT x + 2\u00b5(cid:107)x(cid:107)1,\n\n(7)\nwhere C \u2208 Rp\u00d7p is a positive-de\ufb01nite matrix, w \u2208 Rp is an arbitrary vector and \u00b5 > 0. After\ndiscarding a constant term, L(x | y, A) is of this form, as is the so-called \u201cnon-naive\u201d elastic net,\nwhere C = ( 1\n2n AT A + \u03bb1)/(\u03bb + 1) [4]. More importantly perhaps, eq. (7) also covers linear\nmixed models, where samples need not be independent [29]. In this case, eq. (2) is replaced by\np(y | A, x) = N (Ax, \u03c32K), for some covariance matrix K \u2208 Rn\u00d7n, resulting in a posterior minus\n2n AT K\u22121y. The requirement that C\nlog-likelihood function with C = 1\nis positive de\ufb01nite, and hence invertible, implies that H is strictly convex and hence has a unique\nminimizer. For the lasso (\u03bb = 0) this only holds without further assumptions if n \u2265 p [26]; for the\nelastic net (\u03bb > 0) there is no such constraint.\nThe Gibbs distribution on Rp for the cost function H(x | C, w, \u00b5) with inverse temperature \u03c4 is\nde\ufb01ned as\n\n2n AT K\u22121A + \u03bb1 and w = 1\n\np(x | C, w, \u00b5) =\n\ne\u2212\u03c4 H(x|C,w,\u00b5)\nZ(C, w, \u00b5)\n\n.\n\nconstant Z =(cid:82)\n\nFor ease of notation we will henceforth drop explicit reference to C, w and \u00b5. The normalization\nRp e\u2212\u03c4 H(x)dx is called the partition function. There is no known analytic solution\nfor the partition function integral. However, in the posterior distribution (4), the inverse temperature\n\u03c32 is large, \ufb01rstly because we are interested in high-dimensional problems where n is large\n\u03c4 = n\n(even if it may be small compared to p), and secondly because we assume a priori that (some\nof) the predictors are informative for the response variable and that therefore \u03c32, the amount of\nvariance of y unexplained by the predictors, must be small. It therefore makes sense to seek an\nanalytic approximation to the partition function for large values of \u03c4. However, the usual approach\nto approximate e\u2212\u03c4 H(x) by a Gaussian in the vicinity of the minimizer of H and apply a Laplace\napproximation [17] is not feasible, because H is not twice differentiable. Instead we observe that\ne\u2212\u03c4 H(x) = e\u22122\u03c4 f (x)e\u22122\u03c4 g(x) where\n\n(8)\n\n(9)\n\nxT Cx \u2212 wT x\n\nf (x) =\n\n1\n2\n\ng(x) = \u00b5\n\np(cid:88)\n\nj=1\n\n|xj|.\n\n(cid:90)\n\nRp\n\np\n\n1\n\n2(cid:112)det(C)\n(cid:90) i\u221e\n\n(\u03c0\u03c4 )\n\n(cid:90) i\u221e\n\np(cid:89)\n\nj=1\n\n(cid:90)\n\nZ =\n\nRp\n\nUsing Parseval\u2019s identity for Fourier transforms (Appendix A.1), it follows that (Appendix A.3)\n\ne\u22122\u03c4 f (x)e\u22122\u03c4 g(x)dx =\n\ne\u2212\u03c4 (k\u2212iw)T C\u22121(k\u2212iw)\n\n\u00b5\nj + \u00b52 dk.\nk2\n\n(10)\nAfter a change of variables z = \u2212ik, Z can be written as a p-dimensional complex contour integral\n\np(cid:89)\n\nj=1\n\n(\u2212i\u00b5)p\n\n2(cid:112)det(C)\n\np\n\nZ =\n\n(\u03c0\u03c4 )\n\n\u00b7\u00b7\u00b7\n\ne\u03c4 (z\u2212w)T C\u22121(z\u2212w)\n\n\u2212i\u221e\n\n\u2212i\u221e\n\n1\n\n\u00b52 \u2212 z2\n\nj\n\ndz1 . . . dzp.\n\n(11)\n\nCauchy\u2019s theorem [30, 31] states that this integral remains invariant if the integration contours are\ndeformed, as long as we remain in a domain where the integrand does not diverge (Appendix A.4).\nThe analogue of Laplace\u2019s approximation for complex contour integrals, known as the stationary\nphase, steepest descent or saddle point approximation, then states that an integral of the form (11) can\nbe approximated by a Gaussian integral along a steepest descent contour passing through the saddle\npoint of the argument of the exponential function [17]. Here, the function (z \u2212 w)T C\u22121(z \u2212 w)\nhas a saddle point at z = w. If |wj| < \u00b5 for all j, the standard stationary phase approximation can\nbe applied directly, but this only covers the uninteresting situation where the maximum-likelihood\nsolution \u02c6x = argminx H(x) = 0 (Appendix A.5). As soon as |wj| > \u00b5 for at least one j, the standard\n2 + \u03b1(cid:107)x(cid:107)1), wich is obtained from (5) by\n\n1To be precise, in [28] the penalty term is written as \u02dc\u03bb( 1\u2212\u03b1\n\n2 (cid:107)x(cid:107)2\n\nsetting \u02dc\u03bb = 2(\u03bb + \u00b5) and \u03b1 = \u00b5\n\n\u03bb+\u00b5 .\n\n3\n\n\fp(cid:88)\n\nj=1\n\nof the function(cid:81)\n\nargument breaks down, since to deform the integration contours from the imaginary axes to parallel\ncontours passing through the saddle point z0 = w, we would have to pass through a pole (divergence)\nj )\u22121 (Figure S1). Motivated by similar, albeit one-dimensional, analyses\n\nj(\u00b52 \u2212 z2\n\nin non-equilibrium physics [32, 33], we instead consider a temperature-dependent function\n\n\u03c4 (z) = (z \u2212 w)T C\u22121(z \u2212 w) \u2212 1\nH\u2217\n\u03c4\n\nln(\u00b52 \u2212 z2\nj ),\n\n(12)\n\nwhich is well-de\ufb01ned on the domain D = {z \u2208 Cp : |(cid:60)zj| < \u00b5, j = 1, . . . , p}, where (cid:60) denotes\nthe real part of a complex number. This function has a unique saddle point in D, regardless whether\n|wj| < \u00b5 or not (Figure S1). Our main result is a steepest descent approximation of the partition\nfunction around this saddle point.\nTheorem 1 Let C \u2208 Rp\u00d7p be a positive de\ufb01nite matrix, w \u2208 Rp and \u00b5 > 0. Then the complex\n\u03c4 de\ufb01ned in eq. (12) has a unique saddle point \u02c6u\u03c4 that is real, \u02c6u\u03c4 \u2208 D \u2229 Rp, and is a\nfunction H\u2217\nsolution of the set of third order equations\n\nu \u2208 Rp, j \u2208 {1, . . . , p}.\nFor Q(z) a complex analytic function of z \u2208 Cp, the generalized partition function\n\nj )[C\u22121(w \u2212 u)]j \u2212 uj\n\u03c4\n\n(\u00b52 \u2212 u2\n\n= 0 ,\n\n(cid:90)\n\n2(cid:112)det(C)\n\n1\n\np\n\nZ[Q] =\n\n(\u03c0\u03c4 )\n\n(cid:17)p\n\n(cid:16) \u00b5\u221a\n\n\u03c4\n\ncan be analytically expressed as\n\nZ[Q] =\n\ne\u03c4 (w\u2212\u02c6u\u03c4 )T C\u22121(w\u2212\u02c6u\u03c4 )\n\np(cid:89)\n\ne\u2212\u03c4 (k\u2212iw)T C\u22121(k\u2212iw)Q(\u2212ik)\n\nRp\n\nj=1\n\n\u00b5\nj + \u00b52 dk.\nk2\n\np(cid:89)\n\nj=1\n\n1(cid:113)\n\n\u00b52 + \u02c6u2\n\n\u03c4,j\n\n1\n\n(cid:112)det(C + D\u03c4 )\n(cid:110) 1\n\n(cid:111)\n\nexp\n\n4\u03c4 2 \u2206\u03c4\n\neR\u03c4 (ik)Q(\u02c6u\u03c4 + ik)\n\n(cid:12)(cid:12)(cid:12)(cid:12)k=0\n\n(13)\n\n,\n\n(14)\n\n(15)\n\n(16)\n\n(17)\n\n(18)\n\nwhere D\u03c4 is a diagonal matrix with diagonal elements\n\n\u2206\u03c4 is the differential operator\n\n\u2206\u03c4 =\n\nand\n\nR\u03c4 (z) =\n\n(D\u03c4 )jj =\n\n\u03c4 (\u00b52 \u2212 \u02c6u2\n\u00b52 + \u02c6u2\n\n\u03c4,j)2\n\n,\n\n\u03c4,j\n\np(cid:88)\n(cid:88)\n\ni,j=1\n\n(cid:2)\u03c4 D\u03c4 (C + D\u03c4 )\u22121C(cid:3)\n(cid:104)\n\n1\n\n(\u00b5 \u2212 \u02c6u\u03c4,j)m +\n\n1\nm\n\np(cid:88)\n\nj=1\n\nm\u22653\n\nij\n\n\u22022\n\n\u2202ki\u2202kj\n\n(\u22121)m\n\n(\u00b5 + \u02c6u\u03c4,j)m\n\n(cid:105)\n\nzm\nj .\n\nThis results in an analytic approximation\n\nZ[Q] \u223c(cid:16) \u00b5\u221a\n\n(cid:17)p\n\n\u03c4\n\ne\u03c4 (w\u2212\u02c6u\u03c4 )T C\u22121(w\u2212\u02c6u\u03c4 )\n\np(cid:89)\n\nj=1\n\n1(cid:113)\n\n\u00b52 + \u02c6u2\n\n\u03c4,j\n\n(cid:112)det(C + D\u03c4 )\n\nQ(\u02c6u\u03c4 )\n\n.\n\nThe analytic expression in eq. (14) follows by changing the integration contours to pass through\nthe saddle point \u02c6u\u03c4 , and using a Taylor expansion of H\u2217\n\u03c4 (z) around the saddle point along the\nsteepest descent contour. However, because \u2206\u03c4 and R\u03c4 depend on \u03c4, it is not a priori evident\nthat (18) holds. A detailed proof is given in Appendix B. The analytic approximation in eq. (18)\ncan be simpli\ufb01ed further by expanding \u02c6u\u03c4 around its leading term, resulting in an expression that\nrecognizably converges to the sparse maximum-likelihood solution (Appendix C). While eq. (18)\nis computationally more expensive to calculate than the corresponding expression in terms of the\nmaximum-likelihood solution, it was found to be numerically more accurate (Section 3).\n\n4\n\n\fVarious quantities derived from the posterior distribution can be expressed in terms of generalized\npartition functions. The most important of these are the expectation values of the regression coef\ufb01-\ncients, which, using elementary properties of the Fourier transform (Appendix A.6), can be expressed\nas\n\n(cid:90)\n\nZ(cid:2)C\u22121(w \u2212 z)(cid:3)\n\nZ\n\nE(x) =\n\n1\nZ\n\nRp\n\nx e\u2212\u03c4 H(x) =\n\n\u223c C\u22121(w \u2212 \u02c6u\u03c4 ).\n\nThe leading term,\n\n\u02c6x\u03c4 \u2261 C\u22121(w \u2212 \u02c6u\u03c4 ),\n\n(19)\ncan be interpreted as an estimator for the regression coef\ufb01cients in its own right, which interpolates\nsmoothly (as a function of \u03c4) between the ridge regression estimator \u02c6xridge = C\u22121w at \u03c4 = 0 and\nthe maximum-likelihood elastic net estimator \u02c6x = C\u22121(w \u2212 \u02c6u) at \u03c4 = \u221e, where \u02c6u = lim\u03c4\u2192\u221e \u02c6u\u03c4\nsatis\ufb01es a box-constrained optimization problem (Appendix C).\nThe marginal posterior distribution for a subset I \u2282 {1, . . . , p} of regression coef\ufb01cients is de\ufb01ned as\n\np(xI ) =\n\n1\n\nZ(C, w, \u00b5)\n\nR|Ic|\n\ne\u2212\u03c4 H(x|C,w,\u00b5)dxI c\n\nwhere I c = {1, . . . , p} \\ I is the complement of I, |I| denotes the size of a set I, and we have\nreintroduced temporarily the dependency on C, w and \u00b5 as in eq. (7). A simple calculation shows\nthat the remaining integral is again a partition function of the same form, more precisely:\n\np(xI ) = e\u2212\u03c4 (xT\n\nI CI xI\u22122wT\n\nI xI +2\u00b5(cid:107)xI(cid:107)1) Z(CI c, wI c \u2212 xT\n\nI CI,I c, \u00b5)\n\nZ(C, w, \u00b5)\n\n,\n\n(20)\n\nwhere the subscripts I and I c indicate sub-vectors and sub-matrices on their respective coordinate\nsets. Hence the analytic approximation in eq. (14) can be used to approximate numerically each term\nin the partition function ratio and obtain an approximation to the marginal posterior distributions.\nThe posterior predictive distribution [1] for a new sample a \u2208 Rp of predictor data can also be written\nas a ratio of partition functions:\n\n(cid:16) \u03c4\n\n(cid:17) 1\n\n2\n\n2\u03c0n\n\n2n y2 Z(cid:0)C + 1\n\ne\u2212 \u03c4\n\n2n aaT , w + y\nZ(C, w, \u00b5)\n\n2n a, \u00b5(cid:1)\n\n,\n\np(y) =\n\nRp\n\np(y | a, x)p(x | C, w, \u00b5) dx =\n\nwhere C \u2208 Rp\u00d7p and w \u2208 Rp are obtained from the training data as before, n is the number of\ntraining samples, and y \u2208 R is the unknown response to a with distribution p(y). Note that\n\n(cid:90)\n\n(cid:90)\n\n(cid:90)\n\n(cid:104)(cid:90)\n\n(cid:90)\n\n(cid:105)\n(cid:90)\n\nE(y) =\n\nyp(y)dy =\n\np(y | a, x)dy\n\np(x | C, w, \u00b5) dx\n\nR\n\nRp\n\nR\n\n=\n\n3 Numerical experiments\n\nRp\n\naT xp(x | C, w, \u00b5) dx = aT E(x) \u223c aT \u02c6x\u03c4 .\n\n(21)\n\nTo test the accuracy of the stationary phase approximation, we implemented algorithms to solve the\nsaddle point equations and compute the partition function and marginal posterior distribution, as well\nas an existing Gibbs sampler algorithm [8] in Matlab (see Appendix E for algorithm details, source\ncode available from https://github.com/tmichoel/bayonet/). Results were \ufb01rst evaluated\nfor independent predictors (or equivalently, one predictor) and two commonly used data sets: the\n\u201cdiabetes data\u201d (n = 442, p = 10) [34] and the \u201cleukemia data\u201d (n = 72, p = 3571) [4] (see\nAppendix F for further experimental details and data sources).\nFirst we tested the rate of convergence in the asymptotic relation (see Appendix C)\n\n\u03c4\u2192\u221e\u2212 1\n\nlim\n\n\u03c4\n\nlog Z = Hmin = min\nx\u2208Rp\n\nH(x).\n\nFor independent predictors (p = 1), the partition function can be calculated analytically using\nthe error function (Appendix D), and rapid convergence to Hmin is observed (Figure 1a). After\nscaling by the number of predictors p, a similar rate of convergence is observed for the stationary\n\n5\n\n\fFigure 1: Convergence to the minimum-energy solution. Top row: (\u2212 1\n\u03c4 log Z \u2212 Hmin)/p vs. \u03c4 and\n\u00b5 for the exact partition function for independent predictors (p = 1) (a), and for the stationary phase\napproximation for the diabetes (b) and leukemia (c) data. Bottom row: (cid:107)\u02c6x\u03c4 \u2212 \u02c6x(cid:107)\u221e for the exact\nexpectation value for independent predictors (d), and using the stationary phase approximation for\nthe diabetes (e) and leukemia (f) data. Parameter values were C = 1.0, w = 0.5, and \u00b5 ranging from\n0.05 to 5 in geometric steps (a), and \u03bb = 0.1 and \u00b5 ranging from 0.01\u00b5max upto, but not including,\n\u00b5max = maxj |wj| in geometric steps (b,c).\n\nFigure 2: Marginal posterior distributions for the diabetes data (\u03bb = 0.1, \u00b5 = 0.0397, \u03c4 = 682.3) (a\u2013\nc) and leukemia data (\u03bb = 0.1, \u00b5 = 0.1835, \u03c4 = 9943.9) (d\u2013f;). In blue, Gibbs sampling histogram\n(104 samples). In red, stationary phase approximation for the marginal posterior distribution of\nselected predictors. In yellow, maximum-likelihood-based approximation for the same distributions.\nThe distributions for a zero, transition and non-zero maximum-likelihood predictor are shown (from\nleft to right). The \u2217 on the x-axes indicate the location of the maximum-likelihood and posterior\nexpectation value.\n\n6\n\na0.11020.050.21040.150.31060.5 1.5 1085 b00.021020.040.061040.0810-20.110610-1108c01020.0510410-20.110610-1108d01020.050.051040.10.151060.5 1.5 1085 e0.021020.040.060.081040.110-20.1210610-1108f0.021020.040.060.0810410-20.10.1210610-1108a-0.1-0.0500.050.1x051015202530p(x)b-0.2-0.100.10.2x051015p(x)c0.10.20.30.4x0246810p(x)d-3-2-101x10-3050010001500p(x)e-0.0500.05x0204060p(x)f-0.0500.050.10.15x0510152025p(x)\fphase approximation to the partition function for both the diabetes and leukemia data (Figure 1b,c).\nHowever, convergence of the posterior expectation values \u02c6x\u03c4 to the maximum-likelihood coef\ufb01cients\n\u02c6x, as measured by the (cid:96)\u221e-norm difference (cid:107)\u02c6x\u03c4 \u2212 \u02c6x(cid:107)\u221e = maxj |\u02c6x\u03c4,j \u2212 \u02c6xj| is noticeably slower,\nparticularly in the p (cid:29) n setting of the leukemia data (Figure 1d\u2013f).\nNext, the accuracy of the stationary phase approximation at \ufb01nite \u03c4 was determined by comparing\nthe marginal distributions for single predictors [i.e. where I is a singleton in eq. (20)] to results\nobtained from Gibbs sampling. For simplicity, representative results are shown for speci\ufb01c hyper-\nparameter values (Appendix F.2). Application of the stationary phase approximation resulted in\nmarginal posterior distributions which were indistinguishable from those obtained by Gibbs sampling\n(Figure 2). In view of the convergence of the log-partition function to the minimum-energy value\n(Figure 1), an approximation to eq. (20) of the form\n\np(xI ) \u2248 e\u2212\u03c4 (xT\n\nI CI xI\u22122wT\n\nI CI,Ic ,\u00b5)\u2212Hmin(C,w,\u00b5)]\n\nI xI +2\u00b5(cid:107)xI(cid:107)1)e\u2212\u03c4 [Hmin(CIc ,wIc\u2212xT\n\n(22)\nwas also tested. However, while eq. (22) is indistinguishable from eq. (20) for predictors with zero\neffect size in the maximum-likelihood solution, it resulted in distributions that were squeezed towards\nzero for transition predictors, and often wildly inaccurate for non-zero predictors (Figure 2). This\nis because eq. (22) is maximized at xI = \u02c6xI, the maximum-likelihood value, whereas for non-zero\ncoordinates, eq. (20) is (approximately) symmetric around its expectation value E(xI ) = \u02c6x\u03c4,I. Hence,\naccurate estimations of the marginal posterior distributions requires using the full stationary phase\napproximations [eq. (18)] to the partition functions in eq. (20).\nThe stationary phase approximation can be particularly advantageous in prediction problems, where\nthe response value \u02c6y \u2208 R for a newly measured predictor sample a \u2208 Rp is obtained using regression\ncoef\ufb01cients learned from training data (yt, At). In Bayesian inference, \u02c6y is set to the expectation\nvalue of the posterior predictive distribution, \u02c6y = E(y) = aT \u02c6x\u03c4 [eq. (21)]. Computation of the\nposterior expectation values \u02c6x\u03c4 [eq. (19)] using the stationary phase approximation requires solving\nonly one set of saddle point equations, and hence can be performed ef\ufb01ciently across a range of\nhyper-parameter values, in contrast to Gibbs sampling, where the full posterior needs to be sampled\neven if only expectation values are needed.\nTo illustrate how this bene\ufb01ts large-scale applications of the Bayesian elastic net, its prediction\nperformance was compared to state-of-the-art Gibbs sampling implementations of Bayesian horseshoe\nand Bayesian lasso regression [35], as well as to maximum-likelihood elastic net and ridge regression,\nusing gene expression and drug sensitivity data for 17 anticancer drugs in 474 human cancer cell lines\nfrom the Cancer Cell Line Encyclopedia [36]. Ten-fold cross-validation was used, using p = 1000 pre-\nselected genes and 427 samples for training regression coef\ufb01cients and 47 for validating predictions\nin each fold. To obtain unbiased predictions at a single choice for the hyper-parameters, \u00b5 and\n\u03c4 were optimized over a 10 \u00d7 13 grid using an additional internal 10-fold cross-validation on the\ntraining data only (385 samples for training, 42 for testing); BayReg\u2019s lasso and horseshoe methods\nsample hyper-parameter values from their posteriors and do not require an additional cross-validation\nloop (see Appendix F.3 for complete experimental details and data sources). Despite evaluating\na much greater number of models (in each cross-validation fold, 10\u00d7 cross-validation over 130\nhyper-parameter combinations vs. 1 model per fold), the overall computation time was still much\nlower than BayReg\u2019s Gibbs sampling approach (on average 30 sec. per fold, i.e. 0.023 sec. per model,\nvs. 44 sec. per fold for BayReg). In terms of predictive performance, Bayesian methods tended to\nperform better than maximum-likelihood methods, in particular for the most \u2018predictable\u2019 responses,\nwith little variation between the three Bayesian methods (Figure 3a).\nWhile the difference in optimal performance between Bayesian and maximum-likelihood elastic net\nwas not always large, Bayesian elastic net tended to be optimized at larger values of \u00b5 (i.e. at sparser\nmaximum-likelihood solutions), and at these values the performance improvement over maximum-\nlikelihood elastic net was more pronounced (Figure 3b). As expected, \u03c4 acts as a tuning parameter\nthat allows to smoothly vary from the maximum-likelihood solution at large \u03c4 (here, \u03c4 \u223c 106) to the\nsolution with best cross-validation performance (here, \u03c4 \u223c 103 \u2212 104) (Figure 3c). The improved\nperformance at sparsity-inducing values of \u00b5 suggests that the Bayesian elastic net is uniquely able to\nidentify the dominant predictors for a given response (the non-zero maximum-likelihood coef\ufb01cients),\nwhile still accounting for the cumulative contribution of predictors with small effects. Comparison\nwith the unpenalized (\u00b5 = 0) ridge regression coef\ufb01cients shows that the Bayesian expectation\nvalues are strongly shrunk towards zero, except for the non-zero maximum-likelihood coef\ufb01cients,\nwhich remain relatively unchanged (Figure 3d), resulting in a double-exponential distribution for\n\n7\n\n\fFigure 3: Predictive accuracy on the Cancer Cell Line Encyclopedia. a. Median correlation coef\ufb01cient\nbetween predicted and true drug sensitivities over 10-fold cross-validation, using Bayesian posterior\nexpectation values from the analytic approximation for elastic net (red) and from BayReg\u2019s lasso\n(blue) and horseshoe (yellow) implementations, and maximum-likelihood elastic net (purple) and\nridge regression (green) values for the regression coef\ufb01cients. See main text for details on hyper-\nparameter optimization. b. Median 10-fold cross-validation value for the correlation coef\ufb01cient\nbetween predicted and true sensitivities for the compound PD-0325901 vs. \u00b5, for the Bayesian elastic\nnet at optimal \u03c4 (red), maximum-likelihood elastic net (blue) and ridge regression (dashed). c. Median\n10-fold cross-validation value for the correlation coef\ufb01cient between predicted and true sensitivities\nfor PD-0325901 for the Bayesian elastic net vs. \u03c4 and \u00b5; the black dots show the overall maximum\nand the ML maximum. d. Scatter plot of expected regression coef\ufb01cients in the Bayesian elastic net\nfor PD-0325901 at \u00b5 = 0.055 and optimal \u03c4 = 3.16 \u00b7 103 vs. ridge regression coef\ufb01cient estimates;\ncoef\ufb01cients with non-zero maximum-likelihood elastic net value at the same \u00b5 are indicated in red.\nSee Supp. Figures S2 and S3 for the other 16 compounds.\n\nthe regression coef\ufb01cients. This contrasts with ridge regression, where regression coef\ufb01cients are\nnormally distributed leading to over-estimation of small effects, and maximum-likelihood elastic net,\nwhere small effects become identically zero and don\u2019t contribute to the predicted value at all.\n\n4 Conclusions\n\nThe application of Bayesian methods to infer expected effect sizes and marginal posterior distributions\nin (cid:96)1-penalized models has so far required the use of computationally expensive Gibbs sampling\nmethods. Here it was shown that highly accurate inference in these models is also possible using\nan analytic stationary phase approximation to the partition function integrals. This approximation\nexploits the fact that the Fourier transform of the non-differentiable double-exponential prior dis-\ntribution is a well-behaved exponential of a log-barrier function, which is intimately related to the\nLegendre-Fenchel transform of the (cid:96)1-penalty term. Thus, the Fourier transform plays the same role\nfor Bayesian inference problems as convex duality plays for maximum-likelihood approaches.\nFor simplicity, we have focused on the linear regression model, where the invariance of multivariate\nnormal distributions under the Fourier transform greatly facilitates the analytic derivations. Prelim-\ninary work shows that the results can probably be extended to generalized linear models (or any\n\n8\n\naPD-0325901TopotecanAZD6244PaclitaxelPanobinostatLapatinibSorafenibTKI25817-AAGAEW541TAE684ErlotinibPF2341066ZD-6474AZD0530Nutlin-3LBW24200.10.20.30.40.50.60.7corr(ytrue,ypred)BAY Elastic Net (analytic approx)BAYREG LassoBAYREG HorseshoeML Elastic NetML Ridgeb 0.17 0.11 0.069 0.044 0.027 0.017 0.0110.00690.00440.002700.10.20.30.40.50.60.7corr(ytrue,ypred)BAY Elastic NetML Elastic NetML Ridgec01030.20.4corr(ytrue,ypred)1040.00440.6 0.011105 0.027 0.069106 0.17d-0.08-0.06-0.04-0.0200.020.040.06xRidge-0.02-0.0100.010.020.030.040.05xBayELN\fmodel with convex energy function) with L1 penalty, using the argument sketched in Appendix A.2.\nIn such models, the predictor correlation matrix C will need to be replaced by the Hessian matrix\nof the energy function evaluated at the saddle point. Numerically, this will require updates of the\nHessian during the coordinate descent algorithm for solving the saddle point equations. How to\nbalance the accuracy of the approximation and the frequency of the Hessian updates will require\nfurther in-depth investigation. In principle, the same analysis can also be performed using other non-\ntwice-differentiable sparse penalty functions, but if their Fourier transform is not known analytically,\nor not twice differentiable either, the analysis and implementation will become more complicated\nstill.\nA limitation of the current approach may be that values of the hyper-parameters need to be speci\ufb01ed\nin advance, whereas in complete hierarchical models, these are subject to their own prior distributions.\nIncorporation of such priors will require careful attention to the interchange between taking the limit\nof and integrating over the inverse temperature parameter. However, in many practical situations\n(cid:96)1 and (cid:96)2-penalty parameters are pre-determined by cross-validation. Setting the residual variance\nparameter to its maximum a-posteriori value then allows to evaluate the maximum-likelihood solution\nin the context of the posterior distribution of which it is the mode [8]. Alternatively, if the posterior\nexpectation values of the regression coef\ufb01cients are used instead of their maximum-likelihood values\nto predict unmeasured responses, the optimal inverse-temperature parameter can be determined by\nstandard cross-validation on the training data, as in the drug response prediction experiments.\nNo attempt was made to optimize the ef\ufb01ciency of the coordinate descent algorithm to solve the\nsaddle point equations. However, comparison to the Gibbs sampling algorithm shows that one cycle\nthrough all coordinates in the coordinate descent algorithm is approximately equivalent to one cycle\nin the Gibbs sampler, i.e. to adding one more sample. The coordinate descent algorithm typically\nconverges in 5-10 cycles starting from the maximum-likelihood solution, and 1-2 cycles when starting\nfrom a neighbouring solution in the estimation of marginal distributions. In contrast, Gibbs sampling\ntypically requires 103-105 coordinate cycles to obtain stable distributions. Hence, if only the posterior\nexpectation values or the posterior distributions for a limited number of coordinates are sought, the\ncomputational advantage of the stationary phase approximation is vast. On the other hand, each\nevaluation of the marginal distribution functions requires the solution of a separate set of saddle point\nequations. Hence, computing these distributions for all predictors at a very large number of points\nwith the current algorithm could become equally expensive as Gibbs sampling.\nIn summary, expressing intractable partition function integrals as complex-valued oscillatory integrals\nthrough the Fourier transform is a powerful approach for performing Bayesian inference in the lasso\nand elastic net regression models, and (cid:96)1-penalized models more generally. Use of the stationary\nphase approximation to these integrals results in highly accurate estimates for the posterior expectation\nvalues and marginal distributions at a much reduced computational cost compared to Gibbs sampling.\n\nAcknowledgments\n\nThis research was supported by the BBSRC (grant numbers BB/J004235/1 and BB/M020053/1).\n\nReferences\n[1] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. The elements of statistical learning.\n\nSpringer series in statistics Springer, Berlin, 2001.\n\n[2] Arthur E Hoerl and Robert W Kennard. Ridge regression: Biased estimation for nonorthogonal\n\nproblems. Technometrics, 12(1):55\u201367, 1970.\n\n[3] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical\n\nSociety. Series B (Methodological), pages 267\u2013288, 1996.\n\n[4] H. Zou and T. Hastie. Regularization and variable selection via the elastic net. Journal of the\n\nRoyal Statistical Society: Series B (Statistical Methodology), 67(2):301\u2013320, 2005.\n\n[5] Trevor Park and George Casella. The Bayesian lasso. Journal of the American Statistical\n\nAssociation, 103(482):681\u2013686, 2008.\n\n[6] Chris Hans. Bayesian lasso regression. Biometrika, pages 835\u2013845, 2009.\n\n9\n\n\f[7] Qing Li, Nan Lin, et al. The Bayesian elastic net. Bayesian Analysis, 5(1):151\u2013170, 2010.\n\n[8] Chris Hans. Elastic net regression modeling with the orthant normal prior. Journal of the\n\nAmerican Statistical Association, 106(496):1383\u20131393, 2011.\n\n[9] Jun S Liu. Monte Carlo strategies in scienti\ufb01c computing. Springer, 2004.\n\n[10] Himel Mallick and Nengjun Yi. Bayesian methods for high dimensional linear models. Journal\n\nof Biometrics & Biostatistics, 1:005, 2013.\n\n[11] Bala Rajaratnam and Doug Sparks. Fast Bayesian lasso for high-dimensional regression. arXiv\n\npreprint arXiv:1509.03697, 2015.\n\n[12] Bala Rajaratnam and Doug Sparks. MCMC-based inference in the era of big data: A funda-\nmental analysis of the convergence complexity of high-dimensional chains. arXiv preprint\narXiv:1508.00947, 2015.\n\n[13] Robert E Kass and Duane Steffey. Approximate Bayesian inference in conditionally independent\nhierarchical models (parametric empirical Bayes models). Journal of the American Statistical\nAssociation, 84(407):717\u2013726, 1989.\n\n[14] H\u00e5vard Rue, Sara Martino, and Nicolas Chopin. Approximate Bayesian inference for latent\nGaussian models by using integrated nested Laplace approximations. Journal of the Royal\nStatistical Society: Series B (Statistical Methodology), 71(2):319\u2013392, 2009.\n\n[15] Sumio Watanabe. A widely applicable Bayesian information criterion. Journal of Machine\n\nLearning Research, 14(Mar):867\u2013897, 2013.\n\n[16] Mathias Drton and Martyn Plummer. A Bayesian information criterion for singular models.\nJournal of the Royal Statistical Society: Series B (Statistical Methodology), 79(2):323\u2013380,\n2017.\n\n[17] Roderick Wong. Asymptotic approximation of integrals, volume 34. SIAM, 2001.\n\n[18] Henry E Daniels. Saddlepoint approximations in statistics. The Annals of Mathematical\n\nStatistics, pages 631\u2013650, 1954.\n\n[19] Grigory Lazarevich Litvinov. The Maslov dequantization, idempotent and tropical mathematics:\n\na brief introduction. arXiv preprint math/0507014, 2005.\n\n[20] S.P. Boyd and L. Vandenberghe. Convex optimization. Cambridge university press, 2004.\n\n[21] R Tyrrell Rockafellar. Convex Analysis. Princeton University Press, 1970.\n\n[22] M.R. Osborne, B. Presnell, and B.A. Turlach. On the lasso and its dual. Journal of Computa-\n\ntional and Graphical Statistics, 9(2):319\u2013337, 2000.\n\n[23] Michael R Osborne, Brett Presnell, and Berwin A Turlach. A new approach to variable selection\n\nin least squares problems. IMA Journal of Numerical Analysis, 20(3):389\u2013403, 2000.\n\n[24] L. El Ghaoui, V. Viallon, and T. Rabbani. Safe feature elimination for the lasso and sparse\n\nsupervised learning problems. Paci\ufb01c Journal of Optimization, 8:667\u2013698, 2012.\n\n[25] R. Tibshirani, J. Bien, J. Friedman, T. Hastie, N. Simon, J. Taylor, and R.J. Tibshirani. Strong\nrules for discarding predictors in lasso-type problems. Journal of the Royal Statistical Society:\nSeries B (Statistical Methodology), 74(2):245\u2013266, 2012.\n\n[26] R.J. Tibshirani. The lasso problem and uniqueness. Electronic Journal of Statistics, 7:1456\u2013\n\n1490, 2013.\n\n[27] Tom Michoel. Natural coordinate descent algorithm for L1-penalised regression in generalised\n\nlinear models. Computational Statistics & Data Analysis, 97:60\u201370, 2016.\n\n[28] J. Friedman, T. Hastie, and R. Tibshirani. Regularization paths for generalized linear models\n\nvia coordinate descent. Journal of Statistical Software, 33(1):1, 2010.\n\n10\n\n\f[29] Barbara Rakitsch, Christoph Lippert, Oliver Stegle, and Karsten Borgwardt. A lasso multi-\nmarker mixed model for association mapping with population structure correction. Bioinformat-\nics, 29(2):206\u2013214, 2012.\n\n[30] Serge Lang. Complex Analysis, volume 103 of Graduate Texts in Mathematics. Springrer, 2002.\n\n[31] Volker Schneidemann. Introduction to Complex Analysis in Several Variables. Birkh\u00e4user\n\nVerlag, 2005.\n\n[32] R Van Zon and EGD Cohen. Extended heat-\ufb02uctuation theorems for a system with deterministic\n\nand stochastic forces. Physical Review E, 69(5):056121, 2004.\n\n[33] Jae Sung Lee, Chulan Kwon, and Hyunggyu Park. Modi\ufb01ed saddle-point integral near a\nsingularity for the large deviation function. Journal of Statistical Mechanics: Theory and\nExperiment, 2013(11):P11002, 2013.\n\n[34] B. Efron, T. Hastie, I.M. Johnstone, and R. Tibshirani. Least angle regression. The Annals of\n\nStatistics, 32(2):407\u2013499, 2004.\n\n[35] Enes Makalic and Daniel F Schmidt. High-dimensional Bayesian regularised regression with\n\nthe BayesReg package. arXiv:1611.06649, 2016.\n\n[36] J. Barretina, G. Caponigro, N. Stransky, K. Venkatesan, A.A. Margolin, S. Kim, C.J. Wilson,\nJ. Leh\u00e1r, G.V. Kryukov, D. Sonkin, et al. The Cancer Cell Line Encyclopedia enables predictive\nmodelling of anticancer drug sensitivity. Nature, 483(7391):603\u2013607, 2012.\n\n11\n\n\f", "award": [], "sourceid": 1467, "authors": [{"given_name": "Tom", "family_name": "Michoel", "institution": "University of Bergen"}]}