{"title": "Dual-Space Analysis of the Sparse Linear Model", "book": "Advances in Neural Information Processing Systems", "page_first": 1745, "page_last": 1753, "abstract": "Sparse linear (or generalized linear) models combine a standard likelihood function with a sparse prior on the unknown coefficients. These priors can conveniently be expressed as a maximization over zero-mean Gaussians with different variance hyperparameters. Standard MAP estimation (Type I) involves maximizing over both the hyperparameters and coefficients, while an empirical Bayesian alternative (Type II) first marginalizes the coefficients and then maximizes over the hyperparameters, leading to a tractable posterior approximation. The underlying cost functions can be related via a dual-space framework from Wipf et al. (2011), which allows both the Type I or Type II objectives to be expressed in either coefficient or hyperparmeter space. This perspective is useful because some analyses or extensions are more conducive to development in one space or the other. Herein we consider the estimation of a trade-off parameter balancing sparsity and data fit. As this parameter is effectively a variance, natural estimators exist by assessing the problem in hyperparameter (variance) space, transitioning natural ideas from Type II to solve what is much less intuitive for Type I. In contrast, for analyses of update rules and sparsity properties of local and global solutions, as well as extensions to more general likelihood models, we can leverage coefficient-space techniques developed for Type I and apply them to Type II. For example, this allows us to prove that Type II-inspired techniques can be successful recovering sparse coefficients when unfavorable restricted isometry properties (RIP) lead to failure of popular L1 reconstructions. It also facilitates the analysis of Type II when non-Gaussian likelihood models lead to intractable integrations.", "full_text": "Dual-Space Analysis of the Sparse Linear Model\n\nDavid Wipf and Yi Wu\n\nVisual Computing Group, Microsoft Research Asia\n\ndavidwipf@gmail.com, jxwuyi@gmail.com\n\nAbstract\n\nSparse linear (or generalized linear) models combine a standard likelihood func-\ntion with a sparse prior on the unknown coef\ufb01cients. These priors can conve-\nniently be expressed as a maximization over zero-mean Gaussians with different\nvariance hyperparameters. Standard MAP estimation (Type I) involves maximiz-\ning over both the hyperparameters and coef\ufb01cients, while an empirical Bayesian\nalternative (Type II) \ufb01rst marginalizes the coef\ufb01cients and then maximizes over\nthe hyperparameters, leading to a tractable posterior approximation. The under-\nlying cost functions can be related via a dual-space framework from [22], which\nallows both the Type I or Type II objectives to be expressed in either coef\ufb01cient\nor hyperparmeter space. This perspective is useful because some analyses or ex-\ntensions are more conducive to development in one space or the other. Herein we\nconsider the estimation of a trade-off parameter balancing sparsity and data \ufb01t. As\nthis parameter is effectively a variance, natural estimators exist by assessing the\nproblem in hyperparameter (variance) space, transitioning natural ideas from Type\nII to solve what is much less intuitive for Type I. In contrast, for analyses of update\nrules and sparsity properties of local and global solutions, as well as extensions to\nmore general likelihood models, we can leverage coef\ufb01cient-space techniques de-\nveloped for Type I and apply them to Type II. For example, this allows us to prove\nthat Type II-inspired techniques can be successful recovering sparse coef\ufb01cients\nwhen unfavorable restricted isometry properties (RIP) lead to failure of popular\n\u21131 reconstructions. It also facilitates the analysis of Type II when non-Gaussian\nlikelihood models lead to intractable integrations.\n\n1\n\nIntroduction\n\nWe begin with the likelihood model\n\ny = \u03a6x + \u01eb,\n\n(1)\nwhere \u03a6 \u2208 Rn\u00d7m is a dictionary of unit \u21132-norm basis vectors, x \u2208 Rm is a vector of unknown\ncoef\ufb01cients we would like to estimate, y \u2208 Rn is the observed signal, and \u01eb is noise distributed as\nN (\u01eb; 0, \u03bbI) (later we consider more general likelihood models). In many practical situations where\nlarge numbers of features are present relative to the signal dimension, the problem of estimating x\ngiven y becomes ill-posed. A Bayesian framework is intuitively appealing for formulating these\ntypes of problems because prior assumptions must be incorporated, whether explicitly or implicitly,\nto regularize the solution space.\n\nRecently, there has been a growing interest in models that employ sparse priors p(x) to encourage\nsolutions x with mostly small or zero-valued coef\ufb01cients and a few large or unrestricted values, i.e.,\nwe are assuming the generative x is a sparse vector. Such solutions can be favored by using\n\nwith h concave and non-decreasing on [0,\u221e) [15, 16]. Virtually all sparse priors of interest can\nbe expressed in this manner, including the popular Laplacian, Jeffreys, Student\u2019s t, and generalized\n\nexp(cid:20)\u2212\n\n1\n2\n\ni(cid:1)(cid:21) ,\nh(cid:0)x2\n\n(2)\n\np(x) \u221dYi\n\nexp(cid:20)\u2212\n\n1\n2\n\ng(xi)(cid:21) =Yi\n\n1\n\n\fGaussian distributions. Roughly speaking, the \u2018more concave\u2019 h, the more sparse we expect x to be.\nFor example, with h(z) = z, we recover a Gaussian, which is not sparse at all, while h(z) = \u221az\ngives a Laplacian distribution, with characteristic heavy tails and a sharp peak at zero.\n\nAll sparse priors of the form (2) can be conveniently framed in terms of a collection of non-negative\nlatent variables or hyperparameters \u03b3 , [\u03b31, . . . , \u03b3m]T for purposes of optimization, approximation,\nand/or inference. The hyperparameters dictate the structure of the prior via\n\u03b3i\u22650 N (xi; 0, \u03b3i)\u03d5(\u03b3i),\n\np(xi), p(xi) = max\n\n(3)\n\nwhere \u03d5(\u03b3i) is some non-negative function that is sometimes treated as a hyperprior, although it will\nnot generally integrate to one. For the purpose of obtaining sparse point estimates of x, which will\nbe our primary focus herein, models with latent variable sparse priors are frequently handled in one\nof two ways. First, the latent structure afforded by (3) offers a very convenient means of obtaining\n(possibly local) maximum a posteriori (MAP) estimates of x by iteratively solving\n\np(x) =Yi\n\nx(I) = arg min\n\nx \u2212 log p(y|x)p(x) = arg min\n\nx;\u03b3(cid:23)0ky \u2212 \u03a6xk2\n\n2 + \u03bbXi (cid:20) x2\n\ni\n\u03b3i\n\n+ log \u03b3i + f (\u03b3i)(cid:21) ,\n\n(4)\n\nwhere f (\u03b3i) , \u22122 log \u03d5(\u03b3i) and x(I) is commonly referred to as a Type I estimator. Examples\ninclude minimum \u2113p-norm approaches [4, 11, 16], Jeffreys prior-based methods sometimes called\nFOCUSS [7, 6, 9], algorithms for computing the basis pursuit (BP) or Lasso solution [6, 16, 18],\nand iterative reweighted \u21131 methods [3].\nSecondly, instead of maximizing over both x and \u03b3 as in (4), Type II methods \ufb01rst integrate out\n(marginalize) the unknown x and then solve the empirical Bayesian problem [19]\n\n\u03b3(II) = arg max\n\n\u03b3\n\np(\u03b3|y) = arg max\n\nN (x; 0, \u03b3i)\u03d5(\u03b3i)dxi\n\n\u03b3 Z p(y|x)Yi\n\nm\n\nXi=1\n\n= arg min\n\n\u03b3\n\nyT \u03a3\u22121\n\ny y + log |\u03a3y| +\n\nf (\u03b3i),\n\n(5)\n\nwhere \u03a3y , \u03bbI + \u03a6\u0393\u03a6T and \u0393 , diag[\u03b3]. Once \u03b3(II) is obtained, the conditional distribution\np(x|y; \u03b3(II)) is Gaussian, and a point estimate for x naturally emerges as the posterior mean\n\n(6)\nPertinent examples include sparse Bayesian learning and the relevance vector machine (RVM) [19],\nautomatic relevance determination (ARD) [14], methods for learning overcomplete dictionaries [8],\nand large-scale experimental design [17].\n\nx(II) = E(cid:2)x|y; \u03b3(II)(cid:3) = \u0393(II)\u03a6T (cid:0)\u03bbI + \u03a6\u0393(II)\u03a6T(cid:1)\u22121\n\ny.\n\nWhile initially these two approaches may seem vastly different, both can be directly compared using\na dual-space view [22] of the underlying cost functions. In brief, this involves expressing both the\nType I and Type II objective solely in terms of either x or \u03b3 as reviewed in Section 2. The dual-space\nview is advantageous for several reasons, such as establishing connections between algorithms, de-\nveloping ef\ufb01cient update rules, or handling more general (non-Gaussian) likelihood functions. In\nSection 3, we utilize \u03b3-space cost functions to develop a principled method for choosing the trade-\noff parameter \u03bb (which accompanies the Gaussian likelihood model and essentially balances sparsity\nand data \ufb01t) and demonstrate its effectiveness via simulations. Section 4 then derives a new Type\nII-inspired algorithm in x-space that can compute maximally sparse (minimal \u21130 norm) solutions\neven with highly coherent dictionaries, proving a result for clustered dictionaries that previously has\nonly been shown empirically [21]. Finally, Section 5 leverages duality to address Type II methods\nwith generalized likelihood functions that previously were rendered untenable because of intractable\nintegrals. In general, some tasks and analyses are easier to undertake in \u03b3-space (Section 3), while\nothers are more transparent in x-space (Sections 4 and 5). Here we consider both with the goal of\nadvancing the proper understanding and full utilization of the sparse linear model.\n\n2 Dual-Space View of the Sparse Linear Model\n\nType I is based on a natural cost function in x-space, p(x|y), while Type II involves an analogous\nfunction in \u03b3-space, p(\u03b3|y). The dual-space view de\ufb01nes a corresponding \u03b3-space cost function for\nType I and a x-space cost function for Type II to complete the symmetry.\n\n2\n\n\fType II in x-Space: Using the relationship\n\ny\u03a3\u22121\n\ny y = min\n\nx\n\n1\n\u03bbky \u2212 \u03a6xk2\n\n2 + xT \u0393\u22121x\n\n(7)\n\nas in [22], it can be shown that the Type II coef\ufb01cients from (6) satisfy x(II) = arg minx L(II)(x),\nwhere\n(8)\n\n2 + \u03bbg(II)(x),\n\nL(II)(x) , ky \u2212 \u03a6xk2\n\nand\n\ng(II)(x) , min\n\nf (\u03b3i).\n\n(9)\n\nx2\ni\n\u03b3i\n\n\u03b3(cid:23)0 Xi\n\n+ log |\u03a3y| +Xi\n\n1, . . . , x2\n\nThis reformulation of Type II in x-space is revealing for multiple reasons (Sections 4 and 5 will\naddress additional reasons in detail). For many applications of the sparse linear model, the primary\ngoal is simply a point estimate that exhibits some degree of sparsity, meaning many elements of \u02c6x\nnear zero and a few relatively large coef\ufb01cients. This requires a penalty function g(x) that is concave\nm]T . In the context of Type I, any prior p(x) expressible via\nand non-decreasing in x2 , [x2\n(2) will satisfy this condition by de\ufb01nition; such priors are said to be strongly super-Gaussian and\nwill always have positive kurtosis [15]. Regarding Type II, because the associated x-space penalty\n(9) is represented as a minimum of upper-bounding hyperplanes with respect to x2 (and the slopes\nare all non-negative given \u03b3 (cid:23) 0), it must therefore be concave and non-decreasing in x2 [1].\nFor compression, interpretability, or other practical reasons, it is sometimes desirable to have exactly\nsparse point estimates, with many (or most) elements of x equal to exactly zero. This then necessi-\ntates a penalty function g(x) that is concave and non-decreasing in |x| , [|x1|, . . . ,|xm|]T , a much\nstronger condition. In the case of Type I, if log \u03b3 + f (\u03b3) is concave and non-decreasing in \u03b3, then\ng(x) = Pi g(xi) satis\ufb01es this condition. The Type II analog, which emerges by further inspection\n\nof (9) stipulates that if\n\nlog |\u03a3y| +Xi\n\nf (\u03b3i) = log(cid:12)(cid:12)\u03bb\u22121\u03a6T \u03a6 + \u0393\u22121(cid:12)(cid:12) + log |\u0393| +Xi\n\nf (\u03b3i)\n\n(10)\n\nis a concave and non-decreasing function of \u03b3, then g(II)(x) will be a concave, non-decreasing\nfunction of |x|. For this purpose it is suf\ufb01cient, but not necessary, that f be a concave and non-\ndecreasing function. Note that this is a somewhat stronger criteria than Type I since the \ufb01rst term\non the righthand side of (10) (which is absent from Type I) is actually convex in \u03b3. Regardless, it is\nnow very transparent how Type II may promote sparsity akin to Type I.\nThe dual-space view also leads to ef\ufb01cient, convergent algorithms such as iterative reweighted \u21131\nminimization and its variants as discussed in [22]. However, building on these ideas, we can demon-\nstrate here that it also elucidates the original, widely applied update procedures developed for im-\nplementing the relevance vector machine (RVM), a popular Type II method for regression and clas-\nsi\ufb01cation that assumes f (\u03b3) = 0 [19]. In fact these updates, which were inspired by a \ufb01xed-point\nheuristic from [12], have been widely used for a number of Bayesian inference tasks without any\nformal analyses or justi\ufb01cation.1 The dual-space formulation can be leveraged to show that these\nupdates are in fact executing a coordinate-wise, iterative min-max procedure in search of a saddle\npoint. Speci\ufb01cally we have the following result (all proofs are in the supplementary material):\n\nTheorem 1. The original RVM update rule from [19, Equation (16)] is equivalent to a closed-form,\ncoordinate-wise optimization of\n\nmin\nx;\u03b3(cid:23)0\n\nmax\n\nz(cid:23)0 \"ky \u2212 \u03a6xk2\n\n2 +Xi (cid:18) x2\n\ni\n\u03b3i\n\n+ zi log \u03b3i(cid:19) \u2212 \u03d1(z)#\n\n(11)\n\nwith respect to u.\n\nover x, \u03b3, and z, where \u03d1(z) is the convex conjugate function [1] of log(cid:12)(cid:12)\u03bbI + \u03a6diag[exp(u)]\u03a6T(cid:12)(cid:12)\n\n1Although a more recent, step-wise variant of the RVM has been shown to be substantially faster [20],\nthe original version is still germane since it can easily be extended to handle more general structured sparsity\nproblems. The step-wise method cannot without introducing additional approximations [10].\n\n3\n\n\fType I in \u03b3-Space: Similar methodology and the expansion of yT \u03a3\u22121\nthe Type I optimization problem in \u03b3-space, which serves several useful purposes. Let \u03b3(I)\narg min\u03b3(cid:23)0 L(I)(\u03b3), with\n\ny y can be used to express\n,\n\nL(I)(\u03b3) , yT \u03a3\u22121\n\ny y + log |\u0393| +\n\nThen the Type I coef\ufb01cients obtained from (4) satisfy\n\nf (\u03b3i).\n\nm\n\nXi=1\n\n(12)\n\n(13)\n\nSection 3 will use \u03b3-space cost functions to derive well-motivated approaches for learning the trade-\noff parameter \u03bb.\n\nx(I) = \u0393(I)\u03a6T (cid:0)\u03bbI + \u03a6\u0393(I)\u03a6T(cid:1)\u22121\n\ny.\n\n3 Choosing the Trade-off Parameter \u03bb\n\nThe trade-off parameter is crucial for obtaining good estimates of x. In general, if \u03bb is too large,\n\u02c6x \u2192 0; too small and \u02c6x is over\ufb01tted to the noise. In practice, either expensive cross-validation or\nsome heuristic procedure is often required. However, because \u03bb can be interpreted as a variance, it is\nuseful to address its estimation in \u03b3-space, in which existing unknowns (i.e., \u03b3) are also variances.\nLearning \u03bb with Type I: Consider the Type I cost function L(I)(\u03b3). The data-dependent term can be\nshown to be a convex, non-increasing function of \u03b3, which encourages each element to be large. The\nsecond term is a penalty factor that regulates the size of \u03b3. It is here that a convenient regularizer\nfor \u03bb can be incorporated.\n\nThis can be accomplished as follows. First we expand \u03a3y via \u03a3y =Pm\n\nj ,\nj=1 \u03bbej eT\nwhere \u03c6\u00b7i denotes the i-th column of \u03a6 and ej is a column vector of zeros with a \u20181\u2019 in the j-th\nlocation. Thus we observe that \u03bb is embedded in the data-dependent term in the exact same fashion\nas each \u03b3i. This motivates a penalty on \u03bb with similar correspondence, leading to the objective\n\n\u00b7i +Pn\n\nj=1 \u03b3i\u03c6\u00b7i\u03c6T\n\nL(I)(\u03b3, \u03bb) , yT \u03a3\u22121\n\ny y +\n\n= yT \u03a3\u22121\n\ny y +\n\n[log \u03b3i + f (\u03b3i)] +\n\n[log \u03bb + f (\u03bb)]\n\nn\n\nXj=1\n\n[log \u03b3i + f (\u03b3i)] + n log \u03bb + nf (\u03bb).\n\n(14)\n\nm\n\nm\n\nXi=1\nXi=1\n\nWhile admittedly simple, this construction is appealing because, regardless of how each \u03b3i is penal-\nized, \u03bb is penalized in a proportional manner, so both \u03b3 and \u03bb have a properly balanced chance of\nexplaining the observed data. This is important because the optimal \u03bb will be highly dependent on\nboth the true noise level, and crucially, the particular sparse prior assumed p(x) (as re\ufb02ected by f).\nFor analysis or implementational purposes, we may convert L(I)(\u03b3, \u03bb) back to x-space, with \u03bb-\ndependency now removed. It can then be shown that solving (4), with \u03bb \ufb01xed to the value that\nminimizes (14), is equivalent to solving\n\nmin\n\nx,u Xi\n\ng(xi) + ng(cid:18) 1\n\n\u221ankuk2(cid:19) ,\n\ns.t. y = \u03a6x + u.\n\n(15)\n\nIf x\u2217 and u\u2217 minimize (15), then we can demonstrate using [15] that the corresponding \u03bb estimate,\nwhich also minimizes (14), is given by \u03bb\u2217 = \u2202h(z)/\u2202z evaluated at z = 1/nku\u2217k2\n2. Note that if we\nwere just performing maximum likelihood estimation of \u03bb given x\u2217, the optimal value would reduce\nto simply \u03bb\u2217 = 1/nku\u2217k2\n2, with no in\ufb02uence from the prior on x. This is a fundamental weakness.\nSolving (15), or equivalently (14), can be accomplished using simple iterative reweighted least\nsquares, or if g is concave in |xi|, an iterative reweighted second-order-cone (SOC) minimization.\nLearning \u03bb with Type II: The same procedure can be adopted for Type II yielding the cost function\n\nL(II)(\u03b3, \u03bb) = yT \u03a3\u22121\n\ny y + log |\u03a3y| +Xi\n\nf (\u03b3i) + nf (\u03bb),\n\n(16)\n\n4\n\n\fwhere we note that, unlike in the Type I case above, the log-based term is already naturally balanced\nbetween \u03bb and \u03b3 by virtue of the symmetric embedding in \u03a3y. It is important to stress that this\nType II prescription for learning \u03bb is not the same as originally proposed in the literature for Type\nII models of this genre. In this context, \u03d5(\u03b3i) is interpreted a hyperprior on \u03b3i, and an equivalent\ndistribution is assumed on the noise variance \u03bb. Importantly, these assumptions leave out the factor\nof n in (16), and so an asymmetry is created.\nSimulation Examples: Empirical tests help to illustrate the ef\ufb01cacy of this procedure. As in many\napplications of sparse reconstruction, here we are only concerned with accurately estimating x,\nwhose nonzero entries may have physical signi\ufb01cance (e.g., source localization [16], compressive\nsensing [2], etc.), as opposed to predicting new values of y. Therefore, automatically learning the\nvalue of \u03bb is particularly relevant, since cross-validation is often not possible.2 Simulations are\nhelpful for evaluation purposes since we then have access to the true sparse generating vector.\n\nFigure 1 compares the estimation performance obtained by minimizing (15) with two different se-\nlections for g: g(x) = kxkp\np = Pi |xi|p, with p = 0.01 and p = 1.0. Data generation proceeds\nas follows: We create a random 100 \u00d7 50 dictionary \u03a6, with \u21132-normalized, iid Gaussian columns.\nx is randomly generated with 10 unit Gaussian nonzero elements. We then compute y = \u03a6x + \u01eb,\nwhere \u01eb is iid Gaussian noise producing an SNR of 0dB. To determine what \u03bb values lead to optimal\nperformance we solve (4) with the appropriate g over a range of \ufb01xed \u03bb values (10\u22124 to 101) and\nthen compute the error between x and \u02c6x. The minimum of this curve re\ufb02ects the best performance\nwe can hope to achieve when learning \u03bb blindly. In Figure 1 (Top) we plot these curves for both\nType I methods averaged over 1000 independent trials.\nNext we solve (15), which produces an estimate of both x and \u03bb. We mark with an \u2018+\u2019 the learned\n\u03bb versus the corresponding error of \u02c6x. In both cases the learned \u03bb\u2019s (averaged across trials) perform\njust as well as if we knew the optimal value a priori. Results using other noise levels, problem di-\nmensions n and m, sparsity levels kxk0, and sparsity penalties g are similar. See the supplementary\nmaterial for more examples.\nFigure 1 (Bottom) shows the average sparsity of estimates \u02c6x, as quanti\ufb01ed by the \u21130 norm k \u02c6xk0,\nacross \u03bb values (kxk0 returns a count of the number of nonzero elements in x). The \u2018+\u2019 indicates\nthe average sparsity of each \u02c6x for the learned \u03bb as before. In general, the \u2113(0.01) penalty produces\na much sparser estimate, very near the true value of kxk0 = 10 at the optimal \u03bb. The \u21131 penalty,\nwhich is substantially less concave/sparsity-inducing, still sets some elements to exactly zero, but\nalso substantially shrinks nonzero coef\ufb01cients in achieving a similar overall reconstruction error.\nThis highlights the importance of learning a \u03bb via a penalty that is properly matched to the prior on\nx: if we instead tried to force a particular sparsity value (in this case 10), then the \u21131 solution would\nbe very suboptimal. Finally we note that maximum likelihood (ML) estimation of \u03bb performs very\npoorly (not shown), except in the special case where the ML estimate is equivalent to solving (14)\nas occurs when f (\u03b3) = 0 (see [6]). The proposed method can be viewed as adding a principled\nhyperprior on \u03bb, properly matched to p(x), that compensates for this shortcoming of standard ML.\nType II \u03bb estimation has been explored elsewhere for the special case where f (\u03b3) = 0 [19], which\nrenders the factor of n in (16) irrelevant; however, for other selections we have found this factor\nto improve performance (not shown). For space considerations we have focused our attention here\non Type I, which has frequently been noted for not lending itself well to \u03bb estimation (or related\nparameters) [6, 13]. In fact, the symmetry afforded by the dual-space perspective reveals that Type\nI is just as natural a candidate for this task as Type II, and may be preferred in high-dimensional\nsettings where computational resources are at a premium.\n\n4 Maximally Sparse Estimation\n\nWith the advent of compressive sensing and other related applications, there has been growing inter-\nest in \ufb01nding maximally sparse signal representations from redundant dictionaries (m \u226b n) [3, 5].\nThe canonical form of this problem involves solving\nx kxk0,\n\nx0 , arg min\n\ns.t. y = \u03a6x.\n\n(17)\n\n2For example, in non-stationary environments, the value of both x and \u03bb may be completely different for\n\nany new y, which then necessitates that we estimate both jointly.\n\n5\n\n\f \n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\nE\nS\nM\n\n\u2113(0.01)\n\u21131\n\n \n\n0\n10\u22124\n\n10\u22123\n\n10\u22122\n\n10\u22121\n\u03bb value\n\n100\n\n101\n\n50\n\n45\n\n40\n\n35\n\n30\n\n \n\n0\n\nk\n25\n\u02c6x\nk\n20\n\n15\n\n10\n\n5\n\n \n\n0\n10\u22124\n\n\u2113(0.01)\n\u21131\n\n10\u22123\n\n10\u22122\n\n10\u22121\n\u03bb value\n\n100\n\n101\n\nFigure 1: Left: Normalized mean-squared error (MSE) given by (cid:10)kx \u2212 \u02c6xk2\n\naverage is across 1000 trials) plotted versus \u03bb for two different Type I approaches. Each black \u2018+\u2019\nrepresents the estimated value of \u03bb (averaged across trials) and the associated MSE produced with\nthis estimate. In both cases the estimated value achieves the lowest possible MSE (it can actually\nbe slightly lower than the curve because its value is allowed to \ufb02uctuate from trial to trial). Right:\nSolution sparsity k \u02c6xk0 versus \u03bb. Even though they both lead to similar MSE, the \u2113(0.01) penalty\nproduces a much sparser estimate at the optimal \u03bb value.\n\n2/kxk2(cid:11) (where the\n\nWhile (17) is NP-hard, whenever the dictionary \u03a6 satis\ufb01es a restricted isometry property (RIP) [2]\nor a related structural assumption, meaning that each kx0k0 columns of \u03a6 are suf\ufb01ciently close\nto orthonormal (i.e., mutually uncorrelated), then replacing \u21130 with \u21131 in (17) leads to a convex\nproblem with an equivalent global solution. Unfortunately however, in many situations (e.g., feature\nselection, source localization) these RIP equivalence conditions are grossly violated, implying that\nthe \u21131 solution may deviate substantially from x0.\nAn alternative is to instead replace (17) with minimization of (8) and then take the limit as \u03bb \u2192 0.\n(Note that the extension to the noisy case with \u03bb > 0 is straightforward, but analysis is more\ndif\ufb01cult.) In this regime the optimization problem reduces to\n\nx(II) = lim\n\u03bb\u21920\n\narg min\n\nx\n\ng(II)(x),\n\ns.t. y = \u03a6x.\n\n(18)\n\nIf log |\u03a3y| +Pi f (\u03b3i) is concave, then (18) can be minimized using reweighted \u21131 minimization.\n\nWith initial weight vector w(0) = 1, the (k + 1)-th iteration involves computing\n\n.\n\n(19)\n\n(cid:12)(cid:12)(cid:12)(cid:12)x=x(k+1)\n\nx(k+1) \u2190 arg min\n\nx: y=\u03a6xXi\n\nw(k)\n\ni\n\n|xi|,\n\nw(k+1) \u2190\n\n\u2202g(II)(x)\n\n\u2202|xi|\n\nWith f (\u03b3) = 0, iterating (19) will provably lead to an estimate of x0 that is as good or better than the\n\u21131 solution [21], in particular when \u03a6 has highly correlated columns. Additionally, the assumption\nf (\u03b3) = 0 leads to a closed-form expression for the weights w(k+1). Let\n\n\u03b7i(x; \u03b1, q) ,(cid:20)\u03c6T\n\n\u00b7i(cid:16)\u03b1I + \u03a6|X (k+1)|2\u03a6T(cid:17)\u22121\n\n\u03c6\u00b7i(cid:21)q\n\n,\n\n(20)\n\nwhere |X (k+1| denotes a diagonal matrix with i-th diagonal entry given by |x(k+1)\n|. Then w(k+1)\ncan be computed via w(k+1)\n= \u03b7i(x; 0, 1/2), \u2200i. It remains unclear however in what circum-\nstances this type of update can lead to guaranteed improvement nor if the functions \u03b7i(x; 0, 1/2)\nare even the optimal choice. We will now demonstrate that for certain selections of \u03b1 and q, we\ncan guarantee that reweighted \u21131 using \u03b7i(x; \u03b1, q) is guaranteed to recover x0 exactly if \u03a6 is drawn\nfrom what we call a clustered dictionary model.\n\ni\n\ni\n\nDe\ufb01nition 1. Clustered Dictionary Model: Let \u03a6(d)\nuncorr denote any dictionary such that \u21131 mini-\nmization succeeds in solving (17) for all kx0k0 \u2264 d. Let \u03a6(d,\u01eb)\ncorr denote any dictionary obtained\nby replacing each column of \u03a6(d)\nuncorr with a \u201ccluster\u201d of mi basis vectors such that the angle be-\ntween any two vectors within a cluster is less than some \u01eb > 0. We also de\ufb01ne the cluster support\n\n6\n\n\f\u21260 \u2282 {1, 2, . . . , m} as the set of cluster indices whereby x0 has at least one nonzero element.\nFinally, we assume that the resulting \u03a6(d,\u01eb)\n\ncorr is such that every n \u00d7 n submatrix is full rank.\n\nTheorem 2. For any sparse vector x0 and any dictionary \u03a6(d,\u01eb)\ncorr obtained from the clustered\ndictionary model with \u01eb suf\ufb01ciently small, reweighted \u21131 minimization using weights \u03b7i(x; \u03bb, q)\nwith some q \u2265 1 and \u03b1 suf\ufb01ciently small will recover x0 exactly provided that |\u21260| \u2264 d,\nPi\u2208\u21260\n\nmi \u2264 n, and within each cluster k \u2208 \u21260 the coef\ufb01cients do not sum to zero.\n\nTheorem 2 implies that even though \u21131 may fail to \ufb01nd the maximally sparse x0 because of severe\nRIP violations (high correlations between groups of dictionary columns as dictated by \u01eb lead directly\nto a poor RIP), a Type II-inspired method can still be successful. Moreover, because whenever \u21131\ndoes succeed, Type II will always succeed as well (assuming a reweighted \u21131 implementation), the\nconverse (RIP violation leading to Type II failure but not \u21131 failure) can never happen. Recent work\nfrom [21] has argued that Type II may be useful for addressing the sparse recovery problem with\ncorrelated dictionaries, and empirical evidence is provided showing vastly superior performance on\nclustered dictionaries. However, we stress that no results proving global convergence to the correct,\nmaximally sparse solution have been shown before in the case of structured dictionaries (except\nin special cases with strong, unveri\ufb01able constraints on coef\ufb01cient magnitudes [21]). Moreover,\nthe proposed weighting strategy \u03b7i(x; \u03bb, q) accomplishes this without any particular tuning to the\nclustered dictionary model under consideration and thus likely holds in many other cases as well.\n\n5 Generalized Likelihood functions\n\nType I methods naturally accommodate alternative likelihood functions. We simply must replace the\nquadratic data \ufb01t term from (4) with some preferred function and then coordinate-wise optimization\nmay proceed provided we have an ef\ufb01cient means of computing a weighted \u21132-norm penalized\nsolution. In contrast, generalizing Type II is substantially more complicated because it is no longer\npossible to compute the marginalization (5) or the posterior distribution p(x|y; \u03b3(II)). Therefore, to\nobtain a tractable estimate x(II) additional heuristics are required. For example, the RVM classi\ufb01er\nfrom [19] employs a Laplace approximation for this purpose; however, it is not clear what cost\nfunction is being minimized nor rigorous properties of the estimated solutions.\n\nFortunately, the dual x-space view provides a natural mechanism for generalizing the basic Type II\nmethodology to address alternative likelihood functions in a more principled manner. In the case\nof classi\ufb01cation problems, we might want to replace the Gaussian likelihood p(y|x) implied by (1)\nwith a multivariate Bernoulli distribution p(y|x) \u221d log[\u2212\u03c8(y, x)] where \u03c8(y, x) is the function\n\nj\u00b7x)], with \u03c6j\u00b7 denoting the j-th row of \u03a6. This function\nHere yj \u2208 {0, 1} and \u03c3j(x) , 1/[1+exp(\u03c6T\nmay be naturally substituted into the x-space Type II cost function (8) giving us the candidate\npenalized logistic regression function\n\nmin\n\nx\n\n\u03c8 (y, x) + \u03bbg(II)(x).\n\n(22)\n\nImportantly, recasting Type II classi\ufb01cation using x-space in this way, with its attendant well-\nspeci\ufb01ed cost function, facilitates more concrete analyses (see below) regarding properties of global\nand local minima that were previously rendered inaccessible because of intractable integrals and\ncompensatory approximations. Moreover, we retain a tight connection with the original Type II\nmarginalization process as follows.\n\nConsider the strict upper bound on the function \u03c8(y, x) (obtained by a Taylor series approximation\nand a Hessian bound) given by\n\n\u03c8(y, x) \u2264 \u03c0(y, x, v) , \u03c8(y, v) + (v \u2212 x)T \u03a6T t + 1/8 (v \u2212 x)T \u03a6T \u03a6 (v \u2212 x) ,\n\n(23)\nwhere t = [t1, . . . , tn]T with tj , yj \u2212 \u03c3j(v). This bound holds for all v with equality\nwhen v = x. Using this result we obtain the lower bound on the marginal likelihood given by\nR log[\u2212\u03c8(y, x)]p(x)dx \u2265R log[\u2212\u03c0(y, x, v)]p(x)dx. The dual-space framework can then be used\n\nto derive the following result:\n\n7\n\n\u03c8 (y, x) ,Xj\n\n(yj log [\u03c3j(x)] + (1 \u2212 yj) log [1 \u2212 \u03c3j(x)]) .\n\n(21)\n\n\fTheorem 3. Minimization of (22) with \u03bb = 4 is equivalent to solving\n\nmax\n\nv;\u03b3(cid:23)0Z exp [\u2212\u03c0(y, x, v)]Yi\n\nN (x; 0, \u03b3i)\u03d5(\u03b3i)dxi\n\n(24)\n\nand then computing x(II) by plugging the resulting \u03b3 into (6).\n\nThus we may conclude that (22) provides a principled approximation to (5) when a Bernoulli like-\nlihood function is used for classi\ufb01cation purposes. In empirical tests on benchmark data sets (see\nsupplementary material) using f (\u03b3) = 0, it performs nearly identically to the original RVM (which\nalso implicitly assumes f (\u03b3) = 0), but nonetheless provides a more solid theoretical justi\ufb01cation\nfor Type II classi\ufb01ers because of the underlying similarities and identical generative model. But\nwhile the RVM and its attendant approximations are dif\ufb01cult to analyze, (22) is relatively transpar-\nent. Additionally, for other sparse priors, or equivalently other selections for f, we can still perform\noptimization and analyze cost functions without any conjugacy requirements on the implicit p(x).\n\nTheorem 4. If log |\u03a3y| +Pi f (\u03b3i) is a concave, non-decreasing function of \u03b3 (as will be the case\nif f is concave and non-decreasing), then every local optimum of (24) is achieved at a solution with\nat most n nonzero elements in \u03b3 and therefore x(II). In contrast, if \u2212 log p(x) is convex, then (24)\ncan be globally solved via a convex program.\n\nDespite the practical success of the RVM and related Bayesian techniques, and empirical evidence of\nsparse solutions, there is currently no proof that the standard variants of these classi\ufb01cation methods\nwill always produce exactly sparse estimates. Thus Theorem 4 provides some analytical validation\nof these types of classi\ufb01ers.\n\nFinally, if we take (22) as our starting point, we may naturally consider modi\ufb01cations tailored to\nspeci\ufb01c sparse classi\ufb01cation tasks (that may or may not retain an explicit connection with the original\nType II probabilistic model). For example, suppose we would like to obtain a maximally sparse\nclassi\ufb01er, where regularization is provided by a kxk0 penalty. Direct optimization is combinatorial\nbecause of what we call the global zero attraction property: Whenever any individual coef\ufb01cient xi\ngoes to zero, we are necessarily at a local minimum with respect to this coef\ufb01cient because of the\nin\ufb01nite slope (discontinuity) of the \u21130 norm at zero. However, (22) can be modi\ufb01ed to approximate\nthe \u21130 without this property as follows.\n\nTheorem 5. Consider the Type II-inspired minimization problem\n\n\u02c6x, \u02c6\u03b3 = arg min\nx;\u03b3(cid:23)0\n\n\u03c8 (y, x) + \u03b11Xi\n\nx2\ni\n\u03b3i\n\n+ log(cid:12)(cid:12)\u03b12I + \u03a6\u0393\u03a6T(cid:12)(cid:12)\n\n(25)\n\nwhich is equivalent to (22) with f (\u03b3) = 0 when \u03b11 = \u03b12 = \u03bb. For some \u03b11 and \u03b12 suf\ufb01ciently\nsmall (but not necessarily equal), the support3 of \u02c6x will match the support of arg minx \u03c8 (y, x) +\n\u03bbkxk0. Moreover, (25) does not satisfy the global zero attraction property.\nThus Type II affords the possibility of mimicking the \u21130 norm in the presence of generalized like-\nlihoods but with the advantageous potential for drastically fewer local minima. This is a direction\nfor future research. Additionally, while here we have focused our attention on classi\ufb01cation via\nlogistic regression, these ideas can presumably be extended to other likelihood functions provided\ncertain conditions are met. To the best of our knowledge, while already demonstrably successful\nin an empirical setting, Type II classi\ufb01ers and other related Bayesian generalized likelihood models\nhave never been analyzed in the context of sparse estimation as we have done in this section.\n\n6 Conclusion\n\nThe dual-space view of sparse linear or generalized linear models naturally allows us to transition\nx-space ideas originally developed for Type I and apply them to Type II, and conversely, apply \u03b3-\nspace techniques from Type II to Type I. The resulting symmetry promotes a mutual understanding\nof both methodologies and helps ensure that they are not underutilized.\n\n3Support refers to the index set of the nonzero elements.\n\n8\n\n\fReferences\n\n[1] S. Boyd and L. Vandenberghe, Convex Optimization, Cambridge University Press, 2004.\n[2] E. Cand`es, J. Romberg, and T. Tao, \u201cRobust uncertainty principles: Exact signal reconstruction\nfrom highly incomplete frequency information,\u201d IEEE Trans. Information Theory, vol. 52, no.\n2, pp. 489\u2013509, Feb. 2006.\n\n[3] E. Cand`es, M. Wakin, and S. Boyd, \u201cEnhancing sparsity by reweighted \u21131 minimization,\u201d J.\n\nFourier Anal. Appl., vol. 14, no. 5, pp. 877\u2013905, 2008.\n\n[4] R. Chartrand and W. Yin, \u201cIteratively reweighted algorithms for compressive sensing,\u201d Proc.\n\nInt. Conf. Accoustics, Speech, and Signal Proc., 2008.\n\n[5] D.L. Donoho and M. Elad, \u201cOptimally sparse representation in general (nonorthogonal) dic-\ntionaries via \u21131 minimization,\u201d Proc. National Academy of Sciences, vol. 100, no. 5, pp. 2197\u2013\n2202, March 2003.\n\n[6] M.A.T. Figueiredo, \u201cAdaptive sparseness using Jeffreys prior,\u201d Advances in Neural Informa-\n\ntion Processing Systems 14, pp. 697\u2013704, 2002.\n\n[7] C. F\u00b4evotte and S.J. Godsill, \u201cBlind separation of sparse sources using Jeffreys inverse prior and\nthe EM algorithm,\u201d Proc. 6th Int. Conf. Independent Component Analysis and Blind Source\nSeparation, Mar. 2006.\n\n[8] M. Girolami, \u201cA variational method for learning sparse and overcomplete representations,\u201d\n\nNeural Computation, vol. 13, no. 11, pp. 2517\u20132532, 2001.\n\n[9] I.F. Gorodnitsky and B.D. Rao, \u201cSparse signal reconstruction from limited data using FO-\nCUSS: A re-weighted minimum norm algorithm,\u201d IEEE Transactions on Signal Processing,\nvol. 45, no. 3, pp. 600\u2013616, March 1997.\n\n[10] S. Ji, D. Dunson, and L. Carin, \u201cMulti-task compressive sensing,\u201d IEEE Trans. Signal Pro-\n\ncessing, vol. 57, no. 1, pp. 92\u2013106, Jan 2009.\n\n[11] K. Kreutz-Delgado, J. F. Murray, B.D. Rao, K. Engan, T.-W. Lee, and T.J. Sejnowski, \u201cDic-\ntionary learning algorithms for sparse representation,\u201d Neural Computation, vol. 15, no. 2, pp.\n349\u2013396, February 2003.\n\n[12] D.J.C. MacKay, \u201cBayesian interpolation,\u201d Neural Computation, vol. 4, no. 3, pp. 415\u2013447,\n\n1992.\n\n[13] J. Mattout, C. Phillips, W.D. Penny, M.D. Rugg, and K.J. Friston, \u201cMEG source localization\nunder multiple constraints: An extended Bayesian framework,\u201d NeuroImage, vol. 30, pp. 753\u2013\n767, 2006.\n\n[14] R.M. Neal, Bayesian Learning for Neural Networks, Springer-Verlag, New York, 1996.\n[15] J.A. Palmer, D.P. Wipf, K. Kreutz-Delgado, and B.D. Rao, \u201cVariational EM algorithms for\nnon-Gaussian latent variable models,\u201d Advances in Neural Information Processing Systems\n18, pp. 1059\u20131066, 2006.\n\n[16] B.D. Rao, K. Engan, S. F. Cotter, J. Palmer, and K. Kreutz-Delgado, \u201cSubset selection in noise\nbased on diversity measure minimization,\u201d IEEE Trans. Signal Processing, vol. 51, no. 3, pp.\n760\u2013770, March 2003.\n\n[17] M. Seeger and H. Nickisch, \u201cLarge scale Bayesian inference and experimental design for\n\nsparse linear models,\u201d SIAM J. Imaging Sciences, vol. 4, no. 1, pp. 166\u2013199, 2011.\n\n[18] R. Tibshirani, \u201cRegression shrinkage and selection via the Lasso,\u201d Journal of the Royal\n\nStatistical Society, vol. 58, no. 1, pp. 267\u2013288, 1996.\n\n[19] M.E. Tipping, \u201cSparse Bayesian learning and the relevance vector machine,\u201d Journal of\n\nMachine Learning Research, vol. 1, pp. 211\u2013244, 2001.\n\n[20] M.E. Tipping and A.C. Faul, \u201cFast marginal likelihood maximisation for sparse Bayesian\n\nmodels,\u201d Ninth Int. Workshop. Arti\ufb01cial Intelligence and Statistics, Jan. 2003.\n\n[21] D.P. Wipf \u201cSparse estimation with structured dictionaries,\u201d Advances in Nerual Information\n\nProcessing 24, 2011.\n\n[22] D.P. Wipf, B.D. Rao, and S. Nagarajan, \u201cLatent variable Bayesian models for promoting\n\nsparsity,\u201d IEEE Trans. Information Theory, vol. 57, no. 9, Sept. 2011.\n\n9\n\n\f", "award": [], "sourceid": 848, "authors": [{"given_name": "Yi", "family_name": "Wu", "institution": null}, {"given_name": "David", "family_name": "Wipf", "institution": null}]}