{"title": "A New View of Automatic Relevance Determination", "book": "Advances in Neural Information Processing Systems", "page_first": 1625, "page_last": 1632, "abstract": "Automatic relevance determination (ARD), and the closely-related sparse Bayesian learning (SBL) framework, are effective tools for pruning large numbers of irrelevant features. However, popular update rules used for this process are either prohibitively slow in practice and/or heuristic in nature without proven convergence properties. This paper furnishes an alternative means of optimizing a general ARD cost function using an auxiliary function that can naturally be solved using a series of re-weighted L1 problems. The result is an efficient algorithm that can be implemented using standard convex programming toolboxes and is guaranteed to converge to a stationary point unlike existing methods. The analysis also leads to additional insights into the behavior of previous ARD updates as well as the ARD cost function. For example, the standard fixed-point updates of MacKay (1992) are shown to be iteratively solving a particular min-max problem, although they are not guaranteed to lead to a stationary point. The analysis also reveals that ARD is exactly equivalent to performing MAP estimation using a particular feature- and noise-dependent \\textit{non-factorial} weight prior with several desirable properties over conventional priors with respect to feature selection. In particular, it provides a tighter approximation to the L0 quasi-norm sparsity measure than the L1 norm. Overall these results suggests alternative cost functions and update procedures for selecting features and promoting sparse solutions.", "full_text": "A New View of Automatic Relevance Determination\n\nDavid Wipf and Srikantan Nagarajan, (cid:3)\nBiomagnetic Imaging Lab, UC San Francisco\nfdavid.wipf, srig@mrsc.ucsf.edu\n\nAbstract\n\nAutomatic relevance determination (ARD) and the closely-related sparse\nBayesian learning (SBL) framework are effective tools for pruning large numbers\nof irrelevant features leading to a sparse explanatory subset. However, popular up-\ndate rules used for ARD are either dif(cid:2)cult to extend to more general problems of\ninterest or are characterized by non-ideal convergence properties. Moreover, it re-\nmains unclear exactly how ARD relates to more traditional MAP estimation-based\nmethods for learning sparse representations (e.g., the Lasso). This paper furnishes\nan alternative means of expressing the ARD cost function using auxiliary func-\ntions that naturally addresses both of these issues. First, the proposed reformu-\nlation of ARD can naturally be optimized by solving a series of re-weighted \u20181\nproblems. The result is an ef(cid:2)cient, extensible algorithm that can be implemented\nusing standard convex programming toolboxes and is guaranteed to converge to\na local minimum (or saddle point). Secondly, the analysis reveals that ARD is\nexactly equivalent to performing standard MAP estimation in weight space using\na particular feature- and noise-dependent, non-factorial weight prior. We then\ndemonstrate that this implicit prior maintains several desirable advantages over\nconventional priors with respect to feature selection. Overall these results suggest\nalternative cost functions and update procedures for selecting features and promot-\ning sparse solutions in a variety of general situations. In particular, the method-\nology readily extends to handle problems such as non-negative sparse coding and\ncovariance component estimation.\n\n1 Introduction\nHere we will be concerned with the generative model\n\ny = (cid:8)x + (cid:15);\n\n(1)\nwhere (cid:8) 2 Rn(cid:2)m is a dictionary of features, x 2 Rm is a vector of unknown weights, y is an\nobservation vector, and (cid:15) is uncorrelated noise distributed as N ((cid:15); 0; (cid:21)I). When large numbers\nof features are present relative to the signal dimension, the estimation problem is fundamentally\nill-posed. Automatic relevance determination (ARD) addresses this problem by regularizing the\nsolution space using a parameterized, data-dependent prior distribution that effectively prunes away\nredundant or super(cid:3)uous features [10]. Here we will describe a special case of ARD called sparse\nBayesian learning (SBL) that has been very successful in a variety of applications [15]. Later in\nSection 4 we will address extensions to more general models.\nThe basic ARD prior incorporated by SBL is p(x; (cid:13)) = N (x; 0; diag[(cid:13)]), where (cid:13) 2 Rm\n+ is a vector\nof m non-negative hyperperparameters governing the prior variance of each unknown coef(cid:2)cient.\nThese hyperparameters are estimated from the data by (cid:2)rst marginalizing over the coef(cid:2)cients x\nand then performing what is commonly referred to as evidence maximization or type-II maximum\nlikelihood [7, 10, 15]. Mathematically, this is equivalent to minimizing\n\nL((cid:13)) , (cid:0) logZ p(yjx)p(x; (cid:13))dx = (cid:0) log p(y; (cid:13)) (cid:17) log j(cid:6)yj + y\n\nT (cid:6)(cid:0)1\n\ny y;\n\n(2)\n\n(cid:3)This research was supported by NIH grants R01DC04855 and R01DC006435.\n\n\fwhere a (cid:3)at hyperprior on (cid:13) is assumed, (cid:6)y , (cid:21)I + (cid:8)(cid:0)(cid:8)T , and (cid:0) , diag[(cid:13)]. Once some (cid:13)(cid:3) =\narg min(cid:13) L((cid:13)) is computed, an estimate of the unknown coef(cid:2)cients can be obtained by setting\nxARD to the posterior mean computed using (cid:13)(cid:3):\n\ny(cid:3) y:\n\nxARD = E[xjy; (cid:13)(cid:3)] = (cid:0)(cid:3)(cid:8)T (cid:6)(cid:0)1\n\n(3)\nNote that if any (cid:13)(cid:3);i = 0, as often occurs during the learning process, then xARD;i = 0 and the\ncorresponding feature is effectively pruned from the model. The resulting weight vector xARD is\ntherefore sparse, with nonzero elements corresponding with the \u2018relevant\u2019 features.\nThere are (at least) two outstanding issues related to this model which we consider to be signi(cid:2)cant.\nFirst, while several methods exist for optimizing (2), limitations remain in each case. For example,\nan EM version operates by treating the unknown x as hidden data, leading to the E-step\ny y;\n\n(cid:6) , Cov[xjy; (cid:13)] = (cid:0) (cid:0) (cid:0)(cid:8)T (cid:6)(cid:0)1\n\n(cid:22) , E[xjy; (cid:13)] = (cid:0)(cid:8)T (cid:6)(cid:0)1\n\ny (cid:8)(cid:0);\n\n(4)\n\nand the M-step\n\n(5)\nWhile convenient to implement, the convergence can be prohibitively slow in practice. In contrast,\nthe MacKay update rules are considerably faster to converge [15]. The idea here is to form the\ngradient of (2), equate to zero, and then form the (cid:2)xed-point update\n\n8i = 1; : : : ; m:\n\n(cid:13)i ! (cid:22)2\n\ni + (cid:6)ii;\n\n(cid:13)i !\n\n(cid:22)2\ni\n1 (cid:0) (cid:13)(cid:0)1\ni (cid:6)ii\n\n;\n\n8i = 1; : : : ; m:\n\n(6)\n\nHowever, neither the EM nor MacKay updates are guaranteed to converge to a local minimum or\neven a saddle point of L((cid:13)); both have (cid:2)xed points whenever a (cid:13)i = 0, whether at a minimizing\nsolution or not. Finally, a third algorithm has recently been proposed that optimally updates a single\nhyperparameter (cid:13)i at a time, which can be done very ef(cid:2)ciently in closed form [16]. While extremely\nfast to implement, as a greedy-like method it can sometimes be more prone to becoming trapped in\nlocal minima when the number of features is large, e.g., m > n (results will be presented in a\nforthcoming publication). Additionally, none of these methods are easily extended to more general\nproblems such as non-negative sparse coding, covariance component estimation, and classi(cid:2)cation\nwithout introducing additional approximations.\nA second issue pertaining to the ARD model involves its connection with more traditional maximum\na posteriori (MAP) estimation methods for extracting sparse, relevant features using (cid:2)xed, sparsity\npromoting prior distributions (i.e., heavy-tailed and peaked). Presently, it is unclear how ARD,\nwhich invokes a parameterized prior and transfers the estimation problem to hyperparameter space,\nrelates to MAP approaches which operate directly in x space. Nor is it intuitively clear why ARD\noften works better in selecting optimal feature sets.\nThis paper introduces an alternative formulation of the ARD cost function using auxiliary func-\ntions that naturally addresses the above issues. In Section 2, the proposed reformulation of ARD is\nconveniently optimized by solving a series of re-weighted \u20181 problems. The result is an ef(cid:2)cient al-\ngorithm that can be implemented using standard convex programming methods and is guaranteed to\nconverge to a local minimum (or saddle point) of L((cid:13)). Section 3 then demonstrates that ARD is ex-\nactly equivalent to performing standard MAP estimation in weight space using a particular feature-\nand noise-dependent, non-factorial weight prior. We then show that this implicit prior maintains\nseveral desirable advantages over conventional priors with respect to feature selection. Additionally,\nthese results suggest modi(cid:2)cations of ARD for selecting relevant features and promoting sparse so-\nlutions in a variety of general situations. In particular, the methodology readily extends to handle\nproblems involving non-negative sparse coding, covariance component estimation, and classi(cid:2)cation\nas discussed in Section 4.\n\n2 ARD/SBL Optimization via Iterative Re-Weighted Minimum \u20181\n\nIn this section we re-express L((cid:13)) using auxiliary functions which leads to an alternative update\nprocedure that circumvents the limitations of current approaches. In fact, a wide variety of alterna-\ntive update rules can be derived by decoupling L((cid:13)) using upper bounding functions that are more\nconveniently optimized. Here we focus on a particular instantiation of this idea that leads to an\niterative minimum \u20181 procedure. The utility of this selection being that many powerful convex pro-\ngramming toolboxes have already been developed for solving these types of problems, especially\nwhen structured dictionaries (cid:8) are being used.\n\n\f2.1 Algorithm Derivation\n\nTo start we note that the log-determinant term of L((cid:13)) is concave in (cid:13) (see Section 3.1.5 of [1]),\nand so can be expressed as a minimum over upper-bounding hyperplanes via\n\nwhere g(cid:3)(z) is the concave conjugate of log j(cid:6)yj that is de(cid:2)ned by the duality relationship [1]\n\nlog j(cid:6)yj = min\n\nz\n\nT\n\nz\n\n(cid:13) (cid:0) g(cid:3)(z);\n\n(7)\n\n(8)\nalthough for our purposes we will never actually compute g (cid:3)(z). This leads to the following upper-\nbounding auxiliary cost function\n\ng(cid:3)(z) = min\n\n(cid:13) (cid:0) log j(cid:6)yj ;\n\nz\n\nT\n\n(cid:13)\n\n(9)\nFor any (cid:2)xed (cid:13), the optimal (tightest) bound can be obtained by minimizing over z. The optimal\nvalue of z equals the slope at the current (cid:13) of log j(cid:6)yj. Therefore, we have\n\ny y (cid:21) L((cid:13)):\n\n(cid:13) (cid:0) g(cid:3)(z) + y\n\nL((cid:13); z) , z\n\nT (cid:6)(cid:0)1\n\nT\n\nzopt = O\n\ny (cid:8)(cid:3) :\nThis formulation naturally admits the following optimization scheme:\nStep 1: Initialize each zi, e.g., zi = 1; 8i.\nStep 2: Solve the minimization problem\n\n(cid:13) log j(cid:6)yj = diag(cid:2)(cid:8)T (cid:6)(cid:0)1\n\n(cid:13) ! arg min\n\n(cid:13)\n\nLz((cid:13)) , z\n\nT\n\n(cid:13) + y\n\nT (cid:6)(cid:0)1\n\ny y:\n\n(10)\n\n(11)\n\nStep 3: Compute the optimal z using (10).\nStep 4: Iterate Steps 2 and 3 until convergence to some (cid:13)(cid:3).\nStep 5: Compute xARD = E[xjy; (cid:13)(cid:3)] = (cid:0)(cid:3)(cid:8)T (cid:6)(cid:0)1\n\ny(cid:3) y.\nLemma 1. The objective function in (11) is convex.\nThis can be shown using Example 3.4 and Section 3.2.2 in [1]. Lemma 1 implies that many standard\noptimization procedures can be used for the minimization required by Step 2. For example, one\nattractive option is to convert the problem to an equivalent least absolute shrinkage and selector\noperator or \u2018Lasso\u2019 [14] optimization problem according to the following:\nLemma 2. The objective function in (11) can be minimized by solving the weighted convex \u20181-\nregularized cost function\n\nx(cid:3) = arg min\n\nx\n\nky (cid:0) (cid:8)xk2\n\n2 + 2(cid:21)Xi\n\nz1=2\ni\n\njxij\n\n(12)\n\nand then setting (cid:13)i ! z(cid:0)1=2\nThe proof of Lemma 2 can be brie(cid:3)y summarized using a re-expression of the data dependent term\nin (11) using\n(13)\n\njx(cid:3);ij for all i (note that each zi will always be positive).\n\nx2\ni\n(cid:13)i\nThis leads to an upper-bounding auxiliary function for Lz((cid:13)) given by\n\n2 +Xi\n\nky (cid:0) (cid:8)xk2\n\ny y = min\n\nT (cid:6)(cid:0)1\n\n1\n(cid:21)\n\ny\n\n:\n\ni\n\nx\n\nLz((cid:13); x) , Xi\n\n(cid:18)zi(cid:13)i +\n\nx2\ni\n\n(cid:13)i (cid:19) +\n\n1\n(cid:21)\n\nky (cid:0) (cid:8)xk2\n\n2 (cid:21) Lz((cid:13));\n\n(14)\n\ni\n\nwhich is jointly convex in x and (cid:13) (see Example 3.4 in [1]) and can be globally minimized by\njxij minimizes Lz((cid:13); x). When substituted into\nsolving over (cid:13) and then x. For any x, (cid:13)i = z(cid:0)1=2\n(14) we obtain (12). When solved for x, the global minimum of (14) yields the global minimum of\n(11) via the stated transformation.\nIn summary then, by iterating the above algorithm using Lemma 2 to implement Step 2, a conve-\nnient optimization method is obtained. Moreover, we do not even need to globally solve for x (or\nequivalently (cid:13)) at each iteration as long as we strictly reduce (11) at each iteration. This is read-\nily achievable using a variety of simple strategies. Additionally, if z is initialized to a vector of\nones, then the starting point (assuming Step 2 is computed in full) is the exact Lasso estimator. The\nalgorithm then re(cid:2)nes this estimate through the speci(cid:2)ed re-weighting procedure.\n\n\f2.2 Global Convergence Analysis\n\n00) < L((cid:13)\n\n0) for all (cid:13)\n\n+ the subset of Rm\n\n+ which satis(cid:2)es\nLet A((cid:1)) denote a mapping that assigns to every point in Rm\nSteps 2 and 3 of the proposed algorithm. Such a mapping can be implemented via the methodology\ndescribed above. We allow A((cid:1)) to be a point-to-set mapping to handle the case where the global\nminimum of (11) is not unique, which could occur, for example, if two columns of (cid:8) are identical.\nTheorem 1. From any initialization point (cid:13)(0) 2 Rm\n+ the sequence of hyperparameter estimates\nf(cid:13)(k)g generated via (cid:13)(k+1) 2 A((cid:13)(k+1)) is guaranteed to converge monotonically to a local mini-\nmum (or saddle point) of (2).\nThe proof is relatively straightforward and stems directly from the Global Convergence Theorem\n(see for example [6]). A sketch is as follows: First, it must be shown that the the mapping A((cid:1))\nis compact. This condition is satis(cid:2)ed because if any element of (cid:13) is unbounded, L((cid:13)) diverges to\nin(cid:2)nity. If fact, for any (cid:2)xed y, (cid:8) and (cid:21), there will always exist a radius r such that for any k(cid:13)(0)k (cid:20)\nr, k(cid:13)(k)k (cid:20) r for all k. Second, we must show that for any non-minimizing point of L((cid:13)) denoted\n0 the auxiliary cost function\n0, L((cid:13)\n(cid:13)\n0 ((cid:13)) obtained from Step 3 will be strictly tangent to L((cid:13)) at (cid:13)\n0. It will therefore necessarily have\nLz\n0 is nonzero by de(cid:2)nition. Moreover, because the log j (cid:1) j\na minimum elsewhere since the slope at (cid:13)\nfunction is strictly concave, at this minimum the actual cost function will be reduced still further.\nConsequently, the proposed updates represent a valid descent function. Finally, it must be shown\nthat A((cid:1)) is closed at all non-stationary points. This follows from related arguments. The algorithm\ncould of course theoretically converge to a saddle point, but this is rare and any minimal perturbation\nleads to escape.\nBoth EM and MacKay updates provably fail to satisfy one or more of the above criteria and so global\nconvergence cannot be guaranteed. With EM, the failure occurs because the associated updates do\nnot always strictly reduce L((cid:13)). Rather, they only ensure that L((cid:13)\n0) at all points. In\ncontrast, the MacKay updates do not even guarantee cost function decrease. Consequently, both\nmethods can become trapped at a solution such as (cid:13) = 0; a (cid:2)xed point of the updates but not a\nstationary point or local minimum of L((cid:13)). However, in practice this seems to be more of an issue\nwith the MacKay updates. Related shortcomings of EM in this regard can be found in [19]. Finally,\nthe fast Tipping updates could potentially satisfy the conditions for global convergence, although\nthis matter is not discussed in [16].\n\n0). At any non-minimizing (cid:13)\n\n00) (cid:20) L((cid:13)\n\n00 2 A((cid:13)\n\n3 Relating ARD to MAP Estimation\nIn hierarchical models such as ARD and SBL there has been considerable debate over how to best\nperform estimation and inference [8]. Do we add a hyperprior and then integrate out (cid:13) and perform\nMAP estimation directly on x? Or is it better to marginalize over the coef(cid:2)cients x and optimize the\nhyperparameters (cid:13) as we have described in this paper? In speci(cid:2)c cases, arguments have been made\nfor the merits of one over the other based on intuition or heuristic arguments [8, 15]. But we would\nargue that this distinction is somewhat tenuous because, as we will now show using ideas from the\nprevious section, the weights obtained from the ARD type-II ML procedure can equivalently be\nviewed as arising from an explicit MAP estimate in x space. This notion is made precise as follows:\nTheorem 2. Let x\nm ]T . Then the ARD coef(cid:2)cients\nfrom (3) solve the MAP problem\n\nm]T and (cid:13)\n\n1 ; : : : ; (cid:13)(cid:0)1\n\n(cid:0)1 , [(cid:13)(cid:0)1\n\n1; : : : ; x2\n\n2 , [x2\n\nxARD = arg min\n2) is the concave conjugate of h((cid:13)\n\nx\n\n2 + (cid:21)h(cid:3)(x\n\n(15)\nky (cid:0) (cid:8)xk2\n(cid:0)1) , (cid:0) log j(cid:6)yj and is a concave, non-decreasing\n\n2);\n\nwhere h(cid:3)(x\nfunction of x.\nThis result can be established using much of the same analysis used in previous sections. Omitting\nsome details for the sake of brevity, using (13) we can create a strict upper bounding auxiliary\nfunction on L((cid:13)):\n\n(16)\n\nL((cid:13); x) =\n\nky (cid:0) (cid:8)xk2\n\n+ log j(cid:6)yj:\n\n1\n(cid:21)\n\n2 +Xi\n\nx2\ni\n(cid:13)i\n\nIf we optimize (cid:2)rst over (cid:13) instead of x (allowable), the last two terms form the stated concave\nconjugate function h(cid:3)(x\n2). In turn, the minimizing x, which solves (15), is identical to that obtained\nby ARD. The concavity of h(cid:3)(x\n\n2) with respect each jxij follows from similar ideas.\n\n\fCorollary 1. The regularization term in (15), and hence the implicit prior distribution on x given\n2)], is not generally factorable, meaning p(x) 6= Qi pi(xi). Addition-\nby p(x) / exp[(cid:0) 1\nally, unlike traditional MAP procedures (e.g., Lasso, ridge regression, etc.), this prior is explicitly\ndependent on both the dictionary (cid:8) and the regularization term (cid:21).\n\n2 h(cid:3)(x\n\nThis result stems directly from the fact that h((cid:13)\n(cid:21). The only exception occurs when (cid:8)T (cid:8) = I; here h(cid:3)(x\nform independently of (cid:8), although (cid:21) dependency remains.\n\n(cid:0)1) is non-factorable and is dependent on (cid:8) and\n2) factors and can be expressed in closed\n\n3.1 Properties of the implicit ARD prior\n\nTo begin at the most super(cid:2)cial level, the (cid:8) dependency of the ARD prior leads to scale invariant\nsolutions, meaning the value of xARD is not affected if we rescale (cid:8), i.e., (cid:8) ! (cid:8)D, where D is a\ndiagonal matrix. Rather, any rescaling D only affects the implicit initialization of the algorithm, not\nthe shape of the cost function.\nMore signi(cid:2)cantly, the ARD prior is particularly well-designed for (cid:2)nding sparse solutions. We\nshould note that concave, non-decreasing regularization functions are well-known to encourage\nsparse representations. Since h(cid:3)(x\n2) is such a function, it should therefore not be surprising that it\npromotes sparsity to some degree. However, when selecting highly sparse subsets of features, the\nfactorial \u20180 quasi-norm is often invoked as the ideal regularization term given unlimited computa-\ntional resources. It is expressed via kxk0 , Pi I[xi 6= 0], where I[(cid:1)] denotes the indicator function,\nand so represents a count of the number of nonzero coef(cid:2)cients (and therefore features). By applying\na exp[(cid:0)1=2((cid:1))] transformation, we obtain the implicit (improper) prior distribution. The associated\nMAP estimation problem (assuming the same standard Gaussian likelihood) involves solving\n\nmin\n\nx\n\nky (cid:0) (cid:8)xk2\n\n2 + (cid:21)kxk0:\n\n(17)\n\nThe dif(cid:2)culty here is that (17) is nearly impossible to solve in general; it is NP-hard owing to a\ncombinatorial number of local minima and so the traditional idea is to replace k (cid:1) k0 with a tractable\napproximation. For this purpose, the \u20181 norm is the optimal or tightest convex relaxation of the \u20180\nquasi-norm, and therefore it is commonly used leading to the Lasso algorithm [14]. However, the\n\u20181 norm need not be the best relaxation in general. In Sections 3.2 and 3.3 we demonstrate that\nthe non-factorable, (cid:21)-dependent h(cid:3)(x\n2) provides a tighter, albeit non-convex, approximation that\npromotes greater sparsity than kxk1 while conveniently producing many fewer local minima than\nusing kxk0 directly. We also show that, in certain settings, no (cid:21)-independent, factorial regularization\nterm can achieve similar results. Consequently, the widely used family of \u2018p quasi-norms, i.e.,\nkxkp , Pi jxijp, p < 1 [2], or the Gaussian entropy measure Pi log jxij based on the Jeffreys\nprior [4] provably fail in this regard.\n\n3.2 Bene\ufb01ts of (cid:21) dependency\n\nTo explore the properties of h(cid:3)(x\n2) regarding (cid:21) dependency alone, we adopt the simplifying as-\nsumption (cid:8)T (cid:8) = I. (Later we investigate the bene(cid:2)ts of a non-factorial prior.) In this special case,\nh(cid:3)(x\n\n2) is factorable and can be expressed in closed form via\n\n2jxij\n\ni + 4(cid:21)\n\n+ log(cid:18)2(cid:21) + x2\n\nh(cid:3)(x\n\ni + 4(cid:21)(cid:19) ;\n\n(18)\n\nh(cid:3)(x2\n\ni ) / Xi\n\n2) = Xi\n\ni + jxijqx2\ni ) is shown in Figure 1 (left) below.\n\njxij +px2\nwhich is independent of (cid:8). A plot of h(cid:3)(x2\nThe (cid:21) dependency is retained however and contributes two very desirable properties: (i) As a strictly\nconcave function of each jxij, h(cid:3)(x\n2) more closely approximates the \u20180 quasi-norm than the \u20181 norm\nwhile, (ii) The associated cost function (15) is unimodal unlike when (cid:21)-independent approximations,\ne.g., the \u2018p quasi-norm, are used. This can be explained as follows. When (cid:21) is small, the Gaussian\nlikelihood is highly restrictive, constraining most of its relative mass to a very localized region of x\nspace. Therefore, a tighter prior more closely resembling the \u20180 quasi-norm can be used without the\nrisk of local minima, which occur when the spines of a sparse prior overlap non-negligible portions\nof the likelihood (see Figure 6 in [15] for a good 2D visual of a sparse prior with characteristic spines\nrunning alone the coordinate axis). In the limit as (cid:21) ! 0, h(cid:3)(x\n2) converges to a scaled version of the\n\n\f\u20180 quasi-norm, yet no local minimum exist because the likelihood in this case only permits a single\nfeasible solution with x = (cid:8)T\ny. In contrast, when (cid:21) is large, the likelihood is less constrained and a\nlooser prior is required to avoid local minima troubles, which will arise whenever the now relatively\ndiffuse likelihood intersects the sharp spines of a highly sparse prior. In this situation h(cid:3)(x\n2) more\nclosely resembles a scaled version of the \u20181 norm. The implicit ARD prior naturally handles this\ntransition becoming sparser as (cid:21) decreases and vice versa. Hence the following property, which is\neasy to show [18]:\nLemma 3. When (cid:8)T (cid:8) = I, (15) has no local minima whereas (17) has 2M local minima.\n2) also yields no local minima; however, it is a much looser\nUse of the \u20181 norm in place of h(cid:3)(x\napproximation of \u20180 and penalizes coef(cid:2)cients linearly unlike h(cid:3)(x\n2). The bene(cid:2)ts of (cid:21) dependency\nin this regard can be formalized and will be presented in a subsequent paper. As a (cid:2)nal point of\ncomparison, the actual weight estimate obtained from solving (15) when (cid:8)T (cid:8) = I is equivalent to\nthe non-negative garrote estimator that has been advocated for wavelet shrinkage [5, 18].\n\n2 \n\n1.8\n\n1.6\n\n1.4\n\n1.2\n\n2\n\n1.6\n\n1.2\n\n0.8\n\n0.4\n\n)\ni\n\nx\n(\np\ng\no\nl\n\n(cid:0)\n\nI[xi 6= 0]\njxij\nARD\n\nPSfrag replacements\nxi\n(cid:0) log p(xi)\nI[xi 6= 0]\njxij\nARD(cid:0)\n\n)\nx\n(\np\ng\no\nl\n\n)\nd\ne\nz\ni\nl\na\nm\nr\no\nn\n(\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\nmaximally\n\nsparse\nsolution\n\nARD\nPi jxij0:01\n\n0 \n\u22122\n\n\u22121.5\n\n\u22121\n\n\u22120.5\n\n0 \nxi\n\n0.5 \n\n1 \n\n1.5 \n\n2\n\n0\n\u22128\n\n\u22126\n\n\u22124\n\n\u22122\n\n2\n\n4\n\n6\n\n8\n\n0\n\n(cid:11)\n\nFigure 1: Left: 1D example of the implicit ARD prior. The \u20181 and \u20180 norms are included for com-\nparison. Right: Plot of the ARD prior across the feasible region as parameterized by (cid:11). A factorial\nprior given by (cid:0) log p(x) / Pi jxij0:01 (cid:25) kxk0 is included for comparison. Both approximations\nto the \u20180 norm retain the correct global minimum, but only ARD smooths out local minima.\n\nPSfrag replacements\n\n(cid:0) log p(x)\n\n(normalized)\n\n(cid:11)\nARD\nPi jxij0:01\n\nx0 , arg min\n\nkxk0\n\nx\n\n3.3 Bene\ufb01ts of a non-factorial prior\nIn contrast, the bene(cid:2)ts the typically non-factorial nature of h(cid:3)(x\n2) are most pronounced when\nm > n, meaning there are more features than the signal dimension y. In a noiseless setting (with\n(cid:21) ! 0), we can explicitly quantify the potential of this property of the implicit ARD prior. In this\nlimiting situation, the canonical sparse MAP estimation problem (17) reduces to (cid:2)nding\n\ns.t. y = (cid:8)x:\n\n(19)\nBy simple extension of results in [18], the global minimum of (15) in the limit as (cid:21) ! 0 will\nequal x0, assuming the latter is unique. The real distinction then is regarding the number of local\nminimum. In this capacity the ARD MAP problem is superior to any possible factorial variant:\nIn the limit as (cid:21) ! 0 and assuming m > n, no factorial prior p(x) =\nTheorem 3.\nQi exp[(cid:0)1=2fi(xi)] exists such that\nthe corresponding MAP problem minx ky (cid:0) (cid:8)xk2\n2 +\n(cid:21)Pi fi(xi) is: (i) Always globally minimized by a maximally sparse solution x0 and, (ii) Has\nfewer local minima than when solving (15).\nA sketch of the proof is as follows. First, for any factorial prior and associated regularization term\nPi fi(xi), the only way to satisfy (i) is if @fi(xi)=@xi ! 1 as xi ! 0. Otherwise, it will always be\npossible to have a (cid:8) and y such that x0 is not the global minimum. It is then straightforward to show\nthat any fi(xi) with this property will necessarily have between (cid:2)(cid:0)m(cid:0)1\nn(cid:1)(cid:3) local minimum.\nUsing results from [18], this is provably an upper bound on the number of local minimum to (15).\nMoreover, with the exception of very contrived situations, the number of ARD local minima will\nbe considerably less. In general, this result speaks directly to the potential limitations of restricting\noneself to factorial priors when maximal feature pruning is paramount.\nWhile generally dif(cid:2)cult to visualize, in restricted situations it is possible to explicitly illustrate\nthe type of smoothing over local minima that is possible using non-factorial priors. For example,\n\nn (cid:1) + 1;(cid:0)m\n\n\fconsider the case where m = n + 1 and Rank((cid:8)) = n, implying that (cid:8) has a null-space dimension\nof one. Consequently, any feasible solution to y = (cid:8)x can be expressed as x = x\n0 + (cid:11)v, where\nv 2 Null((cid:8)), (cid:11) is any real-valued scalar, and x\n0 is any (cid:2)xed, feasible solution (e.g., the minimum\nnorm solution). We can now plot any prior distribution p(x), or equivalently (cid:0) log p(x), over the\n1D feasible region of x space as a function of (cid:11) to view the local minima pro(cid:2)le.\nTo demonstrate this idea, we chose n = 10, m = 11 and generated a (cid:8) matrix using iid N (0; 1)\nentries. We then computed y = (cid:8)x0, where kx0k0 = 9 and nonzero entries are also iid unit\nGaussian. Figure 1 (right) displays the plots of two example priors in the feasible region of y = (cid:8)x:\n2 Pi jxijp), p = 0:01. The\n(i) the non-factorial implicit ARD prior, and (ii) the prior p(x) / exp((cid:0) 1\nlater is a factorial prior which converges to the ideal sparsity penalty when p ! 0. From the (cid:2)gure,\nwe observe that, while both priors peak at the x0, the ARD prior has substantially smoothed away\nlocal minima. While the implicit Lasso prior (which is equivalent to the assumption p = 1) also\nsmooths out local minima, the global minimum may be biased away from the maximally sparse\nsolution in many situations, unlike the ARD prior which provides a non-convex approximation with\nits global minimum anchored at x0.\n\n4 Extensions\n\nThus far we have restricted attention to one particularly useful ARD-based model. But much of the\nanalysis can be extended to handle a variety of alternative data likelihoods and priors. A particularly\nuseful adaptation relevant to compressed sensing [17], manifold learning [13], and neuroimaging\n[12, 18] is as follows. First, the data y can be replaced with a n (cid:2) t observation matrix Y which is\ngenerated via an unknown coef(cid:2)cient matrix X. The assumed likelihood model and prior are\nd(cid:13)\np(Y jX) / exp(cid:18)(cid:0)\n(cid:13)iCi:\n(20)\nHere each of the d(cid:13) matrices Ci\u2019s are known covariance components of which the irrelevant ones\nare pruned by minimizing the analogous type-II likelihood function\n\nx X(cid:3)(cid:19) ; (cid:6)x ,\n\ntrace(cid:2)X T (cid:6)(cid:0)1\n\nkY (cid:0) (cid:8)Xk2\n\nF(cid:19) ;\n\np(X) / exp(cid:18)(cid:0)\n\n1\n2\n\nXi=1\n\n1\n2(cid:21)\n\nL((cid:13)) = log j(cid:21)I + (cid:8)(cid:6)x(cid:8)T j + trace(cid:20) 1\n\nt\n\nXX T (cid:0)(cid:21)I + (cid:8)(cid:6)x(cid:8)T(cid:1)\n\n(cid:0)1(cid:21) :\n\n(21)\n\nWith minimal effort, this extension can be solved using the methodology described herein. The\nprimary difference is that Step 2 becomes a second-order cone (SOC) optimization problem for\nwhich a variety of techniques exist for its minimization [2, 9].\nAnother very useful adaptation involves adding a non-negativity constraint on the coef(cid:2)cients x,\ne.g., non-negative sparse coding. This is easily incorporated into the MAP cost function (15) and\noptimization problem (12); performance is often signi(cid:2)cantly better than the non-negative Lasso.\nResults will be presented in a subsequent paper. It may also be possible to develop an effective\nvariant for handling classi(cid:2)cation problems that avoids additional approximations such as those\nintroduced in [15].\n\n5 Discussion\nWhile ARD-based approaches have enjoyed remarkable success in a number of disparate (cid:2)elds, they\nremain hampered to some degree by implementational limitations and a lack of clarity regarding the\nnature of the cost function and existing update rules. This paper addresses these issues by presenting\na principled alternative algorithm based on auxiliary functions and a dual representation of the ARD\nobjective. The resulting algorithm is initialized at the well-known Lasso solution and then iterates\nvia a globally convergent re-weighted \u20181 procedure that in many ways approximates ideal subset\nselection using the \u20180 norm. Preliminary results using this methodology on toy problems as well\nas large neuroimaging simulations with m (cid:25) 100; 000 are very promising (and will be reported in\nfuture papers). A good (highly sparse) solution is produced at every iteration and so early stopping is\nalways feasible if desired. This produces a highly ef(cid:2)cient, global competition among features that\nis potentially superior to the sequential (greedy) updates of [16] in terms of local minima avoidance\nin certain cases when (cid:8) is highly overcomplete (i.e., m (cid:29) n). Moreover, it is also easily extended\nto handle additional constraints (e.g., non-negativity) or model complexity as occurs with general\ncovariance component estimation. A related optimization strategy has also been reported in [3].\n\n\fThe analysis used in deriving this algorithm reveals that ARD is exactly equivalent to performing\nMAP estimation in x space using a principled, sparsity-inducing prior that is non-factorable and\ndependent on both the feature set and noise parameter. We have shown that these qualities allow it\nto promote maximally sparse solutions at the global minimum while relenting drastically fewer local\nminima than competing priors. This might possibly explain the superior performance of ARD/SBL\nover Lasso in a variety of disparate disciplines where sparsity is crucial [11, 12, 18]. These ideas\nraise a key question: If we do not limit ourselves to factorable, (cid:8)- and (cid:21)-independent regularization\nterms/priors as is commonly done, then what is the optimal prior p(x) in the context of feature\nselection? Perhaps there is a better choice that does not neatly (cid:2)t into current frameworks linked\nto empirical priors based on the Gaussian distribution. Note that the \u20181 re-weighting scheme for\noptimization can be applied to a broad family of non-factorial, sparsity-inducing priors.\n\nReferences\n[1] S. Boyd and L. Vandenberghe, Convex Optimization, Cambridge University Press, 2004.\n[2] S.F. Cotter, B.D. Rao, K. Engan, and K. Kreutz-Delgado, (cid:147)Sparse solutions to linear inverse\nproblems with multiple measurement vectors,(cid:148) IEEE Trans. Signal Processing, vol. 53, no. 7,\npp. 2477(cid:150)2488, April 2005.\n\n[3] M. Fazel, H. Hindi, and S. Boyd (cid:147)Log-Det Heuristic for Matrix Rank Minimization with Appli-\ncations to Hankel and Euclidean Distance Matrices,(cid:148) Proc. American Control Conf., vol. 3, pp.\n2156(cid:150)2162, June 2003.\n\n[4] M.A.T. Figueiredo, (cid:147)Adaptive sparseness using Jeffreys prior,(cid:148) Advances in Neural Information\n\nProcessing Systems 14, pp. 697(cid:150)704, 2002.\n\n[5] H. Gao, (cid:147)Wavelet shrinkage denoising using the nonnegative garrote,(cid:148) Journal of Computational\n\nand Graphical Statistics, vol. 7, no. 4, pp. 469(cid:150)488, 1998.\n\n[6] D.G. Luenberger, Linear and Nonlinear Programming, Addison(cid:150)Wesley, Reading, Mas-\n\nsachusetts, 2nd ed., 1984.\n\n[7] D.J.C. MacKay, (cid:147)Bayesian interpolation,(cid:148) Neural Comp., vol. 4, no. 3, pp. 415(cid:150)447, 1992.\n[8] D.J.C. MacKay, (cid:147)Comparison of approximate methods for handling hyperparameters,(cid:148) Neural\n\nComp., vol. 11, no. 5, pp. 1035(cid:150)1068, 1999.\n\n[9] D.M. Malioutov, M. C\u201a etin, and A.S. Willsky, (cid:147)Sparse signal reconstruction perspective for\nsource localization with sensor arrays,(cid:148) IEEE Trans. Signal Processing, vol. 53, no. 8, pp.\n3010(cid:150)3022, August 2005.\n\n[10] R.M. Neal, Bayesian Learning for Neural Networks, Springer-Verlag, New York, 1996.\n[11] R. Pique-Regi, E.S. Tsau, A. Ortega, R.C. Seeger, and S. Asgharzadeh, (cid:147)Wavelet footprints\nand sparse Bayesian learning for DNA copy number change analysis,(cid:148) Int. Conf. Acoustics\nSpeech and Signal Processing, April 2007.\n\n[12] R.R. Ram\u00b7(cid:17)rez, Neuromagnetic Source Imaging of Spontaneous and Evoked Human Brain\n\nDynamics, PhD Thesis, New York University, 2005.\n\n[13] J.G. Silva, J.S. Marques, and J.M. Lemos, (cid:147)Selecting landmark points for sparse manifold\n\nlearning,(cid:148) Advances in Neural Information Processing Systems 18, pp. 1241(cid:150)1248, 2006.\n\n[14] R. Tibshirani, (cid:147)Regression shrinkage and selection via the Lasso,(cid:148) Journal of the Royal\n\nStatistical Society, vol. 58, no. 1, pp. 267(cid:150)288, 1996.\n\n[15] M.E. Tipping, (cid:147)Sparse Bayesian learning and the relevance vector machine,(cid:148) Journal of\n\nMachine Learning Research, vol. 1, pp. 211(cid:150)244, 2001.\n\n[16] M.E. Tipping and A.C. Faul, (cid:147)Fast marginal likelihood maximisation for sparse Bayesian\n\nmodels,(cid:148) Ninth Int. Workshop Arti\ufb01cial Intelligence and Statistics, Jan. 2003.\n\n[17] M.B. Wakin, M.F. Duarte, S. Sarvotham, D. Baron, and R.G. Baraniuk, (cid:147)Recovery of jointly\nsparse signals from a few random projections,(cid:148) Advances in Neural Information Processing\nSystems 18, pp. 1433(cid:150)1440, 2006.\n\n[18] D.P. Wipf, (cid:147)Bayesian Methods for Finding Sparse Representations,(cid:148) PhD Thesis, UC San\n\nDiego, 2006.\n\n[19] C.F. Wu, (cid:147)On the convergence properties of the EM algorithm,(cid:148) The Annals of Statistics, vol.\n\n11, pp. 95(cid:150)103, 1983.\n\n\f", "award": [], "sourceid": 976, "authors": [{"given_name": "David", "family_name": "Wipf", "institution": null}, {"given_name": "Srikantan", "family_name": "Nagarajan", "institution": null}]}