{"title": "Perspectives on Sparse Bayesian Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 249, "page_last": 256, "abstract": "", "full_text": "Perspectives on Sparse Bayesian Learning\n\nDavid Wipf,\nJason Palmer, and Bhaskar Rao\nDepartment of Electrical and Computer Engineering\n\nUniversity of California, San Diego, CA 92092\n\ndwipf,japalmer@ucsd.edu, brao@ece.ucsd.edu\n\nAbstract\n\nRecently, relevance vector machines (RVM) have been fashioned from a\nsparse Bayesian learning (SBL) framework to perform supervised learn-\ning using a weight prior that encourages sparsity of representation. The\nmethodology incorporates an additional set of hyperparameters govern-\ning the prior, one for each weight, and then adopts a speci\ufb01c approxi-\nmation to the full marginalization over all weights and hyperparameters.\nDespite its empirical success however, no rigorous motivation for this\nparticular approximation is currently available. To address this issue, we\ndemonstrate that SBL can be recast as the application of a rigorous vari-\national approximation to the full model by expressing the prior in a dual\nform. This formulation obviates the necessity of assuming any hyperpri-\nors and leads to natural, intuitive explanations of why sparsity is achieved\nin practice.\n\n1\n\nIntroduction\n\nIn an archetypical regression situation, we are presented with a collection of N regres-\ni=1 and the goal is to \ufb01nd a vector of weights w such\nsor/target pairs {\u03c6i \u2208 0 are constants. The crux of the actual learning procedure presented in [7]\nis to \ufb01nd some MAP estimate of \u03b3 (or more accurately, a function of \u03b3). In practice, we\n\ufb01nd that many of the estimated \u03b3i\u2019s converge to zero, leading to sparse solutions since\nthe corresponding weights, and therefore columns of \u03a6, can effectively be pruned from\nthe model. The Gaussian assumptions, both on p(t|w) and p(w; \u03b3), then facilitate direct,\nanalytic computation of (6).\n\nexp(\u2212b/\u03b3i),\n\n\f1.2 Ambiguities in Current SBL Derivation\n\nModern Bayesian analysis is primarily concerned with \ufb01nding distributions and locations of\nsigni\ufb01cant probability mass, not just modes of distributions, which can be very misleading\nin many cases [6]. With SBL, the justi\ufb01cation for the additional level of sophistication\n(i.e., the inclusion of hyperparameters) is that the adoption of the plugin rule (i.e., the\napproximation p(w, t) \u2248 p(w, t; \u03b3M AP )) is re\ufb02ective of the true mass, at least suf\ufb01ciently\nso for predictive purposes. However, no rigorous motivation for this particular claim is\ncurrently available nor is it immediately obvious exactly how the mass of this approximate\ndistribution relates to the true mass.\n\nA more subtle dif\ufb01culty arises because MAP estimation, and hence the plugin rule, is not\ninvariant under a change in parameterization. Speci\ufb01cally, for an invertible function f (\u00b7),\n(9)\n\n[f (\u03b3)]M AP 6= f (\u03b3M AP ).\n\nDifferent transformations lead to different modes and ultimately, different approximations\nto p(w, t) and therefore p(t\u2217|t). So how do we decide which one to use? The canonical\nform of SBL, and the one that has displayed remarkable success in the literature, does not\nin fact \ufb01nd a mode of p(\u03b3|t), but a mode of p(\u2212 log \u03b3|t). But again, why should this mode\nnecessarily be more re\ufb02ective of the desired mass than any other?\n\nAs already mentioned, SBL often leads to sparse results in practice, namely, the approxi-\nmation p(w, t; \u03b3M AP ) is typically nonzero only on a small subspace of M-dimensional w\nspace. The question remains, however, why should an approximation to the full Bayesian\ntreatment necessarily lead to sparse results in practice?\n\nTo address all of these ambiguities, we will herein demonstrate that the sparse Bayesian\nlearning procedure outlined above can be recast as the application of a rigorous variational\napproximation to the distribution p(w, t).2 This will allow us to quantify the exact rela-\ntionship between the true mass and the approximate mass of this distribution. In effect, we\nwill demonstrate that SBL is attempting to directly capture signi\ufb01cant portions of the prob-\nability mass of p(w, t), while still allowing us to perform the required integrations. This\nframework also obviates the necessity of assuming any hyperprior p(\u03b3) and is independent\nof the (subjective) parameterization (e.g., \u03b3 or \u2212 log \u03b3, etc.). Moreover, this perspective\nleads to natural, intuitive explanations of why sparsity is observed in practice and why, in\ngeneral, this need not be the case.\n\n2 A Variational Interpretation of Sparse Bayesian Learning\n\nTo begin, we review that the ultimate goal of this analysis is to \ufb01nd a well-motivated ap-\nproximation to the distribution\n\np(t\u2217|t;H) \u221d Z p(t\u2217|w)p(w, t;H)dw = Z p(t\u2217|w)p(t|w)p(w;H)dw,\n\n(10)\n\nwhere we have explicitly noted the hypothesis of a model with a sparsity inducing (possibly\nimproper) weight prior by H. As already mentioned, the integration required by this form is\nanalytically intractable and we must resort to some form of approximation. To accomplish\nthis, we appeal to variational methods to \ufb01nd a viable approximation to p(w, t;H) [5].\nWe may then substitute this approximation into (10), leading to tractable integrations and\nanalytic posterior distributions. To \ufb01nd a class of suitable approximations, we \ufb01rst express\np(w;H) in its dual form by introducing a set of variational parameters. This is similar to a\nprocedure outlined in [4] in the context of independent component analysis.\n\n2We note that the analysis in this paper is different from [1], which derives an alternative SBL\n\nalgorithm based on variational methods.\n\n\f2.1 Dual Form Representation of p(w;H)\nAt the heart of this methodology is the ability to represent a convex function in its dual\nform. For example, given a convex function f (y) : < \u2192 <, the dual form is given by\n\nf (y) = sup\n\n\u03bb\n\n[\u03bby \u2212 f \u2217(\u03bb)] ,\n\nwhere f \u2217(\u03bb) denotes the conjugate function. Geometrically, this can be interpreted as\nrepresenting f (x) as the upper envelope or supremum of a set of lines parameterized by \u03bb.\nThe selection of f \u2217(\u03bb) as the intercept term ensures that each line is tangent to f (y). If we\ndrop the maximization in (11), we obtain the bound\n\n(11)\n\n(12)\n\nf (y) \u2265 \u03bby \u2212 f \u2217(\u03bb).\n\nThus, for any given \u03bb, we have a lower bound on f (y); we may then optimize over \u03bb to\n\ufb01nd the optimal or tightest bound in a region of interest.\n\nTo apply this theory to the problem at hand, we specify the form for our sparse prior\n\np(w;H) = QM\n\ni=1 p(wi;H). Using (7) and (8), we obtain the prior\np(wi;H) = Z p(wi|\u03b3i)p(\u03b3i)d\u03b3i = C\u00b5b +\n\nw2\ni\n\n2 \u00b6\u2212(a+1/2)\n\n,\n\n(13)\n\nwhich for a, b > 0 is proportional to a Student-t density. The constant C is not chosen to\nenforce proper normalization; rather, it is chosen to facilitate the variational analysis below.\nAlso, this density function can be seen to encourage sparsity since it has heavy tails and a\ni as\nsharp peak at zero. Clearly p(wi;H) is not convex in wi; however, if we let yi , w2\nsuggested in [5] and de\ufb01ne\n\nf (yi) , log p(wi;H) = \u2212(a + 1/2) log C\u00b3b +\n\nyi\n\n2 \u00b4 ,\n\n(14)\n\nwe see that we now have a convex function in yi amenable to dual representation. By\ncomputing the conjugate function f \u2217(yi), constructing the dual, and then transforming\nback to p(wi;H), we obtain the representation (see Appendix for details)\n\u03b3i\u00b6 \u03b3\u2212a\ni \u00b8 .\n\n\u03b3i\u22650 \u00b7(2\u03c0\u03b3i)\u22121/2 exp\u00b5\u2212\n\n2\u03b3i\u00b6 exp\u00b5\u2212\n\np(wi;H) = max\n\n(15)\n\nw2\ni\n\nb\n\nAs a, b \u2192 0, it is readily apparent from (15) that what were straight lines in the yi domain\nare now Gaussian functions with variance \u03b3i in the wi domain. Figure 1 illustrates this\nconnection. When we drop the maximization, we obtain a lower bound on p(wi;H) of the\nform\n(16)\n\np(wi;H) \u2265 p(wi; \u02c6H) , (2\u03c0\u03b3i)\u22121/2 exp\u00b5\u2212\n\n2\u03b3i\u00b6 exp\u00b5\u2212\n\n\u03b3i\u00b6 \u03b3\u2212a\n\nw2\ni\n\nb\n\n,\n\ni\n\nwhich serves as our approximate prior to p(w;H). From this relationship, we see that\np(wi; \u02c6H) does not integrate to one, except in the special case when a, b \u2192 0. We will now\nincorporate these results into an algorithm for \ufb01nding a good \u02c6H, or more accurately \u02c6H(\u03b3),\nsince each candidate hypothesis is characterized by a different set of variational parameters.\n\n2.2 Variational Approximation to p(w, t;H)\nSo now that we have a variational approximation to the problematic weight prior, we must\nreturn to our original problem of estimating p(t\u2217|t;H). Since the integration is intractable\nunder model hypothesis H, we will instead compute p(t\u2217|t; \u02c6H) using p(w, t; \u02c6H) =\np(t|w)p(w; \u02c6H), with p(w; \u02c6H) de\ufb01ned as in (16). How do we choose this approximate\n\n\f0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n\u22122\n\ny\nt\ni\ns\nn\ne\nD\ng\no\nL\n\n \n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\ny\nt\ni\ns\nn\ne\nD\n\np (wi;H)\n\nlower bounds\n\np\u00b3wi; \u02c6H\u00b4\n\n\u22122.5\n0\n\n0.5\n\n1\n\n1.5\n\nyi\n\n2\n\n2.5\n\n3\n\n0\n\u22125\n\n\u22124\n\n\u22123\n\n\u22122\n\n\u22121\n\n1\n\n2\n\n3\n\n4\n\n5\n\n0\n\nwi\n\nFigure 1: Variational approximation example in both yi space and wi space for a, b \u2192 0.\nLeft: Dual forms in yi space. The solid line represents the plot of f (yi) while the dotted\nlines represent variational lower bounds in the dual representation for three different values\nof \u03bbi. Right: Dual forms in wi space. The solid line represents the plot of p(wi;H) while\nthe dotted lines represent Gaussian distributions with three different variances.\n\nmodel? In other words, given that different \u02c6H are distinguished by a different set of vari-\national parameters \u03b3, how do we choose the most appropriate \u03b3? Consistent with modern\nBayesian analysis, we concern ourselves not with matching modes of distributions, but\nwith aligning regions of signi\ufb01cant probability mass. In choosing p(w, t; \u02c6H), we would\ntherefore like to match, where possible, signi\ufb01cant regions of probability mass in the true\nmodel p(w, t;H). For a given t, an obvious way to do this is to select \u02c6H by minimizing\nthe sum of the misaligned mass, i.e.,\n\n\u02c6H =\n\n=\n\narg min\n\narg max\n\n\u02c6H Z \u00af\u00af\u00af\n\u02c6H Z p(t|w)p(w; \u02c6H)dw,\n\np(w, t;H) \u2212 p(w, t; \u02c6H)\u00af\u00af\u00af\n\ndw\n\n(17)\n\nwhere the variational assumptions have allowed us to remove the absolute value (since\nthe argument must always be positive). Also, we note that (17) is tantamount to selecting\nthe variational approximation with maximal Bayesian evidence [6]. In other words, we\nare selecting the \u02c6H, out of a class of variational approximations to H, that most probably\nexplains the training data t, marginalized over the weights.\nFrom an implementational standpoint, (17) can be reexpressed using (16) as,\n\n\u03b3 = arg max\n\n\u03b3\n\nlogZ p(t|w)\n\nM\n\nYi=1\n\n= arg max\n\n\u03b3 \u2212\n\n1\n\n2 \u00a3log |\u03a3t| + tT \u03a3\u22121\n\np\u00b3wi; \u02c6H(\u03b3i)\u00b4 dw\n\u00b5\u2212\n\nt t\u00a4 +\n\nM\n\nXi=1\n\nb\n\n\u03b3i \u2212 a log \u03b3i\u00b6 ,\n\n(18)\n\nwhere \u03a3t , \u03c32I +\u03a6diag(\u03b3)\u03a6T . This is the same cost function as in [7] only without terms\nresulting from a prior on \u03c32, which we will address later. Thus, the end result of this anal-\nysis is an evidence maximization procedure equivalent to the one in [7]. The difference is\nthat, where before we were optimizing over a somewhat arbitrary model parameterization,\nnow we see that it is actually optimization over the space of variational approximations to\na model with a sparse, regularizing prior. Also, we know from (17) that this procedure is\neffectively matching, as much as possible, the mass of the full model p(w, t; \u02c6H).\n\n\f3 Analysis\n\nWhile the variational perspective is interesting, two pertinent questions still remain:\n\n1. Why should it be that approximating a sparse prior p(w;H) leads to sparse repre-\n2. How do we extend these results to handle an unknown, random variance \u03c32?\n\nsentations in practice?\n\nWe \ufb01rst treat Question (1). In Figure 2 below, we have illustrated a 2D example of evidence\nmaximization within the context of variational approximations to the sparse prior p(w;H).\nFor now, we will assume a, b \u2192 0, which from (13), implies that p(wi;H) \u221d 1/|wi| for\neach i. On the left, the shaded area represents the region of w space where both p(w;H)\nand p(t|w) (and therefore p(w, t;H)) have signi\ufb01cant probability mass. Maximization of\n(17) involves \ufb01nding an approximate distribution p(w, t; \u02c6H) with a substantial percentage\nof its mass in this region.\n\np (t|w1, w2)\n\n8\n\n6\n\n4\n\n2\n\n2\n0\nw\n\n\u22122\n\n\u22124\n\n\u22126\n\n\u22128\n\np (w1, w2;H)\n\n8\n\n6\n\n4\n\n2\n\n2\n0\nw\n\n\u22122\n\n\u22124\n\n\u22126\n\n\u22128\n\np (t|w1, w2)\n\np\u00b3w1, w2; \u02c6Ha\u00b4\n\np\u00b3w1, w2; \u02c6Hb\u00b4\n\n \n\nvariational \n constraint \n\n\u22128\n\n\u22126\n\n\u22124\n\n\u22122\n\n0\n\nw1\n\n2\n\n4\n\n6\n\n8\n\n\u22128\n\n\u22126\n\n\u22124\n\n\u22122\n\n0\n\nw1\n\n2\n\n4\n\n6\n\n8\n\nFigure 2: Comparison between full model and approximate models with a, b \u2192 0. Left:\nContours of equiprobability density for p(w;H) and constant likelihood p(t|w); the promi-\nnent density and likelihood lie within each region respectively. The shaded region repre-\nsents the area where both have signi\ufb01cant mass. Right: Here we have added the contours\nof p(w; \u02c6H) for two different values of \u03b3, i.e., two approximate hypotheses denoted \u02c6Ha and\n\u02c6Hb. The shaded region represents the area where both the likelihood and the approximate\nprior \u02c6Ha have signi\ufb01cant mass. Note that by the variational bound, each p(w; \u02c6H) must lie\nwithin the contours of p(w;H).\nIn the plot on the right, we have graphed two approximate priors that satisfy the variational\nbounds, i.e., they must lie within the contours of p(w;H). We see that the narrow prior that\naligns with the horizontal spine of p(w;H) places the largest percentage of its mass (and\ntherefore the mass of p(w, t; \u02c6Ha)) in the shaded region. This corresponds with a prior of\n(19)\nThis creates a long narrow prior since there is minimal variance along the w2 axis. In fact,\nit can be shown that owing to the in\ufb01nite density of the variational constraint along each\naxis (which is allowed as a and b go to zero), the maximum evidence is obtained when\n\u03b32 is strictly equal to zero, giving the approximate prior in\ufb01nite density along this axis as\nwell. This implies that w2 also equals zero and can be pruned from the model. In contrast,\na model with signi\ufb01cant prior variance along both axes, \u02c6Hb, is hampered because it cannot\nextend directly out (due to the dotted variational boundary) along the spine to penetrate the\nlikelihood.\n\np(w; \u02c6Ha) = p(w1, w2; \u03b31 \u00c0 0, \u03b32 \u2248 0).\n\n\fSimilar effective weight pruning occurs in higher dimensional problems as evidenced by\nsimulation studies and the analysis in [3]. In higher dimensions, the algorithm only retains\nthose weights associated with the prior spines that span a subspace penetrating the most\nprominent portion of the likelihood mass (i.e., a higher-dimensional analog to the shaded\nregion already mentioned). The prior p(w; \u02c6H) navigates the variational constraints, placing\nas much as possible of its mass in this region, driving many of the \u03b3i\u2019s to zero.\nIn contrast, when a, b > 0, the situation is somewhat different. It is not dif\ufb01cult to show\nthat, assuming a noise variance \u03c32 > 0, the variational approximation to p(w, t;H) with\nmaximal evidence cannot have any \u03b3i = wi = 0. Intuitively, this occurs because the now\n\ufb01nite spines of the prior p(w;H), which bound the variational approximation, do not allow\nus to place in\ufb01nite prior density in any region of weight space (as occurred previously\nwhen any \u03b3i \u2192 0). Consequently, if any \u03b3i goes to zero with a, b > 0, the associated\napproximate prior mass, and therefore the approximate evidence, must also fall to zero by\n(16). As such, models with all non-zero weights will be now be favored when we form the\nvariational approximation. We therefore cannot assume an approximation to a sparse prior\nwill necessarily give us sparse results in practice.\n\nWe now address Question (2). Thus far, we have considered a known, \ufb01xed noise variance\n\u03c32; however, what if \u03c32 is unknown? SBL assumes it is unknown and random with prior\ndistribution p(1/\u03c32) \u221d (\u03c32)1\u2212c exp(\u2212d/\u03c32), and c, d > 0. After integrating out the\nunknown \u03c32, we arrive at the implicit likelihood equation,\n\np(t|w) = Z p(t|w, \u03c32)p(\u03c32)d\u03c32 \u221d \u00b5d +\n\n1\n\n2kt \u2212 \u03a6wk2\u00b6\u2212(\u00afc+1/2)\n\n,\n\n(20)\n\nwhere \u00afc , c + (N \u2212 1)/2. We may then form a variational approximation to the likelihood\nin a similar manner as before (with wi being replaced by kt \u2212 \u03a6wk) giving us,\n\np(t|w) \u2265 (2\u03c0)\u2212N/2(\u03c32)\u22121/2 exp\u00b5\u2212\n\n1\n\n2\u03c32kt \u2212 \u03a6wk2\u00b6 exp\u00b5\u2212\n\nd\n\n\u03c32\u00b6 (\u03c32)\u2212\u00afc\n\n= (2\u03c0\u03c32)\u2212N/2 exp\u00b5\u2212\n\n1\n\n2\u03c32kt \u2212 \u03a6wk2\u00b6 exp\u00b5\u2212\n\nd\n\n\u03c32\u00b6 (\u03c32)\u2212c,\n\n(21)\n\nwhere the second step follows by substituting back in for \u00afc. By replacing p(t|w) with the\nlower bound from (21), we then maximize over the variational parameters \u03b3 and \u03c32 via\n\n\u03b3, \u03c32 = arg max\n\n\u03b3,\u03c32 \u2212\n\n1\n\n2 \u00a3log |\u03a3t| + tT \u03a3\u22121\n\nt t\u00a4+\n\nM\n\nXi=1\n\n\u00b5\u2212\n\nb\n\n\u03b3i \u2212 a log \u03b3i\u00b6\u2212\n\nd\n\u03c32 \u2212c log \u03c32, (22)\n\nthe exact SBL optimization procedure. Thus, we see that the entire SBL framework, in-\ncluding noise variance estimation, can be seen in variational terms.\n\n4 Conclusions\n\nThe end result of this analysis is an evidence maximization procedure that is equivalent to\nthe one originally formulated in [7]. The difference is that, where before we were optimiz-\ning over a somewhat arbitrary model parameterization, we now see that SBL is actually\nsearching a space of variational approximations to \ufb01nd an alternative distribution that cap-\ntures the signi\ufb01cant mass of the full model. Moreover, from the vantage point afforded\nby this new perspective, we can better understand the sparsity properties of SBL and the\nrelationship between sparse priors and approximations to sparse priors.\n\n\fAppendix: Derivation of the Dual Form of p(wi;H)\n\nTo accommodate the variational analysis of Sec. 2.1, we require the dual representation of\np(wi;H). As an intermediate step, we must \ufb01nd the dual representation of f (yi), where\nyi , w2\n(23)\n\nf (yi) , log p(wi;H) = log\u00b7C\u00b3b +\n\n2 \u00b4\u2212(a+1/2)\u00b8 .\n\nTo accomplish this, we \ufb01nd the conjugate function f \u2217(\u03bbi) using the duality relation\n\ni and\n\nyi\n\nf \u2217(\u03bbi) = max\n\nyi\n\n[\u03bbiyi \u2212 f (yi)] = max\n\nyi \u00b7\u03bbiyi \u2212 log C +\u00b5a +\n\n1\n\n2\u00b6 log\u00b3b +\n\nyi\n\n2 \u00b4\u00b8 .\n\nTo \ufb01nd the maximizing yi, we take the gradient of the left side and set it to zero, giving us,\n\n(24)\n\n(25)\n\n(26)\n\n(27)\n\nSubstituting this value into the expression for f \u2217(\u03bbi) and selecting\n\nymax\ni\n\n= \u2212\n\na\n\u03bbi \u2212\n\n1\n2\u03bbi \u2212 2b.\n\nwe arrive at\n\n1\n\nC = (2\u03c0)\u22121/2 exp\u00b7\u2212\u00b5a +\n2\u00b6 log\u00b5 \u22121\nf \u2217(\u03bbi) = \u00b5a +\n\n2\u00b6\u00b8\u00b5a +\n2\u03bbi\u00b6 +\n\n1\n2\n\n1\n\n1\n\n2\u00b6(a+1/2)\n\n,\n\nlog 2\u03c0 \u2212 2b\u03bbi.\n\nWe are now ready to represent f (yi) in its dual form, observing \ufb01rst that we only need\nconsider maximization over \u03bbi \u2264 0 since f (yi) is a monotonically decreasing function\n(i.e., all tangent lines will have negative slope). Proceeding forward, we have\n\n\u03b3i\u22650\u00b7\u2212yi\n\n2\u03b3i \u2212\u00b5a +\n\n[\u03bbiyi \u2212 f \u2217(\u03bbi)] = max\n\nf (yi) = max\n\u03bbi\u22640\nwhere we have used the monotonically increasing transformation \u03bbi = \u22121/(2\u03b3i), \u03b3i \u2265 0.\nThe attendant dual representation of p(wi;H) can then be obtained by exponentiating both\nsides of (28) and substituting yi = w2\ni ,\n1\n\u221a2\u03c0\u03b3i\n\n\u03b3i\u22650\u00b7\np(wi;H) = max\n\n2\u03b3i\u00b6 exp\u00b5\u2212\n\n\u03b3i\u00b6 \u03b3\u2212a\ni \u00b8 .\n\nexp\u00b5\u2212\n\nlog 2\u03c0 \u2212\n\n(29)\n\n(28)\n\nw2\ni\n\nb\n\nb\n\n\u03b3i\u00b8 ,\n\n1\n\n2\u00b6 log \u03b3i \u2212\n\n1\n2\n\nAcknowledgments\n\nThis research was supported by DiMI grant #22-8376 sponsored by Nissan.\n\nReferences\n[1] C. Bishop and M. Tipping, \u201cVariational relevance vector machines,\u201d Proc. 16th Conf. Uncer-\n\ntainty in Arti\ufb01cial Intelligence, pp. 46\u201353, 2000.\n\n[2] R. Duda, P. Hart, and D. Stork, Pattern Classi\ufb01cation, Wiley, Inc., New York, 2nd ed., 2001.\n[3] A.C. Faul and M.E. Tipping, \u201cAnalysis of sparse Bayesian learning,\u201d Advances in Neural\n\nInformation Processing Systems 14, pp. 383\u2013389, 2002.\n\n[4] M. Girolami, \u201cA variational method for learning sparse and overcomplete representations,\u201d Neu-\n\nral Computation, vol. 13, no. 11, pp. 2517\u20132532, 2001.\n\n[5] M.I. Jordan, Z. Ghahramani, T. Jaakkola, and L.K. Saul, \u201cAn introduction to variational methods\n\nfor graphical models,\u201d Machine Learning, vol. 37, no. 2, pp. 183\u2013233, 1999.\n\n[6] D.J.C. MacKay, \u201cBayesian interpolation,\u201d Neural Comp., vol. 4, no. 3, pp. 415\u2013447, 1992.\n[7] M.E. Tipping, \u201cSparse Bayesian learning and the relevance vector machine,\u201d Journal of Machine\n\nLearning, vol. 1, pp. 211\u2013244, 2001.\n\n\f", "award": [], "sourceid": 2393, "authors": [{"given_name": "Jason", "family_name": "Palmer", "institution": null}, {"given_name": "Bhaskar", "family_name": "Rao", "institution": null}, {"given_name": "David", "family_name": "Wipf", "institution": null}]}