{"title": "Affine Independent Variational Inference", "book": "Advances in Neural Information Processing Systems", "page_first": 2186, "page_last": 2194, "abstract": "We present a method for approximate inference for a broad class of non-conjugate probabilistic models. In particular, for the family of generalized linear model target densities we describe a rich class of variational approximating densities which can be best fit to the target by minimizing the Kullback-Leibler divergence. Our approach is based on using the Fourier representation which we show results in efficient and scalable inference.", "full_text": "Af\ufb01ne Independent Variational Inference\n\nEdward Challis\nDepartment of Computer Science\nUniversity College London, UK\n\nDavid Barber\n\n{edward.challis,david.barber}@cs.ucl.ac.uk\n\nAbstract\n\nWe consider inference in a broad class of non-conjugate probabilistic models\nbased on minimising the Kullback-Leibler divergence between the given target\ndensity and an approximating \u2018variational\u2019 density. In particular, for generalised\nlinear models we describe approximating densities formed from an af\ufb01ne trans-\nformation of independently distributed latent variables, this class including many\nwell known densities as special cases. We show how all relevant quantities can\nbe ef\ufb01ciently computed using the fast Fourier transform. This extends the known\nclass of tractable variational approximations and enables the \ufb01tting for example of\nskew variational densities to the target density.\n\n1\n\nIntroduction\n\nWhilst Bayesian methods have played a signi\ufb01cant role in machine learning and related areas (see\n[1] for an introduction), improving the class of distributions for which inference is either tractable\nor can be well approximated remains an ongoing challenge. Within this broad \ufb01eld of research,\nvariational methods have played a key role by enabling mathematical guarantees on inferences (see\n[28] for an overview). Our contribution is to extend the class of approximating distributions beyond\nclassical forms to approximations that can possess skewness or other non-Gaussian characteristics,\nwhile maintaining computational ef\ufb01ciency.\nWe consider approximating the normalisation constant Z of a probability density function p(w)\n\nN(cid:89)\n\np(w) =\n\n1\nZ\n\n(cid:90) N(cid:89)\n\nfn(w) with Z =\n\nfn(w)dw\n\n(1.1)\n\nn=1\n\nn=1\n\nwhere w \u2208 RD and fn : RD \u2192 R+ are potential functions. Apart from special cases, evaluating\nZ and other marginal quantities of p(w) is dif\ufb01cult due to the assumed high dimensionality D of\nthe integral. To address this we may \ufb01nd an approximating density q(w) to the target p(w) by\nminimising the Kullback-Leibler (KL) divergence\n\nKL(q(w)|p(w)) = (cid:104)log q(w)(cid:105)q(w) \u2212 (cid:104)log p(w)(cid:105)q(w) = \u2212H [q(w)] \u2212 (cid:104)log p(w)(cid:105)q(w) (1.2)\nwhere (cid:104)f (x)(cid:105)p(x) refers to taking the expectation of f (x) with respect to the distribution p(x) and\nH [q(w)] is the differential entropy of the distribution q(w). The non-negativity of the KL diver-\ngence provides the lower bound\n\nlog Z \u2265 H [q(w)] +\n\n(cid:104)log fn(w)(cid:105) := B.\n\n(1.3)\n\nFinding the best parameters \u03b8 of the approximating density q(w|\u03b8) is then equivalent to maximising\nthe lower bound on log Z. This KL bounding method is constrained by the class of distributions\n\nn=1\n\n1\n\nN(cid:88)\n\n\f(a)\n\n(b)\n\n(c)\n\n2\u03c4 e\u2212|wd|/\u03c4 with \u03c4 = 0.16 and Gaussian likelihood N(cid:0)y|wTx, \u03c32\n\nFigure 1: Two dimensional Bayesian sparse linear regression given a single data pair x, y using\na Laplace prior fd(w) \u2261 1\nl = 0.05. (a) True posterior with log Z = \u22121.4026. (b) Optimal Gaussian approximation with\n\u03c32\nbound value BG = \u22121.4399. (c) Optimal AI generalised-normal approximation with bound value\nBAI = \u22121.4026.\n\n(cid:1),\n\nl\n\np(w) and q(w) for which (1.3) can be ef\ufb01ciently evaluated. We therefore specialise on models of\nthe form\n\np(w) \u221d N (w|\u00b5, \u03a3)\n\nfn(wTxn)\n\n(1.4)\n\nN(cid:89)\n\nn=1\n\nn=1 is a collection of \ufb01xed D dimensional real vectors and fn : R \u2192 R+; N (w|\u00b5, \u03a3)\nwhere {xn}N\ndenotes a multivariate Gaussian in w with mean \u00b5 and covariance \u03a3. This class includes Bayesian\ngeneralised linear models, latent linear models, independent components analysis and sparse linear\nmodels amongst others1. Many problems have posteriors that possess non-Gaussian characteristics\nresulting from strongly non-Gaussian priors or likelihood terms. For example, in \ufb01nancial risk mod-\nelling it is crucial that skewness and heavy tailed properties of the data are accurately captured [27];\nsimilarly in inverse modelling, sparsity inducing priors can lead to highly non-Gaussian posteriors.\nIt is therefore important to extend the class of tractable approximating distributions beyond standard\nforms such as the multivariate Gaussian [20, 12, 2, 13]. Whilst mixtures of Gaussians [4, 10, 5] have\npreviously been developed, these typically require additional bounds. Our interest here is to consider\nalternative multivariate distribution classes for which the KL method is more directly applicable2.\n\n\u03b4 (w \u2212 Av \u2212 b) qv(v|\u03b8)dv =\n\n2 Af\ufb01ne independent densities\n\nWe \ufb01rst consider independently distributed latent variables v \u223c qv(v|\u03b8) =(cid:81)D\n\nd=1 qvd(vd|\u03b8d) with\n\u2018base\u2019 distributions qvd. To enrich the representation, we form the af\ufb01ne transformation w = Av+b\nwhere A \u2208 RD\u00d7D is invertible and b \u2208 RD. The distribution on w is then3\nqw(w|A, b, \u03b8) =\n\n(cid:90)\nwhere \u03b4 (x) = (cid:81)\ni \u03b4(xi) is the Dirac delta function, \u03b8 = [\u03b81, ..., \u03b8d] and [x]d refers to the dth\nN(cid:0)w|b, AAT(cid:1). By using, for example, Student\u2019s t, Laplace, logistic, generalised-normal or skew-\nelement of the vector x. Typically we assume the base distributions are homogeneous, qvd \u2261 qv. For\ninstance, if we constrain each factor qvd(vd|\u03b8d) to be the standard normal N (vd|0, 1) then qw(w) =\n1For p(w) in this model class and Gaussian q(w) = N(cid:0)w|m, CTC(cid:1), B is tighter than \u2018local\u2019 bounds\n\nnormal base distributions, equation (2.1) parameterises multivariate extensions of these univariate\ndistributions. This class of multivariate distributions has the important property that, unlike the\n\n(cid:0)(cid:2)A\u22121 (w \u2212 b)(cid:3)\nd |\u03b8d\n\n[14, 11, 22, 18, 16]. For log-concave f, B is jointly concave in (m, C) for C the Cholesky matrix [7].\n\n(cid:1) (2.1)\n\n1\n\n|det (A)|\n\nqvd\n\n(cid:89)\n\nd\n\n2The skew-normal q(w) recently discussed in [21] possesses skew in one direction of parameter space only\n\nand is a special case of the AI skew-normal densities used in section 4.2.\n\n3This construction is equivalent to a form of square noiseless Independent Components Analysis. See [9]\n\nand [25] for similar constructions.\n\n2\n\n\u22124\u22122024\u22120.200.20.40.60.81\u22124\u22122024\u22120.200.20.40.60.81\u22124\u22122024\u22120.200.20.40.60.81\fGaussian, they can approximate skew and/or heavy-tailed p(w). See \ufb01gures 1, 2 and 3, for examples\nof two dimensional distributions qw(w|A, b, \u03b8) with skew-normal and generalised-normal base\ndistributions used to approximate toy machine learning problems.\nProvided we choose a base distribution class that includes the Gaussian as a special case (for example\ngeneralised-normal, skew-normal and asymptotically Student\u2019s t) we are guaranteed to perform at\nleast as well as classical multivariate Gaussian KL approximate inference.\nWe note that we may arbitrarily permute the indices of v. Furthermore, since every invertible matrix\nis expressible as LUP for L lower, U upper and P permutation matrices, without loss of generality,\nwe may use an LU decomposition A = LU; this reduces the complexity of subsequent computa-\ntions.\nWhilst de\ufb01ning such Af\ufb01ne Independent (AI) distributions is straightforward, critically we require\nthat the bound, equation (1.3), is fast to compute. As we explain below, this can be achieved using\nthe Fourier transform both for the bound and its gradients. Full derivations, including formulae for\nskew-normal and generalised-normal base distributions, are given in the supplementary material.\n\n2.1 Evaluating the KL bound\n\nD(cid:88)\n(cid:123)(cid:122)\n\nd=1\n\nThe KL bound can be readily decomposed as\n\n(cid:68)\n\nN(cid:88)\n(cid:123)(cid:122)\n\nn=1\n\nEnergy\n\n(cid:69)\n(cid:125)\n\n(cid:124)\n\nEntropy\n\nH [q(vd|\u03b8d)]\n\nB = log |det (A)| +\n\n+(cid:104)log N (w|\u00b5, \u03a3)(cid:105) +\n\n(cid:124)\n(cid:125)\nwhere we used H [qw(w)] = log |det (A)| +(cid:80)\nd H [qvd(vd|\u03b8d)] (see for example [8]). For many\nstandard base distributions the entropy H [qvd(vd|\u03b8d)] is closed form. When the entropy of a uni-\nvariate base distribution is not analytically available, we assume it can be cheaply evaluated numer-\nically. The energy contribution to the KL bound is the sum of the expectation of the log Gaussian\nterm (which requires only \ufb01rst and second order moments) and the nonlinear \u2018site projections\u2019. The\nnon-linear site projections (and their gradients) can be evaluated using the methods described below.\n\nlog fn(wTxn)\n\n(2.2)\n\n(cid:90)\n\n(cid:90)\n\n(cid:90) (cid:90)\n\n2.1.1 Marginal site densities\n\nequivalent to a one-dimensional expectation,(cid:10)g(cid:0)wTx(cid:1)(cid:11)\n\nDe\ufb01ning y := wTx, the expectation of the site projection for any function g and \ufb01xed vector x is\n\nqw(w) = (cid:104)g(y)(cid:105)qy(y) with\n\nqy(y) =\n\n\u03b4(y \u2212 xTw)qw(w)dw =\n\n\u03b4(y \u2212 \u03b1Tv \u2212 \u03b2)qv(v)dv\n\none dimensional integral using the integral transform \u03b4(x) =(cid:82) e2\u03c0itxdt:\n\nwhere w = Av + b and \u03b1 := ATx, \u03b2 := bTx. We can rewrite this D-dimensional integral as a\n\n(2.3)\n\nD(cid:89)\n\n(cid:90)\n\nD(cid:89)\n\nqy(y) =\n\ne2\u03c0it(y\u2212\u03b1Tv\u2212\u03b2)\n\nqvd(vd)dvdt =\n\ne2\u03c0i(t\u2212\u03b2)y\n\n\u02dcqud (t) dt\n\n(2.4)\n\nd=1\n\nd=1\n\n\u03b1d\n\nwhere \u02dcf (t) denotes the Fourier transform of f (x) and qud (ud|\u03b8d) is the density of the random\n|\u03b8d). Equation(2.4) can be interpreted as the\nvariable ud := \u03b1dvd so that qud (ud|\u03b8d) = 1|\u03b1d| qvd( ud\n(shifted) inverse Fourier transform of the product of the Fourier transforms of {qud (ud|\u03b8d)}.\nfor the product(cid:81)D\nUnfortunately, most distributions do not have Fourier transforms that admit compact analytic forms\nd=1 \u02dcqud (t). The notable exception is the family of stable distributions for which\nlinear combinations of random variables are also stable distributed (see [19] for an introduction).\nWith the exception of the Gaussian (the only stable distribution with \ufb01nite variance), Levy and\nCauchy distributions, stable distributions do not have analytic forms for their density functions and\nare analytically expressible only in the Fourier domain. Nevertheless, when qv(v) is stable dis-\ntributed, marginal quantities of w such as y can be computed analytically in the Fourier domain\n[3].\n\n3\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 2: Two dimensional Bayesian logistic regression with Gaussian prior N (w|0, 10I) and like-\nlihood fn(w) = \u03c3(\u03c4lcnwTxn), \u03c4l = 5. Here \u03c3(x) is the logistic sigmoid and cn \u2208 {\u22121, +1} the\nclass labels; N = 4 data points. (a) True posterior with log Z = \u22121.13. (b) Optimal Gaussian\napproximation with bound value BG = \u22121.42. (c) Optimal AI skew-normal approximation with\nbound value BAI = \u22121.17.\n\nIn general, therefore, we need to resort to numerical methods to compute qy(y) and expectations\nwith respect to it. To achieve this we discretise the base distributions and, by choosing a suf\ufb01ciently\n\ufb01ne discretisation, limit the maximal error that can be incurred. As such, up to a speci\ufb01ed accuracy,\nthe KL bound may be exactly computed.\nFirst we de\ufb01ne the set of discrete approximations to {qud (ud|\u03b8d)}D\n\u2018lattice\u2019 approximations are a weighted sum of K delta functions\n\nd=1 for ud := \u03b1dvd. These\n\nqud (ud|\u03b8d) \u2248 \u02c6qud (ud) :=\n\n\u03c0dk\u03b4 (ud \u2212 lk) where \u03c0dk =\n\nq(ud|\u03b8d)dud. (2.5)\n\nK(cid:88)\n\nk=1\n\n(cid:90) lk+ 1\n\n2 \u2206\n\nlk\u2212 1\n\n2 \u2206\n\n2 \u2206, lk + 1\n\nThe lattice points {lk}K\nk=1 are spaced uniformly over the domain [l1, lK] with \u2206 := lk+1 \u2212 lk. The\nweighting for each delta spike is the mass assigned to the distribution qud (ud|\u03b8d) over the interval\n[lk \u2212 1\nGiven the lattice approximations to the densities {qud (ud|\u03b8d)}D\nd=1 the fast Fourier transform (FFT)\ncan be used to evaluate the convolution of the lattice distributions. Doing so we obtain the lattice\napproximation to the marginal y = wTx such that (see supplementary section 2.2)\n\n2 \u2206].\n\nqy(y) \u2248 \u02c6qy(y) =\n\n\u03b4(y \u2212 lk \u2212 \u03b2)\u03c1k where \u03c1 = ifft\n\nfft [\u03c0(cid:48)\nd]\n\n.\n\n(2.6)\n\nwhere \u03c0d is padded with (D \u2212 1)K zeros, \u03c0(cid:48)\nd := [\u03c0d, 0]. The only approximation used in \ufb01nding\nthe marginal density is then the discretisation of the base distributions, with the remaining FFT\n\ncalculations being exact. The time complexity for the above procedure scales O(cid:0)D2K log KD(cid:1).\n\n(cid:35)\n\n(cid:34) D(cid:89)\n\nd=1\n\nK(cid:88)\n\nk=1\n\n2.1.2 Ef\ufb01cient site derivative computation\n\nWhilst we have shown that the expectation of the site projections can be accurately computed using\nthe FFT, how to cheaply evaluate the derivative of this term is less clear. The complication can be\n\nseen by inspecting the partial derivative of(cid:10)g(wTx)(cid:11) with respect to Amn\n\n(cid:68)\n\n\u2202\n\n(cid:69)\n\n(cid:90)\n\nqv(v)g(cid:48)(cid:16)\n\ng(wTx)\n\n= xn\n\nxTAv + bTx\n\nvmdv,\n\nq(w)\n\n\u2202Amn\nwhere g(cid:48)(y) = d\ndy g(y). Naively, this can be readily reduced to a (relatively expensive) two dimen-\nsional integral. Critically, however, the computation can be simpli\ufb01ed to a one dimensional integral.\nTo see this we can write\n\n(2.7)\n\n(cid:16)\n\n(cid:17)\n\nvmqv(v)\u03b4\n\ny \u2212 \u03b1Tv \u2212 \u03b2\n\ndv.\n\n(cid:68)\n\n(cid:16)\n\n(cid:17)(cid:69)\n\ng\n\nwTx\n\n= xn\n\n(cid:90)\n\n\u2202\n\n\u2202Amn\n\n(cid:17)\n\n(cid:90)\n\ng(cid:48)(y)dm(y)dy, where dm(y) :=\n\n4\n\n\u221250510\u22128\u22126\u22124\u2212202\u221250510\u22128\u22126\u22124\u2212202\u221250510\u22128\u22126\u22124\u2212202\f(a)\n\n(b)\n\n(c)\n\nFigure 3: Two dimensional robust linear regression with Gaussian prior N (w|0, I), Laplace like-\ne\u2212|yn\u2212wTxn|/\u03c4l with \u03c4l = 0.1581 and 2 data pairs xn, yn. (a) True posterior\nlihood fn(w) = 1\nwith log Z = \u22123.5159. (b) Optimal Gaussian approximation with bound value BG = \u22123.6102. (c)\n2\u03c4l\nOptimal AI generalised-normal approximation with bound value BAI = \u22123.5167.\n\n(cid:90) um\n\n\u03b1m\n\nqum(um)e\u22122\u03c0itumdum.\n\n(cid:89)\n\nHere dm(y) is a univariate weighting function with Fourier transform:\n\u02dcdm(t) = e\u22122\u03c0it\u03b2 \u02dcem(t)\n\n\u02dcqud (t), where \u02dcem(t) :=\n\nd(cid:54)=m\n\nd=1 are required to compute the expectation of(cid:10)g(wTx)(cid:11) the only additional computa-\n\nSince {\u02dcq(t)}D\ntions needed to evaluate all partial derivatives with respect to A are {\u02dced(t)}D\nd=1. Thus the complexity\nof computing the site derivative4 is equivalent to the complexity of the site expectation of section\n2.1.1.\nEven for non-smooth functions g the site gradient has the additional property that it is smooth, pro-\nvided the base distributions are smooth. Indeed, this property extends to the KL bound itself, which\nhas continuous partial derivatives, see supplementary material section 1. This means that gradient\noptimisation for AI based KL approximate inference can be applied, even when the target density\nis non-smooth. In contrast, other deterministic approximate inference routines are not universally\napplicable to non-smooth target densities \u2013 for instance the Laplace approximation and [10] both\nrequire the target to be smooth.\n\n2.2 Optimising the KL bound\n\nGiven \ufb01xed base distributions, we can optimise the KL bound with respect to the parameters A =\nLU and b. Provided {fn}N\nn=1 are log-concave the KL bound is jointly concave with respect to b and\neither L or U. This follows from an application of the concavity result in [7] \u2013 see the supplementary\nmaterial section 3.\nUsing a similar approach to that presented in section 2.1.2 we can also ef\ufb01ciently evaluate the\ngradients of the KL bound with respect to the parameters \u03b8 that de\ufb01ne the base distribution. These\nparameters \u03b8 can control higher order moments of the approximating density q(w) such as skewness\nand kurtosis. We can therefore jointly optimise over all parameters {A, b, \u03b8} simultaneously; this\nmeans that we can fully capitalise on the expressiveness of the AI distribution class, allowing us to\ncapture non-Gaussian structure in p(w).\nIn many modeling scenarios the best choice for qv(v) will suggest itself naturally. For example,\nin section 4.1 we choose the skew-normal distribution to approximate Bayesian logistic regression\nposteriors. For heavy-tailed posteriors that arise for example in robust or sparse Bayesian linear\nregression models, one choice is to use the generalised-normal as base density, which includes the\nLaplace and Gaussian distributions as special cases. For other models, for instance mixed data factor\nanalysis [15], different distributions for blocks of variables of {vd}D\nd=1 may be optimal. However, in\nsituations for which it is not clear how to select qv(v), several different distributions can be assessed\nand then that which achieves the greatest lower bound B is preferred.\n\n4Further derivations and computational scaling properties are provided in supplementary section 2.\n\n5\n\n\u22120.200.20.4\u22120.1\u22120.0500.050.10.150.20.25\u22120.200.20.4\u22120.1\u22120.0500.050.10.150.20.25\u22120.200.20.4\u22120.1\u22120.0500.050.10.150.20.25\f2.3 Numerical issues\n\ny =(cid:80)\n\nd \u03b12\n\nd=1 which increases the time and memory requirements.\n\nThe computational burden of the numerical marginalisation procedure described in section 2.1.1\ndepends on the number of lattice points used to evaluate the convolved density function qy(y). For\nthe results presented we implemented a simple strategy for choosing the lattice points [l1, ..., lK].\nLattice end points were chosen5 such that [l1, lK] = [\u22126\u03c3y, 6\u03c3y] where \u03c3y is the standard deviation\nof the random variable y: \u03c32\ndvar(vd). From Chebyshev\u2019s inequality, taking six standard\ndeviation end points guarantees that we capture at least 97% of the mass of qy(y). In practice this\nproportion is often much higher since qy(y) is often close to Gaussian for D (cid:29) 1. We \ufb01x the\nnumber of lattice points used during optimisation to suit our computational budget. To compute the\n\ufb01nal bound value we apply the simple strategy of doubling the number of lattice points until the\nevaluated bound changes by less than 10\u22123 [6].\nFully characterising the overall accuracy of the approximation as a function of the number of lattice\npoints is complex, see [24, 26] for a related discussion. One determining factor is the condition\nnumber (ratio of largest and smallest eigenvalues) of the posterior covariance. When the condi-\ntion number is large many lattice points are needed to accurately discretise the set of distributions\n{qud (ud|\u03b8d)}D\nOne possible route to circumventing these issues is to use base densities that have analytic Fourier\ntransforms (such as a mixture of Gaussians).\nIn such cases the discrete Fourier transform of\nqy(y) can be directly evaluated by computing the product of the Fourier transforms of each\n{qud (ud|\u03b8d)}D\nThe computational bottleneck for AI inference, assuming N > D, arises from computing the ex-\npectation and partial derivatives of the N site projections. For parameters w \u2208 RD this scales\n\nO(cid:0)N D2K log DK(cid:1). Whilst this might appear expensive it is worth considering it within the\nproximate inference has bound and gradient computations which scale O(cid:0)N D2(cid:1). Similarly, local\nvariational bounding methods (see below) scale O(cid:0)N D2(cid:1) when implemented exactly.\n\nbroader scope of lower bound inference methods. It was shown in [7] that exact Gaussian KL ap-\n\nd=1. The implementation and analysis of this procedure is left for future work.\n\n3 Related methods\n\nAnother commonly applied technique to obtain a lower bound for densities of the form of equation\n(1.4) is the \u2018local\u2019 variational bounding procedure [14, 11, 22, 18]. Local bounding methods approx-\nimate the normalisation constant by bounding each non-conjugate term in the integrand, equation\n(1.1), with a form that renders the integral tractable. In [7] we showed that the Gaussian KL bound\ndominates the local bound in such models. Hence the AI KL method also dominates the local and\nGaussian KL methods.\nOther approaches increase the \ufb02exibility of the approximating distribution by expressing qw(w) as a\nmixture. However, computing the entropy of a mixture distribution is in general dif\ufb01cult. Whilst one\nmay bound the entropy term [10, 4], employing such additional bounds is undesirable since it limits\nthe gains from using a mixture. Another recently proposed method to approximate integrals using\nmixtures is split mean \ufb01eld which iterates between constructing soft partitions of the integral domain\nand bounding those partitioned integrals [5]. The partitioned integrals are approximated using local\nor Gaussian KL bounds. Our AI method is complementary to the split mean \ufb01eld method since one\nmay use the AI technique to bound each of the partitioned integrals and so achieve an improved\nbound.\n\n4 Experiments\n\nFor the experiments below6, AI KL bound optimisation is performed using L-BFGS7. Gaussian KL\ninference is implemented in all experiments using our own package8.\n\n5For symmetric densities {qud (ud|\u03b8d)} we arranged that their mode coincides with the central lattice point.\n6All experiments are performed in Matlab 2009b on a 32 bit Intel Core 2 Quad 2.5 GHz processor.\n7We used the minFunc package (www.di.ens.fr/\u02dcmschmidt)\n8mloss.org/software/view/308/\n\n6\n\n\f(a)\n\n(b)\n\nFigure 4: Gaussian KL and AI KL approximate inference comparison for a Bayesian logistic re-\ngression model with different training data set sizes Ntrn. w \u2208 R10; Gaussian prior N (w|0, 5I);\nlogistic sigmoid likelihood fn = \u03c3(\u03c4lcnwTxn) with \u03c4l = 5; covariates xn sampled from the stan-\ndard normal, wtrue sampled from the prior and class labels cn = \u00b11 sampled from the likelihood.\n(a) Bound differences, BAI \u2212 BG, achieved using Gaussian KL and AI KL approximate inference\nfor different training dataset sizes Ntrn. Mean and standard errors are presented from 15 randomly\ngenerated models. A logarithmic difference of 0.4 corresponds to 49% improvement in the bound\non the marginal likelihood. (b) Mean and standard error averaged test set log probability (ATLP)\ndifferences obtained with the Gaussian and AI approximate posteriors for different training dataset\nsizes Ntrn. ATLP values calculated using 104 test data points sampled from each model.\n\n4.1 Toy problems\n\nWe compare the performance of Gaussian KL and AI KL approximate inference methods in three\ndifferent two dimensional generalised linear models against the true posteriors and marginal likeli-\nhood values obtained numerically. See supplementary section 4 for derivations. Figure 1 presents\nresults for a linear regression model with a sparse Laplace prior; the AI base density is chosen to be\ngeneralised-normal. Figure 2 demonstrates approximating a Bayesian logistic regression posterior,\nwith the AI base distribution skew-normal. Figure 3 corresponds to a Bayesian linear regression\nmodel with the noise robust Laplace likelihood density and Gaussian prior; again the AI approxi-\nmation uses the generalised-normal as the base distribution. The AI KL procedure achieves a con-\nsistently higher bound than the Gaussian case, with the AI bound nearly saturating at the true value\nof log Z in two of the models. In addition, the AI approximation captures signi\ufb01cant non-Gaussian\nfeatures of the posterior: the approximate densities are sparse in directions of sparsity of the poste-\nrior; their modes are approximately equal (where the Gaussian mode can differ signi\ufb01cantly); tail\nbehaviour is more accurately captured by the AI distribution than by the Gaussian.\n\n4.2 Bayesian logistic regression\n\nWe compare Gaussian KL and AI KL approximate inference for a synthetic Bayesian logistic regres-\nsion model. The AI density has skew-normal base distribution with \u03b8d parameterising the skewness\nof vd. We optimised the AI KL bound jointly with respect to L, U, b and \u03b8 simultaneously with\nconvergence taking on average 8 seconds with D = N = 10, compared to 0.2 seconds for Gaussian\nKL9. In \ufb01gure 4 we plot the performance of the KL bound for the Gaussian versus the skew-normal\nAI approximation as we vary the number of datapoints. In (a) we plot the mean and standard error\nbound differences BAI \u2212 BG. For a small number of datapoints the bound difference is small. This\ndifference increases up to D = N, and then decreases for larger datasets. This behaviour can be\nexplained by the fact that when there are few datapoints the Gaussian prior dominates, with little dif-\nference therefore between the Gaussian and optimal AI approximation (which becomes effectively\nGaussian). As more data is introduced, the non-Gaussian likelihood terms have a stronger impact\nand the posterior becomes signi\ufb01cantly non-Gaussian. However as even more data is introduced the\ncentral limit theorem effect takes hold and the posterior becomes increasingly Gaussian. In (b) we\n\n9We note that split mean \ufb01eld approximate inference was reported to take approximately 100 seconds for a\n\nsimilar logistic regression model achieving comparable results [20].\n\n7\n\n01020304000.10.20.30.40.5NtrnBAI\u2212BG010203040\u2212102468x 10\u22123NtrnATLPAI\u2212ATLPG\fplot the mean and standard error differences for the averaged test set log probabilities (ATLP) cal-\nculated using the Gaussian and AI approximate posteriors obtained in each model. For each model\nand each training set size the ATLP is calculated using 104 test points sampled from the model. The\nlog test set probability of each test data pair x\u2217, c\u2217 is calculated as log (cid:104)p(c\u2217|w, x\u2217)(cid:105)q(w) for q(w)\nthe approximate posterior. The bound differences can be seen to be strongly correlated with test set\nlog probability differences, con\ufb01rming that tighter bound values correspond to improved predictive\nperformance.\n\n4.3 Sparse robust kernel regression\n\na Laplace prior on the weight vectors(cid:81)\nIn this experiment we consider sparse noise robust kernel regression. Sparsity is encoded using\nd fd(wd) where fd(wd) = e\u2212|wd|/\u03c4p /2\u03c4p. The Laplace\ndistribution is also used as a noise robust likelihood fn(w) = p(yn|w, kn) = e\u2212|yn\u2212wTkn|/\u03c4l /2\u03c4l\nwhere kn is the nth vector of the kernel matrix. The squared exponential kernel was used throughout\nwith length scale 0.05 and additive noise 1, see [23]. In all experiments the prior and likelihood were\n\ufb01xed with \u03c4p = \u03c4l = 0.16. Three datasets are considered: Boston housing10 (D = 14, Ntrn = 100,\nNtst = 406); Concrete Slump Test11 (D = 10, Ntrn = 100, Ntst = 930); a synthetic dataset\nconstructed as described in [17] \u00a75.6.1 (D = 10, Ntrn = 100, Ntst = 406). Results are collected\nfor each data set over 10 random training and test set partitions. All datasets are zero mean unit\nvariance normalised based on the statistics of the training data.\n\n\u00afBG\n\n\u00afBAI\n\nDataset\nConc. CS. \u22122.08 \u00b1 0.09 \u22122.06 \u00b1 0.09\n\u22121.28 \u00b1 0.05 \u22121.25 \u00b1 0.05\nBoston\nSynthetic \u22122.49 \u00b1 0.10 \u22122.46 \u00b1 0.10\n\nAT LPG\n\n\u00afBAI \u2212 \u00afBG\n0.022 \u00b1 0.004 \u22121.70 \u00b1 0.11 \u22121.67 \u00b1 0.11\n0.028 \u00b1 0.003 \u22121.18 \u00b1 0.10 \u22121.15 \u00b1 0.09\n0.028 \u00b1 0.004 \u22121.84 \u00b1 0.11 \u22121.83 \u00b1 0.11\n\nAT LPAI\n\nAT LPAI \u2212 AT LPG\n\n0.024 \u00b1 0.010\n0.023 \u00b1 0.006\n0.009 \u00b1 0.009\n\nAI KL inference is performed with a generalised-normal base distribution. The parameters \u03b8d con-\ntrol the kurtosis of the base distributions q(vd|\u03b8d); for simplicity we \ufb01x \u03b8d = 1.5 and optimise jointly\nfor L, U, b. Bound optimisation took roughly 250 seconds for the AI KL procedure, compared to\n5 seconds for the Gaussian KL procedure. Averaged results and standard errors are presented in\nthe table above where \u00afB denotes the bound value divided by the number of points in the training\ndataset. Whilst the improvements for these particular datasets are modest, the AI bound dominates\nthe Gaussian bound in all three datasets, with predictive log probabilities also showing consistent\nimprovement.\nWhilst we have only presented experimental results for AI distributions with simple analytically\nexpressible base distributions we note the method is applicable for any base distribution provided\n{qvd(vd)}D\n\nd=1 are smooth. For example smooth univariate mixtures for qvd(vd) can be used.\n\n5 Discussion\n\nAf\ufb01ne independent KL approximate inference has several desirable properties compared to existing\ndeterministic bounding methods. We\u2019ve shown how it generalises on classical multivariate Gaussian\nKL approximations and our experiments con\ufb01rm that the method is able to capture non-Gaussian\neffects in posteriors. Since we optimise the KL divergence over a larger class of approximating den-\nsities than the multivariate Gaussian, the lower bound to the normalisation constant is also improved.\nThis is particularly useful for model selection purposes where the normalisation constant plays the\nrole of the model likelihood.\nThere are several interesting areas open for further research. The numerical procedures presented\nin section 2.1 provide a general and computationally ef\ufb01cient means for inference in non-Gaussian\ndensities whose application could be useful for a range of probabilistic models. However, our current\nunderstanding of the best approach to discretise the base densities is limited and further study of this\nis required, particularly for application in very large systems D (cid:29) 1. It would also be useful to\ninvestigate using base densities that directly allow for ef\ufb01cient computation of the marginals qy(y)\nin the Fourier domain.\n\n10archive.ics.uci.edu/ml/datasets/Housing\n11archive.ics.uci.edu/ml/datasets/Concrete+Slump+Test\n\n8\n\n\fReferences\n[1] D. Barber. Bayesian Reasoning and Machine Learning. Cambridge University Press, 2012.\n[2] D. Barber and C. M. Bishop. Ensemble Learning for Multi-Layer Networks.\n\nIn Advances in Neural\n\nInformation Processing Systems, NIPS 10, 1998.\n\n[3] D. Bickson and C. Guestrin. Inference with Multivariate Heavy-Tails in Linear Models. In Advances in\n\nNeural Information Processing Systems, NIPS 23. 2010.\n\n[4] C. M. Bishop, N. Lawrence, T. Jaakkola, and M. I. Jordan. Approximating Posterior Distributions in\nBelief Networks Using Mixtures. In Advances in Neural Information Processing Systems, NIPS 10, 1998.\n[5] G. Bouchard and O. Zoeter. Split Variational Inference. In International Conference on Arti\ufb01cial Intelli-\n\ngence and Statistics, AISTATS, 2009.\n\n[6] R. N. Bracewell. The Fourier Transform and its Applications. McGraw-Hill Book Co, Singapore, 2000.\n[7] E. Challis and D. Barber. Concave Gaussian Variational Approximations for Inference in Large-Scale\nBayesian Linear Models. In International Conference on Arti\ufb01cial Intelligence and Statistics, AISTATS,\n2011.\n\n[8] T. M. Cover and J. A. Thomas. Elements of Information Theory. Wiley, New York, 1991.\n[9] J. T. A. S. Ferreira and M. F. J. Steel. A New Class of Skewed Multivariate Distributions with Applications\n\nTo Regression Analysis. Statistica Sinica, 17:505\u2013529, 2007.\n\n[10] G. Gersman, M. Hoffman, and D. Blei. Nonparametric Variational Inference. In International Conference\n\non Machine Learning, ICML 29, 2012.\n\n[11] M. Girolami. A Variational Method for Learning Sparse and Overcomplete Representations. Neural\n\nComputation, 13:2517\u20132532, 2001.\n\n[12] A. Graves. Practical Variational Inference for Neural Networks.\n\nProcessing Systems, NIPS 24, 2011.\n\nIn Advances in Neural Information\n\n[13] A. Honkela and H. Valpola. Unsupervised Variational Bayesian Learning of Nonlinear Models. In Ad-\n\nvances in Neural Information Processing Systems, NIPS 17, 2005.\n\n[14] T. Jaakkola and M. Jordan. A Variational Approach to Bayesian Logistic Regression Problems and their\n\nExtensions. In Arti\ufb01cial Intelligence and Statistics, AISTATS 6, 1996.\n\n[15] M. E. Khan, B. Marlin, G. Bouchard, and K. Murphy. Variational Bounds for Mixed-Data Factor Analysis.\n\nIn Advances in Neural Information Processing Systems, NIPS 23, 2010.\n\n[16] D. A. Knowles and T. Minka. Non-conjugate Variational Message Passing for Multinomial and Binary\n\nRegression. In Advances in Neural Information Processing Systems, NIPS 23. 2011.\n\n[17] M. Kuss. Gaussian Process Models for Robust Regression, Classi\ufb01cation, and Reinforcement Learning.\n\nPhD thesis, Technische Universit\u00a8at Darmstadt, Darmstadt, Germany, 2006.\n\n[18] H. Nickisch and M. Seeger. Convex Variational Bayesian Inference for Large Scale Generalized Linear\n\n[19] J. P. Nolan. Stable Distributions - Models for Heavy Tailed Data. Birkhauser, Boston, 2012. In progress,\n\nModels. In International Conference on Machine Learning, ICML 26, 2009.\nChapter 1 online at academic2.american.edu/\u223cjpnolan.\n\n[20] M. Opper and C. Archambeau. The Variational Gaussian Approximation Revisited. Neural Computation,\n\n21(3):786\u2013792, 2009.\n\n[21] J. Ormerod. Skew-Normal Variational Approximations for Bayesian Inference. Technical Report CRG-\n\nTR-93-1, School of Mathematics and Statistics, University of Sydney, 2011.\n\n[22] A. Palmer, D. Wipf, K. Kreutz-Delgado, and B. Rao. Variational EM Algorithms for Non-Gaussian Latent\n\nVariable Models. In Advances in Neural Information Processing Systems, NIPS 20, 2006.\n\n[23] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006.\n[24] P. Ruckdeschel and M. Kohl. General Purpose Convolution Algorithm in S4-Classes by means of FFT.\n\nTechnical Report 1006.0764v2, arXiv.org, 2010.\n\n[25] S. K. Sahu, D. K. Dey, and M. D. Branco. A New Class of Multivariate Skew Distributions with Appli-\ncations to Bayesian Regression Models. The Canadian Journal of Statistics / La Revue Canadienne de\nStatistique, 31(2):129\u2013150, 2003.\n\n[26] P. Schaller and G. Temnov. Ef\ufb01cient and precise computation of convolutions: applying FFT to heavy\n\ntailed distributions. Computational Methods in Applied Mathematics, 8(2):187\u2013200, 2008.\n\n[27] C. Siddhartha, F. Nardari, and N. Shephard. Markov chain Monte Carlo methods for stochastic volatility\n\nmodels. Journal of Econometrics, 108(2):281\u2013316, 2002.\n\n[28] M. J. Wainwright and M. I. Jordan. Graphical Models, Exponential Families, and Variational Inference.\n\nFoundations and Trends in Machine Learning, 1(1-2):1\u2013305, 2008.\n\n9\n\n\f", "award": [], "sourceid": 1086, "authors": [{"given_name": "Edward", "family_name": "Challis", "institution": null}, {"given_name": "David", "family_name": "Barber", "institution": null}]}