{"title": "Automated Variational Inference for Gaussian Process Models", "book": "Advances in Neural Information Processing Systems", "page_first": 1404, "page_last": 1412, "abstract": "We develop an automated variational method for approximate inference in Gaussian process (GP) models whose posteriors are often intractable. Using a mixture of Gaussians as the variational distribution, we show that (i) the variational objective and its gradients can be approximated efficiently via sampling from univariate Gaussian distributions and (ii) the gradients of the GP hyperparameters can be obtained analytically regardless of the model likelihood. We further propose two instances of the variational distribution whose covariance matrices can be parametrized linearly in the number of observations. These results allow gradient-based optimization to be done efficiently in a black-box manner. Our approach is thoroughly verified on 5 models using 6 benchmark datasets, performing as well as the exact or hard-coded implementations while running orders of magnitude faster than the alternative MCMC sampling approaches. Our method can be a valuable tool for practitioners and researchers to investigate new models with minimal effort in deriving model-specific inference algorithms.", "full_text": "Automated Variational Inference\n\nfor Gaussian Process Models\n\nTrung V. Nguyen\nANU & NICTA\n\nVanTrung.Nguyen@nicta.com.au\n\nEdwin V. Bonilla\n\nThe University of New South Wales\n\ne.bonilla@unsw.edu.au\n\nAbstract\n\nWe develop an automated variational method for approximate inference in Gaus-\nsian process (GP) models whose posteriors are often intractable. Using a mixture\nof Gaussians as the variational distribution, we show that (i) the variational objec-\ntive and its gradients can be approximated ef\ufb01ciently via sampling from univari-\nate Gaussian distributions and (ii) the gradients wrt the GP hyperparameters can\nbe obtained analytically regardless of the model likelihood. We further propose\ntwo instances of the variational distribution whose covariance matrices can be\nparametrized linearly in the number of observations. These results allow gradient-\nbased optimization to be done ef\ufb01ciently in a black-box manner. Our approach is\nthoroughly veri\ufb01ed on \ufb01ve models using six benchmark datasets, performing as\nwell as the exact or hard-coded implementations while running orders of magni-\ntude faster than the alternative MCMC sampling approaches. Our method can be\na valuable tool for practitioners and researchers to investigate new models with\nminimal effort in deriving model-speci\ufb01c inference algorithms.\n\n1\n\nIntroduction\n\nGaussian processes (GPs, [1]) are a popular choice in practical Bayesian non-parametric modeling.\nThe most straightforward application of GPs is the standard regression model with Gaussian like-\nlihood, for which the posterior can be computed in closed form. However, analytical tractability is\nno longer possible when having non-Gaussian likelihoods, and inference must be carried out via ap-\nproximate methods, among which Markov chain Monte Carlo (MCMC, see e.g. [2]) and variational\ninference [3] are arguably the two techniques most widely used.\nMCMC algorithms provide a \ufb02exible framework for sampling from complex posterior distributions\nof probabilistic models. However, their generality comes at the expense of very high computational\ncost as well as cumbersome convergence analysis. Furthermore, methods such as Gibbs sampling\nmay perform poorly when there are strong dependencies among the variables of interest. Other\nalgorithms such as the elliptical slice sampling (ESS) developed in [4] are more effective at drawing\nsamples from strongly correlated Gaussians. Nevertheless, while improving upon generic MCMC\nmethods, the sampling cost of ESS remains a major challenge for practical usages.\nAlternative to MCMC is the deterministic approximation approach via variational inference, which\nhas been used in numerous applications with some empirical success ( see e.g. [5, 6, 7, 8, 9, 10, 11]).\nThe main insight from variational methods is that optimizing is generally easier than integrating.\nIndeed, they approximate a posterior by optimizing a lower bound of the marginal likelihood, the\nso-called evidence lower bound (ELBO). While variational inference can be considerably faster\nthan MCMC, it lacks MCMC\u2019s broader applicability as it requires derivations of the ELBO and its\ngradients on a model-by-model basis.\nThis paper develops an automated variational inference technique for GP models that not only re-\nduces the overhead of the tedious mathematical derivations inherent to variational methods but also\n\n1\n\n\fallows their application to a wide range of problems. In particular, we consider Gaussian process\nmodels that satisfy the following properties: (i) factorization across latent functions and (ii) factor-\nization across observations. The former assumes that, when there is more than one latent function,\nthey are generated from independent GPs. The latter assumes that, given the latent functions, the\nobservations are conditionally independent. Existing GP models, such as regression [1], binary and\nmulti-class classi\ufb01cation [6, 12], warped GPs [13], log Gaussian Cox process [14], and multi-output\nregression [15], all fall into this class of models. We note, however, that our approach goes beyond\nstandard settings for which elaborate learning machinery has been developed, as we only require\naccess to the likelihood function in a black-box manner.\nOur automated deterministic inference method uses a mixture of Gaussians as the approximating\nposterior distribution and exploits the decomposition of the ELBO into a KL divergence term and\nan expected log likelihood term. In particular, we derive an analytical lower bound for the KL term;\nand we show that the expected log likelihood term and its gradients can be computed ef\ufb01ciently by\nsampling from univariate Gaussian distributions, without explicitly requiring gradients of the likeli-\nhood. Furthermore, we optimize the GP hyperparameters within the same variational framework by\nusing their analytical gradients, irrespective of the speci\ufb01cs of the likelihood models.\nAdditionally, we exploit the ef\ufb01cient parametrization of the covariance matrices in the models, which\nis linear in the number of observations, along with variance-reduction techniques in order to pro-\nvide an automated inference framework that is useful in practice. We verify the effectiveness of\nour method with extensive experiments on 5 different GP settings using 6 benchmark datasets. We\nshow that our approach performs as well as exact GPs or hard-coded deterministic inference imple-\nmentations, and that it can be up to several orders of magnitude faster than state-of-the-art MCMC\napproaches.\nRelated work\n\nBlack box variational inference (BBVI, [16]) has recently been developed for general latent variable\nmodels. Due to this generality, it under-utilizes the rich amount of information available in GP mod-\nels that we previously discussed. For example, BBVI approximates the KL term of the ELBO, but\nthis is computed analytically in our method. A clear disadvantage of BBVI is that it does not provide\nan analytical or practical way of learning the covariance hyperparameters of GPs \u2013 in fact, these are\nset to \ufb01xed values. In principle, these values can be learned in BBVI using stochastic optimiza-\ntion, but experimentally, we have found this to be problematic, ineffectual, and time-consuming. In\ncontrast, our method optimizes the hyperparameters using their exact gradients.\nAn approach more closely related to ours is in [17], which investigates variational inference for\nGP models with one latent function and factorial likelihood. Their main result is an ef\ufb01cient\nparametrization when using a standard variational Gaussian distribution. Our method is more gen-\neral in that it allows multiple latent functions, hence being applicable to settings such as multi-class\nclassi\ufb01cation and multi-output regression. Furthermore, our variational distribution is a mixture of\nGaussians, with the full Gaussian distribution being a particular case. Another recent approach to\ndeterministic approximate inference is the Integrated Nested Laplace Approximation (INLA, [18]).\nINLA uses numerical integration to approximate the marginal likelihood, which makes it unsuitable\nfor GP models that contain a large number of hyperparameters.\n\n2 A family of GP models\nWe consider supervised learning problems with a dataset of N training inputs x = {xn}N\nn=1 and\ntheir corresponding targets y = {yn}N\nn=1. The mapping from inputs to outputs is established via\nQ underlying latent functions, and our objective is to reason about these latent functions from the\nobserved data. We specify a class of GP models for which the priors and the likelihoods have the\nfollowing structure:\n\nQ(cid:89)\n\nQ(cid:89)\nN(cid:89)\n\nj=1\n\np(f|\u03b80) =\n\np(y|f , \u03b81) =\n\np(f\u2022j|\u03b80) =\n\nj=1\n\np(yn|fn\u2022, \u03b81),\n\nN (f\u2022j; 0, Kj),\n\n(1)\n\n(2)\n\nn=1\n\n2\n\n\fwhere f is the set of all latent function values; f\u2022j = {fj(xn)}N\nn=1 denotes the values of the latent\nfunction j; fn\u2022 = {fj(xn)}Q\nj=1 is the set of latent function values which yn depends upon; Kj is\nthe covariance matrix evaluated at every pair of inputs induced by the covariance function kj(\u00b7,\u00b7);\nand \u03b80 and \u03b81 are covariance hyperparameters and likelihood parameters, respectively.\nIn other words, the class of models speci\ufb01ed by Equations (1) and (2) satisfy the following two cri-\nteria: (a) factorization of the prior over the latent functions and (b) factorization of the conditional\nlikelihood over the observations. Existing GP models including GP regression [1], binary classi\ufb01ca-\ntion [6, 12], warped GPs [13], log Gaussian Cox processes [14], multi-class classi\ufb01cation [12], and\nmulti-output regression [15] all belong to this family of models.\n\n3 Automated variational inference for GP models\n\nThis section describes our automated inference framework for posterior inference of the latent func-\ntions for the given family of models. Apart from Equations (1) and (2), we only require access to\nthe likelihood function in a black-box manner, i.e. speci\ufb01c knowledge of its shape or its gradient is\nnot needed. Posterior inference for general (non-Gaussian) likelihoods is analytically intractable.\nWe build our posterior approximation framework upon variational inference principles. This entails\npositing a tractable family of distributions and \ufb01nding the member of the family that is \u201cclosest\u201d\nto the true posterior in terms of their KL divergence. Herein we choose the family of mixture of\nGaussians (MoG) with K components, de\ufb01ned as\n\nq(f|\u03bb) =\n\n1\nK\n\nqk(f|mk, Sk) =\n\n1\nK\n\nN (f\u2022j; mkj, Skj), \u03bb = {mkj, Skj},\n\n(3)\n\nwhere qk(f|mk, Sk) is the component k with variational parameters mk = {mkj}Q\nj=1 and Sk =\n{Skj}Q\nj=1. Less general MoG with isotropic covariances have been used with variational inference\nin [7, 19]. Note that within each component, the posteriors over the latent functions are independent.\nMinimizing the divergence KL[q(f|\u03bb)||p(f|y)] is equivalent to maximizing the evidence lower\nbound (ELBO) given by:\n\nK(cid:88)\n\nQ(cid:89)\n\nk=1\n\nj=1\n\nK(cid:88)\n\nk=1\n\nlog p(y) \u2265 Eq[\u2212 log q(f|\u03bb)] + Eq[log p(f )]\n\nEqk [log p(y|f )] \u2206= L.\n\n(4)\n\n(cid:124)\n\n(cid:123)(cid:122)\n\n\u2212KL[q(f|\u03bb)||p(f )]\n\nK(cid:88)\n\nk=1\n\n(cid:125)\n\n+\n\n1\nK\n\nObserve that the KL term in Equation (4) does not depend on the likelihood. The remaining term,\ncalled the expected log likelihood (ELL), is the only contribution of the likelihood to the ELBO. We\ncan thus address the technical dif\ufb01culties regarding each component and their derivatives separately\nusing different approaches. In particular, we can obtain a lower bound of the \ufb01rst term (KL) and\napproximate the second term (ELL) via sampling. Due to the limited space, we only show the main\nresults and refer the reader to the supplementary material for derivation details.\n3.1 A lower bound of \u2212KL[q(f|\u03bb)||p(f )]\nThe \ufb01rst component of the KL divergence term is the entropy of a Gaussian mixture which is not\nanalytically tractable. However, a lower bound of this entropy can be obtained using Jensen\u2019s in-\nequality (see e.g. [20]) giving:\n\nEq[\u2212 log q(f|\u03bb)] \u2265 \u2212 K(cid:88)\n\nk=1\n\nK(cid:88)\n\nl=1\n\n1\nK\n\nlog\n\nN (mk; ml, Sk + Sl).\n\n1\nK\n\n(5)\n\nThe second component of the KL term is a negative cross-entropy between a Gaussian mixture and\na Gaussian, which can be computed analytically giving:\n\nK(cid:88)\n\nQ(cid:88)\n\nk=1\n\nj=1\n\n(cid:2)N log 2\u03c0 + log |Kj| + mT\n\nkjK\u22121\n\nj mkj + tr (K\u22121\n\n(6)\n\nj Skj)(cid:3).\n\nEq[log p(f )] = \u2212 1\n2K\n\nThe gradients of the two terms in Equations (5) and (6) wrt the variational parameters can be com-\nputed analytically and are given in the supplementary material.\n\n3\n\n\f3.2 An approximation to the expected log likelihood (ELL)\n\nN(cid:88)\n\nIt is clear from Equation (4) that the ELL can be obtained via the ELLs of the individual mixture\ncomponents Eqk [log p(y|f )]. Due to the factorial assumption of p(y|f ), the expectation becomes:\n\nEqk [log p(y|f )] =\n\nEqk(n) [log p(yn|fn\u2022)],\n\n(7)\n\nn=1\n\nwhere qk(n) = qk(n)(fn\u2022|\u03bbk(n)) is the marginal posterior with variational parameters \u03bbk(n) that\ncorrespond to fn\u2022. The gradients of these individual ELL terms wrt the variational parameters \u03bbk(n)\nare given by:\n\n\u2207\u03bbk(n)\n\nEqk(n)[log p(yn|fn\u2022)] =Eqk(n)\u2207\u03bbk(n) log qk(n)(fn\u2022|\u03bbk(n)) log p(yn|fn\u2022).\n\n(8)\n\nUsing Equations (7) and (8) we establish the following theorem regarding the computation of the\nELL and its gradients.\nTheorem 1. The expected log likelihood and its gradients can be approximated using samples from\nunivariate Gaussian distributions.\n\nThe proof is in Section 1 of the supplementary material. A less general result, for the case of\none latent function and the variational Gaussian posterior, was obtained in [17] using a different\nderivation. Note that when Q > 1, qk(n) is not a univariate marginal. Nevertheless, it has a diagonal\ncovariance matrix due to the factorization of the latent posteriors so the theorem still holds.\n\n3.3 Learning of the variational parameters and other model parameters\n\nIn order to learn the parameters of the model we use gradient-based optimization of the ELBO. For\nthis we require the gradients of the ELBO wrt all model parameters.\n\nVariational parameters. The noisy gradients of the ELBO w.r.t. the variational means mk(n) and\nvariances Sk(n) corresponding to data point n are given by:\n\nk(n) \u25e6 S(cid:88)\n\ns\u22121\n\ni=1\n\n1\nKS\n\n\u02c6\u2207mk(n)L \u2248 \u2207mk(n)Lent + \u2207mk(n)Lcross +\n\u02c6\u2207Sk(n)L \u2248 \u2207Sk(n)Lent + \u2207Sk(n)Lcross\nk(n) \u25e6 (f i\n\nk(n) \u25e6 s\u22121\ns\u22121\n\nS(cid:88)\n\n(cid:18)\n\ndg\n\n+\n\n1\n\n2KS\n\ni=1\n\nn\u2022 \u2212 mk(n)) log p(yn|f i\n(f i\n\nn\u2022),\n\n(9)\n\n(cid:19)\n\nlog p(yn|f i\n\nn\u2022)\n\n(10)\n\nn\u2022 \u2212 mk(n)) \u25e6 (f i\n\nn\u2022 \u2212 mk(n)) \u2212 s\u22121\n\nk(n)\n\nwhere \u25e6 is the entrywise Hadamard product; {f i\ni=1 are samples from qk(n)(fn\u2022|mk(n), sk(n));\nsk(n) is the diagonal of Sk(n) and s\u22121\nk(n) is the element-wise inverse of sk(n); dg turns a vector to a\ndiagonal matrix; and Lent = Eq[\u2212 log q(f|\u03bb)] and Lcross = Eq[log p(f )] are given by Equations (5)\nand (6). The control variates technique described in [16] is also used to further reduce the variance\nof these estimators.\n\nn\u2022}S\n\nCovariance hyperparameters. The ELBO in Equation (4) reveals a remarkable property: the hy-\nperparameters depend only on the negative cross-entropy term Eq[log p(f )] whose exact expression\nwas derived in Equation (6). This has a signi\ufb01cant practical implication: despite using black-box\ninference, the hyperparameters are optimized wrt the true evidence lower bound (given \ufb01xed varia-\ntional parameters). This is an additional and crucial advantage of our automated inference method\nover other generic inference techniques [16] that seem incapable of hyperparameter learning, in\npart because there are not yet techniques for reducing the variance of the gradient estimators. The\ngradient of the ELBO wrt any hyperparameter \u03b8 of the j-th covariance function is given by:\n\nj \u2207\u03b8Kj \u2212 K\u22121\n\nj \u2207\u03b8KjK\u22121\n\nj (mkjmT\n\n(11)\n\nkj + Sj)(cid:1).\n\nK(cid:88)\n\ntr(cid:0)K\u22121\n\n\u2207\u03b8L = \u2212 1\n2K\n\nk=1\n\n4\n\n\fLikelihood parameters The noisy gradients w.r.t. the likelihood parameters can also be estimated\nvia samples from univariate marginals:\n\n\u2207\u03b81 log p(yn|f k,i\n\n(n), \u03b81), where f k,i\n\n(n) \u223c qk(n)(fn\u2022|mk(n), sk(n)).\n\n(12)\n\nK(cid:88)\n\nN(cid:88)\n\nS(cid:88)\n\nk=1\n\nn=1\n\ni=1\n\n\u2207\u03b81L \u2248 1\nKS\n\n3.4 Practical variational distributions\n\nThe gradients from the previous section may be used for automated variational inference for GP\nmodels. However, the mixture of Gaussians (MoG) requires O(N 2) variational parameters for each\ncovariance matrix, i.e. we need to estimate a total of O(QKN 2) parameters. This causes dif\ufb01culties\nfor learning when these parameters are optimized simultaneously. This section introduces two spe-\ncial members of the MoG family that improve the practical tractability of our inference framework.\n\nFull Gaussian posterior. This instance is the mixture with only 1 component and is thus a Gaus-\nsian distribution.\nIts covariance matrix has block diagonal structure, where each block is a full\ncovariance corresponding to that of a single latent function posterior. We thus refer to it as the\nfull Gaussian posterior. As stated in the following theorem, full Gaussian posteriors can still be\nestimated ef\ufb01ciently in our variational framework.\nTheorem 2. Only O(QN ) variational parameters are required to parametrize the latent posteriors\nwith full covariance structure.\n\nThe proof is given Section 2 of the supplementary material. This result has been stated previously\n(see e.g. [6, 7, 17]) but for speci\ufb01c models that belong to the class of GP models considered here.\n\nMixture of diagonal Gaussians posterior. Our second practical variational posterior is a Gaus-\nsian mixture with diagonal covariances, yielding two immediate bene\ufb01ts. Firstly, only O(QN )\nparameters are required for each mixture component. Secondly, computation is more ef\ufb01cient as\ninverting a diagonal covariance can be done in linear time. Furthermore, as a result of the following\ntheorem, optimization will typically converge faster when using a mixture of diagonal Gaussians.\nTheorem 3. The estimator of the gradients wrt the variational parameters using the mixture of\ndiagonal Gaussians has a lower variance than the full Gaussian posterior\u2019s.\n\nThe proof is in Section 3 of the supplementary material and is based on the Rao-Blackwellization\ntechnique [21]. We note that this result is different to that in [16]. In particular, our variational\ndistribution is a mixture, thus multi-modal. The theorem is only made possible due to the analytical\ntractability of the KL term in the ELBO.\nGiven the noisy gradients, we use off-the-shelf, gradient-based optimizers, such as conjugate gradi-\nent, to learn the model parameters. Note that stochastic optimization may also be used, but it may\nrequire signi\ufb01cant time and effort in tuning the learning rates.\n\n3.5 Prediction\n\nGiven the MoG posterior, the predictive distribution for new test points x\u2217 is given by:\n\n(cid:90)\n\nK(cid:88)\n\nk=1\n\n(cid:90)\n\np(Y\u2217|x\u2217) =\n\n1\nK\n\np(Y\u2217|f\u2217)\n\np(f\u2217|f )qk(f )df df\u2217.\n\n(13)\n\nThe inner integral is the predictive distribution of the latent values f\u2217 and it is a Gaussian since\nboth qk(f ) and p(f\u2217|f ) are Gaussian. The probability of the test points taking values y\u2217 (e.g. in\nclassi\ufb01cation) can thus be readily estimated via Monte Carlo sampling. The predictive means and\nvariances of a MoG can be obtained from that of the individual mixture components as described in\nSection 6 of the supplementary material.\n\n5\n\n\fMining disasters 811\nBoston housing 300\nCreep 800\nAbalone 1000\nBreast cancer 300\nUSPS 1233\n\n0\n206\n1266\n3177\n383\n1232\n\nModel\n1\nLog Gausian Cox process\n13 N (y; f, \u03c32)\nStandard regression\n30 \u2207yt(y)N (t(y); f, \u03c32) Warped Gaussian processes\nWarped Gaussian processes\n8\nBinary classi\ufb01cation\n9\n\nsame as above\n1/(1 + exp(\u2212f ))\n\n256 exp(fc)/(cid:80)\n\nTable 1: Datasets, their statistics, and the corresponding likelihood functions and models used in the\nexperiments, where Ntrain, Ntest, and D are the training size, testing size, and the input dimension,\nrespectively. See text for detailed description of the models.\nDataset Ntrain Ntest D Likelihood p(y|f )\n\u03bby exp(\u2212\u03bb)/y!\n\ni=1 exp(fi) Multi-class classi\ufb01cation\n\n4 Experiments\n\nWe perform experiments with \ufb01ve GP models: standard regression [1], warped GPs [13], binary\nclassi\ufb01cation [6, 12], multi-class classi\ufb01cation [12], and log Gaussian Cox processes [14] on six\ndatasets (see Table 1) and repeat the experiments \ufb01ve times using different data subsets.\nExperimental settings. The squared exponential covariance function with automatic relevance de-\ntermination (see Ch. 4 in [1]) is used with the GP regression and warped GPs. The isotropic co-\nvariance is used with all other models. The noisy gradients of the ELBO are approximated with\n2000 samples and 200 samples are used with control variates to reduce the variance of the gradient\nestimators. The model parameters (variational, covariance hyperparameters and likelihood parame-\nters) are learned by iteratively optimizing one set while \ufb01xing the others until convergence, which is\ndetermined when changes are less than 1e-5 for the ELBO or 1e-3 for the variational parameters.\nEvaluation metrics. To assess the predictive accuracy, we use the standardized squared error (SSE)\nfor the regression tasks and the classi\ufb01cation error rates for the classi\ufb01cation tasks. The negative log\npredictive density (NLPD) is also used to evaluate the con\ufb01dence of the prediction. For all of the\nmetrics, smaller \ufb01gures are better.\nNotations. We call our method AGP and use AGP-FULL, AGP-MIX and AGP-MIX2 when\nusing the full Gaussian and the mixture of diagonal Gaussians with 1 and 2 components, respectively.\nDetails of these two posteriors were given in Section 3.4. On the plots, we use the shorter notations,\nFULL, MIX, and MIX2 due to the limited space.\nReading the box plots. We used box plots to give a more complete picture of the predictive per-\nformance. Each plot corresponds to the distribution of a particular metric evaluated at all test points\nfor a given task. The edges of a box are the q1 = 25th and q3 = 75th percentiles and the central\nmark is the median. The dotted line marks the limit of extreme points that are greater than the 97.5th\npercentile. The whiskers enclose the points in the range (q1\u2212 1.5(q3\u2212 q1), q3 +1.5(q3\u2212 q1)), which\namounts to approximately \u00b12.7\u03c3 if the data is normally distributed. The points outside the whiskers\nand below the dotted line are outliers and are plotted individually.\n\n4.1 Standard regression\n\nFirst we consider the standard Gaussian process regression for which the predictive distribution can\nbe computed analytically. We compare with this exact inference method (GPR) using the Boston\nhousing dataset [22]. The results in Figure 1 show that AGP-FULL achieves nearly identical per-\nformance as GPR. This is expected as the analytical posterior is a full Gaussian. AGP-MIX and\nAGP-MIX2 also give comparable performance in terms of the median SSE and NLPD.\n\n4.2 Warped Gaussian processes (WGP)\n\nThe WGP allows for non-Gaussian processes and non-Gaussian noises. The likelihood for each\ntarget yn is attained by warping it through a nonlinear monotonic transformation t(y) giving\np(yn|fn) = \u2207yn t(yn)N (t(yn)|fn, \u03c32). We used the same neural net style transformation as in\n[13]. We \ufb01xed the warp parameters and used the same procedure for making analytical approxima-\ntions to the predicted means and variances for all methods.\n\n6\n\n\fFigure 1: The distributions of SSE and NLPD of all methods on the regression task. Compared to the\nexact inference method GPR, the performance of AGP-FULL is identical while that of AGP-MIX\nand AGP-MIX2 are comparable.\n\nFigure 2: The distributions of SSE and NLPD of all methods on the regression task with warped\nGPs. The AGP methods (FULL, MIX and MIX2) give comparable performance to exact inference\nwith WGP and slightly outperform GPR which has narrower ranges of predictive variances.\n\nWe compare with the exact implementation of [13] and the standard GP regression (GPR) on the\nCreep [23] and Abalone [22] datasets. The results in Figure 2 show that the AGP methods give\ncomparable performance to the exact method WGP and slightly outperform GPR. The prediction\nby GPR exhibits characteristically narrower ranges of predictive variances which can be attributed\nto its Gaussian noise assumption.\n\n4.3 Classi\ufb01cation\n\nFor binary classi\ufb01cation, we use the logistic likelihood and experiment with the Wisconsin breast\ncancer dataset [22]. We compare with the variational bounds (VBO) and the expectation propagation\n(EP) methods. Details of VBO and EP can be found in [6]. All methods use the same analytical\napproximations when making prediction.\nFor multi-class classi\ufb01cation, we use the softmax likelihood and experiment with a subset of the\nUSPS dataset [1] containing the digits 4, 7, and 9. We compare with a variational inference method\n(VQ) which constructs the ELBO via a quadratic lower bound to the likelihood terms [5]. Prediction\nis made by squashing the samples from the predictive distributions of the latent values at test points\nthrough the softmax likelihood for all methods.\n\nFigure 3: Left plot: classi\ufb01cation error rates averaged over 5 runs (the error bars show two standard\ndeviations). The AGP methods have classi\ufb01cation errors comparable to the hard-coded implementa-\ntions. Middle and right plots: the distribution of NLPD of all methods on the binary and multi-class\nclassi\ufb01cation tasks, respectively. The hard-coded methods are slightly better than AGP.\n\n7\n\nFULLMIXMIX2GPR00.20.40.60.8Boston housingSSEFULLMIXMIX2GPR2345678Boston housingNLPDFULLMIXMIX2GPRWGP00.10.20.30.4CreepSSEFULLMIXMIX2GPRWGP234567CreepNLPDFULLMIXMIX2GPRWGP00.511.522.53AbaloneSSEFULLMIXMIX2GPRWGP12345AbaloneNLPDBreast cancerUSPS00.010.020.030.040.050.06Error rates VQFULLMIXMIX2VBOEPFULLMIXMIX2VBOEP00.20.40.60.81Breast cancerNLPDFULLMIXMIX2VQ00.20.40.60.81USPSNLPD\fFigure 4: Left plot: the true event counts during the given time period. Middle plot: the posteriors\n(estimated intensities) inferred by all methods. For each method, the middle line is the posterior\nmean and the two remaining lines enclose 90% interval. AGP-FULL infers the same posterior as\nHMC and ESS while AGP-MIX obtains the same mean but underestimates the variance. Right\nplot: speed-up factors against the HMC method. The AGP methods run more than 2 orders of\nmagnitude faster than the sampling methods.\n\nThe classi\ufb01cation error rates and the NLPD are shown in Figure 3 for both tasks. For binary classi-\n\ufb01cation, the AGP methods give comparable performance to the hard-coded implementations, VBO\nand EP. The latter is often considered the best approximation method for this task [6]. Similar\nresults can be observed for the multi-class classi\ufb01cation problem.\nWe note that the running times of our methods are comparable to that of the hard-coded methods.\nFor example, the average training times for VBO, EP, MIX, and FULL are 76s, 63s, 210s, and 480s\nrespectively, on the Wisconsin dataset.\n\n4.4 Log Gaussian Cox process (LGCP)\n\nyn!\n\nThe LGCP is an inhomogeneous Poisson process with the log-intensity function being a shifted\nn exp(\u2212\u03bbn)\ndraw from a Gaussian process. Following [4], we use the likelihood p(yn|fn) = \u03bbyn\n,\nwhere \u03bbn = exp(fn + m) is the mean of a Poisson distribution and m is the offset to the log mean.\nThe data concerns coal-mining disasters taken from a standard dataset for testing point processes\n[24]. The offset m and the covariance hyperparameters are set to the same values as in [4].\nWe compare AGP with the Hybrid Monte Carlo (HMC, [25]) and elliptical slice sampling (ESS,\n[4]) methods, where the latter is designed speci\ufb01cally for GP models. We collected every 100th\nsample for a total of 10k samples after a burn-in period of 5k samples; the Gelman-Rubin potential\nscale reduction factors [26] are used to check for convergence. The middle plot of Figure 4 shows\nthe posteriors learned by all methods. We see that the posterior by AGP-FULL is similar to that\nby HMC and ESS. AGP-MIX obtains the same posterior mean but it underestimates the variance.\nThe right plot shows the speed-up factors of all methods against the slowest method HMC. The\nAGP methods run more than two orders of magnitude faster than HMC, thus con\ufb01rming the com-\nputational advantages of our method to the sampling approaches. Training time was measured on a\ndesktop with Intel(R) i7-2600 3.40GHz CPU with 8GB of RAM using Matlab R2012a.\n\n5 Discussion\n\nWe have developed automated variational inference for Gaussian process models (AGP). AGP per-\nforms as well as the exact or hard-coded implementations when testing on \ufb01ve models using six real\nworld datasets. AGP has the potential to be a powerful tool for GP practitioners and researchers\nwhen devising models for new or existing problems for which variational inference is not yet avail-\nable. In the future we will address the scalability of AGP to deal with very large datasets.\n\nAcknowledgements\n\nNICTA is funded by the Australian Government through the Department of Communications and\nthe Australian Research Council through the ICT Centre of Excellence Program.\n\n8\n\n18601880190019201940196001234TimeEvent counts18601880190019201940196000.10.20.30.40.50.6TimeIntensityPosteriors of the latent intensity FULLMIXHMC & ESSTime comparison against HMC00.511.522.5Log10 speed\u2212up factor FULLMIXESS\fReferences\n[1] Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian processes for machine learning. The\n\nMIT Press, 2006.\n\n[2] Radford M. Neal. Probabilistic inference using Markov chain Monte Carlo methods. Technical report,\n\nDepartment of Computer Science, University of Toronto, 1993.\n\n[3] Michael I Jordan, Zoubin Ghahramani, Tommi S Jaakkola, and Lawrence K Saul. An introduction to\n\nvariational methods for graphical models. Springer, 1998.\n\n[4] Iain Murray, Ryan Prescott Adams, and David J.C. MacKay. Elliptical slice sampling. In AISTATS, 2010.\n[5] Mohammad E. Khan, Shakir Mohamed, Benjamin M. Marlin, and Kevin P. Murphy. A stick-breaking\nlikelihood for categorical data analysis with latent Gaussian models. In AISTATS, pages 610\u2013618, 2012.\n[6] Hannes Nickisch and Carl Edward Rasmussen. Approximations for binary Gaussian process classi\ufb01ca-\n\ntion. Journal of Machine Learning Research, 9(10), 2008.\n\n[7] Trung V. Nguyen and Edwin Bonilla. Ef\ufb01cient variational inference for Gaussian process regression\n\nnetworks. In AISTATS, pages 472\u2013480, 2013.\n\n[8] Mohammad E. Khan, Shakir Mohamed, and Kevin P. Murphy. Fast Bayesian inference for non-conjugate\n\nGaussian process regression. In NIPS, pages 3149\u20133157, 2012.\n\n[9] Miguel L\u00b4azaro-Gredilla. Bayesian warped Gaussian processes. In NIPS, pages 1628\u20131636, 2012.\n[10] Mark Girolami and Simon Rogers. Variational Bayesian multinomial probit regression with Gaussian\n\nprocess priors. Neural Computation, 18(8):1790\u20131817, 2006.\n\n[11] Miguel L\u00b4azaro-Gredilla and Michalis Titsias. Variational heteroscedastic Gaussian process regression. In\n\nICML, 2011.\n\n[12] Christopher K.I. Williams and David Barber. Bayesian classi\ufb01cation with Gaussian processes. Pattern\n\nAnalysis and Machine Intelligence, IEEE Transactions on, 20(12):1342\u20131351, 1998.\n\n[13] Edward Snelson, Carl Edward Rasmussen, and Zoubin Ghahramani. Warped Gaussian processes.\n\nNIPS, 2003.\n\nIn\n\n[14] Jesper M\u00f8ller, Anne Randi Syversveen, and Rasmus Plenge Waagepetersen. Log Gaussian Cox processes.\n\nScandinavian journal of statistics, 25(3):451\u2013482, 1998.\n\n[15] Andrew G. Wilson, David A. Knowles, and Zoubin Ghahramani. Gaussian process regression networks.\n\nIn ICML, 2012.\n\n[16] Rajesh Ranganath, Sean Gerrish, and David M. Blei. Black box variational inference. In AISTATS, 2014.\n[17] Manfred Opper and C\u00b4edric Archambeau. The variational Gaussian approximation revisited. Neural\n\nComputation, 21(3):786\u2013792, 2009.\n\n[18] H\u02daavard Rue, Sara Martino, and Nicolas Chopin. Approximate Bayesian inference for latent Gaussian\nmodels by using integrated nested Laplace approximations. Journal of the royal statistical society: Series\nb (statistical methodology), 71(2):319\u2013392, 2009.\n\n[19] Samuel J. Gershman, Matthew D. Hoffman, and David M. Blei. Nonparametric variational inference. In\n\nICML, 2012.\n\n[20] Marco F. Huber, Tim Bailey, Hugh Durrant-Whyte, and Uwe D. Hanebeck. On entropy approximation\nIn IEEE International Conference on Multisensor Fusion and\n\nfor Gaussian mixture random vectors.\nIntegration for Intelligent Systems, 2008.\n\n[21] George Casella and Christian P. Robert. Rao-Blackwellisation of sampling schemes. Biometrika, 1996.\n[22] K. Bache and M. Lichman. UCI machine learning repository, 2013.\n[23] D. Cole, C. Martin-Moran, A.G. Sheard, H.K.D.H. Bhadeshia, and D.J.C. MacKay. Modelling creep\nrupture strength of ferritic steel welds. Science and Technology of Welding & Joining, 5(2):81\u201389, 2000.\n\n[24] R.G. Jarrett. A note on the intervals between coal-mining disasters. Biometrika, 66(1):191\u2013193, 1979.\n[25] Simon Duane, Anthony D. Kennedy, Brian J. Pendleton, and Duncan Roweth. Hybrid Monte Carlo.\n\nPhysics letters B, 195(2):216\u2013222, 1987.\n[26] Andrew Gelman and Donald B Rubin.\nStatistical science, pages 457\u2013472, 1992.\n\nInference from iterative simulation using multiple sequences.\n\n9\n\n\f", "award": [], "sourceid": 776, "authors": [{"given_name": "Trung", "family_name": "Nguyen", "institution": "ANU; NICTA"}, {"given_name": "Edwin", "family_name": "Bonilla", "institution": "The University of New South Wales"}]}