{"title": "Gaussian Processes for Bayesian Classification via Hybrid Monte Carlo", "book": "Advances in Neural Information Processing Systems", "page_first": 340, "page_last": 346, "abstract": null, "full_text": "Gaussian Processes for Bayesian \n\nClassification via Hybrid Monte Carlo \n\nDavid Barber and Christopher K. I. Williams \n\nNeural Computing Research Group \n\nDepartment of Computer Science and Applied Mathematics \n\nAston University, Birmingham B4 7ET, UK \n\nd.barber~aston.ac.uk \n\nc.k.i.williams~aston.ac.uk \n\nAbstract \n\nThe full Bayesian method for applying neural networks to a pre(cid:173)\ndiction problem is to set up the prior/hyperprior structure for the \nnet and then perform the necessary integrals. However, these inte(cid:173)\ngrals are not tractable analytically, and Markov Chain Monte Carlo \n(MCMC) methods are slow, especially if the parameter space is \nhigh-dimensional. Using Gaussian processes we can approximate \nthe weight space integral analytically, so that only a small number \nof hyperparameters need be integrated over by MCMC methods. \nWe have applied this idea to classification problems, obtaining ex(cid:173)\ncellent results on the real-world problems investigated so far . \n\n1 \n\nINTRODUCTION \n\nTo make predictions based on a set of training data, fundamentally we need to \ncombine our prior beliefs about possible predictive functions with the data at hand. \nIn the Bayesian approach to neural networks a prior on the weights in the net induces \na prior distribution over functions. This leads naturally to the idea of specifying our \nbeliefs about functions more directly. Gaussian Processes (GPs) achieve just that, \nbeing examples of stochastic process priors over functions that allow the efficient \ncomputation of predictions. It is also possible to show that a large class of neural \nnetwork models converge to GPs in the limit of an infinite number of hidden units \n(Neal, 1996). In previous work (Williams and Rasmussen, 1996) we have applied GP \npriors over functions to the problem of predicting a real-valued output, and found \nthat the method has comparable performance to other state-of-the-art methods. \nThis paper extends the use of GP priors to classification problems. \n\nThe G Ps we use have a number of adjustable hyperparameters that specify quan(cid:173)\ntities like the length scale over which smoothing should take place. Rather than \n\n\fGaussian Processes/or Bayesian Classification via Hybrid Monte Carlo \n\n341 \n\noptimizing these parameters (e.g . by maximum likelihood or cross-validation meth(cid:173)\nods) we place priors over them and use a Markov Chain Monte Carlo (MCMC) \nmethod to obtain a sample from the posterior which is then used for making pre(cid:173)\ndictions. An important advantage of using GPs rather than neural networks arises \nfrom the fact that the GPs are characterized by a few (say ten or twenty) hyperpa(cid:173)\nrameters, while the networks have a similar number of hyperparameters but many \n(e.g. hundreds) of weights as well, so that MCMC integrations for the networks are \nmuch more difficult . \n\nWe first briefly review the regression framework as our strategy will be to transform \nthe classification problem into a corresponding regression problem by dealing with \nthe input values to the logistic transfer function. In section 2.1 we show how to \nuse Gaussian processes for classification when the hyperparameters are fixed , and \nthen describe the integration over hyperparameters in section 2.3. Results of our \nmethod as applied to some well known classification problems are given in section \n3, followed by a brief discussion and directions for future research. \n\n1.1 Gaussian Processes for regression \n\nWe outline the GP method as applied to the prediction of a real valued output \ny* = y(x*) for a new input value x*, given a set of training data V = {(Xi, ti), i = \n1. .. n} \nGiven a set of inputs X*,X1, ... Xn, a GP allows us to specify how correlated we \nexpect their corresponding outputs y = (y(xt), y(X2), ... , y(xn)) to be. We denote \nthis prior over functions as P(y), and similarly, P(y*, y) for the joint distribution \nincluding y*. If we also specify P( tly), the probability of observing the particular \nvalues t = (t1, .. . tn)T given actual values y (i.e. a noise model) then \n\nP(y*lt) = J P(y*,ylt)dy = P~t) J P(y*,y)P(tly)dy \n\n(1) \n\nHence the predictive distribution for y. is found from the marginalization of the \nproduct of the prior and the noise model. If P(tly) and P(y*, y) are Gaussian then \nP(Y. It) is a Gaussian whose mean and variance can be calculated using matrix \ncomputations involving matrices of size n x n. Specifying P(y*, y) to be a multidi(cid:173)\nmensional Gaussian (for all values of n and placements of the points X*, Xl , . .. Xn) \nmeans that the prior over functions is a G P. More formally, a stochastic process is a \ncollection ofrandom variables {Y(x )Ix E X} indexed by a set X. In our case X will \nbe the input space with dimension d, the number of inputs. A GP is a stochastic \nprocess which can be fully specified by its mean function J.l(x) = E[Y(x)] and its \ncovariance function C(x,x') = E[(Y(x) - J.l(x))(Y(x') - J.l(x'))]; any finite set of \nY -variables will have a joint multivariate Gaussian distribution. Below we consider \nGPs which have J.l(x) == o. \n\n2 GAUSSIAN PROCESSES FOR CLASSIFICATION \n\nFor simplicity of exposition, we will present our method as applied to two class \nproblems as the extension to multiple classes is straightforward. \n\nBy using the logistic transfer function u to produce an output which can be in(cid:173)\nterpreted as 11\"(x), the probability of the input X belonging to class 1, the job of \nspecifying a prior over functions 11\" can be transformed into that of specifying a prior \nover the input to the transfer function. We call the input to the transfer function \nthe activation, and denote it by y, with 11\"(x) = u(y(x)). For input Xi, we will \ndenote the corresponding probability and activation by 11\"i and Yi respectively. \n\n\f342 \n\nD. Barber and C. K. l Williams \n\nTo make predictions when using fixed hyperparameters we would like to compute \n11-. = !7r.P(7r.lt) d7r., which requires us to find P(7r.lt) = P(7r(z.)lt) for a new \ninput z \u2022. This can be done by finding the distribution P(y. It) (Y. is the activation of \n7r.) and then using the appropriate Jacobian to transform the distribution . Formally \nthe equations for obtaining P(y. It) are identical to equation 1. However, even if \nwe use a GP prior so that P(Y., y) is Gaussian, the usual expression for P(tly) = \nni 7r;' (1 - 7rd 1- t , for classification data (where the t's take on values of 0 or 1), \nmeans that the marginalization to obtain P(Y. It) is no longer analytically tractable. \nWe will employ Laplace's approximation, i.e. we shall approximate the integrand \nP(Y., ylt, 8) by a Gaussian distribution centred at a maximum of this function with \nrespect to Y., Y with an inverse covariance matrix given by - v\"v log P(Y., ylt, 8) . \nThe necessary integrations (marginalization) can then be carried out analytically \n(see, e.g. Green and Silverman (1994) \u00a75.3) and we provide a derivation in the \nfollowing section. \n\n2.1 Maximizing P(y.,ylt) \n\nLet y+ denote (Y. , y), the complete set of activations. By Bayes' theorem \nlog P(y+ It) = log P(tly)+log P(y+)-log P(t), and let 'It+ = log P(tly)+log P(y+) . \nAs P(t) does not depend on y+ (it is just a normalizing factor), the maximum of \nP(y+ It) is found by maximizing 'It + with respect to y+. We define 'It similarly in \nrelation to P(ylt). Using log P(tdyd = tiYi -log(1 + eY'), we obtain \n\n'It + \n\nT ~ \nt y-~log(1+eY')-2y+J{+ y+-210glli.+I--2-log27r (2) \n\n1 T \n\n-1 \n\n1 \n\nT \n\nn + 1 \n\ni=1 \n\n(3) \n\nwhere J{+ is the covariance matrix of the GP evaluated at Z1, . .. Zn,Z \u2022. J{+ can \nbe partitioned in terms of an n x n matrix J{, a n x 1 vector k and a scalar k., viz. \n\n~ ) \n\n(4) \n\nAs y. only enters into equation 2 in the quadratic prior term and has no data point \nassociated with it, maximizing 'It + with respect to y+ can be achieved by first \nmaximizing 'It with respect to y and then doing the further quadratic optimization \nto determine the posterior mean y \u2022. To find a maximum of 'It we use the Newton(cid:173)\nRaphson (or Fisher scoring) iteration ynew = y -\n('V'V'It)-1'V'It . Differentiating \nequation 3 with respect to y we find \n\n(t - 1r) - J{-1 y \n_J{-1 - W \n\n(5) \n(6) \n\nwhere W = diag( 7r1 (1 - 7r1), .. , 7rn (1 - 7rn )), which gives the iterative equation1, \n\n(7) \n\nIThe complexity of calculating each iteration using standard matrix methods is O( n 3 ). \nIn our implementation, however, we use conjugate gradient methods to avoid explicitly \ninverting matrices. In addition, by using the previous iterate y as an initial guess for the \nconjugate gradient solution to equation 7, the iterates are computed an order of magnitude \nfaster than using standard algorithms. \n\n\f343 \nGaussian Processes for Bayesian Classification via Hybrid Monte Carlo \nGiven a converged solution y for Y, fl. can easily be found using y. = kT f{-ly = \nkT(t -i'). var(y.) is given by (f{+l + W+)(n1+l)(n+l)' where W+ is the W matrix \nwith a zero appended in the (n + l)th diagonal position. \nGiven the (Gaussian) distribution of y. we then wish to find the mean of the dis(cid:173)\ntribution of P(11\".lt) which is found from 71-. = J u(y.)P(y.lt) . We calculate this by \napproximating the sigmoid by a set of five cumulative normal densities (erf) that \ninterpolate the sigmoid at chosen points. This leads to a very fast and accurate \nanalytic approximation for the mean class prediction. \n\nThe justification of Laplace's approximation in our case is somewhat different from \nthe argument usually put forward, e.g. for asymptotic normality of the maximum \nlikelihood estimator for a model with a finite number of parameters. This is because \nthe dimension of the problem grows with the number of data points. However, if \nwe consider the \"infill asymptotics\" , where the number of data points in a bounded \nregion increases, then a local average of the training data at any point x will pro(cid:173)\nvide a tightly localized estimate for 11\"( x) and hence y( x), so we would expect the \ndistribution P(y) to become more Gaussian with increasing data. \n\n2.2 Parameterizing the covariance function \n\nThere are many reasonable choices for the covariance function . Formally, we are \nrequired to specify functions which will generate a non-negative definite covariance \nmatrix for any set of points (Xl, . .. , Xk). From a modelling point of view we wish \nto specify covariances so that points with nearby inputs will give rise to similar \npredictions. We find that the following covariance function works well: \n\nC(x , x') = Vaexp {-~ t WI(XI - XD2} \n\n1=1 \n\n(8) \n\nwhere XI is the Ith component of x and 8 = log(va, W1, .. . , Wd) plays the role of \nhyperparameters2. \n\nWe define the hyperparameters to be the log of the variables in equation 8 since \nthese are positive scale-parameters. This covariance function has been studied by \nSacks et al (1989) and can be obtained from a network of Gaussian radial basis \nfunctions in the limit of an infinite number of hidden units (Williams, 1996). \n\nThe WI parameters in equation 8 allow a different length scale on each input di(cid:173)\nmension. For irrelevant inputs, the corresponding WI will become small, and the \nmodel will ignore that input. This is closely related to the Automatic Relevance \nDetermination (ARD) idea of MacKay and Neal (Neal, 1996). The Va variable gives \nthe overall scale of the prior; in the classification case, this specifies if the 11\" values \nwill typically be pushed to 0 or 1, or will hover around 0.5. \n\n2.3 \n\nIntegration over the hyperparameters \n\nGiven that the GP contains adjustable hyperparameters, how should they be \nadapted given the data? Maximum likelihood or (generalized) cross-validation \nmethods are often used, but we will prefer a Bayesian solution. A prior distri(cid:173)\nbution over the hyperparameters P( 8) is modified using the training data to obtain \n'the posterior distribution P(8It) ex P(tI8)P(8). To make predictions we integrate \n\n2We call f) the hyperparameters rather than parameters as they correspond closely to \n\nhyperparameters in neural networks. \n\n\f344 \n\nD. Barber and C. K. I. Williams \n\nthe predicted probabilities over the posterior; for example, the mean value 7f(:I:*) \nfor test input :1:* is given by \n\n7f(:I:.) = J 1i-(:I:. 19)P(9It )d9, \n\n(9) \n\n(10) \n\nwhere 1i-(:I:* 19) is the mean prediction for a fixed value of the hyperparameters, as \ngiven in section 2. \nFor the regression problem P(tI9) can be calculated exactly using P(tI9) = \nJ P(tly)P(yI9)dy , but this integral is not analytically tractable for the classifi(cid:173)\ncation problem. Again we use Laplace's approximation and obtain3 \n\nlogP(tI9) c:= w(y) + ~IJ{-l + WI + i log27r \n\nwhere y is the converged iterate of equation 7. We denote the right-hand side of \nequation 10 by log Pa(tI9) (where a stands for approximate) . \n\nThe integration over 9-space also cannot be done analytically, and we employ a \nMarkov Chain Monte Carlo method. We have used the Hybrid Monte Carlo (HMC) \nmethod of Duane et al (1987), with broad Gaussian hyperpriors on the parameters. \n\nHMC works by creating a fictitious dynamical system in which the hyperparameters \nare regarded as position variables, and augmenting these with momentum variables \np. The purpose of the dynamical system is to give the hyperparameters \"inertia\" \nso that random-walk behaviour in 9-space can be avoided. The total energy, 1l, of \nthe system is the sum of the kinetic energy, K = pT pj2 and the potential energy, \u00a3 . \nThe potential energy is defined such that p( 91D)