{"title": "A Practical Monte Carlo Implementation of Bayesian Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 598, "page_last": 604, "abstract": null, "full_text": "A Practical Monte Carlo Implementation \n\nof Bayesian Learning \n\nCarl Edward Rasmussen \n\nDepartment of Computer Science \n\nUniversity of Toronto \n\nToronto, Ontario, M5S 1A4, Canada \n\ncarl@cs.toronto.edu \n\nAbstract \n\nA practical method for Bayesian training of feed-forward neural \nnetworks using sophisticated Monte Carlo methods is presented \nand evaluated. In reasonably small amounts of computer time this \napproach outperforms other state-of-the-art methods on 5 data(cid:173)\nlimited tasks from real world domains. \n\n1 \n\nINTRODUCTION \n\nBayesian learning uses a prior on model parameters, combines this with information \nfrom a training set , and then integrates over the resulting posterior to make pre(cid:173)\ndictions. With this approach, we can use large networks without fear of overfitting, \nallowing us to capture more structure in the data, thus improving prediction accu(cid:173)\nracy and eliminating the tedious search (often performed using cross validation) for \nthe model complexity that optimises the bias/variance tradeoff. In this approach \nthe size of the model is limited only by computational considerations. \n\nThe application of Bayesian learning to neural networks has been pioneered by \nMacKay (1992), who uses a Gaussian approximation to the posterior weight distri(cid:173)\nbution. However, the Gaussian approximation is poor because of multiple modes in \nthe posterior. Even locally around a mode the accuracy of the Gaussian approxi(cid:173)\nmation is questionable, especially when the model is large compared to the amount \nof training data. \n\nHere I present and test a Monte Carlo method (Neal, 1995) which avoids the \nGaussian approximation. The implementation is complicated, but the user is not re(cid:173)\nquired to have extensive knowledge about the algorithm. Thus, the implementation \nrepresents a practical tool for learning in neural nets. \n\n\fA Practical Monte Carlo Implementation of Bayesian Learning \n\n599 \n\n1.1 THE PREDICTION TASK \nThe training data consists of n examples in the form of inputs x = {x(i)} and \ncorresponding outputs y = {y(i)} where i = 1 ... n. For simplicity we consider \nonly real-valued scalar outputs. The network is parametrised by weights w, and \nhyperparameters h that control the distributions for weights, playing a role similar \nto that of conventional weight decay. Weights and hyperparameters are collectively \ntermed 0, and the network function is written as F/I (x), although the function value \nis only indirectly dependent on the hyperparameters (through the weights). \n\nBayes' rule gives the posterior distribution for the parameters in terms of the like(cid:173)\nlihood, p(ylx, 0), and prior, p(O): \n\n(Olx \np \n\n,y \n\n) = p(O)p(ylx, O) \n\np(ylx) \n\nTo minimize the expected squared error on an unseen test case with input x(n+l), \nwe use the mean prediction \n\n(1) \n\n2 MONTE CARLO SAMPLING \n\nThe following implementation is due to Neal (1995). The network weights are \nupdated using the hybrid Monte Carlo method (Duane et al. 1987). This method \ncombines the Metropolis algorithm with dynamical simulation. This helps to avoid \nthe random walk behavior of simple forms of Metropolis, which is essential if we \nwish to explore weight space efficiently. The hyperparameters are updated using \nGibbs sampling. \n\n2.1 NETWORK SPECIFICATION \n\nThe networks used here are always of the same form: a single linear output unit, a \nsingle hidden layer of tanh units and a task dependent number of input units. All \nlayers are fully connected in a feed forward manner (including direct connections \nfrom input to output). The output and hidden units have biases. \nThe network priors are specified in a hierarchical manner in terms of hyperparam(cid:173)\neters; weights of different kinds are divided into groups, each group having it's own \nprior. The output-bias is given a zero-mean Gaussian prior with a std. dev. of \nu = 1000, so it is effectively unconstrained. \nThe hidden-biases are given a two layer prior: the bias b is given a zero-mean \nGaussian prior b '\" N(O, ( 2 ); the value of u is specified in terms of precision r = u- 2 , \nwhich is given a Gamma prior with mean p = 400 (corresponding to u = 0.05) and \nshape parameter a = 0.5; the Gamma density is given by p(r) '\" Gamma(p, a) ex: \nr Ol / 2 - 1 exp( -ra/2p). Note that this type of prior introduces a dependency between \nthe biases for different hidden units through the common r. The prior for the \nhidden-to-output weights is identical to the prior for the hidden-biases, except that \nthe variance of these weights under the prior is scaled down by the square root \nof the number of hidden units, such that the network output magnitude becomes \nindependent of the number of hidden units. The noise variance is also given a \nGamma prior with these parameters. \n\n\f600 \n\nC. E. RASMUSSEN \n\nThe input-to-hidden weights are given a three layer prior: again each weight is \ngiven a zero-mean Gaussian prior w rv N(O, (12); the corresponding precision for \nthe weights out of input unit i is given a Gamma prior with a mean J.l and a shape \nparameter a1 = 0.5: Ti rv Gamma(J.l, a1). The mean J.l is determined on the top \nlevel by a Gamma distribution with mean and shape parameter ao = 1: J.li rv \nGamma(400,ao). The direct input-to-output connections are also given this prior. \n\nThe above-mentioned 3 layer prior incorporates the idea of Automatic Relevance \nDetermination (ARD), due to MacKay and Neal, and discussed in Neal (1995) . The \nhyperparameters, Ti, associated with individual inputs can adapt according to the \nrelevance of the input; for an unimportant input, Ti can grow very large (governed \nby the top level prior), thus forcing (1i and the associated weights to vanish. \n\n2.2 MONTE CARLO SPECIFICATION \n\nSampling from the posterior weight distribution is performed by iteratively updating \nthe values of the network weights and hyperparameters. Each iteration involves two \ncomponents: weight updates and hyperparameter updates. A cursory description \nof these steps follows. \n\n2.2.1 Weight Updates \n\nWeight updates are done using the hybrid Monte Carlo method . A fictitious dy(cid:173)\nnamical system is generated by interpreting weights as positions, and augmenting \nthe weights w with momentum variables p. The purpose of the dynamical system \nis to give the weights \"inertia\" so that slow random walk behaviour can be avoided \nduring exploration of weight space. The total energy, H, of the system is the sum \nof the kinetic energy, I<, (a function of the momenta) and the potential energy, E. \nThe potential energy is defined such that p(w) ex exp( -E). We sample from the \njoint distribution for wand p given by p(w,p) ex exp(-E - I<), under which the \nmarginal distribution for w is given by the posterior. A sample of weights from the \nposterior can therefore be obtained by simply ignoring the momenta. \n\nSampling from the joint distribution is achieved by two steps: 1) finding new points \nin phase space with near-identical energies H by simulating the dynamical system \nusing a discretised approximation to Hamiltonian dynamics, and 2) changing the \nenergy H by doing Gibbs sampling for the momentum variables. \nHamiltonian Dynamics. Hamilton's first order differential equations for Hare \napproximated by a series of discrete first order steps (specifically by the leapfrog \nmethod). The first derivatives of the network error function enter through the \nderivative of the potential energy, and are computed using backpropagation. In \nthe original version of the hybrid Monte Carlo method the final position is then \naccepted or rejected depending on the final energy H'\" (which is not necessarily \nequal to the initial energy H because of the discretisation). Here we use a modified \nversion that uses an average over a window of states instead. The step size of the \ndiscrete dynamics should be as large as possible while keeping the rejection rate \nlow. The step sizes are set individually using several heuristic approximations, and \nscaled by an overall parameter c. We use L = 200 iterations, a window size of 20 \nand a step size of c = 0.2 for all simulations. \nGibbs Sampling for Momentum Variables. The momentum variables are \nupdated using a modified version of Gibbs sampling, allowing the energy H to \nchange. A \"persistence\" of 0.95 is used; the new value of the momentum is a \nweighted sum of the previous value (weight 0.95) and the value obtained by Gibbs \nsampling (weight (1 - 0.952)1/2). With this form of persistence, the momenta \n\n\fA Practical Monte Carlo Implementation of Bayesian Learning \n\n601 \n\nchanges approx. 20 times more slowly, thus increasing the \"inertia\" of the weights, \nso as to further help in avoiding random walks. Larger values of the persistence will \nfurther increase the weight inertia, but reduce the rate of exploration of H. The \nadvantage of increasing the weight inertia in this way rather than by increasing L is \nthat the hyperparameters are updated at shorter intervals, allowing them to adapt \nto the rapidly changing weights. \n\n2.2.2 Hyperparameter Updates \n\nThe hyperparameters are updated using Gibbs sampling. The conditional distribu(cid:173)\ntions for the hyperparameters given the weights are of the Gamma form, for which \nefficient generators exist, except for the top-level hyperparameter in the case of the \n3 layer priors used for the weights from the inputs; in this case the conditional \ndistribution is more complicated and a form of rejection sampling is employed. \n\n2.3 NETWORK TRAINING AND PREDICTION \n\nThe network training consists of two levels of initialisation before sampling for \nnetworks used for prediction. At the first level of initialisation the hyperparameters \n(variance of the Gaussians) are kept constant at 1, allowing the weights to grow \nduring 1000 leapfrog iterations. Neglecting this phase can cause the network to get \ncaught for a long time in a state where weights and hyperparameters are both very \nsmall. \n\nThe scheme described above is then invoked and run for as long as desired, even(cid:173)\ntually producing networks from the posterior distribution. The initial 1/3 of these \nnets are discarded, since the algorithm may need time to reach regions of high pos(cid:173)\nterior probability. Networks sampled during the remainder of the run are saved for \nmaking predictions. \n\nThe predictions are made using an average of the networks sampled from the pos(cid:173)\nterior as an approximation to the integral in eq. (1). Since the output unit is linear \nthe final prediction can be seen as coming from a huge (fully connected) ensemble \nnet with appropriately scaled output weights. All the results reported here were \nfor ensemble nets with 4000 hidden units. The size of the individual nets is given \nby the rule that we want at least as many network parameters as we have training \nexamples (with a lower limit of 4 hidden units). We hope thereby to be well out of \nthe underfitting region. Using even larger nets would probably not gain us much \n(in the face of the limited training data) and is avoided for computational reasons. \n\nAll runs used the parameter values given above. The only check that is necessary \nis that the rejection rate stays low, say below 5%; if not, the step size should \nbe lowered. In all runs reported here, c = 0.2 was adequate. The parameters \nconcerning the Monte Carlo method and the network priors were all selected based \non intuition and on experience with toy problems. Thus no parameters need to be \nset by the user. \n\n3 TESTS \n\nThe performance of the algorithm was evaluated by comparing it to other state-of(cid:173)\nthe-art methods on 5 real-world regression tasks. All 5 data sets have previously \nbeen studied using a 10-way cross-validation scheme (Quinlan 1993). The tasks \nin these domains is to predict price or performance of an object from various dis(cid:173)\ncrete and real-valued attributes. For each domain the data is split into two sets \nof roughly equal size, one for training and one for testing. The training data is \n\n\f602 \n\nC. E. RASMUSSEN \n\nfurther subdivided into full-, half-, quarter- and eighth-sized subsets, 15 subsets in \ntotal. Networks are trained on each of these partitions, and evaluated on the large \ncommon test set . On the small training sets, the average performance and one \nstd. dev. error bars on this estimate are computed. \n\n3.1 ALGORITHMS \n\nThe Monte Carlo method was compared to four other algorithms. For the three \nneural network methods nets with a single hidden layer and direct input-output \nconnections were used. The Monte Carlo method was run for 1 hour on each of the \nsmall training sets, and 2,4 and 8 hours respectively on the larger training sets. All \nsimulations were done on a 200 MHz MIPS R4400 processor. The Gaussian Process \nmethod is described in a companion paper (Williams & Rasmussen 1996). \nThe Evidence method (MacKay 1992) was used for a network with separate hyper(cid:173)\nparameters for the direct connections, the weights from individual inputs (ARD), \nhidden biases, and output biases. Nets were trained using a conjugate gradient \nmethod, allowing 10000 gradient evaluations (batch) before each of 6 updates of \nthe hyperparameters. The network Hessian was computed analytically. The value \nof the evidence was computed without compensating for network symmetries, since \nthis can lead to a vastly over-estimated evidence for big networks where the poste(cid:173)\nrior Gaussians from different modes overlap. A large number of nets were trained for \neach task, with the number of hidden units computed from the results of previous \nnets by the following heuristics: The min and max number of hidden units in the 20% \nnets with the highest evidences were found. The new architecture is picked from a \nGaussian (truncated at 0) with mean (max - min)/2 and std. dev. 2 + max - min, \nwhich is thought to give a reasonable trade-off between exploration and exploita(cid:173)\ntion. This procedure is run for 1 hour of cpu time or until more than 1000 nets have \nbeen trained. The final predictions are made from an ensemble of the 20% (but a \nmaximum of 100) nets with the highest evidence. \nAn ensemble method using cross-validation to search over a 2-dimensional grid for \nthe number of hidden units and the value of a single weight decay parameter has \nbeen included, as an attempt to have a thorough version of \"common practise\". \nThe weight decay parameter takes on the values 0, 0.01, 0.04, 0.16, 0.64 and 2.56. \nUp to 6 sizes of nets are used, from 0 hidden units (a linear model) up to a number \nthat gives as many weights as training examples. Networks are trained with a \nconjugent gradient method for 10000 epochs on each of these up to 36 networks, \nand performance was monitored on a validation set containing 1/3 of the examples, \nselected at random. This was repeated 5 times with different random validation \nsets, and the architecture and weight decay that did best on average was selected. \nThe predictions are made from an ensemble of 10 nets with this architecture, trained \non the full training set. This algorithm took several hours of cpu time for the largest \ntraining sets. \nThe Multivariate Adaptive Regression Splines (MARS) method (Friedman 1991) \nwas included as a non-neural network approach. It is possible to vary the maximum \nnumber of variables allowed to interact in the additive components of the model. \nIt is common to allow either pairwise or full interactions. I do not have sufficient \nexperience with MARS to make this choice. Therefore, I tried both options and \nreported for each partition on each domain the best performance based on the \ntest error, so results as good as the ones reported here might not be obtainable in \npractise. All other parameters of MARS were left at their default values. MARS \nalways required less than 1 minute of cpu time. \n\n\fA Practical Monte Carlo Implementation of Bayesian Learning \n\n603 \n\nAuto price \n\nCpu \n\n2 \n\n1.5 \n\n1 \n\n0.5 \n\n0.6 \n\n0.5 \n\n0.4 \n\n0.3 \n\n0.2 \n\n0.1 \n\n0* \n+ \n\nx \n\n+ \n\nIS! \n\no \nX * \n\no~------~----~----~---\n\n10 \n\n20 \n\n40 \n\n80 \n\nOL-~----~------~----~--\n\n13 \n\n26 \n\n52 \n\n104 \n\nHouse \n\nt \n\n>\u00ab1>* \n+ IS! \n\n0.6 \n\n0.5 \n\n0.4 \n\n0.3 \n\n0.2 \n\n0.1 \n\n0.25 \n\n0.2 \n\n0.15 \n\n0.1 \n\n0.05 \n\nMpg \n\n* \nXo+ IS! \n\no~~----~------~----~--\n\n32 \n\n64 \n\n128 \n\n256 \n\nOL-~----~----~----~--\n\n24 \n\n48 \n\n96 \n\n192 \n\n1 \n\n0.8 \n\n0.6 \n\n0.4 \n\n0.2 \n\nServo \n\nOtIS! \n\nX * \n\nGeometric mean \n\nx Monte Carlo \no Gaussian Evidence \n+ Backprop \n* MARS \nIS! Gaussian Process \n\n0.283 \n0.364 \n0.339 \n\n0.371 \n0.304 \n\no~~------~----~----~---\n\n11 \n\n22 \n\n44 \n\n88 \n\nFigure 1: Squared error on test cases for the five algorithms applied to the five problems. \nErrors are normalized with respect to the variance on the test cases. The x-axis gives the \nnumber of training examples; four different set sizes were used on each domain. The error \nbars give one std. dev. for the distribution of the mean over training sets. No error bar is \ngiven for the largest size, for which only a single training set was available. Some of the \nlarge error bars are cut of at the top. MARS was unable to run on the smallest partitions \nfrom the Auto price and the servo domains; in these cases the means of the four other \nmethods were used in the reported geometric mean for MARS. \n\n\f604 \n\nC. E. RASMUSSEN \n\nTable 1: Data Sets \n\n# training cases # test cases # binary inputs # real inputs \n\ndomain \nAuto Price \nCpu \nHouse \nMpg \nServo \n\n80 \n104 \n256 \n192 \n88 \n\n79 \n105 \n250 \n200 \n79 \n\n0 \n0 \n1 \n6 \n10 \n\n16 \n6 \n12 \n3 \n2 \n\n3.2 PERFORMANCE \n\nThe test results are presented in fig . 1. On the servo domain the Monte Carlo \nmethod is uniformly better than all other methods, although the difference should \nprobably not always be considered statistically significant. The Monte Carlo method \ngenerally does well for the smallest training sets. Note that no single method does \nwell on all these tasks. The Monte Carlo method is never vastly out-performed by \nthe other methods. \n\nThe geometric mean of the performances over all 5 domains for the the 4 different \ntraining set sizes is computed. Assuming a Gaussian distribution of prediction \nerrors, the log of the error variance can (apart from normalising constants) be \ninterpreted as the amount of information unexplained by the models. Thus, the \nlog of the geometric means in fig. 1 give the average information unexplained by \nthe models. According to this measure the Monte Carlo method does best, closely \nfollowed by the Gaussian Process method. Note that MARS is the worst, even \nthough the decision between pairwise and full interactions were made on the basis \nof the test errors. \n\n4 CONCLUSIONS \n\nI have outlined a black-box Monte Carlo implementation of Bayesian learning in \nneural networks, and shown that it has an excellent performance. These results sug(cid:173)\ngest that Monte Carlo based Bayesian methods are serious competitors for practical \nprediction tasks on data limited domains. \n\nAcknowledgements \n\nI am grateful to Radford Neal for his generosity with insight and software. This research \nwas funded by a grant to G. Hinton from the Institute for Robotics and Intelligent Systems. \n\nReferences \n\nS. Duane, A. D. Kennedy, B. J. Pendleton & D. Roweth (1987) \"Hybrid Monte Carlo\", \nPhysics Letters B, vol. 195, pp. 216-222. \nJ . H. Friedman (1991) \"Multivariate adaptive regression splines\" (with discussion) , Annals \nof Statistics, 19,1-141 (March) . Source: http://lib.stat.cmu.edu/general/mars3.5. \nD. J. C. MacKay (1992) \"A practical Bayesian framework for backpropagation networks\", \nNeural Computation, vol. 4, pp. 448- 472. \nR. M. Neal (1995) Bayesian Learning for Neural Networks, PhD thesis, Dept. of Computer \nScience, University of Toronto, ftp: pub/radford/thesis. ps. Z from ftp. cs . toronto. edu. \nJ. R. Quinlan (1993) \"Combining instance-based and model-based learning\", Proc . ML '93 \n(ed P.E. Utgoff), San Mateo: Morgan Kaufmann. \nC. K. I. Williams & C. E. Rasmussen (1996). \"Regression with Gaussian processes\", NIPS \n8, editors D. Touretzky, M. Mozer and M. Hesselmo. (this volume) . \n\n\f", "award": [], "sourceid": 1029, "authors": [{"given_name": "Carl", "family_name": "Rasmussen", "institution": null}]}