{"title": "Bayesian Learning via Stochastic Dynamics", "book": "Advances in Neural Information Processing Systems", "page_first": 475, "page_last": 482, "abstract": null, "full_text": "Bayesian Learning \n\nvia Stochastic Dynamics \n\nRadford M. Neal \n\nToronto, Ontario, Canada M5S lA4 \n\nDepartment of Computer Science \n\nUniversity of Toronto \n\nAbstract \n\nThe attempt to find a single \"optimal\" weight vector in conven(cid:173)\ntional network training can lead to overfitting and poor generaliza(cid:173)\ntion. Bayesian methods avoid this, without the need for a valida(cid:173)\ntion set, by averaging the outputs of many networks with weights \nsampled from the posterior distribution given the training data. \nThis sample can be obtained by simulating a stochastic dynamical \nsystem that has the posterior as its stationary distribution. \n\n1 CONVENTIONAL AND BAYESIAN LEARNING \n\nI view neural networks as probabilistic models, and learning as statistical inference. \nConventional network learning finds a single \"optimal\" set of network parameter \nvalues, corresponding to maximum likelihood or maximum penalized likelihood in(cid:173)\nference. Bayesian inference instead integrates the predictions of the network over \nall possible values of the network parameters, weighting each parameter set by its \nposterior probability in light of the training data. \n\n1.1 NEURAL NETWORKS AS PROBABILISTIC MODELS \n\nConsider a network taking a vector of real-valued inputs, x, and producing a vector \nof real-valued outputs, y, perhaps computed using hidden units. Such a network \narchitecture corresponds to a function, I, with y = I(x, w), where w is a vector of \nconnection weights. If we assume the observed outputs, y, are equal to y plus Gaus(cid:173)\nsian noise of standard deviation (j, the network defines the conditional probability \n\n475 \n\n\f476 \n\nNeal \n\nfor an observed output vector given an input vector as follows: \nex: exp( -IY - !(x, w)12 /20\"2) \n\nP(y I x, 0\") \n\n(1) \n\nThe probability of the outputs in a training set (Xl, yt), ... , (Xn, Yn) given this fixed \nnoise level is therefore \n\nP(Yl, ... , Yn I Xl,\u00b7.\u00b7, Xn, 0\") \n\nex: exp( - E lYe - !(Xe, w)12 /20\"2) \n\ne \n\n(2) \n\nOften 0\" is unknown. A Bayesian approach to handling this is to assign 0\" a vague \nprior distribution and then \u00b7.ntcgrating it away, giving the following probability for \nthe training set (see (Buntine and Weigend, 1991) or (Neal, 1992) for details): \n\nP(Yl,\"\" Yn I Xl, ... , Xn) \n\nex: \n\n(so + E lYe - !(Xe, w)12) -\n\ne \n\nmp\u00b1nD \n\n2 \n\n(3) \n\nwhere So and mo are parameters of the prior for 0\". \n\n1.2 CONVENTIONAL LEARNING \n\nConventional backpropagation learning tries to find the weight vector that assigns \nthe highest probability to the training data, or equivalently, that minimizes minus \nthe log probability of the training data. When 0\" is assumed known, we can use (2) \nto obtain the following objective function to minimize: \n\nM(w) = E lYe - !(Xe, w)12 / 20\"2 \n\ne \n\nWhen 0\" is unknown, we can instead minimize the following, derived from (3): \n\nM(w) \n\ne \n\n(4) \n\n(5) \n\nConventional learning often leads to the network over fitting the training data -\nmodeling the noise, rather than the true regularities. This can be alleviated by \nstopping learning when the the performance of the network on a separate validation \nset begins to worsen, rather than improve. Another way to avoid overfitting is to \ninclude a weight decay term in the objective function, as follows: \n\nM'(w) = Alwl 2 + M(w) \n\n(6) \n\nHere, the data fit term, M(w), may come from either (4) or (5). We must somehow \nfind an appropriate value for A, perhaps, again, using a separate validation set. \n\n1.3 BAYESIAN LEARNING AND PREDICTION \n\nUnlike conventional training, Bayesian learning does not look for a single \"optimal\" \nset of network weights. Instead, the training data is used to find the posterior \nprobability distribution over weight vectors. Predictions for future cases are made \nby averaging the outputs obtained with all possible weight vectors, with each con(cid:173)\ntributing in proportion to its posterior probability. \n\nTo obtain the posterior, we must first define a prior distribution for weight vectors. \nWe might, for example, give each weight a Gaussian prior of standard deviation w: \n\n(7) \n\n\fBayesian Learning via Stochastic Dynamics \n\n477 \n\nWe can then obtain the posterior distribution over weight vectors given the training \ncases (Xl, yt), ... , (Xn, Yn) using Bayes' Theorem: \n\nP(w I (Xl, yt}, ... , (Xn, Yn)) \n\noc P(w) P(YI, ... , Yn I Xl, ... , Xn, w) \n\n(8) \n\nBased on the training data, the best prediction for the output vector in a test case \nwith input vector X., assuming squared-error loss, is \n\nY. = J /(x.,w)P(w I (xI,yd,\u00b7 .. ,(xn,Yn))dw \n\n(9) \n\nA full predictive distribution for the outputs in the test case can also be obtained, \nquantifying the uncertainty in the above prediction. \n\n2 \n\nINTEGRATION BY MONTE CARLO METHODS \n\nIntegrals such as that of (9) are difficult to evaluate. Buntine and Weigend (1991) \nand MacKay (1992) approach this problem by approximating the posterior distribu(cid:173)\ntion by a Gaussian. Instead, I evaluate such integrals using Monte Carlo methods. \nIf we randomly select weight vectors, wo, ... , WN-I, each distributed according to \nthe posterior, the prediction for a test case can be found by approximating the \nintegral of (9) by the average output of networks with these weights: \n\ny. ~ ~ L/(x.,Wt) \n\n(10) \n\nThis formula is valid even if the Wt are dependent, though a larger sample may \nthen be needed to achieve a given error bound. Such a sample can be obtained \nby simulating an ergodic Markov chain that has the posterior as its stationary \ndistribution. The early part of the chain, before the stationary distribution has \nbeen reached, is discarded. Subsequent vectors are used to estimate the integral. \n\nt \n\n2.1 FORMULATING THE PROBLEM IN TERMS OF ENERGY \n\nConsider the general problem of obtaining a sample of (dependent) vectors, qt, \nwith probabilities given by P( q). For Bayesian network learning, q will be the \nweight vector, or other parameters from which the weights can be obtained, and \nthe distribution of interest will be the posterior. \n\nIt will be convenient to express this probability distribution in terms of a potential \nenergy function, E( q), chosen so that \n\nP(q) \n\noc exp(-E(q)) \n\n(11) \n\nA momentum vector, p, of the same dimensions as q, is also introduced, and defined \nto have a kinetic energy of ~ \\pI2. The sum of the potential and kinetic energies is \nthe Hamiltonian: \n\nH(q,p) = E(q) + ~lpl2 \n\n(12) \n\n(13) \n\nFrom the Hamiltonian, we define ajoint probability distribution over q and p (phase \nspace) as follows: \n\nP(q,p) oc exp(-H(q,p)) \n\nThe marginal distribution for q in (13) is that of (11), from which we wish to sample. \n\n\f478 \n\nNeal \n\nWe can therefore proceed by sampling from this joint distribution for q and p, and \nthen just ignoring the values obtained for p. \n\n2.2 HAMILTONIAN DYNAMICS \n\nSampling from the distribution (13) can be split into two subproblems -\nto sample uniformly from a surface where H, and hence the probability, is con(cid:173)\nstant, and second, to visit points of differing H with the correct probabilities. The \nsolutions to these subproblems can then be interleaved to give an overall solution. \n\nfirst, \n\nThe first subproblem can be solved by simulating the Hamiltonian dynamics of \nthe system, in which q and p evolve through a fictitious time, r, according to the \nfollowing equations: \n\ndq \ndr \n\n8H \n8p = p, \n\ndp \n- = -- = -VE(q) \ndr \n\n8H \n8q \n\n(14) \n\nThis dynamics leaves H constant, and preserves the volumes of regions of phase \nspace. It therefore visits points on a surface of constant H with uniform probability. \n\nWhen simulating this dynamics, some discrete approximation must be used. The \nleapfrog method exactly maintains the preservation of phase space volume. Given \na size for the time step, E, an iteration of the leapfrog method goes as follows: \n\np(r+ E/2) \nq(r+ E) \np(r + E) \n\nper) - (E/2)VE(q(r\u00bb) \nq(r)+Ep \np(r + E) - (E/2)V E(q(r + E\u00bb \n\n(15) \n\n2.3 THE STOCHASTIC DYNAMICS METHOD \n\nTo create a Markov chain that converges to the distribution of (13), we must inter(cid:173)\nleave leapfrog iterations, which keep H (approximately) constant, with steps that \ncan change H. It is convenient for the latter to affect only p, since it enters into H \nin a simple way. This general approach is due to Anderson (1980). \nI use stochastic steps of the following form to change H: \n\np' \n\n(16) \n\nwhere 0 < (l' < 1, and n is a random vector with components picked independently \nfrom Gaussian distributions of mean zero and standard deviation one. One can \nshow that these steps leave the distribution of (13) invariant. Alternating these \nstochastic steps with dynamical leapfrog steps will therefore sample values for q \nand p with close to the desired probabilities. In so far as the discretized dynamics \ndoes not keep H exactly constant, however, there will be some degree of bias, which \nwill be eliminated only in the limit as E goes to zero. \n\nIt is best to use a value of (l' close to one, as this reduces the random walk aspect \nof the dynamics. If the random term in (16) is omitted, the procedure is equivalent \nto ordinary batch mode backpropagation learning with momentum. \n\n\fBayesian Learning via Stochastic Dynamics \n\n479 \n\n2.4 THE HYBRID MONTE CARLO METHOD \n\nThe bias introduced into the stochastic dynamics method by using an approxima(cid:173)\ntion to the dynamics is eliminated in the Hybrid Monte Carlo method of Duane, \nKennedy, Pendleton, and Roweth (1987). \n\nThis method is a variation on the algorithm of Metropolis, et al (1953), which \ngenerates a Markov chain by considering randomly-selected changes to the state. \nA change is always accepted if it lowers the energy (H), or leaves it unchanged. If \nit increases the energy, it is accepted with probability exp( -LlH), and is rejected \notherwise, with the old state then being repeated. \nIn the Hybrid Monte Carlo method, candidate changes are produced by picking a \nrandom value for p from its distribution given by (13) and then performing some pre(cid:173)\ndetermined number of leapfrog steps. If the leapfrog method were exact, H would \nbe unchanged, and these changes would always be accepted. Since the method \nis actually only approximate, H sometimes increases, and changes are sometimes \nrejected, exactly cancelling the bias introduced by the approximation. \n\nOf course, if the errors are very large, the acceptance probability will be very low, \nand it will take a long time to reach and explore the stationary distribution. To \navoid this, we need to choose a step size (f) that is small enough. \n\n3 RESULTS ON A TEST PROBLEM \n\nI use the \"robot arm\" problem of MacKay (1992) for testing. The task is to learn \nthe mapping from two real-valued inputs, Xl and X2, to two real-valued outputs, YI \nand Y2, given by \n\nih = 2.0 cos(xI) + 1.3 COS(XI + X2) \nY2 = 2.0 sin(xI) + 1.3 sin(xi + X2) \n\n(17) \n(18) \nGaussian noise of mean zero and standard deviation 0.05 is added to (YI' Y2) to give \nthe observed position, (YI, Y2). The training and test sets each consist of 200 cases, \nwith Xl picked randomly from the ranges [-1.932, -0.453] and [+0.453, +1.932], \nand X2 from the range [0.534,3.142]. \nA network with 16 sigmoidal hidden units was used. The output units were linear. \nLike MacKay, I group weights into three categories -\ninput to hidden, bias to \nhidden, and hidden/bias to output. MacKay gives separate priors to weights in \nI fix w to one, but \neach category, finding an appropriate value of w for each. \nmultiply each weight by a scale factor associated with its category before using it, \ngiving an equivalent effect. For conventional training with weight decay, I use an \nanalogous scheme with three weight decay constants (.\\ in (6\u00bb. \n\nIn all cases, I assume that the true value of u is not known. I therefore use (3) for \nthe training set probability, and (5) for the data fit term in conventional training. \nI set 80 = rno = 0.1, which corresponds to a very vague prior for u. \n\n3.1 PERFORMANCE OF CONVENTIONAL LEARNING \n\nConventional backpropagation learning was tested on the robot arm problem to \ngauge how difficult it is to obtain good generalization with standard methods. \n\n\f480 \n\nNeal \n\n\\.. \n\n. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o. \n\n......... \n\n1.0-\"', \n\n.0060~~~.-... -... ~.~-r---~~---~ \n\n(a) .006.5 +-l,-----t--___ ==!r.:-::,===*\" (b) .006.5 +--+-,--t-----+----_+_ \n.0000+-~~ ... -.. -... -.\u2022 r .. -.. -... -.. -... -.. -... -... ~.~---~ \n........................ \n. ~5+--4,,----+-------4-~~--~ \n.0050 +-----'''''''\"--t----==-1I_---;(cid:173)\n.~.5+----~---~~~====~ \n\n.~.5+-~~ ___ --r----~---~ \n\n-~O+-r----t----~I_---~ \n\n--..... \n\n.~o \n\no \n\n50 \nHerations X 1000 \n\n~ \n\n~ \n\n.oow+----~---~~---~ \n~ \n\n0 \n\n50 \n100 \nIterations X 1000 \n\nFigure 1: Conventional backpropagation learning -\n(a) with no weight decay, (b) with \ncarefully-chosen weight decay constants. The solid lines give the squared error on the \ntraining data, the dotted lines the squared error on the test data. \n\nFig. l(a) shows results obtained without using weight decay. Error on the test set \ndeclined initially, but then increased with further training. To achieve good results, \nthe point where the test error reaches its minimum would have to be identified using \na separate validation set. \nFig. l(b) shows results using good weight decay constants, one for each category \nof weights, taken from the Bayesian runs described below. In this case there is no \nneed to stop learning early, but finding the proper weight decay constants by non(cid:173)\nBayesian methods would be a problem. Again, a validation set seems necessary, as \nwell as considerable computation. \nUse of a validation set is wasteful, since data that could otherwise be included in \nthe training set must be excluded. Standard techniques for avoiding this, such as \n\"N-fold\" cross-validation, are difficult to apply to neural networks. \n\n3.2 PERFORMANCE OF BAYESIAN LEARNING \n\nBayesian learning was first tested using the unbiased Hybrid Monte Carlo method. \nThe parameter vector in the simulations (q) consisted of the unsealed network \nweights together with the scale factors for the three weight categories. The actual \nweight vector (w) was obtained by multiplying each unsealed weight by the scale \nfactor for its category. \n\nEach Hybrid Monte Carlo run consisted of 500 Metropolis steps. For each step, a \ntrajectory consisting of 1000 leapfrog iterations with f = 0.00012 was computed, \nand accepted or rejected based on the change in H at its end-point. Each run \ntherefore required 500,000 batch gradient evaluations, and took approximately four \nhours on a machine rated at about 25 MIPS. \nFig. 2(a) shows the training and test error for the early portion of one Hybrid Monte \nCarlo run. After initially declining, these values fluctuate about an average. Though \nnot apparent in the figure, some quantities (notably the scale factors) require a \nhundred or more steps to reach their final distribution. The first 250 steps of each \nrun were therefore discarded as not being from the stationary distribution. \n\nFig. 2(b) shows the training and test set errors produced by networks with weight \nvectors taken from the last 250 steps of the same run. Also shown is the error on \nthe test set using the average of the outputs of all these networks -\nthat is, the \nestimate given by (10) for the Bayesian prediction of (9). For the run shown, this \n\n\fBayesian Learning via Stochastic Dynamics \n\n481 \n\n(b) .0070 --I.----t-------1I---+--+-----II-----!(cid:173)\n.0061 --li-----t---__II----t.-+-----II-t----!(cid:173)\n~~~-~~-~-M~~~~--~~~~~ \n.0064 --Ih--;--t--t-rh-i:'lIIt-i1lr--te!H-+--~!-!1I5B-+._i:_r_-!\u00ad\n.0062 ~~~~-+:III'!I''HI:4---::.fII-+Iit-'-i''*''-.+-__..H*<~II'-+:~:.H!t_.;+\u00ad\n.0060 --I~14.~!H-;;1fH-~~fml!--lF-Hi:-:t-_i_f.;i_\u00a5_'i~f--i!h\"lt~!tt\u00ad\n\n(a) .0140 \n\n.0120 \n\n.01110 \n\n.1lO8O \n\n.1lO6O \n\n.0040 \n\nI \n\n\"'.,\"; ...... ~.J! \u2022\u2022 \n..J>v-A \n\n,J ! ......... ~,. \n, \n\n\" .. \"\"\"'-' ~ \n\nd \n\no \n\n50 \n\n100 \n\n.~.--I~~~~-~~~__Ir+~~~~~--~ \n.00545 --Ia---;J--=.~\"*-~---I~---'--+-.!i=_--1~-~-t\u00ad\n.00S4 --Ift-;;,-;nr-t----:---jih---Jr:--+----r---IHt--JbT;rl(cid:173)\n.0052 ~-'\\+~'fliIllnl~HI\\,rbItl-\"'AiIA>tI-Wl~'\"T1I'cyJ-y;I-fif'L'\\.-tti-tf'l~ \n.0050 --I-lf---'--f+-'~'---'..:&.!1---'-''---F>LL.:'--....,I--...J.:....--!-\n\n300 \n\n350 \n\n400 \n\n450 \n\nI&era&ions X 1000 \n\nI&era&ions x 1000 \nFigure 2: Bayesian learning using Hybrid Mon~e Carlo -\n(a) early portion of run, (b) last \n250 iterations. The solid lines give the squared error on the training set, the dotted lines \nthe squared error on the test set, for individual networks. The dashed line in (b) is the \ntest error when using the average of the outputs of all 250 networks. \n\nFigure 3: Predictive distribution for \noutputs. The two regions from which \ntraining data was drawn are outlined. \nCircles indicate the true, noise-free out-\nputs for a grid of cases in the input \nspace. The dots in the vicinity of each \ncircle (often piled on top of it) are the \noutputs of every fifth network from the \nlast 250 iterations of a Hybrid Monte \nCarlo run. \n\n+3\u00bb -\n\n+2\u00bb -\n\n' \u2022\u2022 \n\n\",. \n\n+1\u00bb -\n\n0.0-\n\n\u00b7LD -\n\n\u00b72.0-\n\n\u00b7u-\n\n\u2022 \n\n::, \n\n.. ~., \n:.:~:.'. \n.. ' \n\n\u2022 \n\n\u00b710 \n\n\u00b71.0 \n\n0\u00bb \n\n+LD \n\n+10 \n\n+3.0 \n\ntest set error using averaged outputs is 0.00559, which is (slightly) better than any \nresults obtained using conventional training. Note that with Bayesian training no \nvalidation set is necessary. The analogues of the weight decay constants -\nthe \nweight scale factors -\n\nare found during the course of the simulation. \n\nAnother advantage of the Bayesian approach is that it can provide an indication \nof how uncertain the predictions for test cases are. Fig. 3 demonstrates this. As \none would expect, the uncertainty is greater for test cases with inputs outside the \nregion where training data was supplied. \n\n3.3 STOCHASTIC DYNAMICS VS. HYBRID MONTE CARLO \n\nThe uncorrected stochastic dynamics method will have some degree of systematic \nbias, due to inexact simulation of the dynamics. Is the amount of bias introduced \nof any practical importance, however? \n\n\f482 \n\nNeal \n\n(a) .IXY1O ~------+---~-++-----~----~-----4-(b) \n\n.0068 \n.0066 \n.0064 \n.D062 \n.0060 \n.005. \n.oo5ti \n.0054 \n.0052 \n.0050 \n.0048 \n\n250 \n\n~ 400 \nIterations X 1000 \n\n\u2022 \n\u2022 \n\u2022 \n\n.. \n\n\\. ~ \n\u2022 \n\u2022 \n\n21 \n\n~ \n\\ ~ \n\nII \n\n\u2022 \n\nIterations X 1000 \n\nFigure 4: Bayesian learning using uncorrected stochastic dynamics -\n(a) Training and \ntest error for the last 250 iterations of a run with c = 0.00012, (b) potential energy (E) \nfor a run with c = 0.00030. Note the two peaks where the dynamics became unstable. \n\nTo help answer this question, the stochastic dynamics method was run with pa(cid:173)\nrameters analogous to those used in the Hybrid Monte Carlo runs. The step size of \n( = 0.00012 used in those runs was chosen to be as large as possible while keeping \nthe number of trajectories rejected low (about 10%). A smaller step size would not \ngive competitive results, so this value was used for the stochastic dynamics runs as \nwell. A value of 0.999 for 0' in (16) was chosen as being (loosely) equivalent to the \nuse of trajectories 1000 iterations long in the Hybrid Monte Carlo runs. \n\nThe results shown in Fig. 4(a) are comparable to those obtained using Hybrid Monte \nCarlo in Fig. 2(b). Fig. 4(b) shows that with a larger step size the uncorrected \nstochastic dynamics method becomes unstable. Large step sizes also cause problems \nfor the Hybrid Monte Carlo method, however, as they lead to high rejection rates. \n\nThe Hybrid Monte Carlo method may be the more robust choice in some circum(cid:173)\nstances, but uncorrected stochastic dynamics can also give good results. As it is \nsimpler, the stochastic dynamics method may be better for hardware implemen(cid:173)\ntation, and is a more plausible starting point for any attempt to relate Bayesian \nmethods to biology. Numerous other variations on these methods are possible as \nwell, some of which are discussed in (Neal, 1992). \n\nReferences \n\nAndersen, H. C. (1980) \"Molecular dynamics simulations at constant pressure and/or \n\ntemperature\", Journal of Chemical Physics, vol. 72, pp. 2384-2393. \n\nBuntine, W. L. and Weigend, A. S. (1991) \"Bayesian back-propagation\", Complex Systems, \n\nvol. 5, pp. 603-643. \n\nDuane, S., Kennedy, A. D., Pendleton, B. J., and Roweth, D. (1987) \"Hybrid Monte \n\nCarlo\", Physics Letters B, vol. 195, pp. 216-222. \n\nMacKay, D. J. C. (1992) \"A practical Bayesian framework for backpropagation networks\", \n\nNeural Computation, vol. 4, pp. 448-472. \n\nMetropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., and Teller, E. (1953) \n\"Equation of state calculations by fast computing machines\", Journal of Chemical \nPhysics, vol. 21, pp. 1087-1092. \n\nNeal, R. M. (1992) \"Bayesian training of backpropagation networks by the hybrid Monte \n\nCarlo method\", CRG-TR-92-1, Dept. of Computer Science, University of Toronto. \n\n\f", "award": [], "sourceid": 613, "authors": [{"given_name": "Radford", "family_name": "Neal", "institution": null}]}