{"title": "Manifold Stochastic Dynamics for Bayesian Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 694, "page_last": 700, "abstract": "", "full_text": "Manifold Stochastic Dynamics \n\nfor Bayesian Learning \n\nMark Zlochin \n\nDepartment of Computer Science \n\nTechnion - Israel Institute of Technology \n\nTechnion City, Haifa 32000, Israel \n\nzmark@cs.technion.ac.il \n\nYoramBaram \n\nDepartment of Computer Science \n\nTechnion - Israel Institute of Technology \n\nTechnion City, Haifa 32000, Israel \n\nbaram@cs.technion.ac.il \n\nAbstract \n\nWe propose a new Markov Chain Monte Carlo algorithm which is a gen(cid:173)\neralization of the stochastic dynamics method. The algorithm performs \nexploration of the state space using its intrinsic geometric structure, facil(cid:173)\nitating efficient sampling of complex distributions. Applied to Bayesian \nlearning in neural networks, our algorithm was found to perform at least \nas well as the best state-of-the-art method while consuming considerably \nless time. \n\n1 Introduction \n\nIn the Bayesian framework predictions are made by integrating the function of interest \nover the posterior parameter distribution, the lattt~r being the normalized product of the \nprior distribution and the likelihood. Since in most problems the integrals are too complex \nto be calculated analytically, approximations are needed. \n\nEarly works in Bayesian learning for nonlinear models [Buntineand Weigend 1991, \nMacKay 1992] used Gaussian approximations to the posterior parameter distribution. \nHowever, the Gaussian approximation may be poor, especially for complex models, be(cid:173)\ncause of the multi-modal character of the posterior distribution. \n\nHybrid Monte Carlo (HMC) [Duane et al. 1987] introduced to the neural network com(cid:173)\nmunity by [Neal 1996], deals more successfully with multi-modal distributions but is very \ntime consuming. One of the main causes of HMC inefficiency is the anisotropic character \nof the posterior distribution - the density changes rapidly in some directions while remain(cid:173)\ning almost constant in others. \n\nWe present a novel algorithm which overcomes the above problem by using the intrinsic \ngeometrical structure of the model space. \n\n2 Hybrid Monte Carlo \n\nMarkov Chain Monte Carlo (MCMC) [Gilks et al. 1996] approximates the value \n\nE[a] = / a(O)Q(O)dO \n\n\fManifold Stochastic Dynamics for Bayesian Learning \n\n695 \n\nby the mean \n\n1 \n\na = N L a(O(t\u00bb) \n\nIV \n\nt=l \n\nwhere e(l) , ... , O(N) are successive states of the ergodic Markov chain with invariant dis(cid:173)\ntribution Q(8) . \nIn addition to ergodicity and invariance of Q(O) another quality we would like the Markov \nchain to have is rapid exploration of the state space. While the first two qualities are rather \neasily attained, achieving rapid exploration of the state space is often nontrivial. \n\nA state-of-the-art MCMC method, capable of sampling from complex distributions, is Hy(cid:173)\nbrid Monte Carlo [Duane et al. 1987]. \n\nThe algorithm is expressed in terms of sampling from canonical distribution for the state, \nq, of a \"physical\" system, defined in terms of the energy function E( q) I: \n\nP(q) ex exp(-E(q)) \n\n(1) \n\nTo allow the use of dynamical methods, a \"momentum\" variable, p, is introduced , with the \nsame dimensionality as q. The canonical distribution over the \"phase space\" is defined to \nbe: \n\n(2) \nwhere H(q , p) = E(q) + K(p) is the \"Hamiltonian\", which represents the total energy. \nf{ (p) is the \"kinetic energy\" due to momentum, defined as \n\nP(q,p) ex exp(-H(q , p)) \n\nn \n\n2 \nK (p) = '\" J!.L \n~2m' \ni=l \n\nl \n\n(3) \n\nwhere pi , i = 1, . . . , n are the momentum components and m i is the \"mass\" associated \nwith i'th component, so that different components can be given different weight. \n\nSampling from the canonical distribution can be done using stochastic dynamics method \n[Andersen 1980], in which the task is split into two sub tasks - sampling uniformly from \nvalues of q and p with a fixed total energy, H(q , p) , and sampling states with different \nvalues of H. The first task is done by simulating the Hamiltonian dynamics of the system: \n\ndqi \ndT \n\nBH \n=+-\nBPi \n\nPi \n\nm j \n\nDifferent energy \nlevels are obtained by occasional stochastic Gibbs sampling \n[Geman and Geman 1984] of the momentum. Since q and p are independent, p may be \nupdated without reference to q by drawing a value with probability density proportional to \nexp( - K (p)), which, in the case of (3), can be easily done, since the Pi'S have independent \nGaussian distributions. \n\nIn practice, Hamiltonian dynamics cannot be simulated exactly, but can be approximated \nby some discretization using finite time steps. One common approximation is leapfrog \ndiscretization [Neal 1996], \n\nIn the hybrid Monte Carlo method stochastic dynamic transitions are used to generate can(cid:173)\ndidate states for the Metropolis algorithm [Metropolis et al. 1953]. This eliminates certain \n\n1 Note that any probability density that is nowhere zero can be put in this form, by simply defining \n\nE( q) = - log P( q) - log Z, for any convenient Z). \n\n\f696 \n\nM Zlochin and Y. Baram \n\ndrawbacks of the stochastic dynamics such as systematic errors due to leapfrog discretiza(cid:173)\ntion, since Metropolis algorithm ensures that every transition keeps canonical distribution \ninvariant. However, the empirical comparison between the uncorrected stochastic dynamics \nand the HMC in application to Bayesian learning in neural networks [Neal 1996] showed \nthat with appropriate discretization stepsize there is no notable difference between the two \nmethods. \n\nA modification proposed in [Horowitz 1991] instead of Gibbs sampling of momentum, is \nto replace p each time by p. cos (0) + ( . sin( 0), where 0 is a small angle and ( is distributed \naccording to N(O, 1). While keeping canonical distribution invariant, this scheme, called \nmomentum persistence, improves the rate of exploration. \n\n3 Riemannian geometry \n\nA Riemannian manifold [Amari 1997] is a set e ~ R n equipped with a metric tensor G \nwhich is a positive semidefinite matrix defining the inner product between infinitesimal \nincrements as: \n\n< dOl, d02 >= doT . G . d02 \n\nLet us denote entries of G by Gi,j and entries of G- l by Gi,j. This inner product naturally \ngives us the norm \n\nII dO IIb=< dO, dO >= dOT. G . dO. \n\nThe Jeffrey prior over e is defined by the density function: \n\n11\" ( 0) ex: JiG(ijI \n\nwhere I . I denotes determinant. \n\n3.1 Hamiltonian dynamics over a manifold \n\nFor Riemannian manifold the dynamics take a more general form than the one described in \nsection 2. \n\nIf the metric tensor is G and all masses are set to one then the Hamiltonian is given by: \n\nH(q,p) = E(q) + 2pT . G- l \n\n1 \n\n. P \n\n(4) \n\nThe dynamics are governed by the following set of differential equations [Chavel 1993]: \n\nwhere r~,k are the Christoffel symbols given by: \n\nr i. k =! ~Gi,m(OGm,k + oGm,j _ OGj,k) \noqm \n\n2 ~ oqj \n\nJ, \n\nOqk \n\nand q = ~: is related to p by q = G-lp. \n\n\fManifold Stochastic Dynamics for Bayesian Learning \n\n697 \n\n3.2 Riemannian geometry of functions \n\nIn regression the log-likelihood is proportional to the empirical error, which is simply the \nEuclidean distance between the target point, t, and candidate function evaluated over the \nsample. Therefore, the most natural distance measure between the models is the Euclidean \nseminorm : \n\nd(Ol,{;2)2 =11 hi - !(Plir= L(f(Xi,01) - !(Xi,02)f \n\nI \n\ni=1 \n\nThe resulting metric tensor is: \n\nG = L{Y'e!(xi,O). Y'd(Xi,Of} = JT . J \n\nI \n\ni=1 \n\n(5) \n\n(6) \n\nwhere V' e denotes gradient and J = [(] ~~~ d] is the Jacobian matrix. \n\nJ \n\n3.3 Bayesian geometry \n\nA Bayesian approach would suggest the inclusion of prior assumptions about the parame(cid:173)\nters in the manifold geometry. \n\nIf, for example, a priori 0 \"\" N (0, 1/ a), then the log-posterior can be written as: \n\n10gp(Olx) = P L(f(Xi , OI) - t)2 + a L(Ok - 0)2 \n\nI \n\nn \n\nwhere P is inverse noise variance. \nTherefore, the natural metric in the model space is \n\ni=l \n\nk=1 \n\nd(01, ( 2)2 = P L(f(Xi, ( 1) - !(Xi, ( 2))2 + a L(O.! - Ok)2 \n\nI \n\nn \n\nwith the metric tensor: \n\ni=l \n\nk=1 \n\n\" \nGB=p\u00b7G+a\u00b7I=J .J \n\n\"T \n\nwhere j is the \"extended Jacobian\": \n\nj\"j = { \n\ni < I \ni > I \n\n(7) \n\n(8) \n\nwhere &i,j is the Kroneker's delta. \nNote, that as a -+ 0, GB -+ PG, hence as the prior becomes vaguer we approach a non(cid:173)\nBayesian paradigm. If, on the other hand, a -+ 00 or P . G -+ 0, the Bayesian geometry \napproaches the Euclidean geometry ofthe parameter space. These are the qualities that we \nwould like the Bayesian geometry to have - if the prior is \"strong\" in comparison to the \nlikelihood, the exact form of G should be of little importance. \n\nThe definitions above can be applied to any log-concave prior distribution with the inverse \nHessian of the log-prior, (V'V' logp( 0)) -1, replacing a I in (7). The framework is not re(cid:173)\nstricted to regression. For a general distribution class it is natural to use Fisher information \nmatrix, I, as a metric tensor [Amari 1997}. The Bayesian metric tensor then becomes: \n\nGB = I + (V'V'logp(O))-l \n\n(9) \n\n\f698 \n\nM Zlochin and y. Baram \n\n4 Manifold Stochastic Dynamics \n\nAs mentioned before, the energy landscape in many regression problems is anisotropic. \nThis degrades the performance of HMC in two aspects: \n\n\u2022 The dynamics may not be optimal for efficient exploration of the posterior distri(cid:173)\n\nbution as suggested by the studies of Gaussian diffusions [Hwang et al. 1993]. \n\n\u2022 The resulting differential equations are stiff [Gear 1971], leading to large dis(cid:173)\n\ncretization errors, which in turn necessitates small time steps, implying that the \ncomputational burden is high. \n\nBoth of these problems disappear if instead of the Euclidean Hamiltonian dynamics used \nin HMC we simulate dynamics over the manifold equipped with the metric tensor G B \nproposed in the previous section. \n\nIn the context of regression from the definition G B = jT . j, we obtain an alternative \n\n\u2022 \n\n& \n\nequatIOn lor dT2 ,In a matnx lorm: \n\nd2q . \n\n. & \n\n2 \n\nd q = -G- 1(\"V E + jT oj q) \ndT2 \n\ndT \n\nB \n\n' \n\n(10) \n\nIn the canonical distribution P(q,p) ex: exp(-H(q,p)) the conditional distribution of p \ngiven q is a zero-mean Gaussian with the covariance matrix G B (q) and the marginal dis(cid:173)\ntribution over q is proportional to exp( -E(q))1r(q). This is equivalent to mUltiplying the \nprior by the Jeffrey prior2. \n\nThe sampling from the canonical distribution is two-fold: \n\n\u2022 Simulate the Hamiltonian dynamics (3.1) for one time-step using leapfrog dis(cid:173)\n\ncretisation. \n\n\u2022 Replace p using momentum persistence. Unlike the HMC case, the momentum \n\nperturbation (is distributed according to N(O, GB). \n\nThe actual weights mUltiplying the matrices I and G in (7) may be chosen to be different \nfrom the specified a and /3, so as to improve numerical stability. \n\n5 Empirical comparison \n\n5.1 Robot ann problem \n\nWe compared the performance of the Manifold Stochastic Dynamics (MSD) algorithm \nwith the standard HMC. The comparison was carried using MacKay's robot arm problem \nwhich is a common benchmark for Bayesian methods in neural networks [MacKay 1992, \nNeal 1996]. \n\nThe robot arm problem is concerned with the mapping: \n\nYI = 2.0 cos Xl + 1.3 COS(XI + X2) + el, Y2 = 2.0 sin Xl + 1.3 sin(xi + X2) + e2 \n\nwhere el, e2 are independent Gaussian noise variables of standard deviation 0.05. The \ndataset used by Neal and Mackay contained 200 examples in the training set and 400 in the \ntest set. \n\n2In fact, since the actual prior over the weights is unknown, a truly Bayesian approach would be \nto use a non-informative prior such as 71\"( q). In this paper we kept the modified prior which is the \nproduct of 7I\"(q) and a zero-mean Gaussian. \n\n\fManifold Stochastic Dynamics for Bayesian Learning \n\n699 \n\n, \n\n, , \n, \n, , \n\n...,...-- - - -\n\n---\n\n\"-\n\n\"-\n\n1.2 \n\n0.8 \n\n0.6 \n\n0.4 \n\n0.2 \n\n0 \n\n10 \n\n20 \n\n30 \n\n40 \n\n50 \n\n-0.2 \n0 \n\n10 \n\n20 \n\n30 \n\n40 \n\n50 \n\n0.9 \n\n0.8 \n\n0.7 \n\n0.6 \n\n0.5 \n\n0.4 \n\n0.3 \n\n0.2 \n\n0.1 \n\n0 \n0 \n\nFigure 1: Average (over the 1 0 runs) autocorrelation of input-to-hidden (left) and hidden(cid:173)\nto-output (right) weights for HMC with 100 and 30 leapfrog steps per iteration and MSD \nwith single leapfrog step per iteration. The horizontal axis gives the lags, measured in \nnumber of iterations. \n\nWe used a neural network with two input units, one hidden layer containing 8 tanh units \nand two linear output units. \nThe hyperparameter f3 was set to its correct value of 400 and 0\" was chosen to be 1. \n\n5.2 Algorithms \n\nWe compared MSD with two versions of HMC - with 30 and with 100 leapfrog steps per \niteration, henceforth referred to as HMC30 and HMCIOO. MSD was run with a single \nleapfrog step per iteration. In all three algorithms momentum was resampled using persis(cid:173)\ntence with cos(O) = 0.95. \nA single iteration of HMC100 required about 4.8 . 106 floating point operations (flops), \nHMC30 required 1.4 . 106 flops and MSD required 0.5 . 106 flops. Hence the computa(cid:173)\ntionalload of MSD was about one third of that of HMC30 and 10 times lower than that of \nHMClOO. \n\nThe discretization stepsize for HMC was chosen so as to keep the rejection rate below 5%. \nAn equivalent criterion of average error in the Hamiltonian around 0.05 was used for the \nMSD. \n\nAll three sampling algorithms were run 10 times, each time for 3000 iteration with the \nfirst 1000 samples discarded in order to allow the algorithms to reach the regions of high \nprobability. \n\n5.3 Results \n\nOne appropriate measure for the rate of state space exploration is weights autocorrelation \n[Neal 1996]. As shown in Figure 1, the behavior of MSD was clearly superior to that of \nHMC. \n\nAnother value of interest is the total squared error over the test set. The predictions for the \ntest set were made as follows. A subsample of 100 parameter vectors waS generated by \ntaking every twentieth sample vector starting from 1001 and on. The predicted value was \n\n\f700 \n\nM. Zlochin and Y. Baram \n\nthe average over the empirical function distribution of this sUbsample. \n\nThe total squared errors, nonnalized with respect to the variance on the test cases, have the \nfollowing statistics (over the 10 runs): \n\nHMC30 \nHMCI00 \nMSD \n\naverage \n1.314 \n1.167 \n1.161 \n\nstandard deviation \n\n0.074 \n0.044 \n0.023 \n\nThe average error ofHMC30 is high, indicating that the algorithm failed to reach the region \nof high probability. The errors of HMC 1 00 and MSD are comparable but the standard \ndeviation for MSD is twice as low as that for HMC 1 00, meaning that the estimate obtained \nusing MSD is more reliable. \n\n6 Conclusion \n\nWe have described a new algorithm for efficient sampling from complex distributions such \nas those appearing in Bayesian learning with non-linear models. The empirical compar(cid:173)\nison shows that our algorithm achieves results superior to the best achieved by existing \nalgorithms in considerably smaller computation time. \n\nReferences \n\n[Amari 1997] \n\nAmari S., \"Natural Gradient Works Efficiently in Learning\", Neural \nComputation, vol. 10, pp.251-276. \n\n[Andersen 1980] Andersen H.e., \"Molecular dynamics simulations at constant pressure \nand/or temperature\", Journal of Chemical Physics, vol. 3,pp. 589-603. \n[Buntine and Weigend 1991] \"Bayesian back-propagation\", Complex systems, vol. 5, pp. \n\n[Chavel 1993] \n\n603-643. \nChavel I., Riemannian Geometry: A Modem Introduction, University \nPress, Cambridge. \n\n[Duane et al. 1987] \"Hybrid Monte Carlo\", Physics Letters B,vol. 195,pp. 216-222. \n[Gear 1971] \n\nGear e.W., Numerical initial value problems in ordinary differential \nequations, Prentice Hall. \n\n[Geman and Geman 1984] Geman S.,Geman D., \"Stochastic relaxation,Gibbs distribu(cid:173)\ntions and the Bayesian restoration of images\", IEEE Trans.,PAMI-\n6,721-741. \n\n[Gilks et al. 1996] Gilks W.R., Richardson S. and Spiegelhalter DJ., Markov Chain Monte \n\nCarlo in Practice, Chapman&Hall. \n\n[Hwang et al. 1993] Hwang, C.,-R, Hwang-Ma S.,-Y. and Shen. S.,-J., \"Accelerating \n\nGaussian diffusions\", Ann. Appl. Prob. , vol. 3, 897-913. \n\n[Horowitz 1991] Horowitz A.M., \"A generalized guided Monte Carlo algorithm\", \n\nPhysics Letters B\" vol. 268, pp. 247-252. \n\n[MacKay 1992] MacKay D.le., Bayesian Methods for Adaptive Models, Ph.D. thesis, \n\nCalifornia Institute of Technology. \n\n[Metropolis et al. 1953] Metropolis N., Rosenbluth A.W., Rosenbluth M.N., Teller A.H. \n\n[Neal 1996] \n\nand Teller E., \"Equation of State Calculations by Fast Computing Ma(cid:173)\nchines\", Journal of Chemical Physics,vol.21,pp. 1087-1092. \nNeal, R.M., Bayesian Learn ing for Neural Networks, Springer 1996. \n\n\fPART V \n\nIMPLEMENTATION \n\n\f\f", "award": [], "sourceid": 1757, "authors": [{"given_name": "Mark", "family_name": "Zlochin", "institution": null}, {"given_name": "Yoram", "family_name": "Baram", "institution": null}]}