{"title": "Bayesian Sampling Using Stochastic Gradient Thermostats", "book": "Advances in Neural Information Processing Systems", "page_first": 3203, "page_last": 3211, "abstract": "Dynamics-based sampling methods, such as Hybrid Monte Carlo (HMC) and Langevin dynamics (LD), are commonly used to sample target distributions. Recently, such approaches have been combined with stochastic gradient techniques to increase sampling efficiency when dealing with large datasets. An outstanding problem with this approach is that the stochastic gradient introduces an unknown amount of noise which can prevent proper sampling after discretization. To remedy this problem, we show that one can leverage a small number of additional variables in order to stabilize momentum fluctuations induced by the unknown noise. Our method is inspired by the idea of a thermostat in statistical physics and is justified by a general theory.", "full_text": "Bayesian Sampling Using Stochastic Gradient\n\nThermostats\n\nNan Ding\u2217\nGoogle Inc.\n\ndingnan@google.com\n\nYouhan Fang\u2217\nPurdue University\n\nyfang@cs.purdue.edu\n\nChangyou Chen\nDuke University\n\ncchangyou@gmail.com\n\nRobert D. Skeel\nPurdue University\n\nskeel@cs.purdue.edu\n\nRyan Babbush\n\nGoogle Inc.\n\nbabbush@google.com\n\nHartmut Neven\n\nGoogle Inc.\n\nneven@google.com\n\nAbstract\n\nDynamics-based sampling methods, such as Hybrid Monte Carlo (HMC) and\nLangevin dynamics (LD), are commonly used to sample target distributions. Re-\ncently, such approaches have been combined with stochastic gradient techniques\nto increase sampling ef\ufb01ciency when dealing with large datasets. An outstanding\nproblem with this approach is that the stochastic gradient introduces an unknown\namount of noise which can prevent proper sampling after discretization. To rem-\nedy this problem, we show that one can leverage a small number of additional\nvariables to stabilize momentum \ufb02uctuations induced by the unknown noise. Our\nmethod is inspired by the idea of a thermostat in statistical physics and is justi\ufb01ed\nby a general theory.\n\n1\n\nIntroduction\n\nThe generation of random samples from a posterior distribution is a pervasive problem in Bayesian\nstatistics which has many important applications in machine learning. The Markov Chain Monte\nCarlo method (MCMC), proposed by Metropolis et al.[16], generates unbiased samples from a\ndesired distribution when the density function is known up to a normalizing constant. However,\ntraditional MCMC methods are based on random walk proposals which lead to highly correlated\nsamples. On the other hand, dynamics-based sampling methods, e.g. Hybrid Monte Carlo (HMC)\n[6, 10], avoid this high degree of correlation by combining dynamic systems with the Metropolis\nstep. The dynamic system uses information from the gradient of the log density to reduce the ran-\ndom walk effect, and the Metropolis step serves as a correction of the discretization error introduced\nby the numerical integration of the dynamic systems.\nThe computational cost of HMC methods depends primarily on the gradient evaluation. In many\nmachine learning problems, expensive gradient computations are a consequence of working with\nextremely large datasets. In such scenarios, methods based on stochastic gradients have been very\nsuccessful. A stochastic gradient uses the gradient obtained from a random subset of the data to\napproximate the true gradient. This idea was \ufb01rst used in optimization [9, 19] but was recently\nadapted for sampling methods based on stochastic differential equations (SDEs) such as Brownian\ndynamics [1, 18, 24] and Langevin dynamics [5].\nDue to discretization, stochastic gradients introduce an unknown amount of noise into the dynamic\nsystem. Existing methods sample correctly only when the step size is small or when a good estimate\nof the noise is available. In this paper, we propose a method based on SDEs that self-adapts to the\n\n\u2217 indicates equal contribution.\n\n1\n\n\funknown noise with the help of a small number of additional variables. This allows for the use\nof larger discretization step, smaller diffusion factor, or smaller minibatch to improve the sampling\nef\ufb01ciency without sacri\ufb01cing accuracy.\nFrom the statistical physics perspective, all these dynamics-based sampling methods are approaches\nthat use dynamics to approximate a canonical ensemble [23]. In a canonical ensemble, the dis-\ntribution of the states follows the canonical distribution which corresponds to the target posterior\ndistribution of interests. In attemping to sample from the canonical ensemble, existing methods\nhave neglected the condition that, the system temperature must remain near a target temperature\n(Eq.(4) of Sec. 3). When this requirement is ignored, noise introduced by stochastic gradients may\ndrive the system temperature away from the target temperature and cause inaccurate sampling. The\nadditional variables in our method essentially play the role of a thermostat which controls the tem-\nperature and, as a consequence, handles the unknown noise. This approach can also be found by\nfollowing a general recipe which helps designing dynamic systems that produce correct samples.\nThe rest of the paper is organized as follows. Section 2 brie\ufb02y reviews the related background.\nSection 3 proposes the stochastic gradient Nos\u00b4e-Hoover thermostat method which maintains the\ncanonical ensemble. Section 4 presents the general recipe for \ufb01nding proper SDEs and mathemati-\ncally shows that the proposed method produces samples from the correct target distribution. Section\n5 compares our method with previous methods on synthetic and real world machine learning appli-\ncations. The paper is concluded in Section 6.\n\n2 Background\nOur objective is to generate random samples from the posterior probability density p(\u03b8| X) \u221d\np(X|\u03b8)p(\u03b8), where \u03b8 represents an n-dim parameter vector and X represents data. The canoni-\ncal form is p(\u03b8| X) = (1/Z) exp(\u2212U (\u03b8)) where U (\u03b8) = \u2212 log p(X|\u03b8) \u2212 log p(\u03b8) is referred to\nas the potential energy and Z is the normalizing constant. Here, we brie\ufb02y review a few dynamics-\nbased sampling methods, including HMC, LD, stochastic gradient LD (SGLD) [24], and stochastic\ngradient HMC (SGHMC) [5], while relegating a more comprehensive review to Appendix A.\nHMC [17] works in an extended space \u0393 = (\u03b8, p), where \u03b8 and p simulate the positions and the\nmomenta of particles in a system. Although some works, e.g. [7, 8], make use of variable mass,\nwe assume that all particles have unit constant mass (i.e. mi = 1). The joint density of \u03b8 and\np can be written as \u03c1(\u03b8, p) \u221d exp(\u2212H(\u03b8, p)), where H(\u03b8, p) = U (\u03b8) + K(p) is called the\nHamiltonian (the total energy). U (\u03b8) is called the potential energy and K(p) = p(cid:62) p /2 is called\nthe kinetic energy. Note that p has standard normal distribution. The force on the system is de\ufb01ned\nas f (\u03b8) = \u2212\u2207U (\u03b8). It can be shown that the Hamiltonian dynamics\n\nd\u03b8 = p dt, d p = f (\u03b8)dt,\n\nmaintain a constant total energy [17]. In each step of the HMC algorithm, one \ufb01rst randomizes\np according to the standard normal distribution; then evolves (\u03b8, p) according to the Hamiltonian\ndynamics (solved by numerical integrators); and \ufb01nally uses the Metropolis step to correct the dis-\ncretization error.\nLangevin dynamics (with diffusion factor A) are described by the following SDE,\n\nd\u03b8 = p dt, d p = f (\u03b8)dt \u2212 A p dt +\n\n(1)\nwhere W is n independent Wiener processes (see Appendix A), and d W can be informally written\nas N (0, I dt) or simply N (0, dt) as in [5]. Brownian dynamics\nd\u03b8 = f (\u03b8)dt + N (0, 2dt)\n\nis obtained from Langevin dynamics by rescaling time t \u2190 At and letting A \u2192 \u221e, i.e., on long\nof the gradient of \u2212 log p(X|\u03b8) = \u2212(cid:80)N\ntime scales inertia effects can be neglected [11]. When the size of the dataset is big, the computation\ni=1 log p(xi |\u03b8) can be very expensive. In such situations,\n\u02dcN(cid:88)\none could use the likelihood of a random subset of the data xi\u2019s to approximate the true likelihood,\n\n\u221a\n\n2Ad W,\n\n\u02dcU (\u03b8) = \u2212 N\n\u02dcN\n\nlog p(x(i) |\u03b8) \u2212 log p(\u03b8),\n\n(2)\n\ni=1\n\n2\n\n\fwhere(cid:8)x(i)\n\n(cid:9) represents a random subset of {xi} and \u02dcN (cid:28) N. De\ufb01ne the stochastic force \u02dcf (\u03b8) =\n\n\u2212\u2207 \u02dcU (\u03b8). The SGLD algorithm [24] uses \u02dcf (\u03b8) and the Brownian dynamics to generate samples,\n\nd\u03b8 = \u02dcf (\u03b8)dt + N (0, 2dt).\n\nIn [5], the stochastic force with a discretization step h is approximated as h\u02dcf (\u03b8) (cid:39) h f (\u03b8) +\nN (0, 2h B(\u03b8)) (note that the argument is not rigorous and that other signi\ufb01cant artifacts of dis-\ncretization may have been neglected). The SGHMC algorithm uses a modi\ufb01ed LD,\nd\u03b8 = p dt, d p = \u02dcf (\u03b8)dt \u2212 A p dt + N (0, 2(A I\u2212 \u02c6B(\u03b8))dt),\n\n(3)\n\nwhere \u02c6B(\u03b8) is intended to offset B(\u03b8), the noise from the stochastic force.\nHowever, \u02c6B(\u03b8) is hard to estimate in practice and cannot be omitted when the discretization step h\nis not small enough. Since poor estimation of \u02c6B(\u03b8) may lead to inaccurate sampling, we attempt to\n\ufb01nd a dynamic system which is able to adaptively \ufb01t to the noise without explicit estimation. The\nintuition comes from the practice of sampling a canonical ensemble in statistical physics.\nThe Metropolis step in SDE-based samplers with stochastic gradients is sometimes omitted on large\ndatasets, because the evaluation of the potential energy requires using the entire dataset which can-\ncels the bene\ufb01t of using stochastic gradients. There is some recent work [2, 3, 14] which attempts\nto estimate the Metropolis step using partial data. Although an interesting direction for future work,\nin this paper we do not consider applying Metropolis step in conjunction with stochastic gradients.\n\n3 Stochastic Gradient Thermostats\n\nIn statistical physics, a canonical ensemble represents the possible states of a system in thermal\nequilibrium with a heat bath at \ufb01xed temperature T [23]. The probability of the states in a canonical\nensemble follows the canonical distribution \u03c1(\u03b8, p) \u221d exp(\u2212H(\u03b8, p)/(kBT )), where kB is the\nBoltzmann constant. A critical characteristic of the canonical ensemble is that the system tempera-\nture, de\ufb01ned as the mean kinetic energy, satis\ufb01es the following thermal equilibrium condition,\n\nkBT\n\n2\n\n=\n\n1\nn\n\nE[K(p)], or equivalently, kBT =\n\nE[p(cid:62) p].\n\n1\nn\n\n(4)\n\nAll dynamics-based sampling methods approximate the canonical ensemble to generate samples. In\nBayesian statistics, n is the dimension of \u03b8, and kBT = 1 so that \u03c1(\u03b8, p) \u221d exp(\u2212H(\u03b8, p)) and\nmore importantly \u03c1\u03b8(\u03b8) \u221d exp(\u2212U (\u03b8)). However, one key fact that was overlooked in previous\nmethods, is that the dynamics that correctly simulate the canonical ensemble must maintain the\nthermal equilibrium condition (4). Besides its physical meaning, the condition is necessary for p\nbeing distributed as its marginal canonical distribution \u03c1p(p) \u221d exp(\u2212K(p)).\nIt can be veri\ufb01ed that ordinary HMC and LD (1) with true force both maintain (4). However, after\ncombination with the stochastic force \u02dcf (\u03b8), the dynamics (3) may drift away from thermal equilib-\nrium if \u02c6B(\u03b8) is poorly estimated. Therefore, to generate correct samples, one needs to introduce a\nproper thermostat, which adaptively controls the mean kinetic energy. To this end, we introduce an\nadditional variable \u03be, and use the following dynamics (with diffusion factor A and kBT = 1),\n\nd\u03b8 = p dt, d p = \u02dcf (\u03b8)dt \u2212 \u03be p dt +\n\nd\u03be = (\n\n1\nn\n\np(cid:62) p\u22121)dt.\n\n2AN (0, dt),\n\n(5)\n\n(6)\n\n\u221a\n\nIntuitively, if the mean kinetic energy is higher than 1/2, then \u03be gets bigger and p experiences more\nfriction in (5); on the other hand, if the mean kinetic energy is lower, then \u03be gets smaller and p\nexperiences less friction. Because (6) appears to be the same as the Nos\u00b4e-Hoover thermostat [13]\nin statistical physics, we call our method stochastic gradient Nos\u00b4e-Hoover thermostat (SGNHT,\nAlgorithm 1). In Section 4, we will show that (6) is a simpli\ufb01ed version of a more general SGNHT\nmethod that is able to handle high dimensional non-isotropic noise from \u02dcf. But before that, let us\n\ufb01rst look at a 1-D illustration of the SGNHT sampling in the presence of unknown noise.\n\n3\n\n\fAlgorithm 1: Stochastic Gradient Nos\u00b4e-Hoover Thermostat\nInput: Parameters h, A.\nInitialize \u03b8(0) \u2208 Rn, p(0) \u223c N (0, I), and \u03be(0) = A ;\nfor t = 1, 2, . . . do\n\nEvaluate \u2207 \u02dcU (\u03b8(t\u22121)) from (2) ;\np(t) = p(t\u22121) \u2212\u03be(t\u22121) p(t\u22121) h \u2212 \u2207 \u02dcU (\u03b8(t\u22121))h +\n\u03b8(t) = \u03b8(t\u22121) + p(t) h;\nn p(cid:62)\n\u03be(t) = \u03be(t\u22121) + ( 1\n\n(t) p(t) \u22121)h;\n\nend\n\n\u221a\n\n2AN (0, h);\n\nIllustrations of a Double-well Potential To illustrate that the adaptive update (6) is able to control\nthe mean kinetic energy, and more importantly, produce correct sampling with unknown noise on\nthe gradient, we consider the following double-well potential,\n\nU (\u03b8) = (\u03b8 + 4)(\u03b8 + 1)(\u03b8 \u2212 1)(\u03b8 \u2212 3)/14 + 0.5.\n\nThe target distribution is \u03c1(\u03b8) \u221d exp(\u2212U (\u03b8)). To simulate the unknown noise, we let \u2207 \u02dcU (\u03b8)h =\n\u2207U (\u03b8)h + N (0, 2Bh), where h = 0.01 and B = 1. In the interest of clarity we did not inject\nadditional noise other than the noise from \u2207 \u02dcU (\u03b8), namely A = 0. In Figure 1 we plot the estimated\ndensity based on 106 samples and the mean kinetic energy over iterations, when \u03be is \ufb01xed at 0.1, 1, 10\nsuccessively, as well as when \u03be follows our thermostat update in (6).\nFrom Figure 1, when \u03be = B = 1, the SDE is the ordinary Langevin dynamics. In this case, the\nsampling is accurate and the kinetic energy is controlled around 0.5. When \u03be > B, the kinetic\nenergy drops to a low value, and the sampling gets stuck in one local minimum; this is what happens\nin the SGD optimization with momentum. When \u03be < B, the kinetic energy gets too high, and the\nsampling looks like a random walk. For SGNHT, the sampling looks as accurate as the one with\n\u03be = B and the kinetic energy is also controlled around 0.5. Actually in Appendix B, we see that the\nvalue of \u03be of SGNHT quickly converges to B = 1.\n\nTrue distribution\n\n\u03be = 1\n\nTrue distribution\n\n\u03be = 10\n\n0.6\n\n0.4\n\n0.2\n\n)\n\u03b8\n(\n\u03c1\n\n0\n\u22126\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n)\np\n(\n\nK\n\n\u22124\n\n\u22122\n\n0\n\n\u03b8\n\n2\n\n4\n\n6\n\n\u03be = 1\n\n2\n\n1\n\n)\n\u03b8\n(\n\u03c1\n\n0\n\u22126\n\n\u22124\n\n\u22122\n\n0\n\n\u03b8\n\n)\np\n(\n\nK\n\n1.2\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n)\n\u03b8\n(\n\u03c1\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n2\n\n4\n\n6\n\n\u22126\n\n\u22124\n\n\u22122\n\n\u03be = 10\n\n)\np\n(\n\nK\n\n4\n\n2\n\n0\n\nTrue distribution\n\nSGNHT\n\nTrue distribution\n\n\u03be = 0.1\n\n0.6\n\n0.4\n\n0.2\n\n)\n\u03b8\n(\n\u03c1\n\n2\n\n4\n\n6\n\n0\n\n\u03b8\n\n\u03be = 0.1\n\n0\n\u22126\n\n\u22124\n\n\u22122\n\n2\n\n4\n\n6\n\n0\n\n\u03b8\n\nSGNHT\n\n)\np\n(\n\nK\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0\n\n0.2\n\n0.4\n0.6\niterations\n\n0.8\n\n1\n\u00b7106\n\n0\n\n0.2\n\n0.4\n0.6\niterations\n\n0.8\n\n1\n\u00b7106\n\n0\n\n0.2\n\n0.4\n0.6\niterations\n\n0.8\n\n1\n\u00b7106\n\n0\n\n0.2\n\n0.4\n0.6\niterations\n\n0.8\n\n1\n\u00b7106\n\nFigure 1: The samples on \u03c1(\u03b8) and the mean kinetic energy over iterations K(p) with \u03be = 1 (1st),\n\u03be = 10 (2nd), \u03be = 0.1 (3rd), and the SGNHT (4th). The \ufb01rst three do not use a thermostat. The\nfourth column shows that the SGNHT method is able to sample accurately and maintains the mean\nkinetic energy with unknown noise.\n\n4 The General Recipe\n\nIn this section, we mathematically justify the proposed SGNHT method. We begin with a theorem\nshowing why and how a sampler based on SDEs using stochastic gradients can produce the correct\ntarget distribution. The theorem serves two purposes. First, one can examine whether a given SDE\nsampler is correct or not. The theorem is more general than previous ones in [5][24] which focus\non justifying individual methods. Second, the theorem can be a general recipe for proposing new\nmethods. As a concrete example of using this approach, we show how to obtain SGNHT from the\nmain theorem.\n\n4\n\n\f4.1 The Main Theorem\n\nConsider the following general stochastic differential equations that use the stochastic force:\n\nd\u0393 = v(\u0393)dt + N (0, 2 D(\u03b8)dt)\n\n(7)\nwhere \u0393 = (\u03b8, p, \u03be), and both p and \u03be are optional. v is a vector \ufb01eld that characterizes the\ndeterministic part of the dynamics. D(\u03b8) = A +diag(0, B(\u03b8), 0), where the injected noise A is\nknown and constant, whereas the noise of the stochastic gradient B(\u03b8) is unknown, may vary, and\nonly appears in blocks corresponding to rows of the momentum. Both A and B are symmetric\npositive semide\ufb01nite. Taking the dynamics of SGHMC as an example, it has \u0393 = (\u03b8, p), v =\n(p, f \u2212Ap) and D(\u03b8) = diag(0, A I\u2212 \u02c6B(\u03b8) + B(\u03b8)).\nLet \u03c1(\u0393) = (1/Z) exp(\u2212H(\u0393)) be the joint probability density of all variables, and write H as\nH(\u0393) = U (\u03b8) + Q(\u03b8, p, \u03be). The marginal density for \u03b8 must equal the target density,\n\nexp (\u2212U (\u03b8)) \u221d\n\nexp (\u2212U (\u03b8) \u2212 Q(\u03b8, p, \u03be)) dpd\u03be\n\n(8)\n\n(cid:90)(cid:90)\n\nwhich will be referred as the marginalization condition.\nMain Theorem. The stochastic process of \u03b8 generated by the stochastic differential equation (7) has\nthe target distribution \u03c1\u03b8(\u03b8) = (1/Z) exp(\u2212U (\u03b8)) as its stationary distribution, if \u03c1 \u221d exp (\u2212H)\nsatis\ufb01es the marginalization condition (8), and\n\n\u2207 \u00b7 (\u03c1v) = \u2207\u2207(cid:62) : (\u03c1 D),\n\n(9)\n\nwhere we use concise notation, \u2207 = (\u2202/\u2202\u03b8, \u2202/\u2202 p, \u2202/\u2202\u03be) being a column vector,\n\u00b7 representing a vector inner product x\u00b7 y = x(cid:62) y, and : representing a matrix double dot product\nX : Y = trace(X(cid:62) Y).\n\nProof. See Appendix C.\nRemark. The theorem implies that when the SDE is solved exactly (namely h \u2192 0), then the noise\nof the stochastic force has no effect, because limh\u21920 D = A [5]. In this case, any dynamics that\nproduce the correct distribution with the true gradient, such as the original Langevin dynamics, can\nalso produce the correct distribution with the stochastic gradient.\n\nHowever, when there is discretization error one must \ufb01nd the proper H, v and A to ensure pro-\nduction of the correct distribution of \u03b8. Towards this end, the theorem provides a general recipe\nfor \ufb01nding proper dynamics that can sample correctly in the presence of stochastic forces. To use\nthis prescription, one may freely select the dynamics characterized by v and A as well as the joint\nstationary distribution for which the marginalization condition holds. Together, the selected v, A\nand \u03c1 must satisfy this main theorem.\nThe marginalization condition is important because for some stochastic differential equations there\nexists a \u03c1 that makes (9) hold even though the marginalized distribution is not the target distribution.\nTherefore, care must be taken when designing the dynamics. In the following subsection, we will\nuse the proposed stochastic gradient Nos\u00b4e-Hoover thermostats as an illustrative example of how our\nrecipe may be used to discover new methods. We will show more examples in Appendix D.\n\n4.2 Revisiting the Stochastic Gradient Nos\u00b4e-Hoover Thermostat\n\nLet us start from the following dynamics:\n\nd\u03b8 = p dt, d p = f dt \u2212 \u039e p dt + N (0, 2 D dt),\n\nwhere both \u039e and D are n \u00d7 n matrices. Apparently, when \u039e (cid:54)= D, the dynamics will not generate\nthe correct target distribution (see Appendix D). Now let us add dynamics for \u039e, denoted by d\u039e =\nv(\u039e) dt, and demonstrate application of the main theorem.\nLet \u03c1(\u03b8, p, \u039e) = (1/Z) exp(\u2212H(\u03b8, p, \u039e)) be our target distribution, where H(\u03b8, p, \u039e) = U (\u03b8) +\nQ(p, \u039e) and Q(p, \u039e) is also to be determined. Clearly, the marginalization condition is satis\ufb01ed for\nsuch H(\u03b8, p, \u039e).\n\n5\n\n\fLet Rz denote the gradient of a function R, and Rz z denote the Hessian. For simplicity, we constrain\n\u2207\u039e \u00b7 v(\u039e) = 0, and assume that D is a constant matrix. Then the LHS and RHS of (9) become\n\nLHS = (\u2207 \u00b7 v \u2212\u2207H \u00b7 v)\u03c1 = (\u2212trace(\u039e) + f Tp \u2212 QT\nRHS = D : \u03c1pp = D : (QpQT\n\np \u2212 Qpp)\u03c1.\n\npf + QT\n\np\u039ep \u2212 Q\u039e : v(\u039e))\u03c1,\n\nEquating both sides, one gets\n\n\u2212trace(\u039e) + f Tp \u2212 QT\n\npf + QT\n\np\u039ep \u2212 Q\u039e : v(\u039e) = D : (QpQT\n\np) \u2212 D : Qpp.\n\nTo cancel the f terms, set Qp = p, then Q(p, \u039e) = 1\ndetermined. The equation becomes\n\n2 pT p + S(\u039e), which leaves S(\u039e) to be\n\n\u2212\u039e : I +\u039e : (ppT) \u2212 S\u039e : v(\u039e) = D : (ppT) \u2212 D : I .\n\n(10)\nObviously, v(\u039e) must be a function of ppT since S\u039e is independent of p. Also, D must only appear\nin S\u039e, since we want v(\u039e) to be independent of the unknown D. Finally, v(\u039e) should be independent\nof \u039e, since we let \u2207\u039e \u00b7 v(\u039e) = 0. Combining all three observations, we let v(\u039e) be a linear function\nof ppT, and S\u039e a linear function of \u039e. With some algebra, one \ufb01nds that\n\nv(\u039e) = (ppT \u2212 I)/\u00b5,\n2 pTp + 1\n\n(11)\nand S\u039e = (\u039e \u2212 D)\u00b5 which means Q(p, \u039e) = 1\n2 \u00b5(\u039e \u2212 D) : (\u039e \u2212 D). (11) de\ufb01nes a\ngeneral stochastic gradient Nos\u00b4e-Hoover thermostats. When D = D I and \u039e = \u03be I (here D and \u03be\nare both scalars and I is the identity matrix), one can simplify (10) and obtain v(\u03be) = (pTp \u2212 n)/\u00b5.\nIt reduces to (6) of the SGNHT in section 3 when \u00b5 = n.\nThe Nos\u00b4e-Hoover thermostat without stochastic terms has \u03be \u223c N (0, \u00b5\u22121). When there is\na stochastic term N (0, 2 D dt), the distribution of \u039e changes to a matrix normal distribution\nMN (D, \u00b5\u22121 I, I) (in the scalar case, N (D, \u00b5\u22121)). This indicates that the thermostat absorbs the\nstochastic term D, since the expected value of \u039e is equal to D, and leaves the marginal distribution\nof \u03b8 invariant.\nIn the derivation above, we assumed that D is constant (by assuming B constant). This assumption\nis reasonable when the data size is large so that the posterior of \u03b8 has small variance. In addition,\nthe full dynamics of \u039e requires additional n \u00d7 n equations of motion, which is generally too costly.\nIn practice, we found that Algorithm 1 with a single scalar \u03be works well.\n\n5 Experiments\n\n5.1 Gaussian Distribution Estimation Using Stochastic Gradient\n\nWe \ufb01rst demonstrate our method on a simple example: Bayesian inference on 1D normal distri-\nbutions. The \ufb01rst part of the experiment tries to estimate the mean of the normal distribution with\nknown variance and N = 100 random examples from N (0, 1). The likelihood is N (xi|\u00b5, 1), and an\nimproper prior of \u00b5 being uniform is assigned. Each iteration we randomly select \u02dcN = 10 examples.\nThe noise of the stochastic gradient is a constant given \u02dcN (Appendix E).\nFigure 2 shows the density of 106 samples obtained by SGNHT (1st plot) and SGHMC (2nd plot).\nAs we can see, SGNHT samples accurately without knowing the variance of the noise of the stochas-\ntic force under all parameter settings, whereas SGHMC samples accurately only when h is small and\nA is large. The 3rd plot shows the mean of \u03be values in SGNHT. When h = 0.001, \u03be and A are close.\nHowever, when h = 0.01, \u03be becomes much larger than A. This indicates that the discretization in-\ntroduces a large noise from the stochastic gradient, and the \u03be variable effectively absorbs the noise.\nThe second part of the experiment is to estimate both mean and variance of the normal distribu-\ntion. We use the likelihood function N (xi|\u00b5, \u03b3\u22121) and the Normal-Gamma distribution \u00b5, \u03b3 \u223c\nN (\u00b5|0, \u03b3)Gam(\u03b3|1, 1) as prior. The variance of the stochastic gradient noise is no longer a con-\nstant and depends on the values of \u00b5 and \u03b3 (see Appendix E).\nSimilar density plots are available in Appendix E. Here we plot the Root Mean Square Error (RMSE)\nof the density estimation vs. the autocorrelation time of the observable \u00b5 + \u03b3 under various h and\nA in the 4th plot in Figure 2. We can see that SGNHT has signi\ufb01cantly lower autocorrelation time\nthan SGHMC at similar sampling accuracy. More details about the h, A values which produces the\nplot are also available in Appendix E.\n\n6\n\n\fDensity of \u00b5 (SGNHT)\n\nDensity of \u00b5 (SGHMC)\n\n\u03be value of SGNHT\n\nRMSE vs. Autocorrelation time\n\n5\n\n4\n\n3\n\n2\n\n1\n\n\u00b5\nf\no\n\ny\nt\ni\ns\nn\ne\nD\n\nTrue\n\nh=0.01,A=1\nh=0.01,A=10\nh=0.001,A=1\nh=0.001,A=10\n\n5\n\n4\n\n3\n\n2\n\n1\n\n\u00b5\nf\no\n\ny\nt\ni\ns\nn\ne\nD\n\n0\n\u22120.6 \u22120.4 \u22120.2\n\n0\n\n0.2\n\u00b5\n\n0.4\n\n0.6\n\n0.8\n\n1\n\n0\n\u22120.6 \u22120.4 \u22120.2\n\n0\n\n0.2\n\u00b5\n\nh=0.01,A=1\nh=0.01,A=10\nh=0.001,A=1\nh=0.001,A=10\n\nTrue\n\nh=0.01,A=1\nh=0.01,A=10\nh=0.001,A=1\nh=0.001,A=10\n\n\u03be\n\n15\n\n10\n\n5\n\n0\n\n0.4\n\n0.6\n\n0.8\n\n1\n\n0\n\n0.2\n\n0.4\n0.6\niterations\n\n0.8\n\n1\n\u00b7106\n\n0.6\n\n0.4\n\n0.2\n\nn\no\ni\nt\na\nm\n\ni\nt\ns\nE\ny\nt\ni\ns\nn\ne\nD\n\nf\no\nE\nS\nM\nR\n\n0\n\n0\n\nSGHMC\nSGNHT\n\n400\n\n100\n\n200\n\n300\nAutocorrrelation Time\n\nFigure 2: Density of \u00b5 obtained by SGNHT with known variance (1st), density of \u00b5 obtained by\nSGHMC with known variance (2nd), mean of \u03be over iterations with known variance in SGNHT\n(3rd), RMSE vs. Autocorrelation time for both methods with unknown variance (4th).\n5.2 Machine Learning Applications\n\nIn the following machine learning experiments, we used a reformulation of (5) and (6) similar to [5],\nby letting u = p h, \u03b7 = h2, \u03b1 = \u03beh and a = Ah. The resulting Algorithm 2 is provided in Appendix\nF. In [5], SGHMC has been extensively compared with SGLD, SGD and SGD-momentum. Our\nexperiments will focus on comparing SGHMC and SGNHT. Details of the experiment settings are\ndescribed below. The test results over various parameters are reported in Figure 3.\n\nBayesian Neural Network We \ufb01rst evaluate the benchmark MNIST dataset, using the Bayesian\nNeural Network (BNN) as in [5]. The MNIST dataset contains 50,000 training examples, 10,000\nvalidation examples, and 10,000 test examples. To show our algorithm being able to handle large\nstochastic gradient noise due to small minibatch, we chose the minibatch of size 20. Each algorithm\nis run for a total number of 50k iterations with burn-in of the \ufb01rst 10k iterations. The hidden layer\nsize is 100, parameter a is from {0.001, 0.01} and \u03b7 from {2, 4, 6, 8} \u00d7 10\u22127.\n\nBayesian Matrix Factorization Next, we evaluate our methods on two collaborative \ufb01ltering\ntasks: the Movielens ml-1m dataset and the Net\ufb02ix dataset, using the Bayesian probabilistic matrix\nfactorization (BPMF) model [21]. The Movielens dataset contains 6,050 users and 3,883 movies\nwith about 1M ratings, and the Net\ufb02ix dataset contains 480,046 users and 17,000 movies with about\n100M ratings. To conduct the experiments, Each dataset is partitioned into training (80%) and test-\ning (20%), and the training set is further partitioned for 5-fold cross validation. Each minibatch\ncontains 400 ratings for Movielens1M and 40k ratings for Net\ufb02ix. Each algorithm is run for 100k\niterations with burn-in of the \ufb01rst 20k iterations. The base number is chosen as 10, parameter a is\nfrom {0.01, 0.1} and \u03b7 from {2, 4, 6, 8} \u00d7 10\u22127.\n\nLatent Dirichlet Allocation Finally, we evaluate our method on the ICML dataset using Latent\nDirichlet Allocation [4]. The ICML dataset contains 765 documents from the abstracts or ICML\nproceedings from 2007 to 2011. After simple stopword removal, we obtained a vocabulary size of\nabout 2K and total words of about 44K. We used 80% documents for 5-fold cross validation and\nthe remaining 20% for testing. Similar to [18], we used the semi-collapsed LDA whose posterior\nof \u03b8kw is provided in Appendix H. The Dirichlet prior parameter for the topic distribution for each\ndocument is set to 0.1 and the Gaussian prior for \u03b8kw is set as N (0.1, 1). Each minibatch contains\n100 documents. Each algorithm is run for 50k iterations with the \ufb01rst 10k iterations as burn-in.\nTopic number is 30, parameter a is from {0.01, 0.1} and \u03b7 from {2, 4, 6, 8} \u00d7 10\u22125.\n\n5.2.1 Result Analysis\n\nFrom Figure 3, SGNHT is apparently more stable than SGHMC when the discretization step \u03b7 is\nlarger. In all four datasets, especially with the smaller a, SGHMC gets worse and worse results as \u03b7\nincreases. With the largest \u03b7, SGHMC diverges (as the green curve is way beyond the range) due to\nits failure to handle the large unknown noise with small a.\nFigure 3 also gives a comprehensive view of the critical role that a plays on. On one hand, larger\na may cause more random walk effect which slows down the convergence (as in Movielens1M and\nNet\ufb02ix). On the other hand, it is helpful to increase the ergodicity and compensate the unknown\nnoise from the stochastic gradient (as in MNIST and ICML).\n\n7\n\n\fThroughout the experiment, we \ufb01nd that the kinetic energy of SGNHT is always maintained around\n0.5 while that of SGHMC is usually higher. And overall SGNHT has better test performance with\nthe choice of the parameters selected by cross validation (see Table 2 of Appendix G).\n\n6\n\n5\n\n4\n\n3\n\n2\n\n0.94\n\n0.92\n\n0.9\n\n0.88\n\n0.86\n\n0.86\n\n0.85\n\n0.84\n\n0.83\n\n0.82\n\n1,500\n\n1,400\n\n1,300\n\n1,200\n\n1,100\n\n1,000\n\nr\no\nr\nr\nE\n\nt\ns\ne\nT\n\nE\nS\nM\nR\n\nt\ns\ne\nT\n\nE\nS\nM\nR\n\nt\ns\ne\nT\n\ny\nt\ni\nx\ne\nl\np\nr\ne\nP\n\nt\ns\ne\nT\n\nMNIST (\u03b7 = 2 \u00d7 10\u22127)\n\n\u00b710\u22122\n\nSGHMC(a = 0.001)\nSGNHT(a = 0.001)\nSGHMC(a = 0.01)\nSGNHT(a = 0.01)\n\n1\n\n2\n\n3\n\n4\n\niterations\n\nMovielens1M (\u03b7 = 2 \u00d7 10\u22127)\n\n5\n\u00b7104\n\nSGHMC(a = 0.01)\nSGNHT(a = 0.01)\nSGHMC(a = 0.1)\nSGNHT(a = 0.1)\n\n0.2\n\n0.4\n\n0.6\niterations\n\n0.8\n\n1\n\u00b7105\n\nNet\ufb02ix (\u03b7 = 2 \u00d7 10\u22127)\n\nSGHMC(a = 0.01)\nSGNHT(a = 0.01)\nSGHMC(a = 0.1)\nSGNHT(a = 0.1)\n\n0.2\n\n0.4\n\n0.6\niterations\n\n0.8\n\n1\n\u00b7105\n\nICML (\u03b7 = 2 \u00d7 10\u22125)\n\nSGHMC(a = 0.01)\nSGNHT(a = 0.01)\nSGHMC(a = 0.1)\nSGNHT(a = 0.1)\n\n1\n\n2\n\n3\n\n4\n\niterations\n\n5\n\u00b7104\n\nr\no\nr\nr\nE\n\nt\ns\ne\nT\n\nE\nS\nM\nR\n\nt\ns\ne\nT\n\nE\nS\nM\nR\n\nt\ns\ne\nT\n\n6\n\n5\n\n4\n\n3\n\n2\n\n0.94\n\n0.92\n\n0.9\n\n0.88\n\n0.86\n\n0.86\n\n0.85\n\n0.84\n\n0.83\n\n0.82\n\ny\nt\ni\nx\ne\nl\np\nr\ne\nP\n\nt\ns\ne\nT\n\n1,500\n\n1,400\n\n1,300\n\n1,200\n\n1,100\n\n1,000\n\nMNIST (\u03b7 = 4 \u00d7 10\u22127)\n\n\u00b710\u22122\n\nSGHMC(a = 0.001)\nSGNHT(a = 0.001)\nSGHMC(a = 0.01)\nSGNHT(a = 0.01)\n\n1\n\n2\n\n3\n\n4\n\niterations\n\nMovielens1M (\u03b7 = 4 \u00d7 10\u22127)\n\n5\n\u00b7104\n\nSGHMC(a = 0.01)\nSGNHT(a = 0.01)\nSGHMC(a = 0.1)\nSGNHT(a = 0.1)\n\n0.2\n\n0.4\n\n0.6\niterations\n\n0.8\n\n1\n\u00b7105\n\nNet\ufb02ix (\u03b7 = 4 \u00d7 10\u22127)\n\nSGHMC(a = 0.01)\nSGNHT(a = 0.01)\nSGHMC(a = 0.1)\nSGNHT(a = 0.1)\n\n0.2\n\n0.4\n\n0.6\niterations\n\n0.8\n\n1\n\u00b7105\n\nICML (\u03b7 = 4 \u00d7 10\u22125)\n\nSGHMC(a = 0.01)\nSGNHT(a = 0.01)\nSGHMC(a = 0.1)\nSGNHT(a = 0.1)\n\n1\n\n2\n\n3\n\n4\n\niterations\n\n5\n\u00b7104\n\nr\no\nr\nr\nE\n\nt\ns\ne\nT\n\nE\nS\nM\nR\n\nt\ns\ne\nT\n\nE\nS\nM\nR\n\nt\ns\ne\nT\n\n6\n\n5\n\n4\n\n3\n\n2\n\n0.94\n\n0.92\n\n0.9\n\n0.88\n\n0.86\n\n0.86\n\n0.85\n\n0.84\n\n0.83\n\n0.82\n\ny\nt\ni\nx\ne\nl\np\nr\ne\nP\n\nt\ns\ne\nT\n\n1,500\n\n1,400\n\n1,300\n\n1,200\n\n1,100\n\n1,000\n\nMNIST (\u03b7 = 6 \u00d7 10\u22127)\n\n\u00b710\u22122\n\nSGHMC(a = 0.001)\nSGNHT(a = 0.001)\nSGHMC(a = 0.01)\nSGNHT(a = 0.01)\n\n1\n\n2\n\n3\n\n4\n\niterations\n\nMovielens1M (\u03b7 = 6 \u00d7 10\u22127)\n\n5\n\u00b7104\n\nSGHMC(a = 0.01)\nSGNHT(a = 0.01)\nSGHMC(a = 0.1)\nSGNHT(a = 0.1)\n\n0.2\n\n0.4\n\n0.6\niterations\n\n0.8\n\n1\n\u00b7105\n\nNet\ufb02ix (\u03b7 = 6 \u00d7 10\u22127)\n\nSGHMC(a = 0.01)\nSGNHT(a = 0.01)\nSGHMC(a = 0.1)\nSGNHT(a = 0.1)\n\n0.2\n\n0.4\n\n0.6\niterations\n\n0.8\n\n1\n\u00b7105\n\nICML (\u03b7 = 6 \u00d7 10\u22125)\n\nr\no\nr\nr\nE\n\nt\ns\ne\nT\n\nE\nS\nM\nR\n\nt\ns\ne\nT\n\nE\nS\nM\nR\n\nt\ns\ne\nT\n\n6\n\n5\n\n4\n\n3\n\n2\n\n0.94\n\n0.92\n\n0.9\n\n0.88\n\n0.86\n\n0.86\n\n0.85\n\n0.84\n\n0.83\n\n0.82\n\nMNIST (\u03b7 = 8 \u00d7 10\u22127)\n\n\u00b710\u22122\n\nSGHMC(a = 0.001)\nSGNHT(a = 0.001)\nSGHMC(a = 0.01)\nSGNHT(a = 0.01)\n\n2\n\n3\niterations\n\n4\n\n5\n\u00b7104\n\nMovielens1M (\u03b7 = 8 \u00d7 10\u22127)\n\nSGHMC(a = 0.01)\nSGNHT(a = 0.01)\nSGHMC(a = 0.1)\nSGNHT(a = 0.1)\n\n0.2\n\n0.4\n\n0.6\niterations\n\n0.8\n\n1\n\u00b7105\n\nNet\ufb02ix (\u03b7 = 8 \u00d7 10\u22127)\n\nSGHMC(a = 0.01)\nSGNHT(a = 0.01)\nSGHMC(a = 0.1)\nSGNHT(a = 0.1)\n\n0.2\n\n0.4\n\n0.6\niterations\n\n0.8\n\n1\n\u00b7105\n\nICML (\u03b7 = 8 \u00d7 10\u22125)\n\nSGHMC(a = 0.01)\nSGNHT(a = 0.01)\nSGHMC(a = 0.1)\nSGNHT(a = 0.1)\n\n1\n\n2\n\n3\n\n4\n\niterations\n\n5\n\u00b7104\n\ny\nt\ni\nx\ne\nl\np\nr\ne\nP\n\nt\ns\ne\nT\n\n1,500\n\n1,400\n\n1,300\n\n1,200\n\n1,100\n\n1,000\n\nSGHMC(a = 0.01)\nSGNHT(a = 0.01)\nSGHMC(a = 0.1)\nSGNHT(a = 0.1)\n\n1\n\n2\n\n3\n\n4\n\niterations\n\n5\n\u00b7104\n\nFigure 3: The test error of MNIST (1st row), test RMSE of Movielens1M (2nd row), test RMSE\nof Net\ufb02ix (3rd row) and test perplexity of ICML (4th row) datasets with their standard deviations\n(close to 0 in row 2 and 3) under various \u03b7 and a.\n\n6 Conclusion and Discussion\n\nIn this paper, we \ufb01nd proper dynamics that adpatively \ufb01t to the noise introduced by stochastic gra-\ndients. Experiments show that our method is able to control the temperature, estimate the unknown\nnoise, and perform competitively in practice. Our method can be justi\ufb01ed in continuous time by a\ngeneral theorem. The discretization of continuous SDEs, however, introduces bias. This issue has\nbeen extensively studied by previous work such as [20, 22, 15, 12]. The existency of an invariant\nmeasure has been proved (e.g., Theorem 3.2 [22] and Proposition 2.5 [12]) and the bound of the\nerror has been obtained (e.g, O(h2) for a symmetric splitting scheme [12]). Due to space limitation,\nwe leave a deeper discussion on this topic and a more rigorous justi\ufb01cation to future work.\n\nAcknowledgments\n\nWe acknowledge Kevin P. Murphy and Julien Cornebise for helpful discussions and comments.\n\nReferences\n[1] S. Ahn, A. K. Balan, and M. Welling. Bayesian Posterior Sampling via Stochastic Gradient\nFisher Scoring. Proceedings of the 29th International Conference on Machine Learning, pages\n\n8\n\n\f1591\u20131598, 2012.\n\n[2] A. K. Balan, Y. Chen, and M. Welling. Austerity in MCMC Land: Cutting the Metropolis-\nHastings Budget. Proceedings of the 31st International Conference on Machine Learning,\n2014.\n\n[3] R. Bardenet, A. Doucet, and C. Holmes. Towards Scaling up Markov Chain Monte Carlo:\nan Adaptive Subsampling Approach. Proceedings of the 31st International Conference on\nMachine Learning, pages 405\u2013413, 2014.\n\n[4] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet Allocation. J. Mach. Learn. Res.,\n\n3:993\u20131022, March 2003.\n\n[5] T. Chen, E. B. Fox, and C. Guestrin. Stochastic Gradient Hamiltonian Monte Carlo. Proceed-\n\nings of the 31st International Conference on Machine Learning, 2014.\n\n[6] S. Duane, A. D. Kennedy, B. J. Pendleton, and D. Roweth. Hybrid Monte Carlo. Phys. Lett.\n\nB, 195:216\u2013222, 1987.\n\n[7] Y. Fang, J. M. Sanz-Serna, and R. D. Skeel. Compressible Generalized Hybrid Monte Carlo.\n\nJ. Chem. Phys., 140:174108 (10 pages), 2014.\n\n[8] M. Girolami and B. Calderhead. Riemann Manifold Langevin and Hamiltonian Monte Carlo\n\nMethods. J. R. Statist. Soc. B, 73, Part 2:123\u2013214(with discussion), 2011.\n\n[9] M. D. Hoffman, D. M. Blei, C. Wang, and J. Paisley. Stochastic Variational Inference. Journal\n\nof Maching Learning Research, 14(1):1303\u20131347, 2013.\n\n[10] A. M. Horowitz. A Generalized Guided Monte-Carlo Algorithm. Phys. Lett. B, 268:247\u2013252,\n\n1991.\n\n[11] B. Leimkuhler and C. Matthews. Rational Construction of Stochastic Numerical Methods for\n\nMolecular Sampling. arXiv:1203.5428, 2012.\n\n[12] B. Leimkuhler, C. Matthews, and G. Stoltz. The Computation of Averages from Equilibrium\n\nand Nonequilibrium Langevin Molecular Dynamics. IMA J Num. Anal., 2014.\n\n[13] B. Leimkuhler and S. Reich. A Metropolis Adjusted Nos\u00b4e-Hoover Thermostat. Math. Mod-\n\nellinig Numer. Anal., 43(4):743\u2013755, 2009.\n\n[14] D. Maclaurin and R. P. Adams. Fire\ufb02y Monte Carlo: Exact MCMC with Subsets of Data.\n\narXiv: 1403.5693, 2014.\n\n[15] J. C. Mattingly, A. M. Stuart, and M. Tretyakov. Convergence of Numerical Time-averaging\n\nand Stationary Measures via Poisson Equations. SIAM J. Num. Anal., 48:552\u2013577, 2014.\n\n[16] N. Metropolis, A. Rosenbluth, M. Rosenbluth, A. Teller, and E. Teller. Equation of State\n\nCalculations by Fast Computing Machines. J. Chem. Phys., 21:1087\u20131092, 1953.\n\n[17] R. M. Neal. MCMC Using Hamiltonian Dynamics. arXiv:1206.1901, 2012.\n[18] S. Patterson and Y. W. Teh. Stochastic Gradient Riemannian Langevin Dynamics on the Prob-\nability Simplex. Advances in Neural Information Processing Systems 26, pages 3102\u20133110,\n2013.\n\n[19] H. Robbins and S. Monro. A Stochastic Approximation Method. The Annals of Mathematical\n\nStatistics, 22(3):400\u2013407, 1951.\n\n[20] G. Roberts and R. Tweedie. Exponential Convergence of Langevin Distributions and Their\n\nDiscrete Approximations. Bernoulli, 2:341\u2013363, 1996.\n\n[21] R. Salakhutdinov and A. Mnih. Bayesian Probabilistic Matrix Factorization Using Markov\nChain Monte Carlo. Proceedings of the 25th International Conference on Machine Learning,\npages 880\u2013887, 2008.\n\n[22] D. Talay. Second Order Discretization Schemes of Stochastic Differential Systems for the\n\nComputation of the Invariant Law. Stochastics and Stochastics Reports, 29:13\u201336, 1990.\n\n[23] M. E. Tuckerman. Statistical Mechanics: Theory and Molecular Simulation. Oxford Univer-\n\nsity Press, 2010.\n\n[24] M. Welling and Y. W. Teh. Bayesian Learning via Stochastic Gradient Langevin Dynamics.\n\nProceedings of the 28th International Conference on Machine Learning, 2011.\n\n9\n\n\f", "award": [], "sourceid": 1633, "authors": [{"given_name": "Nan", "family_name": "Ding", "institution": "Google Inc."}, {"given_name": "Youhan", "family_name": "Fang", "institution": "Purdue University"}, {"given_name": "Ryan", "family_name": "Babbush", "institution": "Harvard University"}, {"given_name": "Changyou", "family_name": "Chen", "institution": "Duke University"}, {"given_name": "Robert", "family_name": "Skeel", "institution": "Purdue University"}, {"given_name": "Hartmut", "family_name": "Neven", "institution": "Google"}]}