{"title": "A Complete Recipe for Stochastic Gradient MCMC", "book": "Advances in Neural Information Processing Systems", "page_first": 2917, "page_last": 2925, "abstract": "Many recent Markov chain Monte Carlo (MCMC) samplers leverage continuous dynamics to define a transition kernel that efficiently explores a target distribution. In tandem, a focus has been on devising scalable variants that subsample the data and use stochastic gradients in place of full-data gradients in the dynamic simulations. However, such stochastic gradient MCMC samplers have lagged behind their full-data counterparts in terms of the complexity of dynamics considered since proving convergence in the presence of the stochastic gradient noise is non-trivial.  Even with simple dynamics, significant physical intuition is often required to modify the dynamical system to account for the stochastic gradient noise.  In this paper, we provide a general recipe for constructing MCMC samplers--including stochastic gradient versions--based on continuous Markov processes specified via two matrices.  We constructively prove that the framework is complete. That is, any continuous Markov process that provides samples from the target distribution can be written in our framework.  We show how previous continuous-dynamic samplers can be trivially reinvented in our framework, avoiding the complicated sampler-specific proofs. We likewise use our recipe to straightforwardly propose a new state-adaptive sampler: stochastic gradient Riemann Hamiltonian Monte Carlo (SGRHMC).  Our experiments on simulated data and a streaming Wikipedia analysis demonstrate that the proposed SGRHMC sampler inherits the benefits of Riemann HMC, with the scalability of stochastic gradient methods.", "full_text": "A Complete Recipe for Stochastic Gradient MCMC\n\nUniversity of Washington {yianma@u,tqchen@cs,ebfox@stat}.washington.edu\n\nYi-An Ma, Tianqi Chen, and Emily B. Fox\n\nAbstract\n\nMany recent Markov chain Monte Carlo (MCMC) samplers leverage continuous\ndynamics to de\ufb01ne a transition kernel that ef\ufb01ciently explores a target distribution.\nIn tandem, a focus has been on devising scalable variants that subsample the data\nand use stochastic gradients in place of full-data gradients in the dynamic simu-\nlations. However, such stochastic gradient MCMC samplers have lagged behind\ntheir full-data counterparts in terms of the complexity of dynamics considered\nsince proving convergence in the presence of the stochastic gradient noise is non-\ntrivial. Even with simple dynamics, signi\ufb01cant physical intuition is often required\nto modify the dynamical system to account for the stochastic gradient noise. In this\npaper, we provide a general recipe for constructing MCMC samplers\u2014including\nstochastic gradient versions\u2014based on continuous Markov processes speci\ufb01ed vi-\na two matrices. We constructively prove that the framework is complete. That is,\nany continuous Markov process that provides samples from the target distribution\ncan be written in our framework. We show how previous continuous-dynamic\nsamplers can be trivially \u201creinvented\u201d in our framework, avoiding the complicated\nsampler-speci\ufb01c proofs. We likewise use our recipe to straightforwardly propose\na new state-adaptive sampler: stochastic gradient Riemann Hamiltonian Monte\nCarlo (SGRHMC). Our experiments on simulated data and a streaming Wikipedi-\na analysis demonstrate that the proposed SGRHMC sampler inherits the bene\ufb01ts\nof Riemann HMC, with the scalability of stochastic gradient methods.\n\n1\n\nIntroduction\n\nMarkov chain Monte Carlo (MCMC) has become a defacto tool for Bayesian posterior inference.\nHowever, these methods notoriously mix slowly in complex, high-dimensional models and scale\npoorly to large datasets. The past decades have seen a rise in MCMC methods that provide more ef\ufb01-\ncient exploration of the posterior, such as Hamiltonian Monte Carlo (HMC) [8, 12] and its Reimann\nmanifold variant [10]. This class of samplers is based on de\ufb01ning a potential energy function in\nterms of the target posterior distribution and then devising various continuous dynamics to explore\nthe energy landscape, enabling proposals of distant states. The gain in ef\ufb01ciency of exploration often\ncomes at the cost of a signi\ufb01cant computational burden in large datasets.\nRecently, stochastic gradient variants of such continuous-dynamic samplers have proven quite useful\nin scaling the methods to large datasets [17, 1, 6, 2, 7]. At each iteration, these samplers use data\nsubsamples\u2014or minibatches\u2014rather than the full dataset. Stochastic gradient Langevin dynamics\n(SGLD) [17] innovated in this area by connecting stochastic optimization with a \ufb01rst-order Langevin\ndynamic MCMC technique, showing that adding the \u201cright amount\u201d of noise to stochastic gradient\nascent iterates leads to samples from the target posterior as the step size is annealed. Stochastic\ngradient Hamiltonian Monte Carlo (SGHMC) [6] builds on this idea, but importantly incorporates\nthe ef\ufb01cient exploration provided by the HMC momentum term. A key insight in that paper was that\nthe na\u00a8\u0131ve stochastic gradient variant of HMC actually leads to an incorrect stationary distribution\n(also see [4]); instead a modi\ufb01cation to the dynamics underlying HMC is needed to account for\n\n1\n\n\fthe stochastic gradient noise. Variants of both SGLD and SGHMC with further modi\ufb01cations to\nimprove ef\ufb01ciency have also recently been proposed [1, 13, 7].\nIn the plethora of past MCMC methods that explicitly leverage continuous dynamics\u2014including\nHMC, Riemann manifold HMC, and the stochastic gradient methods\u2014the focus has been on show-\ning that the intricate dynamics leave the target posterior distribution invariant. Innovating in this\narena requires constructing novel dynamics and simultaneously ensuring that the target distribution\nis the stationary distribution. This can be quite challenging, and often requires signi\ufb01cant physical\nand geometrical intuition [6, 13, 7]. A natural question, then, is whether there exists a general recipe\nfor devising such continuous-dynamic MCMC methods that naturally lead to invariance of the target\ndistribution. In this paper, we answer this question to the af\ufb01rmative. Furthermore, and quite im-\nportantly, our proposed recipe is complete. That is, any continuous Markov process (with no jumps)\nwith the desired invariant distribution can be cast within our framework, including HMC, Riemann\nmanifold HMC, SGLD, SGHMC, their recent variants, and any future developments in this area.\nThat is, our method provides a unifying framework of past algorithms, as well as a practical tool for\ndevising new samplers and testing the correctness of proposed samplers.\nThe recipe involves de\ufb01ning a (stochastic) system parameterized by two matrices: a positive\nsemide\ufb01nite diffusion matrix, D(z), and a skew-symmetric curl matrix, Q(z), where z = (\u03b8, r)\nwith \u03b8 our model parameters of interest and r a set of auxiliary variables. The dynamics are then\nwritten explicitly in terms of the target stationary distribution and these two matrices. By varying\nthe choices of D(z) and Q(z), we explore the space of MCMC methods that maintain the correct\ninvariant distribution. We constructively prove the completeness of this framework by converting a\ngeneral continuous Markov process into the proposed dynamic structure.\nFor any given D(z), Q(z), and target distribution, we provide practical algorithms for implement-\ning either full-data or minibatch-based variants of the sampler. In Sec. 3.1, we cast many previous\ncontinuous-dynamic samplers in our framework, \ufb01nding their D(z) and Q(z). We then show how\nthese existing D(z) and Q(z) building blocks can be used to devise new samplers; we leave the\nquestion of exploring the space of D(z) and Q(z) well-suited to the structure of the target distribu-\ntion as an interesting direction for future research. In Sec. 3.2 we demonstrate our ability to construct\nnew and relevant samplers by proposing stochastic gradient Riemann Hamiltonian Monte Carlo, the\nexistence of which was previously only speculated. We demonstrate the utility of this sampler on\nsynthetic data and in a streaming Wikipedia analysis using latent Dirichlet allocation [5].\n\n2 A Complete Stochastic Gradient MCMC Framework\nWe start with the standard MCMC goal of drawing samples from a target distribution, which we take\nto be the posterior p(\u03b8|S) of model parameters \u03b8 \u2208 Rd given an observed dataset S. Throughout,\nwe assume i.i.d. data x \u223c p(x|\u03b8). We write p(\u03b8|S) \u221d exp(\u2212U (\u03b8)), with potential function\nx\u2208S log p(x|\u03b8) \u2212 log p(\u03b8). Algorithms like HMC [12, 10] further augment the space\nof interest with auxiliary variables r and sample from p(z|S) \u221d exp(\u2212H(z)), with Hamiltonian\n\nU (\u03b8) = \u2212(cid:80)\n\n(cid:90)\n\nH(z) = H(\u03b8, r) = U (\u03b8) + g(\u03b8, r),\n\nsuch that\n\nexp(\u2212g(\u03b8, r))dr = constant.\n\n(1)\n\nMarginalizing the auxiliary variables gives us the desired distribution on \u03b8. In this paper, we gener-\nically consider z as the samples we seek to draw; z could represent \u03b8 itself, or an augmented state\nspace in which case we simply discard the auxiliary variables to perform the desired marginalization.\nAs in HMC, the idea is to translate the task of sampling from the posterior distribution to simulating\nfrom a continuous dynamical system which is used to de\ufb01ne a Markov transition kernel. That is,\nover any interval h, the differential equation de\ufb01nes a mapping from the state at time t to the state\nat time t + h. One can then discuss the evolution of the distribution p(z, t) under the dynamics, as\ncharacterized by the Fokker-Planck equation for stochastic dynamics [14] or the Liouville equation\nfor deterministic dynamics [20]. This evolution can be used to analyze the invariant distribution of\nthe dynamics, ps(z). When considering deterministic dynamics, as in HMC, a jump process must\nbe added to ensure ergodicity. If the resulting stationary distribution is equal to the target posterior,\nthen simulating from the process can be equated with drawing samples from the posterior.\nIf the stationary distribution is not the target distribution, a Metropolis-Hastings (MH) correction\ncan often be applied. Unfortunately, such correction steps require a costly computation on the entire\n\n2\n\n\fdataset. Even if one can compute the MH correction, if the dynamics do not nearly lead to the\ncorrect stationary distribution, then the rejection rate can be high even for short simulation periods\nh. Furthermore, for many stochastic gradient MCMC samplers, computing the probability of the\nreverse path is infeasible, obviating the use of MH. As such, a focus in the literature is on de\ufb01ning\ndynamics with the right target distribution, especially in large-data scenarios where MH corrections\nare computationally burdensome or infeasible.\n\n2.1 Devising SDEs with a Speci\ufb01ed Target Stationary Distribution\nGenerically, all continuous Markov processes that one might consider for sampling can be written\nas a stochastic differential equation (SDE) of the form:\n\ndz = f (z)dt +(cid:112)2D(z)dW(t),\n\nwhere f (z) denotes the deterministic drift and often relates to the gradient of H(z), W(t) is a d-\ndimensional Wiener process, and D(z) is a positive semide\ufb01nite diffusion matrix. Clearly, however,\nnot all choices of f (z) and D(z) yield the stationary distribution ps(z) \u221d exp(\u2212H(z)).\nWhen D(z) = 0, as in HMC, the dynamics of Eq. (2) become deterministic. Our exposition focuses\non SDEs, but our analysis applies to deterministic dynamics as well. In this case, our framework\u2014\nusing the Liouville equation in place of Fokker-Planck\u2014ensures that the deterministic dynamics\nleave the target distribution invariant. For ergodicity, a jump process must be added, which is not\nconsidered in our recipe, but tends to be straightforward (e.g., momentum resampling in HMC).\nTo devise a recipe for constructing SDEs with the correct stationary distribution, we propose writing\nf (z) directly in terms of the target distribution:\n\nf (z) = \u2212(cid:2)D(z) + Q(z)(cid:3)\u2207H(z) + \u0393(z), \u0393i(z) =\n\nd(cid:88)\n\nj=1\n\n\u2202\n\u2202zj\n\n(cid:0)Dij(z) + Qij(z)(cid:1).\n\nHere, Q(z) is a skew-symmetric curl matrix representing the deterministic traversing effects seen\nin HMC procedures. In contrast, the diffusion matrix D(z) determines the strength of the Wiener-\nprocess-driven diffusion. Matrices D(z) and Q(z) can be adjusted to attain faster convergence to\nthe posterior distribution. A more detailed discussion on the interpretation of D(z) and Q(z) and\nthe in\ufb02uence of speci\ufb01c choices of these matrices is provided in the Supplement.\nImportantly, as we show in Theorem 1, sampling the stochastic dynamics of Eq. (2) (according\nto It\u02c6o integral) with f (z) as in Eq. (3) leads to the desired posterior distribution as the stationary\ndistribution: ps(z) \u221d exp(\u2212H(z)). That is, for any choice of positive semide\ufb01nite D(z) and skew-\nsymmetric Q(z) parameterizing f (z), we know that simulating from Eq. (2) will provide samples\nfrom p(\u03b8 | S) (discarding any sampled auxiliary variables r) assuming the process is ergodic.\nTheorem 1. ps(z) \u221d exp(\u2212H(z)) is a stationary distribution of the dynamics of Eq. (2) if f (z) is\nrestricted to the form of Eq. (3), with D(z) positive semide\ufb01nite and Q(z) skew-symmetric. If D(z)\nis positive de\ufb01nite, or if ergodicity can be shown, then the stationary distribution is unique.\n\nProof. The equivalence of ps(z) and the target p(z | S) \u221d exp(\u2212H(z)) can be shown using the\nFokker-Planck description of the probability density evolution under the dynamics of Eq. (2) :\n\n(2)\n\n(3)\n\n(4)\n\n\u2202tp(z, t) = \u2212(cid:88)\n\u2202tp(z, t) =\u2207T \u00b7(cid:16)\n\ni\n\n\u2202\n\u2202zi\n\n(cid:0)fi(z)p(z, t)(cid:1) +\n\n(cid:88)\n\n\u22022\n\n\u2202zi\u2202zj\n\ni,j\n\n(cid:0)Dij(z)p(z, t)(cid:1).\n(cid:17)\n\nEq. (4) can be further transformed into a more compact form [19, 16]:\n\nWe can verify that p(z | S) is invariant under Eq. (5) by calculating(cid:2)e\u2212H(z)\u2207H(z) + \u2207e\u2212H(z)(cid:3) =\n\n[D(z) + Q(z)] [p(z, t)\u2207H(z) + \u2207p(z, t)]\n\n0. If the process is ergodic, this invariant distribution is unique. The equivalence of the compact form\nwas originally proved in [16]; we include a detailed proof in the Supplement for completeness.\n\n(5)\n\n.\n\n3\n\n\fFigure 1: The red space represents the set of all continuous Markov\nprocesses. A point\nin the black space represents a continuous\nMarkov process de\ufb01ned by Eqs. (2)-(3) based on a speci\ufb01c choice of\nD(z), Q(z). By Theorem 1, each such point has stationary distribution\nps(z) = p(z | S). The blue space represents all continuous Markov\nprocesses with ps(z) = p(z | S). Theorem 2 states that these blue and\nblack spaces are equivalent (there is no gap, and any point in the blue\nspace has a corresponding D(z), Q(z) in our framework).\n\n2.2 Completeness of the Framework\nAn important question is what portion of samplers de\ufb01ned by continuous Markov processes with\nthe target invariant distribution can we de\ufb01ne by iterating over all possible D(z) and Q(z)? In\nTheorem 2, we show that for any continuous Markov process with the desired stationary distribution,\nps(z), there exists an SDE as in Eq. (2) with f (z) de\ufb01ned as in Eq. (3). We know from the Chapman-\nKolmogorov equation [9] that any continuous Markov process with stationary distribution ps(z) can\nbe written as in Eq. (2), which gives us the diffusion matrix D(z). Theorem 2 then constructively\nde\ufb01nes the curl matrix Q(z). This result implies that our recipe is complete. That is, we cover all\npossible continuous Markov process samplers in our framework. See Fig. 1.\nTheorem 2. For the SDE of Eq. (2), suppose its stationary probability density function ps(z) u-\nniquely exists, and that\nis integrable with respect to the\n\n(cid:20)\nfi(z)ps(z) \u2212(cid:80)d\n\nDij(z)ps(z)\n\n(cid:17)(cid:21)\n\n(cid:16)\n\n\u2202\n\u2202\u03b8j\n\nj=1\n\nLebesgue measure, then there exists a skew-symmetric Q(z) such that Eq. (3) holds.\n\n(cid:88)\nx\u2208(cid:101)S\n\nlog p(x|\u03b8) \u2212 log p(\u03b8);\n\nThe integrability condition is usually satis\ufb01ed when the probability density function uniquely exists.\nA constructive proof for the existence of Q(z) is provided in the Supplement.\n2.3 A Practical Algorithm\nIn practice, simulation relies on an \u0001-discretization of the SDE, leading to a full-data update rule\n\n(cid:2)(cid:0)D(zt) + Q(zt)(cid:1)\u2207H(zt) + \u0393(zt)(cid:3) + N (0, 2\u0001tD(zt)).\n\nzt+1 \u2190 zt \u2212 \u0001t\n\n(6)\nCalculating the gradient of H(z) involves evaluating the gradient of U (\u03b8). For a stochastic gradient\nmethod, the assumption is that U (\u03b8) is too computationally intensive to compute as it relies on a sum\nover all data points (see Sec. 2). Instead, such stochastic gradient algorithms examine independently\n\nsampled data subsets (cid:101)S \u2282 S and the corresponding potential for these data:\n(cid:101)S \u2282 S.\nThe speci\ufb01c form of Eq. (7) implies that (cid:101)U (\u03b8) is an unbiased estimator of U (\u03b8). As such, a gradient\ncomputed based on (cid:101)U (\u03b8)\u2014called a stochastic gradient [15]\u2014is a noisy, but unbiased estimator\ndistribution of the modi\ufb01ed dynamics (using \u2207(cid:101)U (\u03b8) in place of \u2207U (\u03b8)). One way to analyze the\nresulting in a noisy Hamiltonian gradient \u2207(cid:101)H(z) = \u2207H(z) + [N (0, V(\u03b8)), 0]T . Simply plugging\nin \u2207(cid:101)H(z) in place of \u2207H(z) in Eq. (6) results in dynamics with an additional noise term (D(zt) +\nQ(zt)(cid:1)[N (0, V(\u03b8)), 0]T . To counteract this, assume we have an estimate \u02c6Bt of the variance of this\n\nof the full-data gradient. The key question in many of the existing stochastic gradient MCMC\nalgorithms is whether the noise injected by the stochastic gradient adversely affects the stationary\n\n\u2207(cid:101)U (\u03b8) = \u2207U (\u03b8) + N (0, V(\u03b8)),\n\nimpact of the stochastic gradient is to make use of the central limit theorem and assume\n\n(cid:101)U (\u03b8) = \u2212|S|\n|(cid:101)S|\n\nadditional noise satisfying 2D(zt) \u2212 \u0001t \u02c6Bt (cid:23) 0 (i.e., positive semide\ufb01nite). With small \u0001, this is\nalways true since the stochastic gradient noise scales down faster than the added noise. Then, we\ncan attempt to account for the stochastic gradient noise by simulating\n\n(8)\n\n(7)\n\n(cid:104)(cid:0)D(zt) + Q(zt)(cid:1)\u2207(cid:101)H(zt) + \u0393(zt)\n(cid:105)\n\nzt+1 \u2190 zt \u2212 \u0001t\n\n(9)\nThis provides our stochastic gradient\u2014or minibatch\u2014 variant of the sampler. In Eq. (9), the noise\nintroduced by the stochastic gradient is multiplied by \u0001t (and the compensation by \u00012\nt ), implying that\n\n+ N (0, \u0001t(2D(zt) \u2212 \u0001t \u02c6Bt)).\n\n4\n\nAll Continuous Markov Processes f(z) defined by D(z), Q(z) Processes with ps(z) = p(z|S) \fthe discrepancy between these dynamics and those of Eq. (6) approaches zero as \u0001t goes to zero. As\nsuch, in this in\ufb01nitesimal step size limit, since Eq. (6) yields the correct invariant distribution, so\ndoes Eq. (9). This avoids the need for a costly or potentially intractable MH correction. However,\nhaving to decrease \u0001t to zero comes at the cost of increasingly small updates. We can also use a \ufb01nite,\nsmall step size in practice, resulting in a biased (but faster) sampler. A similar bias-speed tradeoff\nwas used in [11, 3] to construct MH samplers, in addition to being used in SGLD and SGHMC.\n\n3 Applying the Theory to Construct Samplers\n3.1 Casting Previous MCMC Algorithms within the Proposed Framework\nWe explicitly state how some recently developed MCMC methods fall within the proposed frame-\nwork based on speci\ufb01c choices of D(z), Q(z) and H(z) in Eq. (2) and (3). For the stochastic\ngradient methods, we show how our framework can be used to \u201creinvent\u201d the samplers by guiding\ntheir construction and avoiding potential mistakes or inef\ufb01ciencies caused by na\u00a8\u0131ve implementations.\n\nHamiltonian Monte Carlo (HMC) The key ingredient in HMC [8, 12] is Hamiltonian dynamics,\nwhich simulate the physical motion of an object with position \u03b8, momentum r, and mass M on an\nfrictionless surface as follows (typically, a leapfrog simulation is used instead):\n\n(cid:26) \u03b8t+1 \u2190 \u03b8t + \u0001tM\u22121rt\n\nrt+1 \u2190 rt \u2212 \u0001t\u2207U (\u03b8t).\n\nEq. (10) is a special case of the proposed framework with z = (\u03b8, r), H(\u03b8, r) = U (\u03b8) + 1\n\n(10)\n2 rT M\u22121r,\n\n(cid:19)\n\n(cid:18) 0 \u2212I\n\nI\n\n0\n\nQ(\u03b8, r) =\n\nand D(\u03b8, r) = 0.\n\n(cid:19)\n\nNaive :\n\nt V(\u03b8t)),\n\nStochastic Gradient Hamiltonian Monte Carlo (SGHMC) As discussed in [6], simply replac-\n\ning \u2207U (\u03b8) by the stochastic gradient \u2207(cid:101)U (\u03b8) in Eq. (10) results in the following updates:\n\n(cid:26) \u03b8t+1 \u2190 \u03b8t + \u0001tM\u22121rt\nrt+1 \u2190 rt \u2212 \u0001t\u2207(cid:101)U (\u03b8t) \u2248 rt \u2212 \u0001t\u2207U (\u03b8t) + N (0, \u00012\n\n0\n\u0001V(\u03b8)\n\n(11)\nwhere the \u2248 arises from the approximation of Eq. (8). Careful study shows that Eq. (11) cannot be\nrewritten into our proposed framework, which hints that such a na\u00a8\u0131ve stochastic gradient version of\nHMC is not correct. Interestingly, the authors of [6] proved that this na\u00a8\u0131ve version indeed does not\nhave the correct stationary distribution. In our framework, we see that the noise term N (0, 2\u0001tD(z))\nis paired with a D(z)\u2207H(z) term, hinting that such a term should be added to Eq. (11). Here,\n, which means we need to add D(z)\u2207H(z) = \u0001V(\u03b8)\u2207rH(\u03b8, r) =\nD(\u03b8, r) =\n\u0001V(\u03b8)M\u22121r. Interestingly, this is the correction strategy proposed in [6], but through a physical\ninterpretation of the dynamics. In particular, the term \u0001V(\u03b8)M\u22121r (or, generically, CM\u22121r where\nC (cid:23) \u0001V(\u03b8)) has an interpretation as friction and leads to second order Langevin dynamics:\n\n(cid:18) 0\n(cid:26) \u03b8t+1 \u2190 \u03b8t + \u0001tM\u22121rt\nrt+1 \u2190 rt \u2212 \u0001t\u2207(cid:101)U (\u03b8t) \u2212 \u0001tCM\u22121rt + N (0, \u0001t(2C \u2212 \u0001t \u02c6Bt)).\n(cid:18) 0\n\nHere, \u02c6Bt is an estimate of V(\u03b8t). This method now \ufb01ts into our framework with H(\u03b8, r) and Q(\u03b8, r)\nas in HMC, but with D(\u03b8, r) =\n. This example shows how our theory can be used to\nidentify invalid samplers and provide guidance on how to effortlessly correct the mistakes; this is\ncrucial when physical intuition is not available. Once the proposed sampler is cast in our framework\nwith a speci\ufb01c D(z) and Q(z), there is no need for sampler-speci\ufb01c proofs, such as those of [6].\n\n0\n0 C\n\n(cid:19)\n\n(12)\n\n0\n\nStochastic Gradient Langevin Dynamics (SGLD) SGLD [17] proposes to use the following \ufb01rst\norder (no momentum) Langevin dynamics to generate samples\n\n\u03b8t+1 \u2190 \u03b8t \u2212 \u0001tD\u2207(cid:101)U (\u03b8t) + N (0, 2\u0001tD).\n\n(13)\nThis algorithm corresponds to taking z = \u03b8 with H(\u03b8) = U (\u03b8), D(\u03b8) = D, Q(\u03b8) = 0, and \u02c6Bt = 0.\nAs motivated by Eq. (9) of our framework, the variance of the stochastic gradient can be subtracted\nfrom the sampler injected noise to make the \ufb01nite stepsize simulation more accurate. This variant of\nSGLD leads to the stochastic gradient Fisher scoring algorithm [1].\n\n5\n\n\fStochastic Gradient Riemannian Langevin Dynamics (SGRLD) SGLD can be generalized to\nuse an adaptive diffusion matrix D(\u03b8). Speci\ufb01cally, it is interesting to take D(\u03b8) = G\u22121(\u03b8), where\nG(\u03b8) is the Fisher information metric. The sampler dynamics are given by\n\n\u03b8t+1 \u2190 \u03b8t \u2212 \u0001t[G(\u03b8t)\u22121\u2207(cid:101)U (\u03b8t) + \u0393(\u03b8t)] + N (0, 2\u0001tG(\u03b8t)\u22121).\n\n(14)\nTaking D(\u03b8) = G(\u03b8)\u22121, Q(\u03b8) = 0, and \u02c6Bt = 0, this SGRLD [13] method falls into our frame-\n. It is interesting to note that in earlier literature [10],\n\n\u2202Dij(\u03b8)\n\nwork with correction term \u0393i(\u03b8) =(cid:80)\n\u0393i(\u03b8) was taken to be 2 |G(\u03b8)|\u22121/2(cid:80)\n\nij (\u03b8)|G(\u03b8)|1/2(cid:1). More recently, it was found that\n(cid:0)G\u22121\n\nj\n\n\u2202\u03b8j\n\u2202\n\u2202\u03b8j\n\nj\n\nthis correction term corresponds to the distribution function with respect to a non-Lebesgue mea-\nsure [18]; for the Lebesgue measure, the revised \u0393i(\u03b8) was as determined by our framework [18].\nAgain, we have an example of our theory providing guidance in devising correct samplers.\n\nStochastic Gradient Nos\u00b4e-Hoover Thermostat (SGNHT) Finally, the SGNHT [7] method in-\ncorporates ideas from thermodynamics to further increase adaptivity by augmenting the SGHMC\nsystem with an additional scalar auxiliary variable, \u03be. The algorithm uses the following dynamics:\n\n\u03b8t+1 \u2190 \u03b8t + \u0001trt\n\nrt+1 \u2190 rt \u2212 \u0001t\u2207(cid:101)U (\u03b8t) \u2212 \u0001t\u03betrt + N (0, \u0001t(2A \u2212 \u0001t \u02c6Bt))\n\n\uf8f1\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f3\n\n(cid:18) 0\n\n(cid:19)\n\n.\n\n(cid:18) 1\n(cid:19)\n\nd\n\nt rt \u2212 1\nrT\n1\n2\n\n\u03bet+1 \u2190 \u03bet + \u0001t\n\n\u2212I\n0\n\nI\n0 \u2212rT /d\n\n0\n\nr/d\n\n0\n\n(cid:18) 0\n\n0\n\n0 A \u00b7 I\n0\n\n0\n\n(15)\n\n(cid:19)\n\n,\n\n0\n0\n0\n\nWe can take z = (\u03b8, r, \u03be), H(\u03b8, r, \u03be) = U (\u03b8) +\n\nrT r +\n\n(\u03be \u2212 A)2, D(\u03b8, r, \u03be) =\n\n1\n2d\n\nand Q(\u03b8, r, \u03be) =\n\nto place these dynamics within our framework.\n\nSummary\nIn our framework, SGLD and SGRLD take Q(z) = 0 and instead stress the design of\nthe diffusion matrix D(z), with SGLD using a constant D(z) and SGRLD an adaptive, \u03b8-dependent\ndiffusion matrix to better account for the geometry of the space being explored. On the other hand,\nHMC takes D(z) = 0 and focuses on the curl matrix Q(z). SGHMC combines SGLD with HMC\nthrough non-zero D(\u03b8) and Q(\u03b8) matrices. SGNHT then extends SGHMC by taking Q(z) to be\nstate dependent. The relationships between these methods are depicted in the Supplement, which\nlikewise contains a discussion of the tradeoffs between these two matrices.\nIn short, D(z) can\nguide escaping from local modes while Q(z) can enable rapid traversing of low-probability regions,\nespecially when state adaptation is incorporated. We readily see that most of the product space\nD(z) \u00d7 Q(z), de\ufb01ning the space of all possible samplers, has yet to be \ufb01lled.\n3.2 Stochastic Gradient Riemann Hamiltonian Monte Carlo\nIn Sec. 3.1, we have shown how our framework uni\ufb01es existing samplers. In this section, we now use\nour framework to guide the development of a new sampler. While SGHMC [6] inherits the momen-\ntum term of HMC, making it easier to traverse the space of parameters, the underlying geometry of\nthe target distribution is still not utilized. Such information can usually be represented by the Fisher\ninformation metric [10], denoted as G(\u03b8), which can be used to precondition the dynamics. For our\n2 rT r, as in HMC/SGHMC methods, and modify\nproposed system, we consider H(\u03b8, r) = U (\u03b8) + 1\nthe D(\u03b8, r) and Q(\u03b8, r) of SGHMC to account for the geometry as follows:\n\nD(\u03b8, r) =\n\n0\n\n0 G(\u03b8)\u22121\n\n;\n\nQ(\u03b8, r) =\n\n0\n\nG(\u03b8)\u22121/2\n\n\u2212G(\u03b8)\u22121/2\n\n0\n\n(cid:18) 0\n\n(cid:19)\n\n(cid:19)\n\n.\n\n(cid:18)\n\nWe refer to this algorithm as stochastic gradient Riemann Hamiltonian Monte Carlo (SGRHMC).\nOur theory holds for any positive de\ufb01nite G(\u03b8), yielding a generalized SGRHMC (gSGRHMC)\nalgorithm, which can be helpful when the Fisher information metric is hard to compute.\nA na\u00a8\u0131ve implementation of a state-dependent SGHMC algorithm might simply (i) precondition the\n\nHMC update, (ii) replace \u2207U (\u03b8) by \u2207(cid:101)U (\u03b8), and (iii) add a state-dependent friction term on the\n\n(cid:26) \u03b8t+1 \u2190 \u03b8t + \u0001tG(\u03b8t)\u22121/2rt\nrt+1 \u2190 rt \u2212 \u0001tG(\u03b8t)\u22121/2\u2207\u03b8(cid:101)U (\u03b8t) \u2212 \u0001tG(\u03b8t)\u22121rt + N (0, \u0001t(2G(\u03b8t)\u22121 \u2212 \u0001t \u02c6Bt)).\n\norder of the diffusion matrix to counterbalance the noise as in SGHMC, resulting in:\n\nNaive :\n\n(16)\n\n6\n\n\fAlgorithm 1: Generalized Stochastic Gradient Riemann Hamiltonian Monte Carlo\ninitialize (\u03b80, r0)\nfor t = 0, 1, 2\u00b7\u00b7\u00b7 do\n\noptionally, periodically resample momentum r as r(t) \u223c N (0, I)\n\u03b8t+1 \u2190 \u03b8t + \u0001tG(\u03b8t)\u22121/2rt, \u03a3t \u2190 \u0001t(2G(\u03b8t)\u22121 \u2212 \u0001t \u02c6Bt)\n\nrt+1 \u2190 rt \u2212 \u0001tG(\u03b8t)\u22121/2\u2207\u03b8(cid:101)U (\u03b8t) + \u0001t\u2207\u03b8(G(\u03b8t)\u22121/2) \u2212 \u0001tG(\u03b8t)\u22121rt + N(cid:16)\n\n(cid:17)\n\n0, \u03a3t\n\nend\n\nFigure 2: Left: For two simulated 1D distributions de\ufb01ned by U (\u03b8) = \u03b82/2 (one peak) and U (\u03b8) = \u03b84 \u2212 2\u03b82\n(two peaks), we compare the KL divergence of methods: SGLD, SGHMC, the na\u00a8\u0131ve SGRHMC of Eq. (16), and\nthe gSGRHMC of Eq. (17) relative to the true distribution in each scenario (left and right bars labeled by 1 and\n2). Right: For a correlated 2D distribution with U (\u03b81, \u03b82) = \u03b84\n1)2/2, we see that\nour gSGRHMC most rapidly explores the space relative to SGHMC and SGLD. Contour plots of the distribution\nalong with paths of the \ufb01rst 10 sampled points are shown for each method.\n\n1/10 + (4 \u00b7 (\u03b82 + 1.2) \u2212 \u03b82\n\nHowever, as we show in Sec. 4.1, samples from these dynamics do not converge to the desired\ndistribution. Indeed, this system cannot be written within our framework. Instead, we can simply\nfollow our framework and, as indicated by Eq. (9), consider the following update rule:\n\n(cid:40)\n\n\u03b8t+1 \u2190 \u03b8t + \u0001tG(\u03b8t)\u22121/2rt\n\nrt+1 \u2190 rt \u2212 \u0001t[G(\u03b8)\u22121/2\u2207\u03b8(cid:101)U (\u03b8t) + \u2207\u03b8\n\nG(\u03b8t)\u22121/2(cid:17) \u2212 G(\u03b8t)\u22121rt] + N (0, \u0001t(2G(\u03b8t)\u22121 \u2212 \u0001t \u02c6Bt)),\n\n(cid:16)\n(cid:0)G(\u03b8)\u22121/2(cid:1), with i-th component(cid:80)\n\n(cid:0)G(\u03b8)\u22121/2(cid:1)\n\n(17)\n\nij. The\n\nwhich includes a correction term \u2207\u03b8\npractical implementation of gSGRHMC is outlined in Algorithm 1.\n\n\u2202\n\u2202\u03b8j\n\nj\n\n4 Experiments\n\nIn Sec. 4.1, we show that gSGRHMC can excel at rapidly exploring distributions with complex\nlandscapes. We then apply SGRHMC to sampling in a latent Dirichlet allocation (LDA) model on\na large Wikipedia dataset in Sec. 4.2. The Supplement contains details on the speci\ufb01c samplers\nconsidered and the parameter settings used in these experiments.\n\n4.1 Synthetic Experiments\nIn this section we aim to empirically (i) validate the correctness of our recipe and (ii) assess the\neffectiveness of gSGRHMC. In Fig. 2(left), we consider two univariate distributions (shown in the\nSupplement) and compare SGLD, SGHMC, the na\u00a8\u0131ve state-adaptive SGHMC of Eq. (16), and our\nproposed gSGRHMC of Eq. (17). See the Supplement for the form of G(\u03b8). As expected, the na\u00a8\u0131ve\nimplementation does not converge to the target distribution. In contrast, the gSGRHMC algorithm\nobtained via our recipe indeed has the correct invariant distribution and ef\ufb01ciently explores the dis-\ntributions. In the second experiment, we sample a bivariate distribution with strong correlation. The\nresults are shown in Fig. 2(right). The comparison between SGLD, SGHMC, and our gSGRHMC\nmethod shows that both a state-dependent preconditioner and Hamiltonian dynamics help to make\nthe sampler more ef\ufb01cient than either element on its own.\n\n7\n\n12SGLD12SGHMC12NaivegSGRHMC12gSGRHMC0.0000.0050.0100.0150.020K-LDivergence024681000.511.522.5log3(Steps/100)+1K\u2212L Divergence SGLDSGHMCgSGRHMC\fParameter \u03b8\nPrior p(\u03b8)\n\nMethod\nSGLD\nSGHMC\nSGRLD\nSGRHMC\n\nOriginal LDA Expanded Mean\n\u03b2kw = \u03b8kw\n\n\u03b2kw = \u03b8kw(cid:80)\n\nw \u03b8kw\n\np(\u03b8k) = Dir(\u03b1) p(\u03b8kw) = \u0393(\u03b1, 1)\n\nAverage Runtime per 100 Docs\n\n0.778s\n0.815s\n0.730s\n0.806s\n\nFigure 3: Upper Left: Expanded mean parameterization of the LDA model. Lower Left: Average runtime per\n100 Wikipedia entries for all methods. Right: Perplexity versus number of Wikipedia entries processed.\n\n4.2 Online Latent Dirichlet Allocation\nWe also applied SGRHMC (with G(\u03b8) = diag(\u03b8)\u22121, the Fisher information metric) to an online\nlatent Dirichlet allocation (LDA) [5] analysis of topics present in Wikipedia entries. In LDA, each\ntopic is associated with a distribution over words, with \u03b2kw the probability of word w under topic k.\nEach document is comprised of a mixture of topics, with \u03c0(d)\nthe probability of topic k in document\nj \u223c \u03c0(d) for the jth word and then drawing\nd. Documents are generated by \ufb01rst selecting a topic z(d)\nthe speci\ufb01c word from the topic as x(d)\n. Typically, \u03c0(d) and \u03b2k are given Dirichlet priors.\n\nk\n\nj \u223c \u03b2z(d)\n\nj\n\nThe goal of our analysis here is inference of the corpus-wide topic distributions \u03b2k. Since the\nWikipedia dataset is large and continually growing with new articles, it is not practical to carry out\nthis task over the whole dataset. Instead, we scrape the corpus from Wikipedia in a streaming man-\nner and sample parameters based on minibatches of data. Following the approach in [13], we \ufb01rst\nanalytically marginalize the document distributions \u03c0(d) and, to resolve the boundary issue posed by\nthe Dirichlet posterior of \u03b2k de\ufb01ned on the probability simplex, use an expanded mean parameter-\nization shown in Figure 3(upper left). Under this parameterization, we then compute \u2207 log p(\u03b8|x)\nand, in our implementation, use boundary re\ufb02ection to ensure the positivity of parameters \u03b8kw. The\nnecessary expectation over word-speci\ufb01c topic indicators z(d)\nis approximated using Gibbs sampling\nseparately on each document, as in [13]. The Supplement contains further details.\nFor all the methods, we report results of three random runs. When sampling distributions with\nmass concentrated over small regions, as in this application, it is important to incorporate geometric\ninformation via a Riemannian sampler [13]. The results in Fig. 3(right) indeed demonstrate the im-\nportance of Riemannian variants of the stochastic gradient samplers. However, there also appears to\nbe some bene\ufb01ts gained from the incorporation of the HMC term for both the Riemmannian and non-\nReimannian samplers. The average runtime for the different methods are similar (see Fig. 3(lower\nleft)) since the main computational bottleneck is the gradient evaluation. Overall, this application\nserves as an important example of where our newly proposed sampler can have impact.\n\nj\n\n5 Conclusion\nWe presented a general recipe for devising MCMC samplers based on continuous Markov process-\nes. Our framework constructs an SDE speci\ufb01ed by two matrices, a positive semide\ufb01nite D(z) and a\nskew-symmetric Q(z). We prove that for any D(z) and Q(z), we can devise a continuous Markov\nprocess with a speci\ufb01ed stationary distribution. We also prove that for any continuous Markov pro-\ncess with the target stationary distribution, there exists a D(z) and Q(z) that cast the process in our\nframework. Our recipe is particularly useful in the more challenging case of devising stochastic gra-\ndient MCMC samplers. We demonstrate the utility of our recipe in \u201creinventing\u201d previous stochastic\ngradient MCMC samplers, and in proposing our SGRHMC method. The ef\ufb01ciency and scalability\nof the SGRHMC method was shown on simulated data and a streaming Wikipedia analysis.\n\nAcknowledgments\nThis work was supported in part by ONR Grant N00014-10-1-0746, NSF CAREER Award IIS-1350133, and\nthe TerraSwarm Research Center sponsored by MARCO and DARPA. We also thank Mr. Lei Wu for helping\nwith the proof of Theorem 2 and Professors Ping Ao and Hong Qian for many discussions.\n\n8\n\n0200040006000800010000100015002000250030003500Number of DocumentsPerplexity SGLDSGHMCSGRLDSGRHMC\fReferences\n[1] S. Ahn, A. Korattikara, and M. Welling. Bayesian posterior sampling via stochastic gradient\nIn Proceedings of the 29th International Conference on Machine Learning\n\nFisher scoring.\n(ICML\u201912), 2012.\n\n[2] S. Ahn, B. Shahbaba, and M. Welling. Distributed stochastic gradient MCMC. In Proceeding\n\nof 31st International Conference on Machine Learning (ICML\u201914), 2014.\n\n[3] R. Bardenet, A. Doucet, and C. Holmes. Towards scaling up Markov chain Monte Carlo:\nAn adaptive subsampling approach. In Proceedings of the 30th International Conference on\nMachine Learning (ICML\u201914), 2014.\n\n[4] M. Betancourt. The fundamental incompatibility of scalable Hamiltonian Monte Carlo and\nIn Proceedings of the 31th International Conference on Machine\n\nnaive data subsampling.\nLearning (ICML\u201915), 2015.\n\n[5] D.M. Blei, A.Y. Ng, and M.I. Jordan. Latent dirichlet allocation. Journal of Machine Learning\n\nResearch, 3:993\u20131022, March 2003.\n\n[6] T. Chen, E.B. Fox, and C. Guestrin. Stochastic gradient Hamiltonian Monte Carlo. In Pro-\n\nceeding of 31st International Conference on Machine Learning (ICML\u201914), 2014.\n\n[7] N. Ding, Y. Fang, R. Babbush, C. Chen, R.D. Skeel, and H. Neven. Bayesian sampling using\nIn Advances in Neural Information Processing Systems 27\n\nstochastic gradient thermostats.\n(NIPS\u201914). 2014.\n\n[8] S. Duane, A.D. Kennedy, B.J. Pendleton, and D. Roweth. Hybrid Monte Carlo. Physics Letters\n\nB, 195(2):216 \u2013 222, 1987.\n\n[9] W. Feller. Introduction to Probability Theory and its Applications. John Wiley & Sons, 1950.\n[10] M. Girolami and B. Calderhead. Riemann manifold Langevin and Hamiltonian Monte Carlo\n\nmethods. Journal of the Royal Statistical Society Series B, 73(2):123\u2013214, 03 2011.\n\n[11] A. Korattikara, Y. Chen, and M. Welling. Austerity in MCMC land: Cutting the Metropolis-\nHastings budget. In Proceedings of the 30th International Conference on Machine Learning\n(ICML\u201914), 2014.\n\n[12] R.M. Neal. MCMC using Hamiltonian dynamics. Handbook of Markov Chain Monte Carlo,\n\n54:113\u2013162, 2010.\n\n[13] S. Patterson and Y.W. Teh. Stochastic gradient Riemannian Langevin dynamics on the proba-\n\nbility simplex. In Advances in Neural Information Processing Systems 26 (NIPS\u201913). 2013.\n\n[14] H. Risken and T. Frank. The Fokker-Planck Equation: Methods of Solutions and Applications.\n\nSpringer, 1996.\n\n[15] H. Robbins and S. Monro. A stochastic approximation method. The Annals of Mathematical\n\nStatistics, 22(3):400\u2013407, 09 1951.\n\n[16] J. Shi, T. Chen, R. Yuan, B. Yuan, and P. Ao. Relation of a new interpretation of stochastic\n\ndifferential equations to It\u02c6o process. Journal of Statistical Physics, 148(3):579\u2013590, 2012.\n\n[17] M. Welling and Y.W. Teh. Bayesian learning via stochastic gradient Langevin dynamics. In\nProceedings of the 28th International Conference on Machine Learning (ICML\u201911), pages\n681\u2013688, June 2011.\n\n[18] T. Xifara, C. Sherlock, S. Livingstone, S. Byrne, and M. Girolami. Langevin diffusions and\nthe Metropolis-adjusted Langevin algorithm. Statistics & Probability Letters, 91:14\u201319, 2014.\n[19] L. Yin and P. Ao. Existence and construction of dynamical potential in nonequilibrium process-\nes without detailed balance. Journal of Physics A: Mathematical and General, 39(27):8593,\n2006.\n\n[20] R. Zwanzig. Nonequilibrium Statistical Mechanics. Oxford University Press, 2001.\n\n9\n\n\f", "award": [], "sourceid": 1660, "authors": [{"given_name": "Yi-An", "family_name": "Ma", "institution": "University of Washington"}, {"given_name": "Tianqi", "family_name": "Chen", "institution": "University of Washington"}, {"given_name": "Emily", "family_name": "Fox", "institution": "Washington"}]}