{"title": "Stochastic Runge-Kutta Accelerates Langevin Monte Carlo and Beyond", "book": "Advances in Neural Information Processing Systems", "page_first": 7748, "page_last": 7760, "abstract": "Sampling with Markov chain Monte Carlo methods typically amounts to discretizing some continuous-time dynamics with numerical integration. In this paper, we establish the convergence rate of sampling algorithms obtained by discretizing smooth It\\^o diffusions exhibiting fast $2$-Wasserstein contraction, based on local deviation properties of the integration scheme. In particular, we study a sampling algorithm constructed by discretizing the overdamped Langevin diffusion with the method of stochastic Runge-Kutta. For strongly convex potentials that are smooth up to a certain order, its iterates \r\nconverge to the target distribution in $2$-Wasserstein distance in $\\tilde{\\mathcal{O}}(d\\epsilon^{-2/3})$ iterations. This improves upon the best-known rate for strongly log-concave sampling based on the overdamped Langevin equation using only the gradient oracle without adjustment. Additionally, we extend our analysis of stochastic Runge-Kutta methods to uniformly dissipative diffusions with possibly non-convex potentials and\r\nshow they achieve better rates compared to the Euler-Maruyama scheme on the dependence on tolerance $\\epsilon$. Numerical studies show that these algorithms lead to better stability and lower asymptotic errors.", "full_text": "Stochastic Runge-Kutta Accelerates\nLangevin Monte Carlo and Beyond\n\nXuechen Li1, 2, Denny Wu1, 2, Lester Mackey3, Murat A. Erdogdu1, 2\n\nUniversity of Toronto1, Vector Institute2, Microsoft Research3\n\n{lxuechen, dennywu, erdogdu}@cs.toronto.edu, lmackey@microsoft.com\n\nAbstract\n\nSampling with Markov chain Monte Carlo methods often amounts to discretizing\nsome continuous-time dynamics with numerical integration. In this paper, we\nestablish the convergence rate of sampling algorithms obtained by discretizing\nsmooth It\u00f4 diffusions exhibiting fast Wasserstein-2 contraction, based on local\ndeviation properties of the integration scheme. In particular, we study a sampling\nalgorithm constructed by discretizing the overdamped Langevin diffusion with the\nmethod of stochastic Runge-Kutta. For strongly convex potentials that are smooth\nup to a certain order, its iterates converge to the target distribution in 2-Wasserstein\ndistance in \u02dcO(d\u270f2/3) iterations. This improves upon the best-known rate for\nstrongly log-concave sampling based on the overdamped Langevin equation using\nonly the gradient oracle without adjustment. In addition, we extend our analysis of\nstochastic Runge-Kutta methods to uniformly dissipative diffusions with possibly\nnon-convex potentials and show they achieve better rates compared to the Euler-\nMaruyama scheme in terms of the dependence on tolerance \u270f. Numerical studies\nshow that these algorithms lead to better stability and lower asymptotic errors.\n\n1\n\nIntroduction\n\nSampling from a probability distribution is a fundamental problem that arises in machine learning,\nstatistics, and optimization. In many situations, the goal is to obtain samples from a target distribution\ngiven only the unnormalized density [2, 27, 40]. A prominent approach to this problem is the method\nof Markov chain Monte Carlo (MCMC), where an ergodic Markov chain is simulated so that iterates\nconverge exactly or approximately to the distribution of interest [43, 2].\nMCMC samplers based on numerically integrating continuous-time dynamics have proven very\nuseful due to their ability to accommodate a stochastic gradient oracle [65]. Moreover, when used\nas optimizations algorithms, these methods can deliver strong theoretical guarantees in non-convex\nsettings [50]. A popular example in this regime is the unadjusted Langevin Monte Carlo (LMC)\nalgorithm [51]. Fast mixing of LMC is inherited from exponential Wasserstein decay of the Langevin\ndiffusion, and numerical integration using the Euler-Maruyama scheme with a suf\ufb01ciently small\nstep size ensures the Markov chain tracks the diffusion. Asymptotic guarantees of this algorithm are\nwell-studied [51, 26, 42], and non-asymptotic analyses specifying explicit constants in convergence\nbounds were recently conducted [14, 11, 18, 7, 20, 9].\nTo the best of our knowledge, the best known rate of LMC in 2-Wasserstein distance is due to Durmus\nand Moulines [18] \u2013 \u02dcO(d\u270f1) iterations are required to reach \u270f accuracy to d-dimensional target\ndistributions with strongly convex potentials under the additional Lipschitz Hessian assumption, where\n\u02dcO hides insubstantial poly-logarithmic factors. Due to its simplicity and well-understood theoretical\nproperties, LMC and its derivatives have found numerous applications in statistics and machine\nlearning [65, 15]. However, from the numerical integration point of view, the Euler-Maruyama\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fscheme is usually less preferred for many problems due to its inferior stability compared to implicit\nschemes [1] and large integration error compared to high-order schemes [46].\nIn this paper, we study the convergence rate of MCMC samplers devised from discretizing It\u00f4\ndiffusions with exponential Wasserstein-2 contraction. Our result provides a general framework for\nestablishing convergence rates of existing numerical schemes in the SDE literature when used as\nsampling algorithms. In particular, we establish non-asymptotic convergence bounds for sampling\nwith stochastic Runge-Kutta (SRK) methods. For strongly convex potentials, iterates of a variant of\nSRK applied to the overdamped Langevin diffusion has a convergence rate of \u02dcO(d\u270f2/3). Similar to\nLMC, the algorithm only queries the gradient oracle of the potential during each update and improves\nupon the best known rate of \u02dcO(d\u270f1) for strongly log-concave sampling based on the overdamped\nLangevin diffusion without Metropolis adjustment, under the mild extra assumption that the potential\nis smooth up to the third order. In addition, we extend our analysis to uniformly dissipative diffusions,\nwhich enables sampling from non-convex potentials by choosing a non-constant diffusion coef\ufb01cient.\nWe study a different variant of SRK and obtain the convergence rate of \u02dcO(d3/4m2\u270f1) for general\nIt\u00f4 diffusions, where m is the dimensionality of the Brownian motion. This improves upon the\nconvergence rate of \u02dcO(d\u270f2) for the Euler-Maruyama scheme in terms of the tolerance \u270f, while\npotentially trading off dimension dependence.\nOur contributions can be summarized as follows:\n\u2022 We provide a broadly applicable theorem for establishing convergence rates of sampling algorithms\nbased on discretizing It\u00f4 diffusions exhibiting exponential Wasserstein-2 contraction to the target\ninvariant measure. The convergence rate is explicitly expressed in terms of the contraction rate of\nthe diffusion and local properties of the numerical scheme, both of which can be easily derived.\n\u2022 We show for strongly convex potentials, a variant of SRK applied to the overdamped Langevin\ndiffusion achieves the improved convergence rate of \u02dcO(d\u270f2/3) by accessing only the gradient\noracle, under mild additional smoothness conditions on the potential.\n\u2022 We establish the convergence rate of a different variant of SRK applied to uniformly dissipative\ndiffusions. By choosing an appropriate diffusion coef\ufb01cient, we show the corresponding algorithm\ncan sample from certain non-convex potentials and achieves the rate of \u02dcO(d3/4m2\u270f1).\n\u2022 We provide examples and numerical studies of sampling from both convex and non-convex\npotentials with SRK methods and show they lead to better stability and lower asymptotic errors.\n\n1.1 Additional Related Work\nHigh-Order Schemes. Numerically solving SDEs has been a research area for decades [46, 32].\nWe refer the reader to [3] for a review and to [32] for technical foundations. Chen et al. [5] studied the\nconvergence of smooth functions evaluated at iterates of sampling algorithms obtained by discretizing\nthe Langevin diffusion with high-order numerical schemes. Their focus was on convergence rates of\nfunction evaluations under a stochastic gradient oracle using asymptotic arguments. This convergence\nassessment pertains to analyzing numerical schemes in the weak sense. By contrast, we establish\nnon-asymptotic convergence bounds in the 2-Wasserstein metric, which covers a broader class of\nfunctions by the Kantorovich duality [28, 62], and our techniques are based on the mean-square\nconvergence analysis of numerical schemes. Notably, a key ingredient in the proofs by Chen et al. [5],\ni.e. moment bounds in the guise of a Lyapunov function argument, is assumed without justi\ufb01cation,\nwhereas we derive this formally and obtain convergence bounds with explicit dimension dependent\nconstants. Durmus et al. [19] considered convergence of function evaluations of schemes obtained\nusing Richardson-Romberg extrapolation. Sabanis and Zhang [53] introduced a numerical scheme\nthat queries the gradient of the Laplacian based on an integrator that accommodates superlinear\ndrifts [54]. In particular, for potentials with a Lipschitz gradient, they obtained the convergence\nrate of \u02dcO(d4/3\u270f2/3). In optimization, high-order ordinary differential equation (ODE) integration\nschemes were introduced to discretize a second-order ODE and achieved acceleration [68].\n\nNon-Convex Learning. The convergence analyses of sampling using the overdamped and under-\ndamped Langevin diffusion were extended to the non-convex setting [9, 39]. For the Langevin\ndiffusion, the most common assumption on the potential is strong convexity outside a ball of \ufb01nite\nradius, in addition to Lipschitz smoothness and twice differentiability [9, 38, 39]. More generally,\nVempala and Wibisono [61] showed that convergence in the KL divergence of LMC can be derived\nassuming a log-Sobolev inequality of the target measure with a positive log-Sobolev constant holds.\nFor general It\u00f4 diffusions, the notion of distant dissipativity [30, 22, 23] is used to study convergence\n\n2\n\n\fTable 1: Convergence rates in W2 for algorithms sampling from strongly convex potentials by\ndiscretizing the overdamped Langevin diffusion. \u201cOracle\u201d refers to highest derivative used in the\nupdate. \u201cSmoothness\u201d refers to Lipschitz conditions. Note that faster algorithms exist by discretizing\nhigh-order Langevin equations [13, 8, 9, 47, 56] or applying Metropolis adjustment [21, 6].\n\nMethod\n\nConvergence Rate\n\nEuler-Maruyama [18]\nEuler-Maruyama [18]\n\nOzaki\u2019s [11] 1\n\nTamed Order 1.5 [53] 2\n\nStochastic Runge-Kutta (this work)\n\n\u02dcO(d\u270f2)\n\u02dcO(d\u270f1)\n\u02dcO(d\u270f1)\n\n\u02dcO(d4/3\u270f2/3)\n\u02dcO(d\u270f2/3)\n\nOracle\n1st order\n1st order\n2nd order\n3rd order\n1st order\n\nSmoothness\n\ngradient\n\ngradient & Hessian\ngradient & Hessian\n1st to 3rd derivatives\n1st to 3rd derivatives\n\nto target measures with non-convex potentials in the 1-Wasserstein distance. Different from these\nworks, our non-convex convergence analysis, due to conducted in W2, requires the slightly stronger\nuniform dissipativity condition [30]. In optimization, non-asymptotic results for stochastic gradient\nLangevin dynamics and its variants have been established for non-convex objectives [50, 67, 24, 69].\nNotation. We denote the p-norm of a real vector x 2 Rd by kxkp. For a function f : Rd ! R, we\ndenote its ith derivative by rif (x) and its Laplacian by f =Pd\ni . For a vector-\nvalued function g : Rd ! Rm, we denote its vector Laplacian by ~(g), i.e. ~(g)i =( gi). For a\ntensor T 2 Rd1\u21e5d2\u21e5\u00b7\u00b7\u00b7\u21e5dm, we de\ufb01ne its operator norm recursively as kTkop = supkuk2\uf8ff1 kT [u]kop,\nwhere T [u] denotes the tensor-vector product. For f suf\ufb01ciently differentiable, we denote the\nLipschitz and polynomial coef\ufb01cients of its ith order derivative as\nkri1f (x)ri1f (y)kop\n\ni=1 @2fi(x)/@x2\n\n\u00b50(f ) = sup\n\nop\n\n,\n\n, and \u21e1i,n(f ) = sup\nx2Rd\n\nkri1f (x)kn\n1+kxkn\n\n2\n\nx2Rd kf (x)kop, \u00b5i(f ) = sup\n\nx,y2Rd,x6=y\n\nkxyk2\n\nwith the exception in Theorem 3, where \u21e11,n() is used for a sublinear growth condition. We denote\nLipschitz and growth coef\ufb01cients under the Frobenius norm k\u00b7kF as \u00b5F\n1,n(\u00b7), respectively.\nCoupling and Wasserstein Distance. We denote by B(Rd) the Borel -\ufb01eld of Rd. Given prob-\nability measures \u232b and \u232b0 on (Rd,B(Rd)), we de\ufb01ne a coupling (or transference plan) \u21e3 between\n\u232b and \u232b0 as a probability measure on (Rd \u21e5 Rd,B(Rd \u21e5 Rd)) such that \u21e3(A \u21e5 Rd) = \u232b(A) and\n\u21e3(Rd \u21e5 A) = \u232b0(A) for all A 2B (Rd). Let couplings(\u232b, \u232b0) denote the set of all such couplings.\nWe de\ufb01ne the 2-Wasserstein distance between a pair of probability measures \u232b and \u232b0 as\n\n1 (\u00b7) and \u21e1F\n\nW2(\u232b, \u232b0) =\n\n\u21e32couplings(\u232b,\u232b0)\u21e3R kx yk2\n\ninf\n\n2 d\u21e3(\u232b, \u232b0)\u23181/2\n\n.\n\n2 Sampling with Discretized Diffusions\n\nWe study the problem of sampling from a target distribution p(x) with the help of a candidate It\u00f4\ndiffusion [37, 44] given as the solution to the following stochastic differential equation (SDE):\n\ndXt = b(Xt) dt + (Xt) dBt, with X0 = x0,\n\n(1)\nwhere b : Rd ! Rd and : Rd ! Rd\u21e5m are termed as the drift and diffusion coef\ufb01cients,\nrespectively. Here, {Bt}t0 is an m-dimensional Brownian motion adapted to the \ufb01ltration {Ft}t0,\nwhose ith dimension we denote by {B(i)\nt }t0 . A candidate diffusion should be chosen so that (i)\nits invariant measure is the target distribution p(x) and (ii) it exhibits fast mixing properties. Under\nmild conditions, one can design a diffusion with the target invariant measure by choosing the drift\ncoef\ufb01cient as (see e.g. [37, Thm. 2])\n\nb(x) = 1\n\n2p(x) hr, p(x)w(x)i , where w(x) = (x)(x)> + c(x),\n\n(2)\n\n1 We obtain a rate in W2 from the discretization analysis in KL [11] via standard techniques [50, 61].\n2 Sabanis and Zhang [53] use the Frobenius norm for matrices and the Euclidean norm of Frobenius norms\n\nfor 3-tensors. For a fair comparison, we convert their Lipschitz constants to be based on the operator norm.\n\n3\n\n\fc(x) 2 Rd\u21e5d is any skew-symmetric matrix and hr,\u00b7i is the divergence operator for a matrix-\nvalued function, i.e. hr, w(x)ii = Pd\nj=1 @wi,j(x)/@xj for w : Rd ! Rd\u21e5d. To guarantee that\nthis diffusion has fast convergence properties, we will require certain dissipativity conditions to\nbe introduced later. For example, if the target is the Gibbs measure of a strongly convex potential\nf : Rd ! R, i.e., p(x) / exp (f (x)), a popular candidate diffusion is the (overdamped) Langevin\ndiffusion which is the solution to the following SDE:\n(3)\nIt is straightforward to verify (2) for the above diffusion which implies that the target p(x) is its\ninvariant measure. Moreover, strong convexity of f implies uniform dissipativity and ensures that the\ndiffusion achieves fast convergence.\n\ndXt = rf (Xt) dt + p2 dBt, with X0 = x0.\n\n2.1 Numerical Schemes and the It\u00f4-Taylor Expansion\nIn practice, the It\u00f4 diffusion (1) (similarly (3)) cannot be simulated in continuous time and is instead\napproximated by a discrete-time numerical integration scheme. Owing to its simplicity, a common\nchoice is the Euler-Maruyama (EM) scheme [32], which relies on the following update rule,\n\n\u02dcXk+1 = \u02dcXk + h b( \u02dcXk) + ph ( \u02dcXk)\u21e0k+1,\n\nk = 0, 1, . . . ,\n\nX0 + t b(X0) + (X0)Bt\n\nmean-square order 1.0 stochastic Runge-Kutta update\n\n(4)\ni.i.d.\u21e0N (0, Id) is independent of \u02dcXk for all k 2 N. The above\nwhere h is the step size and \u21e0k+1\niteration de\ufb01nes a Markov chain and due to discretization error, its invariant measure \u02dcp(x) is different\nfrom the target distribution p(x); yet, for a suf\ufb01ciently small step size, the difference between \u02dcp(x)\nand p(x) can be characterized (see e.g. [42, Thm. 7.3]).\nAnalogous to ODE solvers, numerical schemes such as the EM scheme and SRK schemes are\nderived based on approximating the continuous-time dynamics locally. Similar to the standard Taylor\nexpansion, It\u00f4\u2019s lemma induces a stochastic version of the Taylor expansion of a smooth function\nevaluated at a stochastic process at time t. This is known as the It\u00f4-Taylor (or Wagner-Platen)\nexpansion [46], and one can also interpret the expansion as recursively applying It\u00f4\u2019s lemma to terms\nin the integral form of an SDE. Speci\ufb01cally, for g : Rd ! Rd, we de\ufb01ne the operators:\n(5)\nwhere i(x) denotes the ith column of (x). Then, applying It\u00f4\u2019s lemma to the integral form of the\nSDE (1) with the starting point X0 yields the following expansion around X0 [32, 46]:\n\n2Pm\ni=1r2g(x)[i(x), i(x)], \u21e4j(g)(x) = rg(x) \u00b7 j(x),\nL(g)(x) = rg(x) \u00b7 b(x)+ 1\n}|\ni,j=1R t\n0R s\n+Pm\ni=1R t\n0R s\ns + Pm\n\nz\n{z\n|\ni=1R t\n0R s\n+Pm\n\n{\nu +R t\n0R s\n\n0 \u21e4i(b)(Xu) dB(i)\n\n0 L(i)(Xu) du dB(i)\n\n(6)\nThe expansion justi\ufb01es the update rule of the EM scheme, since the discretization is nothing more\nthan taking the \ufb01rst three terms on the right hand side of (6). Similarly, a mean-square order 1.0\nSRK scheme for general It\u00f4 diffusions \u2013 introduced in Section 4.2 \u2013 approximates the \ufb01rst four\nterms. In principle, one may recursively apply It\u00f4\u2019s lemma to terms in the expansion to obtain a more\n\ufb01ne-grained approximation. However, the appearance of non-Gaussian terms in the guise of iterated\nBrownian integrals presents a challenge for simulation. Nevertheless, it is clear that the above SRK\nscheme will be a more accurate local approximation than the EM scheme, due to accounting more\nterms in the expansion. As a result, the local deviation between the continuous-time process and\nMarkov chain will be smaller. We characterize this property of a numerical scheme as follows.\nDe\ufb01nition 2.1 (Uniform Local Deviation Orders). Let { \u02dcXk}k2N denote the discretization of an It\u00f4\ndiffusion {Xt}t0 based on a numerical integration scheme with constant step size h, and its govern-\ns }s0 is another\ning Brownian motion {Bt}t0 be adapted to the \ufb01ltration {Ft}t0. Suppose {X (k)\ninstance of the same diffusion starting from \u02dcXk1 at s = 0 and governed by the Brownian motion\n{Bs+h(k1)}s0. Then, the numerical integration scheme has local deviation D(k)\nh = \u02dcXk X (k)\nwith uniform orders (p1, p2) if\n\n0 \u21e4j(i)(Xu) dB(j)\n\n0 L(b)(Xu) du ds\n\nEuler-Maruyama update\n\nk = EhE\u21e5kD(k)\nE (1)\n\nh k2\n\n2|Ftk1\u21e4i \uf8ff 1h2p1,\n\nE (2)\n\nk = EhE\u21e5D(k)\n\nh |Ftk1\u21e42\n\n2i \uf8ff 2h2p2,\n\n(7)\nk are\n\nfor all k 2 N+ and 0 \uf8ff h < Ch, where constants 0 < 1, 2, Ch < 1. We say that E (1)\nthe local mean-square deviation and the local mean deviation at iteration k, respectively.\n\nk and E (2)\n\nu dB(i)\n\nu ds.\n\nXt =\n\n}\n\nh\n\n4\n\n\fIn the SDE literature, local deviation orders are de\ufb01ned to derive the mean-square order (or strong\norder) of numerical schemes [46], where the mean-square order is de\ufb01ned as the maximum half-\ninteger p such that E[kXtk \u02dcXkk2\n2] \uf8ff Ch2p for a constant C independent of step size h and all\nk 2 N where tk < T . Here, {Xt}t0 is the continuous-time process, \u02dcXk(k = 0, 1, . . . ) is the\nMarkov chain with the same Brownian motion as the continuous-time process, and T < 1 is the\nterminal time. The key difference between our de\ufb01nition of uniform local deviation orders and local\ndeviation orders in the SDE literature is we require the extra step of ensuring the expectations of E (1)\nand E (2)\nk are bounded across all iterations, instead of merely requiring the two deviation variables to\nbe bounded by a function of the previous iterate.\n\nk\n\n3 Convergence Rates of Numerical Schemes for Sampling\n\nWe present a user-friendly and broadly applicable theorem that establishes the convergence rate\nof a diffusion-based sampling algorithm. We develop our explicit bounds in the 2-Wasserstein\ndistance based on two crucial steps. We \ufb01rst verify that the candidate diffusion exhibits exponential\nWasserstein-2 contraction and thereafter compute the uniform local deviation orders of the scheme.\nDe\ufb01nition 3.1 (Wasserstein-2 rate). A diffusion Xt has Wasserstein-2 (W2) rate r : R0 ! R if for\ntwo instances of the diffusion Xt initiated respectively from x and y, we have\n\nW2(xPt, yPt) \uf8ff r(t)kx yk2 ,\n\nfor all x, y 2 Rd, t 0,\n\nwhere xPt denotes the distribution of the diffusion Xt starting from x. Moreover, if r(t) = e\u21b5t for\nsome \u21b5> 0, then we say the diffusion has exponential W2-contraction.\n\nThe above condition guarantees fast mixing of the sampling algorithm. For It\u00f4 diffusions, uniform\ndissipativity suf\ufb01ces to ensure exponential W2-contraction r(t) = e\u21b5t [24, Prop. 3.3].\nDe\ufb01nition 3.2 (Uniform Dissipativity). A diffusion de\ufb01ned by (1) is \u21b5-uniformly dissipative if\n\nhb(x) b(y), x yi + 1\n\n2 k(x) (y)k2\n\nF \uf8ff \u21b5kx yk2\n2 ,\n\nfor all x, y 2 Rd.\n\nFor It\u00f4 diffusions with a constant diffusion coef\ufb01cient, uniform dissipativity is equivalent to one-sided\nLipschitz continuity of the drift with coef\ufb01cient 2\u21b5. In particular, for the overdamped Langevin\ndiffusion (3), this reduces to strong convexity of the potential. Moreover, for this special case,\nexponential W2-contraction of the diffusion and strong convexity of the potential are equivalent [4].\nWe will ultimately verify uniform dissipativity for the candidate diffusions, but we \ufb01rst use W2-\ncontraction to derive the convergence rate of a diffusion-based sampling algorithm.\nTheorem 1 (W2-rate of a numerical scheme). For a diffusion with invariant measure \u232b\u21e4, exponen-\ntially contracting W2-rate r(t) = e\u21b5t, and Lipschitz drift and diffusion coef\ufb01cients, suppose its\ndiscretization based on a numerical integration scheme has uniform local deviation orders (p1, p2)\nwhere p1 1/2 and p2 p1 + 1/2. Let \u232bk be the measure associated with the Markov chain\nobtained from the discretization after k steps starting from the dirac measure \u232b0 = x0. Then, for\nconstant step size h satisfying\n\nwhere Ch is the step size constraint for obtaining the uniform local deviation orders, we have\n\nh < 1 ^ Ch ^\n\n1\n\n1\n2\u21b5 ^\n\n8\u00b51(b)2 + 8\u00b5F\n\n1 ()2 ,\nW2(\u232b0,\u232b \u21e4) +\u2713 8 (16\u00b51(b)1 + 2)\n\n\u21b5h\n\n\u21b52\nMoreover, if p1 > 1/2 and the step size additionally satis\ufb01es\n\nW2(\u232bk,\u232b \u21e4) \uf8ff\u27131 \n2 \u25c6k\nh < 2\n\u270fr 64(161\u00b51(b) + 2)\n\n\u21b52\n\n21\n\n\u21b5 !1/(p11/2)\n\n,\n\n+\n\n+\n\n21\n\n\u21b5 \u25c61/2\n\nhp11/2.\n\n(8)\n\nthen W2(\u232bk,\u232b \u21e4) converges in \u02dcO(\u270f1/(p11/2)) iterations within a suf\ufb01ciently small positive error \u270f.\nTheorem 1 directly translates mean-square order results in the SDE literature to convergence rates of\nsampling algorithms in W2. The proof deferred to Appendix A follows from an inductive argument\n\n5\n\n\fover the local deviation at each step (see e.g. [46]), and the convergence is provided by the exponential\nW2-contraction of the diffusion. To invoke the theorem and obtain convergence rates of a sampling\nalgorithm, it suf\ufb01ces to (i) show that the candidate diffusion is uniformly dissipative and (ii) derive\nthe local deviation orders for the underlying discretization. Below, we demonstrate this on both the\noverdamped Langevin and general It\u00f4 diffusions when the EM scheme is used for discretization,\nas well as the underdamped Langevin diffusion when a linearization is used for discretization [8].\nFor these schemes, local deviation orders are either well-known or straightforward to derive. Thus,\nconvergence rates for corresponding sampling algorithms can be easily obtained using Theorem 1.\nExample 1. Consider sampling from a target distribution whose potential is strongly convex using the\noverdamped Langevin diffusion (3) discretized by the EM scheme. The scheme has local deviation of\norders (1.5, 2.0) for It\u00f4 diffusions with constant diffusion coef\ufb01cients and drift coef\ufb01cients that are\nsuf\ufb01ciently smooth 3 (see e.g. [46, Sec. 1.5.4]). Since the potential is strongly convex, the Langevin\ndiffusion is uniformly dissipative and achieves exponential W2-contraction [18, Prop. 1]. Elementary\nalgebra shows that Markov chain moments are bounded [24, Lem. A.2]. Therefore, Theorem 1\nimplies that the rate of the sampling is \u02dcO(d\u270f1), where the dimension dependence can be extracted\nfrom the explicit bound. This recovers the result by Durmus and Moulines [18, Thm. 8].\nExample 2. If a general It\u00f4 diffusion (1) with Lipschitz smooth drift and diffusion coef\ufb01cients is\nused for the sampling task, local deviation orders of the EM scheme reduce to (1.0, 1.5) due to the\napproximation of the diffusion term [46] \u2013 this term is exact for Langevin diffusion. If we further\nhave uniform dissipativity, it can be shown that Markov chain moments are bounded [24, Lem. A.2].\nHence, Theorem 1 concludes that the convergence rate is \u02dcO(d\u270f2). We note that for the diffusion\ncoef\ufb01cient, we use the Frobenius norm for the Lipschitz and growth constants which potentially hides\ndimension dependence factors. The dimension dependence worsens if one were to convert all bounds\nto be based on the operator norm using the pessimistic inequality k(x)kF \uf8ff (d1/2 + m1/2)k(x)kop .\nAppendix D provides a convergence bound with explicit constants.\nExample 3. Consider sampling from a target distribution whose potential is strongly convex using\nthe underdamped Langevin diffusion:\n\ndXt = Vt dt,\n\ndVt = Vt dt urf (Xt) dt +p2u dBt.\n\nCheng et al. [8] show that the continuous-time process {(Xt, Xt + Vt)}t0 exhibits exponential W2-\ncontraction when the coef\ufb01cients and u are appropriately chosen [8, Thm. 5]. Moreover, the scheme\ndevised by linearizing the degenerate SDE has uniform local deviation orders (1.5, 2.0) 4 [8, Thm.\n9]. Theorem 1 implies that the convergence rate is O(d1/2\u270f1), where the dimension dependence is\nextracted from explicit bounds. This recovers the result by Cheng et al. [8, Thm. 1].\n\nWhile computing the local deviation orders of a numerical scheme for a single step is often straight-\nforward, it is not immediately clear how one might verify them uniformly for each iteration. This\nrequires a uniform bound on moments of the Markov chain de\ufb01ned by the numerical scheme. As\nour second principal contribution, we explicitly bound the Markov chain moments of SRK schemes\nwhich, combined with Theorem 1, leads to improved rates by only accessing the \ufb01rst-order oracle.\n\n4 Sampling with Stochastic Runge-Kutta and Improved Rates\n\nWe show that convergence rates of sampling can be signi\ufb01cantly improved if an It\u00f4 diffusion with\nexponential W2-contraction is discretized using SRK methods. Compared to the EM scheme, SRK\nschemes we consider query the same order oracle and improve on the deviation orders.\nTheorem 1 hints that one may expect the convergence rate of sampling to improve as more terms of\nthe It\u00f4-Taylor expansion are incorporated in the numerical integration scheme. However, in practice,\na challenge for simulation is the appearance of non-Gaussian terms in the form of iterated It\u00f4 integrals.\nFortunately, since the overdamped Langevin diffusion has a constant diffusion coef\ufb01cient, ef\ufb01cient\nSRK methods can still be applied to accelerate convergence.\n\n3In fact, it suf\ufb01ces to ensure the drift is three-times differentiable with Lipschitz gradient and Hessian.\n4Cheng et al. [8] derive the uniform local mean-square deviation order. Jensen\u2019s inequality implies that the\nlocal mean deviation is of the same uniform order. This entails uniform local deviation orders are (2.0, 2.0) and\nhence also (1.5, 2.0) when step size constraint Ch \uf8ff 1; note p2 p1 + 1/2 is required to invoke Theorem 1.\n\n6\n\n\f4.1 Sampling from Strongly Convex Potentials with the Langevin Diffusion\n\nWe provide a non-asymptotic analysis for integrating the overdamped Langevin diffusion based on a\nmean-square order 1.5 SRK scheme for SDEs with constant diffusion coef\ufb01cients [46]. We refer to\nthe sampling algorithm as SRK-LD. Speci\ufb01cally, given a sample from the previous iteration \u02dcXk,\n\n1\n\n2\n\n+\n\np6\u25c6 \u21e0k+1 +\n\u02dcH1 = \u02dcXk + p2h\uf8ff\u2713 1\n\u02dcH2 = \u02dcXk hrf ( \u02dcXk) + p2h\uf8ff\u2713 1\n\u02dcXk+1 = \u02dcXk \n\n\u2318k+1 ,\n1\np12\np6\u25c6 \u21e0k+1 +\n1\n2\u21e3rf ( \u02dcH1) + rf ( \u02dcH2)\u2318 + p2h\u21e0k+1,\n\n2 \n\nh\n\n1\np12\n\n\u2318k+1 ,\n\n(9)\n\ni.i.d.\u21e0N (0, Id) are independent of \u02dcXk for all k 2 N. We refer\nwhere h is the step size and \u21e0k+1,\u2318 k+1\nthe reader to [46, Sec. 1.5] for a detailed derivation of the scheme and other background information.\nTheorem 2 (SRK-LD). Let \u232b\u21e4 be the target distribution with a strongly convex potential that is\nfour-times differentiable with Lipschitz continuous \ufb01rst three derivatives. Let \u232bk be the distribution\nof the kth Markov chain iterate de\ufb01ned by (9) starting from the dirac measure \u232b0 = x0. Then, for\na suf\ufb01ciently small step size, 1.5 SRK scheme has uniform local deviation orders (2.0, 2.5), and\nW2(\u232bk,\u232b \u21e4) converges within \u270f error in \u02dcO(d\u270f2/3) iterations.\nThe proof of this theorem is given in Appendix B where we provide explicit constants. The basic idea\nof the proof is to match up the terms in the It\u00f4-Taylor expansion to terms in the Taylor expansion of\nthe discretization scheme. However, extreme care is needed to ensure a tight dimension dependence.\nRemark. For large-scale Bayesian inference, computing the full gradient of the potential can be costly.\nFortunately, SRK-LD can be easily adapted to use an unbiased stochastic oracle, provided queries of\nthe latter have a variance not overly large. We provide an informal discussion in Appendix E.\nWe emphasize that the 1.5 SRK scheme (9) only queries the gradient of the potential and improves\nthe best available W2-rate of LMC in the same setting from \u02dcO(d\u270f1) to \u02dcO(d\u270f2/3), with merely two\nextra gradient evaluations per iteration. Remarkably, the dimension dependence stays the same.\n\n4.2 Sampling from Non-Convex Potentials with It\u00f4 Diffusions\n\nFor the Langevin diffusion, the conclusions of Theorem 1 only apply to distributions with strongly\nconvex potentials, as exponential W2-contraction of the Langevin diffusion is equivalent to strong\nconvexity of the potential. This shortcoming can be addressed using a non-constant diffusion coef\ufb01-\ncient which allows us to sample from non-convex potentials using uniformly dissipative candidate\ndiffusions. Below, we use a mean-square order 1.0 SRK scheme for general diffusions [52] and\nachieve an improved convergence rate compared to sampling with the EM scheme.\nWe refer to the sampling algorithm as SRK-ID, which has the following update rule:\n\ntk\n\n,\n\n,\n\n\u02dcH (i)\n\n\u02dcH (i)\n\ndB(i)\ns\n\nI(j,i)ph\n\nj=1l( \u02dcXk)\n\nj=1l( \u02dcXk)\n\n2 = \u02dcXk Pm\nph\n2 Pm\n\n1 = \u02dcXk +Pm\nI(j,i)ph\n\u02dcXk+1 = \u02dcXk + hb( \u02dcXk) +Pm\ni=1i( \u02dcXk)I(i) +\n, I(j,i) = R tk+1\nwhere I(i) = R tk+1\n. We note that schemes of higher order exist\nR s\nfor general diffusions, but they typically require advanced approximations of iterated It\u00f4 integrals of\n0 \u00b7\u00b7\u00b7R tn1\nthe form R t0\n.\n\u00b7\u00b7\u00b7 dB(k1)\ndB(kn)\nTheorem 3 (SRK-ID). For a uniformly dissipative diffusion with invariant measure \u232b\u21e4, Lipschitz\ndrift and diffusion coef\ufb01cients that have Lipschitz gradients, assume that the diffusion coef\ufb01cient\n2 for all x 2 Rd. Let\nfurther satis\ufb01es the sublinear growth condition k(x)kop \uf8ff \u21e11,1()1 + kxk1/2\n\u232bk be the distribution of the kth Markov chain iterate de\ufb01ned by (10) starting from the dirac measure\n\u232b0 = x0. Then for a suf\ufb01ciently small step size, iterates of the 1.0 SRK scheme have uniform local\ndeviation orders (1.5, 2.0), and W2(\u232bk,\u232b \u21e4) converges within \u270f error in \u02dcO(d3/4m2\u270f1) iterations.\n\ni=1i( \u02dcH (i)\n\n2 ),\n1 ) i( \u02dcH (i)\n\ndB(j)\n\nu dB(i)\ns\n\ntk\n\ntk\n\n(10)\n\n0\n\ntn\n\nt1\n\n7\n\n\f2\n\nThe proof is given in Appendix C where we present explicit constants. We note that the dimension\ndependence in this case is only better than that of EM due to the extra growth condition on the\ndiffusion. The extra m-dependence comes from the 2m evaluations of the diffusion coef\ufb01cient at\n(i = 1, . . . , m). In the above theorem, we use the Frobenius norm for the Lipschitz and\n1 and \u02dcH (i)\n\u02dcH (i)\ngrowth constants for the diffusion coef\ufb01cient which potentially hides dimension dependence. One\nmay convert all bounds to be based on the operator norm with our constants given in the Appendix.\nIn practice, accurately simulating both the iterated It\u00f4 integrals I(j,i) and the Brownian motion\nincrements I(i) simultaneously is dif\ufb01cult. We comment on two possible approximations based on\ntruncating an in\ufb01nite series in Appendix H.2.\n\n5 Examples and Numerical Studies\n\nWe provide examples of our theory and numerical studies showing SRK methods achieve lower\nasymptotic errors, are stable under large step sizes, and hence converge faster to a prescribed\ntolerance. We sample from strongly convex potentials with SRK-LD and non-convex potentials\nwith SRK-ID. Since our theory is in W2, we compare with EM on W2 and mean squared error\n(MSE) between iterates of the Markov chain and the target. We do not compare to schemes that\nrequire computing derivatives of the drift and diffusion coef\ufb01cients. Since directly computing W2 is\ninfeasible, we estimate it using samples instead. However, sample-based estimators have a bias of\norder \u2326(n1/d) [64], so we perform a heuristic correction whose description is in Appendix G.\n\n(a) Gaussian mixture\n\n(b) Bayesian logistic regression\n\n(c) non-convex potential\n\nFigure 1: (a) Estimated asymptotic error against step size. (b) Estimated error against number of itera-\ntions. (c) MSE against number of iterations. Legends of (a) and (c) denote \u201cscheme (dimensionality)\u201d.\nLegend of (b) denotes \u201cscheme (step size)\u201d.\n\n5.1 Strongly Convex Potentials\nGaussian Mixture. We consider sampling from a multivariate Gaussian mixture with density\n\n\u21e1(\u2713) / exp 1\n\n2k\u2713 ak2\n\n2 + exp 1\n\n2 ,\u2713\n2k\u2713 + ak2\n\n2 Rd,\n\nwhere a 2 Rd is a parameter that measures the separation of two modes. The potential is strongly\nconvex when kak2 < 1 and has Lipschitz gradient and Hessian [11]. Moreover, one can verify that\nits third derivative is also Lipschitz.\n\ni=1 2 Rn\u21e5d, Y = {yi}n\n\nBayesian Logistic Regression. We consider Bayesian logistic regression (BLR) [11]. Given data\ni=1 2 Rn, and parameter \u2713 2 Rd, logistic regression\nsamples X = {xi}n\nmodels the Bernoulli conditional distribution with probability Pr(yi = 1|xi) = 1/(1 + exp(\u2713>xi)).\nWe place a Gaussian prior on \u2713 with mean zero and covariance proportional to \u23031\nX , where \u2303X =\nX>X/n is the sample covariance matrix. We sample from the posterior density\ni=1 log(1 + exp(\u2713>xi)) \u21b5\n\n\u21e1(\u2713) / exp(f (\u2713)) = exp\u21e3Y>X\u2713 Pn\n\nThe potential is strongly convex and has Lipschitz gradient and Hessian [11]. One can also verify\nthat it has a Lipschitz third derivative.\nTo obtain the potential, we generate data from the model with the parameter \u2713\u21e4 = 1d following [11,\n21]. To obtain each xi, we sample a vector whose components are independently drawn from the\n\n2 k\u23031/2\n\n2\u2318.\nX \u2713k2\n\n8\n\n\fRademacher distribution and normalize it by the Frobenius norm of the sample matrix X times\nd1/2. Note that our normalization scheme is different from that adopted in [11, 21], where each xi\nis normalized by its Euclidean norm. We sample the corresponding yi from the model and \ufb01x the\nregularizer \u21b5 = 0.3d/\u21e12.\nTo characterize the true posterior, we sample 50k particles driven by EM with a step size of 0.001\nuntil convergence. We subsample from these particles 5k examples to represent the true posterior each\ntime we intend to estimate squared W2. We monitor the kernel Stein discrepancy 5 (KSD) [29, 10, 36]\nusing the inverse multiquadratic kernel [29] with hyperparameters = 1/2 and c = 1 to measure\nthe distance between the 100k particles and the true posterior. We con\ufb01rm that these particles\nfaithfully approximate the true posterior with the squared KSD being less than 0.002 in all settings.\nWhen sampling from a Gaussian mixture and the posterior of BLR, we observe that SRK-LD leads to\na consistent improvement in the asymptotic error compared to the EM scheme when the same step size\nis used. In particular, Figure 1 (a) plots the estimated asymptotic error in squared W2 of different step\nsizes for 2D and 20D Gaussian mixture problems and shows that SRK-LD is surprisingly stable for\nexceptionally large step sizes. Figure 1 (b) plots the estimated error in squared W2 as the number of\niterations increases for 2D BLR. We include additional results on problems in 2D and 20D with error\nestimates in squared W2 and the energy distance [58] along with a wall time analysis in Appendix H.\n\n5.2 Non-Convex Potentials\nWe consider sampling from the non-convex potential\n\nx 2 Rd,\n\n2 < 1 and (4 + 1) kxk2\n\nf (x) = + kxk2\n\nwhere , > 0 are scalar parameters of the distribution. The corresponding density is a simpli\ufb01ed\nabstraction for the posterior distribution of Student\u2019s t regression with a pseudo-Huber prior [30]. One\ncan verify that when + kxk2\n2 , the Hessian has a negative\neigenvalue. The candidate diffusion, where the drift coef\ufb01cient is given by (2) and diffusion coef\ufb01cient\n81/2 > 0.\n\n21/2 + log + kxk2\n2,\n2 < (2 + 1)q + kxk2\n21/2, is uniformly dissipative if 1\n(x) = g(x)1/2Id with g(x) = + kxk2\n1/2 , and \u00b51() \uf8ff 1\nIndeed, one can verify that \u00b51(g) \uf8ff 1, \u00b52(g) \uf8ff 2\nF \uf8ff 1\n2|\u00b52(g) d\nhb(x) b(y), x yi + 1\n2| 2\n\uf8ff ( 1\n1/2 d\n\n2| 1\n1/2 d\n2| 2\n21/4 . Therefore,\n2 \u00b51()2kx yk2\n81/2 )kx yk2\n2 .\n\n2 ,\n\n2 k(x) (y)k2\n\n2 | 1\n2 | 1\n\nMoreover, b and have Lipschitz \ufb01rst two derivatives, and the latter satis\ufb01es the sublinear growth\ncondition in Theorem 3.\nTo study the behavior of SRK-ID, we simulate using both SRK-ID and EM. For both schemes, we\nsimulate with a step size of 103 initiated from the same 50k particles approximating the stationary\ndistribution obtained by simulating EM with a step size of 106 until convergence. We compute the\nMSE between the continuous-time process and the Markov chain with the same Brownian motion\nfor 300 iterations when we observe the MSE curve plateaus. We approximate the continuous-time\nprocess by simulating using the EM scheme with a step size of 106 similar to the setting in [52]. To\nobtain \ufb01nal results, we average across ten independent runs. We note that the MSE upper bounds W2\ndue to the latter being an in\ufb01mum over all couplings. Hence, the MSE value serves as an indication\nof the convergence performance in W2.\nFigure 1 (c) shows that for = 0.33, = 0.5 and d = 1, when simulating from a good approximation\nto the target distribution with the same step size, the MSE of SRK-ID remains small, whereas the\nMSE of EM converges to a larger value. However, this improvement diminishes as the dimensionality\nof the sampling problem increases. We report additional results with other parameter settings in\nAppendix H.2.2. Notably, we did not observe signi\ufb01cant differences in the estimated squared W2\nvalues. We suspect this is due to the discrepancy being dominated by the bias of our estimator.\n\nAcknowledgments\nMAE is partially funded by NSERC [2019-06167] and CIFAR AI Chairs program at the Vector\nInstitute.\n\n5Unfortunately, there appear to be two de\ufb01nitions for KSD and the energy distance in the literature, differing\n\nin whether a square root is taken or not. We adopt the version with the square root taken.\n\n9\n\n\fReferences\n[1] David F Anderson and Jonathan C Mattingly. A weak trapezoidal method for a class of\n\nstochastic differential equations. arXiv preprint arXiv:0906.3475, 2009.\n\n[2] Steve Brooks, Andrew Gelman, Galin Jones, and Xiao-Li Meng. Handbook of Markov chain\n\nMonte Carlo. CRC press, 2011.\n\n[3] Kevin Burrage, PM Burrage, and Tianhai Tian. Numerical methods for strong solutions of\nstochastic differential equations: an overview. Proceedings of the Royal Society of London.\nSeries A: Mathematical, Physical and Engineering Sciences, 460(2041):373\u2013402, 2004.\n\n[4] Simone Calogero. Exponential convergence to equilibrium for kinetic Fokker-Planck equations.\n\nCommunications in Partial Differential Equations, 37(8):1357\u20131390, 2012.\n\n[5] Changyou Chen, Nan Ding, and Lawrence Carin. On the convergence of stochastic gradient\nMCMC algorithms with high-order integrators. In Advances in Neural Information Processing\nSystems, pages 2278\u20132286, 2015.\n\n[6] Yuansi Chen, Raaz Dwivedi, Martin J Wainwright, and Bin Yu. Fast mixing of Metropolized\nHamiltonian Monte Carlo: Bene\ufb01ts of multi-step gradients. arXiv preprint arXiv:1905.12247,\n2019.\n\n[7] Xiang Cheng and Peter Bartlett. Convergence of Langevin MCMC in KL-divergence. arXiv\n\npreprint arXiv:1705.09048, 2017.\n\n[8] Xiang Cheng, Niladri S Chatterji, Peter L Bartlett, and Michael I Jordan. Underdamped\n\nLangevin MCMC: A non-asymptotic analysis. arXiv preprint arXiv:1707.03663, 2017.\n\n[9] Xiang Cheng, Niladri S Chatterji, Yasin Abbasi-Yadkori, Peter L Bartlett, and Michael I Jordan.\nSharp convergence rates for Langevin dynamics in the nonconvex setting. arXiv preprint\narXiv:1805.01648, 2018.\n\n[10] Kacper Chwialkowski, Heiko Strathmann, and Arthur Gretton. A kernel test of goodness of \ufb01t.\n\nJMLR: Workshop and Conference Proceedings, 2016.\n\n[11] Arnak S Dalalyan. Theoretical guarantees for approximate sampling from smooth and log-\nconcave densities. Journal of the Royal Statistical Society: Series B (Statistical Methodology),\n79(3):651\u2013676, 2017.\n\n[12] Arnak S Dalalyan and Avetik Karagulyan. User-friendly guarantees for the Langevin Monte\n\nCarlo with inaccurate gradient. Stochastic Processes and their Applications, 2019.\n\n[13] Arnak S Dalalyan and Lionel Riou-Durand. On sampling from a log-concave density using\n\nkinetic Langevin diffusions. arXiv preprint arXiv:1807.09382, 2018.\n\n[14] Arnak S Dalalyan and Alexandre B Tsybakov. Sparse regression learning by aggregation and\n\nLangevin Monte-Carlo. Journal of Computer and System Sciences, 78(5):1423\u20131443, 2012.\n\n[15] Nan Ding, Youhan Fang, Ryan Babbush, Changyou Chen, Robert D Skeel, and Hartmut Neven.\nBayesian sampling using stochastic gradient thermostats. In Advances in neural information\nprocessing systems, pages 3203\u20133211, 2014.\n\n[16] V Dobri\u00b4c and Joseph E Yukich. Asymptotics for transportation cost in high dimensions. Journal\n\nof Theoretical Probability, 8(1):97\u2013118, 1995.\n\n[17] RM Dudley. The speed of mean Glivenko-Cantelli convergence. The Annals of Mathematical\n\nStatistics, 40(1):40\u201350, 1969.\n\n[18] Alain Durmus and Eric Moulines. High-dimensional Bayesian inference via the unadjusted\n\nLangevin algorithm. arXiv preprint arXiv:1605.01559, 2016.\n\n[19] Alain Durmus, Umut Simsekli, Eric Moulines, Roland Badeau, and Ga\u00ebl Richard. Stochastic\ngradient Richardson-Romberg Markov chain Monte Carlo. In Advances in Neural Information\nProcessing Systems, pages 2047\u20132055, 2016.\n\n10\n\n\f[20] Alain Durmus, Eric Moulines, and Marcelo Pereyra. Ef\ufb01cient Bayesian computation by\nproximal Markov chain Monte Carlo: when Langevin meets moreau. SIAM Journal on Imaging\nSciences, 11(1):473\u2013506, 2018.\n\n[21] Raaz Dwivedi, Yuansi Chen, Martin J Wainwright, and Bin Yu. Log-concave sampling:\n\nMetropolis-Hastings algorithms are fast! arXiv preprint arXiv:1801.02309, 2018.\n\n[22] Andreas Eberle. Re\ufb02ection couplings and contraction rates for diffusions. Probability theory\n\nand related \ufb01elds, 166(3-4):851\u2013886, 2016.\n\n[23] Andreas Eberle, Arnaud Guillin, and Raphael Zimmer. Couplings and quantitative contraction\n\nrates for Langevin dynamics. arXiv preprint arXiv:1703.01617, 2017.\n\n[24] Murat A Erdogdu, Lester Mackey, and Ohad Shamir. Global non-convex optimization with\ndiscretized diffusions. In Advances in Neural Information Processing Systems, pages 9694\u20139703,\n2018.\n\n[25] R\u2019emi Flamary and Nicolas Courty. POT python optimal transport library, 2017. URL https:\n\n//github.com/rflamary/POT.\n\n[26] Saul B Gelfand and Sanjoy K Mitter. Recursive stochastic algorithms for global optimization in\n\nr\u02c6d. SIAM Journal on Control and Optimization, 29(5):999\u20131018, 1991.\n\n[27] Andrew Gelman, Hal S Stern, John B Carlin, David B Dunson, Aki Vehtari, and Donald B\n\nRubin. Bayesian data analysis. Chapman and Hall/CRC, 2013.\n\n[28] Ivan Gentil, Christian L\u00e9onard, and Luigia Ripani. About the analogy between optimal transport\n\nand minimal entropy. arXiv preprint arXiv:1510.08230, 2015.\n\n[29] Jackson Gorham and Lester Mackey. Measuring sample quality with kernels. In Proceedings of\nthe 34th International Conference on Machine Learning-Volume 70, pages 1292\u20131301. JMLR.\norg, 2017.\n\n[30] Jackson Gorham, Andrew B Duncan, Sebastian J Vollmer, and Lester Mackey. Measuring\n\nsample quality with diffusions. arXiv preprint arXiv:1611.06972, 2016.\n\n[31] Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Sch\u00f6lkopf, and Alexander\nSmola. A kernel two-sample test. Journal of Machine Learning Research, 13(Mar):723\u2013773,\n2012.\n\n[32] Peter E Kloeden and Eckhard Platen. Numerical solution of stochastic differential equations,\n\nvolume 23. Springer Science & Business Media, 2013.\n\n[33] Peter E Kloeden, Eckhard Platen, and IW Wright. The approximation of multiple stochastic\n\nintegrals. Stochastic analysis and applications, 10(4):431\u2013441, 1992.\n\n[34] Dmitriy F Kuznetsov. Explicit one-step strong numerical methods of order 2.5 for It\u00f4 stochastic\ndifferential equations, based on the uni\ufb01ed Taylor-It\u00f4 and Taylor-Stratonovich expansions. arXiv\npreprint arXiv:1802.04844, 2018.\n\n[35] Adrien Laurent and Gilles Vilmart. Exotic aromatic b-series for the study of long time integrators\n\nfor a class of ergodic SDEs. arXiv preprint arXiv:1707.02877, 2017.\n\n[36] Qiang Liu, Jason Lee, and Michael Jordan. A kernelized stein discrepancy for goodness-of-\ufb01t\n\ntests. In International Conference on Machine Learning, pages 276\u2013284, 2016.\n\n[37] Yi-An Ma, Tianqi Chen, and Emily Fox. A complete recipe for stochastic gradient MCMC. In\n\nAdvances in Neural Information Processing Systems, pages 2917\u20132925, 2015.\n\n[38] Yi-An Ma, Yuansi Chen, Chi Jin, Nicolas Flammarion, and Michael I Jordan. Sampling can be\n\nfaster than optimization. arXiv preprint arXiv:1811.08413, 2018.\n\n[39] Yi-An Ma, Niladri Chatterji, Xiang Cheng, Nicolas Flammarion, Peter Bartlett, and\nMichael I Jordan. Is there an analog of Nesterov acceleration for MCMC? arXiv preprint\narXiv:1902.00996, 2019.\n\n11\n\n\f[40] David JC MacKay and David JC Mac Kay. Information theory, inference and learning algo-\n\nrithms. Cambridge university press, 2003.\n\n[41] Xuerong Mao. Stochastic differential equations and applications. Elsevier, 2007.\n\n[42] Jonathan C Mattingly, Andrew M Stuart, and Desmond J Higham. Ergodicity for SDEs and\napproximations: locally lipschitz vector \ufb01elds and degenerate noise. Stochastic processes and\ntheir applications, 101(2):185\u2013232, 2002.\n\n[43] Nicholas Metropolis, Arianna W Rosenbluth, Marshall N Rosenbluth, Augusta H Teller, and\nEdward Teller. Equation of state calculations by fast computing machines. The journal of\nchemical physics, 21(6):1087\u20131092, 1953.\n\n[44] Sean P Meyn and Richard L Tweedie. Markov chains and stochastic stability. Springer Science\n\n& Business Media, 2012.\n\n[45] GN Milstein and Michael V Tretyakov. Quasi-symplectic methods for Langevin-type equations.\n\nIMA journal of numerical analysis, 23(4):593\u2013626, 2003.\n\n[46] Grigori Noah Milstein and Michael V Tretyakov. Stochastic numerics for mathematical physics.\n\nSpringer Science & Business Media, 2013.\n\n[47] Wenlong Mou, Yi-An Ma, Martin J Wainwright, Peter L Bartlett, and Michael I Jordan.\nHigh-order Langevin diffusion yields an accelerated MCMC algorithm. arXiv preprint\narXiv:1908.10859, 2019.\n\n[48] Bernt \u00d8ksendal. Stochastic differential equations. In Stochastic differential equations, pages\n\n65\u201384. Springer, 2003.\n\n[49] Gabriel Peyr\u00e9, Marco Cuturi, et al. Computational optimal transport. Foundations and Trends R\n\nin Machine Learning, 11(5-6):355\u2013607, 2019.\n\n[50] Maxim Raginsky, Alexander Rakhlin, and Matus Telgarsky. Non-convex learning via stochastic\ngradient Langevin dynamics: a nonasymptotic analysis. arXiv preprint arXiv:1702.03849,\n2017.\n\n[51] Gareth O Roberts, Richard L Tweedie, et al. Exponential convergence of Langevin distributions\n\nand their discrete approximations. Bernoulli, 2(4):341\u2013363, 1996.\n\n[52] Andreas R\u00f6\u00dfler. Runge\u2013kutta methods for the strong approximation of solutions of stochastic\n\ndifferential equations. SIAM Journal on Numerical Analysis, 48(3):922\u2013952, 2010.\n\n[53] Sotirios Sabanis and Ying Zhang. Higher order Langevin Monte Carlo algorithm. arXiv preprint\n\narXiv:1808.00728, 2018.\n\n[54] Sotirios Sabanis and Ying Zhang. On explicit order 1.5 approximations with varying coef\ufb01cients:\n\nthe case of super-linear diffusion coef\ufb01cients. Journal of Complexity, 50:84\u2013115, 2019.\n\n[55] Dino Sejdinovic, Bharath Sriperumbudur, Arthur Gretton, Kenji Fukumizu, et al. Equivalence\nof distance-based and RKHS-based statistics in hypothesis testing. The Annals of Statistics, 41\n(5):2263\u20132291, 2013.\n\n[56] Ruoqi Shen and Yin Tat Lee. The randomized midpoint method for log-concave sampling.\n\narXiv preprint arXiv:1909.05503, 2019.\n\n[57] Marvin K Simon. Probability distributions involving Gaussian random variables: A handbook\n\nfor engineers and scientists. Springer Science & Business Media, 2007.\n\n[58] G\u00e1bor J Sz\u00e9kely. E-statistics: The energy of statistical samples. Bowling Green State University,\n\nDepartment of Mathematics and Statistics Technical Report, 3(05):1\u201318, 2003.\n\n[59] G\u00e1bor J Sz\u00e9kely and Maria L Rizzo. Energy statistics: A class of statistics based on distances.\n\nJournal of statistical planning and inference, 143(8):1249\u20131272, 2013.\n\n12\n\n\f[60] Veeravalli S Varadarajan. On the convergence of sample probability distributions. Sankhy\u00afa:\n\nThe Indian Journal of Statistics (1933-1960), 19(1/2):23\u201326, 1958.\n\n[61] Santosh S Vempala and Andre Wibisono. Rapid convergence of the unadjusted Langevin\n\nalgorithm: Log-sobolev suf\ufb01ces. arXiv preprint arXiv:1903.08568, 2019.\n\n[62] C\u00e9dric Villani. Optimal transport: old and new, volume 338. Springer Science & Business\n\nMedia, 2008.\n\n[63] Gilles Vilmart. Postprocessed integrators for the high order integration of ergodic SDEs. SIAM\n\nJournal on Scienti\ufb01c Computing, 37(1):A201\u2013A220, 2015.\n\n[64] Jonathan Weed and Francis Bach. Sharp asymptotic and \ufb01nite-sample rates of convergence of\n\nempirical measures in wasserstein distance. arXiv preprint arXiv:1707.00087, 2017.\n\n[65] Max Welling and Yee W Teh. Bayesian learning via stochastic gradient Langevin dynamics.\nIn Proceedings of the 28th international conference on machine learning (ICML-11), pages\n681\u2013688, 2011.\n\n[66] Magnus Wiktorsson et al. Joint characteristic function and simultaneous simulation of iterated\nIt\u00f4 integrals for multiple independent Brownian motions. The Annals of Applied Probability, 11\n(2):470\u2013487, 2001.\n\n[67] Pan Xu, Jinghui Chen, Difan Zou, and Quanquan Gu. Global convergence of Langevin dynamics\nbased algorithms for nonconvex optimization. In Advances in Neural Information Processing\nSystems, pages 3122\u20133133, 2018.\n\n[68] Jingzhao Zhang, Aryan Mokhtari, Suvrit Sra, and Ali Jadbabaie. Direct Runge-Kutta dis-\ncretization achieves acceleration. In Advances in Neural Information Processing Systems, pages\n3904\u20133913, 2018.\n\n[69] Difan Zou, Pan Xu, and Quanquan Gu. Sampling from non-log-concave distributions via\nIn The 22nd International Conference on\n\nvariance-reduced gradient Langevin dynamics.\nArti\ufb01cial Intelligence and Statistics, pages 2936\u20132945, 2019.\n\n13\n\n\f", "award": [], "sourceid": 4202, "authors": [{"given_name": "Xuechen", "family_name": "Li", "institution": "Google"}, {"given_name": "Yi", "family_name": "Wu", "institution": "University of Toronto & Vector Institute"}, {"given_name": "Lester", "family_name": "Mackey", "institution": "Microsoft Research"}, {"given_name": "Murat", "family_name": "Erdogdu", "institution": "University of Toronto & Vector Institute"}]}