{"title": "Rapid Convergence of the Unadjusted Langevin Algorithm: Isoperimetry Suffices", "book": "Advances in Neural Information Processing Systems", "page_first": 8094, "page_last": 8106, "abstract": "We study the Unadjusted Langevin Algorithm (ULA) for sampling from a probability distribution $\\nu = e^{-f}$ on $\\R^n$. We prove a convergence guarantee in Kullback-Leibler (KL) divergence assuming $\\nu$ satisfies log-Sobolev inequality and $f$ has bounded Hessian. Notably, we do not assume convexity or bounds on higher derivatives. We also prove convergence guarantees in R\\'enyi divergence of order $q > 1$ assuming the limit of ULA satisfies either log-Sobolev or Poincar\\'e inequality.", "full_text": "Rapid Convergence of the Unadjusted Langevin\n\nAlgorithm: Isoperimetry Suf\ufb01ces\n\nSantosh S. Vempala\nCollege of Computing\n\nGeorgia Institute of Technology\n\nAtlanta, GA 30332\n\nvempala@gatech.edu\n\nAndre Wibisono\n\nCollege of Computing\n\nGeorgia Institute of Technology\n\nAtlanta, GA 30332\n\nwibisono@gatech.edu\n\nAbstract\n\nWe study the Unadjusted Langevin Algorithm (ULA) for sampling from a proba-\nbility distribution \u232b = ef on Rn. We prove a convergence guarantee in Kullback-\nLeibler (KL) divergence assuming \u232b satis\ufb01es log-Sobolev inequality and f has\nbounded Hessian. Notably, we do not assume convexity or bounds on higher deriva-\ntives. We also prove convergence guarantees in R\u00e9nyi divergence of order q > 1\nassuming the limit of ULA satis\ufb01es either log-Sobolev or Poincar\u00e9 inequality.\n\n1\n\nIntroduction\n\nSampling is a fundamental algorithmic task. Many applications require sampling from probability\ndistributions in high-dimensional spaces, and in modern applications the probability distributions\nare complicated and non-logconcave. While the setting of logconcave functions is well-studied, it\nis important to have ef\ufb01cient sampling algorithms with good convergence guarantees beyond the\nlogconcavity assumption. There is a close interplay between sampling and optimization, either via\noptimization as a limit of sampling (annealing) [34, 55], or via sampling as optimization in the space\nof distributions [36, 62]. Motivated by the widespread use of non-convex optimization and sampling,\nthere is resurgent interest in understanding non-logconcave sampling.\nIn this paper we study a simple algorithm, the Unadjusted Langevin Algorithm (ULA), for sampling\nfrom a target probability distribution \u232b = ef on Rn. ULA requires oracle access to the gradient rf\nof the log density f = log \u232b. In particular, ULA does not require knowledge of f, which makes it\napplicable in practice where we often only know \u232b up to a normalizing constant.\nAs the step size \u270f ! 0, ULA recovers the Langevin dynamics, which is a continuous-time stochastic\nprocess in Rn that converges to \u232b. We recall the optimization interpretation of the Langevin dynamics\nfor sampling as the gradient \ufb02ow of the Kullback-Leibler (KL) divergence with respect to \u232b in the\nspace of probability distributions with the Wasserstein metric [36]. When \u232b is strongly logconcave,\nthe KL divergence is a strongly convex objective function, so the Langevin dynamics as gradient\n\ufb02ow converges exponentially fast [6, 60]. From the classical theory of Markov chains and diffusion\nprocesses, there are several known conditions milder than logconcavity that are suf\ufb01cient for rapid\nconvergence in continuous time. These include isoperimetric inequalities such as Poincar\u00e9 inequality\nor log-Sobolev inequality (LSI). Along the Langevin dynamics in continuous time, Poincar\u00e9 inequality\nimplies an exponential convergence rate in 2-divergence, while LSI\u2014which is stronger\u2014implies an\nexponential convergence rate in KL divergence (as well as in R\u00e9nyi divergence).\nHowever, in discrete time, sampling under Poincar\u00e9 inequality or LSI is a more challenging problem.\nULA is an inexact discretization of the Langevin dynamics, and it converges to a biased limit\n\u232b\u270f 6= \u232b. When \u232b is strongly logconcave and smooth, it is known how to control the bias and\nprove a convergence guarantee on KL divergence along ULA [17, 21, 22, 24]. When \u232b is strongly\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\flogconcave, there are many sampling algorithms with provable rapid convergence; these include\nthe ball walk and hit-and-run [37, 43, 44, 42] (which give truly polynomial algorithms), various\ndiscretizations of the overdamped or underdamped Langevin dynamics [21, 22, 24, 8, 26] (which\nhave polynomial dependence on smoothness parameters but low dependence on dimension), and\nthe Hamiltonian Monte Carlo [47, 48, 25, 39, 16]. It is of great interest to extend these results to\nnon-logconcave densities \u232b, where existing results require strong assumptions with bounds that grow\nexponentially with the dimension or other parameters [2, 18, 45, 49]. There are recent works that\nanalyze convergence of sampling using various techniques such as re\ufb02ection coupling [28], kernel\nmethods [29], and higher-order integrators [40], albeit still under some strong conditions such as\ndistant dissipativity, which is similar to strong logconcavity outside a bounded domain.\nIn this paper we study the convergence along ULA under minimal (and necessary) isoperimetric\nassumptions, namely, LSI and Poincar\u00e9 inequality. These are suf\ufb01cient for fast convergence in\ncontinuous time; moreover, in the case of logconcave distribution, the log-Sobolev and Poincar\u00e9\nconstants can be bounded and lead to convergence guarantees for ef\ufb01cient sampling in discrete time.\nHowever, do they suf\ufb01ce on their own without the assumption of logconcavity?\nWe note that LSI and Poincar\u00e9 inequality apply to a wider class of measures than logconcave\ndistributions. In particular, LSI and Poincar\u00e9 inequality are preserved under bounded perturbation and\nLipschitz mapping, whereas logconcavity is destroyed. Given these properties, it is easy to exhibit\nexamples of non-logconcave distributions satisfying LSI or Poincar\u00e9 inequality. For example, we\ncan take a small perturbation of a convex body to make it nonconvex but still satis\ufb01es isoperimetry;\nthen the uniform probability distribution (or a smooth version of it) on the body is not logconcave but\nsatis\ufb01es LSI and Poincar\u00e9 inequality. Similarly, we can start with a strongly logconcave distribution\nand make bounded perturbations; then the resulting (normalized) probability distribution is not\nlogconcave, but it satis\ufb01es LSI and Poincar\u00e9 inequality. See Figure 1 for an illustration.\n\nH\u232b(\u21e2k) \uf8ff e\u21b5\u270fkH\u232b(\u21e20) + 8\u270fnL2\n\u21b5 .\n16L2n reaches error H\u232b(\u21e2k) \uf8ff after k 1\n\n\u21b5\u270f log 2H\u232b (\u21e20)\n\n\n\nFigure 1: Illustrations of non-logconcave distributions satisfying isoperimetry: uniform distribution\non a nonconvex set (left) and a perturbation of a logconcave distribution (right).\nWe measure the mode of convergence using KL divergence and R\u00e9nyi divergence of order q 1,\nwhich is stronger. Our \ufb01rst main result says the only further assumption we need is smoothness. We\nsay \u232b = ef is L-smooth if rf is L-Lipschitz. Here H\u232b(\u21e2) is the KL divergence between \u21e2 and \u232b.\nSee Theorem 2 in Section 3.1 for more detail.\nTheorem 2. Assume \u232b = ef satis\ufb01es log-Sobolev inequality with constant \u21b5> 0 and is L-smooth.\nULA with step size 0 <\u270f \uf8ff \u21b5\n\n4L2 satis\ufb01es\n\niterations.\n\n\u21b52 to achieve H\u232b(\u21e2k) \uf8ff using ULA with step size \u270f =\u21e5( \u21b5\n\nFor 0 << 4n, ULA with \u270f \uf8ff \u21b5\nFor example, if we start with a Gaussian \u21e20 = N (x\u21e4, 1\nL I) where x\u21e4 is a stationary point of f (which\nwe can \ufb01nd, e.g., via gradient descent), then H\u232b(\u21e20) = \u02dcO(n) (see Lemma 1), and Theorem 2 gives an\niteration complexity of k = \u02dc\u21e5 L2n\nL2n ).\nThe result above matches previous known bounds for ULA when \u232b is strongly logconcave [17, 21,\n22, 24]. Our result complements the recent work of Ma et al. [45] who study the underdamped\nversion of the Langevin dynamics under LSI and show an iteration complexity for the discrete-\ntime algorithm that has better dependence on the dimension (p n\n above for ULA),\nbut under an additional smoothness assumption (f has bounded third derivatives) and with higher\npolynomial dependence on other parameters. Our result also complements the work of Mangoubi\nand Vishnoi [49] who study the Metropolis-adjusted version of ULA (MALA) for non-logconcave \u232b\nand show a log( 1\n ) iteration complexity from a warm start, under the additional assumption that f\nhas bounded third and fourth derivatives in an appropriate 1-norm.\n\n in place of n\n\n2\n\n\fWe note that in general some isoperimetry condition is needed for rapid mixing of Markov chains\n(such as Langevin dynamics and ULA), otherwise there are bad regions in the state space from which\nthe chains take arbitrarily long to escape. Smoothness or bounded Hessian is a common assumption\nneeded for the analysis of discrete-time algorithms (such as gradient descent or ULA).\nIn the second part of this paper, we study the convergence of R\u00e9nyi divergence of order q > 1\nalong ULA. R\u00e9nyi divergence is a family of generalizations of KL divergence [56, 59, 11], which\nbecomes stronger as the order q increases. There are physical and operational interpretations of\nR\u00e9nyi divergence [31, 3]. R\u00e9nyi divergence has been useful in many applications, including for\nthe exponential mechanism in differential privacy [27, 1, 12, 52], lattice-based cryptography [4],\ninformation-theoretic encryption [35], variational inference [41], machine learning [32, 50], informa-\ntion theory and statistics [20, 53], and black hole physics [23].\nOur second result proves a convergence bound for the R\u00e9nyi divergence of order q > 1. While\nR\u00e9nyi divergence is a stronger measure of convergence than KL divergence, the situation is more\ncomplicated. First, we can hope to converge to the biased limit \u232b\u270f only for \ufb01nite q for any step\nsize \u270f (as we illustrate with an example). Second, it is unclear how to bound the R\u00e9nyi divergence\nbetween \u232b\u270f and \u232b. We \ufb01rst show the following convergence guarantees of R\u00e9nyi divergence along the\ncontinuous-time Langevin dynamics under LSI or Poincar\u00e9 inequality; see Theorem 3 and Theorem 5.\nHere Rq,\u232b (\u21e2) is the R\u00e9nyi divergence of order q between \u21e2 and \u232b.\nTheorem 3. Suppose \u232b satis\ufb01es LSI with constant \u21b5> 0. Let q 1. Along the Langevin dynamics,\n\nRq,\u232b (\u21e2t) \uf8ff e 2\u21b5t\n\nq Rq,\u232b (\u21e20).\n\nTheorem 5. Suppose \u232b satis\ufb01es Poincar\u00e9 inequality with constant \u21b5> 0. Let q 2. Along the\nLangevin dynamics,\n\nRq,\u232b (\u21e2t) \uf8ff(Rq,\u232b (\u21e20) 2\u21b5t\n\nq\nq Rq,\u232b (\u21e20)\n\ne 2\u21b5t\n\nif Rq,\u232b (\u21e20) 1 and as long as Rq,\u232b (\u21e2t) 1,\nif Rq,\u232b (\u21e20) \uf8ff 1.\n\nNotice that under Poincar\u00e9 inequality, compared to LSI, the convergence is slower in the beginning\nbefore it becomes exponential. For a reasonable starting distribution (such as a Gaussian centered at\na stationary point), this leads to an extra factor of n compared to the convergence under LSI. We then\nturn to discrete time and show the convergence of R\u00e9nyi divergence along ULA to the biased limit \u232b\u270f\nunder the assumption that \u232b\u270f itself satis\ufb01es either LSI or Poincar\u00e9 inequality. We combine this with a\ndecomposition result on R\u00e9nyi divergence to derive a convergence guarantee for R\u00e9nyi divergence to\n\u232b along ULA; see Theorem 4 and Theorem 6.\nIn what follows, we review KL divergence and its properties along the Langevin dynamics in Section 2,\nand prove a convergence guarantee for KL divergence along ULA under LSI in Section 3. We provide\na review of R\u00e9nyi divergence and its properties along the Langevin dynamics in Section 4. We then\nprove the convergence guarantee for R\u00e9nyi divergence along ULA under LSI in Section 5, and under\nPoincar\u00e9 inequality in Section 6. We conclude with a discussion in Section 7.\n\n2 Review of KL divergence along Langevin dynamics\n\nIn this section we review the de\ufb01nition of Kullback-Leibler (KL) divergence, log-Sobolev inequality,\nand the convergence of KL divergence along the Langevin dynamics in continuous time under\nlog-Sobolev inequality. See Appendix A.1 for a review on notation.\n\n2.1 KL divergence\nLet \u21e2, \u232b be probability distributions on Rn, represented via their probability density functions with\nrespect to the Lebesgue measure on Rn. We assume \u21e2, \u232b have full support and smooth densities.\nRecall the Kullback-Leibler (KL) divergence of \u21e2 with respect to \u232b is\n\nH\u232b(\u21e2) =ZRn\n\n\u21e2(x)\n\u232b(x)\n\n\u21e2(x) log\n\ndx.\n\n(1)\n\nKL divergence is the relative form of Shannon entropy H(\u21e2) = RRn \u21e2(x) log \u21e2(x) dx. Whereas\n\nShannon entropy can be positive or negative, KL divergence is nonnegative and minimized at \u232b:\n\n3\n\n\fH\u232b(\u21e2) 0 for all \u21e2, and H\u232b(\u21e2) = 0 if and only if \u21e2 = \u232b. Therefore, KL divergence serves as a\nmeasure of (albeit asymmetric) \u201cdistance\u201d of a probability distribution \u21e2 from a base distribution \u232b.\nKL divergence is a relatively strong measure of distance; for example, Pinsker\u2019s inequality implies\nthat KL divergence controls total variation distance. Furthermore, under log-Sobolev (or Talagrand)\ninequality, KL divergence also controls the quadratic Wasserstein W2 distance, as we review below.\nWe say \u232b = ef is L-smooth if f has bounded Hessian: LI r2f (x) LI for all x 2 Rn. We\nprovide the proof of Lemma 1 in Appendix B.1.1.\nLemma 1. Suppose \u232b = ef is L-smooth. Let \u21e2 = N (x\u21e4, 1\nL I) where x\u21e4 is a stationary point of f.\nThen H\u232b(\u21e2) \uf8ff f (x\u21e4) + n\n2.2 Log-Sobolev inequality\n\n2\u21e1 .\n2 log L\n\nRecall we say \u232b satis\ufb01es the log-Sobolev inequality (LSI) with a constant \u21b5> 0 if for all smooth\nfunction g : Rn ! R with E\u232b[g2] < 1,\n\nE\u232b[g2 log g2] E\u232b[g2] log E\u232b[g2] \uf8ff\n\n2\n\u21b5E\u232b[krgk2].\n\nRecall the relative Fisher information of \u21e2 with respect to \u232b is\n\nJ\u232b(\u21e2) =ZRn\n\n\u21e2(x)r log\n\n\u21e2(x)\n\n\u232b(x)\n\n2\n\ndx.\n\n(2)\n\n(3)\n\n(4)\n\nLSI is equivalent to the following relation between KL divergence and Fisher information for all \u21e2:\n\n1\n2\u21b5\n\nJ\u232b(\u21e2).\n\nH\u232b(\u21e2) \uf8ff\n\u232b in (2); conversely, to obtain (2) we choose \u21e2 = g2\u232b\n\nIndeed, to obtain (4) we choose g2 = \u21e2\nE\u232b [g2] in (4).\nLSI is an isoperimetry condition and implies, among others, concentration of measure and sub-\nGaussian tail property [38]. LSI was \ufb01rst shown by Gross [30] for the case of Gaussian \u232b. It was\nextended by Bakry and \u00c9mery [6] to strongly log-concave \u232b; namely, when f = log \u232b is \u21b5-strongly\nconvex, then \u232b satis\ufb01es LSI with constant \u21b5. However, LSI applies more generally. For example,\nthe classical perturbation result by Holley and Stroock [33] states that LSI is stable under bounded\nperturbation. Furthermore, LSI is preserved under a Lipschitz mapping. In one dimension, there is\nan exact characterization of when a probability distribution on R satis\ufb01es LSI [9]. Moreover, LSI\nsatis\ufb01es a tensorization property [38]: If \u232b1,\u232b 2 satisfy LSI with constants \u21b51,\u21b5 2 > 0, respectively,\nthen \u232b1 \u2326 \u232b2 satis\ufb01es LSI with constant min{\u21b51,\u21b5 2} > 0. Thus, there are many examples of\nnon-logconcave distributions \u232b on Rn satisfying LSI (with a constant independent of dimension).\nThere are also Lyapunov function criteria and exponential integrability conditions that can be used to\nverify when a probability distribution satis\ufb01es LSI; see for example [14, 15, 51, 61, 7].\n\n2.2.1 Talagrand inequality\nRecall the Wasserstein distance between \u21e2 and \u232b is\n\nW2(\u21e2, \u232b) = inf\n\u21e7\n\nE\u21e7[kX Y k2]\n\n1\n2\n\n(5)\n\nwhere the in\ufb01mum is over joint distributions \u21e7 of (X, Y ) with the correct marginals X \u21e0 \u21e2, Y \u21e0 \u232b.\nRecall we say \u232b satis\ufb01es Talagrand inequality with a constant \u21b5> 0 if for all \u21e2:\n\n\u21b5\n2\n\nW2(\u21e2, \u232b)2 \uf8ff H\u232b(\u21e2).\n\n(6)\n\nTalagrand\u2019s inequality implies concentration of measure of Gaussian type. It was \ufb01rst studied by\nTalagrand [58] for Gaussian \u232b, and extended by Otto and Villani [54] to all \u232b satisfying LSI; namely,\nif \u232b satis\ufb01es LSI with constant \u21b5> 0, then \u232b also satis\ufb01es Talagrand\u2019s inequality with the same\nconstant [54, Theorem 1]. Therefore, under LSI, KL divergence controls the Wasserstein distance.\nMoreover, when \u232b is log-concave, LSI and Talagrand\u2019s inequality are equivalent [54, Corollary 3.1].\nWe recall in Appendix A.2 the geometric interpretation of LSI and Talagrand\u2019s inequality from [54].\n\n4\n\n\f(7)\n\n(8)\n\n2.3 Langevin dynamics\nThe Langevin dynamics for target distribution \u232b = ef is a continuous-time stochastic process\n(Xt)t0 in Rn that evolves following the stochastic differential equation:\n\ndXt = rf (Xt) dt + p2 dWt\n\nwhere (Wt)t0 is the standard Brownian motion in Rn with W0 = 0.\nIf (Xt)t0 evolves following the Langevin dynamics (7), then their probability density function\n(\u21e2t)t0 evolves following the Fokker-Planck equation:\n\n@\u21e2t\n@t\n\n= r\u00b7 (\u21e2trf ) + \u21e2t = r\u00b7 \u21e3\u21e2tr log\n\n\u21e2t\n\n\u232b\u2318 .\n\nHere r\u00b7 is the divergence and is the Laplacian operator. We provide a derivation in Appendix A.3.\nFrom (8), if \u21e2t = \u232b, then @\u21e2t\n@t = 0, so \u232b is the stationary distribution for the Langevin dynamics (7).\nMoreover, the Langevin dynamics brings any distribution Xt \u21e0 \u21e2t closer to the target distribution \u232b,\nas the following lemma shows.\nLemma 2. Along the Langevin dynamics (7) (or equivalently, the Fokker-Planck equation (8)),\n\nd\ndt\n\nH\u232b(\u21e2t) = J\u232b(\u21e2t).\n\n(9)\n\nWe provide the proof of Lemma 2 in Appendix B.1.2. Since J\u232b(\u21e2) 0, the identity (9) shows KL\ndivergence is decreasing along the Langevin dynamics, so indeed the distribution \u21e2t converges to \u232b.\n\n2.3.1 Exponential convergence of KL divergence along Langevin dynamics under LSI\nWhen \u232b satis\ufb01es LSI, KL divergence converges exponentially fast along the Langevin dynamics.\nTheorem 1. Suppose \u232b satis\ufb01es LSI with constant \u21b5> 0. Along the Langevin dynamics (7),\n\nH\u232b(\u21e2t) \uf8ff e2\u21b5tH\u232b(\u21e20).\n\n(10)\n\nFurthermore, W2(\u21e2t,\u232b ) \uf8ffq 2\n\n\u21b5 H\u232b(\u21e20) e\u21b5t.\n\nWe provide the proof of Theorem 1 in Appendix B.1.3. We also recall the optimization interpretation\nof Langevin dynamics as the gradient \ufb02ow of KL divergence in the space of distributions with\nthe Wasserstein metric [36, 60, 54]. Then the exponential convergence rate in Theorem 1 is a\nmanifestation of the general fact that gradient \ufb02ow converges exponentially fast under gradient\ndomination condition. This provides a justi\ufb01cation for using the Langevin dynamics for sampling\nfrom \u232b, as a natural steepest descent \ufb02ow that minimizes the KL divergence H\u232b.\n\n3 Unadjusted Langevin Algorithm\n\nSuppose we wish to sample from a smooth target probability distribution \u232b = ef in Rn. The\nUnadjusted Langevin Algorithm (ULA) with step size \u270f> 0 is the discrete-time algorithm\n\nxk+1 = xk \u270frf (xk) + p2\u270fz k\n\n(11)\nwhere zk \u21e0N (0, I) is an independent standard Gaussian random variable in Rn. Let \u21e2k denote the\nprobability distribution of xk that evolves following ULA.\nAs \u270f ! 0, ULA recovers the Langevin dynamics (7) in continuous-time. However, for \ufb01xed \u270f> 0,\nULA converges to a biased limiting distribution \u232b\u270f 6= \u232b. Therefore, KL divergence H\u232b(\u21e2k) does not\ntend to 0 along ULA, as it has an asymptotic bias H\u232b(\u232b\u270f) > 0.\n\u21b5 I). The ULA iteration is xk+1 = (1 \u270f\u21b5)xk +p2\u270fzk. For 0 <\u270f< 2\nExample 1. Let \u232b = N (0, 1\n\u21b5,\n2 ) and the bias is H\u232b(\u232b\u270f) = n\n2 ). In particular,\nthe limit is \u232b\u270f = N0,\n2\n\u21b5(1 \u270f\u21b5\nH\u232b(\u232b\u270f) \uf8ff n\u270f2\u21b52\n2 )2 = O(\u270f2).\n16(1 \u270f\u21b5\n\n2 ) + log(1 \u270f\u21b5\n\n\u270f\u21b5\n2(1 \u270f\u21b5\n\n1\n\n5\n\n\f3.1 Convergence of KL divergence along ULA under LSI\nWhen \u232b satis\ufb01es LSI and a smoothness condition, we can prove a convergence guarantee in KL\ndivergence along ULA. Recall we say \u232b = ef is L-smooth if LI r2f (x) LI for all x 2 Rn.\nA key part in our analysis is the following lemma which bounds the decrease in KL divergence along\none iteration of ULA. Here xk+1 \u21e0 \u21e2k+1 is the output of one step of ULA (11) from xk \u21e0 \u21e2k.\nLemma 3. Suppose \u232b satis\ufb01es LSI with constant \u21b5> 0 and is L-smooth. If 0 <\u270f \uf8ff \u21b5\neach step of ULA (11),\n\n4L2 , then along\n\nH\u232b(\u21e2k+1) \uf8ff e\u21b5\u270fH\u232b(\u21e2k) + 6\u270f2nL2.\n\n(12)\n\nWe provide the proof of Lemma 3 in Appendix B.2.1. The proof of Lemma 3 compares the evolution\nof KL divergence along one step of ULA with the evolution along Langevin dynamics in continuous\ntime (which converges exponentially fast under LSI), and bounds the discretization error; see Figure 2\nfor an illustration. This comparison technique has been used in many papers. Our proof structure is\nsimilar to that of Cheng and Bartlett [17], whose analysis needs \u232b to be strongly logconcave.\nWith Lemma 3, we can prove our main result on the convergence rate of ULA under LSI. We provide\nthe proof of Theorem 2 in Appendix B.2.2.\nTheorem 2. Assume \u232b = ef satis\ufb01es log-Sobolev inequality with constant \u21b5> 0 and is L-smooth.\nULA with step size 0 <\u270f \uf8ff \u21b5\n\n4L2 satis\ufb01es\n\nH\u232b(\u21e2k) \uf8ff e\u21b5\u270fkH\u232b(\u21e20) + 8\u270fnL2\n\u21b5 .\n16L2n reaches error H\u232b(\u21e2k) \uf8ff after k 1\n\n\n\niterations.\n\n\u21b5\u270f log 2H\u232b (\u21e20)\n\nL I), where x\u21e4 is a stationary point of f (which we can \ufb01nd,\n2\u21e1 = \u02dcO(n) by Lemma 1. Theorem 2\n\nFor 0 << 4n, ULA with \u270f \uf8ff \u21b5\nL2n. Suppose\nIn particular, suppose < 4n and we choose the largest permissible step size \u270f =\u21e5 \u21b5\nwe start with a Gaussian \u21e20 = N (x\u21e4, 1\ne.g., via gradient descent), so H\u232b(\u21e20) \uf8ff f (x\u21e4) + n\n\u21b52\u2318. Since LSI implies\nstates that to achieve H\u232b(\u21e2k) \uf8ff , ULA has iteration complexity k = \u02dc\u21e5\u21e3 L2n\nTalagrand\u2019s inequality, Theorem 2 also yields a convergence guarantee in Wasserstein distance. As\nk ! 1, Theorem 2 implies the following bound on the bias of ULA under LSI. However, we note\nthe bound O(\u270f) may be loose, since from Example 1 we see H\u232b(\u232b\u270f) =\u21e5( \u270f2) in Gaussian case.\nCorollary 1. Suppose \u232b satis\ufb01es LSI with constant \u21b5> 0 and is L-smooth. For 0 <\u270f \uf8ff \u21b5\n4L2 , the\nbiased limit \u232b\u270f of ULA with step size \u270f satis\ufb01es H\u232b(\u232b\u270f) \uf8ff 8nL2\u270f\n4 Review of R\u00e9nyi divergence along Langevin dynamics\n\nand W2(\u232b, \u232b\u270f)2 \uf8ff 16nL2\u270f\n\n2 log L\n\n\u21b52\n\n\u21b5\n\n.\n\n4.1 R\u00e9nyi divergence\nR\u00e9nyi divergence [56] is a family of generalizations of KL divergence. See [59, 11] for properties of\nR\u00e9nyi divergence.\nFor q > 0, q 6= 1, the R\u00e9nyi divergence of order q of a probability distribution \u21e2 with respect to \u232b is\n(13)\n\nlog Fq,\u232b (\u21e2)\n\nRq,\u232b (\u21e2) =\n\n1\nq 1\n\n\u232b(x)\n\n6\n\nwhere\n\nFq,\u232b (\u21e2) = E\u232bh\u21e3 \u21e2\n\n\u232b\u2318qi =ZRn\n\n\u21e2(x)q\n\n\u232b(x)q dx =ZRn\n\n\u21e2(x)q\n\u232b(x)q1 dx.\n\n(14)\n\nq1 logR \u21e2(x)q dx. The case\nR\u00e9nyi divergence is the relative form of R\u00e9nyi entropy [56]: Hq(\u21e2) = 1\nq = 1 is de\ufb01ned via limit, and recovers the KL divergence (1): R1,\u232b(\u21e2) = limq!1 Rq,\u232b (\u21e2) = H\u232b(\u21e2).\nR\u00e9nyi divergence has the property that Rq,\u232b (\u21e2) 0 for all \u21e2, and Rq,\u232b (\u21e2) = 0 if and only if \u21e2 = \u232b.\nFurthermore, the map q 7! Rq,\u232b (\u21e2) is increasing (see Section B.3.1). Therefore, R\u00e9nyi divergence\nprovides an alternative measure of \u201cdistance\u201d of \u21e2 from \u232b, which becomes stronger as q increases. In\n\u232b(x) is \ufb01nite if and only if \u21e2 is warm relative to \u232b. It is\n\n= log supx\n\n\u21e2(x)\n\npossible that Rq,\u232b (\u21e2) = 1 for large enough q, as the following example shows.\n\nparticular, R1,\u232b(\u21e2) = log \u21e2\n\u232b1\n\n\f22 , then Rq,\u232b (\u21e2) = 1.\n\n2 log 2\n\n2 n\n\nExample 2. Let \u21e2 = N (0, 2I) and \u232b = N (0, 2I). If 2 > 2 and q 2\nOtherwise, Rq,\u232b (\u21e2) = n\n\n2(q1) logq (q 1) 2\n2.\nThe following is analogous to Lemma 1. We provide the proof of Lemma 4 in Appendix B.3.2.\nLemma 4. Suppose \u232b = ef is L-smooth. Let \u21e2 = N (x\u21e4, 1\nThen for all q 1, Rq,\u232b (\u21e2) \uf8ff f (x\u21e4) + n\n4.1.1 Log-Sobolev inequality\nFor q > 0, we de\ufb01ne the R\u00e9nyi information of order q of \u21e2 with respect to \u232b as\n\n2\u21e1 .\n2 log L\n\nL I) where x\u21e4 is a stationary point of f.\n\nThe case q = 1 recovers relative Fisher information (3): G1,\u232b(\u21e2) = E\u232bh \u21e2\nWe have the following relation under log-Sobolev inequality. Note the case q = 1 recovers LSI (4).\nWe provide the proof of Lemma 5 in Appendix B.3.3.\nLemma 5. Suppose \u232b satis\ufb01es LSI with constant \u21b5> 0. Let q 1. For all \u21e2,\n\n4\n\n(15)\n\n2\nq2 E\u232bhr\u21e3 \u21e2\n\u232b\u2318 q\n2i.\n\u232b2i = J\u232b(\u21e2).\n\u232br log \u21e2\n\nGq,\u232b (\u21e2) = E\u232bh\u21e3 \u21e2\n\n\u232b\u2318qr log\n\n\u21e2\n\n\u232b\n\n2i = E\u232bh\u21e3 \u21e2\n\n\u232b\u2318q2r\n\n\u21e2\n\n\u232b\n\n2i =\n\nGq,\u232b (\u21e2)\nFq,\u232b (\u21e2) \n\n2\u21b5\nq2 Rq,\u232b (\u21e2).\n\n(16)\n\n4.2 Langevin dynamics\nAlong the Langevin dynamics (7) for \u232b, we can compute the rate of change of the R\u00e9nyi divergence.\nLemma 6. For all q > 0, along the Langevin dynamics (7),\n\nd\ndt\n\nRq,\u232b (\u21e2t) = q\n\nGq,\u232b (\u21e2t)\nFq,\u232b (\u21e2t)\n\n.\n\n(17)\n\ndt Rq,\u232b (\u21e2t) \uf8ff 0, so R\u00e9nyi\nWe provide the proof of Lemma 6 in Appendix B.3.4. In particular, d\ndivergence is always decreasing along the Langevin dynamics. Furthermore, analogous to how the\nLangevin dynamics is the gradient \ufb02ow of KL divergence under the Wasserstein metric, one can\nalso show that the Langevin dynamics is the the gradient \ufb02ow of R\u00e9nyi divergence with respect to a\nsuitably de\ufb01ned metric (which depends on \u232b) on the space of distributions; see [13].\n\n4.2.1 Convergence of R\u00e9nyi divergence along Langevin dynamics under LSI\nWhen \u232b satis\ufb01es LSI, R\u00e9nyi divergence converges exponentially fast along the Langevin dynamics.\nNote the case q = 1 recovers the exponential convergence rate of KL divergence from Theorem 1.\nTheorem 3. Suppose \u232b satis\ufb01es LSI with constant \u21b5> 0. Let q 1. Along the Langevin dynamics,\n\nRq,\u232b (\u21e2t) \uf8ff e 2\u21b5t\n\nq Rq,\u232b (\u21e20).\n\nWe provide the proof of Theorem 3 in Appendix B.3.5. Theorem 3 shows that if the initial R\u00e9nyi diver-\ngence is \ufb01nite, then it converges exponentially fast. However, even if initially the R\u00e9nyi divergence is\n1, it will be \ufb01nite along the Langevin dynamics, after which time Theorem 3 applies. This is because\nwhen \u232b satis\ufb01es LSI, the Langevin dynamics satis\ufb01es a hypercontractivity property [30, 10, 60]; see\nSection B.3.6. Furthermore, as shown in [13], we can combine the exponential convergence rate\nabove with the hypercontractivity property to improve the exponential rate to be 2\u21b5, independent of\nq, at the cost of some initial waiting time; here we leave the rate as above for simplicity.\n\n5 R\u00e9nyi divergence along ULA\n\nIn this section we prove a convergence guarantee for R\u00e9nyi divergence along ULA under the\nassumption that the biased limit satis\ufb01es LSI. As before, let \u232b = ef , and let \u232b\u270f denote the biased\nlimit of ULA (11) with step size \u270f> 0. We note that the bias Rq,\u232b (\u232b\u270f) may be 1 for large enough q.\n\n7\n\n\fExample 3. As in Examples 1 and 2, let \u232b = N (0, 1\n\nRq,\u232b (\u232b\u270f) =( n\n\n2(q1)q log1 \u270f\u21b5\n\n1\n\n\u21b5 I), so \u232b\u270f = N0,\n2 log1 q\u270f\u21b5\n2 \n\n1\n\n2 ) I. The bias is\n\n\u21b5(1 \u270f\u21b5\nif 1 < q < 2\n\u270f\u21b5 ,\nif q 2\n\u270f\u21b5 .\n\nThus, for each \ufb01xed q > 1, there is an asymptotic bias Rq,\u232b (\u232b\u270f) which is \ufb01nite for small enough \u270f.\nIn Example 3 we have Rq,\u232b (\u232b\u270f) = O(\u270f2). In general, we assume for each q > 1 there is a growth\nfunction gq(\u270f) that controls the bias: Rq,\u232b (\u232b\u270f) \uf8ff gq(\u270f) for small \u270f> 0, and lim\u270f!0 gq(\u270f) = 0.\n5.1 Decomposition of R\u00e9nyi divergence\nFor order q > 1, we have the following decomposition of R\u00e9nyi divergence.\nLemma 7. Let q > 1. For all probability distribution \u21e2,\n\nRq,\u232b (\u21e2) \uf8ff\u2713 q 1\n\nq 1\u25c6 R2q,\u232b\u270f(\u21e2) + R2q1,\u232b(\u232b\u270f).\n\n2\n\n(18)\n\nWe provide the proof of Lemma 7 in Appendix B.4.1. The \ufb01rst term in the bound above is the\nR\u00e9nyi divergence with respect to the biased limit, which converges exponentially fast under LSI (see\nLemma 8). The second term in (18) is the bias, which is controlled by the growth function g2q1(\u270f).\n5.2 Rapid convergence of R\u00e9nyi divergence with respect to \u232b\u270f along ULA\nWe show R\u00e9nyi divergence with respect to the biased limit \u232b\u270f converges exponentially fast along\nULA, assuming \u232b\u270f itself satis\ufb01es LSI.\nAssumption 1. The probability distribution \u232b\u270f satis\ufb01es LSI with a constant \u2318 \u270f > 0.\nWe can verify Assumption 1 in the Gaussian case. However, it is unclear how to verify Assumption 1\nin general. One might hope to prove that if \u232b satis\ufb01es LSI, then Assumption 1 holds.\n2 ) I satis\ufb01es LSI with = \u21b5(1 \u270f\u21b5\nExample 4. Let \u232b = N (0, 1\nUnder Assumption 1, we can prove an exponential convergence rate to the biased limit \u232b\u270f.\nLemma 8. Assume Assumption 1. Suppose \u232b = ef is L-smooth, and let 0 <\u270f \uf8ff minn 1\nFor q 1, along ULA (11),\n\n\u21b5 I), so \u232b\u270f = N0,\n\n9o.\n\n\u21b5(1 \u270f\u21b5\n\n3L , 1\n\n2 ).\n\n1\n\nRq,\u232b\u270f(\u21e2k) \uf8ff e \u270fk\n\nq Rq,\u232b\u270f(\u21e20).\n\n(19)\n\nWe provide the proof of Lemma 8 in Appendix B.4.2. In the proof of Lemma 8, we decompose each\nstep of ULA as a sequence of two operations; see Figure 3 for an illustration. In the \ufb01rst part we\ntake a gradient step, which is a deterministic bijective map, so it preserves R\u00e9nyi divergence. In the\nsecond part we add an independent Gaussian, which is evolution along the heat \ufb02ow, and we derive a\nformula on the decrease in R\u00e9nyi divergence (which is similar to (17) along the Langevin dynamics).\n\n5.3 Convergence of R\u00e9nyi divergence along ULA under LSI\nWe combine Lemma 7 and Lemma 8 to obtain the following characterization of the convergence of\nR\u00e9nyi divergence along ULA under LSI. We provide the proof of Theorem 4 in Appendix B.4.3.\n3L , 1\n\nTheorem 4. Assume Assumption 1. Suppose \u232b = ef is L-smooth, and let 0 <\u270f \uf8ff minn 1\nLet q > 1, and suppose R2q,\u232b\u270f(\u21e20) < 1. Then along ULA (11),\nq 1\u25c6 R2q,\u232b\u270f(\u21e20)e \u270fk\n2q1 \n\nq () = sup{\u270f> 0 : gq(\u270f) \uf8ff }. Theorem 4 states that to achieve Rq,\u232b (\u21e2k) \uf8ff ,\n\n\u270f log R2q,\u232b\u270f (\u21e20)\n2\u270f-smooth, so if we choose \u21e20\n\nRq,\u232b (\u21e2k) \uf8ff\u2713 q 1\n\n2 for k = O 1\n\nit suf\ufb01ces to run ULA with step size \u270f =\u21e5 min 1\n2 < 1\n\niterations. Suppose is small so g1\n\n2q1 \n\nL. Note \u232b\u270f is 1\n\n2q + g2q1(\u270f).\n\nFor > 0, let g1\n\n9o.\n\nL , g1\n\n(20)\n\n2\n\n\n\n8\n\n\f1\n\ng1\n\nthen g1\nin Example 3, then g1\n\nto be a Gaussian with covariance 2\u270fI, we have R2q,\u232b\u270f(\u21e20) = \u02dcO(n) by Lemma 4. Therefore,\n\nTheorem 4 yields an iteration complexity of k = \u02dcO\n\nq () =\u2326( ), so the iteration complexity is k = \u02dcO 1\n\nq () =\u2326( p), so the iteration complexity is k = \u02dcO 1\n\n2q1(/2). For example, if gq(\u270f) = O(\u270f),\n with \u270f =\u21e5( ). If gq(\u270f) = O(\u270f2), as\np with \u270f =\u21e5( p).\n6 Poincar\u00e9 inequality\nWe recall \u232b satis\ufb01es Poincar\u00e9 inequality (PI) with a constant \u21b5> 0 if for all smooth g : Rn ! R,\n(21)\nwhere Var\u232b(g) = E\u232b[g2]E\u232b[g]2 is the variance of g under \u232b. Poincar\u00e9 inequality is an isoperimetry\ncondition which is weaker than LSI. LSI implies PI with the same constant; in fact, PI is a linearization\nof LSI (4), i.e., when \u21e2 = (1+\u2318g)\u232b as \u2318 ! 0 [57, 60]. Furthermore, it is known Talagrand\u2019s inequality\nimplies PI with the same constant, and PI is also a linearization of Talagrand\u2019s inequality [54].\nPoincar\u00e9 inequality is better behaved than LSI [15], and there are various Lyapunov criteria and\nintegrability conditions to verify when a distribution satis\ufb01es Poincar\u00e9 inequality [5, 51, 19].\n\nVar\u232b(g) \uf8ff 1\n\n\u21b5E\u232b[krgk2]\n\ne 2\u21b5t\n\nq\nq Rq,\u232b (\u21e20)\n\n6.1 Convergence of R\u00e9nyi divergence along Langevin dynamics under Poincar\u00e9 inequality\nWhen \u232b satis\ufb01es Poincar\u00e9 inequality, R\u00e9nyi divergence converges along the Langevin dynamics. The\nconvergence is initially linear, then becomes exponential once R\u00e9nyi divergence falls below 1.\nTheorem 5. Suppose \u232b satis\ufb01es Poincar\u00e9 inequality with constant \u21b5> 0. Let q 2. Along the\nLangevin dynamics,\n\nRq,\u232b (\u21e2t) \uf8ff(Rq,\u232b (\u21e20) 2\u21b5t\n\nif Rq,\u232b (\u21e20) 1 and as long as Rq,\u232b (\u21e2t) 1,\nif Rq,\u232b (\u21e20) \uf8ff 1.\nWe provide the proof of Theorem 5 in Appendix B.5.2. Theorem 5 states that starting from Rq,\u232b (\u21e20) \n1, the Langevin dynamics reaches Rq,\u232b (\u21e2t) \uf8ff in t \uf8ff O q\n\n6.2 Rapid convergence of R\u00e9nyi divergence with respect to \u232b\u270f along ULA\nWe assume the biased limit \u232b\u270f satis\ufb01es Poincar\u00e9 inequality.\nAssumption 2. The distribution \u232b\u270f satis\ufb01es Poincar\u00e9 inequality with a constant \u2318 \u270f > 0.\nUnder Assumption 2 we can show R\u00e9nyi divergence with respect to \u232b\u270f converges at a rate similar to\nthe Langevin dynamics; see Lemma 18 in Appendix B.5.3.\n\n\u21b5Rq,\u232b (\u21e20) + log 1\n\n time.\n\n6.3 Convergence of R\u00e9nyi divergence along ULA under Poincar\u00e9 inequality\nWe combine Lemma 7 and Lemma 18 to obtain the following convergence of R\u00e9nyi divergence along\nULA under Poincar\u00e9 inequality. We provide the proof of Theorem 6 in Appendix B.5.4.\n\nTheorem 6. Assume Assumption 2. Suppose \u232b = ef is L-smooth, and let 0 <\u270f \uf8ff min 1\nLet q > 1 and assume 1 \uf8ff R2q,\u232b\u270f(\u21e20) < 1. Along ULA (11), for k k0 := 2q\n\n9 .\n3L , 1\n\u270f (R2q,\u232b\u270f(\u21e20) 1),\n(22)\n\nRq,\u232b (\u21e2k) \uf8ff\u2713 q 1\n\nq 1\u25c6 e \u270f(kk0)\n\n2\n\n2q + g2q1(\u270f).\n\nThis yields an iteration complexity for ULA under Poincar\u00e9 which is a factor of n larger than the\ncomplexity under LSI; see Appendix B.5.5.\n\n7 Discussion\n\nIn this paper we proved convergence guarantees on KL and R\u00e9nyi divergence along ULA under\nisoperimetry and bounded Hessian, without assuming convexity or bounds on higher derivatives.\nIt would be interesting to verify when Assumptions 1 and 2 hold or whether they follow from\nisoperimetry and bounded Hessian of the target density. Another intriguing question is whether there\nis an af\ufb01ne-invariant version of the Langevin dynamics. This might lead to a sampling algorithm with\nlogarithmic dependence on smoothness parameters, rather than the current polynomial dependence.\n\n9\n\n\fAcknowledgment\nThe \ufb01rst author was supported in part by NSF awards CCF-1563838 and CCF-1717349. The authors\nwould like to thank Kunal Talwar for explaining the application of R\u00e9nyi divergence to data privacy.\nThe authors thank Yu Cao, Jianfeng Lu, and Yulong Lu for alerting us to their work [13] on R\u00e9nyi\ndivergence. The authors also thank Xiang Cheng and Peter Bartlett for helpful comments on an\nearlier version of this paper.\n\nReferences\n[1] Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov, Kunal Talwar, and\nLi Zhang. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference\non Computer and Communications Security, pages 308\u2013318. ACM, 2016.\n\n[2] David Applegate and Ravi Kannan. Sampling and integration of near log-concave functions. In Proceedings\nof the Twenty-third Annual ACM Symposium on Theory of Computing, STOC \u201991, pages 156\u2013163, New\nYork, NY, USA, 1991. ACM.\n\n[3] John C Baez. R\u00e9nyi entropy and free energy. arXiv preprint arXiv:1102.2098, 2011.\n\n[4] Shi Bai, Tancr\u00e8de Lepoint, Adeline Roux-Langlois, Amin Sakzad, Damien Stehl\u00e9, and Ron Steinfeld.\nImproved security proofs in lattice-based cryptography: using the r\u00e9nyi divergence rather than the statistical\ndistance. Journal of Cryptology, 31(2):610\u2013640, 2018.\n\n[5] Dominique Bakry, Franck Barthe, Patrick Cattiaux, Arnaud Guillin, et al. A simple proof of the Poincar\u00e9\ninequality for a large class of probability measures. Electronic Communications in Probability, 13:60\u201366,\n2008.\n\n[6] Dominique Bakry and Michel \u00c9mery. Diffusions hypercontractives. In S\u00e9minaire de Probabilit\u00e9s XIX\n\n1983/84, pages 177\u2013206. Springer, 1985.\n\n[7] Jean-Baptiste Bardet, Natha\u00ebl Gozlan, Florent Malrieu, and Pierre-Andr\u00e9 Zitt. Functional inequalities\nfor Gaussian convolutions of compactly supported measures: Explicit bounds and dimension dependence.\nBernoulli, 24(1):333\u2013353, 2018.\n\n[8] Espen Bernton. Langevin Monte Carlo and JKO splitting. In Conference On Learning Theory, COLT 2018,\n\nStockholm, Sweden, 6-9 July 2018, pages 1777\u20131798, 2018.\n\n[9] Sergej G Bobkov and Friedrich G\u00f6tze. Exponential integrability and transportation cost related to logarith-\n\nmic Sobolev inequalities. Journal of Functional Analysis, 163(1):1\u201328, 1999.\n\n[10] Sergey G Bobkov, Ivan Gentil, and Michel Ledoux. Hypercontractivity of Hamilton\u2013Jacobi equations.\n\nJournal de Math\u00e9matiques Pures et Appliqu\u00e9es, 80(7):669\u2013696, 2001.\n\n[11] SG Bobkov, GP Chistyakov, and Friedrich G\u00f6tze. R\u00e9nyi divergence and the central limit theorem. The\n\nAnnals of Probability, 47(1):270\u2013323, 2019.\n\n[12] Mark Bun and Thomas Steinke. Concentrated differential privacy: Simpli\ufb01cations, extensions, and lower\n\nbounds. In Theory of Cryptography Conference, pages 635\u2013658. Springer, 2016.\n\n[13] Yu Cao, Jianfeng Lu, and Yulong Lu. Exponential decay of R\u00e9nyi divergence under Fokker\u2013Planck\n\nequations. Journal of Statistical Physics, pages 1\u201313, 2018.\n\n[14] Djalil Chafa\u00ef. Entropies, convexity, and functional inequalities: On -entropies and -Sobolev inequalities.\n\nJournal of Mathematics of Kyoto University, 44(2):325\u2013363, 2004.\n\n[15] Djalil Chafai and Florent Malrieu. On \ufb01ne properties of mixtures with respect to concentration of measure\nand Sobolev type inequalities. In Annales de l\u2019IHP Probabilit\u00e9s et statistiques, volume 46, pages 72\u201396,\n2010.\n\n[16] Zongchen Chen and Santosh S. Vempala. Optimal convergence rate of Hamiltonian Monte Carlo for\n\nstrongly logconcave distributions. arXiv preprint arXiv:1905.02313, 2019.\n\n[17] Xiang Cheng and Peter Bartlett. Convergence of Langevin MCMC in KL-divergence. In Firdaus Janoos,\nMehryar Mohri, and Karthik Sridharan, editors, Proceedings of Algorithmic Learning Theory, volume 83\nof Proceedings of Machine Learning Research, pages 186\u2013211. PMLR, 07\u201309 Apr 2018.\n\n10\n\n\f[18] Xiang Cheng, Niladri S Chatterji, Yasin Abbasi-Yadkori, Peter L Bartlett, and Michael I Jordan. Sharp\nconvergence rates for Langevin dynamics in the nonconvex setting. arXiv preprint arXiv:1805.01648,\n2018.\n\n[19] Thomas A Courtade. Bounds on the Poincar\u00e9 constant for convolution measures. arXiv preprint\n\narXiv:1807.00027, 2018.\n\n[20] Imre Csisz\u00e1r. Generalized cutoff rates and R\u00e9nyi\u2019s information measures. IEEE Transactions on information\n\ntheory, 41(1):26\u201334, 1995.\n\n[21] Arnak Dalalyan. Further and stronger analogy between sampling and optimization: Langevin Monte\nCarlo and gradient descent. In Proceedings of the 2017 Conference on Learning Theory, volume 65 of\nProceedings of Machine Learning Research, pages 678\u2013689. PMLR, 07\u201310 Jul 2017.\n\n[22] Arnak S Dalalyan and Avetik Karagulyan. User-friendly guarantees for the langevin monte carlo with\n\ninaccurate gradient. Stochastic Processes and their Applications, 2019.\n\n[23] Xi Dong. The gravity dual of R\u00e9nyi entropy. Nature communications, 7:12472, 2016.\n\n[24] Alain Durmus, Szymon Majewski, and B\u0142a\u02d9zej Miasojedow. Analysis of Langevin Monte Carlo via convex\n\noptimization. arXiv preprint arXiv:1802.09188, 2018.\n\n[25] Alain Durmus, Eric Moulines, and Eero Saksman. On the convergence of Hamiltonian Monte Carlo. arXiv\n\npreprint arXiv:1705.00166, 2017.\n\n[26] Raaz Dwivedi, Yuansi Chen, Martin J. Wainwright, and Bin Yu. Log-concave sampling: Metropolis-\nHastings algorithms are fast! In Conference On Learning Theory, COLT 2018, Stockholm, Sweden, 6-9\nJuly 2018, pages 793\u2013797, 2018.\n\n[27] Cynthia Dwork and Guy N Rothblum. Concentrated differential privacy. arXiv preprint arXiv:1603.01887,\n\n2016.\n\n[28] Andreas Eberle, Arnaud Guillin, and Raphael Zimmer. Couplings and quantitative contraction rates for\n\nLangevin dynamics. The Annals of Probability, 47(4):1982\u20132010, 2019.\n\n[29] Jackson Gorham and Lester Mackey. Measuring sample quality with kernels. In Doina Precup and\nYee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70\nof Proceedings of Machine Learning Research, pages 1292\u20131301, International Convention Centre, Sydney,\nAustralia, 06\u201311 Aug 2017. PMLR.\n\n[30] Leonard Gross. Logarithmic Sobolev inequalities. American Journal of Mathematics, 97(4):1061\u20131083,\n\n1975.\n\n[31] Peter Harremo\u00ebs. Interpretations of R\u00e9nyi entropies and divergences. Physica A: Statistical Mechanics\n\nand its Applications, 365(1):57\u201362, 2006.\n\n[32] Yun He, A Ben Hamza, and Hamid Krim. A generalized divergence measure for robust image registration.\n\nIEEE Transactions on Signal Processing, 51(5):1211\u20131220, 2003.\n\n[33] Richard Holley and Daniel Stroock. Logarithmic Sobolev inequalities and stochastic Ising models. Journal\n\nof Statistical Physics, 46(5):1159\u20131194, 1987.\n\n[34] Richard Holley and Daniel Stroock. Simulated annealing via Sobolev inequalities. Communications in\n\nMathematical Physics, 115(4):553\u2013569, 1988.\n\n[35] Mitsugu Iwamoto and Junji Shikata. Information theoretic security for encryption based on conditional\nR\u00e9nyi entropies. In International Conference on Information Theoretic Security, pages 103\u2013121. Springer,\n2013.\n\n[36] Richard Jordan, David Kinderlehrer, and Felix Otto. The variational formulation of the Fokker\u2013Planck\n\nequation. SIAM Journal on Mathematical Analysis, 29(1):1\u201317, January 1998.\n\n[37] R. Kannan, L. Lov\u00e1sz, and M. Simonovits. Random walks and an O\u21e4(n5) volume algorithm for convex\n\nbodies. Random Structures and Algorithms, 11:1\u201350, 1997.\n\n[38] Michel Ledoux. Concentration of measure and logarithmic Sobolev inequalities. S\u00e9minaire de probabilit\u00e9s\n\nde Strasbourg, 33:120\u2013216, 1999.\n\n11\n\n\f[39] Yin Tat Lee and Santosh S Vempala. Convergence rate of Riemannian Hamiltonian Monte Carlo and faster\npolytope volume computation. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of\nComputing, pages 1115\u20131121. ACM, 2018.\n\n[40] Xuechen Li, Denny Wu, Lester Mackey, and Murat A Erdogdu. Stochastic Runge\u2013Kutta accelerates\n\nLangevin Monte Carlo and beyond. arXiv preprint arXiv:1906.07868, 2019.\n\n[41] Yingzhen Li and Richard E Turner. R\u00e9nyi divergence variational inference. In D. D. Lee, M. Sugiyama,\nU. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29,\npages 1073\u20131081. Curran Associates, Inc., 2016.\n\n[42] L. Lov\u00e1sz and S. Vempala. Fast algorithms for logconcave functions: sampling, rounding, integration and\n\noptimization. In FOCS, pages 57\u201368, 2006.\n\n[43] L. Lov\u00e1sz and S. Vempala. The geometry of logconcave functions and sampling algorithms. Random\n\nStruct. Algorithms, 30(3):307\u2013358, 2007.\n\n[44] L\u00e1szl\u00f3 Lov\u00e1sz and Santosh S. Vempala. Hit-and-run from a corner. SIAM Journal on Computing,\n\n35(4):985\u20131005, 2006.\n\n[45] Yi-An Ma, Niladri Chatterji, Xiang Cheng, Nicolas Flammarion, Peter Bartlett, and Michael I Jordan. Is\n\nthere an analog of Nesterov acceleration for MCMC? arXiv preprint arXiv:1902.00996, 2019.\n\n[46] Michael C. Mackey. Time\u2019s Arrow: The Origins of Thermodynamics Behavior. Springer-Verlag, 1992.\n\n[47] Oren Mangoubi and Aaron Smith. Rapid mixing of Hamiltonian Monte Carlo on strongly log-concave\n\ndistributions. arXiv preprint arXiv:1708.07114, 2017.\n\n[48] Oren Mangoubi and Nisheeth Vishnoi. Dimensionally tight bounds for second-order Hamiltonian Monte\nCarlo. In Advances in Neural Information Processing Systems 31, pages 6027\u20136037. Curran Associates,\nInc., 2018.\n\n[49] Oren Mangoubi and Nisheeth K Vishnoi. Nonconvex sampling with the Metropolis-adjusted Langevin\n\nalgorithm. arXiv preprint arXiv:1902.08452, 2019.\n\n[50] Yishay Mansour, Mehryar Mohri, and Afshin Rostamizadeh. Multiple source adaptation and the R\u00e9nyi\ndivergence. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Arti\ufb01cial Intelligence, pages\n367\u2013374. AUAI Press, 2009.\n\n[51] Georg Menz and Andr\u00e9 Schlichting. Poincar\u00e9 and logarithmic Sobolev inequalities by decomposition of\n\nthe energy landscape. Ann. Probab., 42(5):1809\u20131884, 09 2014.\n\n[52] Ilya Mironov. R\u00e9nyi differential privacy. In 2017 IEEE 30th Computer Security Foundations Symposium\n\n(CSF), pages 263\u2013275. IEEE, 2017.\n\n[53] D Morales, L Pardo, and I Vajda. R\u00e9nyi statistics in directed families of exponential experiments. Statistics:\n\nA Journal of Theoretical and Applied Statistics, 34(2):151\u2013174, 2000.\n\n[54] Felix Otto and C\u00e9dric Villani. Generalization of an inequality by Talagrand and links with the logarithmic\n\nSobolev inequality. Journal of Functional Analysis, 173(2):361\u2013400, 2000.\n\n[55] Maxim Raginsky, Alexander Rakhlin, and Matus Telgarsky. Non-convex learning via stochastic gradient\nLangevin dynamics: A nonasymptotic analysis. In Satyen Kale and Ohad Shamir, editors, Proceedings\nof the 2017 Conference on Learning Theory, volume 65 of Proceedings of Machine Learning Research,\npages 1674\u20131703, Amsterdam, Netherlands, 07\u201310 Jul 2017. PMLR.\n\n[56] Alfr\u00e9d R\u00e9nyi et al. On measures of entropy and information. In Proceedings of the Fourth Berkeley\nSymposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics.\nThe Regents of the University of California, 1961.\n\n[57] OS Rothaus. Diffusion on compact Riemannian manifolds and logarithmic Sobolev inequalities. Journal\n\nof functional analysis, 42(1):102\u2013109, 1981.\n\n[58] M Talagrand. Transportation cost for Gaussian and other product measures. Geometric and Functional\n\nAnalysis, 6:587\u2013600, 01 1996.\n\n[59] Tim Van Erven and Peter Harremos. R\u00e9nyi divergence and Kullback-Leibler divergence. IEEE Transactions\n\non Information Theory, 60(7):3797\u20133820, 2014.\n\n12\n\n\f[60] C\u00e9dric Villani. Topics in optimal transportation. Number 58 in Graduate Studies in Mathematics. American\n\nMathematical Society, 2003.\n\n[61] Feng-Yu Wang and Jian Wang. Functional inequalities for convolution of probability measures. In Annales\nde l\u2019Institut Henri Poincar\u00e9, Probabilit\u00e9s et Statistiques, volume 52, pages 898\u2013914. Institut Henri Poincar\u00e9,\n2016.\n\n[62] Andre Wibisono. Sampling as optimization in the space of measures: The Langevin dynamics as a\ncomposite optimization problem. In Conference On Learning Theory, COLT 2018, Stockholm, Sweden,\n6-9 July 2018, pages 2093\u20133027, 2018.\n\n13\n\n\f", "award": [], "sourceid": 4419, "authors": [{"given_name": "Santosh", "family_name": "Vempala", "institution": "Georgia Tech"}, {"given_name": "Andre", "family_name": "Wibisono", "institution": "Georgia Tech"}]}