{"title": "Global Non-convex Optimization with Discretized Diffusions", "book": "Advances in Neural Information Processing Systems", "page_first": 9671, "page_last": 9680, "abstract": "An Euler discretization of the Langevin diffusion is known to converge to the global minimizers of certain convex and non-convex optimization problems.  We show that this property holds for any suitably smooth diffusion and that different diffusions are suitable for optimizing different classes of convex and non-convex functions. This allows us to design diffusions suitable for globally optimizing convex and non-convex functions not covered by the existing Langevin theory. Our non-asymptotic analysis delivers computable optimization and integration error bounds based on easily accessed properties of the objective and chosen diffusion. Central to our approach are new explicit Stein factor bounds on the solutions of Poisson equations. We complement these results with improved optimization guarantees for targets other than the standard Gibbs measure.", "full_text": "Global Non-convex Optimization\n\nwith Discretized Diffusions\n\nMurat A. Erdogdu 1,2\n\nerdogdu@cs.toronto.edu\n\nLester Mackey 3\n\nlmackey@ microsoft.com\n\nOhad Shamir 4\n\nohad.shamir@weizmann.ac.il\n\n1University of Toronto 2Vector Institute 3Microsoft Research 4Weizmann Institute of Science\n\nAbstract\n\nAn Euler discretization of the Langevin diffusion is known to converge to the\nglobal minimizers of certain convex and non-convex optimization problems. We\nshow that this property holds for any suitably smooth diffusion and that different\ndiffusions are suitable for optimizing different classes of convex and non-convex\nfunctions. This allows us to design diffusions suitable for globally optimizing\nconvex and non-convex functions not covered by the existing Langevin theory. Our\nnon-asymptotic analysis delivers computable optimization and integration error\nbounds based on easily accessed properties of the objective and chosen diffusion.\nCentral to our approach are new explicit Stein factor bounds on the solutions\nof Poisson equations. We complement these results with improved optimization\nguarantees for targets other than the standard Gibbs measure.\n\n1\n\nIntroduction\n\nConsider the unconstrained and possibly non-convex optimization problem\n\nminimize\n\nx2Rd\n\nf (x).\n\nRecent studies have shown that the Langevin algorithm \u2013 in which an appropriately scaled isotropic\nGaussian vector is added to a gradient descent update \u2013 globally optimizes f whenever the objective is\ndissipative (hrf (x), xi  \u21b5kxk2\n2   for \u21b5> 0) with a Lipschitz gradient [14, 25, 29]. Remarkably,\nthese globally optimized objectives need not be convex and can even be multimodal. The intuition\nbehind the success of the Langevin algorithm is that the stochastic optimization method approximately\ntracks the continuous-time Langevin diffusion which admits the Gibbs measure \u2013 a distribution de\ufb01ned\nby p(x) / exp(f (x)) \u2013 as its invariant distribution. Here, > 0 is an inverse temperature\nparameter, and when  is large, the Gibbs measure concentrates around its modes. As a result, for\nlarge values of , a rapidly mixing Langevin algorithm will be close to a global minimum of f. In\nthis case, rapid mixing is ensured by the Lipschitz gradient and dissipativity. Due to its simplicity,\nef\ufb01ciency, and well-understood theoretical properties, the Langevin algorithm and its derivatives have\nfound numerous applications in machine learning [see, e.g., 28, 6].\nIn this paper, we prove an analogous global optimization property for the Euler discretization of\nany smooth and dissipative diffusion and show that different diffusions are suitable for solving\ndifferent classes of convex and non-convex problems. Our non-asymptotic analysis, based on a\nmultidimensional version of Stein\u2019s method, establishes explicit bounds on both integration and\noptimization error. Our contributions can be summarized as follows:\n\n\u2022 For any function f, we provide explicit O 1\n\n\u270f2 bounds on the numerical integration error\n\nof discretized dissipative diffusions. Our bounds depend only on simple properties of the\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fdiffusion\u2019s coef\ufb01cients and Stein factors, i.e., bounds on the derivatives of the associated\nPoisson equation solution.\n\u2022 For pseudo-Lipschitz f, we derive explicit \ufb01rst through fourth-order Stein factor bounds\nfor every fast-coupling diffusion with smooth coef\ufb01cients. Since our bounds depend on\nWasserstein coupling rates, we provide user-friendly, broadly applicable tools for computing\nthese rates. The resulting computable integration error bounds recover the known Markov\nchain Monte Carlo convergence rates of the Langevin algorithm in both convex and non-convex\nsettings but apply more broadly.\n\u2022 We introduce new explicit bounds on the expected suboptimality of sampling from a diffusion.\nTogether with our integration error bounds, these yield computable and convergent bounds\non global optimization error. We demonstrate that improved optimization guarantees can be\nobtained by targeting distributions other than the standard Gibbs measure.\n\u2022 We show that different diffusions are appropriate for different objectives f and detail concrete\nexamples of global non-convex optimization enabled by our framework but not covered by the\nexisting Langevin theory. For example, while the Langevin diffusion is particularly appropriate\nfor dissipative and hence quadratic growth f [25, 29], we show alternative diffusions are\nappropriate for \u201cheavy-tailed\u201d f with subquadratic or sublinear growth.\n\nWe emphasize that, while past work has assumed the existence of \ufb01nite Stein factors [4, 29], focused\non deriving convergence rates with inexplicit constants [23, 26, 29], or concentrated singularly on\nthe Langevin diffusion [6, 9, 29, 25], the goals of this work are to provide the reader with tools to\n(a) check the appropriateness of a given diffusion for optimizing a given objective and (b) compute\nexplicit optimization and integration error bounds based on easily accessed properties of the objective\nand chosen diffusion. The rest of the paper is organized as follows. Section 1.1 surveys related\nwork. Section 2 provides an introduction to diffusions and their use in optimization and reviews our\nnotation. Section 3 provides explicit bounds on integration error in terms of Stein factors and on\nStein factors in terms of simple properties of f and the diffusion. In Section 4, we provide explicit\nbounds on optimization error by targeting Gibbs and non-Gibbs invariant measures and discuss how\nto obtain better optimization error using non-Gibbs invariant measures. We give concrete examples\nof applying these tools to non-convex optimization problems in Section 5 and conclude in Section 6.\n\n1.1 Related work\n\nThe Euler discretization of the Langevin diffusion is commonly termed the Langevin algorithm\nand has been studied extensively in the context of sampling from a log concave distribution. Non-\nasymptotic integration error bounds for the Langevin algorithm are studied in [8, 7, 9, 10]. A\nrepresentative bound follows from combining the ergodicity of the diffusion with a discretization\nerror analysis and yields \u270f error in O( 1\n\u270f ))) steps for the strongly log concave case and\nO( 1\nOur work is motivated by a line of research that uses the Langevin algorithm to globally optimize\nnon-convex functions. Gelfand and Mitter [14] established the global convergence of an appropriate\nvariant of the algorithm, and Raginsky et al. [25] subsequently used optimal transport theory to prove\noptimization and integration error bounds. For example, [25] provides an integration error bound\n\n\u270f ))) steps for the general log concave case [7, 9].\n\n\u270f2 poly(log( 1\n\n\u270f4 poly(log( 1\n\n\u270f4 poly(log( 1\n\nof \u270f after O 1\n\n\u21e4 steps under the quadratic-growth assumptions of dissipativity and\n\na Lipschitz gradient; the estimate involves the inverse spectral gap parameter 1\n, a quantity that\n\u21e4\nis often unknown and sometimes exponential in both inverse temperature and dimension. Gao et al.\n[13] obtained similar guarantees for stochastic Hamiltonian Monte Carlo algorithms for empirical\nand population risk minimization under a dissipativity assumption with rate estimates. In this work,\nwe accommodate \u201cheavy-tailed\u201d objectives that grow subquadratically and trade the often unknown\nand hence inexplicit spectral gap parameter of [25] for the more user-friendly distant dissipativity\ncondition (Prop. 3.4) which provides a straightforward and explicit certi\ufb01cation of fast coupling\nand hence the fast mixing of a diffusion. For distantly dissipative diffusions, the size of our error\nbounds is driven primarily by a computable distance parameter; in the Langevin setting, an analogous\nquantity is studied in place of the spectral gap in the contemporaneous work of [5].\nCheng et al. [5] provide integration error bounds for sampling with the overdamped Langevin\nalgorithm under a distant strong convexity assumption (a special case of distant dissipativity). The\n\u270f )) steps. We consider general\nauthors build on the results of [9, 11] and establish \u270f error in O( 1\n\n\u270f )) 1\n\n\u270f2 log( 1\n\n2\n\n\f\u270f2 ) steps under mild\n\ndistantly dissipative diffusions and establish an integration error of \u270f in O( 1\nassumptions on the objective function f and smoothness of the diffusion.\nVollmer et al. [26] used the solution of the Poisson equation in their analysis of stochastic Langevin\ngradient descent, invoking the bounds of Pardoux and Veretennikov [24, Thms. 1 and 2] to obtain\nStein factors. However, Thms. 1 and 2 of [24] yield only inexplicit constants and require bounded\ndiffusion coef\ufb01cients, a strong assumption violated by the examples treated in Section 5. Chen\net al. [4] considered a broader range of diffusions but assumed, without veri\ufb01cation, that Stein\nfactor and Markov chain moment were universally bounded by constants independent of all problem\nparameters. One of our principal contributions is a careful enumeration of the dependencies of\nthese Stein factors and Markov chain moments on the objective f and the candidate diffusion. Our\nconvergence analysis builds on the arguments of [23, 15], and our Stein factor bounds rely on distant\nand uniform dissipativity conditions for L1-Wasserstein rate decay [11, 15] and the smoothing effect\nof the Markov semigroup [3, 15]. Our Stein factor results signi\ufb01cantly generalize the existing bounds\nof [15] by accommodating pseudo-Lipschitz objectives f and quadratic growth in the covariance\ncoef\ufb01cient and deriving the \ufb01rst four Stein factors explicitly.\n\n2 Optimization with Discretized Diffusions: Preliminaries\nConsider a target objective function f : Rd ! R. Our goal is to carry out unconstrained minimization\nof f with the aid of a candidate diffusion de\ufb01ned by the stochastic differential equation (SDE)\n(2.1)\nHere, (Bt)t0 is an l-dimensional Wiener process, and b : Rd ! Rd and  : Rd ! Rd\u21e5l represent\nthe drift and the diffusion coef\ufb01cients, respectively. The diffusion Zz\nt starts at a point z 2 Rd and,\nunder the conditions of Section 3, admits a limiting invariant distribution P with (Lebesgue) density\np. To encourage sampling near the minima of f, we would like to choose p so that the maximizers of\np correspond to minimizers of f. Fortunately, under mild conditions, one can construct a diffusion\nwith target invariant distribution P (see, e.g., [20, 15, Thm. 2]), by selecting the drift coef\ufb01cient\n\nt )dBt with Zz\n\nt )dt + (Zz\n\nt = b(Zz\n\n0 = z.\n\ndZz\n\n@mjk(x)\n\n@xk\n\nobtain\n\np(x) / exp(f (x))\n\nb(x) = 1\n\n2p(x)hr, p(x)(a(x) + c(x))i,\n\n(2.2)\nwhere a(x) , (x)(x)> is the covariance coef\ufb01cient, c(x) = c(x)> 2 Rd\u21e5d is the skew-\nsymmetric stream coef\ufb01cient, and hr, m(x)i =Pj ejPk\ndenotes the divergence operator\nwith {ej}j as the standard basis of Rd. As an illustration, consider the (overdamped) Langevin\ndiffusion for the Gibbs measure with inverse temperature > 0 and density\n(2.3)\n\n2p (x)hr, p(x)(a(x) + c(x))ij = 1\n\nassociated with our objective f. Inserting (x) =p2/ I and c(x) = 0 into the formula (2.2) we\n=  @j f (x)\nbj(x) = 1\n,\nwhich reduces to b = rf. We emphasize that the choice of the Gibbs measure is arbitrary, and we\nwill consider other measures that yield superior guarantees for certain minimization problems.\nIn practice, the diffusion (2.1) cannot be simulated in continuous time and is instead approximated by\na discrete-time numerical integrator. We will show that a particular discretization, the Euler method,\ncan be used as a global optimization algorithm for various families of convex and non-convex f.\nThe Euler method is the most commonly used discretization technique due to its explicit form and\nsimplicity; however, our analysis can be generalized to other numerical integrators as well. For\nm = 0, 1, ..., the Euler discretization of the SDE (2.1) corresponds to the Markov chain updates\n\np (x)Pk\n\n@p (x)Ijk\n\n= 1\n\np (x)\n\n@p (x)\n\n@xj\n\n@xk\n\n@xj\n\nXm+1 = Xm + \u2318b (Xm) + p\u2318 (Xm)Wm,\n\nwhere \u2318 is the step size, and Wm \u21e0 Nd(0, I) is an isotropic Gaussian vector that is independent\nfrom Xm. This update rule de\ufb01nes a Markov chain which typically has an invariant measure that\nis different from the invariant measure of the continuous time diffusion. However, when the step\nsize \u2318 is suf\ufb01ciently small, the difference between two invariant measures becomes small and can\nbe quantitatively characterized [see, e.g., 22]. Our optimization algorithm is simply to evaluate the\nfunction f at each Markov chain iterate Xm and report the point with the smallest function value.\n\n3\n\n\fDenoting by p(f ) the expectation of f under the density p \u2013 i.e., p(f ) =EZ\u21e0p[f (Z)] \u2013 we decompose\nthe optimization error after M steps of our Markov chain into two components,\n\nmin\n\nm=1,..,M\n\nMPM\nE[f (Xm)]  minx f (x) \uf8ff 1\n|\n\nm=1 E[f (Xm)  p(f )]\n}\n\nintegration error\n\n{z\n\n|\n\nand bound each term on the right-hand side separately. The integration error\u2014which captures both\nthe short-term non-stationarity of the chain and the long-term bias due to discretization\u2014is the subject\nof Section 3; we develop explicit bounds using techniques that build upon [23, 15]. The expected\nsuboptimality quanti\ufb01es how well exact samples from p minimize f on average. In Section 4, we\nextend the Gibbs measure Langevin diffusion bound of Raginsky et al. [25] to more general invariant\nmeasures and associated diffusions and demonstrate the bene\ufb01ts of targeting non-Gibbs measures.\n\n+ p(f )  minx f (x)\n\nexpected suboptimality\n\n,\n\n(2.4)\n\n{z\n\n}\n\n|g(x)  g(y)|\uf8ff \u02dc\u00b51,n(g)(1 + kxkn\n\nNotation. We say a function g is pseudo-Lipschitz continuous of order n if it satis\ufb01es\nfor all x, y 2 Rd,\n\n(2.5)\nwhere k\u00b7k 2 denotes the Euclidean norm, and \u02dc\u00b51,n(g) is the smallest constant satisfying (2.5). This\nassumption, which relaxes the more stringent Lipschitz assumption, allows g to exhibit polynomial\ngrowth of order n. For example, g(x) = x2 is not Lipschitz but satis\ufb01es (2.5) with \u02dc\u00b51,1(g) \uf8ff 1. In\nall of our examples of interest, n \uf8ff 1. For operator and Frobenius norms k\u00b7k op and k\u00b7k F, we use\n\n2 )kx  yk2,\n\n2 + kykn\n\n1(g) = supx,y2Rd,x6=y kg(x)g(y)kF\nand \u00b5i(g) = supx,y2Rd,x6=y kri1g(x)ri1g(y)kop\n\n\u00b50(g) = supx2Rd kg(x)kop,\nkxyk2\n\nkxyk2\n\n,\n\nfor the i-th order Lipschitz coef\ufb01cients of a suf\ufb01ciently differentiable function g. We denote the\ndegree n polynomial coef\ufb01cient of the i-th derivative of g by \u02dc\u21e1i,n(g) , supx2Rd krig(x)kop\n1+kxkn\n3 Explicit Bounds on Integration Error\n\n.\n\n2\n\nWe develop our explicit bounds on integration error in three steps. In Theorem 3.1, we bound integra-\ntion error in terms of the polynomial growth and dissipativity of diffusion coef\ufb01cients (Conditions 1\nand 2) and Stein factors bounds on the derivatives of solutions to the diffusion\u2019s Poisson equation\n(Condition 3). Condition 3 is a common assumption in the literature but is typically not veri\ufb01ed. To\naddress this shortcoming, Theorem 3.2 shows that any smooth, fast-coupling diffusion admits \ufb01nite\nStein factors expressed in terms of diffusion coupling rates (Condition 4). Finally, in Section 3.1, we\nprovide user-friendly tools for explicitly bounding those diffusion coupling rates. We begin with our\nconditions.\nCondition 1 (Polynomial growth of coef\ufb01cients). For some r 2{ 1, 2} and 8x 2 Rd, the drift and\nthe diffusion coef\ufb01cients of the diffusion (2.1) satisfy the growth condition\nkb(x)k2 \uf8ff b\nThe existence and uniqueness of the solution to the diffusion SDE (2.1) is guaranteed under Con-\ndition 1 [19, Thm 3.5]. The cases r = 1 and r = 2 correspond to linear and quadratic growth of\nk>(x)kop, and we will explore examples of both r settings in Section 5. As we will see in each\nresult to follow, the quadratic growth case is far more delicate.\nCondition 2 (Dissipativity). For \u21b5,  > 0, the diffusion (2.1) satis\ufb01es the dissipativity condition\n\n4 (1 + kxk2), and k>(x)kop \uf8ff a\n\n4 (1 + kxk2), k(x)kF \uf8ff \n\n4 (1 + kxkr\n2).\n\nAkxk2\n\n2 \uf8ff \u21b5kxk2\n\n2h(x)(x)>,r2g(x)i.\n\n2 +  for Ag(x) , hb(x),rg(x)i + 1\n\n(3.1)\nF.\nA is the generator of the diffusion with coef\ufb01cients b and , and Akxk2\n2 = 2hb(x), xi + k(x)k2\nDissipativity is a standard assumption that ensures that the diffusion does not diverge but rather travels\ninward when far from the origin [22]. Notably, a linear growth bound on k(x)kF and a quadratic\ngrowth bound on k>(x)kop follow directly from the linear growth of kb(x)k and Condition 2.\nHowever, in many examples, tighter growth constants can be obtained by inspection.\nOur \ufb01nal condition concerns the solution of the Poisson equation (also known as the Stein equation\nin the Stein\u2019s method literature) associated with our candidate diffusion.\n\n4\n\n\fCondition 3 (Finite Stein factors). The function uf solves the Poisson equation with generator (3.1)\n(3.2)\nis pseudo-Lipschitz of order n with constant \u21e31, and has i-th order derivative with degree-n polynomial\ngrowth for i = 2, 3, 4, i.e.,\n\nf  p(f ) = Auf ,\n\nkriuf (x)kop \uf8ff \u21e3i(1 + kxkn\n\n2 ) for i 2{ 2, 3, 4} and all x 2 Rd.\n\nIn other words, \u02dc\u00b51,n(uf ) = \u21e31, and \u02dc\u21e1i,n(uf ) = \u21e3i for i = 2, 3, 4 with maxi \u21e3i < 1.\nThe coef\ufb01cients \u21e3i govern the regularity of the Poisson equation solution uf and are termed Stein\nfactors in the Stein\u2019s method literature. Although variants of Condition 3 have been assumed in\nprevious work [4, 26], we emphasize that this assumption is not easily veri\ufb01ed, and frequently only\nempirical evidence is provided as justi\ufb01cation for the assumption [4]. We will ultimately derive\nexplicit expressions for the Stein factors \u21e3i for a wide variety of diffusions and functions f, but \ufb01rst\nwe will use the Stein factors to bound the integration error of our discretized diffusion.\nTheorem 3.1 (Integration error of discretized diffusions). Let Conditions 1 to 3 hold for some r 2\n{1, 2}. For any even integer1 ne  n + 4 and a step size satisfying \u2318< 1^\n2(ne1)!!(1+b/2+/2)ne ,\n\n\u21b5\n\n1\nM\nwhere\n\nMPm=1\n\n\n\nE[f (Xm)]  p(f ) \uf8ff\u21e3c1\n16h2\u21e322\n48h\u21e333\n\nb + \u21e344\n\nc2 = 1\n\nc1 = 6\u21e31,\n\nc3 = 1\n\n\uf8ffr(n) = 2 + 2\n\n\u21b5 + na\n\n4\u21b5 + \u02dc\u21b5r\n\nb + \u21e33b2\n\nb(1 + 3n1) + 4\u21e34\n\n\u21b5\u21e3 na+6r\n2r \u02dc\u21b5r \u2318n\n\n1\n\n\u2318M + c2\u2318 + c3\u23181+|1^n/2|\u2318\uf8ffr(ne) + E[kX0kne\n2 ],\n\ni,\n\n + \u21e34(1 + 3n1)4\n1.5n\nn4 (4\n, with \u02dc\u21b51 = \u21b5, \u02dc\u21b52 = [\u21b5  nea/4]+.\n\n)i,\n\nb + n!!n\n\nb + n2\n\n)(n\n\ne4\n\nc3\u23181+|1^n/2| can be combined with the dominant term c2\u2318 yielding (c2 + c3)\u2318 as \u2318< 1. We observe\n\nThis integration error bound, proved in Appendix A, is O 1\n\u2318M + \u2318 since the higher order term\nthat one needs O\u270f2 steps to reach a tolerance of \u270f. Theorem 3.1 seemingly makes no assumptions\n\non the objective function f, but in fact the dependence on f is present in the growth parameters, the\nStein factors, and the polynomial degree of the Poisson equation solution. For example, we will show\nin Theorem 3.2 that the polynomial degree is upper bounded by that of the objective function f. To\ncharacterize the function classes covered by Theorem 3.1, we next turn to dissecting the Stein factors.\nWhile verifying Conditions 1 and 2 for a given diffusion is often straightforward, it is not immediately\nclear how one might verify Condition 3. As our second principal contribution, we derive explicit\nvalues for the Stein factors \u21e3i for any smooth and dissipative diffusion exhibiting fast L1-Wasserstein\ndecay:\nCondition 4 (Wasserstein rate). The diffusion Zx\n\ninfcouplings (Zx\n\nt ,Zy\n\nt ) E[kZx\n\nt  Zy\nt kp\n\n2]1/p \uf8ff %p(t)kx  yk2\n\nwhere in\ufb01mum is taken over all couplings between Zx\n\nt has Lp-Wasserstein rate %p : R0 ! R if\nfor all x, y 2 Rd and t  0,\nt . We further de\ufb01ne the relative rates\n\nt and Zy\n\n\u02dc%1(t) = log(%2(t)/%1(t)) and \u02dc%2(t) = log(%1(t)/[%1(0)%2(t)])/ log(%1(t)/%1(0)).\n\n0 %1(t)!r(t + i  2)dt\n\nTheorem 3.2 (Finite Stein factors from Wasserstein decay). Assume that Conditions 1, 2 and 4\nhold and that f is pseudo-Lipschitz continuous of order n with, for i = 2, 3, 4, at most degree-n\npolynomial growth of its i-th order derivatives. Then, Condition 3 is satis\ufb01ed with Stein factors\n\n\u21e3i = \u2327i + \u21e0iR 1\n!r(t) = 1 + 4%1(t)11/r%1(0)1/21 + 2\nwith \u02dc\u21b51 = \u21b5, \u02dc\u21b52 = inf t0[\u21b5  na(1 _ \u02dc%2(t))]+, and\n\u23271 = 0\n\u21e01 = \u02dc\u00b51,n(f ) & \u21e0i = \u02dc\u00b51,n(f )\u02dc\u232b1:i(b)\u02dc\u232b1:i()\u02dc\u232b0:i2(1)%1(0)!r(1)\uf8ffr(6n)i1\nwhere \uf8ffr(n) is as in Theorem 3.1, \u02dc\u21e1a:b,n(f ) = maxi=a,..,b \u02dc\u21e1i,n(f ), and \u02dc\u232ba:b(g) is a constant, given\nexplicitly in the proof, depending only on the order a through b derivatives of g.\n\nr {[1 _ \u02dc%r(t)]2an + 3r}n,\n\n& \u2327i = \u02dc\u00b51,n(f )\u02dc\u21e12:i,n(f )\u02dc\u232b1:i(b)\u02dc\u232b1:i()\uf8ffr(6n)\n\nfor i = 2, 3, 4,\nfor i = 2, 3, 4,\n\nfor i = 1, 2, 3, 4, where\n\n\u02dc\u21b5n\n\n1In a typical example where f is bounded by a quadratic polynomial, we have n = 1 and ne = 6. We also\n\nremind the reader that the double factorial (ne  1)!! = 1 \u00b7 3 \u00b7 5\u00b7\u00b7\u00b7 (ne1) is of order pne!.\n\n5\n\n\fThe proof of Theorem 3.2 is given in Section B and relies on the explicit transition semigroup\nderivative bounds of [12]. We emphasize that, to provide \ufb01nite Stein factors, Theorem 3.2 only\nrequires L1-Wasserstein decay and allows the L2-Wasserstein rate to grow. An integrable Wasserstein\nrate is an indication that a diffusion mixes quickly to its stationary distribution. Hence, Theorem 3.2\nsuggests that, for a given f, one should select a diffusion that mixes quickly to a stationary measure\nthat, like the Gibbs measure (2.3), has modes at the minimizers of f. We explore user-friendly\nconditions implying fast Wasserstein decay in Section 3.1 and detailed examples deploying these\ntools in Section 5. Crucially for the \u201cheavy-tailed\u201d examples given in Section 5, Theorem 3.2 allows\nfor an unbounded diffusion coef\ufb01cient , unlike the classic results of [24].\n\n3.1 Suf\ufb01cient conditions for Wasserstein decay\n\nA simple condition that leads to exponential L1 and L2-Wasserstein decay is uniform dissipativity\n(3.3). The next result from [27] (see also [2, Sec. 1], [15, Thm. 10]) makes the relationship precise.\nProposition 3.3 (Wasserstein decay from uniform dissipativity [27, Thm. 2.5]). A diffusion with\ndrift and diffusion coef\ufb01cients b and  has Wasserstein rate %p(t) = ekt/2 if, for all x, y 2 Rd,\n\nop \uf8ff kkx  yk2\n2.\n\nF + (p  2)k(x)  (y)k2\n\n2hb(x)  b(y), x  yi + k(x)  (y)k2\nIn the Gibbs measure Langevin case, where b = rf and  \u2318p2/I, uniform dissipativity is\n\nequivalent to the strong convexity of f. As we will see in Section 5, the extra degree of freedom in\nthe diffusion coef\ufb01cient  will allow us to treat non-convex and non-strongly convex functions f.\nA more general condition leading to exponential L1-Wasserstein decay is the distant dissipativity\ncondition (3.4). The following result of [15] builds upon the pioneering analyses of Eberle [11, Cor.\n2] and Wang [27, Thm. 2.6] to provide explicit Wasserstein decay.\nProposition 3.4 (Wasserstein decay from distant dissipativity [15, Cor. 4.2]). A diffusion with drift\nand diffusion coef\ufb01cients b and  satisfying \u02dc(x) , ((x)(x)>  s2I)1/2 and\n\n(3.3)\n\nhb(x)b(y),xyi\n\ns2kxyk2\n\n2/2 + k\u02dc(x)\u02dc(y)k2\ns2kxyk2\n\n2  k(\u02dc(x)\u02dc(y))>(xy)k2\n\ns2kxyk4\n\nF\n\n2\n\n2\n\n\uf8ff\u21e2K if kx  yk2 > R\nL if kx  yk2 \uf8ff R\n\n(3.4)\n\nfor R, L  0, K > 0, and s 2 (0, 1/\u00b50(1)) has Wasserstein rate %1(t) = 2eLR2/8ekt/2 for\n\ns2k1 \uf8ff( e1\n\n2 R2 + ep8K1 R + 4K1\n8p2\u21e1R1L1/2(L1 + K1) exp( LR2\n\n8 ) + 32R2K2\n\nif LR2 \uf8ff 8\nif LR2 > 8.\n\nConveniently, both uniform and distant dissipativity imply our dissipativity condition, Condition 2.\nThe Prop. 3.4 rates feature the distance-dependent parameter eLR2/8. In the pre-conditioned Langevin\nGibbs setting (b =  1\n2 arf and  constant) when f is the negative log likelihood of a multimodal\nGaussian mixture, R in (3.4) represents the maximum distance between modes [15]. When R is\nrelatively small, the convergence of the diffusion towards its stationary distribution is rapid, and the\nnon-uniformity parameter is small; when R is relatively large, the parameter grows exponentially in\nR2, as would be expected due to infrequent diffusion transitions between modes.\nOur next result, proved in Appendix D, provides a user-friendly set of suf\ufb01cient conditions for\nverifying distant dissipativity and hence exponential Wasserstein decay in practice.\nProposition 3.5 (User-friendly Wasserstein decay). Fix any diffusion and skew-symmetric stream\ncoef\ufb01cients  and c satisfying L\u21e4 , F1(\u02dc)2 + supx max(rhr, m(x)i) < 1 for m(x) ,\n(x)(x)> + c(x), \u02dc(x) , ((x)(x)>  s2\n\n0I)1/2, and s0 2 (0, 1/\u00b50(1)). If\n\uf8ff\u21e2Km if kx  yk2 > Rm\nLm if kx  yk2 \uf8ff Rm,\nholds for Rm, Lm  0, Km > 0, then, for any inverse temperature > L \u21e4/Km, the diffusion with\ndrift and diffusion coef\ufb01cients b =  1\n2hr, mi and  = 1p  has stationary density\np(x) / ef (x) and satis\ufb01es (3.4) with s = s0p , K = KmL\u21e4\n\nhm(x)rf (x)  m(y)rf (y), x  yi\n\n2 mrf + 1\n\n, L = Lm+L\u21e4\n\n, and R = Rm.\n\nkx  yk2\n\n(3.5)\n\ns2\n0\n\ns2\n0\n\n2\n\n6\n\n\f4 Explicit Bounds on Optimization Error\n\nTo convert our integration error bounds into bounds on optimization error, we now turn our attention\nto bounding the expected suboptimality term of (2.4). To characterize the expected suboptimality\nof sampling from a measure with modes matching the minima of f, we generalize a result due\nto Raginsky et al. [25]. The original result [25, Prop. 3.4] was designed to analyze the Gibbs\nmeasure (2.3) and demanded that log p be smooth, in the sense that \u00b52(log p) < 1. Our next\nproposition, proved in Appendix C, is designed for more general measures p and importantly relaxes\nthe smoothness requirements on log p.\nProposition 4.1 (Expected suboptimality: Sampling yields near-optima). Suppose p is the stationary\ndensity of an (\u21b5, )-dissipative diffusion (Condition 2) with global maximizer x\u21e4. If p takes the\ngeneralized Gibbs form p,\u2713 (x) / exp((f (x)  f (x\u21e4))\u2713) for > 0 and rf (x\u21e4) = 0, we have\n(4.1)\n\n\u2713 log( 2\nMore generally, if log p(x\u21e4)  log p(x) \uf8ff Ckx  x\u21e4k2\u2713\nthen\n\np,\u2713 (f )  f (x\u21e4) \uf8ff \u2713q d\n\n2 ( 1\n\n2\u21b5\n\n)).\n\nd ) + log( e\u00b52(f )\n2 for some C > 0 and \u2713 2 (0, 1] and all x,\n\np(log p) + log p(x\u21e4) \uf8ff d\n\n2\u2713 log( 2C\n\nd ) + d\n\n2 log( e\n\n\u21b5 ).\n\n(4.2)\n\nWhen \u2713 = 1, p,\u2713 is the Gibbs measure, and the bound (4.1) exactly recovers [25, Prop. 3.4]. The\ngeneralized Gibbs measures with \u2713< 1 allow for improved dependence on the inverse temperature\nwhen   d/(2\u2713). Note however that, for \u2713< 1, the distributions p,\u2713 also require knowledge of the\noptimal value f (x\u21e4). In certain practical settings, such as neural network optimization, it is common\nto have f (x\u21e4) = 0. When f (x\u21e4) is unknown, a similar analysis can be carried out by replacing f (x\u21e4)\nwith an estimate, and the bound (4.1) still holds up to a controllable error factor.\nBy combining Prop. 4.1 with Theorem 3.1, we obtain a complete bound controlling the global\noptimization error of the best Markov chain iterate.\nCorollary 4.2 (Optimization error of discretized diffusions). Instantiate the assumptions and notation\nof Theorem 3.1 and Prop. 4.1. If the diffusion has the generalized Gibbs stationary density p,\u2713 (x) /\nexp((f (x)  f (x\u21e4))\u2713), then\n\nmin\n\nm=1,..,M\n\n\u2318M + (c2+c3)\u2318\u2318\uf8ffr(ne) + E[kX0kne\nE[f (Xm)]  f (x\u21e4) \uf8ff\u21e3c1\n2 ]\n+ \u2713q d\n\nd ) + log( e\u00b52(f )\n\n\u2713 log( 2\n\n2 ( 1\n\n)).\n\n2\u21b5\n\n1\n\nFinally, we demonstrate that, for quadratic functions, the generalized Gibbs expected suboptimality\nbound (4.1) can be further re\ufb01ned to remove the log(/d)1/\u2713 dependence.\nProposition 4.3 (Expected suboptimality: Quadratic f). Let f (x) = hx  b, A(x  b)i for a positive\nsemide\ufb01nite A 2 Rd\u21e5d and b 2 Rd. Then for p,\u2713 (x) / exp((f (x)  f (x\u21e4))\u2713) with \u2713> 0, and\nfor each positive integer k, we have\n\n(4.3)\n\n(4.4)\n\np,1/k(f )  f (x\u21e4) \uf8ff  k(1+ d\n\n2 )1\n\n\nk.\n\nThe bound (4.4) applies to any f with level set (i.e., {x : f (x) = \u21e2}) volume proportional to \u21e2d1.\n5 Applications to Non-convex Optimization\n\nWe next provide detailed examples of verifying that a given diffusion is appropriate for optimizing a\ngiven objective, using either uniform dissipativity (Prop. 3.3) or our user-friendly distant dissipativity\nconditions (Prop. 3.5). When the Gibbs measure Langevin diffusion is used, our results yield global\n\noptimization when f is strongly convex (condition (3.3) with b = rf and  \u2318p2/I) or has\nstrongly convex tails (condition (3.5) with m \u2318 I). To highlight the value of non-constant diffusion\ncoef\ufb01cients, we will focus on \u201cheavy-tailed\u201d examples that are not covered by the Langevin theory.\n\n7\n\n\f150\n\n100\n\n50\n\n0\n\n\u221250\n\n\u2212100\n\n\u2212150\n\n150\n\n100\n\n50\n\nGradient\u00a0Descent\u00a0(first\u00a07000\u00a0iters)\nGradient\u00a0Descent\u00a0(next\u00a03000\u00a0iters)\nLangevin\u00a0Algorithm\u00a0(300\u00a0iters)\nDesigned\u00a0Diffusion\u00a0(15\u00a0iters)\n\n75\n\nl\n\ne\nu\na\nV\nn\no\n\n \n\n50\n\nGradient\u00a0Descent\u00a0(first\u00a07000\u00a0iters)\nGradient\u00a0Descent\u00a0(next\u00a03000\u00a0iters)\nLangevin\u00a0Algorithm\u00a0(300\u00a0iters)\nDesigned\u00a0Diffusion\u00a0(15\u00a0iters)\n\n\u2212100\n\n0\n\n100\n\ni\nt\nc\nn\nu\nF\n\n25\n\n0\n\n0\n\nMethod\n\nDesigned Diffusion\nLangevin Algorithm\nGradient Descent\n\n25\n\n50\n\nIterations\n\n75\n\n100\n\n0\n\nFigure 1: The left plot shows the landscape of the non-convex, sublinear growth function f (x) =\n2). The middle and right plots compare the optimization error of gradient descent, the\nc log(1 + 1\nLangevin algorithm, and the discretized diffusion designed in Section 5.1.\n\n2kxk2\n\n\u221250\n\n\u2212100\n\n\u2212150\n\n\u2212100\n\n0\n\n100\n\n5.1 A simple example with sublinear growth\n\n2kxk2\n\n2 a(x)rf (x) + 1\n\n2 and  = d/. In fact, this\n\ndiffusion satis\ufb01es uniform dissipativity,\n\n2 and consider f (x) = c log(1 + 1\n\n2hr, a(x)i and (x) = 1p (x) for (x) ,q1 + 1\n\nWe begin with a pedagogical example of selecting an appropriate diffusion and verifying our global\noptimization conditions. Fix c > d+3\n2), a simple non-convex\nobjective which exhibits sublinear growth in kxk2 and hence does not satisfy dissipativity (Con-\ndition 2) when paired with the Gibbs measure Langevin diffusion (b = rf,  = p2/I). To\ntarget the Gibbs measure (2.3) with inverse temperature   1, we choose the diffusion with coef-\n\ufb01cients b(x) =  1\n2kxk2\n2I\nand a(x) = (x)(x)>. This choice satis\ufb01es Condition 1 with b = O(1),  = O1/2, and\na = O1 with respect to  and Condition 2 with \u21b5 = c  d+3\n\u21e3q1 + 1\n2\u23182\n\u02dc%2(t) = 0. Hence, the i-th Stein factor in Theorem 3.2 satis\ufb01es \u21e3i = O(i1)/2. This implies that\nthe coef\ufb01cients ci in Corollary 4.2 scale with O\u21e3 1\n\u2318 and the \ufb01nal optimization\nerror bound (4.3) can be made of order \u270f by choosing the inverse temperature  = O\u270f1, the step\nsize \u2318 = O\u270f1.5, and the number of iterations M = O\u270f2.5.\n\n2hb(x)  b(y), x  yi + k(x)  (y)k2\nF,\n= (c  1\n\nFigure 1 illustrates the value of this designed diffusion over gradient descent and the standard\nLangevin algorithm. Here, d = 2, c = 10, the inverse temperature  = 1, the step size \u2318 = 0.1,\nand each algorithm is run from the initial point (90, 110). We observe that the Langevin algorithm\ndiverges, and gradient descent requires thousands of iterations to converge while the designed\ndiffusion converges to the region of interest after 15 iterations.\n\nyielding L1 and L2-Wasserstein rates %1(t) = %2(t) = et\u21b5/2 by Prop. 3.3 and the relative rate\n\n2 q1 + 1\n\nM\u2318 +\u2303 3\n\ni=1\u2318ii/2 + 1\n\n2kyk2\n\n\uf8ff \u21b5kx  yk2\n2,\n\n )kx  yk2\n\n2 + d\n\n2kxk2\n\n5.2 Non-convex learning with linear growth\n\nNext consider the canonical learning problem of regularized loss minimization with\n\nf (x) = L(x) + R(x)\n\nLPL\nfor L(x) , 1\nl=1 l(hx, vli), l a datapoint-speci\ufb01c loss function, vl 2 Rd the l-th data-\n2) a regularizer with concave \u21e2 satisfying 3\u21e20(z) \npoint covariate vector, and R(x) = \u21e2( 1\npmax(0,\u21e20(0)z\u21e2000(z)) and 4g01(z)2\nz \uf8ff 1 for gs(z) , \u21e20(0)\n2 z)  s, some 1, 2, 3 > 0,\nand all z, s 2 R. Our aim is to select diffusion and stream coef\ufb01cients that satisfy the Wasserstein\ndecay preconditions of Prop. 3.5. To achieve this, we set c \u2318 0 and choose  with \u00b50(1) < 1 so\n\n2kxk2\n2 \uf8ff g1(z)\n\n\u21e20( 1\n\n8\n\n\fthat the regularization component of the drift is one-sided Lipschitz, i.e.,\n\n2\n\nfor some Ka > 0.\n\n2 and Lm, Rm suf\ufb01ciently large.\n\nha(x)rR(x)  a(y)rR(y), x  yi \uf8ff Kakx  yk2\n\n(5.1)\nWe then show that L\u21e4 from Prop. 3.5 is bounded and that, for suitable loss choices, a(x)rL(x) is\nbounded and Lipschitz so that (3.5) holds with Km = Ka\nr2 pgs(r2) for all s 2 [0, 1].\nFix any x, let r = kxk2, and de\ufb01ne \u02dc(s)(x) = p1  s(I  xx>\nWe choose  = \u02dc(0) so that a(x)rR(x) = \u21e20(0)x and (5.1) holds with Ka = \u21e20(0). Our constraints\non \u21e2 ensure that a(x) = I + xx>\nr2 g1(r2) is positive de\ufb01nite, that \u00b50(1) \uf8ff 1, and that  and a have\nat most linear and quadratic growth respectively, in satisfaction of Condition 1. Moreover,\nrhr, a(x)i = I( (d1)g1(r2)\nmax(rhr, a(x)i) = max( (d1)g1(r2)\nso that max(rhr, a(x)i) \uf8ff max((d 1)1 +p12, dp12 + 23). For any s0 2 (0, 1), we have\n\nr2 ((d  1)(g01(r2)  g1(r2)\n\n+ 2g01(r2), (d1)g1(r2)\n\n+ 2dg01(r2) + 4r2g001 (r2)),\n\n+ 2g01(r2)) + 2 xx>\n\n) + 2r2g001 (r2)), and\n\nr2 ) + xx>\n\nr2\n\nr2\n\nr2\n\nr2\n\npgs0 (r2)p1s0\n\n(r2)\n\nrg0s0\npgs0 (r2)\n\nr\n\n+ 2 xx>\n\n2kxk2\n\nr3 hx, vi\n\nr3 hx, vi)\n\nr  2 xx>\n\nr\u02dc(s0)(x)[v] = (I hx,vir + xv>\nfor each v 2 Rd, so, as |pgs0(r2)  p1  s0|\uf8ff pg1(r2), 1(\u02dc) \uf8ff dp1 + p2 for \u02dc = \u02dc(s0).\nFinally, to satisfy (3.5), it suf\ufb01ces to verify that a(x)rL(x) is bounded and Lipschitz. For example,\nin the case of a ridge regularizer, R(x) = \n2 for > 0, the coef\ufb01cient a(x) = I, and it suf\ufb01ces\nto check that L is Lipschitz with Lipschitz gradient. This strongly convex regularizer satis\ufb01es our\nassumptions, but strong convexity is by no means necessary. Consider instead the pseudo-Huber\nfunction, R(x) = (q1 + 1\n2  1), popularized in computer vision [17]. This convex but non-\nstrongly convex regularizer satis\ufb01es all of our criteria and yields a diffusion with a(x) = I + xx>\nr2 R(x)\n .\nLPl vlv>l 00l (hx, vli), a(x)rL(x)\nMoreover, since rL(x) = 1\n1+r for some 4, 5 > 0. Hence,\nis bounded and Lipschitz whenever | 0l(r)|\uf8ff 4\nProp. 3.5 guarantees exponential Wasserstein decay for a variety of non-convex L based on datapoint\noutcomes yl, including the sigmoid ( (r) = tanh((r  yl)2) for yl 2 R or (r) = 1  tanh(ylr)\nfor yl 2 {\u00b11}) [1], the Student\u2019s t negative log likelihood ( l(r) = log(1 + (r  yl)2)), and\nthe Blake-Zisserman ( (r) =  log(e(ryl)2 + \u270f),\u270f > 0) [17]. The reader can verify that all\nof these examples also satisfy the remaining global optimization pre-conditions of Corollary 4.2\nand Theorem 3.2. In contrast, these linear-growth examples do not satisfy dissipativity (Condition 2)\nwhen paired with the Gibbs measure Langevin diffusion.\n\nLPl vl 0l(hx, vli) and r2L(x) = 1\n1+r and | 00l (r)|\uf8ff 5\n\n2kxk2\n\n6 Conclusion\n\nIn this paper, we showed that the Euler discretization of any smooth and dissipative diffusion can\nbe used for global non-convex optimization. We established non-asymptotic bounds on global\noptimization error and integration error with convergence governed by Stein factors obtained from the\nsolution of the Poisson equation. We further provided explicit bounds on Stein factors for large classes\nof convex and non-convex objective functions, based on computable properties of the objective and the\ndiffusion. Using this \ufb02exibility, we designed suitable diffusions for optimizing non-convex functions\nnot covered by the existing Langevin theory. We also demonstrated that targeting distributions other\nthan the Gibbs measure can give rise to improved optimization guarantees.\n\nReferences\n[1] P. L. Bartlett, M. I. Jordan, and J. McAuliffe. Convexity, classi\ufb01cation, and risk bounds. Journal of the\n\nAmerican Statistical Association, 101:138\u2013156, 2006.\n\n[2] P. Cattiaux and A. Guillin. Semi log-concave Markov diffusions. In S\u00e9minaire de Probabilit\u00e9s XLVI, volume\n2123 of Lecture Notes in Math., pages 231\u2013292. Springer, Cham, 2014. doi: 10.1007/978-3-319-11970-0_\n9.\n\n[3] S. Cerrai. Second order PDE\u2019s in \ufb01nite and in\ufb01nite dimension: a probabilistic approach, volume 1762.\n\nSpringer Science & Business Media, 2001.\n\n9\n\n\f[4] C. Chen, N. Ding, and L. Carin. On the convergence of stochastic gradient mcmc algorithms with\nhigh-order integrators. In Advances in Neural Information Processing Systems, pages 2278\u20132286, 2015.\n[5] X. Cheng, N. S. Chatterji, Y. Abbasi-Yadkori, P. L. Bartlett, and M. I. Jordan. Sharp convergence rates for\n\nlangevin dynamics in the nonconvex setting. arXiv preprint arXiv:1805.01648, 2018.\n\n[6] A. S. Dalalyan. Further and stronger analogy between sampling and optimization: Langevin monte carlo\n\nand gradient descent. arXiv preprint arXiv:1704.04752, 2017.\n\n[7] A. S. Dalalyan. Theoretical guarantees for approximate sampling from smooth and log-concave densities.\n\nJournal of the Royal Statistical Society: Series B (Statistical Methodology), 79(3):651\u2013676, 2017.\n\n[8] A. S. Dalalyan and A. Tsybakov. Sparse regression learning by aggregation and langevin monte-carlo.\n\nJournal of Computer and System Sciences, 78:1423\u20131443, 2012.\n\n[9] A. Durmus, E. Moulines, et al. Nonasymptotic convergence analysis for the unadjusted langevin algorithm.\n\nThe Annals of Applied Probability, 27(3):1551\u20131587, 2017.\n\n[10] R. Dwivedi, Y. Chen, M. J. Wainwright, and B. Yu. Log-concave sampling: Metropolis-hastings algorithms\n\nare fast! arXiv preprint arXiv:1801.02309, 2018.\n\n[11] A. Eberle. Re\ufb02ection couplings and contraction rates for diffusions. Probability theory and related \ufb01elds,\n\n166(3-4):851\u2013886, 2016.\n\n[12] M. A. Erdogdu, L. Mackey, and O. Shamir. Multivariate Stein Factors from Wasserstein Decay. In\n\npreparation, 2019.\n\n[13] X. Gao, M. G\u00fcrb\u00fczbalaban, and L. Zhu. Global convergence of stochastic gradient hamiltonian monte\ncarlo for non-convex stochastic optimization: Non-asymptotic performance bounds and momentum-based\nacceleration. arXiv preprint arXiv:1809.04618, 2018.\n\n[14] S. B. Gelfand and S. K. Mitter. Recursive stochastic algorithms for global optimization in r\u02c6d. SIAM\n\nJournal on Control and Optimization, 29(5):999\u20131018, 1991.\n\n[15] J. Gorham, A. B. Duncan, S. J. Vollmer, and L. Mackey. Measuring sample quality with diffusions. arXiv\n\npreprint arXiv:1611.06972, 2016.\n\n[16] I. S. Gradshteyn and I. M. Ryzhik. Table of integrals, series, and products. Academic press, 2014.\n[17] R. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Cambridge University Press,\n\nISBN: 0521540518, second edition, 2004.\n\n[18] G. J. O. Jameson. Inequalities for gamma function ratios. The American Mathematical Monthly, 120(10):\n936\u2013940, 2013. ISSN 00029890, 19300972. URL http://www.jstor.org/stable/10.4169/amer.\nmath.monthly.120.10.936.\n\n[19] R. Khasminskii. Stochastic stability of differential equations, volume 66. Springer Science & Business\n\nMedia, 2011.\n\n[20] Y.-A. Ma, T. Chen, and E. Fox. A complete recipe for stochastic gradient mcmc. In Advances in Neural\n\nInformation Processing Systems, pages 2917\u20132925, 2015.\n\n[21] A. Mathai and S. Provost. Quadratic forms in random variables: Theory and applications. 1992.\n[22] J. C. Mattingly, A. M. Stuart, and D. J. Higham. Ergodicity for sdes and approximations: locally lipschitz\n\nvector \ufb01elds and degenerate noise. Stochastic processes and their applications, 101(2):185\u2013232, 2002.\n\n[23] J. C. Mattingly, A. M. Stuart, and M. V. Tretyakov. Convergence of numerical time-averaging and stationary\n\nmeasures via poisson equations. SIAM Journal on Numerical Analysis, 48(2):552\u2013577, 2010.\n\n[24] E. Pardoux and A. Veretennikov. On the Poisson equation and diffusion approximation. i. Ann. Probab.,\n\npages 1061\u20131085, 2001.\n\n[25] M. Raginsky, A. Rakhlin, and M. Telgarsky. Non-convex learning via stochastic gradient langevin\n\ndynamics: a nonasymptotic analysis. arXiv preprint arXiv:1702.03849, 2017.\n\n[26] S. J. Vollmer, K. C. Zygalakis, and Y. W. Teh. Exploration of the (non-) asymptotic bias and variance of\n\nstochastic gradient langevin dynamics. Journal of Machine Learning Research 17, pages 1\u201348, 2016.\n\n[27] F. Wang. Exponential Contraction in Wasserstein Distances for Diffusion Semigroups with Negative\n\nCurvature. arXiv:1608.04471, Mar. 2016.\n\n[28] M. Welling and Y. W. Teh. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of\n\nthe 28th International Conference on Machine Learning (ICML-11), pages 681\u2013688, 2011.\n\n[29] P. Xu, J. Chen, D. Zou, and Q. Gu. Global convergence of langevin dynamics based algorithms for\nnonconvex optimization. In Advances in Neural Information Processing Systems, pages 3126\u20133137, 2018.\n\n10\n\n\f", "award": [], "sourceid": 6072, "authors": [{"given_name": "Murat", "family_name": "Erdogdu", "institution": "University of Toronto"}, {"given_name": "Lester", "family_name": "Mackey", "institution": "Microsoft Research"}, {"given_name": "Ohad", "family_name": "Shamir", "institution": "Weizmann Institute of Science"}]}