{"title": "Global Convergence of Langevin Dynamics Based Algorithms for Nonconvex Optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 3122, "page_last": 3133, "abstract": "We present a unified framework to analyze the global convergence of Langevin dynamics based algorithms for nonconvex finite-sum optimization with $n$ component functions. At the core of our analysis is a direct analysis of the ergodicity of the numerical approximations to Langevin dynamics, which leads to faster convergence rates. Specifically, we show that gradient Langevin dynamics (GLD) and stochastic gradient Langevin dynamics (SGLD) converge to the \\textit{almost minimizer}\\footnote{Following \\citet{raginsky2017non}, an almost minimizer is defined to be a point which is within the ball of the global minimizer with radius $O(d\\log(\\beta+1)/\\beta)$, where $d$ is the problem dimension and $\\beta$ is the inverse temperature parameter.} within $\\tilde O\\big(nd/(\\lambda\\epsilon) \\big)$\\footnote{$\\tilde O(\\cdot)$ notation hides polynomials of logarithmic terms and constants.} and $\\tilde O\\big(d^7/(\\lambda^5\\epsilon^5) \\big)$ stochastic gradient evaluations respectively, where $d$ is the problem dimension, and $\\lambda$ is the spectral gap of the Markov chain generated by GLD. Both results improve upon the best known gradient complexity\\footnote{Gradient complexity is defined as the total number of stochastic gradient evaluations of an algorithm, which is the number of stochastic gradients calculated per iteration times the total number of iterations.} results \\citep{raginsky2017non}. \nFurthermore, for the first time we prove the global convergence guarantee for variance reduced stochastic gradient Langevin dynamics (VR-SGLD) to the almost minimizer within $\\tilde O\\big(\\sqrt{n}d^5/(\\lambda^4\\epsilon^{5/2})\\big)$ stochastic gradient evaluations, which outperforms the gradient complexities of GLD and SGLD in a wide regime. \nOur theoretical analyses shed some light on using Langevin dynamics based algorithms for nonconvex optimization with provable guarantees.", "full_text": "Global Convergence of Langevin Dynamics Based\n\nAlgorithms for Nonconvex Optimization\n\nDepartment of Computer Science\n\nDepartment of Computer Science\n\nDepartment of Computer Science\n\nDepartment of Computer Science\n\nPan Xu\u21e4\n\nUCLA\n\nLos Angeles, CA 90095\npanxu@cs.ucla.edu\n\nDifan Zou\n\nUCLA\n\nLos Angeles, CA 90095\nknowzou@cs.ucla.edu\n\nJinghui Chen\u21e4\n\nUniversity of Virginia\n\nCharlottesville, VA 22903\njc4zg@virginia.edu\n\nQuanquan Gu\n\nUCLA\n\nLos Angeles, CA 90095\n\nqgu@cs.ucla.edu\n\nAbstract\n\nWe present a uni\ufb01ed framework to analyze the global convergence of Langevin\ndynamics based algorithms for nonconvex \ufb01nite-sum optimization with n compo-\nnent functions. At the core of our analysis is a direct analysis of the ergodicity\nof the numerical approximations to Langevin dynamics, which leads to faster\nconvergence rates. Speci\ufb01cally, we show that gradient Langevin dynamics (GLD)\nand stochastic gradient Langevin dynamics (SGLD) converge to the almost min-\n\nrespectively3, where d is the problem dimension, and is the spectral gap of the\nMarkov chain generated by GLD. Both results improve upon the best known gradi-\nent complexity4 results [45]. Furthermore, for the \ufb01rst time we prove the global\nconvergence guarantee for variance reduced stochastic gradient Langevin dynamics\n\nimizer2 within eOnd/(\u270f) and eOd7/(5\u270f5) stochastic gradient evaluations\n(SVRG-LD) to the almost minimizer within eOpnd5/(4\u270f5/2) stochastic gradi-\n\nent evaluations, which outperforms the gradient complexities of GLD and SGLD\nin a wide regime. Our theoretical analyses shed some light on using Langevin\ndynamics based algorithms for nonconvex optimization with provable guarantees.\n\nIntroduction\n\n1\nWe consider the following nonconvex \ufb01nite-sum optimization problem\n\nminx Fn(x) := 1/nPn\n\n(1.1)\nwhere fi(x)\u2019s are called component functions, and both Fn(x) and fi(\u00b7)\u2019s can be nonconvex. Various\n\ufb01rst-order optimization algorithms such as gradient descent [42], stochastic gradient descent [27] and\nmore recently variance-reduced stochastic gradient descent [46, 3] have been proposed and analyzed\nfor solving (1.1). However, all these algorithms are only guaranteed to converge to a stationary point,\nwhich can be a local minimum, a local maximum, or even a saddle point. This raises an important\n\ni=1 fi(x),\n\n\u21e4Equal contribution.\n2Following [45], an almost minimizer is de\ufb01ned to be a point which is within the ball of the global minimizer\nwith radius O(d log( + 1)/), where d is the problem dimension and is the inverse temperature parameter.\n\n3eO(\u00b7) notation hides polynomials of logarithmic terms and constants.\n\nis the number of stochastic gradients calculated per iteration times the total number of iterations.\n\n4Gradient complexity is de\ufb01ned as the total number of stochastic gradient evaluations of an algorithm, which\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fquestion in nonconvex optimization and machine learning: is there an ef\ufb01cient algorithm that is\nguaranteed to converge to the global minimum of (1.1)?\nRecent studies [17, 16] showed that sampling from a distribution which concentrates around the\nglobal minimum of Fn(x) is a similar task as minimizing Fn via certain optimization algorithms.\nThis justi\ufb01es the use of Langevin dynamics based algorithms for optimization. In detail, the \ufb01rst\norder Langevin dynamics is de\ufb01ned by the following stochastic differential equation (SDE)\n\ndX(t) = rFn(X(t))dt +p21dB(t),\n\n(1.2)\nwhere > 0 is the inverse temperature parameter that is treated as a constant throughout the analysis\nof this paper, and {B(t)}t0 is the standard Brownian motion in Rd. Under certain assumptions on\nthe drift coef\ufb01cient rFn, it was showed that the distribution of diffusion X(t) in (1.2) converges to its\nstationary distribution [14], a.k.a., the Gibbs measure \u21e1(dx) / exp(Fn(x)), which concentrates\non the global minimum of Fn [29, 26, 47]. Note that the above convergence result holds even when\nFn(x) is nonconvex. This motivates the use of Langevin dynamics based algorithms for nonconvex\noptimization [45, 53, 50, 49]. However, unlike \ufb01rst order optimization algorithms [42, 27, 46, 3],\nwhich have been extensively studied, the non-asymptotic theoretical guarantee of applying Langevin\ndynamics based algorithms for nonconvex optimization, is still under studied. In a seminal work,\nRaginsky et al. [45] provided a non-asymptotic analysis of stochastic gradient Langevin dynamics\n(SGLD) [52] for nonconvex optimization, which is a stochastic gradient based discretization of\n(1.2). They proved that SGLD converges to an almost minimizer up to d2/(1/4\u21e4) log(1/\u270f) within\n\neO(d/(\u21e4\u270f4)) iterations, where 2 is the variance of stochastic gradient and \u21e4 is called the uniform\nspectral gap of Langevin diffusion (1.2), and it is in the order of eeO(d). In a concurrent work, Zhang\n\net al. [53] analyzed the hitting time of SGLD and proved its convergence to an approximate local\nminimum. More recently, Tzen et al. [50] studied the local optimality and generalization performance\nof Langevin algorithm for nonconvex functions through the lens of metastability and Simsekli et al.\n[49] developed an asynchronous-parallel stochastic L-BFGS algorithm for non-convex optimization\nbased on variants of SGLD. Erdogdu et al. [23] further developed non-asymptotic analysis of global\noptimization based on a broader class of diffusions.\nIn this paper, we establish the global convergence for a family of Langevin dynamics based algorithms,\nincluding Gradient Langevin Dynamics (GLD) [17, 20, 16], Stochastic Gradient Langevin Dynamics\n(SGLD) [52] and Stochastic Variance Reduced Gradient Langevin Dynamics (SVRG-LD) [19] for\nsolving the \ufb01nite sum nonconvex optimization problem in (1.1). Our analysis is built upon the direct\nanalysis of the discrete-time Markov chain rather than the continuous-time Langevin diffusion, and\ntherefore avoid the discretization error.\n1.1 Our Contributions\nThe major contributions of our work are summarized as follows:\n\n\u2022 We provide a uni\ufb01ed analysis for a family of Langevin dynamics based algorithms by a\nnew decomposition scheme of the optimization error, under which we directly analyze the\nergodicity of numerical approximations for Langevin dynamics (see Figure 1).\n\n\u2022 Under our uni\ufb01ed framework, we establish the global convergence of GLD for solving (1.1).\nup to precision \u270f, where is the spectral gap of the discrete-time Markov chain generated\n\nIn detail, GLD requires eOd/(\u270f) iterations to converge to the almost minimizer of (1.1)\nby GLD and is in the order of eeO(d). This improves the eOd/(\u21e4\u270f4)) iteration complexity\nof GLD implied by [45], where \u21e4 = eeO(d) is the spectral gap of Langevin diffusion (1.2).\n\u2022 We establish a faster convergence of SGLD to the almost minimizer of (1.1). In detail,\nit converges to the almost minimizer up to \u270f precision within eOd7/(5\u270f5) stochastic\ngradient evaluations. This also improves the eOd9/(\u21e45\u270f8) gradient complexity proved in\neOpnd5/(4\u270f5/2) stochastic gradient evaluations. It outperforms the gradient complexi-\n\n\u2022 We also analyze the SVRG-LD algorithm and investigate its global convergence property.\nWe show that SVRG-LD is guaranteed to converge to the almost minimizer of (1.1) within\n\nties of both GLD and SGLD when 1/\u270f3 \uf8ff n \uf8ff 1/\u270f5. To the best of our knowledge, this is\nthe \ufb01rst global convergence guarantee of SVRG-LD for nonconvex optimization, while the\noriginal paper [19] only analyzed the posterior sampling property of SVRG-LD.\n\n[45].\n\n2\n\n\f1.2 Additional Related Work\nStochastic gradient Langevin dynamics (SGLD) [52] and its extensions [2, 39, 19] have been widely\nused in Bayesian learning. A large body of work has focused on analyzing the mean square error of\nLangevin dynamics based algorithms. In particular, Vollmer et al. [51] analyzed the non-asymptotic\nbias and variance of the SGLD algorithm by using Poisson equations. Chen et al. [12] showed the\nnon-asymptotic bias and variance of MCMC algorithms with high order integrators. Dubey et al.\n[19] proposed variance-reduced algorithms based on stochastic gradient Langevin dynamics, namely\nSVRG-LD and SAGA-LD, for Bayesian posterior inference, and proved that their method improves\nthe mean square error upon SGLD. Li et al. [37] further improved the mean square error by applying\nthe variance reduction tricks on Hamiltonian Monte Carlo, which is also called the underdamped\nLangevin dynamics.\nAnother line of research [17, 21, 16, 18, 22, 55] focused on characterizing the distance between\ndistributions generated by Langevin dynamics based algorithms and (strongly) log-concave target\ndistributions. In detail, Dalalyan [17] proved that the distribution of the last step in GLD converges to\n\nthe stationary distribution in eO(d/\u270f2) iterations in terms of total variation distance and Wasserstein\n\ndistance respectively with a warm start and showed the similarities between posterior sampling and\noptimization. Later Durmus and Moulines [20] improved the results by showing this result holds for\nany starting point and established similar bounds for the Wasserstein distance. Dalalyan [16] further\nimproved the existing results in terms of the Wasserstein distance and provide further insights on\nthe close relation between approximate sampling and gradient descent. Cheng et al. [13] improved\nexisting 2-Wasserstein results by reducing the discretization error using underdamped Langevin\ndynamics. To improve the convergence rates in noisy gradient settings, Chatterji et al. [11], Zou et al.\n[56] presented convergence guarantees in 2-Wasserstein distance for SAGA-LD and SVRG-LD using\nvariance reduction techniques. Zou et al. [55] proposed the variance reduced Hamilton Monte Carlo\nto accelerate the convergence of Langevin dynamics based sampling algorithms. As to sampling from\ndistribution with compact support, Bubeck et al. [8] analyzed sampling from log-concave distributions\nvia projected Langevin Monte Carlo, and Brosse et al. [7] proposed a proximal Langevin Monte\nCarlo algorithm. This line of research is orthogonal to our work since their analyses are regarding to\nthe convergence of the distribution of the iterates to the stationary distribution of Langevin diffusion\nin total variation distance or 2-Wasserstein distance instead of expected function value gap.\nOn the other hand, many attempts have been made to escape from saddle points in nonconvex\noptimization, such as cubic regularization [43, 54], trust region Newton method [15], Hessian-vector\nproduct based methods [1, 9, 10], noisy gradient descent [24, 31, 32] and normalized gradient [36].\nYet all these algorithms are only guaranteed to converge to an approximate local minimum rather\nthan a global minimum. The global convergence for nonconvex optimization remains understudied.\n1.3 Notation and Preliminaries\nIn this section, we present notations used in this paper and some preliminaries for SDE. We use\nlower case bold symbol x to denote deterministic vector, and use upper case italicized bold symbol\nX to denote random vector. For a vector x 2 Rd, we denote by kxk2 its Euclidean norm. We use\nan = O(bn) to denote that an \uf8ff Cbn for some constant C > 0 independent of n. We also denote\nan . bn (an & bn) if an is less than (larger than) bn up to a constant. We also use eO(\u00b7) notation to\nhide both polynomials of logarithmic terms and constants.\nKolmogorov Operator and In\ufb01nitesimal Generator\nSuppose X(t) is the solution to the diffusion process represented by the stochastic differential\nequation (1.2). For such a continuous time Markov process, let P = {Pt}t>0 be the corresponding\nMarkov semi-group [4], and we de\ufb01ne the Kolmogorov operator [4] Ps as follows\n\nPsg(X(t)) = E[g(X(s + t))|X(t)],\n\nwhere g is a smooth test function. We have Ps+t = Ps Pt by Markov property. Further we de\ufb01ne\nthe in\ufb01nitesimal generator [4] of the semi-group L to describe the the movement of the process in an\nin\ufb01nitesimal time interval:\n= rFn(X(t)) \u00b7r + 1r2g(X(t)),\n\nLg(X(t)) := lim\nh!0+\nwhere is the inverse temperature parameter.\n\nE[g(X(t + h))|X(t)] g(X(t))\n\nh\n\n3\n\n\fL = g \u00afg,\n\nPoisson Equation and the Time Average\nPoisson equations are widely used in the study of homogenization and ergodic theory to prove the\ndesired limit of a time-average. Let L be the in\ufb01nitesimal generator and let be de\ufb01ned as follows\n(1.3)\nwhere g is a smooth test function and \u00afg is the expectation of g over the Gibbs measure, i.e., \u00afg :=\nR g(x)\u21e1(dx). Smooth function is called the solution of Poisson equation (1.3). Importantly, it has\n\nbeen shown [23] that the \ufb01rst and second order derivatives of the solution of Poisson equation for\nLangevin diffusion can be bounded by polynomial growth functions.\n2 Review of Langevin Dynamics Based Algorithms\nIn this section, we brie\ufb02y review three Langevin dynamics based algorithms proposed recently.\nIn practice, numerical methods (a.k.a., numerical integrators) are used to approximate the Langevin\ndiffusion in (1.2). For example, by Euler-Maruyama scheme [34], (1.2) can be discretized as follows:\n(2.1)\nwhere \u270fk 2 Rd is standard Gaussian noise and \u2318> 0 is the step size. The update in (2.1) resembles\ngradient descent update except for an additional injected Gaussian noise. The magnitude of the\nGaussian noise is controlled by the inverse temperature parameter . In our paper, we refer this update\nas Gradient Langevin Dynamics (GLD) [17, 20, 16]. The details of GLD are shown in Algorithm 1.\nIn the case that n is large, the above Euler approximation can be infeasible due to the high computa-\ntional cost of the full gradient rFn(Xk) at each iteration. A natural idea is to use stochastic gradient\nto approximate the full gradient, which gives rise to Stochastic Gradient Langevin Dynamics (SGLD)\n[52] and its variants [2, 39, 12]. However, the high variance brought by the stochastic gradient can\nmake the convergence of SGLD slow. To reduce the variance of the stochastic gradient and accelerate\nthe convergence of SGLD, we use a mini-batch of stochastic gradients in the following update form:\n(2.2)\n\nXk+1 = Xk \u2318rFn(Xk) +p2\u23181 \u00b7 \u270fk,\n\nYk+1 = Yk \u2318/BPi2Ikrfi(Yk) +p2\u23181 \u00b7 \u270fk,\n\nwhere 1/BPi2Ik rfi(Yk) is the stochastic gradient, which is an unbiased estimator for rFn(Yk)\nand Ik is a subset of {1, . . . , n} with |Ik| = B. Algorithm 2 displays the details of SGLD.\nMotivated by recent advances in stochastic optimization, in particular, the variance reduction based\ntechniques [33, 46, 3], Dubey et al. [19] proposed the Stochastic Variance Reduced Gradient Langevin\nDynamics (SVRG-LD) for posterior sampling. The key idea is to use semi-stochastic gradient to\nreduce the variance of the stochastic gradient. SVRG-LD takes the following update form:\n\nZk+1 = Zk \u2318erk +p2\u23181 \u00b7 \u270fk,\n\nwhere erk = 1/BPik2Ikrfik (Zk) rfik (eZ(s)) + rFn(eZ(s)) is the semi-stochastic gradient,\neZ(s) is a snapshot of Zk at every L iteration such that k = sL + ` for some ` = 0, 1, . . . , L 1, and\nIk is a subset of {1, . . . , n} with |Ik| = B. SVRG-LD is summarized in Algorithm 3.\nNote that although all the three algorithms are originally proposed for posterior sampling or more\ngenerally, Bayesian learning, they can be applied for nonconvex optimization, as demonstrated in\nmany previous studies [2, 45, 53].\nAlgorithm 1 Gradient Langevin Dynamics (GLD)\n\n(2.3)\n\ninput: step size \u2318> 0; inverse temperature parameter > 0; X0 = 0\nfor k = 0, 1, . . . , K 1 do\nrandomly draw \u270fk \u21e0 N (0, Id\u21e5d)\nXk+1 = Xk \u2318rFn(Xk) +p2\u2318/\u270fk\n\nend for\n\n3 Main Theory\nBefore we present our main results, we \ufb01rst lay out the following assumptions on the loss function.\nAssumption 3.1 (Smoothness). The function fi(x) is M-smooth for M > 0, i = 1, . . . , n, i.e.,\n\nkrfi(x) rfi(y)k2 \uf8ff Mkx yk2,\n\nfor any x, y 2 Rd.\n\n4\n\n\fAlgorithm 2 Stochastic Gradient Langevin Dynamics (SGLD)\n\ninput: step size \u2318> 0; batch size B; inverse temperature parameter > 0; Y0 = 0\nfor k = 0, 1, . . . , K 1 do\nYk+1 = Yk \u2318/BPi2Ik rfi(Yk) +p2\u2318/\u270fk\n\nrandomly pick a subset Ik from {1, . . . , n} of size |Ik| = B; randomly draw \u270fk \u21e0 N (0, Id\u21e5d)\n\nend for\n\nAlgorithm 3 Stochastic Variance Reduced Gradient Langevin Dynamics (SVRG-LD)\n\ninput: step size \u2318> 0; batch size B; epoch length L; inverse temperature parameter > 0\n\ninitialization: Z0 = 0, eZ(0) = Z0\nfor s = 0, 1, . . . , (K/L) 1 do\nfW = rFn(eZ(s))\nfor ` = 0, . . . , L 1 do\nk = sL + `\nrandomly pick a subset Ik from {1, . . . , n} of size |Ik| = B; draw \u270fk \u21e0 N (0, Id\u21e5d)\nerk = 1/BPik2Ikrfik (Zk) rfik (eZ(s)) +fW\nZk+1 = Zk \u2318erk +p2\u2318/\u270fk\neZ(s) = Z(s+1)L\n\nend for\n\nAssumption 3.1 immediately implies that Fn(x) = 1/nPn\n\nAssumption 3.2 (Dissipative). There exist constants m, b > 0, such that we have\n\ni=1 fi(x) is also M-smooth.\n\nend for\n\nhrFn(x), xi mkxk2\n\n2 b,\n\nfor all x 2 Rd.\n\nAssumption 3.2 is a typical assumption for the convergence analysis of an SDE and diffusion\napproximation [40, 45, 53], which can be satis\ufb01ed by enforcing a weight decay regularization [45]. It\nsays that starting from a position that is suf\ufb01ciently far away from the origin, the Markov process\nde\ufb01ned by (1.2) moves towards the origin on average. It can also be noted that all critical points are\n\nwithin the ball of radius O(pb/m) centered at the origin under this assumption.\nLet x\u21e4 = argminx2Rd Fn(x) be the global minimizer of Fn. Our ultimate goal is to prove the\nconvergence of the optimization error in expectation, i.e., E[Fn(Xk)] Fn(x\u21e4). In the sequel, we\ndecompose the optimization error into two parts: (1) E[Fn(Xk)] E[Fn(X \u21e1)], which characterizes\nthe gap between the expected function value at the k-th iterate Xk and the expected function value\nat X \u21e1, where X \u21e1 follows the stationary distribution \u21e1(dx) of Markov process {X(t)}t0, and (2)\nE[Fn(X \u21e1)] Fn(x\u21e4). Note that the error in part (1) is algorithm dependent, while that in part (2)\nonly depends on the diffusion itself and hence is identical for all Langevin dynamics based algorithms.\nNow we are ready to present our main results regarding to the optimization error of each algorithm\nreviewed in Section 2. We \ufb01rst show the optimization error bound of GLD (Algorithm 1).\nTheorem 3.3 (GLD). Under Assumptions 3.1 and 3.2, consider XK generated by Algorithm 1 with\ninitial point X0 = 0. The optimization error is bounded by\n\nE[Fn(XK)] Fn(x\u21e4) \uf8ff \u21e5eK\u2318 +\n\nC \u2318\n\n\n\n+\n\nd\n2\n\n(3.1)\n\nlog\u2713 eM (b/d + 1)\n\nm\n\n|\n\nRM\n\n{z\n\n,\n\n\u25c6\n}\n\nwhere problem-dependent parameters \u21e5 and are de\ufb01ned as\n\n\u21e5=\n\nC0M (b + m + d)(m + em\u2318M (b + m + d))\n\nm2\u21e2d/2\n\n, =\n\n2m\u21e2d\n\nlog(2M (b + m + d)/m)\n\n,\n\nand \u21e2 2 (0, 1), C0, C > 0 are constants.\nIn the optimization error of GLD (3.1), we denote the upper bound of the error term E[Fn(X \u21e1)] \nFn(x\u21e4) by RM, which characterizes the distance between the expected function value at X \u21e1 and the\nglobal minimum of Fn. The stationary distribution of Langevin diffusion \u21e1 / eFn(x) is a Gibbs\n\n5\n\n\fdistribution, which concentrates around the minimizer x\u21e4 of Fn. Thus a random vector X \u21e1 following\nthe law of \u21e1 is called an almost minimizer of Fn within a neighborhood of x\u21e4 with radius RM [45].\nIt is worth noting that the \ufb01rst term in (3.1) vanishes at a exponential rate due to the ergodicity of\nMarkov chain {Xk}k=0,1.... Moreover, the exponential convergence rate is controlled by , the\nspectral gap of the discrete-time Markov chain generated by GLD, which is in the order of eeO(d).\nBy setting E[Fn(XK)] E[Fn(X \u21e1)] to be less than a precision \u270f, and solving for K, we have the\nfollowing corollary on the iteration complexity for GLD to converge to the almost minimizer X \u21e1.\nCorollary 3.4 (GLD). Under the same conditions as in Theorem 3.3, provided that \u2318 . \u270f, GLD\nachieves E[Fn(XK)] E[Fn(X \u21e1)] \uf8ff \u270f with K = Od\u270f11 \u00b7 log(1/\u270f).\n\nRemark 3.5. In a seminal work by [45], they provided a non-asymptotic analysis of SGLD for non-\nconvex optimization. By setting the variance of stochastic gradient to 0, their result immediately sug-\ngests an O(d/(\u270f4\u21e4) log5((1/\u270f))) iteration complexity for GLD to converge to the almost minimizer\nup to precision \u270f. Here the quantity \u21e4 is the so-called uniform spectral gap for continuous-time\nMarkov process {Xt}t0 generated by Langevin dynamics. They further proved that \u21e4 = eeO(d),\nwhich is in the same order of our spectral gap for the discrete-time Markov chain {Xk}k=0,1...\ngenerated by GLD. Both of them match the lower bound for metastable exit times of SDE for\nnonconvex functions that have multiple local minima and saddle points [6]. Although for some\nspeci\ufb01c function Fn, the spectral gap may be reduced to polynomial in d [25], in general, the spectral\ngap for continuous-time Markov processes is in the same order as the spectral gap for discrete-time\nMarkov chains. Thus, the iteration complexity of GLD suggested by Corollary 3.4 is better than that\nsuggested by [45] by a factor of O(1/\u270f3).\n\nWe now present the following theorem, which states the optimization error of SGLD (Algorithm 2).\nTheorem 3.6 (SGLD). Under Assumptions 3.1 and 3.2, consider YK generated by Algorithm 2 with\ninitial point Y0 = 0, the optimization error is bounded by\n\nE[Fn(YK)] Fn(x\u21e4) \uf8ff C1K\u2318\uf8ff (n B)(Mp+ G)2\n\nB(n 1)\n\n1/2\n\n+\u21e5 eK\u2318 +\n\nC \u2318\n\n\n\n+ RM ,\n\n(3.2)\nwhere C1 is an absolute constant, C ,, \u21e5 and RM are the same as in Theorem 3.3, B is the\nmini-batch size, G = maxi=1,...,n{krfi(x\u21e4)k2} + bM/m and = 2(1 + 1/m)(b + 2G2 + d/).\nSimilar to Corollary 3.4, by setting E[Fn(Yk)] E[Fn(X \u21e1)] \uf8ff \u270f, we obtain the following corollary.\nCorollary 3.7 (SGLD). Under the same conditions as in Theorem 3.6, if \u2318 . \u270f, SGLD achieves\n\nE[Fn(YK)] E[Fn(X \u21e1)] = Od3/2B1/41 \u00b7 log(1/\u270f) + \u270f,\nwith K = Od\u270f11 \u00b7 log(1/\u270f), where B is the mini-batch size of Algorithm 2.\n\nRemark 3.8. Corollary 3.7 suggests that if the mini-batch size B is chosen to be large enough to\noffset the divergent term log(1/\u270f), SGLD is able to converge to the almost minimizer in terms of\nexpected function value gap. This is also suggested by the result in [45]. More speci\ufb01cally, the result\nin [45] implies that SGLD achieves\n\n(3.3)\n\nE[Fn(YK)] E[Fn(X \u21e1)] = Od21/4\u21e41 \u00b7 log(1/\u270f) + \u270f\n\nwith K = O(d/(\u21e4\u270f4) \u00b7 log5(1/\u270f)), where 2 is the upper bound of stochastic variance in SGLD,\nwhich can be reduced with larger batch size B. Recall that the spectral gap \u21e4 in their work scales as\nO(eeO(d)), which is in the same order as in Corollary 3.7. In comparison, our result in Corollary\n3.7 indicates that SGLD can actually achieve the same order of error for E[Fn(YK)] E[Fn(X \u21e1)]\nwith substantially fewer number of iterations, i.e., O(d/(\u270f) log(1/\u270f)) .\nRemark 3.9. To ensure SGLD converges in Corollary 3.7, one may set a suf\ufb01ciently large batch size\nB to offset the divergent term. For example, if we choose B & d64\u270f4 log4(1/\u270f), SGLD achieves\nE[Fn(YK)] E[Fn(X \u21e1)] \uf8ff \u270f within K = O(d/(\u270f) log(1/\u270f)) stochastic gradient evaluations.\nIn what follows, we proceed to present our result on the optimization error bound of SVRG-LD.\n\n6\n\n\fTheorem 3.10 (SVRG-LD). Under Assumptions 3.1 and 3.2, consider ZK generated by Algorithm\n3 with initial point Z0 = 0. The optimization error is bounded by\nE[Fn(ZK)] Fn(x\u21e4)\n\uf8ff C1K3/4\u2318\uf8ff LM 2(n B)\n+ RM , (3.4)\nwhere constants C1, C ,, \u21e5, , G and RM are the same as in Theorem 3.6, B is the mini-batch\nsize and L is the length of inner loop of Algorithm 3.\n\nB(n 1) \u27139\u2318L(M 2+ G2) +\n\n\u25c61/4\n\n+\u21e5 eK\u2318 +\n\nC \u2318\n\n\n\nd\n\nSimilar to Corollaries 3.4 and 3.7, we have the following iteration complexity for SVRG-LD.\nCorollary 3.11 (SVRG-LD). Under the same conditions as in Theorem 3.10, if \u2318 . \u270f, SVRG-LD\niterations. In addition, if we choose B = pn\u270f3/2, L = pn\u270f3/2, the number of stochastic gradient\n\nachieves E[Fn(ZK)] E[Fn(X \u21e1)] \uf8ff \u270f within K = OLd5B14\u270f4 \u00b7 log4(1/\u270f) + 1/\u270f total\nevaluations needed for SVRG-LD to achieve \u270f precision is eOpn\u270f5/2 \u00b7 eeO(d).\n\nRemark 3.12. In Theorem 3.10 and Corollary 3.11, we establish the global convergence guarantee\nfor SVRG-LD to an almost minimizer of a nonconvex function Fn. To the best of our knowledge,\nthis is the \ufb01rst iteration/gradient complexity guarantee for SVRG-LD in nonconvex \ufb01nite-sum\noptimization. Dubey et al. [19] \ufb01rst proposed the SVRG-LD algorithm for posterior sampling, but\nonly proved that the mean square error between averaged sample pass and the stationary distribution\n\nN/A\n\nGLD\n\nSGLD5\n\nThis paper\n\nSVRG-LD\n\neO n\n\u270f4 \u00b7 eeO(d)\neO n\n\u270f \u00b7 eeO(d)\n\neO 1\n\u270f8 \u00b7 eeO(d)\neO 1\n\u270f5 \u00b7 eeO(d)\n\nTable 1: Gradient complexities to converge to the almost minimizer.\n\nconverges to \u270f within eO(1/\u270f3/2) iterations, which has no implication for nonconvex optimization.\n\nIn large scale machine learn-\ning problems, the evaluation of\nfull gradient can be quite ex-\npensive, in which case the iter-\nation complexity is no longer\nappropriate to re\ufb02ect the ef\ufb01-\nciency of different algorithms.\nTo perform a comprehensive\ncomparison among the three algorithms, we present their gradient complexities for converging\nto the almost minimizer X \u21e1 with \u270f precision in Table 1. Recall that gradient complexity is de\ufb01ned as\nthe total number of stochastic gradient evaluations needed to achieve \u270f precision. It can be seen from\nTable 1 that the gradient complexity for GLD has worse dependence on the number of component\nfunctions n and SVRG-LD has worse dependence on the optimization precision \u270f. More speci\ufb01cally,\nwhen the number of component functions satis\ufb01es n \uf8ff 1/\u270f5, SVRG-LD achieves better gradient\ncomplexity than SGLD. Additionally, if n 1/\u270f3, SVRG-LD is better than both GLD and SGLD,\ntherefore is more favorable.\n4 Proof Sketch of the Main Results\nIn this section, we highlight our high level idea in the analysis of GLD, SGLD and SVRG-LD.\n4.1 Roadmap of the Proof\nRecall the problem in (1.1) and denote the global minimizer as x\u21e4 = argminx Fn(x). {X(t)}t0 and\n{Xk}k=0,...,K are the continuous-time and discrete-time Markov processes generated by Langevin\ndiffusion (1.2) and GLD respectively. We propose to decompose the optimization error as follows:\n\neO\u21e3 pn\n\u270f5/2\u2318 \u00b7 eeO(d)\n\n[45]\n\nE[Fn(Xk)] Fn(x\u21e4)\n= E[Fn(Xk)] E[Fn(X \u00b5)]\n}\n{z\n|\n\nI1\n\n+ E[Fn(X \u00b5)] E[Fn(X \u21e1)]\n}\n|\n\n{z\n\nI2\n\n+ E[Fn(X \u21e1)] Fn(x\u21e4)\n}\n|\n\n{z\n\nI3\n\n,\n\n(4.1)\n\nwhere X \u00b5 follows the stationary distribution \u00b5(dx) of Markov process {Xk}k=0,1,...,K, and X \u21e1\nfollows the stationary distribution \u21e1(dx) of Markov process {X(t)}t0, a.k.a., the Gibbs distribution.\nFollowing existing literature [40, 41, 12], here we assume the existence of stationary distributions,\ni.e., the ergodicity, of Langevin diffusion (1.2) and its numerical approximation (2.2). Note that the\n\n5For SGLD in [45], the result in the table is obtained by choosing the exact batch size suggested by the\n\nauthors that could make the stochastic variance small enough to cancel out the divergent term in their paper.\n\n7\n\n\fX(t)\n\nXk\n\nX\u21e1\n\nx\u21e4\n\nergodicity property of an SDE is not trivially guaranteed in general and establishing the existence of\nthe stationary distribution is beyond the scope of our paper. Yet we will discuss the circumstances\nwhen geometric ergodicity holds in the Appendix.\nWe illustrate the decomposition (4.1) in Figure 1. Unlike exist-\ning optimization analysis of SGLD such as [45], which measure\nthe approximation error between Xk and X(t) (blue arrows in\nthe chart), we directly analyze the geometric convergence of dis-\ncretized Markov chain Xk to its stationary distribution (red arrows\nin the chart). Since the distance between Xk and X(t) is a slow-\nconvergence term in [45], and the distance between X(t) and X \u21e1\ndepends on the uniform spectral gap, our new roadmap of proof will\nbypass both of these two terms, hence leads to a faster convergence\nrate.\nBounding I1: Geometric Ergodicity of GLD\nTo bound the \ufb01rst term in (4.1), we need to analyze the convergence of the Markov chain generated by\nAlgorithm 1 to its stationary distribution, namely, the ergodic property of the numerical approximation\nof Langevin dynamics. In probability theory, ergodicity describes the long time behavior of Markov\nprocesses. For a \ufb01nite-state Markov Chain, this is also closely related to the mixing time and has\nbeen thoroughly studied in the literature of Markov processes [28, 35, 4]. Note that Durmus and\nMoulines [21] studied the convergence of the Euler-Maruyama discretization (also referred to as the\nunadjusted Langevin algorithm) towards its stationary distribution in total variation. Nevertheless,\nthey only focus on strongly convex functions which are less challenging than our nonconvex setting.\nThe following lemma ensures the geometric ergodicity of gradient Langevin dynamics.\nLemma 4.1. Under Assumptions 3.1 and 3.2, the gradient Langevin dynamics (GLD) in Algorithm\n1 has a unique invariant measure \u00b5 on Rd. It holds that\n\nFigure 1: Illustration of the anal-\nysis framework in our paper.\n\nX\u00b5\n\n|E[Fn(Xk)] E[Fn(X \u00b5)]|\uf8ff C\uf8ff\u21e2d/2(1 + \uf8ffem\u2318) exp\u2713 \n\n2mk\u2318\u21e2d\n\nlog(\uf8ff) \u25c6,\n\nwhere \u21e2 2 (0, 1),C > 0 are absolute constants, and \uf8ff = 2M (b + m + d)/b.\nLemma 4.1 establishes the exponential decay of function gap between Fn(Xk) and Fn(X \u21e1) using\ncoupling techniques. Note that the exponential dependence on dimension d is consistent with the\nresult from [45] using entropy methods.\nBounding I2: Convergence to Stationary Distribution of Langevin Diffusion\nNow we are going to bound the distance between two invariant measures \u00b5 and \u21e1 in terms of their\nexpectations over the objective function Fn. Our proof is inspired by [51, 12]. The key insight here\nis that after establishing the geometric ergodicity of GLD, by the stationarity of \u00b5, we have\n\nZ Fn(x)\u00b5(dx) =Z E[Fn(Xk)|X0 = x] \u00b7 \u00b5(dx).\n\nThis property says that after reaching the stationary distribution, any further transition (GLD update)\nwill not change the distribution. Thus we can bound the difference between two invariant measures.\nLemma 4.2. Under Assumptions 3.1 and 3.2, the invariant measures \u00b5 and \u21e1 satisfy\n\nwhere C > 0 is a constant that dominates E[krp (Xk)k] (p = 0, 1, 2) and is the solution of\nPoisson equation (1.3).\n\nE[Fn(X \u00b5)] E[Fn(X \u21e1)] \uf8ff C \u2318/,\n\nLemma 4.2 suggests that the bound on the difference between the two invariant measures depends on\nthe numerical approximation step size \u2318, the inverse temperature parameter and the upper bound C .\nWe emphasize that the dependence on is reasonable since different results in different diffusion,\nand further leads to different stationary distributions of the SDE and its numerical approximations.\nBounding I3: Gap between Langevin Diffusion and Global Minimum\nMost existing studies [52, 48, 12] on Langevin dynamics based algorithms focus on the convergence\nof the averaged sample path to the stationary distribution. The property of Langevin diffusion\n\n8\n\n\fasymptotically concentrating on the global minimum of Fn is well understood [14, 26] , which makes\nthe convergence to a global minimum possible, even when the function Fn is nonconvex.\nWe give an explicit bound between the stationary distribution of Langevin diffusion and the global\nminimizer of Fn, i.e., the last term E[Fn(X \u21e1)] Fn(x\u21e4) in (4.1). For nonconvex objective function,\nthis has been proved in [45] using the concept of differential entropy and smoothness of Fn. We\nformally summarize it as the following lemma:\nLemma 4.3. [45] Under Assumptions 3.1 and 3.2, the model error I3 in (4.1) can be bounded by\n\nE[Fn(X \u21e1)] Fn(x\u21e4) \uf8ff\n\nd\n2\n\nlog\u2713 eM (m/d + 1)\n\nm\n\n\u25c6,\n\nwhere X \u21e1 is a random vector following the stationary distribution of Langevin diffusion (1.2).\nLemma 4.3 suggests that Gibbs density concentrates on the global minimizer of objective function.\nTherefore, the random vector X \u21e1 following the Gibbs distribution \u21e1 is also referred to as an almost\nminimizer of the nonconvex function Fn in [45].\n4.2 Proof of Theorems 3.3, 3.6 and 3.10\nNow we integrate the previous lemmas to prove our main theorems in Section 3. First, submitting the\nresults in Lemmas 4.1, 4.2 and 4.3 into (4.1), we immediately obtain the optimization error bound\nin (3.1) for GLD, which proves Theorem 3.3. Second, consider the optimization error of SGLD\n(Algorithm 2), we only need to bound the error between E[Fn(YK)] and E[Fn(XK)] and then apply\nthe results for GLD, which is given by the following lemma.\nLemma 4.4. Under Assumptions 3.1 and 3.2, by choosing mini-batch of size B, the output of SGLD\nin Algorithm 2 (YK) and the output of GLD in Algorithm 1 (XK) satis\ufb01es\n\n|E[Fn(YK)] E[Fn(XK)]|\uf8ff C1p(Mp+ G)K\u2318\uf8ff n B\n\nB(n 1)1/4\n\n,\n\n(4.2)\n\n,\n\n1/4\n\nB(n 1)\n\nwhere C1 is an absolute constant and = 2(1 + 1/m)(b + 2G2 + d/).\nCombining Lemmas 4.1, 4.2, 4.3 and 4.4 yields the desired result in (3.6) for SGLD, which completes\nthe proof of Theorem 3.6. Third, similar to the proof of SGLD, we require an additional bound\nbetween Fn(ZK) and Fn(XK) for the proof of SVRG-LD, which is stated by the following lemma.\nLemma 4.5. Under Assumptions 3.1 and 3.2, by choosing mini-batch of size B, the output of\nSVRG-LD in Algorithm 3 (ZK) and the output of GLD in Algorithm 1 (XK) satis\ufb01es\n\nE[Fn(ZK)] E[Fn(XK)] \uf8ff C1K3/4\u2318\uf8ff LM 2(n B)(3L\u2318(M 2+ G2) + d/2)\n\nwhere = 2(1 + 1/m)(b + 2G2 + d/), C1 is an absolute constant and L is the number of inner\nloops in SVRG-LD.\nThe optimization error bound in (3.4) for SVRG-LD follows from Lemmas 4.1, 4.2, 4.3 and 4.5.\n5 Conclusions and Future Work\nIn this work, we present a new framework for analyzing the convergence of Langevin dynamics\nbased algorithms, and provide non-asymptotic analysis on the convergence for nonconvex \ufb01nite-\nsum optimization. By comparing the Langevin dynamics based algorithms and standard \ufb01rst-order\noptimization algorithms, we may see that the counterparts of GLD and SVRG-LD are gradient descent\n(GD) and stochastic variance-reduced gradient (SVRG) methods. It has been proved that SVRG\noutperforms GD universally for nonconvex \ufb01nite-sum optimization [46, 3]. This poses a natural\nquestion that whether SVRG-LD can be universally better than GLD for nonconvex optimization?\nWe will attempt to answer this question in the future.\nAcknowledgement\nWe would like to thank the anonymous reviewers for their helpful comments. We thank Maxim\nRaginsky for insightful comments and discussion on the \ufb01rst version of this paper. We also thank\nTianhao Wang for discussion on this work. This research was sponsored in part by the National\nScience Foundation IIS-1652539. The views and conclusions contained in this paper are those of the\nauthors and should not be interpreted as representing any funding agencies.\n\n9\n\n\fReferences\n[1] Naman Agarwal, Zeyuan Allen-Zhu, Brian Bullins, Elad Hazan, and Tengyu Ma. Finding approximate\nlocal minima faster than gradient descent. In Proceedings of the 49th Annual ACM SIGACT Symposium on\nTheory of Computing, STOC 2017, pages 1195\u20131199, 2017.\n\n[2] Sungjin Ahn, Anoop Korattikara, and Max Welling. Bayesian posterior sampling via stochastic gradient\n\ufb01sher scoring. In Proceedings of the 29th International Conference on Machine Learning, pages 1771\u20131778,\n2012.\n\n[3] Zeyuan Allen-Zhu and Elad Hazan. Variance reduction for faster non-convex optimization. In International\n\nConference on Machine Learning, pages 699\u2013707, 2016.\n\n[4] Dominique Bakry, Ivan Gentil, and Michel Ledoux. Analysis and geometry of Markov diffusion operators,\n\nvolume 348. Springer Science & Business Media, 2013.\n\n[5] Francois Bolley and Cedric Villani. Weighted csisz\u00e1r-kullback-pinsker inequalities and applications to\ntransportation inequalities. Annales de la Facult\u00e9 des Sciences de Toulouse. S\u00e9rie VI. Math\u00e9matiques, 14,\n01 2005. doi: 10.5802/afst.1095.\n\n[6] Anton Bovier, Michael Eckhoff, V\u00e9ronique Gayrard, and Markus Klein. Metastability in reversible diffu-\nsion processes i: Sharp asymptotics for capacities and exit times. Journal of the European Mathematical\nSociety, 6(4):399\u2013424, 2004.\n\n[7] Nicolas Brosse, Alain Durmus, \u00c9ric Moulines, and Marcelo Pereyra. Sampling from a log-concave\ndistribution with compact support with proximal Langevin Monte Carlo. In Conference on Learning\nTheory, pages 319\u2013342, 2017.\n\n[8] S\u00e9bastien Bubeck, Ronen Eldan, and Joseph Lehec. Sampling from a log-concave distribution with\n\nprojected Langevin Monte Carlo. Discrete & Computational Geometry, 59(4):757\u2013783, 2018.\n\n[9] Yair Carmon and John C Duchi. Gradient descent ef\ufb01ciently \ufb01nds the Cubic-regularized non-convex\n\nnewton step. arXiv preprint arXiv:1612.00547, 2016.\n\n[10] Yair Carmon, John C Duchi, Oliver Hinder, and Aaron Sidford. Accelerated methods for nonconvex\n\noptimization. SIAM Journal on Optimization, 28(2):1751\u20131772, 2018.\n\n[11] Niladri Chatterji, Nicolas Flammarion, Yian Ma, Peter Bartlett, and Michael Jordan. On the theory of\nvariance reduction for stochastic gradient Monte Carlo. In Proceedings of the 35th International Conference\non Machine Learning, pages 764\u2013773, 2018.\n\n[12] Changyou Chen, Nan Ding, and Lawrence Carin. On the convergence of stochastic gradient mcmc\nalgorithms with high-order integrators. In Advances in Neural Information Processing Systems, pages\n2278\u20132286, 2015.\n\n[13] Xiang Cheng, Niladri S. Chatterji, Peter L. Bartlett, and Michael I. Jordan. Underdamped Langevin mcmc:\nA non-asymptotic analysis. In Proceedings of the 31st Conference On Learning Theory, volume 75, pages\n300\u2013323, 2018.\n\n[14] Tzuu-Shuh Chiang, Chii-Ruey Hwang, and Shuenn Jyi Sheu. Diffusion for global optimization in r\u02c6n.\n\nSIAM Journal on Control and Optimization, 25(3):737\u2013753, 1987.\n\n[15] Frank E Curtis, Daniel P Robinson, and Mohammadreza Samadi. A trust region algorithm with a worst-case\niteration complexity of O(\u270f3/2) for nonconvex optimization. Mathematical Programming, pages 1\u201332,\n2014.\n\n[16] Arnak Dalalyan. Further and stronger analogy between sampling and optimization: Langevin Monte Carlo\n\nand gradient descent. In Conference on Learning Theory, pages 678\u2013689, 2017.\n\n[17] Arnak S Dalalyan. Theoretical guarantees for approximate sampling from smooth and log-concave\ndensities. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(3):651\u2013676,\n2017.\n\n[18] Arnak S Dalalyan and Avetik G Karagulyan. User-friendly guarantees for the Langevin Monte Carlo with\n\ninaccurate gradient. arXiv preprint arXiv:1710.00095, 2017.\n\n[19] Kumar Avinava Dubey, Sashank J Reddi, Sinead A Williamson, Barnabas Poczos, Alexander J Smola,\nand Eric P Xing. Variance reduction in stochastic gradient Langevin dynamics. In Advances in Neural\nInformation Processing Systems, pages 1154\u20131162, 2016.\n\n10\n\n\f[20] Alain Durmus and Eric Moulines. Non-asymptotic convergence analysis for the unadjusted Langevin\n\nalgorithm. arXiv preprint arXiv:1507.05021, 2015.\n\n[21] Alain Durmus and Eric Moulines. High-dimensional bayesian inference via the unadjusted Langevin\n\nalgorithm. arXiv preprint arXiv:1605.01559, 2016.\n\n[22] Raaz Dwivedi, Yuansi Chen, Martin J Wainwright, and Bin Yu. Log-concave sampling: Metropolis-\nhastings algorithms are fast! In Proceedings of the 31st Conference On Learning Theory, pages 793\u2013797,\n2018.\n\n[23] Murat A Erdogdu, Lester Mackey, and Ohad Shamir. Global non-convex optimization with discretized\n\ndiffusions. In Advances in Neural Information Processing Systems, pages 9693\u20139702, 2018.\n\n[24] Rong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping from saddle points-online stochastic gradient\n\nfor tensor decomposition. In COLT, pages 797\u2013842, 2015.\n\n[25] Rong Ge, Holden Lee, and Andrej Risteski. Beyond log-concavity: Provable guarantees for sampling multi-\nmodal distributions using simulated tempering Langevin Monte Carlo. arXiv preprint arXiv:1710.02736,\n2017.\n\n[26] Saul B Gelfand and Sanjoy K Mitter. Recursive stochastic algorithms for global optimization in Rd. SIAM\n\nJournal on Control and Optimization, 29(5):999\u20131018, 1991.\n\n[27] Saeed Ghadimi and Guanghui Lan. Stochastic \ufb01rst-and zeroth-order methods for nonconvex stochastic\n\nprogramming. SIAM Journal on Optimization, 23(4):2341\u20132368, 2013.\n\n[28] Martin Hairer and Jonathan C Mattingly. Spectral gaps in wasserstein distances and the 2d stochastic\n\nnavier-stokes equations. The Annals of Probability, pages 2050\u20132091, 2008.\n\n[29] Chii-Ruey Hwang. Laplace\u2019s method revisited: weak convergence of probability measures. The Annals of\n\nProbability, pages 1177\u20131182, 1980.\n\n[30] Nobuyuki Ikeda and Shinzo Watanabe. Stochastic differential equations and diffusion processes, volume 24.\n\nElsevier, 2014.\n\n[31] Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M Kakade, and Michael I Jordan. How to escape saddle\n\npoints ef\ufb01ciently. In International Conference on Machine Learning, pages 1724\u20131732, 2017.\n\n[32] Chi Jin, Praneeth Netrapalli, and Michael I. Jordan. Accelerated gradient descent escapes saddle points\nfaster than gradient descent. In Proceedings of the 31st Conference On Learning Theory, pages 1042\u20131085,\n2018.\n\n[33] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction.\n\nIn Advances in Neural Information Processing Systems, pages 315\u2013323, 2013.\n\n[34] Peter E Kloeden and Eckhard Platen. Higher-order implicit strong numerical schemes for stochastic\n\ndifferential equations. Journal of statistical physics, 66(1):283\u2013314, 1992.\n\n[35] David Asher Levin, Yuval Peres, and Elizabeth Lee Wilmer. Markov chains and mixing times. American\n\nMathematical Soc., 2009.\n\n[36] K\ufb01r Y Levy. The power of normalization: Faster evasion of saddle points. arXiv preprint arXiv:1611.04831,\n\n2016.\n\n[37] Zhize Li, Tianyi Zhang, and Jian Li. Stochastic gradient hamiltonian Monte Carlo with variance reduction\n\nfor bayesian inference. arXiv preprint arXiv:1803.11159, 2018.\n\n[38] Robert S Liptser and Albert N Shiryaev. Statistics of random Processes: I. general Theory, volume 5.\n\nSpringer Science & Business Media, 2013.\n\n[39] Yi-An Ma, Tianqi Chen, and Emily Fox. A complete recipe for stochastic gradient mcmc. In Advances in\n\nNeural Information Processing Systems, pages 2917\u20132925, 2015.\n\n[40] Jonathan C Mattingly, Andrew M Stuart, and Desmond J Higham. Ergodicity for sdes and approximations:\nlocally lipschitz vector \ufb01elds and degenerate noise. Stochastic processes and their applications, 101(2):\n185\u2013232, 2002.\n\n[41] Jonathan C Mattingly, Andrew M Stuart, and Michael V Tretyakov. Convergence of numerical time-\naveraging and stationary measures via poisson equations. SIAM Journal on Numerical Analysis, 48(2):\n552\u2013577, 2010.\n\n11\n\n\f[42] Yurii Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science\n\n& Business Media, 2013.\n\n[43] Yurii Nesterov and Boris T Polyak. Cubic regularization of newton method and its global performance.\n\nMathematical Programming, 108(1):177\u2013205, 2006.\n\n[44] Yury Polyanskiy and Yihong Wu. Wasserstein continuity of entropy and outer bounds for interference\n\nchannels. IEEE Transactions on Information Theory, 62(7):3992\u20134002, 2016.\n\n[45] Maxim Raginsky, Alexander Rakhlin, and Matus Telgarsky. Non-convex learning via stochastic gradient\nLangevin dynamics: a nonasymptotic analysis. In Conference on Learning Theory, pages 1674\u20131703,\n2017.\n\n[46] Sashank J Reddi, Ahmed Hefny, Suvrit Sra, Barnabas Poczos, and Alex Smola. Stochastic variance\nreduction for nonconvex optimization. In International Conference on Machine Learning, pages 314\u2013323,\n2016.\n\n[47] Gareth O Roberts and Richard L Tweedie. Exponential convergence of Langevin distributions and their\n\ndiscrete approximations. Bernoulli, pages 341\u2013363, 1996.\n\n[48] Issei Sato and Hiroshi Nakagawa. Approximation analysis of stochastic gradient Langevin dynamics by\nusing fokker-planck equation and ito process. In Proceedings of the 31st International Conference on\nMachine Learning (ICML-14), pages 982\u2013990, 2014.\n\n[49] Umut Simsekli, Cagatay Yildiz, Than Huy Nguyen, Taylan Cemgil, and Gael Richard. Asynchronous\nstochastic quasi-Newton MCMC for non-convex optimization. In Proceedings of the 35th International\nConference on Machine Learning, pages 4674\u20134683, 2018.\n\n[50] Belinda Tzen, Tengyuan Liang, and Maxim Raginsky. Local optimality and generalization guarantees for\nthe Langevin algorithm via empirical metastability. In Proceedings of the 31st Conference On Learning\nTheory, pages 857\u2013875, 2018.\n\n[51] Sebastian J Vollmer, Konstantinos C Zygalakis, and Yee Whye Teh. Exploration of the (non-) asymptotic\nbias and variance of stochastic gradient Langevin dynamics. Journal of Machine Learning Research, 17\n(159):1\u201348, 2016.\n\n[52] Max Welling and Yee W Teh. Bayesian learning via stochastic gradient Langevin dynamics. In Proceedings\n\nof the 28th International Conference on Machine Learning, pages 681\u2013688, 2011.\n\n[53] Yuchen Zhang, Percy Liang, and Moses Charikar. A hitting time analysis of stochastic gradient Langevin\n\ndynamics. In Conference on Learning Theory, pages 1980\u20132022, 2017.\n\n[54] Dongruo Zhou, Pan Xu, and Quanquan Gu. Stochastic variance-reduced cubic regularized Newton methods.\n\nIn Proceedings of the 35th International Conference on Machine Learning, pages 5990\u20135999, 2018.\n\n[55] Difan Zou, Pan Xu, and Quanquan Gu. Stochastic variance-reduced Hamilton Monte Carlo methods. In\n\nProceedings of the 35th International Conference on Machine Learning, pages 6028\u20136037, 2018.\n\n[56] Difan Zou, Pan Xu, and Quanquan Gu. Subsampled stochastic variance-reduced gradient Langevin\n\ndynamics. In Proceedings of International Conference on Uncertainty in Arti\ufb01cial Intelligence, 2018.\n\n12\n\n\f", "award": [], "sourceid": 1598, "authors": [{"given_name": "Pan", "family_name": "Xu", "institution": "UCLA"}, {"given_name": "Jinghui", "family_name": "Chen", "institution": "University of Virginia"}, {"given_name": "Difan", "family_name": "Zou", "institution": "University of California, Los Angeles"}, {"given_name": "Quanquan", "family_name": "Gu", "institution": "UCLA"}]}