{"title": "Continuous-time Models for Stochastic Optimization Algorithms", "book": "Advances in Neural Information Processing Systems", "page_first": 12610, "page_last": 12622, "abstract": "We propose new continuous-time formulations for first-order stochastic optimization algorithms such as mini-batch gradient descent and variance-reduced methods. We exploit these continuous-time models, together with simple Lyapunov analysis as well as tools from stochastic calculus, in order to derive convergence bounds for various types of non-convex functions. Guided by such analysis, we show that the same Lyapunov arguments hold in discrete-time, leading to matching rates. In addition, we use these models and Ito calculus to infer novel insights on the dynamics of SGD, proving that a decreasing learning rate acts as time warping or, equivalently, as landscape stretching.", "full_text": "Continuous-time Models\n\nfor Stochastic Optimization Algorithms\n\nAntonio Orvieto\n\nDepartment of Computer Science\n\nETH Zurich, Switzerland \u21e4\n\nAurelien Lucchi\n\nDepartment of Computer Science\n\nETH Zurich, Switzerland\n\nAbstract\n\nWe propose new continuous-time formulations for \ufb01rst-order stochastic optimiza-\ntion algorithms such as mini-batch gradient descent and variance-reduced methods.\nWe exploit these continuous-time models, together with simple Lyapunov analysis\nas well as tools from stochastic calculus, in order to derive convergence bounds\nfor various types of non-convex functions. Guided by such analysis, we show that\nthe same Lyapunov arguments hold in discrete-time, leading to matching rates.\nIn addition, we use these models and It\u00f4 calculus to infer novel insights on the\ndynamics of SGD, proving that a decreasing learning rate acts as time warping or,\nequivalently, as landscape stretching.\n\nIntroduction\n\n1\nWe consider the problem of \ufb01nding the minimizer of a smooth non-convex function f : Rd ! R:\nx\u21e4 := arg minx2Rd f (x). We are here speci\ufb01cally interested in a \ufb01nite-sum setting which is\ncommonly encountered in machine learning and where f (\u00b7) can be written as a sum of individual\nfunctions over datapoints. In such settings, the optimization method of choice is mini-batch Stochastic\nGradient Descent (MB-SGD) which simply iteratively computes stochastic gradients based on\naveraging from sampled datapoints. The advantage of this approach is its cheap per-iteration\ncomplexity which is independent of the size of the dataset. This is of course especially relevant\ngiven the rapid growth in the size of the datasets commonly used in machine learning applications.\nHowever, the steps of MB-SGD have a high variance, which can signi\ufb01cantly slow down the\nspeed of convergence [22, 36]. In the case where f (\u00b7) is a strongly-convex function, SGD with a\ndecreasing learning rate achieves a sublinear rate of convergence in the number of iterations, while\nits deterministic counterpart (i.e. full Gradient Descent, GD) exhibits a linear rate of convergence.\nThere are various ways to improve this rate. The \ufb01rst obvious alternative is to systematically increase\nthe size of the mini-batch at each iteration: [20] showed that a controlled increase of the mini-batch\nsize yields faster rates of convergence. An alternative, that has become popular recently, is to use\nvariance reduction (VR) techniques such as SAG [56], SVRG [32], SAGA [16], etc. The high-level\nidea behind such algorithms is to re-use past gradients on top of MB-SGD in order to reduce the\nvariance of the stochastic gradients. This idea leads to faster rates: for general L-smooth objectives,\n\nboth SVRG and SAGA \ufb01nd an \u270f-approximate stationary point2 in OLn2/3/\u270f stochastic gradient\ncomputations [3, 53], compared to the O (Ln/\u270f) needed for GD [45] and the O1/\u270f2 needed for\n\nMB-SGD [22]. As a consequence, most modern state-of-the-art optimizers designed for general\nsmooth objectives (Natasha [2], SCSG [37], Katyusha [1], etc) are based on such methods. The\noptimization algorithms discussed above are typically analyzed in their discrete form. One alternative\nthat has recently become popular in machine learning is to view these methods as continuous-time\n\n\u21e4Correspondence to orvietoa@ethz.ch\n2A point x where krf (x)k \uf8ff \u270f.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fprocesses. By doing so, one can take advantage of numerous tools from the \ufb01eld of differential\nequations and stochastic calculus. This has led to new insights about non-trivial phenomena in\nnon-convex optimization [40, 31, 60] and has allowed for more compact proofs of convergence for\ngradient methods [57, 42, 34]. This perspective appears to be very fruitful, since it also has led to\nthe development of new discrete algorithms [68, 9, 64, 65]. Finally, this connection goes beyond the\nstudy of algorithms, and can be used for neural network architecture design [14, 12].\nThis success is not surprising, given the impact of continuous-time models in various scienti\ufb01c\n\ufb01elds including, e.g., mathematical \ufb01nance, where these models are often used to get closed-form\nsolutions for derivative prices that are not available for discrete models (see e.g. the celebrated\nBlack-Scholes formula [10], which is derived from It\u00f4\u2019s lemma [30]). Many other success stories\ncome from statistical physics [18], biology [24] and engineering. Nonetheless, an important question,\nwhich has encouraged numerous debates (see e.g. [62]), is about the reason behind the effectiveness\nof continuous-time models. In optimization, this question is partially addressed for deterministic\naccelerated methods by the works of [63, 9, 57] that provide a link between continuous and discrete\ntime. However, we found that this problem has received less attention in the context of stochastic\nnon-convex optimization and does not cover recent developments such as [32]. We therefore focus\non the latter setting for which we provide detailed comparisons and analysis of continuous- and\ndiscrete-time methods. The paper is organized as follows:\n\n1. In Sec. 2 we build new continuous-time models for SVRG and mini-batch SGD \u2014 which\ninclude the effect of decaying learning rates and increasing batch-sizes. We show existence and\nuniqueness of the solution to the corresponding stochastic differential equations.\n\n2. In Sec. 3.1 we derive novel and interpretable non-asymptotic convergence rates for our models,\nusing the elegant machinery provided by stochastic calculus. We focus on various classes of\nnon-convex functions relevant for machine learning (see list in Sec. 3).\n\n3. In Sec. 3.2 we complement each of our rates in continuous-time with equivalent results for\nthe algorithmic counterparts, using the same Lyapunov functions. This shows an algebraic\nequivalence between continuous and discrete time and proves the effectiveness of our modeling\ntechnique. To the best of our knowledge, most of these rates (in full generality) are novel 3.\n\n4. In Sec. 4.1 we provide a new interpretation for the distribution induced by SGD with decreasing\nstepsizes based on the \u00d8ksendal\u2019s time change formula \u2014 which reveals an underlying time\nwarping phenomenon that can be used for designing Lyapunov functions.\n\n5. In Sec. 4.2 we provide a dual interpretation of this last phenomenon as landscape stretching.\n\nAt a deeper level, our work proves that continuous-time models can adequately guide the analysis of\nstochastic gradient methods and provide new thought-provoking perspectives on their dynamics.\n\nNPN\n\ni=1 be a collection of functions s.t. fi : Rd ! R for any i 2 [N ] and f (\u00b7) := 1\n\n2 Uni\ufb01ed models of stochastic gradient methods\ni=1 fi(\u00b7).\nLet {fi}N\nIn order to minimize f (\u00b7), \ufb01rst-order stochastic optimization algorithms rely on some noisy (but\nusually unbiased) estimator G(\u00b7) of the gradient rf (\u00b7). In its full generality, Stochastic Gradient\nDescent (SGD) builds a sequence of estimates of the solution x\u21e4 in a recursive way:\n(SGD)\nwhere (\u2318k)k0 is a non-increasing deterministic sequence of positive numbers called the learning\nrates sequence. Since G(xk, k) is stochastic, {xk}k0 is a stochastic process on some countable\nprobability space (\u2326,F, P). Throughout this paper, we denote by {Fk}k0 the natural \ufb01ltration\ninduced by {xk}k0; by E the expectation over all the information F1 and by EFk the conditional\nexpectation given the information at step k. We consider the two following popular designs for G(\u00b7).\ni) MB gradient estimator. The mini-batch gradient estimator at iteration k is GMB(xk, k) :=\nbkPik2\u2326k rfik (xk), where bk := |\u2326k| and the elements of \u2326k (the mini-batch) are sampled at each\niteration k independently, uniformly and with replacement from [N ]. Since \u2326k is random, GMB(x) is\na random variable with conditional (i.e. taking out randomness in xk) mean and covariance\n\nxk+1 = xk \u2318kG ({xi}0\uf8ffi\uf8ffk, k) ,\n\n1\n\nEFk1 [GMB(xk, k)] = rf (xk),\n\nCovFk1 [GMB(xk, k)] =\n\n\u2303MB(xk)\n\nbk\n\n,\n\n(1)\n\n3We derive these rates in App. E and summarize them in Tb. 3.\n\n2\n\n\fNPN\n\ni=1 (rf (x) rfi(x)) (rf (x) rfi(x))T is the one-sample covariance.\nwhere \u2303MB(x) := 1\nii) VR gradient estimator. The basic idea of the original SVRG algorithm introduced in [32] is\nto compute the full gradient at some chosen pivot point and combine it with stochastic gradients\ncomputed at subsequent iterations. Combined with mini-batching [53], this gradient estimator is:\n\nGVR(xk, \u02dcxk, k) :=\n\nrfik (xk) rfik (\u02dcxk) + rf (\u02dcxk),\n\n1\n\nbk Xik2\u2326k\n\nwhere \u02dcxk 2{ x0, x1, . . . , xk1} is the pivot used at iteration k. This estimator is unbiased, i.e.\nEFk1 [GVR(xk, \u02dcxk, k)] = rf (xk). Its covariance is CovFk1 [GVR(xk, \u02dcxk, k)] = \u2303VR(xk,\u02dcxk)\nwith\ni=1 (rfi(x) rfi(y) + rf (y) rf (x)) (rfi(x) rfi(y) + rf (y) rf (x))T .\n\u2303VR(x, y) := 1\n\nbk\n\n2.1 Building the perturbed gradient \ufb02ow model\nWe take inspiration from [38] and [27] and build continuous-time models for SGD with either the\nMB or the SVRG gradient estimators. The procedure has three steps.\n\nNPN\n\n(S1) We \ufb01rst de\ufb01ne the discretization stepsize h := \u23180 \u2014 this variable is essential to provide a link\nbetween continuous and discrete time. We assume it to be \ufb01xed for the rest of this subsection.\nNext, we de\ufb01ne the adjustment-factors sequence ( k)k0 s.t. k = \u2318k/h (cf. Eq. 9 in [38]). In\nthis way \u2014 we decouple the two information contained in \u2318k: h controls the overall size of the\nlearning rate and k handles its variation4 during training.\n\n(S2) Second, we write SGD as xk+1 = xk \u2318k(rf (xk) + Vk), where the error Vk has mean zero\nbe the principal square root5 of \u2303k, we can write SGD as\n\nand covariance \u2303k. Next, let \u23031/2\n\nk\n\nk Zk,\n\nxk+1 = xk \u2318krf (xk) \u2318k\u23031/2\n\n(PGD)\nwhere Zk is a random variable with zero mean and unit covariance6. In order to build simple\ncontinuous-time models, we assume that each Zk is Gaussian distributed: Zk \u21e0N (0d, Id). To\nhighlight this assumption, we will refer to the last recursion as Perturbed Gradient Descent\n(PGD) [15]. In Sec. 2.1 we motivate why this assumption, which is commonly used in the\nliterature [38], is not restrictive for our purposes. By plugging in either \u2303k =\u2303 MB(xk)/bk\nor \u2303k =\u2303 VR(xk, \u02dcxk)/bk, we get a discrete model for SGD with the MB or VR gradient\nestimators.\n\n(S3) Finally, we lift these PGD models to continuous time. The \ufb01rst step is to rewrite them using k:\n(MB-PGD)\n\nwhere MB(x) :=\u2303 1/2(x), VR(x, y) :=\u2303 1/2\nVR (x, y) and \u21e0k 2 [k] quanti\ufb01es the pivot staleness.\nReaders familiar with stochastic analysis might recognize that MB-PGD and VR-PGD are the\nsteps of a numerical integrator (with stepsize h) of an SDE and of an SDDE, respectively. For\nconvenience of the reader, we give an hands-on introduction to these objects in App. B.\n\nThe resulting continuous-time models, which we analyse in this paper, are\n\ndX(t) = (t)rf (X(t)) dt + (t)ph/b(t) MB(X(t)) dB(t)\ndX(t) = (t)rf (X(t)) dt + (t)ph/b(t) VR(X(t), X(t \u21e0(t))) dB(t)\n\nwhere\n\n(MB-PGF)\n\n(VR-PGF)\n\n4A popular choice (see e.g. [43]) is \u2318k = Ck\u21b5, \u21b5 2 [0, 1]. Here, h = C and k = k\u21b5 2 [0, 1].\n5The unique positive semide\ufb01nite matrix such that \u2303k =\u2303 1/2\n6Because \u23031/2\n\nk Zk has the same distribution as Vk, conditioned on xk.\n\nk \u23031/2\nk .\n\n3\n\nxk+1 = xk krf (xk)\n\nadjusted gradient drift\n\nxk+1 = xk krf (xk)\n\nadjusted gradient drift\n\n|\n|\n\n{z\n{z\n\n}\n}\n\n{z\n\nadjusted mini-batch volatility\n\nphZk\nh + kph/bk MB(xk)\n|\n}\nh + kph/bk VR(xk, xk\u21e0k )\n|\n}\n\nadjusted variance-reduced volatility\n\n{z\n\nphZk\n\n(VR-PGD)\n\n\f\u2022 \u21e0 : R+ ! [0, T], the staleness function, is s.t. \u21e0(hk) = \u21e0k for all k 0;\n\u2022 (\u00b7) 2C 1(R+, [0, 1]), the adjustment function, is s.t. (hk) = k for all k 0 and d (t)\ndt \uf8ff 0;\n\u2022 b(\u00b7) 2C 1(R+, R+), the mini-batch size function is s.t. b(hk) = bk for all k 0 and b(t) 1;\n\u2022 {B(t)}t0 is a ddimensional Brownian Motion on some \ufb01ltered probability space.\nWe conclude this subsection with some important remarks and clari\ufb01cations on the procedure above.\n\nOn the Gaussian assumption.\nIn (S2) we assumed that Zk is Gaussian distributed. If the mini-\nbatch size bk is large enough and the gradients are sampled from a distribution with \ufb01nite variance,\nthen the assumption is sound: indeed, by the Berry\u2013Esseen Theorem (see e.g. [17]), Zk approaches\n\nN (0d, Id) in distribution with a rate O1/pbk. However, if bk is small or the underlying variance\n\nis unbounded, the distribution of Zk has heavy tails [58]. Nonetheless, in the large-scale optimization\nliterature, the gradient variance is generally assumed to be bounded (see e.g. [22], [11]) \u2014 hence, we\nkeep this assumption, which is practical and reasonable for many problems (likewise assumed in the\nrelated literature [51, 42, 34, 38, 39]). Also, taking a different yet enlightening perspective, it is easy\nto see that (see Sec. 4 of [11]), if one cares only about expected convergence guarantees \u2014 only the\n\ufb01rst and the second moments of the stochastic gradients have a quantitative effect on the rate.\n\nApproximation guarantees. Recently, [28, 39] showed that for a special case of MB-PGF ( k = 1,\nand bk constant), its solution {X(t)}0\uf8fft\uf8ffT compares to SGD as follows: let K = bT /hc and consider\nthe iterates {xk}k2[K] of mini-batch SGD (i.e. without Gaussian assumption) with \ufb01xed learning\nrate h. Under mild assumptions on f (\u00b7), there exists a constant C (independent of h) such that\nkE[xk] E[X(kh)]k \uf8ff Ch for all k 2 [K]. Their proof argument relies on semi-group expansions\nof the solution to the Kolmogorov backward equation, and can be adapted to provide a similar\nresult for our (more general) equations. However, this approach to motivate the continuous-time\nformulation is very limited \u2014 as C depends exponentially on T (see also [57]). Nonetheless, under\nstrong-convexity, some uniform-in-time (a.k.a. shadowing) results were recently derived in [48, 19].\nIn this paper, we take a different approach (similarly to [57] for deterministic methods) and provide\ninstead matching convergence rates in continuous and in discrete time using the same Lyapunov\nfunction. We note that this is still a very strong indication of the effectiveness of our model to study\nSGD, since it shows an algebraic equivalence between the continuous and the discrete case.\n\nComparison to the \"ODE method\". A powerful technique in stochastic approximation [36] is to\nstudy SGD through the deterministic ODE \u02d9X = rf (X). A key result is that SGD, with decreasing\nlearning rate under the Robbins Monro [55] conditions, behaves like this ODE in the limit. Hence\nthe ODE can be used to characterize the asymptotic behaviour of SGD. In this work we instead take\ninspiration from more recent literature [39] and build stochastic models which include the effect\nof a decreasing learning rate into the drift and the volatility coef\ufb01cients through the adjustment\nfunction (\u00b7). This allows, in contrast to the ODE method7, to provide non-asymptotic arguments\nand convergence rates.\n\nLocal minima width. Our models con\ufb01rm, as noted in [31], that the ratio of (initial) learning rate\nh to batch size b(t) is a determinant factor of SGD dynamics. Compared to [31], our model is more\ngeneral: indeed, we will see in Sec. 4.2 that the adjustment function also plays a fundamental role in\ndetermining the width of the \ufb01nal minima \u2014 since it acts like a \"function stretcher\".\n\n2.2 Existence and uniqueness\nPrior works that take an approach similar to ours [35, 27, 42], assume the one-sample volatility\n(\u00b7) to be Lipschitz continuous. This makes the proof of existence and uniqueness straightforward\n(cf. a textbook like [41]), but we claim such assumption is not trivial in our setting where (\u00b7) is\ndata-dependent. Indeed, (\u00b7) is the result of a square root operation on the gradient covariance \u2014\nand the square root function is not Lipschitz around zero. App. C is dedicated to a rigorous proof of\nexistence and uniqueness, which is veri\ufb01ed under the following condition:\n(H)\n\nEach fi(\u00b7) is C3, with bounded third derivative and L-smooth.\n\n7This method is instead suitable to assess almost sure convergence and convergence in probability, which are\n\nnot considered in this paper for the sake of delivering convergence rates for population quantities.\n\n4\n\n\ff (\u00b7) is C1 and there exists \u00b5 > 0 s.t. krf (x)k2 2\u00b5(f (x) f (x?)) for all x 2 Rd.\n\nThis hypothesis is arguably not restrictive as it is usually satis\ufb01ed by many loss functions encountered\nin machine learning. As a result, under (H), with probability 1 the realizations of the stochastic\nprocesses {f (X(t))}t>0 and {X(t)}t>0 are continuous functions of time.\n3 Matching convergence rates in continuous and discrete time\nEven though in optimization, convex functions are central objects of study, many interesting objectives\nfound in machine learning are non-convex. However, most of the time, such functions still exhibit\nsome regularity. For instance, [25] showed that linear LSTMs induce weakly-quasi-convex objectives.\n(HWQC) f (\u00b7) is C1 and exists \u2327> 0 and x? s.t. hrf (x), x x?i \u2327 (f (x) f (x?)) for all x 2 Rd.\nIntuitively, (HWQC) requires the negative gradient to be always aligned with the direction of a global\nminimum x?. Convex differentiable functions are weakly-quasi-convex (with \u2327 = 1), but the WQC\nclass is richer and actually allows functions to be locally concave. Another important class of\nproblems (e.g., under some assumptions, matrix completion [61]) satisfy the Polyak-\u0141ojasiewicz\nproperty, which is the weakest known suf\ufb01cient condition for GD to achieve linear convergence [50].\n(HP\u0141)\nOne can verify that if f (\u00b7) is strongly-convex, then it is P\u0141. However, P\u0141 functions are not necessarily\nconvex. What\u2019s more, a broad class of problems (dictionary learning [5], phase retrieval [13],\ntwo-layer MLPs [39]) are related to a stronger condition: the restricted-secant-inequality [66].\n(HRSI)\n2kx x\u21e4k2 for all x 2 Rd.\nIn [33] the authors prove strong-convexity ) (HRSI) ) (HP\u0141) (with different constants).\n3.1 Continuous-time analysis\nFirst, we derive non-asymptotic rates for MB-PGF. For convenience, we de\ufb01ne '(t) :=R t\n\u21e4 := supx2Rd kMB(x)MB(x)Tks < 1, where k\u00b7k s denotes the spectral norm.\nTheorem 1. Assume (H), (H). Let t > 0 and \u02dct 2 [0, t] be a random time point with distribution\n'(t) for \u02dct 2 [0, t] (and 0 otherwise). The solution to MB-PGF is s.t.\n2 '(t) Z t\n\n0 (s)ds,\nwhich plays a fundamental role (see Sec. 4.1). As [42, 34], we introduce a bound on the volatility.\n(H) 2\n\nf (\u00b7) is C1 and there exists \u00b5 > 0 s.t. hrf (x), x x\u21e4i \u00b5\n\nf (x0) f (x?)\n\nE\u21e5krf (X(\u02dct))k2\u21e4 \uf8ff\n\nh d L 2\n\u21e4\n\n (s)2\nb(s)\n\n'(t)\n\nds.\n\n (\u02dct)\n\n+\n\n0\n\nProof. We use the energy function E(x, t) := f (x) f (x?). Details in App. D.2.\n\n\u2305\n\nTheorem 2. Assume (H), (H), (HWQC). Let \u02dct be as in Thm. 1. The solution to MB-PGF is s.t.\n\n (s)2\nb(s)\n\nds\n\n(L\u2327 ' (s) + 1)\n\n(W1)\n\n(W2)\n\n (s)2\nb(s)\n\nds.\n\n+\n\n2 \u2327 '(t)\n\nh d 2\n\u21e4\n\nE\u21e5f (X(\u02dct)) f (x?)\u21e4 \uf8ff kx0 x?k2\nE [(f (X(t)) f (x?))] \uf8ff kx0 x?k2\n\n2 \u2327 '(t)Z t\n2 \u2327 '(t)Z t\nProof. We use the energy functions E1,E2 s.t. E1(x) := 1\nf (x?)) + 1\n\nh d 2\n\u21e4\n\n2 \u2327 '(t)\n\n+\n\n0\n\n0\n\n2kx x?k2 for (W1) and (W2), respectively. Details in App. D.2.\n\n2kx x?k2 and E2(x, t) := \u2327 '(t)(f (x))\n\u2305\n\nTheorem 3. Assume (H), (H), (HP\u0141). The solution to MB-PGF is s.t.\n\nE[f (X(t)) f (x?)] \uf8ff e2\u00b5'(t)(f (x0) f (x?)) +\n\nh d L 2\n\u21e4\n\n2\n\nZ t\n\n0\n\n (s)2\nb(s)\n\ne2\u00b5('(t)'(s))ds.\n\n5\n\n\fTable 1: Asymptotic rates for MB-PGF under (t) = O(ta) in the form O(t). shown in the\ntable as a function of a. \"\u21e0\" indicates randomization of the result. Rates match with Tb. 1 in [43].\n(\u21e0), (H), (H)\n\n(\u21e0), (H), (H), (HWQC)\n\n(H), (H), (HWQC)\n\n(H), (H), (HP\u0141)\n\na\n\n(0\n, 1/2)\n(1/2 , 2/3)\n(2/3 ,\n1)\n\nCor. 3\n\na\na\na\n\nCor. 2\n\u21e5\n2a 1\n1 a\n\nCor. 2\n\nCor. 1\n\na\n1 a\n1 a\n\na\n1 a\n1 a\n\n\u2305\n\nProof. We use the energy function E(x, t) := e2\u00b5'(t)(f (x) f (x?)). Details in App. D.2.\nDecreasing mini-batch size. From Thm. 1, 2, 3, it is clear that, as it is well known [11, 6], a\nsimple way to converge to a local minimizer is to pick b(\u00b7) increasing as a function of time. However,\nthis corresponds to dramatically increasing the complexity in terms of gradient computations. In\ncontinuous-time, we can account for this by introducing (t) =R t\n0 b(s)ds, proportional to the number\nof computed gradients at time t. The complexity in number of gradient computations can be derived\nby substituting into the \ufb01nal rate the new time variable 1(t) instead of t. As we will see in Thm. 5,\nthis concept extends to a more general setting and leads to valuable insights.\nAsymptotic rates. Another way to guarantee convergence to a local minimizer is to decrease (\u00b7).\nIn App. D.3 we derive asymptotic rates for (t) = O(ta) and report the results in Tb. 1. The results\nmatch exactly the corresponding know rates for SGD, stated under stronger assumptions in [43]. As\nfor increasing b(\u00b7), decreasing (\u00b7) can also be seen as performing a time warp (see Thm. 5).\nBall of convergence. For (t) = 1, the sub-optimality gap derived in App. D.3.1 matches [11].\nIn contrast to GMB(\u00b7), [3, 4] have shown that signi\ufb01cant speed-ups are hard to obtain from parallel\ngradient computations (i.e. for b(t) > 1) using GVR(\u00b7) 8. Also, our results for MB-PGF as well as\nprior work [67, 3, 4, 53] suggest that linear rates can only be obtained with (t) = 1. Hence, for\nour analysis of VR-PGF, we focus on the case b(t) = (t) = 1. The following result, in the spirit\nof [32, 4], relates to the so-called Option II of SVRG.\n\nTheorem 4. Assume (H), (HRSI) and choose \u21e0(t) = t P1j=1 (t jT) (sawtooth wave),\nwhere (\u00b7) is the Dirac delta. Let {X(t)}t0 be the solution to VR-PGF with additional jumps\nat times (jT)j2N: we pick X(jT + T) uniformly in {X(s)}jT\uf8ffs<(j+1)T. Then,\n\nE[kX(jT) x?k2] =\u2713 2hL2T + 1\nT(\u00b5 2hL2)\u25c6j\n\nkx0 x\u21e4k2.\n\nPrevious Literature (SDEs for MB-PGF).\n[42] studied dual averaging using a similar SDE model\nin the convex setting, under vanishing and persistent volatility. Part of their results are similar, yet\nless general and not directly comparable. [51] studied a speci\ufb01c case of our equations, under constant\nvolatility (see also [52] and references therein). [34, 65, 64] studied extentions to [42] including\nacceleration [45] and AC-SA [21]. To the best of our knowledge, there hasn\u2019t been yet any analysis\nof SVRG in continuous-time in the literature.\n\n3.2 Discrete-time analysis and algebraic equivalence\nWe provide matching algorithmic counterparts (using the same Lyapunov function) for all our non-\nasymptotic rates in App. D, along with Tb. 2 to summarize the results. We stress that the rates we\nprove in discrete-time (i.e. for SGD with gradient estimators GMB or GVR) hold without Gaussian\nnoise assumption. This is a key result of this paper, which indicates that the tools of It\u00f4 calculus [30]\n\u2014which are able to provide more compact proofs [42, 52] \u2014 yield calculations which are equivalent\nto the ones used to analyze standard SGD. We invite the curious reader to go through the proofs in\nthe appendix to appreciate this correspondence as well as to inspect Tb. 3 in the appendix, which\nprovides a comparison of the discrere-time rates with Thms. 1, 2, 3 and 4.\n\n8See e.g. Thm. 7 in [53] for a counterexample.\n\n6\n\n\fCond.\n\nRate (Discrete-time, no Gaussian assumption)\n\n(\u21e0),(H-),(H)\n\n2 (f (x0) f (x?))\n\n(h'k+1)\n\n+\n\nh d L 2\n\u21e4\n(h'k+1)\n\n 2\ni\nbi\n\nh\n\nkXi=0\n\n 2\ni\nbi\n\nh\n\n(\u21e0),(H-),(H),(HWQC)\n\n(H-),(H),(HWQC)\n\n(H-),(H),(HP\u0141)\n\n(H-),(HRSI)\n\n+\n\nd h 2\n\u21e4\n\n+\n\nh d 2\n\u21e4\n\n\u2327 (h'k+1)\n\n2 \u2327 (h'k+1)\n\nkx0 x?k2\n\u2327 (h'k+1)\n\nkXi=0\nkXi=0\nkx0 x?k2\n2 \u2327 (h'k+1)\nkYi=0\n(1 \u00b5 h i)(f (x0) f (x?)) +\n\u2713 1 + 2L2h2m\nhm(\u00b5 2L2h)\u25c6j\n\nThm.\n\nE.1\n\nE.2\n\nE.2\n\nE.3\n\nE.4\n\n 2\ni\nbi\n\nh\n\n(1 + \u2327 'i+1L)\n\n 2\ni\nbi\n\nh\n\nkXi=0 Qk\nQi\n\nh d L 2\n\u21e4\n\n2\n\n`=0(1 \u00b5 h `)\nj=0(1 \u00b5 h l)\n\nkx0 x\u21e4k2 (under variance reduction)\n\nTable 2: Summary of the rates we show in the appendix for SGD with mini-batch and VR, using a\nLyapunov argument inspired by the continuous-time analysis. (\u21e0) indicates randomized output. The\nreader should compare the results with Thms. 1, 2, 3, 4 (explicit comparison in the \ufb01rst page of the\nappendix). For the de\ufb01nition of the quantities in the rates, check App. E.\n\nNow we ask the simple question: why is this the case? Using the concept of derivation from abstract\nalgebra, in App. A.2 we show that the discrete difference operator and the derivative operator enjoy\nsimilar algebraic properties. Crucially, this is due to the smoothness of the underlying objective \u2014\nwhich implies a chain-rule9 for the difference operator. Hence, this equivalence is tightly linked\nwith optimization and might lead to numerous insights. We leave the exploration of this fascinating\ndirection to future work.\n\nLiterature comparison (algorithms). Even though partial10 results have been derived for the\nfunction classes described above in [25, 54], an in-depth non-asymptotic analysis was still missing.\nRates in Tb. 3 (stated above in continuous-time as theorems) provide a generalization to the results\nof [43] to the weaker function classes we considered (we never assume convexity). Regarding SVRG,\nthe rate we report uses a proof similar11 to [4, 53] and is comparable to [32] (under convexity).\n\n4\n\nInsights provided by continuous-time models\n\nBuilding on the tools we used so far, we provide novel insights on the dynamics of SGD. First, in\norder to consider both MB-PGF and VR-PGF at the same time, we introduce a stochastic12 matrix\nprocess {(t)}t0 adapted to the Brownian motion:\n\ndX(t) = (t)rf (X(t)) dt + (t)ph/b(t)(t) dB(t).\n\n(PGF)\nWe show that annealing the learning rate through a decreasing (\u00b7) can be viewed as performing a\ntime dilation or, alternatively, as directly stretching the objective function. This view is inspired from\nthe use of Girsanov theorem [23] in \ufb01nance: a deep result in stochastic analysis which is the formal\nconcept underlying the change of measure from real world to \"risk-neutral\" world.\n\n9This is a key formula in the continuous-time analysis to compute the derivative of a Lyapunov function.\n10The convergence under weak-quasi-convexity using a learning rate C/pk and a randomized output is\nstudied in [25] (Prop. 2.3 under Eq. 2.2 of their paper). On the same line, [33] studied the convergence for\nP\u0141using a learning rate C/pk and assuming bounded stochastic gradients. These results are strictly contained\nin our rates.\n\n11In particular, the lack of convexity causes the factor L2 in the linear rate.\n12For MB-PGF, {(t)}t0 := {(X(t))}t0. For VR-PGF, {(t)}t0 := {(X(t), X(t \u21e0(t)))}t0.\n\n7\n\n\f4.1 Time stretching through \u00d8ksendal\u2019s formula\nWe notice that, in Thm. 1,2,3, the time variable t is always \ufb01ltered through the map '(\u00b7). Hence, '(\u00b7)\nseems to act as a new time variable. We show this rigorously using \u00d8ksendal\u2019s time change formula.\nTheorem 5. Let {X(t)}t0 satisfy PGF and de\ufb01ne \u2327 (\u00b7) = '1(\u00b7), where '(t) =R t\n0 (s)ds.\nFor all t 0, X (\u2327 (t)) = Y (t) in distribution, where {Y (t)}t0 has the stochastic differential\n\ndY (t) = rf (Y (t))dt +ph (\u2327 (t))/b(\u2327 (t))(\u2327 (t)) dB(t).\n\nProof. We use the substitution formula for deterministic integrals combined with \u00d8ksendal\u2019s formula\n\u2305\nfor time change in stochastic integrals \u2014 a key result in SDE theory. Details in App. F.\n\nExample. We consider b(t) = 1, (s) = Id and (t) = 1/(t+\n1) (popular annealing procedure [11]); we have '(t) = log(t + 1)\nph\nand \u2327 (t) = et 1. dX(t) = 1\nt+1 dB(t) is\nt+1rf (X(t))dt \ns.t. the sped-up solution Y (t) = X(et 1) satis\ufb01es\ndY (t) = rf (X(t))dt + phetdB(t).\n\n(2)\n\nIn the example, Eq. (2) is the model for SGD with constant learning\nrates but rapidly vanishing noise \u2014 which is arguably easier to\nstudy compared to the original equation, that also includes time-\nvarying learning rates. Hence, this result draws a connection to\nSGLD [52] and to prior work on SDE models [42], which only\nconsidered (t) = 1. But, most importantly \u2014 Thm. 5 allows\nfor more \ufb02exibility in the analysis: to derive convergence rates13\none could work with either X(as we did in Sec. 2) or with Y (and\nslow-down the rates afterwords).\nWe verify this result on a one dimensional quadratic, under the choice of parameters in our example,\nusing Euler-Maruyama simulation (i.e. PGD) with h = 103, = 5. In Fig. 1 we show the mean\nand standard deviation relative to 20 realization of the Gaussian noise.\nNote that in the case of variance reduction, the volatility is decreasing as a function of time [3], even\nwith (t) = 1. Hence one gets a similar result without the change of variable.\n\nFigure 1: Veri\ufb01cation of Thm. 5\non a 1d quadratic (100 samples):\nempirically X(t) d= Y ('(t)).\n\n4.2 Landspace stretching via solution feedback\nConsider the (potentially non-convex) quadratic f (x) = hx x?, H(x x?)i. WLOG we\nassume x? = 0d and that H is diagonal. For simplicity, consider again the case b(t) =\nPGF reduces to a linear stochastic system:\n1, (s) = Id and (t) = 1/(t + 1).\nh\nt + 1\n\ndX(t) = \n\nHX(t)dt +\n\ndB(t).\n\nt + 1\n\n1\n\nt+1 HE[X(t)]dt.\n\nBy the variation-of-constants formula [41], the expectation evolves\nwithout bias: dE[X(t)] = 1\nIf we denote\nby ui(t) the i-th coordinate of E[X(t)] we have d\ndt ui(t) =\nt+1 ui(t), where i is the eigenvalue relative to the i-th direc-\n i\ntion. Using separation of variables, we \ufb01nd ui(t) = (t + 1)iui\n0.\nMoreover, we can invert space and time: t =ui\n\nFeeding back this equation into the original differential \u2014 the\nsystem becomes autonomous:\n\n0/ui(t)1/i 1.\n\nd\ndt\n\nui(t) = i(ui\n\n0) 1\n\ni ui(t)1+ 1\ni .\n\nFigure 2: Landscape stretching\nfor an isotropic paraboloid.\n\n13The design of the Lyapunov function might be easier if we change time variable. This is the case in our\nsetting, where '(t) comes directly into the Lyapunov functions and would be simply t for the transformed SDE.\n\n8\n\n\fFrom this simple derivation we get two important insights on the dynamics of PGF:\n\n2. Inspecting the equivalent formulation d\n\n0) 1\n\ni ui(t)1+ 1\n\n1. Comparing the solution ui(t) = (t + 1)iui\n\n0 with the solution one would obtain with (t) = 1,\nthat is eitui\n0 \u2014 we notice that the dynamics in the \ufb01rst case is much slower: we get polynomial\nconvergence and divergence (when i \uf8ff 0) as opposed to exponential. This quantitatively shows\nthat decreasing the learning rate could slow down (from exponential to polynomial) the dynamics\nof SGD around saddle points. However, note that, even though the speed is different, ui(\u00b7) and\nvi(\u00b7) move along the same path14 by Thm 5.\ni , we notice with surprise\n0)rgi(ui(t)), where\n\u2014 that this is a gradient system. Indeed the RHS can be written as C(i, ui\ngi(x) = x2+ 1\nis the equivalent landscape in the i-th direction. In particular, PGF on the\nsimple quadratic 1\n2kxk2 with learning rate decreasing as 1/t behaves in expectation like PGF with\nconstant learning rate on a cubic. This shines new light on the fact that, as it is well known from\nthe literature [44], by decreasing the learning rate we can only achieve sublinear convergence rates\non strongly convex stochastic problems. From our perspective, this happens simply because the\nequivalent stretched landscape has vanishing curvature \u2014 hence, it is not strongly convex. We\nillustrate this last example in Fig. 2 and note that the stretching effect is tangent to the expected\nsolution (in solid line).\n\ndt ui(t) = i(ui\n\ni\n\nWe believe the landscape stretching phenomenon we just outlined to be quite general and to also\nhold asymptotically under strong convexity15: indeed it is well known that, by Taylor\u2019s theorem,\nin a neighborhood of the solution to a strongly convex problem the cost behaves as its quadratic\napproximation. In dynamical systems, this linearization argument can be made precise and goes under\nthe name of Hartman-Grobman theorem (see e.g. [49]). Since the SDE we studied is memoryless\n(no momentum), at some point it will necessarily enter a neighborhood of the solution where the\ndynamics is described by result in this section. We leave the veri\ufb01cation and formalization of the\nargument we just outlined to future research.\n\n5 Conclusion\n\nWe provided a detailed comparisons and analysis of continuous- and discrete-time methods in the\ncontext of stochastic non-convex optimization. Notably, our analysis covers the variance-reduced\nmethod introduced in [32]. The continuous-time perspective allowed us to deliver new insights about\nhow decreasing step-sizes lead to time and landscape stretching. There are many potential interesting\ndirections for future research such as extending our analysis to mirror-descent or accelerated gradient-\ndescent [35, 60], or to study state-of-the-art stochastic non-convex optimizers such as Natasha [2].\nFinally, we believe it would be interesting to expand the work of [38, 39] to better characterize the\nconvergence of MB-SGD and SVRG to the SDEs we studied here, perhaps with some asymptotic\narguments similar to the ones used in mean-\ufb01eld theory [7, 8].\n\nReferences\n[1] Zeyuan Allen-Zhu. Katyusha: The \ufb01rst direct acceleration of stochastic gradient methods. The\n\nJournal of Machine Learning Research, 18(1):8194\u20138244, 2017.\n\n[2] Zeyuan Allen-Zhu. Natasha: Faster non-convex stochastic optimization via strongly non-convex\nparameter. In Proceedings of the 34th International Conference on Machine Learning-Volume\n70, pages 89\u201397. JMLR. org, 2017.\n\n[3] Zeyuan Allen-Zhu and Elad Hazan. Variance reduction for faster non-convex optimization. In\n\nInternational Conference on Machine Learning, pages 699\u2013707, 2016.\n\n[4] Zeyuan Allen-Zhu and Yang Yuan. Improved svrg for non-strongly-convex or sum-of-non-\nconvex objectives. In International conference on machine learning, pages 1080\u20131089, 2016.\n[5] Sanjeev Arora, Rong Ge, Tengyu Ma, and Ankur Moitra. Simple, ef\ufb01cient, and neural algorithms\n\nfor sparse coding. In Proceedings of Machine Learning Research, 2015.\n\n14One is the time-changed version of the other (consider Thm. 5 with (t) = 0), see also Fig. 1.\n15Perhaps also in the neighborhood of any hyperbolic \ufb01xed point, with implications about saddle point evasion.\n\n9\n\n\f[6] Lukas Balles, Javier Romero, and Philipp Hennig. Coupling adaptive batch sizes with learning\n\nrates. arXiv preprint arXiv:1612.05086, 2016.\n\n[7] Michel Bena\u00efm. Recursive algorithms, urn processes and chaining number of chain recurrent\n\nsets. Ergodic Theory and Dynamical Systems, 18(1):53\u201387, 1998.\n\n[8] Michel Benaim and Jean-Yves Le Boudec. A class of mean \ufb01eld interaction models for computer\n\nand communication systems. Performance evaluation, 65(11-12):823\u2013838, 2008.\n\n[9] Michael Betancourt, Michael I Jordan, and Ashia C Wilson. On symplectic optimization. arXiv\n\npreprint arXiv:1802.03653, 2018.\n\n[10] Fischer Black and Myron Scholes. The pricing of options and corporate liabilities. Journal of\n\npolitical economy, 81(3):637\u2013654, 1973.\n\n[11] L\u00e9on Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large-scale machine\n\nlearning. SIAM Review, 60(2):223\u2013311, 2018.\n\n[12] Tian Qi Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary\ndifferential equations. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi,\nand R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 6572\u2013\n6583. Curran Associates, Inc., 2018.\n\n[13] Yuxin Chen and Emmanuel Candes. Solving random quadratic systems of equations is nearly\nas easy as solving linear systems. In Advances in Neural Information Processing Systems, pages\n739\u2013747, 2015.\n\n[14] Marco Ciccone, Marco Gallieri, Jonathan Masci, Christian Osendorfer, and Faustino Gomez.\nNais-net: Stable deep networks from non-autonomous differential equations. arXiv preprint\narXiv:1804.07209, 2018.\n\n[15] Hadi Daneshmand, Jonas Kohler, Aurelien Lucchi, and Thomas Hofmann. Escaping saddles\n\nwith stochastic gradients. arXiv preprint arXiv:1803.05999, 2018.\n\n[16] Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. Saga: A fast incremental gradient\nmethod with support for non-strongly convex composite objectives. In Advances in neural\ninformation processing systems, pages 1646\u20131654, 2014.\n\n[17] Rick Durrett. Probability: theory and examples. Cambridge university press, 2010.\n[18] Albert Einstein et al. On the motion of small particles suspended in liquids at rest required by\n\nthe molecular-kinetic theory of heat. Annalen der physik, 17:549\u2013560, 1905.\n\n[19] Yuanyuan Feng, Tingran Gao, Lei Li, Jian-Guo Liu, and Yulong Lu. Uniform-in-time weak\nerror analysis for stochastic gradient descent algorithms via diffusion approximation. arXiv\npreprint arXiv:1902.00635, 2019.\n\n[20] Michael P Friedlander and Mark Schmidt. Hybrid deterministic-stochastic methods for data\n\n\ufb01tting. SIAM Journal on Scienti\ufb01c Computing, 34(3):A1380\u2013A1405, 2012.\n\n[21] Saeed Ghadimi and Guanghui Lan. Optimal stochastic approximation algorithms for strongly\nconvex stochastic composite optimization i: A generic algorithmic framework. SIAM Journal\non Optimization, 22(4):1469\u20131492, 2012.\n\n[22] Saeed Ghadimi and Guanghui Lan. Stochastic \ufb01rst-and zeroth-order methods for nonconvex\n\nstochastic programming. SIAM Journal on Optimization, 23(4):2341\u20132368, 2013.\n\n[23] Igor Vladimirovich Girsanov. On transforming a certain class of stochastic processes by\nabsolutely continuous substitution of measures. Theory of Probability & Its Applications,\n5(3):285\u2013301, 1960.\n\n[24] Narendra S Goel and Nira Richter-Dyn. Stochastic models in biology. Elsevier, 2016.\n[25] Moritz Hardt, Tengyu Ma, and Benjamin Recht. Gradient descent learns linear dynamical\n\nsystems. The Journal of Machine Learning Research, 19(1):1025\u20131068, 2018.\n\n10\n\n\f[26] Reza Harikandeh, Mohamed Osama Ahmed, Alim Virani, Mark Schmidt, Jakub Kone\u02c7cn`y, and\nScott Sallinen. Stopwasting my gradients: Practical svrg. In Advances in Neural Information\nProcessing Systems, pages 2251\u20132259, 2015.\n\n[27] Li He, Qi Meng, Wei Chen, Zhi-Ming Ma, and Tie-Yan Liu. Differential equations for modeling\n\nasynchronous algorithms. arXiv preprint arXiv:1805.02991, 2018.\n\n[28] Wenqing Hu, Chris Junchi Li, Lei Li, and Jian-Guo Liu. On the diffusion approximation of\n\nnonconvex stochastic gradient descent. arXiv preprint arXiv:1705.07562, 2017.\n\n[29] Nobuyuki Ikeda and Shinzo Watanabe. Stochastic differential equations and diffusion processes,\n\nvolume 24. Elsevier, 2014.\n\n[30] Kiyosi It\u00f4. 109. stochastic integral. Proceedings of the Imperial Academy, 20(8):519\u2013524,\n\n1944.\n\n[31] Stanis\u0142aw Jastrz\u02dbebski, Zachary Kenton, Devansh Arpit, Nicolas Ballas, Asja Fischer, Yoshua\narXiv preprint\n\nBengio, and Amos Storkey. Three factors in\ufb02uencing minima in sgd.\narXiv:1711.04623, 2017.\n\n[32] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance\n\nreduction. In Advances in neural information processing systems, pages 315\u2013323, 2013.\n\n[33] Hamed Karimi, Julie Nutini, and Mark Schmidt. Linear convergence of gradient and proximal-\ngradient methods under the polyak-\u0142ojasiewicz condition. In Joint European Conference on\nMachine Learning and Knowledge Discovery in Databases, pages 795\u2013811. Springer, 2016.\n\n[34] Walid Krichene and Peter L Bartlett. Acceleration and averaging in stochastic descent dynamics.\n\nIn Advances in Neural Information Processing Systems, pages 6796\u20136806, 2017.\n\n[35] Walid Krichene, Alexandre Bayen, and Peter L Bartlett. Accelerated mirror descent in continu-\nous and discrete time. In Advances in neural information processing systems, pages 2845\u20132853,\n2015.\n\n[36] Harold Kushner and G George Yin. Stochastic approximation and recursive algorithms and\n\napplications, volume 35. Springer Science & Business Media, 2003.\n\n[37] Lihua Lei, Cheng Ju, Jianbo Chen, and Michael I Jordan. Non-convex \ufb01nite-sum optimization\nvia scsg methods. In Advances in Neural Information Processing Systems, pages 2348\u20132358,\n2017.\n\n[38] Qianxiao Li, Cheng Tai, and Weinan E. Stochastic modi\ufb01ed equations and adaptive stochastic\ngradient algorithms. In Proceedings of the 34th International Conference on Machine Learning,\nvolume 70 of Proceedings of Machine Learning Research, pages 2101\u20132110, 2017.\n\n[39] Yuanzhi Li and Yang Yuan. Convergence analysis of two-layer neural networks with relu\n\nactivation. In Advances in Neural Information Processing Systems, pages 597\u2013607, 2017.\n\n[40] Stephan Mandt, Matthew Hoffman, and David Blei. A variational analysis of stochastic gradient\n\nalgorithms. In International Conference on Machine Learning, pages 354\u2013363, 2016.\n\n[41] Xuerong Mao. Stochastic differential equations and applications. Elsevier, 2007.\n\n[42] Panayotis Mertikopoulos and Mathias Staudigl. On the convergence of gradient-like \ufb02ows with\n\nnoisy gradient input. SIAM Journal on Optimization, 28(1):163\u2013197, 2018.\n\n[43] Eric Moulines and Francis R Bach. Non-asymptotic analysis of stochastic approximation\nalgorithms for machine learning. In Advances in Neural Information Processing Systems, pages\n451\u2013459, 2011.\n\n[44] Arkadii Semenovich Nemirovsky and David Borisovich Yudin. Problem complexity and method\n\nef\ufb01ciency in optimization. John Wiley and Sons, 1983.\n\n[45] Yurii Nesterov. Lectures on convex optimization. Springer, 2018.\n\n11\n\n\f[46] Bernt \u00d8ksendal. When is a stochastic integral a time change of a diffusion?\n\ntheoretical probability, 3(2):207\u2013226, 1990.\n\nJournal of\n\n[47] Bernt \u00d8ksendal. Stochastic differential equations. In Stochastic differential equations. Springer,\n\n2003.\n\n[48] Antonio Orvieto and Aurelien Lucchi. Shadowing properties of optimization algorithms. arXiv\n\npreprint, 2019.\n\n[49] Lawrence Perko. Differential equations and dynamical systems, volume 7. Springer Science &\n\nBusiness Media, 2013.\n\n[50] Boris Teodorovich Polyak. Gradient methods for minimizing functionals. Zhurnal Vychisli-\n\ntel\u2019noi Matematiki i Matematicheskoi Fiziki, 3(4):643\u2013653, 1963.\n\n[51] Maxim Raginsky and Jake Bouvrie. Continuous-time stochastic mirror descent on a network:\nVariance reduction, consensus, convergence. In Decision and Control (CDC), 2012 IEEE 51st\nAnnual Conference on, pages 6793\u20136800. IEEE, 2012.\n\n[52] Maxim Raginsky, Alexander Rakhlin, and Matus Telgarsky. Non-convex learning via stochastic\ngradient langevin dynamics: a nonasymptotic analysis. arXiv preprint arXiv:1702.03849, 2017.\n\n[53] Sashank J Reddi, Ahmed Hefny, Suvrit Sra, Barnabas Poczos, and Alex Smola. Stochastic\nvariance reduction for nonconvex optimization. In International conference on machine learning,\npages 314\u2013323, 2016.\n\n[54] Sashank J Reddi, Suvrit Sra, Barnab\u00e1s P\u00f3czos, and Alexander J Smola. Proximal stochastic\nmethods for nonsmooth nonconvex \ufb01nite-sum optimization. In Advances in Neural Information\nProcessing Systems, pages 1145\u20131153, 2016.\n\n[55] Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals of\n\nmathematical statistics, pages 400\u2013407, 1951.\n\n[56] Nicolas L Roux, Mark Schmidt, and Francis R Bach. A stochastic gradient method with\nan exponential convergence _rate for \ufb01nite training sets. In Advances in neural information\nprocessing systems, pages 2663\u20132671, 2012.\n\n[57] Bin Shi, Simon S Du, Weijie J Su, and Michael I Jordan. Acceleration via symplectic discretiza-\n\ntion of high-resolution differential equations. arXiv preprint arXiv:1902.03694, 2019.\n\n[58] Umut Simsekli, Levent Sagun, and Mert Gurbuzbalaban. A tail-index analysis of stochastic\n\ngradient noise in deep neural networks. arXiv preprint arXiv:1901.06053, 2019.\n\n[59] Daniel W Stroock and SR Srinivasa Varadhan. Multidimensional diffusion processes. Springer,\n\n2007.\n\n[60] Weijie Su, Stephen Boyd, and Emmanuel Candes. A differential equation for modeling nes-\nterov\u2019s accelerated gradient method: Theory and insights. In Advances in Neural Information\nProcessing Systems, pages 2510\u20132518, 2014.\n\n[61] Ruoyu Sun and Zhi-Quan Luo. Guaranteed matrix completion via non-convex factorization.\n\nIEEE Transactions on Information Theory, 62(11):6535\u20136579, 2016.\n\n[62] Eugene P Wigner. The unreasonable effectiveness of mathematics in the natural sciences. In\n\nMathematics and Science, pages 291\u2013306. World Scienti\ufb01c, 1990.\n\n[63] Ashia C Wilson, Benjamin Recht, and Michael I Jordan. A lyapunov analysis of momentum\n\nmethods in optimization. arXiv preprint arXiv:1611.02635, 2016.\n\n[64] Pan Xu, Tianhao Wang, and Quanquan Gu. Accelerated stochastic mirror descent: From\ncontinuous-time dynamics to discrete-time algorithms. In International Conference on Arti\ufb01cial\nIntelligence and Statistics, pages 1087\u20131096, 2018.\n\n12\n\n\f[65] Pan Xu, Tianhao Wang, and Quanquan Gu. Continuous and discrete-time accelerated stochastic\nmirror descent for strongly convex functions. In International Conference on Machine Learning,\npages 5488\u20135497, 2018.\n\n[66] Hui Zhang. New analysis of linear convergence of gradient-type methods via unifying error\n\nbound conditions. Mathematical Programming, pages 1\u201346, 2016.\n\n[67] Hui Zhang and Wotao Yin. Gradient methods for convex minimization: better rates under\n\nweaker conditions. arXiv preprint arXiv:1303.4645, 2013.\n\n[68] Jingzhao Zhang, Aryan Mokhtari, Suvrit Sra, and Ali Jadbabaie. Direct runge-kutta discretiza-\n\ntion achieves acceleration. arXiv preprint arXiv:1805.00521, 2018.\n\n13\n\n\f", "award": [], "sourceid": 6867, "authors": [{"given_name": "Antonio", "family_name": "Orvieto", "institution": "ETH Zurich"}, {"given_name": "Aurelien", "family_name": "Lucchi", "institution": "ETH Zurich"}]}