{"title": "First Exit Time Analysis of Stochastic Gradient Descent Under Heavy-Tailed Gradient Noise", "book": "Advances in Neural Information Processing Systems", "page_first": 273, "page_last": 283, "abstract": "Stochastic gradient descent (SGD) has been widely used in machine learning due to its computational efficiency and favorable generalization properties. Recently, it has been empirically demonstrated that the gradient noise in several deep learning settings admits a non-Gaussian, heavy-tailed behavior. This suggests that the gradient noise can be modeled by using $\\alpha$-stable distributions, a family of heavy-tailed distributions that appear in the generalized central limit theorem. In this context, SGD can be viewed as a discretization of a stochastic differential equation (SDE) driven by a L\\'{e}vy motion, and the metastability results for this SDE can then be used for illuminating the behavior of SGD, especially in terms of `preferring wide minima'. While this approach brings a new perspective for analyzing SGD, it is limited in the sense that, due to the time discretization, SGD might admit a significantly different behavior than its continuous-time limit. Intuitively, the behaviors of these two systems are expected to be similar to each other only when the discretization step is sufficiently small; however, to the best of our knowledge, there is no theoretical understanding on how small the step-size should be chosen in order to guarantee that the discretized system inherits the properties of the continuous-time system. In this study, we provide formal theoretical analysis where we derive explicit conditions for the step-size such that the metastability behavior of the discrete-time system is similar to its continuous-time limit. We show that the behaviors of the two systems are indeed similar for small step-sizes and we identify how the error depends on the algorithm and problem parameters. We illustrate our results with simulations on a synthetic model and neural networks.", "full_text": "First Exit Time Analysis of Stochastic Gradient\nDescent Under Heavy-Tailed Gradient Noise\n\nThanh Huy Nguyen1, Umut \u00b8Sim\u00b8sekli1,2, Mert G\u00fcrb\u00fczbalaban3, Ga\u00ebl Richard1\n\n1: LTCI, T\u00e9l\u00e9com Paris, Institut Polytechnique de Paris, France\n\n2: Department of Statistics, University of Oxford, UK\n\n3: Dept. of Management Science and Information Systems, Rutgers Business School, NJ, USA\n\nAbstract\n\nStochastic gradient descent (SGD) has been widely used in machine learning due\nto its computational ef\ufb01ciency and favorable generalization properties. Recently, it\nhas been empirically demonstrated that the gradient noise in several deep learning\nsettings admits a non-Gaussian, heavy-tailed behavior. This suggests that the\ngradient noise can be modeled by using \u03b1-stable distributions, a family of heavy-\ntailed distributions that appear in the generalized central limit theorem. In this\ncontext, SGD can be viewed as a discretization of a stochastic differential equation\n(SDE) driven by a L\u00e9vy motion, and the metastability results for this SDE can then\nbe used for illuminating the behavior of SGD, especially in terms of \u2018preferring\nwide minima\u2019. While this approach brings a new perspective for analyzing SGD,\nit is limited in the sense that, due to the time discretization, SGD might admit\na signi\ufb01cantly different behavior than its continuous-time limit. Intuitively, the\nbehaviors of these two systems are expected to be similar to each other only when\nthe discretization step is suf\ufb01ciently small; however, to the best of our knowledge,\nthere is no theoretical understanding on how small the step-size should be chosen\nin order to guarantee that the discretized system inherits the properties of the\ncontinuous-time system. In this study, we provide formal theoretical analysis\nwhere we derive explicit conditions for the step-size such that the metastability\nbehavior of the discrete-time system is similar to its continuous-time limit. We\nshow that the behaviors of the two systems are indeed similar for small step-sizes\nand we identify how the error depends on the algorithm and problem parameters.\nWe illustrate our results with simulations on a synthetic model and neural networks.\n\n1\n\nIntroduction\n\nStochastic gradient descent (SGD) is one of the most popular algorithms in machine learning due\nto its scalability to large dimensional problems as well as favorable generalization properties. SGD\nalgorithms are applicable to a broad set of convex and non-convex optimization problems arising\nin machine learning [1, 2], including deep learning where they have been particularly successful\n[3, 4, 5]. In deep learning, many key tasks can be formulated as the following non-convex optimization\nproblem:\n\nminw\u2208Rd f (w) =\n\n1\nn\n\nf (i)(w),\n\n(1)\n\nn(cid:88)\n\ni=1\n\nwhere w \u2208 Rd contains the weights for the deep network to estimate, f (i) : Rd (cid:55)\u2192 R is the typically\nnon-convex loss function corresponding to the i-th data point, and n is the number of data points\n[6, 7, 5]. SGD iterations consist of\n\nW k+1 = W k \u2212 \u03b7\u2207 \u02dcfk(W k),\n\nk \u2265 0,\n\n(2)\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: Illustration of S\u03b1S (left), L\u03b1\n\nt (middle), wide-narrow minima (right).\n\nwhere \u03b7 is the step-size, k denotes the iterations, W 0 \u2208 Rd is the initial point, \u2207 \u02dcfk(W k) is an\nunbiased estimator of the actual gradient \u2207f (W k), estimated from a subset of the component\nfunctions {fi}n\ni=1. In particular, the gradients of the objective are estimated as averages of the form\n\n\u2207 \u02dcfk(W k) (cid:44) \u2207 \u02dcf\u2126k (W k) (cid:44) 1\nb\n\n\u2207f (i)(W k),\n\n(3)\n\n(cid:88)\n\ni\u2208\u2126k\n\nwhere \u2126k \u2282 {1, . . . , n} is a random subset that is drawn with or without replacement at iteration k,\nand b = |\u2126k| denotes the number of elements in \u2126k [1].\nThe popularity and success of SGD in practice have motivated researchers to investigate and analyze\nthe reasons behind; a topic which has been an active research area [6, 4]. One well-known hypothesis\n[8] that has gained recent popularity (see e.g. [4, 9]) is that among all the local minima lying on the\nnon-convex energy landscape de\ufb01ned by the loss function (1), local minima that lie on wider valleys\ngeneralize better compared to sharp valleys, and that SGD is able to converge to the \u201cright local\nminimum\" that generalizes better. This is visualized in Figure 1(right), where the local minimum\non the right lies on a wider valley with width w2 compared to the local minimum on the left with\nwidth w1 lying in a sharp valley of depth h. Interpreting this hypothesis and the structure of the local\nminima found by SGD clearly requires a deeper understanding of the statistical properties of the\ngradient noise Zk (cid:44) \u2207 \u02dcf (W k) \u2212 \u2207f (W k) and its implications on the dynamics of SGD. A number\nof papers in the literature argue that the noise has Gaussian structure [10, 7, 11, 12, 13, 3]. Under the\nGaussian noise assumption, the following continuous-time limit of SGD has been considered in the\nliterature to analyze the behavior of SGD:\n\ndW (t) = \u2212\u2207f (W (t))dt +\n\n\u03b7\u03c3dB(t)\n\n(4)\nwhere B(t) is the standard Brownian motion and \u03c3 is the noise variance and \u03b7 is the step-size. The\nGaussianity of the gradient noise implicitly assumes that the gradient noise has a \ufb01nite variance with\nlight tails. In a recent study, [6] empirically illustrated that in various deep learning settings, the\ngradient noise admits a heavy-tail behavior, which suggests that the Gaussian-based approximation is\nnot always appropriate, and furthermore, the heavy-tailed noise could be modeled by a symmetric\n\u03b1-stable distribution (S\u03b1S(\u03c3)). Here, \u03b1 \u2208 (0, 2] is called the tail-index and characterizes the heavy-\ntailedness of the distribution and \u03c3 is a scale parameter that will be formally de\ufb01ned in Section 2. This\n\u03b1-stable model generalizes the Gaussian model in the sense that \u03b1 = 2 reduces to the Gaussian model,\nwhereas smaller values of \u03b1 quantify the heavy-tailedness of the gradient noise (see Figure 1(left)).\nUnder this noise model, the resulting continuous-time limit of SGD becomes [6]:\n\n\u221a\n\ndW (t) = \u2212\u2207f (W (t))dt + \u03b7\n\n\u03b1\u22121\n\u03b1 \u03c3dL\u03b1(t),\n\n(5)\nwhere L\u03b1(t) is the d-dimensional \u03b1-stable L\u00e9vy motion with independent components (which will\nbe formally de\ufb01ned in Section 2). This process has also been investigated for Bayesian posterior\nsampling [14] and global non-convex optimization [15].\nThe sample paths of the L\u00e9vy-driven SDE (5) have a fundamentally different behavior than the\nones of Brownian motion driven dynamics (4). This difference is mainly originated by the fact that,\nunlike the Brownian motion which has almost surely continuous sample paths, the L\u00e9vy motion can\nhave discontinuities, which are also called \u2018jumps\u2019 [16] (cf. Figure 1(middle)). This fundamental\ndifference becomes more prominent in the metastability properties of the SDE (5): consider a basin in\nwhich a particle is initialized and undergoes \ufb02uctuations continually. The particle persists in the basin\nfor a long time before exiting it by the in\ufb02uence of \ufb02uctuations. This relative instability phenomenon\nis described by the term \u2018metastability\u2019.\nMore formally, the metastability studies consider the case where W (0) is initialized in a basin and\nanalyze the minimum time t such that W (t) exits that basin. It has been shown that when \u03b1 < 2 (i.e.\nthe noise has a heavy-tailed component), this so called \ufb01rst exit time only depends on the width of the\n\n2\n\n-10-5051010-410-310-210-1=1.5=1.8=2.00100020003000-2002040=1.5=1.8=2.0hAAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeCF721YGuhDWWznbRrN5uwuxFK6C/w4kERr/4kb/4bt20O2vpg4PHeDDPzgkRwbVz32ymsrW9sbhW3Szu7e/sH5cOjto5TxbDFYhGrTkA1Ci6xZbgR2EkU0igQ+BCMb2b+wxMqzWN5byYJ+hEdSh5yRo2VmqN+ueJW3TnIKvFyUoEcjX75qzeIWRqhNExQrbuemxg/o8pwJnBa6qUaE8rGdIhdSyWNUPvZ/NApObPKgISxsiUNmau/JzIaaT2JAtsZUTPSy95M/M/rpia89jMuk9SgZItFYSqIicnsazLgCpkRE0soU9zeStiIKsqMzaZkQ/CWX14l7VrVu6jWmpeV+l0eRxFO4BTOwYMrqMMtNKAFDBCe4RXenEfnxXl3PhatBSefOYY/cD5/ANFijPc=w1AAAB6nicbVBNS8NAEJ34WetX1aOXxSJ4KkkV9FjworeK9gPaUDbbTbt0swm7E6WE/gQvHhTx6i/y5r9x2+agrQ8GHu/NMDMvSKQw6Lrfzsrq2vrGZmGruL2zu7dfOjhsmjjVjDdYLGPdDqjhUijeQIGStxPNaRRI3gpG11O/9ci1EbF6wHHC/YgOlAgFo2il+6ee1yuV3Yo7A1kmXk7KkKPeK311+zFLI66QSWpMx3MT9DOqUTDJJ8VuanhC2YgOeMdSRSNu/Gx26oScWqVPwljbUkhm6u+JjEbGjKPAdkYUh2bRm4r/eZ0Uwys/EypJkSs2XxSmkmBMpn+TvtCcoRxbQpkW9lbChlRThjadog3BW3x5mTSrFe+8Ur27KNdu8zgKcAwncAYeXEINbqAODWAwgGd4hTdHOi/Ou/Mxb11x8pkj+APn8wcNk42qw2AAAB6nicbVBNS8NAEJ34WetX1aOXxSJ4KkkV9FjworeK9gPaUDbbTbt0swm7E6WE/gQvHhTx6i/y5r9x2+agrQ8GHu/NMDMvSKQw6Lrfzsrq2vrGZmGruL2zu7dfOjhsmjjVjDdYLGPdDqjhUijeQIGStxPNaRRI3gpG11O/9ci1EbF6wHHC/YgOlAgFo2il+6detVcquxV3BrJMvJyUIUe9V/rq9mOWRlwhk9SYjucm6GdUo2CST4rd1PCEshEd8I6likbc+Nns1Ak5tUqfhLG2pZDM1N8TGY2MGUeB7YwoDs2iNxX/8zophld+JlSSIldsvihMJcGYTP8mfaE5Qzm2hDIt7K2EDammDG06RRuCt/jyMmlWK955pXp3Ua7d5nEU4BhO4Aw8uIQa3EAdGsBgAM/wCm+OdF6cd+dj3rri5DNH8AfO5w8PF42r\fbasin and the value of \u03b1, and it does not depend on the height of the basin [17, 18, 19]. The empirical\nresults in [6] showed that, in various deep learning settings the estimated tail index \u03b1 is signi\ufb01cantly\nsmaller than 2, suggesting that the metastability results can be used as a proxy for understanding the\ndynamics of SGD in discrete time, especially to shed more light on the hypothesis that SGD prefers\nwide minima.\nWhile this approach brings a new perspective for analyzing SGD, approximating SGD as a continuous-\ntime approach might not be accurate for any step-size \u03b7, and some theoretical concerns have already\nbeen raised for the validity of such approximations [20]. Intuitively, one can expect that the metastable\nbehavior of SGD would be similar to the behavior of its continuous-time limit only when the\ndiscretization step-size is small enough. Even though some theoretical results have been recently\nestablished for the discretizations of SDEs driven by Brownian motion [21], it is not clear that how\nthe discretized L\u00e9vy SDEs behave in terms of metastability.\nIn this study, we provide formal theoretical analyses where we derive explicit conditions for the\nstep-size such that the metastability behavior of the discrete-time system (7) is guaranteed to be close\nto its continuous-time limit (6). More precisely, we consider a stochastic differential equation with\nboth a Brownian term and a L\u00e9vy term, and its Euler discretization as follows [22]:\n\ndW (t) = \u2212\u2207f (W (t\u2212))dt + \u03b5\u03c3dB(t) + \u03b5dL\u03b1(t)\nW k+1 = W k \u2212 \u03b7\u2207f (W k) + \u03b5\u03c3\u03b71/2\u03bek + \u03b5\u03b71/\u03b1\u03b6k,\n\n(6)\n(7)\nwith independent and identically distributed (i.i.d.) variables \u03bek \u223c N (0, I) where I is the identity\nmatrix, the components of \u03b6k are i.i.d with S\u03b1S(1) distribution, and \u03b5 is the amplitude of the noise.\nThis dynamics includes (4) and (5) as special cases. Here, we choose \u03c3 as a scalar for convenience;\nhowever, our analyses can be extended to the case where \u03c3 is a function of W (t).\nUnderstanding the metastability behavior of SGD modeled by these dynamics requires understanding\nhow long it takes for the continuous-time process W (t) given by (6) and its discretization W k (7) to\nexit a neighborhood of a local minimum \u00afw, if it is started in that neighborhood. For this purpose, for\nany given local minimum \u00afw of f and a > 0, we de\ufb01ne the following set\n\n(cid:111)\n\n(cid:107)wk \u2212 \u00afw(cid:107) \u2264 a\n\n,\n\nA (cid:44)(cid:110)\n\n(w1, . . . , wK) \u2208 Rd \u00d7 . . . \u00d7 Rd : max\nk\u2264K\n\n(8)\n\nwhich is the set of K points in Rd, each at a distance of at most a from the local minimum \u00afw. We\nformally de\ufb01ne the \ufb01rst exit times, respectively for W (t) and W k as follows:\n\n\u03c4\u03be,a(\u03b5) (cid:44) inf{t \u2265 0 : (cid:107)W (t) \u2212 \u00afw(cid:107) (cid:54)\u2208 [0, a + \u03be]},\n\u00af\u03c4\u03be,a(\u03b5) (cid:44) inf{k \u2208 N : (cid:107)W k \u2212 \u00afw(cid:107) (cid:54)\u2208 [0, a + \u03be]},\n\n(9)\n(10)\nwhere the processes are initialized at W (0) \u2261 W 0 such that (cid:107)W (0) \u2212 \u00afw(cid:107) \u2208 [0, a]. Our main result\n(Theorem 2) shows that with suf\ufb01ciently small discretization step \u03b7, the probability of exiting a given\nneighborhood of the local optimum at a \ufb01xed time t of the discretization process approximates that\nof the continuous process. This result also provides an explicit condition for the step-size, which\nexplains certain impacts of the other parameters of the problem, such as dimension d, noise amplitude\n\u03b5, variance of Gaussian noise \u03c3, towards the similarity of the discretization and continuous processes.\nWe validate our theory on a synthetic model and neural networks.\n\nNotations. For z > 0, the gamma function is de\ufb01ned as \u0393(z) (cid:44)(cid:82) \u221e\n\n0 xz\u22121e\u2212xdx. For any Borel\nprobability measures \u00b5 and \u03bd with domain \u2126, the total variation (TV) distance is de\ufb01ned as follows:\n(cid:107)\u00b5 \u2212 \u03bd(cid:107)T V (cid:44) 2 supA\u2208B(\u2126) |\u00b5(A) \u2212 \u03bd(A)|, where B(\u2126) denotes the Borel subsets of \u2126.\n\n2 Technical Background\nSymmetric \u03b1-stable distributions. The S\u03b1S distribution is a generalization of a centered Gaussian\ndistribution where \u03b1 \u2208 (0, 2] is called the tail index, a parameter that determines the amount of\nheavy-tailedness. We say that X \u223c S\u03b1S(\u03c3), if its characteristic function E[ei\u03c9X ] = e\u2212|\u03c3|\u03c9\u03b1 where\n\u03c3 \u2208 (0,\u221e) is called the scale parameter. In the special case, when \u03b1 = 2, S\u03b1S(\u03c3) reduces to\nthe normal distribution N (0, 2\u03c32). A crucial property of the \u03b1-stable distributions is that, when\nX \u223c S\u03b1S(\u03c3) with \u03b1 < 2, the moment E[|X|p] is \ufb01nite if and only if p < \u03b1, which implies that\n\n3\n\n\fs | > \u03b4) \u2192 0 as t \u2192 s.\n2Bt. Since L\u03b1\n\n\u221a\n\nS\u03b1S has in\ufb01nite variance as soon as \u03b1 < 2. While the probability density function does not have\nclosed form analytical expression except for a few special cases of S\u03b1S (e.g. \u03b1 = 2: Gaussian,\n\u03b1 = 1: Cauchy), it is computationally easy to draw random samples from it by using the method\nproposed in [23].\nL\u00e9vy processes and SDEs driven by L\u00e9vy motions. The standard \u03b1-stable L\u00e9vy motion on the\nreal line is the unique process satisfying the following properties [22]:\n(i) For any 0 \u2264 t1 < t2 < t2 < \u00b7\u00b7\u00b7 < tN , its increments L\u03b1\n\ns have the same distribution S\u03b1S(cid:0)(t \u2212 s)1/\u03b1(cid:1) for any t > s.\n\ni = 1, 2, . . . , N and L\u03b1\nt\u2212s and L\u03b1\nt is continuous in probability: \u2200\u03b4 > 0 and s \u2265 0, P(|L\u03b1\n\nti are independent for\n\n0 = 0 almost surely.\n\n(ii) L\u03b1\n(iii) L\u03b1\n\nt \u2212 L\u03b1\n\nt \u2212 L\u03b1\n\n\u2212 L\u03b1\n\nti+1\n\nt reduces to a scaled version of the standard Brownian motion\n\nWhen \u03b1 = 2, L\u03b1\nt for\n\u03b1 < 2 is only continuous in probability, it can incur a countable number of discontinuities at random\ntimes, which makes is fundamentally different from the Brownian motion that has almost surely\ncontinuous paths.\nThe d-dimensional L\u00e9vy motion with independent components is a stochastic process on Rd where\neach coordinate corresponds to an independent scalar L\u00e9vy motion. Stochastic processes based on\nL\u00e9vy motion such as (5) and their mathematical properties have also been studied in the literature,\nwe refer the reader to [24, 16] for details.\nFirst Exit Times of Continuous-Time L\u00e9vy Stable SDEs. Due to the discontinuities of the L\u00e9vy-\ndriven SDEs, their metastability behaviors also differ signi\ufb01cantly from their Brownian counterparts.\nIn this section, we will brie\ufb02y mention important theoretical results about the SDE given in (6).\nFor simplicity, let us consider the SDE (6) in dimension one, i.e. d = 1. In a relatively recent\nstudy [17], the authors considered this SDE, where the potential function f is required to have a\nnon-degenerate global minimum at the origin, and they proved the following theorem.\nTheorem 1 ([17]). Consider the SDE (6) in dimension d = 1 and assume that it has a unique strong\nsolution. Assume further that the objective f has a global minimum at zero, satisfying the conditions\nf(cid:48)(x)x \u2265 0, f (0) = 0, f(cid:48)(x) = 0 if and only if x = 0, and f(cid:48)(cid:48)(0) = M > 0. Then, there exist\npositive constants \u03b50, \u03b3, \u03b4, and C > 0 such that for 0 < \u03b5 \u2264 \u03b50, the following holds:\n\u03b1 (1\u2212C\u03b5\u03b4)(1 + C\u03b5\u03b4)\n\n\u03b1 (1+C\u03b5\u03b4)(1 \u2212 C\u03b5\u03b4) \u2264 P(\u03c40,a(\u03b5) > u) \u2264 e\u2212u\u03b5\u03b1 \u03b8\n\ne\u2212u\u03b5\u03b1 \u03b8\n\n(11)\n\nfor all W (0) initialized uniformly in [\u2212a + \u03b5\u03b3, a \u2212 \u03b5\u03b3] and u \u2265 0, where \u03b8 = 2\n\na\u03b1 . Consequently,\n\nE[\u03c40,a(\u03b5)] =\n\n\u03b1\n2\n\na\u03b1\n\n\u03b5\u03b1 (1 + O(\u03b5\u03b4)),\n\nfor all W(0) initialized uniformly in [\u2212a + \u03b5\u03b3, a \u2212 \u03b5\u03b3].\n\n(12)\n\nThis result indicates that the \ufb01rst exit time of W (t) needs only polynomial time with respect to the\nwidth of the basin and it does not depend on the depth of the basin, whereas Brownian systems need\nexponential time in the height of the basin in order to exit from the basin [25, 18]. This difference\nis mainly due to the discontinuities of the L\u00e9vy motion, which enables it to \u2018jump out\u2019 of the basin,\nwhereas the Brownian SDEs need to \u2018climb\u2019 the basin due to their continuity. Consequently, given that\nthe gradient noise exhibits similar heavy-tailed behavior to an \u03b1-stable distributed random variable,\nthis result can be considered as a proxy to understand the wide-minima behavior of SGD.\nWe note that this result has already been extended to Rd in [19]. Extension to state dependent noise\nhas also been obtained in [26]. We also note that the metastability phenomenon is closely related to\nthe spectral gap of the forward operator corresponding to the SDE dynamics (see e.g. [25]) and it is\nknown that this quantity scales like O(\u03b5\u03b1) for \u03b5 small which determines the dependency to \u03b5 in the\n\ufb01rst term of the exit time (12) due to Kramer\u2019s Law [27, 28]. Burghoff and Pavlyukevich [28] showed\nthat similar scaling in \u03b5 for the spectral gap would hold if we were to restrict the SDE dynamics to a\ndiscrete grid with a small enough grid size.\n\n3 Assumptions and the Main Result\n\nIn this study, our main goal is to obtain an explicit condition on the step-size, such that the \ufb01rst exit\ntime of the continuous-time process \u03c4\u03be,a(\u03b5) (9) would be similar to the \ufb01rst exit time of its Euler\ndiscretization \u00af\u03c4\u03be,a(\u03b5) (10).\n\n4\n\n\f2\n\n0 \u03c62\n\n2\n\n\u03b5\u03c3\n\n(cid:82) T\n\nsatis\ufb01es E exp\n\n(cid:107)\u2207f (x) \u2212 \u2207f (y)(cid:107) \u2264 M(cid:107)x \u2212 y(cid:107)\u03b3,\n\n(cid:17)\n(cid:16) 1\nt dt\n2 < \u03b3 < min{ 1\u221a\n\nWe \ufb01rst state our assumptions.\nA 1. The SDE (6) admits a unique strong solution.\nA 2. The process \u03c6t (cid:44) \u2212 b(W )+\u2207f (W (t))\nA 3. The gradient of f is \u03b3-H\u00f6lder continuous with 1\n\n< \u221e.\n2 }:\n, \u03b1\n\u2200x, y \u2208 Rd.\nA 4. The gradient of f satis\ufb01es the following assumption: (cid:107)\u2207f (0)(cid:107) \u2264 B.\nA 5. For some m > 0 and b \u2265 0, f is (m, b, \u03b3)-dissipative: (cid:104)x,\u2207f (x)(cid:105) \u2265 m(cid:107)x(cid:107)1+\u03b3 \u2212 b, \u2200x \u2208 Rd.\nWe note that, as opposed to the theory of SDEs driven by Brownian motion, the theory of L\u00e9vy-driven\nSDEs is still an active research \ufb01eld where even the existence of solutions with general drift functions\nis not well-established and the main contributions have appeared in the last decade [29, 30]. Therefore,\nA1 has been a common assumption in stochastic analysis, e.g. [17, 19, 31]. Nevertheless, existence\nand uniqueness results have been very recently established in [30] for SDEs with bounded H\u00f6lder\ndrifts. Therefore A1 and A2 directly hold for bounded gradients and extending this result to H\u00f6lder\nand dissipative drifts is out of the scope of this study. On the other hand, the assumptions A3-A5 are\nstandard conditions, which are often considered in non-convex optimization algorithms that are based\non discretization of diffusions [32, 33, 34, 35, 36, 37, 38].\nNow, we identify an explicit condition for the step-size, which is one of our main contributions.\nA 6. For a given \u03b4 > 0, t = K\u03b7, and for some C > 0, the step-size satis\ufb01es the following condition:\n\n(cid:16) \u03b42\n\n(cid:17)\n\n2K1t2\n\n1\n\n\u03b32+2\u03b3\u22121 ,\n\n(cid:16) \u03b42\n\n(cid:17) 1\n\n2\u03b3\n\n(cid:16) \u03b42\n\n,\n\n(cid:17) \u03b1\n\n2\u03b3\n\n(cid:16) \u03b42\n\n,\n\n(cid:17) 1\n\u03b3(cid:111)\n\n,\n\n2K2t2\n\n2K3t2\n\n2K4t2\n\n(cid:110)\n\n0 < \u03b7 \u2264 min\n\n1,\n\nm\nM 2 ,\n\nwhere \u03b5 is as in (7), the constants m, M, b are de\ufb01ned by A3\u2013 A5 and\n\nK1 = O(d\u03b52\u03b32\u22122), K2 = O(\u03b5\u22122), K3 = O(d2\u03b3\u03b52\u03b3\u22122), K4 = O(d2\u03b3\u03b52\u03b3\u22122).\n\nA 6 will be stated in more details in the supplementary document. We now present our main result,\nits proof can be found in the supplementary material.\nTheorem 2. Under assumptions A1- A6, the following inequality holds:\n\nP[\u03c4\u2212\u03be,a(\u03b5) > K\u03b7] \u2212 CK,\u03b7,\u03b5,d,\u03be \u2212 \u03b4 \u2264 P[\u00af\u03c40,a(\u03b5) > K] \u2264 P[\u03c4\u03be,a(\u03b5) > K\u03b7] + CK,\u03b7,\u03b5,d,\u03be + \u03b4,\n\nwhere,\n\nCK,\u03b7,\u03b5,d,\u03be (cid:44) C1(K\u03b7(d\u03b5 + 1) + 1)\u03b3eM \u03b7M \u03b7\n\n+ 1 \u2212(cid:16)\n1 \u2212 C\u03b1d1+\u03b1/2\u03b7e\u03b1M \u03b7\u03b5\u03b1\u03be\u2212\u03b1(cid:17)K\n\n\u03be\n\n,\n\n+ 1 \u2212(cid:16)\n\n1 \u2212 Cde\u2212\u03be2e\u22122M \u03b7(\u03b5\u03c3)\u22122/(16d\u03b7)(cid:17)K\n\nfor some constants C1, C\u03b1 and C that does not depend on \u03b7 or \u03b5, M is given by A3 and \u03b5 is as in\n(6)\u2013(7).\nRemark. Theorem 2 enables the use of the metastability results for L\u00e9vy-driven SDEs for their\ndiscretized counterpart, which is our most important contribution.\nExit time versus problem parameters. In Theorem 2, if we let \u03b7 go to zero for any \u03b4 \ufb01xed, the\nconstant CK,\u03b7,\u03b5,d,\u03be will also go to zero, and since \u03b4 can be chosen arbitrarily small, this implies\nthat the probability of the \ufb01rst exit time for the discrete process and the continuous process will\napproach each other when the step-size gets smaller, as expected. If instead, we decrease d or \u03b5, the\nquantity CK,\u03b7,\u03b5,d,\u03be also decreases monotonically, but it does not go to zero due to the \ufb01rst term in the\nexpression of CK,\u03b7,\u03b5,d,\u03be.\nExit time versus width of local minima. Popular activation functions used in deep learning such\nas ReLU functions are almost everywhere differentiable and therefore the cost function has a well-\nde\ufb01ned Hessian almost everywhere (see e.g. [39]). The eigenvalues of the Hessian of the objective\nnear local minima have also been studied in the literature (see e.g. [40, 41]). If the Hessian around a\nlocal minimum is positive de\ufb01nite, the conditions for the multi-dimensional version of Theorem 1\nin [19]) are satis\ufb01ed locally around a local minimum. For local minima lying in wider valleys, the\nparameter a can be taken to be larger; in which case the expected exit time E\u03c40,a(\u03b5) \u223c O(a\u03b1) will\nbe larger by the formula (12). In other words, the SDE (5) spends more time to exit wider valleys.\nTheorem 2 shows that SGD modeled by the discretization of this SDE will also inherit a similar\nbehavior if the step-size satis\ufb01es the conditions we provide.\n\n5\n\n\f4 Proof Overview\n\nRelating the \ufb01rst exit times for W (t) and W k often requires obtaining bounds on the distance between\nW (k\u03b7) and W k. In particular if (cid:107)W k \u2212 W (k\u03b7)(cid:107) is small with high probability, then we expect that\ntheir \ufb01rst exit times from the set A will be close to each other as well with high probability.\nFor objective functions with bounded gradients, in order to relate \u03c4\u03be,a(\u03b5) to \u00af\u03c4\u03be,a(\u03b5), one can attempt to\nuse the strong convergence of the Euler scheme (cf. [42] Proposition 1): lim\u03b7\u21920 E(cid:107)W k \u2212 W (k\u03b7)(cid:107) =\n0. By using Markov\u2019s inequality, this result implies convergence in probability: for any \u03b4 > 0 and\n\u03b5 > 0, there exists \u03b7 such that P((cid:107)W k \u2212 W (k\u03b7)(cid:107) > \u03b5) < \u03b4/2. Then, if W (k\u03b7) \u2208 A then one of the\nfollowing events must happen:\n\n1. W k \u2208 A,\n2. W k (cid:54)\u2208 A and (cid:107)W k \u2212 W (k\u03b7)(cid:107) > \u0001 (with probability less than \u03b4/2),\n3. W k /\u2208 A and distance from W k to A is at most \u03b5 (with probability less than \u03b4/2).\n\nBy using this observation, we obtain: P[W (k\u03b7) \u2208 A] \u2264 P[W k \u2208 A] + \u03b4. Even though we could use\nthis result in order to relate \u03c4\u03be,a(\u03b5) to \u00af\u03c4\u03be,a(\u03b5), this approach would not yield a meaningful condition\nfor \u03b7 since the bounds for the strong error E(cid:107)W k \u2212 W (k\u03b7)(cid:107) often grows exponentially in general\nwith k, which means \u03b7 should be chosen exponentially small for a given k. Therefore, in our strategy,\nwe choose a different path where we do not use the strong convergence of the Euler scheme.\nOur proof strategy is inspired by the recent study [21], where the authors analyze the empirical\nmetastability of the Langevin equation which is driven by a Brownian motion. However, unlike the\nBrownian case that [21] was based on, some of the tools for analyzing Brownian SDEs do not exist\nfor L\u00e9vy-driven SDEs, which increases the dif\ufb01culty of our task.\nWe \ufb01rst de\ufb01ne a linearly interpolated version of the discrete-time process {W k}k\u2208N+, which will be\nuseful in our analysis, given as follows:\n\n(13)\nwhere \u02c6W \u2261 { \u02c6W (t)}t\u22650 denotes the whole process and the drift function b( \u02c6W ) is chosen as follows:\n\nd \u02c6W (t) = b( \u02c6W )dt + \u03b5\u03c3dB(t) + \u03b5dL\u03b1(t),\n\nb( \u02c6W ) (cid:44) \u2212\n\n\u2207f ( \u02c6W (k\u03b7))I[k\u03b7,(k+1)\u03b7)(t).\n\n\u221e(cid:88)\n\nk=0\n\nHere, I denotes the indicator function, i.e. IS(x) = 1 if x \u2208 S and IS(x) = 0 if x /\u2208 S. It is easy to\nverify that \u02c6W (k\u03b7) = W k for all k \u2208 N+ [43, 32].\nIn our approach, we start by developing a Girsanov-like change of measures [24] to express the\nKullback-Leibler (KL) divergence between \u00b5t and \u02c6\u00b5t, which is de\ufb01ned as follows:\n\n(cid:90)\n\n(cid:20)(cid:90) t\n\n0\n\nKL(\u02c6\u00b5t, \u00b5t) (cid:44)\n\nlog\n\nd\u02c6\u00b5t,\n\nd\u02c6\u00b5t\nd\u00b5t\n\n(cid:21)\n\nwhere \u00b5t denotes the law of {W (s)}s\u2208[0,t], \u02c6\u00b5t denotes the law of { \u02c6W (s)}s\u2208[0,t], and d\u00b5t/d\u02c6\u00b5t is\nthe Radon\u2013Nikodym derivative of \u00b5t with respect to \u02c6\u00b5t. Here, we require A2 for the existence of a\nGirsanov transform between \u02c6\u00b5t and \u00b5t and for establishing an explicit formula for the transform. In\nthe supplementary document, we show that the KL divergence between \u00b5t and \u02c6\u00b5t can be written as:\n\nKL(\u02c6\u00b5t, \u00b5t) =\n\n1\n\n2\u03b52\u03c32\n\nE\n\n(cid:107)b( \u02c6W ) + \u2207f ( \u02c6W (s))(cid:107)2ds\n\n.\n\n(14)\n\nWhile this result has been known for SDEs driven by Brownian motion [16], none of the references\nwe are aware of expressed the KL divergence as in (14). We also note that one of the key reasons that\nallows us to obtain (14) is the presence of the Brownian motion in (6), i.e. \u03c3 > 0. For \u03c3 = 0 such a\nmeasure transformation cannot be performed [44].\nIn the next result, we show that if the step-size is chosen suf\ufb01ciently small, the KL divergence\nbetween \u00b5t and \u02c6\u00b5t is bounded.\nTheorem 3. Assume that the conditions A1-A6 hold. Then the following inequality holds:\n\nKL(\u02c6\u00b5t, \u00b5t) \u2264 2\u03b42.\n\n6\n\n\fThe proof technique is similar to the approach of [43, 32, 15]: The idea is to divide the integral in\n(14) into smaller pieces and bounding each piece separately. Once we obtain a bound on KL, by\nusing an optimal coupling argument, the data processing inequality, and Pinsker\u2019s inequality, we\nobtain a bound for the total variation (TV) distance between \u00b5t and \u02c6\u00b5t as follows:\n\nPM[(W (\u03b7), . . . , W (K\u03b7)) (cid:54)= ( \u02c6W (\u03b7), . . . , \u02c6W (K\u03b7))] \u2264 (cid:107)\u00b5K\u03b7 \u2212 \u02c6\u00b5K\u03b7(cid:107)T V \u2264(cid:16) 1\n\nKL(\u02c6\u00b5K\u03b7, \u00b5K\u03b7)\n\n.\n\n(cid:17) 1\n\n2\n\n2\n\nwhere the TV distance is de\ufb01ned in Section 1. Besides, M denotes the optimal coupling between\n{W (s)}s\u2208[0,K\u03b7] and { \u02c6W (s)}s\u2208[0,K\u03b7], i.e., the joint probability measure of {W (s)}s\u2208[0,K\u03b7] and\n{ \u02c6W (s)}s\u2208[0,K\u03b7], which satis\ufb01es the following identity [45]:\n\nPM[{W (s)}s\u2208[0,K\u03b7] (cid:54)= { \u02c6W (s)}s\u2208[0,K\u03b7]] = (cid:107)\u00b5K\u03b7 \u2212 \u02c6\u00b5K\u03b7(cid:107)T V .\n\nCombined with Theorem 3, this inequality implies the following useful result:\n\nP[(W (\u03b7), . . . , W (K\u03b7)) \u2208 A] \u2212 \u03b4 \u2264 P[\u00af\u03c40,a(\u03b5) > K] \u2264 P[(W (\u03b7), . . . , W (K\u03b7)) \u2208 A] + \u03b4\n\n(15)\nwhere we used the fact that the event ( \u02c6W (\u03b7), . . . , \u02c6W (K\u03b7)) \u2208 A is equivalent to the event (\u00af\u03c40,a(\u03b5) >\nK). The remaining task is to relate the probability P[(W (\u03b7), . . . , W (K\u03b7)) \u2208 A] to P[\u03c4\u03be,a(\u03b5) > K\u03b7].\nThe event (W (\u03b7), . . . , W (K\u03b7)) \u2208 A ensures that the process W (t) does not leave the set A when\nt = \u03b7, . . . , K\u03b7; however, it does not indicate that the process remains in A when t \u2208 (k\u03b7, (k + 1)\u03b7).\nIn order to have a control over the whole process, we introduce the following event:\n\nB (cid:44)(cid:110)\n\nmax\n\n0\u2264k\u2264K\u22121\n\nsup\n\nt\u2208[k\u03b7,(k+1)\u03b7]\n\n(cid:107)W (t) \u2212 W (k\u03b7)(cid:107) \u2264 \u03be\n\n(cid:111)\n\n,\n\nsuch that the event [(W (\u03b7), . . . , W (K\u03b7)) \u2208 A] \u2229 B ensures that the process stays close to A for the\nwhole time. By using this event, we can obtain the following inequalities:\nP[(W (\u03b7), . . . , W (K\u03b7)) \u2208 A] \u2264P[(W (\u03b7), . . . , W (K\u03b7)) \u2208 A \u2229 B] + P[(W (\u03b7), . . . , W (K\u03b7)) \u2208 Bc]\n\n=P[\u03c4\u03be,a(\u03b5) > K\u03b7] + P[(W (\u03b7), . . . , W (K\u03b7)) \u2208 Bc].\n\nBy using the same approach, we can obtain a lower bound on P[(W (\u03b7), . . . , W (K\u03b7)) \u2208 A] as well.\nHence, our \ufb01nal task reduces to bounding the term P[(W (\u03b7), . . . , W (K\u03b7)) \u2208 Bc], which we perform\nby using the weak re\ufb02ection principles of L\u00e9vy processes [46]. This \ufb01nally yields Theorem 2.\n\n5 Numerical Illustration\n\nSynthetic data. To illustrate our results, we \ufb01rst conduct experiments on a synthetic problem,\n2(cid:107)x(cid:107)2. This corresponds to an Ornstein-Uhlenbeck-type\nwhere the cost function is set to f (x) = 1\nprocess, which is commonly considered in metastability analyses [22]. This process locally satis\ufb01es\nthe conditions A1-A5.\nSince we cannot directly simulate the continuous-time process, we consider the stochastic process\nsampled from (7) with suf\ufb01ciently small step-size as an approximation of the continuous scheme.\nThus, we organize the experiments as follows. We \ufb01rst choose a very small step-size, i.e. \u03b7 = 10\u221210.\nStarting from an initial point W 0 satisfying (cid:107)W 0(cid:107) < a, we iterate (7) until we \ufb01nd the \ufb01rst K\nsuch that (cid:107)W K(cid:107) > a. We repeat this experiment 100 times, then we take the average K\u03b7 as the\n\u2018ground-truth\u2019 \ufb01rst exit time. We continue the experiments by calculating the \ufb01rst exit times for larger\nstep-sizes (each repeated 100 times), and compute their distances to the ground truth.\nThe results for this experiment are shown in Figure 2. By Theorem 2, the distance between the \ufb01rst\nexit times of the discretization and the continuous processes depends on two terms CK,\u03b7,\u03b5,d,\u00af\u03b5 and \u03b4,\nwhich are used for explaining our experimental results.\nWe observe from Figure 2(a) that the error to the ground-truth \ufb01rst exit time is an increasing function\nof \u03b7, which directly matches our theoretical result. Figure 2(b) shows that, with small noise limit\n(e.g., in our settings, \u03b5 < 1 versus \u03b7 \u2248 10\u22128), the error decreases with the parameter \u03b5. By A6,\nwith increased \u03b5, we have the term \u03b4 to be reduced. On the other hand, CK,\u03b7,\u03b5,d,\u00af\u03b5 increases with\n\u03b5. However, at small noise limit, this effect is dominated by the decrease of \u03b4, that makes the error\ndecrease overall. The decreasing speed then decelerates with larger \u03b5, since, the product \u03b5\u03b7 becomes\nso large that the increase of CK,\u03b7,\u03b5,d,\u00af\u03b5 starts to dominate the decrease of \u03b4. Thus, it suggests that for\n\n7\n\n\f(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 2: Results of the synthetic experiments.\n\na large \u03b5, a very small step-size \u03b7 would be required for reducing the distance between the \ufb01rst exit\ntimes of the processes. In Figure 2(c), the error decreases when the variance \u03c3 increases. The reason\nfor the performance is the same as in (b), and can be explained by considering the expression of \u03b4\nand CK,\u03b7,\u03b5,d,\u00af\u03b5 in the conclusion of Theorem 2.\nIn Figure 2(d), for small dimension, with the same exit time interval, when we increase d, both\nprocesses escape the interval earlier, with smaller exit times. Hence, the distance between their exit\ntimes becomes smaller. With larger d, the increasing effect of \u03b4 and CK,\u03b7,\u03b5,d,\u00af\u03b5 starts to dominate the\nabove \u2018early-escape\u2019 effect, thus, the decreasing speed of the error diminish. We observe that the\nerror even slightly increases when \u03b1 = 1.2 and d grows from 70 to 100.\n\nversion\n\nin\n\nNeural networks.\nIn our\nsecond set of experiments, we\nconsider the real data setting\nused in [6]: a multi-layer fully\nconnected neural network with\nReLu activations on the MNIST\ndataset. We adapted the code\nprovided in [6] and we provide\nour\nhttps://\ngithub.com/umutsimsekli/\nsgd_first_exit_time. For\nthis model, we followed a\nsimilar methodology:\nwe\nmonitored the \ufb01rst exit time by\nvarying the \u03b7, the number of layers (depth), and the number of neurons per layer (width). Since a\nlocal minimum is not analytically available, we \ufb01rst trained the networks with SGD until a vicinity of\na local minimum is reached with at least 90% accuracy, then we measured the \ufb01rst exit times with\na = 1 and \u03b5 = 0.1. In order to have a prominent level of gradient noise, we set the mini-batch size\nb = 10 and we did not add explicit Gaussian or L\u00e9vy noise. The results are given in Figure 3. We\nobserve that, even with pure gradient noise, the error in the exit time behaves very similarly to the\none that we observed in Figure 2(a), hence supporting our theory. We further observe that, the error\nhas a better dependency when the width and depth are relatively small, whereas the slope of the error\nincreases for larger width and depth. This result shows that, to inherit the metastability properties of\nthe continuous-time SDE, we need to use a smaller \u03b7 as we increase the size of the network. Note\nthat this result does not con\ufb02ict with Figure 2(d), since changing the width and depth does not simply\nchange d, it also changes the landscape of the problem.\n\nFigure 3: Results of the neural network experiments.\n\n6 Conclusion\n\nWe studied SGD under a heavy-tailed gradient noise model, which has been empirically justi\ufb01ed\nfor a variety of deep learning tasks. While a continuous-time limit of SGD can be used as a proxy\nfor investigating the metastability of SGD under this model, the system might behave differently\nonce discretized. Addressing this issue, we derived explicit conditions for the step-size such that the\ndiscrete-time system can inherit the metastability behavior of its continuous-time limit. We illustrated\nour results on a synthetic model and neural networks. A natural next step four our study would be\nanalyzing the generalization properties of SGD under such heavy-tailed gradient noise.\n\n8\n\n10-910-810-710-610-510-4\u03b710-810-710-610-510-4|\u00af\u03c4o,a\u2212\u03c4o,a|\u03b1=1.2\u03b1=1.4\u03b1=1.6\u03b1=1.810-310-210-1100\u03b510-810-7\u03b1=1.2\u03b1=1.4\u03b1=1.6\u03b1=1.810-210-1100101\u03c310-810-710-6\u03b1=1.2\u03b1=1.4\u03b1=1.6\u03b1=1.8104070100d3\u00d710-810-75\u00d710-7\u03b1=1.2\u03b1=1.4\u03b1=1.6\u03b1=1.800.020.040.060.080.104080120160200Depth = 200.020.040.060.080.104080120160200Width = 50\fAcknowledgments\n\nWe are grateful to Peter Tankov for providing us the derivations for the Girsanov-like change of\nmeasures. This work is partly supported by the French National Research Agency (ANR) as a part of\nthe FBIMATRIX (ANR-16-CE23-0014) project, and by the industrial chair Data science & Arti\ufb01cial\nIntelligence from T\u00e9l\u00e9com Paris. Mert G\u00fcrb\u00fczbalaban acknowledges support from the grants NSF\nDMS-1723085 and NSF CCF-1814888.\n\nReferences\n[1] L\u00e9on Bottou. Stochastic gradient descent tricks. In Neural networks: Tricks of the trade, pages\n\n421\u2013436. Springer, 2012.\n\n[2] L\u00e9on Bottou. Large-scale machine learning with stochastic gradient descent. In Proceedings of\n\nCOMPSTAT\u20192010, pages 177\u2013186. Springer, 2010.\n\n[3] P. Chaudhari and S. Soatto. Stochastic gradient descent performs variational inference, converges\nto limit cycles for deep networks. In International Conference on Learning Representations,\n2018.\n\n[4] Pratik Chaudhari, Anna Choromanska, Stefano Soatto, Yann LeCun, Carlo Baldassi, Christian\nBorgs, Jennifer Chayes, Levent Sagun, and Riccardo Zecchina. Entropy-sgd: Biasing gradient\ndescent into wide valleys. arXiv preprint arXiv:1611.01838, 2016.\n\n[5] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436,\n\n2015.\n\n[6] U. \u00b8Sim\u00b8sekli, L. Sagun, and G\u00fcrb\u00fczbalaban. A tail-index analysis of stochastic gradient noise in\n\ndeep neural networks. In ICML, 2019.\n\n[7] S. Jastrzebski, Z. Kenton, D. Arpit, N. Ballas, A. Fischer, Y. Bengio, and A. Storkey. Three\n\nfactors in\ufb02uencing minima in sgd. arXiv preprint arXiv:1711.04623, 2017.\n\n[8] Sepp Hochreiter and J\u00fcrgen Schmidhuber. Flat minima. Neural Computation, 9(1):1\u201342, 1997.\n\n[9] Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping\nTak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima.\narXiv preprint arXiv:1609.04836, 2016.\n\n[10] S. Mandt, M. Hoffman, and D. Blei. A variational analysis of stochastic gradient algorithms. In\n\nInternational Conference on Machine Learning, pages 354\u2013363, 2016.\n\n[11] Q. Li, C. Tai, and W. E. Stochastic modi\ufb01ed equations and adaptive stochastic gradient\nalgorithms. In Proceedings of the 34th International Conference on Machine Learning, pages\n2101\u20132110, 06\u201311 Aug 2017.\n\n[12] W. Hu, C. J. Li, L. Li, and J.-G. Liu. On the diffusion approximation of nonconvex stochastic\n\ngradient descent. arXiv preprint arXiv:1705.07562, 2017.\n\n[13] Zhanxing Zhu, Jingfeng Wu, Bing Yu, Lei Wu, and Jinwen Ma. The anisotropic noise in\nstochastic gradient descent: Its behavior of escaping from minima and regularization effects.\narXiv preprint arXiv:1803.00195, 2018.\n\n[14] Umut \u00b8Sim\u00b8sekli. Fractional Langevin Monte Carlo: Exploring L\u00e9vy driven stochastic differential\nequations for Markov Chain Monte Carlo. In Proceedings of the 34th International Conference\non Machine Learning-Volume 70, pages 3200\u20133209. JMLR. org, 2017.\n\n[15] Thanh Huy Nguyen, Umut \u00b8Sim\u00b8sekli, and Ga\u00ebl Richard. Non-Asymptotic Analysis of Fractional\nLangevin Monte Carlo for Non-Convex Optimization. In International Conference on Machine\nLearning, 2019.\n\n[16] Bernt Karsten \u00d8ksendal and Agnes Sulem. Applied stochastic control of jump diffusions, volume\n\n498. Springer, 2005.\n\n9\n\n\f[17] Peter Imkeller and Ilya Pavlyukevich. First exit times of sdes driven by stable L\u00e9vy processes.\n\nStochastic Processes and their Applications, 116(4):611\u2013642, 2006.\n\n[18] P. Imkeller, I. Pavlyukevich, and T. Wetzel. The hierarchy of exit times of L\u00e9vy-driven Langevin\n\nequations. The European Physical Journal Special Topics, 191(1):211\u2013222, 2010.\n\n[19] Peter Imkeller, Ilya Pavlyukevich, and Michael Stauch. First exit times of non-linear dynamical\nsystems in rd perturbed by multifractal L\u00e9vy noise. Journal of Statistical Physics, 141(1):94\u2013\n119, 2010.\n\n[20] S. Yaida. Fluctuation-dissipation relations for stochastic gradient descent. In International\n\nConference on Learning Representations, 2019.\n\n[21] B. Tzen, T. Liang, and M. Raginsky. Local optimality and generalization guarantees for the\nlangevin algorithm via empirical metastability. In Proceedings of the 2018 Conference on\nLearning Theory, 2018.\n\n[22] J. Duan. An Introduction to Stochastic Dynamics. Cambridge University Press, New York,\n\n2015.\n\n[23] J. M. Chambers, C. L. Mallows, and B. W. Stuck. A method for simulating stable random\n\nvariables. Journal of the american statistical association, 71(354):340\u2013344, 1976.\n\n[24] Peter Tankov. Financial modelling with jump processes. Chapman and Hall/CRC, 2003.\n\n[25] Anton Bovier, Michael Eckhoff, V\u00e9ronique Gayrard, and Markus Klein. Metastability in\nreversible diffusion processes i: Sharp asymptotics for capacities and exit times. Journal of the\nEuropean Mathematical Society, 6(4):399\u2013424, 2004.\n\n[26] Ilya Pavlyukevich. First exit times of solutions of stochastic differential equations driven by\nmultiplicative L\u00e9vy noise with heavy tails. Stochastics and Dynamics, 11(02n03):495\u2013519,\n2011.\n\n[27] Nils Berglund. Kramers\u2019 law: Validity, derivations and generalisations. arXiv preprint\n\narXiv:1106.5799, 2011.\n\n[28] Toralf Burghoff and Ilya Pavlyukevich. Spectral analysis for a discrete metastable system driven\n\nby l\u00e9vy \ufb02ights. Journal of Statistical Physics, 161(1):171\u2013196, 2015.\n\n[29] Enrico Priola et al. Pathwise uniqueness for singular SDEs driven by stable processes. Osaka\n\nJournal of Mathematics, 49(2):421\u2013447, 2012.\n\n[30] Alexei M Kulik. On weak uniqueness and distributional properties of a solution to an sde with\n\n\u03b1-stable noise. Stochastic Processes and their Applications, 129(2):473\u2013506, 2019.\n\n[31] Mingjie Liang and Jian Wang. Gradient estimates and ergodicity for sdes driven by multiplicative\n\nl\\\u2019{e} vy noises via coupling. arXiv preprint arXiv:1801.05936, 2018.\n\n[32] M. Raginsky, A. Rakhlin, and M. Telgarsky. Non-convex learning via stochastic gradient\nLangevin dynamics: a nonasymptotic analysis. In Proceedings of the 2017 Conference on\nLearning Theory, volume 65, pages 1674\u20131703, 2017.\n\n[33] Pan Xu, Jinghui Chen, Difan Zou, and Quanquan Gu. Global convergence of Langevin dynamics\nbased algorithms for nonconvex optimization. In Advances in Neural Information Processing\nSystems, pages 3125\u20133136, 2018.\n\n[34] M. A. Erdogdu, L. Mackey, and O. Shamir. Global non-convex optimization with discretized\n\ndiffusions. In Advances in Neural Information Processing Systems, pages 9693\u20139702, 2018.\n\n[35] Xuefeng Gao, Mert Gurbuzbalaban, and Lingjiong Zhu. Breaking Reversibility Accel-\narXiv e-prints, page\n\nerates Langevin Dynamics for Global Non-Convex Optimization.\narXiv:1812.07725, Dec 2018.\n\n10\n\n\f[36] Xuefeng Gao, Mert G\u00fcrb\u00fczbalaban, and Lingjiong Zhu. Global Convergence of Stochastic Gra-\ndient Hamiltonian Monte Carlo for Non-Convex Stochastic Optimization: Non-Asymptotic Per-\nformance Bounds and Momentum-Based Acceleration. arXiv e-prints, page arXiv:1809.04618,\nSep 2018.\n\n[37] Umut \u00b8Sim\u00b8sekli, \u00c7a\u02d8gatay Y\u0131ld\u0131z, Thanh Huy Nguyen, Ga\u00ebl Richard, and A Taylan Cemgil.\nAsynchronous stochastic quasi-Newton MCMC for non-convex optimization. In International\nConference on Machine Learning, 2018.\n\n[38] Antoine Liutkus, Umut \u00b8Sim\u00b8sekli, Szymon Majewski, Alain Durmus, and Fabian-Robert Stoter.\nSliced-Wasserstein \ufb02ows: Nonparametric generative modeling via optimal transport and diffu-\nsions. In International Conference on Machine Learning, 2019.\n\n[39] Yuanzhi Li and Yang Yuan. Convergence analysis of two-layer neural networks with relu\nactivation. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and\nR. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 597\u2013607.\nCurran Associates, Inc., 2017.\n\n[40] Levent Sagun, Leon Bottou, and Yann LeCun. Eigenvalues of the Hessian in deep learning:\n\nSingularity and beyond. arXiv preprint arXiv:1611.07476, 2016.\n\n[41] Vardan Papyan. The full spectrum of deep net Hessians at scale: Dynamics with sample size.\n\narXiv preprint arXiv:1811.07062, 2018.\n\n[42] R Mikulevi\u02c7cius and Fanhui Xu. On the rate of convergence of strong Euler approximation for\n\nSDEs driven by l\u00e9vy processes. Stochastics, 90(4):569\u2013604, 2018.\n\n[43] A. S. Dalalyan. Theoretical guarantees for approximate sampling from smooth and log-concave\ndensities. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(3):651\u2013\n676, 2017.\n\n[44] Arnaud Debussche and Nicolas Fournier. Existence of densities for stable-like driven sde\u2019s with\n\nH\u00f6lder continuous coef\ufb01cients. Journal of Functional Analysis, 264(8):1757\u20131778, 2013.\n\n[45] Torgny Lindvall. Lectures on the coupling method. Courier Corporation, 2002.\n\n[46] Erhan Bayraktar, Sergey Nadtochiy, et al. Weak re\ufb02ection principle for L\u00e9vy processes. The\n\nAnnals of Applied Probability, 25(6):3251\u20133294, 2015.\n\n11\n\n\f", "award": [], "sourceid": 107, "authors": [{"given_name": "Thanh Huy", "family_name": "Nguyen", "institution": "Telecom ParisTech"}, {"given_name": "Umut", "family_name": "Simsekli", "institution": "Institut Polytechnique de Paris/ University of Oxford"}, {"given_name": "Mert", "family_name": "Gurbuzbalaban", "institution": "Rutgers"}, {"given_name": "Ga\u00ebl", "family_name": "RICHARD", "institution": "T\u00e9l\u00e9com ParisTech"}]}