{"title": "First-order Stochastic Algorithms for Escaping From Saddle Points in Almost Linear Time", "book": "Advances in Neural Information Processing Systems", "page_first": 5530, "page_last": 5540, "abstract": "(This is a theory paper) In this paper, we consider first-order methods for solving stochastic non-convex optimization problems. The key building block of the proposed algorithms is first-order procedures to extract negative curvature from the Hessian matrix through a principled sequence starting from noise, which are referred to {\\it NEgative-curvature-Originated-from-Noise or NEON} and are of independent interest. Based on this building block, we design purely first-order stochastic algorithms for escaping from non-degenerate saddle points with a much better time complexity (almost linear time in the problem's dimensionality). In particular, we develop a general framework of {\\it first-order stochastic algorithms} with a second-order convergence guarantee based on our new technique and existing algorithms that may only converge to a first-order stationary point. For finding a nearly {\\it second-order stationary point} $\\x$ such that $\\|\\nabla F(\\x)\\|\\leq \\epsilon$ and $\\nabla^2 F(\\x)\\geq -\\sqrt{\\epsilon}I$ (in high probability), the best time complexity of the presented algorithms is $\\widetilde O(d/\\epsilon^{3.5})$, where $F(\\cdot)$ denotes the objective function and $d$ is the dimensionality of the problem. To the best of our knowledge, this is the first theoretical result of first-order stochastic algorithms with an almost linear time in terms of problem's dimensionality for finding second-order stationary points, which is even competitive with existing stochastic algorithms hinging on the second-order information.", "full_text": "First-order Stochastic Algorithms for Escaping From\n\nSaddle Points in Almost Linear Time\n\nYi Xu\u2020, Rong Jin\u2021, Tianbao Yang\u2020\n\n\u2020 Department of Computer Science, The University of Iowa, Iowa City, IA 52246, USA\n\n\u2021 Machine Intelligence Technology, Alibaba Group, Bellevue, WA 98004, USA\n\n{yi-xu, tianbao-yang}@uiowa.edu, jinrong.jr@alibaba-inc.com\n\nAbstract\n\nIn this paper, we consider \ufb01rst-order methods for solving stochastic non-convex\noptimization problems. The key building block of the proposed algorithms is \ufb01rst-\norder procedures to extract negative curvature from the Hessian matrix through a\nprincipled sequence starting from noise, which are referred to NEgative-curvature-\nOriginated-from-Noise or NEON and are of independent interest. Based on this\nbuilding block, we design purely \ufb01rst-order stochastic algorithms for escaping\nfrom non-degenerate saddle points with a much better time complexity (almost\nlinear time in the problem\u2019s dimensionality) under a bounded variance condition of\nstochastic gradients than previous \ufb01rst-order stochastic algorithms. In particular,\nwe develop a general framework of \ufb01rst-order stochastic algorithms with a second-\norder convergence guarantee based on our new technique and existing algorithms\nthat may only converge to a \ufb01rst-order stationary point. For \ufb01nding a nearly\n\u0001I\n(in high probability), the best time complexity of the presented algorithms is\n\nsecond-order stationary point x such that (cid:107)\u2207F (x)(cid:107) \u2264 \u0001 and \u22072F (x) \u2265 \u2212\u221a\n(cid:101)O(d/\u00013.5), where F (\u00b7) denotes the objective function and d is the dimensionality\n\nof the problem. To the best of our knowledge, this is the \ufb01rst theoretical result of\n\ufb01rst-order stochastic algorithms with an almost linear time in terms of problem\u2019s\ndimensionality for \ufb01nding second-order stationary points, which is even competitive\nwith existing stochastic algorithms hinging on the second-order information.\n\nIntroduction\n\n1\nThe problem of interest in this paper is Stochastic Non-Convex Optimization given by\n\nF (x) (cid:44) E\u03be[f (x; \u03be)],\n\nmin\nx\u2208Rd\n\n(1)\n\nwhere \u03be is a random variable and f (x; \u03be) is a random smooth non-convex function of x. The only\ninformation available of F (x) to us is sampled stochastic functions f (x; \u03be) and their gradients.\nA popular choice of algorithms for solving (1) is (mini-batch) stochastic gradient descent (SGD)\nmethod and its variants [6]. However, these algorithms do not necessarily guarantee to escape from a\nsaddle point (more precisely a non-degenerate saddle point) x satisfying that: \u2207F (x) = 0 and the\nminimum eigen-value of \u22072F (x)) is less than 0. Recently, new variants of SGD by adding isotropic\nnoise into the stochastic gradient were proposed (noisy SGD [5], stochastic gradient Langevin\ndynamics (SGLD) [23]). These two works provide rigorous analyses of the noise-injected update for\nescaping from a saddle point. Unfortunately, both variants suffer from a polynomial time complexity\nwith a super-linear dependence on the dimensionality d (at least a power of 4), which renders them\nnot practical for optimizing problems of high dimension.\nOn the other hand, second-order information carried by the Hessian has been utilized to escape from\na saddle point, which usually yields an almost linear time complexity in terms of the dimensionality\nd under the assumption that the Hessian-vector product (HVP) can be performed in a linear time. In\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fTable 1: Comparison with existing Stochastic Algorithms for achieving an (\u0001, \u03b3)-SSP to (1), where\np is a number at least 4, IFO (incremental \ufb01rst-order oracle) and ISO (incremental second-order\noracle) are terminologies borrowed from [20], representing \u2207f (x; \u03be) and \u22072f (x; \u03be)v respectively,\n\nTh denotes the runtime of ISO and Tg denotes the runtime of IFO. (cid:101)O(\u00b7) hides a poly-logarithmic\n\nfactor. SM refers to stochastic momentum methods. For \u03b3, we only consider as lower as \u00011/2.\n\nAlgorithm\nNoisy SGD [5]\nSGLD [23]\n\nNatasha2 [1]\nNatasha2 [1]\nSNCG [17]\nSVRG-Hessian [20] (\ufb01nite-sum objectives)\n(n is number of components)\nNEON-SGD, NEON-SM (this work)\nNEON-SCSG (this work)\nNEON-SCSG (this work)\nNEON-Natasha (this work)\nNEON-Natasha (this work)\nNEON-SVRG (this work) (\ufb01nite sum)\n\nOracle\nIFO\nIFO\n\nTarget\n(\u0001, \u00011/2)-SSP, high probability\n(\u0001, \u00011/2)-SSP, high probability\n\nIFO + ISO (\u0001, \u00011/2)-SSP, expectation\nIFO + ISO (\u0001, \u00011/4)-SSP, expectation\nIFO + ISO (\u0001, \u00011/2)-SSP, high probability\nIFO + ISO (\u0001, \u00011/2)-SSP, high probability\n\nIFO\nIFO\nIFO\nIFO\nIFO\nIFO\n\n(\u0001, \u00011/2)-SSP, high probability\n(\u0001, \u00011/2)-SSP, high probability\n(\u0001, \u00014/9)-SSP, high probability\n(\u0001, \u00011/2)-SSP, expectation\n(\u0001, \u00011/4)-SSP, expectation\n(\u0001, \u00011/2)-SSP, high probability\n\nTime Complexity\n\n(cid:101)O (Tgdp\u0001\u2212p)\n(cid:101)O(cid:0)Tgdp\u0001\u22124(cid:1)\n(cid:101)O(cid:0)Tg\u0001\u22123.5 + Th\u0001\u22122.5(cid:1)\n(cid:101)O(cid:0)Tg\u0001\u22123.25 + Th\u0001\u22121.75(cid:1)\n(cid:101)O(cid:0)Tg\u0001\u22124 + Th\u0001\u22122.5(cid:1)\n(cid:101)O(cid:0)Tg(n2/3\u0001\u22122 + n\u0001\u22121.5)\n+Th(n\u0001\u22121.5 + n3/4\u0001\u22127/4)(cid:1)\n(cid:101)O(cid:0)Tg\u0001\u22124(cid:1)\n(cid:101)O(cid:0)Tg\u0001\u22123.5(cid:1)\n(cid:101)O(cid:0)Tg\u0001\u22123.33(cid:1)\n(cid:101)O(cid:0)Tg\u0001\u22123.5(cid:1)\n(cid:101)O(cid:0)Tg\u0001\u22123.25(cid:1)\n(cid:101)O(cid:0)Tg\n\n(cid:0)n2/3\u0001\u22122 + n\u0001\u22121.5 + \u0001\u22122.75(cid:1)(cid:1)\n\npractice, HVP can be estimated by a \ufb01nite difference approximation using two gradient evaluations.\nHowever, the rigorous analysis of algorithms using such noisy approximation for solving non-convex\noptimization remains unsolved, and heuristic approaches may suffer from numerical issues. Although\nfor some problems with special structures (e.g., neural networks), HVP can be ef\ufb01ciently computed\nusing gradients, a HVP-free method that can escape saddle points for a broader family of non-convex\nproblems is still desirable.\nThis paper aims to design HVP-free stochastic algorithms for solving (1), which can converge\nto second order stationary points with a time complexity that is almost linear in the problem\u2019s\ndimensionality. Our main contributions are:\n\u2022 As a key building block of proposed algorithms, \ufb01rst-order procedures (NEON) are proposed\nto extract negative curvature from the Hessian using a principled sequence starting from noise.\nInterestingly, our perspective of NEON connects the existing two classes of methods (noise-\nbased and HVP-based) for escaping from saddle points. We provide a formal analysis of simple\nprocedures based on gradient descent and accelerated gradient method for exacting a negative\ncurvature direction from the Hessian.\n\n\u2022 We develop a general framework of \ufb01rst-order algorithms for stochastic non-convex optimization\nby combining the proposed \ufb01rst-order NEON procedures to extract negative curvature with existing\n\ufb01rst-order stochastic algorithms that aim at a \ufb01rst-order critical point. We also establish the time\ncomplexities of several interesting instances of our general framework for \ufb01nding a nearly (\u0001, \u03b3)-\nsecond-order stationary point (SSP), i.e., (cid:107)\u2207F (x)(cid:107) \u2264 \u0001, and \u03bbmin(\u22072F (x)) \u2265 \u2212\u03b3, where (cid:107) \u00b7 (cid:107)\nrepresents Euclidean norm of a vector and \u03bbmin(\u00b7) denotes the minimum eigen-value. A summary\nof our results and existing results for Stochastic Non-Convex Optimization is presented in Table 1.\n\n2 Other Related Work\nSGD and its many variants (e.g., mini-batch SGD and stochastic momentum (SM) methods) have\nbeen analyzed for stochastic non-convex optimization [6, 7, 8, 22]. The iteration complexities\nof all these algorithms is O(1/\u00014) for \ufb01nding a \ufb01rst-order stationary point (FSP) (in expectation\n2] \u2264 \u00012 or in high probability). Recently, there are some improvements for stochastic\nE[(cid:107)\u2207F (x)(cid:107)2\nnon-convex optimization. [14] proposed a \ufb01rst-order stochastic algorithm (named SCSG) using the\nvariance-reduction technique, which enjoys an iteration complexity of O(1/\u0001\u221210/3) for \ufb01nding an\nFSP (in expectation), i.e., E[(cid:107)\u2207F (x)(cid:107)2\n2] \u2264 \u00012. [1] proposed a variant of SCSG (named Natasha1.5)\nwith the same convergence and complexity. An important application of NEON is that previous\nstochastic algorithms that have a \ufb01rst-order convergence guarantee can be strengthened to enjoy a\nsecond-order convergence guarantee by leveraging the proposed \ufb01rst-order NEON procedures to\nescape from saddle points. We will analyze several algorithms by combining the updates of SGD,\nSM, and SCSG with the proposed NEON.\n\n2\n\n\fSeveral recent works [17, 1, 20] propose to strengthen existing \ufb01rst-order stochastic algorithms to\nhave second-order convergence guarantee by leveraging the second-order information. [17] used\nmini-batch SGD, [20] used SVRG for a \ufb01nite-sum problem, and [1] used a similar algorithm to\nSCSG for their \ufb01rst-order algorithms. The second-order methods used in these studies for computing\nnegative curvature can be replaced by the proposed NEON procedures. It is notable although a\ngeneric approach for stochastic non-convex optimization was proposed in [20], its requirement on the\n\ufb01rst-order stochastic algorithms precludes many interesting algorithms such as SGD, SM, and SCSG.\nStronger convergence guarantee (e.g., converging to a global minimum) of stochastic algorithms has\nbeen studied in [9] for a certain family of problems, which is beyond the setting of the present work.\nIt is also worth mentioning that the \ufb01eld of non-convex optimization is moving so fast that similar\nresults have appeared online after the preliminary version of this work [2]. Allen-Zhu and Li [2]\nproposed NEON2 for \ufb01nding a negative curvature, which includes a stochastic version and a de-\nterministic version. We notice several differences between the two works: (i) they used Gaussian\nrandom noise with a variance proportional to d\u2212C, where C is a large unknown constant, in contrast\nour NEON and NEON+ procedures use random noise sampled from the sphere of an Euclidean ball\n\u22122(d); (ii) the update of their deterministic NEON2det is constructed\nwith radius proportional to log\nbased on the Chebyshev polynomial, in contrast our NEON+ with a similar iteration complexity is\nbased on the well-known Nesterov\u2019s accelerated gradient method; (iii) we provide a general frame-\nwork/analysis for promoting \ufb01rst-order algorithms to enjoy second-order convergence, which could\nbe useful for promoting new \ufb01rst-order stochastic algorithms; (iv) the reported iteration complexity\nof their NEON2online is better than our stochastic variants of NEON. However, in most cases the total\ncomplexity for \ufb01nding an (\u0001,\n\u0001)-SSP is dominated by the complexity for \ufb01nding a stationary point\nnot by the complexity of stochastic NEON for \ufb01nding a negative curvature.\n\n\u221a\n\n3 Preliminaries\nLet (cid:107) \u00b7 (cid:107) denote the Euclidean norm of a vector and (cid:107) \u00b7 (cid:107)2 denote the spectral norm of a matrix.\nLet Sd\nr denote the sphere of an Euclidean ball centered at zero with radius r, and [t] denote a\nset {0, . . . , t}. A function f (x) has a L1-Lipschitz continuous gradient if it is differentiable and\nthere exists L1 > 0 such that (cid:107)\u2207f (x) \u2212 \u2207f (y)(cid:107) \u2264 L1(cid:107)x \u2212 y(cid:107) holds for any x, y \u2208 Rd. A\nfunction f (x) has a L2-Lipschitz continuous Hessian if it is twice differentiable and there exists\nL2 > 0 such that (cid:107)\u22072f (x) \u2212 \u22072f (y)(cid:107)2 \u2264 L2(cid:107)x \u2212 y(cid:107) holds for any x, y \u2208 Rd. It implies that\n|f (x) \u2212 f (y) \u2212 \u2207f (y)(cid:62)(x \u2212 y) \u2212 1\n\n2 (x \u2212 y)(cid:62)\u22072f (y)(x \u2212 y)| \u2264 L2\n\n6 (cid:107)x \u2212 y(cid:107)3, and\n\n(cid:107)\u2207f (x + u) \u2212 \u2207f (x) \u2212 \u22072f (x)u(cid:107) \u2264 L2(cid:107)u(cid:107)2/2.\n\n(2)\n\nthe global minimum of (1).\n\ngradient and L2-Lipschitz continuous Hessian.\n\nWe \ufb01rst make the following assumptions regarding the problem (1).\nAssumption 1. For the problem (1), we assume that\n(i). every random function f (x; \u03be) is twice differentiable, and it has L1-Lipschitz continuous\n(ii). given an initial point x0, there exists \u2206 < \u221e such that F (x0)\u2212 F (x\u2217) \u2264 \u2206, where x\u2217 denotes\n(iii). there exists G > 0 such that E[exp((cid:107)\u2207f (x; \u03be) \u2212 \u2207F (x)(cid:107)2/G2)] \u2264 exp(1) holds.\nRemark. (1) the analysis of NEON or NEON+ or their stochastic versions for extracting the negative\ncurvature only requires Assumption 1 (i). Indeed, the Lipschitz continuous Hessian can be relaxed\nto locally Lipchitz continuous Hessian condition according to our analysis. (2) Assumptions 1 (ii)\n(iii) are used in the analysis of Section 5, which are standard assumptions made in the literature\nof stochastic non-convex optimization [6, 7, 8]. Assumption 1 (iii) implies that E[(cid:107)\u2207f (x; \u03be) \u2212\n\u2207F (x)(cid:107)2] \u2264 V (cid:44) G2 holds. For stating our time complexities, we assume G is independent of d for\n\ufb01nding an approximate local minimum in Section 5. Nevertheless, our comparison of the proposed\nalgorithms with previous algorithms (e.g., SGLD [23], SNCG [17], Natasha2 [1]) in the stochastic\nsetting are fair because similar assumptions are also made. We also note that [5] makes a stronger\nassumption about the stochastic gradients, i.e., (cid:107)\u2207f (x; \u03be) \u2212 \u2207F (x)(cid:107) \u2264 O(d), which leads to a\nworse dependence of time complexity on d, i.e., O(dp) with p \u2265 4.\nNext, we discuss a second-order method based on HVPs to escape from a non-degenerate saddle\npoint x of a function f (x) that satis\ufb01es \u03bbmin(\u22072f (x)) \u2264 \u2212\u03b3, which can be found in many previous\nstudies [21, 16, 4]. The method is based on a negative curvature (NC for short is used in the sequel)\n\n3\n\n\fdirection v \u2208 Rd that satis\ufb01es (cid:107)v(cid:107) = 1 and\n\nv(cid:62)\u22072f (x)v \u2264 \u2212c\u03b3,\n\n(3)\n\nwhere c > 0 is a constant. Given such a vector v, we can update the solution according to\n\nx+ = x \u2212 c\u03b3\nL2\n\nsign(v(cid:62)\u2207f (x))v, or x(cid:48)\n\n(4)\nwhere \u00af\u03be \u2208 {1,\u22121} is a Rademacher random variable used when \u2207f (x) is not available. The\nfollowing lemma establishes that the objective value of x+ or x(cid:48)\n+ is less than that of x by a suf\ufb01cient\namount, which makes it possible to escape from the saddle point x.\nLemma 1. For x satisfying \u03bbmin(\u22072f (x)) \u2264 \u2212\u03b3 and v satisfying (3), let x+, x(cid:48)\nthen we have f (x) \u2212 f (x+) \u2265 c3\u03b33\n\nand E[f (x) \u2212 f (x(cid:48)\n\n+ be given in (4),\n\n+)] \u2265 c3\u03b33\n\n.\n\n\u00af\u03bev,\n\n+ = x \u2212 c\u03b3\nL2\n\n3L2\n2\n\n3L2\n2\n\nTo compute a NC direction v that satis\ufb01es (3), we can employ the Lanczos method or the Power\nmethod for computing the maximum eigen-vector of the matrix (I \u2212 \u03b7\u22072f (x)), where \u03b7L1 \u2264 1 such\nthat I \u2212 \u03b7\u22072f (x) (cid:23) 0. The Power method starts with a random vector v1 \u2208 Rd (e.g., drawn from a\nuniform distribution over the unit sphere) and iteratively compute v\u03c4 +1 = (I \u2212 \u03b7\u22072f (x))v\u03c4 , \u03c4 =\n1, . . . , t. Following the results in [13], it can be shown that if \u03bbmin(\u22072f (x)) \u2264 \u2212\u03b3, then with at most\nt \u22072f (x)\u02c6vt \u2264 \u2212 \u03b3\nlog(d/\u03b42)L1\nholds with high probability 1 \u2212 \u03b4. Similarly, the Lanczos method (e.g., Lemma 11 in [21]) can \ufb01nd\n\u221a\n\u221a\nsuch a vector \u02c6vt with a lower number of HVPs, i.e., min(d, log(d/\u03b42)\n2\n\nHVPs, the Power method \ufb01nds a vector \u02c6vt = vt/(cid:107)vt(cid:107) such that \u02c6v(cid:62)\n\n).\n\nL1\n\n2\u03b5\n\n\u03b3\n\n2\n\n4 Key Building Block: Extracting NC From Noise\n\nOur HVP-free stochastic algorithms with provable guarantees for solving (1) presented in next section\nare based on a key building block, i.e., extracting NC from noise using only \ufb01rst-order information.\nTo tackle the stochastic objective in (1), our method is to compute a NC based on a mini-batch of\ni=1 f (x; \u03bei)/m for a suf\ufb01ciently large number of samples. Thus, a key building block of\n\nfunctions(cid:80)m\n\nthe proposed method is a \ufb01rst-order procedure to extract NC for a non-convex function f (x) 1.\nBelow, we \ufb01rst propose a gradient descent based method for extracting NC, which achieves a similar\niteration complexity to the Power method. Second, we present an accelerated gradient method to\nextract the NC to match the iteration complexity of the Lanczos method. Finally, we discuss the\napplication of these procedures for stochastic non-convex optimization using mini-batch.\n\n4.1 Extracting NC by NEON\nThe NEON is inspired by the perturbed gradient descent (PGD) method (a method for solving\ndeterministic non-convex problems) proposed in the seminal work [11] and its connection with the\nPower method as discussed shortly. Around a saddle point x, the PGD method \ufb01rst generates a\nrandom noise vector \u02c6e from the sphere of an Euclidean ball with a proper radius, then starts with a\nnoise perturbed solution x0 = x + \u02c6e, the PGD generates the following sequence of solutions:\n\nx\u03c4 = x\u03c4\u22121 \u2212 \u03b7\u2207f (x\u03c4\u22121).\n\nx), \u03c4 = 1, . . . , t. It is clear that for \u03c4 = 1, . . . , t,\n\n(5)\nTo establish a connection with the Power method and motivate the proposed NEON, let us de\ufb01ne\n\nanother sequence of(cid:98)x\u03c4 = x\u03c4 \u2212 x. Then we have the recurrence for(cid:98)x\u03c4 = (cid:98)x\u03c4\u22121 \u2212 \u03b7\u2207f ((cid:98)x\u03c4\u22121 +\n\n(cid:98)x\u03c4 =(cid:98)x\u03c4\u22121 \u2212 \u03b7\u2207f (x) \u2212 \u03b7(\u2207f ((cid:98)x\u03c4\u22121 + x) \u2212 \u2207f (x)).\n(cid:98)x\u03c4 \u2248(cid:98)x\u03c4\u22121 \u2212 \u03b7\u22072f (x)(cid:98)x\u03c4\u22121 = (I \u2212 \u03b7\u22072f (x))(cid:98)x\u03c4\u22121.\n\nTo understand the above update, we adopt the following approximation: \u2207f (x) \u2248 0 for an ap-\nproximate saddle point, and from the Lipschitz continuous Hessian condition (2), we can see that\n\n\u2207f ((cid:98)x\u03c4\u22121 + x) \u2212 \u2207f (x) \u2248 \u22072f (x)(cid:98)x\u03c4\u22121 as long as (cid:107)(cid:98)x\u03c4\u22121(cid:107) is small. Then for \u03c4 = 1, . . . , t,\nupdated solution xt = x +(cid:98)xt can decrease the objective value due to that(cid:98)xt is close to a NC of the\n\nIt is obvious that the above approximated recurrence is close to the the sequence generated by the\nPower method with the same starting random vector \u02c6e = v1. This intuitively explains that why the\n\n1We abuse the same notation f here.\n\n4\n\n\fAlgorithm 1 NEON(f, x, t,F, r)\n1: Input: f, x, t,F, r\n2: Generate u0 randomly from Sd\nr\n3: for \u03c4 = 0, . . . , t do\n4:\n5: end for\n6: if mini\u2208[t+1],(cid:107)ui(cid:107)\u2264U\n7:\n8: else return 0\n\n\u02c6fx(ui) \u2264 \u22122.5F\n\nreturn u\u03c4(cid:48), \u03c4(cid:48) = arg mini\u2208[t+1],(cid:107)ui(cid:107)\u2264U\n\nu\u03c4 +1 = u\u03c4 \u2212 \u03b7(\u2207f (x + u\u03c4 ) \u2212 \u2207f (x))\n\nAlgorithm 2 NEON+(f, x, t,F, U, \u03b6, r)\n1: Input: f, x, t,F, U, \u03b6, r\n2: Generate y0 = u0 randomly from Sr\n3: for \u03c4 = 0, . . . , t do\n2(cid:107)y\u03c4 \u2212 u\u03c4(cid:107)2\n4:\n\nif \u2206x(y\u03c4 , u\u03c4 ) < \u2212 \u03b3\nthen\n\nreturn v =NCFind(y0:\u03c4 , u0:\u03c4 )\n\nend if\ncompute (y\u03c4 +1, u\u03c4 +1) by (8)\n\n\u02c6fx(yi) \u2264 \u22122F then\n\u02c6fx(yi)\n\nlet \u03c4(cid:48) = arg mini,(cid:107)yi(cid:107)\u2264U\nreturn y\u03c4(cid:48)\n\n5:\n6:\n7:\n8: end for\n9: if mini,(cid:107)yi(cid:107)\u2264U\n10:\n11:\n12: else\n13:\n14: end if\n\nreturn 0\n\n\u02c6fx(ui)\n\n\u221a\n\n6\u03b7F}\n\nAlgorithm 3 NCFind (y0:\u03c4 , u0:\u03c4 )\n1: if minj=0,...,\u03c4 (cid:107)yj \u2212 uj(cid:107) \u2265 \u03b6\n2:\n3: else return y\u03c4 \u2212 u\u03c4\n\nreturn yj, j = min{j(cid:48) : (cid:107)yj(cid:48)\u2212uj(cid:48)(cid:107) \u2265 \u03b6\n\n\u221a\n\n6\u03b7F\n\nHessian \u22072f (x). To provide a formal analysis, we will \ufb01rst analyze the following recurrence:\n\nu\u03c4 = u\u03c4\u22121 \u2212 \u03b7(\u2207f (x + u\u03c4\u22121) \u2212 \u2207f (x)), \u03c4 = 1, . . .\n\n(6)\nstarting with a random noise vector u0, which is drawn from the sphere of an Euclidean ball with a\nproper radius r denoted by Sd\nr. It is notable that the recurrence in (6) is slightly different from that\nin (5). We emphasize that this simple change is useful for extracting the NC at any points whose\nHessian has a negative eigen-value not just at non-degenerate saddle points, which can be used in\nsome stochastic or deterministic algorithms [1, 4, 21, 16]. The proposed procedure NEON based on\nthe above sequence for \ufb01nding a NC direction of \u22072f (x) is presented in Algorithm 1, where \u02c6fx(u)\nis de\ufb01ned in (7). The following theorem states our result of NEON for extracting the NC.\nTheorem 1. Under Assumption 1 (i), let \u03b3 \u2208 (0, 1) and \u03b4 \u2208 (0, 1) be a suf\ufb01ciently small. For any\nconstant \u02c6c \u2265 18, there exists a constant cmax that depends on \u02c6c, such that if NEON is called with\n\u22122(dL1/(\u03b3\u03b4)),\nt = \u02c6c log(dL1/(\u03b3\u03b4))\n\u221a\n\u03b7L1F/L2)1/3 and a constant \u03b7 \u2264 cmax/L1, then at a point x satisfying \u03bbmin(\u22072f (x)) \u2264\nU = 4\u02c6c(\n\u2212\u03b3 with high probability 1 \u2212 \u03b4 it returns u such that u(cid:62)\u22072f (x)u\nNEON returns u (cid:54)= 0, then the above inequality must hold; if NEON returns 0, we can conclude that\n\u03bbmin(\u22072f (x)) \u2265 \u2212\u03b3 with high probability 1 \u2212 O(\u03b4).\nRemark: The above theorem shows that at any point x whose Hessian has a negative eigen-value\n(including non-degenerate saddle points), NEON can \ufb01nd a NC of \u22072f (x) with high probability.\n\n8\u02c6c2 log(dL1/(\u03b3\u03b4)) \u2264 \u2212(cid:101)\u2126(\u03b3). If\n\n\u22123(dL1/(\u03b3\u03b4)), r =\n\n, F = \u03b7\u03b33L1L\u22122\n\n\u22121/2\n1\n\n\u2264 \u2212\n\nL\u22121\n\n2\n\n\u03b7\u03b32L\n\nlog\n\n2\n\n(cid:107)u(cid:107)2\n\n\u221a\n\nlog\n\n\u03b7\u03b3\n\n\u03b3\n\n4.2 Finding NC by Accelerated Gradient Method\nAlthough NEON provides a similar guarantee for extracting a NC as that provided by the Power\n\u221a\nmethod, but its iteration complexity O(1/\u03b3) is worse than that of the Lanczos method, i.e., O(1/\n\u03b3).\nIn this subsection, we present a \ufb01rst-order method that matches O(1/\nLet us recall the sequence (6), which is essentially an application of gradient descent (GD) method to\nthe following objective function:\n\n\u221a\n\u03b3) of the Lanczos method.\n\n(7)\nIn the sequel, we write \u02c6fx(u) = \u02c6f (u), where the dependent x should be clear from the context. By\nthe Lipschitz continuous Hessian condition, we have that\n\n\u02c6fx(u) = f (x + u) \u2212 f (x) \u2212 \u2207f (x)(cid:62)u.\n\n1\n2\n\nu(cid:62)\u22072f (x)u \u2212 L2\n6\n\n(cid:107)u(cid:107)3 \u2264 \u02c6f (u).\n\nIt implies that if \u02c6f (u) is suf\ufb01ciently less than zero and (cid:107)u(cid:107) is not too large, then u(cid:62)\u22072f (x)u\nwill be\nsuf\ufb01ciently less than zero. Hence, NEON can be explained as using GD updates to decrease \u02c6f (u).\n\n(cid:107)u(cid:107)2\n\n5\n\n\fA natural question to ask is whether the convergence of GD updates of NEON can be accelerated by\naccelerated gradient (AG) methods. It is well-known from convex optimization literature that AG\nmethods can accelerate the convergence of GD method for smooth problems. Recently, several studies\nhave explored AG methods for non-convex optimization [15, 19, 3, 12]. Notably, [19] analyzed the\nbehavior of AG methods near strict saddle points and investigated the rate of divergence from a strict\nsaddle point for toy quadratic problems. [12] analyzed a single-loop algorithm based on Nesterov\u2019s\nAG method for deterministic non-convex optimization. However, none of these studies provide an\nexplicit complexity guarantee on extracting NC from the Hessian matrix for a general non-convex\nproblem. Inspired by these studies, we will show that Nesterov\u2019s AG (NAG) method [18] when\n\napplied the function \u02c6f (u) can \ufb01nd a NC with a complexity of (cid:101)O(1/\n\n\u03b3).\n\n\u221a\n\nThe updates of NAG method applied to the function \u02c6f (u) at a given point x is given by\n\ny\u03c4 +1 = u\u03c4 \u2212 \u03b7\u2207 \u02c6f (u\u03c4 ),\nu\u03c4 +1 = y\u03c4 +1 + \u03b6(y\u03c4 +1 \u2212 y\u03c4 ),\n\n(8)\nwhere \u03b6(y\u03c4 +1 \u2212 y\u03c4 ) is the momentum term, and \u03b6 \u2208 (0, 1) is the momentum parameter. The\nproposed algorithm based on the NAG method (referred to as NEON+) for extracting NC of a\nHessian matrix \u22072f (x) is presented in Algorithm 2, where\n\n\u2206x(y\u03c4 , u\u03c4 ) = \u02c6fx(y\u03c4 ) \u2212 \u02c6fx(u\u03c4 ) \u2212 \u2207 \u02c6fx(u\u03c4 )(cid:62)(y\u03c4 \u2212 u\u03c4 ),\n\nand NCFind is a procedure that returns a NC by searching over the history y0:\u03c4 , u0:\u03c4 shown in\nAlgorithm 3. The condition check in Step 4 is to detect easy cases such that NCFind can easily \ufb01nd\na NC in historical solutions without continuing the update, which is designed following a similar\nprocedure called Negative Curvature Exploitation (NCE) proposed in [12]. However, the difference\nis that NCFind is tailored to \ufb01nding a negative curvature satisfying (3), while NCE in [12] is for\nensuring a decrease on a modi\ufb01ed objective. The theoretical result of NEON+ is presented below.\nTheorem 2. Under Assumption 1 (i), let \u03b3 \u2208 (0, 1) and \u03b4 \u2208 (0, 1) be a suf\ufb01ciently small. For any\nconstant \u02c6c \u2265 43, there exists a constant cmax that depends on \u02c6c, such that if NEON+is called with\n\u22122(dL1/(\u03b3\u03b4)),\nt =\n\u221a\n\u03b7L1F/L2)1/3, a small constant \u03b7 \u2264 cmax/L1, and a momentum parameter \u03b6 = 1\u2212\u221a\n\u03b7\u03b3,\nU = 12\u02c6c(\n72\u02c6c2 log(dL1/(\u03b3\u03b4)) \u2264 \u2212(cid:101)\u2126(\u03b3). If NEON+returns u (cid:54)= 0, then the above inequality\nthen at any point x satisfying \u03bbmin(\u22072f (x)) \u2264 \u2212\u03b3 with high probability 1 \u2212 \u03b4 it returns u such that\nu(cid:62)\u22072f (x)u\n\u2264 \u2212\nmust hold; if NEON+returns 0, we can conclude that \u03bbmin(\u22072f (x)) \u2265 \u2212\u03b3 with high probability\n1 \u2212 O(\u03b4).\n\n(cid:113) \u02c6c log(dL1/(\u03b3\u03b4))\n\n\u22123(dL1/(\u03b3\u03b4)), r =\n\n, F = \u03b7\u03b33L1L\u22122\n\n\u22121/2\n1\n\nL\u22121\n\n\u03b7\u03b32L\n\n(cid:107)u(cid:107)2\n\n\u221a\n\nlog\n\n\u03b7\u03b3\n\nlog\n\n2\n\n\u03b3\n\n2\n\n4.3 Stochastic Approach for Extracting NC\n\nIn this subsection, we present a stochastic approach for extracting NC for F (x) in (1). For simplicity,\nwe refer to both NEON and NEON+ as NEON. The challenge in employing NEON for \ufb01nding a\nNC for the original function F (x) in (1) is that we cannot evaluate the gradient of F (x) exactly. To\n(cid:80)\naddress this issue, we resort to the mini-batching technique.\nLet S = {\u03be1, . . . , \u03bem} denote a set of random samples and de\ufb01ne a sub-sampled function FS (x) =\n\u03be\u2208S f (x; \u03be). Then we apply NEON to FS (x) for \ufb01nding an approximate NC uS of \u22072FS (x).\n1|S|\nBelow, we show that as long as m is suf\ufb01ciently large, uS is also an approximate NC of \u22072F (x).\nTheorem 3. Under Assumption 1 (i), for a suf\ufb01ciently small \u03b4 \u2208 (0, 1) and \u02c6c \u2265 43, let m \u2265\nis a proper small constant. If \u03bbmin(\u22072F (x)) \u2264 \u2212\u03b3,\nthere exists c > 0 such that with probability 1 \u2212 \u03b4, NEON(FS , x, t,F, r) returns a vector uS such\nthat u(cid:62)\n\u22121(3dL1/(2\u03b3\u03b4)). If NEON(FS , x, t,F, r) returns\n0, then with high probability 1 \u2212 O(\u03b4) we have \u03bbmin(\u22072F (x)) \u2265 \u22122\u03b3. In either case, NEON\n\nterminates with an IFO complexity of (cid:101)O(1/\u03b33) or (cid:101)O(1/\u03b32.5) corresponding to Algorithm 1 and\n\n\u2264 \u2212c\u03b3, where c = (12\u02c6c)\u22122 log\n\n, where s = log\u22121(3dL1/(2\u03b3\u03b4))\n\nS \u22072F (x)uS\n\n1 log(6d/\u03b4)\ns2\u03b32\n\n(cid:107)uS(cid:107)2\n\n(12\u02c6c)2\n\n16L2\n\nAlgorithm 2, respectively.\n\n6\n\n\fCompute (yj, zj) = A(xj)\nif \ufb01rst-order condition of yj not met then\n\nAlgorithm 4 NEON-A\n1: Input: x1, other parameters of algorithm A\n2: for j = 1, 2, . . . , do\n3:\n4:\n5:\n6:\n7:\n8:\n9:\nend if\n10:\n11: end for\n\nlet xj+1 = zj\nuj = NEON(FS2, yj, t,F, r)\nif uj = 0 return yj\nelse let xj+1 = yj \u2212 c\u03b3 \u00af\u03be\n\nuj(cid:107)uj(cid:107)\n\nelse\n\nL2\n\nb \u2264 |S1|\n\nAlgorithm 6 SCSG-epoch: (x,S1, b)\n1: Input: x, an independent set of samples S1 and\n2: Set m1 = |S1|, \u03b7 = c(cid:48)(m1/b)\u22122/3, c(cid:48) \u2264 1/6\n3: Compute \u2207FS (xj\u22121) and let x0 = x\n4: Generate N \u223c Geom(m1/(m1 + b))\n5: for k = 1, 2, . . . , N do\n6:\n7:\n8:\n9: end for\n10: return xN\n\nSample samples Sk of size b\nvk = \u2207FSk (xk\u22121)\u2212\u2207FSk (x0) +\u2207FS (x0)\nxk = xk\u22121 \u2212 \u03b7vk\n\n5 First-order Algorithms for Stochastic Non-Convex Optimization\n\nIn this section, we will \ufb01rst describe a general framework for promoting existing \ufb01rst-order stochastic\nalgorithms denoted by A to enjoy a second-order convergence, which is shown in Algorithm 4.\nHere, we require A(xj) to return two points (yj, zj) that satisfy (9) and the mini-batch sample size\nm = |S2| satis\ufb01es the condition in Lemma 3. The proposed NEON is used for escaping from a saddle\npoint. It should be noted that Algorithm 4 is abstract depending on how to implement Step 3, how to\ncheck the \ufb01rst-order condition, and how to set the step size parameter \u00af\u03be in Step 9.\nFor theoretical interest, we will analyze Algorithm 4 with a Rademacher random variable \u00af\u03be \u2208 {1,\u22121}\nand its three main components satisfying the following properties.\nProperty 1. (1) Step 7 - Step 9 guarantees that if \u03bbmin(\u22072F (yj)) \u2264 \u2212\u03b3, there exists C > 0 such\nthat E[F (xj+1) \u2212 F (yj)] \u2264 \u2212C\u03b33. Let the total IFO complexity of Step 7 - Step 9 be Tn. (2) There\nexists a \ufb01rst-order stochastic algorithm (yj, zj) = A(xj) that satis\ufb01es:\n\nif (cid:107)\u2207F (yj)(cid:107) \u2265 \u0001, then E[F (zj) \u2212 F (xj)] \u2264 \u2212\u03b5(\u0001, \u03b1)\nif (cid:107)\u2207F (yj)(cid:107) \u2264 \u0001, then E[F (yj) \u2212 F (xj)] \u2264 C\u03b33/2\n\n(9)\nwhere \u03b5(\u0001, \u03b1) is a function of \u0001 and a parameter \u03b1 > 0. Let the total IFO complexity of A(x) be\nTa. (3) the check of \ufb01rst-order condition can be implemented by using a mini-batch of samples S,\ni.e., (cid:107)\u2207FS (yj)(cid:107) \u2264 \u0001, where S is independent of yj such that (cid:107)\u2207F (yj) \u2212 \u2207FS (yj)(cid:107) \u2264 \u0001/2. Let the\nIFO complexity of checking the \ufb01rst-order condition be Tc.\n\nProperty (1) can be guaranteed by Theorem 3 and Lemma 1. When using NEON, Tn = (cid:101)O(1/\u03b33)\nand when using NEON+, Tn = (cid:101)O(1/\u03b32.5). For Property (2), we will analyze several interesting\nwith Tc = (cid:101)O( 1\ntotal IFO complexity of (cid:101)O(max(\n(cid:107)\u2207F (yj)(cid:107) \u2264 O(\u0001) and \u03bbmin(\u22072F (yj)) \u2265 \u22122\u03b3, where (cid:101)O(\u00b7) hides logarithmic factors of d and 1/\u03b4,\n\nalgorithms. Property (3) can be guaranteed by Lemma 2 in the supplement under Assumption (1) (iii)\n\u00012 ). Based on the above properties, we have the following convergence of Algorithm 4.\nTheorem 4. Assume Properties 1 hold. Then with high probability 1\u2212 \u03b4, NEON-A terminates with a\n\u03b33 )(Tn + Ta + Tc)). Upon termination, with high probability\n\nand problem\u2019s other constant parameters.\nNext, we present corollaries of Theorem 4 for several instances of A, including stochastic gradient\ndescent (SGD) method, stochastic momentum (SM) methods, mini-batch SGD (MSGD), and SCSG.\nSGD and its momentum variants (including stochastic heavy-ball (SHB) method and stochastic\nNesterov\u2019s accelerated gradient (SNAG) method) are popular stochastic algorithms for solving a\nstochastic non-convex optimization problem. We will consider them in a uni\ufb01ed framework as\nestablished in [22]. The updates of SM starting from x0 are\n\n1\n\n\u03b5(\u0001,\u03b1) , 1\n\n(cid:98)x\u03c4 +1 = x\u03c4 \u2212 \u03b7\u2207f (x\u03c4 ; \u03be\u03c4 ),\n(cid:98)xs\nx\u03c4 +1 =(cid:98)x\u03c4 +1 + \u03b2((cid:98)xs\n\u03c4 +1 \u2212(cid:98)xs\n\u03c4 +1 = x\u03c4 \u2212 s\u03b7\u2207f (x\u03c4 ; \u03be\u03c4 ),\n\n\u03c4 ),\n\n7\n\n(10)\n\n\fAlgorithm 5 SM: (x0, \u03b7, \u03b2, s, t)\n1: for \u03c4 = 0, 1, 2, . . . , t do\n2:\n3:\n4: end for\n5: return (x+\n\nCompute x\u03c4 +1 according to (10)\nCompute x+\n\u03c4 +1 according to (11)\nt+1), where \u03c4(cid:48) \u2208 {0, . . . , t}\n\n\u03c4(cid:48), x+\n\nis a randomly generated.\n\nfor \u03c4 = 0, . . . , t and(cid:98)xs\n\nFigure 1: NEON vs\nSecond-order Meth-\nods for Extracting\nNC\n\n0, 1, 1/(1 \u2212 \u03b2) corresponds to SHB, SNAG and SGD. Let sequence x+\n\n0 = x0, where \u03b2 \u2208 (0, 1) is a momentum constant, \u03b7 is a step size, s =\n0 = x0 be de\ufb01ned as\n(11)\n\n(x\u03c4 \u2212 x\u03c4\u22121 \u2212 s\u03b7\u2207f (x\u03c4\u22121; \u03be\u03c4\u22121)).\n\n\u03c4 with x+\n\n\u03c4 = x\u03c4 + p\u03c4 , \u03c4 \u2265 1, p\u03c4 =\nx+\n\nWe can implement A by Algorithm 5 and have the following result.\nCorollary 5. Let A(xj) be implemented by Algorithm 5 with t = \u0398(1/\u00012) iterations, \u03b7 = \u0398(\u00012), \u03b2 \u2208\n(0, 1), s \u2208 (0, 1/(1 \u2212 \u03b2)). Then Ta = O(1/\u00012) and \u03b5(\u0001, \u03b1) = \u0398(\u00012). Suppose that \u03b3 \u2265 \u00012/3 and\nE[(cid:107)\u2207f (x; \u03be)(cid:107)2] is bounded for s (cid:54)= 1/(1 \u2212 \u03b2). Then with high probability, NEON-SM \ufb01nds an\n\n\u03b2\n1 \u2212 \u03b2\n\n\u00012 )), where Tn = (cid:101)O(1/\u03b33) (NEON)\n\n(\u0001, \u03b3)-SPP with a total IFO complexity of (cid:101)O(max( 1\nor Tn = (cid:101)O(1/\u03b32.5) (NEON+).\nRemark: When \u03b3 = \u00011/2, NEON-SM has an IFO complexity of (cid:101)O( 1\n1 \u2207FS1(xj), yj = xj\n\nMSGD computes (yj, zj) by\n\nzj = xj \u2212 L\u22121\n\n\u03b33 )(Tn + 1\n\n\u00012 , 1\n\n\u00014 ).\n\n(12)\n\nwhere S1 is a set of samples independent of xj.\n\nCorollary 6. Let A(xj) be implemented by (12) with |S1| = (cid:101)O(1/\u00012). Then Ta = (cid:101)O(1/\u00012) and\nof (cid:101)O(max( 1\n\n. With high probability, NEON-MSGD \ufb01nds an (\u0001, \u03b3)-SPP with a total IFO complexity\n\u03b33 )(Tn + 1/\u00012)).\n\n\u03b5(\u0001, \u03b1) = \u00012\n4L1\n\u00012 , 1\n\nRemark: Compared to Corollary 5, there is no requirement on \u03b3 \u2265 \u00012/3, which is due to that MSGD\ncan guarantee that E[F (yj) \u2212 F (xj)] \u2264 0.\nSCSG was proposed in [14], which only provides a \ufb01rst-order convergence guarantee. SCSG runs\nwith multiple epochs, and each epoch uses similar updates as SVRG with three distinct features:\n(i) it was applied to a sub-sampled function FS1; (ii) it allows for using a mini-batch samples of\nsize b independent of S1 to compute stochastic gradients; (ii) the number of updates of each epoch\nis a random number following a geometric distribution dependent on b and |S1|. These features\nmake each SGCG epoch denoted by SCSG-epoch(x,S1, b) have an expected IFO complexity of\nTa = O(|S1|). We present SCSG-epoch(x,S1, b) in Algorithm 6. For using SCSG, yj and zj are\n(13)\n\nCorollary 7. Let A(xj) be implemented by (13) with |S1| = (cid:101)O(cid:0)max(1/\u00012, 1/(\u03b39/2b1/2))(cid:1). Then\n\u03b5(\u0001, \u03b1) = \u2126(\u00014/3/b1/3) and E[Ta] = (cid:101)O(cid:0)max(1/\u00012, 1/(\u03b39/2b1/2))(cid:1). With high probability, NEON-\nSCSG \ufb01nds an (\u0001, \u03b3)-SSP with an expected total IFO complexity of (cid:101)O(max( b1/3\n1/(\u03b39/2b1/2))), where Tn = (cid:101)O(1/\u03b33) (NEON) or Tn = (cid:101)O(1/\u03b32.5) (NEON+).\n\u0001, NEON-SCSG has an expected IFO complexity of (cid:101)O( 1\nWhen \u03b3 \u2265 \u00014/9, b = 1, NEON-SCSG has an expected IFO complexity of (cid:101)O(1/\u00013.33).\n\nyj = SCSG-epoch(xj,S1, b),\n\nRemark: When \u03b3 = \u00011/2, b = 1/\n\n\u03b33 )(Tn + 1/\u00012 +\n\n\u00014/3 , 1\n\nzj = yj\n\n\u00013.5 ).\n\n\u221a\n\nFinally, we mention that the proposed NEON or NEON+ can be used in existing second-order\nstochastic algorithms that require a NC direction as a substitute of second-order methods [1, 20].\n[1] developed Natasha2, which uses second-order online Oja\u2019s algorithm for \ufb01nding the NC. [20]\ndeveloped a stochastic algorithm for solving a \ufb01nite-sum problem by using SVRG and a second-order\nstochastic algorithm for computing the NC. We can replace the second-order methods for computing\na NC in these algorithms by the proposed NEON or NEON+, with the resulting algorithms referred\nto as NEON-Natasha and NEON-SVRG. It is a simple exercise to derive the convergence results in\nTable 1, which is left to interested readers.\n\n8\n\n#IFO (or #ISO)\u00d71040123456vTHv-0.8-0.6-0.4-0.200.20.4PowerLanczosNEONNEON+NEONstNEON+stmin-eig-val\fFigure 2: NEON-SGD vs Noisy SGD. (All algorithms converge to local minimum)\n\n6 Experiments\n\nx2\ni\n\n+ \u03bb\nn\n\ni=1\n\n1+x2\ni\n\nregularizer for classi\ufb01cation, i.e., F (x) =(cid:80)d\n\nwhose y-axis denotes the value of(cid:98)u(cid:62)H(cid:98)u, where(cid:98)u represents the found normalized NC vector and\n\n(cid:80)n\nExtracting NC. First, we present some simulations to verify the proposed NEON procedures for\nextracting NC. To this end, we consider minimizing non-linear least square loss with a non-convex\ni=1(bi \u2212 \u03c3(x(cid:62)ai))2, where bi \u2208 {0, 1}\ndenotes the label and ai \u2208 Rd denotes the feature vector of the i-th data, \u03bb > 0 is a trade-off\nparameter, and \u03c3(\u00b7) is a sigmoid function. We generate a random vector x \u223c N (0, I) as the target\npoint to construct \u02c6Fx(u) and compute a NC of \u22072F (x). We use a binary classi\ufb01cation data named\ngisette from the libsvm data website that has n = 6000 examples and d = 5000 features, and set\n\u03bb = 3 in our simulation to ensure there is signi\ufb01cant NC from the non-linear least-square loss. The\nstep size \u03b7 and initial radius in NEON procedures are set to be 0.01 and the momentum parameter in\nNEON+ is set to be 0.9. These values are tuned in a certain range.\nWe compare the two NEON procedures and their stochastic variants (denoted by NEON-st and\nNEON+-st in the \ufb01gure) with second-order methods that use HVPs, namely the Power method\nand the Lanczos method, where the HVPs are calculated exactly. The result is shown in Figure 1\nH = \u22072F (x) is the Hessian matrix. For NEON-st and NEON+-st, we use a sample size of 100.\nPlease note that the solid red curve corresponding to NEON+-st terminates earlier due to that NCFind\nis executed. Several observations follow: (i) NEON performs similarly to the Power method (the two\ncurves overlap in the \ufb01gure); (ii) NEON+ has a faster convergence than NEON; (iv) the stochastic\nversions of NEON and NEON+ can quickly \ufb01nd a good NC directions than their full versions in\nterms of IFO complexity and are even competitive with the Lanczos method. We include several\nmore results in the supplement.\nEscaping Saddles. Second, we present some simulations to verify the proposed NEON and NEON+\nbased algorithms for minimizing a stochastic objective. We consider a non-convex optimization\ni ) where \u03bei are a normal random variables with mean of\n1 so that the saddle points of the expected function are known [10]. Assuming the noise \u03be is only\naccessed through a sampler, then we compare NEON-SGD with a state-of-the-art algorithm Noisy\nSGD [5] for different values of d \u2208 {103, 104, 105}. The step size of Noisy SGD is tuned in a wide\nrange and the best one is used. The step size in NEON procedures are set to be the same value as\nNoisy SGD. The radius in NEON procedures is set to be 0.01 and the momentum paramenter in\nNEON+ is set to be 0.9. The mini-batch size is tuned from {50, 100, 200, 500}. All algorithms are\nstarted with a same saddle point as the initial solution. The results are presented in Figure 2, showing\nthat two variants of NEON-SGD methods can escape saddles faster than Noisy SGD. NEON+-SGD\nescapes saddle points the fastest among all algorithms for different values of d. In addition, the\nincreasing of dimensionality d has much larger effect on the IFO complexity of Noisy-SGD than that\nof NEON-SGD methods, which is consistent with theoretical results.\n\nproblem with f (x; \u03be) =(cid:80)d\n\ni=1 \u03bei(x4\n\ni \u2212 4x2\n\n7 Conclusions\n\nWe have proposed novel \ufb01rst-order procedures to extract negative curvature from a Hessian matrix\nby using a noise-initiated sequence, which are of independent interest. A general framework for\npromoting a \ufb01rst-order stochastic algorithm to enjoy a second-order convergence is also proposed.\nBased on the proposed general framework, we designed several \ufb01rst-order stochastic algorithms with\nstate-of-the-art second-order convergence guarantee.\n\n9\n\n#IFO\u00d7104012345objective-4000-3500-3000-2500-2000-1500-1000-5000d = 103NEON+-SGDNEON-SGDNoisy SGD#IFO\u00d7104012345objective\u00d7104-4-3.5-3-2.5-2-1.5-1-0.50d = 104NEON+-SGDNEON-SGDNoisy SGD#IFO\u00d7104012345objective\u00d7105-4-3.5-3-2.5-2-1.5-1-0.50d = 105NEON+-SGDNEON-SGDNoisy SGD\fAcknowledgement\n\nThe authors thank the anonymous reviewers for their helpful comments. Y. Xu and T. Yang are\npartially supported by National Science Foundation (IIS-1545995).\n\nReferences\n[1] Z. Allen-Zhu. Natasha 2: Faster non-convex optimization than sgd. CoRR, /abs/1708.08694,\n\n2017.\n\n[2] Z. Allen-Zhu and Y. Li. Neon2: Finding local minima via \ufb01rst-order oracles. CoRR,\n\nabs/1711.06673, 2017.\n\n[3] Y. Carmon, J. C. Duchi, O. Hinder, and A. Sidford. \"convex until proven guilty\": Dimension-free\n\nacceleration of gradient descent on non-convex functions. In ICML, pages 654\u2013663, 2017.\n\n[4] Y. Carmon, J. C. Duchi, O. Hinder, and A. Sidford. Accelerated methods for nonconvex\n\noptimization. SIAM Journal on Optimization, 28(2):1751\u20131772, 2018.\n\n[5] R. Ge, F. Huang, C. Jin, and Y. Yuan. Escaping from saddle points \u2014 online stochastic gradient\n\nfor tensor decomposition. In COLT, pages 797\u2013842, 2015.\n\n[6] S. Ghadimi and G. Lan. Stochastic \ufb01rst- and zeroth-order methods for nonconvex stochastic\n\nprogramming. SIAM Journal on Optimization, 23(4):2341\u20132368, 2013.\n\n[7] S. Ghadimi and G. Lan. Accelerated gradient methods for nonconvex nonlinear and stochastic\n\nprogramming. Math. Program., 156(1-2):59\u201399, 2016.\n\n[8] S. Ghadimi, G. Lan, and H. Zhang. Mini-batch stochastic approximation methods for nonconvex\n\nstochastic composite optimization. Math. Program., 155(1-2):267\u2013305, 2016.\n\n[9] E. Hazan, K. Y. Levy, and S. Shalev-Shwartz. On graduated optimization for stochastic\n\nnon-convex problems. In ICML, pages 1833\u20131841, 2016.\n\n[10] P. Jain, P. Kar, et al. Non-convex optimization for machine learning. Foundations and Trends R(cid:13)\n\nin Machine Learning, 10(3-4):142\u2013336, 2017.\n\n[11] C. Jin, R. Ge, P. Netrapalli, S. M. Kakade, and M. I. Jordan. How to escape saddle points\n\nef\ufb01ciently. In ICML, pages 1724\u20131732, 2017.\n\n[12] C. Jin, P. Netrapalli, and M. I. Jordan. Accelerated gradient descent escapes saddle points faster\n\nthan gradient descent. In COLT, pages 1042\u20131085, 2018.\n\n[13] J. Kuczynski and H. Wozniakowski. Estimating the largest eigenvalue by the power and\nlanczos algorithms with a random start. SIAM Journal on Matrix Analysis and Applications,\n13(4):1094\u20131122, 1992.\n\n[14] L. Lei, C. Ju, J. Chen, and M. I. Jordan. Non-convex \ufb01nite-sum optimization via SCSG methods.\n\nIn NIPS, pages 2345\u20132355, 2017.\n\n[15] H. Li and Z. Lin. Accelerated proximal gradient methods for nonconvex programming. In NIPS,\n\npages 379\u2013387, 2015.\n\n[16] M. Liu and T. Yang. On noisy negative curvature descent: Competing with gradient descent for\n\nfaster non-convex optimization. CoRR, abs/1709.08571, 2017.\n\n[17] M. Liu and T. Yang. Stochastic non-convex optimization with strong high probability second-\n\norder convergence. CoRR, abs/1710.09447, 2017.\n\n[18] Y. Nesterov. Introductory lectures on convex optimization : a basic course. Applied optimization.\n\nKluwer Academic Publ., 2004.\n\n[19] M. O\u2019Neill and S. J. Wright. Behavior of accelerated gradient methods near critical points of\n\nnonconvex problems. CoRR, abs/1706.07993, 2017.\n\n10\n\n\f[20] S. Reddi, M. Zaheer, S. Sra, B. Poczos, F. Bach, R. Salakhutdinov, and A. Smola. A generic\n\napproach for escaping saddle points. In AISTATS, pages 1233\u20131242, 2018.\n\n[21] C. W. Royer and S. J. Wright. Complexity analysis of second-order line-search algorithms for\n\nsmooth nonconvex optimization. SIAM Journal on Optimization, 28(2):1448\u20131477, 2018.\n\n[22] Y. Yan, T. Yang, Z. Li, Q. Lin, and Y. Yang. A uni\ufb01ed analysis of stochastic momentum methods\n\nfor deep learning. In IJCAI, pages 2955\u20132961, 2018.\n\n[23] Y. Zhang, P. Liang, and M. Charikar. A hitting time analysis of stochastic gradient langevin\n\ndynamics. In COLT, pages 1980\u20132022, 2017.\n\n11\n\n\f", "award": [], "sourceid": 2653, "authors": [{"given_name": "Yi", "family_name": "Xu", "institution": "The University of Iowa"}, {"given_name": "Rong", "family_name": "Jin", "institution": "Alibaba"}, {"given_name": "Tianbao", "family_name": "Yang", "institution": "The University of Iowa"}]}