{"title": "NEON2: Finding Local Minima via First-Order Oracles", "book": "Advances in Neural Information Processing Systems", "page_first": 3716, "page_last": 3726, "abstract": "We propose a reduction for non-convex optimization that can (1) turn an stationary-point finding algorithm into an local-minimum finding one, and (2) replace the Hessian-vector product computations with only gradient computations. It works both in the stochastic and the deterministic settings, without hurting the algorithm's performance.\n\nAs applications, our reduction turns Natasha2 into a first-order method without hurting its theoretical performance. It also converts SGD, GD, SCSG, and SVRG into algorithms finding approximate local minima, outperforming some best known results.", "full_text": "NEON2: Finding Local Minima\n\nvia First-Order Oracles\n\nZeyuan Allen-Zhu\u2217\nMicrosoft Research AI\nRedmond, WA 98052\n\nzeyuan@csail.mit.edu\n\nYuanzhi Li\u2217\n\nStanford University\nStanford, CA 94305\n\nyuanzhil@stanford.edu\n\nAbstract\n\nWe propose a reduction for non-convex optimization that can (1) turn an\nstationary-point \ufb01nding algorithm into an local-minimum \ufb01nding one, and (2) re-\nplace the Hessian-vector product computations with only gradient computations.\nIt works both in the stochastic and the deterministic settings, without hurting the\nalgorithm\u2019s performance.\nAs applications, our reduction turns Natasha2 into a \ufb01rst-order method with-\nout hurting its theoretical performance. It also converts SGD, GD, SCSG, and\nSVRG into algorithms \ufb01nding approximate local minima, outperforming some\nbest known results.\n\nn(cid:88)\n\ni=1\n\nf (x) =\n\n1\nn\n\nfi(x)\n\nIntroduction\n\n1\nNonconvex optimization has become increasingly popular due its ability to capture modern machine\nlearning tasks in large scale. For instance, training neural nets corresponds to minimizing a function\n\nover x \u2208 Rd that is non-convex, where each training sample i corresponds to one loss function\nfi(\u00b7) in the summation. This average structure allows one to perform stochastic gradient descent\n(SGD) which uses a random \u2207fi(x) \u2014corresponding to computing backpropagation once\u2014 to\napproximate \u2207f (x) and performs descent updates.\nMotivated by such large-scale machine learning applications, we wish to design faster \ufb01rst-order\nnon-convex optimization methods that outperform the performance of gradient descent, both in the\nonline and of\ufb02ine settings. In this paper, we say an algorithm is online if its complexity is indepen-\ndent of n (so n can be in\ufb01nite), and of\ufb02ine otherwise. In recently years, researchers across different\ncommunities have gathered together to tackle this challenging question. By far, known theoretical\napproaches mostly fall into one of the following two categories.\nFirst-order methods for stationary points.\nIn analyzing \ufb01rst-order methods, we denote by gra-\ndient complexity T the number of computations of \u2207fi(x). To achieve an \u03b5-approximate stationary\npoint \u2014namely, a point x with (cid:107)\u2207f (x)(cid:107) \u2264 \u03b5\u2014 it is a folklore that gradient descent (GD) is of\ufb02ine\n\n(cid:1), while stochastic gradient decent (SGD) is online and needs T \u221d O(cid:0) 1\n\nand needs T \u221d O(cid:0) n\nIn recent years, the of\ufb02ine complexity has been improved to T \u221d O(cid:0) n2/3\n24], and the online complexity has been improved to T \u221d O(cid:0) 1\n\nBoth of them rely on the so-called variance-reduction technique, originally discovered for convex\nproblems [12, 17, 27, 29].\n\n\u03b510/3\n\n(cid:1).\n(cid:1) by the SVRG method [4,\n(cid:1) by the SCSG method [19].\n\n\u03b54\n\n\u03b52\n\n\u03b52\n\n\u2217Authors sorted in alphabetical order. We acknowledge a parallel work of Xu and Yang [31] (which ap-\npeared online a few days before us), and have adopted their algorithm name Neon and called our new algorithm\nNeon2. Our algorithms are very different from theirs, and give better theoretical performance. The full version\nof this paper can be found on https://arxiv.org/abs/1711.06673.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00b4eal, Canada.\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 1: Neon vs Neon2 for \ufb01nding (\u03b5, \u03b4)-approximate local minima. We emphasize that Neon2 and Neon try\n\nto tackle the same problem, but are different algorithms.\n\nBoth algorithms SVRG and SCSG are only capable of \ufb01nding approximate stationary points, which\nmay not necessarily be approximate local minima and are arguably bad solutions for deep neural\nnets [10, 11, 15]. Thus,\n\ncan we turn stationary-point \ufb01nding algorithms into local-minimum \ufb01nding ones?\n\nHessian-vector methods for local minima. Using information about the Hessian, one can \ufb01nd\n\u03b5-approximate local minima \u2014namely, a point x with (cid:107)\u2207f (x)(cid:107) \u2264 \u03b5 and also \u22072f (x) (cid:23) \u2212\u03b51/CI.2\nIn 2006, Nesterov and Polyak [21] showed that one can \ufb01nd an \u03b5-approximate in O( 1\n\u03b51.5 ) iterations,\nbut each iteration requires an (of\ufb02ine) computation as heavy as inverting the matrix \u22072f (x).\nTo \ufb01x this issue, researchers propose to study the so-called \u201cHessian-free\u201d methods that, in addition\nto gradient computations, also compute Hessian-vector products. That is, instead of using the full\nmatrix \u22072fi(x) or \u22072f (x), these methods also compute \u22072fi(x)\u00b7v for indices i and vectors v.3 For\nHessian-free methods, we denote by gradient complexity T the number of computations of \u2207fi(x)\nplus that of \u22072fi(x) \u00b7 v. The hope of using Hessian-vector products is to improve the complexity T\nas a function of \u03b5.\nSuch improvement was \ufb01rst shown possible independently by [1, 8] for the of\ufb02ine setting, with\n\n(cid:1) so is better than that of gradient descent. In the online setting, the\n\ncomplexity T \u221d(cid:0) n\n\ufb01rst improvement was by Natasha2 which gives complexity T \u221d(cid:0) 1\n\n(cid:1) [2].\n\n\u03b51.5 + n3/4\n\n\u03b51.75\n\nUnfortunately, it is argued by some researchers that Hessian-vector products are not general enough\nand may not be as simple to implement as evaluating gradients [9]. Therefore,\n\n\u03b53.25\n\ncan we turn Hessian-free methods into \ufb01rst-order ones, without hurting their performance?\n\n1.1 From Hessian-Vector Products to First-Order Methods\nRecall by de\ufb01nition of derivative we have\n\n\u22072fi(x) \u00b7 v = limq\u21920{\u2207fi(x+qv)\u2212\u2207fi(x)\n\n} .\n\nq\n\nq\n\n\u2207fi(x+qv)\u2212\u2207fi(x)\n\nfor some small q > 0?\n\nGiven any Hessian-free method, at least at a high level, can we replace every occurrence of \u22072fi(x)\u00b7\nv with w =\nNote the error introduced in this approximation is (cid:107)\u22072fi(x)\u00b7v\u2212w(cid:107) \u221d q(cid:107)v(cid:107)2. However, the original\nalgorithm might not be stable to adversarial noise, thus, an (inverse) exponentially small q might be\nrequired. One of our main contributions is to show how to implement these algorithms stably, so we\ncan convert Hessian-free methods into \ufb01rst-order ones with an (inverse) polynomially small q.\nIn this paper, we demonstrate this idea by converting negative-curvature-search (NC-search) subrou-\ntines into \ufb01rst-order processes. NC-search is a key subroutine used in state-of-the-art Hessian-free\nmethods that have rigorous proofs [1, 2, 8]. It solves the following simple task:\n\nnegative-curvature search (NC-search)\n\ngiven point x0, decide if \u22072f (x0) (cid:23) \u2212\u03b4I or \ufb01nd a unit vector v such that v(cid:62)\u22072f (x0)v \u2264 \u2212 \u03b4\n2.\n2We say A (cid:23) \u2212\u03b4I if all the eigenvalues of A are no smaller than \u2212\u03b4. In this high-level introduction, we\n3Hessian-free methods are useful because when fi(\u00b7) is explicitly given, computing its gradient is in the\n\nfocus only on the case when \u03b4 = \u03b51/C for some constant C.\n\nsame complexity as computing its Hessian-vector product [23] [28], using backpropagation.\n\n2\n\nT=\u03b4-5T=\u03b4-3\u03b5-2T=\u03b5-4T=\u03b4-7Neon2+SGDNeon+SGD\u03b5\u03b52/3\u03b54/7\u03b51/2\u03b51/4\u03b4\u03b5-4\u03b5-5TT=\u03b4-5T=\u03b4-3\u03b5-2T=\u03b5-3.33T=\u03b4-6Neon2+SCSGNeon+SCSG\u03b5\u03b52/3\u03b54/9\u03b51/2\u03b51/4\u03b4\u03b5-4\u03b5-5\u03b5-3.33TT=\u03b4-5T=\u03b4-1\u03b5-3T=\u03b5-3.25T=\u03b4-6Neon2+Natasha2Neon+Natasha\u03b5\u03b53/4\u03b53/5\u03b51/2\u03b51/4\u03b4\u03b5-3.25\u03b5-3.75\u03b5-3.6\u03b5-5T\f1\n\n\u221a\n\nOnline Setting.\n\n(cid:101)O(1/\u03b42) computations of Hessian-vector products. This is \ufb01rst proved by Allen-Zhu and Li [7] and\n\nIn the online setting, NC-search can be solved by Oja\u2019s algorithm [22] which costs\n\n\ufb01rst applied to NC-search in Natasha2 [2]).\nIn this paper, we propose a method Neon2online which solves the NC-search problem via only\nstochastic \ufb01rst-order updates. That is, starting from x1 = x0 + \u03be where \u03be is some random per-\nturbation, we keep updating xt+1 = xt \u2212 \u03b7(\u2207fi(xt) \u2212 \u2207fi(x0)). In the end, the vector xT \u2212 x0\ngives us enough information about the negative curvature.\n\nTheorem 1 (informal). Our Neon2online algorithm solves NC-search using (cid:101)O(1/\u03b42) stochastic\nThis complexity (cid:101)O(1/\u03b42) matches that of Oja\u2019s algorithm, and is information-theoretically optimal\nthat proposed this approach. However, Neon needs (cid:101)O(1/\u03b43) stochastic gradients, because it uses\n\n(up to log factors), see the lower bound in [7].\nThe independent work Neon by Xu and Yang [31] is actually the \ufb01rst recorded theoretical result\n\ngradients, without Hessian-vector product computations.\n\nIn this paper, we convert (a variant of) Lanscoz\u2019s method into a \ufb01rst-order one:\n\nfull gradient descent to \ufb01nd NC (on a sub-sampled objective) inspired by power method and [16];\ninstead, Neon2online uses stochastic gradients and is based on the recent result of Oja\u2019s algorithm [7].\nPlugging Neon2online into Natasha2 [2], we achieve the following corollary (see Figure 1(c)):\nTheorem 2 (informal). Neon2online turns Natasha2 into a stochastic \ufb01rst-order method, without\n\u03b53.25 +\n\nhurting its performance. That is, it \ufb01nds an (\u03b5, \u03b4)-approximate local minimum in T = (cid:101)O(cid:0) 1\n(cid:1) stochastic gradient computations, without Hessian-vector product computations.\n\n\u03b53\u03b4 + 1\n\u03b45\n(We say x is an (\u03b5, \u03b4)-approximate local minimum if (cid:107)\u2207f (x)(cid:107) \u2264 \u03b5 and \u22072f (x) (cid:23) \u2212\u03b4I.)\nOf\ufb02ine Deterministic Setting. There are a number of ways to solve the NC-search problem in the\n\nTheir approach is inspired by [16], but our Neon2det is based on Chebyshev approximation theory.\nBy putting Neon2det and Neon2\ufb01nite into the CDHS method of Carmon et al. [8], we have\nTheorem 4 (informal). Neon2det turns CDHS into a \ufb01rst-order method without hurting its perfor-\n\nof\ufb02ine setting using Hessian-vector products. Most notably, power method uses (cid:101)O(n/\u03b4) computa-\ntions of Hessian-vector products, and Lanscoz method [18] uses (cid:101)O(n/\nTheorem 3 (informal). Our Neon2det algorithm solves NC-search using (cid:101)O(1/\n(or equivalently (cid:101)O(n/\nThe independent work Neon [31] also applies to the of\ufb02ine setting, and needs (cid:101)O(1/\u03b4) full gradients.\n(cid:1) full gradient computations.\nmance: it \ufb01nds an (\u03b5, \u03b4)-approximate local minimum in (cid:101)O(cid:0) 1\ninvert [13] method, using (cid:101)O(n + n3/4/\nTheorem 5 (informal). Neon2\ufb01nite algorithm solves NC-search using (cid:101)O(n + n3/4/\nmance: it \ufb01nds an (\u03b5, \u03b4)-approximate local minimum in T = (cid:101)O(cid:0) n\n\ngradients.\nPutting Neon2\ufb01nite into the (\ufb01nite-sum version of) CDHS method [8], we have4\nTheorem 6 (informal). Neon2\ufb01nite turns CDHS into a \ufb01rst-order method without hurting its perfor-\n\n1.1.1 Of\ufb02ine Finite-Sum Setting\nRecall one can also solve the NC-search problem in the of\ufb02ine setting by the (\ufb01nite-sum) shift-and-\n\u03b4) computations of Hessian-vector products. We refer to\n\nthis method as \u201c\ufb01nite-sum SI\u201d, and also convert it into a \ufb01rst-order method.\n\n(cid:1) stochastic\n\n\u03b4) stochastic gradients).\n\ngradient computations.\nRemark 1.1. All the cited works in Section 1.1 requires the objective to have (1) Lipschitz-\ncontinuous Hessian and (2) Lipschitz-continuous gradient. One can argue that (1) and (2) are both\nnecessary for \ufb01nding approximate local minima, but if only \ufb01nding approximate stationary points,\nthen only (2) is necessary. We shall formally discuss our assumptions in Section 2.\n\n\u03b51.5 + n\n\n\u03b43 + n3/4\n\n\u03b51.75 + n3/4\n\n\u03b43.5\n\n\u221a\n\n\u03b4) stochastic\n\n\u221a\n\n\u03b4) full gradients\n\n\u221a\n\n\u03b4) computations.\n\n\u03b51.75 + 1\n\u03b43.5\n\n\u221a\n\n4The original paper of CDHS only stated their algorithm in the deterministic setting, but is easily veri\ufb01able\n\nto work in the \ufb01nite-sum setting, see discussions in [1].\n\n3\n\n\fgradient complexity T\n\nHessian-vector\n\nproducts\n\n(cid:1)\n(cid:1)\n\nno\n\nno\n\nno\n\nno\n\nno\n\nno\n\nno\n\n(cid:1)\n(cid:1)\n\nvariance\nbound\nneeded\n\nLip.\n\nsmooth\nneeded\n\n2nd-order\nsmooth\n\nno\n\nneeded\n\nneeded\n\nneeded\n\nneeded\n\nneeded\n\nneeded\n\nneeded\n\nneeded\n\nneeded\n\nneeded\n\nneeded\n\nno\n\nneeded\n\nneeded\n\nneeded\n\nneeded\n\nneeded\n\nneeded\n\nneeded\n\nneeded\n\nneeded\n\nneeded\n\nno\n\nno\n\nneeded\n\nneeded\n\nneeded\n\nneeded\n\nneeded\n\nneeded\n\n\u2193 of\ufb02ine methods \u2193\n\nno\n\nno\n\nno\n\nno\n\nneeded\n\nno\n\nno\n\nneeded\n\nno\n\nno\n\nno\n\nno\n\nno\n\nno\nno\n\nno\n\nno\n\nno\n\nneeded\n\nno\n\nneeded\n\nneeded\n\nneeded\n\nneeded\n\nneeded\n\nno\n\nneeded\nneeded\n\nneeded\nneeded\n\nneeded\n\nneeded\n\nneeded\n\nneeded\n\nneeded\n\nneeded\n\nstationary\n\nalgorithm\n\nSGD (folklore)\nperturbed SGD\n\nlocal minima\n\nNeon+SGD\n\nstationary\n\nlocal minima\n\nNeon2+SGD\n\nSCSG\nNeon+SCSG\n\nNeon2+SCSG\n\nNatasha2\n\nlocal minima\n\nNeon+Natasha2\n\nNeon2+Natasha2\n\nstationary\n\nGD (folklore [20])\nperturbed GD\n\nlocal minima\n\nNeon2+GD\n\nstationary\n\nSVRG\n\nlocal minima\n\nReddi et al.\n\nNeon2+SVRG\n\nstationary\n\n\u201cconvex until guilty\u201d\n\nlocal minima\n\nFastCubic\nCDHS\n\nNeon2+CDHS\n\n\u03b52\u03b43 + 1\n\u03b46\n\n\u03b52\u03b43 + 1\n\u03b45\n\n\u03b53\u03b4 + 1\n\u03b45\n\n\u03b53\u03b4 + 1\n\u03b46\n\n(cid:1)\n(cid:1)\n(cid:1)\n\n\u03b54\n\u03b54 + poly(d)\n\u03b416\n\n\u03b54 + 1\n\n\u03b52\u03b43 + 1\n\u03b45\n\n(cid:1)\n\n\u03b54 + 1\n\u03b47\n\n\u03b53.25 + 1\n\n\u03b53.25 + 1\n\n\u03b510/3\n\u03b510/3 + 1\n\u03b510/3 + 1\n\nO(cid:0) 1\n(cid:1)\n[14] (cid:101)O(cid:0) poly(d)\n[31] (cid:101)O(cid:0) 1\n(cid:101)O(cid:0) 1\n(cid:1)\n[19] O(cid:0) 1\n[31] O(cid:0) 1\nO(cid:0) 1\n[2] (cid:101)O(cid:0) 1\n[31] (cid:101)O(cid:0) 1\n(cid:101)O(cid:0) 1\nO(cid:0) n\n(cid:1)\n(cid:1)\n[16] (cid:101)O(cid:0) n\n(cid:1)\n(cid:101)O(cid:0) n\n[4] O(cid:0) n2/3\n\u03b52 + n(cid:1)\n[25] (cid:101)O(cid:0) n2/3\n(cid:1)\n[9] (cid:101)O(cid:0) n\n(cid:101)O(cid:0) n\n(cid:1)\n(cid:101)O(cid:0) n\n(cid:1)\n\n\u03b52\n\u03b52 + n\n\u03b44\n\n\u03b51.5 + n\n\n\u03b51.5 + n\n\n\u03b52 + n\n\u03b43.5\n\n[1]\n[8]\n\nn3/4\n\u03b43.5\n\nn3/4\n\u03b43.5\n\n[24]\n\n\u03b51.75\n\n(cid:1)\n\n\u03b52 + n\n\n\u03b43 + n3/4\n\n\u03b43.5\n\n\u03b43 + n3/4\n\n\u03b51.75 +\n\n\u03b43 + n3/4\n\n\u03b51.75 +\n\n\u03b53.25 + 1\n\n\u03b53\u03b4 + 1\n\u03b45\n\n\u2191 online methods \u2191\n\nTable 1: Complexity for \ufb01nding (cid:107)\u2207f (x)(cid:107) \u2264 \u03b5 and \u22072f (x) (cid:23) \u2212\u03b4I. Following tradition, in these complexity\nbounds, we assume variance and smoothness parameters as constants, and only show the dependency\non n, d, \u03b5.\n\nRemark 1. Variance bounds is needed for online methods (\ufb01rst half of the table).\nRemark 2. Lipschitz smoothness is needed for \ufb01nding even approximate stationary points.\nRemark 3. Second-order Lipschitz smoothness is needed for \ufb01nding approximate local minima.\n\n1.2 From Stationary Points to Local Minima\nGiven any \ufb01rst-order method that \ufb01nds stationary points (such as GD, SGD, SVRG or SCSG), we\ncan hope for using the NC-search routine to identify whether or not its output x satis\ufb01es \u22072f (x) (cid:23)\n\u2212\u03b4I. If so, then automatically x becomes an (\u03b5, \u03b4)-approximate local minima so we can terminate.\nIf not, we can go into its negative curvature direction to further decrease the objective.\nIn the independent work of Xu and Yang [31], they applied their Neon method for NC-search, and\nthus turned SGD and SCSG into \ufb01rst-order methods \ufb01nding approximate local minima. In this paper,\nwe use Neon2 instead. We show the following theorem:\nTheorem 7 (informal). To \ufb01nd an (\u03b5, \u03b4)-approximate local minima,\n\n4\n\n\f0 log(100d)\n\n,\n\n(cid:5) for suf\ufb01ciently large constant C0\n(cid:5) \u03be is Gaussian random vector with norm \u03c3 := (100d)\u22123C0 \u03b72\u03b43\n\nL2\n\nxt+1 \u2190 xt \u2212 \u03b7 (\u2207fi(xt) \u2212 \u2207fi(x0)) where i \u2208R [n].\nif (cid:107)xt+1 \u2212 x0(cid:107)2 \u2265 r then return v = xs\u2212x0\n(cid:107)xs\u2212x0(cid:107)2\n\nfor a uniformly random s \u2208 [t].\n\n(cid:5) r := (100d)C0 \u03c3\n\n\u03b4\n\nAlgorithm 1 Neon2online\nweak (f, x0, \u03b4)\n0 L2 log(100d), T \u2190 C2\n1: \u03b7 \u2190\nC2\n2: \u03be \u2190 \u03c3 \u03be(cid:48)\n(cid:107)\u03be(cid:48)(cid:107)2\n3: x1 \u2190 x0 + \u03be.\n4: for t \u2190 1 to T do\n5:\n6:\n\nwhere \u03be(cid:48) \u223c N (0, I).\n\n\u03b7\u03b4\n\n7: end for\n8: return v = \u22a5;\n\n(a) Neon2+SGD needs T = (cid:101)O(cid:0) 1\n(b) Neon2+SCSG needs T = (cid:101)O(cid:0) 1\n(c) Neon2+GD needs T = (cid:101)O(cid:0) n\n(d) Neon2+SVRG needs T = (cid:101)O(cid:0) n2/3\n\n\u03b52 + n\n\u03b43.5\n\n(cid:1) stochastic gradients;\n(cid:1) stochastic gradients; and\n(cid:1) full gradients).\n(cid:1) stochastic gradients.\n\n(cid:1) (so (cid:101)O(cid:0) 1\n\n\u03b52\u03b43 + 1\n\u03b45\n\u03b52 + 1\n\u03b43.5\n\n\u03b43 + n3/4\n\n\u03b43.5\n\n\u03b52\u03b43 + 1\n\u03b45\n\n\u03b54 + 1\n\u03b510/3 + 1\n\n\u03b52 + n\n\n1.3 Roadmap\n\nWe introduce notions and formalize the problem in Section 2. We introduce Neon2 in the online,\ndeterministic, and SVRG settings respectively in Section 3, Section 4 and Section 5. We apply\nNeon2 to SGD, GD, Natasha2, CDHS, SCSG and SVRG in Section 6. Most of the proofs are in the\nappendix.\n2 Preliminaries\nThroughout this paper, we denote by (cid:107) \u00b7 (cid:107) the Euclidean norm. We use i \u2208R [n] to denote that i\nis generated from [n] = {1, 2, . . . , n} uniformly at random. We denote by I[event] the indicator\nfunction of probabilistic events.\nWe denote by (cid:107)A(cid:107)2 the spectral norm of matrix A. For symmetric matrices A and B, we write\nA (cid:23) B to indicate that A \u2212 B is positive semide\ufb01nite (PSD). Therefore, A (cid:23) \u2212\u03c3I if and only if\nall eigenvalues of A are no less than \u2212\u03c3. We denote by \u03bbmin(A) and \u03bbmax(A) the minimum and\nmaximum eigenvalue of a symmetric matrix A.\nDe\ufb01nition 2.1. For a function f : Rd \u2192 R,\n\u2022 f is L-Lipschitz smooth (or L-smooth for short) if \u2200x, y \u2208 Rd, (cid:107)\u2207f (x) \u2212 \u2207f (y)(cid:107) \u2264 L(cid:107)x \u2212 y(cid:107).\n\u2022 f is second-order L2-Lipschitz smooth (or L2-second-order smooth for short) if\n\n\u2200x, y \u2208 Rd, (cid:107)\u22072f (x) \u2212 \u22072f (y)(cid:107)2 \u2264 L2(cid:107)x \u2212 y(cid:107).\n\n2.1 Problem and Assumptions\n\nThroughout the paper we study\n\n(cid:110)\n\n(cid:80)n\n\n(cid:111)\n\n(2.1)\nwhere both f (\u00b7) and each fi(\u00b7) can be nonconvex. We wish to \ufb01nd (\u03b5, \u03b4)-approximate local minima\nwhich are points x satisfying\n\nf (x) := 1\nn\n\ni=1 fi(x)\n\nminx\u2208Rd\n\n(cid:107)\u2207f (x)(cid:107) \u2264 \u03b5\n\nand \u22072f (x) (cid:23) \u2212\u03b4I .\n\nWe need the following three assumptions\n\u2022 Each fi(x) is L-Lipschitz smooth.\n\u2022 Each fi(x) is second-order L2-Lipschitz smooth.\n\u2022 Stochastic gradients have bounded variance: \u2200x \u2208 Rd :\n(This assumption is needed only for online algorithms.)\n\n5\n\nEi\u2208R[n] (cid:107)\u2207f (x) \u2212 \u2207fi(x)(cid:107)2 \u2264 V .\n\n\f(cid:80)n\n(cid:5) for boosting con\ufb01dence of Neon2online\ni=1 fi(x), vector x0, negative curvature \u03b4 > 0, con\ufb01dence p \u2208 (0, 1].\n(cid:5) boost the con\ufb01dence\n\nweak\n\nAlgorithm 2 Neon2online(f, x0, \u03b4, p)\nInput: Function f (x) = 1\n1: for j = 1, 2,\u00b7\u00b7\u00b7 \u0398(log 1/p) do\nn\nvj \u2190 Neon2online\nweak (f, x0, \u03b4);\n2:\nif vj (cid:54)= \u22a5 then\n3:\n4:\n5:\n6:\n7:\n8:\n9: end for\n10: return v = \u22a5.\n\nm \u2190 \u0398( L2 log 1/p\nDraw i1, . . . , im \u2208R [n].\nzj = 1\nif zj \u2264 \u22123\u03b4/4 return v = vj\n\nend if\n\nm(cid:107)v(cid:48)(cid:107)2\n\n\u03b42\n\n2\n\n), v(cid:48) \u2190 \u0398( \u03b4\n\nj=1(v(cid:48))(cid:62)(cid:0)\u2207fij (x0 + v(cid:48)) \u2212 \u2207fij (x0)(cid:1)\n(cid:80)m\n\n)v.\n\nL2\n\n3 Neon2 in the Online Setting\nWe propose Neon2online as the online version of Neon2.\nIt repeatedly invokes Neon2online\nweak\nAlgorithm 1, whose goal is to solve the NC-search problem with con\ufb01dence 2/3 only;\nweak repeatedly for log(1/p) times to boost the con\ufb01dence to 1 \u2212 p.\nNeon2online invokes Neon2online\nWe prove the following theorem:\nTheorem 1 (Neon2online). Let f (x) = 1\norder smooth. For every point x0 \u2208 Rd, every \u03b4 \u2208 (0, L], every p \u2208 (0, 1), the output\nn\n\ni=1 fi(x) where each fi is L-smooth and L2-second-\n\n(cid:80)n\n\nin\nthen\n\nv = Neon2online(f, x0, \u03b4, p)\n\nsatis\ufb01es that, with probability at least 1 \u2212 p:\n1. If v = \u22a5, then \u22072f (x0) (cid:23) \u2212\u03b4I.\n2. If v (cid:54)= \u22a5, then (cid:107)v(cid:107)2 = 1 and v(cid:62)\u22072f (x0)v \u2264 \u2212 \u03b4\n2 .\n\nMoreover, the total number of stochastic gradient evaluations O(cid:0) log2(d/p)L2\n\n(cid:1).\n\n\u03b42\n\nThe proof of Theorem 1 immediately follows from Lemma 3.1 and Lemma 3.2 below.\nLemma 3.1 (Neon2online\nweak ).\nweak (f, x0, \u03b4)\nsatis\ufb01es If \u03bbmin(\u22072f (x0)) \u2264 \u2212\u03b4, then with probability at least 2/3, v (cid:54)= \u22a5 and v(cid:62)\u22072f (x0)v \u2264\n\u2212 3\n4 \u03b4.\n\nIn the same setting as Theorem 1, the output v = Neon2online\n\nProof sketch of Lemma 3.1. We explain why Neon2online\nweak works as follows. Starting from a ran-\ndomly perturbed point x1 = x0 + \u03be, it keeps updating xt+1 \u2190 xt \u2212 \u03b7 (\u2207fi(xt) \u2212 \u2207fi(x0)) for\nsome random index i \u2208 [n], and stops either when T iterations are reached, or when (cid:107)xt+1\u2212x0(cid:107)2 >\nr. Therefore, we have (cid:107)xt \u2212 x0(cid:107)2 \u2264 r throughout the iterations, and thus can approximate\n\u22072fi(x0)(xt \u2212 x0) using \u2207fi(xt) \u2212 \u2207fi(x0), up to error O(r2). This is a small term based on\nour choice of r.\n\nIgnoring the error term, our updates look like xt+1 \u2212 x0 = (cid:0)I \u2212 \u03b7\u22072fi(x0)(cid:1)(xt \u2212 x0). This\n(cid:1) iterations, where \u03bb =\nysis of Oja\u2019s algorithm [7], one can conclude that after T1 = \u0398(cid:0) log r\n\nis exactly the same as Oja\u2019s algorithm [22] which is known to approximately compute the min-\nimum eigenvector of \u22072f (x0) = 1\ni=1 fi(x0). Using the recent optimal convergence anal-\nmax{0,\u2212\u03bbmin(\u22072f (x0))}, then we not only have that (cid:107)xt+1 \u2212 x0(cid:107)2 is blown up, but also it aligns\nwell with the minimum eigenvector of \u22072f (x0). In other words, if \u03bb \u2265 \u03b4, then the algorithm must\nstop before T .\nFinally, one has to carefully argue that the error does not blow up in this iterative process. We defer\n(cid:3)\nthe proof details to Appendix B.3.\n\n(cid:80)n\n\n\u03c3\n\u03b7\u03bb\n\nn\n\nOur Lemma 3.2 below tells us we can verify if the output v of Neon2online\nadditive \u03b4\nthis procedure as Neon2online in Algorithm 2.\n\nweak is indeed correct (up to\n4), so we can boost the success probability to 1\u2212 p. For completeness\u2019 sake, we summarize\n\n6\n\n\fLemma 3.2 (veri\ufb01cation). In the same setting as Theorem 1, let vectors x, v \u2208 Rd. If i1, . . . , im \u2208R\n[n] and de\ufb01ne\n\nThen, if (cid:107)v(cid:107) \u2264 \u03b4\n\n8L2\n\nand m = \u0398( L2 log 1/p\n\nz = 1\nm\n\n(cid:80)m\nj=1 v(cid:62)(\u2207fij (x + v) \u2212 \u2207fij (x))\n(cid:12)(cid:12)(cid:12) z(cid:107)v(cid:107)2\n\n), with probability at least 1 \u2212 p,\n\u2212 v(cid:62)\u22072f (x)v\n\n(cid:12)(cid:12)(cid:12) \u2264 \u03b4\n\n(cid:107)v(cid:107)2\n\n4 .\n\n\u03b42\n\n2\n\n2\n\nThe simple proof of Lemma 3.2 can be found in Section B.4.\n4 Neon2 in the Deterministic Setting\n\n.\n\n\u221a\n\u03b4\n\n\u221a\nL\n\n1 log(d/p)\n\nAlgorithm 3 Neon2det(f, x0, \u03b4, p)\nInput: A function f, vector x0, negative curvature target \u03b4 > 0, failure probability p \u2208 (0, 1].\n1: T \u2190 C2\n2: \u03be \u2190 Gaussian random vector with norm \u03c3;\n3: x1 \u2190 x0 + \u03be. y1 \u2190 \u03be, y0 \u2190 0\n4: for t \u2190 1 to T do\nyt+1 = 2M(yt) \u2212 yt\u22121;\n5:\nxt+1 = x0 + yt+1 \u2212 M(yt).\n6:\nif (cid:107)xt+1 \u2212 x0(cid:107)2 \u2265 r then return xt+1\u2212x0\n7:\n(cid:107)xt+1\u2212x0(cid:107)2\n8: end for\n9: return \u22a5.\n\nL (\u2207f (x0 + y) \u2212 \u2207f (x0)) +(cid:0)1 \u2212 3\u03b4\n\n(cid:5) \u03c3 := (d/p)\u22122C1\n\n(cid:5) M(y) := \u2212 1\n\n(cid:5) for suf\ufb01ciently large constant C1.\n\n(cid:5) r := (d/p)C1 \u03c3\n\n(cid:1) y\n\nT 4L2\n\n4L\n\n.\n\n\u03b4\n\nWe propose Neon2det formally in Algorithm 3 and prove:\nTheorem 3 (Neon2det). Let f (x) be a function that is L-smooth and L2-second-order smooth. For\nevery point x0 \u2208 Rd, every \u03b4 > 0, every p \u2208 (0, 1], the output v = Neon2det(f, x0, \u03b4, p) satis\ufb01es\nthat, with probability at least 1 \u2212 p:\n1. If v = \u22a5, then \u22072f (x0) (cid:23) \u2212\u03b4I.\n2. If v (cid:54)= \u22a5, then (cid:107)v(cid:107)2 = 1 and v(cid:62)\u22072f (x0)v \u2264 \u2212 \u03b4\n2 .\n\nMoreover, the total number full gradient evaluations is O(cid:0) log2(d/p)\nL\u22072f (x0) +(cid:0)1 \u2212 3\u03b4\n(cid:1) I. We immediately notice that\n4 , L(cid:3) are mapped to the eigenvalues of M in [\u22121, 1], and\n\u2022 all eigenvalues of \u22072f (x0) in(cid:2)\u22123\u03b4\n(cid:1), if we compute xT +1 = x0 + MT \u03be for some random vector \u03be,\nTherefore, as long as T \u2265 (cid:101)\u2126(cid:0) L\n(cid:1) if\nThe \ufb01rst issue is that, the degree T of this matrix polynomial MT can be reduced to T =(cid:101)\u2126(cid:0)\u221a\n\nby the theory of power method, xT +1 \u2212 x0 must be a negative-curvature direction of \u22072f (x0) with\nvalue \u2264 1\n\nProof sketch of Theorem 3. We explain the high-level intuition of Neon2det and the proof of\nTheorem 3 as follows. De\ufb01ne M = \u2212 1\n\n\u2022 any eigenvalue of \u22072f (x0) smaller than \u2212\u03b4 is mapped to eigenvalue of M greater than 1 + \u03b4\n4L.\n\n2 \u03b4. There are two issues with this approach.\n\n(cid:1).\n\n\u221a\n\n\u221a\n\n4L\n\nL\n\n\u03b4\n\n\u03b4\n\nL\u221a\n\u03b4\n\nthe so-called Chebyshev polynomial is used.\n\nClaim 4.1. Let Tt(x) be the t-th Chebyshev polynomial of the \ufb01rst kind, de\ufb01ned as:\n\nT0(x) := 1,\n\nT1(x) := x,\nthen Tt(x) satis\ufb01es (see Trefethen [30]):\n\n(cid:26) cos(n arccos(x)) \u2208 [\u22121, 1]\n(cid:2)(cid:0)x \u2212 \u221a\n+(cid:0)x +\n\nx2 \u2212 1(cid:1)n\n\n1\n2\n\nTt(x) =\n\nx2 \u2212 1(cid:1)n(cid:3)\n\n\u221a\n\nTn+1(x) := 2x \u00b7 Tn(x) \u2212 Tn\u22121(x)\n\nif x \u2208 [\u22121, 1];\nif x > 1.\n\n\u221a\n\nSince Tt(x) stays between [\u22121, 1] when x \u2208 [\u22121, 1], and grows to \u2248 (1 +\ncan use TT (M) in replacement of MT . Then, any eigenvalue of M that is above 1 + \u03b4\n\nx2 \u2212 1)t for x \u2265 1, we\n4L shall grow\n\n7\n\n\fin a speed like (1 +(cid:112)\u03b4/L)T , so it suf\ufb01ces to choose T \u2265(cid:101)\u2126(cid:0)\u221a\n\napplying the power method, so in Neon2det we wish to compute xt+1 \u2248 x0 + Tt (M) \u03be.\nThe second issue is that, since we cannot compute Hessian-vector products, we have to use the\ngradient difference to approximate it; that is, we can only use M(y) to approximate My where\n\nL\u221a\n\u03c3\n\n(cid:1). This is quadratically faster than\n(cid:18)\n\n(cid:19)\n\nM(y) := \u2212 1\nL\n\n(\u2207f (x0 + y) \u2212 \u2207f (x0)) +\n\n1 \u2212 3\u03b4\n4L\n\ny .\n\nHow does error propagate if we compute Tt (M) \u03be by replacing M with M? Note that this is a very\nnon-trivial question, because the coef\ufb01cients of the polynomial Tt(x) is as large as 2O(t).\nIt turns out, the way that error propagates depends on how the Chebyshev polynomial is calculated.\nIf the so-called backward recurrence formula is used, namely,\n\ny0 = 0,\n\ny1 = \u03be,\n\nyt = 2M(yt\u22121) \u2212 yt\u22122\n\nand setting xT +1 = x0 + yT +1 \u2212 M(yT ), then this xT +1 is suf\ufb01ciently close to the exact value\nx0 + Tt (M) \u03be. This is known as the stability theory of computing Chebyshev polynomials, and is\n(cid:3)\nproved in [6]. We defer all the proof details to Appendix C.2.\n\n5 Neon2 in the Finite-Sum Setting\nLet us recall how the shift-and-invert (SI) approach [26] solves the minimum eigenvector problem.\nGiven matrix A = \u22072f (x0) \u2208 Rd\u00d7d and suppose its eigenvalues are \u2212L \u2264 \u03bb1 \u2264 \u00b7\u00b7\u00b7 \u2264 \u03bbd \u2264 L.\nAt a high level, the SI approach\n\u2022 chooses \u03bb = \u03b4 \u2212 \u03bb1 ,5\n\u2022 de\ufb01nes positive de\ufb01nite matrix B = (\u03bbI + A)\u22121, and\n\u2022 applies power method for a logarithmic number of rounds to B to \ufb01nd its approximate maximum\nOne can show that this unit vector v satis\ufb01es \u03bb1 \u2264 v(cid:62)Av \u2264 \u03bb1 + O(\u03b4) [13].\nTo apply power method to B, one needs to compute matrix inversion By = (\u03bbI + A)\u22121y for arbi-\ntrary vectors y \u2208 Rd. The stability of SI ensures that it suf\ufb01ces to compute By to some suf\ufb01ciently\nhigh accuracy.7\nOne ef\ufb01cient way to compute By to such high accuracy is by expressing A in a \ufb01nite-sum form\nand then adopt convex optimization [13]. We call this approach \ufb01nite-sum SI. Consider a convex\nquadratic function that is of a \ufb01nite sum of non-convex functions:\n\neigenvector v.6\n\ng(z) :=\n\n1\n2\n\nz(cid:62)(\u03bbI + A)z + y(cid:62)z =\n\n1\nn\n\nz(cid:62)(\u03bbI + \u22072fi(x0))z + y(cid:62)z\n\ngi(z) .\n\n(cid:17)\n\nn(cid:88)\n\ni=1\n\n=:\n\n1\nn\n\nn(cid:88)\n\n(cid:16) 1\n\n2\n\ni=1\n\nmomentum, and \ufb01nds z using (cid:101)O(n + n3/4(cid:112)L/\u03b4) computations of stochastic gradients.8 Whenever\n\nNow, computing By is equivalent to minimizing g(z), and one can use a stochastic \ufb01rst-order\nmethod to minimize it.\nOne such method is KatyushaX, which directly accelerates the so-called SVRG method using\na stochastic gradient \u2207gi(z) = (\u03bbI + \u22072fi(x0))z + y is needed at some point z \u2208 Rd for some\nrandom i \u2208 [n], instead of evaluating it exactly (which require a Hessian-vector product), we use\n\u2207fi(x0 + z) \u2212 \u2207fi(x0) to approximate \u22072fi(x0) \u00b7 z. We call this method Neon2\ufb01nite.\n\n5The precise SI approach needs to binary search \u03bb because \u03bb1 is unknown.\n6More precisely, applying power method for O(log(d/p)) rounds, one can \ufb01nd a unit vector v such that\n10 \u03bbmax(B) with probability at least 1 \u2212 p. One can also prove that this vector v satis\ufb01es \u03bb1 \u2264\n7More precisely, if suf\ufb01ces to compute w \u2208 Rd so that (cid:107)w \u2212 By(cid:107) \u2264 \u03b5(cid:107)y(cid:107), in a time complexity that\n\nv(cid:62)Bv \u2265 9\nv(cid:62)Av \u2264 \u03bb1 + O(\u03b4).\n\npolynomially depends on log 1\n\n\u03b5 [5, 13].\n\n8Shalev-Shwartz [29] \ufb01rst discovered that one can apply SVRG to minimize sum-of-nonconvex functions.\nIt was also observed that applying APPA/Catalyst reductions to SVRG one can achieve accelerated convergence\nrates [13, 29], and this approach is commonly known as AccSVRG. However, AccSVRG requires some careful\nparameter tuning of its inner loops, and thus is a logarithmic-factor slower than KatyushaX and also less\npractical [3].\n\n8\n\n\fOf course, one needs to show that KatyushaX is stable to noise. Using similar techniques as the\nprevious two sections, one can show that the error term is proportional to O((cid:107)z(cid:107)2\n2), and thus as long\nas we bound the norm of z is bounded (just like we did in the previous two sections), this should not\naffect the performance of the algorithm. We decide to ignore the detailed theoretical proof of this\nresult, because it will complicate this paper.\nTheorem 5 (Neon2\ufb01nite).\ni=1 fi(x) where each fi is L-smooth and L2-\nsecond-order smooth. For every point x0 \u2208 Rd, every \u03b4 > 0, every p \u2208 (0, 1], the output\nv = Neon2\ufb01nite(f, x0, \u03b4, p) satis\ufb01es that, with probability at least 1 \u2212 p:\n1. If v = \u22a5, then \u22072f (x0) (cid:23) \u2212\u03b4I.\n2. If v (cid:54)= \u22a5, then (cid:107)v(cid:107)2 = 1 and v(cid:62)\u22072f (x0)v \u2264 \u2212 \u03b4\n2 .\n\nMoreover, the total number stochastic gradient evaluations is (cid:101)O(cid:0)n + n3/4\n\n(cid:1), where the (cid:101)O notion\n\nLet f (x) = 1\nn\n\n(cid:80)n\n\n\u221a\nL\u221a\n\n\u03b4\n\nhides logarithmic factors in d, 1/p and L/\u03b4.\n6 Applications of Neon2\nWe show how Neon2 can be applied to existing algorithms such as SGD, GD, SCSG, SVRG, Natasha2,\nCDHS. Unfortunately, we are unaware of a generic statement for applying Neon2 to any algorithm.\nTherefore, we have to prove them individually.9\nThroughout this section, we assume that some starting vector x0 \u2208 Rd and upper bound \u2206f is given\nto the algorithm, and it satis\ufb01es f (x0)\u2212 minx{f (x)} \u2264 \u2206f . This is only for the purpose of proving\ntheoretical bounds. Since \u2206f only appears in specifying the number of iterations, in practice, one\ncan run enough number of iterations and then halt the algorithm, without knowing \u2206f .\n6.1 Applying Neon2 to SGD and GD\nTo apply Neon2 to turn SGD into an algorithm \ufb01nding approximate local minima, we propose the\nfollowing process Neon2+SGD (see Algorithm 4). In each iteration t, it \ufb01rst applies SGD with mini-\n\u03b52 ) (see Line 4). Then, if SGD \ufb01nds a point with small gradient, we apply Neon2online\nbatch size O( 1\nto decide if it has a negative curvature, if so, then we move in the direction of the negative curvature\n(see Line 10). We have the following theorem:\nTheorem 7a. With probability at least 1 \u2212 p, Neon2+SGD outputs an (\u03b5, \u03b4)-approximate local\n\n(cid:1) + L2\nminimum in gradient complexity T = (cid:101)O\nCorollary 6.1. Treating \u2206f ,V, L, L2 as constants, we have T = (cid:101)O(cid:0) 1\n\n\u03b52 + 1)(cid:0) L2\n\n\u03b43 + L\u2206f\n2\u2206f\n\nL2\n\n2\u2206f\n\u03b43\n\n(cid:1).\n\n(cid:16)\n\n\u03b54 + 1\n\n\u03b52\u03b43 + 1\n\u03b45\n\n(cid:17)\n\n( V\n\n\u03b52\n\n\u03b42\n\n.\n\nOne can similarly (and more easily) give an algorithm Neon2+GD, which is the same as Neon2+SGD\nexcept that the mini-batch SGD is replaced with a full gradient descent, and the use of Neon2online\nis replaced with Neon2det. We have the following theorem:\nTheorem 7c. With probability at least 1 \u2212 p, Neon2+GD outputs an (\u03b5, \u03b4)-approximate local mini-\n\nmum using (cid:101)O\n\n(cid:16) L\u2206f\n\n(cid:17)\n\n\u03b52 + L1/2\n\n\u03b41/2\n\nL2\n\n2\u2206f\n\u03b43\n\nfull gradient computations.\n\nWe only prove Theorem 7a in Appendix D and the proof of Theorem 7c is only simpler.\n6.2 Other Applications\nDue to space limitation, we defer the applications to Natasha2, CDHS, and SCSG to Appendix A.\nAt a high level, the applications to Natasha2 and CDHS are trivial because NC-search was already\na subroutine required by both algorithms, so one can directly replace them with Neon2 of this pa-\nper. The application to SCSG is less non-trivial, because one has to additionally take care of some\nprobabilistic behavior from SCSG.\nAcknowledgements\nWe would like to thank Tianbao Yang and Yi Xu for helpful feedbacks on this manuscript. This\nwork was done when Yuanzhi Li was a summer intern at Microsoft Research in 2017.\n\n9This is because stationary-point \ufb01nding algorithms have somewhat different guarantees. For instance, in\nmini-batch SGD we have f (xt) \u2212 E[f (xt+1)] \u2265 \u2126((cid:107)\u2207f (xt)(cid:107)2) but in SCSG we have f (xt) \u2212 E[f (xt+1)] \u2265\n\u2126(E[(cid:107)\u2207f (xt+1)(cid:107)2]).\n\n9\n\n\fReferences\n[1] Naman Agarwal, Zeyuan Allen-Zhu, Brian Bullins, Elad Hazan, and Tengyu Ma. Finding\n\nApproximate Local Minima for Nonconvex Optimization in Linear Time. In STOC, 2017.\n\n[2] Zeyuan Allen-Zhu. Natasha 2: Faster Non-Convex Optimization Than SGD.\n\n2018.\n\nIn NeurIPS,\n\n[3] Zeyuan Allen-Zhu. Katyusha X: Practical Momentum Method for Stochastic Sum-of-\n\nNonconvex Optimization. In ICML, 2018.\n\n[4] Zeyuan Allen-Zhu and Elad Hazan. Variance Reduction for Faster Non-Convex Optimization.\n\nIn ICML, 2016.\n\n[5] Zeyuan Allen-Zhu and Yuanzhi Li. LazySVD: Even Faster SVD Decomposition Yet Without\n\nAgonizing Pain. In NeurIPS, 2016.\n\n[6] Zeyuan Allen-Zhu and Yuanzhi Li. Faster Principal Component Regression and Stable Matrix\n\nChebyshev Approximation. In ICML, 2017.\n\n[7] Zeyuan Allen-Zhu and Yuanzhi Li. Follow the Compressed Leader: Faster Online Learning of\n\nEigenvectors and Faster MMWU. In ICML, 2017.\n\n[8] Yair Carmon, John C. Duchi, Oliver Hinder, and Aaron Sidford. Accelerated Methods for\n\nNon-Convex Optimization. ArXiv e-prints, abs/1611.00756, November 2016.\n\n[9] Yair Carmon, Oliver Hinder, John C. Duchi, and Aaron Sidford. \u201dConvex Until Proven Guilty\u201d:\nDimension-Free Acceleration of Gradient Descent on Non-Convex Functions. In ICML, 2017.\n\n[10] Anna Choromanska, Mikael Henaff, Michael Mathieu, G\u00b4erard Ben Arous, and Yann LeCun.\n\nThe loss surfaces of multilayer networks. In AISTATS, 2015.\n\n[11] Yann N Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and\nYoshua Bengio. Identifying and attacking the saddle point problem in high-dimensional non-\nconvex optimization. In NeurIPS, pages 2933\u20132941, 2014.\n\n[12] Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. SAGA: A Fast Incremental Gradient\n\nMethod With Support for Non-Strongly Convex Composite Objectives. In NeurIPS, 2014.\n\n[13] Dan Garber, Elad Hazan, Chi Jin, Sham M. Kakade, Cameron Musco, Praneeth Netrapalli,\nand Aaron Sidford. Robust shift-and-invert preconditioning: Faster and more sample ef\ufb01cient\nalgorithms for eigenvector computation. In ICML, 2016.\n\n[14] Rong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping from saddle points\u2014online\nstochastic gradient for tensor decomposition. In Proceedings of the 28th Annual Conference\non Learning Theory, COLT 2015, 2015.\n\n[15] I. J. Goodfellow, O. Vinyals, and A. M. Saxe. Qualitatively characterizing neural network\n\noptimization problems. ArXiv e-prints, December 2014.\n\n[16] Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M Kakade, and Michael I Jordan. How to Escape\n\nSaddle Points Ef\ufb01ciently. In ICML, 2017.\n\n[17] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive vari-\nance reduction. In Advances in Neural Information Processing Systems, NeurIPS 2013, pages\n315\u2013323, 2013.\n\n[18] Cornelius Lanczos. An iteration method for the solution of the eigenvalue problem of linear\ndifferential and integral operators. Journal of Research of the National Bureau of Standards,\n45(4), 1950.\n\n[19] Lihua Lei, Cheng Ju, Jianbo Chen, and Michael I Jordan. Nonconvex Finite-Sum Optimization\n\nVia SCSG Methods. In NeurIPS, 2017.\n\n[20] Yurii Nesterov. Introductory Lectures on Convex Programming Volume: A Basic course, vol-\n\nume I. Kluwer Academic Publishers, 2004. ISBN 1402075537.\n\n10\n\n\f[21] Yurii Nesterov and Boris T. Polyak. Cubic regularization of newton method and its global\n\nperformance. Mathematical Programming, 108(1):177\u2013205, 2006.\n\n[22] Erkki Oja. Simpli\ufb01ed neuron model as a principal component analyzer. Journal of mathemat-\n\nical biology, 15(3):267\u2013273, 1982.\n\n[23] Barak A Pearlmutter. Fast exact multiplication by the hessian. Neural computation, 6(1):\n\n147\u2013160, 1994.\n\n[24] Sashank J. Reddi, Ahmed Hefny, Suvrit Sra, Barnabas Poczos, and Alex Smola. Stochastic\n\nvariance reduction for nonconvex optimization. In ICML, 2016.\n\n[25] Sashank J Reddi, Manzil Zaheer, Suvrit Sra, Barnabas Poczos, Francis Bach, Ruslan Salakhut-\ndinov, and Alexander J Smola. A generic approach for escaping saddle points. ArXiv e-prints,\nabs/1709.01434, September 2017.\n\n[26] Youcef Saad. Numerical methods for large eigenvalue problems. Manchester University Press,\n\n1992.\n\n[27] Mark Schmidt, Nicolas Le Roux, and Francis Bach. Minimizing \ufb01nite sums with the stochastic\n\naverage gradient. ArXiv e-prints, abs/1309.2388, September 2013.\n\n[28] Nicol N Schraudolph. Fast curvature matrix-vector products for second-order gradient descent.\n\nNeural computation, 14(7):1723\u20131738, 2002.\n\n[29] Shai Shalev-Shwartz. SDCA without Duality, Regularization, and Individual Convexity. In\n\nICML, 2016.\n\n[30] Lloyd N. Trefethen. Approximation Theory and Approximation Practice. SIAM, 2013.\n\n[31] Yi Xu and Tianbao Yang. First-order Stochastic Algorithms for Escaping From Saddle Points\n\nin Almost Linear Time. ArXiv e-prints, abs/1711.01944, November 2017.\n\n11\n\n\f", "award": [], "sourceid": 1873, "authors": [{"given_name": "Zeyuan", "family_name": "Allen-Zhu", "institution": "Microsoft Research"}, {"given_name": "Yuanzhi", "family_name": "Li", "institution": "Princeton"}]}