{"title": "SPIDER: Near-Optimal Non-Convex Optimization via Stochastic Path-Integrated Differential Estimator", "book": "Advances in Neural Information Processing Systems", "page_first": 689, "page_last": 699, "abstract": "In this paper, we propose a new technique named \\textit{Stochastic Path-Integrated Differential EstimatoR} (SPIDER), which can be used to track many deterministic quantities of interests with significantly reduced computational cost. \nCombining SPIDER with the method of normalized gradient descent, we propose SPIDER-SFO that solve non-convex stochastic optimization problems using stochastic gradients only. \nWe provide a few error-bound results on its convergence rates.\nSpecially, we prove that the SPIDER-SFO algorithm achieves a gradient computation cost of $\\mathcal{O}\\left(  \\min( n^{1/2} \\epsilon^{-2}, \\epsilon^{-3} ) \\right)$ to find an $\\epsilon$-approximate first-order stationary point. \nIn addition, we prove that SPIDER-SFO nearly matches the algorithmic lower bound for finding stationary point under the gradient Lipschitz assumption in the finite-sum setting.\nOur SPIDER technique can be further applied to find an $(\\epsilon, \\mathcal{O}(\\ep^{0.5}))$-approximate second-order stationary point at a gradient computation cost of $\\tilde{\\mathcal{O}}\\left(  \\min( n^{1/2} \\epsilon^{-2}+\\epsilon^{-2.5}, \\epsilon^{-3} ) \\right)$.", "full_text": "SPIDER: Near-Optimal Non-Convex Optimization via\n\nStochastic Path Integrated Differential Estimator\n\nCong Fang1 \u2217 Chris Junchi Li2 Zhouchen Lin1\u2020 Tong Zhang2\n\n1Key Lab. of Machine Intelligence (MoE), School of EECS, Peking University\n\n2Tencent AI Lab\n\n{fangcong, zlin}@pku.edu.cn\n\njunchi.li.duke@gmail.com tongzhang@tongzhang-ml.org\n\nAbstract\n\nIn this paper, we propose a new technique named Stochastic Path-Integrated\nDifferential EstimatoR (SPIDER), which can be used to track many deterministic\nquantities of interests with signi\ufb01cantly reduced computational cost. Combining\nSPIDER with the method of normalized gradient descent, we propose SPIDER-SFO\nthat solve non-convex stochastic optimization problems using stochastic gradients\nonly. We provide a few error-bound results on its convergence rates. Specially,\nwe prove that the SPIDER-SFO algorithm achieves a gradient computation cost\n\nof O(cid:0)min(n1/2\u0001\u22122, \u0001\u22123)(cid:1) to \ufb01nd an \u0001-approximate \ufb01rst-order stationary point.\ncost of \u02dcO(cid:0)min(n1/2\u0001\u22122 + \u0001\u22122.5, \u0001\u22123)(cid:1).\n\nIn addition, we prove that SPIDER-SFO nearly matches the algorithmic lower\nbound for \ufb01nding stationary point under the gradient Lipschitz assumption in\nthe \ufb01nite-sum setting. Our SPIDER technique can be further applied to \ufb01nd an\n(\u0001,O(\u00010.5))-approximate second-order stationary point at a gradient computation\n\n1\n\nIntroduction\n\nIn this paper, we study the optimization problem\n\nminimize\n\nx\u2208Rd\n\nf (x) \u2261 E [F (x; \u03b6)]\n\n(1.1)\n\nwhere the stochastic component F (x; \u03b6), indexed by some random vector \u03b6, is smooth and possibly\nnon-convex. Non-convex optimization problem of form (1.1) contains many large-scale statistical\nlearning tasks and is gaining tremendous popularity due to its favorable computational and statistical\nef\ufb01ciency [5\u20137]. Typical examples of form (1.1) include principal component analysis, estimation\nof graphical models, as well as training deep neural networks [17]. The expectation-minimization\nstructure of stochastic optimization problem (1.1) allows us to perform iterative updates and minimize\nthe objective using its stochastic gradient \u2207F (x; \u03b6) as an estimator of its deterministic counterpart.\nA special case of central interest is when the stochastic vector \u03b6 is \ufb01nitely sampled. In such\n\ufb01nite-sum (or of\ufb02ine) case, we denote each component function as fi(x) and (1.1) can be restated as\n\nminimize\n\nx\u2208Rd\n\nf (x) =\n\n1\nn\n\nfi(x)\n\n(1.2)\n\nn(cid:88)\n\ni=1\n\n\u2217This work was done while Cong Fang was a Research Intern with Tencent AI Lab.\n\u2020Corresponding author.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fwhere n is the number of individual functions. Another case is when n is reasonably large or even\nin\ufb01nite, running across of the whole dataset is exhaustive or impossible. We refer it as the online (or\nstreaming) case. For simplicity of notations we will study the optimization problem of form (1.2) in\nboth \ufb01nite-sum and online cases till the rest of this paper.\n\nOne important task for non-convex optimization is to search for, given the precision accuracy\n\u0001 > 0, an \u0001-approximate \ufb01rst-order stationary point x \u2208 Rd or (cid:107)\u2207f (x)(cid:107) \u2264 \u0001. In this paper, we aim\nto propose a new technique, called the Stochastic Path-Integrated Differential EstimatoR (SPIDER),\nwhich enables us to construct an estimator that tracks a deterministic quantity with signi\ufb01cantly\nlower sampling costs. As the readers will see, the SPIDER technique further allows us to design an\nalgorithm with a faster rate of convergence for non-convex problem (1.2), in which we utilize the\nidea of Normalized Gradient Descent (NGD) [18, 26]. NGD is a variant of Gradient Descent (GD)\nwhere the stepsize is picked to be inverse-proportional to the norm of the full gradient. Compared to\nGD, NGD exempli\ufb01es faster convergence, especially in the neighborhood of stationary points [25].\nHowever, NGD has been less popular due to its requirement of accessing the full gradient and its\nnorm at each update. In this paper, we estimate and track the gradient and its norm via the SPIDER\ntechnique and then hybrid it with NGD. Measured by gradient cost which is the total number of\ncomputation of stochastic gradients, our proposed SPIDER-SFO algorithm achieves a faster rate of\nconvergence in O(min(n1/2\u0001\u22122, \u0001\u22123)) which outperforms the previous best-known results in both\n\ufb01nite-sum [3][32] and online cases [24] by a factor of O(min(n1/6, \u0001\u22120.333)).\n\nFor the task of \ufb01nding stationary points for which we already achieved a faster convergence\nrate via our proposed SPIDER-SFO algorithm, a follow-up question to ask is: is our proposed\nSPIDER-SFO algorithm optimal for an appropriate class of smooth functions? In this paper, we\nprovide an af\ufb01rmative answer to this question in the \ufb01nite-sum case. To be speci\ufb01c, inspired by a\ncounterexample proposed by Carmon et al. [10] we are able to prove that the gradient cost upper\nbound of SPIDER-SFO algorithm matches the algorithmic lower bound. To put it differently, the\ngradient cost of SPIDER-SFO cannot be further improved for \ufb01nding stationary points for some\nparticular non-convex functions.\n\n1.1 Related Works\n\nIn the recent years, there has been a surge of literatures in machine learning community that\nanalyze the convergence property of non-convex optimization algorithms. Limited by space and our\nknowledge, we have listed all literatures that we believe are mostly related to this work. We refer\nthe readers to the monograph by Jain et al. [19] and the references therein on recent general and\nmodel-speci\ufb01c convergence rate results on non-convex optimization.\n\nSGD and Variance Reduction For the general problem of \ufb01nding approximate stationary points,\nunder the smoothness condition of f (x), it is known that vanilla Gradient Descent (GD) and Stochastic\nGradient Descent (SGD), which can be traced back to Cauchy [11] and Robbins & Monro [33] and\nachieve an \u0001-approximate stationary point with a gradient cost of O(min(n\u0001\u22122, \u0001\u22124)) [16, 26].\n\nRecently, the convergence rate of GD and SGD have been improved by the variance-reduction\ntype of algorithms [22, 34].\nIn special, the \ufb01nite-sum Stochastic Variance-Reduced Gradient\n(SVRG) and online Stochastically Controlled Stochastic Gradient (SCSG), to the gradient cost\nof \u02dcO(min(n2/3\u0001\u22122, \u0001\u22123.333)) [3, 24, 32].\n\nFirst-order method for \ufb01nding approximate second-order stationary points\nIt has been shown\nthat for machine learning methods such as deep learning, approximate stationary points that have\nat least one negative Hessian direction, including saddle points and local maximizers, are often not\nsuf\ufb01cient and need to be avoided or escaped from [12, 15]. Recently, many literature study the problem\nof how to avoid or escape saddle points and achieve an (\u0001, \u03b4)-approximate second-order stationary\npoint x at a polynomial gradient cost, i.e. an x \u2208 Rd such that (cid:107)\u2207f (x)(cid:107) \u2264 \u0001, \u03bbmin(\u22072f (x)) \u2265 \u2212\u03b4\n[1, 2, 4, 8, 15, 18, 20, 21, 23, 25, 30, 31, 35, 38]. Among them, the group of authors Ge et al.\n[15], Jin et al. [20] proposed the noise-perturbed variants of Gradient Descent (PGD) and Stochastic\n\n2\n\n\fGradient Descent (SGD) that escape from all saddle points and achieve an \u0001-approximate second-\norder stationary point in gradient cost of \u02dcO(min(n\u0001\u22122, poly(d)\u0001\u22124)) stochastic gradients. Levy [25]\nproposed the noise-perturbed variant of NGD which yields faster evasion of saddle points than GD.\nThe breakthrough of gradient cost for \ufb01nding second-order stationary points were achieved in\n2016/2017, when the two recent lines of literatures, namely FastCubic [1] and CDHS [8] as well as\ntheir stochastic versions [2, 35], achieve a gradient cost of \u02dcO(min(n\u0001\u22121.5+n3/4\u0001\u22121.75, \u0001\u22123.5)) which\nserves as the best-known gradient cost for \ufb01nding an (\u0001,O(\u00010.5))-approximate second-order stationary\npoint before the initial submission of this paper.3 4 In particular, Agarwal et al. [1], Tripuraneni\net al. [35] converted the cubic regularization method for \ufb01nding second-order stationary points [27]\nto stochastic- gradient based and stochastic-Hessian-vector-product-based methods, and Allen-Zhu\n[2], Carmon et al. [8] used a Negative-Curvature Search method to avoid saddle points. See also\nrecent works by Reddi et al. [31] for related saddle-point-escaping methods that achieve similar rates\nfor \ufb01nding an approximate second-order stationary point.\n\nOther concurrent works As the current work is carried out in its \ufb01nal phase, the authors became\naware that an idea of resemblance was earlier presented in an algorithm named the StochAstic\nRecursive grAdient algoritHm (SARAH) [28, 29]. Despite the fact that both our SPIDER-SFO and\ntheirs adopt the recursive stochastic gradient update framework and our SPIDER-SFO can be viewed\nas a variant of SARAH with normalization, our work differ from their works in two aspects:\n\n(i) Our analysis techniques are totally different from the version of SARAH proposed by Nguyen\net al. [28, 29]. Their version can be seen as a variant of gradient descent, while ours hybrids the\nSPIDER technique with normalized gradient descent. Moreover, Nguyen et al. [28, 29] adopt a\nlarge stepsize setting (in fact their goal was to design a memory-saving variant of SAGA [13]),\nwhile our SPIDER-SFO algorithm adopt a small stepsize that is proportional to \u0001. All these are\nessential elements of our superior achievements in convergence rates;\n\n(ii) Our proposed SPIDER technique is a much more general variance-reduced estimation method\nfor many quantities (not limited to gradients) and can be \ufb02exibly applied to numerous problems,\ne.g. stochastic zeroth-order method.\n\nSoon after the initial submission to NIPS and arXiv release of this paper, we became aware\nthat similar convergence rate results for stochastic \ufb01rst-order method were also achieved indepen-\ndently by the so-called SNVRG algorithm [39, 40]. SNVRG [40] obtains a gradient complexity\nof \u02dcO(min(n1/2\u0001\u22122, \u0001\u22123)) for \ufb01nding an \u0001-approximate \ufb01rst-order stationary point and achieves a\n\u02dcO(\u0001\u22123.5) gradient cost for \ufb01nding an (\u0001,O(\u00010.5))-approximate second-order stationary point [39].\nBy exploiting the third-order smoothness, an SNVRG variant can also achieve an (\u0001,O(\u00010.5))-\napproximate second-order stationary point in \u02dcO(\u0001\u22123) stochastic gradient costs [39].\n\n1.2 Our Contributions\n\nIn this work, we propose the Stochastic Path-Integrated Differential Estimator (SPIDER) technique,\nwhich signi\ufb01cantly avoids excessive access of stochastic oracles and reduces the time complexity.\nSuch technique can be potential applied in many stochastic estimation problems.\n\n(i) We propose the SPIDER-SFO algorithm (Algorithm 1) for \ufb01nding approximate \ufb01rst-order\nstationary points for non-convex stochastic optimization problem (1.2), and prove the optimality\nof such rate in at least one case. Inspired by recent works Carmon et al. [8, 10], Johnson &\nZhang [22] and independent of Zhou et al. [39, 40], this is the \ufb01rst time that the gradient cost of\nO(min(n1/2\u0001\u22122, \u0001\u22123)) in both upper and lower (\ufb01nite-sum only) bound for \ufb01nding \ufb01rst-order\nstationary points for problem (1.2) were obtained.\n\n3Allen-Zhu [2] also obtains a gradient cost of \u02dcO(\u0001\u22123.25) to achieve a (modi\ufb01ed and weakened) (\u0001,O(\u00010.25))-\n\napproximate second-order stationary point.\n\n4Here and in many places afterwards, the gradient cost also includes the number of stochastic Hessian-vector\n\nproduct accesses, which has similar running time with computing per-access stochastic gradient.\n\n3\n\n\f(ii) Following Allen-Zhu & Li [4], Carmon et al. [8], Xu et al. [38], we propose SPIDER-SFO+\nalgorithm for \ufb01nding an approximate second-order stationary point for non-convex stochastic\noptimization problem. To best of our knowledge, this is also the \ufb01rst time that the gradient cost\nof \u02dcO(min(n1/2\u0001\u22122 + \u0001\u22122.5, \u0001\u22123)) achieved with standard assumptions. We leave the details of\nSFO in the long version of our paper: https://arxiv.org/abs/1807.01695\n\n(iii) We propose a new and simpler analysis framework for proving convergence to approximate\nstationary points. One can \ufb02exibly apply our proof techniques to analyze others algorithms, e.g.\nSGD, SVRG [22], and SAGA [13].\n\nNotation. Throughout this paper, we treat the parameters L, \u2206, \u03c3, and \u03c1, to be speci\ufb01ed later as\nglobal constants. Let (cid:107) \u00b7 (cid:107) denote the Euclidean norm of a vector or spectral norm of a square\nmatrix. Denote pn = O(qn) for a sequence of vectors pn and positive scalars qn if there is a global\nconstant C such that |pn| \u2264 Cqn, and pn = \u02dcO(qn) such C hides a poly-logarithmic factor of the\nparameters. Denote pn = \u2126(qn) if there is a global constant C such that |pn| \u2265 Cqn. Let \u03bbmin(A)\ndenote the least eigenvalue of a real symmetric matrix A. For \ufb01xed K \u2265 k \u2265 0, let xk:K denote the\nsequence {xk, . . . , xK}. Let [n] = {1, . . . , n} and S denote the cardinality of a multi-set S \u2282 [n] of\nsamples (a generic set that allows elements of multiple instances). For simplicity, we further denote\ni\u2208S Bi and averaged sub-sampled\n\nthe averaged sub-sampled stochastic estimator BS := (1/S)(cid:80)\ngradient \u2207fS := (1/S)(cid:80)\n\ni\u2208S \u2207fi. Other notations are explained at their \ufb01rst appearance.\n\n2 Stochastic Path-Integrated Differential Estimator: Core Idea\n\nIn this section, we present in detail the underlying idea of our Stochastic Path-Integrated Dif-\nferential Estimator (SPIDER) technique behind the algorithm design. As the readers will see, such\ntechnique signi\ufb01cantly avoids excessive access of stochastic oracle and reduces complexity, which is\nof independent interest and has potential applications in many stochastic estimation problems.\n\nLet us consider an arbitrary deterministic vector quantity Q(x). Assume that we observe a\nsequence \u02c6x0:K, and we want to dynamically track Q(\u02c6xk) for k = 0, 1, . . . , K. Assume further that\nwe have an initial estimate \u02dcQ(\u02c6x0) \u2248 Q(\u02c6x0), and an unbiased estimate \u03bek(\u02c6x0:k) of Q(\u02c6xk)\u2212 Q(\u02c6xk\u22121)\nsuch that for each k = 1, . . . , K\n\nE [\u03bek(\u02c6x0:k) | \u02c6x0:k] = Q(\u02c6xk) \u2212 Q(\u02c6xk\u22121).\n\nThen we can integrate (in the discrete sense) the stochastic differential estimate as\n\n\u02dcQ(\u02c6x0:K) := \u02dcQ(\u02c6x0) +\n\n\u03bek(\u02c6x0:k).\n\n(2.1)\n\nK(cid:88)\n\nk=1\n\nK(cid:88)\n\nWe call estimator \u02dcQ(\u02c6x0:K) the Stochastic Path-Integrated Differential EstimatoR, or SPIDER for\nbrevity. We conclude the following proposition which bounds the error of our estimator (cid:107) \u02dcQ(\u02c6x0:K) \u2212\nQ(\u02c6xK)(cid:107), in terms of both expectation and high probability:\nProposition 1. The martingale variance bound has\n\nE(cid:107) \u02dcQ(\u02c6x0:K)\u2212Q(\u02c6xK)(cid:107)2 = E(cid:107) \u02dcQ(\u02c6x0)\u2212Q(\u02c6x0)(cid:107)2 +\n\nE(cid:107)\u03bek(\u02c6x0:k)\u2212(Q(\u02c6xk)\u2212Q(\u02c6xk\u22121))(cid:107)2. (2.2)\n\nk=1\n\nProposition 1 can be easily concluded using the property of square-integrable martingales. Now,\nlet B map any x \u2208 Rd to a random estimate Bi(x) such that, conditioning on the observed sequence\nx0:k, we have for each k = 1, . . . , K,\n\nE(cid:2)Bi(xk) \u2212 Bi(xk\u22121) | x0:k\n\n(cid:3) = V k \u2212 V k\u22121.\n\n(2.3)\n\n4\n\n\fAt each step k let S\u2217 be a subset that samples S\u2217 elements in [n] with replacement, and let the\n\nstochastic estimator BS\u2217 = (1/S\u2217)(cid:80)\n\ni\u2208S\u2217 Bi satisfy\n\nand (cid:107)xk \u2212 xk\u22121(cid:107) \u2264 \u00011 for all k = 1, . . . , K. Finally, we set our estimator V k of B(xk) as\n\nE(cid:107)Bi(x) \u2212 Bi(y)(cid:107)2 \u2264 L2B(cid:107)x \u2212 y(cid:107)2,\n\n(2.4)\n\nV k = BS\u2217 (xk) \u2212 BS\u2217 (xk\u22121) + V k\u22121.\n\nApplying Proposition 1 immediately concludes the following lemma, which gives an error bound of\nthe estimator V k in terms of the second moment of (cid:107)V k \u2212 B(xk)(cid:107):\nLemma 1. We have under the condition (2.4) that for all k = 1, . . . , K,\n+ E(cid:107)V 0 \u2212 B(x0)(cid:107)2.\n\n(2.5)\n\nE(cid:107)V k \u2212 B(xk)(cid:107)2 \u2264 kL2B\u00012\n1S\u2217\n\nIt turns out that one can use SPIDER to track many quantities of interest, such as stochastic\ngradient, function values, zero-order estimate gradient, functionals of Hessian matrices, etc. Our\nproposed SPIDER-based algorithms in this paper take Bi as the stochastic gradient \u2207fi and the\nzeroth-order estimate gradient, separately.\n\n3 SPIDER for Stochastic First-Order Method\n\nIn this section, we apply SPIDER to the Stochastic First-Order (SFO) method. We introduce the\nbasic settings and assumptions in \u00a73.1 and propose the main error-bound theorems for \ufb01nding an\n\u0001-approximate \ufb01rst-order stationary point in \u00a73.2. We conclude this section with the corresponding\nlower-bound result in \u00a73.3.\n\n3.1 Settings and Assumptions\n\nWe \ufb01rst introduce the formal de\ufb01nition of an approximate \ufb01rst-order stationary point as follows.\n\nDe\ufb01nition 1. We call x \u2208 Rd an \u0001-approximate \ufb01rst-order stationary point, or simply an FSP, if\n\n(cid:107)\u2207f (x)(cid:107) \u2264 \u0001.\n\n(3.1)\n\nFor our purpose of analysis, we also pose the following assumption:\n\nAssumption 1. We assume the following\n(i) The \u2206 := f (x0) \u2212 f\u2217 < \u221e where f\u2217 = inf x\u2208Rd f (x) is the global in\ufb01mum value of f (x);\n(ii) The component function fi(x) has an averaged L-Lipschitz gradient, i.e. for all x, y,\n\nE(cid:107)\u2207fi(x) \u2212 \u2207fi(y)(cid:107)2 \u2264 L2(cid:107)x \u2212 y(cid:107)2;\n\n(iii) (For online case only) the stochastic gradient has a \ufb01nite variance bounded by \u03c32 < \u221e, i.e.\n\nE(cid:107)\u2207fi(x) \u2212 \u2207f (x)(cid:107)2 \u2264 \u03c32.\n\n3.2 Upper Bound for Finding First-Order Stationary Points\n\nRecall that NGD has iteration update rule\n\nxk+1 = xk \u2212 \u03b7\n\n\u2207f (xk)\n(cid:107)\u2207f (xk)(cid:107) ,\n\n(3.2)\n\nwhere \u03b7 is a constant step size. The NGD update rule (3.2) ensures (cid:107)xk+1 \u2212 xk(cid:107) being constantly\nequal to the stepsize \u03b7, and might fastly escape from saddle points and converge to a second-order\n\n5\n\n\fif mod (k, q) = 0 then\n\nAlgorithm 1 SPIDER-SFO: Input x0, q, S1, S2, n0, \u0001, and \u02dc\u0001 (For \ufb01nding \ufb01rst-order stationary point)\n1: for k = 0 to K do\n2:\n3:\n4:\n5:\n6:\n\nDraw S1 samples (or compute the full gradient for the \ufb01nite-sum case), let vk = \u2207fS1(xk)\nDraw S2 samples, and let vk = \u2207fS2(xk) \u2212 \u2207fS2(xk\u22121) + vk\u22121\n\nend if\n\nelse\n\nif (cid:107)vk(cid:107) \u2264 2\u02dc\u0001 then\n\nreturn xk\nxk+1 = xk \u2212 \u03b7 \u00b7 (vk/(cid:107)vk(cid:107)) where\n\n7: OPTION I\n8:\n9:\n10:\n11:\n12:\n\nend if\n\nelse\n\n13: OPTION II\n\n14:\n\nxk+1 = xk \u2212 \u03b7kvk where\n\n\u03b7k = min\n\n15: end for\n\n(cid:5) for convergence rates in high probability\n\n\u03b7 =\n\n\u0001\n\nLn0\n\n(cid:18)\n\n\u0001\n\nLn0(cid:107)vk(cid:107) ,\n\n(cid:19)\n\n(cid:5) for convergence rates in expectation\n1\n\n2Ln0\n\n16: OPTION I: Return xK\n17: OPTION II: Return \u02dcx chosen uniformly at random from {xk}K\u22121\n\nk=0\n\n(cid:5) however, this line is not reached with high probability\n\nstationary point [25]. We propose SPIDER-SFO in Algorithm 1, which resembles a stochastic variant\nof NGD with the SPIDER technique applied, so that one can maintain an estimate of \u2207f (xk) at a\nhigher accuracy under limited gradient budgets.\n\nTo analyze the convergence rate of SPIDER-SFO, let us \ufb01rst consider the online case for Algorithm\n\n1. We let the input parameters be\n\n(cid:18)\n\n(cid:19)\n\n,\n\n2\u03c32\n\u00012 ,\n\n2\u03c3\n\u0001n0\n\n,\n\n\u0001\n\n,\n\n\u0001\n\n1\n\n\u03c3n0\n\n\u03b7k = min\n\nS2 =\n\nS1 =\n\n,\n\u0001\n(3.3)\nwhere n0 \u2208 [1, 2\u03c3/\u0001] is a free parameter to choose.5 In this case, vk in Line 5 of Algorithm 1 is a\nSPIDER for \u2207f (xk). To see this, recall \u2207fi(xk\u22121) is the stochastic gradient drawn at step k and\n\nLn0(cid:107)vk(cid:107) ,\n\n2Ln0\n\nLn0\n\n\u03b7 =\n\nq =\n\nE(cid:2)\u2207fi(xk) \u2212 \u2207fi(xk\u22121) | x0:k\n\n(cid:3) = \u2207f (xk) \u2212 \u2207f (xk\u22121).\n\n(3.4)\nPlugging in V k = vk and Bi = \u2207fi in Lemma 1 of \u00a72, we can use vk in Algorithm 1 as the SPIDER\nand conclude the following lemma that is pivotal to our analysis.\nLemma 2. Set the parameters S1, S2, \u03b7, and q as in (3.3), and k0 = (cid:98)k/q(cid:99) \u00b7 q. Then under the\nAssumption 1, we have\n\nE(cid:2)(cid:107)vk \u2212 \u2207f (xk)(cid:107)2 | x0:k0\n\n(cid:3) \u2264 \u00012.\n\nHere we compute the conditional expectation over the randomness of x(k0+1):k.\n\nLemma 2 shows that our SPIDER vk of \u2207f (x) maintains an error of O(\u0001). Using this lemma,\nwe are ready to present the following results for Stochastic First-Order (SFO) method for \ufb01nding\n\ufb01rst-order stationary points of (1.2).\nTheorem 1 (First-order stationary point, online setting, in expectation). Assume we are in the\nonline case, let Assumption 1 holds, set the parameters S1, S2, \u03b7, and q as in (3.3), and set\n\nK =(cid:4)(4L\u2206n0)\u0001\u22122(cid:5) + 1. Then running Algorithm 1 with OPTION II for K iterations outputs a \u02dcx\n\n5When n0 = 1, the mini-batch size is 2\u03c3/\u0001, which is the largest mini-batch size that Algorithm 1 allows to\n\nchoose.\n\n6\n\n\fTurning to the \ufb01nite-sum case, analogous to the online case we let\n\n(cid:18)\n\n(cid:19)\n\nsatisfying\n\nE [(cid:107)\u2207f (\u02dcx)(cid:107)] \u2264 5\u0001.\n\nThe gradient cost is bounded by 24L\u2206\u03c3\u00b7 \u0001\u22123 + 2\u03c32\u0001\u22122 + 4\u03c3n\u22121\nTreating \u2206, L and \u03c3 as positive constants, the stochastic gradient complexity is O(\u0001\u22123).\n\n(3.5)\n0 \u0001\u22121 for any choice of n0 \u2208 [1, 2\u03c3/\u0001].\n\nThe relatively reduced minibatch size serves as the key ingredient for the superior performance\nof SPIDER-SFO. For illustrations, let us compare the sampling ef\ufb01ciency among SGD, SCSG and\nSPIDER-SFO in their special cases. With some involved analysis of the algorithms above, we can\nconclude that to ensure per-iteration suf\ufb01cient decrease of \u2126(\u00012/L), we have\n\n(i) for SGD the choice of mini-batch size is O(cid:0)\u03c32 \u00b7 \u0001\u22122(cid:1);\n(ii) for SCSG [24] and Natasha2 [2] the mini-batch size is O(cid:0)\u03c3 \u00b7 \u0001\u22121.333(cid:1);\n(iii) for our SPIDER-SFO, only a reduced mini-batch size of O(cid:0)\u03c3 \u00b7 \u0001\u22121(cid:1) is needed.\n\nS2 =\n\nn1/2\nn0\n\n,\n\n\u03b7 =\n\n\u0001\n\nLn0\n\n,\n\n\u03b7k = min\n\n\u0001\n\nLn0(cid:107)vk(cid:107) ,\n\n1\n\n2Ln0\n\n,\n\nq = n0n1/2,\n\n(3.6)\n\nwhere n0 \u2208 [1, n1/2]. In this case, one computes the full gradient vk = \u2207fS1 (xk) in Line 3 of\nAlgorithm 1. We conclude our second upper-bound result:\nTheorem 2 (First-order stationary point, \ufb01nite-sum setting, in expectation). Assume we are in the\n\ufb01nite-sum case, let Assumption 1 holds, set the parameters S2, \u03b7k, and q as in (3.6), set K =\n\n(cid:4)(4L\u2206n0)\u0001\u22122(cid:5) + 1, and let S1 = [n], i.e. we obtain the full gradient in Line 3. Then running\n\nAlgorithm 1 with OPTION II for K iterations outputs a \u02dcx satisfying\n\nE(cid:107)\u2207f (\u02dcx)(cid:107) \u2264 5\u0001.\n\nThe gradient cost is bounded by n + 12(L\u2206) \u00b7 n1/2\u0001\u22122 + 2n\u22121\n0 n1/2 for any choice of n0 \u2208 [1, n1/2].\nTreating \u2206, L and \u03c3 as positive constants, the stochastic gradient complexity is O(n + n1/2\u0001\u22122).\n\n3.3 Lower Bound for Finding First-Order Stationary Points\n\nTo conclude the optimality of our algorithm we need an algorithmic lower bound result [10, 37].\nConsider the \ufb01nite-sum case and any random algorithm A that maps functions f : Rd \u2192 R to a\nsequence of iterates in Rd+1, with\n\n[xk; ik] = Ak\u22121(cid:0)\u03be,\u2207fi0(x0),\u2207fi1 (x1), . . . ,\u2207fik\u22121(xk\u22121)(cid:1),\n\n(3.7)\nwhere Ak are measure mapping into Rd+1, ik is the individual function chosen by A at iteration k,\nand \u03be is uniform random vector from [0, 1]. And [x0; i0] = A0(\u03be), where A0 is a measure mapping.\nThe lower-bound result for solving (1.2) is stated as follows:\nTheorem 3 (Lower bound for SFO for the \ufb01nite-sum setting). For any L > 0, \u2206 > 0, and 2 \u2264 n \u2264\n\nO(cid:0)\u22062L2 \u00b7 \u0001\u22124(cid:1), for any algorithm A satisfying (3.7), there exists a dimension d = \u02dcO(cid:0)\u22062L2\u00b7n2\u0001\u22124(cid:1),\nwhich (cid:107)\u2207f (\u02dcx)(cid:107) \u2264 \u0001, A must cost at least \u2126(cid:0)L\u2206 \u00b7 n1/2\u0001\u22122(cid:1) stochastic gradient accesses.\n\nand a function f satis\ufb01es Assumption 1 in the \ufb01nite-sum case, such that in order to \ufb01nd a point \u02dcx for\n\nk \u2265 1,\n\nNote the condition n \u2264 O(\u0001\u22124) in Theorem 3 ensures that our lower bound \u2126(n1/2\u0001\u22122) =\n\u2126(n + n1/2\u0001\u22122), and hence our upper bound in Theorem 1 matches the lower bound in Theorem\n3 up to a constant factor of relevant parameters, and is hence near-optimal. Inspired by Carmon\net al. [10], our proof of Theorem 3 utilizes a speci\ufb01c counterexample function that requires at least\n\u2126(n1/2\u0001\u22122) stochastic gradient accesses. Note Carmon et al. [10] analyzed such counterexample in\nthe deterministic case n = 1 and we generalize such analysis to the \ufb01nite-sum case n \u2265 1.\nRemark 1. Note by setting n = O(\u0001\u22124) the lower bound complexity in Theorem 3 can be as large as\n\u2126(\u0001\u22124). We emphasize that this does not violate the O(\u0001\u22123) upper bound in the online case [Theorem\n\n7\n\n\f1], since the counterexample established in the lower bound depends not on the stochastic gradient\nvariance \u03c32 speci\ufb01ed in Assumption 1(iii), but on the component number n. To obtain the lower\nbound result for the online case with the additional Assumption 1(iii), with more efforts one might be\nable to construct a second counterexample that requires \u2126(\u0001\u22123) stochastic gradient accesses with\nthe knowledge of \u03c3 instead of n. We leave this as a future work.\n\n4 Further Extensions\n\nFurther extensions of our SPIDER technique can be successfully applied to reduce the complexity.\nLimited by space, we leave the details of the following important extensions in the long version of\nour paper at https://arxiv.org/abs/1807.01695 .\n\nUpper Bound for Finding First-Order Stationary Points, in High-Probability Under more\nstringent assumptions on the moments of stochastic gradients, our Algorithm 1 with OPTION I\nachieves a gradient cost of \u02dcO(min(n1/2\u0001\u22122, \u0001\u22123)) (note the additional polylogarithmic factor) with\nhigh probability. We detail the theorems and their proofs in the long version of our paper.\n\nSecond-Order Stationary Point To \ufb01nd a second-order stationary point with (3.1), we can fuse\nour SPIDER-SFO in Algorithm 1 (OPTION I taken) with a Negative-Curvature-Search (NC-Search)\niteration. In the long version of our paper (and independent of [39]), we proved rigorously that a\ngradient cost of \u02dcO(min(n1/2\u0001\u22122 + \u0001\u22122.5, \u0001\u22123)) can be achieved under standard assumptions:\nTheorem 4 (Second-Order Stationary Point, Informal). There exists an algorithm such that under\nappropriate assumptions it takes to \ufb01nd a (\u0001,\n\u03c1\u0001)-second-order stationary point, we have for the\nonline case, when \u0001 \u2264 \u03c1\u03c32 the total number of stochastic gradient computations is \u02dcO(\u0001\u22123); For the\n\ufb01nite-sum case, when \u0001 \u2264 \u03c1n, the total cost of gradient access is \u02dcO(n\u0001\u22121.5 + n1/2\u0001\u22122 + \u0001\u22122.5).\n\n\u221a\n\nZeroth-Order Stationary Point After the NIPS submission of this work, we propose a second\napplication of our SPIDER technique to the stochastic zeroth-order method for problem (1.2) and\nachieves individual function accesses of O(min(dn1/2\u0001\u22122, d\u0001\u22123)). To best of our knowledge, this is\nalso the \ufb01rst time a complexity of individual function value accesses for non-convex problems has\nbeen improved to the aforementioned complexity using variance reduction techniques [22, 34].\n\n5 Summary and Future Directions\n\nWe propose in this work the SPIDER method for non-convex optimization. Our SPIDER-type\nalgorithms have update rules that are reasonably simple and achieve excellent convergence properties.\nHowever, there are still some important questions are left. For example, the lower bound results for\n\ufb01nding a second-order stationary point are not complete. Specially, it is not yet clear if our \u02dcO(\u0001\u22123)\nfor the online case and \u02dcO(n1/2\u0001\u22122) for the \ufb01nite-sum case gradient cost upper bound for \ufb01nding\na second-order stationary point (when n \u2265 \u2126(\u0001\u22121)) is optimal or the gradient cost can be further\nimproved, assuming both Lipschitz gradient and Lipschitz Hessian.\n\nAcknowledgement The authors would like to thank Jeffrey Z. HaoChen for his help on the nu-\nmerical experiments, thank an anonymous reviewer to point out a mistake in the original proof of\nTheorem 1 and thank Zeyuan Allen-Zhu and Quanquan Gu for relevant discussions and pointing\nout references Zhou et al. [39, 40], also Jianqiao Wangni for pointing out references Nguyen et al.\n[28, 29], and Zebang Shen, Ruoyu Sun, Haishan Ye, Pan Zhou for very helpful discussions and com-\nments. Zhouchen Lin is supported by National Basic Research Program of China (973 Program, grant\nno. 2015CB352502), National Natural Science Foundation (NSF) of China (grant nos. 61625301 and\n61731018), and Microsoft Research Asia.\n\n8\n\n\fReferences\n[1] Agarwal, N., Allen-Zhu, Z., Bullins, B., Hazan, E., & Ma, T. (2017). Finding approximate local\nminima faster than gradient descent. In Proceedings of the 49th Annual ACM SIGACT Symposium\non Theory of Computing (pp. 1195\u20131199).: ACM.\n\n[2] Allen-Zhu, Z. (2018). Natasha 2: Faster non-convex optimization than sgd. In Advances in\n\nNeural Information Processing Systems.\n\n[3] Allen-Zhu, Z. & Hazan, E. (2016). Variance reduction for faster non-convex optimization. In\n\nInternational Conference on Machine Learning (pp. 699\u2013707).\n\n[4] Allen-Zhu, Z. & Li, Y. (2018). Neon2: Finding local minima via \ufb01rst-order oracles. In Advances\n\nin Neural Information Processing Systems.\n\n[5] Bottou, L. (2010). Large-scale machine learning with stochastic gradient descent. In Proceedings\n\nof COMPSTAT\u20192010 (pp. 177\u2013186). Springer.\n\n[6] Bottou, L., Curtis, F. E., & Nocedal, J. (2018). Optimization methods for large-scale machine\n\nlearning. SIAM Review, 60(2), 223\u2013311.\n\n[7] Bubeck, S. et al. (2015). Convex optimization: Algorithms and complexity. Foundations and\n\nTrends R(cid:13) in Machine Learning, 8(3-4), 231\u2013357.\n\n[8] Carmon, Y., Duchi, J. C., Hinder, O., & Sidford, A. (2016). Accelerated methods for non-convex\n\noptimization. To appear in SIAM Journal on Optimization, accepted.\n\n[9] Carmon, Y., Duchi, J. C., Hinder, O., & Sidford, A. (2017a). \u201cConvex Until Proven Guilty\u201d:\nIn International\n\nDimension-free acceleration of gradient descent on non-convex functions.\nConference on Machine Learning (pp. 654\u2013663).\n\n[10] Carmon, Y., Duchi, J. C., Hinder, O., & Sidford, A. (2017b). Lower bounds for \ufb01nding stationary\n\npoints i. arXiv preprint arXiv:1710.11606.\n\n[11] Cauchy, A. (1847). M\u00e9thode g\u00e9n\u00e9rale pour la r\u00e9solution des systemes d\u00e9quations simultan\u00e9es.\n\nComptes Rendus de l\u2019Academie des Science, 25, 536\u2013538.\n\n[12] Dauphin, Y. N., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., & Bengio, Y. (2014). Identi-\nfying and attacking the saddle point problem in high-dimensional non-convex optimization. In\nAdvances in Neural Information Processing Systems (pp. 2933\u20132941).\n\n[13] Defazio, A., Bach, F., & Lacoste-Julien, S. (2014). SAGA: A fast incremental gradient method\nwith support for non-strongly convex composite objectives. In Advances in Neural Information\nProcessing Systems (pp. 1646\u20131654).\n\n[14] Durrett, R. (2010). Probability: Theory and Examples (4th edition). Cambridge University\n\nPress.\n\n[15] Ge, R., Huang, F., Jin, C., & Yuan, Y. (2015). Escaping from saddle points \u2013 online stochastic\ngradient for tensor decomposition. In Proceedings of The 28th Conference on Learning Theory\n(pp. 797\u2013842).\n\n[16] Ghadimi, S. & Lan, G. (2013). Stochastic \ufb01rst-and zeroth-order methods for nonconvex\n\nstochastic programming. SIAM Journal on Optimization, 23(4), 2341\u20132368.\n\n[17] Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. http:\n\n//www.deeplearningbook.org.\n\n[18] Hazan, E., Levy, K., & Shalev-Shwartz, S. (2015). Beyond convexity: Stochastic quasi-convex\n\noptimization. In Advances in Neural Information Processing Systems (pp. 1594\u20131602).\n\n9\n\n\f[19] Jain, P., Kar, P., et al. (2017). Non-convex optimization for machine learning. Foundations and\n\nTrends R(cid:13) in Machine Learning, 10(3-4), 142\u2013336.\n\n[20] Jin, C., Ge, R., Netrapalli, P., Kakade, S. M., & Jordan, M. I. (2017a). How to escape saddle\n\npoints ef\ufb01ciently. In International Conference on Machine Learning (pp. 1724\u20131732).\n\n[21] Jin, C., Netrapalli, P., & Jordan, M. I. (2017b). Accelerated gradient descent escapes saddle\n\npoints faster than gradient descent. arXiv preprint arXiv:1711.10456.\n\n[22] Johnson, R. & Zhang, T. (2013). Accelerating stochastic gradient descent using predictive\n\nvariance reduction. In Advances in Neural Information Processing Systems (pp. 315\u2013323).\n\n[23] Lee, J. D., Simchowitz, M., Jordan, M. I., & Recht, B. (2016). Gradient descent only converges\n\nto minimizers. In Proceedings of The 29th Conference on Learning Theory (pp. 1246\u20131257).\n\n[24] Lei, L., Ju, C., Chen, J., & Jordan, M. I. (2017). Non-convex \ufb01nite-sum optimization via scsg\n\nmethods. In Advances in Neural Information Processing Systems (pp. 2345\u20132355).\n\n[25] Levy, K. Y. (2016). The power of normalization: Faster evasion of saddle points. arXiv preprint\n\narXiv:1611.04831.\n\n[26] Nesterov, Y. (2004). Introductory lectures on convex optimization: A basic course, volume 87.\n\nSpringer.\n\n[27] Nesterov, Y. & Polyak, B. T. (2006). Cubic regularization of newton method and its global\n\nperformance. Mathematical Programming, 108(1), 177\u2013205.\n\n[28] Nguyen, L. M., Liu, J., Scheinberg, K., & Tak\u00e1\u02c7c, M. (2017a). SARAH: A novel method\nfor machine learning problems using stochastic recursive gradient. In D. Precup & Y. W. Teh\n(Eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70 of\nProceedings of Machine Learning Research (pp. 2613\u20132621). International Convention Centre,\nSydney, Australia: PMLR.\n\n[29] Nguyen, L. M., Liu, J., Scheinberg, K., & Tak\u00e1\u02c7c, M. (2017b). Stochastic recursive gradient\n\nalgorithm for nonconvex optimization. arXiv preprint arXiv:1705.07261.\n\n[30] Paquette, C., Lin, H., Drusvyatskiy, D., Mairal, J., & Harchaoui, Z. (2018). Catalyst for gradient-\nbased nonconvex optimization. In Proceedings of the Twenty-First International Conference on\nArti\ufb01cial Intelligence and Statistics (pp. 613\u2013622).\n\n[31] Reddi, S., Zaheer, M., Sra, S., Poczos, B., Bach, F., Salakhutdinov, R., & Smola, A. (2018). A\ngeneric approach for escaping saddle points. In A. Storkey & F. Perez-Cruz (Eds.), Proceedings of\nthe Twenty-First International Conference on Arti\ufb01cial Intelligence and Statistics, volume 84 of\nProceedings of Machine Learning Research (pp. 1233\u20131242). Playa Blanca, Lanzarote, Canary\nIslands: PMLR.\n\n[32] Reddi, S. J., Hefny, A., Sra, S., Poczos, B., & Smola, A. (2016). Stochastic variance reduction\n\nfor nonconvex optimization. In International conference on machine learning (pp. 314\u2013323).\n\n[33] Robbins, H. & Monro, S. (1951). A stochastic approximation method. The annals of mathemat-\n\nical statistics, (pp. 400\u2013407).\n\n[34] Schmidt, M., Le Roux, N., & Bach, F. (2017). Minimizing \ufb01nite sums with the stochastic\n\naverage gradient. Mathematical Programming, 162(1-2), 83\u2013112.\n\n[35] Tripuraneni, N., Stern, M., Jin, C., Regier, J., & Jordan, M. I. (2018). Stochastic cubic\nregularization for fast nonconvex optimization. In Advances in Neural Information Processing\nSystems.\n\n10\n\n\f[36] Woodworth, B. & Srebro, N. (2017). Lower bound for randomized \ufb01rst order convex optimiza-\n\ntion. arXiv preprint arXiv:1709.03594.\n\n[37] Woodworth, B. E. & Srebro, N. (2016). Tight complexity bounds for optimizing composite\n\nobjectives. In Advances in Neural Information Processing Systems (pp. 3639\u20133647).\n\n[38] Xu, Y., Jin, R., & Yang, T. (2017). First-order stochastic algorithms for escaping from saddle\n\npoints in almost linear time. arXiv preprint arXiv:1711.01944.\n\n[39] Zhou, D., Xu, P., & Gu, Q. (2018a). Finding local minima via stochastic nested variance\n\nreduction. arXiv preprint arXiv:1806.08782.\n\n[40] Zhou, D., Xu, P., & Gu, Q. (2018b). Stochastic nested variance reduction for nonconvex\n\noptimization. arXiv preprint arXiv:1806.07811.\n\n11\n\n\f", "award": [], "sourceid": 395, "authors": [{"given_name": "Cong", "family_name": "Fang", "institution": "Peking University"}, {"given_name": "Chris Junchi", "family_name": "Li", "institution": "Tencent AI Lab"}, {"given_name": "Zhouchen", "family_name": "Lin", "institution": "Peking University"}, {"given_name": "Tong", "family_name": "Zhang", "institution": "Tencent AI Lab"}]}