{"title": "Stochastic Nested Variance Reduction for Nonconvex Optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 3921, "page_last": 3932, "abstract": "We study finite-sum nonconvex optimization problems, where the objective function is an average of $n$ nonconvex functions. We propose a new stochastic gradient descent algorithm based on nested variance reduction. Compared with conventional stochastic variance reduced gradient (SVRG) algorithm that uses two reference points to construct a semi-stochastic gradient with diminishing variance in each iteration, our algorithm uses $K+1$ nested reference points to build a semi-stochastic gradient to further reduce its variance in each iteration. For smooth nonconvex functions, the proposed algorithm converges to an $\\epsilon$-approximate first-order stationary point (i.e., $\\|\\nabla F(\\mathbf{x})\\|_2\\leq \\epsilon$) within $\\tilde O(n\\land \\epsilon^{-2}+\\epsilon^{-3}\\land n^{1/2}\\epsilon^{-2})$\\footnote{$\\tilde O(\\cdot)$ hides the logarithmic factors, and $a\\land b$ means $\\min(a,b)$.} number of stochastic gradient evaluations. This improves the best known gradient complexity of SVRG $O(n+n^{2/3}\\epsilon^{-2})$ and that of SCSG $O(n\\land \\epsilon^{-2}+\\epsilon^{-10/3}\\land n^{2/3}\\epsilon^{-2})$. For gradient dominated functions, our algorithm also achieves better gradient complexity than the state-of-the-art algorithms. Thorough experimental results on different nonconvex optimization problems back up our theory.", "full_text": "Stochastic Nested Variance Reduction for\n\nNonconvex Optimization\n\nDongruo Zhou\n\nPan Xu\n\nDepartment of Computer Science\n\nUniversity of California, Los Angeles\n\nDepartment of Computer Science\n\nUniversity of California, Los Angeles\n\nLos Angeles, CA 90095\ndrzhou@cs.ucla.edu\n\nLos Angeles, CA 90095\npanxu@cs.ucla.edu\n\nQuanquan Gu\n\nDepartment of Computer Science\n\nUniversity of California, Los Angeles\n\nLos Angeles, CA 90095\n\nqgu@cs.ucla.edu\n\nAbstract\n\nWe study \ufb01nite-sum nonconvex optimization problems, where the objective func-\ntion is an average of n nonconvex functions. We propose a new stochastic gradient\ndescent algorithm based on nested variance reduction. Compared with conventional\nstochastic variance reduced gradient (SVRG) algorithm that uses two reference\npoints to construct a semi-stochastic gradient with diminishing variance in each it-\neration, our algorithm uses K + 1 nested reference points to build a semi-stochastic\ngradient to further reduce its variance in each iteration. For smooth nonconvex\nfunctions, the proposed algorithm converges to an \u270f-approximate \ufb01rst-order station-\n\nary point (i.e., krF (x)k2 \uf8ff \u270f) within eO(n ^ \u270f2 + \u270f3 ^ n1/2\u270f2)1 number of\nstochastic gradient evaluations. This improves the best known gradient complexity\nof SVRG O(n + n2/3\u270f2) and that of SCSG O(n^ \u270f2 + \u270f10/3 ^ n2/3\u270f2). For\ngradient dominated functions, our algorithm also achieves better gradient complex-\nity than the state-of-the-art algorithms. Thorough experimental results on different\nnonconvex optimization problems back up our theory.\n\n1\n\nIntroduction\n\nWe study the following nonconvex \ufb01nite-sum problem\n\nF (x) :=\n\nmin\nx2Rd\n\n1\nn\n\nnXi=1\n\nfi(x),\n\n(1.1)\n\nwhere each component function fi : Rd ! R has L-Lipschitz continuous gradient but may be\nnonconvex. A lot of machine learning problems fall into (1.1) such as empirical risk minimization\n(ERM) with nonconvex loss. Since \ufb01nding the global minimum of (1.1) is general NP-hard [17], we\ninstead aim at \ufb01nding an \u270f-approximate stationary point x, which satis\ufb01es krF (x)k2 \uf8ff \u270f, where\nrF (x) is the gradient of F (x) at x, and \u270f> 0 is the accuracy parameter.\nIn this work, we mainly focus on \ufb01rst-order algorithms, which only need the function value and\ngradient evaluations. We use gradient complexity, the number of stochastic gradient evaluations,\n\n1eO(\u00b7) hides the logarithmic factors, and a ^ b means min(a, b).\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fto measure the convergence of different \ufb01rst-order algorithms.2 For nonconvex optimization, it is\nwell-known that Gradient Descent (GD) can converge to an \u270f-approximate stationary point with\nO(n \u00b7 \u270f2) [32] number of stochastic gradient evaluations. It can be seen that GD needs to calculate\nthe full gradient at each iteration, which is a heavy load when n  1. Stochastic Gradient Descent\n(SGD) has O(\u270f4) gradient complexity to an \u270f-approximate stationary point under the assumption\nthat the stochastic gradient has a bounded variance [15]. While SGD only needs to calculate a mini-\nbatch of stochastic gradients in each iteration, due to the noise brought by stochastic gradients, its\ngradient complexity has a worse dependency on \u270f. In order to improve the dependence of the gradient\ncomplexity of SGD on n and \u270f for nonconvex optimization, variance reduction technique was \ufb01rstly\nproposed in [41, 19, 46, 10, 30, 6, 45, 11, 16] for convex \ufb01nite-sum optimization. Representative\nalgorithms include Stochastic Average Gradient (SAG) [41], Stochastic Variance Reduced Gradient\n(SVRG) [19], SAGA [10], Stochastic Dual Coordinate Ascent (SDCA) [45], Finito [11] and Batching\nSVRG [16], to mention a few. The key idea behind variance reduction is that the gradient complexity\ncan be saved if the algorithm use history information as reference. For instance, one representative\nvariance reduction method is SVRG, which is based on a semi-stochastic gradient that is de\ufb01ned\nby two reference points. Since the the variance of this semi-stochastic gradient will diminish\nwhen the iterate gets closer to the minimizer, it therefore accelerates the convergence of stochastic\ngradient method. Later on, Harikandeh et al. [16] proposed Batching SVRG which also enjoys fast\nconvergence property of SVRG without computing the full gradient. The convergence of SVRG under\nnonconvex setting was \ufb01rst analyzed in [13, 44], where F is still convex but each component function\nfi can be nonconvex. The analysis for the general nonconvex function F was done by [38, 5], which\nshows that SVRG can converge to an \u270f-approximate stationary point with O(n2/3 \u00b7 \u270f2) number\nof stochastic gradient evaluations. This result is strictly better than that of GD. Recently, Lei et al.\n[26] proposed a Stochastically Controlled Stochastic Gradient (SCSG) based on variance reduction,\nwhich further reduces the gradient complexity of SVRG to O(n ^ \u270f2 + \u270f10/3 ^ (n2/3\u270f2)). This\nresult outperforms both GD and SGD strictly. To the best of our knowledge, this is the state-of-art\ngradient complexity under the smoothness (i.e., gradient lipschitz) and bounded stochastic gradient\nvariance assumptions. A natural and long standing question is:\n\nIs there still room for improvement in nonconvex \ufb01nite-sum optimization without making additional\n\nassumptions beyond smoothness and bounded stochastic gradient variance?\n\nIn this paper, we provide an af\ufb01rmative answer to the above question, by showing that the dependence\non n in the gradient complexity of SVRG [38, 5] and SCSG [26] can be further reduced. We propose\na novel algorithm namely Stochastic Nested Variance-Reduced Gradient descent (SNVRG). Similar\nto SVRG and SCSG, our proposed algorithm works in a multi-epoch way. Nevertheless, the technique\nwe developed is highly nontrivial. At the core of our algorithm is the multiple reference points-based\nvariance reduction technique in each iteration. In detail, inspired by SVRG and SCSG, which uses\ntwo reference points to construct a semi-stochastic gradient with diminishing variance, our algorithm\nuses K + 1 reference points to construct a semi-stochastic gradient, whose variance decays faster\nthan that of the semi-stochastic gradient used in SVRG and SCSG.\n\n1.1 Our Contributions\n\nOur major contributions are summarized as follows:\n\u2022 We propose a stochastic nested variance reduction technique for stochastic gradient method, which\n\nreduces the dependence of the gradient complexity on n compared with SVRG and SCSG.\n\n\ufb01rst-order algorithms such as GD, SGD, SVRG and SCSG.\n\n\u2022 We show that our proposed algorithm is able to achieve an \u270f-approximate stationary point with\neO(n ^ \u270f2 + \u270f3 ^ n1/2\u270f2) stochastic gradient evaluations, which outperforms all existing\n\u2022 As a by-product, when F is a \u2327-gradient dominated function, a variant of our algorithm can achieve\nan \u270f-approximate global minimizer (i.e., F (x)  miny F (y) \uf8ff \u270f) within eOn ^ \u2327 \u270f1 + \u2327 (n ^\n\u2327 \u270f1)1/2 stochastic gradient evaluations, which also outperforms the state-of-the-art.\n\n2While we use gradient complexity as in [26] to present our result, it is basically the same if we use\n\nincremental \ufb01rst-order oracle (IFO) complexity used by [38]. In other words, these are directly comparable.\n\n2\n\n\f1.2 Additional Related Work\n\n\u270f2\n\nSince it is hardly possible to review the huge body of literature on convex and nonconvex optimization\ndue to space limit, here we review some additional most related work on accelerating nonconvex\n(\ufb01nite-sum) optimization.\nAcceleration by high-order smoothness assumption With\nonly Lipschitz continuous gradient assumption, Carmon et al.\n[9] showed that the lower bound for both deterministic and\nstochastic algorithms to achieve an \u270f-approximate stationary\npoint is \u2326(\u270f2). With high-order smoothness assumptions,\ni.e., Hessian Lipschitzness, Hessian smoothness etc., a series\nof work have shown the existence of acceleration. For in-\nstance, Agarwal et al. [1] gave an algorithm based on Fast-PCA\nwhich can achieve an \u270f-approximate stationary point with gra-\n\nSVRG\nSCSG\nSNVRG\nn\n\nGradient\nComplexity\n\nn1/2\n\u270f2\n\nn2/3\n\u270f2\n\n\u270f10/3\n\n\u270f3\n\n\u270f2\n\n1\n\nn2/3\n\u270f2\n\nFigure 1: Comparison of gradient\ncomplexities.\n\nshowed two algorithms based on \ufb01nding exact or inexact neg-\native curvature which can achieve an \u270f-approximate stationary\n\ndient complexity eO(n\u270f3/2 + n3/4\u270f7/4) Carmon et al. [7, 8]\npoint with gradient complexity eO(n\u270f7/4). In this work, we\n\nonly consider gradient Lipschitz without assuming Hessian Lipschitz or Hessian smooth. Therefore,\nour result is not directly comparable to the methods in this category.\nAcceleration by momentum The fact that using momentum is able to accelerate algorithms has been\nshown both in theory and practice in convex optimization [35, 31, 18, 23, 14, 32, 29, 2]. However,\nthere is no evidence that such acceleration exists in nonconvex optimization with only Lipschitz\ncontinuous gradient assumption [15, 27, 34, 28, 24].\nIf F satis\ufb01es -strongly nonconvex, i.e.,\nr2F \u232b I, Allen-Zhu [3] proved that Natasha 1, an algorithm based on nonconvex momentum, is\nable to \ufb01nd an \u270f-approximate stationary point in eO(n2/3L2/31/3\u270f2). Later, Allen-Zhu [3] further\nshowed that Natasha 2, an online version of Natasha 1, is able to achieve an \u270f-approximate stationary\npoint within eO(\u270f3.25) stochastic gradient evaluations3.\n\nAfter our paper was submitted to NIPS and released on arXiv, a paper [12] was released on arXiv after\nour work, which independently proposes a different algorithm and achieves the same convergence\nrate for \ufb01nding an \u270f-approximate stationary point.\nTo give a thorough comparison of our proposed algorithm with existing \ufb01rst-order algorithms for\nnonconvex \ufb01nite-sum optimization, we summarize the gradient complexity of the most relevant\nalgorithms in Table 1. We also plot the gradient complexities of different algorithms in Figure 1\nfor nonconvex smooth functions. Note that GD and SGD are always worse than SVRG and SCSG\naccording to Table 1. In addition, GNC-AGD and Natasha2 needs additional Hessian Lipschitz\ncondition. Therefore, we only plot the gradient complexity of SVRG, SCSG and our proposed\nSNVRG in Figure 1.\nNotation: Let A = [Aij] 2 Rd\u21e5d be a matrix and x = (x1, ..., xd)> 2 Rd be a vector. I denotes an\nidentity matrix. We use kvk2 to denote the 2-norm of vector v 2 Rd. We use h\u00b7,\u00b7i to represent the\ninner product of two vectors. Given two sequences {an} and {bn}, we write an = O(bn) if there\nexists a constant 0 < C < +1 such that an \uf8ff C bn. We write an =\u2326( bn) if there exists a constant\n0 < C < +1, such that an  C bn. We use notation eO(\u00b7) to hide logarithmic factors. We also make\nuse of the notation fn . gn (fn & gn) if fn is less than (larger than) gn up to a constant. We use\nproductive symbolQb\ni=a ci to denote caca+1 . . . cb. Moreover, if a > b, we take the product as 1.\nWe use b\u00b7c as the \ufb02oor function. We use log(x) to represent the logarithm of x to base 2. a ^ b is a\nshorthand notation for min(a, b).\n\n2 Preliminaries\n\nIn this section, we present some de\ufb01nitions that will be used throughout our analysis.\n\n3In fact, Natasha 2 is guaranteed to converge to an (\u270f,p\u270f)-approximate second-order stationary point with\n\neO(\u270f3.25) gradient complexity, which implies the convergence to an \u270f-approximate stationary point.\n\n3\n\n\fTable 1: Comparisons on gradient complexity of different algorithms. The second column shows the\ngradient complexity for a nonconvex and smooth function to achieve an \u270f-approximate stationary\npoint (i.e., krF (x)k2 \uf8ff \u270f). The third column presents the gradient complexity for a gradient\ndominant function to achieve an \u270f-approximate global minimizer (i.e., F (x)  minx F (x) \uf8ff \u270f). The\nlast column presents the space complexity of all algorithms.\n\ngradient dominant\n\nHessian Lipschitz\n\nAlgorithm\n\nGD\nSGD\n\nSVRG [38]\nSCSG [26]\n\nnonconvex\n\nO n\n\u270f2\nO 1\n\u270f4\n\u270f2 \nO n2/3\nO 1\n\u270f2 \n\u270f10/3 ^ n2/3\neO n\n\u270f1.75\n\u270f3.25\neO 1\neO 1\n\u270f2 \n\u270f3 ^ n1/2\n\nGNC-AGD [8]\nNatasha 2 [3]\n\n\u270f2/3\u2318\n\u270f1/2\u2318\nDe\ufb01nition 2.1. A function f is L-smooth, if for any x, y 2 Rd, we have\n\neO\u21e3n ^ \u2327\neO\u21e3n ^ \u2327\n\nSNVRG (this paper)\n\neO(\u2327n )\nO 1\n\u270f4\neO(n + \u2327n 2/3)\n\u270f + \u2327n ^ \u2327\n\u270f + \u2327n ^ \u2327\n\nN/A\nN/A\n\nNo\nNo\nNo\nNo\n\nNeeded\nNeeded\n\nNo\n\nkrf (x)  rf (y)k2 \uf8ff Lkx  yk2.\n\nDe\ufb01nition 2.1 implies that if f is L-smooth, we have for any x, h 2 Rd\nL\n2 khk2\n2.\n\nf (x + h) \uf8ff f (x) + hrf (x), hi +\n\nDe\ufb01nition 2.2. A function f is -strongly convex, if for any x, y 2 Rd, we have\n\nf (x + h)  f (x) + hrf (x), hi +\n\n\n2khk2\n2.\n\n(2.1)\n\n(2.2)\n\n(2.3)\n\nDe\ufb01nition 2.3. A function F with \ufb01nite-sum structure in (1.1) is said to have stochastic gradients\nwith bounded variance 2, if for any x 2 Rd, we have\n\nEikrfi(x)  rF (x)k2\n\n2 \uf8ff 2,\n\n(2.4)\n\nwhere i a random index uniformly chosen from [n] and Ei denotes the expectation over such i.\n\n2 is called the upper bound on the variance of stochastic gradients [26].\nDe\ufb01nition 2.4. A function F with \ufb01nite-sum structure in (1.1) is said to have averaged L-Lipschitz\ngradient, if for any x, y 2 Rd, we have\n\nEikrfi(x)  rfi(y)k2\n\n2 \uf8ff L2kx  yk2\n2,\n\n(2.5)\n\nwhere i is a random index uniformly chosen from [n] and Ei denotes the expectation over the choice.\nDe\ufb01nition 2.5. We say a function f is lower-bounded by f\u21e4 if for any x 2 Rd, f (x)  f\u21e4.\nWe also consider a class of functions namely gradient dominated functions [36], which is formally\nde\ufb01ned as follows:\nDe\ufb01nition 2.6. We say function f is \u2327-gradient dominated if for any x 2 Rd, we have\n\nf (x)  f (x\u21e4) \uf8ff \u2327 \u00b7 krf (x)k2\n2,\n\n(2.6)\n\nwhere x\u21e4 2 Rd is the global minimum of f.\nNote that gradient dominated condition is also known as the Polyak-Lojasiewicz (P-L) condition [36],\nand is not necessarily convex. It is weaker than strong convexity as well as other popular conditions\nthat appear in the optimization literature [20].\n\n4\n\n\fAlgorithm 1 One-epoch-SNVRG(x0, F, K, M,{Tl},{Bl}, B)\n1: Input: initial point x0, function F , loop number K, step size parameter M, loop parameters\n\nl=0 g(l)\n\nt\n\nl=1 Tl\n\nl=0 g(l)\n\n0\n\nl=j+1 Tl), 0 \uf8ff j \uf8ff K}\n\nt1}, xt, r), 0 \uf8ff l \uf8ff K.\n\n6: x1 = x0  1/(10M ) \u00b7 v0\n8:\n9:\n10:\n11:\n12:\n13: end for\n\nTl, l 2 [K], batch parameters Bl, l 2 [K], base batch size B > 0.\n2: x(l)\n0 x0, g(l)\n0 0, 0 \uf8ff l \uf8ff K\n3: Uniformly generate index set I \u21e2 [n] without replacement, |I| = B\n0 1/BPi2I rfi(x0)\n4: g(0)\n5: v0 PK\n7: for t = 1, ...,QK\nl=1 Tl  1 do\nr = min{j : 0 = (t mod QK\nt } Update_reference_points({x(l)\n{x(l)\n{g(l)\nt } Update_reference_gradients({g(l)\nvt PK\nxt+1 xt  1/(10M ) \u00b7 vt\n14: xout uniformly random choice from {xt}, where 0 \uf8ff t <QK\n15: T =QK\n16: Output: [xout, xT ]\n17: Function: Update_reference_points({x(l)\nold, 0 \uf8ff l \uf8ff r  1; x(l)\nnew x(l)\n18: x(l)\n19: return {x(l)\nnew}\n20: Function: Update_reference_gradients({g(l)\nnew g(l)\n21: g(l)\n22: for r \uf8ff l \uf8ff K do\nUniformly generate index set I \u21e2 [n] without replacement, |I| = Bl\n23:\nnew 1/BlPi2I\u21e5rfi(x(l)\ng(l)\n24:\n25: end for\n26: return {g(l)\n3 The Proposed Algorithm\n\nnew )\u21e4\nnew)  rfi(x(l1)\n\nold}, x, r)\nnew x, r \uf8ff l \uf8ff K\n\nold},{x(l)\n\nnew}, r)\n\nold, 0 \uf8ff l < r\n\nnew}.\n\nt1},{x(l)\n\nt }, r), 0 \uf8ff l \uf8ff K.\n\nl=1 Tl\n\nIn this section, we present our nested stochastic variance reduction algorithm, namely, SNVRG.\nOne-epoch-SNVRG: We \ufb01rst present the key component of our main algorithm, One-epoch-\nSNVRG, which is displayed in Algorithm 1. The most innovative part of Algorithm 1 attributes to the\nK + 1 reference points and K + 1 reference gradients. Note that when K = 1, Algorithm 1 reduces\nto one epoch of SVRG algorithm [19, 38, 5]. To better understand our One-epoch SNVRG algorithm,\nit would be helpful to revisit the original SVRG which is a special case of our algorithm. For the\n\ufb01nite-sum optimization problem in (1.1), the original SVRG takes the following updating formula\n\nxt+1 = xt  \u2318vt = xt  \u2318rF (ex) + rfit(xt)  rfit(ex),\n\nwhere \u2318> 0 is the step size, it is a random index uniformly chosen from [n] andex is a snapshot for\nxt after every T1 iterations. There are two reference points in the update formula at xt: x(0)\nt =ex\nand x(1)\nt = xt. Note thatex is updated every T1 iterations, namely,ex is set to be xt only when (t\nmod T1) = 0. Moreover, in the semi-stochastic gradient vt, there are also two reference gradients\nand we denote them by g(0)\nt )rfit(x(0)\nt ).\nBack to our One-epoch-SNVRG, we can de\ufb01ne similar reference points and reference gradients as\nl=1 Tl  1, each point xt has K + 1\n\nt = rfit(xt)rfit(ex) = rfit(x(1)\n\nthat in the special case of SVRG. Speci\ufb01cally, for t = 0, . . . ,QK\n\nt = rF (ex) and g(1)\n\n5\n\n\f(3.1)\n\n1\n\nl=0 g(l)\n\nt\n\nt\n\nTk.\n\ng(0)\nt =\n\nrfi(x0),\n\nt when x(l)\n\n1\n\nBXi2I\n\nt = x0 and x(K)\n\nreference points {x(l)\n\nt = xtl with index tl de\ufb01ned as\n\nt }, l = 0, . . . , K, which is set to be x(l)\nKYk=l+1\n\ntl =\nk=l+1 Tk\u232b \u00b7\nQK\nt = xt for all t = 0, . . . ,QK\nSpecially, note that we have x(0)\nl=1 Tl  1. Similarly, xt\nt }, which can be de\ufb01ned based on the reference points {x(l)\nalso has K + 1 reference gradients {g(l)\nt }:\nBlXi2Il\u21e5rfi(x(l)\n)\u21e4, l = 1, . . . , K,\nt )  rfi(x(l1)\ng(l)\nt =\n(3.2)\n\nx(0)\nReference point\nt\ng(0)\nReference gradient\nt\nFor t1 = 1, . . . , T1\nx(1)\nReference point\nt\ng(1)\nReference gradient\nt\n\nwhere I, Il are random index sets with |I| = B,|Il| = Bl and are uniformly generated from\n[n] without replacement. Based on the reference points and reference gradients, we then update\nxt+1 = xt  1/(10M )\u00b7 vt, where vt =PK\nand M is the step size parameter. The illustration\n\nat each iteration. Fortunately, due\nt = g(l)\nt1\nt has been updated as is suggested by Line 24 in Algorithm 1.\n\nof reference points and gradients of SNVRG is displayed in Figure 2.\nWe remark that it would be a huge waste for us to re-evaluate g(l)\nt\nto the fact that each reference point is only updated after a long period, we can maintain g(l)\nand only need to update g(l)\nSNVRG: Using One-epoch-SNVRG (Algorithm 1) as a build-\ning block, we now present our main algorithm: Algorithm 2\nfor nonconvex \ufb01nite-sum optimization to \ufb01nd an \u270f-approximate\nstationary point. At each iteration of Algorithm 2, it executes\nOne-epoch-SNVRG (Algorithm 1) which takes zs1 as its input\nand outputs [ys, zs]. We choose yout as the output of Algorithm\n2 uniformly from {ys}, for s = 1, . . . , S.\nSNVRG-PL: In addition, when function F in (1.1) is gradient\ndominated as de\ufb01ned in De\ufb01nition 2.6 (P-L condition), it has\nbeen proved that the global minimum can be found by SGD\n[20], SVRG [38] and SCSG [26] very ef\ufb01ciently. Following\na similar trick used in [38], we present Algorithm 3 on top of\nAlgorithm 2, to \ufb01nd the global minimum in this setting. We call\nAlgorithm 3 SNVRG-PL, because gradient dominated condition\nis also known as Polyak-Lojasiewicz (PL) condition [36].\nSpace complexity: We brie\ufb02y compare the space complexity\nbetween our algorithms and other variance reduction based algo-\nrithms. SVRG and SCSG needs O(d) space complexity to store\none reference gradient, SAGA [10] needs to store reference gra-\ndients for each component functions, and its space complexity is\nO(nd) without using any trick. For our algorithm SNVRG, we\nneed to store K reference gradients, thus its space complexity\nis O(Kd). In our theory, we will show that K = O(log log n). Therefore, the space complexity of\n\n......\nFor tK1 = 1, . . . , TK1\nReference point\ng(K1)\nReference gradient\nt\nFor tK = 1, . . . , TK\nx(K)\nReference point\nt\ng(K)\nReference gradient\nt\nupdate\nxt+1 = xt  \u2318\n\nFigure 2: Illustration of reference\npoints and gradients.\n\nx(K1)\nt\n\ng(l)\nt\n\nKXl=0\n\nt\n\n\u2026\u2026\n\nour algorithm is actually eO(d), which is almost comparable to that of SVRG and SCSG.\n{Tl}, {Bl}, batch B, S.\n\nAlgorithm 2 SNVRG\n1: Input: initial point z0, function F , K, M,\n\nAlgorithm 3 SNVRG-PL\n1: Input: initial point z0, function F , K, M,\n\n{Tl}, {Bl}, batch B, S, U.\n\n2: for s = 1, . . . , S do\ndenote P = (F, K, M,{Tl},{Bl}, B)\n3:\n[ys, zs] One-epoch-SNVRG(zs1,P)\n4:\n5: end for\n6: Output: Uniformly choose yout from {ys}.\n4 Main Theory\nIn this section, we provide the convergence analysis of SNVRG.\n\n2: for u = 1, . . . , U do\n3:\n4:\n5: end for\n6: Output: zout = zU.\n\ndenote Q = (F, K, M,{Tl},{Bl}, B, S)\nzu = SNVRG(zu1,Q)\n\n6\n\n\f4.1 Convergence of SNVRG\n\nWe \ufb01rst analyze One-epoch-SNVRG (Algorithm 1) and provide a particular choice of parameters.\nLemma 4.1. Suppose that F has averaged L-Lipschitz gradient, in Algorithm 1, suppose B  2 and\nlet the number of nested loops be K = log log B. Choose the step size parameter as M = 6L. For\nthe loop and batch parameters, let T1 = 2, B1 = 6K \u00b7 B and\n\n,\n\nTl = 22l2\n\nBl = 6Kl+1 \u00b7 B/22l1\nfor all 2 \uf8ff l \uf8ff K. Then the output of Algorithm 1 [xout, xT ] satis\ufb01es\nB1/2 \u00b7 E\u21e5F (x0)  F (xT )\u21e4 +\n2\n\n2 \uf8ff C\u2713 L\n\nEkrF (xout)k2\n\n,\n\nB \u00b7 1(B < n)\u25c6\n\n(4.1)\n\nl=1 Tl, C = 600 is a constant\n\nwithin 1_ (7B log3 B) stochastic gradient computations, where T =QK\nand 1(\u00b7) is the indicator function.\nThe following theorem shows the gradient complexity for Algorithm 2 to \ufb01nd an \u270f-approximate\nstationary point with a constant base batch size B.\nTheorem 4.2. Suppose that F has averaged L-Lipschitz gradient and stochastic gradients with\nbounded variance 2. In Algorithm 2, let B = n ^ (2C 2/\u270f2) , S = 1 _ (2CLF /(B1/2\u270f2)) and\nC = 600. The rest parameters (K, M,{Bl},{Tl}) are chosen the same as in Lemma 4.1. Then the\noutput yout of Algorithm 2 satis\ufb01es EkrF (yout)k2\n\u270f2 ^ n\u25c6\uf8ff 2\n\n\u270f2 \uf8ff 2\nstochastic gradient computations, where F = F (z0)  F \u21e4.\nRemark 4.3. If we treat 2, L and F as constants, and assume \u270f \u2327 1, then (4.2) can be simpli\ufb01ed\nto eO(\u270f3 ^ n1/2\u270f2). This gradient complexity is strictly better than O(\u270f10/3 ^ n2/3\u270f2), which is\nachieved by SCSG [26]. Speci\ufb01cally, when n . 1/\u270f2, our proposed SNVRG is faster than SCSG\nby a factor of n1/6; when n & 1/\u270f2, SNVRG is faster than SCSG by a factor of \u270f1/3. Moreover,\nSNVRG also outperforms Natasha 2 [3] which attains eO(\u270f3.25) gradient complexity and needs the\n\n\u270f2 ^ n1/2\u25c6\n\nO\u2713 log3\u2713 2\n\nadditional Hessian Lipschitz condition.\n\n2 \uf8ff \u270f2 with less than\n\n\u270f2 ^ n +\n\n(4.2)\n\nLF\n\n4.2 Convergence of SNVRG-PL\n\nWe now consider the case when F is a \u2327-gradient dominated function. In general, we are able to \ufb01nd\nan \u270f-approximate global minimizer of F instead of only an \u270f-approximate stationary point. Algorithm\n3 uses Algorithm 2 as a component.\nTheorem 4.4. Suppose that F has averaged L-Lipschitz gradient and stochastic gradients with\nbounded variance 2, F is a \u2327-gradient dominated function. In Algorithm 3, let the base batch size\nB = n ^ (4C1\u2327 2/\u270f), the number of epochs for SNVRG S = 1 _ (2C1\u2327 L/B1/2) and the number\nof epochs U = log(2F /\u270f). The rest parameters (K, M,{Bl},{Tl}) are chosen as the same in\nLemma 4.1. Then the output zout of Algorithm 3 satis\ufb01es E\u21e5F (zout)  F \u21e4\u21e4 \uf8ff \u270f within\n\u270f 1/2\u25c6\n\nO\u2713 log3\u2713n ^\n\n+ \u2327L\uf8ffn ^\n\n\u270f \u25c6 log\n\n\u270f \uf8ffn ^\n\n\u2327 2\n\u270f\n\n(4.3)\n\n\u2327 2\n\n\u2327 2\n\nF\n\nstochastic gradient computations, where F = F (z0)  F \u21e4\nRemark 4.5. If we treat 2, L and F as constants, then the gradient complexity in (4.3) turns\ninto eO(n ^ \u2327 \u270f1 + \u2327 (n ^ \u2327 \u270f1)1/2). Compared with nonconvex SVRG [39] which achieves\neO(n + \u2327n 2/3) gradient complexity, our SNVRG-PL is strictly better than SVRG in terms of the\neOn ^ \u2327 \u270f1 + \u2327 (n ^ \u2327 \u270f1)2/3 gradient complexity, SNVRG-PL also outperforms it by a factor of\n(n ^ \u2327 \u270f1)1/6.\n\n\ufb01rst summand and is faster than SVRG at least by a factor of n1/6 in terms of the second summand.\nCompared with a more general variant of SVRG, namely, the SCSG algorithm [26], which attains\n\n7\n\n\fIf we further assume that F is -strongly convex, then it is easy to verify that F is also 1/(2)-gradient\ndominated. As a direct consequence, we have the following corollary:\nCorollary 4.6. Under the same conditions and parameter choices as Theorem 4.4. If we additionally\nassume that F is -strongly convex, then Algorithm 3 will outputs an \u270f-approximate global minimizer\nwithin\n\n2\n\u270f\n\n+ \uf8ff \u00b7\uf8ffn ^\n\n2\n\n\u270f 1/2\u25c6\n\n(4.4)\n\neO\u2713n ^\n\nstochastic gradient computations, where \uf8ff = L/ is the condition number of F .\nRemark 4.7. Corollary 4.6 suggests that when we regard  and 2 as constants and set \u270f \u2327 1,\nAlgorithm 3 is able to \ufb01nd an \u270f-approximate global minimizer within eO(n + n1/2\uf8ff) stochastic\ngradient computations, which matches SVRG-lep in Katyusha X [4]. Using catalyst techniques [29]\nor Katyusha momentum [2], it can be further accelerated to eO(n + n3/4p\uf8ff), which matches the\n\nbest-known convergence rate [43, 4].\n\n5 Experiments\n\nIn this section, we compare our algorithm SNVRG with other baseline algorithms on training a\nconvolutional neural network for image classi\ufb01cation. We compare the performance of the following\nalgorithms: SGD; SGD with momentum [37] (denoted by SGD-momentum); ADAM[21]; SCSG [26].\nIt is worth noting that SCSG is a special case of SNVRG when the number of nested loops K = 1.\nDue to the memory cost, we did not compare GD and SVRG which need to calculate the full gradient.\nAlthough our theoretical analysis holds for general K nested loops, it suf\ufb01ces to choose K = 2 in\nSNVRG to illustrate the effectiveness of the nested structure for the simpli\ufb01cation of implementation.\nIn this case, we have 3 reference points and gradients. All experiments are conducted on Amazon\nAWS p2.xlarge servers which comes with Intel Xeon E5 CPU and NVIDIA Tesla K80 GPU (12G\nGPU RAM). All algorithm are implemented in Pytorch platform version 0.4.0 within Python 3.6.4.\nDatasets We use three image datasets: (1) The MNIST dataset [42] consists of handwritten digits\nand has 50, 000 training examples and 10, 000 test examples. The digits have been size-normalized\nto \ufb01t the network, and each image is 28 pixels by 28 pixels. (2) CIFAR10 dataset [22] consists of\nimages in 10 classes and has 50, 000 training examples and 10, 000 test examples. The digits have\nbeen size-normalized to \ufb01t the network, and each image is 32 pixels by 32 pixels. (3) SVHN dataset\n[33] consists of images of digits and has 531, 131 training examples and 26, 032 test examples. The\ndigits have been size-normalized to \ufb01t the network, and each image is 32 pixels by 32 pixels.\nCNN Architecture We use the standard LeNet [25], which has two convolutional layers with 6 and\n16 \ufb01lters of size 5 respectively, followed by three fully-connected layers with output size 120, 84 and\n10. We apply max pooling after each convolutional layer.\nImplementation Details & Parameter Tuning We did not use the random data augmentation which\nis set as default by Pytorch, because it will apply random transformation (e.g., clip and rotation) at\nthe beginning of each epoch on the original image dataset, which will ruin the \ufb01nite-sum structure\nof the loss function. We set our grid search rules for all three datasets as follows. For SGD, we\nsearch the batch size from {256, 512, 1024, 2048} and the initial step sizes from {1, 0.1, 0.01}.\nFor SGD-momentum, we set the momentum parameter as 0.9. We search its batch size from\n{256, 512, 1024, 2048} and the initial learning rate from {1, 0.1, 0.01}. For ADAM, we search the\nbatch size from {256, 512, 1024, 2048} and the initial learning rate from {0.01, 0.001, 0.0001}. For\nSCSG and SNVRG, we choose loop parameters {Tl} which satisfy Bl \u00b7Ql\nj=1 Tj = B automatically.\nIn addition, for SCSG, we set the batch sizes (B, B1) = (B, B/b), where b is the batch size\nratio parameter. We search B from {256, 512, 1024, 2048} and we search b from {2, 4, 8}. We\nsearch its initial learning rate from {1, 0.1, 0.01}. For our proposed SNVRG, we set the batch\nsizes (B, B1, B2) = (B, B/b, B/b2), where b is the batch size ratio parameter. We search B from\n{256, 512, 1024, 2048} and b from {2, 4, 8}. We search its initial learning rate from {1, 0.1, 0.01}.\nFollowing the convention of deep learning practice, we apply learning rate decay schedule to each\nalgorithm with the learning rate decayed by 0.1 every 20 epochs. We also conducted experiments\nbased on plain implementation of different algorithms without learning rate decay, which is deferred\nto the appendix.\n\n8\n\n\f(a) training loss (MNIST)\n\n(b) training loss (CIFAR10)\n\n(c) training loss (SVHN)\n\n(d) test error (MNIST)\n\n(e) test error (CIFAR10)\n\n(f) test error (SVHN)\n\nFigure 3: Experiment results on different datasets with learning rate decay. (a) and (d) depict the\ntraining loss and test error (top-1 error) v.s. data epochs for training LeNet on MNIST dataset. (b)\nand (e) depict the training loss and test error v.s. data epochs for training LeNet on CIFAR10 dataset.\n(c) and (f) depict the training loss and test error v.s. data epochs for training LeNet on SVHN dataset.\n\nWe plotted the training loss and test error for different algorithms on each dataset in Figure 3. The\nresults on MNIST are presented in Figures 3(a) and 3(d); the results on CIFAR10 are in Figures 3(b)\nand 3(e); and the results on SVHN dataset are shown in Figures 3(c) and 3(f). It can be seen that\nwith learning rate decay schedule, our algorithm SNVRG outperforms all baseline algorithms, which\ncon\ufb01rms that the use of nested reference points and gradients can accelerate the nonconvex \ufb01nite-sum\noptimization.\nWe would like to emphasize that, while this experiment is on training convolutional neural networks,\nthe major goal of this experiment is to illustrate the advantage of our algorithm and corroborate our\ntheory, rather than claiming a state-of-the-art algorithm for training deep neural networks.\n\n6 Conclusions and Future Work\n\nIn this paper, we proposed a stochastic nested variance reduced gradient method for \ufb01nite-sum\nnonconvex optimization. It achieves substantially better gradient complexity than existing \ufb01rst-order\nalgorithms. This partially resolves a long standing question that whether the dependence of gradient\ncomplexity on n for nonconvex SVRG and SCSG can be further improved. There is still an open\n\nquestion: whether eO(n^ \u270f2 + \u270f3 ^ n1/2\u270f2) is the optimal gradient complexity for \ufb01nite-sum and\n\nstochastic nonconvex optimization problem? For \ufb01nite-sum nonconvex optimization problem, the\nlower bound has been proved in Fang et al. [12], which suggests that our algorithm is near optimal up\nto a logarithmic factor. However, for general stochastic problem, the lower bound is still unknown.\nWe plan to derive such lower bound in our future work. On the other hand, our algorithm can also be\nextended to deal with nonconvex nonsmooth \ufb01nite-sum optimization using proximal gradient [40].\n\nAcknowledgement\n\nWe would like to thank the anonymous reviewers for their helpful comments. This research was\nsponsored in part by the National Science Foundation IIS-1652539 and BIGDATA IIS-1855099.\nWe also thank AWS for providing cloud computing credits associated with the NSF BIGDATA\naward. The views and conclusions contained in this paper are those of the authors and should not be\ninterpreted as representing any funding agencies.\n\n9\n\n\fReferences\n[1] Naman Agarwal, Zeyuan Allenzhu, Brian Bullins, Elad Hazan, and Tengyu Ma. Finding\n\napproximate local minima for nonconvex optimization in linear time. 2017.\n\n[2] Zeyuan Allen-Zhu. Katyusha: The \ufb01rst direct acceleration of stochastic gradient methods. In\nProceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pages\n1200\u20131205. ACM, 2017.\n\n[3] Zeyuan Allen-Zhu. Natasha 2: Faster non-convex optimization than sgd. arXiv preprint\n\narXiv:1708.08694, 2017.\n\n[4] Zeyuan Allen-Zhu. Katyusha x: Practical momentum method for stochastic sum-of-nonconvex\n\noptimization. arXiv preprint arXiv:1802.03866, 2018.\n\n[5] Zeyuan Allen-Zhu and Elad Hazan. Variance reduction for faster non-convex optimization. In\n\nInternational Conference on Machine Learning, pages 699\u2013707, 2016.\n\n[6] Alberto Bietti and Julien Mairal. Stochastic optimization with variance reduction for in\ufb01nite\ndatasets with \ufb01nite sum structure. In Advances in Neural Information Processing Systems, pages\n1622\u20131632, 2017.\n\n[7] Yair Carmon, John C. Duchi, Oliver Hinder, and Aaron Sidford. Accelerated methods for\n\nnon-convex optimization. 2016.\n\n[8] Yair Carmon, John C Duchi, Oliver Hinder, and Aaron Sidford. \u201cconvex until proven guilty\":\nDimension-free acceleration of gradient descent on non-convex functions. In International\nConference on Machine Learning, pages 654\u2013663, 2017.\n\n[9] Yair Carmon, John C Duchi, Oliver Hinder, and Aaron Sidford. Lower bounds for \ufb01nding\n\nstationary points of non-convex, smooth high-dimensional functions. 2017.\n\n[10] Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. Saga: A fast incremental gradient\nmethod with support for non-strongly convex composite objectives. In Advances in Neural\nInformation Processing Systems, pages 1646\u20131654, 2014.\n\n[11] Aaron Defazio, Justin Domke, et al. Finito: A faster, permutable incremental gradient method\nfor big data problems. In International Conference on Machine Learning, pages 1125\u20131133,\n2014.\n\n[12] Cong Fang, Chris Junchi Li, Zhouchen Lin, and Tong Zhang. Spider: Near-optimal non-\nconvex optimization via stochastic path-integrated differential estimator. In Advances in Neural\nInformation Processing Systems, pages 686\u2013696, 2018.\n\n[13] Dan Garber and Elad Hazan. Fast and simple pca via convex optimization. arXiv preprint\n\narXiv:1509.05647, 2015.\n\n[14] Saeed Ghadimi and Guanghui Lan. Optimal stochastic approximation algorithms for strongly\nconvex stochastic composite optimization i: A generic algorithmic framework. SIAM Journal\non Optimization, 22(4):1469\u20131492, 2012.\n\n[15] Saeed Ghadimi and Guanghui Lan. Accelerated gradient methods for nonconvex nonlinear and\n\nstochastic programming. Mathematical Programming, 156(1-2):59\u201399, 2016.\n\n[16] Reza Harikandeh, Mohamed Osama Ahmed, Alim Virani, Mark Schmidt, Jakub Kone\u02c7cn`y, and\nScott Sallinen. Stopwasting my gradients: Practical svrg. In Advances in Neural Information\nProcessing Systems, pages 2251\u20132259, 2015.\n\n[17] Christopher J Hillar and Lek-Heng Lim. Most tensor problems are np-hard. Journal of the ACM\n\n(JACM), 60(6):45, 2013.\n\n[18] Chonghai Hu, Weike Pan, and James T Kwok. Accelerated gradient methods for stochastic\noptimization and online learning. In Advances in Neural Information Processing Systems, pages\n781\u2013789, 2009.\n\n10\n\n\f[19] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance\n\nreduction. In Advances in neural information processing systems, pages 315\u2013323, 2013.\n\n[20] Hamed Karimi, Julie Nutini, and Mark Schmidt. Linear convergence of gradient and proximal-\ngradient methods under the polyak-\u0142ojasiewicz condition. In Joint European Conference on\nMachine Learning and Knowledge Discovery in Databases, pages 795\u2013811. Springer, 2016.\n\n[21] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[22] Alex Krizhevsky. Learning multiple layers of features from tiny images. 2009.\n[23] Guanghui Lan. An optimal method for stochastic composite optimization. Mathematical\n\nProgramming, 133(1):365\u2013397, 2012.\n\n[24] Guanghui Lan and Yi Zhou. An optimal randomized incremental gradient method. Mathematical\n\nprogramming, pages 1\u201349, 2017.\n\n[25] Yann LeCun, L\u00e9on Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning\n\napplied to document recognition. Proceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[26] Lihua Lei, Cheng Ju, Jianbo Chen, and Michael I Jordan. Non-convex \ufb01nite-sum optimization\nvia scsg methods. In Advances in Neural Information Processing Systems, pages 2345\u20132355,\n2017.\n\n[27] Huan Li and Zhouchen Lin. Accelerated proximal gradient methods for nonconvex program-\n\nming. In Advances in neural information processing systems, pages 379\u2013387, 2015.\n\n[28] Qunwei Li, Yi Zhou, Yingbin Liang, and Pramod K Varshney. Convergence analysis of proximal\ngradient with momentum for nonconvex optimization. arXiv preprint arXiv:1705.04925, 2017.\n[29] Hongzhou Lin, Julien Mairal, and Zaid Harchaoui. A universal catalyst for \ufb01rst-order optimiza-\n\ntion. In Advances in Neural Information Processing Systems, pages 3384\u20133392, 2015.\n\n[30] Julien Mairal. Incremental majorization-minimization optimization with application to large-\n\nscale machine learning. SIAM Journal on Optimization, 25(2):829\u2013855, 2015.\n\n[31] Yu Nesterov. Smooth minimization of non-smooth functions. Mathematical programming, 103\n\n(1):127\u2013152, 2005.\n\n[32] Yurii Nesterov. Introductory Lectures on Convex Optimization. Kluwer Academic Publishers,\n\n2014.\n\n[33] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng.\n\nReading digits in natural images with unsupervised feature learning.\n\n[34] Courtney Paquette, Hongzhou Lin, Dmitriy Drusvyatskiy, Julien Mairal, and Zaid Har-\nchaoui. Catalyst acceleration for gradient-based non-convex optimization. arXiv preprint\narXiv:1703.10993, 2017.\n\n[35] Boris T Polyak. Some methods of speeding up the convergence of iteration methods. USSR\n\nComputational Mathematics and Mathematical Physics, 4(5):1\u201317, 1964.\n\n[36] Boris Teodorovich Polyak. Gradient methods for minimizing functionals. Zhurnal Vychisli-\n\ntel\u2019noi Matematiki i Matematicheskoi Fiziki, 3(4):643\u2013653, 1963.\n\n[37] Ning Qian. On the momentum term in gradient descent learning algorithms. Neural networks,\n\n12(1):145\u2013151, 1999.\n\n[38] Sashank J. Reddi, Ahmed Hefny, Suvrit Sra, Barnabas Poczos, and Alex Smola. Stochastic\n\nvariance reduction for nonconvex optimization. pages 314\u2013323, 2016.\n\n[39] Sashank J Reddi, Suvrit Sra, Barnab\u00e1s P\u00f3czos, and Alex Smola. Fast incremental method for\nsmooth nonconvex optimization. In Decision and Control (CDC), 2016 IEEE 55th Conference\non, pages 1971\u20131977. IEEE, 2016.\n\n11\n\n\f[40] Sashank J Reddi, Suvrit Sra, Barnabas Poczos, and Alexander J Smola. Proximal stochastic\nmethods for nonsmooth nonconvex \ufb01nite-sum optimization. In Advances in Neural Information\nProcessing Systems, pages 1145\u20131153, 2016.\n\n[41] Nicolas L Roux, Mark Schmidt, and Francis R Bach. A stochastic gradient method with an\nIn Advances in Neural Information\n\nexponential convergence _rate for \ufb01nite training sets.\nProcessing Systems, pages 2663\u20132671, 2012.\n\n[42] Bernhard Sch\u00f6lkopf and Alexander J Smola. Learning with kernels: support vector machines,\n\nregularization, optimization, and beyond. MIT press, 2002.\n\n[43] Shai Shalev-Shwartz. Sdca without duality. arXiv preprint arXiv:1502.06177, 2015.\n[44] Shai Shalev-Shwartz. Sdca without duality, regularization, and individual convexity.\n\nInternational Conference on Machine Learning, pages 747\u2013754, 2016.\n\nIn\n\n[45] Shai Shalev-Shwartz and Tong Zhang. Stochastic dual coordinate ascent methods for regularized\n\nloss minimization. Journal of Machine Learning Research, 14(Feb):567\u2013599, 2013.\n\n[46] Lin Xiao and Tong Zhang. A proximal stochastic gradient method with progressive variance\n\nreduction. SIAM Journal on Optimization, 24(4):2057\u20132075, 2014.\n\n12\n\n\f", "award": [], "sourceid": 1937, "authors": [{"given_name": "Dongruo", "family_name": "Zhou", "institution": "UCLA"}, {"given_name": "Pan", "family_name": "Xu", "institution": "UCLA"}, {"given_name": "Quanquan", "family_name": "Gu", "institution": "UCLA"}]}