{"title": "The Step Decay Schedule: A Near Optimal, Geometrically Decaying Learning Rate Procedure For Least Squares", "book": "Advances in Neural Information Processing Systems", "page_first": 14977, "page_last": 14988, "abstract": "Minimax optimal convergence rates for numerous classes of stochastic convex optimization problems are well characterized, where the majority of results utilize iterate averaged stochastic gradient descent (SGD) with polynomially decaying step sizes. In contrast, the behavior of SGD\u2019s final iterate has received much less attention despite the widespread use in practice. Motivated by this observation, this work provides a detailed study of the following question: what rate is achievable using the final iterate of SGD for the streaming least squares regression problem with and without strong convexity? \n\n\nFirst, this work shows that even if the time horizon T (i.e. the number of iterations that SGD is run for) is known in advance, the behavior of SGD\u2019s final iterate with any polynomially decaying learning rate scheme is highly sub-optimal compared to the statistical minimax rate (by a condition number factor in the strongly convex case and a factor of $\\sqrt{T}$ in the non-strongly convex case). In contrast, this paper shows that Step Decay schedules, which cut the learning rate by a constant factor every constant number of epochs (i.e., the learning rate decays geometrically) offer significant improvements over any polynomially decaying step size schedule. In particular, the behavior of the final iterate with step decay schedules is off from the statistical minimax rate by only log factors (in the condition number for the strongly convex case, and in T in the non-strongly convex case). Finally, in stark contrast to the known horizon case, this paper shows that the anytime (i.e. the limiting) behavior of SGD\u2019s final iterate is poor (in that it queries iterates with highly sub-optimal function value infinitely often, i.e. in a limsup sense) irrespective of the step size scheme employed. These results demonstrate the subtlety in establishing optimal learning rate schedules (for the final iterate) for stochastic gradient procedures in fixed time horizon settings.", "full_text": "The Step Decay Schedule: A Near Optimal,\n\nGeometrically Decaying Learning Rate Procedure\n\nFor Least Squares\n\nRong Ge 1, Sham M. Kakade 2, Rahul Kidambi3 and Praneeth Netrapalli4\n\n1 Duke University, 2 University of Washington, 3 Cornell University, 4 Microsoft Research, India.\n\nrongge@cs.duke.edu, sham@cs.washington.edu, rkidambi@cornell.edu,\n\npraneeth@microsoft.com\n\nAbstract\n\nMinimax optimal convergence rates for numerous classes of stochastic convex\noptimization problems are well characterized, where the majority of results utilize\niterate averaged stochastic gradient descent (SGD) with polynomially decaying\nstep sizes. In contrast, the behavior of SGD\u2019s \ufb01nal iterate has received much less\nattention despite the widespread use in practice. Motivated by this observation, this\nwork provides a detailed study of the following question: what rate is achievable\nusing the \ufb01nal iterate of SGD for the streaming least squares regression problem\nwith and without strong convexity?\nFirst, this work shows that even if the time horizon T (i.e. the number of iterations\nthat SGD is run for) is known in advance, the behavior of SGD\u2019s \ufb01nal iterate with\nany polynomially decaying learning rate scheme is highly sub-optimal compared\nto the statistical minimax rate (by a condition number factor in the strongly convex\ncase and a factor of\nT in the non-strongly convex case). In contrast, this paper\nshows that Step Decay schedules, which cut the learning rate by a constant factor\nevery constant number of epochs (i.e., the learning rate decays geometrically)\noffer signi\ufb01cant improvements over any polynomially decaying step size schedule.\nIn particular, the behavior of the \ufb01nal iterate with step decay schedules is off\nfrom the statistical minimax rate by only log factors (in the condition number\nfor the strongly convex case, and in T in the non-strongly convex case). Finally,\nin stark contrast to the known horizon case, this paper shows that the anytime\n(i.e. the limiting) behavior of SGD\u2019s \ufb01nal iterate is poor (in that it queries iterates\nwith highly sub-optimal function value in\ufb01nitely often, i.e. in a limsup sense)\nirrespective of the stepsize scheme employed. These results demonstrate the\nsubtlety in establishing optimal learning rate schedules (for the \ufb01nal iterate) for\nstochastic gradient procedures in \ufb01xed time horizon settings.\n\n\u221a\n\n1\n\nIntroduction\n\nLarge scale machine learning relies almost exclusively on stochastic optimization methods [BB07],\nwhich include stochastic gradient descent (SGD) [RM51] and its variants [DHS11, JZ13]. In this\nwork, we restrict our attention to the SGD algorithm where we are concerned with the behavior of the\n\ufb01nal iterate (i.e. the last point when we terminate the algorithm). A majority of (minimax optimal)\ntheoretical results for SGD focus on polynomially decaying stepsizes [DGBSX12, RSS12, LJSB12,\nBub14] (or constant stepsizes [BM13, DB15a, JKK+16] for the case of least squares regression)\ncoupled with iterate averaging [Rup88, PJ92] to achieve minimax optimal rates of convergence.\nHowever, practical SGD implementations typically return the \ufb01nal iterate of a stochastic gradient\nprocedure. This line of work in theory (based on iterate averaging) and its discrepancy with regards to\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fAlgorithm 1: Step Decay scheme\nInput: Initial vector w, starting learning rate\n\n\u03b70, number of iterations T\n\nOutput: w\nfor (cid:96) \u2190 1 to log T do\n\n\u03b7(cid:96) \u2190 \u03b70/2(cid:96)\nfor t \u2190 1 to T /log T do\n\nw \u2190 w \u2212 \u03b7(cid:96)(cid:98)\u2207f (w)\n\nend\n\nend\n\nFigure 1: (Left) The Step Decay scheme for stochastic gradient descent. Note that the algorithm\nrequires just two parameters - the starting learning rate \u03b70 and number of iterations T .\n(Right) Plot of function value error vs. condition number for the \ufb01nal iterate of polynomially decaying\nstepsizes i.e., equation(5,6), step-decay schedule (Algorithm 1) compared against the minimax optimal\nsuf\ufb01x averaged iterate with a constant stepsize [JKK+16] for a synthetic two-dimensional least\nsquares regression problem(1). The condition number \u03ba is varied as {50, 100, 200, 400}. Exhaustive\ngrid search is performed on starting stepsize and decay parameters. Initial excess risk is d\u03c32 and the\nalgorithm is run for T = \u03ba2\nmax = 4002 steps (for all experiments); results are averaged over 5 random\nseeds. Observe that the \ufb01nal iterate\u2019s error grows linearly as a function of the condition number \u03ba for\nthe polynomially decaying stepsize schemes, whereas, the error does not grow as a function of \u03ba for\nthe geometric \u201cstep-decay\u201d stepsize scheme. See section E.1 in supplementary material for details.\n\npractice leads to the question with regards to the behavior of SGD\u2019s \ufb01nal iterate. Indeed, this question\nhas motivated several efforts in stochastic convex optimization literature as elaborated below.\nNon-Smooth Stochastic Optimization: The work of [Sha12] raised the question with regards to\nthe behavior of SGD\u2019s \ufb01nal iterate for non-smooth stochastic optimization (with/without strong\nconvexity). The work of [SZ12] answered this question, indicating that SGD\u2019s \ufb01nal iterate with\npolynomially decaying stepsizes achieves near minimax rates (up to log factors) in an anytime (i.e. in\na limiting) sense (when number of iterations SGD is run for is not known in advance). Under speci\ufb01c\nchoices of step size sequences, [SZ12]\u2019s result on SGD\u2019s \ufb01nal iterate is tight owing to the recent work\nof [HLPR18]. More recently [JNN19] presented an approach indicating that a more nuanced stepsize\nsequence serves to achieve minimax rates (up to constant factors) for the non-smooth stochastic\noptimization setting when the end time T is known in advance.\nLeast Squares Regression (LSR): In contrast to the non-smooth setting, the state of our under-\nstanding of SGD\u2019s \ufb01nal iterate for smooth stochastic convex optimization, or, say, the streaming\nleast squares regression setting is far less mature \u2212 this gap motivates our paper\u2019s contributions. In\nparticular, this paper studies SGD\u2019s \ufb01nal iterate behavior under various stepsize choices for least\nsquares regression (with and without strong convexity). The use of SGD\u2019s \ufb01nal iterate for the least\nmean squares objective has featured in several efforts [WH60, Pro74, WS85, RS90], but these results\ndo not achieve minimax rates of convergence, which leads to the following question:\n\u201c Can polynomially decaying stepsizes (known to achieve minimax rates when coupled with iterate\naveraging [Rup88, PJ92]) offer minimax optimal rates on SGD\u2019s \ufb01nal iterate when optimizing the\nstreaming least squares regression objective? If not, is there any other family of stepsizes that can\nguarantee minimax rates on the \ufb01nal iterate of stochastic gradient descent? \u201d\nThis paper presents progress on answering the above question \u2212 refer to contributions below for more\ndetails. Note that the oracle model employed by this work (to quantify SGD\u2019s \ufb01nal iterate behavior)\nhas featured in a string of recent results that present a non-asymptotic understanding of SGD for least\nsquares regression, with the caveat being that these results crucially rely on iterate averaging with\nconstant stepsize sequences [BM13, DB15a, JKK+16, JKK+17b, JKK+17a, NR18].\n\nOur contributions: This work establishes upper and lower bounds on the behavior of SGD\u2019s \ufb01nal\niterate, as run with standard polynomially decaying stepsizes as well as step decay schedules which\n\n2\n\n\ffactor (in an anytime/limiting sense) [SZ12]. Here (cid:98)\u2207f,\u2207f = E(cid:104)(cid:98)\u2207f\n\nTable 1: Comparison of sub-optimality for \ufb01nal iterate of SGD (i.e., E [f (wT )] \u2212 f (w\u2217)) for\nstochastic convex optimization problems. This paper\u2019s focus is on SGD\u2019s \ufb01nal iterate for streaming\nleast squares regression. The minimax rate refers to the best possible worst case rate with access to\nstochastic gradients (typically achieved with iterate averaging methods [PJ92, DGBSX12, RSS12]);\nthe red shows the multiplicative factor increase (over the minimax rate) using the \ufb01nal iterate, under\ntwo different learning rate schedules - the polynomial decay and the step decay (refer to Algorithm 1).\nPolynomial decay schedules are of the form \u03b7t \u221d 1/t\u03b1 (for appropriate \u03b1 \u2208 [0.5, 1]). For the general\nconvex cases below, the \ufb01nal iterate with a polynomial decay scheme is off minimax rates by a log T\n,\u22072f denotes the stochastic\ngradient, gradient and the Hessian of the function f. With regards to least squares, we assume\nequation (3), following recent efforts [BM13, DB15a, JKK+16]. While polynomially decaying\nstepsizes are nearly minimax optimal for general (strongly) convex functions, this paper indicates\nthey are highly suboptimal on the \ufb01nal iterate for least squares. The geometrically decaying Step\nDecay schedule (Algorithm 1) provides marked improvements over any polynomial decay scheme\non the \ufb01nal iterate for least squares. For simplicity of presentation, the results for least squares\nregression do not show dependence on initial error. See Theorems 1 and 2 for precise statements (and\n[NY83, SZ12, HLPR18] for precise statements of the general case).\n\n(cid:105)\n\nGeneral\n\nconvex functions\n\nNon-strongly convex\nleast squares regression\n\nGeneral strongly\nconvex functions\n\nStrongly convex\n\nleast squares regression\n\nDiam (ConstraintSet) \u2264 D\n\nAssumptions\n\n(cid:20)(cid:13)(cid:13)(cid:13)(cid:98)\u2207f\n(cid:20)(cid:13)(cid:13)(cid:13)(cid:98)\u2207f\n\n(cid:13)(cid:13)(cid:13)2(cid:21)\n(cid:13)(cid:13)(cid:13)2(cid:21)\n\nEq. (3)\n\nE\n\nE\n\n\u22072f (cid:23) \u00b5I\nEq. (3)\n\u22072f (cid:23) \u00b5I\n\n\u2264 G2\n\n\u2264 G2\n\nRate w/ Final iterate\nusing best poly-decay\n\nRate w/ Final iterate\nusing Step Decay\n\nMinimax rate\n\nGD\u221a\nT\n\n\u03c32d\nT\n\nG2\n\u00b5T\n\n\u03c32d\nT\n\nT\n\n\u2126\n\n(cid:17)\n(cid:17)\n(cid:17)\n\n\u0398\n[SZ12, HLPR18]\n\n(cid:16) GD\u221a\n\u00b7 log T\n(cid:16) \u03c32d\nT \u00b7 \u221a\n(cid:16) G2\n\u00b5T \u00b7 log T\n(cid:16) \u03c32d\n(cid:17)\n\u0398\n[SZ12, HLPR18]\nT \u00b7 \u03ba\n\nT\nlog T\n\n\u2126\n\nO(cid:16) \u03c32d\nO(cid:16) \u03c32d\n\n\u2013\nT \u00b7 log T\n\n\u2013\nT \u00b7 log T\n\n(cid:17)\n\n(cid:17)\n\n(This work - Theorem 1)\n\n(This work - Theorem 2)\n\n(This work - Theorem 1)\n\n(This work - Theorem 2)\n\ntends to cut the stepsize by a constant factor after every constant number of epochs (see algorithm 1),\nby considering the streaming least squares regression problem (with and without strong convexity).\nOur main result indicates that step decay schedules offer signi\ufb01cant improvements in achieving\nnear minimax rates over polynomially decaying stepsizes in the known horizon case (when the end\ntime T is known in advance). Figure 1 illustrates that this difference is evident (empirically) even\nwhen optimizing a two-dimensional synthetic least squares objective. Table 1 provides a summary.\nFinally, we present results that indicate the subtle (yet signi\ufb01cant) differences between the known\ntime horizon case and the anytime (i.e. the limiting) behavior of SGD\u2019s \ufb01nal iterate (see below). Note\nthat proofs of our main claims can be found in the supplementary material.\nOur main contributions are as follows:\n\n\u2022 Sub-optimality of polynomially decaying stepsizes: For the strongly convex least squares\ncase, this work shows that the \ufb01nal iterate of a polynomially decaying stepsize scheme\n(i.e. with \u03b7t \u221d 1/t\u03b1, with \u03b1 \u2208 [0.5, 1]) is off the minimax rate d\u03c32/T by a factor of the\n\u221a\ncondition number of the problem. For the non-strongly convex case of least squares, we\nshow that any polynomially decaying stepsize can achieve a rate no better than d\u03c32/\nT\n(up to log factors), while the minimax rate is d\u03c32/T .\n\n\u2022 Near-optimality of the step-decay scheme: Given a \ufb01xed end time T , the step-decay scheme\n(algorithm 1) presents a \ufb01nal iterate that is off the statistical minimax rate by just a log(T )\nfactor when optimizing the strongly convex and non-strongly convex least squares regression\n1, thus indicating vast improvements over polynomially decaying stepsize schedules. We\nnote here that our Theorem 2 for the non-strongly case offers a rate on the initial error\n(i.e., the bias term) that is off the best known rate [BM13] (that employs iterate averaging)\nby a dimension factor. That said, Algorithm 1 is rather straightforward and employs the\nknowledge of just an initial learning rate and number of iterations for its implementation.\n\n1This dependence can be improved to log of the condition number of the problem (for the strongly convex\n\ncase) using a more re\ufb01ned stepsize decay scheme.\n\n3\n\n\f\u2022 SGD has to query bad iterates in\ufb01nitely often: For the case of optimizing strongly convex\nleast squares regression, this work shows that any stochastic gradient procedure (in a lim sup\nsense) must query sub-optimal iterates (off by nearly a condition number) in\ufb01nitely often.\n\u2022 Complementary to our theoretical results for the stochastic linear regression, we evaluate the\nempirical performance of different learning rate schemes when training a residual network\non the cifar-10 dataset and observe that the continuous variant of step decay schemes (i.e.\nan exponential decay) indeed compares favorably to polynomially decaying stepsizes.\n\nWhile the upper bounds established in this paper (section 3.2) merit extensions towards broader\nsmooth convex functions (with/without strong convexity), the lower bounds established in sections 3.1,\n3.3 present implications towards classes of smooth stochastic convex optimization. Even in terms of\nupper bounds, note that there are fewer results on non-asymptotic behavior of SGD (beyond least\nsquares) when working in the oracle model considered in this work (see below). [BM11, BM13,\nBac14, NSW16] are exceptions, yet they do not achieve minimax rates on appropriate problem\nclasses; [FGKS15] does not work in standard stochastic \ufb01rst order oracle model [NY83, ABRW12],\nso their work is not directly comparable to examine extensions towards broader function classes.\nAs a \ufb01nal note, this paper\u2019s result on the sub-optimality of standard polynomially decaying stepsizes\nfor classes of smooth and strongly convex optimization doesn\u2019t contradict the (minimax) optimality\nresults in stochastic approximation [PJ92]. Iterate averaging coupled with polynomially decaying\nlearning rates (or constant learning rates for least squares [BM13, DB15a, JKK+16]) does achieve\nminimax rates [Rup88, PJ92]. However, as mentioned previously, this work deals with SGD\u2019s \ufb01nal\niterate behavior (i.e. without iterate averaging), since this bears more relevance towards practice.\nRelated work: [RM51] introduced the stochastic approximation problem and Stochastic Gradient\nDescent (SGD). They present conditions on stepsize schemes satis\ufb01ed by asymptotically convergent\nalgorithms: these schemes are referred to as \u201cconvergent\u201d stepsize sequences. [Rup88, PJ92] proved\nthe asymptotic optimality of iterate averaged SGD with larger stepsize sequences. In terms of oracle\nmodels and notions of optimality, there exists two lines of thought (see also [JKK+17b]):\nTowards statistically optimal estimation procedures: The goal of this line of thought is to match\nthe excess risk of the statistically optimal estimator [Anb71, KC78, PJ92, LC98] on every problem\ninstance. Several efforts consider SGD in this oracle [BM11, Bac14, DB15b, FGKS15, NSW16]\npresenting non-asymptotic results, often with iterate averaging. With regards to least squares,\n[BM13, DB15a, FGKS15, JKK+16, JKK+17b, NR18] use constant step-size SGD with iterate\naveraging to achieve minimax rates (on a per-problem basis) in this oracle model. SGD\u2019s \ufb01nal\niterate behavior for least squares has featured in several efforts in the signal processing/controls\nliterature [WH60, NN67, Pro74, WS85, RS90, SSB98], without achieving minimax rates. This paper\nworks in this oracle model and analyzes SGD\u2019s \ufb01nal iterate behavior with various stepsize choices.\nTowards optimality under bounded noise assumptions: The other line of thought presents algorithms\nwith access to stochastic gradients satisfying bounded noise assumptions, aiming to match lower\nbounds provided in [NY83, RR11, ABRW12]. Asymptotic properties of \u201cconvergent\u201d stepsize\nschemes have been studied in great detail [KC78, BMP90, LPW92, BB99, KY03, Lai03, Bor08].\n[DGBSX12, LJSB12, RSS12, GL12, GL13a, HK14, Bub14, DFB16] use iterate averaged SGD to\nachieve minimax rates for various problem classes non-asymptotically. [AZ18] present an alternative\napproach towards minimizing the gradient norm with access to stochastic gradients. As noted, [SZ12]\nachieves anytime optimal rates (upto a log T factor) with the \ufb01nal iterate of an SGD procedure, and\nthis is shown to be tight with the recent work of [HLPR18]. [JNN19] achieve minimax rates on the\n\ufb01nal iterate using a nuanced stepsize scheme when the number of iterations is \ufb01xed in advance.\nGeometrically Decaying Stepsize Schedules date to [Gof77]. [DD19] employ the stepdecay schedule\nto prove high-probability guarantees for SGD with strongly convex objectives. In stochastic optimiza-\ntion, several other works, including [GL13b, HK14, AFGO19, KM19] consider doubling argument\nbased approaches, where the epoch length is doubled everytime the stepsizes are halved. The step\ndecay schedule is employed to yield faster rates of convergence under certain growth (and related)\nconditions both in convex [XLY16] and non-convex settings [YYJ18, DDC19].\nPaper organization: Section 2 describes notation and problem setup. Section 3 presents our results\non the sub-optimality of polynomial decay schemes and the near optimality of the step decay scheme.\nSection 3.3 presents results on the anytime behavior of SGD (i.e. the asymptotic/in\ufb01nite horizon\ncase). Section 4 presents experimental results and Section 5 presents conclusions.\n\n4\n\n\f2 Problem Setup\n\nNotation: We present the setup and associated notation in this section. We represent scalars with\nnormal font a, b, L etc., vectors with boldface lowercase characters a, b etc. and matrices with\nboldface uppercase characters A, B etc. We represent positive semide\ufb01nite (PSD) ordering between\ntwo matrices using (cid:23). The symbol (cid:38) represents that the inequality holds for some universal constant.\nWe consider here the minimization of the following expected square loss objective:\n\nf (w) where f (w)\n\nmin\n\nw\n\ndef\n= 1\n2\n\nE(x,y)\u223cD[(y \u2212 (cid:104)w, x(cid:105))2].\n\n= \u22072f (w) = E(cid:2)xx(cid:62)(cid:3). We are provided access to\n\n(1)\n\ndef\n\nNote that the hessian of the objective H\nstochastic gradients obtained by sampling a new example (xt, yt) \u223c D. These examples satisfy:\n\ny = (cid:104)w\u2217, x(cid:105) + \u0001,\n\nwhere, \u0001 is the noise on the example pair (x, y) \u223c D and w\u2217 is a minimizer of the objective f (w).\nGiven an initial iterate w0 and stepsize sequence {\u03b7t}, our stochastic gradient update is:\n\n(cid:98)\u2207f (wt) = \u2212(yt \u2212 (cid:104)wt, xt(cid:105)) \u00b7 xt.\n\nWe assume that the noise \u0001 = y \u2212 (cid:104)w\u2217, x(cid:105) \u2200 (x, y) \u223c D satis\ufb01es the following condition:\n\nwt+1 \u2190 wt \u2212 \u03b7t(cid:98)\u2207f (wt\u22121);\n= E(cid:104)(cid:98)\u2207f (w\u2217)(cid:98)\u2207f (w\u2217)(cid:62)(cid:105)\n\ndef\n\n\u03a3\n\n= E(x,y)\u223cD[(y \u2212 (cid:104)w\u2217, x(cid:105))2xx(cid:62)] (cid:22) \u03c32H.\n\n(2)\n\n(3)\n\nE(cid:104)(cid:107)x(cid:107)2 xx(cid:62)(cid:105) (cid:22) R2 H\n\nNext, assume that covariates x satisfy the following fourth moment inequality:\n\n(4)\nThis assumption is satis\ufb01ed, say, when the norm of the covariates sup(cid:107)x(cid:107)2 < R2, but is true more\ngenerally. Finally, note that both 3 and 4 are general and are used in recent works [BM13, JKK+16]\nthat present a sharp analysis of SGD for streaming least squares problem. Next, we denote by\n\ndef\n= \u03bbmin (H) , L\n\ndef\n= \u03bbmax (H) , and , \u03ba\n\n\u00b5\n\ndef\n= R2/\u00b5\n\nthe smallest eigenvalue, largest eigenvalue and condition number of H respectively. \u00b5 > 0 in the\nstrongly convex case but not necessarily so in the non-strongly convex case (in section 3 and beyond,\nthe non-strongly case is referred to as the \u201csmooth\u201d case). Let w\u2217 \u2208 arg minw\u2208Rd f (w). The\nequation 2, any algorithm that uses these stochastic gradients and outputs (cid:98)wt has sub-optimality that\nexcess risk of an estimator w is f (w) \u2212 f (w\u2217). Given t accesses to the stochastic gradient oracle in\n\nis lower bounded by \u03c32d\nt\n\n. More concretely, we have that [VdV00]\n\u2265 1 .\n\nE [f ((cid:98)wt)] \u2212 f (w\u2217)\n\nlim\nt\u2192\u221e\n\n\u03c32d/t\n\nThe rate of (1 + o(1)) \u00b7 \u03c32d/t is achieved using iterate averaged SGD [Rup88, PJ92] with constant\nstepsizes [BM13, DB15a, JKK+16]. This rate of \u03c32d/t is called the statistical minimax rate.\n\n3 Main Results\n\nSections 3.1, 3.2 consider the \ufb01xed time horizon setting; the former presents the signi\ufb01cant sub-\noptimality of polynomially decaying stepsizes on SGD\u2019s \ufb01nal iterate behavior, the latter section\npresenting the near-optimality of SGD\u2019s \ufb01nal iterate. Section 3.3 presents negative results on SGD\u2019s\n\ufb01nal iterate behavior (irrespective of stepsizes employed), in the anytime (i.e. limiting) sense.\n\n3.1 Suboptimality of polynomial decay schemes\n\nThis section begins by showing that there exist problem instances where polynomially decaying\nstepsizes considered stochastic approximation theory [RM51, PJ92] i.e., those of the form a\nb+t\u03b1 , for\nany choice of a, b > 0 and \u03b1 \u2208 [0.5, 1] are signi\ufb01cantly suboptimal (by a factor of the condition\nnumber of the problem, or by\nT in the smooth case) compared to the statistical minimax rate [KC78].\n\n\u221a\n\n5\n\n\fTheorem 1. Under assumptions 3, 4, there exists a class of problem instances where the following\nlower bounds on excess risk hold on SGD\u2019s \ufb01nal iterate with polynomially decaying stepsizes when\ngiven access to the oracle as written in equation 2.\nStrongly convex case: Suppose \u00b5 > 0. For any condition number \u03ba, there exists a least squares\nproblem instance with initial suboptimality f (w0) \u2212 f (w\u2217) \u2264 \u03c32d such that, for any T \u2265 \u03ba 4\n3 , and\nfor all a, b \u2265 0 and 0.5 \u2264 \u03b1 \u2264 1, and for the learning rate scheme \u03b7t = a\n(f (w0) \u2212 f (w\u2217)) +\n\nE [f (wT )] \u2212 f (w\u2217) \u2265 exp\n\n\u2212 T\n\n(cid:19)\n\nb+t\u03b1 , we have\n\u00b7 \u03ba\nT\n\n\u03c32d\n64\n\n\u03ba log T\n\n.\n\nSmooth case: For any \ufb01xed T > 1, there exists a least squares problem instance such that, for all\na, b \u2265 0 and 0.5 \u2264 \u03b1 \u2264 1, and for the learning rate scheme \u03b7t = a\nL \u00b7 (cid:107)w0 \u2212 w\u2217(cid:107)2 + \u03c32d\n\nb+t\u03b1 , we have\n\n(cid:17) \u00b7\n\n.\n\n(cid:18)\nE [f (wT )] \u2212 f (w\u2217) \u2265(cid:16)\n\n1\u221a\nT log T\n\nFor both cases (with/without strong convexity), the minimax rate is \u03c32d/T . In the strongly convex\ncase, SGD\u2019s \ufb01nal iterate with polynomially decaying stepsizes pays a suboptimality factor of \u2126(\u03ba),\nwhereas, in the smooth case, SGD\u2019s \ufb01nal iterate pays a suboptimality factor of \u2126\n\n.\n\n(cid:17)\n\n(cid:16) \u221a\n\nT\nlog T\n\n3.2 Near optimality of Step Decay schemes\n\nGiven the knowledge of an end time T when the algorithm is terminated, this section presents the step\ndecay schedule (Algorithm 1), which offers signi\ufb01cant improvements over standard polynomially\ndecaying stepsize schemes, and obtains near minimax rates (off by only a log(T ) factor).\nTheorem 2. Suppose we are given access to the stochastic gradient oracle 2 satisfying Assumptions 3\nand 4. Running Algorithm 1 with an initial stepsize of \u03b71 = 1/(2R2) allows the algorithm to achieve\nthe following excess risk guarantees.\n\n\u2022 Strongly convex case: Suppose \u00b5 > 0. We have:\n\n(cid:18)\n\n(cid:19)\n\nE [f (wT )] \u2212 f (w\u2217) \u2264 2 \u00b7 exp\n\n\u2212\n\nT\n\n2\u03ba log T log \u03ba\n\n(f (w0) \u2212 f (w\u2217)) + 4\u03c32d \u00b7 log T\nT\n\n.\n\n\u2022 Smooth case: We have:\n\nE [f (wT )] \u2212 f (w\u2217) \u2264 2 \u00b7(cid:16)\n\nR2d \u00b7 (cid:107)w0 \u2212 w\u2217(cid:107)2 + 2\u03c32d\n\n(cid:17) \u00b7 log T\n\nT\n\nWhile theorem 2 presents signi\ufb01cant improvements over polynomial decay schemes, as mentioned in\nthe contributions, the above result presents a worse rate on the initial error (by a dimension factor) in\nthe smooth case (i.e. non-strongly convex case), compared to the best known result [BM13], which\nrelies heavily on iterate averaging to remove this factor. It is an open question with regards to whether\nthis factor can actually be improved or not. Furthermore, comparing the initial error dependence\nbetween the lower bound for the smooth case (Theorem 1) with the upper bound for the step decay\nscheme, we believe that the dependence on the smoothness L should be improved to one on the R2.\nIn terms of the variance, however, note that the polynomial decay schemes, are plagued by a\npolynomial dependence on the condition number \u03ba (for the strongly convex case), and are off the\nminimax rate by a\nT factor (for the smooth case). The step decay schedule, on the other hand, is off\nthe minimax rate [Rup88, PJ92, VdV00] by only a log(T ) factor. It is worth noting that Algorithm 1\nadmits an ef\ufb01cient implementation in that it requires the knowledge only of R2 (similar to iterate\naveraging results [BM13, JKK+16]) and the end time T . Finally, note that this log T factor can be\nimproved to a log \u03ba factor for the strongly convex case by using an additional polynomial decay\nscheme before switching to the Step Decay scheme.\nProposition 3. Suppose we are given access to the stochastic gradient oracle 2 satisfying Assump-\ntions 3 and 4. Let \u00b5 > 0 and let \u03ba \u2265 2. For any problem and \ufb01xed time horizon T / log T > 5\u03ba, there\nexists a learning rate scheme that achieves\n\n\u221a\n\nE [f (wT )] \u2212 f (w\u2217) \u2264 2 exp(\u2212T /(6\u03ba log \u03ba)) \u00b7 (f (w0) \u2212 f (w\u2217)) + 100 log2 \u03ba \u00b7 \u03c32d\nT\n\n.\n\n6\n\n\fIn order to have improved the dependence on the variance from log(T ) (in theorem 2) to log(\u03ba)\n(in proposition 3), we require access to the strong convexity parameter \u00b5 = \u03bbmin(H) in addition\nto R2 and knowledge of the end time T . This parallels results known for general strongly convex\nsetting [RSS12, LJSB12, SZ12, Bub14, JNN19].\nAs a \ufb01nal remark, note that this section\u2019s results (on step decay schemes) assumed the knowledge\nof a \ufb01xed time horizon T . In contrast, most results SGD\u2019s averaged iterate obtain anytime (i.e.,\nlimiting/in\ufb01nite horizon) guarantees. Can we hope to achieve such guarantees with the \ufb01nal iterate?\n\n3.3 SGD queries bad points in\ufb01nitely often\n\nThis section shows that obtaining near statistical minimax rates with the \ufb01nal iterate is not possible\nwithout knowledge of the time horizon T . More concretely, we show that irrespective of the learning\nrate sequence employed (be it polynomially decaying or step-decay), SGD requires to query a point\nwith sub-optimality at least \u2126(\u03ba/ log \u03ba) \u00b7 \u03c32d/T for in\ufb01nitely many time steps T .\nTheorem 4. Suppose we are given access to a stochastic gradient oracle 2 satisfying Assumption 3, 4.\nThere exists a universal constant C > 0, and a problem instance, such that for SGD algorithm with\nany \u03b7t \u2264 1/2R2 for all t2, we have\nlim sup\nT\u2192\u221e\n\nE [f (wT )] \u2212 f (w\u2217)\n\nlog(\u03ba + 1)\n\n(\u03c32d/T )\n\n.\n\n\u2265 C\n\n\u03ba\n\nonce in O(cid:16) \u03ba\n\nlog \u03ba\n\n(cid:17)\n\nThe bad points guaranteed to exist by Theorem 4 are not rare. We show that such points occur at least\n\niterations. Refer to Theorem 16 in appendix D in supplementary material.\n\n4 Experimental Results\n\nWe present experimental validation on the suitability of the Step-decay schedule (or more precisely,\nits continuous counterpart, which is the exponentially decaying schedule), and compare its with the\npolynomially decaying stepsize schedules. In particular, we consider the use of:\n\n\u03b7t =\n\n\u03b70\n\n1 + b \u00b7 t\n\n(5)\n\n\u03b7t =\n\n\u221a\n\n\u03b70\n1 + b\n\nt\n\n(6)\n\n\u03b7t = \u03b70 \u00b7 exp (\u2212b \u00b7 t).\n\n(7)\n\nWhere, we perform a systematic grid search on the parameters \u03b70 and b. In the section below, we\nconsider a real world non-convex optimization problem of training a residual network on the cifar-\n10 dataset, with an aim to illustrate the practical implications of the results described in the paper.\nComplete details of the setup are given in Appendix E in the supplementary material.\n\n4.1 Non-Convex Optimization: Training a Residual Net on cifar-10\n\ntraining a 44\u2212layer deep residual network [HZRS16a] with pre-activation\nWe consider\nblocks [HZRS16b] (dubbed preresnet-44) on cifar-10 dataset. The code for implementing the network\ncan be found here 3. For all experiments, we use Nesterov\u2019s momentum [Nes83] implemented in\npytorch 4 with a momentum of 0.9, batchsize 128, 100 training epochs, (cid:96)2 regularization of 0.0005.\nOur experiments are based on grid searching for the best learning rate decay scheme on the parametric\nfamily of learning rate schemes described above (5),(6),(7); all grid searches are performed on a\nseparate validation set (obtained by setting aside one-tenth of the training dataset) and with models\ntrained on the remaining 45000 samples. For presenting the \ufb01nal numbers in the plots/tables, we\nemploy the best hyperparameters from the validation stage and train it on the entire 50, 000 samples\nand average results run with 10 different random seeds. The parameters for grid searches and\nother details are presented in Appendix E. Furthermore, we always extend the grid so that the best\nperforming grid search parameter lies in the interior of our grid search.\nHow does the step decay scheme compare with the polynomially decaying stepsizes? Figure 2\nand Table 2 present a comparison of the performance of the three schemes (5)-(7). These results\ndemonstrate that the exponential scheme convicingly outperforms the polynomial step-size schemes.\n\n2Learning rate more than 2/R2 will make the algorithm diverge.\n3https://github.com/D-X-Y/ResNeXt-DenseNet\n4https://github.com/pytorch\n\n7\n\n\fTable 2: Comparing Train Cross-Entropy and Test 0/1 Error of various learning rate decay schemes\nfor the classi\ufb01cation task on cifar-10 using a 44\u2212layer residual net with pre-activations.\n\nTrain Function Value\n\n0.0713 \u00b1 0.015\n0.1119 \u00b1 0.036\n0.0053 \u00b1 0.0015\n\nDecay Scheme\n\u221a\n\nO(1/t) (equation (5))\nt) (equation (6))\nO(1/\nexp(\u2212t) (equation (7))\n\nTest 0/1 error\n10.20 \u00b1 0.7%\n11.6 \u00b1 0.67%\n7.58 \u00b1 0.21%\n\nFigure 2: Plot of the training function value (left) and test 0/1\u2212 error (right) comparing the three\ndecay schemes (two polynomial) 5, 6, (and one exponential) 7 scheme.\n\nDoes suf\ufb01x iterate averaging improve over \ufb01nal iterate\u2019s behavior for polynomially decaying\nstepsizes? Towards answering this question, \ufb01rstly, we consider the best performing values of equa-\ntion 5 and 6, and then, average iterates of the algorithm starting from 5, 10, 20, 40, 80, 85, 90, 95, 99\nepochs when training the model for a total of 100 epochs. While such iterate averaging\n(and their suf\ufb01x) variants have strong theoretical support for (stochastic) convex optimiza-\ntion [Rup88, PJ92, RSS12, Bub14, JKK+16], their impact on non-convex optimization is largely\ndebatable. Nevertheless, this experiments\u2019s results (\ufb01gure 3) indicates that suf\ufb01x averaging tends to\nhurt the algorithm\u2019s generalization behavior (which is unsurprising given the non-convex nature of\nthe objective). Note that, \ufb01gure 3 serves to indicate that averaging the \ufb01nal few (\u2264 5) epochs tends to\noffer nearly the same result as the \ufb01nal iterate\u2019s behavior, indicating that the gains of using suf\ufb01x\niterate averaging are relatively limited for several such settings.\n\nFigure 3: Performance of the suf\ufb01x averaged iterate compared to the \ufb01nal iterate when varying the\niteration when iterate averaging is begun from {5, 10, 20, 40, 80, 85, 90, 95, 99} epochs for the 1/T\n\u221a\nlearning rate 5 (left) and the 1/\n\nT learning rate 6 (right).\n\nDoes our result on \u201cknowing\u201d the time horizon (for step-decay schedule) present implications\ntowards hyper-parameter search methods that work based on results extracted from truncated\nruns? Towards answering this question, consider the \ufb01gure 4 and Tables 3 and 4, which present\na comparison of the performance of three exponential decay schemes each of which is tuned to\nachieve the best performance at 33, 66 and 100 epochs respectively. The key point to note is that best\nperforming hyperparameters at 33 and 66 epochs are not the best performing at 100 epochs (which\nis made stark from the perspective of the validation error - refer to table 4). This demonstrates that\nhyper parameter selection methods that tend to discard hyper-parameters which don\u2019t perform well at\n\n8\n\n20406080100Start of Suffix Averaging102030405060708090Test Accuracyt0t - Final Iteratet0t - Averaged Iterate20406080100Start of Suffix Averaging102030405060708090Test Accuracyt0t + Finalt0t + Avg\fearlier stages of the optimization (i.e. based on comparing results on truncated runs), which, for e.g.,\nis indeed the case with hyperband [LJD+17], will bene\ufb01t from a round of rethinking.\n\nFigure 4: Plot of the training function value (left) and test 0/1\u2212 error (right) comparing exponential\ndecay scheme (equation 7), with parameters optimized for 33, 66 and 100 epochs.\n\nTable 3: Comparing training (softmax) function value by optimizing the exponential decay scheme\nwith end times of 33/66/100 epochs on cifar-10 dataset using a 44\u2212layer residual net.\nexp(\u2212t) [optimized for 33 epochs] (eqn (7))\nexp(\u2212t) [optimized for 66 epochs] (eqn (7))\nexp(\u2212t) [optimized for 100 epochs] (eqn (7))\n\nTrain FVal @66\n0.0086 \u00b1 0.002\n0.0088 \u00b1 0.0014\n0.071 \u00b1 0.017\n\nTrain FVal @100\n0.0062 \u00b1 0.0015\n0.0061 \u00b1 0.0011\n0.0053 \u00b1 0.0016\n\nDecay Scheme\n\nTrain FVal @33\n0.098 \u00b1 0.006\n0.107 \u00b1 0.012\n\n0.3 \u00b1 0.06\n\nTable 4: Comparing test 0/1 error by optimizing the exponential decay scheme with end times of\n33/66/100 epochs for the classi\ufb01cation task on cifar-10 dataset using a 44\u2212layer residual net.\nexp(\u2212t) [optimized for 33 epochs] (eqn (7))\nexp(\u2212t) [optimized for 66 epochs] (eqn (7))\nexp(\u2212t) [optimized for 100 epochs] (eqn (7))\n\nTest 0/1 @66\nTest 0/1 @100\nTest 0/1 @33\n10.36 \u00b1 0.235% 8.6 \u00b1 0.26%\n8.57 \u00b1 0.25%\n10.51 \u00b1 0.45% 8.51 \u00b1 0.13% 8.46 \u00b1 0.19%\n9.8 \u00b1 0.66%\n14.42 \u00b1 1.47%\n7.58 \u00b1 0.21%\n\nDecay Scheme\n\n5 Conclusions and Discussion\n\nThe main contribution of this work shows that the behavior of SGD\u2019s \ufb01nal iterate for least squares\nregression is much more nuanced than what has been indicated by prior efforts that have primarily\nconsidered non-smooth stochastic convex optimization. The results of this paper point out the striking\nlimitations of polynomially decaying stepsizes on SGD\u2019s \ufb01nal iterate, as well as sheds light on the\neffectiveness of starkly different schemes based on a Step Decay schedule. Somewhat coincidentally,\npractical implementations for certain classes of stochastic optimization do return the \ufb01nal iterate of\nSGD with step decay schedule \u2212 this connection does merit an understanding through future work.\n\nAcknowledgments\n\nRong Ge acknowledges funding from NSF CCF-1704656, NSF CCF-1845171 (CAREER), Sloan\nFellowship and Google Faculty Research Award. Sham Kakade acknowledges funding from the\nWashington Research Foundation for Innovation in Data-intensive Discovery, NSF Award 1740551,\nand ONR award N00014-18-1-2247. Rahul Kidambi acknowledges funding from NSF Award\n1740822.\n\nReferences\n[ABRW12] Alekh Agarwal, Peter L. Bartlett, Pradeep Ravikumar, and Martin J. Wainwright.\nInformation-theoretic lower bounds on the oracle complexity of stochastic convex\noptimization. IEEE Transactions on Information Theory, 2012.\n\n[AFGO19] Necdet Serhat Aybat, Alireza Fallah, Mert G\u00fcrb\u00fczbalaban, and Asuman E. Ozdaglar.\nA universally optimal multistage accelerated stochastic gradient method. CoRR,\nabs/1901.08022, 2019.\n\n9\n\n\f[Anb71] Dan Anbar. On Optimal Estimation Methods Using Stochastic Approximation Proce-\n\ndures. University of California, 1971.\n\n[AZ18] Zeyuan Allen-Zhu. How to make the gradients small stochastically. CoRR,\n\nabs/1801.02982, 2018.\n\n[Bac14] Francis R. Bach. Adaptivity of averaged stochastic gradient descent to local strong\nconvexity for logistic regression. Journal of Machine Learning Research (JMLR),\nvolume 15, 2014.\n\n[BB99] B. Bharath and V. S. Borkar. Stochastic approximation algorithms: overview and recent\n\ntrends. S\u00afadhan\u00afa, 1999.\n\n[BB07] L\u00e9on Bottou and Olivier Bousquet. The tradeoffs of large scale learning. In NIPS 20,\n\n2007.\n\n[BM11] Francis R. Bach and Eric Moulines. Non-asymptotic analysis of stochastic approxima-\n\ntion algorithms for machine learning. In NIPS 24, 2011.\n\n[BM13] Francis R. Bach and Eric Moulines. Non-strongly-convex smooth stochastic approxi-\n\nmation with convergence rate O(1/n). In NIPS 26, 2013.\n\n[BMP90] Albert Benveniste, Michel Metivier, and Pierre Priouret. Adaptive Algorithms and\nStochastic Approximations. Springer texts in Stochastic Modelling and Applied Proba-\nbility, 1990.\n\n[Bor08] Vivek Borkar. Stochastic approximation. Cambridge Books, 2008.\n[Bub14] S\u00e9bastien Bubeck. Theory of convex optimization for machine learning. CoRR,\n\nabs/1405.4980, 2014.\n\n[DB15a] Alexandre D\u00e9fossez and Francis R. Bach. Averaged least-mean-squares: Bias-variance\ntrade-offs and optimal sampling distributions. In Arti\ufb01cal Intelligence and Statistics\n(AISTATS), 2015.\n\n[DB15b] Aymeric Dieuleveut and Francis R. Bach. Non-parametric stochastic approximation\n\nwith large step sizes. The Annals of Statistics, 2015.\n\n[DD19] Damek Davis and Dmitriy Drusvyatskiy. Robust stochastic optimization with the\n\nproximal point method. CoRR, abs/1907.13307, 2019.\n\n[DDC19] Damek Davis, Dmitriy Drusvyatskiy, and Vasileios Charisopoulos. Stochastic al-\ngorithms with geometric step decay converge linearly on sharp functions. CoRR,\nabs/1907.09547, 2019.\n\n[DFB16] Aymeric Dieuleveut, Nicolas Flammarion, and Francis R. Bach. Harder, better, faster,\nstronger convergence rates for least-squares regression. CoRR, abs/1602.05419, 2016.\n[DGBSX12] Ofer Dekel, Ran Gilad-Bachrach, Ohad Shamir, and Lin Xiao. Optimal distributed\nonline prediction using mini-batches. Journal of Machine Learning Research (JMLR),\nvolume 13, 2012.\n\n[DHS11] John C. Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for\nonline learning and stochastic optimization. Journal of Machine Learning Research,\n12:2121\u20132159, 2011.\n\n[FGKS15] Roy Frostig, Rong Ge, Sham M. Kakade, and Aaron Sidford. Competing with the\n\nempirical risk minimizer in a single pass. In COLT, 2015.\n\n[GL12] Saeed Ghadimi and Guanghui Lan. Optimal stochastic approximation algorithms for\nstrongly convex stochastic composite optimization i: A generic algorithmic framework.\nSIAM Journal on Optimization, 2012.\n\n[GL13a] Saeed Ghadimi and Guanghui Lan. Optimal stochastic approximation algorithms\nfor strongly convex stochastic composite optimization, ii: shrinking procedures and\noptimal algorithms. SIAM Journal on Optimization, 2013.\n\n[GL13b] Saeed Ghadimi and Guanghui Lan. Optimal stochastic approximation algorithms\nfor strongly convex stochastic composite optimization, ii: Shrinking procedures and\noptimal algorithms. SIAM Journal on Optimization, 23(4), 2013.\n\n[Gof77] J. L. Gof\ufb01n. On the convergence rates of subgradient optimization methods. Mathe-\n\nmatical Programming, 13:329\u2013347, 1977.\n\n10\n\n\f[HK14] Elad Hazan and Satyen Kale. Beyond the regret minimization barrier: optimal al-\ngorithms for stochastic strongly-convex optimization. Journal of Machine Learning\nResearch (JMLR), volume 15, 2014.\n\n[HLPR18] Nicholas J. A. Harvey, Christopher Liaw, Yaniv Plan, and Sikander Randhawa. Tight\n\nanalyses for non-smooth stochastic gradient descent. CoRR, 2018.\n\n[HZRS16a] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for\n\nimage recognition. In CVPR, pages 770\u2013778, 2016.\n\n[HZRS16b] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep\nresidual networks. In ECCV (4), Lecture Notes in Computer Science, pages 630\u2013645.\nSpringer, 2016.\n\n[JKK+16] Prateek Jain, Sham M Kakade, Rahul Kidambi, Praneeth Netrapalli, and Aaron Sidford.\nParallelizing stochastic approximation through mini-batching and tail-averaging. arXiv\npreprint arXiv:1610.03774, 2016.\n\n[JKK+17a] Prateek Jain, Sham M. Kakade, Rahul Kidambi, Praneeth Netrapalli, Venkata Krishna\nPillutla, and Aaron Sidford. A markov chain theory approach to characterizing the\nminimax optimality of stochastic gradient descent (for least squares). CoRR, 2017.\n\n[JKK+17b] Prateek Jain, Sham M Kakade, Rahul Kidambi, Praneeth Netrapalli, and Aaron Sidford.\n\nAccelerating stochastic gradient descent. arXiv preprint arXiv:1704.08227, 2017.\n\n[JNN19] Prateek Jain, Dheeraj Nagaraj, and Praneeth Netrapalli. Making the last iterate of sgd\n\ninformation theoretically optimal. CoRR, 2019.\n\n[JZ13] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive\n\nvariance reduction. In NIPS 26, 2013.\n\n[KC78] Harold J. Kushner and Dean S. Clark. Stochastic Approximation Methods for Con-\n\nstrained and Unconstrained Systems. Springer-Verlag, 1978.\n\n[KM19] Andrei Kulunchakov and Julien Mairal. A generic acceleration framework for stochastic\n\ncomposite optimization. CoRR, abs/1906.01164, 2019.\n\n[KY03] Harold J. Kushner and George Yin. Stochastic approximation and recursive algorithms\n\nand applications. Springer-Verlag, 2003.\n\n[Lai03] Tze Leung Lai. Stochastic approximation: invited paper, 2003.\n[LC98] Erich L. Lehmann and George Casella. Theory of Point Estimation. Springer Texts in\n\nStatistics. Springer, 1998.\n\n[LJD+17] Lisha Li, Kevin Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, and Ameet Talwalkar.\nHyperband: A novel bandit-based approach to hyperparameter optimization. The\nJournal of Machine Learning Research, 18(1):6765\u20136816, 2017.\n\n[LJSB12] Simon Lacoste-Julien, Mark W. Schmidt, and Francis R. Bach. A simpler approach to\nobtaining an o(1/t) convergence rate for the projected stochastic subgradient method.\nCoRR, 2012.\n\n[LPW92] Lennart Ljung, Georg P\ufb02ug, and Harro Walk. Stochastic Approximation and Opti-\nmization of Random Systems. Birkhauser Verlag, Basel, Switzerland, Switzerland,\n1992.\n\n[Nes83] Yurii E. Nesterov. A method for unconstrained convex minimization problem with the\n\nrate of convergence O(1/k2). Doklady AN SSSR, 269, 1983.\n\n[NN67] Jin-Ichi Nagumo and Atsuhiko Noda. A learning method for system identi\ufb01cation.\n\nIEEE Transactions on Automatic Control, 1967.\n\n[NR18] Gergely Neu and Lorenzo Rosasco. Iterate averaging as regularization for stochastic\n\ngradient descent. CoRR, 2018.\n\n[NSW16] Deanna Needell, Nathan Srebro, and Rachel Ward. Stochastic gradient descent,\nweighted sampling, and the randomized kaczmarz algorithm. Mathematical Pro-\ngramming, 2016.\n\n[NY83] Arkadi S. Nemirovsky and David B. Yudin. Problem Complexity and Method Ef\ufb01ciency\n\nin Optimization. John Wiley, 1983.\n\n11\n\n\f[PJ92] Boris T. Polyak and Anatoli B. Juditsky. Acceleration of stochastic approximation by\n\naveraging. SIAM Journal on Control and Optimization, volume 30, 1992.\n\n[Pro74] John G. Proakis. Channel identi\ufb01cation for high speed digital communications. IEEE\n\nTransactions on Automatic Control, 1974.\n\n[RM51] Herbert Robbins and Sutton Monro. A stochastic approximation method. The Annals\n\nof Mathematical Statistics, vol. 22, 1951.\n\n[RR11] Maxim Raginsky and Alexander Rakhlin. Information-based complexity, feedback and\ndynamics in convex programming. IEEE Transactions on Information Theory, 2011.\nIEEE\n\n[RS90] Sumit Roy and John J. Shynk. Analysis of the momentum lms algorithm.\n\nTransactions on Acoustics, Speech and Signal Processing, 1990.\n\n[RSS12] Alexander Rakhlin, Ohad Shamir, and Karthik Sridharan. Making gradient descent\n\noptimal for strongly convex stochastic optimization. In ICML, 2012.\n\n[Rup88] David Ruppert. Ef\ufb01cient estimations from a slowly convergent robbins-monro process.\n\nTech. Report, ORIE, Cornell University, 1988.\n\n[Sha12] Ohad Shamir. Open problem: Is averaging needed for strongly convex stochastic\n\ngradient descent? In COLT, 2012.\n\n[SSB98] Rajesh Sharma, William A. Sethares, and James A. Bucklew. Analysis of momentum\n\nadaptive \ufb01ltering algorithms. IEEE Transactions on Signal Processing, 1998.\n\n[SZ12] Ohad Shamir and Tong Zhang. Stochastic gradient descent for non-smooth optimization:\n\nConvergence results and optimal averaging schemes. CoRR, abs/1212.1824, 2012.\n\n[VdV00] Aad W Van der Vaart. Asymptotic statistics, volume 3. Cambridge university press,\n\n2000.\n\n[WH60] Bernard Widrow and Marcian E Hoff. Adaptive switching circuits. Defense Technical\n\nInformation Center, 1960.\n\n[WS85] Bernard Widrow and Samuel D. Stearns. Adaptive Signal Processing. Englewood\n\nCliffs, NJ: Prentice-Hall, 1985.\n\n[XLY16] Yi Xu, Qihang Lin, and Tianbao Yang. Accelerate stochastic subgradient method by\n\nleveraging local error bound. CoRR, abs/1607.01027, 2016.\n\n[YYJ18] Tianbao Yang, Yan Yan 0006, Zhuoning Yuan, and Rong Jin. Why does stagewise\ntraining accelerate convergence of testing error over sgd? CoRR, abs/1812.03934,\n2018.\n\n12\n\n\f", "award": [], "sourceid": 8546, "authors": [{"given_name": "Rong", "family_name": "Ge", "institution": "Duke University"}, {"given_name": "Sham", "family_name": "Kakade", "institution": "University of Washington"}, {"given_name": "Rahul", "family_name": "Kidambi", "institution": "Cornell University"}, {"given_name": "Praneeth", "family_name": "Netrapalli", "institution": "Microsoft Research"}]}