{"title": "Stagewise Training Accelerates Convergence of Testing Error Over SGD", "book": "Advances in Neural Information Processing Systems", "page_first": 2608, "page_last": 2618, "abstract": "Stagewise training strategy is widely used for learning neural networks, which runs a stochastic algorithm (e.g., SGD) starting with a relatively large step size (aka learning rate) and geometrically decreasing the step size after a number of iterations. It has been observed that the stagewise SGD has much faster convergence than the vanilla SGD with a polynomially decaying step size in terms of both training error and testing error. {\\it But how to explain this phenomenon has been largely ignored by existing studies.} This paper provides some theoretical evidence for explaining this faster convergence. In particular, we consider a stagewise training strategy for minimizing empirical risk that satisfies the Polyak-\\L ojasiewicz (PL) condition, which has been observed/proved for neural networks and also holds for a broad family of convex functions. For convex loss functions and two classes of ``nice-behaviored\" non-convex objectives that are close to a convex function, we establish faster convergence of stagewise training than the vanilla SGD under the PL condition on both training error and testing error. Experiments on stagewise learning of deep residual networks exhibits that it satisfies one type of non-convexity assumption and therefore can be explained by our theory.", "full_text": "Stagewise Training Accelerates Convergence of\n\nTesting Error Over SGD\n\nZhuoning Yuan\u2020, Yan Yan\u2020, Rong Jin\u2021, Tianbao Yang\u2020\n\n\u2020Department of Computer Science, The University of Iowa, Iowa City, IA 52242, USA\n\n\u2021Machine Intelligence Technology, Alibaba Group, Bellevue, WA 98004, USA\n\n{zhuoning-yuan, yan-yan-2, tianbao-yang}@uiowa.edu, jinrong.jr@alibaba-inc.com\n\nAbstract\n\nStagewise training strategy is widely used for learning neural networks, which runs\na stochastic algorithm (e.g., SGD) starting with a relatively large step size (aka\nlearning rate) and geometrically decreasing the step size after a number of iterations.\nIt has been observed that the stagewise SGD has much faster convergence than\nthe vanilla SGD with a polynomially decaying step size in terms of both training\nerror and testing error. But how to explain this phenomenon has been largely\nignored by existing studies. This paper provides some theoretical evidence for\nexplaining this faster convergence. In particular, we consider a stagewise training\nstrategy for minimizing empirical risk that satis\ufb01es the Polyak-\u0141ojasiewicz (PL)\ncondition, which has been observed/proved for neural networks and also holds for\na broad family of convex functions. For convex loss functions and two classes\nof \u201cnice-behaved\" non-convex objectives that are close to a convex function, we\nestablish faster convergence of stagewise training than the vanilla SGD under the\nPL condition on both training error and testing error. Experiments on stagewise\nlearning of deep neural networks exhibits that it satis\ufb01es one type of non-convexity\nassumption and therefore can be explained by our theory.\n\nIntroduction\n\n1\nIn this paper, we consider learning a predictive model by using a stochastic algorithm to minimize\nthe expected risk via solving the following empirical risk problem:\n\nmin\nw\u2208\u2126\n\nFS (w) :=\n\n1\nn\n\nf (w, zi),\n\n(1)\n\nn(cid:88)\n\ni=1\n\nwhere f (w, z) is a smooth loss function of the model w on the data z, \u2126 is a closed convex set,\nand S = {z1, . . . , zn} denotes a set of n observed data points that are sampled from an underlying\ndistribution Pz with support on Z.\nThere are tremendous studies devoted to solving this empirical risk minimization (ERM) problem\nin machine learning and related \ufb01elds. Among all existing algorithms, stochastic gradient descent\n(SGD) is probably the simplest and attracts most attention, which takes the following update:\n(2)\nwhere it \u2208 {1, . . . , n} is randomly sampled, \u03a0\u2126 is the projection operator, and \u03b7t is the step size\nthat is usually decreasing to 0. Convergence theories have been extensively studied for SGD with a\npolynomially decaying step size (e.g., 1/t, 1/\nt) for an objective that satis\ufb01es various assumptions,\ne.g., convexity [25], non-convexity [12], strong convexity [15], local strong convexity [27], Polyak-\n\u0141ojasiewicz inequality [18], Kurdyka-\u0141ojasiewicz inequality [30], etc. The list of papers about SGD\nis so long that can not be exhausted here.\n\nwt+1 = \u03a0\u2126[wt \u2212 \u03b7t\u2207f (wt, zit)],\n\n\u221a\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f2\u00b5(FS (w) \u2212 min\nw\u2208\u2126\n\nFS (w)) \u2264 (cid:107)\u2207FS (w)(cid:107)2,\n\nThe success of deep learning is mostly driven by stochastic algorithms as simple as SGD running on\nbig data sets [20, 17]. However, an interesting phenomenon that can be observed in practice for deep\nlearning is that no one is actually using the vanilla SGD with a polynomially decaying step size that is\nwell studied in theory for non-convex optimization [18, 12, 9]. Instead, a common trick used to speed\nup the convergence of SGD is by using a stagewise step size strategy, i.e., starting from a relatively\nlarge step size and decreasing it geometrically after a number of iterations [20, 17]. Not only the\nconvergence of training error is accelerated but also is the convergence of testing error. However,\nthere is still a lack of theory for explaining this phenomenon. Although a stagewise step size strategy\nhas been considered in some studies [16, 30, 18, 19, 7], none of them explains the bene\ufb01t of stagewise\ntraining used in practice compared with standard SGD with a decreasing step size, especially on the\nconvergence of testing error for non-convex problems.\nOur Contributions. This paper aims to provide some theoretical evidence to show that an appropriate\nstagewise training algorithm can have faster convergence than SGD with a polynomially deccaying\nstep size under some condition. In particular, we analyze a stagewise training algorithm under the\nPolyak-\u0141ojasiewicz condition [26]:\n\nwhere \u00b5 is a constant. This property has been recently observed/proved for learning deep and shallow\nneural networks [13, 29, 24, 31, 5], and it also holds for a broad family of convex functions [30].\nWe will focus on the scenario that \u00b5 is a small positive value and n is large, which corresponds to\n\u221a\nill-conditioned big data problems and is indeed the case for many problems [13, 5]. We compare\nwith two popular vanilla SGD variants with \u0398(1/t) or \u0398(1/\nt) step size scheme for both the convex\nloss and two classes of non-convex objectives that are close to a convex function. We show that the\nconsidered stagewise training algorithm has a better dependence on \u00b5 than the vanilla SGD with\n\u0398(1/t) step size scheme for both the training error (under the same number of iterations) and the\ntesting error (under the same number of data and a less number of iterations), while keeping the same\n\u221a\ndependence on the number of data for the testing error bound. Additionally, it has faster convergence\nand a smaller testing error bound than the vanilla SGD with \u0398(1/\nt) step size scheme for big data.\nTo be fair for comparison between two algorithms, we adopt a uni\ufb01ed approach that considers\nboth the optimization error and the generalization error, which together with algorithm-independent\noptimal empirical risk constitute the testing error. In addition, we use the same tool for analysis of\nthe generalization error - a key component in the testing error. The techniques for us to prove the\nconvergence of optimization error and testing error are simple and standard. It is of great interest to\nus that simple analysis of the widely used learning strategy can possibly explain its greater success in\npractice than using the standard SGD method with a polynomially decaying step size.\nBesides theoretical contributions, the considered algorithm also has additional features that come with\ntheoretical guarantee for the considered non-convex problems and help improve the generalization\nperformance, including allowing for explicit algorithmic regularization at each stage, using an\naveraged solution for restarting, and returning the last stagewise solution as the \ufb01nal solution. It is\nalso notable that the widely used stagewise SGD is covered by the proposed framework. We refer to\nthe considered algorithm as stagewise regularized training algorithm or START.\nOther closely related works. It is notable that many papers have proposed and analyzed determinis-\ntic/stochastic optimization algorithms under the PL condition, e.g., [18, 23, 28, 2]. This list could be\nlong if we consider its equivalent condition in the convex case. However, none of them exhibits the\nbene\ufb01t of stagewise learning strategy used in practice. One may also notice that linear convergence\nfor the optimization error was proved for a stochastic variance reduction gradient method [28]. Never-\ntheless, its uniform stability bound remains unclear for making a fair comparison with the considered\nalgorithms in this paper, and variance reduction method is not widely used for deep learning.\nWe also notice that some recent studies [21, 32, 5] have used other techniques (e.g., data-dependent\nbound, average stability, point-wise stability) to analyze the generalization error of a stochastic\nalgorithm. Nevertheless, we believe similar techniques can be also used for analyzing stagewise\nlearning algorithm, which is beyond the scope of this paper. The generalization error results in [32, 5]\nunder PL condition are not directly comparable to ours because they have stronger assumptions (e.g.,\nthe global minimizer is unique for deriving uniform stability [5], the condition number is small [32]).\nFinally, it was brought to our attention when a preliminary version of this paper is done that an\nindependent work [11] observes a similar advantage of stagewise SGD over SGD with a polynomially\n\n2\n\n\fdecaying step size lying at the better dependence on the condition number. However, they only\nanalyze the strongly convex quadratic case and the training error of ERM.\n2 Preliminaries and Notations\nLet A denote a randomized algorithm, which returns a randomized solution wS = A(S) based on\nthe given data set S. Denote by EA the expectation over the randomness in the algorithm and by ES\nexpectation over the randomness in the data set. When it is clear from the context, we will omit the\nsubscript S and A in the expectation notations. Let w\u2217\nS \u2208 arg minw\u2208\u2126 FS (w) denote an empirical\nrisk minimizer, and F (w) = Ez[f (w, z)] denote the true risk of w (also called testing error in this\npaper). We use (cid:107) \u00b7 (cid:107) to denote the Euclidean norm, and use [n] = {1, . . . , n}.\nIn order to analyze the testing error convergence of a random solution, we use the following decom-\nposition of testing error.\n\nEA,S [F (wS )] = ES [FS (w\u2217\n\nS )] + ES EA[FS (wS ) \u2212 FS (w\u2217\nS )]\n\n+ EA,S [F (wS ) \u2212 FS (wS)]\n,\n\n(cid:124)\n\n(cid:123)(cid:122)\n\n\u03b5opt\n\n(cid:125)\n\n(cid:124)\n\n(cid:123)(cid:122)\n\n\u03b5gen\n\n(cid:125)\n\nwhere \u03b5opt measures the optimization error, i.e., the difference between empirical risk (or called\ntraining error) of the returned solution wS and the optimal value of the empirical risk, and \u03b5gen\nmeasures the generalization error, i.e., the difference between the true risk of the returned solution\nand the empirical risk of the returned solution. The difference EA,S [F (wS )] \u2212 ES [FS (w\u2217\nS )] is an\nupper bound of the so-called excess risk bound in the literature, which is de\ufb01ned as EA,S [F (wS )]\u2212\nminw\u2208\u2126 F (w). It is notable that the \ufb01rst term ES [FS (w\u2217\nS )] in the above bound is independent\nof the choice of randomized algorithms. Hence, in order to compare the performance of different\nrandomized algorithms, we can focus on analyzing \u03b5opt and \u03b5gen. For analyzing the generalization\nerror, we will leverage the uniform stability tool [4]. A randomized algorithm A is called \u0001-uniformly\nstable if for all data sets S,S(cid:48) \u2208 Z n that differs at most one example the following holds:\n\n\u03b5stab := sup\n\nEA[f (A(S), z) \u2212 f (A(S(cid:48)), z)] \u2264 \u0001.\n\nz\n\nA well-known result is that if A is \u0001-uniformly stable, then its generalization error is bounded by \u0001 [4],\ni.e., if A is \u0001-uniformly stable, we have \u03b5gen \u2264 \u0001. In light of the above discussion, in order to compare\nthe convergence of testing error of different randomized algorithms, it suf\ufb01ces to analyze their\nconvergence in terms of optimization error and their uniform stability. We would like to emphasize\nthat the PL condition is not used in our generalization error analysis by uniform stability, which\nmakes our comparison to the results in [14] fair.\nA function f (w) is L-smooth if it is differentiable and its gradient is L-Lipchitz continuous, i.e.,\n(cid:107)\u2207f (w) \u2212 \u2207f (u)(cid:107) \u2264 L(cid:107)w \u2212 u(cid:107),\u2200w, u \u2208 \u2126. A function f (w) is G-Lipchitz continuous if\n(cid:107)\u2207f (w)(cid:107) \u2264 G,\u2200w \u2208 \u2126. We summarize the used assumptions below with some positive L, \u03c3, G, \u00b5\nand \u00010.\nAssumption 1. Assume that\n\n(i) f (w, z) is L-smooth in terms of w \u2208 \u2126 for every z \u2208 Z.\n(ii) f (w, z) is \ufb01nite-valued and G-Lipchitz continuous in terms of w \u2208 \u2126 for every z \u2208 Z.\n(iii) there exists \u03c3 such that Ei[(cid:107)\u2207f (w, z) \u2212 \u2207FS (w)(cid:107)2] \u2264 \u03c32 for w \u2208 \u2126.\n(iv) FS (w) satis\ufb01es the PL condition for any S of size n, i.e., there exists \u00b5\nS )) \u2264 (cid:107)\u2207FS (w)(cid:107)2,\u2200w \u2208 \u2126.\n\n2\u00b5(FS (w) \u2212 FS (w\u2217\n\n(v) For an initial solution w0 \u2208 \u2126, there exists \u00010 such that FS (w0) \u2212 FS (w\u2217\n\nS ) \u2264 \u00010.\n\nRemark 1: The second assumption is imposed for the analysis of uniform stability of a randomized\nalgorithm. W.o.l.g we assume |f (w, z)| \u2264 1,\u2200w \u2208 \u2126. The third assumption is for the purpose of\nanalyzing optimization error. It is notable that \u03c32 \u2264 4G2. For simplicity, we assume the PL condition\nof FS (w) holds uniformly over S. It is known that the PL condition is much weaker than strong\nconvexity. If FS is strongly convex, \u00b5 corresponds to the strong convexity parameter. In this paper,\nwe are particularly interested in the case when \u00b5 is small, i.e. the condition number L/\u00b5 is large.\nRemark 2: It is worth mentioning that we do not assume the PL condition holds in the whole space\nRd. Hence, our analysis presented below can capture some cases that the PL condition only holds in\na local space \u2126 that contains a global minimum. For example, the recent papers by [10, 29, 1] shows\nthat the global minimum of learning a two-layer and deep overparameterized neural networks resides\nin a ball centered around a random initial solution and the PL condition holds in the ball.\n\n3\n\n\f3 Review: SGD under PL Condition\nIn this section, we review the training error convergence and generalization error of SGD with a\ndecreasing step size for functions satisfying the PL condition in order to derive its testing error bound.\n\u221a\nWe will focus on SGD using the step size \u0398(1/t) and brie\ufb02y mention the results corresponding to\nt) at the end of this section. We would like to emphasize the results presented in this section\n\u0398(1/\nare mainly from existing works [18, 14]. The optimization error and the uniform stability of SGD\nhave been studied in these two papers separately. Since we are not aware of any studies that piece\nthem together, it is of our interest to summarize these results here for comparing with our new results\nestablished later in this paper. Let us \ufb01rst consider the optimization error convergence, which has\nbeen analyzed in [18] and is summarized below.\nTheorem 1. [18] Suppose \u2126 = Rd. Under Assumption 1 (i), (iv) and Ei[(cid:107)\u2207f (w, zi)(cid:107)2] \u2264 G2, by\nsetting \u03b7t = 2t+1\n\n2\u00b5(t+1)2 in the update of SGD (2), we have\nE[FS (wT ) \u2212 FS (w\u2217\nand by setting \u03b7t = \u03b7, we have E[FS (wT ) \u2212 FS (w\u2217\n\nS )] \u2264 LG2\n2T \u00b52 ,\nS )] \u2264 (1 \u2212 2\u03b7\u00b5)T (FS (w0) \u2212 FS (w\u2217\n\n(3)\nS )) + \u03b7LG2\n4\u00b5 .\n\n\u00b52\u0001\n\nLG2 and T = LG2\n\nRemark 3: In order to have an \u0001 optimization error, one can set T = LG2\n2\u00b52\u0001 in the decreasing step\nsize setting. In the constant step size setting, one can set \u03b7 = 2\u00b5\u0001\n4\u00b52\u0001 log(2\u00010/\u0001), where\n\u00010 \u2265 FS (w0) \u2212 FS (w\u2217\nS) is the initial optimization error bound. [18] also mentioned a stagewise\nstep size strategy based on the second result above. By starting with \u03b71 = \u00010\u00b5\nLG2 and running for\nt1 = LG2 log 4\niterations, and restarting the second stage with \u03b72 = \u03b71/2 and t2 = 2t1, then after\n2\u00b52\u00010\nK = log(\u00010/\u0001) stages, we have optimization error less than \u0001, and the total iteration complexity is\nO( LG2 log 4\n). We can see that the analysis of [18] cannot explain why stagewise optimization strategy\nbrings any improvement compared with SGD with a decreasing step size of O(1/t). No matter which\n\u00b52\u0001 ).\nstep size strategy is used among the ones discussed above, the total iteration complexity is O( L\nIt is also interesting to know that the above convergence result does not require the convexity of\nf (w, z). On the other hand, it is unclear how to directly analyze SGD with a polynomially decaying\nstep size for a convex loss to obtain a better convergence rate than (3).\nThe generalization error bound by uniform stability for both convex and non-convex losses have been\nanalyzed in [14]. We just need to plug the step size of SGD in Theorem 1 into their results (Theorem\n3.7 and Theorem 3.8) to prove the uniform stability. For the sake of space, we summarize the uniform\nstability results in Theorem 11 in the supplement. Combining the optimization error and uniform\nstability, we obtain the convergence of testing error of SGD for smooth loss functions under the PL\ncondition. By optimizing the value of T in the bounds, we obtain the following testing error bound\ndependent on n only.\nIf f (\u00b7, z) is convex for any z \u2208 Z, with step size\nTheorem 2. Suppose Assumption 1 holds.\n\u03b7t = 2t+1\n\n4(L+2G2)\u00b5 iterations SGD returns a solution wT satisfying\n\n2\u00b5(t+1)2 and T = nLG2\n\nEA,S [F (wT )] \u2264 ES [FS (w\u2217\n\nS )] +\n\nn\u00b5\nIf f (\u00b7, z) is non-convex for any z \u2208 Z, with step size \u03b7t =\nmax{\n\n} iterations SGD returns a solution wT satisfying\n\n\u221a\n(n\u22121)LG\n\u221a\n8\u00b53/4\n\n(n\u22121)LG\n2\u00b5e \u02c6G\n\n\u221a\n\nn\u00b5\n\n,\n\n2(L + 2G2)\n\n(L + 2G2) log(T + 1)\n\n+\n\n2t+1\n\n2\u00b5(t+1)2 and T =\n\nEA,S [F (wT )] \u2264 ES [FS (w\u2217\n\nS )] + 2 min\n\n(cid:26) \u221a\n(cid:112)(n \u2212 1)\u00b55/4\n\n2L1/2G3/2\n\n,\n\n(cid:112)\n\n\u221a\n\nLe2 \u02c6GG\nn \u2212 1\u00b5\n\n.\n\n(cid:27)\n\n.\n\nRemark 4: If the loss is convex, the excess risk bound is in the order of O( L log(nL/\u00b5)\n) by running\nSGD with T = O(nL/\u00b5) iterations. It notable that an O(1/n) excess risk bound is called the fast\n\u00b5 > e2 \u02c6G (an interesting case 1), the excess\nrate in the literature. If the loss is non-convex and 2G/\nnL/\u00b5) iterations. When \u00b5 is\nrisk bound is in the order of O(\n\n\u221a\n\u221a\nL\u221a\nn\u00b5 ) by running SGD with T = O(\n\n\u221a\n\nn\u00b5\n\n1We can scale up L such that e2 \u02c6G is a small constant, which only scales up the bound by a constant factor.\n\n4\n\n\fAlgorithm 1 START Algorithm: START(FS , w0, \u03b3, K)\n1: Input: w0, \u03b3 and K\n2: for k = 1, . . . , K do\n3:\n4: wk = SGD(F \u03b3\n5: end for\n6: Return: wK\n\n2\u03b3(cid:107)w \u2212 wk\u22121(cid:107)2\n\n(w) = FS (w) + 1\n\n, wk\u22121, \u03b7k, Tk)\n\nLet F \u03b3\n\nwk\u22121\n\nwk\u22121\n\n, w1, \u03b7, T )\n\nAlgorithm 2 SGD(F \u03b3\nw1\n1: for t = 1, . . . , T do\nSample a random data zit \u2208 S\n2:\n3: wt+1 = minw\u2208\u2126 \u2207f (wt, zit)(cid:62)w +\n2\u03b3(cid:107)w \u2212 w1(cid:107)2\n2\u03b7(cid:107)w \u2212 wt(cid:107)2 + 1\n5: Output: (cid:98)wT = O(w1, . . . , wT +1)\n4: end for\n\n1\n\nconvergence rate for FS ((cid:98)wT ) \u2212 FS (w\u2217\n\nvery small, the convergence of testing error is very slow. In addition, the number of iterations is also\nscaled by 1/\u00b5 for achieving a minimal excess risk bound.\n\u221a\nRemark 5: Another possible choice of decreasing step size is O(1/\n\n\u221a\nt), which yields an O(1/\n\nT )\nS ) in the convex case [25] or for (cid:107)\u2207FS(wt)(cid:107)2 in the non-\n\u221a\nconvex case with a randomly sampled t [12]. In the latter case, it also implies a worse convergence\nT )) for the optimization error FS (wt) \u2212 minw FS (w) under the PL condition 2.\n\u221a\nrate of O(1/(\u00b5\nRegarding the uniform stability, the step size of O(1/\nt) will also yield a worse growth rate in terms\n\u221a\nof T [14]. For example, if the loss function is convex, the generalization error by uniform stability\nscales as O(\nn\u00b5), which is worse\n\nthan the above testing error bound (cid:101)O(1/(n\u00b5)) for the big data setting \u00b5 \u2265 \u2126(1/n). Hence, below\n\n\u221a\nT /n) and hence the testing error bound is in the order of O(1/\n\nwe will focus on the comparison with the theoretical results in Theorem 2.\n\nwk\u22121\n\n4 START for a Convex Function\nFirst, let us present the algorithm that we intend to analyze in Algorithm 1. At the k-th stage, a\nregularized funciton F \u03b3\n(w) is constructed that consists of the original objective FS (w) and a\n2\u03b3(cid:107)w \u2212 wk\u22121(cid:107)2. The reference point wk\u22121 is a returned solution from the\nquadratic regularizer 1\nprevious stage, which is also used for an initial solution for the current stage. Adding the strongly\nconvex regularizer at each stage is not essential but could be helpful for reducing the generalization\nerror and is also important for one class of non-convex loss considered in next section. For each\nregularized problem, SGD with a constant step size is employed for a number of iterations with an\nappropriate returned solution. We will reveal the value of step size, the number of iterations and the\nreturned solution for each class of problems separately. Note that the widely used stagewise SGD\nfalls into the framework of START when \u03b3 = \u221e and O = (w1, . . . , wT +1) = wT +1.\nIn this section, we will analyze START algorithm for a convex function under the PL condition. We\nwould like to point out that similar algorithms have been proposed and analyzed in [16, 30] for convex\nproblems. They focus on analyzing the convergence of optimization error for convex problems under\na quadratic growth condition or more general local error bound condition. In the following, we will\nshow that the PL condition implies a quadratic growth condition. Hence, their algorithms can be used\nfor optimizing FS as well enjoying a similar convergence rate in terms of optimization error. However,\nthere is still slight difference between the analyzed algorithm from their considered algorithms. In\n2\u03b3(cid:107)w \u2212 wk\u22121(cid:107)2 is absent in [16], which corresponds to \u03b3 = \u221e\nparticular, the regularization term 1\nin our case. However, adding a small regularization (with not too large \u03b3) can possibly help reduce\nthe generalization error. In addition, their initial step size is scaled by 1/\u00b5. The initial step size of\nour algorithm depends on the quality of initial solution that seems more natural and practical. A\nsimilar regularization at each stage is also used in [30]. But their algorithm will suffer from a large\ngeneralization error, which is due to the key difference between START and their algorithm (ASSG-r).\nIn particular, they use a geometrically decreasing the parameter \u03b3k starting from a relatively large\nvalue in the order of O(1/(\u00b5\u0001)) with a total iteration number T = O(1/(\u00b5\u0001)). According to our\nanalysis of generalization error (see Theorem 4), their algorithm has a generalization error in the\norder of O(T /n) in contrast to log T /n of our algorithm.\nBelow, we summarize the convergence of optimization error in Theorem 3 and generalization error in\nTheorem 4. We need the following lemma for the optimization error analysis.\n\n2Note that here \u00b5 is not required for running the algorithm.\n\n5\n\n\fLemma 1. If FS (w) satis\ufb01es the PL condition, then for any w \u2208 \u2126 we have\n\n(4)\n\n(FS (w) \u2212 FS (w\u2217\n\nS )),\n\nS(cid:107)2 \u2264 1\n2\u00b5\nS is the closest optimal solution to w.\n\n(cid:107)w \u2212 w\u2217\n\nt denote the solution computed during the k-th stage at the t-th iteration.\n\nwhere w\u2217\nRemark 6: The above result does not require the convexity of FS. For a proof, please refer to [3, 18].\nIndeed, this error bound condition instead of the PL condition is enough to derive the results in\nSection 4 and Section 5.\nBelow, we let wk\nTheorem 3. (Optimization Error) Suppose Assumption 1 holds, and f (w, z) is a convex function of\nw. Then by setting \u03b3 \u2265 1.5/\u00b5 and Tk = 9\u03c32\nt+1/Tk ,\nwhere \u0001k = \u00010/2k, \u03b1 \u2264 min(1, 3\u03c32\n\u00010L ), after K = log(\u00010/\u0001) stages with a total iteration complexity of\nO(L/(\u00b5\u0001)) we have E[FS (wK) \u2212 FS (w\u2217\nRemark 7: Compared to the result in Theorem 1, the convergence rate of START is faster by a factor\nof O(1/\u00b5). It is also notable that \u03b3 can be as large as \u221e in the convex case.\nBy showing supz EA[f (wK, z) \u2212 f (w(cid:48)\nK, z)] \u2264 \u0001, we can show the generalization error is bounded\nby \u0001, where wK is learned on a data set S and w(cid:48)\nK is learned a different data set S(cid:48) that only differs\nfrom S at most one example. Our analysis is closely following the route in [14]. The difference is\nthat we have to consider the difference on the reference points wk\u22121 of two copies of our algorithm\non two data sets S,S(cid:48).\nTheorem 4. (Uniform Stability) After K stages, START satis\ufb01es uniform stability with\n\n2\u00b5\u0001k\u03b1 , \u03b7k = \u0001k\u03b1\nS )] \u2264 \u0001.\n\nTk+1) =(cid:80)Tk\n\n3\u03c32 , O(wk\n\n1 , . . . , wk\n\nt=1 wk\n\n(cid:40) 2\u03b3G2(cid:80)K\n2G2(cid:80)K\n\n\u03b5stab \u2264\n\nk=1(1\u2212(\n\n\u03b3\n\n\u03b7k +\u03b3 )Tk )\n\nn\n\nk=1 \u03b7kTk\nn\n\n\u2264 2G2(cid:80)K\n\nk=1 \u03b7kTk\nn\n\nif \u03b3 < \u221e\nelse\n\n.\n\nPut them Together. Finally, we have the following testing error bound of wK returned by START.\nTheorem 5. (Testing Error) After K = log(\u00010/\u0001) stages with a total number of iterations T = 18\u03c32\n\u03b1\u00b5\u0001 .\nThe testing error of wK is bounded by\n\n.\n\nn\u00b5\n\n3G2 log(\u00010/\u0001)\n\nS )] + \u0001 +\n\nEA,S [F (wK)] \u2264 E[FS (w\u2217\nRemark 8: Let \u0001 = 1\nn\u00b5, the excess risk bound becomes O(log(n\u00b5)/(n\u00b5)) and the total iteration\ncomplexity is T = O(nL). This improves the convergence of testing error of SGD stated in\nTheorem 2 for the convex case when \u00b5 (cid:28) 1, which needs T = O(nL/\u00b5) iterations and has a testing\nerror bound of O(L log(nL/\u00b5)/(n\u00b5)).\n5 START for Non-Convex Functions\nNext, we will establish faster convergence of START than SGD for \u201cnice-behaved\" non-convex\nfunctions. In particular, we will consider two classes of non-convex functions that are are close to\na convex function, namely one-point weakly quasi-convex and weakly convex functions. We \ufb01rst\nintroduce the de\ufb01nitions of these functions followed by some discussions.\nDe\ufb01nition 1 (One-point Weakly Quasi-Convex). A non-convex function F is called one-point \u03b8-\nweakly quasi-convex for \u03b8 > 0 if there exists a global minimum w\u2217 such that\n\u2207F (w)(cid:62)(w \u2212 w\u2217) \u2265 \u03b8(F (w) \u2212 F (w\u2217)),\u2200w \u2208 \u2126.\n\n(5)\nDe\ufb01nition 2 (Weakly Convex). A non-convex function F is \u03c1-weakly convex for \u03c1 > 0 if F (w) +\n2(cid:107)w(cid:107)2 is convex.\n\u03c1\nIt is interesting to connect one-point weakly quasi-convexity to one-point strong convexity that has\nbeen considered for non-convex optimization, especially optimizing neural networks [24, 19].\nDe\ufb01nition 3 (One-point Strongly Convex). A non-convex function F is one-point strongly convex\nwith respect to a global minimum w\u2217 if there exists \u00b51 > 0 such that\n\u2207F (w)(cid:62)(w \u2212 w\u2217) \u2265 \u00b51(cid:107)w \u2212 w\u2217(cid:107)2.\n\nThe following lemma shows that one-point strong convexity implies both the PL condition and the\none-point weakly quasi-convexity.\n\n6\n\n\fLemma 2. Suppose F is L-smooth and one-point strongly convex w.r.t w\u2217 with \u00b51 > 0 and\n\u2207F (w\u2217) = 0, then\n\nmin{(cid:107)\u2207F (w)(cid:107)2,\u2207F (w)(cid:62)(w \u2212 w\u2217)} \u2265 2\u00b51\nL\n\n(F (w) \u2212 F (w\u2217))\n\nS. Then by setting \u03b3 \u2265 1.5/(\u03b8\u00b5), \u03b7k = 2\u0001k\u03b8\nS )] \u2264 \u0001. The total iteration complexity is O( 1\n\nFor \u201cnice-behaved\" one-point weakly quasi-convex function FS (w) that satis\ufb01es the PL condition,\nwe are interested in the case that \u03b8 is a constant close to or larger than 1. Note that a convex\nfunction has \u03b8 = 1 and a strongly convex function has \u03b8 > 1. For the case of \u00b5 (cid:28) 1 in the PL\ncondition, this indicates that \u2207F (w)(cid:62)(w\u2212 w\u2217) is larger than (cid:107)\u2207F (w)(cid:107)2, which further implies that\n(cid:107)w\u2212 w\u2217(cid:107) \u2265 (cid:107)\u2207F (w)(cid:107). Intuitively, this inequality (illustrated in Figure 2 in appendix) also connects\nitself to the \ufb02at minimum that has been observed in deep learning experiments [6]. For \u201cnice-behaved\"\nweakly convex function, we are interested in the case that \u03c1 \u2264 \u00b5/4 is close to zero. Weakly convex\nfunctions with a small \u03c1 have been considered in the literature of non-convex optimization [22]. In\nboth cases, we will establish faster convergence of optimization error and testing error of START.\nConvergence of Optimization Error. The approach of optimization error analysis for the considered\nnon-convex functions is similar to that of convex functions. We also \ufb01rst analyze the convergence\nof SGD for each stage and then extend it to K stages for START. Due to limit of space, we only\nsummarize the \ufb01nal convergence results here for the two classes of non-convex functions separately.\nTheorem 6. Suppose FS (w) is one-point \u03b8-weakly quasi-convex w.r.t w\u2217\nS and (4) holds for the\n4\u00b5\u0001k\u03b82 and O(wk\nsame w\u2217\nTk+1) = wk\nwhere \u03c4 \u2208 {1, . . . , Tk} is randomly sampled, after K = log(\u0001k/\u0001) stages we have E[FS (wK) \u2212\nFS (w\u2217\nTheorem 7. Suppose Assumption 1 holds, and FS (w) is \u03c1-weakly convex with \u03c1 \u2264 \u00b5/4. Then by\nsetting \u03b3 = 4/\u00b5 \u2264 1/\u03c1, \u03b7k = \u0001k\u03b1\nt+1/Tk,\nt=1 wk\nS )] \u2264 \u0001. The\nwhere \u03b1 \u2264 min(1, 2\u03c32\ntotal iteration complexity is O( 1\nRemark 9: Several differences are noticeable between the two classes of non-convex functions:\n(i) \u03b3 in the weakly quasi-convex case can be as large as \u221e, in contrast it is required to be smaller\nthan 1/\u03c1 in the weakly convex case; (ii) the returned solution by SGD at the end of each stage is a\nrandomly selected solution in the weakly quasi-convex case and is an averaged solution in the weakly\nconvex case. Finally, we note that the total iteration complexity for both cases is O(1/\u00b5\u0001) under\n\u03b8 \u2248 1 and \u03c1 \u2264 O(\u00b5), which is better than O(1/\u00b52\u0001) of SGD as in Theorem 1.\nGeneralization Error. The analysis of generalization error follows similarly to the non-convex case\nin [14]. Please note that the quasi- and weak-convexity are not directly leveraged in the generalization\nerror analysis. The only thing that matters here is the value of step size. Hence, we present a uni\ufb01ed\nresult for the two cases below.\n\n\u00010L ), and after K = log(\u00010/\u0001) stages we have E[FS (wK) \u2212 FS (w\u2217\n\nTk+1) =(cid:80)Tk\n\n4\u03c32 \u2264 1/L, Tk = 4\u03c32\n\n\u00b5\u0001k\u03b1 and O(wk\n\n3G2 , Tk = 9G2\n\n1 , . . . , wk\n\n1 , . . . , wk\n\n\u03b82\u00b5\u0001 ).\n\n\u03b1\u00b5\u0001 ).\n\n\u03c4\n\nTheorem 8. Let SK\u22121 =(cid:80)K\u22121\n\n\u00b5\u0001 \u00af\u03b1 ) and \u03b7K \u2264 c/(\u00b5TK), where \u00af\u03b1 = \u03b82, c = 1.5/\u03b8 in\nthe one-point weakly quasi-convex case and \u00af\u03b1 = \u03b1, c = 1 in the weakly convex case. Then we have\n\nk=1 Tk = O( 1\n\n\u03b5stab \u2264 SK\u22121\nn\n\n+\n\n1 + \u00b5/(Lc)\n\nn \u2212 1\n\n(2G2c/\u00b5)1/(1+Lc/\u00b5)T\nLc/\u00b5+1\nk\n\nLc/\u00b5\n\nBy putting the optimization error and generalization error together, we have the following testing\nerror bound.\nTheorem 9. Under the same assumptions as in Theorem 6 or 7 and \u00b5 (cid:28) 1. After K = log(\u00010/\u0001)\nstages with a total number of iterations T = O( 1\n\n\u00af\u03b1\u00b5\u0001 ), the testing error bound of wK is\n\nEA,S [F (wK)] \u2264 E[FS (w\u2217\n\nS )] + \u0001 + O(\n\n1\n\n).\n\nn\u00af\u03b1\u00b5\u0001\n\n\u221a\nRemark 10: We are mostly interested in the case when \u03b8 is constant close to or larger than 1. By\n1\u221a\nsetting \u0001 = \u0398(1/\nn \u00af\u03b1\u00b5 ) under the total iteration\n\ncomplexity T = O((cid:112)n/(\u00af\u03b1\u00b5)). This improves the testing error bound of SGD stated in Theorem 2\n\nn\u00af\u03b1\u00b5), we have the excess risk bounded by O(\n\n\u221a\nfor the non-convex case when \u00b5 \u2264 \u00af\u03b1, which needs T = O(\n\u221a\nerror bound of O(1/(\n\nn\u00b5)).\n\nn/\u00b5) iterations and suffers a testing\n\n7\n\n\fFigure 1: From left to right: training, generalization and testing error, and verifying assumptions for\nstagewise learning of ResNets.\nFinally, it is worth noting that our analysis is applicable to an approximate optimal solution w\u2217\nlong as the inequality (4) and (5) hold for that particular w\u2217\nassumptions in numerical experiments.\n\nS as\nS. This fact is helpful for us to verify the\n\n\u221a\n\n6 Numerical Experiments\nWe focus experiments on non-convex deep learning, and include in the supplement some experimental\nresults of START for convex functions that satisfy the PL condition. The numerical experiments\nmainly serve two purposes: (i) verifying that using different algorithmic choices in practice (e.g,\nregularization, averaged solution) is consistent with the provided theory; (ii) verifying the assumptions\nmade for non-convex objectives in our analysis in order to support our theory.\nWe compare stagewise learning with different algorithmic choices against SGD using two polynomi-\nally decaying step sizes (i.e., O(1/t) and O(1/\nt)). For stagewise learning, we consider the widely\nused version that corresponds to START with \u03b3 = \u221e and the returned solution at each stage being\nthe last solution, which is denoted as stagewise SGD (V1). We also implement other two variants of\nSTART that solves a regularized function at each stage (corresponding to \u03b3 < \u221e) and uses the last\nsolution or the averaged solution for the returned solution at each stage. We refer to these variants as\nstagewise SGD (V2) and (V3), respectively.\nWe conduct experiments on two datasets CIFAR-10, -100 using different neural network structures,\nincluding residual nets and convolutional neural nets without skip connection. Two residual nets\nnamely ResNet20 and ResNet56 [17] are used for CIFAR-10 and CIFAR-100. For each network\nstructure, we use two types of activation functions, namely RELU and ELU (\u03b1 = 1) [8]. ELU is\nsmooth that is consistent with our assumption. Although RELU is non-smooth, we would like to show\nthat the provided theory can also explain the good performance of stagewise SGD. For stagewise\nSGD on CIFAR datasets, we use a similar stagewise step size strategy as in [17], i.e., the step size\nis decreased by 10 times at 40k, 60k iterations. For all algorithms, we select the best initial step\nsize from 10\u22123 \u223c 103 and the best regularization parameter 1/\u03b3 of stagewise SGD (V2, V3) from\n0.0001 \u223c 0.1 by cross-validation based on performance on a validation data. Due to limit of space,\nwe only report the results for using ResNet56 and ELU and no weight decay, and results for other\nsettings are included in the supplement.\nThe training error, generalization error and testing error are shown in Figure 1. We can see that SGD\nwith a decreasing step size converges slowly, especially SGD with a step size proportional to 1/t. It is\nbecause that the initial step size of SGD (c/t) is selected as a small value less than 1. We observe that\nusing a large step size cannot lead to convergence. In terms of different algorithmic choices of START,\nwe can see that using an explicit regularization as in V2, V3 can help reduce the generalization error\nthat is consistent with theory, but also slows down the training a little. Using an averaged solution as\nthe returned solution in V3 can further reduce the generalization error but also further slow downs the\ntraining. Overall, stagwise SGD (V2) achieves the best tradeoff in training error convergence and\ngeneralization error, which leads to the best testing error.\nFinally, we verify the assumptions about the non-convexity made in Section 5. To this end, on a\nselected wt we compute the value of \u03b8, i.e., the ratio of \u2207FS (wt)(cid:62)(wt \u2212 w\u2217\nS ) to FS (wt)\u2212 FS (w\u2217\nS )\n\n8\n\n020406080100\u0006\u000341\u0003\u0004907,9\u0004438\u0003\u0001\r\r0.00.20.40.60.81.0%7,\u00043\u00043\u0004\u0003\u001c7747#08\u001f09\n\u0012\u0013\u0003\u001c\u0002&\u0003\u001a\u0002\u001d\u0018#\u000e\r$\u0002\u001b\u0003.\f9$\u0002\u001b\u0003.\f8679\u00059<$9,\u00040\u0005\u000480\u0003$\u0002\u001b\u0003'\u000e$9,\u00040\u0005\u000480\u0003$\u0002\u001b\u0003'\u000f$9,\u00040\u0005\u000480\u0003$\u0002\u001b\u0003'\u0010020406080100\u0006\u000341\u0003\u0004907,9\u0004438\u0003\u0001\r\r0.000.050.100.150.200.25,-897,\u00043\u0003077\u0003\n\u00039089\u0003077#08\u001f09\n\u0012\u0013\u0003\u001c\u0002&\u0003\u001a\u0002\u001d\u0018#\u000e\r$\u0002\u001b\u0003.\f9$\u0002\u001b\u0003.\f8679\u00059<$9,\u00040\u0005\u000480\u0003$\u0002\u001b\u0003'\u000e$9,\u00040\u0005\u000480\u0003$\u0002\u001b\u0003'\u000f$9,\u00040\u0005\u000480\u0003$\u0002\u001b\u0003'\u0010020406080100\u0006\u000341\u0003\u0004907,9\u0004438\u0003\u0001\r\r0.00.20.40.60.81.0%089\u00043\u0004\u0003\u001c7747#08\u001f09\n\u0012\u0013\u0003\u001c\u0002&\u0003\u001a\u0002\u001d\u0018#\u000e\r$\u0002\u001b\u0003.\f9$\u0002\u001b\u0003.\f8679\u00059<$9,\u00040\u0005\u000480\u0003$\u0002\u001b\u0003'\u000e$9,\u00040\u0005\u000480\u0003$\u0002\u001b\u0003'\u000f$9,\u00040\u0005\u000480\u0003$\u0002\u001b\u0003'\u0010050100150200\u0006\u000341\u0003\u0004907,9\u0004438\u0003\u0011\r\r0.00.51.01.52.02.53.03.50.112910.043150.012760.01331\u0304\u03bc=0.00108#08\u001f09\n\u0012\u0013\u0003\u001c\u0002&\u0003\u001a\u0002\u001d\u0018#\u000e\r9\u000409,30\u0004\n2\u00043\n0\u0004\u0004\n;,\u00042:020406080100\u0006\u000341\u0003\u0004907,9\u0004438\u0003\u0001\r\r0.00.20.40.60.81.0%7,\u00043\u00043\u0004\u0003\u001c7747#08\u001f09\n\u0012\u0013\u0003\u001c\u0002&\u0003\u001a\u0002\u001d\u0018#\u000e\r\r$\u0002\u001b\u0003.\f9$\u0002\u001b\u0003.\f8679\u00059<$9,\u00040\u0005\u000480\u0003$\u0002\u001b\u0003'\u000e$9,\u00040\u0005\u000480\u0003$\u0002\u001b\u0003'\u000f$9,\u00040\u0005\u000480\u0003$\u0002\u001b\u0003'\u0010020406080100\u0006\u000341\u0003\u0004907,9\u0004438\u0003\u0001\r\r0.00.10.20.30.40.5,-897,\u00043\u0003077\u0003\n\u00039089\u0003077#08\u001f09\n\u0012\u0013\u0003\u001c\u0002&\u0003\u001a\u0002\u001d\u0018#\u000e\r\r$\u0002\u001b\u0003.\f9$\u0002\u001b\u0003.\f8679\u00059<$9,\u00040\u0005\u000480\u0003$\u0002\u001b\u0003'\u000e$9,\u00040\u0005\u000480\u0003$\u0002\u001b\u0003'\u000f$9,\u00040\u0005\u000480\u0003$\u0002\u001b\u0003'\u0010020406080100\u0006\u000341\u0003\u0004907,9\u0004438\u0003\u0001\r\r0.20.40.60.81.0%089\u00043\u0004\u0003\u001c7747#08\u001f09\n\u0012\u0013\u0003\u001c\u0002&\u0003\u001a\u0002\u001d\u0018#\u000e\r\r$\u0002\u001b\u0003.\f9$\u0002\u001b\u0003.\f8679\u00059<$9,\u00040\u0005\u000480\u0003$\u0002\u001b\u0003'\u000e$9,\u00040\u0005\u000480\u0003$\u0002\u001b\u0003'\u000f$9,\u00040\u0005\u000480\u0003$\u0002\u001b\u0003'\u0010050100150200\u0006\u000341\u0003\u0004907,9\u0004438\u0003\u0011\r\r0.00.51.01.52.02.53.03.50.045350.024510.006560.00710\u0304\u03bc=0.00055#08\u001f09\n\u0012\u0013\u0003\u001c\u0002&\u0003\u001a\u0002\u001d\u0018#\u000e\r\r9\u000409,30\u0004\n2\u00043\n0\u0004\u0004\n;,\u00042:\fS ) to 2(cid:107)wt \u2212 w\u2217\n\nas in (5), and the value of \u00b5, i.e., the ratio of FS (wt) \u2212 FS (w\u2217\nS(cid:107)2 as in (4). For w\u2217\nS,\nwe use the solution found by stagewise SGD (V1) after a large number of iterations (200k), which\ngives a small objective value close to zero. We select 200 points during the process of training by\nstagewise SGD (V1) across all stages, and plot the curves for the values of \u03b8 and \u00b5 averaged over 5\ntrials in the most right panel of Figure 1. We can clearly see that our assumptions about \u00b5 (cid:28) 1 and\none-point weakly quasi-convexity with \u03b8 > 1 are satis\ufb01ed. Hence, the provided theory for stagewise\nlearning is applicable. We also compute the minimum eigen-value of the Hessian on several selected\nsolutions by the Lanczos method to verify the assumption about weak convexity. The Hessian-vector\nproduct is approximated by the \ufb01nite-difference using gradients. The negative value of minimum\neigen-value (i.e., \u03c1) is marked as (cid:5) in the same \ufb01gure of \u03b8, \u00b5. We can see that the assumption about\n\u03c1 \u2264 O(\u00b5) seems not to hold for learning deep neural networks.\n\n7 Conclusion\n\nIn this paper, we have analyzed the convergence of training error and testing error of a stagewise\nregularized training algorithm for solving empirical risk minimization under the Polyak-\u0141ojasiewicz\ncondition. We give the \ufb01rst theory to justify why the widely used stagewise step size scheme gives\nfaster convergence than a polynomially decreasing step size. Our numerical experiments on deep\nlearning verify that one class of non-convexity assumption holds and hence the provided theory of\nfaster convergence applies. In particular, our generalization error bound analysis is based on the\nnice-behaved properties of non-convex functions by using uniform stability, which is designed for\ngeneral non-convex problems. In the future, we consider to improve the generalization error bound\nfor the speci\ufb01c conditions.\n\nReferences\n[1] Z. Allen-Zhu, Y. Li, and Z. Song. A convergence theory for deep learning via over-\n\nparameterization. CoRR, abs/1811.03962, 2018.\n\n[2] R. Bassily, M. Belkin, and S. Ma. On exponential convergence of sgd in non-convex over-\n\nparametrized learning. CoRR, abs/1811.02564, 2018.\n\n[3] J. Bolte, T. P. Nguyen, J. Peypouquet, and B. Suter. From error bounds to the complexity of\n\n\ufb01rst-order descent methods for convex functions. CoRR, abs/1510.08234, 2015.\n\n[4] O. Bousquet and A. Elisseeff. Stability and generalization. J. Mach. Learn. Res., 2:499\u2013526,\n\nMar. 2002.\n\n[5] Z. Charles and D. Papailiopoulos. Stability and generalization of learning algorithms that\nconverge to global optima. In Proceedings of the 35th International Conference on Machine\nLearning (ICML), pages 745\u2013754, 2018.\n\n[6] P. Chaudhari, A. Choromanska, S. Soatto, Y. LeCun, C. Baldassi, C. Borgs, J. T. Chayes,\nL. Sagun, and R. Zecchina. Entropy-sgd: Biasing gradient descent into wide valleys. CoRR,\nabs/1611.01838, 2016.\n\n[7] Z. Chen, Z. Yuan, J. Yi, B. Zhou, E. Chen, and T. Yang. Universal stagewise learning for\nnon-convex problems with convergence on averaged solutions. In International Conference on\nLearning Representations, 2019.\n\n[8] D. Clevert, T. Unterthiner, and S. Hochreiter. Fast and accurate deep network learning by\n\nexponential linear units (elus). CoRR, abs/1511.07289, 2015.\n\n[9] D. Davis and D. Drusvyatskiy. Stochastic subgradient method converges at the rate o(k\u22121/4)\n\non weakly convex functions. CoRR, /abs/1802.02988, 2018.\n\n[10] S. S. Du, X. Zhai, B. Poczos, and A. Singh. Gradient descent provably optimizes over-\nparameterized neural networks. In International Conference on Learning Representations,\n2019.\n\n9\n\n\f[11] R. Ge, S. M. Kakade, R. Kidambi, and P. Netrapalli. Rethinking learning rate schedules for\nstochastic optimization. In Submitted to International Conference on Learning Representations,\n2019. under review.\n\n[12] S. Ghadimi and G. Lan. Stochastic \ufb01rst- and zeroth-order methods for nonconvex stochastic\n\nprogramming. SIAM Journal on Optimization, 23(4):2341\u20132368, 2013.\n\n[13] M. Hardt and T. Ma. Identity matters in deep learning. CoRR, abs/1611.04231, 2016.\n\n[14] M. Hardt, B. Recht, and Y. Singer. Train faster, generalize better: Stability of stochastic gradient\ndescent. In Proceedings of the 33nd International Conference on Machine Learning (ICML),\npages 1225\u20131234, 2016.\n\n[15] E. Hazan, A. Agarwal, and S. Kale. Logarithmic regret algorithms for online convex optimiza-\n\ntion. Machine Learning, 69(2-3):169\u2013192, 2007.\n\n[16] E. Hazan and S. Kale. Beyond the regret minimization barrier: an optimal algorithm for\nstochastic strongly-convex optimization. In Proceedings of the 24th Annual Conference on\nLearning Theory (COLT), pages 421\u2013436, 2011.\n\n[17] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR,\n\npages 770\u2013778. IEEE Computer Society, 2016.\n\n[18] H. Karimi, J. Nutini, and M. W. Schmidt. Linear convergence of gradient and proximal-\ngradient methods under the polyak-\u0142ojasiewicz condition. In Machine Learning and Knowledge\nDiscovery in Databases - European Conference (ECML-PKDD), pages 795\u2013811, 2016.\n\n[19] B. Kleinberg, Y. Li, and Y. Yuan. An alternative view: When does SGD escape local minima?\nIn Proceedings of the 35th International Conference on Machine Learning, pages 2698\u20132707,\n2018.\n\n[20] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classi\ufb01cation with deep convolutional\nneural networks. In Advances in Neural Information Processing Systems (NIPS), pages 1106\u2013\n1114, 2012.\n\n[21] I. Kuzborskij and C. H. Lampert. Data-dependent stability of stochastic gradient descent. In\nProceedings of the 35nd International Conference on Machine Learning (ICML), volume 80 of\nJMLR Workshop and Conference Proceedings, pages 2820\u20132829. JMLR.org, 2018.\n\n[22] G. Lan and Y. Yang. Accelerated stochastic algorithms for nonconvex \ufb01nite-sum and multi-block\n\noptimization. CoRR, abs/1805.05411, 2018.\n\n[23] L. Lei, C. Ju, J. Chen, and M. I. Jordan. Non-convex \ufb01nite-sum optimization via SCSG methods.\n\nIn Advances in Neural Information Processing Systems 30 (NIPS), pages 2345\u20132355, 2017.\n\n[24] Y. Li and Y. Yuan. Convergence analysis of two-layer neural networks with relu activation. In\n\nAdvances in Neural Information Processing Systems 30 (NIPS, pages 597\u2013607, 2017.\n\n[25] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach\n\nto stochastic programming. SIAM Journal on Optimization, 19:1574\u20131609, 2009.\n\n[26] B. T. Polyak. Gradient methods for minimizing functionals. Zh. Vychisl. Mat. Mat. Fiz.,\n\n3:4:864?878, 1963.\n\n[27] C. Qu, H. Xu, and C. Ong. Fast rate analysis of some stochastic optimization algorithms. In M. F.\nBalcan and K. Q. Weinberger, editors, Proceedings of The 33rd International Conference on\nMachine Learning, volume 48 of Proceedings of Machine Learning Research, pages 662\u2013670,\nNew York, New York, USA, 20\u201322 Jun 2016. PMLR.\n\n[28] S. J. Reddi, A. Hefny, S. Sra, B. Poczos, and A. Smola. Stochastic variance reduction for\nnonconvex optimization. In Proceedings of The 33rd International Conference on Machine\nLearning (ICML), volume 48, pages 314\u2013323, 2016.\n\n[29] B. Xie, Y. Liang, and L. Song. Diversity leads to generalization in neural networks. CoRR,\n\nabs/1611.03131, 2016.\n\n10\n\n\f[30] Y. Xu, Q. Lin, and T. Yang. Stochastic convex optimization: Faster local growth implies faster\nglobal convergence. In Proceedings of the 34th International Conference on Machine Learning\n(ICML), pages 3821 \u2013 3830, 2017.\n\n[31] Y. Zhou and Y. Liang. Characterization of gradient dominance and regularity conditions for\n\nneural networks. CoRR, abs/1710.06910, 2017.\n\n[32] Y. Zhou, Y. Liang, and H. Zhang. Generalization error bounds with probabilistic guarantee for\n\nSGD in nonconvex optimization. CoRR, abs/1802.06903, 2018.\n\n11\n\n\f", "award": [], "sourceid": 1494, "authors": [{"given_name": "Zhuoning", "family_name": "Yuan", "institution": "University of Iowa"}, {"given_name": "Yan", "family_name": "Yan", "institution": "the University of Iowa"}, {"given_name": "Rong", "family_name": "Jin", "institution": "Alibaba"}, {"given_name": "Tianbao", "family_name": "Yang", "institution": "The University of Iowa"}]}