{"title": "Adaptive SVRG Methods under Error Bound Conditions with Unknown Growth Parameter", "book": "Advances in Neural Information Processing Systems", "page_first": 3277, "page_last": 3287, "abstract": "Error bound, an inherent property of an optimization problem, has recently revived in the development of algorithms with improved global convergence without strong convexity. The most studied error bound is the quadratic error bound, which generalizes strong convexity and is satisfied by a large family of machine learning problems. Quadratic error bound have been leveraged to achieve linear convergence in many first-order methods including the stochastic variance reduced gradient (SVRG) method, which is one of the most important stochastic optimization methods in machine learning. However, the studies along this direction face the critical issue that the algorithms must depend on an unknown growth parameter (a generalization of strong convexity modulus) in the error bound. This parameter is difficult to estimate exactly and the algorithms choosing this parameter heuristically do not have theoretical convergence guarantee. To address this issue, we propose novel SVRG methods that automatically search for this unknown parameter on the fly of optimization while still obtain almost the same convergence rate as when this parameter is known. We also analyze the convergence property of SVRG methods under H\\\"{o}lderian error bound, which generalizes the quadratic error bound.", "full_text": "Adaptive SVRG Methods under Error Bound\nConditions with Unknown Growth Parameter\n\nYi Xu\u2020, Qihang Lin\u2021, Tianbao Yang\u2020\n\n\u2020Department of Computer Science, The University of Iowa, Iowa City, IA 52242, USA\n\u2021Department of Management Sciences, The University of Iowa, Iowa City, IA 52242, USA\n\n{yi-xu, qihang-lin, tianbao-yang}@uiowa.edu\n\nAbstract\n\nError bound, an inherent property of an optimization problem, has recently revived\nin the development of algorithms with improved global convergence without strong\nconvexity. The most studied error bound is the quadratic error bound, which\ngeneralizes strong convexity and is satis\ufb01ed by a large family of machine learning\nproblems. Quadratic error bound have been leveraged to achieve linear convergence\nin many \ufb01rst-order methods including the stochastic variance reduced gradient\n(SVRG) method, which is one of the most important stochastic optimization\nmethods in machine learning. However, the studies along this direction face the\ncritical issue that the algorithms must depend on an unknown growth parameter (a\ngeneralization of strong convexity modulus) in the error bound. This parameter is\ndif\ufb01cult to estimate exactly and the algorithms choosing this parameter heuristically\ndo not have theoretical convergence guarantee. To address this issue, we propose\nnovel SVRG methods that automatically search for this unknown parameter on the\n\ufb02y of optimization while still obtain almost the same convergence rate as when this\nparameter is known. We also analyze the convergence property of SVRG methods\nunder H\u00f6lderian error bound, which generalizes the quadratic error bound.\n\n1\n\nIntroduction\n\nFinite-sum optimization problems have broad applications in machine learning, including regres-\nsion by minimizing the (regularized) empirical square losses and classi\ufb01cation by minimizing the\n(regularized) empirical logistic losses. In this paper, we consider the following \ufb01nite-sum problem:\n\nfi(x) + \u03a8(x),\n\n(1)\n\nn(cid:88)\n\ni=1\n\nmin\nx\u2208\u2126\n\nF (x) (cid:44) 1\nn\n\nwhere fi(x) is a continuously differential convex function whose gradient is Lipschitz continuous\nand \u03a8(x) is a proper, lower-semicontinuous convex function [24]. Traditional proximal gradient (PG)\nmethods or accelerated proximal gradient (APG) methods for solving (1) become prohibited when\nthe number of components n is very large, which has spurred many studies on developing stochastic\noptimization algorithms with fast convergence [4, 8, 25, 1].\nAn important milestone among several others is the stochastic variance reduced gradient (SVRG)\nmethod [8] and its proximal variant [26]. Under the strong convexity of the objective function F (x),\nlinear convergence of SVRG and its proximal variant has been established. Many variations of SVRG\nhave also been proposed [2, 1]. However, the key assumption of strong convexity limits the power of\nSVRG for many interesting problems in machine learning without strong convexity. For example, in\nregression with high-dimensional data one is usually interested in solving the least-squares regression\nwith an (cid:96)1 norm regularization or constraint (known as the LASSO-type problem). A common\npractice for solving non-strongly convex \ufb01nite-sum problems by SVRG is to add a small strongly\nconvex regularizer (e.g., \u03bb\n2) [26]. Recently, a variant of SVRG (named SVRG++ [2]) was\n\n2(cid:107)x(cid:107)2\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fdesigned that can cope with non-strongly convex problems without adding the strongly convex\nterm. However, these approaches only have sublinear convergence (e.g., requiring a O(1/\u0001) iteration\ncomplexity to achieve an \u0001-optimal solution).\nPromisingly, recent studies on optimization showed that leveraging the quadratic error bound (QEB)\ncondition can open a new door to the linear convergence without strong convexity [9, 20, 6, 30, 5, 3].\nThe problem (1) obeys the QEB condition if the following holds:\n\n(cid:107)x \u2212 x\u2217(cid:107)2 \u2264 c(F (x) \u2212 F (x\u2217))1/2,\u2200x \u2208 \u2126,\n\n(2)\n\nwhere x\u2217 denotes the closest optimal solution to x and \u2126 is usually a compact set. Indeed, the\naforementioned LASSO-type problems satisfy the QEB condition. It is worth mentioning that the\nabove condition (or similar conditions) has been explored extensively and has different names in\nthe literature, e.g., the second-order growth condition, the weak strong convexity [20], essential\nstrong convexity [13], restricted strong convexity [31], optimal strong convexity [13], semi-strong\nconvexity [6]. Interestingly, [6, 9] have showed that SVRG can enjoy a linear convergence under the\nQEB condition. However, the issue is that SVRG requires to know the parameter c (analogous to the\nstrong convexity parameter) in the QEB for setting the number of iterations of inner loops, which is\nusually unknown and dif\ufb01cult to estimate. A naive trick for setting the number of iterations of inner\nloops to a certain multiplicative factor (e.g., 2) of the number of components n is usually sub-optimal\nand worrisome because it may not be large enough for bad conditioned problems or it could be too\nlarge for good conditioned problems. In the former case, the algorithm may not converge as the\ntheory indicates and in the latter case, too many iterations may be wasted for inner loops.\nTo address this issue, we develop a new variant of SVRG that embeds an ef\ufb01cient automatic search\nstep for c into the optimization. The challenge for developing such an adaptive variant of SVRG is\nthat one needs to develop an appropriate machinery to check whether the current value of c is large\nenough. One might be reminded of some restarting procedure for searching the unknown strong\nconvexity parameter in APG methods [21, 11]. However, there are several differences that make\nthe development of such a search scheme much more daunting for SVRG than for APG. The \ufb01rst\ndifference is that, although SVRG has a lower per-iteration cost than APG, it also makes smaller\nprogress towards the optimality after each iteration, which provides less information on the correctness\nof the current c. The second difference lies at that the SVRG is inherently stochastic, making the\nanalysis for bounding the number of search steps much more dif\ufb01cult. To tackle this challenge, we\npropose to perform the proximal gradient updates occasionally at the reference points in SVRG where\nthe full gradient is naturally computed. The normal of the proximal gradient provides a probabilistic\n\u201ccerti\ufb01cate\" for checking whether the value of c is large enough. We then provide a novel analysis to\nbound the expected number of search steps with a consideration that the probabilistic \u201ccerti\ufb01cate\"\nmight fail with some probability. The \ufb01nal result shows that the new variant of SVRG enjoys a linear\nconvergence under the QEB condition with unknown c and the corresponding complexity is only\nworse by a logarithmic factor than that in the setting where the parameter c is assumed to be known.\nBesides the QEB condition, we also consider more general error bound conditions (aka the H\u00f6lderian\nerror bound (HEB) conditions [3]) whose de\ufb01nition is given below, and develop adaptive variants of\nSVRG under the HEB condition with \u03b8 \u2208 (0, 1/2) to enjoy intermediate faster convergence rates\nthan SVRG under only the smoothness assumption (e.g, SVRG++ [2]). It turns out that the adaptive\nvariants of SVRG under HEB with \u03b8 < 1/2 are simpler than that under the QEB.\nDe\ufb01nition 1 (H\u00f6lderian error bound (HEB)). Problem (1) is said to satisfy a H\u00f6lderian error bound\ncondition on a compact set \u2126 if there exist \u03b8 \u2208 (0, 1/2] and c > 0 such that for any x \u2208 \u2126\n\n(cid:107)x \u2212 x\u2217(cid:107)2 \u2264 c(F (x) \u2212 F\u2217)\u03b8,\n\n(3)\n\nwhere x\u2217 denotes the closest optimal solution to x.\n\nIt is notable that the above inequality can always hold for \u03b8 = 0 on a compact set \u2126. Therefore the\ndiscussion in the paper regarding the HEB condition also applies to the case \u03b8 = 0. In addition, if\na HEB condition with \u03b8 \u2208 (1/2, 1] holds, we can always reduce it to the QEB condition provided\nthat F (x) \u2212 F\u2217 is bounded over \u2126. However, we are not aware of any interesting examples of (1) for\nsuch cases. We defer several examples satisfying the HEB conditions with explicit \u03b8 \u2208 (0, 1/2] in\nmachine learning to Section 5. We refer the reader to [29, 28, 27, 14] for more examples.\n\n2\n\n\f2 Related work\n\nThe use of error bound conditions in optimization for deriving fast convergence dates back to [15, 16,\n17], where the (local) error bound condition bounds the distance of a point in the local neighborhood\nof the optimal solution to the optimal set by a multiple of the norm of the proximal gradient at the\npoint. Based on their local error bound condition, they have derived local linear convergence for\ndescent methods (e.g., proximal gradient methods). Several recent works have established the same\nlocal error bound conditions for several interesting problems in machine learning [7, 32, 33].\nH\u00f6lderian error bound (HEB) conditions have been studied extensively in variational analysis [10]\nand recently revived in optimization for developing fast convergence of optimization algorithms.\nMany studies have leveraged the QEB condition in place of strong convexity assumption to develop\nfast convergence (e.g., linear convergence) of many optimization algorithms (e.g., the gradient\nmethod [3], the proximal gradient method [5], the accelerated gradient method [20], coordinate\ndescent methods [30], randomized coordinate descent methods [9, 18], subgradient methods [29, 27],\nprimal-dual style of methods [28], and etc.). This work is closely related to several recent studies that\nhave shown that SVRG methods can also enjoy linear convergence for \ufb01nite-sum (composite) smooth\noptimization problems under the QEB condition [6, 9, 12]. However, these approach all require\nknowing the growth parameter in the QEB condition, which is unknown in many practical problems.\nIt is worth mentioning that several recent studies have also noticed the similar issue in SVRG-type\nof methods that the strong convexity constant is unknown and suggested some practical heuristics\nfor either stopping the inner iterations early or restarting the algorithm [2, 22, 19]. Nonetheless, no\ntheoretical convergence guarantee is provided for the suggested heuristics.\nOur work is also related to studies that focus on searching for unknown strong convexity parameter\nin accelerated proximal gradient (APG) methods [21, 11] but with striking differences as mentioned\nbefore. Recently, Liu & Yang [14] considered the HEB for composite smooth optimization problems\nand developed an adaptive restarting accelerated gradient method without knowing the c constant in\nthe HEB. As we argued before, their analysis can not be trivially extended to SVRG.\n\n3 SVRG under the HEB condition in the oracle setting\n\n(cid:80)n\n\nn\n\n(cid:80)n\n\nn\n\n2 (cid:107)x \u2212 y(cid:107)2\n(cid:80)n\n\nn\n\ni=1 Li. For simplicity, we can take Lf = 1\n\nIn this section, we will present SVRG methods under the HEB condition in the oracle setting\nassuming that the c parameter is given. We \ufb01rst give some notations. Denote by Li the smoothness\nconstant of fi, i.e., for all x, y \u2208 \u2126 fi(x) \u2212 fi(y) \u2264 (cid:104)\u2207fi(y), x \u2212 y(cid:105) + Li\n2. It implies\nthat f (x) (cid:44) 1\ni=1 fi(x) is also continuously differential convex function whose gradient is Lf -\nLipschitz continuous, where Lf \u2264 1\ni=1 Li. In the\nsequel, we let L (cid:44) maxi Li and assume that it is given or can be estimated for the problem. Denote\nby \u2126\u2217 the optimal set of the problem (1), and by F\u2217 = minx\u2208\u2126 F (x). The detailed steps of SVRG\nunder the HEB condition are presented in Algorithm 1. The formal guarantee of SVRGHEB is given\nin the following theorem.\nTheorem 2. Suppose problem (1) satis\ufb01es the HEB condition with \u03b8 \u2208 (0, 1/2] and F (x0)\u2212F\u2217 \u2264 \u00010,\nwhere x0 is an initial solution. Let \u03b7 = 1/(36L), and T1 \u2265 81Lc2 (1/\u00010)1\u22122\u03b8. Algorithm 1 ensures\n(4)\nIn particular, by running Algorithm 1 with R = (cid:100)log2\n\u0001 (cid:101), we have E[F (\u00afx(R)) \u2212 F\u2217] \u2264 \u0001, and\nthe computational complexity for achieving an \u0001-optimal solution in expectation is O(n log(\u00010/\u0001) +\nLc2 max{ 1\nRemark: We make several remarks about the Algorithm 1 and the results in Theorem 2. First, the\nconstant factors in \u03b7 and T1 should not be treated literally, because we have made no effort to optimize\nthem. Second, when \u03b8 = 1/2 (i.e, the QEB condition holds), the Algorithm 1 reduces to the standard\nSVRG method under strong convexity, and the iteration complexity becomes O((n + Lc2) log(\u00010/\u0001)),\nwhich is the same as that of the standard SVRG with Lc2 mimicking the condition number of the\nproblem. Third, when \u03b8 = 0 (i.e., with only the smoothness assumption), the Algorithm 1 reduces\nto SVRG++ [2] with one difference, where in SVRGHEB the initial point and the reference point\nfor each outer loop are the same but are different in SVRG++, and the iteration complexity of\nSVRGHEB becomes O(n log(\u00010/\u0001) + Lc2\n\u0001 ) that is similar to that of SVRG++. Fourth, for intermediate\n\nE[F (\u00afx(R)) \u2212 F\u2217] \u2264 (1/2)R \u00010.\n\n\u00011\u22122\u03b8 , log(\u00010/\u0001)}).\n\n\u00010\n\n3\n\n\f0 = \u00afx(r\u22121)\n\n\u00afgr = \u2207f (\u00afx(r\u22121)), x(r)\nfor t = 1, 2, . . . , Tr do\n\nAlgorithm 1 SVRG method under HEB (SVRGHEB(x0, T1, R, \u03b8))\n1: Input: x0 \u2208 \u2126, the number of inner initial iterations T1, and the number of outer loops R.\n2: \u00afx(0) = x0\n3: for r = 1, 2, . . . , R do\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12: end for\n13: Output: \u00afx(R)\n\nChoose it \u2208 {1, . . . , n} uniformly at random.\nt = \u2207fit(x(r)\ng(r)\nt = arg minx\u2208\u2126(cid:104)g(r)\n(cid:80)Tr\nx(r)\nend for\n\u00afx(r) = 1\nTr\nTr+1 = 21\u22122\u03b8Tr\n\nt\u22121) \u2212 \u2207fit(\u00afx(r\u22121)) + \u00afgr.\nt\u22121(cid:105) + 1\n\n2\u03b7(cid:107)x \u2212 x(r)\n\n, x \u2212 x(r)\n\n2 + \u03a8(x).\n\nt=1 x(r)\n\nt\u22121(cid:107)2\n\nt\n\nt\n\n1 = 81Lc2\n\n0 (1/\u00010)1\u22122\u03b8\n\nAlgorithm 2 SVRG method under HEB with Restarting: SVRGHEB-RS\n1: Input: x(0) \u2208 \u2126, a small value c0 > 0, and \u03b8 \u2208 (0, 1/2).\n2: Initialization: T (1)\n3: for s = 1, 2, . . . , S do\n4:\n5:\n6: end for\n\u03b8 \u2208 (0, 1/2) we can obtain faster convergence than SVRG++. Lastly, note that the number of\niterations for each outer loop depends on the c parameter in the HEB condition. The proof the\nTheorem 2 is simply built on previous analysis of SVRG and is deferred to the supplement.\n\nx(s)=SVRGHEB (x(s\u22121), T (s)\nT (s+1)\n1\n\n= 21\u22122\u03b8T (s)\n\n1 , R, \u03b8)\n\n1\n\n4 Adaptive SVRG under the HEB condition in the dark setting\n\nIn this section, we will present adaptive variants of SVRGHEB that can be run in the dark setting, i.e,\nwithout assuming c is known. We \ufb01rst present the variant for \u03b8 < 1/2, which is simple and can help\nus understand the dif\ufb01culty for \u03b8 = 1/2.\n4.1 Adaptive SVRG for \u03b8 \u2208 (0, 1/2)\nAn issue of SVRGHEB is that when c is unknown the initial number of iterations T1 in Algorithm 1\nis dif\ufb01cult to estimate . A small value of T1 may not guarantee SVRGHEB converges as Theorem 2\nindicates. To address this issue, we can use the restarting trick, i.e, restarting SVRGHEB with an\nincreasing sequences of T1. The steps are shown in Algorithm 2. We can start with a small value of\nc0, which is not necessarily larger than c. If c0 is larger than c, the \ufb01rst call of SVRGHEB will yield\nan \u0001-optimal solution as Theorem 2 indicates. Below, we assume that c0 \u2264 c.\nTheorem 3. Suppose problem (1) satis\ufb01es the HEB with \u03b8 \u2208 (0, 1/2) and F (x0) \u2212 F\u2217 \u2264 \u00010, where\n0 (1/\u00010)1\u22122\u03b8. Then\nx0 is an initial solution. Let c0 \u2264 c, \u0001 \u2264 \u00010\n+ 1 calls of SVRGHEB in Algorithm 2, we \ufb01nd\nwith at most a total number of S =\n2\u2212\u03b8 log2\na solution x(S) such that E[F (x(S)) \u2212 F\u2217] \u2264 \u0001. The computaional complexity of SVRGHEB-RS for\nobtaining such an \u0001-optimal solution is O\n\n(cid:16) c\n(cid:17)(cid:109)\n2 , R = (cid:100)log2\n(cid:16)\n\n\u0001 (cid:101) and T (1)\n\nn log(\u00010/\u0001) log(c/c0) + Lc2\n\u00011\u22122\u03b8\n\n1 = 81Lc2\n\n(cid:108) 1\n\n(cid:17)\n\nc0\n\n\u00010\n\n.\n\n1\n\nRemark: The proof is in the supplement. We can see that Algorithm 2 cannot be applied to \u03b8 = 1/2,\nwhich gives a constant sequence of T (s)\nand therefore cannot provide any convergence guarantee\nfor a small value of c0 < c. We have to develop a different variant for tackling \u03b8 = 1/2. A minor\npoint of worth mentioning is that if necessary we can stop Algorithm 2 appropriately by performing\na proximal gradient update at x(s) (whose full gradient will be computed for the next stage) and\nchecking if the proximal gradient\u2019s Euclidean norm square is less than a prede\ufb01ned level (c.f. (7)).\n\n1\n\n4\n\n\f\u221a\n\ncs+1 =\n\n2 + \u03a8(x)\n\n2 + \u03a8(x), s = 0\n\nend if\ns = s + 1\n\n2 (cid:107)x \u2212 \u02dcx0(cid:107)2\n\n2 (cid:107)x \u2212 \u02dcx(s+1)(cid:107)2\n\n2 > \u0001 do\nSet Rs and Ts = (cid:100)81Lc2\n\u02dcx(s+1)=SVRGHEB(\u00afx(s), Ts, Rs, 0.5)\n\u00afx(s+1) = arg minx\u2208\u2126(cid:104)\u2207f (\u02dcx(s+1)), x \u2212 \u02dcx(s+1)(cid:105) + L\ncs+1 = cs\nif (cid:107)\u00afx(s+1) \u2212 \u02dcx(s+1)(cid:107)2 \u2265 \u03d1(cid:107)\u00afx(s) \u2212 \u02dcx(s)(cid:107)2 then\n2cs, \u00afx(s+1) = \u00afx(s), \u02dcx(s+1) = \u02dcx(s)\n\nAlgorithm 3 SVRG method under QEB with Restarting and Search: SVRGQEB-RS\n1: Input: \u02dcx(0) \u2208 \u2126, an initial value c0 > 0, \u0001 > 0, \u03c1 = 1/ log(1/\u0001) and \u03d1 \u2208 (0, 1).\n2: \u00afx(0) = arg minx\u2208\u2126(cid:104)\u2207f (\u02dcx0), x \u2212 \u02dcx0(cid:105) + L\n3: while (cid:107)\u00afx(s) \u2212 \u02dcx(s)(cid:107)2\ns(cid:101) as in Lemma 2\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12: end while\n13: Output: \u00afx(s)\n4.2 Adaptive SVRG for \u03b8 = 1/2\nIn light of the value of T1 in Theorem 2 for \u03b8 = 1/2, i.e., T1 = (cid:100)81Lc2(cid:101), one might consider to start\nwith a small value for c and then increase its value by a constant factor at certain points in order to\nincrease the value of T1. But the challenge is to decide when we should increase the value of c. If\none follows a similar procedure as in Algorithm 2, we may end up with a worse iteration complexity.\nTo tackle this challenge, we need to develop an appropriate machinery to check whether the value of\nc is already large enough for SVRG to decrease the objective value. However, we cannot afford the\ncost for computing the objective value due to large n. To this end, we develop a \u201ccerti\ufb01cate\u201d that can\nbe easily veri\ufb01ed and can act as signal for a suf\ufb01cient decrease in the objective value. The developed\ncerti\ufb01cate is motivated by a property of proximal gradient update under the QEB as shown in (5).\nLemma 1. Let \u00afx = arg minx\u2208\u2126(cid:104)\u2207f (\u02dcx), x\u2212 \u02dcx(cid:105)+ L\nof the problem (1), we have\n\n2 +\u03a8(x). Then under the QEB condition\n\n2 (cid:107)x\u2212 \u02dcx(cid:107)2\n\nF (\u00afx) \u2212 F\u2217 \u2264 (L + Lf )2c2(cid:107)\u00afx \u2212 \u02dcx(cid:107)2\n2.\n\n(5)\n\nThe above lemma indicates that we can perform a proximal gradient update at a point \u02dcx and use\n(cid:107)\u00afx \u2212 \u02dcx(cid:107)2 as a gauge for monitoring the decrease in the objective value. However, the proximal\ngradient update is too expensive to compute due to the computation of full gradient \u2207f (\u02dcx). Luckily,\nSVRG allows to compute the full gradient at a small number of reference points. We propose to\nleverage these full gradients to conduct the proximal gradient updates and develop the certi\ufb01cate for\nsearching the value of c. The detailed steps of the proposed algorithm are presented in Algorithm 3\nto which we refer as SVRGQEB-RS. Similar to SVRGHEB-RS, SVRGQEB-RS also calls SVRGHEB for\nmultiple stages. We conduct the proximal gradient update at the returned solution of each SVRGHEB,\nwhich also serves as the initial solution and the initial reference point for the next stage of SVRGHEB\nwhen our check in Step 7 fails. At each stage, at most Rs + 1 full gradients are computed, where\nRs is a logarithmic number as revealed later. Step 7 - Step 11 in Algorithm 3 are considered as our\nsearch step for searching the value of c. We will show that, if cs is larger than c, the condition in Step\n7 is true with small probability. This can be seen from the following lemma.\nLemma 2. Suppose problem (1) satis\ufb01es the QEB condition. Let G0 \u2286 G1 . . . \u2286 Gs . . . be a \ufb01ltration\nwith the sigma algebra Gs generated by all random events before line 4 of stage s of Algorithm 3.\n. Then for any \u03d1 \u2208 (0, 1), we have\nLet \u03b7 = 1\n\n(cid:108)\n(cid:16)(cid:107)\u00afx(s+1) \u2212 \u02dcx(s+1)(cid:107)2 \u2265 \u03d1(cid:107)\u00afx(s) \u2212 \u02dcx(s)(cid:107)2\n\n(cid:17)(cid:109)\n(cid:17) \u2264 \u03c1.\n\n36L , Ts = (cid:100)81Lc2\n\n(cid:16) 2c2\n(cid:12)(cid:12)(cid:12)Gs, cs \u2265 c\n\ns(cid:101), Rs =\n\ns(L+Lf )2\n\u03d12\u03c1L\n\nlog2\n\nPr\n\nProof. By Lemma 1, we have F (\u00afx(s)) \u2212 F\u2217 \u2264 (L + Lf )2 c2(cid:107)\u00afx(s) \u2212 \u02dcx(s)(cid:107)2\n2 for all s. Below we\nconsider stages such that cs \u2265 c. Following Theorem 2 and the above inequality, when Ts =\n(cid:100)81Lc2\n\ns(cid:101) \u2265 (cid:100)81Lc2(cid:101), we have\nE[F (\u02dcx(s+1)) \u2212 F\u2217|Gs] \u2264 0.5Rs(F (\u00afx(s)) \u2212 F\u2217) \u2264 0.5Rs (L + Lf )2 c2(cid:107)\u00afx(s) \u2212 \u02dcx(s)(cid:107)2\n2.\n\n(6)\nMoreover, the smoothness of f (x) and the de\ufb01nition of \u00afx(s+1) imply (see Lemma 4 in the supplemnt).\n\nF (\u02dcx(s+1)) \u2212 F\u2217 \u2265 L\n2\n\n(cid:107)\u00afx(s+1) \u2212 \u02dcx(s+1)(cid:107)2\n2.\n\n(7)\n\n5\n\n\fBy combining (7) and (6) and using Markov inequality, we have\n\nPr\n\n(cid:107)\u00afx(s+1) \u2212 \u02dcx(s+1)(cid:107)2\n\n2 \u2265 \u0001|Gs\n\n\u2264 0.5Rs (L + Lf )2 c2(cid:107)\u00afx(s) \u2212 \u02dcx(s)(cid:107)2\n\n2\n\n\u0001\n\n.\n\n(cid:18) L\n\n2\n\n(cid:19)\n\nIf we choose \u0001 = \u03d12L(cid:107)\u00afx(s)\u2212\u02dcx(s)(cid:107)2\nconclusion follows.\nTheorem 4. Under the same conditions as in Lemma 2 with \u03c1 = 1/ log(1/\u0001), the expected computa-\ntional complexity of SVRGQEB-RS for \ufb01nding an \u0001-optimal solution is at most\n\nin the inequality above and let Rs de\ufb01ned as in the assumption, the\n\n2\n\n(cid:18) c2(L + Lf )2\n\n(cid:18) 1\n\n(cid:19)(cid:19)(cid:18)\n\n(cid:18)(cid:107)\u00afx(0) \u2212 \u02dcx(0)(cid:107)2\n\n(cid:19)\n\n2\n\nO\n\n(Lc2 + n) log2\n\n\u03d12L\n\nlog\n\n\u0001\n\nlog1/\u03d12\n\n\u0001\n\n+ log2\n\nc0\n\n(cid:18) c\n\n(cid:19)(cid:19)(cid:19)\n\n.\n\n(cid:18)\n\n\u0001\n\n(cid:17)(cid:105)\n\n(cid:17)\n\n2\n\nlog2( c\nc0\n\n(cid:16)(cid:107)\u00afx(0)\u2212\u02dcx(0)(cid:107)2\n\n(cid:16)(cid:107)\u00afx(0)\u2212\u02dcx(0)(cid:107)2\n\nProof. We call stage s with s = 0, 1, . . . a successful stage if (cid:107)\u00afx(s+1)\u2212 \u02dcx(s+1)(cid:107)2 < \u03d1(cid:107)\u00afx(s)\u2212 \u02dcx(s)(cid:107)2;\notherwise, the stage s is called an unsuccessful stage. The condition (cid:107)\u00afx(s) \u2212 \u02dcx(s)(cid:107)2\n2 \u2264 \u0001 will hold\nafter S1 := log1/\u03d12\nsuccessful stages and then Algorithm 3 will stop. Let S denote\nthe total number of stages when the algorithm stops. Although stage s = S \u2212 1 is the last stage,\nfor the convenience in the proof, we still de\ufb01ne stage s = S as a post-termination stage where no\ncomputation is performed.\nIn stage s with 0 \u2264 s \u2264 S \u2212 1, the computational complexity is proportional to the number of\nstochastic gradient computations (#SGC), which is TsRs + n(Rs + 1) \u2264 (Ts + 2n)Rs. If stage s\nis successful, then Rs+1 = Rs and Ts+1 = Ts. If stage s is unsuccessful, then Rs+1 = Rs + 1 \u2264\n2Rs and Ts+1 = 2Ts so that Rs+1Ts+1 \u2264 4RsTs. In either case, Rs and Ts are non-decreasing.\nNote that, after S2 := (cid:100)2 log2(c/c0)(cid:101) unsuccessful stages, we will have cs \u2265 c. We will consider two\nscenarios: (I) the algorithm stops with cS < c and (I) the algorithm stops with cS \u2265 c.\n(cid:16)(cid:104)\nIn the \ufb01rst scenario, we have S1 successful stages and at most S2 unsuccessfully stages so that\nS \u2264 S1 + S2 and cS < c. The #SGC of all stages can be bounded by (S1 + S2)(TS\u22121 + 2n)RS\u22121 \u2264\nO\nThen, we consider the second scenario. Let \u02c6s be the \ufb01rst stage with cs \u2265 c, i.e., \u02c6s := min{s|cs \u2265\nc}. It is easy to see that c\u02c6s <\n2c and there are S2 unsuccessful and less than S1 successful\nstages before stage \u02c6s. Since the #SGC in any stage before \u02c6s is bounded by (T\u02c6s + 2n)R\u02c6s \u2264\n, the total #SGC in stages 0, 1, . . . , \u02c6s\u22121 is at most (S1 +S2)(T\u02c6s +\nO\n2n)R\u02c6s \u2264 O\nNext, we bound the total #SGC in stages \u02c6s, \u02c6s + 1, . . . , S. In the rest of the proof, we consider stage s\nwith \u02c6s \u2264 s \u2264 S. We de\ufb01ne C(\u02dcx, \u00afx, i, j, s) as the expected #SGC in stages s, s+1, . . . , S, conditioning\non that the initial state of stage s are \u02dcx(s) = \u02dcx and \u00afx(s) = \u00afx and the numbers of successful and\nunsuccessful stages before stage s are i and j, respectively. Note that s = i + j. Because stage s\ndepends on the historical path only through the state variables (\u02dcx, \u00afx, i, j, s), C(\u02dcx, \u00afx, i, j, s) is well\nde\ufb01ned and (\u02dcx, \u00afx, i, j, s) transits in a Markov chain with the next state being (\u02dcx, \u00afx, i, j + 1, s + 1) if\nstage s does not succeed and being (\u02dcx+, \u00afx+, i+1, j, s+1) if stage s succeeds, where \u02dcx+=SVRGHEB(\u00afx,\nTs, Rs, 0.5) and \u00afx+ = arg minx\u2208\u2126(cid:104)\u2207f (\u02dcx+), x \u2212 \u02dcx+(cid:105) + L\nIn the next, we will use backward induction to derive an upper bound for C(\u02dcx, \u00afx, i, j, s) that only\ndepends on i and j but not on s, \u02dcx and \u00afx. In particular, we want to show that\n\n(cid:17)(cid:17)\n(cid:16)(cid:107)\u00afx(0)\u2212\u02dcx(0)(cid:107)2\n\n(cid:16) 8c2(L+Lf )2\n\n(cid:16) 2c2(L+Lf )2\n\n(cid:16) 2c2(L+Lf )2\n\n2 (cid:107)x \u2212 \u02dcx+(cid:107)2\n\n(Lc2 + n) log2\n\n(Lc2 + n)\n\n.\n\n(Lc2 + n)\n\n.\n\n2 + \u03a8(x).\n\n) + log1/\u03d12\n\n) + log1/\u03d12\n\nlog2( c\nc0\n\n2\n\nlog2\n\n2\n\nlog2\n\n(cid:16)(cid:104)\n\n(cid:17)(cid:105)\n\n\u0001\n\n\u221a\n\n(cid:16)\n\n(cid:17)\n\n(cid:17)\n\n(cid:17)\n\n(cid:17)\n\n\u0001\n\n\u03d12\u03c1L\n\n\u03d12\u03c1L\n\n\u03d12\u03c1L\n\nC(\u02dcx, \u00afx, i, j, s) \u2264 4j\u2212S2 (T\u02c6s + 2n)R\u02c6s\n\nwhere Ai :=(cid:80)S1\u2212i\u22121\n\nr=0\n\n(cid:17)r\n\n(cid:16) 1\u2212\u03c1\n\n1\u22124\u03c1\n\nAi, for i \u2265 0, j \u2265 0, i + j = s, s \u2265 \u02c6s,\n\n1 \u2212 4\u03c1\nif 0 \u2264 i \u2264 S1 \u2212 1 and Ai := 0 if i = S1.\n\n(8)\n\nWe start with the base case where i = S1. By de\ufb01nitions, the only stage with i = S1 is the post-\ntermination stage, namely, stage s = S. In this case, C(\u02dcx, \u00afx, i, j, s) = 0 since stage S performs no\ncomputation. Then, (8) holds trivially with Ai = 0.\n\n6\n\n\fSuppose i < S1 and (8) holds for i + 1, i + 2, . . . , S1. We want to prove it also holds i. We de\ufb01ne\nX = X(\u02dcx, \u00afx, i, j, s) as the random variable that equals the number of unsuccessful stages from stage\ns (including stage s) to the \ufb01rst successful stage among stages s, s + 1, s + 2, . . . , S \u2212 1, conditioning\non s \u2265 \u02c6s and the state variables at the beginning of stage s are (\u02dcx, \u00afx, i, j, s). Note that X = 0 means\nstage s is successful. For simplicity of notation, we use Pr(\u00b7) to represent the conditional probability\nPr(\u00b7|s \u2265 \u02c6s, (\u02dcx, \u00afx, i, j, s)). Since cs \u2265 c\u02c6s \u2265 c for s \u2265 \u02c6s, we can show by Lemma 2 that 1\nPr(X = r|X \u2265 r),\n\n(cid:105)\nt=0 Pr(X \u2265 t + 1|X \u2265 t)\n\n(cid:104)(cid:81)r\u22121\n\nPr(X \u2265 r + 1|X \u2265 r) = Pr(s + r fails |stages s, s + 1, . . . , s + r \u2212 1 fail) \u2264 \u03c1,\nPr(X = r|X \u2265 r) = Pr(s + r succeeds |stages s, s + 1, . . . , s + r \u2212 1 fail),\n\nWhen X = r, the #SGC from stage s to the end of the algorithms will be(cid:80)r\n\n= 1 \u2212 Pr(X \u2265 r + 1|X \u2265 r) \u2265 1 \u2212 \u03c1.\n\nt=0(Ts+t + 2n)Rs+t +\nEC(\u02dcx+, \u00afx+, i + 1, j + r, s + r + 1), where E denotes the expectation over \u02dcx+ and \u00afx+ conditioning on\n2 (cid:107)x \u2212\n(\u02dcx, \u00afx) and \u02dcx+=SVRGHEB(\u00afx, Ts+r, Rs+r, 0.5) and \u00afx+ = arg minx\u2208\u2126(cid:104)\u2207f (\u02dcx+), x \u2212 \u02dcx+(cid:105) + L\n\u02dcx+(cid:107)2\n\n2 + \u03a8(x). Since stages s, s + 1, . . . , s + r \u2212 1 are unsuccessful, we have\n\nPr(X = r) =\n\n(9)\n\n(Ts+t + 2n)Rs+t \u2264 4t(Ts + 2n)Rs \u2264 4j+t\u2212S2 (T\u02c6s + 2n)R\u02c6s for t = 0, 1, . . . , r \u2212 1.\n\nBecause (8) holds for i + 1 and for any \u02dcx+ and \u00afx+, we have\n\nC(\u02dcx+, \u00afx+, i + 1, j + r, s + r + 1) \u2264 4j+r\u2212S2 (T\u02c6s + 2n)R\u02c6s\n\n(10)\nBased on the above inequality and the connection between C(\u02dcx, \u00afx, i, j, s) and C(\u02dcx+, \u00afx+, i + 1, j +\n(cid:33)\nr, s + r + 1), we will prove that (8) holds for i, j, s.\n\n1 \u2212 4\u03c1\n\nAi+1.\n\nC(\u02dcx, \u00afx, i, j, s) =\n\nPr(X = r)\n\n(Ts+t + 2n)Rs+t + EC(\u02dcx+, \u00afx+, i + 1, j + r, s + r + 1)\n\n(cid:32) r(cid:88)\n\nt=0\n\n(Ts+t + 2n)Rs+t +\n\n4j+r\u2212S2 (T\u02c6s + 2n)R\u02c6s\n\n1 \u2212 4\u03c1\n\n[(1 \u2212 \u03c1)/(1 \u2212 4\u03c1)]S1\u2212i\u22121 \u2212 1\n\n((1 \u2212 \u03c1)/(1 \u2212 4\u03c1) \u2212 1)\n\n(cid:16)\n\n(cid:33)\n\n\u221e(cid:88)\n\u221e(cid:88)\n\nr=0\n\nr=0\n\n\u2264\n\n\u2264\n\nPr(X = r)\n\nPr(X = r)\n\n\u2264 4j\u2212S2 (T\u02c6s + 2n)R\u02c6s\n\nPr(X = r)\n\n4t +\n\n4j+t\u2212S2 (T\u02c6s + 2n)R\u02c6s +\n\n4j+r\u2212S2 (T\u02c6s + 2n)R\u02c6s\n\nAi+1\n\n(cid:32) r(cid:88)\n\nt=0\n\n1 \u2212 4\u03c1\n\n(cid:33)\n\nAi+1\n\n4r\n\n1 \u2212 4\u03c1\n\n(cid:35)\n\n= 4j\u2212S2 (T\u02c6s + 2n)R\u02c6s\n\nPr(X \u2265 t + 1|X \u2265 t)\n\nPr(X = r|X \u2265 r)\n\n(cid:18) 4r+1 \u2212 1\n\n3\n\n+\n\n4rAi+1\n1 \u2212 4\u03c1\n\nt=0\n\nr=0\n\n\u221e(cid:88)\n\uf8eb\uf8ed r(cid:88)\n(cid:32) r(cid:88)\n\u221e(cid:88)\n\u221e(cid:88)\n\nr=0\n\nt=0\n\nr=0\n\n(cid:34)r\u22121(cid:89)\n(cid:19)\n\nt=0\n\n(cid:17)\n\n\uf8f6\uf8f8\n\n(cid:19)\n\n.\n\n(cid:19)\n\nSince 1 \u2212 \u03c1 \u2265 1\n\n3\n\u2264 (1 \u2212 \u03c1)\n\n\u2264\n\n+\n\n(cid:19)\n\n4aAi+1\n1 \u2212 4\u03c1\n\n(cid:18) 4a+1 \u2212 1\n4, for any a \u2265 0 and any b \u2265 a + 1, we have\n(cid:18) 4a+2 \u2212 1\n(cid:34) r\u22121(cid:89)\nb(cid:88)\n\n3\nPr(X \u2265 t + 1|X \u2265 t)\n\n4a+1Ai+1\n1 \u2212 4\u03c1\n\nb(cid:88)\n\nPr(X \u2265 t + 1|X \u2265 t)\n\n(cid:35)\n\n(cid:35)\n\nPr(X = r|X \u2265 r)\n\n+\n\nwhich implies\n\n:=\n\nt=a+1\n\nr=a+1\n\nDb\na\u22121\n\n(cid:34)r\u22121(cid:89)\n(cid:18) 4a+1 \u2212 1\n1We follow the convention that(cid:81)j\n\n= Pr(X = a|X \u2265 a)\n\n\u2264 (1 \u2212 \u03c1)\n\nr=a\n\nt=a\n\n3\n\n(cid:18) 4a+1 \u2212 1\n\n3\n\n+\n\n4aAi+1\n1 \u2212 4\u03c1\n\ni = 1 if j < i.\n\n+\n\n(cid:19)\n\n4aAi+1\n1 \u2212 4\u03c1\n+ \u03c1Db\na.\n\n7\n\n\u2264 Pr(X = a + 1|X \u2265 a + 1)\n\n(cid:18) 4a+2 \u2212 1\n\n3\n\n+\n\n4rAi+1\n1 \u2212 4\u03c1\n\n(cid:18) 4r+1 \u2212 1\n\n3\n\n+\n\n(cid:19)\n\n4a+1Ai+1\n1 \u2212 4\u03c1\n:= Db\na,\n\n(cid:18) 4r+1 \u2212 1\n\n(cid:19)\n\nPr(X = r|X \u2265 r)\n\n(cid:19)\n\n4rAi+1\n1 \u2212 4\u03c1\n+ Pr(X \u2265 a + 1|X \u2265 a)Db\n\n+\n\n3\n\na\n\n\fApplying this inequality for a = 0, 1, . . . , b \u2212 1 and the fact Db\n\nDb\u22121 \u2264 (1 \u2212 \u03c1)\n\n\u03c1r\n\nb\u22121(cid:88)\n\nr=0\n\n(cid:18) 4r+1 \u2212 1\n\n3\n\n(cid:19)\n\n+\n\n4rAi+1\n1 \u2212 4\u03c1\n\n+ \u03c1b\n\nSince 4\u03c1 < 1, letting b in the inequality above increase to in\ufb01nity gives\n\n= 4j\u2212S2 (T\u02c6s + 2n)R\u02c6s\n\nC(\u02dcx, \u00afx, i, j, s) \u2264 4j\u2212S2 (T\u02c6s + 2n)R\u02c6s(1 \u2212 \u03c1)\nAi+1(1 \u2212 \u03c1)\n(1 \u2212 4\u03c1)2\n\n(cid:18) 1\n(cid:17)r \u2264 (S1 + S2 \u2212 \u02c6s)\n(cid:16) 1\u2212\u03c1\n\n1 \u2212 4\u03c1\n\n+\n\nwhich is (8). Then by induction, (8) holds for any state (\u02dcx, \u00afx, i, j, s) with s \u2265 \u02c6s. At the moment\nwhen the algorithm enters stage \u02c6s, we must have j = S2 and i = \u02c6s \u2212 S2. By (8) and the facts that\n, the expected #SGC\n\n\u02c6s \u2265 S2 and that Ai =(cid:80)S1\u2212i\u22121\n\n1\u22124\u03c1\nfrom stage \u02c6s to the end of algorithm is\n\nr=0\n\nC(\u02dcx, \u00afx, \u02c6s \u2212 S2, S2, \u02c6s) \u2264 (T\u02c6s + 2n)R\u02c6s\n\n(S1 + S2 \u2212 \u02c6s)\n\n1 \u2212 4\u03c1\n\n(cid:32)\n\n\u2264 O\n\n(Lc2 + n) log2\n\n.\n\n(cid:19)\n\n1\u22124\u03c1 gives\n\n(cid:19)\n3 + 4bAi+1\n\n+\n\n4bAi+1\n1 \u2212 4\u03c1\n\nb\u22121 \u2264 4b+1\u22121\n(cid:18) 4b+1 \u2212 1\n(cid:18) 4r+1 \u2212 1\n\n3\n\n,\n\n3\n\n+\n\nr=0\n\n\u03c1r\n\n1\u22124\u03c1\n\n1 \u2212 4\u03c1\n\n4rAi+1\n1 \u2212 4\u03c1\n\n\u221e(cid:88)\n(cid:19) 4j\u2212S2 (T\u02c6s + 2n)R\u02c6sAi\n(cid:17)S1+S2\u2212\u02c6s\n(cid:16) 1\u2212\u03c1\n(cid:18) 1 \u2212 \u03c1\n(cid:19)S1+S2\u2212\u02c6s\n(cid:19)S1(cid:33)\n(cid:18) 1 \u2212 \u03c1\n(cid:19)\n(cid:18) 8c2(L + Lf )2\n(cid:16)(cid:107)\u00afx(0)\u2212\u02dcx(0)(cid:107)2\n(cid:17) log( 1\u2212\u03c1\n(cid:17)S1\n(cid:16) 1\u2212\u03c1\n(cid:17)\n\n1 \u2212 4\u03c1\n\n1 \u2212 4\u03c1\n\n1\u22124\u03c1\n\n\u03d12\u03c1L\n\nS1\n\n=\n\n\u0001\n\n.\n\n2\n\n3\u03c1\n\n(1\u22124\u03c1) log 1/\u03d1\n\nTherefore, by adding the #SGC be-\nfore and after the \u02c6s stages in the second scenario, we have the expected total #SGC is\nO\n\n(cid:16)(cid:107)\u00afx(0)\u2212\u02dcx(0)(cid:107)2\n\n(Lc2 + n)\n\n= O\n\n+ log\n\nlog\n\nlog\n\n.\n\n2\n\n\u0001\n\n\u03c1L\n\n1\n\n(cid:16)(cid:0) 1\n(cid:17)(cid:17)\n\nlog(1/\u0001), we have\n\n(cid:1)3\u03c1(cid:17) \u2264 O(1).\n(cid:17)\n(cid:16) c2(L+Lf )2\n\n\u0001\n\n1\u22124\u03c1 )\n\nlog 1/\u03d12 \u2264\n\nIn light of the value of \u03c1, i.e., \u03c1 =\n\n(cid:17)\n(cid:16)(cid:107)\u00afx(0)\u2212\u02dcx(0)(cid:107)2\n(cid:16) c\n(cid:16)(cid:16)\n(cid:17)\n\n\u0001\n\nc0\n\n5 Applications and Experiments\n\nIn this section, we consider some applications in machine learning and present some experimental\nresults. We will consider \ufb01nite-sum problems in machine learning where fi(x) = (cid:96)(x(cid:62)ai, bi)\ndenotes a loss function on an observed training feature and label pair (ai, bi), and \u03a8(x) denotes a\nregularization on the model x. Let us \ufb01rst consider some examples of loss functions and regularizers\nthat satisfy the QEB condition. More examples can be found in [29, 28, 27, 14].\nPiecewise convex quadratic (PCQ) problems. According to the global error bound of piecewise\nconvex polynomials by Li [10], PCQ problems satisfy the QEB condition. Examples of such problems\ninclude empirical square loss, squared hinge loss or Huber loss minimization with (cid:96)1 norm, (cid:96)\u221e norm\nor (cid:96)1,\u221e norm regularization or constraint.\nA family of structured smooth composite functions. This family include functions of the form\nF (x) = h(Ax) + \u03a8(x), where \u03a8(x) is a polyhedral function or an indicator function of a polyhedral\nset and h(\u00b7) is a smooth and strongly convex function on any compact set. Accoding to studies\nin [6, 20], the QEB holds on any compact set or the involved polyhedral set. Examples of interesting\nloss functions include the aforementioned square loss and the logisitc loss as well.\nFor examples satisfying the HEB condition with intermediate values of \u03b8 \u2208 (0, 1/2), we can\ni=1(x(cid:62)ai \u2212 bi)p with\np \u2208 2N+ [23]. According to the reasoning in [14], the HEB condition holds with \u03b8 = 1/p.\nBefore presenting the experimental results, we would like to remark that in many regularized machine\nlearning formulations, no constraint in a compact domain x \u2208 \u2126 is included. Nevertheless, we can\nexplicitly add a constraint \u03a8(x) \u2264 B into the problem to ensure that intermediate solutions generated\nby the proposed algorithms always stay in a compact set, where B can be set to a large value without\naffecting the optimal solutions. The proximal mapping of \u03a8(x) with such an explicit constraint can\nbe ef\ufb01ciently handled by combining the proximal mapping and a binary search for the Lagrangian\n\nconsider (cid:96)1 constrained (cid:96)p norm regression, where the objective f (x) = 1/n(cid:80)n\n\n8\n\n\fFigure 1: Comparison of different algorithms for solving different problems on different datasets.\n\nmultiplier. In practice, as long as B is suf\ufb01ciently large, the constraint remains inactive and the\ncomputational cost remains the same.\nNext, we conduct some experiments to demostrate the effectiveness of the proposed algorithms\non several tasks, including (cid:96)1 regularized squared hinge loss minimization, (cid:96)1 regularized logistic\nloss minimization for linear classi\ufb01cation problems; and (cid:96)1 constrained (cid:96)p norm regression, (cid:96)1\nregularized square loss minimization and (cid:96)1 regularized Huber loss minimization for linear regression\nproblems. We use three datasets from libsvm website: Adult (n = 32561, d = 123), E2006-t\ufb01df\n(n = 16087, d = 150360), and YearPredictionMSD (n = 51630, d = 90). Note that we use the\ntesting set of YearPredictionMSD data for our experiment because some baselines need a lot of\ntime to converge on the large training set. We set the regularization parameter of (cid:96)1 norm and the\nupper bound of (cid:96)1 constraint to be 10\u22124 and 100, respectively. In each plot, the difference between\nobjective value and optimum is presented in log scale.\nOur \ufb01rst experiment is to justify the proposed SVRGQEB-RS algorithm by comparing it with SVRGHEB\nwith different estimations of c (corresponding to the different initial values of T1). We try four\ndifferent values of T1 \u2208 {1000, 2000, 8000, 2n}. The result is plotted in the top left of Figure 1.\nWe can see that SVRGHEB with some underestimated values of T1 (e.g, 1000, 2000) converge very\nslowly. However, the performance of SVRGQEB-RS is not affected too much by the initial value of T1,\nwhich is consistent with our theory showing the log dependence on the initial value of c. Moreover,\nSVRGQEB-RS with different values of T1 perform always better than their counterparts of SVRGHEB.\nThen we compare SVRGQEB-RS and SVRGHEB-RS to other baselines for solving different problems\non different data sets. We choose SAGA, SVRG++ as the baselines. We also notice that a heuristic\nvariant of SVRG++ was suggested in [2] where epoch length is automatically determined based on the\nchange in the variance of gradient estimators between two consecutive epochs. However, according\nto our experiments we \ufb01nd that this heuristic automatic strategy cannot always terminate one epoch\nbecause their suggested criterion cannot be met. This is also con\ufb01rmed by our communication\nwith the authors of SVRG++. To make it work, we manually add an upper bound constraint of\neach epoch length equal to 2n following the suggestion in [8]. The resulting baseline is denoted\nby SVRG-heuristics. For all algorithms, the step size is best tuned. The initial epoch length of\nSVRG++ is set to n/4 following the suggestion in [2], and the same initial epoch length is also\nused in our algorithms. The comparison with these baselines are reported in remaining \ufb01gures of\nFigure 1. We can see that SVRGQEB-RS (resp. SVRGHEB-RS) always has superior performance, while\nSVRG-heuristics sometimes performs well sometimes bad.\n\nAcknowlegements\n\nWe thank the anonymous reviewers for their helpful comments. Y. Xu and T. Yang are partially\nsupported by National Science Foundation (IIS-1463988, IIS-1545995).\n\n9\n\n#grad/n0100200300400500objective-optimum-15-10-50squaredhinge+\u21131norm,AdultSVRGHEB(1000)SVRGHEB(2000)SVRGHEB(8000)SVRGHEB(2n=65122)SVRGQEB\u2212RS(1000)SVRGQEB\u2212RS(2000)SVRGQEB\u2212RS(8000)SVRGQEB\u2212RS(2n=65122)#grad/n0100200300400500objective-optimum-16-14-12-10-8-6-4-20squaredhinge+\u21131norm,AdultSAGASVRG++SVRG-heuristicsSVRGQEB\u2212RS#grad/n0100200300400500objective-optimum-15-10-50logistic+\u21131norm,AdultSAGASVRG++SVRG-heuristicsSVRGQEB\u2212RS#grad/n0100200300400500objective-optimum-5-4.5-4-3.5-3-2.5-2square+\u21131norm,millionsongsSAGASVRG++SVRG-heuristicsSVRGQEB\u2212RS#grad/n0100200300400500objective-optimum-5.5-5-4.5-4-3.5-3-2.5-2-1.5-1huberloss+\u21131norm,millionsongsSAGASVRG++SVRG-heuristicsSVRGQEB\u2212RS#grad/n0100200300400500objective-optimum-2.9-2.8-2.7-2.6-2.5-2.4-2.3-2.2-2.1-2\u2113pregression(p=4),E2006SAGASVRG++SVRG-heuristicsSVRGHEB\u2212RS\fReferences\n[1] Z. Allen-Zhu. Katyusha: The \ufb01rst direct acceleration of stochastic gradient methods.\n\nIn\nProceedings of the 49th Annual ACM Symposium on Theory of Computing, STOC \u201917, 2017.\n\n[2] Z. Allen-Zhu and Y. Yuan. Improved svrg for non-strongly-convex or sum-of-non-convex\nobjectives. In Proceedings of The 33rd International Conference on Machine Learning, pages\n1080\u20131089, 2016.\n\n[3] J. Bolte, T. P. Nguyen, J. Peypouquet, and B. Suter. From error bounds to the complexity of\n\n\ufb01rst-order descent methods for convex functions. CoRR, abs/1510.08234, 2015.\n\n[4] A. Defazio, F. R. Bach, and S. Lacoste-Julien. SAGA: A fast incremental gradient method\nwith support for non-strongly convex composite objectives. In Advances in Neural Information\nProcessing Systems (NIPS), pages 1646\u20131654, 2014.\n\n[5] D. Drusvyatskiy and A. S. Lewis. Error bounds, quadratic growth, and linear convergence of\n\nproximal methods. arXiv:1602.06661, 2016.\n\n[6] P. Gong and J. Ye. Linear convergence of variance-reduced projected stochastic gradient without\n\nstrong convexity. CoRR, abs/1406.1102, 2014.\n\n[7] K. Hou, Z. Zhou, A. M. So, and Z. Luo. On the linear convergence of the proximal gradient\nmethod for trace norm regularization. In Advances in Neural Information Processing Systems\n(NIPS), pages 710\u2013718, 2013.\n\n[8] R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance\n\nreduction. In NIPS, pages 315\u2013323, 2013.\n\n[9] H. Karimi, J. Nutini, and M. W. Schmidt. Linear convergence of gradient and proximal-\ngradient methods under the polyak-\u0142ojasiewicz condition. In Machine Learning and Knowledge\nDiscovery in Databases - European Conference (ECML-PKDD), pages 795\u2013811, 2016.\n\n[10] G. Li. Global error bounds for piecewise convex polynomials. Math. Program., 137(1-2):37\u201364,\n\n2013.\n\n[11] Q. Lin and L. Xiao. An adaptive accelerated proximal gradient method and its homotopy\nIn Proceedings of the International Conference on\n\ncontinuation for sparse optimization.\nMachine Learning, (ICML), pages 73\u201381, 2014.\n\n[12] J. Liu and M. Tak\u00e1c. Projected semi-stochastic gradient descent method with mini-batch scheme\n\nunder weak strong convexity assumption. CoRR, abs/1612.05356, 2016.\n\n[13] J. Liu and S. J. Wright. Asynchronous stochastic coordinate descent: Parallelism and conver-\n\ngence properties. SIAM Journal on Optimization, 25(1):351\u2013376, 2015.\n\n[14] M. Liu and T. Yang. Adaptive accelerated gradient converging methods under holderian error\n\nbound condition. CoRR, abs/1611.07609, 2017.\n\n[15] Z.-Q. Luo and P. Tseng. On the convergence of coordinate descent method for convex differen-\n\ntiable minization. Journal of Optimization Theory and Applications, 72(1):7\u201335, 1992.\n\n[16] Z.-Q. Luo and P. Tseng. On the linear convergence of descent methods for convex essenially\n\nsmooth minization. SIAM Journal on Control and Optimization, 30(2):408\u2013425, 1992.\n\n[17] Z.-Q. Luo and P. Tseng. Error bounds and convergence analysis of feasible descent methods: a\n\ngeneral approach. Annals of Operations Research, 46:157\u2013178, 1993.\n\n[18] C. Ma, R. Tappenden, and M. Tak\u00e1c. Linear convergence of the randomized feasible descent\n\nmethod under the weak strong convexity assumption. CoRR, abs/1506.02530, 2015.\n\n[19] T. Murata and T. Suzuki. Doubly accelerated stochastic variance reduced dual averaging method\n\nfor regularized empirical risk minimization. CoRR, abs/1703.00439, 2017.\n\n[20] I. Necoara, Y. Nesterov, and F. Glineur. Linear convergence of \ufb01rst order methods for non-\n\nstrongly convex optimization. CoRR, abs/1504.06298, 2015.\n\n10\n\n\f[21] Y. Nesterov. Gradient methods for minimizing composite functions. Mathematical Program-\n\nming, 140(1):125\u2013161, 2013.\n\n[22] L. Nguyen, J. Liu, K. Scheinberg, and M. Tak\u00e1c. SARAH: A novel method for machine learning\n\nproblems using stochastic recursive gradient. CoRR, 2017.\n\n[23] H. Nyquist. The optimal lp norm estimator in linear regression models. Communications in\n\nStatistics - Theory and Methods, 12(21):2511\u20132524, 1983.\n\n[24] R. Rockafellar. Convex Analysis. Princeton mathematical series. Princeton University Press,\n\n1970.\n\n[25] S. Shalev-Shwartz and T. Zhang. Stochastic dual coordinate ascent methods for regularized loss\nminimization. In Proceedings of the International Conference on Machine Learning (ICML),\npages 567\u2013599, 2013.\n\n[26] L. Xiao and T. Zhang. A proximal stochastic gradient method with progressive variance\n\nreduction. SIAM Journal on Optimization, 24(4):2057\u20132075, 2014.\n\n[27] Y. Xu, Q. Lin, and T. Yang. Stochastic convex optimization: Faster local growth implies faster\nglobal convergence. In Proceedings of the 34th International Conference on Machine Learning\n(ICML), pages 3821\u20133830, 2017.\n\n[28] Y. Xu, Y. Yan, Q. Lin, and T. Yang. Homotopy smoothing for non-smooth problems with lower\ncomplexity than O(1/\u0001). In Advances In Neural Information Processing Systems 29 (NIPS),\npages 1208\u20131216, 2016.\n\n[29] T. Yang and Q. Lin. Rsg: Beating sgd without smoothness and/or strong convexity. CoRR,\n\nabs/1512.03107, 2016.\n\n[30] H. Zhang. New analysis of linear convergence of gradient-type methods via unifying error\n\nbound conditions. CoRR, abs/1606.00269, 2016.\n\n[31] H. Zhang and W. Yin. Gradient methods for convex minimization: better rates under weaker\n\nconditions. arXiv preprint arXiv:1303.4645, 2013.\n\n[32] Z. Zhou and A. M.-C. So. A uni\ufb01ed approach to error bounds for structured convex optimization\n\nproblems. arXiv:1512.03518, 2015.\n\n[33] Z. Zhou, Q. Zhang, and A. M. So. L1p-norm regularization: Error bounds and convergence\nrate analysis of \ufb01rst-order methods. In Proceedings of the 32nd International Conference on\nMachine Learning, (ICML), pages 1501\u20131510, 2015.\n\n11\n\n\f", "award": [], "sourceid": 1863, "authors": [{"given_name": "Yi", "family_name": "Xu", "institution": "The University of Iowa"}, {"given_name": "Qihang", "family_name": "Lin", "institution": "University of Iowa"}, {"given_name": "Tianbao", "family_name": "Yang", "institution": "The University of Iowa"}]}