{"title": "Breaking the Span Assumption Yields Fast Finite-Sum Minimization", "book": "Advances in Neural Information Processing Systems", "page_first": 2312, "page_last": 2321, "abstract": "In this paper, we show that SVRG and SARAH can be modified to be fundamentally faster than all of the other standard algorithms that minimize the sum of $n$ smooth functions, such as SAGA, SAG, SDCA, and SDCA without duality. Most finite sum algorithms follow what we call the ``span assumption'': Their updates are in the span of a sequence of component gradients chosen in a random IID fashion. In the big data regime, where the condition number $\\kappa=O(n)$, the span assumption prevents algorithms from converging to an approximate solution of accuracy $\\epsilon$ in less than $n\\ln(1/\\epsilon)$ iterations. SVRG and SARAH do not follow the span assumption since they are updated with a hybrid of full-gradient and component-gradient information. We show that because of this, they can be up to $\\Omega(1+(\\ln(n/\\kappa))_+)$ times faster. In particular, to obtain an accuracy $\\epsilon = 1/n^\\alpha$ for $\\kappa=n^\\beta$ and $\\alpha,\\beta\\in(0,1)$, modified SVRG requires $O(n)$ iterations, whereas algorithms that follow the span assumption require $O(n\\ln(n))$ iterations. Moreover, we present lower bound results that show this speedup is optimal, and provide analysis to help explain why this speedup exists. With the understanding that the span assumption is a point of weakness of finite sum algorithms, future work may purposefully exploit this to yield faster algorithms in the big data regime.", "full_text": "Breaking the Span Assumption Yields Fast\n\nFinite-Sum Minimization\u2217\n\nRobert Hannah\u20201, Yanli Liu\u20211, Daniel O\u2019Connor\u00a72, and Wotao Yin\u00b61\n1Department of Mathematics, University of California, Los Angeles\n\n2Department of Mathematics, University of San Francisco\n\nAbstract\n\nIn this paper, we show that SVRG and SARAH can be modi\ufb01ed to be\nfundamentally faster than all of the other standard algorithms that minimize\nthe sum of n smooth functions, such as SAGA, SAG, SDCA, and SDCA\nwithout duality. Most \ufb01nite sum algorithms follow what we call the \u201cspan\nassumption\u201d: Their updates are in the span of a sequence of component\ngradients chosen in a random IID fashion. In the big data regime, where the\ncondition number \u03ba = O(n), the span assumption prevents algorithms from\nconverging to an approximate solution of accuracy \u0001 in less than n ln(1/\u0001)\niterations. SVRG and SARAH do not follow the span assumption since\nthey are updated with a hybrid of full-gradient and component-gradient\ninformation. We show that because of this, they can be up to \u2126(1 +\n(ln(n/\u03ba))+) times faster. In particular, to obtain an accuracy \u0001 = 1/n\u03b1 for\n\u03ba = n\u03b2 and \u03b1, \u03b2 \u2208 (0, 1), modi\ufb01ed SVRG requires O(n) iterations, whereas\nalgorithms that follow the span assumption require O(n ln(n)) iterations.\nMoreover, we present lower bound results that show this speedup is optimal,\nand provide analysis to help explain why this speedup exists. With the\nunderstanding that the span assumption is a point of weakness of \ufb01nite\nsum algorithms, future work may purposefully exploit this to yield faster\nalgorithms in the big data regime.\n\nIntroduction\n\n1\nFinite sum minimization is an important class of optimization problem that appears in many\napplications in machine learning and other areas. We consider the problem of \ufb01nding an\napproximation \u02c6x to the minimizer x\u2217 of functions F : Rd \u2192 R of the form:\n\nnX\n\ni=1\n\nF(x) = f(x) + \u03c8(x) (cid:44) 1\n\nn\n\nfi(x) + \u03c8(x).\n\n(1.1)\n\n1720237, and ONR N000141712162.\n\nWe assume each function fi is smooth6, and possibly nonconvex; \u03c8 is proper, closed, and\nconvex; and the sum F is strongly convex and smooth. It has become well-known that under\na variety of assumptions, functions of this form can be minimized much faster with variance\n\u2217This work was supported in part by grants: AFOSR MURI FA9550-18-1-0502, NSF DMS-\n\u2020Corresponding author: RobertHannah89@gmail.com\n\u2021yanli@math.ucla.edu\n\u00a7daniel.v.oconnor@gmail.com\n\u00b6WotaoYin@math.ucla.edu\n6A function f is L-smooth if it has an L-Lipschitz gradient \u2207f\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\freduction (VR) algorithms that speci\ufb01cally exploit the \ufb01nite-sum structure. When each fi\nis \u00b5-strongly convex and L-smooth, and \u03c8 = 0, SAGA [1], SAG [2], Finito/Miso [3], [4],\nSVRG [5], SARAH [6], SDCA [7], and SDCA without duality [8] can \ufb01nd a vector \u02c6x with\nexpected suboptimality E(f(\u02c6x) \u2212 f(x\u2217)) = O(\u0001) with only O((n + L/\u00b5) ln(1/\u0001)) calculations\nof component gradients \u2207fi(x). This can be up to n times faster than (full) gradient descent,\nwhich takes O(nL/\u00b5 ln(1/\u0001)) gradients. These algorithms exhibit sublinear convergence for\nnon-strongly convex problems7. Various results also exist for nonzero convex \u03c8.\nAccelerated VR algorithms have also been proposed. Katyusha [9] is a primal-only Nesterov-\naccelerated VR algorithm that uses only component gradients. It is based on SVRG and\nhas complexity O((n +\nn\u03ba) ln(1/\u0001))) for condition number \u03ba which is de\ufb01ned as L/\u00b5. In\n[10], the author devises an accelerated SAGA algorithm that attains the same complexity\nusing component proximal steps. In [11], the author devises an accelerated primal-dual VR\nalgorithm. There also exist \u201ccatalyst\u201d [12] accelerated methods [13], [14]. However, catalyst\nmethods appear to have a logarithmic complexity penalty over Nesterov-accelerated methods,\na defect that researchers have been able to correct.\nIn [15], authors show that a class of algorithms that includes SAGA, SAG, Finito (with\nreplacement), Miso, SDCA without duality, etc. have complexity K(\u0001) lower bounded by\nn\u03ba) ln(1/\u0001)) for problem dimension d \u2265 2K(\u0001). More precisely, the lower bound\n\u2126((n +\napplies to algorithms that satisfy what we will call the span condition. That is\n\n\u221a\n\n\u221a\n\nxk+1 \u2208 x0 + span(cid:8)\u2207fi0\n\n(cid:0)x0(cid:1),\u2207fi1\n\n(cid:0)x1(cid:1), . . . ,\u2207fik\n\n(cid:0)xk(cid:1)(cid:9)\n\n(1.2)\nfor some \ufb01xed IID random variable ik over the indices {1, . . . , n}. Later, [16] and [17] extend\nlower bound results to algorithms that do not follow the span assumption: SDCA, SVRG,\nSARAH, accelerated SAGA, etc.; but with a weaker lower bound of \u2126(n +\nn\u03ba ln(1/\u0001)).\nThe di\ufb00erence in these two expressions was thought to be a proof artifact that would later\nbe resolved.\nHowever we show a surprising result in Section 2, that SVRG, and SARAH can be funda-\nmentally faster than methods that satisfy the span assumption, with the full gradient steps\nplaying a critical role in their speedup. More precisely, for \u03ba = O(n), SVRG and SARAH\ncan be modi\ufb01ed to reach an accuracy of \u0001 in O((\n1+(ln(n/\u03ba))+ ) ln(1/\u0001)) gradient calculations8,\ninstead of the \u0398(n ln(1/\u0001)) iterations required for algorithms that follow the span condition.\nWe also improve the lower bound of [17] to \u2126(n + (\nn\u03ba) ln(1/\u0001)) in Section\n2.1. That is, we show that the complexity K(\u0001) of a very general class of algorithm that\nincludes all of the above algorithms satis\ufb01es the lower bound:\n\n1+(ln(n/\u03ba))+\n\n\u221a\n\n\u221a\n\n+\n\nn\n\nn\n\n(\u2126(n +\n\n\u2126(n +\n\n\u221a\n\nK(\u0001) =\n\nn\u03ba ln(1/\u0001)),\n\nn\n\n1+(ln(n/\u03ba))+\n\nln(1/\u0001)),\n\nfor n = O(\u03ba),\nfor \u03ba = O(n).\n\n(1.3)\n\nHence when \u03ba = O(n) our modi\ufb01ed SVRG has optimal complexity, and when n = O(\u03ba),\nKatyusha is optimal.\nSDCA doesn\u2019t quite follow the span assumption. Also the dimension n of the dual space on\nwhich the algorithm runs is inherently small in comparison to k, the number of iterations.\nWe complete the picture using di\ufb00erent arguments, by showing that its complexity is lower\nbounded by \u2126(n ln(1/\u0001)) in Section 2.2. Hence SDCA doesn\u2019t attain this logarithmic speedup.\nWe leave the analysis of accelerated SAGA, accelerated SDCA, and other algorithms to\nfuture work.\nOur results identify a signi\ufb01cant obstacle to high performance when n (cid:29) \u03ba. The speedup\nthat SVRG and SARAH can be modi\ufb01ed to attain in this scenario is somewhat accidental\nsince their original purpose was to minimize memory overhead. However, with the knowledge\nthat this assumption is a point of weakness for VR algorithms, future work may more\npurposefully exploit this to yield better speedups than SVRG and SARAH can currently\nattain. Though the complexity of SVRG and SARAH can be made optimal to within a\n\n7SDCA must be modi\ufb01ed however with a dummy regularizer.\n8We de\ufb01ne (a)+ as max{a, 0} for a \u2208 R.\n\n2\n\n\f\u221a\n\n\u221a\n\nconstant factor, this factor is somewhat large, and could potentially be reduced substantially.\nThough it is unclear how much of a speedup is possible.\nHaving n (cid:29) \u03ba, which has been referred to as the \u201cbig data condition\u201d, is rather common,\nespecially in regularized empirical risk minimization (ERM). For instance [2] remarks that\n\u03ba =\nn is a nearly optimal choice for regularization for empirical risk minimization in the\nlow dimensional setting. In the high-dimensional setting, the authors of [2] claim there is no\nanalysis that they are aware that doesn\u2019t imply that we should set the regularization term\nto ensure n = O(\u03ba). In [18], authors consider regularized ERM for minimizing a stochastic\nobjective. They argue that the optimal choice of regularization parameter \u03bb corresponds to\n\u03ba =\nHence our results have wide application.\nIn the settings described above, we have the\nfollowing corollary that implies a complexity improvement from O(n ln(n)) to O(n). This\nwill follow from Corollary 2 ahead.\nCorollary 1. To obtain accuracy \u0001 = 1/n\u03b1 for \u03ba = n\u03b2 and some \u03b1, \u03b2 \u2208 (0, 1), modi\ufb01ed\nSVRG requires O(n) iterations, whereas algorithms that follow the span assumption require\nO(n ln(n)) iterations [11] for su\ufb03ciently large problem dimension d.\n\nn. [19] considers regularized SVM with \u03ba = n\u03b2 for \u03b2 < 1.\n\nFor large-scale problems, this ln(n) factor can be rather large: For instance in the KDD\nCup 2012 dataset (n = 149, 639, 105 and ln(n) \u2248 18), Criteo\u2019s Terabyte Click Logs (n =\n4, 195, 197, 692 and ln(n) \u2248 22), etc. Non-public internal company datasets can be far larger,\nwith n potentially larger than 1015.\nWe also analyze Prox-SVRG in the case where fi are smooth and potentially nonconvex, but\nthe sum F is strongly convex. We build on the work of [20], which proves state-of-the-art\ncomplexity bounds for this setting, and show that we can attain a similar logarithmic speedup\nwithout modi\ufb01cation. Lower bounds for this context are lacking, so it is unclear if this result\ncan be further improved.\n\n2 Optimal Convex SVRG\n\nIn this section, we show that the Prox-SVRG algorithm proposed in [21] for problem (1.1)\ncan be sped up by a factor of \u2126(1 + (ln(n/\u03ba))+) when \u03ba = O(n). A similar speedup is clearly\npossible for vanilla SVRG and SARAH, which have similar rate expressions. We then re\ufb01ne\nthe lower bound analysis of [17] to show that the complexity is optimal9 when \u03ba = O(n).\nKatyusha is optimal in the other scenario when n = O(\u03ba) by [22].\nAssumption 1.\n\nfi is Li\u2212Lipschitz di\ufb00erentiable for i = 1, 2, ..., n. That is,\n\nk\u2207fi(x) \u2212 \u2207fi(y)k \u2264 Likx \u2212 yk\n\nfor all x, y \u2208 Rd.\n\nf is L\u2212Lipschitz di\ufb00erentiable. F is \u00b5\u2212strongly convex. That is,\n\nF(y) \u2265 F(x) + h \u02dc\u2207F(x), y \u2212 xi + \u00b5\n\n2ky \u2212 xk2\n\nfor all x, y \u2208 Rd and \u02dc\u2207F(x) \u2208 \u2202F(x).\n\nAssumption 2.\n\nfi is convex for i = 1, 2, ..., n; and \u03c8 is proper, closed, and convex.\n\n9I.e. the complexity cannot be improved among a very broad class of \ufb01nite-sum algorithms.\n\n3\n\n\fi=1 fi(x), initial vector x0, step size \u03b7 > 0, number of epochs K,\n\nn\n\nPn\n\nAlgorithm 1 Prox-SVRG(F, x0, \u03b7, m)\nInput: F(x) = \u03c8(x) + 1\nprobability distribution P = {p1, . . . , pn}\nOutput: vector xK\n1: M k \u223c Geom( 1\nm);\n2: for k \u2190 0, ..., K \u2212 1 do\nw0 \u2190 xk; \u00b5 \u2190 \u2207f(xk);\n3:\nfor t \u2190 0, ..., M k do\n4:\n5:\n6:\n7:\n8:\n9:\n10: end for\n\nend for\nxk+1 \u2190 wM+1;\n\n\u02dc\u2207t = \u00b5 +(cid:0)\u2207fit(wt) \u2212 \u2207fit(w0)(cid:1)/(npit);\n\npick it \u2208 {1, 2, ..., n} \u223c P randomly;\nwt+1 = arg miny\u2208Rd{\u03c8(y) + 1\n\n2\u03b7ky \u2212 wtk2 + h \u02dc\u2207t, yi};\n\nWe make Assumption 1 throughout the paper, and Assumption 2 in this section. Recall\nthe Prox-SVRG algorithm of [21], which we reproduce in Algorithm 1. The algorithm is\norganized into a series of K epochs of size M k, where M k is a geometric random variable\nwith success probability 1/m. Hence epochs have an expected length of m. At the start\nof each epoch, a snapshot \u00b5 = \u2207f(xk) of the gradient is taken. Then for M k steps, a\nrandom component gradient \u2207itf(wt) is calculated, for an IID random variable it with \ufb01xed\ndistribution P given by P[it = i] = pi. This component gradient is used to calculate an\nunbiased estimate \u02dc\u2207t of the true gradient \u2207f(wt). Each time, this estimate is then used to\nperform a proximal-gradient-like step with step size \u03b7. At the end of these M k steps, a new\nepoch of size M k+1 is started, and the process continues.\nWe \ufb01rst recall a modi\ufb01ed Theorem 1 from [21]. The di\ufb00erence is that in [21], the authors\nused an epoch length of m, whereas we use a random epoch length M k with expectation\nEM k = m. The proof and theorem statement only require only trivial modi\ufb01cations to\naccount for this. This modi\ufb01cation is only to unify the di\ufb00erent version of SVRG in [21] and\n[20], and makes no di\ufb00erence to the result.\nIt becomes useful to de\ufb01ne the e\ufb00ective Lipschitz constant LQ = maxiLi/(pin), and the\ne\ufb00ective condition number \u03baQ = LQ/\u00b5 for this algorithm. These reduce to the standard\nLipschitz constant L, and the standard condition number \u03ba in the standard uniform scenario\nwhere Li = L,\u2200i, and P is uniform.\nTheorem 1. Complexity of Prox-SVRG. Let Assumptions 1 and 2 hold. Then Prox-\nSVRG de\ufb01ned in Algorithm 1 satis\ufb01es\n\nE[F(xk) \u2212 F(x\u2217)] \u2264 \u03c1k[F(x0) \u2212 F(x\u2217)]\n\n(2.1)\n\nfor \u03c1 = 1 + \u00b5\u03b7(1 + 4mLQ\u03b7)\n\u00b5\u03b7m(1 \u2212 4LQ\u03b7)\n\n.\n\n(2.2)\n\nIn previous work, the optimal parameters were not really explored in much detail. In the\noriginal paper [5], the author suggest \u03b7 = 0.1/L, which results in linear convergence rate\n1/4 \u2264 \u03c1 \u2264 1/2 for m \u2265 50\u03ba. In [21], authors also suggest \u03b7 = 0.1/L for m = 100\u03ba, which\nyields \u03c1 \u2248 5/6. However, they observe that \u03b7 = 0.01/L works nearly as well. In [6], authors\nobtain a similar rate expression for SARAH and suggest \u03b7 = 0.5/L and m = 4.5\u03ba which\nyields \u03c1 \u2248 7/9. In the following corollary, we propose a choice of \u03b7 and m that leads to an\noptimal complexity to within a constant factor for \u03ba = O(n). This result helps explain why\nthe optimal step size observed in prior work appears to be much smaller than the \u201cstandard\u201d\ngradient descent step of 1/L.\nCorollary 2. Let the conditions of Theorem 1 hold, and let m = n + 121\u03baQ, and \u03b7 =\nQm\u2212 1\n121+(n/\u03baQ),\n\u03ba\nand hence it needs:\n\n2 /(2LQ). The Prox-SVRG in Algorithm 1 has convergence rate \u03c1 \u2264q 100\n\n1\n2\n\n!\n\n!\n\nln 1\n\n\u0001\n\n+ n + \u03baQ\n\n(2.3)\n\n \n\nK(\u0001) = O\n\nn\n\n1 + (ln( n\n\u03baQ\n\n+ \u03baQ\n\n))+\n\n4\n\n\fiterations in expectation to obtain a point xK(\u0001) such that E(cid:2)f(cid:0)xK(\u0001)(cid:1) \u2212 f(x\u2217)(cid:3) < \u0001.\nconvergence rate \u03c1 \u223cp\u03baQ/n \u2192 0, and complexity O(cid:16)\n\nThis result is proven in Appendix A. The n + \u03baQ term is needed because we assume that at\nleast one epoch is completed. For n = O(\u03baQ), we have a similar convergence rate (\u03c1 \u2248 10\n11)\nand complexity to algorithms that follow the span assumption. For n (cid:29) \u03baQ, we have a\n\n1+(ln(n/\u03ba)) ln(1/\u0001)(cid:17), which can can\n\nbe much better than n ln(1/\u0001). See also Corollary 1. The corresponding result and proof for\nSARAH is nearly identical, and we do not include this.\nIn order to obtain this speedup, some estimate of the condition number must be known.\nHowever this is often not a problem. In the case of a regularization term for empirical risk\nminimization, the strong convexity modulus is hand-picked based on the number of examples\nn. In other cases, we can simply tune an estimate of the parameter \u03ba with the assurance\nthat this can yield a logarithmic speedup.\nRemark 1.\nP = {p1, p2, ..., pn} on {1, 2, ..., n} is pi =\n\nIn Theorem 1 and Corollary 2, the optimal choice of the probability distribution\n\nfor i = 1, 2, ..., n, and LQ =\n\nPn\n\nLiPn\n\ni=1 Li\nn\n\n.\n\nn\n\ni=1 Lj\n\n2.1 Optimality\nThe major di\ufb00erence between SAGA, SAG, Miso/Finito, and SDCA without duality, and\nSVRG and SARAH, is that the former satisfy what we call the span condition (1.2). SVRG,\nand SARAH, do not, since they also involve full-gradient steps. We refer to SVRG, and\nSARAH as hybrid methods, since they use full-gradient and partial gradient information\nto calculate their iterations. We assume for simplicity that Li = L, for all i, and that \u03c8 = 0.\nWe now present a rewording of Corollary 3 from [11].\nCorollary 3. For every \u0001 and randomized algorithm on (1.1) that follows the span as-\nsumption, there are a dimension d, and L-smooth, \u00b5-strongly convex functions fi on Rd\n\u03ban) ln(1/\u0001)) steps to reach sub-optimality\nsuch that the algorithm takes at least \u2126((n +\n\n\u221a\n\nEf(cid:0)xk(cid:1) \u2212 f(x\u2217) < \u0001.\n\nThe above algorithms that satisfy the span condition all have known upper complexity\nbounds of O((n + \u03ba) ln(1/\u0001)), and hence for \u03ba = O(n) we have a sharp convergence rate.\nHowever, it turns out that the span assumption is an obstacle to faster convergence when\nn (cid:29) \u03ba (at least for su\ufb03ciently high dimension). In the following theorem, we improve10 the\nanalysis of [17], to show that the complexity of SVRG obtained in Corollary 2 is optimal\nto within a constant factor without fundamentally di\ufb00erent assumptions on the class of\nalgorithms that are allowed. Clearly this also applies to SARAH. The theorem is actually far\nmore general, and applies to a general class of algorithms called p\u2212CLI oblivious algorithms\nintroduced in [17]. This class contains all VR algorithms mentioned in this paper.\nIn\nAppendix B, we recall the de\ufb01nition of p\u2212CLI oblivious algorithms, as well as providing the\nproof of a more general version of Theorem 2.\nTheorem 2. Lower complexity bound of Prox-SVRG and SARAH. For all \u00b5, L,\nthere exist L-smooth, and \u00b5-strongly convex functions fi such that at least11\n\n(2.4)\n\n(cid:18)(cid:18)\n\nK(\u0001) = \u02dc\u2126\n\nn\n\n1 + (ln( n\n\n\u03ba))+\n\n\u221a\n\n+\n\nn\u03ba\n\nln 1\n\n\u0001\n\n+ n\n\n(cid:19)\n\n(cid:19)\n\niterations are needed for SVRG or SARAH to obtain expected suboptimality\nE[f(K(\u0001)) \u2212 f(x\u2217)] < \u0001.\n\n2.2 SDCA\nTo complete the picture, in the following proposition, which we prove in Appendix C, we show\nthat SDCA has a complexity lower bound of \u2126(n ln(1/\u0001)). Hence it attains no logarithmic\n\n10Speci\ufb01cally, we improve the analysis of Theorem 2 from this paper.\n11We absorb some smaller low-accuracy terms (high \u0001) as is common practice. Exact lower bound\n\nexpressions appear in the proof.\n\n5\n\n\fIt does so with coordinate\n\nnX\n\nspeedup. SDCA aims to solve the following problem:\n\nF(x) = 1\n\nfi(x) = 1\n\nn\n\nmin\nx\u2208Rd\nwhere each yi \u2208 Rd, \u03c6i\nminimization steps on the corresponding dual problem:\n2k 1\ni (\u2212\u03b1i) + \u03bb\n\u03c6\u2217\n\n: R \u2192 R is convex and smooth.\n\nD(\u03b1) := 1\n\nnX\n\nmin\n\u03b1\u2208Rn\n\ni=1\n\ni=1\n\n\u03bbn\n\nn\n\nnX\n\nnX\n\n(cid:0)\u03c6i(xT yi) + \u03bb\n\n2kxk2(cid:1),\n\n\u03b1iyik2,\n\nn\n\ni=1\n\ni=1\n\ni (u) := maxz\n\nHere \u03c6\u2217\nuniform random variables on {1, ..., n}. SDCA updates a dual point \u03b1k, while maintaining a\ncorresponding primal vector xk. SDCA can be written as:\n\n(cid:0)zu \u2212 \u03c6i(z)(cid:1) is the convex conjugate of \u03c6i. Let ik be an IID sequence of\n(cid:26)\u03b1k\n\u03b1k+1\ni =\nnX\nxk+1 = 1\n\ni ,\narg minz D(\u03b1k1, ..., \u03b1k\n\nif i 6= ik,\nif i = ik,\n\nn),\ni+1, ..., \u03b1k\n\ni\u22121, z, \u03b1k\n\n\u03b1k+1\n\n(2.5)\n\n(2.6)\n\nyi,\n\ni\n\nn\u03bb\n\ni=1\n\nSince SDCA doesn\u2019t follow the span assumption, and the number of iterations k is much\ngreater than the dual problem dimension n, di\ufb00erent arguments to the ones used in [11]\nmust be used. Motivated by the analysis in [23], which only proves a lower bound for dual\nsuboptimality, we have the following lower complexity bound, which matches the upper\ncomplexity bound given in [7] for \u03ba = O(n).\nProposition 1. Lower complexity bound of SDCA. For all \u00b5, L, n > 2, there exist n\nfunctions fi that are L\u2212smooth, and \u00b5\u2212strongly convex such that\n\n(2.7)\niterations are needed for SDCA to obtain expected suboptimality E[F(xK(\u0001)) \u2212 F(x\u2217)] \u2264 \u0001.\n\n\u0001\n\nK(\u0001) = \u2126(cid:0)n ln 1\n\n(cid:1)\n\n3 Why are hybrid methods faster?\nIn this section, we explain why SVRG and SARAH, which are a hybrid between full-gradient\nand VR methods, are fundamentally faster than other VR algorithms. We consider the\nperformance of these algorithms on a variation of the adversarial function example from\n[11], [24]. The key insight is that the span condition makes this adversarial example hard to\nminimize, but that the full gradient steps of SVRG and SARAH make it easy when n (cid:29) \u03ba.\nWe conduct the analysis in \u20182, for simplicity12, since the argument readily applies to Rd.\nConsider the function introduced in [24] that we introduce for the case n = 1:\n\n(cid:18)1\n2hx, Axi \u2212 he1, xi\n\n(cid:19)\n\n, for A =\n\n(cid:0)q1, q2\n\nThe function \u03c6(x) + 1\n\n\u03c6(x) = L \u2212 \u03c3\n4\n1, . . .(cid:1) for q1 =(cid:0)\u03ba1/2 \u2212 1(cid:1)/(cid:0)\u03ba1/2 + 1(cid:1). We assume that x0 = 0 with no loss in general-\n2 \u03c3kxk2 is L-smooth and \u03c3-strongly convex. Its minimizer x\u2217 is given by\nity. Let N(x) be position of the last nonzero in the vector. E.g. N(0, 2, 3, 0, 4, 0, 0, 0, . . .) = 5.\n(cid:1)\n/(cid:0)1 \u2212 q2\nN(x) is a control on how close x can be to the solution. If N(x) = N, then clearly:\n\n, . . .(cid:1)(cid:13)(cid:13)2 = q2N+2\n\nkx \u2212 x\u2217k2 \u2265\n\n...\n...\n\n, qN+2\n\n1, q3\n\nmin\n\n1\n\n1\n\n1\n\n2 \u22121\n\u22121\n2\n...\n\ny s.t. N(y)=N\n\nky \u2212 x\u2217k2 =(cid:13)(cid:13)(cid:0)0, . . . , 0, qN+1\ni=1 withP\u221e\n\n1\n\n12This is the Hilbert space of sequences (xi)\u221e\n\ni=1 x2\n\ni < \u221e\n\n\uf8eb\uf8ec\uf8ec\uf8ec\uf8ed\n\n2 \u22121\n\u22121\n\n\uf8f6\uf8f7\uf8f7\uf8f7\uf8f8\n\n6\n\n\f2 \u03c3kxk2(cid:17)(y) =\n4 A+\u03c3I, the last nonzero N(cid:0)xk(cid:1) of xk can only increase by 1 per iteration by any algorithm\nHence since we have N(cid:0)x0(cid:1) = 0, we have(cid:13)(cid:13)xk \u2212 x\u2217(cid:13)(cid:13)2\n\nBecause of the tridiagonal pattern of nonzeros in the Hessian \u22072\nL\u2212\u03c3\nthat satis\ufb01es that span condition (e.g. gradient descent, accelerated gradient descent, etc.).\n\n(cid:16)\n/(cid:13)(cid:13)x0 \u2212 x\u2217(cid:13)(cid:13)2 \u2265 q2k1 .\n\n\u03c6(x) + 1\n\nx\n\nFor the case n > 1, let the solution vector x = (x1, . . . , xn) be split into n coordinate blocks,\nand hence de\ufb01ne:\n\n2 \u03c3kxk2(cid:1)\n(cid:0)\u03c6(xi) + 1\n(cid:18)1\n(cid:19)\n(cid:18) L \u2212 \u03c3\n2hxi, Axii \u2212 he1, xii\n(cid:18)(L \u2212 \u03c3 + \u03c3n) \u2212 \u03c3n\n\n4\n\n+ 1\n\n2(\u03c3n)kxik2(cid:19)\n(cid:19)\n\nf(x) =\n\ni=1\n\nnX\nnX\nnX\n\ni=1\n\ni=1\n\n(3.1)\n\n.\n\n(3.2)\n\n/\n\n4\n\nn, q3\n\n=\n\n=\n\ni=1\n\nnq2\n\n+ 1\n\nn/(1 \u2212 q2\nn)\n\n\u2265 nX\n\nkx \u2212 x\u2217k2\nkx\u2217k2 =\n\n(cid:13)(cid:13)xi \u2212(cid:0)qn, q2\n\ncontrols how close x can be to x\u2217:\n\nn + 1(cid:1)1/2 \u2212 1(cid:17)\n\n(cid:18)1\n2hxi, Axii \u2212 he1, xii\n\nf is clearly the sum of n convex L-smooth functions \u03c6(xi) + 1\nstrongly convex.\n\n2 \u03c3kxk2, that are \u03c3-\n(3.2) shows it is \u03c3n-strongly convex and L \u2212 \u03c3 + \u03c3n-smooth with\n\n2(\u03c3n)kxik2(cid:19)\nrespect to coordinate xi. Hence the minimizer is given by xi = (cid:0)qn, q2\nqn =(cid:16)(cid:0) \u03ba\u22121\n\nn, . . .(cid:1) for\n(cid:16)(cid:0) \u03ba\u22121\nn + 1(cid:1)1/2 + 1(cid:17) for all i. Similar to before, (N(x1), . . . , N(xn))\nPn\n\nn, . . .(cid:1)(cid:13)(cid:13)2\nsatisfy the span assumption, we have N(cid:0)xk\n(cid:1) \u2264 Ik,i. If we assume that ik is uniform, then\nnX\nnX\n/(cid:13)(cid:13)x0 \u2212 x\u2217(cid:13)(cid:13)2 \u2265 E\nn =(cid:0)1 \u2212 n\u22121(cid:0)1 \u2212 q2\n(cid:1)(cid:1)k\n \n (cid:18) \u03ba \u2212 1\n(cid:19)1/2\n\u2265(cid:0)1 \u2212 2n\u22121(cid:1)k\n\nLet IK,i be the number of times that ik = i for k = 0, 1, . . . , K \u2212 1. For algorithms that\n\nIK,i is a binomial random variable of probability 1/n and size k. Hence:\n\nfor n \u2265 \u03ba. the second equality in (3.3) follows from the fact that Ii,k is a binomial random\n\nvariable. Hence after 1 epoch, E(cid:13)(cid:13)xk \u2212 x\u2217(cid:13)(cid:13)2 decreases by a factor of at most \u2248 e2, whereas\n\nfor SVRG it decreases by at least a factor of \u223c (n/\u03ba)1/2, which is (cid:29) e2 for n (cid:29) \u03ba. To help\nunderstand why, consider trying the above analysis on SVRG for 1 epoch of size n. Because\nof the full-gradient step, we actually have N(wn\n\ni ) \u2264 1 + In,i, and hence:\n\nE(cid:13)(cid:13)xk \u2212 x\u2217(cid:13)(cid:13)2\n\n1 \u2212 4n\u22121/\n\n!!k\n\nq2N(xi)\n\nn\n\n/n\n\n2Ik,i\nq\nn\n\n/n\n\n/n \u2265 E\n\n2N(xk\ni )\n\ni=1\n2Ik,i\n\n= Eq\n\nn\n\n+ 1\n\ni\n\nq\nn\n\n(3.3)\n\n\u2265\n\nn\n\n+ 1\n\ni=1\n\ni=1\n\nEkwn \u2212 x\u2217k2\n\n/(cid:13)(cid:13)x0 \u2212 x\u2217(cid:13)(cid:13)2 \u2265 E\n\nnX\n\ni=1\n\n(cid:0)1 \u2212 2n\u22121(cid:1)n \u2248\n\nq2(In,1+1)\n\nn\n\n\u2265 q2\n\nn\n\nWhat it comes down to is that when n (cid:29) \u03ba, we have EPn\n\nHence attempting the above results in a much weaker lower bound.\n2Ii,k\nn\n\n/n.\nThe interpretation is that for this objective, the progress towards a solution is limited by\nthe component function fi that is minimized the least. The full gradient step ensures that\nat least some progress is made toward minimizing every fi. For algorithms that follow\nthe span assumption, there will invariably be many indices i for which no gradient \u2207fi\nis calculated, and hence xk\ni can make no progress towards the minimum. This may be\nrelated to the observation that sampling without replacement can often speed up randomized\n\n2EIi,k\nn\n\ni=1 q\n\ni=1 q\n\n\u03ba \u2212 1\nn\n\n(cid:19)2\n(cid:18)1\n/n (cid:29) EPn\n\n4\n\ne\u22122\n\n7\n\n\fPn\n\nalgorithms. However, on the other hand, it is well known that full gradient methods fail to\nachieve a good convergence rate for other objectives with the same parameters \u00b5, L, n (e.g.\n2 \u00b5kxk2). Hence we conclude that it is because SVRG is a hybrid\nf(x) = 1\nmethod, that combines both full-gradient and VR elements, that it is able to outperform\nboth VR and full-gradient algorithms.\n\ni=1 \u03c6(x) + 1\n\nn\n\n4 Prox-SVRG for strongly convex sum of smooth nonconvex\n\nfunction\n\nIn this section we show that this logarithmic speedup is still possible if we relax Assumption\n2: that each fi is convex. By Assumption 1, the functions fi are still smooth, though possibly\nnonconvex. The sum F though is strongly convex, and smooth. This is based on the analysis\nof Prox-SVRG found in [20]. The proof of Theorem 3 can be found in Appendix D.\n\nTheorem 3. Under Assumption 1, let x\u2217 = arg minx F(x), L = (Pn\n\nL2\nn2pi\n\ni\n\n) 1\n2 , \u03ba = L\n\n\u00b5 , and\n\ni=1\n2}. Then the Prox-SVRG in Algorithm 1 satis\ufb01es\n) 1\n\n\u03b7 = 1\n\nL , ( 1\n\n2 min{ 1\nE[F(xk) \u2212 F(x\u2217)] \u2264 O(\u03c1k)[F(x0) \u2212 F(x\u2217)],\n\nm\n\nL\n\n2\n\n(4.2)\nHence for m = min{n, 2}, in order to obtain an \u0001-optimal solution in terms of function value,\nthe SVRG in Algorithm 1 needs at most\n\nfor \u03c1 =\n\n(4.1)\n\n.\n\n1\n1 + 1\n2 m\u03b7\u00b5\n\n(cid:1) + 2n\n\nK = O(cid:0)(\n\n\u221a\n\n\u00afL\n\u00b5\n\n) ln 1\n\n\u0001\n\nn\n\nn\n\nn\n\n(4.3)\n\n+ \u03ba +\n\n2 )1/2)\n\nln (1 + n\n\nln(1 + ( n\u00b52\n4L\n\n4\u03ba) +\ngradient evaluations in expectation.\nK = O(cid:0)(n + \u03ba +\nThe complexity of nonconvex SVRG using the original analysis of [9] would have been\n\n(4.4)\nHence we have obtained a similar logarithmic speedup as we obtained in Corollary 2. There\nare no known nontrivial lower bounds in this regime, and so it is not immediately clear\nwhether our complexity is optimal.\nRemark 2.\n{p1, p2, ..., pn} on {1, 2, ..., n} is pi =\n\nIn Theorem 3, the optimal choice of the probability distribution P =\n\nfor i = 1, 2, ..., n, and L = (\n\nPn\n\n) ln 1\n\niPn\n\n\u00afL\n\u00b5\n\n) 1\n2 .\n\ni=1 L2\nn\n\n(cid:1)\n\n\u221a\n\nn\n\n\u0001\n\ni\n\nL2\ni=1 L2\n\nj\n\n2.\n\n1\n2n\n\n2 + \u03bb\n\nminimize\n\nkAx \u2212 bk2\n\n5 Experiments\nIn this section we compare the performance of SVRG, and SARAH to SAGA to verify our\nconclusions. We solve the regularized least squares problem\n2kxk2\n\n(5.1)\nThe matrix A and vector b are generated randomly with entries uniformly distributed between\n0 and 1. In this experiment, A has n = 16000 rows and 20 columns. The parameter \u03bb\nis chosen to control the condition number \u03ba = L/\u00b5 of the problem. Figure 5.1 compares\nSAGA, SVRG, and SARAH for three instances of problem (5.1) with conditions numbers\n\u03ba = 5, \u03ba = 10, and \u03ba = 20. In order to provide a fair comparison, step sizes were tuned\nindividually for each algorithm and each problem instance.\nThere does appear to be a small but noticeable e\ufb00ect when n/\u03ba is large. In all our experiments,\nSAGA appears to converge very quickly initially, but slows signi\ufb01cantly after a few iterations.\nTo compensate for this, we compare the convergence speed of the three algorithms after the\n\ufb01rst few iterations in terms of decibels per epoch13. For \u03ba = 5, we obtain convergence speeds\n13Decibels are a logarithmic scale. 10 decibels corresponds to a 10-fold increase, 100 decibels\ncorresponds to a 100-fold increase. This is the natural way to compare speeds for linearly converging\nerror.\n\n8\n\n\f(a) \u03ba = 5\n\n(b) \u03ba = 10\n\n(c) \u03ba = 20\n\nFigure 5.1: Comparison of SAGA, SVRG, and SARAH for various values of the condition number \u03ba.\n\nof 2.1, 4.9, 6.3 decibels/epoch for SAGA, SARAH, and SVRG respectively. For \u03ba = 10 these\nvalues are 2.3, 5.0, and 6.2 respectively; and for \u03ba = 20 these values are 2.1, 4.3, and 6.0. So\nSVRG converges above 3\u00d7 faster than SAGA in the long term, even though SVRG iterations\nare twice as expensive as SAGA\u2019s.\nIt is not yet clear whether this e\ufb00ect will have practical impact. However we see a few\nobvious future directions. Firstly, SVRG and SARAH were never intentionally designed to\nexploit this logarithmic speedup. It\u2019s possible that designing an algorithm with this in mind\nwill yield greater speedup. Secondly, it should be investigating whether the initial speed\nburst of SAGA can be incorporated into an SVRG-like algorithm. This will make it more\ncompetitive. Thirdly, SVRG and SARAH have iterations that are twice as expensive as\nSAGA\u2019s because of the full gradient steps. It should be investigated whether there is a way\nof retaining this logarithmic speedup while reducing the iteration cost. Perhaps large batch\ngradients instead of full gradient will be su\ufb03cient to yield this speedup, and avoid the high\ncost of full gradient steps.\n\nReferences\n[1] A. Defazio, F. Bach, and S. Lacoste-Julien, \u201cSAGA: A Fast Incremental Gradient\nMethod With Support for Non-Strongly Convex Composite Objectives,\u201d in Advances\nin Neural Information Processing Systems 27, 2014, pp. 1646\u20131654. [Online]. Available:\nhttp://papers.nips.cc/paper/5258- saga- a- fast- incremental- gradient-\nmethod-with-support-for-non-strongly-convex-composite-objectives.pdf.\n\n[2] N. L. Roux, M. Schmidt, and F. R. Bach, \u201cA Stochastic Gradient Method with an\nExponential Convergence _Rate for Finite Training Sets,\u201d in Advances in Neural\nInformation Processing Systems 25, Curran Associates, Inc., 2012, pp. 2663\u20132671.\n[Online]. Available: http://papers.nips.cc/paper/4633-a-stochastic-gradient-\nmethod-with-an-exponential-convergence-_rate-for-finite-training-sets.\npdf.\n\n[3] A. Defazio, J. Domke, and Caetano, \u201cFinito: A faster, permutable incremental gradient\nmethod for big data problems,\u201d in International Conference on Machine Learning,\nJan. 2014, pp. 1125\u20131133. [Online]. Available: http://proceedings.mlr.press/v32/\ndefazio14.html.\n\n[4] J. Mairal, \u201cOptimization with First-Order Surrogate Functions,\u201d in International\nConference on Machine Learning, Feb. 2013, pp. 783\u2013791. [Online]. Available: http:\n//proceedings.mlr.press/v28/mairal13.html.\n\n[5] R. Johnson and T. Zhang, \u201cAccelerating stochastic gradient descent using predic-\ntive variance reduction,\u201d in Advances in Neural Information Processing Systems 26,\n2013, pp. 315\u2013323. [Online]. Available: http : / / papers . nips . cc / paper / 4937 -\naccelerating - stochastic - gradient - descent - using - predictive - variance -\nreduction.pdf.\n\n[6] L. M. Nguyen, J. Liu, K. Scheinberg, and M. Tak\u00e1\u010d, \u201cSARAH: A Novel Method for\nMachine Learning Problems Using Stochastic Recursive Gradient,\u201d arXiv:1703.00102\n[cs, math, stat], Feb. 2017. [Online]. Available: http://arxiv.org/abs/1703.00102.\n\n9\n\n\f[7] S. Shalev-Shwartz and T. Zhang, \u201cStochastic dual coordinate ascent methods for\nregularized loss,\u201d J. Mach. Learn. Res., vol. 14, no. 1, pp. 567\u2013599, Feb. 2013. [Online].\nAvailable: http://dl.acm.org/citation.cfm?id=2502581.2502598.\n\n[8] S. Shalev-Shwartz, \u201cSDCA without Duality,\u201d arXiv:1502.06177, Feb. 2015. [Online].\n\nAvailable: http://arxiv.org/abs/1502.06177.\n\n[9] Z. Allen-Zhu, \u201cKatyusha: The First Direct Acceleration of Stochastic Gradient Methods,\u201d\nin Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing,\nser. STOC 2017, New York, NY, USA: ACM, 2017, pp. 1200\u20131205. [Online]. Available:\nhttp://doi.acm.org/10.1145/3055399.3055448.\n\n[10] A. Defazio, \u201cA simple practical accelerated method for \ufb01nite sums,\u201d in Advances in\n\nNeural Information Processing Systems, 2016, pp. 676\u2013684.\n\n[11] G. Lan and Y. Zhou, \u201cAn optimal randomized incremental gradient method,\u201d Mathemat-\nical Programming, pp. 1\u201349, Jun. 2017. [Online]. Available: https://link.springer.\ncom/article/10.1007/s10107-017-1173-0.\n\n[12] H. Lin, J. Mairal, and Z. Harchaoui, \u201cA Universal Catalyst for First-Order Optimiza-\ntion,\u201d arXiv:1506.02186, Jun. 2015. [Online]. Available: http://arxiv.org/abs/1506.\n02186.\n\n[13] Q. Lin, Z. Lu, and L. Xiao, \u201cAn Accelerated Proximal Coordinate Gradient Method\nand its Application to Regularized Empirical Risk Minimization,\u201d arXiv:1407.1296,\nJul. 2014. [Online]. Available: http://arxiv.org/abs/1407.1296.\n\n[14] S. Shalev-Shwartz and T. Zhang, \u201cAccelerated proximal stochastic dual coordinate\nascent for regularized loss minimization,\u201d Mathematical Programming, vol. 155, no. 1-2,\npp. 105\u2013145, Jan. 2016. [Online]. Available: http://link.springer.com/article/\n10.1007/s10107-014-0839-0.\n\n[15] G. Lan and Y. Zhou, \u201cAn optimal randomized incremental gradient method,\u201d\narXiv:1507.02000, Jul. 2015. [Online]. Available: http://arxiv.org/abs/1507.02000.\n[16] B. Woodworth and N. Srebro, \u201cTight Complexity Bounds for Optimizing Composite\nObjectives,\u201d arXiv:1605.08003, May 2016. [Online]. Available: http://arxiv.org/\nabs/1605.08003.\n\n[17] Y. Arjevani and O. Shamir, \u201cDimension-Free Iteration Complexity of Finite Sum\nOptimization Problems,\u201d arXiv:1606.09333 [cs, math], Jun. 2016. [Online]. Available:\nhttp://arxiv.org/abs/1606.09333.\n\n[18] K. Sridharan, S. Shalev-shwartz, and N. Srebro, \u201cFast Rates for Regularized Objectives,\u201d\nin Advances in Neural Information Processing Systems 21, Curran Associates, Inc.,\n2009, pp. 1545\u20131552. [Online]. Available: http://papers.nips.cc/paper/3400-fast-\nrates-for-regularized-objectives.pdf.\n\n[19] M. Eberts and I. Steinwart, \u201cOptimal learning rates for least squares SVMs using\nGaussian kernels,\u201d in Advances in Neural Information Processing Systems 24, Curran\nAssociates, Inc., 2011, pp. 1539\u20131547. [Online]. Available: http://papers.nips.\ncc/paper/4216- optimal- learning- rates- for- least- squares- svms- using-\ngaussian-kernels.pdf.\n\n[20] Z. Allen-Zhu, \u201cKatyusha X: Practical Momentum Method for Stochastic Sum-of-\nNonconvex Optimization,\u201d arXiv:1802.03866, Feb. 2018. [Online]. Available: http:\n//arxiv.org/abs/1802.03866.\n\n[21] L. Xiao and T. Zhang, \u201cA Proximal Stochastic Gradient Method with Progressive\nVariance Reduction,\u201d arXiv:1403.4699, Mar. 2014. [Online]. Available: http://arxiv.\norg/abs/1403.4699.\n\n[22] Y. Arjevani and O. Shamir, \u201cOn the Iteration Complexity of Oblivious First-Order\nOptimization Algorithms,\u201d arXiv:1605.03529 [cs, math], May 2016. [Online]. Available:\nhttp://arxiv.org/abs/1605.03529.\n\n[23] Y. Arjevani, \u201cOn Lower and Upper Bounds in Smooth Strongly Convex Optimization\n- A Uni\ufb01ed Approach via Linear Iterative Methods,\u201d arXiv:1410.6387, Oct. 2014.\n[Online]. Available: http://arxiv.org/abs/1410.6387.\n\n[24] Y. Nesterov, Introductory Lectures on Convex Optimization: A Basic Course. Springer\nScience & Business Media, Dec. 2013. [Online]. Available: https://dl.acm.org/\ncitation.cfm?id=2670022.\n\n10\n\n\f", "award": [], "sourceid": 1162, "authors": [{"given_name": "Robert", "family_name": "Hannah", "institution": "UCLA"}, {"given_name": "Yanli", "family_name": "Liu", "institution": "UCLA"}, {"given_name": "Daniel", "family_name": "O'Connor", "institution": "UCLA"}, {"given_name": "Wotao", "family_name": "Yin", "institution": "University of California, Los Angeles"}]}