{"title": "Towards closing the gap between the theory and practice of SVRG", "book": "Advances in Neural Information Processing Systems", "page_first": 648, "page_last": 658, "abstract": "Amongst the very first variance reduced stochastic methods for solving the empirical risk minimization problem was the SVRG method. SVRG is an inner-outer loop based method, where in the outer loop a reference full gradient is evaluated, after which $m \\in \\N$ steps of an inner loop are executed where the reference gradient is used to build a variance reduced estimate of the current gradient.\nThe simplicity of the SVRG method and its analysis have lead to multiple extensions and variants for even non-convex optimization. Yet there is a significant gap between the parameter settings that the analysis suggests and what is known to work well in practice. Our first contribution is that we take several steps towards closing this gap. In particular, the current analysis shows that $m$ should be of the order of the condition number so that the resulting method has a favorable complexity. Yet in practice $m=n$ works well regardless of the condition number, where $n$ is the number of data points. Furthermore, the current analysis shows that the inner iterates have to be reset using averaging after every outer loop. Yet in practice SVRG works best when the inner iterates are updated continuously and not reset. We provide an analysis of these aforementioned practical settings and show that they achieve the same favorable complexity as the original analysis (with slightly better constants). Our second contribution is to provide a more general analysis than had been previously done by using arbitrary sampling, which allows us to analyze virtually all forms of mini-batching through a single theorem. Since our setup and analysis reflect what is done in practice, we are able to set the parameters such as the mini-batch size and step size using our theory in such a way that produces a more efficient algorithm in practice, as we show in extensive numerical experiments.", "full_text": "Towards closing the gap between the theory and\n\npractice of SVRG\n\nOthmane Sebbouh\nLTCI, T\u00b4el\u00b4ecom Paris\n\nInstitut Polytechnique de Paris\nothmane.sebbouh@gmail.com\n\nNidham Gazagnadou\nLTCI, T\u00b4el\u00b4ecom Paris\n\nInstitut Polytechnique de Paris\n\nnidham.gazagnadou@telecom-paris.fr\n\nSamy Jelassi\n\nORFE Department\nPrinceton University\n\nsjelassi@princeton.edu\n\nFrancis Bach\n\nINRIA - Ecole Normale Sup\u00b4erieure\n\nPSL Research University\nfrancis.bach@inria.fr\n\nRobert M. Gower\nLTCI, T\u00b4el\u00b4ecom Paris\n\nInstitut Polytechnique de Paris\n\nrobert.gower@telecom-paris.fr\n\nAbstract\n\nAmongst the very \ufb01rst variance reduced stochastic methods for solving the empiri-\ncal risk minimization problem was the SVRG method [13]. SVRG is an inner-outer\nloop based method, where in the outer loop a reference full gradient is evaluated, af-\nter which m \u2208 N steps of an inner loop are executed where the reference gradient is\nused to build a variance reduced estimate of the current gradient. The simplicity of\nthe SVRG method and its analysis have lead to multiple extensions and variants for\neven non-convex optimization. Yet there is a signi\ufb01cant gap between the parameter\nsettings that the analysis suggests and what is known to work well in practice. Our\n\ufb01rst contribution is that we take several steps towards closing this gap. In particular,\nthe current analysis shows that m should be of the order of the condition number so\nthat the resulting method has a favorable complexity. Yet in practice m = n works\nwell regardless of the condition number, where n is the number of data points.\nFurthermore, the current analysis shows that the inner iterates have to be reset\nusing averaging after every outer loop. Yet in practice SVRG works best when\nthe inner iterates are updated continuously and not reset. We provide an analysis\nof these aforementioned practical settings and show that they achieve the same\nfavorable complexity as the original analysis (with slightly better constants). Our\nsecond contribution is to provide a more general analysis than had been previously\ndone by using arbitrary sampling, which allows us to analyse virtually all forms\nof mini-batching through a single theorem. Since our setup and analysis re\ufb02ect\nwhat is done in practice, we are able to set the parameters such as the mini-batch\nsize and step size using our theory in such a way that produces a more ef\ufb01cient\nalgorithm in practice, as we show in extensive numerical experiments.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f1\n\nIntroduction\n\nConsider the problem of minimizing a \u00b5\u2013strongly convex and L\u2013smooth function f where\n\nx\u2217 = arg min\nx\u2208Rd\n\n1\nn\n\nn(cid:88)\n\ni=1\n\nfi(x) =: f (x),\n\n(1)\n\nand each fi is convex and Li\u2013smooth. Several training problems in machine learning \ufb01t this format,\ne.g. least-squares, logistic regressions and conditional random \ufb01elds. Typically each fi represents a\nregularized loss of an ith data point. When n is large, algorithms that rely on full passes over the\ndata, such as gradient descent, are no longer competitive. Instead, the stochastic version of gradient\ndescent SGD [26] is often used since it requires only a mini-batch of data to make progress towards\nthe solution. However, SGD suffers from high variance, which keeps the algorithm from converging\nunless a carefully often hand-tuned decreasing sequence of step sizes is chosen. This often results in\na cumbersome parameter tuning and a slow convergence.\nTo address this issue, many variance reduced methods have been designed in recent years including\nSAG [27], SAGA [6] and SDCA [28] that require only a constant step size to achieve linear conver-\ngence. In this paper, we are interested in variance reduced methods with an inner-outer loop structure,\nsuch as S2GD [14], SARAH [21], L-SVRG [16] and the orignal SVRG [13] algorithm. Here we\npresent not only a more general analysis that allows for any mini-batching strategy, but also a more\npractical analysis, by analysing methods that are based on what works in practice, and thus providing\nan analysis that can inform practice.\n\n2 Background and Contributions\n\nConvergence under arbitrary samplings.\nWe give the \ufb01rst arbitrary sampling conver-\ngence results for SVRG type methods in\nthe convex setting1. That is our analysis in-\ncludes all forms of sampling including mini-\nbatching and importance sampling as a spe-\ncial case. To better understand the signi\ufb01-\ncance of this result, we use mini-batching b\nelements without replacement as a running\nexample throughout the paper. With this sam-\npling the update step of SVRG, starting from\nx0 = w0 \u2208 Rd, takes the form of\n1\nb\n\nxt+1 = xt \u2212 \u03b1\n\n(cid:88)\n\n(cid:32)\n\nFigure 1: Left: the total complexity (3) for random\nGaussian data, right: the step size (4) as b increases.\n\n(cid:33)\n\n(cid:88)\n\n\u2207fi(xt) \u2212 1\nb\n\n\u2207fi(ws\u22121) + \u2207f (ws\u22121)\n\n,\n\n(2)\n\ni\u2208B\nwhere \u03b1 > 0 is the step size, B \u2286 [n]\n= {1, . . . , n} and b = |B|. Here ws\u22121 is the reference point\nwhich is updated after m \u2208 N steps, the xt\u2019s are the inner iterates and m is the loop length. As\na special case of our forthcoming analysis in Corollary 4.1, we show that the total complexity of\nthe SVRG method based on (2) to reach an \u0001 > 0 accurate solution has a simple expression which\ndepends on n, m, b, \u00b5, L and Lmax\n\ni\u2208B\n\ndef\n\n(cid:16) n\n\nm\n\n(cid:17)\n\n(cid:26) 3\n\ndef\n= maxi\u2208[n] Li:\nn \u2212 b\nn \u2212 1\n\nmax\n\nb\n\nCm(b)\n\ndef\n= 2\n\n+ 2b\n\nLmax\n\n\u00b5\n\n+\n\nb \u2212 1\nn \u2212 1\n\nn\nb\n\nL\n\u00b5\n\n(cid:27)\n\n(cid:18) 1\n\n(cid:19)\n\n\u0001\n\n, m\n\nlog\n\n,\n\n(3)\n\nso long as the step size is\n\n\u03b1 =\n\n(4)\nBy total complexity we mean the total number of individual \u2207fi gradients evaluated. This shows that\nthe total complexity is a simple function of the loop length m and the mini-batch size b. See Figure 1\nfor an example for how total complexity evolves as we increase the mini-batch size.\n\n3(n \u2212 b)Lmax + n(b \u2212 1)L\n\n.\n\n1\n2\n\nb(n \u2212 1)\n\n1SVRG has very recently been analysed under arbitrary samplings in the non-convex setting [12].\n\n2\n\n246810mini-batch size150160170180190200210total complexity1n=53500 mini-batch size3.10\u221261.10\u221232.10\u221233.10\u22123step size\fOptimal mini-batch and step sizes for SVRG. The size of the mini-batch b is often left as a\nparameter for the user to choose or set using a rule of thumb. The current analysis in the literature for\nmini-batching shows that when increasing the mini-batch size b, while the iteration complexity can\ndecrease2, the total complexity increases or is invariant. See for instance results in the non-convex\ncase [22, 25], and for the convex case [10], [15], [1] and \ufb01nally [18] where one can \ufb01nd the iteration\ncomplexity of several variants of SVRG with mini-batching. However, in practice, mini-batching\ncan often lead to faster algorithms. In contrast our total complexity (3) clearly highlights that when\nincreasing the mini batch size, the total complexity can decrease and the step size increases, as can\nbe seen in our plot of (3) and (4) in Figure 1. Furthermore Cm(b) is a convex function in b which\nallows us to determine the optimal mini-batch a priori. For m = n \u2013 a widely used loop length\nin practice \u2013 the optimal mini-batch size, depending on the problem setting, is given in Table 1.\nMoreover, we can also determine the optimal loop length. The reason we were able to achieve these\n\nn \u2264 L\n\nLmax \u2265 nL\n\n3\n\nn\n\n\u00b5\nLmax < nL\n3\n\n(cid:106)\u02c6b\n(cid:107)\n\nLmax \u2265 nL\n\n3\n\n(cid:106)\u02dcb\n(cid:107)\n\nL\n\n\u00b5 < n < 3Lmax\n\n\u00b5\nLmax < nL\n3\n\n(cid:106)\nmin{\u02c6b, \u02dcb}(cid:107)\n\nn \u2265 3Lmax\n\n\u00b5\n\n1\n\nTable 1: Optimal mini-batch sizes for Algorithm 1 with m = n. The last line presents the optimal\nmini-batch sizes depending on all the possible problem settings, which are presented in the \ufb01rst two\n\nlines. Notations: \u02c6b =\n\n3Lmax\u2212L\nnL\u22123Lmax\n\n, \u02dcb =\n\n(3Lmax\u2212L)n\n\nn(n\u22121)\u00b5\u2212nL+3Lmax\n\n.\n\n(cid:113) n\n\n2\n\nnew tighter mini-batch complexity bounds was by using the recently introduced concept of expected\nsmoothness [9] alongside a new constant we introduce in this paper called the expected residual\nconstant. The expected smoothness and residual constants, which we present later in Lemmas 4.1\nand 4.2, show how mini-batching (and arbitrary sampling in general) combined with the smoothness\nof our data can determine how smooth in expectation our resulting mini-batched functions are. The\nexpected smoothness constant has been instrumental in providing a tight mini-batch analysis for\nSGD [8], SAGA [7] and now SVRG.\n\nNew practical variants. We took special care so that our analysis allows for practical parameter\nsettings. In particular, often the loop length is set to m = n or m = n/b in the case of mini-batching3.\nAnd yet, the classical SVRG analysis given in [13] requires m \u2265 20Lmax/\u00b5 in order to ensure\na resulting iteration complexity of O((n + Lmax/\u00b5) log(1/\u0001)). Furthermore, the standard SVRG\nanalysis relies on averaging the xt inner iterates after every m iterations of (2), yet this too is not\nwhat works well in practice4. To remedy this, we propose Free-SVRG, a variant of SVRG where\nthe inner iterates are not averaged at any point. Furthermore, by developing a new Lyapunov style\nconvergence for Free-SVRG, our analysis holds for any choice of m, and in particular, for m = n we\nshow that the resulting complexity is also given by O((n + Lmax/\u00b5) log(1/\u0001)).\nThe only downside of Free-SVRG is that the reference point is set using a weighted averaging based on\nthe strong convexity parameter. To \ufb01x this issue, [11], and later [16, 17], proposed a loopless version\nof SVRG. This loopless variant has no explicit inner-loop structure, it instead updates the reference\npoint based on a coin toss and lastly requires no knowledge of the strong convexity parameter and no\naveraging whatsoever. We introduce L-SVRG-D, an improved variant of Loopless-SVRG that takes\nmuch larger step sizes after the reference point is reset, and gradually smaller step sizes thereafter.\n\n2Note that the total complexity is equal to the iteration complexity times the mini-batch size b.\n3See for example the lightning package from scikit-learn [23]: http://contrib.scikit-learn.org/lightning/\n\nand [21] for examples where m = n. See [2] for an example where m = 5n/b.\n\n4Perhaps an exception to the above issues in the literature is the Katyusha method and its analysis [1], which\nis an accelerated variant of SVRG. In [1] the author shows that using a loop length m = 2n and by not averaging\n\nthe inner iterates, the Katyusha method achieves the accelerated complexity of O((n+(cid:112)(nLmax)/\u00b5) log(1/\u0001)).\n\nThough a remarkable advance in the theory of accelerated methods, the analysis in [1] does not hold for the\nunaccelerated case. This is important since, contrary to the name, the accelerated variants of stochastic methods\nare not always faster than their non-accelerated counterparts. Indeed, acceleration only helps in the stochastic\nsetting when Lmax/\u00b5 \u2265 n, in other words for problems that are suf\ufb01ciently ill-conditioned.\n\n3\n\n\fIndeed these are unbiased estimators since\n\nn(cid:88)\n\ni=1\n\nfv(x)\n\ndef\n=\n\n1\nn\n\nn(cid:88)\n\ni=1\n\nn(cid:88)\n\ni=1\n\n1\nn\n\nn(cid:88)\n\ni=1\n\n(7)\n\n(8)\n\nWe provide an complexity analysis of L-SVRG-D that allows for arbitrary sampling and achieves the\nsame complexity as Free-SVRG, albeit at the cost of introducing more variance into the procedure\ndue to the coin toss.\n\n3 Assumptions and Sampling\n\nWe collect all of the assumptions we use in the following.\nAssumption 3.1. There exist L \u2265 0 and \u00b5 \u2265 0 such that for all x, y \u2208 Rd,\n(cid:107)x \u2212 y(cid:107)2\n2 ,\n(cid:107)x \u2212 y(cid:107)2\n2 .\n\nf (x) \u2264 f (y) + (cid:104)\u2207f (y), x \u2212 y(cid:105) +\nL\n2\nf (x) \u2264 f (y) + (cid:104)\u2207f (x), x \u2212 y(cid:105) \u2212 \u00b5\n2\n\n(6)\nWe say that f is L\u2013smooth (5) and \u00b5\u2013strongly convex (6). Moreover, for all i \u2208 [n], fi is convex and\nthere exists Li \u2265 0 such that fi is Li\u2013smooth.\nSo that we can analyse all forms of mini-batching simultaneously through arbitrary sampling we\nmake use of a sampling vector.\nDe\ufb01nition 3.1 (The sampling vector). We say that the random vector v = [v1, . . . , vn] \u2208 Rn with\ndistribution D is a sampling vector if ED [vi] = 1 for all i \u2208 [n].\nWith a sampling vector we can compute an unbiased estimate of f (x) and \u2207f (x) via\n\n(5)\n\nvifi(x)\n\nand \u2207fv(w)\n\ndef\n=\n\nvi\u2207fi(x).\n\nED [fv(x)] =\n\n1\nn\n\nED [vi] fi(x) =\n\n1\nn\n\nfi(x) = f (x).\n\nLikewise we can show that ED [\u2207fv(x)] = \u2207f (x). Computing \u2207fv is cheaper than computing the\nfull gradient \u2207f whenever v is a sparse vector. In particular, this is the case when the support of v is\nbased on a mini-batch sampling.\nDe\ufb01nition 3.2 (Sampling). A sampling S \u2286 [n] is any random set-valued map which is uniquely\n= P(S = B) for all B \u2286 [n]. A sampling S\n\nde\ufb01ned by the probabilities(cid:80)\n\nB\u2286[n] pB = 1 where pB\nis called proper if for every i \u2208 [n], we have that pi\ndef\n\ndef\n\n= P [i \u2208 S] = (cid:80)\n\npC > 0.\n\nC:i\u2208C\n\nWe can build a sampling vector using sampling as follows.\nLemma 3.1 (Sampling vector). Let S be a proper sampling. Let pi\nDiag (p1, . . . , pn). Let v = v(S) be a random vector de\ufb01ned by\n= P\u22121eS.\n\nv(S) = P\u22121(cid:88)\n\nei\n\ndef\n\ni\u2208S\n\ndef\n\n= P [i \u2208 S] and P\n\ndef\n=\n\n(9)\n\nIt follows that v is a sampling vector.\nProof. The i-th coordinate of v(S) is vi(S) = 1(i \u2208 S)/pi and thus\n\nE [vi(S)] =\n\nE [1(i \u2208 S)]\n\nP [i \u2208 S]\n\npi\n\n=\n\npi\n\n= 1.\n\nOur forthcoming analysis holds for all samplings. However, we will pay particular attention to b-nice\nsampling, otherwise known as mini-batching without replacement, since it is often used in practice.\nDe\ufb01nition 3.3 (b-nice sampling). S is a b-nice sampling if it is sampling such that\n\nP [S = B] =\n\n1(cid:0)n\n(cid:1) ,\n\nb\n\n\u2200B \u2286 [n] : |B| = b.\n\n4\n\n\f(cid:80)\n\n(cid:80)\n\nn for all i \u2208 [n] and\ni\u2208S ei according to Lemma 3.1. The resulting subsampled function is\n\nTo construct such a sampling vector based on the b\u2013nice sampling, note that pi = b\nthus we have that v(S) = n\nb\nthen fv(x) = 1|S|\nUsing arbitrary sampling also allows us to consider non-uniform samplings, and for completeness,\nwe present this sampling and several others in Appendix D.\n\ni\u2208S fi(x), which is simply the mini-batch average over S.\n\n4 Free-SVRG: freeing up the inner loop size\n\nSimilarly to SVRG, Free-SVRG is an inner-outer loop variance reduced algorithm. It differs from the\noriginal SVRG [13] on two major points: how the reference point is reset and how the \ufb01rst iterate of\nthe inner loop is de\ufb01ned, see Algorithm 15.\nFirst, in SVRG, the reference point is the average of the iterates of the inner loop. Thus, old iterates\nand recent iterates have equal weights in the average. This is counterintuitive as one would expect\nthat to reduce the variance of the gradient estimate used in (2), one needs a reference point which is\ncloser to the more recent iterates. This is why, inspired by [20], we use the weighted averaging in\nFree-SVRG given in (10), which gives more importance to recent iterates compared to old ones.\nSecond, in SVRG, the \ufb01rst iterate of the inner loop is reset to the reference point. Thus, the inner\niterates of the algorithm are not updated using a one step recurrence. In contrast, Free-SVRG de\ufb01nes\nthe \ufb01rst iterate of the inner loop as the last iterate of the previous inner loop, as is also done in practice.\nThese changes and a new Lyapunov function analysis are what allows us to freely choose the size of\nthe inner loop6. To declutter the notation, we de\ufb01ne for a given step size \u03b1 > 0:\n\nSm\n\ndef\n=\n\n(1 \u2212 \u03b1\u00b5)m\u22121\u2212i\n\nand pt\n\ndef\n=\n\n(1 \u2212 \u03b1\u00b5)m\u22121\u2212t\n\nSm\n\nfor t = 0, . . . , m \u2212 1.\n\n,\n\n(10)\n\nm\u22121(cid:88)\n\ni=0\n\nAlgorithm 1 Free-SVRG\n\nParameters inner-loop length m, step size \u03b1, a sampling vector v \u223c D, and pt de\ufb01ned in (10)\nInitialization w0 = xm\nfor s = 1, 2, . . . , S do\n\n0 \u2208 Rd\n\ns = xm\nx0\nfor t = 0, 1, . . . , m \u2212 1 do\n\ns\u22121\n\nSample vt \u223c D\ns = \u2207fvt(xt\ngt\ns \u2212 \u03b1gt\ns = xt\nxt+1\n\nws =(cid:80)m\u22121\n\nt=0 ptxt\ns\n\ns\n\nreturn xm\nS\n\ns) \u2212 \u2207fvt(ws\u22121) + \u2207f (ws\u22121)\n\n4.1 Convergence analysis\n\nOur analysis relies on two important constants called the expected smoothness constant and the\nexpected residual constant. Their existence is a result of the smoothness of the function f and that of\nthe individual functions fi, i \u2208 [n].\nLemma 4.1 (Expected smoothness, Theorem 3.6 in [8]). Let v \u223c D be a sampling vector and assume\nthat Assumption 3.1 holds. There exists L \u2265 0 such that for all x \u2208 Rd,\n\n(11)\nLemma 4.2 (Expected residual). Let v \u223c D be a sampling vector and assume that Assumption 3.1\nholds. There exists \u03c1 \u2265 0 such that for all x \u2208 Rd,\n\n(cid:104)(cid:107)\u2207fv(x) \u2212 \u2207fv(x\u2217)(cid:107)2\n(cid:104)(cid:107)\u2207fv(x) \u2212 \u2207fv(x\u2217) \u2212 \u2207f (x)(cid:107)2\n\n(cid:105) \u2264 2L (f (x) \u2212 f (x\u2217)) .\n(cid:105) \u2264 2\u03c1 (f (x) \u2212 f (x\u2217)) .\n\n(12)\n5After submitting this work, it has come to our attention that Free-SVRG is a special case of k-SVRG [24]\n\nED\n\nED\n\n2\n\n2\n\nwhen k = 1.\n\n6Hence the name of our method Free-SVRG.\n\n5\n\n\fFor completeness, the proof of Lemma 4.1 is given in Lemma E.1 in the supplementary material. The\nproof of Lemma 4.2 is also given in the supplementary material, in Lemma F.1. Indeed, all proofs are\ndeferred to the supplementary material.\nThough Lemma 4.1 establishes the existence of the expected smoothness L, it was only very recently\nthat a tight estimate of L was conjectured in [7] and proven in [8]. In particular, for our working\nexample of b\u2013nice sampling, we have that the constants L and \u03c1 have simple closed formulae that\ndepend on b.\nLemma 4.3 (L and \u03c1 for b-nice sampling). Let v be a sampling vector based on the b\u2013nice sampling.\nIt follows that.\n\nL = L(b)\n\n\u03c1 = \u03c1(b)\n\ndef\n=\n\ndef\n=\n\n1\nb\n1\nb\n\nn \u2212 b\nn \u2212 1\nn \u2212 b\nn \u2212 1\n\nLmax +\n\nn\nb\n\nLmax.\n\nb \u2212 1\nn \u2212 1\n\nL,\n\n(13)\n\n(14)\n\nThe reason that the expected smoothness and expected residual constants are so useful in obtaining\na tight mini-batch analysis is because, as the mini-batch size b goes from n to 1, L(b) (resp. \u03c1(b))\ngracefully interpolates between the smoothness of the full function L(n) = L (resp. \u03c1(n) = 0), and\nthe smoothness of the individual fi functions L(1) = Lmax (resp \u03c1(1) = Lmax). Also, we can bound\nthe second moment of a variance reduced gradient estimate using L and \u03c1 as follows.\nLemma 4.4. Let Assumption 3.1 hold. Let x, w \u2208 Rd and v \u223c D be sampling vector. Consider\ng(x, w)\n\n= \u2207fv(x) \u2212 \u2207fv(w) + \u2207f (w). As a consequence of (11) and (12) we have that\n\n(cid:3) \u2264 4L(f (x) \u2212 f (x\u2217)) + 4\u03c1(f (w) \u2212 f (x\u2217)).\n\nED(cid:2)(cid:107)g(x, w)(cid:107)2\n\n(15)\n\ndef\n\n2\n\nNext we present a new Lyapunov style convergence analysis through which we will establish the\nconvergence of the iterates and the function values simultaneously.\nTheorem 4.1. Consider the setting of Algorithm 1 and the following Lyapunov function\n\nIf Assumption 3.1 holds and if \u03b1 \u2264\n\ndef\n\n= (cid:107)xm\n\ns \u2212 x\u2217(cid:107)2\n\n\u03c6s\n\ndef\n\n= 8\u03b12\u03c1Sm(f (ws) \u2212 f (x\u2217)).\n\n2 + \u03c8s where \u03c8s\n2(L+2\u03c1) , then\n\nE [\u03c6s] \u2264 \u03b2s\u03c60, where \u03b2 = max(cid:8)(1 \u2212 \u03b1\u00b5)m, 1\n\n1\n\n(cid:9) .\n\n2\n\n(16)\n\n(17)\n\n4.2 Total complexity for b\u2013nice sampling\n\nTo gain better insight into the convergence rate stated in Theorem 4.1, we present the total complexity\nof Algorithm 1 when v is de\ufb01ned via the b\u2013nice sampling introduced in De\ufb01nition 3.3.\nCorollary 4.1. Consider the setting of Algorithm 1 and suppose that we use b\u2013nice sampling. Let\n2(L(b)+2\u03c1(b)) , where L(b) and \u03c1(b) are given in (13) and (14). We have that the total complexity\n\u03b1 =\nof \ufb01nding an \u0001 > 0 approximate solution that satis\ufb01es E\n\n1\n\n(cid:104)(cid:107)xm\n(cid:26)L(b) + 2\u03c1(b)\n\n(cid:27)\ns \u2212 x\u2217(cid:107)2\n\n2\n\n(cid:105) \u2264 \u0001 \u03c60 is\n(cid:18) 1\n(cid:19)\n\n(cid:17)\n\nCm(b)\n\ndef\n= 2\n\n+ 2b\n\nmax\n\n(cid:16) n\n\nm\n\n\u00b5\n\n, m\n\nlog\n\n.\n\n\u0001\n\n(18)\n\nNow (3) results from plugging (13) and (14) into (18). As an immediate sanity check, we check the\ntwo extremes b = n and b = 1. When b = n, we would expect to recover the iteration complexity of\ngradient descent, as we do in the next corollary7.\nCorollary 4.2. Consider the setting of Corollary 4.1 with b = n and m = 1, thus \u03b1 =\n2(L(n)+2\u03c1(n)) = 1\n\n2L . Hence, the resulting total complexity (18) is given by C1(n) = 6n L\n\n\u00b5 log(cid:0) 1\n\n(cid:1) .\n\n1\n\n\u0001\n\nIn practice, the most common setting is choosing b = 1 and the size of the inner loop m = n. Here\nwe recover a complexity that is common to other non-accelerated algorithms [27], [6], [14], and for a\nrange of values of m including m = n.\n\n7Though the resulting complexity is 6 times the tightest gradient descent complexity, it is of the same order.\n\n6\n\n\fCorollary 4.3. Consider the setting of Corollary 4.1 with b = 1 and thus \u03b1 =\nHence the resulting total complexity (18) is given by Cm(1) = 18\n\nn + Lmax\n\n\u00b5\n\n(cid:16)\n\n(cid:17)\n\n2(L(1)+2\u03c1(1)) = 1\n\n1\n\nlog(cid:0) 1\n\n\u0001\n\n(cid:1) , so long as\n\n6Lmax\n\n.\n\n(cid:105)\n\nm \u2208(cid:104)\n\nmin(n, Lmax\n\n\u00b5 ), max(n, Lmax\n\u00b5 )\n\n.\n\nThus total complexity is essentially invariant for m = n, m = Lmax/\u00b5 and everything in between.\n\n5 L-SVRG-D: a decreasing step size approach\n\nAlthough Free-SVRG solves multiple issues regarding the construction and analysis of SVRG, it still\nsuffers from an important issue: it requires the knowledge of the strong convexity constant, as is the\ncase for the original SVRG algorithm [13]. One can of course use an explicit small regularization\nparameter as a proxy, but this can result in a slower algorithm.\nA loopless variant of SVRG was proposed and analysed in [11, 16, 17]. At each iteration, their\nmethod makes a coin toss. With (a low) probability p, typically 1/n, the reference point is reset to\nthe previous iterate, and with probability 1 \u2212 p, the reference point remains the same. This method\ndoes not require knowledge of the strong convexity constant.\nOur method, L-SVRG-D, uses the same loopless structure as in [11, 16, 17] but introduces different\nstep sizes at each iteration, see Algorithm 2. We initialize the step size to a \ufb01xed value \u03b1 > 0. At\neach iteration we toss a coin, and if it lands heads (with probability 1 \u2212 p) the step size decreases by\n1 \u2212 p. If it lands tails (with probability p) the reference point is reset to the most recent\na factor\niterate and the step size is reset to its initial value \u03b1.\nThis allows us to take larger steps than L-SVRG when we update the reference point, i.e., when\nthe variance of the unbiased estimate of the gradient is low, and smaller steps when this variance\nincreases.\n\n\u221a\n\nAlgorithm 2 L-SVRG-D\n\nParameters step size \u03b1, p \u2208 (0, 1], and a sampling vector v \u223c D\nInitialization w0 = x0 \u2208 Rd, \u03b10 = \u03b1\nfor k = 1, 2, . . . , K \u2212 1 do\n\nSample vk \u223c D\ngk = \u2207fvk (xk) \u2212 \u2207fvk (wk) + \u2207f (wk)\nxk+1 = xk \u2212 \u03b1kgk\n(wk+1, \u03b1k+1) =\n\n(cid:26) (xk, \u03b1)\n\n\u221a\n\nwith probability p\n\n1 \u2212 p \u03b1k) with probability 1 \u2212 p\n\n(wk,\n\nreturn xK\n\n=(cid:13)(cid:13)xk \u2212 x\u2217(cid:13)(cid:13)2\n\n\u03c6k def\n\nTheorem 5.1. Consider the iterates of Algorithm 2 and the following Lyapunov function\n\n2 + \u03c8k where \u03c8k def\n=\n\nkL\n8\u03b12\np(3 \u2212 2p)\n\n(cid:0)f (wk) \u2212 f (x\u2217)(cid:1) ,\n\n\u2200k \u2208 N.\n\n(19)\n\nIf Assumption 3.1 holds and\n\n\u03b1 \u2264 1\n\n2\u03b6pL , where\n\ndef\n=\n\n\u03b6p\n\nthen\n\nE(cid:2)\u03c6k(cid:3) \u2264 \u03b2k\u03c60, where \u03b2 = max\n\n(cid:26)\n\n(7 \u2212 4p)(1 \u2212 (1 \u2212 p) 3\n2 )\np(2 \u2212 p)(3 \u2212 2p)\n\n1 \u2212 2\n3\n\n\u03b1\u00b5, 1 \u2212 p\n2\n\n,\n\n(cid:27)\n\n.\n\n(20)\n\n(21)\n\nRemark 5.1. To get a sense of the formula of the step size given in (20), it is easy to show that \u03b6p is\nan increasing function of p such that 7/4 \u2264 \u03b6p \u2264 3. Since typically p \u2248 0, we often take a step\nwhich is approximately \u03b1 \u2264 2/(7L).\n\n7\n\n\fCorollary 5.1. Consider the setting of Algorithm 2 and suppose that we use b\u2013nice sampling. Let\n2\u03b6pL(b) . We have that the total complexity of \ufb01nding an \u0001 > 0 approximate solution that satis\ufb01es\n\u03b1 = 1\n\n(cid:104)(cid:13)(cid:13)xk \u2212 x\u2217(cid:13)(cid:13)2\n\n2\n\n(cid:105) \u2264 \u0001 \u03c60 is\n\nE\n\nCp(b)\n\ndef\n= 2(2b + pn) max\n\nlog\n\n.\n\n(22)\n\n(cid:26) 3\u03b6p\n\n2\n\nL(b)\n\u00b5\n\n,\n\n1\np\n\n(cid:27)\n\n(cid:18) 1\n\n(cid:19)\n\n\u0001\n\n6 Optimal parameter settings: loop, mini-batch and step sizes\n\nIn this section, we restrict our analysis to b\u2013nice sampling. First, we determine the optimal loop\nsize for Algorithm 1. Then, we examine the optimal mini-batch and step sizes for particular choices\nof the inner loop size m for Algorithm 1 and of the probability p of updating the reference point\nin Algorithm 2, that play analogous roles. Note that the steps used in our algorithms depend on\nb through the expected smoothness constant L(b) and the expected residual constant \u03c1(b). Hence,\noptimizing the total complexity in the mini-batch size also determines the optimal step size.\nExamining the total complexities of Algorithms 1 and 2, given in (18) and (22), we can see that,\nwhen setting p = 1/m in Algorithm 2, these complexities only differ by constants. Thus, to avoid\nredundancy, we present the optimal mini-batch sizes for Algorithm 2 in Appendix C and we only\nconsider here the complexity of Algorithm 1 given in (18).\n\n6.1 Optimal loop size for Algorithm 1\nHere we determine the optimal value of m for a \ufb01xed batch size b, denoted by m\u2217(b), which minimizes\nthe total complexity (18).\nProposition 6.1. The loop size that minimizes (18) and the resulting total complexity is given by\n\nm\u2217(b) =\n\nL(b) + 2\u03c1(b)\n\n\u00b5\n\nand Cm\u2217 (b) = 2\n\nn + 2b\n\nL(b) + 2\u03c1(b)\n\n\u00b5\n\n(cid:18)\n\n(cid:19)\n\n(cid:18) 1\n\n(cid:19)\n\n\u0001\n\nlog\n\n.\n\n(23)\n\nFor example when b = 1, we have that m\u2217(1) = 3Lmax/\u00b5 and Cm\u2217 (1) = O((n+Lmax/\u00b5) log(1/\u0001)),\nwhich is the same complexity as achieved by the range of m values given in Corollary 4.3. Thus,\nas we also observed in Corollary 4.3, the total complexity is not very sensitive to the choice of m,\nand m = n is a perfectly safe choice as it achieves the same complexity as m\u2217. We also con\ufb01rm this\nnumerically with a series of experiments in Section G.2.2.\n\n6.2 Optimal mini-batch and step sizes\n\nIn the following proposition, we determine the optimal mini-batch and step sizes for two practical\nchoices of the size of the loop m.\nProposition 6.2. Let b\u2217 def\nCm(b), where Cm(b) is de\ufb01ned in (18). For the widely used\nchoice m = n, we have that b\u2217 is given by Table 1. For another widely used choice m = n/b, which\nallows to make a full pass over the data set during each inner loop, we have\n\n= arg min\n\nb\u2208[n]\n\n\u00b5\n\nif n > 3Lmax\nL < n \u2264 3Lmax\nif 3Lmax\notherwise, if n \u2264 3Lmax\n\n\u00b5\n\nL\n\n, where \u00afb\n\ndef\n=\n\nn(n \u2212 1)\u00b5 \u2212 3n(Lmax \u2212 L)\n\n3(nL \u2212 Lmax)\n\n.\n\n(24)\n\n\uf8f1\uf8f2\uf8f3\n\n(cid:4)\u00afb(cid:5)\n\n1\nn\n\nb\u2217 =\n\nPreviously, theory showed that the total complexity would increase as the mini-batch size increases,\nand thus established that single-element sampling was optimal. However, notice that for m = n and\nm = n/b, the usual choices for m in practice, the optimal mini-batch size is different than 1 for a\nrange of problem settings. Since our algorithms are closer to the SVRG variants used in practice, we\nargue that our results explain why practitioners experiment that mini-batching works, as we verify in\nthe next section.\n\n8\n\n\f7 Experiments\n\nWe performed a series of experiments on data sets from LIBSVM [5] and the UCI repository [3], to\nvalidate our theoretical \ufb01ndings. We tested l2\u2013regularized logistic regression on ijcnn1 and real-sim,\nand ridge regression on slice and YearPredictionMSD. We used two choices for the regularizer:\n\u03bb = 10\u22121 and \u03bb = 10\u22123. All of our code is implemented in Julia 1.0. Due to lack of space, most\n\ufb01gures have been relegated to Section G in the supplementary material.\n\nFigure 2: Comparison of theoretical variants of SVRG without mini-batching (b = 1) on the ijcnn1\n\ndata set.\n\nFigure 3: Optimality of our mini-batch size b\u2217 given in Table 1 for Free-SVRG on the slice data set.\n\nPractical theory. Our \ufb01rst round of experiments aimed at verifying if our theory does result in\nef\ufb01cient algorithms. Indeed, we found that Free-SVRG and L-SVRG-D with the parameter setting\ngiven by our theory are often faster than SVRG with settings suggested by the theory in [13], that is\nm = 20Lmax/\u00b5 and \u03b1 = 1/10Lmax. See Figure 2, and Section G.1 for more experiments comparing\ndifferent theoretical parameter settings.\n\nOptimal mini-batch size. We also con\ufb01rmed numerically that when using Free-SVRG with m = n,\nthe optimal mini-batch size b\u2217 derived in Table 1 was highly competitive as compared to the range\nof mini-batch sizes b \u2208 {1, 100,\nn, n}. See Figure 3 and several more such experiments in\nSection G.2.1. We also explore the optimality of our m\u2217 in more experiments in Section G.2.2.\n\n\u221a\n\n9\n\nSVRG (b=1,m=20Lmax/\u03bc)Free-SVRG (b=1,m=n)L-SVRG-D (b=1,p=1/n)05101520epochs10\u2212610\u2212510\u2212410\u2212310\u2212210\u22121100residual05001000150020002500time10\u2212610\u2212510\u2212410\u2212310\u2212210\u22121100residual0255075100epochs10\u2212410\u2212310\u2212210\u22121100residualMini-batch size bb=1,\u03b1*(b)=3.03e\u221206b=100,\u03b1*(b)=2.94e\u221204b=\u221an=231,\u03b1*(b)=6.54e\u221204b=n=53500,\u03b1*(b)=9.39e\u221203b=b*(n)=31,\u03b1*(b)=9.31e\u221205025005000750010000time10\u2212410\u2212310\u2212210\u22121100residualMini-batch size bb=1,\u03b1*(b)=3.03e\u221206b=100,\u03b1*(b)=2.94e\u221204b=\u221an=231,\u03b1*(b)=6.54e\u221204b=n=53500,\u03b1*(b)=9.39e\u221203b=b*(n)=31,\u03b1*(b)=9.31e\u221205\fAcknowledgments\n\nRMG acknowledges the support by grants from DIM Math Innov R\u00b4egion Ile-de-France (ED574 -\nFMJH), reference ANR-11-LABX-0056-LMH, LabEx LMH.\n\nReferences\n[1] Z. Allen-Zhu. \u201cKatyusha: The First Direct Acceleration of Stochastic Gradient Methods\u201d. In:\nProceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing. STOC\n2017. 2017, pp. 1200\u20131205.\n\n[2] Z. Allen-Zhu and E. Hazan. \u201cVariance Reduction for Faster Non-Convex Optimization\u201d.\nIn: Proceedings of The 33rd International Conference on Machine Learning. Vol. 48. 2016,\npp. 699\u2013707.\n\n[3] A. Asuncion and D. Newman. UCI machine learning repository. 2007.\n[4] S. Bubeck et al. \u201cConvex optimization: Algorithms and complexity\u201d. In: Foundations and\n\nTrends R(cid:13) in Machine Learning 8.3-4 (2015), pp. 231\u2013357.\n\n[5] C.-C. Chang and C.-J. Lin. \u201cLIBSVM: A library for support vector machines\u201d. In: ACM\n\ntransactions on intelligent systems and technology (TIST) 2.3 (2011), p. 27.\n\n[6] A. Defazio, F. Bach, and S. Lacoste-Julien. \u201cSAGA: A Fast Incremental Gradient Method With\nSupport for Non-Strongly Convex Composite Objectives\u201d. In: Advances in Neural Information\nProcessing Systems 27. 2014, pp. 1646\u20131654.\n\n[7] N. Gazagnadou, R. M. Gower, and J. Salmon. \u201cOptimal mini-batch and step sizes for SAGA\u201d.\n\nIn: The International Conference on Machine Learning (2019).\n\n[8] R. M. Gower, N. Loizou, X. Qian, A. Sailanbayev, E. Shulgin, and P. Richt\u00b4arik. \u201cSGD: general\n\nanalysis and improved rates\u201d. In: ().\n\n[9] R. M. Gower, P. Richt\u00b4arik, and F. Bach. \u201cStochastic Quasi-Gradient methods: Variance Reduc-\n\ntion via Jacobian Sketching\u201d. In: arXiv:1805.02632 (2018).\n\n[10] R. Harikandeh, M. O. Ahmed, A. Virani, M. Schmidt, J. Kone\u02c7cn\u00b4y, and S. Sallinen. \u201cStopWast-\ning My Gradients: Practical SVRG\u201d. In: Advances in Neural Information Processing Systems\n28. 2015, pp. 2251\u20132259.\n\n[11] T. Hofmann, A. Lucchi, S. Lacoste-Julien, and B. McWilliams. \u201cVariance Reduced Stochastic\nGradient Descent with Neighbors\u201d. In: Advances in Neural Information Processing Systems\n28. 2015, pp. 2305\u20132313.\n\n[12] S. Horv\u00b4ath and P. Richt\u00b4arik. \u201cNonconvex Variance Reduced Optimization with Arbitrary\n\nSampling\u201d. In: ().\n\n[13] R. Johnson and T. Zhang. \u201cAccelerating stochastic gradient descent using predictive variance\n\n[14]\n\n[15]\n\nreduction\u201d. In: Advances in Neural Information Processing Systems. 2013, pp. 315\u2013323.\nJ. Kone\u02c7cn\u00b4y and P. Richt\u00b4arik. \u201cSemi-stochastic gradient descent methods\u201d. In: Frontiers in\nApplied Mathematics and Statistics 3 (2017), p. 9.\nJ. Kone\u02c7cn\u00b4y, J. Liu, P. Richt\u00b4arik, and M. Tak\u00b4a\u02c7c. \u201cMini-Batch Semi-Stochastic Gradient Descent\nin the Proximal Setting\u201d. In: IEEE Journal of Selected Topics in Signal Processing 2 (2016),\npp. 242\u2013255.\n\n[16] D. Kovalev, S. Horvath, and P. Richt\u00b4arik. \u201cDon\u2019t Jump Through Hoops and Remove Those\nLoops: SVRG and Katyusha are Better Without the Outer Loop\u201d. In: arXiv:1901.08689 (2019).\n[17] A. Kulunchakov and J. Mairal. \u201cEstimate Sequences for Variance-Reduced Stochastic Compos-\nite Optimization\u201d. In: Proceedings of the 36th International Conference on Machine Learning.\nVol. 97. 2019, pp. 3541\u20133550.\n\n[18] T. Murata and T. Suzuki. \u201cDoubly Accelerated Stochastic Variance Reduced Dual Averaging\nMethod for Regularized Empirical Risk Minimization\u201d. In: Proceedings of the 31st Inter-\nnational Conference on Neural Information Processing Systems. NIPS\u201917. 2017, pp. 608\u2013\n617.\n\n[19] Y. Nesterov. Introductory lectures on convex optimization: A basic course. Vol. 87. 2013.\n\n10\n\n\f[20] Y. Nesterov and J.-P. Vial. \u201cCon\ufb01dence level solutions for stochastic programming\u201d. In:\n\nAutomatica. Vol. 44. 2008, pp. 1559\u20131568.\n\n[21] L. M. Nguyen, J. Liu, K. Scheinberg, and M. Tak\u00b4a\u02c7c. \u201cSARAH: A Novel Method for Machine\nLearning Problems Using Stochastic Recursive Gradient\u201d. In: Proceedings of the 34th Interna-\ntional Conference on Machine Learning. Vol. 70. Proceedings of Machine Learning Research.\n2017, pp. 2613\u20132621.\n\n[22] A. Nitanda. \u201cStochastic Proximal Gradient Descent with Acceleration Techniques\u201d. In: Ad-\n\nvances in Neural Information Processing Systems 27. 2014, pp. 1574\u20131582.\n\n[23] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P.\nPrettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher,\nM. Perrot, and E. Duchesnay. \u201cScikit-learn: Machine Learning in Python\u201d. In: Journal of\nMachine Learning Research 12 (2011), pp. 2825\u20132830.\n\n[24] A. Raj and S. U. Stich. \u201ck-SVRG: Variance Reduction for Large Scale Optimization\u201d. In:\n\narXiv:1805.00982 ().\n\n[25] S. J. Reddi, A. Hefny, S. Sra, B. P\u00b4oczos, and A. J. Smola. \u201cStochastic Variance Reduction for\nNonconvex Optimization.\u201d In: Proceedings of the 34th International Conference on Machine\nLearning. Vol. 48. 2016, pp. 314\u2013323.\n\n[26] H. Robbins and D. Siegmund. \u201cA convergence theorem for non negative almost supermartin-\n\ngales and some applications\u201d. In: Herbert Robbins Selected Papers. 1985, pp. 111\u2013135.\n\n[27] N. L. Roux, M. Schmidt, and F. R. Bach. \u201cA Stochastic Gradient Method with an Exponential\nConvergence Rate for Finite Training Sets\u201d. In: Advances in Neural Information Processing\nSystems 25. 2012, pp. 2663\u20132671.\n\n[28] S. Shalev-Shwartz and T. Zhang. \u201cStochastic dual coordinate ascent methods for regularized\nloss minimization\u201d. In: Journal of Machine Learning Research 14.Feb (2013), pp. 567\u2013599.\n\n11\n\n\f", "award": [], "sourceid": 328, "authors": [{"given_name": "Othmane", "family_name": "Sebbouh", "institution": "T\u00e9l\u00e9com ParisTech"}, {"given_name": "Nidham", "family_name": "Gazagnadou", "institution": "T\u00e9l\u00e9com Paris"}, {"given_name": "Samy", "family_name": "Jelassi", "institution": "Princeton University"}, {"given_name": "Francis", "family_name": "Bach", "institution": "INRIA - Ecole Normale Superieure"}, {"given_name": "Robert", "family_name": "Gower", "institution": "Institut Polytechnique de Paris, Telecom Paris"}]}