{"title": "Accelerating Stochastic Gradient Descent using Predictive Variance Reduction", "book": "Advances in Neural Information Processing Systems", "page_first": 315, "page_last": 323, "abstract": "Stochastic gradient descent is popular for large scale optimization but has slow convergence asymptotically due to the inherent variance. To remedy this problem, we introduce an explicit variance reduction method for stochastic gradient descent which we call stochastic variance reduced gradient (SVRG). For smooth and strongly convex functions, we  prove that this method enjoys the same fast convergence rate as those of stochastic dual coordinate ascent (SDCA) and Stochastic Average Gradient (SAG).  However, our analysis is significantly simpler and more intuitive. Moreover, unlike SDCA or SAG, our method does not require the storage of gradients, and thus is more easily applicable to complex problems such as some structured prediction problems and neural network learning.", "full_text": "Accelerating Stochastic Gradient Descent using\n\nPredictive Variance Reduction\n\nRie Johnson\n\nRJ Research Consulting\n\nTarrytown NY, USA\n\nTong Zhang\n\nBaidu Inc., Beijing, China\n\nRutgers University, New Jersey, USA\n\nAbstract\n\nStochastic gradient descent is popular for large scale optimization but has slow\nconvergence asymptotically due to the inherent variance. To remedy this prob-\nlem, we introduce an explicit variance reduction method for stochastic gradient\ndescent which we call stochastic variance reduced gradient (SVRG). For smooth\nand strongly convex functions, we prove that this method enjoys the same fast con-\nvergence rate as those of stochastic dual coordinate ascent (SDCA) and Stochastic\nAverage Gradient (SAG). However, our analysis is signi\ufb01cantly simpler and more\nintuitive. Moreover, unlike SDCA or SAG, our method does not require the stor-\nage of gradients, and thus is more easily applicable to complex problems such as\nsome structured prediction problems and neural network learning.\n\n1\n\nIntroduction\n\nIn machine learning, we often encounter the following optimization problem. Let \u03c81, . . . , \u03c8n be\na sequence of vector functions from Rd to R. Our goal is to \ufb01nd an approximate solution of the\nfollowing optimization problem\n\nn(cid:88)\n\ni=1\n\nn(cid:88)\n\ni=1\n\nmin P (w),\n\nP (w) :=\n\n1\nn\n\n\u03c8i(w).\n\n(1)\n\nFor example, given a sequence of n training examples (x1, y1), . . . , (xn, yn), where xi \u2208 Rd and\nyi \u2208 R, if we use the squared loss \u03c8i(w) = (w(cid:62)xi \u2212 yi)2, then we can obtain least squares\nregression. In this case, \u03c8i(\u00b7) represents a loss function in machine learning. One may also include\nregularization conditions. For example, if we take \u03c8i(w) = ln(1 + exp(\u2212w(cid:62)xiyi)) + 0.5\u03bbw(cid:62)w\n(yi \u2208 {\u00b11}), then the optimization problem becomes regularized logistic regression.\nA standard method is gradient descent, which can be described by the following update rule for\nt = 1, 2, . . .\n\nw(t) = w(t\u22121) \u2212 \u03b7t\u2207P (w(t\u22121)) = w(t\u22121) \u2212 \u03b7t\nn\n\n\u2207\u03c8i(w(t\u22121)).\n\n(2)\n\nHowever, at each step, gradient descent requires evaluation of n derivatives, which is expensive. A\npopular modi\ufb01cation is stochastic gradient descent (SGD): where at each iteration t = 1, 2, . . ., we\ndraw it randomly from {1, . . . , n}, and\n\nw(t) = w(t\u22121) \u2212 \u03b7t\u2207\u03c8it(w(t\u22121)).\n\n(3)\nThe expectation E[w(t)|w(t\u22121)] is identical to (2). A more general version of SGD is the following\n(4)\n\nw(t) = w(t\u22121) \u2212 \u03b7tgt(w(t\u22121), \u03bet),\n\n1\n\n\fwhere \u03bet is a random variable that may depend on w(t\u22121), and the expectation (with respect to \u03bet)\nE[gt(w(t\u22121), \u03bet)|w(t\u22121)] = \u2207P (w(t\u22121)). The advantage of stochastic gradient is that each step\nonly relies on a single derivative \u2207\u03c8i(\u00b7), and thus the computational cost is 1/n that of the standard\ngradient descent. However, a disadvantage of the method is that the randomness introduces variance\n\u2014 this is caused by the fact that gt(w(t\u22121), \u03bet) equals the gradient \u2207P (w(t\u22121)) in expectation, but\neach gt(w(t\u22121), \u03bet) is different. In particular, if gt(w(t\u22121), \u03bet) is large, then we have a relatively\nlarge variance which slows down the convergence. For example, consider the case that each \u03c8i(w)\nis smooth\n\n\u03c8i(w) \u2212 \u03c8i(w(cid:48)) \u2212 0.5L(cid:107)w \u2212 w(cid:48)(cid:107)2 \u2264 \u2207\u03c8i(w(cid:48))(cid:62)(w \u2212 w(cid:48)),\n\n(5)\n\nand convex; and P (w) is strongly convex\n\n2 \u2265 \u2207P (w(cid:48))(cid:62)(w \u2212 w(cid:48)),\n\nP (w) \u2212 P (w(cid:48)) \u2212 0.5\u03b3(cid:107)w \u2212 w(cid:48)(cid:107)2\n\n(6)\nwhere L \u2265 \u03b3 \u2265 0. As long as we pick \u03b7t as a constant \u03b7 < 1/L, we have linear convergence of\nO((1 \u2212 \u03b3/L)t) Nesterov [2004]. However, for SGD, due to the variance of random sampling, we\ngenerally need to choose \u03b7t = O(1/t) and obtain a slower sub-linear convergence rate of O(1/t).\nThis means that we have a trade-off of fast computation per iteration and slow convergence for\nSGD versus slow computation per iteration and fast convergence for gradient descent. Although\nthe fast computation means it can reach an approximate solution relatively quickly, and thus has\nbeen proposed by various researchers for large scale problems Zhang [2004], Shalev-Shwartz et al.\n[2007] (also see Leon Bottou\u2019s Webpage http://leon.bottou.org/projects/sgd), the\nconvergence slows down when we need a more accurate solution.\nIn order to improve SGD, one has to design methods that can reduce the variance, which allows\nus to use a larger learning rate \u03b7t. Two recent papers Le Roux et al. [2012], Shalev-Shwartz and\nZhang [2012] proposed methods that achieve such a variance reduction effect for SGD, which leads\nto a linear convergence rate when \u03c8i(w) is smooth and strongly convex. The method in Le Roux\net al. [2012] was referred to as SAG (stochastic average gradient), and the method in Shalev-Shwartz\nand Zhang [2012] was referred to as SDCA. These methods are suitable for training convex linear\nprediction problems such as logistic regression or least squares regression, and in fact, SDCA is the\nmethod implemented in the popular lib-SVM package Hsieh et al. [2008]. However, both propos-\nals require storage of all gradients (or dual variables). Although this issue may not be a problem\nfor training simple regularized linear prediction problems such as least squares regression, the re-\nquirement makes it unsuitable for more complex applications where storing all gradients would be\nimpractical. One example is training certain structured learning problems with convex loss, and an-\nother example is training nonconvex neural networks. In order to remedy the problem, we propose a\ndifferent method in this paper that employs explicit variance reduction without the need to store the\nintermediate gradients. We show that if \u03c8i(w) is strongly convex and smooth, then the same con-\nvergence rate as those of Le Roux et al. [2012], Shalev-Shwartz and Zhang [2012] can be obtained.\nEven if \u03c8i(w) is nonconvex (such as neural networks), under mild assumptions, it can be shown that\nasymptotically the variance of SGD goes to zero, and thus faster convergence can be achieved. In\nsummary, this work makes the following three contributions:\n\n\u2022 Our method does not require the storage of full gradients, and thus is suitable for some\nproblems where methods such as Le Roux et al. [2012], Shalev-Shwartz and Zhang [2012]\ncannot be applied.\n\u2022 We provide a much simpler proof of the linear convergence results for smooth and strongly\nconvex loss, and our view provides a signi\ufb01cantly more intuitive explanation of the fast\nconvergence by explicitly connecting the idea to variance reduction in SGD. The resulting\ninsight can easily lead to additional algorithmic development.\n\u2022 The relatively intuitive variance reduction explanation also applies to nonconvex optimiza-\ntion problems, and thus this idea can be used for complex problems such as training deep\nneural networks.\n\n2 Stochastic Variance Reduced Gradient\n\nOne practical issue for SGD is that in order to ensure convergence the learning rate \u03b7t has to decay\nto zero. This leads to slower convergence. The need for a small learning rate is due to the variance\n\n2\n\n\fof SGD (that is, SGD approximates the full gradient using a small batch of samples or even a single\nexample, and this introduces variance). However, there is a \ufb01x described below. At each time, we\nkeep a version of estimated w as \u02dcw that is close to the optimal w. For example, we can keep a\nsnapshot of \u02dcw after every m SGD iterations. Moreover, we maintain the average gradient\n\n\u02dc\u00b5 = \u2207P ( \u02dcw) =\n\n1\nn\n\n\u2207\u03c8i( \u02dcw),\n\nn(cid:88)\n\ni=1\n\nand its computation requires one pass over the data using \u02dcw. Note that the expectation of \u2207\u03c8i( \u02dcw)\u2212\n\u02dc\u00b5 over i is zero, and thus the following update rule is generalized SGD: randomly draw it from\n{1, . . . , n}:\n\nw(t) = w(t\u22121) \u2212 \u03b7t(\u2207\u03c8i(w(t\u22121)) \u2212 \u2207\u03c8it( \u02dcw) + \u02dc\u00b5).\n\n(7)\n\nWe thus have\nThat is, if we let the random variable \u03bet = it and gt(w(t\u22121), \u03bet) = \u2207\u03c8it(w(t\u22121)) \u2212 \u2207\u03c8it( \u02dcw) + \u02dc\u00b5,\nthen (7) is a special case of (4).\nThe update rule in (7) can also be obtained by de\ufb01ning the auxiliary function\n\nE[w(t)|w(t\u22121)] = w(t\u22121) \u2212 \u03b7t\u2207P (w(t\u22121)).\n\n\u02dc\u03c8i(w) = \u03c8i(w) \u2212 (\u2207\u03c8i( \u02dcw) \u2212 \u02dc\u00b5)(cid:62)w.\n\nSince(cid:80)n\n\nn(cid:88)\ni=1(\u2207\u03c8i( \u02dcw) \u2212 \u02dc\u00b5) = 0, we know that\n\nP (w) =\n\n\u03c8i(w) =\n\n\u02dc\u03c8i(w).\n\n1\nn\n\ni=1\n\nn(cid:88)\n\ni=1\n\n1\nn\n\n(cid:80)n\n\nNow we may apply the standard SGD to the new representation P (w) = 1\nn\nthe update rule (7).\nTo see that the variance of the update rule (7) is reduced, we note that when both \u02dcw and w(t) converge\nto the same parameter w\u2217, then \u02dc\u00b5 \u2192 0. Therefore if \u2207\u03c8i( \u02dcw) \u2192 \u2207\u03c8i(w\u2217), then\n\n\u02dc\u03c8i(w) and obtain\n\ni=1\n\n\u2207\u03c8i(w(t\u22121)) \u2212 \u2207\u03c8i( \u02dcw) + \u02dc\u00b5 \u2192 \u2207\u03c8i(w(t\u22121)) \u2212 \u2207\u03c8i(w\u2217) \u2192 0.\n\nThis argument will be made more rigorous in the next section, where we will analyze the algorithm\nin Figure 1 that summarizes the ideas described in this section. We call this method stochastic\nvariance reduced gradient (SVRG) because it explicitly reduces the variance of SGD. Unlike SGD,\nthe learning rate \u03b7t for SVRG does not have to decay, which leads to faster convergence as one can\nuse a relatively large learning rate. This is con\ufb01rmed by our experiments.\nIn practical implementations, it is natural to choose option I, or take \u02dcws to be the average of the\npast t iterates. However, our analysis depends on option II. Note that each stage s requires 2m + n\ngradient computations (for some convex problems, one may save the intermediate gradients \u2207\u03c8i( \u02dcw),\nand thus only m + n gradient computations are needed). Therefore it is natural to choose m to\nbe the same order of n but slightly larger (for example m = 2n for convex problems and m =\n5n for nonconvex problems in our experiments). In comparison, standard SGD requires only m\ngradient computations. Since gradient may be the computationally most intensive operation, for fair\ncomparison, we compare SGD to SVRG based on the number of gradient computations.\n\n3 Analysis\n\nFor simplicity we will only consider the case that each \u03c8i(w) is convex and smooth, and P (w) is\nstrongly convex.\nTheorem 1. Consider SVRG in Figure 1 with option II. Assume that all \u03c8i are convex and both (5)\nand (6) hold with \u03b3 > 0. Let w\u2217 = arg minw P (w). Assume that m is suf\ufb01ciently large so that\n\n2L\u03b7\n1 \u2212 2L\u03b7\nthen we have geometric convergence in expectation for SVRG:\n\n\u03b3\u03b7(1 \u2212 2L\u03b7)m\n\n\u03b1 =\n\n+\n\n1\n\n< 1,\n\nE P ( \u02dcws) \u2264 E P (w\u2217) + \u03b1s[P ( \u02dcw0) \u2212 P (w\u2217)]\n\n3\n\n\fProcedure SVRG\n\nParameters update frequency m and learning rate \u03b7\nInitialize \u02dcw0\n(cid:80)n\nIterate: for s = 1, 2, . . .\ni=1 \u2207\u03c8i( \u02dcw)\n\n\u02dcw = \u02dcws\u22121\n\u02dc\u00b5 = 1\nn\nw0 = \u02dcw\nIterate: for t = 1, 2, . . . , m\nRandomly pick it \u2208 {1, . . . , n} and update weight\nwt = wt\u22121 \u2212 \u03b7(\u2207\u03c8it(wt\u22121) \u2212 \u2207\u03c8it( \u02dcw) + \u02dc\u00b5)\nend\noption I: set \u02dcws = wm\noption II: set \u02dcws = wt for randomly chosen t \u2208 {0, . . . , m \u2212 1}\nend\n\nFigure 1: Stochastic Variance Reduced Gradient\n\nProof. Given any i, consider\n\ngi(w) = \u03c8i(w) \u2212 \u03c8i(w\u2217) \u2212 \u2207\u03c8i(w\u2217)(cid:62)(w \u2212 w\u2217).\n\nWe know that gi(w\u2217) = minw gi(w) since \u2207gi(w\u2217) = 0. Therefore\n\n0 = gi(w\u2217) \u2264 min\n\u2264 min\n\n\u03b7\n\n\u03b7\n\n[gi(w \u2212 \u03b7\u2207gi(w))]\n[gi(w) \u2212 \u03b7(cid:107)\u2207gi(w)(cid:107)2\n\n2 + 0.5L\u03b72(cid:107)\u2207gi(w)(cid:107)2\n\n2] = gi(w) \u2212 1\n2L\n\n(cid:107)\u2207gi(w)(cid:107)2\n2.\n\nThat is,\n\nn(cid:88)\n\n(cid:107)\u2207\u03c8i(w) \u2212 \u2207\u03c8i(w\u2217)(cid:107)2\n\n2 \u2264 2L[\u03c8i(w) \u2212 \u03c8i(w\u2217) \u2212 \u2207\u03c8i(w\u2217)(cid:62)(w \u2212 w\u2217)].\n\nBy summing the above inequality over i = 1, . . . , n, and using the fact that \u2207P (w\u2217) = 0, we obtain\n\nn\u22121\n\n(cid:107)\u2207\u03c8i(w) \u2212 \u2207\u03c8i(w\u2217)(cid:107)2\n\n2 \u2264 2L[P (w) \u2212 P (w\u2217)].\n\n(8)\nWe can now proceed to prove the theorem. Let vt = \u2207\u03c8it(wt\u22121) \u2212 \u2207\u03c8it( \u02dcw) + \u02dc\u00b5. Conditioned on\nwt\u22121, we can take expectation with respect to it, and obtain:\n\ni=1\n\nE (cid:107)vt(cid:107)2\n\n2\n\n2 + 2 E (cid:107)[\u2207\u03c8it( \u02dcw) \u2212 \u2207\u03c8it(w\u2217)] \u2212 \u2207P ( \u02dcw)(cid:107)2\n2 + 2 E (cid:107)[\u2207\u03c8it( \u02dcw) \u2212 \u2207\u03c8it(w\u2217)]\n\n\u22642 E (cid:107)\u2207\u03c8it(wt\u22121) \u2212 \u2207\u03c8it(w\u2217)(cid:107)2\n=2 E (cid:107)\u2207\u03c8it(wt\u22121) \u2212 \u2207\u03c8it(w\u2217)(cid:107)2\n\u2212 E [\u2207\u03c8it( \u02dcw) \u2212 \u2207\u03c8it(w\u2217)](cid:107)2\n\u22642 E (cid:107)\u2207\u03c8it(wt\u22121) \u2212 \u2207\u03c8it(w\u2217)(cid:107)2\n\u22644L[P (wt\u22121) \u2212 P (w\u2217) + P ( \u02dcw) \u2212 P (w\u2217)].\n2 and \u02dc\u00b5 = \u2207P ( \u02dcw). The second inequality uses\n2 + 2(cid:107)b(cid:107)2\n2 for any random vector \u03be. The third inequality uses (8).\n\n2 + 2 E (cid:107)\u2207\u03c8it( \u02dcw) \u2212 \u2207\u03c8it(w\u2217)(cid:107)2\n\nThe \ufb01rst inequality uses (cid:107)a + b(cid:107)2\n2 \u2212 (cid:107) E \u03be(cid:107)2\nE(cid:107)\u03be \u2212 E \u03be(cid:107)2\nNow by noticing that conditioned on wt\u22121, we have E vt = \u2207P (wt\u22121); and this leads to\n\n2 \u2264 2(cid:107)a(cid:107)2\n2 \u2264 E(cid:107)\u03be(cid:107)2\n\n2 = E(cid:107)\u03be(cid:107)2\n\n2\n\n2\n\n2\n\n2\n\nE (cid:107)wt \u2212 w\u2217(cid:107)2\n=(cid:107)wt\u22121 \u2212 w\u2217(cid:107)2\n\u2264(cid:107)wt\u22121 \u2212 w\u2217(cid:107)2\n\u2264(cid:107)wt\u22121 \u2212 w\u2217(cid:107)2\n=(cid:107)wt\u22121 \u2212 w\u2217(cid:107)2\n\n2 \u2212 2\u03b7(wt\u22121 \u2212 w\u2217)(cid:62) E vt + \u03b72 E (cid:107)vt(cid:107)2\n2 \u2212 2\u03b7(wt\u22121 \u2212 w\u2217)(cid:62)\u2207P (wt\u22121) + 4L\u03b72[P (wt\u22121) \u2212 P (w\u2217) + P ( \u02dcw) \u2212 P (w\u2217)]\n2 \u2212 2\u03b7[P (wt\u22121) \u2212 P (w\u2217)] + 4L\u03b72[P (wt\u22121) \u2212 P (w\u2217) + P ( \u02dcw) \u2212 P (w\u2217)]\n2 \u2212 2\u03b7(1 \u2212 2L\u03b7)[P (wt\u22121) \u2212 P (w\u2217)] + 4L\u03b72[P ( \u02dcw) \u2212 P (w\u2217)].\n\n2\n\n4\n\n\f2 + 2\u03b7(1 \u2212 2L\u03b7)m E [P ( \u02dcws) \u2212 P (w\u2217)]\n2 + 4Lm\u03b72 E[P ( \u02dcw) \u2212 P (w\u2217)]\n2 + 4Lm\u03b72 E[P ( \u02dcw) \u2212 P (w\u2217)]\n\nE (cid:107)wm \u2212 w\u2217(cid:107)2\n\u2264 E(cid:107)w0 \u2212 w\u2217(cid:107)2\n= E(cid:107) \u02dcw \u2212 w\u2217(cid:107)2\n\u2264 2\n\u03b3\n=2(\u03b3\u22121 + 2Lm\u03b72) E[P ( \u02dcw) \u2212 P (w\u2217)].\n\nE[P ( \u02dcw) \u2212 P (w\u2217)] + 4Lm\u03b72 E[P ( \u02dcw) \u2212 P (w\u2217)]\n\n(cid:20)\n\n(cid:21)\n\nThe second inequality uses the strong convexity property (6). We thus obtain\n\nE [P ( \u02dcws) \u2212 P (w\u2217)] \u2264\n\n1\n\n\u03b3\u03b7(1 \u2212 2L\u03b7)m\n\n+\n\n2L\u03b7\n1 \u2212 2L\u03b7\n\nE[P ( \u02dcws\u22121) \u2212 P (w\u2217)].\n\nThis implies that E [P ( \u02dcws) \u2212 P (w\u2217)] \u2264 \u03b1s E [P ( \u02dcw0) \u2212 P (w\u2217)]. The desired bound follows.\n\nThe bound we obtained in Theorem 1 is comparable to those obtained in Le Roux et al. [2012],\nShalev-Shwartz and Zhang [2012] (if we ignore the log factor). To see this, we may consider for\nsimplicity the most indicative case where the condition number L/\u03b3 = n. Due to the poor condition\nnumber, the standard batch gradient descent requires complexity of n ln(1/\u0001) iterations over the\ndata to achieve accuracy of \u0001, which means we have to process n2 ln(1/\u0001) number of examples. In\ncomparison, in our procedure we may take \u03b7 = 0.1/L and m = O(n) to obtain a convergence rate of\n\u03b1 = 1/2. Therefore to achieve an accuracy of \u0001, we need to process n ln(1/\u0001) number of examples.\nThis matches the results of Le Roux et al. [2012], Shalev-Shwartz and Zhang [2012]. Nevertheless,\nour analysis is signi\ufb01cantly simpler than both Le Roux et al. [2012] and Shalev-Shwartz and Zhang\n[2012], and the explicit variance reduction argument provides better intuition on why this method\nworks. In fact, in Section 4 we show that a similar intuition can be used to explain the effectiveness\nof SDCA.\nThe SVRG algorithm can also be applied to smooth but not strongly convex problems. A con-\n\u221a\nvergence rate of O(1/T ) may be obtained, which improves the standard SGD convergence rate of\nT ). In order to apply SVRG to nonconvex problems such as neural networks, it is useful\nO(1/\nto start with an initial vector \u02dcw0 that is close to a local minimum (which may be obtained with\nSGD), and then the method can be used to accelerate the local convergence rate of SGD (which may\nconverge very slowly by itself). If the system is locally (strongly) convex, then Theorem 1 can be\ndirectly applied, which implies local geometric convergence rate with a constant learning rate.\n\n4 SDCA as Variance Reduction\n\nIt can be shown that both SDCA and SAG are connected to SVRG in the sense they are also a\nvariance reduction methods for SGD, although using different techniques.\nIn the following we\npresent the variance reduction view of SDCA, which provides additional insights into these recently\nproposed fast convergence methods for stochastic optimization. In SDCA, we consider the following\nproblem with convex \u03c6i(w):\n\nw\u2217 = arg min P (w),\n\nP (w) =\n\n1\nn\n\n\u03c6i(w) + 0.5\u03bbw(cid:62)w.\n\n(9)\n\nThe \ufb01rst inequality uses the previously obtained inequality for E(cid:107)vt(cid:107)2\nconvexity of P (w), which implies that \u2212(wt\u22121 \u2212 w\u2217)(cid:62)\u2207P (wt\u22121) \u2264 P (w\u2217) \u2212 P (wt\u22121).\nWe consider a \ufb01xed stage s, so that \u02dcw = \u02dcws\u22121 and \u02dcws is selected after all of the updates have\ncompleted. By summing the previous inequality over t = 1, . . . , m, taking expectation with all the\nhistory, and using option II at stage s, we obtain\n\n2, and the second inequality\n\nn(cid:88)\n\ni=1\n\nThis is the same as our formulation with \u03c8i(w) = \u03c6i(w) + 0.5\u03bbw(cid:62)w.\nWe can take the derivative of (9) and derive a \u201cdual\u201d representation of w at the solution w\u2217 as:\n\nw\u2217 =\n\n\u03b1\u2217\n\ni\n\n(j = 1, . . . , k),\n\nn(cid:88)\n\ni=1\n\n5\n\n\fwhere the dual variables\n\nTherefore in the SGD update (3), if we maintain a representation\n\n\u2207\u03c6i(w\u2217).\n\ni = \u2212 1\n\u03b1\u2217\n\u03bbn\nn(cid:88)\n\nw(t) =\n\n\u03b1(t)\n\ni\n\n,\n\n(10)\n\n(11)\n\n(12)\n\nthen the update of \u03b1 becomes:\n\n\u03b1(t)\n\n(cid:96) =\n\ni=1\n\n(cid:40)\n(1 \u2212 \u03b7t\u03bb)\u03b1(t\u22121)\n(1 \u2212 \u03b7t\u03bb)\u03b1(t\u22121)\n\ni\n\n(cid:96)\n\n\u2212 \u03b7t\u2207\u03c6i(w)\n\n(cid:96) = i\n(cid:96) (cid:54)= i\n\n.\n\nThis update rule requires \u03b7t \u2192 0 when t \u2192 \u221e.\nAlternatively, we may consider starting with SGD by maintaining (11), and then apply the following\nDual Coordinate Ascent rule:\n\n\u03b1(t)\n\n(cid:96) =\n\n\u03b1(t\u22121)\n\u03b1(t\u22121)\n\ni\n\n(cid:96)\n\n\u2212 \u03b7t(\u2207\u03c6i(w(t\u22121)) + \u03bbn\u03b1(t\u22121)\n\ni\n\n)\n\n(cid:96) = i\n(cid:96) (cid:54)= i\n\n(j = 1, . . . , k)\n\n(13)\n\n(cid:40)\n\nand then update w as w(t) = w(t\u22121) + (\u03b1(t)\nIt can be checked that if we take expectation over random i \u2208 {1, . . . , n}, then the SGD rule in (12)\nand the dual coordinate ascent rule (13) both yield the gradient descent rule\n\ni \u2212 \u03b1(t\u22121)\n\n).\n\ni\n\nE[w(t|w(t\u22121)] = w(t\u22121) \u2212 \u03b7t\u2207P (w(t\u22121)).\n\nTherefore both can be regarded as different realizations of the more general stochastic gradient rule\nin (4). However, the advantage of (13) is that we may take a larger step when t \u2192 \u221e. This is be-\ncause according to (10), when the primal-dual parameters (w, \u03b1) converge to the optimal parameters\n(w\u2217, \u03b1\u2217), we have\n\n(\u2207\u03c6i(w) + \u03bbn\u03b1i) \u2192 0,\n\nwhich means that even if the learning rate \u03b7t stays bounded away from zero, the procedure can\nconverge. This is the same effect as SVRG, in the sense that the variance goes to zero asymptotically:\nas w \u2192 w\u2217 and \u03b1 \u2192 \u03b1\u2217, we have\n\nn(cid:88)\n\ni=1\n\n1\nn\n\n(\u2207\u03c6i(w) + \u03bbn\u03b1i)2 \u2192 0.\n\nThat is, SDCA is also a variance reduction method for SGD, which is similar to SVRG.\nFrom this discussion, we can view SVRG as an explicit variance reduction technique for SGD which\nis similar to SDCA. However, it is simpler, more intuitive, and easier to analyze. This relationship\nprovides useful insights into the underlying optimization problem that may allow us to make further\nimprovements.\n\n5 Experiments\n\nTo con\ufb01rm the theoretical results and insights, we experimented with SVRG (Fig. 1 Option I) in\ncomparison with SGD and SDCA with linear predictors (convex) and neural nets (nonconvex). In\nall the \ufb01gures, the x-axis is computational cost measured by the number of gradient computations\ndivided by n. For SGD, it is the number of passes to go through the training data, and for SVRG\nin the nonconvex case (neural nets), it includes the additional computation of \u2207\u03c8i( \u02dcw) both in each\niteration and for computing the gradient average \u02dc\u00b5. For SVRG in our convex case, however, \u2207\u03c8i( \u02dcw)\ndoes not have to be re-computed in each iteration. Since in this case the gradient is always a multiple\nof xi, i.e., \u2207\u03c8i(w) = \u03c6(cid:48)\ni(w(cid:62)xi)xi where \u03c8i(w) = \u03c6i(w(cid:62)xi), \u2207\u03c8i( \u02dcw) can be compactly saved in\nmemory by only saving scalars \u03c6(cid:48)\ni( \u02dcw(cid:62)xi) with the same memory consumption as SDCA and SAG.\nThe interval m was set to 2n (convex) and 5n (nonconvex). The weights for SVRG were initialized\nby performing 1 iteration (convex) or 10 iterations (nonconvex) of SGD; therefore, the line for\nSVRG starts after x = 1 (convex) or x = 10 (nonconvex) in the respective \ufb01gures.\n\n6\n\n\fFigure 2: Multiclass logistic regression (convex) on MNIST. (a) Training loss comparison with SGD with\n\ufb01xed learning rates. The numbers in the legends are the learning rate. (b) Training loss residual P (w)\u2212P (w\u2217);\ncomparison with best-tuned SGD with learning rate scheduling and SDCA. (c) Variance of weight update\n(including multiplication with the learning rate).\n\nFirst, we performed L2-regularized multiclass logistic regression (convex optimization) on MNIST1\nwith regularization parameter \u03bb =1e-4. Fig. 2 (a) shows training loss (i.e., the optimization objective\nP (w)) in comparison with SGD with \ufb01xed learning rates. The results are indicative of the known\nweakness of SGD, which also illustrates the strength of SVRG. That is, when a relatively large\nlearning rate \u03b7 is used with SGD, training loss drops fast at \ufb01rst, but it oscillates above the minimum\nand never goes down to the minimum. With small \u03b7, the minimum may be approached eventually,\nbut it will take many iterations to get there. Therefore, to accelerate SGD, one has to start with\nrelatively large \u03b7 and gradually decrease it (learning rate scheduling), as commonly practiced. By\ncontrast, using a single relatively large value of \u03b7, SVRG smoothly goes down faster than SGD.\nThis is in line with our theoretical prediction that one can use a relatively large \u03b7 with SVRG, which\nleads to faster convergence.\nFig. 2 (b) and (c) compare SVRG with best-tuned SGD with learning rate scheduling and SDCA.\n\u2018SGD-best\u2019 is the best-tuned SGD, which was chosen by preferring smaller training loss from a\nlarge number of parameter combinations for two types of learning scheduling: exponential decay\n\u03b7(t) = \u03b70a(cid:98)t/n(cid:99) with parameters \u03b70 and a to adjust and t-inverse \u03b7(t) = \u03b70(1 + b(cid:98)t/n(cid:99))\u22121 with \u03b70\nand b to adjust. (Not surprisingly, the best-tuned SGD with learning rate scheduling outperformed\nthe best-tuned SGD with a \ufb01xed learning rate throughout our experiments.) Fig. 2 (b) shows training\nloss residual, which is training loss minus the optimum (estimated by running gradient descent for\na very long time): P (w) \u2212 P (w\u2217). We observe that SVRG\u2019s loss residual goes down exponentially,\nwhich is in line with Theorem 1, and that SVRG is competitive with SDCA (the two lines are almost\noverlapping) and decreases faster than SGD-best. In Fig. 2 (c), we show the variance of SVRG\nupdate \u2212\u03b7(\u2207\u03c8i(w) \u2212 \u2207\u03c8i( \u02dcw) + \u02dc\u00b5) in comparison with the variance of SGD update \u2212\u03b7(t)\u2207\u03c8i(w)\nand SDCA. As expected, the variance of both SVRG and SDCA decreases as optimization proceeds,\nand the variance of SGD with a \ufb01xed learning rate (\u2018SGD:0.001\u2019) stays high. The variance of the\nbest-tuned SGD decreases, but this is due to the forced exponential decay of the learning rate and\nthe variance of the gradients \u2207\u03c8i(w) (the dotted line labeled as \u2018SGD-best/\u03b7(t)\u2019) stays high.\nFig. 3 shows more convex-case results (L2-regularized logistic regression) in terms of training loss\nresidual (top) and test error rate (bottom) on rcv1.binary and covtype.binary from the LIBSVM site2,\nprotein3, and CIFAR-104. As protein and covtype do not come with labeled test data, we randomly\nsplit the training data into halves to make the training/test split. CIFAR was normalized into [0, 1] by\ndivision with 255 (which was also done with MNIST and CIFAR in the other \ufb01gures), and protein\nwas standardized. \u03bb was set to 1e-3 (CIFAR) and 1e-5 (rest). Overall, SVRG is competitive with\nSDCA and clearly more advantageous than the best-tuned SGD. It is also worth mentioning that a\nrecent study Schmidt et al. [2013] reports that SAG and SDCA are competitive.\nTo test SVRG with nonconvex objectives, we trained neural nets (with one fully-connected hidden\nlayer of 100 nodes and ten softmax output nodes; sigmoid activation and L2 regularization) with\nmini-batches of size 10 on MNIST and CIFAR-10, both of which are standard datasets for deep\n\n1 http://yann.lecun.com/exdb/mnist/\n2 http://www.csie.ntu.edu.tw/\u02dccjlin/libsvmtools/datasets/\n3 http://osmot.cs.cornell.edu/kddcup/datasets.html\n4 www.cs.toronto.edu/\u02dckriz/cifar.html\n\n7\n\n(a)                                                                                (b0.250.260.270.280.290.30.31050100Training loss#grad / nMNIST convex: training loss: P(w)SGD:0.005SGD:0.0025SGD:0.001SVRG:0.0251E-131E-101E-071E-041E-011E+020Training loss -optimumMNIST convex: training loss residual SVRGSGD-best(a)                                                                                (b)                                                                               (c)50100#grad / nMNIST convex: training loss residual P(w)-P(w*)SDCASGD:0.0011E-161E-131E-101E-071E-041E-011E+02050100Variance#grad / nMNIST convex: update varianceSVRGSDCASGD-bestSGD:0.001SGD-best/\u03b7(t)\fFigure 3: More convex-case results. Loss residual P (w) \u2212 P (w\u2217) (top) and test error rates (down). L2-\nregularized logistic regression (10-class for CIFAR-10 and binary for the rest).\n\nFigure 4: Neural net results (nonconvex).\n\nneural net studies; \u03bb was set to 1e-4 and 1e-3, respectively. In Fig. 4 we con\ufb01rm that the results are\nsimilar to the convex case; i.e., SVRG reduces the variance and smoothly converges faster than the\nbest-tuned SGD with learning rate scheduling, which is a de facto standard method for neural net\ntraining. As said earlier, methods such as SDCA and SAG are not practical for neural nets due to\ntheir memory requirement. We view these results as promising. However, further investigation, in\nparticular with larger/deeper neural nets for which training cost is a critical issue, is still needed.\n\n6 Conclusion\n\nThis paper introduces an explicit variance reduction method for stochastic gradient descent meth-\nods. For smooth and strongly convex functions, we prove that this method enjoys the same fast\nconvergence rate as those of SDCA and SAG. However, our proof is signi\ufb01cantly simpler and more\nintuitive. Moreover, unlike SDCA or SAG, this method does not require the storage of gradients, and\nthus is more easily applicable to complex problems such as structured prediction or neural network\nlearning.\n\nAcknowledgment\n\nWe thank Leon Bottou and Alekh Agarwal for spotting a mistake in the original theorem.\n\n8\n\n1E-121E-101E-81E-61E-41E-20102030training loss -optimum#grad / nrcv1 convexSGD-bestSDCASVRG0.0350.040.0450.050102030Test error rate#grad / nrcv1 convexSGD-bestSDCASVRG1E-61E-51E-41E-31E-20102030training loss -optimum#grad / nprotein convexSGD-bestSDCASVRG0.0020.0030.0040.0050.0060102030Test error rate#grad / nprotein convexSGD-bestSDCASVRG1E-141E-121E-101E-81E-61E-41E-20102030training loss -optimum#grad / ncover type convexSGD-bestSDCASVRG0.240.2450.250.2550.260102030Test error rate#grad / ncover type convexSGD-bestSDCASVRG1E-121E-101E-081E-061E-041E-021E+00050100training loss -optimum#grad / nCIFAR10 convexSGD-bestSDCASVRG0.580.60.620.640.66050100Test error rate#grad / nCIFAR10 convexSGD-bestSDCASVRG0.090.0950.10.1050.110100200Training loss#grad / nMNIST nonconvexSGD-bestSVRG1E-51E-41E-31E-21E-11E+00100200Variance#grad / nMNIST nonconvexSGD-best/\u03b7(t)SGD-bestSVRG1.41.451.51.551.60200400Training loss#grad / nCIFAR10 nonconvexSGD-bestSVRG0.450.460.470.480.490.50.510.520200400Test error rate#grad / nCIFAR10 nonconvexSGD-bestSVRG\fReferences\nC.J. Hsieh, K.W. Chang, C.J. Lin, S.S. Keerthi, and S. Sundararajan. A dual coordinate descent\n\nmethod for large-scale linear SVM. In ICML, pages 408\u2013415, 2008.\n\nNicolas Le Roux, Mark Schmidt, and Francis Bach. A Stochastic Gradient Method with an Ex-\nponential Convergence Rate for Strongly-Convex Optimization with Finite Training Sets. arXiv\npreprint arXiv:1202.6258, 2012.\n\nY. Nesterov. Introductory Lectures on Convex Optimization: A Basic Course. Kluwer, Boston, 2004.\nMark Schmidt, Nicolas Le Roux, and Francis Bach. Minimizing \ufb01nite sums with the stochastic\n\naverage gradient. arXiv preprint arXiv:1309.2388, 2013.\n\nS. Shalev-Shwartz, Y. Singer, and N. Srebro. Pegasos: Primal Estimated sub-GrAdient SOlver for\n\nSVM. In International Conference on Machine Learning, pages 807\u2013814, 2007.\n\nShai Shalev-Shwartz and Tong Zhang. Stochastic dual coordinate ascent methods for regularized\n\nloss minimization. arXiv preprint arXiv:1209.1873, 2012.\n\nT. Zhang. Solving large scale linear prediction problems using stochastic gradient descent algo-\nrithms. In Proceedings of the Twenty-First International Conference on Machine Learning, 2004.\n\n9\n\n\f", "award": [], "sourceid": 238, "authors": [{"given_name": "Rie", "family_name": "Johnson", "institution": "RJ Research Consulting"}, {"given_name": "Tong", "family_name": "Zhang", "institution": "Baidu & Rutgers"}]}