{"title": "The Lingering of Gradients: How to Reuse Gradients Over Time", "book": "Advances in Neural Information Processing Systems", "page_first": 1244, "page_last": 1253, "abstract": "Classically, the time complexity of a first-order method is estimated by its number of gradient computations. In this paper, we study a more refined complexity by taking into account the ``lingering'' of gradients: once a gradient is computed at $x_k$, the additional time to compute gradients at $x_{k+1},x_{k+2},\\dots$ may be reduced.\n\nWe show how this improves the running time of gradient descent and SVRG. For instance, if the \"additional time'' scales linearly with respect to the traveled distance, then the \"convergence rate'' of gradient descent can be improved from $1/T$ to $\\exp(-T^{1/3})$. On the empirical side, we solve a hypothetical revenue management problem on the Yahoo! Front Page Today Module application with 4.6m users to $10^{-6}$ error (or $10^{-12}$ dual error) using 6 passes of the dataset.", "full_text": "The Lingering of Gradients:\n\nHow to Reuse Gradients over Time\n\nZeyuan Allen-Zhu\u2217\nMicrosoft Research AI\nRedmond, WA 98052\n\nDavid Simchi-Levi\u2217\n\nMIT\n\nCambridge, MA 02139\n\nzeyuan@csail.mit.edu\n\ndslevi@mit.edu\n\nXinshang Wang\u2217\n\nMIT\n\nCambridge, MA 02139\nxinshang@mit.edu\n\nAbstract\n\nClassically, the time complexity of a \ufb01rst-order method is estimated by its number\nof gradient computations. In this paper, we study a more re\ufb01ned complexity by\ntaking into account the \u201clingering\u201d of gradients: once a gradient is computed at\nxk, the additional time to compute gradients at xk+1, xk+2, . . . may be reduced.\nWe show how this improves the running time of gradient descent and SVRG. For\ninstance, if the \u201cadditional time\u201d scales linearly with respect to the traveled dis-\ntance, then the \u201cconvergence rate\u201d of gradient descent can be improved from 1/T\nto exp(\u2212T 1/3). On the empirical side, we solve a hypothetical revenue manage-\nment problem on the Yahoo! Front Page Today Module application with 4.6m\nusers to 10\u22126 error (or 10\u221212 dual error) using 6 passes of the dataset.\n\n1\n\nIntroduction\n\n(cid:110)\n\n(cid:80)n\n\n(cid:111)\n\nFirst-order methods play a fundamental role in large-scale machine learning and optimization tasks.\nIn most scenarios, the performance of a \ufb01rst-order method is represented by its convergence rate:\nthe relationship between \u03b5 (the optimization error) versus T (the number of gradient computations).\nThis is meaningful because in most applications, the time complexities for evaluating gradients at\ndifferent points are of the same magnitude. In other words, the worse-case time complexities of\n\ufb01rst-order methods are usually proportional to a \ufb01xed parameter times T .\nIn large-scale settings, however, if we have already spent time computing the (full) gradient at x,\nperhaps we can use such information to reduce the time complexity to compute full gradients at\nother points near x. We call this the \u201clingering\u201d of gradients, because the gradient at x may be\npartially reused for future consideration, but will eventually fade away once we are far from x.\nFormally, consider the (\ufb01nite-sum) stochastic convex minimization problem:\n\nminx\u2208Rd\n\nf (x) def= 1\nn\n\ni=1 fi(x)\n\n.\n\n(1.1)\n\nThen, could it be possible that whenever x is suf\ufb01ciently close to y, for at least a large fraction of\nindices i \u2208 [n], we have \u2207fi(x) \u2248 \u2207fi(y)? In other words, if \u2207f1(x), . . . ,\u2207fn(x) are already\ncalculated at some point x, can we reuse a large fraction of them to approximate \u2207f (y)?\nExample 1.\nIn classi\ufb01cation problems, fi(x) represents the loss value for \u201chow well training\nsample i is classi\ufb01ed under predictor x\u201d. For any sample i that has a large margin under predictor x,\nits gradient \u2207fi(x) may stay close to \u2207fi(y) whenever x is close to y.\nFormally, let fi(x) = max{0, 1\u2212(cid:104)x, ai(cid:105)} be the hinge loss (or its smoothed variant if needed) with\nrespect to the i-th sample ai \u2208 Rd. If the margin |1 \u2212 (cid:104)x, ai(cid:105)| is suf\ufb01ciently large, then moving\n\u2217Authors sorted in alphabetical order. Full version of this paper (containing additional theoretical results,\n\nadditional experiments, and missing proofs) available at https://arxiv.org/abs/1901.02871.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00b4eal, Canada.\n\n\ffrom x to a nearby point y should not affect the sign of 1\u2212(cid:104)x, ai(cid:105), and thus not change the gradient.\nTherefore, if samples a1, . . . , an are suf\ufb01ciently spread out in the space, then a large fraction of them\nshould incur large margins, and thus have the same gradients when x changes by a small amount.\n\nIn revenue management problems, fi(x) represents the marginal pro\ufb01t of the i-th\n+ over d items. In many applications (see Section 2.2),\n\nExample 2.\ncustomer under bid-price strategy x \u2208 Rd\n\u2207fi(x) only depends on customer i\u2019s preferences under x.\nIf the bid-price vector x \u2208 Rd\n+ changes by a small amount to y, then for a large fraction of customers\ni, their most pro\ufb01table items may not change, and thus \u2207fi(x) \u2248 \u2207fi(y). (Indeed, imagine if one\nof the items is Xbox, and its price drops by 5%, perhaps 90% of the customers will not change their\nminds about buying or not. We shall demonstrate this using real-life data.)\n\n1.1 Our Results\nWe assume in this paper that, given any point x \u2208 Rd and index i \u2208 [n], one can ef\ufb01ciently evaluate\na \u201clingering radius\u201d \u03b4(x, i). The radius satis\ufb01es the condition that for every point y that is within\ndistance \u03b4(x, i) from x, the stochastic gradient \u2207fi(y) is equal to \u2207fi(x). We make two remarks:\n\u2022 We use \u201cequal to\u201d for the purpose of proving theoretical results. In practice and in our experi-\nments, it suf\ufb01ces to use approximate equality such as (cid:107)\u2207fi(x) \u2212 \u2207fi(y)(cid:107) \u2264 10\u221210.\n\u2022 By \u201cef\ufb01cient\u201d we mean \u03b4(x, i) is computable in the same complexity as evaluating \u2207fi(x). This\nis reasonable because when \u2207fi(x) is an explicit function of x, it is usually easy to tell how\nsensitive it is to the input x. (We shall include such an example in our experiments.)\n\nIf we denote by B(x, r) the set of indices j satisfying \u03b4(x, j) < r, and if we travel to some point y\nthat is at most distance r from x, then we only need to re-evaluate the (stochastic) gradients \u2207fj(y)\nfor j \u2208 B(x, r). Intuitively, one should expect |B(x, r)| to grow as a function of r, and this is indeed\nthe case \u2013 for instance, for the revenue management problem (see Section 5).\n\nTheory. To present the simplest theoretical result, we modify gradient descent (GD) to take into\naccount the lingering of gradients. At a high level, we run GD, but during its execution, we maintain\na decomposition of the indices \u039b0 \u222a \u00b7\u00b7\u00b7 \u222a \u039bt = {1, 2, . . . , n} where t is logarithmic in n. Now,\nwhenever we need \u2207fi(xk) for some i \u2208 \u039bp, we approximate it by \u2207fi(xk(cid:48)) for a point k(cid:48) that was\nvisited at most 2p steps ago. Our algorithm makes sure that such \u2207fi(xk(cid:48)) is available in memory.\nWe prove that the performance of our algorithm depends on how |B(x, r)| grows in r. Formally, let\nT be the total number of stochastic gradient computations divided by n, and suppose |B(x, r)| \u2264\nO(r), i.e., it linearly scales in the radius r. Then, our algorithm \ufb01nds a point x with f (x)\u2212 f (x\u2217) \u2264\n2\u2212\u2126(T 1/3). In contrast, traditional GD satis\ufb01es f (x) \u2212 f (x\u2217) \u2264 O(T \u22121).\nIn the full version of this paper, we also study the case when |B(x, r)| \u2264 O(r\u03b2) for an arbitrary\nconstant \u03b2 \u2208 (0, 1].\nPractice. We also design an algorithm that practically maximizes the use of gradient lingering. We\ntake the SVRG method [19, 36] as the prototype because it is widely applied in large-scale settings.\n\nRecall that SVRG uses gradient estimator \u2207f ((cid:101)x) \u2212 \u2207fi((cid:101)x) + \u2207fi(xk) to estimate the full gradient\n\u2207f (xk), where(cid:101)x is the so-called snapshot point (which was visited at most n steps ago) and i is a\nwhose stochastic gradients need to be recomputed, and ignore those such that \u2207fi(xk) = \u2207fi((cid:101)x).\n\nrandom index. At a high level, we modify SVRG so that the index i is only generated from those\n\nThis can further reduce the variance of the gradient estimator, and improve the running time.\n\n1.2 Related Work\n\nVariance Reduction. The SVRG method was independently proposed by Johnson and Zhang\n[19], Zhang et al. [36], and belong to the class of stochastic methods using the so-called variance-\nreduction technique [4, 8, 19, 23, 27\u201330, 35, 36]. The common idea behind these methods is to use\nsome full gradient of the past to approximate future, but they do not distinguish which \u2207fi(x) can\n\u201clinger longer in time\u201d among all indices i \u2208 [n] for different x.\n\n2\n\n\fArguably the two most widely applied variance-reduction methods are SVRG and SAGA [8]. They\nhave complementary performance depending on the internal structural of the dataset [5], so we\ncompare to both in our experiments.\nReuse Gradients. Some researchers have exploited the internal structure of the dataset to speed up\n\ufb01rst-order methods. That is, they use \u2207fi(x) to approximate \u2207fj(x) when the two data samples i\nand j are suf\ufb01ciently close. This is orthogonal to our setting because we use \u2207fi(x) to approximate\n\u2207fi(y) when x and y are suf\ufb01ciently close.\nIn the extreme case when all the data samples are\nidentical, they have \u2207fi(x) = \u2207fj(x) for every i, j and thus stochastic gradient methods converge\nas fast as full gradient ones. For this problem, Hofmann et al. [16] introduced a variant of SAGA,\nAllen-Zhu et al. [5] introduced a variant of SVRG and a variant of accelerated coordinate descent.\nOther authors study how to reduce gradient computations at the snapshot points of SVRG [15, 20].\nThis is also orthogonal to the idea of this paper, and can be added to our algorithms for even better\nperformance (see Section 5).\n\n2 Notions and Problem Formulation\nWe denote by (cid:107) \u00b7 (cid:107) the Euclidean norm, and (cid:107) \u00b7 (cid:107)\u221e the in\ufb01nity norm. Recall the notion of Lipschitz\nsmoothness (it has other equivalent de\ufb01nitions, see textbook [25]).\nDe\ufb01nition 2.1. A function f : Rd \u2192 R is L-Lipschitz smooth (or L-smooth for short) if\n\n\u2200x, y \u2208 Rd : (cid:107)\u2207f (x) \u2212 \u2207f (y)(cid:107) \u2264 L(cid:107)x \u2212 y(cid:107) .\nWe also introduce the notion of \u201clowbit sequence\u201d for a positive integer.\nDe\ufb01nition 2.2. For positive integer k, let lowbit(k) def= 2i where i \u2265 0 is the maximum integer such\nthat k is integral multiple of 2i. For instance, lowbit(34) = 2, lowbit(12) = 4, and lowbit(8) = 8.\nGiven positive integer k, let the lowbit sequence of k be (k0, k1, . . . , kt) where\n\n0 = k0 < k1 < \u00b7\u00b7\u00b7 < kt = k and ki\u22121 = ki \u2212 lowbit(ki) .\n\nFor instance, the lowbit sequence of 45 is (0, 32, 40, 44, 45).\n\n2.1 Our Model\nWe propose the following model to capture the lingering of gradients. For every x \u2208 Rd and index\ni \u2208 [n], let \u03b4(x, i) \u2265 0 be the lingering radius of \u2207fi(x), meaning that2\n\n\u2200y \u2208 Rd with (cid:107)y \u2212 x(cid:107) \u2264 \u03b4(x, i)\n\nit satis\ufb01es \u2207fi(x) = \u2207fi(y) .\n\nIn other words, as long as we travel within distance \u03b4(x, i) from x, the gradient \u2207fi(x) can be\nreused to represent \u2207fi(y). Accordingly, for every x \u2208 Rd and r \u2265 0, we denote by B(x, r) the set\n\nof indices j satisfying \u03b4(x, j) < r. That is, B(x, r) def=(cid:8)j \u2208 [n](cid:12)(cid:12) \u03b4(x, j) < r(cid:9) .\n\nOur main assumption of this paper is that\nAssumption 1. Each \u03b4(x, i) can be computed in the same time complexity as \u2207fi(x).\nUnder Assumption 1, if at some point x we have already computed \u2207fi(x) for all i \u2208 [n], then we\ncan compute \u03b4(x, i) as well for every i \u2208 [n], and sort the indices i \u2208 [n] in the increasing order of\n\u03b4(x, i). In the future, if we arrive at any point y, we can calculate r = (cid:107)x \u2212 y(cid:107) and use\n\n(cid:17)\n\n(cid:16)(cid:80)\n\ni(cid:54)\u2208B(x,r) \u2207fi(x) +(cid:80)\n\n\u2207(cid:48) = 1\n\nn\n\ni\u2208B(x,r) \u2207fi(y)\n\nto represent \u2207f (y). We stress that the time to compute \u2207(cid:48) is only proportional to |B(x, r)|.\nWe denote by Ttime the gradient complexity, which equals how many times \u2207fi(x) and \u03b4(x, i) are\ncalculated, divided by n. In computing \u2207(cid:48) above, the gradient complexity is |B(x, r)|/n. If we\nalways set \u03b4(x, i) = 0 then |B(x, r)| = n and the gradient complexity for computing \u2207(cid:48) remains\n1. However, if the underlying Problem (1.1) is nice enough so that |B(x, r)| becomes an increasing\nfunction of r (see Figure 2), then we can hope to design faster algorithms.\n\n2Recall that, in practice, one should replace the exact equality with, for instance, (cid:107)\u2207fi(x) \u2212 \u2207fi(y)(cid:107) \u2264\n\n10\u221210. To present the simplest statements, we do not introduce such an extra parameter.\n\n3\n\n\f2.2 Revenue Management Problem\n\nAs a motivating example, consider a canonical revenue management problem of selling d resources\nto n customers. Let bj \u2265 0 be the capacity of resource j \u2208 [d]; let pi,j \u2208 [0, 1] be the probability that\ncustomer i \u2208 [n] will purchase a unit of resource j if offered resource j; and let rj be the revenue\nfor each unit of resource j. We want to offer each customer one and only one candidate resource,\nand let yi,j be the probability we offer customer i resource j. The following is an LP relaxation for\nthis problem:\n\n(cid:26) (cid:88)\n\nmax\ny\u22650\n\ni\u2208[n],j\u2208[d]\n\nrjpi,jyi,j\n\n(cid:12)(cid:12)(cid:12)(cid:12)\u2200j \u2208 [d],\n\n(cid:88)\n\ni\u2208[n]\n\n(cid:94)\u2200i \u2208 [n],\n\n(cid:88)\n\nj\u2208[d]\n\n(cid:27)\n\npi,jyi,j \u2264 bj\n\nyi,j = 1\n\n(2.1)\n\nThis LP (2.1) and its variants have repeatedly found many applications, including adwords/ad al-\nlocation problems [3, 10, 11, 14, 17, 24, 34, 37], and revenue management for airline and service\nindustries [7, 13, 18, 26, 31, 33]. Some authors also study the online version of solving such LPs\n[1, 2, 9, 12].\nA standard way to reduce (2.1) to convex optimization is by regularization (cf. [37]). Let us subtract\ndef= maxi\u2208[n] pi,j\n\nthe maximization objective by R(x) def= \u00b5(cid:80)\n(cid:111)\n\nand \u00b5 > 0 is some small regularization weight. Then, after transforming to the dual, we have\n\n(cid:16) (rj \u2212 xj)pi,j\n\nj\u2208[d] yi,j log yi,j, where pi\n\n(cid:80)\n\nn(cid:88)\n\ni\u2208[n] pi\n\n(cid:110)\n\n(cid:17)\n\nexp\n\npi\u00b5\n\n.\n\n(2.2)\n\npi \u00b7 log Zi +\n\nmin\nx\u22650\n\n\u00b5\n\ni=1\n\nwhere Zi =\n\nxjbj\n\nd(cid:88)\nyi,j = exp(cid:0) (rj\u2212xj )pi,j\n\nj=1\n\nd(cid:88)\n(cid:1)/Zi.\n\nj=1\n\nAny solution x (usually known as the bid price in operations management [32]) to (2.2) naturally\ngives back a solution y for the primal (2.1), by setting\n\n(2.3)\nIf we let fi(x) def= \u00b5npi \u00b7 log Zi + (cid:104)x, b(cid:105), then (2.2) reduces to Problem (1.1). We conduct empirical\nstudies on this revenue management problem in Section 5.\n\npi\u00b5\n\n3 Our Modi\ufb01cation to Gradient Descent\n\n(cid:80)n\nthen we can arrive at a point x with f (x) \u2212 f (x\u2217) \u2264 O(cid:0)(cid:107)x0\u2212x\u2217(cid:107)2\n\nIn this section, we consider a convex function f (x) = 1\ni=1 fi(x) that is L-smooth. Recall from\ntextbooks (e.g., [25]) that if gradient descent (GD) is applied for T iterations, starting at x0 \u2208 Rd,\nn\nT convergence rate.\n\n(cid:1). This is the 1\n\nT\n\nTo improve on this theoretical rate, we make the following assumption on B(x, r):\nAssumption 2. There exists \u03b1 \u2208 [0, 1], C > 0 such that,\n\n\u2200x \u2208 Rd, r \u2265 0 :\n\n\u2264 \u03c8(r) def= max{\u03b1, r/C} .\n\n|B(x, r)|\n\nn\n\nIt says that |B(n, r)| is a growing function in r, and the growth rate is \u221d r. (In the full version we\ninvestigate the more general case where the growth rate is \u221d r\u03b2 for arbitrary \u03b2 \u2208 (0, 1].) We also\nallow an additive term \u03b1 to cover the case that an \u03b1 fraction of the stochastic gradients always need\nto be recalculated, regardless of the distance. We illustrate the meaningfulness of Assumption 2 in\nFigure 2.\nOur result of this section can be summarized as follows. Hiding (cid:107)x0\u2212x\u2217(cid:107), L, C in the big-O notion,\nand letting Ttime be the gradient complexity, we can modify GD so that it \ufb01nds a point x with\n\nf (x) \u2212 f (x\u2217) \u2264 O(cid:0) \u03b1\n\n+ 2\u2212\u2126(Ttime)1/3(cid:1) .\n\nTtime\n\nWe emphasize that our modi\ufb01ed algorithm does not need to know \u03b1 or C.\n\n3.1 Algorithm Description\nIn classical gradient descent (GD), starting from x0 \u2208 Rd, one iteratively updates xk+1 \u2190 xk \u2212\nL\u2207f (xk). We propose GDlin (see Algorithm 1) which, at a high level, differs from GD in two ways:\n\n1\n\n4\n\n\f(cid:80)n\ni=1 fi(x) convex and L-smooth, starting vector x(0) \u2208 Rd, number of epochs\n\n(cid:5) it satis\ufb01es g = \u2207f (xk)\n\n(cid:1)s(cid:101); and \u03be \u2190 C\n\nm.\n\nS \u2265 1, parameters C, D > 0.\n\nx0 \u2190 x(s\u22121); m \u2190 (cid:100)(cid:0)1 + C2\n\nAlgorithm 1 GDlin(f, x(0), S, C, D)\nInput: f (x) = 1\nn\nOutput: vector x \u2208 Rd.\n1: for s \u2190 1 to S do\n2:\n3:\n4:\n5:\n6:\n7:\n8:\nx(s) \u2190 xm;\n9:\n10: return x = x(S).\n\nxk+1 \u2190 xk \u2212 min(cid:8) \u03be(cid:107)g(cid:107) , 1\n\ng \u2190 (cid:126)0 and gi \u2190 (cid:126)0 for each i \u2208 [n].\nfor k \u2190 0 to m \u2212 1 do\n\n(cid:9)g\n\n16D2\n\nL\n\nn\n\nCalculate \u039bk \u2286 [n] from x0, . . . , xk according to De\ufb01nition 3.1.\nfor i \u2208 \u039bk do\ng \u2190 g +\n\nand gi \u2190 \u2207fi(xk).\n\n\u2207fi(xk)\u2212gi\n\n(a)\nFigure 1: Illustration of index sets \u039bk\n\n(b)\n\n\u2022 It performs a truncated gradient descent with travel distance (cid:107)xk \u2212 xk+1(cid:107) \u2264 \u03be per step.\n\u2022 It speeds up the process of calculating \u2207f (xk) by using the lingering of past gradients.\n\nFormally, GDlin consists of S epochs s = 1, 2, . . . , S of growing length m = (cid:100)(cid:0)1 + C2\n\neach epoch, it starts with x0 \u2208 Rd and performs m truncated gradient descent steps\n\n16D2\n\n(cid:1)s(cid:7). In\n\nxk+1 \u2190 xk \u2212 min(cid:8)\n\n(cid:9) \u00b7 \u2207f (xk) .\n\n\u03be\n\n(cid:107)\u2207f (xk)(cid:107) , 1\n\nL\n\nWe choose \u03be = C/m to ensure that the worst-case travel distance (cid:107)xm \u2212 x0(cid:107) is at most m\u03be = C.\nIn each iteration k = 0, 1, . . . , m\u2212 1 of this epoch s, in order to calculate \u2207f (xk), GDlin constructs\nindex sets \u039b0, \u039b1, . . . , \u039bm\u22121 \u2286 [n] and recalculates only \u2207fi(xk) for those i \u2208 \u039bk. We formally\nintroduce index sets below, and illustrate them in Figure 1.\nDe\ufb01nition 3.1. Given x0, x1, . . . , xm\u22121 \u2208 Rd, we de\ufb01ne index subsets \u039b0, . . . \u039bm\u22121 \u2286 [n] as\nfollows. Let \u039b0 = [n]. For each k \u2208 {1, 2, . . . , m \u2212 1}, if (k0, . . . , kt) is k\u2019s lowbit sequence from\nDe\ufb01nition 2.2, then (recalling k = kt)\n\n(cid:0)Bki(k \u2212 ki) \\ Bki (kt\u22121 \u2212 ki)(cid:1) where Bk(r) def= \u039bk \u2229 B(xk, r \u00b7 \u03be) .\n\ndef=(cid:83)t\u22121\n\n\u039bk\n\ni=0\n\n3.2\n\nIntuitions & Properties of Index Sets\n\nWe show in this paper that our construction of index sets satisfy the following three properties.\nLemma 3.2. The construction of \u039b0, . . . , \u039bm\u22121 ensures that g = \u2207f (xk) in each iteration k.\n\n5\n\n158096072809304\ud835\udeb2\ud835\udfce\u039b1\u039b2\u039b3\u039b4\u039b5\u039b6\u039b7\ud835\udeb2\ud835\udfd6\u039b9\u039b10\u039b11\ud835\udeb2\ud835\udfcf\ud835\udfd0\u039b13\ud835\udeb2\ud835\udfcf\ud835\udfd2\ud835\udeb2\ud835\udfcf\ud835\udfd3\u039b16\ud835\udfcf=\ud835\udc3508\ud835\udfd0=\ud835\udc35012\u2216\ud835\udc350(8)\ud835\udfd1=\ud835\udc35014\u2216\ud835\udc350(12)\ud835\udfd2=\ud835\udc35015\u2216\ud835\udc350(14)\ud835\udfd3=\ud835\udc3584\ud835\udfd4=\ud835\udc3586\u2216\ud835\udc3584\ud835\udfd5=\ud835\udc3587\u2216\ud835\udc3586\ud835\udfd6=\ud835\udc35122\ud835\udfd7=\ud835\udc35123\u2216\ud835\udc3512(2)\ud835\udfce=\ud835\udc35141158096072809304\ud835\udeb2\ud835\udfce\ud835\udeb2\ud835\udfd6\ud835\udeb2\ud835\udfcf\ud835\udfd0\ud835\udeb2\ud835\udfcf\ud835\udfd2\ud835\udeb2\ud835\udfcf\ud835\udfd3\u2295\u2295\u2295\u2295\u2297\u2297\u2296\u2299\fClaim 3.3. The gradient complexity to construct \u039b0, . . . , \u039bm\u22121 is O(cid:0) 1\n\nk=0 |\u039bk|(cid:1) under\n(cid:80)m\u22121\n(cid:80)m\u22121\nk=0 |\u039bk| \u2264 O(\u03b1m + log2 m) .\n\nn\n\n(cid:3).\n\nAssumption 1. The space complexity is O(n log n).\nLemma 3.4. Under Assumption 2, we have 1\nn\nClaim 3.3 is easy to verify. Indeed, for each \u039b(cid:96) that is calculated, we can sort its indices j \u2208 \u039b(cid:96) in\nthe increasing order of \u03b4(xk, j).3 Now, whenever we calculate Bki (k\u2212ki)\\Bki(kt\u22121\u2212ki), we have\n\nalready sorted the indices in \u039bki, so can directly retrieve those j with \u03b4(xki, j) \u2208(cid:0)kt\u22121\u2212ki, k\u2212ki\n\nAs for the space complexity, in any iteration k, we only need to store (cid:100)log2 k(cid:101) index sets \u039b(cid:96) for\n(cid:96) < k. For instance, when calculating \u039b15 (see Figure 1(b)), we only need to use \u039b0, \u039b8, \u039b12, \u039b14;\nand from k = 16 onwards, we no longer need to store \u039b1, . . . , \u039b15.\nLemma 3.2 is technically involved to prove (see full version), but we give a sketched proof by\npicture. Take k = 15 as an example. As illustrated by Figure 1(b), for every j \u2208 [n],\n\u2022 If j belongs to \u039b15 \u2014i.e., boxes 4, 0, 9, 7 of Figure 1\u2014\n\u2022 If j belongs to \u039b14 \\ B14(1) \u2014i.e., \u2295 region of Figure 1(b)\u2014\n\nWe have calculated \u2207fj(xk) so are \ufb01ne.\nWe have \u2207fj(x15) = \u2207fj(x14) because (cid:107)x15 \u2212 x14(cid:107) \u2264 \u03be and j (cid:54)\u2208 B14(1). Therefore, we can\nsafely retrieve gj = \u2207fj(x14) to represent \u2207fj(x15).\nWe have \u2207fj(x15) = \u2207fj(x12) for similar reason above. Also, the most recent update of gj\nwas at iteration 12, so we can safely retrieve gj to represent \u2207fj(x15).\n\n\u2022 If j belongs to \u039b12 \\ B12(3) \u2014i.e., \u2297 region of Figure 1(b)\u2014\n\n\u2022 And so on.\nIn sum, for all indices j \u2208 [n], we have gj = \u2207fj(xk) so g = g1+\u00b7\u00b7\u00b7+gn\nLemma 3.4 is also involved to prove (see full version), but again should be intuitive from the picture.\nThe indices in boxes 1, 2, 3, 4 of Figure 1 are disjoint, and belong to B(x0, 15\u03be), totaling at most\n|B(x0, 15\u03be)| \u2264 n\u03c8(15\u03be). The indices in boxes 5, 6, 7 of Figure 1 are also disjoint, and belong to\nB(x8, 7\u03be), totaling at most |B(x8, 7\u03be)| \u2264 n\u03c8(7\u03be). If we sum up the cardinality of these boxes by\ncarefully grouping them in this manner, then we can prove Lemma 3.4 using Assumption 2.\n\nequals \u2207f (xk).\n\nn\n\n3.3 Convergence Theorem\n\nSo far, Lemma 3.4 shows we can reduce the gradient complexity from O(m) to (cid:101)O(1) for every m\n\nsteps of gradient descent. Therefore, we wish to set m as large as possible, or equivalently \u03be = C/m\nas small as possible. Unfortunately, when \u03be is too small, it will impact the performance of truncated\ngradient descent (see full version). This motivates us to start with small value of m and increase\nit epoch by epoch. Indeed, as the number of epoch grows, f (x0) becomes closer to the minimum\nf (x\u2217), and thus we can choose smaller values of \u03be.\nFormally, we have\nTheorem 3.5. Given any x(0) \u2208 Rd and D > 0 that is an upper bound on (cid:107)x(0) \u2212 x\u2217(cid:107). Suppose\nAssumption 1 and 2 are satis\ufb01ed with parameters C \u2208 (0, D], \u03b1 \u2208 [0, 1]. Then, denoting by ms =\n\n(cid:1)s(cid:101), we have that GDlin(f, x0, S, C, D) outputs a point x \u2208 Rd satisfying f (x)\u2212f (x\u2217) \u2264\nwith gradient complexity Ttime = O(cid:0)(cid:80)S\n\n(cid:100)(cid:0)1+ C2\n\ns=1 \u03b1ms + log2 ms\n\n(cid:1).\n\n16D2\n\n4LD2\nmS\n\nAs simple corollaries, we have\nTheorem 3.6. In the setting of Theorem 3.5, given any T \u2265 1, one can choose S so that GDlin \ufb01nds\na point x in gradient complexity Ttime = O(T ) s.t.\n\nf (x) \u2212 f (x\u2217) \u2264 O(cid:0) LD4\n\nC2\n\n(cid:1) +\n\n\u00b7 \u03b1\n\nLD2\n\nand in this case GDlin gives back the convergence f (x) \u2212 f (x\u2217) \u2264 O(cid:0) LD2\n\nWe remark here if \u03c8(r) = 1 (so there is no lingering effect for gradients), we can choose C = D\n\n(cid:1) of GD.\n\n2\u2126(C2T /D2)1/3\n\nT\n\n3Calculating those lingering radii \u03b4(xk, j) require gradient complexity |\u039b(cid:96)| according to Assumption 1, and\n\nthe time for sorting is negligible.\n\n6\n\n.\n\nT\n\n\f4 Our Modi\ufb01cation to SVRG\n\nIn this section, we use Assumption 1 to improve the running time of SVRG [19, 36], one of the most\nwidely applied stochastic gradient methods in large-scale settings. The purpose of this section is\nto construct an algorithm that works well in practice: to (1) work for any possible lingering radii\n\u03b4(x, i), (2) be identical to SVRG if \u03b4(x, i) \u2261 0, and (3) be faster than SVRG when \u03b4(x, i) is large.\nRecall how the SVRG method works. Each epoch of SVRG consists of m iterations (m = 2n\nin practice). Each epoch starts with a point x0 (known as the snapshot) where the full gradient\n\u2207f (x0) is computed exactly. In each iteration k = 0, 1, . . . , m \u2212 1 of this epoch, SVRG updates\nxk+1 \u2190 xk \u2212 \u03b7g where \u03b7 > 0 is the learning rate and g is the gradient estimator g = \u2207f (x0) +\n\u2207fi(xk) \u2212 \u2207fi(x0) for some i randomly drawn from [n]. Note that it satis\ufb01es Ei[g] = \u2207f (xk) so\ng is an unbiased estimator of the gradient. In the next epoch, SVRG starts with xm of the previous\nepoch.4 We denote by x(s) the value of x0 at the beginning of epoch s = 0, 1, 2, . . . , S \u2212 1.\nOur Algorithm. Our algorithm SVRGlin (pseudocode in full version) maintains disjoint subsets\nHs \u2286 [n], where each Hs includes the set of the indices i whose gradients \u2207fi(x(s)) from epoch s\ncan still be safely reused at present.\nAt the starting point x0 of an epoch s, we let Hs = [n]\\(H0\u222a\u00b7\u00b7\u00b7\u222aHs\u22121) and re-calculate gradients\n\u2207fi(x0) only for i \u2208 Hs; the remaining ones can be loaded from the memory. This computes the\nfull gradient \u2207f (x0). Then, we denote by m = 2|Hs| and perform only m iterations within epoch\ns. We next discuss how to perform update xk \u2192 xk+1 and maintain {Hs}s during each iteration.\n\u2022 In each iteration k of this epoch, we claim that \u2207fi(xk) = \u2207fi(x0) for every i \u2208 H0\u222a\u00b7\u00b7\u00b7\u222aHs.5\n\nThus, we can uniformly sample i from [n]\\(cid:0)H0\u222a\u00b7\u00b7\u00b7\u222aHs\n(cid:17)\n\n(cid:1), and construct an unbiased estimator\n\n(cid:16)\n\ng \u2190 \u2207f (x0) +\n\n1 \u2212\n\n[\u2207fi(xk) \u2212 \u2207fi(x0)]\n\n(cid:80)s\ns(cid:48)=0 |Hs(cid:48)|\n\nn\n\nof the true gradient \u2207f (xk). Then, we update xk+1 \u2190 xk \u2212 \u03b7g the same way as SVRG.\nWe emphasize that the above choice of g reduces its variance (because there are fewer random\nchoices), and it is known that reducing variance leads to faster running time [19].\n\u2022 As for how to maintain {Hs}s, in each iteration k after xk+1 is computed, for every s(cid:48) \u2264 s, we\nwish to remove those indices i \u2208 Hs(cid:48) such that the current position x lies outside of the lingering\nradius of i, i.e., \u03b4(x(s), i) < (cid:107)x \u2212 x(s)(cid:107). To ef\ufb01ciently implement this, we need to make sure\nthat whenever Hs(cid:48) is constructed (at the beginning of epoch s(cid:48)), the algorithm sort all the indices\ni \u2208 Hs(cid:48) by the increasing order of \u03b4(x(s(cid:48)), i). We include implementation details in full version.\n\n5 Preliminary Empirical Evaluation\n\nIn this section, we construct a revenue maximization\nLP (2.1) using the publicly accessible dataset of Yahoo!\nFront Page Today Module [6, 22]. We describe details of\nthe experimental setup in full version. Based on this real-\nlife dataset, we validate Assumption 2 and our motivation\nbehind lingering gradients. We also test the performance\nof SVRGlin from Section 4 on optimizing this LP.\nIllustration of Lingering Radius. We calculate linger-\ning radii on the dual problem (2.2). Let \u03b8 > 0 be a pa-\nrameter large enough so that e\u2212\u03b8 can be viewed as zero.\n(For instance, \u03b8 = 20 gives e\u221220 \u2248 2 \u00d7 10\u22129.) Then, for\neach point x \u2208 R\u22650 and index i \u2208 [n], we let\n\nFigure 2: |B(x, r)|/n as a function of r. We\nchoose \u03b8 = 5. Dashed curve is\nwhen x = (cid:126)0, and solid curve is\nwhen x is near the optimum.\n\n(cid:8)(rj\u2212xj)pi,j\n\n(cid:9) .\n\n(rj\u2217 \u2212 xj\u2217 )pi,j\u2217 \u2212 (rj \u2212 xj)pi,j \u2212 \u03b8pi\u00b5\n\n\u03b4(x, i) = min\n\nj\u2208[n],j(cid:54)=j\u2217\n\npi,j\u2217 + pi,j\n\nwhere\n\nj\u2217 = arg max\nj\u2208[n]\n\n4Some authors use the average of x1, . . . , xm to start the next epoch, but we choose this simpler version.\n5This is because for every i \u2208 Hs, by de\ufb01nition of Hs we have \u2207fi(xk) = \u2207fi(x(s)) = \u2207fi(x0); for\nevery i \u2208 Hs(cid:48) where s(cid:48) < s, we know \u2207fi(xk) = \u2207fi(x(s(cid:48))) but we also have \u2207fi(x0) = \u2207fi(x(s(cid:48)))\n(because otherwise i would have been removed from Hs(cid:48)).\n\n7\n\n00.20.40.60.8100.010.020.030.04B(x,r)/nr\f(a) SVRGlin vs. SVRG and SAGA.\n\n(b) SVRGlin vs. SVRG and SAGA\n\nFigure 3: Comparison of (a) dual objective optimality (for which different learning rates are presented) and (b)\n\nprimal objective optimality (for which the learning rates are best tuned).\n\n\u2207fi(y) \u2248 (b1, . . . , bd) + npi,j\u2217 ej\u2217\n\nIt is now a simple exercise to verify that, denoting by ej the j-th basis unit vector, then6\nfor every (cid:107)y \u2212 x(cid:107)\u221e \u2264 \u03b4(x, i) .\n\nIn Figure 2, we plot |B(x, r)| =(cid:12)(cid:12)(cid:8)j \u2208 [n](cid:12)(cid:12) \u03b4(x, j) < r(cid:9)(cid:12)(cid:12) as an increasing function of r. We see\n\nthat for practical data, |B(x, r)|/n is indeed bounded above by some increasing function \u03c8(\u00b7). We\nprovide more justi\ufb01cation on why this happens in the full paper.\nRemark 5.1. This \u03b4(x, i) differs from our de\ufb01nition in Section 2 in two ways. First, it ensures\n\u2207fi(y) \u2248 \u2207fi(x) as opposed to exact equality; for practical purposes this is no big issue, and we\nchoose \u03b8 = 5 in our experiments. Second, (cid:107)y \u2212 x(cid:107)\u221e \u2264 \u03b4(x, i) gives a bigger \u201csafe region\u201d than\n(cid:107)y \u2212 x(cid:107) \u2264 \u03b4(x, i); thus, when implementing SVRGlin, we adopt (cid:107) \u00b7 (cid:107)\u221e as the norm of choice.\n\nNumerical Experiments. We consider solving the dual problem (2.2). In Figure 3(a), we plot the\noptimization error of (2.2) as a function #grad/n, the number of stochastic gradient computations\ndivided by n, also known as #passes of dataset.\nFigure 3(a) compares our SVRGlin to SVRG and SAGA (each for 3 best tuned step lengths).7 We can\nsee SVRGlin is close to SVRG or SAGA during the \ufb01rst 5-7 passes of the data. This is because initially,\nx moves fast and cannot usually stay in the lingering radii for most indices i. After that period,\nSVRGlin requires a dramatically smaller number of gradient computations, as x moves slower and\nslower, becoming more easily to stay in the lingering radii. It is interesting to note that SVRGlin does\nnot signi\ufb01cantly improve the optimization error as a function of number of epochs; the improvement\nprimarily lies in improving the number of gradient computations per epoch. The comparison is\n\nMore Plots.\nIn Figure 3(b), we also compare the primal objective value for the LP (2.1). (We\nexplain how to get feasible primal solutions from the dual in the full version.) It is perhaps worth\nnoting that we have chosen \u00b5 = 10\u22125 as the regularization error, and the primal objective error\nindeed reaches to 10\u22126 which is roughly \u00b5. In the full version, we also compare the running time of\nthe algorithms. Those plots are almost identical to Figure 3(b).\n\n6 Conclusion\n\nIn this paper, we study convex problems where the stochastic gradients \u2207fi(x) can be reused when\nwe move away from x. In our theoretical result, we model the number of stochastic gradients that can\nbe changed (and thus cannot be reused) as a function of how much distance we travel away from x,\nand show faster convergence for gradient descent (in terms of the number of gradient computations).\nOn the empirical side, we show how to modify the SVRG method to use reuse stochastic gradients\nef\ufb01ciently. Figure 3(a) and Figure 3(b) summarize our \ufb01ndings on a hypothetic experiment.\n\n6For any other coordinate j (cid:54)= j\u2217, it satis\ufb01es\n7Each epoch of SVRG consists of a full gradient computation and 2n iterations, totaling 3n computations of\n\ne(rj\u2212yj )pi,j /(pi\u00b5)\n(rj\u2217 \u2212yj\u2217 )pi,j\u2217 /(pi\u00b5) \u2264 e\u2212\u03b8 and hence is negligible.\n\n(new) stochastic gradients. (We do not count the computation of \u2207fi(0) at x = 0.)\n\ne\n\n8\n\n5E-125E-115E-105E-095E-085E-075E-065E-055E-04051015202530Optimization Error#grad/nSVRG:0.0001SVRG:0.0003SVRG:0.0005SAGA:0.0001SAGA:0.0003SAGA:0.0005SVRG_lin:0.00051E-61E-51E-41E-31E-21E-1051015202530Primal Error#grad/nSVRGSAGASVRG_lin\fAcknowledgements\n\nWe would like to thank Greg Yang, Ilya Razenshteyn and S\u00b4ebastien Bubeck for discussing the\nmotivation of this problem.\n\nReferences\n[1] Shipra Agrawal and Nikhil R Devanur. Fast Algorithms for Online Stochastic Convex Pro-\n\ngramming. In SODA, pages 1405\u20131424, 2015.\n\n[2] Shipra Agrawal, Zizhuo Wang, and Yinyu Ye. A dynamic near-optimal algorithm for online\n\nlinear programming. Operations Research, 62(4):876\u2013890, 2014.\n\n[3] Saeed Alaei, MohammadTaghi Hajiaghayi, and Vahid Liaghat. Online prophet-inequality\nIn Proceedings of the 13th ACM Conference\n\nmatching with applications to ad allocation.\non Electronic Commerce, pages 18\u201335. ACM, 2012.\n\n[4] Zeyuan Allen-Zhu and Yang Yuan.\n\nImproved SVRG for Non-Strongly-Convex or Sum-of-\n\nNon-Convex Objectives. In ICML, 2016.\n\n[5] Zeyuan Allen-Zhu, Yang Yuan, and Karthik Sridharan. Exploiting the Structure: Stochastic\n\nGradient Methods Using Raw Clusters. In NeurIPS, 2016.\n\n[6] Wei Chu, Seung-Taek Park, Todd Beaupre, Nitin Motgi, Amit Phadke, Seinjuti Chakraborty,\nand Joe Zachariah. A Case Study of Behavior-driven Conjoint Analysis on Yahoo!: Front Page\nToday Module. In KDD, pages 1097\u20131104, 2009.\n\n[7] Dragos Florin Ciocan and Vivek Farias. Model predictive control for dynamic resource allo-\n\ncation. Mathematics of Operations Research, 37(3):501\u2013525, 2012.\n\n[8] Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. SAGA: A Fast Incremental Gradient\n\nMethod With Support for Non-Strongly Convex Composite Objectives. In NeurIPS, 2014.\n\n[9] Nikhil R Devanur and Thomas P Hayes. The adwords problem: online keyword matching with\nbudgeted bidders under random permutations. In Proceedings of the 10th ACM conference on\nElectronic commerce, pages 71\u201378. ACM, 2009.\n\n[10] Nikhil R Devanur, Balasubramanian Sivan, and Yossi Azar. Asymptotically optimal algorithm\nfor stochastic adwords. In Proceedings of the 13th ACM Conference on Electronic Commerce,\npages 388\u2013404. ACM, 2012.\n\n[11] Jon Feldman, Aranyak Mehta, Vahab Mirrokni, and S Muthukrishnan. Online Stochastic\n\nMatching: Beating 1-1/e. In FOCS, pages 117\u2013126, 2009.\n\n[12] Jon Feldman, Monika Henzinger, Nitish Korula, Vahab Mirrokni, and Cliff Stein. Online\nstochastic packing applied to display ad allocation. Algorithms\u2013ESA 2010, pages 182\u2013194,\n2010.\n\n[13] Kris Johnson Ferreira, David Simchi-Levi, and He Wang. Online network revenue manage-\n\nment using Thompson sampling. manuscript on SSRN, 2016.\n\n[14] Bernhard Haeupler, Vahab S Mirrokni, and Morteza Zadimoghaddam. Online Stochastic\nWeighted Matching: Improved Approximation Algorithms. In WINE, volume 11, pages 170\u2013\n181. Springer, 2011.\n\n[15] Reza Harikandeh, Mohamed Osama Ahmed, Alim Virani, Mark Schmidt, Jakub Kone\u02c7cn`y, and\nScott Sallinen. Stop wasting my gradients: Practical SVRG. In NeurIPS, pages 2251\u20132259,\n2015.\n\n[16] Thomas Hofmann, Aurelien Lucchi, Simon Lacoste-Julien, and Brian McWilliams. Variance\nreduced stochastic gradient descent with neighbors. In NeurIPS 2015, pages 2296\u20132304, 2015.\n\n[17] Patrick Jaillet and Xin Lu. Online Stochastic Matching: New Algorithms with Better Bounds.\n\nMathematics of Operations Research, 39(3):624\u2013646, 2014.\n\n9\n\n\f[18] Stefanus Jasin and Sunil Kumar. A re-solving heuristic with bounded revenue loss for network\nrevenue management with customer choice. Mathematics of Operations Research, 37(2):313\u2013\n345, 2012.\n\n[19] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive vari-\nance reduction. In Advances in Neural Information Processing Systems, NeurIPS 2013, pages\n315\u2013323, 2013.\n\n[20] Lihua Lei and Michael I. Jordan. Less than a single pass: Stochastically controlled stochastic\n\ngradient method. In AISTATS, pages 148\u2013156, 2017.\n\n[21] Lihua Lei, Cheng Ju, Jianbo Chen, and Michael I Jordan. Nonconvex Finite-Sum Optimization\n\nVia SCSG Methods. In NeurIPS, 2017.\n\n[22] Lihong Li, Wei Chu, John Langford, and Robert E. Schapire. A Contextual-bandit Approach\nto Personalized News Article Recommendation. In WWW, pages 661\u2013670, New York, NY,\nUSA, 2010.\n\n[23] Mehrdad Mahdavi, Lijun Zhang, and Rong Jin. Mixed optimization for smooth functions. In\n\nAdvances in Neural Information Processing Systems, pages 674\u2013682, 2013.\n\n[24] Vahideh H Manshadi, Shayan Oveis Gharan, and Amin Saberi. Online stochastic matching:\nOnline actions based on of\ufb02ine statistics. Mathematics of Operations Research, 37(4):559\u2013\n573, 2012.\n\n[25] Yurii Nesterov. Introductory Lectures on Convex Programming Volume: A Basic course, vol-\n\nume I. Kluwer Academic Publishers, 2004. ISBN 1402075537.\n\n[26] Martin I Reiman and Qiong Wang. An asymptotically optimal policy for a quantity-based\nnetwork revenue management problem. Mathematics of Operations Research, 33(2):257\u2013282,\n2008.\n\n[27] Mark Schmidt, Nicolas Le Roux, and Francis Bach. Minimizing \ufb01nite sums with the stochastic\n\naverage gradient. ArXiv e-prints, abs/1309.2388, September 2013.\n\n[28] Shai Shalev-Shwartz. SDCA without Duality, Regularization, and Individual Convexity. In\n\nICML, 2016.\n\n[29] Shai Shalev-Shwartz and Tong Zhang. Proximal Stochastic Dual Coordinate Ascent. ArXiv\n\ne-prints, abs/1211.2717, 2012.\n\n[30] Shai Shalev-Shwartz and Tong Zhang. Stochastic dual coordinate ascent methods for regular-\n\nized loss minimization. Journal of Machine Learning Research, 14:567\u2013599, 2013.\n\n[31] Clifford Stein, Van-Anh Truong, and Xinshang Wang. Advance Service Reservations with\n\nHeterogeneous Customers. ArXiv e-prints, abs/1805.05554, 2016.\n\n[32] Kalyan Talluri and Garrett van Ryzin. An Analysis of Bid-Price Controls for Network Revenue\n\nManagement. Management Science, 44(11-part-1):1577\u20131593, 1998.\n\n[33] Xinshang Wang, Van-Anh Truong, and David Bank. Online advance admission scheduling for\n\nservices, with customer preferences. ArXiv e-prints, abs/1805.10412, 2015.\n\n[34] Xinshang Wang, Van-Anh Truong, Shenghuo Zhu, and Qiong Zhang. Dynamic Optimization\n\nof Mobile Push Campaigns. Working paper, 2016.\n\n[35] Lin Xiao and Tong Zhang. A Proximal Stochastic Gradient Method with Progressive Variance\n\nReduction. SIAM Journal on Optimization, 24(4):2057\u2014-2075, 2014.\n\n[36] Lijun Zhang, Mehrdad Mahdavi, and Rong Jin. Linear convergence with condition number\nindependent access of full gradients. In Advances in Neural Information Processing Systems,\npages 980\u2013988, 2013.\n\n[37] Wenliang Zhong, Rong Jin, Cheng Yang, Xiaowei Yan, Qi Zhang, and Qiang Li. Stock Con-\n\nstrained Recommendation in Tmall. KDD, pages 2287\u20132296, 2015.\n\n10\n\n\f", "award": [], "sourceid": 657, "authors": [{"given_name": "Zeyuan", "family_name": "Allen-Zhu", "institution": "Microsoft Research"}, {"given_name": "David", "family_name": "Simchi-Levi", "institution": "MIT"}, {"given_name": "Xinshang", "family_name": "Wang", "institution": "MIT"}]}