{"title": "Delay-Tolerant Algorithms for Asynchronous Distributed Online Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 2915, "page_last": 2923, "abstract": "We analyze new online gradient descent algorithms for distributed systems with large delays between gradient computations and the corresponding updates. Using insights from adaptive gradient methods, we develop algorithms that adapt not only to the sequence of gradients, but also to the precise update delays that occur. We first give an impractical algorithm that achieves a regret bound that precisely quantifies the impact of the delays. We then analyze AdaptiveRevision, an algorithm that is efficiently implementable and achieves comparable guarantees. The key algorithmic technique is appropriately and efficiently revising the learning rate used for previous gradient steps. Experimental results show when the delays grow large (1000 updates or more), our new algorithms perform significantly better than standard adaptive gradient methods.", "full_text": "Delay-Tolerant Algorithms for\n\nAsynchronous Distributed Online Learning\n\nH. Brendan McMahan\n\nGoogle, Inc.\nSeattle, WA\n\nmcmahan@google.com\n\nMatthew Streeter\nDuolingo, Inc.\u2217\nPittsburgh, PA\n\nmatt@duolingo.com\n\nAbstract\n\nWe analyze new online gradient descent algorithms for distributed systems with\nlarge delays between gradient computations and the corresponding updates. Us-\ning insights from adaptive gradient methods, we develop algorithms that adapt not\nonly to the sequence of gradients, but also to the precise update delays that occur.\nWe \ufb01rst give an impractical algorithm that achieves a regret bound that precisely\nquanti\ufb01es the impact of the delays. We then analyze AdaptiveRevision, an\nalgorithm that is ef\ufb01ciently implementable and achieves comparable guarantees.\nThe key algorithmic technique is appropriately and ef\ufb01ciently revising the learn-\ning rate used for previous gradient steps. Experimental results show when the\ndelays grow large (1000 updates or more), our new algorithms perform signi\ufb01-\ncantly better than standard adaptive gradient methods.\n\n1\n\nIntroduction\n\nStochastic and online gradient descent methods have proved to be extremely useful for solving large-\nscale machine learning problems [1, 2, 3, 4]. Recently, there has been much work on extending these\nalgorithms to parallel and distributed systems [5, 6, 7, 8, 9]. In particular, Recht et al. [10] and Duchi\net al. [11] have shown that standard stochastic algorithms essentially \u201cwork\u201d even when updates are\napplied asynchronously by many threads. Our experiments con\ufb01rm this for moderate amounts of\nparallelism (say 100 threads), but show that for large amounts of parallelism (as in a distributed\nsystem, with say 1000 threads spread over many machines), performance can degrade signi\ufb01cantly.\nTo address this, we develop new algorithms that adapt to both the data and the amount of parallelism.\nAdaptive gradient (AdaGrad) methods [12, 13] have proved remarkably effective for real-world\nproblems, particularly on sparse data (for example, text classi\ufb01cation with bag-of-words features).\nThe key idea behind these algorithms is to prove a general regret bound in terms of an arbi-\ntrary sequence of non-increasing learning rates and the full sequence of gradients, and then to\nde\ufb01ne an adaptive method for choosing the learning rates as a function of the gradients seen so\nfar, so as to minimize the \ufb01nal bound when the learning rates are plugged in. We extend this\nidea to the parallel setting, by developing a general regret bound that depends on both the gradi-\nents and the exact update delays that occur (rather than say an upper bound on delays). We then\npresent AdaptiveRevision, an algorithm for choosing learning rates and ef\ufb01ciently revising\npast learning-rate choices that strives to minimize this bound. In addition to providing an adaptive\nregret bound (which recovers the standard AdaGrad bound in the case of no delays), we demonstrate\nexcellent empirical performance.\n\nProblem Setting and Notation We consider a computation model where one or more computation\nunits (a thread in a parallel implementation or a full machine in a distributed system) store and\n\n\u2217Work performed while at Google, Inc.\n\n1\n\n\fupdate the model x \u2208 Rn, and another larger set of computation units perform feature extraction\nand prediction. We call the \ufb01rst type the Updaters (since they apply the gradient updates) and\nthe second type the Readers (since they read coef\ufb01cients stored by the Updaters). Because\nthe Readers and Updaters may reside on different machines, perhaps located in different parts\nof the world, communication between them is not instantaneous. Thus, when making a prediction,\na Reader will generally be using a coef\ufb01cient vector that is somewhat stale relative to the most\nrecent version being served by the Updaters.\nAs one application of this model, consider the problem of predicting click-through rates for spon-\nsored search ads using a generalized linear model [14, 15]. While the coef\ufb01cient vector may be\nstored and updated centrally, predictions must be available in milliseconds in any part of the world.\nThis leads naturally to an architecture in which a large number of Readers maintain local copies\nof the coef\ufb01cient vector, sending updates to the Updaters and periodically requesting fresh coef-\n\ufb01cients from them. As another application, this model encompasses the Parameter Server/ Model\nReplica split of Downpour SGD [16].\nOur bounds apply to general online convex optimization [4], which encompasses the problem of\npredicting with a generalized linear model (models where the prediction is a function of at \u00b7 xt,\nwhere at is a feature vector and xt are model coef\ufb01cients). We analyze the algorithm on a sequence\nof \u03c4 = 1, ..., T rounds; for the moment, we index rounds based on when each prediction is made. On\neach round, a convex loss function f\u03c4 arrives at a Reader, the Reader predicts with x\u03c4 \u2208 Rn and\nincurs loss f\u03c4 (x\u03c4 ). The Reader then computes a subgradient g\u03c4 \u2208 \u2202f\u03c4 (x\u03c4 ). For each coordinate\ni where g\u03c4,i is nonzero, the Reader sends an update to the Updater(s) for those coef\ufb01cients. We\nare particularly concerned with sparse data, where n is very large, say 106 \u2212 109, but any particular\ntraining example has only a small fraction of the features at,i that take non-zero values.\nThe regret against a comparator x\u2217 \u2208 Rn is\n\nRegret(x\u2217) \u2261 T(cid:88)\n\n\u03c4 =1\n\nf\u03c4 (x\u03c4 ) \u2212 f\u03c4 (x\u2217).\n\n(1)\n\nRegret(x\u2217) \u2261 n(cid:88)\n\nT(cid:88)\n\nOur primary theoretical contributions are upper bounds on the regret of our algorithms.\nWe assume a fully asynchronous model, where the delays in the read requests and update requests\ncan be different for different coef\ufb01cients even for the same training event. This leads to a combina-\ntorial explosion in potential interleavings of these operations, making \ufb01ne-grained adaptive analysis\nquite dif\ufb01cult. Our primary technique for addressing this will be the linearization of loss functions,\na standard tool in online convex optimization which takes on increased importance in the parallel\nsetting. An immediate consequence of convexity is that given a general convex loss function f\u03c4 ,\nwith g\u03c4 \u2208 \u2202f\u03c4 (x\u03c4 ), for any x\u2217, we have f\u03c4 (x\u03c4 ) \u2212 f\u03c4 (x\u2217) \u2264 g\u03c4 \u00b7 (x\u03c4 \u2212 x\u2217). One of the key obser-\nvations of Zinkevich [1] is that by plugging this inequality into (1), we see that if we can guarantee\nlow regret against linear functions, we can provide the same guarantees against arbitrary convex\nfunctions. Further, expanding the dot products and re-arranging the sum, we can write\n\nRegreti(x\u2217\ni )\n\nwhere\n\nRegreti(x\u2217\n\ni ) =\n\ng\u03c4,i(x\u03c4,i \u2212 x\u2217\ni ).\n\n(2)\n\ni=1\n\n\u03c4 =1\n\nIf we consider algorithms where the updates are also coordinate decomposable (that is, the update\nto coordinate i can be applied independently of the update of coordinate j), then we can bound\nRegret(x\u2217) by proving a per-coordinate bound for linear functions and then summing across coor-\ndinates. In fact, our computation architecture already assumes a coordinate decomposable algorithm\nsince this lets us avoid synchronizing the Updates, and so in addition to leading to more ef\ufb01cient\nalgorithms, this approach will greatly simplify the analysis. The proofs of Duchi et al. [11] take a\nsimilar approach.\n\nBounding per-coordinate regret Given the above, we will design and analyze asynchronous one-\ndimensional algorithms which can be run independently on each coordinate of the true learning\nproblem.\nFor each coordinate, each Read and Update is assumed to be an atomic operation.\nIt will be critical to adopt an indexing scheme different than the prediction-based indexing \u03c4 used\nabove. The net result will be bounding the sum of (2), but we will actually re-order the sum to\nmake the analysis easier. Critically, this ordering could be different for different coordinates, and\n\n2\n\n\fso considering one coordinate at a time simpli\ufb01es the analysis considerably.1 We index time by the\norder of the Updates, so the index t is such that gt is the gradient associated with the tth update\napplied and xt is the value of the coef\ufb01cient immediately before the update for gt is applied. Then,\nthe Online Gradient Descent (OGD) update consists of exactly the assumed-atomic operation\n\nxt+1 = xt \u2212 \u03b7tgt,\n\n(3)\nwhere \u03b7t is a learning-rate. Let r(t) \u2208 {1, . . . , t} be the index such that xr(t) was the value of the\ncoef\ufb01cient used by the Reader to compute gt (and to predict on the corresponding example). That\nis, update r(t) \u2212 1 completed before the Read for gt, but update r(t) completed after. Thus, our\nloss (for coordinate i) is gtxr(t), and we desire a bound on\n\nRegreti(x\u2217) =\n\ngt(xr(t) \u2212 x\u2217).\n\nT(cid:88)\n\nt=1\n\n(cid:118)(cid:117)(cid:117)(cid:116) T(cid:88)\n\nRegret \u2264\n\nMain result and related work We say an update s is outstanding at time t if the Read for\nUpdate s occurs before update t, but the Update occurs after: precisely, s is outstanding at t\nif r(s) \u2264 t < s. We let Ft \u2261 {s | r(s) \u2264 t < s} be the set of updates outstanding at time t. We\ncall the sum of these gradients the forward gradient sum, gfwd\ngs. Then, ignoring con-\nstant factors and terms independent of T , we show that AdaptiveRevision has a per-coordinate\nbound of the form\n\nt \u2261 (cid:80)\n\ns\u2208Ft\n\ng2\nt + gtgfwd\n\nt\n\n.\n\n(4)\n\nt=1\n\nTheorem 3 gives the precise result as well as the n-dimensional version. Observe that without any\nt = 0, and we arrive at the standard AdaGrad-style bound. To prove the bound for\ndelays, gfwd\nAdaptiveRevision, we require an additional InOrder assumption on the delays, namely that\nfor any indexes s1 and s2, if r(s1) < r(s2) then s1 < s2. This assumption should be approximately\nsatis\ufb01ed most of the time for realistic delay distributions, and even under a more pathological delay\ndistributions (delays uniform on {0, . . . , m} rather than more tightly grouped around a mean delay),\nour experiments show excellent performance for AdaptiveRevision.\nThe key challenge is that unlike in the AdaGrad case, conceptually we need to know gradients that\nhave not yet been computed in order to calculate the optimal learning rate. We surmount this by\nusing an algorithm that not only chooses learning rates adaptively, but also revises previous gradient\nsteps. Critically, these revisions require only moderate additional storage and network cost: we store\na sum of gradients along with each coef\ufb01cient, and for each Read, we remember the value of this\ngradient sum at the time of the Read until the corresponding Update occurs. This later storage\ncan essentially be implemented on the network, if the gradient sum is sent from the Updater to the\nReader and back again, ensuring it is available exactly when needed. This is the approach taken\nin the pseudocode of Algorithm 1.\nAgainst a true adversary and a maximum delay of m, in general we cannot do better than just\ntraining synchronously on a single machine using a 1/m fraction of the data. Our results sur-\nmount this issue by producing strongly data-dependent bounds: we do not expect fully adversarial\ngradients and delays in practice, and so on real data the bound we prove still gives interesting re-\nsults. In fact, we can essentially recover the guarantees for AsyncAdaGrad from Duchi et al. [11],\nwhich rely on stochastic assumptions on the sparsity of the data, by applying the same assumptions\nto our bound. To simplify the comparison, WLOG we consider a 1-dimensional problem where\n(cid:107)x\u2217(cid:107)2 = 1, (cid:107)gt(cid:107)2 \u2264 1, and we have the stochastic assumption that each gt is exactly 0 indepen-\ndently with probability p (implying Mj = 1, M = 1, and M2 = p in their notation). Then, simple\ncalculations (given in Appendix B) show our bound for AdaptiveRevision implies a bound on\n\nexpected regret of O(cid:0)(cid:112)(1 + mp)pT(cid:1) without knowledge of p or m, ignoring terms independent of\n\nT .2 AsyncAdaGrad achieves the same bound, but critically this requires knowledge of both p and\n\n1Our analysis could be extended to non-coordinate-decomposable algorithms, but then the full gradient\nupdate across all coordinates would need to be atomic. This case is less interesting due to the computational\noverhead.\n\n2In the analysis, we choose the parameter G0 based on an upper bound m on the delay, but this only impacts\n\nan additive term independent of T .\n\n3\n\n\fm in advance in order to tune the learning rate appropriately (in the general n-dimensional case, this\nwould mean knowing not just one parameter p, but a separate sparsity parameter pj for each coor-\ndinate, and then using an appropriate per-coordinate scaling of the learning rate depending on this);\n\nwithout such knowledge, AsyncAdaGrad only obtains the much worse bound O(cid:0)(1 + mp)\n\npT(cid:1).\n\nAdaptiveRevision will also provide signi\ufb01cantly better guarantees if most of the delays are\nmuch less than the maximum, or if the data is only approximately sparse (e.g., many gt = 10\u22126\nrather than exactly 0). The above analysis also makes a worst-case assumption on the gtgfwd\nterms,\nbut in practice many gradients in gfwd\nare likely to have opposite signs and cancel out, a fact our\nalgorithm and bounds can exploit.\n\n\u221a\n\nt\n\nt\n\n2 Algorithms and Analysis\nWe \ufb01rst introduce some additional de\ufb01nitions. Let o(t) \u2261 maxFt \u222a {t}, the index of the highest\nupdate outstanding at time t, or t itself if nothing is outstanding. The sets Ft fully specify the\n. We also de\ufb01ne Bt, the set\ndelay pattern. In light of (4), we further de\ufb01ne Gfwd\nof updates applied while update t was outstanding. Under our notation, this set is easily de\ufb01ned\nas Bt = {r(t), . . . , t \u2212 1} (or the empty set if r(t) = t, so in particular B1 = \u2205). We will also\nfrequently use the backward gradient sum, gbck\ns=r(t) gs. These vectors most often appear in\n. Figure 3 in Appendix A shows a variety of delay patterns and\nthe products Gbck\ngives a visual representation of the sums Gfwd and Gbck. We say the delay is (upper) bounded by m\nif t \u2212 r(t) \u2264 m for all t, which implies |Ft| \u2264 m and |Bt| \u2264 m. Note that if m = 0 then r(t) = t.\n\nt \u2261(cid:80)t\u22121\nWe use the compressed summation notation c1:t \u2261(cid:80)t\n\ns=1 cs for vectors, scalars, and functions.\n\nt \u2261 g2\n\nt \u2261 g2\n\nt + 2gtgfwd\n\nt + 2gtgbck\n\nt\n\nt\n\nOur analysis builds on the following simple but fundamental result (Appendix C contains all proofs\nand lemmas omitted here).\nLemma 1. Given any non-increasing learning-rate schedule \u03b7t, de\ufb01ne \u03c3t where \u03c31 = 1/\u03b71 and\n\u03c3t = 1/\u03b7t \u2212 1/\u03b7t\u22121 for t > 1, so \u03b7t = 1/\u03c31:t. Then, for any delay schedule, unprojected online\ngradient descent achieves, for any x\u2217 \u2208 R,\n\nRegret(x\u2217) \u2264 (2RT )2\n2\u03b7T\n\n+\n\n1\n2\n\n\u03b7tGfwd\n\nt\n\nwhere\n\nt=1\n\nt=1\n\n|x\u2217 \u2212 xt|2.\n\n\u03c3t\n\u03c31:T\n\nT(cid:88)\n\n(2RT )2 \u2261 T(cid:88)\n\nProof. Given how we have indexed time, we can consider the regret of a hypothetical online gradient\ndescent algorithm that plays xt and then observes gt, since this corresponds exactly to the update\n(3). We can then bound regret for this hypothetical setting using a simple modi\ufb01cation to standard\nbound for OGD [1],\n\n|x\u2217 \u2212 xt|2 +\n\n\u03c3t\n2\n\n1\n2\n\n\u03b7tg2\nt .\n\nT(cid:88)\n\nt=1\n\nT(cid:88)\n\nt=1\n\ngt \u00b7 xt \u2212 g1:T \u00b7 x\u2217 \u2264 T(cid:88)\nT(cid:88)\n\nt=1\n\nThe actual algorithm used xr(t) to predict on gt, not xt, so we can bound its Regret by\n\nT(cid:88)\ns=r(t) \u03b7sgs, =(cid:80)\nRecalling xt+1 = xt \u2212 \u03b7tgt, observe that xr(t) \u2212 xt =(cid:80)t\u22121\n(cid:88)\nT(cid:88)\nT(cid:88)\n\nRegret \u2264 (2RT )2\n2\u03b7T\n\ngt(xr(t) \u2212 xt).\n\n(cid:88)\n\nT(cid:88)\n\nT(cid:88)\n\n\u03b7tg2\n\ns\u2208Bt\n\nt +\n\n1\n2\n\nt=1\n\nt=1\n\n+\n\n\u03b7sgs =\n\n\u03b7sgs\n\ngt =\n\n\u03b7sgsgfwd\n\ns\n\n,\n\ngt(xr(t) \u2212 xt) =\n\nt=1\n\nt=1\n\ns=1\n\nt\u2208Fs\n\ns=1\n\ngt\n\ns\u2208Bt\n\n(5)\n\n\u03b7sgs and so\n\nusing Lemma 4(E) from the Appendix to re-order the sum. Plugging into (5) completes the proof.\n\nFor projected online gradient descent, by projecting onto a feasible set of radius R and assuming\nx\u2217 is in this set, we immediately get |x\u2217 \u2212 xt| \u2264 2R. Without projecting, we get a more adaptive\nbound which depends on the weighted quadratic mean 2RT . Though less standard, we choose to\n\n4\n\n\fanalyze the unprojected variant of the algorithm for two reasons. First, our analysis rests heavily on\nthe ability to represent points played by our algorithms exactly as weighted sums of past gradients, a\nproperty not preserved when projection is invoked. More importantly, we know of no experiments on\nreal-world prediction problems (where any x \u2208 Rn is a valid model) where the projected algorithm\nactually performs better. In our experience, once the learning-rate schedule is tuned appropriately,\nthe resulting RT values will not be more than a constant factor of (cid:107)x\u2217(cid:107). This makes intuitive sense\nin the stochastic case, where it is known that averages of the xt should in fact converge to x\u2217.3\nFor learning rate tuning we assume we know in advance a constant \u02dcR such that RT \u2264 \u02dcR; again,\nin practice this is roughly equivalent to assuming we know (cid:107)x\u2217(cid:107) in advance in order to choose the\nfeasible set.\nOur \ufb01rst algorithm, HypFwd (for Hypothetical-Forward), assumes it has knowledge of all the gra-\ndients, so it can optimize its learning rates to minimize the above bound. If there are no delays, that\nis, gfwd\nt = 0 for all t, then this immediately gives rise to a standard AdaGrad-style online gradient\nterms could be large, implying the optimal learning\ndescent method. If there are delays, the Gfwd\nrates should be smaller. Unfortunately, it is impossible for a real algorithm to know gfwd\nt when \u03b7t is\nchosen. To work toward a practical algorithm, we introduce HypBack, which achieves similar guar-\nantees (but is still impractical). Finally, we introduce AdaptiveRevision, which plays points\nvery similar to HypBack, but can be implemented ef\ufb01ciently. Since we will need non-increasing\n1:t \u2261 maxs\u2264t Gfwd\nlearning rates, it will be useful to de\ufb01ne \u02dcGbck\n1:s . In prac-\ntice, we expect \u02dcGbck\n1:T to be close to Gbck\n1 > 0, which at worst adds a\nnegligible additive constant to our regret.\n\n1:s and \u02dcGfwd\n1:T . We assume WLOG that Gfwd\n\n1:t \u2261 maxs\u2264t Gbck\n\nt\n\nAlgorithm HypFwd This algorithm \u201ccheats\u201d by using the forward sum gfwd\n\nt\n\n\u03b1(cid:113)\n\n\u02dcGfwd\n1:t\n\n\u03b7t =\n\nto choose \u03b7t,\n\n(6)\n\nfor an appropriate scaling parameter \u03b1 > 0. Then, Lemma 1 combined with the technical inequality\nof Corollary 10 (given in Appendix D) gives\n\n\u02dcGfwd\n1:T .\n\n\u221a\n\n(cid:113)(cid:80)T\n\n(7)\n2 \u02dcR (recalling \u02dcR \u2265 RT ). If there are no delays, this bound reduces to the\nwhen we take \u03b1 =\n\u221a\nstandard bound 2\nt . With delays, however, this is a hypothetical algorithm, because\nit is generally not possible to know gfwd\nt when update t is applied. However, we can implement\nthis algorithm ef\ufb01ciently in a single-machine simulation, and it performs very well (see Section 3).\nThus, our goal is to \ufb01nd an ef\ufb01ciently implementable algorithm that achieves comparable results in\npractice and also matches this regret bound.\n\nt=1 g2\n\n2 \u02dcR\n\nAlgorithm HypBack The next step in the analysis is to show that a second hypothetical algorithm,\nHypBack, approximates the regret bound of (7). This algorithm plays\n\n(cid:113)\n\n\u221a\nRegret \u2264 2\n\n2 \u02dcR\n\n\u02c6xt+1 = \u2212 t(cid:88)\n\ns=1\n\n\u02c6\u03b7sgs\n\nwhere\n\n\u02c6\u03b7t =\n\n(8)\n\n\u03b1(cid:113) \u02dcGbck\n\n1:o(t) + G0\n\nis a learning rate with parameters \u03b1 and G0. This is a hypothetical algorithm, since we also can\u2019t\n(ef\ufb01ciently) know Gbck\n1:o(t) on round t. We prove the following guarantee:\nLemma 2. Suppose delays bounded by m and |gt| \u2264 L. Then when the InOrder property holds,\n\u221a\nHypBack with \u03b1 =\n\n2 \u02dcR and G0 = m2L2 has\n\u221a\nRegret \u2264 2\n\n2 \u02dcR\n\n(cid:113)\n\n\u02dcGfwd\n\n1:T + 2 \u02dcRmL.\n\n3For example, the arguments of Nemirovski et al. [17, Sec 2.2] hold for unprojected gradient descent.\n\n5\n\n\fAlgorithm 1 Algorithm AdaptiveRevision\nProcedure Read(loss function f):\nRead (xi, \u00afgi) from the Updaters for all necessary coordinates\nCalculate a subgradient g \u2208 \u2202f (x)\nfor each coordinate i with a non-zero gradient do\n\nSend an update tuple (g \u2190 gi, \u00afgold \u2190 \u00afgi) to the Updater for coordinate i\n\nProcedure Update(g, \u00afgold): The Updater initializes state (\u00afg \u2190 0, z \u2190 1, z(cid:48) \u2190 1, x \u2190 0) per coordinate.\n\nDo the following atomically:\ngbck \u2190 \u00afg \u2212 \u00afgold\n\u03b7old \u2190 \u03b1\u221a\nz(cid:48)\nz \u2190 z + g2 + 2g \u00b7 gbck; z(cid:48) \u2190 max(z, z(cid:48)) Maintain z = Gbck\n\u03b7 \u2190 \u03b1\u221a\nNew learning rate.\nz(cid:48)\nx \u2190 x \u2212 \u03b7g\nThe main gradient-descent update.\nx \u2190 x + (\u03b7old \u2212 \u03b7)gbck\nApply adaptive revision of some previous steps.\n\u00afg \u2190 \u00afg + g\nMaintain \u00afg = g1:t.\n\nFor analysis, assign index t to the current update.\nInvariant: effective \u03b7 for all gbck.\n1:t and z(cid:48) = \u02dcGbck\n\n1:t , to enforce non-increasing \u03b7.\n\nt(cid:88)\n\ns=1\n\nAlgorithm AdaptiveRevision Now that we have shown that HypBack is effective, we can\ndescribe AdaptiveRevision, which ef\ufb01ciently approximates HypBack. We then analyze this\nnew algorithm by showing its loss is close to the loss of HypBack. Pseudo-code for the algorithm\nas implemented for the experiments is given in Algorithm 1; we now give an equivalent expression\nfor the algorithm under the InOrder assumption. Let \u03b2t be the learning rate based on \u02dcGbck\n1:t ,\n\u03b2t = \u03b1/\n\n1:t + G0. Then, AdaptiveRevision plays the points\n\n(cid:113)\n\n\u02dcGbck\n\nxt+1 =\n\n\u03b7t\nsgs\n\nwhere\n\n\u03b7t\ns = \u03b2min(t,o(s)).\n\n(9)\n\nWhen s << t then we will usually have min(t, o(s)) = o(s), and so we see that \u03b7t\ns = \u03b2o(s) = \u02c6\u03b7s,\nand so the effective learning rate applied to gradient gs is the same one HypBack would have used\n(namely \u02c6\u03b7s); thus, the only difference between AdaptiveRevision and HypBack is on the\nleading edge, where o(s) > t. See Figure 4 in Appendix A for an example. When InOrder holds,\nLemma 6 (in Appendix C) shows Algorithm 1 plays the points speci\ufb01ed by (9).\nGiven Lemma 2, it is suf\ufb01cient to show that the difference between the loss of HypBack and the\nloss of AdaptiveRevision is small. Lemma 8 (in the appendix) accomplishes this, showing\nthat under the InOrder assumption and with G0 = m2L2 the difference in loss is at most 2\u03b1Lm\n(a quantity independent of T ). Our main theorem is then a direct consequence of Lemma 2 and\nLemma 8:\nTheorem 3. Under an InOrder delay pattern with a maximum delay of at most m,\n\u221a\nAdaptiveRevision algorithm guarantees Regret \u2264 2\nwe take G0 = m2L2 and \u03b1 =\nproblem, we have\n\nthe\n2 + 2) \u02dcRmL when\n2 \u02dcR. Applied on a per-coordinate basis to an n-dimensional\n\n\u221a\n1:T + (2\n\n(cid:113)\n\n\u02dcGfwd\n\n2 \u02dcR\n\n\u221a\n\n(cid:118)(cid:117)(cid:117)(cid:116) T(cid:88)\n\nn(cid:88)\n\n\u221a\nRegret \u2264 2\n\n(cid:88)\nWe note the n-dimensional guarantee is at most O(cid:0)n \u02dcRL\n\ng2\nt,i + 2\n\ns\u2208Ft,i\n\n2 \u02dcR\n\nt=1\n\ni=1\n\n(cid:16)\n\n\u221a\n\ngs,igs,i\n\n\u221a\n+ n(2\n\n(cid:17)\nT m(cid:1), which matches the lower bound\n\n2 + 2) \u02dcRmL.\n\nfor the feasible set [\u2212R, R]n and gt \u2208 [\u2212L, L]n up to the difference between \u02dcR and R (see, for\nexample, Langford et al. [18]).4 Our point, of course, is that for real data our bound will often be\nmuch much better.\ngt \u2208 [\u2212L, L]n we have (cid:107)gt(cid:107)2 \u2264 \u221a\n\n4To compare to regret bounds stated in terms of L2 bounds on the feasible set and the gradients, note for\nnR, so the\ndependence on n is a necessary consequence of using these norms, which are quite natural for sparse problems.\n\nnL, and similarly for x \u2208 [\u2212R, R]n we have (cid:107)x(cid:107)2 \u2264 \u221a\n\n6\n\n\fFigure 1: Accuracy as a function of update delays, with learning rate scale factors optimized for each\nalgorithm and dataset for the zero delay case. The x-axis is non-linear. The results are qualitatively\nsimilar across the plots, but note the differences in the y-axis ranges. In particular, the random delay\npattern appears to hurt performance signi\ufb01cantly less than either the minibatch or constant delay\npatterns.\n\nFigure 2: Accuracy as a function of update delays, with learning rate scale factors optimized as\na function of the delay. The lower plot in each group shows the best learning rate scale \u03b1 on a\nlog-scale.\n\n3 Experiments\n\nWe study the performance of both hypothetical algorithms and AdaptiveRevision on two real-\nworld medium-sized datasets. We simulate the update delays using an update queue, which allows\nus to implement the hypothetical algorithms and also lets us precisely control both the exact de-\nlays as well as the delay pattern. We compare to the dual-averaging AsyncAdaGrad algorithm of\nDuchi et al. [11] (AsyncAda-DA in the \ufb01gures), as well as asynchronous AdaGrad gradient descent\n(AsyncAda-GD), which can be thought of as AdaptiveRevision with all gbck set to zero and\nno revision step. As analyzed, AdaptiveRevision stores an extra variable (z(cid:48)) in order to en-\nforce a non-increasing learning rate. In practice, we found this had a negligible impact; in the plots\nabove, AdaptiveRevision\u2217 denotes the algorithm without this check. With this improvement\nAdaptiveRevision stores three numbers per coef\ufb01cient, versus the two stored by AsyncAda-\ngrad DA or GD.\nWe consider three different delay patterns, which we parameterize by D, the average delay; this\nyields a more fair comparison across the delay patterns than using the the maximum delay m. We\nconsider: 1) constant delays, where all updates (except at the beginning and the end of the dataset)\nhave a delay of exactly D (e.g., rows (B) and (C) in Figure 3 in the Appendix); 2) A minibatch delay\npattern5, where 2D + 1 Reads occur, followed by 2D + 1 Updates; and 3) a random delay pattern,\nwhere the delays are chosen uniformly from the set {0, . . . , 2D}, so again the mean delay is D. The\n\ufb01rst two patterns satisfy InOrder, but the third does not.\n\n5It is straightforward to show that under this delay pattern, when we do not enforcing non-increasing learn-\ning rates, AdaptiveRevision and HypBack are in fact equivalent to standard AdaGrad run on the mini-\nbatches (that is, with one update per minibatch using the combined minibatch gradient sum).\n\n7\n\n\fs=1 g2\n\n(cid:80)t\n\nWe evaluate on two datasets. The \ufb01rst is a web search advertising dataset from a large search engine.\nThe dataset consists of about 3.1\u00d7106 training examples with a large number of sparse anonymized\nfeatures based on the ad and query text. Each example is labeled {\u22121, 1} based on whether or not\nthe person doing the query clicked on the ad. The second is a shuf\ufb02ed version of the malicious URL\ndataset as described by Ma et al. [19] (2.4\u00d7106 examples, 3.2\u00d7106 features).6 For each of these\ndatasets we trained a logistic regression model, and evaluated using the logistic loss (LogLoss).\nThat is, for an example with feature vector a \u2208 Rn and label y \u2208 {\u22121, 1}, the loss is given by\n(cid:96)(x, (a, y)) = log(1 + exp(\u2212y a \u00b7 x)). Following the spirit of our regret bounds, we evaluate the\nmodels online, making a single pass over the data and computing accuracy metrics on the predictions\nmade by the model immediately before it trained on each example (i.e., progressive validation). To\navoid possible transient behavior, we only report metrics for the predictions on the second half of\neach dataset, though this choice does not change the results signi\ufb01cantly.\n\u221a\nThe exact parametrization of the learning rate schedule is particularly important with delayed up-\ndates. We follow the common practice of taking learning rates of the form \u03b7t = \u03b1/\nSt + 1, where\nSt is the appropriate learning rate statistic for the given algorithm, e.g., \u02dcGbck\n1:o(t) for HypBack or\ns for vanilla AdaGrad. In the analysis, we use G0 = m2L2 rather than G0 = 1; we believe\nG0 = 1 will generally be a better choice in practice, though we did not optimize this choice.7 When\nwe optimize \u03b1, we choose the best setting from a grid {\u03b10(1.25)i | i \u2208 N}, where \u03b10 is an initial\nguess for each dataset.\nAll \ufb01gures give the average delay D on the x-axis. For Figure 1, for each dataset and algorithm, we\noptimized \u03b1 in the zero delay (D = m = 0) case, and \ufb01xed this parameter as the average delay D\nincreases. This leads to very bad performance for standard AdaGrad DA and GD as D gets large.\nIn Figure 2, we optimized \u03b1 individually for each delay level; we plot the accuracy as before, with\nthe lower plot showing the optimal learning rate scaling \u03b1 on a log-scale. The optimal learning rate\nscaling for GD and DA decrease by two orders of magnitude as the delays increase. However, even\nwith this tuning they do not obtain the performance of AdaptiveRevision. The performance of\nAdaptiveRevision (and HypBack and HypFwd) is slightly improved by lowering the learning\nrate as delays increase, but the effect is comparatively very minor. As anticipated, the performance\nfor AdaptiveRevision, HypBack, and HypFwd are closely grouped.\nAdaptiveRevision\u2019s delay tolerance can lead to enormous speedups in practice. For example,\nthe leftmost plot of Figure 2 shows that AdaptiveRevision achieves better accuracy with an\nupdate delay of 10,000 than AsyncAda-DA achieves with a delay of 1000. Because update delays\nare proportional to the number of Readers, this means that AdaptiveRevision can be used to\ntrain a model an order of magnitude faster than AsyncAda-DA, with no reduction in accuracy. This\nallows for much faster iteration when data sets are large and parallelism is cheap, which is the case\nin important real-world problems such as ad click-through rate prediction [14].\n\n4 Conclusions and Future Work\n\nWe have demonstrated that adaptive tuning and revision of per-coordinate learning rates for dis-\ntributed gradient descent can signi\ufb01cantly improve accuracy as the update delays become large.\nThe key algorithmic technique is maintaining a sum of gradients, which allows the adjustment of\nall learning rates for gradient updates that occurred between the current Update and its Read.\nThe analysis method is novel, but is also somewhat indirect; an interesting open question is \ufb01nd-\ning a general analysis framework for algorithms of this style.\nIdeally such an analysis would\nalso remove the technical need for the InOrder assumption, and also allow for the analysis of\nAdaptiveRevision variants of OGD with Projection and Dual Averaging.\n\n6We also ran experiments on the rcv1.binary training dataset (0.6\u00d7106 examples, 0.05\u00d7106 features)\n\nfrom Chang and Lin [20]; results were qualitatively very similar to those for the URL dataset.\n\n7The main purpose of choosing a larger G0 in the theorems was to make the performance of HypBack\nand AdaptiveRevision provably close to that of HypFwd, even in the worst case. On real data, the\nperformance of the algorithms will typically be close even with G0 = 1.\n\n8\n\n\fReferences\n[1] Martin Zinkevich. Online convex programming and generalized in\ufb01nitesimal gradient ascent. In ICML,\n\n2003.\n\n[2] Tong Zhang. Solving large scale linear prediction problems using stochastic gradient descent algorithms.\n\nIn ICML 2004, 2004.\n\n[3] L\u00b4eon Bottou and Olivier Bousquet. The tradeoffs of large scale learning. In Advances in Neural Informa-\n\ntion Processing Systems. 2008.\n\n[4] Shai Shalev-Shwartz. Online learning and online convex optimization. Foundations and Trends in Ma-\n\nchine Learning, 2012.\n\n[5] Ofer Dekel, Ran Gilad-Bachrach, Ohad Shamir, and Lin Xiao. Optimal distributed online prediction using\n\nmini-batches. J. Mach. Learn. Res., 13(1), January 2012.\n\n[6] Peter Richt\u00b4arik and Martin Tak\u00b4a\u02c7c. Parallel coordinate descent methods for big data optimization.\n\narXiv:1212.0873 [math.OC], 2012. URL http://arxiv.org/abs/1212.0873.\n\n[7] Martin Tak\u00b4a\u02c7c, Avleen Bijral, Peter Richt\u00b4arik, and Nati Srebro. Mini-batch primal and dual methods for\n\nSVMs. In Proceedings of the 30th International Conference on Machine Learning, 2013.\n\n[8] Daniel Hsu, Nikos Karampatziakis, John Langford, and Alexander J. Smola. Scaling Up Machine Learn-\n\ning, chapter Parallel Online Learning. Cambridge University Press, 2011.\n\n[9] John C. Duchi, Alekh Agarwal, and Martin J. Wainwright. Dual averaging for distributed optimization:\n\nConvergence analysis and network scaling. IEEE Trans. Automat. Contr., 57(3):592\u2013606, 2012.\n\n[10] Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. Hogwild: a lock-free approach to\n\nparallelizing stochastic gradient descent. In NIPS, 2011.\n\n[11] John C. Duchi, Michael I. Jordan, and H. Brendan McMahan. Estimation, optimization, and parallelism\n\nwhen data is sparse. In NIPS, 2013.\n\n[12] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and\n\nstochastic optimization. In COLT, 2010.\n\n[13] H. Brendan McMahan and Matthew Streeter. Adaptive bound optimization for online convex optimiza-\n\ntion. In COLT, 2010.\n\n[14] H. Brendan McMahan, Gary Holt, David Sculley, Michael Young, Dietmar Ebner, Julian Grady, Lan\nNie, Todd Phillips, Eugene Davydov, Daniel Golovin, Sharat Chikkerur, Dan Liu, Martin Wattenberg,\nArnar Mar Hrafnkelsson, Tom Boulos, and Jeremy Kubica. Ad click prediction: a view from the trenches.\nIn KDD, 2013.\n\n[15] Thore Graepel, Joaquin Qui\u02dcnonero Candela, Thomas Borchert, and Ralf Herbrich. Web-scale bayesian\nclick-through rate prediction for sponsored search advertising in microsoft\u2019s bing search engine. In ICML,\n2010.\n\n[16] Jeffrey Dean, Greg S. Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V. Le, Mark Z. Mao,\nMarc\u2019Aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, and Andrew Y. Ng. Large scale distributed\ndeep networks. In NIPS, 2012.\n\n[17] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to\nstochastic programming. SIAM J. on Optimization, 19(4):1574\u20131609, January 2009. ISSN 1052-6234.\ndoi: 10.1137/070704277.\n\n[18] John Langford, Alex Smola, and Martin Zinkevich. Slow Learners are Fast.\n\nInformation Processing Systems 22. 2009.\n\nIn Advances in Neural\n\n[19] Justin Ma, Lawrence K. Saul, Stefan Savage, and Geoffrey M. Voelker. Identifying suspicious urls: An\napplication of large-scale online learning. In Proceedings of the 26th Annual International Conference on\nMachine Learning, ICML \u201909, 2009.\n\n[20] Chih-Chung Chang and Chih-Jen Lin. LIBSVM data sets. http://www.csie.ntu.edu.tw/\u02dccjlin/libsvmtools/\n\ndatasets/, 2010.\n\n[21] Peter Auer, Nicol`o Cesa-Bianchi, and Claudio Gentile. Adaptive and self-con\ufb01dent on-line learning\n\nalgorithms. Journal of Computer and System Sciences, 2002.\n\n9\n\n\f", "award": [], "sourceid": 1527, "authors": [{"given_name": "Brendan", "family_name": "McMahan", "institution": "Google"}, {"given_name": "Matthew", "family_name": "Streeter", "institution": "Duolingo"}]}