{"title": "Memory Efficient Adaptive Optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 9749, "page_last": 9758, "abstract": "Adaptive gradient-based optimizers such as Adagrad and Adam are crucial for achieving state-of-the-art performance in machine translation and language modeling. However, these methods maintain second-order statistics for each parameter, thus introducing significant memory overheads that restrict the size of the model being used as well as the number of examples in a mini-batch. We describe an effective and flexible adaptive optimization method with greatly reduced memory overhead. Our method retains the benefits of per-parameter adaptivity while allowing significantly larger models and batch sizes. We give convergence guarantees for our method, and demonstrate its effectiveness in training very large translation and language models with up to 2-fold speedups compared to the state-of-the-art.", "full_text": "Memory-Ef\ufb01cient Adaptive Optimization\n\nRohan Anil Vineet Gupta\n\nGoogle Brain\n\n{rohananil,vineet}@google.com\n\ntkoren@google.com\n\nTomer Koren\n\nGoogle Brain and Tel Aviv Univ.\n\nYoram Singer\nPrinceton Univ.\n\ny.s@cs.princeton.edu\n\nAbstract\n\nAdaptive gradient-based optimizers such as Adagrad and Adam are crucial for\nachieving state-of-the-art performance in machine translation and language model-\ning. However, these methods maintain second-order statistics for each parameter,\nthus introducing signi\ufb01cant memory overheads that restrict the size of the model\nbeing used as well as the number of examples in a mini-batch. We describe an\neffective and \ufb02exible adaptive optimization method with greatly reduced memory\noverhead. Our method retains the bene\ufb01ts of per-parameter adaptivity while allow-\ning signi\ufb01cantly larger models and batch sizes. We give convergence guarantees\nfor our method, and demonstrate its effectiveness in training very large translation\nand language models with up to 2-fold speedups compared to the state-of-the-art.\n\n1\n\nIntroduction\n\nAdaptive gradient-based optimizers such as Adagrad [11] and Adam [15] are among the de facto\nmethods of choice in modern machine learning. These methods adaptively tune the learning rate\nfor each parameter during the optimization process using cumulative second-order statistics. Often\noffering superior convergence properties, these methods are very attractive in large scale applications\ndue to their moderate time and space requirements, which are linear in the number of parameters.\nHowever, when training extremely large models even the modest memory overhead imposes grave\nlimitations on the quality of the trained model. For example, recent advances in natural language\nprocessing [26, 17] show that models with hundreds of millions to billions of parameters, trained with\nadaptive optimization methods, achieve state-of-the-art results. In such cases, the memory overhead\nof the optimizer severely restricts the size of the model that can be used as well as the number of\nexamples in each mini-batch, both of which have a dramatic effect on the accuracy of the model.\nMotivated by these challenges, we describe an adaptive optimization method that retains the bene\ufb01ts\nof standard per-parameter adaptivity while signi\ufb01cantly reducing memory overhead. Our construction\nis general and \ufb02exible, and very simple to implement. We give convergence guarantees for our\nmethod in the convex (online or stochastic) optimization setting, and demonstrate experimentally that\nit is particularly effective when the gradients exhibit natural activation patterns; namely, when the\nparameters can be subdivided into (not necessarily disjoint) sets where gradient entries within sets are\ncorrelated and of a similar order of magnitude. For example, we often observe in deep networks that\nthe incoming (outgoing) edges into (from) a neuron are jointly activated and, loosely speaking, their\nassociated gradients exhibit similar statistical characteristics. That said, our analysis of the algorithm\nmakes no statistical assumptions on the gradients and is applicable for general stochastic convex\noptimization. Further, we do not assume that the activation pattern is fully prescribed a-priori.\nLarge scale experiments show that our algorithm achieves comparable, and at times superior, rates of\nconvergence compared to standard linear-space adaptive methods. Focusing primarily on language\nmodeling tasks where state-of-the-art models are extremely large, we further demonstrate that the\nreduction in memory footprint can be utilized for a substantial increase in the batch size, which\ngreatly speeds up convergence in a distributed environment. For a \ufb01xed budget of computational\nresource our method is able to shorten the end-to-end walltime for convergence by up to 50%. Our\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fmethod exhibits slightly improved per-step time. The latter could be attributed to reduction in the\nfrequency of memory accesses.\n\n1.1 Related work\n\nAdaptive learning rates in online and stochastic optimization date back at least to [5] and were\npopularized in [11, 16], the former of which introduced the well-known Adagrad algorithm. Several\nvariants of Adagrad have now been proposed in the optimization and machine learning literature\n(see [19] and the references therein), the most notable of which is Adam [15]. All of these methods\nrequire (at least) linear space for maintaining various per-parameter statistics during their execution.\nOne notable exception, which is directly related to our work, is the Adafactor algorithm [23] that was\nproposed as a way to reduce the memory costs of Adam, primarily for training large language models.\nWhile the memory requirements of our construction are similar to Adafactor\u2019s, the application scope\nand the convergence properties of the two algorithms are quite different. We discuss the relationship\nin more detail in Section 4 and give an empirical comparison between the algorithms in Section 5.\nSpring et al. [25] provide an alternative way to reduce memory costs, making use of the Count-\nSketch data structure [7] to maintain a compressed approximation to the auxiliary variables. One\nkey difference between SM3 and Count-Sketch is that SM3 uses speci\ufb01c hash functions instead of\nrandom hash functions. Our hash functions are compatible with slices of parameter tensors and are\ngeared towards exploiting empirically observed correlations between the auxiliary parameters, as\nwe discuss below (see Section 4). As a result, our sketches can be 100x\u20131000x smaller than the\noriginal tensors\u2014compared to the 5x reduction reported in [25]\u2014while showing signi\ufb01cantly smaller\napproximation error (we provide details in the full version of the paper [3]). In addition, randomized\nsketching is extremely inef\ufb01cient to implement on GPUs and TPUs, since it involves sparse look-ups\nand is not cache-ef\ufb01cient. These differences allow us to show signi\ufb01cant improvements for a large\nvariety of tasks and models, as compared to the results in [25].\nAlso related to our work is the Shampoo algorithm for optimization over tensor structures [12]. The\ngoal of Shampoo is very different from ours: going beyond entry-wise learning rates and employing\nfull-matrix regularization in a computationally ef\ufb01cient way. Nonetheless, Shampoo can also be seen\nas a method to substantially reduce the memory footprint of full-matrix preconditioned algorithms\n(speci\ufb01cally, full-matrix Adagrad). In a sense, our algorithms are analogous to a diagonalized\nversion of the Shampoo algorithm. Yet another recent adaptive optimization method is the GGT\nalgorithm [2]. Similarly to Shampoo, the goal of the latter is to reduce the computation cost of\nfull-matrix preconditioning in order to make it practical in large scale settings. However, GGT stores\nmultiple copies of the gradient over the course of its execution, and as a result, its space requirements\nrestricts it from being applied at large scale.\n\n2 Preliminaries\n\n2.1 Online optimization\n\nWe henceforth assume the general online optimization setting (see [22, 13]). Online optimization\nconsists of rounds t = 1, . . . , T, where in each round the algorithm chooses a parameter vector\nwt 2 Rd. After making a choice on round t, the algorithm receives a loss function `t : Rd ! R which\nis used to form an update of the parameters. In our analysis, we focus on online convex optimization\nin which `1, . . . , `T are convex. Often, as is the case in this paper, the update is determined by\nthe gradient gt = r`t(wt) of the instantaneous loss `t at the current iterate wt. The algorithm is\nmeasured by its T-round regret with respect to a given comparator w? 2 Rd, de\ufb01ned as the quantity\nPT\nt=1 `t(wt) \u2013PT\nt=1 `t(w?). An online optimization algorithm is convergent if its regret is o(T), i.e., its\naverage regret approaches zero as T grows.\nThe above setting includes stochastic (possibly mini-batched) optimization as a special case. In\nstochastic optimization the underlying goal is to minimize a population loss L(w) = Ez\u21e0D[`(w, z)]\nbased on samples of z. Here `(w, z) de\ufb01nes the loss of parameters w w.r.t a batch z. The online loss\nfunction `t(w) = `(w, zt) is the average loss over a mini-batch zt received on iteration t. The stochastic\ngradient gt is a conditionally unbiased estimate of the gradient of L at the current parameter vector wt.\nUnder convexity assumptions, an online algorithm with vanishing average regret can be converted to\na stochastic optimization algorithm for minimizing the population loss L [6].\n\n2\n\n\f`t(wt) \u2013\n\nTXt=1\n\nTXt=1\n\n`t(w?) = O0@D\n\ndXi=1\n\nvuut\n\nTXt=1\n\ng2\n\nt (j)1A ,\n\n(2)\n\n2.2 Adaptive methods\nFor the sake of self-containment, we give a brief description of adaptive gradient methods, focusing\non Adagrad [11]. Adagrad maintains at each step t parameter-wise accumulated statistics which are\ncomputed from the previously obtained gradients g1, . . . , gt:\n\nt(i) =\n\ng2\ns (i) ,\n\n8 i 2 [d] .\n\n(1)\n\ntXs=1\n\nBased on these statistics, the update rule of the algorithm on step t takes the form:\n\nwt+1(i) = wt(i) \u2013 \u2318\n\ngt(i)\npt(i)\n\n,\n\n8 i 2 [d] ,\n\nwhere \u2318 > 0 is an external learning rate parameter. Duchi et al. [11] proved the following regret\nbound for Adagrad with respect to a given w? (with properly tuned \u2318):\n\nwhere D maxt kwt \u2013 w?k1. Adagrad has proved to be particularly useful in training sparse\nmodels, where the effective learning rates \u2318pt(i) decay in a moderate way for rare, yet potentially\ninformative, features. In these settings, Adagrad can potentially lead to substantial improvements in\nconvergence time; see for instance the discussion in [11]. Crucially, however, Adagrad must maintain\nauxiliary sequence of accumulators t and thus needs \u2326(d) additional space. The goal of this paper is\nto provide memory-ef\ufb01cient methods with comparable convergence characteristics that refrain from\nmaintaining the full vectors t.\n\n3 The SM3 Algorithm\n\nWe now present our memory-ef\ufb01cient adaptive optimization algorithm. As an abstraction, the\nalgorithm employs a cover of the parameters: a collection of k nonempty sets {Sr}k\nr=1, such that\nSr \u2713 [d] and [rSr = [d]. In particular, each index i 2 [d] may be contained in multiple sets Sr. The\nalgorithm maintains a single variable for each set Sr in the cover. Thus, the additional space it requires\nis O(k) rather than the O(d) required by standard adaptive methods. In large scale applications, k\nwill be chosen to be negligible in comparison to d, which would translates to substantial savings in\nmemory; see Section 4 for a discussion on the covers used in practice.\nConcretely, for each set Sr in the cover, the algorithm maintains a running sum, \u00b5t(r), of the maximal\nvariance over all gradient entries j 2 Sr. Next, for each parameter i, we take the minimum over\nall variables \u00b5t(r) associated with sets which cover i, denoted Sr 3 i. Thereafter, the learning rate\ncorresponding to the i\u2019th gradient entry is determined by taking the square-root of this minimum,\ndenoted by \u232bt(i). Accordingly, we name our algorithm the Square-root of Minima of Sums of Maxima\nof Squared-gradients Method, or in short, SM3. See Algorithm SM3-I for its pseudocode.\nAs noted above, SM3-I requires only O(k) space\nin addition to the space required for storing the\nparameters wt themselves. The time per iteration\nof SM3-I is O(Pk\nr=1 |Sr|). To see this, consider a\nbipartite graph de\ufb01ned over d+k vertices. Nodes\non one side of the graph correspond to indices\ni 2 [d], while nodes on the other side correspond\nto indices j 2 [k]. The edges of the graphs are\nall pairs (i, j) such that i 2 Sj. The complexity of\neach inner for-loop of the algorithm scales with\nthe number of edges in this graph, which is equal\nto O(Pk\nr=1|Sr|). Note that updating the weights\nwt takes O(d) time, which is always dominated\nby the former quantity.\nThe following provides convergence guarantees for SM3-I.\n\nSM3-I\n1: parameters: learning rate \u2318\n2: initialize w1 = 0 ; 8r 2 [k] : \u00b50(r) = 0\n3: for t = 1, . . . , T do\n4:\n5:\n6:\n7:\n8:\n9:\n\nreceive gradient gt = r`t(wt)\nfor r = 1, . . . , k do\n\nfor i = 1, . . . , d do\n\nset \u00b5t(r) \u00b5t\u20131(r) + maxj2Sr g2\nt (j)\nset \u232bt(i) minr:Sr3i \u00b5t(r)\nupdate wt+1(i) wt(i) \u2013 \u2318 gt(i)p\u232bt(i)\n\n. with the convention that 0/0 = 0\n\n3\n\n\fProposition 1. Assume that the loss functions `1, `2, . . . are convex, and let w1, w2, . . . be the iterates\ngenerated by SM3-I. Then, for any w? 2 Rd,\n\nTXt=1`t(wt) \u2013 `t(w?) \uf8ff 2D\n\ndXi=1\nwhere maxt kwt \u2013 w?k1 \uf8ff D and choosing \u2318 = D.1\nFor stochastic optimization, i.e., when the functions `t correspond to i.i.d. samples with E[`t(w)] =\nL(w), the above bound translates via standard arguments to a O(1/pT)-type convergence guarantee\nfor the average iterate wT = 1\n\nTXt=1\n\nmax\nj2Sr\n\nt=1 wt of the form\n\ng2\nt (j) ,\n\nr:Sr3i\n\nvuut min\n\nTPT\nE[L(wT)] \u2013 L(w?) = O0@ 1\n\nT\n\ndXi=1\n\nEvuut min\n\nr:Sr3i\n\nTXt=1\n\nmax\nj2Sr\n\ng2\n\nt (j)1A.\n\nNote that adding more sets Sr to the cover used by SM3 always improves its convergence bound,\nbut results in a worse space complexity and a higher runtime per step. When k = d and Si = {i}\nfor all i 2 [d], SM3-I reduces to the Adagrad algorithm, and the regret bound in Proposition 1 then\nprecisely recovers the bound attained by Adagrad (recall Eq. (2)). In general, the right-hand side\nof Proposition 1 is never smaller than Adagrad\u2019s regret bound, as expected from a space-restricted\nscheme (this is a consequence of Claim 2 below). Nevertheless, the two bounds can be of similar\norder of magnitude in practical scenarios; see Section 4 below for a detailed discussion.\nWe now give a proof of Proposition 1. First, we state two elementary properties of the step sizes the\nalgorithm computes. For a proof, see the full version of the paper [3].\n\nClaim 2. For any i, the sequence \u232b1(i), \u232b2(i), . . . is monotonically increasing, and \u232bt(i) Pt\ns (i).\nProof of Proposition 1. Let us \ufb01rst assume that g1(i) > 0 for all i, so that \u232bt(i) > 0 for all i and t 1\ndue to Claim 2. We start by observing that SM3-I performs Online Mirror Descent updates, where\nthe step on round t uses the positive de\ufb01nite diagonal matrix Ht = diag(\u232b1/2\n) for regularization. Then,\nemploying a standard regret bound for the Online Mirror Descent algorithm with time-dependent\nregularization (see for instance [11, Proposition 3]), the regret of the algorithm is bounded by\n\ns=1 g2\n\nt\n\n1\n2\u2318\n\nTXt=1kwt \u2013 w?k2\n\nHt \u2013 kwt+1 \u2013 w?k2\n\nHt + \u2318\n\n2\n\nTXt=1kgtk\u21e4Ht2 .\n\nHere, kxkH = pxTHx and k\u00b7k \u21e4 is the corresponding dual norm, kxk\u21e4H = pxTH\u20131x. Henceforth, for\n\nnotational convenience we set \u232b0 = 0. Simplifying the \ufb01rst sum above using the fact that Ht are\ndiagonal matrices, we have\n\nTXt=1kwt \u2013 w?k2\n\nTXt=1kgtk\u21e4Gt2\n\n\uf8ff\n\nt\n\nt\n\n\uf8ff\n\n(\u232b1/2\n\n(\u232b1/2\n\nHt \uf8ff\n\nHt \u2013 kwt+1 \u2013 w?k2\n\nTXt=1\nTXt=1\n\uf8ff D2\u232b1/2\nTXt=1kgtk\u21e4GT2 + Tr(GT) = \u20131/2\n\nT\n\nT\n\n\u2013 \u232b1/2\n\n\u2013 \u232b1/2\n\nt\u20131) \u00b7 (wt \u2013 w?)2\nt\u20131) \u00b7kwt \u2013 w?k2\n11d\n\u00b7 1d = D2 Tr(HT) .\n\n\u00b7 T + Tr(GT) = 2 Tr(GT) .\n\nNow, let t(i) =Pt\n\n[12, Lemma 2] with (G) = Tr(G), we have\n\ns=1 g2\n\ns (i) and consider the positive de\ufb01nite diagonal matrix Gt = diag(1/2\n\nt\n\n). From\n\n1Here we implicitly assume that the iterates of the algorithm remain bounded and D is a constant. This can\nbe enforced by projecting the iterates to a bounded set of choice; we avoid introducing projections explicitly as\nthey are rarely used in practice.\n\n4\n\n\fAlso, from Claim 2 we know that for all t, Ht \u232b Gt, thus\n\n\uf8ff\nIn summary, we have established that\n\nTXt=1kgtk\u21e4Ht2\nTXt=1\n\nTXt=1kgtk\u21e4Gt2\n`t(wt) \u2013 `t(w?) \uf8ff\u2713D2\n\n2\u2318\n\n\uf8ff 2 Tr(GT) \uf8ff 2 Tr(HT) .\n\n+ \u2318\u25c6 Tr(HT) .\n\nPlugging in \u2318 = D and the expression for the diagonal elements of HT, we obtain the claim.\nFor the degenerate case where the matrices Ht may not be strictly positive de\ufb01nite, a careful yet\ntechnical inspection of the proof above reveals that our arguments apply to this case as well by\nreplacing inverses with pseudo-inverses. The rest of the proof remains intact as the algorithm does\nnot update parameter i on step t if the corresponding diagonal entry in Ht is zero.\n\n3.1 SM3-II\nWe now discuss a slightly more ef\ufb01cient variant of SM3, which we describe in SM3-II. It is similar to\nSM3-I, and improves on the latter in the following sense.\nProposition 3. For any i 2 [d], the sequence \u232b01(i), . . . , \u232b0T(i) is monotonically increasing. Further,\n\ufb01xing a sequence of gradients g1, . . . , gT, we have for all t, i thatPt\ns (i) \uf8ff \u232b0t (i) \uf8ff \u232bt(i), where\n\u232b1(i), . . . , \u232bT(i) is the sequence SM3-I emits upon receiving the gradients g1, . . . , gT.\n\ns=1 g2\n\n(See the full version of the paper [3] for a proof.)\nIn other words, SM3-II provides a tighter upper\nbound on the cumulative gradient squares than\nSM3-I. Consequently, we can show, along sim-\nilar lines to the proof of Proposition 1, a slightly\nbetter bound for SM3-II that scales with the\n\nthan the one appearing in the bound of SM3-I.\n\ni=1p\u232b0t (i), which is always smaller\n\nquantityPd\n\n4 Discussion\n\nSM3-II\n1: parameters: learning rate \u2318\n2: initialize w1 = 0 ; 8r 2 [k] : \u00b500(r) = 0\n3: for t = 1, . . . , T do\n4:\n5:\n6:\n7:\n8:\n\nreceive gradient gt = r`t(wt)\ninitialize \u00b50t(r) = 0 for all r 2 [k]\nfor i = 1, . . . , d do\n\u232b0t (i) minr:Sr3i \u00b50t\u20131(r) + g2\nt (i)\nwt+1(i) wt(i) \u2013 \u2318 gt(i)p\u232b0t (i)\n. with the convention that 0/0 = 0\nfor all r : Sr 3 i do\n\u00b50t(r) max{\u00b50t(r), \u232b0t (i)}\n\n9:\n10:\n\nThus far, we gave an analysis of SM3 in a worst-\ncase (convex) setting without placing any further\nassumptions on the statistical characteristics of the underlying stochastic gradients. Further, we did\nnot attempt to relate the cover used by SM3 to properties of the underlying stochastic optimization\nproblem. It should not come as a surprise that in this general setting, the convergence of SM3 might\nbe much worse, at least in theory, than its linear-memory counterpart Adagrad.\n\nActivation patterns. Often in our experiments, we observe common statistical attributes that could\nbe exploited by SM3. Speci\ufb01cally, we see that certain entries of the stochastic gradients have (on\naverage) similar values, and exhibit what we refer to as an activation pattern. For example, in\ngradients of embedding layers of deep networks, an entire row (or column) is either zero or non-zero.\nSimilarly, in intermediate layers we often observe that gradients associated with the same unit are of\nsimilar order of magnitude. In these cases, a similar phenomenon is observed in the second-order\nstatistics maintained by adaptive methods. In Figure 1 we visualize this phenomenon for different\nlayers of a Transformer network. In the full version of the paper [3] we give additional illustrations\nof similar phenomena in convolutional layers of image classi\ufb01cation models.\n\nChoice of covers. The intuitive notion of an activation pattern motivates a natural and generic\nchoice for the cover used by SM3 in practice. For the parameters of deep networks, that are organized\nas a collection of tensors, we form a cover consisting of slices of co-dimension 1 for each tensor.\nThus, for an m \u21e5 n parameter matrix, the cover consists of rows and columns of the matrix. The\nmemory requirements therefore drop from \u21e5(mn) to merely \u21e5(m + n). For a parameter tensor of\ndimension n1 \u21e5\u00b7\u00b7\u00b7\u21e5 np, the reduction in memory consumption is even more pronounced, dropping\n\n5\n\n\f(a) Input embedding\n\n(b) Attention layer\n\n(c) Output softmax\n\ni=1 ni) to \u21e5(Pp\n\nFigure 1: Visualization of Adagrad\u2019s statistics (cf. Eq. (1)) for different weight matrices in\nTransformer-Big model trained with Adagrad on WMT\u201914 en!fr (color intensities are in log scale).\nfrom \u21e5(Qp\ni=1 ni). This virtually eliminates the memory overhead associated with\nmaintaining the adaptive learning rates.\nWe argue, though only informally, that when choice of cover used by SM3 is compatible with the\nobserved activation patterns, we expect the convergence of SM3 to be signi\ufb01cantly better, and closely\nmatch Adagrad. Quantitatively, if each parameter i 2 [d] is covered by a set Sr such that gs(j) \u21e1 gs(i)\nfor all j 2 Sr, then maxj2Sr g2\ns (i). Thus, the\nbounds in Proposition 1 and Eq. (2) are of similar order of magnitude. In other words, in such\nscenarios we inherit the convergence properties of Adagrad while using a negligible amount of\nmemory. We remark that the activation pattern need not be fully speci\ufb01ed in advance; in particular,\nSM3 is robust to whether a certain parameter is \u201crow tied\u201d or \u201ccolumn tied\u201d, as long as both rows and\ncolumns are included in the cover.\n\ns (i), and thus minr:Sr3iPs maxj2Sr g2\n\ns (j) \u21e1 g2\n\ns (j) \u21e1Ps g2\n\nComparison with Adafactor. Adafactor [23] is a very effective method for space-ef\ufb01cient adaptive\noptimization. SM3 and Adafactor differ in a number of important ways. First, Adafactor is only\nde\ufb01ned for matrix-shaped parameters while SM3 applies to tensors of arbitrary dimensions, and\neven more generally, to any prede\ufb01ned cover of the parameters. Second, Adafactor is in essence a\n\ufb01xed learning-rate algorithm, being a memory-constrained variation of Adam, and often requires a\nmanually devised learning-rate schedule to ensure convergence. In contrast, SM3 adapts its learning\nrates in an adaptive, data-driven manner similar to Adagrad. Finally, SM3 comes with rigorous\nconvergence guarantees in stochastic convex optimization settings.\n\n5 Experiments\n\nWe demonstrate the practical ef\ufb01cacy of SM3 on several machine learning tasks using published\nstate-of-the-art architectures. We focus on three domains: machine translation, language modeling,\nand image classi\ufb01cation. We implemented SM3 as an optimizer in TensorFlow [1]; source code is\npublicly available at [4]. Our implementation follows the pseudocode of SM3-II, as it performed\nslightly yet consistently better than SM3-I in our experiments (as predicted by our bounds). We\nuse covers induced by rows and columns of matrices, and more generally, by slices of higher-order\ntensors (e.g., in convolutional layers represented by 4-dimensional tensors), as described in Section 4.\nIn addition to being compatible with the natural activation patterns, these covers facilitates ef\ufb01cient\ntensor operations available on GPUs and TPUs for computing max and min over the sets. In all\nexperiments, we used the Cloud TPU-v2 device [14] where each core has 8GiB of memory. For more\ndetails on all of our experiments, including the precise hyperparameters used in each of them, refer\nto the full version of the paper [3].\n\n5.1 Machine translation\nWe experimented with machine translation tasks on two standard datasets from WMT\u201914: English to\nFrench (en!fr) with 36.3M sentence pairs, and English to German (en!de) with 4.5M sentence pairs.\nWe used the state-of-the-art Transformer architecture Vaswani et al. [26]. The basic version of this\n\n6\n\n\fmodel has 93.3M parameters and consumes 0.36GiB memory. The larger variant (Transformer-Big)\nhas 375.4M parameters (1.432GiB) and consists of 6 layers for its encoder and decoder, where each\nlayer has 1024 model dimensions, 8192 hidden dimensions, and 16 attention heads.\nHere we report our results on the larger Transformer-Big, and defer results on the basic Transformer\nto the full version of the paper [3]. We trained Transformer-Big on the en!fr dataset with batches of\nsize 384, and compared SM3 with several standard optimizers in each of the tasks. In all cases, we\nused momentum (including for Adagrad) and extensively tuned all hyperparameters. We also ran\nSGD with momentum (with various exponential decay schedules), but it performed poorly and hence\nit is omitted from the \ufb01gures. The results are provided in Figure 2 and Table 1, and demonstrate that\nSM3 performed substantially better and provided a large improvement in BLEU score compared to\nAdam and Adafactor. In addition, the small memory requirements of SM3 and Adafactor allowed us\nto double the number of examples in a batch to a total of 768, with minimal additional computation\nresources. In this setting, we found that SM3 outperformed Adafactor in terms of the number of\nsteps as well as the wall-time to convergence by roughly a factor of 2. We further observed that SM3\napproximated the 2nd-order statistics tightly. For more details, see the full version of the paper [3].\nBoth models were trained on a 4\u21e5 4 Cloud TPU-v2 using the Lingvo [24] sequence modeling\nframework, with 32K word-pieces [21] for each language pair. BLEU scores were computed on the\nNewstest 2014 for evaluation, on tokenized, true-case outputs, and without manual post-processing\nof the text, similar to [28]. Our BLEU scores are not directly comparable to those of [26]. We instead\nfollowed the experimental protocol described in a later work [8].\n\nFigure 2: Test log-perplexity of a Transformer-Big model on WMT\u201914 en!fr, when training with\nbatch sizes of 384 (left) and 768 (right). For batch size of 768, Adam and Adagrad were infeasible as\nthey exceeded the available memory.\n\nOPTIMIZER\n\nBATCH SIZE\n\nPER CORE (TOTAL)\n\nAdam\nAdagrad\nAdafactor\nSM3\nAdafactor\nSM3\n\n12 (384)\n12 (384)\n12 (384)\n12 (384)\n24 (768)\n24 (768)\n\nMEMORY USAGE\n\nBLEU\n\nPER CORE\n6.88 GiB\n6.85 GiB\n5.43 GiB\n5.36 GiB\n7.04 GiB\n7.02 GiB\n\n38.96 \u00b1 0.002\n39.90 \u00b1 0.003\n37.89 \u00b1 0.002\n39.81 \u00b1 0.002\n39.65 \u00b1 0.002\n40.50 \u00b1 0.001\n\nTable 1: BLEU scores and memory usage for various batch sizes on the WMT\u201914 en!fr dataset.\n5.2 Language modeling\n\nNext, we considered a language modeling task on the concatenation of Wikipedia and BooksCor-\npus [29], with 2.5B and 800M words respectively. We used the recent Bidrectional Encoder Repre-\nsentation (BERT) architecture of Devlin et al. [10], focusing on its larger variant, coined BERT-Large.\nBERT-Large is a large bidirectional transformer model containing 24 transformer blocks with 1024\nhidden dimensions and 16 self attention heads. It has 340M parameters (1.297 GiB), and is set up to\njointly optimize two objectives: (a) masked language model (Masked-LM) loss where the task is to\n\n7\n\n0.200.400.600.8010steSs2.202.252.302.352.402.452.50AdagradAdamAdafactor6030.200.400.600.8010steSs2.202.252.302.352.402.452.50Adafactor603\fOPTIMIZER\n\nBATCH SIZE\n\nPER CORE (TOTAL)\n\nAdam\nSM3\nSM3\n\n8 (1024)\n8 (1024)\n16 (2048)\n\nMEMORY USAGE\n\nPER CORE\n6.15 GiB\n4.90 GiB\n6.02 GiB\n\nTable 2: Training memory consumption at different batch sizes for BERT-Large on 8x8 TPUs.\n\npredict masked tokens based on surrounding context, and (b) next sentence prediction (NSP) loss\nwhere the task is to predict whether two given sentences are consecutive in the text.\n\nFigure 3: Masked LM test accuracy (left), and number of steps to get 70% test accuracy as a function\nof the batch size (right), of the BERT-Large language model trained on Wikipedia+BooksCorpus.\nSM3 with batch size 2048 uses about the same amount of memory as Adam/Adagrad with batch size\n1024, and scales linearly up to a batch size of 216, at which point we hit the hardware memory limits.\n\nAs before, we compared SM3 with Adagrad, Adam and Adafactor. Our results are presented\nin Figure 3. We see that SM3 worked as well as Adam and Adagrad for a \ufb01xed batch size. However,\nthe savings in memory allowed us to train SM3 with double the batch size, resulting in a substantial\nincrease in accuracy. The experiments were run using the open sourced code from [10] on a 8\u21e58\nCloud TPU-V2 con\ufb01guration.\nTo underscore the importance of our memory savings in the context of very large models, we report\nadditional results on the number of steps required for reaching a given solution quality for various\nbatch sizes. We chose a solution quality of 70% Masked-LM accuracy on the holdout set, which\nAdam and AdaGrad reached at 500k steps. We use Cloud TPU-v3 device which has 16Gib per\ncore for this experiment. We measured the number of steps SM3 needed to reach this accuracy as a\nfunction of the batch size. Our results are presented in Figure 3. SM3 scaled almost linearly with\nthe batch size, up to a size of 216, at which point the training program reached the limits of memory\navailable on hardware. We also found that SM3 came out ahead in terms of wall-time: with the same\nbatch size, a step of SM3 was faster than Adam\u2019s by 3%, and doubling the batch size allowed it to\nreach the same solution quality in almost 35% less wall-time for the same computational budget.\n\n5.3 AmoebaNet-D on ImageNet\nFinally, we report results from a different domain: image classi\ufb01cation on ImageNet [20] with the\nstate-of-the-art AmoebaNet-D architecture [18], that has recently won the Stanford DAWNBench\ncompetition [9]. We compared SM3 with SGD with momentum (Adam performed poorly on this\ntask). SM3 performed very well in this task and achieved improved convergence to state-of-the-art\nperformance, reaching 78.71% top-1 and 94.31% top-5 test accuracies. The fully detailed convergence\nplots are provided in the full version of the paper [3].\n\n6 Summary\n\nMotivated by the large increase in models sizes and the huge amounts of memory required for training\nthem, we have presented a new memory-ef\ufb01cient adaptive optimization algorithm for stochastic\noptimization called SM3. We demonstrated empirically that SM3 can be used effectively in training\n\n8\n\n000.100.200.300.400.50steSs55%60%65%70%75%Adam (batch size: 1024)Adagrad (batch size: 1024)Adafactor (batch size: 1024)603 (batch size: 1024)603 (batch size: 2048)10111213141516log2(batch sLze)13141516171819log2(steSs)LLnear scalLng603\fmodern mammoth-sized models and dramatically decrease memory overhead. Utilizing the freed\nmemory for increasing the batch size, our experiments indicate that this saving can also lead to\nsigni\ufb01cant improvements in performance. Our theoretical investigation focused on convex objectives.\nAs with many other optimization scenarios, we believe the analysis of convex memory-ef\ufb01cient\nadaptive optimization could serve as a basis for understanding non-convex settings.\nOur memory savings virtually eliminate the overhead coming from the second-order statistics t with\nlittle and often no impact on convergence. Additional and potentially substantial improvements in\nmemory consumption could come from compressing or sketching the momentum terms employed\nby virtually all \ufb01rst-order optimizers used in practice. We leave the exploration of this promising\ndirection for future work.\n\nAcknowledgements\n\nWe would like to thank Luke Metz, Kunal Talwar and Yonghui Wu for numerous helpful discussions\nand suggestions. Special thanks go to Samy Bengio who made it possible for us to conduct large\nscale experiments on a tight schedule. We would also like to thank Zhifeng Chen for coming up with\nthe shorthand \u2018SM3\u2019.\n\nReferences\n[1] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving,\nM. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G. Murray, B. Steiner, P. Tucker,\nV. Vasudevan, P. Warden, M. Wicke, Y. Yu, and X. Zheng. Tensor\ufb02ow: A system for large-\nscale machine learning.\nIn 12th USENIX Symposium on Operating Systems Design and\nImplementation (OSDI 16), pages 265\u2013283, 2016.\n\n[2] N. Agarwal, B. Bullins, X. Chen, E. Hazan, K. Singh, C. Zhang, and Y. Zhang. The case for\n\nfull-matrix adaptive regularization. CoRR, abs/1806.02958, 2018.\n\n[3] R. Anil, V. Gupta, T. Koren, and Y. Singer. Memory-ef\ufb01cient adaptive optimization for large-\n\nscale learning. arXiv preprint arXiv:1901.11150, 2019.\n\n[4] R. Anil, V. Gupta, T. Koren, and Y. Singer. SM3 tensor\ufb02ow optimizer. https://github.com/\n\ngoogle-research/google-research/tree/master/sm3, 2019.\n\n[5] P. Auer, N. Cesa-Bianchi, and C. Gentile. Adaptive and self-con\ufb01dent on-line learning algo-\n\nrithms. Journal of Computer and System Sciences, 64(1):48\u201375, 2002.\n\n[6] N. Cesa-Bianchi, A. Conconi, and C. Gentile. On the generalization ability of on-line learning\n\nalgorithms. IEEE Transactions on Information Theory, 50(9):2050\u20132057, 2004.\n\n[7] M. Charikar, K. Chen, and M. Farach-Colton. Finding frequent items in data streams. In\nProceedings of the 29th International Colloquium on Automata, Languages and Programming,\nICALP \u201902, pages 693\u2013703, Berlin, Heidelberg, 2002. Springer-Verlag. ISBN 3-540-43864-5.\nURL http://dl.acm.org/citation.cfm?id=646255.684566.\n\n[8] M. X. Chen, O. Firat, A. Bapna, M. Johnson, W. Macherey, G. Foster, L. Jones, M. Schuster,\nN. Shazeer, N. Parmar, A. Vaswani, J. Uszkoreit, L. Kaiser, Z. Chen, Y. Wu, and M. Hughes. The\nbest of both worlds: Combining recent advances in neural machine translation. In Proceedings\nof the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, pages\n76\u201386, 2018.\n\n[9] C. Coleman, D. Kang, D. Narayanan, L. Nardi, T. Zhao, J. Zhang, P. Bailis, K. Olukotun, C. Re,\nand M. Zaharia. Analysis of dawnbench, a time-to-accuracy machine learning performance\nbenchmark. arXiv preprint arXiv:1806.01427, 2018.\n\n[10] J. Devlin, M. Chang, K. Lee, and K. Toutanova. BERT: pre-training of deep bidirectional\n\ntransformers for language understanding. CoRR, abs/1810.04805, 2018.\n\n[11] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and\n\nstochastic optimization. Journal of Machine Learning Research, 12(Jul):2121\u20132159, 2011.\n\n9\n\n\f[12] V. Gupta, T. Koren, and Y. Singer. Shampoo: Preconditioned stochastic tensor optimization.\nIn Proceedings of the 35th International Conference on Machine Learning, volume 80, pages\n1842\u20131850, 2018.\n\n[13] E. Hazan. Introduction to online convex optimization. Foundations and Trends in Optimization,\n\n2(3-4):157\u2013325, 2016.\n\n[14] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia,\nN. Boden, A. Borchers, et al. In-datacenter performance analysis of a tensor processing unit.\nIn Computer Architecture (ISCA), 2017 ACM/IEEE 44th Annual International Symposium on,\npages 1\u201312. IEEE, 2017.\n\n[15] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[16] H. B. McMahan and M. Streeter. Adaptive bound optimization for online convex optimization.\n\nCOLT 2010, page 244, 2010.\n\n[17] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever. Language models are\n\nunsupervised multitask learners. 2019.\n\n[18] E. Real, A. Aggarwal, Y. Huang, and Q. V. Le. Regularized evolution for image classi\ufb01er\n\narchitecture search. arXiv preprint arXiv:1802.01548, 2018.\n\n[19] S. J. Reddi, S. Kale, and S. Kumar. On the convergence of adam and beyond. 2018.\n[20] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy,\nA. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. International\nJournal of Computer Vision, 115(3):211\u2013252, 2015.\n\n[21] M. Schuster and K. Nakajima. Japanese and korean voice search. In ICASSP, pages 5149\u20135152.\n\nIEEE, 2012.\n\n[22] S. Shalev-Shwartz. Online learning and online convex optimization. Foundations and Trends in\n\nMachine Learning, 4(2):107\u2013194, 2012.\n\n[23] N. Shazeer and M. Stern. Adafactor: Adaptive learning rates with sublinear memory cost. In\nProceedings of the 35th International Conference on Machine Learning, ICML 2018, pages\n4603\u20134611, 2018.\n\n[24] J. Shen, P. Nguyen, Y. Wu, Z. Chen, et al. Lingvo. https://github.com/tensorflow/\n\nlingvo.\n\n[25] R. Spring, A. Kyrillidis, V. Mohan, and A. Shrivastava. Compressing gradient optimizers\nvia count-sketches. In K. Chaudhuri and R. Salakhutdinov, editors, Proceedings of the 36th\nInternational Conference on Machine Learning, volume 97 of Proceedings of Machine Learning\nResearch, pages 5946\u20135955, Long Beach, California, USA, 09\u201315 Jun 2019. PMLR. URL\nhttp://proceedings.mlr.press/v97/spring19a.html.\n\n[26] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, \u0141. Kaiser, and\nI. Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems,\npages 5998\u20136008, 2017.\n\n[27] A. Vaswani, S. Bengio, E. Brevdo, F. Chollet, A. N. Gomez, S. Gouws, L. Jones, L. Kaiser,\nN. Kalchbrenner, N. Parmar, R. Sepassi, N. Shazeer, and J. Uszkoreit. Tensor2tensor for neural\nmachine translation. CoRR, abs/1803.07416, 2018. URL http://arxiv.org/abs/1803.\n07416.\n\n[28] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao,\nK. Macherey, et al. Google\u2019s neural machine translation system: Bridging the gap between\nhuman and machine translation. arXiv preprint arXiv:1609.08144, 2016.\n\n[29] Y. Zhu, R. Kiros, R. Zemel, R. Salakhutdinov, R. Urtasun, A. Torralba, and S. Fidler. Aligning\nbooks and movies: Towards story-like visual explanations by watching movies and reading\nbooks. In Proceedings of the IEEE international conference on computer vision, pages 19\u201327,\n2015.\n\n10\n\n\f", "award": [], "sourceid": 5147, "authors": [{"given_name": "Rohan", "family_name": "Anil", "institution": "Google"}, {"given_name": "Vineet", "family_name": "Gupta", "institution": "Google"}, {"given_name": "Tomer", "family_name": "Koren", "institution": "Google"}, {"given_name": "Yoram", "family_name": "Singer", "institution": "Google"}]}