{"title": "QSGD: Communication-Efficient SGD via Gradient Quantization and Encoding", "book": "Advances in Neural Information Processing Systems", "page_first": 1709, "page_last": 1720, "abstract": "Parallel implementations of stochastic gradient descent (SGD) have received significant research attention, thanks to its excellent scalability properties. A fundamental barrier when parallelizing SGD is the high bandwidth cost of communicating gradient updates between nodes; consequently, several lossy compresion heuristics have been proposed, by which nodes only communicate quantized gradients. Although effective in practice, these heuristics do not always guarantee convergence, and it is not clear whether they can be improved.  In this paper, we propose Quantized SGD (QSGD), a family of compression schemes for gradient updates which provides convergence guarantees. QSGD allows the user to smoothly trade off \\emph{communication bandwidth} and \\emph{convergence time}: nodes can adjust the number of bits sent per iteration, at the cost of possibly higher variance. We show that this trade-off is inherent, in the sense that improving it past some threshold would violate  information-theoretic lower bounds. QSGD guarantees convergence for convex and non-convex objectives,  under asynchrony, and can be extended to stochastic variance-reduced techniques.   When applied to  training deep neural networks for image classification and  automated speech recognition, QSGD leads to significant reductions in  end-to-end training time. For example, on 16GPUs, we can train the ResNet152  network to full accuracy on ImageNet 1.8x faster than the full-precision  variant.", "full_text": "QSGD: Communication-Ef\ufb01cient SGD\nvia Gradient Quantization and Encoding\n\nDan Alistarh\n\nIST Austria & ETH Zurich\ndan.alistarh@ist.ac.at\n\nDemjan Grubic\n\nETH Zurich & Google\n\ndemjangrubic@gmail.com\n\nJerry Z. Li\n\nMIT\n\njerryzli@mit.edu\n\nRyota Tomioka\n\nMicrosoft Research\n\nryoto@microsoft.com\n\nMilan Vojnovic\n\nLondon School of Economics\n\nM.Vojnovic@lse.ac.uk\n\nAbstract\n\nParallel implementations of stochastic gradient descent (SGD) have received signi\ufb01-\ncant research attention, thanks to its excellent scalability properties. A fundamental\nbarrier when parallelizing SGD is the high bandwidth cost of communicating gradi-\nent updates between nodes; consequently, several lossy compresion heuristics have\nbeen proposed, by which nodes only communicate quantized gradients. Although\neffective in practice, these heuristics do not always converge.\nIn this paper, we propose Quantized SGD (QSGD), a family of compression\nschemes with convergence guarantees and good practical performance. QSGD\nallows the user to smoothly trade off communication bandwidth and convergence\ntime: nodes can adjust the number of bits sent per iteration, at the cost of possibly\nhigher variance. We show that this trade-off is inherent, in the sense that improving\nit past some threshold would violate information-theoretic lower bounds. QSGD\nguarantees convergence for convex and non-convex objectives, under asynchrony,\nand can be extended to stochastic variance-reduced techniques.\nWhen applied to training deep neural networks for image classi\ufb01cation and au-\ntomated speech recognition, QSGD leads to signi\ufb01cant reductions in end-to-end\ntraining time. For instance, on 16GPUs, we can train the ResNet-152 network to\nfull accuracy on ImageNet 1.8\u00d7 faster than the full-precision variant.\n\n1 Introduction\n\nThe surge of massive data has led to signi\ufb01cant interest in distributed algorithms for scaling com-\nputations in the context of machine learning and optimization. In this context, much attention has\nbeen devoted to scaling large-scale stochastic gradient descent (SGD) algorithms [33], which can be\nbrie\ufb02y de\ufb01ned as follows. Let f : Rn \u2192 R be a function which we want to minimize. We have access\n\nto stochastic gradients(cid:101)g such that E[(cid:101)g(x)] = \u2207f (x). A standard instance of SGD will converge\n\ntowards the minimum by iterating the procedure\n\nxt+1 = xt \u2212 \u03b7t(cid:101)g(xt),\n\n(1)\nwhere xt is the current candidate, and \u03b7t is a variable step-size parameter. Notably, this arises if\nwe are given i.i.d. data points X1, . . . , Xm generated from an unknown distribution D, and a loss\nfunction (cid:96)(X, \u03b8), which measures the loss of the model \u03b8 at data point X. We wish to \ufb01nd a model\n\u03b8\u2217 which minimizes f (\u03b8) = EX\u223cD[(cid:96)(X, \u03b8)], the expected loss to the data. This framework captures\nmany fundamental tasks, such as neural network training.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fIn this paper, we focus on parallel SGD methods, which have received considerable attention recently\ndue to their high scalability [6, 8, 32, 13]. Speci\ufb01cally, we consider a setting where a large dataset is\npartitioned among K processors, which collectively minimize a function f. Each processor maintains\na local copy of the parameter vector xt; in each iteration, it obtains a new stochastic gradient update\n(corresponding to its local data). Processors then broadcast their gradient updates to their peers, and\naggregate the gradients to compute the new iterate xt+1.\nIn most current implementations of parallel SGD, in each iteration, each processor must communicate\nits entire gradient update to all other processors. If the gradient vector is dense, each processor will\nneed to send and receive n \ufb02oating-point numbers per iteration to/from each peer to communicate\nthe gradients and maintain the parameter vector x. In practical applications, communicating the\ngradients in each iteration has been observed to be a signi\ufb01cant performance bottleneck [35, 37, 8].\nOne popular way to reduce this cost has been to perform lossy compression of the gradients [11, 1,\n3, 10, 41]. A simple implementation is to simply reduce precision of the representation, which has\nbeen shown to converge under convexity and sparsity assumptions [10]. A more drastic quantization\ntechnique is 1BitSGD [35, 37], which reduces each component of the gradient to just its sign\n\n(one bit), scaled by the average over the coordinates of(cid:101)g, accumulating errors locally. 1BitSGD\n\n\u221a\n\nwas experimentally observed to preserve convergence [35], under certain conditions; thanks to the\nreduction in communication, it enabled state-of-the-art scaling of deep neural networks (DNNs) for\nacoustic modelling [37]. However, it is currently not known if 1BitSGD provides any guarantees,\neven under strong assumptions, and it is not clear if higher compression is achievable.\nContributions. Our focus is understanding the trade-offs between the communication cost of data-\nparallel SGD, and its convergence guarantees. We propose a family of algorithms allowing for lossy\ncompression of gradients called Quantized SGD (QSGD), by which processors can trade-off the\nnumber of bits communicated per iteration with the variance added to the process.\nQSGD is built on two algorithmic ideas. The \ufb01rst is an intuitive stochastic quantization scheme:\ngiven the gradient vector at a processor, we quantize each component by randomized rounding to a\ndiscrete set of values, in a principled way which preserves the statistical properties of the original.\nThe second step is an ef\ufb01cient lossless code for quantized gradients, which exploits their statistical\nproperties to generate ef\ufb01cient encodings. Our analysis gives tight bounds on the precision-variance\ntrade-off induced by QSGD.\nAt one extreme of this trade-off, we can guarantee that each processor transmits at most\nn(log n +\nO(1)) expected bits per iteration, while increasing variance by at most a\nn multiplicative factor.\nAt the other extreme, we show that each processor can transmit \u2264 2.8n + 32 bits per iteration in\nexpectation, while increasing variance by a only a factor of 2. In particular, in the latter regime,\ncompared to full precision SGD, we use \u2248 2.8n bits of communication per iteration as opposed to\n32n bits, and guarantee at most 2\u00d7 more iterations, leading to bandwidth savings of \u2248 5.7\u00d7.\nQSGD is fairly general: it can also be shown to converge, under assumptions, to local minima for non-\nconvex objectives, as well as under asynchronous iterations. One non-trivial extension we develop\nis a stochastic variance-reduced [23] variant of QSGD, called QSVRG, which has exponential\nconvergence rate.\nOne key question is whether QSGD\u2019s compression-variance trade-off is inherent: for instance, does\nany algorithm guaranteeing at most constant variance blowup need to transmit \u2126(n) bits per iteration?\nThe answer is positive: improving asymptotically upon this trade-off would break the communication\ncomplexity lower bound of distributed mean estimation (see [44, Proposition 2] and [38]).\nExperiments. The crucial question is whether, in practice, QSGD can reduce communication cost\nby enough to offset the overhead of any additional iterations to convergence. The answer is yes.\nWe explore the practicality of QSGD on a variety of state-of-the-art datasets and machine learning\nmodels: we examine its performance in training networks for image classi\ufb01cation tasks (AlexNet,\nInception, ResNet, and VGG) on the ImageNet [12] and CIFAR-10 [25] datasets, as well as on\nLSTMs [19] for speech recognition. We implement QSGD in Microsoft CNTK [3].\nExperiments show that all these models can signi\ufb01cantly bene\ufb01t from reduced communication when\ndoing multi-GPU training, with virtually no accuracy loss, and under standard parameters. For exam-\nple, when training AlexNet on 16 GPUs with standard parameters, the reduction in communication\ntime is 4\u00d7, and the reduction in training to the network\u2019s top accuracy is 2.5\u00d7. When training an\nLSTM on two GPUs, the reduction in communication time is 6.8\u00d7, while the reduction in training\n\n\u221a\n\n2\n\n\ftime to the same target accuracy is 2.7\u00d7. Further, even computationally-heavy architectures such as\nInception and ResNet can bene\ufb01t from the reduction in communication: on 16GPUs, QSGD reduces\nthe end-to-end convergence time of ResNet152 by approximately 2\u00d7. Networks trained with QSGD\ncan converge to virtually the same accuracy as full-precision variants, and that gradient quantization\nmay even slightly improve accuracy in some settings.\nRelated Work. One line of related research studies the communication complexity of convex\noptimization. In particular, [40] studied two-processor convex minimization in the same model,\nprovided a lower bound of \u2126(n(log n + log(1/\u0001))) bits on the communication cost of n-dimensional\nconvex problems, and proposed a non-stochastic algorithm for strongly convex problems, whose\ncommunication cost is within a log factor of the lower bound. By contrast, our focus is on stochastic\ngradient methods. Recent work [5] focused on round complexity lower bounds on the number of\ncommunication rounds necessary for convex learning.\nBuckwild! [10] was the \ufb01rst to consider the convergence guarantees of low-precision SGD. It gave\nupper bounds on the error probability of SGD, assuming unbiased stochastic quantization, convexity,\nand gradient sparsity, and showed signi\ufb01cant speedup when solving convex problems on CPUs.\nQSGD re\ufb01nes these results by focusing on the trade-off between communication and convergence.\nWe view quantization as an independent source of variance for SGD, which allows us to employ\nstandard convergence results [7]. The main differences from Buckwild! are that 1) we focus on the\nvariance-precision trade-off; 2) our results apply to the quantized non-convex case; 3) we validate\nthe practicality of our scheme on neural network training on GPUs. Concurrent work proposes\nTernGrad [41], which starts from a similar stochastic quantization, but focuses on the case where\nindividual gradient components can have only three possible values. They show that signi\ufb01cant\nspeedups can be achieved on TensorFlow [1], while maintaining accuracy within a few percentage\npoints relative to full precision. The main differences to our work are: 1) our implementation\nguarantees convergence under standard assumptions; 2) we strive to provide a black-box compression\ntechnique, with no additional hyperparameters to tune; 3) experimentally, QSGD maintains the same\naccuracy within the same target number of epochs; for this, we allow gradients to have larger bit\nwidth; 4) our experiments focus on the single-machine multi-GPU case.\nWe note that QSGD can be applied to solve the distributed mean estimation problem [38, 24] with an\noptimal error-communication trade-off in some regimes. In contrast to the elegant random rotation\nsolution presented in [38], QSGD employs quantization and Elias coding. Our use case is different\nfrom the federated learning application of [38, 24], and has the advantage of being more ef\ufb01cient to\ncompute on a GPU.\nThere is an extremely rich area studying algorithms and systems for ef\ufb01cient distributed large-scale\nlearning, e.g. [6, 11, 1, 3, 39, 32, 10, 21, 43]. Signi\ufb01cant interest has recently been dedicated to\nquantized frameworks, both for inference, e.g., [1, 17] and training [45, 35, 20, 37, 16, 10, 42]. In\nthis context, [35] proposed 1BitSGD, a heuristic for compressing gradients in SGD, inspired by\ndelta-sigma modulation [34]. It is implemented in Microsoft CNTK, and has a cost of n bits and two\n\ufb02oats per iteration. Variants of it were shown to perform well on large-scale Amazon datasets by [37].\nCompared to 1BitSGD, QSGD can achieve asymptotically higher compression, provably converges\nunder standard assumptions, and shows superior practical performance in some cases.\n2 Preliminaries\nSGD has many variants, with different preconditions and guarantees. Our techniques are rather\nportable, and can usually be applied in a black-box fashion on top of SGD. For conciseness, we will\nfocus on a basic SGD setup. The following assumptions are standard; see e.g. [7].\nLet X \u2286 Rn be a known convex set, and let f : X \u2192 R be differentiable, convex, smooth, and\nunknown. We assume repeated access to stochastic gradients of f, which on (possibly random) input\nx, outputs a direction which is in expectation the correct direction to move in. Formally:\n\nDe\ufb01nition 2.1. Fix f : X \u2192 R. A stochastic gradient for f is a random function(cid:101)g(x) so that\nE[(cid:101)g(x)] = \u2207f (x). We say the stochastic gradient has second moment at most B if E[(cid:107)(cid:101)g(cid:107)2\nall x \u2208 X . We say it has variance at most \u03c32 if E[(cid:107)(cid:101)g(x) \u2212 \u2207f (x)(cid:107)2\n2] \u2264 B for\n2] \u2264 \u03c32 for all x \u2208 X .\ngradient with variance bound \u03c32 = B, since E[(cid:107)(cid:101)g(x) \u2212 \u2207f (x)(cid:107)2] \u2264 E[(cid:107)(cid:101)g(x)(cid:107)2] as long as\nE[(cid:101)g(x)] = \u2207f (x). Second, in convex optimization, one often assumes a second moment bound\n\nObserve that any stochastic gradient with second moment bound B is automatically also a stochastic\n\n3\n\n\fData: Local copy of the parameter vector x\n\n1 for each iteration t do\n2\n\nLet(cid:101)gi\nM i \u2190 Encode((cid:101)gi(x)) //encode gradients ;\n\nt be an independent stochastic gradient ;\n\n3\n\n4\n5\n6\n\n7\n\n8\n9\n10 end\n\nbroadcast M i to all peers;\nfor each peer (cid:96) do\n\nreceive M (cid:96) from peer (cid:96);\n\n(cid:98)g(cid:96) \u2190 Decode(M (cid:96)) //decode gradients ;\n\nend\n\nxt+1 \u2190 xt \u2212 (\u03b7t/K)(cid:80)K\n\n(cid:96)=1(cid:98)g(cid:96);\n\nFigure 1: An illustration of generalized\nstochastic quantization with 5 levels.\n\nAlgorithm 1: Parallel SGD Algorithm.\n\n(cid:113) 2\n(cid:114)\n\n(cid:34)\n\n(cid:32)\n\n(cid:33)(cid:35)\n\nT(cid:88)\n\nwhen dealing with non-smooth convex optimization, and a variance bound when dealing with smooth\nconvex optimization. However, for us it will be convenient to consistently assume a second moment\nbound. This does not seem to be a major distinction in theory or in practice [7].\nGiven access to stochastic gradients, and a starting point x0, SGD builds iterates xt given by Equation\n(1), projected onto X , where (\u03b7t)t\u22650 is a sequence of step sizes. In this setting, one can show:\nTheorem 2.1 ([7], Theorem 6.3). Let X \u2286 Rn be convex, and let f : X \u2192 R be unknown, convex,\nand L-smooth. Let x0 \u2208 X be given, and let R2 = supx\u2208X (cid:107)x \u2212 x0(cid:107)2. Let T > 0 be \ufb01xed. Given\nrepeated, independent access to stochastic gradients with variance bound \u03c32 for f, SGD with initial\npoint x0 and constant step sizes \u03b7t = 1\n\nL+1/\u03b3 , where \u03b3 = R\n\nT , achieves\n\n\u03c3\n\nLR2\n\nT\n\n.\n\nf\n\nm\n\n+\n\nE\n\nt=0\n\nxt\n\n(2)\n\n1\nT\n\n2\u03c32\nT\n\nx\u2208X f (x) \u2264 R\n\u2212 min\n(cid:80)m\n\u03a0X (xt \u2212 \u03b7t(cid:101)Gt(xt)), where (cid:101)Gt(xt) = 1\ni=1(cid:101)gt,i, and where each(cid:101)gt,i is an independent stochastic\ngradient for f at xt. It is not hard to see that if(cid:101)gt,i are stochastic gradients with variance bound \u03c32,\nthen the (cid:101)Gt is a stochastic gradient with variance bound \u03c32/m. By inspection of Theorem 2.1, as\n\nMinibatched SGD. A modi\ufb01cation to the SGD scheme presented above often observed in practice\nis a technique known as minibatching. In minibatched SGD, updates are of the form xt+1 =\n\nlong as the \ufb01rst term in (2) dominates, minibatched SGD requires 1/m fewer iterations to converge.\nData-Parallel SGD. We consider synchronous data-parallel SGD, modelling real-world multi-GPU\nsystems, and focus on the communication cost of SGD in this setting. We have a set of K processors\np1, p2, . . . , pK who proceed in synchronous steps, and communicate using point-to-point messages.\nEach processor maintains a local copy of a vector x of dimension n, representing the current estimate\nof the minimizer, and has access to private, independent stochastic gradients for f.\nIn each synchronous iteration, described in Algorithm 1, each processor aggregates the value of x,\nthen obtains random gradient updates for each component of x, then communicates these updates\nto all peers, and \ufb01nally aggregates the received updates and applies them locally. Importantly, we\nadd encoding and decoding steps for the gradients before and after send/receive in lines 3 and 7,\nrespectively. In the following, whenever describing a variant of SGD, we assume the above general\npattern, and only specify the encode/decode functions. Notice that the decoding step does not\n\nnecessarily recover the original gradient(cid:101)g(cid:96); instead, we usually apply an approximate version.\nend of this iteration is xt+1 = xt \u2212 (\u03b7t/K)(cid:80)K\n\nWhen the encoding and decoding steps are the identity (i.e., no encoding / decoding), we shall refer\nto this algorithm as parallel SGD. In this case, it is a simple calculation to see that at each processor,\nif xt was the value of x that the processors held before iteration t, then the updated value of x by the\n\n(cid:96)=1(cid:101)g(cid:96)(xt), where each(cid:101)g(cid:96) is a stochatic gradient. In\n\nparticular, this update is merely a minibatched update of size K. Thus, by the discussion above, and\nby rephrasing Theorem 2.1, we have the following corollary:\nCorollary 2.2. Let X , f, L, x0, and R be as in Theorem 2.1. Fix \u0001 > 0. Suppose we run parallel\nSGD on K processors, each with access to independent stochastic gradients with second moment\n\n4\n\n\fbound B, with step size \u03b7t = 1/(L +\n\nK/\u03b3), where \u03b3 is as in Theorem 2.1. Then if\n\n(cid:18)\n\n(cid:18) 2B\n\nK\u00012 ,\n\n(cid:19)(cid:19)\n\n\u221a\n\nL\n\u0001\n\n(cid:33)(cid:35)\n\n(cid:34)\n\n(cid:32)\n\nT(cid:88)\n\nt=0\n\n1\nT\n\nT = O\n\nR2 \u00b7 max\n\n, then E\n\nf\n\nxt\n\n\u2212 min\nx\u2208X f (x) \u2264 \u0001.\n\n(3)\n\nIn most reasonable regimes, the \ufb01rst term of the max in (3) will dominate the number of iterations\nnecessary. Speci\ufb01cally, the number of iterations will depend linearly on the second moment bound B.\n3 Quantized Stochastic Gradient Descent (QSGD)\nIn this section, we present our main results on stochastically quantized SGD. Throughout, log denotes\nthe base-2 logarithm, and the number of bits to represent a \ufb02oat is 32. For any vector v \u2208 Rn, we\nlet (cid:107)v(cid:107)0 denote the number of nonzeros of v. For any string \u03c9 \u2208 {0, 1}\u2217, we will let |\u03c9| denote its\nlength. For any scalar x \u2208 R, we let sgn (x) \u2208 {\u22121, +1} denote its sign, with sgn (0) = 1.\n3.1 Generalized Stochastic Quantization and Coding\nStochastic Quantization. We now consider a general, parametrizable lossy-compression scheme\nfor stochastic gradient vectors. The quantization function is denoted with Qs(v), where s \u2265 1 is\na tuning parameter, corresponding to the number of quantization levels we implement. Intuitively,\nwe de\ufb01ne s uniformly distributed levels between 0 and 1, to which each value is quantized in a way\nwhich preserves the value in expectation, and introduces minimal variance. Please see Figure 1.\nFor any v \u2208 Rn with v (cid:54)= 0, Qs(v) is de\ufb01ned as\n\nQs(vi) = (cid:107)v(cid:107)2 \u00b7 sgn (vi) \u00b7 \u03bei(v, s) ,\n\n(4)\nwhere \u03bei(v, s)\u2019s are independent random variables de\ufb01ned as follows. Let 0 \u2264 (cid:96) < s be an\ninteger such that |vi|/(cid:107)v(cid:107)2 \u2208 [(cid:96)/s, ((cid:96) + 1)/s]. That is, [(cid:96)/s, ((cid:96) + 1)/s] is the quantization interval\ncorresponding to |vi|/(cid:107)v(cid:107)2. Then\n\n(cid:40)\n\n\u03bei(v, s) =\n\nwith probability 1 \u2212 p\n\n(cid:96)/s\n((cid:96) + 1)/s otherwise.\n\n(cid:16) |vi|\n\n(cid:107)v(cid:107)2\n\n(cid:17)\n\n, s\n\n;\n\n\u221a\n\n\u221a\n\nn/s)(cid:107)v(cid:107)2\n\n2] \u2264 min(n/s2,\n\n2 (variance bound), and (iii) E[(cid:107)Qs(v)(cid:107)0] \u2264 s(s +\n\nHere, p(a, s) = as \u2212 (cid:96) for any a \u2208 [0, 1]. If v = 0, then we de\ufb01ne Q(v, s) = 0.\nThe distribution of \u03bei(v, s) has minimal variance over distributions with support {0, 1/s, . . . , 1}, and\nits expectation satis\ufb01es E[\u03bei(v, s)] = |vi|/(cid:107)v(cid:107)2. Formally, we can show:\nLemma 3.1. For any vector v \u2208 Rn, we have that (i) E[Qs(v)] = v (unbiasedness), (ii) E[(cid:107)Qs(v)\u2212\nv(cid:107)2\nn) (sparsity).\nEf\ufb01cient Coding of Gradients. Observe that for any vector v, the output of Qs(v) is naturally\nexpressible by a tuple ((cid:107)v(cid:107)2, \u03c3, \u03b6), where \u03c3 is the vector of signs of the vi\u2019s and \u03b6 is the vector\nof integer values s \u00b7 \u03bei(v, s). The key idea behind the coding scheme is that not all integer values\ns \u00b7 \u03bei(v, s) can be equally likely: in particular, larger integers are less frequent. We will exploit this\nvia a specialized Elias integer encoding [14], presented in full in the full version of our paper [4].\nIntuitively, for any positive integer k, its code, denoted Elias(k), starts from the binary representation\nof k, to which it prepends the length of this representation. It then recursively encodes this pre\ufb01x.\nWe show that for any positive integer k, the length of the resulting code has |Elias(k)| = log k +\nlog log k + . . . + 1 \u2264 (1 + o(1)) log k + 1, and that encoding and decoding can be done ef\ufb01ciently.\nGiven a gradient vector represented as the triple ((cid:107)v(cid:107)2, \u03c3, \u03b6), with s quantization levels, our coding\noutputs a string S de\ufb01ned as follows. First, it uses 32 bits to encode (cid:107)v(cid:107)2. It proceeds to encode\nusing Elias recursive coding the position of the \ufb01rst nonzero entry of \u03b6. It then appends a bit denoting\n\u03c3i and follows that with Elias(s \u00b7 \u03bei(v, s)). Iteratively, it proceeds to encode the distance from the\ncurrent coordinate of \u03b6 to the next nonzero, and encodes the \u03c3i and \u03b6i for that coordinate in the\nsame way. The decoding scheme is straightforward: we \ufb01rst read off 32 bits to construct (cid:107)v(cid:107)2, then\niteratively use the decoding scheme for Elias recursive coding to read off the positions and values of\nthe nonzeros of \u03b6 and \u03c3. The properties of the quantization and of the encoding imply the following.\nTheorem 3.2. Let f : Rn \u2192 R be \ufb01xed, and let x \u2208 Rn be arbitrary. Fix s \u2265 2 quantization\n\nlevels. If(cid:101)g(x) is a stochastic gradient for f at x with second moment bound B, then Qs((cid:101)g(x)) is a\n\n5\n\n\fstochastic gradient for f at x with variance bound min\n\nscheme so that in expectation, the number of bits to communicate Qs((cid:101)g(x)) is upper bounded by\n\nB. Moreover, there is an encoding\n\ns2 ,\n\nn\ns\n\n(cid:19)\n\n(cid:18)\n\n(cid:18) 3\n\n(cid:18) 2(s2 + n)\n\u221a\nn), while the second-moment blowup is \u2264 \u221a\n\n+ o(1)\n\ns(s +\n\n3 +\n\nlog\n\n2\n\nn)\n\n(cid:17)\n\n\u221a\n\n(cid:16) n\n(cid:19)(cid:19)\n\n\u221a\n\ns(s +\n\nn) + 32.\n\n\u221a\n\nn).\n\n\u221a\nn log n) bits per iteration, while the convergence time is increased by O(\n\nSparse Regime. For the case s = 1, i.e., quantization levels 0, 1, and \u22121, the gradient density is\n\u221a\n\u221a\nn. Intuitively, this means that we will employ\nO(\nO(\nDense Regime. The variance blowup is minimized to at most 2 for s =\nn quantization levels; in\nthis case, we devise a more ef\ufb01cient encoding which yields an order of magnitude shorter codes\ncompared to the full-precision variant. The proof of this statement is not entirely obvious, as it\nexploits both the statistical properties of the quantization and the guarantees of the Elias coding.\n\nCorollary 3.3. Let f, x, and(cid:101)g(x) be as in Theorem 3.2. There is an encoding scheme for Q\u221a\n\nwhich in expectation has length at most 2.8n + 32.\n3.2 QSGD Guarantees\nPutting the bounds on the communication and variance given above with the guarantees for SGD\nalgorithms on smooth, convex functions yield the following results:\nTheorem 3.4 (Smooth Convex QSGD). Let X , f, L, x0, and R be as in Theorem 2.1. Fix \u0001 > 0.\nSuppose we run parallel QSGD with s quantization levels on K processors accessing indepen-\ndent stochastic gradients with second moment bound B, with step size \u03b7t = 1/(L +\nK/\u03b3),\nwhere \u03b3 is as in Theorem 2.1 with \u03c3 = B(cid:48), where B(cid:48) = min\nB. Then if T =\n\nn((cid:101)g(x))\n\n(cid:17)(cid:105) \u2212 minx\u2208X f (x) \u2264 \u0001. Moreover, QSGD re-\n\nR2 \u00b7 max\n\n(cid:80)T\n\n(cid:16) n\n\n(cid:16)\n\n(cid:17)\n\ns2 ,\n\n\u221a\n\nn\ns\n\n\u221a\n\nO\n\nf\n\n(cid:16) 2B(cid:48)\n(cid:17)(cid:17)\n3 +(cid:0) 3\n2 + o(1)(cid:1) log\n\nK\u00012 , L\n\n, then E(cid:104)\n(cid:16) 2(s2+n)\n\n\u221a\nn\n\ns2+\n\n\u0001\n\n(cid:16)\n\n(cid:16) 1\n(cid:17)(cid:17)\n\nn) + 32 bits of communication per round. In the\n\nt=0 xt\n\u221a\n\nT\n(s2 +\n\nquires\nspecial case when s =\n\n\u221a\n\nn, this can be reduced to 2.8n + 32.\n\nE(cid:2)(cid:107)\u2207f (x)(cid:107)2\n\n(cid:3) \u2264 O\n\n(cid:18)\u221a\n\nQSGD is quite portable, and can be applied to almost any stochastic gradient method. For illustration,\nwe can use quantization along with [15] to get communication-ef\ufb01cient non-convex SGD.\nTheorem 3.5 (QSGD for smooth non-convex optimization). Let f : Rn \u2192 R be a L-smooth\n(possibly nonconvex) function, and let x1 be an arbitrary initial point. Let T > 0 be \ufb01xed, and\ns > 0. Then there is a random stopping time R supported on {1, . . . , N} so that QSGD with\nquantization level s, constant stepsizes \u03b7 = O(1/L) and access to stochastic gradients of f with\n\n(cid:19)\n\n\u221a\n\nL(f (x1)\u2212f\u2217)\n\nn/s)B\n\n+ min(n/s2,\nL\n\n.\n\n2\n\nN\n\nj=im/K\n\n(cid:80)(i+1)m/K\u22121\n\nsecond moment bound B satis\ufb01es 1\nL\nMoreover, the communication cost is the same as in Theorem 3.4.\n3.3 Quantized Variance-Reduced SGD\nAssume we are given K processors, and a parameter m > 0, where each processor i has access to\nfunctions {fim/K, . . . , f(i+1)m/K\u22121}. The goal is to approximately minimize f = 1\ni=1 fi. For\nprocessor i, let hi = 1\ni=1 hi.\nm\nA natural question is whether we can apply stochastic quantization to reduce communication for\nparallel SVRG. Upon inspection, we notice that the resulting update will break standard SVRG. We\nresolve this technical issue, proving one can quantize SVRG updates using our techniques and still\nobtain the same convergence bounds.\n\n(cid:80)m\nfi be the portion of f that it knows, so that f =(cid:80)K\n\nAlgorithm Description. Let (cid:101)Q(v) = Q(v,\n\u2207f (y(p)) = (cid:80)m\n(cid:16)\u2207fj(p)\nt,i = (cid:101)Q\n\nn), where Q(v, s) is de\ufb01ned as in Section 3.1. Given\narbitrary starting point x0, we let y(1) = x0. At the beginning of epoch p, each processor broad-\ncasts \u2207hi(y(p)), that is, the unquantized full gradient, from which the processors each aggregate\ni=1 \u2207hi(y(p)). Within each epoch, for each iteration t = 1, . . . , T , and for each\nprocessor i = 1, . . . , K, we let j(p)\ni,t be a uniformly random integer from [m] completely independent\nfrom everything else. Then, in iteration t in epoch p, processor i broadcasts the update vector\n) \u2212 \u2207fj(p)\nu(p)\n\n(y(p)) + \u2207f (y(p))\n\n(x(p)\n\n(cid:17)\n\n\u221a\n\nm\n\n.\n\nt\n\ni,t\n\ni,t\n\n6\n\n\fTable 1: Description of networks, \ufb01nal top-1 accuracy, as well as end-to-end training speedup on 8GPUs.\n\nInit. Rate\n\nTop-1 (32bit)\n\nSpeedup (8 GPUs)\n\nNetwork\nAlexNet\nResNet152\nResNet50\nResNet110\nBN-Inception\n\nVGG19\nLSTM\n\nDataset\nImageNet\nImageNet\nImageNet\nCIFAR-10\nImageNet\nImageNet\n\nAN4\n\nParams.\n\n62M\n60M\n25M\n1M\n11M\n143M\n13M\n\n0.07\n\n1\n1\n0.1\n3.6\n0.1\n0.5\n\nTop-1 (QSGD)\n60.05% (4bit)\n76.74% (8bit)\n74.76% (4bit)\n94.19% (4bit)\n\n-\n-\n\n81.15 % (4bit)\n\n2.05 \u00d7\n1.56 \u00d7\n1.26 \u00d7\n1.10 \u00d7\n\n1.16\u00d7 (projected)\n2.25\u00d7 (projected)\n\n2\u00d7 (2 GPUs)\n\n59.50%\n77.0%\n74.68%\n93.86%\n\n-\n-\n\n81.13%\n\n(cid:80)K\n(cid:80)T\n\nt \u2212 \u03b7u(p)\n\nt\n\n.\n\nt = 1\nK\n\nt=1 x(p)\n\nt\n\n(cid:80)m\n\ni=1 ut,i, and sets x(p)\n\nt+1 = x(p)\n\nEach processor then computes the total update u(p)\nAt the end of epoch p, each processor sets y(p+1) = 1\nT\nTheorem 3.6. Let f (x) = 1\ni=1 fi(x), where f is (cid:96)-strongly convex, and fi are convex and\nL-smooth, for all i. Let x\u2217 be the unique minimizer of f over Rn. Then, if \u03b7 = O(1/L)\nm\n\nand T = O(L/(cid:96)), then QSVRG with initial point y(1) ensures E(cid:2)f (y(p+1))(cid:3) \u2212 f (x\u2217) \u2264\n0.9p(cid:0)f (y(1)) \u2212 f (x\u2217)(cid:1) , for any epoch p \u2265 1. Moreover, QSVRG with T iterations per epoch\n\n. We can prove the following.\n\nrequires \u2264 (F + 2.8n)(T + 1) + F n bits of communication per epoch.\nDiscussion. In particular, this allows us to largely decouple the dependence between F and the\ncondition number of f in the communication. Let \u03ba = L/(cid:96) denote the condition number of f. Observe\nthat whenever F (cid:28) \u03ba, the second term is subsumed by the \ufb01rst and the per epoch communication\nis dominated by (F + 2.8n)(T + 1). Speci\ufb01cally, for any \ufb01xed \u0001, to attain accuracy \u0001 we must\ntake F = O(log 1/\u0001). As long as log 1/\u0001 \u2265 \u2126(\u03ba), which is true for instance in the case when\n\u03ba \u2265 poly log(n) and \u0001 \u2265 poly(1/n), then the communication per epoch is O(\u03ba(log 1/\u0001 + n)).\nGradient Descent. The full version of the paper [4] contains an application of QSGD to gradient\ndescent. Roughly, in this case, QSGD can simply truncate the gradient to its top components, sorted\nby magnitude.\n4 QSGD Variants\nOur experiments will stretch the theory, as we use deep networks, with non-convex objectives. (We\nhave also tested QSGD for convex objectives. Results closely follow the theory, and are therefore\nomitted.) Our implementations will depart from the previous algorithm description as follows.\nFirst, we notice that the we can control the variance the quantization by quantizing into buckets\nof a \ufb01xed size d. If we view each gradient as a one-dimensional vector v, reshaping tensors if\nnecessary, a bucket will be de\ufb01ned as a set of d consecutive vector values. (E.g. the ith bucket is the\nsub-vector v[(i \u2212 1)d + 1 : i \u00b7 d].) We will quantize each bucket independently, using QSGD. Setting\nd = 1 corresponds to no quantization (vanilla SGD), and d = n corresponds to full quantization,\nas described in the previous section. It is easy to see that, using bucketing, the guarantees from\nLemma 3.1 will be expressed in terms of d, as opposed to the full dimension n. This provides a\nknob by which we can control variance, at the cost of storing an extra scaling factor on every d\nbucket values. As an example, if we use a bucket size of 512, and 4 bits, the variance increase\n512/24 (cid:39) 1.41. This provides a theoretical\ndue to quantization will be upper bounded by only\njusti\ufb01cation for the similar convergence rates we observe in practice.\nThe second difference from the theory is that we will scale by the maximum value of the vector (as\nopposed to the 2-norm). Intuitively, normalizing by the max preserves more values, and has slightly\nhigher accuracy for the same number of iterations. Both methods have the same baseline bandwidth\nreduction because of lower bit width (e.g. 32 bits to 2 bits per dimension), but normalizing by the\n\u221a\nmax no longer provides any sparsity guarantees. We note that this does not affect our bounds in the\nregime where we use \u0398(\nn) quantization levels per component, as we employ no sparsity in that\ncase. (However, we note that in practice max normalization also generates non-trivial sparsity.)\n5 Experiments\nSetup. We performed experiments on Amazon EC2 p2.16xlarge instances, with 16 NVIDIA K80\nGPUs. Instances have GPUDirect peer-to-peer communication, but do not currently support NVIDIA\n\n\u221a\n\n7\n\n\fFigure 2: Breakdown of communication versus computation for various neural networks, on 2, 4, 8, 16 GPUs,\nfor full 32-bit precision versus QSGD 4-bit. Each bar represents the total time for an epoch under standard\nparameters. Epoch time is broken down into communication (bottom, solid) and computation (top, transparent).\nAlthough epoch time diminishes as we parallelize, the proportion of communication increases.\n\n(a) AlexNet Accuracy versus Time.\n\n(b) LSTM error vs Time.\n\n(c) ResNet50 Accuracy.\n\nFigure 3: Accuracy numbers for different networks. Light blue lines represent 32-bit accuracy.\n\nNCCL extensions. We have implemented QSGD on GPUs using the Microsoft Cognitive Toolkit\n(CNTK) [3]. This package provides ef\ufb01cient (MPI-based) GPU-to-GPU communication, and imple-\nments an optimized version of 1bit-SGD [35]. Our code is released as open-source [31].\nWe execute two types of tasks: image classi\ufb01cation on ILSVRC 2015 (ImageNet) [12], CIFAR-\n10 [25], and MNIST [27], and speech recognition on the CMU AN4 dataset [2]. For vision, we\nexperimented with AlexNet [26], VGG [36], ResNet [18], and Inception with Batch Normaliza-\ntion [22] deep networks. For speech, we trained an LSTM network [19]. See Table 1 for details.\nProtocol. Our methodology emphasizes zero error tolerance, in the sense that we always aim to\npreserve the accuracy of the networks trained. We used standard sizes for the networks, with hyper-\nparameters optimized for the 32bit precision variant. (Unless otherwise stated, we use the default\nnetworks and hyper-parameters optimized for full-precision CNTK 2.0.) We increased batch size\nwhen necessary to balance communication and computation for larger GPU counts, but never past the\npoint where we lose accuracy. We employed double buffering [35] to perform communication and\nquantization concurrently with the computation. Quantization usually bene\ufb01ts from lowering learning\nrates; yet, we always run the 32bit learning rate, and decrease bucket size to reduce variance. We will\nnot quantize small gradient matrices (< 10K elements), since the computational cost of quantizing\nthem signi\ufb01cantly exceeds the reduction in communication. However, in all experiments, more than\n99% of all parameters are transmitted in quantized form. We reshape matrices to \ufb01t bucket sizes, so\nthat no receptive \ufb01eld is split across two buckets.\nCommunication vs. Computation. In the \ufb01rst set of experiments, we examine the ratio between\ncomputation and communication costs during training, for increased parallelism. The image classi-\n\ufb01cation networks are trained on ImageNet, while LSTM is trained on AN4. We examine the cost\nbreakdown for these networks over a pass over the dataset (epoch). Figure 2 gives the results for\nvarious networks for image classi\ufb01cation. The variance of epoch times is practically negligible (<1%),\nhence we omit con\ufb01dence intervals.\nFigure 2 leads to some interesting observations. First, based on the ratio of communication to\ncomputation, we can roughly split networks into communication-intensive (AlexNet, VGG, LSTM),\nand computation-intensive (Inception, ResNet). For both network types, the relative impact of\ncommunication increases signi\ufb01cantly as we increase the number of GPUs. Examining the breakdown\nfor the 32-bit version, all networks could signi\ufb01cantly bene\ufb01t from reduced communication. For\n\n8\n\n2.3x3.5x1.6x> 2x faster030060090012001500Time (sec)0.00.51.01.52.0Training loss2bit QSGD (d=128)4bit QSGD (d=8192)8bit QSGD (d=8192)SGD020406080100120Epoch01020304050607080Test accuracy (%)1bitSGD*32bitQSGD 4bitQSGD 8bit\fexample, for AlexNet on 16 GPUs with batch size 1024, more than 80% of training time is spent on\ncommunication, whereas for LSTM on 2 GPUs with batch size 256, the ratio is 71%. (These ratios\ncan be slightly changed by increasing batch size, but this can decrease accuracy, see e.g. [21].)\nNext, we examine the impact of QSGD on communication and overall training time. (Communication\ntime includes time spent compressing and uncompressing gradients.) We measured QSGD with\n2-bit quantization and 128 bucket size, and 4-bit and 8-bit quantization with 512 bucket size. The\nresults for these two variants are similar, since the different bucket sizes mean that the 4bit version\nonly sends 77% more data than the 2-bit version (but \u223c 8\u00d7 less than 32-bit). These bucket sizes are\nchosen to ensure good convergence, but are not carefully tuned.\nOn 16GPU AlexNet with batch size 1024, 4-bit QSGD reduces communication time by 4\u00d7, and\noverall epoch time by 2.5\u00d7. On LSTM, it reduces communication time by 6.8\u00d7, and overall epoch\ntime by 2.7\u00d7. Runtime improvements are non-trivial for all architectures we considered.\nAccuracy. We now examine how QSGD in\ufb02uences accuracy and convergence rate. We ran AlexNet\nand ResNet to full convergence on ImageNet, LSTM on AN4, ResNet110 on CIFAR-10, as well as\na two-layer perceptron on MNIST. Results are given in Figure 3, and exact numbers are given in\nTable 1. QSGD tests are performed on an 8GPU setup, and are compared against the best known\nfull-precision accuracy of the networks. In general, we notice that 4bit or 8bit gradient quantization\nis suf\ufb01cient to recover or even slightly improve full accuracy, while ensuring non-trivial speedup.\nAcross all our experiments, 8-bit gradients with 512 bucket size have been suf\ufb01cient to recover or\nimprove upon the full-precision accuracy. Our results are consistent with recent work [30] noting\nbene\ufb01ts of adding noise to gradients when training deep networks. Thus, quantization can be seen\nas a source of zero-mean noise, which happens to render communication more ef\ufb01cient. At the\nsame time, we note that more aggressive quantization can hurt accuracy. In particular, 4-bit QSGD\nwith 8192 bucket size (not shown) loses 0.57% for top-5 accuracy, and 0.68% for top-1, versus full\nprecision on AlexNet when trained for the same number of epochs. Also, QSGD with 2-bit and 64\nbucket size has gap 1.73% for top-1, and 1.18% for top-1.\nOne issue we examined in more detail is which layers are more sensitive to quantization. It appears\nthat quantizing convolutional layers too aggressively (e.g., 2-bit precision) can lead to accuracy loss\nif trained for the same period of time as the full precision variant. However, increasing precision to\n4-bit or 8-bit recovers accuracy. This \ufb01nding suggests that modern architectures for vision tasks, such\nas ResNet or Inception, which are almost entirely convolutional, may bene\ufb01t less from quantization\nthan recurrent deep networks such as LSTMs.\nAdditional Experiments. The full version of the paper contains additional experiments, including a\nfull comparison with 1BitSGD. In brief, QSGD outperforms or matches the performance and \ufb01nal\naccuracy of 1BitSGD for the networks and parameter values we consider.\n6 Conclusions and Future Work\nWe have presented QSGD, a family of SGD algorithms which allow a smooth trade off between\nthe amount of communication per iteration and the running time. Experiments suggest that QSGD\nis highly competitive with the full-precision variant on a variety of tasks. There are a number of\noptimizations we did not explore. The most signi\ufb01cant is leveraging the sparsity created by QSGD.\nCurrent implementations of MPI do not provide support for sparse types, but we plan to explore\nsuch support in future work. Further, we plan to examine the potential of QSGD in larger-scale\napplications, such as super-computing. On the theoretical side, it is interesting to consider applications\nof quantization beyond SGD.\nThe full version of this paper [4] contains complete proofs, as well as additional applications.\n7 Acknowledgments\nThe authors would like to thank Martin Jaggi, Ce Zhang, Frank Seide and the CNTK team for their\nsupport during the development of this project, as well as the anonymous NIPS reviewers for their\ncareful consideration and excellent suggestions. Dan Alistarh was supported by a Swiss National\nFund Ambizione Fellowship. Jerry Li was supported by the NSF CAREER Award CCF-1453261,\nCCF-1565235, a Google Faculty Research Award, and an NSF Graduate Research Fellowship. This\nwork was developed in part while Dan Alistarh, Jerri Li and Milan Vojnovic were with Microsoft\nResearch Cambridge, UK.\n\n9\n\n\fReferences\n[1] Mart\u0131n Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro,\nGreg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensor\ufb02ow: Large-scale\nmachine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467,\n2016.\n\n[2] Alex Acero. Acoustical and environmental robustness in automatic speech recognition, volume\n\n201. Springer Science & Business Media, 2012.\n\n[3] Amit Agarwal, Eldar Akchurin, Chris Basoglu, Guoguo Chen, Scott Cyphers, Jasha Droppo,\nAdam Eversole, Brian Guenter, Mark Hillebrand, Ryan Hoens, et al. An introduction to\ncomputational networks and the computational network toolkit. Technical report, Tech. Rep.\nMSR-TR-2014-112, August 2014., 2014.\n\n[4] Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. QSGD:\narXiv preprint\n\nCommunication-ef\ufb01cient SGD via gradient quantization and encoding.\narXiv:1610.02132, 2016.\n\n[5] Yossi Arjevani and Ohad Shamir. Communication complexity of distributed convex learning\n\nand optimization. In NIPS, 2015.\n\n[6] Ron Bekkerman, Mikhail Bilenko, and John Langford. Scaling up machine learning: Parallel\n\nand distributed approaches. Cambridge University Press, 2011.\n\n[7] S\u00e9bastien Bubeck. Convex optimization: Algorithms and complexity. Foundations and Trends R(cid:13)\n\nin Machine Learning, 8(3-4):231\u2013357, 2015.\n\n[8] Trishul Chilimbi, Yutaka Suzue, Johnson Apacible, and Karthik Kalyanaraman. Project adam:\n\nBuilding an ef\ufb01cient and scalable deep learning training system. In OSDI, October 2014.\n\n[9] Cntk brainscript \ufb01le for alexnet. https://github.com/Microsoft/CNTK/tree/master/\n\nExamples/Image/Classification/AlexNet/BrainScript. Accessed: 2017-02-24.\n\n[10] Christopher M De Sa, Ce Zhang, Kunle Olukotun, and Christopher R\u00e9. Taming the wild: A\n\nuni\ufb01ed analysis of hogwild-style algorithms. In NIPS, 2015.\n\n[11] Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew\nSenior, Paul Tucker, Ke Yang, Quoc V Le, et al. Large scale distributed deep networks. In NIPS,\n2012.\n\n[12] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale\nhierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009.\nIEEE Conference on, pages 248\u2013255. IEEE, 2009.\n\n[13] John C Duchi, Sorathan Chaturapruek, and Christopher R\u00e9. Asynchronous stochastic convex\n\noptimization. NIPS, 2015.\n\n[14] Peter Elias. Universal codeword sets and representations of the integers. IEEE transactions on\n\ninformation theory, 21(2):194\u2013203, 1975.\n\n[15] Saeed Ghadimi and Guanghui Lan. Stochastic \ufb01rst- and zeroth-order methods for nonconvex\n\nstochastic programming. SIAM Journal on Optimization, 23(4):2341\u20132368, 2013.\n\n[16] Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. Deep learning\n\nwith limited numerical precision. In ICML, pages 1737\u20131746, 2015.\n\n[17] Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural net-\nworks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149,\n2015.\n\n[18] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for im-\nage recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern\nRecognition, pages 770\u2013778, 2016.\n\n10\n\n\f[19] Sepp Hochreiter and J\u00fcrgen Schmidhuber. Long short-term memory. Neural computation,\n\n9(8):1735\u20131780, 1997.\n\n[20] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Binarized\nneural networks. In Advances in Neural Information Processing Systems, pages 4107\u20134115,\n2016.\n\n[21] Forrest N Iandola, Matthew W Moskewicz, Khalid Ashraf, and Kurt Keutzer. Firecaffe: near-\nlinear acceleration of deep neural network training on compute clusters. In Proceedings of the\nIEEE Conference on Computer Vision and Pattern Recognition, pages 2592\u20132600, 2016.\n\n[22] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training\n\nby reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.\n\n[23] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance\n\nreduction. In NIPS, 2013.\n\n[24] Jakub Kone\u02c7cn`y. Stochastic, distributed and federated optimization for machine learning. arXiv\n\npreprint arXiv:1707.01155, 2017.\n\n[25] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images,\n\n2009.\n\n[26] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classi\ufb01cation with deep\nconvolutional neural networks. In Advances in neural information processing systems, pages\n1097\u20131105, 2012.\n\n[27] Yann LeCun, Corinna Cortes, and Christopher JC Burges. The mnist database of handwritten\n\ndigits, 1998.\n\n[28] Mu Li, David G Andersen, Jun Woo Park, Alexander J Smola, Amr Ahmed, Vanja Josifovski,\nJames Long, Eugene J Shekita, and Bor-Yiing Su. Scaling distributed machine learning with\nthe parameter server. In OSDI, 2014.\n\n[29] Xiangru Lian, Yijun Huang, Yuncheng Li, and Ji Liu. Asynchronous parallel stochastic gradient\n\nfor nonconvex optimization. In NIPS. 2015.\n\n[30] Arvind Neelakantan, Luke Vilnis, Quoc V Le, Ilya Sutskever, Lukasz Kaiser, Karol Kurach,\nand James Martens. Adding gradient noise improves learning for very deep networks. arXiv\npreprint arXiv:1511.06807, 2015.\n\n[31] Cntk implementation of qsgd. https://gitlab.com/demjangrubic/QSGD. Accessed: 2017-\n\n11-4.\n\n[32] Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. Hogwild: A lock-free\n\napproach to parallelizing stochastic gradient descent. In NIPS, 2011.\n\n[33] Herbert Robbins and Sutton Monro. A stochastic approximation method. The Annals of\n\nMathematical Statistics, pages 400\u2013407, 1951.\n\n[34] Richard Schreier and Gabor C Temes. Understanding delta-sigma data converters, volume 74.\n\nIEEE Press, Piscataway, NJ, 2005.\n\n[35] Frank Seide, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu. 1-bit stochastic gradient descent\nand its application to data-parallel distributed training of speech dnns. In INTERSPEECH, 2014.\n\n[36] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale\n\nimage recognition. arXiv preprint arXiv:1409.1556, 2014.\n\n[37] Nikko Strom. Scalable distributed DNN training using commodity GPU cloud computing. In\n\nINTERSPEECH, 2015.\n\n[38] Ananda Theertha Suresh, Felix X Yu, H Brendan McMahan, and Sanjiv Kumar. Distributed\n\nmean estimation with limited communication. arXiv preprint arXiv:1611.00429, 2016.\n\n11\n\n\f[39] Seiya Tokui, Kenta Oono, Shohei Hido, CA San Mateo, and Justin Clayton. Chainer: a\n\nnext-generation open source framework for deep learning.\n\n[40] John N Tsitsiklis and Zhi-Quan Luo. Communication complexity of convex optimization.\n\nJournal of Complexity, 3(3), 1987.\n\n[41] Wei Wen, Cong Xu, Feng Yan, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Tern-\ngrad: Ternary gradients to reduce communication in distributed deep learning. arXiv preprint\narXiv:1705.07878, 2017.\n\n[42] Hantian Zhang, Jerry Li, Kaan Kara, Dan Alistarh, Ji Liu, and Ce Zhang. Zipml: Training\nlinear models with end-to-end low precision, and a little bit of deep learning. In International\nConference on Machine Learning, pages 4035\u20134043, 2017.\n\n[43] Sixin Zhang, Anna E Choromanska, and Yann LeCun. Deep learning with elastic averaging\n\nsgd. In Advances in Neural Information Processing Systems, pages 685\u2013693, 2015.\n\n[44] Yuchen Zhang, John Duchi, Michael I Jordan, and Martin J Wainwright. Information-theoretic\nlower bounds for distributed statistical estimation with communication constraints. In NIPS,\n2013.\n\n[45] Shuchang Zhou, Yuxin Wu, Zekun Ni, Xinyu Zhou, He Wen, and Yuheng Zou. Dorefa-net:\nTraining low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint\narXiv:1606.06160, 2016.\n\n12\n\n\f", "award": [], "sourceid": 1063, "authors": [{"given_name": "Dan", "family_name": "Alistarh", "institution": "IST Austria & ETH Zurich"}, {"given_name": "Demjan", "family_name": "Grubic", "institution": "ETH Zurich / Google"}, {"given_name": "Jerry", "family_name": "Li", "institution": "MIT"}, {"given_name": "Ryota", "family_name": "Tomioka", "institution": "Microsoft Research Cambridge"}, {"given_name": "Milan", "family_name": "Vojnovic", "institution": "London School of Economics and Political Science (LSE)"}]}