{"title": "GradiVeQ: Vector Quantization for Bandwidth-Efficient Gradient Aggregation in Distributed CNN Training", "book": "Advances in Neural Information Processing Systems", "page_first": 5123, "page_last": 5133, "abstract": "Data parallelism can boost the training speed of convolutional neural networks (CNN), but could suffer from significant communication costs caused by gradient aggregation. To alleviate this problem, several scalar quantization techniques have been developed to compress the gradients. But these techniques could perform poorly when used together with decentralized aggregation protocols like ring all-reduce (RAR), mainly due to their inability to directly aggregate compressed gradients. In this paper, we empirically demonstrate the strong linear correlations between CNN gradients, and propose a gradient vector quantization technique, named GradiVeQ, to exploit these correlations through principal component analysis (PCA) for substantial gradient dimension reduction. GradiveQ enables direct aggregation of compressed gradients, hence allows us to build a distributed learning system that parallelizes GradiveQ gradient compression and RAR communications. Extensive experiments on popular CNNs demonstrate that applying GradiveQ slashes the wall-clock gradient aggregation time of the original RAR by more than 5x without noticeable accuracy loss, and reduce the end-to-end training time by almost 50%. The results also show that \\GradiveQ is compatible with scalar quantization techniques such as QSGD (Quantized SGD), and achieves a much higher speed-up gain under the same compression ratio.", "full_text": "GradiVeQ: Vector Quantization for\n\nBandwidth-Ef\ufb01cient Gradient Aggregation in\n\nDistributed CNN Training\n\nMingchao Yu(cid:5)\u22171, Zhifeng Lin(cid:5)\u22171, Krishna Narra(cid:5), Songze Li(cid:5), Youjie Li\u2020, Nam Sung Kim\u2020,\n\nAlexander Schwing\u2020, Murali Annavaram(cid:5), and Salman Avestimehr(cid:5)\n\n(cid:5)University of Southern California\n\n\u2020University of Illinois at Urbana Champaign\n\nAbstract\n\nData parallelism can boost the training speed of convolutional neural networks\n(CNN), but could suffer from signi\ufb01cant communication costs caused by gradi-\nent aggregation. To alleviate this problem, several scalar quantization techniques\nhave been developed to compress the gradients. But these techniques could per-\nform poorly when used together with decentralized aggregation protocols like\nring all-reduce (RAR), mainly due to their inability to directly aggregate com-\npressed gradients. In this paper, we empirically demonstrate the strong linear\ncorrelations between CNN gradients, and propose a gradient vector quantiza-\ntion technique, named GradiVeQ, to exploit these correlations through principal\ncomponent analysis (PCA) for substantial gradient dimension reduction. Gradi-\nVeQ enables direct aggregation of compressed gradients, hence allows us to build\na distributed learning system that parallelizes GradiVeQ gradient compression\nand RAR communications. Extensive experiments on popular CNNs demonstrate\nthat applying GradiVeQ slashes the wall-clock gradient aggregation time of the\noriginal RAR by more than 5X without noticeable accuracy loss, and reduces the\nend-to-end training time by almost 50%. The results also show that GradiVeQ is\ncompatible with scalar quantization techniques such as QSGD (Quantized SGD),\nand achieves a much higher speed-up gain under the same compression ratio.\n\n1\n\nIntroduction\n\nConvolutional neural networks (CNN) such as VGG [1] and ResNet [2] can achieve unprecedented\nperformance on many practical applications like speech recognition [3, 4], text processing [5, 6],\nand image classi\ufb01cation on very large datasets like CIFAR-100 [7] and ImageNet [8]. Due to the\nlarge dataset size, CNN training is widely implemented using distributed methods such as data-\nparallel stochastic gradient descent (SGD) [9, 10, 11, 12, 13, 14, 15, 16], where gradients computed\nby distributed nodes are summed after every iteration to update the CNN model of every node2.\nHowever, this gradient aggregation can dramatically hinder the expansion of such systems, for it\nincurs signi\ufb01cant communication costs, and will become the system bottleneck when communication\nis slow.\nTo improve gradient aggregation ef\ufb01ciency, two main approaches have been proposed in the literature,\nnamely gradient compression and parallel aggregation. Gradient compression aims at reducing the\n\n\u2217 M. Yu and Z. Lin contributed equally to this work.\n2 We focus on synchronized training. Our compression technique could be applied to asynchronous systems\n\nwhere the model at different nodes may be updated differently [17, 18, 19, 20, 21, 22]\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fgoal is to compute g =(cid:80)2\n\nFigure 1: An example of ring all-reduce (RAR) with 3 nodes. Each node has a local vector gn. The\nn=0 gn and share with every node. Each node will initiate the aggregation\nof a 1/3 segment of the vector. After 2 steps, every segment will be completely aggregated. The\naggregated segments will then be simultaneously circulated to every node.\n\n(a) RAR with decompression\n\n(b) RAR with compressed domain aggregation\n\nFigure 2: Processing time \ufb02ow at a node in one RAR step. GradiVeQ allows the node to compress its\nlocal gradient segment while downloading, and then sum the two compressed segments and send.\n\nnumber of bits used to describe the gradients. Popular methods include gradient scalar quantization\n(lossy) [23, 24, 25, 26, 27] and sparsity coding (lossless) [24, 28]. Parallel aggregation, on the\nother hand, aims at minimizing communication congestion by shifting from centralized aggregation\nat a parameter server to distributed methods such as ring all-reduce (RAR) [29, 30, 31, 32]. As\ndemonstrated in Fig. 1, RAR places all nodes in a logical ring, and then circulates different segments\nof the gradient vector through the nodes simultaneously. Upon the reception of a segment, a node\nwill add to it the same segment of its own, and then send the sum to the next node in the ring. Once a\nsegment has been circulated through all nodes, it becomes a fully aggregated gradient vector. Then,\nanother round of circulation will make it available at all nodes.\nDue to the complementary nature of the above two approaches, one may naturally aim at combining\nthem to unleash their gains simultaneously. However, there lies a critical problem: the compressed\ngradients cannot be directly summed without \ufb01rst decompressing them. For example, summing two\nscalar quantized gradients will incur over\ufb02ow due to limited quantization levels. And, summing\ntwo sparsity coded descriptions is an unde\ufb01ned operation. An additional problem for sparsity based\ncompression is that the gradient density may increase rapidly during RAR [28], which may incur\nexploding communication costs.\nThe inability to directly aggregate compressed gradients could incur hefty compression-related\noverheads. In every step of RAR, every node will have to decompress the downloaded gradient\nsegment before adding its own corresponding uncompressed gradient segment. The nodes will\nthen compress the sum and communicate it to the next node in the ring. Consequently, download\nand compression processes cannot be parallelized (as illustrated in Fig. 2(a)). Moreover, the same\ngradients will be repeatedly compressed/decompressed at every single node.\nIn order to leverage both gradient compression and parallel aggregation, the compression function\nshould be commutable with the gradient aggregation. Mathematically, let Q() be the compression\nfunction on gradient vector gn of each node-n, then the following equality must hold:\n\nQ(gn) = Q\n\ngn\n\n,\n\n(1)\n\nn=0\n\nn=0\n\nwhere N is the number of nodes. The LHS of (1) enables compressed domain gradient aggregation,\nnamely, direct summation of compressed gradients. Such a compression function will allow the\nparallelization of compression and RAR communications, so that compression time can be masked\nby communication time (as illustrated in Fig. 2(b)). The RHS of (1) indicates that decompression\nQ\u22121() is only needed once - after the compressed gradients are fully aggregated.\n\n2\n\nN\u22121(cid:88)\n\n(cid:32)N\u22121(cid:88)\n\n(cid:33)\n\n\fContributions We propose GradiVeQ (Gradient Vector Quantizer), a novel gradient compres-\nsion technique that can signi\ufb01cantly reduce the communication load in distributed CNN training.\nGradiVeQ is the \ufb01rst method that leverages both gradient compression and parallel aggregation by\nemploying a vector compression technique that commutes with gradient aggregation (i.e., satis\ufb01es\n(1)), hence enabling compressed domain gradient aggregation.\nIntuition and Motivational Data: At the core of GradiVeQ is a linear compressor that uses principal\ncomponent analysis (PCA) to exploit the linear correlation between the gradients for gradient\ndimension reduction. Our development of GradiVeQ is rooted from the following experimental\nobservations on the strong linear correlation between gradients in CNN:\n\n1. Linear correlation: We \ufb02atten the gradients to a vector representation in a special way such\nthat adjacent gradients could have linear correlation. As shown in Fig. 3(a), we place together\nthe gradients located at identical coordinates across all the F \ufb01lters of the same convolutional\nlayer. One of the foundational intuitions behind such a gradient vector representation is that\nthese F gradient elements are generated from the same input datum, such as a pixel value in\nan image, or an output from the last layer. Hence the F aggregated gradients could show\nstrong linear correlation across training iterations. This linear correlation will allow us to\nuse PCA to compute a linear compressor for each gradient vector.\nFor example, we record the value of the \ufb01rst 3 gradients of layer-1 of ResNet-32 for 150\niterations during CiFAR-100 training, and plot them as the 150 blue points in Fig. 4(a). As\ndescribed in the \ufb01gure caption, a strong linear correlation can be observed between the 3\ngradients, which allows us to compress them into 2 gradients with negligible loss.\n\n2. Spatial domain consistency: Another interesting observation is that, within a (H, W, D, F )\nconvolutional layer, the large gradient vector can be sliced at the granularity of F \u00d7 D\nmultiples and these slices show strong similarity in their linear correlation. This correlation\nis best demonstrated by the low compression loss of using the compressor of one slice to\ncompress the other slices. For example, Fig. 5 shows that, in a (3, 3, 16, 16)-CNN layer,\nthe compression loss ((cid:107)\u02c6g \u2212 g(cid:107)2/(cid:107)g(cid:107)2, where \u02c6g is the decompressed gradient vector) drops\ndramatically at slice sizes of 256, 512 and so on (multiple of F D = 256). Thus, it is possible\nto just perform PCA on one gradient slice and then apply the compressor to other slices,\nwhich will make PCA overhead negligible.\n\n3. Time domain invariance: Finally, we also note that the observed linear correlation evolves\nslowly over iterations, which is likely due to the steady gradient directions and step size\nunder reasonable learning rates. Hence, we can invest a fraction of the compute resource\nduring a set of training iterations on uncompressed aggregations to perform PCA, and use\nthe resulting PCA compressors to compress the successor iterations.\n\nBuilt upon these observations, we develop a practical implementation of GradiVeQ, where gradient\ncompression and RAR communications are fully parallelized. Experiments on ResNet with CIFAR-\n100 show that GradiVeQ can compress the gradient by 8 times, and reduce the wall-clock gradient\naggregation time by over 5X, which translates to a 46% reduction on the end-to-end training time in\na system where communication contributes to 60% of the time. We also note that under the same\ncompression ratio of 8, scalar quantizers such as 4-bit QSGD [24], while effective over baseline\nRAR, does not achieve the same level of performance. QSGD has the requirement that compressed\ngradient vectors must \ufb01rst be uncompressed and aggregated thereby preventing compressed domain\naggregation, which in turn prevents compression and communication parallelization.\n\n2 Description of GradiVeQ\n\nIn this section, we describe the core compression and aggregation techniques applied in GradiVeQ.\nThe next section presents the details of system implementation, including system parameter selection.\nWe consider a distributed system that uses N nodes and a certain dataset (e.g., CIFAR-100 [7] or\nImageNet [8]) to train a CNN model that has M parameters in its convolutional layers. We use w to\nrepresent their weights. 3. In the t-th (t (cid:62) 0) training iteration, each node-n (n \u2208 [0, N \u2212 1]) trains a\n\n3GradiVeQ is not applied to the gradient of other parameters such as those from fully connected layers.\n\n3\n\n\f(a) Each CNN layer has F \ufb01lters of dimension\n(H, W, D), producing F DW H gradients in every\niteration. We \ufb02atten the gradients into a vector g\nby placing every F collocated gradients from all\nthe F \ufb01lters next to each other in g. The location\nselection traverses depth, width, then height.\n\n(b) Gradients are sliced and compressed separately.\nOnce every node-n has compressed its local slice-\nn, RAR aggregation can be launched.\n\nFigure 3: Gradient \ufb02attening, slicing, compression, and aggregation in GradiVeQ.\n\n(a) Origin\n\n(b) After centralization and rotation.\n\nFigure 4: 3D scatter plot of the value of 3 adjacent gradients from 150 iterations (the 150 blue points\nin (a)), and projections to different 2D planes. A strong linear correlation is observed. After proper\ncentralization and rotation, most of the variance/information is captured by the value of gradient 1(cid:48)\nand 2(cid:48) (the 2-D green points at the bottom plane of (b)), indicating a compression ratio of 3/2=1.5.\n\ng[t] (cid:44) N\u22121(cid:88)\n\nn=0\n\ndifferent subset of the dataset to compute a length-M gradient vector gn[t]. These gradient vectors\nare aggregated into:\n\ngn[t],\n\n(2)\nwhich updates the model of every node as w[t + 1] = w[t] \u2212 \u03b7[t]g[t], where \u03b7[t] is the learning rate.\nDuring the training phase, each convolutional layer uses the same set of \ufb01lters repeatedly on sliding\nwindow over the training data. This motivates us to pack gradients gn[t] into a vector format carefully\nto unleash linear correlation between adjacent gradients in gn[t]. As explained in the previous section,\nfor a convolutional layer with F \ufb01lters, every F collocated gradients of the F \ufb01lters are placed\ntogether in gn[t]. This group assignment is \ufb01rst applied to gradients along the \ufb01lter depth, then width,\nand \ufb01nally height (Fig. 3(a)).\nGradiVeQ aims to compress every slice of K adjacent gradients in gn[t] separately, where K is called\nthe slice size. Let gn,m[t] be the m-th (m \u2208 [0, M/K \u2212 1]) slice, GradiVeQ will compress it into\ng(cid:48)\nn,m[t] using a function Q() as follows:\n\n(cid:16)\n\n(cid:17)\n\ngn,m[t] \u2212 \u00b5m\nN\n\n,\n\n(3)\n\ng(cid:48)\nn,m[t] (cid:44) Q(gn,m[t]) (cid:44) U T\n\nd,m\n\n4\n\n\fFigure 5: Slice size v.s. compression performance when using the linear compressor of the \ufb01rst slice\nof a (3, 3, 16, 16)-convolutional layer to compress the remaining slices. While compression ratio\nincreases steadily with slice size, the compression loss drops drastically when slice size is a multiple\nof F D = 256, indicating similarity between the \ufb01rst slice\u2019s linear correlation and those of the other\nslices.\n\nwhere U d,m is a K \u00d7 d linear compressor with d < K, and \u00b5m is a length-K whitening vector. After\ncompression, the dimension of the slice is reduced to d, indicating a compression ratio of r = K/d.\nThe compressed slices from different nodes can then be directly aggregated into:\n\nN\u22121(cid:88)\n\ng(cid:48)\nm[t] =\n\ng(cid:48)\nn,m[t] = U T\n\nd,m (g[t] \u2212 \u00b5m) ,\n\n(4)\n\nn=0\n\nwhich indicates that GradiVeQ compression is commutable with aggregation. According to (4), a\nsingle decompression operation is applied to g(cid:48)\n\nm to obtain a lossy version \u02c6gm[t] of gm[t]:\n\n\u02c6gm[t] = U d,mg(cid:48)\n\nm[t] + \u00b5m,\n\n(5)\n\nwhich will be used to update the corresponding K model parameters.\nIn this work we rely on PCA to compute the compressor U d,m and the whitening vector \u00b5m. More\nspeci\ufb01cally, for each slice-m, all nodes periodically invest the same Lt out of L CNN training\niterations on uncompressed gradient aggregation. After this, every node will have Lt samples of\nslice-m, say, gm[t],\u00b7\u00b7\u00b7 , gm[t + Lt \u2212 1]. The whitening vector \u00b5m is simply the average of these Lt\nsamples. Every node then computes the covariance matrix Cm of the K gradients using these Lt\nsamples, and applies singular value decomposition (SVD) to obtain the eigen matrix U m and eigen\nvector sm of Cm. The compressor U d,m is simply the \ufb01rst d columns of U m, corresponding to the\nd most signi\ufb01cant eigen values in the eigen vector sm. The obtained U d,m and \u00b5m are then used to\ncompress gn,m[t + Lt],\u00b7\u00b7\u00b7 , gn,m[t + L \u2212 1] of every node-n in the next L \u2212 Lt training iterations.\nDue to the commutability, GradiVeQ gradient compression can be parallelized with RAR as shown\nin Algorithm 1, so that compression time can be hidden behind communication time. We place all\nthe N nodes in a logical ring, where each node can only send data to its immediate successor. We\nalso partition every gradient vector gn[t] into N equal segments, each containing several gradient\nslices. The aggregation consists of two rounds: a compression round and a decompression round. To\ninitiate, each node-n will only compress the gradient slices in the n-th segment, and then send the\ncompressed segment to its successor. Then, in every step of the compression round, every node will\nsimultaneously 1) download a compressed segment from its predecessor, and 2) compress the same\nsegment of its own. Once both are completed, it will sum the two compressed segments and send the\nresult to its successor. After N \u2212 1 steps, the compression round is completed, and every node will\nhave a different completely aggregated compressed segment. Then in each step of the decompression\nround, every node will simultaneously 1) download a new compressed segment from its predecessor,\nand 2) decompress its last downloaded compressed segment. Note that after decompression, the\ncompressed segment must be kept for the successor to download. The original (i.e., uncompressed)\nRAR is a special case of Algorithm 1, where compression and decompression operations are skipped.\n\n5\n\n\fAlgorithm 1 Parallelized GradiVeQ Gradient Compression and Ring All-Reduce Communication\n1: Input: N nodes, each with a local gradient vector gn, n \u2208 [0, N \u2212 1];\n2: Each node-n partitions its gn into N equal segments gn(0),\u00b7\u00b7\u00b7 , gn(N \u2212 1);\n3: Every node-n compresses gn(n) to g(cid:48)\n4: for i = 1 : N \u2212 1 do\n5:\n6:\n7:\n\nEach node-n downloads g(cid:48)([n \u2212 i]N ) from node-[n \u2212 1]N ;\nAt the same time, each node-n compresses gn([n \u2212 i]N ) to g(cid:48)\nOnce node-n has completed the above two steps, it adds g(cid:48)\nsend the updated g(cid:48)([n \u2212 i]N ) to node-[n + 1]N ;\n\nn(n) as in (3), and sends g(cid:48)(n) (cid:44) g(cid:48)\n\nn([n \u2212 i]N ) to g(cid:48)([n \u2212 i]N ), and\n\n8: end for\n9: Each node-n now has the completely aggregated compressed g(cid:48)([n + 1]N );\n10: for i = 0 : N \u2212 1 do\n11:\n12:\n13: end for\n14: All nodes now have the complete g(cid:48)(cid:48).\n\nEach node-n decompresses g(cid:48)([n + 1 \u2212 i]N ) into g(cid:48)(cid:48)([n + 1 \u2212 i]N );\nAt the same time, each node-n downloads g(cid:48)\n\nn([n \u2212 i]N ) from node-[n \u2212 1]N ;\n\nn(n) to node-[n + 1]N ;\n\nn([n \u2212 i]N ) as in (3);\n\n3\n\nImplementation Details of GradiVeQ\n\nSystem Overview: At the beginning of the training, we will \ufb01rst apply a short warm-up phase to\nstabilize the model, which is common in the literature (see, e.g., [28]). Then, we will iterate between\nLt iterations of uncompressed aggregations and Lc iterations of compressed aggregations for every\ngradient slice. Recall that PCA needs some initial gradient data to compute the U d values for the\nlinear compressor. Hence, the Lt iterations of uncompressed data is used to generate the compressor\nwhich is then followed by Lc iterations of compressed aggregation. Since the linear correlation drifts\nslowly, the compression error is curtailed by periodically re-generating the updated U d values for\nthe linear compressor. Hence, the interspersion of an uncompressed aggregation with compressed\naggregation GradiVeQ minimizes any compression related losses.\nComputing U d,m from a large gradient vector is a computationally intensive task. To minimize the\nPCA computation overhead, GradiVeQ exploits the spatial domain consistency of gradient correlation.\nFor every consecutive s (called the compressor reuse factor) slices in the same convolutional layer,\nGradiVeQ computes a single PCA compressor U d (with the slice index m omitted) using the \ufb01rst\nslice, and then uses the resulting U d to compress the remaining s \u2212 1 slices. We note that, although\ncomputing U d does not require the values from the remaining s \u2212 1 slices, they should still be\naggregated in an uncompressed way as the \ufb01rst slice, so that the parameters associated with all slices\ncan evolve in the same way. The composition of training iterations is demonstrated in Fig. 6.\nIn order to reduce the bandwidth inef\ufb01ciency brought by the Lt iterations that are not GradiVeQ com-\npressed, we will apply a scalar quantization technique such as QSGD [24] to these iterations, and\ncommunicate the quantized gradients.\nParameter Selection: We brie\ufb02y describe how the various parameters used in GradiVeQ are chosen.\nSlice size K: We set K to be a multiple of F D for a convolutional layer that has F \ufb01lters with a\ndepth of D, as the compression loss is minimized at this granularity as demonstrated in Fig. 5. In\naddition, K should be selected to balance the compression ratio and PCA complexity. Increasing K\nto higher multiples of F D may capture more linear correlations for higher compression ratio, but will\nincrease the computational and storage costs, as SVD is applied to a K \u00d7 K covariance matrix. For\nexample, for a (3, 3, 16, 16)-convolutional layer, a good choice of K would be 768 = 3 \u00d7 16 \u00d7 16.\nCompressor reuse factor s: A larger s implies that the PCA compressor computed from a single\nslice of size K is reused across many consecutive slices of the gradient vector, but with a potentially\nhigher compression loss. We experimented with different values of s and found that the accuracy\ndegradation of using s = \u221e (i.e., one compressor per layer) is less than 1% compared to the original\nuncompressed benchmark. Therefore, \ufb01nding the compressor for a single slice in a convolutional\nlayer is suf\ufb01cient.\nCompressor dimension d: the value of d is determined by the maximum compression loss we can\nafford. This loss can be easily projected using the eigen vector s. Let 0 (cid:54) \u03bb < 1 be a loss threshold,\n\n6\n\n\fFigure 6: GradiVeQ CNN training iterations of one convolutional layer with compressor reuse factor\nof s = 2.\n\nwe \ufb01nd the minimum d such that:\n\n(cid:80)d\u22121\n(cid:80)K\u22121\n\nk=0 s[k]\nk=0 s[k]\n\n(cid:62) 1 \u2212 \u03bb.\n\n(6)\n\nThe corresponding U d will guarantee a compression loss of at most \u03bb to the sample slices of size K\nin each layer and, based on our spatial correlation observation the U d from one slice also works well\nfor other slices in the gradient vector.\nNumber of iterations with uncompressed aggregation Lt: This value could either be tuned as a\nhyper-parameter, or be determined with uncompressed RAR: namely perform PCA on the collected\nsample slices of the gradients to \ufb01nd out how many samples will be needed to get a stable value of d.\nWe used the later approach and determined that 100 samples are suf\ufb01cient, hence Lt of 100 iterations\nis used.\nNumber of compressed iterations Lc: This value could also be tuned as a hyper-parameter, or be\ndetermined in the run-time by letting nodes to perform local decompression to monitor the local\n\ncompression loss, which is de\ufb01ned as:(cid:13)(cid:13)(cid:13)gn[t] \u2212(cid:16)\n\n(cid:17)(cid:13)(cid:13)(cid:13)2\n\nU d \u00b7 g(cid:48)\n\nn[t] +\n\n\u00b5\nN\n\n.\n\n(7)\n\nWhen the majority of nodes experience large loss, we stop compression, and resume training with\nuncompressed aggregations to compute new compressors. Again, we currently use the latter approach\nand determine Lc to be 400 iterations.\nComputation Complexity: GradiVeQ has two main operations for each gradient slice: (a) one SVD\nover Lt samples of the aggregated slice for every Lt + Lc iterations, and (b) two low-dimensional\nmatrix multiplications per iteration per node to compress/decompress the slice. Our gradient \ufb02attening\nand slicing (Figure 3) allows the compressor calculated from one slice to be reused by all the\nslices in the same layer, offering drastically amortized SVD complexity. On the other hand, the\noperations associated with (b) are of low-complexity, and can be completely hidden behind the RAR\ncommunication time (Figure 2) due to GradiVeQ\u2019s linearity. Thus, GradiVeQ will not increase the\ntraining time.\n\n4 Experiments\n\nWe apply GradiVeQ to train an image classi\ufb01er using ResNet-32 from the data set CIFAR-100 under\na 6-node distributed computing system. The system is implemented under both a CPU cluster (local)\nand a GPU cluster (Google cloud). The purpose is to evaluate the gain of GradiVeQ under different\nlevels of communication bottleneck. In the local CPU cluster, each node is equipped with 2 Ten-Core\nIntel Xeon Processor (E5-2630 v4), which is equivalent to 40 hardware threads. We observe that\nwithout applying any gradient compression techniques, gradient communication and aggregation\noverheads account for 60% of the end-to-end training time, which is consistent with prior works [24].\nIn the GPU setup, each node has an NVIDIA Tesla K80 GPU. The increased compute capability\nfrom the GPUs magni\ufb01es the communication and aggregation overheads, which occupies 88% of the\nend-to-end training time when no compression techniques are used.\nWe choose \u03bb = 0.01 as our loss threshold. We set s = \u221e for each convolutional layer, which\nmeans that we only compute and use one compressor for each layer, so that the PCA overhead is\n\n7\n\n\f(a) For training accuracy with CPU setup, the uncompressed RAR converges around 135, 000 seconds, 4-bit-\nQSGD RAR converges around 90, 000 seconds, and GradiVeQ RAR converges around 76, 000 seconds. Under\nGPU setup, the numbers are reduce to 75, 000, 30, 000 and 24, 000, respectively.\n\n(b) End-to-End Training time breakdown for 500 iterations of three systems: Uncompressed RAR, 4-bit-QSGD\nRAR, and GradiVeQ RAR\n\nminimized. Our experimental results indicate that s = \u221e yields no compression loss. We spend\nthe \ufb01rst 2, 500 iterations on warm-up, and then periodically invest Lt = 100 iterations for PCA\nsampling, followed by Lc = 400 iterations of GradiVeQ-compressed gradient aggregations. With\nthese parameter selection, we observe an average compression ratio of K/d = 8. The performance\nmetrics we are interested in include wall-clock end-to-end training time and test accuracy. We\ncompare the performance of GradiVeQ with uncompressed baseline RAR. In addition, to fairly\ncompare GradiVeQ with scalar quantization techniques under the same compression ratio (i.e., 8), we\nalso integrate 4-bit-QSGD with RAR. We note that for CNNs, 4-bit is the minimum quantization\nlevel that allows QSGD to gracefully converge [24]. One of the advantage of QSGD is that unlike\nGradiVeQ it does not need any PCA before applying the compression. GradiVeQ exploits this\nproperty of QSGD to minimize bandwidth-inef\ufb01ciency during the Lt iterations that are not GradiVeQ-\ncompressed. Furthermore this approach demonstrates the compatibility of GradiVeQ with scalar\nquantization. We apply 4-bit-SGD to these Lt iterations and use the quantized gradients for PCA.\n\n4.1 Model Convergence\n\nWe analyze the model convergence of the three different systems for a given training accuracy.\nAll the 3 systems converge after 45, 000 iterations, indicating that GradiVeQ does not incur extra\niterations. However, as plotted in 7(a), both compression approaches reduce the model convergence\nwall-clock time signi\ufb01cantly, with GradiVeQ slashing it even more due to its further reduction in\ngradient aggregation time \u2013 in our CPU setup, uncompressed RAR takes about 135, 000 seconds\nfor the model to converge, 4-bit-QSGD takes 90, 000 seconds to converge, whereas GradiVeQ takes\nonly 76, 000 seconds to converge. For our GPU setup in Google Cloud, these numbers are reduced\nto 75, 000 (uncompressed), 30, 000 (4-bit-QSGD), and 24, 000 (GradiVeQ), respectively. In terms\nof test accuracy, uncompressed RAR\u2019s top-1 accuracy is 0.676, while 4-bit-QSGD RAR\u2019s top-1\naccuracy is 0.667, and GradiVeQ RAR\u2019s top-1 accuracy is 0.666, indicating only marginal accuracy\nloss due to quantization.\n\n8\n\n\f4.2 End-to-End Training Time Reduction Breakdown\n\nThe end-to-end training time consists of computation time and gradient aggregation time. The\ncomputation time includes the time it takes to compute the gradient vector through backward\npropagation, and to update the model parameters. The gradient aggregation time is the time it takes to\naggregate the gradient vectors computed by the worker nodes. Both GradiVeQ and 4-bit-QSGD share\nthe same computation time as the uncompressed system, as they do not alter the computations. The\nnumber is about 650 seconds per 500 (Lt + Lc) iterations under our CPU setup. In terms of gradient\naggregation time, the uncompressed system needs 850 seconds per 500 iterations, which constitutes\n60% of the end-to-end training time. On the other hand, GradiVeQ substantially reduces this time\nby 5.25x to only 162 seconds thanks to both its gradient compression and parallelization with RAR.\nAs a result, GradiVeQ is able to slash the end-to-end training time by 46%. In contrast, although\n4-bit-QSGD can offer the same compression ratio, its incompatibility to be parallelized with RAR\nmakes its gradient aggregation time almost double that of GradiVeQ. The gain of GradiVeQ becomes\nmore signi\ufb01cant under our GPU setup. GPUs can boost the computation speed by 6 times, which\nmakes the communication time a more substantial bottleneck: being 88% of the end-to-end training\ntime without compression. Thus, by slashing the gradient aggregation time, GradiVeQ achieves a\nhigher end-to-end training time reduction of 4X over the uncompressed method and 1.40 times over\n4-bit-QSGD.\n\n5 Conclusion\n\nIn this paper we have proposed GradiVeQ, a novel vector quantization technique for CNN gradient\ncompression. GradiVeQ enables direct aggregation of compressed gradients, so that when paired\nwith decentralized aggregation protocols such as ring all-reduce (RAR), GradiVeQ compression can\nbe parallelized with gradient communication. Experiments show that GradiVeQ can signi\ufb01cantly\nreduce the wall-clock gradient aggregation time of RAR, and achieves better speed-up than scalar\nquantization techniques such as QSGD.\nIn the future, we will adapt GradiVeQ to other types of neural networks. We are also interested in\nunderstanding the implications of the linear correlation between gradients we have discovered, such\nas its usage in model reduction.\n\nAcknowledgement\n\nThis material is based upon work supported by Defense Advanced Research Projects Agency\n(DARPA) under Contract No. HR001117C0053. The views, opinions, and/or \ufb01ndings expressed are\nthose of the author(s) and should not be interpreted as representing the of\ufb01cial views or policies of\nthe Department of Defense or the U.S. Government. This work is also supported by NSF Grants\nCCF-1763673, CCF-1703575, CNS-1705047, CNS-1557244, SHF-1719074.\n\n9\n\n\fReferences\n\n[1] K. Simonyan and A. Zisserman, \u201cVery deep convolutional networks for large-scale image\n\nrecognition,\u201d in ICLR, 2015.\n\n[2] K. He, X. Zhang, S. Ren, and J. Sun, \u201cDeep residual learning for image recognition,\u201d in Proc.\n\nIEEE conf. Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770\u2013778.\n\n[3] O. Abdel-Hamid, A.-r. Mohamed, H. Jiang, and G. Penn, \u201cApplying convolutional neural\nIEEE, 2012,\n\nnetworks concepts to hybrid nn-hmm model for speech recognition,\u201d in ICASSP.\npp. 4277\u20134280.\n\n[4] O. Abdel-Hamid, A.-r. Mohamed, H. Jiang, L. Deng, G. Penn, and D. Yu, \u201cConvolutional neural\nnetworks for speech recognition,\u201d IEEE/ACM Transactions on audio, speech, and language\nprocessing, vol. 22, no. 10, pp. 1533\u20131545, 2014.\n\n[5] Y. Miao, L. Yu, and P. Blunsom, \u201cNeural variational inference for text processing,\u201d in Interna-\n\ntional Conference on Machine Learning, 2016, pp. 1727\u20131736.\n\n[6] R. Johnson and T. Zhang, \u201cEffective use of word order for text categorization with convolutional\n\nneural networks,\u201d arXiv preprint arXiv:1412.1058, 2014.\n\n[7] A. Krizhevsky, V. Nair, and G. Hinton, \u201cThe CIFAR-10 and CIFAR-100 dataset,\u201d 2009.\n\n[Online]. Available: https://www.cs.toronto.edu/~kriz/cifar.html\n\n[8] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, \u201cImagenet: A large-scale\nhierarchical image database,\u201d in Proc. IEEE conf. Computer Vision and Pattern Recognition\n(CVPR), 2009, pp. 248\u2013255.\n\n[9] R. Mcdonald, M. Mohri, N. Silberman, D. Walker, and G. S. Mann, \u201cEf\ufb01cient large-scale\ndistributed training of conditional maximum entropy models,\u201d in NIPS, 2009, pp. 1231\u20131239.\n[10] M. Zinkevich, M. Weimer, L. Li, and A. J. Smola, \u201cParallelized stochastic gradient descent,\u201d in\n\nNIPS, 2010, pp. 2595\u20132603.\n\n[11] R. McDonald, K. Hall, and G. Mann, \u201cDistributed training strategies for the structured percep-\n\ntron,\u201d in NAACL. Association for Computational Linguistics, 2010, pp. 456\u2013464.\n\n[12] R. Gemulla, E. Nijkamp, P. J. Haas, and Y. Sismanis, \u201cLarge-scale matrix factorization with\ndistributed stochastic gradient descent,\u201d in Proceedings of the 17th ACM SIGKDD international\nconference on Knowledge discovery and data mining. ACM, 2011, pp. 69\u201377.\n\n[13] Y. Zhuang, W.-S. Chin, Y.-C. Juan, and C.-J. Lin, \u201cA fast parallel sgd for matrix factorization in\nshared memory systems,\u201d in Proceedings of the 7th ACM conference on Recommender systems.\nACM, 2013, pp. 249\u2013256.\n\n[14] M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V. Josifovski, J. Long, E. J. Shekita,\nand B.-Y. Su, \u201cScaling distributed machine learning with the parameter server,\u201d in OSDI, 2014.\n[15] M. Li, D. G. Andersen, A. J. Smola, and K. Yu, \u201cCommunication ef\ufb01cient distributed machine\n\nlearning with the parameter server,\u201d in NIPS, 2014.\n\n[16] F. N. Iandola, K. Ashraf, M. W. Moskewicz, and K. Keutzer, \u201cFirecaffe: near-linear acceleration\n\nof deep neural network training on compute clusters,\u201d in CVPR, 2016.\n\n[17] J. Langford, A. J. Smola, and M. Zinkevich, \u201cSlow learners are fast,\u201d in Proc. Advances in\n\nNeural Information Processing Systems (NIPS), 2009, pp. 2331\u20132339.\n\n[18] B. Recht, C. Re, S. Wright, and F. Niu, \u201cHogwild: A lock-free approach to parallelizing\nstochastic gradient descent,\u201d in Proc. Advances in Neural Information Processing Systems\n(NIPS), 2011, pp. 693\u2013701.\n\n[19] A. Agarwal and J. C. Duchi, \u201cDistributed delayed stochastic optimization,\u201d in NIPS, 2011, pp.\n\n873\u2013881.\n\n[20] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, A. Senior, P. Tucker, K. Yang,\n\nQ. V. Le et al., \u201cLarge scale distributed deep networks,\u201d in NIPS, 2012, pp. 1223\u20131231.\n\n[21] Q. Ho, J. Cipar, H. Cui, S. Lee, J. K. Kim, P. B. Gibbons, G. A. Gibson, G. R. Ganger, and E. P.\nXing, \u201cMore effective distributed ml via a stale synchronous parallel parameter server.\u201d in NIPS,\n2013, pp. 1223\u20131231.\n\n10\n\n\f[22] X. Lian, W. Zhang, C. Zhang, and J. Liu, \u201cAsynchronous decentralized parallel\nstochastic gradient descent,\u201d 2018, cite arxiv:1710.06952. [Online]. Available: https:\n//arxiv.org/abs/1710.06952\n\n[23] F. Seide, H. Fu, J. Droppo, G. Li, and D. Yu, \u201c1-bit stochastic gradient descent and its application\nto data-parallel distributed training of speech dnns,\u201d in Fifteenth Annual Conference of the\nInternational Speech Communication Association, 2014.\n\n[24] D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic, \u201cQSGD: Communication-ef\ufb01cient\nSGD via gradient quantization and encoding,\u201d in Proc. Advances in Neural Information Pro-\ncessing Systems (NIPS), 2017, pp. 1707\u20131718.\n\n[25] N. Dryden, T. Moon, S. A. Jacobs, and B. Van Essen, \u201cCommunication quantization for\ndata-parallel training of deep neural networks,\u201d in Workshop on Machine Learning in HPC\nEnvironments (MLHPC).\n\nIEEE, 2016, pp. 1\u20138.\n\n[26] W. Wen, C. Xu, F. Yan, C. Wu, Y. Wang, Y. Chen, and H. Li, \u201cTerngrad: Ternary gradients to\n\nreduce communication in distributed deep learning,\u201d NIPS, pp. 1508\u20131518, 2017.\n\n[27] S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, and Y. Zou, \u201cDorefa-net: Training low bitwidth\nconvolutional neural networks with low bitwidth gradients,\u201d arXiv preprint arXiv:1606.06160,\n2016.\n\n[28] Y. Lin, S. Han, H. Mao, Y. Wang, and W. J. Dally, \u201cDeep gradient compression: Reducing the\n\ncommunication bandwidth for distributed training,\u201d in ICLR, 2018.\n\n[29] P. Patarasuk and X. Yuan, \u201cBandwidth optimal all-reduce algorithms for clusters of workstations,\u201d\n\nJournal of Parallel and Distributed Computing, vol. 69, no. 2, pp. 117\u2013124, 2009.\n\n[30] R. Thakur, R. Rabenseifner, and W. Gropp, \u201cOptimization of collective communication opera-\n\ntions in mpich.\u201d IJHPCA, vol. 19, pp. 49\u201366, 2005.\n\n[31] P. Goyal, P. Doll\u00e1r, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia,\ntraining imagenet in 1 hour,\u201d arXiv preprint\n\nand K. He, \u201cAccurate, large minibatch sgd:\narXiv:1706.02677, 2018.\n\n[32] A. Gibiansky, \u201cBringing HPC Techniques to Deep Learning,\u201d http://research.baidu.com/\n\nbringing-hpc-techniques-deep-learning/, 2017, [Online; accessed December 2, 2018].\n\n11\n\n\f", "award": [], "sourceid": 2461, "authors": [{"given_name": "Mingchao", "family_name": "Yu", "institution": "University of Southern California"}, {"given_name": "Zhifeng", "family_name": "Lin", "institution": "University of Southern California"}, {"given_name": "Krishna", "family_name": "Narra", "institution": "University Of Southern California"}, {"given_name": "Songze", "family_name": "Li", "institution": "University of Southern California"}, {"given_name": "Youjie", "family_name": "Li", "institution": "UIUC"}, {"given_name": "Nam Sung", "family_name": "Kim", "institution": "University of Illinois at Urbana-Champaign"}, {"given_name": "Alexander", "family_name": "Schwing", "institution": "University of Illinois at Urbana-Champaign"}, {"given_name": "Murali", "family_name": "Annavaram", "institution": "University of Southern California"}, {"given_name": "Salman", "family_name": "Avestimehr", "institution": "University of Southern California"}]}