{"title": "Optimal Sparsity-Sensitive Bounds for Distributed Mean Estimation", "book": "Advances in Neural Information Processing Systems", "page_first": 6371, "page_last": 6381, "abstract": "We consider the problem of estimating the mean of a set of vectors, which are stored in a distributed system. This is a fundamental task with applications in distributed SGD and many other distributed problems, where communication is a main bottleneck for scaling up computations. We propose a new sparsity-aware algorithm, which improves previous results both theoretically and empirically. The communication cost of our algorithm is characterized by Hoyer's measure of sparseness. Moreover, we prove that the communication cost of our algorithm is information-theoretic optimal up to a constant factor in all sparseness regime. We have also conducted experimental studies, which demonstrate the advantages of our method and confirm our theoretical findings.", "full_text": "Optimal Sparsity-Sensitive Bounds for Distributed\n\nMean Estimation\n\nZengfeng Huang\n\nSchool of Data Science\n\nFudan University\n\nhuangzf@fudan.edu.cn\n\nYilei Wang\n\nDepartment of CSE\n\nHKUST\n\nywanggq@cse.ust.hk\n\nZiyue Huang\n\nDepartment of CSE\n\nHKUST\n\nzhuangbq@cse.ust.hk\n\nDepartment of CSE\n\nKe Yi\n\nHKUST\n\nyike@cse.ust.hk\n\nAbstract\n\nWe consider the problem of estimating the mean of a set of vectors, which are\nstored in a distributed system. This is a fundamental task with applications in\ndistributed SGD and many other distributed problems, where communication is a\nmain bottleneck for scaling up computations. We propose a new sparsity-aware\nalgorithm, which improves previous results both theoretically and empirically.\nThe communication cost of our algorithm is characterized by Hoyer\u2019s measure of\nsparseness. Moreover, we prove that the communication cost of our algorithm is\ninformation-theoretic optimal up to a constant factor in all sparseness regime. We\nhave also conducted experimental studies, which demonstrate the advantages of\nour method and con\ufb01rm our theoretical \ufb01ndings.\n\n1\n\nIntroduction\n\n(cid:80)n\n\nConsider a distributed system with n nodes, called clients, each of which holds a d-dimensional\nvector Xi \u2208 Rd. The goal of distributed mean estimation (DME) is to estimate the mean of these\nvectors, i.e., X := 1\ni=1 Xi, subject to a constraint on the communication cost (i.e. the total\nn\nnumber of bits transmitted by all clients).\nDME is a fundamental task in distributed machine learning and optimization problems [3, 10, 18,\n14, 12]. For example, gradient aggregation in distributed stochastic gradient decent (SGD) is a form\nof DME. In the standard synchronous implementation, in each round, clients evaluate their local\ngradients with respect to local mini-batches and communicate them to a central server; the server then\ncomputes the mean of all these gradients, which is used to update the model parameters. It is widely\nobserved that the communication cost of gradient exchange has become a signi\ufb01cant bottleneck for\nscaling up distributed training [5, 19, 24]. Therefore, communication-ef\ufb01cient gradient aggregation\nhas received lots of attention recently [1, 2, 15, 23, 26]. DME is also a critical subproblem in many\nother applications such as the distributed implementation of Lloyd\u2019s algorithm for K-means clustering\n[16] and power iteration for computing eigenvectors [21].\nHowever, the communication complexity of this fundamental problem has not been fully understood,\nespecially when the input is sparse or skew. In this paper, we provide a tight connection between\ncommunication complexity and input sparsity. Speci\ufb01cally, we propose a new sparsity-aware lossy\ncompression scheme, which reduces the communication cost both theoretically and empirically.\nWe also prove that the communication cost of our method is information-theoretic optimal up to a\nconstant factor in all sparsity regime, under Hoyer\u2019s measure of sparseness [9].\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f(cid:80)n\n\n1.1 Problem de\ufb01nition and notation\nThe problem setting in this paper is the same as in [20]. Each client i holds a private vector Xi \u2208 Rd\nand transmits messages only to the central sever according to some protocol; at the end, the sever\noutputs an estimate for the mean X = 1\ni=1 Xi based on all the messages it has received. The\nn\ncommunication cost of a protocol is measured by the total number of bits exchanged between clients\nand the sever. Let \u02c6X denote the estimated mean and we wish to minimize the mean square error\n(MSE) of the estimate, i.e., E = E (cid:107) \u02c6X \u2212 X(cid:107)2\n2, under a certain communication budget. Note that the\nproblem considered here is non-stochastic, i.e., the input vectors are arbitrary or even adversarial.\nThis is different from distributed statistical estimation [27, 8, 4], where the inputs are i.i.d samples\nfrom some distribution and the goal is to estimate the parameters of the underlying distribution. In\nparticular, the expectation in the above de\ufb01nition of MSE is only over the randomness of the protocol.\n2 be\nthe sum of squared (cid:96)2-norms of the input vectors and F0 be the total number of non-zero entries in\nall the input vectors. We will always use d to denote the dimensionality of input vectors and n for the\nnumber of clients.\n\ni=1 (cid:107)Xi(cid:107)1, i.e., the sum of (cid:96)1-norms of input vectors; let F2 :=(cid:80)n\n\nWe de\ufb01ne F1 :=(cid:80)n\n\ni=1 (cid:107)Xi(cid:107)2\n\n1.2 Previous results\n\nNaively sending all vectors to the sever needs ndr bits of communication, where r is the number\nof bits to represent a \ufb02oating point number. In [20], several methods to save communication are\nproposed. The best of them uses O(nd) bits of communication while achieving an MSE of F2/n2.\nTheir algorithm \ufb01rst applies stochastic quantization and then encodes the quantized vectors by entropy\nencoding schemes such as arithmetic coding. Moreover, it is also proved that, in the worst case, this\ncost is optimal for one-round protocols. Similar bounds are also obtained in [2, 11]. One major\nlimitation of the methods in [20] is that they cannot exploit the sparseness in the inputs due to the\nnature of their quantization and encoding methods. In many distributed learning scenarios, the input\nvectors can be very sparse or skew, i.e., a large fraction of the entries can be zero or close to zero.\nThe sparsity can be caused by either data unbalance (large entries occur in a few clients) or feature\nunbalance (large entries occur in a few dimensions). QSGD of [2] works well in practice for sparse\ndata, but doesn\u2019t have an upper bound on the cost that is parameterized by input sparsity: to achieve\nan MSE of F2/n2, the cost is still O(nd) bits (Theorem 3.2, Corollary 3.3 in [2]).\nIntuitively, one could drop entries with small absolute values without affecting the MSE too much.\nGradient sparsi\ufb01cation utilizes this idea, which has been successfully applied in distributed gradient\ncompression [19, 1, 15, 22, 23]. However, such methods either do not have optimal sparsity-sensitive\ntheoretical guarantees or work only under strong sparsity assumptions.\nThere are various sparsity notions, but it is currently not clear which notion best characterizes the\ninherent complexity of DME. To get meaningful sparsity-sensitive bounds, it is essential to identify\nan appropriate sparsity notion for DME. In this paper, we propose to use a modi\ufb01ed notion of Hoyer\u2019s\nto measure the sparseness of vectors [9]. For a d-dimensional vector X, its sparseness is de\ufb01ned as\n(cid:107)X(cid:107)2\n.1 Since our inputs can be viewed as an nd-dimensional vector, the global sparseness is de\ufb01ned\nd(cid:107)X(cid:107)2\nnd \u2264 s \u2264 1; s = 1 (densest) iff all entries are non-zero and have equal\nas s := F 2\nabsolute values, and s = 1\nnd (sparsest) iff the input contains a single non-zero entry. Wangni et al.\n[23] obtain a sparsity-aware bound based on a different sparsity notion, but our result implies theirs\nand can be much better for some inputs (see the supplementary for details).\n1.3 Our contributions\nFirst, we propose a sparsity-sensitive compression method that provably exploits the sparseness\nof the input. Speci\ufb01cally, to achieve an MSE of E \u2264 F2\nn2 , our protocol only needs to transmit\n\ns + 1(cid:1) bits (ignoring some lower order terms), where s is the sparseness of the input\ns + 1(cid:1) \u2264 1 when s \u2264 1, this is always no worse than nd (the cost of\n\nC \u2248 nds log(cid:0) 1\nde\ufb01ned earlier. Since s log(cid:0) 1\n\n[20]) and can be much smaller on sparse inputs, i.e., when s (cid:28) 1.\nSecondly, we prove that, for any sparseness s \u2264 1, the communication cost of our protocol is\noptimal, up to a constant factor. Speci\ufb01cally, for any s \u2264 1, we construct a family of inputs with\n\n1 /ndF2. Note that 1\n\n1\n\n2\n\n1The original Hoyer\u2019s measure is the ratio between the (cid:96)1 and (cid:96)2 norm, normalized to the range [0, 1].\n\n2\n\n\fn2 on this family must incur\nsparseness equal to s, and prove that any protocol achieving an MSE of F2\ns ) bits of communication in expectation for some inputs in this family. This lower bound\n\u2126(nds log 1\nholds for multi-round protocols in the broadcasting model (where each message can be seen by all\nclients). As observed in [20], any lower bound for distributed statistical mean estimation can be\ntranslated to a lower bound for the DME problem. However, current lower bounds in this area do not\nsuf\ufb01ce to obtain tight sparsity-sensitive bounds for DME.\nFinally, we complement our theoretical \ufb01ndings with experimental studies. Empirical results show\nthat, under the same communication bandwidth, our proposed method has a much lower MSE,\nespecially on sparse inputs, which veri\ufb01es our theoretical analyses. Moreover, as a subroutine, our\nprotocol outperforms previous approaches consistently in various distributed learning tasks, e.g.,\nLloyd\u2019s algorithm for K-means clustering and power iteration.\n2 Sparsity-Sensitive DME Protocol\nOverview of our techniques. Algorithms in [20] apply k-level stochastic quantization and then\nencode the quantized vectors using variable length coding. Speci\ufb01cally, for each Xi, the client divides\n] into k\u2212 1 intervals of equal length, and then identi\ufb01es the interval containing\nthe range [X min\neach Xij and rounds it either to the left point or the right point of the corresponding interval with\nprobability depending on its distance to the end points. After quantization, the vector can be viewed\nas a string of length d over an alphabet of size k, which is then compressed using arithmetic coding.\nQSGD is similar, but encodes signs separately and uses the Elias coding method.\nSince the sparseness depends on the (cid:96)1 norm F1, our quantization step size depends on F1 as\nin Wang et al. [22]. In addition to F1 quantization, our protocol has the following algorithmic\ningredients. 1) All clients in our protocol use the same interval size in stochastic quantization. This\nmeans that the number of levels may vary for different clients, as opposed to all previous methods,\nwhere all clients use a \ufb01xed number of levels. This is another major difference in our quantization\nstep, which is necessary to get communication bounds in terms of global sparseness. 2) As in\nQSGD, we encode the sign of each entry separately and only quantize the absolute values, which\ncan be conveniently implemented by a scaling and rounding procedure. 3) Instead of encoding the\nquantized vectors directly with entropy coding methods, we \ufb01rst convert each integer vector into\na bit string using a one-to-one map: for any integer vector v = (v1, v2,\u00b7\u00b7\u00b7 , vd), the length of its\ncorresponding bit string is d +(cid:107)v(cid:107)1 \u2212 1, among which the number of 1\u2019s is exactly d\u2212 1. 4) We then\napply ef\ufb01cient coding methods, e.g., arithmetic coding, to encode the entire bit string using roughly\n\n, X max\n\ni\n\ni\n\nlog(cid:0)d+(cid:107)v(cid:107)1\n\nd\n\n(cid:1) \u2248 (cid:107)v(cid:107)1 log d+(cid:107)v(cid:107)1\n\n(cid:107)v(cid:107)1\n\nbits.\n\nScaling and Rounding. We \ufb01rst introduce the scaling and rounding procedure (Algorithm 1),\nwhich is essentially equivalent to stochastic quantization (for the absolute values only). The next\nLemma summarizes the key properties of SaR, the proof of which is in the supplementary.\n\nAlgorithm 1 Scaling and Rounding (SaR)\ninput v \u2208 Rd and a scaling factor F\n1: u = 1\n2: Randomized rounding: for j = 1,\u00b7\u00b7\u00b7 , d\n\nF \u00b7 v\n\n\u02c6uj =\n\n3: return \u02c6u\n\n(cid:26)(cid:98)uj(cid:99) + 1, with probability uj \u2212 (cid:98)uj(cid:99)\n\n(cid:98)uj(cid:99) ,\n\notherwise.\n\nLemma 2.1. Let \u02c6v = F \u02c6u, then E[\u02c6v] = v and E[(cid:107)\u02c6v \u2212 v(cid:107)2\n\nLet \u02c6ui be the output for Xi and \u02c6Xi = F \u02c6ui. At the end, the server uses \u02c6X =(cid:80)n\n\nIn our protocol, we apply Algorithm 1 on each Xi with F = F1/C, where C is a tunable parameter.\n\u02c6Xi/n as the\n\n2] \u2264 F(cid:107)v(cid:107)1. Moreover, E[|\u02c6vi|] = |vi|.\n\ni=1\n\nestimate for the mean, then by Lemma 2.1, the MSE is\n\nE = E (cid:107) \u02c6X \u2212 X(cid:107)2\n\n2 =\n\n1\nn2\n\n(cid:107)Xi(cid:107)1 =\n\nF 2\n1\nCn2 .\n\n(1)\n\nn(cid:88)\n\ni=1\n\nE[(cid:107) \u02c6Xi \u2212 Xi(cid:107)2\n\n2] \u2264 F\nn2\n\n3\n\nn(cid:88)\n\ni=1\n\n\fw\n\nw\n\nand decoding time O(d).\n\nConstant-weight binary sequence. The Hamming weight of a length-d binary sequence v is\ndenoted by w(v) = |{vi : vi = 1}|. Constant-weight binary codes C(d, w) is the set of all length-d\n\n(cid:1), the number of bits to represent a sequence in C is at least\n(cid:1)(cid:101). There exists ef\ufb01cient encoding methods, such as arithmetic coding or its variants [17], to\n(cid:1)(cid:101), which has encoding\n\nsequences with weight w. Since |C| =(cid:0) d\n(cid:100)log(cid:0) d\nencode sequences in C using binary strings of length very close to (cid:100)log(cid:0) d\nnegative integer vector v by w(v) =(cid:80)d\n\nConstant-weight non-negative integer vector coding. Denote the weight of a length-d non-\ni=1 vi. Constant-weight integer codes I(d, w) is the set of\nlength-d non-negative integer vectors with weight w. In our protocol, we map each v \u2208 I(d, w) to a\nbinary sequence f (v) \u2208 C(d + w \u2212 1, d \u2212 1) as follows. For i = 1, 2,\u00b7\u00b7\u00b7 , d \u2212 1 we write vi \u20180\u2019s\nand one \u20181\u2019, and in the end we write vd \u20180\u2019s. It is an (d + w \u2212 1, d \u2212 1) constant-weight binary code.\nOne can also verify that f is a one-to-one and onto map from I(d, w) to C(d + w \u2212 1, d \u2212 1). By\napplying encoding methods for C mentioned above, we have the following lemma.\n\nLemma 2.2. Sequences in I(d, w) can be encoded losslessly by (cid:100)log(cid:0)d+w\u22121\n\n(cid:1)(cid:101)-bit binary strings,\n\nw\n\nd\u22121\n\nwith encoding and decoding time O(d + w).\n\n2.1 The Protocol\n\nd\u22121\n\nF1 =(cid:80)\n(cid:100)log(cid:0)d+wi\u22121\n\n(cid:1)(cid:101)-bit string (Lemma 2.2) and sends it to the sever. The client also sends the value\n\nWe are now ready to describe our sparsity-sensitive DME protocol.\n1. (Initialization) Clients and the server determine the scaling factor F to be used in Algorithm 1 and\nwe will use F = F1/C for some C \u2264 nd. To compute F1, each client i sends (cid:107)Xi(cid:107)1 to the server\nusing r bits, where r is the number of bits to represent \ufb02oating points. Then the server computes\ni (cid:107)Xi(cid:107)1 and broadcasts it to all the clients. This step use 2r bits of communication per client.\n2. (Quantization) Client i runs SaR(Xi, F1/C) (Algorithm 1) and get an integer vector \u02c6ui. The\nabsolute value and sign of each entry in \u02c6ui will be encoded separately. Let vi = (|\u02c6ui1|,\u00b7\u00b7\u00b7 ,|\u02c6uid|).\n(Encoding) Note that vi \u2208 I(d, wi), where wi = w(vi). Client i encodes vi using a\n3.\n\u2206wi = wi \u2212 (cid:98)C(cid:107)Xi(cid:107)1/F1(cid:99) with log(2d + 1) bits 2, as wi is needed for decoding vi. 3\n4. (Sending the signs) Let si be a binary sequence indicating the signs of non-zero entries in \u02c6ui.\nClient i simply sends this sequence with di bits of communication, where di is the number of non-zero\nvalues in vi. Moreover, we can apply constant-weight coding to compress this sequence.\n5. (Decoding) The server decodes vi, which contains the absolute values of \u02c6ui. Given the signs of\nits the non-zero entries si, the server is now able to recover \u02c6ui losslessly. It \ufb01nally computes the\nestimated mean \u02c6X = 1\nn\nThe correctness of the protocol readily follows from Lemma 2.1 and (1): E[ \u02c6X] = X and E[(cid:107) \u02c6X \u2212\nX(cid:107)2\nCn2 . Below we analyze its communication cost. By part 2 of Lemma 2.1, we have\n\n2] \u2264 F 2\n\n\u02c6Xi = F1\nCn\n\n(cid:80)\n\ni \u02c6ui.\n\ni\n\n= C.\nBecause of the observation di \u2264 wi, we have the expected total communication cost is at most\n\nF1\n\nF1\n\nF1\n\n. Therefore, E[(cid:80)n\n\nE[wi] =(cid:80)\n\nC|Xij|\n\n= C(cid:107)Xi(cid:107)1\n\nC(cid:107)Xi(cid:107)1\n\n1\n\n(cid:80)\nj E[|\u02c6uij|] =(cid:80)\n(cid:34) n(cid:88)\n(cid:34) n(cid:88)\n\ni=1\n\nE\n\nj\n\n(cid:54) E\n\ni=1\n\n(cid:18)\n\ni=1 wi] =(cid:80)n\n(cid:19)(cid:35)\n\ni=1\n\n(cid:18)d + wi \u2212 1\n(cid:19)\n\n+ di\n\nd \u2212 1\n\n2r + log(2d + 1) + log\n\n(cid:19)(cid:35)\n\n(cid:18) d\n\nwi\n\nwi log\n\n+ 1\n\n+ O(C + nr + n log d).\n\n2Clearly, \u2206wi is an integer and |\u2206wi| (cid:54) d. One can also use universal code such as Elias gamma code [7]\n\n3One can also use entropy coding to encode vi, but it is unclear whether such methods achieve the same\n\nto reduce the bits of transmitting \u2206wi.\n\ntheoretical guarantee as ours.\n\n4\n\n\fi=1\n\ni=1\n\n+ 1\n\n(cid:34)\n\nE\n\nwi log\n\n+ 1\n\nwi\n\n(cid:54) E\n\n(\n\n(cid:19)(cid:35)\n\n(cid:19)(cid:35)\n\nn(cid:88)\n\n(cid:18) d\n\n(cid:18) nd(cid:80)n\n\nFrom the concavity of the function x log( 1\n\nx + 1) on R>0 and Jensen\u2019s inequality, we have\n\n(cid:34) n(cid:88)\nC + 1(cid:1) + O(C + nr + n log d) bits of communication, where r is the number of bits to\nC log(cid:0) nd\ns + 1(cid:1) +\n\ni=1 wi\nTherefore, we get the following theorem, and by setting C = F 2\nTheorem 2.3. For any C \u2264 nd, there exists a DME protocol that achieves an MSE of F 2\n\nrepresent a \ufb02oating point.\nCorollary 2.4. There exists a DME protocol that achieves an MSE of F2\nO(nds + nr + n log d) bits, where s = F 2\n\nn2 using nds \u00b7 log(cid:0) 1\n\n1 /ndF2 is the Hoyer\u2019s measure of sparseness of the inputs.\n\n1 /F2, the next corollary follows.\n\n(cid:18) nd\n\nCn2 with\n\n(cid:54) C log\n\nwi) log\n\n+ 1\n\n.\n\nC\n\n(cid:19)\n\n1\n\nRemark. The authors of [20] discuss how to use client or coordinate sampling to obtain a trade-off\nbetween MSE and communication. Their analysis shows that, to achieve an MSE of F2/pn2, the\ncommunication cost is O(pnd) bits (ignoring low order terms), where 0 \u2264 p \u2264 1 is the sampling\nprobability. Applying our sparsity-sensitive protocol on the sampled clients or coordinates, we can\nachieve the same MSE with a communication of O(pnds log( 1\ns + 1)) bits, which will never be worse\nand can be much better on inputs with small spareness s.\nWe also would like to point out that, our algorithm can also be run without the synchronization round.\nFor this setting, we can derive a communication bound for each client by simply setting n = 1 in\nCorollary 2.4, although s in the bound will become the local sparsity of the client when doing so.\nLocal sparsity bound is worse than global sparsity when there is data unbalance, but the bound is still\nbetter than prior work as long as there is dimension unbalance across different clients. This is also\nveri\ufb01ed in our experimental results below.\nSince the sparseness depends on the (cid:96)1 norm F1, the key to getting a sparsity-sensitive bound is to\nunderstand the connection between F1 and the MSE-communication trade-off. So our quantization\nstep size depends on F1, which is one of the main differences in the quantization step compared with\n[20, 2, 23]. Wang et al. [22] also use F1 quantization, but only consider 1-level quantization and\ndoesn\u2019t specify an appropriate encoding method to achieve an optimal sparsity-sensitive bound. Our\n1 /n2C, where C \u2264 nd\nis a tunable parameter; and if we set C = F 2\n1 /F2, the MSE and communication cost are F2/n2 and\nnds log(1/s + 1) respectively, as claimed earlier. Having C as a tunable parameter gives us a better\ncontrol on the cost of the protocol; our result in fact implies the MSE-communication trade-off of\n[22] (and could be much better) but not vice versa. Wang et al. [22] prove that their algorithm can\ncompress a vector X \u2208 Rd using kr bits (where r is the number of bits to represents \ufb02oating points\n1 /k; ours algorithm (the special case when n = 1) can\nand k is a tuning parameter) with MSE F 2\ncompress X using C log( d\n1 /C. By setting C = k, we achieve the\nsame MSE while the cost is k log(d/k) bits (and it is trivial to make it be k min(log(d/k), r)). When\nk = \u0398(d), the cost is O(k) bits versus O(kr).\n\nC + 1(cid:1) bits of communication to achieve an MSE of F 2\n\nprotocol uses C log(cid:0) nd\n\nC + 1) bits with MSE at least F 2\n\n3 Lower Bound\n\nIn this section, we show the optimality of Theorem 2.3 by proving the following lower bound.\nTheorem 3.1. For any n \u2264 C \u2264 nd\n2 , there exists a family of inputs, all of which have F1 = F2 = C,\nsuch that any randomized protocol solving the DME problem on this family in the broadcast model\nwith an MSE of F 2\n\n4n2C must communicate at least C\n\n2 log nd\n\n2C bits.\n\n1\n\nThis theorem immediately leads to the following corollary, which means that our sparsity-sensitive\nprotocol is optimal (up to a constant factor) for all sparseness 1\nCorollary 3.2. For any sparseness 1\n2 , there exists a family of inputs, all of which have\nsparseness s, such that any randomized protocol solving the DME problem on this family in the\nbroadcast model with an MSE of F2\n\n4n2 must communicate at least of nds\n\nd \u2264 s \u2264 1\n2.\n\nd \u2264 s \u2264 1\n\n2 log( 1\n\n2s ) bits.\n\nProof. Note that on the family of inputs used in the proof of Theorem 3.1 (presented shortly), we\nhave s = F 2\n2 , we obtain a\nndF2\n\nn2 . Since this family exists for any n \u2264 C \u2264 nd\n\nnd and E = F 2\n\n4n2C = F2\n\n= C\n\n1\n\n1\n\n5\n\n\fd \u2264 s \u2264 1\nfamily with sparseness s for any 1\ncan be rewritten as claimed in the corollary.\n\n2. Then, the MSE and communication cost in Theorem 3.1\n\nThe rest of this section is devoted to the proof of Theorem 3.1. To prove lower bounds for\nrandomized protocols, the standard tool is Yao\u2019s Minimax Principle [25]. We will de\ufb01ne an\ninput distribution D for DME. Suppose there is a randomized algorithm AR with worst case\n(for any possible input) MSE M and expected cost T , where R is the randomness used in\nthe algorithm. Now, if we sample input X \u223c D, then EREX\u223cD[MSE of AR(X)] \u2264 M and\nEREX\u223cD[Cost of AR(X)] \u2264 T . By Markov\u2019s inequality, PrR [EX\u223cD[ MSE of AR(X)] \u2264 4M ] \u2265\n0.75 and PrR [EX\u223cD[Cost of AR(X)] \u2264 2T ] \u2265 0.5. Then, with positive probability, the two\nevents happen simultaneously. In other words, there exists some \ufb01xed randomness r such that\nEX\u223cD[ MSE of Ar(X)] \u2264 4M and EX\u223cD[Cost of Ar(X)] \u2264 2T where Ar is now simply a deter-\nministic algorithm. That means if there is a randomized algorithm with worst case MSE M and\nexpected cost T , then there exist a deterministic algorithm with MSE 4M and expected cost 2T w.r.t.\nany input distribution D.\n\nC ) bits.\n\nMinimax Principle. From the above argument, it is suf\ufb01cient to prove that, for some input dis-\ntribution D, any deterministic protocol with MSE at most \u0398(F 2\n1 /n2C) must incur an expected\ncommunication cost of \u2126(C log nd\nInput distribution. For any \ufb01xed n \u2264 C \u2264 nd/2, we de\ufb01ne the hard distribution D for our\nproblem as follows. Each Xi is divided into t = C/n blocks, each of size b = nd/C. In this\nsection, we use xij \u2208 Rb to denote the jth block in Xi. In D, each block xij is uniformly sampled\nfrom b-dimensional standard basis vectors, i.e. Pr[xij = ek] = 1/b for each 1 \u2264 k \u2264 b, and the\ndistribution of xij are independent across all i and j. Note that any input sampled from D has (cid:96)1\nnorm exactly C.\nLet \u03a0 be any deterministic protocol with MSE bounded by F 2\n1 /4n2C = C/4n2 w.r.t. the input\ndistribution D. We next prove a lower bound of the expected communication cost of \u03a0 w.r.t. D. Let\nX1, X2,\u00b7\u00b7\u00b7 , Xn be a random input sampled from D and \u03a0(X1,\u00b7\u00b7\u00b7 , Xn) be the transcript of the\nprotocol given the input, i.e., the concatenation of all messages, which is a random variable. When\nthere is no confusion, we will omit the input and use \u03a0 to denote the random transcript; and \u03c0 \u223c \u03a0\nmeans \u03c0 is chosen according to the distribution of \u03a0(X1,\u00b7\u00b7\u00b7 , Xn) .\nSince the protocol is deterministic, any particular transcript \u03c0 corresponds to a deterministic set of\ninputs R\u03c0, i.e., all inputs in R\u03c0 generate the same transcript \u03c0 under the protocol \u03a0. Hence, all inputs\nin R\u03c0 share the same output, denoted as Y \u03c0. Note each input belongs to a unique R\u03c0, and thus all\nR\u03c0 corresponds to a partition of all possible inputs. It is well-known R\u03c0 is a combinatorial rectangle,\ni.e., R\u03c0 = B1 \u00d7 \u00b7\u00b7\u00b7 \u00d7 Bn, where each Bi \u2286 {0, 1}d is some subset of all possible inputs of the ith\nclient.\nDe\ufb01nition 1. De\ufb01ne D\u03c0 as the conditional distribution of X1, X2,\u00b7\u00b7\u00b7 , Xn (sampled from D) condi-\ntioned on the event [X1, X2,\u00b7\u00b7\u00b7 , Xn] \u2208 R\u03c0.\nLet X = 1\ni=1 Xi. By the property of conditional expectation, we have the following Lemma.\nn\nLemma 3.3. We assume \u03a0 has an MSE of C\n4n2 .\nDe\ufb01nition 2. Suppose [X1,\u00b7\u00b7\u00b7 , Xn] \u223c D\u03c0, then for every i and j, the distribution of xij is still a\nijk = PrD\u03c0 [xij = ek] for\ndistribution over b-dimensional basis vectors. For each i, j, we de\ufb01ne p\u03c0\n\n(cid:2)(cid:107)X \u2212 Y \u03c0(cid:107)2(cid:3) \u2264 C\n\n4n2 , then E\u03c0\u223c\u03a0E[X1,\u00b7\u00b7\u00b7 ,Xn]\u223cD\u03c0\n\n(cid:80)n\n\nk \u2208 [b], where(cid:80)b\n\nk=1 p\u03c0\n\nijk = 1.\n\nThe next lemma is crucial to our argument, the proof of which can be found in the supplementary.\nLemma 3.4. For any \u03c0 and let Y \u03c0 be its output, we have\n\nt(cid:88)\n\nb(cid:88)\n\nn(cid:88)\n\nijk(1 \u2212 p\u03c0\n[p\u03c0\n\nijk)] \u2264 n2 \u00b7 E[X1,\u00b7\u00b7\u00b7 ,Xn]\u223cD\u03c0\n\n(cid:2)(cid:107)X \u2212 Y \u03c0(cid:107)2(cid:3) .\n\nj=1\n\nk=1\n\ni=1\n\nHere we introduce some basic notations from information theory [6]. For any random variable\nX, H(X) is the standard Shannon Entropy of X. For any random variables X, Y, Z, we use\n\n6\n\n\fH(X|Y ) = EY [H(X|Y = y)] to denote the conditional entropy of X given Y , and I(X; Y |Z) =\nH(X|Z)\u2212 H(X|Y, Z) to denote the conditional mutual information between X and Y given Z. We\nknow the average encoding length of a random transcript \u03a0, i.e., the expected communication cost, is\nlower bounded by its entropy H(\u03a0). By the non-negativity of (conditional) entropy, we have\n\nH(\u03a0) = I(X1,\u00b7\u00b7\u00b7 , Xn; \u03a0) + H(\u03a0|X1,\u00b7\u00b7\u00b7 , Xn) \u2265 I(X1,\u00b7\u00b7\u00b7 , Xn; \u03a0).\n\n(2)\n\nNext we prove a lower bound on I(X1,\u00b7\u00b7\u00b7 , Xn; \u03a0). We will need the following property.\nLemma 3.5. Let X, Y, Z be three random variables such that X and Y are independent, then\nI(X, Y ; Z) \u2265 I(X; Z) + I(Y ; Z).\nLemma 3.6. I(X1,\u00b7\u00b7\u00b7 , Xn; \u03a0) \u2265 C\n\n2 log nd\n2C .\n\nProof. Since the input distribution is independent across different blocks and clients, by Lemma 3.5,\n\nwe have I(X1,\u00b7\u00b7\u00b7 , Xn; \u03a0) \u2265(cid:80)t\n\n(cid:80)n\nI(X1,\u00b7\u00b7\u00b7 , Xn; \u03a0) \u2265 t(cid:88)\nn(cid:88)\n\nj=1\n\ni=1 I(Xij; \u03a0). Thus,\n\nH(Xij) \u2212 t(cid:88)\nn(cid:88)\n\uf8ee\uf8f0 t(cid:88)\nn(cid:88)\n\n\u2212 E\u03c0\u223c\u03a0\n\nj=1\n\ni=1\n\nnd\nC\n\nj=1\n\ni=1\n\nj=1\n\ni=1\n\n= C log\n\nH(Xij | \u03a0)\n\n\uf8f9\uf8fb ,\n\nH(Xij | \u03a0 = \u03c0)\n\n(3)\n\nij = min(p\u03c0\n\nij, 1 \u2212 p\u03c0\n\nij) (see De\ufb01nition 2), then\n\u2265 (1 \u2212 q\u03c0\n, which\n\nij) log\n\n1\u2212q\u03c0\n\n1\n\nij\n\nij log 1\nq\u03c0\nij\n\np\u03c0\nijk log\n\nq\u03c0\nijk log\n\nq\u03c0\nijk) log\n\n1\np\u03c0\nijk\n\n(cid:33)\uf8f9\uf8fb\n(cid:33)\uf8f9\uf8fb\n(cid:80)n\n(cid:80)t\nE\u03c0\u223c\u03a0[(cid:80)t\n\n1\nq\u03c0\nijk\n\nj=1\n\nq\u03c0\nijk] log\n\ni=1\n\ni=1\n\nj=1\n\nj=1\n\nk=1\n\n. So,\n\nE\u03c0\u223c\u03a0\n\n\u2265 p\u03c0\n\nn(cid:88)\n\nij log 1\nq\u03c0\nij\n\nij log 1\np\u03c0\nij\n\n\u2264 E\u03c0\u223c\u03a0\n\nC . Let q\u03c0\n\n\uf8ee\uf8f0 t(cid:88)\n\nH(Xij | \u03a0 = \u03c0)\n\n\uf8f9\uf8fb = E\u03c0\u223c\u03a0\n\nwhere we use H(Xij) = log b = log nd\nij \u2264 0.5. It can be veri\ufb01ed by elementary calculus that q\u03c0\nq\u03c0\n\uf8ee\uf8f0 t(cid:88)\n(cid:32)\nimplies that q\u03c0\nb(cid:88)\nn(cid:88)\n\uf8ee\uf8f0 t(cid:88)\n(cid:32)\nb(cid:88)\nn(cid:88)\n\uf8ee\uf8f0(\nt(cid:88)\nb(cid:88)\nn(cid:88)\nn(cid:88)\nt(cid:88)\nb(cid:88)\n\uf8ee\uf8f0 t(cid:88)\n\uf8f9\uf8fb \u2264 2E\u03c0\u223c\u03a0\n\uf8ee\uf8f0 t(cid:88)\n\nijk \u2264 0.5 and by Lemma 3.3 and 3.4, we have\nn(cid:88)\nn(cid:88)\n\n\uf8ee\uf8f0 t(cid:88)\n\n\u2264 E\u03c0\u223c\u03a0[\n\n\u2264 E\u03c0\u223c\u03a0\n\nn(cid:88)\n\nb(cid:88)\n\n= 2E\u03c0\u223c\u03a0\n\nE\u03c0\u223c\u03a0\n\nq\u03c0\nijk\n\nk=1\n\nk=1\n\nk=1\n\nj=1\n\nj=1\n\nj=1\n\nj=1\n\ni=1\n\ni=1\n\ni=1\n\ni=1\n\nb(cid:88)\nb(cid:88)\n\nk=1\n\nijk(1 \u2212 q\u03c0\n[q\u03c0\n\nijk)]\n\nijk(1 \u2212 p\u03c0\n[p\u03c0\n\nijk)]\n\nk=1 q\u03c0\nijk]\nwhere the last two inequalities is from Jensen\u2019s inequality (since x log(1/x) is concave on R>0).\nSince each q\u03c0\n\nk=1\n\nj=1\n\nj=1\n\ni=1\n\n\uf8f9\uf8fb\n\nk=1 q\u03c0\nijk\n\n(cid:80)b\n\ni=1\n\ni=1\n\nnd\n\nnd\n\n(cid:80)b\n(cid:80)n\n\uf8f9\uf8fb\n\uf8f9\uf8fb \u2264 C\n\n.\n\n2\n\nln 2 . Thus g(x) attains its maximum at x = nd\n\nj=1\n\ni=1\n\nk=1\n\nx , which is concave on R>0 and its derivative is g(cid:48)(x) =\n21/ ln 2 . Moreover, g(x) is monoton-\n21/ ln 2 , we have\n\n21/ ln 2 . Since we assume C\n\n2 \u2264 nd\n\n4 <\n\nnd\n\nnd\n\nx \u2212 1\n\n(cid:104)(cid:80)t\n\nConsider the function g(x) = x log nd\nlog nd\n(cid:80)n\nically increasing for 0 < x \u2264\ni=1 H(Xij | \u03a0 = \u03c0)\nE\u03c0\u223c\u03a0\nI(X1,\u00b7\u00b7\u00b7 , Xn; \u03a0) \u2265 C log\nThis \ufb01nishes the proof of the Lemma.\n\nj=1\n\nnd\nC\n\n(cid:105) \u2264 C\n\n2 log 2nd\n\u2212 C\n2\n\nC . By (3), we prove\nnd\nC\n\n2nd\nC\n\nC\n2\n\nlog\n\nlog\n\n=\n\n\u2212 C\n2\n\n=\n\nC\n2\n\nlog\n\nnd\n2C\n\n.\n\n7\n\n\f(a) sparseness = 0.60\n\n(b) sparseness = 0.36\n\n(c) sparseness = 0.15\n\n(d) sparseness = 0.06\n\nFigure 1: Communication-MSE trade-off on the synthetic dataset generated from t-distribution. The\nx-axis is the average number of bits sent for each dimension, and the y-axis is log(MSE).\n\nBy (2), we prove that the expected communication cost of \u03a0 is at least C\nTheorem 3.1 follows from the minimax principle.\n\n2 log nd\n\n2C bits w.r.t. D. Then\n\n4 Experiments\n\nWe have conducted experiments comparing our DME protocol with the variable length coding method\n(the best in [20]) and the methods in [2, 23] on their MSE-communication trade-off, as well as the\nperformance in distributed learning tasks that use DME as a subroutine, including K-means clustering\nand power iteration. The algorithm in [22] doesn\u2019t specify an appropriate encoding method and\ndirectly sends \ufb02oating points, and thus the cost is worse than that of [2, 23].\n\n4.1 DME\n\nIn the \ufb01rst set of experiments, we compare our new protocol with that of [2, 23, 20] on the DME\nproblem directly, in terms of the MSE-communication trade-off. To see how the performances of the\nprotocols are affected by the sparseness of the input, we generated synthetic datasets with varying\nspareness. Speci\ufb01cally, we generated 16 vectors, each held by a different client. Each vector has\n10000 dimensions, whose values are generated independently from student\u2019s t-distribution. This data\nset has an empirical sparseness of 0.60, and the results are shown in Figure 1(a).\nWe used two ways to create sparser data. First, we scaled up the data on each nodes by a different\nfactor, which is also generated from t-distribution. This resulted in a data set with sparseness 0.36,\nand the experimental results are shown in Figure 1(b), which con\ufb01rms the effectiveness of using a\nglobal quantization step size when data is unbalance across clients. Second, we randomly chose 30%\nand 10% of the dimensions and set the rest to 0. This resulted in two data sets with sparseness 0.15\nand 0.06, respectively, and the experimental results are shown in Figure 1(c) and Figure 1(d). These\nresults render that the sparser and/or less balance (across clients) the data is, the higher performance\ngain our new protocol has. The same phenomenon is also observed in the next two tasks.\nIn Figure 1 (a)(c)(d), the data sets used do not have data unbalance across different clients (meaning\nthe coordination round is effectively useless), and the results are still better than previous methods.\n\n4.2 Distributed K-Means\n\nWe then test the performances for distributed K-means. In each iteration of the distributed K-means\nalgorithm, the server broadcasts the current centroids of the clusters to all clients. Each client updates\nthe centroids based on its local data, and then sends back the updated centroids to the server. The\nserver then computes the average of these centroids for each cluster. This is exactly K instances of\nthe DME problem, except that average should be weighted by the cluster size at each client. Thus,\nwe \ufb01rst scale up the centroids by the cluster size, and then apply the DME protocols.\nWe used the MNIST [13] data set, uniformly or non-uniformly distributed across 10 clients. The\nnumber of clusters and iterations is set to 10 and 30 respectively. The results are shown in Figure 2,\nwhere we used different values of k (quantization level) for Suresh et al.\u2019s algorithm, k = 32 for less\ncommunication and k = 512 for less error, and other methods are tuned to achieve the same objective.\nThe results show that with the same \ufb01nal objective, our algorithm has less communication cost.\n\n8\n\n246810bits per dimension6420246810log(MSE)Alistarh et al.Suresh et al.Wangni et al.Ours246810bits per dimension7.55.02.50.02.55.07.510.0log(MSE)Alistarh et al.Suresh et al.Wangni et al.Ours246810bits per non-zero entry5.02.50.02.55.07.510.0log(MSE)Alistarh et al.Suresh et al.Wangni et al.Ours2.55.07.510.012.515.017.5bits per non-zero entry10.07.55.02.50.02.55.07.510.0log(MSE)Alistarh et al.Suresh et al.Wangni et al.Ours\f(a) uniform\n\n(b) uniform\n\n(c) non-uniform\n\n(d) non-uniform\n\nFigure 2: Distributed K-Means on MNIST dataset distributed among 10 workers. The x-axis is the\naverage number of bits sent for each dimension, accumulated over the iterations, and the y-axis is the\nobjective function value of K-Means. In (a) and (b) data is uniform distributed, while in (c) and (d)\ndata is non-uniform distributed, every worker has 1000, 4000, 7000, 10000 or 13000 images.\n\n(a) uniform\n\n(b) uniform\n\n(c) non-uniform\n\n(d) non-uniform\n\nFigure 3: Distributed power iteration on MNIST dataset distributed among 100 workers. The x-axis is\nthe averaged number of bits sent for each dimension, which scales linearly to the number of iterations,\nand the y-axis is the (cid:96)2 distance between the current estimate of eigenvector and the ground-truth\neigenvector. In (a) and (b) data is uniform distributed, while in (c) and (d) data is non-uniform\ndistributed, every worker has 100, 400, 700, 1000 or 1300 images.\n\n4.3 Distributed Power Iteration\n\nThe second learning task we tested is the distributed power iteration algorithm. The number of clients\nis set to 100 and the number of iterations is set to 15. In this algorithm, the server broadcasts the\ncurrent estimate of the eigenvector to all clients, then each client updates the eigenvector based on\none power iteration on its local data, and sends back the compressed eigenvector to the server. The\nserver updates the current estimate of eigenvector with the average of all the received eigenvectors.\nThe results on the MNIST data set are reported in Figure 3, where we used different values of k\n(quantization level) for Suresh et al.\u2019s algorithm, k = 32 for less communication and k = 512 for less\nerror, and other methods are tuned to achieve the same error. It also shows that our DME protocol\nuses less communication to achieve the same error.\n\nAcknowledgments\n\nZengfeng Huang is partially supported by National Natural Science Foundation of China (Grant\nNo. 61802069), Shanghai Sailing Program (Grant No. 18YF1401200) and Shanghai Science and\nTechnology Commission (Grant No. 17JC1420200). Ziyue Huang, Yilei Wang, and Ke Yi are\nsupported by HKRGC under grants 16200415, 16202317, and 16201318.\n\nReferences\n[1] A. F. Aji and K. Hea\ufb01eld. Sparse communication for distributed gradient descent. In Proceedings\nof the 2017 Conference on Empirical Methods in Natural Language Processing, pages 440\u2013445,\n2017.\n\n[2] D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic. Qsgd: Communication-ef\ufb01cient\nsgd via gradient quantization and encoding. In Advances in Neural Information Processing\nSystems, pages 1709\u20131720, 2017.\n\n9\n\n01020304050Total bits per dimension19.820.020.220.420.620.821.0Objectivecomm:50.9 bits, obj:19.89, Alistarh et al.comm:52.8 bits, obj:19.88, Suresh et al.comm:184.3 bits, obj:19.88, Wangni et al.comm:46.8 bits, obj:19.88, Ours020406080100120140Total bits per dimension19.820.020.220.420.620.821.0Objectivecomm:168.2 bits, obj:19.72, Alistarh et al.comm:162.4 bits, obj:19.72, Suresh et al.comm:398.2 bits, obj:19.72, Wangni et al.comm:128.6 bits, obj:19.72, Ours0510152025303540Total bits per dimension19.820.020.220.420.620.821.0Objectivecomm:48.7 bits, obj:19.88, Alistarh et al.comm:49.9 bits, obj:19.88, Suresh et al.comm:174.5 bits, obj:19.88, Wangni et al.comm:36.4 bits, obj:19.88, Ours01020304050Total bits per dimension19.820.020.220.420.620.821.0Objectivecomm:62.1 bits, obj:19.72, Alistarh et al.comm:62.5 bits, obj:19.72, Suresh et al.comm:224.9 bits, obj:19.72, Wangni et al.comm:45.9 bits, obj:19.72, Ours0510152025Total bits per dimension0.00.10.20.30.40.50.6Errorcomm:24.2 bits, err:0.040, Alistarh et al.comm:34.7 bits, err:0.040, Suresh et al.comm:77.4 bits, err:0.041, Wangni et al.comm:23.4 bits, err:0.040, Ours01020304050607080Total bits per dimension0.00.10.20.30.40.50.6Errorcomm:88.7 bits, err:0.002, Alistarh et al.comm:94.4 bits, err:0.002, Suresh et al.comm:230.1 bits, err:0.002, Wangni et al.comm:74.8 bits, err:0.002, Ours0510152025Total bits per dimension0.00.10.20.30.40.50.6Errorcomm:26.6 bits, err:0.040, Alistarh et al.comm:36.5 bits, err:0.040, Suresh et al.comm:93.1 bits, err:0.040, Wangni et al.comm:20.6 bits, err:0.040, Ours01020304050607080Total bits per dimension0.00.10.20.30.40.50.6Errorcomm:94.4 bits, err:0.002, Alistarh et al.comm:102.0 bits, err:0.002, Suresh et al.comm:238.8 bits, err:0.002, Wangni et al.comm:69.6 bits, err:0.002, Ours\f[3] S. Boyd, N. Parikh, E. Chu, B. Peleato, J. Eckstein, et al. Distributed optimization and statistical\nlearning via the alternating direction method of multipliers. Foundations and Trends R(cid:13) in\nMachine learning, 3(1):1\u2013122, 2011.\n\n[4] M. Braverman, A. Garg, T. Ma, H. L. Nguyen, and D. P. Woodruff. Communication lower\nbounds for statistical estimation problems via a distributed data processing inequality.\nIn\nProceedings of the forty-eighth annual ACM symposium on Theory of Computing, pages 1011\u2013\n1020. ACM, 2016.\n\n[5] T. M. Chilimbi, Y. Suzue, J. Apacible, and K. Kalyanaraman. Project adam: Building an\nef\ufb01cient and scalable deep learning training system. In OSDI, volume 14, pages 571\u2013582, 2014.\n\n[6] T. M. Cover and J. A. Thomas. Elements of information theory. John Wiley & Sons, 2006.\n\n[7] P. Elias. Universal codeword sets and representations of the integers. IEEE transactions on\n\ninformation theory, 21(2):194\u2013203, 1975.\n\n[8] A. Garg, T. Ma, and H. Nguyen. On communication cost of distributed statistical estimation\nand dimensionality. In Advances in Neural Information Processing Systems, pages 2726\u20132734,\n2014.\n\n[9] P. O. Hoyer. Non-negative matrix factorization with sparseness constraints. Journal of machine\n\nlearning research, 5(Nov):1457\u20131469, 2004.\n\n[10] M. Jaggi, V. Smith, M. Tak\u00e1c, J. Terhorst, S. Krishnan, T. Hofmann, and M. I. Jordan.\nCommunication-ef\ufb01cient distributed dual coordinate ascent. In Advances in neural information\nprocessing systems, pages 3068\u20133076, 2014.\n\n[11] J. Kone\u02c7cn`y and P. Richt\u00e1rik. Randomized distributed mean estimation: Accuracy vs communi-\n\ncation. Frontiers in Applied Mathematics and Statistics, 4:62, 2018.\n\n[12] G. Lan, S. Lee, and Y. Zhou. Communication-ef\ufb01cient algorithms for decentralized and\n\nstochastic optimization. Mathematical Programming, 2018.\n\n[13] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document\n\nrecognition. Proceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[14] J. D. Lee, Q. Lin, T. Ma, and T. Yang. Distributed stochastic variance reduced gradient\nmethods by sampling extra data with replacement. The Journal of Machine Learning Research,\n18(1):4404\u20134446, 2017.\n\n[15] Y. Lin, S. Han, H. Mao, Y. Wang, and B. Dally. Deep gradient compression: Reducing the\ncommunication bandwidth for distributed training. In International Conference on Learning\nRepresentations, 2018.\n\n[16] S. Lloyd. Least squares quantization in pcm.\n\n28(2):129\u2013137, 1982.\n\nIEEE transactions on information theory,\n\n[17] T. V. Ramabadran. A coding scheme for m-out-of-n codes. IEEE Transactions on Communica-\n\ntions, 38(8):1156\u20131163, 1990.\n\n[18] K. Scaman, F. Bach, S. Bubeck, Y. T. Lee, and L. Massouli\u00e9. Optimal algorithms for smooth and\nstrongly convex distributed optimization in networks. In International Conference on Machine\nLearning, pages 3027\u20133036, 2017.\n\n[19] N. Strom. Scalable distributed dnn training using commodity gpu cloud computing. In Sixteenth\n\nAnnual Conference of the International Speech Communication Association, 2015.\n\n[20] A. T. Suresh, F. X. Yu, S. Kumar, and H. B. McMahan. Distributed mean estimation with\n\nlimited communication. ICML, 2017.\n\n[21] L. N. Trefethen and D. Bau III. Numerical linear algebra, volume 50. Siam, 1997.\n\n10\n\n\f[22] H. Wang, S. Sievert, S. Liu, Z. Charles, D. Papailiopoulos, and S. Wright. Atomo:\nCommunication-ef\ufb01cient learning via atomic sparsi\ufb01cation. In Advances in Neural Information\nProcessing Systems, pages 9850\u20139861, 2018.\n\n[23] J. Wangni, J. Wang, J. Liu, and T. Zhang. Gradient sparsi\ufb01cation for communication-ef\ufb01cient\n\ndistributed optimization. In Advances in Neural Information Processing Systems 31. 2018.\n\n[24] W. Wen, C. Xu, F. Yan, C. Wu, Y. Wang, Y. Chen, and H. Li. Terngrad: Ternary gradients\nto reduce communication in distributed deep learning. In Advances in neural information\nprocessing systems, pages 1509\u20131519, 2017.\n\n[25] A. C. Yao. Probabilistic computations: Toward a uni\ufb01ed measure of complexity. In 18th Annual\n\nSymposium on Foundations of Computer Science (FOCS), pages 222\u2014-227. IEEE, 1977.\n\n[26] M. Yu, Z. Lin, K. Narra, S. Li, Y. Li, N. S. Kim, A. Schwing, M. Annavaram, and S. Avestimehr.\nGradiveq: Vector quantization for bandwidth-ef\ufb01cient gradient aggregation in distributed cnn\ntraining. In Advances in Neural Information Processing Systems 31. 2018.\n\n[27] Y. Zhang, J. Duchi, M. I. Jordan, and M. J. Wainwright. Information-theoretic lower bounds\nfor distributed statistical estimation with communication constraints. In Advances in Neural\nInformation Processing Systems, pages 2328\u20132336, 2013.\n\n11\n\n\f", "award": [], "sourceid": 3435, "authors": [{"given_name": "zengfeng", "family_name": "Huang", "institution": "Fudan University"}, {"given_name": "Ziyue", "family_name": "Huang", "institution": "HKUST"}, {"given_name": "Yilei", "family_name": "WANG", "institution": "The Hong Kong University of Science and Technology"}, {"given_name": "Ke", "family_name": "Yi", "institution": "\" Hong Kong University of Science and Technology, Hong Kong\""}]}