{"title": "Gradient Sparsification for Communication-Efficient Distributed Optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 1299, "page_last": 1309, "abstract": "Modern large-scale machine learning applications require stochastic optimization algorithms to be implemented on distributed computational architectures. A key bottleneck is the communication overhead for exchanging information such as stochastic gradients among different workers. In this paper, to reduce the communication cost, we propose a convex optimization formulation to minimize the coding length of stochastic gradients. The key idea is to randomly drop out coordinates of the stochastic gradient vectors and amplify the remaining coordinates appropriately to ensure the sparsified gradient to be unbiased. To solve the optimal sparsification efficiently, several simple and fast algorithms are proposed for an approximate solution, with a theoretical guarantee for sparseness. Experiments on $\\ell_2$ regularized logistic regression, support vector machines, and convolutional neural networks validate our sparsification approaches.", "full_text": "Gradient Sparsi\ufb01cation for Communication-Ef\ufb01cient\n\nDistributed Optimization\n\nJianqiao Wangni\n\nUniversity of Pennsylvania\n\nTencent AI Lab\n\nwnjq@seas.upenn.edu\n\nJi Liu\n\nUniversity of Rochester\n\nTencent AI Lab\n\nJialei Wang\n\nTwo Sigma Investments\n\njialei.wang@twosigma.com\n\nTong Zhang\nTencent AI Lab\n\ntongzhang@tongzhang-ml.org\n\nji.liu.uwisc@gmail.com\n\nAbstract\n\nModern large-scale machine learning applications require stochastic optimization\nalgorithms to be implemented on distributed computational architectures. A key\nbottleneck is the communication overhead for exchanging information such as\nstochastic gradients among different workers. In this paper, to reduce the communi-\ncation cost, we propose a convex optimization formulation to minimize the coding\nlength of stochastic gradients. The key idea is to randomly drop out coordinates of\nthe stochastic gradient vectors and amplify the remaining coordinates appropriately\nto ensure the sparsi\ufb01ed gradient to be unbiased. To solve the optimal sparsi\ufb01cation\nef\ufb01ciently, a simple and fast algorithm is proposed for an approximate solution,\nwith a theoretical guarantee for sparseness. Experiments on (cid:96)2-regularized logistic\nregression, support vector machines and convolutional neural networks validate\nour sparsi\ufb01cation approaches.\n\n1\n\nIntroduction\n\nScaling stochastic optimization algorithms [26, 24, 14, 11] to distributed computational architectures\n[10, 17, 33] or multicore systems [23, 9, 19, 22] is a crucial problem for large-scale machine learning.\nIn the synchronous stochastic gradient method, each worker processes a random minibatch of its\ntraining data, and then the local updates are synchronized by making an All-Reduce step, which\naggregates stochastic gradients from all workers, and taking a Broadcast step that transmits the\nupdated parameter vector back to all workers. The process is repeated until a certain convergence\ncriterion is met. An important factor that may signi\ufb01cantly slow down any optimization algorithm is\nthe communication cost among workers. Even for the single machine multi-core setting, where the\ncores communicate with each other by reading and writing to a chunk of shared memory, con\ufb02icts of\n(memory access) resources may signi\ufb01cantly degrade the ef\ufb01ciency. There are solutions to speci\ufb01c\nproblems like mean estimation [29, 28], component analysis [20], clustering [6], sparse regression\n[16] and boosting [7]. Other existing works on distributed machine learning include two directions:\n1) how to design communication ef\ufb01cient algorithms to reduce the round of communications among\nworkers [37, 27, 12, 36], and 2) how to use large mini-batches without compromising the convergence\nspeed [18, 31]. Several papers considered the problem of reducing the precision of gradient by using\nfewer bits to represent \ufb02oating-point numbers [25, 2, 34, 8, 32] or only transmitting coordinates\nof large magnitudes[1, 21]. This problem has also drawn signi\ufb01cant attention from theoretical\nperspectives about its communication complexity [30, 37, 3].\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fIn this paper, we propose a novel approach to complement these methods above. Speci\ufb01cally, we\nsparsify stochastic gradients to reduce the communication cost, with minor increase in the number\nof iterations. The key idea behind our sparsi\ufb01cation technique is to drop some coordinates of the\nstochastic gradient and appropriately amplify the remaining coordinates to ensure the unbiasedness\nof the sparsi\ufb01ed stochastic gradient. The sparsi\ufb01cation approach can signi\ufb01cantly reduce the coding\nlength of the stochastic gradient and only slightly increase the variance of the stochastic gradient.\nThis paper proposes a convex formulation to achieve the optimal tradeoff of variance and sparsity: the\noptimal probabilities to sample coordinates can be obtained given any \ufb01xed variance budget. To solve\nthis optimization within a linear time, several ef\ufb01cient algorithms are proposed to \ufb01nd approximately\noptimal solutions with sparsity guarantees. The proposed sparsi\ufb01cation approach can be encapsulated\nseamlessly to many bench-mark stochastic optimization algorithms in machine learning, such as SGD\n[4], SVRG [14, 35], SAGA [11], and ADAM [15]. We conducted empirical studies to validate the\nproposed approach on (cid:96)2-regularized logistic regression, support vector machines, and convolutional\nneural networks on both synthetic and real-world data sets.\n\n2 Algorithms\n\nN(cid:88)\n\nn=1\n\nWe consider the problem of sparsifying a stochastic gradient vector, and formulate it as a linear\nplanning problem. Consider a training data set {xn}N\nn=1, each of\nwhich fn : \u2126 \u2192 R depends on a training data point xn \u2208 \u2126. We use w \u2208 Rd to denote the model\nparameter vector, and consider solving the following problem using stochastic optimization:\n\nn=1 and N loss functions {fn}N\n\n1\nN\n\nmin\n\nw\n\nfn(w), wt+1 = wt \u2212 \u03b7tgt(wt),\n\nSGD :\n\nf (w) :=\n\n(1)\nwhere t indicates the iterations and E [gt(w)] = \u2207f (w) serves as an unbiased estimate for the true\ngradient \u2207f (wt). The following are two ways to choose gt, like SGD [35, 4] and SVRG [14]\n\ngt(wt) = \u2207fnt(wt),\n\nwhere nt is uniformly sampled from the data set and (cid:101)w is a reference point. The above algorithm\n\nimplies that the convergence of SGD is signi\ufb01cantly dominated by E(cid:107)gt(wt)(cid:107)2 or equivalently the\nvariance of gt(wt). It can be seen from the following simple derivation. Assume that the loss function\nf (w) is L-smooth with respect to w, which means that for \u2200x, y \u2208 Rd,(cid:107)\u2207f (x)\u2212\u2207f (y)(cid:107) \u2264 L(cid:107)x\u2212y(cid:107)\n(where (cid:107) \u00b7 (cid:107) is the (cid:96)2-norm). Then the expected loss function is given by\nE [f (wt+1)] \u2264 E\n(cid:107)xt+1 \u2212 xt(cid:107)2\n\nf (wt) + \u2207f (wt)(cid:62)(xt+1 \u2212 xt) +\n\ngt(wt) = \u2207fnt(wt) \u2212 \u2207fnt((cid:101)w) + \u2207f ((cid:101)w)\n\nSVRG :\n\n(cid:20)\n\n(cid:21)\n\n(2)\n\n(3)\n\n(cid:20)\n\n=E\n\nf (wt) \u2212 \u03b7t\u2207f (wt)T gt(wt) +\n\nt (cid:107)gt(wt)(cid:107)2\n\u03b72\n\nL\n2\n\n(cid:21)\n\nL\n2\n= f (wt) \u2212 \u03b7t(cid:107)\u2207f (wt)(cid:107)2 +\n\nL\n2\n\n\u03b72\nt\n\n(cid:124)\n\nE (cid:107)gt(wt)(cid:107)2\n\n,\n\n(cid:123)(cid:122)\n\nvariance\n\n(cid:125)\n\nwhere the inequality is due to the Lipschitz property, and the second equality is due to the unbiased\nnature of the gradient E [gt(w)] = \u2207f (w). So the magnitude of E((cid:107)gt(wt)(cid:107)2) or equivalently the\nvariance of gt(wt) will signi\ufb01cantly affect the convergence ef\ufb01ciency.\nNext we consider how to reduce the communication cost in distributed machine learning by using a\nsparsi\ufb01ed gradient gt(wt), denoted by Q(g(wt)), such that Q(gt(wt)) is unbiased, and has a relatively\nsmall variance. In the following, to simplify notation, we denote the current stochastic gradient\ngt(wt) by g for short. Note that g can be obtained either by SGD or SVRG. We also let gi be the i-th\ncomponent of vector g \u2208 Rd: g = [g1, . . . , gd]. We propose to randomly drop out the i-th coordinate\nby a probability of 1 \u2212 pi, which means that the coordinates remain non-zero with a probability of pi\nfor each coordinate. Let Zi \u2208 {0, 1} be a binary-valued random variable indicating whether the i-th\ncoordinate is selected: Zi = 1 with probability pi and Zi = 0 with probability 1 \u2212 pi. Then, to make\nthe resulting sparsi\ufb01ed gradient vector Q(g) unbiased, we amplify the non-zero coordinates, from gi\nto gi/pi. So the \ufb01nal sparsi\ufb01ed vector is Q(g)i = Zi(gi/pi). The whole protocol can be summarized\nas follows:\nGradients g = [g1, g2,\u00b7\u00b7\u00b7 , gd], Probabilities p = [p1, p2,\u00b7\u00b7\u00b7 , pd], Selectors Z = [Z1, Z2,\u00b7\u00b7\u00b7 , Zd],\n(4)\nwhere P (Zi = 1) = pi,\n\n=\u21d2 Results Q(g) =\n\n,\u00b7\u00b7\u00b7 , Zd\n\n(cid:20)\n\n(cid:21)\n\n, Z2\n\nZ1\n\ng1\np1\n\ng2\np2\n\ngd\npd\n\n2\n\n\f+ (1 \u2212 pi) \u00d7 0 = gi.\n\nWe note that if g is an unbiased estimate of the gradient, then Q(g) is also an unbiased estimate of\nthe gradient since E [Q(g)i] = pi \u00d7 gi\nIn distributed machine learning, each worker calculates gradient g and transmits it to the master\nnode or the parameter server for an update. We use an index m to indicate a node, and assume there\nare total M nodes. The gradient sparsi\ufb01cation method can be used with a synchronous distributed\nstochastic optimization algorithm in Algorithm 1. Asynchronous algorithms can also be used with\nour technique in a similar fashion.\n\npi\n\nAlgorithm 1 A synchronous distributed optimization algorithm\n1: Initialize the clock t = 0 and initialize the weight w0.\n2: repeat\nEach worker m calculates local gradient gm(wt) and the probability vector pm.\n3:\nSparsify the gradients Q(gm(wt)) and take an All-Reduce step vt = 1\nm=1 Q(gm(wt)).\n4:\nBroadcast the average gradient vt and take a descent step wt+1 = wt \u2212 \u03b7tvt on all workers.\nM\n5:\n6: until convergence or the number of iteration reaches the maximum setting.\n\n(cid:80)M\n\nOur method could be combined with other methods which are orthogonal to us, like only transmitting\nlarge coordinates and accumulating the gradient residual which might be transmitted in the next\nstep [1, 21]. Advanced quantization and coding strategy from [2] can be used for transmitting valid\ncoordinates of our method. In addition, this method concords with [29] for the mean estimation\nproblem on distributed data, with a statistical guarantee under skewness.\n\n2.1 Mathematical formulation\n\nAlthough the gradient sparsi\ufb01cation technique can reduce communication cost, it increases the\nvariance of the gradient vector, which might slow down the convergence rate. In the following\nsection, we will investigate how to \ufb01nd the optimal tradeoff between sparsity and variance for the\nsparsi\ufb01cation technique. In particular, we consider how to \ufb01nd out the optimal sparsi\ufb01cation strategy,\ngiven a budget of maximal variance. First, note that the variance of Q(g) can be bounded by\n\n(cid:21)\nd(cid:88)\nIn addition, the expected sparsity of Q(gi) is given by E [(cid:107)Q(g)(cid:107)0] =(cid:80)d\nd(cid:88)\n\ni=1 pi. In this paper, we try\nto balance these two factors (sparsity and variance) by formulating it as a linear planning problem as\nfollows:\n\ng2\ni\npi\n\n.\n\n(5)\n\n\u00d7 pi + 0 \u00d7 (1 \u2212 pi)\n\n(cid:20) g2\n\ni\np2\ni\n\n[Q(g)2\n\ni ] =\n\nd(cid:88)\n\nd(cid:88)\n\ni=1\n\nd(cid:88)\n\ni=1\n\nd(cid:88)\n\n=\n\ni=1\n\nE\n\nmin\n\np\n\npi\n\ns.t.\n\ni=1\n\ni=1\n\n\u2264 (1 + \u0001)\n\ng2\ni\npi\n\ng2\ni ,\n\ni=1\n\n(6)\n\nwhere 0 < pi \u2264 1,\u2200i \u2208 [d], and \u0001 is a factor that controls the variance increase of the stochastic\ngradient g. This leads to an optimal strategy for sparsi\ufb01cation given an upper bound on the variance.\nThe following proposition provides a closed-form solution for problem (6).\nProposition 1. The solution to the optimal sparsi\ufb01cation problem (6) is a probability vector p such\nthat pi = min(\u03bb|gi|, 1),\u2200i \u2208 [d], where \u03bb > 0 is a constant only depending on g and \u0001.\n\nProof. By introducing Lagrange multipliers \u03bb and \u00b5i, we know that the solution of (6) is given by\nthe solution of the following objective:\n\nd(cid:88)\n\n(cid:32) d(cid:88)\n\n(cid:33)\n\nd(cid:88)\n\nd(cid:88)\n\nmin\n\np\n\nmax\n\n\u03bb\n\nmax\n\n\u00b5\n\nL(pi, \u03bb, \u00b5i) =\n\npi + \u03bb2\n\n\u2212 (1 + \u0001)\n\ng2\ni\npi\n\ng2\ni\n\n+\n\n\u00b5i(pi \u2212 1).\n\n(7)\n\ni=1\n\ni=1\n\ni=1\n\ni=1\n\nConsider the KKT conditions of the above formulation, by stationarity with respect to pi we have:\n\n1 \u2212 \u03bb2 g2\ni\np2\ni\n\n+ \u00b5i = 0,\n\n\u2200i \u2208 [d].\n\n(8)\n\n3\n\n\fNote that we have to permit pi = 0 for KKT condition to apply. Combined with the complementary\nslackness condition that guarantees \u00b5i(pi \u2212 1) = 0,\u2200i \u2208 [d], we know that pi = 1 for \u00b5i (cid:54)= 0, and\npi = \u03bb|gi| for \u00b5i = 0. This tells us that for several coordinates the probability of keeping the value is\n1 (when \u00b5i (cid:54)= 0), and for other coordinates the probability of keeping the value is proportional to the\nmagnitude of the gradient gi. Also, by simple reasoning we know that if |gi| \u2265 |gj| then |pi| \u2265 |pj|\n(otherwise we simply switch pi and pj and get a sparser result). Therefore there is a dominating\nset of coordinates S with pj = 1,\u2200j \u2208 S, and it must be the set of |gj| with the largest absolute\nmagnitudes. Suppose this set has a size of |S| = k (0 \u2264 k \u2264 d) and denote by g(1), g(2), ..., g(d) the\nelements of g ordered by their magnitudes (for the largest to the smallest), we have pi = 1 for i \u2264 k,\nand pi = \u03bb|gi| for i > k.\n\n2.2 Sparsi\ufb01cation algorithms\n\nIn this section, we propose two algorithms for ef\ufb01ciently calculating the optimal probability vector p\nin Proposition 1. Since \u03bb > 0, by the complementary slackness condition, we have\n\nd(cid:88)\n\ni=1\n\nd(cid:88)\n\nk(cid:88)\n\nd(cid:88)\n\n\u2212 (1 + \u0001)\n\ng2\ni\npi\n\ng2\ni =\n\ng2\n(i) +\n\ni=1\n\ni=1\n\ni=k+1\n\nd(cid:88)\n\ni=1\n\nThis further implies\n\n|g(i)|\n\u03bb\n\n\u2212 (1 + \u0001)\n\nd(cid:88)\n\n|g(i)|),\n\nd(cid:88)\n\ng2\n(i).\n\ng2\ni +\n\nd(cid:88)\n(cid:33)\n\n(i))\u22121(\ng2\nd(cid:88)\n\n\u2264 \u0001\n\ni=k+1\n\ni=k+1\n\n\u03bb = (\u0001\n\ng2\ni +\n\nd(cid:88)\n(cid:32) d(cid:88)\n\ni=1\n\n|g(k+1)|\n\n|g(i)|\n\ng2\ni = 0.\n\n(9)\n\n(10)\n\n(11)\n\nthen we used the constraint \u03bb|g(k+1)| \u2264 1 and get\n\nIt follows that we should \ufb01nd the smallest k which satis\ufb01es the above inequality. Based on the above\nreasoning, we get the following closed-form solution for pi in Algorithm 2.\n\ni=k+1\n\ni=1\n\ni=k+1\n\nAlgorithm 2 Closed-form solution\n1: Find the smallest k such that the second inequality of (10) is true, and let Sk be the set of\n\ncoordinates with top k largest magnitude of |gi|.\n\n2: Set the probability vector p by\nif i \u2208 Sk\nj=1 g2\n\n(cid:40) 1,\n(\u0001(cid:80)d\n\npi =\n\n(j) +(cid:80)d\n\nj=k+1 g2\n\n(j))\u22121|gi|(cid:16)(cid:80)d\n\nj=k+1 |g(j)|(cid:17)\n\nif i (cid:54)\u2208 Sk.\n\n,\n\nIn practice, Algorithm 2 requires partial sorting of the gradient magnitude values to \ufb01nd Sk, which\ni pi/d \u2248 \u03ba. Loosely speaking, we want to initially set(cid:101)pi = \u03bad|gi|/(cid:80)\n\ufb01nd pi that satis\ufb01es(cid:80)\ncould be computationally expensive. Therefore we developed a greedy algorithm for approximately\nwhich sums to(cid:80)\nsolving the problem. We pre-de\ufb01ne a sparsity parameter \u03ba \u2208 (0, 1), which implies that we aim to\ni(cid:101)pi = \u03bad, meeting our requirement on \u03ba. However, by the truncation operation\ni |gi|,\npi = min((cid:101)pi, 1), the expected nonzero density will be less than \u03ba. Now, we can use an iterative\n\nprocedure that in the next iteration, we \ufb01x the set of {pi : pi = 1} and scale the remaining values, as\nsummarized in Algorithm 3. This algorithm is much easier to implement, and computationally more\nef\ufb01cient on parallel computing architecture. Since the operations mainly consist of accumulations,\nmultiplications and minimizations, they can be easily accelerated on graphic processing units (GPU)\nor other hardware supporting single instruction multiple data (SIMD).\n\n2.3 Coding strategy\n\nOnce we have computed a sparsi\ufb01ed gradient vector Q(g), we need to pack the resulting vector\ninto a message for transmission. Here we apply a hybrid strategy for encoding Q(g). Suppose that\n\n4\n\n\fAlgorithm 3 Greedy algorithm\n1: Input g \u2208 Rd, \u03ba \u2208 (0, 1). Initialize p0 \u2208 Rd, j = 0. Set p0\n2: repeat\n3:\n4:\n5: until If c \u2264 1 or j reaches the maximum iterations. Return p = pj.\n\nIdentify an active set I = {1 \u2264 i \u2264 D|pj\ni = min(cpj\nRecalibrate the values by pj+1\n\ni , 1). j = j + 1.\n\ni (cid:54)= 1} and compute c = (\u03bad \u2212 d + |I|)/(cid:80)\n\ni |gi|, 1) for all i.\ni\u2208I pj\ni .\n\ni = min (\u03bad|gi|/(cid:80)\n\ncomputers represent a \ufb02oating-point scalar using b bits, with negligible loss in precision. We use two\nvectors QA(g) and QB(g) for representing non-zero coordinates, one for coordinates i \u2208 Sk, and\nthe other for coordinates i /\u2208 Sk. The vector QA(g) represents {gi : i \u2208 Sk}, where each item of\nQA(g) needs log d bits to represent the coordinates and b bits for the value gi/pi. The vector QB(g)\nrepresents {gi : i (cid:54)\u2208 Sk}, since in this case, we have pi = \u03bb|gi|, we have for all i (cid:54)\u2208 Sk the quantized\nvalue Q(gi) = gi/pi = sign(gi)/\u03bb. Therefore to represent QB(g), we only need one \ufb02oating-point\nscalar 1/\u03bb, plus the non-zero coordinates i and its sign sign(gi). Here we give an example about the\nformat,\n\nQ(g) :\n\n,\u00b7\u00b7\u00b7 , 0\n\n, where g1, g5 \u2208 Sk, g4 < 0, g6 > 0,\n\n(cid:20) g1\n(cid:20)\n\np1\n\n1,\n\n, 0, 0,\n\ng1\np1\n\n, 5,\n\ng4\np4\ng5\np5\n\n,\n\n,\n\n(cid:21)\n\ng5\ng6\np5\np6\n\u00b7\u00b7\u00b7 , 0\n\n,\n\n[4,\u22121/\u03bb, 6, 1/\u03bb,\u00b7\u00b7\u00b7 ] .\n\nQA(g) :\n\nQB(g) :\n\n3 Theoretical guarantees on sparsity\n\n(12)\nwhere i = 1, 5 \u2208 Sk, i = 4, 6 (cid:54)\u2208 Sk, g4 < 0, g6 > 0. Moreover, we can also represent the indices of\n\nA and vector QB(g) using a dense vector of(cid:101)q \u2208 {0,\u00b11, 2}d, where each component(cid:101)qi is de\ufb01ned as\nQ(gi) = \u03bbQ(gi) when i (cid:54)\u2208 Sk and(cid:101)qi = 2 if i \u2208 Sk. Using the standard entropy coding, we know\nthat(cid:101)q requires at most(cid:80)2\nIn this section we analyze the expected sparsity of Q(g), which equals to(cid:80)d\n\n(cid:96)=\u22121 d(cid:96) log2(d/d(cid:96)) \u2264 2d bits to represent.\n\ni=1 pi. In particular we\nshow when the distribution of gradient magnitude values is highly skewed, there is a signi\ufb01cant gain\nin applying the proposed sparsi\ufb01cation strategy. First, we de\ufb01ne the following notion of approximate\nsparsity on the magnitude at each coordinate of g:\nDe\ufb01nition 2. A vector g \u2208 Rd is (\u03c1, s)-approximately sparse if there exists a subset S \u2282 [d] such\nthat |S| = s and (cid:107)gSc(cid:107)1 \u2264 \u03c1(cid:107)gS(cid:107)1, where Sc is the complement of S.\nThe notion of (\u03c1, s)-approximately sparsity is inspired by the restricted eigenvalue condition used\nin high-dimensional statistics [5]. (\u03c1, s)-approximately sparsity measures how well the signal of\na vector is concentrated on a small subset of the coordinates of size s. As we will see later, the\nquantity (1 + \u03c1)s plays an important role in establishing the expected sparsity bound. Note that we\ncan always take s = d and \u03c1 = 0 so that (\u03c1, s) satis\ufb01es the above de\ufb01nition with (1 + \u03c1)s \u2264 d. If the\ndistribution of magnitude values in g is highly skewed, we would expect the existence of (\u03c1, s) such\nthat (1 + \u03c1)s (cid:28) d. For example when g is exactly s-sparse, we can choose \u03c1 = 0 and the quantity\n(1 + \u03c1)s reduces to s which can be signi\ufb01cantly smaller than d.\nLemma 3. If the gradient g \u2208 Rd of the loss function is (\u03c1, s)-approximately sparse as in De\ufb01nition 2.\nThen we can \ufb01nd a sparsi\ufb01cation Q(g) with \u0001 = \u03c1 in (6) (that is, the variance of Q(g) is increased\nby a factor of no more than 1 + \u03c1), and the expected sparsity of Q(g) can be upper bounded by\nE [(cid:107)Q(g)(cid:107)0] \u2264 (1 + \u03c1)s.\n(cid:88)\nd(cid:88)\nProof. Based on De\ufb01nition 2, we can choose \u0001 = \u03c1 and Sk = S that satis\ufb01es (10), thus\nj=k+1 |g(j)|)\n(cid:13)(cid:13)gSc\n2 + (1 + \u03c1)(cid:13)(cid:13)gSc\n\n|gi|((cid:80)d\n(j) + (1 + \u0001)(cid:80)d\n(cid:13)(cid:13)2\n\n(cid:88)\n\u0001(cid:80)k\ni(cid:54)\u2208Sk\n2 + (1 + \u03c1)(cid:13)(cid:13)gSc\n\u03c12s(cid:107)gSk(cid:107)2\n\nE [(cid:107)Q(g)(cid:107)0] =\n\ni(cid:54)\u2208Sk\n\u2264 s +\n\n\u2264 (1 + \u03c1)s,\n\n\u03c1(cid:107)gSk(cid:107)2\n\n\u03c1(cid:107)gSk(cid:107)2\n\nj=k+1 g2\n(j)\n\n(cid:88)\n\n(cid:13)(cid:13)2\n\n1\n\npi = s +\n\nj=1 g2\n\n=s +\n\ni\u2208Sk\n\npi =\n\npi +\n\n(13)\n\ni=1\n\nk\n\n(cid:13)(cid:13)2\n\n2\n\nk\n\n2\n\nk\n\n2\n\n(cid:21)\n\n5\n\n\fwhich completes the proof.\nRemark 1. Lemma 3 indicates that the variance after sparsi\ufb01cation only increases by a factor of\n(1+\u03c1), while in expectation we only need to communicate a (1+\u03c1)s-sparse vector after sparsi\ufb01cation.\nIn order to achieve the same optimization accuracy, we may need to increase the number of iterations\nby a factor of up to (1 + \u03c1), and the overall number of \ufb02oating-point numbers communicated is\nreduced by a factor of up to (1 + \u03c1)2s/d.\n\nAbove lemma shows the number of \ufb02oating-point numbers needed to communicate per iteration\nis reduced by the proposed sparsi\ufb01cation strategy. As shown in Section 2.3, we only need to use\none \ufb02oating-point number to encode the gradient values in Sc\nk, so there is a further reduction in\ncommunication when considering the total number of bits transmitted, this is characterized by the\nTheorem below. The details of proof are put in a full version (https://arxiv.org/abs/1710.\n09854) of this paper.\nTheorem 4. If the gradient g \u2208 Rd of the loss function is (\u03c1, s)-approximately sparse as in De\ufb01ni-\ntion 2, and a \ufb02oating-point number costs b bits, then the coding length of Q(g) in Lemma 3 can be\nbounded by s(b + log2 d) + min(\u03c1s log2 d, d) + b.\nRemark 2. The coding length of the original gradient vector g is db, by considering the slightly\nincreased number of iterations to reach the same optimization accuracy, the total communication\ncost is reduced by a factor of at least (1 + \u03c1)((s + 1)b + log2 d)/db.\n\n4 Experiments\n\nFigure 1: SGD type comparison between gradient sparsi\ufb01cation (GSpar) with random sparsi\ufb01cation\nwith uniform sampling (UniSp).\n\nIn this section we conduct experiments to validate the effectiveness and ef\ufb01ciency of the proposed\nsparsi\ufb01cation technique. We use (cid:96)2-regularized logistic regression as an example for convex problems,\nand take convolutional neural networks as an example for non-convex problems. The sparsi\ufb01cation\ntechnique shows strong improvement over the uniform sampling approach as a baseline, the iteration\ncomplexity is only slightly increased as we strongly reduce the communication costs. Moreover, we\nalso conduct asynchronous parallel experiments on the shared memory architecture. In particular, our\nexperiments show that the proposed sparsi\ufb01cation technique signi\ufb01cantly reduces the con\ufb02icts among\nmultiple threads and dramatically improves the performance. In all experiments, the probability\nvector p is calculated by Algorithm 3 and set the maximum iterations to be 2, which generates good\nenough high-quality approximation of the optimal p vector.\n\n6\n\n510152010\u22120.510\u22120.410\u22120.3datapassesf(w)\u2212f(w*) baseline:GSpar var:1.3 spa:0.5UniSp var:2 spa:0.5GSpar var:3.9 spa:0.17UniSp var:6 spa:0.17GSpar var:12 spa:0.056UniSp var:18 spa:0.056510152010\u22120.3410\u22120.3310\u22120.3210\u22120.31datapassesf(w)\u2212f(w*) baseline:GSpar var:1 spa:0.5UniSp var:2 spa:0.5GSpar var:1.7 spa:0.17UniSp var:6 spa:0.17GSpar var:5 spa:0.056UniSp var:18 spa:0.056510152010\u22120.3810\u22120.3510\u22120.3210\u22120.29datapassesf(w)\u2212f(w*) baseline:GSpar var:1 spa:0.5UniSp var:2 spa:0.5GSpar var:1.1 spa:0.17UniSp var:6 spa:0.17GSpar var:2.4 spa:0.056UniSp var:18 spa:0.056510152010\u22120.710\u22120.610\u22120.510\u22120.410\u22120.3datapassesf(w)\u2212f(w*) baseline:GSpar var:1.3 spa:0.5UniSp var:2 spa:0.5GSpar var:3.9 spa:0.17UniSp var:6 spa:0.17GSpar var:12 spa:0.056UniSp var:18 spa:0.056510152010\u22120.510\u22120.410\u22120.3datapassesf(w)\u2212f(w*) baseline:GSpar var:1.1 spa:0.5UniSp var:2 spa:0.5GSpar var:2 spa:0.17UniSp var:6 spa:0.17GSpar var:6.1 spa:0.056UniSp var:18 spa:0.056510152010\u22120.5810\u22120.5710\u22120.56datapassesf(w)\u2212f(w*) baseline:GSpar var:1 spa:0.5UniSp var:2 spa:0.5GSpar var:1.1 spa:0.17UniSp var:6 spa:0.17GSpar var:2.2 spa:0.056UniSp var:18 spa:0.056\fFigure 2: SVRG type comparison between gradient sparsi\ufb01cation (GSpar) with random sparsi\ufb01cation\nwith uniform sampling (UniSp)\n\n(cid:80)\n\n(cid:0)1 + exp(\u2212a(cid:62)\n\nn wbn)(cid:1) + \u03bb2(cid:107)w(cid:107)2\n\nn \u00afw)\n\nn log2\n\nWe \ufb01rst validate the sparsi\ufb01cation technique on the (cid:96)2-regularized logistic regression problem using\n2, where an \u2208 Rd,\nSGD and SVRG respectively: f (w) = 1\nbn \u2208 {\u22121, 1}. The experiments are conducted on synthetic data for the convenience to control the\nN\ndata sparsity. The mini-batch size is set to be 8 by default unless otherwise speci\ufb01ed. We simulated\nwith M = 4 machines, where one machine is both a worker and the master that aggregates stochastic\ngradients received from other workers. We compare our algorithm with a uniform sampling method as\nbaseline, where each element of the probability vector is set to be pi = \u03ba, and a similar sparsi\ufb01cation\nfollows to apply. In this method, the sparsi\ufb01ed vector has a nonzero density of \u03ba in expectation. The\ndata set {an}N\ndense data:\nif:\n\nn=1 is generated as follows\n\u00afani \u223c N (0, 1),\n\u00afBi \u2264 C2,\n\n\u2200i \u2208 [d], n \u2208 [N ], sparsify:\nan \u2190 \u00afan (cid:12) \u00afB, label:\n\n\u00afB \u223c Uniform[0, 1]d,\n\u00afw \u223c N (0, I),\n\nbn \u2190 sign(\u00afa(cid:62)\n\n\u00afBi \u2190 C1 \u00afBi,\n\n\u2200i \u2208 [d],\n\n(cid:16)\n\nwhere (cid:12) is the element-wise multiplication. In the equations above, the \ufb01rst step is a standard data\nsampling procedure from a multivariate Gaussian distribution; the second step generates a magnitude\nvector \u00afB, which is later sparsi\ufb01ed by decreasing elements that are smaller than a threshold C2 by\na factor of C1; the third line describes the application of magnitude vectors on the dataset; and the\nfourth line generates a weight vector \u00afw, and labels yn, based on the signs of multiplications of data\nand the weights. We should note that the parameters C1 and C2 give us a easier way to control the\nsparsity of data points and the gradients: the smaller these two constants are, the sparser the gradients\nare. The gradient of linear models on the dataset should be expected to be\n-\napproximately sparse, and the gradient of regularization needs not to be communicated. We set the\ndataset of size N = 1024, dimension d = 2048. The step sizes are \ufb01ne-tuned on each case, and\nin our \ufb01ndings, the empirically optimal step size is inversely related to the gradient variance as the\ntheoretical analysis.\nIn Figures 1 and 2, from the top row to the bottom row, the (cid:96)2-regularization parameter \u03bb is set to\n1/(10N ), 1/N. And in each row, from the \ufb01rst column to the last column, C2 is set to 4\u22121, 4\u22122, 4\u22123.\nIn these \ufb01gures, our algorithm is denoted by \u2018GSpar\u2019, and the uniform sampling method is denoted by\n\u2018UniSp\u2019, and the SGD/SVRG algorithm with non-sparsi\ufb01ed communication is denoted by \u2018baseline\u2019,\nindicating the original distributed optimization algorithm. The x-axis shows the number of data\npasses, and the y-axis draws the suboptimality of the objective function (f (wt) \u2212 minw f (w)). For\nthe experiments, we report the sparsi\ufb01ed-gradient SGD variance as the notation \u2018var\u2019 in Figure 1. And\n\u2018spa\u2019 in all \ufb01gures represents the nonzero density \u03ba in Algorithm 3. We observe that the theoretical\ncomplexity reduction against the baseline in terms of the communication rounds, which can be\ninferred by var \u00d7 spa, from the labels in Figures 1 to 2, where C1 = 0.9, and the rest of the \ufb01gures\nare put in the full version due to the limited space.\n\n(1 \u2212 C2)d, C2\n\n(cid:17)\n\nC1+2\n\nC1\n\n7\n\n510152010\u22120.910\u22120.710\u22120.510\u22120.3datapassesf(w)\u2212f(w*) baseline:GSpar var:1.5 spa:0.5UniSp var:2 spa:0.5GSpar var:4.5 spa:0.17UniSp var:6 spa:0.17GSpar var:14 spa:0.055UniSp var:18 spa:0.055510152010\u22120.3410\u22120.3310\u22120.32datapassesf(w)\u2212f(w*) baseline:GSpar var:1.1 spa:0.5UniSp var:2 spa:0.5GSpar var:2.1 spa:0.17UniSp var:6 spa:0.17GSpar var:6.4 spa:0.055UniSp var:18 spa:0.055510152010\u22120.410\u22120.3datapassesf(w)\u2212f(w*) baseline:GSpar var:1 spa:0.5UniSp var:2 spa:0.5GSpar var:1.1 spa:0.17UniSp var:6 spa:0.17GSpar var:2.8 spa:0.055UniSp var:18 spa:0.055510152010\u22120.910\u22120.710\u22120.510\u22120.3datapassesf(w)\u2212f(w*) baseline:GSpar var:1.5 spa:0.5UniSp var:2 spa:0.5GSpar var:4.5 spa:0.17UniSp var:6 spa:0.17GSpar var:14 spa:0.055UniSp var:18 spa:0.055510152010\u22120.410\u22120.310\u22120.2datapassesf(w)\u2212f(w*) baseline:GSpar var:1.1 spa:0.5UniSp var:2 spa:0.5GSpar var:2.2 spa:0.17UniSp var:6 spa:0.17GSpar var:6.6 spa:0.055UniSp var:18 spa:0.055510152010\u22120.410\u22120.310\u22120.2datapassesf(w)\u2212f(w*) baseline:GSpar var:1 spa:0.5UniSp var:2 spa:0.5GSpar var:1.1 spa:0.17UniSp var:6 spa:0.17GSpar var:2.8 spa:0.055UniSp var:18 spa:0.055\fFrom Figure 1, we observe that results on sparser data yield smaller gradient variance than results on\ndenser data. Compared to uniform sampling, our algorithm generates gradients with less variance,\nand converges much faster. This observation is consistent with the objective of our algorithm, which\nis to minimize gradient variance given a certain sparsity. The convergence slowed down linearly\nw.r.t. the increase of variance. The results on SVRG show better speed up \u2014 although our algorithm\nincreases the variance of gradients, the convergence rate degrades only slightly.\nWe compared the gradient sparsi\ufb01-\ncation method with the quantized\nstochastic gradient descent (QSGD)\nalgorithm in [2]. The results are\nshown in Figures 4. The data are gen-\nerated as previous, with both strong\nand weak sparsity settings. From the\ntop row to the bottom row, the (cid:96)2-\nregularization parameter \u03bb is set to\n1/(10N ), 1/N. And in each row,\nfrom the \ufb01rst column to the last col-\numn, C2 is set to 4\u22121, 4\u22122. The\nstep sizes are set to be the same for\nboth methods for a fair comparison\nafter \ufb01ne-tuning.\nIn this compari-\nson, we use the overall communica-\ntion coding length of each algorithm,\nand note the length in x-axis. For\nQSGD, the communication cost per\nelement is linearly related to b, which\nrefers to the bits of \ufb02oating-point num-\nber. QSGD(b) denotes QSGD algorithm with bit number b in these \ufb01gures, and the average bits\nrequired to represent per element is on the labels. We also tried to compare with the gradient residual\naccumulation approaches [1], which unfortunately failed on our experiments, since the gradient is\nrelatively sparse so that lots of small coordinates could be delayed in\ufb01nitely, resulting in a large\ngradient bias to cause the divergence on convex problems. From Figures 4, we observe that the\nproposed sparsi\ufb01cation approach is at least comparable to QSGD, and signi\ufb01cantly outperforms\nQSGD when the gradient sparsity is stronger; and this concords with our analysis on the gradient\napproximate sparsity encouraging faster speed up.\n\nFigure 3: Comparison of the sparisi\ufb01ed-SGD with QSGD.\n\n4.1 Experiments on deep learning\n\nThis section conducts experiments on\nnon-convex problems. We consider\nthe convolutional neural networks\n(CNN) on the CIFAR-10 dataset with\ndifferent settings. Generally, the net-\nworks consist of three convolutional\nlayers (3 \u00d7 3), two pooling layers\n(2 \u00d7 2), and one 256 dimensional\nfully connected layer. Each convo-\nlution layer is followed by a batch-\nnormalization layer. The channels\nof each convolutional layer is set to\n{24, 32, 48, 64}. We use the ADAM\noptimization algorithm [15], and the\ninitial step size is set to 0.02.\nIn Figure 4.1, we plot the objective\nfunction against the computational\ncomplexity measured by the number\nof epochs (1 epoch is equal to 1 pass of all training samples). We also plot the convergence with\nrespect to the communication cost, which is the product of computations and the sparsi\ufb01cation pa-\n\nFigure 4: Comparison of 3-layer CNN of channels of 64\n(top) and 48 (bottom) on CIFAR-10. (Y-axis: f (wt).)\n\n8\n\n5101510\u22120.3910\u22120.3610\u22120.3310\u22120.3communicationsf(w)\u2212f(w*) baseline:GSpar Bits:34QSGD(20) Bits:20GSpar Bits:9.3GSpar Bits:5.2QSGD(5) Bits:5GSpar Bits:1.8GSpar Bits:0.75QSGD(2) Bits:2510152010\u22120.32410\u22120.32310\u22120.322communicationsf(w)\u2212f(w*) baseline:GSpar Bits:30QSGD(20) Bits:20GSpar Bits:11GSpar Bits:5.5QSGD(5) Bits:5GSpar Bits:3.4GSpar Bits:1QSGD(2) Bits:2510152010\u22120.510\u22120.410\u22120.3communicationsf(w)\u2212f(w*) baseline:GSpar Bits:34QSGD(20) Bits:20GSpar Bits:7GSpar Bits:5.4QSGD(5) Bits:5GSpar Bits:1.5GSpar Bits:0.75QSGD(2) Bits:2510152010\u22120.510\u22120.410\u22120.3communicationsf(w)\u2212f(w*) baseline:GSpar Bits:32QSGD(20) Bits:20GSpar Bits:7.2GSpar Bits:5.6QSGD(5) Bits:5GSpar Bits:3.8GSpar Bits:0.76QSGD(2) Bits:20.02.55.07.510.012.515.017.5Computations0.500.751.001.251.501.752.00rho=1.0rho=0.07rho=0.045rho=0.015rho=0.004rho=0.0010.02.55.07.510.012.515.017.5Communications0.500.751.001.251.501.752.00rho=1.0rho=0.07rho=0.045rho=0.015rho=0.004rho=0.0010.02.55.07.510.012.515.017.5Computations0.500.751.001.251.501.752.00rho=1.0rho=0.07rho=0.045rho=0.015rho=0.004rho=0.0010.02.55.07.510.012.515.017.5Communications0.500.751.001.251.501.752.00rho=1.0rho=0.07rho=0.045rho=0.015rho=0.004rho=0.001\frameter \u03ba. The experiments on each setting are repeated 4 times and we report the average objective\nfunction values. The results show that for this non-convex problem, the gradient sparsi\ufb01cation slows\ndown the training ef\ufb01ciency only slightly. In particular, the optimization algorithm converges even\nwhen the sparsity ratio is about \u03ba = 0.004, and the communication cost is signi\ufb01cantly reduced in\nthis setting. This experiments also show that the optimization of neural networks is less sensitive to\ngradient noises, and the noises within a certain range may even help the algorithm to avoid bad local\nminimums [13].\n\n4.2 Experiments on asynchronous parallel SGD\n\n(cid:80)\nn max(1 \u2212 a(cid:62)\n\nn wbn, 0) + \u03bb2(cid:107)w(cid:107)2\n\n2, an \u2208 Rd,\n\nbn \u2190 sign(x(cid:62)\n\nn \u00afw + \u03c3), where \u03c3 \u223c N (0, 1).\n\nIn this section, we study parallel implementations of SGD on the single-machine multi-core archi-\ntecture. We employ the support vector machine for binary classi\ufb01cation, where the loss function\nbn \u2208 {\u22121, 1}. We implemented\nis f (w) = 1\nN\nshared memory multi-thread SGD, where each thread employs a locked read, which may block\nother threads\u2019 writing to the same coordinate. We implement a multi-thread algorithm with locks\nwhich are implemented using compare-and-swap operations. To improve the speed of the algorithm,\nwe also employ several engineering tricks. First, we observe that \u2200pi < 1,\ngi/pi = sign(gi)/\u03bb\nfrom Proposition 1, therefore we only need to assign constant values to these variables, without\napplying \ufb02oating-point division operations. Another costly operation is the pseudo-random number\ngeneration in the sampling procedure; therefore we generate a large array of pseudo-random numbers\nin [0, 1], and iteratively read the numbers during training without calling a random number generating\nfunction. The data are generated by \ufb01rst generating dense data, sparsifying them and generating the\ncorresponding labels:\n\u00afani \u223c N (0, 1),\u2200i \u2208 [d], n \u2208 [N ], \u00afw \u223c Uniform[\u22120.5, 0.5]d, \u00afB \u223c Uniform[0, 1]d,\n\u00afBi \u2190 C1 \u00afBi, if: \u00afBi \u2264 C2,\u2200i \u2208 [d], an \u2190 \u00afan (cid:12) \u00afB,\nWe set the dataset of size N = 51200, dimension d = 256, also set C1 = 0.01 and C2 = 0.9.\nThe regularization parameter \u03bb2 is de-\nnoted by reg, the number of threads\nis denoted by W (workers), and the\nlearning rate is denoted by lrt. The\nnumber of workers is set to 16 or 32,\nthe regularization parameter is set to\n{0.5, 0.1, 0.05}, and the learning rate\nis chosen from {0.5, 0.25, 0.05, 0.25}.\nThe convergence of objective value\nagainst running time (milliseconds) is\nplotted in Figure 4.2, and the rest of\n\ufb01gures are put in the full version.\nFrom Figure 4.2, we can observe\nthat using gradient sparsi\ufb01cation, the\ncon\ufb02icts among multiple threads for\nreading and writing the same coordi-\nnate are signi\ufb01cantly reduced. There-\nfore the training speed is signi\ufb01cantly\nfaster. By comparing with other set-\ntings, we also observe that the sparsi\ufb01cation technique works better at the case when more threads\nare available, since the more threads, the more frequently the lock con\ufb02icts occur.\n\nFigure 5: Loss functions by a multi-thread SVM. X-axis:\ntime in milliseconds, Y-axis: log2(f (wt)).\n\n5 Conclusions\n\nIn this paper, we propose a gradient sparsi\ufb01cation technique to reduce the communication cost for\nlarge-scale distributed machine learning. We propose a convex optimization formulation to minimize\nthe coding length of stochastic gradients given the variance budget that monotonically depends on\nthe computational complexity, with ef\ufb01cient algorithms and a theoretical guarantee. Comprehensive\nexperiments on distributed and parallel optimization of multiple models proved our algorithm can\neffectively reduce the communication cost during training or reduce con\ufb02icts among multiple threads.\n\n9\n\n0200400600\u22122\u221210123456W:16 reg:0.5 lrt:0.5rho=1/1rho=1/2rho=1/3rho=1/40200400600\u2212101234W:16 reg:0.5 lrt:0.25rho=1/1rho=1/2rho=1/3rho=1/4020040060001234567W:16 reg:0.1 lrt:0.1rho=1/1rho=1/2rho=1/3rho=1/402004006000123456W:16 reg:0.1 lrt:0.05rho=1/1rho=1/2rho=1/3rho=1/4020040060012345678W:16 reg:0.05 lrt:0.05rho=1/1rho=1/2rho=1/3rho=1/402004006001234567W:16 reg:0.05 lrt:0.025rho=1/1rho=1/2rho=1/3rho=1/4\fAcknowledgments\n\nJi Liu is in part supported by NSF CCF1718513, IBM faculty award, and NEC fellowship.\n\nReferences\n[1] Alham Fikri Aji and Kenneth Hea\ufb01eld. Sparse communication for distributed gradient descent.\n\nIn\nProceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages\n440\u2013445, 2017.\n\n[2] Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. QSGD: Communication-\nef\ufb01cient SGD via gradient quantization and encoding. In Advances in Neural Information Processing\nSystems, pages 1707\u20131718, 2017.\n\n[3] Yossi Arjevani and Ohad Shamir. Communication complexity of distributed convex learning and optimiza-\n\ntion. In Advances in Neural Information Processing Systems, pages 1756\u20131764, 2015.\n\n[4] L\u00e9on Bottou. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMP-\n\nSTAT\u20192010, pages 177\u2013186. Springer, 2010.\n\n[5] Peter B\u00fchlmann and Sara Van De Geer. Statistics for high-dimensional data: methods, theory and\n\napplications. Springer Science & Business Media, 2011.\n\n[6] Jiecao Chen, He Sun, David Woodruff, and Qin Zhang. Communication-optimal distributed clustering. In\n\nAdvances in Neural Information Processing Systems, pages 3727\u20133735, 2016.\n\n[7] Shang-Tse Chen, Maria-Florina Balcan, and Duen Horng Chau. Communication ef\ufb01cient distributed\n\nagnostic boosting. In Arti\ufb01cial Intelligence and Statistics, pages 1299\u20131307, 2016.\n\n[8] Christopher De Sa, Matthew Feldman, Christopher R\u00e9, and Kunle Olukotun. Understanding and optimizing\nasynchronous low-precision stochastic gradient descent. In Proceedings of the 44th Annual International\nSymposium on Computer Architecture, pages 561\u2013574. ACM, 2017.\n\n[9] Christopher De Sa, Ce Zhang, Kunle Olukotun, and Christopher R\u00e9. Taming the wild: A uni\ufb01ed analysis\nof hogwild-style algorithms. In Advances in Neural Information Processing Systems, pages 2674\u20132682,\n2015.\n\n[10] Jeffrey Dean and Sanjay Ghemawat. MapReduce: simpli\ufb01ed data processing on large clusters. Communi-\n\ncations of the ACM, 51(1):107\u2013113, 2008.\n\n[11] Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. SAGA: A fast incremental gradient method with\nsupport for non-strongly convex composite objectives. In Advances in Neural Information Processing\nSystems, pages 1646\u20131654, 2014.\n\n[12] Martin Jaggi, Virginia Smith, Martin Tak\u00e1c, Jonathan Terhorst, Sanjay Krishnan, Thomas Hofmann, and\nMichael I Jordan. Communication-ef\ufb01cient distributed dual coordinate ascent. In Advances in Neural\nInformation Processing Systems, pages 3068\u20133076, 2014.\n\n[13] Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M Kakade, and Michael I Jordan. How to escape saddle\n\npoints ef\ufb01ciently. In International Conference on Machine Learning, pages 1724\u20131732, 2017.\n\n[14] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction.\n\nIn Advances in Neural Information Processing Systems, pages 315\u2013323, 2013.\n\n[15] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International\n\nConference on Learning Representations, 2014.\n\n[16] Jason D Lee, Qiang Liu, Yuekai Sun, and Jonathan E Taylor. Communication-ef\ufb01cient sparse regression.\n\nJournal of Machine Learning Research, 18(5):1\u201330, 2017.\n\n[17] Mu Li, David G. Andersen, Jun Woo Park, Alexander J. Smola, Amr Ahmed, Vanja Josifovski, James\nLong, Eugene J. Shekita, and Bor-Yiing Su. Scaling distributed machine learning with the parameter server.\nIn 11th USENIX Symposium on Operating Systems Design and Implementation, pages 583\u2013598, 2014.\n\n[18] Mu Li, Tong Zhang, Yuqiang Chen, and Alexander J Smola. Ef\ufb01cient mini-batch training for stochastic\noptimization. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery\nand Data Mining, pages 661\u2013670. ACM, 2014.\n\n10\n\n\f[19] Xiangru Lian, Yijun Huang, Yuncheng Li, and Ji Liu. Asynchronous parallel stochastic gradient for\nnonconvex optimization. In Advances in Neural Information Processing Systems, pages 2737\u20132745, 2015.\n\n[20] Yingyu Liang, Maria-Florina F Balcan, Vandana Kanchanapally, and David Woodruff. Improved distributed\nprincipal component analysis. In Advances in Neural Information Processing Systems, pages 3113\u20133121,\n2014.\n\n[21] Yujun Lin, Song Han, Huizi Mao, Yu Wang, and William J Dally. Deep gradient compression: Reduc-\ning the communication bandwidth for distributed training. In International Conference on Learning\nRepresentations, 2018.\n\n[22] Ji Liu, Stephen J Wright, Christopher R\u00e9, Victor Bittorf, and Srikrishna Sridhar. An asynchronous parallel\nstochastic coordinate descent algorithm. The Journal of Machine Learning Research, 16(1):285\u2013322, 2015.\n\n[23] Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. Hogwild: A lock-free approach to\nparallelizing stochastic gradient descent. In Advances in Neural Information Processing Systems, pages\n693\u2013701, 2011.\n\n[24] Mark Schmidt, Nicolas Le Roux, and Francis Bach. Minimizing \ufb01nite sums with the stochastic average\n\ngradient. Mathematical Programming: Series A and B, 162(1-2):83\u2013112, 2017.\n\n[25] Frank Seide, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu. 1-bit stochastic gradient descent and its\napplication to data-parallel distributed training of speech dnns. In Fifteenth Annual Conference of the\nInternational Speech Communication Association, 2014.\n\n[26] Shai Shalev-Shwartz and Tong Zhang. Stochastic dual coordinate ascent methods for regularized loss\n\nminimization. Journal of Machine Learning Research, 14(Feb):567\u2013599, 2013.\n\n[27] Ohad Shamir, Nati Srebro, and Tong Zhang. Communication-ef\ufb01cient distributed optimization using an\napproximate newton-type method. In International Conference on Machine Learning, pages 1000\u20131008,\n2014.\n\n[28] Ananda Theertha Suresh, X Yu Felix, Sanjiv Kumar, and H Brendan McMahan. Distributed mean estimation\nwith limited communication. In International Conference on Machine Learning, pages 3329\u20133337, 2017.\n\n[29] Ananda Theertha Suresh, Felix X. Yu, Sanjiv Kumar, and H. Brendan McMahan. Distributed mean\nestimation with limited communication. In International Conference on Machine Learning, pages 3329\u2013\n3337, 2017.\n\n[30] John N Tsitsiklis and Zhi-Quan Luo. Communication complexity of convex optimization. Journal of\n\nComplexity, 3(3):231\u2013243, 1987.\n\n[31] Jialei Wang, Weiran Wang, and Nathan Srebro. Memory and communication ef\ufb01cient distributed stochastic\n\noptimization with minibatch prox. In Conference on Learning Theory, pages 1882\u20131919, 2017.\n\n[32] Wei Wen, Cong Xu, Feng Yan, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. TernGrad: Ternary\ngradients to reduce communication in distributed deep learning. In Advances in Neural Information\nProcessing Systems, pages 1509\u20131519, 2017.\n\n[33] Eric P Xing, Qirong Ho, Wei Dai, Jin Kyu Kim, Jinliang Wei, Seunghak Lee, Xun Zheng, Pengtao Xie,\nAbhimanu Kumar, and Yaoliang Yu. Petuum: A new platform for distributed machine learning on big data.\nIEEE Transactions on Big Data, 1(2):49\u201367, 2015.\n\n[34] Hantian Zhang, Jerry Li, Kaan Kara, Dan Alistarh, Ji Liu, and Ce Zhang. ZipML: Training linear models\nwith end-to-end low precision, and a little bit of deep learning. International Conference on Machine\nLearning, page 4035\u20134043, 2017.\n\n[35] Tong Zhang. Solving large scale linear prediction problems using stochastic gradient descent algorithms.\n\nIn International Conference on Machine Learning, page 116, 2004.\n\n[36] Yuchen Zhang and Xiao Lin. DISCO: Distributed optimization for self-concordant empirical loss. In\n\nInternational Conference on Machine Learning, pages 362\u2013370, 2015.\n\n[37] Yuchen Zhang, Martin J Wainwright, and John C Duchi. Communication-ef\ufb01cient algorithms for statistical\n\noptimization. In Advances in Neural Information Processing Systems, pages 1502\u20131510, 2012.\n\n11\n\n\f", "award": [], "sourceid": 685, "authors": [{"given_name": "Jianqiao", "family_name": "Wangni", "institution": "University of Pennsylvania"}, {"given_name": "Jialei", "family_name": "Wang", "institution": "Two Sigma Investments, University of Chicago"}, {"given_name": "Ji", "family_name": "Liu", "institution": "University of Rochester, Tencent AI lab"}, {"given_name": "Tong", "family_name": "Zhang", "institution": "Tencent AI Lab"}]}