{"title": "LAG: Lazily Aggregated Gradient for Communication-Efficient Distributed Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 5050, "page_last": 5060, "abstract": "This paper presents a new class of gradient methods for distributed \nmachine learning that adaptively skip the gradient calculations to \nlearn with reduced communication and computation. Simple rules \nare designed to detect slowly-varying gradients and, therefore, \ntrigger the reuse of outdated gradients. The resultant gradient-based \nalgorithms are termed Lazily Aggregated Gradient --- justifying our \nacronym LAG used henceforth. Theoretically, the merits of \nthis contribution are: i) the convergence rate is the same as batch \ngradient descent in strongly-convex, convex, and nonconvex cases; \nand, ii) if the distributed datasets are heterogeneous (quantified by \ncertain measurable constants), the communication rounds needed \nto achieve a targeted accuracy are reduced thanks to the adaptive \nreuse of lagged gradients. Numerical experiments on both \nsynthetic and real data corroborate a significant communication \nreduction compared to alternatives.", "full_text": "LAG: Lazily Aggregated Gradient for\n\nCommunication-Ef\ufb01cient Distributed Learning\n\nTianyi Chen\u22c6\n\nGeorgios B. Giannakis\u22c6\n\nTao Suny;(cid:3)\n\nWotao Yin(cid:3)\n\nyNational University of Defense Technology, Changsha, Hunan 410073, China\n\n\u22c6University of Minnesota - Twin Cities, Minneapolis, MN 55455, USA\n(cid:3)University of California - Los Angeles, Los Angeles, CA 90095, USA\n\n{chen3827,georgios@umn.edu} nudtsuntao@163.com wotaoyin@math.ucla.edu\n\nAbstract\n\nThis paper presents a new class of gradient methods for distributed machine learn-\ning that adaptively skip the gradient calculations to learn with reduced commu-\nnication and computation. Simple rules are designed to detect slowly-varying\ngradients and, therefore, trigger the reuse of outdated gradients. The resultant\ngradient-based algorithms are termed Lazily Aggregated Gradient \u2014 justifying\nour acronym LAG used henceforth. Theoretically, the merits of this contribution\nare:\ni) the convergence rate is the same as batch gradient descent in strongly-\nconvex, convex, and nonconvex cases; and, ii) if the distributed datasets are hetero-\ngeneous (quanti\ufb01ed by certain measurable constants), the communication rounds\nneeded to achieve a targeted accuracy are reduced thanks to the adaptive reuse of\nlagged gradients. Numerical experiments on both synthetic and real data corrobo-\nrate a signi\ufb01cant communication reduction compared to alternatives.\n\n1\n\nIntroduction\n\nL((cid:18)) with L((cid:18)) :=\n\nLm((cid:18))\n\n\u2211\n\nm2M\n\nIn this paper, we develop communication-ef\ufb01cient algorithms to solve the following problem\n\nn2Nm\n\nmin\n(cid:18)2Rd\n\n\u2211\n\n(1)\nwhere (cid:18) 2 Rd is the unknown vector, L and fLm; m2Mg are smooth (but not necessarily convex)\nfunctions with M := f1; : : : ; Mg. Problem (1) naturally arises in a number of areas, such as\nmulti-agent optimization [1], distributed signal processing [2], and distributed machine learning [3].\nConsidering the distributed machine learning paradigm, each Lm is also a sum of functions, e.g.,\nLm((cid:18)) :=\n\u2113n((cid:18)), where \u2113n is the loss function (e.g., square or the logistic loss) with respect\nto the vector (cid:18) (describing the model) evaluated at the training sample xn; that is, \u2113n((cid:18)) := \u2113((cid:18); xn).\nWhile machine learning tasks are traditionally carried out at a single server, for datasets with massive\nsamples fxng, running gradient-based iterative algorithms at a single server can be prohibitively\nslow; e.g., the server needs to sequentially compute gradient components given limited processors.\nA simple yet popular solution in recent years is to parallelize the training across multiple computing\nunits (a.k.a. workers) [3]. Speci\ufb01cally, assuming batch samples distributedly stored in a total of\nM workers with the worker m 2 M associated with samples fxn; n 2 Nmg, a globally shared\nmodel (cid:18) will be updated at the central server by aggregating gradients computed by workers. Due\nto bandwidth and privacy concerns, each worker m will not upload its data fxn; n 2 Nmg to the\nserver, thus the learning task needs to be performed by iteratively communicating with the server.\nWe are particularly interested in the scenarios where communication between the central server and\nthe local workers is costly, as is the case with the Federated Learning setting [4, 5], the cloud-edge\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fAI systems [6], and more in the emerging Internet-of-Things paradigm [7]. In those cases, communi-\ncation latency is the bottleneck of overall performance. More precisely, the communication latency\nis a result of initiating communication links, queueing and propagating the message. For sending\nsmall messages, e.g., the d-dimensional model (cid:18) or aggregated gradient, this latency dominates the\nmessage size-dependent transmission latency. Therefore, it is important to reduce the number of\ncommunication rounds, even more so than the bits per round. In short, our goal is to \ufb01nd the model\nparameter (cid:18) that minimizes (1) using as low communication overhead as possible.\n\n1.1 Prior art\n\nTo put our work in context, we review prior contributions that we group in two categories.\nLarge-scale machine learning. Solving (1) at a single server has been extensively studied for large-\nscale learning tasks, where the \u201cworkhorse approach\u201d is the simple yet ef\ufb01cient stochastic gradient\ndescent (SGD) [8, 9]. Albeit its low per-iteration complexity, the inherited variance prevents SGD\nto achieve fast convergence. Recent advances include leveraging the so-termed variance reduction\ntechniques to achieve both low complexity and fast convergence [10\u201312]. For learning beyond\na single server, distributed parallel machine learning is an attractive solution to tackle large-scale\nlearning tasks, where the parameter server architecture is the most commonly used one [3, 13]. Dif-\nferent from the single server case, parallel implementation of the batch gradient descent (GD) is a\npopular choice, since SGD that has low complexity per iteration requires a large number of iterations\nthus communication rounds [14]. For traditional parallel learning algorithms however, latency, band-\nwidth limits, and unexpected drain on resources, that delay the update of even a single worker will\nslow down the entire system operation. Recent research efforts in this line have been centered on\nunderstanding asynchronous-parallel algorithms to speed up machine learning by eliminating costly\nsynchronization; e.g., [15\u201320]. All these approaches either reduce the computational complexity, or,\nreduce the run time, but they do not save communication.\nCommunication-ef\ufb01cient learning. Going beyond single-server learning, the high communication\noverhead becomes the bottleneck of the overall system performance [14]. Communication-ef\ufb01cient\nlearning algorithms have gained popularity [21, 22]. Distributed learning approaches have been de-\nveloped based on quantized (gradient) information, e.g., [23\u201326], but they only reduce the required\nbandwidth per communication, not the rounds. For machine learning tasks where the loss function\nis convex and its conjugate dual is expressible, the dual coordinate ascent-based approaches have\nbeen demonstrated to yield impressive empirical performance [5, 27, 28]. But these algorithms run\nin a double-loop manner, and the communication reduction has not been formally quanti\ufb01ed. To\nreduce communication by accelerating convergence, approaches leveraging (inexact) second-order\ninformation have been studied in [29, 30]. Roughly speaking, algorithms in [5, 27\u201330] reduce com-\nmunication by increasing local computation (relative to GD), while our method does not increase lo-\ncal computation. In settings different from the one considered in this paper, communication-ef\ufb01cient\napproaches have been recently studied with triggered communication protocols [31, 32]. Except for\nconvergence guarantees however, no theoretical justi\ufb01cation for communication reduction has been\nestablished in [31]. While a sublinear convergence rate can be achieved by algorithms in [32], the\nproposed gradient selection rule is nonadaptive and requires double-loop iterations.\n\n1.2 Our contributions\n\nBefore introducing our approach, we revisit the popular GD method for (1) in the setting of one\nparameter server and M workers: At iteration k, the server broadcasts the current model (cid:18)k to all\nthe workers; every worker m 2 M computes \u2207Lm\nand uploads it to the server; and once\nreceiving gradients from all workers, the server updates the model parameters via\n\n(cid:18)k\n\nGD iteration\n\n(cid:18)k+1 = (cid:18)k (cid:0) (cid:11)\u2207k\n\nGD with \u2207k\n\nGD :=\n\n\u2207Lm\n\n(cid:18)k\n\n(2)\n\n(\n\n)\n\nwhere (cid:11) is a stepsize, and \u2207k\nimplement (2), the server has to communicate with all workers to obtain fresh f\u2207Lm\nIn this context, the present paper puts forward a new batch gradient method (as simple as GD)\nthat can skip communication at certain rounds, which justi\ufb01es the term Lazily Aggregated Gradient\n\nGD is an aggregated gradient that summarizes the model change. To\n\n(cid:18)k\n\n(\n\n)\n\n\u2211\n\nm2M\n\n(\n\n)g.\n\n2\n\n\fMetric\nAlgorithm PS!WK m WK m !PS\n\nCommunication\nPS\n\u2207Lm\n(cid:18)k\nm, if m2Mk (4), (12b) \u2207Lm, if m2Mk (cid:18)k;\u2207k;f^(cid:18)\nm, if m2Mk\n(cid:18)k;\u2207k\n\nComputation\nWK m\n\u2207Lm\n\n\u2207Lm; (12a)\n\n(cid:18)k, if m2Mk (cid:14)\u2207k\n(cid:14)\u2207k\n\nGD\n\nLAG-PS\nLAG-WK\n\nPS\n(2)\n\n(4)\n\n(cid:18)k\n\n(cid:18)k\n\nMemory\n\nWK m\n\n=\n\nk\n\nmg \u2207Lm(^(cid:18)\n\u2207Lm(^(cid:18)\n\nk\nm)\nk\nm)\n\nTable 1: A comparison of communication, computation and memory requirements. PS denotes the\nparameter server, WK denotes the worker, PS!WK m is the communication link from the server\nto the worker m, and WK m ! PS is the communication link from the worker m to the server.\n(LAG). With its derivations deferred to Section 2, LAG resembles (2), given by\nk\nLAG iteration\n^(cid:18)\nm\n\n(cid:18)k+1 = (cid:18)k (cid:0) (cid:11)\u2207k with \u2207k :=\n\n\u2211\n\n\u2207Lm\n\n(\n\n)\n\n(3)\n\nm2M\n\nm\n\nm2Mk\n\nk\n\nm) is either \u2207Lm((cid:18)k), when ^(cid:18)\n\nwhere each \u2207Lm(^(cid:18)\nk\nm = (cid:18)k, or an outdated gradient that has been\n\u0338= (cid:18)k. Instead of requesting fresh gradient from every worker in (2),\ncomputed using an old copy ^(cid:18)\nthe twist is to obtain \u2207k by re\ufb01ning the previous aggregated gradient \u2207k(cid:0)1; that is, using only the\nnew gradients from the selected workers in Mk, while reusing the outdated gradients from the rest\nk(cid:0)1\nm ; 8m =2Mk, LAG in (3) is equivalent to\nof workers. Therefore, with ^(cid:18)\nLAG iteration\n(4)\n\nm := (cid:18)k; 8m2Mk; ^(cid:18)\n(cid:18)k+1 = (cid:18)k (cid:0) (cid:11)\u2207k with \u2207k =\u2207k(cid:0)1 +\n\n\u2211\n\nm := ^(cid:18)\n\n(cid:14)\u2207k\n\nk\nm\n\nk\n\nk\n\nk(cid:0)1\nm := \u2207Lm((cid:18)k)(cid:0)\u2207Lm(^(cid:18)\nm ) is the difference between two evaluations of \u2207Lm at the\nwhere (cid:14)\u2207k\nk(cid:0)1\nm . If \u2207k(cid:0)1 is stored in the server, this simple modi\ufb01cation\ncurrent iterate (cid:18)k and the old copy ^(cid:18)\nscales down the per-iteration communication rounds from GD\u2019s M to LAG\u2019s jMkj.\nWe develop two different rules to select Mk. The \ufb01rst rule is adopted by the parameter server (PS),\nand the second one by every worker (WK). At iteration k,\nLAG-PS: the server determines Mk and sends (cid:18)k to the workers in Mk; each worker m 2 Mk\ncomputes \u2207Lm((cid:18)k) and uploads (cid:14)\u2207k\nm; each workerm=2Mk does nothing; the server updates via (4);\nLAG-WK: the server broadcasts (cid:18)k to all workers; every worker computes \u2207Lm((cid:18)k), and checks\nif it belongs to Mk; only the workers in Mk upload (cid:14)\u2207k\nSee a comparison of two LAG variants with GD in Table 1.\nNaively reusing outdated gradients, while saving\ncommunication per iteration, can increase the to-\ntal number of iterations. To keep this number in\ncontrol, we judiciously design our simple trigger\ni) achieve the same order\nrules so that LAG can:\nof convergence rates (thus iteration complexities)\nas batch GD under strongly-convex, convex, and\nnonconvex smooth cases; and, ii) require reduced\ncommunication to achieve a targeted learning ac-\nFigure 1: LAG in a parameter server setup.\ncuracy, when the distributed datasets are heteroge-\nneous (measured by certain quantity speci\ufb01ed later). In certain learning settings, LAG requires only\nO(1=M ) communication of GD. Empirically, we found that LAG can reduce the communication\nrequired by GD and other distributed learning methods by an order of magnitude.\nNotation. Bold lowercase letters denote column vectors, which are transposed by ((cid:1))\n\u22a4. And \u2225x\u2225\ndenotes the \u21132-norm of x. Inequalities for vectors x > 0 is de\ufb01ned entrywise.\n\nm; the server updates via (4).\n\n2 LAG: Lazily Aggregated Gradient Approach\n\nIn this section, we formally develop our LAG method, and present the intuition and basic principles\nbehind its design. The original idea of LAG comes from a simple rewriting of the GD iteration (2)\nas\n\n\u2211\n\n\u2211\n\n(\n\n))\n\n(\n\n(cid:18)k+1 = (cid:18)k (cid:0) (cid:11)\n\n\u2207Lm((cid:18)k(cid:0)1) (cid:0) (cid:11)\n\n\u2207Lm\n\n(cid:18)k\n\n) (cid:0) \u2207Lm\n\n(\n\n(cid:18)k(cid:0)1\n\n:\n\n(5)\n\nm2M\n\nm2M\n\n3\n\nParameter Server (PS)Workers\fm2Mk\n\nc\n\nLet us view \u2207Lm((cid:18)k)(cid:0)\u2207Lm((cid:18)k(cid:0)1) as a re\ufb01nement to \u2207Lm((cid:18)k(cid:0)1), and recall that obtaining this\nre\ufb01nement requires a round of communication between the server and the worker m. Therefore, to\nsave communication, we can skip the server\u2019s communication with the worker m if this re\ufb01nement is\nm2M \u2207Lm((cid:18)k(cid:0)1)\u2225.\nk(cid:0)1\nm )g with\n(cid:21) 0, if communicating with some workers will bring only small\n(\n\nsmall compared to the old gradient; that is, \u2225\u2207Lm((cid:18)k)(cid:0)\u2207Lm((cid:18)k(cid:0)1)\u2225 \u226a \u2225\u2211\nGeneralizing on this intuition, given the generic outdated gradient components f\u2207Lm(^(cid:18)\nk(cid:0)1\nm = (cid:18)k(cid:0)1(cid:0)(cid:28) k(cid:0)1\n))\n(\n) (cid:0) \u2207Lm\n^(cid:18)\ngradient re\ufb01nements, we skip those communications (contained in set Mk\n))\n(\n\nfor a certain (cid:28) k(cid:0)1\n\u2211\n\n) (cid:0) (cid:11)\n(\n\u2211\n\n(\n\u2207Lm\n\nc ) and end up with\n\n(cid:18)k+1 = (cid:18)k (cid:0) (cid:11)\n\n\u2211\n(\n\n\u2207Lm\n\nk(cid:0)1\nm\n\nk(cid:0)1\nm\n\nm2M\n\n(6a)\n\n^(cid:18)\n\n^(cid:18)\n\nm\n\nm\n\nm\n\n(\n) (cid:0) \u2207Lm\n\n(cid:18)k\n\n(cid:18)k\n\n= (cid:18)k (cid:0) (cid:11)\u2207L((cid:18)k) (cid:0) (cid:11)\n\nm2Mk\n^(cid:18)\n\n\u2207Lm\n\nk(cid:0)1\nm\n\n(6b)\n\nwhere Mk and Mk\nc are the sets of workers that do and do not communicate with the server, respec-\ntively. It is easy to verify that (6) is identical to (3) and (4). Comparing (2) with (6b), when Mk\nc\nincludes more workers, more communication is saved, but (cid:18)k is updated by a coarser gradient.\nKey to addressing this communication versus accuracy tradeoff is a principled criterion to select\na subset of workers Mk\nc that do not communicate with the server at each round. To achieve this\n\u201csweet spot,\u201d we will rely on the fundamental descent lemma. For GD, it is given as follows [33].\nLemma 1 (GD descent in objective) Suppose L((cid:18)) is L-smooth, and (cid:22)(cid:18)k+1 is generated by run-\nning one-step GD iteration (2) given (cid:18)k and stepsize (cid:11). Then the objective values satisfy\n\n(\n\n)\n\nL((cid:22)(cid:18)k+1) (cid:0) L((cid:18)k) (cid:20) (cid:0)\n\n(cid:11) (cid:0) (cid:11)2L\n2\n\n\u2225\u2207L((cid:18)k)\u22252 := \u2206k\n\nGD((cid:18)k):\n\n(7)\n\nLikewise, for our wanted iteration (6), the following holds; its proof is given in the Supplement.\nLemma 2 (LAG descent in objective) Suppose L((cid:18)) is L-smooth, and (cid:18)k+1 is generated by run-\nning one-step LAG iteration (4) given (cid:18)k. Then the objective values satisfy (cf. (cid:14)\u2207k\nL((cid:18)k+1)(cid:0)L((cid:18)k) (cid:20)(cid:0) (cid:11)\n2\n\n)(cid:13)(cid:13)(cid:13)(cid:18)k+1(cid:0)(cid:18)k\n\n(cid:13)(cid:13)(cid:13)\u2207L((cid:18)k)\n(cid:13)(cid:13)(cid:13)2\n\n(cid:13)(cid:13)(cid:13) \u2211\n\nm in (4))\n\n(cid:0) 1\n2(cid:11)\n\nLAG((cid:18)k):\n\n(cid:13)(cid:13)(cid:13)2\n\n(cid:13)(cid:13)(cid:13)2\n\n(cid:14)\u2207k\n\n:= \u2206k\n\n(\n\nL\n2\n\n(cid:11)\n2\n\n(8)\n\n+\n\n+\n\nm\n\nm2Mk\n\nc\n\nLemmas 1 and 2 estimate the objective value descent by performing one-iteration of the GD and\nLAG methods, respectively, conditioned on a common iterate (cid:18)k. GD \ufb01nds \u2206k\nGD((cid:18)k) by performing\nLAG((cid:18)k) by performing only\nM rounds of communication with all the workers, while LAG yields \u2206k\njMkj rounds of communication with a selected subset of workers. Our pursuit is to select Mk to\nensure that LAG enjoys larger per-communication descent than GD; that is\n\n(cid:11)2M 2\n\n(cid:24)d\n\nd=1\n\n4\n\nLAG((cid:18)k)=jMkj (cid:20) \u2206k\n\u2206k\n\nGD((cid:18)k)=M:\n\n(\n\n(cid:13)(cid:13)(cid:13)\u2207Lm\n\n^(cid:18)\n\nChoosing the standard (cid:11) = 1=L, we can show that in order to guarantee (9), it is suf\ufb01cient to have\n(see the supplemental material for the deduction)\n\n(10)\nHowever, directly checking (10) at each worker is expensive since obtaining \u2225\u2207L((cid:18)k)\u22252 requires\ninformation from all the workers. Instead, we approximate \u2225\u2207L((cid:18)k)\u22252 in (10) by\n\n=M 2; 8m 2 Mk\nc :\n\n(cid:18)k\n\n(\n\nk(cid:0)1\nm\n\n)(cid:13)(cid:13)(cid:13)2 (cid:20)\n)(cid:0)\u2207Lm\n(cid:13)(cid:13)(cid:13)2 (cid:25) 1\n(cid:13)(cid:13)(cid:13)\u2207L((cid:18)k)\nD\u2211\n\n(cid:13)(cid:13)(cid:13)2\n\n(cid:13)(cid:13)(cid:13)\u2207L((cid:18)k)\n(cid:13)(cid:13)(cid:13)(cid:18)k+1(cid:0)d (cid:0) (cid:18)k(cid:0)d\n\n(cid:24)d\n\n(cid:13)(cid:13)(cid:13)2\n\n(cid:11)2\n\nd=1\n\nwhere f(cid:24)dgD\nd=1 are constant weights, and the constant D determines the number of recent iterate\nchanges that LAG incorporates to approximate the current gradient. The rationale here is that, as L\nis smooth, \u2207L((cid:18)k) cannot be very different from the recent gradients or the recent iterate lags.\nD\u2211\nBuilding upon (10) and (11), we will include worker m in Mk\n\nc of (6) if it satis\ufb01es\n\n(cid:13)(cid:13)(cid:13)(cid:18)k+1(cid:0)d(cid:0)(cid:18)k(cid:0)d\n\n(cid:13)(cid:13)(cid:13)2\n\n:\n\n(cid:13)(cid:13)(cid:13)\u2207Lm(^(cid:18)\n\n(cid:13)(cid:13)(cid:13)2(cid:20) 1\n\nLAG-WK condition\n\nk(cid:0)1\nm )(cid:0)\u2207Lm((cid:18)k)\n\n(12a)\n\n(9)\n\n(11)\n\n\f0\n\nm); 8mg.\n\nServer broadcasts (cid:18)k to all workers.\nfor worker m = 1; : : : ; M do\nWorker m computes \u2207Lm((cid:18)k).\nWorker m checks condition (12a).\nif worker m violates (12a) then\n\nAlgorithm 1 LAG-WK\n1: Input: Stepsize (cid:11) > 0, and threshold f(cid:24)dg.\n2: Initialize: (cid:18)1;f\u2207Lm(^(cid:18)\n3: for k = 1; 2; : : : ; K do\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13:\n14:\n15:\n16: end for\n\nWorker m uploads (cid:14)\u2207k\nm.\n\u25b7 Save \u2207Lm(^(cid:18)\n\nend for\nServer updates via (4).\n\nm) = \u2207Lm((cid:18)k)\n\nWorker m uploads nothing.\n\nend if\n\nelse\n\nk\n\n0\n\n0\n\nm;\u2207Lm(^(cid:18)\n\nServer checks condition (12b).\nif worker m violates (12b) then\n\nm);8mg.\nfor worker m = 1; : : : ; M do\n\nAlgorithm 2 LAG-PS\n1: Input: Stepsize (cid:11) > 0, f(cid:24)dg, and Lm; 8m.\n2: Initialize: (cid:18)1;f^(cid:18)\n3: for k = 1; 2; : : : ; K do\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13:\n14:\n15:\n16: end for\n\nServer sends (cid:18)k to worker m.\nk\nm = (cid:18)k at server\nWorker m computes \u2207Lm((cid:18)k).\nWorker m uploads (cid:14)\u2207k\nm.\n\nend for\nServer updates via (4).\n\n\u25b7 Save ^(cid:18)\n\nend if\n\nelse\n\nNo actions at server and worker m.\n\nTable 2: A comparison of LAG-WK and LAG-PS.\n\nCondition (12a) is checked at the worker side after each worker receives (cid:18)k from the server and\ncomputes its \u2207Lm((cid:18)k). If broadcasting is also costly, we can resort to the following server side rule:\n\n(cid:13)(cid:13)(cid:13)^(cid:18)\n\n(cid:13)(cid:13)(cid:13)2 (cid:20) 1\n\nD\u2211\n\n(cid:13)(cid:13)(cid:13)(cid:18)k+1(cid:0)d (cid:0) (cid:18)k(cid:0)d\n\n(cid:13)(cid:13)(cid:13)2\n\nLAG-PS condition\n\nL2\nm\n\nk(cid:0)1\nm (cid:0) (cid:18)k\n\n(cid:11)2M 2\n\n(cid:24)d\n\nd=1\n\n:\n\n(12b)\n\nThe values of f(cid:24)dg and D admit simple choices, e.g., (cid:24)d = 1=D; 8d with D = 10 used in simula-\ntions.\nLAG-WK vs LAG-PS. To perform (12a), the server needs to broadcast the current model (cid:18)k, and\nall the workers need to compute the gradient; while performing (12b), the server needs the estimated\nsmoothness constant Lm for all the local functions. On the other hand, as it will be shown in Section\n3, (12a) and (12b) lead to the same worst-case convergence guarantees. In practice, however, the\nserver-side condition is more conservative than the worker-side one at communication reduction,\nbecause the smoothness of Lm readily implies that satisfying (12b) will necessarily satisfy (12a),\nbut not vice versa. Empirically, (12a) will lead to a larger Mk\nc than that of (12b), and thus extra\ncommunication overhead will be saved. Hence, (12a) and (12b) can be chosen according to users\u2019\npreferences. LAG-WK and LAG-PS are summarized as Algorithms 1 and 2.\nRegarding our proposed LAG method, three remarks are in order.\nR1) With recursive update of the lagged gradients in (4) and the lagged iterates in (12), implementing\nLAG is as simple as GD; see Table 1. Both empirically and theoretically, we will further demonstrate\nthat using lagged gradients even reduces the overall delay by cutting down costly communication.\nR2) Although both LAG and asynchronous-parallel algorithms in [15\u201320] leverage stale gradients,\nthey are very different. LAG actively creates staleness, and by design, it reduces total communication\ndespite the staleness. Asynchronous algorithms passively receives staleness, and increases total\ncommunication due to the staleness, but it saves run time.\nR3) Compared with existing efforts for communication-ef\ufb01cient learning such as quantized gradient,\nNesterov\u2019s acceleration, dual coordinate ascent and second-order methods, LAG is not orthogonal\nto all of them. Instead, LAG can be combined with these methods to develop even more powerful\nlearning schemes. Extension to the proximal LAG is also possible to cover nonsmooth regularizers.\n\n3\n\nIteration and communication complexity\n\nIn this section, we establish the convergence of LAG, under the following standard conditions.\nAssumption 1: Loss function Lm((cid:18)) is Lm-smooth, and L((cid:18)) is L-smooth.\nAssumption 2: L((cid:18)) is convex and coercive.\n\nAssumption 3: L((cid:18)) is (cid:22)-strongly convex.\n\n5\n\n\fD\u2211\n\n(cid:13)(cid:13)(cid:13)(cid:18)k+1(cid:0)d (cid:0) (cid:18)k(cid:0)d\n\n(cid:13)(cid:13)(cid:13)2\n\nThe subsequent convergence analysis critically builds on the following Lyapunov function:\n\nVk := L((cid:18)k) (cid:0) L((cid:18)\n\n(cid:3)\n\n(cid:3)\n\n(cid:12)d\n\n) +\n\n(13)\nis the minimizer of (1), and f(cid:12)dg is a sequence of constants that will be determined later.\n\nwhere (cid:18)\nWe will start with the suf\ufb01cient descent of our Vk in (13).\nLemma 3 (descent lemma) Under Assumption 1, if (cid:11) and f(cid:24)dg are chosen properly, there exist\nconstants c0;(cid:1)(cid:1)(cid:1) ; cD (cid:21) 0 such that the Lyapunov function in (13) satis\ufb01es\n\nd=1\n\n(cid:13)(cid:13)(cid:13)\u2207L((cid:18)k)\n\n(cid:13)(cid:13)(cid:13)2 (cid:0) D\u2211\n\ncd\n\n(cid:13)(cid:13)(cid:13)(cid:18)k+1(cid:0)d(cid:0)(cid:18)k(cid:0)d\n\n(cid:13)(cid:13)(cid:13)2\n\nVk+1 (cid:0) Vk (cid:20) (cid:0)c0\n\n(14)\n\nwhich implies the descent in our Lyapunov function, that is, Vk+1 (cid:20) Vk.\nLemma 3 is a generalization of GD\u2019s descent lemma. As speci\ufb01ed in the supplementary material,\nunder properly chosen f(cid:24)dg, the stepsize (cid:11) 2 (0; 2=L) including (cid:11) = 1=L guarantees (14), matching\nthe stepsize region of GD. With Mk = M and (cid:12)d = 0; 8d in (13), Lemma 3 reduces to Lemma 1.\n\nd=1\n\n3.1 Convergence in strongly convex case\n\n\u221a\nD (cid:0) d + 1\nD=(cid:24)\n2(cid:11)\n\n(16)\n\n(cid:12)1 = (cid:1)(cid:1)(cid:1) = (cid:12)D :=\n)\n\nand\n\n(\n\n(cid:0)1\n\u03f5\n\n(\n\nL\n\n) (cid:0) L(\n\n(cid:3)) (cid:20)(\n\nWe \ufb01rst present the convergence under the smooth and strongly convex condition.\nTheorem 1 (strongly convex case) Under Assumptions 1-3, the iterates f(cid:18)kg of LAG satisfy\n\n(cid:3)\n\n(cid:18)K\n\n(15)\nis the minimizer of L((cid:18)) in (1), and c((cid:11);f(cid:24)dg) 2 (0; 1) is a constant depending on (cid:11); f(cid:24)dg\nwhere (cid:18)\nand f(cid:12)dg and the condition number (cid:20) := L=(cid:22), which are speci\ufb01ed in the supplementary material.\nIteration complexity. The iteration complexity in its generic form is complicated since c((cid:11);f(cid:24)dg)\ndepends on the choice of several parameters. Speci\ufb01cally, if we choose the parameters as follows\n\n(cid:18)\n\n1 (cid:0) c((cid:11);f(cid:24)dg)\n\nK V0\n\n)\n\n(cid:24)1 = (cid:1)(cid:1)(cid:1) = (cid:24)D := (cid:24) <\n\n1\nD\n\nand\n\n(cid:11) :=\n\n1 (cid:0) p\n\nD(cid:24)\n\nL\n\nthen, following Theorem 1, the iteration complexity of LAG in this case is\n\nILAG(\u03f5) =\n\n(cid:20)\n\n1 (cid:0) p\n\n:\n\nlog\n\nD(cid:24)\n\n(17)\n(cid:0)1), but\nThe iteration complexity in (17) is on the same order of GD\u2019s iteration complexity (cid:20) log(\u03f5\nhas a worse constant. This is the consequence of using a smaller stepsize in (16) (relative to (cid:11) = 1=L\nin GD) to simplify the choice of other parameters. Empirically, LAG with (cid:11) = 1=L can achieve\nalmost the same empirical iteration complexity as GD; see Section 4. Building on the iteration\ncomplexity, we study next the communication complexity of LAG. In the setting of our interest, we\nde\ufb01ne the communication complexity as the total number of uploads over all the workers needed to\nachieve accuracy \u03f5. While the accuracy refers to the objective optimality error in the strongly convex\ncase, it is considered as the gradient norm in general (non)convex cases.\nThe power of LAG is best illustrated by numerical examples; see an example of LAG-WK in Figure\n2. Clearly, workers with a small smoothness constant communicate with the server less frequently.\nThis intuition will be formally treated in the next lemma.\nLemma 4 (lazy communication) De\ufb01ne the importance factor of every worker m as H(m) :=\nLm=L. If the stepsize (cid:11) and the constants f(cid:24)dg in the conditions (12) satisfy (cid:24)D (cid:20) (cid:1)(cid:1)(cid:1) (cid:20) (cid:24)d (cid:20)\n(cid:1)(cid:1)(cid:1) (cid:20) (cid:24)1 and worker m satis\ufb01es\n\n/\n\nH2(m) (cid:20) (cid:24)d\n\n(d(cid:11)2L2M 2) := (cid:13)d\n\n(18)\n\nthen, until the k-th iteration, worker m communicates with the server at most k=(d + 1) rounds.\nLemma 4 asserts that if the worker m has a small Lm (a close-to-linear loss function) such that\nH2(m) (cid:20) (cid:13)d, then under LAG, it only communicates with the server at most k=(d + 1) rounds.\nThis is in contrast to the total of k communication rounds involved per worker under GD. Ideally,\nwe want as many workers satisfying (18) as possible, especially when d is large.\n\n6\n\n\fTo quantify the overall communication reduction,\nwe de\ufb01ne the heterogeneity score function as\n\n\u2211\n\n1\nM\n\n1(H2(m) (cid:20) (cid:13))\n\nm2M\n\nh((cid:13)) :=\n\n(19)\nwhere the indicator 1 equals 1 when H2(m) (cid:20) (cid:13)\nholds, and 0 otherwise. Clearly, h((cid:13)) is a nonde-\ncreasing function of (cid:13), that depends on the distribu-\ntion of smoothness constants L1; L2; : : : ; LM . It is\nalso instructive to view it as the cumulative distribu-\ntion function of the deterministic quantity H2(m),\nimplying h((cid:13)) 2 [0; 1]. Putting it in our context, the\ncritical quantity h((cid:13)d) lower bounds the fraction of\nworkers that communicate with the server at most k=(d + 1) rounds until the k-th iteration. We are\nnow ready to present the communication complexity.\nProposition 5 (communication complexity) With (cid:13)d de\ufb01ned in (18) and the function h((cid:13)) in (19),\nthe communication complexity of LAG denoted as CLAG(\u03f5) is bounded by\n\nFigure 2: Communication events of workers\n1; 3; 5; 7; 9 over 1; 000 iterations. Each stick\nis an upload. A setup with L1 < : : : < L9.\n\n)\n\n)\n\n(\n1 (cid:0) \u2206 (cid:22)C(h;f(cid:13)dg)\n\n)\n\nM ILAG(\u03f5)\n\n(20)\n\n(\n1 (cid:0) D\u2211\n\n(\n\nd=1\n\nCLAG(\u03f5) (cid:20)\n\n1\nd\n\n(cid:0) 1\n\nd + 1\n\nh ((cid:13)d)\n\nM ILAG(\u03f5) :=\n\n\u2211\n\n(\n\n)\n\nd+1\n\n)\n\n)\n\n(\n\nD\nd=1\n\n1\nd\n\n\u221a\n\nh ((cid:13)d).\n\n(cid:0) 1\n\n/(\n\nCLAG(\u03f5) (cid:20)\n\n1 (cid:0) \u2206 (cid:22)C(h; (cid:24))\n\nwhere the constant is de\ufb01ned as \u2206 (cid:22)C(h;f(cid:13)dg) :=\nThe communication complexity in (20) crucially depends on the iteration complexity ILAG(\u03f5) as\nwell as what we call the fraction of reduced communication per iteration \u2206 (cid:22)C(h;f(cid:13)dg). Simply\nchoosing the parameters as (16), it follows from (17) and (20) that (cf. (cid:13)d = (cid:24)(1 (cid:0) p\n(cid:0)1)\nCGD(\u03f5)\n(21)\n(cid:0)1). In (21), due to the nondecreasing prop-\nwhere the GD\u2019s complexity is CGD(\u03f5) = M (cid:20) log(\u03f5\nerty of h((cid:13)), increasing the constant (cid:24) yields a smaller fraction of workers 1 (cid:0) \u2206 (cid:22)C(h; (cid:24)) that are\ncommunicating per iteration, yet with a larger number of iterations (cf. (17)). The key enabler of\nLAG\u2019s communication reduction is a heterogeneous environment associated with a favorable h((cid:13))\nensuring that the bene\ufb01t of increasing (cid:24) is more signi\ufb01cant than its effect on increasing iteration\ncomplexity. More precisely, for a given (cid:24), if h((cid:13)) guarantees \u2206 (cid:22)C(h; (cid:24)) >\nD(cid:24), then we have\nCLAG(\u03f5) <CGD(\u03f5). Intuitively speaking, if there is a large fraction of workers with small Lm, LAG\nhas lower communication complexity than GD. An example follows to illustrate this reduction.\nExample. Consider Lm = 1; m \u0338= M, and LM = L (cid:21) M 2 \u226b 1, where we have H(m) =\n1=L; m \u0338= M; H(M ) = 1, implying that h((cid:13)) (cid:21) 1 (cid:0) 1\nM , if (cid:13) (cid:21) 1=L2. Choosing D (cid:21) M and\n)]/(\n(cid:24) = M 2D=L2 < 1=D in (16) such that (cid:13)D (cid:21) 1=L2 in (18), we have (cf. (21))\n\n(cid:0)2M\n\n(cid:0)2d\n\n1 (cid:0)\n\n(\n\n)\n\nD(cid:24)\n\n:\n\n[\n\np\n\nD(cid:24))\n\n/\n\nCLAG(\u03f5)\n\nCGD(\u03f5) (cid:20)\n\n1 (cid:0)\n\n1 (cid:0) 1\n\nD + 1\n\n1 (cid:0) M D=L\n\n(cid:25) M + D\nM (D + 1)\n\n(cid:25) 2\nM\n\n:\n\n(22)\n\n)(\n1 (cid:0) 1\nM\n\nDue to technical issues in the convergence analysis, the current condition on h((cid:13)) to ensure LAG\u2019s\ncommunication reduction is relatively restrictive. Establishing communication reduction on a\nbroader learning setting that matches the LAG\u2019s intriguing empirical performance is in our agenda.\n\n3.2 Convergence in (non)convex case\n\nLAG\u2019s convergence and communication reduction guarantees go beyond the strongly-convex case.\nWe next establish the convergence of LAG for general convex functions.\nTheorem 2 (convex case) Under Assumptions 1 and 2, if (cid:11) and f(cid:24)dg are chosen properly, then\n\nL((cid:18)K ) (cid:0) L((cid:18)\n\n(cid:3)\n\n) = O (1=K) :\n\nFor nonconvex objective functions, LAG can guarantee the following convergence result.\nTheorem 3 (nonconvex case) Under Assumption 1, if (cid:11) and f(cid:24)dg are chosen properly, then\n\n(cid:13)(cid:13)2 = o (1=K) and min\n\n(cid:13)(cid:13)\u2207L((cid:18)k)\n\n(cid:13)(cid:13)2 = o (1=K) :\n\n(cid:13)(cid:13)(cid:18)k+1 (cid:0) (cid:18)k\n\nmin\n1(cid:20)k(cid:20)K\n\n1(cid:20)k(cid:20)K\n\n7\n\n(23)\n\n(24)\n\n01WK 101WK 301WK 501WK 701002003004005006007008009001000Iteration index k01WK 9\fIncreasing Lm\n\nIncreasing Lm\n\nUniform Lm\n\nUniform Lm\n\nFigure 3: Iteration and communication complexity in synthetic datasets.\n\nLinear regression\n\nLinear regression\n\nLogistic regression\n\nLogistic regression\n\nFigure 4: Iteration and communication complexity in real datasets.\n\nTheorems 2 and 3 assert that with the judiciously designed lazy gradient aggregation rules, LAG can\nachieve order of convergence rate identical to GD for general (non)convex objective functions.\nSimilar to Proposition 5, in the supplementary material, we have also shown that in the (non)convex\ncase, LAG still requires less communication than GD, under certain conditions on the function h((cid:13)).\n\n4 Numerical tests and conclusions\n\n(cid:3)\n\n\u2211\n\nm2M Lm.\n\nTo validate the theoretical results, this section evaluates the empirical performance of LAG in linear\nand logistic regression tasks. All experiments were performed using MATLAB on an Intel CPU @\n3.4 GHz (32 GB RAM) desktop. By default, we consider one server, and nine workers. Throughout\nthe test, we use L((cid:18)k) (cid:0) L((cid:18)\n) as \ufb01gure of merit of our solution. For logistic regression, the regular-\n(cid:0)3. To benchmark LAG, we consider the following approaches.\nization parameter is set to (cid:21) = 10\n\u25b7 Cyc-IAG is the cyclic version of the incremental aggregated gradient (IAG) method [9, 10] that\nresembles the recursion (4), but communicates with one worker per iteration in a cyclic fashion.\n\u25b7 Num-IAG also resembles the recursion (4), and is the non-uniform-sampling enhancement of SAG\n[12], but it randomly selects one worker to obtain a fresh gradient per-iteration with the probability\nof choosing worker m equal to Lm=\n\u25b7 Batch-GD is the GD iteration (2) that communicates with all the workers per iteration.\nFor LAG-WK, we choose (cid:24)d = (cid:24) = 1=D with D = 10, and for LAG-PS, we choose more aggressive\n(cid:24)d = (cid:24) = 10=D with D = 10. Stepsizes for LAG-WK, LAG-PS, and GD are chosen as (cid:11) = 1=L;\nto optimize performance and guarantee stability, (cid:11) = 1=(M L) is used in Cyc-IAG and Num-IAG.\nWe consider two synthetic data tests: a) linear regression with increasing smoothness constants,\ne.g., Lm = (1:3m(cid:0)1 + 1)2; 8m; and, b) logistic regression with uniform smoothness constants, e.g.,\nL1 = : : : = L9 = 4; see Figure 3. For the case of increasing Lm, it is not surprising that both LAG\nvariants need fewer communication rounds. Interesting enough, for uniform Lm, LAG-WK still has\nmarked improvements on communication, thanks to its ability of exploiting the hidden smoothness\nof the loss functions; that is, the local curvature of Lm may not be as steep as Lm.\nPerformance is also tested on the real datasets [2]: a) linear regression using Housing, Body fat,\nAbalone datasets; and, b) logistic regression using Ionosphere, Adult, Derm datasets; see Figure 4.\nEach dataset is evenly split into three workers with the number of features used in the test equal to the\nminimal number of features among all datasets; see the details of parameters and data allocation in\nthe supplement material. In all tests, LAG-WK outperforms the alternatives in terms of both metrics,\nespecially reducing the needed communication rounds by several orders of magnitude. Its needed\ncommunication rounds can be even smaller than the number of iterations, if none of workers violate\n\n8\n\n2004006008001000Number of iteration10-5100Objective errorCyc-IAGNum-IAGLAG-PSLAG-WKBatch-GD101102103Number of communications (uploads)10-5100Objective errorCyc-IAGNum-IAGLAG-PSLAG-WKBatch-GD00.511.522.5Number of iteration\u00d710410-5100Objective errorCyc-IAGNum-IAGLAG-PSLAG-WKBatch-GD101102103104Number of communications (uploads)10-5100Objective errorCyc-IAGNum-IAGLAG-PSLAG-WKBatch-GD010002000300040005000Number of iteration10-5100105Objective errorCyc-IAGNum-IAGLAG-PSLAG-WKBatch-GD101102103104Number of communications (uploads)10-5100105Objective errorCyc-IAGNum-IAGLAG-PSLAG-WKBatch-GD00.511.522.533.5Number of iteration\u00d710410-810-610-410-2100102Objective errorCyc-IAGNum-IAGLAG-PSLAG-WKBatch-GD101102103104Number of communications (uploads)10-5100Objective errorCyc-IAGNum-IAGLAG-PSLAG-WKBatch-GD\fAlgorithm\nCyclic-IAG\nNum-IAG\nLAG-PS\nLAG-WK\nBatch GD\n\nM = 9\n5271\n3466\n1756\n412\n5283\n\nM = 18\n10522\n5283\n3610\n657\n10548\n\nTable 3: Communication complexity (\u03f5 = 10\n\nLinear regression\n\nLogistic regression\n\nM = 9\n33300\n22113\n14423\n\n584\n33309\n\nM = 27\n15773\n5815\n5944\n1058\n15822\n(cid:0)8) in real dataset under different number of workers.\n\nM = 27\n97773\n37262\n44598\n1723\n97821\n\nM = 18\n65287\n30540\n29968\n1098\n65322\n\nFigure 5: Iteration and communication complexity in Gisette dataset.\n\nthe trigger condition (12) at certain iterations. Additional tests under different number of workers\nare listed in Table 3, which corroborate the effectiveness of LAG when it comes to communication\nreduction. Similar performance gain has also been observed in the additional logistic regression test\non a larger dataset Gisette. The dataset was taken from [7] which was constructed from the MNIST\ndata [8]. After random selecting subset of samples and eliminating all-zero features, it contains 2000\nsamples xn 2 R4837. We randomly split this dataset into nine workers. The performance of all the\nalgorithms is reported in Figure 5 in terms of the iteration and communication complexity. Clearly,\nLAG-WK and LAG-PS achieve the same iteration complexity as GD, and outperform Cyc- and Num-\nIAG. Regarding communication complexity, two LAG variants reduce the needed communication\nrounds by several orders of magnitude compared with the alternatives.\nCon\ufb01rmed by the impressive empirical performance on both synthetic and real datasets, this paper\ndeveloped a promising communication-cognizant method for distributed machine learning that we\nterm Lazily Aggregated gradient (LAG) approach. LAG can achieve the same convergence rates as\nbatch gradient descent (GD) in smooth strongly-convex, convex, and nonconvex cases, and requires\nfewer communication rounds than GD given that the datasets at different workers are heterogeneous.\nTo overcome the limitations of LAG, future work consists of incorporating smoothing techniques to\nhandle nonsmooth loss functions, and robustifying our aggregation rules to deal with cyber attacks.\n\nAcknowledgments\n\nThe work by T. Chen and G. Giannakis is supported in part by NSF 1500713 and 1711471, and NIH\n1R01GM104975-01. The work by T. Chen is also supported by the Doctoral Dissertation Fellowship\nfrom the University of Minnesota. The work by T. Sun is supported in part by China Scholarship\nCouncil. The work by W. Yin is supported in part by NSF DMS-1720237 and ONR N0001417121.\n\n9\n\n012345Number of iteration\u00d710510-1100101102103Objective errorCyc-IAGNum-IAGLAG-PSLAG-WKBatch-GD101102103104105Number of communications (uploads)10-1100101102103Objective errorCyc-IAGNum-IAGLAG-PSLAG-WKBatch-GD\fReferences\n[1] A. Nedic and A. Ozdaglar, \u201cDistributed subgradient methods for multi-agent optimization,\u201d IEEE Trans.\n\nAutomat. Control, vol. 54, no. 1, pp. 48\u201361, Jan. 2009.\n\n[2] G. B. Giannakis, Q. Ling, G. Mateos, I. D. Schizas, and H. Zhu, \u201cDecentralized Learning for Wireless\nCommunications and Networking,\u201d in Splitting Methods in Communication and Imaging, Science and\nEngineering. New York: Springer, 2016.\n\n[3] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, A. Senior, P. Tucker, K. Yang, Q. V. Le\net al., \u201cLarge scale distributed deep networks,\u201d in Proc. Advances in Neural Info. Process. Syst., Lake\nTahoe, NV, 2012, pp. 1223\u20131231.\n\n[4] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, \u201cCommunication-ef\ufb01cient learning\nof deep networks from decentralized data,\u201d in Proc. Intl. Conf. Arti\ufb01cial Intell. and Stat., Fort Lauderdale,\nFL, Apr. 2017, pp. 1273\u20131282.\n\n[5] V. Smith, C.-K. Chiang, M. Sanjabi, and A. S. Talwalkar, \u201cFederated multi-task learning,\u201d in Proc. Ad-\n\nvances in Neural Info. Process. Syst., Long Beach, CA, Dec. 2017, pp. 4427\u20134437.\n\n[6] I. Stoica, D. Song, R. A. Popa, D. Patterson, M. W. Mahoney, R. Katz, A. D. Joseph, M. Jor-\ndan, J. M. Hellerstein, J. E. Gonzalez et al., \u201cA Berkeley view of systems challenges for AI,\u201d arXiv\npreprint:1712.05855, Dec. 2017.\n\n[7] T. Chen, S. Barbarossa, X. Wang, G. B. Giannakis, and Z.-L. Zhang, \u201cLearning and management for\n\nInternet-of-Things: Accounting for adaptivity and scalability,\u201d Proc. of the IEEE, Nov. 2018.\n\n[8] L. Bottou, \u201cLarge-Scale Machine Learning with Stochastic Gradient Descent,\u201d in Proc. of COMP-\n\nSTAT\u20192010, Y. Lechevallier and G. Saporta, Eds. Heidelberg: Physica-Verlag HD, 2010, pp. 177\u2013186.\n\n[9] L. Bottou, F. E. Curtis, and J. Nocedal, \u201cOptimization methods for large-scale machine learning,\u201d arXiv\n\npreprint:1606.04838, Jun. 2016.\n\n[10] R. Johnson and T. Zhang, \u201cAccelerating stochastic gradient descent using predictive variance reduction,\u201d\n\nin Proc. Advances in Neural Info. Process. Syst., Lake Tahoe, NV, Dec. 2013, pp. 315\u2013323.\n\n[11] A. Defazio, F. Bach, and S. Lacoste-Julien, \u201cSaga: A fast incremental gradient method with support for\nnon-strongly convex composite objectives,\u201d in Proc. Advances in Neural Info. Process. Syst., Montreal,\nCanada, Dec. 2014, pp. 1646\u20131654.\n\n[12] M. Schmidt, N. Le Roux, and F. Bach, \u201cMinimizing \ufb01nite sums with the stochastic average gradient,\u201d\n\nMathematical Programming, vol. 162, no. 1-2, pp. 83\u2013112, Mar. 2017.\n\n[13] M. Li, D. G. Andersen, A. J. Smola, and K. Yu, \u201cCommunication ef\ufb01cient distributed machine learning\nwith the parameter server,\u201d in Proc. Advances in Neural Info. Process. Syst., Montreal, Canada, Dec. 2014,\npp. 19\u201327.\n\n[14] B. McMahan and D. Ramage, \u201cFederated learning: Collaborative machine learning without centralized\ntraining data,\u201d Google Research Blog, Apr. 2017. [Online]. Available: https://research.googleblog.com/\n2017/04/federated-learning-collaborative.html\n\n[15] L. Cannelli, F. Facchinei, V. Kungurtsev, and G. Scutari, \u201cAsynchronous parallel algorithms for nonconvex\n\nbig-data optimization: Model and convergence,\u201d arXiv preprint:1607.04818, Jul. 2016.\n\n[16] T. Sun, R. Hannah, and W. Yin, \u201cAsynchronous coordinate descent under more realistic assumptions,\u201d in\n\nProc. Advances in Neural Info. Process. Syst., Long Beach, CA, Dec. 2017, pp. 6183\u20136191.\n\n[17] Z. Peng, Y. Xu, M. Yan, and W. Yin, \u201cArock: an algorithmic framework for asynchronous parallel coordi-\n\nnate updates,\u201d SIAM J. Sci. Comp., vol. 38, no. 5, pp. 2851\u20132879, Sep. 2016.\n\n[18] B. Recht, C. Re, S. Wright, and F. Niu, \u201cHogwild: A lock-free approach to parallelizing stochastic gradi-\nent descent,\u201d in Proc. Advances in Neural Info. Process. Syst., Granada, Spain, Dec. 2011, pp. 693\u2013701.\n\n[19] J. Liu, S. Wright, C. R\u00e9, V. Bittorf, and S. Sridhar, \u201cAn asynchronous parallel stochastic coordinate\n\ndescent algorithm,\u201d J. Machine Learning Res., vol. 16, no. 1, pp. 285\u2013322, 2015.\n\n[20] X. Lian, Y. Huang, Y. Li, and J. Liu, \u201cAsynchronous parallel stochastic gradient for nonconvex optimiza-\n\ntion,\u201d in Proc. Advances in Neural Info. Process. Syst., Montreal, Canada, Dec. 2015, pp. 2737\u20132745.\n\n10\n\n\f[21] M. I. Jordan, J. D. Lee, and Y. Yang, \u201cCommunication-ef\ufb01cient distributed statistical inference,\u201d J. Amer-\n\nican Statistical Association, vol. to appear, 2018.\n\n[22] Y. Zhang, J. C. Duchi, and M. J. Wainwright, \u201cCommunication-ef\ufb01cient algorithms for statistical opti-\n\nmization.\u201d J. Machine Learning Res., vol. 14, no. 11, 2013.\n\n[23] A. T. Suresh, X. Y. Felix, S. Kumar, and H. B. McMahan, \u201cDistributed mean estimation with limited\n\ncommunication,\u201d in Proc. Intl. Conf. Machine Learn., Sydney, Australia, Aug. 2017, pp. 3329\u20133337.\n\n[24] D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic, \u201cQSGD: Communication-ef\ufb01cient SGD via\ngradient quantization and encoding,\u201d In Proc. Advances in Neural Info. Process. Syst., pages 1709\u20131720,\nLong Beach, CA, Dec. 2017.\n\n[25] W. Wen, C. Xu, F. Yan, C. Wu, Y. Wang, Y. Chen, and H. Li, \u201cTernGrad: Ternary gradients to reduce\ncommunication in distributed deep learning,\u201d In Proc. Advances in Neural Info. Process. Syst., pages\n1509\u20131519, Long Beach, CA, Dec. 2017.\n\n[26] A. F. Aji and K. Hea\ufb01eld, \u201cSparse communication for distributed gradient descent,\u201d In Proc. of Empirical\n\nMethods in Natural Language Process., pages 440\u2013445, Copenhagen, Denmark, Sep. 2017.\n\n[27] M. Jaggi, V. Smith, M. Tak\u00e1c, J. Terhorst, S. Krishnan, T. Hofmann, and M. I. Jordan, \u201cCommunication-\nef\ufb01cient distributed dual coordinate ascent,\u201d in Proc. Advances in Neural Info. Process. Syst., Montreal,\nCanada, Dec. 2014, pp. 3068\u20133076.\n\n[28] C. Ma, J. Kone\u02c7cn`y, M. Jaggi, V. Smith, M. I. Jordan, P. Richt\u00e1rik, and M. Tak\u00e1\u02c7c, \u201cDistributed optimization\nwith arbitrary local solvers,\u201d Optimization Methods and Software, vol. 32, no. 4, pp. 813\u2013848, Jul. 2017.\n\n[29] O. Shamir, N. Srebro, and T. Zhang, \u201cCommunication-ef\ufb01cient distributed optimization using an approx-\nimate newton-type method,\u201d in Proc. Intl. Conf. Machine Learn., Beijing, China, Jun. 2014, pp. 1000\u2013\n1008.\n\n[30] Y. Zhang and X. Lin, \u201cDiSCO: Distributed optimization for self-concordant empirical loss,\u201d in Proc. Intl.\n\nConf. Machine Learn., Lille, France, Jun. 2015, pp. 362\u2013370.\n\n[31] Y. Liu, C. Nowzari, Z. Tian, and Q. Ling, \u201cAsynchronous periodic event-triggered coordination of multi-\nagent systems,\u201d in Proc. IEEE Conf. Decision Control, Melbourne, Australia, Dec. 2017, pp. 6696\u20136701.\n\n[32] G. Lan, S. Lee, and Y. Zhou, \u201cCommunication-ef\ufb01cient algorithms for decentralized and stochastic opti-\n\nmization,\u201d arXiv preprint:1701.03961, Jan. 2017.\n\n[33] Y. Nesterov, Introductory Lectures on Convex Optimization: A basic course. Berlin, Germany: Springer,\n\n2013, vol. 87.\n\n[34] D. Blatt, A. O. Hero, and H. Gauchman, \u201cA convergent incremental gradient method with a constant step\n\nsize,\u201d SIAM J. Optimization, vol. 18, no. 1, pp. 29\u201351, Feb. 2007.\n\n[35] M. Gurbuzbalaban, A. Ozdaglar, and P. A. Parrilo, \u201cOn the convergence rate of incremental aggregated\n\ngradient algorithms,\u201d SIAM J. Optimization, vol. 27, no. 2, pp. 1035\u20131048, Jun. 2017.\n\n[36] M. Lichman, \u201cUCI machine learning repository,\u201d 2013. [Online]. Available: http://archive.ics.uci.edu/ml\n\n[37] L. Song, A. Smola, A. Gretton, K. M. Borgwardt, and J. Bedo, \u201cSupervised feature selection via depen-\n\ndence estimation,\u201d in Proc. Intl. Conf. Machine Learn., Corvallis, OR, Jun. 2007, pp. 823\u2013830.\n\n[38] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, \u201cGradient-based learning applied to document recogni-\n\ntion,\u201d Proc. of the IEEE, vol. 86, no. 11, pp. 2278\u20132324, Nov. 1998.\n\n11\n\n\f", "award": [], "sourceid": 2440, "authors": [{"given_name": "Tianyi", "family_name": "Chen", "institution": "University of Minnesota"}, {"given_name": "Georgios", "family_name": "Giannakis", "institution": "University of Minnesota"}, {"given_name": "Tao", "family_name": "Sun", "institution": "National university of defense technology"}, {"given_name": "Wotao", "family_name": "Yin", "institution": "University of California, Los Angeles"}]}