{"title": "Communication Complexity of Distributed Convex Learning and Optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 1756, "page_last": 1764, "abstract": "We study the fundamental limits to communication-efficient distributed methods for convex learning and optimization, under different assumptions on the information available to individual machines, and the types of functions considered. We identify cases where existing algorithms are already worst-case optimal, as well as cases where room for further improvement is still possible. Among other things, our results indicate that without similarity between the local objective functions (due to statistical data similarity or otherwise) many communication rounds may be required, even if the machines have unbounded computational power.", "full_text": "Communication Complexity of Distributed\n\nConvex Learning and Optimization\n\nYossi Arjevani\n\nWeizmann Institute of Science\n\nRehovot 7610001, Israel\n\nOhad Shamir\n\nWeizmann Institute of Science\n\nRehovot 7610001, Israel\n\nyossi.arjevani@weizmann.ac.il\n\nohad.shamir@weizmann.ac.il\n\nAbstract\n\nWe study the fundamental limits to communication-ef\ufb01cient distributed methods\nfor convex learning and optimization, under different assumptions on the informa-\ntion available to individual machines, and the types of functions considered. We\nidentify cases where existing algorithms are already worst-case optimal, as well as\ncases where room for further improvement is still possible. Among other things,\nour results indicate that without similarity between the local objective functions\n(due to statistical data similarity or otherwise) many communication rounds may\nbe required, even if the machines have unbounded computational power.\n\n1\n\nIntroduction\n\nWe consider the problem of distributed convex learning and optimization, where a set of m ma-\nchines, each with access to a different local convex function Fi : Rd (cid:55)\u2192 R and a convex domain\nW \u2286 Rd, attempt to solve the optimization problem\n\nm(cid:88)\n\ni=1\n\nw\u2208W F (w) where F (w) =\nmin\n\n1\nm\n\nFi(w).\n\n(1)\n\n(cid:80)N\n\nA prominent application is empirical risk minimization, where the goal is to minimize the average\nloss over some dataset, where each machine has access to a different subset of the data. Letting\n{z1, . . . , zN} be the dataset composed of N examples, and assuming the loss function (cid:96)(w, z)\nis convex in w, then the empirical risk minimization problem minw\u2208W 1\ni=1 (cid:96)(w, zi) can be\nN\nwritten as in Eq. (1), where Fi(w) is the average loss over machine i\u2019s examples.\nThe main challenge in solving such problems is that communication between the different machines\nis usually slow and constrained, at least compared to the speed of local processing. On the other\nhand, the datasets involved in distributed learning are usually large and high-dimensional. Therefore,\nmachines cannot simply communicate their entire data to each other, and the question is how well\ncan we solve problems such as Eq. (1) using as little communication as possible.\nAs datasets continue to increase in size, and parallel computing platforms becoming more and more\ncommon (from multiple cores on a single CPU to large-scale and geographically distributed com-\nputing grids), distributed learning and optimization methods have been the focus of much research\nin recent years, with just a few examples including [25, 4, 2, 27, 1, 5, 13, 23, 16, 17, 8, 7, 9, 11, 20,\n19, 3, 26]. Most of this work studied algorithms for this problem, which provide upper bounds on\nthe required time and communication complexity.\nIn this paper, we take the opposite direction, and study what are the fundamental performance lim-\nitations in solving Eq. (1), under several different sets of assumptions. We identify cases where\nexisting algorithms are already optimal (at least in the worst-case), as well as cases where room for\nfurther improvement is still possible.\n\n1\n\n\fSince a major constraint in distributed learning is communication, we focus on studying the amount\nof communication required to optimize Eq. (1) up to some desired accuracy \u0001. More precisely,\nwe consider the number of communication rounds that are required, where in each communication\nround the machines can generally broadcast to each other information linear in the problem\u2019s di-\nmension d (e.g. a point in W or a gradient). This applies to virtually all algorithms for large-scale\nlearning we are aware of, where sending vectors and gradients is feasible, but computing and sending\nlarger objects, such as Hessians (d \u00d7 d matrices) is not.\nOur results pertain to several possible settings (see Sec. 2 for precise de\ufb01nitions). First, we distin-\nguish between the local functions being merely convex or strongly-convex, and whether they are\nsmooth or not. These distinctions are standard in studying optimization algorithms for learning, and\ncapture important properties such as the regularization and the type of loss function used. Second,\nwe distinguish between a setting where the local functions are related \u2013 e.g., because they re\ufb02ect\nstatistical similarities in the data residing at different machines \u2013 and a setting where no relationship\nis assumed. For example, in the extreme case where data was split uniformly at random between\n\u221a\nmachines, one can show that quantities such as the values, gradients and Hessians of the local func-\ntions differ only by \u03b4 = O(1/\nn), where n is the sample size per machine, due to concentration\nof measure effects. Such similarities can be used to speed up the optimization/learning process, as\nwas done in e.g. [20, 26]. Both the \u03b4-related and the unrelated setting can be considered in a uni\ufb01ed\nway, by letting \u03b4 be a parameter and studying the attainable lower bounds as a function of \u03b4. Our\nresults can be summarized as follows:\n\u2022 First, we de\ufb01ne a mild structural assumption on the algorithm (which is satis\ufb01ed by reasonable\napproaches we are aware of), which allows us to provide the lower bounds described below on\nthe number of communication rounds required to reach a given suboptimality \u0001.\n\nlower bounds are matched by a straightforward distributed implementation of accelerated gra-\ndient descent. In particular, the results imply that many communication rounds may be required\nto get a high-accuracy solution, and moreover, that no algorithm satisfying our structural as-\nsumption would be better, even if we endow the local machines with unbounded computational\n\n\u2013 When the local functions can be unrelated, we prove a lower bound of \u2126((cid:112)1/\u03bb log(1/\u0001)) for\nsmooth and \u03bb-strongly convex functions, and \u2126((cid:112)1/\u0001) for smooth convex functions. These\npower. For non-smooth functions, we show a lower bound of \u2126((cid:112)1/\u03bb\u0001) for \u03bb-strongly convex\ntion round lower bound of \u2126((cid:112)\u03b4/\u03bb log(1/\u0001)) for smooth and \u03bb-strongly convex functions. For\n\nfunctions, and \u2126(1/\u0001) for general convex functions. Although we leave a full derivation to\nfuture work, it seems these lower bounds can be matched in our framework by an algorithm\ncombining acceleration and Moreau proximal smoothing of the local functions.\n\n\u2013 When the local functions are related (as quanti\ufb01ed by the parameter \u03b4), we prove a communica-\n\nquadratics, this bound is matched by (up to constants and logarithmic factors) by the recently-\nproposed DISCO algorithm [26]. However, getting an optimal algorithm for general strongly\nconvex and smooth functions in the \u03b4-related setting, let alone for non-smooth or non-strongly\nconvex functions, remains open.\n\u2022 We also study the attainable performance without posing any structural assumptions on the algo-\nrithm, but in the more restricted case where only a single round of communication is allowed. We\nprove that in a broad regime, the performance of any distributed algorithm may be no better than a\n\u2018trivial\u2019 algorithm which returns the minimizer of one of the local functions, as long as the number\nof bits communicated is less than \u2126(d2). Therefore, in our setting, no communication-ef\ufb01cient\n1-round distributed algorithm can provide non-trivial performance in the worst case.\n\nRelated Work\n\nThere have been several previous works which considered lower bounds in the context of distributed\nlearning and optimization, but to the best of our knowledge, none of them provide a similar type of\nresults. Perhaps the most closely-related paper is [22], which studied the communication complexity\nof distributed optimization, and showed that \u2126(d log(1/\u0001)) bits of communication are necessary\nbetween the machines, for d-dimensional convex problems. However, in our setting this does not\nlead to any non-trivial lower bound on the number of communication rounds (indeed, just specifying\na d-dimensional vector up to accuracy \u0001 required O(d log(1/\u0001)) bits). More recently, [2] considered\nlower bounds for certain types of distributed learning problems, but not convex ones in an agnostic\n\n2\n\n\fdistribution-free framework. In the context of lower bounds for one-round algorithms, the results of\n[6] imply that \u2126(d2) bits of communication are required to solve linear regression in one round of\ncommunication. However, that paper assumes a different model than ours, where the function to be\noptimized is not split among the machines as in Eq. (1), where each Fi is convex. Moreover, issues\nsuch as strong convexity and smoothness are not considered. [20] proves an impossibility result for\na one-round distributed learning scheme, even when the local functions are not merely related, but\nactually result from splitting data uniformly at random between machines. On the \ufb02ip side, that\nresult is for a particular algorithm, and doesn\u2019t apply to any possible method.\nFinally, we emphasize that distributed learning and optimization can be studied under many settings,\nincluding ones different than those studied here. For example, one can consider distributed learning\non a stream of i.i.d. data [19, 7, 10, 8], or settings where the computing architecture is different, e.g.\nwhere the machines have a shared memory, or the function to be optimized is not split as in Eq. (1).\nStudying lower bounds in such settings is an interesting topic for future work.\n\n2 Notation and Framework\n\nThe only vector and matrix norms used in this paper are the Euclidean norm and the spectral norm,\nrespectively. ej denotes the j-th standard unit vector. We let \u2207G(w) and \u22072G(w) denote the\ngradient and Hessians of a function G at w, if they exist. G is smooth (with parameter L) if it\nis differentiable and the gradient is L-Lipschitz. In particular, if w\u2217 = arg minw\u2208W G(w), then\nG(w) \u2212 G(w\u2217) \u2264 L\n2 (cid:107)w \u2212 w\u2217(cid:107)2. G is strongly convex (with parameter \u03bb) if for any w, w(cid:48) \u2208\nW, G(w(cid:48)) \u2265 G(w) + (cid:104)g, w(cid:48) \u2212 w(cid:105) + \u03bb\n2 (cid:107)w(cid:48) \u2212 w(cid:107)2 where g \u2208 \u2202G(w(cid:48)) is a subgradient of G\nat w. In particular, if w\u2217 = arg minw\u2208W G(w), then G(w) \u2212 G(w\u2217) \u2265 \u03bb\n2 (cid:107)w \u2212 w\u2217(cid:107)2. Any\nconvex function is also strongly-convex with \u03bb = 0. A special case of smooth convex functions are\nquadratics, where G(w) = w(cid:62)Aw + b(cid:62)w + c for some positive semide\ufb01nite matrix A, vector b\nand scalar c. In this case, \u03bb and L correspond to the smallest and largest eigenvalues of A.\nWe model the distributed learning algorithm as an iterative process, where in each round the ma-\nchines may perform some local computations, followed by a communication round where each ma-\nchine broadcasts a message to all other machines. We make no assumptions on the computational\ncomplexity of the local computations. After all communication rounds are completed, a designated\nmachine provides the algorithm\u2019s output (possibly after additional local computation).\nClearly, without any assumptions on the number of bits communicated, the problem can be trivially\nsolved in one round of communication (e.g. each machine communicates the function Fi to the\ndesignated machine, which then solves Eq. (1). However, in practical large-scale scenarios, this\nis non-feasible, and the size of each message (measured by the number of bits) is typically on the\norder of \u02dcO(d), enough to send a d-dimensional real-valued vector1, such as points in the optimization\ndomain or gradients, but not larger objects such as d \u00d7 d Hessians.\nIn this model, our main question is the following: How many rounds of communication are neces-\nsary in order to solve problems such as Eq. (1) to some given accuracy \u0001?\nAs discussed in the introduction, we \ufb01rst need to distinguish between different assumptions on the\npossible relation between the local functions. One natural situation is when no signi\ufb01cant relation-\nship can be assumed, for instance when the data is arbitrarily split or is gathered by each machine\nfrom statistically dissimilar sources. We denote this as the unrelated setting. However, this assump-\ntion is often unnecessarily pessimistic. Often the data allocation process is more random, or we can\nassume that the different data sources for each machine have statistical similarities (to give a sim-\nple example, consider learning from users\u2019 activity across a geographically distributed computing\ngrid, each servicing its own local population). We will capture such similarities, in the context of\nquadratic functions, using the following de\ufb01nition:\nDe\ufb01nition 1. We say that a set of quadratic functions\n\nFi(w) := w(cid:62)Aiw + biw + ci,\n\nAi \u2208 Rd\u00d7d, bi \u2208 Rd, ci \u2208 R\n\n1The \u02dcO hides constants and factors logarithmic in the required accuracy of the solution. The idea is that we\ncan represent real numbers up to some arbitrarily high machine precision, enough so that \ufb01nite-precision issues\nare not a problem.\n\n3\n\n\fare \u03b4-related, if for any i, j \u2208 {1 . . . k}, it holds that\n\n(cid:107)Ai \u2212 Aj(cid:107) \u2264 \u03b4, (cid:107)bi \u2212 bj(cid:107) \u2264 \u03b4,\n\n|ci \u2212 cj| \u2264 \u03b4\n\n\u221a\n\nFor example, in the context of linear regression with the squared loss over a bounded subset of\nRd, and assuming mn data points with bounded norm are randomly and equally split among m\n\u221a\nmachines, it can be shown that the conditions above hold with \u03b4 = O(1/\nn) [20]. The choice\nof \u03b4 provides us with a spectrum of learning problems ranked by dif\ufb01culty: When \u03b4 = \u2126(1), this\ngenerally corresponds to the unrelated setting discussed earlier. When \u03b4 = O(1/\nn), we get the\nsituation typical of randomly partitioned data. When \u03b4 = 0, then all the local functions have essen-\ntially the same minimizers, in which case Eq. (1) can be trivially solved with zero communication,\njust by letting one machine optimize its own local function. We note that although De\ufb01nition 1 can\nbe generalized to non-quadratic functions, we do not need it for the results presented here.\nWe end this section with an important remark. In this paper, we prove lower bounds for the \u03b4-related\n\u221a\nsetting, which includes as a special case the commonly-studied setting of randomly partitioned data\n(in which case \u03b4 = O(1/\nn)). However, our bounds do not apply for random partitioning, since\nthey use \u03b4-related constructions which do not correspond to randomly partitioned data. In fact, very\nrecent work [12] has cleverly shown that for randomly partitioned data, and for certain reasonable\nregimes of strong convexity and smoothness, it is actually possible to get better performance than\nwhat is indicated by our lower bounds. However, this encouraging result crucially relies on the\nrandom partition property, and in parameter regimes which limit how much each data point needs to\nbe \u201ctouched\u201d, hence preserving key statistical independence properties. We suspect that it may be\ndif\ufb01cult to improve on our lower bounds under substantially weaker assumptions.\n\n3 Lower Bounds Using a Structural Assumption\n\nIn this section, we present lower bounds on the number of communication rounds, where we im-\npose a certain mild structural assumption on the operations performed by the algorithm. Roughly\nspeaking, our lower bounds pertain to a very large class of algorithms, which are based on linear\noperations involving points, gradients, and vector products with local Hessians and their inverses,\nas well as solving local optimization problems involving such quantities. At each communication\nround, the machines can share any of the vectors they have computed so far. Formally, we con-\nsider algorithms which satisfy the assumption stated below. For convenience, we state it for smooth\nfunctions (which are differentiable) and discuss the case of non-smooth functions in Sec. 3.2.\nAssumption 1. For each machine j, de\ufb01ne a set Wj \u2282 Rd, initially Wj = {0}. Between commu-\nnication rounds, each machine j iteratively computes and adds to Wj some \ufb01nite number of points\nw, each satisfying\n\nw(cid:48) , \u2207Fj(w(cid:48)) , (\u22072Fj(w(cid:48)) + D)w(cid:48)(cid:48) , (\u22072Fj(w(cid:48)) + D)\u22121w(cid:48)(cid:48)(cid:12)(cid:12)(cid:12)\n(cid:110)\n\n\u03b3w + \u03bd\u2207Fj(w) \u2208 span\n\n(cid:111)\n\nw(cid:48), w(cid:48)(cid:48) \u2208 Wj , D diagonal , \u22072Fj(w(cid:48)) exists , (\u22072Fj(w(cid:48)) + D)\u22121 exists\n.\nfor some \u03b3, \u03bd \u2265 0 such that \u03b3 + \u03bd > 0. After every communication round, let Wj := \u222am\ni=1Wi for all\nj. The algorithm\u2019s \ufb01nal output (provided by the designated machine j) is a point in the span of Wj.\n\n(2)\n\nThis assumption requires several remarks:\n\u2022 Note that Wj is not an explicit part of the algorithm: It simply includes all points computed by\nmachine j so far, or communicated to it by other machines, and is used to de\ufb01ne the set of new\npoints which the machine is allowed to compute.\n\u2022 The assumption bears some resemblance \u2013 but is far weaker \u2013 than standard assumptions used to\nprovide lower bounds for iterative optimization algorithms. For example, a common assumption\n(see [14]) is that each computed point w must lie in the span of the previous gradients. This corre-\nsponds to a special case of Assumption 1, where \u03b3 = 1, \u03bd = 0, and the span is only over gradients\nof previously computed points. Moreover, it also allows (for instance) exact optimization of each\nlocal function, which is a subroutine in some distributed algorithms (e.g. [27, 25]), by setting\n\u03b3 = 0, \u03bd = 1 and computing a point w satisfying \u03b3w + \u03bd\u2207Fj(w) = 0. By allowing the span\nto include previous gradients, we also incorporate algorithms which perform optimization of the\n\n4\n\n\flocal function plus terms involving previous gradients and points, such as [20], as well as algo-\nrithms which rely on local Hessian information and preconditioning, such as [26]. In summary,\nthe assumption is satis\ufb01ed by most techniques for black-box convex optimization that we are\naware of. Finally, we emphasize that we do not restrict the number or computational complexity\nof the operations performed between communication rounds.\n\u2022 The requirement that \u03b3, \u03bd \u2265 0 is to exclude algorithms which solve non-convex local optimization\nproblems of the form minw Fj(w) + \u03b3 (cid:107)w(cid:107)2 with \u03b3 < 0, which are unreasonable in practice and\ncan sometimes break our lower bounds.\n\u2022 The assumption that Wj is initially {0} (namely, that the algorithm starts from the origin) is purely\nfor convenience, and our results can be easily adapted to any other starting point by shifting all\nfunctions accordingly.\n\nThe techniques we employ in this section are inspired by lower bounds on the iteration complex-\nity of \ufb01rst-order methods for standard (non-distributed) optimization (see for example [14]). These\nare based on the construction of \u2018hard\u2019 functions, where each gradient (or subgradient) computa-\ntion can only provide a small improvement in the objective value. In our setting, the dynamics are\nroughly similar, but the necessity of many gradient computations is replaced by many communica-\ntion rounds. This is achieved by constructing suitable local functions, where at any time point no\nindividual machine can \u2018progress\u2019 on its own, without information from other machines.\n\n3.1 Smooth Local Functions\n\nWe begin by presenting a lower bound when the local functions Fi are strongly-convex and smooth:\nTheorem 1. For any even number m of machines, any distributed algorithm which satis\ufb01es As-\nsumption 1, and for any \u03bb \u2208 [0, 1), \u03b4 \u2208 (0, 1), there exist m local quadratic functions over Rd\n(where d is suf\ufb01ciently large) which are 1-smooth, \u03bb-strongly convex, and \u03b4-related, such that if\nw\u2217 = arg minw\u2208Rd F (w), then the number of communication rounds required to obtain \u02c6w satisfy-\ning F ( \u02c6w) \u2212 F (w\u2217) \u2264 \u0001 (for any \u0001 > 0) is at least\n\u03bb(cid:107)w\u2217(cid:107)2\n\n(cid:32)(cid:114)\n\n(cid:32)(cid:115)\n\n\u03bb(cid:107)w\u2217(cid:107)2\n\n(cid:33)(cid:33)\n\n(cid:33)\n\n(cid:32)\n\n(cid:32)\n\n(cid:33)\n\n(cid:19)\n\n1 + \u03b4\n\n\u2212 1\n\n\u2212 1\n\nlog\n\n\u2212 1\n2\n\n= \u2126\n\n\u03b4\n\u03bb\n\nlog\n\n\u0001\n\n1\n4\n\n(cid:18) 1\n(cid:113) 3\u03b4\n\n\u03bb\n\nif \u03bb > 0, and at least\n\n32\u0001 (cid:107)w\u2217(cid:107) \u2212 2 if \u03bb = 0.\n\n4\u0001\n\nThe assumption of m being even is purely for technical convenience, and can be discarded at the\ncost of making the proof slightly more complex. Also, note that m does not appear explicitly in\nthe bound, but may appear implicitly, via \u03b4 (for example, in a statistical setting \u03b4 may depend on\nthe number of data points per machine, and may be larger if the same dataset is divided to more\nmachines).\nLet us contrast our lower bound with some existing algorithms and guarantees in the literature. First,\nregardless of whether the local functions are similar or not, we can always simulate any gradient-\n(cid:80)m\nbased method designed for a single machine, by iteratively computing gradients of the local func-\ntions, and performing a communication round to compute their average. Clearly, this will be a\ni=1 Fi(\u00b7), which can be fed into any gradient-based\ngradient of the objective function F (\u00b7) = 1\nmethod such as gradient descent or accelerated gradient descent [14]. The resulting number of\nrequired communication rounds is then equal to the number of iterations. In particular, using ac-\ncelerated gradient descent for smooth and \u03bb-strongly convex functions yields a round complexity\n\nof O((cid:112)1/\u03bb log((cid:107)w\u2217(cid:107)2 /\u0001)), and O((cid:107)w\u2217(cid:107)(cid:112)1/\u0001) for smooth convex functions. This matches our\n\nlower bound (up to constants and log factors) when the local functions are unrelated (\u03b4 = \u2126(1)).\nWhen the functions are related, however, the upper bounds above are highly sub-optimal: Even if\nthe local functions are completely identical, and \u03b4 = 0, the number of communication rounds will\nremain the same as when \u03b4 = \u2126(1). To utilize function similarity while guaranteeing arbitrary small\n\u0001, the two most relevant algorithms are DANE [20], and the more recent DISCO [26]. For smooth\nand \u03bb-strongly convex functions, which are either quadratic or satisfy a certain self-concordance\n\ncondition, DISCO achieves \u02dcO(1+(cid:112)\u03b4/\u03bb) round complexity ([26, Thm.2]), which matches our lower\n\nbound in terms of dependence on \u03b4, \u03bb. However, for non-quadratic losses, the round complexity\n\nm\n\n5\n\n\fbounds are somewhat worse, and there are no guarantees for strongly convex and smooth functions\nwhich are not self-concordant. Thus, the question of the optimal round complexity for such functions\nremains open.\nThe full proof of Thm. 1 appears in the supplementary material, and is based on the following idea:\nFor simplicity, suppose we have two machines, with local functions F1, F2 de\ufb01ned as follows,\n\n\u03b4(1 \u2212 \u03bb)\n\nw(cid:62)A1w \u2212 \u03b4(1 \u2212 \u03bb)\n\nF1(w) =\n\n\uf8ee\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8f0\n\nA1 =\n\nF2(w) =\n\n1\n0\n0\n0\n0\n1 \u22121\n0\n0\n0\n0 \u22121\n1\n0\n0\n1 \u22121\n0\n0\n0\n0 \u22121\n1\n0\n0\n...\n...\n...\n\n...\n\n...\n\n4\n\u03b4(1 \u2212 \u03bb)\n\n4\n\n0\n. . .\n0 . . .\n0\n. . .\n0\n. . .\n0 . . .\n\n...\n\n...\n\nw(cid:62)A2w +\n\n\uf8f9\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fb , A2 =\n\ne(cid:62)\n1 w +\n\n(cid:107)w(cid:107)2\n\n\u03bb\n2\n\n\uf8ee\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8f0\n\n2\n(cid:107)w(cid:107)2 , where\n\u03bb\n2\n1 \u22121\n0\n0\n0\n0\n\u22121\n1\n0\n0\n0\n0\n1 \u22121\n0\n0\n0\n0\n0 \u22121\n1\n0\n0\n0\n1 \u22121\n0\n0\n0\n0\n0 \u22121\n1\n0\n0\n0\n...\n...\n...\n\n...\n\n...\n\n...\n\n(3)\n\n\uf8f9\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fb\n\n. . .\n. . .\n. . .\n. . .\n. . .\n. . .\n\n...\n\nIt is easy to verify that for \u03b4, \u03bb \u2264 1, both F1(w) and F2(w) are 1-smooth and \u03bb-strongly convex,\nas well as \u03b4-related. Moreover, the optimum of their average is a point w\u2217 with non-zero entries\nat all coordinates. However, since each local functions has a block-diagonal quadratic term, it can\nbe shown that for any algorithm satisfying Assumption 1, after T communication rounds, the points\ncomputed by the two machines can only have the \ufb01rst T + 1 coordinates non-zero. No machine\nwill be able to further \u2018progress\u2019 on its own, and cause additional coordinates to become non-zero,\nwithout another communication round. This leads to a lower bound on the optimization error which\ndepends on T , resulting in the theorem statement after a few computations.\n\n3.2 Non-smooth Local Functions\n\nRemaining in the framework of algorithms satisfying Assumption 1, we now turn to discuss the\nsituation where the local functions are not necessarily smooth or differentiable. For simplicity, our\nformal results here will be in the unrelated setting, and we only informally discuss their extension\nto a \u03b4-related setting (in a sense relevant to non-smooth functions). Formally de\ufb01ning \u03b4-related\nnon-smooth functions is possible but not altogether trivial, and is therefore left to future work.\nWe adapt Assumption 1 to the non-smooth case, by allowing gradients to be replaced by arbitrary\nsubgradients at the same points. Namely, we replace Eq. (2) by the requirement that for some\ng \u2208 \u2202Fj(w), and \u03b3, \u03bd \u2265 0, \u03b3 + \u03bd > 0,\n\u03b3w + \u03bdg \u2208 span\n\nw(cid:48) , g(cid:48) , (\u22072Fj(w(cid:48)) + D)w(cid:48)(cid:48) , (\u22072Fj(w(cid:48)) + D)\u22121w(cid:48)(cid:48)(cid:12)(cid:12)(cid:12)\n(cid:110)\n\nw(cid:48), w(cid:48)(cid:48) \u2208 Wj , g(cid:48) \u2208 \u2202Fj(w(cid:48)) , D diagonal , \u22072Fj(w(cid:48)) exists , (\u22072Fj(w(cid:48)) + D)\u22121 exists\n\n(cid:111)\n\n.\n\nThe lower bound for this setting is stated in the following theorem.\nTheorem 2. For any even number m of machines, any distributed optimization algorithm which\nsatis\ufb01es Assumption 1, and for any \u03bb \u2265 0, there exist \u03bb-strongly convex (1+\u03bb)-Lipschitz continuous\nconvex local functions F1(w) and F2(w) over the unit Euclidean ball in Rd (where d is suf\ufb01ciently\nlarge), such that if w\u2217 = arg minw:(cid:107)w(cid:107)\u22641 F (w), the number of communication rounds required to\nobtain \u02c6w satisfying F ( \u02c6w) \u2212 F (w\u2217) \u2264 \u0001 (for any suf\ufb01ciently small \u0001 > 0) is 1\n8\u0001 \u2212 2 for \u03bb = 0, and\n\n(cid:113) 1\n16\u03bb\u0001 \u2212 2 for \u03bb > 0.\n\nAs in Thm. 1, we note that the assumption of even m is for technical convenience.\nThis theorem, together with Thm. 1, implies that both strong convexity and smoothness are neces-\nsary for the number of communication rounds to scale logarithmically with the required accuracy\n\u0001. We emphasize that this is true even if we allow the machines unbounded computational power,\nto perform arbitrarily many operations satisfying Assumption 1. Moreover, a preliminary analysis\n\n6\n\n\findicates that performing accelerated gradient descent on smoothed versions of the local functions\n(using Moreau proximal smoothing, e.g. [15, 24]), can match these lower bounds up to log factors2.\nWe leave a full formal derivation (which has some subtleties) to future work.\nThe full proof of Thm. 2 appears in the supplementary material. The proof idea relies on the fol-\nlowing construction: Assume that we \ufb01x the number of communication rounds to be T , and (for\nsimplicity) that T is even and the number of machines is 2. Then we use local functions of the form\n(cid:107)w(cid:107)2\n\n(|w2 \u2212 w3| + |w4 \u2212 w5| + \u00b7\u00b7\u00b7 + |wT \u2212 wT +1|) +\n\n|b \u2212 w1| +\n\nF1(w) =\n\n1(cid:112)2(T + 2)\n\n\u03bb\n2\n\n1\u221a\n2\n\n1(cid:112)2(T + 2)\n\nF2(w) =\n\n(|w1 \u2212 w2| + |w3 \u2212 w4| + \u00b7\u00b7\u00b7 + |wT +1 \u2212 wT +2|) +\n\n(cid:107)w(cid:107)2 ,\n\n\u03bb\n2\n\nwhere b is a suitably chosen parameter. It is easy to verify that both local functions are \u03bb-strongly\nconvex and (1 + \u03bb)-Lipschitz continuous over the unit Euclidean ball. Similar to the smooth case,\nwe argue that after T communication rounds, the resulting points w computed by machine 1 will\nbe non-zero only on the \ufb01rst T + 1 coordinates, and the points w computed by machine 2 will be\nnon-zero only on the \ufb01rst T coordinates. As in the smooth case, these functions allow us to \u2019control\u2019\nthe progress of any algorithm which satis\ufb01es Assumption 1.\nFinally, although the result is in the unrelated setting, it is straightforward to have a similar construc-\ntion in a \u2018\u03b4-related\u2019 setting, by multiplying F1 and F2 by \u03b4. The resulting two functions have their\ngradients and subgradients at most \u03b4-different from each other, and the construction above leads to\n\na lower bound of \u2126(\u03b4/\u0001) for convex Lipschitz functions, and \u2126(\u03b4(cid:112)1/\u03bb\u0001) for \u03bb-strongly convex\n\nLipschitz functions. In terms of upper bounds, we are actually unaware of any relevant algorithm\nin the literature adapted to such a setting, and the question of attainable performance here remains\nwide open.\n\n4 One Round of Communication\n\nIn this section, we study what lower bounds are attainable without any kind of structural assumption\n(such as Assumption 1). This is a more challenging setting, and the result we present will be lim-\nited to algorithms using a single round of communication round. We note that this still captures a\nrealistic non-interactive distributed computing scenario, where we want each machine to broadcast\na single message, and a designated machine is then required to produce an output. In the context of\ndistributed optimization, a natural example is a one-shot averaging algorithm, where each machine\noptimizes its own local data, and the resulting points are averaged (e.g. [27, 25]).\nIntuitively, with only a single round of communication, getting an arbitrarily small error \u0001 may be\ninfeasible. The following theorem establishes a lower bound on the attainable error, depending on\nthe strong convexity parameter \u03bb and the similarity measure \u03b4 between the local functions, and\ncompares this with a \u2018trivial\u2019 zero-communication algorithm, which just returns the optimum of a\nsingle local function:\nTheorem 3. For any even number m of machines, any dimension d larger than some numerical\nconstant, any \u03b4 \u2265 3\u03bb > 0, and any (possibly randomized) algorithm which communicates at most\nd2/128 bits in a single round of communication, there exist m quadratic functions over Rd, which\nare \u03b4-related, \u03bb-strongly convex and 9\u03bb-smooth, for which the following hold for some positive\nnumerical constants c, c(cid:48):\n\u2022 The point \u02c6w returned by the algorithm satis\ufb01es\nF ( \u02c6w) \u2212 min\nw\u2208Rd\nin expectation over the algorithm\u2019s randomness.\n\n\u2265 c\n\nF (w)\n\n\u03b42\n\u03bb\n\n(cid:21)\n\n(cid:20)\n\nE\n\n2Roughly speaking, for any \u03b3 > 0, this smoothing creates a 1\n\n\u03b3 -smooth function which is \u03b3-close to the\noriginal function. Plugging these into the guarantees of accelerated gradient descent and tuning \u03b3 yields our\nlower bounds. Note that, in order to execute this algorithm each machine must be suf\ufb01ciently powerful to obtain\nthe gradient of the Moreau envelope of its local function, which is indeed the case in our framework.\n\n7\n\n\f\u2022 For any machine j, if \u02c6wj = arg minw\u2208Rd Fj(w), then F ( \u02c6wj) \u2212 minw\u2208Rd F (w) \u2264 c(cid:48)\u03b42/\u03bb.\nThe theorem shows that unless the communication budget is extremely large (quadratic in the di-\nmension), there are functions which cannot be optimized to non-trivial accuracy in one round of\ncommunication, in the sense that the same accuracy (up to a universal constant) can be obtained\nwith a \u2018trivial\u2019 solution where we just return the optimum of a single local function. This comple-\nments an earlier result in [20], which showed that a particular one-round algorithm is no better than\nreturning the optimum of a local function, under the stronger assumption that the local functions are\nnot merely \u03b4-related, but are actually the average loss over some randomly partitioned data.\nThe full proof appears in the supplementary material, but we sketch the main ideas below. As\nbefore, focusing on the case of two machines, and assuming machine 2 is responsible for providing\nthe output, we use\n\n(cid:33)\n\n(cid:19)\u22121 \u2212 1\n\n2\n\nI\n\nw\n\nwhere M is essentially a randomly chosen {\u22121, +1}-valued d \u00d7 d symmetric matrix with spectral\n\u221a\nd, and c is a suitable constant. These functions can be shown to be \u03b4-related as well\nnorm at most c\nas \u03bb-strongly convex. Moreover, the optimum of F (w) = 1\n\n2 (F1(w) + F2(w)) equals\n\n(cid:32)(cid:18)\n\n(cid:18)\n\nF1(w) = 3\u03bbw(cid:62)\n\nd\n\nM\n\n\u221a\n1\n2c\n(cid:107)w(cid:107)2 \u2212 \u03b4ej,\n\nI +\n\n3\u03bb\n2\n\nF2(w) =\n\n(cid:19)\n\nw\u2217 =\n\n\u03b4\n6\u03bb\n\nI +\n\nM\n\nej.\n\n\u221a\n1\n2c\n\nd\n\nThus, we see that the optimal point w\u2217 depends on the j-th column of M. Intuitively, the machines\nneed to approximate this column, and this is the source of hardness in this setting: Machine 1\nknows M but not j, yet needs to communicate to machine 2 enough information to construct its j-th\ncolumn. However, given a communication budget much smaller than the size of M (which is d2), it\nis dif\ufb01cult to convey enough information on the j-th column without knowing what j is. Carefully\nformalizing this intuition, and using some information-theoretic tools, allows us to prove the \ufb01rst\npart of Thm. 3. Proving the second part of Thm. 3 is straightforward, using a few computations.\n\n5 Summary and Open Questions\n\nIn this paper, we studied lower bounds on the number of communication rounds needed to solve\ndistributed convex learning and optimization problems, under several different settings. Our results\nindicate that when the local functions are unrelated, then regardless of the local machines\u2019 compu-\ntational power, many communication rounds may be necessary (scaling polynomially with 1/\u0001 or\n1/\u03bb), and that the worst-case optimal algorithm (at least for smooth functions) is just a straightfor-\nward distributed implementation of accelerated gradient descent. When the functions are related, we\nshow that the optimal performance is achieved by the algorithm of [26] for quadratic and strongly\nconvex functions, but designing optimal algorithms for more general functions remains open. Be-\nside these results, which required a certain mild structural assumption on the algorithm employed,\nwe also provided an assumption-free lower bound for one-round algorithms, which implies that\neven for strongly convex quadratic functions, such algorithms can sometimes only provide trivial\nperformance.\nBesides the question of designing optimal algorithms for the remaining settings, several additional\nquestions remain open. First, it would be interesting to get assumption-free lower bounds for al-\ngorithms with multiple rounds of communication. Second, our work focused on communication\ncomplexity, but in practice the computational complexity of the local computations is no less im-\nportant. Thus, it would be interesting to understand what is the attainable performance with simple,\nruntime-ef\ufb01cient algorithms. Finally, it would be interesting to study lower bounds for other dis-\ntributed learning and optimization scenarios.\n\nAcknowledgments: This research is supported in part by an FP7 Marie Curie CIG grant, the Intel\nICRI-CI Institute, and Israel Science Foundation grant 425/13. We thank Nati Srebro for several\nhelpful discussions and insights.\n\n8\n\n\fReferences\n[1] A. Agarwal, O. Chapelle, M. Dud\u00b4\u0131k, and J. Langford. A reliable effective terascale linear\n\nlearning system. CoRR, abs/1110.4198, 2011.\n\n[2] M.-F. Balcan, A. Blum, S. Fine, and Y. Mansour. Distributed learning, communication com-\n\nplexity and privacy. In COLT, 2012.\n\n[3] M.-F. Balcan, V. Kanchanapally, Y. Liang, and D. Woodruff. Improved distributed principal\n\ncomponent analysis. In NIPS, 2014.\n\n[4] R. Bekkerman, M. Bilenko, and J. Langford. Scaling up machine learning: Parallel and\n\ndistributed approaches. Cambridge University Press, 2011.\n\n[5] S.P. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statis-\n\ntical learning via ADMM. Foundations and Trends in Machine Learning, 3(1):1\u2013122, 2011.\n\n[6] K. Clarkson and D. Woodruff. Numerical linear algebra in the streaming model. In STOC,\n\n2009.\n\n[7] A. Cotter, O. Shamir, N. Srebro, and K. Sridharan. Better mini-batch algorithms via accelerated\n\ngradient methods. In NIPS, 2011.\n\n[8] O. Dekel, R. Gilad-Bachrach, O. Shamir, and L. Xiao. Optimal distributed online prediction\n\nusing mini-batches. Journal of Machine Learning Research, 13:165\u2013202, 2012.\n\n[9] J. Duchi, A. Agarwal, and M. Wainwright. Dual averaging for distributed optimization: Con-\n\nvergence analysis and network scaling. IEEE Trans. Automat. Contr., 57(3):592\u2013606, 2012.\n\n[10] R. Frostig, R. Ge, S. Kakade, and A. Sidford. Competing with the empirical risk minimizer in\n\na single pass. arXiv preprint arXiv:1412.6606, 2014.\n\n[11] M. Jaggi, V. Smith, M. Tak\u00b4ac, J. Terhorst, S. Krishnan, T. Hofmann, and M. Jordan.\n\nCommunication-ef\ufb01cient distributed dual coordinate ascent. In NIPS, 2014.\n\n[12] J. Lee, T. Ma, and Q. Lin. Distributed stochastic variance reduced gradient methods. CoRR,\n\n1507.07595, 2015.\n\n[13] D. Mahajan, S. Keerthy, S. Sundararajan, and L. Bottou. A parallel SGD method with strong\n\nconvergence. CoRR, abs/1311.0636, 2013.\n\n[14] Y. Nesterov. Introductory lectures on convex optimization: A basic course. Springer, 2004.\n[15] Y. Nesterov. Smooth minimization of non-smooth functions. Mathematical programming,\n\n103(1):127\u2013152, 2005.\n\n[16] B. Recht, C. R\u00b4e, S. Wright, and F. Niu. Hogwild: A lock-free approach to parallelizing stochas-\n\ntic gradient descent. In NIPS, 2011.\n\n[17] P. Richt\u00b4arik and M. Tak\u00b4ac. Distributed coordinate descent method for learning with big data.\n\nCoRR, abs/1310.2059, 2013.\n\n[18] O. Shamir. Fundamental limits of online and distributed algorithms for statistical learning and\n\nestimation. In NIPS, 2014.\n\n[19] O. Shamir and N. Srebro. On distributed stochastic optimization and learning.\n\nConference on Communication, Control, and Computing, 2014.\n\nIn Allerton\n\n[20] O. Shamir, N. Srebro, and T. Zhang. Communication-ef\ufb01cient distributed optimization using\n\nan approximate newton-type method. In ICML, 2014.\n\n[21] T. Tao. Topics in random matrix theory, volume 132. American Mathematical Soc., 2012.\n[22] J. Tsitsiklis and Z.-Q. Luo. Communication complexity of convex optimization. J. Complexity,\n\n3(3):231\u2013243, 1987.\n\n[23] T. Yang. Trading computation for communication: Distributed SDCA. In NIPS, 2013.\n[24] Y.-L. Yu. Better approximation and faster algorithm using proximal average. In NIPS, 2013.\n[25] Y. Zhang, J. Duchi, and M. Wainwright. Communication-ef\ufb01cient algorithms for statistical\n\noptimization. Journal of Machine Learning Research, 14:3321\u20133363, 2013.\n\n[26] Y. Zhang and L. Xiao. Communication-ef\ufb01cient distributed optimization of self-concordant\n\nempirical loss. In ICML, 2015.\n\n[27] M. Zinkevich, M. Weimer, A. Smola, and L. Li. Parallelized stochastic gradient descent. In\n\nNIPS, 2010.\n\n9\n\n\f", "award": [], "sourceid": 1053, "authors": [{"given_name": "Yossi", "family_name": "Arjevani", "institution": "The Weizmann Institute"}, {"given_name": "Ohad", "family_name": "Shamir", "institution": "Weizmann Institute of Science"}]}