{"title": "On Communication Cost of Distributed Statistical Estimation and Dimensionality", "book": "Advances in Neural Information Processing Systems", "page_first": 2726, "page_last": 2734, "abstract": "We explore the connection between dimensionality and communication cost in distributed learning problems. Specifically we study the problem of estimating the mean $\\vectheta$ of an unknown $d$ dimensional gaussian distribution in the distributed setting. In this problem, the samples from the unknown distribution are distributed among $m$ different machines. The goal is to estimate the mean $\\vectheta$ at the optimal minimax rate while communicating as few bits as possible. We show that in this setting, the communication cost scales linearly in the number of dimensions i.e. one needs to deal with different dimensions individually. Applying this result to previous lower bounds for one dimension in the interactive setting \\cite{ZDJW13} and to our improved bounds for the simultaneous setting, we prove new lower bounds of $\\Omega(md/\\log(m))$ and $\\Omega(md)$ for the bits of communication needed to achieve the minimax squared loss, in the interactive and simultaneous settings respectively. To complement, we also demonstrate an interactive protocol achieving the minimax squared loss with $O(md)$ bits of communication, which improves upon the simple simultaneous protocol by a logarithmic factor. Given the strong lower bounds in the general setting, we initiate the study of the distributed parameter estimation problems with structured parameters. Specifically, when the parameter is promised to be $s$-sparse, we show a simple thresholding based protocol that achieves the same squared loss while saving a $d/s$ factor of communication. We conjecture that the tradeoff between communication and squared loss demonstrated by this protocol is essentially optimal up to logarithmic factor.", "full_text": "On Communication Cost of Distributed Statistical\n\nEstimation and Dimensionality\n\nAnkit Garg\n\nDepartment of Computer Science, Princeton University\n\ngarg@cs.princeton.edu\n\nTengyu Ma\n\nDepartment of Computer Science, Princeton University\n\ntengyu@cs.princeton.edu\n\nHuy L. Nguy\u02dc\u02c6en\n\nSimons Institute, UC Berkeley\n\nhlnguyen@cs.princeton.edu\n\nAbstract\n\nWe explore the connection between dimensionality and communication cost in\ndistributed learning problems. Speci\ufb01cally we study the problem of estimating\nthe mean ~\u2713 of an unknown d dimensional gaussian distribution in the distributed\nsetting. In this problem, the samples from the unknown distribution are distributed\namong m different machines. The goal is to estimate the mean ~\u2713 at the optimal\nminimax rate while communicating as few bits as possible. We show that in this\nsetting, the communication cost scales linearly in the number of dimensions i.e.\none needs to deal with different dimensions individually. Applying this result to\nprevious lower bounds for one dimension in the interactive setting [1] and to our\nimproved bounds for the simultaneous setting, we prove new lower bounds of\n\u2326(md/ log(m)) and \u2326(md) for the bits of communication needed to achieve the\nminimax squared loss, in the interactive and simultaneous settings respectively.\nTo complement, we also demonstrate an interactive protocol achieving the mini-\nmax squared loss with O(md) bits of communication, which improves upon the\nsimple simultaneous protocol by a logarithmic factor. Given the strong lower\nbounds in the general setting, we initiate the study of the distributed parameter\nestimation problems with structured parameters. Speci\ufb01cally, when the param-\neter is promised to be s-sparse, we show a simple thresholding based protocol\nthat achieves the same squared loss while saving a d/s factor of communication.\nWe conjecture that the tradeoff between communication and squared loss demon-\nstrated by this protocol is essentially optimal up to logarithmic factor.\n\n1\n\nIntroduction\n\nThe last decade has witnessed a tremendous growth in the amount of data involved in machine learn-\ning tasks. In many cases, data volume has outgrown the capacity of memory of a single machine and\nit is increasingly common that learning tasks are performed in a distributed fashion on many ma-\nchines. Communication has emerged as an important resource and sometimes the bottleneck of the\nwhole system. A lot of recent works are devoted to understand how to solve problems distributedly\nwith ef\ufb01cient communication [2, 3, 4, 1, 5].\nIn this paper, we study the relation between the dimensionality and the communication cost of sta-\ntistical estimation problems. Most modern statistical problems are characterized by high dimension-\nality. Thus, it is natural to ask the following meta question:\nHow does the communication cost scale in the dimensionality?\n\n1\n\n\fWe study this question via the problems of estimating parameters of distributions in the distributed\nsetting. For these problems, we answer the question above by providing two complementary results:\n\n1. Lower bound for general case: If the distribution is a product distribution over the coordi-\nnates, then one essentially needs to estimate each dimension of the parameter individually\nand the information cost (a proxy for communication cost) scales linearly in the number of\ndimensions.\n\n2. Upper bound for sparse case: If the true parameter is promised to have low sparsity, then a\nvery simple thresholding estimator gives better tradeoff between communication cost and\nmean-square loss.\n\nBefore getting into the ideas behind these results, we \ufb01rst de\ufb01ne the problem more formally. We con-\nsider the case when there are m machines, each of which receives n i.i.d samples from an unknown\ndistribution P (from a family P) over the d-dimensional Euclidean space Rd. These machines need\nto estimate a parameter \u2713 of the distribution via communicating with each other. Each machine can\ndo arbitrary computation on its samples and messages it receives from other machines. We regard\ncommunication (the number of bits communicated) as a resource, and therefore we not only want to\noptimize over the estimation error of the parameters but also the tradeoff between the estimation er-\nror and communication cost of the whole procedure. For simplicity, here we are typically interested\nin achieving the minimax error 1 while communicating as few bits as possible. Our main focus is\nthe high dimensional setting where d is very large.\n\nCommunication Lower Bound via Direct-Sum Theorem The key idea for the lower bound is,\nwhen the unknown distribution P = P1 \u21e5 \u00b7\u00b7\u00b7 \u21e5 Pd is a product distribution over Rd, and each\ncoordinate of the parameter \u2713 only depends on the corresponding component of P , then we can\nview the d-dimensional problem as d independent copies of one dimensional problem. We show\nthat, one unfortunately cannot do anything beyond this trivial decomposition, that is, treating each\ndimension independently, and solving d different estimations problems individually. In other words,\nthe communication cost 2 must be at least d times the cost for one dimensional problem. We call\nthis theorem \u201cdirect-sum\u201d theorem.\nTo demonstrate our theorem, we focus on the speci\ufb01c case where P is a d dimensional spherical\n3 . The problem is to estimate\nGaussian distribution with an unknown mean and covariance 2Id\nthe mean of P . The work [1] showed a lower bound on the communication cost for this problem\nwhen d = 1. Our technique when applied to their theorem immediately yields a lower bound\nequal to d times the lower bound for the one dimension problem for any choice of d. Note that [5]\nindependently achieve the same bound by re\ufb01ning the proof in [1].\nIn the simultaneous communication setting, where all machines send one message to one machine\nand this machine needs to \ufb01gure out the estimation, the work [1] showed that \u2326(md/ log m) bits\nof communication are needed to achieve the minimax squared loss.\nIn this paper, we improve\nthis bound to \u2326(md), by providing an improved lower bound for one-dimensional setting and then\napplying our direct-sum theorem.\nThe direct-sum theorem that we prove heavily uses the idea and tools from the recent developments\nin communication complexity and information complexity. There has been a lot of work on the\nparadigm of studying communication complexity via the notion of information complexity [6, 7, 8,\n9, 10]. Information complexity can be thought of as a proxy for communication complexity that is\nespecially accurate for solving multiple copies of the same problem simultaneously [8]. Proving so-\ncalled \u201cdirect-sum\u201d results has become a standard tool, namely the fact that the amount of resources\nrequired for solving d copies of a problem (with different inputs) in parallel is equal to d times\nthe amount required for one copy. In other words, there is no saving from solving many copies of\nthe same problem in batch and the trivial solution of solving each of them separately is optimal.\nNote that this generic statement is certainly NOT true for arbitrary type of tasks and arbitrary type\nof resources. Actually even for distributed computing tasks, if the measure of resources is the\n\n1by minimax error we mean the minimum possible error that can be achieved when there is no limit on the\n\ncommunication\n\n2technically, information cost, as discussed below\n3where Id denote the d \u21e5 d identity matrix\n\n2\n\n\fcommunication cost instead of information cost, there exist examples where solving d copies of\na certain problem requires less communication than d times the communication required for one\ncopy [11]. Therefore, a direct-sum theorem, if true, could indeed capture the features and dif\ufb01culties\nof the problems.\nOur result can be viewed as a direct sum theorem for communication complexity for statistical es-\ntimation problems: the amount of communication needed for solving an estimation problem in d\ndimensions is at least d times the amount of information needed for the same problem in one di-\nmension. The proof technique is directly inspired by the notion of conditional information complex-\nity [7], which was used to prove direct sum theorems and lower bounds for streaming algorithms.\nWe believe this is a fruitful connection and can lead to more lower bounds in statistical machine\nlearning.\nTo complement the above lower bounds, we also show an interactive protocol that uses a log factor\nless communication than the simple protocol, under which each machine sends the sample mean and\nthe center takes the average as the estimation. Our protocol demonstrates additional power of inter-\nactive communication and potential complexity of proving lower bound for interactive protocols.\n\nThresholding Algorithm for Sparse Parameter Estimation In light of the strong lower bounds\nin the general case, a question suggests itself as a way to get around the impossibility results:\nCan we do better when the data (parameters) have more structure?\nWe study this questions by considering the sparsity structure on the parameter \u2713. Speci\ufb01cally, we\nconsider the case when the underlying parameter \u2713 is promised to be s-sparse. We provide a simple\nprotocol that achieves the same squared-loss O(d2/(mn)) as in the general case, while using\n\u02dcO(sm) communications, or achieving optimal squared loss O(s2/(mn)), with communication\n\u02dcO(dm), or any tradeoff between these cases. We even conjecture that this is the best tradeoff up to\npolylogarithmic factors.\n\n2 Problem Setup, Notations and Preliminaries\n\nClassical Statistical Parameter Estimation We start by reviewing the classical framework of statis-\ntical parameter estimation problems. Let P be a family of distributions over X . Let \u2713 : P ! \u21e5 \u21e2 R\ndenote a function de\ufb01ned on P. We are given samples X 1, . . . , X n from some P 2 P, and are asked\nto estimate \u2713(P ). Let \u02c6\u2713 : X n ! \u21e5 be such an estimator, and \u02c6\u2713(X 1, . . . , X n) is the corresponding\nestimate.\nDe\ufb01ne the squared loss R of the estimator to be\n\nR(\u02c6\u2713, \u2713) = E\n\n\u02c6\u2713,Xhk\u02c6\u2713(X 1, . . . , X n) \u2713(P )k2\n2i\n\nIn the high-dimensional case, let P d := { ~P = P1 \u21e5 \u00b7\u00b7\u00b7 \u21e5 Pd : Pi 2 P} be the family of product\ndistributions over X d. Let ~\u2713 : P d ! \u21e5d \u21e2 Rd be the d-dimensional function obtained by applying\n\u2713 point-wise ~\u2713 (P1 \u21e5 \u00b7\u00b7\u00b7 \u21e5 Pd) = (\u2713(P1), . . . , \u2713(Pd)).\nThroughout this paper, we consider the case when X = R and P = {N (\u2713, 2) : \u2713 2 [1, 1]} is\nGaussian distribution with for some \ufb01xed and known . Therefore, in the high-dimensional case,\nP d = {N ( ~\u2713 , 2Id) : ~\u2713 2 [1, 1]d} is a collection of spherical Gaussian distributions. We use \u02c6~\u2713 to\ndenote the d-dimensional estimator. For clarity, in this paper, we always use~\u00b7 to indicate a vector in\nhigh dimensions.\nDistributed Protocols and Parameter Estimation: In this paper, we are interested in the situation\nwhere there are m machines and the jth machine receives n samples ~X (j,1), . . . , ~X (j,n) 2 Rd from\nthe distribution ~P = N ( ~\u2713 , 2Id). The machines communicate via a publicly shown blackboard.\nThat is, when a machine writes a message on the blackboard, all other machines can see the content\nof the message. Following [1], we usually refer to the blackboard as the fusion center or simply\ncenter. Note that this model captures both point-to-point communication as well as broadcast com-\n\n3\n\n\fmunication. Therefore, our lower bounds in this model apply to both the message passing setting\nand the broadcast setting. We will say that a protocol is simultaneous if each machine broadcasts\na single message based on its input independently of the other machine ([1] call such protocols\nindependent).\nWe denote the collection of all the messages written on the blackboard by Y . We will refer to Y as\ntranscript and note that Y 2 {0, 1}\u21e4 is written in bits and the communication cost is de\ufb01ned as the\nlength of Y , denoted by |Y |. In multi-machine setting, the estimator \u02c6~\u2713 only sees the transcript Y , and\nit maps Y to \u02c6~\u2713(Y ) 4, which is the estimation of ~\u2713 . Let letter j be reserved for index of the machine\nand k for the sample and letter i for the dimension. In other words, ~X (j,k)\nis the ith-coordinate of\nkth sample of machine j. We will use ~Xi as a shorthand for the collection of the ith coordinate of\nall the samples: ~Xi = { ~X (j,k)\n: j 2 [m], k 2 [n]}. Also note that [n] is a shorthand for {1, . . . , n}.\nThe mean-squared loss of the protocol \u21e7 with estimator \u02c6~\u2713 is de\ufb01ned as\nR\u21e3(\u21e7,\n[k\u02c6~\u2713(Y ) ~\u2713 k2]\n\n\u02c6~\u2713), ~\u2713\u2318 = sup\n\nE\n~X,\u21e7\n\n~\u2713\n\ni\n\ni\n\nand the communication cost of \u21e7 is de\ufb01ned as\n\nCC(\u21e7) = sup\n~\u2713\n\nE\n~X,\u21e7\n\n[|Y |]\n\n~X,\u21e7\n\nRV d((\u21e7,\n\n\u02c6~\u2713), ~\u2713 ) = E\n\n\u02c6~\u2713), ~\u2713 ) \uf8ff R((\u21e7,\n\nThe main goal of this paper is to study the tradeoff between R\u21e3(\u21e7,\nProving Minimax Lower Bound: We follow the standard way to prove minimax lower bound.\nWe introduce a (product) distribution V d of ~\u2713 over the [1, 1]d. Let\u2019s de\ufb01ne the mean-squared loss\nwith respect to distribution V d as\n\n\u02c6~\u2713), ~\u2713\u2318 and CC(\u21e7).\n\n[k\u02c6~\u2713(Y ) ~\u2713 k2]#\n\n~\u2713 \u21e0V d\" E\n\u02c6~\u2713), ~\u2713 ) for any distribution V d. Therefore to prove\nIt is easy to see that RV d((\u21e7,\nlower bound for the minimax rate, it suf\ufb01ces to prove the lower bound for the mean-squared loss\nunder any distribution V d. 5\nPrivate/Public Randomness: We allow the protocol to use both private and public randomness.\nPrivate randomness, denoted by Rpriv, refers to the random bits that each machine draws by itself.\nPublic randomness, denoted by Rpub, is a sequence of random bits that is shared among all parties\nbefore the protocol without being counted toward the total communication. Certainly allowing these\ntwo types of randomness only makes our lower bound stronger, and public randomness is actually\nonly introduced for convenience.\nFurthermore, we will see in the proof of Theorem 3.1, the bene\ufb01t of allowing private randomness\nis that we can hide information using private randomness when doing the reduction from one di-\nmension protocol to d-dimensional one. The downside is that we require a stronger theorem (that\ntolerates private randomness) for the one dimensional lower bound, which is not a problem in our\ncase since technique in [1] is general enough to handle private randomness.\nInformation cost: We de\ufb01ne information cost IC(\u21e7) of protocol \u21e7 as mutual information between\nthe data and the messages communicated conditioned on the mean ~\u2713 . 6\n\n4Therefore here \u02c6~\u2713 maps {0, 1}\u21e4 to \u21e5\n5Standard minimax theorem says that actually the supVd RVd ((\u21e7,\ncompactness condition for the space of ~\u2713 .\n6Note that here we have introduced a distribution for the choice of ~\u2713 , and therefore ~\u2713 is a random variable.\n\n\u02c6~\u2713), ~\u2713 ) under certain\n\n\u02c6~\u2713), ~\u2713 ) = R((\u21e7,\n\n4\n\n\fPrivate randomness doesn\u2019t explicitly appear in the de\ufb01nition of information cost but it affects it.\nNote that the information cost is a lower bound on the communication cost:\n\nICV d(\u21e7) = I( ~X; Y | ~\u2713 , Rpub)\n\nICV d(\u21e7) = I( ~X; Y | ~\u2713 , Rpub) \uf8ff H(Y ) \uf8ff CC(\u21e7)\n\nThe \ufb01rst inequality uses the fact that I(U ; V | W ) \uf8ff H(V | W ) \uf8ff H(V ) hold for any random\nvariable U, V, W , and the second inequality uses Shannon\u2019s source coding theorem [13].\nWe will drop the subscript for the prior V d of ~\u2713 when it is clear from the context.\n3 Main Results\n\n3.1 High Dimensional Lower bound via Direct Sum\n\nOur main theorem roughly states that if one can solves the d-dimensional problem, then one must\nbe able to solve the one dimensional problem with information cost and square loss reduced by a\nfactor of d. Therefore, a lower bound for one dimensional problem will imply a lower bound for\nhigh dimensional problem, with information cost and square loss scaled up by a factor of d.\nWe \ufb01rst de\ufb01ne our task formally, and then state the theorem that relates d-dimensional task with\none-dimensional task.\n\u02c6~\u2713) solves task T (d, m, n, 2,V d) with infor-\nDe\ufb01nition 1. We say a protocol and estimator pair (\u21e7,\nmation cost C and mean-squared loss R, if for ~\u2713 randomly chosen from V d, m machines, each of\nwhich takes n samples from N ( ~\u2713 , 2Id) as input, can run the protocol \u21e7 and get transcript Y so\nthat the followings are true:\n\nRV d((\u21e7,\n\n\u02c6~\u2713), ~\u2713 ) = R\n(1)\nIV d( ~X; Y | ~\u2713 , Rpub) = C\n(2)\n\u02c6~\u2713) solves the task T (d, m, n, 2,V d) with information cost C\nTheorem 3.1. [Direct-Sum] If (\u21e7,\nand squared loss R, then there exists (\u21e70, \u02c6\u2713) that solves the task T (1, m, n, 2,V) with information\ncost at most 4C/d and squared loss at most 4R/d. Furthermore, if the protocol \u21e7 is simultaneous,\nthen the protocol \u21e70 is also simultaneous.\nRemark 1. Note that this theorem doesn\u2019t prove directly that communication cost scales linearly\nwith the dimension, but only information cost. However for many natural problems, communication\ncost and information cost are similar for one dimension (e.g. for gaussian mean estimation) and then\nthis direct sum theorem can be applied. In this sense it is very generic tool and is widely used in\ncommunication complexity and streaming algorithms literature.\n\nCorollary 3.1. Suppose (\u21e7,\nsquared loss R, and communication cost B. Then\n\n\u02c6~\u2713) estimates the mean of N ( ~\u2713 , 2Id), for all ~\u2713 2 [1, 1]d, with mean-\nR \u2326\u2713min\u21e2 d22\n\nnB log m\n\nd2\n\n, d\u25c6\nAs a corollary, when 2 \uf8ff mn, to achieve the mean-squared loss R = d2\nB is at least \u2326\u21e3 dm\nlog m\u2318.\n\nn log m\n\n,\n\nThis lower bound is tight up to polylogarithmic factors. In most of the cases, roughly B/m machines\nsending their sample mean to the fusion center and \u02c6~\u2713 simply outputs the mean of the sample means\nwith O(log m) bits of precision will match the lower bound up to a multiplicative log2 m factor. 7\n7When is very large, when \u2713 is known to be in [1, 1], \u02c6~\u2713 = 0 is a better estimator, that is essentially why\n\nthe lower bounds not only have the \ufb01rst term we desired but also the other two.\n\nmn , the communication cost\n\n5\n\n\f3.2 Protocol for sparse estimation problem\nIn this section we consider the class of gaussian distributions with sparse mean: Ps =\n{N ( ~\u2713 , 2Id) : | ~\u2713 |0 \uf8ff s, ~\u2713 2 Rd}. We provide a protocol that exploits the sparse structure of\n~\u2713 .\n\nInputs : Machine j gets samples X (j,1), . . . , X (j,n) distributed according to N ( ~\u2713 , 2Id), where\n~\u2713 2 Rd with | ~\u2713 |0 \uf8ff s.\nFor each 1 \uf8ff j \uf8ff m0 = (Lm log d)/\u21b5, (where L is a suf\ufb01ciently large constant), machine j sends\nits sample mean \u00afX (j) = 1\nFusion center calculates the mean of the sample means \u00afX = 1\nLet \u02c6~\u2713i =\u21e2 \u00afXi\n\nnX (j,1), . . . , X (j,n) (with precision O(log m)) to the center.\nm0\u21e3 \u00afX (1) + \u00b7\u00b7\u00b7 + \u00afX (m0)\u2318.\n\nif | \u00afXi|2 \u21b52\notherwise\n\nmn\n\n0\n\nOutputs \u02c6~\u2713\n\nProtocol 1: Protocol for Ps\n\nTheorem 3.2. For any P 2 Ps, for any d/s \u21b5 1, Protocol 1 returns ~\u2713 with mean-squared loss\nO( \u21b5s2\n\nmn ) with communication cost O((dm log m log d)\u21b5).\n\nThe proof of the theorem is deferred to supplementary material. Note that when \u21b5 = 1, we have\na protocol with \u02dcO(dm) communication cost and mean-squared loss O(s2/(mn)), and when \u21b5 =\nd/s, the communication cost is \u02dcO(sm) but squared loss O(d2/(mn)). Comparing to the case\nwhere we don\u2019t have sparse structure, basically we either replace the d factor in the communication\ncost by the intrinsic dimension s or the d factor in the squared loss by s, but not both.\n\n3.3\n\nImproved upper bound\n\nThe lower bound provided in Section 3.1 is only tight up to polylogarithmic factor. To achieve the\ncentralized minimax rate 2d\nmn , the best existing upper bound of O(dm log(m)) bits of communica-\ntion is achieved by the simple protocol that ask each machine to send its sample mean with O(log n)\nbits precision . We improve the upper bound to O(dm) using the interactive protocols.\nRecall that the class of unknown distributions of our model is P d = {N ( ~\u2713 , 2Id) : \u2713 2 [1, 1]d}.\nTheorem 3.3. Then there is an interactive protocol \u21e7 with communication O(md) and an estimator\n\u02c6~\u2713 based on \u21e7 which estimates ~\u2713 up to a squared loss of O( d2\nmn ).\n\nRemark 2. Our protocol is interactive but not simultaneous, and it is a very interesting question\nwhether the upper bound of O(dm) could be achieved by a simultaneous protocol.\n\n3.4\n\nImproved lower bound for simultaneous protocols\n\nAlthough we are not able to prove \u2326(dm) lower bound for achieve the centralized minimax rate in\nthe interactive model, the lower bound for simultaneous case can be improved to \u2326(dm). Again, we\nlowerbound the information cost for the one dimensional problem \ufb01rst, and applying the direct-sum\ntheorem in Section 3.1, we got the d-dimensional lower bound.\nTheorem 3.4. Suppose simultaneous protocol (\u21e7,\n~\u2713 2 [1, 1]d, with mean-squared loss R, and communication cost B, Then\n\n\u02c6~\u2713) estimates the mean of N ( ~\u2713 , 2Id), for all\n\nAs a corollary, when 2 \uf8ff mn, to achieve mean-squared loss R = d2\nis at least \u2326(dm).\n\nmn , the communication cost B\n\nR \u2326\u2713min\u21e2 d22\n\nnB\n\n, d\u25c6\n\n6\n\n\f4 Proof sketches\n\n4.1 Proof sketch of theorem 3.1 and corollary 3.1\n\nTo prove a lower bound for the d dimensional problem using an existing lower bound for one dimen-\nsional problem, we demonstrate a reduction that uses the (hypothetical) protocol \u21e7 for d dimensions\nto construct a protocol for the one dimensional problem.\nFor each \ufb01xed coordinate i 2 [d], we design a protocol \u21e7i for the one-dimensional problem by\nembedding the one-dimensional problem into the ith coordinate of the d-dimensional problem. We\nwill show essentially that if the machines \ufb01rst collectively choose randomly a coordinate i, and run\nprotocol \u21e7i for the one-dimensional problem, then the information cost and mean-squared loss of\nthis protocol will be only 1/d factor of those of the d-dimensional problem. Therefore, the informa-\ntion cost of the d-dimensional problem is at least d times the information cost of one-dimensional\nproblem.\n\nInputs : Machine j gets samples X (j,1), . . . , X (j,n) distributed according to N (\u2713, 2), where \u2713 \u21e0 V.\n\n1. All machines publicly sample \u02d8\u2713i distributed according to V d1.\n2. Machine j privately samples \u02d8X (j,1)\n\ni\n, . . . , \u02d8X (j,k)\n\n, . . . , \u02d8X (j,n)\ni\ni1 , X (j,k), \u02d8X (j,k)\ni+1 , . . . , \u02d8X (j,k)\n\nd\n\n).\n\nLet \u02d8X (j,k) = ( \u02d8X (j,k)\n\n1\n\ndistributed according to N (\u02d8\u2713i, 2Id1).\n\n3. All machines run protocol \u21e7 on data \u02d8X and get transcript Yi. The estimator \u02c6\u2713i is \u02c6\u2713i(Yi) =\n\n\u02c6~\u2713(Y )i i.e. the ith coordinate of the d-dimensional estimator.\n\nProtocol 2: \u21e7i\n\nIn more detail, under protocol \u21e7i (described formally in Protocol 2) the machines prepare a d-\ndimensional dataset as follows: First they \ufb01ll the one-dimensional data that they got into the ith\ncoordinate of the d-dimensional data. Then the machines choose publicly randomly ~\u2713 i from distri-\nbution V d1, and draw independently and privately gaussian random variables from N ( ~\u2713 i , Id1),\nand \ufb01ll the data into the other d 1 coordinates. Then machines then simply run the d-dimension\nprotocol \u21e7 on this tailored dataset. Finally the estimator, denoted by \u02c6\u2713i, outputs the ith coordinate\nof the d-dimensional estimator \u02c6~\u2713.\nWe are interested in the mean-squared loss and information cost of the protocol \u21e7i\u2019s that we just\ndesigned. The following lemmas relate \u21e7i\u2019s with the original protocol \u21e7.\n\nLemma 1. Protocols \u21e7i\u2019s satisfyPd\nLemma 2. Protocols \u21e7i\u2019s satisfyPd\n\ni=1 RV\u21e3(\u21e7i, \u02c6\u2713i), \u2713\u2318 = RV d\u21e3(\u21e7,\ni=1 ICV (\u21e7i) \uf8ff ICV d(\u21e7)\n\n\u02c6~\u2713), ~\u2713\u2318\n\nNote that the counterpart of Lemma 2 with communication cost won\u2019t be true, and actually the\ncommunication cost of each \u21e7i is the same as that of \u21e7. It turns out doing reduction in communi-\ncation cost is much harder, and this is part of the reason why we use information cost as a proxy for\ncommunication cost when proving lower bound. Also note that the correctness of Lemma 2 heavily\nrelies on the fact that \u21e7i draws the redundant data privately independently (see Section 2 and the\nproof for more discussion on private versus public randomness).\nBy Lemma 1 and Lemma 2 and a Markov argument, there exists an i 2 {1, . . . , d} such that\n\nR\u21e3(\u21e7i, \u02c6\u2713i), \u2713\u2318 \uf8ff\n\n4\n\nd \u00b7 R\u21e3(\u21e7, ~\u2713 ), ~\u2713\u2318\n\nand\n\nIC(\u21e7i) \uf8ff\n\n4\nd \u00b7 IC(\u21e7)\n\nThen the pair (\u21e70, \u02c6\u2713) = (\u21e7i, \u02c6\u2713i) solves the task T (1, m, n, 2,V) with information cost at most\n4C/d and squared loss 4R/d, which proves Theorem 3.1.\nCorollary 3.1 follows Theorem 3.1 and the following lower bound for one dimensional gaussian\nmean estimation proved in [1]. We provide complete proofs in the supplementary.\n\n7\n\n\f\u2318.\nTheorem 4.1. [1] Let V be the uniform distribution over {\u00b1}, where 2 \uf8ff min\u21e31, 2 log(m)\nIf (\u21e7, \u02c6\u2713) solves the task T (1, m, n, 2,V) with information cost C and squared loss R, then either\nC \u2326\u21e3\n\n2n log(m)\u2318 or R 2/10.\n\n2\n\nn\n\n4.2 Proof sketch of theorem 3.3\n\nThe protocol is described in protocol 3 in the supplementary. We only describe the d = 1 case,\nwhile for general case we only need to run d protocols individually for each dimension.\nThe central idea is that we maintain an upper bound U and lower bound L for the target mean, and\niteratively ask the machines to send their sample means to shrink the interval [L, U ]. Initially we\nonly know that \u2713 2 [1, 1]. Therefore we set the upper bound U and lower bound L for \u2713 to be\n1 and 1. In the \ufb01rst iteration the machines try to determine whether \u2713 < 0 or 0. This is done\nby letting several machines (precisely, O(log m)/2 machines) send whether their sample means\nare < 0 or 0. If the majority of the samples are < 0, \u2713 is likely to be < 0. However when \u2713\nis very close to 0, one needs a lot of samples to determine this, but here we only ask O(log m)/2\nmachines to send their sample means. Therefore we should be more conservative and we only update\nthe interval in which \u2713 might lie to [1, 1/2] if the majority of samples are < 0.\nWe repeat this until the interval (L, U ) become smaller than our target squared loss. Each round,\nwe ask a number of new machines sending 1 bits of information about whether their sample mean\nis large than (U + L)/2. The number of machines participated is carefully set so that the failure\nprobability p is small. An interesting feature of the protocol is to choose the target error probabil-\nity p differently at each iteration so that we have a better balance between the failure probability\nand communication cost. The complete the description of the protocol and proof are given in the\nsupplementary.\n\n4.3 Proof sketch of theorem 3.4\nWe use a different prior on the mean N (0, 2) instead of uniform over {, } used by [1]. Gaussian\nprior allows us to use a strong data processing inequality for jointly gaussian random variables by\n[14]. Since we don\u2019t have to truncate the gaussian, we don\u2019t lose the factor of log(m) lost by [1].\nTheorem 4.2. ([14], Theorem 7) Suppose X and V are jointly gaussian random variables with\ncorrelation \u21e2. Let Y $ X $ V be a markov chain with I(Y ; X) \uf8ff R. Then I(Y ; V ) \uf8ff \u21e22R.\nNow suppose that each machine gets n samples X 1, . . . , X n \u21e0 N (V, 2), where V is the prior\nN (0, 2) on the mean. By an application of theorem 4.2, we prove that if Y is a B-bit message\ndepending on X 1, . . . , X n, then Y has only n2\n\u00b7 B bits of information about V . Using some\n2\nstandard information theory arguments, this converts into the statement that if Y is the transcript of\na simultaneous protocol with communication cost \uf8ff B, then it has at most n2\n2 \u00b7B bits of information\nabout V . Then a lower bound on the communication cost B of a simultaneous protocol estimating\nthe mean \u2713 2 [1, 1] follows from proving that such a protocol must have \u2326(1) bit of information\nabout V . Complete proof is given in the supplementary.\n\n5 Conclusion\n\nWe have lowerbounded the communication cost of estimating the mean of a d-dimensional spherical\ngaussian random variables in a distributed fashion. We provided a generic tool called direct-sum for\nrelating the information cost of d-dimensional problem to one-dimensional problem, which might\nbe of potential use for other statistical problem than gaussian mean estimation as well.\nWe also initiated the study of distributed estimation of gaussian mean with sparse structure. We\nprovide a simple protocol that exploits the sparse structure and conjecture its tradeoff to be optimal:\nConjecture 1. If some protocol estimates the mean for any distribution P 2 Ps with mean-squared\nloss R and communication cost C, then C \u00b7 R & sd2\nmn , where we use & to hide log factors and\npotential corner cases.\n\n8\n\n\fReferences\n[1] Yuchen Zhang, John C. Duchi, Michael I. Jordan, and Martin J. Wainwright.\n\nInformation-\ntheoretic lower bounds for distributed statistical estimation with communication constraints.\nIn NIPS, pages 2328\u20132336, 2013.\n\n[2] Maria-Florina Balcan, Avrim Blum, Shai Fine, and Yishay Mansour. Distributed learning,\n\ncommunication complexity and privacy. In COLT, pages 26.1\u201326.22, 2012.\n\n[3] Hal Daum\u00b4e III, Jeff M. Phillips, Avishek Saha, and Suresh Venkatasubramanian. Protocols for\n\nlearning classi\ufb01ers on distributed data. In AISTATS, pages 282\u2013290, 2012.\n\n[4] Hal Daum\u00b4e III, Jeff M. Phillips, Avishek Saha, and Suresh Venkatasubramanian. Ef\ufb01cient\n\nprotocols for distributed classi\ufb01cation and optimization. In ALT, pages 154\u2013168, 2012.\n\n[5] John C. Duchi, Michael I. Jordan, Martin J. Wainwright, and Yuchen Zhang.\n\nInformation-\ntheoretic lower bounds for distributed statistical estimation with communication constraints.\nCoRR, abs/1405.0782, 2014.\n\n[6] Amit Chakrabarti, Yaoyun Shi, Anthony Wirth, and Andrew Chi-Chih Yao.\n\nInformational\ncomplexity and the direct sum problem for simultaneous message complexity. In FOCS, pages\n270\u2013278, 2001.\n\n[7] Ziv Bar-Yossef, T. S. Jayram, Ravi Kumar, and D. Sivakumar. An information statistics ap-\n\nproach to data stream and communication complexity. J. Comput. Syst. Sci., 68(4), 2004.\n\n[8] Mark Braverman and Anup Rao. Information equals amortized communication. In FOCS,\n\npages 748\u2013757, 2011.\n\n[9] Boaz Barak, Mark Braverman, Xi Chen, and Anup Rao. How to compress interactive commu-\n\nnication. SIAM J. Comput., 42(3):1327\u20131363, 2013.\n\n[10] Mark Braverman, Faith Ellen, Rotem Oshman, Toniann Pitassi, and Vinod Vaikuntanathan. A\ntight bound for set disjointness in the message-passing model. In FOCS, pages 668\u2013677, 2013.\n[11] Anat Ganor, Gillat Kol, and Ran Raz. Exponential separation of information and communica-\n\ntion. Electronic Colloquium on Computational Complexity (ECCC), 21:49, 2014.\n\n[12] Yuchen Zhang, John C. Duchi, and Martin J. Wainwright. Communication-ef\ufb01cient algorithms\nfor statistical optimization. Journal of Machine Learning Research, 14(1):3321\u20133363, 2013.\n[13] Claude Shannon. A mathematical theory of communication. Bell System Technical Journal,\n\n27:379\u2013423, 623\u2013656, 1948.\n\n[14] Elza Erkip and Thomas M. Cover. The ef\ufb01ciency of investment information.\n\nInform. Theory, 44, 1998.\n\nIEEE Trans.\n\n9\n\n\f", "award": [], "sourceid": 1407, "authors": [{"given_name": "Ankit", "family_name": "Garg", "institution": "Princeton"}, {"given_name": "Tengyu", "family_name": "Ma", "institution": "Princeton University"}, {"given_name": "Huy", "family_name": "Nguyen", "institution": "University of California, Berkeley"}]}