NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Reviewer 1
[I have read the author response; no change in score as my confidence is the limiting factor] In eq (3), it seems to me that "m" was not introduced - it is only in words defined later. As I was reading the result, I was thinking about this question: "What is the number of samples needed for optimal rates, if the communication network is fully connected graph?" This is an interesting special case, which should be implied by existing work, probably worth highlighting in the paper. How does that special case compare to the presented result? I imagine there is a gap. If so, this would open another point for future directions - presented are two extremes - but how does the threshold depend on network topology?
Reviewer 2
This paper proves optimal generalization error bounds for Distributed Gradient Descent in the context of multi-agent decentralized non-parametric least-squares regression when i.i.d. samples are assigned to agents. The derived results show that if the agents hold sufficiently many samples with respect to the network size, then Distributed Gradient Descent with a number of iterations (depending on a network parameter) achieves optimal rates. Provided the “communication delay” is sufficiently small, the Distributed Gradient Descent yields a linear speed-up in runtime compared to the single machine protocol. To the best of my knowledge, the derived result is the first statistical result for Distributed Gradient Descent in the context of multi-agent decentralized non-parametric least-squares regression, making an important contribution for understanding the generalization properties of decentralized distributed gradient descent. Overall, I find the results are very interesting. The novel key step of the proof is the estimation of the network error. I did a high-level check of this proof step and I did not find anything wrong. The presentation is fine, although I believe it can be further be improved (for example, making more concise in some of the statements). Minor comments: Line 145 of Page 4, the marginal distribution on X… The presentation in the supplementary material could be further improved. For example, the notation $T_{\rho}$ is not introduced in Line 93 on Page 3. In the appendix, the proof involves many notations. But it seems to me these notations are necessary. Could you derive similar results for the non-attainable cases? --------After Rebuttal---------------- I have read the rebuttal. I am happy with the acceptance of the paper.
Reviewer 3
Update: I have read the author response and appreciate that they addressed some of my comments. I'm leaving my score unchanged, mainly due to the remaining limitations. -- This paper studies the performance of decentralized gradient descent for nonparametric (linear) regression (in a RKHS). The focus is on obtaining statistical guarantees about the generalization. This is a highly relevant direction to the growing body of work on decentralized training. The paper is generally well written, contains very original ideas, and I was very excited to read it. The main reason I didn't give a higher rating was because of the limitations listed at the beginning of Sec 5. I commend the authors for acknowledging them. Nevertheless, they make the reader question the relevance of the results (e.g., not covering the case of finite-dimensional linear regression). I only have a few suggestions/questions. 1. In line 254 it may be worth emphasizing that the relaxation time 1/(1 - \sigma_2) in the setting considered (static, symmetric doubly-stochastic P) is at worst polynomial in n, so the condition on n is no very restrictive. 2. In line 275, please elaborate on the communication model. I would have expected the denominator on the right to involve the product of \tau and Deg(P), or otherwise for \tau to at least be increasing in Deg(P), since having to communicate with more neighbors should require more communication. How would that affect the conclusions in Sec 3.1? Aside: Such a model would be similar to the one considered in Tsianos and Rabbat, NIPS 2012, which also investigated protocols which progressively sparsify communication over time (relevant to the discussion in Sec 5). 3. The limitations list at the beginning of Sec 5 seem pretty severe. Could you elaborate why modified bounds for the finite-dimensional setting would only hold for "easy" problems, and what is the challenge in generalizing these results to other finite-dimensional problems? (Note: It seems questionable to focus on the infinite dimensional setting for decentralized algorithms, since it is no longer clear how to communicate infinite-dimensional quantities using a finite number of bits, so that the communication delay is finite.) 4. The result in Sec 3.1 begins to hint at a tradeoff in terms of the parameter \tau. I would have liked to see more discussion around this. In particular, for very large \tau, should a regime appear where it is most efficient for all nodes to work independently (for m sufficiently large)? 5. Regarding accelerated gossip, results similar to those mentioned are also proved in Oreshkin, Coates, and Rabbat, "Optimization and analysis of distributed averaging with short node memory," IEEE Trans Signal Processing, 2010.