Paper ID: | 3435 |
---|---|

Title: | Optimal Sparsity-Sensitive Bounds for Distributed Mean Estimation |

Originality: The proposed compression scheme involves significant novelty. The paper clearly cites the prior work and contrast its contribution from the prior work. As claimed in the paper, this paper proposes the first compression scheme (along with tight up to constants) that characterizes the effect of underlying sparsity on the communication-cost and MSE trade-off for the DME problem across all sparsity regimes. Quality: As far as the reviewer can tell, the results presented in the paper are correct. The paper is well organized and it definitely constitutes as a complete piece of work. Clarity: The paper is well written and clearly conveys the main ideas. In fact, the paper is able to nicely summarize all the main ingredients of the proofs in the main text. (Please see below for some minor typos.) Significance: The DME problems is of great importance in distributed computing setups as it arises in many tasks as a subroutine, e.g., distributed gradient descent, distributed k-means. The paper makes significant contributions towards this problem when the underlying vectors are sparse (which is also quite prevalent in the practice). Given the number of recent research efforts in the domain, the reviewer believes that this work is certainly going to receive the attention of other researchers. ########## Post-rebuttal ############# I have gone through the response of the authors and other reviewers' comments. I think I am comfortable with my previous score.

Positives: - an almost tight answer to a fundamental problem - good practical performance on some tasks - the lower bound argument is quite nice Negatives: - writing and positioning w.r.t. related work can be improved - experiments on training small CNNs or LSTMs with sparsity could be added, but are not absolutely necessary I am quite positive about this paper. My main observation to the authors is to stress less about the relation with previous work, and stress more about explaining your algorithm in detail. For instance, I think the discussion in lines 101-115 which basically re-explains how your algorithm is different to a finer level of detail could be left for *after* you've actually described your algorithm and its guarantees. You could have a whole sub-section for that, where you can carefully cover the similarities and differences. Currently, the draft is very defensive on this point; this makes sense, but notice that you are the first to be proving tight bounds, so it's kind of OK if your algorithm is not super-novel (given the amount of activity in this area one wouldn't really expect that anyway). The main shortcoming of your algorithm though is from the practical side: you need to synch the normalization factor, which means that your algorithm is two-rounds. This could potentially negate the benefits of reduced communication. It would be nice if you could comment on this. The paper has many typos (some collected below), which should be fixed. - maybe add "asymptotically" optimal to the title. sometimes in communication complexity people actually care about constants (see e.g. list decoding) 105 -> doen't 110 -> trade-off of 101 -> 115: maybe move this discussion for *after* you've presented your algorithm? 128 -> summarizes 157-> sends it 181-182: repeated discussion

*Originality and Significance* I think the contribution of the current paper is quite incremental over [2,20] and other works on similar topic, some of which the authors have not cited (such as the ATOMO algorithm which provides a method for handling sparse vectors). But more than this, I feel the authors have failed to appreciate a few points: 1. The authors have allowed small interaction between the parties and even downlink communication from the centre to the parties. This communication is critical for normalising the coordinates and using a quantiser with a fixed dynamic range. However, one of the main points in [20] was to rotate randomly so that this range can be determined by \ell_2 norms, which was assumed to be bounded and need not be described. Authors have a rather simple scheme, but it still requires coordination between the parties by the centre. I feel that random rotations such as randomised Hadamard transform should have been used to avoid this. 2. The lower bound seems to only consider SMPs and doesn't allow for the downlink communication by the centre. In this sense, I feel that the optimality of the proposed scheme has not been established in a strict sense. The authors continue with the running assumption of [2,20] that the norm can be described using a fixed number of bits that can be ignored. I find this assumption a bit absurd for theoretical contributions, but am now accustomed to seeing it in papers on this literature. However, the authors should clearly specify the scope of their lower bounds and allowed communication protocols. 3. There is not much innovation in the scheme or the lower bound. Both seem to follow the template of [20]. The main contribution is the choice of normalising parameter F_1/C in the scheme which allows the authors to exploit sparsity. I like authors choice of Hoyer's measure of sparsity since it will seamlessly bridge between sparse and non-sparse data and will not require a separate evaluation of sparsity before deciding which scheme to use. But the technical contribution beyond prior work is minimal and doesn't suffice for publication at NeurIPS. *Clarity* I think the presentation is rathe clear. Personally I would have preferred more exhaustive set of experiments with applications in distributed optimisation, such as those in [2], but I am happy that the authors made an effort to compare their results with those in [20] and repeated the same experiments. Also, the reported improvement shows that indeed the data in these experiments comprises sparse observations, a point which was raised in some earlier work as well. Even the SGD updates for training CNNs on MNIST are sparse. [Update after reading the rebuttal] I have to admit that my first reading of the paper was less thorough than I wanted it to be, I will blame the short review cycle for it. I read a few sections again in order to fix the "factual errors" in my review. Here are my additional comments: 1. Lower bound: For some reason, I saw the distribution and information complexity popping-in, and just assumed that there is a Fano bound hidden somewhere. Indeed, this was a gross oversight on my part, apologies. Looks like the main idea was to use Lemma 3.4 to relate MSE to the conditional entropy H(X|\Pi) for a specific distribution. It is an interesting argument, but its relation to existing lower bounds in distributed Gaussian mean estimation has not been clarified. Since the proof doesn't look difficult, I can't comment on the innovation here. 2. The scheme: I think my evaluation of the scheme was not a "major factual error." It is indeed very simple extension of prior work: The main difference is the choice of F_1 to normalize the values so that the clients can use appropriate interval for uniform quantization. But this F_1 had to be computed based on norms of all the vectors and prior work was not at all looking at this case. The authors have clarified how their goal was to capture "local sparsity" and sort of global sparsity at the same time. That's interesting. Overall, I think the paper is acceptable for NeurIPS.