NIPS 2018
Sun Dec 2nd through Sat the 8th, 2018 at Palais des Congrès de Montréal
Paper ID: 685 Gradient Sparsification for Communication-Efficient Distributed Optimization

### Reviewer 1

This paper focuses on large-scale machine learning tasks in a distributed computing setup with the objective of making the underlying optimization procedures communication efficient. In particular, the paper considers the problem of reducing the overall communication during the collection of the stochastic gradient of the objective function at hand. The authors propose a sparsification approach where, during an iteration of the optimization procedure, each coordinate of the stochastic gradient is independently replaced with zero value with a certain probability. The non-zero coordinates are then properly scaled to ensure that the sparsified gradient remains an unbiased estimate of the true gradient. Now the (scaled) sparse gradient vector can be encoded up to a desired level of precision. The randomized sparsification of the (original) stochastic gradient generates another stochastic gradient with higher variance, which may lead to poor performance in term of the convergence rate of the optimization procedure. Keeping this in mind, the authors tune their sparsification procedure to ensure the variance of the resulting stochastic gradient is not too high. This allows them to simultaneously ensure both good convergence rate and communication-efficiency. In particular, the authors provably show that their approach has very good performance when the original stochastic gradient is sparse or approximately sparse. The proposed scheme is practical and can be combined with various other existing approaches to reduce the communication overhead in a distributed computing setting. The authors conduct extensive experiments involving logistic regression and neural network training to demonstrate the effectiveness of their approach. The paper is well written, except for some minor typos listed below. It would be great if the authors can briefly mention a few real-life scenarios where one might expect to encounter gradient vectors that are sparse or approximately sparse. Typos: 1) Line 221: Figure 4 --> Figure 3 2) Line 232: Figure 4 --> Figure 3 3) Line 241: Figure 4.1 --> Figure 4 4) Line 277, 279: Figure 4.1 --> Figure 5. 5) The authors use $b_i$ to denote the labels. Please consider changing this notation as it conflicts with the number of bits $b$.