NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Reviewer 1
Update: I have carefully read the authors' rebuttal. I have raised by score to 6 from 5 to reflect their clarification about Figure 3 and Table 6. It still seems that the speedups of the current formulation are often not of great practical significance, except for the language model which was able to give 2x wall clock speedup. As another reviewer noted, it is disappointing that the overall training time is not reported in the main paper, instead of the average batch time, as that makes it unclear whether latency times and other overheads between batches might be a significant concern. The author rebuttal notes that Appendix C shows time-to-accuracy, which would be good to mention in the main paper. But those results still appear mixed: for CIFAR10 SGD beats Rank1 and seems only competitive with Ranks 2,4, whereas for Language model all Ranks seem to convincingly beat SGD. So it's not often this approach in current form will produce significant improvements for a given new task. Overall this approach appears promising but lacks any significant theoretical support. Furthermore, the experiments offered mainly provide an existence proof that some wall-clock speed up is achieved using gradient compression, without putting the these results in a significant context (e.g. what are limitations, such as when the single step approximation of the power iteration fails and gives worse test accuracy?). PowerSGD reduces gradient communication significantly in some experiments, but the net speedup is not proportional (e.g.for LSTM experiment, 10x reduction only yields 2x speedup). However, even that claimed improvement seems misleading. Figure 3 shows that using modern NCCL, nearly 8x speedup on 16 GPUS across 8 machines is achieved by not only (Rank-2) PowerSGD, but also SGD and Signum. Thus, tables 6 and 7 seem misleading, because although they seem to indicate significant practical speedups (e.g. 134 ms / epoch for PowerSGD versus 300ms for SGD for language model LSTM), that actually seems to be the case only when using the slow GLOO backend, where SGD and PowerSGD have significant differences as shown in Figure 3, and not when the modern, commonly available, NCCL is used for all-reduce. If that is the case, then this approach is not clearly of much practical significant as is for modern practiioners (who typically do use NCCL). So, the significance of the contributions of this work are unclear, especially given the lack of theoretical justifications and limited scope of the experimental work. Algorithm 1 (rank-r PowerSGD compression) is presented without justification. It is not clear from the experiments when this works well or fails, or what existing work this is based on (there are no citations offered).
Reviewer 2
This paper studies gradient compression methods for distributed optimization. The existing sign-based approach and top-k method relies on all-gather operation, which does not scale well with increasing number of workers. In this paper, a rank-r powerSGD compression method is proposed that allows using all-reduce, which has better scalability than all-gather. The rank-r powerSGD compression utilizes the one-step power iteration and is therefore computationally inexpensive. Extensive experiments are conducted to demonstrate the efficiency of the proposed method. It is good to see the ablation study in the experiment. The paper is well written. Overall, I this is a good paper, but the theoretical parts and experiments can be further strengthened. It is not clear to me why powerSGD enjoys linearity. In particular, why does Lemma 3 in supplementary material hold? As can be seen in Algorithm 1, the construction of matrix Q involves matrix multiplication between M_i and M_j for i \not= j, and therefore \hat{P}Q^T \not= 1/W\sum_{i=1}^W\hat{P}_iQ_i^T, where \hat{P}_i and Q_i are obtained by performing compression on each i-th worker's gradient. Due to this, I do not see why powerSGD can have linearity. There is no rigorous convergence analysis of Algorithm 2 and Algorithm 1. Only a short of description of what techniques can be applied in the supplementary material. A similar setting is considered in [Yu et al., 2018], which proposed a PCA-based compression for all-reduce. It would be good to compare the proposed method with it. In Section 4.1, an unbiased low-rank approximation without feedback is compared. Why don't we just compare to powerSGD without feedback? The testing accuracy 94.3% of ResNet18 trained with SGD on CIFAR-10 is much better than the 91.25% result in original paper [He et al., 2016]. I wonder which difference in the experimental setting contributes to such improvement? Distributed training on a small scale dataset such as CIFAR-10 is not very interesting. It is more convincing to conduct experiment on training a deep ResNet on the ImageNet dataset. Yu et al. GradiVeQ: Vector Quantization for Bandwidth-Efficient Gradient Aggregation in Distributed CNN Training. NIPS 2018. He et al. Deep Residual Learning for Image Recognition. CVPR 2016. ===================================== I have read the rebuttal, and my score remains the same. I would encourage the authors to take time to study the convergence with k-step power method, and perform the experiments on larger dataset such as ImageNet.
Reviewer 3
Updated review: After carefully reading the rebuttal, part of my concerns are addressed as the authors claim they will extend the experimental study to a larger scale. However, I don't wish to change my overall score. I agree with Reviewer 1 and Reviewer 3 that providing convergence analysis will strengthen this work, but it may take time. Moreover, I insist on my suggestion that i) the authors should consider studying PowerSGD under the setting that global batch size is fixed since this will lead to more promising speedups. ii) end-to-end speedups need to be clearly illustrated e.g. adding one table that indicates speedups of PowerSGD comparing to various baselines on converging to certain accuracies. ------------------------------------------------------------------------------------------------- This paper proposed PowerSGD, a low-rank based gradient compression algorithm for speeding up distributed training. Compared to the previously proposed method ATOMO [1], PowerSGD avoids the computationally expensive SVD step in ATOMO to attain better scalability. PowerSGD is a biased gradient compression technique, which makes it hard to conduct solid theoretical analysis. Moreover, PowerSGD is the first gradient compression algorithm that is compatible with all-reduce based distributed applications compared to most of the previously proposed communication efficient distributed training algorithms. The paper is very well written and well motivated. Extensive experimental results indicate that PowerSGD has good scalability and end-to-end convergence performance. [1] https://papers.nips.cc/paper/8191-atomo-communication-efficient-learning-via-atomic-sparsification