Paper ID: | 5205 |
---|---|

Title: | Leader Stochastic Gradient Descent for Distributed Training of Deep Learning Models |

Without rehashing the "contributions," the authors have done a great job of justifying their approach theoretically, and I enjoyed reading the paper. It is very clearly written, and the approach is well introduced. There are a couple small concerns. First, the paper is extremely dependent upon the supplemental material, and it's difficult to sufficiently understand the approach without reading it. Second, the experiments are not actually that overwhelming in practical performance; they are only slightly better or merely comparable to other approaches. Finally, I don't think "Vol" is defined in Theorem 6, so I don't know what's going on there. There is a great deal of intuitive justification for the approach based on the idea that averaging between workers can result in pulling the worker out of a minimum. This seems just as likely to happen to the non-leader workers in this approach. Is this indicated in the schedule of the leader switching? Does it tend to stick with a single leader for a long time, while the other workers are stranded in averaged locations?

This paper proposed a new algorithm called LSGD for distributed optimization in non-convex settings based on pulling workers to the current best performer among them, rather than their average at each iteration. Originality: using the leader stochastic gradient descent is interesting. Quality: the theoretical analysis is rigorous. Clarity: the paper is easy to understand including the motivating example, and the problem formulation. They do not compare the convergence rate with existing methods theoretically, which is important to see the theoretical improvement.

After rebuttal: I have carefully read the authors' response. Unfortunately, I do not think my concerns are well addressed. (1) intuitively, i cannot see why many candidate leaders will be in directions of negative curvature; (2) there is neither theoretical nor empirical evidence that x_i's of all workers will lie in the same valley; (3) the test perplexity of additional experiment with LSTM on the NLP task looks very bad. See Table 2 in "Regularizing and Optimizing LSTM Language Models" for comparison; (4) the performance of SGD on a single GTX1080 GPU does not tell how it performs with multiple workers (larger mini-batch size); (5) selecting learning rate based on the test error is not a good practice. For machine learning, we should select the hyper-parameters according to the accuracy on a hold-out validation set. Considering the above five points, I decide to keep my score unchanged. This paper considers distributed training for nonconvex models such deep neural networks. Different from EASGD, the proposed leader gradient descent (LGD) replaces the averaged parameters with the parameters on "leader" worker, which gives the smallest objective value among all the workers. To further reduces the communication cost, the leader worker is updated every finite number of iterations, and asynchronous execution is considered. As the evaluation of the full objective is expensive, the authors proposed a stochastic extension, leader stochastic gradient descent (LSGD), in which the leader is selected based on the estimated value on a subsampled batch. The theoretical results show: 1) using guiding point improves one-step SGD and has an asymptotic O(1/t) rate for strongly convex problems; 2) the stationary points of the perturbed objective are the stationary points of the original objective; 3) leader selection makes the update direction more close Newton direction. The experiments on image classification task show that the proposed method outperforms EASGD. Overall, this is an interesting paper. However, I raise some questions, which need to be addressed by the authors, as follows: 1. The theoretical results only apply to the deterministic and simplified version of the proposed asynchronous LSGD algorithm. This makes it difficult to understand the convergence behaviour of the actual algorithm used in the experiment. For instance, as shown in Theorem 3, the stochastic leader selection contributes an extra error term and the convergence is not guaranteed unless the importance of the leader decreases over time. The LSGD method also behaves like local SGD method, and number of rounds for synchronization plays a significant role in the convergence. However, there is no analysis on how this hyper-parameter affects the convergence rate of the proposed approach. 2. In Section 3.3, it is shown that at least half of the candidate leaders result in a search direction that is closer to Newton direction. This is a good result for convex optimization. However, as the focus of the paper is on nonconvex problems and it has been shown that Newton iterates are attracted to saddle points (Dauphin et al., 2014), the claimed statement is not convincing in the context of the training of the deep neural networks. 3. The authors state that the averaging step hurts the convergence, which is true in the symmetric non-convex landscapes. However, there are some cases that the averaged iterate improve the performance, such as training AWD-LSTM model for language modeling and Transformer network for translation task. Thus, it is also good to conduct experiments on such NLP tasks to see how the LSGD method behaves. Besides, as claimed in the paper that the averaging step hurts the performance, then why don't you use the leader's parameter instead of the average of the parameters of all the workers, which is used in the experiment, to perform the testing? 4. SGD is compared on all the CIFAR-10 experiments. However, SGD is not compared on training the ResNet50 on ImageNet dataset. I guess that SGD can perform significantly better than LSGD in this case. I am also curious about how the proposed method performs on a deeper ResNet (e.g., ResNet110) on the CIFAR-10 dataset. 5. There is no description on how the best learning rate is selected for each method in the supplementary. 6. As the authors claim that LSGD is more communication-efficient, it is necessary to also report the breakdown of the total time into computation and communication costs. Minor comments: 1. In Theorem 4, as x^i has been previously defined, it would be better to use different symbol. 2. The definition of f^i is missing in Theorem 4. 3. There are many typos in the paper/supplementary. Dauphin et al., 2014. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. NIPS 2014.