NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:6703
Title:Theoretical Limits of Pipeline Parallel Optimization and Application to Distributed Deep Learning

Reviewer 1

Originality: The method (pipelining + smoothing) is derivative. At the same time, I am not certain if the proof techniques are original or not. Overall, I would consider the paper to have low-to-medium originality. Quality: The paper presents a thorough and important collection theoretical results, and provides excellent insight into the problem setting. There are some issues with the way the empirical results are presented (see the improvements section). Overall medium-high quality. Clarity: I found the paper well-written and easy to follow. High clarity. Significance: Although the authors were mostly concerned with the non-smooth, low-sample setting, I find that the paper addresses an important intersection of topics: ML-systems (pipelining), optimization theory, and deep learning. The intersection of these topics will only continue to grow in importance, and work such as this paper is highly significant.

Reviewer 2

**Update after reading the author's feedback With the additional information on connections with existing work and comments and the practical setup, I'm willing to update the score to 7. ******************************************************* The characterization of the distributed deep learning in the form of pipeline optimization seems to be a novel contribution, and the convergence results, particularly incorporated with randomized smoothing look reasonable. Some comments regarding clarifying the contribution: 1. The relationship between the proposed pipeline parallel optimization setting and existing work is not clear. Does it contain related work as special cases? The authors mentioned in the abstract that the presented study is distributed per-layer instead of per-sample. It could be helpful to give additional comparison along this line. 2. The manuscript seems to be short of details on the distributed computing mechanism. This was briefly touched in Section 2 on asynchronous value/gradient evaluation. Additional discussions such like the distributed framework, scalability etc. could add more practical value to the submission. This part is also unclear in the evaluation section of the paper. The improvement discussed in Section 5 over GPipe shows interesting trade-off, however as the authors mentioned those conditions are seldom seen in practice and the experiment setup seems artificial. 3. In the beginning of Section 4, the authors mentioned acceleration is possible. What’s the counterpart that the method is evaluated against? While the manuscript is an interesting read from the theoretical perspective, the reviewer is interested to see additional evidence on the practical impact such as improvement over state-of-the-art methods on well-studied applications.

Reviewer 3

I read the author's response which addresses the raised concerns, esp. regarding the general applications and the shown experimental results. I raise my rating to 7. The theoretical findings and contributions of this work are of general significance and give a nice overview of pipeline parallel optimization for different classes of functions. Further, the introduced PPRS optimization algorithm for non-smooth (and potentially almost non-smooth, e.g. L >>) functions is of general interest to the ML / DL community. However, the major problem of this work is that hypothesized benefits of PPRS are not backed up empirically, e.g. for section 5, and also the experimental section 6 seems unreliable at the current state. Due to my limited overview over current optimization literature, it is hard for me to judge the originality of this work and especially the proposed PPRS algorithm. Clarity: The paper is well written and structured. Theoretical concepts and theorems are described in an understandable manner. The figures and tables are also of high quality. Besides the challenges that arise from parallel computing, it seems, that one could easily implement the PPRS algorithm on the basis of the provided descriptions.