Summary and Contributions: Model parallelism is essential for effectively training large models on accelerators with limited memory. Pipelining seems to be one successful way to do this easily. This work posits a Combinatorial Optimization problem which can be solved to find the optimal way to partition a large model across multiple homogeneous devices. == AFTER AUTHOR FEEDBACK == The authors' do a good job of addressing some of the concerns that I have raised in my review. I think my original score still stands. However, I am not fully convinced that all the issues have been sufficiently resolved (cost model vs learning from raw reward signal etc.)
Strengths: [+] This work is very relevant to the current need of model parallelism and pipelining to train large deep neural networks. [+] It proposes a novel Comb. Optimization formulation which can be solved to get the optimal placements. For popular models that get reused thousands of times on standard device configurations, knowing the optimal placement can be very useful. [+] Evaluation is done on some of the recent state-of-the-art models (e.g. BERT). [+] Simple DP / IP formulation means it is easy to implement and try out.
Weaknesses: [-] Number of nodes in the graphs seems to be quite low (~200 for GNMT). GDP reports several tens of thousands of nodes? Is there some manual grouping operation performed on the computational graph? If so, wouldn't this be a significant drawback? Does a "node" correspond to an op run on a device? Few more details on how the graphs are pre-processed could be helpful. [-] Evaluation does not compare with other learning based approaches of model parallelism (for instance to GDP). [-] All the graphs have only up to a few thousand nodes at max, where as the state-of-the-art places graphs with tens of thousands of nodes (see REGAL/GDP/ Hierarchical Device Placer). [-] Unlike the recent line of work using RL. ( see Azalia et al.,), this approach makes use of a fairly sophisticated cost model. How hard is this to build for different configurations? What if we would like to extend this to devices across multiple machines? RL approaches require re-training with the same measured reward signal, where as the proposed approach would need sophisticated config specific modeling.
Correctness: Evaluation methodology is thorough. Claims are easy to verify.
Clarity: Paper is fairly easy to understand.
Relation to Prior Work: This paper does a good job of summarizing and comparing with all the prior work. I do not believe there have been any omissions in this regard.
Summary and Contributions: This paper proposes an efficient algorithm for model parallelism, which is an interesting topic for model acceleration. They pay attention to two aspects: latency minimization and throughput maximization with the corresponding strategy. The experiments show the efficiency of their proposed methods.
Strengths: 1. Model partitioning across different devices is a practical field and it’s very useful in model deploying for real-world application. Recently, more and more models have huge parameters, which are difficult to be deployed directly on some hardwares. Thus, this problem setting is interesting. 2. They take both contiguous and non-contiguous split scenarios and provide both solution for model deployment. 3. They consider many famous models like BERT and ResNet-50 for experiment. The sufficient results demonstrate the efficiency of their algorithms.
Weaknesses: 1. The algorithms are lack of novelty. Even though the problem setting is a very interesting topic, the proposed algorithms are somewhat trivial. For instance, the algorithm just put different mini-batches on different devices and performs feed-forward and back-forward propagation. 2. I am confused about Figure 3(a). Why the time of sample on Device 2 takes much more time compared the sample on Device 1? And there is no explanation about it.
Correctness: The claims and method are correct and the empirical methodology is correct.
Clarity: The paper is well written.
Relation to Prior Work: It is clearly discussed how this work differs from previous contributions.
Summary and Contributions: The paper considers the problem of partitioning of a deep neural network (DNN) model on a given set of devices (for e.g. accelerators like GPUs, TPUs etc.) with their memory and interconnect constraints while optimizing certain metrics of interest (for e.g. latency and throughput). Two different settings of non-pipelined model-parallel inference/training and pipelined-parallel inference/training are considered. A combinatorial optimization formulation of the problem along with two algorithms (Dynamic Programming and Integer Programming based) as solutions are provided. Experiments are performed on multiple state-of-the-art DNNs.
Strengths: Soundness of the claims: All the claims are backed by experimental evaluation. A clear description of the Integer programming formulation is provided which is correct. Significance and novelty: Efficient training and inference of DNNs is an important problem. The algorithms are novel for the considered setting. A good feature of the proposed approach is its generality which allows it application to potential any neural-network architecture.
Correctness: Experimental methodology seems correct and the proposed approach shows good improvement over multiple good baselines in terms of both throughput and latency objectives.
Clarity: The paper is well-written with clear description of all the main components of the proposed approach.
Relation to Prior Work: Related work is discussed clearly. Proper description of the differences between the proposed approach and the related work is given. The key distinction is that the proposed method builds a cost model of the metrics of interest and solves an offline optimization problem as compared to previous work that focuses on optimizing the objectives in an online manner while evaluating each design on an actual system or a simulator.
Additional Feedback: Update after rebuttal: I didn't have any question and I am happy with author's response. Therefore, my review remains the same.