Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Though the paper is clearly written and presents an interesting approach for large model training. It also allows a GPU to train models that cannot fit into the GPU on-device memory. However, it completely ignores how fast a model can be trained after rematerialization. Given that it already takes a significant amount of time to train a large model on a single GPU and there are many alternatives to train large models without sacrificing too much on computation, it does not seem like the paper provides strong enough evidence over choosing an alternative approach. 1. The paper claims that the "feasibility" of training large models as the main motivation, but the memory itself is not the only bottleneck and the paper completely ignores the training speed side of the story. Training large models such as Bert on a single GPU already takes more than hundreds of days (e.g., Titan V), without specialized optimization. The evaluation results show that the reduced memory footprint is at the cost of 3-4 times increasing of the schedule length, which leads to at least 3-4 times increase of training time. The actual training time could be prohibitively slow, and sacrificing computation speed would further increase the already unbearable long training time and seems to be equally infeasible. It would be better that the paper reports the actual training time after rematerialization. 2. The main motivation of performing rematerialization is to make it possible to train a large model with limited memory, but the paper fails to discuss many alternatives that address the same problem. For example, existing DL frameworks such as TensorFlow and PyTorch support model parallelism, where it partitions the computation graph of a large model and uses aggregated device memory for training. On a single node, there are approaches that reduce memory consumption by reusing memory regions , or using unified memory that allows training to use both CPU and GPU memory [2, 3]. Given that, it is not clear the advantage of this work as compared with those existing work. 3. A large portion of the proposed technique is from rematerialization techniques in compilers, where tree decomposition is used to optimize register allocation. Algorithm 1 seems to be fairly generic and does not rely too much on any domain knowledge of neural networks. The technical contribution seems to be incremental. Minor: It seems to be possible that an increased schedule length can lead to much longer actual execution time. Is it possible that although the schedule length increases only 3-4 times, the new schedule may cause the computation to lose potential parallelism opportunities, which may increase the critical path and negatively impact the GPU utilization rate, leading to significantly longer execution time? It would be better to report the actual end-to-end training time.  K. Shirahata, Y. Tomita, and A. Ike. 2016. Memory reduction method for deep neural network training. In 2016 IEEE 26th International Workshop on Machine Learning for Signal Processing.  Chen Meng, Minmin Sun, Jun Yang, Minghui Qiu, and Yang Gu. 2017. Training deeper models by GPU memory optimization on TensorFlow. In Proc. of ML Systems Workshop in NIPS  M. Rhu, N. Gimelshein, J. Clemons, A. Zulfiqar, and S. W. Keckler. 2016. vDNN: Virtualized Deep Neural Networks for Scalable, Memory-Efficient Neural Network Design. ArXiv e-prints (Feb. 2016). arXiv:1602.08124 ===================== After author response ==================== After reading the author's response, I increased my score from 5 to 6 because although the technique has been applied to other domains, it has a good problem formalization and theoretical analysis of trading computation for memory consumption under the ML/DL context. It might not be a bad thing to release a paper which offers a different line of solving the memory consumption challenges in large-scale model training. That said, it would be better to consider the computation constraint while optimizing the memory usage.
This paper proposed a variant of gradient checkpointing algorithm that recursively applies checkpointing to derive a O(log n) memory cost for general computational graphs. First of all, I do believe that the general treatment of tree decomposition for computational memory optimization is valuable. Although some of that was also mentioned in the previous works (e.g. the tensorflow gradient checkpointing blog-post). It is helpful to have papers to formally summarize the algorithms. The authors uses “materialization” (comes from database systems) while most existing works uses gradient checkpointing (a terminology in AD). I think it would be helpful to clarify the relations of these terminologies to give readers a better context, as they are essentially the same thing. Given that recomputation trades computation for more memory, the authors should also list the additional computing cost besides the memory usage -- given that the recursive algorithm pays additional computing cost to save memory. I am not sure if we want to go as far as the recursive approach to pay the additional computing to save more memories -- given the sqrt(n) time cost might be good enough for many cases. Some discussion around computing cost vs memory would be helpful. Can the current algorithm be adapted to take a memory constraint into consideration to minimize the amount of compute needed for a memory constraint?
[Originality] This paper considers an interesting problem of rematerialization problem, with the motivation from the present situation of the complexity of neural networks and state-of-the-art machinery. [Quality & Clarity] * The general description using a DAG (in Section 2) is quite simple while I feel that it captures the motivation sufficiently. Given a DAG, where each vertex corresponds to computation and each edge corresponds data dependency, the objective is to find a short sequence of nodes with small peak memory, whose definition is natural. * The precise problem formulation and the proposed algorithm are not convincing, though Theorem 4 itself is technically sound. Instead of considering to optimize either length or peak memory of the schedule, the authors considered to provide an algorithm that computes a valid schedule whose length and schedule depends on treewidth. Since there is a tradeoff between the peak memory and schedule length, it is more natural to formulate an optimization problem of peak memory (or length) given a fixed length (or peak memory). E.g., Figure 3 shows that TwRemat runs 3 times as long as the other baselines; what if one wants to get a schedule that is twice as long as the baselines? Section 4.3 claims that we can interpolate TwRemat vs. NoRemat by changing the recursion limit, but it seems hard to find a schedule of desired length unless trying all the possible recursion limit. * Experimental evaluation is well described; the section reports the result of TwRemat and two baselines on several deep networks. It is effectively demonstrated that TwRemat can significantly reduce the peak memory at the price of schedule length, and the memory usage does not increase with the number of layers. It is also clearly presented that the treewidth of computation graphs of the real model is bounded. [Significance] Overall, I thought this is a reasonable paper. This work formulates a new rematerialization problem for complex neural networks, develops an algorithm that works very well for bounded-treewidth cases, and verifies the effectiveness by experimental evaluations using real deep networks.