Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
The proposed approach is compelling in its simplicity -- this is a good thing, especially given its effectiveness. It is more of an engineering solution than a mathematically motivated method. It is highly effective, and presented clearly. The individual aspects of the approach (compaction, retention of critical connections, etc.) have been explored in individual methods, but this seems a compelling and novel combination of these methods. I liked the ablative dissection of the approach into existing methods as special cases in Section 1: Method Overview. As mentioned, this approach is more engineering than mathematically principled. That's fine given its effectiveness, but it would be useful to develop or combine the mathematical foundations to better motivate the approach. The use of the retention masks would seem to prevent new tasks from fine-tuning previously learned models, limiting the amount of reverse transfer. Progressive nets is incapable of reverse transfer, which is another large limitation of that approach. Is your CPG approach similarly limited in this way, and if not, how is it capable of fine-tuning previous models based on new tasks? The experimental analysis is good, but lacks key details on the specific setups for all experiments. These should have been added as supplemental material, and the final paper would need to be revised to include them. Tables 1 and 2 would be much better presented as a line or bar graph. Are these peak performances or performances after training all tasks (which might be the same)? If the method did incur some sort of reverse transfer or fine-tuning from subsequent learning, it would be good to examine learning curves of all tasks over time, and also to measure any amount of forgetting. POST-RESPONSE Thanks for your comments. The biological motivation you mentioned isn't near as satisfactory as a solid mathematical motivation, although it would be quite difficult to come up with the math/theory behind such an engineered approach.
Summary: This paper uses three already existing approaches in continual learning, PackNet, Piggyback, and ProgressiveNets to reduce forgetting. Upon learning a task, as in PackNet and Ref#49, the network is gradually pruned, then Piggyback is used to learn a differentiable mask, and when network’s capacity is maxed out, the architecture can be expanded similar to ProgressiveNets. While the individual parts are not novel, combining all three of them in one model is still interesting. Strength: a) Overall, the paper is well-written and organized. b) The approach is overall novel and shows how to combine three existing ideas (PackNet, Piggyback, and ProgressiveNet) c) Experimental evaluation on several larger scale datasets as tasks show that the model outperforms prior work w.r.t. accuracy at a small model size footprint. d) Additional evaluation is also shown on CIFAR (less convincing) and different Face recognition tasks Weaknesses: 1. Determining hyperparameters and reporting complexity 1.1. The paper requires setting “accuracy goals” when encountering a new task. However, it might be unclear which accuracy can be reached and the paper is opaque how these accuracy goals are determined e.g. when comparing to prior work. To reach optimal performance algorithm 1 might need significant manual intervention. 1.1.1. How are the “accuracy goals” determined (especially for Table 6,7)? 1.1.2. What happens if growing the network does not lead to achieving the accuracy goal? E.g. increasing the network capacity might lead to stronger overfitting and a reduced accuracy? 1.2. The approach may need many iterations to retrain the model to meet the “accuracy goal” (both w.r.t. growing and compressing) 1.3. How much is the model grown, how much is picked, how much is compressed? It would be interesting to see this for the different models in Table 6, as well as the accuracy targets. 1.4. It would be good to report the memory overhead from the binary masks and relate this to memory-based approached such as GEM, A-GEM, and generative replay. 2. Experimental Evaluation 2.1. Ablations 2.1.1. The paper claims that “Another distinction of our approach is the “picking” step “. However, this aspect is not ablated. 2.2. Experiments on CIFAR. The comparison on CIFAR is not convincing 2.2.1. The continual learning literature has extensive experiments on this dataset and the paper only compares to one approach (DEN). 2.2.2. It is unclear if DEN is correctly used/evaluated. It would have been more convincing if the authors used the same setup as in the DEN paper to make sure the comparison is fair/correct. 3. Motivation 3.1. The paper claims forgetting is fully avoided due to the usage of a mask. While it is true that *after* model compression no further forgetting happens, but there is an accuracy drop during pruning, in contrast to e.g. regularization-based methods. Specifically, the original value (before pruning) is not recoverable and hence should be reported as forgetting. 4. The checklist is not fully accurate. The paper does not provide error bars and std-deviation for experiments. 5. Minor: 5.1. Grammar issue in word “determining” in the 4th paragraph on page 3. 5.2. On page 3, in “Method overview” it says “An overview of our method is depicted below” whereas it should directly refer to Figure 1 because Figure 1 is on page 2 5.3. On page 6, right below Figure 2, it says “in all experiments, but realize DEN”. Word “realize” does not fit into the context. 5.4. In future, please use the submission template (not the camera-ready version) so that line numbers on the margins can be used to easily refer to the text. I lean more towards accept: The overall convincing results (especially Table 6) and overall novel model outweigh the limitations discussed above.
I found this sentence convincing "replay needs re-training which requires memory" Iterative pruning is a big overhead after learning each task The authors stated that "Without loss of generality, our work follows a task-based sequential learning setup" It is a standard setup but it has many limitations, not being applied when tasks are not known at test time for example. How could this not limit the generality of the approach? Why gradual pruning, there are many methods for compression. Choice is not discussed extensively. The main components of the methods are: 1- Compression, 2- piggyback(a previous method) and training of released weight. 3-If not enough (no explanation of what this means), add nodes or filter. How is this happening??? 4- The algorithm is a paragraph of text. 5-Comparison to HAT is missing, an icml18 paper that masks neurons instead of parameters. "Overcoming Catastrophic Forgetting with Hard Attention to the Task" No explanation on how to behave at test time. Are weights only associated with the test task activated? How does this limit the method?