NeurIPS 2020

Continual Learning with Node-Importance based Adaptive Group Sparse Regularization

Review 1

Summary and Contributions: This paper proposed a regularization method under the umbrella of continual learning by introducing group L1 norm on every neuron. The important neurons are frozen, at the same time, the unimportant neuron are forced to be sparse.

Strengths: This is an interesting topic. There are two sources for the catastrophic forgetting: model drift and negative transfer. Each source is restricted by a penalty item.

Weaknesses: The contribution is quite incremental. It is somehow a combination of pruning and task-specific neuron regularization. The assumption is task bounded. The architecture is Alex network, so it is not essential to apply continual learning, since the retraining is not expensive. The expression in proximal gradient is a little bit confusing. eq 5 holds for each layer rather than for whole loss. The reinitialization trick is just an intuition, lacking of firm theoretical explanation.

Correctness: yes

Clarity: It is well written, and pleasant to read.

Relation to Prior Work: yes

Reproducibility: Yes

Additional Feedback: It could be more convincing by showing results on larger datasets and neural networks.

Review 2

Summary and Contributions: This paper proposes two sparsity-based regularisation terms for continual learning which are induced by importance of nodes instead of weights and hence reduces memory cost significantly. The two terms can explicitly control the stability and plasticity of the model by splitting the hidden nodes into two groups: unimportant and important nodes. Proximal gradient descent is utilised to obtain the analytic form of the optimal solution of the regularisation terms. The learning process also includes a parameter update strategy for specifically alleviating forgetting and negative transfer.

Strengths: The motivation is sound and attractive. The proposed method is novel, sophisticated, and efficient in terms of memory cost. It provides a feasible way to combine model compression and continual learning in the fixed model capacity scenario, which could be an important contribution to the community.

Weaknesses: This paper in general is strong.

Correctness: No obvious problem in my understanding.

Clarity: This paper is generally well structured, however, the clarity could be improved. 1. An incoming weight of an important node could be an outgoing weight of an unimportant node, how does the mechanism update such a weight? Will it be frozen or nullified? Could the authors give some analysis about such a situation? 2. Since the PGD update is applied per epoch, does the number of epochs affect the performance significantly? Can this method be applied in online continual learning scenario, i.e. only 1 epoch is allowed? 3. Why only EWC is compared in experiments of reinforcement learning?

Relation to Prior Work: Relation to prior work has been clearly discussed in Sec.1.

Reproducibility: Yes

Additional Feedback:

Review 3

Summary and Contributions: The paper addresses the catastrophic forgetting problem in continual learning with a regularization-based method. It takes inspiration from model compression and focuses on neural network weights at node level (grouping weights responsible for a single activation). Contributions: The authors 1) use a node-based version of 2 regularizers - 1 inducing sparsity and 1 reducing change of weights, for nodes which were important for previous tasks. 2) use a simple feed-forward method to evaluate the importance of a node, which appears to be made possible by their sparsity regulariser. 3) use proximal gradient descent to reduce the number of hyperparameters. 4) explicitly trim connections to prevent negative transfer (degrading the performance of a previous task), and randomly initialise unimportant weights in order to increase the model’s capacity. 5) show that their method outperforms competing methods using significantly less memory.

Strengths: - I think the ideas of contributions 1), 2) and 4) are novel and potentially interesting for the community. - Overall, the improvement in performance with less memory is an important advancement.

Weaknesses: In my opinion, the evaluation setting could be improved. - The first set of tasks - “Supervised learning on vision datasets”, is dividing a dataset into multiple tasks. It would have been better if the tasks were from different datasets, so that the input domains are more different from each other, and thus fewer nodes could be reused. - The RL setting does not include all baselines from the first setting, and I couldn’t find an explanation why.

Correctness: I didn't find incorrect statement in the paper.

Clarity: I found most of the paper easy to follow. Here are a few points: - The use of proximal gradient descent isn’t motivated before section 3.3. - It might be good to add a 1-paragraph summary of the methods described in sections 3.2, 3.3, 3.4 in order to provide a better overview of your approach. -Question: In line 214 you describe the effects of Zero-init, Rand-init. It appears that, for an unimportant node n, all parameters which are multiplied by this node are set to 0 for future tasks. Therefore, this node shouldn’t be useable by the following layer in future tasks. Yet, the incoming weights to this node are randomly initialised, so its activation can be changed in the following tasks. Why randomly initialise weights for a node, which cannot be used later? I think I know what you meant to write, but I find the description confusing.

Relation to Prior Work: Related work is well outlined and the differences from the closest paper are sufficiently discussed.

Reproducibility: Yes

Additional Feedback: