NeurIPS 2020

Continual Learning in Low-rank Orthogonal Subspaces

Review 1

Summary and Contributions: The paper proposes a novel replay-based continual learning method, named Orthg-subspace.To prevent knowledge interference among observed tasks, they adopt orthogonal task-specific parameter projections at each layer thus the update of model parameters for newly arriving tasks doesn't incur the forgetting problems to previous tasks.

Strengths: First, the proposed method shows outstanding performance compared to recent replay-based works with a few replay buffers. Their performance enhancement is based on the orthogonality across projected model parameters of observed tasks. This is concretely described and they suggest appropriate optimization techniques for solving their objective.

Weaknesses: The method is based on multi-head approach which requires task identities during training and inference. But it may not be practical for online setting while their setup is based on single-epoch training which fundamentally targets toward online scenario. Also, I have several concerns about the paper in the comment below.

Correctness: The suggested methodology and explanations look correct.

Clarity: The paper is well written and the methodology and experimental setting are well described.

Relation to Prior Work: This is one of the weakness of the paper I felt. The authors quantitatively validate their method to baselines in terms of accuracy. But, they do not provide intuitive quantitative/qualitative analysis for better understanding the uniqueness/contributions of Orthg-subspace, compared to other replay-based methods.

Reproducibility: Yes

Additional Feedback: I have several main concerns in the paper which deeply affect to the score. - The paper bypasses some expensive computation during training, I guess that they still require much larger wallclock time per training iteration. Could you give a numerical comparison of the training wallclock time with baselines? - The results for forgetting score is quite interesting that all baselines show only a marginal forgetting about less than 0.5%. But in my knowledge, for example, EWC is shown further severe forgetting (more than 10%) in many works [1,2,3]. What is the difference in the setting? - In line 5 of the algorithm, does k is randomly picked from previous tasks? Is there [1] Serra, Joan, et al. "Overcoming catastrophic forgetting with hard attention to the task." ICML 2018. [2] Lopez-Paz, David, and Marc'Aurelio Ranzato. "Gradient episodic memory for continual learning." NIPS 2017. [3] Chaudhry, Arslan, et al. "Efficient lifelong learning with a-gem." ICLR 2019. When I get feedback and solve my concerns, I'm willing to raise the score. ========== Post-rebuttal: I thoroughly read other reviews and author responses. I find the method can be performed within reasonable wall-clock time. But, I still have a concern that there is a lack of (quantitative and/or qualitative) explanations/analysis for better understanding the uniqueness/contributions of Orthg-subspace. I keep the score.

Review 2

Summary and Contributions: This paper introduces deep nets whose weight matrices are orthogonal (in the extension to the non-square matrix sense of the term). This is motivated by applications to continual learning on different tasks, and specifically the desire to embed an input's representation (and corresponding backpropagation of gradients) into a feature subspace orthogonal to the subspaces used for different tasks. For this purpose, the authors show that by a) adding a projection to a predefined subspace for each task, and b) enforcing that the weight matrices are orthogonal, feature embeddings and gradient updates does not interfere with each other (remain orthogonal) across different tasks. Experimentally, the authors evaluate the performance of their architecture across MNIST, CIFAR100 and ImageNet datasets, and provide ablation tests. *** Updated review *** I thank the authors for their response. I have updated my score to 6, to reflect the outcome of the reviewer discussion: in particular, the concerns raised regarding changing the size of T over time seem important to address; I believe that the proposed solution discussed by the authors in their response warrants empirical evaluation. However, my concerns regarding the benchmarking tasks have been alleviated.

Strengths: The theoretical motivation and analysis of this novel architecture is sound; the experimental results seem strong, sometimes improving by several standard deviations upon previous results. The topic of this paper (continual learning, specific design choices for neural nets) is entirely relevant to the NeurIPS community. In terms of significance, the authors compare to a benchmark set of problems from a previously published paper; I am not familiar enough with the literature to evaluate whether this set of benchmarks is significant (it appears to be). I am curious to know if the authors have investigated multi-task learning across a more diverse set of tasks (as it appears that the transformations applied in each task belongs to the same family of transformations: different rotation angles for Rotated MNIST, etc.).

Weaknesses: As mentioned above, the limitations of the setup for experiments, where all tasks appear to be fairly similar, might slightly reduce the results of this paper.

Correctness: The claims are correct, and the methodology (I have not checked the code) seems well-motivated and correct. Several clarifications might improve the readability of the theoretical results, and certain results (in particular regarding basic isometry properties) may not be necessary for the main paper.

Clarity: The paper is well-written. However, there are several minor grammar mistakes that I recommend be fixed, although they do not hinder understanding of the paper.

Relation to Prior Work: Although I am not particularly familiar with continual learning, the authors compare their method experimentally to a wide range of previous work, including fairly recent network architectures.

Reproducibility: Yes

Additional Feedback: *** Questions *** - For the last equality in Eq. (9), are you defining g_L^t = P_t dl/dh_L? If not, could you explain that equality in more detail? - In Figure 2, the inner products between the gradients across different tasks are much closer -- but not equal -- to 0. Is this entirely due to the fact that in practice, the ReLU activation across neurons is not entirely linear, or is there another factor explaining this behavior? - Relatedly, does Figure 2 show the absolute value of the inner products? If not, I would expect some gradients to have negative inner products. - Another assumption made in the theoretical analysis is that the layers are of decreasing size. Do the architectures you use follow that assumption? Have you thought of how you might generalize to architectures that do not verify this constraint? Although it is possible to define isometries in this space, I am curious to know if the Stiefel manifold learning can be generalized to this setting. *** Minor comments *** - I would recommend mentioning before Eq. (4) that the matrices considered are not square, and that orthogonal is here used as "W^T W = I" - l.18: "standard supervised learning *setting*"? - l.19: "poseS"? - l.79: "Let the inner product in H *be* denoted" - l. 126: where are you using the sets V_i? I didn't notice them being referenced after this paragraph. - l. 133 and elsewhere: "basEs" when using the plural of "basis" l. 147: "relatively deep"

Review 3

Summary and Contributions: The authors propose a method for tackling catastrophic forgetting in a continual learning setting. The core idea is to decouple the data for different tasks by mapping them into orthogonal subspaces through using a number of orthonormal matrices and then adapt propagation accordingly. Experiments on four different benchmark datasets are performed and the method is compared against several continual learning method to demonstrate effectiveness of the method.

Strengths: The area of continual learning is related to the NeurIPS community and recently has been focus of machine learning community. The idea of using orthogonal subspaces to tackle catastrophic forgetting in continual learning is novel and the authors have provided somewhat convincing empirical evaluation to support their claims.

Weaknesses: Despite having a novel core idea, I think this paper is not ready for publication and needs substantial improvement before publication: 1. Currently it seems that you need to know T because projection matrices P_t should be constructed before starting continual learning. This is a huge limitation because the very notion of "continual learning" implies that T is not known a priori because the learning agent supposedly is learning over unlimited time periods (i.e., we may even have T\rightarrow\infty) . Currently, learning task T+1 is going to invalidate your core idea because building an orthogonal P_{T+1} does not seem to be trivial. In my opinion, this constraint should be removed. 2. Currently it seems that you have assumed that ReLU functions are in the linear range in your derivations. But I think this is a highly slippery assumption. The very reason that ReLU is used is because of its nonlinearity. If all the ReLU functions were in the linear range, then any network will be equal to a two-layer shallow network, irrespective of the number of layers. Nonlinearity is absolutely essential to have meaningful deep networks. It may be OK to state that you estimate Jacobian matrices with identity matrices. You can even use empirical validation to check how well such a estimation is but assuming that ReLU functions are in the linear range is not a good assumption. I speculate this is why your method is not as effective when the base network is deep. 3. Section 2 is lengthier than being useful. Subsection 2.1 can be rewritten and presented in one paragraph without allocating space to trivial material, e.g., metrics. Subsection 2.2 is on quite trivial theoretical material that can be found in most linear algebra textbooks. Having that section in Appendix might be OK but its presentation in the current form is misleading. For example, Theorem 2.1 is presented as if it is a contribution by the authors but it is a preliminary theorem that is taught in undergraduate courses. The same is almost true about the rest of that section. I think the paper has almost no theoretical contribution but the presentation is such that it implies theoretical contribution and make the paper look mathematically rigorous, even if not intended. Using mathematical terms such as "isometry" or "Stiefel manifold" to describe your method does not increase your contribution. On the contrary, presenting ideas without relying on technical jargon widens outreach of your work. 4. Experimental Section: - As clear from Appendix C, out-performance of your method over memory-based methods is mainly because of using a very small memory buffer. I understand that you have used the same size to be fair, but having a bigger buffer is not a substantial limitations. I think it is more meaningful to allow for bigger buffers for the memory-based methods to the point that memory-based method outperform your method.As a result, the regimes in which your method outperforms can be understood better. Additionally methods that are based on generative experience replay (see below) do not use a memory buffer at all, yet are able to perform well under certain conditions, and in this sense are not limited. - ablative studies are missing in your results. For example, what happens if P_t matrices are not orthogonal? What is the effect of memory buffer size and other hyper-parameters? - Adding Figures which represent performance vs #task number is helpful. At the moment, you are reporting average performance but per-task performance is not visualized. You can include figures that report performance on tasks 1 to t after learning task t and use t as the x-axis. - Can you explain how do you select the samples that are stored in the buffer? Are you using the same stored samples for all methods? - How many runs do you use to computed std and average accuracy? - It is also helpful to mention which performance numbers are computed by you and if not, what reference have you used to copy the numbers. For example, I know the original EWC paper does not perform 23 permuted-MNIST tasks but currently it is not clear whether you have generated results for EWC or you are using results by third party. 6. Reference to a line of work on generative experience replay is missing in the related work section. I think incorporating these works is helpful because these works do not rely on a memory buffer for implementing experience replay: a. Shin H, Lee JK, Kim J, Kim J. Continual learning with deep generative replay, NeurIPS 2017 b . Kamra N, Gupta U, Liu Y. Deep generative dual memory network for continual learning. arXiv preprint arXiv:1710.10368. 2017 Oct 28. c. Seff A, Beatson A, Suo D, Liu H. Continual learning in generative adversarial nets. arXiv preprint arXiv:1705.08395. 2017 May 23. d. Rostami M, Kolouri S, Pilly PK. Complementary Learning for Overcoming Catastrophic Forgetting Using Experience Replay, AAAI 2019 These works use experience replay but do not use a memory buffer. I think comparing against 1-2 of these works can be helpful, too. ================================ Post-rebuttal comment: I think this work requires further demonstration to show that having a fixed T before training is not a limiting factor.

Correctness: Empirical methodology seems to be correct but can be improved. Please check the comment above.

Clarity: The paper is written well and following the text is straightforward. But I think it still can be improved, e.g., reducing section 2 and improving section 4.

Relation to Prior Work: The authors have covered prior work and explicitly stated their contribution but I think it might be better to move section 6 to right after section 1 and merge redundant material to make the paper more coherent.

Reproducibility: Yes

Additional Feedback:

Review 4

Summary and Contributions: This paper proposed a new continual learning method in low-rank orthogonal subspaces. The method learns tasks in different vector subspaces that are kept orthogonal to each other in order to minimize interference.

Strengths: Low-rank orthogonal subspaces are good for minimizing interferences among different tasks.

Weaknesses: The proposed method computes gradient on the past tasks when training current task. Consequently, they need replay buffer to store the data of past tasks. This is weekness of the method. Comparing the proposed method with other continual learning methods which does not use replay buffer, like the EWC method, is unfair. When developing continual learning algorithms, storing the data in previous tasks makes that the continual problem become easier.

Correctness: Yes

Clarity: I think so.

Relation to Prior Work: Yes.

Reproducibility: Yes

Additional Feedback: My concerens are addressed in the rebuttal.