Summary and Contributions: This paper proposes to leverage the information of task similarities to aid continual learning. It tries to transfer knowledge of similar tasks and prevent changing of important parameters of dissimilar tasks.
Strengths: The ideas of using the information of task similarities is interesting. The authors propose an intuitive and simple architecture, including the knowledge transfer module for similar tasks and a mask module to identify the important parameters (adopted from HAT) for dissimilar tasks.
Weaknesses: 1. From the perspective of HAT, the authors added a knowledge transfer module to learn from old similar tasks. However, this is at the expense of much more calculations, such as the knowledge transfer module (sec 3.2), the transfer model (L193), and the reference model (L198). There is no experiments showing the running time compared with HAT. Is it worthwhile to involve the extra calculations? 2. The HAT masked the important parameters using an attention module. This has identified similar tasks. Why do we need an extra module to do this? Is the motivation that discriminating similar and dissimilar tasks necessary? Similarity is a regression problem but not a classification problem. Why do you use a similar classifier (9) to do this?
Correctness: Yes.
Clarity: Yes.
Relation to Prior Work: Not really. More analysis to compare with HAT is needed, as shown in "weakness".
Reproducibility: Yes
Additional Feedback: See above. ------------- My initial major concern is the running time comparison. The authors have admitted that efficiency is a potential problem. They mentioned in the rebuttal that they use 10x training time to gain 3-5% higher accuracy. I am neutral to this now. Another concern is that I want to know the motivation to identify the similar or dissimilar tasks, as HAT has been able to mask the important parameters using an attention module. Is the motivation that discriminating similar and dissimilar tasks necessary? The authors cannot properly address this question in the rebuttal. My other concerns are answered properly. Therefore, I will change my score from "4: An okay submission, but not good enough; a reject." to "5: Marginally below the acceptance threshold."
Summary and Contributions: This paper proposes a method to learn a sequence of mixed tasks in continual learning. For dissimilar tasks, the method deals with forgetting by using task masks, while for similar tasks, it uses knowledge transfer attention to selectively transfer the useful knowledge.
Strengths: The method proposed in this paper is novel and is relevant to the NeurIPS community. The claims in the paper are sound.
Weaknesses: I think the main weakness of the paper is the relatively simple network architecture used in the experiments (2-layer fully connected network), especially for CIFAR and CelebA data sets. For these data sets, a convolutional neural network (CNN) would be a more appropriate model. Can the proposed method be used with CNNs? Another weakness in my opinion is that the task similarity detection could be expensive. The method needs to train several f_{k -> t} models to detect similar tasks. It would be helpful if the paper could discuss more about the computational cost of the method. Besides, the space overhead when storing the masks should also be discussed.
Correctness: Yes.
Clarity: The paper is well written. But there are some minor points that are not very clear to me. 1. In Eq (1), m_l^(t) is the output of a sigmoid function. How did you make it into a binary value in Eq (2)? I do not think the paper explained this. 2. In Eq (5), N_t should be defined before the first use. 3. On L193, are the readout functions trained after applying the masks? 4. On L197, did you mean the "transfer network" instead of the "reference network"?
Relation to Prior Work: Yes.
Reproducibility: Yes
Additional Feedback: Post-rebuttal: I've read the other reviews and the rebuttal. The rebuttal has adequately addressed my concern regarding the CNN architecture with additional experimental results. I agree with other reviewers that running time is an issue for this method. I still think that this paper is marginally above the acceptance threshold due to the increase in accuracy.
Summary and Contributions: The paper conjectures that the prior work on "task-level" continual learning either focuses on similar or dissimilar tasks. Consequently, authors consider a mix of both similar and dissimilar tasks during continual learning. The approach identifies tasks similar to the current one in order to update their parameters, while the dissimilar task parameters are fixed (using a learned mask over parameters). There were some concerns raised in the initial reviews, and the authors have provided a satisfactory response. I would suggest authors include the promised changes in the final version.
Strengths: * The introduction does a good job towards clearly defining the problem setting (i.e., incremental task learning instead of incremental classifier learning). * The main idea showcased in the paper is to identify a set of shared and masked parameters, which correspond to similar and dissimilar tasks respectively. In order to identify which tasks are similar, a validation dataset is used. During the update process, the gradients are blocked so that they dont affect the dissimilar task parameters and only update the similar task parameters to achieve backward knowledge transfer. * A knowledge base is used to summarize past task information and a task-specific attention is used to focus on the relevant information. * The paper employs some tricks to stablilize the learning process. One main trick is the use of annealing to stablitze the masks learned from embeddings.
Weaknesses: * The dataset choice seems arbitrary. Since authors are defining a new setting, they should elaborate why specifically FEMNIST and FCelebA are used to create similar and dissimilar pairs. * Relation to relevant prior work is not mentioned and elaborated. For example, Rajasegaran, et al. "Random path selection for continual learning." NeurIPS'19 also propose a similar masking based approach to learn non-overlapping paths for dissimilar tasks. Similarly, PathNet (Evolution Channels Gradient Descent in Super Neural Networks) selectively masks out irrelavent model paramters. These papers should be cited and disucssed (preferably compared against) in this manuscript. * To my understanding, the notion of similar and dissimilar tasks is not accurate. E.g., the prior works on task incremental learning have both sets of similar and dissimilar tasks. (E.g., consider CIFAR100 classes in GEM - NeurIPS'17). In fact the considered set of similar and dissimilar tasks is not too different from the ones considered in earlier works. Specifically, consider a seminal work from Li & Hoeim, "Learning without forgetting" (TPAMI), where different datasets such as ImageNet/Places365/VOC/CUB/Scenes/MNIST are considered in continual learning experiments). Nevertheless, the proposed splits and dataset choices should be properly motivated and the authors should also report some experiments on previously considred protocols for fair benchmarking against existing methods. * The annealing strategy is somewhat similar to controller proposed in iTAML (iTAML : An Incremental Task-Agnostic Meta-learning Approach - CVPR'20). * The approach assumes that the task ID is known beforehand. Although this is consistent with some prior works, isn't it a bit restrictive in practical settings? It would be good to explain some application scenarios where tasks ID can be known to motivate the readers. * Equation 3 is wrong, it should be explicitly written. * The caption of Figure 1 should have some description for the MTCL architecture (a) as well.
Correctness: The method generally seems correct. The empirical methododlogy is reasonable, however, I have do have some concerns as mentioned under weaknesses and relation to prior work sections.
Clarity: The paper is overall clearly written with nice visualizations and easy to follow structure.
Relation to Prior Work: This is the biggest concern for me is that relevant prior work is neither cited nor properly acknolwedged. Further, there should be comparisons with similar previous work. If the authors argue that those were for a different incremental learning settings, then they must implement a simple baseline version of previous models to clearly show the advanatage of proposed improvements. In addition, the authors devise a new protocol without much justification. I think it needs to be better motivated along with more extensive comparisons.
Reproducibility: Yes
Additional Feedback:
Summary and Contributions: The paper proposes a novel model for learning a sequence of tasks, some of which are similar and other dissimilar, while previous work focused on only similar or dissimilar tasks. The proposed model learns embeddings for each tasks and predicts binary masks for each layer to gate the gradient. The network has a knowledge base network with a task classification loss, and a knowledge transfer attention network conditioned on all similar task's hidden representations from the former network, again with a classification loss. The model is compared to a set of benchmarks in several sets of classification tasks and outperforms all the benchmarks in terms of accuracy for mixed similar/dissimilar sequences. One of the benchmarks, ONE, tends to outperform the proposed model on sequences of only dissimilar tasks, but the proposed model again outperforms all benchmarks for sequences of only similar tasks.
Strengths: Novel model and setup evaluated against a number of recent benchmark models. Ablation study supports the models features.
Weaknesses: The tasks considered are on small datasets with small numbers of classes and the paper only considers classification tasks in experiments.
Correctness: The results seem to support the claims. It would be good to have a discussion/comparison to data and tasks used for evaluating models in previous work.
Clarity: Yes.
Relation to Prior Work: The paper has a good related work section.
Reproducibility: Yes
Additional Feedback: Comments: - Define x, y in the paper. - There seems to be a different f_mask and f_KTA for each tasks which is not clear from the notation or the figure.