NeurIPS 2020

MATE: Plugging in Model Awareness to Task Embedding for Meta Learning

Review 1

Summary and Contributions: The work advocates to bring in model awareness to learning task embeddings for meta learning/few-shot classification. The main argument is that existing works only leverage the data distribution, while the models used in learning the embeddings can provide additional value by encoding some measure of task difficulty. The task complexity is taken into account by a Hilbert kernel mean embedding (KME) framework, and the idea is further implemented by an SVM-based feature selection algorithm and FiLM conditioning layers.

Strengths: The proposed idea is well motivated and interesting. “The model learned in each task is itself part of the inductive bias” is a convincing argument; It makes sense that reasoning the difficulty of different tasks by their best-fit models can inform how to leverage those tasks to generalize to new tasks. The authors clearly layout this main idea in Section 3 beginning using KME theory. Section 3 then did a good job in connected their theory to concrete implementations such as the model-aware surrogate using SVM variable selection, and the sample-task fusion by DNN conditioning layers. This method is general and in principle should be appliable to many few-shot backbones. In experiments, the authors plugged in MATE on top of the current state-of-the-art in few-shot classification (MetaOptNet, reimplemented by authors), and reported two benchmarks. This “plug in” model awareness improves CIFAR-FS top-5 by a noteworthy 1%, and miniImageNet top-5 by 0.76%. Those look promising considering MATE essentially used the same MetaOptNet model in the implementation except for an extra model-aware conditioning step at the task embedding stage. The ablation study dissects the respective gain of each MATE component, which is also helpful.

Weaknesses: One clear limitation is that authors demonstrated MATE idea on only the MetaOPTNet model, despite their more general claim and design. Perhaps this could be partially excused, as the authors reported (in supplementary section 3) some unsuccessful trials in reproducing another SOTA method. Also it is kinda known by the field that reproducing meta learning results could be very unstable and challenging. But, this paper could be definitely strengthened if the authors can verify their ideas on more backbones. Another limitation is that the reported performance gain from adding MATE is not very significant (although relatively consistent). While Top-5 demonstrate observable margins, the Top-1 is usually within the confidence interval during the ablation. BTW, it is unclear to me what protocol the authors followed to compute the performance std (e.g. how many runs)? And in Table 2, the last three rows’ top-5 std values are all 0.46% - just to make sure those were not typos? Despite the above, I think the model-aware idea would be of interest and inspiration to the meta learning community, so I currently lean toward acceptance.

Correctness: Yes.

Clarity: The paper is clearly written. The motivation and theory ground are explained well. Figure 2 helps understand the framework

Relation to Prior Work: The authors have well discussed related works.

Reproducibility: Yes

Additional Feedback:

Review 2

Summary and Contributions: This paper argues that, in a meta-learning setup, the learner can not only benefit from incorporating the task distribution but also from representing the model complexity. To this end, the authors propose a model-aware task embedding which is inspired by kernel theory. Specifically, they propose to train a SVM classifier on top of the backbone feature extractor. From the SVM parameter w*, they define the model complexity feature vector as f_M(x) = \partial || w ||^2_2 / \partial \phi(x) They then use this vector to condition the film parameters of the feature extractor in a second pass through this model. Note that in the first pass they set the FiLM parameters to the identity transformation. The entire meta-learning model is then trained in the usual end-to-end way. While the proposed method can be incorporated into any meta-learner with a backbone feature extractor, they experiment with MetaOptNet model (a state-of-the-art model). On CIFAR-FS and MiniImageNet, the authors report small gains over the baseline models.

Strengths: - The introduction of the problem is well-motivated - The proposed method is novel and interesting, and generally applicable to many meta-learning models - The paper is well-written and extensively covers related work

Weaknesses: - My main concern is the experimental section. The current results show only very small benefits over the baseline model. The paper would therefore be much stronger if they would report results for other models than MetaOptNet. This will emphasize the generality of the method and also convincingly demonstrate that the proposed method can improve several baseline models. - Perhaps more a question than a comment: why can’t you condition the FiLM parameters on the w parameter directly? What are the benefits of using your notion? I'm willing to update my score if you can address these concerns.

Correctness: Yes, as far as I can judge

Clarity: Yes, in general the paper was easy to follow.

Relation to Prior Work: Yes, the related work section covers this well

Reproducibility: Yes

Additional Feedback:

Review 3

Summary and Contributions: The paper proposes to improve the generalization to unseen tasks in meta learning, by inventing a new model-aware task embedding framework called MATE. The main new argument is that the model learned in each task is itself part of the inductive bias that could imply the task complexity. The authors derived their framework from kernel mean embedding, and implemented that with instance-adaptive attention plus model conditioning.

Strengths: - The core idea of informing task difficulty by model awareness is interesting and well-illustrated - The authors explicitly draw a connection between meta learning and Hilbert space embedding of distributions from a theoretical perspective. - The idea of incorporating model awareness is implemented with an SVM-style first-order variable selection mechanism and a neural network conditioning layer. The MATE module seems easy to plug-in. - Experiments demonstrate that MATE can help the learning agent to adapt faster and better to new tasks, improving up to 1% 5-shot accuracy on two popular few-shot classification benchmarks.

Weaknesses: - Experiments are not very sufficient to support the claim. While the MATE idea seems to be general, all experiments are done on top of only one recent baseline (MetaOPTNet).

Correctness: Yes

Clarity: Yes

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback: