NeurIPS 2020

OOD-MAML: Meta-Learning for Few-Shot Out-of-Distribution Detection and Classification

Review 1

Summary and Contributions: The submission explores simultaneous classification and out-of-distribution (OOD) detection in a few-shot setting. The underlying framework for the few-shot setting is MAML with binary classification meta-tasks. For OOD detection, the paper uses generated examples of OOD samples per class. MetaGAN [1] has proposed to use a GAN to generate such contrastive samples, but the current work uses an adversarial perturbation method instead based on the FGSM [2], with the sample initialisation being meta-learned. Experiments indicate that the proposed method might perform better than more naive MAML instantiations. A small case-study indicates that the adversarial perturbation per class is significantly useful for OOD detection in the proposed framework. [1] MetaGAN: An Adversarial Approach to Few-Shot Learning, Zhang et al., 2018 [2] Explaining and Harnessing Adversarial Examples, Goodfellow et al., 2015

Strengths: I find the idea of meta-learned initialisations for adversarial samples interesting, and potentially inspirational for other works. To my knowledge, this is quite novel.

Weaknesses: My comments are mostly to do with empirical methodology; see below.

Correctness: In Table 1, baseline methods are thresholded at a 95% TPR, while the proposed method and its variants are claimed to be threshold-agnostic: From section 3.3 it appears that the threshold is manually set to be 0.5, so they are not really threshold agnostic. It feels likely to me that there might be situations where picking thresholds with different criteria for comparative methods might lead to an unfair assessment. I’d recommend picking an OOD-detection threshold (on the maximum softmax values for class 1 across all tasks, for example) also at 95% TPR for a more even comparison. The experiment in Section 4.4 feels a bit anecdotal due to the particular example studied. Appendix D studies the effect of the adversarial adaptation, and while the text says random-(ini)-OOD outperforms random-OOD, the table seems to show the opposite trend (a typo perhaps?), which would indicate the adversarial adaptation did not help. In any case, aggregate numbers in the actual setting might be nice, with the meta-learned initialisations instead of random initialisations. It might also be useful to visualise the fake samples learned by the meta-training process, as well as the adversarially-perturbed ones. Do the authors suspect a particular reason for why performance goes down upon increasing the number of negative samples (M = 3 to 5) in Table 1 for miniImagenet (and to some small extent, for CIFAR-FS)? How were hyper-parameters picked for the various methods? From the Appendix it appears that validation sets were not used. While these are probably true, I’d be cautious about claiming them without some evidence: L45: "the algorithms in previous studies would not perform well under few-shot settings.” L61:"meta-learning algorithms, including MAML, would generate trivial classifiers that predict all the examples as in-distribution examples.” ---------------------- Post-rebuttal -------------------- Thanks for the responses to my comments. I'm happy with the updates.

Clarity: While the writing could probably be improved to make the method clearer, it wasn’t particularly difficult for me to follow. Here are some typos/suggestions: L31: “K-show” -> “K-shot” L142: “examples” -> “example” L173: “If then,” -> “If so,”/“Then” L267: “few-show”->”few-shot”

Relation to Prior Work: To my knowledge, prior work is adequately discussed

Reproducibility: Yes

Additional Feedback:

Review 2

Summary and Contributions: This work introduces a novel and challenging setting: few-shot learning for open-set. The authors propose an optimization-based method along with a data generation strategy to handle this task.

Strengths: This work introduces a challenging few-shot setting, which is rarely explored in previous works. The method introduced in this work is optimization-based and the data generation strategy can actually help to improve the performance for this task.

Weaknesses: 1. Terminology ambiguous. The "detection" in computer vision-related tasks means to localize and classify one instance of a specific class. 2. The author claims the GAN is difficult to optimize and use adversarial gradient updating. It's not clear why the proposed idea is better than GAN. How to avoid the issue of the vanishing gradient and model collapsing of their proposed method is less presented in the method section. 3. Can you provide some visualization of the fake example for better understanding and analysis?

Correctness: Correct

Clarity: The method section should be improved with more intuitive explanation and analysis.

Relation to Prior Work: This work is also related to the open-set problem in the classification task, the related works shoud be reported and compared with.

Reproducibility: Yes

Additional Feedback: The feedback has solved my concerns. The related work of open-set problem should be reported in the final version. I will keep my rating.

Review 3

Summary and Contributions: This paper studies OOD detection in the few-shot setting. It first formulates the few-shot OOD detection problem based on the classical few-shot setting. Then this paper proposes OOD-MAML to meta-train and meta-test the model. The main idea is to involve adversarial samples to train a sharp classifier which can help to identify the OOD samples. This paper also demonstrates the effective performance of OOD-MAML over benchmark datasets.

Strengths: - This paper discuss the few-shot OOD detection problem which is interesting. The authors rigorously formulate the problem based on the classical few-shot learning setting. - This paper proposes a MAML-like algorithm, OOD-MAML, to solve the problem. The main contribution is introduce adversarial samples (as meta-parameters) to help train the classifier, which is novel. This paper also carefully designed the gradient descent steps to update the meta-parameters. - The experiments are sufficient and show the effectiveness of OOD-MAML algorithm.

Weaknesses: The baseline in the experiment is MAML which is not designed for OOD detection, so I think it is not fair enough. Do you have other baselines? For example, many-shot OOD detection algorithms?

Correctness: Yes. I haven't seen anything wrong so far.

Clarity: Yes. One suggestion: The notations used in section 3 are a little bit too complex and could be simplified.

Relation to Prior Work: Yes. This work is related to MAML and OOD detection, which are both discussed in section 2.

Reproducibility: Yes

Additional Feedback:

Review 4

Summary and Contributions: Presents a meta-learning algorithm that meta-trains for detecting out of distribution samples. At meta-test time, OOD-MAML can do both N-way classification and OOD detection. The method is a modification of MAML. Meta-train: - Each task is essentially a binary classification problem where in distribution samples are examples from the same class (labeled 1) and out of distribution samples are generated adversarially (labeled 0) - To get the adv examples, they start with theta_fake, a meta-learned vector of adv examples. For each task, they directly gradient update theta_fake to fool the model. theta_fake is updated using sign of gradient, and has a meta-learned inner learning rate. - IIUC, they then need to update the model again to reflect the updated adv examples. Meta-test - The N-way K-shot problem is turned into N binary classification tasks again, essentially giving N binary classifiers, one for each class - Looking at the output of each of the N classifiers, if the output is < 0.5 for all classes then they say it is OOD - Otherwise, choose the class corresponding to the highest (maximum) output among the classifiers Experiments - OOD examples are generated by choosing from unseen classes - Compare against OOD detection baselines combined with MAML - OOD-MAML appears to better detect OOD examples, and has comparable accuracies for in distribution data

Strengths: - Method seems novel as far as I'm aware, in particular the ideas of turning few shot classification problem into few-shot in/out binary classification, and adapting the adversarial example directly in the inner loop - Evaluation appears to show benefits in detecting OOD examples, without hurting acc in distribution - Work is relevant to the community

Weaknesses: - Though the method is applied in a different setting (adding in OOD detection), the actual technical deviations from MAML and MetaGAN are relatively minor - It is not clear that the difference from MetaGAN (lack of a generator for adv examples) are justified or better, aside from being easier to train, and no experimental comparison is made to that approach. Post author response: The response is satisfactory, and my concerns were relatively minor to begin with. No change to score (7).

Correctness: The empirical methodology seems correct.

Clarity: The notation is somewhat cumbersome, many mathematical symbols stretch on for too long decreasing readability (D_{meta-train}^{MAML}, for example). Minor: too many equations are in-lined (perhaps for space) and decrease readability, e.g. L168. Figures and tables often lack captions explaining what is going on.

Relation to Prior Work: - Clearly discusses relation to other OOD methods - No discussion about prior meta-learning + OOD work, or comparison to them. However, I am not familiar enough with that area to say whether it is necessary. - Some discussion to other meta-learning work, particularly MAML and MetaGAN.

Reproducibility: Yes

Additional Feedback: