Review for NeurIPS paper: Task-Robust Model-Agnostic Meta-Learning

NeurIPS 2020

Task-Robust Model-Agnostic Meta-Learning

Review 1

Summary and Contributions: The submission proposes a modification of the multi-task meta-learning objective from the average of the per-task losses to the maximum of those losses. The argument is that this will force the learner to learn all tasks to a comparable amount, even the worst-case ones, so no task can be ignored. The manuscript presents a stochastic algorithm for optimizing the resulting min-max problem and proves convergence including explicit rates for the convex as well as the non-convex case. Some generalization bounds are also stated. Experiments on small-scale data show that worst-case (w.r.t. tasks) performance is improved compared to MAML, which optimizes for average loss.

Strengths: + meta-learning is a problem of high relevance and of high interest in the community at the moment + the manuscript contains a lot of material + the proposed methods seems technically correct + the proposed algorithm is not only tested empirically, but also its convergence is formally proved, including rates + the experiments indeed show that performance on the worst tasks are improved, in the Omniglot case, even average performance sometimes gets better

Weaknesses: The work does not seem to have any major flaws, but there's a bunch of small to medium-sized weaknesses regarding the novelty, significance and relevance of the work. 1) the actual new objective is not of high novelty: switching from average to max is a well-known way of aiming for better worst-case performance 2) the motivation/discussion why to optimize for worst-case task loss is insufficient (see below) 3) the contribution on the side of optimization is not made clear enough. The proposed algorithm and its convergence analysis are based on prior work for min-max problems (of course), and it is not explained in how far the proposed steps are simply an application of that. 4) the established rates for the non -convex case ( max(1/eps^5,, 1/delta^5) ) are far from practical 5) the the generalization bounds are not useful as given, because they rely on a task-specific notion of Rademacher complexity, that is not explained, and not quantified. 6) the manuscript has no conclusion section, page 8 ends on experiments (the "Broader Impact" Section on page 9 is written as an Conclusion, but I disregard this, as that is not that section is meant to be, and it is beyond the page limit) 7) experiments are mostly unsurprising: switching from average to max loss, the average-case performance generally gets worse, the worst-case performance gets better Details: 2) The manuscript's arguments in favor of minimizing the max-loss across tasks is not fully convincing. The proposed objective is not "robust" as suggested in the manuscript's title, but it is brittle. A single "outlier" or "too hard" task would render the setting pointless. The manuscript mentions that adversarial tasks would be a problem, but an adversary is not required, already tasks of different difficulty (e.g. different Bayes error rates) should pose problems. Prior work (which is cited, e.g. [32],[9]) states explicitly that avg-loss has a lot of advantage, but that there is situations in which minimizing the max-loss can make sense. That, however, is across samples which come from the same distribution and emphasis is on the realizable setting, i.e. even the max can be 0, and the different to average loss is mainly of the optimization side. For multiple tasks, the differences and problem seem far bigger. I would have hoped to see a discussion of this, and potentially a justification. 5) The generalization bounds rely on standard arguments, which use the task-specific Rademacher complexity as a black box. For the reader to understand the implications of the bounds, the reader has to understand the behavior of the Rademacher complexity. Does it even converge to 0 for m->infty? Is the amount of test or the train data crucial for that? What's the dependence on \mathcal{W}? Maybe it could be expressed in terms of other existing, better understood, Rademacher complixity measures?

Correctness: I did not spot any mistakes. I did not check the proofs in the supplemental material, though.

Clarity: The writing is not ideal. Overall, the paper is trying to squeeze too much into the available pages. The work feels almost like three papers: one that presents an algorithmic with its convergence analysis, one that presents an objective and some generalization bounds, and one that show experimental improvements. Ultimately, each part ended up a bit too shor to be satisfying. - the motivation of the max-loss is not convincing. I don't know if this is fixable, but making clearer in which situation is it a good a idea and when it is not might help. - the use of 'task instances' and 'episodes' in the Problem Formulation (lines 93-98) is not clear enough. I only inferred what is meant from the later text and the treatment of the j-index. - the algorithm and convergence analysis lacks a clear distinction of what is standard techniques/results and what is a new contribution. - the generalization bounds are rather useless to the reader, because the properties of the occurring version of Rademacher complexity is not explained. The results for convex combinations of tasks is unsurprising given the max-formulation, but convex combination of tasks are not a very realistic setting anyway. - the experiments do not convince me that the result carry over to "real-world" tasks. Results on two datasets are reported, but both are quite artificial (sinusoid regression is synthetic 1D, Omniglot is handwritten characters). - the manuscript has no conclusion section, page 8 ends on experiments (the "Broader Impact" Section on page 9 is written as an Conclusion, but I have to disregard this, as that is not that section is meant to be, and it is beyond the page limit)

Relation to Prior Work: Prior works is acknowledged properly. The exact differences of the proposed steps to the ones from the literature are not clear enough, though.

Reproducibility: No

Additional Feedback: Dear authors, I appreciate the amount of material presented, but the density of writing in the manuscript makes it hard for me as reader to assess the aspects of novelty, relevance and significance. That's why I gave a borderline score, but I'd be happy to still adjust that. To make me better understand the contribution of the paper, could you please explicitly list what exactly you consider your main contributions (ideally split into each of the parts: setting, algorithm, convergence analysis, bounds, experiments)? Specifically for the algorithm, convergence analysis and bounds I would be interested which parts you consider applications of existing work (with might adjustments), and which ones you consider as new contributions to the community? ----------------------- After reading the reviewer response and following the discussion, my impression of the work is still the same. It makes some contribution, which are laid out in the rebuttal (thank you), but none of them appear a fundamentally new contribution. The motivation for max-loss is unconvincing to me, unless make stronger assumptions on the task environment. Overall, I remain with the assessment that this is a borderline work. Comparable work has been published at NeurIPS, and comparable work has been rejected.

Review 2

Summary and Contributions: This paper proposes a new meta-learning objective to learn rare tasks on equal importance with major tasks. It causes robust learning to distribution shifts on the observed tasks and better generalization performance than the average loss. The authors also prove that their formulation convergences in both convex and nonconvex settings, and show an empirical gain on regression and image classification experiments.

Strengths: The task-robust (robust learning not just for significant tasks but also rare or hard tasks) is a nice idea to prevent task overfitting (given tasks can be biased when the number of it is small) or make a more general meta-learned model, and they provided the prove about the convergences of their objective.

Weaknesses: To use the proposed objective, we need to make the task. When the task distribution is continuous, we need to quantize the distribution. How we quantize, it can be another hyperparameter to tune. Another that I want to mention is that it can cause worse performance on an easy task. Let me give an example. When there are easy and difficult tasks, the proposed objective focuses on difficult tasks. During training, the loss for the easy task can be larger than hard. At that time, the model tries to learn more about the easy task. However, the performance on the easy task cannot be much better than hard, because if it happens, the objective will focus on the hard task more. Thus, the model can do better on the easy task, but it cannot due to the objective. (I think the lower MEAN performance on regression than MAML is for this reason.) The limitation of this paper is lack of empirical analysis on more complicated image classification task (mini-imagenet) or RL tasks (point navigation or tasks on mujoco). The trying to use Omniglot dataset not to classify alphabet but to classify characters is nice because the alphabet classification performance is already over 95%, so it is hard to show the gain from your method. However, if the authors validated their method on more complex tasks, it would be helpful to understand or agree their suggestion. # To authers: Thank you for updating the experiment parts, I'd like to update my score by considering it. If the description regarding with comparing with the method using task-average loss or performance on easier tasks is added, it will be more concrete.

Correctness: Correct

Clarity: Well written

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback: - When I read section 1 and 2, I assumed that you evaluate your method with the method having the loss as the sum of the task average loss. I think this method also can solve the underfitting problem on the sparse tasks. So by comparing with it, you can deeply analyze your method I think. - For the case showing worse performance than MAML, analyzing more deeply can be helpful to understand or validate your method. Currently, it is mentioned like TR-MAML showed worse because it more focus on the worst task. - In the aspect that your idea can learn correctly on biased task distribution, it can be related with probabilistic meta-learning methods. As baselines, you can use those methods I think. - It is more fundamental comments, your method learns the tasks beyond given task distribution. However, meta-learning is to learn the task distribution or the shared inductive bias on given tasks. It means your method is to learn more general inductive bias than naive meta-learning method. I think that it can make better performance on out-of-distribution cases. If you show the analysis on those cases, it would be better paper I think.

Review 3

Summary and Contributions: The paper proposes to optimize worst case (meta) loss for MAML to obtain robustness with respect to worst-case test distributions.

Strengths: The work seems to be technically sound and well-executed. The idea makes a lot of sense. The analysis appears to be correct (though I did not check the proofs), and the theoretical results, though simple, are non-trivial and support the claims of the paper. The experiments are performed on relatively simple domains, but the results seem promising. I believe this work will make an impact, and is likely to lead to follow-up work.

Weaknesses: There are two weaknesses, which I think are not critical, but I would appreciate a response from the authors about #1: 1. The mean performance on Omniglot exceeds MAML. But MAML trains for the mean case. Is this not strange? Does it indicate (meta) overfitting on the meta-training set? If so, it would be good to add results for a domain where there are sufficient meta-training examples to avoid overfitting. 2. Following up on #1, if overfitting is the issue, I would recommend a comparison to a regularized variant of MAML (see, e.g. "Meta-Learning without Memorization" Yin et al.). 3. I like the idea behind the paper. But it is a bit obvious -- a less charitable interpretation is that it is a fairly obvious application of known ideas in minimax/DRO to the meta-learning setting. This is the main thing preventing me from giving the paper a higher score. I think in the balance this is OK -- the idea is valuable, and although it is somewhat obvious, there is no previous paper that proposes this, and although the paper utilizes in some sense the most obvious algorithm for solving this problem, it is executed well, and analyzed thoroughly. Other than the two nitpicks above, it's hard to imagine how the authors could have executed on this idea better. If #1 and/or #2 are addressed well in the rebuttal, I would be willing to raise my score to 8. Though perhaps that is not needed to get the paper over the bar.

Correctness: Yes, the claims are correct.

Clarity: Yes, the paper is clearly written, and it was very easy to read.

Relation to Prior Work: Yes, I believe the relevant prior work is generally discussed well. I'm sure there are more citations that could be added (as always...) but I didn't notice glaring omissions.

Reproducibility: Yes

Additional Feedback:

Review 4

Summary and Contributions: This paper proposes TR-MAML, which is a MAML's variant that optimizes for the worst-case performance in the few-shot learning problem. TR-MAML works by replacing the average task loss by the maximum, which is later approximated by a min-max problem over a probability simplex. The authors provide convergence guarantee as well as the generalization bound for TR-MAML. The experiment results support the theoretical claims that TR-MAML can achieve better worst-case performance compared to MAML.

Strengths: The worst-case guarantee in few-shot learning is certainly important. The authors did a good job of providing a clear objective for this problem as well as theoretical analysis for their method. The improvements to Omniglot seem significant.

Weaknesses: My major concern is the insufficient experimental analysis and more concrete experiments are needed. Experimental results on miniImagenet, tieredImagenet dataset would be more informative. Besides, since the authors claim that the proposed method is robust to shifts in the task distribution between meta-training and meta-testing, the experimental result on meta-dataset [1] is more convincing. [1] Triantafillou E, Zhu T, Dumoulin V, Lamblin P, Evci U, Xu K, Goroshin R, Gelada C, Swersky KJ, Manzagol PA, Larochelle H. Meta-Dataset: A Dataset of Datasets for Learning to Learn from Few Examples.

Correctness: The paper seems to be correct.

Clarity: The paper is well written.

Relation to Prior Work: The related work is adequately discussed.

Reproducibility: Yes

Additional Feedback: