NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:856
Title:Efficient Meta Learning via Minibatch Proximal Update

Reviewer 1

Originality: Admittedly (by the authors), the proposed algorithm is a fairly straightforward extension of previous ideas. However, the authors are up-front about this, and their manuscript contains considerably additional analysis and empirical work, which more than exceeds any sort of originality bar in this reviewer's opinion. Quality: The work is of high quality, with extensive theoretical analysis of the convergence of their algorithm. Clarity: The paper is quite clear. The appendix discussion of the proofs could use slightly more detail (particularly for outsides)--even a few sentences of guiding intuition could be useful (i.e., regarding Karush-Kuhn-Tucker points, for the uninitiated) Significance: This method clearly outperforms other methods, and is conceptually simpler and easier to compute. Conditioned on the authors actually providing tensorflow code for the result, this looks like a powerful, general technique for few-shot metalearning.

Reviewer 2

UPDATE: I'd like to thank the authors for their detailed response. In light of this response I have increased my score to a 7. Originality This paper presents Meta-MinibatchProx, an algorithm for model and algorithm agnostic meta learning that, unlike MAML and friends, comes with theoretical guarantees of convergence. To the best of my knowledge it is the first such algorithm to offer any convergence guarantees, and also has the potential to scale to very large problems. Quality Something I would like to have seen addressed explicitly in this paper is the distinction between what I will call the "finite" and "infinite" versions of MMP. The distinction here is related to the comment "Usually we are only provided with n observed tasks..." on line 135. In particular: 1. The number of tasks n is fairly small (say 10s) and we can permanently materialize the per task parameters w_T for each task (the "finite" case). 2. The number of tasks n is infinite (as in the sin wave experiments) or effectively so (as in the few shot image net experiments) where a single task is unlikely to be seen more than once, so permanent materialization of the task specific parameters is not useful (the "infinite" case). The paper focuses exclusively on the second setting (which is the typical framing for meta learning), but it seems to me like one of the big advantages of MMP is its ability to deal well with the finite setting as well. In the finite case I think the MMP framework offers a starting point for thinking about how to deal with issues like: 1. Task spaces where some tasks have vastly more data than others 2. Hierarchically structured task spaces (for example, the structured labels in tieredImageNet could be used to construct hierarchical priors). 3. Task spaces with non-uniform input or output structures, where task specific models share only some of their parameters (perhaps also with a graph structure over 4. dependencies between model parts) 5. And so on… This paper presents itself as a better MAML than MAML, which is justified through convergence theory and empirical results. It would be vastly more compelling if it also demonstrated the capability to deal with problems that are hard or impossible to express with MAML. I feel that not doing so is a quite a missed opportunity. Clarity I found the paper quite clear and well presented. There is necessarily a lot of notation in a paper of this sort, but I found it well presented and reasonably easy to follow. Significance As mentioned in the quality section, I think the biggest weakness of this paper is a missed opportunity to present an algorithm that can do more than the existing meta learning formulations. It is one thing to have a slightly better meta learning algorithm than the baselines, it is quite another to have that in a framework that is also general enough to address a wider class of problems. Typos 139: "inra-meta-task" 271: The text says 15 steps of SGD, but the legend in Figure 1(a) says 32 steps.

Reviewer 3

UPDATE: After reading the author feedback I would like to keep my score and stick to the recommendation of introducing a more intuitive, graphical explanation of the difference between MMP and MAML, similarly to how my question has been answered. Novelty: The approach seems to be quiet novel, as well as the convergence results. I am not an expert in this, but I think it should not be difficult to provide similar converge guarantees e.g. for the standard MAML, and their absence is mostly due to lack of interest from the community. However, it is important that finally such results are available. Quality: The proposed Meta-MinimatchProx is a sound framework, analysed well both from theoretical and experimental perspectives. I have only a few questions / suggestions to the authors. I would appreciate more discussion of the parameter lambda and how one should set it. It must have something to do with the (meta-)generalisation performance as it is the strength of the prior in some sense, and if it’s the case it would be interesting to see a graph of the test performance, e.g. on ImageNet, depending on lambda. Clarity: The paper is written well and generally easy to follow. I have only a few comments: 1. The distinction between MAML and Meta-MinimatchProx updates discussed on lines 132-151 could be made clearer and perhaps supplied with a graphical illustration. 2. It looks like function F(w) appears first in eq. 3 and is defined implicitly. Significance: The paper makes an interesting contribution to a very important area of meta-learning and, I think, will be recognised by the community.