Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Please find my concerns for this approach here: 1. There are too many strong continuity assumptions in the proof of Lemma 1. In reality, such assumptions may prove to be somewhat acceptable for simple datasets, but may not generalize to more complex scenarios. I would like to see a simple frequentist study showing how many times these assumptions are violated in reality. One such approach could be simply randomly evaluating the assumed continuities in a real-world setup. 2. Complementary to the above, the paper would be more impactful if authors study a real-world complicated task and show superior performance of their approach. Furthermore, study of failure cases would also be greatly appreciated. Table 2 shows an extremely saturated task, which only leaves the reviewer in doubt. Aside from a task already in 95+ performance, I would like to see the proposed model be challenged in a more complex setup. 3. What is the optimizer for MAML in Table 2? 4. Authors highlight that MAML may have gradient instabilities @274 which naturally the reader connects that to continuity of the gradients and the inner loop, but then wouldn't this also break your claims about continuity? What would happen if the assumptions are not true in your approach? Methods like MAML are desirable since they will work regardless of the exact complexity of the data and continuities across the optimization path. 5. Line 172 needs citation. It is not clear why line search methods would in fact fail.
The paper tackles the problem of efficiently differentiating through the inner-loop in meta-learning (mainly for few-shot classification/regression) so as the learn the optimal meta-learner, ie. the "optimal" initial model from which a new model can learned with few examples. To this end, the paper gives an excellent definition of one of the main problems in existing meta-learning algorithms, ie. the difficulty of computing gradients over the compute chain of the meta-learning algorithm's inner loop. On the positive side, the propose technique is simple, novel and clear; presented with neat arguments, and derived in a theoretically sound way. I really enjoyed reading the paper even if I am familiar with the problem being tackled. It is also good to see that the paper improves over directly relevant MAML-based baselines, even if it is not achieving state-of-the-art results. On the negative side, I am quite unimpressed with the experimental analysis & verification of the proposed method. The problems I see are as follows: - The paper makes one important speculative argument on the reasons for improving over MAML, even in the cases where MAML gives the exact inner-loop gradient (albeit being inefficient): the paper claims that MAML works worse due to numerical instability, but that's not verified! - The following question is not answered: would it still behave badly if you were to work implement MAML with infinite precision OR by scaling the data to minimize numerical problems? Or something else is going on, in terms of the improvements brought by using the proposed implicit gradient technique, when the exact same inner loop is being used as in MAML? - It would be very curious to see the accuracy of the implicit gradient by comparing that against exact (MAML) gradients. How does the implicit gradient's accuracy depend on the number of steps being computed? - I am surprised to see that Figure 2 presents iMAML with only 20 inner update steps. Why? Given that the performance of MAML baselines vary rather significantly depending on the number of steps, I find it truly essential to do the same kind of analysis with iMAML. - The choice of the datasets is just poor. Sigmoid-experiments are fine for understanding the algorithm. But Omniglot is pretty much saturate with accuracy values at around 98% in most settings. Why not some other commonly used benchmark, like mini-imagenet (with or without episodic training) or imagenet-1k, ideally in generalized few-shot learning setting? Incorporation of at least one "more modern" benchmark would significantly improve the experimental analysis. - Finally, although not very critical, an empirical comparison in terms of training speed and memory use between MAML and iMAML would be very nice to see for a few different inner loop optimizers.
The paper is well written and well structured. It clearly references existing work on (this type of) meta-learning, and gives an excellent overview of the area. The concept of "implicit differentiation" probably deserves a slightly better introduction, explanation + more references, given its relative importance to the main contribution. Apart from this slight exception, each section of the paper is well presented, concise, and easy to understand. I especially appreciated the references + background section, and the interpretation of iMAML relates to, builds on, differs from, and generalises previous work. I wish the paper were a bit clearer on what iMAML "gives up" compared to normal MAML, e.g. by pointing out situations where MAML might succeed and iMAML fila (and vice versa). I also would have wished for more detail + resolution in the experimental section, and perhaps for a slightly more thorough experimental comparison, or a stronger demonstration of superior performance (e.g. a wall-clock measurements of CPU- and memory usage). Nonetheless, it is an excellent paper that I enjoyed reading and would recommend to accept at NeurIPS 2019.