NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:2692
Title:Adaptive Cross-Modal Few-shot Learning

Reviewer 1

Pros: — Paper is overall well-written — Approach is simple yet effective — Method achieves state of the art (SOTA) performance on standard benchmarks — Overall, this is a solid paper which warrants acceptance Cons: — Some expositional details should be added (see details) — Some additional ablations would be useful (see details) Detailed Comments ----------------------- > Exposition Overall, the paper provides a clear introduction that motivates use of cross-modal information and lays out the high-level overview of the contributions. The experiments are also well structured and presented clearly. That said, some improvements are possible and would be warranted: — The description of the problem at the beginning of Section 3.1.1 was confusing at first glance. Given the variety of few-shot learning scenarios, such as generalized few-shot learning, use of better notation and formalities in laying out the support and query sets more mathematically (as opposed to descriptively) would be useful. — A more detailed overview of TADAM, the second model being extended (in addition to ProtoNets), was missing. This should be added. — The ZSL case is mentioned in multiple places throughout the work, where at times seems unnecessary. > Novelty Method is conceptually simple, yet practically effective (as shown by the new SOTA performance achieved on few-shot image classification on the miniImageNet and tieredImageNet datasets). The paper's adaptive convex combination using learned transformation and mixing networks is interesting and relatively novel. > Significance The fact that the method is agnostic with respect to which metric-based model it is extending is a positive. As such, the proposed model may have potential of extending uni-modal metric based models using other modes of information, such as class definitions and knowledge graphs. Seemingly it can also potentially be applied more broadly in other few-shot learning tasks. > Experiments The experiments on few-shot image classification are reasonable in scope and coverage. The chosen baselines cover the majority of SOTA and near-SOTA methods previously proposed with good coverage across meta-learners, metric-based methods, space aligners, variational methods and other approaches. The CUB-200 results, although mainly included in the supplemental material, also support the same conclusion and the reported results show the superior performance of the proposed method on a variety of k-shot 5-way tasks. — It would have been useful to also report results on MNIST although presumably the bi-modal class embeddings for the digits may have not been as effective. — Additional ablations should be added. Specifically on the"N-way" behavior of the method. Are similar improvements observed with more classes? Although 5-way classification setting is most widely used, the number of few-shot variables and the behavior of the method with respect to such factor is important. > POST REBUTTAL Authors have largely addressed my comments. While simple, the approach appears effective and valuable. I continue to be in favor of paper's acceptance.

Reviewer 2

Post Rebuttal: Thank you to the authors for their comments. My concern was primarily about not having labels for classes in the test-episodes, since few-shot learning models can be applied to learn about arbitrary classes at evaluation without requiring any syntactic labels. With regard to the author's comments on simplicity of the model and being applicable to any metric-based method, I have increased my score. Before Rebuttal: This paper proposes a modification to the few-shot learning image classification scenario, which involves using additional cross-modal semantic information in the form of word embeddings for class labels. The idea is that the extra semantic information can be helpful in a lot of cases where discriminating between visually-similar classes can be difficult, especially when given few examples. The authors propose an extension of prototypical networks for this setting where a prototype is the convex combination between the usual prototype computed from support item features and a learned transformation of the word embedding of the class label. The coefficient given to each part of the sum is a learned function of the word embedding. The method is evaluated on two few-shot learning benchmarks: Mini-Imagenet and Tiered-Imagenet. Additional studies are conducted to study the benefit of the extra semantic features as a function of increasing number of examples for few-shot learning. Strengths 1. Paper is well-written - motivates and describes idea well. 2. Extensive comparison of relevant baselines. Weaknesses 1. Proposed method is relatively simple extension - involves using typical prototype for class in addition to transformation of class word-embeddings. 2. The benefit of incorporating semantic information is largely in the 5-class, 1-shot learning case (3.5% Mini-Imagenet and 2.75% Tiered-Imagenet accuracy gain compared to state-of-the-art LEO applied to regular few-shot learning scenario) and there seems to be very little gain beyond that number of shots. Comments The proposed setting and method requires that word embeddings are known for all train and test classes. Is it a more realistic scenario for few-shot learning that word-embeddings are only available for train classes, as this removes requirement that model that can only be used to learn about concepts that we already have word-embeddings for? Is semantic information as defined in paper applicable to few-shot settings beyond image-classification?

Reviewer 3

1. Although the gated fusion of visual and semantic representation is simple, but the idea is interesting and effective. 2. To my knowledge, I have not see similar idea in few-shot learning. But I am not sure if there might be other work having similar ideas or implementations? 3. In the gated formulation Equation 5, the computation of \lambda should come from the concatenation of both visual representation and word embedding rather than only word embedding. Please refer to the following two papers: 1, Gated fusion unit for multimodal information fusion, ICLR. 2, Learning semantic concept and order for image and sentence matching, CVPR.