Review for NeurIPS paper: Transductive Information Maximization for Few-Shot Learning

NeurIPS 2020

Transductive Information Maximization for Few-Shot Learning

Review 1

Summary and Contributions: This paper presents a transductive few-shot learning method with information maximization. The aim is to maximize the mutual information between the query and the support set for each episode of a few-shot learning task. The idea is to use features from the pre-trained model and update the weights of classifiers using the query and the support images. The performance is superior compared to the recent transductive few-shot learning methods such as TPN and SIB.

Strengths: - The proposed is somewhat novel for few-shot learning taking the perspective from mutual information. - The results are superior compared to the previous methods.

Weaknesses: - The motivation of the method is weak. Especially in introduction, the reference to the mutual information is InfoMax in deep representation learning. While the paper uses a fully-supervised model (pretrained) and the proposed method has less correlation to the input and output of deep networks presented in DeepInfoMax. - Some citations are missing for the recent works relating to transductive few-shot learning and information maximization. It would be recommended to discuss [1] and [2]. - The method shares similarity to [3], especially Eq. 7 in [3]. Would the authors highlight and discuss the difference? - In the 1-shot case, is the one-hot encoded label directly used for $\pi$ in Eq. 3? - It is not described well how w is updated. In TIM-GD, is w updated always with all of the classes (using the loss in Eq. 3)? Because in some recent works for few-shot learning, the samples to update w are sampled (like mini-batch) from the support set and not necessarily containing all samples from all classes (K). - Is this proposed method limited only to the uniform distribution? It is still not very clear how this method can be applied to the iNat dataset with highly-imbalanced tasks. [1] Li et al., "Learning to Self-Train for Semi-Supervised Few-Shot Classification," Neurips, 2019. [2] Guo and Cheng, "Attentive Weights Generation for Few Shot Learning via Information Maximization," CVPR, 2020. [3] Hu et al., "Empirical Bayes Transductive Meta-Learning with Synthetic Gradients," ICLR, 2020.

Correctness: The equations are correct and the experiments also follow the existing works.

Clarity: The rebuttal clarifies the ambiguous segments. Please reflect the unclear statements in the revised version.

Relation to Prior Work: Please discuss [2, 3] in the revised version as the readers might need a better description to bridge the existing techniques and the proposed approach.

Reproducibility: Yes

Additional Feedback: After going to the rebuttal and other reviews, I believe this work has benefits to the community. The rebuttal addressed my concerns, thus I vote for acceptance. The other reviewers also ask for detailed explanation about the proposed approach and experiments. I encourage the authors to revise the manuscript based on the concerns from the other reviewers also and make it clear.

Review 2

Summary and Contributions: The authors propose a kind of information maximization method for few-shot learning. Different from traditional methods, the authors integrate a KL divergence regularizer to maximize the common information between query samples and predictions. According to the experiments, this method can deal with few-shot tasks, where there even exists domain-shift.

Strengths: 1. This paper proposes a method inspired by information theory, which is nicely theoretical guaranteed. 2. The authors derive an approximate method to minimize the whole loss function, which can effectively speed up the optimization. 3. Extensive experiments have been conducted.

Weaknesses: 1. When using the standard few-shot setting (e.g., 5-way 5-shot tasks), \pi is a uniform distribution for both support set and query set under a transductive setting. In this case, there is no difference between \pi and \pi_k (fixed by the label-statistic information from the support labels). In other words, \pi_k is just a uniform distribution. Eq(2) degrades to the standard mutual information as mentioned in Remark 1. In fact, \pi is an estimation of the real distribution of the query set, which is not known in the real setting, i.e., an inductive setting. Importantly, I don’t think \pi can be estimated by the prior distribution of the support set. They may be different from each other. The authors estimate the distribution of the query set by using the distribution of the support set under a transductive setting. It’s not reasonable. I think it is just an artificial coincidence. 2. In Eq (1), the authors say p_ik is proportional to exp(*). Why? Does the feature vector have been \ell_2 normalized? 3. I think it needs some figures or some words to introduce the process of the algorithm. 4. The implementation details are not very clear. For example, what’s the image size used in the experiments.

Correctness: Maybe not.

Clarity: I think the writing needs to be improved. It is difficult for me to understand the whole process of the proposed method. The problem definition is not very clear. There are no figures or materials to help understand.

Relation to Prior Work: Yes, as I know, I think this paper differs from prior works.

Reproducibility: Yes

Additional Feedback: I have carefully read the authors' feedback and comments of other reviewers. Some of my concerns have been well addressed, but others have not. However, because this paper indeed have some impacts to the community, I would like to accept this paper and encourage the authors to refine the manuscript according to the following concerns. 1. What's the main contribution? In the feedback, the authors said "the use of a label marginal prior (R2, R4), which is just a convenient generalization of our MI, but not really the main contribution of the paper" (Lines 5-6) and "We emphasize that the MI (no prior) alone is our contribution, and achieves SOTA results over 4 standard benchmarks by wide margins (it is also competitive on iNat)" (Lines 14-15). However, in the submitted paper, the authors said "In fact, if we remove the marginal divergence D_KL in objective (3), our TIM objective reduces to the loss in [6]." (Lines 129-130 in the submitted paper). If my understanding is correct, MI (no prior) means there is no the D_KL term. Therefore, what's the contribution of this paper compared to [6]? 2. The introduced D_KL is really important? The authors said "The label-marginal regularizer D_KL is of high importance. As it will be observed from our experiments, it brings substantial improvements in performances (e.g., up to 10% increase in accuracy over entropy fine-tuning on the standard few-shot benchmarks)" (Lines 131-132 in the submitted paper). However, from the ablation study in Table 3, we can see that TIM-ADM (CE+D_KL+H) have almost the same results as TIM-ADM (CE+H) on the 5-shot setting on all three datasets. The slightly improvements on the 1-shot setting may come from the randomness. Also, the same phenomenon can be seen in Table 4, TIM-GD (Uniform) has almost the same results as SimpleShot [42]. 3. Why using a support-based prior? Although the authors have explained the reason of using support-based prior, it still cannot convince me because it's under a transductive setting. 4. Implementation details I cannot agree with the viewpoint of the authors. I think a good paper should be self-contained. Why not make this part clearer in your own paper?

Review 3

Summary and Contributions: This paper studies the transductive setting of few-shot learning from the mutual information maximization perspective. In particular, this paper designs TIM loss which consists of CE loss and empirical prior-aware mutual information. To train the model, this paper proposes TIM-GD which has the best performance and further proposes TIM-ADM to speed up the training while maintain the accuracy. The experiments on standard few-shot learning and the scenarios with domain shifts and severe class imbalance show significant performance improvements.

Strengths: - This paper introduces empirical prior-aware mutual information which include the standard mutual information as a special case when the prior is the uniform distribution. - The training procedure tries to maximize the empirical prior-aware mutual information and minimize the cross entropy simultaneously. The former is to use the supervision information from the support set, while the latter is to use the unsupervision information from the query set. - For the optimization, to speed up the training procedure, this paper designs Alternating direction method for minimizing the loss function, which achieves almost equal accuracy to the gradient descent.

Weaknesses: - This paper focuses on the transductive few-shot learning. However, such setting is too artificial, since the model can see all test data (not only one test task) during the training. - The loss function of this paper includes three terms: standard CE loss, KL divergence between prior and latent label distribution, and conditional entropy which makes classifier's coundaries not go through the dense regions of samples. The novelty is on the borderline.

Correctness: Yes. I haven't seen anything wrong so far.

Clarity: Yes.

Relation to Prior Work: Yes.

Reproducibility: Yes

Additional Feedback:

Review 4

Summary and Contributions: This paper presents a novel few-shot learning framework using transduction information Maximization. In the transduction setting, the authors assume the model can see the unlabeled testing data. They also introduce two new loss functions, marginal divergence and conditional entropy. The results show this model outperform the state-of-the-art in a large margin.

Strengths: 1. This model can significantly outperform the baselines on four different datasets 2. The paper shows the naive gradient descent optimization is two order of magnitude slower than baseline. To speed up the model, they introduce the alternating direction method using auxiliary variables to help training.

Weaknesses: The results in this paper are really promising. And the idea is novel. I just have some questions and suggestions about this paper: 1. In table 3, have the authors done multiple runs to get average accuracy? I just found TIM-GD+CE+{w} results are different from TIM-ADM+CE+{w}. I think they are supposed to be the same. 2. The entropy loss is well used on domain adaption, but I haven't found papers using the label prior and mutual information. Is it possible to apply this method to the domain adaptation problem? 3. In the paper, the authors assume the prior is uniform or S-based. However, in some cases, the query images might not satisfy this prior. For example, 50% of support images are class A, but only 10% of support images are class A. How will the model work on it? 4. Missing figure 5.

Correctness: Yes

Clarity: The paper is clear, but there are some typos. 1. In the abstract, "Transductive Infomation Maximization" should be "transduction information Maximization". 2. Line 276, "similar as " should be "similar to".

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback: The authors have addressed my questions. I will keep my score unchanged.