Review for NeurIPS paper: Few-shot Image Generation with Elastic Weight Consolidation

NeurIPS 2020

Few-shot Image Generation with Elastic Weight Consolidation

Review 1

Summary and Contributions: This paper presents a methodology to train GAN on few-shot learning paradigm for a domain where limited training examples are available. To achieve this, authors propose to fine-tune the model parameters trained on a domain with abundant training examples by setting variable layerwise learning rate. To identify the suitable learning rate, authors employ Fisher Information (FI). Also, adopts the elastic consolidation loss as a regulariser which has been successfully applied before the training discriminative model to fine-tune the model. The proposed method is evaluated on multiple source-target domain pairs. Mostly qualitative and also quantitative comparisons are made with some of the existing arts.

Strengths: Few-shot paradigm for training a generative model is an interesting research problem. Extensive qualitative comparisons are made. Qualitative results seem better than existing art on the compared set up. Quantitative measurements also show the efficacy of the proposed method with the existing art. The paper is generally well written. I would consider the novelty of the method is moderate as the regulariser has been previously applied for training discriminative model.

Weaknesses: To identify the trend of layerwise fine-tuning rate, authors took CelebA (200K) images as the source domain and Bitmoji (80K) as a target domain (line 120-122). However, authors have taken downstream scenario as one having very limited training examples ( from 1 to 10). What guarantees that the earlier scenario generalises with the latter scenario? After rebuttal, I still feel how the fine-tuned version will be better when you already have such a large data set. Again referring to the above comment, if we already have examples up to 80K, does the model just fine-tune or learns the parameters specific for the target domain? Is there any guarantee that the fine-tuned version is better than the one trained from scratch?

Correctness: FID as a metric to evaluate the quality of synthetic data is reasonable. However, authors compared the existing arts with at a single point (10-shot) only (Table 1). Seeing only this comparison, it is very hard to draw a concrete conclusion. Hence, it is important to compare on other shot as well (Table 2). On rebuttal, the results on other shots were reported and it looks convincing. Hence, I am up voting my rating.

Clarity: Yes, this paper is generally a well written paper.

Relation to Prior Work: Generally, related works are discussed well. Need to discuss the following reference too: Zakharov, Egor, et al. "Few-shot adversarial learning of realistic neural talking head models." Proceedings of the IEEE International Conference on Computer Vision. 2019.

Reproducibility: Yes

Additional Feedback:

Review 2

Summary and Contributions: In this paper, authors propose a novel approach to few-shot image generation that utilizes regularized finetuning according to the importance of the parameters. the paper includes extensive and convincing qualitative and quantitative evaluation of the proposed method and a thorough analysis of the impact of the amount of target examples and dissimilarity of source and target datasets on the quality and diversity of image generation.

Strengths: The proposed method is sufficiently novel an reasonable. Authors provided either empirical evaluation results or appropriate references to almost every claim they make in the paper. The experimental setup is thoroughly thought through and the appropriate metrics are used to prove their hypothesis. The qualitative results indicate clear advantage of the proposed solution. Finally, the paper is easy to read and well organized.

Weaknesses: The quality of the paper is already great, but there are a few comments. 1. In equation 3 (page 4), it is not clear whether you compute F of the generated source or target data. Also, I don't quite understand why the FI is computed for the difference between the pretrained and finetuned parameters, and not just for the pretrained parameters. Finally, I assume i in this equation is the layer index, but this should be clearly stated. Update: In the rebuttal, the authors kindly explained that the F is computed for each individual parameter in the network rather than for the entire layer. I suggest adding this clarification to the main paper as well. 2. In Figure 3, it would be very illustrative to show how each layer affects the generation. You can do this by regularizing everything but the layer and look at the generation result. This would further convince the reader that some layers are more important than others for the diversity preservation. 3. In Table 3, it is unclear from the caption what source dataset is used. Update: Please add some information about the source dataset in the caption. The image and table captions should contain all the necessary information about the experiment so that the reader doesn't have to search for in the main text. Apart from that, great job!

Correctness: Apart from the claim that the last layers play a more significant role in diversity preservation (which was clarified in the rebuttal), all the main claims of the paper are intuitive or proved empirically.

Clarity: The paper is very well-written.

Relation to Prior Work: The prior work is covered well, although it would be better if the baselines were described in more detail in the related work section.

Reproducibility: No

Additional Feedback:

Review 3

Summary and Contributions: The paper proposes to regularize the network parameters changes during the source to target adaptation progress for few-shot image generations. The method has been demonstrated to effectively preserve the “information” of the source dataset, while fitting the target.

Strengths: 1. The idea of quantifying the “importance”of each parameter (deep networks' parameters) for adaptation tuning is interesting and novel. It effectively transfers the knowledge to target datasets for few-shot image generations. 2. The qualitative and quantitative comparisons are well analyzed and can demonstrate the effectiveness of the proposed method. 3. The paper is well written and easy to follow.

Weaknesses: Overall the idea introduced in the paper is novel and inspiring, though the regularization terms in Eq. 3 was proposed in [17]. The reviewer has some concerns about the experiments. 1. The experiments are mainly performed using different face datasets in the image size of 256*256. Can the approach be used for high-resolution settings or synthesis where the images contain many structured details? (e.g., urban driven images and indoor images)? 2. Since a key contribution is using Eq. 2 (F_i) to weighting the regularization terms (\theta_i - \theta_{S, i})^. Some straightforward options (baselines) should be compared. For example, give some fixed weights for each convolutional layer according to Figure. 2 (middle). Besides, How about removing F_i?

Correctness: Yes.

Clarity: Yes.

Relation to Prior Work: Yes.

Reproducibility: Yes

Additional Feedback: The rebuttal addresses some of my concerns. The main contribution of the paper is their interesting idea, i.e., quantifying the “importance”of each parameter (deep networks' parameters) for adaptation tuning. The regularization term is not a major contribution since it has been studied in [17]. I will keep my borderline positive rating.

Review 4

Summary and Contributions: This paper proposes a method to achieve few-shot image generation by regularizing the changes of the weights during adaptation to best preserve the information of source dataset. The effectiveness of the proposed algorithm is demonstrated by generating high-quality results of different target domains.

Strengths: The proposed self-adaption technique producing diverse generations with limited data. The effectiveness of the proposed method is demonstrated in artistic domains and several cross-domain source/target pairs, in contrast to previous methods which mostly focus on the photo domain. The proposed method is evaluated on several cross-domain source/target pairs in contrast to previous methods which mainly focus on the photo domain.

Weaknesses: The main contribution of this paper is the weight regularization of the changes of the weights during this adaptation to best preserve the information. Thus, the technique contribution of this work is limited. In line 132, it is claimed that the later layers in generators are mainly responsible for synthesizing low-level features, which are more likely to be shared across domains. It is hard to understand, because semantic features in high level tend to be shared across domains. Missing some analysis of regularization parameters. It is claimed that the parameters is determined empirically, while how about the performance with different regularization parameters?

Correctness: Yes

Clarity: Yes

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback: Seeing the rebuttal and other reviews, I decide to raise my score.