Review for NeurIPS paper: Lightweight Generative Adversarial Networks for Text-Guided Image Manipulation

NeurIPS 2020

Lightweight Generative Adversarial Networks for Text-Guided Image Manipulation

Review 1

Summary and Contributions: This paper proposes two methods to improve the performance of the text-guided image-to-image translation model. 1. Adopt word-region-level discriminator and only consider words that represent visual properties. 2. Utilize only one pair of generator-discriminator to apply to memory-limited devices. The results of ablation study show that the proposed methods works as intended, but the technical novelty seems somewhat incremental.

Strengths: The authors clearly described why they proposed these methods, and support their claims through ablation studies. This work is related to the NeurIPS community, but I did not find major contributions.

Weaknesses: - The total training time is compared to the baseline to show that the proposed model is light. However, it is more convincing to compare the inference time or number of parameters of the generator. - The results preserve text conditions well, but it looks a bit unrealistic and blurry, especially on the COCO dataset. Human evaluation is required to claim the superiority of the results.

Correctness: The description of the method is clear, and it is demonstrated by experimental results.

Clarity: The paper is well written and easy to follow.

Relation to Prior Work: The limitations of previous works, and the differences between these and the proposed method are clearly explained.

Reproducibility: Yes

Additional Feedback: In relation to what I pointed out as weaknesses of this paper, I think that further analysis of the following can improve the paper. - Inference time comparison - Human evaluation on generated images ****************** POST REBUTTAL COMMENTS Through the authors’ rebuttal, the questions I presented were clearly addressed. In particular, I felt that the novelty of this model was a bit minor, but their explanation of why the smaller model was able to perform better than the existing models was reasonable and it changed my mind. In my opinion, rather than emphasizing that it is a lightweight model, it would be better to reveal in the title that the performance was improved through the word-level discriminator that gives a more explicit training signal. Since the benefits of the proposed methods are proven both quantitatively and qualitatively, I will raise my score from 4 to 6.

Review 2

Summary and Contributions: This paper proposes a method for text-to-image synthesis based on Generative Adversarial Network (GAN). The proposed method extends the model from [14] by introducing word-level alignment loss in discriminator and designing much light-weight generator that requires much less memory footprint. Experiments on two datasets of CUB bird and MS-COCO demonstrated the effectiveness of the proposed method compared to the previous state-of-the-arts.

Strengths: + The paper is generally well-written and easy to follow. + The generated image quality and speed seems to have a clear improvemnets over the existing state-of-the-arts.

Weaknesses: - The technical novelty of the proposed method is somewhat incremental since it is largely based on the work from [14] with some modifications to the generator and the discriminator architectures. The word-level training feedback in the discriminator seems to be the main technical contribution, but is not ground-breaking as it extends the auxiliary classifier in conditional GAN with multiple classes (i.e. pre-defined sets of words). - The approach to discover relevant and irrelevant parts of an image to the text description is very heuristic and may not be valid in general. Specifically, only the nouns and adjectives are chosen manually as text-relevant attributes, which convey a very limited context of general descriptions. Although it may allow a fine-control of the image content in a limited context, it reduces the capability of aligning rich context of the text to the image, often available in approaches learning to encode the whole sentence (e.g. [4]). Although authors made some justifications in Section 3.2.1 of using heuristic approach, it does not feel that this assumption holds in general. - It would be informative to include more baselines in the experiment. Current comparisons are mostly focused on ManiGAN. Also, it would be informative to include user study as there are no comprehensive metrics that measure the alignment and synthesis quality.

Correctness: I did not find any tehchnical or factual error in the proposed method.

Clarity: The details on experiments and models are missing in the main draft, which makes it less self-contained. For instance, how do you create a training data for text-guided image manipulation without actual manipulated ground-truth? What is L_{DAMSM} in Ln 160? Are two discriminators in Eq.(6) share the parameters? What is manipulative precision?

Relation to Prior Work: It would be inforamtive to highlight the limitation of using heuristics for word-level alignment. The approaches based on whole-sentence encoding may enjoy encoding much general and flexible context in the text compared to the proposed method.

Reproducibility: Yes

Additional Feedback: Please address the concerns in the paper weakness section in rebuttal. ======= POST rebuttal ============ I appreciate authors for their efforts made in rebuttal. It addresses some of my major concerns, especially regarding the capability of modeling rich context in a sentence. I raised my score to 6.

Review 3

Summary and Contributions: This paper proposes a new lightweight method for image manipulation with text description. In general, the proposed architecture uses a single generator and a new word-level discriminator incorporating word labeling to focus on specific attributes to be manipulated. The authors evaluate the proposed method on CUB and COCO comparing with ManiGAN and provide promising quantitative and qualitative results with ablation.

Strengths: - Lightweight image manipulation is very important and challenging. - Word-level feedback discriminator is novel. - The authors conducted extensive experiments and the results seem to be promising in terms of quality and speed.

Weaknesses: - My main concern is "lightweight" term. If I understood correctly, "Lightweight" is a main contribution of this paper. and "lightweight" came from achieving competitive performance with a single generator (G) and discriminator (D) compared to ManiGAN using multi-scale G and D. But, this is not clarified in Section 3 and the authors presented the TPE and total training time only for "lightweight". I recommend that Section 3.1 has more description on how to achieve lightweight in details. - 'L' is used without definition. Maybe, L means the sequence length? In figure 2, notation definition will help readers to understand (e.g. v, w, L, ...) - Lightweight means both faster and smaller. It is required to add the model size comparison in Table 1. - Why SISGAN and TAGAN results are in Figure 1 only? - For most figures, the adjectives are mainly color-related. How are the results on other types of adjectives? Also, the results on the same input image but various colors can be helpful. - In the CUB result of Figure 6, why is the index-word pair of w/o Dis different from others? And what means the order of word-images?

Correctness: This methods seems to be correct.

Clarity: This paper is easy to follow. But, for IS and MP, "a large number of" (L187, P6) phrase is not specific.

Relation to Prior Work: There is no related work on lightweight GANs such as [Aguinaldo et al. 2019, Chen et al. 2020, Li et al. 2020]. [Aguinaldo et al. 2019] Aguinaldo et al. Compressing GANs using Knowledge Distillation. arXiv 2019. [Chen et al. 2020] Distilling portable Generative Adversarial Networks for Image Translation, AAAI 2020. [Li et al. 2020] GAN Compression: Efficient Architectures for Interactive Conditional GANs. CVPR 2020.

Reproducibility: Yes

Additional Feedback: - As textual information, noun and adjective are used only. The reason why this method focuses on them looks reasonable. Are there any results on other parts of speech? - What means single optimization epoch? After rebuttal: ==================================== I thank the authors for their great efforts. I carefully read the other reviewers' comments and author response. The authors alleviate most of my concerns. I think the main contribution is lightweight model for text-to-image manipulation and they showed promising results. This might lead to practical usages of the text-to-image synthesis models despite their incremental novelty. So, I decided to raise my score to 7.

Review 4

Summary and Contributions: This paper addresses text-guided image manipulation; given an original image and a text that describes the desired attributes such as texture, color, and background, the objective is to modify the original image to match the text. The authors propose an adversarial learning method with a novel word-level discriminator. Although the whole network architecture is lighter than the current state-of-the-art method having multiple discriminators, the proposed method achieves more accurate manipulation.

Strengths: - A challenging problem to manipulate an original image according to a text. - An interesting approach, paying attention to the use of word-level information in the text. - Throughout experiments on several datasets. Both accuracy of image manipulation and quickness of training are shown.

Weaknesses: - The authors should add FID score to evaluate the quality of manipulated images. The use of IS only is sometimes not so correlated to subjective evaluation. - In [14], there are only good results with images successfully manipulated by the text. It is not the fault of this paper, but the authors can show some failure cases for further improvement and attracting more researchers to address this task.

Correctness: This paper seems correct.

Clarity: This paper is clearly written.

Relation to Prior Work: The difference between this work and previous ones is discussed enough. In this paper, the proposed method also uses affine combination module, which is proposed in [14]. The overall network architecture, however, significantly different from that in [14]. Especially, instead of multiple discriminators in [14], word-level discriminator with word labeling is proposed. The improvement of training time and accuracy is quantitatively shown.

Reproducibility: Yes

Additional Feedback: The authors provide good references, but some of arXiv preprints can be updated. For example, [17] is accepted in ICLR 2016. Additionally, after the deadline of NeurIPS 2020, [14] is published in CVPR 2020. After reading all reviews and the responses from the authors, I have decided to maintain my first score. The authors have replied to my questions exactly.