NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:306
Title:Learning to Predict Layout-to-image Conditional Convolutions for Semantic Image Synthesis

Reviewer 1

The idea to predict the parameters of the convolutions is interesting, however I cannot really understand the motivation behind the proposed method. If I understand correctly, the V layer is only smaller than a full convolutional layer by a factor of D, (which is 3?). I see the real bottleneck in the fact that you need to predict H*W kernels although you know the semantic content at each pixel. To me it would make much more sense to predict DxCxKxKxL channels, where L is the number of labels. if you would use the conditional weight prediction, you would still be able to use context as input. With respect to the discriminators it is difficult to understand what the authors developed and what has been taken from other papers. Although they reference other papers, they do not state, what they have contributed. It is really hard to judge performance. The number of parameters must be high, especially when including attention as well. Especially since the network details are not listed anywhere (number of parameters etc), it is hard to compare to SPARSE. So many different adjustments are made that it is hard to determine which one is the most important one. The Ablation study is going in the right direction but is lacking some parts. For example, CondConv pred C/O FP + MsPatch. Only when you do not use the feature pyramid and no attention, you can actually assess whether the parameter learning is better than SPADE. Also, it would be good to see results with SPADE + FP and SPADE + FP+SE. Originality: medium Quality: medium Clarity: medium - without having read SPADE, the reader does not know what how to compare these approaches. Significance: low-medium

Reviewer 2

This paper proposes a strongly conditional network for generating images from semantic maps. How impacted is this network by small changes in the input map - for example given 3 sequential frames of a video (as segmentation maps) - is the model consistent in assigning colors and structures? Or do small changes in the geometry of the semantic objects have a large impact on the output? This is mostly curiousity, as having smoothness inherent in the model has large potential for video applications. Some amount of qualitative results comparing to other models were shown, but showing the important regions of the input conditioning, and the influence of input perturbations on the model output could also lead to valuable insight - using something like GradCAM or related methods may be possible for checking the importance of input features. In 4.3 (qualitative worker analysis) there could be more detail (variance across labelers / uncertainty / statistical significance) rather than a pure percentage preference. How many workers labeled the 500 images? Given that this is largely a (very impressive) empirical paper, it would be nice to see a larger exploration of ablation on various components, or some larger intuition on how and why the network was designed how it was. The empirical results are convincing, and the demonstrated experiments are thorough - though more ablations can add greater insight, the current experiments seem sufficient given the high quality of the model. I strongly encourage the authors to release their code, as the community should be able to use, improve, and extend this work in interesting new ways - perhaps doing some "in-the-wild" ablation studies along the way. Feedback post-rebuttal: My score remains unchanged primarily because I had no major criticisms of this paper to begin with - the response didn't fundamentally change my perception of the work. The author's comments clarified some of my key questions, thank you for the explanations.

Reviewer 3

Originality: This paper mainly proposes two things: 1. Generate weights of CNNs from the input semantic label for the generator. 2. Use feature pyramids throughout the network (i.e., in both generator and discriminator) I think this combination is new for the semantic image synthesis task. Related studies are well cited and the paper is different enough from them. Quality: The method is reasonable and the idea to use separable convolutions to reduce the number of parameters makes sense. The proposed method is compared with previous methods and better results are reported. One thing is that the author claims that their method is better than SPADE, but the comparison is done for the original implementation of SPADE but not for the idea. I think that it should be possible to estimate the scaling parameters instead of the CNN weights using the same network. It would be valuable to perform such experiments. Clarity: The paper is easy to follow and understand. I personally prefer to see more explanations on each loss term. It is not always very clear what is actually used by knowing the name of the loss (e.g. perceptual or FM). Significance: The proposed method is reasonable given the current state of the arts and the combination seems novel. I expect that this will attract researchers working on similar fields.