NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:1710
Title:Multi-mapping Image-to-Image Translation via Learning Disentanglement

Reviewer 1

Originality - There have been rapid progress in the field of image-to-image translation. Notably, methods generalizes the image-to-image translation to handle unpaired data (e.g. CycleGAN), multi-domain (e.g., StarGAN), and multi-modal outputs (e.g., DRIT, MUNIT). This work builds upon all these methods and presents a method that is capable of performing multi-domain, multi-mapping from unpaired training data. While this is a straightforward and expected extension, I think this is a nice contribution. Quality - The method is technically sound. However, the work integrates existing losses and disentangled representation (e.g., in DRIT, MUNIT) to achieve the multi-domain, multi-output mapping. The technical novelty is somewhat limited. - The experimental validation is not very convincing. 1) The "diversity" has not not quantified. 2) In Table 2, it seems that the main improvement over prior art is due to implementation details (e.g., projectionD). It is hard to draw conclusion from the Table. Also, as far as I know, the LPIPS scores are "the lower the better". Table 2 seems to treat it the other way around. 3) The paper claims that the method can perform "multi-domain" image-to-image translation. Yet, most of the applications shown in the paper (except the face attribute experiments in the supplementary material) are mapping between "two domains" only. Examples include seasons, weather, and time of day. Clarity - The paper is well-written. The implementation details are clearly discussed in the paper and the supplementary material. Significance - I think the paper will stimulate future research in image-to-image translation.

Reviewer 2

-- In Equation 2, it encourages the distribution of styles of all domains to be as close as possible to a prior distribution. However, I am a little confused about how it disentangle different domain styles (E_s_{x_i}). In other words, is there any possibility that E_s_{x_i} and E_s_{x_j}, where i is different from j, are very close? I will appreciate it if the authors can provide more explanations here. -- domain label encoder Ed I suspect the necessity of introducing a domain label encoder here. I think the information extracted from E_{d} can be handled by style encoder E_{s}. I hope the authors can illustrate this in the rebuttal. -- Figure 2 (d) only utilize one generator for multi-modal data, but in previous works for multi-modal translation, which is illustrated in Figure 2 (b), different modal data needs different generators. Why in multi-mapping translation only one generator is enough, as shown in subfigure (d)? -- In Table 2, why DMIT w/o D-Path achieves the best LPIPS score on all the cases (while DMIT outperforms other settings in FID)?

Reviewer 3

The paper unifies the work of the multi-modal and multi-domain translation, each existing separately so far. It provides a novel combination of techniques, so far used to solve each problem alone, in order to solve the unified problem. For example, an adversarial loss is used in the embedding space of the common encoder and a reconstruction loss is used to reconstruct the input from the common, separate and domain information. Such techniques are used to achieved disentanglement for 2 domains only in, for example Munit and [1]. In that sense, the novelty exists in combining those existing techniques from previous works in a clever way. However, each part in isolation (style encoder and embedding, etc) is not novel. The qualitative and quantitative evaluation is very convincing showing that the use of additional domains (and thus additional supervision) can improve disentangled translation. There seem to be a large gap in FID and other scores for semantic image synthesis compared to season transfer. Could the authors comment on why this is? Further, the ablation analysis and comparison to baselines is very thorough. One concern I have is whether the work is able to perform image to image translation between domains where the difference is not only in style. For example, in the celeba domain, considering domains with different attributes (e.g facial hair, glasses, smile). In this case the facial attributes are common to all domains, and the additional attributes (facial hair, glasses smile) is separate. However, since the separate encoder is constructed so as to model only style/global properties, and the additional attributes are content, how would the translation work in this case? Is this a limitation of the work? [1] should be considered for this case. Could the authors comment on any other limitations of the work? With regards to clarify, the paper is very well written and clear. The overview in Fig. 3 is also very clear. As for significance, the work seem important for practitioners, providing superior results to state of the art results so far. For future research, as the underlying ideas used for the solution already exist in previous work (adversarial loss, reconstruction loss, etc), I am not sure if ideas from the work can be built upon and how. Apart from (rather significant) visual and numerical improvements, are there any observations about the results, or new insights that this work provide? [1] Emerging Disentanglement in Auto-Encoder Based Unsupervised Image Content Transfer. Press et al. ICLR 2019.