NeurIPS 2020

### Review 1

Summary and Contributions: The paper present a post-processing method for "Face Swapping" algorithms that reduces the discrepancies in lighting and color distribution that many algorithms usually create between their output and the target image. The major contributions are: - Formulation of the post-processing step as an Optimal Transport problem based on the distribution of features in image space. The Wasserstein algorithm is used in an unusual (=novel) way to solve the problem. - A novel conditional GAN discriminator that learns to detect blending artifacts. - The output of the method seems to reliably produce plausible results, in contrast to the face-swapping techniques it is supposed to serve as a post-process for.

Strengths: - In principle this method should be applicable to any face-swapping - In contrast to the results from the original face-swapping techniques, the results from this method always look flawless to the human - The way that the OT/Wasserstein method is used to perform image-to-image translation seems very clever and novel to me, especially since it incorporates not only pixel color, but also more abstract image - The evaluation is extensive and contains quantitative (metrics) and qualitative (user study) - The "mix-and-segment" discriminator seems to be novel as well, at least for the application of face-swapping

Weaknesses: - The authors never clearly state (neither in the manuscript, nor in the supplemental material) which information is available to the network at *test* time. I do not understand if their system is trained once on a large corpus of face swaps and can then be applied to arbitrary reenactment results or if it is in some way specific to a certain subject or setting. Also I do not understand whether the target image of a swap also needs to be supplied at test time. It is not clear if the NOTPE network is retrained at test time. This lack of clarity is the biggest complaint I have. - There is no discussion of limitations and no failure cases are shown. - In terms of exposition, this submission does not clear the bar of what I would expect from a NeurIPS publication (see below).

Correctness: Overall the claims made in the paper appear to be substantiated. Minor flaws: - Line 182 claims that Wasserstein GAN "minimizes the Wasserstein distance between two images". This statement is wrong: Wasserstein GAN minimizes the Earth-Mover distance between two probability distributions. - In part (b) of Table 1 it is not clear what the "scores" mean and section 4.2.2 does not clearly say it either. I was able to infer that probably the scores represent the rate at which users picked the result of the presented method as the "better" one, but I think the text should explicitly say that.

Relation to Prior Work: Yes, section 2 points out the differences between the submission and the previous work.

Reproducibility: Yes

Additional Feedback: Yes. There is pseudocode for the training loop and a detailed sketch of the networks involved in the supplemental material. -- The rebuttal addresses some of my points. I hope the authors can come up with a more polished paper in terms of language. I would still like to accept the paper.

### Review 2

Summary and Contributions: The authors propose to solve the problem of inconsistent face swapping results when there are large appearance gaps between the source and the target images, including illuminations and skin colors. The goal is to model the complex appearance mapping so as to transfer fine-grained appearances adaptively with identical traits preservation. Promising results are also achieved compared to DeepfaceLab and FSGAN.

Strengths: The idea to achieve appearance transfer on face swapping by first transferring the latent features and then use a face decoder to generate the final result is somewhat novel.

Weaknesses: 1. Eq. (3) and Eq. (4) seem to be different. In Eq. (3), the authors aim to solve an optimal transport plan between the feature histograms of $$F_{X_r}^i$$ and $$F_{X_t}^i$$. In optimal transport, such an optimal transport plan not only transports the source distribution to the target distribution, but also minimizes the transportation cost. In Eq. (4), the Wasserstein distance between the mapped distribution and the target distribution is minimized. In this way, the mapping function $$\omega$$ maps the source distribution to the target distribution, but the transportation cost is not minimized. 2. Sec. 3.2 is not easy to follow. What is the relation between the proposed Mix-and-Segment Discriminator and the mixup' in Sec. 3.7 of [1]. Besides, the empirical comparison against WGAN should be added. 3. Metrics including verification', pose', landmarks' should be added as they are all used in the two baseline face swapping methods. 4. Time complexity should also be compared with the competitors. [1] Zhang, H., Cisse, M., Dauphin, Y. N., & Lopez-Paz, D. (2018, February). mixup: Beyond Empirical Risk Minimization. In International Conference on Learning Representations.

Correctness: Yes

Clarity: Yes

Relation to Prior Work: Yes

Reproducibility: Yes

### Review 3

Summary and Contributions: The authors propose AOT, a method to improve face swapping results by addressing the changes in lighting and skin tones. The authors extend the idea of Optimal Transport Model by introducing appearance related features and apply it in both latent and pixel space. The authors introduce NOTPE to adapt OT idea into neural network training. The method produces plausible results qualitatively and is quantitatively better than other methods compared.

Strengths: 1. Formulation of the problem. The authors address the latent feature transformation problem using a NOTPE. They address the similarity term in the image space using adversarial training. The overall network design is solid and well established. 2. Plausible results. The swapping results shown in fig1 are clearly better than other methods, and artifacts caused by changes of lighting and skin colors are barely visible.

Weaknesses: 1. Related work. This work replies on the previous face swapping methods [31, 29], and mainly takes their results as input and corrects the artifacts. It is worth comparing to other state-of-arts fully automatic or end-to-end methods, and justify the reason why a two-stage computation is preferred here. 2. Loss of details. The network relies on an encoder-decoder which is prone to loss of high-frequency details of the images. As shown in the supplemental video from 4:41, the results look blurry compared to the input. The authors would need to justify their work does not worsen the results from previous methods.

Correctness: The claims and methods are correct in this paper.

Clarity: Yes. The reviewer found the following points which might get the authors' attention: 1. line 204: extra *** 2. in supp video: 'Neutral network' -> 'Nueral network'

Relation to Prior Work: Yes.

Reproducibility: Yes

### Review 4

Summary and Contributions: This paper proposed an enhanced face-swapping framework via optimal transport. After getting the initial swapped face using any previous swapping method, the system will refine its color and illumination using a Relighting Generator. This generator has a modified UNet structure, with Neural Optimal Transport Plan Estimation (NOTPE), to transport features in skip connections. It also employs a Mix-and-segment Discriminator to enforce realisticity. The proposed method outperforms the state-of-the-art ones on both qualitative, metric, and user evaluation.

Strengths: * AOT applies optimal transport (OT), an emerging technique, in improving face-swapping results. However, it recognized the problem of conventional OT when applying to this problem. Hence, it proposed a modified version, which is included inside the deep generator. AOT shows much more realistic swapped images compared with the previous methods as well as the conventional OT. * AOT exploits 3D face estimation to improve further generated results. * Mix-and-segment Discriminator is an interesting and effective component. * AOT outperforms the state-of-the-art ones on both qualitative, metric, and user evaluation.

Weaknesses: * Some results look less like the source face. The authors should add analyses on identity preservation, which is important in face-swapping. * In equation (7), the second term is not for style loss. The authors should find another name for it. * Minor issues: - Section 4.1: redundant "***". Also, Poisson Blending does not show "the ghosting" effect in Fig. 4, as mentioned in L206. - Typos: "assess" (L231), "lose" (L237). - More challenging examples, such as cross-gender swapping should be added.

Correctness: Yes.

Clarity: Yes

Relation to Prior Work: Yes

Reproducibility: No

Additional Feedback: The rebuttal addressed my comments. I believe this paper is good for publication. While there are some ethical concerns, I still lean toward accepting this paper. To improve deepfake detection, we need to understand the ability of deepfake methods, and deepfake generation studies are essential. This work will raise awareness of this problem. and increase awareness on this problem,