NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:3854
Title:First Order Motion Model for Image Animation

Reviewer 1

Summary: The system attacks the problem of generating images that conform to a given source image driven by motion estimated from a given video. To that end, the system estimates sparse transformations between in terms of corresponding key points and, different from [21], the local linear deformations around those key points. These transformations are composed from transformations with respect to a common reference configuration. The sparse transformations are converted with a CNN into dense motion and occlusion masks. Finally, the motion and occlusion are combined by another neural network with the input image to create the final output. Positive: The paper introduces the novel idea of first order motion and occlusion modeling to unsupervised image animation. Apart from the basic training idea, which is an extension of the equivariance constraints, the authors need to simplify the transformations learned by the networks by performing a number of well chosen geometric tricks (e.g. line 164, 184) or mapping variations (e.g. using transformations D_t<-D_1 instead of D_t<-S). The mathematics of the first order motion are well described, and the supplementary material covers well the missing details in the manuscript. The authors show good performance in three very different applications (faces, full bodies, robots), where they clearly outperform state-of-the-art Negative: My main complain about the paper is that understanding some aspects from it is hard, and could use a description of the intuitions behind them. For example: - what is the intuition between the difference of gaussians used in the heatmap H_k (section 2.3 supp mat)? - How do the reference configurations R look like? Is there any intuition behind those configurations? Could they be visualized (in supp mat obviously)? - Related to the equivariance constraint, could it be said that it encourages the key points to segment the object so that it deforms as close as possible to the thin plate splines model? This holds also to the affine model around the joints, which should conform to the thin plate splines model, right? A secondary concern is the reproducibility of the system. Its large number of components would make implementing it from scratch very hard. I think the value of the paper would increase if the authors release the source code. I found a couple of typos: - Forth->Fourth, line 42 - Latter -> later, line 95

Reviewer 2

This paper proposed a first order motion model for image animation. Firstly, motion is modeled as a set of keypoints and local affine transformations, which has the capability to model large object pose changes compared to the previously proposed zeroth order model. This paper approximated the motion between two frames based on first order Taylor expansions. Secondly, this paper proposed to model occlusions to indicate the generator network about which image parts should be inpainted. This is needed to handle the large motion patterns in the driving video. Pros: - Extensive experiments are performed on high resolution datasets. The proposed method shows clear improvement over the previous methods quantitatively and qualitatively. - This paper also released a new high resolution dataset Thai-Chi-HD for the community. Cons: - Lack of insight into why the first order motion model is necessary. For example, in the ablation test section, I see a gap between "Full" model and "Pyr.+O" model. It would be nice to include a test showing all the factors in between, e.g. first order model vs zeroth order model. - Missing citations for GANs and VAEs in line 24-25. I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NIPS, 2014. D. P. Kingma and M. Welling. Auto-encoding variational bayes. In ICLR, 2014. - Clarity improvements: Reconstruction loss Line 194: which VGG-19 layer did you use exactly? What are the resolutions used for reconstruction loss? Instead of a pyramid of resolutions, have you tried to compute the same loss on multiple layers of the same input resolution? In such a way, it's more efficient as only one feedforward path is needed. A few minor issues: Line 243: How do you preserve the aspect ratio if you resize a video to the fixed target 256x256 resolution? Line 172: is moves -> is moved Line 196: similarly to -> similar to Line 203: Lets -> Let's Line 240: until its it -> until it is Line 302: every single metrics -> every single metric

Reviewer 3

Originality: 1) The task of image animation is not new, but the proposed motion model is novel. 2) The clear difference between this paper and previous papers lies in the first order motion model. 3) Related work is adequately cited and compared. Quality: 1) The submission is technically sound. The mathematical proof seems correct to me. 2) This is a complete piece of work. The authors conduct extensive experiments on several benchmark datasets to validate the effectiveness of the proposed first order motion model. 3) Failure case analysis is missing. 4) What is the intuition of using first order motion model? Can the authors explain how this connects to affine transformation? Can we use a second order motion model? 5) What is the virtual reference frame? Does it represent a canonical pose for a specific object category? Clarity: The method part is a bit unclear. 1) It seems to me that the zeroth order motion is the keypoint location. However, in Figure 1, T_{S<-D}(p_k) seems to be pixel displacement, which is different from the keypoint location. 2) Can the authors explain L161-L163 in more details? Significance: This paper address a difficult problem in a better way than previous papers. It demonstrably advances the state-of-the-art. I believe it provides a good baseline for the following researchers to build upon and make significant progress.