NeurIPS 2020

Self-Learning Transformations for Improving Gaze and Head Redirection


Review 1

Summary and Contributions: The paper introduces a novel method for gaze redirection. Furthermore, authors use the proposed method to train gaze prediction models with fewer labels, showing the ability of their model to augment the data. Authors evaluate their method on various standard benchmarks and show how their model improves existing models in the redirection tasks.

Strengths: - The paper describes the model and the task very well, and justify the existence of the different part of the model. - Authors show how the model improves existing models on gaze and head redirection tasks in multiple dataset. - I believe that tackling explicitly the head/gaze difference is an interesting technical addition to the model. - The augmentation experiments show a very interesting use of the model for data augmentation in a low data regime. I think the fact the authors are able to improve baseline performance in a low data regime is interesting and useful for future research. Furthermore, the fact that they can control gaze direction enables new training schemes which weren't possible with static datasets.

Weaknesses: - Authors address the possibility of using this work to build deep fakers in the broader impact statement. Although their point is valid, the model for now only works on particular poses, I still think this line of research help the overall development of fake videos. I think this should be taken more into account in the discussion in the paper. - Authors claim in the conclusion that the model would apply to other factors different from head pose and gaze in generation. Although the model is general, I think every problem is very particular and the claim might be too strong. - It would have been interesting to train the model with other datasets to see how the performance change with training data. Authors do claim that their model transfers very well across datasets (which is true), but it would be interesting to see if it still does it with other training data. - How did the authors validate how well F_d works and how sensible is to small changes? It is used to evaluate the effect of disturbance and I believe it is importance to also evaluate the evaluator itself. Otherwise, the metrics involving F_d are hard to interpret for the reader. After rebuttal: After reading the rebuttal, the authors have addressed most of my concerns and I update my score to 7-Accept.

Correctness: Yes. The claims on the paper are well formulated. The method seems correct and well supported by the empirical results, both qualitative and quantitative.

Clarity: Yes, the paper is well organised and well written.

Relation to Prior Work: Yes, the authors do describe in detail the other works attempting to do gaze redirection, as well as describe related work on generation and gaze estimation.

Reproducibility: Yes

Additional Feedback:


Review 2

Summary and Contributions: In this paper, a new generative model for producing high-quality face images of eye gaze and head pose redirected. For this a new encoder-decoder architecture that disentangles various independent factors of faces. Several constraints were designed for the disentanglement and redirection. The newly proposed method was validated by three error measures and further shown its improvements in semi-supervised cross-dataset gaze estimation.

Strengths: This paper proposes a new structure that is simple and works well. The proposed method is clearly described and explained, and it shows the superiority of its performance through various experiments. The performances (especially qualitative results) are clearly better in many ways compared to the existing comparative methods.

Weaknesses: It is unlikely that it will be easy to balance the six losses in use in the proposed method (i.e. parameter setting). More specific discussions and experiments related to this parameter setting are needed. It would be nice to have a frank discussion of the proposed method's limitations by showing the failure cases.

Correctness: Yes, they seem correct.

Clarity: The paper was well written and easy to follow. The overall structure is also logical, and the motivations of the research and the issues being addressed are clearly described.

Relation to Prior Work: Yes, analysis and comparison of other papers have been done sufficiently.

Reproducibility: Yes

Additional Feedback: It would be better if there was a comparison with FAZE [20] in the experiment.


Review 3

Summary and Contributions: The paper proposes a method for re-targeting/redirecting eye gaze and head pose. It specifically targets lower quality images. Further, it demonstrates the usefulness of such a method outside directly using it for redirection, but also as a way to augment training data for eye gaze estimation tasks.

Strengths: The paper shows impressive qualitative results on the task of gaze and head redirection. Especially impressive is the temporal stability of redirection of both eye gaze and head pose. The authors present an interesting way to evaluate their work, by augmenting the training data with their redirection network and training a downstream task. This is an interesting and good way to evaluate various redirection/controllable image generation work.

Weaknesses: - The paper method section is a bit difficult to follow (see below) - There is a potential issue in evaluation (see below) - It would be great to include a limitation section discussing where the approach still struggles

Correctness: There is one potential issue with evaluation in terms of redirection error (F_d). Is this the same F_d as is used in the loss function? If so, it is unsurprising it goes down as the optimized is told to reduce it. The model may be learning to exploit the biases of F_d rather than learning to redirect gaze. Ideally, you would want to use a different method to estimate gaze that the one present in the loss function.

Clarity: The introduction/background and evaluation sections are clearly written. However, the method section is difficult to follow at times, and required several re-reads to understand, some specifically unclear parts: 1. It is not entirely clear what part the Figure 1 notation with curved arrow in a circle alludes to. The method description does not have \delta c in it, is it Equation 2? 2. Relationship between factors and conditions is not immediately clear. Why do we need such separation. Is there a deterministic mapping between each embedding and a condition? 3. Is condition a scalar? If so, how can gaze (2-dimensional) and head pose (3-dimensional) be encoded in a condition? 4. It is never explained what is meant by rotationally equivariant mappings 5. [minor] Figure 1 could do with subfigures (rather than stating top left etc.)

Relation to Prior Work: Prior work and differences to it are clearly explained.

Reproducibility: Yes

Additional Feedback: Some relevant work that authors might be interested in "CONFIG: Controllable Neural Face Image Generation". It is completely understandable that the work was not cited as it come out very recently, but authors might find some similarities to their work. Typos: highly-quality -> high-quality