Four qualified referees have carefully reviewed the paper. They agree that the method is interesting and the iew synthesis results on ShapeNet are strong given the small amount of training data used. On the downside, reviewers question "unsupervised" in the name of the method (even though it requires images with ground truth camera poses) and the fact that experiments are limited to relatively simple synthetic data. If sample efficiency is the point, an evaluation on real data would make much sense, as well as experiments focused on demonstrating sample efficiency, e.g. train the proposed method and some baselines on varying amounts of data and show that the proposed method degrades more gracefully. After the rebuttal and discussion, the reviewers agree that the merits of the paper overweight the issues, so I recommend acceptance. However, the authors are strongly encouraged to reconsider using "unsupervised" in the title and encouraged to address other reviewers' comments in the final version of the paper.