NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:4150
Title:PerspectiveNet: A Scene-consistent Image Generator for New View Synthesis in Real Indoor Environments

Reviewer 1

Given few RGBD images of a real indoor scene as well as camera locations where these were taken, the algorithm predicts RGBD images takes from different camera locations. The novelty is the use of denoising auto-encoder for a given view and finding latent representations that are consistent for different views. Detailed comments: - It would be good if the whole process was described in steps because it wasn’t clear what the overall approach is from the start (may be it would be for someone working on a similar topic). Some figures are good, but could be better - together with such description. Something like the following would be useful for me: A) We are given a set of RGBD views along with camera locations of a given scene. Given a subset of these views we would like to predict the other set (given camera locations) B) For each given view we can map each RGBD pixel into 3d space using plain geometry. C) We can then project these pixels into a new view again using plain geometry. Some of the pixel locations in the new view will not be filled, we called those holes. Some of them will have multiple guess - we use a heuristic that choses these values as … (explain in one sentence) D) For each view we train a denoising auto-encoder to fill missing details of a given view. E) To make the filled details self consistent we do the following. For each view we modify the latent to be consistent with other views as follows: We take latent of n-1 views, project them to pixels, then extract the 3d locations of these pixels, project them to the remaining view and impose a loss that makes this view similar to that decoded from the corresponding latent representation of that view. We optimise all the latents by gradient descent. There could be a figure going with that ,or perhaps what you already have but with this kind of description in the caption. 86: It would be good to describe g and K of the camera. 94: …describe what are non-holes 102: May be renderer shouldn’t be in the supplement, seems like the fundamental component. May be at least explain intuitively what it is doing (especially for people not familiar with this literature). Why exp(-d) (from supplement) - so its differentiable? Gan: It is likely that you gan is not well trained and could be made better. It would also be good to train both - simple reconstruction to ground it and create a good representations and then gan to hallucinate the right details. - It would be good if a simpler solution was found - there are lot of regularisers and choices here. How much hyper parameter optimisation was there. - Show more source, target and unpainted images (in supplement say) - just to see more results of how well it work. 167: Why only non-holes of the second v since the first one is auto encoded and has values everywhere. Pros: Nice use of denoising auto encoder and the self-consistency training Cons: - Relying on a depth Channel - which provide a clear grounding of each pixel and mapping between different views - it would be much better if such can things can be inferred. - Blurry filling - using a proper generative model would be good (non-gan might work for this as well)

Reviewer 2

Main reason for "accept" decision - addressing the problem on realistic, large scale indoor scene dataset; nice theoretical contribution on the losses, and explaining the decisions made. Good points: - tackles novel, challenging, large scale problem of synthesizing views for indoor scenes; - works on a large dataset of realistic indoor scenes. - introduces reprojection consistency loss and style consistency loss, which is a nice theoretical contribution - this work is relevant in the context of indoor localization and navigation applications, where inpainting is necessary, e.g. for completing meshes, reconstructing views for which information is not available, such as holes due to occluders. Not so good points: - Evaluates only on one dataset, while other (larger) scale indoor scene dataset exists (Matterport3D, Gibson). If not suitable, explain why. Abstract and figure 1 -- omission of the mention that the input should be RGBD. It is not clear from the evaluation how the method performs on synthetic, smaller scenes. Abstract: mention that as few as 4 RGB”D” views are needed. The mention of reference views taken with a hand-held camera is a bit misleading, making the readers expect just RGB. Figure 1: mention that the input should contain depth as well. The reference input views are captured at the same time as the desired output views. Might be a good idea to emphasize that as few as 4 reference views In related work (L72), GQN has not been tested in real world setup -- it would have been valuable to add this experiment for comparison. Since the method is capable of browsing simple synthetic experiments, it would be worth to check how it performs on realistic Figure 1 and 2 are not referenced in the text. It is unclear later on whether the 3D Conv Net[34] (L72) mentioned in table 1 and later L234-240 is a contribution of the current work or it was proposed in [34]. 3. L90-91: what is the range for d_u (depth) ? what are the units? L102 -- the differentiable point tracer is suitable to be a part of the main paper. Is it mandatory or important for the point tracer to be differentiable? 3.2 L154-158: how are the layers pre-selected for adding residuals? 4. Experiments L206: -- Matterport3D ( is larger indoor scene dataset; they provide, among others, RGBD + camera annotations. How would the proposed method perform on MP3D? (Optional) evaluate on Gibson dataset ( L209, 210 → are the 8 views used for testing, i.e., reference views L213: how are the views clustered? Is this a manual step? if not, what features were used? How would the current method perform on ShapeNet? How difficult would it be to compare with methods that evaluate on ShapeNet, e.g. Dosovitskyi et al → how would it perform on SceneNet? There is no clear comparison showing why the proposed method For ablation: what is the contribution of individual loss components? e.g. style consistency vs reprojection consistency? Bring the dataset statistics closer to the beginning of the section defining the evaluation protocol (# of samples, #train, #test); It is not clear how the train / test were split, and whether there is a validation set. In Figure 4 - please add the color and depth measures for the selected pictures (e.g. similar to Table 1). In discussion -- It is understandable from the text and the table 1 that the authors are comparing against a very strong baseline (ablation). How much does 0.02 in PSNR (color), or 0.03 in LPIPS affect perception? For someone not familiar with these measures, how could one understand the improvement? The meaning of these differences should be explained in the discussion of the results.

Reviewer 3

> The BiGAN image predictions seem noisy along the grid. This might be the result of suboptimal architectural choices (low model capacity, filter size etc). > This approach seems quite similar to "Neural Rerendering in the Wild" (Meshry et al) at CVPR 2019. This paper uses a similar approach of using point cloud representations in the context of multi-view reconstruction. How are these methods related? > It would be good to get clarity on how different this work is from Meshry et al. before evaluating contributions in this paper. I hope that the rebuttal clarifies this.