Reviews: Incremental Scene Synthesis

Reading the rebuttal and the promised improvements to the writing have increased my score to a 7. ---------------------- The paper presents a spatially-structured memory model capable of registering observations onto a globally consistent map, localizing incoming data and hallucinating areas of the map not yet or partially visited. Although borrowing architectural details from previous work especially with respect to MapNet, the paper proposes a way to incorporate a generative process directly into a spatially structured memory. Previous generative models for scenes have omitted any spatial inductive bias, and present the model directly with the sequence of observations. Additionally, previous spatial architectures often assume the setting where an oracle localizer is available. The proposed architecture provides the generative model with strong geometric priors, which enable it to perform localization without needing an oracle and accurate view generation. A toy image generation task and vizdoom navigation experiments demonstrate the improved performance of the proposed model over a previous (non-spatially-structured) generative baseline method in terms of view synthesis quality and localization accuracy. The paper's main novelty can be summarized as augmenting MapNet's localization and registration modules with an integrated generative process. This seems like an important contribution: generating a whole (or parts of a) scene from a prior distribution could have important downstream effects in exploration, agent navigation and memory systems, where an agent can explicitly reason about where objects or rewards likely are in the environment based on past experience. This is in stark contrast to the current standard for unexplored areas in spatial memories, which are either zeroed out or initialized to a single learnable embedding. Additionally, the generative process can be a useful window into the agent's mapping procedure, enabling visualization and rendering of the agent's semantic representation of the scene, which is for the most part opaque to the agent designer for current spatial memories (except in the case where the memory is forced to represent a provided semantics, like in CogMap). Despite the positive and interesting contribution of generative modeling to spatial memories, there are several concerns with the current paper: (1) Perhaps the most obvious is that the quantized orientation limits the current applicability of the method and I don't think that this point is adequately addressed. Doesn't quantized orientation make localization trivial? A local odometry method can probably learn to predict the motion of very long sequences accurately without drift, especially at the quantization levels tested in this paper. (2) With respect to the first point, I think that odometry is not an interesting evaluation setting for what this model can do. It would be far more informative, for example, if you could train the model in a set of mazes with structured goal spaces, and afterwards test how accurately it can imagine reasonable goal locations. You can then also measure how accurately it is predicting the correct goal location as a function of environment explored, etc. This would directly demonstrate the model's capabilities to be used for downstream exploration and navigation tasks, which is where I believe it would be most impactful. (3) More pressing is that the technical writing in the main text is very poor, with critical architectural details omitted and left unexplained, and figures and tables having minimal captions. For example, despite being a central contribution, the generative process is mostly left unexplained. Is it a deterministic autoencoder or some stochastic variant? The only reference I can find to the generative process is Figure 3, which makes a reference to DAE (I assume it is referring to denoising autoencoders?). (4) There is some description of the generative model in the appendix, but even then it is not completely clear. It seems there is no noise in the generative process? Does that mean that there is only a single possible decoded view given an encoded map feature? If that is the case, this is a serious limitation and seems to be in contention with point (d) in the abstract. In conclusion, the method seems like an otherwise important contribution to spatial memories, but currently poor technical writing, a relatively uninformative evaluation setting and confusing architectural description make me somewhat cautious in recommending this paper for acceptance.

Reviewer 2

Response to author feedback: Thank you for your answers and additional experiments. As a result of them, I have increased my score to 7. ------------------------------------ This paper introduces a novel model for incremental scene synthesis, that uses a location network (Mapnet) and a memory architecture that can be used to reconstruct images at arbitrary viewpoints. Incremental scene synthesis is an interesting and relevant application, that can benefit downstream tasks in many subfields. The proposed model is a complex architecture, but the authors are good at providing a quite clear step-by-step explanation (part of it is in the appendix due to space constraints in the NeurIPS format). It is however less clear to me for which applications there is a strong need for being able to produce hallucinations. Hallucinations are in fact meaningful only in less interesting cases for which it is possible to learn a very strong prior for the images (e.g. experiment with faces in CelebA). However, for many environments (e.g. experiments with the floor plans in the HoME dataset) observing part of the image does not tell you too much about other parts of it, and hallucinations are far from accurate. To my knowledge this is among the first model that is able to perform coherent localization and incremental scene synthesis in an end-to-end fashion, and the experimental results look convincing. They could however be further improved to provide a deeper understanding of the real competitiveness of the model. The GTM-SM model that the authors use in the experiments is a relevant competing method, that focuses however on a fairly different task, and is therefore hard to provide meaningful comparison keeping it in its original form. The GTM-SM is built to be able to get long-term action-conditioned predictions for planning application in RL, not scene registration/synthesis (although it can do it to a certain extent). This model in fact is trained in an unsupervised way, feeding action and image information during training, and not location information as in the proposed model. However, the learned state-space model in the GTM-SM could also greatly benefit from the location information (at least during training) that is assumed to be available in the experiments in this submission. For this I believe ground-truth locations should be also used in the GTM-SM experiments to provide a more fair comparison (it is true that the GTM-SM has also access to additional action information in the experiments, but this information is far weaker than the location information considering that the GTM-SM needs to learn a transition model). You may also consider passing the viewpoints to the GTM-SM as well when generating images (the model basically becomes a GQN), which would certainly give an unfair advantage to the GTM-SM, but its results would be very helpful to assess the potential of the proposed architecture. I am quite concerned about the correctness of your GTM-SM implementation. Considering the results in the original paper [8] it seems to me that a well specified and trained GTM-SM should perform much better, at least in the anamnesis metric in the A_cel^s experiment, but likely also in other experiments. Could you provide more details on GTM-SM implementation and training procedure? For the proposed model to work there has to be lots of correlation and overlapping features among the initial observation to be able to learn to correctly localize and memorize. How can this model be used for efficient exploration in larger environments?

Reviewer 3

Overall, the introduced method is novel and the results it allows generating are interesting. A number of aspects of the work remain unclear in its current state. However, I think these can be addressed as part of the final version of the paper. Therefore, I am leaning towards accepting this work. - L101: What is meant by 'frustum culling' in the context of extracting features from spatial memory? The notion of 'culling' is introduced in Section 3.2 in that it refers to culling features. But as this is part of the main contribution it should be made more clear what is meant by this. Furthermore, in Section 3.2 it is not clear what the culling operation is really doing. Instead, the description of Equation (3) is stating that the requested neighborhood is filled. So apart from Equation (3) there is no description of what is actually meant by the term 'culling'. This should be clarified. - The description of 'Encoding Memories' is not clear. In lines 112-122 the text states that a sequence of observed images as either RGB or RGB-D is used. Later an example is provided for constructing the agent's spatial memory for RGB-D observations by converting the depth maps into point clouds and existing solutions are presented how to approximate ground planes. - L128: What is a 'patch' in this context? Later the text also refers to 'feature patch'. It should be more formally defined. - Figure 4 should have a more meaningful caption. As is it is not comprehensible what is illustrated this figure. - L159: 'field of view' is the more common term. - When I am not mistaken, 's' and 't' (in Equations 3, 4) are not defined (also not in the supplementary material). - L208: What is the unit of 83x83? - L216: What is a patch here?

Paper ID:	936
Title:	Incremental Scene Synthesis

Reviewer 1

Reviewer 2

Reviewer 3