Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
This paper presents a new model for the task of vision-and-language navigation (VLN), where an agent must follow a natural language navigation instruction in a simulated environment with real-world visual imagery. The model combines several components: (1) a learned, spatial memory representation of the environment, (2) a neuralized Bayes' filter for goal localization that conditions on latent "actions" and "observations", which are produced by a recurrent decoder that conditions on the instruction, and (3) a reactive policy that conditions on the goal predicted by the Bayes' filter. The paper trains the model end-to-end and evaluates it on the benchmark Room-to-Room dataset, in both the standard navigation setting and in goal prediction. *Originality* While, as the paper acknowledges, all the individual components combined here have been explored in some form in past work (the metric spatial memory, differentiable Bayes filter, and rollouts of latent observations and actions), the combination and application to the language-conditioned navigation task is, to my knowledge, novel. I found it creative and well-motivated. *Quality* The main weakness of the paper is the results on the full navigation task, which are weak in comparison to past work. As the paper points out, this past work has used a variety of training and inference conditions which improve performance, and are likely orthogonal to the contributions here. However, much of this past work has also reported results without these augmentations, and those results are comparable or better than the navigation performance here. It would be clearer if the paper presented these results (for example the "co-grounding" and "greedy decoding" ablation of Ma et al. which obtains 42 SR and 28 SPL on the val-unseen environments, and the Behavioral Cloning (IL) ablation of Tan et al, which obtains 43.6 SR and 40 SPL on val-unseen) rather than the augmented settings, or explained why they are not comparable. In particular, since this paper uses the panoramic state representation of Fried et al, and an action space similar to theirs, it seems that their "panoramic space" ablation model might be a more appropriate baseline than the non-panoramic Seq2Seq model compared to here. However, all these differences seem at least partly explainable due to the use of different ResNet visual features than these past works. In addition, the results on the goal prediction task show a substantial improvement over the strong LingUNet model. *Clarity* I found the paper overall extremely clear about the model details and the intuition for each part, and the motivation for the work. There were a few minor details about the training procedure that were underspecified: - Is the true state sequence in 245 always the human trajectory, or does it include the exploration that is done by the model during training? - When training the policy with cross-entropy loss, are the parameters of the rest of the network (e.g. the filter and semantic map) also updated (or are they just updated by the filter supervision)? - Is the mapper-filter in the experiments 5.2 produced by training without the policy, or does this take a trained full model and remove the policy component? (it seems likely to be the first one, but the text is a bit unclear) *Significance* While the results on the full navigation task don't show an improvement over past work, I think that the model class is still likely to be built upon by researchers in this area. Past work has seen two high level problems in this task, which models like this one may be able to address: (1) Substantial improvements from exploration of the environment during inference time. Having a model with an explicit simulated planning component makes it possible to see how much the simulated planning using the learned environment representation could reduce the need for actual exploration. (2) A generalization gap in novel environments. It seems promising that this model has no gap in performance between seen and unseen environments, although the reason for this is not explained in this work. *Minor comments* - 238: The policy does seem to indirectly have access to a representation of the instruction and semantic map through the belief network; this could be clarified by saying that it has no direct access. - I found it surprising that incorporating the agent's heading into the belief state has such a large impact on performance, given the panoramic visual representation and action space. Some discussion of this would be helpful.
Summary: In this work, the authors propose a modular pipeline, reminiscent of traditional control architectures in robotics, in which a map is updated online, a state-belief distribution is maintained over the map through sequential filtering, and a policy is conditioned on this belief. This architecture is demonstrated on a vision-and-language navigation task, although in principle could apply to other spatial tasks as well. Originality: Neither map->estimate->control pipelines nor end-to-end differentiable Bayesian filtering are new, as the authors note, but this application is a novel and promising avenue for this sort of task. In particular, conditioning a reactive policy on the Bayesian state estimate has the potential to avoid overfitting, since its input has been abstracted away from the observations. The performance results in the paper are poor in the absolute sense, but do seem to show less overfitting. The action space of the agent is also novel, although it requires significant domain and task knowledge in the form of the structure of the navigation graph. Quality: The technical contribution is sound. The results fall significantly below state-of-the-art algorithms. Perhaps performance could be increased by techniques such as reinforcement learning and data augmentation, as in the state-of-the-art approaches, although this is not clear. Clarity: The paper is well-written and organized. The video attachment is helpful for understanding. Significance: With such low performance results, it is not clear that this work directly advances the field in this area, despite its novelty.
The authors propose a new method for inferring vision and language navigation using Bayesian filtering and state tracking . The paper was well written up to some equation nomenclature and acronyms but still very easy to follow and I think it is indeed innovative to extend the simulator for the purpose of learning as a synthesis of two different fields. This submission would definitely advance the fields, the ability to generate more realistic training grounds for VLN model can present a huge advantage. The experiments were also truly convincing (nice touch on the videos of the problem). Yet there are a few things I would be happy to have been clarified or that are unclear. The first and I think by biggest question is how does the state rejection mechanism compare to the one of Weib and et el https://arxiv.org/abs/1511.06458. Can one think of the language component as a refined prior ? In the abstract can you please expand on what do you mean by a strong baseline? Line 134 what are XY ? Line 147-150 is a bit of a mess of definitions and notation