{"title": "Chasing Ghosts: Instruction Following as Bayesian State Tracking", "book": "Advances in Neural Information Processing Systems", "page_first": 371, "page_last": 381, "abstract": "A visually-grounded navigation instruction can be interpreted as a sequence of expected observations and actions an agent following the correct trajectory would encounter and perform. Based on this intuition, we formulate the problem of finding the goal location in Vision-and-Language Navigation (VLN) within the framework of Bayesian state tracking - learning observation and motion models conditioned on these expectable events. Together with a mapper that constructs a semantic spatial map on-the-fly during navigation, we formulate an end-to-end differentiable Bayes filter and train it to identify the goal by predicting the most likely trajectory through the map according to the instructions. The resulting navigation policy constitutes a new approach to instruction following that explicitly models a probability distribution over states, encoding strong geometric and algorithmic priors while enabling greater explainability. Our experiments show that our approach outperforms a strong LingUNet baseline when predicting the goal location on the map. On the full VLN task, i.e. navigating to the goal location, our approach achieves promising results with less reliance on navigation constraints.", "full_text": "Chasing Ghosts\nChasing Ghosts\nChasing Ghosts:\n\nChasing Ghosts\n\nInstruction Following\n\nas Bayesian State Tracking\n\nPeter Anderson\u2217 1 Ayush Shrivastava\u22171 Devi Parikh1,2 Dhruv Batra1,2\n\nStefan Lee1,3\n\n1Georgia Institute of Technology, 2Facebook AI Research, 3Oregon State University\n\n{peter.anderson, ayshrv, parikh, dbatra}@gatech.edu\n\nleestef@oregonstate.edu\n\nAbstract\n\nA visually-grounded navigation instruction can be interpreted as a sequence of\nexpected observations and actions an agent following the correct trajectory would\nencounter and perform. Based on this intuition, we formulate the problem of\n\ufb01nding the goal location in Vision-and-Language Navigation (VLN) [1] within the\nframework of Bayesian state tracking \u2013 learning observation and motion models\nconditioned on these expectable events. Together with a mapper that constructs a\nsemantic spatial map on-the-\ufb02y during navigation, we formulate an end-to-end dif-\nferentiable Bayes \ufb01lter and train it to identify the goal by predicting the most likely\ntrajectory through the map according to the instructions. The resulting navigation\npolicy constitutes a new approach to instruction following that explicitly models a\nprobability distribution over states, encoding strong geometric and algorithmic pri-\nors while enabling greater explainability. Our experiments show that our approach\noutperforms a strong LingUNet [2] baseline when predicting the goal location on\nthe map. On the full VLN task, i.e., navigating to the goal location, our approach\nachieves promising results with less reliance on navigation constraints.\n\nIntroduction\n\n1\nOne long-term challenge in AI is to build agents that can navigate complex 3D environments from\nnatural language instructions. In the Vision-and-Language Navigation (VLN) instantiation of this\ntask [1], an agent is placed in a photo-realistic reconstruction of an indoor environment and given a\nnatural language navigation instruction, similar to the example in Figure 1. The agent must interpret\nthis instruction and execute a sequence of actions to navigate ef\ufb01ciently from its starting point to\nthe corresponding goal. This task is challenging for existing models [3\u20139], particularly as the test\nenvironments are unseen during training and no prior exploration is permitted in the hardest setting.\nTo be successful, agents must learn to ground language instructions to both visual observations and\nactions. Since the environment is only partially-observable, this in turn requires the agent to relate\ninstructions, visual observations and actions through memory. Current approaches to the VLN task\nuse unstructured general purpose memory representations implemented with recurrent neural network\n(RNN) hidden state vectors [1, 3\u20139]. However, these approaches lack geometric priors and contain\nno mechanism for reasoning about the likelihood of alternative trajectories \u2013 a crucial skill for the\ntask, e.g., \u2018Would this look more like the goal if I was on the other side of the room?\u2019. Due to this\nlimitation, many previous works have resorted to performing inef\ufb01cient \ufb01rst-person search through\nthe environment using search algorithms such as beam search [5, 7]. While this greatly improves\nperformance, it is clearly inconsistent with practical applications like robotics since the resulting\nagent trajectories are enormously long \u2013 in the range of hundreds or thousands of meters.\nTo address these limitations, it is essential to move towards reasoning about alternative trajectories in a\nrepresentation of the environment \u2013 where there are no search costs associated with moving a physical\nrobot \u2013 rather than in the environment itself. Towards this, we extend the Matterport3D simulator [1]\nto provide depth outputs, enabling us to investigate the use of a semantic spatial map [10\u201313] in the\ncontext of the VLN task for the \ufb01rst time. We propose an instruction-following agent incorporating\nthree components: (1) a mapper that builds a semantic spatial map of its environment from \ufb01rst-\n\n\u2217First two authors contributed equally.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: Navigation instructions can be interpreted as encoding a set of latent expectable observations\nand actions an agent would encounter and undertake while successfully following the directions.\n\nperson views; (2) a \ufb01lter that determines the most probable trajectory(ies) and goal location(s) in the\nmap, and (3) a policy that executes a sequence of actions to reach the predicted goal.\nFrom a modeling perspective, our key contribution is the \ufb01lter that formulates instruction following as\na problem of Bayesian state tracking [14]. We notice that a visually-grounded navigation instruction\ntypically contains a description of expected future observations and actions on the path to the goal.\nFor example, consider the instruction \u2018walk out of the bathroom, turn left, and go on to the bottom of\nthe stairs and wait near the coat rack\u2019 shown in Figure 1. When following this instruction, we would\nexpect to immediately observe a bathroom, and at the end a coat rack near a stairwell. Further, in\nreaching the goal we can anticipate performing certain actions, such as turning left and continuing\nthat way. Based on this intuition, we use a sequence-to-sequence model with attention to extract\nsequences of latent vectors representing observations and actions from a natural language instruction.\nFaced with a known starting state, a (partially-observed) semantic spatial map generated by the\nmapper, and a sequence of (latent) observations and actions, we now quite naturally interpret our\ninstruction following task within the framework of Bayesian state tracking. Speci\ufb01cally, we formulate\nan end-to-end differentiable histogram \ufb01lter [15] with learnable observation and motion models, and\nwe train it to predict the most likely trajectory taken by a human demonstrator. We emphasize that\nwe are not tracking the state of the actual agent. In the VLN setting, the pose of the agent is known\nwith certainty at all times. The key challenge lies in determining the location of the natural-language-\nspeci\ufb01ed goal state. Leveraging the machinery of Bayesian state estimation allows us to reason in a\nprincipled fashion about what a (hallucinated) human demonstrator would do when following this\ninstruction \u2013 by explicitly modeling the demonstrator\u2019s trajectory over multiple time steps in terms of\na probability distribution over map cells. The resulting model encodes both strong geometric priors\n(e.g., pinhole camera projection) and strong algorithmic priors (e.g., explicit handling of uncertainty,\nwhich can be multi-modal), while enabling explainability of the learned model. For example, we can\nseparately examine the motion model, the observation model, and their interaction during \ufb01ltering.\nEmpirically, we show that our \ufb01lter-based approach signi\ufb01cantly outperforms a strong LingUNet [2]\nbaseline when tasked with predicting the goal location in VLN given a partially-observed semantic\nspatial map. On the full VLN task (incorporating the learned policy as well), our approach achieves\na success rate on the test server [1] of 32.7% (29.9% SPL [16]), a credible result for a new class\nof model trained exclusively with imitation learning and without data augmentation. Although our\npolicy network is speci\ufb01c to the Matterport3D simulator environment, the rest of our pipeline is\ngeneral and operates without knowledge of the simulator\u2019s navigation graph (which has been heavily\nutilized in previous work [1, 3\u20139]). We anticipate this could be an advantage for sim-to-real transfer\n(i.e., in real robot scenarios where a navigation graph is not provided, and could be non-trivial to\ngenerate).\nContributions. In summary, we:\n- Extend the existing Matterport3D simulator [1] used for VLN to support depth image outputs.\n- Implement and investigate a semantic spatial memory in the context of VLN for the \ufb01rst time.\n- Propose a novel formulation of instruction following / goal prediction as Bayesian state tracking of\n\na hypothetical human demonstrator.\n\n- Show that our approach outperforms a strong baseline for goal location prediction.\n- Demonstrate credible results on the full VLN task with the addition of a simple reactive policy,\n\nwith less reliance on navigation constraints than prior work.\n\n2\n\nExpectedObservationExpectedActionWalk out of the bathroom, turn left, and go onto the bottom of the stairsand waitnear the coat rack.\f2 Related work\nVision-and-Language Navigation Task. The VLN task [1], based on the Matterport3D dataset [17],\nbuilds on a rich history of prior work on situated instruction-following tasks beginning with\nSHRDLU [18]. Despite the task\u2019s dif\ufb01culty, a recent \ufb02urry of work has seen signi\ufb01cant improvements\nin success rates and related metrics [3\u20139]. Key developments include the use of instruction-generation\n(\u2018speaker\u2019) models for trajectory re-ranking and data augmentation [7, 8], which have been widely\nadopted. Other work has focused on developing modules for estimating progress towards the goal [5]\nand learning when to backtrack [6, 9]. However, comparatively little attention has been paid to the\nmemory architecture of the agent. LSTM [19] memory has been used in all previous work.\nMemory architectures for navigation agents. Beyond the VLN task, various categories of memory\nstructures for deep neural navigation agents can be identi\ufb01ed in the literature, including unstruc-\ntured, addressable, metric and topological. General purpose unstructured memory representations,\nsuch as LSTM memory [19], have been used extensively in both 2D and 3D environments [20\u201324].\nHowever, LSTM memory does not offer context-dependent storage or retrieval, and so does not\nnaturally facilitate local reasoning when navigating large or complex environments [25]. To overcome\nthese limitations, both addressable [25, 26] and topological [27] memory representations have been\nproposed for navigating in mazes and for predicting free space. However, in this work we elect to use\na metric semantic spatial map [10\u201313] \u2013 which preserves the geometry of the environment \u2013 as our\nagent\u2019s memory representation since reasoning about observed phenomena from alternative view-\npoints is an important aspect of the VLN task. Semantic spatial maps are grid-based representations\ncontaining convolutional neural network (CNN) features which have been recently proposed in the\ncontext of visual navigation [10], interactive question answering [13], and localization [12]. However,\nthere has been little work on incorporating these memory representations into tasks involving natural\nlanguage. The closest work to ours is Blukis et al. [11], however our map construction is more\nsophisticated as we use depth images and do not assume that all pixels lie on the ground plane.\nFurthermore, our major contribution is formulating instruction-following as Bayesian state tracking.\n3 Preliminaries: Bayes \ufb01lters\nA Bayes \ufb01lter [14] is a framework for estimating a probability distribution over a latent state s\n(e.g., the pose of a robot) given a history of observations o and actions a (e.g., camera observations,\nodometry, etc.). At each time step t the algorithm computes a posterior probability distribution\nbel(st) = p(st | a1:t, o1:t) conditioned on the available data. This is also called the belief.\nTaking as a key assumption the Markov property of states, and conditional independence between\nobservations and actions given the state, the belief bel(st) can be recursively updated from bel(st\u22121)\nusing two alternating steps to ef\ufb01ciently combine the available evidence. These steps may be referred\nto as the prediction based on action at and the observation update using observation ot.\nPrediction.\nIn the prediction step, the \ufb01lter processes the action at using a motion model\n| st\u22121, at) that de\ufb01nes the probability of a state st given the previous state st\u22121 and an\np(st\naction at. In particular, the updated belief bel(st) is obtained by integrating (summing) over all prior\nstates st\u22121 from which action at could have lead to st, as follows:\n\n(cid:90)\n\nbel(st) =\n\np(st | st\u22121, at) bel(st\u22121) dst\u22121\n\n(1)\n\nObservation update. During the observation update, the \ufb01lter incorporates information from the\nobservation ot using an observation model p(ot | st) which de\ufb01nes the likelihood of an observation\not given a state st. The observation update is given by:\n\nbel(st) = \u03b7 p(ot | st) bel(st)\n\n(2)\n\nwhere \u03b7 is a normalization constant and Equation 2 is derived from Bayes rule.\nDifferentiable implementations. To apply Bayes \ufb01lters in practice, a major challenge is to construct\naccurate probabilistic motion and observation models for a given choice of belief representation\nbel(st). However, recent work has demonstrated that Bayes \ufb01lter implementations \u2013 including\nKalman \ufb01lters [28], histogram \ufb01lters [15] and particle \ufb01lters [29, 30] \u2013 can be embedded into deep\nneural networks. The resulting models may be seen as new recurrent architectures that encode\nalgorithmic priors from Bayes \ufb01lters (e.g., explicit representations of uncertainty, conditionally\nindependent observation and motion models) yet are fully differentiable and end-to-end learnable.\n\n3\n\n\fFigure 2: Proposed \ufb01lter architecture. To identify likely goal locations in the partially-observed\nsemantic spatial map M generated by the mapper, we \ufb01rst initialize the belief bel(st) with the known\nstarting state s0. We then recursively: (1) generate a latent observation ot and action at from the\ninstruction, (2) compute the prediction step using the motion model (Equation 3), and (3) compute\nthe observation update using the observation model (Equation 5), stopping after T time steps. The\nresulting belief bel(sT ) represents the posterior probability distribution over likely goal locations.\n\n4 Agent model\n\nIn this section, we describe our VLN agent that simultaneously: (1) builds a semantic spatial map\nfrom \ufb01rst-person views; (2) determines the most probable goal location in the current map by \ufb01ltering\nlikely trajectories taken by a human demonstrator from the start location (i.e., the \u2018ghost\u2019); and (3)\nexecutes actions to reach the predicted goal. Each of these functions is the responsibility of a separate\nmodule which we refer to as the mapper, \ufb01lter, and policy, respectively. We begin with the mapper.\n\n4.1 Mapper\nAt each time step t, the mapper updates a learned semantic spatial map Mt \u2208 RM\u00d7Y \u00d7X in the\nworld coordinate frame from \ufb01rst-person views. This map is a grid-based metric representation in\nwhich each grid cell contains a M-sized latent vector representing the visual appearance of a small\ncorresponding region in the environment. X and Y are the spatial dimensions of the semantic map,\nwhich could be dynamically resized if necessary. The map maintains a representation for every world\ncoordinate (x, y) that has been observed by the agent, and each map cell is computed from all past\nobservations of the region. We de\ufb01ne the world coordinate frame by placing the agent at the center of\nthe map at the start of each episode, and de\ufb01ning the xy plane to coincide with the ground plane.\nInputs. As with previous work on VLN task [5\u20137], we provide the agent with a panoramic view\nof its environment at each time step2 comprised of a set of RGB images It = {It,1, It,2, . . . , It,K},\nwhere It,k represents the image captured in direction k. The agent also receives the associated depth\nimages Dt = {Dt,1, Dt,2, . . . , Dt,K} and camera poses Pt = {Pt,1, Pt,2, . . . , Pt,K}. We addition-\nally assume that the camera intrinsics and the ground plane are known. In the VLN task, these inputs\nare provided by the simulator, in other settings they could be provided by SLAM systems etc.\nImage processing. Each image I \u2208 RH\u00d7W\u00d73 is processed with a pretrained convolutional neural\nnetwork (CNN) to extract a downsized visual feature representation v \u2208 RH(cid:48)\u00d7W (cid:48)\u00d7C. To extract a\ncorresponding depth image d \u2208 RH(cid:48)\u00d7W (cid:48)\n, we apply 2D adaptive average pooling to the original depth\nimage D \u2208 RH\u00d7W . Missing (zero) depth values are excluded from the pooling operation.\nFeature projection. Similarly to MapNet [12], we project CNN features v onto the ground plane in\nthe world coordinate frame using the corresponding depth image d, the camera pose P , and a pinhole\ncamera model using known camera intrinsics. We then discretize the projected features into a 2D\nspatial grid Ft \u2208 RC\u00d7Y \u00d7X, using elementwise max pooling to handle feature collisions in a cell.\nMap update. To integrate map observations Ft into our semantic spatial map Mt, we use a\nconvolutional implementation [31] of a Gated Recurrent Unit (GRU) [32]. In preliminary experiments\nwe found that using convolutions in both the input-to-state and state-to-state transitions reduced the\nvariance in the performance of the complete agent by sharing information across neighboring map\ncells. However, since both the map Mt and the map update Ft are sparse, we use a sparsity-aware\nconvolution operation that evaluates only observed pixels and normalizes the output [33]. We also\nmask the GRU map update to prevent bias terms from accumulating in the unobserved regions.\n\n4\n\nObservation ModelMotion ModelLatent ObservationObservation ModelMotionModelMapperDifferentiable Histogram Filter\ud835\udc35\ud835\udc52\ud835\udc59(\ud835\udc60&)RecursiveBayes Filter\ud835\udc35\ud835\udc52\ud835\udc59(\ud835\udc60&)\ud835\udc43(\ud835\udc60&|\ud835\udc60&*+,\ud835\udc4e&)\ud835\udc43\ud835\udc5c&\ud835\udc60&LatentAction\ud835\udc5c&\ud835\udc4e&LatentAction\ud835\udc4e&PredictedMotion Kernel\ud835\udc35\ud835\udc52\ud835\udc59(\ud835\udc60&*+)PreviousBeliefTranspose Convolution\ud835\udc43(\ud835\udc60&|\ud835\udc60&*+,\ud835\udc4e&)\u2133MapLingUNet Architecture\ud835\udc43\ud835\udc5c&\ud835\udc60&Latent Observation\ud835\udc5c&\u2133Map\f4.2 Filter\nAt the beginning of each episode the agent is placed at a start location s\u2217\n0 = (x0, y0, \u03b80), where\n\u03b8 represents the agent\u2019s heading and x and y are coordinates in the world frame as previously\ndescribed. The agent is given an instruction X describing the trajectory to an unknown goal coordinate\nT = (xT , yT ,\u00b7). As an intermediate step towards actually reaching the goal, we wish to identify\ns\u2217\nlikely goal locations in the partially-observed semantic spatial map M generated by the mapper.\nOur approach to this problem is based on the observation that a natural language navigation instruction\ntypically conveys a sequence of expected future observations and actions, as previously discussed.\nBased on this observation, we frame the problem of determining the goal location s\u2217\nT as a tracking\nproblem. As illustrated in Figure 2 and described further below, we implement a Bayes \ufb01lter to track\nthe pose s\u2217\nt of a hypothetical human demonstrator (i.e., the \u2018ghost\u2019) from the start location to the goal.\nAs inputs to the \ufb01lter, we provided a series of latent observations ot and actions at extracted from\nthe navigation instruction X . The output of the \ufb01lter is the belief over likely goal locations bel(sT ).\nNote that in this section we use the subscript t to denote time steps in the \ufb01lter, overloading the\nnotation from Section 4.1 in which t referred to agent time steps. We wish to make clear that in our\nmodel the \ufb01lter runs in an inner loop, re-estimating belief over trajectories taken by a demonstrator\nstarting from s0 each time the map is updated by the agent in the outer loop.\nBelief. We de\ufb01ne the state st = (xt, yt, \u03b8t) using the agent\u2019s (x, y) position and heading \u03b8. We\nrepresent the belief over the demonstrator\u2019s state at each time step t with a histogram, implemented as\na tensor bel(st) = bt, bt \u2208 R\u0398\u00d7Y \u00d7X where X, Y and \u0398 are the number of bins for each component\nof the state, respectively. Using a histogram-based approach allows the \ufb01lter to track multiple\nhypotheses, meshes easily with our implementation of a grid-based semantic map, and leads naturally\nto an ef\ufb01cient motion model implementation based on convolutions, as discussed further below.\nHowever, our proposed approach could also be implemented as a particle \ufb01lter [29, 30], for example\nif discretization error was a signi\ufb01cant concern.\nObservations and actions. To transform the instruction X into a latent representation of observa-\ntions o and actions a, we use a sequence-to-sequence model with attention [34]. We \ufb01rst tokenize the\ninstruction into a sequence of words X = {x1, x2, . . . , xl} which are encoded using learned word em-\nbeddings and a bi-directional LSTM [19] to output a series of encoder hidden states {e1, e2, . . . , el}\nand a \ufb01nal hidden state e representing the output of a complete pass in each direction. We then\nuse an LSTM decoder to generate a series of latent observation and action vectors {o1, o2, . . . , oT}\nand {a1, a2, . . . , aT} respectively. Here, ot is given ot = [\u02c6eo\nt , ht], where ht is the hidden state\nof the decoder LSTM, and \u02c6eo\nt is the attended instruction representation computed using a standard\ndot-product attention mechanism [35]. The action vectors at are computed analogously, using the\nsame decoder LSTM but with a separate learned attention mechanism. The only input to the decoder\nLSTM is a positional encoding [36] of the decoding time step t. While the correct number of decoding\ntime steps T is unknown, in practice we always run the \ufb01lter for a \ufb01xed number of time steps equal to\nthe maximum trajectory length in the dataset (which is 6 steps in the navigation graph).\nMotion model. We implement the motion model p(st | st\u22121, at,M) as a convolution over the belief\nbt\u22121. This ensures that agent motion is consistent across the state space while explicitly enforcing\nlocality, i.e., the agent cannot move further than half the kernel size in a single time step. Similarly to\nJonschkowski and Brock [15], the prediction step from Equation 1 is thus reformulated as:\n\nbt = bt\u22121 \u2217 g(at,M)\n\n(3)\n\nwhere we de\ufb01ne an action- and map-dependent motion kernel g(at,M) \u2208 R\u03982\u00d7M 2 given by:\n\ng(at,M) = softmax(conv([at,M]))\n\n(4)\nwhere conv is a small 3-layer CNN with ReLU activations operating on the semantic spatial map\nM and the spatially-tiled action vector at, M is the motion kernel size and the softmax function\nenforces the prior that g(at,M) represents a probability mass function. Note that we include M in\nthe input so that the motion model can learn that the agent is unlikely to move through obstacles.\nObservation model. We require an observation model p(ot | st,M) to de\ufb01ne the likelihood of a\nlatent observation ot conditioned on the agent\u2019s state st and the map M. A generative observation\nmodel like this would be hard to learn, since it is not clear how to generate high-dimensional latent\nobservations and normalization needs to be done across observations, not states. Therefore, we follow\n\n2The panoramic setting is chosen for comparison with prior work \u2013 not as a requirement of our architecture.\n\n5\n\n\fprior work [30] and learn a discriminative observation model that takes ot and M as inputs and\ndirectly outputs the likelihood of this observation for each state. As detailed further in Section 4.4,\nthis observation model is trained end-to-end without direct supervision of the likelihood.\nTo implement our observation model we use LingUNet [2], a language-conditioned image-to-image\nnetwork based on U-Net [37]. Speci\ufb01cally, we use the LingUNet implementation from Blukis et\nal. [11] with 3 cascaded convolution and deconvolution operations. The spatial dimensionality of\nthe LingUNet output matches the input image (in this case, M), and number of output channels is\nselected to match the number of heading bins \u0398. Outputs are restricted to the range [0, 1] using a\nsigmoid function. The observation update from Equation 2 is re-de\ufb01ned as:\n\nbt = \u03b7 bt (cid:12) LingUNet(ot,M)\n\n(5)\n\nwhere \u03b7 is a normalization constant and (cid:12) represents element-wise multiplication.\nGoal prediction. In summary, to identify goal locations in the partially-observed spatial map M, we\ninitialize the belief b0 with the known starting state s0. We then iteratively: (1) Generate a latent\nobservation ot and action at, (2) Compute the prediction step using Equation 3, and (3) Compute the\nobservation update using Equation 5. We stop after T \ufb01lter update time steps. The resulting belief bT\nrepresents the posterior probability distribution over goal locations.\n4.3 Policy\nThe \ufb01nal component of our agent is a simple reactive policy network. It operates over a global\naction space de\ufb01ned by the complete set of panoramic viewpoints observed in the current episode\n(including both visited viewpoints, and their immediate neighbors). Our agent thus memorizes the\nlocal structure of the observed navigation graph to enable it to return to any previously observed\nlocation in a single action. The probability distribution over actions is de\ufb01ned by a softmax function,\nwhere the logit associated with each viewpoint i is given by yi = MLP([b1:T,i, vi]), where MLP is a\ntwo-layer neural network, b1:T,i is a vector containing the belief at each time step 1 : T in a gaussian\nneighborhood around viewpoint i, and vi is a vector containing the distance from the agent\u2019s current\nlocation to viewpoint i, and an indicator variable for whether i has been previously visited. If the\npolicy chooses to revisit a previously visited viewpoint, we interpret this as a stop action. Note that\nour policy does not have direct access to any representation of the instruction, or the semantic map\nM. Although our policy network is speci\ufb01c to the Matterport3D simulator environment, the rest of\nour pipeline is general and operates without knowledge of the simulator\u2019s navigation graph.\n4.4 Learning\nOur entire agent model is fully differentiable, from policy actions back to image pixels via the\nsemantic spatial map, geometric feature projection function, etc. Training data for the model consists\nof instruction-trajectory pairs (X , s\u2217\n1:T ). In all experiments we train the \ufb01lter using supervised\nlearning by minimizing the KL-divergence between the predicted belief b1:T and the true trajectory\nfrom the start to the goal s\u2217\n1:T , backpropagating gradients through the previous belief bt\u22121 at each\nstep. Note that the predicted belief b1:T is independent of the agent\u2019s actual trajectory s1:T given\nthe map M. In the goal prediction experiments (Section 5.2), the model is trained without a policy\nand so the agent\u2019s trajectory s1:T is generated by moving towards the goal with 50% probability,\nor randomly otherwise. In the full VLN experiments (Section 5.3), we train the \ufb01lter concurrently\nwith the policy. The policy is trained with cross-entropy loss to maximize the likelihood of the\nground-truth target action, de\ufb01ned as the \ufb01rst action in the shortest path from the agent\u2019s current\nlocation st to the goal s\u2217\nT . In this regime, trajectories are generated by sampling an action from the\npolicy with 50% probability, or selecting the ground-truth target action otherwise. In both sets of\nexperiments we train all parameters end-to-end (except for the pretrained CNN). We have veri\ufb01ed\nthat the stand-alone performance of the \ufb01lter is not unduly impacted by the addition of the policy, but\nwe leave the investigation of more sophisticated RL training regimes to future work.\nImplementation details. We provide further implementation details in the supplementary material.\nPyTorch code will be released to replicate all experiments.3\n5 Experiments\n5.1 Environment and dataset\nSimulator. We use the Matterport3D Simulator [1] based on the Matterport3D dataset [17] containing\nRGB-D images, textured 3D meshes and other annotations captured from 11K panoramic viewpoints\n\n3https://github.com/batra-mlp-lab/vln-chasing-ghosts\n\n6\n\n\fTable 1: Goal prediction results given a natural language navigation instruction and a \ufb01xed trajectory\nthat either moves towards the goal, or randomly, with 50:50 probability. We evaluate predictions at\neach time step, although on average the goal is not seen until later time steps. Our \ufb01ltering approach\nthat explicitly models trajectories outperforms LingUNet [2, 11] across all time steps (i.e., regardless\nof map sparsity). We con\ufb01rm that add heading \u03b8 to the \ufb01lter state provides a robust boost.\n\nVal-Seen\n\nVal-Unseen\n\n6\n\n5\n\n4\n\n3\n\n2\n\n1\n\n0\n7 Avg\n47.2 62.5 73.3 82.1 90.7 98.3 105 112 83.9\n8.82 17.2 25.9 33.7 41.2 48.8 54.5 60.2 36.3\n\nTime step\nMap Seen (m2)\nGoal Seen (%)\nPrediction Error (m)\n7.42 7.33 7.19 7.18 7.15 7.13 7.09 7.11 7.20\nHand-coded baseline\n7.17 6.66 6.17 5.75 5.42 5.15 4.89 4.69 5.74\nLingUNet baseline\nFilter, s = (x, y) (ours)\n6.45 5.94 5.66 5.25 5.00 4.86 4.67 4.62 5.31\nFilter, s = (x, y, \u03b8) (ours) 6.10 5.75 5.30 5.06 4.81 4.71 4.59 4.46 5.09\nSuccess Rate (<3m error)\n17.3 17.8 18.5 18.2 18.0 19.1 18.8 18.6 18.3\nHand-coded baseline\n10.7 16.7 21.2 25.8 29.7 33.6 36.9 39.1 26.7\nLingUNet baseline\nFilter, s = (x, y) (ours)\n24.6 29.3 31.9 35.9 39.7 41.0 42.1 41.2 35.7\nFilter, s = (x, y, \u03b8) (ours) 30.9 34.3 38.4 41.6 43.7 44.9 44.3 46.2 40.6\n\n2\n\n1\n\n0\n7 Avg\n45.6 60.3 69.8 78.0 84.9 91.1 96.7 102 78.6\n16.0 25.2 34.6 43.2 50.5 57.0 62.8 67.6 44.6\n\n3\n\n4\n\n5\n\n6\n\n6.75 6.53 6.40 6.37 6.29 6.20 6.15 6.12 6.35\n6.18 5.80 5.40 5.17 4.90 4.65 4.44 4.27 5.10\n5.92 5.50 5.14 4.88 4.67 4.45 4.41 4.30 4.91\n5.69 5.28 4.90 4.60 4.40 4.26 4.14 4.05 4.67\n\n18.9 20.1 21.1 21.3 21.8 22.2 22.6 22.9 21.4\n16.9 22.3 27.7 31.6 35.2 38.4 41.1 44.5 32.2\n29.1 32.5 36.1 39.2 41.9 44.5 45.7 46.2 39.4\n34.2 38.7 42.7 46.1 48.2 48.4 49.9 51.2 44.9\n\ndensely sampled throughout 90 buildings. Using this dataset, the simulator implements a visually-\nrealistic \ufb01rst-person environment that allows the agent to look in any direction while moving between\npanoramic viewpoints along edges in a navigation graph. Viewpoints are 2.25m apart on average.\nDepth outputs. As the Matterport3D Simulator supports RGB output only, we extend it to support\ndepth outputs which are necessary to accurately project CNN features into the semantic spatial map.\nOur simulator extension projects the undistorted depth images from the Matterport3D dataset onto\ncubes aligned with the provided \u2018skybox\u2019 images, such that each cube-mapped pixel represents\nthe euclidean distance from the camera center. We then adapt the existing rendering pipeline to\nrender depth images from these cube-maps, converting depth values from euclidean distance back to\ndistance from the camera plane in the process. To \ufb01ll missing depth values corresponding to shiny,\nbright, transparent, and distant surfaces, we apply a simple cross-bilateral \ufb01lter based on the NYUv2\nimplementation [38]. We additionally implement various other performance improvements, such as\ncaching, which boosts the frame-rate of the simulator up to 1000 FPS, subject to GPU performance\nand CPU-GPU memory bandwith. We have incorporated these extensions into the original simulator\ncodebase.4\nR2R instruction dataset. We evaluate using the Room-to-Room (R2R) dataset for Vision-and-\nLanguage Navigation (VLN) [1]. The dataset consists of 22K open-vocabulary, crowd-sourced\nnavigation instructions with an average length of 29 words. Each instruction corresponds to a 5\u201324m\ntrajectory in the Matterport3D dataset, traversing 5\u20137 viewpoint transitions. Instructions are divided\ninto splits for training, validation and testing. The validation set is further split into two components:\nval-seen, where instructions and trajectories are situated in environments seen during training, and\nval-unseen containing instructions situated in environments that are not seen during training. All the\ntest set instructions and trajectories are from environments that are unseen in training and validation.\n5.2 Goal prediction results\nWe \ufb01rst evaluate the goal prediction performance of our proposed mapper and \ufb01lter architecture in a\npolicy-free setting using \ufb01xed trajectories. Trajectories are generated by an agent that moves towards\nthe goal with 50% probability, or randomly otherwise. As an ablation, we also report results for\nour model excluding heading from the agent\u2019s \ufb01lter state, i.e., st = (x, y), to quantify the value of\nencoding the agent\u2019s orientation in the motion and observation models. We compare to two baselines\nas follows:\nLingUNet baseline. As a strong neural net baseline, we compare to LingUNet [2] \u2013 a language-\nconditioned variant of the U-Net image-to-image architecture [37] \u2013 that has recently been applied to\ngoal location prediction in the context of a simulated quadrocopter instruction-following task [11].\nWe choose LingUNet because existing VLN models [3\u20139] do not explicitly model the goal location or\nthe map, and are thus not capable of predicting the goal location from a provided trajectory. Following\nBlukis et al. [11] we train a 5-layer LingUNet module conditioned on the sentence encoding e and\n\n4https://github.com/peteanderson80/Matterport3DSimulator\n\n7\n\n\fTable 2: Results for the full VLN task on the R2R dataset. Our model achieves credible results\nfor a new model class trained exclusively with imitation learning (no RL) and without any data\naugmentation or specialized pretraining (Aug).\n\nVal-Seen\n\nTL\n\nNE\n\nOS\n\nSR\n\nSPL\n\nTL\n\nVal-Unseen\nNE\n\nOS\n\nSR\n\nSPL\n\nTL\n\nNE\n\nTest\nOS\n\nSR\n\nSPL\n\nModel\n\nRL Aug\n(cid:88)\n(cid:88)\n\n-\n\n(cid:88)\n(cid:88)\n(cid:88)\n\nRPA [4]\nSpeaker-Follower [7]\nRCM [3]\nSelf-Monitoring [5]\n-\nRegretful Agent [6]\n-\nFAST [9]\n-\nBack Translation [8] (cid:88) (cid:88) 11.0\nSpeaker-Follower [7]\n-\nBack Translation [8]\nOurs\n\n(cid:88)\n\n8.46\n\n5.56 0.53 0.43\n3.36 0.74 0.66\n10.65 3.53 0.75 0.67\n\n3.18 0.77 0.68 0.58\n3.23 0.77 0.69 0.63\n\n-\n\n3.99\n\n-\n-\n\n-\n\n0.62 0.59\n\n4.86 0.63 0.52\n5.39\n\n10.3\n0.48 0.46\n10.15 7.59 0.42 0.34 0.30\n\n-\n\n-\n\n-\n-\n-\n\n-\n\n7.22\n\n7.65 0.32 0.25\n6.62 0.45 0.36\n11.46 6.09 0.50 0.43\n\n-\n\n-\n-\n-\n\n-\n-\n\n21.1\n10.7\n\n5.41 0.59 0.47 0.34\n5.32 0.59 0.50 0.41\n4.97\n0.56 0.43\n0.52 0.48\n5.22\n\n-\n-\n\n-\n\n9.15\n9.64\n\n7.07 0.41 0.31\n6.25\n0.44 0.40\n7.20 0.44 0.35 0.31\n\n-\n\n-\n\n9.15\n7.53 0.32 0.25 0.23\n14.82 6.62 0.44 0.35 0.28\n11.97 6.12 0.50 0.43 0.38\n18.04 5.67 0.59 0.48 0.35\n13.69 5.69 0.56 0.48 0.40\n22.08 5.14 0.64 0.54 0.41\n11.66 5.23 0.59 0.51 0.47\n\n-\n-\n\n-\n-\n\n-\n-\n\n-\n-\n\n-\n-\n\n10.03 7.83 0.42 0.33 0.30\n\nthe semantic map M to directly predict the goal location distribution (as well as a path visitation\ndistribution, as an auxilliary loss) in a single forward pass. As we implement our observation model\nusing a (smaller, 3-layer) LingUNet, the LingUNet baseline resembles an ablated single-step version\nof our model that dispenses with the decoder generating latent observations and actions as well as the\nmotion model. Note that we use the same mapper architecture for our \ufb01lter and for LingUNet.\nHand-coded baseline. We additionally compare to hand-coded goal prediction baseline designed\nto exploit biases in the R2R dataset [1] and the provided trajectories. We \ufb01rst calculate the mean\nstraight-line distance from the start position to the goal across the entire training set, which is 7.6m.\nWe then select as the predicted goal the position (x, y) in the map at a radius of 7.6m from the start\nposition that has the greatest observed map area in an Gaussian-weighted neighborhood of (x, y).\nResults. As illustrated in Table 1, our proposed \ufb01lter architecture that explicitly models belief over\ntrajectories that could be taken by a human demonstrator outperforms a strong LingUNet baseline at\npredicting the goal location (with an average success rate of 45% vs. 32% in unseen environments).\nThis \ufb01nding holds at all time steps (i.e., regardless of the sparsity of the map). We also demonstrate\nthat removing the heading \u03b8 from the agent\u2019s state in our model degrades this success rate to 39%,\ndemonstrating the importance of relative orientation to instruction understanding. For instance, it\nis unlikely for an agent following the true path to turn 180 degrees midway through (unless this\nis commanded by the instruction). Similarly, without knowing heading, the model can represent\ninstructions such as \u2018go past the table\u2019 but not \u2018go past with the table on your left\u2019. Finally, the poor\nperformance of the handcoded baseline con\ufb01rms that the goal location cannot be trivially predicted\nfrom the trajectory.\n\n5.3 Vision-and-Language Navigation results\nHaving established the ef\ufb01cacy of our approach for goal prediction from a partial map, we turn to the\nfull VLN task that requires our agent to take actions to actually reach the goal.\nEvaluation. In VLN, an episode is successful if the \ufb01nal navigation error is less than 3m. We report\nour agent\u2019s average success rate at reaching the goal (SR), and SPL [16], a recently proposed summary\nmeasure of an agent\u2019s navigation performance that balances navigation success against trajectory\nef\ufb01ciency (higher is better). We also report trajectory length (TL) and navigation error (NE) in meters,\nas well as oracle success (OS), de\ufb01ned as the agent\u2019s success rate under an oracle stopping rule.\nResults. In Table 2, we present our results in the context of state-of-the-art methods; however, as\nnoted by the RL and Aug columns in the table, these approaches include reinforcement learning and\ncomplex data augmentation and pretraining strategies. These are non-trivial extensions that are the\nresult of a community effort [3\u20139] and are orthogonal to our own contribution. We also use a less\npowerful CNN (ResNet-34 vs. ResNet-152 in prior work). For the most direct comparison, we\nconsider the ablated models in the lower panel of Table 2 to be most appropriate. We \ufb01nd these\nresults promising given this is the \ufb01rst work to explore such a drastically different model class\n(i.e., maintaining a metric map and a probability distribution over alternative trajectories in the map).\nOur model also exhibits less over\ufb01tting than other approaches \u2013 performing equally well on both seen\n(val-seen) and unseen (val-unseen) environments.\nFurther, our \ufb01ltering approach allows us greater insight into the model. We examine a qualitative\nexample in Figure 3. On the left, we can see the agent attends to appropriate visual and direction\nwords when generating latent observations and actions, supporting the intuition in Figure 1. On the\n\n8\n\n\fFigure 3: Left: Textual attention during latent observation and action generation is appropriately more\nfocused towards action words (\u2018left\u2019, \u2018right\u2019) for the motion model, and visual words (\u2018bedroom\u2019,\n\u2018corridor\u2019, \u2018table\u2019) for the observation model. Right: Top-down view illustrating the agent\u2019s expanding\nsemantic spatial map (lighter-colored region), navigation graph (blue dots) and corresponding belief\n(red heatmap and circles with white heading markers) when following this instruction. At t = 0 the\nmap is largely unexplored, and the belief is approximately correct but dispersed. By t = 6, the agent\nhas become con\ufb01dent about the correct goal location, despite many now-visible alternative paths.\n\nright, we can see the growing con\ufb01dence our goal predictor places on the correct location as more\nof the map is explored \u2013 despite the increasing number of visible alternatives. We provide further\nexamples (including insight into the motion and observation models) in the supplementary video.\n6 Conclusion\nWe show that instruction following can be formulated as Bayesian state tracking in a model that\nmaintains a semantic spatial map of the environment, and an explicit probability distribution over\nalternative possible trajectories in that map. To evaluate our approach we choose the complex problem\nof Vision-and-Language Navigation (VLN). This represents a signi\ufb01cant departure from existing\nwork in the area, and required augmenting the Matterport3D simulator with depth. Empirically,\nwe show that our approach outperforms recent alternative approaches to goal location prediction,\nand achieves credible results on the full VLN task without using RL or data augmentation \u2013 while\noffering reduced over\ufb01tting to seen environments, unprecedented intepretability and less reliance on\nthe simulator\u2019s navigation constraints.\n\nAcknowledgments\nWe thank Abhishek Kadian and Prithviraj Ammanabrolu for their help in the initial stages of the project. The\nGeorgia Tech effort was supported in part by NSF, AFRL, DARPA, ONR YIPs, ARO PECASE. The views and\nconclusions contained herein are those of the authors and should not be interpreted as necessarily representing\nthe of\ufb01cial policies or endorsements, either expressed or implied, of the U.S. Government, or any sponsor.\n\nReferences\n[1] Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko S\u00fcnderhauf, Ian Reid, Stephen\nGould, and Anton van den Hengel. Vision-and-Language Navigation: Interpreting visually-grounded\nnavigation instructions in real environments. In CVPR, 2018.\n\n[2] Dipendra Misra, Andrew Bennett, Valts Blukis, Eyvind Niklasson, Max Shatkhin, and Yoav Artzi. Mapping\n\ninstructions to actions in 3d environments with visual goal prediction. In EMNLP, 2018.\n\n[3] Xin Wang, Qiuyuan Huang, Asli Celikyilmaz, Jianfeng Gao, Dinghan Shen, Yuan-Fang Wang,\nWilliam Yang Wang, and Lei Zhang. Reinforced cross-modal matching and self-supervised imitation\nlearning for vision-language navigation. In CVPR, 2019.\n\n[4] Xin Wang, Wenhan Xiong, Hongmin Wang, and William Yang Wang. Look before you leap: Bridging\nmodel-free and model-based reinforcement learning for planned-ahead vision-and-language navigation. In\nECCV, September 2018.\n\n9\n\n\f[5] Chih-Yao Ma, Jiasen Lu, Zuxuan Wu, Ghassan AlRegib, Zsolt Kira, Richard Socher, and Caiming Xiong.\n\nSelf-monitoring navigation agent via auxiliary progress estimation. In ICLR, 2019.\n\n[6] Chih-Yao Ma, Zuxuan Wu, Ghassan AlRegib, Caiming Xiong, and Zsolt Kira. The regretful agent:\n\nHeuristic-aided navigation through progress estimation. In CVPR, 2019.\n\n[7] Daniel Fried, Ronghang Hu, Volkan Cirik, Anna Rohrbach, Jacob Andreas, Louis-Philippe Morency,\nTaylor Berg-Kirkpatrick, Kate Saenko, Dan Klein, and Trevor Darrell. Speaker-follower models for\nvision-and-language navigation. In NeurIPS, 2018.\n\n[8] Hao Tan, Licheng Yu, and Mohit Bansal. Learning to navigate unseen environments: Back translation with\n\nenvironmental dropout. In NAACL, 2019.\n\n[9] Liyiming Ke, Xiujun Li, Yonatan Bisk, Ari Holtzman, Zhe Gan, Jingjing Liu, Jianfeng Gao, Yejin Choi, and\nSiddhartha Srinivasa. Tactical rewind: Self-correction via backtracking in vision-and-language navigation.\nIn CVPR, 2019.\n\n[10] Saurabh Gupta, James Davidson, Sergey Levine, Rahul Sukthankar, and Jitendra Malik. Cognitive mapping\n\nand planning for visual navigation. In CVPR, 2017.\n\n[11] Valts Blukis, Dipendra Misra, Ross A Knepper, and Yoav Artzi. Mapping navigation instructions to\n\ncontinuous control actions with position-visitation prediction. In CoRL, 2018.\n\n[12] J. F. Henriques and A. Vedaldi. Mapnet: An allocentric spatial memory for mapping environments. In\n\nCVPR, 2018.\n\n[13] D. Gordon, A. Kembhavi, M. Rastegari, J. Redmon, D. Fox, and A. Farhadi.\n\nanswering in interactive environments. In CVPR, 2018.\n\nIQA: Visual question\n\n[14] Sebastian Thrun, Wolfram Burgard, and Dieter Fox. Probabilistic robotics. MIT Press, 2005.\n\n[15] Rico Jonschkowski and Oliver Brock. End-to-end learnable histogram \ufb01lters. In In Workshop on Deep\nLearning for Action and Interaction at the Conference on Neural Information Processing Systems (NIPS),\n2016.\n\n[16] Peter Anderson, Angel Chang, Devendra Singh Chaplot, Alexey Dosovitskiy, Saurabh Gupta, Vladlen\nKoltun, Jana Kosecka, Jitendra Malik, Roozbeh Mottaghi, Manolis Savva, and Amir R. Zamir. On\nevaluation of embodied navigation agents. arXiv:1807.06757, 2018.\n\n[17] Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niessner, Manolis Savva, Shuran\nSong, Andy Zeng, and Yinda Zhang. Matterport3d: Learning from rgb-d data in indoor environments.\nInternational Conference on 3D Vision (3DV), 2017.\n\n[18] Terry Winograd. Procedures as a representation for data in a computer program for understanding natural\n\nlanguage. Technical report, Massachusetts Institute of Technology, 1971.\n\n[19] Sepp Hochreiter and J\u00fcrgen Schmidhuber. Long Short-Term Memory. Neural Computation, 1997.\n\n[20] Daan Wierstra, Alexander Foerster, Jan Peters, and Juergen Schmidhuber. Solving deep memory pomdps\n\nwith recurrent policy gradients. In International Conference on Arti\ufb01cial Neural Networks, 2007.\n\n[21] Max Jaderberg, Volodymyr Mnih, Wojciech Marian Czarnecki, Tom Schaul, Joel Z Leibo, David Silver,\n\nand Koray Kavukcuoglu. Reinforcement learning with unsupervised auxiliary tasks. In ICLR, 2017.\n\n[22] Piotr Mirowski, Razvan Pascanu, Fabio Viola, Hubert Soyer, Andy Ballard, Andrea Banino, Misha Denil,\nRoss Goroshin, Laurent Sifre, Koray Kavukcuoglu, et al. Learning to navigate in complex environments.\nIn ICLR, 2017.\n\n[23] Manolis Savva, Angel X. Chang, Alexey Dosovitskiy, Thomas Funkhouser, and Vladlen Koltun. MINOS:\n\nMultimodal indoor simulator for navigation in complex environments. arXiv:1712.03931, 2017.\n\n[24] Abhishek Das, Samyak Datta, Georgia Gkioxari, Stefan Lee, Devi Parikh, and Dhruv Batra. Embodied\n\nQuestion Answering. In CVPR, 2018.\n\n[25] Junhyuk Oh, Valliappa Chockalingam, Satinder Singh, and Honglak Lee. Control of memory, active\n\nperception, and action in minecraft. In ICML, 2016.\n\n[26] Emilio Parisotto and Ruslan Salakhutdinov. Neural map: Structured memory for deep reinforcement\n\nlearning. In ICLR, 2018.\n\n10\n\n\f[27] Nikolay Savinov, Alexey Dosovitskiy, and Vladlen Koltun. Semi-parametric topological memory for\n\nnavigation. In ICLR, 2018.\n\n[28] Tuomas Haarnoja, Anurag Ajay, Sergey Levine, and Pieter Abbeel. Backprop KF: Learning discriminative\n\ndeterministic state estimators. In NIPS, 2016.\n\n[29] Rico Jonschkowski, Divyam Rastogi, and Oliver Brock. Differentiable Particle Filters: End-to-End\n\nLearning with Algorithmic Priors. In Proceedings of Robotics: Science and Systems (RSS), 2018.\n\n[30] Peter Karkus, David Hsu, and Wee Sun Lee. Particle \ufb01lter networks with application to visual localization.\n\nIn CoRL, 2018.\n\n[31] SHI Xingjian, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-Kin Wong, and Wang-chun Woo. Convo-\n\nlutional lstm network: A machine learning approach for precipitation nowcasting. In NIPS, 2015.\n\n[32] Kyunghyun Cho, Bart van Merrienboer, \u00c7aglar G\u00fcl\u00e7ehre, Dzmitry Bahdanau, Fethi Bougares, Holger\nSchwenk, and Yoshua Bengio. Learning phrase representations using RNN encoder-decoder for statistical\nmachine translation. In EMNLP, 2014.\n\n[33] Jonas Uhrig, Nick Schneider, Lukas Schneider, Uwe Franke, Thomas Brox, and Andreas Geiger. Sparsity\n\ninvariant CNNs. In International Conference on 3D Vision (3DV), 2017.\n\n[34] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning\n\nto align and translate. In ICLR, 2015.\n\n[35] Minh-Thang Luong, Hieu Pham, and Christopher D Manning. Effective approaches to attention-based\n\nneural machine translation. In EMNLP, 2014.\n\n[36] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, \u0141ukasz\n\nKaiser, and Illia Polosukhin. Attention is all you need. In NIPS, 2017.\n\n[37] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-Net: Convolutional networks for biomedical\nimage segmentation. In International Conference on Medical image computing and computer-assisted\nintervention, 2015.\n\n[38] Pushmeet Kohli Nathan Silberman, Derek Hoiem and Rob Fergus. Indoor segmentation and support\n\ninference from rgbd images. In ECCV, 2012.\n\n11\n\n\f", "award": [], "sourceid": 175, "authors": [{"given_name": "Peter", "family_name": "Anderson", "institution": "Georgia Tech"}, {"given_name": "Ayush", "family_name": "Shrivastava", "institution": "Georgia Institute of Technology"}, {"given_name": "Devi", "family_name": "Parikh", "institution": "Georgia Tech / Facebook AI Research (FAIR)"}, {"given_name": "Dhruv", "family_name": "Batra", "institution": "Georgia Tech / Facebook AI Research (FAIR)"}, {"given_name": "Stefan", "family_name": "Lee", "institution": "Georgia Institute of Technology"}]}