{"title": "QMDP-Net: Deep Learning for Planning under Partial Observability", "book": "Advances in Neural Information Processing Systems", "page_first": 4694, "page_last": 4704, "abstract": "This paper introduces the QMDP-net, a neural network architecture for planning under partial observability. The QMDP-net combines the strengths of model-free learning and model-based planning. It is a recurrent policy network, but it represents a policy for a parameterized set of tasks by connecting a model with a planning algorithm that solves the model, thus embedding the solution structure of planning in a network learning architecture. The QMDP-net is fully differentiable and allows for end-to-end training. We train a QMDP-net on different tasks so that it can generalize to new ones in the parameterized task set and \u201ctransfer\u201d to other similar tasks beyond the set. In preliminary experiments, QMDP-net showed strong performance on several robotic tasks in simulation. Interestingly, while QMDP-net encodes the QMDP algorithm, it sometimes outperforms the QMDP algorithm in the experiments, as a result of end-to-end learning.", "full_text": "QMDP-Net: Deep Learning for Planning under\n\nPartial Observability\n\nPeter Karkus1,2\n\nDavid Hsu1,2\n\nWee Sun Lee2\n\n1NUS Graduate School for Integrative Sciences and Engineering\n\n2School of Computing\n\nNational University of Singapore\n\n{karkus, dyhsu, leews}@comp.nus.edu.sg\n\nAbstract\n\nThis paper introduces the QMDP-net, a neural network architecture for planning under\npartial observability. The QMDP-net combines the strengths of model-free learning and\nmodel-based planning. It is a recurrent policy network, but it represents a policy for a\nparameterized set of tasks by connecting a model with a planning algorithm that solves the\nmodel, thus embedding the solution structure of planning in a network learning architecture.\nThe QMDP-net is fully differentiable and allows for end-to-end training. We train a QMDP-\nnet on different tasks so that it can generalize to new ones in the parameterized task set\nand \u201ctransfer\u201d to other similar tasks beyond the set. In preliminary experiments, QMDP-net\nshowed strong performance on several robotic tasks in simulation. Interestingly, while\nQMDP-net encodes the QMDP algorithm, it sometimes outperforms the QMDP algorithm\nin the experiments, as a result of end-to-end learning.\n\n1\n\nIntroduction\n\nDecision-making under uncertainty is of fundamental importance, but it is computationally hard,\nespecially under partial observability [24]. In a partially observable world, the agent cannot determine\nthe state exactly based on the current observation; to plan optimal actions, it must integrate information\nover the past history of actions and observations. See Fig. 1 for an example. In the model-based\napproach, we may formulate the problem as a partially observable Markov decision process (POMDP).\nSolving POMDPs exactly is computationally intractable in the worst case [24]. Approximate POMDP\nalgorithms have made dramatic progress on solving large-scale POMDPs [17, 25, 29, 32, 37]; however,\nmanually constructing POMDP models or learning them from data remains dif\ufb01cult. In the model-free\napproach, we directly search for an optimal solution within a policy class. If we do not restrict the\npolicy class, the dif\ufb01culty is data and computational ef\ufb01ciency. We may choose a parameterized\npolicy class. The effectiveness of policy search is then constrained by this a priori choice.\nDeep neural networks have brought unprecedented success in many domains [16, 21, 30] and\nprovide a distinct new approach to decision-making under uncertainty. The deep Q-network (DQN),\nwhich consists of a convolutional neural network (CNN) together with a fully connected layer,\nhas successfully tackled many Atari games with complex visual input [21]. Replacing the post-\nconvolutional fully connected layer of DQN by a recurrent LSTM layer allows it to deal with partial\nobservaiblity [10]. However, compared with planning, this approach fails to exploit the underlying\nsequential nature of decision-making.\nWe introduce QMDP-net, a neural network architecture for planning under partial observability.\nQMDP-net combines the strengths of model-free learning and model-based planning. A QMDP-net\nis a recurrent policy network, but it represents a policy by connecting a POMDP model with an\nalgorithm that solves the model, thus embedding the solution structure of planning in a network\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\f(a)\n\n(b)\n\n(c)\n\n(d)\n\nFig. 1: A robot learning to navigate in partially observable grid worlds. (a) The robot has a map. It\nhas a belief over the initial state, but does not know the exact initial state. (b) Local observations\nare ambiguous and are insuf\ufb01cient to determine the exact state. (c, d) A policy trained on expert\ndemonstrations in a set of randomly generated environments generalizes to a new environment. It\nalso \u201ctransfers\u201d to a much larger real-life environment, represented as a LIDAR map [12].\n\nlearning architecture. Speci\ufb01cally, our network uses QMDP [18], a simple, but fast approximate\nPOMDP algorithm, though other more sophisticated POMDP algorithms could be used as well.\nA QMDP-net consists of two main network modules (Fig. 2). One represents a Bayesian \ufb01lter, which\nintegrates the history of an agent\u2019s actions and observations into a belief, i.e. a probabilistic estimate\nof the agent\u2019s state. The other represents the QMDP algorithm, which chooses the action given the\ncurrent belief. Both modules are differentiable, allowing the entire network to be trained end-to-end.\nWe train a QMDP-net on expert demonstrations in a set of randomly generated environments. The\ntrained policy generalizes to new environments and also \u201ctransfers\u201d to more complex environments\n(Fig. 1c\u2013d). Preliminary experiments show that QMDP-net outperformed state-of-the-art network\narchitectures on several robotic tasks in simulation. It successfully solved dif\ufb01cult POMDPs that\nrequire reasoning over many time steps, such as the well-known Hallway2 domain [18]. Interestingly,\nwhile QMDP-net encodes the QMDP algorithm, it sometimes outperformed the QMDP algorithm in\nour experiments, as a result of end-to-end learning.\n\n2 Background\n\n2.1 Planning under Uncertainty\nA POMDP is formally de\ufb01ned as a tuple (S, A, O, T, Z, R), where S, A and O are the state, action,\nand observation space, respectively. The state-transition function T (s, a, s0) = P (s0|s, a) de\ufb01nes the\nprobability of the agent being in state s0 after taking action a in state s. The observation function\nZ(s, a, o) = p(o|s, a) de\ufb01nes the probability of receiving observation o after taking action a in\nstate s. The reward function R(s, a) de\ufb01nes the immediate reward for taking action a in state s.\nIn a partially observable world, the agent does not know its exact state. It maintains a belief, which is\na probability distribution over S. The agent starts with an initial belief b0 and updates the belief bt at\neach time step t with a Bayesian \ufb01lter:\n\nbt(s0) = \u2327 (bt1, at, ot) = \u2318Z (s0, at, ot)Ps2S T (s, at, s0)bt1(s),\n\n(1)\nwhere \u2318 is a normalizing constant. The belief bt recursively integrates information from the entire\npast history (a1, o1, a2, o2, . . . , at, ot) for decision making. POMDP planning seeks a policy \u21e1 that\nmaximizes the value, i.e., the expected total discounted reward:\n\nV\u21e1(b0) = EP1t=0 tR(st, at+1) b0,\u21e1,\n\n(2)\nwhere st is the state at time t, at+1 = \u21e1(bt) is the action that the policy \u21e1 chooses at time t, and\n 2 (0, 1) is a discount factor.\n2.2 Related Work\nTo learn policies for decision making in partially observable domains, one approach is to learn models\n[6, 19, 26] and solve the models through planning. An alternative is to learn policies directly [2, 5].\nModel learning is usually not end-to-end. While policy learning can be end-to-end, it does not exploit\nmodel information for effective generalization. Our proposed approach combines model-based and\n\n2\n\n\fmodel-free learning by embedding a model and a planning algorithm in a recurrent neural network\n(RNN) that represents a policy and then training the network end-to-end.\nRNNs have been used earlier for learning in partially observable domains [4, 10, 11]. In particular,\nHausknecht and Stone extended DQN [21], a convolutional neural network (CNN), by replacing\nits post-convolutional fully connected layer with a recurrent LSTM layer [10]. Similarly, Mirowski\net al. [20] considered learning to navigate in partially observable 3-D mazes. The learned policy\ngeneralizes over different goals, but in a \ufb01xed environment. Instead of using the generic LSTM,\nour approach embeds algorithmic structure speci\ufb01c to sequential decision making in the network\narchitecture and aims to learn a policy that generalizes to new environments.\nThe idea of embedding speci\ufb01c computation structures in the neural network architecture has been\ngaining attention recently. Tamar et al. implemented value iteration in a neural network, called\nValue Iteration Network (VIN), to solve Markov decision processes (MDPs) in fully observable\ndomains, where an agent knows its exact state and does not require \ufb01ltering [34]. Okada et al.\naddressed a related problem of path integral optimal control, which allows for continuous states\nand actions [23]. Neither addresses the issue of partial observability, which drastically increases\nthe computational complexity of decision making [24]. Haarnoja et al. [9] and Jonschkowski and\nBrock [15] developed end-to-end trainable Bayesian \ufb01lters for probabilistic state estimation. Silver\net al. introduced Predictron for value estimation in Markov reward processes [31]. They do not\ndeal with decision making or planning. Both Shankar et al. [28] and Gupta et al. [8] addressed\nplanning under partial observability. The former focuses on learning a model rather than a policy.\nThe learned model is trained on a \ufb01xed environment and does not generalize to new ones. The\nlatter proposes a network learning approach to robot navigation in an unknown environment, with a\nfocus on mapping. Its network architecture contains a hierarchical extension of VIN for planning\nand thus does not deal with partial observability during planning. The QMDP-net extends the prior\nwork on network architectures for MDP planning and for Bayesian \ufb01ltering. It imposes the POMDP\nmodel and computation structure priors on the entire network architecture for planning under partial\nobservability.\n\n3 Overview\n\nWe want to learn a policy that enables an agent to act effectively in a diverse set of partially\nobservable stochastic environments. Consider, for example, the robot navigation domain in Fig. 1.\nThe environments may correspond to different buildings. The robot agent does not observe its own\nlocation directly, but estimates it based on noisy readings from a laser range \ufb01nder. It has access\nto building maps, but does not have models of its own dynamics and sensors. While the buildings\nmay differ signi\ufb01cantly in their layouts, the underlying reasoning required for effective navigation is\nsimilar in all buildings. After training the robot in a few buildings, we want to place the robot in a\nnew building and have it navigate effectively to a speci\ufb01ed goal.\nFormally, the agent learns a policy for a parameterized set of tasks in partially observable stochastic\nenvironments: W\u21e5 = {W (\u2713) | \u2713 2 \u21e5}, where \u21e5 is the set of all parameter values. The parameter\nvalue \u2713 captures a wide variety of task characteristics that vary within the set, including environments,\ngoals, and agents. In our robot navigation example, \u2713 encodes a map of the environment, a goal,\nand a belief over the robot\u2019s initial state. We assume that all tasks in W\u21e5 share the same state space,\naction space, and observation space. The agent does not have prior models of its own dynamics,\nsensors, or task objectives. After training on tasks for some subset of values in \u21e5, the agent learns a\npolicy that solves W (\u2713) for any given \u2713 2 \u21e5.\nA key issue is a general representation of a policy for W\u21e5, without knowing the speci\ufb01cs of W\u21e5 or its\nparametrization. We introduce the QMDP-net, a recurrent policy network. A QMDP-net represents a\npolicy by connecting a parameterized POMDP model with an approximate POMDP algorithm and\nembedding both in a single, differentiable neural network. Embedding the model allows the policy\nto generalize over W\u21e5 effectively. Embedding the algorithm allows us to train the entire network\nend-to-end and learn a model that compensates for the limitations of the approximate algorithm.\nLet M (\u2713) = (S, A, O, fT (\u00b7|\u2713), fZ(\u00b7|\u2713), fR(\u00b7|\u2713)) be the embedded POMDP model, where S, A and\nO are the shared state space, action space, observation space designed manually for all tasks\nin W\u21e5 and fT (\u00b7|\u00b7), fZ(\u00b7|\u00b7), fR(\u00b7|\u00b7) are the state-transition, observation, and reward functions to\nbe learned from data. It may appear that a perfect answer to our learning problem would have\n\n3\n\n\f(a)\n\n(b)\n\n(c)\n\nPolicy\n\nQMDP\nplanner\n\nBayesian\n\nfilter\n\nQMDP\nplanner\n\nQMDP\nplanner\n\nQMDP\nplanner\n\nBayesian\n\nfilter\n\nBayesian\n\nfilter\n\nBayesian\n\nfilter\n\nFig. 2: QMDP-net architecture. (a) A policy maps a history of actions and observations to a new\naction. (b) A QMDP-net is an RNN that imposes structure priors for sequential decision making\nunder partial observability. It embeds a Bayesian \ufb01lter and the QMDP algorithm in the network. The\nhidden state of the RNN encodes the belief for POMDP planning. (c) A QMDP-net unfolded in time.\n\nfT (\u00b7|\u2713), fZ(\u00b7|\u2713), and fR(\u00b7|\u2713) represent the \u201ctrue\u201d underlying models of dynamics, observation, and\nreward for the task W (\u2713). This is true only if the embedded POMDP algorithm is exact, but not\ntrue in general. The agent may learn an alternative model to mitigate an approximate algorithm\u2019s\nlimitations and obtain an overall better policy. In this sense, while QMDP-net embeds a POMDP\nmodel in the network architecture, it aims to learn a good policy rather than a \u201ccorrect\u201d model.\nA QMDP-net consists of two modules (Fig. 2). One encodes a Bayesian \ufb01lter, which performs\nstate estimation by integrating the past history of agent actions and observations into a belief. The\nother encodes QMDP, a simple, but fast approximate POMDP planner [18]. QMDP chooses the\nagent\u2019s actions by solving the corresponding fully observable Markov decision process (MDP) and\nperforming one-step look-ahead search on the MDP values weighted by the belief.\nWe evaluate the proposed network architecture in an imitation learning setting. We train on a\nset of expert trajectories with randomly chosen task parameter values in \u21e5 and test with new\nparameter values. An expert trajectory consist of a sequence of demonstrated actions and observations\n(a1, o1, a2, o2, . . .) for some \u2713 2 \u21e5. The agent does not access the ground-truth states or beliefs\nalong the trajectory during the training. We de\ufb01ne loss as the cross entropy between predicted and\ndemonstrated action sequences and use RMSProp [35] for training. See Appendix C.7 for details. Our\nimplementation in Tensor\ufb02ow [1] is available online at http://github.com/AdaCompNUS/qmdp-net.\n\n4 QMDP-Net\n\nWe assume that all tasks in a parameterized set W\u21e5 share the same underlying state space S, action\nspace A, and observation space O. We want to learn a QMDP-net policy for W\u21e5, conditioned on the\nparameters \u2713 2 \u21e5. A QMDP-net is a recurrent policy network. The inputs to a QMDP-net are the\naction at 2 A and the observation ot 2 O at time step t, as well as the task parameter \u2713 2 \u21e5. The\noutput is the action at+1 for time step t + 1.\nA QMDP-net encodes a parameterized POMDP model M (\u2713) = (S, A, O, T = fT (\u00b7|\u2713), Z =\nfZ(\u00b7|\u2713), R = fR(\u00b7|\u2713)) and the QMDP algorithm, which selects actions by solving the model approxi-\nmately. We choose S, A, and O of M (\u2713) manually, based on prior knowledge on W\u21e5, speci\ufb01cally,\nprior knowledge on S, A, and O. In general, S 6= S, A 6= A, and O 6= O. The model states,\nactions, and observations may be abstractions of their real-world counterparts in the task. In our robot\nnavigation example (Fig. 1), while the robot moves in a continuous space, we choose S to be a grid\nof \ufb01nite size. We can do the same for A and O, in order to reduce representational and computational\ncomplexity. The transition function T , observation function Z, and reward function R of M (\u2713) are\nconditioned on \u2713, and are learned from data through end-to-end training. In this work, we assume\nthat T is the same for all tasks in W\u21e5 to simplify the network architecture. In other words, T does\nnot depend on \u2713.\nEnd-to-end training is feasible, because a QMDP-net encodes both a model and the associated\nalgorithm in a single, fully differentiable neural network. The main idea for embedding the algorithm\nin a neural network is to represent linear operations, such as matrix multiplication and summation, by\nconvolutional layers and represent maximum operations by max-pooling layers. Below we provide\nsome details on the QMDP-net\u2019s architecture, which consists of two modules, a \ufb01lter and a planner.\n\n4\n\n\f(a) Bayesian \ufb01lter module\n\n(b) QMDP planner module\n\nFig. 3: A QMDP-net consists of two modules. (a) The Bayesian \ufb01lter module incorporates the\ncurrent action at and observation ot into the belief. (b) The QMDP planner module selects the action\naccording to the current belief bt.\n\nFilter module. The \ufb01lter module (Fig. 3a) implements a Bayesian \ufb01lter. It maps from a belief,\naction, and observation to a next belief, bt+1 = f (bt|at, ot). The belief is updated in two steps. The\n\ufb01rst accounts for actions, the second for observations:\n\nb0t(s) =Ps02S T (s0, at, s)bt(s0),\n\nbt+1(s) = \u2318Z (s, ot)b0t(s),\n\n(3)\n\n(4)\n\nwhere ot 2 O is the observation received after taking action at 2 A and \u2318 is a normalization factor.\nWe implement the Bayesian \ufb01lter by transforming Eq. (3) and Eq. (4) to layers of a neural network.\nFor ease of discussion consider our N\u21e5N grid navigation task (Fig. 1a\u2013c). The agent does not know\nits own state and only observes neighboring cells. It has access to the task parameter \u2713 that encodes\nthe obstacles, goal, and a belief over initial states. Given the task, we choose M (\u2713) to have a N\u21e5N\nstate space. The belief, bt(s), is now an N\u21e5N tensor.\nEq. (3) is implemented as a convolutional layer with |A| convolutional \ufb01lters. We denote the\nconvolutional layer by fT . The kernel weights of fT encode the transition function T in M (\u2713). The\noutput of the convolutional layer, b0t(s, a), is a N\u21e5N\u21e5|A| tensor.\nb0t(s, a) encodes the updated belief after taking each of the actions, a 2 A. We need to select the\nbelief corresponding to the last action taken by the agent, at. We can directly index b0t(s, a) by at if\nA = A. In general A 6= A, so we cannot use simple indexing. Instead, we will use \u201csoft indexing\u201d.\nFirst we encode actions in A to actions in A through a learned function fA. fA maps from at to\nan indexing vector wa\nt along the\nappropriate dimension, i.e.\n\nt , a distribution over actions in A. We then weight b0t(s, a) by wa\n\nb0t(s) =Pa2A b0t(s, a)wa\n\nt .\n\nEq. (4) incorporates observations through an observation model Z(s, o). Now Z(s, o) is a N\u21e5N\u21e5|O|\ntensor that represents the probability of receiving observation o 2 O in state s 2 S. In our grid\nnavigation task observations depend on the obstacle locations. We condition Z on the task parameter,\nZ(s, o) = fZ(s, o|\u2713) for \u2713 2 \u21e5. The function fZ is a neural network, mapping from \u2713 to Z(s, o). In\nthis paper fZ is a CNN.\nZ(s, o) encodes observation probabilities for each of the observations, o 2 O. We need the ob-\nservation probabilities for the last observation ot. In general O 6= O and we cannot index Z(s, o)\ndirectly. Instead, we will use soft indexing again. We encode observations in O to observations in O\nthrough fO. fO is a function mapping from ot to an indexing vector, wo\nt , a distribution over O. We\nthen weight Z(s, o) by wo\n\nt , i.e.\n\n(5)\n\n(6)\n\nFinally, we obtain the updated belief, bt+1(s), by multiplying b0t(s) and Z(s) element-wise, and\nnormalizing over states. In our setting the initial belief for the task W (\u2713) is encoded in \u2713. We\ninitialize the belief in QMDP-net through an additional encoding function, b0 = fB(\u2713).\n\nZ(s) =Po2O Z(s, o)wo\n\nt .\n\n5\n\n\fPlanner module. The QMDP planner (Fig. 3b) performs value iteration at its core. Q values are\ncomputed by iteratively applying Bellman updates,\n\n(7)\n\n(8)\n\nQk+1(s, a) = R(s, a) + Ps02S T (s, a, s0)Vk(s0),\n\nVk(s) = maxa Qk(s, a).\n\nActions are then selected by weighting the Q values with the belief.\nWe can implement value iteration using convolutional and max pooling layers [28, 34]. In our grid\nnavigation task Q(s, a) is a N\u21e5N\u21e5|A| tensor. Eq. (8) is expressed by a max pooling layer, where\nQk(s, a) is the input and Vk(s) is the output. Eq. (7) is a N\u21e5N convolution with |A| convolutional\n\ufb01lters, followed by an addition operation with R(s, a), the reward tensor. We denote the convolutional\nlayer by f0T . The kernel weights of f0T encode the transition function T , similarly to fT in the \ufb01lter.\nRewards for a navigation task depend on the goal and obstacles. We condition rewards on the task\nparameter, R(s, a) = fR(s, a|\u2713). fR maps from \u2713 to R(s, a). In this paper fR is a CNN.\nWe implement K iterations of Bellman updates by stacking the layers representing Eq. (7) and Eq. (8)\nK times with tied weights. After K iterations we get QK(s, a), the approximate Q values for each\nstate-action pair. We weight the Q values by the belief to obtain action values,\n\nq(a) =Ps2S QK(s, a)bt(s).\n\n(9)\nFinally, we choose the output action through a low-level policy function, f\u21e1, mapping from q(a) to\nthe action output, at+1.\nQMDP-net naturally extends to higher dimensional discrete state spaces (e.g. our maze navigation\ntask) where n-dimensional convolutions can be used [14]. While M (\u2713) is restricted to a discrete\nspace, we can handle continuous tasks W\u21e5 by simultaneously learning a discrete M (\u2713) for planning,\nand fA, fO, fB, f\u21e1 to map between states, actions and observations in W\u21e5 and M (\u2713).\n5 Experiments\n\nThe main objective of the experiments is to understand the bene\ufb01ts of structure priors on learning\nneural-network policies. We create several alternative network architectures by gradually relaxing\nthe structure priors and evaluate the architectures on simulated robot navigation and manipulation\ntasks. While these tasks are simpler than, for example, Atari games, in terms of visual perception,\nthey are in fact very challenging, because of the sophisticated long-term reasoning required to handle\npartial observability and distant future rewards. Since the exact state of the robot is unknown, a\nsuccessful policy must reason over many steps to gather information and improve state estimation\nthrough partial and noisy observations. It also must reason about the trade-off between the cost of\ninformation gathering and the reward in the distance future.\n\n5.1 Experimental Setup\nWe compare the QMDP-net with a number of related alternative architectures. Two are QMDP-net\nvariants. Untied QMDP-net relaxes the constraints on the planning module by untying the weights\nrepresenting the state-transition function over the different CNN layers. LSTM QMDP-net replaces\nthe \ufb01lter module with a generic LSTM module. The other two architectures do not embed POMDP\nstructure priors at all. CNN+LSTM is a state-of-the-art deep CNN connected to an LSTM. It is similar\nto the DRQN architecture proposed for reinforcement learning under partially observability [10].\nRNN is a basic recurrent neural network with a single fully-connected hidden layer. RNN contains no\nstructure speci\ufb01c to planning under partial observability.\nEach experimental domain contains a parameterized set of tasks W\u21e5. The parameters \u2713 encode an\nenvironment, a goal, and a belief over the robot\u2019s initial state. To train a policy for W\u21e5, we generate\nrandom environments, goals, and initial beliefs. We construct ground-truth POMDP models for the\ngenerated data and apply the QMDP algorithm. If the QMDP algorithm successfully reaches the\ngoal, we then retain the resulting sequence of action and observations (a1, o1, a2, o2, . . .) as an expert\ntrajectory, together with the corresponding environment, goal, and initial belief. It is important to note\nthat the ground-truth POMDPs are used only for generating expert trajectories and not for learning\nthe QMDP-net.\n\n6\n\n\fFor fair comparison, we train all networks using the same set of expert trajectories in each domain. We\nperform basic search over training parameters, the number of layers, and the number of hidden units\nfor each network architecture. Below we brie\ufb02y describe the experimental domains. See Appendix C\nfor implementation details.\nGrid-world navigation. A robot navigates in an unknown building given a \ufb02oor map and a goal.\nThe robot is uncertain of its own location. It is equipped with a LIDAR that detects obstacles in its\ndirect neighborhood. The world is uncertain: the robot may fail to execute desired actions, possibly\nbecause of wheel slippage, and the LIDAR may produce false readings. We implemented a simpli\ufb01ed\nversion of this task in a discrete n\u21e5n grid world (Fig. 1c). The task parameter \u2713 is represented as\nan n\u21e5n image with three channels. The \ufb01rst channel encodes the obstacles in the environment, the\nsecond channel encodes the goal, and the last channel encodes the belief over the robot\u2019s initial state.\nThe robot\u2019s state represents its position in the grid. It has \ufb01ve actions: moving in each of the four\ncanonical directions or staying put. The LIDAR observations are compressed into four binary values\ncorresponding to obstacles in the four neighboring cells. We consider both a deterministic and a\nstochastic variant of the domain. The stochastic variant adds action and observation uncertainties.\nThe robot fails to execute the speci\ufb01ed move action and stays in place with probability 0.2. The\nobservations are faulty with probability 0.1 independently in each direction. We trained a policy\nusing expert trajectories from 10, 000 random environments, 5 trajectories from each environment.\nWe then tested on a separate set of 500 random environments.\nMaze navigation. A differential-drive robot navigates\nin a maze with the help of a map, but it does not know its\npose (Fig. 1d). This domain is similar to the grid-world\nnavigation, but it is signi\ufb01cant more challenging. The\nrobot\u2019s state contains both its position and orientation.\nThe robot cannot move freely because of kinematic con-\nstraints. It has four actions: move forward, turn left, turn\nright and stay put. The observations are relative to the\nrobot\u2019s current orientation, and the increased ambiguity\nmakes it more dif\ufb01cult to localize the robot, especially\nwhen the initial state is highly uncertain. Finally, suc-\ncessful trajectories in mazes are typically much longer\nthan those in randomly-generated grid worlds. Again\nwe trained on expert trajectories in 10, 000 randomly\ngenerated mazes and tested them in 500 new ones.\n2-D object grasping. A robot gripper picks up novel\nobjects from a table using a two-\ufb01nger hand with noisy\ntouch sensors at the \ufb01nger tips. The gripper uses the\n\ufb01ngers to perform compliant motions while maintaining\ncontact with the object or to grasp the object. It knows the\nshape of the object to be grasped, maybe from an object\ndatabase. However, it does not know its own pose relative\nto the object and relies on the touch sensors to localize\nitself. We implemented a simpli\ufb01ed 2-D variant of this\ntask, modeled as a POMDP [13]. The task parameter \u2713\nis an image with three channels encoding the object shape, the grasp point, and a belief over the\ngripper\u2019s initial pose. The gripper has four actions, each moving in a canonical direction unless it\ntouches the object or the environment boundary. Each \ufb01nger has 3 binary touch sensors at the tip,\nresulting in 64 distinct observations. We trained on expert demonstration on 20 different objects with\n500 randomly sampled poses for each object. We then tested on 10 previously unseen objects in\nrandom poses.\n\nFig. 5: Object grasping using touch sens-\ning. (a) An example [3]. (b) Simpli\ufb01ed\n2-D object grasping. Objects from the\ntraining set (top) and the test set (bottom).\n\nFig. 4: Highly ambiguous observations in\na maze. The four observations (in red) are\nthe same, despite that the robot states are\nall different.\n\n(a)\n\n(b)\n\n5.2 Choosing QMDP-Net Components for a Task\nGiven a new task W\u21e5, we need to choose an appropriate neural network representation for\nM (\u2713). More speci\ufb01cally, we need to choose S, A and O, and a representation for the functions\nfR, fT , f0T , fZ, fO, fA, fB, f\u21e1. This provides an opportunity to incorporate domain knowledge in a\nprincipled way. For example, if W\u21e5 has a local and spatially invariant connectivity structure, we can\nchoose convolutions with small kernels to represent fT , fR and fZ.\n\n7\n\n\fIn our experiments we use S = N\u21e5N for N\u21e5N grid navigation, and S = N\u21e5N\u21e54 for N\u21e5N maze\nnavigation where the robot has 4 possible orientations. We use |A| = |A| and |O| = |O| for all tasks\nexcept for the object grasping task, where |O| = 64 and |O| = 16. We represent fT , fR and fZ by\nCNN components with 3\u21e53 and 5\u21e55 kernels depending on the task. We enforce that fT and fZ\nare proper probability distributions by using softmax and sigmoid activations on the convolutional\nkernels, respectively. Finally, fO is a small fully connected component, fA is a one-hot encoding\nfunction, f\u21e1 is a single softmax layer, and fB is the identity function.\nWe can adjust the amount of planning in a QMDP-net by setting K. A large K allows propagating\ninformation to more distant states without affecting the number of parameters to learn. However, it\nresults in deeper networks that are computationally expensive to evaluate and more dif\ufb01cult to train.\nWe used K = 20 . . . 116 depending on the problem size. We were able to transfer policies to larger\nenvironments by increasing K up to 450 when executing the policy.\nIn our experiments the representation of the task parameter \u2713 is isomorphic to the chosen state\nspace S. While the architecture is not restricted to this setting, we rely on it to represent fT , fZ, fR by\nconvolutions with small kernels. Experiments with a more general class of problems is an interesting\ndirection for future work.\n\n5.3 Results and Discussion\nThe main results are reported in Table 1. Some additional results are reported in Appendix A. For\neach domain, we report the task success rate and the average number of time steps for task completion.\nComparing the completion time is meaningful only when the success rates are similar.\n\nQMDP-net successfully learns policies that generalize to new environments. When evaluated\non new environments, the QMDP-net has higher success rate and faster completion time than the\nalternatives in nearly all domains. To understand better the performance difference, we speci\ufb01cally\ncompared the architectures in a \ufb01xed environment for navigation. Here only the initial state and the\ngoal vary across the task instances, while the environment remains the same. See the results in the\nlast row of Table 1. The QMDP-net and the alternatives have comparable performance. Even RNN\nperforms very well. Why? In a \ufb01xed environment, a network may learn the features of an optimal\npolicy directly, e.g., going straight towards the goal. In contrast, the QMDP-net learns a model for\nplanning, i.e., generating a near-optimal policy for a given arbitrary environment.\n\nPOMDP structure priors improve the performance of learning complex policies. Moving\nacross Table 1 from left to right, we gradually relax the POMDP structure priors on the network\narchitecture. As the structure priors weaken, so does the overall performance. However, strong priors\nsometimes over-constrain the network and result in degraded performance. For example, we found\nthat tying the weights of fT in the \ufb01lter and f0T in the planner may lead to worse policies. While both\nfT and f0T represent the same underlying transition dynamics, using different weights allows each\nto choose its own approximation and thus greater \ufb02exibility. We shed some light on this issue and\nvisualize the learned POMDP model in Appendix B.\n\nQMDP-net learns \u201cincorrect\u201d, but useful models. Planning under partial observability is in-\ntractable in general, and we must rely on approximation algorithms. A QMDP-net encodes both a\nPOMDP model and QMDP, an approximate POMDP algorithm that solves the model. We then train\nthe network end-to-end. This provides the opportunity to learn an \u201cincorrect\u201d, but useful model that\ncompensates the limitation of the approximation algorithm, in a way similar to reward shaping in\nreinforcement learning [22]. Indeed, our results show that the QMDP-net achieves higher success\nrate than QMDP in nearly all tasks. In particular, QMDP-net performs well on the well-known\nHallway2 domain, which is designed to expose the weakness of QMDP resulting from its myopic\nplanning horizon. The planning algorithm is the same for both the QMDP-net and QMDP, but the\nQMDP-net learns a more effective model from expert demonstrations. This is true even though\nQMDP generates the expert data for training. We note that the expert data contain only successful\nQMDP demonstrations. When both successful and unsuccessful QMDP demonstrations were used\nfor training, the QMDP-net did not perform better than QMDP, as one would expect.\n\nQMDP-net policies learned in small environments transfer directly to larger environments.\nLearning a policy for large environments from scratch is often dif\ufb01cult. A more scalable approach\n\n8\n\n\fTable 1: Performance comparison of QMDP-net and alternative architectures for recurrent policy\nnetworks. SR is the success rate in percentage. Time is the average number of time steps for task\ncompletion. D-n and S-n denote deterministic and stochastic variants of a domain with environment\nsize n\u21e5n.\n\nQMDP\n\nQMDP-net\n\nSR\n99.8\n99.0\n97.6\n98.1\n63.2\n63.1\n37.3\n98.3\n90.2\n88.4\n98.8\n\nTime\n8.8\n15.5\n24.6\n23.9\n54.1\n50.5\n28.2\n14.6\n85.4\n66.9\n17.4\n\nTime\nSR\n8.2\n99.6\n14.6\n99.0\n25.0\n98.6\n23.9\n98.8\n56.5\n98.0\n60.4\n93.9\n64.4\n82.9\n99.6\n18.2\n94.4 107.7\n81.1\n93.2\n98.6\n17.6\n\nUntied\n\nQMDP-net\nSR\nTime\n8.3\n98.6\n14.8\n98.8\n98.8\n23.9\n24.0\n95.9\n62.5\n95.4\n98.7\n57.1\n69.6 104.4\n20.4\n98.9\n55.3\n20.0\n51.7\n37.4\n99.8\n17.0\n\nLSTM\n\nQMDP-net\nSR\nTime\n12.8\n84.4\n27.9\n43.8\n22.2\n51.1\n55.6\n23.8\n57.2\n9.8\n79.0\n18.9\n82.8\n89.7\n26.4\n91.4\n\n-\n-\n\nCNN\n+LSTM\nSR\n90.0\n57.8\n19.4\n41.4\n9.2\n19.2\n77.8\n92.8\n\nTime\n13.4\n33.7\n45.2\n65.9\n41.4\n80.8\n99.5\n22.1\n\n-\n-\n\nRNN\n\nTime\nSR\n13.4\n87.8\n24.5\n35.8\n39.3\n16.4\n64.1\n34.0\n47.0\n9.8\n19.6\n82.1\n68.0 108.8\n25.7\n94.1\n\n-\n-\n\n97.0\n\n19.7\n\n98.4\n\n19.9\n\n98.0\n\n19.8\n\nDomain\nGrid D-10\nGrid D-18\nGrid D-30\nGrid S-18\nMaze D-29\nMaze S-19\nHallway2\nGrasp\nIntel Lab\nFreiburg\nFixed grid\n\nwould be to learn a policy in small environments and transfer it to large environments by repeating\nthe reasoning process. To transfer a learned QMDP-net policy, we simply expand its planning module\nby adding more recurrent layers. Speci\ufb01cally, we trained a policy in randomly generated 30 \u21e5 30\ngrid worlds with K = 90. We then set K = 450 and applied the learned policy to several real-life\nenvironments, including Intel Lab (100\u21e5101) and Freiburg (139\u21e557), using their LIDAR maps\n(Fig. 1c) from the Robotics Data Set Repository [12]. See the results for these two environments in\nTable 1. Additional results with different K settings and other buildings are available in Appendix A.\n\n6 Conclusion\n\nA QMDP-net is a deep recurrent policy network that embeds POMDP structure priors for planning\nunder partial observability. While generic neural networks learn a direct mapping from inputs to\noutputs, QMDP-net learns how to model and solve a planning task. The network is fully differentiable\nand allows for end-to-end training.\nExperiments on several simulated robotic tasks show that learned QMDP-net policies successfully\ngeneralize to new environments and transfer to larger environments as well. The POMDP structure\npriors and end-to-end training substantially improve the performance of learned policies. Interestingly,\nwhile a QMDP-net encodes the QMDP algorithm for planning, learned QMDP-net policies sometimes\noutperform QMDP.\nThere are many exciting directions for future exploration. First, a major limitation of our current\napproach is the state space representation. The value iteration algorithm used in QMDP iterates\nthrough the entire state space and is well known to suffer from the \u201ccurse of dimensionality\u201d. To\nalleviate this dif\ufb01culty, the QMDP-net, through end-to-end training, may learn a much smaller\nabstract state space representation for planning. One may also incorporate hierarchical planning [8].\nSecond, QMDP makes strong approximations in order to reduce computational complexity. We\nwant to explore the possibility of embedding more sophisticated POMDP algorithms in the network\narchitecture. While these algorithms provide stronger planning performance, their algorithmic\nsophistication increases the dif\ufb01culty of learning. Finally, we have so far restricted the work to\nimitation learning. It would be exciting to extend it to reinforcement learning. Based on earlier\nwork [28, 34], this is indeed promising.\n\nAcknowledgments We thank Leslie Kaelbling and Tom\u00e1s Lozano-P\u00e9rez for insightful discussions that\nhelped to improve our understanding of the problem. The work is supported in part by Singapore Ministry of\nEducation AcRF grant MOE2016-T2-2-068 and National University of Singapore AcRF grant R-252-000-587-\n112.\n\n9\n\n\fReferences\n[1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean,\nM. Devin, et al. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. URL\nhttp://tensorflow.org/.\n\n[2] J. A. Bagnell, S. Kakade, A. Y. Ng, and J. G. Schneider. Policy search by dynamic programming. In\n\nAdvances in Neural Information Processing Systems, pages 831\u2013838, 2003.\n\n[3] H. Bai, D. Hsu, W. S. Lee, and V. A. Ngo. Monte carlo value iteration for continuous-state POMDPs. In\n\nAlgorithmic Foundations of Robotics IX, pages 175\u2013191, 2010.\n\n[4] B. Bakker, V. Zhumatiy, G. Gruener, and J. Schmidhuber. A robot that reinforcement-learns to identify and\nmemorize important previous observations. In International Conference on Intelligent Robots and Systems,\npages 430\u2013435, 2003.\n\n[5] J. Baxter and P. L. Bartlett. In\ufb01nite-horizon policy-gradient estimation. Journal of Arti\ufb01cial Intelligence\n\nResearch, 15:319\u2013350, 2001.\n\n[6] B. Boots, S. M. Siddiqi, and G. J. Gordon. Closing the learning-planning loop with predictive state\n\nrepresentations. The International Journal of Robotics Research, 30(7):954\u2013966, 2011.\n\n[7] K. Cho, B. Van Merri\u00ebnboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio. Learning\nphrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint\narXiv:1406.1078, 2014.\n\n[8] S. Gupta, J. Davidson, S. Levine, R. Sukthankar, and J. Malik. Cognitive mapping and planning for visual\n\nnavigation. arXiv preprint arXiv:1702.03920, 2017.\n\n[9] T. Haarnoja, A. Ajay, S. Levine, and P. Abbeel. Backprop kf: Learning discriminative deterministic state\n\nestimators. In Advances in Neural Information Processing Systems, pages 4376\u20134384, 2016.\n\n[10] M. J. Hausknecht and P. Stone. Deep recurrent Q-learning for partially observable MDPs. arXiv preprint,\n\n2015. URL http://arxiv.org/abs/1507.06527.\n\n[11] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735\u20131780, 1997.\n\n[12] A. Howard and N. Roy. The robotics data set repository (radish), 2003. URL http://radish.\n\nsourceforge.net/.\n\n[13] K. Hsiao, L. P. Kaelbling, and T. Lozano-P\u00e9rez. Grasping POMDPs. In International Conference on\n\nRobotics and Automation, pages 4685\u20134692, 2007.\n\n[14] S. Ji, W. Xu, M. Yang, and K. Yu. 3D convolutional neural networks for human action recognition. IEEE\n\nTransactions on Pattern Analysis and Machine Intelligence, 35(1):221\u2013231, 2013.\n\n[15] R. Jonschkowski and O. Brock. End-to-end learnable histogram \ufb01lters. In Workshop on Deep Learning\nfor Action and Interaction at NIPS, 2016. URL http://www.robotics.tu-berlin.de/fileadmin/\nfg170/Publikationen_pdf/Jonschkowski-16-NIPS-WS.pdf.\n\n[16] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classi\ufb01cation with deep convolutional neural\n\nnetworks. In Advances in Neural Information Processing Systems, pages 1097\u20131105, 2012.\n\n[17] H. Kurniawati, D. Hsu, and W. S. Lee. Sarsop: Ef\ufb01cient point-based POMDP planning by approximating\n\noptimally reachable belief spaces. In Robotics: Science and Systems, volume 2008, 2008.\n\n[18] M. L. Littman, A. R. Cassandra, and L. P. Kaelbling. Learning policies for partially observable environ-\n\nments: Scaling up. In International Conference on Machine Learning, pages 362\u2013370, 1995.\n\n[19] M. L. Littman, R. S. Sutton, and S. Singh. Predictive representations of state. In Advances in Neural\n\nInformation Processing Systems, pages 1555\u20131562, 2002.\n\n[20] P. Mirowski, R. Pascanu, F. Viola, H. Soyer, A. Ballard, A. Banino, M. Denil, R. Goroshin, L. Sifre,\nK. Kavukcuoglu, et al. Learning to navigate in complex environments. arXiv preprint arXiv:1611.03673,\n2016.\n\n[21] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller,\nA. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep reinforcement learning. Nature,\n518(7540):529\u2013533, 2015.\n\n10\n\n\f[22] A. Y. Ng, D. Harada, and S. Russell. Policy invariance under reward transformations: Theory and\napplication to reward shaping. In International Conference on Machine Learning, pages 278\u2013287, 1999.\n\n[23] M. Okada, L. Rigazio, and T. Aoshima. Path integral networks: End-to-end differentiable optimal control.\n\narXiv preprint arXiv:1706.09597, 2017.\n\n[24] C. H. Papadimitriou and J. N. Tsitsiklis. The complexity of Markov decision processes. Mathematics of\n\nOperations Research, 12(3):441\u2013450, 1987.\n\n[25] J. Pineau, G. J. Gordon, and S. Thrun. Applying metric-trees to belief-point POMDPs. In Advances in\n\nNeural Information Processing Systems, page None, 2003.\n\n[26] G. Shani, R. I. Brafman, and S. E. Shimony. Model-based online learning of POMDPs. In European\n\nConference on Machine Learning, pages 353\u2013364, 2005.\n\n[27] G. Shani, J. Pineau, and R. Kaplow. A survey of point-based POMDP solvers. Autonomous Agents and\n\nMulti-agent Systems, 27(1):1\u201351, 2013.\n\n[28] T. Shankar, S. K. Dwivedy, and P. Guha. Reinforcement learning via recurrent convolutional neural\n\nnetworks. In International Conference on Pattern Recognition, pages 2592\u20132597, 2016.\n\n[29] D. Silver and J. Veness. Monte-carlo planning in large POMDPs. In Advances in Neural Information\n\nProcessing Systems, pages 2164\u20132172, 2010.\n\n[30] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser,\nI. Antonoglou, V. Panneershelvam, M. Lanctot, et al. Mastering the game of Go with deep neural\nnetworks and tree search. Nature, 529(7587):484\u2013489, 2016.\n\n[31] D. Silver, H. van Hasselt, M. Hessel, T. Schaul, A. Guez, T. Harley, G. Dulac-Arnold, D. Reichert,\nN. Rabinowitz, A. Barreto, et al. The predictron: End-to-end learning and planning. arXiv preprint, 2016.\nURL https://arxiv.org/abs/1612.08810.\n\n[32] M. T. Spaan and N. Vlassis. Perseus: Randomized point-based value iteration for POMDPs. Journal of\n\nArti\ufb01cial Intelligence Research, 24:195\u2013220, 2005.\n\n[33] C. Stachniss. Robotics 2D-laser dataset. URL http://www.ipb.uni-bonn.de/datasets/.\n[34] A. Tamar, S. Levine, P. Abbeel, Y. Wu, and G. Thomas. Value iteration networks. In Advances in Neural\n\nInformation Processing Systems, pages 2146\u20132154, 2016.\n\n[35] T. Tieleman and G. Hinton. Lecture 6.5 - rmsprop: Divide the gradient by a running average of its recent\n\nmagnitude. COURSERA: Neural networks for machine learning, pages 26\u201331, 2012.\n\n[36] S. Xingjian, Z. Chen, H. Wang, D.-Y. Yeung, W.-k. Wong, and W.-c. Woo. Convolutional LSTM network:\nA machine learning approach for precipitation nowcasting. In Advances in Neural Information Processing\nSystems, pages 802\u2013810, 2015.\n\n[37] N. Ye, A. Somani, D. Hsu, and W. S. Lee. Despot: Online POMDP planning with regularization. Journal\n\nof Arti\ufb01cial Intelligence Research, 58:231\u2013266, 2017.\n\n11\n\n\f", "award": [], "sourceid": 2460, "authors": [{"given_name": "Peter", "family_name": "Karkus", "institution": "NUS"}, {"given_name": "David", "family_name": "Hsu", "institution": "National University of Singapore"}, {"given_name": "Wee Sun", "family_name": "Lee", "institution": "National University of Singapore"}]}