{"title": "Regression Planning Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 1319, "page_last": 1329, "abstract": "Recent learning-to-plan methods have shown promising results on planning directly from observation space. Yet, their ability to plan for long-horizon tasks is limited by the accuracy of the prediction model. On the other hand, classical symbolic planners show remarkable capabilities in solving long-horizon tasks, but they require predefined symbolic rules and symbolic states, restricting their real-world applicability. In this work, we combine the benefits of these two paradigms and propose a learning-to-plan method that can directly generate a long-term symbolic plan conditioned on high-dimensional observations. We borrow the idea of regression (backward) planning from classical planning literature and introduce Regression Planning Networks (RPN), a neural network architecture that plans backward starting at a task goal and generates a sequence of intermediate goals that reaches the current observation. We show that our model not only inherits many favorable traits from symbolic planning --including the ability to solve previously unseen tasks-- but also can learn from visual inputs in an end-to-end manner. We evaluate the capabilities of RPN in a grid world environment and a simulated 3D kitchen environment featuring complex visual scenes and long task horizon, and show that it achieves near-optimal performance in completely new task instances.", "full_text": "Regression Planning Networks\n\nDanfei Xu\n\nStanford University\n\nRoberto Mart\u00edn-Mart\u00edn\n\nStanford University\n\nDe-An Huang\n\nStanford University\n\nYuke Zhu\n\nStanford University\nNVIDIA Research\n\nSilvio Savarese\n\nStanford University\n\nLi Fei-Fei\n\nStanford University\n\nAbstract\n\nRecent learning-to-plan methods have shown promising results on planning directly\nfrom observation space. Yet, their ability to plan for long-horizon tasks is limited\nby the accuracy of the prediction model. On the other hand, classical symbolic\nplanners show remarkable capabilities in solving long-horizon tasks, but they\nrequire prede\ufb01ned symbolic rules and symbolic states, restricting their real-world\napplicability.\nIn this work, we combine the bene\ufb01ts of these two paradigms\nand propose a learning-to-plan method that can directly generate a long-term\nsymbolic plan conditioned on high-dimensional observations. We borrow the idea\nof regression (backward) planning from classical planning literature and introduce\nRegression Planning Networks (RPN), a neural network architecture that plans\nbackward starting at a task goal and generates a sequence of intermediate goals that\nreaches the current observation. We show that our model not only inherits many\nfavorable traits from symbolic planning, e.g., the ability to solve previously unseen\ntasks, but also can learn from visual inputs in an end-to-end manner. We evaluate\nthe capabilities of RPN in a grid world environment and a simulated 3D kitchen\nenvironment featuring complex visual scene and long task horizon, and show that\nit achieves near-optimal performance in completely new task instances.\n\n1\n\nIntroduction\n\nPerforming real-world tasks such as cooking meals or assembling furniture requires an agent to\ndetermine long-term strategies. This is often formulated as a planning problem. In traditional\nAI literature, symbolic planners have shown remarkable capability in solving high-level reasoning\nproblems by planning in human-interpretable symbolic spaces [1, 2]. However, classical symbolic\nplanning methods typically abstract away perception with ground-truth symbols and rely on pre-\nde\ufb01ned planning domains to specify the causal effects of actions. These assumptions signi\ufb01cantly\nrestrict the applicability of these methods in real environments, where states are high-dimensional\n(e.g., color images) and it\u2019s tedious, if not impossible, to specify a detailed planning domain.\nA solution to plan without relying on prede\ufb01ned action models and symbols is to learn to plan from\nobservations. Recent works have shown that deep networks can capture the environment dynamics\ndirectly in the observation space [3\u20135] or a learned latent space [6\u20138]. With a learned dynamics model,\nthese methods can plan a sequence of actions towards a desired goal through forward prediction.\nHowever, these learned models are far from accurate in long-term predictions due to the compounding\nerrors over multiple steps. Moreover, due to the action-conditioned nature of these models, they\nare bound to use myopic sampling-based action selection for planning [4, 5]. Such strategy may be\nsuf\ufb01cient for simple short-horizon tasks, e.g., pushing an object to a location, but they fall short in\ntasks that involve high-level decision making over longer timescale, e.g., making a meal.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1:\nRegression (backward)\nplanning with Regression Planning\nNetworks (RPN): Starting from the\n\ufb01nal symbolic goal g, our learning-\nbased planner iteratively predicts a se-\nquence of intermediate goals condi-\ntioning on the current observation ot\nuntil it reaches a goal g(\u2212T ) that is\nreachable from the current state using\na low-level controller.\n\nIn this work, we aim to combine the merits of planning from observation and the high-level reasoning\nability and interpretability of classical planners. We propose a learning-to-plan method that can\ngenerate a long-term plan towards a symbolic task goal from high-dimensional observation inputs.\nAs discussed above, the key challenge is that planning with either symbolic or observation space\nrequires accurate forward models that are hard to obtain, namely symbolic planning domains and\nobservation-space dynamics models. Instead, we propose to plan backward in a symbolic space\nconditioning on the current observation. Similar to forward planning, backward planning in symbolic\nspace (formally known as regression planning [9, 10] or pre-image backchaining [11\u201313]) also relies\non a planning domain to expand the search space starting from the \ufb01nal goal until the current state is\nreached. Our key insight is that by conditioning on the current observation, we can train a planner to\ndirectly predict a single path in the search space that connects the \ufb01nal goal to the current observation.\nThe resulting plan is a sequence of intermediate goals that can be used to guide a low-level controller\nto interact with the environment and achieve the \ufb01nal task goal.\nWe present Regression Planning Networks (RPN), a neural network architecture that learns to perform\nregression planning (backward planning) in a symbolic planning space conditioned on environment\nobservations. Central to the architecture is a precondition network that takes as input the current\nobservation and a symbolic goal and iteratively predicts a sequence of intermediate goals in reverse\norder. In addition, the architecture exploits the compositional structure of the symbolic space by\nmodeling the dependencies among symbolic subgoals with a dependency network. Such dependency\ninformation can be used to decompose a complex task goal into simpler subgoals, an essential\nmechanism to learn complex plans and generalize to new task goals. Finally, we present an algorithm\nthat combines these networks to perform regression planning and invoke low-level controllers for\nexecuting the plan in the environment. An overview of our method is illustrated in Fig. 1.\nWe train RPN with supervisions from task demonstration data. Each demonstration consists of a\nsequence of intermediate symbolic goals, their corresponding environment observations, and a \ufb01nal\nsymbolic task goal. An advantage of our approach is that the trained RPN models can compose seen\nplans to solve novel tasks that are outside of the training dataset. As we show in the experiments,\nwhen trained to cook two dishes with less than three ingredients, RPN can plan for a three-course meal\nwith more ingredients with near-optimal performance. In contrast, we observe that the performance of\nmethods that lack the essential components of our RPN degrades signi\ufb01cantly when facing new tasks.\nWe demonstrate the capabilities of RPN in solving tasks in two domains: a grid world environment\nthat illustrates the essential features of RPN, and a 3D kitchen environment where we tackle the\nchallenges of longer-horizon tasks and increased complexity of visual observations.\n\n2 Related Work\n\nAlthough recent, there is a large body of prior works in learning to plan from observation. Methods in\nmodel-based RL [3\u20135, 7] have focused on building action-conditioned forward models and perform\nsampling-based planning. However, learning to make accurate predictions with high-dimensional\nobservation is still challenging [3, 4, 6, 7], especially for long-horizon tasks. Recent works have\nproposed to learn structured latent representations for planning [8, 14, 15]. For example, Causal\nInfoGAN [8] learns a latent binary representation that can be used jointly with graph-planning\nalgorithm. However, similar to model-based RL, learning such representations relies on reconstructing\n\n2\n\ncurrent observationCooked(cabbage)On(cabbage, plate)...On(pot, stove)final task goal...intermediate goalsRPNRPNRPNlow-levelcontrollerEnv.reachable goal\fthe full input space, which can be dif\ufb01cult to scale to challenging visual domains. Instead, our method\ndirectly plans in a symbolic space, which allows more effective long-term planning and interpretability,\nwhile still taking high-dimensional observation as input.\nOur work is also closely related to Universal Planning Networks [16], which propose to learn planning\ncomputation from expert demonstrations. However, their planning by gradient descent scheme is not\nideal for non-differentiable symbolic action space, and they require detailed action trajectories as\ntraining labels, which is agent-speci\ufb01c and can be hard to obtain in the case of human demonstrations.\nOur method does not require an explicit action space and learns directly from high-level symbolic\ngoal information, which makes it agent-agnostic and adaptive to different low-level controllers.\nOur method is inspired by classical symbolic planning, especially a) goal-regression planning [9]\n(also known as pre-image backchaining [11\u201313]), where the planning process regresses from instead\nprogressing towards a goal, and b) the idea of partial-order planning [10, 17], where the ordering\nof the subgoals within a goal is exploited to reduce search complexity. However, these methods\nrequire (1) complete speci\ufb01cation of a symbolic planning domain [1] and (2) the initial symbolic state\neither given or obtained from a highly accurate symbol grounding module [18, 19]; both can be hard\nto obtain for real-world task domains. In contrast, our method does not perform explicit symbolic\ngrounding and can generate a plan directly from a high-dimensional observation and a task goal.\nOur network architecture design is inspired by recursive networks for natural language syntax\nmodeling [20, 21] and program induction [22, 23]. Given a goal and an environment observation, our\nRPN predicts a set of predecessor goals that need to be completed before achieving the goal. The\nregression planning process is then to apply RPN recursively by feeding the predicted goals back to\nthe network until a predicted goal is reachable from the current observation.\n\n3 Problem De\ufb01nition and Preliminaries\n\n3.1 Zero-shot Task Generalization with a Hierarchical Policy\n\nThe goal of zero-shot task generalization is to achieve task goals that are not seen during training [24\u2013\n26]. Each task goal g belongs to a set of valid goals G. We consider an environment with transition\nprobability O \u00d7 A \u00d7 O \u2192 R, where O is a set of environment observations and A a set of primitive\nactions. Given a symbolic task goal g, the objective of an agent is to arrive at o \u2208 Og, where Og \u2282 O\nis the set of observations where g is satis\ufb01ed. We adopt a hierarchical policy setup where given\na \ufb01nal goal g and the current observation ot, a high-level policy \u00b5 : O \u00d7 G \u2192 G generates an\nintermediate goal g(cid:48) \u2208 G, and a low-level policy \u03c0 : O \u00d7 G \u2192 A acts in the environment to achieve\nthe intermediate goal. We assume a low-level policy can only perform short-horizon tasks. In this\nwork, we focus on learning an effective high-level policy \u00b5 and assume the low-level policy can be\neither a pre-trained agent or a motion planner. For evaluation we consider a zero-shot generalization\nsetup [25, 26] where only a subset of the task goals, Gtrain \u2282 G, is available during training, and the\nagent has to achieve a disjoint set of test task goals Gtest \u2282 G, where Gtrain \u2229 Gtest = \u2205.\n\n3.2 Regression Planning\n\nIn this work, we formulate the high-level policy \u00b5 as a learning-based regression planner. Goal-\nregression planning [9\u201313] is a class of symbolic planning algorithms, where the planning process\nruns backward from the goal instead of forward towards the goal. Given an initial symbolic state, a\nsymbolic goal, and a planning domain that de\ufb01nes actions as their pre-conditions and post-effects (i.e.,\naction operators), a regression planner starts by considering all action operators that might lead to\nthe goal and in turn expand the search space by enumerating all preconditions of the action operators.\nThe process repeats until the current symbolic state satis\ufb01es the preconditions of an operator. A plan\nis then a sequence of action operators that leads to the state from the goal.\nAn important distinction in our setup is that, because we do not assume access to these high-level\naction operators (from a symbolic planning domain) and the current symbolic state, we cannot perform\nexplicitly such an exhaustive search process. Instead, our model learns to predict the preconditions\nthat need to be satis\ufb01ed in order to achieve a goal, conditioned on the current environment observation.\nSuch ability enables our method to perform regression planning without explicit action operators,\nplanning domain de\ufb01nition, or symbolic states.\n\n3\n\n\fWe now de\ufb01ne the essential concepts of regression planning adopted by our method. Following [12],\nwe de\ufb01ne goal g \u2208 G as the conjunction of a set of logical atoms, each consists of a predicate and a\nlist of object arguments, e.g., On(pot, stove)\u2227\u00acClean(cabbage). We denote each atom in g a\nsubgoal gi. A subgoal can also be viewed as a goal with a single atom. We de\ufb01ne preconditions of a\ngoal g or a subgoal gi as another intermediate goal g(cid:48) \u2208 G that needs to be satis\ufb01ed before attempting\ng or gi conditioning on the environment observation. An intuitive example is that In(cabbage,\nsink) is a precondition of Clean(cabbage) if the cabbage is, e.g., on the table.\n\n4 Method\n\nOur primary contribution is to introduce a learning formulation of regression planning and propose\nRegression Planning Networks (RPN) as a solution. Here, we summarize the essential regression\nplanning steps to be posed as learning problems, introduce the state representation used by our model,\nand explain our learning approach to solving regression planning.\nSubgoal Serialization: The idea of subgoal serialization stems from partial order planning [10, 17],\nwhere a planning goal is broken into sub-goals, and the plans for each subgoal can be combined to\nreduce the search complexity. The challenge is to execute the subgoal plans in an order such that\na plan does not undo an already achieved subgoal. The process of \ufb01nding such orderings is called\nsubgoal serialization [10]. Our method explicitly models the dependencies among subgoals and\nformulates subgoal serialization as a directed graph prediction problem (Sec. 4.1). This is an essential\ncomponent for our method to learn to achieve complex goals and generalize to new goals.\nPrecondition Prediction: Finding the predecessor goals (preconditions) that need to be satis\ufb01ed\nbefore attempting to achieve another goal is an essential step in planning backward. As discussed\nin Sec. 3.2, symbolic regression planners rely on a set of high-level action operators de\ufb01ned in a\nplanning domain to enumerate valid preconditions. The challenge here is to directly predict the\npreconditions of a goal given an environment observation without assuming a planning domain. We\nformulate the problem of predicting preconditions as a graph node classi\ufb01cation problem in Sec. 4.2.\nThe overall regression planning process is then as follows: Given a task goal, we (1) decompose\nthe \ufb01nal task goal into subgoals and \ufb01nd the optimal ordering of completing the subgoals (subgoal\nserialization), (2) predict the preconditions of each subgoal, and (3) set the preconditions as the \ufb01nal\ngoal and repeat (1) and (2) recursively. We implement the process with a recursive neural network\narchitecture and an algorithm that invokes the networks to perform regression planning (Sec. 4.4).\nObject-Centric Representation: To bridge between the symbolic representation of the goals and the\nraw observations, we adopt an object-centric state representation [27\u201329]. The general form is that\neach object is represented as a continuous-valued feature vector extracted from the observation. We\nextend such representation to n-ary relationships among the objects, where each relationship has its\ncorresponding feature, akin to a scene-graph feature representation [30]. We refer to objects and their\nn-ary relationship as entities and their features as entity features, e. Such factorization re\ufb02ects that\neach goal atom gi indicates the desired symbolic state of an entity in the scene. For example, the goal\nOn(A, B) indicates that the desired states of the binary entity (A, B) is A on top of B. We assume\nthat either the environment observation is already in such entity-centric representations, or there exists\nt \u2208 RD, i \u2208 {1...N},\na perception function F that maps an observation o to a set of entity features ei\nwhere N is the number of entities in an environment and D is the feature dimension. As an example,\nF can be a 2D object detector, and the features are simply the resulting image patches.\n\n4.1 Learning Subgoal Serialization\n\nWe pose subgoal serialization as a learning problem. We say that a subgoal gi depends on subgoal\ngj if gj needs to be completed before attempting gi. For example, Clean(cabbage) depends\non In(cabbage, sink). The process of subgoal serialization [10] is to \ufb01nd the optimal order\nto complete all subgoals by considering all dependencies among them. Following the taxonomy\nintroduced by Korf et al. [10], we consider four cases: we say that a set of subgoals is independent if\nthe subgoals can be completed in any order and serializable if they can be completed in a \ufb01xed order.\nOften a subset of the subgoals needs to be completed together, e.g., Heated(pan) and On(stove),\nin which case these subgoals are called a subgoal block and g is block-serializable. Non-serializable\nis a special case of block-serializable where the entire g is a subgoal block.\n\n4\n\n\fFigure 2: Given a task goal, g, and the current observation, ot, RPN performs subgoal serialization\n(Algorithm 1) to \ufb01nd the highest priority subgoal and estimates whether the subgoal is reachable by\none of the low-level controllers. If not, RPN predicts the preconditions of the subgoal and recursively\nuses them as new goal for the regression planning\n\nTo see how to pose subgoal serialization as a learning problem, we note that the dependencies among\na set of subgoals can be viewed as a directed graph, where the nodes are individual subgoals and the\ndirected edges are dependencies. Dependence and independence between a pair of subgoals can be\nexpressed as a directed edge and its absence. We express subgoal block as a complete subgraph and\nlikewise a non-serializable goal as a complete graph. The dependence between a subgoal block and\na subgoal or another subgoal block can then be an edge between any node in the subgraph and an\noutside node or any node in another subgraph. For simplicity, we refer to both subgoals and subgoal\nblocks interchangeably from now on.\nNow, we can formulate the subgoal serialization problem as a graph prediction problem. Concretely,\nt }, our\ngiven a goal g = {g1, g2, ..., gK} and the corresponding entity features eg\nsubgoal dependency network is then:\n\nt = {eg1\n\nt , ...egK\n\nt , eg2\n\nfdependency(eg\n\n(1)\nwhere dep(gi, gj) \u2208 [0, 1] is a score indicating if gi depends on gj. \u03c6\u03b8 is a learnable network and\n[\u00b7,\u00b7] is concatenation. We describe a subgoal serialization algorithm in Sec. 4.4.\n\ni,j=1,\n\nt , egj\n\nt , g) = \u03c6\u03b8({[egi\n\nt , gi, gj]}K\n\ni,j=1) = {dep(gi, gj)}K\n\n4.2 Learning Precondition Prediction\n\nWe have discussed how to \ufb01nd the optimal order of completing a set of subgoals. The next step\nis to \ufb01nd the precondition of a subgoal or subgoal block, an essential step in planning backward.\nThe preconditions of a subgoal is another goal that needs to be completed before attempting the\nsubgoal at hand. To formulate it as a learning problem, we note that the subgoal and its preconditions\nmay not share the same set of entities. For example, the precondition of Clean(cabbage) may be\nIn(cabbage, sink). Hence a subgoal may map to any subgoals grounded on any entities in the\nscene. To realize such intuition, we formulate the precondition problem as a node classi\ufb01cation\nproblem [31], where each node corresponds to a pair of goal predicate and entity in the scene. We\nconsider three target classes, True and False corresponds to the logical state of the goal predicate,\nand a third Null class to indicate that the predicate is not part of the precondition. Concretely, given\na goal or subgoal set g = {g1, ..., gK} and all entity features et, the precondition network is then:\n(2)\nwhere g(\u22121) is the predicted precondition of g, and \u03c6\u03c8 is a learnable network. Note that g may only\nmap to a subset of et. \u2206 \ufb01lls the missing predicates with Null class before concatenating et and g.\n\nfprecondition(et, g) = \u03c6\u03c8(\u2206(et, g)) = g(\u22121),\n\n4.3 Learning Subgoal Satisfaction and Reachability\n\nSubgoal serialization determines the order of completing subgoals, but some of the subgoals might\nhave already been completed. Here we use a learnable module to determine if a subgoal is already\nsatis\ufb01ed. We formulate the subgoal satisfaction problem as a single entity classi\ufb01cation problem\nbecause whether a subgoal is satis\ufb01ed does not depend on any other entity features. Similarly, we\nuse another module to determine if a subgoal is reachable by a low-level controller from the current\nobservation. We note that the reachability of a subgoal by a low-level controller may depend on the\nstate of other entities. For example, whether we can launch a grasping planner to fetch an apple from\nthe fridge depends on if the fridge door is open. Hence we formulate it as a binary classi\ufb01cation\n\n5\n\nCooked(cabbage)On(cabbage, plate)Satisfaction NetworkDependencyNetworkCooked(cabbage)SubgoalSerializationPreconditionNetworkReachabilityNetworklow-levelcontrollerOn(cabbage, pot)Activated(stove)reachable?YesNoAlgorithm 1object-centric representation\fproblem conditioning on all entity features. Given a goal g and its subgoals {g1, ..., gK}, the models\ncan be expressed as:\nfsatisf ied(egi\n\n(3)\nwhere sat(gi) \u2208 [0, 1] indicates if a subgoal gi \u2208 g is satis\ufb01ed, and rec(g) \u2208 [0, 1] indicates if the\ngoal g is reachable by a low-level controller given the current observation.\n\nt , gi]) = sat(gi) freachable(et, g) = \u03c6\u03b2([et, g]) = rec(g),\n\nt , gi) = \u03c6\u03b1([egi\n\n4.4 Regression Planning with RPN\n\nAlgorithm 1 SUBGOALSERIALIZATION\n\nsat(gi) \u2190 fsatisf ied(egi\nif sat(gi) < 0.5 then\n\nInputs: Current entity features et, goal g\nv \u2190 \u2205, w \u2190 \u2205\nfor all gi \u2208 g do\n\nHaving described the essential components of\nRPN, we introduce an algorithm that invokes\nthe network at inference time to generate a plan.\nGiven the entity features et and the \ufb01nal goal\ng, the \ufb01rst step is to serialize the subgoals. We\nstart by \ufb01nding all subgoals are unsatis\ufb01ed with\nfsatisf ied(\u00b7) and construct the input nodes for\nfdependency(\u00b7), which in turn predicts a directed\ngraph structure. Then we use the Bron-Kerbosch\nalgorithm [32] to \ufb01nd all complete subgraphs and\nconstruct a DAG among all subgraphs. Finally, we\nuse topological sorting to \ufb01nd the subgoal block\nthat has the highest priority to be completed. The\nsubgoal serialization subroutine is summarized\nin Algorithm 1. Given a subgoal, we \ufb01rst check\nif it is reachable by a low-level controller with\nfreachable(\u00b7), and invoke the controller with the subgoal if it is deemed reachable. Otherwise\nfprecondition(\u00b7) is used to \ufb01nd the preconditions of the subgoal and set it as the new goal. The overall\nprocess is illustrated in Fig. 2 and is in addition summarized with an algorithm in Appendix.\n\nend for\ndepGraph \u2190 DiGraph(fdependency(w, v))\nblockGraph \u2190 Bron-Kerbosch(depGraph)\nblockDAG \u2190 DAG(blockGraph)\nsortedBlocks \u2190 TopoSort(blockDAG)\nreturn g[sortedBlocks[\u22121]]\n\nt , gi)\n\nv \u2190 v \u222a {gi}, w \u2190 w \u222a {egi\nt }\n\nend if\n\n4.5 Supervision and Training\n\nSupervision from demonstrations: We parse the training labels from task demonstrations gener-\nated by a hard-coded expert. A task demonstration consists of a sequence of intermediate goals\n{g(0), ...g(T )} and the corresponding environment observations {o(0), ..., o(T )}. In addition, we also\nassume the dependencies among the subgoals {g0, ..gN} of a goal are given in the form of a directed\ngraph. A detailed discussion on training supervision is included in Appendix.\nTraining: We train all sub-networks with full supervision. Due to the recursive nature of our\narchitecture, a long planning sequence can be optimized in parallel by considering the intermediate\ngoals and their preconditions independent of the planning history. More details is included in the\nAppendix.\n\n5 Experiments\n\nOur experiments aim to (1) illustrate the essential features of RPN, especially the effect of regression\nplanning and subgoal serialization, (2) test whether RPN can achieve zero-shot generalization to\nnew task instances, and (3) test whether RPN can directly learn from visual observation inputs. We\nevaluate our method on two environments: an illustrative Grid World environment [33] that dissects\ndifferent aspects of the generalization challenges (Sec. 5.1), and a simulated Kitchen 3D domain\n(Sec. 5.2) that features complex visual scenes and long-horizon tasks in BulletPhysics [34].\nWe evaluate RPN including all components introduced in Sec. 4 to perform regression planning\nwith the algorithm of Sec. 4.4 and compare the following baselines and ablation versions: 1) E2E, a\nreactive planner adopted from Pathak et al. [35] that learns to plan by imitating the expert trajectory.\nBecause we do not assume a high-level action space, we train the planner to directly predict the next\nintermediate goal conditioning on the \ufb01nal goal and the current observation. The comparison to E2E\nis important to understand the effect of the inductive biases embedded in RPN. 2) SS-only shares\nthe same network architecture as RPN, but instead of performing regression planning, it directly\nplans the next intermediate goal based on the highest-priority subgoal produced by Algorithm 1. This\n\n6\n\n\fbaseline evaluates in isolation the capabilities of our proposed subgoal serialization to decompose\ncomplex task goal into simpler subgoals. Similarly, in 3) RP-only we replace subgoal serialization\n(Algorithm 1) with a single network, measuring the capabilities of the backward planning alone.\n\n5.1 Grid World\n\nIn this environment we remove the complexity of learning the visual features and focus on comparing\nplanning capabilities of RPN and the baselines. The environment is the 2D grid world built on [33]\n(see Table 1, left). The state space is factored into object-centric features, which consist of object types\n(door, key, etc), object state (e.g., door is open), object colors (six unique colors), and their locations\nrelative to the agent. The goals are provided in a grounded symbolic expression as described in\nSec. 3, e.g., Open(door_red)\u2227Open(door_blue). Both the expert demonstrator and the low-level\ncontrollers are A\u2217-based search algorithm. Further details on training and evaluation setup are in the\nAppendix. In the grid world we consider two domains:\nDoorKey: Six pairs of doors and keys, where a locked door can only be unlocked by the key of the\nsame color. Doors are randomly initialized to be locked or unlocked. The training tasks consist of\nopening D = 2 randomly selected doors (the other doors can be in any state). The evaluation tasks\nconsist of opening D \u2208 {4, 6} doors, measuring the generalization capabilities of the methods to\ndeal with new tasks composed of multiple instances of similar subtasks. The key to solving tasks\ninvolving more than two doors is to model opening each door as an independent subgoal.\nRoomGoal: Six rooms connected to a central area by possibly locked and closed doors. The training\ndata is evenly sampled from two tasks: k-d (key-door) is to open a randomly selected (possibly\nlocked) door without getting into the room. d-g (door-goal) is to reach a tile by getting through a\nclosed but unlocked door. In evaluation, the agent is asked to reach a speci\ufb01ed tile by getting through\na locked door (k-d-g), measuring the capabilities of the methods to compose plans learned from the\ntwo training tasks to form a longer unseen plan.\nDomain\n\nDoorKey\n\nTrain\n\nEval\n\nE2E [35]\nRP-only\nSS-only\n\nTask D=2 D=4 D=6\n0.0\n0.0\n21.1\n64.3\n\n81.2\n92.2\n99.7\nRPN 99.1\n\n1.2\n18.2\n46.0\n91.9\n\nRoomGoal\nTrain\n\nk-d\n100.0\n100.0\n99.9\n98.7\n\nd-g\n100.0\n100.0\n100.0\n99.9\n\nEval\nk-d-g\n3.2\n100.0\n7.8\n98.8\n\nTable 1: (Left) Sample initial states of DoorKey and RoomGoal domains; (Right) Results of DoorKey\nand RoomGoal reported in average success rate (percentage).\nResults: The results of both domains are shown in Table 1, right. In DoorKey, all methods except\nE2E almost perfectly learn to reproduce the training tasks. The performance drops signi\ufb01cantly for\nthe three baselines when increasing the number of doors, D. RP-only degrades signi\ufb01cantly for the\ninability to decompose the goal into independent parts, while the performance of SS-only degrades\nbecause, in addition to interpreting the goal, it also needs to determine if a key is needed and the color\nof the key to pick. However, it still achieves 21% success rate at D = 6. RPN maintains 64% success\nrate even for D = 6, although it has been trained with very few samples where all six doors are\ninitialized as closed or locked. Most of the failures (21% of the episodes) are due to RPN not being\nable to predict any precondition while no subgoals are reachable (full error breakdown in Appendix).\nIn RoomGoal all methods almost perfectly learn the two training tasks. In particular, E2E achieves\nperfect performance, but it only achieves 3.2% success rate in the k-d-g long evaluation task. In\ncontrast, both RP-only and RPN achieve optimal performance also on the k-d-g evaluation task,\nshowing that our regression planning mechanism is enough to solve new tasks by composing learned\nplans, even when the planning steps connecting plans have never been observed during training.\n\n5.2 Kitchen 3D\n\nThis environment features complex visual scenes and very long-horizon tasks composed of tabletop\ncooking and sorting subtasks. We test in this environment the full capabilities of each component in\nRPN, and whether the complete regression planning mechanism can solve previously unseen task\ninstances without dropping performance while coping directly with high-dimensional visual inputs.\n\n7\n\n\fFigure 4: Visualization of RPN solving a sample cooking task with one dish and two ingredients\n(I = 2, D = 1): (Top) Visualization of the environment after a goal is achieved (zoom in to see\ndetails), and (Bottom) the regression planning trace generated by RPN. Additional video illustration\nis in the supplementary material.\n\nCooking: The task is for a robotic agent to prepare a meal consisting\nof a variable number of dishes, each involving a variety of ingredients\nand different cookwares. As shown in Fig. 3, the initial layout consists\nof a set of ingredients and plates randomly located at the center of the\nworkspace surrounded by (1) a stove, (2) a sink, (3) two cookwares,\nand (4) three serving areas. There are two types of ingredients: fruits\nand vegetables, and six ingredients in total. An ingredient needs to be\ncleaned at the sink before cooking. An ingredient can be cooked by\nsetting up the correct cookware at the stove, activating the stove, and\nplacing the ingredient on the cookware. Fruits can only be cooked in\nthe small pan and vegetables in the big pot.\nThe environment is simulated with [34]. A set of low-level controllers\ninteract with objects to complete a subgoal, e.g., On(tomato, sink)\ninvokes an RRT-based motion planner [36] to pick up the tomato and\nplace it in the sink. For the object-centric representation, we assume\naccess to object bounding boxes of the input image ot and use a CNN-\nbased encoder to encode individual image patches to et. The encoder\nis trained end-to-end with the rest of the model. More details on network architectures and evaluation\nsetup are available in the Appendix.\nWe focus on evaluating the ability to generalize to tasks that involve a different number of dishes (D)\nand ingredients (I). The training tasks are to cook randomly chosen I = 3 ingredients into D = 2\ndishes. The evaluation tasks are to cook meals with I \u2208 {2, ..., 6} ingredients and D \u2208 {1, .., 3}\ndishes. In addition, cooking steps may vary depending on the order of cooking each ingredient, e.g.,\nthe agent has to set up the correct cookware before cooking an ingredient or turn on the stove/sink if\nthese steps are not done from cooking the previous ingredient. In addition to the average task success\nrate, we also report the average sub-goal completion rate. For example, for an episode where 5 out of\n6 ingredients is successfully prepared for a I = 6 task, the metric value would be 5\n6.\n\nFigure 3: The Kitchen 3D\nenvironment. An agent (not\nshown) is tasked to prepare\na meal with variable number\nof dishes and ingredients\n\nTable 2: Results of Kitchen 3D in average task success rate / average subgoal completion rate over\n1000 evaluation episodes. All standard errors are less or equal to 0.5 and are thus omitted.\n\nTrain\n\nTask\nE2E [35]\nRP-only\nSS-only\n\nI=3, D=2\n5.0 / 8.3\n70.3 / 83.4\n49.1 / 59.7\nRPN 98.5 / 98.8\n\nI=2, D=1\n16.4 / 21.2\n67.1 / 77.4\n59.3 / 61.9\n98.6 / 98.7\n\nI=4, D=1\n2.3 / 3.7\n47.0 / 71.7\n56.6 / 66.2\n98.2 / 99.2\n\nEvaluation\nI=4, D=3\n0.7 / 3.0\n27.9 / 64.1\n43.4 / 60.0\n98.4 / 99.2\n\nI=6, D=1\n0.0 / <0.1\n<0.1 / 23.9\n42.8 / 69.3\n95.3 / 98.9\n\nI=6, D=3\n0.0 / <0.1\n0.0 / 22.9\n32.7 / 59.7\n97.2 / 99.4\n\nResults: As shown in Table 2, RPN is able to achieve near-optimal performance on all tasks, showing\nthat our method achieves strong zero-shot generalization even with visual inputs. In comparison,\nE2E performs poorly on both training and evaluation tasks and RP-only achieves high accuracy on\ntraining tasks, but the performance degrades signi\ufb01cantly as the generalization dif\ufb01culty increases.\nThis shows that the regression planning is effective in modeling long-term plans but generalize poorly\n\n8\n\n\fto new task goals. SS-only performs worse than RP-only in training tasks, but it is able to maintain\nreasonable performance across evaluation tasks by decomposing task goals to subgoals.\nIn Fig. 3, we visualize a sample planning trajectory generated by RPN on a two-ingredients one-dish\ntask (I = 2, D = 1). RPN is able to resolve the optimal order of completing each subgoal. In this\ncase, RPN chooses to cook cabbage \ufb01rst and then banana. Note that the steps for cooking a banana is\ndifferent than that of cooking cabbage: the agent does not have to activate the stove and the sink, but\nit has to set a pan on the stove in addition to the pot because fruits can only be cooked with pan. RPN\nis able to correctly generate different plans for the two ingredients and complete the task.\n\nTable 3: Results of RPN trained on I = 3, D = 2 tasks with different number of task instances T\nand demonstrations per task N and evaluated on I = 6, D = 3 tasks.\n\nTrain Set\n\nRPN\n\nT=50\nN=10\n\n80.0 / 89.8\n\nT=100\nN=10\n\n93.7 / 97.7\n\nT=500\nN=10\n\n94.6 / 99.0\n\nT=1080\nN=1\n87.7 / 94.1\n\nT=1080\nN=5\n97.8 / 99.4\n\nT=1080\nN=10\n97.2 / 99.4\n\nAblation study: generalization under limited data. Here we evaluate RPN under limited training\ndata. Speci\ufb01cally, we construct training datasets with reduced number of unique task instances T\n(combinations of ingredients in a meal) and number of demonstrations per task instance N, and we\nevaluate the resulting RPN models on the most challenging I = 6, D = 3 tasks. As shown in Table 3,\nRPN generalizes well with both reduced T and N. Notably, RPN is able to maintain a \u223c 95% task\nsuccess rate with 1/10 of all unique training task instances (T = 100, N = 10), showing that RPN\ncan generalize to unseen goals and plans in a complex task domain.\n\n6 Limitations and Future Works\n\nPartially-observable environments. Because the current RPN framework assumes that the symbolic\ngoals are grounded on the current observation, more architecture changes are required to accommodate\ngoal speci\ufb01cations under POMDP. In addition, more principled approaches [19] may require extending\nRPN to explicitly reason about uncertainty of the state. We consider extending RPN to POMDP as an\nimportant future direction.\nGeneralize to new objects and/or predicates. We have shown preliminary results in Sec. 5 that\nRPN is able to generalize to new visual con\ufb01gurations of known predicates: An RPN model trained\non tasks I=3, D=2 has never seen a dish with more than two cooked ingredients, but it is able to\ngeneralize to tasks with four-ingredients dishes(I=6, D=3). Nonetheless, generalizing to arbitrary\nnew objects and predicates remains a major challenge. For example, to plan for a goal Cooked(X)\nregardless of X requires understanding the invariant feature of Cooked. While possible in the current\nKitchen3D environment, where predicates are rendered as simple change of texture hues, generalizing\nto more realistic scenarios would either require more diverse training data or explicitly learning\ndisentangled representations [37] for the predicates.\nLearning new rules from a few demonstrations. We have demonstrated that RPN can learn the\nimplicit rules of a domain from thousands of video demonstrations. However, in a more realistic\nsetting, we would like to have an agent that can learn new planning rules by only observing a few\ndemonstrations, similar to [38]. A possible approach is to combine RPN with a meta-learning\nframework such as [39] to quickly adapt to a new domains.\nGeneralized planning space. In future works, we plan to extend the regression planning mechanism\nto more complex but structured planning spaces such as geometric planning [19] (e.g., including\nthe object pose as part of a goal), enabling more \ufb01ne-grained interface with the low-level controller.\nAnother direction is to plan in a learned latent space instead of an explicit symbolic space. For\nexample, we can learn compact representations of the entity features with existing representation\nlearning methods [37] and train RPN to perform regression planning in the latent space.\n\n7 Acknowledgement\n\nThis work has been partially supported by JD.com American Technologies Corporation (\u201cJD\u201d) under\nthe SAIL-JD AI Research Initiative. This article solely re\ufb02ects the opinions and conclusions of its\nauthors and not JD or any entity associated with JD.com.\n\n9\n\n\fReferences\n[1] Richard E Fikes and Nils J Nilsson. Strips: A new approach to the application of theorem proving to\n\nproblem solving. Arti\ufb01cial intelligence, 2(3-4):189\u2013208, 1971.\n\n[2] Drew McDermott, Malik Ghallab, Adele Howe, Craig Knoblock, Ashwin Ram, Manuela Veloso, Daniel\n\nWeld, and David Wilkins. Pddl-the planning domain de\ufb01nition language. 1998.\n\n[3] Junhyuk Oh, Xiaoxiao Guo, Honglak Lee, Richard L Lewis, and Satinder Singh. Action-conditional video\n\nprediction using deep networks in ATARI games. In NIPS, pages 2863\u20132871, 2015.\n\n[4] Chelsea Finn and Sergey Levine. Deep visual foresight for planning robot motion. In ICRA, pages\n\n2786\u20132793. IEEE, 2017.\n\n[5] Pulkit Agrawal, Ashvin V Nair, Pieter Abbeel, Jitendra Malik, and Sergey Levine. Learning to poke by\npoking: Experiential learning of intuitive physics. In Advances in Neural Information Processing Systems,\npages 5074\u20135082, 2016.\n\n[6] Manuel Watter, Jost Springenberg, Joschka Boedecker, and Martin Riedmiller. Embed to control: A locally\nlinear latent dynamics model for control from raw images. In Advances in neural information processing\nsystems, pages 2746\u20132754, 2015.\n\n[7] Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James\n\nDavidson. Learning latent dynamics for planning from pixels. arXiv preprint arXiv:1811.04551, 2018.\n\n[8] Thanard Kurutach, Aviv Tamar, Ge Yang, Stuart J Russell, and Pieter Abbeel. Learning plannable\nIn Advances in Neural Information Processing Systems, pages\n\nrepresentations with causal infogan.\n8733\u20138744, 2018.\n\n[9] Richard Waldinger. Achieving several goals simultaneously. Stanford Research Institute Menlo Park, CA,\n\n1975.\n\n[10] Richard E Korf. Planning as search: A quantitative approach. Arti\ufb01cial Intelligence, 33(1):65\u201388, 1987.\n[11] Tomas Lozano-Perez, Matthew T Mason, and Russell H Taylor. Automatic synthesis of \ufb01ne-motion\n\nstrategies for robots. The International Journal of Robotics Research, 3(1):3\u201324, 1984.\n\n[12] Leslie Pack Kaelbling and Tom\u00e1s Lozano-P\u00e9rez. Hierarchical task and motion planning in the now. In\n\nICRA, 2011.\n\n[13] Leslie Pack Kaelbling and Tom\u00e1s Lozano-P\u00e9rez. Pre-image backchaining in belief space for mobile\n\nmanipulation. In Robotics Research, pages 383\u2013400. Springer, 2017.\n\n[14] Dane Corneil, Wulfram Gerstner, and Johanni Brea. Ef\ufb01cient modelbased deep reinforcement learning\nwith variational state tabulation. In International Conference on Machine Learning, pages 1057\u20131066,\n2018.\n\n[15] Masataro Asai and Alex Fukunaga. Classical planning in deep latent space: Bridging the subsymbolic-\n\nsymbolic boundary. In Thirty-Second AAAI Conference on Arti\ufb01cial Intelligence, 2018.\n\n[16] Aravind Srinivas, Allan Jabri, Pieter Abbeel, Sergey Levine, and Chelsea Finn. Universal planning\nnetworks: Learning generalizable representations for visuomotor control. In International Conference on\nMachine Learning, pages 4739\u20134748, 2018.\n\n[17] Daniel S Weld. An introduction to least commitment planning. AI magazine, 15(4):27\u201327, 1994.\n[18] Stevan Harnad. The symbol grounding problem. Physica D: Nonlinear Phenomena, 42(1-3):335\u2013346,\n\n1990.\n\n[19] Leslie Pack Kaelbling and Tom\u00e1s Lozano-P\u00e9rez. Integrated task and motion planning in belief space. The\n\nInternational Journal of Robotics Research, 32(9-10):1194\u20131227, 2013.\n\n[20] Ozan Irsoy and Claire Cardie. Deep recursive neural networks for compositionality in language. In\n\nAdvances in neural information processing systems, pages 2096\u20132104, 2014.\n\n[21] Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Ng, and\nChristopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank.\nIn Proceedings of the 2013 conference on empirical methods in natural language processing, pages\n1631\u20131642, 2013.\n\n[22] Scott Reed and Nando de Freitas. Neural programmer-interpreters. In ICLR, 2016.\n[23] Jonathon Cai, Richard Shin, and Dawn Song. Making neural programming architectures generalize via\n\nrecursion. arXiv preprint arXiv:1704.06611, 2017.\n\n[24] Jacob Andreas, Dan Klein, and Sergey Levine. Modular multitask reinforcement learning with policy\nsketches. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages\n166\u2013175. JMLR. org, 2017.\n\n[25] Junhyuk Oh, Satinder Singh, Honglak Lee, and Pushmeet Kohli. Zero-shot task generalization with\nmulti-task deep reinforcement learning. In Proceedings of the 34th International Conference on Machine\nLearning-Volume 70, pages 2661\u20132670. JMLR. org, 2017.\n\n[26] Sungryull Sohn, Junhyuk Oh, and Honglak Lee. Multitask reinforcement learning for zero-shot generaliza-\n\ntion with subtask dependencies. arXiv preprint arXiv:1807.07665, 2018.\n\n[27] Dequan Wang, Coline Devin, Qi-Zhi Cai, Fisher Yu, and Trevor Darrell. Deep object centric policies for\n\nautonomous driving. arXiv preprint arXiv:1811.05432, 2018.\n\n10\n\n\f[28] Michael Janner, Sergey Levine, William T Freeman, Joshua B Tenenbaum, Chelsea Finn, and Jiajun\nWu. Reasoning about physical interactions with object-oriented prediction and planning. arXiv preprint\narXiv:1812.10972, 2018.\n\n[29] Jiajun Wu, Erika Lu, Pushmeet Kohli, Bill Freeman, and Josh Tenenbaum. Learning to see physics via\n\nvisual de-animation. In Advances in Neural Information Processing Systems, pages 153\u2013164, 2017.\n\n[30] Danfei Xu, Yuke Zhu, Christopher B Choy, and Li Fei-Fei. Scene graph generation by iterative message\npassing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages\n5410\u20135419, 2017.\n\n[31] Thomas N Kipf and Max Welling. Semi-supervised classi\ufb01cation with graph convolutional networks.\n\narXiv preprint arXiv:1609.02907, 2016.\n\n[32] Eralp Abdurrahim Akkoyunlu. The enumeration of maximal cliques of large graphs. SIAM Journal on\n\nComputing, 2(1):1\u20136, 1973.\n\n[33] Maxime Chevalier-Boisvert, Lucas Willems, and Suman Pal. Minimalistic gridworld environment for\n\nopenai gym. https://github.com/maximecb/gym-minigrid, 2018.\n\n[34] Erwin Coumans and Yunfei Bai. pybullet, a python module for physics simulation, games, robotics and\n\nmachine learning. http://pybullet.org/, 2016\u20132017.\n\n[35] Deepak Pathak, Parsa Mahmoudieh, Guanghao Luo, Pulkit Agrawal, Dian Chen, Yide Shentu, Evan\nShelhamer, Jitendra Malik, Alexei A Efros, and Trevor Darrell. Zero-shot visual imitation. In ICLR, 2018.\n\n[36] Steven M LaValle. Rapidly-exploring random trees: A new tool for path planning. 1998.\n[37] Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir\nMohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a constrained variational\nframework. ICLR, 2(5):6, 2017.\n\n[38] Rohan Chitnis, Leslie Pack Kaelbling, and Tom\u00e1s Lozano-P\u00e9rez. Learning quickly to plan quickly using\nmodular meta-learning. In 2019 International Conference on Robotics and Automation (ICRA), pages\n7865\u20137871. IEEE, 2019.\n\n[39] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep\nnetworks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages\n1126\u20131135. JMLR. org, 2017.\n\n11\n\n\f", "award": [], "sourceid": 768, "authors": [{"given_name": "Danfei", "family_name": "Xu", "institution": "Stanford University"}, {"given_name": "Roberto", "family_name": "Mart\u00edn-Mart\u00edn", "institution": "Stanford University"}, {"given_name": "De-An", "family_name": "Huang", "institution": "Stanford University"}, {"given_name": "Yuke", "family_name": "Zhu", "institution": "Stanford University"}, {"given_name": "Silvio", "family_name": "Savarese", "institution": "Stanford University"}, {"given_name": "Li", "family_name": "Fei-Fei", "institution": "Stanford University"}]}