{"title": "Planning with Goal-Conditioned Policies", "book": "Advances in Neural Information Processing Systems", "page_first": 14843, "page_last": 14854, "abstract": "Planning methods can solve temporally extended sequential decision making problems by composing simple behaviors. However, planning requires suitable abstractions for the states and transitions, which typically need to be designed by hand. In contrast, reinforcement learning (RL) can acquire behaviors from low-level inputs directly, but struggles with temporally extended tasks. Can we utilize reinforcement learning to automatically form the abstractions needed for planning, thus obtaining the best of both approaches? We show that goal-conditioned policies learned with RL can be incorporated into planning, such that a planner can focus on which states to reach, rather than how those states are reached. However, with complex state observations such as images, not all inputs represent valid states. We therefore also propose using a latent variable model to compactly represent the set of valid states for the planner, such that the policies provide an abstraction of actions, and the latent variable model provides an abstraction of states. We compare our method with planning-based and model-free methods and find that our method significantly outperforms prior work when evaluated on image-based tasks that require non-greedy, multi-staged behavior.", "full_text": "Planning with Goal-Conditioned Policies\n\nSoroush Nasiriany\u2217, Vitchyr H. Pong\u2217, Steven Lin, Sergey Levine\n\nUniversity of California, Berkeley\n\n{snasiriany,vitchyr,stevenlin598,svlevine@berkeley.edu}\n\nAbstract\n\nPlanning methods can solve temporally extended sequential decision making prob-\nlems by composing simple behaviors. However, planning requires suitable abstrac-\ntions for the states and transitions, which typically need to be designed by hand.\nIn contrast, model-free reinforcement learning (RL) can acquire behaviors from\nlow-level inputs directly, but often struggles with temporally extended tasks. Can\nwe utilize reinforcement learning to automatically form the abstractions needed\nfor planning, thus obtaining the best of both approaches? We show that goal-\nconditioned policies learned with RL can be incorporated into planning, so that a\nplanner can focus on which states to reach, rather than how those states are reached.\nHowever, with complex state observations such as images, not all inputs represent\nvalid states. We therefore also propose using a latent variable model to compactly\nrepresent the set of valid states for the planner, so that the policies provide an\nabstraction of actions, and the latent variable model provides an abstraction of\nstates. We compare our method with planning-based and model-free methods\nand \ufb01nd that our method signi\ufb01cantly outperforms prior work when evaluated\non image-based robot navigation and manipulation tasks that require non-greedy,\nmulti-staged behavior.\n\n1\n\nIntroduction\n\nReinforcement learning can acquire complex skills by learning through direct interaction with\nthe environment, sidestepping the need for accurate modeling and manual engineering. However,\ncomplex and temporally extended sequential decision making requires more than just well-honed\nreactions. Agents that generalize effectively to new situations and new tasks must reason about the\nconsequences of their actions and solve new problems via planning. Accomplishing this entirely with\nmodel-free RL often proves challenging, as purely model-free learning does not inherently provide\nfor temporal compositionality of skills. Planning and trajectory optimization algorithms encode this\ntemporal compositionality by design, but require accurate models with which to plan. When these\nmodels are speci\ufb01ed manually, planning can be very powerful, but learning such models presents\nmajor obstacles: in complex environments with high-dimensional observations such as images, direct\nprediction of future observations presents a very dif\ufb01cult modeling problem [4, 43, 36, 6, 27, 3, 31],\nand model errors accumulate over time [39], making their predictions inaccurate in precisely those\nlong-horizon settings where we most need the compositionality of planning methods. Can we obtain\nthe bene\ufb01ts temporal compositionality inherent in model-based planning, without the need to model\nthe environment at the lowest level, in terms of both time and state representation?\nOne way to avoid modeling the environment in detail is to plan over abstractions: simpli\ufb01ed\nrepresentations of states and transitions on which it is easier to construct predictions and plans.\nTemporal abstractions allow planning at a coarser time scale, skipping over the high-frequency details\nand instead planning over higher-level subgoals, while state abstractions allow planning over a\n\n\u2217equal contribution\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fsimpler representation of the state. Both make modeling and planning easier. In this paper, we study\nhow model-free RL can be used to provide such abstraction for a model-based planner. At \ufb01rst glance,\nthis might seem like a strange proposition, since model-free RL methods learn value functions and\npolicies, not models. However, this is precisely what makes them ideal for abstracting away the\ncomplexity in temporally extended tasks with high-dimensional observations: by avoiding low-level\n(e.g., pixel-level) prediction, model-free RL can acquire behaviors that manipulate these low-level\nobservations without needing to predict them explicitly. This leaves the planner free to operate at a\nhigher level of abstraction, reasoning about the capabilities of low-level model-free policies.\nBuilding on this idea, we propose a model-free planning framework. For temporal abstraction, we\nlearn low-level goal-conditioned policies, and use their value functions as implicit models, such that\nthe planner plans over the goals to pass to these policies. Goal-conditioned policies are policies that are\ntrained to reach a goal state that is provided as an additional input [24, 55, 53, 48]. While in principle\nsuch policies can solve any goal-reaching problem, in practice their effectiveness is constrained to\nnearby goals: for long-distance goals that require planning, they tend to be substantially less effective,\nas we illustrate in our experiments. However, when these policies are trained together with a value\nfunction, as in an actor-critic algorithms, the value function can provide an indication of whether a\nparticular goal is reachable or not. The planner can then plan over intermediate subgoals, using the\ngoal-conditioned value function to evaluate reachability. A major challenge with this setup is the\nneed to actually optimize over these subgoals. In domains with high-dimensional observations such\nas images, this may require explicitly optimizing over image pixels. This optimization is challenging,\nas realistic images \u2013 and, in general, feasible states \u2013 typically form a thin, low-dimensional manifold\nwithin the larger space of possible state observation values [34]. To address this, we also build\nabstractions of the state observation by learning a compact latent variable state representation, which\nmakes it feasible to optimize over the goals in domains with high-dimensional observations, such as\nimages, without explicitly optimizing over image pixels. The learned representation allows the planner\nto determine which subgoals actually represent feasible states, while the learned goal-conditioned\nvalue function tells the planner whether these states are reachable.\nOur contribution is a method for combining model-free RL for short-horizon goal-reaching with\nmodel-based planning over a latent variable representation of subgoals. We evaluate our method on\ntemporally extended tasks that require multistage reasoning and handling image observations. The\nlow-level goal-reaching policies themselves cannot solve these tasks effectively, as they do not plan\nover subgoals and therefore do not bene\ufb01t from temporal compositionality. Planning without state\nrepresentation learning also fails to perform these tasks, as optimizing directly over images results in\ninvalid subgoals. By contrast, our method, which we call Latent Embeddings for Abstracted Planning\n(LEAP), is able to successfully determine suitable subgoals by searching in the latent representation\nspace, and then reach these subgoals via the model-free policy.\n\n2 Related Work\n\nGoal-conditioned reinforcement learning has been studied in a number of prior works [24, 25, 37,\n18, 53, 2, 48, 57, 40, 59]. While goal-conditioned methods excel at training policies to greedily reach\ngoals, they often fail to solve long-horizon problems. Rather than proposing a new goal-conditioned\nRL method, we propose to use goal-conditioned policies as the abstraction for planning in order to\nhandle tasks with a longer horizon.\nModel-based planning in deep reinforcement learning is a well-studied problem in the context\nof low-dimensional state spaces [50, 32, 39, 7]. When the observations are high-dimensional,\nsuch as images, model errors for direct prediction compound quickly, making model-based RL\ndif\ufb01cult [15, 13, 5, 14, 26]. Rather than planning directly over image observations, we propose\nto plan at a temporally-abstract level by utilizing goal-conditioned policies. A number of papers\nhave studied embedding high-dimensional observations into a low-dimensional latent space for\nplanning [60, 16, 62, 22, 29]. While our method also plans in a latent space, we additionally use a\nmodel-free goal-conditioned policy as the abstraction to plan over, allowing our method to plan over\ntemporal abstractions rather than only state abstractions.\nAutomatically setting subgoals for a low-level goal-reaching policy bears a resemblance to hierarchical\nRL, where prior methods have used model-free learning on top of goal-conditioned policies [10, 61,\n\n2\n\n\f12, 58, 33, 20, 38]. By instead using a planner at the higher level, our method can \ufb02exibly plan to\nsolve new tasks and bene\ufb01t from the compositional structure of planning.\nOur method builds on temporal difference models [48] (TDMs), which are \ufb01nite-horizon, goal-\nconditioned value functions. In prior work, TDMs were used together with a single-step planner\nthat optimized over a single goal, represented as a low-dimensional ground truth state (under the\nassumption that all states are valid) [48]. We also use TDMs as implicit models, but in contrast to\nprior work, we plan over multiple subgoals and demonstrate that our method can perform temporally\nextended tasks. More critically, our method also learns abstractions of the state, which makes this\nplanning process much more practical, as it does not require assuming that all state vectors represent\nfeasible states. Planning with goal-conditioned value functions has also been studied when there are\na discrete number of predetermined goals [30] or skills [1], in which case graph-search algorithms\ncan be used to plan. In this paper, we not only provide a concrete instantiation of planning with\ngoal-conditioned value functions, but we also present a new method for scaling this planning approach\nto images, which reside in a lower-dimensional manifold.\nLastly, we note that while a number of papers have studied how to combine model-free and model-\nbased methods [54, 41, 23, 56, 44, 51, 39], our method is substantially different from these approaches:\nwe study how to use model-free policies as the abstraction for planning, rather than using models [54,\n41, 23, 39] or planning-inspired architectures [56, 44, 51, 21] to accelerate model-free learning.\n\n3 Background\n\nE[(cid:80)Tmax\n\nWe consider a \ufb01nite-horizon, goal-conditioned Markov decision process (MDP) de\ufb01ned by a tuple\n(S,G,A, p, R, Tmax, \u03c10, \u03c1g), where S is the set of states, G is the set of goals, A is the set of actions,\np(st+1 | st, at) is the time-invariant (unknown) dynamics function, R is the reward function, Tmax is\nthe maximum horizon, \u03c10 is the initial state distribution, and \u03c1g is the goal distribution. The objective\nin goal-conditioned RL is to obtain a policy \u03c0(at | st, g, t) to maximize the expected sum of rewards\nt=0 R(st, g, t)], where the goal is sampled from \u03c1g and the states are sampled according to\ns0 \u223c \u03c10, at \u223c \u03c0(at | st, g, t), and st+1 \u223c p(st+1 | st, at). We consider the case where goals reside\nin the same space as states, i.e., G = S.\nAn important quantity in goal-conditioned MDPs is the goal-conditioned value function V \u03c0, which\npredicts the expected sum of future rewards, given the current state s, goal g, and time t:\n\n(cid:35)\n\n(cid:34)Tmax(cid:88)\n\nt(cid:48)=t\n\nV \u03c0(s, g, t) = E\n\nR(st(cid:48), g, t(cid:48)) | st = s, \u03c0 is conditioned on g\n\n.\n\nTo keep the notation uncluttered, we will omit the dependence of V on \u03c0. While various time-varying\nreward functions can be used, temporal difference models (TDMs) [48] use the following form:\n\nRTDM(s, g, t) = \u2212\u03b4(t = Tmax)d(s, g).\n\n(1)\nwhere \u03b4 is the indicator function, and the distance function d is de\ufb01ned by the task. This particular\nchoice of reward function gives a TDM the following interpretation: given a state s, how close will\nthe goal-conditioned policy \u03c0 get to g after t time steps of attempting to reach g? TDMs can thus\nbe used as a measure of reachability by quantifying how close to another state the policy can get\nin t time steps, thus providing temporal abstraction. However, TDMs will only produce reasonable\nreachability predictions for valid goals \u2013 goals that resemble the kinds of states on which the TDM\nwas trained. This important limitation requires us to also utilize state abstractions, limiting our search\nto valid states. In the next section, we will discuss how we can use TDMs in a planning framework\nover high-dimensional state observations such as images.\n\n4 Planning with Goal-Conditioned Policies\n\nWe aim to learn a model that can solve arbitrary long-horizon goal reaching tasks with high-\ndimensional observation and goal spaces, such as images. A model-free goal-conditioned rein-\nforcement learning algorithm could, in principle, solve such a problem. However, as we will show in\nour experiments, in practice such methods produce overly greedy policies, which can accomplish\nshort-term goals, but struggle with goals that are more temporally extended. We instead combine\n\n3\n\n\fFigure 1: Summary of Latent Embeddings for Abstracted Planning (LEAP). (1) The planner is given a goal\nstate. (2) The planner plans intermediate subgoals in a low-dimensional latent space. By planning in this latent\nspace, the subgoals correspond to valid state observations. (3) The goal-conditioned policy then tries to reach\nthe \ufb01rst subgoal. After t1 time steps, the policy replans and repeats steps 2 and 3.\n\ngoal-conditioned policies trained to achieve subgoals with a planner that decomposes long-horizon\ngoal-reaching tasks into K shorter horizon subgoals. Speci\ufb01cally, our planner chooses the K sub-\ngoals, g1, . . . , gK, and a goal-reaching policy then attempts to reach the \ufb01rst subgoal g1 in the \ufb01rst t1\ntime steps, before moving onto the second goal g2, and so forth, as shown in Figure 1. This procedure\nonly requires training a goal-conditioned policy to solve short-horizon tasks. Moreover, by planning\nappropriate subgoals, the agent can compose previously learned goal-reaching behavior to solve\nnew, temporally extended tasks. The success of this approach will depend heavily on the choice of\nsubgoals. In the sections below, we outline how one can measure the quality of the subgoals. Then,\nwe address issues that arise when optimizing over these subgoals in high-dimensional state spaces\nsuch as images. Lastly, we summarize the overall method and provide details on our implementation.\n\n4.1 Planning over Subgoals\n\nSuitable subgoals are ones that are reachable: if the planner can choose subgoals such that each\nsubsequent subgoal is reachable given the previous subgoal, then it can reach any goal by ensuring\nthe last subgoal is the true goal. If we use a goal-conditioned policy to reach these goals, how can we\nquantify how reachable these subgoals are?\nOne natural choice is to use a goal-conditioned value function which, as previously discussed,\nprovides a measure of reachability. In particular, given the current state s, a policy will reach a goal\ng after t time steps if and only if V (s, g, t) = 0. More generally, given K intermediate subgoals\ng1:K = g1, . . . , gK and K + 1 time intervals t1, . . . , tK+1 that sum to Tmax, we de\ufb01ne the feasibility\nvector as\n\n\uf8ee\uf8ef\uf8ef\uf8ef\uf8ef\uf8f0\n\nV (s, g1, t1)\nV (g1, g2, t2)\n\n...\n\nV (gK\u22121, gK, tK)\nV (gK, g, tK+1)\n\n\u2212\u2192\nV(s, g1:K, t1:K+1, g) =\n\n\uf8f9\uf8fa\uf8fa\uf8fa\uf8fa\uf8fb .\n\nThe feasibility vector provides a quantitative measure of a plan\u2019s feasibility: The \ufb01rst element\ndescribes how close the policy will reach the \ufb01rst subgoal, g1, starting from the initial state, s. The\nsecond element describes how close the policy will reach the second subgoal, g2, starting from the\n\ufb01rst subgoal, and so on, until the last term measures the reachability to the true goal, g.\nTo create a feasible plan, we would like each element of this vector to be zero, and so we minimize\nthe norm of the feasibility vector:\n\nL(g1:K) = ||\u2212\u2192\n\nV(s, g1:K, t1:K+1, g)||.\n\n(2)\n\nIn other words, minimizing Equation 2 searches for subgoals such that the overall path is feasible and\nterminates at the true goal. In the next section, we turn to optimizing Equation 2 and address issues\nthat arise in high-dimensional state spaces.\n\n4.2 Optimizing over Images\nWe consider image-based environments, where the set of states S is the set of valid image observations\nin our domain. In image-based environments, solving the optimization in Equation 2 presents two\n\n4\n\n\fproblems. First, the optimization variables g1:K are very high-dimensional \u2013 even with 64x64\nimages and just 3 subgoals, there are over 10,000 dimensions. Second, and perhaps more subtle, the\noptimization iterates must be constrained to the set of valid image observations S for the subgoals to\ncorrespond to meaningful states. While a plethora of constrained optimization methods exist, they\ntypically require knowing the set of valid states [42] or being able to project onto that set [46]. In\nimage-based domains, the set of states S is an unknown r-dimensional manifold embedded in a\nhigher-dimensional space RN , for some N (cid:29) r [34] \u2013 i.e., the set of valid image observations.\nOptimizing Equation 2 would be much easier if we could\ndirectly optimize over the r dimensions of the underlying\nrepresentation, since r (cid:28) N, and crucially, since we\nwould not have to worry about constraining the planner\nto an unknown manifold. While we may not know the\nset S a priori, we can learn a latent-variable model with\na compact latent space to capture it, and then optimize\nin the latent space of this model. To this end, we use a\nvariational-autoencoder (VAE) [28, 52], which we train\nwith images randomly sampled from our environment.\nA VAE consists of an encoder q\u03c6(z | s) and decoder\np\u03b8(s | z). The inference network maps high-dimensional\nstates s \u2208 S to a distribution over lower-dimensional latent\nvariables z for some lower dimensional space Z, while\nthe generative model reverses this mapping. Moreover,\nthe VAE is trained so that the marginal distribution of Z\nmatches our prior distribution p0, the standard Gaussian.\nThis last property of VAEs is crucial, as it allows us to\ntractably optimize over the manifold of valid states S. So long as the latent variables have high\nlikelihood under the prior, the corresponding images will remain inside the manifold of valid states,\nas shown in Figure 2. In fact, Dai and Wipf [9] showed that a VAE with a Gaussian prior can always\nrecover the true manifold, making this choice for latent-variable model particularly appealing.\nIn summary, rather than minimizing Equation 2, which requires optimizing over the high-dimensional,\nunknown space S we minimize\n\nFigure 2: Optimizing directly over the im-\nage manifold (b) is challenging, as it is\ngenerally unknown and resides in a high-\ndimensional space. We optimize over a latent\nstate (a) and use our decoder to generate im-\nages. So long as the latent states have high\nlikelihood under the prior (green), they will\ncorrespond to realistic images, while latent\nstates with low likelihood (red) will not.\n\nLLEAP(z1:K) = ||\u2212\u2192\n\nV(s, z1:K, t1:K+1, g)||p \u2212 \u03bb\n\nK(cid:88)\n\nwhere\n\n\u2212\u2192\nV(s, z1:K, t1:K+1, g) =\n\n\uf8ee\uf8ef\uf8ef\uf8ef\uf8ef\uf8f0\n\nV (s, \u03c8(z1), t1)\n\nV (\u03c8(z1), \u03c8(z2), t2)\n\n...\n\nV (\u03c8(zK\u22121), \u03c8(zK), tK)\n\nlog p(zk)\n\nk=1\n\n\uf8f9\uf8fa\uf8fa\uf8fa\uf8fa\uf8fb and \u03c8(z) = arg max\n\ng(cid:48)\n\n(3)\n\np\u03b8(g(cid:48) | z).\n\nV (\u03c8(zK), g, tK+1)\n\nThis procedure optimizes over latent variables zk, which are then mapped onto high-dimensional\ngoal states gk using the maximum likelihood estimate (MLE) of the decoder arg maxg(g | z). In\nour case, the MLE can be computed in closed form by taking the mean of the decoder. The term\nsumming over log p(zk) penalizes latent variables that have low likelihood under the prior p, and \u03bb is\na hyperparameter that controls the importance of this second term.\nWhile any norm could be used, we used the (cid:96)\u221e-norm which forces each element of the feasibility\nvector to be near zero. We found that the (cid:96)\u221e-norm outperformed the (cid:96)1-norm, which only forces the\nsum of absolute values of elements near zero. 2\n\n4.3 Goal-Conditioned Reinforcement Learning\n\nFor our goal-conditioned reinforcement learning algorithm, we use temporal difference models\n(TDMs) [48]. TDMs learn Q functions rather that V functions, and so we compute V by evaluating\n\n2 See Subsection A.1 comparison.\n\n5\n\n\fQ with the action from the deterministic policy: V (s, g, t) = Q(s, a, g, t)|a=\u03c0(s,g,t). To further\nimprove the ef\ufb01ciency of our method, we can also utilize the same VAE that we use to recover the\nlatent space for planning as a state representation for TDMs. While we could train the reinforcement\nlearning agents from scratch, this can be expensive in terms of sample ef\ufb01ciency as much of the\nlearning will focus on simply learning good convolution \ufb01lters. We therefore use the pretrained\nmean-encoder of the VAE as the state encoder for our policy and value function networks, and\nonly train additional fully-connected layers with RL on top of these representations. Details of the\narchitecture are provided in Appendix C. We show in Section 5 that our method works without\nreusing the VAE mean-encoder, and that this parameter reuse primarily helps with increasing the\nspeed of learning.\n\n4.4 Summary of Latent Embeddings for Abstracted Planning\n\nOur overall method is called Latent Embeddings for Abstracted Planning (LEAP) and is summarized\nin Algorithm 1. We \ufb01rst train a goal-conditioned policy and a variational-autoencoder on randomly\ncollected states. Then at testing time, given a new goal, we choose subgoals by minimizing Equation 3.\nOnce the plan is chosen, the \ufb01rst goal \u03c8(z1) is given to the policy. After t1 steps, we repeat this\nprocedure: we produce a plan with K \u2212 1 (rather than K) subgoals, and give the \ufb01rst goal to the\npolicy. In this work, we \ufb01x the time intervals to be evenly spaced out (i.e., t1 = t2 . . . tK+1 =\n(cid:98)Tmax/(K + 1)(cid:99)), but additionally optimizing over the time intervals would be a promising future\nextension.\n\nAlgorithm 1 Latent Embeddings for Abstracted Planning (LEAP)\n1: Train VAE encoder q\u03c6 and decoder p\u03b8.\n2: Train TDM policy \u03c0 and value function V .\n3: Initialize state, goal, and time: s1 \u223c \u03c10, goal g \u223c \u03c1g, and t = 1.\n4: Assign the last subgoal to the true goal, gK+1 = g\n5: for k in 1, . . . , K + 1 do\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13: end for\n\nSample next action at using goal-conditioned policy \u03c0(\u00b7 | st, gk, tk \u2212 t(cid:48)).\nExecute at and obtain next state st+1\nIncrement the global timer t \u2190 t + 1.\n\nOptimize Equation 3 to choose latent subgoals zk, . . . , zK using V and p\u03b8 if k \u2264 K.\nDecode zk to obtain goal gk = \u03c8(zk).\nfor t(cid:48) in 1, . . . , tk do\n\nend for\n\n5 Experiments\n\nOur experiments study the following two questions: (1) How does LEAP compare to model-based\nmethods, which directly predict each time step, and model-free RL, which directly optimizes for the\n\ufb01nal goal? (2) How does the use of a latent state representation and other design decisions impact the\nperformance of LEAP?\n\n5.1 Vision-based Comparison and Results\n\nWe study the \ufb01rst question on two distinct vision-based tasks, each of which requires temporally-\nextended planning and handling high-dimensional image observations.\nThe \ufb01rst task, 2D Navigation requires navigating around a U-shaped wall to reach a goal, as shown in\nFigure 3. The state observation is a top-down image of the environment. We use this task to conduct\nablation studies that test how each component of LEAP contributes to \ufb01nal performance. We also\nuse this environment to generate visualizations that help us better understand how our method uses\nthe goal-conditioned value function to evaluate reachability over images. While visually simple, this\ntask is far from trivial for goal-conditioned and planning methods: a greedy goal-reaching policy\nthat moves directly towards the goal will never reach the goal. The agent must plan a temporally-\nextended path that moves around the walls, sometimes moving away from the goal. We also use this\nenvironment to compare our method with prior work on goal-conditioned and model-based RL.\n\n6\n\n\fFigure 3: Comparisons on two vision-based domains that evaluate temporally extended control, with illustra-\ntions of the tasks. In 2D Navigation (left), the goal is to navigate around a U-shaped wall to reach the goal. In the\nPush and Reach manipulation task (right), a robot must \ufb01rst push a puck to a target location (blue star), which\nmay require moving the hand away from the goal hand location, and then move the hand to another location (red\nstar). Curves are averaged over multiple seeds and shaded regions represent one standard deviation. Our method,\nshown in red, outperforms prior methods on both tasks. On the Push and Reach task, prior methods typically get\nthe hand close to the right location, but perform much worse at moving the puck, indicating an overly greedy\nstrategy, while our approach succeeds at both.\n\nTo evaluate LEAP on a more complex task, we utilize a robotic manipulation simulation of a Push\nand Reach task. This task requires controlling a simulated Sawyer robot to both (1) move a puck\nto a target location and (2) move its end effector to a target location. This task is more visually\ncomplex, and requires more temporally extended reasoning. The initial arm and and puck locations\nare randomized so that the agent must decide how to reposition the arm to reach around the object,\npush the object in the desired direction, and then move the arm to the correct location, as shown in\nFigure 3. A common failure case for model-free policies in this setting is to adopt an overly greedy\nstrategy, only moving the arm to the goal while ignoring the puck.\nWe train all methods on randomly initialized goals and initial states. However, for evaluation,\nwe intentionally select dif\ufb01cult start and goal states to evaluate long-horizon reasoning. For 2D\nNavigation, we initialize the policy randomly inside the center square and sample a goal from the\nregion directly below the U-shaped wall. This requires initially moving away from the goal to\nnavigate around the wall. For Push and Reach, we evaluate on 5 distinct challenging con\ufb01gurations,\neach requiring the agent to \ufb01rst plan to move the puck, and then move the arm only once the puck is\nin its desired location. In one con\ufb01guration for example, we initialize the hand and puck on opposite\nsides of the workspace and set goals so that the hand and puck must switch sides.\nWe compare our method to both model-free methods and model-based methods that plan over learned\nmodels. All of our tasks use Tmax = 100, and LEAP uses CEM to optimize over K = 3 subgoals,\neach of which are 25 time steps apart. We compare directly with model-free TDMs, which we\nlabel TDM-25. Since the task is evaluated on a horizon of length Tmax = 100 we also compare\nto a model-free TDM policy trained for Tmax = 100, which we label TDM-100. We compare\nto reinforcement learning with imagined goals (RIG) [40], a state-of-the-art method for solving\nimage-based goal-conditioned tasks. RIG learns a reward function from images rather than using\na pre-determined reward function. We found that providing RIG with the same distance function\nas our method improves its performance, so we use this stronger variant of RIG to ensure a fair\ncomparison. In addition, we compare to hindsight experiment replay (HER) [2] which uses sparse,\nindicator rewards. Lastly, we compare to probabilistic ensembles with trajectory sampling (PETS) [7],\na state-of-the-art model-based RL method. We favorably implemented PETS on the ground-truth\nlow-dimensional state representation and label it PETS, state.\nThe results are shown in Figure 3. LEAP signi\ufb01cantly outperforms prior work on both tasks, particu-\nlarly on the harder Push and Reach task. While the TDM used by LEAP (TDM-25) performs poorly\nby itself, composing it with 3 different subgoals using LEAP results in much better performance. By\n400k environment steps, LEAP already achieves a \ufb01nal puck distance of under 10 cm, while the next\nbest method, TDM-100, requires 5 times as many samples. Details on each task are in Appendix B,\nand algorithm implementation details are given in Appendix C.\nWe visualize the subgoals chosen by LEAP in Figure 4 by decoding the latent subgoals zt1:K into\nimages with the VAE decoder p\u03b8. In Push and Reach, these images correspond to natural subgoals\nfor the task. Figure 4 also shows a visualization of the value function, which is used by the planner to\ndetermine reachability. Note that the value function generally recognizes that the wall is impassable,\n\n7\n\n0100200300400500Number of Environment Steps Total (x1000)0123456Final distance to Goal2D NavigationLEAP (ours)TDM-25TDM-100RIGHERPETS, state025050075010001250150017502000Number of Environment Steps Total (x1000)05101520253035Final distance to puck Goal (cm)Push and Reach\fFigure 4: (Left) Visualization of subgoals reconstructed from the VAE (bottom row), and the actual images\nseen when reaching those subgoals (top row). Given an initial state s0 and a goal image g, the planner chooses\nmeaningful subgoals: at gt1, it moves towards the puck, at gt2 it begins pushing the puck, and at gt3 it completes\nthe pushing motion before moving to the goal hand position at g. (Middle) The top row shows the image subgoals\nsuperimposed on one another. The blue circle is the starting position, the green circle is the target position, and\nthe intermediate circles show the progression of subgoals (bright red is gt1, brown is gt3). The colored circles\nshow the subgoals in the latent space (bottom row) for the two most active VAE latent dimensions, as well as\nsamples from the VAE aggregate posterior [35]. (Right) Heatmap of the value function V (s, g, t), with each\ncolumn showing a different time horizon t for a \ufb01xed state s. Warmer colors show higher value. Each image\nindicates the value function for all possible goals g. As the time horizon decreases, the value function recognizes\nthat it can only reach nearby goals.\n\nand makes reasonable predictions for different time horizons. Videos of the \ufb01nal policies and\ngenerated subgoals and code for our implementation of LEAP are available on the paper website3.\n\n5.2 Planning in Non-Vision-based Environments with Unknown State Spaces\n\nWhile LEAP was presented in the context of optimizing over images, we also study its utility in\nnon-vision based domains. Speci\ufb01cally, we compare LEAP to prior works on an Ant Navigation task,\nshown in Figure 5, where the state-space consists of the quadruped robot\u2019s joint angles, joint velocity,\nand center of mass. While this state space is more compact than images, only certain combinations of\nstate values are actually valid, and the obstacle in the environment is unknown to the agent, meaning\nthat a na\u00efve optimization over the state space can easily result in invalid states (e.g., putting the robot\ninside an obstacle).\nThis task has a signi\ufb01cantly longer horizon of Tmax = 600, and LEAP uses CEM to optimize over\nK = 11 subgoals, each of which are 50 time steps apart. As in the vision-based comparisons, we\ncompare with model-free TDMs, for the short-horizon setting (TDM-50) which LEAP is built on top\nof, and the long-horizon setting (TDM-600). In addition to HER, we compare to a variant of HER\nthat uses the same rewards and relabeling strategy as RIG, which we label HER+. We exclude the\nPETS baseline, as it has been unable to solve long-horizon tasks such as ours. In this section, we\nadd a comparison to hierarchical reinforcement learning with off-policy correction (HIRO) [38], a\nhierarchical method for state-based goals. We evaluate all baselines on a challenging con\ufb01guration of\nthe task in which the ant must navigate from one corner of the maze to the other side, by going around\na long wall. The desired behavior will result in large negative rewards during the trajectory, but will\nresult in an optimal \ufb01nal state. We see that in Figure 5, LEAP is the only method that successfully\nnavigates the ant to the goal. HIRO, HER, HER+ don\u2019t attempt to go around the wall at all, as doing\nso will result in a large sum of negative rewards. TDM-50 has a short horizon that results in greedy\nbehavior, while TDM-600 fails to learn due to temporal sparsity of the reward.\n\nFigure 5: In the Ant Navigation task, the ant must move around the long wall, which will incur large negative\nrewards during the trajectory, but will result in an optimal \ufb01nal state. We illustrate the task, with the purple\nant showing the starting state and the green ant showing the goal. We use 3 subgoals here for illustration. Our\nmethod (shown in red in the plot) is the only method that successfully navigates the ant to the goal.\n\n3https://sites.google.com/view/goal-planning\n\n8\n\n02004006008001000Number of Environment Steps Total (x1000)024681012Final distance to GoalAnt NavigationLEAP (ours)TDM-50TDM-600HER+HERHIRO\f5.3 Ablation Study\n\nWe analyze the importance of planning in the latent space, as opposed to image space, on the\nnavigation task. For comparison, we implement a planner that directly optimizes over image subgoals\n(i.e., in pixel space). We also study the importance of reusing the pretrained VAE encoder by\nreplicating the experiments with the RL networks trained from scratch. We see in Figure 6 that a\nmodel that does not reuse the VAE encoder does succeed, but takes much longer. More importantly,\nplanning over latent states achieves dramatically better performance than planning over raw images.\nFigure 6 also shows the intermediate subgoals outputted by our optimizer when optimizing over\nimages. While these subgoals may have high value according to Equation 2, they clearly do not\ncorrespond to valid state observations, indicating that the planner is exploiting the value function by\nchoosing images far outside the manifold of valid states.\nWe include further ablations in Appendix A, in which we study the sensitivity of \u03bb in Equation 3\n(Subsection A.3), the choice of norm (Subsection A.1), and the choice of optimizer (Subsection A.2).\nThe results show that LEAP works well for a wide range of \u03bb, that (cid:96)\u221e-norm performs better, and that\nCEM consistently outperforms gradient-based optimizers, both in terms of optimizer loss and policy\nperformance.\n\nFigure 6: (Left) Ablative studies on 2D Navigation. We keep all components of LEAP the same but replace\noptimizing over the latent space with optimizing over the image space (-latent). We separately train the RL\nmethods from scratch rather than reusing the VAE mean encoder (-shared), and also test both ablations together\n(-latent, shared). We see that sharing the encoder weights with the RL policy results in faster learning, and\nthat optimizing over the latent space is critical for success of the method. (Right) Visualization of the subgoals\ngenerated when optimizing over the latent space and decoding the image (top) and when optimizing over the\nimages directly (bottom). The goals generated when planning in image space are not meaningful, which explains\nthe poor performance of \u201c-latent\u201d shown in (Left).\n\n6 Discussion\n\nWe presented Latent Embeddings for Abstracted Planning (LEAP), an approach for solving temporally\nextended tasks with high-dimensional state observations, such as images. The key idea in LEAP\nis to form temporal abstractions by using goal-reaching policies to evaluate reachability, and state\nabstractions by using representation learning to provide a convenient state representation for planning.\nBy planning over states in a learned latent space and using these planned states as subgoals for goal-\nconditioned policies, LEAP can solve tasks that are dif\ufb01cult to solve with conventional model-free\ngoal-reaching policies, while avoiding the challenges of modeling low-level observations associated\nwith fully model-based methods. More generally, the combination of model-free RL with planning is\nan exciting research direction that holds the potential to make RL methods more \ufb02exible, capable,\nand broadly applicable. Our method represents a step in this direction, though many crucial questions\nremain to be answered. Our work largely neglects the question of exploration for goal-conditioned\npolicies, and though this question has been studied in some recent works [17, 45, 59, 49], examining\nhow exploration interacts with planning is an exciting future direction. Another exciting direction for\nfuture work is to study how lossy state abstractions might further improve the performance of the\nplanner, by explicitly discarding state information that is irrelevant for higher-level planning.\n\n7 Acknowledgments\n\nThis work was supported by the Of\ufb01ce of Naval Research, the National Science Foundation, Google,\nNVIDIA, Amazon, and ARL DCIST CRA W911NF-17-2-0181.\n\n9\n\n0100200300400Number of Environment Steps Total (x1000)0246Final distance to Goal2D-Navigation Learning AblationOurs-Latent-Shared-Shared -Latent\fReferences\n[1] Arpit Agarwal, Katharina Muelling, and Katerina Fragkiadaki. Model learning for look-ahead\n\nexploration in continuous control. AAAI, 2019.\n\n[2] Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder,\nBob Mcgrew, Josh Tobin, Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay.\nIn Advances in Neural Information Processing Systems, 2017.\n\n[3] Mohammad Babaeizadeh, Chelsea Finn, Dumitru Erhan, Roy H Campbell, and Sergey Levine.\nStochastic variational video prediction. In International Conference on Learning Representa-\ntions, 2018.\n\n[4] Byron Boots, Arunkumar Byravan, and Dieter Fox. Learning predictive models of a depth\ncamera & manipulator from raw execution traces. In IEEE International Conference on Robotics\nand Automation, 2014.\n\n[5] Arunkumar Byravan, Felix Leeb, Franziska Meier, and Dieter Fox. Se3-pose-nets: structured\ndeep dynamics models for visuomotor planning and control. In IEEE International Conference\non Robotics and Automation.\n\n[6] Silvia Chiappa, S\u00e9bastien Racaniere, Daan Wierstra, and Shakir Mohamed. Recurrent environ-\n\nment simulators. In International Conference on Learning Representations, 2017.\n\n[7] Kurtland Chua, Roberto Calandra, Rowan McAllister, and Sergey Levine. Deep reinforcement\nlearning in a handful of trials using probabilistic dynamics models. In Advances in Neural\nInformation Processing Systems, 2018.\n\n[8] C\u00e9dric Colas, Pierre Fournier, Olivier Sigaud, and Pierre-Yves Oudeyer. CURIOUS: intrinsically\nmotivated multi-task, multi-goal reinforcement learning. International Conference on Machine\nLearning, 2019.\n\n[9] Bin Dai and David Wipf. Diagnosing and enhancing vae models. In International Conference\n\non Learning Representations, 2019.\n\n[10] Peter Dayan and Geoffrey E Hinton. Feudal reinforcement learning. In Advances in Neural\n\nInformation Processing Systems, 1993.\n\n[11] Pieter-Tjerk De Boer, Dirk P Kroese, Shie Mannor, and Reuven Y Rubinstein. A tutorial on the\n\ncross-entropy method. Annals of operations research, 134(1), 2005.\n\n[12] Thomas G Dietterich. Hierarchical reinforcement learning with the maxq value function\n\ndecomposition. Journal of Arti\ufb01cial Intelligence Research, 13, 2000.\n\n[13] Frederik Ebert, Chelsea Finn, Alex X Lee, and Sergey Levine. Self-supervised visual planning\n\nwith temporal skip connections. In Conference on Robot Learning, 2017.\n\n[14] Frederik Ebert, Chelsea Finn, Sudeep Dasari, Annie Xie, Alex Lee, and Sergey Levine. Visual\nforesight: model-based deep reinforcement learning for vision-based robotic control. arXiv\npreprint arXiv:1812.00568, 2018.\n\n[15] Chelsea Finn and Sergey Levine. Deep visual foresight for planning robot motion. In Advances\n\nin Neural Information Processing Systems, 2016.\n\n[16] Chelsea Finn, Xin Yu Tan, Yan Duan, Trevor Darrell, Sergey Levine, and Pieter Abbeel. Deep\nspatial autoencoders for visuomotor learning. In IEEE International Conference on Robotics\nand Automation, 2016.\n\n[17] Carlos Florensa, David Held, Xinyang Geng, and Pieter Abbeel. Automatic goal generation for\n\nreinforcement learning agents. In International Conference on Machine Learning, 2018.\n\n[18] David Foster and Peter Dayan. Structure in the space of value functions. Machine Learning, 49\n\n(2-3), 2002.\n\n[19] Scott Fujimoto, Herke van Hoof, and David Meger. Addressing function approximation error in\n\nactor-critic methods. In International Conference on Machine Learning, 2018.\n\n[20] Dibya Ghosh, Abhishek Gupta, and Sergey Levine. Learning actionable representations with\n\ngoal-conditioned policies. In International Conference on Learning Representations, 2019.\n\n[21] Arthur Guez, Mehdi Mirza, Karol Gregor, Rishabh Kabra, S\u00e9bastien Racani\u00e8re, Th\u00e9ophane\nWeber, David Raposo, Adam Santoro, Laurent Orseau, Tom Eccles, et al. An investigation of\nmodel-free planning. In International Conference on Machine Learning, 2019.\n\n10\n\n\f[22] Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee,\nand James Davidson. Learning latent dynamics for planning from pixels. In International\nConference on Machine Learning, 2019.\n\n[23] Nicolas Heess, Greg Wayne, David Silver, Timothy Lillicrap, Yuval Tassa, and Tom Erez.\nLearning continuous control policies by stochastic value gradients. In Advances in Neural\nInformation Processing Systems, 2015.\n\n[24] Leslie Pack Kaelbling. Learning to achieve goals. In International Joint Conference on Arti\ufb01cial\n\nIntelligence (IJCAI), volume vol.2, 1993.\n\n[25] Leslie Pack Kaelbling. Hierarchical learning in stochastic domains: preliminary results. In\n\nInternational Conference on Machine Learning, 1993.\n\n[26] Lukasz Kaiser, Mohammad Babaeizadeh, Piotr Milos, Blazej Osinski, Roy H Campbell, Konrad\nCzechowski, Dumitru Erhan, Chelsea Finn, Piotr Kozakowski, Sergey Levine, et al. Model-\nbased reinforcement learning for atari. arXiv preprint arXiv:1903.00374, 2019.\n\n[27] Nal Kalchbrenner, A\u00e4ron van den Oord, Karen Simonyan, Ivo Danihelka, Oriol Vinyals, Alex\nIn International Conference on\n\nGraves, and Koray Kavukcuoglu. Video pixel networks.\nMachine Learning, 2017.\n\n[28] Diederik P Kingma and Max Welling. Auto-encoding variational Bayes. In International\n\nConference on Learning Representations, 2014.\n\n[29] Thanard Kurutach, Aviv Tamar, Ge Yang, Stuart J Russell, and Pieter Abbeel. Learning\nplannable representations with causal infogan. In Advances in Neural Information Processing\nSystems, 2018.\n\n[30] Terran Lane and Leslie Pack Kaelbling. Toward hierarchical decomposition for planning\nin uncertain environments. In Proceedings of the 2001 IJCAI workshop on planning under\nuncertainty and incomplete information, 2001.\n\n[31] Alex X Lee, Richard Zhang, Frederik Ebert, Pieter Abbeel, Chelsea Finn, and Sergey Levine.\n\nStochastic adversarial video prediction. arXiv preprint arXiv:1804.01523, 2018.\n\n[32] Ian Lenz, Ross Knepper, and Ashutosh Saxena. DeepMPC: learning deep latent features for\n\nmodel predictive control. In Robotics: Science and Systems (RSS), 2015.\n\n[33] Andrew Levy, George Konidaris, Robert Platt, and Kate Saenko. Learning multi-level hierar-\n\nchies with hindsight. In International Conference on Learning Representations, 2019.\n\n[34] Haw-Minn Lu, Yeshaiahu Fainman, and Robert Hecht-Nielsen. Image manifolds. In Applica-\ntions of Arti\ufb01cial Neural Networks in Image Processing III, volume 3307. International Society\nfor Optics and Photonics, 1998.\n\n[35] Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, Ian Goodfellow, and Brendan Frey. Adver-\n\nsarial autoencoders. In International Conference on Learning Representations, 2016.\n\n[36] Michael Mathieu, Camille Couprie, and Yann LeCun. Deep multi-scale video prediction beyond\n\nmean square error. In International Conference on Learning Representations, 2016.\n\n[37] Andrew W Moore, Leemon Baird, and Leslie P Kaelbling. Multi-value-functions: effcient\nautomatic action hierarchies for multiple goal mdps. In Proceedings of the International Joint\nConference on Arti\ufb01cial Intelligence, 1999.\n\n[38] O\ufb01r Nachum, Google Brain, Shane Gu, Honglak Lee, and Sergey Levine. Data-ef\ufb01cient\nhierarchical reinforcement learning. In Advances in Neural Information Processing Systems,\n2018.\n\n[39] Anusha Nagabandi, Gregory Kahn, Ronald S Fearing, and Sergey Levine. Neural network\ndynamics for model-based deep reinforcement learning with model-free \ufb01ne-tuning. In IEEE\nInternational Conference on Robotics and Automation, 2018.\n\n[40] Ashvin Nair, Vitchyr Pong, Murtaza Dalal, Shikhar Bahl, Steven Lin, and Sergey Levine. Visual\nreinforcement learning with imagined goals. In Advances in Neural Information Processing\nSystems, 2018.\n\n[41] Derrick H Nguyen and Bernard Widrow. Neural networks for self-learning control systems.\n\nIEEE Control systems magazine, 1990.\n\n[42] Jorge Nocedal and Stephen Wright. Numerical optimization. Springer Science & Business\n\nMedia, 2006.\n\n11\n\n\f[43] Junhyuk Oh, Xiaoxiao Guo, Honglak Lee, Richard Lewis, and Satinder Singh. Action-\nIn Advances in Neural\n\nconditional video prediction using deep networks in atari games.\nInformation Processing Systems, 2015.\n\n[44] Junhyuk Oh, Satinder Singh, and Honglak Lee. Value prediction network. In Advances in\n\nNeural Information Processing Systems, 2017.\n\n[45] Fabio Pardo, Vitaly Levdik, and Petar Kormushev. Q-map: a convolutional approach for\n[46] Neal Parikh, Stephen Boyd, et al. Proximal algorithms. Foundations and Trends R(cid:13) in Optimiza-\n\ngoal-oriented reinforcement learning. CoRR, abs/1810.02927, 2018.\n\ntion, 1(3), 2014.\n\n[47] Matthias Plappert, Marcin Andrychowicz, Alex Ray, Bob Mcgrew, Bowen Baker, Glenn Powell,\nJonas Schneider, Josh Tobin, Maciek Chociej, Peter Welinder, Vikash Kumar, and Wojciech\nZaremba. Multi-goal reinforcement learning: challenging robotics environments and request\nfor research. arXiv preprint arXiv:1802.09464, 2018.\n\n[48] Vitchyr Pong, Shixiang Gu, Murtaza Dalal, and Sergey Levine. Temporal difference mod-\nels: model-free deep rl For model-based control. In International Conference on Learning\nRepresentations, 2018.\n\n[49] Vitchyr H. Pong, Murtaza Dalal, Steven Lin, Ashvin Nair, Shikhar Bahl, and Sergey Levine.\nSkew-\ufb01t: state-covering self-supervised reinforcement learning. CoRR, abs/1903.03698, 2019.\n[50] Ali Punjani and Pieter Abbeel. Deep learning helicopter dynamics models. In IEEE International\n\nConference on Robotics and Automation, 2015.\n\n[51] S\u00e9bastien Racani\u00e8re, Th\u00e9ophane Weber, David Reichert, Lars Buesing, Arthur Guez,\nDanilo Jimenez Rezende, Adria Puigdomenech Badia, Oriol Vinyals, Nicolas Heess, Yujia Li,\net al. Imagination-augmented agents for deep reinforcement learning. In Advances in Neural\nInformation Processing Systems, 2017.\n\n[52] Danilo J Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and\napproximate inference in deep generative models. In International Conference on Machine\nLearning, 2014.\n\n[53] Tom Schaul, Daniel Horgan, Karol Gregor, and David Silver. Universal value function approxi-\n\nmators. In International Conference on Machine Learning, 2015.\n\n[54] Richard S Sutton.\n\nIntegrated architectures for learning, planning, and reacting based on\napproximating dynamic programming. In Machine Learning Proceedings 1990. Elsevier, 1990.\n[55] Richard S Sutton, Joseph Modayil, Michael Delp, Thomas Degris, Patrick M Pilarski, Adam\nWhite, and Doina Precup. Horde: A scalable real-time architecture for learning knowledge from\nunsupervised sensorimotor interaction. In International Conference on Autonomous Agents and\nMultiagent Systems, 2011.\n\n[56] Aviv Tamar, Yi Wu, Garrett Thomas, Sergey Levine, and Pieter Abbeel. Value iteration networks.\n\nIn Advances in Neural Information Processing Systems, 2016.\n\n[57] Vivek Veeriah, Junhyuk Oh, and Satinder Singh. Many-goals reinforcement learning. arXiv\n\npreprint arXiv:1806.09605, 2018.\n\n[58] Alexander Sasha Vezhnevets, Simon Osindero, Tom Schaul, Nicolas Heess, Max Jaderberg,\nDavid Silver, and Koray Kavukcuoglu. FeUdal networks for hierarchical reinforcement learning.\nIn International Conference on Machine Learning, 2017.\n\n[59] David Warde-Farley, Tom Van de Wiele, Tejas Kulkarni, Catalin Ionescu, Steven Hansen, and\nVolodymyr Mnih. Unsupervised control through non-parametric discriminative rewards. CoRR,\nabs/1811.11359, 2018.\n\n[60] Manuel Watter, Jost Tobias Springenberg, Joschka Boedecker, and Martin Riedmiller. Embed\nto control: a locally linear latent dynamics model for control from raw images. In Advances in\nNeural Information Processing Systems, 2015.\n\n[61] Marco Wiering and J\u00fcrgen Schmidhuber. Hq-learning. Adaptive Behavior, 6(2), 1997.\n[62] Marvin Zhang, Sharad Vikram, Laura Smith, Pieter Abbeel, Matthew J Johnson, and Sergey\nLevine. Solar: deep structured latent representations for model-based reinforcement learning.\nIn International Conference on Machine Learning, 2019.\n\n12\n\n\f", "award": [], "sourceid": 8417, "authors": [{"given_name": "Soroush", "family_name": "Nasiriany", "institution": "UC Berkeley"}, {"given_name": "Vitchyr", "family_name": "Pong", "institution": "UC Berkeley"}, {"given_name": "Steven", "family_name": "Lin", "institution": "UC Berkeley"}, {"given_name": "Sergey", "family_name": "Levine", "institution": "UC Berkeley"}]}