{"title": "Hierarchical Apprenticeship Learning with Application to Quadruped Locomotion", "book": "Advances in Neural Information Processing Systems", "page_first": 769, "page_last": 776, "abstract": null, "full_text": "Hierarchical Apprenticeship Learning, with\n\nApplication to Quadruped Locomotion\n\nJ. Zico Kolter, Pieter Abbeel, Andrew Y. Ng\n\nDepartment of Computer Science\n\nStanford University\nStanford, CA 94305\n\n{kolter, pabbeel, ang}@cs.stanford.edu\n\nAbstract\n\nWe consider apprenticeship learning\u2014learning from expert demonstrations\u2014in\nthe setting of large, complex domains. Past work in apprenticeship learning\nrequires that the expert demonstrate complete trajectories through the domain.\nHowever, in many problems even an expert has dif\ufb01culty controlling the system,\nwhich makes this approach infeasible. For example, consider the task of teach-\ning a quadruped robot to navigate over extreme terrain; demonstrating an optimal\npolicy (i.e., an optimal set of foot locations over the entire terrain) is a highly\nnon-trivial task, even for an expert. In this paper we propose a method for hier-\narchical apprenticeship learning, which allows the algorithm to accept isolated\nadvice at different hierarchical levels of the control task. This type of advice is\noften feasible for experts to give, even if the expert is unable to demonstrate com-\nplete trajectories. This allows us to extend the apprenticeship learning paradigm\nto much larger, more challenging domains. In particular, in this paper we apply\nthe hierarchical apprenticeship learning algorithm to the task of quadruped loco-\nmotion over extreme terrain, and achieve, to the best of our knowledge, results\nsuperior to any previously published work.\n\n1 Introduction\nIn this paper we consider apprenticeship learning in the setting of large, complex domains. While\nmost reinforcement learning algorithms operate under the Markov decision process (MDP) formal-\nism (where the reward function is typically assumed to be given a priori), past work [1, 13, 11]\nhas noted that often the reward function itself is dif\ufb01cult to specify by hand, since it must quantify\nthe trade off between many features. Apprenticeship learning is based on the insight that often it\nis easier for an \u201cexpert\u201d to demonstrate the desired behavior than it is to specify a reward function\nthat induces this behavior. However, when attempting to apply apprenticeship learning to large do-\nmains, several challenges arise. First, past algorithms for apprenticeship learning require the expert\nto demonstrate complete trajectories in the domain, and we are speci\ufb01cally concerned with domains\nthat are suf\ufb01ciently complex so that even this task is not feasible. Second, these past algorithms\nrequire the ability to solve the \u201ceasier\u201d problem of \ufb01nding a nearly optimal policy given some can-\ndidate reward function, and even this is challenging in large domains. Indeed, such domains often\nnecessitate hierarchical control in order to reduce the complexity of the control task [2, 4, 15, 12].\nAs a motivating application, consider the task of navigating a quadruped robot (shown in Figure\n1(a)) over challenging, irregular terrain (shown in Figure 1(b,c)). In a naive approach, the dimen-\nsionality of the state space is prohibitively large: the robot has 12 independently actuated joints, and\nthe state must also specify the current three-dimensional position and orientation of the robot, lead-\ning to an 18-dimensional state space that is well beyond the capabilities of standard RL algorithms.\nFortunately, this control task succumbs very naturally to a hierarchical decomposition: we \ufb01rst plan\na general path over the terrain, then plan footsteps along this path, and \ufb01nally plan joint movements\n\n1\n\n\fFigure 1: (a) LittleDog robot, designed and built by Boston Dynamics, Inc. (b) Typical terrain. (c) Height\nmap of the depicted terrain. (Black = 0cm altitude, white = 12cm altitude.)\n\nto achieve these footsteps. However, it is very challenging to specify a proper reward, speci\ufb01cally\nfor the higher levels of control, as this requires quantifying the trade-off between many features,\nincluding progress toward a goal, the height differential between feet, the slope of the terrain under-\nneath its feet, etc. Moreover, consider the apprenticeship learning task of specifying a complete set\nof foot locations, across an entire terrain, that properly captures all the trade-offs above; this itself is\na highly non-trivial task.\nMotivated by these dif\ufb01culties, we present a uni\ufb01ed method for hierarchical apprenticeship learn-\ning. Our approach is based on the insight that, while it may be dif\ufb01cult for an expert to specify\nentire optimal trajectories in a large domain, it is much easier to \u201cteach hierarchically\u201d: that is, if we\nemploy a hierarchical control scheme to solve our problem, it is much easier for the expert to give\nadvice independently at each level of this hierarchy. At the lower levels of the control hierarchy,\nour method only requires that the expert be able to demonstrate good local behavior, rather than\nbehavior that is optimal for the entire task. This type of advice is often feasible for the expert to give\neven when the expert is entirely unable to give full trajectory demonstrations. Thus the approach\nallows for apprenticeship learning in extremely complex, previously intractable domains.\nThe contributions of this paper are twofold. First, we introduce the hierarchical apprenticeship\nlearning algorithm. This algorithm extends the apprenticeship learning paradigm to complex, high-\ndimensional control tasks by allowing an expert to demonstrate desired behavior at multiple levels of\nabstraction. Second, we apply the hierarchical apprenticeship approach to the quadruped locomotion\nproblem discussed above. By applying this method, we achieve performance that is, to the best of\nour knowledge, well beyond any published results for quadruped locomotion.1\nIn Section 2 we discuss preliminaries and\nThe remainder of this paper is organized as follows.\nnotation. In Section 3 we present the general formulation of the hierarchical apprenticeship learning\nalgorithm. In Section 4 we present experimental results, both on a hierarchical multi-room grid\nworld, and on the real-world quadruped locomotion task. Finally, in Section 5 we discuss related\nwork and conclude the paper.\n\n2 Preliminaries and Notation\n\nA Markov decision process (MDP) is a tuple (S, A, T, H, D, R), where S is a set of states; A is a\nset of actions, T = {Psa} is a set of state transition probabilities (here, Psa is the state transition\ndistribution upon taking action a in state s); H is the horizon which corresponds to the number of\ntime-steps considered; D is a distribution over initial states; and R : S \u2192 R is a reward function.\nAs we are often concerned with MDPs for which no reward function is given, we use the notation\nMDP\\R to denote an MDP minus the reward function. A policy \u03c0 is a mapping from states to a prob-\nt=0 R(st)|\u03c0i,\nability distribution over actions. The value of a policy \u03c0 is given by V (\u03c0) = E hPH\nwhere the expectation is taken with respect to the random state sequence s0, s1, . . . , sH drawn by\nstating from the state s0 (drawn from distribution D) and picking actions according to \u03c0.\n\n1There are several other institutions working with the LittleDog robot, and many have developed (unpub-\nlished) systems that are also very capable. As of the date of submission, we believe that the controller presented\nin this paper is on par with the very best controllers developed at other institutions. For instance, although di-\nrect comparison is dif\ufb01cult, the fastest running time that any team achieved during public evaluations was 39\nseconds. In Section 4 we present results crossing terrain of comparable dif\ufb01culty and distance in 30-35 seconds.\n\n2\n\n\fOften the reward function R can be represented more compactly as a function of the state. Let\n\u03c6 : S \u2192 Rn be a mapping from states to a set of features. We consider the case where the reward\nfunction R is a linear combination of the features: R(s) = wT \u03c6(s) for parameters w \u2208 Rn. Then\nwe have that the value of a policy \u03c6 is linear in the reward function weights\n\nV (\u03c0) = E[PH\n\nt=0 R(st)|\u03c0] = E[PH\n\nt=0 wT \u03c6(st)|\u03c0] = wT E[PH\nt=0 \u03c6(st)|\u03c0].\n\n(1)\nwhere we used linearity of expectation to bring w outside of the expectation. The last quantity\nde\ufb01nes the vector of feature expectations \u00b5\u03c6(\u03c0) = E[PH\n3 The Hierarchical Apprenticeship Learning Algorithm\nWe now present our hierarchical apprenticeship learning algorithm (hereafter HAL). For simplicity,\nwe present a two level hierarchical formulation of the control task, referred to generically as the\nlow-level and high-level controllers. The extension to higher order hierarchies poses no dif\ufb01culties.\n\nt=0 \u03c6(st)|\u03c0] = wT \u00b5\u03c6(\u03c0)\n\n3.1 Reward Decomposition in HAL\nAt the heart of the HAL algorithm is a simple decomposition of the reward function that links the\ntwo levels of control. Suppose that we are given a hierarchical decomposition of a control task in the\nform of two MDP\\Rs \u2014 a low-level and a high-level MDP\\R, denoted M` = (S`, A`, T`, H`, D`)\nand Mh = (Sh, Ah, Th, Hh, Dh) respectively \u2014 and a partitioning function \u03c8 : S` \u2192 Sh that maps\nlow level states to high-level states (the assumption here is that |Sh| (cid:28) |S`| so that this hierarchical\ndecomposition actually provides a computational gain).2 For example, in the case of the quadruped\nlocomotion problem the low-level MDP\\R describes the state of all four feet, while the high-level\nMDP\\R describes only the position of the robot\u2019s center of mass. As is standard in apprenticeship\nlearning, we suppose that the rewards in the low-level MDP\\R can be represented as a linear function\nof state features, R(s`) = wT \u03c6(s`). The HAL algorithm assumes that the reward of a high-level\nstate is equal to the average reward over all its corresponding low-level states. Formally\nwT X\n\nwT \u03c6(s`) =\n\nN (sh) X\n\nN (sh) X\n\nR(sh) =\n\n1\n\nN (sh)\n\n\u03c6(s`)\n\nR(s`) =\n\n1\n\n1\n\ns`\u2208\u03c8\u22121(sh)\n\ns`\u2208\u03c8\u22121(sh)\n\ns`\u2208\u03c8\u22121(sh)\n\n(2)\nwhere \u03c8\u22121(sh) denotes the inverse image of the partitioning function and N (sh) = |\u03c8\u22121(sh)|.\nWhile this may not always be the most ideal decomposition of the reward function in many cases\u2014\nfor example, we may want to let the reward of a high-level state be the maximum of its low level\nstate rewards to capture the fact that an ideal agent would always seek to maximize reward at the\nlower level, or alternatively the minimum of its low level state rewards to be robust to worst-case\noutcomes\u2014it captures the idea that in the absence of other prior information, it seems reasonable\nto assume a uniform distribution over the low-level states corresponding to a high-level state. An\nimportant consequence of (2) is that the high level reward is now also linear in the low-level reward\nweights w. This will enable us in the subsequent sections to formulate a uni\ufb01ed hierarchical appren-\nticeship learning algorithm that is able to incorporate expert advice at both the high level and the\nlow level simultaneously.\n\n3.2 Expert Advice at the High Level\nSimilar to past apprenticeship learning methods, expert advice at the high level consists of full\npolicies demonstrated by the expert. However, because the high-level MDP\\R can be signi\ufb01cantly\nsimpler than the low-level MDP\\R, this task can be substantially easier. If the expert suggests that\n\u03c0(i)\nh,E is an optimal policy for some given MDP\\R M (i)\nh , then this corresponds to the following\nconstraint, which states that the expert\u2019s policy outperforms all other policies:\n\nh ) \u2200\u03c0(i)\nh .\nEquivalently, using (1), we can formulate this constraint as follows:\nh ) \u2200\u03c0(i)\nh .\n\nh,E) \u2265 wT \u00b5\u03c6(\u03c0(i)\n\nh,E) \u2265 V (i)(\u03c0(i)\n\nV (i)(\u03c0(i)\n\n\u03c6 (\u03c0(i)\n\nwT \u00b5(i)\n\nWhile we may not be able to obtain the exact feature expectations of the expert\u2019s policy if the high-\nlevel transitions are stochastic, observing a single expert demonstration corresponds to receiving\n\n2As with much work in reinforcement learning, it is the assumption of this paper that the hierarchical\ndecomposition of a control task is given by a system designer. While there has also been recent work on the\nautomated discovery of state abstractions[5], we have found that there is often a very natural decomposition of\ncontrol tasks into multiple levels (as we will discuss for the speci\ufb01c case of quadruped locomotion).\n\n3\n\n\fa sample from these feature expectations, so we simply use the observed expert features counts\n\u02c6\u00b5(i)\n\u03c6 (\u03c0(i)\nh,E) in lieu of the true expectations. By standard sample complexity arguments [1], it can be\nshown that a suf\ufb01cient number of observed feature counts will converge to the true expectation. To\nresolve the ambiguity in w, and to allow the expert to provide noisy advice, we use regularization and\nslack variables (similar to standard SVM formulations), which results in the following formulation:\n\nminw,\u03b7\n\n1\n\n2 kwk2\ns.t. wT \u02c6\u00b5(i)\n\n2 + Ch Pn\n\u03c6 (\u03c0(i)\n\ni=1 \u03b7(i)\n\nh,E) \u2265 wT \u00b5\u03c6(\u03c0(i)\n\nh ) + 1 \u2212 \u03b7(i) \u2200\u03c0(i)\n\nh , i\n\nwhere \u03c0(i)\nh indexes over all high-level policies, i indexes over all MDPs, and Ch is a regularization\nconstant.3 Despite the fact that there are an exponential number of possible policies there are well-\nknown algorithms that are able to solve this optimization problem; however, we defer this discussion\nuntil after presenting our complete formulation.\n3.3 Expert Advice at the Low Level\nOur approach differs from standard apprenticeship learning when we consider advice at the low\nlevel. Unlike the apprenticeship learning paradigm where an expert speci\ufb01es full trajectories in the\ntarget domain, we allow for an expert to specify single, greedy actions in the low-level domain.\nSpeci\ufb01cally, if the agent is in state s` and the expert suggests that the best greedy action would move\nto state s0\n\n`, this corresponds directly to a constraint on the reward function, namely that\n\nR(s0\n\n`) \u2265 R(s00\n` )\n\nfor all other states s00\nthe current state s` if \u2203a s.t.Ps`a(s00\nconstraints on the reward function parameters w,\n\n` that can be reached from the current state (we say that s00\n\n` is \u201creachable\u201d from\n` ) > \u0001 for some 0 < \u0001 \u2264 1).4 This results in the following\n\nfor all s00\nprovide noisy advice, we use regularization and slack variables. This gives:\n\n` reachable from s`. As before, to resolve the ambiguity in w and to allow for the expert to\n\nwT \u03c6(s0\n\n`) \u2265 wT \u03c6(s00\n` )\n\n1\n\nminw,\u03be\n\n2 kwk2\ns.t. wT \u03c6(s0\n`\n\n2 + C` Pm\nj=1 \u03be(j)\n(j)) \u2265 wT \u03c6(s00\n(j)) + 1 \u2212 \u03be(j) \u2200s00\n`\n`\n(j) indexes over all states reachable from s0\n(j) and j indexes over all low-level demonstra-\n`\n\nwhere s00\n`\ntions provided by the expert.\n3.4 The Uni\ufb01ed HAL Algorithm\nFrom (2) we see the high level and low level rewards are a linear combination of the same set of\nreward weights w. This allows us to combine both types of expert advice presented above to obtain\nthe following uni\ufb01ed optimization problem\n\n(j), j\n\nminw,\u03b7,\u03be\n\n1\n\n2 kwk2\ns.t. wT \u03c6(s0\n`\nwT \u02c6\u00b5(i)\n\nj=1 \u03be(j) + Ch Pn\n\n2 + C` Pm\n(j)) \u2265 wT \u03c6(s00\n`\n\u03c6 (\u03c0(i)\nh,E) \u2265 wT \u00b5\u03c6(\u03c0(i)\n\n(j)) + 1 \u2212 \u03be(j) \u2200s00\n`\n\n(j), j\nh ) + 1 \u2212 \u03b7(i) \u2200\u03c0(i)\n\ni=1 \u03b7(i)\n\nh , i.\n\n(3)\n\nThis optimization problem is convex, and can be solved ef\ufb01ciently. In particular, even though the\noptimization problem has an exponentially large number of constraints (one constraint per policy),\nthe optimum can be found ef\ufb01ciently (i.e., in polynomial time) using, for example, the ellipsoid\nmethod, since we can ef\ufb01ciently identify a constraint that is violated.5 However, in practice we\nfound the following constraint generation method more ef\ufb01cient:\n\n3This formulation is not entirely correct by itself, due to the fact that it is impossible to separate a policy\nfrom all policies (including itself) by a margin of one, and so the exact solution to this problem will be w = 0.\nTo deal with this, one typically scales the margin or slack by some loss function that quanti\ufb01es how different\ntwo policies are [16, 17], and this is the approach taken by Ratliff, et al. [13] in their maximum margin planning\nalgorithm. Alternatively, Abbeel & Ng [1], solve the optimization problem without any slack, and notice that\nas soon as the problem becomes infeasible, the expert\u2019s policy lies in the convex hull of the generated policies.\nHowever, in our full formulation with low-level advice also taken into account, this becomes less of an issue,\nand so we present the above formulation for simplicity. In all experiments where we use only the high-level\nconstraints, we employ margin scaling as in [13].\n\n4Alternatively, one interpret low-level advice at the level of actions, and interpret the expert picking action a\nas the constraint that Ps0 Psa(s0)R(s0) \u2265 Ps0 Psa0 (s0)R(s0) \u2200a0 6= a. However, in the domains we consider,\nwhere there is a clear set of \u201creachable\u201d states from each state, the formalism above seems more natural.\n5 Similar techniques are employed by [17] to solve structured prediction problems. Alternatively, Ratliff, et\nal. [13] take a different approach, and move the constraints into the objective by eliminating the slack variables,\nthen employ a subgradient method.\n\n4\n\n\f \n\nHAL\nFlat Apprenticeship Learning\n\n300\n\n250\n\n200\n\n150\n\n100\n\ny\nc\n\ni\nl\n\no\nP\n\n \nf\n\no\n\n \ny\nt\ni\nl\n\na\nm\n\ni\nt\n\np\no\nb\nu\nS\n\n50\n\n \n0\n\n100\n\n200\n\n300\n\n400\n\n500\n\n600\n\nNumber of Training Samples\n\n700\n\n800\n\n900\n\n1000\n\ny\nc\n\ni\nl\n\no\np\n\n \nf\n\no\n \ny\nt\ni\nl\n\na\nm\n\ni\nt\n\np\no\nb\nu\nS\n\n400\n\n350\n\n300\n\n250\n\n200\n\n150\n\n100\n\n50\n\n \n0\n\n \n\nHAL\nHigh\u2212Level Contraints Only\nLow\u2212Level Constraints Only\n\n2\n\n4\n\n6\n\n8\n\n10\n\n12\n\n14\n\n16\n\n18\n\n20\n\n# of Training MDPs\n\nFigure 2: (a) Picture of the multi-room gridworld environment. (b) Performance versus number of training\nsamples for HAL and \ufb02at apprenticeship learning. (c) Performance versus number of training MDPs for HAL\nversus using only low-level or only high-level constraints.\n\n1. Begin with no expert path constraints.\n2. Find the current reward weights by solving the current optimization problem.\n3. Solve the reinforcement learning problem at the high level of the hierarchy to \ufb01nd the\noptimal (high-level) policies for the current reward for each MDP\\R, i.\nIf the optimal\npolicy violates the current (high level) constraints, then add this constraint to the current\noptimization problem and goto Step (2). Otherwise, no constraints are violated and the\ncurrent reward weights are the solution of the optimization problem.\n\n4 Experimental Results\n4.1 Gridworld\nIn this section we present results on a multi-room gridworld domain with unknown cost. While this\nis not meant to be a challenging control task, it allows us to compare the performance of HAL to\ntraditional \u201c\ufb02at\u201d (non-hierarchical) apprenticeship learning methods, as these algorithms are feasible\nin such domains. The grid world domain has a very natural hierarchical decomposition:\nif we\naverage the cost over each room, we can form a \u201chigh-level\u201d approximation of the grid world. Our\nhierarchical controller \ufb01rst plans in this domain to choose a path over the rooms. Then for each\nroom along this path we plan a low-level path to the desired exit.\nFigure 2(b) shows the performance versus number of training examples provided to the algorithm\n(where one training example equals one action demonstrated by the expert).6 As expected, the \ufb02at\napprenticeship learning algorithm eventually converges to a superior policy, since it employs full\nvalue iteration to \ufb01nd the optimal policy, while HAL uses the (non-optimal) hierarchical controller.\nHowever, for small amounts of training data, HAL outperforms the \ufb02at method, since it is able to\nleverage the small amount of data provided by the expert at both levels of the hierarchy. Figure 2(c)\nshows performance versus number of MDPs in the training set for HAL and well as for algorithms\nwhich receive the same training data as HAL (that is, both high level and low level expert demon-\nstrations), but which make use of only one or the other. Here we see that HAL performs substantially\nbetter. This is not meant to be a direct comparison of the different methods, since HAL obtains more\ntraining data per MDP than the single-level approaches. Rather, this experiment illustrates that in\nsituations where one has access to both high-level and low-level advice, it is advantageous to use\n\n6Experimental details: We consider a 111x111 grid world, evenly divided into 100 rooms of size 10x10\neach. There are walls around each room, except for a door of size 2 that connects a room to each of its\nneighbors (a picture of the domain is shown in \ufb01gure 2(a)). Each state has 40 binary features, sampled from\na distribution particular to that room, and the reward function is chosen randomly to have 10 \u201csmall\u201d [-0.75,\n-0.25], negative rewards, 20 \u201cmedium\u201d [-1.0 -2.0] negative rewards, and 10 \u201chigh\u201d [-3.0 -5.0] negative rewards.\nIn all cases we generated multiple training MDPs, which differ in which features are active at each state and we\nprovided the algorithm with one expert demonstration for each sampled MDP. After training on each MDP we\ntested on 25 holdout MDPs generated by the same process. In all cases the results were averaged over 10 runs.\nFor all our experiments, we \ufb01xed the ratio of Ch/C` so that the both constraints were equally weighted (i.e., if\nit typically took t low level actions to accomplish one high-level action, then we used a ratio of Ch/C` = t).\nGiven this \ufb01xed scaling, we found that the algorithm was generally insensitive (in terms of the resulting policy\u2019s\nsuboptimality) to scaling of the slack penalties. In the comparison of HAL with \ufb02at apprenticeship learning\nin Figure 2(b), one training example corresponds to one expert action. Concretely, for HAL the number of\ntraining examples for a given training MDP corresponds to the number of high level actions in the high level\ndemonstration plus the (equal) number of low level expert actions provided. For \ufb02at apprenticeship learning\nthe number of training examples for a given training MDP corresponds to the number of expert actions in the\nexpert\u2019s full trajectory demonstration.\n\n5\n\n\fFigure 3: (a) High-level (path) expert demonstration. (b) Low-level (footstep) expert demonstration.\n\nboth. This will be especially important in domains such as the quadruped locomotion task, where\nwe have access to very few training MDPs (i.e., different terrains).\n\n4.2 Quadruped Robot\nIn this section we present the primary experimental result of this paper, a successful application of\nhierarchical apprenticeship learning to the task of quadruped locomotion. Videos of the results in\nthis section are available at http://cs.stanford.edu/\u02dckolter/nips07videos.\n\n4.2.1 Hierarchical Control for Quadruped Locomotion\n\nThe LittleDog robot, shown in Figure 1, is designed and built by Boston Dynamics, Inc. The robot\nconsists of 12 independently actuated servo motors, three on each leg, with two at the hip and one at\nthe knee. It is equipped with an internal IMU and foot force sensors. We estimate the robot\u2019s state\nusing a motion capture system that tracks re\ufb02ective markers on the robot\u2019s body. We perform all\ncomputation on a desktop computer, and send commands to the robot via a wireless connection.\nAs mentioned in the introduction, we employ a hierarchical control scheme for navigating the\nquadruped over the terrain. Due to space constraints, we describe the complete control system\nbrie\ufb02y, but a much more detailed description can be found in [8]. The high level controller is a body\npath planner, that plans an approximate trajectory for the robot\u2019s center of mass over the terrain;\nthe low-level controller is a footstep planner that, given a path for the robot\u2019s center, plans a set of\nfootsteps that follow this path. The footstep planner uses a reward function that speci\ufb01es the rel-\native trade-off between several different features of the robot\u2019s state, including (i) several features\ncapturing the roughness and slope of the terrain at several different spatial scales around the robot\u2019s\nfeet, (ii) distance of the foot location from the robot\u2019s desired center, (iii) the area and inradius of the\nsupport triangle formed by the three stationary feet, and other similar features. Kinematic feasibility\nis required for all candidate foot locations and collision of the legs with obstacles is forbidden. To\nform the high-level cost, we aggregate features from the footstep planner. In particular, for each\nfoot we consider all the footstep features within a 3 cm radius of the foot\u2019s \u201chome\u201d position (the\ndesired position of the foot relative to the center of mass in the absence of all other discriminating\nfeatures), and aggregate these features to form the features for the body path planner. While this is\nan approximation, we found that it performed very well in practice, possibly due to its ability to ac-\ncount for stochasticity of the domain. After forming the cost function for both levels, we used value\niteration to \ufb01nd the optimal policy for the body path planner, and a \ufb01ve-step lookahead receding\nhorizon search to \ufb01nd a good set of footsteps for the footstep planner.\n\n4.2.2 Hierarchical Apprenticeship Learning for Quadruped Locomotion\n\nAll experiments were carried out on two terrains: a relatively easy terrain for training, and a signif-\nicantly more challenging terrain for testing. To give advice at the high level, we speci\ufb01ed complete\nbody trajectories for the robot\u2019s center of mass, as shown in Figure 3(a). To give advice for the\nlow level we looked for situations in which the robot stepped in a suboptimal location, and then\nindicated the correct greedy foot placement, as shown in Figure 3(b). The entire training set con-\n\n6\n\n\fFigure 4: Snapshots of quadruped while traversing the testing terrain.\n\nFigure 5: Body and footstep plans for different constraints on the training (left) and testing (right) terrains:\n(Red) No Learning, (Green) HAL, (Blue) Path Only, (Yellow) Footstep Only.\n\nsisted of a single high-level path demonstration across the training terrain, and 20 low-level footstep\ndemonstrations on this terrain; it took about 10 minutes to collect the data.\nEven from this small amount of training data, the learned system achieved excellent performance,\nnot only on the training board, but also on the much more dif\ufb01cult testing board. Figure 4 shows\nsnapshots of the quadruped crossing the testing board. Figure 5 shows the resulting footsteps taken\nfor each of the different types of constraints, which shows a very large qualitative difference be-\ntween the footsteps chosen before and after training. Table 1 shows the crossing times for each of\nthe different types of constraints. As shown, he HAL algorithm outperforms all the intermediate\nmethods. Using only footstep constraints does quite well on the training board, but on the testing\nboard the lack of high-level training leads the robot to take a very roundabout route, and it performs\nmuch worse. The quadruped fails at crossing the testing terrain when learning from the path-level\ndemonstration only or when not learning at all.\nFinally, prior to undertaking our work on hierarchical apprenticeship learning, we invested several\nweeks attempting to hand-tune controller capable of picking good footsteps across challenging ter-\nrain. However, none of our previous efforts could signi\ufb01cantly outperform the controller presented\nhere, learned from about 10 minutes worth of data, and many of our previous efforts performed\nsubstantially worse.\n\n5 Related Work and Discussion\nThe work presented in this paper relates to many areas of reinforcement learning, including ap-\nprenticeship learning and hierarchical reinforcement learning, and to a large body of past work in\nquadruped locomotion. In the introduction and in the formulation of our algorithm we discussed the\nconnection to the inverse reinforcement learning algorithm of [1] and the maximum margin plan-\nning algorithm of [13]. In addition, there has been subsequent work [14] that extends the maximum\nmargin planning framework to allow for the automated addition of new features through a boosting\nprocedure; There has also been much recent work in reinforcement learning on hierarchical rein-\nforcement learning; a recent survey is [2]. However, all the work in this area that we are aware of\ndeals with the more standard reinforcement learning formulation where known rewards are given\nto the agent as it acts in a (possibly unknown) environment. In contrast, our work follows the ap-\nprenticeship learning paradigm where the model, but not the rewards, are known to the agent. Prior\nwork on legged locomotion has mostly focused on generating gaits for stably traversing fairly \ufb02at\n\n7\n\n\fTraining Time (sec)\nTesting\nTime (sec)\n\nHAL\n31.03\n35.25\n\nFeet Only\n\nPath Only No Learning\n\n33.46\n45.70\n\n\u2014\n\u2014\n\n40.25\n\n\u2014\n\nTable 1: Execution times for different constraints on training and testing terrains. Dashes indicate that the\nrobot fell over and did not reach the goal.\n\nterrain (see, among many others, [10], [7]). Only very few learning algorithms, which attempt to\ngeneralize to previously unseen terrains, have been successfully applied before [6, 3, 9]. The terrains\nconsidered in this paper go well beyond the dif\ufb01culty level considered in prior work.\n\n6 Acknowledgements\nWe gratefully acknowledge the anonymous reviewers for helpful suggestions. This work was sup-\nported by the DARPA Learning Locomotion program under contract number FA8650-05-C-7261.\n\nReferences\n\n[1] Pieter Abbeel and Andrew Y. Ng. Apprenticeship learning via inverse reinforcement learning. In Pro-\n\nceedings of the International Conference on Machine Learning, 2004.\n\n[2] Andrew G. Barto and Sridhar Mahadevan. Recent advances in hierarchical reinforcement learning. Dis-\n\ncrete Event Dynamic Systems: Theory and Applications, 13:41\u201377, 2003.\n\n[3] Joel Chestnutt, James Kuffner, Koichi Nishiwaki, and Satoshi Kagami. Planning biped navigation strate-\ngies in complex environments. In Proceedings of the International Conference on Humanoid Robotics,\n2003.\n\n[4] Thomas G. Dietterich. Hierarchical reinforcement learning with the MAXQ value function decomposi-\n\ntion. Journal of Arti\ufb01cial Intelligence Research, 13:227\u2013303, 2000.\n\n[5] Nicholas K. Jong and Peter Stone. State abstraction discovery from irrelevant state variables. In Proceed-\n\nings of the International Joint Conference on Arti\ufb01cial Intelligence, 2005.\n\n[6] H. Kim, T. Kang, V. G. Loc, and H. R. Choi. Gait planning of quadruped walking and climbing robot\nIn Proceedings of the International Conference on Robotics and\n\nfor locomotion in 3D environment.\nAutomation, 2005.\n\n[7] Nate Kohl and Peter Stone. Machine learning for fast quadrupedal locomotion. In Proceedings of AAAI,\n\n2004.\n\n[8] J. Zico Kolter, Mike P. Rodgers, and Andrew Y. Ng. A complete control architecture for quadruped loco-\nmotion over rough terrain. In Proceedings of the International Conference on Robotics and Automation\n(to appear), 2008.\n\n[9] Honglak Lee, Yirong Shen, Chih-Han Yu, Gurjeet Singh, and Andrew Y. Ng. Quadruped robot obstacle\nnegotiation via reinforcement learning. In Proceedings of the International Conference on Robotics and\nAutomation, 2006.\n\n[10] Jun Morimoto and Christopher G. Atkeson. Minimax differential dynamic programming: An application\n\nto robust biped walking. In Neural Information Processing Systems 15, 2002.\n\n[11] Gergeley Neu and Csaba Szepesv\u00b4ari. Apprenticeship learning using inverse reinforcement learning and\n\ngradient methods. In Proceedings of Uncertainty in Arti\ufb01cial Intelligence, 2007.\n\n[12] Ronald Parr and Stuart Russell. Reinforcement learning with hierarchcies of machines. In Neural Infor-\n\nmation Processing Systems 10, 1998.\n\n[13] Nathan Ratliff, J. Andrew Bagnell, and Martin Zinkevich. Maximum margin planning. In Proceedings of\n\nthe International Conference on Machine Learning, 2006.\n\n[14] Nathan Ratliff, David Bradley, J. Andrew Bagnell, and Joel Chestnutt. Boosting structured prediction for\n\nimitation learning. In Neural Information Processing Systems 19, 2007.\n\n[15] Richard S. Sutton, Doina Precup, and Satinder Singh. Between mdps and semi-mdps: A framework for\n\ntemporal abstraction in reinforcement learning. Arti\ufb01cial Intelligence, 112:181\u2013211, 1999.\n\n[16] Ben Taskar, Vassil Chatalbashev, Daphne Koller, and Carlos Guestrin. Learning structured prediction\nmodels: A large margin approach. In Proceedings of the International Conference on Machine Learning,\n2005.\n\n[17] I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun. Large margin methods for structured and\n\ninterdependent output variables. Journal of Machine Learning Research, 6:1453\u20131484, 2005.\n\n8\n\n\f", "award": [], "sourceid": 985, "authors": [{"given_name": "J.", "family_name": "Kolter", "institution": null}, {"given_name": "Pieter", "family_name": "Abbeel", "institution": null}, {"given_name": "Andrew", "family_name": "Ng", "institution": null}]}