{"title": "Constructing Skill Trees for Reinforcement Learning Agents from Demonstration Trajectories", "book": "Advances in Neural Information Processing Systems", "page_first": 1162, "page_last": 1170, "abstract": "We introduce CST, an algorithm for constructing skill trees from demonstration trajectories in continuous reinforcement learning domains. CST uses a changepoint detection method to segment each trajectory into a skill chain by detecting a change of appropriate abstraction, or that a segment is too complex to model as a single skill. The skill chains from each trajectory are then merged to form a skill tree. We demonstrate that CST constructs an appropriate skill tree that can be further refined through learning in a challenging continuous domain, and that it can be used to segment demonstration trajectories on a mobile manipulator into chains of skills where each skill is assigned an appropriate abstraction.", "full_text": "Constructing Skill Trees for Reinforcement Learning\n\nAgents from Demonstration Trajectories\n\nAutonomous Learning Laboratory\u2020\n\nLaboratory for Perceptual Robotics\u2021\n\nGeorge Konidaris\u2020\n\nScott Kuindersma\u2020\u2021 Andrew Barto\u2020 Roderic Grupen\u2021\n\nComputer Science Department, University of Massachusetts Amherst\n\n{gdk, scottk, barto, grupen}@cs.umass.edu\n\nAbstract\n\nWe introduce CST, an algorithm for constructing skill trees from demonstration\ntrajectories in continuous reinforcement learning domains. CST uses a change-\npoint detection method to segment each trajectory into a skill chain by detecting\na change of appropriate abstraction, or that a segment is too complex to model\nas a single skill. The skill chains from each trajectory are then merged to form a\nskill tree. We demonstrate that CST constructs an appropriate skill tree that can\nbe further re\ufb01ned through learning in a challenging continuous domain, and that\nit can be used to segment demonstration trajectories on a mobile manipulator into\nchains of skills where each skill is assigned an appropriate abstraction.\n\n1\n\nIntroduction\n\nHierarchical reinforcement learning [1] offers an appealing family of approaches to scaling up stan-\ndard reinforcement learning (RL) [2] methods by enabling the use of both low-level primitive actions\nand higher-level macro-actions (or skills). A core research goal in hierarchical RL is the develop-\nment of methods by which an agent can autonomously acquire its own high-level skills.\nRecently, Konidaris and Barto [3] introduced a general method for skill discovery in continuous\nRL domains called skill chaining. Skill chaining adaptively segments complex policies into skills\nthat can be executed sequentially and that are easier to represent and learn. It can be coupled with\nabstraction selection [4] to select skill-speci\ufb01c abstractions, which can aid in acquiring policies that\nare high-dimensional when represented monolithically, but can be broken into subpolicies that can\nbe de\ufb01ned over far fewer variables. Unfortunately, performing skill chaining iteratively is slow:\nit creates skills sequentially, and requires several episodes to learn a new skill policy followed by\nseveral further episodes to learn by trial and error where it can be executed successfully. While this\nis reasonable for many problems, in domains where experience is expensive (such as robotics) we\nrequire a faster method. Moreover, with the growing realization that learning policies completely\nfrom scratch in such domains is infeasible, we may also need to bootstrap learning through a method\nthat provides a reasonable initial policy such as learning from demonstration [5], sequencing existing\ncontrollers [6], using a kinematic planner, or using a feedback controller [7].\nWe introduce CST, a new skill acquisition method that can build skill trees (with appropriate ab-\nstractions) from a set of sample solution trajectories obtained from demonstration, a planner, or a\ncontroller. CST uses an incremental MAP changepoint detection method [8] to segment each so-\nlution trajectory into skills and then merges the resulting skill chains into a skill tree. The time\ncomplexity of CST is controlled through the use of a particle \ufb01lter. We show that CST can construct\na skill tree from human demonstration trajectories in Pinball, a challenging dynamic continuous do-\nmain, and that the resulting skills can be re\ufb01ned using RL. We further show that it can be used to\nsegment demonstration trajectories from a mobile manipulator into chains of skills, where each skill\nis assigned an appropriate abstraction.\n\n1\n\n\f2 Background\n\n2.1 Hierarchical Reinforcement Learning and the Options Framework\n\nThe options framework [9] adds methods for hierarchical planning and learning using temporally-\nextended actions to the standard RL framework. Rather than restricting the agent to selecting actions\nthat take a single time step to complete, it models higher-level decision making using options: ac-\ntions that have their own policies and which may require multiple time steps to complete. An option,\no, consists of three components: an option policy, \u03c0o, giving the probability of executing each action\nin each state in which the option is de\ufb01ned; an initiation set indicator function, Io, which is 1 for\nstates where the option can be executed and 0 elsewhere; and a termination condition, \u03b2o, giving\nthe probability of option execution terminating in states where the option is de\ufb01ned. Options can be\nadded to an agent\u2019s action repertoire alongside its primitive actions, and the agent chooses when to\nexecute them in the same way it chooses when to execute primitive actions.\nMethods for creating new options must determine when to create an option, how to de\ufb01ne its termi-\nnation condition (skill discovery), how to de\ufb01ne its initiation set, and how to learn its policy. Given\nan option reward function, policy learning can be viewed as just another RL problem. Creation and\ntermination are typically performed by the identi\ufb01cation of option goal states, with an option created\nto reach a goal state and then terminate. The initiation set is then the set of states from which a goal\nstate can be reached. Although there are many skill discovery methods for discrete domains, very\nfew exist for continuous domains. To the best of our knowledge (see Section 6), skill chaining [3] is\nthe only such method that does not make any assumptions about the domain structure.\n\n2.2 Skill Chaining and Abstraction Selection\n\nSkill chaining mirrors an idea present in other control \ufb01elds\u2014for example, in robotics a similar idea\nis known as pre-image backchaining [10, 11], and in control for chaotic systems as adaptive targeting\n[12]. Given a continuous RL problem where the policy is either too dif\ufb01cult to learn directly or too\ncomplex to represent monolithically, we construct a skill tree such that we can obtain a trajectory\nfrom every start state to a solution state by executing a sequence (or chain) of acquired skills.\nThis is accomplished as follows. The agent starts with an initial list of target events (regions of the\nstate space), T , which in most cases consists simply of the solution regions of the problem. It then\nperforms RL as usual to try to learn a reasonable policy for the problem. When the agent triggers\nsome target event, To\u2014which occurs when it moves from a state not contained in any event in T to\none contained in To\u2014it creates a new option, o, with the goal of reaching To. As the agent continues\nto interact with the environment it learns a policy for o, and adds it to its set of available actions.\nInitially, o has an initiation set that covers the whole state space. Over time, some executions of\no will succeed (the agent reaches To), and some will fail. The agent uses these states as training\nexamples and learns Io, the initiation set of o, using a classi\ufb01er. When learning has converged, Io\nis added to T as a new target event. An agent applying this method along a single trajectory will\nslowly learn a chain of skills that grows backward from the task goal region towards the start region\n(as depicted in Figure 1). More generally, multiple trajectories, noise in control, stochasticity in the\nenvironment, or simple variance will result in skill trees rather than skill chains because more than\none option will be created to reach some target events. Eventually, the entire state space is covered\nby acquired skills. A more detailed description can be found in Konidaris and Barto [3].\n\n(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 1: An agent creates options using skill chaining.\n(a) First, the agent encounters a target\nevent and creates an option to reach it. (b) Entering the initiation set of this \ufb01rst option triggers the\ncreation of a second option whose target is the initiation set of the \ufb01rst option. (c) Finally, after many\ntrajectories the agent has created a chain of options to reach the original target. (d) When multiple\noptions are created to target an initiation set, the chain splits and the agent creates a skill tree.\n\n2\n\n\fThe major advantage of skill chaining is that it provides a mechanism for the agent to adaptively\nrepresent a complex policy using a collection of simpler policies. We can take this further and allow\neach individual option policy to use its own state abstraction. In this way, we may be able to represent\nhigh-dimensional policies using component policies that are low-dimensional (and therefore feasible\nto learn). For example, a complex policy like driving to school in the morning, that requires far too\nmany features to be easily represented monolithically, may be broken into component tasks (such\nas walking to the car, opening the door, inserting the key, etc.) that do not. Abstraction selection\n[4] is a simple mechanism for achieving this. Given a library of possible abstractions, and a set of\nsample trajectories (as, for example, obtained when initially learning an option policy), abstraction\nselection \ufb01nds the abstraction best able to represent the value function inferred from the sample\ntrajectories. It can be combined with skill chaining to learn a skill tree where each skill has its own\nabstraction; in such cases, the initiation set of each skill will be restricted to states where its policy\ncan be well-represented using its abstraction.\n\n2.3 Changepoint Detection\n\nSkill chaining learns a segmented policy by creating a new option when either the most suitable\nabstraction changes, or the value function (and therefore policy) becomes too complex to represent\nwith a single option. We would like to segment an entire trajectory at once; the question then\nbecomes: how many options exist along it, and where do they begin and end? This can be modeled\nas a multiple changepoint detection problem [8]. In this setting, we are given observed data and a\nset of candidate models. The data are segmented such that the data within a segment are generated\nby a single model. We are to infer the number of changepoints and their positions, and select and \ufb01t\nan appropriate model for each segment. Figure 2 shows a simple example.\n\nFigure 2: Data with multiple segments. The observed data (left) are generated by three different\nmodels (solid line, changepoints shown using dashed lines, right) plus noise. The \ufb01rst and third\nsegments are generated by linear models, whereas the second is quadratic.\n\nof a segment length l with PMF g(l) and CDF G(l) = (cid:80)l\n\nUnlike the standard regression setting, in RL our data is sequentially but not necessarily spatially\nsegmented, and we would like to perform changepoint detection online\u2014processing transitions as\nthey occur and then discarding them. Fearnhead and Liu [8] introduced online algorithms for both\nBayesian and MAP changepoint detection; we use the simpler method that obtains the MAP change-\npoints and models via an online Viterbi algorithm.\nThe changepoint process is implemented as follows. We observe data tuples (xt, yt), for times\nt \u2208 [1, T ], and are given a set of models Q with prior p(q \u2208 Q). We model the marginal probability\ni=1 g(i). Finally, we assume that we\ncan \ufb01t a segment from time j + 1 to t using model q to obtain the probability of the data P (j, t, q)\nconditioned on q.\nThis results in a Hidden Markov Model where the hidden state at time t is the model qt and the\nobserved data is yt given xt. The hidden state transition probability from time i to time j with\nmodel q is given by g(j \u2212 i \u2212 1)p(q) (re\ufb02ecting the probability of a segment of length j \u2212 i \u2212 1 and\nthe prior for q). The probability of an observed data segment starting at time i + 1 and continuing\nthrough j using q is P (i, j, q)(1 \u2212 G(j \u2212 i \u2212 1)), re\ufb02ecting the \ufb01t probability and the probability of\na segment of at least j \u2212 i\u2212 1 steps. Note that a transition between two instances of the same model\n(but with different parameters) is possible. We can thus use an online Viterbi algorithm to compute\nPt(j, q), the probability of the changepoint previous to time t occuring at time j using model q:\n\nPt(j, q) = (1 \u2212 G(t \u2212 j \u2212 1))P (j, t, q)p(q)P M AP\n\nj\n\n, and\n\n(1)\n\n3\n\n5101520253035404550556005010015020051015202530354045505560050100150200\fP M AP\n\nj\n\n= max\ni,q\n\nPj(i, q)g(j \u2212 i)\n1 \u2212 G(j \u2212 i \u2212 1) ,\u2200j < t.\n\n(2)\n\n) and then compute P M AP\n\nAt time j, the i and q maximizing Equation 2 are the MAP changepoint position and model for the\ncurrent segment, respectively. We then perform this procedure for time i, repeating until we reach\ntime 1, to obtain the changepoints and models for the entire sequence.\nThus, at each time step t we compute Pt(j, q) for each model q and changepoint time j < t (using\nand store it.1 This requires O(T ) storage and O(T L|Q|) time\nP M AP\nj\nper timestep, where L is the time required to compute P (j, t, q). We can reduce L to a constant for\nmost models of interest by storing a small suf\ufb01cient statistic and updating it incrementally in time\nindependent of t, obtaining P (j, t, q) from P (j, t\u22121, q). In addition, since most Pt(j, q) values will\nbe close to zero, we can employ a particle \ufb01lter to discard most combinations of j and q and retain\n, suf\ufb01cient statistics and its\na constant number per timestep. Each particle then stores j, q, P M AP\nViterbi path. We use the Strati\ufb01ed Optimal Resampling algorithm of Fearnhead and Liu [8] to \ufb01lter\ndown to M particles whenever the number of particles reaches N. This results in a time complexity\nof O(N L) and storage complexity of O(N c), where there are O(c) changepoints in the data.\n\nj\n\nt\n\n3 Constructing Skill Trees from Sample Trajectories\n\nWe propose using changepoint detection to segment sample trajectories into skills, using return Rt\n(sum of discounted reward) as the target variable. This provides an intuitive mapping to RL since\na value function is simply an estimator of return; segmentation based on return thus provides a\nnatural way to segment the value function implied by a trajectory into simpler value functions, or\nto detect a change in model (and therefore abstraction). To do so, we must select an appropriate\nmodel of expected skill (segment) length, and an appropriate model for \ufb01tting the data. We assume\na geometric distribution for skill lengths with parameter p, so that g(l) = (1 \u2212 p)l\u22121p and G(l) =\n(1\u2212 (1\u2212 p)l). This gives us a natural way to set p since p = 1\nk , where k is the expected skill length.\nSince RL in continuous state spaces usually employs linear function approximation, it is natural to\nuse a linear regression model with Gaussian noise as our model of the data. Following Fearnhead\nand Liu [8], we assume conjugate priors: the Gaussian noise prior has mean zero, and variance with\n2 . The prior for each weight is a zero-mean Gaussian\ninverse gamma prior with parameters v\nwith variance \u03c32\u03b4. Integrating the likelihood function over the parameters obtains:\n\n2 and u\n\n(3)\nwhere n = t \u2212 j, q has m basis functions, \u0393 is the Gamma function, D is an m by m matrix with\n\u03b4\u22121 on the diagonal and zeros elsewhere, and:\n\n(y + u) n+v\n\n2\n\nu v\n\n2\n\n\u0393( n+v\n2 )\n2 ) ,\n\u0393( v\n\ni ) \u2212 bT (A + D)\u22121b,\nR2\n\n(4)\n\n\u03a6(xi)\u03a6(xi)T\n\ni=j\n\ni=j\n\ni=j Ri\u03a6(xi).\n\nj=i \u03b3j\u2212irj is the return\n\nwhere \u03a6(xi) is a vector of m basis functions evaluated at state xi, Ri =(cid:80)T\nobtained from state i, and b =(cid:80)t\ntics A, b and ((cid:80)t\n\nNote that we are using each Rt as the target regression variable in this formulation, even though we\nonly observe rt for each state. However, to compute Equation 3 we need only retain suf\ufb01cient statis-\ni ). Each can be updated incrementally using rt (the latter two using traces).\nThus, the suf\ufb01cient statistics required to obtain the \ufb01t probability can be computed incrementally\nand online at each timestep, without requiring any transition data to be stored. Furthermore, A and\nb are the same matrices used for performing a least-squares \ufb01t to the data using Rt as the regression\ntarget. They can thus be used to produce a value function \ufb01t (equivalent to a least-squares Monte\nCarlo estimate) for the skill segment if so desired; again, without the need to store the trajectory.\nUsing this model we can segment a single trajectory into a skill chain; given multiple skill chains\nfrom different trajectories, we would like to merge them into a skill tree. We merge two trajectory\n\ni=j R2\n\nP (j, t, q) = \u03c0\u2212 n\n\n2\n\n2\n\n\u03b4m |(A + D)\u22121| 1\nt(cid:88)\n\ny = (\n\nt(cid:88)\n\nA =\n\n1In practice all equations are computed in log form to ensure numerical stability.\n\n4\n\n\fsegments by assigning them to the same skill (rather than two distinct skills). Since we wish to\nbuild skills that can be sequentially executed, we can only consider merging two segments when\nthey have the same target\u2014which means that the segments immediately following each of them\nhave been merged. Since we assume that all trajectories have the same \ufb01nal goal, we merge two\nchains by starting at their \ufb01nal skill segments. For each pair of segments, we determine whether or\nnot they are a good statistical match, and if so merging them, repeating this process until we fail to\nmerge a pair of skill segments, after which the remaining skill chains branch off on their own. Since\nP (j, t, q) as de\ufb01ned in Equation 3 is the integration over the likelihood function of our model given\nsegment data, we can reuse it as a measure of whether a pair of trajectories are better modeled as\none skill (where we simply sum their suf\ufb01cient statistics), or as two separate skills (forming new\nsuf\ufb01cient statistics using two groups of basis functions, each of which is zero over the other\u2019s data\nsegments). Before merging, we perform a fast test to ensure that the trajectory pairs actually overlap\nin state space\u2014if they do not, we will often be able to represent them both simultaneously with low\nerror and hence our metric may incorrectly suggest a merge.\nSegmenting a sample trajectory should be performed using a lower-order function approximator\nthan that used for policy learning, since we see merely a single trajectory sample rather than a\ndense sample over the state space. However, merging should be performed using the same function\napproximator used for learning. This necessitates the maintenance of two sets of suf\ufb01cient statistics\nduring segmentation; fortunately, the majority of time is consumed computing P (j, t, q), which\nduring segmentation is only required using the lower-order approximator.\nIf we are to merge skills obtained over multiple trajectories into trees, we require the component\nskills to be aligned, meaning that the changepoints occur in roughly the same places. This will occur\nnaturally in domains where changepoints are primarily caused by a change in relevant abstraction.\nWhen this is not the case, they may vary since segmentation is then based on function approximation\nboundaries, and hence two trajectories segmented independently may be poorly aligned. Therefore,\nwhen segmenting two trajectories sequentially in anticipation of a merge, we may wish to include\na bias on changepoint locations in the second trajectory. We model this bias as a Mixture of Gaus-\nsians, centering an isotropic Gaussian at each location in state-space where a changepoint previously\noccurred. We can include this bias during changepoint detection by multiplying Equation 1 with the\nresulting PDF evaluated at the current state.\n\n4 Acquiring Skills from Human Demonstration in the PinBall Domain\n\nThe Pinball domain is a continuous domain with dynamic aspects, sharp discontinuities, and ex-\ntended control characteristics that make it dif\ufb01cult for control and function approximation.2 Previ-\nous experiments have shown that skill chaining is able to \ufb01nd a very good policy while \ufb02at learning\n\ufb01nds a poor solution [3]. In this section, we evaluate the performance bene\ufb01ts obtained using a skill\ntree generated from a pair of human-provided solution trajectories.\nThe goal of PinBall is to maneuver the small ball (which starts in one of two places) into the large\nred hole. The ball is dynamic (drag coef\ufb01cient 0.995), so its state is described by four variables: x,\ny, \u02d9x and \u02d9y. Collisions with obstacles are fully elastic and cause the ball to bounce, so rather than\nmerely avoiding obstacles the agent may choose to use them to ef\ufb01ciently reach the hole. There\nare \ufb01ve primitive actions: incrementing or decrementing \u02d9x or \u02d9y by a small amount (which incurs\na reward of \u22125 per action), or leaving them unchanged (which incurs a reward of \u22121 per action);\nreaching the goal obtains a reward of 10, 000. We use the Pinball domain instance shown in Figure 3\nwith 5 pairs (one trajectory in each pair for each start state) of human demonstration trajectories.\n\n4.1 Implementation Details\n\nOverall task learning for both standard and option-learning agents used linear FA Sarsa (\u03b3 = 1, \u0001 =\n0.01) using a 5th-order Fourier basis [13] with \u03b1 = 0.0005. Option policy learning used Q-learning\n(\u03b1o = 0.0005, \u03b3 = 1, \u0001 = 0.01) with a 3rd-order Fourier basis. Initiation sets were learned using\nlogistic regression using 2nd order polynomial features with learning rate \u03b7 = 0.1 and 100 sweeps\nper new data point. Other parameters were as in Konidaris and Barto [3].\n\n2Java source code for Pinball can be downloaded at http://www-all.cs.umass.edu/\u02dcgdk/pinball\n\n5\n\n\fCST used an expected skill length of 100, \u03b4 = 0.0001, particle \ufb01lter parameters N = 30 and M =\n50, and a \ufb01rst-order Fourier Basis (16 basis functions). After segmenting the \ufb01rst trajectory we used\nisotropic Gaussians with variance 0.52 to bias the segmentation of the second. The full 3rd-order\nFourier basis representation was used for merging. To obtain a fair comparison to skill chaining,\nwe initialized the CST skill policies using 10 episodes of experience replay of the demonstrated\ntrajectories, rather than using the suf\ufb01cient statistics to perform a least-squares value function \ufb01t.\n\n4.2 Results\n\nTrajectory segmentation was successful for all demonstration trajectories, and all pairs were merged\nsuccessfully into skill trees when the alignment bias was used to segment the second trajectory in\nthe pair (two of the \ufb01ve could not be merged due to misalignments when the bias was not used).\nExample segmentations and the resulting merged trajectories are shown in Figure 3.\n\n(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 3: The Pinball instance used in our experiment (a), along with segmented skill chains from a\npair of sample solution trajectories (b and c), and the assignments obtained when the two chains are\nmerged (d).\n\nFigure 4: Learning curves in the PinBall domain, for agents employing skill trees created from\ndemonstration trajectories, skill chaining agents, and agents starting with pre-learned skills.\n\nThe learning curves obtained using the resulting skill trees to RL agents are shown in Figure 4. These\nresults compare the learning curves of CST agents, agents that perform skill chaining from scratch,\nand agents that are given fully pre-learned skills (obtained over 250 episodes of skill chaining). They\nshow that the CST policies are not good enough to use immediately, as the agents do worse than\nthose given pre-learned skills for the \ufb01rst few episodes. However, very shortly thereafter the CST\nagents are able to learn excellent policies\u2014immediately performing much better than skill chaining\nagents, and shortly thereafter even exceeding the performance of agents with pre-learned skills. This\nis likely because the skill tree structure obtained from demonstration has fewer but better skills than\nthat learned incrementally by skill chaining agents.\nIn addition, segmenting demonstration trajectories into skills results in much faster learning than\nattempting to acquire the entire policy by demonstration at once. The learning curve for agents\nthat \ufb01rst perform experience replay on the overall task value function and then proceed using skill\nchaining (not shown) is virtually identical to that of agents performing skill chaining from scratch.\n\n6\n\n20406080100120140\u221220\u221215\u221210\u2212505x 104EpisodesReturn Pre\u2212learnedSkill ChainingSkill Tree\f5 Acquiring Skills from Human Demonstration on the uBot\n\nIn this section we show that CST is able to create skill chains and select appropriate abstractions for\neach skill from human demonstration on the uBot-5, a dynamically balancing mobile manipulator.\nDemonstration trajectories are obtained from an expert human operator, controlling the uBot as it\nenters a corridor, approaches a door, pushes the door open, turns right into a new corridor, and \ufb01nally\napproaches and pushes on a panel (illustrated in Figure 5).\n\n(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 5: Starting at the beginning of a corridor (a), the uBot must approach and push open a door\n(b), turn through the doorway (c), then approach and push a panel (d).\n\nTo simplify perception, the uBot uses colored purple, orange and yellow circles placed on the door\nand panel, beginning of the back wall, and middle of the back wall, respectively, as perceptually\nsalient markers indicating the centroid of each object. The distances (obtained using stereo vision)\nbetween the uBot and each marker are computed at 8Hz and \ufb01ltered. The uBot is able to engage\none of two motor abstractions at a time: either performing end-point position control of its hand,\nor controlling the speed and angle of its forward motion. Thus, we constructed six sensorimotor\nabstractions, each containing either the differences between the arm endpoint position and marker\nposition, or the distance to and angle between the robot\u2019s torso and the object. We assume a reward\nfunction of \u22121 every 10th of a second.\n\n# Abstraction\n\nDescription\n\nTrajectories\nRequired\n\na\nb\nc\nd\ne\nf\n\nDrive to door.\n\ntorso-purple\nendpoint-purple Open door.\ntorso-orange\ntorso-yellow\ntorso-purple\nendpoint-purple\n\nDrive toward wall.\nTurn.\nDrive to panel.\nPress panel.\n\n2\n1\n1\n2\n1\n3\n\nFigure 6: A demonstration trajectory segmented into skills, each with an appropriate abstraction.\n\nWe gathered 12 demonstration trajectories from the uBot, of which 3 had to be discarded because\nthe perceptual features were too noisy. Of the remaining 9, all segmented sensibly and 8 were able to\nbe merged into a single skill chain. An example segmentation corresponding to this chain is shown\nin Figure 6 along with the abstractions selected, a brief description of each skill segment, and the\nnumber of sample trajectories required before the skill policy (learned using ridge regression with a\n5th order Fourier basis) could be replayed successfully 9 times out of 10. This shows that CST is\nable to segment trajectories obtained from a robot platform, select an appropriate abstraction in each\ncase, and then replay the resulting policies using a small number of sample trajectories.\n\n6 Related Work\n\nSeveral methods exist for skill discovery in discrete reinforcement learning domains; the most recent\nrelevant work is by Mehta et al. [14], which induces task hierarchies from demonstration trajectories\n\n7\n\n\fin discrete domains, but assumes a factored MDP with given dynamic Bayes network action models.\nBy contrast, we know of very little work on skill acquisition in continuous domains where the skills\nor action hierarchy are not designed in advance. Mugan and Kuipers [15] use learned qualitatively-\ndiscretized factored models of a continuous state space to derive options, which is only feasible and\nappropriate in some settings. In Neumann et al. [16], an agent learns to solve a complex task by\nsequencing task-speci\ufb01c parametrized motion templates. Finally, Tedrake [17] builds a similar tree\nto ours in the model-based control setting.\nA sequence of policies represented using linear function approximators may be considered a switch-\ning linear dynamical system. Methods exist for learning such systems from data [18, 19]; these\nmethods are able to handle multivariate target variables and models that repeat in the sequence.\nHowever, they are consequently more complex and computationally intensive than the much simpler\nchangepoint detection method we use, and they have not been used in the context of skill acquisition.\nA great deal of work exists in robotics under the general heading of learning from demonstration [5],\nwhere control policies are learned using sample trajectories obtained from a human, robot demon-\nstrator, or a planner. Most methods learn an entire single policy from data, although some perform\nsegmentation\u2014for example, Jenkins and Matari\u00b4c [20] segment demonstrated data into motion prim-\nitives, and thereby build a motion primitive library. They perform segmentation using a heuristic\nspeci\ufb01c to human-like kinematic motions; more recent work has used more principled statistical\nmethods [21, 22] to segment the data into multiple models as a way to avoid perceptual aliasing in\nthe policy. Other methods use demonstration to provide an initial policy that is then re\ufb01ned using\nreinforcement learning (e.g., Peters and Schaal [23]). Prior to our work, we know of no existing\nmethod that both performs trajectory segmentation and results in motion primitives suitable for fur-\nther learning.\n\n7 Discussion and Conclusions\n\nCST makes several key assumptions. The \ufb01rst is that the demonstrated skills form a tree, when in\nsome cases they may form a more general graph (e.g., when the demonstrated policy has a loop).\nA straightforward modi\ufb01cation of the procedure to merge skill chains could accommodate such\ncases. We also assume that the domain reward function is known and that each option reward can be\nobtained from it by adding in a termination reward. A method for using inferred reward functions\n(e.g., Abbeel and Ng [24]) could be incorporated into our method when this is not true. However,\nthis requires segmentation based on policy rather than value function, since rewards are not given\nat demonstration time. Because policies are usually multivariate, this would require a multivariate\nchangepoint detection algorithm, such as that by Xuan and Murphy [18]. Finally, we assume that\nthe best model for combining a pair of skills is the model selected for representing both individually.\nThis may not always hold\u2014two skills best represented individually by one model may be better\nrepresented together using another (perhaps more general) one. Since the correct abstraction would\npresumably be at least competitive during segmentation, such cases can be resolved by considering\nsegmentations other than the \ufb01nal MAP solution when merging.\nSegmenting demonstration trajectories into skills has several advantages. Each skill is allocated its\nown abstraction, and therefore can be learned and represented ef\ufb01ciently\u2014potentially allowing us\nto learn higher dimensional, extended policies. During learning, an unsuccessful or partial episode\ncan still improve skills whose goals where nevertheless reached. Con\ufb01dence-based learning meth-\nods [25] can be applied to each skill individually. Finally, skills learned using agent-centric features\n(such as in our uBot example) can be transferred to new problems [26], and thereby detached from a\nproblem-speci\ufb01c setting to be more generally useful. Taken together, these advantages, in conjunc-\ntion with the application of CST to bootstrap skill policy acquisition, may prove crucial to scaling\nup policy learning methods to high-dimensional, continuous domains.\n\nAcknowledgements\n\nWe would like to thank Dan Xie and Dirk Ruiken for their invaluable help with the uBot, and\nPhil Thomas and Brenna Argall for useful discussions. Andrew Barto and George Konidaris were\nsupported by the Air Force Of\ufb01ce of Scienti\ufb01c Research under grant FA9550-08-1-0418. Scott\nKuindersma is supported by a NASA GSRP fellowship.\n\n8\n\n\fReferences\n[1] A.G. Barto and S. Mahadevan. Recent advances in hierarchical reinforcement learning. Discrete Event\n\nDynamic Systems, 13:41\u201377, 2003. Special Issue on Reinforcement Learning.\n\n[2] R.S. Sutton and A.G. Barto. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA,\n\n1998.\n\n[3] G.D. Konidaris and A.G. Barto. Skill discovery in continuous reinforcement learning domains using skill\n\nchaining. In Advances in Neural Information Processing Systems 22, pages 1015\u20131023, 2009.\n\n[4] G.D. Konidaris and A.G. Barto. Ef\ufb01cient skill learning using abstraction selection. In Proceedings of the\n\nTwenty First International Joint Conference on Arti\ufb01cial Intelligence, July 2009.\n\n[5] B. Argall, S. Chernova, M. Veloso, and B. Browning. A survey of robot learning from demonstration.\n\nRobotics and Autonomous Systems, 57:469\u2013483, 2009.\n\n[6] M. Huber and R.A. Grupen. A feedback control structure for on-line learning tasks. Robotics and Au-\n\ntonomous Systems, 22(3-4):303\u2013315, 1997.\n\n[7] M. Rosenstein and A.G. Barto. Supervised actor-critic reinforcement learning.\n\nIn J. Si, A.G. Barto,\nA. Powell, and D. Wunsch, editors, Learning and Approximate Dynamic Programming: Scaling up the\nReal World, pages 359\u2013380. John Wiley & Sons, Inc., New York, 2004.\n\n[8] P. Fearnhead and Z. Liu. On-line inference for multiple changepoint problems. Journal of the Royal\n\nStatistical Society B, 69:589\u2013605, 2007.\n\n[9] R.S. Sutton, D. Precup, and S.P. Singh. Between MDPs and semi-MDPs: A framework for temporal\n\nabstraction in reinforcement learning. Arti\ufb01cial Intelligence, 112(1-2):181\u2013211, 1999.\n\n[10] T. Lozano-Perez, M.T. Mason, and R.H. Taylor. Automatic synthesis of \ufb01ne-motion strategies for robots.\n\nThe International Journal of Robotics Research, 3(1):3\u201324, 1984.\n\n[11] R.R. Burridge, A.A. Rizzi, and D.E. Koditschek. Sequential composition of dynamically dextrous robot\n\nbehaviors. International Journal of Robotics Research, 18(6):534\u2013555, 1999.\n\n[12] S. Boccaletti, A. Farini, E.J. Kostelich, and F.T. Arecchi. Adaptive targeting of chaos. Physical Review\n\nE, 55(5):4845\u20134848, 1997.\n\n[13] G.D. Konidaris and S. Osentoski. Value function approximation in reinforcement learning using the\nFourier basis. Technical Report UM-CS-2008-19, Department of Computer Science, University of Mas-\nsachusetts Amherst, June 2008.\n\n[14] N. Mehta, S. Ray, P. Tadepalli, and T. Dietterich. Automatic discovery and transfer of MAXQ hierarchies.\nIn Proceedings of the Twenty Fifth International Conference on Machine Learning, pages 648\u2013655, 2008.\n[15] J. Mugan and B. Kuipers. Autonomously learning an action hierarchy using a learned qualitative state\nrepresentation. In Proceedings of the 21st International Joint Conference on Arti\ufb01cial Intelligence, 2009.\n[16] G. Neumann, W. Maass, and J. Peters. Learning complex motions by sequencing simpler motion tem-\n\nplates. In Proceedings of the 26th International Conference on Machine Learning, 2009.\n\n[17] R. Tedrake. LQR-Trees: Feedback motion planning on sparse randomized trees.\n\nRobotics: Science and Systems, pages 18\u201324, 2009.\n\nIn Proceedings of\n\n[18] X. Xuan and K. Murphy. Modeling changing dependency structure in multivariate time series. In Pro-\n\nceedings of the Twenty-Fourth International Conference on Machine Learning, 2007.\n\n[19] E.B. Fox, E.B. Sudderth, M.I. Jordan, and A.S. Willsky. Nonparametric Bayesian learning of switching\n\nlinear dynamical systems. In Advances in Neural Information Processing Systems 21, 2008.\n\n[20] O.C. Jenkins and M. Matari\u00b4c. Performance-derived behavior vocabularies: data-driven acquisition of\n\nskills from motion. International Journal of Humanoid Robotics, 1(2):237\u2013288, 2004.\n\n[21] D.H. Grollman and O.C. Jenkins. Incremental learning of subtasks from unsegmented demonstration. In\n\nInternational Conference on Intelligent Robots and Systems, 2010.\n\n[22] J. Butter\ufb01eld, S. Osentoski, G. Jay, and O.C. Jenkins. Learning from demonstration using a multi-valued\nfunction regressor for time-series data. In Proceedings of the Tenth IEEE-RAS International Conference\non Humanoid Robots, 2010.\n\n[23] J. Peters and S. Schaal. Natural actor-critic. Neurocomputing, 71(7-9):1180\u20131190, 2008.\n[24] P. Abbeel and A.Y. Ng. Apprenticeship learning via inverse reinforcement learning. In Proceedings of\n\nthe 21st International Conference on Machine Learning, 2004.\n\n[25] S. Chernova and M. Veloso. Con\ufb01dence-based policy learning from demonstration using Gaussian mix-\nture models. In Proceedings of the 6th International Joint Conference on Autonomous Agents and Multi-\nagent Systems, 2007.\n\n[26] G.D. Konidaris and A.G. Barto. Building portable options: Skill transfer in reinforcement learning. In\n\nProceedings of the Twentieth International Joint Conference on Arti\ufb01cial Intelligence, 2007.\n\n9\n\n\f", "award": [], "sourceid": 1039, "authors": [{"given_name": "George", "family_name": "Konidaris", "institution": null}, {"given_name": "Scott", "family_name": "Kuindersma", "institution": null}, {"given_name": "Roderic", "family_name": "Grupen", "institution": null}, {"given_name": "Andrew", "family_name": "Barto", "institution": ""}]}