{"title": "Adaptive Skills Adaptive Partitions (ASAP)", "book": "Advances in Neural Information Processing Systems", "page_first": 1588, "page_last": 1596, "abstract": "We introduce the Adaptive Skills, Adaptive Partitions (ASAP) framework that (1) learns skills (i.e., temporally extended actions or options) as well as (2) where to apply them. We believe that both (1) and (2) are necessary for a truly general skill learning framework, which is a key building block needed to scale up to lifelong learning agents. The ASAP framework is also able to solve related new tasks simply by adapting where it applies its existing learned skills. We prove that ASAP converges to a local optimum under natural conditions. Finally, our experimental results, which include a RoboCup domain, demonstrate the ability of ASAP to learn where to reuse skills as well as solve multiple tasks with considerably less experience than solving each task from scratch.", "full_text": "Adaptive Skills Adaptive Partitions (ASAP)\n\nDaniel J. Mankowitz, Timothy A. Mann\u2217 and Shie Mannor\n\nThe Technion - Israel Institute of Technology,\n\nHaifa, Israel\n\ndanielm@tx.technion.ac.il, mann.timothy@acm.org, shie@ee.technion.ac.il\n\n\u2217Timothy Mann now works at Google Deepmind.\n\nAbstract\n\nWe introduce the Adaptive Skills, Adaptive Partitions (ASAP) framework that (1)\nlearns skills (i.e., temporally extended actions or options) as well as (2) where to\napply them. We believe that both (1) and (2) are necessary for a truly general skill\nlearning framework, which is a key building block needed to scale up to lifelong\nlearning agents. The ASAP framework can also solve related new tasks simply by\nadapting where it applies its existing learned skills. We prove that ASAP converges\nto a local optimum under natural conditions. Finally, our experimental results,\nwhich include a RoboCup domain, demonstrate the ability of ASAP to learn where\nto reuse skills as well as solve multiple tasks with considerably less experience\nthan solving each task from scratch.\n\nIntroduction\n\n1\nHuman-decision making involves decomposing a task into a course of action. The course of action is\ntypically composed of abstract, high-level actions that may execute over different timescales (e.g.,\nwalk to the door or make a cup of coffee). The decision-maker chooses actions to execute to solve\nthe task. These actions may need to be reused at different points in the task. In addition, the actions\nmay need to be used across multiple, related tasks.\nConsider, for example, the task of building a city. The course of action to building a city may involve\nbuilding the foundations, laying down sewage pipes as well as building houses and shopping malls.\nEach action operates over multiple timescales and certain actions (such as building a house) may need\nto be reused if additional units are required. In addition, these actions can be reused if a neighboring\ncity needs to be developed (multi-task scenario).\nReinforcement Learning (RL) represents actions that last for multiple timescales as Temporally\nExtended Actions (TEAs) (Sutton et al., 1999), also referred to as options, skills (Konidaris & Barto,\n2009) or macro-actions (Hauskrecht, 1998). It has been shown both experimentally (Precup & Sutton,\n1997; Sutton et al., 1999; Silver & Ciosek, 2012; Mankowitz et al., 2014) and theoretically (Mann &\nMannor, 2014) that TEAs speed up the convergence rates of RL planning algorithms. TEAs are seen\nas a potentially viable solution to making RL truly scalable. TEAs in RL have become popular in\nmany domains including RoboCup soccer (Bai et al., 2012), video games (Mann et al., 2015) and\nRobotics (Fu et al., 2015). Here, decomposing the domains into temporally extended courses of\naction (strategies in RoboCup, move combinations in video games and skill controllers in Robotics\nfor example) has generated impressive solutions. From here on in, we will refer to TEAs as skills.\nA course of action is de\ufb01ned by a policy. A policy is a solution to a Markov Decision Process (MDP)\nand is de\ufb01ned as a mapping from states to a probability distribution over actions. That is, it tells the\nRL agent which action to perform given the agent\u2019s current state. We will refer to an inter-skill policy\nas being a policy that tells the agent which skill to execute, given the current state.\nA truly general skill learning framework must (1) learn skills as well as (2) automatically compose\nthem together (as stated by Bacon & Precup (2015)) and determine where each skill should be\nexecuted (the inter-skill policy). This framework should also determine (3) where skills can be reused\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fTable 1: Comparison of Approaches to ASAP\n\nAutomated Skill\n\nAutomatic\n\nContinuous\n\nLearning\nwith Policy\nGradient\n\nSkill\n\nState\n\nComposition Multitask\nLearning\n\nLearning\nReusable\n\nSkills\n\n(cid:88)\n\u00d7\n\u00d7\n\u00d7\n\u00d7\n\nCorrecting\n\nModel\n\nMisspeci\ufb01cation\n\n(cid:88)\n\u00d7\n\u00d7\n\u00d7\n\u00d7\n\nASAP (this paper)\nda Silva et al. 2012\nKonidaris & Barto 2009\nBacon & Precup 2015\nEaton & Ruvolo 2013\n\n(cid:88)\n(cid:88)\n\u00d7\n(cid:88)\n\u00d7\n\n(cid:88)\n\u00d7\n(cid:88)\n\u00d7\n\u00d7\n\n(cid:88)\n(cid:88)\n\u00d7\n\u00d7\n(cid:88)\n\nin different parts of the state space and (4) adapt to changes in the task itself. Finally it should also\nbe able to (5) correct model misspeci\ufb01cation (Mankowitz et al., 2014). Whilst different forms of\nmodel misspeci\ufb01cation exist in RL, we de\ufb01ne it here as having an unsatisfactory set of skills and\ninter-skill policy that provide a sub-optimal solution to a given task. This skill learning framework\nshould be able to correct this misspeci\ufb01cation to obtain a near-optimal solution. A number of works\nhave addressed some of these issues separately as shown in Table 1. However, no work, to the best of\nour knowledge, has combined all of these elements into a truly general skill-learning framework.\nOur framework entitled \u2018Adaptive Skills, Adaptive Partitions (ASAP)\u2019 is the \ufb01rst of its kind to\nincorporate all of the above-mentioned elements into a single framework, as shown in Table 1, and\nsolve continuous state MDPs. It receives as input a misspeci\ufb01ed model (a sub-optimal set of skills and\ninter-skill policy). The ASAP framework corrects the misspeci\ufb01cation by simultaneously learning a\nnear-optimal skill-set and inter-skill policy which are both stored, in a Bayesian-like manner, within\nthe ASAP policy. In addition, ASAP automatically composes skills together, learns where to reuse\nthem and learns skills across multiple tasks.\nMain Contributions: (1) The Adaptive Skills, Adaptive Partitions (ASAP) algorithm that automati-\ncally corrects a misspeci\ufb01ed model. It learns a set of near-optimal skills, automatically composes\nskills together and learns an inter-skill policy to solve a given task. (2) Learning skills over multiple\ndifferent tasks by automatically adapting both the inter-skill policy and the skill set. (3) ASAP can\ndetermine where skills should be reused in the state space. (4) Theoretical convergence guarantees.\n2 Background\nReinforcement Learning Problem: A Markov Decision Process is de\ufb01ned by a 5-tuple\n(cid:104)X, A, R, \u03b3, P(cid:105) where X is the state space, A is the action space, R \u2208 [\u2212b, b] is a bounded re-\nward function, \u03b3 \u2208 [0, 1] is the discount factor and P : X \u00d7 A \u2192 [0, 1]X is the transition probability\nfunction for the MDP. The solution to an MDP is a policy \u03c0 : X \u2192 \u2206A which is a function\nmapping states to a probability distribution over actions. An optimal policy \u03c0\u2217 : X \u2192 \u2206A\ndetermines the best actions to take so as to maximize the expected reward. The value function\nV \u03c0(x) = Ea\u223c\u03c0(\u00b7|a)\nde\ufb01nes the expected reward for following\na policy \u03c0 from state x. The optimal expected reward V \u03c0\u2217\nfollowing the optimal policy from state x.\nPolicy Gradient: Policy Gradient (PG) methods have enjoyed success in recent years especially\nin the \ufb01elds of robotics (Peters & Schaal, 2006, 2008). The goal in PG is to learn a policy \u03c0\u03b8 that\n\u03c4 P (\u03c4 )R(\u03c4 )d\u03c4, where \u03c4 is a set of trajectories, P (\u03c4 ) is\nthe probability of a trajectory and R(\u03c4 ) is the reward obtained for a particular trajectory. P (\u03c4 ) is\nk=0P (xk+1|xk, ak)\u03c0\u03b8(ak|xk). Here, xk \u2208 X is the state at the kth\nde\ufb01ned as P (\u03c4 ) = P (x0)\u03a0T\ntimestep of the trajectory; ak \u2208 A is the action at the kth timestep; T is the trajectory length. Only\nthe policy, in the general formulation of policy gradient, is parameterized with parameters \u03b8. The\nidea is then to update the policy parameters using stochastic gradient descent leading to the update\nrule \u03b8t+1 = \u03b8t + \u03b7\u2207J(\u03c0\u03b8), where \u03b8t are the policy parameters at timestep t, \u2207J(\u03c0\u03b8) is the gradient\nof the objective function with respect to the parameters and \u03b7 is the step size.\n\nmaximizes the expected return J(\u03c0\u03b8) =(cid:82)\n\n(x) is the expected value obtained for\n\nV \u03c0(x(cid:48))\n\nR(x, a)\n\n+ \u03b3Ex(cid:48)\u223cP (\u00b7|x,a)\n\n(cid:20)\n\n(cid:21)\n\n(cid:20)\n\n(cid:21)\n\n3 Skills, Skill Partitions and Intra-Skill Policy\n\nSkills: A skill is a parameterized Temporally Extended Action (TEA) (Sutton et al., 1999). The\npower of a skill is that it incorporates both generalization (due to the parameterization) and temporal\n\n2\n\n\fabstraction. Skills are a special case of options and therefore inherit many of their useful theoretical\nproperties (Sutton et al., 1999; Precup et al., 1998).\nDe\ufb01nition 1. A Skill \u03b6 is a TEA that consists of the two-tuple \u03b6 = (cid:104)\u03c3\u03b8, p(x)(cid:105) where \u03c3\u03b8 : X \u2192 \u2206A\nis a parameterized, intra-skill policy with parameters \u03b8 \u2208 Rd and p : X \u2192 [0, 1] is the termination\nprobability distribution of the skill.\n\nSkill Partitions: A skill, by de\ufb01nition, performs a specialized task on a sub-region of a state space.\nWe refer to these sub-regions as Skill Partitions (SPs) which are necessary for skills to specialize\nduring the learning process. A given set of SPs covering a state space effectively de\ufb01ne the inter-skill\npolicy as they determine where each skill should be executed. These partitions are unknown a-priori\nand are generated using intersections of hyperplane half-spaces (described below). Hyperplanes\nprovide a natural way to automatically compose skills together. In addition, once a skill is being\nexecuted, the agent needs to select actions from the skill\u2019s intra-skill policy \u03c3\u03b8. We next utilize SPs\nand the intra-skill policy for each skill to construct the ASAP policy, de\ufb01ned in Section 4. We \ufb01rst\nde\ufb01ne a skill hyperplane.\nDe\ufb01nition 2. Skill Hyperplane (SH): Let \u03c8x,m \u2208 Rd be a vector of features that depend on a state\nx \u2208 X and an MDP environment m. Let \u03b2i \u2208 Rd be a vector of hyperplane parameters. A skill\nhyperplane is de\ufb01ned as \u03c8T\n\nx,m\u03b2i = L, where L is a constant.\n\nIn this work, we interpret hyperplanes to mean that the intersection of skill hyperplane half spaces\nform sub-regions in the state space called Skill Partitions (SPs), de\ufb01ning where each skill is executed.\nFigure 1a contains two example skill hyperplanes h1, h2. Skill \u03b61 is executed in the SP de\ufb01ned by the\nintersection of the positive half-space of h1 and the negative half-space of h2. The same argument\napplies for \u03b60, \u03b62, \u03b63. From here on in, we will refer to skill \u03b6i interchangeably with its index i.\nSkill hyperplanes have two functions: (1) They automatically compose skills together, creating\nchainable skills as desired by Bacon & Precup (2015). (2) They de\ufb01ne SPs which enable us to\nderive the probability of executing a skill, given a state x and MDP m. First, we need to be able\nto uniquely identify a skill. We de\ufb01ne a binary vector B = [b1, b2,\u00b7\u00b7\u00b7 , bK] \u2208 {0, 1}K where bk is\na Bernoulli random variable and K is the number of skill hyperplanes. We de\ufb01ne the skill index\nk=1 2k\u22121bk as a sum of Bernoulli random variables bk. Note that this is but one approach to\ngenerate skills (and SPs). In principle this setup de\ufb01nes 2K skills, but in practice, far fewer skills\nare typically used (see experiments). Furthermore, the complexity of the SP is governed by the\nVC-dimension. We can now de\ufb01ne the probability of executing skill i as a Bernoulli likelihood in\nEquation 1.\n\ni =(cid:80)K\n\nP (i|x, m) = P\n\ni =\n\n2k\u22121bk\n\n=\n\npk(bk = ik|x, m) .\n\n(1)\n\nHere, ik \u2208 {0, 1} is the value of the kth bit of B, x is the current state and m is a description of the\nMDP. The probability pk(bk = 1|x, m) and pk(bk = 0|x, m) are de\ufb01ned in Equation 2.\n\nk=1\n\nk\n\npk(bk = 1|x, m) =\n\n1\n\n1 + exp(\u2212\u03b1\u03c8T\n\n(x,m)\u03b2k)\n\n, pk(bk = 0|x, m) = 1 \u2212 pk(bk = 1|x, m) .\n\n(2)\n\n(cid:34)\n\nK(cid:88)\n\n(cid:35)\n\n(cid:89)\n\nx,m\u03b2k is\nWe have made use of the logistic sigmoid function to ensure valid probabilities where \u03c8T\na skill hyperplane and \u03b1 > 0 is a temperature parameter. The intuition here is that the kth bit\nx,m\u03b2k > 0 meaning that the skill\u2019s partition is in the\nof a skill, bk = 1, if the skill hyperplane \u03c8T\npositive half-space of the hyperplane. Similarly, bk = 0 if \u03c8T\nx,m\u03b2k < 0 corresponding to the negative\nhalf-space. Using skill 3 as an example with K = 2 hyperplanes in Figure 1a, we would de\ufb01ne the\nBernoulli likelihood of executing \u03b63 as p(i = 3|x, m) = p1(b1 = 1|x, m) \u00b7 p2(b2 = 1|x, m).\nIntra-Skill Policy: Now that we can de\ufb01ne the probability of executing a skill based on its SP, we\nde\ufb01ne the intra-skill policy \u03c3\u03b8 for each skill. The Gibb\u2019s distribution is a commonly used function\nto de\ufb01ne policies in RL (Sutton et al., 1999). Therefore we de\ufb01ne the intra-skill policy for skill i,\nparameterized by \u03b8i \u2208 Rd as\n\n\u03c3\u03b8i(a|s) =\n\n(cid:80)\n\nexp (\u03b1\u03c6T\nx,a\u03b8i)\nb\u2208A exp (\u03b1\u03c6T\n\nx,b\u03b8i)\n\n.\n\n3\n\n(3)\n\n\fHere, \u03b1 > 0 is the temperature, \u03c6x,a \u2208 Rd is a feature vector that depends on the current state x \u2208 X\nand action a \u2208 A. Now that we have a de\ufb01nition of both the probability of executing a skill and an\nintra-skill policy, we need to incorporate these distributions into the policy gradient setting using a\ngeneralized trajectory.\nGeneralized Trajectory: A generalized trajectory is necessary to derive policy gradient update\nrules with respect to the parameters \u0398, \u03b2 as will be shown in Section 4. A typical trajectory is\nusually de\ufb01ned as \u03c4 = (xt, at, rt, xt+1)T\nt=0 where T is the length of the trajectory. For a generalized\ntrajectory, our algorithm emits a class it at each timestep t \u2265 1, which denotes the skill that was\nexecuted. The generalized trajectory is de\ufb01ned as g = (xt, at, it, rt, xt+1)T\nt=0. The probability\nof a generalized trajectory, as an extension to the PG trajectory in Section 2, is now, P\u0398,\u03b2(g) =\nt=0 P (xt+1|xt, at)P\u03b2(it|xt, m)\u03c3\u03b8i (at|xt), where P\u03b2(it|xt, m) is the probability of a skill\nbeing executed, given the state xt \u2208 X and environment m at time t \u2265 1; \u03c3\u03b8i(at|xt) is the probability\nof executing action at \u2208 A at time t \u2265 1 given that we are executing skill i. The generalized trajectory\nis now a function of two parameter vectors \u03b8 and \u03b2.\n\nP (x0)(cid:81)T\n\n4 Adaptive Skills, Adaptive Partitions (ASAP) Framework\n\nThe Adaptive Skills, Adaptive Partitions (ASAP) framework simultaneously learns a near-optimal set\nof skills and SPs (inter-skill policy), given an initially misspeci\ufb01ed model. ASAP also automatically\ncomposes skills together and allows for a multi-task setting as it incorporates the environment m into\nits hyperplane feature set. We have previously de\ufb01ned two important distributions P\u03b2(it|xt, m) and\n\u03c3\u03b8i(at|xt) respectively. These distributions are used to collectively de\ufb01ne the ASAP policy which is\npresented below. Using the notion of a generalized trajectory, the ASAP policy can be learned in a\npolicy gradient setting.\nASAP Policy: Assume that we are given a probability distribution \u00b5 over MDPs with a d-dimensional\nstate-action space and a z-dimensional vector describing each MDP. We de\ufb01ne \u03b2 as a (d + z) \u00d7 K\nmatrix where each column \u03b2i represents a skill hyperplane, and \u0398 is a (d \u00d7 2K) matrix where each\ncolumn \u03b8j parameterizes an intra-skill policy. Using the previously de\ufb01ned distributions, we now\nde\ufb01ne the ASAP policy.\nDe\ufb01nition 3. (ASAP Policy). Given K skill hyperplanes, a set of 2K skills \u03a3 = {\u03b6i|i = 1,\u00b7\u00b7\u00b7 2K},\na state space x \u2208 X, a set of actions a \u2208 A and an MDP m from a hypothesis space of MDPs, the\nASAP policy is de\ufb01ned as,\n\n\u03c0\u0398,\u03b2(a|x, m) =\n\nP\u03b2(i|x, m)\u03c3\u03b8i(a|x) ,\n\n(4)\n\n2K(cid:88)\n\ni=1\n\nwhere P\u03b2(i|x, m) and \u03c3\u03b8i(a|s) are the distributions as de\ufb01ned in Equations 1 and 3 respectively.\nThis is a powerful description for a policy, which resembles a Bayesian approach, as the policy takes\ninto account the uncertainty of the skills that are executing as well as the actions that each skill\u2019s\nintra-skill policy chooses. We now de\ufb01ne the ASAP objective with respect to the ASAP policy.\nASAP Objective: We de\ufb01ned the policy with respect to a hypothesis space of m MDPs. We now\nneed to de\ufb01ne an objective function which takes this hypothesis space into account. Since we assume\nthat we are provided with a distribution \u00b5 : M \u2192 [0, 1] over possible MDP models m \u2208 M, with a\nd-dimensional state-action space, we can incorporate this into the ASAP objective function:\n\n\u03c1(\u03c0\u0398,\u03b2) =\n\n\u00b5(m)J (m)(\u03c0\u0398,\u03b2)dm ,\n\n(5)\n\n(cid:90)\n\n(cid:90)\n\nJ(\u03c0\u2126) =(cid:82)\n\nwhere \u03c0\u0398,\u03b2 is the ASAP policy and J (m)(\u03c0\u0398,\u03b2) is the expected return for MDP m with respect to\nthe ASAP policy. To simplify the notation, we group all of the parameters into a single parameter\nvector \u2126 = [vec(\u0398), vec(\u03b2)]. We de\ufb01ne the expected reward for generalized trajectories g as\ng P\u2126(g)R(g)dg, where R(g) is reward obtained for a particular trajectory g. This is a\nslight variation of the original policy gradient objective de\ufb01ned in Section 2. We then insert J(\u03c0\u2126)\ninto Equation 5 and we get the ASAP objective function\n\n\u03c1(\u03c0\u2126) =\n\n\u00b5(m)J (m)(\u03c0\u2126)dm ,\n\n(6)\n\n4\n\n\fwhere J (m)(\u03c0\u2126) is the expected return for policy \u03c0\u2126 in MDP m. Next, we need to derive gradient\nupdate rules to learn the parameters of the optimal policy \u03c0\u2217\nASAP Gradients: To learn both intra-skill policy parameters matrix \u0398 as well as the hyperplane\nparameters matrix \u03b2 (and therefore implicitly the SPs), we derive an update rule for the policy\ngradient framework with generalized trajectories. The full derivation is in the supplementary material.\nThe \ufb01rst step involves calculating the gradient of the ASAP objective function yielding the ASAP\ngradient (Theorem 1).\nTheorem 1. (ASAP Gradient Theorem). Suppose that the ASAP objective function is \u03c1(\u03c0\u2126) =\n\n(cid:82) \u00b5(m)J (m)(\u03c0\u2126)dm where \u00b5(m) is a distribution over MDPs m and J (m)(\u03c0\u2126) is the expected\n\n\u2126 that maximizes this objective.\n\nreturn for MDP m whilst following policy \u03c0\u2126, then the gradient of this objective is:\n\n\u2207\u2126\u03c1(\u03c0\u2126) = E\u00b5(m)\n\nE\nP (m)\n\n\u2126 (g)\n\n\u2207\u2126Z (m)\n\n\u2126 (xt, it, at)R(m)\n\n,\n\n(cid:20)\n\n(cid:20)H (m)(cid:88)\n\ni=0\n\n(cid:21)(cid:21)\n\n(cid:17)\n\n,\n\n(cid:17)\n\nwhere Z (m)\n\nR(m) =(cid:80)H (m)\n\n\u2126 (xt, it, at) = log P\u03b2(it|xt, m)\u03c3\u03b8i(at|xt), H (m) is the length of a trajectory for MDP m;\n\ni=0 \u03b3iri is the discounted cumulative reward for trajectory H (m) 1.\n\n\u2126 = Z (m)\n\n\u2126 (xt, it, at), then we can estimate the gradient \u2207\u2126\u03c1(\u03c0\u2126). We will\nIf we are able to derive \u2207\u2126Z (m)\nrefer to Z (m)\n\u2126 (xt, it, at) where it is clear from context. It turns out that it is possible to derive\n\u2126 and \u2207\u03b2Z (m)\nthis term as a result of the generalized trajectory. This yields the gradients \u2207\u0398Z (m)\n\u2126 in\nTheorems 2 and 3 respectively. The derivations can be found the supplementary material.\nTheorem 2. (\u0398 Gradient Theorem). Suppose that \u0398 is a (d \u00d7 2K) matrix where each column \u03b8j\nparameterizes an intra-skill policy. Then the gradient \u2207\u03b8it\ncorresponding to the intra-skill\nparameters of the ith skill at time t is:\n\nZ (m)\n\n\u2126\n\n(cid:16)(cid:80)\n(cid:16)(cid:80)\n\n\u2207\u03b8it\n\n\u2126 = \u03b1\u03c6xt,at \u2212 \u03b1\nZ (m)\n\nb\u2208A \u03c6xt,bt exp(\u03b1\u03c6T\n\nxt,bt\n\n\u0398it)\n\nb\u2208A exp(\u03b1\u03c6T\n\nxt,bt\n\n\u0398it)\n\n\u2126 =\n\nxt,m\u03b2k)\n\nwhere \u03b1 > 0 is the temperature parameter and \u03c6xt,at \u2208 Rd\u00d72K is a feature vector of the current\nstate xt \u2208 Xt and the current action at \u2208 At.\nTheorem 3. (\u03b2 Gradient Theorem). Suppose that \u03b2 is a (d + z) \u00d7 K matrix where each column \u03b2k\nrepresents a skill hyperplane. Then the gradient \u2207\u03b2k Z (m)\n\u2126 corresponding to parameters of the kth\nhyperplane is:\n\u2207\u03b2k,1Z (m)\n\n\u03b1\u03c8(xt,m) exp(\u2212\u03b1\u03c8T\n\n\u03b1\u03c8xt,m exp(\u2212\u03b1\u03c8T\n\n\u2126 = \u2212\u03b1\u03c8xt,m +\n\nxt,m\u03b2k)(cid:1) ,\u2207\u03b2k,0Z (m)\n\n(cid:0)1 + exp(\u2212\u03b1\u03c8T\n\n(cid:0)1 + exp(\u2212\u03b1\u03c8T\n\nxt,m\u03b2k)(cid:1)\n\n(7)\nwhere \u03b1 > 0 is the hyperplane temperature parameter, \u03c8T\n(xt,m)\u03b2k is the kth skill hyperplane for\nMDP m, \u03b2k,1 corresponds to locations in the binary vector equal to 1 (bk = 1) and \u03b2k,0 corresponds\nto locations in the binary vector equal to 0 (bk = 0).\nUsing these gradient updates, we can then order all of the gradients into a vector \u2207\u2126Z (m)\n\u2126 =\n\u2126 (cid:105) and update both the intra-skill policy parameters\n(cid:104)\u2207\u03b81 Z (m)\nand hyperplane parameters for the given task (learning a skill set and SPs). Note that the updates\noccur on a single time scale. This is formally stated in the ASAP Algorithm.\n\n\u2126 . . .\u2207\u03b82k Z (m)\n\n\u2126 . . .\u2207\u03b2k Z (m)\n\n\u2126 ,\u2207\u03b21Z (m)\n\nxt,m\u03b2k)\n\n5 ASAP Algorithm\n\nWe present the ASAP algorithm (Algorithm 1) that dynamically and simultaneously learns skills, the\ninter-skill policy and automatically composes skills together by learning SPs. The skills (\u0398 matrix)\nand SPs (\u03b2 matrix) are initially arbitrary and therefore form a misspeci\ufb01ed model. Line 2 combines\n\n1These expectations can easily be sampled (see supplementary material).\n\n5\n\n\fthe skill and hyperplane parameters into a single parameter vector \u2126. Lines 3 \u2212 7 learns the skill\nand hyperplane parameters (and therefore implicitly the skill partitions). In line 4 a generalized\ntrajectory is generated using the current ASAP policy. The gradient \u2207\u2126\u03c1(\u03c0\u2126) is then estimated in\nline 5 from this trajectory and updates the parameters in line 6. This is repeated until the skill and\nhyperplane parameters have converged, thus correcting the misspeci\ufb01ed model. Theorem 4 provides\na convergence guarantee of ASAP to a local optimum (see supplementary material for the proof).\n\nAlgorithm 1 ASAP\nRequire: \u03c6s,a \u2208 Rd {state-action feature vector}, \u03c8x,m \u2208 R(d+z) {skill hyperplane feature vector},\nK {The number of hyperplanes}, \u0398 \u2208 Rd\u00d72K {An arbitrary skill matrix}, \u03b2 \u2208 R(d+z)\u00d7K {An\narbitrary skill hyperplane matrix}, \u00b5(m) {A distribution over MDP tasks}\n\n1: Z = (|d||2K| + |(d + z)K|) {De\ufb01ne the number of parameters}\n2: \u2126 = [vec(\u0398), vec(\u03b2)] \u2208 RZ\n3: repeat\n4: Perform a\n\n(which may\n\ntrial\n\nand\nx0:H , i0:H , a0:H , r0:H , m0:H {states, skills, actions, rewards, task-speci\ufb01c information}\n\nof multiple MDP tasks)\n\nconsist\n\nobtain\n\n(cid:28)(cid:80)\n\n(cid:28)(cid:80)T (m)\n\ni=0 \u2207\u2126Z (m)(\u2126)R(m)\n\n5: \u2207\u2126\u03c1(\u03c0\u2126) =\n6: \u2126 \u2192 \u2126 + \u03b7\u2207\u2126\u03c1(\u03c0\u2126)\n7: until parameters \u2126 have converged\n8: return \u2126\n\nm\n\n(cid:29)(cid:29)\n\n{T is the task episode length}\n\n\u2202\u03c1(\u03c0\u2126,k)\n\nlim\nk\u2192\u221e\n\nand(cid:80)\n\nk=0 converges such that\n\n\u2202\u2126 = 0 almost surely.\n\nTheorem 4. Convergence of ASAP: Given an ASAP policy \u03c0(\u2126), an ASAP objective over MDP\nmodels \u03c1(\u03c0\u2126) as well as the ASAP gradient update rules. If (1) the step-size \u03b7k satis\ufb01es lim\nk\u2192\u221e\u03b7k = 0\nk \u03b7k = \u221e; (2) The second derivative of the policy is bounded and we have bounded rewards;\n\nThen, the sequence {\u03c1(\u03c0\u2126,k)}\u221e\n6 Experiments\nThe experiments have been performed on four different continuous domains: the Two Rooms (2R)\ndomain (Figure 1b), the Flipped 2R domain (Figure 1c), the Three rooms (3R) domain (Figure 1d) and\nRoboCup domains (Figure 1e) that include a one-on-one scenario between a striker and a goalkeeper\n(R1), a two-on-one scenario of a striker against a goalkeeper and a defender (R2), and a striker against\ntwo defenders and a goalkeeper (R3) (see supplementary material). In each experiment, ASAP is\nprovided with a misspeci\ufb01ed model; that is, a set of skills and SPs (the inter-skill policy) that achieve\ndegenerate, sub-optimal performance. ASAP corrects this misspeci\ufb01ed model in each case to learn\na set of near-optimal skills and SPs. For each experiment we implement ASAP using Actor-Critic\nPolicy Gradient (AC-PG) as the learning algorithm 2.\nThe Two-Room and Flipped Room Domains (2R): In both domains, the agent (red ball) needs to\nreach the goal location (blue square) in the shortest amount of time. The agent receives constant\nnegatives rewards and upon reaching the goal, receives a large positive reward. There is a wall\ndividing the environment which creates two rooms. The state space is a 4-tuple consisting of the\ncontinuous (cid:104)xagent, yagent(cid:105) location of the agent and the (cid:104)xgoal, ygoal(cid:105) location of the center of the\ngoal. The agent can move in each of the four cardinal directions. For each experiment involving the\ntwo room domains, a single hyperplane is learned (resulting in two SPs) with a linear feature vector\nrepresentation \u03c8x,m = [1, xagent, yagent]. In addition, a skill is learned in each of the two SPs. The\nintra-skill policies are represented as a probability distribution over actions.\nAutomated Hyperplane and Skill Learning: Using ASAP, the agent learned intuitive SPs and skills\nas seen in Figure 1f and g. Each colored region corresponds to a SP. The white arrows have been\nsuperimposed onto the \ufb01gures to indicate the skills learned for each SP. Since each intra-skill policy\nis a probability distribution over actions, each skill is unable to solve the entire task on its own. ASAP\nhas taken this into account and has positioned the hyperplane accordingly such that the given skill\nrepresentation can solve the task. Figure 2a shows that ASAP improves upon the initial misspeci\ufb01ed\npartitioning to attain near-optimal performance compared to executing ASAP on the \ufb01xed initial\nmisspeci\ufb01ed partitioning and on a \ufb01xed approximately optimal partitioning.\n\n2AC-PG works well in practice and can be trivially incorporated into ASAP with convergence guarantees\n\n6\n\n\fFigure 1: (a) The intersection of skill hyperplanes {h1, h2} form four partitions, each of which\nde\ufb01nes a skill\u2019s execution region (the inter-skill policy). The (b) 2R, (c) Flipped 2R, (d) 3R and (e)\nRoboCup domains (with a varying number of defenders for R1,R2,R3). The learned skills and Skill\nPartitions (SPs) for the (f) 2R, (g) Flipped 2R, (h) 3R and (i) across multiple tasks.\n\nFigure 2: Average reward of the learned ASAP policy compared to (1) the approximately optimal SPs\nand skill set as well as (2) the initial misspeci\ufb01ed model. This is for the (a) 2R, (b) 3R, (c) 2R learning\nacross multiple tasks and the (d) 2R without learning by \ufb02ipping the hyperplane. (e) The average\nreward of the learned ASAP policy for a varying number of K hyperplanes. (f) The learned SPs and\nskill set for the R1 domain. (g) The learned SPs using a polynomial hyperplane (1),(2) and linear\nhyperplane (3) representation. (h) The learned SPs using a polynomial hyperplane representation\nwithout the defender\u2019s location as a feature (1) and with the defender\u2019s x location (2), y location\n(3), and (cid:104)x, y(cid:105) location as a feature (4). (i) The dribbling behavior of the striker when taking the\ndefender\u2019s y location into account. (j) The average reward for the R1 domain.\nMultiple Hyperplanes: We analyzed the ASAP framework when learning multiple hyperplanes\nin the two room domain. As seen in Figure 2e, increasing the number of hyperplanes K, does\nnot have an impact on the \ufb01nal solution in terms of average reward. However, it does increase the\ncomputational complexity of the algorithm since 2K skills need to be learned. The approximate points\nof convergence are marked in the \ufb01gure as K1, K2 and K3, respectively. In addition, two skills\ndominate in each case producing similar partitions to those seen in Figure 1a (see supplementary\nmaterial) indicating that ASAP learns that not all skills are necessary to solve the task.\nMultitask Learning: We \ufb01rst applied ASAP to the 2R domain (Task 1) and attained a near optimal\naverage reward (Figure 2c). It took approximately 35000 episodes to get near-optimal performance\nand resulted in the SPs and skill set shown in Figure 1i (top). Using the learned SPs and skills, ASAP\nwas then able to adapt and learn a new set of SPs and skills to solve a different task (Flipped 2R -\nTask 2) in only 5000 episodes (Figure 2c) indicating that the parameters learned from the old task\nprovided a good initialization for the new task. The knowledge transfer is seen in Figure 1i (bottom)\nas the SPs do not signi\ufb01cantly change between tasks, yet the skills are completely relearned.\nWe also wanted to see whether we could \ufb02ip the SPs; that is, switch the sign of the hyperplane\nparameters learned in the 2R domain and see whether ASAP can solve the Flipped 2R domain (Task\n2) without any additional learning. Due to the symmetry of the domains, ASAP was indeed able to\nsolve the new domain and attained near-optimal performance (Figure 2d). This is an exciting result as\nmany problems, especially navigation tasks, possess symmetrical characteristics. This insight could\ndramatically reduce the sample complexity of these problems.\nThe Three-Room Domain (3R): The 3R domain (Figure 1d), is similar to the 2R domain regarding\nthe goal, state-space, available actions and rewards. However, in this case, there are two walls,\ndividing the state space into three rooms. The hyperplane feature vector \u03c8x,m consists of a single\n\n7\n\n\ffourier feature. The intra-skill policy is a probability distribution over actions. The resulting learned\nhyperplane partitioning and skill set are shown in Figure 1h. Using this partitioning ASAP achieved\nnear optimal performance (Figure 2b). This experiment shows an insightful and unexpected result.\nReusable Skills: Using this hyperplane representation, ASAP was able to not only learn the intra-skill\npolicies and SPs, but also that skill \u2018A\u2019 needed to be reused in two different parts of the state space\n(Figure 1h). ASAP therefore shows the potential to automatically create reusable skills.\nRoboCup Domain: The RoboCup 2D soccer simulation domain (Akiyama & Nakashima, 2014) is\na 2D soccer \ufb01eld (Figure 1e) with two opposing teams. We utilized three RoboCup sub-domains\n3 R1, R2 and R3 as mentioned previously. In these sub-domains, a striker (the agent) needs to\nlearn to dribble the ball and try and score goals past the goalkeeper. State space: R1 domain -\nthe continuous locations of the striker (cid:104)xstriker, ystriker(cid:105) , the ball (cid:104)xball, yball(cid:105), the goalkeeper\n(cid:104)xgoalkeeper, ygoalkeeper(cid:105) and the constant goal location (cid:104)xgoal, ygoal(cid:105). R2 domain - we have the\naddition of the defender\u2019s location (cid:104)xdef ender, ydef ender(cid:105) to the state space. R3 domain - we add the\nlocations of two defenders. Features: For the R1 domain, we tested both a linear and degree two\npolynomial feature representation for the hyperplanes. For the R2 and R3 domains, we also utilized\na degree two polynomial hyperplane feature representation. Actions: The striker has three actions\nwhich are (1) move to the ball (M), (2) move to the ball and dribble towards the goal (D) (3) move\nto the ball and shoot towards the goal (S). Rewards: The reward setup is consistent with logical\nfootball strategies (Hausknecht & Stone, 2015; Bai et al., 2012). Small negative (positive) rewards for\nshooting from outside (inside) the box and dribbling when inside (outside) the box. Large negative\nrewards for losing possession and kicking the ball out of bounds. Large positive reward for scoring.\nDifferent SP Optimas: Since ASAP attains a locally optimal solution, it may sometimes learn\ndifferent SPs. For the polynomial hyperplane feature representation, ASAP attained two different\nsolutions as shown in Figure 2g(1) as well as Figure 2g(2), respectively. Both achieve near optimal\nperformance compared to the approximately optimal scoring controller (see supplementary material).\nFor the linear feature representation, the SPs and skill set in Figure 2g(3) is obtained and achieved\nnear-optimal performance (Figure 2j), outperforming the polynomial representation.\nSP Sensitivity: In the R2 domain, an additional player (the defender) is added to the game. It is\nexpected that the presence of the defender will affect the shape of the learned SPs. ASAP again learns\nintuitive SPs. However, the shape of the learned SPs change based on the pre-de\ufb01ned hyperplane\nfeature vector \u03c8m,x. Figure 2h(1) shows the learned SPs when the location of the defender is not used\nas a hyperplane feature. When the x location of the defender is utilized, the \u2018\ufb02atter\u2019 SPs are learned\nin Figure 2h(2). Using the y location of the defender as a hyperplane feature causes the hyperplane\noffset shown in Figure 2h(3). This is due to the striker learning to dribble around the defender in\norder to score a goal as seen in Figure 2i. Finally, taking the (cid:104)x, y(cid:105) location of the defender into\naccount results in the \u2018squashed\u2019 SPs shown in Figure 2h(4) clearly showing the sensitivity and\nadaptability of ASAP to dynamic factors in the environment.\n7 Discussion\nWe have presented the Adaptive Skills, Adaptive Partitions (ASAP) framework that is able to\nautomatically compose skills together and learns a near-optimal skill set and skill partitions (the\ninter-skill policy) simultaneously to correct an initially misspeci\ufb01ed model. We derived the gradient\nupdate rules for both skill and skill hyperplane parameters and incorporated them into a policy\ngradient framework. This is possible due to our de\ufb01nition of a generalized trajectory. In addition,\nASAP has shown the potential to learn across multiple tasks as well as automatically reuse skills.\nThese are the necessary requirements for a truly general skill learning framework and can be applied\nto lifelong learning problems (Ammar et al., 2015; Thrun & Mitchell, 1995). An exciting extension\nof this work is to incorporate it into a Deep Reinforcement Learning framework, where both the skills\nand ASAP policy can be represented as deep networks.\n\nAcknowledgements\n\nThe research leading to these results has received funding from the European Research Council\nunder the European Union\u2019s Seventh Framework Program (FP/2007-2013) / ERC Grant Agreement n.\n306638.\n\n3https://github.com/mhauskn/HFO.git\n\n8\n\n\fReferences\nAkiyama, Hidehisa and Nakashima, Tomoharu. Helios base: An open source package for the robocup\n\nsoccer 2d simulation. In RoboCup 2013: Robot World Cup XVII, pp. 528\u2013535. Springer, 2014.\n\nAmmar, Haitham Bou, Tutunov, Rasul, and Eaton, Eric. Safe policy search for lifelong reinforcement\n\nlearning with sublinear regret. arXiv preprint arXiv:1505.05798, 2015.\n\nBacon, Pierre-Luc and Precup, Doina. The option-critic architecture. In NIPS Deep Reinforcement\n\nLearning Workshop, 2015.\n\nBai, Aijun, Wu, Feng, and Chen, Xiaoping. Online planning for large mdps with maxq decomposition.\n\nIn AAMAS, 2012.\n\nda Silva, B.C., Konidaris, G.D., and Barto, A.G. Learning parameterized skills. In ICML, 2012.\n\nEaton, Eric and Ruvolo, Paul L. Ella: An ef\ufb01cient lifelong learning algorithm. In Proceedings of the\n\n30th international conference on machine learning (ICML-13), pp. 507\u2013515, 2013.\n\nFu, Justin, Levine, Sergey, and Abbeel, Pieter. One-shot learning of manipulation skills with online\n\ndynamics adaptation and neural network priors. arXiv preprint arXiv:1509.06841, 2015.\n\nHausknecht, Matthew and Stone, Peter. Deep reinforcement learning in parameterized action space.\n\narXiv preprint arXiv:1511.04143, 2015.\n\nHauskrecht, Milos, Meuleau Nicolas et. al. Hierarchical solution of markov decision processes using\n\nmacro-actions. In UAI, pp. 220\u2013229, 1998.\n\nKonidaris, George and Barto, Andrew G. Skill discovery in continuous reinforcement learning\n\ndomains using skill chaining. In NIPS, 2009.\n\nMankowitz, Daniel J, Mann, Timothy A, and Mannor, Shie. Time regularized interrupting options.\n\nInternation Conference on Machine Learning, 2014.\n\nMann, Timothy A and Mannor, Shie. Scaling up approximate value iteration with options: Better\npolicies with fewer iterations. In Proceedings of the 31 st International Conference on Machine\nLearning, 2014.\n\nMann, Timothy Arthur, Mankowitz, Daniel J, and Mannor, Shie. Learning when to switch between\n\nskills in a high dimensional domain. In AAAI Workshop, 2015.\n\nMasson, Warwick and Konidaris, George. Reinforcement learning with parameterized actions. arXiv\n\npreprint arXiv:1509.01644, 2015.\n\nPeters, Jan and Schaal, Stefan. Policy gradient methods for robotics. In Intelligent Robots and\n\nSystems, 2006 IEEE/RSJ International Conference on, pp. 2219\u20132225. IEEE, 2006.\n\nPeters, Jan and Schaal, Stefan. Reinforcement learning of motor skills with policy gradients. Neural\n\nNetworks, 21:682\u2013691, 2008.\n\nPrecup, Doina and Sutton, Richard S. Multi-time models for temporally abstract planning. In\n\nAdvances in Neural Information Processing Systems 10 (Proceedings of NIPS\u201997), 1997.\n\nPrecup, Doina, Sutton, Richard S, and Singh, Satinder. Theoretical results on reinforcement learning\nwith temporally abstract options. In Machine Learning: ECML-98, pp. 382\u2013393. Springer, 1998.\n\nSilver, David and Ciosek, Kamil. Compositional Planning Using Optimal Option Models.\n\nProceedings of the 29th International Conference on Machine Learning, Edinburgh, 2012.\n\nIn\n\nSutton, Richard S, Precup, Doina, and Singh, Satinder. Between MDPs and semi-MDPs: A framework\n\nfor temporal abstraction in reinforcement learning. Arti\ufb01cial Intelligence, 1999.\n\nSutton, Richard S, McAllester, David, Singh, Satindar, and Mansour, Yishay. Policy gradient methods\n\nfor reinforcement learning with function approximation. In NIPS, pp. 1057\u20131063, 2000.\n\nThrun, Sebastian and Mitchell, Tom M. Lifelong robot learning. Springer, 1995.\n\n9\n\n\f", "award": [], "sourceid": 873, "authors": [{"given_name": "Daniel", "family_name": "Mankowitz", "institution": "Technion"}, {"given_name": "Timothy", "family_name": "Mann", "institution": "Google DeepMind"}, {"given_name": "Shie", "family_name": "Mannor", "institution": "Technion"}]}