{"title": "Finding Structure in Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 385, "page_last": 392, "abstract": null, "full_text": "Finding Structure in Reinforcement Learning \n\nSebastian Thrun \nUniversity of Bonn \n\nDepartment of Computer Science nr \nR6merstr. 164, D-53117 Bonn, Germany \n\nE-mail: thrun@carbon.informatik.uni-bonn.de \n\nAnton Schwartz \n\nDept. of Computer Science \n\nStanford University \nStanford, CA 94305 \n\nEmail: schwartz@cs.stanford.edu \n\nAbstract \n\nReinforcement learning addresses the problem of learning to select actions in order to \nmaximize one's performance in unknown environments. To scale reinforcement learning \nto complex real-world tasks, such as typically studied in AI, one must ultimately be able \nto discover the structure in the world, in order to abstract away the myriad of details and \nto operate in more tractable problem spaces. \nThis paper presents the SKILLS algorithm. SKILLS discovers skills, which are partially \ndefined action policies that arise in the context of multiple, related tasks. Skills collapse \nwhole action sequences into single operators. They are learned by minimizing the com(cid:173)\npactness of action policies, using a description length argument on their representation. \nEmpirical results in simple grid navigation tasks illustrate the successful discovery of \nstructure in reinforcement learning. \n\n1 Introduction \nReinforcement learning comprises a family of incremental planning algorithms that construct \nreactive controllers through real-world experimentation. A key scaling problem of reinforce(cid:173)\nment learning, as is generally the case with unstructured planning algorithms, is that in large \nreal-world domains there might be an enormous number of decisions to be made, and pay-off \nmay be sparse and delayed. Hence, instead of learning all single fine-grain actions all at the \nsame time, one could conceivably learn much faster if one abstracted away the myriad of \nmicro-decisions, and focused instead on a small set of important decisions. But this imme(cid:173)\ndiately raises the problem of how to recognize the important, and how to distinguish it from \nthe unimportant. \nThis paper presents the SKILLS algorithm. SKILLS finds partially defined action policies, \ncalled skills, that occur in more than one task. Skills, once found, constitute parts of solutions \nto multiple reinforcement learning problems. In order to find maximally useful skills, a \ndescription length argument is employed. Skills reduce the number of bytes required to \ndescribe action policies. This is because instead of having to describe a complete action \npolicy for each task separately, as is the case in plain reinforcement learning, skills constrain \nmultiple task to pick the same actions, and thus reduce the total number of actions required \n\n\f386 \n\nSebastian Thrun, Anton Schwartz \n\nfor representing action policies. However, using skills comes at a price. In general, one \ncannot constrain actions to be the same in multiple tasks without ultimately suffering a loss \nin performance. Hence, in order to find maximally useful skills that infer a minimum loss in \nperformance, the SKILLS algorithm minimizes a function of the form \n\nE = PERFORMANCE LOSS + 1J' DESCRIPTION LENGTH. \n\n(1) \nThis equation summarizes the rationale of the SKILLS approach. The reminder of this paper \ngives more precise definitions and learning rules for the terms\" PERFORMANCE LOSS\" and \n\"DESCRIPTION LENGTH,\" using the vocabulary of reinforcement learning. In addition, \nexperimental results empirically illustrate the successful discovery of skills in simple grid \nnavigation domains. \n\n2 Reinforcement Learning \nReinforcement learning addresses the problem of learning, through experimentation, to act \nso as to maximize one's pay-off in an unknown environment. Throughout this paper we will \nassume that the environment of the learner is a partially controllable Markov chain [1]. At \nany instant in time the learner can observe the state of the environment, denoted by s E S, \nand apply an action, a E A. Actions change the state of the environment, and also produce \na scalar pay-off value, denoted by rs,a E ~. Reinforcement learning seeks to identify an \naction policy, 7f : S --+ A, i.e., a mapping from states s E S to actions a E A that, if actions \nare selected accordingly, maximizes the expected discounted sum offuture pay-off \n\nR = E [f ,t-til r t ]. \n\nt=tu \n\n(2) \n\nHere I (with 0 S,S 1) is a discount factor that favors pay-offs reaped sooner in time, and \nrt refers to the expected pay-off at time t. In general, pay-off might be delayed. Therefore, \nin order to learn an optimal 7f, one has to solve a temporal credit assignment problem [11]. \nTo date, the single most widely used algorithm for learning from delayed pay-off is Q(cid:173)\nLearning [14]. Q-Learning solves the problem of learning 7f by learning a value function, \ndenoted by Q : S x A --+~. Q maps states s E S and actions a E A to scalar values. \nAfter learning, Q(s, a) ranks actions according to their goodness: The larger the expected \ncumulative pay-offfor picking action a at state s, the larger the value Q(s, a). Hence Q, once \nlearned, allows to maximize R by picking actions greedily with respect to Q: \n\n7f(s) = \n\nargmax Q(s, a) \n\naEA \n\nThe value function Q is learned on-line through experimentation. Initially, all values Q( s, a) \nare set to zero. Suppose during learning the learner executes action a at state s, which leads to \na new state s' and the immediate pay-off rs ,a' Q-Learning uses this state transition to update \nQ(s, a): \n\nQ(s, a) ;- (1 - a) . Q(s, a) + a\u00b7 (rs,a + I' V(s')) \n\n(3) \n\nwith V(s') \n\nm~x Q(s', a) \n\na \n\nThe scalar a (O 0 is a gain parameter that trades off both target functions. E-optimal policies make \nheavily use of large skills, yet result in a minimum loss in performance. Notice that the state \nspace may be partitioned completely by skills, and solutions to the individual tasks can be \nuniquely described by the skills and its usages. If such a complete partitioning does not exist, \nhowever, tasks may instead rely to some extent on task-specific, local pOlicies. \n\n4 Derivation of the Learning Algorithm \nEach skill k is characterized by three types of adjustable variables: skill actions 7rk (s), the \nskill domain Sk, and skill usages Ub,k. one for each task b E B. In this section we will give \nupdate rules that perform hill-cli:nbing in E for each of these variables. As in Q-Learning \nthese rules apply only at the currently visited state (henceforth denoted by s). Both learning \naction policies (cf Eq. (3\u00bb and learning skills is fully interleaved. \nActions. Determining skill actions is straightforward, since what action is prescribed by a \nskill exclusively affects the performance loss, but does not play any part in the description \nlength. Hence, the action policy 7rk(S) minimizes LOSS(s) (cf Eqs. (5) and (6\u00bb: \n\n7rk(S) \n\nargmax ~ Pb(kls) . Qb(S. a) \n\n.lEA bEB \n\n(9) \n\n\fFinding Structure in Reinforcement Learning \n\n389 \n\nDomains. Initially, each skill domain Sk contains only a single state that is chosen at \nrandom. Sk is changed incrementally by minimizing E(s) for states s which are visited \nduring learning. More specifically, for each skill k, it is evaluated whether or not to include \nsin Sk by considering E(s) = LOSS(s) + TJDL(s). \n\ns E Sk, \n\n(otherwises ~ Sk) \n\nifandonlyif E(s)lsESk < E(s)lsf'Sk \n\n(10) \nIf the domain of a skill k vanishes completely, i.e., if Sk = 0, it is re-initialized by a randomly \nselected state. In addition all usage values {Ub ,klb E B} are initialized randomly. This \nmechanism ensures that skills, once overturned by other skills, will not get lost forever. \nUsages. Unlike skill domains, which are discrete quantities, usages are real-valued numbers. \nInitially, they are chosen at random in [0, 1]. Usages are optimized by stochastic gradient \ndescent in E. According to Eq. (8), the derivative of E(s) is the sum of aL~SS(s ) and \na~uL(s ) . The first term is governed by \n\nUb,k \n\nb,k \n\n_ 8Vb(S) _ _ 8P;(s) . Q ( *() ) _ \" \n\n8Pb(j!S) . Q ( \n\n.(\u00bb \nb S,7rJ S \n\n8 \n'Ub ,k \n\n-\n\n8 \nUb,k \n\nb 7rb s , s ~ 8 \n\njEK \n\nUb ,k \n\n8LOSS(s) \n\n8 \nUb ,k \n\nwith 8Pb (j!s) \n8Ub ,k \n\n(11 ) \n\n(12) \n\nThe \n\n(13) \n\n(14) \n\n8Pb*(s) \n\nand \nHere Dkj denotes the Kronecker delta function, which is 1 if k = j and 0 otherwise. \nsecond term is given by \n\n8DL(s) \n8Ubk \n, \n\n8Pb*(s) \n8Ukb \n, \n\n' \n\nwhich can be further transformed using Eqs. (12) and (11). In order to minimize E, usages \nare incrementally refined in the opposite direction of the gradients: \n\nUk ,b \n\n;....-.- Uk,b \n\n-\n\n{J. (8V(s) + TJ 8DL(S\u00bb) \n\n8Ukb \n, \n\n8Ukb \n, \n\nHere {J > 0 is a small learning rate. This completes the derivation of the SKILLS algorithm. \nAfter each action execution, Q-Learning is employed to update the Q-function. SKILLS also \nre-calculates, for any applicable skill, the skill policy according to Eq. (9), and adjusts skill \ndomains and usage values based upon Eqs. (10) and (14). \n\n5 Experimental Results \nThe SKILLS algorithm was applied to discover skills in a simple, discrete grid-navigation \ndomain, depicted in Fig. 1. At each state, the agent can move to one of at most eight adjacent \ngrid cells. With a 10% chance the agent is carried to a random neighboring state, regardless \nof the commanded action. Each corner defines a starting state for one out of four task, with \nthe corresponding goal state being in the opposite corner. The pay-off (costs) for executing \nactions is -1, except for the goal state, which is an absorbing state with zero pay-off. In a first \nexperiment, we supplied the agent with two skills f{ = {kJ, k2 }. All four tasks were trained \nin a time-shared manner, with time slices being 2,000 steps long. We used the following \nparameter settings: TJ = 1.2, 'Y = 1, a = 0.1, and {J = 0.001. \nAfter 30000 training steps for each task, the SKILLS algorithm has successfully discovered \nthe two ski11s shown in Figure 1. One of these skills leads the agent to the right door, and \n\n\f390 \n\nSebastian Thrun. Anton Schwartz \n\n1 1 / I 1 ' / / / - / / / \nI \nI \n\nI //~ \",. - - / / -\n\n/\n-\n\n-\n\nr-\n, \n\n-\n\n. - - - -\n\nFigure 1: Simple 3-room environment. Start and goal states are marked by circles. The \ndiagrams also shows two skills (black states), which lead to the doors connecting the rooms. \n\nthe second to the left. Each skill is employed by two tasks. By forcing two tasks to adopt a \nsingle policy in the region of the skill, they both have to sacrifice performance, but the loss \nin performance is considerably small. Beyond the door, however, optimal actions point into \nopposite directions. There, forcing both tasks to select actions according to the same policy \nwould result in a significant performance loss, which would clearly outweigh the savings in \ndescription length. The solution shown in Fig. 1 is (approximately) the global minimum of \nE, given that only two skills are available. It is easy to be seen that these skills establish \nhelpful building blocks for many navigation tasks. \nWhen using more than two skills, E can be minimized further. We repeated the experiment \nusing six skills, which can partition the state space in a more efficient way. Two of the \nresulting skills were similar to the skills shown in Fig. 1, but they were defined only between \nthe doors. The other four skills were policies for moving out of a corner, one for each corner. \nEach of the latter four skills can be used in three tasks (unlike two tasks for passing through \nthe middle room), resulting in an improVed description length when compared to the two-skill \nsolution shown in Fig. 1. \nWe also applied skill learning to a more complex grid world, using 25 skills for a total of 20 \ntasks. The environment, along with one of the skills, is depicted in Fig. 2. Different tasks \nwere defined by different starting positions, goal positions and door configurations which \ncould be open or closed. The training time was typically an order of magnitude slower than \nin the previous task, and skills were less stable over time. However, Fig. 2 illustrates that \nmodular skills could be discovered even in such complex a domain. \n\n6 Discussion \nThis paper presents the SKILLS algorithm. SKILLS learns skills, which are partial policies \nthat are defined on a subset of all states. Skills are used in as many tasks as possible, \nwhile affecting the performance in these tasks as little as possible. They are discovered by \nminimizing a combined measure, which takes a task performance and a description length \nargument into account. \nWhile our empirical findings in simple grid world domains are encouraging, there are several \nopen questions that warrant future research. \nLearning speed. In our experiments we found that the time required for finding useful skills \nis up to an order of magnitude larger than the time it takes to find close-to-optimal policies. \n\n\fFinding Structure in Reinforcement Learning \n\n391 \n\nFigure 2: SkiIl found in a \nmore complex grid navi(cid:173)\ngation task. \n\nSimilar findings are reported in [9]. This is because discovering skills is much harder than \nlearning control. Initially, nothing is know about the structure of the state space, and unless \nreasonably accurate Q-tables are available, SKILLS cannot discover meaningful skills. Faster \nmethods for learning skills, which might precede the development of optimal value functions, \nare clearly desirable. \nTransfer. We conjecture that skills can be helpful when one wants to learn new, related tasks. \nThis is because if tasks are related, as is the case in many natural learning environments, \nskills allow to transfer knowledge from previously learned tasks to new tasks. In particular, if \nthe learner faces tasks with increasing complexity, as proposed by Singh [10], learning skills \ncould conceivable reduce the learning time in complex tasks, and hence scale reinforcement \nlearning techniques to more complex tasks. \nUsing function approximators. In this paper, performance loss and description length has \nbeen defined based on table look-up representations of Q. Recently, various researchers have \napplied reinforcement learning in combination with generalizing function approximators, \nsuch as nearest neighbor methods or artificial neural networks (e.g., [2, 4, 12, 13]). \nIn \norder to apply the SKILLS algorithm together with generalizing function approximators, \nthe notions of skill domains and description length have to be modified. For example, the \nmembership function mk, which defines the domain of a skill, could be represented by a \nfunction approximator which allows to derive gradients in the description length. \nGeneralization in state space. In the current form, SKILLS exclusively discovers skills that \nare used across mUltiple tasks. However, skills might be useful under multiple circumstances \neven in single tasks. For example, the (generalized) skill of climbing a staircase may be \nuseful several times in one and the same task. SKILLS, in its current form, cannot represent \nsuch skills. \nThe key to learning such generalized skills is generalization. Currently, skills generalize \nexclusively over tasks, since they can be applied to entire sets of tasks. However, they cannot \ngeneralize over states. One could imagine an extension to the SKILLS algorithm, in which \nskills are free to pick what to generalize over. For example, they could chose to ignore certain \nstate information (like the color of the staircase). It remains to be seen if effective learning \nmechanisms can be designed for learning such generalized skills. \nAbstractions and action hierarchies. In recent years, several researchers have recognized \nthe importance of structuring reinforcement learning in order to build abstractions and action \n\n\f392 \n\nSebastian Thrun, Anton Schwartz \n\nhierarchies. Different approaches differ in the origin of the abstraction, and the way it \nis incorporated into learning. For example, abstractions have been built upon previously \nlearned, simpler tasks [9, 10], previously learned low-level behaviors [7], subgoals, which \nare either known in advance [15] or determined at random [6], or based on a pyramid of \ndifferent levels of perceptual resolution, which produces a whole spectrum of problem solving \ncapabilities [3]. For all these approaches, drastically improved problem solving capabilities \nhave been reported, which are far beyond that of plain, unstructured reinforcement learning. \nThis paper exclusively focuses on how to discover the structure inherent in a family of related \ntasks. Using skills to form abstractions and learning in the resulting abstract problem spaces \nis beyond the scope of this paper. The experimental findings indicate, however, that skills \nare powerful candidates for operators on a more abstract level, because they collapse whole \naction sequences into single entities. \n\nReferences \n[I] A. G. Barto, S. J. Bradtke, and S. P. Singh. Learning to act using real-time dynamic programming. \n\nArtijiciallntelligence, to appear. \n\n[2] J. A. Boyan. Generalization in reinforcement learning: Safely approximating the value function. \n\nSame volume. \n\n[3] P. Dayan and G. E. Hinton. Feudal reinforcement learning. In J. E. Moody, S. J. Hanson, and \nR. P. Lippmann, editors, Advances in Neural Information Processing Systems 5,1993. Morgan \nKaufmann. \n\n[4) V. Gullapalli, 1. A. Franklin, and Hamid B. Acquiring robot skills via reinforcement learning. \n\nIEEE Control Systems, 272( 1708), 1994. \n\n[5) T. Jaakkola, M. I. Jordan, and S. P. Singh. On the convergence of stochastic iterative dynamic \nprogramming algorithms. Technical Report 9307, Department of Brain and Cognitive Sciences, \nMIT, July 1993. \n\n[6] L. P. Kaelbling. Hierarchical learning in stochastic domains: Preliminary results. In Paul E. \nUtgoff, editor, Proceedings of the Tenth International Conference on Machine Learning, 1993. \nMorgan Kaufmann. \n\n[7] L.-J. Lin. Self-supervised Learning by Reinforcementand Artijicial Neural Networks. PhD thesis, \n\nCarnegie Mellon University, School of Computer Science, 1992. \n\n[8) M. Ring. Two methods for hierarchy learning in reinforcement environments. In From Animals \nto Animates 2: Proceedings of the Second International Conference on Simulation of Adaptive \nBehavior. MIT Press, 1993. \n\n[9] S. P. Singh. Reinforcement learning with a hierarchy of abstract models. In Proceeding of the \nTenth National Conference on Artijiciallntelligence AAAI-92, 1992. AAAI Pressffhe MIT Press. \n[10] S. P. Singh. Transfer of learning by composing solutions for elemental sequential tasks. Machine \n\nLearning, 8,1992. \n\n[II] R. S. Sutton. Temporal Credit Assignment in Reinforcement Learning. PhD thesis, Department \n\nof Computer and Information Science, University of Massachusetts, 1984. \n\n[12) G. J. Tesauro. Practical issues in temporal difference learning. Machine Learning, 8, 1992. \n[13] S. Thrun and A. Schwartz. Issues in using function approximation for reinforcement learning. In \nM. Mozer, Pa. Smolensky, D. Touretzky, J. Elman, and A. Weigend, editors, Proceedings of the \nJ 993 Connectionist Models Summer School, 1993. Erlbaum Associates. \n\n[14] C. J. C. H. Watkins. Learningfrom Delayed Rewards. PhD thesis, King's College, Cambridge, \n\nEngland, 1989. \n\n[15] S. Whitehead, J. Karlsson, and J. Tenenberg. Learning multiple goal behavior via task decompo(cid:173)\nsition and dynamic policy merging. In J. H. Connell and S. Mahadevan, editors, Robot Learning. \nKluwer Academic Publisher, 1993. \n\n\f", "award": [], "sourceid": 887, "authors": [{"given_name": "Sebastian", "family_name": "Thrun", "institution": null}, {"given_name": "Anton", "family_name": "Schwartz", "institution": null}]}