{"title": "Reinforcement Learning with Hierarchies of Machines", "book": "Advances in Neural Information Processing Systems", "page_first": 1043, "page_last": 1049, "abstract": null, "full_text": "Reinforcement Learning with \n\nHierarchies of Machines * \n\nRonald Parr and Stuart Russell \n\nComputer Science Division, UC Berkeley, CA 94720 \n\n{parr,russell}@cs.berkeley.edu \n\nAbstract \n\nWe present a new approach to reinforcement learning in which the poli(cid:173)\ncies considered by the learning process are constrained by hierarchies of \npartially specified machines.  This allows for the  use of prior knowledge \nto reduce the search space and provides a framework in which knowledge \ncan  be  transferred  across  problems  and  in  which  component solutions \ncan be recombined to solve larger and more complicated problems.  Our \napproach can be seen  as providing a link between reinforcement learn(cid:173)\ning and \"behavior-based\" or \"teleo-reactive\" approaches to  control.  We \npresent provably  convergent algorithms  for  problem-solving and learn(cid:173)\ning with hierarchical machines and demonstrate their effectiveness on a \nproblem with several thousand states. \n\n1  Introduction \nOptimal decision making in  virtually all  spheres of human activity is rendered intractable \nby  the complexity of the task environment.  Generally speaking, the  only  way  around in(cid:173)\ntractability has been to provide a hierarchical organization for complex activities.  Although \nit can yield suboptimal policies, top-down hierarchical control often reduces the complexity \nof decision making from exponential to linear in the size of the problem.  For example, hier(cid:173)\narchical task network (HTN) planners can generate solutions containing tens of thousands \nof steps [5], whereas \"fiat\" planners can manage only tens of steps. \nHTN planners are successful because they use a plan library that describes the decomposition \nof high-level  activities  into  lower-level  activities.  This  paper describes  an  approach  to \nlearning and decision making in uncertain environments (Markov decision processes) that \nuses a roughly analogous form of prior knowledge. We use hierarchical abstract machines \n(HAMs), which impose constraints on the policies considered by  our learning algorithms. \nHAMs  consist  of nondeterministic  finite  state  machines  whose  transitions  may  invoke \nlower-level machines.  Nondeterminism is represented by  choice states where the optimal \naction  is  yet to be  decided or learned.  The language allows a  variety  of prior constraints \nto  be expressed, ranging from  no constraint all  the way  to a fully  specified solution.  One \n\n*This research was supported in part by ARO under the MURI program \"Integrated Approach to \n\nIntelligent Systems,\" grant number DAAH04-96-1-0341. \n\n\f1044 \n\nR.  Parr and S. Russell \n\n(a) \n\n(b) \n\n0_1. \n\n(c) \n\nFigure  1:  (a) An MOP with  ~ 3600 states.  The initial  state is  in the top left.  (b) Close(cid:173)\nup  showing a typical obstacle.  (c) Nondetenninistic finite-state controller for negotiating \nobstacles. \n\nuseful intennediate point is  the  specification of just the  general  organization of behavior \ninto a layered hierarchy, leaving it up to the learning algorithm to  discover exactly which \nlower-level activities should be invoked by higher levels at each point. \nThe paper begins with a brief review of Markov decision processes (MOPs) and a descrip(cid:173)\ntion of hierarchical abstract machines.  We then present, in abbreviated fonn, the following \nresults:  1) Given any  HAM and any  MOP,  there exists a new MOP such that the optimal \npolicy in the new MOP is optimal in the original MOP among those policies that satisfy the \nconstraints specified by the HAM. This means that even with complex machine specifica(cid:173)\ntions we can still apply standard decision-making and learning methods.  2) An  algorithm \nexists that detennines this optimal policy, given an MOP and a HAM. 3) On an illustrative \nproblem with 3600 states, this algorithm yields dramatic perfonnance improvements over \nstandard  algorithms  applied  to  the  original  MOP.  4)  A  reinforcement learning algorithm \nexists that converges to the optimal policy, subject to  the HAM constraints,  with no  need \nto  construct explicitly a new MOP.  5) On the  sample  problem,  this algorithm learns dra(cid:173)\nmatically  faster  than  standard RL algorithms.  We  conclude with  a  discussion of related \napproaches and ongoing work. \n\n2  Markov Decision Processes \nWe  assume the reader is familiar with the basic concepts of MOPs.  To review, an MOP is \na  4-tuple,  (5, A, T, R)  where 5  is  a set of states,  A  is  a set of actions,  T  is  a transition \nmodel mapping 5  x  A  x  5  into probabilities in  [0,  I J,  and R  is a reward function mapping \n5 x A  x 5 into real-valued rewards.  Algorithms for solving MOPs can return a policy 7r that \nmaps from 5 to A, a real-valued value function V  on states, or a real-valued Q-function on \nstate-action pairs.  In this paper, we focus on infinite-horizon MOPs with a discount factor \n/3.  The aim is to find an optimal policy 7r*  (or, equivalently, V* or Q*) that maximizes the \nexpected discounted total reward of the agent. \nThroughout the paper, we  will  use as  an example the MOP shown in  Figure  l(a).  Here A \ncontains four primitive actions (up, down, left, right).  The transition model, T, specifies that \neach action succeeds 80% of time, while 20% of the time the agent moves in an unintended \nperpendicular direction.  The agent begins in a start state in the upper left corner.  A reward \nof 5.0 is given for reaching the goal state and the discount factor /3  is 0.999. \n\n3  Hierarchical abstract machines \nA HAM is a program which, when executed by an agent in  an environment, constrains the \nactions that the agent can take  in each state.  For example,  a very simple machine might \ndictate,  \"repeatedly choose right or down,\" which  would eliminate from  consideration all \npolicies  that  go  up  or  left.  HAMs  extend  this  simple  idea  of constraining  policies  by \nproviding  a  hierarchical  means  of expressing  constraints  at  varying  levels  of detail  and \n\n\fReinforcement Learning with Hierarchies of Machines \n\n1045 \n\nspecificity.  Machines for HAMs are defined by  a set of states, a transition function, and a \nstart function  that detennines the initial state of the machine.  Machine states are  of four \ntypes:  Action states  execute  an  action  in  the  environment.  Call  states  execute  another \nmachine as  a subroutine.  Choice states  nondetenninistically select a next machine state. \nStop states halt execution of the machine and return control to the previous call state. \nThe  transition  function  detennines  the  next  machine  state  after  an  action  Or  call  state \nas  a  stochastic  function  of the  current  machine  state  and  some  features  of the  resulting \nenvironment state.  Machines will typically  use a partial description of the environment to \ndetennine the next state.  Although machines can function in partially observable domains, \nfor the purposes of this paper we make the standard assumption that the agent has access to \na complete description as well. \nA HAM is  defined by  an initial machine in  which execution begins and the closure of all \nmachines reachable  from  the  initial  machine.  Figure  I(c)  shows  a simplified version  of \none  element of the  HAM  we  used  for  the  MDP  in  Figure  I.  This  element  is  used  for \ntraversing a hallway  while negotiating obstacles of the kind shown in Figure  1 (b).  It runs \nuntil the end of the hallway or an intersection is reached.  When it encounters an obstacle, a \nchoice point is created to  choose between two possible next machine states.  One calls the \nbackoff machine to back away from the obstacle and then move forward until the next one. \nThe other calls the follow-wall machine to  try to  get around the obstacle.  The follow-wall \nmachine is very simple and will be tricked by obstacles that are concave in the direction of \nintended movement; the backoff machine, on the other hand, can move around any obstacle \nin  this  world but could waste  time backing away  from some obstacles unnecessarily  and \nshould be used sparingly. \nOur complete \"navigation HAM\"  involves a three-level hierarchy, somewhat reminiscent \nof a Brooks-style architecture but with hard-wired decisions replaced by choice states.  The \ntop level of the hierarchy is basically just a choice state for choosing a hallway  navigation \ndirection from the four coordinate directions.  This machine has control initially and regains \ncontrol at intersections or corners.  The second level of the hierarchy contains four machines \nfor moving along hallways, one for each direction.  Each machine at this level has a choice \nstate  with  four  basic  strategies  for  handling  obstacles.  Two  back  away  from  obstacles \nand two attempt to  follow  walls  to  get around obstacles.  The third level of the  hierarchy \nimplements these strategies using the primitive actions. \nThe transition function for this HAM assumes that an agent executing the HAM has access \nto  a short-range, low-directed sonar that detects  obstacles in  any  of the  four  axis-parallel \nadjacent squares  and  a long-range, high-directed sonar that detects larger objects such as \nthe intersections and the ends of hallways.  The HAM uses these partial state descriptions \nto  identify  feasible  choices.  For example,  the machine  to  traverse  a hallway  northwards \nwould not be called from the start state because the high-directed sonar would detect a wall \nto the north. \nOur navigation  HAM  represents  an  abstract plan  to  move  about the  environment by  re(cid:173)\npeatedly  selecting a direction  and  pursuing this direction  until  an  intersection  is  reached. \nEach machine for navigating in the chosen direction represents an abstract plan for moving \nin  a particular direction  while avoiding obstacles.  The next section defines  how a HAM \ninteracts with  a specific  MDP and  how  to  find  an  optimal policy  that respects  the HAM \nconstraints. \n\n4  Defining and solving the HAM-induced MDP \nA policy  for  a model,  M,  that is HAM-consistent with HAM  H  is  a scheme for  making \nchoices whenever an  agent executing H  in  M,  enters a choice state.  To  find  the  optimal \nHAM-consistent policy we apply H  to  M  to  yield an  induced MDP, HoM. A somewhat \nsimplified description of the construction of HoM is  as follows:  1) The set of states in \nHoM is  the  cross-product of the states of H  with  the  states of M.  2) For each state in \nHoM where the machine component is an action state, the model and machine transition \n\n\fWittlHAM  ....... \n\nWillloutHAM  -\nr--\n\nR. Parr and S. Russell \n\nWllItoIIlHAM  - (cid:173)\n\nWtthHAM  .... --\n\n1046 \n\n20  ~ \n1 \nI ~  : \n\n10 \n\no ____ ~ ________________ ~~. \n\no \n\nSOO \n\n1(0) \n\nlSOO \n\n2:C)X) \n\nl500 \n\nlOOO \nRuntiJne(.ca::ondI' \n\n3SOO \n\n.4QOO \n\n..soD  ~ \n\n(a) \n\n(b) \n\nFigure 2:  Experimental results showing policy  value (at the initial state) as a  function of \nruntime on the domain shown in Figure 1.  (a) Policy iteration with and without the HAM. \n(b) Q-learning with and without the HAM (averaged over 10 runs). \n\nfunctions are combined.  3) For each state where the machine component is a choice state, \nactions that change only the machine component of the state are introduced. 4) The reward \nis taken from M  for primitive actions, otherwise it is zero.  With this construction, we have \nthe following (proof omitted): \nLemma  1 For any  Markov decision process  M  and  any!  HAM  H,  the  induced process \nHoM is a Markov decision process. \nLemma 2 If 7r  is an optimal policy for  HoM , then the primitive actions specified by  7r \nconstitute the optimal policy for M  that is HAM-consistent with H. \nOf course,  HoM may  be  quite large.  Fortunately,  there  are  two  things that  will  make \nthe  problem much easier in  most cases.  The first  is that not all  pairs of HAM states and \nenvironment states will be possible, i.e., reachable from an initial state.  The second is that \nthe actual complexity of the induced MOP is determined by the number of choice points, \ni.e.,  states  of HoM in  which the  HAM component is  a  choice  state.  This leads to  the \nfollowing: \nTheorem 1 For any MOP,  M,  and HAM, H, let C be the set of choice points in  HoM . \nThere exists a decision process, reduce(H 0  M), with states C such that the optimal policy \nfor reduce(H 0  M) corresponds to the optimal policy for M  that is HAM-consistent with \nH . \nProof sketch We begin by applying Lemma 1 and then observing that in states of HoM \nwhere  the  HAM  component is  not a  choice  state,  only  one  action  is  permitted.  These \nstates can be removed to produce an  equivalent Semi-Markov decision process (SMOP). \n(SMOPs are a  generalization of Markov decision processes that permit different discount \nrates for different transitions.)  The optimal policy for this  SMOP will be the same as  the \noptimal policy for HoM and by Lemma 2,  this  will  be the optimal policy for M  that is \nHAM-consistent with H.  0 \nThis theorem formally establishes the mechanism by which the constraints embodied in a \nHAM can be used to simplify an MDP.  As an example of the power of this theorem, and \nto  demonstrate that this transformation can be done efficiently, we applied our navigation \nHAM to  the problem described in  the previous section.  Figure 2(a) shows the results of \napplying policy iteration to the original model and to the transformed model.  Even when \nwe add in the  cost of transformation (which,  with our rather underoptimized code,  takes \n\nITo preserve the Markov property, we require that if a machine has more than one possible caller \nin the hierarchy, that each appearance is treated as  a distinct machine. This is equivalent to requiring \nthat the call graph for the HAM is a tree.  It follows from this that circular calling sequences are also \nforbidden. \n\n\fReinforcement Learning with Hierarchies of Machines \n\n1047 \n\n866 seconds), the HAM method produces a good policy  in  less than  a quarter of the time \nrequired to find  the optimal policy  in  the original model.  The actual solution time is  185 \nseconds versus 4544 seconds. \nAn  important property  of the  HAM  approach  is  that  model  transformation  produces  an \nMDP that is an accurate model of the application of the HAM to the original MDP. Unlike \ntypical approximation methods for  MDPs,  the  HAM method can give strict performance \nguarantees.  The solution to the transformed model Teduce(H  0  M) is the optimal solution \nfrom  within  a well-defined class of policies and the  value  assigned to  this solution is  the \ntrue expected value of applying the concrete HAM policy to the original MDP. \n\n5  Reinforcement learning with HAMs \nHAMs can  be  of even  greater advantage  in  a reinforcement learning  context,  where  the \neffort required to obtain a solution typically scales very badly with the size of the problem. \nHAM contraints can focus exploration of the state space, reducing the \"blind search\" phase \nthat reinforcement learning agents must endure while  learning about a new  environment. \nLearning will also be fasterfor the same reason policy iteration is faster in the HAM-induced \nmodel; the agent is effectively operating in a reduced state space. \nWe  now  introduce  a  variation  of Q-learning  called  HAMQ-1earning  that  learns  directly \nin  the reduced state  space  without performing the  model transformation described in  the \nprevious  section.  This  is  significant  because  the  the  environment model  is  not  usually \nknown a priori in reinforcement learning contexts. \nA HAMQ-learning agent keeps track of the following quantities:  t, the current environment \nstate; n, the current machine state;  Se  and me, the environment state and machine state at \nthe previous choice point; a,  the choice made at the previous choice point; and T e  and 13e, \nthe total accumulated reward and discount since the previous choice point.  It also maintains \nan extended Q-table, Q([s, m], a), which is indexed by an environment-state/machine-state \npair and by an  action taken at a choice point. \nFor every environment transition from state s to state t with observed reward T  and discount \n13,  the HAMQ-Iearning agent updates:  Te  ~ Te  + 13eT  and 13e  ~ 13l3e.  For each transition \nto a choice point, the agent does \n\nQ([se, me], a)  ~ Q([se, mc], a) + a[Te  + 13e V([t, n]) - Q([Se, mc], a)], \n\nand then Te  ~ 0, 13e  ~ 1. \nTheorem 2 For any finite-state MDP, M, and any HAM, H, HAMQ-Iearning will converge \nto the optimal choice for every choice point in Teduce(H  0  M) with probability  l. \nProof sketch We  note  that  the  expected reinforcement signal  in  HAMQ-Iearning  is  the \nsame as the expected reinforcement signal that would be received if the agent were acting \ndirectly in the  transformed model of Theorem  1 above.  Thus, Theorem  1 of [11]  can be \napplied to prove the convergence of the HAMQ-learning agent, provided that we enforce \nsuitable constraints on the exploration strategy and the update parameter decay rate. 0 \nWe ran some experiments to measure the performance of HAMQ-learning on our sample \nproblem.  Exploration was achieved by selecting actions according to the Boltzman distri(cid:173)\nbution with a temperature parameter for each state.  We  also used an inverse decay for the \nupdate parameter a.  Figure 2(b) compares the learning curves for Q-Iearning and HAMQ(cid:173)\nlearning.  HAMQ-Iearning appears  to  learn  much faster:  Q-Iearning required  9,000,000 \niterations  to  reach  the  level  achieved by  HAMQ-learning after 270,000 iterations.  Even \nafter 20,000,000 iterations, Q-Iearning did not do as well as HAMQ-learning.2 \n\n2Speedup  techniques  such  as  eligibility traces  could be  applied  to  get better Q-Ieaming results; \n\nsuch methods apply equally well to HAMQ-Iearning. \n\n\f1048 \n\nR.  Parr and S.  Russell \n\n6  Related work \nState aggregation (see, e.g., [18] and [7]) clusters \"similar\" states together and assigns them \nthe  same value,  effectively  reducing  the  state  space.  This is  orthogonal  to  our approach \nand  could be combined with  HAMs.  However,  aggregation should be  used with  caution \nas  it  treats distinct states as  a single state and can  violate the Markov property leading to \nthe loss of performance guarantees and oscillation or divergence in reinforcement learning. \nMoreover, state aggregation may be hard to apply effectively in many cases. \nDean  and Lin  [8]  and  Bertsekas  and Tsitsiklis  [2],  showed that  some MDPs  are  loosely \ncoupled and hence amenable to divide-and-conquer algorithms.  A machine-like language \nwas  used  in  [13]  to partition  an  MDP into decoupled subproblems.  In  problems that are \namenable to decoupling, this could approaches could be used in combinated with HAMs. \nDayan and Hinton [6] have proposedJeudal RL which specifies an explicit subgoal structure, \nwith fixed values for each sub goal achieved, in order to achieve a hierarchical decomposition \nof the  state  space.  Dietterich  extends  and  generalizes  this  approach  in  [9].  Singh  has \ninvestigated  a  number  of approaches  to  subgoal  based  decomposition  in  reinforcement \nlearning  (e.g. \n[17]  and  [16]).  Subgoals  seem  natural  in  some  domains,  but  they  may \nrequire a significant amount of outside knowledge about the domain and establishing the \nrelationship  between  the  value  of subgoals  with  respect  to  the  overall  problem  can  be \ndifficult. \nBradtke and Duff [3] proposed an RL algorithm for SMDPs.  Sutton [19] proposes temporal \nabstractions, which concatenate sequences of state transitions together to permit reasoning \nabout temporally extended events,  and which can thereby  form  a behavioral hierarchy  as \nin  [14]  and  [15].  Lin's somewhat informal scheme [12]  also allows  agents to  treat entire \npolicies as  single actions.  These approaches can  be emcompassed within  our framework \nby encoding the events or behaviors as machines. \nThe design of hierarchically organized, \"layered\" controllers was popularized by Brooks [4]. \nHis designs use a somewhat different means of passing control, but our analysis and theorems \napply equally well to his machine description language.  The \"teleo-reactive\" agent designs \nof Benson and Nilsson [I] are even closer to our HAM language.  Both of these approaches \nassume  that the  agent is  completely  specified,  albeit self-modifiable.  The idea of partial \nbehavior descriptions can  be traced  at  least  to  Hsu's partial programs  [10],  which  were \nused with a deterministic logical planner. \n\n7  Conclusions and future work \nWe have presented HAMs as a principled means of constraining the set of policies that are \nconsidered for  a Markov decision process and we  have demonstrated the efficacy  of this \napproach in  a simple example for  both policy iteration  and reinforcement learning.  Our \nresults show very significant speedup for decision-making and learning-but of course, this \nreflects the provision of knowledge in  the form of the HAM. The HAM language provides \na very general method of transferring knowledge to an agent and we only  have scratched \nthe surface of what can be done with this approach. \nWe believe that if desired, subgoal information can be incorporated into the HAM structure, \nunifying subgoal-based approaches with the HAM approach.  Moreover, the HAM structure \nprovides a natural decomposition of the  HAM-induced model, making it  amenable to  the \ndivide-and-conquer approaches of [8] and [2]. \nThere are opportunities for generalization across all levels of the  HAM  paradigm.  Value \nfunction approximation can  be used for  the  HAM induced model  and  inductive learning \nmethods can be used to produce HAMs or to generalize their effects upon different regions \nof the  state  space.  Gradient-following methods also  can be used  to  adjust  the  transition \nprobabilities of a stochastic HAM. \nHAMs also lend themselves naturally to partially observable domains.  They can be applied \ndirectly when the choice points induced by the HAM are states where no confusion about \n\n\fReinforcement Learning with Hierarchies of Machines \n\n1049 \n\nthe  true  state of the environment is  possible.  The application of HAMs  to  more  general \npartially observable domains is more complicated and is  a topic of ongoing research.  We \nalso believe that the HAM approach can be extended to cover the average-reward optimality \ncriterion. \nWe expect that successful pursuit of these lines of research will provide a formal basis for \nunderstanding and  unifying several  seemingly  disparate  approaches to control,  including \nbehavior-based methods.  It should also enable the use of the MDP framework in real-world \napplications of much greater complexity than hitherto attacked, much as HTN planning has \nextended the reach of classical planning methods. \n\nReferences \n[1]  S.  Benson  and  N.  Nilsson.  Reacting,  planning  and  learning  in  an  autonomous  agent. \n\nIn \nK. Furukawa, D. Michie, and S. Muggleton, editors, Machine Intelligence 14. Oxford University \nPress, Oxford,  1995. \n\n[2]  D. C. Bertsekas and J. N. Tsitsiklis. Parallel and Distributed Computation:  Numerical Metlwds. \n\nPrentice-Hall, Englewood Cliffs, New Jersey,  1989. \n\n[3]  S.  J. Bradtke  and  M.  O. Duff.  Reinforcement learning methods  for  continuous-time Markov \ndecision problems.  In Advances in Neurallnfonnation Processing Systems 7:  Proc.  of the 1994 \nConference, Denver, Colorado, December 1995. MIT Press. \n\n[4]  R.  A.  Brooks.  A  robust layered control  system for  a mobile robot.  IEEE Journal of Robotics \n\n[5]  K.  W.  Currie  and  A.  Tate.  O-Plan:  the  Open  Planning  Architecture.  Artificial Intelligence, \n\nand Automation, 2,  1986. \n\n52(1), November 1991. \n\n[6]  P. Dayan and G.  E. Hinton.  Feudal reinforcement learning.  In Stephen Jose Hanson, Jack D. \nCowan,  and  C.  Lee  Giles,  editors,  Neural  Information  Processing  Systems  5,  San  Mateo, \nCalifornia, 1993. Morgan Kaufman. \n\n[7]  T.  Dean,  R.  Givan,  and  S.  Leach.  Model  reduction  techniques  for  computing  approximately \noptimal solutions for markov decision processes.  In Proc.  of the Thirteenth Conference on  Un(cid:173)\ncertainty in Artificial Intelligence , Providence, Rhode Island, August 1997. Morgan Kaufmann. \n[8]  T.  Dean and S.-H. Lin. Decomposition techniques  for planning in stochastic domains. In Proc. \nof the  Fourteenth  Int.  Joint  Conference  on  Artificial Intelligence,  Montreal,  Canada,  August \n1995. Morgan Kaufmann. \n\n[9]  Thomas  G.  Dietterich.  Hierarchical  reinforcement  learning  with  the  MAXQ  value  function \ndecomposition.  Technical  report,  Department of Computer Science, Oregon State University, \nCorvallis, Oregon,  1997. \n\n[10]  Y.-J. Hsu.  Synthesizing efficient agents from partial programs. In Metlwdologiesfor Intelligent \nSystems:  6th  Int.  Symposium,  ISMIS  '91,  Proc.,  Charlotte,  North  Carolina,  October  1991. \nSpringer-Verlag. \n\n[11]  T.  Jaakkola,  M.l. Jordan,  and  S.P.  Singh.  On the convergence of stochastic iterative dynamic \n\nprogramming algorithms.  Neural Computation, 6(6), 1994. \n\n[12]  L.-J. Lin.  Reinforcement Learning for Robots  Using Neural Networks.  PhD thesis, Computer \n\nScience Department, Carnegie-Mellon University, Pittsburgh, Pennsylvania,  1993. \n\n[13]  Shieu-Hong Lin. Exploiting Structure for Planning and Control.  PhD thesis, Computer Science \n\nDepartment, Brown University, Providence, Rhode Island, 1997. \n\n[14]  A. McGovern, R. S. Sutton, and A. H. Fagg. Roles of macro-actions in accelerating reinforcement \n\nlearning.  In 1997 Grace Hopper Celebration of Women  in Computing,  1997. \n\n[15]  D. Precup and R. S. Sutton.  Multi-time models fortemporally abstract planning. In This Volume . \n[16]  S. P.  Singh.  Scaling reinforcement learning algorithms by learning variable temporal resolution \nmodels.  In Proceedings of the Ninth International Conference on Machine Learning, Aberdeen, \nJuly 1992. Morgan Kaufmann. \n\n[17]  S. P. Singh. Transfer of learning by composing solutions of elemental sequential tasks. Machine \n\nLearning, 8(3), May  1992. \n\n[18]  S. P.  Singh, T. Jaakola, and M. I. Jordan.  Reinforcement learning with soft state aggregation.  In \nG. Tesauro, D. S. Touretzky, and T. K. Leen, editors, Neural Information Processing Systems 7, \nCambridge, Massachusetts,  1995. MIT Press. \n\n[19]  R.  S.  Sutton.  Temporal  abstraction  in  reinforcement  learning.  In  Proc.  of the  Twelfth  Int. \n\nConference on Machine Learning, Tahoe City, CA, July  1995. Morgan Kaufmann. \n\n\f", "award": [], "sourceid": 1384, "authors": [{"given_name": "Ronald", "family_name": "Parr", "institution": null}, {"given_name": "Stuart", "family_name": "Russell", "institution": null}]}