{"title": "Integrated Modeling and Control Based on Reinforcement Learning and Dynamic Programming", "book": "Advances in Neural Information Processing Systems", "page_first": 471, "page_last": 478, "abstract": null, "full_text": "Integrated Modeling and Control \nBased on Reinforcement Learning \n\nand Dynamic Programming \n\nRichard S. Sutton \n\nGTE Laboratories Incorporated \n\nWaltham, MA 02254 \n\nAbstract \n\nThis is a summary of results with Dyna, a class of architectures for intel(cid:173)\nligent systems based on approximating dynamic programming methods. \nDyna architectures integrate trial-and-error (reinforcement) learning and \nexecution-time planning into a single process operating alternately on the \nworld and on a learned forward model of the world. We describe and \nshow results for two Dyna architectures, Dyna-AHC and Dyna-Q. Using a \nnavigation task, results are shown for a simple Dyna-AHC system which \nsimultaneously learns by trial and error, learns a world model, and plans \noptimal routes using the evolving world model. We show that Dyna-Q \narchitectures (based on Watkins's Q-Iearning) are easy to adapt for use in \nchanging environments. \n\n1 \n\nIntroduction to Dyna \n\nDyna architectures (Sutton, 1990) use learning algorithms to approximate the con(cid:173)\nventional optimal control technique known as dynamic programming (DP) (Bell(cid:173)\nman, 1957; Bertsekas, 1987). DP itself is not a learning method, but rather a \ncomputational method for determining optimal behavior given a complete model of \nthe task to be solved. It is very similar to state-space search, but differs in that \nit is more incremental and never considers actual action sequences explicitly, only \nsingle actions at a time. This makes DP more amenable to incremental planning \nat execution time, and also makes it more suitable for stochastic or incompletely \nmodeled environments, as it need not consider the extremely large number of se(cid:173)\nquences possible in an uncertain environment. Learned world models are likely \nto be stochastic and uncertain, making DP approaches particularly promising for \n\n471 \n\n\f472 \n\nSutton \n\nlearning systems. Dyna architectures are those that learn a world model online \nwhile using approximations to DP to learn and plan optimal behavior. \nThe theory of Dyna is based on the theory of DP and on DP's relationship to \nreinforcement learning (Watkins, 1989; Barto, Sutton & Watkins, 1989, 1990), to \ntemporal-difference learning (Sutton, 1988), and to AI methods for planning and \nsearch (Korf, 1990). Werb08 (1987) has previously argued for the general idea of \nbuilding AI systems that approximate dynamic programming, and Whitehead & \nBallard (1989) and others (Sutton & Barto, 1981; Sutton & Pinette, 1985; Rumel(cid:173)\nhart et aI., 1986; Lin, 1991; Riolo, 1991) have presented results for the specific \nidea of augmenting a reinforcement learning system with a world model used for \nplanning. \n\n2 Dyna-AHC: Dyna by Approximating Policy Iteration \n\nThe Dyna-AHC architecture is based on approximating a DP method known as \npolicy iteration (see Bertsekas, 1987). It consists of four components interacting as \nshown in Figure 1. The policy is simply the function formed by the current set of \nreactions; it receives as input a description of the current state of the world and \nproduces as output an action to be sent to the world. The world represents the \ntask to be solved; prototypically it is the robot's external environment. The world \nreceives actions from the policy and produces a next state output and a reward \noutput. The overall task is defined as maximizing the long-term average reward \nper time step. The architecture also includes an explicit world model. The world \nmodel is intended to mimic the one-step input-output behavior of the real world. \nFinally, the Dyna-AHC architecture includes an evaluation function that rapidly \nmaps states to values, much as the policy rapidly maps states to actions. The \nevaluation function, the policy, and the world model are each updated by separate \nlearning processes. \nThe policy is continually modified by an integrated planning/learning process. The \npolicy is, in a sense, a plan, but one that is completely conditioned by current input. \nThe planning process is incremental and can be interrupted and resumed at any \ntime. It consists of a series of shallow seaches, each typically of one ply, and yet \nultimately produces the same result as an arbitrarily deep conventional search. I \ncall this relaxation planning. \nRelaxation planning is based on continually adjusting the evaluation function in \nsuch a way that credit is propagated to the appropriate steps within action se(cid:173)\nquences. Generally speaking, the evaluation e(x) of a state x should be equal to \nthe best of the states y that can be reached from it in one action, taking into \nconsideration the reward (or cost) r for that one transition: \ne(x) \"=\" m~ E {r + e(y) I x, a}, \n\n(1) \nwhere E {. I .} denotes a conditional expected value and the equal sign is quoted to \nindicate that this is a condition that we would like to hold, not one that necessarily \ndoes hold. If we have a complete model of the world, then the right-hand side can \nbe computed by looking ahead one action. Thus we can generate any number of \ntraining examples for the process that learns the evaluation function: for any x, \n\naEActlon. \n\n\fIntegrated Modeling and Control Based on Reinforcement Learning \n\n473 \n\n(r EVALUATION 1 \n\" \n~ \n\nFUNCTION J~---' Heuristic \nReward \n(scalar) \n\nr \n\" \n\nPOLICY \n\n~ \n\nAction \n\nReward \n(scalar) \n\nState \n\nWORLD \n\nOR \n\n~ \n\n\"WORLD MODEL) /sWITCH \n\nFigure 1. Overview of Dyna-AHC \n\n1. Decide if this will be a real experience \n\nor a hypothetical one. \n\n2. Pick a state z. If this is a real expe(cid:173)\n\nrience, use the current state. \n\n3. Choose an action: a +- Policy(z) \n4. Do action a; obtain next state y and \nreward r from world or world model. \n5. If this is a real experience, update \n\nworld model from z, a, y and r. \n6. Update evaluation function so that \ne(z) is more like r + re(y); this is \ntemporal-difference learning. \n\n7. Update policy-strengthen or weaken \nthe tendency to perform action a in \nstate z according to the error in the \nevaluation function: r + re(y) - e( z) . \n\n8. Go to Step 1. \nFigure 2. \nInner Loop of Dyna-AHC. \nThese steps are repeatedly continually, \nsometimes with real experiences, some(cid:173)\ntimes with hypothetical ones. \n\nthe right-hand side of (1) is the desired output. If the learning process converges \nsuch that (1) holds in all states, then the optimal policy is given by choosing the \naction in each state z that achieves the maximum on the right-hand side. There is an \nextensive theoretical basis from dynamic programming for algorithms of this type for \nthe special case in which the evaluation function is tabular, with enumerable states \nand actions. For example, this theory guarantees convergence to a unique evaluation \nfunction satisfying (1) and that the corresponding policy is optimal (Bertsekas, \n1987). \nThe evaluation function and policy need not be tables, but can be more compact \nfunction approximators such as connectionist networks, decision trees, k-d trees, \nor symbolic rules. Although the existing theory does not apply to these machine \nlearning algorithms directly, it does provide a theoretical foundation for exploring \ntheir use in this way. \n\nThe above discussion gives the general idea of relaxation planning, but not the ex(cid:173)\nact form used in policy iteration and Dyna-AHC, in which the policy is adapted \nsimultaneously with the evaluation function. The evaluations in this case are not \nsupposed to reflect the value of states given optimal behavior, but rather their \nvalue given current behavior (the current policy). As the current policy gradually \napproaches optimality, the evaluation function also approaches the optimal evalua(cid:173)\ntion function. In addition, Dyna-AHC is a Monte Carlo or stochastic approximation \nvariant of policy iteration, in which the world model is only sampled, not examined \ndirectly. Since the real world can also be sampled, by actually taking actions and \nobserving the result, the world can be used in place of the world model in these \nmethods. In this case, the result is not relaxation planning, but a trial-and-error \nlearning process much like reinforcement learning (see Barto, Sutton & Watkins, \n\n\f474 \n\nSutton \n\n800 \n\n700 \n\n600 \n\nsoo \n\nSTEPS \n\nPER \nTRIAL 400 \n\n300 \n\n200 \n\n100 \n\no Planning steps \n(Trial and Error Learning Only) \n\n/ \n\n10 Planning Steps \n\n100 Planning Steps \n\nWITHOUT PLANNING (t = 0) \n\nWITH PLANNING (t = 100) \n\n14~~============~====_ \n\n80 \n\n100 \n\n20 \n\n40 \n60 \nTRIALS \n\nFigure 3. Learning Curves of Dyna(cid:173)\nAIIC Systems on a Navigation Task \n\nFigure 4. Policies Found by Planning \nand Non-Planning Dyna-AHC Systems \nby the Middle of the Second Trial. The \nblack square is the current location of \nthe system. The arrows indicate action \nprobabilities (excess over smallest) for \neach direction of movement. \n\n1989, 1990). In Dyna-AHC, both of these are done at once. The same algorithm is \napplied both to real experience (resulting in learning) and to hypothetical experi(cid:173)\nence generated by the world model (resulting in relaxation planning). The results \nin both cases are accumulated in the policy and the evaluation function. \nThere is insufficient room here to fully justify the algorithm used in Dyna-AHC, \nbut it is quite simple and is given in outline form in Figure 2. \n\n3 A Navigation Task \n\nAs an illustration of the Dyna-AHC architecture, consider the task of navigating \nthe maze shown in the upper right of Figure 3. The maze is a 6 by 9 grid of \npossible locations or states, one of which is marked as the starting state, \"S\", and \none of which is marked as the goal state, \"G\". The shaded states act as barriers and \ncannot be entered. All the other states are distinct and completely distinguishable. \nFrom each there are four possible actions: UP, DOWN, RIGHT, and LEFT, which \nchange the state accordingly, except where such a movement would take the take \nthe system into a barrier or outside the maze, in which case the location is not \nchanged. Reward is zero for all transitions except for those into the goal state, for \nwhich it is +1. Upon entering the goal state, the system is instantly transported \nback to the start state to begin the next trial. None of this structure and dynamics \nis known to the Dyna-AHC system a priori. \nIn this instance of the Dyna-AHC architecture, real and hypothetical experiences \n\n\fIntegrated Modeling and Control Based on Reinforcement Learning \n\n475 \n\nwere used alternately (Step 1). For each single experience with the real world, k \nhypothetical experiences were generated with the model. Figure 3 shows learning \ncurves for k = 0, k = 10, and k = 100, each an average over 100 runs. The k = 0 \ncase involves no planning; this is a pure trial-and-error learning system entirely \nanalogous to those used in reinforcement learning systems based on the adaptive \nheuristic critic (AHC) (Sutton, 1984; Barto, Sutton & Anderson, 1983). Although \nthe length of path taken from start to goal falls dramatically for this case, it falls \nmuch more rapidly for the cases including hypothetical experiences, showing the \nbenefit of relaxation planning using the learned world model. For k = 100, the \noptimal path was generally found and followed by the fourth trip from start to goal; \nthis is very rapid learning. \n\nFigure 4 shows why a Dyna-AHC system that includes planning solves this problem \nso much faster than one that does not. Shown are the policies found by the k == 0 and \nk = 100 Dyna-AHC systems half-way through the second trial. Without planning \n(k = 0), each trial adds only one additional step to the policy, and so only one step \n(the last) has been learned so far. With planning, the first trial also learned only \none step, but here during the second trial an extensive policy has been developed \nthat by the trial's end will reach almost back to the start state. \n\n4 Dyna-Q: Dyna by Q-learning \n\nThe Dyna-AHC architecture is in essence the reinforcement learning architecture \nbased on the adaptive heuristic critic (AHC) that my colleagues and I developed \n(Sutton, 1984; Barto, Sutton & Anderson, 1983) plus the idea of using a learned \nworld model to generate hypothetical experience and to plan. Watkins (1989) sub(cid:173)\nsequently developed the relationships between the reinforcement-learning architec(cid:173)\nture and dynamic programming (see also Barto, Sutton & Watkins, 1989, 1990) \nand, moreover, proposed a slightly different kind of reinforcement learning called \nQ-learning. The Dyna- Q architecture is the combination of this new kind of learn(cid:173)\ning with the Dyna idea of using a learned world model to generate hypothetical \nexperience and achieve planning. \nWhereas the AHC reinforcement learning architecture maintains two fundamental \nmemory structures, the evaluation function and the policy, Q-Iearning maintains \nonly one. That one is a cross between an evaluation function and a policy. For each \npair of state x and action a, Q-Iearning maintains an estimate Qra of the value of \ntaking a in x. The value of a state can then be defined as the value of the state's \nbest state-action pair: e(x) deC maXa Qra. In general, the Q-value for a state x and \nan action a should equal the expected value of the immediate reward r plus the \ndiscounted value of the next state y: \n\n(3) \nTo achieve this goal, the updating steps (Steps 6 and 7 of Figure 2) are implemented \nby \n\nQra \"=\" E{r+-ye(y)lx,a}. \n\nQra +- Qra + f3(r + -ye(y) - Qra). \n\n(4) \nThis is the only update rule in Q-Iearning. We note that it is very similar though \nnot identical to Holland's bucket brigade and to Sutton's (1988) temporal-difference \nlearning. \n\n\f476 \n\nSutton \n\nThe simplest way of determining the policy on real experiences is to deterministically \nselect the action that currently looks best-the action with the maximal Q-value. \nHowever, as we show below, this approach alone suffers from inadequate exploration. \nTo deal with this problem, a new memory structure was added that keeps track of \nthe degree of uncertainty about each component of the model. For each state x and \naction a, a record is kept of the number of time steps n XIl that have elapsed since a \nwas tried in z in a real experience. An exploration bonus of fVnxll is used to make \nactions that have not been tried in a long time (and that therefore have uncertain \nconsequences) appear more attractive by replacing (4) with: \n\nQXIl +- QXIl + f3(r + fVnxll + ')'e(y) - QXIl)' \n\n(5) \n\nIn addition, the system is permitted to hypothetically experience actions is has \nnever before tried, so that the exploration bonus for trying them can be propagated \nback by relaxation planning. This was done by starting the system with a non(cid:173)\nempty initial model and by selecting actions randomly on hypothetical experiences. \nIn the experiments with Dyna-Q systems reported below, actions that had never \nbeen tried were assumed to produce zero reward and leave the state unchanged. \n\n5 Changing-World Experiments \n\nTwo experiments were performed to test the ability of Dyna systems to adapt to \nchanges in their environments. Three Dyna systems were used: the Dyna-AHC \nsystem presented earlier in the paper, a Dyna-Q system including the exploration \nbonus (5), called the Dyna-Q+ system, and a Dyna-Q system without the explo(cid:173)\nration bonus (4), called the Dyna-Q- system. All systems used k = 10. \nThe blocking experiment used the two mazes shown in the upper portion of Figure \n5. Initially a short path from start to goal was available (first maze). After 1000 \ntime steps, by which time the short path was usually well learned, that path was \nblocked and a longer path was opened (second maze). Performance under the new \ncondition was measured for 2000 time steps. Average results over 50 runs are shown \nin Figure 5 for the three Dyna systems. The graph shows a cumulative record of \nthe number of rewards received by the system up to each moment in time. In the \nfirst 1000 trials, all three Dyna systems found a short route to the goal, though the \nDyna-Q+ system did so significantly faster than the other two. After the short path \nwas blocked at 1000 steps, the graph for the Dyna-AHC system remains almost flat, \nindicating that it was unable to obtain further rewards. The Dyna-Q systems, on \nthe other hand, clearly solved the blocking problem, reliably finding the alternate \npath after about 800 time steps. \n\nThe shortcut experiment began with only a long path available (first maze of Figure \n6). After 3000 times steps all three Dyna systems had learned the long path, and \nthen a shortcut was opened without interferring with the long path (second maze of \nFigure 6). The lower part of Figure 6 shows the results. The increase in the slope \nof the curve for the Dyna-Q+ system, while the others remain constant, indicates \nthat it alone was able to find the shortcut. The Dyna-Q+ system also learned \nthe original long route faster than the Dyna-Q- system, which in turn learned it \nfaster than the Dyna-AHC system. However, the ability of the Dyna-Q+ system \nto find shortcuts does not come totally for free . Continually re-exploring the world \n\n\fIntegrated Modeling and Control Based on Reinforcement Learning \n\n477 \n\n\u2022 \u2022 \n\n150 \n\nDyna-Q+ \n\nDyna-Q-\n\nDyna-PI \n\no=-______ ~ ______ ~ ____ ~ \n\no \n\n1000 \n\na:lOO \n\nTime Steps \n\no \n\n3000 \n\nTime Steps \n\nFigure 5. Performance on the Blocking \nTask (Slope is the Rate of Reward) \n\nFigure 6. Performance on the Shortcut \nTask (Slope is the Rate of Reward) \n\nmeans occasionally making suboptimal actions. If one looks closely at Figure 6, \none can see that the Dyna-Q+ system actually acheives a slightly lower rate of \nreinforcement during the first 3000 steps. In a static environment, Dyna-Q+ will \neventually perform worse than Dyna-Q-, whereas, in a changing environment, it \nwill be far superior, as here. One possibility is to use a meta-level learning process \nto adjust the exploration parameter f to match the degree of variability of the \nenvironment. \n\n6 Limitations and Conclusions \n\nThe results presented here are clearly limited in many ways. The state and action \nspaces are small and denumerable, permitting tables to be used for all learning pro(cid:173)\ncesses and making it feasible for the entire state space to be explicitly explored. In \naddition, these results have assumed knowledge of the world state, have used a triv(cid:173)\nial form of search control (random exploration), and have used terminal goal states. \nThese are significant limitations of the results, but not of the Dyna architecture. \nThere is nothing about the Dyna architecture which prevents it from being applied \nmore generally in each of these ways (e.g., see Lin, 1991; Riolo, 1991; Whitehead & \nBallard, in press). \n\nDespite limitations, these results are significant. They show that the use of a for(cid:173)\nward model can dramatically speed trial-and-error (reinforcement) learning pro(cid:173)\ncesses even on simple problems. Moreover, they show how planning can be done \nwith the incomplete, changing, and of times incorrect world models that are con(cid:173)\ntructed through learning. Finally, they show how the functionality of planning can \nbe obtained in a completely incremental manner, and how a planning process can be \nfreely intermixed with reaction and learning processes. Further results are needed \nfor a thorough comparison of Dyna-AHC and Dyna-Q architectures, but the results \npresented here suggest that it is easier to adapt Dyna-Q architectures to changing \nenvironments. \n\n\f478 \n\nSutton \n\nAcknowledgements \n\nThe author gratefully acknowledges the contributions by Andrew Barto, Chris \nWatkins, Steve Whitehead, Paul Werbos, Luis Almeida, and Leslie Kaelbling. \n\nReferences \n\nBarto, A. G., Sutton R. S., & Anderson, C. W. (1983) IEEE '.Irans. SMC-13, 834-\n846. \nBarto, A. G., Sutton, R. S., & Watkins, C. J. C. H. (1989) In: Learning and \nComputational Neuroscience, M. Gabriel and J.W. Moore (Eds.), MIT Press, 1991. \nBarto, A. G., Sutton, R. S., & Watkins, C. J. C. II. (1990) NIPS 2, 686-693. \nBellman, R. E. (1957) Dynamic Programming, Princeton University Press. \nBertsekas, D. P. (1987) Dynamic Programming: Deterministic and Stochastic Mod(cid:173)\nels, Prentice-Hall. \nKorf, R. E. (1990) Artificial Intelligence 42, 189-211. \nLin, Long-Ji. (1991) In: Proceedings of the International Conference on the Simu(cid:173)\nlation of Adaptive Behavior, MIT Press. \nRiolo, R. (1991) In: Proceedings of the International Conference on the Simulation \nof Adaptive Behavior, MIT Press. \nRumelhart, D. E., Smolensky, P., McClelland, J. L., & Hinton, G. E. (1986) In: \nParallel Distributed Processing: Explorations in the Microstructure of Cognition, \nVolume II, by J. L. McClelland, D. E. Rumelhart, and the PDP research group, \n7-57. MIT Press. \nSutton, R. S. (1984) Temporal Credit Assignment in Reinforcement Learning. PhD \nthesis, COINS Dept., Univ, of Mass. \nSutton, R.S. (1988) Machine Learning 3, 9-44. \nSutton, R.S. (1990) In: Proceedings of the Seventh International Conference on \nMachine Learning, 216-224, Morgan Kaufmann. \nSutton, R.S., Barto, A.G. (1981) Cognition and Brain Theory 4, 217-246. \nSutton, R.S., Pinette, B. (1985) In: Proceedings of the Seventh Annual Coni. of the \nCognitive Science Society, 54-64, Lawrence Erlbaum. \n\nWatkins, C. J. C. H. (1989) Learning with Delayed Rewards. PhD thesis, Cambridge \nUniversity Psychology Department. \nWerbos, P. J. (1987) IEEE 'frans. SMC-17, 7-20. \nWhitehead, S. D., Ballard, D.II. (1989) In: Proceedings of the Sixth International \nWorkshop on Machine Learning, 354-357, Morgan Kaufmann. \n\nWhitehead, S. D., Ballard, D.H. (in press) Machine Learning. \n\n\f", "award": [], "sourceid": 388, "authors": [{"given_name": "Richard", "family_name": "Sutton", "institution": null}]}