Richard S. Sutton
This is a summary of results with Dyna, a class of architectures for intel(cid:173) ligent systems based on approximating dynamic programming methods. Dyna architectures integrate trial-and-error (reinforcement) learning and execution-time planning into a single process operating alternately on the world and on a learned forward model of the world. We describe and show results for two Dyna architectures, Dyna-AHC and Dyna-Q. Using a navigation task, results are shown for a simple Dyna-AHC system which simultaneously learns by trial and error, learns a world model, and plans optimal routes using the evolving world model. We show that Dyna-Q architectures (based on Watkins's Q-Iearning) are easy to adapt for use in changing environments.
Introduction to Dyna
Dyna architectures (Sutton, 1990) use learning algorithms to approximate the con(cid:173) ventional optimal control technique known as dynamic programming (DP) (Bell(cid:173) man, 1957; Bertsekas, 1987). DP itself is not a learning method, but rather a computational method for determining optimal behavior given a complete model of the task to be solved. It is very similar to state-space search, but differs in that it is more incremental and never considers actual action sequences explicitly, only single actions at a time. This makes DP more amenable to incremental planning at execution time, and also makes it more suitable for stochastic or incompletely modeled environments, as it need not consider the extremely large number of se(cid:173) quences possible in an uncertain environment. Learned world models are likely to be stochastic and uncertain, making DP approaches particularly promising for