Andrew Moore, Christopher Atkeson
We present a new algorithm, Prioritized Sweeping, for efficient prediction and control of stochastic Markov systems. Incremental learning methods such as Temporal Differencing and Q-Iearning have fast real time perfor(cid:173) mance. Classical methods are slower, but more accurate, because they make full use of the observations. Prioritized Sweeping aims for the best of both worlds. It uses all previous experiences both to prioritize impor(cid:173) tant dynamic programming sweeps and to guide the exploration of state(cid:173) space. We compare Prioritized Sweeping with other reinforcement learning schemes for a number of different stochastic optimal control problems. It successfully solves large state-space real time problems with which other methods have difficulty.
1 STOCHASTIC PREDICTION
The paper introduces a memory-based technique, prioritized 6weeping, which is used both for stochastic prediction and reinforcement learning. A fuller version of this paper is in preparation [Moore and Atkeson, 1992]. Consider the 500 state Markov system depicted in Figure 1. The system has sixteen absorbing states, depicted by white and black circles. The prediction problem is to estimate, for every state, the long-term probability that it will terminate in a white, rather than black, circle. The data available to the learner is a sequence of observed state transitions. Let us consider two existing methods along with prioritized sweeping.