Geoffrey J. Gordon
Many algorithms for approximate reinforcement learning are not known to converge. In fact, there are counterexamples showing that the adjustable weights in some algorithms may oscillate within a region rather than converging to a point. This paper shows that, for two popular algorithms, such oscillation is the worst that can happen: the weights cannot diverge, but instead must converge to a bounded region. The algorithms are SARSA(O) and V(O); the latter algorithm was used in the well-known TD-Gammon program.