John Tsitsiklis, Benjamin Van Roy
We present new results about the temporal-difference learning al(cid:173) gorithm, as applied to approximating the cost-to-go function of a Markov chain using linear function approximators. The algo(cid:173) rithm we analyze performs on-line updating of a parameter vector during a single endless trajectory of an aperiodic irreducible finite state Markov chain. Results include convergence (with probability 1), a characterization of the limit of convergence, and a bound on the resulting approximation error. In addition to establishing new and stronger results than those previously available, our analysis is based on a new line of reasoning that provides new intuition about the dynamics of temporal-difference learning. Furthermore, we discuss the implications of two counter-examples with regards to the Significance of on-line updating and linearly parameterized function approximators.