Andrew Barto, Michael Duff
We describe the relationship between certain reinforcement learn(cid:173) ing (RL) methods based on dynamic programming (DP) and a class of unorthodox Monte Carlo methods for solving systems of linear equations proposed in the 1950's. These methods recast the solu(cid:173) tion of the linear system as the expected value of a statistic suitably defined over sample paths of a Markov chain. The significance of our observations lies in arguments (Curtiss, 1954) that these Monte Carlo methods scale better with respect to state-space size than do standard, iterative techniques for solving systems of linear equa(cid:173) tions. This analysis also establishes convergence rate estimates. Because methods used in RL systems for approximating the evalu(cid:173) ation function of a fixed control policy also approximate solutions to systems of linear equations, the connection to these Monte Carlo methods establishes that algorithms very similar to TD algorithms (Sutton, 1988) are asymptotically more efficient in a precise sense than other methods for evaluating policies. Further, all DP-based RL methods have some of the properties of these Monte Carlo al(cid:173) gorithms, which suggests that although RL is often perceived to be slow, for sufficiently large problems, it may in fact be more ef(cid:173) ficient than other known classes of methods capable of producing the same results.