Part of Advances in Neural Information Processing Systems 5 (NIPS 1992)
Steven Bradtke
Recent research on reinforcement learning has focused on algo(cid:173) rithms based on the principles of Dynamic Programming (DP). One of the most promising areas of application for these algo(cid:173) rithms is the control of dynamical systems, and some impressive results have been achieved. However, there are significant gaps between practice and theory. In particular, there are no con ver(cid:173) gence proofs for problems with continuous state and action spaces, or for systems involving non-linear function approximators (such as multilayer perceptrons). This paper presents research applying DP-based reinforcement learning theory to Linear Quadratic Reg(cid:173) ulation (LQR), an important class of control problems involving continuous state and action spaces and requiring a simple type of non-linear function approximator. We describe an algorithm based on Q-Iearning that is proven to converge to the optimal controller for a large class of LQR problems. We also describe a slightly different algorithm that is only locally convergent to the optimal Q-function, demonstrating one of the possible pitfalls of using a non-linear function approximator with DP-based learning.