Part of Advances in Neural Information Processing Systems 6 (NIPS 1993)
Kenneth Buckland, Peter Lawrence
Transition point dynamic programming (TPDP) is a memory(cid:173) based, reinforcement learning, direct dynamic programming ap(cid:173) proach to adaptive optimal control that can reduce the learning time and memory usage required for the control of continuous stochastic dynamic systems. TPDP does so by determining an ideal set of transition points (TPs) which specify only the control action changes necessary for optimal control. TPDP converges to an ideal TP set by using a variation of Q-Iearning to assess the mer(cid:173) its of adding, swapping and removing TPs from states throughout the state space. When applied to a race track problem, TPDP learned the optimal control policy much sooner than conventional Q-Iearning, and was able to do so using less memory.