A Natural Policy Gradient

Part of Advances in Neural Information Processing Systems 14 (NIPS 2001)

Sham M. Kakade


We provide a natural gradient method that represents the steepest descent direction based on the underlying structure of the param(cid:173) eter space. Although gradient methods cannot make large changes in the values of the parameters, we show that the natural gradi(cid:173) ent is moving toward choosing a greedy optimal action rather than just a better action. These greedy optimal actions are those that would be chosen under one improvement step of policy iteration with approximate, compatible value functions, as defined by Sut(cid:173) ton et al. [9]. We then show drastic performance improvements in simple MDPs and in the more challenging MDP of Tetris.