Improving Policies without Measuring Merits

Part of Advances in Neural Information Processing Systems 8 (NIPS 1995)

Bibtex Metadata Paper


Peter Dayan, Satinder Singh


Performing policy iteration in dynamic programming should only require knowledge of relative rather than absolute measures of the utility of actions (Werbos, 1991) - what Baird (1993) calls the ad(cid:173) vantages of actions at states. Nevertheless, most existing methods in dynamic programming (including Baird's) compute some form of absolute utility function . For smooth problems, advantages satisfy two differential consistency conditions (including the requirement that they be free of curl), and we show that enforcing these can lead to appropriate policy improvement solely in terms of advantages.