Multi-Step Dyna Planning for Policy Evaluation and Control

Part of Advances in Neural Information Processing Systems 22 (NIPS 2009)

Bibtex »Metadata »Paper »Supplemental »


Hengshuai Yao, Shalabh Bhatnagar, Dongcui Diao, Richard S. Sutton, Csaba Szepesvári


We extend Dyna planning architecture for policy evaluation and control in two significant aspects. First, we introduce a multi-step Dyna planning that projects the simulated state/feature many steps into the future. Our multi-step Dyna is based on a multi-step model, which we call the {\em $\lambda$-model}. The $\lambda$-model interpolates between the one-step model and an infinite-step model, and can be learned efficiently online. Second, we use for Dyna control a dynamic multi-step model that is able to predict the results of a sequence of greedy actions and track the optimal policy in the long run. Experimental results show that Dyna using the multi-step model evaluates a policy faster than using single-step models; Dyna control algorithms using the dynamic tracking model are much faster than model-free algorithms; further, multi-step Dyna control algorithms enable the policy and value function to converge much faster to their optima than single-step Dyna algorithms.