Learning Continuous Control Policies by Stochastic Value Gradients

Part of Advances in Neural Information Processing Systems 28 (NIPS 2015)

Bibtex »Metadata »Paper »Reviews »Supplemental »


Nicolas Heess, Gregory Wayne, David Silver, Timothy Lillicrap, Tom Erez, Yuval Tassa


We present a unified framework for learning continuous control policies usingbackpropagation. It supports stochastic control by treating stochasticity in theBellman equation as a deterministic function of exogenous noise. The productis a spectrum of general policy gradient algorithms that range from model-freemethods with value functions to model-based methods without value functions.We use learned models but only require observations from the environment insteadof observations from model-predicted trajectories, minimizing the impactof compounded model errors. We apply these algorithms first to a toy stochasticcontrol problem and then to several physics-based control problems in simulation.One of these variants, SVG(1), shows the effectiveness of learning models, valuefunctions, and policies simultaneously in continuous domains.