Optimistic Planning in Markov Decision Processes Using a Generative Model

Part of Advances in Neural Information Processing Systems 27 (NIPS 2014)

Bibtex Metadata Paper Reviews Supplemental


Balázs Szörényi, Gunnar Kedenburg, Remi Munos


We consider the problem of online planning in a Markov decision process with discounted rewards for any given initial state. We consider the PAC sample complexity problem of computing, with probability $1-\delta$, an $\epsilon$-optimal action using the smallest possible number of calls to the generative model (which provides reward and next-state samples). We design an algorithm, called StOP (for Stochastic-Optimistic Planning), based on the optimism in the face of uncertainty" principle. StOP can be used in the general setting, requires only a generative model, and enjoys a complexity bound that only depends on the local structure of the MDP."