Part of Advances in Neural Information Processing Systems 29 (NIPS 2016)
In many applications such as advertisement placement or automated dialog systems, an intelligent system optimizes performance over a sequence of interactions with each user. Such tasks often involve many states and potentially time-dependent transition dynamics, and can be modeled well as episodic Markov decision processes (MDPs). In this paper, we present a PAC algorithm for reinforcement learning in episodic finite MDPs with time-dependent transitions that acts epsilon-optimal in all but O(S A H^3 / epsilon^2 log(1 / delta)) episodes. Our algorithm has a polynomial computational complexity, and our sample complexity bound accounts for the fact that we may only be able to approximately solve the internal planning problems. In addition, our PAC sample complexity bound has only linear dependency on the number of states S and actions A and strictly improves previous bounds with S^2 dependency in this setting. Compared against other methods for infinite horizon reinforcement learning with linear state space sample complexity our method has much lower dependency on the (effective) horizon. Indeed, our bound is optimal up to a factor of H.