Part of Advances in Neural Information Processing Systems 32 (NeurIPS 2019)
Zihan Zhang, Xiangyang Ji
We present an algorithm based on the \emph{Optimism in the Face of Uncertainty} (OFU) principle which is able to learn Reinforcement Learning (RL) modeled by Markov decision process (MDP) with finite state-action space efficiently. By evaluating the state-pair difference of the optimal bias function h∗, the proposed algorithm achieves a regret bound of ˜O(√SATH)\footnote{The symbol ˜O means O with log factors ignored. } for MDP with S states and A actions, in the case that an upper bound H on the span of h∗, i.e., sp(h∗) is known. This result outperforms the best previous regret bounds ˜O(HS√AT)\cite{bartlett2009regal} by a factor of √SH. Furthermore, this regret bound matches the lower bound of Ω(√SATH)\cite{jaksch2010near} up to a logarithmic factor. As a consequence, we show that there is a near optimal regret bound of ˜O(√DSAT) for MDPs with finite diameter D compared to the lower bound of Ω(√DSAT)\cite{jaksch2010near}.