NIPS Proceedingsβ

Action-Gap Phenomenon in Reinforcement Learning

Part of: Advances in Neural Information Processing Systems 24 (NIPS 2011)

[PDF] [BibTeX] [Spotlights]



Many practitioners of reinforcement learning problems have observed that oftentimes the performance of the agent reaches very close to the optimal performance even though the estimated (action-)value function is still far from the optimal one. The goal of this paper is to explain and formalize this phenomenon by introducing the concept of the action-gap regularity. As a typical result, we prove that for an agent following the greedy policy \(\hat{\pi}\) with respect to an action-value function \(\hat{Q}\), the performance loss \(E[V^*(X) - V^{\hat{X}} (X)]\) is upper bounded by \(O(|| \hat{Q} - Q^*||_\infty^{1+\zeta}\)), in which \(\zeta >= 0\) is the parameter quantifying the action-gap regularity. For \(\zeta > 0\), our results indicate smaller performance loss compared to what previous analyses had suggested. Finally, we show how this regularity affects the performance of the family of approximate value iteration algorithms.