NeurIPS 2020

Leverage the Average: an Analysis of KL Regularization in Reinforcement Learning


Meta Review

This paper unifies KL and entropy regularized RL algorithms (e.g. TRPO, MPO, SAC, soft Q-learning, softmax DQN, DPP, etc) under one mirror descent framework and provides proofs for KL regularized value iteration. The paper shows that using KL regularization implicitly averages the estimates of the Q function, and using this result it shows a linear dependence of the approximation error of Q on the time horizon, whereas in many previous works with similar assumptions it was quadratic. This is a significant result. In addition, KL regularization ensures convergence in the case of independent and centered errors, which is not the case for standard approximate dynamic programming. The paper also examines how KL regularization interacts with entropy regularization, and presents empirical findings suggesting that KL regularization alone might be sufficient and better then entropy regularization, encouraging a lot of exploration in the beginning and less so as the policy deviates from being uniform. This is the highest-rated paper in my batch. All reviews were very positive, and reviewers unanimously thought this was a very strong contribution, Most criticisms were cosmetic in nature and easily fixable. R1 had concerns about a larger range of experiments, but it was pointed out in the rebuttal that they exist in the appendix. The contributions are impactful and the work will be of interest to not only the RL theory community but also users of RL methods. I recommend this paper be accepted as an oral, both because of its merits and to reach as broad an audience as possible.