Part of Advances in Neural Information Processing Systems 37 (NeurIPS 2024) Main Conference Track
Heyang Zhao, Jiafan He, Quanquan Gu
The exploration-exploitation dilemma has been a central challenge in reinforcement learning (RL) with complex model classes. In this paper, we propose a new algorithm, Monotonic Q-Learning with Upper Confidence Bound (MQL-UCB) for RL with general function approximation. Our key algorithmic design includes (1) a general deterministic policy-switching strategy that achieves low switching cost, (2) a monotonic value function structure with carefully controlled function class complexity, and (3) a variance-weighted regression scheme that exploits historical trajectories with high data efficiency. MQL-UCB achieves minimax optimal regret of ˜O(d√HK) when K is sufficiently large and near-optimal policy switching cost of ˜O(dH), with d being the eluder dimension of the function class, H being the planning horizon, and K being the number of episodes. Our work sheds light on designing provably sample-efficient and deployment-efficient Q-learning with nonlinear function approximation.