Part of Advances in Neural Information Processing Systems 34 (NeurIPS 2021)
Boyi Liu, Qi Cai, Zhuoran Yang, Zhaoran Wang
Despite the tremendous success of reinforcement learning (RL) with function approximation, efficient exploration remains a significant challenge, both practically and theoretically. In particular, existing theoretically grounded RL algorithms based on upper confidence bounds (UCBs), such as optimistic least-squares value iteration (LSVI), are often incompatible with practically powerful function approximators, such as neural networks. In this paper, we develop a variant of \underline{boo}tstrapped LS\underline{VI}, namely BooVI, which bridges such a gap between practice and theory. Practically, BooVI drives exploration through (re)sampling, making it compatible with general function approximators. Theoretically, BooVI inherits the worst-case $\tilde{O}(\sqrt{d^3 H^3 T})$-regret of optimistic LSVI in the episodic linear setting. Here $d$ is the feature dimension, $H$ is the episode horizon, and $T$ is the total number of steps.