Part of Advances in Neural Information Processing Systems 33 (NeurIPS 2020)
Kwang-Sung Jun, Chicheng Zhang
We study stochastic structured bandits for minimizing regret. The fact that the popular optimistic algorithms do not achieve the asymptotic instance-dependent regret optimality (asymptotic optimality for short) has recently alluded researchers. On the other hand, it is known that one can achieve bounded regret (i.e., does not grow indefinitely with $n$) in certain instances. Unfortunately, existing asymptotically optimal algorithms rely on forced sampling that introduces an $\omega(1)$ term w.r.t. the time horizon $n$ in their regret, failing to adapt to the ``easiness'' of the instance. In this paper, we focus on the finite hypothesis case and ask if one can achieve the asymptotic optimality while enjoying bounded regret whenever possible. We provide a positive answer by introducing a new algorithm called CRush Optimism with Pessimism (CROP) that eliminates optimistic hypotheses by pulling the informative arms indicated by a pessimistic hypothesis. Our finite-time analysis shows that CROP $(i)$ achieves a constant-factor asymptotic optimality and, thanks to the forced-exploration-free design, $(ii)$ adapts to bounded regret, and $(iii)$ its regret bound scales not with $K$ but with an effective number of arms $K_\psi$ that we introduce. We also discuss a problem class where CROP can be exponentially better than existing algorithms in \textit{nonasymptotic} regimes. This problem class also reveals a surprising fact that even a clairvoyant oracle who plays according to the asymptotically optimal arm pull scheme may suffer a linear worst-case regret.