NIPS Proceedingsβ

Cornering Stationary and Restless Mixing Bandits with Remix-UCB

Part of: Advances in Neural Information Processing Systems 28 (NIPS 2015)

A note about reviews: "heavy" review comments were provided by reviewers in the program committee as part of the evaluation process for NIPS 2015, along with posted responses during the author feedback period. Numerical scores from both "heavy" and "light" reviewers are not provided in the review link below.

[PDF] [BibTeX] [Supplemental] [Reviews]

Authors

Conference Event Type: Poster

Abstract

We study the restless bandit problem where arms are associated with stationary $\varphi$-mixing processes and where rewards are therefore dependent: the question that arises from this setting is that of carefully recovering some independence by `ignoring' the values of some rewards. As we shall see, the bandit problem we tackle requires us to address the exploration/exploitation/independence trade-off, which we do by considering the idea of a {\em waiting arm} in the new Remix-UCB algorithm, a generalization of Improved-UCB for the problem at hand, that we introduce. We provide a regret analysis for this bandit strategy; two noticeable features of Remix-UCB are that i) it reduces to the regular Improved-UCB when the $\varphi$-mixing coefficients are all $0$, i.e. when the i.i.d scenario is recovered, and ii) when $\varphi(n)=O(n^{-\alpha})$, it is able to ensure a controlled regret of order $\Ot\left( \Delta_*^{(\alpha- 2)/\alpha} \log^{1/\alpha} T\right),$ where $\Delta_*$ encodes the distance between the best arm and the best suboptimal arm, even in the case when $\alpha<1$, i.e. the case when the $\varphi$-mixing coefficients {\em are not} summable.