Part of Advances in Neural Information Processing Systems 28 (NIPS 2015)
Julien Audiffren, Liva Ralaivola
We study the restless bandit problem where arms are associated with stationary φ-mixing processes and where rewards are therefore dependent: the question that arises from this setting is that of carefully recovering some independence by `ignoring' the values of some rewards. As we shall see, the bandit problem we tackle requires us to address the exploration/exploitation/independence trade-off, which we do by considering the idea of a {\em waiting arm} in the new Remix-UCB algorithm, a generalization of Improved-UCB for the problem at hand, that we introduce. We provide a regret analysis for this bandit strategy; two noticeable features of Remix-UCB are that i) it reduces to the regular Improved-UCB when the φ-mixing coefficients are all 0, i.e. when the i.i.d scenario is recovered, and ii) when φ(n)=O(n−α), it is able to ensure a controlled regret of order \Ot(Δ(α−2)/α∗log1/αT), where Δ∗ encodes the distance between the best arm and the best suboptimal arm, even in the case when α<1, i.e. the case when the φ-mixing coefficients {\em are not} summable.