NeurIPS 2020

Restless-UCB, an Efficient and Low-complexity Algorithm for Online Restless Bandits


Meta Review

I must first admit that judging this paper was a fairly challenging task given the mixed opinions expressed by the reviewers, together with my own impressions after having scrutinized the manuscript in detail. The reviewers largely agree that the paper deserves credit as it tackles the challenging, relevant and (relatively) scarcely studied topic of restless bandit learning. I believe the main value of the paper is in the introduction of the birth-death Markov chain structure for arms of a restless bandit, together with the monotonicity and positive correlation assumptions on rewards and transitions. These are not unnatural assumptions, as evidenced by modeling literature on scheduling over wireless channels and queueing systems, and seem to greatly alleviate the computational complexity of a portion of the learning process. On the other hand, the reviewers are not fully convinced about the significance of the proposed algorithm and regret bound proven in the paper, given that the analysis is carried out for a highly structured ensemble of Markov decision processes. A technical question about the analysis, in this regard, that I was unable to satisfactorily resolve is: Why is optimistic planning actually required in an explore-then-commit type algorithm? At least in bandits, this does not appear to give order-wise improvements in regret. On the subject of the new structural assumptions, while the identification of this structure is, of course, not always an easy task, I feel that a paper introducing new assumptions must do a more comprehensive job at proposing a solution. For instance, the question of how computationally demanding a planning oracle is to implement remains largely unresolved (the author response gives only a superficial insight into it); thus, it is not clear if the exponential reduction in complexity in handling the uncertainty set is finally an advantage in hand (what if the planner is still very expensive to run?). Finally, for a work that proposes to lighten the computational burden of learning restless bandits, it is disappointing to see practical experiments carried out on Markov chains with only 2 states (trivially birth-death Markov chains), whereas a more convincing demonstration would have involved showing an end-to-end algorithmic implementation of a learning algorithm on arms with several states, which would illustrate the full power of this framework. A more comprehensive set of experiments on instances involving many states per arm is highly recommended to demonstrate the value of the algorithmic solution. In summary, the paper is balanced on a knife edge, but I feel that the author(s) could do a more comprehensive job at more solidly motivating the proposed Markov chain structure, explicitly discussing the effort involved in offline planning, shedding more light on optimality within this structured class (perhaps through exposing fundamental limits on regret), and demonstrating experimental results on more complex setups.