NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID: 4824 Regret Bounds for Thompson Sampling in Episodic Restless Bandit Problems

### Reviewer 1

The main problem with this article is that the mathematical formulation of the settings of the problem is not sufficiently precise to understand exactly what is done later in the paper. For example, the paper is not clear about what is called the "entire parameters of the systems". What is P^{active}, and P^{passive}? What is X_j in formula (1)? There were $X_{t,A_t}$ with two indexes, is it the same thing? What is the difference between a "true parameter" and a parameter... etc. The experimental part is unreadable: even with the maximum possible zoom, it remains very hard to read. Furthermore, several of the curves are plotted without true explanations on how they were produced precisely: e.g. how many runs. Why the curve on Figure 3 left, is like that? (jumps, no smoothness...)

### Reviewer 2

Under the periodic and simultaneous restarting assumption, the Thompson sampling algorithm is quite straightforward. The significance is in the analysis of its regret. Allowing general benchmark policies in the regret analysis is a strength of the result. The writing is clear. My main concerns are as follows: 1. The simultaneous and periodic restart of all arms is rather limiting. Any motivating applications that give rise to this model? 2. Under such a restarting assumption, the problem is essentially reduced from a restless bandit problem to a problem with i.i.d. processes by viewing each epoch as a super-time-instant. The authors should discuss this view of the problem and its implications in regret analysis (e.g., difficulties in mapping analysis of Thompson sampling in i.i.d. reward processes to this setting under this view of i.i.d. super-time-instants, and how different the current analysis is from the existing ones on Thompson sampling for i.i.d. rewards). 3. The abstract and the introduction, especially the presentation and comparison with existing literature on restless bandits with unknown model, do not give a complete picture. The restrictive assumption of restarting (which also significantly simplifies the regret analysis) was not mentioned. Note that the work by Liu, et. al and by Tekin and Liu allow general Markov chain rather than two-state and do not need restarting assumption. The work by Dai, et. al, although require stochastically identical two-state arms, does not rely on the restarting assumption, either. A more complete picture of the result should be reflected in the title, abstract, and introduction. All these results, including this paper, make restrictive assumptions, largely due to the difficulty of the problem. They are just limited in different aspects. 3. Under known transition probabilities, with the restarting, the problem is a finite-horizon restless bandit problem with a horizon of length L (the period of restarting). In this case, the optimal policy is in general time-dependent. A more clear discussion of this issue and its ramifications in the regret analysis when the benchmark policy is the optimal policy will provide more insight and enhance the paper.

### Reviewer 3

As I mentioned above, I consider the framework for Thompson sampling algorithms valuable, as it is easily implementable and allows for regret bounds with respect to a general class of policies. The analysis is also intuitive and not technically very difficult, combining ideas from posterior sampling that are now typically applied in Thompson sampling with characterizations of the Bayesian regret through the Bellman operator and concentration of parameter estimation. Proofs appear correct and formally written. One thing I want to point out is that the length of the episodes, L, necessarily needs to scale as o(T^{1/2}) (that is, sub-square-root of the horizon) to get a sublinear regret bound according to the statement of the main theorem of the paper. In the analysis, it appears as though the sample counts that are used to establish concentration of the parameter estimate in episode $l$ to the true parameter are based on the sample counts at the start of the episode -- these sample counts are not tracked through the episode itself. So the analysis seems to be tightest when $L$ (which is the length of the episode) is very small, say a constant and not even growing in $T$. My feeling is that considering the evolution of sample counts within episodes would make the analysis very delicate, but I am interested to hear from the authors about whether a more fine-grained analysis adds any value, and how difficult it may be to do. In particular, should I expect that the bounds in the main theorem are tight? Are lower bounds easy to prove? I ask these questions because I am curious about how fundamental the episodic setting is and in particular how the episodic length affects a regret bound. Are longer and fewer episodes necessarily much worse for regret in simulation, as is suggested in the theoretical upper bound? I already consider the results in this submission valuable, given that the restless multi-armed bandit problem is already very difficult and this is the first analysis involving learning the transition matrices that I have seen -- but concrete answers to these questions would make me consider raising my score higher.