Reviews: Thompson Sampling with Information Relaxation Penalties

This work studies a Bayesian multi-armed bandit with a known and finite time horizon. The proposed framework introduces a technique of relaxing the non-anticipativity constraint in operations research to bandit problems and generates several interesting algorithms. These algorithms can exploit the knowledge of time horizon in a systematic way and achieves better performance than thompson sampling. The paper is well-written and organized. And I believe the results are very interesting to the bandit community and even larger field of reinforcement learning. I have some comments: 1) Can this framework get rid of the knowledge of T? In the frequentist setting, one can use a doubling trick to extend the regret bound obtained in knowing T to the setting of not knowing T. So I am wondering this framework can be extended to the setting of not knowing T. That would be interesting in practice. 2) In most cases, IRS has computational complexity that is linear (or even polynomial) in T for each round. This complexity can be a problem in some cases when the time horizon is millions or more, which could be limit to its practical value. 3) Any theoretical results for IRS.V-EMAX in the form of Theorem 2 and 3? a minor one: line 34: focus "on" the Bayesian setting ... ==== after response==== Rebuttal read. Keep my score.

Reviewer 2

The paper applies information relaxation techniques to the finite-horizon multi-armed bandit problem. By proposing a series of information relaxation penalty functions, the authors are able to derive a series of upper bounds on the optimal value function that are increasingly tighter at the cost of increasing computational complexity. The authors also propose a spectrum of algorithms with increasing computational complexity using these penalty functions, with one end being Thompson sampling and the other end being the optimal policy. Algorithms derived from this technique take into account the effect of having a finite horizon in a natural way. I think these results are very interesting and the approach shows great promise for dealing with problems with a relatively short time horizon. The simulation results show that the proposed algorithms perform better than vanilla Thompson sampling, with the cost of more compute time. What seems interesting to me is that the performance of information directed sampling looks comparable to IRS algorithms, even though IDS only looks at the current time period at each time, and does not consider the future or having a finite horizon. What is the advantage of IRS over IDS? My guess is that IDS probably does not scale well with number of arms. Could you please confirm? The paper is pretty well-written in general. There are a few minor typos, so please proofread. Also, since IRS.Index achieves the best regret empirically, it might be worthwhile to discuss it in the main text.

Reviewer 3

I think this is a good paper with a clear structure and is nicely written. However, the paper focuses on stochastic bandits with independent arms, with is an extensively studied realm with abundant results. Although the authors present a convincing theoretical analysis, there is no significant improvement over previous bounds (as the authors mention, TS for MAB is already minimax optimal). The generalisation of TS to finite-horizon problems is not novel, and the exploration-exploitation trade-off has also been addressed in previous literature. This could be the start of an interesting line of work, but I do not think the current version is above the acceptation threshold.

Paper ID:	1927
Title:	Thompson Sampling with Information Relaxation Penalties

Reviewer 1

Reviewer 2

Reviewer 3