NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID: 2821 Learning in Generalized Linear Contextual Bandits with Stochastic Delays

### Reviewer 1

This paper studies online learning in the generalized linear contextual bandits where rewards are observed with delays. UCB-based and Thompson sampling based algorithms are proposed, respectively. Theoretical analysis of these two algorithm in terms of regret bounds are provided. The problem is very fundamental and practical. The proposed algorithms have theoretical perform and the theoretical analysis is reasonable. It may be interesting to consider some real applications and test the performance of the proposed algorithms.

### Reviewer 2

BRIEF SUMMARY ------------------------- In this work, authors study the GLB problem with delayed feedback. In this setting, they describe how to adapt the UCB and TS (bayesian) algorithms for which they prove similar frequentist and Bayesian regret bounds respectively. They work under the assumption of independent stochastic delays, but they allow the distribution to have a near-heavy tail distribution, hence going beyond the sub-Gaussian case. Finally, they also provide a characterization of the regret under more restrictive assumptions on the delays (iid or bounded). GENERAL REMARKS ------------------------------ The paper is well written and easy to follow. The problem is well motivated and I appreciated that they discuss the regret bound they obtain w.r.t. the structure of the delay. In particular, how the regret bounds simplify when the delays are iid or bounded brings insight as it allows a straightforward comparison to previous work. On the negative side, the algorithm involves some universal (non-explicit) constant which impacts the tuning parameter $\tau$. It would be nice to have a discussion on how to set $\tau$ appropriately. Further, the technical contribution consists mainly in adapting existing result on delayed feedback and GLB regret bounds. That being said, I believe this is offset by the novelty (to the best of my knowledge) of the result and because it brings a valuable contribution to the contextual bandit literature. COMMENTS AND QUESTIONS ------------------------------------ 1) TS versus UCB. The comparison between the two algorithm is somehow unfair. Although it is commonly acknowledge that TS performs better than UCB in practice and is easier to implement, it is mainly because the DTS instance considered here is design for the **Bayesian regret** guarantee. The TS algorithm with **frequentist regret** guarantee in GLB (see [1]) requires a similar careful design of the confidence parameter as UCB and moreover offers a worse regret bound by $\sqrt{d}$. To compare the two algorithms, one should consider the instances which offer the same type of guarantee (Frequentist or Bayesian). The frequentist guarantee is way stronger than the Bayesian one (it actually implies it), which explains why DUCB is more complicated than DTS. The consequence is that Bayesian TS is way less aggressive in the exploration than UCB which may explain its better empirical performances. As a result, I would be cautious in doing such comparison (in Sec. 1.2 and Conclusion), unless experiments are provided and conducted with the appropriate algorithm instances. 2) Arm set. Authors consider the finite arm case, where $K$ arms are sample iid at each time step from some distribution which should satisfy that they span (in some sense) every directions. What about the infinite arm case ? Regret bounds have been derived in this setting for both UCB and TS, and I believe that this analysis can extend to this case. Is it ? What if the arms are fixed at the beginning of the trajectory ? Again it seems that it would only require modifying Asm. 1. Overall, I think that the structure of the arm set can be relaxed with minor modifications, which would allow for a less restricted setting. MINOR COMMENTS ------------------------------------ - Check the numbering of Thm, Prop and Cor in the main text. - In the related work section, you may also mention [1] and [2] for the TS in GLB and Bandit with delay respectively. - Line 207. $p$ should be $q$ isn't it ? REFERENCES ------------------- [1] M. Abeille, A Lazaric. Linear Thompson sampling revisited - Electronic Journal of Statistics, 2017 [2] Vernade, C., Cappé, O., & Perchet, V. (2017). Stochastic bandit models for delayed conversions.

### Reviewer 3

The problem considered in this paper is well motivated, and the constraint of delay is of practical interest. Also, it looks like this is the first paper that extends existing literature into the stochastic parametric bandit setting. Two things I would like to see, but do not appear are: 1. A lower bound for learning under stochastic delay. Although I understand that the regret bounds provided here collapse to the optimal regret bound when no delay exists, it would still be great if one can fully characterize the hardness of the problem. 2. A more interesting DTS analysis would go into the frequentist framework similar to [1]. Even for the Bayes setting described here, it would be helpful if the authors could describe in details what is the unique difficulty of extending the results from DUCB to DTS if one simply wants to adopt the reduction framework proposed in [2]. [1] S. Agrawal, N. Goyal, "Thompson Sampling for contextual bandits with linear payoffs". In ICML 2013. [2] Russo D, Van Roy B, "Learning to optimize via posterior sampling". in Mathematics of Operations Research 2014.