NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:1546
Title:On the Optimality of Perturbations in Stochastic and Adversarial Multi-armed Bandit Problems

Reviewer 1

The paper studies the Follow-the-perturbed (FTPL) leader algorithms for multi-armed bandit problems (both stochastic and adversarial). In the stochastic setting, a unified analysis of FTPL with sub-Weibull perturbations and FTPL with bouded perturbatinos is given in the case of sub-Gaussian arms. In the adversarial setting, the paper provide a negative result regarding the construction of a FTPL algorithm achieving the optimal $\sqrt{KT}$ regret bound: it asserts that there is no perturbation so that FTPL is equivalent to the optimal INF algorithm. A conjecture, based on extreme-value theory, regarding the form a such an optimal perturbation is proposed. Overall, I believe the paper brings several interesting contributions to the study of FTPL algorithms for MAB.

Reviewer 2

********* [added after the rebuttal] I have read the reviews of the other reviewers, who did not raise any major concerns, and so the authors' rebuttal wasn't expected to bring any major correction or clarification. Overall I stick to my originally given score. ********* Originality: for the adversarial setting, the paper builds on and significantly extends previous work [1,2]. For the stochastic setting, it seems to be novel. Quality: technically strong, using concepts and methods from different theories, providing several statements with proofs and a conjecture. Clarity: very good, easy to read. Significance: I found the unification ideas of high interest. The problem in the adversarial setting is very hard and the authors showed a mix of negative results and a positive result in a special case. Minor comments: L52: please specify the range for t (starting at 0 or 1?) L83: after -> before L167: FTRL is not defined until L185 L285: what is the difference between the number of epizodes and the number of iterations (mentioned in Figure 1)? L287: define the algorithm parameters that need tuning L288: B(1,1/2) is not defined

Reviewer 3

Originality: This paper manages to unify different perturbation based algorithms to one template and as a byproduct it gives a randomized version of UCB. Moreover it excludes the two natural approaches for obtaining the optimal regret bound for adversarial bandit problem and provide a conjecture with strong evidence. Quality: The proofs are reliable. The paper gives supporting argument on the conjecture it proposes. Clarity: The paper is clearly written. Significance: The general analysis on perturbed algorithms for stochastic multi-armed bandit problem is inspiring. We may use it for more complex settings. Also it eliminates two possible ways of pursuing the optimal regret bound for adversarial multi-armed bandit problem and shine a light on what the optimal perturbation should be.