NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:3185
Title:Equipping Experts/Bandits with Long-term Memory

Reviewer 1

Summary: This paper proposes black-box reduction methods from online learning with experts to that of tracking best shifting experts. In particular, the paper considers two settings, full information, and adversarial bandit setting, where the latter gives only feedbacks of chosen experts. The proposed algorithms achieve better bounds than previous work, as well as adaptive algorithms. For the bandit tracking problems, the paper also shows a lower bound of the regret. Comments: The proposed black-box reduction from tracking experts to competing with a static optimal expert is new (as far as I know) and a breakthrough in the theory. The algorithms are new but simple. The results looks feasible while I have not checked the details. The technical contribution is substantial and I recommend this paper for acceptance.

Reviewer 2

The paper is nice to read, the proofs are sound and the literature seem to be well cited. On the negative side, the different regret bounds proved in this paper improve existing regret bounds in very specific setting (depending on n, S, and the total number of experts) and some of them may not be optimal (such as the one on bandits with sparsity). This might thus only interest a small niche of experts only. Other comments: - Some discussion should be made somewhere about the optimality of the bounds. What is known to be optimal and what is not? - About the bandit feedback with sparsity: it would be nice to add figures that compare the rates of the existing bound and the one of Thm. 8 to better understand when this one is better and when it is not. In particular, it is stated that this is the case when S and K are very large. But only S needs to be large, since the assumption $T>S>max{T/K^3,K^5/T}$ is enough which can be satisfied for small K. - About the existing results for the stochastic setting (see line 47-52): when reading this paragraph, I am not sure if the existing results refer to best-of-both-worlds results only or to any results for switching regret in stochastic environment. Does there exists any results for tracking a small set of experts for stochastic losses? - I am wondering if an algorithm such as Coral (see Agarwal, Luo et al. 2016) could be use in the sparse bandit setting to improve the rate to sqrt{T} instead of T^{2/3}.

Reviewer 3

The main novelty in the full-information part of the results is the black-box reduction to the confidence-based framework. The main intuitive idea is that confidences are generated for experts based on how good they have looked in the past/how bad regret has been with respect to them so far (Step 7 of Algorithm 1), and the confidence is multiplied by the probabilities generated by the actual expert algorithm *with a static regret guarantee*. The algorithm and analysis provides a conceptual look into how switching regret minimization can be achieved and is interesting. Proof (of Theorem 3) appears essentially correct. These ideas also yield new results for the sparse bandit version of the problem, which has seen recent progress. I did not check the proofs in detail, but it seems that the building blocks seem a bit easier to achieve than the original expert-based approach (here, they are achieving worst-case static regret, and worst-case switching regret over 2 actions, as opposed to directly achieving worst-case switching regret over K actions). It still requires sophisticated regularization with OMD "one-sided log barrier", which I have not seen in prior work and appears to be a technical contribution. Proofs appear correct (although I only checked in detail for full-information) and formally written. The submission is also nicely contextualized in related work. While the results are nice, I am not sure about the broad appeal of this problem to people in the NeurIPS community outside of the online learning community; so it would be nice to hear from the authors on whether they think the regret guarantee and this problem more generally has scope in any practical machine learning application.