NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:2673
Title:Blocking Bandits

Reviewer 1

The paper makes three key contributions. First, the authors provide results on hardness of the problem by reducing it to the PINWHEEL scheduling problem and provide negative results on solvability of the problem even with prior knowledge of the rewards and delays of all the arms. Second, motived by the obtained result on hardness of the problem, the authors show that a simple greedy algorithm that plays the available arm with the highest reward is asymptotically (1 -1/e) optimal. Finally, the authors show that by exploiting the free exploration of arms due to the unavailability the UCB algorithm achieves the log T cumulative regret with respect to the oracle greedy algorithm with known rewards. The presentation of the paper was mostly clear. The claimed contributions are discussed in the light of existing results and the paper does survey related work appropriately. Specifically, the authors put and compare the proposed setting in the context of other settings such as sleeping bandits, combinatorial semi-bandits, and bandits with budget. Regarding the quality of the writing, the paper is reasonably well written, the structure and language are good. The paper is technically sound, and the proofs seem to be correct as far as I checked.

Reviewer 2

This paper introduces an interesting and original framework for multi-armed bandits with potential applications. I liked the impossibility result and the global approach of the paper. Yet, in my opinion, the paper still needs some more polishing. It sometimes lacks clarity and contains some typos or proofs that I found hard to follow. Furthermore, the tightness of the results should maybe better discussed. I have two main comments: - The lower bound and the upper-bound do not really match. In particular, if we apply the reward instance used in the lower bound to the upper bound, the latter becomes infinite (because the Delta_{ij} are zero), am I right? I found the dependence on 1/Delta(K,K^*) quite weak since this quantity may be really small. I would be interested to see a lower-bound with that dependence or at least experiments to see if this is really observed in practice. Furthermore, there is also a gap of a factor K with between the upper-bound and the lower-bound. - What about the dependence on the delays? The proposed algorithm does not use knowledge of the delays. Would it be possible to improve the results if they are known or not? The impossibility result states that we cannot reach the optimal strategy but can we get closer to it? Other Remarks: - Distribution-free upper-bound. The quantity Delta(K,K^*) may indeed be very small in many situations, is it possible to state also a distribution-free bound? - Is the factor (1-1/e) tight? The best factor obtained by the lower bound is 3/4, isn't it? - About Prop 3.4: - The choice epsilon=0 seems to be the best leading to a ratio 3/4. Why do you need epsilon at all? - As in all the proposition, the statement should be more accurate. In particular, it is assumed epsilon<1 which is not stated in the prop. Furthermore, "for any epsilon" should be before "There exists" since for each epsilon the instance is different. - About Prop 3.5: I could not follow at all the proof. The greedy algorithm plays the (K+1)th arm if (1+eps)/(K+1)>1 but then its cumulative reward is better than the one stated for the optimal algorithm... What did I miss? - About Thm 4.1: the upper-bound is a bit hard to read. It would be useful to simplify it or to explain the different terms. - In the experiments, it is stated that sometimes, the greedy algorithm gets blocked in a wrong cycle leading to a negative cumulative regret. I would expect that the opposite is also possible, with UCB blocked in a suboptimal cycle. Isn't it possible with small probability? A high-probability regret bound would be useful. Furthermore, would some random perturbation of the prediction be possible to try to fall into a better cycle? - Proof of the examples l.254 would be nice - I do not understand how K^*=20 is possible in the experiments since the delays are all smaller than 20. Are all delays equal? Typos: - l.72: imu2>...>muK should be stated before the definition of Delta - l.73: >=1 missing in the definition of K^* - In some results, there are K arms and some other K+1 arms. Please make it uniform to ease the reading. - l.205: note that H is the Riemann zeta function - l.208 and later: Delta_{ij} -> Delta(i,j) otherwise define it - l.236: ",," - l.237: exploraiton

Reviewer 3

The paper studies a version of multi-armed bandit in which when arm i is pulled, it cannot be pulled again for at least d_i rounds. While the problem itself is interesting, I have some doubts about the results. My most important concern is about the lower bound for the offline version of the problem. I am suspicious about the reduction from this problem, named MAXREWARD, to another problem called PINWHEEL SCHEDULING (the reduction presented in the appendix). Here is why: an instance of PINWHEEL SCHEDULING turns to an instance of MAXREWARD with size \Pi_i d_i = d_1 * ... * d_k (because T has to be this large in order to distinguish YES cases from NO cases). On the other hand, it is not hard to see PINWHEEL SCHEDULING does admit a dynamic programming algorithm whose running time is polynomial in \Pi_i d_i = d_1 * ... * d_k. Therefore I am not sure what conclusion this reduction is supposed to make. ---- UPDATE AFTER REBUTTAL: Here is more details on my comment and my confusion regarding the lower bound: For the lower bound, a decision version of the problem in the following form is studied: "Given an instance of the The PINWHEEL SCHEDULING and T, is the optimal reward T?" and it is assumed that T is given in binary format so a "poly running time" is expected to be poly in log(T). ---- Also there are minor issues in the paper which makes it a bit hard to read. One of them is in the definition of \tau_{i, t}. Is it defined correctly? I guess the authors actually meant to replace min_{t' >= 1} with max_{t'<= t}

Reviewer 4

The paper studies a new multi-armed bandit setup, where after each pull of an arm, it is blocked (i.e. zero reward) for a certain number of rounds (this number is known). The setup captures some real-life applications nicely as motivated in the beginning of the paper. The paper also makes it clear in Sec 1.2 how this new setup is connected to existing ones and why those results do not directly solve the problem. The paper includes a set of theoretical results as listed above, starting from the natural question of if one can solve the offline version of the problem with knowledge of expected rewards of all arms, and then coming back to the online setting and discussing how to ensure low regret against a reasonable offline strategy. The techniques of proving the regret bound (highlighted on Page 2) indeed seem novel. There are still several loose ends though. For example, the regret upper and lower bounds do not match exactly. Also, it is not entirely clear to me that a computational hardness result for finding the exact solution of the offline problem excludes the possibility of efficient online algorithm with sublinear (exact) regret. Overall, I believe the paper studies an interesting new setting and provides several solid results. I recommend accept.