Paper ID: | 3526 |
---|---|

Title: | Nonstochastic Multiarmed Bandits with Unrestricted Delays |

The paper deals with algorithms and regret guarantees for the non-stochastic delayed reward bandit problem. The authors make three main contributions. 1. For the setting of non-stochastic bandit problems with unknown, variable, but bounded delays the authors establish regret guarantees for the delayed EXP3 algorithm. These regret guarantees establish a conjecture of Cesa-Bianchi[2016]. 2. The authors then study regret guarantees for the case where the delays are potentially unbounded, and variable. For this setting the authors provide an algorithm that is a slight variant of delayed EXP3. Regret bounds are established for this setting. 3. Furthermore, via a non-standard application of the doubling trick the algorithms are made robust to various unknown parameters. The results are significant and the problem of bandit learning with delayed feedback is significant. The writing could have been improved as it was not always easy to follow the paper. I have a few suggestions and questions. 1. It is important to clearly explain the setup early on in the paper, ideally in the introduction itself. I am not that familiar with delayed reward setup in bandits and it took me a bit of time to understand (1) How the regret of the learning algorithm is calculated. (2) What extra information does an oracle comparator have that the learning algorithm does not have. (3) How the delays effect the learning algorithm. Furthermore, I initially assumed that the delays are themselves arm dependent but that turned out not to be the case. Hence, it is important that the authors clearly explain the above points in the introduction so that the learning setup is clear. 2. It looks like the idea of stability-span is new to the analysis and that the definition/usage of this quantity is also a contribution of this paper. Can the authors confirm this? 3. On line 119, the authors claim that N \leq 2 \max d_t. However, since the set N_t has at most d_t points, it should be the case that N \leq \max d_t. Can the authors explain why a factor of 2 is necessary? 4. Does Theorem 1 assume that the losses are bounded? 5. For the skipper algorithm it would be great if the authors could provide some intuition as to why it achieves potentially lower regret. What would be especially insightful is if the authors can explain why the algorithm works fine even though it potentially throws away information. I understand that the stability span in this case is lower, but I would like to see a simple, explaination that does not have to resort to the use of stability span. 6. In section 4, it would be nice if the authors explained the reasoning behind the definition of epochs as shown in equations 4-6. For example, why is the definition of epoch in equation 6 the correct definition? Overall, I like this paper and I think it is an important contribution. However, the writing is not up to the mark and with some effort this paper can be made reader-friendly.

The paper is well-motivated especially the delay at action/observation time settings. These are real settings one must consider in practice and no previous works sufficiently studies them. My only real consider is the assumptions needed in this work. How reasonable are these in practice? Though, since this seems to be the first work in this area, I guess assumptions are OK to show initial results. The authors do acknowledge this and are able to come up with a refinement which removes assumptions. One compliant I have about the paper is that the authors do not provide an empirical results. The theoretical results are nice but this paper's quality would benefit from even the most rudimentary experimental section. I'm disappointed the authors did not include this section. ********** I have read the author's response and am not satisfied. The first point they make about not running experiments because it's not possible to cover all scenarios is a disappointing. Experiments do not have to cover all scenarios, that's kind of their empirical nature. There are plenty of scenarios the authors could experiment with to at least provide some additional evidence the algorithm works. The second point is even more disappointing. One always has something to compare against, otherwise how do you know your method has any merit? For example, compare against an algorithm which picks random arms. This is the minimum baseline and is always available to compare against. One can also compare against algorithms not designed for delays. If the proposed algorithm is worth publishing then it should beat these algorithms. I expect more from authors when submitting to a venue of this caliber.

The paper considers a version of the adversarial stochastic multi-armed bandit problem where the feedback received from pulling an arm is delayed. The delays are time-varying and can be selected by an oblivious adversary. The main algorithm presented is referred to as the “skipper” algorithm, and refines exponential weights analysis by skipping rounds with large delays, and is based on prior knowledge of the cumulative delay. When delays are only revealed each times an action is taken, and no prior information is granted on the cumulative delay (or the horizon length) the authors provide a doubling scheme that achieves a refined regret bound. I like the paper and definitely appreciate various aspects of it. The topic and the algorithmic approach are interesting, the paper is well-written, and, as far as I could tell, analytically correct. I don't have major concerns regarding the paper. My main comments are as follow. 1. Since in many application domains delay information is only realized when feedback arrives. This is true also in the motivating examples that are given by the authors in page 2 (while expected delay might be available at time of action, the realized delay can often be different from this expectation. It would definitely be insightful to understand how one could extend current results and algorithms in that direction, and a first step in this direction can be to have some robustness analysis that will help to understand to extent to which the performance of the doubling scheme deteriorates when actual delays are (even a bit) different from the one observed at the time of the action. 2. Similarly, it would be insightful to understand how robust the skipper algorithm is for misspecifications in the cumulative delay D. I apologize if I missed a discussion along these lines in the paper. 3. It would be very insightful, and would clearly increase the contribution of the paper, to have some form of optimality for the results. Ideally that would be established through a refined lower bound for time-varying delays, but even some discussion using examples or numeric analysis might be helpful in that respect.