Reviews: Online EXP3 Learning in Adversarial Bandits with Delayed Feedback

i. Novelty: The paper addressed the similar problem setups introduced in [1], [3], [10], [11] in an adversarial bandit feedback setup and showed that the classical EXP3 algorithm is robust enough to be applied in the delayed feedback setup as well. However I have a major issue with their proposed algorithms which seems erroneous due to the choice of the learning rate (\eta) which requires the knowledge of the delays (d_t) -- but according to the problem formulation this is unknown to the learner. Then the whole technique seems to be pointless! ii: Choice of learning rate (\eta): How to select the learning rate \eta? The optimal learning rate seem to depend on the delays (d_t), e.g. Thm 1, Line-146 etc., but those are unknown to the learner. The claims of the paper stands vacuous if the proposed technique requires the knowledge of delays, where lies the major challenge of the problem addressed. iii. Writing: Overall, the paper is well structured and as easy to follow. However there are some minor issues which could be taken care of: a) In Eqn. (1) and (4), it is better and least confusing to use just \min_i \sum_{t}l_t^{(i)}. b) Line 91: Unexpected occurrence of "," c) How is p_1 initialized in Alg-1 and 2 (uniformly I presume)? d) Introducing the notation \lamda^* in Thm 2 is unnecessary e) \eta_e expression incorrect in Eqn (9). etc. iv. Clarity of the results and comparison with earlier works: a) Overall, the paper does not draw a clear picture of how their result improves over the existing work, which would have been very useful to understand their precise contributions. b) Some intuition behind ergodic average (Eqn. 15) would be appreciated. c) Dependence of delay is somewhat unclear in the second setting (of two player game). In my understanding the statement "the delays do not have to be bounded (in t) for the convergence to NE to hold" (Line-186) is tad confusing, e.g. is there a way to set \eta_t, \gamma_t if d_t = e^t, and still achieve "ergodic convergence"? Does Thm-2 yield back the classical NE convergence result of no-regret algorithm [14] when there is no delay, i.e. d_t = 0 \forall t --- explain how? v. Experiments: Its disappointing that the paper does not provide any empirical studies to validate the theoretical guarantees/ comparative performances with existing works.

Major Comments. While online learning problems with delay are important problems, I have two concerns about the paper. First, there is a lack of comparison with relevant literature, which hinders the readers from understanding the novelty and the significance of the paper. Indeed, reference [3] is very relevant to the paper, since [3] provides a generic framework for translating online algorithms without delays to those with delays, with a mild deterioration in the regret bounds. The framework of [3] also allows adversarial online learning problems with an oblivious adversary (See their Theorem 1), for example the first problem considered by the authors. While I understand that the authors’ algorithms are different, do the authors’ algorithms suffer a smaller regret bound than those in [3]? There is no discussion on this matter in the main text or in the supplementary material. Such a discussion will allow readers to see the novel aspects of the paper in comparison to [3]. My second concern is about the parameter-free version of Algorithm 1, which is provided in Algorithm 2. In Line 142, the authors claim that the sum of delays can be guessed by the doubling trick shown in equation (8). However, I have a hard time convincing myself on the effectiveness of the learning rate in (8), and I do not think that Algorithm 2 can achieve the regret bound in Theorem 1. Before elaborating on the second concern, I would like to point out that the algorithmic framework in [3] is parameter-free. In order to improve upon [3], the authors need to show that Algorithm 2 has a smaller regret bound than [3]. To understand my second concern, consider the following example. Let say the horizon $T$ is equal to $\sum^L_{\ell = 0} 2^ell $, so that there are epochs $1, \ldots, L$, which are respectively of lengths $2^0, 2^1, …, 2^L$. Suppose that there is no delay in epochs 1, …, $L$, so that $m_t = 1$ in epochs $1, \ldots L$. In addition, suppose that there is say a delay of say $M$ is epoch $L+1$, where $M$ is an integer much larger than 1. On the one hand, at the start of epoch $L+1$ (which is at the start of time $(1 + 2 + … + 2^{L-1}) +1$), the player has observed that $m_t = 1$ for each time $t$ in each of the epochs $1, \ldots, L$. Consequently, the summation $\sum^t_{t=1} m_t$ in the denominator in (8) is chosen as $2^L - 1$. On the other hand, we know that the sum of $d_t$ in epoch $L+1$ (which is required in Theorem 1) is in fact $M \times 2^L $. Given such a discrepancy, Algorithm 2 would not achieve the regret bound in Theorem 1. I could have mis-understood something here, since there is a typo ($\sum^t_{t=1} m_t$) in equation (8), and I take it as $\sum^t_{s=1} m_s$. Regardless, I do not think that the sum of delays can be guessed by the doubling trick. Indeed, different from the horizon $T$ which is increased by 1 per time step with certainty, the sum of delays can grow in an arbitrary manner. There is something missing in moving towards the parameter free version of Algorithm 1. For example, in guessing the sum of delays, there must be some error in the guess, leading to a sub-optimal choice of $\eta$. Can the authors provide a general version of Theorem 1 under an arbitrary $\eta$? Minor Comments. The paper could be reorganized in order to flesh out the novel contributions. The proofs of Corollary 1 and Theorem 2, which occupy much of the main text, only involve routine calculations. The only non-routine part in these proofs are the use of Lemma 3. My understanding is that the novel contributions come from Lemma 3, which is not even stated in the main text, but is stated and proved in the supplementary material. The paper would have been better explained if the authors provide more intuitions on the proof of Lemma 3, and defer the proofs of Corollary 1 and Theorem 2 to the supplementary material. In Line 118, the authors seem to emphasize that that their algorithm 1 is competitive in comparison to the best mixed strategy, which is a tad peculiar since it is clear that the loss under the best deterministic strategy is the same as the loss under the best mixed strategy. In Proposition 1, the authors mention about $l_t$, what is $l_t$? Does $l_t$ equal to $U(\cdot, a_t)$, where $a_t$ is the random action chosen by the column player in time t?

Paper ID:	6063
Title:	Online EXP3 Learning in Adversarial Bandits with Delayed Feedback

Reviewer 1

Reviewer 2

Reviewer 3