Summary and Contributions: This paper provides nearly tight first- and second- order regret bounds for adversarial bandits. They result is achieved by polynomial-time algorithms which only reliy on linear optimization oracles. The main novelty is the distribution truncation technique which is applied to the standard continuous exponential weight algorithm.
Strengths: The result is new, and will be interesting to bandit community. The solution to this problem is also elegant and easy to understand.
Weaknesses: The proposed polynomial-time algorithm is still impratical because of the use of continuous exponential weight. Is it possible to achieve similar results without continuous exponential weight when the number of actions is finite?
Correctness: I believe the result is correct. Their are some clarity issues that makes me not totally sure (please see the Clarity section).
Clarity: There are some details without explanation in the paper: 1. Is the covariance matrix of the truncated distribution S(\tilde{p}_t) always invertible? 2. How to calculate/approximate the inverse of S(\tilde{p}_t) efficiently? If it is approximated, how to argue that it has small error? Both are important for the validity of the algorithm. Please try to address them in the response. Minor: - In Eq. (20), to avoid confusion, please say that this applies Lemma 1 with S(p_t)^{-1/2}x.
Relation to Prior Work: Table 1 provide a clear comparison. However, the bound shown for previous work does not reflect the real bounds that are known: 1. In the last row of the Table 1 when comparing to [40] (Rakhline&Sridharan), I believe their bound should be <\ell_t - m_t, a_t>^2 in the square root but not \|\ell_t - m_t\|^2. This seems to be an intermediate bound in their anslysis. Please see the top of [Page 27, arxiv/1208.3728], and also [Eq.(3), arxiv/1901.10604]. 2. In the second-last row of Table 1, you show the second-order bound of [29] which gets \|.\|_2^2. I'm not sure whether [29] actually gets a better bound, but again, using the techniques developed in [Lemma 6, arxiv/1208.3728], \|.\|_*^2 is already achievable. Please also check the techniques developed in [Section 3, arxiv/1901.10604], which also already gets \|.\|_*^2 dependence for the second-order bound.
Reproducibility: Yes
Additional Feedback:
Summary and Contributions: This paper concerns the adversarial linear bandit setting. In this paper new algorithms are derived that have near optimal first and second-order regret bounds. Post-rebuttal: The authors adequately addressed my comments, which I why I raised my score.
Strengths: The main technical novelty in this paper is that instead of sampling from the multiplicative weight distribution the new algorithm samples from a truncated version of the multiplicative weight distribution. Previously the multiplicative weight distribution would need to be mixed with a stationary distribution to ensure that the covariance matrix of the samples would be suitably bounded. By sampling from the truncated distribution the algorithm ensures that the covariance matrix behaves nicely, which in turn ensures that the loss estimates behave nicely. By using the fact that the multiplicative weight distribution is log-concave it is shown that sampling from the truncated distribution is not too harmful for the remainder of the analysis. Combining the nicely behaved loss estimates with the optimistic mirror descent framework leads to the novel first and second-order regret bounds. The last new result in this paper is a lower bound for this adversarial linear bandits which indeed shows that the first and second-order regret bounds almost match the lower bound.
Weaknesses: The improvement upon previous second-order bounds such as of Rakhlin and Sridharan (2013), who use optimistic mirror descent, is from d sqrt(theta x variance) to d sqrt(variance), where theta is the self-concordance parameter. For some sets the self-concordance parameter can be d but for other sets it can be 1, such as for the unit ball. For the unit ball the algorithm of Rakhlin and Sridharan (2013) has significantly smaller runtime than continuous multiplicative weights and the same regret bound. This should also be made clear in the main text, as right now only the sentence “note that the self-concordance parameter can be d” hints at the above. Also note that assumption (iii) is not an assumption due to the existence of the universal barrier.
Correctness: The proofs appear to be correct.
Clarity: The writing is clear. The proof of Theorem 1 is well explained. As a minor negative remark, although Lemma 1 is an important part of the analysis from the main text it does not become clear how Lemma 1 is used, which is a shame as I think that together with Lemma 4 it is one of the main new technical tools used to derive the novel results.
Relation to Prior Work: A similar improvement from optimistic mirror descent to continuous multiplicative weights is observed in Bubeck and Eldan (2015) and Van der Hoeven et. al. (2018), but with their methods second order bounds are out of the question due mixing with a stationary distribution. Also, a similar technique to find the optimal hint was used by Cutkosky (2019) (section 6), although the loss used is slightly different the techniques are alike.
Reproducibility: Yes
Additional Feedback: Minor comments and refferences: Line 128: with Q^* was provided → perhaps change to: where Q^* is known beforehand Equation (31): I think there is a t missing from gamma For consistent O notation (14) should be an equality rather than an inequality Abernethy, J., Hazan, E. E., & Rakhlin, A. (2008). Competing in the dark: An efficient algorithm for bandit linear optimization. In 21st Annual Conference on Learning Theory, COLT 2008 (pp. 263-273). Bubeck, S., & Eldan, R. (2015). The entropic barrier: a simple and optimal universal self-concordant barrier. In Conference on Learning Theory (pp. 279-279). Cutkosky, A. (2019). Combining Online Learning Guarantees. In Conference on Learning Theory (pp. 895-913). Rakhlin, A., & Sridharan, K. (2013). Online Learning with Predictable Sequences. In Conference on Learning Theory (pp. 993-1019). Van der Hoeven, D., Erven, T., & Kotłowski, W. (2018). The Many Faces of Exponential Weights in Online Learning. In Conference On Learning Theory (pp. 2067-2092).
Summary and Contributions: Derive first order and second order regret bound for adversarial linear bandits using truncated distribution.
Strengths: If the paper is correct and efficient, I think it would be a breakthrough in the linear bandits literature. It not only provides various common adaptive regret bound, but also matches the optimal dependency on the dimension d. Also another surprising point is that the algorithm is very simple, which uses a common trick to obtain first order regret bounds.
Weaknesses: If the claim is correct and the algorithm is efficient, it looks no evident weakness to me. Also as the authors claimed, the paper is parameter-free, that is, obtaining the claimed bounds without knowing these quantities.
Correctness: At first glance it seems too good to be true as the algorithm is even arguably simpler than [18] and [31], which only get \sqrt{T} regret bound. When checking the correctness of the proof, I found many parts where details are not provided, which makes me hard to verify it. For example, it is known that in MAB, if one uses truncated distribution, then the standard loss estimator is not unbiased anymore. Here I don't see it is trivial to see the estimator is unbiased. Specifically, why is the matrix in (8) always invertible? Another part is (20). It is very mysterious Lemma 1 is used here. The norms are not the same and the condition S(p) <= I is not verified either. Moreover, how to compute S(p')^-1 efficiently to construct the loss estimator is also an issue.
Clarity: Yes, for the language it is great. But for the analysis there are some hand-waving parts. For example, in Line 395, I think the authors should argue more clearly why x the authors refer to is larger than -1. Also the parts in Lemma 4 using Lemma 1 is not well explained.
Relation to Prior Work: It think the method the authors use is more like [6] than [39] and [14], but it is not mentioned in the paper. For example, in Line 395, I think the authors should argue more clearly why x the authors refer to is larger than -1.
Reproducibility: Yes
Additional Feedback: (after the author response phase) I think the authors indeed answered my questions regarding correctness. I am just still not sure about the efficiency and error control. In the paper, the analysis they did are all based on the assumption that they can access accurate S(p_t)^-1, S(~p_t)^-1, etc, which is not the case if we want a polynomial algorithm. I think the algorithm is efficient even when considering error like what did in "Oracle-Efficient Algorithms for Online Linear Optimization with Bandit Feedback", but I still want to know how small epsilon should be chosen to make all these things work. Hope the authors can explain these more clearly in the final version.