NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID: 6485 Weighted Linear Bandits for Non-Stationary Environments

### Reviewer 1

Significance: the th1 can be useful in other works. The B_T term in Th2 (and the need of its knowledge to get the optimal \gamma) are little bit sad and could be more discussed (I'im wondering if on could get rid of it if the opponent had no knowledge of \theta_t but only \theta_{t-1} at timestep t) Originality: Results are very similar to SW-LinUCB since this work is essentially a replacement of the sliding window by a discounted weighted estimation. Quality: my main objection is about the adequacy to scenarios from real world (sec 4.2 is not very good). Anyway this is not the core of the paper since the proposed approach is likely to have the same kind of performances than any sliding windows technique. Clarity: the paper is cristal clear

### Reviewer 2

Thanks for the author feedback. I am happy that the gap in regret bound is closed. However, it seems like the issue of parameter-free algorithms are not addressed. References [12,13] uses the SW-UCB as sub-routine to derive parameter-free bandit algorithms, it is extremely interesting to see what would happen if D-LinUCB is employed in such manner. -----------------------------------------------Original Comments---------------------------------- The paper provides an intuitive analysis for the D-LinUCB algorithm, which can attain a dynamic regret bound of order O(dB_T^{1/3}T^{2/3}). It also delivers solid experimental results on the proposed algorithm. The paper is also well written in general. But some questions remain: 1. It seems like the current algorithm is only optimal w.r.t. B_T and T, but not d, the problem dimension. It is proved in [5] and [12] that the lower bounds for K-armed and linear settings are O(K^{1/3}B_T^{1/3}T^{2/3}) and O(d^{2/3}B_T^{1/3}T^{2/3}), respectively. So it is worth checking the source of this gap. 2. It is shown in [12] that one could build a parameter-free algorithm for non-stationary linear bandit setting on top of the SW-LinUCB. It is thus expected that the D-LinUCB algorithm could also be enhanced accordingly. Is there any specific reason that hinders the flow? If not, what would be the parameter-free dynamic regret based on D-LinUCB? 3. It seems like the discounted linear regression estimator is not entirely new. In [1] B. Keskin and A. Zeevi. Chasing demand: Learning and Earning in a Changing Environment. In Mathematics of Operations Research, 42(2), 277–307, 2016. Similar type of estimator is analyzed for the 2-dimensional dynamic pricing setting. It is thus worth doing a thorough literature search to make sure all prior related works are properly credited. 4. At the end of paragraph 2 of Section 1.2, the dynamic regret bound is O(d^{2/3}B_T^{1/3}T^{2/3}).

### Reviewer 3

Update (after reading the rebuttals): After reading the rebuttal of authors, I have addressed my concerns on the novelty of the new self-normalized concentration, since the key point is that the coefficient of regularizer is changing. I indeed appreciate this work. The idea of this paper is natural but there indeed exist technical challenges, and the authors address these issues elegantly. So I think it deserves an acceptance. Nevertheless, there are still many typos in current verison besides those listed before, for example, in Theorem 2, eq. (6), it should be "\log(1/\delta)" instead of "\log(1/\gamma)". I hope authors could carefully check and revise the final version if accepted :) ----------------------------------------------------------------- Original Comments: This paper extends the paper of [12,13] from a sliding-window strategy to the weighted least square method. The idea is natural and well-motivated. The authors give theoretical analysis, which is based on the self-normalized concentration from [1] with a clever choice of the norm. The analysis is essentially not hard but requires a deep understanding. Particularly, authors propose to consider the $M$-norm, where $M = V\tilde{V}^{-1}V$ in the WLS based linear bandits. This is by contrast with the common $V$-norm that considering in [1] and [12,13]. By considering the concentration over such a norm, authors can still appeal to the self-normalized concentration developed in [1] (with certain modifications) to WLS based linear bandits. Along with some standard techniques and tricks, they finish the argument. I appreciate the elegant idea. However, there are also some deficiencies in the current version, particularly regarding the paper organization. First, I would like to remark that the new self-normalized concentration claimed in the paper is essentially a direct extension of Theorem 1 of [1]. Since Proposition 1 is in the norm of \tilde{V} but not the $V$-norm, so authors can directly obtain the result from Theorem 1 of [1]. It seems unnecessary to restate the supermartingale argument as done in Lemma 1-3. Please clearly state the novelty of this part. Technical Issues: First, the current order for WLS bandits is O(dB_T^{1/3}T^{2/3}), the dependence in d is worse than that of sliding window linear bandits [12]. This can be improved by choosing \gamma as 1-(B_T/Td^{2/3})^{2/3} in line 470. line 461: the inequality should hold for the maximum singular value of $M$. How can you guarantee it is the same as the maximum eigenvalue, as $M$ may not be symmetric. Other minor issues: - line 459: there is a $\lambda$-term in $V$-martix, so is there an extra $\lambda$ missing in the 4-th inequality? - line 483: I do not think it is problematic to apply Theorem 2 of [12] to A_t, because the self-normalized concentration applies for all t>0. Authors are encouraged to add more explainations. Additionally, the related work in Section 1.2 is not satisfied. Authors are requested to reorganize the related work to make it more informative. - line 67-69, authors first introduce the full-information dynamic regret [6], but it follows by work of linear bandit with static regret [21,32]. The organization should be reconsidered. - line 76-77, authors claim "[2] gives fully problem-dependent dynamic regret bounds in O(log(T)).", however, it seems impossible to obtain a sublinear dynamic regret independent of the non-stationarity measure; besides, how can $O(log T)$ bound be regarded as "problem-dependent"? - line 78, the rate of [12,13] is not O(B_T^{2/3}T^{1/3}), the correct one should be O(B_T^{1/3}T^{2/3}), authors should carefully check it - line 83-84, I do not think [15] is the first work that considers the switching bandits. The seminal work of [Auer JMLR'02] has already given some results. [Auer JMLR'02] Auer, Peter. "Using confidence bounds for exploitation-exploration trade-offs." Journal of Machine Learning Research 3.Nov (2002): 397-422. Other related works on dynamic regret of contextual bandits are missing [Luo et al. COLT'18] and [Chen et al. COLT'19]. [Luo et al. COLT'18] Haipeng Luo, Chen-Yu Wei, Alekh Agarwal, and John Langford. Efficient contextual bandits in non-stationary worlds. In 31st Annual Conference on Learning Theory (COLT), 2018. [Chen et al. COLT'19] Yifang Chen, Chung-Wei Lee, Haipeng Luo, and Chen-Yu Wei. A New Algorithm for Non-stationary Contextual Bandits: Efficient, Optimal, and Parameter-free. In 32rd Annual Conference on Learning Theory (COLT), 2019. Some of the experimental descriptions are unclear and hard to understand. For example, line 280-282, what doe this sentence mean? - how to add the noise? - how to calculate the accuracy of the algorithm?