Review for NeurIPS paper: Adversarial Attacks on Linear Contextual Bandits

NeurIPS 2020

Adversarial Attacks on Linear Contextual Bandits

Review 1

Summary and Contributions: Summary & Contributions: Authors study the scopes of adversarial attacks in linear contextual bandit algorithms which have applications in a wide range of domains. The authors consider adversarial attacks on both reward and the context and analyze the robustness (or lack of it) of various contextual linear bandit algorithms including LinUCB, LinTS, epsilon-greedy etc. Empirical evaluations are presented on various synthetic and two real datasets to examine the effect of attacks on these algorithms.

Strengths: The problem of analyzing the effect of adversarial attacks on bandit algorithms are indeed interesting and well motivated, and present work is supposedly the first one to analyze this for stochastic contextual linear bandits. Authors also analyze some popularly studied bandit algorithms, like LinUCB, LinTS, epsilon-greedy, and showed the attacking strategies (as optimization problems) to fool above algorithms for playing some targeted suboptimal arm majority number of times. Experiments are fairly detailed and reported on a large set of datasets showing the effect on learning rate of existing techniques on different degree+type of attacks.

Weaknesses: 1. Earlier works: While I like the problem formulation and inferred results, what concerns me is the earlier work by June at al, 2018 [19]. The reason being they just address the almost same problem for stochastic bandits, with exact same adversarial attack formulation on the rewards, and they provide a similar analysis on effect of LinUCB, LinTS, epsilon-greedy algorithms on such attacks (similar optimization objective based attacking strategy etc). So the present work only appears to be an extension of their problem formulation+analysis technique for contextual cases. Is there any unique contribution of this paper besides that. 2. Intuition behind two corruptions: Another contribution, claimed to be the first attempt of this paper, is studying the case for "context-corruption" (Sec 4). But I fail to see why is it fundamentally any different than corrupting the rewards (Sec 3), as rewards being a linear function of the context vectors, corrupting later is equivalent to corrupting the former and similar ideas are in fact used in the analysis of Sec 4. Please clarify if there is fundamental difference in the attacking strategy required for context corruption. 3. Robust algorithms: This more of a suggestion. Paper presently analyzes the strategy to corrupt the rewards (contexts) such that some well known techniques, like LinUCB, LinTS, epsilon-greedy, fail to learn the best policy, similar to the results + analysis of [19]. However what would have been really interesting and useful is to design a robust contextual bandit algorithm which could survive the attacks (up to some corruption threshold), basically gives a regret guarantee which would be proportional to degree of corruption of non-stationarity, similar to the guarantees of Besbes et al (2014), Luo et al (2018) etc. 4. Presentation: This is a minor issue, but the paper could have structured better, eg a precise contribution would have been easier to parse summarizing the inferences on LinUCB, LinTS, epsilon-Greedy for different types of corruptions (reward, context etc) -- a table would have been the best. Please avoid writing large paragraphs which are throughout in the paper (Pg 2,4,7), break them into small paragraphs and give a paragraph name so that the readers know apriori what to expect after reading it. I also find hard to find out the main claims, the key results of each section could be presented as theorems with some following remarks to clearly mention what the inferences are.

Correctness: Justified.

Clarity: The paper suffers from unclear claims and lack of to the point presentation of results. writing could be improved, see point#4 "Weakness".

Relation to Prior Work: Reasonably clearly mentioned.

Reproducibility: Yes

Additional Feedback: Please see comments in the "Weakness" section. Overall some of the results are interesting but it is hard to appreciate their originality given the work of Jun et al (2018), if the specific+new contributions of this work could be fairly justified over the former paper, I would be happy to reconsider the scores. =============== Post rebuttal ==================== I went through the rebuttal and I changed my score to weak accept: Some parts of the works align with the existing results so I still do not see any major contribution of this paper either conceptually or algorithmically, but the problem is interesting and the paper covers many attacking models with reasonable empirical evaluations, so I tend to accept the paper now given the raised concerns on the presentations are taken care of. Also please add a discussion on the different attack models (on contexts and rewards) and their implications on the regret performance.

Review 2

Summary and Contributions: The paper investigates the robustness of contextual bandits when attacker exists. The attacker aims at causing the contextual bandit to pull a pre-specified target arm almost every time while achieving low attack cost. The work complements the previous works on attacking bandits, and pushes the frontiers of the area.

Strengths: (1). The paper has strong theoretical results that prove that contextual bandits are vulnerable to adversarial attacks, and the attacker can strategically manipulate the rewards or contexts to force a set of target arms almost every iteration. Meanwhile, the attack cost is sublinear, which is not a surprising but a convincing result. (2). Many different attack settings are studied, which includes changing reward, contexts, and targeting a specific user. This is really thorough investigation into the problem. (3). The paper performs well-organized experiments, and the empirical results match the theory.

Weaknesses: (1). Is it truly necessary to assume that the reward lies in (0,1)? Seems like it is critical for the attacker to know a lower bound of the reward, so that he can always drag down the reward to 0 to make non-target arms seemingly inferior. What if no such lower bound information is known beforehand? In that case, will the attacker still be able to attack? Note that prior works do not assume such lower bound information. (2). When the attacker changes the contexts, what if the attacker targets multiple users instead of a single user? I think in that case the attacker may not be able to successfully change the selected arms simultaneously for all target users. I am wondering if the authors have thought about that?

Correctness: Yes

Clarity: Yes

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback:

Review 3

Summary and Contributions: The paper studied adversarial attack on linear bandits, where the adversary may poison either the reward feedback or observed context cause the contextual bandit into pulling a pre-specified target arm(s) linear times. Online data poisoning attack have been studied for stochastic bandits and this is the first paper that considers online attack of linear contextual bandits. The paper considers several settings including online attack on reward for a no-regret algorithm, online attack on context for LinUCB and offline attacks on a single context. The paper proposed attacking strategies for these scenarios and theoretically analyzed the feasibility and cost of the proposed methods.

Strengths: 1) This paper is the first to study online adversarial attack on linear bandits, which is an important problem following recent works of online attack on stochastic MAB and offline attack on linear bandits. 2) This paper thoroughly covered different attacking scenarios for linear bandit problem. 3) Theoretical proofs of proposed methods are sound. 4) Experimental result on simulation data and real-word datasets validated the effectiveness of the attacks.

Weaknesses: 1) Regarding the attack on rewards in Sec3: the theoretical analysis only holds for rewards with r_{min} = 0 (in Assumption 1) and cannot generalize to other choice of r_min. This means that the context $x$ is limited to intersections of half-spaces decided by all \theta_a. This strong assumption makes the problem easier as the environment is no longer the standard contextual bandits problem (which can choose any x within its norm bounded), and limits the contribution of the theoretical analysis. 2) A minor concern is the assumption that the attacker knows the value of r_{min} and r_{max}. I acknowledged the argument in the paper that r_{min} and r_{max} may be accessible by the attacker in some real-world applications, but this assumptions make the problem a much easier (even trivial) setting than the first adversarial attack on bandits paper by Jun et al. (NeurIPS 2018), which proposed an adaptive attack strategy without the need of knowing r_{min} and r_{max}. 3) Regarding the attack on contexts in Sec 4: in order to make sure the norm of attacked context is bounded by L (following Assumption 1 where the norm un-attacked contexts are bounded by L), the paper added another assumption on the norm of un-attacked contexts are bounded by $\nu L/2$ in Proposition 2. This means that there is still a difference between the norm of attacked contexts and un-attacked contexts which makes the step of clipping the norm seems not well motivated and not beneficial to the attacker at all.

Correctness: I checked most of the derivations and they are correct.

Clarity: The paper is well written and easy to follow.

Relation to Prior Work: The paper clearly discussed the related works in data poisoning attack on bandits and more generally, reinforcement learning.

Reproducibility: Yes

Additional Feedback: While the authors show that the proposed method empirically works for general r_{min} and r_{max} , it is important to affirmatively answer whether the proposed method can successfully attack general r_{min} and r_{max} cases, i.e., less or even no restrictions on the environment's choice over context $x$. It would also be helpful to justify the norm clipping step in Sec 4. -------------Post Rebuttal-------------------------- I agree with the authors' argument that it seems hard to avoid the assumption of bounded reward if not knowing what algorithm to attack. I suggest the authors to include the discussion of this intuition in the paper.