Reviews: Policy Poisoning in Batch Reinforcement Learning and Control

Originality: the work proposed in this paper looks rather original, since it seems to be the first time that a paper addresses the problem of data poisoning in batch reinforcement learning. Quality: the paper presents interesting ideas, and shows that the policy poisoning problem can be reformulated as a bi-level optimisation problem. Two derivations of this optimisation problem are proposed (TCE and LQR), for which the authors propose analyses regarding the feasibility and costs of potential attacks. The paper also proposes 4 sets of experiments demonstrating the fact that the attack can actually force the RL agent to learn a policy chosen by the attack. Clarity: the paper is quite well structured. Section 4.2 could be a bit more structured, for instance by proposing claims and proofs in order to better guide the reader in the argumentation. Significance: this is indeed a crucial topic. The paper seems to open interesting discussions regarding the problem of damaging data in order to influence the decision of a potential victim controller. *** Post feedback *** Thanks to the authors for the feedback and clarification. I have updated the overall score accordingly.

Reviewer 2

The paper is well written and organized. It discusses the possibilities of attacks on an RL system by just modifying the rewards, where the RL algorithms are table based Q function and linear quadratic regulator. I think, that it is pretty obvious, that it is possible to change the policy in any way, if one has full control over the rewards. The paper provides a couple of experiments in which the changes to the rewards look surprisingly small, like 12%. But this does not mean much. If I change the reward in only one state transition sufficiently much, all other rewards might not need changes, and so by dividing the one large change by the number of state transitions in the data gives a small relative change. In addition, I criticize the definition of the poisoning ratio ||r - r^0|| / ||r|| for two reasons, 1. if all rewards are increased by an offset of let's say 1e9, the policy will not change, but the poisoning ratio does, 2. it seems odd to divide by ||r||, r being the changed reward, more natural would be ||r^0||, the original reward. I think a definition for the poisoning ratio as ||r-r^0|| / ||r^0 - mean(r^0)||, would be a better measure. AFTER FEEDBACK I am pleased with the author's feedback. In addition, after reading the comments of reviewers 1 and 4, I think that I slightly under-estimated the importance of the topic. Therefore I increased the "overall score" to 6.

Reviewer 3

The paper studies the problem of policy poisoning in batch reinforcement learning and control where the learner estimates the model of the world from batch data set, and finds an optimal policy with respect to the learned model. The attacker modifies the data by the means of modifying the reward entries to make the agent learn a target policy. The paper presents a framework for solving batch policy poison attacks on two standard victims. The theoretical and experimental results show some evidence for the feasibility of policy poisoning attacks. Overall, I think this is an interesting paper that is motivated under a realistic adversarial setting where the attacker can alter the reward (instead of altering the dynamics of the world) to change the optimal policy to an adversarial target policy. The paper is easy to read due to its clear organization. Further I appreciated the source code made available with the submission, so thank you for doing this. However, I have two major issues with the paper which I hope that can be clarified by the authors. i) While I understand the overall idea of your bilevel programming model, it seems that you are not following a formal notation (i.e., the Bard et al. 2000 notation). Therefore it is not clear if you have two separate followers problems, what are the decision variables of (each?) follower etc. Further, it is confusing to use r, \mathbb{r}, R, \mathcal{R}, \tilde{R} etc. all in one model as the reader cannot tell what variables are controlled by the leader or the follower (and that is very important). Please clarify this in your rebuttal and the revised version of your paper. ii) The experimental results are underwhelming where the size of the problems are very small and the domain is limited to a grids world problem - which does not tie well with the realistic adversarial setting introduced earlier. Is this because solving the adversarial optimization problem is not scalable? If so, this is important to note in the paper. Further, have you experimented with other and/or larger domains? One piece of literature that can be of interest to the authors is “Data Poisoning Attacks against Autoregressive Models”. Minor correction: line 161 the word ‘itself’ is repeated twice **After Rebuttal** Thank you for your responses, they were helpful clarifying my questions. Therefore, I am increasing my final score. Please include your clarifications in the final version of the paper.

Paper ID:	8247
Title:	Policy Poisoning in Batch Reinforcement Learning and Control

Reviewer 1

Reviewer 2

Reviewer 3