{"title": "Policy Poisoning in Batch Reinforcement Learning and Control", "book": "Advances in Neural Information Processing Systems", "page_first": 14570, "page_last": 14580, "abstract": "We study a security threat to batch reinforcement learning and control where the attacker aims to poison the learned policy. The victim is a reinforcement learner / controller which first estimates the dynamics and the rewards from a batch data set, and then solves for the optimal policy with respect to the estimates. The attacker can modify the data set slightly before learning happens, and wants to force the learner into learning a target policy chosen by the attacker. We present a unified framework for solving batch policy poisoning attacks, and instantiate the attack on two standard victims: tabular certainty equivalence learner in reinforcement learning and linear quadratic regulator in control. We show that both instantiation result in a convex optimization problem on which global optimality is guaranteed, and provide analysis on attack feasibility and attack cost. Experiments show the effectiveness of policy poisoning attacks.", "full_text": "Policy Poisoning\n\nin Batch Reinforcement Learning and Control\n\nYuzhe Ma\n\nUniversity of Wisconsin\u2013Madison\n\nyzm234@cs.wisc.edu\n\nXuezhou Zhang\n\nUniversity of Wisconsin\u2013Madison\nzhangxz1123@cs.wisc.edu\n\nWen Sun\n\nMicrosoft Research New York\nSun.Wen@microsoft.com\n\nXiaojin Zhu\n\nUniversity of Wisconsin\u2013Madison\n\njerryzhu@cs.wisc.edu\n\nAbstract\n\nWe study a security threat to batch reinforcement learning and control where the\nattacker aims to poison the learned policy. The victim is a reinforcement learner /\ncontroller which \ufb01rst estimates the dynamics and the rewards from a batch data set,\nand then solves for the optimal policy with respect to the estimates. The attacker\ncan modify the data set slightly before learning happens, and wants to force the\nlearner into learning a target policy chosen by the attacker. We present a uni\ufb01ed\nframework for solving batch policy poisoning attacks, and instantiate the attack\non two standard victims: tabular certainty equivalence learner in reinforcement\nlearning and linear quadratic regulator in control. We show that both instantiation\nresult in a convex optimization problem on which global optimality is guaranteed,\nand provide analysis on attack feasibility and attack cost. Experiments show the\neffectiveness of policy poisoning attacks.\n\n1\n\nIntroduction\n\nWith the increasing adoption of machine learning, it is critical to study security threats to learning\nalgorithms and design effective defense mechanisms against those threats. There has been signi\ufb01cant\nwork on adversarial attacks [2, 9]. We focus on the subarea of data poisoning attacks where the\nadversary manipulates the training data so that the learner learns a wrong model. Prior work on data\npoisoning targeted victims in supervised learning [17, 13, 19, 22] and multi-armed bandits [11, 16, 15].\nWe take a step further and study data poisoning attacks on reinforcement learning (RL). Given RL\u2019s\nprominent applications in robotics, games and so on, an intentionally and adversarially planted bad\npolicy could be devastating.\nWhile there has been some related work in test-time attack on RL, reward shaping, and teaching\ninverse reinforcement learning (IRL), little is understood on how to training-set poison a reinforcement\nlearner. We take the \ufb01rst step and focus on batch reinforcement learner and controller as the victims.\nThese victims learn their policy from a batch training set. We assume that the attacker can modify the\nrewards in the training set, which we show is suf\ufb01cient for policy poisoning. The attacker\u2019s goal is to\nforce the victim to learn a particular target policy (hence the name policy poisoning), while minimizing\nthe reward modi\ufb01cations. Our main contribution is to characterize batch policy poisoning with a\nuni\ufb01ed optimization framework, and to study two instances against tabular certainty-equivalence\n(TCE) victim and linear quadratic regulator (LQR) victim, respectively.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f2 Related Work\n\nOf particular interest is the work on test-time attacks against RL [10]. Unlike policy poisoning, there\nthe RL agent carries out an already-learned and \ufb01xed policy \u21e1 to e.g. play the Pong Game. The\nattacker perturbs pixels in a game board image, which is part of the state s. This essentially changes\nthe RL agent\u2019s perceived state into some s0. The RL agent then chooses the action a0 := \u21e1(s0) (e.g.\nmove down) which may differ from a := \u21e1(s) (e.g. move up). The attacker\u2019s goal is to force some\nspeci\ufb01c a0 on the RL agent. Note \u21e1 itself stays the same through the attack. In contrast, ours is a\ndata-poisoning attack which happens at training time and aims to change \u21e1.\nData-poisoning attacks were previously limited to supervised learning victims, either in batch\nmode [3, 21, 14, 17] or online mode [19, 22]. Recently data-poisoning attacks have been extended to\nmulti-armed bandit victims [11, 16, 15], but not yet to RL victims.\nThere are two related but distinct concepts in RL research. One concept is reward shaping [18, 1,\n7, 20] which also modi\ufb01es rewards to affect an RL agent. However, the goal of reward shaping\nis fundamentally different from ours. Reward shaping aims to speed up convergence to the same\noptimal policy as without shaping. Note the differences in both the target (same vs. different policies)\nand the optimality measure (speed to converge vs. magnitude of reward change).\nThe other concept is teaching IRL [5, 4, 12]. Teaching and attacking are mathematically equivalent.\nHowever, the main difference to our work is the victim. They require an IRL agent, which is\na specialized algorithm that estimates a reward function from demonstrations of (state, action)\ntrajectories alone (i.e. no reward given). In contrast, our attacks target more prevalent RL agents and\nare thus potentially more applicable. Due to the difference in the input to IRL vs. RL victims, our\nattack framework is completely different.\n\n3 Preliminaries\nA Markov Decision Process (MDP) is de\ufb01ned as a tuple (S,A, P, R, ), where S is the state space,\nA is the action space, P : S\u21e5A! S is the transition kernel where S denotes the space of\nprobability distributions on S, R : S\u21e5A! R is the reward function, and 2 [0, 1) is a discounting\nfactor. We de\ufb01ne a policy \u21e1 : S!A as a function that maps a state to an action. We denote\nthe Q function of a policy \u21e1 as Q\u21e1(s, a) = E[P1\u2327 =0 \u2327 R(s\u2327 , a\u2327 ) | s0 = s, a0 = a, \u21e1], where the\nexpectation is over the randomness in both transitions and rewards. The Q function that corresponds\nto the optimal policy can be characterized by the following Bellman optimality equation:\n\nQ\u21e4(s0, a0),\n\nQ\u21e4(s, a) = R(s, a) + Xs02S\n\nP (s0|s, a) max\na02A\nand the optimal policy is de\ufb01ned as \u21e1\u21e4(s) 2 arg maxa2A Q\u21e4(s, a).\nWe focus on RL victims who perform batch reinforcement learning. A training item is a tuple\nR \u21e5S , where s is the current state, a is the action taken, r is the received\n(s, a, r, s0) 2S\u21e5A\u21e5\nreward, and s0 is the next state. A training set is a batch of T training items denoted by D =\n(st, at, rt, s0t)t=0:T1. Given training set D, a model-based learner performs learning in two steps:\nStep 1. The learner estimates an MDP \u02c6M = (S,A, \u02c6P , \u02c6R, ) from D. In particular, we assume the\nlearner uses maximum likelihood estimate for the transition kernel \u02c6P : S \u21e5 A 7! S\n\n(1)\n\nNote that we do not require (2) to have a unique maximizer \u02c6P . When multiple maximizers exist,\nwe assume the learner arbitrarily picks one of them as the estimate. We assume the minimizer \u02c6R is\nalways unique. We will discuss the conditions to guarantee the uniqueness of \u02c6R for two learners later.\n\nlog P (s0t|st, at),\nand least-squares estimate for the reward function \u02c6R : S \u21e5 A 7! R\n\n\u02c6P 2 arg max\n\nP\n\n\u02c6R = arg min\n\nR\n\n(rt R(st, at))2.\n\n(2)\n\n(3)\n\nT1Xt=0\nT1Xt=0\n\n2\n\n\fStep 2. The learner \ufb01nds the optimal policy \u02c6\u21e1 that maximizes the expected discounted cumulative\nreward on the estimated environment \u02c6M, i.e.,\n\n\u02c6\u21e1 2 arg max\n\u21e1:S7!A\n\nE \u02c6P\n\n1X\u2327 =0\n\n\u2327 \u02c6R(s\u2327 ,\u21e1 (s\u2327 )),\n\n(4)\n\nwhere s0 is a speci\ufb01ed or random initial state. Note that there could be multiple optimal policies,\nthus we use 2 in (4). Later we will specialize (4) to two speci\ufb01c victim learners: the tabular certainty\nequivalence learner (TCE) and the certainty-equivalent linear quadratic regulator (LQR).\n\n4 Policy Poisoning\n\nWe study policy poisoning attacks on model-based batch RL learners. Our threat model is as follows:\nKnowledge of the attacker. The attacker has access to the original training set D0 =\nt , s0t)t=0:T1. The attacker knows the model-based RL learner\u2019s algorithm. Importantly,\n(st, at, r0\nthe attacker knows how the learner estimates the environment, i.e., (2) and (3). In the case (2) has\nmultiple maximizers, we assume the attacker knows exactly the \u02c6P that the learner picks.\nAvailable actions of the attacker. The attacker is allowed to arbitrarily modify the rewards r0 =\nT1) in D0 into r = (r0, ..., rT1). As we show later, changing r\u2019s but not s, a, s0 is\n(r0\nsuf\ufb01cient for policy poisoning.\nAttacker\u2019s goals. The attacker has a pre-speci\ufb01ed target policy \u21e1\u2020. The attack goals are to (1) force\nthe learner to learn \u21e1\u2020, (2) minimize attack cost kr r0k\u21b5 under an \u21b5-norm chosen by the attacker.\nGiven the threat model, we can formulate policy poisoning as a bi-level optimization problem1:\n\n0, ..., r0\n\nmin\nr, \u02c6R\n\ns.t.\n\nkr r0k\u21b5\n\nR\n\n\u02c6R = arg min\n\nT1Xt=0\n{\u21e1\u2020} = arg max\n\u21e1:S7!A\n\n(rt R(st, at))2\n1X\u2327 =0\n\nE \u02c6P\n\n\u2327 \u02c6R(s\u2327 ,\u21e1 (s\u2327 )).\n\n(5)\n\n(6)\n\n(7)\n\nThe \u02c6P in (7) does not involve r and is precomputed from D0. The singleton set {\u21e1\u2020} on the LHS\nof (7) ensures that the target policy is learned uniquely, i.e., there are no other optimal policies tied\nwith \u21e1\u2020. Next, we instantiate this attack formulation to two representative model-based RL victims.\n\n4.1 Poisoning a Tabular Certainty Equivalence (TCE) Victim\nIn tabular certainty equivalence (TCE), the environment is a Markov Decision Process (MDP) with\n\ufb01nite state and action space. Given original data D0 = (st, at, r0\nt , s0t)0:T1, let Ts,a = {t | st =\ns, at = a}, the time indexes of all training items for which action a is taken at state s. We assume\nTs,a 1, 8s, a, i.e., each state-action pair appears at least once in D0. This condition is needed to\nensure that the learner\u2019s estimate \u02c6P and \u02c6R exist. Remember that we require (3) to have a unique\nsolution. For the TCE learner, \u02c6R is unique as long as it exists. Therefore, Ts,a 1, 8s, a is suf\ufb01cient\nto guarantee a unique solution to (3). Let the poisoned data be D = (st, at, rt, s0t)0:T1. Instantiating\nmodel estimation (2), (3) for TCE, we have\n\u02c6P (s0 | s, a) =\n\n1 [s0t = s0] ,\n\n(8)\n\nwhere 1 [] is the indicator function, and\n\n\u02c6R(s, a) =\n\nrt.\n\n(9)\n\n1\n\n|Ts,a| Xt2Ts,a\n|Ts,a| Xt2Ts,a\n\n1\n\n1As we will show, the constraint (7) could lead to an open feasible set (e.g., in (10)) for the attack optimiza-\ntion (5)-(7), on which the minimum of the objective function (5) may not be well-de\ufb01ned. In the case (7) induces\nan open set, we will consider instead a closed subset of it, and optimize over the subset. How to construct the\nclosed subset will be made clear for concrete learners later.\n\n3\n\n\fThe TCE learner uses \u02c6P , \u02c6R to form an estimated MDP \u02c6M, then solves for the optimal policy \u02c6\u21e1 with\nrespect to \u02c6M using the Bellman equation (1). The attack goal (7) can be naively characterized by\n\nQ(s, \u21e1\u2020(s)) > Q(s, a),8s 2S ,8a 6= \u21e1\u2020(s).\n\n(10)\n\nHowever, due to the strict inequality, (10) induces an open set in the Q space, on which the minimum\nof (5) may not be well-de\ufb01ned. Instead, we require a stronger attack goal which leads to a closed\nsubset in the Q space. This is de\ufb01ned as the following \"-robust target Q polytope.\nDe\ufb01nition 1. (\"-robust target Q polytope) The set of \"-robust Q functions induced by a target policy\n\u21e1\u2020 is the polytope\n\nQ\"(\u21e1\u2020) = {Q : Q(s, \u21e1\u2020(s)) Q(s, a) + \",8s 2S ,8a 6= \u21e1\u2020(s)}\n\nfor a \ufb01xed \"> 0.\n\n(11)\n\nThe margin parameter \" ensures that \u21e1\u2020 is the unique optimal policy for any Q in the polytope. We\nnow have a solvable attack problem, where the attacker wants to force the victim\u2019s Q function into\nthe \"-robust target Q polytope Q\"(\u21e1\u2020):\n\nmin\n\nr2RT , \u02c6R,Q2R|S|\u21e5|A|\ns.t.\n\nkr r0k\u21b5\n\u02c6R(s, a) =\n\n(12)\n\n(13)\n\n(14)\n\n1\n\nrt\n\n|Ts,a| Xt2Ts,a\nQ(s, a) = \u02c6R(s, a) + Xs0\nQ(s, \u21e1\u2020(s)) Q(s, a) + \",8s 2S ,8a 6= \u21e1\u2020(s).\n\n\u02c6P (s0|s, a) Qs0,\u21e1 \u2020(s0) ,8s,8a\n\n(15)\nThe constraint (14) enforces Bellman optimality on the value function Q, in which maxa02A Q(s0, a0)\nis replaced by Qs0,\u21e1 \u2020(s0), since the target policy is guaranteed to be optimal by (15). Note that\nproblem (12)-(15) is a convex program with linear constraints given that \u21b5 1, thus could be solved\nto global optimality. However, we point out that (12)-(15) is a more stringent formulation than (5)-(7)\ndue to the additional margin parameter \" we introduced. The feasible set of (12)-(15) is a subset\nof (5)-(7). Therefore, the optimal solution to (12)-(15) could in general be a sub-optimal solution\nto (5)-(7) with potentially larger objective value. We now study a few theoretical properties of policy\npoisoning on TCE. All proofs are in the appendix. First of all, the attack is always feasible.\nProposition 1. The attack problem (12)-(15) is always feasible for any target policy \u21e1\u2020.\n\nProposition 1 states that for any target policy \u21e1\u2020, there exists a perturbation on the rewards that\nteaches the learner that policy. Therefore, the attacker changing r\u2019s but not s, a, s0 is already suf\ufb01cient\nfor policy poisoning.\nWe next bound the attack cost. Let the MDP estimated on the clean data be \u02c6M 0 = (S,A, \u02c6P , \u02c6R0, ).\nLet Q0 be the Q function that satis\ufb01es the Bellman optimality equation on \u02c6M 0. De\ufb01ne (\") =\nmaxs2S[maxa6=\u21e1\u2020(s) Q0(s, a)Q0(s, \u21e1\u2020(s))+\"]+, where []+ takes the maximum over 0. Intuitively,\n(\") measures how suboptimal the target policy \u21e1\u2020 is compared to the clean optimal policy \u21e10 learned\non \u02c6M 0, up to a margin parameter \".\nTheorem 2. Assume \u21b5 1 in (12). Let r\u21e4, \u02c6R\u21e4 and Q\u21e4 be an optimal solution to (12)-(15), then\n\n1\n2\n\n(1 )(\")\u2713min\n\ns,a |Ts,a|\u25c6 1\n\n\u21b5\n\n\uf8ff kr\u21e4 r0k\u21b5 \uf8ff\n\n1\n2\n\n(1 + )(\")T\n\n1\n\n\u21b5 .\n\n(16)\n\nCorollary 3. If \u21b5 = 1, then the optimal attack cost is O((\")T ). If \u21b5 = 2, then the optimal attack\ncost is O((\")pT ). If \u21b5 = 1, then the optimal attack cost is O((\")).\n\nNote that both the upper and lower bounds on the attack cost are linear with respect to (\"), which\ncan be estimated directly from the clean training set D0. This allows the attacker to easily estimate\nits attack cost before actually solving the attack problem.\n\n4\n\n\f4.2 Poisoning a Linear Quadratic Regulator (LQR) Victim\nAs the second example, we study an LQR victim that performs system identi\ufb01cation from a batch\ntraining set [6]. Let the linear dynamical system be\n\n1\n2\n\nL(s, a) =\n\ns>Qs + q>s + a>Ra + c\n\nst+1 = Ast + Bat + wt,8t 0,\n\n(17)\nwhere A 2 Rn\u21e5n, B 2 Rn\u21e5m, st 2 Rn is the state, at 2 Rm is the control signal, and wt \u21e0\nN (0, 2I) is a Gaussian noise. When the agent takes action a at state s, it suffers a quadratic loss of\nthe general form\n(18)\nfor some Q \u232b 0, R 0, q 2 Rn and c 2 R. Here we have rede\ufb01ned the symbols Q and R in order to\nconform with the notation convention in LQR: now we use Q for the quadratic loss matrix associated\nwith state, not the action-value function; we use R for the quadratic loss matrix associated with\naction, not the reward function. The previous reward function R(s, a) in general MDP (section 3)\nis now equivalent to the negative loss L(s, a). This form of loss captures various LQR control\nproblems. Note that the above linear dynamical system can be viewed as an MDP with transition\nkernel P (s0 | s, a) = N (As + Ba, 2I) and reward function L(s, a). The environment is thus\ncharacterized by matrices A, B (for transition kernel) and Q, R, q, c (for reward function), which are\nall unknown to the learner.\nWe assume the clean training data D0 = (st, at, r0\nt , st+1)0:T1 was generated by running the\nlinear system for multiple episodes following some random policy [6]. Let the poisoned data be\nD = (st, at, rt, st+1)0:T1. Instantiating model estimation (2), (3), the learner performs system\nidenti\ufb01cation on the poisoned data:\n\n( \u02c6A, \u02c6B) 2 arg min\n\n(A,B)\n\n2\n\n1\n2\n\nkAst + Bat st+1k2\n\nT1Xt=0\ns>t Qst + q>st + a>t Rat + c + rt\nT1Xt=0\n\n1\n2\n\n1\n2\n\n(19)\n\n(20)\n\n2\n\n2\n\n.\n\n( \u02c6Q, \u02c6R, \u02c6q, \u02c6c) = arg min\n\n(Q\u232b0,R\u232b\"I,q,c)\n\nNote that in (20), the learner uses a stronger constraint R \u232b \"I than the original constraint R 0,\nwhich guarantees that the minimizer can be achieved. The conditions to further guarantee (20) having\na unique solution depend on the property of certain matrices formed by the clean training set D0,\nwhich we defer to appendix D.\nThe learner then computes the optimal control policy with respect to \u02c6A, \u02c6B, \u02c6Q, \u02c6R, \u02c6q and \u02c6c. We assume\nthe learner solves a discounted version of LQR control\n\nE\" 1X\u2327 =0\n\n\u2327 (\n\n1\n2\n\ns>\u2327\n\n\u02c6Qs\u2327 + \u02c6q>s\u2327 + \u21e1(s\u2327 )> \u02c6R\u21e1(s\u2327 ) + \u02c6c)#\n\nmax\n\u21e1:S7!A\ns.t.\n\n(22)\nwhere the expectation is over w\u2327 . It is known that the control problem has a closed-form solution\n\u02c6a\u2327 = \u02c6\u21e1(s\u2327 ) = Ks\u2327 + k, where\n\ns\u2327 +1 = \u02c6As\u2327 + \u02c6B\u21e1(s\u2327 ) + w\u2327 ,8\u2327 0.\n\nK = \u21e3 \u02c6R + \u02c6B>X \u02c6B\u23181\n\n\u02c6B>X \u02c6A,\n\nk = ( \u02c6R + \u02c6B>X \u02c6B)1 \u02c6B>x.\n\nHere X \u232b 0 is the unique solution of the Algebraic Riccati Equation,\nX = \u02c6A>X \u02c6A 2 \u02c6A>X \u02c6B\u21e3 \u02c6R + \u02c6B>X \u02c6B\u23181\n\nand x is a vector that satis\ufb01es\n\n\u02c6B>X \u02c6A + \u02c6Q,\n\n(25)\nThe attacker aims to force the victim into taking target action \u21e1\u2020(s),8s 2 Rn. Note that in LQR, the\nattacker cannot arbitrarily choose \u21e1\u2020, as the optimal control policy K and k enforce a linear structural\nconstraint between \u21e1\u2020(s) and s. One can easily see that the target action must obey \u21e1\u2020(s) = K\u2020s+k\u2020\n\nx = \u02c6q + ( \u02c6A + \u02c6BK)>x.\n\n(21)\n\n(23)\n\n(24)\n\n5\n\n\ffor some (K\u2020, k\u2020) in order to achieve successful attack. Therefore we must assume instead that the\nattacker has a target policy speci\ufb01ed by a pair (K\u2020, k\u2020). However, an arbitrarily linear policy may\nstill not be feasible. A target policy (K\u2020, k\u2020) is feasible if and only if it is produced by solving some\nRiccati equation, namely, it must lie in the following set:\n\n{(K, k) : 9Q \u232b 0, R \u232b \"I, q 2 Rn, c 2 R, such that (23), (24), and (25) are satis\ufb01ed}.\n\n(26)\nTherefore, to guarantee feasibility, we assume the attacker always picks the target policy (K\u2020, k\u2020)\nby solving an LQR problem with some attacker-de\ufb01ned loss function. We can now pose the policy\npoisoning attack problem:\n\nmin\n\nr, \u02c6Q, \u02c6R,\u02c6q,\u02c6c,X,x\n\ns.t.\n\n\u02c6B>X \u02c6A = K\u2020\n\nkr r0k\u21b5\n\u21e3 \u02c6R + \u02c6B>X \u02c6B\u23181\n\u21e3 \u02c6R + \u02c6B>X \u02c6B\u23181\nX = \u02c6A>X \u02c6A 2 \u02c6A>X \u02c6B\u21e3 \u02c6R + \u02c6B>X \u02c6B\u23181\n\nx = \u02c6q + ( \u02c6A + \u02c6BK\u2020)>x\n\n\u02c6B>x = k\u2020\n\n( \u02c6Q, \u02c6R, \u02c6q, \u02c6c) = arg min\n\n(Q\u232b0,R\u232b\"I,q,c)\n\n\u02c6B>X \u02c6A + \u02c6Q\n\n(27)\n\n(28)\n\n(29)\n\n(30)\n\n(31)\n\n(32)\n\n2\n\n2\n\nT1Xt=0\n\n1\n2\n\ns>t Qst + q>st + a>t Rat + c + rt\n\nX \u232b 0.\n\n(33)\nNote that the estimated transition matrices \u02c6A, \u02c6B are not optimization variables because the attacker\ncan only modify the rewards, which will not change the learner\u2019s estimate on \u02c6A and \u02c6B. The attack\noptimization (27)-(33) is hard to solve due to the constraint (32) itself being a semi-de\ufb01nite program\n(SDP). To overcome the dif\ufb01culty, we pull all the positive semi-de\ufb01nite constraints out of the lower-\nlevel optimization. This leads to a more stringent surrogate attack optimization (see appendix C).\nSolving the surrogate attack problem, whose feasible region is a subset of the original problem, in\ngeneral gives a suboptimal solution to (27)-(33). But it comes with one advantage: convexity.\n\n5 Experiments\n\nThroughout the experiments, we use CVXPY [8] to implement the optimization. All code can be\nfound in https://github.com/myzwisc/PPRL_NeurIPS19.\n\n5.1 Policy Poisoning Attack on TCE Victim\nExperiment 1. We consider a simple MDP with two states A, B and two actions: stay in the same\nstate or move to the other state, shown in \ufb01gure 1a. The discounting factor is = 0.9. The MDP\u2019s Q\nvalues are shown in table 1b. Note that the optimal policy will always pick action stay. The clean\ntraining data D0 re\ufb02ects this underlying MDP, and consists of 4 tuples:\n\n(A, stay, 1, A)\n\n(A, move, 0, B)\n\n(B, stay, 1, B)\n\n(B, move, 0, A)\n\nLet the attacker\u2019s target policy be \u21e1\u2020(s) =move, for any state s. The attacker sets \" = 1 and\nuses \u21b5 = 2, i.e. kr r0k2 as the attack cost. Solving the policy poisoning attack optimization\nproblem (12)-(15) produces the poisoned data:\n(A, move, 1, B)\n\n(B, move, 1, A)\n\n(B, stay, 0, B)\n\n(A, stay, 0, A)\n\nwith attack cost kr r0k2 = 2. The resulting poisoned Q values are shown in table 1c. To verify\nthis attack, we run TCE learner on both clean data and poisoned data. Speci\ufb01cally, we estimate\nthe transition kernel and the reward function as in (8) and (9) on each data set, and then run value\niteration until the Q values converge. In Figure 1d, we show the trajectory of Q values for state A,\nwhere the x and y axes denote Q(A, stay) and Q(A, move) respectively. All trajectories start at\n(0, 0). The dots on the trajectory correspond to each step of value iteration, while the star denotes the\nconverged Q values. The diagonal dashed line is the (zero margin) policy boundary, while the gray\n\n6\n\n\f0\n\n0\n\nA\n\n+1\n\nB\n\n+1\n\n(a) A toy MDP with two states.\n\n9\n9\n\nstay move\n10\n10\n\nA\nB\n(b) Original Q values.\nstay move\n10\n9\n9\n10\n\nA\nB\n\n(c) Poisoned Q values.\n\n(d) Trajectory for the Q values of state A during value iteration.\n\nFigure 1: Poisoning TCE in a two-state MDP.\n\nregion is the \"-robust target Q polytope with an offset \" = 1 to the policy boundary. The trajectory of\nclean data converges to a point below the policy boundary, where the action stay is optimal. With the\npoisoned data, the trajectory of Q values converge to a point exactly on the boundary of the \"-robust\ntarget Q polytope, where the action move becomes optimal. This validates our attack.\nWe also compare our attack with reward shaping [18]. We let the potential function (s) be the\noptimal value function V (s) for all s to shape the clean dataset. The dataset after shaping is\n\n(A, stay, 0, A)\n\n(A, move,1, B)\n\n(B, stay, 0, B)\n\n(B, move,1, A)\n\nIn Figure 1d, we show the trajectory of Q values after reward shaping. Note that same as on clean\ndataset, the trajectory after shaping converges to a point also below the policy boundary. This means\nreward shaping can not make the learner learn a different policy from the original optimal policy.\nAlso note that after reward shaping, value iteration converges much faster (in only one iteration),\nwhich matches the bene\ufb01ts of reward shaping shown in [18]. More importantly, this illustrates the\ndifference between our attack and reward shaping.\n\n+2.139\nG\n+0.238\n\n-0.200\n\n-0.221\n\n-0.246\n\n-0.274\n\n-0.304\n\n-0.338\n\n-0.376\n\n-0.572\n\n-0.515\n\n-0.464\n\n-0.417\n\n+0.464\n\n+0.515\n\n+0.572\n\nS\n\n-0.015\n\n-0.115\n\nG1\n\n-0.115\n\n-0.005\n\n+0.262\n\nG2\n\n-0.040\n\n-0.115\n\n+0.005\n\n-0.004\n\n+0.029\n\n-0.004\n\n-0.044\n\n+0.009\n\n-0.008\n\n+0.032\n\n-0.043\n\nS\n\n+0.075\n\n+0.088\n\n+0.080\n\n+0.068\n\n-0.006\n\n+0.036\n\n+0.015\n\n+0.008\n\n-0.005\n\n-0.032\n\n-0.010\n\n-0.013\n\n-0.016\n\n-0.015\n\n-0.007\n\n+0.020\n\n+0.012\n\n+0.004\n\n-0.006\n\n-0.020\n\n-0.009\n\n-0.014\n\n-0.015\n\n-0.011\n\n+0.012\n\n+0.007\n\n+0.002\n\n-0.003\n\n-0.012\n\n-0.012\n\n-0.018\n\n-0.018\n\n-0.013\n\nG\n\n:2\n\n:10\n\n:1\n\nG1\n\n:1\n\nG2\n\n:2\n\n:1\n\n(a) Grid world with a single terminal state G.\n\n(b) Grid world with two terminal states G1 and G2.\n\nFigure 2: Poisoning TCE in grid-world tasks.\n\n7\n\n\fExperiment 2. As another example, we consider the grid world tasks in [5]. In particular, we focus\non two tasks shown in \ufb01gure 2a and 2b. In \ufb01gure 2a, the agent starts from S and aims to arrive at the\nterminal cell G. The black regions are walls, thus the agent can only choose to go through the white or\ngray regions. The agent can take four actions in every state: go left, right, up or down, and stays if the\naction takes it into the wall. Reaching a gray, white, or the terminal state results in rewards 10, 1,\n2, respectively. After the agent arrives at the terminal state G, it will stay there forever and always\nreceive reward 0 regardless of the following actions. The original optimal policy is to follow the blue\ntrajectory. The attacker\u2019s goal is to force the agent to follow the red trajectory. Correspondingly, we\nset the target actions for those states on the red trajectory as along the trajectory. We set the target\nactions for the remaining states to be the same as the original optimal policy learned on clean data.\nThe clean training data contains a single item for every state-action pair. We run the attack with\n\" = 0.1 and \u21b5 = 2. Our attack is successful: with the poisoned data, TCE generates a policy\nthat produces the red trajectory in Figure 2a, which is the desired behavior. The attack cost is\nkr r0k2 \u21e1 2.64, which is small compared to kr0k2 = 21.61. In Figure 2a, we show the poisoning\non rewards. Each state-action pair is denoted by an orange arrow. The value tagged to each arrow\nis the modi\ufb01cation to that reward, where red value means the reward is increased and blue means\ndecreased. An arrow without any tagged value means the corresponding reward is not changed by\nattack. Note that rewards along the red trajectory are increased, while those along the blue trajectory\nare reduced, resulting in the red trajectory being preferred by the agent. Furthermore, rewards closer\nto the starting state S suffer larger poisoning since they contribute more to the Q values. For the large\nattack +2.139 happening at terminal state, we provide an explanation in appendix E.\nExperiment 3. In Figure 2b there are two terminal states G1 and G2 with reward 1 and 2, respectively.\nThe agent starts from S. Although G2 is more pro\ufb01table, the path is longer and each step has a 1\nreward. Therefore, the original optimal policy is the blue trajectory to G1. The attacker\u2019s target\npolicy is to force the agent along the red trajectory to G2. We set the target actions for states as in\nexperiment 2. The clean training data contains a single item for every state-action pair. We run our\nattack with \" = 0.1 and \u21b5 = 2. Again, after the attack, TCE on the poisoned dataset produces the red\ntrajectory in \ufb01gure 2b, with attack cost kr r0k2 \u21e1 0.38, compared to kr0k2 = 11.09. The reward\npoisoning follows a similar pattern to experiment 2.\n\n5.2 Policy Poisoning Attack on LQR Victim\n\n(a) Clean and poisoned vehicle trajectory.\n\n(b) Clean and poisoned rewards.\n\nFigure 3: Poisoning a vehicle running LQR in 4D state space.\n\nt , vy\n\nExperiment 4. We now demonstrate our attack on LQR. We consider a linear dynamical system that\napproximately models a vehicle. The state of the vehicle consists of its 2D position and 2D velocity:\nt ) 2 R4. The control signal at time t is the force at 2 R2 which will be applied\nst = (xt, yt, vx\non the vehicle for h seconds. We assume there is a friction parameter \u2318 such that the friction force\nis \u2318vt. Let m be the mass of the vehicle. Given small enough h, the transition matrices can be\napproximated by (17) where\n0\n1\n0 1 h\u2318/m\n0\n\n0\n0\nh/m 0\n\n1\n0\n0\n0\n\nh\n0\n\n0\n\n(34)\n\n375 , B =264\n\n0\n0\n\n0\n\n375 .\n\nh/m\n\nA =264\n\n0\nh\n0\n\n8\n\n1 h\u2318/m\n\n\fIn this example, we let h = 0.1, m = 1, \u2318 = 0.5, and wt \u21e0N (0, 2I) with = 0.01. The vehicle\nstarts from initial position (1, 1) with velocity (1,0.5), i.e., s0 = (1, 1, 1,0.5). The true loss\nfunction is L(s, a) = 1\n2 s>Qs + a>Ra with Q = I and R = 0.1I (i.e., Q = I, R = 0.1I, q = 0, c =\n0 in (18)). Throughout the experiment, we let = 0.9 for solving the optimal control policy in (21).\nWith the true dynamics and loss function, the computed optimal control policy is\n\nK\u21e4 =\uf8ff 1.32\n\n0\n\n0\n\n2.39\n\n0\n\n2.39 , k\u21e4 = [ 0\n\n0 ] ,\n\n(35)\n\n0\n\n1.32\nwhich will drive the vehicle to the origin.\nThe batch LQR learner estimates the dynamics and the loss function from a batch training data. To\nproduce the training data, we let the vehicle start from state s0 and simulate its trajectory with a\nrandom control policy. Speci\ufb01cally, in each time step, we uniformly sample a control signal at in a\nunit sphere. The vehicle then takes action at to transit from current state st to the next state st+1, and\nreceives a reward rt = L(st, at). This gives us one training item (st, at, rt, st+1). We simulate a\ntotal of 400 time steps to obtain a batch data that contains 400 items, on which the learner estimates\nthe dynamics and the loss function. With the learner\u2019s estimate, the computed clean optimal policy is\n\u02c6K0 =\uf8ff 1.31\n\n1.97e2 1.35 1.14e2 2.42 , \u02c6k0 = [ 4.88e5\n\n4.95e6 ] . (36)\nThe clean optimal policy differs slightly from the true optimal policy due to the inaccuracy of the\nlearner\u2019s estimate. The attacker has a target policy (K\u2020, k\u2020) that can drive the vehicle close to its\ntarget destination (x\u2020, y\u2020) = (0, 1) with terminal velocity (0, 0), which can be represented as a target\nstate s\u2020 = (0, 1, 0, 0). To ensure feasibility, we assume that the attacker starts with the loss function\n2 (s s\u2020)>Q(s s\u2020) + a>Ra where Q = I, R = 0.1I. Due to the offset this corresponds to setting\n1\n2 s\u2020>Qs\u2020 = 0.5 in (18). The attacker then solves the Riccati\nQ = I, R = 0.1I, q = s\u2020, c = 1\nequation with its own loss function and the learner\u2019s estimates \u02c6A and \u02c6B to arrive at the target policy\n\n1.00e2\n\n2.03e3\n\n2.41\n\nK\u2020 =\uf8ff 1.31\n\n1.97e2 1.35 1.14e2 2.42 , k\u2020 = [ 0.01\n\n9.99e3\n\n2.02e3\n\n2.41\n\n1.35 ] .\n\n(37)\n\nWe run our attack (27)-(33) with \u21b5 = 2 and \" = 0.01 in (32). Figure 3 shows the result of our attack.\nIn Figure 3a, we plot the trajectory of the vehicle with policy learned on clean data and poisoned\ndata respectively. Our attack successfully forces LQR into a policy that drives the vehicle close to the\ntarget destination. The wiggle on the trajectory is due to the noise wt of the dynamical system. On\nthe poisoned data, the LQR victim learns the policy\n2.41\n\n1.97e2 1.35 1.14e2 2.42 , \u02c6k = [ 0.01\n\n\u02c6K =\uf8ff 1.31\n\n9.99e3\n\n2.02e3\n\nwhich matches exactly the target policy K\u2020, k\u2020. In Figure 3b, we show the poisoning on rewards. Our\nattack leads to very small modi\ufb01cation on each reward, thus the attack is ef\ufb01cient. The total attack\ncost over all 400 items is only kr r0k2 = 0.73, which is tiny small compared to kr0k2 = 112.94.\nThe results here demonstrate that our attack can dramatically change the behavior of LQR by only\nslightly modifying the rewards in the dataset.\nFinally, for both attacks on TCE and LQR, we note that by setting the attack cost norm \u21b5 = 1 in (5),\nthe attacker is able to obtain a sparse attack, meaning that only a small fraction of the batch data\nneeds to be poisoned. Such sparse attacks have profound implications in adversarial machine learning\nas they can be easier to carry out and harder to detect. We show detailed results in appendix E.\n\n1.35 ] ,\n\n(38)\n\n6 Conclusion\n\nWe presented a policy poisoning framework against batch reinforcement learning and control. We\nshowed the attack problem can be formulated as convex optimization. We provided theoretical\nanalysis on attack feasibility and cost. Experiments show the attack can force the learner into an\nattacker-chosen target policy while incurring only a small attack cost.\nAcknowledgments. This work is supported in part by NSF 1545481, 1561512, 1623605, 1704117,\n1836978 and the MADLab AF Center of Excellence FA9550-18-1-0166.\n\n9\n\n\fReferences\n[1] John Asmuth, Michael L Littman, and Robert Zinkov. Potential-based shaping in model-\nbased reinforcement learning. In Proceedings of the 23rd national conference on Arti\ufb01cial\nintelligence-Volume 2, pages 604\u2013609. AAAI Press, 2008.\n\n[2] Battista Biggio and Fabio Roli. Wild patterns: Ten years after the rise of adversarial machine\n\nlearning. Pattern Recognition, 84:317\u2013331, 2018.\n\n[3] Battista Biggio, B Nelson, and P Laskov. Poisoning attacks against support vector machines. In\n29th International Conference on Machine Learning, pages 1807\u20131814. ArXiv e-prints, 2012.\n\n[4] Daniel S Brown and Scott Niekum. Machine teaching for inverse reinforcement learning:\nAlgorithms and applications. In Proceedings of the AAAI Conference on Arti\ufb01cial Intelligence,\nvolume 33, pages 7749\u20137758, 2019.\n\n[5] Maya Cakmak and Manuel Lopes. Algorithmic and human teaching of sequential decision\n\ntasks. In Twenty-Sixth AAAI Conference on Arti\ufb01cial Intelligence, 2012.\n\n[6] Sarah Dean, Horia Mania, Nikolai Matni, Benjamin Recht, and Stephen Tu. On the sample\n\ncomplexity of the linear quadratic regulator. arXiv preprint arXiv:1710.01688, 2017.\n\n[7] Sam Michael Devlin and Daniel Kudenko. Dynamic potential-based reward shaping.\n\nIn\nProceedings of the 11th International Conference on Autonomous Agents and Multiagent\nSystems, pages 433\u2013440. IFAAMAS, 2012.\n\n[8] Steven Diamond and Stephen Boyd. CVXPY: A Python-embedded modeling language for\n\nconvex optimization. Journal of Machine Learning Research, 17(83):1\u20135, 2016.\n\n[9] Ling Huang, Anthony D Joseph, Blaine Nelson, Benjamin IP Rubinstein, and JD Tygar. Adver-\nsarial machine learning. In Proceedings of the 4th ACM workshop on Security and arti\ufb01cial\nintelligence, pages 43\u201358. ACM, 2011.\n\n[10] Sandy Huang, Nicolas Papernot, Ian Goodfellow, Yan Duan, and Pieter Abbeel. Adversarial\n\nattacks on neural network policies. arXiv preprint arXiv:1702.02284, 2017.\n\n[11] Kwang-Sung Jun, Lihong Li, Yuzhe Ma, and Jerry Zhu. Adversarial attacks on stochastic\n\nbandits. In Advances in Neural Information Processing Systems, pages 3640\u20133649, 2018.\n\n[12] P Kamalaruban, R Devidze, V Cevher, and A Singla. Interactive teaching algorithms for inverse\nreinforcement learning. In 28th International Joint Conference on Arti\ufb01cial Intelligence, pages\n604\u2013609, 2019.\n\n[13] Pang Wei Koh, Jacob Steinhardt, and Percy Liang. Stronger data poisoning attacks break data\n\nsanitization defenses. arXiv preprint arXiv:1811.00741, 2018.\n\n[14] Bo Li, Yining Wang, Aarti Singh, and Yevgeniy Vorobeychik. Data poisoning attacks on\nIn Advances in neural information processing\n\nfactorization-based collaborative \ufb01ltering.\nsystems, pages 1885\u20131893, 2016.\n\n[15] Fang Liu and Ness Shroff. Data poisoning attacks on stochastic bandits. In International\n\nConference on Machine Learning, pages 4042\u20134050, 2019.\n\n[16] Yuzhe Ma, Kwang-Sung Jun, Lihong Li, and Xiaojin Zhu. Data poisoning attacks in contextual\nbandits. In International Conference on Decision and Game Theory for Security, pages 186\u2013204.\nSpringer, 2018.\n\n[17] Shike Mei and Xiaojin Zhu. Using machine teaching to identify optimal training-set attacks on\n\nmachine learners. In Twenty-Ninth AAAI Conference on Arti\ufb01cial Intelligence, 2015.\n\n[18] Andrew Y Ng, Daishi Harada, and Stuart Russell. Policy invariance under reward transfor-\nmations: Theory and application to reward shaping. In ICML, volume 99, pages 278\u2013287,\n1999.\n\n10\n\n\f[19] Yizhen Wang and Kamalika Chaudhuri. Data poisoning attacks against online learning. arXiv\n\npreprint arXiv:1808.08994, 2018.\n\n[20] Eric Wiewiora. Potential-based shaping and q-value initialization are equivalent. Journal of\n\nArti\ufb01cial Intelligence Research, 19:205\u2013208, 2003.\n\n[21] Huang Xiao, Battista Biggio, Gavin Brown, Giorgio Fumera, Claudia Eckert, and Fabio Roli.\nIs feature selection secure against training data poisoning? In International Conference on\nMachine Learning, pages 1689\u20131698, 2015.\n\n[22] Xuezhou Zhang and Xiaojin Zhu.\n\narXiv:1903.01666, 2019.\n\nOnline data poisoning attack.\n\narXiv preprint\n\n11\n\n\f", "award": [], "sourceid": 8247, "authors": [{"given_name": "Yuzhe", "family_name": "Ma", "institution": "University of Wisconsin-Madison"}, {"given_name": "Xuezhou", "family_name": "Zhang", "institution": "UW-Madison"}, {"given_name": "Wen", "family_name": "Sun", "institution": "Microsoft Research NYC"}, {"given_name": "Jerry", "family_name": "Zhu", "institution": "University of Wisconsin-Madison"}]}