{"title": "Adversarial Attacks on Stochastic Bandits", "book": "Advances in Neural Information Processing Systems", "page_first": 3640, "page_last": 3649, "abstract": "We study adversarial attacks that manipulate the reward signals to control the actions chosen by a stochastic multi-armed bandit algorithm. We propose the first attack against two popular bandit algorithms: $\\epsilon$-greedy and UCB, \\emph{without} knowledge of the mean rewards. The attacker is able to spend only logarithmic effort, multiplied by a problem-specific parameter that becomes smaller as the bandit problem gets easier to attack. The result means the attacker can easily hijack the behavior of the bandit algorithm to promote or obstruct certain actions, say, a particular medical treatment. As bandits are seeing increasingly wide use in practice, our study exposes a significant security threat.", "full_text": "Adversarial Attacks on Stochastic Bandits\n\nKwang-Sung Jun\nBoston University\n\nkwangsung.jun@gmail.com\n\nYuzhe Ma\nUW-Madison\n\nma234@wisc.edu\n\nLihong Li\nGoogle Brain\n\nlihong@google.com\n\nXiaojin Zhu\nUW-Madison\n\njerryzhu@cs.wisc.edu\n\nAbstract\n\nWe study adversarial attacks that manipulate the reward signals to control the\nactions chosen by a stochastic multi-armed bandit algorithm. We propose the\n\ufb01rst attack against two popular bandit algorithms: \u270f-greedy and UCB, without\nknowledge of the mean rewards. The attacker is able to spend only logarithmic\neffort, multiplied by a problem-speci\ufb01c parameter that becomes smaller as the\nbandit problem gets easier to attack. The result means the attacker can easily hijack\nthe behavior of the bandit algorithm to promote or obstruct certain actions, say,\na particular medical treatment. As bandits are seeing increasingly wide use in\npractice, our study exposes a signi\ufb01cant security threat.\n\n1\n\nIntroduction\n\nDesigning trustworthy machine learning systems requires understanding how they may be attacked.\nThere has been a surge of interest on adversarial attacks against supervised learning [12, 15]. In\ncontrast, little is known on adversarial attacks against stochastic multi-armed bandits (MABs), a form\nof online learning with limited feedback. This is potentially hazardous since stochastic MABs are\nwidely used in the industry to recommend news articles [18], display advertisements [9], improve\nsearch results [17], allocate medical treatment [16], and promote users\u2019 well-being [13], among many\nothers. Indeed, as we show, an adversarial attacker can modify the reward signal to manipulate the\nMAB for nefarious goals.\nOur main contribution is an analysis on reward-manipulation attacks. We distinguish three agents\nin this setting: \u201cthe world,\u201d \u201cBob\u201d the bandit algorithm, and \u201cAlice\u201d the attacker. As in standard\nstochastic bandit problems, the world consists of K arms with sub-Gaussian rewards centered at\n\u00b51, . . . , \u00b5K. Note that we do not assume {\u00b5i} are sorted. Neither Bob nor Alice knows {\u00b5i}. Bob\npulls selected arms in rounds and attempts to minimize his regret. When Bob pulls arm It 2 [K]\nin round t, the world generates a random reward r0\nt drawn from a sub-Gaussian distribution with\nexpectation \u00b5It. However, Alice sits in-between the world and Bob and manipulates the reward\ninto rt = r0\nt \u21b5t. We call \u21b5t 2 R the attack. If Alice decides not to attack in this round, she\nsimply lets \u21b5t = 0. Bob then receives rt, without knowing the presence of Alice. Without loss of\ngenerality, assume arm K is a suboptimal \u201cattack target\u201d arm: \u00b5K < maxi=1...K \u00b5i. Alice\u2019s goal is\nto manipulate Bob into pulling arm K very often while making small attacks. Speci\ufb01cally, we show\nAlice can force Bob to pull the target arm T o(T ) number of times with a cumulative attack cost of\nPT\nt=1 |\u21b5t| = O(log(T )).\nThe assumption that Alice does not know {\u00b5i} is signi\ufb01cant because otherwise Alice can perform the\nattack trivially. To see this, with the knowledge of {\u00b5i} Alice would be able to compute the truncated\nreward gap \u270f\ni = max{\u00b5i \u00b5K + \u270f, 0} 0 for all non-target arms i 6= K for some small parameter\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\f\u270f> 0. Alice can perform the following oracle attack: in any round where a non-target arm It 6= K\nis pulled, attack with \u21b5t = \u270f\nIt. This oracle attack transforms the original bandit problem into one\nwhere all non-target arms have expected reward less than \u00b5K. It is well-known that if Bob runs a\nsublinear-regret algorithm (e.g., UCB [6, 8]), almost all arm pulls will concentrate on the now-best\ntarget arm K in the transformed bandit problem. Furthermore, Alice\u2019s cumulative attack cost will be\nsublinear in time, because the total number of non-target arm pulls is sublinear in the transformed\nproblem. In practice, however, it is almost never the case that Alice knows \u00b51, . . . , \u00b5K and hence the\ni\u2019s. Thus the oracle attack is impractical. Our focus in this paper is to design an attack that nearly\n\u270f\nmatches the oracle attack, but for Alice who does not know {\u00b5i}. We do so for two popular bandit\nalgorithms, \u270f-greedy [7] and UCB [8].\nWhat damage can Alice do in practice? She can largely control the arms pulled by Bob. She can\nalso control which arm appears to Bob as the best arm at the end. As an example, consider the\nnews-delivering contextual bandit problem [18]. The arms are available news articles, and Bob selects\nwhich arm to pull (i.e., which article to show to a user at the news site). In normal operation, Bob\nshows news articles to users to maximize the click-through rate. However, Alice can attack Bob to\nchange his behavior. For instance, Alice can manipulate the rewards so that users from a particular\npolitical base are always shown particular news articles that can reinforce or convert their opinion.\nConversely, Alice can coerce the bandit to not show an important article to certain users. As another\nexample, Alice may interfere with clinical trials [16] to funnel most patients toward certain treatment,\nor make researchers draw wrong conclusions on whether treatment is better than control. Therefore,\nadversarial attacks on MAB deserve our attention. Insights gained from our study can be used to\nbuild defense in the future.\nFinally, we note that our setting is motivated by modern industry-scale applications of contextual\nbandits, where arm selection, reward signal collection, and policy updates are done in a distributed\nway [3, 18]. Attacks can happen when the reward signal is joined with the selected arm, or when\nthe arm-reward data is sent to another module for Bob to update his policy. In either case, Alice has\naccess to both It and r0\nThe rest of the paper is organized as follows. In Section 2, we introduce notations and straightforward\nattack algorithms that serve as baseline. We then propose our two attack algorithms for \u270f-greedy and\nUCB in Section 3 and 4 respectively, along with their theoretical attack guarantees. In Section 5, we\nempirically con\ufb01rm our \ufb01ndings with toy experiments. Finally, we conclude our paper with related\nwork (Section 6) and a discussion of future work (Section 7) that will enrich our understanding of\nsecurity vulnerability and defense mechanisms for secure MAB deployment.\n\nt for the present and previous rounds.\n\n2 Preliminaries\n\nBefore presenting our main attack algorithms, in this section we \ufb01rst discuss a simple heuristic attack\nalgorithm which serves to illustrate the intrinsic dif\ufb01culty of attacks. Throughout, we assume Bob\nruns a bandit algorithm with sublinear pseudo-regret EPT\nj=1 \u00b5j \u00b5It). As Alice does not\nknow {\u00b5i} she must rely on the empirical rewards up to round t 1 to decide the appropriate attack\n\u21b5t. The attack is online since \u21b5t is computed on-the-\ufb02y as It and r0\nt are revealed. The attacking\nprotocol is summarized in Algorithm 1.\n\nt=1(maxK\n\nAlgorithm 1 Alice\u2019s attack against a bandit algorithm\n1: Input: Bob\u2019s bandit algorithm, target arm K\n2: for t = 1, 2, . . . do\n3:\n4: World generates pre-attack reward r0\nt .\n5:\n6:\n7: end for\n\nAlice observes It and r0\nAlice gives rt = r0\n\nBob chooses arm It to pull.\n\nt , and then decides the attack \u21b5t.\n\nt \u21b5t to Bob.\n\nWe assume all arm rewards are 2-sub-Gaussian where 2 is known to both Alice and Bob. Let Ni(t)\nbe the number of pulls of arm i up to round t. We say the attack is successful after T rounds if the\n\n2\n\n\fFor convenience we de\ufb01ne the following quantities:\n\nnumber of target-arm pulls is NK(T ) = T o(T ) in expectation or with high probability, while\nminimizing the cumulative attack costPT\nt=1 |\u21b5t|. Other attack settings are discussed in Section 7.\n\u2022 \u2327i(t) := {s : s \uf8ff t, Is = i}, the set of rounds up to t where arm i is chosen,\ni (t) := Ni(t)1Ps2\u2327i(t) r0\n\u2022 \u02c6\u00b50\n\u2022 \u02c6\u00b5i(t) := Ni(t)1Ps2\u2327i(t) rs, the corresponding post-attack average reward.\n\nThe oracle attack, revisited While the oracle attack was impractical, it gives us a baseline for\ncomparison. The oracle attack drags down the reward of all non-target arms,1 and can be written as\n\ns, the pre-attack average reward of arm i up to round t, and\n\n\u21b5t = 1{It 6= K} \u00b7 \u270f\nIt .\n\nProposition 1 shows that the oracle attack succeeds and requires only a logarithmic attack cost. While\nmore general statements for sublinear-regret algorithms can be made, we focus on logarithmic-regret\nbandit algorithms for simplicity. Throughout, omitted proofs can be found in our supplementary\nmaterial.\nProposition 1. Assume that Bob\u2019s bandit algorithm achieves an O(log T ) regret bound. Then, Alice\u2019s\noracle attack with \u270f> 0 succeeds; i.e., ENK(T ) = T o(T ). Furthermore, the expected total attack\n\ncost is O\u21e3PK1\n\ni=1 \u270f\n\ni log T\u2318.2\n\nThe heuristic constant attack A slight variant of the oracle attack is to attack all the non-target\narms with a single constant amount A > 0, regardless of the actual \u00b5i\u2019s:\n\n\u21b5t = 1{It 6= K} \u00b7 A.\n\nLet i := 0\ni . Unfortunately, this heuristic constant attack depends critically on the value of A\ncompared to the unknown maximum gap maxi i. Proposition 2 states the condition under which\nthe attack succeeds:\nProposition 2. Assume that Bob\u2019s bandit algorithm achieves an O(log T ) regret bound. Then, Alice\u2019s\nheuristic constant attack with A succeeds if and only if A > maxi i. If the attack succeeds, then\nthe expected attack cost is O(AK log T ).\n\nConversely, if A < maxi i the attack fails. This is because in the transformed bandit problem,\nthere exists an arm that has a higher expected reward than arm K, and Bob will mostly pull that arm.\nTherefore, the heuristic constant attack has to know an unknown quantity to guarantee a successful\nattack. Moreover, the attack is non-adaptive to the problem dif\ufb01culty since some i\u2019s can be much\nsmaller than A, in which case Alice pays an unnecessarily large attack cost.\nWe therefore ask the following question:\n\nDoes there exist an attacker Alice that guarantees a successful attack with cost adaptive to\n\nthe problem dif\ufb01culty?\n\nThe answer is yes. We present attack strategies against two popular bandit algorithms of Bob:\n\u270f-greedy and UCB. We show that Alice can indeed succeed in her attacks and incur cost as small as\nthat of the oracle with an additive term due to the sub-Gaussian noise level .\n\n3 Alice\u2019s Attack on \u270f-Greedy Bob\n\nThe \u270f-greedy strategy initially pulls each arm once in the \ufb01rst K rounds. For convenience, we assume\nthat the target arm is pulled \ufb01rst: I1 = K. Our results in this section can be adapted to any order of\ninitialization with more complicated notation.\n\n1The opposite strategy is to push up the target arm: \u21b5t = 1{It = K} \u00b7 (\u00b5K maxj \u00b5j \u270f) to make\narm K the best arm in post-attack rewards. However, a successful attack means that Alice pulls the target arm\nT o(T ) times; the attack cost is necessarily linear in T , which is inef\ufb01cient. Simulations that support \u201cdrag\ndown\u201d instead of \u201cpush up\u201d are presented in Appendix D.\n2For near-optimal algorithms like UCB [6], one can \ufb01nd the optimal choice of \u270f. See our supplementary\n\nmaterial for detail.\n\n3\n\n\fBob\u2019s \u270f-greedy strategy has the following arm-selection rule for t > K [7]:\n(exploration)\n\nw.p. \u270ft\notherwise (exploitation) .\n\nIt =\u21e2draw uniform[K],\n\narg maxi \u02c6\u00b5i(t 1),\n\nThe strategy uses an exploration scheme {\u270ft} over t. Alice\u2019s attack algorithm is not aware of {\u270ft}\nthough her cumulative attack costP|\u21b5t| will implicitly depend on it. Later in Corollary 1 we show\nthat, for the typical decaying scheme \u270ft / 1/t, the cumulative attack cost is mild: O(log(t)).\nAlice wants to make Bob always pull the target arm during exploitation rounds. Since Alice has no\nin\ufb02uence on which arm is pulled during exploration, this attack goal is the strongest she can achieve.\nHere, Algorithm 1 is specialized to ensure the following condition:\n\u02c6\u00b5It(t) \uf8ff \u02c6\u00b5K(t) 2(NK(t)),\n\n(1)\n\nwhere we de\ufb01ne (N ) as\n\n(N ) :=r 22\n\nN\n\n\u21e12KN 2\n\n.\n\nlog\n\n3\n\n\u02c6\u00b5It(t) =\n\n\u02c6\u00b5It(t 1)NIt(t 1) + r0\n\nt \u21b5t\n\nNIt(t)\n\nFrom this condition, we derive the actual attack \u21b5t. Since\n\n(2)\n\n(3)\n\n,\n\nwe set the attack in Algorithm 1 as\n\n(4)\nwhere [z]+ = max(0, z). Note \u21b5 is always non-negative, thus the cumulative attack cost can be\n\nt (\u02c6\u00b5K(t) 2(NK(t))) NIt(t)\u21e4+ ,\n\n\u21b5t =\u21e5\u02c6\u00b5It(t 1)NIt(t 1) + r0\n\nwritten without absolute value:PT\n\nt=1 \u21b5t.\n\nt=1 \u270ft\n\nwhere\n\nTXt=1\n\nt=1 \u270ft \n\nWith this \u21b5t, we claim that (i) Alice forces Bob to pull the target arm in all exploitation rounds as\nshown in Lemma 2, and (ii) the cumulative attack cost is logarithmic in t for standard \u270f-greedy learner\nexploration scheme \u270ft = O(1/t) as shown in Corollary 1. Our main result is the following general\nupper bound on the cumulative attack cost.\n\n|\u21b5t| < KXi=1\n\nK\nrounds, using a cumulative attack cost at most\n\nTheorem 1. Let \uf8ff 1/2. With probability at least 1 2, for any T satisfying PT\ne2 log(K/),3 Alice forces Bob running \u270f-greedy to choose the target arm in at least eNK(T )\ni! eN (T ) + (K 1) \u00b7\u21e3eN (T )(eN (T )) + 3eN (T )(eNK(T ))\u2318\neN (T ) = PT\neNK(T ) = T TXt=1\n\nK ! +vuut3 log\u2713 K\n\u270ft! vuut3 log\u2713 K\n\n\u25c6 PT\n\u25c6 TXt=1\n\nK ! ,\n\u270ft! .\n\nBefore proving the theorem, we \ufb01rst look at its consequence. If Bob\u2019s \u270ft decay scheme is \u270ft =\nmin{1, cK/t} for some c > 0 as recommended in Auer et al. [7], Alice\u2019s cumulative attack cost is\nO(PK\nCorollary 1. Inherit the assumptions in Theorem 1. Fix K and . If \u270ft = cK/t for some constant\nc > 0, then\n\ni=1 i log T ) for large enough T , as the following corollary shows:\n\n|\u21b5t| = bO KXi=1\n\ni! log T + Kplog T! ,\n\nTXt=1\nwhere bO ignores log log factors.\n3 One can drop this condition by considering slightly larger eN (t) and smaller eNK(t). However, we keep the\ncondition as it simpli\ufb01es eN (t) and eNK(t). We refer to the proof of Lemma 4 for detail.\n\nt=1 \u270ft\n\n(5)\n\n4\n\n\fNote that the two important constants arePi i and . While a large can increase the cost\nsigni\ufb01cantly, the term withPi i dominates the cost for large enough T . Speci\ufb01cally,Pi i is\n\nmultiplied by log T that is of higher order than plog T . We empirically verify the scaling of cost\nwith T in Section 5.\nTo prove Theorem 1, we \ufb01rst show that in (2) is a high-probability bound on the pre-attack empirical\nmean of all arms on all rounds. De\ufb01ne the event\nE := {8i,8t > K : |\u02c6\u00b50\n\nLemma 1. For 2 (0, 1), P (E) > 1 .\nThe following lemma proves the \ufb01rst half of our claim.\nLemma 2. For \uf8ff 1/2 and under event E, attacks (4) force Bob to always pull the target arm K in\nexploitation rounds.\nWe now show that on average each attack on a non-target arm i is not much bigger than i.\nLemma 3. For \uf8ff 1/2 and under event E, we have for all arm i < K and all t that\n\ni (t) \u00b5i| < (Ni(t))}.\n\n(6)\n\nXs2\u2327i(t)\n\n|\u21b5s| < (i + (Ni(t)) + 3(NK(t))) Ni(t) .\n\nFinally, we upper bound the number of non-target arm i pulls Ni(T ) for i < K. Recall the arm i\npulls are only the result of exploration rounds. In round t the exploration probability is \u270ft; if Bob\nexplores, he chooses an arm uniformly at random. We also lower bound the target arm pulls NK(T ).\ne2 log(K/). With probability at least\n\nlog\n\nK\n\n\n.\n\n\u270ft log\n\nK\n\n\n.\n\nLemma 4. Let < 1/2. Suppose T satisfyPT\nt=1 \u270ft K\n1 , for all non-target arms i < K,\n+vuut3\nTXs=1\n\u270ft vuut3\nTXs=1\nTXt=1\n\nand for the target arm K,\n\nNK(T ) > T \n\nTXt=1\n\nNi(T ) <\n\n\u270ft\nK\n\n\u270ft\nK\n\nWe are now ready to prove Theorem 1.\n\nProof. The theorem follows immediately from a union bound over Lemma 3 and Lemma 4 below.\nWe add up the attack costs over K 1 non-target arms. Then, we note that N (N ) is increasing in\nN so Ni(T )(Ni(T )) \uf8ff eN (T )(eN (T )). Finally, by Lemma 8 in our supplementary material (N )\nis decreasing in N, so (NK(T )) \uf8ff (eNK(T )).\n\n4 Alice\u2019s Attack on UCB Bob\n\n2\n\nIt =(t,\n\nRecall that we assume rewards are 2-sub-Gaussian. Bob\u2019s UCB algorithm in its basic form often\nassumes rewards are bounded in [0, 1]; we need to modify the algorithm to handle the more general\nsub-Gaussian rewards. By choosing \u21b5 = 4.5 and : 7! 22\nin the (\u21b5, )-UCB algorithm of\nBubeck & Cesa-Bianchi [8, Section 2.2], we obtain the following arm-selection rule:\narg maxin\u02c6\u00b5i(t 1) + 3q log t\nNi(t1)o ,\n\nFor the \ufb01rst K rounds where Bob plays each of the K arms once in an arbitrary order, Alice does not\nattack: \u21b5t = 0 for t \uf8ff K. After that, attack happens only when It 6= K. Speci\ufb01cally, consider any\nround t > K where Bob pulls arm i 6= K. It follows from the UCB algorithm that\n.\n\nNi(t 1) \u02c6\u00b5K(t 1) + 3s log t\n\n\u02c6\u00b5i(t 1) + 3s log t\n\nif t \uf8ff K\notherwise.\n\nNK(t 1)\n\n5\n\n\fAlice attacks as follows. She computes an attack \u21b5t with the smallest absolute value, such that\n\nwhere 0 0 is a parameter of Alice. Since the post-attack empirical mean can be computed\nrecursively by the following\n\n\u02c6\u00b5i(t) \uf8ff \u02c6\u00b5K(t 1) 2(NK(t 1)) 0 ,\n\n\u02c6\u00b5i(t) =\n\nNi(t 1)\u02c6\u00b5i(t 1) + r0\n\nt \u21b5t\n\n,\n\nNi(t 1) + 1\n\nwhere r0\n\nt is the pre-attack reward; this enables us to write down in closed form Alice\u2019s attack:\n.\n\n\u21b5s Ni(t) \u00b7 (\u02c6\u00b5K(t 1) 2(NK(t 1)) 0)i+\n\n\u21b5t =hNi(t)\u02c6\u00b50\n\ni (t) Xs2\u2327i(t1)\n\nFor convenience, de\ufb01ne \u21b5t = 0 if It = K. We now present the main theorem on Alice\u2019s cumulative\nattack cost against Bob who runs UCB.\nTheorem 2. Suppose T 2K and \uf8ff 1/2. Then, with probability at least 1 , Alice forces Bob\nto choose the target arm in at least\n\n(7)\n\nXs2\u2327i(t)\n\n\u21b5s \uf8ff Ni(t)\u21e3i + 0 + 4(Ni(t))\u2318 .\n\n6\n\nrounds, using a cumulative attack cost at most\n\n\u21b5t \uf8ff\u27132 +\n\n92\n2\n0\n\nTXt=1\n\nlog T\u25c6Xi plog T since K 2\nlog T < K. Thus there is no need for Alice to choose\na larger 0. By choosing 0 =\u21e5( ), the cost is bO(Pi 0. Alice\u2019s target arm is arm 2. We let = 0.025.\nBob\u2019s exploration probability decays as \u270ft = 1\nt . We run Alice and Bob for T = 105 rounds; this\nforms one trial. We repeat 1000 trials.\nIn Figure 1(a), we \ufb01x = 0.1 and show Alice\u2019s cumulative attack costPt\nt=1 |\u21b5t| = bO1 log T + plog T. Ignoring log log T terms, we have\nFurthermore, note thatPT\nt=1 |\u21b5t|\uf8ff C(1 log T + plog T ) for some constant C > 0 and large enough T . Therefore,\nPT\nlog\u21e3PT\n2 log log T + log } + log C. We thus expect the\nlog-cost curve as a function of log log T to behave like the maximum of two lines, one with slope\n1/2 and the other with slope 1. Indeed, we observe such a curve in Figure 1(b) where we \ufb01x 1 = 1\nand vary . All the slopes eventually approach 1, though larger \u2019s take a longer time. This implies\nthat the effect of diminishes for large enough T , which was predicted by Corollary 1.\nIn Figure 1(c), we compare the number of target arm (the suboptimal arm 2) pulls with and without\nattack. This experiment is with 1 = 0.1 and = 0.1. Alice\u2019s attack dramatically forces Bob to pull\nthe target arm. In 10000 rounds, Bob is forced to pull the target arm 9994 rounds with the attack,\ncompared to only 6 rounds if Alice was not present.\nAttacking UCB The bandit has two arms. The reward distributions are the same as the \u270f-greedy\nexperiment. We let = 0.05. To study how and 0 affects the cumulative attack cost, we perform\ntwo groups of experiments. In the \ufb01rst group, we \ufb01x = 0.1 and vary Alice\u2019s free parameter 0\nwhile in the second group, we \ufb01x 0 = 0.1 and vary . We perform 100 trials with T = 107 rounds.\nFigure 2(a) shows Alice\u2019s cumulative attack cost as 0 varies. As 0 increases, the cumulative attack\ncost decreases. In Figure 2(b), we show the cost as varies. Note that for large enough t, the cost\ngrows almost linearly with log t, which is implied by Corollary 2. In both \ufb01gures, there is a large\nattack near the beginning, after which the cost grows slowly. This is because the initial attacks drag\ndown the empirical average of non-target arms by a large amount, such that the target arm appears to\nhave the best UCB for many subsequent rounds. Figure 2(c) again shows that Alice\u2019s attack forces\n\n7\n\n\f(a) Attack cost Pt\n\nvaries\n\ns=1 \u21b5s as 0\n\n(b) Attack cost as varies\n\n(c) Target arm pulls NK(t)\n\nFigure 2: Attack on UCB learner.\n\nBob to pull the target arm: with attack Bob is forced to pull the target arm 107 2 times, compared\nto only 156 times without attack.\n\n6 Related Work\n\nThe literature on general adversarial learning is vast and covers ethics, safety, fairness, and legal\nconcerns; see e.g. Joseph et al. [15] and Goodfellow et al. [12]. Related to MAB, there has been\nempirical evidence that suggests adversarial attacks can be quite effective, even in the more gen-\neral multi-step reinforcement learning problems, as opposed to the bandit case considered in this\npaper. The learned policy may be lured to visit certain target states when adversarial examples are\ndriven [19], or have inferior generalization ability when training examples are corrupted [14]. There\nare differences, though. In the \ufb01rst, non-stochastic setting [7, 11], the reward is generated by an\nadversary instead of a stationary, stochastic process. However, the reward observed by the learner is\nstill a real reward, in that the learner is still interested in maximizing it, or more precisely, minimizing\nsome notion of regret in reference to some reference policy [8]. Another related problem is reward\nshaping (e.g., Dorigo & Colombetti [10]), where the reward received by the learner is modi\ufb01ed, as\nin our paper. However, those changes are typically done to help the learner in various ways (such\nas promoting exploration), and are designed in a way not to change the optimal policy the learner\neventually converges to [22].\nA concurrent work by Lykouris et al. [20] considers a complementary problem to ours. They propose\na randomized bandit algorithm that is robust to adversarial attacks on the stochastic rewards. In\ncontrast, our work shows that the existing stochastic algorithms are vulnerable to adversarial attacks.\nNote that their attack protocol is slightly different in that the attacker has to prepare attacks for all\nthe arms before the learner chooses an arm. Furthermore, they have a different attack cost de\ufb01nition\nwhere the cost in a round is the largest manipulation over the arms, regardless of which arm the\nlearner selects afterwards.\nAnother concurrent work by Ma et al. [21] considers attacking stochastic contextual bandit algorithms.\nThe authors show that for a contextual bandit algorithm which periodically updates the arm selection\npolicy, an attacker can perform of\ufb02ine attack to force the contextual bandit algorithm to pull some\npre-speci\ufb01ed target arm for a given target context vector. Our work differs in that we consider online\nattack, which is performed on the \ufb02y rather than of\ufb02ine.\n\n7 Conclusions and Future Work\n\nWe presented a reward-manipulating attack on stochastic MABs. We analyzed the attack against\n\u270f-greedy and a generalization of the UCB algorithm, and proved that the attacker can force the\nalgorithms to almost always pull a suboptimal target arm. The cost of the attack is only logarithmic\nin time. Given the wide use of MABs in practice, this is a signi\ufb01cant security threat.\nOur analysis is only the beginning. We targeted \u270f-greedy and UCB learners for their simplicity and\npopularity. Future work may look into attacking Thompson sampling [23, 4], linear bandits [1, 5],\nand contextual bandits [18, 2], etc. We assumed the reward attacks \u21b5t are unbounded from above;\nnew analysis is needed if an application\u2019s reward space is bounded or discrete. It will also be useful\n\n8\n\n\fto establish lower bounds on the cumulative attack cost. Speci\ufb01cally, it would be interesting to study\npareto optimality w.r.t. the number of target arm pulls and the cumulative attack cost.\nBeyond the attack studied in this paper, there is a wide range of possible attacks on MABs. We may\norganize them along several dimensions:\n\ncontrol [24]. We can de\ufb01ne the control cost as \u21b52\nstrategies.\n\n\u2022 Optimal control viewpoint: Our \u2018reward shaping\u2019 attack model can be formulated as optimal\nt + 1{It 6= K} and design optimal control\n\u2022 The attack goal: The attacker may force the learner into pulling or avoiding target arms, or\nworsen the learner\u2019s regret, or make the learner identify the wrong best-arm, etc.\n\u2022 The attack action: The attacker can manipulate the rewards or corrupt the context for\n\u2022 Online vs. of\ufb02ine: An online attacker must choose the attack action in real time; An of\ufb02ine\nattacker poisons a dataset of historical action-reward pairs in batch mode, then the learner\nlearns from the poisoned dataset.\n\ncontextual bandits, etc.\n\nThe combination of these attack dimensions presents fertile ground for future research into both\nbandit-algorithm attacks and the corresponding defense mechanisms.\n\nAcknowledgments\nThis work is supported in part by NSF 1837132, 1545481, 1704117, 1623605, 1561512, and the\nMADLab AF Center of Excellence FA9550-18-1-0166.\n\nReferences\n[1] Abbasi-Yadkori, Yasin, P\u00e1l, D\u00e1vid, and Szepesv\u00e1ri, Csaba. Improved algorithms for linear\nstochastic bandits. In Advances in Neural Information Processing Systems (NIPS), pp. 2312\u2013\n2320, 2011.\n\n[2] Agarwal, Alekh, Hsu, Daniel, Kale, Satyen, Langford, John, Li, Lihong, and Schapire, Robert E.\nTaming the monster: A fast and simple algorithm for contextual bandits. In Proceedings of the\nInternational Conference on Machine Learning (ICML), pp. 1638\u20131646, 2014.\n\n[3] Agarwal, Alekh, Bird, Sarah, Cozowicz, Markus, Hoang, Luong, Langford, John, Lee, Stephen,\nLi, Jiaji, Melamed, Dan, Oshri, Gal, Ribas, Oswaldo, Sen, Siddhartha, and Slivkins, Alex.\nMaking contextual decisions with low technical debt. CoRR abs/1606.03966, 2016.\n\n[4] Agrawal, Shipra and Goyal, Navin. Analysis of Thompson Sampling for the Multi-armed\nBandit Problem. In In Proceedings of the Conference on Learning Theory (COLT), volume 23,\npp. 39.1\u201339.26, 2012.\n\n[5] Agrawal, Shipra and Goyal, Navin. Thompson Sampling for Contextual Bandits with Linear\nPayoffs. In Proceedings of the International Conference on Machine Learning (ICML), pp.\n127\u2013135, 2013.\n\n[6] Auer, Peter, Cesa-Bianchi, Nicol\u00f2, and Fischer, Paul. Finite-time analysis of the multiarmed\n\nbandit problem. Machine Learning, 47(2\u20133):235\u2013256, 2002.\n\n[7] Auer, Peter, Cesa-Bianchi, Nicol\u00f2, Freund, Yoav, and Schapire, Robert E. The nonstochastic\n\nmultiarmed bandit problem. SIAM Journal on Computing, 32(1):48\u201377, 2002.\n\n[8] Bubeck, S\u00e9bastien and Cesa-Bianchi, Nicol\u00f2. Regret Analysis of Stochastic and Nonstochastic\nMulti-armed Bandit Problems. Foundations and Trends in Machine Learning, 5:1\u2013122, 2012.\n\n[9] Chapelle, Olivier, Manavoglu, Eren, and Rosales, Romer. Simple and scalable response\nprediction for display advertising. ACM Transactions on Intelligent Systems and Technology, 5\n(4):61:1\u201361:34, 2014.\n\n[10] Dorigo, Marco and Colombetti, Luca Marco. Robot Shaping: An Experiment in Behavior\n\nEngineering. MIT Press, 1997. ISBN 0-262-04164-2.\n\n9\n\n\f[11] Even-Dar, Eyal, Kakade, Sham M., and Mansour, Yishay. Online Markov decision processes.\n\nMathematics of Operations Research, 34(3):726\u2013736, 2009.\n\n[12] Goodfellow, Ian J, Shlens, Jonathon, and Szegedy, Christian. Explaining and harnessing\nadversarial examples. In International Conference on Learning Representations (ICLR), 2015.\n[13] Greenewald, Kristjan, Tewari, Ambuj, Murphy, Susan A., and Klasnja, Predrag V. Action\ncentered contextual bandits. In Advances in Neural Information Processing Systems 30 (NIPS),\npp. 5979\u20135987, 2017.\n\n[14] Huang, Sandy, Papernot, Nicolas, Goodfellow, Ian, Duan, Yan, and Abbeel, Pieter. Adversarial\n\nattacks on neural network policies, 2017. arXiv:1702.02284.\n\n[15] Joseph, Anthony D., Nelson, Blaine, Rubinstein, Benjamin I. P., and Tygar, J.D. Adversarial\n\nMachine Learning. Cambridge University Press, 2018.\n\n[16] Kuleshov, Volodymyr and Precup, Doina. Algorithms for multi-armed bandit problems. CoRR\n\nabs/1402.6028, 2014.\n\n[17] Kveton, Branislav, Szepesv\u00e1ri, Csaba, Wen, Zheng, and Ashkan, Azin. Cascading bandits:\nLearning to rank in the cascade model. In Proceedings of the 32nd International Conference on\nMachine Learning (ICML), pp. 767\u2013776, 2015.\n\n[18] Li, Lihong, Chu, Wei, Langford, John, and Schapire, Robert E. A contextual-bandit approach\nto personalized news article recommendation. In Proceedings of the Nineteenth International\nConference on World Wide Web (WWW), pp. 661\u2013670, 2010.\n\n[19] Lin, Yen-Chen, Hong, Zhang-Wei, Liao, Yuan-Hong, Shih, Meng-Li, Liu, Ming-Yu, and Sun,\nMin. Tactics of adversarial attack on deep reinforcement learning agents. In Proceedings of the\n26th International Joint Conference on Arti\ufb01cial Intelligence (IJCAI), pp. 3756\u20133762, 2017.\n\n[20] Lykouris, Thodoris, Mirrokni, Vahab, and Paes Leme, Renato. Stochastic bandits robust to\nadversarial corruptions. In Proceedings of the Annual ACM SIGACT Symposium on Theory of\nComputing (STOC), pp. 114\u2013122, 2018.\n\n[21] Ma, Yuzhe, Jun, Kwang-Sung, Li, Lihong, and Zhu, Xiaojin. Data poisoning attacks in\ncontextual bandits. In Conference on Decision and Game Theory for Security (GameSec), 2018.\n[22] Ng, Andrew Y., Harada, Daishi, and Russell, Stuart J. Policy invariance under reward transfor-\nmations: Theory and application to reward shaping. In Proceedings of the 16th International\nConference on Machine Learning (ICML), pp. 278\u2013287, 1999.\n\n[23] Thompson, William R. On the Likelihood that One Unknown Probability Exceeds Another in\n\nView of the Evidence of Two Samples. Biometrika, 25(3/4):285, 1933.\n\n[24] Zhu, Xiaojin. An optimal control view of adversarial machine learning, 2018. arXiv:1811.04422.\n\n10\n\n\f", "award": [], "sourceid": 1837, "authors": [{"given_name": "Kwang-Sung", "family_name": "Jun", "institution": "UW-Madison"}, {"given_name": "Lihong", "family_name": "Li", "institution": "Google Inc."}, {"given_name": "Yuzhe", "family_name": "Ma", "institution": "University of Wisconsin-Madison"}, {"given_name": "Jerry", "family_name": "Zhu", "institution": "University of Wisconsin-Madison"}]}