{"title": "Causal Bandits: Learning Good Interventions via Causal Inference", "book": "Advances in Neural Information Processing Systems", "page_first": 1181, "page_last": 1189, "abstract": "We study the problem of using causal models to improve the rate at which good interventions can be learned online in a stochastic environment. Our formalism combines multi-arm bandits and causal inference to model a novel type of bandit feedback that is not exploited by existing approaches. We propose a new algorithm that exploits the causal feedback and prove a bound on its simple regret that is strictly better (in all quantities) than algorithms that do not use the additional causal information.", "full_text": "Causal Bandits: Learning Good Interventions via\n\nCausal Inference\n\nFinnian Lattimore\n\nAustralian National University and Data61/NICTA\n\nfinn.lattimore@gmail.com\n\nTor Lattimore\n\nIndiana University, Bloomington\ntor.lattimore@gmail.com\n\nMark D. Reid\n\nAustralian National University and Data61/NICTA\n\nmark.reid@anu.edu.au\n\nAbstract\n\nWe study the problem of using causal models to improve the rate at which good\ninterventions can be learned online in a stochastic environment. Our formalism\ncombines multi-arm bandits and causal inference to model a novel type of bandit\nfeedback that is not exploited by existing approaches. We propose a new algorithm\nthat exploits the causal feedback and prove a bound on its simple regret that is\nstrictly better (in all quantities) than algorithms that do not use the additional causal\ninformation.\n\n1\n\nIntroduction\n\nMedical drug testing, policy setting, and other scienti\ufb01c processes are commonly framed and analysed\nin the language of sequential experimental design and, in special cases, as bandit problems (Robbins,\n1952; Chernoff, 1959). In this framework, single actions (also referred to as interventions) from a\npre-determined set are repeatedly performed in order to evaluate their effectiveness via feedback from\na single, real-valued reward signal. We propose a generalisation of the standard model by assuming\nthat, in addition to the reward signal, the learner observes the values of a number of covariates drawn\nfrom a probabilistic causal model (Pearl, 2000). Causal models are commonly used in disciplines\nwhere explicit experimentation may be dif\ufb01cult such as social science, demography and economics.\nFor example, when predicting the effect of changes to childcare subsidies on workforce participation,\nor school choice on grades. Results from causal inference relate observational distributions to\ninterventional ones, allowing the outcome of an intervention to be predicted without explicitly\nperforming it. By exploiting the causal information we show, theoretically and empirically, how\nnon-interventional observations can be used to improve the rate at which high-reward actions can be\nidenti\ufb01ed.\nThe type of problem we are concerned with is best illustrated with an example. Consider a farmer\nwishing to optimise the yield of her crop. She knows that crop yield is only affected by temperature,\na particular soil nutrient, and moisture level but the precise effect of their combination is unknown.\nIn each season the farmer has enough time and money to intervene and control at most one of these\nvariables: deploying shade or heat lamps will set the temperature to be low or high; the nutrient\ncan be added or removed through a choice of fertilizer; and irrigation or rain-proof covers will keep\nthe soil wet or dry. When not intervened upon, the temperature, soil, and moisture vary naturally\nfrom season to season due to weather conditions and these are all observed along with the \ufb01nal crop\nyield at the end of each season. How might the farmer best experiment to identify the single, highest\nyielding intervention in a limited number of seasons?\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fContributions We take the \ufb01rst step towards formalising and solving problems such as the one\nabove. In \u00a72 we formally introduce causal bandit problems in which interventions are treated as\narms in a bandit problem but their in\ufb02uence on the reward \u2014 along with any other observations\n\u2014 is assumed to conform to a known causal graph. We show that our causal bandit framework\nsubsumes the classical bandits (no additional observations) and contextual stochastic bandit problems\n(observations are revealed before an intervention is chosen) before focusing on the case where, like\nthe above example, observations occur after each intervention is made.\nOur focus is on the simple regret, which measures the difference between the return of the optimal\naction and that of the action chosen by the algorithm after T rounds. In \u00a73 we analyse a speci\ufb01c\nfamily of causal bandit problems that we call parallel bandit problems in which N factors affect the\nreward independently and there are 2N possible interventions. We propose a simple causal best arm\nidenti\ufb01cation algorithm for this problem and show that up to logarithmic factors it enjoys minimax\n\noptimal simple regret guarantees of \u02dc\u0398((cid:112)m/T ) where m depends on the causal model and may\nbe much smaller than N. In contrast, existing best arm identi\ufb01cation algorithms suffer \u2126((cid:112)N/T )\n\nsimple regret (Thm. 4 by Audibert and Bubeck (2010)). This shows theoretically the value of our\nframework over the traditional bandit problem. Experiments in \u00a75 further demonstrate the value of\ncausal models in this framework.\nIn the general casual bandit problem interventions and observations may have a complex relationship.\nIn \u00a74 we propose a new algorithm inspired by importance-sampling that a) enjoys sub-linear regret\nequivalent to the optimal rate in the parallel bandit setting and b) captures many of the intricacies of\nsharing information in a causal graph in the general case. As in the parallel bandit case, the regret\n\nguarantee scales like O((cid:112)m/T ) where m depends on the underlying causal structure, with smaller\n\nvalues corresponding to structures that are easier to learn. The value of m is always less than the\nnumber of interventions N and in the special case of the parallel bandit (where we have lower bounds)\nthe notions are equivalent.\n\nRelated Work As alluded to above, causal bandit problems can be treated as classical multi-armed\nbandit problems by simply ignoring the causal model and extra observations and applying an existing\nbest-arm identi\ufb01cation algorithm with well understood simple regret guarantees (Jamieson et al.,\n2014). However, as we show in \u00a73, ignoring the extra information available in the non-intervened\nvariables yields sub-optimal performance.\nA well-studied class of bandit problems with side information are \u201ccontextual bandits\u201d Langford\nand Zhang (2008); Agarwal et al. (2014). Our framework bears a super\ufb01cial similarity to contextual\nbandit problems since the extra observations on non-intervened variables might be viewed as context\nfor selecting an intervention. However, a crucial difference is that in our model the extra observations\nare only revealed after selecting an intervention and hence cannot be used as context.\nThere have been several proposals for bandit problems where extra feedback is received after an\naction is taken. Most recently, Alon et al. (2015), Koc\u00e1k et al. (2014) have considered very general\nmodels related to partial monitoring games (Bart\u00f3k et al., 2014) where rewards on unplayed actions\nare revealed according to a feedback graph. As we discuss in \u00a76, the parallel bandit problem can be\ncaptured in this framework, however the regret bounds are not optimal in our setting. They also focus\non cumulative regret, which cannot be used to guarantee low simple regret (Bubeck et al., 2009). The\npartial monitoring approach taken by Wu et al. (2015) could be applied (up to modi\ufb01cations for the\nsimple regret) to the parallel bandit, but the resulting strategy would need to know the likelihood\nof each factor in advance, while our strategy learns this online. Yu and Mannor (2009) utilize\nextra observations to detect changes in the reward distribution, whereas we assume \ufb01xed reward\ndistributions and use extra observations to improve arm selection. Avner et al. (2012) analyse bandit\nproblems where the choice of arm to pull and arm to receive feedback on are decoupled. The main\ndifference from our present work is our focus on simple regret and the more complex information\nlinking rewards for different arms via causal graphs. To the best of our knowledge, our paper is the\n\ufb01rst to analyse simple regret in bandit problems with extra post-action feedback.\nTwo pieces of recent work also consider applying ideas from causal inference to bandit problems.\nBareinboim et al. (2015) demonstrate that in the presence of confounding variables the value that a\nvariable would have taken had it not been intervened on can provide important contextual information.\nTheir work differs in many ways. For example, the focus is on the cumulative regret and the context\nis observed before the action is taken and cannot be controlled by the learning agent.\n\n2\n\n\fOrtega and Braun (2014) present an analysis and extension of Thompson sampling assuming actions\nare causal interventions. Their focus is on causal induction (i.e., learning an unknown causal model)\ninstead of exploiting a known causal model. Combining their handling of causal induction with our\nanalysis is left as future work.\nThe truncated importance weighted estimators used in \u00a74 have been studied before in a causal\nframework by Bottou et al. (2013), where the focus is on learning from observational data, but\nnot controlling the sampling process. They also brie\ufb02y discuss some of the issues encountered in\nsequential design, but do not give an algorithm or theoretical results for this case.\n\n2 Problem Setup\n\nWe now introduce a novel class of stochastic sequential decision problems which we call causal\nbandit problems. In these problems, rewards are given for repeated interventions on a \ufb01xed causal\nmodel Pearl (2000). Following the terminology and notation in Koller and Friedman (2009), a causal\nmodel is given by a directed acyclic graph G over a set of random variables X = {X1, . . . , XN}\nand a joint distribution P over X that factorises over G. We will assume each variable only takes\non a \ufb01nite number of distinct values. An edge from variable Xi to Xj is interpreted to mean that a\nchange in the value of Xi may directly cause a change to the value of Xj. The parents of a variable\nXi, denoted PaXi, is the set of all variables Xj such that there is an edge from Xj to Xi in G. An\nintervention or action (of size n), denoted do(X = x), assigns the values x = {x1, . . . , xn} to the\ncorresponding variables X = {X1, . . . , Xn} \u2282 X with the empty intervention (where no variable is\nset) denoted do(). The intervention also \u201cmutilates\u201d the graph G by removing all edges from Pai\nto Xi for each Xi \u2208 X. The resulting graph de\ufb01nes a probability distribution P{X c|do(X = x)}\nover X c := X \u2212 X. Details can be found in Chapter 21 of Koller and Friedman (2009).\nA learner for a casual bandit problem is given the casual model\u2019s graph G and a set of allowed actions\nA. One variable Y \u2208 X is designated as the reward variable and takes on values in {0, 1}. We denote\nthe expected reward for the action a = do(X = x) by \u00b5a := E [Y |do(X = x)] and the optimal\nexpected reward by \u00b5\u2217 := maxa\u2208A \u00b5a. The causal bandit game proceeds over T rounds. In round t,\nthe learner intervenes by choosing at = do(X t = xt) \u2208 A based on previous observations. It then\nt|do(X t = xt)},\nobserves sampled values for all non-intervened variables X c\nincluding the reward Yt \u2208 {0, 1}. After T observations the learner outputs an estimate of the optimal\naction \u02c6a\u2217\n\nThe objective of the learner is to minimise the simple regret RT = \u00b5\u2217 \u2212 E(cid:2)\u00b5\u02c6a\u2217\n\nT \u2208 A based on its prior observations.\n\nrefered to as a \u201cpure exploration\u201d (Bubeck et al., 2009) or \u201cbest-arm identi\ufb01cation\u201d problem (Gabillon\net al., 2012) and is most appropriate when, as in drug and policy testing, the learner has a \ufb01xed\nexperimental budget after which its policy will be \ufb01xed inde\ufb01nitely.\nAlthough we will focus on the intervene-then-observe ordering of events within each round, other\nscenarios are possible. If the non-intervened variables are observed before an intervention is selected\nour framework reduces to stochastic contextual bandits, which are already reasonably well under-\nstood (Agarwal et al., 2014). Even if no observations are made during the rounds, the causal model\nmay still allow of\ufb02ine pruning of the set of allowable interventions thereby reducing the complexity.\nWe note that classical K-armed stochastic bandit problem can be recovered in our framework by\nconsidering a simple causal model with one edge connecting a single variable X that can take on K\nvalues to a reward variable Y \u2208 {0, 1} where P{Y = 1|X} = r(X) for some arbitrary but unknown,\nreal-valued function r. The set of allowed actions in this case is A = {do(X = k) : k \u2208 {1, . . . , K}}.\nConversely, any causal bandit problem can be reduced to a classical stochastic |A|-armed bandit\nproblem by treating each possible intervention as an independent arm and ignoring all sampled values\nfor the observed variables except for the reward. Intuitively though, one would expect to perform\nbetter by making use of the extra structure and observations.\n\nt drawn from P{X c\n\n(cid:3) . This is sometimes\n\nT\n\n3 Regret Bounds for Parallel Bandit\n\nIn this section we propose and analyse an algorithm for achieving the optimal regret in a natural\nspecial case of the causal bandit problem which we call the parallel bandit. It is simple enough to\nadmit a thorough analysis but rich enough to model the type of problem discussed in \u00a71, including\n\n3\n\n\fX1\n\nX2\n\n...\n\nXN\n\nX1\n\nY\n\nX2\n\nY\n\nX1\n\nX2\n\n...\n\nXN\n\nY\n\n(a) Parallel graph\n\n(b) Confounded graph\n\n(c) Chain graph\n\nFigure 1: Causal Models\n\nthe farming example. It also suf\ufb01ces to witness the regret gap between algorithms that make use of\ncausal models and those which do not.\nThe causal model for this class of problems has N binary variables {X1, . . . , XN} where each\nXi \u2208 {0, 1} are independent causes of a reward variable Y \u2208 {0, 1}, as shown in Figure 1a. All\nvariables are observable and the set of allowable actions are all size 0 and size 1 interventions: A =\n{do()} \u222a {do(Xi = j) : 1 \u2264 i \u2264 N and j \u2208 {0, 1}} In the farming example from the introduction,\nX1 might represent temperature (e.g., X1 = 0 for low and X1 = 1 for high). The interventions\ndo(X1 = 0) and do(X1 = 1) indicate the use of shades or heat lamps to keep the temperature low or\nhigh, respectively.\nIn each round the learner either purely observes by selecting do() or sets the value of a single variable.\nThe remaining variables are simultaneously set by independently biased coin \ufb02ips. The value of all\nvariables are then used to determine the distribution of rewards for that round. Formally, when not\nintervened upon we assume that each Xi \u223c Bernoulli(qi) where q = (q1, . . . , qN ) \u2208 [0, 1]N so that\nqi = P{Xi = 1}. The value of the reward variable is distributed as P{Y = 1|X} = r(X) where\nr : {0, 1}N \u2192 [0, 1] is an arbitrary, \ufb01xed, and unknown function. In the farming example, this choice\nof Y models the success or failure of a seasons crop, which depends stochastically on the various\nenvironment variables.\n\nThe Parallel Bandit Algorithm The algorithm operates as follows. For the \ufb01rst T /2 rounds it\nchooses do() to collect observational data. As the only link from each X1, . . . , XN to Y is a direct,\ncausal one, P{Y |do(Xi = j)} = P{Y |Xi = j}. Thus we can create good estimators for the returns\nof the actions do(Xi = j) for which P{Xi = j} is large. The actions for which P{Xi = j} is small\nmay not be observed (often) so estimates of their returns could be poor. To address this, the remaining\nT /2 rounds are evenly split to estimate the rewards for these infrequently observed actions. The\ndif\ufb01culty of the problem depends on q and, in particular, how many of the variables are unbalanced\n\n(i.e., small qi or (1 \u2212 qi)). For \u03c4 \u2208 [2...N ] let I\u03c4 =(cid:8)i : min{qi, 1 \u2212 qi} < 1\n\n(cid:9). De\ufb01ne\n\n\u03c4\n\nm(q) = min{\u03c4 : |I\u03c4| \u2264 \u03c4} .\n\nI\u03c4 is the set of variables considered\nunbalanced and we tune \u03c4 to trade\noff identifying the low probability ac-\ntions against not having too many of\nthem, so as to minimize the worst-case\nsimple regret. When q = ( 1\n2 , . . . , 1\n2 )\nwe have m(q) = 2 and when q =\n(0, . . . , 0) we have m(q) = N. We do\nnot assume that q is known, thus Algo-\nrithm 1 also utilizes the samples cap-\ntured during the observational phase\nto estimate m(q). Although very sim-\nple, the following two theorems show\nthat this algorithm is effectively opti-\nmal.\nTheorem 1. Algorithm 1 satis\ufb01es\n\n(cid:32)(cid:115)\n\n(cid:18) N T\n\n(cid:19)(cid:33)\n\nm(q)\n\n.\n\nRT \u2208 O\n\nm(q)\n\nT\n\nlog\n\nAlgorithm 1 Parallel Bandit Algorithm\n1: Input: Total rounds T and N.\n2: for t \u2208 1, . . . , T /2 do\n3:\n4:\n5: for a = do(Xi = x) \u2208 A do\n6:\n7:\n\nCount times Xi = x seen: Ta =(cid:80)T /2\n\nPerform empty intervention do()\nObserve X t and Yt\n\nEstimate reward: \u02c6\u00b5a = 1\nTa\n\n(cid:80)T /2\n\nt=1\n\nt=1\nEstimate probabilities: \u02c6pa = 2Ta\n\n9: Compute \u02c6m = m(\u02c6q) and A =(cid:8)a \u2208 A : \u02c6pa \u2264 1\n\n1{Xt,i = x} Yt\nT , \u02c6qi = \u02c6pdo(Xi=1)\n\n(cid:9).\n\n8:\n\n1{Xt,i = x}\n\n2|A| be times to sample each a \u2208 A.\n\n\u02c6m\n\nfor t \u2208 1, . . . , TA do\n\n10: Let TA := T\n11: for a = do(Xi = x) \u2208 A do\n12:\n13:\n14:\n15: return estimated optimal \u02c6a\u2217\n\nRe-estimate \u02c6\u00b5a = 1\nTA\n\nIntervene with a and observe Yt\n\n(cid:80)TA\nt=1 Yt\nT \u2208 arg maxa\u2208A \u02c6\u00b5a\n(cid:33)\n\n(cid:32)(cid:114)\n\nTheorem 2. For all strategies and T , q, there exist rewards such that RT \u2208 \u2126\n\nm(q)\n\nT\n\n.\n\n4\n\n\fThe proofs of Theorems 1 and 2 follow by carefully analysing the concentration of \u02c6pa and \u02c6m about\ntheir true values and may be found in the supplementary material. By utilizing knowledge of the\ncausal structure, Algorithm 1 effectively only has to explore the m(q) \u2019dif\ufb01cult\u2019 actions. Standard\n\nmulti-armed bandit algorithms must explore all 2N actions and thus achieve regret \u2126((cid:112)N/T ). Since\n\nm is typically much smaller than N, the new algorithm can signi\ufb01cantly outperform classical bandit\nalgorithms in this setting. In practice, you would combine the data from both phases to estimate\nrewards for the low probability actions. We do not do so here as it slightly complicates the proofs and\ndoes not improve the worst case regret.\n\n4 Regret Bounds for General Graphs\n\nWe now consider the more general problem where the graph structure is known, but arbitrary. For\ngeneral graphs, P{Y |Xi = j} (cid:54)= P{Y |do(Xi = j)} (correlation is not causation). However, if all\nthe variables are observable, any causal distribution P{X1...XN|do(Xi = j)} can be expressed in\nterms of observational distributions via the truncated factorization formula (Pearl, 2000).\n\nP{X1...XN|do(Xi = j)} =\n\nP{Xk|PaXk} \u03b4(Xi \u2212 j) ,\n\n(cid:89)\n\nk(cid:54)=i\n\nP{Y |do(X2 = j)} =(cid:80)\n\nwhere PaXk denotes the parents of Xk and \u03b4 is the dirac delta function.\nWe could naively generalize our approach for parallel bandits by observing for T /2 rounds, applying\nthe truncated product factorization to write an expression for each P{Y |a} in terms of observational\nquantities and explicitly playing the actions for which the observational estimates were poor. However,\nit is no longer optimal to ignore the information we can learn about the reward for intervening on one\nvariable from rounds in which we act on a different variable. Consider the graph in Figure 1c and\nsuppose each variable deterministically takes the value of its parent, Xk = Xk\u22121 for k \u2208 2, . . . , N\nand P{X1} = 0. We can learn the reward for all the interventions do(Xi = 1) simultaneously by\nselecting do(X1 = 1), but not from do(). In addition, variance of the observational estimator for\na = do(Xi = j) can be high even if P{Xi = j} is large. Given the causal graph in Figure 1b,\nP{X1} P{Y |X1, X2 = j}. Suppose X2 = X1 deterministically, no\nmatter how large P{X2 = 1} is we will never observe (X2 = 1, X1 = 0) and so cannot get a good\nestimate for P{Y |do(X2 = 1)}.\nTo solve the general problem we need an estimator for each action that incorporates information\nobtained from every other action and a way to optimally allocate samples to actions. To address\nthis dif\ufb01cult problem, we assume the conditional interventional distributions P{PaY |a} (but not\nP{Y |a}) are known. These could be estimated from experimental data on the same covariates but\nwhere the outcome of interest differed, such that Y was not included, or similarly from observational\ndata subject to identi\ufb01ability constraints. Of course this is a somewhat limiting assumption, but\nseems like a natural place to start. The challenge of estimating the conditional distributions for\nall variables in an optimal way is left as an interesting future direction. Let \u03b7 be a distribution on\na\u2208A \u03b7a P{PaY |a} to\n\navailable interventions a \u2208 A so \u03b7a \u2265 0 and(cid:80)\n\na\u2208A \u03b7a = 1. De\ufb01ne Q =(cid:80)\n\nX1\n\nbe the mixture distribution over the interventions with respect to \u03b7.\nOur algorithm samples T actions from \u03b7 and\nuses them to estimate the returns \u00b5a for all a \u2208\nA simultaneously via a truncated importance\nweighted estimator. Let PaY (X) denote the\nrealization of the variables in X that are parents\nof Y and de\ufb01ne Ra(X) = P{PaY (X)|a}\nQ{PaY (X)}\n\nfor a \u2208 A do\n\nInput: T , \u03b7 \u2208 [0, 1]A, B \u2208 [0,\u221e)A\nfor t \u2208 {1, . . . , T} do\n\nSample action at from \u03b7\nDo action at and observe Xt and Yt\n\nAlgorithm 2 General Algorithm\n\n\u02c6\u00b5a =\n\n1\nT\n\nYtRa(Xt)1{Ra(Xt) \u2264 Ba} ,\n\n\u02c6\u00b5a =\n\n1\nT\n\nYtRa(Xt)1{Ra(Xt) \u2264 Ba}\n\nwhere Ba \u2265 0 is a constant that tunes the level\nof truncation to be chosen subsequently. The truncation introduces a bias in the estimator, but\nsimultaneously chops the potentially heavy tail that is so detrimental to its concentration guarantees.\n\nT = arg maxa \u02c6\u00b5a\n\nreturn \u02c6a\u2217\n\nT(cid:88)\n\nt=1\n\nT(cid:88)\n\nt=1\n\n5\n\n\fThe distribution over actions, \u03b7 plays the role of allocating samples to actions and is optimized to\nminimize the worst-case simple regret. Abusing notation we de\ufb01ne m(\u03b7) by\n\n(cid:20) P{PaY (X)|a}\n\nQ{PaY (X)}\n\n(cid:21)\n\nm(\u03b7) = max\na\u2208A\n\nEa\n\n, where Ea is the expectation with respect to P{.|a}\n\nWe will show shortly that m(\u03b7) is a measure of the dif\ufb01culty of the problem that approximately\ncoincides with the version for parallel bandits, justifying the name overloading.\nTheorem 3. If Algorithm 2 is run with B \u2208 RA given by Ba =\n\nlog(2T|A|) .\n\n(cid:113) m(\u03b7)T\n(cid:33)\n\n(cid:32)(cid:114)\n\nRT \u2208 O\n\nm(\u03b7)\n\nT\n\nlog (2T|A|)\n\n.\n\nThe proof is in the supplementary materials. Note the regret has the same form as that obtained\nfor Algorithm 1, with m(\u03b7) replacing m(q). Algorithm 1 assumes only the graph structure and not\nknowledge of the conditional distributions on X. Thus it has broader applicability to the parallel\ngraph than the generic algorithm given here. We believe that Algorithm 2 with the optimal choice of\n\u03b7 is close to minimax optimal, but leave lower bounds for future work.\n\nChoosing the Sampling Distribution Algorithm 2 depends on a choice of sampling distribution\nQ that is determined by \u03b7. In light of Theorem 3 a natural choice of \u03b7 is the minimiser of m(\u03b7).\n\n\u03b7\u2217 = arg min\n\n\u03b7\n\nm(\u03b7) = arg min\n\n\u03b7\n\nEa\n\nmax\na\u2208A\n\n(cid:124)\n\n(cid:20)\n\n(cid:80)\n\nP{PaY (X)|a}\n(cid:123)(cid:122)\nb\u2208A \u03b7b P{PaY (X)|b}\n\nm(\u03b7)\n\n(cid:21)\n(cid:125)\n\n.\n\nSince the mixture of convex functions is convex and the maximum of a set of convex functions is\nconvex, we see that m(\u03b7) is convex (in \u03b7). Therefore the minimisation problem may be tackled\nusing standard techniques from convex optimisation. The quantity m(\u03b7\u2217) may be interpreted as the\nminimum achievable worst-case variance of the importance weighted estimator. In the experimental\nsection we present some special cases, but for now we give two simple results. The \ufb01rst shows that\n|A| serves as an upper bound on m(\u03b7\u2217).\nProposition 4. m(\u03b7\u2217) \u2264 |A|. Proof. By de\ufb01nition, m(\u03b7\u2217) \u2264 m(\u03b7) for all \u03b7. Let \u03b7a = 1/|A|\u2200a.\n= |A|\n\n(cid:20) P{PaY (X)|a}\n\n(cid:20) P{PaY (X)|a}\n\n(cid:20) 1\n\nm(\u03b7) = max\n\n(cid:21)\n\n(cid:21)\n\n(cid:21)\n\nEa\n\n\u03b7a P{PaY (X)|a}\n\n= max\n\na\n\nEa\n\n\u03b7a\n\na\n\nQ{PaY (X)}\n\n\u2264 max\n\nEa\n\na\n\nThe second observation is that, in the parallel bandit setting, m(\u03b7\u2217) \u2264 2m(q). This is easy to see\nby letting \u03b7a = 1/2 for a = do() and \u03b7a = 1{P{Xi = j} \u2264 1/m(q)} /2m(q) for the actions\ncorresponding to do(Xi = j), and applying an argument like that for Proposition 4. The proof is in\nthe supplementary materials.\nRemark 5. The choice of Ba given in Theorem 3 is not the only possibility. As we shall see in the\nexperiments, it is often possible to choose Ba signi\ufb01cantly larger when there is no heavy tail and\nthis can drastically improve performance by eliminating the bias. This is especially true when the\nratio Ra is never too large and Bernstein\u2019s inequality could be used directly without the truncation.\nFor another discussion see the article by Bottou et al. (2013) who also use importance weighted\nestimators to learn from observational data.\n\n5 Experiments\n\nWe compare Algorithms 1 and 2 with the Successive Reject algorithm of Audibert and Bubeck (2010),\nThompson Sampling and UCB under a variety of conditions. Thomson sampling and UCB are\noptimized to minimize cumulative regret. We apply them in the \ufb01xed horizon, best arm identi\ufb01cation\nsetting by running them upto horizon T and then selecting the arm with the highest empirical mean.\nThe importance weighted estimator used by Algorithm 2 is not truncated, which is justi\ufb01ed in this\nsetting by Remark 5.\n\n6\n\n\f(a) Simple regret vs m(q) for\n\ufb01xed horizon T = 400 and num-\nber of variables N = 50\n\n(b) Simple regret vs horizon, T ,\nwith N = 50, m = 2 and \u03b5 =\n\n(cid:113) N\n\n8T\n\nFigure 2: Experimental results\n\n(c) Simple regret vs horizon, T ,\nwith N = 50, m = 2 and \ufb01xed\n\u03b5 = .3\n\n2 + \u03b5) if X1 = 1 and Yt \u223c Bernoulli( 1\n\n2 + \u03b5 for do(X1 = 1), 1\n\n\u221a\n\n2 for all other actions. We set qi = 0 for i \u2264 m and 1\n\nThroughout we use a model in which Y depends only on a single variable X1 (this is unknown to\nthe algorithms). Yt \u223c Bernoulli( 1\n2 \u2212 \u03b5(cid:48)) otherwise, where\n\u03b5(cid:48) = q1\u03b5/(1\u2212 q1). This leads to an expected reward of 1\n2 \u2212 \u03b5(cid:48) for do(X1 = 0)\nand 1\n2 otherwise. Note that changing m and\nthus q has no effect on the reward distribution. For each experiment, we show the average regret\nover 10,000 simulations with error bars displaying three standard errors. The code is available from\n<https://github.com/finnhacks42/causal_bandits>\nIn Figure 2a we \ufb01x the number of variables N and the horizon T and compare the performance\nof the algorithms as m increases. The regret for the Successive Reject algorithm is constant as it\ndepends only on the reward distribution and has no knowledge of the causal structure. For the causal\nalgorithms it increases approximately with\nm. As m approaches N, the gain the causal algorithms\nobtain from knowledge of the structure is outweighed by fact they do not leverage the observed\nrewards to focus sampling effort on actions with high pay-offs.\nFigure 2b demonstrates the performance of the algorithms in the worst case environment for standard\n\nbandits, where the gap between the optimal and sub-optimal arms, \u03b5 =(cid:112)N/(8T ) , is just too small\n\nto be learned. This gap is learnable by the causal algorithms, for which the worst case \u03b5 depends on\nm (cid:28) N. In Figure 2c we \ufb01x N and \u03b5 and observe that, for suf\ufb01ciently large T , the regret decays\nexponentially. The decay constant is larger for the causal algorithms as they have observed a greater\neffective number of samples for a given T .\nFor the parallel bandit problem, the regression estimator used in the speci\ufb01c algorithm outperforms the\ntruncated importance weighted estimator in the more general algorithm, despite the fact the speci\ufb01c\nalgorithm must estimate q from the data. This is an interesting phenomenon that has been noted\nbefore in off-policy evaluation where the regression (and not the importance weighted) estimator is\nknown to be minimax optimal asymptotically (Li et al., 2014).\n\n6 Discussion & Future Work\n\nAlgorithm 2 for general causal bandit problems estimates the reward for all allowable interventions\na \u2208 A over T rounds by sampling and applying interventions from a distribution \u03b7. Theorem 3 shows\n\nthat this algorithm has (up to log factors) simple regret that is O((cid:112)m(\u03b7)/T ) where the parameter\n\nm(\u03b7) measures the dif\ufb01culty of learning the causal model and is always less than N. The value of\nm(\u03b7) is a uniform bound on the variance of the reward estimators \u02c6\u00b5a and, intuitively, problems where\nall variables\u2019 values in the causal model \u201coccur naturally\u201d when interventions are sampled from \u03b7\nwill have low values of m(\u03b7).\nThe main practical drawback of Algorithm 2 is that both the estimator \u02c6\u00b5a and the optimal sampling\ndistribution \u03b7\u2217 (i.e., the one that minimises m(\u03b7)) require knowledge of the conditional distributions\nP{PaY |a} for all a \u2208 A. In contrast, in the special case of parallel bandits, Algorithm 1 uses\nthe do() action to effectively estimate m(\u03b7) and the rewards then re-samples the interventions with\nvariances that are not bound by \u02c6m(\u03b7). Despite these extra estimates, Theorem 2 shows that this\n\n7\n\n010203040m0.000.050.100.150.200.250.30RegretAlgorithm 2Algorithm 1Successive RejectUCBThompson Sampling0200400600800T0.00.10.20.30.40.5Regret0100200300400500T0.000.050.100.150.200.250.30Regret\fapproach is optimal (up to log factors). Finding an algorithm that only requires the causal graph and\nlower bounds for its simple regret in the general case is left as future work.\nMaking Better Use of the Reward Signal Existing algorithms for best arm identi\ufb01cation are\nbased on \u201csuccessive rejection\u201d (SR) of arms based on UCB-like bounds on their rewards (Even-Dar\net al., 2002). In contrast, our algorithms completely ignore the reward signal when developing their\narm sampling policies and only use the rewards when estimating \u02c6\u00b5a. Incorporating the reward signal\ninto our sampling techniques or designing more adaptive reward estimators that focus on high reward\ninterventions is an obvious next step. This would likely improve the poor performance of our causal\nalgorithm relative to the sucessive rejects algorithm for large m, as seen in Figure 2a. For the parallel\nbandit the required modi\ufb01cations should be quite straightforward. The idea would be to adapt the\nalgorithm to essentially use successive elimination in the second phase so arms are eliminated as\nsoon as they are provably no longer optimal with high probability. In the general case a similar\nmodi\ufb01cation is also possible by dividing the budget T into phases and optimising the sampling\ndistribution \u03b7, eliminating arms when their con\ufb01dence intervals are no longer overlapping. Note\nthat these modi\ufb01cations will not improve the minimax regret, which at least for the parallel bandit is\nalready optimal. For this reason we prefer to emphasize the main point that causal structure should\nbe exploited when available. Another observation is that Algorithm 2 is actually using a \ufb01xed design,\nwhich in some cases may be preferred to a sequential design for logistical reasons. This is not possible\nfor Algorithm 1, since the q vector is unknown.\nCumulative Regret Although we have focused on simple regret in our analysis, it would also be\nnatural to consider the cumulative regret. In the case of the parallel bandit problem we can slightly\nmodify the analysis from (Wu et al., 2015) on bandits with side information to get near-optimal\ncumulative regret guarantees. They consider a \ufb01nite-armed bandit model with side information where\nin reach round the learner chooses an action and receives a Gaussian reward signal for all actions, but\nwith a known variance that depends on the chosen action. In this way the learner can gain information\nabout actions it does not take with varying levels of accuracy. The reduction follows by substituting\nthe importance weighted estimators in place of the Gaussian reward. In the case that q is known\nthis would lead to a known variance and the only (insigni\ufb01cant) difference is the Bernoulli noise\nmodel. In the parallel bandit case we believe this would lead to near-optimal cumulative regret, at\nleast asymptotically.\nThe parallel bandit problem can also be viewed as an instance of a time varying graph feedback\nproblem (Alon et al., 2015; Koc\u00e1k et al., 2014), where at each timestep the feedback graph Gt is\nselected stochastically, dependent on q, and revealed after an action has been chosen. The feedback\ngraph is distinct from the causal graph. A link A \u2192 B in Gt indicates that selecting the action A\nreveals the reward for action B. For this parallel bandit problem, Gt will always be a star graph\nwith the action do() connected to half the remaining actions. However, Alon et al. (2015); Koc\u00e1k\net al. (2014) give adversarial algorithms, which when applied to the parallel bandit problem obtain\nthe standard bandit regret. A malicious adversary can select the same graph each time, such that\nthe rewards for half the arms are never revealed by the informative action. This is equivalent to a\nnominally stochastic selection of feedback graph where q = 0.\nCausal Models with Non-Observable Variables\nIf we assume knowledge of the conditional inter-\nventional distributions P{PaY |a} our analysis applies unchanged to the case of causal models with\nnon-observable variables. Some of the interventional distributions may be non-identi\ufb01able meaning\nwe can not obtain prior estimates for P{PaY |a} from even an in\ufb01nite amount of observational\ndata. Even if all variables are observable and the graph is known, if the conditional distributions\nare unknown, then Algorithm 2 cannot be used. Estimating these quantities while simultaneously\nminimising the simple regret is an interesting and challenging open problem.\nPartially or Completely Unknown Causal Graph A much more dif\ufb01cult generalisation would\nbe to consider causal bandit problems where the causal graph is completely unknown or known to\nbe a member of class of models. The latter case arises naturally if we assume free access to a large\nobservational dataset, from which the Markov equivalence class can be found via causal discovery\ntechniques. Work on the problem of selecting experiments to discover the correct causal graph from\nwithin a Markov equivalence class Eberhardt et al. (2005); Eberhardt (2010); Hauser and B\u00fchlmann\n(2014); Hu et al. (2014) could potentially be incorporated into a causal bandit algorithm. In particular,\nHu et al. (2014) show that only O (log log n) multi-variable interventions are required on average to\nrecover a causal graph over n variables once purely observational data is used to recover the \u201cessential\ngraph\u201d. Simultaneously learning a completely unknown causal model while estimating the rewards\nof interventions without a large observational dataset would be much more challenging.\n\n8\n\n\fReferences\nAgarwal, A., Hsu, D., Kale, S., Langford, J., Li, L., and Schapire, R. E. (2014). Taming the monster: A fast and\n\nsimple algorithm for contextual bandits. In ICML, pages 1638\u20131646.\n\nAlon, N., Cesa-Bianchi, N., Dekel, O., and Koren, T. (2015). Online learning with feedback graphs: Beyond\n\nbandits. In COLT, pages 23\u201335.\n\nAudibert, J.-Y. and Bubeck, S. (2010). Best arm identi\ufb01cation in multi-armed bandits. In COLT, pages 13\u2013p.\n\nAvner, O., Mannor, S., and Shamir, O. (2012). Decoupling exploration and exploitation in multi-armed bandits.\n\nIn ICML, pages 409\u2013416.\n\nBareinboim, E., Forney, A., and Pearl, J. (2015). Bandits with unobserved confounders: A causal approach. In\n\nNIPS, pages 1342\u20131350.\n\nBart\u00f3k, G., Foster, D. P., P\u00e1l, D., Rakhlin, A., and Szepesv\u00e1ri, C. (2014). Partial monitoring-classi\ufb01cation, regret\n\nbounds, and algorithms. Mathematics of Operations Research, 39(4):967\u2013997.\n\nBottou, L., Peters, J., Quinonero-Candela, J., Charles, D. X., Chickering, D. M., Portugaly, E., Ray, D., Simard,\nP., and Snelson, E. (2013). Counterfactual reasoning and learning systems: The example of computational\nadvertising. JMLR, 14(1):3207\u20133260.\n\nBubeck, S., Munos, R., and Stoltz, G. (2009). Pure exploration in multi-armed bandits problems. In ALT, pages\n\n23\u201337.\n\nChernoff, H. (1959). Sequential design of experiments. The Annals of Mathematical Statistics, pages 755\u2013770.\n\nEberhardt, F. (2010). Causal Discovery as a Game. In NIPS Causality: Objectives and Assessment, pages 87\u201396.\n\nEberhardt, F., Glymour, C., and Scheines, R. (2005). On the number of experiments suf\ufb01cient and in the worst\n\ncase necessary to identify all causal relations among n variables. In UAI.\n\nEven-Dar, E., Mannor, S., and Mansour, Y. (2002). Pac bounds for multi-armed bandit and markov decision\n\nprocesses. In Computational Learning Theory, pages 255\u2013270.\n\nGabillon, V., Ghavamzadeh, M., and Lazaric, A. (2012). Best arm identi\ufb01cation: A uni\ufb01ed approach to \ufb01xed\n\nbudget and \ufb01xed con\ufb01dence. In NIPS, pages 3212\u20133220.\n\nHauser, A. and B\u00fchlmann, P. (2014). Two optimal strategies for active learning of causal models from\n\ninterventional data. International Journal of Approximate Reasoning, 55(4):926\u2013939.\n\nHu, H., Li, Z., and Vetta, A. R. (2014). Randomized experimental design for causal graph discovery. In NIPS,\n\npages 2339\u20132347.\n\nJamieson, K., Malloy, M., Nowak, R., and Bubeck, S. (2014). lil\u2019UCB: An optimal exploration algorithm for\n\nmulti-armed bandits. In COLT, pages 423\u2013439.\n\nKoc\u00e1k, T., Neu, G., Valko, M., and Munos, R. (2014). Ef\ufb01cient learning by implicit exploration in bandit\n\nproblems with side observations. In NIPS, pages 613\u2013621.\n\nKoller, D. and Friedman, N. (2009). Probabilistic graphical models: principles and techniques. MIT Press.\n\nLangford, J. and Zhang, T. (2008). The epoch-greedy algorithm for multi-armed bandits with side information.\n\nIn NIPS, pages 817\u2013824.\n\nLi, L., Munos, R., and Szepesvari, C. (2014). On minimax optimal of\ufb02ine policy evaluation. arXiv preprint\n\narXiv:1409.3653.\n\nOrtega, P. A. and Braun, D. A. (2014). Generalized thompson sampling for sequential decision-making and\n\ncausal inference. Complex Adaptive Systems Modeling, 2(1):2.\n\nPearl, J. (2000). Causality: models, reasoning and inference. MIT Press, Cambridge.\n\nRobbins, H. (1952). Some aspects of the sequential design of experiments. Bulletin of the American Mathematical\n\nSociety, 58(5):527\u2013536.\n\nWu, Y., Gy\u00f6rgy, A., and Szepesv\u00e1ri, C. (2015). Online Learning with Gaussian Payoffs and Side Observations.\n\nIn NIPS, pages 1360\u20131368.\n\nYu, J. Y. and Mannor, S. (2009). Piecewise-stationary bandit problems with side observations. In ICML, pages\n\n1177\u20131184.\n\n9\n\n\f", "award": [], "sourceid": 657, "authors": [{"given_name": "Finnian", "family_name": "Lattimore", "institution": "Australian National University"}, {"given_name": "Tor", "family_name": "Lattimore", "institution": "Indiana University"}, {"given_name": "Mark", "family_name": "Reid", "institution": "Apple"}]}