{"title": "Near-Optimal Reinforcement Learning in Dynamic Treatment Regimes", "book": "Advances in Neural Information Processing Systems", "page_first": 13401, "page_last": 13411, "abstract": "A dynamic treatment regime (DTR) consists of a sequence of decision rules, one per stage of intervention, that dictates how to determine the treatment assignment to patients based on evolving treatments and covariates' history. These regimes are particularly effective for managing chronic disorders and is arguably one of the key aspects towards more personalized decision-making. In this paper, we investigate the online reinforcement learning (RL) problem for selecting optimal DTRs provided that observational data is available. We develop the first adaptive algorithm that achieves near-optimal regret in DTRs in online settings, without any access to historical data. We further derive informative bounds on the system dynamics of the underlying DTR from confounded, observational data. Finally, we combine these results and develop a novel RL algorithm that efficiently learns the optimal DTR while leveraging the abundant, yet imperfect confounded observations.", "full_text": "Near-Optimal Reinforcement Learning\n\nin Dynamic Treatment Regimes\n\nDepartment of Computer Science\n\nDepartment of Computer Science\n\nElias Bareinboim\n\nColumbia University\nNew York, NY 10027\neb@cs.columbia.edu\n\nJunzhe Zhang\n\nColumbia University\nNew York, NY 10027\n\njunzhez@cs.columbia.edu\n\nAbstract\n\nA dynamic treatment regime (DTR) consists of a sequence of decision rules, one\nper stage of intervention, that dictates how to determine the treatment assignment\nto patients based on evolving treatments and covariates\u2019 history. These regimes are\nparticularly effective for managing chronic disorders and is arguably one of the key\naspects towards more personalized decision-making. In this paper, we investigate\nthe online reinforcement learning (RL) problem for selecting optimal DTRs pro-\nvided that observational data is available. We develop the \ufb01rst adaptive algorithm\nthat achieves near-optimal regret in DTRs in online settings, without any access to\nhistorical data. We further derive informative bounds on the system dynamics of\nthe underlying DTR from confounded, observational data. Finally, we combine\nthese results and develop a novel RL algorithm that ef\ufb01ciently learns the optimal\nDTR while leveraging the abundant, yet imperfect confounded observations.\n\n1\n\nIntroduction\n\nIn medical practice, a patient typically has to be treated at multiple stages; the physician repeatedly\nadapts each treatment, tailored to the patient\u2019s time-varying, dynamic state (e.g., level of virus, results\nof diagnostic tests). Dynamic treatment regimes (DTRs) [18] provide an attractive framework of\npersonalized treatments in longitudinal settings. Operationally, a DTR consists of decision rules that\ndictate what treatment to provide at each stage, given the patient\u2019s evolving conditions and history.\nThese decision rules are alternatively known as adaptive treatment strategies [12, 13, 19, 33, 34] or\ntreatment policies [16, 37, 38]. DTRs offer an effective vehicle for personalized management of\nchronic conditions, including cancer, diabetes, and mental illnesses [36].\nConsider the DTR instance regarding the treatment of alcohol dependence [19, 6], which is graphically\nrepresented in Fig. 1a . Based on the condition of alcohol dependant patients (S1), the physician\nmay prescribe a medication or behavioral therapy (X1). Patients are classi\ufb01ed as responders or\nnon-responders (S2) based on their level of heavy drinking within the next two months. The physician\nthen must decide whether to continue the initial treatment or switch to an augmented plan combining\nboth medication and behavioral therapy (X2). The unobserved covariate U summarizes all the\nunknown factors about the patient. We are interested in the primary outcome Y that is the percentage\nof abstinent days over a 12-month period. The treatment policy \u03c0 in this set-up is a sequence of\ndecision rules x1 \u2190 \u03c01(s1), x2 \u2190 \u03c02(s1, s2, x1) selecting the values of X1, X2 based on the history.\nPolicy learning in a DTR setting is concerned with \ufb01nding an optimal policy \u03c0 that maximizes the\nprimary outcome Y . The main challenge is that since the parameters of the DTR are often unknown,\nit\u2019s not immediate how to directly compute the consequences of executing the policy do(\u03c0), i.e., the\nexpected value E\u03c0[Y ]. Most of the current work in the causal inference literature focus on trying to\nidentify this quantity, E\u03c0[Y ], from \ufb01nite observational data and causal assumptions about the data-\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fS1\n\nS2\n\nY\n\nS1\n\nS2\n\nY\n\nX1\n\nX2\n\nU\n\n(a) P ( \u00afx2, \u00afs2, y)\n\nX1\n\nX2\n\nU\n\n(b) P \u00af\u03c02 ( \u00afx2, \u00afs2, y)\n\nFigure 1: Causal diagrams of (a) a DTR with K = 2 stages of intervention; and (b) a DTR in (a)\nunder sequential interventions do(X1 \u223c \u03c01(X1|S1), X2 \u223c \u03c02(X2|S1, S2, X1)).\n\ngenerating mechanisms (commonly through causal graphs and potential outcomes). Several criteria\nand algorithms have been developed [23, 28, 4]. For instance, a criterion called sequential backdoor\n[24] permits one to determine whether causal effects can be obtained by covariate adjustment. This\ncondition is also referred to as conditional ignorability or unconfoundedness [27, 18]: there exists\nno unobserved confounders (UCs) that simultaneously affects the treatment at any stage and all the\nsubsequent outcomes given a set of observed covariates. Whenever ignorability holds, a number of\nef\ufb01cient estimation procedures exist, including popular methods based on the propensity score [26],\ninverse probability of treatment weighting [21, 25], and Q-learning [31, 20].\nIn general, the combination of observational data and causal assumptions does not always lead\nto point-identi\ufb01cation [23, Ch. 3-4]. An alternative is to randomize patients\u2019 treatments at each\nstage based on the previous decisions and observed outcomes; for instance, one popular strategy\nis known as the sequential multiple assignment randomized trail (SMART) [19]. By the virtue of\nrandomization, the sequential backdoor condition is entailed. However, in practice, performing a\nrandomized experiment in the actual environment can be extremely costly and undesirable (due to\nunintended consequences), especially for domains where humans are the main research subjects\n(e.g., medicine, epidemiology, and psychology). Reinforcement learning (RL) [31] provides a unique\nopportunity to ef\ufb01ciently learning DTRs due to its nature of balancing exploration and exploitation. A\ntypical RL agent learns by conducting adaptive, sequential experimentation: it repeatedly adjusts the\npolicy that is currently deployed based on the past outcomes. The goal is to learn an optimal policy\nwhile minimizing the experimental cost. Ef\ufb01cient RL algorithms have been successfully developed to\nvery general settings such as Markov decision processes (MDPs) [30, 11, 32], where a \ufb01nite state is\nstatistically suf\ufb01cient to summarize the treatments and covariates\u2019 history. Variations of this setting\ninclude multi-armed bandits [1], partially-observable MDP [10, 2], and factored MDPs [22].\nOur focus here is on learning a policy for an unknown DTR while leveraging the observational data.\nThis is a challenging setting for both causal inference and RL. As an example, consider data collected\nfrom an unknown behavior policy of the DTR in Fig. 1a (i..e, x1 \u2190 f1(s1, u), x2 \u2190 f2(s1, s2, x1, u),\nwhere both U and {f1, f2} are unobserved), which is materialized in the form of the observational\ndistribution P (x1, x2, s1, s2, y) [23, pp. 205]. The existence of the unmeasured confounder U\nleads to an immediate violation of the sequential backdoor criterion (e.g., due to the spurious path\nX1 \u2190 U \u2192 Y ), which implies that the effect of the policy E\u03c0[Y ] is not identi\ufb01able [23, Ch. 4.4].\nOn the other hand, existing RL algorithms are not applicable either, which can be seen by noting\nthat DTRs are inherently non-Markovian \u2013 in words, the initial treatment X1 directly affects the\noutcome Y . Even though an heuristic approach may be pursued (e.g., Thompson Sampling [35]), and\ncould eventually converge, the same is still not optimal since it\u2019s oblivious to all the observational\ndata. 1. Indeed, it is acknowledged in the literature [7, 8] that the \u201cdevelopment of statistically sound\nestimation and inference techniques\u201d for online RL settings \u201cseem to be another very important\nresearch direction\u201d, especially when the increasing use of mobiles devices allows for the possibility\nof continuous monitoring and just-in-time intervention.\nThe goal of this paper is to overcome these challenges. We will introduce novel RL strategies capable\nof optimizing an unknown DTR while ef\ufb01ciently leveraging the imperfect, but large amounts of\nobservational data. In particular, our contributions are as follows: (1) We introduce the \ufb01rst algorithm\n(UC-DTR (Alg. 1)) that reaches the near-optimal regret bound in the pure DTR setting, without\n\n1Standard off-policy RL methods such as Q-Learning rely on the condition of sequential backdoor, thus not\n\napplicable for the confounded observational data. For a more elaborate discussion, see [7, Ch. 3.5]\n\n2\n\n\fobservational data; (2) We derive novel bounds capable of exploiting observational data based on the\nDTR structure (Thms. 5 and 6), which are provably tight; (3) We develop a novel algorithm (UCc-DTR\n(Alg. 2)) that ef\ufb01ciently incorporates these bounds and accelerates learning in the online setting. Our\nresults are validated on randomly generated DTRs and multi-stage clinical trials on cancer treatment.\n\n1.1 Preliminaries\n\nIn this section, we introduce the basic notation and de\ufb01nitions used throughout the paper. We use\ncapital letters to denote variables (X) and small letters for their values (x). Let X represent the\ndomain of X and |X| its dimension. We consistently use the abbreviation P (x) to represent the\nprobabilities P (X = x). \u00afXk stands for a sequence {X1, . . . , Xk} (\u2205 if k < 1), and \u00afX k represents\nits domain, i.e., X1 \u00d7 \u00b7\u00b7\u00b7 \u00d7 Xk. Further, we denote by I{\u00b7} the indicator function.\nThe basic semantical framework of our analysis rests on structural causal models (SCM) [23, Ch. 7].\nA SCM M is a tuple (cid:104)U , V , F , P (u)(cid:105) where U is a set of exogenous (unobserved) variables and V\nis a set of endogenous (observed) variables. F is a set of structural functions where fi \u2208 F decides\nthe values of Vi \u2208 V taking as argument a combination of other endogenous and exogenous variables\n(i.e., Vi \u2190 fi(P Ai, Ui), P Ai \u2286 V , Ui \u2286 U). The values of U are drawn from the distribution\nP (u), and induce an observational distribution P (v) [23, pp. 205]. Each SCM is associated with a\ncausal diagram in the form of a directed acyclic graph G, where nodes represent endogenous variables,\ndashed nodes exogenous variables, and arrows stand for functional relations (e.g., see Fig. 1).\nAn intervention on a set of endogenous variables X, denoted by do(x), is an operation where values\nof X are set to constants x, regardless of how they were ordinarily determined (through the functions\n{fX : \u2200X \u2208 X}). For a SCM M, let Mx be a sub-model of M induced by intervention do(x).\nThe interventional distribution Px(y) induced by do(x) is the distribution over variables Y in the\nsub-model Mx. For a more detailed discussion of SCMs, we refer readers to [23, Ch. 7].\n\n2 Optimizing Dynamic Treatment Regimes\n\nIn this section, we will formalize the problem of online optimization in DTRs with confounded\nobservations and provide an ef\ufb01cient solution. We start by de\ufb01ning DTRs in the structural semantics.\nDe\ufb01nition 1 (Dynamic Treatment Regime [18]). A dynamic treatment regime (DTR) is a SCM\n(cid:104)U , V , F , P (u)(cid:105) where the endogenous variables V = { \u00afXK, \u00afSK, Y }; K \u2208 N+ is the total stages of\ninterventions. For stage k = 1, . . . , K: (1) Xk is a \ufb01nite decision decided by a behavior policy xk \u2190\nfk(\u00afsk, \u00afxk\u22121, u); (2) Sk is a \ufb01nite state decided by a transition function sk \u2190 \u03c4k(\u00afxk\u22121, \u00afsk\u22121, u). Y\nis the primary outcome at the \ufb01nal state K, decided by a reward function y \u2190 r(\u00afxK, \u00afsK, u) bounded\nin [0, 1]. Values of exogenous variables U are drawn from the distribution P (u).\nA DTR M\u2217 induces an observational distribution P (\u00afxK, \u00afsK, y). Fig. 1a shows the causal diagram\nof a DTR with K = 2 stages of interventions. A policy \u03c0 for a DTR is a sequence of decision rules\n\u00af\u03c0K, where each \u03c0k(xk|\u00afsk, \u00afxk\u22121) is a function mapping from the domain of histories \u00afSk, \u00afXk\u22121 up\nto stage k to a distribution over decision Xk. A policy is called deterministic if the above mappings\nare from histories \u00afSk, \u00afXk\u22121 to the domain of decision Xk, i.e., xk \u2190 \u03c0k(\u00afsk, \u00afxk\u22121). The collection\nof possible policies, depending on the domains of the history and decision, de\ufb01ne a policy space \u03a0.\nA policy \u03c0 de\ufb01nes a sequence of stochastic interventions do(X1 \u223c \u03c01(X1| \u00afS1), . . . , XK \u223c\n\u03c0K(XK| \u00afSK, \u00afXK\u22121)), which induce an interventional distribution over variables \u00afXK, \u00afSK, Y , i.e.:\n\nK\u22121(cid:89)\n\nP\u03c0(\u00afxK, \u00afsK, y) = P\u00afxK (y|\u00afsK)\n\nP\u00afxk (sk+1|\u00afsk)\u03c0k+1(xk+1|\u00afsk+1, \u00afxk),\n\n(1)\n\nk=0\n\nwhere P\u00afxk (sk+1|\u00afsk) is the transition distribution at stage k and P\u00afxK (y|\u00afsK) is the reward distribution\nover the primary outcome. Fig. 1b describes a DTR under K = 2 stages of interventions do(X2 \u223c\n\u03c01(X1|S1), X2 \u223c \u03c02(X2|S1, S2, X1)). The expected cumulative reward of a policy \u03c0 in a DTR\nM\u2217 is given by V\u03c0(M\u2217) = E\u03c0[Y ]. We are searching for an optimal policy \u03c0\u2217 that maximizes the\ncumulative reward, i.e., \u03c0\u2217 = arg max\u03c0\u2208\u03a0 V\u03c0(M\u2217). It is a well-known fact in decision theory\nthat no stochastic policy can improve on the utility of the best deterministic policy (see, e.g., [15,\nLem. 2.1]). Thus, in what follows, we will usually consider the policy space \u03a0 to be deterministic.\n\n3\n\n\fAlgorithm 1: UC-DTR\n\nInput: failure tolerance \u03b4 \u2208 (0, 1).\n\n1: for all episodes t = 1, 2, . . . do\n2:\n\nt as, respectively,(cid:80)t\u22121\nk=\u00afxk and(cid:80)t\u22121\ncounts Rt(\u00afsK, \u00afxK) prior to episode t as(cid:80)t\u22121\n\nDe\ufb01ne event counts N t(\u00afsk, \u00afxk) and N t(\u00afsk, \u00afxk\u22121) for horizon k = 1, . . . , K prior to episode\nk\u22121=\u00afxk\u22121. Further, de\ufb01ne reward\nK =\u00afxK .\n\ni=1 I \u00afSi\ni=1 Y iI \u00afSi\n\nK =\u00afsK , \u00afX i\n\ni=1 I \u00afSi\n\nk=\u00afsk, \u00afX i\n\nk=\u00afsk, \u00afX i\n\nCompute estimates \u02c6P t\n\u00afxk\n\n(sk+1|\u00afsk) =\n\n\u02c6P t\n\u00afxk\n\n(sk+1|\u00afsk) and \u02c6Et\nN t(\u00afsk+1, \u00afxk)\n\n\u00afxK\n\nmax{1, N t(\u00afsk, \u00afxk)} ,\n\n[Y |\u00afsK] as\n\n\u02c6Et\n\n\u00afxK\n\n[Y |\u00afsK] =\n\nRt(\u00afsK, \u00afxK)\n\nmax{1, N t(\u00afsk, \u00afxk)} .\n\n3:\n\n4:\n\nLet Mt denote a set of DTRs such that for any M \u2208 Mt, its transition probabilities\n\nP\u00afxk (sk+1|\u00afsk) and reward E\u00afxK [Y |\u00afsK] are close to estimates \u02c6P t\n\n(sk+1|\u00afsk), \u02c6Et\n\n[Y |\u00afsK], i.e.,\n\n\u00afxK\n\n(cid:13)(cid:13)(cid:13)P\u00afxk (\u00b7|\u00afsk) \u2212 \u02c6P t\n(cid:12)(cid:12)(cid:12)E\u00afxK [Y |\u00afsK] \u2212 \u02c6Et\n\n\u00afxk\n\n(cid:13)(cid:13)(cid:13)1\n\n\u2264\n\n(cid:115)\n6(cid:12)(cid:12)Sk+1\n(cid:115)\n(cid:12)(cid:12)(cid:12) \u2264\n\n(\u00b7|\u00afsk)\n\n[Y |\u00afsK]\n\n\u00afxK\n\n(cid:12)(cid:12) log(2K(cid:12)(cid:12) \u00afS k\n\n\u00afxk\n\n(cid:12)(cid:12)(cid:12)(cid:12) \u00afX k\n\n(cid:12)(cid:12)t/\u03b4)\n\nmax{1, N t(\u00afsk, \u00afxk)}\n2 log(2K|S||X|t/\u03b4)\nmax{1, N t(\u00afsK, \u00afxK)} .\n\n5:\n\nFind the optimal policy \u03c0t of an optimistic DTR Mt \u2208 Mt such that\n\nV\u03c0t(Mt) =\n\nmax\n\n\u03c0\u2208\u03a0,M\u2208Mt\n\nV\u03c0(M )\n\nExecute policy \u03c0t for episode t and observe the samples \u00afSt\n\nK, \u00afX t\n\nK, Y t.\n\n6:\n7: end for\n\n,\n\n(2)\n\n(3)\n\n(4)\n\nOur goal is to optimize an unknown DTR M\u2217 based solely on the domains S = \u00afS K, X = \u00afX K and\nthe observational distribution P (\u00afxK, \u00afsK, y) (i.e., both F , P (u) are unknown). The agent (e.g., a\nphysician) learns through repeated experiments of episodes t = 1, . . . , T . Each episode t contains a\nk, performs an intervention do(X t\ncomplete DTR process: at stage k, the agent observes the state St\nk)\nk+1; the primary outcome Y t is received at the \ufb01nal stage K. The cumulative\nand moves to the state St\nt=1(V\u03c0\u2217 (M\u2217) \u2212 Y t), i.e, the loss due to the fact that\nthe agent does not always pick the optimal policy \u03c0\u2217. We will assess and compare algorithms in\nterms of their regret R(T ). A desirable asymptotic property is to have limT\u2192\u221e E[R(T )]/T = 0,\nmeaning that the agent eventually converges and \ufb01nds the optimal policy \u03c0\u2217.\n\nregret up to episode T is de\ufb01ned as R(T ) =(cid:80)T\n\n2.1 The UC-DTR Algorithm\n\nWe now introduce a new RL algorithm for optimizing an unknown DTR, which we call UC-DTR. We\nwill later prove that UC-DTR achieves near-optimal bound on the total regret given only the knowledge\nof the domains S and X . Like many other online RL algorithms [1, 11, 22], UC-DTR follows the\nprinciple of optimism under uncertainty to balance exploration and exploitation. The algorithm\ngenerally works in phases of model learning, optimistic planning, and strategy execution.\nThe details of UC-DTR procedure can be found in Alg. 1. The algorithm proceeds in episodes and\ncomputes a new strategy \u03c0t from samples { \u00afSi\ni=1 collected so far at the beginning of each\n[Y |\u00afsK] of the\nepisode t. Speci\ufb01cally, UC-DTR computes in Steps 1-3, the empirical estimates \u02c6Et\n(sk+1|\u00afsk) of the transitional probabilities P\u00afxk (sk+1|\u00afsk) from\nexpected reward E\u00afxK [Y |\u00afsK], and \u02c6P t\nexperimental samples collected prior to episode t. In Step 4, a set Mt of plausible DTRs is de\ufb01ned in\n(sk+1|\u00afsk). This\nterms of con\ufb01dence region around the the empirical estimates \u02c6Et\nguarantees that the true DTR M\u2217 is in the set Mt with high probability. In Step 5, UC-DTR computes\nthe optimal policy \u03c0t of the most optimistic instance Mt in the family of DTRs Mt that induces the\nmaximal optimal expected reward. This policy \u03c0t is executed throughout episode t and new samples\n\u00afSt\nK, \u00afX t\n\nK, Y t are collected (Step 6).\n\n[Y |\u00afsK] and \u02c6P t\n\nK, Y i}t\u22121\n\nK, \u00afX i\n\n\u00afxK\n\n\u00afxK\n\n\u00afxk\n\n\u00afxk\n\n4\n\n\fFinding Optimistic DTRs The Step 5 of UC-DTR tries to \ufb01nd an optimal policy \u03c0t for an optimistic\nDTR Mt. While the Bellman equation [5] allows one to optimize a \ufb01xed DTR, we need to \ufb01nd a\nDTR Mt that gives the maximal optimal reward among all plausible DTRs in Mt given by Eq. (3).\nWe now introduce a method that extends standard dynamic programming planners [5] to solve this\nproblem. We \ufb01rst combine all DTRs in Mt to get an extended DTR M+. That is, we consider a DTR\nM+ with continuous decision space \u00afX + = \u00afX +\nK, where for each horizon k, each action \u00afxk \u2208 \u00afX k,\neach admissible transition probabilities P\u00afxk (sk+1|\u00afsk) according to Eq. (2), there is an action in \u00afX +\ninducing the same probabilities P\u00afxk (sk+1|\u00afsk). Similar arguments also apply to the expected reward\nE\u00afxK [Y |\u00afsK]. Then, for each policy \u03c0+ on M+, there is an DTR Mt \u2208 Mt and a policy \u03c0t \u2208 \u03a0\nsuch that policies \u03c0+ and \u03c0t induces the same transition probabilities on the respective DTR, and\nvice versa. Thus, solving the optimization problem in Eq. (4) is equivalent to \ufb01nding an optimal\n[Y |\u00afsk, \u00afxk\u22121]\npolicy \u03c0\u2217\nin M+. The Bellman equation on M+ for k = 1, . . . , K \u2212 1 is de\ufb01ned as follows:\n\n+ on the extended DTR M+. Let V \u2217(\u00afsk, \u00afxk\u22121) denote the optimal value E\u03c0\u2217\n\nk\n\n+\n\n(cid:40)\n\n(cid:26)(cid:88)\n\n(cid:27)(cid:41)\n\n,\n\n(5)\n\nV \u2217(\u00afsk, \u00afxk\u22121) = max\nand V \u2217(\u00afsK, \u00afxK\u22121) = max\n\nxk\n\nxK\n\nmax\n\nP\u00afxk (\u00b7|\u00afsk)\u2208P k\nE\u00afxK [Y |\u00afsK ]\u2208R E\u00afxK [Y |\u00afsK],\n\nmax\n\nsk+1\n\nV \u2217(\u00afsk+1, \u00afxk)P\u00afxk (sk+1|\u00afsk)\n\nwhere R and P k are the convex polytope of parameters E\u00afxK [Y |\u00afsK] and P\u00afxk (sk+1|\u00afsk) de\ufb01ned in\nEqs. (2) and (3), respectively. The inner maximum in Eq. (5) is a linear program (LP) over the convex\npolytope P k (or R), which is solvable using standard LP algorithms.\n\n2.2 Theoretical Analysis\n\nWe now analyze the asymptotic behavior of UC-DTR that will lead to a better understanding of its\ntheoretical guarantees. Given space constraints, all proofs are provided in the full technical report [40,\nAppendix I]. The following proposition shows that the cumulative regret of UC-DTR after T steps is\n\nat most \u02dcO(K(cid:112)|S||X|T )2.\n\nTheorem 1. Fix a \u03b4 \u2208 (0, 1). With probability (w.p.) of at least 1 \u2212 \u03b4, it holds for any T > 1, the\nregret of UC-DTR with parameter \u03b4 is bounded by\n\nR(T ) \u2264 12K(cid:112)|S||X|T log(2K|S||X|T /\u03b4) + 4K(cid:112)T log(2T /\u03b4).\n\n(6)\nIt is also possible to obtain the instance-dependent bound on the expected regret. Let \u03a0\u2212 denote\na set of sub-optimal policies {\u03c0 \u2208 \u03a0 : V\u03c0(M\u2217) < V\u03c0\u2217 (M\u2217)}. For any \u03c0 \u2208 \u03a0\u2212, let its gap in\nexpected reward between the optimal policy \u03c0\u2217 be \u2206\u03c0 = V\u03c0\u2217 (M\u2217) \u2212 V\u03c0(M\u2217). We next derive the\ngap-dependent logarithmic bound on the expected regret of UC-DTR after T steps.\nTheorem 2. For any T \u2265 1, with parameter \u03b4 = 1\n\nT , the expected regret of UC-DTR is bounded by\n\n(cid:26) 332K 2|S||X| log(T )\n\n(cid:27)\n\nE[R(T )] \u2264 max\n\u03c0\u2208\u03a0\u2212\n\n\u2206\u03c0\n\n+\n\n32\n\u22063\n\u03c0\n\n+\n\n4\n\u2206\u03c0\n\n+ 1.\n\n(7)\n\nSince Eq. (7) is a decreasing function relative to the gap \u2206\u03c0, the maximum of the regret in Thm. 2\nis achieved with the second best policy \u03c0\u2212 = arg min\u03c0\u2208\u03a0\u2212 \u2206\u03c0. We also provide a corresponding\nlower bound on the expected regret of any experimental algorithm.\n\nTheorem 3. For any algorithm A, any natural numbers K \u2265 1, and(cid:12)(cid:12)S k(cid:12)(cid:12) \u2265 2,(cid:12)(cid:12)X k(cid:12)(cid:12) \u2265 2 for any\n\nk \u2208 {1, . . . , K}, there is a DTR M with horizon K, state domains S and action domains X , such\nthat the expected regret of A after T \u2265 |S||X| episodes is as least\n\nThm. 3 implies that for any DTR instance, the cumulative regret of \u2126((cid:112)|S||X|T ) is inevitable. The\nregret upper bound \u02dcO(K(cid:112)|S||X|T ) in Thm. 1 is close to the lower bound \u2126((cid:112)|S||X|T ) in Thm. 3,\n\nwhich means that UC-DTR is near-optimal provided with only the domains of state S and actions X .\n\n(8)\n\nE[R(T )] \u2265 0.05(cid:112)|S||X|T\n\n2 \u02dcO(\u00b7) is similar to O(\u00b7) but ignores log-terms, i.e., f = \u02dcO(g) if and only if \u2203k, f = O(g logk(g)).\n\n5\n\n\f3 Learning from Confounded Observations\nThe results presented so far (Thms. 1 to 3) establish the dimension of the state-action domain |S||X|\nas the an important parameter for the information complexity of online learning in DTRs. When\ndomains S \u00d7 X are high-dimensional, the cumulative regret will be signi\ufb01cant for any online\nalgorithm, no matter how sophisticated it might be. This observation suggests that we should explore\nother reasonable assumptions to address the issues of high-dimensional domains.\nA natural approach is to utilize the abundant observational data, which could be obtained by passively\nobserving other agents behaving in the environment. Despite all its power, the UC-DTR algorithm\ndoes not make use of any knowledge in the the observational distribution P (\u00afsK, \u00afxK, y). For the\nremainder of this paper, we will present and study an ef\ufb01cient procedure to incorporate observational\nsamples of P (\u00afsK, \u00afxK, y), so that the performance of online learners could be improved.\nWhen states \u00afSK satisfy the sequential backdoor criterion [24] with respect to treatments \u00afXK and\nthe primary outcome Y , one could identify the transition probabilities P\u00afxk (sk+1|\u00afsk) and expected\nreward E\u00afxK [Y |\u00afsk] from P (\u00afsK, \u00afxK, y). The optimal policy is thus solvable using the standard\noff-policy learning methods such as Q-learning [31, 20]. However, issues of non-identi\ufb01ability arise\nin the general settings where the sequential backdoor does not hold (e.g., see Fig. 1a).\nTheorem 4. Given P (\u00afsK, \u00afxK, y) > 0, there exists DTRs M1, M2 such that P M1 (\u00afsK, \u00afxK, y) =\nP M2(\u00afsK, \u00afxK, y) = P (\u00afsK, \u00afxK, y) while P M1\n\n\u00afxK (\u00afsK, y) (cid:54)= P M2\n\n\u00afxK (\u00afsK, y).\n\nThm. 4 is stronger than the standard non-identi\ufb01ability results (e.g., [14, Thm. 1]). It shows that\ngiven any observational distribution P (\u00afsK, \u00afxK, y), one to construct two DTRs both compatible with\nP (\u00afsK, \u00afxK, y), but disagrees in the interventional probabilities P\u00afxK (\u00afsK, y).\n\n3.1 Bounds and Partial Identi\ufb01cation in DTRs\n\nIn this section, we consider a partial identi\ufb01cation task in DTRs which bounds parameters of\nP\u00afxk (sk+1|\u00afsk) and E\u00afxK [Y |\u00afsk] from the observational distribution P (\u00afsK, \u00afxK, y). Our \ufb01rst result\nshows that the gap between causal quantities P\u00afxk (\u00afsk+1) and P\u00afxk (\u00afsk) in a DTR is bounded by the\ngap between the corresponding observational distributions P (\u00afsk+1, \u00afxk) and P (\u00afsk, \u00afxk).\nLemma 1. For a DTR, given P (\u00afsK, \u00afxK, y), for any k = 1, . . . , K \u2212 1,\n\nP\u00afxk (\u00afsk+1) \u2212 P\u00afxk (\u00afsk) \u2264 P (\u00afsk+1, \u00afxk) \u2212 P (\u00afsk, \u00afxk).\n\n(9)\nLem. 1 allows one to derive informative bounds of transition probabilities P\u00afxk (sk+1|\u00afsk) in a DTR,\nwhich are consistently estimable from the observational data P (\u00afsK, \u00afxK).\nTheorem 5. For a DTR, given P (\u00afsK, \u00afxK, y) > 0, for any k = 1, . . . , K \u2212 1,\n\nP (\u00afsk+1, \u00afxk)\n\u0393(\u00afsk, \u00afxk\u22121)\n\n\u2264 P\u00afxk (sk+1|\u00afsk) \u2264 \u0393(\u00afsk+1, \u00afxk)\n\u0393(\u00afsk, \u00afxk\u22121)\n\n,\n\n(10)\n\nwhere \u0393(\u00afsk+1, \u00afxk) = P (\u00afsk+1, \u00afxk) \u2212 P (\u00afsk, \u00afxk) + \u0393(\u00afsk, \u00afxk\u22121) and \u0393(s1) = P (s1).\nBounds in Thm. 5 exploit the sequential functional relationships among states and treatments in\nthe underlying DTR, which improve over the best-known bounds reported in [17, 3, 39]. Let\n\n(cid:2)a\u00afxk,\u00afsk (sk+1), b\u00afxk,\u00afsk (sk+1)(cid:3) denote the bound over P\u00afxk (sk+1|\u00afsk) given by Eq. (10). We next show\nthat P\u00afxk (sk+1|\u00afsk) \u2208(cid:2)a\u00afxk,\u00afsk (sk+1), b\u00afxk,\u00afsk (sk+1)(cid:3) is indeed optimal without additional assumption.\n(sk+1)(cid:3)\nstrictly contained in(cid:2)a\u00afxk,\u00afsk (sk+1), b\u00afxk,\u00afsk (sk+1)(cid:3). By Thm. 6, we could always \ufb01nd DTRs M1, M2\n(sk+1)(cid:3), which is a contradiction.\n\nTheorem 6. Given P (\u00afsK, \u00afxK, y) > 0, for any k \u2208 {1, . . . , K \u2212 1}, there exists DTRs M1, M2 such\n\u00afxk (sk+1|\u00afsk) = a\u00afxk,\u00afsk (sk+1),\nthat P M1(\u00afsK, \u00afxK, y) = P M2(\u00afsK, \u00afxK, y) = P (\u00afsK, \u00afxK, y) while P M1\n\u00afxk (sk+1|\u00afsk) = b\u00afxk,\u00afsk (sk+1).\nP M2\nThm. 6 ensures the optimality of Thm. 5. Suppose there exists a bound [a(cid:48)\n\nthat are compatible with the observational data P (\u00afsK, \u00afxK, y) while their transition probabilities\nP\u00afxk (sk+1|\u00afsk) lie outside of the bound [a(cid:48)\nAs a corollary, one could apply methods of Lem. 1 and Thm. 5 to bound expected rewards E\u00afxK [Y |\u00afsk]\nfrom P (\u00afsK, \u00afxK, y). The optimality of the derived bounds follows immediately after Thm. 6.\n\n(sk+1), b(cid:48)\n\n\u00afxk,\u00afsk\n\n(sk+1), b(cid:48)\n\n\u00afxk,\u00afsk\n\n\u00afxk,\u00afsk\n\n\u00afxk,\u00afsk\n\n6\n\n\fAlgorithm 2: Causal UC-DTR (UCc-DTR)\nInput: failure tolerance \u03b4 \u2208 (0, 1), causal bounds C.\n1: Let Mc denote a set of DTRs compatible with causal bounds C, i.e., for any M \u2208 Mc, its\ncausal quantities P\u00afxk (sk+1|\u00afsk) and E\u00afxK [Y |\u00afsK] satisfy Eq. (13) and Eq. (14) respectively.\n2: for all episodes t = 1, 2, . . . do\n3:\n4:\n\nExecute Steps 2-4 of UC-DTR (Alg. 1).\nFind the optimal policy \u03c0t of an optimistic DTR Mt in Mc\nV\u03c0(M )\n\nt = Mt \u2229 Mc such that\n\nV\u03c0t(Mt) =\n\nmax\n\n\u03c0\u2208\u03a0,M\u2208Mc\n\nt\n\n(12)\n\n.\n\n(11)\n\nExecute policy \u03c0t for episode t and observe the samples \u00afSt\n\nK, \u00afX t\n\nK, Y t.\n\n5:\n6: end for\n\nCorollary 1. For a DTR, given P (\u00afsK, \u00afxK, y) > 0,\n\nE[Y |\u00afsK, \u00afxK]P (\u00afsK, \u00afxK)\n\n\u0393(\u00afsK, \u00afxK\u22121)\n\n\u2264 E\u00afxK [Y |\u00afsk] \u2264 1 \u2212 (1 \u2212 E[Y |\u00afsK, \u00afxK])P (\u00afsK, \u00afxK)\n\n\u0393(\u00afsK, \u00afxK\u22121)\n\nSince E[Y |\u00afsK, \u00afxK] \u2208 [0, 1], the bounds in Eq. (11) are contained in [0, 1] and are thus informative.\nThe bounds developed so far are functions of the observational distribution P (\u00afsK, \u00afxK, y) which is\nidenti\ufb01able by the sampling process, and so generally can be estimated consistently. Speci\ufb01cally, we\nestimate the bounds in Thm. 5 and Corol. 1 by the corresponding sample mean estimates. Standard\nresults of large-deviation theory are thus applicable to control the uncertainties due to \ufb01nite samples.\n\n3.2 The Causal UC-DTR Algorithm\n\nCK =\n\nOur goal in this section is to introduce a simple, yet principled approach for leveraging the new-found\nbounds de\ufb01ned in Thm. 5 and Corol. 1, hopefully improving the performance of UC-DTR procedure.\nFor k = 1, . . . , K \u2212 1, let Ck denote a set of bounds over transition probabilities P\u00afxk (sk+1|\u00afsk), i.e.,\n(13)\n\n(cid:110)\u2200\u00afsk+1, \u00afxk : P\u00afxk (sk+1|\u00afsk) \u2208(cid:2)a\u00afxk,\u00afsk (sk+1), b\u00afxk,\u00afsk (sk+1)(cid:3)(cid:111)\n\nCk =\n\n.\n\nSimilarly, let CK denote a set of bounds over the conditional expected reward E\u00afxK [Y |\u00afsK], i.e.,\n\n(cid:110)\u2200\u00afsK, \u00afxK : E\u00afxK [Y |\u00afsK] \u2208(cid:2)a\u00afxK ,\u00afsK , b\u00afxK ,\u00afsK\n\n(cid:3)(cid:111)\n\n.\n\n(cid:12)(cid:12)a\u00afxk,\u00afsk (sk+1) \u2212 b\u00afxk,\u00afsk (sk+1)(cid:12)(cid:12),\n\n(14)\nWe denote by C a set of bounds {C1, . . . ,CK} on the system dynamics of the DTR, called causal\nbounds. Our procedure Causal UC-DTR (for short, UCc-DTR) is summarized in Alg. 2. UCc-DTR is\nsimilar to the original UC-DTR but exploits causal bounds C. It maintains a set of possible DTRs Mc\ncompatible with the causal bounds C (Step 1). Before each episode t, it computes the optimal policy\n\u03c0t of an optimistic DTRs Mt in set Mc\nt = Mt \u2229 Mc (Step 3). Similar to UC-DTR, \u03c0t could be\nobtained by solving LPs de\ufb01ned in Eq. (5) subject to additional causal constraints Eqs. (13) and (14).\nWe next analyze asymptotic properties of UCc-DTR, showing that it consistently outperforms UC-DTR.\n\nLet(cid:13)(cid:13)Ck\n(cid:13)(cid:13)1 denote the maximal L1 norm of any parameter in Ck, i.e., for any k = 1, . . . , K \u2212 1,\n(cid:88)\n(cid:13)(cid:13)Ck\n(cid:13)(cid:13)1 = max\n(cid:12)(cid:12)a\u00afxK ,\u00afsK \u2212 b\u00afxK ,\u00afsK\n(cid:12)(cid:12).\nFurther, let(cid:13)(cid:13)C(cid:13)(cid:13)1 =(cid:80)K\n(cid:13)(cid:13)1. The total regret of UCc-DTR after T steps is bounded as follows.\n(cid:110)\n12K(cid:112)|S||X|T log(2K|S||X|T /\u03b4),(cid:13)(cid:13)C(cid:13)(cid:13)1T\n\n+ 4K(cid:112)T log(2T /\u03b4).\nEq. (6) if T < 122|S||X| log(2K|S||X|T /\u03b4)/(cid:13)(cid:13)C(cid:13)(cid:13)2\nUC-DTR when the causal bounds C are informative, i.e., the dimension(cid:13)(cid:13)C(cid:13)(cid:13)1 is small.\n\nIt is immediate from Thm. 7 that the regret bound in Eq. (15) is smaller than the bound given by\n1. This means that UCc-DTR has a head start over\n\nTheorem 7. Fix a \u03b4 \u2208 (0, 1). With probability of at least 1 \u2212 \u03b4, it holds for any T > 1, the regret of\nUCc-DTR with parameter \u03b4 and causal bounds C is bounded by\n\n(cid:13)(cid:13)1 = max\n\nand(cid:13)(cid:13)CK\n\nR(T ) \u2264 min\n\n(cid:13)(cid:13)Ck\n\n(cid:111)\n\n(15)\n\n\u00afxK ,\u00afsK\n\nk=1\n\n\u00afxk,\u00afsk\n\nsk+1\n\n7\n\n\f(a) Random DTR\n\n(b) Random DTR\n\n(c) Cancer Treatment\n\nFigure 2: Simulations comparing online learners that are randomized (rand), adaptive (uc-dtr) and\ncausally enhanced (ucc-dtr). Graphs are rendered in high resolution and can be zoomed in.\n\nWe could also witness the improvements of causal bounds on the total expected regret. Let \u03a0\u2212\nC be\nthe set of sub-optimal policies that their maximal expected rewards over instances in Mc are no less\nC = {\u03c0 \u2208 \u03a0\u2212 : maxM\u2208Mc V\u03c0(M ) \u2265 V\u03c0\u2217 (M\u2217)}.\nthan the true optimal value V\u03c0\u2217 (M\u2217), i.e., \u03a0\u2212\nThe following is the instance-dependent bound on the total regret of UCc-DTR after T steps.\nT and causal bounds C, the expected regret of\nTheorem 8. For any T \u2265 1, with parameter \u03b4 = 1\nUCc-DTR is bounded by\n\n(cid:26) 332K 2|S||X| log(T )\n\n(cid:27)\n\nE[R(T )] \u2264 max\n\u2212\n\u03c0\u2208\u03a0\nC\n\n\u2206\u03c0\n\n+\n\n32\n\u22063\n\u03c0\n\n+\n\n4\n\u2206\u03c0\n\n+ 1.\n\n(16)\n\nC \u2286 \u03a0\u2212, it follows that the regret bound in Thm. 8 is small than or equal to Eq. (7), i.e.,\nSince \u03a0\u2212\nUCc-DTR consistently dominates UC-DTR in terms of the performance. For instance, in a multi-armed\nbandit model (i.e., 1-stage DTR with S1 = \u2205) with optimal reward \u00b5\u2217, the regret of UCc-DTR is\nO(|X| log(T )/\u2206x) where \u2206x is the smallest gap among sub-optimal arms x satisfying bx \u2265 \u00b5\u2217.\n\n4 Experiments\n\nWe demonstrate our algorithms on several dynamic treatment regimes, including randomly generated\nDTRs, and the survival model in the context of multi-stage cancer treatment. We found that our\nalgorithms could ef\ufb01ciently found the optimal policy; the observational data typically improve the\nconvergence rate of online RL learners despite the confounding bias.\nIn all experiments, we test sequentially randomized trials (rand), UC-DTR algorithm (uc-dtr) and the\ncausal UC-DTR (ucc-dtr) with causal bounds derived from 1 \u00d7 105 confounded observational samples.\nEach experiment lasts for T = 1.1 \u00d7 104 episodes. The parameter \u03b4 = 1\nKT for uc-dtr and ucc-dtr\nwhere K is the total stages of interventions. For all algorithms, we measure their cumulative regret\nover 200 repetitions. We refer readers to the complete technical report [40, Appendix II] for the more\ndetails on the experimental set-up.\n\nRandom DTRs We generate 200 random instances and observational distributions of the DTR\ndescribed in Fig. 1. We assume treatments X1, X2, states S1, S2 and primary outcome Y are all\nbinary variables; values of each variable are decided by their corresponding unobserved counter-\n, Y\u00afx2 following de\ufb01nitions in [3, 9]. The probabilities of the joint distribution\nfacutals S2x1\n, y\u00afx2) are drawn randomly over [0, 1]. The cumulative regrets average among\nP (s1, x1, s2x1\nall random DTRs are reported in Fig. 2a. We \ufb01nd that online methods (uc-dtr, ucc-dtr) dominate\nrandomized assignments (rand); RL learners that leverage causal bounds (ucc-dtr) consistently domi-\nnates learners that do not (uc-dtr). Fig. 2b reports the relative improvement in total regrets of ucc-dtr\ncompared to uc-dtr among 200 instances: ucc-dtr outperforms uc-dtr in over 80% of generated DTRs.\nThis suggests that causal bounds derived from the observational data are bene\ufb01cial in most instances.\n\n, X2x1\n, x2x1\n\nCancer Treatment We test the survival model of the two-stage clinical trial conducted by the\nCancer and Leukemia Group B [16, 37]. Protocol 8923 was a double-blind, placebo controlled\ntwo-stage trial reported by [29] examining the effects of infusions of granulocyte-macrophage\ncolony-stimulating factor (GM-CSF) after initial chemotherapy in patients with acute myelogenous\n\n8\n\n\fleukemia (AML). Standard chemotherapy for AML could place patients at increased risk of death\ndue to infection or bleeding-related complications. GM-CSF administered after chemotherapy might\nassist patient recovery, thus reducing the number of deaths due to such complications. Patients\nwere randomized initially to GM-CSF or placebo following standard chemotherapy. Later, patients\nmeeting the criteria of complete remission and consenting to further participation were offered a\nsecond randomization to one of two intensi\ufb01cation treatments.\nFig. 1a describes the DTR of this two-stage trail. X1 represents the initial GM-CSF administration\nand X2 represents the intensi\ufb01cation treatment; the initial state S1 = \u2205 and S2 indicates the complete\nremission after the \ufb01rst treatment; the primary outcome Y indicates the survival of patients at the time\nof recording. We generate observational samples using age of patients as UCs U. The cumulative\nregrets average among all random DTRs are reported in Fig. 2b. We \ufb01nd that rand performs worst\namong all strategies; uc-dtr \ufb01nds the optimal policy with sub-linear regrets. Interestingly, ucc-dtr\nconverges almost immediately, suggesting that causal bounds derived from confounded observations\ncould signi\ufb01cantly improve the performance of online learners.\n\n5 Conclusion\n\nIn this paper, we investigated the online reinforcement learning problem for selecting the optimal\nDTR provided with abundant, yet imperfect observations made about the underlying environment.\nWe \ufb01rst presented an online RL algorithm with near-optimal regret bounds in DTRs solely based\non the knowledge about state-action domains. We further derived causal bounds about the system\ndynamics in DTRs from the observational data. These bounds could be incorporated in a simple, yet\nprincipled way to improve the performance of online RL learners. In today\u2019s healthcare, for example,\nthe growing use of mobile devices opens new opportunities in continuous monitoring of patients\u2019\nconditions and just-in-time interventions. We believe that our results constitute a signi\ufb01cant step\ntowards the development of a more principled and robust science of precision medicine.\n\nAcknowledgments\n\nThis research is supported in parts by grants from IBM Research, Adobe Research, NSF IIS-1704352,\nand IIS-1750807 (CAREER).\n\nReferences\n[1] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem.\n\nMachine learning, 47(2-3):235\u2013256, 2002.\n\n[2] K. Azizzadenesheli, A. Lazaric, and A. Anandkumar. Reinforcement learning of pomdp\u2019s using\n\nspectral methods. In COLT, 2016.\n\n[3] A. Balke and J. Pearl. Counterfactuals and policy analysis in structural models. In Proceedings\n\nof the Eleventh Conference on Uncertainty in Arti\ufb01cial Intelligence, pages 11\u201318, 1995.\n\n[4] E. Bareinboim and J. Pearl. Causal inference and the data-fusion problem. Proceedings of the\n\nNational Academy of Sciences, 113:7345\u20137352, 2016.\n\n[5] R. Bellman. Dynamic programming. Science, 153(3731):34\u201337, 1966.\n\n[6] B. Chakraborty. Dynamic treatment regimes for managing chronic health conditions: a statistical\n\nperspective. American journal of public health, 101(1):40\u201345, 2011.\n\n[7] B. Chakraborty and E. Moodie. Statistical methods for dynamic treatment regimes. Springer,\n\n2013.\n\n[8] B. Chakraborty and S. A. Murphy. Dynamic treatment regimes. Annual review of statistics and\n\nits application, 1:447\u2013464, 2014.\n\n[9] C. Frangakis and D. Rubin. Principal strati\ufb01cation in causal inference. Biometrics, 1(58):21\u201329,\n\n2002.\n\n9\n\n\f[10] Z. D. Guo, S. Doroudi, and E. Brunskill. A pac rl algorithm for episodic pomdps. In Arti\ufb01cial\n\nIntelligence and Statistics, pages 510\u2013518, 2016.\n\n[11] T. Jaksch, R. Ortner, and P. Auer. Near-optimal regret bounds for reinforcement learning.\n\nJournal of Machine Learning Research, 11(Apr):1563\u20131600, 2010.\n\n[12] P. W. Lavori and R. Dawson. A design for testing clinical strategies: biased adaptive within-\nsubject randomization. Journal of the Royal Statistical Society: Series A (Statistics in Society),\n163(1):29\u201338, 2000.\n\n[13] P. W. Lavori and R. Dawson. Adaptive treatment strategies in chronic disease. Annu. Rev. Med.,\n\n59:443\u2013453, 2008.\n\n[14] S. Lee, J. D. Correa, and E. Bareinboim. General identi\ufb01ability with arbitrary surrogate\nexperiments. In Proceedings of Thirty-\ufb01fth Conference on Uncertainty in Arti\ufb01cial Intelligence\n(UAI), Corvallis, OR, 2019. AUAI Press.\n\n[15] Q. Liu and A. Ihler. Belief propagation for structured decision making. In Proceedings of\nthe Twenty-Eighth Conference on Uncertainty in Arti\ufb01cial Intelligence, pages 523\u2013532. AUAI\nPress, 2012.\n\n[16] J. K. Lunceford, M. Davidian, and A. A. Tsiatis. Estimation of survival distributions of treatment\npolicies in two-stage randomization designs in clinical trials. Biometrics, 58(1):48\u201357, 2002.\n[17] C. Manski. Nonparametric bounds on treatment effects. American Economic Review, Papers\n\nand Proceedings, 80:319\u2013323, 1990.\n\n[18] S. A. Murphy. Optimal dynamic treatment regimes. Journal of the Royal Statistical Society:\n\nSeries B (Statistical Methodology), 65(2):331\u2013355, 2003.\n\n[19] S. A. Murphy. An experimental design for the development of adaptive treatment strategies.\n\nStatistics in medicine, 24(10):1455\u20131481, 2005.\n\n[20] S. A. Murphy. A generalization error for q-learning. Journal of Machine Learning Research,\n\n6(Jul):1073\u20131097, 2005.\n\n[21] S. A. Murphy, M. J. van der Laan, J. M. Robins, and C. P. P. R. Group. Marginal mean models\nfor dynamic regimes. Journal of the American Statistical Association, 96(456):1410\u20131423,\n2001.\n\n[22] I. Osband and B. Van Roy. Near-optimal reinforcement learning in factored mdps. In Advances\n\nin Neural Information Processing Systems, pages 604\u2013612, 2014.\n\n[23] J. Pearl. Causality: Models, Reasoning, and Inference. Cambridge University Press, New York,\n\n2000. 2nd edition, 2009.\n\n[24] J. Pearl and J. Robins. Probabilistic evaluation of sequential plans from causal models with\nhidden variables. In P. Besnard and S. Hanks, editors, Uncertainty in Arti\ufb01cial Intelligence 11,\npages 444\u2013453. Morgan Kaufmann, San Francisco, 1995.\n\n[25] J. Robins, L. Orellana, and A. Rotnitzky. Estimation and extrapolation of optimal treatment and\n\ntesting strategies. Statistics in medicine, 27(23):4678\u20134721, 2008.\n\n[26] P. Rosenbaum and D. Rubin. The central role of propensity score in observational studies for\n\ncausal effects. Biometrika, 70:41\u201355, 1983.\n\n[27] D. Rubin. Bayesian inference for causal effects: The role of randomization. Annals of Statistics,\n\n6(1):34\u201358, 1978.\n\n[28] P. Spirtes, C. N. Glymour, and R. Scheines. Causation, prediction, and search. MIT press,\n\n2000.\n\n[29] R. M. Stone, D. T. Berg, S. L. George, R. K. Dodge, P. A. Paciucci, P. Schulman, E. J. Lee, J. O.\nMoore, B. L. Powell, and C. A. Schiffer. Granulocyte\u2013macrophage colony-stimulating factor\nafter initial chemotherapy for elderly patients with primary acute myelogenous leukemia. New\nEngland Journal of Medicine, 332(25):1671\u20131677, 1995.\n\n10\n\n\f[30] A. L. Strehl, L. Li, E. Wiewiora, J. Langford, and M. L. Littman. Pac model-free reinforcement\nlearning. In Proceedings of the 23rd international conference on Machine learning, pages\n881\u2013888. ACM, 2006.\n\n[31] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT press, 1998.\n\n[32] I. Szita and C. Szepesv\u00e1ri. Model-based reinforcement learning with nearly tight exploration\ncomplexity bounds. In Proceedings of the 27th International Conference on Machine Learning\n(ICML-10), pages 1031\u20131038, 2010.\n\n[33] P. F. Thall, R. E. Millikan, and H.-G. Sung. Evaluating multiple treatment courses in clinical\n\ntrials. Statistics in medicine, 19(8):1011\u20131028, 2000.\n\n[34] P. F. Thall, H.-G. Sung, and E. H. Estey. Selecting therapeutic strategies based on ef\ufb01cacy and\ndeath in multicourse clinical trials. Journal of the American Statistical Association, 97(457):29\u2013\n39, 2002.\n\n[35] W. R. Thompson. On the likelihood that one unknown probability exceeds another in view of\n\nthe evidence of two samples. Biometrika, 25(3/4):285\u2013294, 1933.\n\n[36] E. H. Wagner, B. T. Austin, C. Davis, M. Hindmarsh, J. Schaefer, and A. Bonomi. Improving\n\nchronic illness care: translating evidence into action. Health affairs, 20(6):64\u201378, 2001.\n\n[37] A. S. Wahed and A. A. Tsiatis. Optimal estimator for the survival distribution and related\nquantities for treatment policies in two-stage randomization designs in clinical trials. Biometrics,\n60(1):124\u2013133, 2004.\n\n[38] A. S. Wahed and A. A. Tsiatis. Semiparametric ef\ufb01cient estimation of survival distributions in\ntwo-stage randomisation designs in clinical trials with censored data. Biometrika, 93(1):163\u2013\n177, 2006.\n\n[39] J. Zhang and E. Bareinboim. Transfer learning in multi-armed bandits: a causal approach.\nIn Proceedings of the 26th International Joint Conference on Arti\ufb01cial Intelligence, pages\n1340\u20131346. AAAI Press, 2017.\n\n[40] J. Zhang and E. Bareinboim. Near-optimal reinforcement learning in dynamic treatment regimes.\n\nTechnical Report R-48, Causal AI Lab, Columbia University., 2019.\n\n11\n\n\f", "award": [], "sourceid": 7406, "authors": [{"given_name": "Junzhe", "family_name": "Zhang", "institution": "Columbia University"}, {"given_name": "Elias", "family_name": "Bareinboim", "institution": "Purdue"}]}