{"title": "Towards Optimal Off-Policy Evaluation for Reinforcement Learning with Marginalized Importance Sampling", "book": "Advances in Neural Information Processing Systems", "page_first": 9668, "page_last": 9678, "abstract": "Motivated by the many real-world applications of reinforcement learning (RL) that require safe-policy iterations, we consider the problem of off-policy evaluation (OPE) --- the problem of evaluating a new policy using the historical data obtained by different behavior policies --- under the model of nonstationary episodic Markov Decision Processes (MDP) with a long horizon and a large action space. Existing importance sampling (IS) methods often suffer from large variance that depends exponentially on the RL horizon $H$. To solve this problem, we consider a marginalized importance sampling (MIS) estimator that recursively estimates the state marginal distribution for the target policy at every step. \nMIS achieves a mean-squared error of \n$$\n\\frac{1}{n} \\sum_{t=1}^H\\mathbb{E}_{\\mu}\\left[\\frac{d_t^\\pi(s_t)^2}{d_t^\\mu(s_t)^2} \\Var_{\\mu}\\left[\\frac{\\pi_t(a_t|s_t)}{\\mu_t(a_t|s_t)}\\big( V_{t+1}^\\pi(s_{t+1}) + r_t\\big) \\middle| s_t\\right]\\right] + \\tilde{O}(n^{-1.5})\n$$\nwhere $\\mu$ and $\\pi$ are the logging and target policies, $d_t^{\\mu}(s_t)$ and $d_t^{\\pi}(s_t)$ are the marginal distribution of the state at $t$th step, $H$ is the horizon, $n$ is the sample size and $V_{t+1}^\\pi$ is the value function of the MDP under $\\pi$. The result matches the Cramer-Rao lower bound in [Jiang and Li, 2016] up to a multiplicative factor of $H$. To the best of our knowledge, this is the first OPE estimation error bound with a polynomial dependence on $H$. Besides theory, we show empirical superiority of our method in time-varying, partially observable, and long-horizon RL environments.", "full_text": "Towards Optimal Off-Policy Evaluation for\nReinforcement Learning with Marginalized\n\nImportance Sampling\n\nTengyang Xie\u2217\n\nDept. of Computer Science\n\nUIUC\n\nUrbana, IL 61801\n\ntx10@illinois.edu\n\nYifei Ma\n\nAWS AI Labs\n\nAmazon.com Services, Inc.\nEast Palo Alto, CA 94303\n\nyifeim@amazon.com\n\nAbstract\n\nYu-Xiang Wang\n\nDept. of Computer Science,\n\nUC Santa Barbara\n\nSanta Barbara, CA 93106\nyuxiangw@cs.ucsb.edu\n\nMotivated by the many real-world applications of reinforcement learning (RL) that\nrequire safe-policy iterations, we consider the problem of off-policy evaluation\n(OPE) \u2014 the problem of evaluating a new policy using the historical data ob-\ntained by different behavior policies \u2014 under the model of nonstationary episodic\nMarkov Decision Processes (MDP) with a long horizon and a large action space.\nExisting importance sampling (IS) methods often suffer from large variance that\ndepends exponentially on the RL horizon H. To solve this problem, we consider\na marginalized importance sampling (MIS) estimator that recursively estimates\nthe state marginal distribution for the target policy at every step. MIS achieves a\nmean-squared error of\n\nE\u00b5\n\nt=1\n\nt (st)2\nt (st)2 Var\u00b5\nd\u00b5\n\nt+1(st+1) + rt\n\n+ \u02dcO(n\u22121.5)\n\n(cid:20) d\u03c0\n\n(cid:88)H\n\n1\nn\n\n(cid:20) \u03c0t(at|st)\n\n\u00b5t(at|st)\n\n(cid:0)V \u03c0\n\n(cid:21)(cid:21)\n\n(cid:1)(cid:12)(cid:12)(cid:12)(cid:12)st\n\nwhere \u00b5 and \u03c0 are the logging and target policies, d\u00b5\nt (st) are the\nmarginal distribution of the state at tth step, H is the horizon, n is the sample\nsize and V \u03c0\nt+1 is the value function of the MDP under \u03c0. The result matches the\nCramer-Rao lower bound in Jiang and Li [2016] up to a multiplicative factor of H.\nTo the best of our knowledge, this is the \ufb01rst OPE estimation error bound with a\npolynomial dependence on H. Besides theory, we show empirical superiority of our\nmethod in time-varying, partially observable, and long-horizon RL environments.\n\nt (st) and d\u03c0\n\n1\n\nIntroduction\n\nThe problem of off-policy evaluation (OPE), which predicts the performance of a policy with data only\nsampled by a behavior policy [Sutton and Barto, 1998], is crucial for using reinforcement learning\n(RL) algorithms responsibly in many real-world applications. In many settings where RL algorithms\nhave already been deployed, e.g., targeted advertising and marketing [Bottou et al., 2013; Tang et al.,\n2013; Chapelle et al., 2015; Theocharous et al., 2015; Thomas et al., 2017] or medical treatments\n[Murphy et al., 2001; Ernst et al., 2006; Raghu et al., 2017], online policy evaluation is usually\nexpensive, risky, or even unethical. Also, using a bad policy in these applications is dangerous and\ncould lead to severe consequences. Solving OPE is often the starting point in many RL applications.\nTo tackle the problem of OPE, the idea of importance sampling (IS) corrects the mismatch in the\ndistributions under the behavior policy and target policy. It also provides typically unbiased or\n\u2217The research was partially conducted when TX was visiting YW and YM at Amazon AWS AI Labs during\n\nhis internship in Summer 2018.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fstrongly consistent estimators [Precup et al., 2000]. IS-based off-policy evaluation methods have\nalso seen lots of interest recently especially for short-horizon problems, including contextual bandits\n[Murphy et al., 2001; Hirano et al., 2003; Dud\u00edk et al., 2011; Wang et al., 2017]. However, the\nvariance of IS-based approaches [Precup et al., 2000; Thomas et al., 2015; Jiang and Li, 2016;\nThomas and Brunskill, 2016; Guo et al., 2017; Farajtabar et al., 2018] tends to be too high to provide\ninformative results, for long-horizon problems [Mandel et al., 2014], since the variance of the product\nof importance weights may grow exponentially as the horizon goes long. There are also model-based\napproaches for solving OPE problems [Liu et al., 2018b; Gottesman et al., 2019], where the value of\nthe target policy is estimated directly using the approximated MDP.\nGiven this high-variance issue, it is necessary to \ufb01nd an IS-based approach without relying heavily\non the cumulative product of importance weights from the whole trajectories. While the bene\ufb01t of\ncumulative products is to allow unbiased estimation even without any state observability assumptions,\nreweighing the entire trajectories may not be necessary if some intermediate states are directly\nobservable. For the latter, based on Markov independence assumptions, we can aggregate all\ntrajectories that share the same state transition patterns to directly estimate the state distribution shifts\nafter the change of policies from the behavioral to the target. We call this approach marginalized\nimportance sampling (MIS), because it computes the marginal state distribution shifts at every single\nstep, instead of the product of policy weights.\nRelated work [Liu et al., 2018a] tackles the high variance issue due to the cumulative product of\nimportance weights. They apply importance sampling on the average visitation distribution of state-\naction pairs, based on an estimation of the mixed state distribution. Hallak and Mannor [2017] and\nGelada and Bellemare [2019] also leverage the same fact in time-invariant MDPs, where they use the\nstationary ratio of state-action pairs to replace the trajectory weights. However, these methods may\nnot directly work in \ufb01nite-horizon MDPs, where the state distributions may not mix.\nIn contrast to the prior work, the \ufb01rst goal of our paper is to study the sample complexity and\noptimality of the marginalized approach. Speci\ufb01cally, we provide the \ufb01rst \ufb01nite sample error bound\non the mean-square error for our MIS off-policy evaluation estimator under the episodic tabular MDP\nsetting (with potentially continuous action space). Our MSE bound is the exact calculation up to low\norder terms. Comparing to the Cramer-Rao lower bound established in [Jiang and Li, 2016, Theorem\n3] for DAG-MDP, our bound is larger by at most a factor of H and we have good reasons to believe\nthat this additional factor is required for any OPE estimators in this setting.\nIn addition to the theoretical results, we empirically evaluate our estimator against a number of strong\nbaselines from prior work in a number of time-invariant/time-varying, fully observable/partially\nobservable, and long-horizon environments. Our approach can also be used in most of OPE estimators\nthat leverage IS-based estimators, such as doubly robust [Jiang and Li, 2016], MAGIC [Thomas and\nBrunskill, 2016], MRDR [Farajtabar et al., 2018] under mild assumptions (Markov assumption).\nHere is a road map for the rest of the paper. Section 2 provides the preliminaries of the problem of\noff-policy evaluation. In Section 3, we offer the design of our marginalized estimator, and we study\nits information-theoretical optimality in Section 4. We present the empirical results in a number of\nRL tasks in Section 5. At last, Section 6 concludes the paper.\n\n2 Problem formulation\n\nSymbols and notations. We consider the problem of off-policy evaluation for a \ufb01nite horizon,\nnonstationary, episodic MDP, which is a tuple de\ufb01ned by M = (S,A, T, r, H), where S is the state\nspace, A is the action space, Tt : S\u00d7A\u00d7S \u2192 [0, 1] is the transition function with Tt(s(cid:48)|s, a) de\ufb01ned\nby probability of achieving state s(cid:48) after taking action a in state s at time t, and rt : S \u00d7 A \u00d7 S \u2192 R\nis the expected reward function with rt(s, a, s(cid:48)) de\ufb01ned by the mean of immediate received reward\nafter taking action a in state s and transitioning into s(cid:48), and H denotes the \ufb01nite horizon. We use P[E]\nto denote the probability of an event E and p(x) the p.m.f. (or pdf) of the random variable X taking\nvalue x. E[\u00b7] and E[\u00b7|E] denotes the expectation and conditional expectation given E, respectively.\nLet \u00b5, \u03c0 : S \u2192 PA be policies which output a distribution of actions given an observed state. We\ncall \u00b5 the behavioral policy and \u03c0 the target policy. For notation convenience we denote \u00b5(at|st)\nand \u03c0(at|st) the p.m.f of actions given state at time t. The expectation operators in this paper will\neither be indexed with \u03c0 or \u00b5, which denotes that all random variables coming from roll-outs from\n\n2\n\n\fat\u22121\n\nt (st) and d\u03c0\n\nthe speci\ufb01ed policy. Moreover, we denote d\u00b5\nt (st) the induced state distribution at time t.\nWhen t = 1, the initial distributions are identical d\u00b5\nt (st) are\nfunctions of not just the policies themselves but also the unknown underlying transition dynamics,\ni.e., for \u03c0 (and similarly \u00b5), recursively de\ufb01ne\nt (st|st\u22121)d\u03c0\n(cid:88)\nP \u03c0\nt (st|st\u22121) =\n\nTt(st|st\u22121, at\u22121)\u03c0(at\u22121|st\u22121).\n\n1 = d1. For t > 1, d\u00b5\n\nt (st) and d\u03c0\n\nt\u22121(st\u22121),\n\n(cid:88)\n\nd\u03c0\nt (st) =\n\nwhere P \u03c0\n\n1 = d\u03c0\n\n(2.1)\n\nst\u22121\n\nt+1,t(s(cid:48)|s) = (cid:80)\n\nt\n\n, a(i)\nt\n\n, a(i)\nt\n\ni,j \u2208 RS\u00d7S \u2200j < i as the state-transition probability from step j to step i un-\na Pt+1,t(s(cid:48)|s, a)\u03c0t(a|s) =\n\nWe denote P \u03c0\nder a sequence of actions taken by \u03c0. Note that P \u03c0\nTt+1(s(cid:48)|s, \u03c0t(s)).\nt ) \u2208 S \u00d7 A \u00d7 R for time index\nBehavior policy \u00b5 is used to collect data in the form of (s(i)\n, r(i)\nt\nt = 1, . . . , H and episode index i = 1, ..., n. Target policy \u03c0 is what we are interested to evaluate.\nAlso, let D to denote the historical data, which contains n episode trajectories in total. We also de\ufb01ne\nDh = {(s(i)\nThroughout the paper, probability distributions are often used in their vector or matrix form. For\nt without an input is interpreted as a vector in a S-dimensional probability simplex and\ninstance, d\u03c0\nt .\ni,j is then a stochastic transition matrix. This allows us to write (2.1) concisely as d\u03c0\nP \u03c0\nt+1,td\u03c0\nAlso note that while st, at, rt are usually used to denote \ufb01xed elements in set S,A and R, in\nsome cases we also overload them to denote generic random variables s(1)\n. For exam-\nple, E\u03c0[rt] = E\u03c0[r(1)\nd\u03c0(st, at, st+1)rt(st, at, st+1) and Var\u03c0[rt(st, at, st+1)] =\nVar\u03c0[rt(s(1)\n, s(1)\n\nt ) : i \u2208 [n], t \u2264 h} to be roll-in realization of n trajectories up to step h.\n, r(i)\n\nt\nt+1)]. The distinctions will be clear in each context.\n\nProblem setup. The problem of off-policy evaluation is about \ufb01nding an estimator(cid:98)v\u03c0 : (S \u00d7 A \u00d7\n\nR)H\u00d7n \u2192 R that makes use of the data collected by running \u00b5 to estimate\n\n] = (cid:80)\n\nt+1 = P \u03c0\n\nst,at,st+1\n\n, a(1)\n\n, a(1)\n\n, r(1)\n\nt\n\nt\n\nt\n\nt\n\nt\n\nH(cid:88)\n\n(cid:88)\n\n(cid:88)\n\n(cid:88)\n\n\u03c0(at|st)\n\nTt(st+1|st, at)rt(st, at, st+1),\n\nst\n\nat\n\nt=1\n\nst+1\n\nv\u03c0 =\n\nd\u03c0\nt (st)\n\n(2.2)\nwhere we assume knowledge about \u00b5(a|s) and \u03c0(a|s) for all (s, a) \u2208 S \u00d7 A, but do not observe\nrt(st, at, st+1) for any actions other than a noisy version of it the evaluated actions. Nor do we\nt (st)\u2200t > 1 implied by the change of policies. Nonetheless, our goal\nobserve the state distributions d\u03c0\nis to \ufb01nd an estimator to minimize the mean-square error (MSE): MSE(\u03c0, \u00b5, M ) = E\u00b5[(\u02c6v\u03c0 \u2212 v\u03c0)2],\nusing the observed data and the known action probabilities. Different from previous studies, we focus\non the case where S is suf\ufb01ciently small but S2A is too large for a reasonable sample size. In other\nwords, this is a setting where we do not have enough data points to estimate the state-action-state\ntransition dynamics, but we do observe the states and can estimate the distribution of the states after\nthe change of policies, which is our main strategy.\n\nAssumptions: We list the technical assumptions we need and provide necessary justi\ufb01cation.\n\nA1. \u2203Rmax, \u03c3 < +\u221e such that 0 \u2264 E[rt|st, at, st+1] \u2264 Rmax, Var[rt|st, at, st+1] \u2264 \u03c32 for\n\nall t, st, at.\n\nA2. Behavior policy \u00b5 obeys that dm := mint,st d\u00b5\nA3. Bounded weights: \u03c4s := maxt,st\n\nt (st) > 0 \u2200t, st such that d\u03c0\nt (st) > 0.\n\u03c0(at|st)\n\u00b5(at|st) < +\u221e.\n\nt (st) < +\u221e and \u03c4a := maxt,st,at\n\nd\u03c0\nt (st)\nd\u00b5\n\nAssumption A1 is assumed without loss of generality. The \u03c3 bound is required even for on-policy\nevaluation and the assumption on the non-negativity and Rmax can always be obtained by shifting and\nrescaling the problem. Assumption A2 is necessary for any consistent off-policy evaluation estimator.\nAssumption A3 is also necessary for discrete state and actions, as otherwise the second moments of\nthe importance weight would be unbounded. For continuous actions, \u03c4a < +\u221e is stronger than we\nneed and should be considered a simplifying assumption for the clarity of our presentation. Finally,\nwe comment that the dependence in the parameter dm, \u03c4s, \u03c4a do not occur in the leading O(1/n)\nterm of our MSE bound, but only in simpli\ufb01ed results after relaxation.\n\n3\n\n\f3 Marginalized Importance Sampling Estimators for OPE\n\nIn this section, we present the design of marginalized IS estimators for OPE. For small action spaces,\nwe may directly build models by the estimated transition function Tt(st|st\u22121, at\u22121) and the reward\nfunction rt(st, at, st+1) from empirical data. However, the models may be inaccurate in large action\nspaces, where not all actions are frequently visited. Function approximation in the models may cause\nadditional biases from covariate shifts due to the change of policies. Standard importance sampling\nestimators (including the doubly robust versions)[Dud\u00edk et al., 2011; Jiang and Li, 2016] avoid the\nneed to estimate the model\u2019s dynamics but rather directly approximating the expected reward:\n\nn(cid:88)\n\nH(cid:88)\n\n(cid:34) h(cid:89)\n\ni=1\n\nh=1\n\nt=1\n\n(cid:98)v\u03c0\n\nIS =\n\n1\nn\n\n(cid:35)\n\n\u03c0(a(i)\nt\n\u00b5(a(i)\nt\n\n|s(i)\nt )\n|s(i)\nt )\n\nr(i)\nh .\n\nTo adjust for the differences in the policy, importance weights are used and it can be shown that this\nis an unbiased estimator of v\u03c0 (See more detailed discussion of IS and the doubly robust version\nin Appendix C). The main issue of this approach, when applying to the episodic MDP with large\naction space is that the variance of the importance weights grows exponentially in H [Liu et al.,\n2018a], which makes the sample complexity exponentially worse than the model-based approaches,\nwhen they are applicable. We address this problem by proposing an alternative way of estimating\nthe importance weights which achieves the same sample complexity as the model-based approaches\nwhile allowing us to achieve the same \ufb02exibility and interpretability as the IS estimator that does not\nexplicitly require estimating the state-action dynamics Tt. We propose the Marginalized Importance\nSampling (MIS) estimator:\n\nH(cid:88)\n\nt=1\n\n(cid:98)d\u03c0\n(cid:98)d\u00b5\n\nn(cid:88)\n(cid:80)\n\ni=1\n\n1\nn\n\nMIS =\n\nt (s(i)\nt ).\n\nt (s(i)\nt )\nt (s(i)\nt )\n\n(cid:98)r\u03c0\n(cid:98)v\u03c0\nt , (cid:98)d\u00b5 \u2192 d\u00b5\nClearly, if (cid:98)d\u03c0 \u2192 d\u03c0\nt \u2192 E\u03c0[Rt(st, at)|st], then(cid:98)v\u03c0\nt ,(cid:98)r\u03c0\nIt turns out that if we take (cid:98)d\u00b5\nt (st) = 0 whenever nst = 0, then (3.1) is equivalent to(cid:80)H\nt (st)/(cid:98)d\u00b5\n(cid:98)d\u03c0\ndirect plug-in estimator of (2.2). It remains to specify (cid:98)d\u03c0\n(cid:98)d\u03c0\nt (cid:98)d\u03c0\nt = (cid:98)P \u03c0\nt\u22121, where (cid:98)P \u03c0\nn(cid:88)\nand(cid:98)r\u03c0\n\nt\u22121|st\u22121)\nt\u22121|st\u22121)\n\nt (st|st\u22121) =\n\nrecursively using\n\n\u03c0(a(i)\n\u00b5(a(i)\n\nt (st) := 1\nn\n\nn(cid:88)\n\nt 1(s(i)\nr(i)\n\nt = st),\n\nMIS \u2192 v\u03c0.\n\ni 1(s(i)\n\nt (st) =\n\nnst\u22121\n\ni=1\n\n1\n\n1\nnst\n\ni=1\n\n\u03c0(a(i)\nt\n\u00b5(a(i)\nt\n\n|st)\n|st)\n\nt = st) \u2014 the empirical mean \u2014 and de\ufb01ne\n\n(cid:80)\nst (cid:98)d\u03c0\nt (st)(cid:98)r\u03c0(st) \u2013 the\nt (st) and (cid:98)r\u03c0(st). (cid:98)d\u03c0\n\nt (st) is estimated\n\nt=1\n\n1((s(i)\n\nt\u22121, s(i)\n\nt ) = (st\u22121, st));\n\n(3.1)\n\n(3.2)\n\nwhere ns\u03c4 is the empirical visitation frequency to state s\u03c4 at time \u03c4. Note that our estimator of r\u03c0\nt (st)\nis the standard IS estimators we use in bandits [Li et al., 2015], which are shown to be optimal (up to\na universal constant) when A is large [Wang et al., 2017].\nThe advantage of MIS over the naive IS estimator is that the variance of the importance weight need\nnot depend exponentially in H. A major theoretical contribution of this paper is to formalize this\nargument by characterizing the dependence on \u03c0, \u00b5 as well as parameters of the MDP M. Note that\nMIS estimator does not dominate the IS estimator. In the more general setting when the state is given\nby the entire history of observations, Jiang and Li [2016] establishes that no estimators can achieve\npolynomial dependence in H. We give a concrete example later (Example 1) about how IS estimator\nsuffers from the \u201ccurse of horizon\u201d [Liu et al., 2018a]. MIS estimator can be thought of as one that\nexploits the state-observability while retaining properties of the IS estimators to tackle the problem of\nlarge action space. As we illustrate in the experiments, MIS estimator can be modi\ufb01ed to naturally\nhandle partially observed states, e.g., when s is only observed every other step.\nFinally, when available, model-based approaches can be combined into importance-weighted methods\n[Jiang and Li, 2016; Thomas and Brunskill, 2016]. We defer discussions about these extensions in\nAppendix C to stay focused on the scenarios where model-based approaches are not applicable.\n\n4\n\n\f4 Theoretical Analysis of the MIS Estimator\n\nMotivated by the challenge of curse of horizon with naive IS estimators, similar to [Liu et al., 2018a],\nwe show that the sample complexity of our MIS estimator reduces to O(H 3). To the best of our\nknowledge, this is \ufb01rst sample complexity guarantee under this setting, which also matches the\nCramer-Rao lower bound for DAG-MDP [Jiang and Li, 2016] as n \u2192 \u221e up to a factor of H.\nExample 1 (Curse of horizon). Assume a MDP with i.i.d. state transition models over time and\nassume that \u03c0t\nis bounded from both sides for all t. Suppose the reward is a constant 1 only\n\u00b5t\n\nshown at the last step, such that naive IS becomes(cid:98)v\u03c0\ntrajectory, (cid:81)H\nCentral Limit Theorem,(cid:80)H\n(cid:0)\u2212HElog, HVlog\nasymptotically follows LogNormal(cid:0)\u2212HElog, HVlog\n(cid:1),\nwhose variance is exponential in horizon:(cid:0)exp (HVlog) \u2212 1(cid:1). On the other hand, MIS estimates the\n\n]. By\nasymptotically follows a normal distribution with parameters\n\n(cid:104)(cid:80)H\n(cid:1). In other words,(cid:81)H\n\n] and Vlog = Var[log \u03c0t\n\u00b5t\n\n; let Elog = E[log \u03c0t\n\n(cid:20)(cid:81)H\n\nstate distributions recursively, yielding variance that is polynomial in horizon and small OPE errors.\n\nt=1 log \u03c0t\n\u00b5t\n\nt=1 log \u03c0t\n\u00b5t\n\n(cid:80)n\n\n. For every\n\n|s(i)\nt )\n|s(i)\nt )\n\nIS = 1\nn\n\n\u03c0(a(i)\n\u00b5(a(i)\n\n= exp\n\n(cid:21)\n\n(cid:105)\n\n\u03c0t\n\u00b5t\n\n\u03c0t\n\u00b5t\n\nt=1\n\nt=1\n\nt=1\n\ni=1\n\n\u00b5t\n\nt\n\nt\n\nWe now formalize the sample complexity bound in Theorem 4.1.\nTheorem 4.1. Let the value function under \u03c0 be de\ufb01ned as follows:\n\n(cid:34) H(cid:88)\n\nt=h\n\n(cid:35)\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)s(1)\n\nh (sh) := E\u03c0\nV \u03c0\n\nrt(s(1)\n\nt\n\n, a(1)\n\nt\n\n, s(1)\n\nt+1)\n\nh = sh\n\n\u2208 [0, Vmax], \u2200h \u2208 {1, 2, ..., H}.\n\nFor the simplicity of the statement, de\ufb01ne boundary conditions: r0(s0) \u2261 0, \u03c30(s0, a0) \u2261 0, d\u03c0\n1, \u03c0(a0|s0)\nthe number of episodes n obeys that\n\nH+1 \u2261 0. Moreover, let \u03c4a := maxt,st,at\n\n\u03c0(at|st)\n\u00b5(at|st) and \u03c4s := maxt,st\n\n\u00b5(a0|s0) \u2261 1 and V \u03c0\n\n0 (s0) \u2261\n0 (s0)\nd\u00b5\nd\u03c0\nt (st)\nt (st) . If\nd\u00b5\n\nn > max\n\n(cid:26) 16 log n\nfor all t = 2, ..., H, then the our estimator(cid:98)v\u03c0\n(cid:88)\nH(cid:88)\nE[(P(cid:98)v\u03c0\n(cid:115)\n(cid:32)\n\nMIS \u2212 v\u03c0)2] \u2264 1\nn\n\nh(sh)2\nd\u03c0\nd\u00b5\nh(sh)\n\nmint,st d\u00b5\n\nh=0\n\nsh\n\n,\n\nt (st)\n\nVar\u00b5\n\n4t\u03c4a\u03c4s\nmint,st max{d\u03c0\n(cid:34)\n\n\u03c0(a(1)\n\u00b5(a(1)\n\n(cid:33)\n\nh |sh)\nh |sh)\n19\u03c4 2\n\n+\n\n(cid:27)\n\nt (st), d\u00b5\n\nt (st)}\n\nMIS with an additional clipping step obeys that\n\n(cid:35)\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)s(1)\n\n(V \u03c0\n\nh+1(s(1)\n\nh+1) + r(1)\nh )\n\nh = sh\n\n\u00b7\n\n1 +\n\n16 log n\nn mint,st d\u00b5\n\nt (st)\n\na \u03c4 2\n\ns SH 2(\u03c32 + R2\n\nmax + V 2\n\nmax)\n\n.\n\nn2\n\nCorollary 1. In the familiar setting when Vmax = HRmax, then the same conditions in Theorem 4.1\nimplies that:\n\nE[(P(cid:98)v\u03c0\n\nMIS \u2212 v\u03c0)2] \u2264 4\nn\n\n\u03c4a\u03c4s(H\u03c32 + H 3R2\n\nmax).\n\nlates the MSE of a clipped version of our estimator(cid:98)vMIS\n\nWe make a few remarks about the results in Theorem 4.1.\nDependence on S, A and the weights. The leading term in the variance bound very precisely calcu-\n1 modulo a (1 + O(n\u22121/2)) multiplicative\nfactor and an O(1/n2) additive factor. Speci\ufb01cally, our bound does not explicitly depend on S and A\nbut instead on how similar \u03c0 and \u00b5 are. This allows the method to handle the case when the action\nspace is continuous. The dependence on \u03c4a, \u03c4s only appear in the low-order terms, while the leading\nterm depends only on the second moments of the importance weights.\nDependence on H. In general, our sample complexity upper bound is proportional to H 3, as\nCorollary 1 indicates. Our bound reveals that in several cases it is possible to achieve a smaller\n\n1The clipping step to [0, HRmax] or [0, Vmax] should not be alarming. It is required only for technical\nreasons, and the clipped estimator is a valid estimator to begin with. Since the true policy value must be within\nthe range, the clipping step is only going to improve the MSE.\n\n5\n\n\fexponent on H for speci\ufb01c triplets of (M, \u03c0, \u00b5). For instance, when \u03c0 \u2248 \u00b5, such that \u03c4a, \u03c4s =\n1 + O(1/H), the variance bound gives O((V 2\nmax + H\u03c32)/n), which\nmatches the MSE bound (up to a constant) of the simple-averaging estimator that knows \u03c0 = \u00b5\na-priori. (See Remark 3 in the Appendix for more details). If Vmax is a constant that does not depend\non H (this is often the case in games when there is a \ufb01xed reward at the end), then the sample\ncomplexity is only O(H).\nOptimality. Comparing to the Cramer-Rao lower bound of the Theorem 3 in [Jiang and Li, 2016],\nwhich we paraphrase below\n\nmax + H\u03c32)/n) or O((H 2R2\n\nthe MSE of our estimator is asymptotically bigger by an additive factor of\n\n(cid:12)(cid:12)(cid:12)s(1)\n\n(cid:105)\n\n(cid:104)\n\n1\nn\n\nH(cid:88)\n\nh=1\n\nsh\n\nah\n\n(cid:88)\n\nd\u03c0\nh(sh)2\nd\u00b5\nh(sh)\n\n(cid:88)\n(cid:88)\nH(cid:88)\nh(sh, ah) := E(cid:2)(V \u03c0\n\n1\nn\n\nh=1\n\n\u03c0h(ah|sh)2\n\u00b5h(ah|sh)\n\nh(sh)2\nd\u03c0\nd\u00b5\nh(sh)\n\nsh\nh+1(s(1)\n\nVar\n\nh+1(s(1)\nV \u03c0\n\nh+1) + r(1)\n\nh\n\nh = sh, a(1)\n\nh = ah\n\n,\n\n(4.1)\n\n(cid:34)\nh )(cid:12)(cid:12)s(1)\n\nVar\u00b5\n\n\u03c0h(a(1)\n\u00b5h(a(1)\n\nh |sh)\nh |sh)\nh = sh, a(1)\n\nQ\u03c0\n\nh(sh, a(1)\nh )\n\n(cid:35)\n(cid:3) is the standard Q-function\n\n(4.2)\n\n,\n\nh+1) + r(1)\n\nh = ah\n\nwhere Q\u03c0\nthe MDP. The gap is signi\ufb01cant as the CR lower bound (4.1) itself only has a worst-case bound of\nH 2\u03c4s\u03c4a/n 2, while (4.2) is proportional to H 3\u03c4s\u03c4a/n. This implies that our estimator is optimal up\nto a factor of H. See Remark 4 for more details in the appendix.\nIt is an intriguing open question whether this additional factor of H can be removed. Our conjecture is\nthat the answer is negative and what we established in Theorem 4.1 matches the correct information-\ntheoretic limit for any methods in the cases when the action space A is continuous (or signi\ufb01cantly\nlarger than n). This conjecture is consistent with an existing lower bound in the simpler contextual\nbandits setting, where Wang et al. [2017] established that a variance of expectation term analogous to\nthe one above cannot be removed, and no estimators can asymptotically attain the CR lower bound\nfor all problems in the large state/action space setting.\n\n4.1 Proof Sketch\n\nRecall that (3.1) is equivalent to(cid:80)H\nsampling and(cid:98)d\u03c0\n\nIn this section, we brie\ufb02y describe the main technical components in the proof of Theorem 4.1. More\ndetailed arguments are deferred to the full proof in Appendix B.\n\nt (st) is recursively estimated using(cid:98)d\u03c0\n\n(cid:80)\nst (cid:98)d\u03c0\nt (st)(cid:98)r\u03c0(st), where(cid:98)r\u03c0(st) is estimated with importance\nt\u22121(st\u22121) and the importance sampling estimator\nt (st|st\u22121) under \u03c0. While the MIS estimator is easy to state, it is not\n\nof the transition matrix P \u03c0\nstraightforward to analyze. We highlight three challenges below.\n\nt=1\n\n1. Dependent data and complex estimator: While the episodes are independent, the data within\neach episode are not. Each time step of our MIS estimator uses the data from all episodes\nup to that time step.\n\n2. An annoying bias: There is a non-zero probability that some states st at time t is not visited\nat all in all n episodes. This creates a bias in the estimator of \u02c6d\u03c0\nh for all time h > t. While\nthe probability of this happening is extremely small, conditioning on the high probability\nevent breaks the natural conditional independences, which makes it hard to analyze.\n\n3. Error propagation: The recursive estimator \u02c6d\u03c0\n\nt is affected by all estimation errors in earlier\ntime steps. Naive calculation of the error with a constant slack in each step can lead to a\n\u201csnowball\u201d effect that causes an exponential blow-up.\n\nAll these issues require delicate handling because otherwise the MSE calculation will not be tight.\nOur solutions are as follows.\nDe\ufb01ning the appropriate \ufb01ltration. The \ufb01rst observation is that we need to have a convenient\nrepresentation of the data. Instead of considering the n episodes as independent trajectories, it is\nmore useful to think of them all together as a Markov chain of multi-dimensional observations\n\n2This is somewhat surprising as each of the H summands in the expression can be as large as H 2.\n\n6\n\n\f(cid:88)\n\nst\u22121\n\n\u02dcd\u03c0\nt (st) =\n\n\u02dcP \u03c0(st|st\u22121) \u02dcd\u03c0\n\nt\u22121(st\u22121).\n\n(cid:110)\n\n(cid:111)n\n\ns(i)\n1:t, a(i)\n\nof n state, action, reward triplets. Speci\ufb01cally, we de\ufb01ne the \u201ccumulative\u201d data up to time t by\nDatat :=\n. Also, we observe that the state of the Markov chain at time t can\nbe summarized by nst \u2014 the number of times state st is visited.\n\nFictitious estimator technique. We address the bias issue by de\ufb01ning a \ufb01ctitious estimator(cid:101)v\u03c0. The\n\n1:t\u22121, r(i)\n\n1:t\u22121\n\ni=1\n\nt and \u02c6r\u03c0\n\nt , the \ufb01ctitious version of these estimators\n\n\ufb01ctitious estimator is constructed by, instead of \u02c6d\u03c0\n\u02dcd\u03c0\nt and \u02dcr\u03c0\n\nt is constructed recursively using\n\nt , where \u02dcd\u03c0\n\nVar\u00b5\n\n\u03c0(a(1)\n\u00b5(a(1)\n\nh |sh)\nh |sh)\n\n(V \u03c0\n\nh+1(s(1)\n\nh+1) + r(1)\nh )\n\nh = sh\n\n.\n\n(cid:35)\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)s(1)\n\nThe key difference is that whenever nst < E\u00b5nst (1 \u2212 \u03b4) for some 0 < \u03b4 < 1, we assign\n\u02dcP \u03c0(st+1|st) = P \u03c0(st+1|st) and \u02dcr\u03c0(st) = E\u03c0[rt|st] \u2014 the true values of interest. This ensures that\nthe \ufb01ctitious estimator is always unbiased (see Lemma B.2). Note that this \ufb01ctitious estimator cannot\nbe implemented in practice. It is used as a purely theoretical construct that simpli\ufb01es the analysis of\nthe (biased) MIS estimator. In Lemma B.1, we show that the \u02dcv\u03c0 and \u02c6v\u03c0 are exponentially close to\neach other.\nDisentangling the dependency by backwards peeling. The \ufb01ctitious estimator technique reduces\nthe problem of calculating the MSE of the MIS estimator to a variance analysis of the \ufb01ctitious\nestimator. By recursively applying the law of total variance backwards that peels one item at a time\nfrom Datat, we establish an exact linear decomposition of the variance of the \ufb01ctitious estimator\n(Lemma B.3):\n\nVar[(cid:101)v\u03c0] =\n\n(cid:34)(cid:101)d\u03c0\n\nE\n\nH(cid:88)\n\n(cid:88)\n\nh=0\n\nsh\n\nh(sh)2\nnsh\n\n1\n\n(cid:18)\nnsh \u2265 nd\u00b5\n\nh(sh)\n(1 \u2212 \u03b4)\u22121\n\n(cid:19)(cid:35)\n\n(cid:34)\n\nE\n\n(cid:18)\n\n(cid:34)(cid:101)d\u03c0\n\nObserve that the value function V \u03c0\nt shows up naturally. This novel decomposition can be thought of\nas a generalization of the celebrated Bellman-equation of variance [Sobel, 1982] in the off-policy,\nepisodic MDP setting with a \ufb01nite sample and can be of independent interest.\nCharacterizing the error propagation in \u02dcd\u03c0\ndistribution estimation as follows\n\n(cid:19)(cid:35)\nwhich reduces the problem to bounding Var[(cid:101)d\u03c0\nat most linearly in h: Var[(cid:101)d\u03c0\nCov((cid:101)d\u03c0\nh(sh)] \u2264 2(1\u2212\u03b4)\u22121hd\u03c0\nFinally, Theorem 4.1 is established by appropriately choosing \u03b4 = O((cid:112)log n/n mint,st d\u00b5\n\nh(sh)]. We show (in Theorem B.1) that instead of an\nexponential blow-up as will a concentration-inequality based argument imply, the variance increases\n. The proof uses a novel decomposition of\nh) (Lemma B.5), which is derived using a similar backwards peeling argument as before.\n\nh(sh). Lastly, we bound the error term in the state\n\u2264 (1 \u2212 \u03b4)\u22121\n\nnsh \u2265 nd\u00b5\n\nh(sh)\n(1 \u2212 \u03b4)\u22121\n\nh(sh)2\nd\u00b5\nh(sh)\n\n(cid:18) d\u03c0\n\nh(sh)2\nnsh\n\n(cid:104)(cid:101)d\u03c0\n\n(cid:105)(cid:19)\n\nh(sh)\n\n+ Var\n\nt (st)).\n\nh(sh)\n\nn\n\n1\n\nn\n\n,\n\nDue to space limits, we can only highlight a few key elements of the proof. We invite the readers to\ncheck out a more detailed exposition in Appendix B.\n\n5 Experiments\n\nThroughout this section, we present the empirical results to illustrate the comparison among different\nestimators. We demonstrate the effectiveness of our proposed marginalized estimator by comparing it\nwith different classic estimators on several domains.\nThe methods we compare in this section are: direct method (DM), importance sampling (IS),\nweighted importance sampling (WIS), importance sampling with stationary state distribution (SSD-\nIS), and marginalized importance sampling (MIS). DM uses the model-based approach to estimate\nTt(st|st\u22121, at\u22121), rt(st, at) by enumerating all tuples of (st\u22121, at\u22121, st), IS is the step-wise impor-\ntance sampling method, WIS uses the step-wise weighted (self-normalized) importance sampling\nmethod, SSD-IS denotes the method of importance sampling with stationary state distribution pro-\nposed by [Liu et al., 2018a]3, and MIS is our proposed marginalized method. Note that our MIS\n\n3Our implementation of SSD-IS for the discrete state case is described in Appendix D.3.\n\n7\n\n\fis different: we normalize the estimate (cid:98)d\u03c0\n\nalso uses the trick of self-normalization to obtain better performance, but the MIS normalization\nt to the probability simplex, whereas WIS normalizes the\nimportance weights. We provide further results by comparing doubly robust estimator, weighted\ndoubly robust estimator, and our estimators in Appendix D. We use logarithmic scales in all \ufb01gures\nand include 95% con\ufb01dence intervals as error bars from 128 runs. Our metric is the relative root\nmean squared error (Relative-RMSE), which is the ratio of RMSE and the true value v\u03c0.\n\nTime-invariant MDPs We \ufb01rst test our methods on the\nstandard ModelWin and ModelFail models with time-\ninvariant MDPs, \ufb01rst introduced by Thomas and Brunskill\n[2016]. The ModelWin domain simulates a fully observ-\nable MDP, depicted in Figure 1(a). On the other hand,\nthe ModelFail domain (Figure 1(b)) simulates a partially\nobservable MDP, where the agent can only tell the differ-\nence between s1 and the \u201cother\u201d unobservable states. A\ndetailed description of these two domains can be found in\nAppendix D. For both problems, the target policy \u03c0 is to\nalways select a1 and a2 with probabilities 0.2 and 0.8, respectively, and the behavior policy \u00b5 is a\nuniform policy.\nWe provide two types of experiments to show the properties of our marginalized approach. The \ufb01rst\nkind is with different numbers of episodes, where we use a \ufb01xed horizon H = 50. The second kind\nis with different horizons, where we use a \ufb01xed number of episodes n = 1024. We use MIS only\nwith observable states and the partial trajectories between them. Details about applying MIS with\npartial observability can be found in Appendix C. While this approach is general in more complex\napplications, for ModelFail, the agent always visits s1 at every other step and we can simply replace\n\u03c0(a(i)\n\u00b5(a(i)\n\n(a) ModelWin\n(b) ModelFail\nFigure 1: MDPs of OPE domains.\n\nfor t = 2\u03c4 \u2212 1 in (3.2).\n\nwith \u03c0(a(i)\n\u00b5(a(i)\n\n\u03c0(a(i)\n\u00b5(a(i)\n\n2\u03c4\u22121|s(i)\n2\u03c4\u22121|s(i)\n\n2\u03c4 |s(i)\n2\u03c4 |s(i)\n\n2\u03c4 =?)\n2\u03c4 =?)\n\n2\u03c4\u22121)\n2\u03c4\u22121)\n\n|s(i)\nt )\n|s(i)\nt )\n\nt\n\nt\n\n(a) ModelWin with differ-\nent number of episodes n.\nFigure 2: Results on Time-invariant MDPs. MIS matches DM on ModelWin and outperforms IS/WIS\non ModelFail, both of which are the best existing methods on their respective domains.\n\n(c) ModelFail with differ-\nent number of episodes n.\n\n(b) ModelWin with differ-\nent horizon H.\n\n(d) ModelFail with differ-\nent horizon H.\n\nFigure 2 shows the results in the time-invariant ModelWin MDP and ModelFail MDP. The results\nclearly demonstrate that MIS maintains a polynomial dependence on H and matches the best\nalternatives such as DM in Figure 2(b) and IS at the beginning of Figure 2(d). Notably, the IS\nin Figure 2(d) re\ufb02ects a bias-variance trade-off, that its RMSE is smaller at short horizons due to\nunbiasedness yet larger at long horizons due to high variance.\n\nTime-varying, non-mixing MDPs with continuous actions. We also test our approach in simu-\nlated MDP environments where the states are binary, the actions are continuous between [0,1] and\nthe state transition models are time-varying with a \ufb01nite horizon H. The agent starts at State 1. At\nevery step, the environment samples a random parameter p \u2208 [0.5/H, 0.5 \u2212 0.5/H]. Any agent in\nState 1 will transition to State 0 if and only if it samples an action between [p \u2212 0.5/H, p + 0.5/H].\nOn the other hand, State 0 is a sinking state. The agent collects rewards at State 0 in the latter half of\nthe steps (t \u2265 H/2). Thus, the agent wants to transition to State 0, but the transition probability is\ninversely proportional to the horizon H for uniform action policies. We pick the behavior policy to\nbe uniform on [0, 1] and the target policy to be uniform on [0, 0.5] with 95% total probability and 5%\nchance uniformly distributed on [0.5, 1].\n\n8\n\nS1S2S3!1!2p1-pr=1r=-1r=1r=-11-ppS1??!1!2r=1r=-1p1-p1-ppDMISWISSSD-ISMIS101102103Number of Episodes, n101100101102Relative RMSE101102103Horizon, H102101100Relative RMSE101102103Number of Episodes, n101100101Relative RMSE101102103Horizon, H102101100Relative RMSE\fFigure 3(a) shows the asymptotic convergence\nrates of RMSE with respect to the number of\nepisodes, given \ufb01xed horizon H = 64. MIS\n\u221a\nconverges at a O(1/\nn) rate from the very be-\nginning. In comparison, neither IS or MIS has\nentered their asymptotic n\u22121/2 regime yet with\nn \u2264 4, 096. SSD-IS does not improve as n gets\nlarger, because the stationary state distribution (a\npoint mass on State 0) is not a good approxima-\ntion of the average probability of visiting State\n0 for t \u2208 [H/2, H]. We exclude DM because it\nrequires additional model assumptions to apply to continuous action spaces.\n\u221a\nFigure 3(b) shows the Relative RMSE dependency in H, \ufb01xing the number of episodes n = 1024. We\nsee that as H gets larger, the Relative RMSE scales as O(\nH) for MIS and stays roughly constant\nfor SSD-IS. Since the true reward v\u03c0 \u221d H, the result matches the worst-case bound of a O(H 3)\nMSE in Corollary 1. SSD-IS saves a factor of H in variance, as it marginalizes over the H steps, but\nintroduces a large bias as we have seen in Figure 3(a). IS and WIS worked better for small H, but\nquickly deteriorates as H increases. Together with Figure 3(a), we may conclude that In conclusion,\nMIS is the only method, among the alternatives in this example, that produces a consistent estimator\nwith low variance.\n\nFigure 3: Time-varying MDPs\n\n(a) Dependency on n\n\n(b) Dependency on H\n\nMountain Car. Finally, we benchmark our estimator on the\nMountain Car domain [Singh and Sutton, 1996], where an\nunder-powered car drives up a steep valley by \u201cswinging\u201d on\nboth sides to gradually build up potential energy. To construct\nthe stochastic behavior policy \u00b5 and stochastic evaluated policy\n\u03c0, we \ufb01rst compute the optimal Q-function using Q-learning\nand use its softmax policy of the optimal Q-function as eval-\nuated policy \u03c0 (with the temperature of 1). For the behavior\npolicy \u00b5, we also use the softmax policy of the optimal Q-\nfunction but set the temperature to 1.25. Note that this is a\n\ufb01nite-horizon MDP with continuous state. We apply MIS by\ndiscretizing the state space as in [Jiang and Li, 2016].\nThe results, shown in Figure 4, demonstrate the effectiveness of\nour approach in a common benchmark control task, where the\nability to evaluate under long horizons is required for success.\nNote that Mountain Car is an episodic environment with a absorbing state, so it is not a setting\nthat SSD-IS is designed for. We include the the detailed description on the experimental setup and\ndiscussion on the results in Appendix D.\n\nFigure 4: Mountain Car with differ-\nent number of episodes.\n\n6 Conclusions\n\nIn this paper, we propose a marginalized importance sampling (MIS) method for the problem of\noff-policy evaluation in reinforcement learning. Our approach gets rid of the burden of horizon\nby using an estimated marginal state distribution of the target policy at every step instead of the\ncumulative product of importance weights.\nComparing to the pioneering work of Liu et al. [2018a] that uses a similar philosophy, this paper\nfocuses on the \ufb01nite state episodic setting with an potentially in\ufb01nite action space. We proved the\n\ufb01rst \ufb01nite sample error bound for such estimators with polynomial dependence in all parameters. The\nerror bound is tight in that it matches the asymptotic variance of a \ufb01ctitious estimator that has access\nto oracle information up to a low-order additive factor. Moreover, it is within a factor of O(H) of the\nCramer-Rao lower bound of this problem in [Jiang and Li, 2016]. We conjecture that this additional\nfactor of H is required for any estimators in the in\ufb01nite action setting.\nOur experiments demonstrate that the MIS estimator is effective in practice as it achieves substantially\nbetter performance than existing approaches in a number of benchmarks.\n\n9\n\nISWISMISSSD-IS102103Number of Episodes, n102101100Relative RMSE101102103Horizon, H102101100Relative RMSEDMISWISSSD-ISMIS\u000e\r\u000f\u000e\r\u0010\u001f:2-07\u000341\u0003\u001c5\u000484/08\u0003n\u000e\r\u000f\u000e\r\u000e\u000e\r\r#0\u0004,9\u0004;0\u0003#\u001e$\u001c\fAcknowledgement\n\nThe authors thank Yu Bai, Murali Narayanaswamy, Lin F. Yang, Nan Jiang, Phil Thomas, Ying Yang\nfor helpful discussion and Amazon internal review committee for the feedback on an early version of\nthe paper. We also acknowledge the NeurIPS area chair, anonymous reviewers for helpful comments\nand Ming Yin for carefully proofreading the paper.\nYW was supported by a start-up grant from UCSB CS department, NSF-OAC 1934641 and a gift\nfrom AWS ML Research Award.\n\nReferences\nBottou, L., Peters, J., Qui\u00f1onero-Candela, J., Charles, D. X., Chickering, D. M., Portugaly, E., Ray, D.,\nSimard, P., and Snelson, E. (2013). Counterfactual reasoning and learning systems: The example\nof computational advertising. The Journal of Machine Learning Research, 14(1):3207\u20133260.\n\nChapelle, O., Manavoglu, E., and Rosales, R. (2015). Simple and scalable response prediction for\n\ndisplay advertising. ACM Transactions on Intelligent Systems and Technology (TIST), 5(4):61.\n\nChernoff, H. et al. (1952). A measure of asymptotic ef\ufb01ciency for tests of a hypothesis based on the\n\nsum of observations. The Annals of Mathematical Statistics, 23(4):493\u2013507.\n\nDud\u00edk, M., Langford, J., and Li, L. (2011). Doubly robust policy evaluation and learning.\n\nInternational Conference on Machine Learning, pages 1097\u20131104. Omnipress.\n\nIn\n\nErnst, D., Stan, G.-B., Goncalves, J., and Wehenkel, L. (2006). Clinical data based optimal sti\nstrategies for hiv: a reinforcement learning approach. In Decision and Control, 2006 45th IEEE\nConference on, pages 667\u2013672. IEEE.\n\nFarajtabar, M., Chow, Y., and Ghavamzadeh, M. (2018). More robust doubly robust off-policy\nevaluation. In International Conference on Machine Learning (ICML-18), volume 80, pages\n1447\u20131456, Stockholmsm\u00e4ssan, Stockholm Sweden. PMLR.\n\nGelada, C. and Bellemare, M. G. (2019). Off-policy deep reinforcement learning by bootstrapping\nthe covariate shift. In AAAI Conference on Arti\ufb01cial Intelligence (AAAI-19), volume 33, pages\n3647\u20133655.\n\nGottesman, O., Liu, Y., Sussex, S., Brunskill, E., and Doshi-Velez, F. (2019). Combining parametric\nand nonparametric models for off-policy evaluation. In International Conference on Machine\nLearning (ICML-19).\n\nGuo, Z., Thomas, P. S., and Brunskill, E. (2017). Using options and covariance testing for long\nhorizon off-policy policy evaluation. In Advances in Neural Information Processing Systems\n(NIPS-17), pages 2492\u20132501.\n\nHallak, A. and Mannor, S. (2017). Consistent on-line off-policy evaluation.\nConference on Machine Learning (ICML-17), pages 1372\u20131383. JMLR. org.\n\nIn International\n\nHirano, K., Imbens, G. W., and Ridder, G. (2003). Ef\ufb01cient estimation of average treatment effects\n\nusing the estimated propensity score. Econometrica, 71(4):1161\u20131189.\n\nJiang, N. and Li, L. (2016). Doubly robust off-policy value evaluation for reinforcement learning. In\n\nInternational Conference on Machine Learning (ICML-16), pages 652\u2013661. JMLR. org.\n\nLi, L., Munos, R., and Szepesvari, C. (2015). Toward minimax off-policy value estimation. In\n\nArti\ufb01cial Intelligence and Statistics (AISTATS-15), pages 608\u2013616.\n\nLiu, Q., Li, L., Tang, Z., and Zhou, D. (2018a). Breaking the curse of horizon: In\ufb01nite-horizon\noff-policy estimation. In Advances in Neural Information Processing Systems (NeurIPS-18), pages\n5361\u20135371.\n\nLiu, Y., Gottesman, O., Raghu, A., Komorowski, M., Faisal, A. A., Doshi-Velez, F., and Brunskill, E.\n(2018b). Representation balancing mdps for off-policy policy evaluation. In Advances in Neural\nInformation Processing Systems (NeurIPS-18), pages 2649\u20132658.\n\n10\n\n\fMandel, T., Liu, Y.-E., Levine, S., Brunskill, E., and Popovic, Z. (2014). Of\ufb02ine policy evaluation\nacross representations with applications to educational games. In International conference on\nAutonomous agents and multi-agent systems, pages 1077\u20131084. International Foundation for\nAutonomous Agents and Multiagent Systems.\n\nMurphy, S. A., van der Laan, M. J., Robins, J. M., and Group, C. P. P. R. (2001). Marginal mean\nmodels for dynamic regimes. Journal of the American Statistical Association, 96(456):1410\u20131423.\n\nPrecup, D., Sutton, R. S., and Singh, S. P. (2000). Eligibility traces for off-policy policy evaluation.\nIn International Conference on Machine Learning (ICML-00), pages 759\u2013766. Morgan Kaufmann\nPublishers Inc.\n\nRaghu, A., Komorowski, M., Celi, L. A., Szolovits, P., and Ghassemi, M. (2017). Continuous\nstate-space models for optimal sepsis treatment: a deep reinforcement learning approach. In\nMachine Learning for Healthcare Conference, pages 147\u2013163.\n\nSingh, S. P. and Sutton, R. S. (1996). Reinforcement learning with replacing eligibility traces.\n\nMachine learning, 22(1-3):123\u2013158.\n\nSobel, M. J. (1982). The variance of discounted markov decision processes. Journal of Applied\n\nProbability, 19(4):794\u2013802.\n\nSutton, R. S. and Barto, A. G. (1998). Reinforcement learning: An introduction, volume 1. MIT\n\npress Cambridge.\n\nTang, L., Rosales, R., Singh, A., and Agarwal, D. (2013). Automatic ad format selection via\ncontextual bandits. In ACM International Conference on Information & Knowledge Management\n(CIKM-13), pages 1587\u20131594. ACM.\n\nTheocharous, G., Thomas, P. S., and Ghavamzadeh, M. (2015). Personalized ad recommendation\nsystems for life-time value optimization with guarantees. In International Joint Conferences on\nArti\ufb01cial Intelligence (IJCAI-15), pages 1806\u20131812.\n\nThomas, P. and Brunskill, E. (2016). Data-ef\ufb01cient off-policy policy evaluation for reinforcement\n\nlearning. In International Conference on Machine Learning (ICML-16), pages 2139\u20132148.\n\nThomas, P. S. (2015). Safe reinforcement learning. PhD thesis, University of Massachusetts Amherst.\n\nThomas, P. S., Theocharous, G., and Ghavamzadeh, M. (2015). High-con\ufb01dence off-policy evaluation.\n\nIn AAAI Conference on Arti\ufb01cial Intelligence (AAAI-15), pages 3000\u20133006.\n\nThomas, P. S., Theocharous, G., Ghavamzadeh, M., Durugkar, I., and Brunskill, E. (2017). Predictive\noff-policy policy evaluation for nonstationary decision problems, with applications to digital\nmarketing. In AAAI Conference on Arti\ufb01cial Intelligence (AAAI-17), pages 4740\u20134745.\n\nWang, Y.-X., Agarwal, A., and Dud\u0131k, M. (2017). Optimal and adaptive off-policy evaluation in\ncontextual bandits. In International Conference on Machine Learning (ICML-17), pages 3589\u2013\n3597.\n\n11\n\n\f", "award": [], "sourceid": 5118, "authors": [{"given_name": "Tengyang", "family_name": "Xie", "institution": "University of Illinois at Urbana-Champaign"}, {"given_name": "Yifei", "family_name": "Ma", "institution": "Amazon"}, {"given_name": "Yu-Xiang", "family_name": "Wang", "institution": "UC Santa Barbara"}]}