{"title": "DualDICE: Behavior-Agnostic Estimation of Discounted Stationary Distribution Corrections", "book": "Advances in Neural Information Processing Systems", "page_first": 2318, "page_last": 2328, "abstract": "In many real-world reinforcement learning applications, access to the environment is limited to a fixed dataset, instead of direct (online) interaction with the environment. When using this data for either evaluation or training of a new policy, accurate estimates of discounted stationary distribution ratios -- correction terms which quantify the likelihood that the new policy will experience a certain state-action pair normalized by the probability with which the state-action pair appears in the dataset -- can improve accuracy and performance. In this work, we propose an algorithm, DualDICE, for estimating these quantities. In contrast to previous approaches, our algorithm is agnostic to knowledge of the behavior policy (or policies) used to generate the dataset. Furthermore, our algorithm eschews any direct use of importance weights, thus avoiding potential optimization instabilities endemic of previous methods. In addition to providing theoretical guarantees, we present an empirical study of our algorithm applied to off-policy policy evaluation and find that our algorithm significantly improves accuracy compared to existing techniques.", "full_text": "DualDICE: Behavior-Agnostic Estimation of\nDiscounted Stationary Distribution Corrections\n\nO\ufb01r Nachum\u2217\n\nYinlam Chow\u2217\n\nBo Dai\n\nLihong Li\n\n{ofirnachum,yinlamchow,bodai,lihong}@google.com\n\nGoogle Research\n\nAbstract\n\nIn many real-world reinforcement learning applications, access to the environ-\nment is limited to a \ufb01xed dataset, instead of direct (online) interaction with the\nenvironment. When using this data for either evaluation or training of a new pol-\nicy, accurate estimates of discounted stationary distribution ratios \u2014 correction\nterms which quantify the likelihood that the new policy will experience a certain\nstate-action pair normalized by the probability with which the state-action pair\nappears in the dataset \u2014 can improve accuracy and performance. In this work,\nwe propose an algorithm, DualDICE, for estimating these quantities. In contrast\nto previous approaches, our algorithm is agnostic to knowledge of the behavior\npolicy (or policies) used to generate the dataset. Furthermore, it eschews any\ndirect use of importance weights, thus avoiding potential optimization instabilities\nendemic of previous methods. In addition to providing theoretical guarantees, we\npresent an empirical study of our algorithm applied to off-policy policy evaluation\nand \ufb01nd that our algorithm signi\ufb01cantly improves accuracy compared to existing\ntechniques.1\n\nIntroduction\n\n1\nReinforcement learning (RL) has recently demonstrated a number of successes in various domains,\nsuch as games [25], robotics [1], and conversational systems [11, 18]. These successes have often\nhinged on the use of simulators to provide large amounts of experience necessary for RL algorithms.\nWhile this is reasonable in game environments, where the game is often a simulator itself, and some\nsimple real-world tasks can be simulated to an accurate enough degree, in general one does not have\nsuch direct or easy access to the environment. Furthermore, in many real-world domains such as\nmedicine [26], recommendation [19], and education [24], the deployment of a new policy, even just\nfor the sake of performance evaluation, may be expensive and risky. In these applications, access\nto the environment is usually in the form of off-policy data [39], logged experience collected by\npotentially multiple and possibly unknown behavior policies.\nState-of-the-art methods which consider this more realistic setting \u2014 either for policy evaluation\nor policy improvement \u2014 often rely on estimating (discounted) stationary distribution ratios or\ncorrections. For each state and action in the environment, these quantities measure the likelihood\nthat one\u2019s current target policy will experience the state-action pair normalized by the probability\nwith which the state-action pair appears in the off-policy data. Proper estimation of these ratios can\nimprove the accuracy of policy evaluation [21] and the stability of policy learning [12, 14, 22, 40]. In\ngeneral, these ratios are dif\ufb01cult to compute, let alone estimate, as they rely not only on the probability\nthat the target policy will take the desired action at the relevant state, but also on the probability that\nthe target policy\u2019s interactions with the environment dynamics will lead it to the relevant state.\n\n\u2217Equal contribution.\n1Find code at https://github.com/google-research/google-research/tree/master/dual_dice.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fSeveral methods to estimate these ratios have been proposed recently [12, 14, 21], all based on the\nsteady-state property of stationary distributions of Markov processes [15]. This property may be\nexpressed locally with respect to state-action-next-state tuples, and is therefore amenable to stochastic\noptimization algorithms. However, these methods possess several issues when applied in practice:\nFirst, these methods require knowledge of the probability distribution used for each sampled action\nappearing in the off-policy data. In practice, these probabilities are usually not known and dif\ufb01cult\nto estimate, especially in the case of multiple, non-Markovian behavior policies. Second, the loss\nfunctions of these algorithms involve per-step importance ratios (the ratio of action sample probability\nwith respect to the target policy versus the behavior policy). Depending on how far the behavior\npolicy is from the target policy, these quantities may have large variance, and thus have a detrimental\neffect on stochastic optimization algorithms.\nIn this work, we propose Dual stationary DIstribution Correction Estimation (DualDICE), a new\nmethod for estimating discounted stationary distribution ratios. It is agnostic to the number or type\nof behavior policies used for collecting the off-policy data. Moreover, the objective function of\nour algorithm does not involve any per-step importance ratios, and so our solution is less likely to\nbe affected by their high variance. We provide theoretical guarantees on the convergence of our\nalgorithm and evaluate it on a number of off-policy policy evaluation benchmarks. We \ufb01nd that\nDualDICE can consistently, and often signi\ufb01cantly, improve performance compared to previous\nalgorithms for estimating stationary distribution ratios.\n2 Background\nWe consider a Markov Decision Process (MDP) setting [32], in which the environment is speci\ufb01ed\nby a tuple M = (cid:104)S, A, R, T, \u03b2(cid:105), consisting of a state space, an action space, a reward function,\na transition probability function, and an initial state distribution. A policy \u03c0 interacts with the\nenvironment iteratively, starting with an initial state s0 \u223c \u03b2. At step t = 0, 1,\u00b7\u00b7\u00b7 , the policy produces\na distribution \u03c0(\u00b7|st) over the actions A, from which an action at is sampled and applied to the\nenvironment. The environment stochastically produces a scalar reward rt \u223c R(st, at) and a next\nstate st+1 \u223c T (st, at). In this work, we consider in\ufb01nite-horizon environments and the \u03b3-discounted\nreward criterion for \u03b3 \u2208 [0, 1). It is clear that any \ufb01nite-horizon environment may be interpreted\nas in\ufb01nite-horizon by considering an augmented state space with an extra terminal state which\ncontinually loops onto itself with zero reward.\n\n2.1 Off-Policy Policy Evaluation\n\nGiven a target policy \u03c0, we are interested in estimating its value, de\ufb01ned as the normalized expected\nper-step reward obtained by following the policy:\n\nt=0 \u03b3trt | s0 \u223c \u03b2,\u2200t, at \u223c \u03c0(st), rt \u223c R(st, at), st+1 \u223c T (st, at)(cid:3).\n\n\u03c1(\u03c0) := (1 \u2212 \u03b3) \u00b7 E(cid:2)(cid:80)\u221e\n\n(1)\n\nThe off-policy policy evaluation (OPE) problem studied here is to estimate \u03c1(\u03c0) using a \ufb01xed set\nD of transitions (s, a, r, s(cid:48)) sampled in a certain way. This is a very general scenario: D can be\ncollected by a single behavior policy (as in most previous work), multiple behavior policies, or an\noracle sampler, among others. In the special case where D contains entire trajectories collected by\na known behavior policy \u00b5, one may use importance sampling (IS) to estimate \u03c1(\u03c0). Speci\ufb01cally,\ngiven a \ufb01nite-length trajectory \u03c4 = (s0, a0, r0, . . . , sH ) collected by \u00b5, the IS estimate of \u03c1 based\non \u03c4 is estimated by [31]: (1 \u2212 \u03b3)\n. Although many improvements\nexist [e.g., 9, 16, 31, 43], importance-weighting the entire trajectory can suffer from exponentially\nhigh variance, which is known as \u201cthe curse of horizon\u201d [20, 21].\nTo avoid exponential dependence on trajectory length, one may weight the states by their long-term\noccupancy measure. First, observe that the policy value may be re-expressed as,\n\n(cid:17)(cid:16)(cid:80)H\u22121\n\n(cid:16)(cid:81)H\u22121\n\n\u03c0(at|st)\n\u00b5(at|st)\n\nt=0 \u03b3trt\n\n(cid:17)\n\nt=0\n\n\u03c1(\u03c0) = E(s,a)\u223cd\u03c0,r\u223cR(s,a)[r] ,\n\nwhere\n\nd\u03c0(s, a) := (1 \u2212 \u03b3)(cid:80)\u221e\n\nt=0 \u03b3t Pr (st = s, at = a | s0 \u223c \u03b2,\u2200t, at \u223c \u03c0(st), st+1 \u223c T (st, at)) ,\n\n(2)\n\nis the normalized discounted stationary distribution over state-actions with respect to \u03c0. One may\nde\ufb01ne the discounted stationary distribution over states analogously, and we slightly abuse notation\n\n2\n\n\fby denoting it as d\u03c0(s); note that d\u03c0(s, a) = d\u03c0(s)\u03c0(a|s). If D consists of trajectories collected by a\nbehavior policy \u00b5, then the policy value may be estimated as,\n\n\u03c1(\u03c0) = E(s,a)\u223cd\u00b5,r\u223cR(s,a)\n\n(cid:2)w\u03c0/\u00b5(s, a) \u00b7 r(cid:3) ,\n\nwhere w\u03c0/\u00b5(s, a) = d\u03c0(s,a)\nis in estimating these correction terms using data drawn from d\u00b5.\n\nd\u00b5(s,a) is the discounted stationary distribution correction. The key challenge\n\n2.2 Learning Stationary Distribution Corrections\n\nWe provide a brief summary of previous methods for estimating the stationary distribution corrections.\n(cid:80)\nThe ones that are most relevant to our work are a suite of recent techniques [12, 14, 21], which are all\nessentially based on the following steady-state property of stationary Markov processes:\na\u2208A d\u03c0(s)\u03c0(a|s)T (s(cid:48)|s, a), \u2200s(cid:48) \u2208 S,\n\nd\u03c0(s(cid:48)) = (1 \u2212 \u03b3)\u03b2(s(cid:48)) + \u03b3(cid:80)\n\ns\u2208S\n\nE(st,at,st+1)\u223cd\u00b5\n\n(3)\nwhere we have simpli\ufb01ed the identity by restricting to discrete state and action spaces. This identity\nsimply re\ufb02ects the conservation of \ufb02ow of the stationary distribution: At each timestep, the \ufb02ow out\nof s(cid:48) (the LHS) must equal the \ufb02ow into s(cid:48) (the RHS). Given a behavior policy \u00b5, equation 3 can be\nequivalently rewritten in terms of the stationary distribution corrections, i.e., for any given s(cid:48) \u2208 S,\n(4)\n\n(cid:2)TD(st, at, st+1 | w\u03c0/\u00b5)(cid:12)(cid:12) st+1 = s(cid:48)(cid:3) = 0 ,\nTD(s, a, s(cid:48) | w\u03c0/\u00b5) := \u2212w\u03c0/\u00b5(s(cid:48)) + (1 \u2212 \u03b3)\u03b2(s(cid:48)) + \u03b3w\u03c0/\u00b5(s) \u00b7 \u03c0(a|s)\n\u00b5(a|s)\n\nprovided that \u00b5(a|s) > 0 whenever \u03c0(a|s) > 0. The quantity TD can be viewed as a temporal differ-\nence associated with w\u03c0/\u00b5. Accordingly, previous works optimize loss functions which minimize\nthis TD error using samples from d\u00b5. We emphasize that although w\u03c0/\u00b5 is associated with a temporal\ndifference, it does not satisfy a Bellman recurrence in the usual sense [2]. Indeed, note that equation 3\nis written \u201cbackwards\u201d: The occupancy measure of a state s(cid:48) is written as a (discounted) function of\nprevious states, as opposed to vice-versa. This will serve as a key differentiator between our algorithm\nand these previous methods.\n\nwhere\n\n,\n\n2.3 Off-Policy Estimation with Multiple Unknown Behavior Policies\n\nWhile the previous algorithms are promising, they have several limitations when applied in practice:\n\u2022 The off-policy experience distribution d\u00b5 is with respect to a single, Markovian behavior policy \u00b5,\nand this policy must be known during optimization.2 In practice, off-policy data often comes from\nmultiple, unknown behavior policies.\n\u2022 Computing the TD error in equation 4 requires the use of per-step importance ratios\n\u03c0(at|st)/\u00b5(at|st) at every state-action sample (st, at). Depending on how far the behavior policy\nis from the target policy, these quantities may have high variance, which can have a detrimental\neffect on the convergence of any stochastic optimization algorithm that is used to estimate w\u03c0/\u00b5.\nThe method we derive below will be free of the aforementioned issues, avoiding unnecessary\nrequirements on the form of the off-policy data collection as well as explicit uses of importance\nratios. Rather, we consider the general setting where D consists of transitions sampled in an unknown\nfashion. Since D contains rewards and next states, we will often slightly abuse notation and write not\nonly (s, a) \u223c dD but also (s, a, r) \u223c dD and (s, a, s(cid:48)) \u223c dD, where the notation dD emphasizes that,\nunlike previously, D is not the result of a single, known behavior policy. The target policy\u2019s value\ncan be equivalently written as,\n\n\u03c1(\u03c0) = E(s,a,r)\u223cdD(cid:2)w\u03c0/D(s, a) \u00b7 r(cid:3) ,\n\n(5)\nwhere the correction terms are given by w\u03c0/D(s, a) := d\u03c0(s,a)\ndD(s,a), and our algorithm will focus on\nestimating these correction terms. Rather than relying on the assumption that D is the result of a\nsingle, known behavior policy, we instead make the following regularity assumption:\nAssumption 1 (Reference distribution property). For any (s, a), d\u03c0(s, a) > 0 implies dD(s, a) > 0.\n\nFurthermore, the correction terms are bounded by some \ufb01nite constant C:(cid:13)(cid:13)w\u03c0/D(cid:13)(cid:13)\u221e \u2264 C.\n\n2The Markovian requirement is necessary for TD methods. However, notably, IS methods do not depend on\n\nthis assumption.\n\n3\n\n\f1\n2\n\n3 DualDICE\nWe now develop our algorithm, DualDICE, for estimating the discounted stationary distribution\ncorrections w\u03c0/D(s, a) = d\u03c0(s,a)\ndD(s,a). In the OPE setting, one does not have explicit knowledge of\nthe distribution dD, but rather only access to samples D = {(s, a, r, s(cid:48))} \u223c dD. Similar to the TD\nmethods described above, we also assume access to samples from the initial state distribution \u03b2. We\nbegin by introducing a key result, which we will later derive and use as the crux for our algorithm.\n3.1 The Key Idea\nConsider optimizing a (bounded) function \u03bd : S \u00d7 A \u2192 R for the following objective:\n\nE(s,a)\u223cdD(cid:2)(\u03bd \u2212 B\u03c0\u03bd) (s, a)2(cid:3) \u2212 (1 \u2212 \u03b3) Es0\u223c\u03b2,a0\u223c\u03c0(s0) [\u03bd(s0, a0)] ,\n\n(6)\nwhere we use B\u03c0 to denote the expected Bellman operator with respect to policy \u03c0 and zero reward:\nB\u03c0\u03bd(s, a) = \u03b3Es(cid:48)\u223cT (s,a),a(cid:48)\u223c\u03c0(s(cid:48))[\u03bd(s(cid:48), a(cid:48))]. The \ufb01rst term in equation 6 is the expected squared\nBellman error with zero reward. This term alone would lead to a trivial solution \u03bd\u2217 \u2261 0, which can\nbe avoided by the second term that encourages \u03bd\u2217 > 0. Together, these two terms result in an optimal\n\u03bd\u2217 that places some non-zero amount of Bellman residual at state-action pairs sampled from dD.\nPerhaps surprisingly, as we will show, the Bellman residuals of \u03bd\u2217 are exactly the desired distribution\ncorrections:\n\n\u03bd:S\u00d7A\u2192R J(\u03bd) :=\n\nmin\n\n(\u03bd\u2217 \u2212 B\u03c0\u03bd\u2217) (s, a) = w\u03c0/D(s, a).\n\n(7)\nThis key result provides the foundation for our algorithm, since it provides us with a simple objective\n(relying only on samples from dD, \u03b2, \u03c0) which we may optimize in order to obtain estimates of the\ndistribution corrections. In the text below, we will show how we arrive at this result. We provide one\nadditional step which allows us to ef\ufb01ciently learn a parameterized \u03bd with respect to equation 6. We\nthen generalize our results to a family of similar algorithms and lastly present theoretical guarantees.\n3.2 Derivation\nA Technical Observation We begin our derivation of the algorithm for estimating w\u03c0/D by pre-\nsenting the following simple technical observation: For arbitrary scalars m \u2208 R>0, n \u2208 R\u22650, the\noptimizer of the convex problem minx J(x) := 1\nm. Using\nthis observation, and letting C be some bounded subset of R which contains [0, C], one immediately\nsees that the optimizer of the following problem,\n\n2 mx2 \u2212 nx is unique and given by x\u2217 = n\n\nE(s,a)\u223cdD(cid:2)x(s, a)2(cid:3) \u2212 E(s,a)\u223cd\u03c0 [x(s, a)] ,\n\n1\n2\n\nmin\n\nx:S\u00d7A\u2192C J1(x) :=\n\n(8)\nis given by x\u2217(s, a) = w\u03c0/D(s, a) for any (s, a) \u2208 S \u00d7 A. This result provides us with an objective\nthat shares the same basic form as equation 6. The main distinction is that the second term relies on\nan expectation over d\u03c0, which we do not have access to.\n\nChange of Variables\nIn order to transform the second expectation in equation 8 to be over the\ninitial state distribution \u03b2, we perform the following change of variables: Let \u03bd : S \u00d7 A \u2192 R be an\narbitrary state-action value function that satis\ufb01es,\n\n\u03bd(s, a) := x(s, a) + \u03b3Es(cid:48)\u223cT (s,a),a(cid:48)\u223c\u03c0(s(cid:48))[\u03bd(s(cid:48), a(cid:48))], \u2200(s, a) \u2208 S \u00d7 A.\n\n(9)\nSince x(s, a) \u2208 C is bounded and \u03b3 < 1, the variable \u03bd(s, a) is well-de\ufb01ned and bounded. By\napplying this change of variables, the objective function in 8 can be re-written in terms of \u03bd, and this\nyields our previously presented objective from equation 6. Indeed, de\ufb01ne,\n\n\u03b2t(s) := Pr (s = st | s0 \u223c \u03b2, ak \u223c \u03c0(sk), sk+1 \u223c T (sk, ak) for 0 \u2264 k < t) ,\nto be the state visitation probability at step t when following \u03c0. Clearly, \u03b20 = \u03b2. Then,\n\n(cid:2)\u03bd(s, a) \u2212 \u03b3Es(cid:48)\u223cT (s,a),a(cid:48)\u223c\u03c0(s(cid:48))[\u03bd(s(cid:48), a(cid:48))](cid:3)\n(cid:2)\u03bd(s, a) \u2212 \u03b3Es(cid:48)\u223cT (s,a),a(cid:48)\u223c\u03c0(s(cid:48))[\u03bd(s(cid:48), a(cid:48))](cid:3)\n\n\u03b3t+1Es\u223c\u03b2t+1,a\u223c\u03c0(s) [\u03bd(s, a)]\n\nE(s,a)\u223cd\u03c0 [x(s, a)] = E(s,a)\u223cd\u03c0\n\n= (1 \u2212 \u03b3)\n\n\u03b3tEs\u223c\u03b2t,a\u223c\u03c0(s)\n\n\u221e(cid:88)\n\u221e(cid:88)\n\nt=0\n\n\u03b3tEs\u223c\u03b2t,a\u223c\u03c0(s) [\u03bd(s, a)] \u2212 (1 \u2212 \u03b3)\n\n= (1 \u2212 \u03b3)\n= (1 \u2212 \u03b3)Es\u223c\u03b2,a\u223c\u03c0(s) [\u03bd(s, a)] .\n\nt=0\n\n\u221e(cid:88)\n\nt=0\n\n4\n\n\fThe Bellman residuals of the optimum of this objective give the desired off-policy corrections:\n\n(\u03bd\u2217 \u2212 B\u03c0\u03bd\u2217)(s, a) = x\u2217(s, a) = w\u03c0/D(s, a).\n\n(10)\nEquation 6 provides a promising approach for estimating the stationary distribution corrections, since\nthe \ufb01rst expectation is over state-action pairs sampled from dD, while the second expectation is over\n\u03b2 and actions sampled from \u03c0, both of which we have access to. Therefore, in principle we may\nsolve this problem with respect to a parameterized value function \u03bd, and then use the optimized \u03bd\u2217 to\ndeduce the corrections. In practice, however, the objective in its current form presents two dif\ufb01culties:\n\u2022 The quantity (\u03bd \u2212 B\u03c0\u03bd)(s, a)2 involves a conditional expectation inside a square. In general,\nwhen environment dynamics are stochastic and the action space may be large or continuous, this\nquantity may not be readily optimized using standard stochastic techniques. (For example, when\nthe environment is stochastic, its Monte-Carlo sample gradient is generally biased.)\n\u2022 Even if one has computed the optimal value \u03bd\u2217, the corrections (\u03bd\u2217 \u2212B\u03c0\u03bd\u2217)(s, a), due to the same\nargument as above, may not be easily computed, especially when the environment is stochastic or\nthe action space continuous.\n\nExploiting Fenchel Duality We solve both dif\ufb01culties listed above in one step by exploiting\nFenchel duality [35]: Any convex function f (x) may be written as f (x) = max\u03b6 x \u00b7 \u03b6 \u2212 f\u2217(\u03b6),\nwhere f\u2217 is the Fenchel conjugate of f. In the case of f (x) = 1\n2 x2, the Fenchel conjugate is given\nby f\u2217(\u03b6) = 1\nmin\n\n2 \u03b6 2. Thus, we may express our objective as,\n(\u03bd \u2212 B\u03c0\u03bd) (s, a)\u00b7 \u03b6 \u2212 1\n2\n\n\u03bd:S\u00d7A\u2192R J(\u03bd) := E(s,a)\u223cdD(cid:2) max\n\n\u03b6\n\nBy the interchangeability principle [6, 34, 36], we may replace the inner max over scalar \u03b6 to a max\nover functions \u03b6 : S \u00d7 A \u2192 R and obtain a min-max saddle-point optimization:\n\n\u03b6 2(cid:3)\u2212 (1\u2212 \u03b3) Es0\u223c\u03b2,a0\u223c\u03c0(s0) [\u03bd(s0, a0)] .\n(cid:2)(\u03bd(s, a) \u2212 \u03b3\u03bd(s(cid:48), a(cid:48)))\u03b6(s, a) \u2212 \u03b6(s, a)2/2(cid:3)\n\n\u03bd:S\u00d7A\u2192R max\n\n\u03b6:S\u00d7A\u2192R J(\u03bd, \u03b6) := E(s,a,s(cid:48))\u223cdD,a(cid:48)\u223c\u03c0(s(cid:48))\n\nmin\n\n(11)\nUsing the KKT condition of the inner optimization problem (which is convex and quadratic in \u03b6),\n\u03bd is equal to the Bellman residual, \u03bd \u2212 B\u03c0\u03bd. Therefore, the desired\nfor any \u03bd the optimal value \u03b6\u2217\nstationary distribution correction can then be found from the saddle-point solution (\u03bd\u2217, \u03b6\u2217) of the\nminimax problem in equation 11 as follows:\n\n\u2212 (1 \u2212 \u03b3) Es0\u223c\u03b2,a0\u223c\u03c0(s0) [\u03bd(s0, a0)] .\n\n\u03b6\u2217(s, a) = (\u03bd\u2217 \u2212 B\u03c0\u03bd\u2217)(s, a) = w\u03c0/D(s, a).\n\n(12)\nNow we \ufb01nally have an objective which is well-suited for practical computation. First, unbiased\nestimates of both the objective and its gradients are easy to compute using stochastic samples from\ndD, \u03b2, and \u03c0, all of which we have access to. Secondly, notice that the min-max objective function\nin equation 11 is linear in \u03bd and concave in \u03b6. Therefore in certain settings, one can provide guarantees\non the convergence of optimization algorithms applied to this objective (see Section 3.4). Thirdly,\nthe optimizer of the objective in equation 11 immediately gives us the desired stationary distribution\ncorrections through the values of \u03b6\u2217(s, a), with no additional computation.\n3.3 Extension to General Convex Functions\nBesides a quadratic penalty function, one may extend the above derivations to a more general class of\nconvex penalty functions. Consider a generic convex penalty function f : R \u2192 R. Recall that C is a\nbounded subset of R which contains the interval [0, C] of stationary distribution corrections. If C is\ncontained in the range of f(cid:48), then the optimizer of the convex problem, minx J(x) := m \u00b7 f (x) \u2212 n\nm. Analogously, the optimizer x\u2217 of,\nfor n\n(13)\n\nm \u2208 C, satis\ufb01es the following KKT condition: f(cid:48)(x\u2217) = n\n\nx:S\u00d7A\u2192C J1(x) := E(s,a)\u223cdD [f (x(s, a))] \u2212 E(s,a)\u223cd\u03c0 [x(s, a)] ,\n\nmin\n\nmin\n\nsatis\ufb01es the equality f(cid:48)(x\u2217(s, a)) = w\u03c0/D(s, a).\nWith change of variables \u03bd := x + B\u03c0\u03bd, the above problem becomes,\n\u03bd:S\u00d7A\u2192R J(\u03bd) := E(s,a)\u223cdD [f ((\u03bd \u2212 B\u03c0\u03bd) (s, a))] \u2212 (1 \u2212 \u03b3) Es0\u223c\u03b2,a0\u223c\u03c0(s0) [\u03bd(s0, a0)] .\nApplying Fenchel duality to f in this objective further leads to the following saddle-point problem:\n\u03bd:S\u00d7A\u2192R max\n\n\u03b6:S\u00d7A\u2192R J(\u03bd, \u03b6) := E(s,a,s(cid:48))\u223cdD,a(cid:48)\u223c\u03c0(s(cid:48)) [(\u03bd(s, a) \u2212 \u03b3\u03bd(s(cid:48), a(cid:48)))\u03b6(s, a) \u2212 f\u2217(\u03b6(s, a))]\n\u2212 (1 \u2212 \u03b3) Es0\u223c\u03b2,a0\u223c\u03c0(s0) [\u03bd(s0, a0)] .\n\nmin\n\n(14)\n\n(15)\n\n5\n\n\f\u03bd satis\ufb01es,\n\nBy the KKT condition of the inner optimization problem, for any \u03bd the optimizer \u03b6\u2217\n\nf\u2217(cid:48)(\u03b6\u2217\n\n\u03bd (s, a)) = (\u03bd \u2212 B\u03c0\u03bd)(s, a).\n\n(16)\nTherefore, using the fact that the derivative of a convex function f(cid:48) is the inverse function of the\nderivative of its Fenchel conjugate f\u2217(cid:48), our desired stationary distribution corrections are found by\ncomputing the saddle-point (\u03b6\u2217, \u03bd\u2217) of the above problem:\n\n\u03b6\u2217(s, a) = f(cid:48)((\u03bd\u2217 \u2212 B\u03c0\u03bd\u2217)(s, a)) = f(cid:48)(x\u2217(s, a)) = w\u03c0/D(s, a).\n\n(17)\nAmazingly, despite the generalization beyond the quadratic penalty function f (x) = 1\n2 x2, the\noptimization problem in equation 15 retains all the computational bene\ufb01ts that make this method very\npractical for learning w\u03c0/D(s, a): All quantities and their gradients may be unbiasedly estimated\nfrom stochastic samples; the objective is linear in \u03bd and concave in \u03b6, thus is well-behaved; and\nthe optimizer of this problem immediately provides the desired stationary distribution corrections\nthrough the values of \u03b6\u2217(s, a), without any additional computation.\nThis generalized derivation also provides insight into the initial technical result: It is now clear\nthat the objective in equation 13 is the negative Fenchel dual (variational) form of the Ali-Silvey\nor f-divergence, which has been used in previous work to estimate divergence and data likelihood\nratios [27]. In the case of f (x) = 1\n2 x2 (equation 8), this corresponds to a variant of the Pearson\n\u03c72 divergence. Despite the similar formulations of our work and previous works using the same\ndivergences to estimate data likelihood ratios [27], we emphasize that the aforementioned dual form\nof the f-divergence is not immediately applicable to estimation of off-policy corrections in the context\nof RL, due to the fact that samples from distribution d\u03c0 are unobserved. Indeed, our derivations\nhinged on two additional key steps: (1) the change of variables from x to \u03bd := x + B\u03c0\u03bd; and (2)\nthe second application of duality to introduce \u03b6. Due to these repeated applications of duality in our\nderivations, we term our method Dual stationary DIstribution Correction Estimation (DualDICE).\n3.4 Theoretical Guarantees\n(cid:9)N\nIn this section, we consider the theoretical properties of DualDICE in the setting where we have\na dataset formed by empirical samples {si, ai, ri, s(cid:48)\ni=1 \u223c \u03b2, and target actions\ni \u223c \u03c0(s(cid:48)\na(cid:48)\n0) for i = 1, . . . , N.3 We will use the shorthand notation \u02c6EdD to denote an\naverage over these empirical samples. Although the proposed estimator can adopt general f, for\n2 x2. We consider using an algorithm OP T (e.g.,\nsimplicity of exposition we restrict to f (x) = 1\nstochastic gradient descent/ascent) to \ufb01nd optimal \u03bd, \u03b6 of equation 15 within some parameterization\nfamilies F,H, respectively. We denote by \u02c6\u03bd, \u02c6\u03b6 the outputs of OP T . We have the following guarantee\non the quality of \u02c6\u03bd, \u02c6\u03b6 with respect to the off-policy policy estimation (OPE) problem.\nTheorem 2. (Informal) Under some mild assumptions, the mean squared error (MSE) associated\nwith using \u02c6\u03bd, \u02c6\u03b6 for OPE can be bounded as,\n\ni=1 \u223c dD,(cid:8)si\n\n0 \u223c \u03c0(si\n\ni}N\n\ni), ai\n\n(cid:20)(cid:16)\u02c6EdD\n\n(cid:104)\u02c6\u03b6 (s, a) \u00b7 r\n\n(cid:17)2(cid:21)\n(cid:105) \u2212 \u03c1(\u03c0)\n\n= (cid:101)O(cid:16)\n\nE\n\n(cid:17)\n\n0\n\n\u0001approx (F,H) + \u0001opt + 1\u221a\n\nN\n\n,\n\n(18)\n\nwhere the outer expectation is with respect to the randomness of the empirical samples and OP T ,\n\u0001opt denotes the optimization error, and \u0001approx (F,H) denotes the approximation error due to F,H.\nThe sources of estimation error are explicit in Theorem 2. As the number of samples N increases, the\nstatistical error N\u22121/2 approaches zero. Meanwhile, there is an implicit trade-off in \u0001approx (F,H)\nand \u0001opt. With \ufb02exible function spaces F and H (such as the space of neural networks), the\napproximation error can be further decreased; however, optimization will be complicated and it is\ndif\ufb01cult to characterize \u0001opt. On the other hand, with linear parameterization of (\u03bd, \u03b6), under some\nmild conditions, after T iterations we achieve provably fast rate, O (exp (\u2212T )) for OP T = SVRG\n\n(cid:1) for OP T = SGD, at the cost of potentially increased approximation error. See the\n\nand O(cid:0) 1\n\nAppendix for the precise theoretical results, proofs, and further discussions.\n4 Related Work\nDensity Ratio Estimation Density ratio estimation is an important tool for many machine learning\nand statistics problems. Other than the naive approach, (i.e., the density ratio is calculated via esti-\nmating the densities in the numerator and denominator separately, which may magnify the estimation\n3For the sake of simplicity, we consider the batch learning setting with i.i.d. samples as in [41]. The results\n\nT\n\ncan be easily generalized to single sample path with dependent samples (see Appendix).\n\n6\n\n\ferror), various direct ratio estimators have been proposed [37], including the moment matching ap-\nproach [13], probabilistic classi\ufb01cation approach [3, 5, 33], and ratio matching approach [17, 27, 38]\nThe proposed DualDICE algorithm, as a direct approach for density ratio estimation, bears some\nsimilarities to ratio matching [27], which is also derived by exploiting the Fenchel dual representation\nof the f-divergences. However, compared to the existing direct estimators, the major difference lies\nin the requirement of the samples from the stationary distribution. Speci\ufb01cally, the existing estimators\nrequire access to samples from both dD and d\u03c0, which is impractical in the off-policy learning setting.\nTherefore, DualDICE is uniquely applicable to the more dif\ufb01cult RL setting.\n\nOff-policy Policy Evaluation The problem of off-policy policy evaluation has been heavily studied\nin contextual bandits [8, 42, 45] and in the more general RL setting [10, 16, 20, 23, 28, 29, 30, 43, 44].\nSeveral representative approaches can be identi\ufb01ed in the literature. The Direct Method (DM) learns\na model of the system and then uses it to estimate the performance of the evaluation policy. This\napproach often has low variance but its bias depends on how well the selected function class can\nexpress the environment dynamics. Importance sampling (IS) [31] uses importance weights to correct\nthe mismatch between the distributions of the system trajectory induced by the target and behavior\npolicies. Its variance can be unbounded when there is a big difference between the distributions of\nthe evaluation and behavior policies, and grows exponentially with the horizon of the RL problem.\nDoubly Robust (DR) is a combination of DM and IS, and can achieve the low variance of DM and no\n(or low) bias of IS. Other than DM, all the methods described above require knowledge of the policy\ndensity ratio, and thus the behavior policy. Our proposed algorithm avoids this necessity.\n5 Experiments\nWe evaluate our method applied to off-policy policy evaluation (OPE). We focus on this setting\nbecause it is a direct application of stationary distribution correction estimation, without many\nadditional tunable parameters, and it has been previously used as a test-bed for similar techniques [21].\nIn each experiment, we use a behavior policy \u00b5 to collect some number of trajectories, each for some\nnumber of steps. This data is used to estimate the stationary distribution corrections, which are then\nused to estimate the average step reward, with respect to a target policy \u03c0. We focus our comparisons\nhere to a TD-based approach (based on [12]) and weighted step-wise IS (as described in [21]), which\nwe and others have generally found to work best relative to common IS variants [24, 31]. See the\nAppendix for additional results and implementation details.\nWe begin in a controlled setting with an evaluation agnostic to optimization issues, where we \ufb01nd\nthat, absent these issues, our method is competitive with TD-based approaches (Figure 1). However,\nas we move to more dif\ufb01cult settings with complex environment dynamics, the performance of TD\nmethods degrades dramatically, while our method is still able to provide accurate estimates (Figure 2).\nFinally, we provide an analysis of the optimization behavior of our method on a simple control task\nacross different choices of function f (Figure 3). Interestingly, although the choice of f (x) = 1\n2 x2 is\n3|x|3/2. All results are\nmost natural, we \ufb01nd the empirically best performing choice to be f (x) = 2\nsummarized for 20 random seeds, with median plotted and error bars at 25th and 75th percentiles.4\n5.1 Estimation Without Function Approximation\nWe begin with a tabular task, the Taxi domain [7]. In this task, we evaluate our method in a manner\nagnostic to optimization dif\ufb01culties: The objective 6 is a quadratic equation in \u03bd, and thus may\nbe solved by matrix operations. The Bellman residuals (equation 7) may then be estimated via an\nempirical average of the transitions appearing in the off-policy data. In a similar manner, TD methods\nfor estimating the correction terms may also be solved using matrix operations [21]. In this controlled\nsetting, we \ufb01nd that, as expected, TD methods can perform well (Figure 1), and our method achieves\ncompetitive performance. As we will see in the following results, the good performance of TD\nmethods quickly deteriorates as one moves to more complex settings, while our method is able to\nmaintain good performance, even when using function approximation and stochastic optimization.\n5.2 Control Tasks\nWe now move on to dif\ufb01cult control tasks: A discrete-control task Cartpole and a continuous-control\ntask Reacher [4]. In these tasks, observations are continuous, and thus we use neural network function\n\n4The choice of plotting percentiles is somewhat arbitrary. Plotting mean and standard errors yields similar\n\nplots.\n\n7\n\n\f# traj = 50\n\n# traj = 100\n\n# traj = 200\n\n# traj = 400\n\nE\nS\nM\nR\ng\no\nl\n\ntrajectory length\n\nFigure 1: We perform OPE on the Taxi domain [7]. The plots show log RMSE of the estimator\nacross different numbers of trajectories and different trajectory lengths (x-axis). For this domain,\nwe avoid any potential issues in optimization by solving for the optimum of the objectives exactly\nusing standard matrix operations. Thus, we are able to see that our method and the TD method are\ncompetitive with each other.\n\nCartpole, \u03b1 = 0\n\nCartpole, \u03b1 = 0.33\n\nCartpole, \u03b1 = 0.66\n\nReacher, \u03b1 = 0\n\nReacher, \u03b1 = 0.33\n\nReacher, \u03b1 = 0.66\n\n)\n\u03c0\n(\n\u02c6\u03c1\n\n)\n\u03c0\n(\n\u02c6\u03c1\n\nFigure 2: We perform OPE on control tasks. Each plot shows the estimated average step reward over\ntraining (x-axis is training step) and different behavior policies (higher \u03b1 corresponds to a behavior\npolicy closer to the target policy). We \ufb01nd that in all cases, our method is able to approximate these\ndesired values well, with accuracy improving with a larger \u03b1. On the other hand, the TD method\nperforms poorly, even more so when the behavior policy \u00b5 is unknown and must be estimated. While\non Cartpole it can start to approach the desired value for large \u03b1, on the more complicated Reacher\ntask (which involves continuous actions) its learning is too unstable to learn anything at all.\n\napproximators with stochastic optimization. Figure 2 shows the results of our method compared to\nthe TD method. We \ufb01nd that in this setting, DualDICE is able to provide good, stable performance,\nwhile the TD approach suffers from high variance, and this issue is exacerbated when we attempt to\nestimate \u00b5 rather than assume it as given. See the Appendix for additional baseline results.\n5.3 Choice of Convex Function f\nWe analyze the choice of the convex function f. We consider a simple continuous grid task where an\nagent may move left, right, up, or down and is rewarded for reaching the bottom right corner of a\nsquare room. We plot the estimation errors of using DualDICE for off-policy policy evaluation on this\np|x|p. Interestingly,\ntask, comparing against different choices of convex functions of the form f (x) = 1\nalthough the choice of f (x) = 1\n2 x2 is most natural, we \ufb01nd the empirically best performing choice to\nbe f (x) = 2\n6 Conclusions\nWe have presented DualDICE, a method for estimating off-policy stationary distribution corrections.\nCompared to previous work, our method is agnostic to knowledge of the behavior policy used to\ncollect the off-policy data and avoids the use of importance weights in its losses. These advantages\nhave a profound empirical effect: our method provides signi\ufb01cantly better estimates compared to TD\nmethods, especially in settings which require function approximation and stochastic optimization.\n\n3|x|3/2. Thus, this is the form of f we used in our experiments for Figure 2.\n\n8\n\n50100200400\u22125\u22124\u22123\u22122\u2212150100200400\u22125\u22124\u22123\u2212250100200400\u22125\u22124\u22123\u2212250100200400\u22126\u22125\u22124\u22123\u22122DualDICE(ours)TDIS0500001000001500000.250.500.751.001.251.500500001000001500000.250.500.751.001.251.500500001000001500000.250.500.751.001.251.50050000100000150000\u22120.4\u22120.3\u22120.2\u22120.10.0050000100000150000\u22120.4\u22120.3\u22120.2\u22120.10.0050000100000150000\u22120.4\u22120.3\u22120.2\u22120.10.0DualDICE(ours)TD(known\u00b5)TD(unknown\u00b5)IS(known\u00b5)TrueValue\ftraj length = 50\n\ntraj length = 100\n\ntraj length = 200\n\ntraj length = 400\n\nE\nS\nM\nR\ng\no\nl\n\nFigure 3: We compare the OPE error when using different forms of f to estimate stationary distri-\nbution ratios with function approximation, which are then applied to OPE on a simple continuous\ngrid task. In this setting, optimization stability is crucial, and this heavily depends on the form of\np|x|p for p \u2208 [1.25, 1.5, 2, 3, 4]. We also\nthe convex function f. We plot the results of using f (x) = 1\nshow the results of TD and IS methods on this task for comparison. We \ufb01nd that p = 1.5 consistently\nperforms the best, often providing signi\ufb01cantly better results.\n\nFuture work includes (1) incorporating the DualDICE algorithm into off-policy training, (2) further\nunderstanding the effects of f on the performance of DualDICE (in terms of approximation error of\nthe distribution corrections), and (3) evaluating DualDICE on real-world off-policy evaluation tasks.\n\nAcknowledgments\n\nWe thank Marc Bellemare, Carles Gelada, and the rest of the Google Brain team for helpful thoughts\nand discussions.\n\nReferences\n[1] Marcin Andrychowicz, Bowen Baker, Maciek Chociej, Rafal Jozefowicz, Bob McGrew, Jakub\nPachocki, Arthur Petron, Matthias Plappert, Glenn Powell, Alex Ray, et al. Learning dexterous\nin-hand manipulation. arXiv preprint arXiv:1808.00177, 2018.\n\n[2] Richard Ernest Bellman. Dynamic Programming. Dover Publications, Inc., New York, NY,\n\nUSA, 2003.\n\n[3] Steffen Bickel, Michael Br\u00fcckner, and Tobias Scheffer. Discriminative learning for differing\ntraining and test distributions. In Proceedings of the 24th international conference on Machine\nlearning, pages 81\u201388. ACM, 2007.\n\n[4] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang,\n\nand Wojciech Zaremba. Openai gym. arXiv preprint arXiv:1606.01540, 2016.\n\n[5] Kuang Fu Cheng, Chih-Kang Chu, et al. Semiparametric density estimation under a two-sample\n\ndensity ratio model. Bernoulli, 10(4):583\u2013604, 2004.\n\n[6] Bo Dai, Niao He, Yunpeng Pan, Byron Boots, and Le Song. Learning from conditional\n\ndistributions via dual embeddings. arXiv preprint arXiv:1607.04579, 2016.\n\n[7] Thomas G Dietterich. Hierarchical reinforcement learning with the MAXQ value function\n\ndecomposition. Journal of Arti\ufb01cial Intelligence Research, 13:227\u2013303, 2000.\n\n[8] Miroslav Dud\u00edk, John Langford, and Lihong Li. Doubly robust policy evaluation and learning.\n\narXiv preprint arXiv:1103.4601, 2011.\n\n[9] Mehrdad Farajtabar, Yinlam Chow, and Mohammad Ghavamzadeh. More robust doubly robust\n\noff-policy evaluation. arXiv preprint arXiv:1802.03493, 2018.\n\n[10] Raphael Fonteneau, Susan A. Murphy, Louis Wehenkel, and Damien Ernst. Batch mode\nreinforcement learning based on the synthesis of arti\ufb01cial trajectories. Annals of Operations\nResearch, 208(1):383\u2013416, 2013.\n\n[11] Jianfeng Gao, Michel Galley, and Lihong Li. Neural approaches to Conversational AI. Founda-\n\ntions and Trends in Information Retrieval, 13(2\u20133):127\u2013298, 2019.\n\n9\n\n\u22124\u22123\u22122\u2212101\u22124\u22123\u22122\u2212101\u22124\u22123\u22122\u2212101\u22124\u22123\u22122\u2212101p=1.25p=1.5p=2p=3p=4TDIS\f[12] Carles Gelada and Marc G Bellemare. Off-policy deep reinforcement learning by bootstrapping\n\nthe covariate shift. AAAI, 2018.\n\n[13] Arthur Gretton, Alex J Smola, Jiayuan Huang, Marcel Schmittfull, Karsten M Borgwardt, and\nBernhard Sch\u00f6llkopf. Covariate shift by kernel mean matching. In Dataset shift in machine\nlearning, pages 131\u2013160. MIT Press, 2009.\n\n[14] Assaf Hallak and Shie Mannor. Consistent on-line off-policy evaluation. In Proceedings of the\n34th International Conference on Machine Learning-Volume 70, pages 1372\u20131383. JMLR. org,\n2017.\n\n[15] W Keith Hastings. Monte carlo sampling methods using markov chains and their applications.\n\n1970.\n\n[16] Nan Jiang and Lihong Li. Doubly robust off-policy value evaluation for reinforcement learning.\nIn Proceedings of the 33rd International Conference on Machine Learning, pages 652\u2013661,\n2016.\n\n[17] Takafumi Kanamori, Shohei Hido, and Masashi Sugiyama. A least-squares approach to direct\n\nimportance estimation. Journal of Machine Learning Research, 10(Jul):1391\u20131445, 2009.\n\n[18] Jiwei Li, Will Monroe, Alan Ritter, Michel Galley, Jianfeng Gao, and Dan Jurafsky. Deep\n\nreinforcement learning for dialogue generation. arXiv preprint arXiv:1606.01541, 2016.\n\n[19] Lihong Li, Wei Chu, John Langford, and Xuanhui Wang. Unbiased of\ufb02ine evaluation of\ncontextual-bandit-based news article recommendation algorithms. In Proceedings of the fourth\nACM international conference on Web search and data mining, pages 297\u2013306. ACM, 2011.\n\n[20] Lihong Li, R\u00e9mi Munos, and Csaba Szepesv\u00e0ri. Toward minimax off-policy value estimation.\nIn Proceedings of the 18th International Conference on Arti\ufb01cial Intelligence and Statistics,\npages 608\u2013616, 2015.\n\n[21] Qiang Liu, Lihong Li, Ziyang Tang, and Dengyong Zhou. Breaking the curse of horizon:\nIn\ufb01nite-horizon off-policy estimation. In Advances in Neural Information Processing Systems,\npages 5356\u20135366, 2018.\n\n[22] Yao Liu, Adith Swaminathan, Alekh Agarwal, and Emma Brunskill. Off-policy policy gradient\nwith state distribution correction. In Proceedings of the Thirty-Fifth Conference on Uncertainty\nin Arti\ufb01cial Intelligence, 2019. To appear.\n\n[23] A. Mahmood, H. van Hasselt, and R. Sutton. Weighted importance sampling for off-policy\nlearning with linear function approximation. In Proceedings of the 27th International Conference\non Neural Information Processing Systems, 2014.\n\n[24] Travis Mandel, Yun-En Liu, Sergey Levine, Emma Brunskill, and Zoran Popovic. Of\ufb02ine\npolicy evaluation across representations with applications to educational games. In Proceedings\nof the 2014 international conference on Autonomous agents and multi-agent systems, pages\n1077\u20131084. International Foundation for Autonomous Agents and Multiagent Systems, 2014.\n\n[25] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan\nWierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint\narXiv:1312.5602, 2013.\n\n[26] Susan A Murphy, Mark J van der Laan, James M Robins, and Conduct Problems Prevention Re-\nsearch Group. Marginal mean models for dynamic regimes. Journal of the American Statistical\nAssociation, 96(456):1410\u20131423, 2001.\n\n[27] XuanLong Nguyen, Martin J Wainwright, and Michael I Jordan. Estimating divergence func-\ntionals and the likelihood ratio by convex risk minimization. IEEE Transactions on Information\nTheory, 56(11):5847\u20135861, 2010.\n\n[28] C. Paduraru. Off-policy Evaluation in Markov Decision Processes. PhD thesis, McGill Univer-\n\nsity, 2013.\n\n10\n\n\f[29] D. Precup, R. Sutton, and S. Dasgupta. Off-policy temporal difference learning with function\napproximation. In Proceedings of the 18th International Conference on Machine Learning,\npages 417\u2013424, 2001.\n\n[30] D. Precup, R. Sutton, and S. Singh. Eligibility traces for off-policy policy evaluation. In\nProceedings of the 17th International Conference on Machine Learning, pages 759\u2013766, 2000.\n\n[31] Doina Precup. Eligibility traces for off-policy policy evaluation. Computer Science Department\n\nFaculty Publication Series, page 80, 2000.\n\n[32] Martin L Puterman. Markov decision processes: Discrete stochastic dynamic programming.\n\n1994.\n\n[33] Jing Qin. Inferences for case-control and semiparametric two-sample density ratio models.\n\nBiometrika, 85(3):619\u2013630, 1998.\n\n[34] R Tyrrell Rockafellar and Roger J-B Wets. Variational analysis, volume 317. Springer Science\n\n& Business Media, 2009.\n\n[35] Ralph Tyrell Rockafellar. Convex analysis. Princeton university press, 2015.\n\n[36] Alexander Shapiro, Darinka Dentcheva, and Andrzej Ruszczy\u00b4nski. Lectures on stochastic\n\nprogramming: modeling and theory. SIAM, 2009.\n\n[37] Masashi Sugiyama, Taiji Suzuki, and Takafumi Kanamori. Density ratio estimation in machine\n\nlearning. Cambridge University Press, 2012.\n\n[38] Masashi Sugiyama, Taiji Suzuki, Shinichi Nakajima, Hisashi Kashima, Paul von B\u00fcnau, and\nMotoaki Kawanabe. Direct importance estimation for covariate shift adaptation. Annals of the\nInstitute of Statistical Mathematics, 60(4):699\u2013746, 2008.\n\n[39] Richard S Sutton and Andrew G Barto. Introduction to reinforcement learning, volume 135.\n\n[40] Richard S Sutton, A Rupam Mahmood, and Martha White. An emphatic approach to the\nproblem of off-policy temporal-difference learning. The Journal of Machine Learning Research,\n17(1):2603\u20132631, 2016.\n\n[41] Richard S Sutton, Csaba Szepesv\u00e1ri, Alborz Geramifard, and Michael Bowling. Dyna-style\nplanning with linear function approximation and prioritized sweeping. In Proceedings of the\nTwenty-Fourth Conference on Uncertainty in Arti\ufb01cial Intelligence, pages 528\u2013536. AUAI Press,\n2008.\n\n[42] A. Swaminathan, A. Krishnamurthy, A. Agarwal, M. Dud\u00edk, J. Langford, D. Jose, and I. Zitouni.\nIn Proceedings of the 31st International\n\nOff-policy evaluation for slate recommendation.\nConference on Neural Information Processing Systems, pages 3635\u20133645, 2017.\n\n[43] P. Thomas and E. Brunskill. Data-ef\ufb01cient off-policy policy evaluation for reinforcement\nlearning. In Proceedings of the 33rd International Conference on Machine Learning, pages\n2139\u20132148, 2016.\n\n[44] P. Thomas, G. Theocharous, and M. Ghavamzadeh. High con\ufb01dence off-policy evaluation. In\n\nProceedings of the 29th Conference on Arti\ufb01cial Intelligence, 2015.\n\n[45] Yu-Xiang Wang, Alekh Agarwal, and Miroslav Dudik. Optimal and adaptive off-policy evalua-\ntion in contextual bandits. In Proceedings of the 34th International Conference on Machine\nLearning-Volume 70, pages 3589\u20133597. JMLR. org, 2017.\n\n11\n\n\f", "award": [], "sourceid": 1361, "authors": [{"given_name": "Ofir", "family_name": "Nachum", "institution": "Google Brain"}, {"given_name": "Yinlam", "family_name": "Chow", "institution": "Google Research"}, {"given_name": "Bo", "family_name": "Dai", "institution": "Google Brain"}, {"given_name": "Lihong", "family_name": "Li", "institution": "Google Brain"}]}