{"title": "Breaking the Curse of Horizon: Infinite-Horizon Off-Policy Estimation", "book": "Advances in Neural Information Processing Systems", "page_first": 5356, "page_last": 5366, "abstract": "We consider the off-policy estimation problem of estimating the expected reward of a target policy using samples collected by a different behavior policy. Importance sampling (IS) has been a key technique to derive (nearly) unbiased estimators, but is known to suffer from an excessively high variance in long-horizon problems.  In the extreme case of in infinite-horizon problems, the variance of an IS-based estimator may even be unbounded. In this paper, we propose a new off-policy estimation method that applies IS directly on the stationary state-visitation distributions to avoid the exploding variance issue faced by existing estimators.Our key contribution is a novel approach to estimating the density ratio of two stationary distributions, with trajectories sampled from only the behavior distribution. We develop a mini-max loss function for the estimation problem, and derive a closed-form solution for the case of RKHS. We support our method with both theoretical  and empirical analyses.", "full_text": "Breaking the Curse of Horizon:\n\nIn\ufb01nite-Horizon Off-Policy Estimation\n\nQiang Liu\n\nThe University of Texas at Austin\n\nAustin, TX, 78712\n\nlqiang@cs.utexas.edu\n\nZiyang Tang\n\nThe University of Texas at Austin\n\nAustin, TX, 78712\n\nztang@cs.utexas.edu\n\nLihong Li\n\nGoogle Brain\n\nKirkland, WA, 98033\nlihong@google.com\n\nDengyong Zhou\nGoogle Brain\n\nKirkland, WA, 98033\n\ndennyzhou@google.com\n\nAbstract\n\nWe consider off-policy estimation of the expected reward of a target policy using\nsamples collected by a different behavior policy. Importance sampling (IS) has\nbeen a key technique for deriving (nearly) unbiased estimators, but is known to\nsuffer from an excessively high variance in long-horizon problems. In the extreme\ncase of in\ufb01nite-horizon problems, the variance of an IS-based estimator may even\nbe unbounded. In this paper, we propose a new off-policy estimator that applies\nIS directly on the stationary state-visitation distributions to avoid the exploding\nvariance faced by existing methods. Our key contribution is a novel approach to\nestimating the density ratio of two stationary state distributions, with trajectories\nsampled from only the behavior distribution. We develop a mini-max loss function\nfor the estimation problem, and derive a closed-form solution for the case of RKHS.\nWe support our method with both theoretical and empirical analyses.\n\n1\n\nIntroduction\n\nReinforcement learning (RL) [36] is one of the most successful approaches to arti\ufb01cial intelligence,\nand has found successful applications in robotics, games, dialogue systems, and recommendation\nsystems, among others. One of the key problems in RL is policy evaluation: given a \ufb01xed policy,\nestimate the average reward garnered by an agent that runs this policy in the environment. In this\npaper, we consider the off-policy estimation problem, in which we want to estimate the expected\nreward of a given target policy with samples collected by a different behavior policy. This problem\nis of great practical importance in many application domains where deploying a new policy can\nbe costly or risky, such as medical treatments [26], econometrics [13], recommender systems [19],\neducation [23], Web search [18], advertising and marketing [4, 5, 38, 40]. It can also be used as a key\ncomponent for developing ef\ufb01cient off-policy policy optimization algorithms [7, 14, 18, 39].\nMost state-of-the-art off-policy estimation methods are based on importance sampling (IS) [e.g., 22].\nA major limitation, however, is that this approach can become inaccurate due to the high variance\nintroduced by the importance weights, especially when the trajectory is long. Indeed, most existing\nIS-based estimators compute the weight as the product of the importance ratios of many steps in the\ntrajectory. Variances in individual steps accumulate multiplicatively, so that the overall IS weight of a\nrandom trajectory can have an exponentially high variance to result in an unreliable estimator. In the\nextreme case when the trajectory length is in\ufb01nite, as in in\ufb01nite-horizon average-reward problems,\nsome of these estimators are not even well-de\ufb01ned. Ad hoc approaches can be used, such as truncating\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fthe trajectories, but often lead to a hard-to-control bias in the \ufb01nal estimation. Analogous to the\nwell-known \u201ccurse of dimensionality\u201d in dynamic programming [2], we call this problem the \u201ccurse\nof horizon\u201d in off-policy learning.\nIn this work, we develop a new approach that tackles the curse of horizon. The key idea is to apply\nimportance sampling on the average visitation distribution of single steps of state-action pairs, instead\nof the much higher dimensional distribution of whole trajectories. This avoids the cumulative product\nacross time in the density ratio, substantially decreasing its variance and eliminating the estimator\u2019s\ndependence on the horizon.\nOur key challenge, of course, is to estimate the importance ratios of average visitation distributions.\nIn practice, we often have access to both the target and behavior policies to compute their importance\nratio of an action conditioned on a given state. But we typically have no access to transition\nprobabilities of the environment, so estimating importance ratios of state visitation distributions has\nbeen very dif\ufb01cult, especially when only off-policy samples are available. In this paper, we develop a\nmini-max loss function for estimating the true stationary density ratio, which yields a closed-form\nrepresentation similar to maximum mean discrepancy [9] when combined with a reproducing kernel\nHilbert space (RKHS). We study the theoretical properties of our loss function, and demonstrate its\nempirical effectiveness on long-horizon problems.\n\n2 Background\nProblem De\ufb01nition Consider a Markov decision process (MDP) [31] M = hS,A, r, Ti with state\nspace S, action space A, reward function r, and transition probability function T . Assume the\nenvironment is initialized at state s0 2S , drawn from an unknown distribution d0(\u00b7). At each time\nstep t, an agent observes the current state st, takes an action at according to a possibly stochastic\npolicy \u21e1(\u00b7|st), receives a reward rt whose expectation is r(st, at), and transitions to a next state\nst+1 according to transition probabilities T (\u00b7|st, at). To simplify exposition and avoid unnecessary\ntechnicalities, we assume S and A are \ufb01nite unless otherwise speci\ufb01ed, although our method extends\nto continuous spaces straightforwardly, as demonstrated in experiments.\nWe consider the in\ufb01nite horizon problem in which the MDP continues without termination. Let p\u21e1(\u00b7)\nbe the distribution of trajectory \u2327 = {st, at, rt}1t=0 under policy \u21e1. The expected reward of \u21e1 is\n\nR\u21e1 := lim\nT!1\n\nE\u2327\u21e0p\u21e1 [RT (\u2327 )],\n\nRT (\u2327 ) := (\n\ntrt)/(\n\nt) ,\n\nTXt=0\n\nTXt=0\n\nwhere RT\ndistinguish two reward criteria, the average reward ( = 1) and discounted reward (0 << 1):\n\n\u21e1 (\u2327 ) is the reward of trajectory \u2327 up to time T . Here,  2 (0, 1] is a discount factor. We\n\nAverage: R(\u2327 ) := lim\nT!1\n\n1\n\nT + 1\n\nTXt=0\n\nrt,\n\nDiscounted: R(\u2327 ) := (1  )\n\ntrt .\n\n1Xt=0\n\nwhere (1  ) = 1/P1t=0 t is a normalization factor. The problem of off-policy value estimation\nis to estimate the expected reward R\u21e1 of a given target policy \u21e1, when we only observe a set of\ntrajectories \u2327 i = {si\nBellman Equation We brie\ufb02y review the Bellman equation and the notation of value functions,\nfor both average and discounted reward criteria. In the discounted case (0 << 1), the value\nV \u21e1(s) is the expected total discounted reward when the initial state s0 is \ufb01xed to be s: V \u21e1(s) =\n\nt=0 generated by following a different behavior policy \u21e10.\nt}T\n\nt, ai\n\nt, ri\n\nE\u2327\u21e0p\u21e1 [P1t=0 trt | s0 = s]. Note that we do not normalize V \u21e1 by (1  ) in our notation. For the\naverage reward ( = 1) case, the expected average reward does not depend on the initial state if the\nMarkov process is ergodic [31]. Instead, the value function V \u21e1(s) in the average case measures the\naverage adjusted sum of reward: V \u21e1(s) = limT!1 E\u2327\u21e0p\u21e1 [PT\nt=0(rt  R\u21e1)|s0 = s]. It represents\n\nthe relative difference in total reward gained from starting in state s0 = s as opposed to R\u21e1.\nUnder these de\ufb01nitions, V \u21e1 is the \ufb01xed-point solution to the respective Bellman equations:\nV \u21e1(s)  Es0,a|s\u21e0d\u21e1 [V \u21e1(s0)] = Ea|s\u21e0\u21e1[r(s, a)  R\u21e1] ,\nV \u21e1(s)  Es0,a|s\u21e0d\u21e1 [V \u21e1(s0)] = Ea|s\u21e0\u21e1[r(s, a)] .\n\nAverage:\nDiscounted:\n\n(1)\n(2)\n\n2\n\n\fImportance Sampling IS represents a major class of approaches to off-policy estimation, which,\nin principle, only applies to the \ufb01nite-horizon reward RT\n\u21e1 when the trajectory is truncated at a \ufb01nite\ntime step T < 1. IS-based estimators are based on the following change-of-measure equality:\n\u21e1/\u21e10(at|st) ,\n\n\u21e1 = E\u2327\u21e0p\u21e10\nRT\n\n[w0:T (\u2327 )RT (\u2327 )] ,\n\np\u21e1(\u2327 0:T )\np\u21e10(\u2327 0:T )\n\nw0:T (\u2327 ) :=\n\nwith\n\n(3)\n\nwhere \u21e1/\u21e10(a|s) := \u21e1(a|s)/\u21e10(a|s) is the single-step density ratio of policies \u21e1 and \u21e10 evaluated\nat a particular state-action pair (s, a), and w0:T is the density ratio of the trajectory \u2327 up to time\nT . Methods based on (3) are called trajectory-wise IS, or weighted IS (WIS) when the weights are\nself-normalized [22, 30]. It is possible to improve trajectory-wise IS with the so called step-wise, or\nper-decision, IS/WIS, which uses weight w0:t for reward rt at time t, yielding smaller variance [30].\nMore details about these estimators are given in Appendix A.\n\n=\n\nTYt=0\n\nThe Curse of Horizon The importance weight w0:T is a product of T density ratios, whose variance\ncan grow exponentially with T . Thus, IS-based estimators have not been widely successful in long-\nhorizon problems, let alone in\ufb01nite-horizon ones where w0:1 may not even be well-de\ufb01ned. While\nWIS estimators often have reduced variance, the exponential dependence on horizon is unavoidable\nin general. We call this phenomenon in IS/WIS-based estimators the curse of horizon.\nNot all hope is lost, however. To see this, consider an MDP with n\nstates and 2 actions, where states are arranged on a circle (see \ufb01gure\non the right). The two actions deterministically move the agent from\nthe current state to the neighboring state counterclockwise and clock-\nwise, respectively. Suppose we are given two policies with opposite\neffects: the behavior policy \u21e10 moves the agent clockwise with prob-\nability \u21e2, and the target policy \u21e1 moves the agent counterclockwise\nwith probability \u21e2, for some constant \u21e2 2 (0, 1). As shown in Ap-\npendix B, IS and WIS estimators suffer from exponentially large\nvariance when estimating the average reward of \u21e1. However, a keen\nreader will realize that the two policies are symmetric, and thus their stationary state visitation\ndistributions are identical. As we show in the sequel, this allows us to estimate the expected reward\nusing a much more ef\ufb01cient importance sampling, whose importance weight equals the single-step\ndensity ratio \u21e1/\u21e10(at|st), instead of the cumulative product weight w0:T in (3), allowing us to\nsigni\ufb01cantly reduce the variance. Such an observation inspired the approach developed in this paper.\n\n3 Off-Policy Estimation via Stationary State Density Ratio Estimation\n\nAs shown in the example above, signi\ufb01cant decrease in estimation variance is possible when we\napply importance weighting on the state space, rather than the trajectory space. It eliminates the\ndependency on the trajectory length and is much more suited for long- or in\ufb01nite-horizon problems.\nTo realize this, we need to introduce an alternative representation of the expected reward. Denote by\nd\u21e1,t(\u00b7) the distribution of state st when we execute policy \u21e1 starting from an initial state s0 drawn\nfrom an initial distribution d0(\u00b7). We de\ufb01ne the average visitation distribution to be\n\nd\u21e1(s) = lim\n\nT!1 TXt=0\n\ntd\u21e1,t(s)! / TXt=0\n\nt! .\n\n(4)\n\nWe always assume the limit T ! 1 exists in this work. When  2 (0, 1) in the discounted case,\nd\u21e1 is a discounted average of d\u21e1,t, that is, d\u21e1(s) = (1  )P1t=0 td\u21e1,t(s) ; when  = 1 in\nthe average reward case, d\u21e1 is the stationary distribution of st as t ! 1 under policy \u21e1, that is,\nd\u21e1(s) = limT!1\nFollowing De\ufb01nition 4, it can be veri\ufb01ed that R\u21e1 can be expressed alternatively as\n\nt=0 d\u21e1,t(s) = limt!1 d\u21e1,t(s).\n\nT +1PT\n\n1\n\nR\u21e1 =Xs,a\n\nd\u21e1(s)\u21e1(a|s)r(s, a) = E(s,a)\u21e0d\u21e1 [r(s, a)],\n\n(5)\n\nwhere, abusing notation slightly, we use (s, a) \u21e0 d\u21e1 to denote draws from distribution d\u21e1(s, a) :=\nd\u21e1(s)\u21e1(a|s). Our idea is to construct an IS estimator based on (5), where the importance ratio is\n\n3\n\n\fcomputed on state-action pairs rather than on trajectories:\n\nt, ai\n\nt, ri\n\nR\u21e1 = E(s,a)\u21e0d\u21e10\u21e5w\u21e1/\u21e10(s)\u21e1/\u21e10(a, s)r(s, a)\u21e4 ,\n\ni=1 obtained when running policy \u21e10,\nt}m\nwi\n\n(6)\nwhere \u21e1/\u21e10(a, s) = \u21e1(a|s)/\u21e10(a|s) and w\u21e1/\u21e10(s) := d\u21e1(s)/d\u21e10(s) is the density ratio of the\nvisitation distributions d\u21e1 and d\u21e10; here, w\u21e1/\u21e10(s) is not known directly but can be estimated, as\nshown later. Eq 5 allows us to construct a (weighted-)IS estimator by approximating E(s,a)\u21e0d\u21e10\n[\u00b7]\nwith data {si\n\u02c6R\u21e1 =\n\nt|si\nt)\nt0|si0\nt0)\u21e1/\u21e10(ai0\nt0)\nThis IS estimator works in the space of (s, a), instead of trajectoris \u2327 = {st, at}T\nt=0, leading\nto a potentially signi\ufb01cant variance reduction. Returning to the example in Section 2 (see also\nAppendix B), since the two policies are symmetric and lead to the same stationary distributions, that is,\nw\u21e1/\u21e10(s) = 1, the importance weight in (6) is simply \u21e1(a|s)/\u21e10(a|s), independent of the trajectory\nlength. This avoids the excessive variance in long horizon problems. In Appendix A, we provide a\nfurther discussion, showing that our estimator can be viewed as a type of Rao-Backwellization of the\ntrajectory-wise and step-wise estimators.\n\nPt0,i0 t0w\u21e1/\u21e10(si0\n\ntw\u21e1/\u21e10(si\n\nt)\u21e1/\u21e10(ai\n\nTXt=0\n\nmXi=1\n\nwi\n\nt :=\n\n.\n\n(7)\n\nwhere\n\ntri\nt,\n\n3.1 Average Reward Case\nThe key technical challenge remaining is estimating the density ratio w\u21e1/\u21e10(s), which we address in\nthis section. For simplifying the presentation, we start with estimating d\u21e1(s) for the average reward\ncase and discuss the discounted case in Section 3.2.\n\nLet T \u21e1(s0|s) :=Pa T (s0|s, a)\u21e1(a|s) be the transition probability from s to s0 following policy \u21e1.\n\nIn the average reward case, d\u21e1 equals the stationary distribution of T \u21e1, satisfying\n\nd\u21e1(s0) =Xs\n\nT \u21e1(s0|s)d\u21e1(s), 8s0.\n\nAssume the Markov chain of T \u21e1 is \ufb01nite state and ergodic, d\u21e1 is also the unique distribution that\nsatis\ufb01es (8). This simple fact can be leveraged to derive the following key property of w\u21e1/\u21e10(s).\nTheorem 1. In the average reward case ( = 1), assume d\u21e1 is the unique invariant distribution of\nT \u21e1 and d\u21e10(s) > 0, 8s. Then a function w(s) equals w\u21e1/\u21e10(s) := d\u21e1(s)/d\u21e10(s) (up to a constant\nfactor) if and only if it satis\ufb01es\n\nE(s,a)|s0\u21e0d\u21e10\nwith\n\n[(w; s, a, s0) | s0] = 0,\n\n8 s0,\n\n(w; s, a, s0) := w(s)\u21e1/\u21e10(a|s)  w(s0),\n\nwhere \u21e1/\u21e10(a|s) = \u21e1(a|s)/\u21e10(a|s) and (s, a)|s0 \u21e0 d\u21e10 denote the conditional distribution\nd\u21e10(s, a|s0) related to joint distribution d\u21e10(s, a, s0) := d\u21e10(s)\u21e10(a|s)T (s0|s, a). Note that this\nis a time-reserved conditional probability, since it is the conditional distribution of (s, a) given that\ntheir next state is s0 following policy \u21e10.\n\nBecause the conditional distribution is time reversed, it is dif\ufb01cult to directly estimate the conditional\nexpectation E(s,a)|s0[\u00b7] for a given s0. This is because we usually can observe only a single data\npoint from d\u21e10(s, a|s0) of a \ufb01xed s0, given that it is dif\ufb01cult to see by chance two different (s, a)\npairs transit to the same s0. This problem can be addressed by introducing a discriminator function\nand constructing a mini-max loss function. Speci\ufb01cally, multiplying (9) with a function f (s0) and\naveraging under s0 \u21e0 d\u21e10 gives\n\n(8)\n\n(9)\n\n(10)\n\n(11)\n\nFollowing Theorem 1, we have w / w\u21e1/\u21e10 if and only if L(w, f ) = 0 for any function f. This\nmotivates us to estimate w\u21e1/\u21e10 with a mini-max problem:\n\nL(w, f ) := E(s,a,s0)\u21e0d\u21e10\n\n[(w; s, a, s0)f (s0)]\n\n= E(s,a,s0)\u21e0d\u21e10\u21e5w(s)\u21e1/\u21e10(a|s)  w(s0) f (s0)\u21e4 .\nw D(w) := max\n\nL (w/zw, f )2 ,\n\nf2F\n\nmin\n\n4\n\n\fwhere F is a set of discriminator functions and zw := Es\u21e0d\u21e10\n[w(s)] normalizes w to avoid the trivial\nsolution w \u2318 0. We shall assume F to be rich enough following the conditions to be discussed in\nSection 3.3. A promising choice of a rich function class is neural networks, for which the mini-max\nproblem (11) can be solved numerically in a fashion similar to generative adversarial networks\n(GANs) [8]. Alternatively, we can take F to be a ball of a reproducing kernel Hilbert space (RKHS),\nwhich enables a closed form representation of D(w) as we show in the following.\nTheorem 2. Assume H is a RKHS of functions f (s) with a positive de\ufb01nite kernel k(s, \u00afs), and de\ufb01ne\nF := {f 2H : ||f||H \uf8ff 1} to be the unit ball of H. We have\n\nL(w, f )2 = Ed\u21e10\n\n[(w; s, a, s0)(w; \u00afs, \u00afa, \u00afs0)k(s0, \u00afs0)] ,\n\n(12)\n\nmax\nf2F\n\nwhere (s, a, s0) and (\u00afs, \u00afa, \u00afs0) are independent transition pairs obtained when running policy \u21e10, and\n(w; s, a, s0) is de\ufb01ned in (10). See Appendix C for more background on RKHS.\n\nIn practice, we approximate the expectation in (12) using empirical distribution of the transition pairs,\nyielding consistent estimates following standard results on V-statistics [33].\n\n3.2 Discounted Reward Case\nWe now discuss the extension to the discount case of  2 (0, 1). Similar to the average reward case,\nwe start with a recursive equation that characterizes d\u21e1(s) in the discounted case.\nLemma 3. Following the de\ufb01nition of d\u21e1 in (4), for any  2 (0, 1], we have\n\nDenote by (s, a, s0) \u21e0 d\u21e1 draws from d\u21e1(s)\u21e1(a|s)T (s0|s, a). For any function f, we have\n\nT \u21e1(s0|s)d\u21e1(s)  d\u21e1(s0) + (1  )d0(s0) = 0,\n\nXs\nE(s,a,s0)\u21e0d\u21e1 [f (s0)  f (s)] + (1  )Es\u21e0d0[f (s)] = 0.\n\n8s0.\n\n(13)\n\n(14)\n\nOne may view d\u21e1 as the invariant distribution of an induced Markov chain with transition probability\nof (1  )d0(s0) + T \u21e1(s0|s), which follows T \u21e1 with probability , and restarts from initial\ndistribution d0(s0) with probability 1  . We can show that d\u21e1 exists and is unique under mild\nconditions [31].\nTheorem 4. Assume d\u21e1 is the unique solution of (13), and d\u21e10(s) > 0, 8s. De\ufb01ne\n\nL(w, f ) = E(s,a,s0)\u21e0d\u21e10\n\n[(w; s, a, s0)f (s0)] + (1  )Es\u21e0d0[(1  w(s))f (s)].\n\nAssume 0 << 1, then w(s) = w\u21e1/\u21e10(s) if and only if L(w, f ) = 0 for any test function f.\nWhen  = 1, the de\ufb01nition in (15) reduces to the average reward case in (10). A subtle difference is\nthat L(w, f ) = 0 only ensures w / w\u21e1/\u21e10 when  = 1, while w = w\u21e1/\u21e10 when  2 (0, 1). This is\nbecause the additional term Es\u21e0d0[(1  w(s))f (s)] in (15) forces w to be normalized properly. In\npractice, however, we still \ufb01nd it works better to pre-normalize w to \u02dcw = w/Ed\u21e10\n[w], and optimize\nthe objective L( \u02dcw, f ).\n\n(15)\n\n3.3 Further Theoretical Analysis\nIn this section, we develop further theoretical understanding on the loss function L(w, f ). Lemma 5\nbelow reveals an interesting connection between L(w, f ) and the Bellman equation, allowing us to\nbound the estimation error of density ratio and expected reward with the mini-max loss when the\ndiscriminator space F is chosen properly (Theorems 6 and 7). The results in this section apply to\nboth discounted and average reward cases.\nLemma 5. Given L(w, f ) in (15), and assuming Ed\u21e10\n\nL(w, f ) = Es\u21e0d\u21e10\nwhere\n\n[(w\u21e1/\u21e10(s)  w(s))\u21e7f (s)] ,\n\n[w] = 1 in the average reward case, we have\n(16)\n(17)\n\n\u21e7f (s) := f (s)  E(s0,a)|s\u21e0d\u21e1 [f (s0)] .\n\nNote that \u21e7f equals the left hand side of the Bellman equations (1) and (2), when f = V \u21e1.\n\n5\n\n\fLemma 5 represents L(w, f ) as an inner product between w\u21e1/\u21e10  w and \u21e7f (under base measure\nd\u21e10). This provides an alternative proof of Theorem 4, since L(w, f ) = 0, 8f 2F implies that\nw\u21e1/\u21e10  w is orthogonal with all \u21e7f and hence w\u21e1/\u21e10 = w when {\u21e7f : f 2F} is suf\ufb01ciently rich.\nIn order to make (w\u21e1/\u21e10  w) orthogonal to a given function g, it requires \u201creversing\u201d operator \u21e7:\n\ufb01nding a function fg which solves g =\u21e7 fg for given g. Observing that g =\u21e7 fg can be viewed\nas a Bellman equation (Eqs. (1)\u2013(2)) when taking g and fg to be the reward and value functions,\nrespectively, we can derive an explicit representation of fg (Lemma 10 in Appendix). This allows\none to gain insights into what discriminator set F would be a good choice, so that minimizing\nmaxf2F L(w, f ) yields good estimation with desirable properties. In the following, by taking\ng(s) /\u00b1 1(s = \u02dcs), 8\u02dcs, we can characterize the conditions on F under which the mini-max loss\nupper bounds the estimation error of w\u21e1/\u21e10 or d\u21e1.\n\nTheorem 6. Let T t\n\nAssume Lemma 5 holds. We have\n\n(18)\n\nwhen 0 << 1,\n\n\u21e1(\u02dcs|s)\n\u21e1(\u02dcs|s)  d\u21e1(\u02dcs)) when  = 1,\n\n\u21e1(s0|s) be the t-step transition probability of T \u21e1(s0|s). For 8\u02dcs 2S , de\ufb01ne\nf\u02dcs(s) =(P1t=0 tT t\nP1t=0(T t\nL(w, f ) w\u21e1/\u21e10  w1\n\n{\u00b1f\u02dcs : 8\u02dcs 2S}\u2713F ,\n{\u00b1f\u02dcs/d\u21e10(\u02dcs) : 8\u02dcs 2S}\u2713F .\n\nL(w, f )  kd\u21e1(s)  w(s)d\u21e10(s)k1 ,\n\nif\n\nif\n\n,\n\nmax\nf2F\nmax\nf2F\n\nSince our main goal is to estimate the expected total reward R\u21e1 instead of the density ratio w\u21e1/\u21e10, it\nis of interest to select F to directly bound the estimation error of the total reward. Interestingly, this\ncan be achieved once F includes the true value function V \u21e1.\nTheorem 7. De\ufb01ne R\u21e1[w] to be the reward estimate using estimated density ratio w(s) (which may\nnot equal the true ratio w\u21e1/\u21e10) and in\ufb01nite number of trajectories from d\u21e10, that is,\n\nR\u21e1[w] := E(s,a,s0)\u21e0d\u21e10\n\n[w(s)\u21e1/\u21e10(a|s)r(s, a)] .\n\nAssume w is properly normalized such that Es\u21e0d\u21e10\nTherefore, if \u00b1V \u21e1 2F , we have |R\u21e1[w]  R\u21e1|\uf8ff maxf2F L(w, f ).\n4 Related Work\n\n[w(s)] = 1, we have L(w, V \u21e1) = R\u21e1  R\u21e1[w].\n\nOur off-policy setting is related to, but different from, off-policy value-function learning [30, 29,\n37, 12, 25, 21]. Our goal is to estimate a single scalar that summarizes the quality of a policy (a.k.a.\noff-policy value estimation as called by some authors [20]). However, our idea can be extended to\nestimating value functions as well, by using estimated density ratios to weight observed transitions\n(c.f., the distribution \u00b5 in LSTDQ [16]). We leave this as future work.\nIS-based off-policy value estimation has seen a lot of interest recently for short-horizon problems,\nincluding contextual bandits [26, 13, 7, 42], and achieved many empirical successes [7, 34]. When\nextended to long-horizon problems, it faces an exponential blowup of variance, and variance-reduction\ntechniques are used to improve the estimator [14, 39, 10, 42]. However, it can be proved that\nin the worst case, the mean squared error of any estimator has to depend exponentially on the\nhorizon [20, 10]. Fortunately, many problems encountered in practical applications may present\nstructures that enable more ef\ufb01cient off-policy estimation, as tackled by the present paper. An\ninteresting open direction is to characterize theoretical conditions that can ensure tractable estimation\nfor long horizon problems.\nFew prior work directly target in\ufb01nite-horizon problems. There exists approaches that use simulated\nsamples to estimate stationary state distributions [1, Chapter IV]. However, they need a reliable model\nto draw such simulations, a requirement that is not satis\ufb01ed in many real-world applications. To the\nbest of our knowledge, the recently developed COP-TD algorithm [11] is the only work that attempts\nto estimate w\u21e1/\u21e10 as an intermediate step of estimating the value function of a target policy \u21e1. They\ntake a stochastic-approximation approach and show asymptotic consistence. However, extending\ntheir approach to continuous state/action spaces appears challenging.\n\n6\n\n\f-5\n\n-6\n\n-7\n\n-8\n\n-9\n\n-10\n\n-11\n\nE\nS\nM\ng\no\nl\n\n-5\n\n-6\n\n-7\n\n-8\n\n-9\n\n-4\n\n-6\n\n-8\n\n-10\n\nNaive Average\nOn Policy (oracle)\nWIS Trajectory-wise\nWIS Step-wise\nModel-based\nOur Method\n\ne\nc\nn\na\nt\ns\ni\nD\nV\nT\n\n30 50 \n\n100\n\n200\n\n400\n\n800\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\n1.0\n\n200 \n\n600 \n\n1000\n\n1500\n\n# of Trajectories (n)\n\nDifferent Behavior Policies\n\nTruncated Length T\n\n(a)\n\n(b)\n\n(c)\n\n\u02c6d\u21e1 (s) vs. d\u21e1 (s) Plot\n\nTraining Iteration\n\n(d)\n\n(e)\n\nFigure 1: Results on Taxi environment with average reward ( = 1). (a)-(b) show the performance of various\nmethods as the number of trajectory (a) and the difference between behavior and target policies (b) vary. (c)\nshows the change of truncated length T . (d) shows that scatter plot of pairs ( \u02c6d\u21e1(s), d\u21e1(s)), 8s. The diagonal\nlines means exact estimation. (e) shows the weighted total variation distance between \u02c6d\u21e1 := \u02c6wd\u21e10 and d\u21e1 along\nthe training iteration of the ratio estimator \u02c6w. The number of trajectory is \ufb01xed to be 100 in (b,c,d). The potential\nbehavior policy \u21e1+ (the right most points in (b)) is used in (a,c,d,e).\n\nFinally, there is a comprehensive literature of two-sample density ratio estimation [e.g., 27, 35],\nwhich estimates the density ratio of two distributions from pairs of their samples. Our problem setting\nis different in that we only have data from d\u21e10, but not from d\u21e1; this makes the traditional density\nratio estimators inapplicable to our problem. Our method is made possible by taking the special\ntemporal structure of MDP into consideration.\n\n5 Experiment\n\nIn this section, we conduct experiments on different environmental settings to compare our method\nwith existing off-policy evaluation methods. We compare with the standard trajectory-wise and\nstep-wise IS and WIS methods. We do not report the results of unnormalized IS because they are\ngenerally signi\ufb01cantly worse than WIS methods [30, 22]. In all the cases, we also compare with an\non-policy oracle and a naive averaging baseline, which estimates the reward using direct averaging\nover the trajectories generated by the target policy and behavior policy, respectively. For problems\nwith discrete action and state spaces, we also compare with a standard model-based method, which\nestimates the transition and reward model and then calculates expected reward explicitly using the\nmodel up to the desired truncation length. When applying our method on problems with \ufb01nite and\ndiscrete state space, we optimize w and f in the space of all possible functions (corresponding to using\na delta kernel in terms of RKHS). For continuous state space, we assume w is a standard feed-forward\nneural network, and F is a RKHS with a standard Gaussian RBF kernel whose bandwidth equals the\nmedian of the pairwise distances between the observed data points.\nBecause we cannot simulate truly in\ufb01nite steps in practice, we use the behavior policy to generate\ntrajectories of length T , and evaluate the algorithms based on the mean square error (MSE) w.r.t. the\nT -step rewards of a large number of trajectories of length T from the target policy. We expect that\nour method gets better as T increases, since it is designed for in\ufb01nite horizon problems, while the\nIS/WIS methods receive large variance and deteriorate as T increases.\n\nTaxi Environment Taxi [6] is a 2D grid world simulating taxi movement along the grids. A taxi\nmoves North, East, South, West or attends to pick up or drop off a passenger. It receives a reward\nof 20 when it successfully picks up a passenger or drops her off at the right place, and otherwise a\nreward of -1 every time step. The original taxi environment would stop when the taxi successfully\npicks up a passenger and drops her off at the right place. We modify the environment to make it\nin\ufb01nite horizon, by allowing passengers to randomly appear and disappear at every corner of the\nmap at each time step. We use a grid size of 5 \u21e5 5, which yields 2000 states in total (25 \u21e5 24 \u21e5 5,\ncorresponding to 25 taxi locations, 24 passenger appearance status and 5 taxi status (empty or with\none of 4 destinations)).\nTo construct target and behavior policies for testing our algorithm, we set our target policy to be\nthe \ufb01nal policy \u21e1\u21e4 after running Q-learning for 1000 iterations, and set another policy \u21e1+ after 950\niterations. The behavior policy is \u21e1 = (1 \u21b5)\u21e1\u21e4 + \u21b5\u21e1+, where \u21b5 is a mixing ratio that can be varied.\nResults in Taxi Environment Figure 1(a)\u2013(b) show results with average reward. We can see our\nmethod performs almost as well as the on-policy oracle, outperforming all the other methods. To\nevaluate the approximation error of the estimated density ratio \u02c6w, we plot in Figure 1(c) the weighted\n\n7\n\n\fE\nS\nM\ng\no\nl\n\n-4\n\n-6\n\n-8\n\n-10\n\n-12\n\n-5\n\n-6\n\n-7\n\n-8\n\n-9\n\n-2\n\n-4\n\n-6\n\n-8\n\n-5\n\n-6\n\n-7\n\n-8\n\n-9\n\n-10\n\nNaive Average\nOn Policy (oracle)\nWIS Trajectory-wise\nWIS Step-wise\nModel-based\nOur Method\n\n30 50 \n\n100\n\n200\n\n400\n\n800\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\n1.0\n\n50 \n\n200 \n\n600 \n\n1500\n\n0.95\n\n0.97 0.98 0.99\n\n1 \n\n# of Trajectories (n)\n\nDifferent Behavior Policies\n\nTruncated length T\n\nDiscounted Factor \n\n(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 2: Results on Taxi with discounted reward (0 << 1), as we vary the number of trajectory n (a), the\ndifference between target and behavior policies (b), the truncated length T (c), the discount factor  (d). The\ndefault values of the parameters, unless it is varying, are  = 0.99, n = 200, T = 400. The potential behavior\npolicy \u21e1+ (the right most points in (b)) is used in (a,c,d).\n\nE\nS\nM\ng\no\nl\n\n2\n\n0\n\n-2\n\n-4\n\n-6\n\n2\n\n0\n\n-2\n\n-4\n\n2\n\n0\n\n-2\n\n-4\n\n2\n\n0\n\n-2\n\n-4\n\n-6\n\n-8\n\nNaive Average\nOn Policy (oracle)\nWIS Trajectory-wise\nWIS Step-wise\nOur Method\n\n0 \n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\n1 \n\n100 \n\n300 \n\n500 \n\n800 \n\n1000\n\n0 \n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\n1 \n\n0.97\n\n0.98\n\n0.99\n\n1 \n\nMixing Ratio \u21b5\n\nTruncated Length T\n\nMixing Ratio \u21b5\n\nDiscount Factor \n\n(a) Average Reward Case\n\n(b) Average Reward Case\n\n(c) Discounted Reward Case\n\n(d) Discounted Reward Case\n\nFigure 3: Results on Pendulum. (a)-(b) show the results in the average reward case when we vary the mixing\nratio \u21b5 in the behavior policies and the truncated length T , respectively. (c)-(d) show the results of the discounted\nreward case when we vary mixing ratio \u21b5 in the behavior policies and discount factor , respectively. The default\nparameters are n = 150, T = 1000,  = 0.99, \u21b5 = 1.\n\ntotal variation distance between \u02c6d\u21e1 = \u02c6wd\u21e10 with the true d\u21e1 with TV distance as we optimize the loss\nfunction. Figure 1(d) shows scatter plot of {( \u02c6d\u21e1(s), d\u21e1(s)) : 8s 2S} at convergence, indicating\nour method correctly estimates the true density ratio over the state space.\nFigure 2 shows similar results for discounted reward. From Figure 2(c) and (d), we can see that\ntypical IS methods deteriorate as the trajectory length T and discount factor  increase, respectively,\nwhich is expected since their variance grows exponentially with T . In contrast, our density ratio\nmethod performs better as trajectory length T increases, and is robust as  increases.\n\nPendulum Environment The Taxi environment features discrete action and state spaces. We\nnow test Pendulum, which has a continuous state space of R3 and action space of [2, 2]. In this\nenvironment, we want to control the pendulum to make it stand up as long as possible (for the average\ncase), or as fast as possible (for small discounted case). The policy is taken to be a truncated Gaussian\nwhose mean is a neural network of the states and variance a constant.\nWe train a near-optimal policy \u21e1\u21e4 using REINFORCE and set it to be the target policy. The behavior\npolicy is set to be \u21e1 = (1  \u21b5)\u21e1\u21e4 + \u21b5\u21e1+, where \u21b5 is a mixing ratio, and \u21e1+ is another policy from\nREINFORCE when it has not converged. Our results are shown in Figure 3, where we again \ufb01nd\nthat our method generally outperforms the standard trajectory-wise and step-wise WIS, and works\nfavorably in long-horizon problems (Figure 3(b)).\nSUMO Traf\ufb01c Simulator SUMO [15] is an open source traf\ufb01c simulator; see Figure 4(a) for an\nillustration. We consider the task of reducing traf\ufb01c congestion by modelling traf\ufb01c light control as a\nreinforcement learning problem [41]. We use TraCI, a built-in \u201cTraf\ufb01c Control Interface\u201d, to interact\n\n0\n\n-2\n\nE\nS\nM\n-4\ng\no\nl\n\n-6\n\n-8\n\n0\n\nE\n-2\nS\nM\ng\n-4\no\nl\n\n-6\n\n-8\n\n0\n\n-2\nE\nS\nM\n-4\ng\no\n-6\nl\n\n-8\n\nNaive Average\nOn Policy (oracle)\nWIS Trajectory-wise\nWIS Step-wise\nOur Method\n\n(a) Environment\n\n(b) # of Trajectories (n)\n\n(c) Different Behavior Policies\n\n(d) Truncated Length T\n\n30 \n\n50 \n\n100\n\n200\n\n1\n\n2\n\n3\n\n4\n\n5\n\n200 \n\n400 \n\n600 \n\n800 1000\n\nFigure 4: Results on SUMO (a) with average reward, as we vary the number of trajectories (b), choose different\nbehavior policies (c), and truncated size (d). When being \ufb01xed, the default parameters are n = 250, T = 400.\nThe behavior policy in (c) with x-tick 2 is used in (b) and (d).\n\n8\n\n\fwith the SUMO simulator. Full details of our environmental settings can be found in Appendix E.\nOur results are shown in Figure 4, where we again \ufb01nd that our method is consistently better than\nstandard IS methods.\n\n6 Conclusions\n\nWe study the off-policy estimation problem in in\ufb01nite-horizon problems and develop a new algorithm\nbased on direct estimation of the stationary state density ratio between the target and behavior policies.\nOur mini-max objective function enjoys nice theoretical properties and yields an intriguing connection\nwith Bellman equations that is worth further investigation. Future directions include scaling our\nmethod to larger scale problems and extending it to estimate value functions and leverage off-policy\ndata in policy optimization.\n\nAcknowledgement\n\nThis work is supported in part by NSF CRII 1830161. We would like to acknowledge Google Cloud\nfor their support.\n\nReferences\n[1] S\u00f8ren Asmussen and Peter W. Glynn. Stochastic Simulation: Algorithms and Analysis, vol-\n\nume 57 of Probability Theory and Stochastic Processes. Springer-Verlag, 2007.\n[2] Richard E. Bellman. Dynamic Programming. Princeton University Press, 1957.\n[3] Alain Berlinet and Christine Thomas-Agnan. Reproducing kernel Hilbert spaces in probability\n\nand statistics. Springer Science & Business Media, 2011.\n\n[4] L\u00e9on Bottou, Jonas Peters, Joaquin Qui\u00f1onero-Candela, Denis Xavier Charles, D. Max Chicker-\ning, Elon Portugaly, Dipankar Ray, Patrice Simard, and Ed Snelson. Counterfactual reasoning\nand learning systems: The example of computational advertising. Journal of Machine Learning\nResearch, 14:3207\u20133260, 2013.\n\n[5] Olivier Chapelle, Eren Manavoglu, and Romer Rosales. Simple and scalable response prediction\nfor display advertising. ACM Transactions on Intelligent Systems and Technology, 5(4):61:1\u2013\n61:34, 2014.\n\n[6] Thomas G Dietterich. Hierarchical reinforcement learning with the MAXQ value function\n\ndecomposition. Journal of Arti\ufb01cial Intelligence Research, 13:227\u2013303, 2000.\n\n[7] Miroslav Dud\u00edk, John Langford, and Lihong Li. Doubly robust policy evaluation and learning.\nIn Proceedings of the 28th International Conference on Machine Learning (ICML), pages\n1097\u20131104, 2011.\n\n[8] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil\nOzair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural\nInformation Processing Systems 27 (NIPS), pages 2672\u20132680, 2014.\n\n[9] Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Sch\u00f6lkopf, and Alexander\nSmola. A kernel two-sample test. The Journal of Machine Learning Research, 13(1):723\u2013773,\n2012.\n\n[10] Zhaohan Guo, Philip S. Thomas, and Emma Brunskill. Using options and covariance testing\nfor long horizon off-policy policy evaluation. In Advances in Neural Information Processing\nSystems 30 (NIPS), pages 2489\u20132498, 2017.\n\n[11] Assaf Hallak and Shie Mannor. Consistent on-line off-policy evaluation. In Proceedings of the\n\n34th International Conference on Machine Learning (ICML), pages 1372\u20131383, 2017.\n\n[12] Assaf Hallak, Aviv Tamar, Remi Munos, and Shie Mannor. Generalized emphatic temporal\ndifference learning: Bias-variance analysis. In Proceedings of the 30th AAAI Conference on\nArti\ufb01cial Intelligence, pages 1631\u20131637, 2016.\n\n9\n\n\f[13] Keisuke Hirano, Guido W Imbens, and Geert Ridder. Ef\ufb01cient estimation of average treatment\n\neffects using the estimated propensity score. Econometrica, 71(4):1161\u20131189, 2003.\n\n[14] Nan Jiang and Lihong Li. Doubly robust off-policy evaluation for reinforcement learning. In\nProceedings of the 23rd International Conference on Machine Learning (ICML), pages 652\u2013661,\n2016.\n\n[15] Daniel Krajzewicz, Jakob Erdmann, Michael Behrisch, and Laura Bieker. Recent development\nand applications of sumo-simulation of urban mobility. International Journal On Advances in\nSystems and Measurements, 5(3&4), 2012.\n\n[16] Michail G. Lagoudakis and Ronald Parr. Least-squares policy iteration. Journal of Machine\n\nLearning Research, 4:1107\u20131149, 2003.\n\n[17] David A Levin and Yuval Peres. Markov chains and mixing times, volume 107. American\n\nMathematical Soc., 2017.\n\n[18] Lihong Li, Shunbao Chen, Ankur Gupta, and Jim Kleban. Counterfactual analysis of click\nmetrics for search engine optimization: A case study. In Proceedings of the 24th International\nWorld Wide Web Conference (WWW), Companion Volume, pages 929\u2013934, 2015.\n\n[19] Lihong Li, Wei Chu, John Langford, and Xuanhui Wang. Unbiased of\ufb02ine evaluation of\ncontextual-bandit-based news article recommendation algorithms. In Proceedings of the 4th\nInternational Conference on Web Search and Data Mining (WSDM), pages 297\u2013306, 2011.\n\n[20] Lihong Li, R\u00e9mi Munos, and Csaba Szepesv\u00e1ri. Toward minimax off-policy value estimation.\nIn Proceedings of the 18th International Conference on Arti\ufb01cial Intelligence and Statistics\n(AISTATS), pages 608\u2013616, 2015.\n\n[21] Hao Liu, Yihao Feng, Yi Mao, Dengyong Zhou, Jian Peng, and Qiang Liu. Action-dependent\ncontrol variates for policy optimization via stein identity. In Proceedings of the 6th International\nConference on Learning Representations (ICLR), 2018.\n\n[22] Jun S. Liu. Monte Carlo Strategies in Scienti\ufb01c Computing. Springer Series in Statistics.\n\nSpringer-Verlag, 2001.\n\n[23] Travis Mandel, Yun-En Liu, Sergey Levine, Emma Brunskill, and Zoran Popovic. Of\ufb02ine policy\nevaluation across representations with applications to educational games. In Proceedings of\nthe 13th International Conference on Autonomous Agents and Multi-Agent Systems (AAMAS),\npages 1077\u20131084, 2014.\n\n[24] Krikamol Muandet, Kenji Fukumizu, Bharath Sriperumbudur, Bernhard Sch\u00f6lkopf, et al. Kernel\nmean embedding of distributions: A review and beyond. Foundations and Trends R in Machine\nLearning, 10(1-2):1\u2013141, 2017.\n\n[25] R\u00e9mi Munos, Tom Stepleton, Anna Harutyunyan, and Marc G. Bellemare. Safe and ef\ufb01cient\noff-policy reinforcement learning. In Advances in Neural Information Processing Systems 29\n(NIPS), pages 1046\u20131054, 2016.\n\n[26] Susan A. Murphy, Mark van der Laan, and James M. Robins. Marginal mean models for\ndynamic regimes. Journal of the American Statistical Association, 96(456):1410\u20131423, 2001.\n\n[27] XuanLong Nguyen, Martin J Wainwright, and Michael Jordan. Estimating divergence function-\nals and the likelihood ratio by convex risk minimization. Information Theory, IEEE Transactions\non, 56(11):5847\u20135861, 2010.\n\n[28] Art B. Owen. Monte Carlo Theory, Methods and Examples. 2013. http://statweb.\n\nstanford.edu/~owen/mc.\n\n[29] Doina Precup, Richard S. Sutton, and Sanjoy Dasgupta. Off-policy temporal-difference learning\nwith funtion approximation. In Proceedings of the 18th Conference on Machine Learning\n(ICML), pages 417\u2013424, 2001.\n\n10\n\n\f[30] Doina Precup, Richard S. Sutton, and Satinder P. Singh. Eligibility traces for off-policy policy\nevaluation. In Proceedings of the 17th International Conference on Machine Learning (ICML),\npages 759\u2013766, 2000.\n\n[31] Martin L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming.\n\nWiley-Interscience, New York, 1994.\n\n[32] Bernhard Scholkopf and Alexander J Smola. Learning with kernels: support vector machines,\n\nregularization, optimization, and beyond. MIT press, 2001.\n\n[33] Robert J Ser\ufb02ing. Approximation theorems of mathematical statistics, volume 162. John Wiley\n\n& Sons, 2009.\n\n[34] Alexander L. Strehl, John Langford, Lihong Li, and Sham M. Kakade. Learning from logged\nimplicit exploration data. In Advances in Neural Information Processing Systems 23 (NIPS-10),\npages 2217\u20132225, 2010.\n\n[35] Masashi Sugiyama, Taiji Suzuki, and Takafumi Kanamori. Density ratio estimation in machine\n\nlearning. Cambridge University Press, 2012.\n\n[36] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. MIT Press,\n\nCambridge, MA, March 1998.\n\n[37] Richard S. Sutton, A. Rupam Mahmood, and Martha White. An emphatic approach to the\nproblem of off-policy temporal-difference learning. Journal of Machine Learning Research,\n17(73):1\u201329, 2016.\n\n[38] Liang Tang, Romer Rosales, Ajit Singh, and Deepak Agarwal. Automatic ad format selection via\ncontextual bandits. In Proceedings of the 22nd ACM International Conference on Information\n& Knowledge Management (CIKM), pages 1587\u20131594, 2013.\n\n[39] Philip S. Thomas and Emma Brunskill. Data-ef\ufb01cient off-policy policy evaluation for rein-\nforcement learning. In Proceedings of the 33rd International Conference on Machine Learning\n(ICML), pages 2139\u20132148, 2016.\n\n[40] Philip S. Thomas, Georgios Theocharous, Mohammad Ghavamzadeh, Ishan Durugkar, and\nEmma Brunskill. Predictive off-policy policy evaluation for nonstationary decision problems,\nwith applications to digital marketing. In Proceedings of the 31st AAAI Conference on Arti\ufb01cial\nIntelligence (AAAI), pages 4740\u20134745, 2017.\n\n[41] Elise Van der Pol and Frans A Oliehoek. Coordinated deep reinforcement learners for traf\ufb01c\nlight control. In NIPS Workshop on Learning, Inference and Control of Multi-Agent Systems,\n2016.\n\n[42] Yu-Xiang Wang, Alekh Agarwal, and Miroslav Dud\u00edk. Optimal and adaptive off-policy evalua-\ntion in contextual bandits. In Proceedings of the 34th International Conference on Machine\nLearning (ICML), pages 3589\u20133597, 2017.\n\n11\n\n\f", "award": [], "sourceid": 2567, "authors": [{"given_name": "Qiang", "family_name": "Liu", "institution": "UT Austin"}, {"given_name": "Lihong", "family_name": "Li", "institution": "Google Inc."}, {"given_name": "Ziyang", "family_name": "Tang", "institution": "The University of Texas at Austin"}, {"given_name": "Dengyong", "family_name": "Zhou", "institution": "Google"}]}