{"title": "Generalized Off-Policy Actor-Critic", "book": "Advances in Neural Information Processing Systems", "page_first": 2001, "page_last": 2011, "abstract": "We propose a new objective, the counterfactual objective, unifying existing objectives for off-policy policy gradient algorithms in the continuing reinforcement learning (RL) setting. Compared to the commonly used excursion objective, which can be misleading about the performance of the target policy when deployed, our new objective better predicts such performance. We prove the Generalized Off-Policy Policy Gradient Theorem to compute the policy gradient of the counterfactual objective and use an emphatic approach to get an unbiased sample from this policy gradient, yielding the Generalized Off-Policy Actor-Critic (Geoff-PAC) algorithm. We demonstrate the merits of Geoff-PAC over existing algorithms in Mujoco robot simulation tasks, the first empirical success of emphatic algorithms in prevailing deep RL benchmarks.", "full_text": "Generalized Off-Policy Actor-Critic\n\nShangtong Zhang, Wendelin Boehmer, Shimon Whiteson\n\n{shangtong.zhang, wendelin.boehmer, shimon.whiteson}@cs.ox.ac.uk\n\nDepartment of Computer Science\n\nUniversity of Oxford\n\nAbstract\n\nWe propose a new objective, the counterfactual objective, unifying existing ob-\njectives for off-policy policy gradient algorithms in the continuing reinforcement\nlearning (RL) setting. Compared to the commonly used excursion objective, which\ncan be misleading about the performance of the target policy when deployed,\nour new objective better predicts such performance. We prove the Generalized\nOff-Policy Policy Gradient Theorem to compute the policy gradient of the coun-\nterfactual objective and use an emphatic approach to get an unbiased sample from\nthis policy gradient, yielding the Generalized Off-Policy Actor-Critic (Geoff-PAC)\nalgorithm. We demonstrate the merits of Geoff-PAC over existing algorithms in\nMujoco robot simulation tasks, the \ufb01rst empirical success of emphatic algorithms\nin prevailing deep RL benchmarks.\n\n1\n\nIntroduction\n\nReinforcement learning (RL) algorithms based on the policy gradient theorem (Sutton et al., 2000;\nMarbach and Tsitsiklis, 2001) have recently enjoyed great success in various domains, e.g., achieving\nhuman-level performance on Atari games (Mnih et al., 2016). The original policy gradient theorem is\non-policy and used to optimize the on-policy objective. However, in many cases, we would prefer to\nlearn off-policy to improve data ef\ufb01ciency (Lin, 1992) and exploration (Osband et al., 2018). To this\nend, the Off-Policy Policy Gradient (OPPG) Theorem (Degris et al., 2012; Maei, 2018; Imani et al.,\n2018) was developed and has been widely used (Silver et al., 2014; Lillicrap et al., 2015; Wang et al.,\n2016; Gu et al., 2017; Ciosek and Whiteson, 2017; Espeholt et al., 2018).\nIdeally, an off-policy algorithm should optimize the off-policy analogue of the on-policy objective. In\nthe continuing RL setting, this analogue would be the performance of the target policy in expectation\nw.r.t. the stationary distribution of the target policy, which is referred to as the alternative life objective\n(Ghiassian et al., 2018). This objective corresponds to the performance of the target policy when\ndeployed. Previously, OPPG optimizes a different objective, the performance of the target policy in\nexpectation w.r.t. the stationary distribution of the behavior policy. This objective is referred to as the\nexcursion objective (Ghiassian et al., 2018), as it corresponds to the excursion setting (Sutton et al.,\n2016). Unfortunately, the excursion objective can be misleading about the performance of the target\npolicy when deployed, as we illustrate in Section 3.\nIt is infeasible to optimize the alternative life objective directly in the off-policy continuing setting.\nInstead, we propose to optimize the counterfactual objective, which approximates the alternative\nlife objective. In the excursion setting, an agent in the stationary distribution of the behavior policy\nconsiders a hypothetical excursion that follows the target policy. The return from this hypothetical\nexcursion is an indicator of the performance of the target policy. The excursion objective measures this\nreturn w.r.t. the stationary distribution of the behavior policy, using samples generated by executing\nthe behavior policy. By contrast, evaluating the alternative life objective requires samples from\nthe stationary distribution of the target policy, to which the agent does not have access. In the\ncounterfactual objective, we use a new parameter \u02c6\u03b3 to control how counterfactual the objective is,\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fakin to Gelada and Bellemare (2019). With \u02c6\u03b3 = 0, the counterfactual objective uses the stationary\ndistribution of the behavior policy to measure the performance of the target policy, recovering the\nexcursion objective. With \u02c6\u03b3 = 1, the counterfactual objective is fully decoupled from the behavior\npolicy and uses the stationary distribution of the target policy to measure the performance of the\ntarget policy, recovering the alternative life objective. As in the excursion objective, the excursion is\nnever actually executed and the agent always follows the behavior policy.\nWe make two contributions in this paper. First, we prove the Generalized Off-Policy Policy Gradient\n(GOPPG) Theorem, which gives the policy gradient of the counterfactual objective. Second, using an\nemphatic approach (Sutton et al., 2016) to compute an unbiased sample for this policy gradient, we\ndevelop the Generalized Off-Policy Actor-Critic (Geoff-PAC) algorithm. We evaluate Geoff-PAC\nempirically in challenging robot simulation tasks with neural network function approximators. Geoff-\nPAC outperforms the actor-critic algorithms proposed by Degris et al. (2012); Imani et al. (2018), and\nto our best knowledge, Geoff-PAC is the \ufb01rst empirical success of emphatic algorithms in prevailing\ndeep RL benchmarks.\n\n2 Background\n\n= \u03a0i\u22121\n.\n\nt\n\nt\n\n.\n= diag(d\u03c0).\n\n.\n\n.\n\ni=0 \u0393i\u22121\n\nt Rt+1+i, where \u0393i\u22121\n\nj=0\u03b3(St+j, At+j, St+j+1) and \u0393\u22121\n\n=(cid:80)\u221e\n=(cid:80)\n\nWe use a time-indexed capital letter (e.g., Xt) to denote a random variable. We use a bold capital\nletter (e.g., X) to denote a matrix and a bold lowercase letter (e.g., x) to denote a column vector. If\nx : S \u2192 R is a scalar function de\ufb01ned on a \ufb01nite set S, we use its corresponding bold lowercase\nletter to denote its vector form, i.e., x .\n= [x(s1), . . . , x(s|S|)]T. We use I to denote the identity matrix\nand 1 to denote an all-one column vector.\nWe consider an in\ufb01nite horizon MDP (Puterman, 2014) consisting of a \ufb01nite state space S, a \ufb01nite\naction space A, a bounded reward function r : S \u00d7 A \u2192 R and a transition kernel p : S \u00d7 S \u00d7 A \u2192\n[0, 1]. We consider a transition-based discount function (White, 2017) \u03b3 : S \u00d7 A \u00d7 S \u2192 [0, 1]\nfor unifying continuing tasks and episodic tasks. At time step t, an agent at state St takes an\naction At according to a policy \u03c0 : A \u00d7 S \u2192 [0, 1]. The agent then proceeds to a new state St+1\naccording to p and gets a reward Rt+1 satisfying E[Rt+1] = r(St, At). The return of \u03c0 at time\n.\nstep t is Gt\n= 1. We\n= E\u03c0[Gt|St = s]. Like White\n.\nuse v\u03c0 to denote the value function of \u03c0, which is de\ufb01ned as v\u03c0(s)\n= E\u03c0[Gt|St = s, At = a] to denote the\n.\n(2017), we assume v\u03c0 exists for all s. We use q\u03c0(s, a)\nstate-action value function of \u03c0. We use P\u03c0 to denote the transition matrix induced by \u03c0, i.e.,\na \u03c0(a|s)p(s(cid:48)|s, a). We assume the chain induced by \u03c0 is ergodic and use d\u03c0 to denote\nP\u03c0[s, s(cid:48)]\nits unique stationary distribution. We de\ufb01ne D\u03c0\nIn the off-policy setting, an agent aims to learn a target policy \u03c0 but follows a behavior policy\n\u00b5. We use the same assumption of coverage as Sutton and Barto (2018), i.e., \u2200(s, a), \u03c0(a|s) >\n0 =\u21d2 \u00b5(a|s) > 0. We assume the chain induced by \u00b5 is ergodic and use d\u00b5 to denote its\n.\nstationary distribution. Similarly, D\u00b5\n= \u03c1(St, At) and\n\u03b3t\nTypically, there are two kinds of tasks in RL, prediction and control.\nPrediction: In prediction, we are interested in \ufb01nding the value function v\u03c0 of a given policy\n\u03c0. Temporal Difference (TD) learning (Sutton, 1988) is perhaps the most powerful algorithm for\nprediction. TD enjoys convergence guarantee in both on- and off-policy tabular settings. TD can\nalso be combined with linear function approximation. The update rule for on-policy linear TD is\nw \u2190 w + \u03b1\u2206t, where \u03b1 is a step size and \u2206t\n= [Rt+1 + \u03b3V (St+1) \u2212 V (St)]\u2207wV (St) is an\n.\nincremental update. Here we use V to denote an estimate of v\u03c0 parameterized by w. Tsitsiklis and\nVan Roy (1997) prove the convergence of on-policy linear TD. In off-policy linear TD, the update \u2206t\nis weighted by \u03c1t. The divergence of off-policy linear TD is well documented (Tsitsiklis and Van Roy,\n1997). To approach this issue, Gradient TD methods (Sutton et al. 2009) were proposed. Instead of\nbootstrapping from the prediction of a successor state like TD, Gradient TD methods compute the\ngradient of the projected Bellman error directly. Gradient TD methods are true stochastic gradient\nmethods and enjoy convergence guarantees. However, they are usually two-time-scale, involving two\nsets of parameters and two learning rates, which makes it hard to use in practice (Sutton et al., 2016).\nTo approach this issue, Emphatic TD (ETD, Sutton et al. 2016) was proposed.\n\n.\n= diag(d\u00b5). We de\ufb01ne \u03c1(s, a)\n\n.\n= \u03b3(St\u22121, At\u22121, St).\n\n= \u03c0(a|s)\n.\n\n\u00b5(a|s) , \u03c1t\n\n2\n\n\fETD introduces an interest function i : S \u2192 [0,\u221e) to specify user\u2019s preferences for different\nstates. With function approximation, we typically cannot get accurate predictions for all states\nand must thus trade off between them. States are usually weighted by d\u00b5(s) in the off-policy\nsetting (e.g., Gradient TD methods) but with the interest function, we can explicitly weight them\nby d\u00b5(s)i(s) in our objective. Consequently, we weight the update at time t via Mt, which is the\nemphasis that accumulates previous interests in a certain way. In the simplest form of ETD, we\n.\nhave Mt\n= i(St) + \u03b3t\u03c1t\u22121Mt\u22121. The update \u2206t is weighted by \u03c1tMt. In practice, we usually set\ni(s) \u2261 1.\nInspired by ETD, Hallak and Mannor (2017) propose to weight \u2206t via \u03c1t\u00afc(St) in the Consistent\nd\u00b5(s) is the density ratio, which is also known as\nOff-Policy TD (COP-TD) algorithm, where \u00afc(s)\nthe covariate shift (Gelada and Bellemare, 2019). To learn \u00afc via stochastic approximation, Hallak and\nMannor (2017) propose the COP operator. However, the COP operator does not have a unique \ufb01xed\npoint. Extra normalization and projection is used to ensure convergence (Hallak and Mannor, 2017)\nin the tabular setting. To address this limitation, Gelada and Bellemare (2019) further propose the\n\u02c6\u03b3-discounted COP operator.\n\u00b5 where \u02c6\u03b3 \u2208 [0, 1]\nGelada and Bellemare (2019) de\ufb01ne a new transition matrix P\u02c6\u03b3\nis a constant. Following this matrix, an agent either proceeds to the next state according to P\u03c0 w.p.\n\u02c6\u03b3 or gets reset to d\u00b5 w.p. 1 \u2212 \u02c6\u03b3. Gelada and Bellemare (2019) show the chain under P\u02c6\u03b3 is ergodic.\nWith d\u02c6\u03b3 denoting its stationary distribution, they prove\n\n= \u02c6\u03b3P\u03c0 + (1\u2212 \u02c6\u03b3)1dT\n.\n\n.\n= d\u03c0(s)\n\nd\u02c6\u03b3 = (1 \u2212 \u02c6\u03b3)(I \u2212 \u02c6\u03b3PT\n\n\u03c0)\u22121d\u00b5\n\n(\u02c6\u03b3 < 1)\n\nand d\u02c6\u03b3 = d\u03c0 (\u02c6\u03b3 = 1). With c(s)\n\n.\n= d\u02c6\u03b3 (s)\nd\u00b5(s), Gelada and Bellemare (2019) prove that\nc = \u02c6\u03b3D\u22121\n\n\u03c0D\u00b5c + (1 \u2212 \u02c6\u03b3)1,\n\n\u00b5 PT\n\nyielding the following learning rule for estimating c in the tabular setting:\n\n(1)\n\n(2)\n\nC(St+1) \u2190 C(St+1) + \u03b1[\u02c6\u03b3\u03c1tC(St) + (1 \u2212 \u02c6\u03b3) \u2212 C(St+1)],\n\n(3)\nwhere C is an estimate of c and \u03b1 is a step size. A semi-gradient is used when C is a parameterized\nfunction (Gelada and Bellemare, 2019). For a small \u02c6\u03b3 (depending on the difference between \u03c0 and\n\u00b5), Gelada and Bellemare (2019) prove a multi-step contraction under linear function approximation.\nFor a large \u02c6\u03b3 or nonlinear function approximation, they provide an extra normalization loss for the\n\u00b5c = 1Td\u02c6\u03b3 = 1. Gelada and Bellemare (2019) use \u03c1tc(St) to weight the\nsake of the constraint dT\nupdate \u2206t in Discounted COP-TD. They demonstrate empirical success in Atari games (Bellemare\net al., 2013) with pixel inputs.\nControl: In this paper, we focus on policy-based control. In the on-policy continuing setting, we\nseek to optimize the average value objective (Silver, 2015)\n\nJ\u03c0\n\n=(cid:80)\n\ns d\u03c0(s)i(s)v\u03c0(s).\n\n(4)\nOptimizing the average value objective is equivalent to optimizing the average reward objective\n(Puterman, 2014) if both \u03b3 and i are constant (see White 2017). In general, the average value\nobjective can be interpreted as a generalization of the average reward objective to adopt transition-\nbased discounting and nonconstant interest function.\nIn the off-policy continuing setting, Imani et al. (2018) propose to optimize the excursion objective\n(5)\ninstead of the alternative life objective J\u03c0. The key difference between J\u03c0 and J\u00b5 is how we trade\noff different states. With function approximation, it is usually not possible to maximize v\u03c0(s) for all\nstates, which is the \ufb01rst trade-off we need to make. Moreover, visiting one state more implies visiting\nanother state less, which is the second trade-off we need to make. J\u00b5 and J\u03c0 achieve both kinds\nof trade-off according to d\u00b5 and d\u03c0 respectively. However, it is J\u03c0, not J\u00b5, that correctly re\ufb02ects\nthe deploy-time performance of \u03c0, as the behavior policy will no longer matter when we deploy the\noff-policy learned \u03c0 in a continuing task.\nIn both objectives, i(s) is usually set to 1. We assume \u03c0 is parameterized by \u03b8. In the rest of this\npaper, all gradients are taken w.r.t. \u03b8 unless otherwise speci\ufb01ed, and we consider the gradient for only\n\ns d\u00b5(s)i(s)v\u03c0(s)\n\n=(cid:80)\n\n.\n\n.\n\nJ\u00b5\n\n3\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 1: (a) The two-circle MDP. Rewards are 0 unless speci\ufb01ed on the edge (b) The probability of\ntransitioning to B from A under target policy \u03c0 during training (c) The in\ufb02uence of \u02c6\u03b3 and \u03bb2 on the\n\ufb01nal solution found by Geoff-PAC.\n\none component of \u03b8 for the sake of clarity. It is not clear how to compute the policy gradient of J\u03c0 in\nthe off-policy continuing setting directly. For J\u00b5, we can compute the policy gradient as\n\n\u2207J\u00b5 =(cid:80)\n\ns d\u00b5(s)i(s)(cid:80)\n\n(cid:0)q\u03c0(s, a)\u2207\u03c0(a|s) + \u03c0(a|s)\u2207q\u03c0(s, a)(cid:1).\n\n(6)\nDegris et al. (2012) prove in the Off-Policy Policy Gradient (OPPG) theorem that we can ignore the\nterm \u03c0(s, a)\u2207q\u03c0(s, a) without introducing bias for a tabular policy1 when i(s) \u2261 1, yielding the\nOff-Policy Actor Critic (Off-PAC), which updates \u03b8 as\n\na\n\n(7)\nwhere \u03b1 is a step size, St is sampled from d\u00b5, and At is sampled from \u00b5(\u00b7|St). For a policy using a\ngeneral function approximator, Imani et al. (2018) propose a new OPPG theorem. They de\ufb01ne\n\n\u03b8t+1 = \u03b8t + \u03b1\u03c1tq\u03c0(St, At)\u2207 log \u03c0(At|St),\n\nF (1)\nt\nZ (1)\n\nt\n\n.\n= i(St) + \u03b3t\u03c1t\u22121F (1)\n.\n= \u03c1tM (1)\n\nt q\u03c0(St, At)\u2207 log \u03c0(At|St),\n\nt\u22121, M (1)\n\nt\n\n= (1 \u2212 \u03bb1)i(St) + \u03bb1F (1)\n.\n\n,\n\nt\n\nwhere \u03bb1 \u2208 [0, 1] is a constant used to optimize the bias-variance trade-off and F (1)\u22121\net al. (2018) prove that Z (1)\n\ufb01xed, i.e., limt\u2192\u221e E\u00b5[Z (1)\nEmphatic weightings (ACE) algorithm, which updates \u03b8 as \u03b8t+1 = \u03b8t + \u03b1Z (1)\napproach where M (1)\n\n.\n= 0. Imani\nis an unbiased sample of \u2207J\u00b5 in the limiting sense if \u03bb1 = 1 and \u03c0 is\n] = \u2207J\u00b5. Based on this, Imani et al. (2018) propose the Actor-Critic with\n. ACE is an emphatic\n\nis the emphasis to reweigh the update.\n\nt\n\nt\n\nt\n\nt\n\n3 The Counterfactual Objective\n\nWe now introduce the counterfactual objective\n\n=(cid:80)\n\n.\n\nJ\u02c6\u03b3\n\ns d\u02c6\u03b3(s)\u02c6i(s)v\u03c0(s),\n\n(8)\n\nwhere \u02c6i is a user-de\ufb01ned interest function. Similarly, we can set \u02c6i(s) to 1 for the continuing setting\nbut we proceed with a general \u02c6i. When \u02c6\u03b3 = 1, J\u02c6\u03b3 recovers the alternative life objective J\u03c0. When\n\u02c6\u03b3 = 0, J\u02c6\u03b3 recovers the excursion objective J\u00b5. To motivate the counterfactual objective J\u02c6\u03b3, we \ufb01rst\npresent the two-circle MDP (Figure 1a) to highlight the difference between J\u03c0 and J\u00b5.\nIn the two-circle MDP, an agent needs to make a decision only in state A. The behavior policy \u00b5\nproceeds to B or C randomly with equal probability. For this continuing task, the discount factor \u03b3\nis always 0.6 and the interest is always 1. Under this task speci\ufb01cation (White, 2017), the optimal\npolicy under the alternative life objective J\u03c0, which is equivalent to the average reward objective as \u03b3\nand i are constant, is to stay in the outer circle. However, to maximize J\u00b5, the agent prefers the inner\ncircle. To see this, \ufb01rst note v\u03c0(B) and v\u03c0(C) hardly change w.r.t. \u03c0, and we have v\u03c0(B) \u2248 3.6 and\nv\u03c0(C) \u2248 5. To maximize J\u00b5, the target policy \u03c0 would prefer transitioning to state C to maximize\nv\u03c0(A). The agent, therefore, remains in the inner circle. The two-circle MDP is tabular, so the policy\n\n1See Errata in Degris et al. (2012), also in Imani et al. (2018); Maei (2018).\n\n4\n\n\fmaximizing v\u03c0(s) for all s can be represented accurately. If we consider an episodic task, e.g., we\naim to maximize only v\u03c0(A) and set the discount of the transition back to A to 0, such policy will be\noptimal under the episodic return criterion. However, when we consider a continuing task and aim\nto optimize J\u03c0, we have to consider state visitation. The excursion objective J\u00b5 maximizes v\u03c0(A)\nregardless of the state visitation under \u03c0, yielding a policy \u03c0 that never visits the state D, a state of\nthe highest value. Such policy is sub-optimal in this continuing task. To maximize J\u03c0, the agent has\nto sacri\ufb01ce v\u03c0(A) and visits state D more. This two-circle MDP is not an artifact due to the small \u03b3.\nThe same effect can also occur with a larger \u03b3 if the path is longer. With function approximation, the\ndiscrepancy between J\u00b5 and J\u03c0 can be magni\ufb01ed as we need to make trade-off in both maximizing\nv\u03c0(s) and state visitation.\nOne solution to this problem is to set the interest function i in J\u00b5 in a clever way. However, it is not\nclear how to achieve this without domain knowledge. Imani et al. (2018) simply set i to 1. Another\nsolution might be to optimize J\u03c0 directly in off-policy learning, if one could use importance sampling\nratios to fully correct d\u00b5 to d\u03c0 as Precup et al. (2001) propose for value-based methods in the episodic\nsetting. However, this solution is infeasible for the continuing setting (Sutton et al., 2016). One may\nalso use differential value function (Sutton and Barto, 2018) to replace the discounted value function\nin J\u00b5. Off-policy policy gradient with differential value function, however, is still an open problem\nand we are not aware of existing literature on this.\nIn this paper, we propose to optimize J\u02c6\u03b3 instead. It is a well-known fact that the policy gradient of the\nstationary distribution exists under mild conditions (e.g., see Yu (2005)).2 It follows immediately from\nthe proof of this existence that lim\u02c6\u03b3\u21921 d\u02c6\u03b3 = d\u03c0. Moreover, it is trivial to see that lim\u02c6\u03b3\u21920 d\u02c6\u03b3 = d\u00b5,\nindicating the counterfactual objective can recover both the excursion objective and the alternative\nlife objective smoothly. Furthermore, we show empirically that a small \u02c6\u03b3 (e.g., 0.6 in the two-circle\nMDP and 0.2 in Mujoco tasks) is enough to generate a different solution from maximizing J\u00b5.\n\n4 Generalized Off-Policy Policy Gradient\nIn this section, we derive an estimator for \u2207J\u02c6\u03b3 and show in Proposition 1 that it is unbiased in the\nlimiting sense. Our (standard) assumptions are given in supplementary materials. The OPPG theorem\n(Imani et al., 2018) leaves us the freedom to choose the interest function i in J\u00b5. In this paper,\n.\n= \u02c6i(s)c(s), which, to our best knowledge, is the \ufb01rst time that a non-trivial interest is\nwe set i(s)\nd d\u00b5(s)i(s)\u2207v\u03c0(s).\n\nused. Hence, i depends on \u03c0 and we cannot invoke OPPG directly as \u2207J\u00b5 (cid:54)=(cid:80)\n\nHowever, we can still invoke the remaining parts of OPPG:\n\n(cid:88)\n\ns\n\nd\u00b5(s)i(s)\u2207v\u03c0(s) =\n\ns\n\na\n\n(9)\n\nm(s)\n\n(cid:88)\n\nq\u03c0(s, a)\u2207\u03c0(a|s),\n\n(cid:88)\n=(cid:80)\na \u03c0(a|s)p(s(cid:48)|s, a)\u03b3(s, a, s(cid:48)). We now compute the\n(cid:88)\n(cid:124)\n\n(cid:0)\u02c6\u03b3 < 1(cid:1)\n\nd\u00b5(s)\u02c6i(s)v\u03c0(s)g(s)\n\n(cid:123)(cid:122)\n\n(cid:125)\n\n(cid:125)\n\n+\n\ns\n\n2\n\n= iTD\u00b5(I \u2212 P\u03c0,\u03b3)\u22121, P\u03c0,\u03b3[s, s(cid:48)]\n\nwhere mT .\ngradient \u2207J\u02c6\u03b3.\nTheorem 1 (Generalized Off-Policy Policy Gradient Theorem)\n\n.\n\n\u2207J\u02c6\u03b3 =\n\nq\u03c0(s, a)\u2207\u03c0(a|s)\n\n(cid:88)\n\na\n\n(cid:123)(cid:122)\n\n1\n\ns\n\nm(s)\n\n(cid:88)\n(cid:124)\n\u00b5 (I \u2212 \u02c6\u03b3PT\n(cid:88)\n(cid:88)\n(cid:124)\n\n\u2207J\u02c6\u03b3 =\n\n=\n\ns\n\ns\n\nwhere g .\n\n= \u02c6\u03b3D\u22121\n\n\u03c0)\u22121b, b .\n\n= \u2207PT\n\n\u03c0D\u00b5c\n\nProof. We \ufb01rst use the product rule of calculus and plug in d\u02c6\u03b3(s) = d\u00b5(s)c(s):\n\n(cid:88)\n(cid:125)\n\n+\n\ns\n\n(cid:88)\n(cid:124)\n\ns\n\nd\u02c6\u03b3(s)\u02c6i(s)\u2207v\u03c0(s) +\nd\u00b5(s)c(s)\u02c6i(s)\u2207v\u03c0(s)\n\n(cid:123)(cid:122)\n\n3\n\n\u2207d\u02c6\u03b3(s)\u02c6i(s)v\u03c0(s)\n\nd\u00b5(s)\u2207c(s)\u02c6i(s)v\u03c0(s).\n\n(cid:125)\n\n(cid:123)(cid:122)\n\n4\n\n2 For completeness, we include that proof in supplementary materials.\n\n5\n\n\f\u00b5 PT\n\nt\n\nt\n\nt\n\n(cid:16)\n\n\u03c0)D\u00b5\n\nt\n\nZ (2)\n\nt\n\n(cid:3)\n\n.\n\nt\n\n.\n=\n\nt\n\nt\n\nt\n\nt\n\n\u03c0D\u00b5c\n\n(cid:16)\n\n=\n\n, Zt\n\nt\n\n.\n= Z (1)\n\nt\n\n\u02c6\u03b3D\u22121\n\n\u03c0D\u00b5c =\n\n\u00b5 \u2207PT\n\n(cid:17)\u22121\n\n\u03c0D\u00b5c = g.\n\nt\u22121, M (2)\n\n\u00b5 (I \u2212 \u02c6\u03b3PT\nD\u22121\n\n\u2207c = (I \u2212 \u02c6\u03b3D\u22121\n\n.\n= It + \u02c6\u03b3\u03c1t\u22121F (2)\n\n\u03c0D\u00b5\u2207c + \u02c6\u03b3D\u22121\n\n\u00b5 PT\n\u00b5 (I \u2212 \u02c6\u03b3PT\nD\u22121\n\n= c(St\u22121)\u03c1t\u22121\u2207 log \u03c0(At\u22121|St\u22121), F (2)\n.\n\n\u00b5 \u2207PT\n(cid:17)\n\u03c0D\u00b5)\u22121\u02c6\u03b3D\u22121\n\u03c0)\u22121D\u00b5\n\n1 = 3 follows directly from (9). To show 2 = 4 , we take gradients on both sides of (2). We\nhave \u2207c = \u02c6\u03b3D\u22121\n\n\u03c0D\u00b5c. Solving this linear system of \u2207c leads to\n\u00b5 \u2207PT\n\u00b5 \u2207PT\n\u02c6\u03b3D\u22121\nWith \u2207c(s) = g(s), 2 = 4 follows easily.\nNow we use an emphatic approach to provide an unbiased sample of \u2207J\u02c6\u03b3. We de\ufb01ne\n= (1 \u2212 \u03bb2)It + \u03bb2F (2)\n.\nIt\nHere It functions as an intrinsic interest (in contrast to the user-de\ufb01ned extrinsic interest \u02c6i) and is\na sample for b. F (2)\naccumulates previous interests and translates b into g. \u03bb2 is for bias-variance\ntrade-off similar to Sutton et al. (2016); Imani et al. (2018). We now de\ufb01ne\nt + Z (2)\nand proceed to show that Zt is an unbiased sample of \u2207J\u02c6\u03b3 when t \u2192 \u221e.\nLemma 1 Assuming the chain induced by \u00b5 is ergodic and \u03c0 is \ufb01xed,\nd\u00b5(s) limt\u2192\u221e E\u00b5[F (2)\n|St = s] exists.\nProof. Previous works (Sutton et al., 2016; Imani et al., 2018) assume limt\u2192\u221e E\u00b5[F (1)\n|St = s], inspired by the process of computing the\nHere we prove the existence of limt\u2192\u221e E\u00b5[F (2)\n|St = s] (assuming its existence) in Sutton et al. (2016). The existence of\nvalue of limt\u2192\u221e E\u00b5[F (1)\n|St = s] with transition-dependent \u03b3 can also be established with the same routine.3\nlimt\u2192\u221e E\u00b5[F (1)\nThe proof also involves similar techniques as Hallak and Mannor (2017). Details in supplementary\n(cid:3)\nmaterials.\n.\n=\n\n|St = s] exists, and f = (I \u2212 \u02c6\u03b3PT\n\n.\n= \u02c6\u03b3\u02c6i(St)v\u03c0(St)M (2)\n\n\u03c0)\u22121b for \u02c6\u03b3 < 1.\n\nthe limit f (s)\n\nProposition 1 Assuming the chain induced by \u00b5 is ergodic, \u03c0 is \ufb01xed, \u03bb1 = \u03bb2 = 1, \u02c6\u03b3 < 1, i(s)\n\u02c6i(s)c(s), then limt\u2192\u221e E\u00b5[Zt] = \u2207J\u02c6\u03b3\nProof. The proof involves Proposition 1 in Imani et al. (2018) and Lemma 1. Details are provided in\n(cid:3)\nsupplementary materials.\nWhen \u02c6\u03b3 = 0, the Generalized Off-Policy Policy Gradient (GOPPG) Theorem recovers the OPPG\ntheorem in Imani et al. (2018). The main contribution of GOPPG lies in the computation of \u2207c, i.e.,\nthe policy gradient of a distribution, which has not been done by previous policy gradient methods.\nThe main contribution of Proposition 1 is the trace F (2)\n, which is an effective way to approximate\n\u2207c. Inspired by Propostion 1, we propose to update \u03b8 as \u03b8t+1 = \u03b8t + \u03b1Zt, where \u03b1 is a step size.\nSo far, we discussed the policy gradient for a single dimension of the policy parameter \u03b8, so\nF (1)\nare all scalars. When we compute policy gradients for the whole \u03b8 in parallel,\nF (1)\nbecome vectors of the same size as \u03b8. This is because\nour intrinsic interest \u201cfunction\u201d It is a multi-dimensional random variable, instead of a deterministic\nscalar function like \u02c6i. We, therefore, generalize the concept of interest.\nSo far, we also assumed access to the true density ratio c and the true value function v\u03c0. We can plug\nin their estimates C and V , yielding the Generalized Off-Policy Actor-Critic (Geoff-PAC) algorithm.4\nThe density ratio estimate C can be learned via the learning rule in (3). The value estimate V can\nbe learned by any off-policy prediction algorithm, e.g., one-step off-policy TD (Sutton and Barto,\n2018), Gradient TD methods, (Discounted) COP-TD or V-trace (Espeholt et al., 2018). Pseudocode\nof Geoff-PAC is provided in supplementary materials.\nWe now discuss two potential practical issues with Geoff-PAC. First, Proposition 1 requires t \u2192 \u221e.\nIn practice, this means \u00b5 has been executed for a long time and can be satis\ufb01ed by a warm-up before\n\n, F (2)\nremain scalars while F (2)\n\n, M (1)\n, M (1)\n\n, M (2)\n\n, M (2)\n\nt\n\nt\n\nt\n\nt\n\nt\n\nt\n\nt\n\nt\n\nt\n\n3This existence does not follow directly from the convergence analysis of ETD in Yu (2015).\n4At this moment, a convergence analysis of Geoff-PAC is an open problem.\n\n6\n\n\ft\n\n, F (2)\n\ntraining. Second, Proposition 1 provides an unbiased sample for a \ufb01xed policy \u03c0. Once \u03c0 is updated,\nF (1)\nt will be invalidated as well as C, V . As their update rule does not have a learning rate,\nwe cannot simply use a larger learning rate for F (1)\nas we would do for C, V . This issue\nalso appeared in Imani et al. (2018). In principle, we could store previous transitions in a replay\nbuffer (Lin, 1992) and replay them for a certain number of steps after \u03c0 is updated. In this way, we\ncan satisfy the requirement t \u2192 \u221e and get the up-to-date F (1)\n. In practice, we found this\nunnecessary. When we use a small learning rate for \u03c0, we assume \u03c0 changes slowly and ignore this\ninvalidation effect.\n\n, F (2)\n\n, F (2)\n\nt\n\nt\n\nt\n\nt\n\n5 Experimental Results\n\nOur experiments aim to answer the following questions. 1) Can Geoff-PAC \ufb01nd the same solution as\non-policy policy gradient algorithms in the two-circle MDP as promised? 2) How does the degree of\ncounterfactualness (\u02c6\u03b3) in\ufb02uence the solution? 3) Can Geoff-PAC scale up to challenging tasks like\nrobot simulation in Mujoco with neural network function approximators? 4) Can the counterfactual\nobjective in Geoff-PAC translate into performance improvement over Off-PAC and ACE? 5) How\ndoes Geoff-PAC compare with other downstream applications of OPPG, e.g., DDPG (Lillicrap et al.,\n2015) and TD3 (Fujimoto et al., 2018)?\n\n5.1 Two-circle MDP\n\nWe implemented a tabular version of ACE and Geoff-PAC for the two-circle MDP. The behavior\npolicy \u00b5 was random, and we monitored the probability from A to B under the target policy \u03c0. In\nFigure 1b, we plot \u03c0(A \u2192 B) during training. The curves are averaged over 10 runs and the shaded\nregions indicate standard errors. We set \u03bb1 = \u03bb2 = 1 so that both ACE and Geoff-PAC are unbiased.\nFor Geoff-PAC, \u02c6\u03b3 was set to 0.9. ACE converges to the correct policy that maximizes J\u00b5 as expected,\nwhile Geoff-PAC converges to the policy that maximizes J\u03c0, the policy we want in on-policy training.\nFigure 1c shows how manipulating \u02c6\u03b3 and \u03bb2 in\ufb02uences the \ufb01nal solution. In this two-circle MDP, \u03bb2\nhas little in\ufb02uence on the \ufb01nal solution, while manipulating \u02c6\u03b3 signi\ufb01cantly changes the \ufb01nal solution.\n\n5.2 Robot Simulation\n\nFigure 2: Comparison among Off-PAC, ACE, and Geoff-PAC. Black dash lines are random agents.\n\nFigure 3: Comparison among DDPG, TD3, and Geoff-PAC\n\nEvaluation: We benchmarked Off-PAC, ACE, DDPG, TD3, and Geoff-PAC on \ufb01ve Mujoco robot\nsimulation tasks from OpenAI gym (Brockman et al., 2016). As all the original tasks are episodic,\nwe adopted similar techniques as White (2017) to compose continuing tasks. We set the discount\nfunction \u03b3 to 0.99 for all non-termination transitions and to 0 for all termination transitions. The\n\n7\n\n\fagent was teleported back to the initial states upon termination. The interest function was always\n1. This setting complies with the common training scheme for Mujoco tasks (Lillicrap et al., 2015;\nAsadi and Williams, 2016). However, we interpret the tasks as continuing tasks. As a consequence,\nJ\u03c0, instead of episodic return, is the proper metric to measure the performance of a policy \u03c0. The\nbehavior policy \u00b5 is a \ufb01xed uniformly random policy, same as Gelada and Bellemare (2019). The\ndata generated by \u00b5 is signi\ufb01cantly different from any meaningful policy in those tasks. Thus, this\nsetting exhibits a high degree of off-policyness. We monitored J\u03c0 periodically during training. To\nevaluate J\u03c0, states were sampled according to \u03c0, and v\u03c0 was approximated via Monte Carlo return.\nEvaluation based on the commonly used total undiscounted episodic return criterion is provided in\nsupplementary materials. The relative performance under the two criterion is almost identical.\nImplementation: Although emphatic algorithms have enjoyed great theoretical success (Yu, 2015;\nHallak et al., 2016; Sutton et al., 2016; Imani et al., 2018), their empirical success is still limited to\nsimple domains (e.g., simple hand-crafted Markov chains, cart-pole balancing) with linear function\napproximation. To our best knowledge, this is the \ufb01rst time that emphatic algorithms are evaluated\nin challenging robot simulation tasks with neural network function approximators. To stabilize\ntraining, we adopted the A2C (Clemente et al., 2017) paradigm with multiple workers and utilized\na target network (Mnih et al., 2015) and a replay buffer (Lin, 1992). All three algorithms share\nthe same architecture and the same parameterization. We \ufb01rst tuned hyperparameters for Off-PAC.\nACE and Geoff-PAC inherited common hyperparameters from Off-PAC. For DDPG and TD3, we\nused the same architecture and hyperparameters as Lillicrap et al. (2015) and Fujimoto et al. (2018)\nrespectively. More details are provided in supplementary materials and all the implementations are\npublicly available 5.\nResults: We \ufb01rst studied the in\ufb02uence of \u03bb1 on ACE and the in\ufb02uence of \u03bb1, \u03bb2, \u02c6\u03b3 on Geoff-PAC in\nHalfCheetah. The results are reported in supplementary materials. We found ACE was not sensitive\nto \u03bb1 and set \u03bb1 = 0 for all experiments. For Geoff-PAC, we found \u03bb1 = 0.7, \u03bb2 = 0.6, \u02c6\u03b3 = 0.2\nproduced good empirical results and used this combination for all remaining tasks. All curves are\naveraged over 10 independent runs and shaded regions indicate standard errors. Figure 2 compares\nGeoff-PAC, ACE, and Off-PAC. Geoff-PAC signi\ufb01cantly outperforms ACE and Off-PAC in 3 out of\n5 tasks. The performance on Walker and Reacher is similar. This performance improvement supports\nour claim that optimizing J\u02c6\u03b3 can better approximate J\u03c0 than optimizing J\u00b5. We also report the\nperformance of a random agent for reference. Moreover, this is the \ufb01rst time that ACE is evaluated on\nsuch challenging domains instead of simple Markov chains. Figure 3 compares Geoff-PAC, DDPG,\nand TD3. Geoff-PAC outperforms DDPG in Hopper and Swimmer. DDPG with a uniformly random\npolicy exhibits high instability in HalfCheetah, Walker, and Hopper. This is expected because DDPG\nignores the discrepancy between d\u00b5 and d\u03c0. As training progresses, this discrepancy gets larger and\n\ufb01nally yields a performance drop. TD3 uses several techniques to stabilize DDPG, which translate\ninto the performance and stability improvement over DDPG in Figure 3. However, Geoff-PAC still\noutperforms TD3 in Hopper and Swimmer. This is not a fair comparison in that many design choices\nfor DDPG, TD3 and Geoff-PAC are different (e.g., one worker vs. multiple workers, deterministic\nvs. stochastic policy, network architectures), and we do not expect Geoff-PAC to outperform all\napplications of OPPG. However, this comparison does suggest GOPPG sheds light on how to improve\napplications of OPPG.\n\n6 Related Work\n\nThe density ratio c is a key component in Geoff-PAC, which is proposed by Gelada and Bellemare\n(2019). However, how we use this density ratio is different. Q-Learning (Watkins and Dayan, 1992;\nMnih et al., 2015) is a semi-gradient method. Gelada and Bellemare (2019) use the density ratio\nto reweigh the Q-Learning semi-gradient update directly. The resulting algorithm still belongs to\nsemi-gradient methods. If we would use the density ratio to reweigh the Off-PAC update (7) directly,\nit would just be an actor-critic analogue of the Q-Learning approach in Gelada and Bellemare (2019).\nThis reweighed Off-PAC, however, will no longer follow the policy gradient of the objective J\u00b5,\nyielding instead \u201cpolicy semi-gradient\u201d. In this paper, we use the density ratio to de\ufb01ne a new\nobjective, the counterfactual objective, and compute the policy gradient of this new objective directly\n(Theorem 1). The resulting algorithm, Geoff-PAC, still belongs to policy gradient methods (in the\n\n5https://github.com/ShangtongZhang/DeepRL\n\n8\n\n\flimiting sense). Computing the policy gradient of the counterfactual objective requires computing the\npolicy gradient of the density ratio, which has not been explored in Gelada and Bellemare (2019).\nThere have been many applications of OPPG, e.g., DPG (Silver et al., 2014), DDPG, ACER (Wang\net al., 2016), EPG (Ciosek and Whiteson, 2017), and IMPALA (Espeholt et al., 2018). Particularly,\nGu et al. (2017) propose IPG to unify on- and off-policy policy gradients. IPG is a mix of gradients\n(i.e., a mix of \u2207J\u00b5 and \u2207J\u03c0). To compute \u2207J\u03c0, IPG does need on-policy samples. In this paper, the\ncounterfactual objective is a mix of objectives, and we do not need on-policy samples to compute\nthe policy gradient of the counterfactual objective. Mixing \u2207J\u02c6\u03b3 and \u2207J\u03c0 directly in IPG-style is a\npossibility for future work.\nThere have been other policy-based off-policy algorithms. Maei (2018) provide an unbiased estimator\n(in the limiting sense) for \u2207J\u00b5, assuming the value function is linear. Theoretical results are provided\nwithout empirical study. Imani et al. (2018) eliminate this linear assumption and provide a thorough\nempirical study. We, therefore, conduct our comparison with Imani et al. (2018) instead of Maei\n(2018). In another line of work, the policy entropy is used for reward shaping. The target policy can\nthen be derived from the value function directly (O\u2019Donoghue et al., 2016; Nachum et al., 2017a;\nSchulman et al., 2017). This line of work includes the deep energy-based RL (Haarnoja et al., 2017,\n2018), where a value function is learned off-policy and the policy is derived from the value function\ndirectly, and path consistency learning (Nachum et al., 2017a,b), where gradients are computed to\nsatisfy certain path consistencies. This line of work is orthogonal to this paper, where we compute the\npolicy gradients of the counterfactual objective directly in an off-policy manner and do not involve\nreward shaping.\nLiu et al. (2018) prove that \u00afc is the unique solution for a minimax problem, which involves max-\nimization over a function set F. They show that theoretically F should be suf\ufb01ciently rich (e.g.,\nneural networks). To make it tractable, they restrict F to a ball of a reproducing kernel Hilbert space,\nyielding a closed form solution for the maximization step. SGD is then used to learn an estimate\nfor \u00afc in the minimization step, which is then used for policy evaluation. In a concurrent work (Liu\net al., 2019), this approximate for \u00afc is used in off-policy policy gradient for J\u03c0, and empirical success\nis observed in simple domains. By contrast, our J\u02c6\u03b3 uni\ufb01es J\u03c0 and J\u00b5, where \u02c6\u03b3 naturally allows\nbias-variance trade-off, yielding an empirical success in challenging robot simulation tasks.\n\n7 Conclusions\n\nIn this paper, we introduced the counterfactual objective unifying the excursion objective and the\nalternative life objective in the continuing RL setting. We further provided the Generalized Off-Policy\nPolicy Gradient Theorem and corresponding Geoff-PAC algorithm. GOPPG is the \ufb01rst example\nthat a non-trivial interest function is used, and Geoff-PAC is the \ufb01rst empirical success of emphatic\nalgorithms in prevailing deep RL benchmarks. There have been numerous applications of OPPG\nincluding DDPG, ACER, IPG, EPG and IMPALA. We expect GOPPG to shed light on improving\nthose applications. Theoretically, a convergent analysis of Geoff-PAC involving compatible function\nassumption (Sutton et al., 2000) or multi-timescale stochastic approximation (Borkar, 2009) is also\nworth further investigation.\n\nAcknowledgments\n\nSZ is generously funded by the Engineering and Physical Sciences Research Council (EPSRC). This\nproject has received funding from the European Research Council under the European Union\u2019s Hori-\nzon 2020 research and innovation programme (grant agreement number 637713). The experiments\nwere made possible by a generous equipment grant from NVIDIA. The authors thank Richard S.\nSutton, Matthew Fellows, Huizhen Yu for the valuable discussion.\n\nReferences\nAsadi, K. and Williams, J. D. (2016). Sample-ef\ufb01cient deep reinforcement learning for dialog control.\n\narXiv preprint arXiv:1612.06000.\n\nBellemare, M. G., Naddaf, Y., Veness, J., and Bowling, M. (2013). The arcade learning environment:\n\nAn evaluation platform for general agents. Journal of Arti\ufb01cial Intelligence Research.\n\n9\n\n\fBorkar, V. S. (2009). Stochastic approximation: a dynamical systems viewpoint. Springer.\n\nBrockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W.\n\n(2016). Openai gym. arXiv preprint arXiv:1606.01540.\n\nCiosek, K. and Whiteson, S. (2017). Expected policy gradients. arXiv preprint arXiv:1706.05374.\n\nClemente, A. V., Castej\u00f3n, H. N., and Chandra, A. (2017). Ef\ufb01cient parallel methods for deep\n\nreinforcement learning. arXiv preprint arXiv:1705.04862.\n\nDegris, T., White, M., and Sutton, R. S. (2012). Off-policy actor-critic.\n\narXiv:1205.4839.\n\narXiv preprint\n\nEspeholt, L., Soyer, H., Munos, R., Simonyan, K., Mnih, V., Ward, T., Doron, Y., Firoiu, V., Harley,\nT., Dunning, I., et al. (2018). Impala: Scalable distributed deep-rl with importance weighted\nactor-learner architectures. arXiv preprint arXiv:1802.01561.\n\nFujimoto, S., van Hoof, H., and Meger, D. (2018). Addressing function approximation error in\n\nactor-critic methods. arXiv preprint arXiv:1802.09477.\n\nGelada, C. and Bellemare, M. G. (2019). Off-policy deep reinforcement learning by bootstrapping\n\nthe covariate shift. In Proceedings of the 33rd AAAI Conference on Arti\ufb01cial Intelligence.\n\nGhiassian, S., Patterson, A., White, M., Sutton, R. S., and White, A. (2018). Online off-policy\n\nprediction.\n\nGu, S. S., Lillicrap, T., Turner, R. E., Ghahramani, Z., Sch\u00f6lkopf, B., and Levine, S. (2017).\nInterpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep\nreinforcement learning. In Advances in Neural Information Processing Systems.\n\nHaarnoja, T., Tang, H., Abbeel, P., and Levine, S. (2017). Reinforcement learning with deep energy-\nbased policies. In Proceedings of the 34th International Conference on Machine Learning-Volume\n70, pages 1352\u20131361. JMLR. org.\n\nHaarnoja, T., Zhou, A., Abbeel, P., and Levine, S. (2018). Soft actor-critic: Off-policy maximum\n\nentropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290.\n\nHallak, A. and Mannor, S. (2017). Consistent on-line off-policy evaluation. In Proceedings of the\n\n34th International Conference on Machine Learning.\n\nHallak, A., Tamar, A., Munos, R., and Mannor, S. (2016). Generalized emphatic temporal difference\nlearning: Bias-variance analysis. In Proceedins of 30th AAAI Conference on Arti\ufb01cial Intelligence.\n\nImani, E., Graves, E., and White, M. (2018). An off-policy policy gradient theorem using emphatic\n\nweightings. In Advances in Neural Information Processing Systems.\n\nLillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D.\n(2015). Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971.\n\nLin, L.-J. (1992). Self-improving reactive agents based on reinforcement learning, planning and\n\nteaching. Machine Learning.\n\nLiu, Q., Li, L., Tang, Z., and Zhou, D. (2018). Breaking the curse of horizon: In\ufb01nite-horizon\n\noff-policy estimation. In Advances in Neural Information Processing Systems.\n\nLiu, Y., Swaminathan, A., Agarwal, A., and Brunskill, E. (2019). Off-policy policy gradient with\n\nstate distribution correction. arXiv preprint arXiv:1904.08473.\n\nMaei, H. R. (2018). Convergent actor-critic algorithms under off-policy training and function\n\napproximation. arXiv preprint arXiv:1802.07842.\n\nMarbach, P. and Tsitsiklis, J. N. (2001). Simulation-based optimization of markov reward processes.\n\nIEEE Transactions on Automatic Control.\n\n10\n\n\fMnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., and Kavukcuoglu,\nK. (2016). Asynchronous methods for deep reinforcement learning. In Proceedings of the 33rd\nInternational Conference on Machine Learning.\n\nMnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A.,\nRiedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. (2015). Human-level control through deep\nreinforcement learning. Nature.\n\nNachum, O., Norouzi, M., Xu, K., and Schuurmans, D. (2017a). Bridging the gap between value and\n\npolicy based reinforcement learning. In Advances in Neural Information Processing Systems.\n\nNachum, O., Norouzi, M., Xu, K., and Schuurmans, D. (2017b). Trust-pcl: An off-policy trust region\n\nmethod for continuous control. arXiv preprint arXiv:1707.01891.\n\nO\u2019Donoghue, B., Munos, R., Kavukcuoglu, K., and Mnih, V. (2016). Combining policy gradient and\n\nq-learning. arXiv preprint arXiv:1611.01626.\n\nOsband, I., Aslanides, J., and Cassirer, A. (2018). Randomized prior functions for deep reinforcement\n\nlearning. In Advances in Neural Information Processing Systems.\n\nPrecup, D., Sutton, R. S., and Dasgupta, S. (2001). Off-policy temporal-difference learning with\nfunction approximation. In Proceedings of the 18th International Conference on Machine Learning.\nPuterman, M. L. (2014). Markov decision processes: discrete stochastic dynamic programming. John\n\nWiley & Sons.\n\nSchulman, J., Chen, X., and Abbeel, P. (2017). Equivalence between policy gradients and soft\n\nq-learning. arXiv preprint arXiv:1704.06440.\n\nSilver, D. (2015). Policy gradient methods. URL: http://www0.cs.ucl.ac.uk/staff/d.\n\nsilver/web/Teaching_files/pg.pdf.\n\nSilver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., and Riedmiller, M. (2014). Deterministic\npolicy gradient algorithms. In Proceedings of the 31st International Conference on Machine\nLearning.\n\nSutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning.\nSutton, R. S. and Barto, A. G. (2018). Reinforcement learning: An introduction (2nd Edition). MIT\n\npress.\n\nSutton, R. S., Maei, H. R., and Szepesv\u00e1ri, C. (2009). A convergent o(n) temporal-difference\nalgorithm for off-policy learning with linear function approximation. In Advances in Neural\nInformation Processing Systems.\n\nSutton, R. S., Mahmood, A. R., and White, M. (2016). An emphatic approach to the problem of\n\noff-policy temporal-difference learning. The Journal of Machine Learning Research.\n\nSutton, R. S., McAllester, D. A., Singh, S. P., and Mansour, Y. (2000). Policy gradient methods\nIn Advances in Neural Information\n\nfor reinforcement learning with function approximation.\nProcessing Systems.\n\nTsitsiklis, J. N. and Van Roy, B. (1997). Analysis of temporal-diffference learning with function\n\napproximation. In Advances in neural information processing systems.\n\nWang, Z., Bapst, V., Heess, N., Mnih, V., Munos, R., Kavukcuoglu, K., and de Freitas, N. (2016).\n\nSample ef\ufb01cient actor-critic with experience replay. arXiv preprint arXiv:1611.01224.\n\nWatkins, C. J. and Dayan, P. (1992). Q-learning. Machine Learning.\nWhite, M. (2017). Unifying task speci\ufb01cation in reinforcement learning. In Proceedings of the 34th\n\nInternational Conference on Machine Learning.\n\nYu, H. (2005). A function approximation approach to estimation of policy gradient for pomdp with\n\nstructured policies. The 21st Conference on Uncertainty in Arti\ufb01cial Intelligence.\n\nYu, H. (2015). On convergence of emphatic temporal-difference learning. In Conference on Learning\n\nTheory.\n\n11\n\n\f", "award": [], "sourceid": 1183, "authors": [{"given_name": "Shangtong", "family_name": "Zhang", "institution": "University of Oxford"}, {"given_name": "Wendelin", "family_name": "Boehmer", "institution": "University of Oxford"}, {"given_name": "Shimon", "family_name": "Whiteson", "institution": "University of Oxford"}]}