{"title": "An Off-policy Policy Gradient Theorem Using Emphatic Weightings", "book": "Advances in Neural Information Processing Systems", "page_first": 96, "page_last": 106, "abstract": "Policy gradient methods are widely used for control in reinforcement learning, particularly for the continuous action setting. There have been a host of theoretically sound algorithms proposed for the on-policy setting, due to the existence of the policy gradient theorem which provides a simplified form for the gradient. In off-policy learning, however, where the behaviour policy is not necessarily attempting to learn and follow the optimal policy for the given task, the existence of such a theorem has been elusive. In this work, we solve this open problem by providing the first off-policy policy gradient theorem. The key to the derivation is the use of emphatic weightings. We develop a new actor-critic algorithm\u2014called Actor Critic with Emphatic weightings (ACE)\u2014that approximates the simplified gradients provided by the theorem. We demonstrate in a simple counterexample that previous off-policy policy gradient methods\u2014particularly OffPAC and DPG\u2014converge to the wrong solution whereas ACE finds the optimal solution.", "full_text": "An Off-policy Policy Gradient Theorem Using\n\nEmphatic Weightings\n\nEhsan Imani\u21e4, Eric Graves\u21e4, Martha White\n\nReinforcement Learning and Arti\ufb01cial Intelligence Laboratory\n\n{imani,graves,whitem}@ualberta.ca\n\nDepartment of Computing Science\n\nUniversity of Alberta\n\nAbstract\n\nPolicy gradient methods are widely used for control in reinforcement learning,\nparticularly for the continuous action setting. There have been a host of theoret-\nically sound algorithms proposed for the on-policy setting, due to the existence\nof the policy gradient theorem which provides a simpli\ufb01ed form for the gradient.\nIn off-policy learning, however, where the behaviour policy is not necessarily at-\ntempting to learn and follow the optimal policy for the given task, the existence\nof such a theorem has been elusive. In this work, we solve this open problem by\nproviding the \ufb01rst off-policy policy gradient theorem. The key to the derivation is\nthe use of emphatic weightings. We develop a new actor-critic algorithm\u2014called\nActor Critic with Emphatic weightings (ACE)\u2014that approximates the simpli\ufb01ed\ngradients provided by the theorem. We demonstrate in a simple counterexam-\nple that previous off-policy policy gradient methods\u2014particularly OffPAC and\nDPG\u2014converge to the wrong solution whereas ACE \ufb01nds the optimal solution.\n\n1\n\nIntroduction\n\nOff-policy learning holds great promise for learning in an online setting, where an agent generates\na single stream of interaction with its environment. On-policy methods are limited to learning about\nthe agent\u2019s current policy. Conversely, in off-policy learning, the agent can learn about many poli-\ncies that are different from the policy being executed. Methods capable of off-policy learning have\nseveral important advantages over on-policy methods. Most importantly, off-policy methods allow\nan agent to learn about many different policies at once, forming the basis for a predictive under-\nstanding of an agent\u2019s environment [Sutton et al., 2011, White, 2015] and enabling the learning of\noptions [Sutton et al., 1999, Precup, 2000]. With options, an agent can determine optimal (short)\nbehaviours, starting from its current state. Off-policy methods can also learn from data generated\nby older versions of a policy, known as experience replay, a critical factor in the recent success of\ndeep reinforcement learning [Lin, 1992, Mnih et al., 2015, Schaul et al., 2015]. They also enable\nlearning from other forms of suboptimal data, including data generated by human demonstration,\nnon-learning controllers, and even random behaviour. Off-policy methods also enable learning about\nthe optimal policy while executing an exploratory policy [Watkins and Dayan, 1992], thereby ad-\ndressing the exploration-exploitation tradeoff.\nPolicy gradient methods are a general class of algorithms for learning optimal policies, for both the\non and off-policy setting. In policy gradient methods, a parameterized policy is improved using\ngradient ascent [Williams, 1992], with seminal work in actor-critic algorithms [Witten, 1977, Barto\net al., 1983] and many techniques since proposed to reduce variance of the estimates of this gradient\n[Konda and Tsitsiklis, 2000, Weaver and Tao, 2001, Greensmith et al., 2004, Peters et al., 2005,\n\n\u21e4These authors contributed equally.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fBhatnagar et al., 2008, 2009, Grondman et al., 2012, Gu et al., 2016]. These algorithms rely on\na fundamental theoretical result: the policy gradient theorem. This theorem [Sutton et al., 2000,\nMarbach and Tsitsiklis, 2001] simpli\ufb01es estimation of the gradient, which would otherwise require\ndif\ufb01cult-to-estimate gradients with respect to the stationary distribution of the policy and potentially\nof the action-values if they are used.\nOff-policy policy gradient methods have also been developed, particularly in recent years where\nthe need for data ef\ufb01ciency and decorrelated samples in deep reinforcement learning require the\nuse of experience replay and so off-policy learning. This work began with OffPAC [Degris et al.,\n2012a], where an off-policy policy gradient theorem was provided that parallels the on-policy policy\ngradient theorem, but only for tabular policy representations.2 This motivated further development,\nincluding a recent actor-critic algorithm proven to converge when the critic uses linear function\napproximation [Maei, 2018], as well as several methods using the approximate off-policy gradient\nsuch as Deterministic Policy Gradient (DPG) [Silver et al., 2014, Lillicrap et al., 2015], ACER\n[Wang et al., 2016], and Interpolated Policy Gradient (IPG) [Gu et al., 2017]. However, it remains an\nopen question whether the foundational theorem that underlies these algorithms can be generalized\nbeyond tabular representations.\nIn this work, we provide an off-policy policy gradient theorem, for general policy parametrization.\nThe key insight is that the gradient can be simpli\ufb01ed if the gradient in each state is weighted with an\nemphatic weighting. We use previous methods for incrementally estimating these emphatic weight-\nings [Yu, 2015, Sutton et al., 2016] to design a new off-policy actor-critic algorithm, called Actor-\nCritic with Emphatic weightings (ACE). We show in a simple three state counterexample, with two\nstates aliased, that solutions are suboptimal with the (semi)-gradients used in previous off-policy\nalgorithms\u2014such as OffPAC and DPG. We demonstrate both the theorem and the counterexample\nunder stochastic and deterministic policies, and that ACE converges to the optimal solution.\n\n2 Problem Formulation\n\nWe consider a Markov decision process (S, A, P, r), where S denotes the \ufb01nite set of states, A\ndenotes the \ufb01nite set of actions, P : S \u21e5 A \u21e5 S ! [0,1) denotes the one-step state transition\ndynamics, and r : S \u21e5 A \u21e5 S ! R denotes the transition-based reward function. At each timestep\nt = 1, 2, . . ., the agent selects an action At according to its behaviour policy \u00b5, where \u00b5 : S \u21e5 A !\n[0, 1]. The environment responds by transitioning into a new state St+1 according to P, and emits a\nscalar reward Rt+1 such that E [Rt+1|St, At, St+1] = r(St, At, St+1).\nThe discounted sum of future rewards given actions are selected according to some target policy \u21e1\nis called the return, and de\ufb01ned as:\n\nGt\n\ndef= Rt+1 + t+1Rt+2 + t+1t+2Rt+3 + . . .\n= Rt+1 + t+1Gt+1\n\n(1)\n\nWe use transition-based discounting  : S \u21e5 A \u21e5 S ! [0, 1], as it uni\ufb01es continuing and episodic\ntasks [White, 2017]. Then the state value function for policy \u21e1 and  is de\ufb01ned as:\n\nP(s, a, s0)[r(s, a, s0) + (s, a, s0)v\u21e1(s0)] 8a 2 A,8s 2 S\n\n(2)\n\nv\u21e1(s) def= E\u21e1[Gt|St = s] 8s 2 S\n\n=Xa2A\n\n\u21e1(s, a)Xs02S\n\nIn off-policy control, the agent\u2019s goal is to learn a target policy \u21e1 while following the behaviour\npolicy \u00b5. The target policy \u21e1\u2713 is a differentiable function of a weight vector \u2713 2 Rd. The goal is to\nlearn \u2713 to maximize the following objective function:\n\nJ\u00b5(\u2713) def=Xs2S\n\nd\u00b5(s)i(s)v\u21e1\u2713 (s)\n\n(3)\n\nwhere i : S ! [0,1) is an interest function, d\u00b5(s) def= limt!1 P(St = s|s0, \u00b5) is the limiting\ndistribution of states under \u00b5 (which we assume exists), and P(St = s|s0, \u00b5) is the probability that\n2See B. Errata in Degris et al. [2012b] for the clari\ufb01cation that the theorem only applies to tabular policy\n\nrepresentations.\n\n2\n\n\fSt = s when starting in state s0 and executing \u00b5. The interest function\u2014introduced by Sutton et al.\n[2016]\u2014provides more \ufb02exibility in weighting states in the objective. If i(s) = 1 for all states, the\nobjective reduces to the standard off-policy objective. Otherwise, it naturally encompasses other\nsettings, such as the start state formulation by setting i(s) = 0 for all states but the start state(s).\nBecause it adds no complications to the derivations, we opt for this more generalized objective.\n\n3 Off-Policy Policy Gradient Theorem using Emphatic Weightings\n\nThe policy gradient theorem with function approximation has only been derived for the on-policy\nsetting thus far, for stochastic policies [Sutton et al., 2000, Theorem 1] and deterministic policies\n[Silver et al., 2014]. The policy gradient theorem for the off-policy case has only been established\nfor the setting where the policy is tabular [Degris et al., 2012b, Theorem 2].3 In this section, we\nshow that the policy gradient theorem does hold in the off-policy setting, when using function ap-\nproximation for the policy, as long as we use emphatic weightings. These results parallel those in\noff-policy policy evaluation, for learning the value function, where issues of convergence for tem-\nporal difference methods are ameliorated by using an emphatic weighting [Sutton et al., 2016].\nTheorem 1 (Off-policy Policy Gradient Theorem).\n\nq\u21e1(s, a)\n\n(4)\n\n@J\u00b5(\u2713)\n\n@\u2713\n\n@\u21e1(s, a; \u2713)\n\n@\u2713\n\nm(s)Xa\n\n=Xs\nm> def= i>(I  P\u21e1, )1\n\nwhere m : S ! [0,1) is the emphatic weighting, in vector form de\ufb01ned as\n\n(5)\nwhere the vector i 2 R|S| has entries i(s) def= d\u00b5(s)i(s) and P\u21e1, 2 R|S|\u21e5|S| is the matrix with\nentries P\u21e1, (s, s0) def=Pa \u21e1(s, a; \u2713)P(s, a, s0)(s, a, s0)\n@Ps i(s)v\u21e1(s)\n\nProof. First notice that\n\n@J\u00b5(\u2713)\n\n@v\u21e1(s)\n\ni(s)\n\n@\u2713\n\n@\u2713\n\n@\u2713\n\n=\n\nTherefore, to compute the gradient of J\u00b5, we need to compute the gradient of the value function\nwith respect to the policy parameters. A recursive form of the gradient of the value function can be\nderived, as we show below. Before starting, for simplicity of notation we will use\n\n=Xs\n\ng(s) =Xa\n\n@\u21e1(s, a; \u2713)\n\n@\u2713\n\nq\u21e1(s, a)\n\nwhere g : S ! Rd. Now let us compute the gradient of the value function.\n\n@v\u21e1(s)\n\n@\u2713\n\n\u21e1(s, a; \u2713)q\u21e1(s, a)\n\n@\u21e1(s, a; \u2713)\n\n@\n\n=\n\n@\u2713\n\n@\u2713Xa\n=Xa\n= g(s) +Xa\n= g(s) +Xa\n\n\u21e1(s, a; \u2713)\n\n\u21e1(s, a; \u2713)Xs0\n\nq\u21e1(s, a) +Xa\n\n\u21e1(s, a; \u2713)\n\n@q\u21e1(s, a)\n\n@\u2713\n\n@Ps0 P(s, a, s0)(r(s, a, s0) + (s, a, s0)v\u21e1(s0))\n\n@\u2713\n\n(6)\n\nP(s, a, s0)(s, a, s0)\n\n@v\u21e1(s0)\n\n@\u2713\n\nWe can simplify this more easily using vector form. Let \u02d9v\u21e1 2 R|S|\u21e5d be the matrix of gradients\n(with respect to the policy parameters \u2713) of v\u21e1 for each state s, and G 2 R|S|\u21e5d the matrix where\neach row corresponding to state s is the vector g(s). Then\n(7)\n\n\u02d9v\u21e1 = G + P\u21e1, \u02d9v\u21e1\n\n=) \u02d9v\u21e1 = (I  P\u21e1, )1G\n\n3Note that the statement in the paper is stronger, but in an errata published by the authors, they highlight an\n\nerror in the proof. Consequently, the result is only correct for the tabular setting.\n\n3\n\n\fTherefore, we obtain\n\ni(s)\n\n@v\u21e1(s)\n\n@\u2713\n\nXs\n\n= i> \u02d9v\u21e1 = i>(I  P\u21e1, )1G\n= m>G\n\n=Xs\n\nm(s)Xa\n\n@\u21e1(s, a; \u2713)\n\n@\u2713\n\nq\u21e1(s, a)\n\nWe can prove a similar result for the deterministic policy gradient objective, for a deterministic\npolicy, \u21e1 : S ! A. The objective remains the same, but the space of possible policies is constrained,\nresulting in a slightly different gradient.\nTheorem 2 (Deterministic Off-policy Policy Gradient Theorem).\n\n@\u21e1(s; \u2713)\n\n@q\u21e1(s, a)\n\n@\u2713\n\n@a\n\nds\n\n(8)\n\nwhere m : S ! [0,1) is the emphatic weighting for a deterministic policy, which is the solution to\nthe recursive equation\n\nP(s, \u21e1(s; \u2713), s0)(s, \u21e1(s; \u2713), s0)m(s) ds\n\n(9)\n\nm(s)\n\n@\u2713\n\n@J\u00b5(\u2713)\n\n=ZS\nm(s0) def= d\u00b5(s0)i(s0) +ZS\n\na=\u21e1(s;\u2713)\n\nThe proof is presented in Appendix A.\n\n4 Actor-Critic with Emphatic Weightings\n\nIn this section, we develop an incremental actor-critic algorithm with emphatic weightings, that uses\nthe above off-policy policy gradient theorem. To perform a gradient ascent update on the policy\nparameters, the goal is to obtain a sample of the gradient\n@\u21e1(s, a; \u2713)\n\n(10)\n\nXs\n\nm(s)Xa\n\nq\u21e1(s, a).\n\n@\u2713\n\nComparing this expression with the approximate gradient used by OffPAC and subsequent methods\n(which we refer to as semi-gradient methods) reveals that the only difference is in the weighting of\nstates:\n\nXs\n\nd\u00b5(s)Xa\n\n@\u21e1(s, a; \u2713)\n\n@\u2713\n\nq\u21e1(s, a).\n\n(11)\n\n@\u2713\n\n@\u21e1(s,a;\u2713)\n\nsample ofPa\n\nTherefore, we can use standard solutions developed for other actor-critic algorithms to obtain a\nq\u21e1(s, a). Explicit details for our off-policy setting are given in Appendix B.\n\nThe key dif\ufb01culty is then in estimating m(s) to reweight this gradient, which we address below.\nThe policy gradient theorem assumes access to the true value function, and provides a Bellman\nequation that de\ufb01nes the optimal \ufb01xed point. However, approximation errors can occur in practice,\nboth in estimating the value function (the critic) and the emphatic weighting. For the critic, we\ncan take advantage of numerous algorithms that improve estimation of value functions, including\nthrough the use of -returns to mitigate bias, with  = 1 corresponding to using unbiased samples\nof returns [Sutton, 1988]. For the emphatic weighting, we introduce a similar parameter a 2 [0, 1],\nthat introduces bias but could help reduce variability in the weightings\nm>a = i>(I  P\u21e1, )1(I  (1  a)P\u21e1, ).\n(12)\nFor a = 1, we get ma = m and so get an unbiased emphatic weighting.4 For a = 0, the\nemphatic weighting is simply i, and the gradient with this weighting reduces to the regular off-\npolicy actor critic update [Degris et al., 2012b]. For a = 0, therefore, we obtain a biased gradient\n4Note that the original emphatic weightings [Sutton et al., 2016] use  = 1  a. This is because their\nemphatic weightings are designed to balance bias introduced from using  for estimating value functions: larger\n means the emphatic weighting plays less of a role. For this setting, we want larger a to correspond to the\nfull emphatic weighting (the unbiased emphatic weighting), and smaller a to correspond to a more biased\nestimate, to better match the typical meaning of such trace parameters.\n\n4\n\n\festimate, but the emphatic weightings themselves are easy to estimate\u2014they are myopic estimates\nof interest\u2014which could signi\ufb01cantly reduce variance when estimating the gradient. Selecting a\nbetween 0 and 1 could provide a reasonable balance, obtaining a nearly unbiased gradient to enable\nconvergence to a valid stationary point but potentially reducing some variability when estimating\nthe emphatic weighting.\nNow we can draw on previous work estimating emphatic weightings incrementally to obtain an em-\nphatically weighted policy gradient. Assume access to an estimate of the gradient @\u21e1(s,a;\u2713)\nq\u21e1(s, a),\nsuch as the commonly-used estimate: \u21e2ttr\u2713 ln \u21e1(s, a; \u2713), where \u21e2t is the importance sampling ra-\ntio (described further in Appendix B), and t is the temporal difference error, which\u2014as an estimate\nof the advantage function a\u21e1(s, a) = q\u21e1(s, a)  v\u21e1(s)\u2014implicitly includes a state value baseline.\nBecause this is an off-policy setting, the states s from which we would sample this gradient are\nweighted according to d\u00b5. We need to adjust this weighting from d\u00b5(s) to m(s). We can do so by\nusing an online algorithm previously derived to obtain a sample Mt of the emphatic weighting\n\n@\u2713\n\nMt\n\n.F t\n\ndef= t\u21e2t1Ft1 + i(St)\n\ndef= (1  a)i(St) + aFt\n\n(13)\nfor F0 = 0. The actor update is then multiplied by Mt to give the emphatically-weighted actor up-\ndate: \u21e2tMttr\u2713 ln \u21e1(s, a; \u2713). Previous work by Thomas [2014] to remove bias in natural actor-critic\nalgorithms is of interest here, as it suggests weighting actor updates by an accumulating product of\ndiscount factors, which can be thought of as an on-policy precursor to emphatic weightings. We\nprove that our update is an unbiased estimate of the gradient for a \ufb01xed policy in Proposition 1. We\nprovide the complete Actor-Critic with Emphatic weightings (ACE) algorithm, with pseudo-code\nand additional algorithm details, in Appendix B.\nProposition 1. For a \ufb01xed policy \u21e1, and with the conditions on the MDP from [Yu, 2015],\n\nE\u00b5[\u21e2tMttr\u2713 ln \u21e1(St, At; \u2713)] =Xs\n\nm(s)Xa\n\n@\u21e1(s, a; \u2713)\n\n@\u2713\n\nq\u21e1(s, a)\n\nProof. Emphatic weightings were previously shown to provide an unbiased estimate for value func-\ntions with Emphatic TD. We use the emphatic weighting differently, but can rely on the proof\nfrom [Sutton et al., 2016] to ensure that (a) d\u00b5(s)E\u00b5[Mt|St = s] = m(s) and the fact that (b)\nE\u00b5[Mt|St = s] = E\u00b5[Mt|St = s, At, St+1]. Using these equalities, we obtain\nE\u00b5[\u21e2tMttr\u2713 ln \u21e1(s, a; \u2713)]\n=Xs\n=Xs\n=Xs\n=Xs\n=Xs\n\nd\u00b5(s)E\u00b5[\u21e2tMttr\u2713 ln \u21e1(St, At; \u2713)|St = s]\nd\u00b5(s)E\u00b5hE\u00b5[\u21e2tMttr\u2713 ln \u21e1(St, At; \u2713)|St = s, At, St+1]i\nd\u00b5(s)E\u00b5hE\u00b5[Mt|St = s, At, St+1] E\u00b5[\u21e2ttr\u2713 ln \u21e1(St, At; \u2713)|St = s, At, St+1]i\nd\u00b5(s)E\u00b5[Mt|St = s]E\u00b5hE\u00b5[\u21e2ttr\u2713 ln \u21e1(St, At; \u2713)|St = s, At, St+1]i\nm(s)Xa\n\n. law of total expectation\n\n. using (a).\n\n@\u21e1(s, a; \u2713)\n\n. using (b)\n\nq\u21e1(s, a)\n\n@\u2713\n\n5 Experiments\n\nWe empirically investigate the utility of using the true off-policy gradient, as opposed to the previous\napproximation used by OffPAC; the impact of the choice of a; and the ef\ufb01cacy of estimating\nemphatic weightings in ACE. We present a toy problem to highlight the fact that OffPAC\u2014which\nuses an approximate semi-gradient\u2014can converge to suboptimal solutions, even in ideal conditions,\nwhereas ACE\u2014with the true gradient\u2014converges to the optimal solution. We conduct several other\nexperiments on the same toy problem, to elucidate properties of ACE.\n\n5\n\n\f5.1 The Drawback of Semi-Gradient Updates\n\nWe design a world with aliased states to highlight the problem with semi-gradient updates. The toy\nproblem, depicted in Figure 1a, has three states, where S0 is a start state with feature vector [1, 0],\nand S1 and S2 are aliased, both with feature vector [0, 1]. This aliased representation forces the\nactor to take a similar action in S1 and S2. The behaviour policy takes actions A0 and A1 with\nprobabilities 0.25 and 0.75 in all non-terminal states, so that S0, S1, and S2 will have probabilities\n0.5, 0.125, and 0.375 under d\u00b5. Under this aliasing, the optimal action in S1 and S2 is A0, for the\noff-policy objective J\u00b5. The target policy is initialized to take A0 and A1 with probabilities 0.9 and\n0.1 in all states, which is near optimal.\nWe \ufb01rst compared an idealized semi-gradient actor (a = 0) and gradient actor (a = 1), with\nexact value function (critic) estimates. Figures 1b and 1c clearly indicate that the semi-gradient\nupdate\u2014which corresponds to an idealized version of the OffPAC update\u2014converges to a subopti-\nmal solution. This occurs even if it is initialized close to the optimal solution, which highlights that\nthe true solution is not even a stationary point for the semi-gradient objective. On the other hand,\nthe gradient solution\u2014corresponding to ACE\u2014increases the objective until converging to optimal.\nWe show below, in Section 5.3 and Figure 5, that this is similarly a problem in the continuous action\nsetting with DPG.\nThe problem with the semi-gradient updates is made clear from the fact that it corresponds to the\na = 0 solution, which uses the weighting d\u00b5 instead of m. In an expected semi-gradient update,\neach state tries to increase the probability of the action with the highest action-value. There will\nbe a con\ufb02ict between the aliased states S1 and S2, since their highest-valued actions differ. If the\nstates are weighted by d\u00b5 in the expected update, S1 will appear insigni\ufb01cant to the actor, and the\nupdate will increase the probability of A1 in the aliased states. (The ratio between q\u21e1(S1, A0) and\nq\u21e1(S2, A1) is not enough to counterbalance this weighting.) However, S1 has an importance that a\nsemi-gradient update overlooks. Taking a suboptimal action at S1 will also reduce q(S0, A0) and,\nafter multiple updates, the actor gradually prefers to take A1 in S0. Eventually, the target policy\nwill be to take A1 at all states, which has a lower objective function than the initial target policy.\nThis experiment suggests why the weight of a state should depend not only on its own share of d\u00b5,\nbut also on its predecessors, and the behaviour policy\u2019s state distribution is not the proper deciding\nfactor in the competition between S1 and S2.\n\n(a) Counterexample\n\n(b) Learning curves\n\n(c) Action probability\n\nFigure 1: (a) A counterexample that identi\ufb01es suboptimal behaviour when using semi-gradient up-\ndates. The semi-gradients converge for the tabular setting [Degris et al., 2012b], but not necessarily\nunder function approximation\u2014such as with the state aliasing in this MDP. S0 is the start state and\nthe terminal state is denoted by T3. S1 and S2 are aliased to the actor. The interest i(s) is set to one\nfor all states. (b) Learning curves comparing semi-gradient updates and gradient updates, averaged\nover 30 runs with negligible standard error bars. The actor has a softmax output on a linear trans-\nformation of features and is trained with a step-size of 0.1 (though results were similar across all the\nstepsizes tested). The dashed line shows the highest attainable objective function under the aliased\nrepresentation. (c) The probability of taking A0 at the aliased states, where taking A0 is optimal\nunder the aliased representation.\n\n6\n\n\f(a) Stepsize sensitivity\n\n(b) Learning curves\n\n(c) Action probability\n\nFigure 2: Performance with different values of a in the 3-state MDP, averaged over 30 runs. (a)\nACE performs well, for a range of stepsizes and even a that gets quite small. (b) For a = 0, which\ncorresponds to OffPAC, the algorithm decreases performance, to get to the suboptimal \ufb01xed point.\nEven with as low a value as a = 0.25, ACE improves the value from the starting point, but does\nconverge to a worse solution than a  0.5. The learning curves correspond to each algorithm\u2019s\nbest step-size. (c) The optimal behaviour is to take A0 with probability 1, in the aliased states.\nACE(0)\u2014corresponding to OffPAC\u2014quickly converges to the suboptimal solution of preferring the\noptimal action for S2 instead of S1. Even with a just a bit higher than 0, convergence is to a more\nreasonable solution, preferring the optimal action the majority of the time.\n\n5.2 The Impact of the Trade-Off Parameter\nThe parameter a in (12) has the potential to trade off bias and variance. For a = 0, the bias can\nbe signi\ufb01cant, as shown in the previous section. A natural question is how the bias changes as a\ndecreases from 1 to 0. There is unlikely to be signi\ufb01cant variance reduction\u2014it is a low variance\ndomain\u2014but we can nonetheless gain some insight into bias.\nWe repeated the experiment in 5.1, with a chosen from {0, 0.25, 0.5, 0.75, 1} and the step-size\nchosen from {0.01, 0.02, 0.05, 0.1, 0.2, 0.5, 1}. To highlight the rate of learning, the actor parame-\nters are now initialized to zero. Figure 2 summarizes the results. As shown in Figure 2a, with a\nclose to one, a small and carefully-tuned step-size is needed to make the most of the method. As\na decreases, the actor is able to learn with higher step-sizes and increases the objective function\nfaster. The quality of the \ufb01nal solution, however, deteriorates with small values of a since the actor\nfollows biased gradients. Even for surprisingly small a = 0.5 the actor converged to optimal, and\neven a = 0.25 produced a much more reasonable solution than a = 0.\nWe ran a similar experiment, this time using value estimates from a critic trained with Gradient\nTD, called GTD() [Maei, 2011] to examine whether the impact of a values with the actual (non-\nidealized) ACE algorithm persists in an actor-critic architecture. The step-size \u21b5v was chosen from\n{105, 104, 103, 102, 101, 100}, \u21b5w was chosen from {1010, 108, 106, 104, 102}, and\n{0, 0.5, 1.0} was the set of candidate values of  for the critic. The results in Figure 3 show that,\nas before, even relatively low a values can still get close to the optimal solution. However, semi-\ngradient updates, corresponding to a = 0, still \ufb01nd a suboptimal policy.\n\n(a) Stepsize sensitivity\n\n(b) Learning curves\n\n(c) Action probability\n\nFigure 3: Performance of ACE with a GTD() critic and different values of a in the 3-state MDP.\nThe results are averaged over 10 runs. The outcomes are similar to Figure 2, though noisier due to\nlearning the critic rather than using the true critic.\n\n7\n\n\f5.3 Challenges in Estimating the Emphatic Weightings\nWe have been using an online algorithm to estimate the emphatic weightings. There can be different\nsources of inaccuracy in these approximations. First, the estimate depends on importance sampling\nratios of previous actions in the trajectory. This can result in high variance if the behaviour policy\nand the target policy are not close. Secondly, derivation of the online algorithm assumes a \ufb01xed\ntarget policy [Sutton et al., 2016], while the actor is updated at every time step. Therefore, the\napproximation is less reliable in long trajectories, as it is partly decided by older target policies in\nthe beginning of the trajectory. We designed experiments to study the ef\ufb01cacy of these estimates in\nan aliased task with more states.\nThe \ufb01rst environment, shown in Figure 6 in the appendix, is an extended version of the previous\nMDP with two long chains before the aliased states. The addition of new states makes the trajecto-\nries considerably longer, allowing errors to build up in emphatic weighting estimates. The actor is\ninitialized with zero weights and uses true state values in its updates. The behaviour policy takes A0\nwith probability 0.25 and A1 with probability 0.75 in all non-terminal states. The actor\u2019s step-size\nis picked from {5 \u00b7 105, 104, 2 \u00b7 104, 5 \u00b7 104, 103, 2 \u00b7 103, 5 \u00b7 103, 102}. We also trained\nan actor, called True-ACE, that uses true emphatic weightings for the current target policy and be-\nhaviour policy, computed at each timestep. The performance of True-ACE is included here for the\nsake of comparison, and computing the exact emphatic weightings is not generally possible in an\nunknown environment. The results in Figure 4 show that, even though performance improves as a\nis increased, there is a gap between ACE with a = 1 and True-ACE. This shows the inaccuracies\npointed out above indeed disturb the updates in long trajectories.\n\n(a) Stepsize sensitivity\n\n(b) Learning curves\n\n(c) Action probability\n\nFigure 4: Performance of ACE with different values of a and True-ACE on the 11-state MDP. The\nresults are averaged over 10 runs. Unlike Figure 2, the methods now have more dif\ufb01culty getting\nnear the optimal solution, though ACE with larger a does still clearly get a signi\ufb01cantly better\nsolution that a = 0.\n\nThe second environment is similar to Figure 1a, but with one continuous unbounded action. Taking\naction with value a at S0 will result in a transition to S1 with probability 1  (a) and a transition\nto S2 with probability (a), where  denotes the logistic sigmoid function. For all actions from\nS0, the reward is zero. From S1 and S2, the agent can only transition to the terminal state, with\nreward 2(a) and (a) respectively. The behaviour policy takes actions drawn from a Gaussian\ndistribution with mean 1.0 and variance 1.0.\nBecause the environment has continuous actions, we can include both stochastic and deterministic\npolicies, and so can include DPG in the comparison. DPG is built on the semi-gradient, like OffPAC.\nWe compare to DPG with Emphatic weightings (DPGE), with the true emphatic weightings rather\nthan estimated ones. We compare to True-DPGE to avoiding confounding factors of estimating\nthe emphatic weighting, and focus the investigation on if DPG converges to a suboptimal solution.\nEstimation of the emphatic weightings for a deterministic target policy is left for future work. The\nstochastic actor in ACE has a linear output unit and a softplus output unit to represent the mean and\nthe standard deviation of a Gaussian distribution. All actors are initialized with zero weights.\nFigure 5 summarizes the results. The \ufb01rst observation is that DPG demonstrates suboptimal be-\nhaviour similar to OffPAC. As training goes on, DPG prefers to take positive actions in all states,\nbecause S2 is updated more often. This problem goes away in True-DPGE. The emphatic weightings\nemphasize updates in S1 and, thus, the actor gradually prefers negative actions and surpasses DPG\nin performance. Similarly, True-ACE learns to take negative actions but, being a stochastic policy, it\n\n8\n\n\fcannot achieve True-DPGE\u2019s performance on this domain. ACE with different a values, however,\ncannot outperform DPG, and this result suggests that an alternative to importance sampling ratios is\nneeded to extend ACE to continuous actions.\n\n(a) Stepsize sensitivity\n\n(b) Learning curves\n\n(c) Mean action\n\nFigure 5: Performance of ACE with different values of a, True-ACE, DPG, and True-DPGE on the\ncontinuous action MDP. The results are averaged over 30 runs. For continuous actions, the methods\nhave even more dif\ufb01culty getting to the optimal solutions, given by True-DPGE and True-ACE,\nthough the action selection graphs suggest that ACE for higher a is staying nearer the optimal\naction selection than ACE(0) and DPG.\n\n6 Conclusions and Future Work\n\nIn this paper we proved the off-policy policy gradient theorem, using emphatic weightings. The\nresult is generally applicable to any differentiable policy parameterization. Using this theorem, we\nderived an off-policy actor-critic algorithm that follows the gradient of the objective function, as\nopposed to previous method like OffPAC and DPG that followed an approximate semi-gradient. We\ndesigned a simple MDP to highlight issues with existing methods\u2014namely OffPAC and DPG\u2014\nparticularly highlighting that the stationary points of these semi-gradient methods for this problem\ndo not include the optimal solution. Our algorithm, called Actor-Critic with Emphatic Weightings,\non the other hand, which follows the gradient, reaches the optimal solution, both for an idealized\nsetting given the true critic and when learning the critic. We conclude with a result suggesting that\nmore work needs to be done to effectively estimate emphatic weightings, and that important next\nsteps for developing Actor-Critic algorithm for the off-policy setting are to improve estimation of\nthese weightings.\n\n7 Acknowledgements\n\nThe authors would like to thank Alberta Innovates for funding the Alberta Machine Intelligence\nInstitute and by extension this research. We would also like to thank Hamid Maei, Susan Murphy,\nand Rich Sutton for their helpful discussions and insightful comments.\n\nReferences\nAndrew G Barto, Richard S Sutton, and Charles W Anderson. Neuronlike adaptive elements that can\nsolve dif\ufb01cult learning control problems. IEEE transactions on systems, man, and cybernetics,\n(5):834\u2013846, 1983.\n\nS Bhatnagar, R Sutton, and M Ghavamzadeh. Natural actor-critic algorithms. Automatica, 2009.\n\nShalabh Bhatnagar, Mohammad Ghavamzadeh, Mark Lee, and Richard S Sutton. Incremental natu-\nral actor-critic algorithms. In Advances in neural information processing systems, pages 105\u2013112,\n2008.\n\nThomas Degris, Patrick M Pilarski, and Richard S Sutton. Model-free reinforcement learning with\ncontinuous action in practice. In American Control Conference (ACC), 2012, pages 2177\u20132182.\nIEEE, 2012a.\n\n9\n\n\fThomas Degris, Martha White, and Richard S Sutton. Off-policy actor-critic.\n\nConference on Machine Learning, 2012b.\n\nIn International\n\nEvan Greensmith, Peter L Bartlett, and Jonathan Baxter. Variance reduction techniques for gradient\nestimates in reinforcement learning. Journal of Machine Learning Research, 5(Nov):1471\u20131530,\n2004.\n\nIvo Grondman, Lucian Busoniu, Gabriel AD Lopes, and Robert Babuska. A survey of actor-critic\nreinforcement learning: Standard and natural policy gradients. IEEE Transactions on Systems,\nMan, and Cybernetics, Part C (Applications and Reviews), 42(6):1291\u20131307, 2012.\n\nShixiang Gu, Timothy Lillicrap, Zoubin Ghahramani, Richard E Turner, and Sergey Levine. Q-prop:\nSample-ef\ufb01cient policy gradient with an off-policy critic. arXiv preprint arXiv:1611.02247, 2016.\n\nShixiang Gu, Tim Lillicrap, Richard E Turner, Zoubin Ghahramani, Bernhard Sch\u00f6lkopf, and Sergey\nLevine. Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for\ndeep reinforcement learning. In Advances in Neural Information Processing Systems, pages 3849\u2013\n3858, 2017.\n\nVijay R Konda and John N Tsitsiklis. Actor-critic algorithms. In Advances in Neural Information\n\nProcessing Systems, pages 1008\u20131014, 2000.\n\nTimothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa,\nDavid Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv\npreprint arXiv:1509.02971, 2015.\n\nLong-Ji Lin. Self-improving reactive agents based on reinforcement learning, planning and teaching.\n\nMachine learning, 8(3-4):293\u2013321, 1992.\n\nH Maei. Gradient Temporal-Difference Learning Algorithms. PhD thesis, University of Alberta,\n\n2011.\n\nHamid Reza Maei. Convergent actor-critic algorithms under off-policy training and function ap-\n\nproximation. arXiv preprint arXiv:1802.07842, 2018.\n\nPeter Marbach and John N Tsitsiklis. Simulation-based optimization of markov reward processes.\n\nIEEE Transactions on Automatic Control, 46(2):191\u2013209, 2001.\n\nVolodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Belle-\nmare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level\ncontrol through deep reinforcement learning. Nature, 518(7540):529, 2015.\n\nJan Peters, Sethu Vijayakumar, and Stefan Schaal. Natural actor-critic. In European Conference on\n\nMachine Learning, pages 280\u2013291. Springer, 2005.\n\nDoina Precup. Temporal abstraction in reinforcement learning. PhD thesis, PhD thesis, University\n\nof Massachusetts Amherst, 2000.\n\nTom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience replay. arXiv\n\npreprint arXiv:1511.05952, 2015.\n\nD Silver, G Lever, N Heess, T Degris, and D Wierstra. Deterministic policy gradient algorithms. In\n\nProceedings of the 31st . . . , 2014.\n\nRichard S Sutton, Doina Precup, and Satinder Singh. Between mdps and semi-mdps: A frame-\nwork for temporal abstraction in reinforcement learning. Arti\ufb01cial intelligence, 112(1-2):181\u2013\n211, 1999.\n\nRichard S Sutton, Joseph Modayil, Michael Delp, Thomas Degris, Patrick M Pilarski, Adam White,\nand Doina Precup. Horde: A scalable real-time architecture for learning knowledge from unsuper-\nvised sensorimotor interaction. In The 10th International Conference on Autonomous Agents and\nMultiagent Systems-Volume 2, pages 761\u2013768. International Foundation for Autonomous Agents\nand Multiagent Systems, 2011.\n\n10\n\n\fRichard S Sutton, A R Mahmood, and Martha White. An emphatic approach to the problem of\n\noff-policy temporal-difference learning. The Journal of Machine Learning Research, 2016.\n\nR.S. Sutton. Learning to predict by the methods of temporal differences. Machine Learning, 1988.\nRS Sutton, D McAllester, S Singh, and Y Mansour. Policy gradient methods for reinforcement\nlearning with function approximation. In Advances in Neural Information Processing Systems,\n2000.\n\nPhilip Thomas. Bias in natural actor-critic algorithms.\n\nLearning, pages 441\u2013448, 2014.\n\nIn International Conference on Machine\n\nZiyu Wang, Victor Bapst, Nicolas Heess, Volodymyr Mnih, R\u00e9mi Munos, Koray Kavukcuoglu, and\n\nNando de Freitas. Sample Ef\ufb01cient Actor-Critic with Experience Replay. arXiv.org, 2016.\n\nChristopher JCH Watkins and Peter Dayan. Q-learning. Machine learning, 8(3-4):279\u2013292, 1992.\nLex Weaver and Nigel Tao. The optimal reward baseline for gradient-based reinforcement learning.\nIn Proceedings of the Seventeenth conference on Uncertainty in arti\ufb01cial intelligence, pages 538\u2013\n545. Morgan Kaufmann Publishers Inc., 2001.\n\nAdam White. Developing a predictive approach to knowledge. PhD thesis, PhD thesis, University\n\nof Alberta, 2015.\n\nMartha White. Unifying task speci\ufb01cation in reinforcement learning. In International Conference\n\non Machine Learning, pages 3742\u20133750, 2017.\n\nR Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learn-\n\ning. Machine Learning, 1992.\n\nIan H Witten. An adaptive optimal controller for discrete-time markov environments. Information\n\nand control, 34(4):286\u2013295, 1977.\n\nHuizhen Yu. On convergence of emphatic temporal-difference learning. In Annual Conference on\n\nLearning Theory, 2015.\n\n11\n\n\f", "award": [], "sourceid": 85, "authors": [{"given_name": "Ehsan", "family_name": "Imani", "institution": "University of Alberta"}, {"given_name": "Eric", "family_name": "Graves", "institution": "University of Alberta"}, {"given_name": "Martha", "family_name": "White", "institution": "University of Alberta"}]}