{"title": "Bridging the Gap Between Value and Policy Based Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 2775, "page_last": 2785, "abstract": "We establish a new connection between value and policy based reinforcement learning (RL) based on a relationship between softmax temporal value consistency and policy optimality under entropy regularization. Specifically, we show that softmax consistent action values correspond to optimal entropy regularized policy probabilities along any action sequence, regardless of provenance. From this observation, we develop a new RL algorithm, Path Consistency Learning (PCL), that minimizes a notion of soft consistency error along multi-step action sequences extracted from both on- and off-policy traces. We examine the behavior of PCL in different scenarios and show that PCL can be interpreted as generalizing both actor-critic and Q-learning algorithms. We subsequently deepen the relationship by showing how a single model can be used to represent both a policy and the corresponding softmax state values, eliminating the need for a separate critic. The experimental evaluation demonstrates that PCL significantly outperforms strong actor-critic and Q-learning baselines across several benchmarks.", "full_text": "Bridging the Gap Between Value and Policy Based\n\nReinforcement Learning\n\nDale Schuurmans\nO\ufb01r Nachum1\n{ofirnachum,mnorouzi,kelvinxx}@google.com, daes@ualberta.ca\n\nMohammad Norouzi\n\nKelvin Xu1\n\nGoogle Brain\n\nAbstract\n\nWe establish a new connection between value and policy based reinforcement\nlearning (RL) based on a relationship between softmax temporal value consistency\nand policy optimality under entropy regularization. Speci\ufb01cally, we show that\nsoftmax consistent action values correspond to optimal entropy regularized policy\nprobabilities along any action sequence, regardless of provenance. From this\nobservation, we develop a new RL algorithm, Path Consistency Learning (PCL),\nthat minimizes a notion of soft consistency error along multi-step action sequences\nextracted from both on- and off-policy traces. We examine the behavior of PCL\nin different scenarios and show that PCL can be interpreted as generalizing both\nactor-critic and Q-learning algorithms. We subsequently deepen the relationship\nby showing how a single model can be used to represent both a policy and the\ncorresponding softmax state values, eliminating the need for a separate critic. The\nexperimental evaluation demonstrates that PCL signi\ufb01cantly outperforms strong\nactor-critic and Q-learning baselines across several benchmarks.2\n\n1\n\nIntroduction\n\nModel-free RL aims to acquire an effective behavior policy through trial and error interaction with a\nblack box environment. The goal is to optimize the quality of an agent\u2019s behavior policy in terms of\nthe total expected discounted reward. Model-free RL has a myriad of applications in games [22, 37],\nrobotics [16, 17], and marketing [18, 38], to name a few. Recently, the impact of model-free RL has\nbeen expanded through the use of deep neural networks, which promise to replace manual feature\nengineering with end-to-end learning of value and policy representations. Unfortunately, a key\nchallenge remains how best to combine the advantages of value and policy based RL approaches in\nthe presence of deep function approximators, while mitigating their shortcomings. Although recent\nprogress has been made in combining value and policy based methods, this issue is not yet settled,\nand the intricacies of each perspective are exacerbated by deep models.\nThe primary advantage of policy based approaches, such as REINFORCE [45], is that they directly\noptimize the quantity of interest while remaining stable under function approximation (given a\nsuf\ufb01ciently small learning rate). Their biggest drawback is sample inef\ufb01ciency: since policy gradients\nare estimated from rollouts the variance is often extreme. Although policy updates can be improved\nby the use of appropriate geometry [14, 27, 32], the need for variance reduction remains paramount.\nActor-critic methods have thus become popular [33, 34, 36], because they use value approximators\nto replace rollout estimates and reduce variance, at the cost of some bias. Nevertheless, on-policy\nlearning remains inherently sample inef\ufb01cient [10]; by estimating quantities de\ufb01ned by the current\npolicy, either on-policy data must be used, or updating must be suf\ufb01ciently slow to avoid signi\ufb01cant\nbias. Naive importance correction is hardly able to overcome these shortcomings in practice [28, 29].\n\n1Work done as a member of the Google Brain Residency program (g.co/brainresidency)\n2An implementation of PCL can be found at https://github.com/tensorflow/models/tree/\n\nmaster/research/pcl_rl\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fBy contrast, value based methods, such as Q-learning [44, 22, 30, 42, 21], can learn from any\ntrajectory sampled from the same environment. Such \u201coff-policy\u201d methods are able to exploit data\nfrom other sources, such as experts, making them inherently more sample ef\ufb01cient than on-policy\nmethods [10]. Their key drawback is that off-policy learning does not stably interact with function\napproximation [35, Chap.11]. The practical consequence is that extensive hyperparameter tuning can\nbe required to obtain stable behavior. Despite practical success [22], there is also little theoretical\nunderstanding of how deep Q-learning might obtain near-optimal objective values.\nIdeally, one would like to combine the unbiasedness and stability of on-policy training with the data\nef\ufb01ciency of off-policy approaches. This desire has motivated substantial recent work on off-policy\nactor-critic methods, where the data ef\ufb01ciency of policy gradient is improved by training an off-\npolicy critic [19, 21, 10]. Although such methods have demonstrated improvements over on-policy\nactor-critic approaches, they have not resolved the theoretical dif\ufb01culty associated with off-policy\nlearning under function approximation. Hence, current methods remain potentially unstable and\nrequire specialized algorithmic and theoretical development as well as delicate tuning to be effective\nin practice [10, 41, 8].\nIn this paper, we exploit a relationship between policy optimization under entropy regularization\nand softmax value consistency to obtain a new form of stable off-policy learning. Even though\nentropy regularized policy optimization is a well studied topic in RL [46, 39, 40, 47, 5, 4, 6, 7]\u2013in\nfact, one that has been attracting renewed interest from concurrent work [25, 11]\u2013we contribute new\nobservations to this study that are essential for the methods we propose: \ufb01rst, we identify a strong\nform of path consistency that relates optimal policy probabilities under entropy regularization to\nsoftmax consistent state values for any action sequence; second, we use this result to formulate a\nnovel optimization objective that allows for a stable form of off-policy actor-critic learning; \ufb01nally, we\nobserve that under this objective the actor and critic can be uni\ufb01ed in a single model that coherently\nful\ufb01lls both roles.\n\n2 Notation & Background\nWe model an agent\u2019s behavior by a parametric distribution \u03c0\u03b8(a | s) de\ufb01ned by a neural network over\na \ufb01nite set of actions. At iteration t, the agent encounters a state st and performs an action at sampled\nfrom \u03c0\u03b8(a | st). The environment then returns a scalar reward rt and transitions to the next state st+1.\nNote: Our main results identify speci\ufb01c properties that hold for arbitrary action sequences. To keep\nthe presentation clear and focus attention on the key properties, we provide a simpli\ufb01ed presentation\nin the main body of this paper by assuming deterministic state dynamics. This restriction is not\nnecessary, and in the Supplementary Material we provide a full treatment of the same concepts\ngeneralized to stochastic state dynamics. All of the desired properties continue to hold in the general\ncase and the algorithms proposed remain unaffected.\nFor simplicity, we assume the per-step reward rt and the next state st+1 are given by functions\nrt = r(st, at) and st+1 = f (st, at) speci\ufb01ed by the environment. We begin the formulation\nby reviewing the key elements of Q-learning [43, 44], which uses a notion of hard-max Bellman\nbackup to enable off-policy TD control. First, observe that the expected discounted reward objective,\nOER(s, \u03c0), can be recursively expressed as,\n\n(cid:88)\n\nOER(s, \u03c0) =\n\n\u03c0(a | s) [r(s, a) + \u03b3OER(s(cid:48), \u03c0)] ,\n\nwhere s(cid:48) = f (s, a) .\n\n(1)\n\na\n\nLet V \u25e6(s) denote the optimal state value at a state s given by the maximum value of OER(s, \u03c0) over\npolicies, i.e., V \u25e6(s) = max\u03c0OER(s, \u03c0). Accordingly, let \u03c0\u25e6 denote the optimal policy that results\nin V \u25e6(s) (for simplicity, assume there is one unique optimal policy), i.e., \u03c0\u25e6 = argmax\u03c0 OER(s, \u03c0).\nSuch an optimal policy is a one-hot distribution that assigns a probability of 1 to an action with\nmaximal return and 0 elsewhere. Thus we have\n\nV \u25e6(s) = OER(s, \u03c0\u25e6) = max\n\n(r(s, a) + \u03b3V \u25e6(s(cid:48))).\n\na\n\n(2)\n\nThis is the well-known hard-max Bellman temporal consistency. Instead of state values, one can\nequivalently (and more commonly) express this consistency in terms of optimal action values, Q\u25e6:\n(3)\n\nQ\u25e6(s, a) = r(s, a) + \u03b3 max\n\na(cid:48) Q\u25e6(s(cid:48), a(cid:48)) .\n\n2\n\n\fQ-learning relies on a value iteration algorithm based on (3), where Q(s, a) is bootstrapped based on\nsuccessor action values Q(s(cid:48), a(cid:48)).\n\n3 Softmax Temporal Consistency\n\nIn this paper, we study the optimal state and action values for a softmax form of temporal con-\nsistency [48, 47, 7], which arises by augmenting the standard expected reward objective with a\ndiscounted entropy regularizer. Entropy regularization [46] encourages exploration and helps prevent\nearly convergence to sub-optimal policies, as has been con\ufb01rmed in practice (e.g., [21, 24]). In this\ncase, one can express regularized expected reward as a sum of the expected reward and a discounted\nentropy term,\n(4)\nwhere \u03c4 \u2265 0 is a user-speci\ufb01ed temperature parameter that controls the degree of entropy regulariza-\ntion, and the discounted entropy H(s, \u03c0) is recursively de\ufb01ned as\n\nOENT(s, \u03c0) = OER(s, \u03c0) + \u03c4H(s, \u03c0) ,\n\nH(s, \u03c0) =\n\n\u03c0(a | s) [\u2212 log \u03c0(a | s) + \u03b3 H(s(cid:48), \u03c0)] .\n\n(cid:88)\n\na\n\n(cid:88)\n\n(5)\n\n(6)\n\nThe objective OENT(s, \u03c0) can then be re-expressed recursively as,\n\nOENT(s, \u03c0) =\n\n\u03c0(a | s) [r(s, a) \u2212 \u03c4 log \u03c0(a | s) + \u03b3OENT(s(cid:48), \u03c0)] .\n\na\n\nNote that when \u03b3 = 1 this is equivalent to the entropy regularized objective proposed in [46].\nLet V \u2217(s) = max\u03c0OENT(s, \u03c0) denote the soft optimal state value at a state s and let \u03c0\u2217(a | s) denote\nthe optimal policy at s that attains the maximum of OENT(s, \u03c0). When \u03c4 > 0, the optimal policy is no\nlonger a one-hot distribution, since the entropy term prefers the use of policies with more uncertainty.\nWe characterize the optimal policy \u03c0\u2217(a | s) in terms of the OENT-optimal state values of successor\nstates V \u2217(s(cid:48)) as a Boltzmann distribution of the form,\n\n\u03c0\u2217(a | s) \u221d exp{(r(s, a) + \u03b3V \u2217(s(cid:48)))/\u03c4} .\n\n(7)\nIt can be veri\ufb01ed that this is the solution by noting that the OENT(s, \u03c0) objective is simply a \u03c4-scaled\nconstant-shifted KL-divergence between \u03c0 and \u03c0\u2217, hence the optimum is achieved when \u03c0 = \u03c0\u2217.\nTo derive V \u2217(s) in terms of V \u2217(s(cid:48)), the policy \u03c0\u2217(a | s) can be substituted into (6), which after\nsome manipulation yields the intuitive de\ufb01nition of optimal state value in terms of a softmax (i.e.,\nlog-sum-exp) backup,\n\nV \u2217(s) = OENT(s, \u03c0\u2217) = \u03c4 log\n\nexp{(r(s, a) + \u03b3V \u2217(s(cid:48)))/\u03c4} .\n\n(8)\n\n(cid:88)\n\na\n\nNote that in the \u03c4 \u2192 0 limit one recovers the hard-max state values de\ufb01ned in (2). Therefore we can\nequivalently state softmax temporal consistency in terms of optimal action values Q\u2217(s, a) as,\n\n(cid:88)\na(cid:48) exp(Q\u2217(s(cid:48), a(cid:48))/\u03c4 ) .\n\n(9)\n\nQ\u2217(s, a) = r(s, a) + \u03b3V \u2217(s(cid:48)) = r(s, a) + \u03b3\u03c4 log\n\nNow, much like Q-learning, the consistency equation (9) can be used to perform one-step backups\nto asynchronously bootstrap Q\u2217(s, a) based on Q\u2217(s(cid:48), a(cid:48)). In the Supplementary Material we prove\nthat such a procedure, in the tabular case, converges to a unique \ufb01xed point representing the optimal\nvalues.\nWe point out that the notion of softmax Q-values has been studied in previous work (e.g., [47, 48, 13,\n5, 3, 7]). Concurrently to our work, [11] has also proposed a soft Q-learning algorithm for continuous\ncontrol that is based on a similar notion of softmax temporal consistency. However, we contribute\nnew observations below that lead to the novel training principles we explore.\n\n4 Consistency Between Optimal Value & Policy\n\nWe now describe the main technical contributions of this paper, which lead to the development of\ntwo novel off-policy RL algorithms in Section 5. The \ufb01rst key observation is that, for the softmax\n\n3\n\n\fvalue function V \u2217 in (8), the quantity exp{V \u2217(s)/\u03c4} also serves as the normalization factor of the\noptimal policy \u03c0\u2217(a | s) in (7); that is,\n\u03c0\u2217(a | s) =\n\nexp{(r(s, a) + \u03b3V \u2217(s(cid:48)))/\u03c4}\n\n(10)\n\n.\n\nexp{V \u2217(s)/\u03c4}\n\nManipulation of (10) by taking the log of both sides then reveals an important connection between\nthe optimal state value V \u2217(s), the value V \u2217(s(cid:48)) of the successor state s(cid:48) reached from any action a\ntaken in s, and the corresponding action probability under the optimal log-policy, log \u03c0\u2217(a | s).\nTheorem 1. For \u03c4 > 0, the policy \u03c0\u2217 that maximizes OENT and state values V \u2217(s) = max\u03c0OENT(s, \u03c0)\nsatisfy the following temporal consistency property for any state s and action a (where s(cid:48) = f (s, a)),\n(11)\n\nV \u2217(s) \u2212 \u03b3V \u2217(s(cid:48)) = r(s, a) \u2212 \u03c4 log \u03c0\u2217(a | s) .\n\nProof. All theorems are established for the general case of a stochastic environment and discounted\nin\ufb01nite horizon problems in the Supplementary Material. Theorem 1 follows as a special case.\nNote that one can also characterize \u03c0\u2217 in terms of Q\u2217 as\n\n\u03c0\u2217(a | s) = exp{(Q\u2217(s, a) \u2212 V \u2217(s))/\u03c4} .\n\n(12)\n\nAn important property of the one-step softmax consistency established in (11) is that it can be\nextended to a multi-step consistency de\ufb01ned on any action sequence from any given state. That is, the\nsoftmax optimal state values at the beginning and end of any action sequence can be related to the\nrewards and optimal log-probabilities observed along the trajectory.\nCorollary 2. For \u03c4 > 0, the optimal policy \u03c0\u2217 and optimal state values V \u2217 satisfy the following\nextended temporal consistency property, for any state s1 and any action sequence a1, ..., at\u22121 (where\nsi+1 = f (si, ai)):\n\nt\u22121(cid:88)\n\nV \u2217(s1) \u2212 \u03b3t\u22121V \u2217(st) =\n\n\u03b3i\u22121[r(si, ai) \u2212 \u03c4 log \u03c0\u2217(ai | si)] .\n\n(13)\n\ni=1\n\nProof. The proof in the Supplementary Material applies (the generalized version of) Theorem 1 to\nany s1 and sequence a1, ..., at\u22121, summing the left and right hand sides of (the generalized version\nof) (11) to induce telescopic cancellation of intermediate state values. Corollary 2 follows as a\nspecial case.\n\nImportantly, the converse of Theorem 1 (and Corollary 2) also holds:\nTheorem 3. If a policy \u03c0(a| s) and state value function V (s) satisfy the consistency property (11) for\nall states s and actions a (where s(cid:48) = f (s, a)), then \u03c0 = \u03c0\u2217 and V = V \u2217. (See the Supplementary\nMaterial.)\n\nTheorem 3 motivates the use of one-step and multi-step path-wise consistencies as the foundation\nof RL algorithms that aim to learn parameterized policy and value estimates by minimizing the\ndiscrepancy between the left and right hand sides of (11) and (13).\n\n5 Path Consistency Learning (PCL)\n\nThe temporal consistency properties between the optimal policy and optimal state values devel-\noped above lead to a natural path-wise objective for training a policy \u03c0\u03b8, parameterized by \u03b8,\nand a state value function V\u03c6, parameterized by \u03c6, via the minimization of a soft consistency\nerror. Based on (13), we \ufb01rst de\ufb01ne a notion of soft consistency for a d-length sub-trajectory\nsi:i+d \u2261 (si, ai, . . . , si+d\u22121, ai+d\u22121, si+d) as a function of \u03b8 and \u03c6:\nC(si:i+d, \u03b8, \u03c6) = \u2212V\u03c6(si) + \u03b3dV\u03c6(si+d) +\n\n\u03b3j[r(si+j, ai+j)\u2212 \u03c4 log \u03c0\u03b8(ai+j | si+j)] . (14)\n\n(cid:88)d\u22121\n\nj=0\n\nThe goal of a learning algorithm can then be to \ufb01nd V\u03c6 and \u03c0\u03b8 such that C(si:i+d, \u03b8, \u03c6) is as close to\n0 as possible for all sub-trajectories si:i+d. Accordingly, we propose a new learning algorithm, called\n\n4\n\n\f(cid:88)\n\nsi:i+d\u2208E\n\n(cid:88)d\u22121\n\nPath Consistency Learning (PCL), that attempts to minimize the squared soft consistency error over a\nset of sub-trajectories E,\n\nOPCL(\u03b8, \u03c6) =\n\n1\n2\n\nC(si:i+d, \u03b8, \u03c6)2.\n\n(15)\n\n(16)\n\nThe PCL update rules for \u03b8 and \u03c6 are derived by calculating the gradient of (15). For a given trajectory\nsi:i+d these take the form,\n\n\u2206\u03b8 = \u03b7\u03c0 C(si:i+d, \u03b8, \u03c6)\n\n\u2206\u03c6 = \u03b7v C(si:i+d, \u03b8, \u03c6)(cid:0)\u2207\u03c6V\u03c6(si) \u2212 \u03b3d\u2207\u03c6V\u03c6(si+d)(cid:1) ,\n\n\u03b3j\u2207\u03b8 log \u03c0\u03b8(ai+j | si+j) ,\n\nj=0\n\n(17)\nwhere \u03b7v and \u03b7\u03c0 denote the value and policy learning rates respectively. Given that the consistency\nproperty must hold on any path, the PCL algorithm applies the updates (16) and (17) both to\ntrajectories sampled on-policy from \u03c0\u03b8 as well as trajectories sampled from a replay buffer. The\nunion of these trajectories comprise the set E used in (15) to de\ufb01ne OPCL.\nSpeci\ufb01cally, given a \ufb01xed rollout parameter d, at each iteration, PCL samples a batch of on-policy\ntrajectories and computes the corresponding parameter updates for each sub-trajectory of length d.\nThen PCL exploits off-policy trajectories by maintaining a replay buffer and applying additional\nupdates based on a batch of episodes sampled from the buffer at each iteration. We have found it\nbene\ufb01cial to sample replay episodes proportionally to exponentiated reward, mixed with a uniform\ndistribution, although we did not exhaustively experiment with this sampling procedure. In particular,\nwe sample a full episode s0:T from the replay buffer of size B with probability 0.1/B + 0.9 \u00b7\ni=0 r(si, ai))/Z, where we use no discounting on the sum of rewards, Z is a normalization\n\nexp(\u03b1(cid:80)T\u22121\n\nfactor, and \u03b1 is a hyper-parameter. Pseudocode of PCL is provided in the Appendix.\nWe note that in stochastic settings, our squared inconsistency objective approximated by Monte Carlo\nsamples is a biased estimate of the true squared inconsistency (in which an expectation over stochastic\ndynamics occurs inside rather than outside the square). This issue arises in Q-learning as well, and\nothers have proposed possible remedies which can also be applied to PCL [2].\n\n5.1 Uni\ufb01ed Path Consistency Learning (Uni\ufb01ed PCL)\n\n(cid:88)\n\nThe PCL algorithm maintains a separate model for the policy and the state value approximation.\nHowever, given the soft consistency between the state and action value functions (e.g.,in (9)), one can\nexpress the soft consistency errors strictly in terms of Q-values. Let Q\u03c1 denote a model of action\nvalues parameterized by \u03c1, based on which one can estimate both the state values and the policy as,\n\na\n\nV\u03c1(s) = \u03c4 log\n\nexp{Q\u03c1(s, a)/\u03c4} ,\n\u03c0\u03c1(a | s) = exp{(Q\u03c1(s, a) \u2212 V\u03c1(s))/\u03c4} .\n\n(18)\n(19)\nGiven this uni\ufb01ed parameterization of policy and value, we can formulate an alternative algo-\nrithm, called Uni\ufb01ed Path Consistency Learning (Uni\ufb01ed PCL), which optimizes the same objective\n(i.e., (15)) as PCL but differs by combining the policy and value function into a single model. Merging\nthe policy and value function models in this way is signi\ufb01cant because it presents a new actor-critic\nparadigm where the policy (actor) is not distinct from the values (critic). We note that in practice,\nwe have found it bene\ufb01cial to apply updates to \u03c1 from V\u03c1 and \u03c0\u03c1 using different learning rates, very\nmuch like PCL. Accordingly, the update rule for \u03c1 takes the form,\n\n\u2206\u03c1 = \u03b7\u03c0C(si:i+d, \u03c1)\n\n\u03b7vC(si:i+d, \u03c1)(cid:0)\u2207\u03c1V\u03c1(si) \u2212 \u03b3d\u2207\u03c1V\u03c1(si+d)(cid:1) .\n\n\u03b3j\u2207\u03c1 log \u03c0\u03c1(ai+j | si+j) +\n\nj=0\n\n(20)\n\n(21)\n\n(cid:88)d\u22121\n\n5.2 Connections to Actor-Critic and Q-learning\n\nTo those familiar with advantage-actor-critic methods [21] (A2C and its asynchronous analogue A3C)\nPCL\u2019s update rules might appear to be similar. In particular, advantage-actor-critic is an on-policy\nmethod that exploits the expected value function,\n\n(cid:88)\n\nV \u03c0(s) =\n\na\n\n\u03c0(a | s) [r(s, a) + \u03b3V \u03c0(s(cid:48))] ,\n\n(22)\n\n5\n\n\fto reduce the variance of policy gradient, in service of maximizing the expected reward. As in PCL,\ntwo models are trained concurrently: an actor \u03c0\u03b8 that determines the policy, and a critic V\u03c6 that is\ntrained to estimate V \u03c0\u03b8. A \ufb01xed rollout parameter d is chosen, and the advantage of an on-policy\ntrajectory si:i+d is estimated by\n\n(cid:88)d\u22121\n\nj=0\n\nA(si:i+d, \u03c6) = \u2212V\u03c6(si) + \u03b3dV\u03c6(si+d) +\n\n\u03b3jr(si+j, ai+j) .\n\n(23)\n\nThe advantage-actor-critic updates for \u03b8 and \u03c6 can then be written as,\n\n\u2206\u03b8 = \u03b7\u03c0Esi:i+d|\u03b8 [A(si:i+d, \u03c6)\u2207\u03b8 log \u03c0\u03b8(ai|si)] ,\n\u2206\u03c6 = \u03b7vEsi:i+d|\u03b8 [A(si:i+d, \u03c6)\u2207\u03c6V\u03c6(si)] ,\n\n(24)\n(25)\nwhere the expectation Esi:i+d|\u03b8 denotes sampling from the current policy \u03c0\u03b8. These updates exhibit a\nstriking similarity to the updates expressed in (16) and (17). In fact, if one takes PCL with \u03c4 \u2192 0\nand omits the replay buffer, a slight variation of A2C is recovered. In this sense, one can interpret PCL\nas a generalization of A2C. Moreover, while A2C is restricted to on-policy samples, PCL minimizes\nan inconsistency measure that is de\ufb01ned on any path, hence it can exploit replay data to enhance its\nef\ufb01ciency via off-policy learning.\nIt is also important to note that for A2C, it is essential that V\u03c6 tracks the non-stationary target V \u03c0\u03b8\nto ensure suitable variance reduction. In PCL, no such tracking is required. This difference is more\ndramatic in Uni\ufb01ed PCL, where a single model is trained both as an actor and a critic. That is, it is\nnot necessary to have a separate actor and critic; the actor itself can serve as its own critic.\nOne can also compare PCL to hard-max temporal consistency RL algorithms, such as Q-learning [43].\nIn fact, setting the rollout to d = 1 in Uni\ufb01ed PCL leads to a form of soft Q-learning, with the degree\nof softness determined by \u03c4. We therefore conclude that the path consistency-based algorithms\ndeveloped in this paper also generalize Q-learning. Importantly, PCL and Uni\ufb01ed PCL are not\nrestricted to single step consistencies, which is a major limitation of Q-learning. While some\nhave proposed using multi-step backups for hard-max Q-learning [26, 21], such an approach is not\ntheoretically sound, since the rewards received after a non-optimal action do not relate to the hard-max\nQ-values Q\u25e6. Therefore, one can interpret the notion of temporal consistency proposed in this paper\nas a sound generalization of the one-step temporal consistency given by hard-max Q-values.\n\n6 Related Work\n\nConnections between softmax Q-values and optimal entropy-regularized policies have been previously\nnoted. In some cases entropy regularization is expressed in the form of relative entropy [4, 6, 7, 31],\nand in other cases it is the standard entropy [47]. While these papers derive similar relationships to (7)\nand (8), they stop short of stating the single- and multi-step consistencies over all action choices we\nhighlight. Moreover, the algorithms proposed in those works are essentially single-step Q-learning\nvariants, which suffer from the limitation of using single-step backups. Another recent work [25]\nuses the softmax relationship in the limit of \u03c4 \u2192 0 and proposes to augment an actor-critic algorithm\nwith of\ufb02ine updates that minimize a set of single-step hard-max Bellman errors. Again, the methods\nwe propose are differentiated by the multi-step path-wise consistencies which allow the resulting\nalgorithms to utilize multi-step trajectories from off-policy samples in addition to on-policy samples.\nThe proposed PCL and Uni\ufb01ed PCL algorithms bear some similarity to multi-step Q-learning [26],\nwhich rather than minimizing one-step hard-max Bellman error, optimizes a Q-value function\napproximator by unrolling the trajectory for some number of steps before using a hard-max backup.\nWhile this method has shown some empirical success [21], its theoretical justi\ufb01cation is lacking,\nsince rewards received after a non-optimal action no longer relate to the hard-max Q-values Q\u25e6. In\ncontrast, the algorithms we propose incorporate the log-probabilities of the actions on a multi-step\nrollout, which is crucial for the version of softmax consistency we consider.\nOther notions of temporal consistency similar to softmax consistency have been discussed in the RL\nliterature. Previous work has used a Boltzmann weighted average operator [20, 5]. In particular, this\noperator has been used by [5] to propose an iterative algorithm converging to the optimal maximum\nreward policy inspired by the work of [15, 39]. While they use the Boltzmann weighted average,\nthey brie\ufb02y mention that a softmax (log-sum-exp) operator would have similar theoretical properties.\nMore recently [3] proposed a mellowmax operator, de\ufb01ned as log-average-exp. These log-average-\nexp operators share a similar non-expansion property, and the proofs of non-expansion are related.\n\n6\n\n\fSynthetic Tree\n\nCopy\n\nDuplicatedInput\n\nRepeatCopy\n\nReverse\n\nReversedAddition\n\nReversedAddition3\n\nHard ReversedAddition\n\nFigure 1: The results of PCL against A3C and DQN baselines. Each plot shows average reward\nacross 5 random training runs (10 for Synthetic Tree) after choosing best hyperparameters. We also\nshow a single standard deviation bar clipped at the min and max. The x-axis is number of training\niterations. PCL exhibits comparable performance to A3C in some tasks, but clearly outperforms A3C\non the more challenging tasks. Across all tasks, the performance of DQN is worse than PCL.\n\nAdditionally it is possible to show that when restricted to an in\ufb01nite horizon setting, the \ufb01xed point\nof the mellowmax operator is a constant shift of the Q\u2217 investigated here. In all these cases, the\nsuggested training algorithm optimizes a single-step consistency unlike PCL and Uni\ufb01ed PCL, which\noptimizes a multi-step consistency. Moreover, these papers do not present a clear relationship between\nthe action values at the \ufb01xed point and the entropy regularized expected reward objective, which was\nkey to the formulation and algorithmic development in this paper.\nFinally, there has been a considerable amount of work in reinforcement learning using off-policy data\nto design more sample ef\ufb01cient algorithms. Broadly speaking, these methods can be understood as\ntrading off bias [36, 34, 19, 9] and variance [28, 23]. Previous work that has considered multi-step\noff-policy learning has typically used a correction (e.g., via importance-sampling [29] or truncated\nimportance sampling with bias correction [23], or eligibility traces [28]). By contrast, our method\nde\ufb01nes an unbiased consistency for an entire trajectory applicable to on- and off-policy data. An\nempirical comparison with all these methods remains however an interesting avenue for future work.\n\n7 Experiments\n\nWe evaluate the proposed algorithms, namely PCL & Uni\ufb01ed PCL, across several different tasks and\ncompare them to an A3C implementation, based on [21], and an implementation of double Q-learning\nwith prioritized experience replay, based on [30]. We \ufb01nd that PCL can consistently match or beat the\nperformance of these baselines. We also provide a comparison between PCL and Uni\ufb01ed PCL and\n\ufb01nd that the use of a single uni\ufb01ed model for both values and policy can be competitive with PCL.\nThese new algorithms are easily amenable to incorporate expert trajectories. Thus, for the more\ndif\ufb01cult tasks we also experiment with seeding the replay buffer with 10 randomly sampled expert\ntrajectories. During training we ensure that these trajectories are not removed from the replay buffer\nand always have a maximal priority.\nThe details of the tasks and the experimental setup are provided in the Appendix.\n\n7.1 Results\n\nWe present the results of each of the variants PCL, A3C, and DQN in Figure 1. After \ufb01nding the\nbest hyperparameters (see the Supplementary Material), we plot the average reward over training\niterations for \ufb01ve randomly seeded runs. For the Synthetic Tree environment, the same protocol is\nperformed but with ten seeds instead.\n\n7\n\n05010005101520SyntheticTree01000200005101520253035Copy01000200030000246810121416DuplicatedInput020004000020406080100RepeatCopy0500010000051015202530Reverse050001000005101520253035ReversedAddition020000400006000005101520ReversedAddition30500010000051015202530HardReversedAdditionPCLA3CDQN\fSynthetic Tree\n\nCopy\n\nDuplicatedInput\n\nRepeatCopy\n\nReverse\n\nReversedAddition\n\nReversedAddition3\n\nHard ReversedAddition\n\nFigure 2: The results of PCL vs. Uni\ufb01ed PCL. Overall we \ufb01nd that using a single model for both\nvalues and policy is not detrimental to training. Although in some of the simpler tasks PCL has an\nedge over Uni\ufb01ed PCL, on the more dif\ufb01cult tasks, Uni\ufb01ed PCL preforms better.\n\nReverse\n\nReversedAddition\n\nReversedAddition3\n\nHard ReversedAddition\n\nFigure 3: The results of PCL vs. PCL augmented with a small number of expert trajectories on the\nhardest algorithmic tasks. We \ufb01nd that incorporating expert trajectories greatly improves performance.\n\nThe gap between PCL and A3C is hard to discern in some of the more simple tasks such as Copy,\nReverse, and RepeatCopy. However, a noticeable gap is observed in the Synthetic Tree and Dupli-\ncatedInput results and more signi\ufb01cant gaps are clear in the harder tasks, including ReversedAddition,\nReversedAddition3, and Hard ReversedAddition. Across all of the experiments, it is clear that the\nprioritized DQN performs worse than PCL. These results suggest that PCL is a competitive RL\nalgorithm, which in some cases signi\ufb01cantly outperforms strong baselines.\nWe compare PCL to Uni\ufb01ed PCL in Figure 2. The same protocol is performed to \ufb01nd the best\nhyperparameters and plot the average reward over several training iterations. We \ufb01nd that using a\nsingle model for both values and policy in Uni\ufb01ed PCL is slightly detrimental on the simpler tasks,\nbut on the more dif\ufb01cult tasks Uni\ufb01ed PCL is competitive or even better than PCL.\nWe present the results of PCL along with PCL augmented with expert trajectories in Figure 3. We\nobserve that the incorporation of expert trajectories helps a considerable amount. Despite only\nusing a small number of expert trajectories (i.e., 10) as opposed to the mini-batch size of 400, the\ninclusion of expert trajectories in the training process signi\ufb01cantly improves the agent\u2019s performance.\nWe performed similar experiments with Uni\ufb01ed PCL and observed a similar lift from using expert\ntrajectories. Incorporating expert trajectories in PCL is relatively trivial compared to the specialized\nmethods developed for other policy based algorithms [1, 12]. While we did not compare to other\nalgorithms that take advantage of expert trajectories, this success shows the promise of using path-\nwise consistencies. Importantly, the ability of PCL to incorporate expert trajectories without requiring\nadjustment or correction is a desirable property in real-world applications such as robotics.\n\n8\n\n05010005101520SyntheticTree01000200005101520253035Copy01000200030000246810121416DuplicatedInput020004000020406080100RepeatCopy0500010000051015202530Reverse0500010000051015202530ReversedAddition0200004000060000051015202530ReversedAddition30500010000051015202530HardReversedAdditionPCLUni\ufb01edPCL020004000051015202530Reverse020004000051015202530ReversedAddition0200004000060000051015202530ReversedAddition30500010000051015202530HardReversedAdditionPCLPCL+Expert\f8 Conclusion\n\nWe study the characteristics of the optimal policy and state values for a maximum expected reward\nobjective in the presence of discounted entropy regularization. The introduction of an entropy\nregularizer induces an interesting softmax consistency between the optimal policy and optimal state\nvalues, which may be expressed as either a single-step or multi-step consistency. This softmax\nconsistency leads to the development of Path Consistency Learning (PCL), an RL algorithm that\nresembles actor-critic in that it maintains and jointly learns a model of the state values and a model of\nthe policy, and is similar to Q-learning in that it minimizes a measure of temporal consistency error.\nWe also propose the variant Uni\ufb01ed PCL which maintains a single model for both the policy and the\nvalues, thus upending the actor-critic paradigm of separating the actor from the critic. Unlike standard\npolicy based RL algorithms, PCL and Uni\ufb01ed PCL apply to both on-policy and off-policy trajectory\nsamples. Further, unlike value based RL algorithms, PCL and Uni\ufb01ed PCL can take advantage of\nmulti-step consistencies. Empirically, PCL and Uni\ufb01ed PCL exhibit a signi\ufb01cant improvement over\nbaseline methods across several algorithmic benchmarks.\n\n9 Acknowledgment\n\nWe thank Rafael Cosman, Brendan O\u2019Donoghue, Volodymyr Mnih, George Tucker, Irwan Bello, and\nthe Google Brain team for insightful comments and discussions.\n\nReferences\n[1] P. Abbeel and A. Y. Ng. Apprenticeship learning via inverse reinforcement learning.\n\nIn\nProceedings of the twenty-\ufb01rst international conference on Machine learning, page 1. ACM,\n2004.\n\n[2] A. Antos, C. Szepesv\u00e1ri, and R. Munos. Learning near-optimal policies with bellman-residual\nminimization based \ufb01tted policy iteration and a single sample path. Machine Learning, 71(1):89\u2013\n129, 2008.\n\n[3] K. Asadi and M. L. Littman. A new softmax operator for reinforcement\n\narXiv:1612.05628, 2016.\n\nlearning.\n\n[4] M. G. Azar, V. G\u00f3mez, and H. J. Kappen. Dynamic policy programming with function\n\napproximation. AISTATS, 2011.\n\n[5] M. G. Azar, V. G\u00f3mez, and H. J. Kappen. Dynamic policy programming. JMLR, 13(Nov),\n\n2012.\n\n[6] M. G. Azar, V. G\u00f3mez, and H. J. Kappen. Optimal control as a graphical model inference\n\nproblem. Mach. Learn. J., 87, 2012.\n\n[7] R. Fox, A. Pakman, and N. Tishby. G-learning: Taming the noise in reinforcement learning via\n\nsoft updates. UAI, 2016.\n\n[8] A. Gruslys, M. G. Azar, M. G. Bellemare, and R. Munos. The reactor: A sample-ef\ufb01cient\n\nactor-critic architecture. arXiv preprint arXiv:1704.04651, 2017.\n\n[9] S. Gu, E. Holly, T. Lillicrap, and S. Levine. Deep reinforcement learning for robotic manipula-\n\ntion with asynchronous off-policy updates. ICRA, 2016.\n\n[10] S. Gu, T. Lillicrap, Z. Ghahramani, R. E. Turner, and S. Levine. Q-Prop: Sample-ef\ufb01cient\n\npolicy gradient with an off-policy critic. ICLR, 2017.\n\n[11] T. Haarnoja, H. Tang, P. Abbeel, and S. Levine. Reinforcement learning with deep energy-based\n\npolicies. arXiv:1702.08165, 2017.\n\n[12] J. Ho and S. Ermon. Generative adversarial imitation learning. In Advances in Neural Informa-\n\ntion Processing Systems, pages 4565\u20134573, 2016.\n\n9\n\n\f[13] D.-A. Huang, A.-m. Farahmand, K. M. Kitani, and J. A. Bagnell. Approximate maxent inverse\n\noptimal control and its application for mental simulation of human interactions. 2015.\n\n[14] S. Kakade. A natural policy gradient. NIPS, 2001.\n\n[15] H. J. Kappen. Path integrals and symmetry breaking for optimal control theory. Journal of\n\nstatistical mechanics: theory and experiment, 2005(11):P11011, 2005.\n\n[16] J. Kober, J. A. Bagnell, and J. Peters. Reinforcement learning in robotics: A survey. IJRR, 2013.\n\n[17] S. Levine, C. Finn, T. Darrell, and P. Abbeel. End-to-end training of deep visuomotor policies.\n\nJMLR, 17(39), 2016.\n\n[18] L. Li, W. Chu, J. Langford, and R. E. Schapire. A contextual-bandit approach to personalized\n\nnews article recommendation. 2010.\n\n[19] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra.\n\nContinuous control with deep reinforcement learning. ICLR, 2016.\n\n[20] M. L. Littman. Algorithms for sequential decision making. PhD thesis, Brown University, 1996.\n\n[21] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, T. Harley, D. Silver, and\n\nK. Kavukcuoglu. Asynchronous methods for deep reinforcement learning. ICML, 2016.\n\n[22] V. Mnih, K. Kavukcuoglu, D. Silver, et al. Human-level control through deep reinforcement\n\nlearning. Nature, 2015.\n\n[23] R. Munos, T. Stepleton, A. Harutyunyan, and M. Bellemare. Safe and ef\ufb01cient off-policy\n\nreinforcement learning. NIPS, 2016.\n\n[24] O. Nachum, M. Norouzi, and D. Schuurmans. Improving policy gradient by exploring under-\n\nappreciated rewards. ICLR, 2017.\n\n[25] B. O\u2019Donoghue, R. Munos, K. Kavukcuoglu, and V. Mnih. PGQ: Combining policy gradient\n\nand Q-learning. ICLR, 2017.\n\n[26] J. Peng and R. J. Williams. Incremental multi-step Q-learning. Machine learning, 22(1-3):283\u2013\n\n290, 1996.\n\n[27] J. Peters, K. M\u00fcling, and Y. Altun. Relative entropy policy search. AAAI, 2010.\n\n[28] D. Precup. Eligibility traces for off-policy policy evaluation. Computer Science Department\n\nFaculty Publication Series, page 80, 2000.\n\n[29] D. Precup, R. S. Sutton, and S. Dasgupta. Off-policy temporal-difference learning with function\n\napproximation. 2001.\n\n[30] T. Schaul, J. Quan, I. Antonoglou, and D. Silver. Prioritized experience replay. ICLR, 2016.\n\n[31] J. Schulman, X. Chen, and P. Abbeel. Equivalence between policy gradients and soft Q-learning.\n\narXiv:1704.06440, 2017.\n\n[32] J. Schulman, S. Levine, P. Moritz, M. Jordan, and P. Abbeel. Trust region policy optimization.\n\nICML, 2015.\n\n[33] J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel. High-dimensional continuous\n\ncontrol using generalized advantage estimation. ICLR, 2016.\n\n[34] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller. Deterministic policy\n\ngradient algorithms. ICML, 2014.\n\n[35] R. S. Sutton and A. G. Barto. Introduction to Reinforcement Learning. MIT Press, 2nd edition,\n\n2017. Preliminary Draft.\n\n[36] R. S. Sutton, D. A. McAllester, S. P. Singh, Y. Mansour, et al. Policy gradient methods for\n\nreinforcement learning with function approximation. NIPS, 1999.\n\n10\n\n\f[37] G. Tesauro. Temporal difference learning and TD-gammon. CACM, 1995.\n\n[38] G. Theocharous, P. S. Thomas, and M. Ghavamzadeh. Personalized ad recommendation systems\n\nfor life-time value optimization with guarantees. IJCAI, 2015.\n\n[39] E. Todorov. Linearly-solvable Markov decision problems. NIPS, 2006.\n\n[40] E. Todorov. Policy gradients in linearly-solvable MDPs. NIPS, 2010.\n\n[41] Z. Wang, V. Bapst, N. Heess, V. Mnih, R. Munos, K. Kavukcuoglu, and N. de Freitas. Sample\n\nef\ufb01cient actor-critic with experience replay. ICLR, 2017.\n\n[42] Z. Wang, N. de Freitas, and M. Lanctot. Dueling network architectures for deep reinforcement\n\nlearning. ICLR, 2016.\n\n[43] C. J. Watkins. Learning from delayed rewards. PhD thesis, University of Cambridge England,\n\n1989.\n\n[44] C. J. Watkins and P. Dayan. Q-learning. Machine learning, 8(3-4):279\u2013292, 1992.\n\n[45] R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement\n\nlearning. Mach. Learn. J., 1992.\n\n[46] R. J. Williams and J. Peng. Function optimization using connectionist reinforcement learning\n\nalgorithms. Connection Science, 1991.\n\n[47] B. D. Ziebart. Modeling purposeful adaptive behavior with the principle of maximum causal\n\nentropy. PhD thesis, CMU, 2010.\n\n[48] B. D. Ziebart, A. L. Maas, J. A. Bagnell, and A. K. Dey. Maximum entropy inverse reinforcement\n\nlearning. AAAI, 2008.\n\n11\n\n\f", "award": [], "sourceid": 1564, "authors": [{"given_name": "Ofir", "family_name": "Nachum", "institution": "Google"}, {"given_name": "Mohammad", "family_name": "Norouzi", "institution": null}, {"given_name": "Kelvin", "family_name": "Xu", "institution": "Google"}, {"given_name": "Dale", "family_name": "Schuurmans", "institution": "Google"}]}