{"title": "Distributional Policy Optimization: An Alternative Approach for Continuous Control", "book": "Advances in Neural Information Processing Systems", "page_first": 1352, "page_last": 1362, "abstract": "We identify a fundamental problem in policy gradient-based methods in continuous control. As policy gradient methods require the agent's underlying probability distribution, they limit policy representation to parametric distribution classes. We show that optimizing over such sets results in local movement in the action space and thus convergence to sub-optimal solutions. We suggest a novel distributional framework, able to represent arbitrary distribution functions over the continuous action space. Using this framework, we construct a generative scheme, trained using an off-policy actor-critic paradigm, which we call the Generative Actor Critic (GAC). Compared to policy gradient methods, GAC does not require knowledge of the underlying probability distribution, thereby overcoming these limitations. Empirical evaluation shows that our approach is comparable and often surpasses current state-of-the-art baselines in continuous domains.", "full_text": "Distributional Policy Optimization:\n\nAn Alternative Approach for Continuous Control\n\nChen Tessler\u2217, Guy Tennenholtz\u2217 and Shie Mannor\n\nchen.tessler@campus.technion.ac.il, guytenn@gmail.com, shie@ee.technion.ac.il\n\nTechnion Institute of Technology, Haifa, Israel\n\n\u2217 Equal Contribution\n\nAbstract\n\nWe identify a fundamental problem in policy gradient-based methods in continu-\nous control. As policy gradient methods require the agent\u2019s underlying probability\ndistribution, they limit policy representation to parametric distribution classes. We\nshow that optimizing over such sets results in local movement in the action space\nand thus convergence to sub-optimal solutions. We suggest a novel distributional\nframework, able to represent arbitrary distribution functions over the continuous\naction space. Using this framework, we construct a generative scheme, trained us-\ning an off-policy actor-critic paradigm, which we call the Generative Actor Critic\n(GAC). Compared to policy gradient methods, GAC does not require knowledge\nof the underlying probability distribution, thereby overcoming these limitations.\nEmpirical evaluation shows that our approach is comparable and often surpasses\ncurrent state-of-the-art baselines in continuous domains.\n\n1\n\nIntroduction\n\nModel-free Reinforcement Learning (RL) is a learning paradigm which aims to maximize a cumu-\nlative reward signal based on experience gathered through interaction with an environment [Sutton\nand Barto, 1998]. It is divided into two primary categories. Value-based approaches involve learning\nthe value of each action and acting greedily with respect to it (i.e., selecting the action with highest\nvalue). On the other hand, policy-based approaches (the focus of this work) learn the policy directly,\nthereby explicitly learning a mapping from state to action.\nPolicy gradients (PGs) [Sutton et al., 2000b] have been the go-to approach for learning policies\nin empirical applications. The combination of the policy gradient with recent advances in deep\nlearning has enabled the application of RL in complex and challenging environments. Such domains\ninclude continuous control problems, in which an agent controls complex robotic machines both in\nsimulation [Schulman et al., 2015, Haarnoja et al., 2017, Peng et al., 2018] as well as real life\n[Levine et al., 2016, Andrychowicz et al., 2018, Riedmiller et al., 2018]. Nevertheless, there exists a\nfundamental problem when PG methods are applied to continuous control regimes. As the gradients\nrequire knowledge of the probability of the performed action P (a| s), the PG is empirically limited\nto parametric distribution functions. Common parametric distributions used in the literature include\nthe Gaussian [Schulman et al., 2015, 2017], Beta [Chou et al., 2017] and Delta [Silver et al., 2014,\nLillicrap et al., 2015, Fujimoto et al., 2018] distribution functions.\nIn this work, we show that while the PG is properly de\ufb01ned over parametric distribution functions,\nit is prone to converge to sub-optimal exterma (Section 3). The leading reason is that these distri-\nbutions are not convex in the distribution space1 and are thus limited to local improvement in the\n\n1As an example, consider the Gaussian distribution, which is known to be non-convex.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\faction space itself. Inspired by Approximate Policy Iteration schemes, for which convergence guar-\nantees exist [Puterman and Brumelle, 1979], we introduce the Distributional Policy Optimization\n(DPO) framework in which an agent\u2019s policy evolves towards a distribution over improving actions.\nThis framework requires the ability to minimize a distance (loss function) which is de\ufb01ned over two\ndistributions, as opposed to the policy gradient approach which requires an explicit differentiation\nthrough the density function.\nDPO establishes the building blocks for our generative algorithm, the Generative Actor Critic2. It is\ncomposed of three elements: a generative model which represents the policy, a value, and a critic.\nThe value and the critic are combined to obtain the advantage of each action. A target distribution is\nthen de\ufb01ned as one which improves the value (i.e., all actions with negative advantage receive zero\nprobability mass). The generative model is optimized directly from samples without the explicit\nde\ufb01nition of the underlying probability distribution using quantile regression and Autoregressive\nImplicit Quantile Networks (see Section 4). Generative Actor Critic is evaluated on tasks in the\nMuJoCo control suite (Section 5), showing promising results on several dif\ufb01cult baselines.\n\n2 Preliminaries\n\nWe consider an in\ufb01nite-horizon discounted Markov Decision Process (MDP) with a continuous\naction space. An MDP is de\ufb01ned as the 5-tuple (S,A, P, r, \u03b3) [Puterman, 1994], where S is a\ncountable state space, A the continuous action space, P : S \u00d7 S \u00d7 A (cid:55)\u2192 [0, 1] is a transition kernel,\nr : S \u00d7 A \u2192 [0, 1] is a reward function, and \u03b3 \u2208 (0, 1) is the discount factor. Let \u03c0 : S (cid:55)\u2192 B(A) be\na stationary policy, where B(A) is the set of probability measures on the Borel sets of A. We denote\nby \u03a0 the set of stationary stochastic policies. In addition to \u03a0, often one is interested in optimizing\nover a set of parametric distributions. We denote the set of possible distribution parameters by \u0398\n(e.g., the mean \u00b5 and variance \u03c3 of a Gaussian distribution).\nin RL are the value and action-value functions v\u03c0 \u2208 R|S| and\nTwo measures of interest\nQ\u03c0 \u2208 R|S|\u00d7|A|, respectively. The value of a policy \u03c0, starting at state s and performing action a\nt=0 \u03b3tr(st, at) | s0 = s, a0 = a]. The value function is then de-\n\ufb01ned by v\u03c0 = E\u03c0[Q\u03c0(s, a)]. Given the action-value and value functions, the advantage of an action\na \u2208 A at state s \u2208 S is de\ufb01ned by A\u03c0(s, a) = Q\u03c0(s, a) \u2212 v\u03c0(s). The optimal policy is de\ufb01ned by\n\u03c0\u2217 = arg max\u03c0\u2208\u03a0 v\u03c0 and the optimal value by v\u2217 = v\u03c0\u2217\n\nis de\ufb01ned by Q\u03c0(s, a) = E\u03c0 [(cid:80)\u221e\n\n.\n\n3 From Policy Gradient to Distributional Policy Optimization\n\nCurrent practical approaches leverage the Policy Gradient Theorem [Sutton et al., 2000b] in order\nto optimize a policy, which updates the policy parameters according to\n\n\u03b8k+1 = \u03b8k + \u03b1kE\n\ns\u223cd(\u03c0\u03b8k )\n\nEa\u223c\u03c0\u03b8k (\u00b7| s)\u2207\u03b8 log \u03c0\u03b8(a| s) |\u03b8=\u03b8k Q\u03c0\u03b8k (s, a) ,\n\n(1)\n\nwhere d (\u03c0) is the stationary distribution of states under \u03c0. Since this update rule requires knowl-\nedge of the log probability of each action under the current policy log \u03c0\u03b8(a| s), empirical methods\nin continuous control resort to parametric distribution functions. Most commonly used are the Gaus-\nsian [Schulman et al., 2017], Beta [Chou et al., 2017] and deterministic Delta [Lillicrap et al., 2015]\ndistribution functions. However, as we show in Proposition 1, this approach is not ensured to con-\nverge, even though there exists an optimal policy which is deterministic (i.e., Delta) - a policy which\nis contained within this set.\nThe sub-optimality of uni-modal policies such as Gaussian or Delta distributions does not occur due\nto the limitation induced by their parametrization (e.g., the neural network), but is rather a result of\nthe prede\ufb01ned set of policies. As an example, consider the set of Delta distributions. As illustrated\nin Figure 1, while this set is convex in the parameter \u00b5 (the mean of the distribution), it is not convex\nin the set \u03a0. This is due to the fact that (1\u2212\u03b1)\u03b4\u00b51 +\u03b1\u03b4\u00b52 results in a stochastic distribution over two\nsupports, which cannot be represented using a single Delta function. Parametric distributions such\nas Gaussian and Delta functions highlight this issue, as the policy gradient considers the gradient\nw.r.t.\nthe parameters \u00b5, \u03c3. This results in local movement in the action space. Clearly such an\napproach can only guarantee convergence to a locally optimal solution and not a global one.\n\n2Code provided in the following anonymous repository: github.com/tesslerc/GAC\n\n2\n\n\f(b) Delta\n\n(c) Gaussian\n\n(a) Policy vs. Parameter Space\n\nFigure 1: (a): A conceptual diagram comparing policy optimization in parameter space \u0398 (black\ndots) in contrast to distribution space \u03a0 (white dots). Plots depict Q values in both spaces. As\nparameterized policies are non-convex in the distribution space, they are prone to converge to a\nlocal optima. Considering the entire policy space ensures convergence to the global optima. (b,c):\nPolicy evolution of Delta and Gaussian parameterized policies for multi-modal problems.\n\n(cid:107)v\u2217 \u2212 v\u03c0\u221e(cid:107)\u221e > L ,\n\nProposition 1. For any initial Gaussian policy \u03c00 \u223c N (\u00b50, \u03a3) and L \u2208 [0, v\u2217\n2 ) there exists an\nMDP M such that \u03c0\u221e satis\ufb01es\n(2)\nwhere \u03c0\u221e is the convergent result of a PG method with step size bounded by \u03b1. Moreover, given M\nthe result follows even when \u00b50 is only known to lie in some ball of radius R around \u02dc\u00b50, BR(\u02dc\u00b50).\nProof sketch. For brevity we prove for the case of a \u2208 R, such that BR is a \ufb01nite interval [a, b].\nWe also assume [a, b] \u2286 [\u00b50 \u2212 2\u03b1, \u00b50 + 2\u03b1], and \u03c3 \u2192 0. The general case proof can be found in\nthe supplementary material. Let \u0001 > 0. We consider a single state MDP (i.e., x-armed bandit) with\naction space A = R and a multi-modal reward function (similar to the illustration in Figure 1b),\nde\ufb01ned by\n\n(cid:19)(cid:12)(cid:12)(cid:12)(cid:12) (\u0001W\u00b50\u22122\u03b1,\u00b50+2\u03b1 + (1 \u2212 \u0001)W\u00b50+2\u03b1,\u00b50+6\u03b1) ,\n\nr(a) =\n\n(a\u2212\u00b50)\n\n(cid:18) 2\u03c0\n\n8\u03b1\n\n(cid:12)(cid:12)(cid:12)(cid:12)cos\n(cid:26)1\n\n0\n\nz \u2208 [x, y]\nelse\n\nwhere Wx,y(z) =\n\nis the window function.\n\nd\n\nlet us consider the derivative with respect\nd\u00b5 log \u03c0\u00b5(a) |\u00b5=\u00b5k = \u2212 1\n\n(cid:8)Ea\u223cN (\u00b5k,\u03c3)\nthat sign(cid:8)Ea\u223cN (\u00b5k,\u03c3) (a\u2212\u00b5k) r(a)(cid:9) = sign(cid:8) d\n\nIn PG, we assume \u00b5 is parameterized by some parameters \u03b8. Without\nloss of general-\nity,\niteration k the deriva-\ntive can be written as\nthus update the pol-\nit holds\nicy parameter \u00b5 by \u00b5k+1 = \u00b5k + \u03b1k\n3 and\n\u00b5k \u2208 [\u00b50 \u2212 2\u03b1, \u00b50 + 2\u03b1] then so is \u00b5k+1. Then, \u00b5\u221e \u2208 [\u00b50 \u2212 2\u03b1, \u00b50 + 2\u03b1]. That is, the pol-\nicy can never reach the interval [\u00b50 + 2\u03b1, \u00b50 + 6\u03b1] in which the optimal solution lies. Hence,\n(cid:107)v\u2217 \u2212 v\u03c0\u221e(cid:107)\u221e = 1 \u2212 2\u0001 and the result follows for \u0001 < 1\n3.\n\nAt\n2\u03c32 (\u00b5k \u2212 a) . PG will\n\n2\u03c32 (a\u2212\u00b5k) r(a)(cid:9) . As \u03c3 \u2192 0,\n(cid:9) . It follows that if \u0001 < 1\n\nd a r(a) |a=\u00b5k\n\nto \u03b8 = \u00b5.\n\n1\n\n3.1 Distributional Policy Optimization (DPO)\n\nIn order to overcome issues present in parametric distribution functions, we consider an alterna-\ntive approach. In our solution, the policy does not evolve based on the gradient w.r.t. distribution\nparameters (e.g., \u00b5, \u03c3), but rather updates the policy distribution according to\n\n\u03c0k+1 = \u0393 (\u03c0k \u2212 \u03b1k\u2207\u03c0d(D\u03c0k\n\nI \u03c0k , \u03c0) |\u03c0=\u03c0k ) ,\n\nwhere \u0393 is a projection operator onto the set of distributions, d : \u03a0 \u00d7 \u03a0 \u2192 [0,\u221e) is a dis-\ntance measure (e.g., Wasserstein distance), and D\u03c0\nI \u03c0 (s) is a distribution de\ufb01ned over the support\nI \u03c0(s) = {a : A\u03c0(s, a) > 0} (i.e., the positive advantage). Table 1 provides examples of such distri-\nbutions.\n\n3\n\n\f(cid:16)\n\n(cid:17)\n\nAlgorithm 1 Distributional Policy Optimization (DPO)\n1: Input: learning rates \u03b1k (cid:29) \u03b2k (cid:29) \u03b4k\n2: \u03c0k+1 = \u0393\n3: Q\u03c0(cid:48)\n4: v\u03c0(cid:48)\n5: \u03c0(cid:48)\n\n(cid:16)\n\u03c0k \u2212 \u03b1k\u2207\u03c0d(D\u03c0(cid:48)\n, \u03c0) |\u03c0=\u03c0k\nI \u03c0(cid:48)\n(cid:16)\n(cid:82)\nr(s, a) + \u03b3v\u03c0(cid:48)\nk (s, a) + \u03b2k\nk (s, a) \u2212 v\u03c0(cid:48)\nQ\u03c0(cid:48)\nA\nk + \u03b4k(\u03c0k \u2212 \u03c0(cid:48)\nk)\n\nk+1(s, a) = Q\u03c0(cid:48)\nk+1(s) = v\u03c0(cid:48)\nk + \u03b2k\nk+1 = \u03c0(cid:48)\n\nk (s)\n\nk\n\nk\n\n(cid:17)\nk (s) \u2212 Q\u03c0(cid:48)\n\nk (s, a)\n\n(cid:17)\n\nTable 1: Examples of target distributions over the set of improving actions\n\nD\u03c0\nI \u03c0(s)(a| s) = \u03b4arg maxa\u2208I(\u03c0) A\u03c0(s,a)(a| s)\nArgmax\nD\u03c0\nI \u03c0(s)(a| s) = 1{a\u2208I \u03c0}\nLinear\nI\u03c0 (s) A\u03c0(s,a(cid:48))d a(cid:48)\nI \u03c0(s)(a| s) = 1{a\u2208I \u03c0}\nBoltzmann (\u03b2 > 0) D\u03c0\nexp( 1\nI\u03c0 (s) exp( 1\nD\u03c0\nI \u03c0(s)(a| s) = Uniform(I \u03c0(s))\nUniform\n\nA\u03c0(s,a)\n\n(cid:82)\n(cid:82)\n\n\u03b2 A\u03c0(s,a))\n\u03b2 A\u03c0(s,a(cid:48)))d a(cid:48)\n\nAlgorithm 1 describes the Distributional Policy Optimization (DPO) framework as a three time-\nscale approach to learning the policy. It can be shown, under standard stochastic approximation\nassumptions [Borkar, 2009, Konda and Tsitsiklis, 2000, Bhatnagar and Lakshmanan, 2012, Chow\net al., 2017], to converge to an optimal solution. DPO consists of 4 elements: (1) A policy \u03c0 on a fast\ntimescale, (2) a delayed policy \u03c0(cid:48) on a slow timescale, (3) a value and (4) a critic, which estimate\nthe quality of the delayed policy \u03c0(cid:48) on an intermediate timescale. Unlike the PG approach, DPO\ndoes not require access to the underlying p.d.f. In addition, \u03c0 which is updated on the fast timescale\nviews the delayed policy \u03c0(cid:48), the value and critic as quasi-static, and as such it can be optimized using\nsupervised learning techniques3. Finally, we note that in DPO, the target distribution D\u03c0(cid:48)\nI \u03c0(cid:48) induces\na higher value than the current policy \u03c0(cid:48), ensuring an always improving policy.\nThe concept of policy evolution using positive advantage is depicted in Figure 2. While the policy\nstarts as a uni-modal distribution, it is not restricted to this subset of policies. As the policy evolves,\nless actions have positive advantage, and the process converges to an optimal solution. In the next\nsection we construct a practical algorithm under the DPO framework using a generative actor.\n\n4 Method\n\nIn this section we present our method, the Generative Actor Critic, which learns a policy based on\nthe Distributional Policy Optimization framework (Section 3). Distributional Policy Optimization\nrequires a model which is both capable of representing arbitrarily complex distributions and can be\noptimized by minimizing a distributional distance. We consider the Autoregressive Implicit Quantile\nNetwork [Ostrovski et al., 2018], which is detailed below.\n\n4.1 Quantile Regression & Autoregressive Implicit Quantile Networks\n\nAs seen in Algorithm 1, DPO requires the ability to minimize a distance between two distributions.\nThe Implicit Quantile Network (IQN) [Dabney et al., 2018a] provides such an approach using the\nWasserstein metric. The IQN receives a quantile value \u03c4 \u2208 [0, 1] and is tasked at returning the\nvalue of the corresponding quantile from a target distribution. As the IQN learns to predict the\nvalue of the quantile, it allows one to sample from the underlying distribution (i.e., by sampling\n\u03c4 \u223c U ([0, 1]) and performing a forward pass). Learning such a model requires the ability to estimate\nthe quantiles. The quantile regression loss [Koenker and Hallock, 2001] provides this ability. It is\ngiven by \u03c1\u03c4 (u) = (\u03c4 \u2212 1{u \u2264 0})u, where \u03c4 \u2208 [0, 1] is the quantile and u the error.\n\n3Assuming the target distribution is \u2019\ufb01xed\u2019, the policy \u03c0 can be trained using a supervised learning loss,\n\ne.g., GAN, VAE or AIQN.\n\n4\n\n\f(a) \u03c00\n\n(c) \u03c02\n\n(b) \u03c01\n\n(d) \u03c0k\n\nFigure 2: Policy evolution of a general, non-parametric policy, where the target policy is a distribu-\ntion over the actions with positive advantage. The horizontal dashed line denotes the current value\nof the policy, the colored green region denotes the target distribution (i.e., the actions with a positive\nadvantage) and \u03c0k denotes the policy after multiple updates. As opposed to Delta and Gaussian\ndistributions, the \ufb01xed point of this approach is the optimal policy.\n\nlikelihoods FX(x) = P(cid:0)X 1 \u2264 x1, . . . , X n \u2264 xn(cid:1) = \u03a0n\n\nNevertheless, the IQN is only capable of coping with univariate (scalar) distribution functions. Os-\ntrovski et al. [2018] proposed to extend the IQN to the multi-variate case using quantile autore-\ngression [Koenker and Xiao, 2006]. Let X = (X1, . . . , Xk) be an n-dimensional random variable.\nGiven a \ufb01xed ordering of the n dimensions, the c.d.f. can be written as the product of conditional\ni=1FX i|X i\u22121,...,X 1(xi) . The Autoregres-\nsive Implicit Quantile Network (AIQN), receives an i.i.d. vector \u03c4 \u223c U ([0, 1]n). The network\narchitecture then ensures each output dimension xi is conditioned on the previously generated val-\nues x1, . . . , xi\u22121; trained by minimizing the quantile regression loss.\n\n4.2 Generative Actor Critic (GAC)\n\nNext, we introduce a practical implementation of the DPO framework. As shown in Section 3,\nDPO is composed of 4 elements: an actor, a delayed actor, a value, and an action-value estimator.\nThe Generative Actor Critic (GAC) uses a generative actor trained using an AIQN, as described\nbelow. Contrary to parametric distribution functions, a generative neural network acts as a universal\nfunction approximator, enabling us to represent arbitrarily complex distributions, as corollary of the\nfollowing lemma.\nLemma (Kernels and Randomization [Kallenberg, 2006]). Let \u03c0 be a probability kernel from a mea-\nsurable space S to a Borel space A. Then there exists some measurable function f : S \u00d7 [0, 1] \u2192 A\nsuch that if \u03b8 is U (0, 1), then f (s, \u03b8) has distribution \u03c0(a| s) for every s \u2208 S.\nActor: DPO de\ufb01nes the actor as one which is capable of representing arbitrarily complex policies.\nTo obtain this we construct a generative neural network, an AIQN. The AIQN learns a mapping\nfrom a sampled noise vector \u03c4 \u223c U ([0, 1]n) to a target distribution.\nAs illustrated in Figure 3, the actor network contains a recurrent cell which enables sequential gen-\neration of the action. This generation schematic ensures the autoregressive nature of the model.\nEach generated action dimension is conditioned only on the current sampled noise scalar \u03c4 i and the\nprevious action dimensions ai\u22121, . . . , a1. In order to train the generative actor, the AIQN requires\nthe ability to produce samples from the target distribution D\u03c0(cid:48)\nI \u03c0(cid:48) . Although we are unable to sample\nfrom this distribution, given an action, we are able to estimate its probability. An unbiased estima-\ntor of the loss can be attained by uniformly sampling actions and then multiplying them by their\ncorresponding weight. More speci\ufb01cally, the weighted autoregressive quantile loss is de\ufb01ned by\n\nD\u03c0(cid:48)\nI \u03c0(cid:48) (aj | s)\n\nj \u2212\u03c0\u03c6(\u03c4 i\n(ai\n\nj| ai\u22121\n\nj\n\n\u03c1k\n\u03c4 i\nj\n\n, . . . , a1\n\nj )) ,\n\n(3)\n\n(cid:88)\n\naj\u223cU (A)\n\nn(cid:88)\n\ni=1\n\nj is the ith coordinate of action aj, and \u03c1k\n\u03c4 i\nj\n\nwhere ai\net al., 2018b]. Estimation of I \u03c0(cid:48)\n\nis the Huber quantile loss [Huber, 1992, Dabney\nin the target distribution is obtained using the estimated advantage.\n\n5\n\n\fDelayed Actor: The delayed actor, also known as Polyak aver-\naging [Polyak, 1990], is an appealing requirement as it is com-\nmon in off-policy actor-critic schemes [Lillicrap et al., 2015].\nThe delayed actor is an additional AIQN \u03c0\u03b8(cid:48), which tracks \u03c0\u03b8.\nk+1 = (1\u2212 \u03b1)\u03b8(cid:48)\nIt is updated based on \u03b8(cid:48)\nk + \u03b1\u03b8k and is used for\ntraining the value and critic networks.\nValue and Action-Value: While it is possible to train a critic\nand use its empirical mean w.r.t.\nthe policy as a value esti-\nmate, we found it to be noisy, resulting in bad convergence.\nWe therefore train a value network to estimate the expectation\nof the critic w.r.t. the delayed policy. In addition, as suggested\nin Fujimoto et al. [2018], we train two critic networks in par-\nallel. During both policy and value updates, we refer to the\nminimal value of the two critics. We observed that this indeed\nreduced variance and improved overall performance.\nTo summarize, GAC combines 4 elements. The delayed actor\ntracks the actor using a Polyak averaging scheme. The value\nand critic networks estimate the performance of the delayed\nactor. Provided Q and v estimations, we are able to estimate\nthe advantage of each action and thus propose the weighted\nautoregressive quantile loss, used to train the actor network.\nWe refer the reader to the supplementary material for an exhaustive overview of the algorithm and\narchitectural details.\n\nFigure 3: Illustration of the actor\u2019s\narchitecture. \u2297 is the hadamard\nproduct, \u2295 a concatenation opera-\ntor, and \u03c8 a mapping [0, 1] (cid:55)\u2192 Rd.\n\n5 Experiments\n\nIn order to evaluate our approach, we test GAC on a variety of continuous control tasks in the\nMuJoCo control suite [Todorov et al., 2012]. The agents are composed of n joints: from 2 joints\nin the simplistic Swimmer task and up to 17 in the Humanoid robot task. The state is a vector\nrepresentation of the agent, containing the spatial location and angular velocity of each element.\nThe action is a continuous n dimensional vector, representing how much torque to apply to each\njoint. The task in these domains is to move forward as much as possible within a given time-limit.\nWe run each task for 1 million steps and, as GAC is an off-poicy approach, evaluate the policy\nevery 5000 steps and report the average over 10 evaluations. We train GAC using a batch size\nof 128 and uncorrelated Gaussian noise for exploration. Results are depicted in Figure 4. Each\ncurve presented is a product of 5 training procedures with a randomly sampled seed. In addition to\nour raw results, we compare to the relevant baselines4, including: (1) DDPG [Lillicrap et al., 2015],\n(2) TD3 [Fujimoto et al., 2018], an off-policy actor critic approach which represents the policy using\na deterministic delta distribution, and (3) PPO [Schulman et al., 2017], an on-policy method which\nrepresents the policy using a Gaussian distribution.\nAs we have shown in the previous sections, DPO and GAC only require some target distribution to\nbe de\ufb01ned, namely, a distribution over actions with positive advantage. In our results we present\ntwo such distributions: the linear and Boltzmann distributions (see Table 1). We also test a non-\nautoregressive version of our model 5 using an IQN. For completeness, we provide additional dis-\ncussion regarding the various parameters and how they performed, in addition to a pseudo-code\nillustration of our approach, in the supplementary material.\nComparison to the policy gradient baselines: Results in Figure 4 show the ability of GAC to\nsolve complex, high dimensional problems. GAC attains competitive results across all domains,\noften outperforming the baseline policy gradient algorithms and exhibiting lower variance. This is\nsomewhat surprising, as GAC is a vanila algorithm, it is not supported by numerous improvements\napparent in recent PG methods. In addition to these results, we provide numerical results in the\nsupplementary material, which emphasize this claim.\n\n4We use the implementations of DDPG and PPO from the OpenAI baselines repo [Dhariwal et al., 2017],\n\nand TD3 [Fujimoto et al., 2018] from the authors GitHub repository.\n\n5Theoretically, the dimensions of the actions may be correlated and thus should be represented using an\n\nauto-regressive model.\n\n6\n\n\fFigure 4: Training curves on continuous control benchmarks. For the Generative Actor Critic ap-\nproach we present both the Autoregressive and Non-autoregressive approaches, the exact hyperpa-\nrameters for each domain are provided in the appendix.\n\nTable 2: Relative best GAC results compared to the best policy gradient baseline\n\nEnvironment\nRelative Result +3447 (+595%) +533 (+14%) +467 (+17%)\n\nHumanoid-v2\n\nWalker2d-v2\n\nHopper-v2\n\nHalfCheetah-v2\n\u2212381 (\u22124%)\n\nAnt-v2\n\nSwimmer-v2\n\u2212444 (\u22128%) +107 (+81%)\n\nParameter Comparison: Below we discuss how various parameters affect the behavior of GAC in\nterms of convergence rates and overall performance:\n\n1. At each step, the target policy is approximated through samples using the weighted quan-\ntile loss (Equation (3)). The results presented in Figure 4 are obtained using 32 (256 for\nHalfCheetah and Walker) samples at each step. 32 (128) samples are taken uniformly over\nthe action space and 32 (128) from the delayed policy \u03c0(cid:48) (a form of combining exploration\nand exploitation). Ablation tests showed that increasing the number of samples improved\nstability and overall performance. Moreover, we observed that the combination of both\nsampling methods is crucial for success.\n\n2. Not presented is the Uniform distribution, which did not work well. We believe this is due\nto the fact that the Uniform target provides an equal weight to actions which are very good\nwhile also to those which barely improve the value.\n\n3. We observed that in most tasks, similar to the observations of Korenkevych et al. [2019],\n\nthe AIQN model outperforms the IQN (non-autoregressive) one.\n\n6 Related Work\n\nDistributional RL: Recent interest in distributional methods for RL has grown with the introduction\nof deep RL approaches for learning the distribution of the return. Bellemare et al. [2017] presented\nthe C51-DQN which partitions the possible values [\u2212vmax, vmax] into a \ufb01xed number of bins and\nestimates the p.d.f. of the return over this discrete set. Dabney et al. [2017] extended this work by\nrepresenting the c.d.f. using a \ufb01xed number of quantiles. Finally, Dabney et al. [2018a] extended the\nQR-DQN to represent the entire distribution using the Implicit Quantile Network (IQN). In addition\nto the empirical line of work, Qu et al. [2018] and Rowland et al. [2018] have provided fundamental\ntheoretical results for this framework.\nGenerative Modeling: Generative Adversarial Networks (GANs) [Goodfellow et al., 2014] com-\nbine two neural networks in a game-theoretic approach which attempt to \ufb01nd a Nash Equilbirium.\nThis equilibrium is found when the generative model is capable of \u201cfooling\u201d the discriminator (i.e.,\nthe discriminator is no longer capable of distinguishing between samples produced from the real\ndistribution and those from the generator). Multiple GAN models and training methods have been\nintroduced, including the Wasserstein-GAN [Arjovsky et al., 2017] which minimizes the Wasser-\nstein loss. However, as the optimization scheme is highly non-convex, these approaches are not\nproven to converge and may thus suffer from instability and mode collapse [Salimans et al., 2016].\n\n7\n\n\fPolicy Learning: Learning a policy is generally performed using one of two methods. The Policy\nGradient (PG) [Williams, 1992, Sutton et al., 2000a] de\ufb01nes the gradient as the direction which\nmaximizes the reward under the assumed policy parametrization class. Although there have been a\nmultitude of improvements, including the ability to cope with deterministic policies [Silver et al.,\n2014, Lillicrap et al., 2015], stabilize learning through trust region updates [Schulman et al., 2015,\n2017] and bayesian approaches [Ghavamzadeh et al., 2016], these methods are bounded to para-\nmetric distribution sets (as the gradient is w.r.t.\nthe log probability of the action). An alternative\nline of work formulates the problem as a maximum entropy [Haarnoja et al., 2018], this enables the\nde\ufb01nition of the target policy using an energy functional. However, training is performed via mini-\nmizing the KL-divergence. The need to know the KL-divergence limits practical implementation to\nparametric distributions functions, similar to PG methods.\n\n7 Discussion and Future Work\n\nIn this work we presented limitations inherent to empirical Policy Gradient (PG) approaches in\ncontinuous control. While current PG methods in continuous control are computationally ef\ufb01cient,\nthey are not ensured to converge to a global extrema. As the policy gradient is de\ufb01ned w.r.t. the log\nprobability of the policy, the gradient results in local changes in the action space (e.g., changing the\nmean and variance of a Gaussian policy). These limitations do not occur in discrete action spaces.\nIn order to ensure better asymptotic results, it is often needed to use methods that are more complex\nand computationally demanding (i.e., \u201cNo Free Lunch\u201d [Wolpert et al., 1997]). Existing approaches\nattempting to mitigate these issues, either enrich the policy space using mixture models, or discretize\nthe action space. However, while the discretization scheme is appealing, there is a clear trade-off\nbetween optimality and ef\ufb01ciency. While \ufb01ner discretization improves guarantees, the complexity\n(number of discrete actions) grows exponentially in the action dimension [Tang and Agrawal, 2019].\nSimilar to the limitations inherent in PG approaches, these limitations also exist when considering\nmixture models, such as Gaussian Mixtures. A mixture model of k-Gaussians provides a categorical\ndistribution over k Gaussian distributions. The policy gradient w.r.t. these parameters, similarly to\nthe single Gaussian model, directly controls the mean \u00b5 and variance \u03c3 of each Gaussian indepen-\ndently. As such, even a mixture model is con\ufb01ned to local improvement in the action space.\nIn practical scenarios, and as the number of Gaussians grows, it is likely that the modes of the mix-\nture would be located in a vicinity of a global optima. A Gaussian Mixture model may therefore\nbe able to cope with various non-convex continuous control problems. Nevertheless, we note that\nGaussian Mixture models, unlike a single Gaussian, are numerically unstable. Due to the summation\nover Gaussians, the log probability of a mixture of Gaussians does not result in a linear representa-\ntion. This can cause numerical instability, and thus hinder the learning process. These insights lead\nus to question the optimality of current PG approaches in continuous control, suggesting that, al-\nthough these approaches are well understood, there is room for research into alternative policy-based\napproaches.\nIn this paper we suggested the Distributional Policy Optimization (DPO) framework and its empir-\nical implementation - the Generative Actor Critic (GAC). We evaluated GAC on a series of con-\ntinuous control tasks under the MuJoCo control suite. When considering overall performance, we\nobserved that despite the algorithmic maturity of PG methods, GAC attains competitive performance\nand often outperforms the various baselines. Nevertheless, as noted above, there is \u201cno free lunch\u201d.\nWhile GAC remains as sample ef\ufb01cient as the current PG methods (in terms of the batch size during\ntraining and number of environment interactions), it suffers from high computational complexity.\nFinally, the elementary framework presented in this paper can be extended in various future research\ndirections. First, improving the computational ef\ufb01ciency is a top priority for GAC to achieve de-\nployment in real robotic agents. In addition, as the target distribution is de\ufb01ned w.r.t. the advantage\nfunction, future work may consider integrating uncertainty estimates in order to improve exploration.\nMoreover, PG methods have been thoroughly researched and many of their improvements, such as\ntrust region optimization [Schulman et al., 2015], can be adapted to the DPO framework. Finally,\nDPO and GAC can be readily applied to other well-known frameworks such as the Soft-Actor-Critic\n[Haarnoja et al., 2018], in which entropy of the policy is encouraged through an augmented reward\nfunction. We believe this work is a \ufb01rst step towards a principal alternative for RL in continuous\naction space domains.\n\n8\n\n\f8 Acknowledgement\n\nWe thank Yonathan Efroni for his fruitful comments that greatly improved this paper.\n\nReferences\nMarcin Andrychowicz, Bowen Baker, Maciek Chociej, Rafal Jozefowicz, Bob McGrew, Jakub Pa-\nchocki, Arthur Petron, Matthias Plappert, Glenn Powell, Alex Ray, et al. Learning dexterous\nin-hand manipulation. arXiv preprint arXiv:1808.00177, 2018.\n\nMartin Arjovsky, Soumith Chintala, and L\u00b4eon Bottou. Wasserstein generative adversarial networks.\n\nIn International Conference on Machine Learning, pages 214\u2013223, 2017.\n\nMarc G Bellemare, Will Dabney, and R\u00b4emi Munos. A distributional perspective on reinforcement\nlearning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70,\npages 449\u2013458. JMLR. org, 2017.\n\nShalabh Bhatnagar and K Lakshmanan. An online actor\u2013critic algorithm with function approxima-\ntion for constrained markov decision processes. Journal of Optimization Theory and Applications,\n153(3):688\u2013708, 2012.\n\nVivek S Borkar. Stochastic approximation: a dynamical systems viewpoint, volume 48. Springer,\n\n2009.\n\nPo-Wei Chou, Daniel Maturana, and Sebastian Scherer. Improving stochastic policy gradients in\ncontinuous control with deep reinforcement learning using the beta distribution. In Proceedings\nof the 34th International Conference on Machine Learning-Volume 70, pages 834\u2013843. JMLR.\norg, 2017.\n\nYinlam Chow, Mohammad Ghavamzadeh, Lucas Janson, and Marco Pavone. Risk-constrained re-\ninforcement learning with percentile risk criteria. The Journal of Machine Learning Research, 18\n(1):6070\u20136120, 2017.\n\nWill Dabney, Mark Rowland, Marc G Bellemare, and R\u00b4emi Munos. Distributional reinforcement\n\nlearning with quantile regression. arXiv preprint arXiv:1710.10044, 2017.\n\nWill Dabney, Georg Ostrovski, David Silver, and R\u00b4emi Munos.\n\nImplicit quantile networks for\n\ndistributional reinforcement learning. arXiv preprint arXiv:1806.06923, 2018a.\n\nWill Dabney, Mark Rowland, Marc G Bellemare, and R\u00b4emi Munos. Distributional reinforcement\nlearning with quantile regression. In Thirty-Second AAAI Conference on Arti\ufb01cial Intelligence,\n2018b.\n\nPrafulla Dhariwal, Christopher Hesse, Oleg Klimov, Alex Nichol, Matthias Plappert, Alec Radford,\nJohn Schulman, Szymon Sidor, Yuhuai Wu, and Peter Zhokhov. Openai baselines. https:\n//github.com/openai/baselines, 2017.\n\nScott Fujimoto, Herke van Hoof, and David Meger. Addressing function approximation error in\n\nactor-critic methods. arXiv preprint arXiv:1802.09477, 2018.\n\nMohammad Ghavamzadeh, Yaakov Engel, and Michal Valko. Bayesian policy gradient and actor-\n\ncritic algorithms. The Journal of Machine Learning Research, 17(1):2319\u20132371, 2016.\n\nIan Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair,\nAaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural infor-\nmation processing systems, pages 2672\u20132680, 2014.\n\nTuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. Reinforcement learning with\nIn Proceedings of the 34th International Conference on Machine\n\ndeep energy-based policies.\nLearning-Volume 70, pages 1352\u20131361. JMLR. org, 2017.\n\nTuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy\nmaximum entropy deep reinforcement learning with a stochastic actor. In International Confer-\nence on Machine Learning, pages 1856\u20131865, 2018.\n\n9\n\n\fPeter J Huber. Robust estimation of a location parameter.\n\n492\u2013518. Springer, 1992.\n\nIn Breakthroughs in statistics, pages\n\nOlav Kallenberg. Foundations of modern probability. Springer Science & Business Media, 2006.\n\nRoger Koenker and Kevin Hallock. Quantile regression: An introduction. Journal of Economic\n\nPerspectives, 15(4):43\u201356, 2001.\n\nRoger Koenker and Zhijie Xiao. Quantile autoregression. Journal of the American Statistical Asso-\n\nciation, 101(475):980\u2013990, 2006.\n\nVijay R Konda and John N Tsitsiklis. Actor-critic algorithms. In Advances in neural information\n\nprocessing systems, pages 1008\u20131014, 2000.\n\nDmytro Korenkevych, A Rupam Mahmood, Gautham Vasan, and James Bergstra. Autoregressive\npolicies for continuous control deep reinforcement learning. arXiv preprint arXiv:1903.11524,\n2019.\n\nSergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep visuo-\n\nmotor policies. The Journal of Machine Learning Research, 17(1):1334\u20131373, 2016.\n\nTimothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa,\nDavid Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv\npreprint arXiv:1509.02971, 2015.\n\nGeorg Ostrovski, Will Dabney, and R\u00b4emi Munos. Autoregressive quantile networks for generative\n\nmodeling. arXiv preprint arXiv:1806.05575, 2018.\n\nXue Bin Peng, Pieter Abbeel, Sergey Levine, and Michiel van de Panne.\n\nDeepmimic:\nExample-guided deep reinforcement learning of physics-based character skills. arXiv preprint\narXiv:1804.02717, 2018.\n\nBoris T Polyak. New stochastic approximation type procedures. Automat. i Telemekh, 7(98-107):2,\n\n1990.\n\nMartin L Puterman. Markov decision processes: discrete stochastic dynamic programming. John\n\nWiley & Sons, 1994.\n\nMartin L Puterman and Shelby L Brumelle. On the convergence of policy iteration in stationary\n\ndynamic programming. Mathematics of Operations Research, 4(1):60\u201369, 1979.\n\nChao Qu, Shie Mannor, and Huan Xu. Nonlinear distributional gradient temporal-difference learn-\n\ning. arXiv preprint arXiv:1805.07732, 2018.\n\nMartin Riedmiller, Roland Hafner, Thomas Lampe, Michael Neunert, Jonas Degrave, Tom Wiele,\nVlad Mnih, Nicolas Heess, and Jost Tobias Springenberg. Learning by playing solving sparse\nreward tasks from scratch. In International Conference on Machine Learning, pages 4341\u20134350,\n2018.\n\nMark Rowland, Marc G Bellemare, Will Dabney, R\u00b4emi Munos, and Yee Whye Teh. An analysis of\n\ncategorical distributional reinforcement learning. arXiv preprint arXiv:1802.08163, 2018.\n\nTim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen.\nImproved techniques for training gans. In Advances in neural information processing systems,\npages 2234\u20132242, 2016.\n\nTim Salimans, Jonathan Ho, Xi Chen, Szymon Sidor, and Ilya Sutskever. Evolution strategies as a\n\nscalable alternative to reinforcement learning. arXiv preprint arXiv:1703.03864, 2017.\n\nJohn Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region\npolicy optimization. In International Conference on Machine Learning, pages 1889\u20131897, 2015.\n\nJohn Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy\n\noptimization algorithms. arXiv preprint arXiv:1707.06347, 2017.\n\n10\n\n\fDavid Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller.\n\nDeterministic policy gradient algorithms. In ICML, 2014.\n\nRichard S Sutton and Andrew G Barto. Reinforcement learning: An introduction, volume 1. MIT\n\npress Cambridge, 1998.\n\nRichard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradient\nmethods for reinforcement learning with function approximation. In Advances in neural informa-\ntion processing systems, pages 1057\u20131063, 2000a.\n\nRichard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradient\nmethods for reinforcement learning with function approximation. In Advances in neural informa-\ntion processing systems, pages 1057\u20131063, 2000b.\n\nYunhao Tang and Shipra Agrawal. Discretizing continuous action space for on-policy optimization.\n\narXiv preprint arXiv:1901.10500, 2019.\n\nEmanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control.\nIn Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pages\n5026\u20135033. IEEE, 2012.\n\nAshish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,\n\u0141ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information\nprocessing systems, pages 5998\u20136008, 2017.\n\nRonald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement\n\nlearning. Machine learning, 8(3-4):229\u2013256, 1992.\n\nDavid H Wolpert, William G Macready, et al. No free lunch theorems for optimization.\n\ntransactions on evolutionary computation, 1(1):67\u201382, 1997.\n\nIEEE\n\n11\n\n\f", "award": [], "sourceid": 782, "authors": [{"given_name": "Chen", "family_name": "Tessler", "institution": "Technion"}, {"given_name": "Guy", "family_name": "Tennenholtz", "institution": "Technion"}, {"given_name": "Shie", "family_name": "Mannor", "institution": "Technion"}]}