{"title": "A Regularized Approach to Sparse Optimal Policy in Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 5940, "page_last": 5950, "abstract": "We propose and study a general framework for regularized Markov decision processes (MDPs) where the goal is to find an optimal policy that maximizes the expected discounted total reward plus a policy regularization term. \nThe extant entropy-regularized MDPs can be cast into our framework. \nMoreover, under our framework, many regularization terms can bring multi-modality and sparsity, which are potentially useful in reinforcement learning. \nIn particular, we present sufficient and necessary conditions that induce a sparse optimal policy. We also conduct a full mathematical analysis of the proposed regularized MDPs, including the optimality condition, performance error, and sparseness control. We provide a generic method to devise regularization forms and propose off-policy actor critic algorithms in complex environment settings. We empirically analyze the numerical properties of optimal policies and compare the performance of different sparse regularization forms in discrete and continuous environments.", "full_text": "A Regularized Approach to Sparse Optimal Policy in\n\nReinforcement Learning\n\nXiang Li\u2217\n\nSchool of Mathematical Sciences\n\nPeking University\n\nBeijing, China\n\nlx10077@pku.edu.cn\n\nWenhao Yang\u2217\n\nCenter for Data Science\n\nPeking University\n\nBeijing, China\n\nyangwenhaosms@pku.edu.cn\n\nNational Engineering Lab for Big Data Analysis and Applications\n\nZhihua Zhang\n\nSchool of Mathematical Sciences\n\nPeking University\n\nBeijing, China\n\nzhzhang@math.pku.edu.cn\n\nAbstract\n\nWe propose and study a general framework for regularized Markov decision pro-\ncesses (MDPs) where the goal is to \ufb01nd an optimal policy that maximizes the\nexpected discounted total reward plus a policy regularization term. The extant\nentropy-regularized MDPs can be cast into our framework. Moreover, under our\nframework, many regularization terms can bring multi-modality and sparsity, which\nare potentially useful in reinforcement learning. In particular, we present suf\ufb01cient\nand necessary conditions that induce a sparse optimal policy. We also conduct a full\nmathematical analysis of the proposed regularized MDPs, including the optimality\ncondition, performance error, and sparseness control. We provide a generic method\nto devise regularization forms and propose off-policy actor critic algorithms in\ncomplex environment settings. We empirically analyze the numerical properties of\noptimal policies and compare the performance of different sparse regularization\nforms in discrete and continuous environments.\n\n1\n\nIntroduction\n\nReinforcement learning (RL) aims to \ufb01nd an optimal policy that maximizes the expected discounted\ntotal reward in an MDP [4, 36]. It\u2019s not an easy task to solve the nonlinear Bellman equation [2]\ngreedily in a high-dimension action space or when function approximation (such as neural networks)\nis used. Even if the optimal policy is obtained precisely, it is often the case the optimal policy is\ndeterministic. Deterministic policies are bad to cope with unexpected situations when its suggested\naction is suddenly unavailable or forbidden. By contrast, a multi-modal policy assigns positive\nprobability mass to both optimal and near optimal actions and hence has multiple alternatives to\nhandle this case. For example, an autonomous vehicle aims to do path planning with a pair of departure\nand destination as the state. When a suggested routine is unfortunately congested, an alternative\nroutine could be provided by a multi-modal policy, which can\u2019t be provided by a deterministic policy\nwithout evoking a new computation. Therefore, in a real-life application, we hope the optimal policy\nto possess thee property of multi-modality.\n\n\u2217Equal contribution.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fEntropy-regularized RL methods have been proposed to handle the issue. More speci\ufb01cally, an\nentropy bonus term is added to the expected long-term returns. As a result, it not only softens the\nnon-linearity of the original Bellman equation but also forces the optimal policy to be stochastic,\nwhich is desirable in problems where dealing with unexpected situations is crucial. In prior work, the\nShannon entropy is usually used. The optimal policy is of the form of softmax, which has been shown\ncan encourage exploration [8, 40]. However, a softmax policy assigns a non-negligible probability\nmass to all actions, including those really terrible and dismissible ones, which may result in an unsafe\npolicy. For RL problems with high dimensional action spaces, a sparse distribution is preferred in\nmodeling a policy function, because it implicitly does action \ufb01ltration, i.e., weeds out suboptimal\nactions and maintains near optimal actions. Thus, Lee et al. [19] proposed to use Tsallis entropy [39]\ninstead, giving rise to a sparse MDP where only few actions have non-zero probability at each state\nin the optimal policy. Lee et al. [20] empirically showed that general Tsallis entropy2 also leads to a\nsparse MDP. Moreover, the Tsallis regularized RL has a lower performance error, i.e., the optimal\nvalue of the Tsallis regularized RL is closer to the original optimal value than that of the Shannon\nregularized RL.\nThe above discussions manifest that an entropy regularization characterizes the solution to the\ncorresponding regularized RL. From Neu et al. [28], any entropy-regularized MDP can be viewed\nas a regularized convex optimization problem where the entropy serves as the regularizer and the\ndecision variable is a stationary policy. Geist et al. [10] proposed a framework in which the MDP is\nregularized by a general strongly concave function. It analyzes some variants of classic algorithms\nunder that framework but does not provide insight into the choice of regularizers. On the other hand, a\nsparse optimal policy distribution is more favored in large action space RL problems. Prior work Lee\net al. [19], Nachum et al. [27] obtains a sparse optimal policy by the Tsallis entropy regularization.\nConsidering the diversity and generality of regularization forms in convex optimization, it is natural\nto ask whether other regularizations can lead to sparseness. The answer is that there does exist other\nregularizers that induces sparsity.\nIn this paper, we propose a framework for regularized MDPs, where a general form of regularizers is\nimposed on the expected discounted total reward. This framework includes the entropy regularized\nMDP as a special case, implying certain regularizers can induce sparseness. We \ufb01rst give the\noptimality condition in regularized MDPs under the framework and then give necessary and suf\ufb01cient\nconditions to show which kind of regularization can lead to a sparse optimal policy. Interestingly,\nthere are lots of regularization that can lead to the sparseness, and the degree of sparseness can be\ncontrolled by the regularization coef\ufb01cient. Furthermore, we show that regularized MDPs have a\nregularization-dependent performance error caused by the regularization term, which could guide\nus to choose an effective regularization when it comes to dealing with problems with a continuous\naction space. To solve regularized MDPs, we employ the idea of generalized policy iteration and\npropose an off-policy actor-critic algorithm to \ufb01gure out the performance of different regularizers.\n\n2 Notation and preliminaries\n\n\u2206X = {P :(cid:80)\n\nMarkov Decision Processes In reinforcement learning (RL) problems, the agent\u2019s interaction with\nthe environment is often modeled as an Markov decision process (MDP). An MDP is de\ufb01ned by\na tuple (S,A, P, r, P0, \u03b3), where S is the state space and A the action space with |A| actions. We\nuse \u2206X to denote the simplex on any set X , which is de\ufb01ned as the set of distributions over X , i.e.,\nx\u2208X P (x) = 1, P (x) \u2265 0}. The vertex set of \u2206X is de\ufb01ned as VX = {P \u2208 \u2206X :\n\u2203 x \u2208 X , s.t. P (x) = 1}. P : S \u00d7 A \u2192 \u2206S is the unknown state transition probability distribution\nand r : S \u00d7 A \u2192 [0, Rmax] is the bounded reward on each transition. P0 is the distribution of initial\nstate and \u03b3 \u2208 [0, 1) is the discount factor.\nOptimality Condition of MDP\nThe goal of RL is to \ufb01nd a stationary policy which maps from state space to a simplex over the actions\n\u03c0 : S \u2192 \u2206A that maximizes the expected discounted total reward, i.e.,\n\n\u03b3tr(st, at)\n\n,\n\n(1)\n\n(cid:34) \u221e(cid:88)\n\nt=0\n\nE\n\nmax\n\n\u03c0\n\n(cid:35)\n\n(cid:12)(cid:12)(cid:12)\u03c0, P0\n\n2The general Tsallis entropy is de\ufb01ned with an additional real-valued parameter, called an entropic index.\n\nLee et al. [20] shows that when this entropic index in large enough, the optimal policy is sparse.\n\n2\n\n\f(cid:34) \u221e(cid:88)\n\nt=0\n\n(cid:35)\n\n(cid:12)(cid:12)(cid:12)\u03c0, P0\n\nV \u03c0(s) = E\n\n(cid:34) \u221e(cid:88)\n\nwhere s0 \u223c P0, at \u223c \u03c0(\u00b7|st), and st+1 \u223c P(\u00b7|st, at). Given any policy \u03c0, its state value and Q-value\nfunctions are de\ufb01ned respectively as\n\n(cid:35)\n)(cid:3).\n(cid:2)r(s, a) + \u03b3Es(cid:48)|s,aV \u03c0(s\nAny solution of the problem (1) is called an optimal policy and denoted by \u03c0\u2217. Optimal policies may\nnot be unique in an MDP, but the optimal state value is unique (denoted V \u2217). Actually, V \u2217 is the\n(cid:2)r(s, a) + \u03b3Es(cid:48)|s,aV (s\n)(cid:3).\nunique \ufb01xed point of the Bellman operator T , i.e., V \u2217(s) = T V \u2217(s) and\n(cid:48)\n\n\u03b3tr(st, at)|s0 = s, \u03c0\n\nQ\u03c0(s, a) = Ea\u223c\u03c0(\u00b7|s)\n\nEa\u223c\u03c0(\u00b7|s)\n\nt=0\n\n(cid:48)\n\n,\n\nT V (s) (cid:44) max\n\n\u03c0\n\n\u03c0\u2217 often is a deterministic policy which puts all probability mass on one action[31]. Actually, it can\nbe obtained as the greedy action w.r.t. the optimal Q-value function, i.e., \u03c0\u2217(s) \u2208 argmaxa Q\u2217(s, a)\n3. The optimal Q-value can be obtained from the state value V \u2217(s) by de\ufb01nition.\nAs a summary, any optimal policy \u03c0\u2217 and its optimal state value V \u2217 and Q-value Q\u2217 satisfy the\nfollowing optimality condition for all states and actions,\n\u2217\n(s, a) = r(s, a) + \u03b3Es(cid:48)|s,aV\n\u2217\n\u2217\nV\n\n(s) = max\n\n(s, a), \u03c0\n\n(s, a).\n\n\u2217\nQ\n\n\u2217\nQ\n\n\u2217\nQ\n\n(s),\n\na\n\n(s) \u2208 argmax\n\na\n\n3 Regularized MDPs\n\nTo obtain a sparse but multi-modal optimal policy, we impose a general regularization term to the\nobjective (1) and solve the following regularized MDP problem\n\nE\n\nmax\n\n\u03c0\n\n\u03b3t(r(st, at) + \u03bb\u03c6(\u03c0(at|st)))\n\n,\n\n(2)\n\nH\u03c6(\u03c0) = Ea\u223c\u03c0(\u00b7|s)\u03c6(\u03c0(a|s)),\n\nwhere \u03c6(\u00b7) is a regularization function. Problem (2) can be seen as a RL problem in which the\nreward function is the sum of the original reward function r(s, a) and a term \u03c6(\u03c0(a|s)) that provides\nregularization. If we take expectation to the regularization term \u03c6(\u03c0(a|s)), it can be found that the\nquantity\n(3)\nis entropy-like but not necessarily an entropy in our work. However, Problem (2) is not well-de\ufb01ned\nsince arbitrary regularizers would be more of a hindrance than a help. In the following, we make\nsome assumptions about \u03c6(\u00b7).\n3.1 Assumption for regularizers\nA regularizer \u03c6(\u00b7) characterizes solutions to Problem (2). In order to make Problems (2) analyzable,\na basic assumption (Assumption 1) is prerequisite. Explanation and examples will be provided to\nshow that such an assumption is reasonable and not strict.\n\nAssumption 1 The regularizer \u03c6(x) is assumed to satisfy the following conditions on the interval\n(0, 1]: (1) Monotonicity: \u03c6(x) is non-increasing; (2) Non-negativity: \u03c6(1) = 0; (3) Differentiabil-\nity: \u03c6(x) is differentiable; (4) Expected Concavity: x\u03c6(x) is strictly concave.\n\nThe assumptions of monotonicity and non-negativity make the regularizer be an positive exploration\nbonus. The bonus for choosing an action of high probability is less than that of choosing an action of\nlow probability. When the policy becomes deterministic, the bonus is forced to zero. The assumption\nof differentiability facilitates theoretic analysis and bene\ufb01ts practical implementation due to the widely\nused automatic derivation in deep learning platforms. The last assumption of expected concavity\nmakes H\u03c6(\u03c0) a concave function w.r.t. \u03c0. Thus any solution to Eqn.(2) hardly lies in the vertex set of\n3\u03c0\u2217 is not necessarily deterministic. If there are two actions a1, a2 that obtain the maximum of T V (s) for a\n\n\ufb01xed s \u2208 S, one can show that the stochastic policy \u03c0(a1|s) = 1 \u2212 \u03c0(a2|s) = p \u2208 [0, 1] is also optimal.\n\n3\n\n\f\u03c6)\u22121(x).\n\n2 ]. Since these functions are simple, we term them basic functions.\n\n\u03c6(x) = \u03c6(x) + x\u03c6(cid:48)(x) is a strictly decreasing function on (0, 1) and thus (f(cid:48)\n\nthe action simplex VA (where the policy is deterministic). As a byproduct, let f\u03c6(x) = x\u03c6(x). Then\nits derivative f(cid:48)\n\u03c6)\u22121(x)\nexists. For simplicity, we denote g\u03c6(x) = (f(cid:48)\nThere are plenty of options for the regularizer \u03c6(\u00b7) that satisfy Assumption 1. First, entropy can be\nrecovered by H\u03c6(\u03c0) with speci\ufb01c \u03c6(\u00b7). For example, when \u03c6(x) = \u2212 log x, the Shannon entropy is\nq\u22121 (1\u2212xq\u22121) with k > 0, the Tsallis entropy is recovered. Second, there\nrecovered; when \u03c6(x) = k\nare many instances that are not viewed as an entropy but can serve as a regularizer. We \ufb01nd two\nfamilies of such functions, namely, the exponential function family q \u2212 xkqx with k \u2265 0, q \u2265 1 and\nthe trigonometric function family cos(\u03b8x) \u2212 cos(\u03b8) and sin(\u03b8) \u2212 sin(\u03b8x) both with hyper-parameter\n\u03b8 \u2208 (0, \u03c0\nApart from the basic functions mentioned earlier, we come up with a generic method to combine\ndifferent basic functions. Let F be the set of all functions satisfying Assumption 1. By Proposition 1,\nthe operations of positive addition and minimum can preserve the properties shared among F.\nTherefore, the \ufb01nite-time application of such operations still leads to an available regularizer.\nProposition 1 Given \u03c61, \u03c62 \u2208 F, we have \u03b1\u03c61 + \u03b2\u03c62 \u2208 F for all \u03b1, \u03b2 \u2265 0 and min{\u03c61, \u03c62} \u2208 F.\nHere we only consider those differentiable min{\u03c61, \u03c62} in theoretical analysis, because the minimum\nof any two functions in F may be non-differentiable on some points. For instance, given any q > 1,\nthe minimum of \u2212 log(x) and q(1 \u2212 x) has a unique non-differentiable point on (0, 1).\n3.2 Optimality and sparsity\nOnce the regularizer \u03c6(\u00b7) is given, similar to non-regularized case, the (regularized) state value and\nQ-value functions of any given policy \u03c0 in a regularized MDP are de\ufb01ned as\n\n\u03bb (s) = E(cid:104) +\u221e(cid:88)\n\nV \u03c0\n\nt=0\n\n(cid:12)(cid:12)(cid:12)s0 = s, \u03c0\n\n(cid:105)\n\n,\n\n\u03b3t(r(st, at) + \u03bb\u03c6(\u03c0(at|st)))\n\n(cid:48)\n\u03bb (s\n\n).\n\n\u03bb(s, a) = r(s, a) + \u03b3Ea\u223c\u03c0(\u00b7|s)Es(cid:48)|s,aV \u03c0\nQ\u03c0\n\n(4)\nAny solution to Problem (2) is call the regularized optimal policy (denoted \u03c0\u2217\n\u03bb). The corresponding\n\u03bb and Q\u2217\nregularized optimal state value and Q-value are also optimal and denoted by V \u2217\n\u03bb respectively.\nIf the context is clear, we will omit the word regularized for simplicity. In this part, we aim to show\nthe optimality condition for the regularized MDPs (Theorem 1). The proof of Theorem 1 is given in\nAppendix B.\nTheorem 1 Any optimal policy \u03c0\u2217\nfollowing optimality condition for all states and actions:\n\u2217\n\u03bb(s, a) = r(s, a) + \u03b3Es(cid:48)|s,aV\nQ\n\u2217\n\u03bb(a|s) = max\n\u2217\n\u2217\n\u03bb(s) \u2212 \u03bb\n\u03bb (s) = \u00b5\nV\n\n(cid:18) \u00b5\u2217\n(cid:48)\n\u2217\n\u03bb (s\n\u03bb(s) \u2212 Q\u2217\n(cid:88)\n\u2217\n\u03bb(a|s)2\u03c6\n\u03c6)\u22121(x) is strictly decreasing and \u00b5\u2217\n\n\u03bb and its optimal state value V \u2217\n\n\u03bb(s) is a normalization term so that\n\n\u03bb and Q-value Q\u2217\n\n\u2217\n\u03bb(a|s)),\n\n),\n\u03bb(s, a)\n\n\u03bb satisfy the\n\n(cid:26)\n\n(cid:27)\n\n(cid:19)\n\n(cid:48)\n\n(\u03c0\n\n\u03c0\n\na\n\n, 0\n\n,\n\n(5)\n\n\u03c0\n\ng\u03c6\n\n\u03bb\n\n(cid:80)\na\u2208A \u03c0\u2217\n\nwhere g\u03c6(x) = (f(cid:48)\n\u03bb(a|s) = 1.\n\n\u03bb(s, a) are below the threshold \u00b5\u2217\n\nf(cid:48)\nTheorem 1 shows how the regularization in\ufb02uences the optimality condition. Let f(cid:48)\n\u03c6(x)\nfor short. From (5), it can be shown that the optimal policy \u03c0\u2217\n\u03bb assigns zero probability to the actions\nwhose Q-values Q\u2217\n\u03bb(s) \u2212 \u03bbf(cid:48)\n\u03c6(0) and assigns positive probability to\n(cid:80)\nnear optimal actions in proportion to their Q-values (since g\u03c6(x) is decreasing). The threshold involves\n\u00b5\u2217\n\u03bb(s) and f(cid:48)\n\u03c6(0). \u00b5\u2217\n\u03bb(s) can be uniquely solved from the equation obtained by plugging Eqn.(5) into\na\u2208A \u03c0\u2217\n\u03bb(s) is\nactually always a multivariate \ufb01nite-valued function of {Q\u2217\n\u03c6(0)\ncan be in\ufb01nity, making the threshold out of function. To see this, if f(cid:48)\n\u03c6(0) = \u221e, the threshold will be\n\n\u03bb(a|s) = 1. Note that the resulting equation only involves {Q\u2217\n\n\u03bb(s, a)}a\u2208A. Thus \u00b5\u2217\n\u03bb(s, a)}a\u2208A. However, the value f(cid:48)\n\n\u03c6(0) (cid:44) lim\nx\u21920+\n\n4\n\n\f\u2212\u221e and all actions will be assigned positive probability in any optimal policy. To characterize the\nnumber of zero probability actions, we de\ufb01ne a \u03b4-sparse policy as De\ufb01nition 1 shows. It is trivial that\n1|A| \u2264 \u03b4 \u2264 1. For instance, a deterministic optimal policy in non-regularized MDP is 1|A|-sparse.\nDe\ufb01nition 1 A given policy \u03c0 : S \u2192 \u2206A is called \u03b4-sparse if it satis\ufb01es:\n\n|{(s, a) \u2208 S \u00d7 A|\u03c0(a|s) (cid:54)= 0}|\n\n|S||A|\n\n\u2264 \u03b4.\n\n(6)\n\nIf \u03c0(a|s) > 0 for all (s, a) \u2208 S \u00d7 A, we call it has no sparsity.\nTheorem 2 If\nsparse.\n\nf(cid:48)\n\u03c6(x) = \u221e (or 0 (cid:54)\u2208 domf(cid:48)\n\nlim\nx\u21920+\n\n\u03c6), the optimal policy of regularized MDP is not\n\nf(cid:48)\n\u03c6(x) = k\n\nf(cid:48)\n\u03c6(x) = lim\n\nlim\nx\u21920+\n\nk\n\n1\u2212q (qxq\u22121 \u2212 1) = \u221e.\n\nf(cid:48)\n\u03c6(x) = lim\nx\u21920+\n\nTheorem 2 provides us a criteria to determine whether a regularization could render its corresponding\nregularized optimal policy the property of sparseness. To facilitate understanding, let us see two\nexamples. When \u03c6(x) = \u2212 log(x), we have that\nx\u21920+\u2212 log(x) \u2212 1 = \u221e, which\nimplies that the optimal policy of Shannon entropy-regularized MDP does not have sparsity. When\nq\u22121 (1 \u2212 xq\u22121) for q > 1 and \u03bb is small enough, the corresponding optimal policy can\n\u03c6(x) = k\nbe spare if \u03bb is small enough because lim\nq\u22121. What\u2019s more, the sparseness property\nx\u21920+\nof Tsallis entropy still keeps for 1 < q < \u221e and small \u03bb, which is empirically proved true in\n[20]. Additionally, when 0 < q < 1, the Tsallis entropy could no longer lead to sparseness due to\nlim\nx\u21920+\nThe sparseness property is \ufb01rst discussed in [19] which shows the Tsallis entropy with k = 1\nq = 2 can devise a sparse MDP. However, we point out that any \u03c6(\u00b7) such that f(cid:48)\nproper \u03bb leads to a sparse MDP. Let G \u2286 F be the set that satis\ufb01es \u2200\u03c6 \u2208 G, 0 \u2208 domf(cid:48)\ncombination of any two regularizers belonging to G still belongs to G.\nProposition 2 Given \u03c61, \u03c62 \u2208 G, we have that \u03b1\u03c61 + \u03b2\u03c62 \u2208 G for all \u03b1, \u03b2 \u2265 0. However, if \u03c61 \u2208 G\nbut \u03c62 /\u2208 G, \u03b1\u03c61 + \u03b2\u03c62 /\u2208 G for any positive \u03b2.\nIt is easily checked that the two families (i.e., exponential functions and trigonometric functions)\ngiven in Section 3.1 can also induce a sparse MDP with a proper \u03bb. For convenience, we prefer to\nq\u22121 (1\u2212xq\u22121) that de\ufb01nes the Tsallis entropy as a polynomial function,\nterm the function \u03c6(x) = k\nbecause when q > 1 it is a polynomial function of degree q\u22121. Additionally, these three basic\nfamilies of functions could be combined to construct more complex regularizers (Propositions 1, 2).\nControl the Sparsity of Optimal Policy. Theorem 2 shows 0 \u2208 domf(cid:48)\n\u03c6 is necessary but not\nsuf\ufb01cient for that the optimal policy \u03c0\u2217\n\u03bb is sparse. The sparsity of optimal policy is also controlled by\n\u03bb. Theorem 3 shows how the sparsity of optimal policy can be controlled by \u03bb when f(cid:48)\n\u03c6(0) < \u221e.\nThe proof is detailed in Appendix E.\nTheorem 3 Let Q\u2217\nthe sparsity of the optimal policy \u03c0\u2217\nsparsity. More speci\ufb01cally, \u03c0\u2217\n\n\u03c6(0) < \u221e. When \u03bb \u2192 0,\n\u03bb shrinks to \u03b4 = 1|A| . When \u03bb \u2192 \u221e, the optimal policy has no\n\n\u03bb(s) be de\ufb01ned in Theorem 1 and assume f(cid:48)\n\n2 and\n\u03c6(0) < \u221e with a\n\u03c6. The positive\n\n\u03bb(s, a) and \u00b5\u2217\n\n\u03bb(a|s) \u2192 1|A| for all (s, a) \u2208 S \u00d7 A as \u03bb \u2192 \u221e.\n\n3.3 Properties of regularized MDPs\n\nIn this section, we present some properties of regularized MDPs. We \ufb01rst prove the uniqueness of the\noptimal policy and value. Next, we give the bound of the performance error between \u03c0\u2217\n\u03bb (the optimal\npolicy obtained by a regularized MDP) and \u03c0\u2217 (the policy obtained by the original MDP). In the\nproofs of this section, we need an additional assumption for regularizers. Assumption 2 is quite weak.\nAll the functions introduced in Section 3.1 satisfy it.\n\nAssumption 2 The regularizer \u03c6(\u00b7) satis\ufb01es f\u03c6(0) (cid:44) lim\nx\u21920+\n\nx\u03c6(x) = 0.\n\n5\n\n\fGeneric Bellman Operator T\u03bb We de\ufb01ne a new operator T\u03bb for regularized MDPs, which de\ufb01nes a\nsmoothed maximum. Given one state s \u2208 S and current value function V\u03bb, T\u03bb is de\ufb01ned as\n(7)\n\n(cid:88)\n\n\u03c0(a|s) [Q\u03bb(s, a)+\u03bb\u03c6(\u03c0(a|s))] ,\n\nT\u03bbV\u03bb(s) (cid:44) max\n\n\u03c0\n\na\n\nwhere Q\u03bb(s, a) = r(s, a) + \u03b3Es(cid:48)|s,aV\u03bb(s(cid:48)) is Q-value function derived from one-step foreseeing\naccording to V\u03bb. By de\ufb01nition, T\u03bb maps V\u03bb(s) to its possible highest value which considers both\nfuture discounted rewards and regularization term. We provide simple upper and lower bounds of T\u03bb\nw.r.t. T , i.e.,\nTheorem 4 Under Assumptions 1 and 2, for any value function V and s \u2208 S, we have\n\nT V (s) \u2264 T\u03bbV (s) \u2264 T V (s) + \u03bb\u03c6(\n\n).\n\n(8)\n\n1\n|A|\n\n\u03bb\n\n(cid:54)= V \u2217\n\n\u03bb are both unique. We formally state the conclusion and give the proof in Appendix C.\n\nThe bound (8) shows that T\u03bb is a bounded and smooth approximation of T . When \u03bb = 0, T\u03bb\ndegenerates to the Bellman operator T . Moreover, it can be proved that T\u03bb is a \u03b3-contraction. By the\nBanach \ufb01xed point theorem [35], V \u2217\n\u03bb , the \ufb01xed point of T\u03bb, is unique. As a result of Theorem 1, Q\u2217\nand \u03c0\u2217\nPerformance Error Between V \u2217\n\u03bb and V \u2217 In general, V \u2217\n\u03bb . But their difference is controlled\nby both \u03bb and \u03c6(\u00b7). The behavior of \u03c6(x) around the origin represents the regularization ability of\n\u03c6(x). Theorem 5 shows that when |A| is quite large (which means \u03c6( 1|A|) is close to \u03c6(0) due to\nits continuity), the closeness of \u03c6(0) to 0 also determines their difference. As a result, the Tsallis\nentropy regularized MDPs have always tighter error bounds than the Shannon entropy regularized\nq\u22121 (1 \u2212 xq\u22121)(q > 1) is much lower\nMDPs, because the value at the origin of the concave function k\nthan that of \u2212 log x, both function satisfying in Assumption 2. Our theory incorporates the result of\nLee et al. [19, 20] which shows a similar performance error for (general) Tsallis entropy RL. The\nproof of Theorem 5 is detailed in Appendix D.\n\u03bb and V \u2217 can be bounded as\nTheorem 5 Under Assumptions 1 and 2, the error between V \u2217\n1\n|A|\n\n\u03bb\n1 \u2212 \u03b3\n\n\u2217\n\u03bb \u2212 V\n\n(cid:107)\u221e \u2264\n\n(cid:107)V\n\n\u03c6(\n\n).\n\n\u2217\n\n4 Regularized Actor-Critic\n\nTo solve the problem (2) in complex environments, we propose an off-policy algorithm Regularized\nActor-Critic (RAC), which alternates between policy evaluation and policy improvement. In practice,\nwe apply neural networks to parameterize the Q-value and policy to increase expressive power. In\nparticular, we model the regularized Q-value function Q\u03b8(s, a) and a tractable policy \u03c0\u03c8(a|s). We use\nAdam [17] to optimize \u03c8, \u03b8. Actually, RAC is created by consulting the previous work SAC [13, 14]\nand making some necessary changes so that it is able to be agnostic to the form of regularization.\nThe goal for training regularized Q-value parameters is to minimize the general Bellman residual:\n\n(9)\nwhere D is the replay buffer used to eliminate the correlation of sampled trajectory data and y is the\ntarget function de\ufb01ned as follows\n\n\u02c6ED(Q\u03b8(st, at) \u2212 y)2,\n\nJQ(\u03b8) =\n\n1\n2\n\ny = r(st, at)+\u03b3 [Q\u00af\u03b8(st+1, at+1)+\u03bb\u03c6(\u03c0\u03c8(at+1|st+1))] .\n\nThe target involves a target regularized Q-value function with parameters \u00af\u03b8 that are updated in a\nmoving average fashion, which can stabilize the training process [24, 13]. Thus the gradient of JQ(\u03b8)\nw.r.t. \u03b8 can be estimated by\n\n\u02c6\u2207JQ(\u03b8) = \u02c6ED\u2207\u03b8Q\u03b8(st, at) (Q\u03b8(st, at)\u2212y) .\nFor training policy parameters, we minimize the negative total reward:\n\nJ\u03c0(\u03c8) = \u02c6ED(cid:2)Ea\u223c\u03c0\u03c8(\u00b7|st) [\u2212\u03bb\u03c6(\u03c0\u03c8(a|st)) \u2212 Q\u03b8(st, \u03c6(\u03c0\u03c8(a|st)))](cid:3) .\n\n(10)\n\n6\n\n\fRAC is formally described in Algorithm 1. The method alternates between data collection and\nparameter updating. Trajectory data is collected by executing the current policy in environments and\nthen stored in a replay buffer. Parameters of the function approximators are updated by descending\nalong the stochastic gradients computed from the batch sampled from that replay buffer. The\nmethod makes use of two Q-functions to overcome the positive bias incurred by overestimation of\nQ-value, which is known to yield a poor performance [15, 9]. Speci\ufb01cally, these two Q-functions\nare parametrized by different parameters \u03b8i and are independently trained to minimize JQ(\u03b8i). The\nminimum of these two Q-functions is used to compute the target value y which is involved in the\ncomputation of \u02c6\u2207JQ(\u03b8) and \u02c6\u2207J\u03c0(\u03c8).\n\n5 Experiments\n\nWe investigate the performance of different reg-\nularizers among diverse environments. We \ufb01rst\ntest basic and combined regularizers in two nu-\nmerical environments. Then we test basic reg-\nularizers in Atari discrete problems. In the end,\nwe explore the possible application in Mujoco\ncontrol environments.\n\n5.1 Numerical results\n\nAlgorithm 1 Regularized Actor-Critic (RAC)\n\nInput: \u03b81, \u03b82, \u03c8\nInitialization: \u00af\u03b81 \u2190 \u03b81, \u00af\u03b81 \u2190 \u03b82,D \u2190 \u2205\nfor each iteration do\nfor each environment step do\nsample action, at \u223c \u03c0\u03c8(\u00b7|st)\nreceive reward rt \u223c rt(st, at)\nreceive next state st+1 from environment\n{(st, at, rt, st+1)}\nD \u2190 D\n\n(cid:83)\n\nend for\nfor each gradient step do\n\nend for\n\nend for\nOutput: \u03b81, \u03b82, \u03c8\n\n\u03b8i \u2190 \u03b8i \u2212 \u03b7Q \u02c6\u2207JQ(\u03b8i) for i \u2208 {1, 2}\n\u03c8 \u2190 \u03c8 \u2212 \u03b7\u03c0 \u02c6\u2207J\u03c0(\u03c8)\n\u00af\u03b8i \u2190 \u03c4 \u03b8i + (1 \u2212 \u03c4 )\u00af\u03b8i for i \u2208 {1, 2}\n\nThe two discrete numerical environments we\nconsider include a simple random generated\nMDP (S = 50, A = 10) and a Gridworld en-\nvironment (S = 81, A = 4). Refer to Ap-\npendix H.1 for more detail settings.\nRegularizers. Four basic regularizers include\nshannon (\u2212 log x), tsallis ( 1\n2 (1 \u2212 x)), cos\n2 x)) and exp (exp(1) \u2212 exp(x)). Propo-\n(cos( \u03c0\nsition 1 and 2 allow three combined regulariz-\ners: (1) min:\nthe minimum of tsallis and\nshannon, i.e., min{\u2212 log(x), 2(1 \u2212 x)}, (2) poly: the positive addition of two polynomial func-\ntions, i.e., 1\n2 (1 \u2212 x) + (1 \u2212 x2), and (3) mix: the positive addition of tsallis and shannon, i.e.,\n2 (1 \u2212 x).\n\u2212 log(x) + 1\nSparsity and Convergence. From (a)(b) in Figure 1, when \u03bb is extremely large, \u03b4 = 1 for all\nregularizers. (c) shows how the probability of each action in the optimal policy at a given state\nvaries with \u03bb (one curve represents one action). These results validate the Theorem 3. A reasonable\nexplanation is that large \u03bb reduces the importance of discounted reward sum and makes H\u03c6(\u03c0)\ndominate the loss, which forces the optimal policy to put probability mass evenly on all actions in\norder to maximize H\u03c6(\u03c0). We regard the ability to defend the tendency towards converging to a\nuniform distribution as sparseness power. From our additional experiments in Appendix H, cos has\nthe strongest sparseness power. (d) shows the convergence speed of RPI on different regularizers. It\nalso shows that (cid:107)V \u2217\n5.2 Atari results\n\n\u03bb(cid:107)\u221e is bounded as Theorem 4 states.\n\n\u2212 V \u03c0\u2217\n\nRegularizers. We test four basic regularizers across four discrete control tasks from OpenAI Gym\nbenchmark [5]. All the training details are in Appendix H.2.\nPerformance. Figure 2 shows the score during training for RAC with four regularization forms with\nbest performance over \u03bb = {0.01, 0.1, 1.0}. Except Breakout, Shannon performs worse than other\nthree regularizers. Cos performs best in Alien and Seaquest while tsallis performs best in Boxing\nand exp performs quite normally. Appendix H.2 gives all the results with different \u03bb and sensitive\nanalysis. In general, shannon is the most insensitive among others.\n\n7\n\n\f(a) Random MDP\n\n(b) Gridworld\n\n(c) cos: cos( \u03c0\n\n2 x)\n\n(d) Error of different regular-\nizers on Random MDP\n\nFigure 1: (a) and (b) show the results of the sparsity \u03b4 of optimal policies on Random MDP and\nGridworld. (c) shows the changing process of the probability of each action in optimal policy\nregularized by cos( \u03c0\n\n2 x) on Random MDP. (d) shows the (cid:96)\u221e-error between V \u2217 and V \u03c0\u2217\n\u03bb.\n\n(a) Alien\n\n(b) Boxing\n\n(c) Breakout\n\n(d) Seaquest\n\nFigure 2: Training curves on Atari games. Each entry in the legend is named with the rule\nthe regularization form + \u03bb. The score is smoothed with 100 windows while the shaded\narea is the one standard deviation.\n\n5.3 Mujoco results\n\nRegularizers. We explore basic regularizers across four continuous control tasks from OpenAI Gym\nbenchmark [5] with the MuJoCo simulator [38]. Unfortunately cos is quite unstable and prone\nto gradient exploding problems in deep RL training process. We speculate it instableness roots in\nnumerical issues where the probability density function often diverges into in\ufb01nity. What\u2019s more,\nthe periodicity of cos( \u03c0\n2 x) makes the gradients vacillate and the algorithm hard to converge. All the\ndetails of the following experiments are given in Appendix H.3.\n\n(a) Ant-v2\n\n(b) Walker-v2\n\n(c) Hopper-v2\n\n(d) HalfCheetah-v2\n\nFigure 3: Training curves on continuous control benchmarks. Each curve is the average\nof four experiments with different seeds. Each entry in the legend is named with the rule\nthe regularization form + \u03bb. The score is smoothed with 30 windows while the shaded\narea is the one standard deviation.\n\nPerformance. Figure 3 shows the total average return of rollouts during training for RAC with\nthree regularization forms and different regularization coef\ufb01cients ([0.01, 0.1, 1]). For each curve,\nwe train four different instances with different random seeds. Tsallis performs steadily better than\nshannon given the same regularization coef\ufb01cient \u03bb. Tsallis is also more stable since its shaded\narea is thinner than shannon. Exp performs almost as good as tsallis in Ant-v2 and Hopper-v2 but\nperforms badly in the rest two environments. From the sensitivity analysis provided in Appendix H.3,\ntsallis is less sensitive to \u03bb than cos and shannon.\n\n8\n\n0246810RegularizationCoe\ufb03cients0.20.40.60.81.0Sparsitymin:min{\u2212log(x),2(1\u2212x)}poly:12(1\u2212x)+(1\u2212x2)cos:cos(\u03c02x)exp:exp(1)\u2212exp(x)shannon:\u2212log(x)mix:\u2212log(x)+12(1\u2212x)tsallis:12(1\u2212x)0246810RegularizationCoe\ufb03cients0.50.60.70.80.91.0Sparsity0246810Regularization Coefficients0.00.10.20.30.40.5Optimal Policy Probability020406080100Iteration0.00.20.40.60.81.0ErrorBoundshannon:\u2212log(x)tsallis:12(1\u2212x)min:min{\u2212log(x),2(1\u2212x)}exp:exp(1)\u2212exp(x)cos:cos(\u03c02x)poly:12(1\u2212x)+(1\u2212x2)mix:\u2212log(x)+12(1\u2212x)0246810Frames(M)025050075010001250150017502000Scorecosx-0.1expx-0.1shannon-0.1tsallis-1.00246810Frames(M)020406080100Scorecosx-1.0expx-0.1shannon-0.1tsallis-0.10246810Frames(M)0100200300400500600Scorecosx-0.1expx-0.01shannon-0.1tsallis-0.10246810Frames(M)020004000600080001000012000Scorecosx-0.01expx-0.1shannon-0.01tsallis-0.1050010001500200025003000Iterarion01000200030004000500060007000Average rewardexpx-0.01shannon-0.1tsallis-0.01050010001500200025003000Iterarion0100020003000400050006000Average rewardexpx-0.01shannon-0.1tsallis-0.0102004006008001000Iterarion050010001500200025003000Average rewardexpx-1shannon-1tsallis-1050010001500200025003000Iterarion02000400060008000100001200014000Average rewardexpx-0.01shannon-0.1tsallis-0.01\f6 Conclusion\n\nIn this paper, we have proposed a uni\ufb01ed framework for regularized reinforcement learning, which\nincludes entropy-regularized RL as a special case. Under this framework, the regularization function\ncharacterizes the optimal policy and value of the corresponding regularized MDPs. We have shown\nthere are many regularization functions that can lead to a sparse but multi-modal optimal policy such\nas trigonometric and exponential functions. We have speci\ufb01ed a necessary and suf\ufb01cient condition\nfor these regularization functions that could lead to sparse optimal policies and how the sparsity is\ncontrolled with \u03bb. We have presented the logical and mathematical foundations of these properties\nand also conducted the experimental results.\n\nAcknowledgements\n\nThis work is sponsored by the Key Project of MOST of China (No. 2018AAA0101000), by Beijing\nMunicipal Commission of Science and Technology under Grant No. 181100008918005, and by\nBeijing Academy of Arti\ufb01cial Intelligence (BAAI).\n\nReferences\n[1] Kavosh Asadi and Michael L Littman. An alternative softmax operator for reinforcement\nlearning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70,\npages 243\u2013252. JMLR. org, 2017.\n\n[2] R Bellmann. Dynamic programming princeton university press. Princeton, NJ, 1957.\n\n[3] Boris Belousov and Jan Peters. f-divergence constrained policy improvement. arXiv preprint\n\narXiv:1801.00056, 2017.\n\n[4] Dimitri P Bertsekas. Neuro-dynamic programming. In Encyclopedia of optimization, pages\n\n2555\u20132560. Springer, 2008.\n\n[5] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang,\n\nand Wojciech Zaremba. Openai gym. arXiv preprint arXiv:1606.01540, 2016.\n\n[6] Bo Dai, Albert Shaw, Lihong Li, Lin Xiao, Niao He, Zhen Liu, Jianshu Chen, and Le Song.\nSbeed: Convergent reinforcement learning with nonlinear function approximation. In Interna-\ntional Conference on Machine Learning, pages 1133\u20131142, 2018.\n\n[7] Amir M Farahmand, Mohammad Ghavamzadeh, Shie Mannor, and Csaba Szepesv\u00e1ri. Regular-\nized policy iteration. In Advances in Neural Information Processing Systems, pages 441\u2013448,\n2009.\n\n[8] Roy Fox, Ari Pakman, and Naftali Tishby. Taming the noise in reinforcement learning via soft\n\nupdates. arXiv preprint arXiv:1512.08562, 2015.\n\n[9] Scott Fujimoto, Herke van Hoof, and Dave Meger. Addressing function approximation error in\n\nactor-critic methods. arXiv preprint arXiv:1802.09477, 2018.\n\n[10] Matthieu Geist, Bruno Scherrer, and Olivier Pietquin. A theory of regularized markov decision\n\nprocesses. arXiv preprint arXiv:1901.11275, 2019.\n\n[11] Jordi Grau-Moya, Felix Leibfried, and Peter Vrancx. Soft q-learning with mutual-information\n\nregularization. 2018.\n\n[12] Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. Reinforcement learning\n\nwith deep energy-based policies. arXiv preprint arXiv:1702.08165, 2017.\n\n[13] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-\npolicy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint\narXiv:1801.01290, 2018.\n\n9\n\n\f[14] Tuomas Haarnoja, Aurick Zhou, Kristian Hartikainen, George Tucker, Sehoon Ha, Jie Tan,\nVikash Kumar, Henry Zhu, Abhishek Gupta, Pieter Abbeel, et al. Soft actor-critic algorithms\nand applications. arXiv preprint arXiv:1812.05905, 2018.\n\n[15] Hado V Hasselt. Double q-learning. In Advances in Neural Information Processing Systems,\n\npages 2613\u20132621, 2010.\n\n[16] Jeffrey Johns, Christopher Painter-Wake\ufb01eld, and Ronald Parr. Linear complementarity for\nregularized policy evaluation and improvement. In Advances in neural information processing\nsystems, pages 1009\u20131017, 2010.\n\n[17] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[18] J Zico Kolter and Andrew Y Ng. Regularization and feature selection in least-squares temporal\ndifference learning. In Proceedings of the 26th annual international conference on machine\nlearning, pages 521\u2013528. ACM, 2009.\n\n[19] Kyungjae Lee, Sungjoon Choi, and Songhwai Oh. Sparse markov decision processes with\ncausal sparse tsallis entropy regularization for reinforcement learning. IEEE Robotics and\nAutomation Letters, 3(3):1466\u20131473, 2018.\n\n[20] Kyungjae Lee, Sungyub Kim, Sungbin Lim, Sungjoon Choi, and Songhwai Oh. Tsallis\nreinforcement learning: A uni\ufb01ed framework for maximum entropy reinforcement learning.\narXiv preprint arXiv:1902.00137, 2019.\n\n[21] Yang Liu, Prajit Ramachandran, Qiang Liu, and Jian Peng. Stein variational policy gradient.\n\narXiv preprint arXiv:1704.02399, 2017.\n\n[22] James Martens and Roger Grosse. Optimizing neural networks with kronecker-factored ap-\nproximate curvature. In International conference on machine learning, pages 2408\u20132417,\n2015.\n\n[23] Amir Massoud Farahmand, Mohammad Ghavamzadeh, Csaba Szepesv\u00e1ri, and Shie Mannor.\nRegularized \ufb01tted q-iteration for planning in continuous-space markovian decision problems.\nIn American Control Conference, 2009. ACC\u201909., pages 725\u2013730. IEEE, 2009.\n\n[24] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G\nBellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al.\nHuman-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.\n\n[25] O\ufb01r Nachum, Mohammad Norouzi, Kelvin Xu, and Dale Schuurmans. Bridging the gap\nbetween value and policy based reinforcement learning. In Advances in Neural Information\nProcessing Systems, pages 2775\u20132785, 2017.\n\n[26] O\ufb01r Nachum, Mohammad Norouzi, Kelvin Xu, and Dale Schuurmans. Trust-pcl: An off-policy\n\ntrust region method for continuous control. arXiv preprint arXiv:1707.01891, 2017.\n\n[27] O\ufb01r Nachum, Yinlam Chow, and Mohammad Ghavamzadeh. Path consistency learning in tsallis\n\nentropy regularized mdps. arXiv preprint arXiv:1802.03501, 2018.\n\n[28] Gergely Neu, Anders Jonsson, and Vicen\u00e7 G\u00f3mez. A uni\ufb01ed view of entropy-regularized\n\nmarkov decision processes. arXiv preprint arXiv:1705.07798, 2017.\n\n[29] Brendan O\u2019Donoghue, Remi Munos, Koray Kavukcuoglu, and Volodymyr Mnih. Combining\n\npolicy gradient and q-learning. arXiv preprint arXiv:1611.01626, 2016.\n\n[30] Jan Peters, Katharina M\u00fclling, and Yasemin Altun. Relative entropy policy search. In AAAI,\n\npages 1607\u20131612. Atlanta, 2010.\n\n[31] Martin L Puterman. Markov decision processes: discrete stochastic dynamic programming.\n\nJohn Wiley & Sons, 2014.\n\n10\n\n\f[32] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region\npolicy optimization. In International Conference on Machine Learning, pages 1889\u20131897,\n2015.\n\n[33] John Schulman, Xi Chen, and Pieter Abbeel. Equivalence between policy gradients and soft\n\nq-learning. arXiv preprint arXiv:1704.06440, 2017.\n\n[34] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal\n\npolicy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.\n\n[35] David Roger Smart. Fixed point theorems, volume 66. CUP Archive, 1980.\n\n[36] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press,\n\n2018.\n\n[37] Emanuel Todorov. Linearly-solvable markov decision problems.\n\ninformation processing systems, pages 1369\u20131376, 2007.\n\nIn Advances in neural\n\n[38] Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based\ncontrol. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on,\npages 5026\u20135033. IEEE, 2012.\n\n[39] Constantino Tsallis. Possible generalization of boltzmann-gibbs statistics. Journal of statistical\n\nphysics, 52(1-2):479\u2013487, 1988.\n\n[40] Peter Vamplew, Richard Dazeley, and Cameron Foale. Softmax exploration strategies for\n\nmultiobjective reinforcement learning. Neurocomputing, 263:74\u201386, 2017.\n\n[41] Lieven Vandenberghe. The cvxopt linear and quadratic cone program solvers. Online:\n\nhttp://cvxopt. org/documentation/coneprog. pdf, 2010.\n\n11\n\n\f", "award": [], "sourceid": 3186, "authors": [{"given_name": "Wenhao", "family_name": "Yang", "institution": "Peking University"}, {"given_name": "Xiang", "family_name": "Li", "institution": "Peking University"}, {"given_name": "Zhihua", "family_name": "Zhang", "institution": "Peking University"}]}