{"title": "On Learning Intrinsic Rewards for Policy Gradient Methods", "book": "Advances in Neural Information Processing Systems", "page_first": 4644, "page_last": 4654, "abstract": "In many sequential decision making tasks, it is challenging to design reward functions that help an RL agent efficiently learn behavior that is considered good by the agent designer. A number of different formulations of the reward-design problem, or close variants thereof, have been proposed in the literature. In this paper we build on the Optimal Rewards Framework of Singh et al. that defines the optimal intrinsic reward function as one that when used by an RL agent achieves behavior that optimizes the task-specifying or extrinsic reward function. Previous work in this framework has shown how good intrinsic reward functions can be learned for lookahead search based planning agents. Whether it is possible to learn intrinsic reward functions for learning agents remains an open problem. In this paper we derive a novel algorithm for learning intrinsic rewards for policy-gradient based learning agents. We compare the performance of an augmented agent that uses our algorithm to provide additive intrinsic rewards to an A2C-based policy learner (for Atari games) and a PPO-based policy learner (for Mujoco domains) with a baseline agent that uses the same policy learners but with only extrinsic rewards. Our results show improved performance on most but not all of the domains.", "full_text": "On Learning Intrinsic Rewards for\n\nPolicy Gradient Methods\n\nZeyu Zheng\n\nJunhyuk Oh\n\nSatinder Singh\n\nComputer Science & Engineering\n\nUniversity of Michigan\n\n{zeyu,junhyuk,baveja}@umich.edu\n\nAbstract\n\nIn many sequential decision making tasks, it is challenging to design reward\nfunctions that help an RL agent ef\ufb01ciently learn behavior that is considered good\nby the agent designer. A number of different formulations of the reward-design\nproblem have been proposed in the literature. In this paper we build on the Optimal\nRewards Framework of Singh et al. [2010] that de\ufb01nes the optimal intrinsic reward\nfunction as one that when used by an RL agent achieves behavior that optimizes the\ntask-specifying or extrinsic reward function. Previous work in this framework has\nshown how good intrinsic reward functions can be learned for lookahead search\nbased planning agents. Whether it is possible to learn intrinsic reward functions\nfor learning agents remains an open problem. In this paper we derive a novel\nalgorithm for learning intrinsic rewards for policy-gradient based learning agents.\nWe compare the performance of an augmented agent that uses our algorithm to\nprovide additive intrinsic rewards to an A2C-based policy learner (for Atari games)\nand a PPO-based policy learner (for Mujoco domains) with a baseline agent that\nuses the same policy learners but with only extrinsic rewards. We also compare our\nmethod with using a constant \u201clive bonus\u201d and with using a count-based exploration\nbonus (i.e., pixel-SimHash). Our results show improved performance on most but\nnot all of the domains.\n\n1\n\nIntroduction\n\nOne of the challenges facing an agent-designer in formulating a sequential decision making task\nas a Reinforcement Learning (RL) problem is that of de\ufb01ning a reward function. In some cases a\nchoice of reward function is clear from the designer\u2019s understanding of the task. For example, in\nboard games such as Chess or Go the notion of win/loss/draw comes with the game de\ufb01nition, and in\nAtari games there is a game score that is part of the game. In other cases there may not be any clear\nchoice of reward function. For example, in domains in which the agent is interacting with humans in\nthe environment and the objective is to maximize human-satisfaction it can be hard to de\ufb01ne a reward\nfunction. Similarly, when the task objective contains multiple criteria such as minimizing energy\nconsumption and maximizing throughput and minimizing latency, it is not clear how to combine\nthese into a single scalar-valued reward function.\nEven when a reward function can be de\ufb01ned, it is not unique in the sense that certain transformations\nof the reward function, e.g., adding a potential-based reward [Ng et al., 1999], will not change\nthe resulting ordering over agent behaviors. While the choice of potential-based or other (policy)\norder-preserving reward function used to transform the original reward function does not change what\nthe optimal policy is, it can change for better or for worse the sample (and computational) complexity\nof the RL agent learning from experience in its environment using the transformed reward function.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fYet another aspect to the challenge of reward-design stems from the observation that in many complex\nreal-world tasks an RL agent is simply not going to learn an optimal policy because of various bounds\n(or limitations) on the agent-environment interaction (e.g., inadequate memory, representational\ncapacity, computation, training data, etc.). Thus, in addressing the reward-design problem one may\nwant to consider transformations of the task-specifying reward function that change the optimal\npolicy. This is because it could result in the bounded-agent achieving a more desirable (to the agent\ndesigner) policy than otherwise. This is often done in the form of shaping reward functions that are\nless sparse than an original reward function and so lead to faster learning of a good policy even if it\nin principle changes what the theoretically optimal policy might be [Rajeswaran et al., 2017]. Other\nexamples of transforming the reward function to aid learning in RL agents is the use of exploration\nbonuses, e.g., count-based reward bonuses for agents that encourage experiencing infrequently visited\nstates [Bellemare et al., 2016, Ostrovski et al., 2017, Tang et al., 2017].\nThe above challenges make reward-design dif\ufb01cult, error-prone, and typically an iterative process.\nReward functions that seem to capture the designer\u2019s objective can sometimes lead to unexpected\nand undesired behaviors. Phenomena such as reward-hacking [Amodei et al., 2016] illustrate this\nvividly. There are many formulations and resulting approaches to the problem of reward-design\nincluding preference elicitation, inverse RL, intrinsically motivated RL, optimal rewards, potential-\nbased shaping rewards, more general reward shaping, and mechanism design; often the details of\nthe formulation depends on the class of RL domains being addressed. In this paper we build on\nthe optimal rewards problem formulation of Singh et al. [2010]. We discuss the optimal rewards\nframework as well as some other approaches for learning intrinsic rewards in Section 2.\nOur main contribution in this paper is the derivation of a new stochastic-gradient-based method for\nlearning parametric intrinsic rewards that when added to the task-specifying (hereafter extrinsic)\nrewards can improve the performance of policy-gradient based learning methods for solving RL\nproblems. The policy-gradient updates the policy parameters to optimize the sum of the extrinsic\nand intrinsic rewards, while simultaneously our method updates the intrinsic reward parameters to\noptimize the extrinsic rewards achieved by the policy. We evaluate our method on several Atari\ngames with a state of the art A2C (Advantage Actor-Critic) [Mnih et al., 2016] agent as well as on\na few Mujoco domains with a similarly state of the art PPO agent and show that learning intrinsic\nrewards can outperform using just extrinsic reward as well as using a combination of extrinsic reward\nand a constant \u201clive bonus\u201d [Duan et al., 2016a]. On Atari games, we also compared our method\nwith a count-based methods, i.e., pixel-SimHash [Tang et al., 2017]. Our method showed better\nperformance.\n\n2 Background and Related Work\n\nOptimal rewards and reward design. Our work builds on the Optimal Reward Framework [Singh\net al., 2010]. Formally, the optimal intrinsic reward for a speci\ufb01c combination of RL agent and\nenvironment is de\ufb01ned as the reward that when used by the agent for its learning in its environment\nmaximizes the extrinsic reward. The main intuition is that in practice all RL agents are bounded\n(computationally, representationally, in terms of data availability, etc.) and the optimal intrinsic\nreward can help mitigate these bounds. Computing the optimal reward remains a big challenge, of\ncourse. The paper introducing the framework used exhaustive search over a space of intrinsic reward\nfunctions and thus does not scale. Sorg et al. [2010] introduced PGRD (Policy Gradient for Reward\nDesign), a scalable algorithm that only works with lookahead-search (e.g., UCT) based planning\nagents (and hence the agent itself is not a learning-based agent; only the reward to use with the\n\ufb01xed planner is learned). Its insight was that the intrinsic reward can be treated as a parameter that\nin\ufb02uences the outcome of the planning process and thus can be trained via gradient ascent as long\nas the planning process is differentiable (which UCT and related algorithms are). Guo et al. [2016]\nextended the scalability of PGRD to high-dimensional image inputs in Atari 2600 games and used\nthe intrinsic reward as a reward bonus to improve the performance of the Monte Carlo Tree Search\nalgorithm using the Atari emulator as a model for the planning. A big open challenge is deriving a\nsound algorithm for learning intrinsic rewards for learning-based RL agents and showing that it can\nlearn intrinsic rewards fast enough to bene\ufb01cially in\ufb02uence the online performance of the learning\nbased RL agent. Our main contribution in this paper is to answer this challenge.\n\n2\n\n\fReward shaping and Auxiliary rewards. Reward shaping [Ng et al., 1999] provides a general\nanswer to what space of reward function modi\ufb01cations do not change the optimal policy, speci\ufb01cally\npotential-based rewards. Other attempts have been made to design auxiliary rewards with desired\nproperties. For example, the UNREAL agent [Jaderberg et al., 2016] used pseudo-reward computed\nfrom unsupervised auxiliary tasks to re\ufb01ne its internal representations. In Bellemare et al. [2016],\nOstrovski et al. [2017], and Tang et al. [2017], a pseudo-count based reward bonus was given to the\nagent to encourage exploration. Pathak et al. [2017] used self-supervised prediction errors as intrinsic\nrewards to help the agent explore. In these and other similar examples [Schmidhuber, 2010, Stadie\net al., 2015, Oudeyer and Kaplan, 2009], the agent\u2019s learning performance improves through the\nreward transformations, but the reward transformations are expert-designed and not learned. The\nmain departure point in this paper is that we learn the parameters of an intrinsic reward function that\nmaps high-dimensional observations and actions to rewards.\n\nHierarchical RL. Another approach to a form of intrinsic reward is in the work on hierarchical\nRL. For example, FeUdal Networks (FuNs) [Vezhnevets et al., 2017] is a hierarchical architecture\nwith a Manager and a Worker learning at different time scales. The Manager conveys abstract goals\nto the Worker and the Worker optimizes its policy to maximize the extrinsic reward and the cosine\ndistance to the goal. The Manager optimizes its proposed goals to guide the Worker to learn a better\npolicy in terms of the cumulative extrinsic reward. A large body of work on hierarchical RL also\ngenerally involves a higher level module choosing goals for lower level modules. All of this work\ncan be viewed as a special case of creating intrinsic rewards within a multi-module agent architecture.\nOne special aspect of hierarchical-RL work is that these intrinsic rewards are usually associated with\ngoals of achievement, i.e., achieving a speci\ufb01c goal state while in our setting the intrinsic reward\nfunctions are general mappings from observation-action pairs to rewards. Another special aspect is\nthat most evaluations of hierarchical RL show a bene\ufb01t in the transfer setting with typically worse\nperformance on early tasks while the manager is learning and better performance on later tasks once\nthe manager has learned. In our setting we take on the challenge of showing that learning and using\nintrinsic rewards can help the RL agent perform better while it is learning on a single task. Finally,\nanother difference is that hierarchical RL typically treats the lower-level learner as a black box while\nwe train the intrinsic reward using gradients through the policy module in our architecture.\n\nMeta Learning for RL. Our work can be viewed as an instance of meta learning [Andrychowicz\net al., 2016, Santoro et al., 2016, Nichol and Schulman, 2018] in the sense that the intrinsic reward\nfunction module acts as a meta-learner that learns to improve the agent\u2019s objective (i.e., mixture of\nextrinsic and intrinsic reward) by taking into account how each gradient step of the agent affects the\ntrue objective (i.e., extrinsic reward) through the meta-gradient. However, a key distinction from the\nprior work on meta learning for RL [Finn et al., 2017, Duan et al., 2017, Wang et al., 2016, Duan et al.,\n2016b] is that our method aims to meta-learn intrinsic rewards within a single task, whereas much of\nthe prior work is designed to quickly adapt to new tasks in a few-shot learning scenario. Xu et al.\n[2018] concurrently proposed a similar idea that learns to \ufb01nd meta-parameters (e.g., discount factor)\nsuch that the agent can learn more ef\ufb01ciently within a single task. In contrast to state-independent\nmeta-parameters in [Xu et al., 2018], we propose a richer form of state-dependent meta-learner (i.e.,\nintrinsic rewards) that directly changes the reward function of the agent, which can be potentially\nextended to hierarchical RL.\n\n3 Gradient-Based Learning of Intrinsic Rewards: A Derivation\n\nAs noted earlier, the most practical previous work in learning intrinsic rewards using the Optimal\nRewards framework was limited to settings where the underlying RL agent was a planning (i.e.,\nneeds a model of the environment) agent that use lookahead search in some form (e.g, UCT). In these\nsettings the only quantity being learned was the intrinsic reward function. By contrast, in this section\nwe derive our algorithm for learning intrinsic rewards for the setting where the underlying RL agent\nis itself a learning agent, speci\ufb01cally a policy gradient based learning agent.\n\n3.1 Policy Gradient based RL\n\nHere we brie\ufb02y describe how policy gradient based RL works, and then present our method that\nincorporates it. We assume an episodic, discrete-actions, RL setting. Within an episode, the state of\n\n3\n\n\fthe environment at time step t is denoted by st \u2208 S and the action the agent takes from action space\nA at time step t as at, and the reward at time step t as rt. The agent\u2019s policy, parameterized by \u03b8 (for\nexample the weights of a neural network), maps a representation of states to a probability distribution\nover actions. The value of a policy \u03c0\u03b8, denoted J(\u03c0\u03b8) or equivalently J(\u03b8), is the expected discounted\nsum of rewards obtained by the agent when executing actions according to policy \u03c0\u03b8, i.e.,\n\n\u221e(cid:88)\n\nt=0\n\nJ(\u03b8) = Est\u223cT (\u00b7|st\u22121,at\u22121),at\u223c\u03c0\u03b8(\u00b7|st)[\n\n\u03b3trt],\n\n(1)\n\nwhere T denotes the transition dynamics, and the initial state s0 \u223c \u00b5 is chosen from some dis-\ntribution \u00b5 over states. Henceforth, for ease of notation we will write the above quantity as\n\nJ(\u03b8) = E\u03b8[(cid:80)\u221e\n\nt=0 \u03b3trt].\n\nwhere G(st, at) =(cid:80)\u221e\n\nThe policy gradient theorem of Sutton et al. [2000] shows that the gradient of the value J with\nrespect to the policy parameters \u03b8 can be computed as follows: from all time steps t within an episode\n(2)\ni=t \u03b3i\u2212tri is the return until termination. Note that recent advances such as\nadvantage actor-critic (A2C) learn a critic (V\u03b8(st)) and use it to reduce the variance of the gradient and\nbootstrap the value after every n steps. However, we present this simple policy gradient formulation\n(Eq 2) in order to simplify the derivation of our proposed algorithm and aid understanding.\n\n\u2207\u03b8J(\u03b8) = E\u03b8[G(st, at)\u2207\u03b8 log \u03c0\u03b8(at|st)],\n\n3.2 LIRPG: Learning Intrinsic Rewards for Policy Gradient\n\nNotation. We use the following notation through-\nout.\n\u2022 \u03b8: policy parameters\n\u2022 \u03b7: intrinsic reward parameters\n\u2022 rex: extrinsic reward from the environment\n\u2022 rin\n\n\u03b7 (s, a): intrinsic reward estimated by \u03b7\n\n\u03b7 = rin\n\n\u2022 Gex(st, at) =(cid:80)\u221e\n\u2022 Gin(st, at) =(cid:80)\u221e\n\u2022 Gex+in(st, at) =(cid:80)\u221e\n\u2022 J ex = E\u03b8[(cid:80)\u221e\n\u2022 J in = E\u03b8[(cid:80)\u221e\n\u2022 J ex+in = E\u03b8[(cid:80)\u221e\n\ni=t \u03b3i\u2212trex\ni=t \u03b3t\u2212irin\n\ni\n\u03b7 (si, ai)\n\ni=t \u03b3i\u2212t(rex\n]\n\ni + \u03bbrin\n\n\u03b7 (si, ai))\n\n\u03b7 (st, at)]\n\nt + \u03bbrin\n\nt=0 \u03b3t(rex\n\nt=0 \u03b3trex\nt\nt=0 \u03b3trin\n\u03b7 (st, at)]\n\nFigure 1: Inside the agent are two modules,\na policy function parameterized by \u03b8 and an\nintrinsic reward function parameterized by \u03b7.\nIn our experiments the policy function (A2C /\nPPO) has an associated value function as does\nthe intrinsic reward function (see supplemen-\ntary materials for details). As shown by the\ndashed lines, the policy module is trained to\noptimize the weighted sum of intrinsic and\nextrinsic rewards while the intrinsic reward\nmodule is trained to optimize just the extrinsic\nrewards.\n\n\u2022 \u03bb: relative weight of intrinsic reward.\nThe departure point of our approach to reward opti-\nmization for policy gradient is to distinguish between\nthe extrinsic reward, rex, that de\ufb01nes the task, and\na separate intrinsic reward rin that additively trans-\nforms the extrinsic reward and in\ufb02uences learning via\npolicy gradients. It is crucial to note that the ultimate\nmeasure of performance we care about improving is\nthe value of the extrinsic rewards achieved by the agent; the intrinsic rewards serve only to in\ufb02uence\nthe change in policy parameters. Figure 1 shows an abstract representation of our intrinsic reward\naugmented policy gradient based learning agent.\n\nAlgorithm Overview. An overview of our algorithm, LIRPG, is presented in Algorithm 1. At\neach iteration of LIRPG, we simultaneously update the policy parameters \u03b8 and the intrinsic reward\nparameters \u03b7. More speci\ufb01cally, we \ufb01rst update \u03b8 in the direction of the gradient of J ex+in which is\nthe weighted sum of intrinsic and extrinsic rewards. After updating policy parameters, we update \u03b7\nin the direction of the gradient of J ex which is just the extrinsic rewards. Intuitively, the policy is\nupdated to maximize the sum of extrinsic and intrinsic rewards, while the intrinsic reward function is\nupdated to maximize only the extrinsic reward. We describe more details of each step below.\n\n4\n\nEnvironmentIntrinsic Reward(\ud835\udf3c)Policy(\ud835\udf3d)\u2211\ud835\udc94\ud835\udc82\ud835\udc93\ud835\udc86\ud835\udc99\ud835\udc93\ud835\udc8a\ud835\udc8fAgent\ud835\udf35\ud835\udf3c\ud835\udc71\ud835\udc86\ud835\udc99\ud835\udf35\ud835\udf3d\ud835\udc71\ud835\udc86\ud835\udc99+\ud835\udc8a\ud835\udc8f\fAlgorithm 1 LIRPG: Learning Intrinsic Reward for Policy Gradient\n1: Input: step-size parameters \u03b1 and \u03b2\n2: Init: initialize \u03b8 and \u03b7 with random values\n3: repeat\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11: until done\n\nSample a trajectory D = {s0, a0, s1, a1,\u00b7\u00b7\u00b7} by interacting with the environment using \u03c0\u03b8\nApproximate \u2207\u03b8J ex+in(\u03b8;D) by Equation 4\nUpdate \u03b8(cid:48) \u2190 \u03b8 + \u03b1\u2207\u03b8J ex+in(\u03b8;D)\nApproximate \u2207\u03b8(cid:48)J ex(\u03b8(cid:48);D) on D by Equation 11\nApproximate \u2207\u03b7\u03b8(cid:48) by Equation 10\nCompute \u2207\u03b7J ex = \u2207\u03b8(cid:48)J ex(\u03b8(cid:48);D)\u2207\u03b7\u03b8(cid:48)\nUpdate \u03b7(cid:48) \u2190 \u03b7 + \u03b2\u2207\u03b7J ex\n\nUpdating Policy Parameters (\u03b8). Given an episode where the behavior is generated according to\npolicy \u03c0\u03b8, we update the policy parameters using regular policy gradient using the sum of intrinsic\nand extrinsic rewards as the reward:\n\n\u03b8(cid:48) = \u03b8 + \u03b1\u2207\u03b8J ex+in(\u03b8)\n\n\u2248 \u03b8 + \u03b1Gex+in(st, at)\u2207\u03b8 log \u03c0\u03b8(at|st),\n\nwhere Equation 4 is a stochastic gradient update.\n\n(3)\n(4)\n\n(5)\n\n(7)\n(8)\n(9)\n\n(10)\n\nUpdating Intrinsic Reward Parameters (\u03b7). Given an episode and the updated policy parameters\n\u03b8(cid:48), we update intrinsic reward parameters. Intuitively, updating \u03b7 requires estimating the effect such a\nchange would have on the extrinsic value through the change in the policy parameters. Our key idea\nis to use the chain rule to compute the gradient as follows:\n\n\u2207\u03b7J ex = \u2207\u03b8(cid:48)J ex\u2207\u03b7\u03b8(cid:48),\n\nwhere the \ufb01rst term (\u2207\u03b8(cid:48)J ex) sampled as\n\n\u2207\u03b8(cid:48)J ex \u2248 Gex(st, at)\u2207\u03b8(cid:48) log \u03c0\u03b8(cid:48)(at|st)\n\n(6)\nis an approximate stochastic gradient of the extrinsic value with respect to the updated policy\nparameters \u03b8(cid:48) when the behavior is generated by \u03c0\u03b8(cid:48), and the second term can be computed as\nfollows:\n\n(cid:0)\u03b8 + \u03b1Gex+in(st, at)\u2207\u03b8 log \u03c0\u03b8(at|st)(cid:1)\n(cid:0)\u03b1Gex+in(st, at)\u2207\u03b8 log \u03c0\u03b8(at|st)(cid:1)\n(cid:0)\u03b1\u03bbGin(st, at)\u2207\u03b8 log \u03c0\u03b8(at|st)(cid:1)\n\u221e(cid:88)\n\n\u03b7 (si, ai)\u2207\u03b8 log \u03c0\u03b8(at|st).\n\n\u03b3i\u2212t\u2207\u03b7rin\n\n\u2207\u03b7\u03b8(cid:48) = \u2207\u03b7\n= \u2207\u03b7\n= \u2207\u03b7\n\n= \u03b1\u03bb\n\ni=t\n\nNote that to compute the gradient of the extrinsic value J ex with respect to the intrinsic reward\nparameters \u03b7, we needed a new episode with the updated policy parameters \u03b8(cid:48) (cf. Equation 6),\nthus requiring two episodes per iteration. To improve data ef\ufb01ciency we instead reuse the episode\ngenerated by the policy parameters \u03b8 at the start of the iteration and correct for the resulting mismatch\nby replacing the on-policy update in Equation 6 with the following off-policy update using importance\nsampling:\n\n\u2207\u03b8(cid:48)J ex = Gex(st, at)\n\n\u2207\u03b8(cid:48)\u03c0\u03b8(cid:48)(at|st)\n\u03c0\u03b8(at|st)\n\n.\n\n(11)\n\nThe parameters \u03b7 are updated using the product of Equations 10 and 11 with a step-size parameter \u03b2;\nthis approximates a stochastic gradient update (cf. Equation 5).\n\nImplementation on A2C and PPO. We described LIRPG using the most basic policy gradient\nformulation for simplicity. There have been many advances in policy gradient methods that reduce\nthe variance of the gradient and improve the data-ef\ufb01ciency. Our LIRPG algorithm is also compatible\nwith such actor-critic architectures. Speci\ufb01cally, for our experiments on Atari games we used a\n\n5\n\n\freasonably state of the art advantage actor-critic (A2C) architecture, and for our experiments on\nMujoco domains we used a similarly reasonably state of the art proximal policy optimization (PPO)\narchitecture. We provide all implementation details in supplementary material. 1\n\n4 Experiments on Atari Games\n\nOur overall objective in the following \ufb01rst set of experiments is to evaluate whether augmenting a\npolicy gradient based RL agent with intrinsic rewards learned using our LIRPG algorithm (henceforth,\naugmented agent in short) improves performance relative to the baseline policy gradient based RL\nagent that uses just the extrinsic reward (henceforth, A2C baseline agent in short). To this end, we\n\ufb01rst perform this evaluation on multiple Atari games from the Arcade Learning Environment (ALE)\nplatform [Bellemare et al., 2013] using the same open-source implementation with exactly the same\nhyper-parameters of the A2C algorithm [Mnih et al., 2016] from OpenAI [Dhariwal et al., 2017]\nfor both our augmented agent as well as the baseline agent. The extrinsic reward used is the game\nscore change as is standard for the work on Atari games. The LIRPG algorithm has two additional\nparameters relative to the baseline algorithm, the parameter \u03bb that controls how the intrinsic reward is\nscaled before adding it to the extrinsic reward and the step-size \u03b2; we describe how we choose these\nparameters below in our results.\nWe also conducted experiments against two other baselines. The \ufb01rst baseline simply added a constant\npositive value as a live bonus to the agent\u2019s reward at each time step (henceforth, A2C-live-bonus\nbaseline agent in short). The live bonus heuristic encourages the agent to live longer so that it will\npotentially have a better chance of getting extrinsic rewards. The second baseline augmented the agent\nwith a count-based bonus generated by the pixel-SimHash algorithm [Tang et al., 2017] (henceforth,\nA2C-pixel-SimHash baseline agent in short.)\nNote that the policy module inside the agent is really two networks, a policy network and a value\nfunction network (that helps estimate Gex+in as required in Equation 4). Similarly the intrinsic\nreward module in the agent is also two networks, a reward function network and a value function\nnetwork (that helps estimate Gex as required in Equation 6).\n\n4.1\n\nImplementation Details\n\nThe intrinsic reward module has two very similar neural network architectures as the policy module\ndescribed above. It has a \u201cpolicy\u201d network that instead of a softmax over actions produces a scalar\nreward for every action through a tanh nonlinearity to keep the scalar output in [\u22121, 1]; we will refer\nto it as the intrinsic reward network. It also has a value network that estimates Gex; this has the\nsame architecture as the intrinsic reward network except for the output layer that has a single scalar\noutput without a non-linear activation. These two networks share the parameters of the \ufb01rst four\nlayers with each other. We keep the default values of all hyper-parameters in the original OpenAI\nimplementation of the A2C-based policy module unchanged for both the augmented and baseline\nagents. We use RMSProp to optimize the two networks of the intrinsic reward module. Recall that\nthere are two parameters special to LIRPG. Of these, the step size \u03b2 was initialized to 0.0007 and\nannealed linearly to zero over 50 million time steps for all the experiments reported below. We did a\nsmall hyper-parameter search for \u03bb for each game (described below).\n\n4.2 Overall Performance\n\nFigure 2 shows the improvements of the augmented agents over baseline agents on 15 Atari games:\nAlien, Amidar, Asterix, Atlantis, BeamRider, Breakout, DemonAttack, DoubleDunk, MsPacman,\nQbert, Riverraid, RoadRunner, SpaceInvaders, Tennis, and UpNDown. We picked as many games\nas our computational resources allowed in which the published performance of the underlying A2C\nbaseline agents was good but where the learning was not so fast in terms of sample complexity\nso as to leave little room for improvement. We ran each agent for 5 separate runs each for 50\nmillion time steps on each game for both the baseline agents and augmented agents. For the\naugmented agents, we explored the following values for the intrinsic reward weighting coef\ufb01cient \u03bb,\n{0.003, 0.005, 0.01, 0.02, 0.03, 0.05} and the following values for the term \u03be, {0.001, 0.01, 0.1, 1},\nthat weights the loss from the value function estimates with the loss from the intrinsic reward function\n\n1Our implementation is available at: https://github.com/Hwhitetooth/lirpg\n\n6\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 2: (a) Improvements of LIRPG augmented agents over A2C baseline agents. (b) Improvements\nof LIRPG augmented agents over live-bonus augmented A2C baseline agents. (c) Improvements\nof LIRPG augmented agents over pixel-SimHash augmented A2C baseline agents. In all \ufb01gures,\nthe columns correspond to different games labeled on the x-axes and the y-axes show human score\nnormalized improvements.\n\n(the policy component of the intrinsic reward module). We plotted the best results from the hyper-\nparameter search in Figure 2. For the A2C-live-bonus baseline agents, we explored the value of live\nbonus over the set {0.001, 0.01, 0.1, 1} on two games, Amidar and MsPacman, and chose the best\nperforming value of 0.01 for all 15 games. For the A2C-pixel-SimHash baseline agents, we adopted\nall hyper-parameters from [Tang et al., 2017]. The learning curves of all agents are provided in the\nsupplementary material.\nThe blue bars in Figure 2 show the human score normalized improvements of the augmented agents\nover the A2C baseline agents, the A2C-live-bonus baseline agents, and the A2C-pixel-SimHash\nbaseline agents. We see that the augmented agent outperforms the A2C baseline agent on all 15\ngames and has an improvement of more than ten percent on 9 out of 15 games. As for the comparison\nto the A2C-live-bonus baseline agent, the augmented agent still performed better on all games except\nfor SpaceInvaders and Asterix. Note that most Atari games are shooting games so the A2C-live-bonus\nbaseline agent is expected to be a stronger baseline. The augmented agent outperformed or was\ncomparable to the A2C-pixel-SimHash baseline agent on all 15 games.\n\n4.3 Analysis of the Learned Intrinsic Reward\n\nAn interesting question is whether the learned intrinsic reward function learns a general state-\nindependent bias over actions or whether it is an interesting function of state. To explore this question\nwe used the learned intrinsic reward module and the policy module from the end of a good run (cf.\nFigure 2) for each game with no further learning to collect new data for each game. Figure 3 shows\nthe variation in intrinsic rewards obtained and the actions selected by the agent over 100 thousand\nsteps, i.e., 400 thousand frames, on 5 games. The analysis for all 15 games is in the supplementary\nmaterial. The red bars show the average intrinsic reward per-step for each action. The black segments\nshow the standard deviation of the intrinsic rewards. The blue bars show the frequency of each action\nbeing selected. Figure 3 shows that the intrinsic rewards for most actions vary through the episode as\nshown by large black segments, indirectly con\ufb01rming that the intrinsic reward module learns more\n\n7\n\nDoubleDunkMsPacmanAtlantisRoadRunnerSpaceInvadersDemonAttackQbertRiverraidBreakoutAmidarAlienBeamRiderTennisAsterixUpNDown\u221220020406080100Relative Performance2%3%3%5%5%9%10%14%14%16%16%20%31%44%189%SpaceInvadersAsterixMsPacmanAtlantisRoadRunnerDoubleDunkBreakoutTennisQbertAmidarAlienBeamRiderRiverraidDemonAttackUpNDown\u221220020406080100Relative Performance-11%-7%1%2%2%3%3%6%7%8%9%10%12%38%179%AtlantisBreakoutAsterixDoubleDunkMsPacmanRiverraidDemonAttackRoadRunnerBeamRiderSpaceInvadersAmidarAlienQbertTennisUpNDown\u221220020406080100Relative Performance0%4%5%6%6%7%7%7%9%11%15%15%18%32%62%\fFigure 3: Intrinsic reward variation and frequency of action selection. For each game/plot the x-axis\nshows the index of the actions that are available in that game. The red bars show the means and\nstandard deviations of the intrinsic rewards associated with each action. The blue bars show the\nfrequency of each action being selected.\n\nthan a state-independent constant bias over actions. By comparing the red bars and the blue bars,\nwe see the expected correlation between aggregate intrinsic reward over actions and their selection\n(through the policy module that trains on the weighted sum of extrinsic and intrinsic rewards).\n\n5 Mujoco Experiments\n\nOur main objective in the following experiments is to demonstrate that our LIRPG-based algorithm\ncan extend to a different class of domains and a different choice of baseline actor-critic architecture\n(namely, PPO instead of A2C). Speci\ufb01cally, we explore domains from the Mujoco continuous control\nbenchmark [Duan et al., 2016a], and used the open-source implementation of the PPO [Schulman\net al., 2017] algorithm from OpenAI [Dhariwal et al., 2017] as our baseline agent. We also compared\nLIRPG to the simple heuristic of giving a live bonus as intrinsic reward (PPO-live-bonus baseline\nagents for short). As for the Atari game results above, we kept all hyper-parameters unchanged to\ndefault values for the policy module of both baseline and augmented agents. Finally, we also conduct\na preliminary exploration into the question of how robust the learning of intrinsic rewards is to the\nsparsity of extrinsic rewards. Speci\ufb01cally, we used the delayed versions of the Mujoco domains,\nwhere the extrinsic reward is made sparse by accumulating the reward for N = 10, 20, 40 time steps\nbefore providing it to the agent. Note that the live bonus is not delayed when we delay the extrinsic\nreward for the PPO-live-bonus baseline agent. We expect that the problem becomes more challenging\nwith increasing N but expect that the learning of intrinsic rewards (that are available at every time\nstep) can help mitigate some of that increasing hardness.\n\nDelayed Mujoco benchmark. We evaluated 5 environments from the Mujoco benchmark, i.e.,\nHopper, HalfCheetah, Walker2d, Ant, and Humanoid. As noted above, to create a more-challenging\nsparse-reward setting we accumulated rewards for 10, 20 and 40 steps (or until the end of the episode,\nwhichever comes earlier) before giving it to the agent. We trained the baseline and augmented agents\nfor 1 million steps on each environment.\n\n5.1\n\nImplementation Details\n\nThe intrinsic reward function networks are quite similar to the two networks in the policy module.\nEach network is a multi-layer perceptron (MLP) with 2 hidden layers. We concatenated the observa-\ntion vector and the action vector as the input to the intrinsic reward network. The \ufb01rst two layers are\nfully connected layers with 64 hidden units. Each hidden layer is followed by a tanh non-linearity.\nThe output layer has one scalar output. We apply tanh on the output to bound the intrinsic reward to\n[\u22121, 1]. The value network to estimate Gex has the same architecture as the intrinsic reward network\nexcept for the output layer that has a single scalar output without a non-linear activation. These two\nnetworks do not share any parameters. We keep the default values of all hyper-parameters in the\noriginal OpenAI implementation of PPO unchanged for both the augmented and baseline agents. We\nuse Adam [Kingma and Ba, 2014] to optimize the two networks of the intrinsic reward module. The\nstep size \u03b2 was initialized to 0.0001 and was \ufb01xed over 1 million time steps for all the experiments\nreported below. The mixing coef\ufb01cient \u03bb was \ufb01xed to 1.0 and instead we multiplied the extrinsic\nreward by 0.01 cross all 5 environments.\n\n8\n\n0246810121416181.000.750.500.250.000.250.500.751.00Alien024681.000.750.500.250.000.250.500.751.00Amidar0123456781.000.750.500.250.000.250.500.751.00Asterix0123456781.000.750.500.250.000.250.500.751.00BeamRider0246810121416181.000.750.500.250.000.250.500.751.00Riverraid\fFigure 4: The x-axis is time steps during learning. The y-axis is the average reward over the last 100\ntraining episodes. The black curves are for the baseline PPO architecture. The blue curves are for the\nPPO-live-bonus baseline. The red curves are for our LIRPG based augmented architecture. The green\ncurves are for our LIRPG architecture in which the policy module was trained with only intrinsic\nrewards. The dark curves are the average of 10 runs with different random seeds. The shaded area\nshows the standard errors of 10 runs.\n\n5.2 Overall Performance\n\nOur results comparing the use of learning intrinsic reward with using just extrinsic reward on top of\na PPO architecture are shown in Figure 4. We only show the results of a delay of 20 here; the full\nresults can be found in the supplementary material. The black curves are for PPO baseline agents.\nThe blue curves are PPO-live-bonus baseline agents, where we explored the value of live bonus over\nthe set {0.001, 0.01, 0.1, 1} and plotted the curves for the domain-speci\ufb01c best performing choice.\nThe red curves are for the augmented LIRPG agents.\nWe see that in 4 out of 5 domains learning intrinsic rewards signi\ufb01cantly improves the performance\nof PPO, while in one game (Ant) we got a degradation of performance. Although a live bonus did\nhelp on 2 domains, i.e., Hopper and Walker2d, LIRPG still outperformed it on 4 out of 5 domains\nexcept for HalfCheetah on which LIRPG got comparable performance. We note that there was no\ndomain-speci\ufb01c hyper-parameter optimization for the results in this \ufb01gure; with such optimization\nthere might be an opportunity to get improved performance for our method in all the domains.\n\nTraining with Only Intrinsic Rewards. We also conducted a more challenging experiment on\nMujoco domains in which we used only intrinsic rewards to train the policy module. Recall that\nthe intrinsic reward module is trained to optimize the extrinsic reward. In 3 out of 5 domains, as\nshown by the green curves denoted by \u2018PPO-LIRPG(Rin)\u2019 in Figure 4, using only intrinsic rewards\nachieved similar performance to the red curves where we used a mixture of extrinsic rewards and\nintrinsic rewards. Using only intrinsic rewards to train the policy performed worse than using the\nmixture on Hopper but performed even better on HalfCheetah. It is important to note that training\nthe policy using only live-bonus reward without the extrinsic reward would completely fail, because\nthere would be no learning signal that encourages the agent to move forward. In contrast, our result\nshows that the agent can learn complex behaviors solely from the learned intrinsic reward on MuJoCo\nenvironment, and thus the intrinsic reward captures far more than a live bonus does; this is because\nthe intrinsic reward module takes into account the extrinsic reward structure through its training.\n\n6 Conclusion\n\nOur experiments on using LIRPG with A2C on multiple Atari games showed that it helped improve\nlearning performance in all of the 15 games we tried. Similarly using LIRPG with PPO on multiple\nMujoco domains showed that it helped improve learning performance in 4 out 5 domains (for the\nversion with a delay of 20). Note that we used the same A2C / PPO architecture and hyper-parameters\nin both our augmented and baseline agents. In summary, we derived a novel practical algorithm,\nLIRPG, for learning intrinsic reward functions in problems with high-dimensional observations\nfor use with policy gradient based RL agents. This is the \ufb01rst such algorithm to the best of our\nknowledge. Our empirical results show promise in using intrinsic reward function learning as a kind\nof meta-learning to improve the performance of modern policy gradient architectures like A2C and\nPPO.\n\n9\n\n020000040000060000080000005001000150020002500Hopper020000040000060000080000050005001000150020002500HalfCheetah0200000400000600000800000050010001500200025003000Walker2d02000004000006000008000005004003002001000100200300Ant02000004000006000008000000100200300400500600700HumanoidPPOPPO-live-bonusPPO-LIRPGPPO-LIRPG(Rin)\fAcknowledgments\n\nWe thank Richard Lewis for conversations on optimal rewards. This work was supported by NSF\ngrant IIS-1526059, by a grant from Toyota Research Institute (TRI), and by a grant from DARPA\u2019s\nL2M program. Any opinions, \ufb01ndings, conclusions, or recommendations expressed here are those of\nthe authors and do not necessarily re\ufb02ect the views of the sponsor.\n\nReferences\nSatinder Singh, Richard L Lewis, Andrew G Barto, and Jonathan Sorg. Intrinsically motivated\nreinforcement learning: An evolutionary perspective. IEEE Transactions on Autonomous Mental\nDevelopment, 2(2):70\u201382, 2010.\n\nAndrew Y Ng, Daishi Harada, and Stuart J Russell. Policy invariance under reward transformations:\nTheory and application to reward shaping. In Proceedings of the Sixteenth International Conference\non Machine Learning, pages 278\u2013287. Morgan Kaufmann Publishers Inc., 1999.\n\nAravind Rajeswaran, Kendall Lowrey, Emanuel V Todorov, and Sham M Kakade. Towards gen-\neralization and simplicity in continuous control. In Advances in Neural Information Processing\nSystems, pages 6553\u20136564, 2017.\n\nMarc Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, and Remi Munos.\nUnifying count-based exploration and intrinsic motivation. In Advances in Neural Information\nProcessing Systems, pages 1471\u20131479, 2016.\n\nGeorg Ostrovski, Marc G Bellemare, A\u00e4ron Oord, and R\u00e9mi Munos. Count-based exploration with\nneural density models. In International Conference on Machine Learning, pages 2721\u20132730, 2017.\n\nHaoran Tang, Rein Houthooft, Davis Foote, Adam Stooke, OpenAI Xi Chen, Yan Duan, John\nSchulman, Filip DeTurck, and Pieter Abbeel. # exploration: A study of count-based exploration\nfor deep reinforcement learning. In Advances in Neural Information Processing Systems, pages\n2750\u20132759, 2017.\n\nDario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Man\u00e9.\n\nConcrete problems in ai safety. arXiv preprint arXiv:1606.06565, 2016.\n\nVolodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim\nHarley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement\nlearning. In International Conference on Machine Learning, pages 1928\u20131937, 2016.\n\nYan Duan, Xi Chen, Rein Houthooft, John Schulman, and Pieter Abbeel. Benchmarking deep\nreinforcement learning for continuous control. In International Conference on Machine Learning,\npages 1329\u20131338, 2016a.\n\nJonathan Sorg, Richard L Lewis, and Satinder Singh. Reward design via online gradient ascent. In\n\nAdvances in Neural Information Processing Systems, pages 2190\u20132198, 2010.\n\nXiaoxiao Guo, Satinder Singh, Richard Lewis, and Honglak Lee. Deep learning for reward design to\n\nimprove monte carlo tree search in atari games. arXiv preprint arXiv:1604.07095, 2016.\n\nMax Jaderberg, Volodymyr Mnih, Wojciech Marian Czarnecki, Tom Schaul, Joel Z Leibo, David\nSilver, and Koray Kavukcuoglu. Reinforcement learning with unsupervised auxiliary tasks. arXiv\npreprint arXiv:1611.05397, 2016.\n\nDeepak Pathak, Pulkit Agrawal, Alexei A Efros, and Trevor Darrell. Curiosity-driven exploration by\nself-supervised prediction. In International Conference on Machine Learning (ICML), volume\n2017, 2017.\n\nJ\u00fcrgen Schmidhuber. Formal theory of creativity, fun, and intrinsic motivation (1990\u20132010). IEEE\n\nTransactions on Autonomous Mental Development, 2(3):230\u2013247, 2010.\n\nBradly C Stadie, Sergey Levine, and Pieter Abbeel. Incentivizing exploration in reinforcement\n\nlearning with deep predictive models. arXiv preprint arXiv:1507.00814, 2015.\n\n10\n\n\fPierre-Yves Oudeyer and Frederic Kaplan. What is intrinsic motivation? a typology of computational\n\napproaches. Frontiers in Neurorobotics, 1:6, 2009.\n\nAlexander Sasha Vezhnevets, Simon Osindero, Tom Schaul, Nicolas Heess, Max Jaderberg, David\nSilver, and Koray Kavukcuoglu. Feudal networks for hierarchical reinforcement learning. In\nInternational Conference on Machine Learning, pages 3540\u20133549, 2017.\n\nMarcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau, Tom Schaul,\nBrendan Shillingford, and Nando De Freitas. Learning to learn by gradient descent by gradient\ndescent. In Advances in Neural Information Processing Systems, pages 3981\u20133989, 2016.\n\nAdam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy Lillicrap. Meta-\nIn International conference on machine\n\nlearning with memory-augmented neural networks.\nlearning, pages 1842\u20131850, 2016.\n\nAlex Nichol and John Schulman. On \ufb01rst-order meta-learning algorithms.\n\narXiv:1803.02999, 2018.\n\narXiv preprint\n\nChelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of\n\ndeep networks. In International Conference on Machine Learning, pages 1126\u20131135, 2017.\n\nYan Duan, Marcin Andrychowicz, Bradly Stadie, OpenAI Jonathan Ho, Jonas Schneider, Ilya\nSutskever, Pieter Abbeel, and Wojciech Zaremba. One-shot imitation learning. In Advances in\nneural information processing systems, pages 1087\u20131098, 2017.\n\nJane X Wang, Zeb Kurth-Nelson, Dhruva Tirumala, Hubert Soyer, Joel Z Leibo, Remi Munos,\nCharles Blundell, Dharshan Kumaran, and Matt Botvinick. Learning to reinforcement learn. arXiv\npreprint arXiv:1611.05763, 2016.\n\nYan Duan, John Schulman, Xi Chen, Peter L Bartlett, Ilya Sutskever, and Pieter Abbeel. Rl\u03982: Fast\nreinforcement learning via slow reinforcement learning. arXiv preprint arXiv:1611.02779, 2016b.\n\nZhongwen Xu, Hado van Hasselt, and David Silver. Meta-gradient reinforcement learning. arXiv\n\npreprint arXiv:1805.09801, 2018.\n\nRichard S Sutton, David A McAllester, Satinder Singh, and Yishay Mansour. Policy gradient methods\nIn Advances in neural information\n\nfor reinforcement learning with function approximation.\nprocessing systems, pages 1057\u20131063, 2000.\n\nMarc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning envi-\nronment: An evaluation platform for general agents. J. Artif. Intell. Res.(JAIR), 47:253\u2013279,\n2013.\n\nPrafulla Dhariwal, Christopher Hesse, Oleg Klimov, Alex Nichol, Matthias Plappert, Alec Radford,\nJohn Schulman, Szymon Sidor, and Yuhuai Wu. Openai baselines. https://github.com/\nopenai/baselines, 2017.\n\nJohn Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy\n\noptimization algorithms. arXiv preprint arXiv:1707.06347, 2017.\n\nDiederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n11\n\n\f", "award": [], "sourceid": 2269, "authors": [{"given_name": "Zeyu", "family_name": "Zheng", "institution": "University of Michigan"}, {"given_name": "Junhyuk", "family_name": "Oh", "institution": "DeepMind"}, {"given_name": "Satinder", "family_name": "Singh", "institution": "University of Michigan"}]}