{"title": "A Meta-MDP Approach to Exploration for Lifelong Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 5691, "page_last": 5700, "abstract": "In this paper we consider the problem of how a reinforcement learning agent that is tasked with solving a sequence of reinforcement learning problems (a sequence of Markov decision processes) can use knowledge acquired early in its lifetime to improve its ability to solve new problems. We argue that previous experience with similar problems can provide an agent with information about how it should explore when facing a new but related problem. We show that the search for an optimal exploration strategy can be formulated as a reinforcement learning problem itself and demonstrate that such strategy can leverage patterns found in the structure of related problems. \nWe conclude with experiments that show the benefits of optimizing an exploration strategy using our proposed framework.", "full_text": "A Meta-MDP Approach to Exploration\nfor Lifelong Reinforcement Learning\n\nFrancisco M. Garcia and Philip S. Thomas\nCollege of Information and Computer Sciences\n\nUniversity of Massachusetts Amherst\n\nAmherst, MA, USA\n\n{fmgarcia,pthomas}@cs.umass.edu\n\nAbstract\n\nIn this paper we consider the problem of how a reinforcement learning agent that\nis tasked with solving a sequence of reinforcement learning problems (a sequence\nof Markov decision processes) can use knowledge acquired early in its lifetime to\nimprove its ability to solve new problems. We argue that previous experience with\nsimilar problems can provide an agent with information about how it should explore\nwhen facing a new but related problem. We show that the search for an optimal\nexploration strategy can be formulated as a reinforcement learning problem itself\nand demonstrate that such strategy can leverage patterns found in the structure\nof related problems. We conclude with experiments that show the bene\ufb01ts of\noptimizing an exploration strategy using our proposed framework.\n\n1\n\nIntroduction\n\nOne hallmark of human intelligence is our ability to leverage knowledge collected over our lifetimes\nwhen we face a new problem. When dealing with a new problem related to one we already know how\nto address, we leverage the experience obtained from solving the former problem. For example, upon\nbuying a new car, we do not re-learn from scratch how to drive a car, instead we use the experience\nwe had driving a previous car to quickly adapt to the new control and dynamics.\nStandard reinforcement learning (RL) methods lack this ability. When faced with a new problem\u2014a\nnew Markov decision process (MDP)\u2014they typically start learning from scratch, initially making\nuninformed decisions in order to explore and learn about the current problem they face. The problem\nof creating agents that can leverage previous experiences to solve new problems is called lifelong\nlearning or continual learning, and is related to the problem of transfer learning.\nAlthough the idea of how an agent can learn to learn has been explored for quite some time [14, 15],\nin this paper we focus on one aspect of lifelong learning: when faced with a sequence of MDPs\nsampled from a distribution over MDPs, how can a reinforcement learning agent learn an optimal\npolicy for exploration? Speci\ufb01cally, we do not consider the question of when an agent should explore\nor how much an agent should explore, which is a well studied area of reinforcement learning research,\n[20, 10, 1, 5, 18]. Instead, we study the question of, given that an agent decides to explore, which\naction should it take? In this work we formally de\ufb01ne the problem of searching for an optimal\nexploration policy and show that this problem can itself be modeled as an MDP. This means that\nthe task of \ufb01nding an optimal exploration strategy for a learning agent can be solved by another\nreinforcement learning agent that is solving a new meta-MDP, which operates at a different timescale\nfrom the RL agent solving a speci\ufb01c task\u2014one episode of the meta-MDP corresponds to an entire\nlifetime of the RL agent. This difference of timescales distinguishes our approach from previous\nmeta-MDP methods for optimizing components of reinforcement learning algorithms, [21, 9, 22, 8, 3].\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fWe contend that ignoring the experience an agent might have with related MDPs is a missed\nopportunity for guiding exploration on novel but related problems. One such example is exploration\nby random action selection (as is common when using Q-learning, [23], Sarsa, [19], and DQN, [11]).\nTo address this limitation, we propose separating the policies that de\ufb01ne the agent\u2019s behavior into an\nexploration policy (which is trained across many related MDPs) and an exploitation policy (which is\ntrained for each speci\ufb01c MDP).\nIn this paper we make the following contributions: 1) we formally de\ufb01ne the problem of searching\nfor an optimal exploration policy, 2) we prove that this problem can be modeled as a new MDP\nand describe one algorithm for solving this meta-MDP, and 3) we present experimental results\nthat show the bene\ufb01ts of our approach. Although the search for an optimal exploration policy is\nonly one of the necessary components for lifelong learning (along with deciding when to explore,\nhow to represent data, how to transfer models, etc.), it provides one key step towards agents that\nleverage prior knowledge to solve challenging problems. Code used for this paper can be found at\nhttps://github.com/fmaxgarcia/Meta-MDP\n\n2 Related Work\nThere is a large body of work discussing the problem of how an agent should behave during exploration\nwhen faced with a single MDP. Simple strategies, such as \u270f-greedy with random action-selection,\nBoltzmann action-selection or softmax action-selection, make sense when an agent has no prior\nknowledge of the problem that is currently trying to solve. It is a well-known fact that the performance\nof an agent exploring with unguided exploration techniques, such as random action-selection, reduces\ndrastically as the size of the state-space increases [24]; for example, the performance of Boltzmann or\nsoftmax action-selection hinges on the accuracy of the action-value estimates. When these estimates\nare poor (e.g., early during the learning process), it can have a drastic negative effect on the overall\nlearning ability of the agent. Given this limitation of unguided technique, when there is information\navailable to guide an agent\u2019s exploration strategy, it should be exploited.\nThere exists more sophisticated methods for exploration; for example, it is possible to use state-\nvisitation counts to encourage the agent to explore states that have not been frequently visited [20, 10].\nRecent research has also shown that adding an exploration \u201cbonus\u201d to the reward function can be an\neffective way of improving exploration; VIME [6] takes a Bayesian approach by maintaining a model\nof the dynamics of the environment, obtaining a posterior of the model after taking an action, and\nusing the KL divergence between these two models as a bonus. The intuition behind this approach\nis that encouraging actions that make large updates to the model allows the agent to better explore\nareas where the current model is inaccurate. [12] de\ufb01ne a bonus in the reward function by adding an\nintrinsic reward. They propose using a neural network to predict state transitions based on the action\ntaken and provide an intrinsic reward proportional to the prediction error. The agent is therefore\nencouraged to make state transitions that are not modeled accurately.\nAnother relevant work in exploration was presented in [3], where the authors propose building a\nlibrary of policies from prior experience to explore the environment in new problems more ef\ufb01ciently.\nThese techniques can be ef\ufb01cient when an agent is dealing with a single MDP; however, when facing\na new problem they ignore potentially useful information the agent may have discovered from solving\na previous task. That is, they fail to leverage prior experience. We aim to address this limitation\nby exploiting existing knowledge speci\ufb01cally for exploration. We do so by taking a meta-learning\napproach, where a meta-agent learns a policy that is used to guide an RL agent whenever it decides to\nexplore, and contrast the performance of our method with Model Agnostic Meta-Learning (MAML),\na state-of-the-art general meta-learning method which was shown to be capable of speeding up\nlearning in RL tasks [4].\n\n3 Background\nA Markov decision process (MDP) is a tuple, M = (S,A, P, R, d0), where S is the set of possible\nstates of the environment, A is the set of possible actions that the agent can take, P (s, a, s0) is the\nprobability that the environment will transition to state s0 2S if the agent takes action a 2A in\nstate s 2S , R(s, a, s0) is a function denoting the reward received after taking action a in state s and\ntransitioning to state s0, and d0 is the initial state distribution. We use t 2{ 0, 1, 2, . . . , T} to index\nthe time-step, and write St, At, and Rt to denote the state, action, and reward at time t. We also\nconsider the undiscounted episodic setting, wherein rewards are not discounted based on the time\n\n2\n\n\fat which they occur. We assume that T , the maximum time step, is \ufb01nite, and thus we restrict our\ndiscussion to episodic MDPs. We use I to denote the total number of episodes the agent interacts with\nan environment. A policy, \u21e1 : S\u21e5A! [0, 1], provides a conditional distribution over actions given\neach possible state: \u21e1(s, a) = Pr(At = a|St = s). Furthermore, we assume that for all policies, \u21e1,\n(and all tasks, c 2C , de\ufb01ned later) the expected returns are normalized to be in the interval [0, 1].\nOne of the key challenges within RL, and the one this work focuses on, is related to the exploration-\nexploitation dilemma. To ensure that an agent is able to \ufb01nd a good policy, it should act with\nthe sole purpose of gathering information about the environment (exploration). However, once\nenough information is gathered, it should behave according to what it believes to be the best policy\n(exploitation). In this work, we separate the behavior of an RL agent into two distinct policies: an\nexploration policy and an exploitation policy. We assume an \u270f-greedy exploration schedule, i.e., with\ni=1 is a\nprobability \u270fi the agent explores and with probability 1 \u270fi the agent exploits, where (\u270fi)I\nsequence of exploration rates where \u270fi 2 [0, 1] and i refers to the episode number in the current task.\nWe note that more sophisticated decisions on when to explore are certainly possible and could exploit\nour proposed method. Assuming this exploration strategy the agent forgoes the ability to learn when\nit should explore and we assume that the decision as to whether the agent explores or not is random.\nThat being said, \u270f-greedy is currently widely used (e.g.,SARSA [19], Q-learning [23], DQN [11])\nand its popularity makes its study still relevant today.\nLet C be the set of all tasks, c = (S,A, Pc, Rc, dc\n0). That is, all c 2C are MDPs sharing the same\nstate-set S and action set A, which may have different transition functions Pc, reward functions\nRc, and initial state distributions dc\n0. An agent is required to solve a set of tasks c 2C , where we\nrefer to the set C as the problem class. Given that each task is a separate MDP, the exploitation\npolicy might not directly apply to a novel task. In fact, doing this could hinder the agent\u2019s ability to\nlearn an appropriate policy. This type of scenarios arise, for example, in control problems where the\npolicy learned for one speci\ufb01c agent will not work for another due to differences in the environment\ndynamics and physical properties. As a concrete example, Intelligent Control Flight Systems (ICFS)\nis an area of study that was born out of the necessity to address some of the limitations of PID\ncontrollers; where RL has gained signi\ufb01cant traction in recent years [26, 27]. One particular scenario\nwere our proposed problem would arise is in using RL to control autonomous vehicles [7], where a\nsingle control policy would likely not work for a number of distinct vehicles and each policy would\nneed to be adapted to the speci\ufb01cs of each vehicle.\nIn our framework, the agent has a task-speci\ufb01c policy, \u21e1, that is updated by the agent\u2019s own learning\nalgorithm. This policy de\ufb01nes the agent\u2019s behavior during exploitation, and so we refer to it as the\nexploitation policy. The behavior of the agent during exploration is determined by an advisor, which\nmaintains a policy, \u00b5, tailored to the problem class (i.e., it is shared across all tasks in C). We refer to\nthis policy as an exploration policy. The agent is given K = IT time-steps of interactions with each\nof the sampled tasks. Hereafter we use i to denote the index of the current episode on the current\ntask, t to denote the time step within that episode, and k to denote the number of time steps that have\npassed on the current task, i.e., k = iT + t, and we refer to k as the advisor time step. At every\ntime-step, k, the advisor suggests an action, Uk, to the agent, where Uk is sampled according to \u00b5.\nIf the agent decides to explore at this step, it takes action Uk, otherwise it takes action Ak sampled\naccording to the agent\u2019s policy, \u21e1. We refer to an optimal policy for the agent solving a speci\ufb01c\ntask, c 2C , as an optimal exploitation policy, \u21e1\u21e4c . More formally: \u21e1\u21e4c 2 argmax\nE [G|\u21e1, c], where\nG =PT\nt=0 Rt is referred to as the return. Thus, the agent solving a speci\ufb01c task is optimizing the\nstandard expected return objective. From now on we refer to the agent solving a speci\ufb01c task as the\nagent (even though the advisor can also be viewed as an agent). We consider a process where a task\nc 2C is sampled from some distribution, dC, over C. While the RL agent learns how to solve a few\nof these tasks, the advisor also updates its policy to guide the agent during exploration. Whenever the\nagent decides to explore, it uses an action provided by the advisor according to its policy, \u00b5.\n\n\u21e1\n\n4 Problem Statement\nWe de\ufb01ne the performance of the advisor\u2019s policy, \u00b5, for a speci\ufb01c task c 2C to be \u21e2(\u00b5, c) =\nt is the reward at time step t during the ith episode. Let C be a\nrandom variable that denotes a task sampled from dC. The goal of the advisor is to \ufb01nd an optimal\n\nEhPI\n\nt\u00b5, ci , where Ri\n\nt=0 Ri\n\ni=0PT\n\n3\n\n\fexploration policy, \u00b5\u21e4, which we de\ufb01ne to be any policy that satis\ufb01es:\n\n\u00b5\u21e4 2 arg max\n\n\u00b5\n\nE [\u21e2(\u00b5, C)] .\n\n(1)\n\nIn intuitive terms, this objective seeks to maximize the area under the learning curve of an agent.\nAssuming a stable policy \u21e1 whose performance improves with training (the performance of the policy\ndoes not collapse), maximizing this objective implies that the agent is able to learn more quickly.\nBecause no single policy can solve every task, the meta-agent learns to help the agent obtain an\noptimal policy but it does not learn a policy to solve any task in particular.\nUnfortunately, we cannot directly optimize this objective because we do not know the transition and\nreward functions of each MDP, and we can only sample tasks from dC. In the next section we show\nthat the search for an exploration policy can be formulated as an RL problem where the advisor is\nitself an RL agent solving an MDP whose environment contains both the current task, c, and the\nagent solving the current task.\n\n5 A General Solution Framework\nOur framework can be viewed as a meta-MDP\u2014\nan MDP within an MDP. From the point of view\nof the agent, the environment is the current task,\nc (an MDP). However, from the point of view\nof the advisor, the environment contains both\nthe task, c, and the agent. At every time-step,\nthe advisor selects an action U and the agent\nan action A. The selected actions go through\na selection mechanism which executes action\nA with probability 1 \u270fi and action U with\nprobability \u270fi at episode i.\nIn our formulation, from the point of view of\nthe advisor action U is always executed and the\nselection mechanism is simply another source of\nuncertainty in the environment. Figure 1 depicts\nthe proposed framework with action A (exploita-\ntion) being selected. Even though one time step\nfor the agent corresponds to one time step for the\nadvisor, one episode for the advisor constitutes\na lifetime of the agent. From this perspective,\nwherein the advisor is merely another reinforce-\nment learning algorithm, we can take advantage\nof the existing body of work in RL to optimize\nthe exploration policy, \u00b5.\nWe experimented training the advisor policy using two different RL algorithms: REINFORCE, [25],\nand Proximal Policy Optimization (PPO), [17]. Using Montercarlo methods, such as REINFORCE,\nresults in a simpler implementation at the expense of a large computation time (each update of the\nadvisor would require to train the agent for an entire lifetime). On the other hand, using temporal\ndifference method, such as PPO, overcomes this computational bottleneck at the expense of larger\nvariance in the performance of the advisor. Pseudocode for the implementations used in our framework\nusing REINFORCE and PPO are shown in Appendix C.\n\nFigure 1: MDP view of interaction between the\nadvisor and agent. At each time-step, the advisor\nselects an action U and the agent an action A. With\nprobability \u270f the agent executes action U and with\nprobability 1 \u270f it executes action A. After each\naction the agent and advisor receive a reward R,\nthe agent and advisor environment transitions to\nstates S and X, respectively.\n\n5.1 Theoretical Results\nBelow, we formally de\ufb01ne the meta-MDP faced by the advisor and show that an optimal policy\nfor the meta-MDP optimizes the objective in (1). Recall that Rc, Pc, and dc\n0 denote the reward\nfunction, transition function, and initial state distribution of the MDP c 2C . To formally describe the\nmeta-MDP, we must capture the property that the agent can implement an arbitrary RL algorithm. To\ndo so, we assume the agent maintains some memory, Mk, that is updated by some learning rule l (an\nRL algorithm) at each time step, and write \u21e1Mk to denote the agent\u2019s policy given that its memory is\nMk. In other words, Mk provides all the information needed to determine \u21e1Mk and its update is of\nthe form Mk+1 = l(Mk, Sk, Ak, Rk, Sk+1) (this update rule can represent popular RL algorithms\n\n4\n\n\flike Q-Learning and actor-critics). We make no assumptions about which learning algorithm the\nagent uses (e.g., it can use Sarsa, Q-learning, REINFORCE, and even batch methods like Fitted\nQ-Iteration), and consider the learning rule to be unknown and a source of uncertainty.\nProposition 1. Consider an advisor policy, \u00b5, and episodic tasks c 2C belonging to a problem class C. The\nproblem of learning \u00b5 can be formulated as an MDP, Mmeta = (X ,U, T, Y, d00), where X is the state space, U\nthe action space, T the transition function, Y the reward function, and d00 the initial state distribution.\nProof. See Appendix A\n\nGiven the formulated meta-MDP, Mmeta, we are able to show that the optimal policy for this new\nMDP corresponds to an optimal exploration policy.\nTheorem 1. An optimal policy for Mmeta is an optimal exploration policy, \u00b5\u21e4, as de\ufb01ned in (1). That is,\n\nE [\u21e2(\u00b5, C)] = EhPK\n\nProof. See Appendix B.\n\nk=0 Yk\u00b5,i.\n\nSince Mmeta is an MDP for which an optimal exploration policy is an optimal policy, it follows that\nthe convergence properties of reinforcement learning algorithms apply to the search for an optimal\nexploration policy. For example, in some experiments the advisor uses the REINFORCE algorithm\n[25], the convergence properties of which have been well-studied [13].\nAlthough conceptually simple, the framework presented thus far may require to solve a large number\nof tasks (episodes of the meta-MDP), each one potentially being an expensive procedure. To address\nthis issue, we sampled a small number of tasks c1, . . . , cn, where each ci \u21e0 dC and train many\nepisodes on each task in parallel. By taking this approach, every update to the advisor is in\ufb02uenced by\nseveral simultaneous tasks and results in an scalable approach to obtain a general exploration policy.\nIn more dif\ufb01cult tasks, which might require the agent to train a long time, using TD techniques allows\nthe advisor to improve its policy while the agent is still training.\n\n6 Empirical Results\nIn this section we present experiments for discrete and continuous control tasks. Figures 8a and\n8b depicts task variations for Animat for the case of discrete action set. Figures 11a and 11b show\ntask variations for Ant problem for the case of continuous action set. Implementations used for the\ndiscrete case pole-balancing and all continuous control problems, where taken from OpenAI Gym,\nRoboschool benchmarks [2]. For the driving task experiments we used a simulator implemented in\nUnity by Tawn Kramer from the \u201cDonkey Car\u201d community 1. We demonstrate that: 1) in practice the\nmeta-MDP, Mmeta, can be solved using existing reinforcement learning methods, 2) the exploration\npolicy learned by the advisor improves performance on existing RL methods, on average, and 3) the\nexploration policy learned by the advisor differs from the optimal exploitation policy for any task\nc 2C , i.e., the exploration policy learned by the advisor is not necessarily a good exploitation policy.\nIntuitively, our method works well when there is a common pattern across tasks of what actions\nshould not to be taken at a given state. For example, in a simple grid-world our method would not be\nable to learn a good exploration policy, but in the case of Animat (shown in \ufb01gures 8a, and 8b) the\nmeta-agent is able to learn that certain action patterns never lead to an optimal policy.\n\n(a) Animat task 1.\n\n(b) Animat task 2.\n\n(c) Ant task 1.\n\n(d) Ant task 2.\n\nFigure 2: Example of task variations. The problem classes correspond to Animat (left) with discrete\naction space, and ant (right) with continuous action space.\n\n1The Unity simulator for the self-driving task can be found at https://github.com/tawnkramer/\n\nsdsandbox\n\n5\n\n\fTo show that our algorithm behave as desired, we will \ufb01rst study the behavior of our method in two\nsimple problem classes with discrete action-spaces: pole-balancing [19] and Animat [21], and a more\nrealistic application of control tuning in self-driving vehicles. As a baseline meta-learning method, to\nwhich we contrast our framework, we chose Model Agnostic Meta Learning (MAML), [4], a general\nmeta learning method for adapting previously trained neural networks to novel but related tasks. It is\nworth noting that, although the method was not speci\ufb01cally designed for RL, the authors describe\nsome promising results in adapting behavior learned from previous tasks to novel ones.\n\n6.1 Empirical Evaluation of Proposed Framework\nWe begin our evaluation by assessing the behavior of our algorithm in two different problems with\ndiscrete action spaces: Pole-balancing and Animat. We chose these problems because they present\nstructural patterns that are intuitive to understand and can be exploited by the agent.\nPole-balancing: the agent is tasked with applying force to a cart to prevent a pole balancing on\nit from falling. The distinct tasks were constructed by modifying the length and mass of the pole\nmass, mass of the cart and force magnitude. States are represented by 4-D vectors describing the\nposition and velocity of the cart, and angle and angular velocity of the pendulum, i.e., s = [x, v, \u2713, \u02d9\u2713].\nThe agent has 2 actions at its disposal; apply a force in the positive or negative x direction. Figure\n3a, contrasts the cumulative return of an agent using the advisor against random exploration during\ntraining over 6 tasks, shown in blue and red respectively. Both policies, \u21e1 and \u00b5, were trained using\nREINFORCE: \u21e1 for I = 1,000 episodes and \u00b5 for 500 iterations. In the \ufb01gure, the horizontal axis\ncorresponds to episodes for the advisor. The horizontal red line denotes an estimate (with standard\nerror bar) of the expected cumulative reward over an agent\u2019s lifetime if it samples actions uniformly\nwhen exploring. Notice that this is not a function of the training iteration, as the random exploration\nis not updated. The blue curve (with standard error bars from 15 trials) shows how the expected\ncumulative reward the agent obtains during its lifetime changes as the advisor improves its policy.\nAfter the advisor is trained, the agent is obtaining roughly 30% more reward during its lifetime than\nit was when using a random exploration. To visualize this difference, Figure 3b shows the mean\nlearning curves (episodes of an agent\u2019s lifetime on the horizontal axis and average return for each\nepisode on the vertical axis) during the \ufb01rst and last 50 iterations.\n\n(a) Performance curves during training comparing\nadvisor policy (blue) and random exploration policy\n(red).\n\n(b) Average learning curves on training tasks over the\n\ufb01rst 50 advisor episodes (blue) and the last 50 advisor\nepisodes (orange).\n\nAnimat: in these environments, the agent is a circular creature that lives in a continuous state space.\nIt has 8 independent actuators, angled around it in increments of 45 degrees. Each actuator can be\neither on or off at each time step, so the action set is {0, 1}8, for a total of 256 actions. When an\nactuator is on, it produces a small force in the direction that it is pointing. The resulting action moves\nthe agent in the direction that results from the some of all those forces and is perturbed by 0-mean\nunit variance Gaussian noise. The agent is tasked with moving to a goal location; it receives a reward\nof 1 at each time-step and a reward of +100 at the goal state. The different variations of the tasks\ncorrespond to randomized start and goal positions in different environments. Figure 4a shows a clear\nperformance improvement on average as the advisor improves its policy over 50 training iterations.\nThe curve show the average curve obtained over the \ufb01rst and last 10 iteration of training the advisor,\nshown in blue and orange respectively. Each individual task was trained for I = 800 episodes.\nAn interesting pattern that is shared across all variations of this problem class is that there are actuator\ncombinations that are not useful for reaching the goal. For example, activating actuators at opposite\n\n6\n\n\fangles would leave the agent in the same position it was before (ignoring the effect of the noise). The\npresence of these poor performing actions provide some common patterns that can be leveraged. To\ntest our intuition that an exploration policy would exploit the presence of poor-performing actions, we\nrecorded the frequency with which they were executed on unseen testing tasks when using the learned\nexploration policy after training and when using a random exploration strategy, over 5 different tasks.\nFigure 4b helps explain the improvement in performance. It depicts in the y-axis, the percentage of\ntimes these poor-performing actions were selected at a given episode, and in the x-axis the agent\nepisode number in the current task. The agent using the advisor policy (blue) is encouraged to reduce\nthe selection of known poor-performing actions, compared to a random action-selection exploration\nstrategy (red).\n\n(b) Animat Results: Frequency of poor-performing\nactions in an agent\u2019s lifetime with learned (blue) and\nrandom (red) exploration.\n\n(a) Animat Results: Average learning curves on train-\ning tasks over the \ufb01rst 10 iterations (blue) and last 10\niterations (orange).\nVehicle Control: a more pragmatic application of our framework is for quickly adapting control\npolicy from one system to another. For this experiment, we tested the advisor on a control problem\nusing a self-driving car simulator implemented in Unity. We assume that the agent has a constant\nacceleration (up to some maximum velocity) and the actions consist on 15 possible steering angles\nbetween angles \u2713min < 0 and \u2713max > 0. The state is represented as a stack of the last 4 80 \u21e5 80\nimages sensed by a front-facing camera, and the tasks vary in the body mass, m, of the car and values\nof \u2713min and \u2713max. We tested the ability of the advisor to improve \ufb01ne-tuning controls to speci\ufb01c\ncars. We \ufb01rst learned a well-performing policy for one car and used the policy as a starting point to\n\ufb01ne-tune policies for 8 different cars.\nThe experiment, depicted in Figure 5, compares\nan agent who is able to use an advisor during\nexploration for \ufb01ne-tuning (blue) vs. an agent\nwho does not have access to an advisor (red).\nThe \ufb01gure shows the number of episodes of\n\ufb01ne-tuning needed to reach a pre-de\ufb01ned per-\nformance threshold (1, 000 time-steps without\nleaving the correct lane). The \ufb01rst and second\ngroups in the \ufb01gure show the average number\nof episodes needed to \ufb01ne-tune in the \ufb01rst and\nsecond half of tasks, respectively. In the \ufb01rst\nhalf of tasks (left), the advisor seems to make\n\ufb01ne-tuning more dif\ufb01cult since it has not been\ntrained to deal with this speci\ufb01c problem. Using\nthe advisor took an average of 42 episodes to\n\ufb01ne-tune, while it took on average 12 episodes\nto \ufb01ne-tune without it. The bene\ufb01t, however, can be seen in the second half of training tasks. Once\nthe advisor had been trained, it took on average 5 episodes to \ufb01ne-tune while not using the advisor\nneeded an average of 18 episodes to reach the required performance threshold. When the number of\ntasks is large enough and each episode is a time-consuming or costly process, our framework could\nresult in important time and cost savings.\n\nFigure 5: Number of episodes needed to achieve\nthreshold performance (lower is better).\n\nIs an Exploration Policy Simply a General Exploitation Policy?\n\n6.2\nOne might be tempted to think that the learned policy for exploration might simply be a policy that\nworks well in general. How do we know that the advisor is learning a policy for exploration and not\nsimply a policy for exploitation? To answer this question, we generated three distinct unseen tasks\n\n7\n\n\ffor pole-balancing and Animat problem classes and compared the performance of using only the\nlearned exploration policy with the performance obtained by an exploitation policy trained to solve\neach speci\ufb01c task. Figure 6 shows two bar charts contrasting the performance of the exploration\npolicy (blue) and the exploitation policy (green) on each task variation. In both charts, the \ufb01rst three\ngroups of bars on the left correspond to the performance on each task and the last one to an average\nover all tasks. Figure 6a corresponds to the mean performance on pole-balancing and the error bars\nto the standard deviation; the y-axis denotes the return obtained. We can see that, as expected, the\nexploration policy by itself fails to achieve a comparable performance to a task-speci\ufb01c policy. The\nsame occurs with the Animat problem class, shown in Figure 6b. In this case, the y-axis refers to the\nnumber of steps needed to reach the goal (smaller bars are better). In all cases, a task-speci\ufb01c policy\nperforms signi\ufb01cantly better than the learned exploration policy, indicating that the exploration policy\nis not a general exploitation policy.\n\n(a) Average returns obtained on test tasks when us-\ning the advisor\u2019s exploration policy (blue) and a task-\nspeci\ufb01c exploitation (green)\n\n(b) Number of steps needed to complete test tasks with\nadvisor policy (blue) and exploitation (green).\n\nFigure 6: Performance comparison of exploration and exploitation policies.\n\n6.3 Performance Evaluation on Novel Tasks\nWe examine the performance of our framework on novel tasks when learning from scratch, and\ncontrast our method to MAML trained using PPO. In the case of discrete action sets, we trained each\ntask for 500 episodes and compare the performance of an agent trained with REINFORCE (R) and\nPPO, with and without an advisor. In the case of continuous tasks, we restrict our experiments to an\nagent using PPO after training for 500 episodes. In our experiments we set the initial value of \u270f to\n0.8, and decreased by a factor of 0.995 every episode. The results shown in table 1 were obtained\nby training 5 times in 5 novel tasks and recording the average performance and standard deviations.\nThe table displays the mean of those averages and the mean of the standard deviations recorded.\nThe problem classes \u201cpole-balance (d)\u201d and \u201canimat\u201d correspond to discrete actions spaces, while\n\u201cpole-balance (c)\u201d, \u201chopper\u201d, and \u201cant\u201d are continuous.\n\nR+Advisor\n28.52 \u00b1 7.6\n\nPPO\n\nMAML\n\nR\n\n\u2014\n\u2014\n\u2014\n\nProblem Class\nPole-balance (d)\n\nAnimat\n\nHopper\n\nAnt\n\nPPO+Advisor\n46.29 \u00b1 6.30\n438.13 \u00b1 35.54\n164.43 \u00b1 48.54\n83.76 \u00b1 20.41\n\n\u2014\n\u2014\n\u2014\n\nPole-balance (c)\n\n20.32 \u00b1 3.15\n\n27.87 \u00b1 6.17\n29.95 \u00b1 7.90\n13.82 \u00b1 10.53\n42.75 \u00b1 24.35\n\n39.29 \u00b1 5.74\n779.62 \u00b1 110.28 387.27 \u00b1 162.33 751.40 \u00b1 68.73 631.97 \u00b1 155.5 669.93 \u00b1 92.32\n267.76 \u00b1 163.05\n39.41 \u00b1 7.95\n113.33 \u00b1 64.48\nTable 1: Average performance (and standard deviations) over all unseen tasks trials on discrete and\ncontinuous control on the last 50 episodes.\n7 Conclusion\nIn this work we developed a framework for leveraging experience to guide an agent\u2019s exploration in\nnovel tasks, where the advisor learns the exploration policy used by the agent solving a task. We\nshowed that a few sample tasks can be used to learn an exploration policy that the agent can use\nto improve the speed of learning on novel tasks. A takeaway from this work is that oftentimes an\nagent solving a new task may have had experience with similar problems, and that experience can be\nleveraged. One way to do that is to learn a better approach for exploring in the face of uncertainty. A\nnatural future direction from this work use past experience to identify when exploration is needed\nand not just what action to take when exploring.\n\n8\n\n\fReferences\n[1] Mohammad Gheshlaghi Azar, Ian Osband, and R\u00e9mi Munos. Minimax regret bounds for\nreinforcement learning. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th\nInternational Conference on Machine Learning, pages 263\u2013272, International Convention\nCentre, Sydney, Australia, 06\u201311 Aug 2017. PMLR.\n\n[2] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang,\n\nand Wojciech Zaremba. Openai gym, 2016.\n\n[3] Fernando Fernandez and Manuela Veloso. Probabilistic policy reuse in a reinforcement learning\nagent. In Proceedings of the Fifth International Joint Conference on Autonomous Agents and\nMultiagent Systems, AAMAS \u201906, pages 720\u2013727, New York, NY, USA, 2006. ACM.\n\n[4] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adapta-\ntion of deep networks. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th\nInternational Conference on Machine Learning, volume 70 of Proceedings of Machine Learning\nResearch, pages 1126\u20131135, International Convention Centre, Sydney, Australia, 06\u201311 Aug\n2017. PMLR.\n\n[5] Aur\u00e9lien Garivier and Eric Moulines. On upper-con\ufb01dence bound policies for switching bandit\nIn Proceedings of the 22nd International Conference on Algorithmic Learning\n\nproblems.\nTheory, ALT\u201911, pages 174\u2013188, Berlin, Heidelberg, 2011. Springer-Verlag.\n\n[6] Rein Houthooft, Xi Chen, Yan Duan, John Schulman, Filip De Turck, and Pieter Abbeel.\nCuriosity-driven exploration in deep reinforcement learning via bayesian neural networks.\nCoRR, abs/1605.09674, 2016.\n\n[7] William Koch, Renato Mancuso, Richard West, and Azer Bestavros. Reinforcement learning\n\nfor UAV attitude control. CoRR, abs/1804.04154, 2018.\n\n[8] Romain Laroche, Mehdi Fatemi, Harm van Seijen, and Joshua Romoff. Multi-advisor reinforce-\n\nment learning. April 2017.\n\n[9] Bingyao Liu, Satinder P. Singh, Richard L. Lewis, and Shiyin Qin. Optimal rewards in\nmultiagent teams. In 2012 IEEE International Conference on Development and Learning and\nEpigenetic Robotics, ICDL-EPIROB 2012, San Diego, CA, USA, November 7-9, 2012, pages\n1\u20138, 2012.\n\n[10] Jarryd Martin, Suraj Narayanan Sasikumar, Tom Everitt, and Marcus Hutter. Count-based\n\nexploration in feature space for reinforcement learning. CoRR, abs/1706.08090, 2017.\n\n[11] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G.\nBellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Pe-\ntersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan\nWierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement\nlearning. Nature, 518(7540):529\u2013533, February 2015.\n\n[12] Deepak Pathak, Pulkit Agrawal, Alexei A. Efros, and Trevor Darrell. Curiosity-driven explo-\n\nration by self-supervised prediction. CoRR, abs/1705.05363, 2017.\n\n[13] V. V. Phansalkar and M. A. L. Thathachar. Local and global optimization algorithms for\n\ngeneralized learning automata. Neural Comput., 7(5):950\u2013973, September 1995.\n\n[14] J\u00fcrgen Schmidhuber, Jieyu Zhao, and Nicol N. Schraudolph. Learning to learn. chapter\nReinforcement Learning with Self-modifying Policies, pages 293\u2013309. Kluwer Academic\nPublishers, Norwell, MA, USA, 1998.\n\n[15] J\u00fcrgen Schmidhuber. On learning how to learn learning strategies. Technical report, 1995.\n\n[16] John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High-\ndimensional continuous control using generalized advantage estimation. In Proceedings of the\nInternational Conference on Learning Representations (ICLR), 2016.\n\n9\n\n\f[17] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal\n\npolicy optimization algorithms. CoRR, abs/1707.06347, 2017.\n\n[18] Alexander L. Strehl. Probably approximately correct (pac) exploration in reinforcement learning.\n\nIn ISAIM, 2008.\n\n[19] Richard S. Sutton and Andrew G. Barto. Introduction to Reinforcement Learning. MIT Press,\n\nCambridge, MA, USA, 1st edition, 1998.\n\n[20] Haoran Tang, Rein Houthooft, Davis Foote, Adam Stooke, Xi Chen, Yan Duan, John Schulman,\nFilip De Turck, and Pieter Abbeel. #exploration: A study of count-based exploration for deep\nreinforcement learning. CoRR, abs/1611.04717, 2016.\n\n[21] Philip S. Thomas and Andrew G. Barto. Conjugate markov decision processes. In Proceedings\nof the 28th International Conference on Machine Learning, ICML 2011, Bellevue, Washington,\nUSA, June 28 - July 2, 2011, pages 137\u2013144, 2011.\n\n[22] Harm van Seijen, Mehdi Fatemi, Joshua Romoff, Romain Laroche, Tavian Barnes, and Jeffrey\n\nTsang. Hybrid reward architecture for reinforcement learning. June 2017.\n\n[23] Christopher J. C. H. Watkins and Peter Dayan. Q-learning.\n\n279\u2013292, 1992.\n\nIn Machine Learning, pages\n\n[24] Steven D. Whitehead. Complexity and cooperation in q-learning. In Proceedings of the Eighth\nInternational Workshop (ML91), Northwestern University, Evanston, Illinois, USA, pages\n363\u2013367, 1991.\n\n[25] Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist reinforce-\n\nment learning. In Machine Learning, pages 229\u2013256, 1992.\n\n[26] Q. Yang and S. Jagannathan. Reinforcement learning controller design for af\ufb01ne nonlinear\ndiscrete-time systems using online approximators. IEEE Transactions on Systems, Man, and\nCybernetics, Part B (Cybernetics), 42(2):377\u2013390, April 2012.\n\n[27] Tianhao Zhang, Gregory Kahn, Sergey Levine, and Pieter Abbeel. Learning deep control\npolicies for autonomous aerial vehicles with mpc-guided policy search. CoRR, abs/1509.06791,\n2015.\n\n10\n\n\f", "award": [], "sourceid": 3052, "authors": [{"given_name": "Francisco", "family_name": "Garcia", "institution": "University of Massachusetts - Amherst"}, {"given_name": "Philip", "family_name": "Thomas", "institution": "University of Massachusetts Amherst"}]}