{"title": "Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation", "book": "Advances in Neural Information Processing Systems", "page_first": 3675, "page_last": 3683, "abstract": "Learning goal-directed behavior in environments with sparse feedback is a major challenge for reinforcement learning algorithms. One of the key difficulties is insufficient exploration, resulting in an agent being unable to learn robust policies. Intrinsically motivated agents can explore new behavior for their own sake rather than to directly solve external goals. Such intrinsic behaviors could eventually help the agent solve tasks posed by the environment. We present hierarchical-DQN (h-DQN), a framework to integrate hierarchical action-value functions, operating at different temporal scales, with goal-driven intrinsically motivated deep reinforcement learning. A top-level q-value function learns a policy over intrinsic goals, while a lower-level function learns a policy over atomic actions to satisfy the given goals. h-DQN allows for flexible goal specifications, such as functions over entities and relations. This provides an efficient space for exploration in complicated environments. We demonstrate the strength of our approach on two problems with very sparse and delayed feedback: (1) a complex discrete stochastic decision process with stochastic transitions, and (2) the classic ATARI game -- `Montezuma's Revenge'.", "full_text": "Hierarchical Deep Reinforcement Learning:\n\nIntegrating Temporal Abstraction and\n\nIntrinsic Motivation\n\nTejas D. Kulkarni\u2217\nDeepMind, London\n\nKarthik R. Narasimhan\u2217\n\nCSAIL, MIT\n\nArdavan Saeedi\n\nCSAIL, MIT\n\ntejasdkulkarni@gmail.com\n\nkarthikn@mit.edu\n\nardavans@mit.edu\n\nJoshua B. Tenenbaum\n\nBCS, MIT\n\njbt@mit.edu\n\nAbstract\n\nLearning goal-directed behavior in environments with sparse feedback is a major\nchallenge for reinforcement learning algorithms. One of the key dif\ufb01culties is in-\nsuf\ufb01cient exploration, resulting in an agent being unable to learn robust policies.\nIntrinsically motivated agents can explore new behavior for their own sake rather\nthan to directly solve external goals. Such intrinsic behaviors could eventually\nhelp the agent solve tasks posed by the environment. We present hierarchical-\nDQN (h-DQN), a framework to integrate hierarchical action-value functions, op-\nerating at different temporal scales, with goal-driven intrinsically motivated deep\nreinforcement learning. A top-level q-value function learns a policy over intrinsic\ngoals, while a lower-level function learns a policy over atomic actions to satisfy\nthe given goals. h-DQN allows for \ufb02exible goal speci\ufb01cations, such as functions\nover entities and relations. This provides an ef\ufb01cient space for exploration in\ncomplicated environments. We demonstrate the strength of our approach on two\nproblems with very sparse and delayed feedback: (1) a complex discrete stochas-\ntic decision process with stochastic transitions, and (2) the classic ATARI game \u2013\n\u2018Montezuma\u2019s Revenge\u2019.\n\n1\n\nIntroduction\n\nLearning goal-directed behavior with sparse feedback from complex environments is a fundamental\nchallenge for arti\ufb01cial intelligence. Learning in this setting requires the agent to represent knowl-\nedge at multiple levels of spatio-temporal abstractions and to explore the environment ef\ufb01ciently.\nRecently, non-linear function approximators coupled with reinforcement learning [14, 16, 23] have\nmade it possible to learn abstractions over high-dimensional state spaces, but the task of exploration\nwith sparse feedback still remains a major challenge. Existing methods like Boltzmann exploration\nand Thomson sampling [31, 19] offer signi\ufb01cant improvements over \u0001-greedy, but are limited due to\nthe underlying models functioning at the level of basic actions. In this work, we propose a frame-\nwork that integrates deep reinforcement learning with hierarchical action-value functions (h-DQN),\nwhere the top-level module learns a policy over options (subgoals) and the bottom-level module\nlearns policies to accomplish the objective of each option. Exploration in the space of goals enables\nef\ufb01cient exploration in problems with sparse and delayed rewards. Additionally, our experiments\nindicate that goals expressed in the space of entities and relations can help constraint the exploration\nspace for data ef\ufb01cient deep reinforcement learning in complex environments.\n\n\u2217Equal Contribution. Work done while Tejas Kulkarni was af\ufb01liated with MIT.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fReinforcement learning (RL) formalizes control problems as \ufb01nding a policy \u03c0 that maximizes\nexpected future rewards [32]. Value functions V (s) are central to RL, and they cache the utility\nof any state s in achieving the agent\u2019s overall objective. Recently, value functions have also been\ngeneralized as V (s, g) in order to represent the utility of state s for achieving a given goal g \u2208\nG [33, 21]. When the environment provides delayed rewards, we adopt a strategy to \ufb01rst learn ways\nto achieve intrinsically generated goals, and subsequently learn an optimal policy to chain them\ntogether. Each of the value functions V (s, g) can be used to generate a policy that terminates when\nthe agent reaches the goal state g. A collection of these policies can be hierarchically arranged\nwith temporal dynamics for learning or planning within the framework of semi-Markov decision\nprocesses [34, 35]. In high-dimensional problems, these value functions can be approximated by\nneural networks as V (s, g; \u03b8).\nWe propose a framework with hierarchically organized deep reinforcement learning modules work-\ning at different time-scales. The model takes decisions over two levels of hierarchy \u2013 (a) a top level\nmodule (meta-controller) takes in the state and picks a new goal, and (b) a lower-level module (con-\ntroller) uses both the state and the chosen goal to select actions either until the goal is reached or\nthe episode terminates. The meta-controller then chooses another goal and steps (a-b) repeat. We\ntrain our model using stochastic gradient descent at different temporal scales to optimize expected\nfuture intrinsic (controller) and extrinsic rewards (meta-controller). We demonstrate the strength of\nour approach on problems with delayed rewards: (1) a discrete stochastic decision process with a\nlong chain of states before receiving optimal extrinsic rewards and (2) a classic ATARI game (\u2018Mon-\ntezuma\u2019s Revenge\u2019) with even longer-range delayed rewards where most existing state-of-art deep\nreinforcement learning approaches fail to learn policies in a data-ef\ufb01cient manner.\n\n2 Literature Review\n\nReinforcement Learning with Temporal Abstractions Learning and operating over different\nlevels of temporal abstraction is a key challenge in tasks involving long-range planning.\nIn the\ncontext of hierarchical reinforcement learning [2], Sutton et al.[34] proposed the options framework,\nwhich involves abstractions over the space of actions. At each step, the agent chooses either a one-\nstep \u201cprimitive\u201d action or a \u201cmulti-step\u201d action policy (option). Each option de\ufb01nes a policy over\nactions (either primitive or other options) and can be terminated according to a stochastic function\n\u03b2. Thus, the traditional MDP setting can be extended to a semi-Markov decision process (SMDP)\nwith the use of options. Recently, several methods have been proposed to learn options in real-time\nby using varying reward functions [35] or by composing existing options [28]. Value functions have\nalso been generalized to consider goals along with states [21]. Our work is inspired by these papers\nand builds upon them.\nOther related work for hierarchical formulations include the MAXQ framework [6], which decom-\nposed the value function of an MDP into combinations of value functions of smaller constituent\nMDPs, as did Guestrin et al.[12] in their factored MDP formulation. Hernandez and Mahadevan [13]\ncombine hierarchies with short-term memory to handle partial observations. In the skill learning lit-\nerature, Baranes et al.[1] have proposed a goal-driven active learning approach for learning skills in\ncontinuous sensorimotor spaces.\nIn this work, we propose a scheme for temporal abstraction that involves simultaneously learning\noptions and a control policy to compose options in a deep reinforcement learning setting. Our\napproach does not use separate Q-functions for each option, but instead treats the option as part of\nthe input, similar to [21]. This has two potential advantages: (1) there is shared learning between\ndifferent options, and (2) the model is scalable to a large number of options.\n\nIntrinsic Motivation The nature and origin of \u2018good\u2019 intrinsic reward functions is an open ques-\ntion in reinforcement learning. Singh et al. [27] explored agents with intrinsic reward structures in\norder to learn generic options that can apply to a wide variety of tasks. In another paper, Singh\net al. [26] take an evolutionary perspective to optimize over the space of reward functions for the\nagent, leading to a notion of extrinsically and intrinsically motivated behavior. In the context of\nhierarchical RL, Goel and Huber [10] discuss a framework for sub-goal discovery using the struc-\ntural aspects of a learned policy model. S\u00b8ims\u00b8ek et al. [24] provide a graph partitioning approach to\nsubgoal identi\ufb01cation.\n\n2\n\n\fSchmidhuber [22] provides a coherent formulation of intrinsic motivation, which is measured by\nthe improvements to a predictive world model made by the learning algorithm. Mohamed and\nRezende [17] have recently proposed a notion of intrinsically motivated learning within the frame-\nwork of mutual information maximization. Frank et al. [9] demonstrate the effectiveness of arti\ufb01cial\ncuriosity using information gain maximization in a humanoid robot. Oudeyer et al. [20] categorize\nintrinsic motivation approaches into knowledge based methods, competence or goal based methods\nand morphological methods. Our work relates to competence based intrinsic motivation but other\ncomplementary methods can be combined in future work.\n\nObject-based Reinforcement Learning Object-based representations [7, 4] that can exploit the\nunderlying structure of a problem have been proposed to alleviate the curse of dimensionality in\nRL. Diuk et al. [7] propose an Object-Oriented MDP, using a representation based on objects and\ntheir interactions. De\ufb01ning each state as a set of value assignments to all possible relations between\nobjects, they introduce an algorithm for solving deterministic object-oriented MDPs. Their repre-\nsentation is similar to that of Guestrin et al. [11], who describe an object-based representation in\nthe context of planning. In contrast to these approaches, our representation does not require explicit\nencoding for the relations between objects and can be used in stochastic domains.\n\nDeep Reinforcement Learning Recent advances in function approximation with deep neural net-\nworks have shown promise in handling high-dimensional sensory input. Deep Q-Networks and\nits variants have been successfully applied to various domains including Atari games [16, 15] and\nGo [23], but still perform poorly on environments with sparse, delayed reward signals.\n\nCognitive Science and Neuroscience The nature and origin of intrinsic goals in humans is a\nthorny issue but there are some notable insights from existing literature. There is converging ev-\nidence in developmental psychology that human infants, primates, children, and adults in diverse\ncultures base their core knowledge on certain cognitive systems including \u2013 entities, agents and their\nactions, numerical quantities, space, social-structures and intuitive theories [29]. During curiosity-\ndriven activities, toddlers use this knowledge to generate intrinsic goals such as building physically\nstable block structures. In order to accomplish these goals, toddlers seem to construct subgoals in\nthe space of their core knowledge. Knowledge of space can also be utilized to learn a hierarchical\ndecomposition of spatial environments. This has been explored in Neuroscience with the successor\nrepresentation, which represents value functions in terms of the expected future state occupancy.\nDecomposition of the successor representation have shown to yield reasonable subgoals for spatial\nnavigation problems [5, 30].\n3 Model\nConsider a Markov decision process (MDP) represented by states s \u2208 S, actions a \u2208 A, and\ntransition function T : (s, a) \u2192 s(cid:48). An agent operating in this framework receives a state s from\nthe external environment and can take an action a, which results in a new state s(cid:48). We de\ufb01ne the\nextrinsic reward function as F : (s) \u2192 R. The objective of the agent is to maximize this function\nover long periods of time. For example, this function can take the form of the agent\u2019s survival time\nor score in a game.\n\nAgents Effective exploration in MDPs is a signi\ufb01cant challenge in learning good control policies.\nMethods such as \u0001-greedy are useful for local exploration but fail to provide impetus for the agent\nto explore different areas of the state space. In order to tackle this, we utilize a notion of intrinsic\ngoals g \u2208 G. The agent focuses on setting and achieving sequences of goals via learning policies\n\u03c0g in order to maximize cumulative extrinsic reward. In order to learn each \u03c0g, the agent also has a\ncritic, which provides intrinsic rewards, based on whether the agent is able to achieve its goals (see\nFigure 1).\n\nTemporal Abstractions As shown in Figure 1, the agent uses a two-stage hierarchy consisting of\na controller and a meta-controller. The meta-controller receives state st and chooses a goal gt \u2208 G,\nwhere G denotes the set of all possible current goals. The controller then selects an action at using\nst and gt. The goal gt remains in place for the next few time steps either until it is achieved or a\nterminal state is reached. The internal critic is responsible for evaluating whether a goal has been\nreached and provides an appropriate reward rt(g) to the controller. In this work, we make a minimal\n\n3\n\n\fFigure 1: (Overview) The agent receives sensory observations and produces actions. Separate\nDQNs are used inside the meta-controller and controller. The meta-controller looks at the raw\nstates and produces a policy over goals by estimating the action-value function Q2(st, gt; \u03b82) (to\nmaximize expected future extrinsic reward). The controller takes in states and the current goal, and\nproduces a policy over actions by estimating the action-value function Q1(st, at; \u03b81, gt) to solve\nthe predicted goal (by maximizing expected future intrinsic reward). The internal critic provides a\npositive reward to the controller if and only if the goal is reached. The controller terminates either\nwhen the episode ends or when g is accomplished. The meta-controller then chooses a new g and\nthe process repeats.\n\nt(cid:48)=t \u03b3t\n\nfunction for the controller is to maximize cumulative intrinsic reward: Rt(g) =(cid:80)\u221e\n(cid:80)\u221e\n\nassumption of a binary internal reward i.e. 1 if the goal is reached and 0 otherwise. The objective\n(cid:48)\u2212trt(cid:48)(g).\nSimilarly, the objective of the meta-controller is to optimize the cumulative extrinsic reward Ft =\n(cid:48)\u2212tft(cid:48), where ft are reward signals received from the environment. Note that the time\nscales for Ft and Rt are different \u2013 each ft is the accumulated external reward over the time period\nbetween successive goal selections. The discounting in Ft, therefore, is over sequences of goals\nand not lower level actions. This setup is similar to optimizing over the space of optimal reward\nfunctions to maximize \ufb01tness [25]. In our case, the reward functions are dynamic and temporally\ndependent on the sequential history of goals. Figure 1 illustrates the agent\u2019s use of the hierarchy\nover subsequent time steps.\n\nt(cid:48)=t \u03b3t\n\nDeep Reinforcement Learning with Temporal Abstractions We use the Deep Q-Learning\nframework [16] to learn policies for both the controller and the meta-controller. Speci\ufb01cally, the\ncontroller estimates the following Q-value function:\n\nE(cid:2) \u221e(cid:88)\nE(cid:2)rt + \u03b3 maxat+1Q\n\nt(cid:48)=t\n\n\u03b3t\n\n\u2217\nQ\n1(s, a; g) = max\n\u03c0ag\n\n= max\n\u03c0ag\n\n(cid:48)\u2212trt(cid:48) | st = s, at = a, gt = g, \u03c0ag\n\n(cid:3)\n\nwhere g is the agent\u2019s goal in state s and \u03c0ag is the action policy. Similarly, for the meta-controller,\nwe have:\n\n1(st+1, at+1; g) | st = s, at = a, gt = g, \u03c0ag\n\u2217\n(cid:3)\n\n) | st = s, gt = g, \u03c0g\n\n(cid:48)\n\n\u2217\nft(cid:48) + \u03b3 maxg(cid:48)Q\n2(st+N , g\n\n(cid:3)\n\n(1)\n\n(2)\n\n2(s, g) = max\u03c0g E(cid:2) t+N(cid:88)\n\n\u2217\nQ\n\nt(cid:48)=t\n\nwhere N denotes the number of time steps until the controller halts given the current goal, g(cid:48) is the\nagent\u2019s goal in state st+N , and \u03c0g is the policy over goals. It is important to note that the transitions\n(st, gt, ft, st+N ) generated by Q2 run at a slower time-scale than the transitions (st, at, gt, rt, st+1)\ngenerated by Q1.\nWe can represent Q\u2217(s, g) \u2248 Q(s, g; \u03b8) using a non-linear function approximator with parameters\n\u03b8. Each Q \u2208 {Q1, Q2} can be trained by minimizing corresponding loss functions \u2013 L1(\u03b81) and\nL2(\u03b82). We store experiences (st, gt, ft, st+N ) for Q2 and (st, at, gt, rt, st+1) for Q1 in disjoint\n\n4\n\nExternal Environmentagentextrinsic reward Meta ControllerControllerCriticaction action intrinsic reward observationsgoal . . . . . . . . . . Meta ControllerstgtgtControllerstst+1. . . . . . st+Nst+Ngt+NQ2(st,g;\u27132)Q2(st+N,gt+N;\u27132)Meta ControllerControllerControllerQ1(st,a;\u27131,gt)Q1(st+1,a;\u27131,gt)Q1(st+N,a;\u27131,gt)atat+1at+N\fmemory spaces D1 and D2 respectively. The loss function for Q1 can then be stated as:\n\n(cid:2)(y1,i \u2212 Q1(s, a; \u03b81,i, g))2(cid:3),\n\nL1(\u03b81,i) = E(s,a,g,r,s(cid:48))\u223cD1\n\nwhere i denotes the training iteration number and y1,i = r + \u03b3 maxa(cid:48)Q1(s(cid:48), a(cid:48); \u03b81,i\u22121, g).\nFollowing [16], the parameters \u03b81,i\u22121 from the previous iteration are held \ufb01xed when optimizing the\nloss function. The parameters \u03b81 can be optimized using the gradient:\n\u2207\u03b81,iL1(\u03b81,i) = E(s,a,r,s(cid:48)\u223cD1)\n\n(cid:35)\n(cid:17)\u2207\u03b81,iQ1(s, a; \u03b81,i, g)\n\nr + \u03b3 maxa(cid:48)Q1(s(cid:48), a(cid:48); \u03b81,i\u22121, g) \u2212 Q1(s, a; \u03b81,i, g)\n\n(cid:34)(cid:16)\n\n(3)\n\nThe loss function L2 and its gradients can be derived using a similar procedure.\n\nAlgorithm 1 Learning algorithm for h-DQN\n1: Initialize experience replay memories {D1,D2} and parameters {\u03b81, \u03b82} for the controller and\n\n2: Initialize exploration probability \u00011,g = 1 for the controller for all goals g and \u00012 = 1 for the\n\nInitialize game and get start state description s\ng \u2190 EPSGREEDY(s,G, \u00012, Q2)\nwhile s is not terminal do\n\nF \u2190 0\ns0 \u2190 s\nwhile not (s is terminal or goal g reached) do\n\na \u2190 EPSGREEDY({s, g},A, \u00011,g, Q1)\nExecute a and obtain next state s(cid:48) and extrinsic reward f from environment\nObtain intrinsic reward r(s, a, s(cid:48)) from internal critic\nStore transition ({s, g}, a, r,{s(cid:48), g}) in D1\nUPDATEPARAMS(L1(\u03b81,i),D1)\nUPDATEPARAMS(L2(\u03b82,i),D2)\nF \u2190 F + f\ns \u2190 s(cid:48)\nend while\nStore transition (s0, g, F, s(cid:48)) in D2\nif s is not terminal then\ng \u2190 EPSGREEDY(s,G, \u00012, Q2)\n\nmeta-controller respectively.\n\nmeta-controller.\n\n3: for i = 1, num episodes do\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13:\n14:\n15:\n16:\n17:\n18:\n19:\n20:\n21:\n22:\n23:\n24:\n25: end for\nAlgorithm 2 : EPSGREEDY(x,B, \u0001, Q)\n1: if random() < \u0001 then\n2:\n3: else\n4:\n5: end if\n\nend if\nend while\nAnneal \u00012 and \u00011.\n\nreturn random element from set B\nreturn argmaxm\u2208BQ(x, m)\n\nAlgorithm 3 : UPDATEPARAMS(L,D)\n1: Randomly sample mini-batches from D\n2: Perform gradient descent on loss L(\u03b8)\n\n(cf. (3))\n\nLearning Algorithm We learn the parameters of h-DQN using stochastic gradient descent at dif-\nferent time scales \u2013 transitions from the controller are collected at every time step but a transition\nfrom the meta-controller is only collected when the controller terminates (i.e. when a goal is re-\npicked or the episode ends). Each new goal g is drawn in an \u0001-greedy fashion (Algorithms 1 & 2)\nwith the exploration probability \u00012 annealed as learning proceeds (from a starting value of 1).\nIn the controller, at every time step, an action is drawn with a goal using the exploration probability\n\u00011,g, which depends on the current empirical success rate of reaching g. Speci\ufb01cally, if the success\nrate for goal g is > 90%, we set \u00011,g = 0.1, else 1. All \u00011,g values are annealed to 0.1. The model\nparameters (\u03b81, \u03b82) are periodically updated by drawing experiences from replay memories D1 and\nD2, respectively (see Algorithm 3).\n\n5\n\n\f4 Experiments\n\n(1) Discrete stochastic decision process with delayed rewards For our \ufb01rst experiment, we con-\nsider a stochastic decision process where the extrinsic reward depends on the history of visited states\nin addition to the current state. This task demonstrates the importance of goal-driven exploration in\nsuch environments. There are 6 possible states and the agent always starts at s2. The agent moves\nleft deterministically when it chooses left action; but the action right only succeeds 50% of the time,\nresulting in a left move otherwise. The terminal state is s1 and the agent receives the reward of 1\nwhen it \ufb01rst visits s6 and then s1. The reward for going to s1 without visiting s6 is 0.01. This is a\nmodi\ufb01ed version of the MDP in [19], with the reward structure adding complexity to the task (see\nFigure 2).\nWe consider each state as a candidate goal for\nexploration. This enables and encourages the\nagent to visit state s6 (if chosen as a goal) and\nhence, learn the optimal policy. For each goal,\nthe agent receives a positive intrinsic reward if\nand only if it reaches the corresponding state.\n\nFigure 2: A stochastic decision process where the\nreward at the terminal state s1 depends on whether\ns6 is visited (r = 1) or not (r = 1/100). Edges\nare annotated with transition probabilities (Red ar-\nrow: move right, Black arrow: move left).\n\nResults We compare the performance of\nour approach (without deep neural networks)\nagainst Q-Learning as a baseline (without in-\ntrinsic rewards) in terms of the average extrin-\nsic reward gained in an episode. In our experiments, all \u0001 parameters are annealed from 1 to 0.1\nover 50k steps. The learning rate is set to 2.5 \u00b7 10\u22124. Figure 3 plots the evolution of reward for\nboth methods averaged over 10 different runs. As expected, we see that Q-Learning is unable to \ufb01nd\nthe optimal policy even after 200 epochs, converging to a sub-optimal policy of reaching state s1\ndirectly to obtain a reward of 0.01. In contrast, our approach with hierarchical Q-estimators learns\nto choose goals s4, s5 or s6, which statistically lead the agent to visit s6 before going back to s1.\nOur agent obtains a signi\ufb01cantly higher average reward of 0.13.\n\nReward\n\n# of visits per episode\n\nFigure 3: (left) Average reward (over 10 runs) of our approach compared to Q-learning. (right)\n#visits of our approach to states s3-s6 (over 1000 episodes). Initial state: s2, Terminal state: s1.\n\nFigure 3 illustrates that the number of visits to states s3, s4, s5, s6 increases with episodes of training.\nEach data point shows the average number of visits for each state over the last 1000 episodes. This\nindicates that our model is choosing goals in a way so that it reaches the critical state s6 more often.\n\n(2) ATARI game with delayed rewards We now consider \u2018Montezuma\u2019s Revenge\u2019, an ATARI\ngame with sparse, delayed rewards. The game requires the player to navigate the explorer (in red)\nthrough several rooms while collecting treasures. In order to pass through doors (in the top right and\ntop left corners of the \ufb01gure), the player has to \ufb01rst pick up the key. The player has to then climb\ndown the ladders on the right and move left towards the key, resulting in a long sequence of actions\nbefore receiving a reward (+100) for collecting the key. After this, navigating towards the door and\nopening it results in another reward (+300).\nBasic DQN [16] achieves a score of 0 while even the best performing system, Gorila DQN [18],\nmanages only 4.16 on average. Asynchronous actor critic methods achieve a non-zero score but\nrequire 100s of millions of training frames [15].\n\n6\n\ns1s2s3s4s5s60.50.50.50.50.50.50.50.50.50.51.01.01.01.01.0orr=1r=1/1005/18/2016Reward.html\ufb01le:///Users/tejas/Documents/deepRelationalRL/dqn/Reward.html1/105010015020000.020.040.060.080.10.120.140.160.18Export\u00a0to\u00a0plot.ly\u00a0\u00bbStepsBaselineOur Approach5/18/2016State 3, State 4, State 5, State 6 | \ufb01lled scatter chart made by Ardavans | plotlyhttps://plot.ly/~ardavans/4.embed1/12468101200.20.40.60.811.2Edit chart \u00bbEpisodes (*1000)State 3State 4State 5State 6\fArchitecture\n\nTotal extrinsic reward\n\nSuccess ratio for reaching the goal \u2019key\u2019\n\nSuccess % of different goals over time\n\nFigure 4: (top-left) Architecture: DQN architecture for the controller (Q1). A similar architecture\nproduces Q2 for the meta-controller (without goal as input). (top-right) The joint training learns to\nconsistently get high rewards. (bottom-left) Goal success ratio: The agent learns to choose the key\nmore often as training proceeds and is successful at achieving it. (bottom-right) Goal statistics:\nDuring early phases of joint training, all goals are equally preferred due to high exploration but as\ntraining proceeds, the agent learns to select appropriate goals such as the key and bottom-left door.\n\nSetup The agent needs intrinsic motivation to explore meaningful parts of the scene before learn-\ning about the advantage of obtaining the key. Inspired by developmental psychology literature [29]\nand object-oriented MDPs [7], we use entities or objects in the scene to parameterize goals in this\nenvironment. Unsupervised detection of objects in visual scenes is an open problem in computer\nvision, although there has been recent progress in obtaining objects directly from image or motion\ndata [8]. In this work, we built a custom pipeline to provide plausible object candidates. Note that\nthe agent is still required to learn which of these candidates are worth pursuing as goals. The con-\ntroller and meta-controller are convolutional neural networks (Figure 4) that learn representations\nfrom raw pixel data. We use the Arcade Learning Environment [3] to perform experiments.\nThe internal critic is de\ufb01ned in the space of (cid:104)entity1, relation, entity2(cid:105), where relation is a func-\ntion over con\ufb01gurations of the entities. In our experiments, the agent learns to choose entity2. For\ninstance, the agent is deemed to have completed a goal (and only then receives a reward) if the agent\nentity reaches another entity such as the door. The critic computes binary rewards using the relative\npositions of the agent and the goal (1 if the goal was reached). Note that this notion of relational\nintrinsic rewards can be generalized to other settings. For instance, in the ATARI game \u2018Asteroids\u2019,\nthe agent could be rewarded when the bullet reaches the asteroid or if simply the ship never reaches\nan asteroid. In \u2018Pacman\u2019, the agent could be rewarded if the pellets on the screen are reached. In the\nmost general case, we can potentially let the model evolve a parameterized intrinsic reward function\ngiven entities. We leave this for future work.\nModel Architecture and Training As shown in Figure 4, the model consists of stacked convo-\nlutional layers with recti\ufb01ed linear units (ReLU). The input to the meta-controller is a set of four\nconsecutive images of size 84 \u00d7 84. To encode the goal output from the meta-controller, we append\na binary mask of the goal location in image space along with the original 4 consecutive frames. This\naugmented input is passed to the controller. The experience replay memories D1 and D2 were set to\nbe equal to 106 and 5 \u00b7 104 respectively. We set the learning rate to be 2.5 \u00b7 10\u22124, with a discount\nrate of 0.99. We follow a two phase training procedure \u2013 (1) In the \ufb01rst phase, we set the exploration\nparameter \u00012 of the meta-controller to 1 and train the controller on actions. This effectively leads to\npre-training the controller so that it can learn to solve a subset of the goals. (2) In the second phase,\nwe jointly train the controller and meta-controller.\n\n7\n\nimage (s) + goal (g)Q1(s, a; g)LinearReLU:Conv (\ufb01lter:8, ftr-maps:32, strides:4)ReLU:Linear (h=512)ReLU:Conv (\ufb01lter:4, ftr-maps:64, strides:2)ReLU:Conv (\ufb01lter:3, ftr-maps:64, strides:1)5/18/2016Reward.html\ufb01le:///Users/tejas/Documents/deepRelationalRL/dqn/Reward.html1/100.5M1M1.5M2M050100150200250300350400Export\u00a0to\u00a0plot.ly\u00a0\u00bbStepsOur ApproachDQN5/18/2016subgoal_6.html\ufb01le:///Users/tejas/Documents/deepRelationalRL/dqn/subgoal_6.html1/100.5M1M1.5M2M00.20.40.60.81Export\u00a0to\u00a0plot.ly\u00a0\u00bbSteps5/18/2016Bar graph.html\ufb01le:///Users/tejas/Documents/deepRelationalRL/dqn/Bar%20graph.html1/10.5M1M1.5M2M00.050.10.150.20.25Export\u00a0to\u00a0plot.ly\u00a0\u00bbStepstop-left doortop-right doormiddle-ladderbottom-left-ladderbottom-right-ladderkey \fFigure 5: Sample game play on Montezuma\u2019s Revenge: The four quadrants are arranged in a\ntemporal order (top-left, top-right, bottom-left and bottom-right). First, the meta-controller chooses\nkey as the goal (illustrated in red). The controller then tries to satisfy this goal by taking a series\nof low level actions (only a subset shown) but fails due to colliding with the skull (the episode\nterminates here). The meta-controller then chooses the bottom-right ladder as the next goal and the\ncontroller terminates after reaching it. Subsequently, the meta-controller chooses the key and the\ntop-right door and the controller is able to successfully achieve both these goals.\n\nResults Figure 4 shows reward progress from the joint training phase \u2013 it is evident that the model\nstarts gradually learning to both reach the key and open the door to get a reward of around +400 per\nepisode. The agent learns to choose the key more often as training proceeds and is also successful\nat reaching it. We observe that the agent \ufb01rst learns to perform the simpler goals (such as reaching\nthe right door or the middle ladder) and then slowly starts learning the \u2018harder\u2019 goals such as the key\nand the bottom ladders, which provide a path to higher rewards. Figure 4 also shows the evolution of\nthe success rate of goals that are picked. At the end of training, we can see that the \u2019key\u2019, \u2019bottom-\nleft-ladder\u2019 and \u2019bottom-right-ladders\u2019 are chosen increasingly more often.\nIn order to scale-up to solve the entire game, several key ingredients are missing, such as \u2013 automatic\ndiscovery of objects from videos to aid the goal parameterization we considered, a \ufb02exible short-\nterm memory, or the ability to intermittently terminate ongoing options. We also show some screen-\nshots from a test run with our agent (with epsilon set to 0.1) in Figure 5, as well as a sample animation\nof the run.2\n\nReferences\n[1] A. Baranes and P.-Y. Oudeyer. Active learning of inverse models with intrinsically motivated goal explo-\n\nration in robots. Robotics and Autonomous Systems, 61(1):49\u201373, 2013.\n\n[2] A. G. Barto and S. Mahadevan. Recent advances in hierarchical reinforcement learning. Discrete Event\n\nDynamic Systems, 13(4):341\u2013379, 2003.\n\n[3] M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling. The arcade learning environment: An evaluation\n\nplatform for general agents. Journal of Arti\ufb01cial Intelligence Research, 2012.\n\n2Sample trajectory of a run on \u2019Montezuma\u2019s Revenge\u2019 \u2013 https://goo.gl/3Z64Ji\n\n8\n\n123456789101112termination (death)goal reachedgoal reachedMeta Controller ControllerMeta Controller Controller\f[4] L. C. Cobo, C. L. Isbell, and A. L. Thomaz. Object focused q-learning for autonomous agents.\n\nIn\n\nProceedings of AAMAS, pages 1061\u20131068, 2013.\n\n[5] P. Dayan. Improving generalization for temporal difference learning: The successor representation. Neu-\n\n[6] T. G. Dietterich. Hierarchical reinforcement learning with the maxq value function decomposition. J.\n\nral Computation, 5(4):613\u2013624, 1993.\n\nArtif. Intell. Res.(JAIR), 13:227\u2013303, 2000.\n\n[7] C. Diuk, A. Cohen, and M. L. Littman. An object-oriented representation for ef\ufb01cient reinforcement\n\nlearning. In Proceedings of the International Conference on Machine learning, pages 240\u2013247, 2008.\n\n[8] S. Eslami, N. Heess, T. Weber, Y. Tassa, K. Kavukcuoglu, and G. E. Hinton. Attend, infer, repeat: Fast\n\nscene understanding with generative models. arXiv preprint arXiv:1603.08575, 2016.\n\n[9] M. Frank, J. Leitner, M. Stollenga, A. F\u00a8orster, and J. Schmidhuber. Curiosity driven reinforcement learn-\ning for motion planning on humanoids. Intrinsic motivations and open-ended development in animals,\nhumans, and robots, page 245, 2015.\n\n[10] S. Goel and M. Huber. Subgoal discovery for hierarchical reinforcement learning using learned policies.\n\nIn FLAIRS conference, pages 346\u2013350, 2003.\n\n[11] C. Guestrin, D. Koller, C. Gearhart, and N. Kanodia. Generalizing plans to new environments in relational\nmdps. In Proceedings of International Joint conference on Arti\ufb01cial Intelligence, pages 1003\u20131010, 2003.\n[12] C. Guestrin, D. Koller, R. Parr, and S. Venkataraman. Ef\ufb01cient solution algorithms for factored mdps.\n\nJournal of Arti\ufb01cial Intelligence Research, pages 399\u2013468, 2003.\n\n[13] N. Hernandez-Gardiol and S. Mahadevan. Hierarchical memory-based reinforcement learning. In Ad-\n\nvances in Neural Information Processing Systems, pages 1047\u20131053, 2001.\n\n[14] J. Koutn\u00b4\u0131k, J. Schmidhuber, and F. Gomez. Evolving deep unsupervised convolutional networks for\nvision-based reinforcement learning. In Proceedings of the 2014 conference on Genetic and evolutionary\ncomputation, pages 541\u2013548. ACM, 2014.\n\n[15] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu.\n\nAsynchronous methods for deep reinforcement learning. arXiv preprint arXiv:1602.01783, 2016.\n\n[16] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller,\n\net al. Human-level control through deep reinforcement learning. Nature, 518(7540):529\u2013533, 2015.\n\n[17] S. Mohamed and D. J. Rezende. Variational information maximisation for intrinsically motivated rein-\n\nforcement learning. In Advances in Neural Information Processing Systems, pages 2116\u20132124, 2015.\n\n[18] A. Nair, P. Srinivasan, S. Blackwell, C. Alcicek, R. Fearon, A. De Maria, V. Panneershelvam, et al.\n\nMassively parallel methods for deep reinforcement learning. arXiv preprint arXiv:1507.04296, 2015.\n\n[19] I. Osband, C. Blundell, A. Pritzel, and B. Van Roy. Deep exploration via bootstrapped dqn. arXiv preprint\n\narXiv:1602.04621, 2016.\n\nFrontiers in neurorobotics, 1:6, 2009.\n\n[20] P.-Y. Oudeyer and F. Kaplan. What is intrinsic motivation? a typology of computational approaches.\n\n[21] T. Schaul, D. Horgan, K. Gregor, and D. Silver. Universal value function approximators. In Proceedings\n\nof the 32nd International Conference on Machine Learning (ICML-15), pages 1312\u20131320, 2015.\n\n[22] J. Schmidhuber. Formal theory of creativity, fun, and intrinsic motivation (1990\u20132010). Autonomous\n\nMental Development, IEEE Transactions on, 2(3):230\u2013247, 2010.\n\n[23] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, et al.\nMastering the game of go with deep neural networks and tree search. Nature, 529(7587):484\u2013489, 2016.\n\u00a8O. S\u00b8ims\u00b8ek, A. Wolfe, and A. Barto. Identifying useful subgoals in reinforcement learning by local graph\npartitioning. In Proceedings of the International conference on Machine learning, pages 816\u2013823, 2005.\n[25] S. Singh, R. L. Lewis, and A. G. Barto. Where do rewards come from. In Proceedings of the annual\n\n[24]\n\nconference of the cognitive science society, pages 2601\u20132606, 2009.\n\n[26] S. Singh, R. L. Lewis, A. G. Barto, and J. Sorg.\n\nIntrinsically motivated reinforcement learning: An\n\nevolutionary perspective. Autonomous Mental Development, IEEE Transactions on, 2(2):70\u201382, 2010.\n\n[27] S. P. Singh, A. G. Barto, and N. Chentanez. Intrinsically motivated reinforcement learning. In Advances\n\nin neural information processing systems, pages 1281\u20131288, 2004.\n\n[28] J. Sorg and S. Singh. Linear options. In Proceedings of the 9th International Conference on Autonomous\n\nAgents and Multiagent Systems, pages 31\u201338, Richland, SC, 2010.\n\n[29] E. S. Spelke and K. D. Kinzler. Core knowledge. Developmental science, 10(1):89\u201396, 2007.\n[30] K. L. Stachenfeld, M. Botvinick, and S. J. Gershman. Design principles of the hippocampal cognitive\n\nmap. In Advances in neural information processing systems, pages 2528\u20132536, 2014.\n\n[31] B. C. Stadie, S. Levine, and P. Abbeel. Incentivizing exploration in reinforcement learning with deep\n\npredictive models. arXiv preprint arXiv:1507.00814, 2015.\n\n[32] R. S. Sutton and A. G. Barto. Introduction to reinforcement learning. MIT Press Cambridge, 1998.\n[33] R. S. Sutton, J. Modayil, M. Delp, T. Degris, P. M. Pilarski, A. White, and D. Precup. Horde: A scalable\nreal-time architecture for learning knowledge from unsupervised sensorimotor interaction. In The 10th\nInternational Conference on Autonomous Agents and Multiagent Systems, pages 761\u2013768, 2011.\n\n[34] R. S. Sutton, D. Precup, and S. Singh. Between mdps and semi-mdps: A framework for temporal abstrac-\n\ntion in reinforcement learning. Arti\ufb01cial intelligence, 112(1):181\u2013211, 1999.\n\n[35] C. Szepesvari, R. S. Sutton, J. Modayil, S. Bhatnagar, et al. Universal option models. In Advances in\n\nNeural Information Processing Systems, pages 990\u2013998, 2014.\n\n9\n\n\f", "award": [], "sourceid": 1826, "authors": [{"given_name": "Tejas", "family_name": "Kulkarni", "institution": "MIT"}, {"given_name": "Karthik", "family_name": "Narasimhan", "institution": "MIT"}, {"given_name": "Ardavan", "family_name": "Saeedi", "institution": "MIT"}, {"given_name": "Josh", "family_name": "Tenenbaum", "institution": "MIT"}]}