{"title": "Generalization of Reinforcement Learners with Working and Episodic Memory", "book": "Advances in Neural Information Processing Systems", "page_first": 12469, "page_last": 12478, "abstract": "Memory is an important aspect of intelligence and plays a role in many deep reinforcement learning models. However, little progress has been made in understanding when specific memory systems help more than others and how well they generalize. The field also has yet to see a prevalent consistent and rigorous approach for evaluating agent performance on holdout data.\nIn this paper, we aim to develop a comprehensive methodology to test different kinds of memory in an agent and assess how well the agent can apply what it learns in training to a holdout set that differs from the training set along dimensions that we suggest are relevant for evaluating memory-specific generalization. To that end, we first construct a diverse set of memory tasks that allow us to evaluate test-time generalization across multiple dimensions. Second, we develop and perform multiple ablations on an agent architecture that combines multiple memory systems, observe its baseline models, and investigate its performance against the task suite.", "full_text": "Generalization of Reinforcement Learners with\n\nWorking and Episodic Memory\n\nRyan Faulkner?\n\nGavin Buttimore\n\nMeire Fortunato?\n\nSteven Hansen?\n\nCharlie Deck\n\nMelissa Tan?\n\nAdri\u00e0 Puigdom\u00e8nech Badia\n\nJoel Z Leibo\n\nCharles Blundell\n\n{meirefortunato, melissatan, rfaulk, stevenhansen,\n\nadriap, buttimore, cdeck, jzl, cblundell}@google.com\n\nDeepMind\n\n(? Equal Contribution)\n\nAbstract\n\nMemory is an important aspect of intelligence and plays a role in many deep\nreinforcement learning models. However, little progress has been made in un-\nderstanding when speci\ufb01c memory systems help more than others and how well\nthey generalize. The \ufb01eld also has yet to see a prevalent consistent and rigorous\napproach for evaluating agent performance on holdout data. In this paper, we aim\nto develop a comprehensive methodology to test different kinds of memory in an\nagent and assess how well the agent can apply what it learns in training to a holdout\nset that differs from the training set along dimensions that we suggest are relevant\nfor evaluating memory-speci\ufb01c generalization. To that end, we \ufb01rst construct a\ndiverse set of memory tasks1 that allow us to evaluate test-time generalization\nacross multiple dimensions. Second, we develop and perform multiple ablations on\nan agent architecture that combines multiple memory systems, observe its baseline\nmodels, and investigate its performance against the task suite.\n\n1\n\nIntroduction\n\nHumans use memory to reason, imagine, plan, and learn. Memory is a foundational component of\nintelligence, and enables information from past events and contexts to inform decision-making in the\npresent and future. Recently, agents that utilize memory systems have advanced the state of the art\nin various research areas including reasoning, planning, program execution and navigation, among\nothers (Graves et al., 2016; Zambaldi et al., 2018; Santoro et al., 2018; Banino et al., 2018; Vaswani\net al., 2017; Sukhbaatar et al., 2015).\nMemory has many aspects, and having access to different kinds allows intelligent organisms to\nbring the most relevant past information to bear on different sets of circumstances. In cognitive\npsychology and neuroscience, two commonly studied types of memory are working and episodic\nmemory. Working memory (Miyake and Shah, 1999) is a short-term temporary store with limited\ncapacity.\nIn contrast, episodic memory (Tulving and Murray, 1985) is typically a larger autobiographical\ndatabase of experience (e.g. recalling a meal eaten last month) that lets one store information over a\nlonger time scale and compile sequences of events into episodes (Tulving, 2002). Episodic memory\nhas been shown to help reinforcement learning agents adapt more quickly and thereby boost data\n1https://github.com/deepmind/dm_memorytasks. Videos available at https://sites.google.com/view/memory-\n\ntasks-suite\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fef\ufb01ciency (Blundell et al., 2016; Pritzel et al., 2017; Hansen et al., 2018). More recently, Ritter et al.\n(2018) shows how episodic memory can be used to provide agents with context-switching abilities\nin contextual bandit problems. The transformer (Vaswani et al., 2017) can be viewed as a hybrid\nof working memory and episodic memory that has been successfully applied to many supervised\nlearning problems.\nIn this work, we explore adding such memory systems to agents and propose a consistent and rigorous\napproach for evaluating whether an agent demonstrates generalization-enabling memory capabilities\nsimilar to those seen in animals and humans.\nOne fundamental principle in machine learning is to train on one set of data and test on an unseen\nholdout set, but it has to date been common in reinforcement learning to evaluate agent performance\nsolely on the training set which is suboptimal for testing generalization (Pineau, 2018). Also, though\nadvances have recently been made on evaluating generalization in reinforcement learning (Cobbe\net al., 2018) these have not been speci\ufb01c to memory.\nOur approach is to construct a train-holdout split where the holdout set differs from the training set\nalong axes that we propose are relevant speci\ufb01cally to memory, i.e. the scale of the task and precise\nobjects used in the task environments. For instance, if an agent learns in training to travel to an apple\nplaced in a room, altering the room size or apple color as part of a generalization test should ideally\nnot throw it off.\nWe propose a set of environments that possess such a split and test different aspects of working and\nepisodic memory, to help us better understand when different kinds of memory systems are most\nhelpful and identify memory architectures in agents with memory abilities that cognitive scientists\nand psychologists have observed in humans.\nAlongside these tasks, we develop a benchmark memory-based agent, the Memory Recall Agent\n(MRA), that brings together previously developed systems thought to mimic working memory and\nepisodic memory. This combination of a controller that models working memory, an external episodic\nmemory, and an architecture that encourages long-term representational credit assignment via an\nauxiliary unsupervised loss and backpropagation through time that can \u2018jump\u2019 over several time-steps\nobtains better performance than baselines across the suite. In particular, episodic memory and\nlearning good representations both prove crucial and in some cases stack synergistically.\nTo summarize, our contribution is to:\n\nof memory in order to solve in a way that generalizes to holdout data.\n\ncomponents that functionally mimic humans\u2019 episodic and working memory.\n\n\u2022 Introduce a suite of tasks that require an agent to utilize fundamental functional properties\n\u2022 Develop an agent architecture that explicitly models the operation of memory by integrating\n\u2022 Show that different components of our agent\u2019s memory have different effectiveness in\n\u2022 Show that none of the models fully generalize outside of the train set on the more challenging\n\ntraining and in generalizing to holdout sets.\n\ntasks, and that the extrapolation incurs a greater level of degradation.\n\n2 Task suite overview\n\nWe de\ufb01ne a suite of 13 tasks designed to test different aspects of memory, with train-test splits that\ntest for generalization across multiple dimensions (https://github.com/deepmind/dm _memorytasks).\nThese include cognitive psychology tasks adapted from PsychLab (Leibo et al., 2018) and DMLab\n(Beattie et al., 2016), and new tasks built with the Unity 3D game engine (uni) that require the agent\nto 1) spot the difference between two scenes; 2) remember the location of a goal and navigate to\nit; or 3) infer an indirect transitive relation between objects. Videos with task descriptions are at\nhttps://sites.google.com/view/memory-tasks-suite.\n\n2.1 PsychLab\n\nFour tasks in the Memory Tasks Suite use the PsychLab environment (Leibo et al., 2018), which\nsimulates a psychology laboratory in \ufb01rst-person. The agent is presented with a set of one or multiple\nconsecutive images, where each set is called a \u2018trial\u2019. Each episode has multiple trials.\n\n2\n\n\fIn Arbitrary Visuomotor Mapping (AVM) a series of objects is presented, each with an associated\nlook-direction (e.g. up,left). The agent is rewarded if it looks in the associated direction the next time\nit sees a given object in the episode (Fig 8(a) in App. B). Continuous Recognition presents a series\nof images with rewards given for correctly indicating whether an image has been previously shown\nin the episode (Fig 8(b) in App. B). In Change Detection the agent sees two consecutive images,\nseparated by a variable-length delay, and has to correctly indicate if the two images differ (Fig 8(c) in\nApp. B). In What Then Where the agent is shown a single \u2018challenge\u2019 MNIST digit, then an image\nof that digit with three other digits, each placed along an edge of the rectangular screen. It next has to\ncorrectly indicate the location of the \u2018challenge\u2019 digit (Fig 8(d) in App. B).\n\n2.2\n\n3D tasks\n\n(a) Spot the Difference basic\n\n(b) Navigate to Goal\n\n(c) Transitive Inference\n\nFigure 1: Task layouts for Spot the Difference, Goal Navigation, and Transitive Inference. In (a), the\nagent has to identify the difference between the two rooms. In (b), the agent has to go to the goal.\nwhich is represented by an oval symbol here and may be visible or not to the agent. In (c), the agent\nhas to go to the higher-valued object in each pair. The value order is given by the transitive chain\noutside the room. It is shown here solely for illustration; the agent cannot see it.\n\nSpot the Difference: This tests whether the agent can correctly identify the difference between two\nnearly identical scenes (Figure 1(a)). The agent has to move from the \ufb01rst to the second room, with a\n\u2018delay\u2019 corridor in between. See Fig. 2 for the four different variants.\n\n(a) Spot\nBasic\n\nthe Difference\n\n(b) Spot the Difference\nPassive\n\n(c) Spot\nMulti-Object\n\nthe Difference\n\n(d) Spot the Difference\nMotion\n\nFigure 2: Spot the Difference tasks. (a) All the tasks in this family are variants of this basic setup,\nwhere each room contains two blocks. (b) By placing Room 1\u2019s blocks right next to the corridor\nentrance, we guarantee that the agent will always see them. (c) The number of objects varies. (d)\nInstead of differing in color between rooms, the altered block follows a different motion pattern.\n\nGoal Navigation: This task family was inspired by the Morris Watermaze (Miyake and Shah, 1999)\nsetup used with rodents in behavioral neuroscience. The agent is rewarded every time it successfully\nreaches the goal; once it gets there it is respawned randomly in the arena and has to \ufb01nd its way back\nto the goal. The goal location is re-randomized at the start of episode (Fig. 1(b), Fig. 3).\nTransitive Inference:\nThis task tests if an agent can learn an overall transitive ordering over a chain of objects, through\nbeing presented with ordered pairs of adjacent objects (See Fig. 1(c) and App. B).\n\n2.3 Scale and Stimulus Split\n\nTo test how well the agent can generalize to holdout data after training, we create per-task holdout\nlevels that differ from the training level along a scale and a stimulus dimension. The scale dimension\nis intended to capture something about the memory demand of the task: e.g., a task with a longer\ntime delay between events that must be related should be harder than one with a short delay. The\nstimulus dimension is to guard against trivial over\ufb01tting to the particular visual input presented to the\ninput: the memory representation should be more abstract than the particular colour of an object.\n\n3\n\n\f(a)\nInvisible\nEmpty Arena\n\nGoal\n\n(b) Invisible Goal, With\nBuildings\n\n(c) Visible Goal With\nBuildings\n\n(d) Visible Goal Proce-\ndural Maze\n\nFigure 3: Goal Navigation tasks. (a) The arena has no buildings, agent must navigate by skybox. (b)\nThere are rectangular buildings at \ufb01xed, non-randomized locations in the arena. (c) As in (b), but the\ngoal appears as an oval. (d) A visible goal in a procedurally generated maze.\n\nThe training level comprises a \u2018small\u2019 and \u2018large\u2019 scale version of the task. When training the\nagent we uniformly sample between these two scales. As for the holdout levels, one of them \u2013\n\u2018holdout-interpolate\u2019 \u2013 corresponds to an interpolation between those two scales (call it \u2018medium\u2019)\nand the other, \u2018holdout-extrapolate\u2019, corresponds to an extrapolation beyond the \u2018large\u2019 scale (call it\n\u2018extra-large\u2019). Alterations made for each task split and their settings are in Table 2 in App. A.\n\n3 The Memory Recall Agent\n\nOur agent, the Memory Recall Agent (MRA), incorporates \ufb01ve components: 1) a pixel-input convo-\nlutional, residual network, 2) a working memory, 3) a slot-based episodic memory, 4) an auxiliary\ncontrastive loss for representation learning (van den Oord et al., 2018), 5) a jumpy backpropagation-\nthrough-time training regime. Our agent architecture is shown in Figure 4(a). The overall agent is\nbuilt on top of the IMPALA model (Espeholt et al., 2018) and is trained in the same way with the\nexceptions described below. Component descriptions are below.\n\nPixel Input Pixel input is fed to a convolutional neural network, as is common in recent agents,\nfollowed by a residual block (He et al., 2015). The precise hyper-parameters are given in C.2: we\nuse three convolutional layers followed by two residual layers. The output of this process is xt in\nFigure 4(a) and serves as input to three other parts of the network: 1) part of the input to the working\nmemory module, 2) in the formation of keys and queries for the episodic memory, 3) as part of the\ntarget for the contrastive predictive coding.\n\nWorking Memory Working memory is often realized through latent recurrent neural networks\n(RNNs) with some form of gating, such as LSTMs and Relational Memory architectures (Hochreiter\nand Schmidhuber, 1997; Santoro et al., 2018). These working memory models calculate the next set\nof hidden units using the current input and the previous hidden units. Although models which rely on\nworking memory can perform well on a variety of problems, their ability to tackle dependencies and\nrepresent variables over long time periods is limited. The short-term nature of working memory is\npragmatically, and perhaps unintentionally, re\ufb02ected in the use of truncated backprop through time\nand the tendency for gradients through these RNNs to explode or vanish. Our agent uses an LSTM as\na model of working memory. As we shall see in experiments, this module is able to perform working\nmemory\u2013like operations on tasks: i.e., learn calculations involving short-term memory. As depicted\nin Figure 4(a), the LSTM takes as input xt from the pixel input network and mt from the episodic\nmemory module. As in Espeholt et al. (2018), the LSTM has two heads as output, producing the\npolicy \u21e1 and the baseline value function V . In our architecture these are derived from the output from\nthe LSTM, ht. ht is also used to form episodic memories, as described below.\n\nEpisodic Memory (MEM)\nIf our agent only consisted of the working memory and pixel input\ndescribed above, it would be almost identical to the model in IMPALA (Espeholt et al., 2018), an\nalready powerful RL agent. But MRA also includes a slot-based episodic memory module as that\ncan store values more reliably and longer-term than an LSTM, is less susceptible to the intricacies of\ngradient propagation, and its fundamental operations afford the agent different abilities (as observed in\nour experiments). The MEM in MRA has a key-value structure which the agent reads from and writes\nto at every time-step (see Fig. 4(a)). MRA implements a mechanism to learn how to store summaries\n\n4\n\n\fofpastexperiencesandretrieverelevantinformationwhenitencounterssimilarcontexts.Thereadsfrommemoryareusedasadditionalinputstotheneuralnetwork(controller),whichproducesthemodelpredictions.Thiseffectivelyaugmentsthecontroller\u2019sworkingmemorycapabilitieswithexperiencesfromdifferenttimescalesretrievedfromtheMEM,whichfacilitatelearninglong-termdependencies,adif\ufb01culttaskwhenrelyingentirelyonbackpropagationinrecurrentarchitectures(HochreiterandSchmidhuber,1997;Gravesetal.,2016;Vaswanietal.,2017).Episodic Memory8NQJHMF.DLNQXht-1htqt=Wq[xt, ht-1]+bq Wk[pi, vi]+bk viwrite[xt, mt]readmt = wjvjinputCNN+ResNet\u03c0V\u2211j=1K pihtxt kik1v1p1(a)ArchitectureoftheMRA.predictionsxt+1workingmemoryCNN+ResNetxt+2xt+3workingmemoryworkingmemoryCNN+ResNetCNN+ResNetCNN+ResNetCNN+ResNetCNN+ResNethtxt-2mt-2xt-1mt-1xtmt(b)ContrastivePredictiveCodinglossforMRA.Figure4:TheMemoryRecallAgent(MRA)architecture.Herepiisthepixelinputembeddingxtfromstept,andviistheLSTMhiddenstateht.kiisthekeyusedforreading;wecomputeitfrompiandvi.qtisthequeryweusetocompareagainstkeysto\ufb01ndnearestneighbors.TheMEMhasanumberofslots,indexedbyi.EachslotstoresactivationsfromthepixelinputnetworkandLSTMfromprevioustimestiinthepast.TheMEMactsasa\ufb01xed-sizecircular(\ufb01rst-in-\ufb01rst-out)buffer:Newkeysandvaluesareadded,overwritingtheleastrecentlyaddedentryiftherearenounusedslotsavailable.Thecontentsoftheepisodicmemorybufferiswipedattheendofeachepisode.MemoryWritingCrucially,writingtoepisodicmemoryisdonewithoutgradients.Ateachstepafreeslotischosenforwriting,denotedi.Next,thefollowingisstored:pi xt;vi ht;ki Wk[pi,vi]+bk(1)wherepiisthepixelinputembeddingfromsteptandviistheLSTMhiddenstate(iftheworkingmemoryissomethingelse,e.g.afeedforward,thiswouldbetheoutputactivations).kiisthekey,usedforreading(describedbelow),computedasasimplelinearfunctionoftheothertwovaluesstored.Cachingthekeyspeedsupmemoryreadssigni\ufb01cantly.However,thekeycanbecomestaleastheweightsandbiases,Wkandbkarelearnt(theprocedureforlearningthemisdescribedbelowunderJumpyBackpropagation).Inourexperimentswedidnotseeanadverseeffectofthisstaleness.MemoryReadingTheagentusesaformofdot-productattention(Bahdanauetal.,2015)overitsMEM,toselectthemostrelevanteventstoprovideasinputmttotheLSTM.ThequeryqtisalineartransformofthepixelinputembeddingxtandtheLSTMhiddenstatefromtheprevioustime-stepht1,withweightWqandbiasbq.qt=Wq[xt,ht1]+bq(2)ThequeryqtisthencomparedagainstthekeysinMEMasinPritzeletal.(2017):Let(pj,vj,kj),1\uf8ffj\uf8ffKbetheKnearestneighborstoqtfromMEM,underanL2normbetweenkjandqt.mt=KXj=1wjvjwherewj/1\u270f+||qtWk[pj,vj]bk||22(3)Wecomputeaweightedaggregateofthevalues(vj)oftheKnearestneighbors,weightedbytheinverseofeachneighbor-key\u2019sdistancetothequery.Notethatthedistanceisre-calculatedfromvaluesstoredintheMEM,viathelinearprojectionWk,bkin(1).Weconcatenatetheresultingweightedaggregatememorymtwiththeembeddedpixelinputxt,andpassitasinputtotheworkingmemoryasshowninFigure4(a).5\fJumpy backpropagation We now turn to how gradients \ufb02ow into memory writes. Full backprop-\nagation can become computationally infeasible as this would require backpropagation into every\nwrite that is read from and so on. Thus as a new (pi, vi, ki)-triplet is added to the MEM, there are\ntrade-offs to be made regarding computational complexity versus performance of the agent. To make\nit more computationally tractable, we place a stop-gradient in the memory write. In particular, the\nwrite operation for the key in (1) becomes:\n\nki Wk[SG(pi), SG(vi)] + bk\n\n(4)\nwhere SG(\u00b7) denote that the gradients are stopped. This allows the parameters Wk and bk to receive\ngradients from the loss during writing and reading, while at the same time bounding the computational\ncomplexity as the gradients do not \ufb02ow back into the recurrent working memory (or via that back\ninto the MEM). To re-calculate the distances, we want to use these learnt parameters rather than,\nsay, random projection, so we need to store the arguments xt and ht of the key-generating linear\ntransform Wk, bk for all previous time-steps. Thus in the MEM we store the full (pi, vi, ki)-triplet,\nwhere pi = xti, vi = hti and ti is the step that write i was made. We call this technique \u2018jumpy\nbackpropagation\u2019 because the intermediate steps between the current time-step t and the memory\nwrite step ti are not taken into account in the gradient updates.\nThis approach is similar to Sparse Attentive Backtracking (Ke et al., 2018, SAB) which uses sparse\nreplay by passing gradients only through memories selected as relevant at each step. Our model\ndiffers in that it does not have a \ufb01xed chunking scheme and does not do full backpropagation\nthrough the architecture (which in our case becomes quickly intractable). Our approach has minimal\ncomputational overhead as we only recompute the keys for the nearest neighbors.\n\nAuxiliary Unsupervised Losses An agent with good memory provides a good basis for forming a\nrich representation of the environment, as it captures a history of the states visited by the agent. This\nis the primary basis for many rich probabilistic state representations in reinforcement learning such as\nbelief states and predictive state representations (Littman and Sutton, 2002). Auxiliary unsupervised\nlosses can signi\ufb01cantly improve agent performance (Jaderberg et al., 2016). Recently it has been\nshown that agents augmented with one-step contrastive predictive coding (van den Oord et al., 2018,\nCPC) can learn belief state representations of the environment (Guo et al., 2018). Thus in MRA we\ncombine the working and episodic memory mechanisms listed above with a CPC unsupervised loss\nto imbue the agent with a rich state representation. The CPC auxiliary loss is added to the usual RL\nlosses, and is of the following form:\n\nCPCLoss [ht; xt+1, xt+2, . . . , xt+\u2327 ]\n\n(5)\n\nNX\u2327 =1\n\nwhere CPCLoss is from van den Oord et al. (2018), ht is the working memory hidden state, and xt+\u2327\nis the encoding pixel input at \u2327 steps in the future. N is the number of CPC steps (typically 10 or 50\nin our experiments). See Figure 4(b) for an illustration and further details and equations elaborating\non this loss in App. C.3.\nReconstruction losses have also been used as an auxiliary task (Jaderberg et al., 2016; Wayne et al.,\n2018) and we include this as a baseline in our experiments. Our reconstruction baseline minimizes\nthe L2 distance between the predicted reward and predicted pixel input and the true reward and pixel\ninput, using the working memory state ht as input. Details of this baseline are given in App. C.4.\n\n4 Experiments\n\nSetup We ran 10 ablations on the MRA architecture, on the training and the two holdout levels:\n\u2022 Working Memory component: Either feedforward neural network (\u2018FF\u2019 for short) or LSTM.\nThe LSTM-only baseline corresponds to IMPALA (Espeholt et al., 2018).\n\u2022 With or without using episodic memory module (\u2018MEM\u2019).\n\u2022 With or without auxiliary unsupervised loss (either CPC or reconstruction loss (\u2018REC\u2019)).\n\u2022 With or without jumpy backpropagation, for MRA (i.e. LSTM + MEM + CPC)\nGiven that the experiments are computationally demanding, we only performed small variations\nwithin as part of our hyper-parameter tuning process for each task (see App. D).\n\n6\n\n\fWe hypothesize that in general the agent should perform the best in training, somewhat worse on the\nholdout-interpolation level and the worst on the holdout-extrapolation level. That is, we expect to see\na generalization gap. Our results validated this hypothesis for the tasks that were much harder for\nagents than for humans.\n\n4.1 Full comparison\nWe computed human-normalized scores (details in App. B) and plotted them into a heatmap (Fig\n5) sorted such that the model with the highest train scores on average is the top row and the task\nwith highest train scores on average is the leftmost column. The heatmap suggests that the MRA\narchitecture, LSTM + MEM + CPC, broadly outperforms the other models (App. B Table 3). This\nranking was almost always maintained across train and holdout levels, despite MRA performing\nworse than the LSTM-only baseline on What Then Where. What Then Where was one of the tasks\nwhere all models did poorly, along with Spot the Difference: Multi-Object, Spot the Difference:\nMulti-Object, Spot the Difference: Multi-Object (rightmost columns in heatmap). At the other end of\nthe dif\ufb01culty spectrum, LSTM + MEM had superhuman scores on Visible Goal Procedural Maze in\ntraining and on Transitive Inference in training and holdout, and further adding CPC or REC boosted\nthe scores even higher.\n\nFigure 5: Heatmap of ablations per task sorted by normalized score for Train, Holdout-Interpolate,\nHoldout-Extrapolate. The same plot with standard errors is in App. B Fig. 14.\n\n4.2 Results\nDifferent memory systems worked best for dif-\nferent kinds of tasks, but the MRA architecture\u2019s\ncombination of LSTM + MEM + CPC did the\nbest overall on training and holdout (Fig. 6). Re-\nmoving jumpy backpropagation from MRA hurt\nperformance in \ufb01ve Memory Suite tasks (App.\nB Fig. 10), while performance was the same in\nthe remaining ones (App. B Fig. 11 and 12).\n\nGeneralization gap widens as task dif\ufb01culty\nincreases The hypothesized generalization\ngap was minimal for some tasks e.g. AVM and\nContinuous Recognition but signi\ufb01cant for oth-\ners e.g. What Then Where and Spot the Differ-\nence: Multi-Object (Fig 7). We observed that\nthe gap tended to be wider as the task dif\ufb01culty\nwent up, and that in PsychLab, the two tasks where the scale was the number of trials seemed to be\neasier than the other two tasks where the scale was the delay duration.\n\nFigure 6: Normalized scores averaged across tasks.\n\nMEM critical on some tasks, is enhanced by auxiliary unsupervised loss Adding MEM im-\nproved scores on nine tasks in training, six in holdout-interpolate, and six in holdout-extrapolate.\nAdding MEM alone, without an auxiliary unsupervised loss, was enough to improve scores on AVM\n\n7\n\n\fand Continuous Recognition, all Spot the Difference tasks except Spot the Difference: Multi-Object,\nall Goal Navigation tasks except Visible Goal Procedural Maze, and also for Transitive Inference.\nAdding MEM helped to signi\ufb01cantly boost holdout performance for Transitive Inference, AVM, and\nContinuous Recognition. For the two PsychLab tasks this \ufb01nding was in line with our expectations,\nsince they both can be solved by memorizing single images and determining exact matches and thus\nan external episodic memory would be the most useful. For Transitive Inference, in training MEM\nhelped when the working memory was FF but made little difference on an LSTM, but on holdout\nMEM helped noticeably for both FF and LSTM. In Change Detection and Multi-Object, adding\nMEM alone had little or no effect but combining it with CPC or REC provided a noticeable boost.\n\nSynergistic effect of MEM + CPC, for LSTM On average, adding either the MEM + CPC stack\nor MEM + REC stack to any working memory appeared to improve the agent\u2019s ability to generalize\nto holdout levels (Fig. 6). Interestingly, on several tasks we found that combining MEM + CPC\nhad a synergistic effect when the working memory was LSTM: The performance boost from adding\nMEM + CPC was larger than the sum of the boost from adding MEM or CPC alone. We observed\nthis phenomenon in seven tasks in training, six in holdout-interpolate, and six in holdout-extrapolate.\nAmong these, the tasks where there was MEM + CPC synergy across training, holdout-interpolate,\nand holdout-extrapolate were: the easiest task, Visible Goal Procedural Maze; Visible Goal with\nBuildings; Spot the Difference: Basic; and the hardest task, Spot the Difference: Multi-Object.\n\nCPC vs. REC CPC was better than REC on all Spot the Difference tasks, and the two harder\nPsychLab tasks Change Detection and What Then Where. On the other two PsychLab tasks there was\nno difference between CPC and REC. However, REC was better on all Goal Navigation tasks except\nInvisible Goal Empty Arena. When averaged out, REC was more useful when the working memory\nwas FF, but CPC was more useful for an LSTM working memory.\n\nFigure 7: Generalization gap is smaller for AVM and Continuous Recognition, larger for What Then\nWhere and Spot the Difference: Multi-Object. Dotted lines indicate human baseline scores. See other\ncurves in App. B Fig. 13.\n\n5 Discussion & Future Work\n\nWe constructed a diverse set of environments 2 to test memory-speci\ufb01c generalization, based on tasks\ndesigned to identify working memory and episodic memory in humans, and also developed an agent\nthat demonstrates many of these cognitive abilities. We propose both a testbed and benchmark for\nfurther work on agents with memory, and demonstrate how better understanding the memory and\ngeneralization abilities of reinforcement learning agents can point to new avenues of research to\nimprove agent performance and data ef\ufb01ciency. There is still room for improvement on the trickiest\ntasks in the suite where the agent fared relatively poorly. In particular, solving Spot the Difference:\n\n2Available at https://github.com/deepmind/dm_memorytasks.\n\n8\n\n\fMotion might need a generative model that enables forward planning to imagine how future motion\nunrolls (e.g., (Racani\u00e8re et al., 2017)). Our results indicate that adding an auxiliary loss such as\nCPC or reconstruction loss to an architecture that already has an external episodic memory improves\ngeneralization performance on holdout sets, sometimes synergistically. This suggests that existing\nagents that use episodic memory, such as DNC and NEC, could potentially boost performance by\nimplementing an additional auxiliary unsupervised loss.\n\nAcknowledgements\n\nWe would like to thank Jessica Hamrick, Jean-Baptiste Lespiau, Frederic Besse, Josh Abramson,\nOriol Vinyals, Federico Carnevale, Charlie Beattie, Piotr Trochim, Piermaria Mendolicchio, Aaron\nvan den Oord, Chloe Hillier, Tom Ward, Ricardo Barreira, Matthew Mauger, Thomas K\u00f6ppe, Pauline\nCoquinot and many others at DeepMind for insightful discussions, comments and feedback on this\nwork.\n\nReferences\nUnity. http://unity3d.com/.\n\nD. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. In\n3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015,\nConference Track Proceedings, 2015. URL http://arxiv.org/abs/1409.0473.\n\nA. Banino, C. Barry, B. Uria, C. Blundell, T. Lillicrap, P. Mirowski, A. Pritzel, M. J. Chadwick, T. Degris,\nJ. Modayil, et al. Vector-based navigation using grid-like representations in arti\ufb01cial agents. Nature, 557\n(7705):429, 2018.\n\nC. Beattie, J. Z. Leibo, D. Teplyashin, T. Ward, M. Wainwright, H. K\u00fcttler, A. Lefrancq, S. Green, V. Vald\u00e9s,\nA. Sadik, J. Schrittwieser, K. Anderson, S. York, M. Cant, A. Cain, A. Bolton, S. Gaffney, H. King,\nD. Hassabis, S. Legg, and S. Petersen. Deepmind lab. CoRR, abs/1612.03801, 2016.\n\nC. Blundell, B. Uria, A. Pritzel, Y. Li, A. Ruderman, J. Z. Leibo, J. Rae, D. Wierstra, and D. Hassabis. Model-free\n\nepisodic control. arXiv preprint arXiv:1606.04460, 2016.\n\nK. Cobbe, O. Klimov, C. Hesse, T. Kim, and J. Schulman. Quantifying generalization in reinforcement learning.\n\narXiv preprint arXiv:1812.02341, 2018.\n\nL. Espeholt, H. Soyer, R. Munos, K. Simonyan, V. Mnih, T. Ward, Y. Doron, V. Firoiu, T. Harley, I. Dunning,\nS. Legg, and K. Kavukcuoglu. IMPALA: scalable distributed deep-rl with importance weighted actor-learner\narchitectures. CoRR, abs/1802.01561, 2018. URL http://arxiv.org/abs/1802.01561.\n\nA. Graves, G. Wayne, M. Reynolds, T. Harley, I. Danihelka, A. Grabska-Barwinska, S. G. Colmenarejo,\nE. Grefenstette, T. Ramalho, J. Agapiou, A. P. Badia, K. M. Hermann, Y. Zwols, G. Ostrovski, A. Cain,\nH. King, C. Summer\ufb01eld, P. Blunsom, K. Kavukcuoglu, and D. Hassabis. Hybrid computing using a neural\nnetwork with dynamic external memory. Nature, 538(7626):471\u2013476, 2016. doi: 10.1038/nature20101. URL\nhttps://doi.org/10.1038/nature20101.\n\nZ. D. Guo, M. G. Azar, B. Piot, B. A. Pires, T. Pohlen, and R. Munos. Neural predictive belief representations.\n\nCoRR, abs/1811.06407, 2018. URL http://arxiv.org/abs/1811.06407.\n\nS. Hansen, A. Pritzel, P. Sprechmann, A. Barreto, and C. Blundell. Fast deep reinforcement learning using online\nadjustments from the past. In Advances in Neural Information Processing Systems, pages 10567\u201310577,\n2018.\n\nK. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. CoRR, abs/1512.03385,\n\n2015. URL http://arxiv.org/abs/1512.03385.\n\nS. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Comput., 9(8):1735\u20131780, Nov. 1997. ISSN\n0899-7667. doi: 10.1162/neco.1997.9.8.1735. URL http://dx.doi.org/10.1162/neco.1997.9.\n8.1735.\n\nM. Jaderberg, V. Mnih, W. M. Czarnecki, T. Schaul, J. Z. Leibo, D. Silver, and K. Kavukcuoglu. Reinforcement\n\nlearning with unsupervised auxiliary tasks. arXiv preprint arXiv:1611.05397, 2016.\n\n9\n\n\fN. R. Ke, A. Goyal, O. Bilaniuk, J. Binas, M. C. Mozer, C. Pal, and Y. Bengio. Sparse attentive backtracking:\nTemporal creditassignment through reminding. CoRR, abs/1809.03702, 2018. URL http://arxiv.org/\nabs/1809.03702.\n\nJ. Z. Leibo, C. de Masson d\u2019Autume, D. Zoran, D. Amos, C. Beattie, K. Anderson, A. G. Casta\u00f1eda, M. Sanchez,\nS. Green, A. Gruslys, S. Legg, D. Hassabis, and M. Botvinick. Psychlab: A psychology laboratory for deep\nreinforcement learning agents. CoRR, abs/1801.08116, 2018.\n\nM. L. Littman and R. S. Sutton. Predictive representations of state. In Advances in neural information processing\n\nsystems, pages 1555\u20131561, 2002.\n\nA. Miyake and P. Shah. Models of working memory: Mechanisms of active maintenance and executive control.\n\nCambridge University Press, 1999. doi: 10.1017/CBO9781139174909.\n\nV. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu. Asyn-\n\nchronous methods for deep reinforcement learning. CoRR, abs/1602.01783, 2016.\n\nJ. Pineau. Oreproducible, reusable, and robust reinforcement learning (invited talk). Advances in Neural\n\nInformation Processing Systems, 2018, 2018.\n\nA. Pritzel, B. Uria, S. Srinivasan, A. P. Badia, O. Vinyals, D. Hassabis, D. Wierstra, and C. Blundell. Neural\nepisodic control. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages\n2827\u20132836. JMLR.org, 2017.\n\nS. Racani\u00e8re, T. Weber, D. Reichert, L. Buesing, A. Guez, D. J. Rezende, A. P. Badia, O. Vinyals, N. Heess, Y. Li,\net al. Imagination-augmented agents for deep reinforcement learning. In Advances in neural information\nprocessing systems, pages 5690\u20135701, 2017.\n\nS. Ritter, J. X. Wang, Z. Kurth-Nelson, S. M. Jayakumar, C. Blundell, R. Pascanu, and M. Botvinick. Been there,\n\ndone that: Meta-learning with episodic recall. arXiv preprint arXiv:1805.09692, 2018.\n\nA. Santoro, R. Faulkner, D. Raposo, J. W. Rae, M. Chrzanowski, T. Weber, D. Wierstra, O. Vinyals, R. Pascanu,\nand T. P. Lillicrap. Relational recurrent neural networks. CoRR, abs/1806.01822, 2018. URL http:\n//arxiv.org/abs/1806.01822.\n\nC. Smith and L. R. Squire. Declarative memory, awareness, and transitive inference. Journal of Neuroscience,\n25(44):10138\u201310146, 2005. ISSN 0270-6474. doi: 10.1523/JNEUROSCI.2731-05.2005. URL http:\n//www.jneurosci.org/content/25/44/10138.\n\nS. Sukhbaatar, A. Szlam, J. Weston, and R. Fergus. Weakly supervised memory networks. CoRR, abs/1503.08895,\n\n2015. URL http://arxiv.org/abs/1503.08895.\n\nE. Tulving. Episodic memory: From mind to brain. Annual Review of Psychology, 53(1):1\u201325, 2002. doi:\n10.1146/annurev.psych.53.100901.135114. URL https://doi.org/10.1146/annurev.psych.\n53.100901.135114. PMID: 11752477.\n\nE. Tulving and D. Murray. Elements of episodic memory. Canadian Psychology, 26(3):235\u2013238, 1985.\n\nA. van den Oord, Y. Li, and O. Vinyals. Representation learning with contrastive predictive coding. CoRR,\n\nabs/1807.03748, 2018. URL http://arxiv.org/abs/1807.03748.\n\nA. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention\n\nis all you need. CoRR, abs/1706.03762, 2017. URL http://arxiv.org/abs/1706.03762.\n\nG. Wayne, C. Hung, D. Amos, M. Mirza, A. Ahuja, A. Grabska-Barwinska, J. W. Rae, P. Mirowski, J. Z.\nLeibo, A. Santoro, M. Gemici, M. Reynolds, T. Harley, J. Abramson, S. Mohamed, D. J. Rezende, D. Saxton,\nA. Cain, C. Hillier, D. Silver, K. Kavukcuoglu, M. Botvinick, D. Hassabis, and T. P. Lillicrap. Unsupervised\npredictive memory in a goal-directed agent. CoRR, abs/1803.10760, 2018.\n\nV. F. Zambaldi, D. Raposo, A. Santoro, V. Bapst, Y. Li, I. Babuschkin, K. Tuyls, D. P. Reichert, T. P. Lillicrap,\nE. Lockhart, M. Shanahan, V. Langston, R. Pascanu, M. Botvinick, O. Vinyals, and P. Battaglia. Relational\ndeep reinforcement learning. CoRR, abs/1806.01830, 2018. URL http://arxiv.org/abs/1806.\n01830.\n\n10\n\n\f", "award": [], "sourceid": 6749, "authors": [{"given_name": "Meire", "family_name": "Fortunato", "institution": "DeepMind"}, {"given_name": "Melissa", "family_name": "Tan", "institution": "Deepmind"}, {"given_name": "Ryan", "family_name": "Faulkner", "institution": "Deepmind"}, {"given_name": "Steven", "family_name": "Hansen", "institution": "DeepMind"}, {"given_name": "Adri\u00e0", "family_name": "Puigdom\u00e8nech Badia", "institution": "Google DeepMind"}, {"given_name": "Gavin", "family_name": "Buttimore", "institution": "DeepMind"}, {"given_name": "Charles", "family_name": "Deck", "institution": "Deepmind"}, {"given_name": "Joel", "family_name": "Leibo", "institution": "DeepMind"}, {"given_name": "Charles", "family_name": "Blundell", "institution": "DeepMind"}]}