{"title": "Relational recurrent neural networks", "book": "Advances in Neural Information Processing Systems", "page_first": 7299, "page_last": 7310, "abstract": "Memory-based neural networks model temporal data by leveraging an ability to remember information for long periods. It is unclear, however, whether they also have an ability to perform complex relational reasoning with the information they remember. Here, we first confirm our intuitions that standard memory architectures may struggle at tasks that heavily involve an understanding of the ways in which entities are connected -- i.e., tasks involving relational reasoning. We then improve upon these deficits by using a new memory module -- a Relational Memory Core (RMC) -- which employs multi-head dot product attention to allow memories to interact. Finally, we test the RMC on a suite of tasks that may profit from more capable relational reasoning across sequential information, and show large gains in RL domains (BoxWorld & Mini PacMan), program evaluation, and language modeling, achieving state-of-the-art results on the WikiText-103, Project Gutenberg, and GigaWord datasets.", "full_text": "Relational recurrent neural networks\n\nAdam Santoro*\u03b1, Ryan Faulkner*\u03b1, David Raposo*\u03b1, Jack Rae\u03b1\u03b2, Mike Chrzanowski\u03b1,\n\nTh\u00e9ophane Weber\u03b1, Daan Wierstra\u03b1, Oriol Vinyals\u03b1, Razvan Pascanu\u03b1, Timothy Lillicrap\u03b1\u03b2\n\n*Equal Contribution\n\n\u03b1DeepMind\n\nLondon, United Kingdom\n\n\u03b2CoMPLEX, Computer Science, University College London\n\nLondon, United Kingdom\n\n{adamsantoro; rfaulk; draposo; jwrae; chrzanowskim;\n\ntheophane; weirstra; vinyals; razp; countzero}@google.com\n\nAbstract\n\nMemory-based neural networks model temporal data by leveraging an ability to\nremember information for long periods. It is unclear, however, whether they also\nhave an ability to perform complex relational reasoning with the information they\nremember. Here, we \ufb01rst con\ufb01rm our intuitions that standard memory architectures\nmay struggle at tasks that heavily involve an understanding of the ways in which\nentities are connected \u2013 i.e., tasks involving relational reasoning. We then improve\nupon these de\ufb01cits by using a new memory module \u2013 a Relational Memory Core\n(RMC) \u2013 which employs multi-head dot product attention to allow memories to\ninteract. Finally, we test the RMC on a suite of tasks that may pro\ufb01t from more\ncapable relational reasoning across sequential information, and show large gains\nin RL domains (e.g. Mini PacMan), program evaluation, and language modeling,\nachieving state-of-the-art results on the WikiText-103, Project Gutenberg, and\nGigaWord datasets.\n\n1\n\nIntroduction\n\nHumans use sophisticated memory systems to access and reason about important information regard-\nless of when it was initially perceived [1, 2]. In neural network research many successful approaches\nto modeling sequential data also use memory systems, such as LSTMs [3] and memory-augmented\nneural networks generally [4\u20137]. Bolstered by augmented memory capacities, bounded computational\ncosts over time, and an ability to deal with vanishing gradients, these networks learn to correlate\nevents across time to be pro\ufb01cient at storing and retrieving information.\nHere we propose that it is fruitful to consider memory interactions along with storage and retrieval.\nAlthough current models can learn to compartmentalize and relate distributed, vectorized memories,\nthey are not biased towards doing so explicitly. We hypothesize that such a bias may allow a model\nto better understand how memories are related, and hence may give it a better capacity for relational\nreasoning over time. We begin by demonstrating that current models do indeed struggle in this\ndomain by developing a toy task to stress relational reasoning of sequential information. Using a new\nRelational Memory Core (RMC), which uses multi-head dot product attention to allow memories to\ninteract with each other, we solve and analyze this toy problem. We then apply the RMC to a suite\nof tasks that may pro\ufb01t from more explicit memory-memory interactions, and hence, a potentially\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fincreased capacity for relational reasoning across time: partially observed reinforcement learning\ntasks, program evaluation, and language modeling on the Wikitext-103, Project Gutenberg, and\nGigaWord datasets.\n\n2 Relational reasoning\n\nWe take relational reasoning to be the process of understanding the ways in which entities are\nconnected and using this understanding to accomplish some higher order goal [8]. For example,\nconsider sorting the distances of various trees to a park bench: the relations (distances) between the\nentities (trees and bench) are compared and contrasted to produce the solution, which could not be\nreached if one reasoned about the properties (positions) of each individual entity in isolation.\nSince we can often quite \ufb02uidly de\ufb01ne what constitutes an \u201centity\u201d or a \u201crelation\u201d, one can imagine a\nspectrum of neural network inductive biases that can be cast in the language of relational reasoning\n1. For example, a convolutional kernel can be said to compute a relation (linear combination)\nof the entities (pixels) within a receptive \ufb01eld. Some previous approaches make the relational\ninductive bias more explicit: in message passing neural networks [e.g. 9\u201312], the nodes comprise\nthe entities and relations are computed using learnable functions applied to nodes connected with\nan edge, or sometimes reducing the relational function to a weighted sum of the source entities [e.g.\n13, 14]. In Relation Networks [15\u201317] entities are obtained by exploiting spatial locality in the input\nimage, and the model focuses on computing binary relations between each entity pair. Even further,\nsome approaches emphasize that more capable reasoning may be possible by employing simple\ncomputational principles; by recognizing that relations might not always be tied to proximity in space,\nnon-local computations may be better able to capture the relations between entities located far away\nfrom each other [18, 19].\nIn the temporal domain relational reasoning could comprise a capacity to compare and contrast\ninformation seen at different points in time [20]. Here, attention mechanisms [e.g. 21, 22] implicitly\nperform some form of relational reasoning; if previous hidden states are interpreted as entities, then\ncomputing a weighted sum of entities using attention helps to remove the locality bias present in\nvanilla RNNs, allowing embeddings to be better related using content rather than proximity.\nSince our current architectures solve complicated temporal tasks they must have some capacity for\ntemporal relational reasoning. However, it is unclear whether their inductive biases are limiting, and\nwhether these limitations can be exposed with tasks demanding particular types of temporal relational\nreasoning. For example, memory-augmented neural networks [4\u20137] solve a compartmentalization\nproblem with a slot-based memory matrix, but may have a harder time allowing memories to interact,\nor relate, with one another once they are encoded. LSTMs [3, 23], on the other hand, pack all\ninformation into a common hidden memory vector, potentially making compartmentalization and\nrelational reasoning more dif\ufb01cult.\n\n3 Model\n\nOur guiding design principle is to provide an architectural backbone upon which a model can learn\nto compartmentalize information, and learn to compute interactions between compartmentalized\ninformation. To accomplish this we assemble building blocks from LSTMs, memory-augmented\nneural networks, and non-local networks (in particular, the Transformer seq2seq model [22]). Similar\nto memory-augmented architectures we consider a \ufb01xed set of memory slots; however, we allow for\ninteractions between memory slots using an attention mechanism. As we will describe, in contrast to\nprevious work we apply attention between memories at a single time step, and not across all previous\nrepresentations computed from all previous observations.\n\n3.1 Allowing memories to interact using multi-head dot product attention\n\nWe will \ufb01rst assume that we do not need to consider memory encoding; that is, that we already\nhave some stored memories in matrix M, with row-wise compartmentalized memories mi. To allow\nmemories to interact we employ multi-head dot product attention (MHDPA) [22], also known as\n\n1Indeed, in the broadest sense any multivariable function must be considered \u201crelational.\u201d\n\n2\n\n\fFigure 1: Relational Memory Core. (a) The RMC receives a previous memory matrix and input\nvector as inputs, which are passed to the MHDPA module labeled with an \u201cA\u201d. (b). Linear projections\nare computed for each memory slot, and input vector, using row-wise shared weights W q for the\nqueries, W k for the keys, and W v for the values. (c) The queries, keys, and values are then compiled\ninto matrices and softmax(QK T )V is computed. The output of this computation is a new memory\nwhere information is blended across memories based on their attention weights. An MLP is applied\nrow-wise to the output of the MHDPA module (a), and the resultant memory matrix is gated, and\npassed on as the core output or next memory state.\n\nself-attention. Using MHDPA, each memory will attend over all of the other memories, and will\nupdate its content based on the attended information.\nFirst, a simple linear projection is used to construct queries (Q = M W q), keys (K = M W k), and\nvalues (V = M W v) for each memory (i.e. row mi) in matrix M. Next, we use the queries, Q, to\nperform a scaled dot-product attention over the keys, K. The returned scalars can be put through a\nsoftmax-function to produce a set of weights, which can then be used to return a weighted average\nV , where dk is the dimensionality of the key\nof values from V as A(Q, K, V ) = softmax\nvectors used as a scaling factor. Equivalently:\n\n(cid:17)\n\n(cid:16) QKT\u221a\n(cid:18) M W q(M W k)T\n(cid:19)\n\ndk\n\n\u221a\n\ndk\n\nA\u03b8(M ) = softmax\n\nM W v, where \u03b8 = (W q, W k, W v)\n\n(1)\n\nThe output of A\u03b8(M ), which we will denote as (cid:102)M, is a matrix with the same dimensionality as\nM. (cid:102)M can be interpreted as a proposed update to M, with each (cid:101)mi comprising information from\n\nmemories mj. Thus, in one step of attention each memory is updated with information originating\nfrom other memories, and it is up to the model to learn (via parameters W q, W k, and W v) how to\nshuttle information from memory to memory.\nAs implied by the name, MHDPA uses multiple heads. We implement this producing h sets of\nqueries, keys, and values, using unique parameters to compute a linear projection from the original\nmemory for each head h. We then independently apply an attention operation for each head. For\nexample, if M is an N \u00d7 F dimensional matrix and we employ two attention heads, then we compute\n\n(cid:103)M 1 = A\u03b8(M ) and(cid:103)M 2 = A\u03c6(M ), where(cid:103)M 1 and(cid:103)M 2 are N \u00d7 F/2 matrices, \u03b8 and \u03c6 denote unique\nparameters for the linear projections to produce the queries, keys, and values, and(cid:102)M = [(cid:103)M 1 :(cid:103)M 2],\n\nwhere [:] denotes column-wise concatenation. Intuitively, heads could be useful for letting a memory\nshare different information, to different targets, using each head.\n\n3\n\nNextMemoryCOREPrev.MemoryInputOutputAResidual+MLPApply gating*Residual+***computation of gates not depicted(a)MemoryMULTI-HEAD DOT PRODUCT ATTENTIONInputquerykeyvalueUpdatedMemory(b).KeysQueriesCompute attention weightsWeightsNormalized WeightsNormalize weights with row-wise softmaxValues.WeightsCompute weighted average of valuesReturn updated memoryUpdated Memory(c)\f3.2 Encoding new memories\n\n(cid:102)M = softmax\n\n(cid:18) M W q([M ; x]W k)T\n\n(cid:19)\n\n\u221a\n\ndk\n\nWe assumed that we already had a matrix of memories M. Of course, memories instead need to be\nencoded as new inputs are received. Suppose then that M is some randomly initialised memory. We\ncan ef\ufb01ciently incorporate new information x into M with a simple modi\ufb01cation to equation 1:\n\n[M ; x]W v,\n\n(2)\n\nwhere we use [M ; x] to denote the row-wise concatenation of M and x. Since we use [M ; x] when\n\ncomputing the keys and values, and only M when computing the queries,(cid:102)M is a matrix with same\n\ndimensionality as M. Thus, equation 2 is a memory-size preserving attention operation that includes\nattention over the memories and the new observations. Notably, we use the same attention operation\nto ef\ufb01ciently compute memory interactions and to incorporate new information.\nWe also note the possible utility of this operation when the memory consists of a single vector rather\nthan a matrix. In this case the model may learn to pick and choose which information from the input\nshould be written into the vector memory state by learning how to attend to the input, conditioned\non what is contained in the memory already. This is possible in LSTMs via the gates, though at a\ndifferent granularity. We return to this idea, and the possible compartmentalization that can occur via\nthe heads even in the single-memory-slot case, in the discussion.\n\n3.3\n\nIntroducing recurrence and embedding into an LSTM\n\nSuppose we have a temporal dimension with new observations at each timestep, xt. Since M and (cid:102)M\nand then updating it with(cid:102)M at each timestep. We chose to do this by embedding this update into an\n\nare the same dimensionality, we can naively introduce recurrence by \ufb01rst randomly initialising M,\n\nLSTM. Suppose memory matrix M can be interpreted as a matrix of cell states, usually denoted as C,\nfor a 2-dimensional LSTM. We can make the operations of individual memories mi nearly identical\nto those in a normal LSTM cell state as follows (subscripts are overloaded to denote the row from a\nmatrix, and timestep; e.g., mi,t is the ith row from M at time t).\n\nsi,t = (hi,t\u22121, mi,t\u22121)\nfi,t = W f xt + U f hi,t\u22121 + bf\nii,t = W ixt + U ihi,t\u22121 + bi\noi,t = W oxt + U ohi,t\u22121 + bo\n\nmi,t = \u03c3(fi,t + \u02dcbf ) \u25e6 mi,t\u22121 + \u03c3(ii,t) \u25e6 g\u03c8((cid:101)mi,t)\n(cid:123)(cid:122)\n(cid:125)\n\n(cid:124)\n\nhi,t = \u03c3(oi,t) \u25e6 tanh(mi,t)\nsi,t+1 = (mi,t, hi,t)\n\n(3)\n(4)\n(5)\n(6)\n(7)\n\n(8)\n(9)\n\nThe underbrace denotes the modi\ufb01cation to a standard LSTM. In practice we did not \ufb01nd output gates\nnecessary \u2013 please see the url in the footnote for our Tensor\ufb02ow implementation of this model in the\nSonnet library 2, and for the exact formulation we used, including our choice for the g\u03c8 function\n(brie\ufb02y, we found a row/memory-wise MLP with layer normalisation to work best). There is also an\ninteresting opportunity to introduce a different kind of gating, which we call \u2018memory\u2019 gating, which\nresembles previous gating ideas [24, 3]. Instead of producing scalar gates for each individual unit\n(\u2018unit\u2019 gating), we can produce scalar gates for each memory row by converting W f , W i, W o, U f ,\nU i, and U o from weight matrices into weight vectors, and by replacing the element-wise product in\nthe gating equations with scalar-vector multiplication.\nSince parameters W f , W i, W o, U f , U i, U o, and \u03c8 are shared for each mi, we can modify the\nnumber of memories without affecting the number of parameters. Thus, tuning the number of\nmemories and the size of each memory can be used to balance the overall storage capacity (equal\nto the total number of units, or elements, in M) and the number of parameters (proportional to the\ndimensionality of mi). We \ufb01nd in our experiments that some tasks require more, but not necessarily\nlarger, memories, and others such as language modeling require fewer, larger memories.\n\n2https://github.com/deepmind/sonnet/blob/master/sonnet/python/modules/\n\nrelational_memory.py\n\n4\n\n\fFigure 2: Tasks. We tested the RMC on a suite of supervised and reinforcement learning tasks.\nNotable are the N th Farthest toy task and language modeling. In the former, the solution requires\nexplicit relational reasoning since the model must sort distance relations between vectors, and not the\nvectors themselves. The latter tests the model on a large quantity of natural data and allows us to\ncompare performance to well-tuned models.\n\nThus, we have a number of tune-able parameters: the number of memories, the size of each memory,\nthe number of attention heads, the number of steps of attention, the gating method, and the post-\nattention processor g\u03c8. In the appendix we list the exact con\ufb01gurations for each task.\n\n4 Experiments\n\nHere we brie\ufb02y outline the tasks on which we applied the RMC, and direct the reader to the appendix\nfor full details on each task and details on hyperparameter settings for the model.\n\n4.1\n\nIllustrative supervised tasks\n\nN th Farthest The N th Farthest task is designed to stress a capacity for relational reasoning across\ntime. Inputs are a sequence of randomly sampled vectors, and targets are answers to a question of the\nform: \u201cWhat is the nth farthest vector (in Euclidean distance) from vector m?\u201d, where the vector\nvalues, their IDs, n, and m are randomly sampled per sequence. It is not enough to simply encode and\nretrieve information as in a copy task. Instead, a model must compute all pairwise distance relations\nto the reference vector m, which might also lie in memory, or might not have even been provided as\ninput yet. It must then implicitly sort these distances to produce the answer. We emphasize that the\nmodel must sort distance relations between vectors, and not the vectors themselves.\n\nProgram Evaluation The Learning to Execute (LTE) dataset [25] consists of algorithmic snippets\nfrom a Turing complete programming language of pseudo-code, and is broken down into three cate-\ngories: addition, control, and full program. Inputs are a sequence of characters over an alphanumeric\nvocabulary representing such snippets, and the target is a numeric sequence of characters that is\nthe execution output for the given programmatic input. Given that the snippets involve symbolic\nmanipulation of variables, we felt it could strain a model\u2019s capacity for relational reasoning; since\nsymbolic operators can be interpreted as de\ufb01ning a relation over the operands, successful learning\ncould re\ufb02ect an understanding of this relation. To also assess model performance on classical se-\nquence tasks we also evaluated on memorization tasks, in which the output is simply a permuted\nform of the input rather than an evaluation from a set of operational instructions. See the appendix\nfor further experimental details.\n\n4.2 Reinforcement learning\n\nMini Pacman with viewport We follow the formulation of Mini Pacman from [26]. Brie\ufb02y, the\nagent navigates a maze to collect food while being chased by ghosts. However, we implement this\ntask with a viewport: a 5 \u00d7 5 window surrounding the agent that comprises the perceptual input. The\ntask is therefore partially observable, since the agent must navigate the space and take in information\nthrough this viewport. Thus, the agent must predict the dynamics of the ghosts in memory, and plan\nits navigation accordingly, also based on remembered information about which food has already been\n\n5\n\nWhat is the Nth farthest from vector m?x = 339for [19]: x += 597 for[94]: x += 875x if 428 < 778 else 652print(x)BoxWorldMini-PacmanLockKeyLoose KeyAgentGemViewportReinforcement LearningProgram EvaluationNth farthest Language ModelingSupervised LearningIt had 24 step programming abilities, which meant it was highly _____A gold dollar had been proposed several times in the 1830s and 1840s , but was not initially _____Super Mario Land is a 1989 side scrolling platform video _____\fpicked up. We also point the reader to the appendix for a description and results of another RL task\ncalled BoxWorld, which demands relational reasoning in memory space.\n\n4.3 Language Modeling\n\nFinally, we investigate the task of word-based language modeling. We model the conditional prob-\nability p(wt|w<t) of a word wt given a sequence of observed words w<t = (wt\u22121, wt\u22122, . . . , w1).\nLanguage models can be directly applied to predictive keyboard and search-phrase completion,\nor they can be used as components within larger systems, e.g. machine translation [27], speech\nrecognition [28], and information retrieval [29]. RNNs, and most notably LSTMs, have proven to be\nstate-of-the-art on many competitive language modeling benchmarks such as Penn Treebank [30, 31],\nWikiText-103 [32, 33], and the One Billion Word Benchmark [34, 35]. As a sequential reasoning\ntask, language modeling allows us to assess the RMC\u2019s ability to process information over time on a\nlarge quantity of natural data, and compare it to well-tuned models.\nWe focus on datasets with contiguous sentences and a moderately large amount of data. WikiText-103\nsatis\ufb01es this set of requirements as it consists of Wikipedia articles shuf\ufb02ed at the article level with\nroughly 100M training tokens, as do two stylistically different sources of text data: books from\nProject Gutenberg3 and news articles from GigaWord v5 [36]. Using the same processing from [32]\nthese datasets consist of 180M training tokens and 4B training tokens respectively, thus they cover\na range of styles and corpus sizes. We choose a similar vocabulary size for all three datasets of\napproximately 250, 000, which is large enough to include rare words and numeric values.\n\n5 Results\n\n5.1 N th Farthest\n\nThis task revealed a stark difference between our LSTM and DNC baselines and RMC when training\non 16-dimensional vector inputs. Both LSTM and DNC models failing to surpass 30% best batch\naccuracy and the RMC consistently achieving 91% at the end of training (see \ufb01gure 5 in the appendix\nfor training curves). The RMC achieved similar performance when the dif\ufb01culty of the task was\nincreased by using 32-dimensional vectors, placing a greater demand on high-\ufb01delity memory storage.\nHowever, this performance was less robust with only a small number of seeds/model con\ufb01gurations\ndemonstrating this performance, in contrast to the 16-dimensional vector case where most model\ncon\ufb01gurations succeeded.\nAn attention analysis revealed some notable features of the RMC\u2019s internal functions. Figure 3 shows\nattention weights in the RMC\u2019s memory throughout a sequence: the \ufb01rst row contains a sequence\nwhere the reference vector m was observed last; in the second row it was observed \ufb01rst; and in the\nlast row it was observed in the middle of the sequence. Before m is seen the model seems to shuttle\ninput information into one or two memory slots, as shown by the high attention weights from these\nslots\u2019 queries to the input key. After m is seen, most evident in row three of the \ufb01gure, the model\ntends to change its attention behaviour, with all the memory slots preferentially focusing attention\non those particular memories to which the m was written. Although this attention analysis provides\nsome useful insights, the conclusions we can make are limited since even after a single round of\nattention the memory can become highly distributed, making any interpretations about information\ncompartmentalisation potentially inaccurate.\n\n5.2 Program Evaluation\n\nProgram evaluation performance was assessed via the Learning to Execute tasks [25]. We evaluated\na number of baselines alongside the RMC including an LSTM [3, 37], DNC [5], and a bank of\nLSTMs resembling Recurrent Entity Networks [38] (EntNet) - the con\ufb01gurations for each of these is\ndescribed in the appendix. Best test batch accuracy results are shown in Table 1. The RMC performs\nat least as well as all of the baselines on each task. It is marginally surpassed by a small fraction of\nperformance on the double memorization task, but both models effectively solve this task. Further,\nthe results of the RMC outperform all equivalent tasks from [25] which use teacher forcing even when\nevaluating model performance. It\u2019s worth noting that we observed better results when we trained in a\n\n3Project Gutenberg. (n.d.). Retrieved January 2, 2018, from www.gutenberg.org\n\n6\n\n\fFigure 3: Model analysis. Each row depicts the attention matrix at each timestep of a particular\nsequence. The text beneath spells out the particular task for the sequence, which was encoded and\nprovided to the model as an input. We mark in red the vector that is referenced in the task: e.g., if the\nmodel is to choose the 2nd farthest vector from vector 7, then the time point at which vector 7 was\ninput to the model is depicted in red. A single attention matrix shows the attention weights from one\nparticular memory slot (y-axis) to another memory slot (columns), or the input (offset column), with\nthe numbers denoting the memory slot and \u201cinput\u201d denoting the input embedding.\n\nnon-auto-regressive fashion - that is, with no teacher forcing during training. This is likely related to\nthe effect that relaxing the ground truth requirement has on improving model generalization [39] and\nhence, performance. It is perhaps more pronounced in these tasks due to the independence of output\ntoken probabilities and also the sharply uni-modal nature of the output distribution (that is, there is\nno ambiguity in the answer given the program).\n\nTable 1: Test per character Accuracy on Program Evaluation and Memorization tasks.\nModel\nProgram Copy Reverse Double\n99.7\nLSTM [3, 37]\n62.3\nEntNet [38]\n100.0\nDNC [5]\n99.8\nRelational Memory Core\n\nAdd Control\n99.8\n98.4\n99.4\n99.9\n\n99.7\n100.0\n100.0\n100.0\n\n99.8\n91.8\n100.0\n100.0\n\n66.1\n73.4\n69.5\n79.0\n\n97.4\n98.0\n83.8\n99.6\n\nTable 2: Validation and test perplexities on WikiText-103, Project Gutenberg, and GigaWord v5.\n\nWikiText-103\nValid.\n\nGutenberg\nTest\n\n-\n-\n-\n\nTest Valid\n48.7\n45.2\n37.2\n34.3\n33\n31.6\n\n39.2\n\n41.8\n\n-\n\nLSTM [40]\nTemporal CNN [41]\nGated CNN [42]\nLSTM [32]\nQuasi-RNN [43]\nRelational Memory Core\n\n-\n-\n-\n\n34.1\n32\n30.8\n\n7\n\nGigaWord\n\nTest\n\n-\n-\n-\n\n-\n\n43.7\n\n-\n-\n-\n\n-\n\n45.5\n\n42.0\n\n38.3\n\ninputInput Vector Idtimeattention weights(a) Reference vector is the last in a sequence, e.g. \"Choose the 5th furthest vector from vector 7\"(b) Reference vector is the \ufb01rst in a sequence, e.g. \"Choose the 3rd furthest vector from vector 4\"(c) Reference vector comes in the middle ofa sequence, e.g. \"Choose the 6th furthest vector from vector 6\"attending fromattending to\f5.3 Mini-Pacman\n\nIn Mini Pacman with viewport the RMC achieved approximately 100 points more than an LSTM\n(677 vs. 550), and when trained with the full observation the RMC nearly doubled the performance\nof an LSTM (1159 vs. 598, \ufb01gure 10).\n\n5.4 Language Modeling\n\nFor all three language modeling tasks we observe lower perplexity when using the relational memory\ncore, with a drop of 1.4 \u2212 5.4 perplexity over the best published results. Although small, this\nconstitutes a 5 \u2212 12% relative improvement and appears to be consistent across tasks of varying\nsize and style. For WikiText-103, we see this can be compared to LSTM architectures [5, 32],\nconvolutional models [42] and hybrid recurrent-convolutional models [43].\nThe model learns with a slightly better data ef\ufb01ciency than an LSTM (appendix \ufb01gure 11). The RMC\nscored highly when the number of context words provided during evaluation were relatively few,\ncompared to an LSTM which pro\ufb01ted much more from a larger context (supplementary \ufb01gure 12).\nThis could be because RMC better captures short-term relations, and hence only needs a relatively\nsmall context for accurate modeling. Inspecting the perplexity broken down by word frequency in\nsupplementary table 3, we see the RMC improved the modeling of frequent words, and this is where\nthe drop in overall perplexity is obtained.\n\n6 Discussion\n\nA number of other approaches have shown success in modeling sequential information by using a\ngrowing buffer of previous states [21, 22]. These models better capture long-distance interactions,\nsince their computations are not biased by temporally local proximity. However, there are serious\nscaling issues for these models when the number of timesteps is large, or even unbounded, such as\nin online reinforcement learning (e.g., in the real world). Thus, some decisions need to be made\nregarding the size of the past-embedding buffer that should be stored, whether it should be a rolling\nwindow, how computations should be cached and propagated across time, etc. These considerations\nmake it dif\ufb01cult to directly compare these approaches in these online settings. Nonetheless, we\nbelieve that a blend of purely recurrent approaches with those that scale with time could be a fruitful\npursuit: perhaps the model accumulates memories losslessly for some chunk of time, then learns to\ncompress it in a recurrent core before moving onto processing a subsequent chunk.\nWe believe it is dif\ufb01cult to agree on a de\ufb01nition for \u201crelational reasoning\u201d, and prefer the intuition\nthat \u201crelational reasoning\u201d describes an ability or capacity to understand the ways in which entities\nare related. Since it is a capacity, it must be measured behaviourally, using tasks that we know to\ndemand reasoning about the ways in which entities are related. Thus, since we cannot formally prove\nthat some series of computations will necessarily result in improved relational reasoning, we must\nrely on empirical veri\ufb01cation that some computations may or may not be correlated with improved\nrelational reasoning. We hypothesized that memory-memory interactions may be underlie such\ncomputations, and proposed intuitions for the speci\ufb01c memory mechanisms that may better equip a\nmodel for complex relational reasoning. Namely, by explicitly allowing memories to interact either\nwith each other, with the input, or both via MHDPA, we demonstrated improved performance on\ntasks demanding relational reasoning across time. We would like to emphasize, however, that while\nthese intuitions guided our design of the model, and while the analysis of the model in the N th\nfarthest task aligned with our intuitions, we cannot necessarily make any concrete claims as to the\ncausal in\ufb02uence of our design choices on the model\u2019s capacity for relational reasoning, or as to the\ncomputations taking place within the model and how they may map to traditional approaches for\nthinking about relational reasoning. Thus, we consider our results primarily as evidence of improved\nfunction \u2013 if a model can better solve tasks that require relational reasoning, then it must have an\nincreased capacity for relational reasoning, even if we do not precisely know why it may have this\nincreased capacity. In this light the RMC may be usefully viewed from multiple vantages, and these\nvantages may offer ideas for further improvements.\nOur model has multiple mechanisms for forming and allowing for interactions between memory\nvectors: slicing the memory matrix row-wise into slots, and column-wise into heads. Each has its\nown advantages (computations on slots share parameters, while having more heads and a larger\n\n8\n\n\fmemory size takes advantage of more parameters). We don\u2019t yet understand the interplay, but we\nnote some empirical \ufb01ndings. First, in the the N th farthest task a model with a single memory slot\nperformed better when it had more attention heads, though in all cases it performed worse than a\nmodel with many memory slots. Second, in language modeling, our model used a single memory\nslot. The reasons for choosing a single memory here were mainly due to the need for a large number\nof parameters for LM in general (hence the large size for the single memory slot), and the inability to\nquickly run a model with both a large number of parameters and multiple memory slots. Thus, we do\nnot necessarily claim that a single memory slot is best for language modeling, rather, we emphasize\nan interesting trade-off between number of memories and individual memory size, which may be\na task speci\ufb01c ratio that can be tuned. Moreover, in program evaluation, an intermediate solution\nworked well across subtasks (4 slots and heads), though some performed best with 1 memory, and\nothers with 8.\nAltogether, our results show that explicit modeling of memory interactions improves performance in\na reinforcement learning task, alongside program evaluation, comparative reasoning, and language\nmodeling, demonstrating the value of instilling a capacity for relational reasoning in recurrent neural\nnetworks.\n\nAcknowledgements\n\nWe thank Caglar Gulcehre, Matt Botvinick, Vinicius Zambaldi, Charles Blundell, S\u00e9bastien Racaniere,\nChloe Hillier, Victoria Langston, and many others on the DeepMind team for critical feedback,\ndiscussions, and support.\n\n9\n\n\fReferences\n[1] Daniel L Schacter and Endel Tulving. Memory systems 1994. Mit Press, 1994.\n\n[2] Barbara J Knowlton, Robert G Morrison, John E Hummel, and Keith J Holyoak. A neurocomputational\n\nsystem for relational reasoning. Trends in cognitive sciences, 16(7):373\u2013381, 2012.\n\n[3] Sepp Hochreiter and Jurgen Schmidhuber. Long short term memory. Neural Computation, Volume 9, Issue\n\n8 November 15, 1997, p.1735-1780, 1997.\n\n[4] Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines. arXiv preprint arXiv:1410.5401,\n\n2014.\n\n[5] Alex Graves, Greg Wayne, Malcolm Reynolds, Tim Harley, Ivo Danihelka, Agnieszka Grabska-Barwi\u00b4nska,\nSergio G\u00f3mez Colmenarejo, Edward Grefenstette, Tiago Ramalho, John Agapiou, et al. Hybrid computing\nusing a neural network with dynamic external memory. Nature, 538(7626):471, 2016.\n\n[6] Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy Lillicrap. Meta-\nlearning with memory-augmented neural networks. In International conference on machine learning,\npages 1842\u20131850, 2016.\n\n[7] Sainbayar Sukhbaatar, Jason Weston, Rob Fergus, et al. End-to-end memory networks. In Advances in\n\nneural information processing systems, pages 2440\u20132448, 2015.\n\n[8] James A Waltz, Barbara J Knowlton, Keith J Holyoak, Kyle B Boone, Fred S Mishkin, Marcia\nde Menezes Santos, Carmen R Thomas, and Bruce L Miller. A system for relational reasoning in\nhuman prefrontal cortex. Psychological science, 10(2):119\u2013125, 1999.\n\n[9] Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. Neural message\n\npassing for quantum chemistry. arXiv preprint arXiv:1704.01212, 2017.\n\n[10] Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. The\n\ngraph neural network model. IEEE Transactions on Neural Networks, 20(1):61\u201380, 2009.\n\n[11] Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard S. Zemel. Gated graph sequence neural networks.\n\nICLR, 2016.\n\n[12] Peter Battaglia, Razvan Pascanu, Matthew Lai, Danilo Jimenez Rezende, et al. Interaction networks for\nlearning about objects, relations and physics. In Advances in neural information processing systems, pages\n4502\u20134510, 2016.\n\n[13] Thomas N Kipf and Max Welling. Semi-supervised classi\ufb01cation with graph convolutional networks.\n\narXiv preprint arXiv:1609.02907, 2016.\n\n[14] Petar Veli\u02c7ckovi\u00b4c, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Li\u00f2, and Yoshua Bengio.\n\nGraph attention networks. In International Conference on Learning Representations, 2018.\n\n[15] Adam Santoro, David Raposo, David G Barrett, Mateusz Malinowski, Razvan Pascanu, Peter Battaglia,\nand Tim Lillicrap. A simple neural network module for relational reasoning. In Advances in neural\ninformation processing systems, pages 4974\u20134983, 2017.\n\n[16] David Raposo, Adam Santoro, David Barrett, Razvan Pascanu, Timothy Lillicrap, and Peter Battaglia. Dis-\ncovering objects and their relations from entangled scene representations. arXiv preprint arXiv:1702.05068,\n2017.\n\n[17] Han Hu, Jiayuan Gu, Zheng Zhang, Jifeng Dai, and Yichen Wei. Relation networks for object detection.\n\narXiv preprint arXiv:1711.11575, 2017.\n\n[18] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. arXiv\n\npreprint arXiv:1711.07971, 2017.\n\n[19] Ding Liu, Bihan Wen, Yuchen Fan, Chen Change Loy, and Thomas S Huang. Non-local recurrent network\n\nfor image restoration. arXiv preprint arXiv:1806.02919, 2018.\n\n[20] Juan Pavez, H\u00e9ctor Allende, and H\u00e9ctor Allende-Cid. Working memory networks: Augmenting memory\n\nnetworks with a relational reasoning module. arXiv preprint arXiv:1805.09354, 2018.\n\n[21] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning\n\nto align and translate. ICLR, abs/1409.0473, 2015.\n\n10\n\n\f[22] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, \u0141ukasz\nKaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing\nSystems, pages 6000\u20136010, 2017.\n\n[23] Alex Graves. Generating sequences with recurrent neural networks. CoRR, abs/1308.0850, 2013.\n\n[24] Felix A Gers, J\u00fcrgen Schmidhuber, and Fred Cummins. Learning to forget: Continual prediction with lstm.\n\n1999.\n\n[25] Wojciech Zaremba and Ilya Sutskever. Learning to execute. arXiv preprint arXiv:1410.4615v3, 2014.\n\n[26] Th\u00e9ophane Weber, S\u00e9bastien Racani\u00e8re, David P Reichert, Lars Buesing, Arthur Guez, Danilo Jimenez\nRezende, Adria Puigdom\u00e8nech Badia, Oriol Vinyals, Nicolas Heess, Yujia Li, et al. Imagination-augmented\nagents for deep reinforcement learning. arXiv preprint arXiv:1707.06203, 2017.\n\n[27] Kyunghyun Cho, Bart Van Merri\u00ebnboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger\nSchwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical\nmachine translation. arXiv preprint arXiv:1406.1078, 2014.\n\n[28] Dzmitry Bahdanau, Jan Chorowski, Dmitriy Serdyuk, Philemon Brakel, and Yoshua Bengio. End-to-end\nattention-based large vocabulary speech recognition. In Acoustics, Speech and Signal Processing (ICASSP),\n2016 IEEE International Conference on, pages 4945\u20134949. IEEE, 2016.\n\n[29] Djoerd Hiemstra. Using language models for information retrieval. 2001.\n\n[30] Zhilin Yang, Zihang Dai, Ruslan Salakhutdinov, and William W Cohen. Breaking the softmax bottleneck:\n\na high-rank rnn language model. arXiv preprint arXiv:1711.03953, 2017.\n\n[31] Mitchell P Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. Building a large annotated corpus\n\nof english: The penn treebank. Computational linguistics, 19(2):313\u2013330, 1993.\n\n[32] Jack W Rae, Chris Dyer, Peter Dayan, and Timothy P Lillicrap. Fast parametric learning with activation\n\nmemorization. arXiv preprint arXiv:1803.10049, 2018.\n\n[33] Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models.\n\narXiv preprint arXiv:1609.07843, 2016.\n\n[34] Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. Exploring the limits of\n\nlanguage modeling. arXiv preprint arXiv:1602.02410, 2016.\n\n[35] Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony\nRobinson. One billion word benchmark for measuring progress in statistical language modeling. arXiv\npreprint arXiv:1312.3005, 2013.\n\n[36] Robert Parker, David Graff, Junbo Kong, Ke Chen, and Kazuaki Maeda. English gigaword \ufb01fth edition\n\nldc2011t07. dvd. Philadelphia: Linguistic Data Consortium, 2011.\n\n[37] Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, and Yoshua Bengio. How to construct deep recurrent\n\nneural networks. arXiv preprint arXiv:1312.6026, 2013.\n\n[38] Mikael Henaff, Jason Weston, Arthur Szlam, Antoine Bordes, and Yann LeCun. Tracking the world state\n\nwith recurrent entity networks. In Fifth International Conference on Learning Representations, 2017.\n\n[39] Navdeep Jaitly Noam Shazeer Samy Bengio, Oriol Vinyals. Scheduled sampling for sequence prediction\n\nwith recurrent neural networks. In Advances in Neural Information Processing Systems 28, 2015.\n\n[40] Edouard Grave, Armand Joulin, and Nicolas Usunier. Improving neural language models with a continuous\n\ncache. arXiv preprint arXiv:1612.04426, 2016.\n\n[41] Shaojie Bai, J Zico Kolter, and Vladlen Koltun. Convolutional sequence modeling revisited. 2018.\n\n[42] Yann N Dauphin, Angela Fan, Michael Auli, and David Grangier. Language modeling with gated\n\nconvolutional networks. arXiv preprint arXiv:1612.08083, 2016.\n\n[43] Stephen Merity, Nitish Shirish Keskar, James Bradbury, and Richard Socher. Scalable language modeling:\n\nWikitext-103 on a single gpu in 12 hours. 2018.\n\n[44] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n11\n\n\f[45] Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. Sequence to sequence learning with neural networks. In\n\nAdvances in neural information processing systems 27, 2014.\n\n[46] Vinicius Zambaldi, David Raposo, Adam Santoro, Victor Bapst, Yujia Li, Igor Babuschkin, Karl Tuyls,\nDavid Reichert, Timothy Lillicrap, Edward Lockhart, et al. Relational deep reinforcement learning. arXiv\npreprint arXiv:1806.01830, 2018.\n\n[47] Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, Volodymir Mnih, Tom Ward, Yotam Doron,\nVlad Firoiu, Tim Harley, Iain Dunning, et al. Importance weighted actor-learner architectures: Scalable\ndistributed deep-rl with importance weighted actor-learner architectures. arXiv preprint arXiv:1802.01561,\n2018.\n\n12\n\n\f", "award": [], "sourceid": 3634, "authors": [{"given_name": "Adam", "family_name": "Santoro", "institution": "DeepMind"}, {"given_name": "Ryan", "family_name": "Faulkner", "institution": "Deepmind"}, {"given_name": "David", "family_name": "Raposo", "institution": "DeepMind"}, {"given_name": "Jack", "family_name": "Rae", "institution": "DeepMind, UCL"}, {"given_name": "Mike", "family_name": "Chrzanowski", "institution": "DeepMind"}, {"given_name": "Theophane", "family_name": "Weber", "institution": "DeepMind"}, {"given_name": "Daan", "family_name": "Wierstra", "institution": "DeepMind Technologies"}, {"given_name": "Oriol", "family_name": "Vinyals", "institution": "Google DeepMind"}, {"given_name": "Razvan", "family_name": "Pascanu", "institution": "Google DeepMind"}, {"given_name": "Timothy", "family_name": "Lillicrap", "institution": "Google DeepMind"}]}