{"title": "Scaling Memory-Augmented Neural Networks with Sparse Reads and Writes", "book": "Advances in Neural Information Processing Systems", "page_first": 3621, "page_last": 3629, "abstract": "Neural networks augmented with external memory have the ability to learn algorithmic solutions to complex tasks. These models appear promising for applications such as language modeling and machine translation. However, they scale poorly in both space and time as the amount of memory grows --- limiting their applicability to real-world domains. Here, we present an end-to-end differentiable memory access scheme, which we call Sparse Access Memory (SAM), that retains the representational power of the original approaches whilst training efficiently with very large memories. We show that SAM achieves asymptotic lower bounds in space and time complexity, and find that an implementation runs $1,\\!000\\times$ faster and with $3,\\!000\\times$ less physical memory than non-sparse models. SAM learns with comparable data efficiency to existing models on a range of synthetic tasks and one-shot Omniglot character recognition, and can scale to tasks requiring $100,\\!000$s of time steps and memories. As well, we show how our approach can be adapted for models that maintain temporal associations between memories, as with the recently introduced Differentiable Neural Computer.", "full_text": "Scaling Memory-Augmented Neural Networks with\n\nSparse Reads and Writes\n\nJack W Rae\u21e4\n\njwrae\n\nJonathan J Hunt\u21e4\n\njjhunt\n\nTim Harley\ntharley\n\nIvo Danihelka\ndanihelka\n\nAndrew Senior\nandrewsenior\n\nTimothy P Lillicrap\n\ncountzero\n\nGreg Wayne\ngregwayne\n\nAlex Graves\ngravesa\n\nGoogle DeepMind\n@google.com\n\nAbstract\n\nNeural networks augmented with external memory have the ability to learn algorith-\nmic solutions to complex tasks. These models appear promising for applications\nsuch as language modeling and machine translation. However, they scale poorly in\nboth space and time as the amount of memory grows \u2014 limiting their applicability\nto real-world domains. Here, we present an end-to-end differentiable memory\naccess scheme, which we call Sparse Access Memory (SAM), that retains the\nrepresentational power of the original approaches whilst training ef\ufb01ciently with\nvery large memories. We show that SAM achieves asymptotic lower bounds in\nspace and time complexity, and \ufb01nd that an implementation runs 1,000\u21e5 faster\nand with 3,000\u21e5 less physical memory than non-sparse models. SAM learns with\ncomparable data ef\ufb01ciency to existing models on a range of synthetic tasks and\none-shot Omniglot character recognition, and can scale to tasks requiring 100,000s\nof time steps and memories. As well, we show how our approach can be adapted\nfor models that maintain temporal associations between memories, as with the\nrecently introduced Differentiable Neural Computer.\n\n1\n\nIntroduction\n\nRecurrent neural networks, such as the Long Short-Term Memory (LSTM) [11], have proven to be\npowerful sequence learning models [6, 18]. However, one limitation of the LSTM architecture is\nthat the number of parameters grows proportionally to the square of the size of the memory, making\nthem unsuitable for problems requiring large amounts of long-term memory. Recent approaches,\nsuch as Neural Turing Machines (NTMs) [7] and Memory Networks [21], have addressed this issue\nby decoupling the memory capacity from the number of model parameters. We refer to this class\nof models as memory augmented neural networks (MANNs). External memory allows MANNs to\nlearn algorithmic solutions to problems that have eluded the capabilities of traditional LSTMs, and to\ngeneralize to longer sequence lengths. Nonetheless, MANNs have had limited success in real world\napplication.\nA signi\ufb01cant dif\ufb01culty in training these models results from their smooth read and write operations,\nwhich incur linear computational overhead on the number of memories stored per time step of\ntraining. Even worse, they require duplication of the entire memory at each time step to perform\nbackpropagation through time (BPTT). To deal with suf\ufb01ciently complex problems, such as processing\n\n\u21e4These authors contributed equally.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fa book, or Wikipedia, this overhead becomes prohibitive. For example, to store 64 memories,\na straightforward implementation of the NTM trained over a sequence of length 100 consumes\n\u21e1 30 MiB physical memory; to store 64,000 memories the overhead exceeds 29 GiB (see Figure 1).\nIn this paper, we present a MANN named SAM (sparse access memory). By thresholding memory\nmodi\ufb01cations to a sparse subset, and using ef\ufb01cient data structures for content-based read operations,\nour model is optimal in space and time with respect to memory size, while retaining end-to-end\ngradient based optimization. To test whether the model is able to learn with this sparse approximation,\nwe examined its performance on a selection of synthetic and natural tasks: algorithmic tasks from\nthe NTM work [7], Babi reasoning tasks used with Memory Networks [17] and Omniglot one-shot\nclassi\ufb01cation [16, 12]. We also tested several of these tasks scaled to longer sequences via curriculum\nlearning. For large external memories we observed improvements in empirical run-time and memory\noverhead by up to three orders magnitude over vanilla NTMs, while maintaining near-identical data\nef\ufb01ciency and performance.\nFurther, in Supplementary D we demonstrate the generality of our approach by describing how to\nconstruct a sparse version of the recently published Differentiable Neural Computer [8]. This Sparse\nDifferentiable Neural Computer (SDNC) is over 400\u21e5 faster than the canonical dense variant for a\nmemory size of 2,000 slots, and achieves the best reported result in the Babi tasks without supervising\nthe memory access.\n\n2 Background\n\n2.1 Attention and content-based addressing\nAn external memory M 2 RN\u21e5M is a collection of N real-valued vectors, or words, of \ufb01xed size M.\nA soft read operation is de\ufb01ned to be a weighted average over memory words,\n\nr =\n\nw(i)M(i) ,\n\n(1)\n\nNXi=1\n\nPN\n\nwhere w 2 RN is a vector of weights with non-negative entries that sum to one. Attending to\nmemory is formalized as the problem of computing w. A content addressable memory, proposed\nin [7, 21, 2, 17], is an external memory with an addressing scheme which selects w based upon the\nsimilarity of memory words to a given query q. Speci\ufb01cally, for the ith read weight w(i) we de\ufb01ne,\n\nw(i) =\n\nf (d(q, M(i)))\nj=1 f (d(q, M(j))\n\n,\n\n(2)\n\nwhere d is a similarity measure, typically Euclidean distance or cosine similarity, and f is a differen-\ntiable monotonic transformation, typically a softmax. We can think of this as an instance of kernel\nsmoothing where the network learns to query relevant points q. Because the read operation (1) and\ncontent-based addressing scheme (2) are smooth, we can place them within a neural network, and\ntrain the full model using backpropagation.\n\n2.2 Memory Networks\nOne recent architecture, Memory Networks, make use of a content addressable memory that is\naccessed via a series of read operations [21, 17] and has been successfully applied to a number\nof question answering tasks [20, 10]. In these tasks, the memory is pre-loaded using a learned\nembedding of the provided context, such as a paragraph of text, and then the controller, given an\nembedding of the question, repeatedly queries the memory by content-based reads to determine an\nanswer.\n\n2.3 Neural Turing Machine\nThe Neural Turing Machine is a recurrent neural network equipped with a content-addressable\nmemory, similar to Memory Networks, but with the additional capability to write to memory over\ntime. The memory is accessed by a controller network, typically an LSTM, and the full model is\ndifferentiable \u2014 allowing it to be trained via BPTT.\n\n2\n\n\fA write to memory,\n\nMt (1 Rt) Mt1 + At ,\n\n(3)\nconsists of a copy of the memory from the previous time step Mt1 decayed by the erase matrix\nRt indicating obsolete or inaccurate content, and an addition of new or updated information At.\nThe erase matrix Rt = wW\nis constructed as the outer product between a set of write weights\nis the outer product\nt 2 [0, 1]N and erase vector et 2 [0, 1]M. The add matrix AT = wW\nwW\nbetween the write weights and a new write word at 2 RM, which the controller outputs.\n3 Architecture\n\nt aT\nt\n\nt eT\nt\n\nThis paper introduces Sparse Access Memory (SAM), a new neural memory architecture with two\ninnovations. Most importantly, all writes to and reads from external memory are constrained to a\nsparse subset of the memory words, providing similar functionality as the NTM, while allowing\ncomputational and memory ef\ufb01cient operation. Secondly, we introduce a sparse memory management\nscheme that tracks memory usage and \ufb01nds unused blocks of memory for recording new information.\nFor a memory containing N words, SAM executes a forward, backward step in \u21e5(log N ) time, ini-\ntializes in \u21e5(N ) space, and consumes \u21e5(1) space per time step. Under some reasonable assumptions,\nSAM is asymptotically optimal in time and space complexity (Supplementary A).\n\n3.1 Read\n\nThe sparse read operation is de\ufb01ned to be a weighted average over a selection of words in memory:\n\n\u02dcrt =\n\nKXi=1\n\n\u02dcwR\n\nt (si)Mt(si),\n\n(4)\n\nt such that \u02dcrt \u21e1 rt. For content-based reads where wR\n\nwhere \u02dcwR\nt 2 RN contains K number of non-zero entries with indices s1, s2, . . . , sK; K is a small\nconstant, independent of N, typically K = 4 or K = 8. We will refer to sparse analogues of weight\nvectors w as \u02dcw, and when discussing operations that are used in both the sparse and dense versions of\nour model use w.\nWe wish to construct \u02dcwR\nis de\ufb01ned by (2), an\neffective approach is to keep the K largest non-zero entries and set the remaining entries to zero.\nWe can compute \u02dcwR\nt and keeping the K largest values.\nt correspond to the K\nHowever, linear-time operation can be avoided. Since the K largest values in wR\nclosest points to our query qt, we can use an approximate nearest neighbor data-structure, described\nin Section 3.5, to calculate \u02dcwR\nt\nSparse read can be considered a special case of the matrix-vector product de\ufb01ned in (1), with two key\ndistinctions. The \ufb01rst is that we pass gradients for only a constant K number of rows of memory per\ntime step, versus N, which results in a negligible fraction of non-zero error gradient per timestep\nwhen the memory is large. The second distinction is in implementation: by using an ef\ufb01cient sparse\nmatrix format such as Compressed Sparse Rows (CSR), we can compute (4) and its gradients in\nconstant time and space (see Supplementary A).\n\nt naively in O(N ) time by calculating wR\n\nin O(log N ) time.\n\nt\n\n3.2 Write\n\nThe write operation is SAM is an instance of (3) where the write weights \u02dcwW\nt are constrained to\ncontain a constant number of non-zero entries. This is done by a simple scheme where the controller\nwrites either to previously read locations, in order to update contextually relevant memories, or the\nleast recently accessed location, in order to overwrite stale or unused memory slots with fresh content.\nThe introduction of sparsity could be achieved via other write schemes. For example, we could use\na sparse content-based write scheme, where the controller chooses a query vector qW\nt and applies\nwrites to similar words in memory. This would allow for direct memory updates, but would create\nproblems when the memory is empty (and shift further complexity to the controller). We decided\nupon the previously read / least recently accessed addressing scheme for simplicity and \ufb02exibility.\n\n3\n\n\fThe write weights are de\ufb01ned as\n\nwW\n\nt = \u21b5t t wR\n\nt ,\nt1 + (1 t) IU\n\n(5)\n\nis set to zero before being written to. When the read operation is sparse (wR\n\nwhere the controller outputs the interpolation gate parameter t and the write gate parameter \u21b5t. The\nt1 is purely additive, while the least recently accessed word\nwrite to the previously read locations wR\nt1 has K non-zero\nIU\nt\nentries), it follows the write operation is also sparse.\nWe de\ufb01ne IU\nto be an indicator over words in memory, with a value of 1 when the word minimizes a\nt\nusage measure Ut\n\nt (i) =( 1 if Ut(i) = min\n\n0 otherwise.\n\nIU\n\nj=1,...,N\n\nUt(j)\n\n(6)\n\nPT\n\nt (i) + wR\n\nt=0 Tt (wW\n\nIf there are several words that minimize Ut then we choose arbitrarily between them. We tried\ntwo de\ufb01nitions of Ut. The \ufb01rst de\ufb01nition is a time-discounted sum of write weights U (1)\nT (i) =\nt (i)) where is the discount factor. This usage de\ufb01nition is incorporated\nwithin Dense Access Memory (DAM), a dense-approximation to SAM that is used for experimental\ncomparison in Section 4.\nThe second usage de\ufb01nition, used by SAM, is simply the number of time-steps since a non-negligible\nmemory access: U (2)\nt (i) > } . Here, is a tuning parameter that\nwe typically choose to be 0.005. We maintain this usage statistic in constant time using a custom\ndata-structure (described in Supplementary A). Finally we also use the least recently accessed word\nto calculate the erase matrix. Rt = IU\nt 1T is de\ufb01ned to be the expansion of this usage indicator where\n1 is a vector of ones. The total cost of the write is constant in time and space for both the forwards\nand backwards pass, which improves on the linear space and time dense write (see Supplementary\nA).\n\nT (i) = T max{ t : wW\n\nt (i) + wR\n\n3.3 Controller\n\nWe use a one layer LSTM for the controller throughout. At each time step, the LSTM receives a\nconcatenation of the external input, xt, the word, rt1 read in the previous time step. The LSTM\nthen produces a vector, pt = (qt, at,\u21b5 t, t), of read and write parameters for memory access via a\nlinear layer. The word read from memory for the current time step, rt, is then concatenated with the\noutput of the LSTM, and this vector is fed through a linear layer to form the \ufb01nal output, yt. The full\ncontrol \ufb02ow is illustrated in Supplementary Figure 6.\n\n3.4 Ef\ufb01cient backpropagation through time\n\nWe have already demonstrated how the forward operations in SAM can be ef\ufb01ciently computed\nin O(T log N ) time. However, when considering space complexity of MANNs, there remains a\ndependence on Mt for the computation of the derivatives at the corresponding time step. A naive\nimplementation requires the state of the memory to be cached at each time step, incurring a space\noverhead of O(N T ), which severely limits memory size and sequence length.\nFortunately, this can be remedied. Since there are only O(1) words that are written at each time step,\nwe instead track the sparse modi\ufb01cations made to the memory at each timestep, apply them in-place\nto compute Mt in O(1) time and O(T ) space. During the backward pass, we can restore the state of\nMt from Mt+1 in O(1) time by reverting the sparse modi\ufb01cations applied at time step t. As such the\nmemory is actually rolled back to previous states during backpropagation (Supplementary Figure 5).\nAt the end of the backward pass, the memory ends rolled back to the start state. If required, such as\nwhen using truncating BPTT, the \ufb01nal memory state can be restored by making a copy of MT prior\nto calling backwards in O(N ) time, or by re-applying the T sparse updates in O(T ) time.\n3.5 Approximate nearest neighbors\n\nWhen querying the memory, we can use an approximate nearest neighbor index (ANN) to search over\nthe external memory for the K nearest words. Where a linear KNN search inspects every element in\n\n4\n\n\fmemory (taking O(N ) time), an ANN index maintains a structure over the dataset to allow for fast\ninspection of nearby points in O(log N ) time.\nIn our case, the memory is still a dense tensor that the network directly operates on; however the\nANN is a structured view of its contents. Both the memory and the ANN index are passed through\nthe network and kept in sync during writes. However there are no gradients with respect to the ANN\nas its function is \ufb01xed.\nWe considered two types of ANN indexes: FLANN\u2019s randomized k-d tree implementation [15] that\narranges the datapoints in an ensemble of structured (randomized k-d) trees to search for nearby\npoints via comparison-based search, and one that uses locality sensitive hash (LSH) functions that\nmap points into buckets with distance-preserving guarantees. We used randomized k-d trees for small\nword sizes and LSHs for large word sizes. For both ANN implementations, there is an O(log N ) cost\nfor insertion, deletion and query. We also rebuild the ANN from scratch every N insertions to ensure\nit does not become imbalanced.\n\n4 Results\n\n4.1 Speed and memory benchmarks\n\n(a)\n\n(b)\n\nFigure 1: (a) Wall-clock time of a single forward and backward pass. The k-d tree is a FLANN\nrandomized ensemble with 4 trees and 32 checks. For 1M memories a single forward and backward\npass takes 12 s for the NTM and 7 ms for SAM, a speedup of 1600\u21e5. (b) Memory used to train over\nsequence of 100 time steps, excluding initialization of external memory. The space overhead of SAM\nis independent of memory size, which we see by the \ufb02at line. When the memory contains 64,000\nwords the NTM consumes 29 GiB whereas SAM consumes only 7.8 MiB, a compression ratio of\n3700.\n\nWe measured the forward and backward times of the SAM architecture versus the dense DAM variant\nand the original NTM (details of setup in Supplementary E). SAM is over 100 times faster than the\nNTM when the memory contains one million words and an exact linear-index is used, and 1600 times\nfaster with the k-d tree (Figure 1a). With an ANN the model runs in sublinear time with respect\nto the memory size. SAM\u2019s memory usage per time step is independent of the number of memory\nwords (Figure 1b), which empirically veri\ufb01es the O(1) space claim from Supplementary A. For 64 K\nmemory words SAM uses 53 MiB of physical memory to initialize the network and 7.8 MiB to run a\n100 step forward and backward pass, compared with the NTM which consumes 29 GiB.\n\n4.2 Learning with sparse memory access\n\nWe have established that SAM reaps a huge computational and memory advantage of previous models,\nbut can we really learn with SAM\u2019s sparse approximations? We investigated the learning cost of\ninducing sparsity, and the effect of placing an approximate nearest neighbor index within the network,\nby comparing SAM with its dense variant DAM and some established models, the NTM and the\nLSTM.\nWe trained each model on three of the original NTM tasks [7]. 1. Copy: copy a random input sequence\nof length 1\u201320, 2. Associative Recall: given 3-6 random (key, value) pairs, and subsequently a cue\nkey, return the associated value. 3. Priority Sort: Given 20 random keys and priority values, return\n\n5\n\n1011021031041051061071umber Rf memRry slRWs (1)100101102103104105WDll 7ime [ms]11.9s7.3ms170DA06A0 lineDr6A0 A111001011021031041051061umber Rf memRry slRts (1)10iB100iB1000iB1GiB10GiB100GiB0emRry29.2GiB7.80iB170DA06A0 lineDr6A0 A11\f(a) Copy\n\n(b) Associative Recall\n\n(c) Priority Sort\n\nFigure 2: Training curves for sparse (SAM) and dense (DAM, NTM) models. SAM trains comparably\nfor the Copy task, and reaches asymptotic error signi\ufb01cantly faster for Associative Recall and Priority\nSort.Light colors indicate one standard deviation over 30 random seeds.\n\nthe top 16 keys in descending order of priority. We chose these tasks because the NTM is known to\nperform well on them.\nFigure 2 shows that sparse models are able to learn with comparable ef\ufb01ciency to the dense models\nand, surprisingly, learn more effectively for some tasks \u2014 notably priority sort and associative recall.\nThis shows that sparse reads and writes can actually bene\ufb01t early-stage learning in some cases.\nFull hyperparameter details are in Supplementary C.\n\n4.3 Scaling with a curriculum\n\nThe computational ef\ufb01ciency of SAM opens up the possibility of training on tasks that require storing\na large amount of information over long sequences. Here we show this is possible in practice, by\nscaling tasks to a large scale via an exponentially increasing curriculum.\nWe parametrized three of the tasks described in Section 4.2: associative recall, copy, and priority\nsort, with a progressively increasing dif\ufb01culty level which characterises the length of the sequence\nand number of entries to store in memory. For example, level speci\ufb01es the input sequence length for\nthe copy task. We exponentially increased the maximum level h when the network begins to learn\nthe fundamental algorithm. Since the time taken for a forward and backward pass scales O(T ) with\nthe sequence length T , following a standard linearly increasing curriculum could potentially take\nO(T 2), if the same amount of training was required at each step of the curriculum. Speci\ufb01cally, h\nwas doubled whenever the average training loss dropped below a threshold for a number of episodes.\nThe level was sampled for each minibatch from the uniform distribution over integers U(0, h).\nWe compared the dense models, NTM and DAM, with both SAM with an exact nearest neighbor\nindex (SAM linear) and with locality sensitive hashing (SAM ANN). The dense models contained 64\nmemory words, while the sparse models had 2 \u21e5 106 words. These sizes were chosen to ensure all\nmodels use approximately the same amount of physical memory when trained over 100 steps.\nFor all tasks, SAM was able to advance further than the other models, and in the associative recall\ntask, SAM was able to advance through the curriculum to sequences greater than 4000 (Figure 3).\nNote that we did not use truncated backpropagation, so this involved BPTT for over 4000 steps with\na memory size in the millions of words.\nTo investigate whether SAM was able to learn algorithmic solutions to tasks, we investigated its\nability to generalize to sequences that far exceeded those observed during training. Namely we\ntrained SAM on the associative recall task up to sequences of length 10, 000, and found it was then\nable to generalize to sequences of length 200,000 (Supplementary Figure 8).\n\n4.4 Question answering on the Babi tasks\n\n[20] introduced toy tasks they considered a prerequisite to agents which can reason and understand\nnatural language. They are synthetically generated language tasks with a vocab of about 150 words\nthat test various aspects of simple reasoning such as deduction, induction and coreferencing.\n\n6\n\n500001000001umber Rf eSLsRdes010203040CRstLST01T0DA0SA0 lLneDrSA0 A11100002000030000400001umber of episodes0246Cost500001000001umber of episodes20406080100120Cost\f(a)\n\n(b)\n\n(c)\n\nFigure 3: Curriculum training curves for sparse and dense models on (a) Associative recall, (b) Copy,\nand (c) Priority sort. Dif\ufb01culty level indicates the task dif\ufb01culty (e.g. the length of sequence for\ncopy). We see SAM train (and backpropagate over) episodes with thousands of steps, and tasks which\nrequire thousands of words to be stored to memory. Each model is averaged across 5 replicas of\nidentical hyper-parameters (light lines indicate individual runs).\n\nWe tested the models (including the Sparse Differentiable Neural Computer described in Supplemen-\ntary D) on this task. The full results and training details are described in Supplementary G.\nThe MANNs, except the NTM, are able to learn solutions comparable to the previous best results,\nfailing at only 2 of the tasks. The SDNC manages to solve all but 1 of the tasks, the best reported\nresult on Babi that we are aware of.\nNotably the best prior results have been obtained by using supervising the memory retrieval (during\ntraining the model is provided annotations which indicate which memories should be used to answer\na query). More directly comparable previous work with end-to-end memory networks, which did not\nuse supervision [17], fails at 6 of the tasks.\nBoth the sparse and dense perform comparably at this task, again indicating the sparse approximations\ndo not impair learning. We believe the NTM may perform poorly since it lacks a mechanism which\nallows it to allocate memory effectively.\n\n4.5 Learning on real world data\n\nFinally, we demonstrate that the model is capable of learning in a non-synthetic dataset. Omniglot\n[12] is a dataset of 1623 characters taken from 50 different alphabets, with 20 examples of each\ncharacter. This dataset is used to test rapid, or one-shot learning, since there are few examples of\neach character but many different character classes. Following [16], we generate episodes where a\nsubset of characters are randomly selected from the dataset, rotated and stretched, and assigned a\nrandomly chosen label. At each time step an example of one of the characters is presented, along\nwith the correct label of the proceeding character. Each character is presented 10 times in an episode\n(but each presentation may be any one of the 20 examples of the character). In order to succeed at the\ntask the model must learn to rapidly associate a novel character with the correct label, such that it can\ncorrectly classify subsequent examples of the same character class.\nAgain, we used an exponential curriculum, doubling the number of additional characters provided to\nthe model whenever the cost was reduced under a threshold. After training all MANNs for the same\nlength of time, a validation task with 500 characters was used to select the best run, and this was then\ntested on a test set, containing all novel characters for different sequence lengths (Figure 4). All of\nthe MANNs were able to perform much better than chance, even on sequences \u21e1 4\u21e5 longer than\nseen during training. SAM outperformed other models, presumably due to its much larger memory\ncapacity. Previous results on the Omniglot curriculum [16] task are not identical, since we used 1-hot\nlabels throughout and the training curriculum scaled to longer sequences, but our results with the\ndense models are comparable (\u21e1 0.4 errors with 100 characters), while the SAM is signi\ufb01cantly\nbetter (0.2 < errors with 100 characters).\n\n7\n\n102103104105106ESLsRde 1R100101102103104105DLffLculty LevelL6T01T0DA06A0 lLneDr6A0 A11102103104105106ESLsRde 1R100101102103104DLffLculty LevelL6T01T0DA06A0 lLneDr6A0 A11102103104105106107ESLsRde 1R100101102103DLffLculty LevelL670170DA06A0 lLneDr6A0 A11\fFigure 4: Test errors for the Omniglot task (described in the text) for the best runs (as chosen by the\nvalidation set). The characters used in the test set were not used in validation or training. All of the\nMANNs were able to perform much better than chance with \u21e1 500 characters (sequence lengths\nof \u21e1 5000), even though they were trained, at most, on sequences of \u21e1 130 (chance is 0.002 for\n500 characters). This indicates they are learning generalizable solutions to the task. SAM is able to\noutperform other approaches, presumably because it can utilize a much larger memory.\n\n5 Discussion\n\nScaling memory systems is a pressing research direction due to potential for compelling applications\nwith large amounts of memory. We have demonstrated that you can train neural networks with large\nmemories via a sparse read and write scheme that makes use of ef\ufb01cient data structures within the\nnetwork, and obtain signi\ufb01cant speedups during training. Although we have focused on a speci\ufb01c\nMANN (SAM), which is closely related to the NTM, the approach taken here is general and can be\napplied to many differentiable memory architectures, such as Memory Networks [21].\nIt should be noted that there are multiple possible routes toward scalable memory architectures. For\nexample, prior work aimed at scaling Neural Turing Machines [22] used reinforcement learning to\ntrain a discrete addressing policy. This approach also touches only a sparse set of memories at each\ntime step, but relies on higher variance estimates of the gradient during optimization. Though we can\nonly guess at what class of memory models will become staple in machine learning systems of the\nfuture, we argue in Supplementary A that they will be no more ef\ufb01cient than SAM in space and time\ncomplexity if they address memories based on content.\nWe have experimented with randomized k-d trees and LSH within the network to reduce the forward\npass of training to sublinear time, but there may be room for improvement here. K-d trees were not\ndesigned speci\ufb01cally for fully online scenarios, and can become imbalanced during training. Recent\nwork in tree ensemble models, such as Mondrian forests [13], show promising results in maintaining\nbalanced hierarchical set coverage in the online setting. An alternative approach which may be\nwell-suited is LSH forests [3], which adaptively modi\ufb01es the number of hashes used. It would be an\ninteresting empirical investigation to more fully assess different ANN approaches in the challenging\ncontext of training a neural network.\nHumans are able to retain a large, task-dependent set of memories obtained in one pass with a\nsurprising amount of \ufb01delity [4]. Here we have demonstrated architectures that may one day compete\nwith humans at these kinds of tasks.\n\nAcknowledgements\n\nWe thank Vyacheslav Egorov, Edward Grefenstette, Malcolm Reynolds, Fumin Wang and Yori Zwols\nfor their assistance, and the Google DeepMind family for helpful discussions and encouragement.\n\n8\n\n\fReferences\n[1] Sunil Arya, David M. Mount, Nathan S. Netanyahu, Ruth Silverman, and Angela Y. Wu. An optimal\nalgorithm for approximate nearest neighbor searching \ufb01xed dimensions. J. ACM, 45(6):891\u2013923, November\n1998.\n\n[2] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning\n\nto align and translate. arXiv preprint arXiv:1409.0473, 2014.\n\n[3] Mayank Bawa, Tyson Condie, and Prasanna Ganesan. Lsh forest: self-tuning indexes for similarity search.\n\nIn Proceedings of the 14th international conference on World Wide Web, pages 651\u2013660. ACM, 2005.\n\n[4] Timothy F Brady, Talia Konkle, George A Alvarez, and Aude Oliva. Visual long-term memory has a massive\nstorage capacity for object details. Proceedings of the National Academy of Sciences, 105(38):14325\u201314329,\n2008.\n\n[5] Ronan Collobert, Koray Kavukcuoglu, and Cl\u00e9ment Farabet. Torch7: A matlab-like environment for\n\nmachine learning. In BigLearn, NIPS Workshop, number EPFL-CONF-192376, 2011.\n\n[6] Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. Speech recognition with deep recurrent neural\nnetworks. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on,\npages 6645\u20136649. IEEE, 2013.\n\n[7] Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines. arXiv preprint arXiv:1410.5401,\n\n2014.\n\n[8] Alex Graves, Greg Wayne, Malcolm Reynolds, Tim Harley, Ivo Danihelka, Agnieszka Grabska-Barwi\u00b4nska,\nSergio G\u00f3mez Colmenarejo, Edward Grefenstette, Tiago Ramalho, John Agapiou, et al. Hybrid computing\nusing a neural network with dynamic external memory. Nature, 2016.\n\n[9] Ga\u00ebl Guennebaud, Beno\u0131t Jacob, Philip Avery, Abraham Bachrach, Sebastien Barthelemy, et al. Eigen v3,\n\n2010.\n\n[10] Felix Hill, Antoine Bordes, Sumit Chopra, and Jason Weston. The goldilocks principle: Reading children\u2019s\n\nbooks with explicit memory representations. arXiv preprint arXiv:1511.02301, 2015.\n\n[11] Sepp Hochreiter and J\u00fcrgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735\u20131780,\n\n1997.\n\n[12] Brenden M Lake, Ruslan Salakhutdinov, and Joshua B Tenenbaum. Human-level concept learning through\n\nprobabilistic program induction. Science, 350(6266):1332\u20131338, 2015.\n\n[13] Balaji Lakshminarayanan, Daniel M Roy, and Yee Whye Teh. Mondrian forests: Ef\ufb01cient online random\n\nforests. In Advances in Neural Information Processing Systems, pages 3140\u20133148, 2014.\n\n[14] Rajeev Motwani, Assaf Naor, and Rina Panigrahy. Lower bounds on locality sensitive hashing. SIAM\n\nJournal on Discrete Mathematics, 21(4):930\u2013935, 2007.\n\n[15] Marius Muja and David G. Lowe. Scalable nearest neighbor algorithms for high dimensional data. Pattern\n\nAnalysis and Machine Intelligence, IEEE Transactions on, 36, 2014.\n\n[16] Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and T Lillicrap. Meta-learning with\n\nmemory-augmented neural networks. In International conference on machine learning, 2016.\n\n[17] Sainbayar Sukhbaatar, Jason Weston, Rob Fergus, et al. End-to-end memory networks. In Advances in\n\nNeural Information Processing Systems, pages 2431\u20132439, 2015.\n\n[18] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In\nAdvances in Neural Information Processing Systems 27, pages 3104\u20133112. Curran Associates, Inc., 2014.\n[19] Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average of\n\nits recent magnitude. COURSERA: Neural Networks for Machine Learning, 4:2, 2012.\n\n[20] Jason Weston, Antoine Bordes, Sumit Chopra, Alexander M Rush, Bart van Merri\u00ebnboer, Armand Joulin,\nand Tomas Mikolov. Towards ai-complete question answering: A set of prerequisite toy tasks. arXiv\npreprint arXiv:1502.05698, 2015.\n\n[21] Jason Weston, Sumit Chopra, and Antoine Bordes. Memory networks. arXiv preprint arXiv:1410.3916,\n\n2014.\n\n[22] Wojciech Zaremba and Ilya Sutskever. Reinforcement learning neural turing machines. arXiv preprint\n\narXiv:1505.00521, 2015.\n\n9\n\n\f", "award": [], "sourceid": 1803, "authors": [{"given_name": "Jack", "family_name": "Rae", "institution": "Google DeepMind"}, {"given_name": "Jonathan", "family_name": "Hunt", "institution": "Brain Corporation"}, {"given_name": "Ivo", "family_name": "Danihelka", "institution": "Google DeepMind"}, {"given_name": "Timothy", "family_name": "Harley", "institution": "Google DeepMind"}, {"given_name": "Andrew", "family_name": "Senior", "institution": "Google DeepMind"}, {"given_name": "Gregory", "family_name": "Wayne", "institution": "Google DeepMind"}, {"given_name": "Alex", "family_name": "Graves", "institution": "Google DeepMind"}, {"given_name": "Timothy", "family_name": "Lillicrap", "institution": "Google DeepMind"}]}