{"title": "Sparse Attentive Backtracking: Temporal Credit Assignment Through Reminding", "book": "Advances in Neural Information Processing Systems", "page_first": 7640, "page_last": 7651, "abstract": "Learning long-term dependencies in extended temporal sequences requires credit assignment to events far back in the past. The most common method for training recurrent neural networks, back-propagation through time (BPTT), requires credit information to be propagated backwards through every single step of the forward computation, potentially over thousands or millions of time steps.\nThis becomes computationally expensive or even infeasible when used with long sequences. Importantly, biological brains are unlikely to perform such detailed reverse replay over very long sequences of internal states (consider days, months, or years.) However, humans are often reminded of past memories or mental states which are associated with the current mental state.\nWe consider the hypothesis that such memory associations between past and present could be used for credit assignment through arbitrarily long sequences, propagating the credit assigned to the current state to the associated past state. Based on this principle, we study a novel algorithm which only back-propagates through a few of these temporal skip connections, realized by a learned attention mechanism that associates current states with relevant past states. We demonstrate in experiments that our method matches or outperforms regular BPTT and truncated BPTT in tasks involving particularly long-term dependencies, but without requiring the biologically implausible backward replay through the whole history of states. Additionally, we demonstrate that the proposed method transfers to longer sequences significantly better than LSTMs trained with BPTT and LSTMs trained with full self-attention.", "full_text": "Sparse Attentive Backtracking:\n\nTemporal Credit Assignment Through Reminding\n\nNan Rosemary Ke1,2, Anirudh Goyal1, Olexa Bilaniuk1, Jonathan Binas1,\n\nMichael C. Mozer3, Chris Pal1,2,4, Yoshua Bengio1\u2020\n\n1 Mila, Universit\u00e9 de Montr\u00e9al\n2 Mila, Polytechnique Montr\u00e9al\n3 University of Colorado, Boulder\n\n4 Element AI\n\n\u2020CIFAR Senior Fellow.\n\nAbstract\n\nLearning long-term dependencies in extended temporal sequences requires credit\nassignment to events far back in the past. The most common method for training\nrecurrent neural networks, back-propagation through time (BPTT), requires credit\ninformation to be propagated backwards through every single step of the forward\ncomputation, potentially over thousands or millions of time steps. This becomes\ncomputationally expensive or even infeasible when used with long sequences.\nImportantly, biological brains are unlikely to perform such detailed reverse replay\nover very long sequences of internal states (consider days, months, or years.)\nHowever, humans are often reminded of past memories or mental states which\nare associated with the current mental state. We consider the hypothesis that\nsuch memory associations between past and present could be used for credit\nassignment through arbitrarily long sequences, propagating the credit assigned to\nthe current state to the associated past state. Based on this principle, we study a\nnovel algorithm which only back-propagates through a few of these temporal skip\nconnections, realized by a learned attention mechanism that associates current states\nwith relevant past states. We demonstrate in experiments that our method matches or\noutperforms regular BPTT and truncated BPTT in tasks involving particularly long-\nterm dependencies, but without requiring the biologically implausible backward\nreplay through the whole history of states. Additionally, we demonstrate that the\nproposed method transfers to longer sequences signi\ufb01cantly better than LSTMs\ntrained with BPTT and LSTMs trained with full self-attention.\n\n1\n\nIntroduction\n\nHumans have a remarkable ability to remember events from the distant past which are associated\nwith the current mental state (Ciaramelli et al., 2008). Most experimental and theoretical analyses\nof memory have focused on understanding the deliberate route to memory formation and recall.\nBut automatic reminding\u2014when memories pop into one\u2019s head\u2014can have a potent in\ufb02uence on\ncognition. Reminding is normally triggered by contextual features present at the moment of retrieval\nwhich match distinctive features of the memory being recalled (Berntsen et al., 2013; Wharton\net al., 1996), and can occur more often following unexpected events (Read & Ian, 1991). Thus, an\nindividual\u2019s current state of understanding can trigger reminding of a past state. Reminding can\nprovide distracting sources of irrelevant information (Forbus et al., 1995; Novick, 1988), but it can\nalso serve a useful computational role in ongoing cognition by providing information essential to\ndecision making (Benjamin & Ross, 2010).\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fL att.\n\nL att.\n\nht\n\nht\n\nforward\n\nbackward\n\nIn this paper, we identify another possible role of reminding: to perform credit assignment across\nlong time spans. Consider the following scenario. As you drive down the highway, you hear an\nunusual popping sound. You think nothing of it until you stop for gas and realize that one of your\ntires has de\ufb02ated, at which point you are suddenly reminded of the pop. The reminding event helps\ndetermine the cause of your \ufb02at tire, and probably leads to synaptic changes by which a future pop\nsound while driving would be processed differently. Credit assignment is critical in machine learning.\nBack-propagation is fundamentally performing credit assignment. Although some progress has been\nmade toward credit-assignment mechanisms that are functionally equivalent to back-propagation (Lee\net al., 2014; Scellier & Bengio, 2016; Whittington & Bogacz, 2017), it remains very unclear how the\nequivalent of back-propagation through time, used to train recurrent neural networks (RNNs), could\nbe implemented by brains. Here we explore the hypothesis that an associative reminding process\ncould play an important role in propagating credit across long time spans, also known as the problem\nof learning long-term dependencies in RNNs, i.e., of learning to exploit statistical dependencies\nbetween events and variables which occur temporally far from each other.\n1.1 Credit Assignment in Recurrent Neural Networks\nRNNs are used to processes sequences of variable length. They have\nachieved state-of-the-art results for many machine learning sequence pro-\ncessing tasks. Examples where models based on RNNs shine include\nspeech recognition (Miao et al., 2015; Chan et al., 2016), image captioning\n(Vinyals et al., 2015; Lu et al., 2017), machine translation (Luong et al.,\n2015).\nIt is common practice to train RNNs using gradients computed with back-\npropagation through time (BPTT), wherein the network states are unrolled\nin time over the whole trajectory of discrete time steps and gradients\nare back-propagated through the unrolled graph. The network unfolding\nprocedure of BPTT does not seem biologically plausible.\nIt requires\nstoring and playing back these events (in reverse order) using the same\nrecurrent weights to combine error signals with activities and derivatives at\nprevious time points. The replay is initiated only at the end of a trajectory\nof T time steps, and thus requires memorization of a large number of states. If a discrete time\ninstant corresponds to a saccade (about 200-300ms,) then a trajectory of 100 days would require\nreplaying back computations through over 42 million time steps. This is not only inconvenient,\nbut more importantly a small error to any one of these events could either vanish or blow up and\ncause catastrophic outcomes. Also, if this unfolding and back-propagation is done only over shorter\nsequences, then learning typically will not capture longer-term dependencies linking events across\nlarger temporal spans then the length of the back-propagated trajectory.\nWhat are the alternatives to BPTT? One approach we explore here exploits associative reminding\nof past events which may be triggered by the current state and added to it, thus making it possible\nto propagate gradients with respect to the current state into approximate gradients in the state\ncorresponding to the recalled event. The approximation comes from not backpropagating through\nthe unfolded ordinary recurrence across long time spans, but only through this memory retrieval\nmechanism. Completely different approaches are possible but are not currently close to BPTT in\nterms of learning performance on large networks, such as methods based on the online estimation\nof gradients (Ollivier et al., 2015). Assuming that no exact gradient estimation method is possible\n(which seems likely) it could well be that brains combine multiple estimators.\nIn machine learning, the most common practical alternative to full BPTT is truncated BPTT (TBPTT)\nWilliams & Peng (1990). In TBPTT, a long sequence is sliced into a number of (possibly overlapping)\nsubsequences, gradients are backpropagated only for a \ufb01xed, limited number of time steps into the past,\nand the parameters are updated after each backpropagation through a subsequence. Unfortunately,\nthis truncation makes capturing dependencies across distant timesteps nigh-impossible, because no\nerror signal reaches further back into the past than TBPTT\u2019s truncation length.\nNeurophysiological \ufb01ndings support the existence of remembering memories and their involvement\nin credit assignment and learning in biological systems. In particular, hippocampal recordings in\nrats indicate that brief sequences of prior experience are replayed both in the awake resting state and\nduring sleep, both of which conditions are linked to memory consolidation and learning (Foster &\nWilson, 2006; Davidson et al., 2009; Gupta et al., 2010; Ambrose et al., 2016). Thus, the mental\n\nThe sparse attentive back-\ntracking model.\n\n2\n\n\flook back into the past seems to occur exactly when credit assignment is to be performed. Thus, it is\nplausible that hippocampal replay could be a way of doing temporal credit assignment (and possibly\nBPTT) on a short time scale, but here we argue for a solution which could handle credit assignment\nover much longer durations.\n\n1.2 Novel Credit Assignment Mechanism: Sparse Attentive Backtracking\n\nInspired by the ability of brains to selectively reactivate memories of the past based on the current\ncontext, we propose here a novel solution called Sparse Attentive Backtracking (SAB) that incorpo-\nrates a differentiable, sparse (hard) attention mechanism to select from past states. Inspired by the\ncognitive analogy of reminding, SAB is designed to retrieve one or very few past states. This may\nalso be advantageous in focusing the credit assignment, although this hypothesis remains to be tested.\nSAB meshes well with TBPTT, yet allows gradient to propagate over distances far in excess of the\nTBPTT truncation length. We experimentally answer af\ufb01rmatively the following questions:\n\nQ1. Can Sparse Attentive Backtracking (SAB) capture long-term dependencies? SAB cap-\n\ntures long-term dependencies. See results for 7 tasks supporting this in \u00a74.\n\nQ2. Generalization and transfer ability of SAB? See the strong transfer results in \u00a74.\nQ3. How does SAB perform compared to the Transformers (Vaswani et al., 2017)? SAB\n\noutperforms the Transformers (comparison in \u00a74).\n\nQ4. Is sparsity important for SAB and does it learn to retrieve meaningful memories? See\n\nthe results on the Importance of Sparsity and Table 4 in \u00a74.\n\n2 Related Machine Learning Work\n\nSkip-connections and gradient \ufb02ow Neural architectures such as Residual Networks (He et al.,\n2016) and Dense Networks (Huang et al., 2016) allow information to skip over convolutional\nprocessing blocks of an underlying convolutional network architecture. This construction provably\nmitigates the vanishing gradient problem by allowing the gradient at any given layer to be bounded.\nDensely-connected convolutional networks alleviate the vanishing gradient problem by allowing a\ndirect path from any layer in the network to the output layer. In contrast, in this work we propose\nand explore what one might regard as a form of dynamic skip connection, modulated by an attention\nmechanism corresponding to a reminding process, which matches the current state with an older state\nthat is retrieved from memory.\nRecurrent neural networks with skip-connections in time can allow information to \ufb02ow over much\nlonger time spans. These skip-connections can have either a \ufb01xed time span such as in hierarchical\nEl Hihi & Bengio (1996) or clockwork Koutnik et al. (2014) RNNs, or a dynamic time span such\nas in Chung et al. (2016); Mozer et al. (2017); Ke et al. (2018). All of these models still need to\nbe trained with full BPTT, which requires a full replay of past events. Designs also exist based on\nwormhole connections, implemented as differentiable reads and writes to external memories, as in\nGulcehre et al. (2017). Also, as noted in K\u00e1d\u00e1r et al. (2018), with highly complex architectures,\ntraining procedure and implementations might hinder their utility.\n\nThe transformer network The Transformer network (Vaswani et al., 2017) takes sequence pro-\ncessing using attention to its logical extreme \u2013 using attention only, not relying on RNNs at all. The\nattention mechanism is a softmax not over the sequence itself but over the outputs of the previous\nself-attention layer. In order to attend to multiple parts of the layer outputs simultaneously, the\nTransformer uses 8 small attention \u201cheads\u201d per layer (instead of a single large head) and combines\nthe attention heads\u2019 outputs by concatenation. No attempt is made to make the attention weights\nsparse, and the authors do not test their models on sequences of length greater than the intermediate\nrepresentations of the Transformer model. With brains clearly involving a recurrent computation,\nthis approach would seem to miss an important characteristic of biological credit assignment through\ntime. Another implausible aspect of the Transformer architecture is the simultaneous access to (and\nlinear combination of) all past memories (as opposed to a handful with SAB.)\n\n3\n\n\fFigure 1: This \ufb01gure illustrates the forward pass in SAB for the con\ufb01guration ktop = 3, katt = 2, ktrunc = 2. This\ninvolves sparse retrieval (\u00a7 3.1) and summarization of memories into the next RNN hidden state. Gray arrows\ndepict how attention weights a(t) are evaluated, \ufb01rst by broadcasting and concatenating the current provisional\nhidden state \u02c6h(t) against the set of all memories M and computing raw attention weights with an MLP. The\nsparsi\ufb01er selects and normalizes only the ktop greatest raw attention weights, while the others are zeroed out.\nRed arrows show memories corresponding to non-zero sparsi\ufb01ed attention weights being weighted, summed,\nthen added into \u02c6h(t) to compute the \ufb01nal hidden state h(t).\n3 Sparse Attentive Backtracking\n\nMindful that humans use a very sparse subset of past experiences in credit assignment, and are\ncapable of direct random access to past experiences and their relevance to the present, we present\nhere SAB: the principle of learned, dynamic, sparse access to, and replay of, relevant past states for\ncredit assignment in neural network models, such as RNNs.\nIn the limit of maximum sparsity (no access to the past), SAB degenerates to the use of a regular\nstatic neural network. In the limit of minimum sparsity (full access to the past), SAB degenerates to\nthe use of a full self-attention mechanism. For the purposes of this paper, we explore the gap between\nthese with a speci\ufb01c variety of augmented LSTM models; but SAB does not refer to any particular\narchitecture, and the augmented LSTM described herein is used purely as a vehicle to explore and\nvalidate our hypotheses in \u00a71.\nBroadly, an SAB neural network is required to do two things:\n\nmemories at every timestep. We will call this sparse retrieval.\n\n\u2022 During the forward pass, manage a memory unit and select at most a sparse subset of past\n\u2022 During the backward pass, propagate gradient only to that sparse subset of memory and its\n\nlocal surroundings. We will call this sparse replay.\n\n3.1 Sparse Retrieval of Memories\n\nJust as humans make a selective use of all past memories to inform their decisions in the present, so\nmust an SAB model learn to remember and dynamically select only a few memories that could be\npotentially useful in the present. There are several alternative implementations of this concept. An\nimportant class of them are attention mechanisms, especially self-attention over a model\u2019s own past\nstates. Closely linked to the question of dynamic access to memory is the structure of the memory\nitself; for instance, in the Differentiable Neural Computer (DNC) (Graves et al., 2016), the memory is\na \ufb01xed-size tensor accessed with explicit read and write operations, while in Bahdanau et al. (2014),\nthe memory is implicitly a list of past hidden states that continuously grows.\nFor the purposes of this paper, we choose a simple approach similar to Bahdanau et al. (2014). Many\nother options are possible, and the question of memory representation in humans (faithful to actual\nbrains) and machines (with good computational properties) remains open. Here, to test the principle\nof SAB without having to answer that question, we use an approach already shown to work well\nin machine learning. We augment a unidirectional LSTM with the memory of every katt\u2019th hidden\nstate from the past, with a modi\ufb01ed hard self-attention mechanism limited to selecting at most ktop\nmemories at every timestep. Future work should investigate more realistic mechanisms for storing\nmemories, e.g., based on saliency, novelty, etc. But this simple scheme allows us to test the hypothesis\nthat neural network models can still perform well even when compelled at every timestep to access\n\n4\n\n\ftheir past sparsely. If they cannot, then it would be meaningless to further encumber them with a\nbounded-size memory.\n\nSAB-augmented LSTM We now describe the sparse retrieval mechanism that we have settled on.\nIt determines which memories will be selected on the forward pass of the RNN, and therefore also\nwhich memories will receive gradient on the backward pass during training.\nAt time t, the underlying LSTM receives a vector of hidden states h(t1), a vector of cell states\nc(t1), and an input x(t), and computes new cell states c(t) and a provisional hidden state vector\n\u02c6h(t) that also serves as a provisional output. We next use an attention mechanism that is similar to\nBahdanau et al. (2014), but modi\ufb01ed to produce sparse attention decisions. First, the provisional\nhidden state vector \u02c6h(t) is concatenated to each memory vector m(i) in the memory M. Then, an\nMLP with one hidden layer maps each such concatenated vector to a scalar, non-sparse, raw attention\nweight a(t)\nrepresenting the salience of the memory i at the current time t. The MLP is parametrized\ni\nwith weight matrices W1, W2 and W3.\nWith the raw attention weights, we compute the\nsparsi\ufb01ed attention weights \u02dca(t)\ni by subtracting\nout the (ktop + 1)\u2019th raw weight from all the\nothers, passing the intermediate result through\nReLU, then normalizing to sum to 1. This mech-\nanism is differentiable (see S.3 for details) and\neffectively implements a discrete, hard decision\nto drop all but ktop memories, weigh the selected\nmemories by their prominence over the others,\nas opposed to their raw value. This is different\nfrom typical attention mechanisms that normal-\nize attention weights using a softmax function\n(Bahdanau et al., 2014), whose output is never\nsparse.\nA summary vector s(t) is then computed using a\nsimple sum of the selected memories, weighted\nby their respective sparsi\ufb01ed attention weight.\nGiven that this sum is very sparse, the summary\noperation is very fast. This summary is then\nadded into the provisional hidden state \u02c6h(t) com-\nputed previously to obtain \ufb01nal state h(t).\nLastly, to compute the SAB-augmented LSTM\ncell\u2019s output y(t) at t, we concatenate h(t) and\nsummary vector s(t), then apply an af\ufb01ne output transform parametrized with learned weights\nmatrices V1 and V2 and bias vector b.\nThe forward pass into a hidden state h(t) has two paths contributing to it. One path is the regular\nsequential forward path in an RNN; the other path is through the dynamic but sparse skip connections\nin the attention mechanism that connect the present states to potentially very distant past experiences.\n\nAlgorithm 1 SAB-augmented LSTM\n1: procedure SABCell (h(t1), c(t1), x(t))\nRequire: ktop > 0, katt > 0, ktrunc > 0\nRequire: Memories m(i) 2M\nRequire: Previous hidden state h(t1)\nRequire: Previous cell state c(t1)\nRequire: Input x(t)\n2:\n3:\n4:\n5:\n6:\n7:\n8:\n\n9:\n10:\n11:\n12:\n13:\n\n\u02c6h(t), c(t) LSTMCell(h(t1), c(t1), x(t))\nfor all i 2 1 . . .|M| do\ni W1m(i) + W2\u02c6h(t)\nd(t)\na(t)\ni W3 tanh(d(t)\ni )\na(t)\nktop sorted(a(t))[ktop+1]\n\u02dca(t) ReLU\u21e3a(t)  a(t)\nktop\u2318\ni m(i)Pi\ns(t) Pm(i)2M\nh(t) \u02c6h(t) + s(t)\ny(t) V1h(t) + V2s(t) + b\nif t \u2318 0 (mod katt) then\nM.append(h(t))\nreturn h(t), c(t), y(t)\n\n\u02dca(t)\ni\n\n\u02dca(t)\n\n3.2 Sparse Replay\nHumans are trivially capable of assigning credit or blame to events even a long time after the fact,\nand do not need to replay all events from the present to the credited event sequentially and in reverse\nto do so. But that is effectively what RNNs trained with full BPTT require, and this does not seem\nbiologically plausible when considering events which are far from each other in time. Even less\nplausible is TBPTT because it ignores time dependencies beyond the truncation length ktrunc.\nSAB networks\u2019 twin paths during the forward pass (sequential connection and sparse skip connections)\nallow gradient to \ufb02ow not just from h(t) to h(t1), but also to the at-most ktop memories m(i)\nretrieved by the attention mechanism (and no others.) Learning to deliver gradient directly (and\nsparsely) where it is needed (and nowhere else) (1) avoids competition for the limited information-\ncarrying capacity of the sequential path, (2) is a simple form of credit assignment, (3) and imposes\n\n5\n\n\fFigure 2: This \ufb01gure illustrates the backward pass in SAB for the con\ufb01guration ktop = 3, katt = 2, ktrunc = 2.\nThe gradients are passed to the hidden states selected in the forward pass and a local truncated backprop is\nperformed around those hidden states. Blue arrows show the gradient \ufb02ow in the backward pass. Red crosses\nindicate TBPTT truncation points, where the gradient stops propagating.\n\na trade-off that is absent in previous, dense self-attentive mechanisms: opening a connection to an\ninteresting or useful timestep must be made at the price of excluding others. This competition for a\nlimited budget of ktop connections results in interesting timesteps being given frequent attention and\nstrong gradient \ufb02ow, while uninteresting timesteps are ignored and starve.\n\nMental updates\nIf we not only allow gradient to \ufb02ow directly to a past timestep, but on to a few\nlocal timesteps around it as well, we have mental updates: a type of local credit assignment around\na memory. There are various ways of enabling this. In our SAB-augmented LSTM, we choose to\nperform TBPTT locally before the selected timesteps (ktrunc timesteps before a selected one.)\n\n4 Experimental Setup and Results\nBaselines We compare SAB to two baseline models for all tasks: 1) an LSTM trained both using full\nBPTT and TBPTT with various truncation lengths; 2) an LSTM augmented with full self-attention\ntrained using full BPTT. For the pixel-by-pixel Cifar10 classi\ufb01cation task, we also compare to the\nTransformer architecture (Vaswani et al., 2017).\n\nCopying and adding problems (Q1) The copy and adding problems de\ufb01ned in Hochreiter &\nSchmidhuber (1997) are synthetic tasks speci\ufb01cally designed to evaluate a model\u2019s performance\non long-term dependencies by testing its ability to remember a sub-sequence for a large number of\ntimesteps.\nFor the copy task, the network is given a sequence of T + 20 inputs consisting of: a) 10 (randomly\ngenerated) digits (digits 1 to 8) followed by; b) T blank inputs followed by; c) a special end-of-\nsequence character followed by; d) 10 additional blank inputs. After the end-of-sequence character\nthe network must output a copy of the initial 10 digits. The adding task requires the model to sum\ntwo speci\ufb01c entries in a sequence of T (input) entries. Each example in the task consists of two input\nvectors of length T . The \ufb01rst is a vector of uniformly generated values between 0 and 1. The second\nvector encodes a binary mask which indicates the two entries in the \ufb01rst input to be added (the mask\nvector consists of T  2 zeros and 2 ones). The mask is randomly generated with the constraint that\nmasked-in entries must be from different halves of the \ufb01rst input vector.\nThe hyperparameters for both baselines and SAB are kept the same. All models have 128 hidden units\nand use the Adam Kingma & Ba (2014) optimizer with a learning rate of 1e-3. The \ufb01rst model in\nthe ablation study (dense version of SAB) was more dif\ufb01cult to train, therefore we explored different\nlearning rates ranging from 1e-3 to 1e-5. We report the best performing model.\nThe performance of SAB almost matches the performance of LSTMs augmented with self-attention\ntrained using full BPTT. Note that our copy and adding LSTM baselines are more competitive\ncompared to ones reported in the existing literature (Arjovsky et al., 2016). These \ufb01ndings support\nour hypothesis that at any given time step, only a few past events need to be recalled for the correct\nprediction of output of the current timestep.\n\n6\n\n\fTable 3 reports the cross-entropy (CE) of the model predictions on unseen sequences in the adding\ntask. LSTM with full self-attention trained using BPTT obtains the lowest CE loss, followed by\nLSTM trained using BPTT. LSTM trained with truncated BPTT performs signi\ufb01cantly worse. When\nT = 200, SAB\u2019s performance is comparable to the best baseline models. With longer sequences\n(T = 400), SAB outperforms TBPTT, but is outperformed by pure BPTT. For more details regarding\nthe setup, refer to supplementary material.\n\nCharacter level Penn TreeBank (PTB) (Q1) We follow the setup in Cooijmans et al. (2016) and\nall of our models use 1000 hidden units and a learning rate of 0.002. We used non-overlapping\nsequences of 100 in the batches of 32 as in Cooijmans et al. (2016). All models were trained for up\nto 100 epochs with early stopping based on the validation performance.\nWe evaluate the performance of our model using the bits-per-character (BPC) metric. As shown in\nTable 3, SAB\u2019s performance is signi\ufb01cantly better than TBPTT\u2019s and almost matches BPTT, which is\nroughly what one expects from an approximate-gradient method like SAB.\n\nText8 (Q1) We follow the setup of Mikolov et al. (2012); we use the \ufb01rst 90M characters for training,\nthe next 5M for validation and the \ufb01nal 5M characters for testing. We train on non-overlapping\nsequences of length 180. Due to computational constraints, all baselines use 1000 hidden units. We\ntrained all models using a batch size of 64. We trained SAB for a maximum of 30 epochs.\nDetails about our experimental setup can be found in the supplementary material. Note that we\ndid not carry out any additional hyperparameter search for our model. Table 3 reports the BPC of\nthe model\u2019s predictions on the test sets. SAB outperforms LSTM trained using TBPTT. SAB also\noutperforms LSTM and self-attention trained with TBPTT. For more details, refer to supplementary\nmaterial.\n\nComparison to LSTM + self attention (with truncation) While SAB is trained with TBPTT\n(and the vanilla LSTM+self-attention is not), Here we argue, that training the vanilla LSTM and self\nattention with truncation works less well on a more challenging Text8 language modelling dataset.\n\nPermuted pixel-by-pixel MNIST (Q1) This task\nis a sequential version of the MNIST classi\ufb01cation\ndataset. The task involves predicting the label of the\nimage after being given its pixels as a sequence per-\nmuted in a \ufb01xed, random order. All models use an\nLSTM with 128 hidden units. The prediction is pro-\nduced by passing the \ufb01nal hidden state of the network\ninto a softmax. We used a learning rate of 0.001. We\ntrained our model for about 100 epochs, and did early\nstopping based on the validation set. Our experiment\nsetup can be found in the supplementary material.\nTable 5 shows that SAB performs well compared to\nBPTT.\n\nMethod\nLSTM (full BPTT)\nLSTM (TBPTT, ktrunc=5)\nLSTM (Self Attention\nwith Truncation, ktrunc=10))\nSAB (ktrunc=10, ktop=10, katt=10)\n\nTest BPC\n\n1.42\n1.56\n\n1.48\n1.44\n\nTable 1: Bit-per-character (BPC) Results on the\ntest set for Text8 (lower is better).\n\nCIFAR10 classi\ufb01cation (Q1,Q3) We test our model\u2019s performance on pixel-by-pixel CIFAR10 (no\npermutation). This task involves predicting the label of the image after being given it as a sequence of\npixels. This task is relatively dif\ufb01cult compared to other tasks, as sequences are substantially longer\n(length 1024.) Our method outperforms Transformers and LSTMs trained with BPTT (Table 5).\n\nLearning long-term dependencies (Q1) Table 2 reports both accuracy and cross-entropy (CE)\nof the models\u2019 predictions on unseen sequences for the copy memory task. The best-performing\nbaseline model is the LSTM with full self-attention trained using BPTT, followed by vanilla LSTMs\ntrained using BPTT. Far behind are LSTMs trained using truncated BPTT. Table 2 demonstrates that\nSAB is able to learn the task almost perfectly for all copy lengths T . Further, SAB outperforms all\nLSTM baselines and matches the performance of LSTMs with full self-attention trained using BPTT\non the copy memory task. This becomes particularly noticeable as the sequence length increases.\n\nTransfer learning (Q2) We examine the generalization ability of SAB compared to full BPTT\ntrained LSTM and LSTM with full self-attention. The experiment is set up as follows: For the copy\n\n7\n\n\fktop\n\nktrunc\nfull BPTT\nfull self-attn.\n-\n-\n-\n-\n-\n1\n5\n5\n10\n\n1\n5\n10\n20\n150\n1\n1\n5\n10\n\nM\nT\nS\nL\n\nB\nA\nS\n\nCopying (T=100)\n\nacc.\n99.8\n100.0\n20.6\n31.0\n29.6\n30.5\n-\n57.9\n100.0\n100.0\n100.0\n\nCE10\n0.030\n0.0008\n1.984\n1.737\n1.772\n1.714\n-\n1.041\n0.001\n0.000\n0.000\n\nCE\n0.002\n0.0000\n0.165\n0.145\n0.148\n0.143\n-\n0.087\n0.000\n0.000\n0.001\n\nCopying (T=200)\nacc.\n56.0\n100.0\n\nCE10\n1.07\n0.001\n\nCE\n0.046\n0.000\n\n17.1\n20.2\n35.8\n35.0\n39.9\n\n100.0\n100.0\n\n2.03\n1.98\n1.61\n1.596\n1.516\n\n0.000\n0.000\n\n0.092\n0.090\n0.073\n0.073\n0.069\n\n0.000\n0.000\n\nCopying (T=300)\nacc.\n35.9\n100.0\n14.0\n\nCE10\n0.197\n0.002\n2.077\n\nCE\n0.047\n7.5e-5\n0.065\n\n25.7\n24.4\n43.1\n89.1\n99.9\n\n1.848\n1.857\n0.231\n0.383\n0.007\n\n0.197\n0.058\n0.045\n0.012\n0.001\n\nTable 2: Test accuracy and cross-entropy (CE) loss performance on the copying task with sequence lengths of\nT=100, 200, and 300. Accuracies are given in percent for the last 10 characters. CE10 corresponds to the CE\nloss on the last 10 characters. These results are with mental updates; Compare with Table 4 for without.\n\nktop\n\nAdding\nktrunc\nfull BPTT\nfull self-attn.\n-\n-\n-\n5\n10\n10\n\n20\n50\n100\n5\n5\n10\n\nM\nT\nS\nL\n\nB\nA\nS\n\nT=200\nCE\n4.59e-6\n5.541e-8\n1.1e-3\n3.0e-4\n\n4.26e-5\n\n2.0e-6\n\nT=400\n\nCE\n1.554e-7\n4.972e-7\n\n6.8e-4\n\n2.30e-4\n1.001e-5\n\nLanguage\nktrunc\nfull BPTT\n\nktop\n\nM\nT\nS\nL\n\nB\nA\nS\n\n1\n5\n20\n10\n10\n20\n20\n\n-\n-\n-\n5\n10\n5\n10\n\nkatt\n\n-\n-\n-\n10\n10\n20\n20\n\nPTB Text8\nBPC\nBPC\n1.36\n1.42\n1.47\n1.44\n1.40\n1.42\n1.40\n1.39\n1.37\n\n1.47\n1.45\n1.45\n1.44\n\n1.56\n\nTable 3: Performance on the adding task (left) and language modeling tasks (PTB and Text8; right). The\nadding task performance is evaluated on unseen sequences of the T = 200 and T = 400 (note that all methods\nhave con\ufb01gurations that allow them to perform near optimally.) For T = 400, BPTT slightly outperforms SAB,\nwhich outperforms TBPTT. For the language modeling tasks, the BPC score is evaluated on the test sets of the\ncharacter-level PTB and Text8.\n\ntask of length T = 100, we train SAB, LSTM trained with BPTT, LSTM and full self-attention to\nconvergence. We then take the trained model and evaluate them on the copy task for an array of larger\nT values. The results are shown in Table 6. Although all 3 models have similar performance on\nT = 100, it is clear that performance for all 3 models drops as T grows. However, SAB still manages\nto complete the task at T = 5000, whereas by T = 2000 both vanilla LSTM and LSTM with full\nself-attention do no better than random guessing (1/8 = 12.5%).\n\nImportance of sparisity and mental updates (Q4) We study the necessity of sparsity and mental\nupdates by running an ablation study on the copying problem. The ablation study focuses on two\nvariants. The \ufb01rst model attends to all events in the past while performing a truncated update. This\ncan be seen either as a dense version of SAB or an LSTM with full self-attention trained using TBPTT.\nEmpirically, we \ufb01nd that such models are both more dif\ufb01cult to train and do not reach the same\nperformance as SAB. The second ablation experiment tests the necessity of mental updates, without\nwhich the model would only attend to the past time steps without passing gradients through them to\npreceding time steps. We observe a degradation of model performance when blocking gradients to\npast events. This effect is most evident when attending to only one timestep in the past (ktop = 1).\nWe evaluate SAB on language modeling, with the Penn TreeBank (PTB) (Marcus et al., 1993) and\nText8 Mahoney (2011) datasets. For models trained using truncated BPTT, the performance drops as\nktrunc shrinks. We found that on PTB, SAB with ktrunc = 20, ktop = 10 performs almost as well as full\nBPTT. For the larger Text8 dataset, SAB with ktrunc = 10 and ktop = 5 outperforms LSTM trained\nusing BPTT.\n\n8\n\n\fAblation\nktrunc\nU 1\nM\n5\no\n10\nn\n5\n\nktop\n1\n5\n10\nall\n\nCopying, T=100\n\nacc. CElast 10\n1.252\n49.0\n0.042\n98.3\n99.6\n0.022\n1.529\n40.5\n\nCE\n0.104\n0.0036\n0.0018\n0.127\n\nAdding,\nT=200\nCE\n\n2.171e-6\n\n220 -\n\n210\n220\n\n210\n220\n\n-\n\n-\n\np\ne\nt\ns\ne\nm\nT\n\ni\n\na\n\nb\n\nc\n\n210\n\n-\n0 40 80 120 160 200\n\npast\n\nMacrostate\n\nTable 4: Left: ablation studies on the adding and copying tasks. The limiting cases of dense attention (ktop = all)\nand of no mental updates (MU) were tested. Right: focus of the attention for the T=200 copying task, where\nreproduction of the inital 10 input symbols is required (black corresponds to stronger attention weights). The\nwas generated at different points in training (a-c) within the \ufb01rst epoch. Attention quickly shifts to the relevant\nparts of the sequence (the initial 10 states.)\n\nTransfer Learning Results\n\nImage class.\n\nktrunc\n\nktop\n\nkatt\n\nM full BPTT\nT\nS\nL\n\npMNIST CIFAR10\nacc.\n58.3\n51.3\n\nacc.\n90.3\n\n89.8\n90.9\n94.2\n\n300\n20\n20\n50\n16\n\n-\n20\n20\n50\n16\nTransformer (Vasvani\u201917)\n\n-\n5\n10\n10\n10\n\nB\nA\nS\n\n64.5\n62.2\nTable 5: Test accuracy for the permutated MNIST and\nCIFAR10 classi\ufb01cation tasks.\n\n97.9\n\nCopy len.\n(T)\n\nLSTM LSTM\n+self-a.\n\nSAB\n\n100\n200\n300\n400\n2000\n5000\n\n99%\n34%\n25%\n21%\n12%\n12%\n\n100% 99%\n52% 95%\n28% 83%\n20% 75%\n12% 47%\nOOM 41%\n\nTable 6: Transfer performance (Accuracy\nfor last 10 digits) for models trained on\nT = 100 copy memory task. Compar-\nisons to LSTM and LSTM with full self-\nattention trained with BPTT.\n\nComparison to Transformer (Q3) We test how SAB compares to the Transformer model (Vaswani\net al., 2017), based a self-attention mechanism. On pMNIST, the Transformer model outperforms our\nbest model, as shown in Table 5. On CIFAR10, however, our proposed model performs much better.\n\n5 Conclusions\nBy considering how brains could perform long-term temporal credit assignment, we developed\nan alternative to the traditional method of training recurrent neural networks by unfolding of the\ncomputational graph and BPTT. We explored the hypothesis that a reminding process which uses\nthe current state to evoke a relevant state arbitrarily far back in the past could be used to effectively\nteleport credit backwards in time to the computations performed to obtain the past state. To test this\nidea, we developed a novel temporal architecture and credit assignment mechanism called SAB for\nSparse Attentive Backtracking, which aims to combine the strengths of full backpropagation through\ntime and truncated backpropagation through time. It does so by backpropagating gradients only\nthrough paths for which the current state and a past state are associated. This allows the RNN to learn\nlong-term dependencies, as with full backpropagation through time, while still allowing it to only\nbacktrack for a few steps, as with truncated backpropagation through time, thus making it possible to\nupdate weights as frequently as needed rather than having to wait for the end of very long sequences.\nCognitive processes in reminding serve not only as the inspiration for SAB, but suggest two interesting\ndirections of future research. First, we assumed a simple content-independent rule for selecting hidden\nstates for inclusion in the memory (select at every katt step), whereas humans show a systematic\ndependence on content: salient, extreme, unusual, and unexpected experiences are more likely to be\nstored and subsequently remembered. These landmarks of memory should be useful for connecting\npast to current context, just as an individual learns to map out a city via distinctive geographic\nlandmarks. Second, SAB determines the relevance of past hidden states to the current state through a\ngeneric, \ufb02exible mapping, whereas humans perform similarity-based retrieval. We conjecture that\na version of SAB with a strong inductive bias in the mechanism to select past states may further\nimprove its performance.\n\n9\n\n\fAcknowledgements\nThe authors would like to thank Hugo Larochelle, Walter Senn, Alex Lamb, Remi Le Priol, Matthieu\nCourbariaux, Gaetan Marceau Caron, Sandeep Subramanian for the useful discussions, as well as\nNSERC, CIFAR, Google, Samsung, SNSF, Nuance, IBM, Canada Research Chairs, National Science\nFoundation awards EHR-1631428 and SES-1461535 for funding. We would also like to thank\nCompute Canada and NVIDIA for computing resources. The authors would also like to thank Alex\nLamb for code review. The authors would also like to express debt of gratitude towards those who\ncontributed to Theano over the years (now that it is being sunset), for making it such a great tool.\n\nReferences\nAmbrose, R Ellen, Pfeiffer, Brad E, and Foster, David J. Reverse replay of hippocampal place cells is\n\nuniquely modulated by changing reward. Neuron, 91(5):1124 \u2013 1136, 2016.\n\nArjovsky, Martin, Shah, Amar, and Bengio, Yoshua. Unitary evolution recurrent neural networks. In\n\nInternational Conference on Machine Learning, pp. 1120\u20131128, 2016.\n\nBahdanau, Dzmitry, Cho, Kyunghyun, and Bengio, Yoshua. Neural machine translation by jointly\n\nlearning to align and translate. arXiv preprint arXiv:1409.0473, 2014.\n\nBenjamin, Aaron S and Ross, Brian H. The causes and consequences of reminding. In Benjamin,\nA. S. (ed.), Successful remembering and successful forgetting: A Festschrift in honor of Robert A.\nBjork. Psychology Press, 2010.\n\nBerntsen, Dorthe, Staugaard, S\u00f8ren Risl\u00f8v, and S\u00f8rensen, Louise Maria Torp. Why am i remembering\nthis now? predicting the occurrence of involuntary (spontaneous) episodic memories. Journal of\nExperimental Psychology: General, 142(2):426, 2013.\n\nChan, William, Jaitly, Navdeep, Le, Quoc, and Vinyals, Oriol. Listen, attend and spell: A neural\nnetwork for large vocabulary conversational speech recognition. In Acoustics, Speech and Signal\nProcessing (ICASSP), 2016 IEEE International Conference on, pp. 4960\u20134964. IEEE, 2016.\n\nChung, Junyoung, Ahn, Sungjin, and Bengio, Yoshua. Hierarchical multiscale recurrent neural\n\nnetworks. arXiv preprint arXiv:1609.01704, 2016.\n\nCiaramelli, Elisa, Grady, Cheryl L, and Moscovitch, Morris. Top-down and bottom-up attention to\nmemory: A hypothesis on the role of the posterior parietal cortex in memory retrieval. Neuropsy-\nchologia, 46(7):1828\u20131851, 2008.\n\nCooijmans, Tim, Ballas, Nicolas, Laurent, C\u00e9sar, G\u00fcl\u00e7ehre, \u00c7a\u02d8glar, and Courville, Aaron. Recurrent\n\nbatch normalization. arXiv preprint arXiv:1603.09025, 2016.\n\nDavidson, Thomas J, Kloosterman, Fabian, and Wilson, Matthew A. Hippocampal replay of extended\n\nexperience. Neuron, 63(4):497\u2013507, 2009.\n\nEl Hihi, Salah and Bengio, Yoshua. Hierarchical recurrent neural networks for long-term dependen-\n\ncies. In Advances in neural information processing systems, pp. 493\u2013499, 1996.\n\nForbus, Kenneth D, Gentner, Dedre, and Law, Keith. Mac/fac: A model of similarity-based retrieval.\n\nCognitive Science, 19:141\u2013205, 1995.\n\nFoster, David J and Wilson, Matthew A. Reverse replay of behavioural sequences in hippocampal\n\nplace cells during the awake state. Nature, 440(7084):680\u2013683, 2006.\n\nGraves, Alex, Wayne, Greg, Reynolds, Malcolm, Harley, Tim, Danihelka, Ivo, Grabska-Barwi\u00b4nska,\nAgnieszka, Colmenarejo, Sergio G\u00f3mez, Grefenstette, Edward, Ramalho, Tiago, Agapiou, John,\net al. Hybrid computing using a neural network with dynamic external memory. Nature, 538\n(7626):471, 2016.\n\nGulcehre, Caglar, Chandar, Sarath, and Bengio, Yoshua. Memory augmented neural networks with\n\nwormhole connections. arXiv preprint arXiv:1701.08718, 2017.\n\n10\n\n\fGupta, Anoopum S, van der Meer, Matthijs AA, Touretzky, David S, and Redish, A David. Hip-\n\npocampal replay is not a simple function of experience. Neuron, 65(5):695\u2013705, 2010.\n\nHe, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, and Sun, Jian. Deep residual learning for image\nrecognition. In Proceedings of the IEEE conference on computer vision and pattern recognition,\npp. 770\u2013778, 2016.\n\nHochreiter, Sepp and Schmidhuber, J\u00fcrgen. Long short-term memory. Neural computation, 9(8):\n\n1735\u20131780, 1997.\n\nHuang, Gao, Liu, Zhuang, Weinberger, Kilian Q, and van der Maaten, Laurens. Densely connected\n\nconvolutional networks. arXiv preprint arXiv:1608.06993, 2016.\n\nK\u00e1d\u00e1r, Akos, C\u00f4t\u00e9, Marc-Alexandre, Chrupa\u0142a, Grzegorz, and Alishahi, Afra. Revisiting the hierar-\n\nchical multiscale lstm. arXiv preprint arXiv:1807.03595, 2018.\n\nKe, Nan Rosemary, Zolna, Konrad, Sordoni, Alessandro, Lin, Zhouhan, Trischler, Adam, Bengio,\nYoshua, Pineau, Joelle, Charlin, Laurent, and Pal, Chris. Focused hierarchical rnns for conditional\nsequence processing. arXiv preprint arXiv:1806.04342, 2018.\n\nKingma, Diederik and Ba, Jimmy. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\nKoutnik, Jan, Greff, Klaus, Gomez, Faustino, and Schmidhuber, Juergen. A clockwork rnn. arXiv\n\npreprint arXiv:1402.3511, 2014.\n\nLee, Dong-Hyun, Zhang, Saizheng, Biard, Antoine, and Bengio, Yoshua. Target propagation. CoRR,\n\nabs/1412.7525, 2014.\n\nLu, Jiasen, Xiong, Caiming, Parikh, Devi, and Socher, Richard. Knowing when to look: Adaptive\nattention via a visual sentinel for image captioning. In Proceedings of the IEEE Conference on\nComputer Vision and Pattern Recognition (CVPR), volume 6, 2017.\n\nLuong, Minh-Thang, Pham, Hieu, and Manning, Christopher D. Effective approaches to attention-\n\nbased neural machine translation. arXiv preprint arXiv:1508.04025, 2015.\n\nMahoney, Matt. Large text compression benchmark. URL: http://www. mattmahoney. net/text/text.\n\nhtml, 2011.\n\nMarcus, Mitchell P, Marcinkiewicz, Mary Ann, and Santorini, Beatrice. Building a large annotated\n\ncorpus of english: The penn treebank. Computational linguistics, 19(2):313\u2013330, 1993.\n\nMiao, Yajie, Gowayyed, Mohammad, and Metze, Florian. Eesen: End-to-end speech recogni-\ntion using deep rnn models and wfst-based decoding. In Automatic Speech Recognition and\nUnderstanding (ASRU), 2015 IEEE Workshop on, pp. 167\u2013174. IEEE, 2015.\n\nMikolov, Tom\u00e1\u0161, Sutskever, Ilya, Deoras, Anoop, Le, Hai-Son, Kombrink, Stefan, and Cer-\nnocky, Jan. Subword language modeling with neural networks. preprint (http://www. \ufb01t. vutbr.\ncz/imikolov/rnnlm/char. pdf), 8, 2012.\n\nMozer, Michael C, Kazakov, Denis, and Lindsey, Robert V. Discrete event, continuous time rnns.\n\narXiv preprint arXiv:1710.04110, 2017.\n\nNovick, Laura R. Analogical transfer, problem similarity, and expertise. Journal of Experimental\n\nPsychology: Learning, Memory, and Cognition, 14(3):510, 1988.\n\nOllivier, Yann, Tallec, Corentin, and Charpiat, Guillaume. Training recurrent networks online without\n\nbacktracking. arXiv preprint arXiv:1507.07680, 2015.\n\nRead, Stephen J and Ian, L. Expectation failures in reminding and explanation. Journal of Experi-\n\nmental Social Psychology, 27:1\u201325, 1991.\n\nScellier, Benjamin and Bengio, Yoshua. Towards a biologically plausible backprop. CoRR,\n\nabs/1602.05179, 2016.\n\n11\n\n\fVaswani, Ashish, Shazeer, Noam, Parmar, Niki, Uszkoreit, Jakob, Jones, Llion, Gomez, Aidan N,\nKaiser, \u0141ukasz, and Polosukhin, Illia. Attention is all you need. In Advances in Neural Information\nProcessing Systems, pp. 6000\u20136010, 2017.\n\nVinyals, Oriol, Toshev, Alexander, Bengio, Samy, and Erhan, Dumitru. Show and tell: A neural\nimage caption generator. In Proceedings of the IEEE conference on computer vision and pattern\nrecognition, pp. 3156\u20133164, 2015.\n\nWharton, Charles M, Holyoak, Keith J, and Lange, Trent E. Remote analogical reminding. Memory\n\n& Cognition, 24:629\u2013643, 1996.\n\nWhittington, James CR and Bogacz, Rafal. An approximation of the error backpropagation algorithm\nin a predictive coding network with local hebbian synaptic plasticity. Neural computation, 29(5):\n1229\u20131262, 2017.\n\nWilliams, Ronald J and Peng, Jing. An ef\ufb01cient gradient-based algorithm for on-line training of\n\nrecurrent network trajectories. Neural computation, 2(4):490\u2013501, 1990.\n\n12\n\n\f", "award": [], "sourceid": 3784, "authors": [{"given_name": "Nan Rosemary", "family_name": "Ke", "institution": "MILA, University of Montreal"}, {"given_name": "Anirudh Goyal", "family_name": "ALIAS PARTH GOYAL", "institution": "Universit\u00e9 de Montr\u00e9al"}, {"given_name": "Olexa", "family_name": "Bilaniuk", "institution": "University of Montreal"}, {"given_name": "Jonathan", "family_name": "Binas", "institution": "MILA, Montreal"}, {"given_name": "Michael", "family_name": "Mozer", "institution": "Google Brain / U. Colorado"}, {"given_name": "Chris", "family_name": "Pal", "institution": "MILA, Polytechnique Montr\u00e9al, Element AI"}, {"given_name": "Yoshua", "family_name": "Bengio", "institution": "U. Montreal"}]}