{"title": "Reversible Recurrent Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 9029, "page_last": 9040, "abstract": "Recurrent neural networks (RNNs) provide state-of-the-art performance in processing sequential data but are memory intensive to train, limiting the flexibility of RNN models which can be trained. Reversible RNNs---RNNs for which the hidden-to-hidden transition can be reversed---offer a path to reduce the memory requirements of training, as hidden states need not be stored and instead can be recomputed during backpropagation. We first show that perfectly reversible RNNs, which require no storage of the hidden activations, are fundamentally limited because they cannot forget information from their hidden state. We then provide a scheme for storing a small number of bits in order to allow perfect reversal with forgetting. Our method achieves comparable performance to traditional models while reducing the activation memory cost by a factor of 10--15. We extend our technique to attention-based sequence-to-sequence models, where it maintains performance while reducing activation memory cost by a factor of 5--10 in the encoder, and a factor of 10--15 in the decoder.", "full_text": "Reversible Recurrent Neural Networks\n\nMatthew MacKay, Paul Vicol, Jimmy Ba, Roger Grosse\n\n{mmackay, pvicol, jba, rgrosse}@cs.toronto.edu\n\nUniversity of Toronto\n\nVector Institute\n\nAbstract\n\nRecurrent neural networks (RNNs) provide state-of-the-art performance in pro-\ncessing sequential data but are memory intensive to train, limiting the \ufb02exibility\nof RNN models which can be trained. Reversible RNNs\u2014RNNs for which the\nhidden-to-hidden transition can be reversed\u2014offer a path to reduce the memory\nrequirements of training, as hidden states need not be stored and instead can be\nrecomputed during backpropagation. We \ufb01rst show that perfectly reversible RNNs,\nwhich require no storage of the hidden activations, are fundamentally limited be-\ncause they cannot forget information from their hidden state. We then provide a\nscheme for storing a small number of bits in order to allow perfect reversal with\nforgetting. Our method achieves comparable performance to traditional models\nwhile reducing the activation memory cost by a factor of 10\u201315. We extend our\ntechnique to attention-based sequence-to-sequence models, where it maintains\nperformance while reducing activation memory cost by a factor of 5\u201310 in the\nencoder, and a factor of 10\u201315 in the decoder.\n\nIntroduction\n\n1\nRecurrent neural networks (RNNs) have attained state-of-the-art performance on a variety of tasks,\nincluding speech recognition [Graves et al., 2013], language modeling [Melis et al., 2017, Merity\net al., 2017], and machine translation [Bahdanau et al., 2014, Wu et al., 2016]. However, RNNs are\nmemory intensive to train. The standard training algorithm is truncated backpropagation through time\n(TBPTT) [Werbos, 1990, Rumelhart et al., 1986]. In this algorithm, the input sequence is divided into\nsubsequences of smaller length, say T . Each of them is processed and the gradient is backpropagated.\nIf H is the size of our model\u2019s hidden state, the memory required for TBPTT is O(T H).\nDecreasing the memory requirements of the TBPTT algorithm would allow us to increase the length\nT of our truncated sequences, capturing dependencies over longer time scales. Alternatively, we could\nincrease the size H of our hidden state or use deeper input-to-hidden, hidden-to-hidden, or hidden-to-\noutput transitions, granting our model greater expressivity. Increasing the depth of these transitions\nhas been shown to increase performance in polyphonic music prediction, language modeling, and\nneural machine translation (NMT) [Pascanu et al., 2013, Barone et al., 2017, Zilly et al., 2016].\nReversible recurrent network architectures present an enticing way to reduce the memory requirements\nof TBPTT. Reversible architectures enable the reconstruction of the hidden state at the current timestep\ngiven the next hidden state and the current input, which would enable us to perform TBPTT without\nstoring the hidden states at each timestep. In exchange, we pay an increased computational cost to\nreconstruct the hidden states during backpropagation.\nWe \ufb01rst present reversible analogues of the widely used Gated Recurrent Unit (GRU) [Cho et al.,\n2014] and Long Short-Term Memory (LSTM) [Hochreiter and Schmidhuber, 1997] architectures.\nWe then show that any perfectly reversible RNN requiring no storage of hidden activations will fail\non a simple one-step prediction task. This task is trivial to solve even for vanilla RNNs, but perfectly\nreversible models fail since they need to memorize the input sequence in order to solve the task.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fIn light of this \ufb01nding, we extend the memory-ef\ufb01cient reversal method of Maclaurin et al. [2015],\nstoring a handful of bits per unit in order to allow perfect reversal for architectures which forget\ninformation.\nWe evaluate the performance of these models on language modeling and neural machine translation\nbenchmarks. Depending on the task, dataset, and chosen architecture, reversible models (without\nattention) achieve 10\u201315-fold memory savings over traditional models. Reversible models achieve\napproximately equivalent performance to traditional LSTM and GRU models on word-level language\nmodeling on the Penn TreeBank dataset [Marcus et al., 1993] and lag 2\u20135 perplexity points behind\ntraditional models on the WikiText-2 dataset [Merity et al., 2016].\nAchieving comparable memory savings with attention-based recurrent sequence-to-sequence models\nis dif\ufb01cult, since the encoder hidden states must be kept simultaneously in memory in order to\nperform attention. We address this challenge by performing attention over a small subset of the\nhidden state, concatenated with the word embedding. With this technique, our reversible models\nsucceed on neural machine translation tasks, outperforming baseline GRU and LSTM models on\nthe Multi30K dataset [Elliott et al., 2016] and achieving competitive performance on the IWSLT\n2016 [Cettolo et al., 2016] benchmark. Applying our technique reduces memory cost by a factor of\n10\u201315 in the decoder, and a factor of 5\u201310 in the encoder.1\n2 Background\nWe begin by describing techniques to construct reversible neural network architectures, which we\nthen adapt to RNNs. Reversible networks were \ufb01rst motivated by the need for \ufb02exible probability\ndistributions with tractable likelihoods [Papamakarios et al., 2017, Dinh et al., 2016, Kingma et al.,\n2016]. Each of these architectures de\ufb01nes a mapping between probability distributions, one of which\nhas a simple, known density. Because this mapping is reversible with an easily computable Jacobian\ndeterminant, maximum likelihood training is ef\ufb01cient.\nA recent paper, closely related to our work, showed that reversible network architectures can be\nadapted to image classi\ufb01cation tasks [Gomez et al., 2017]. Their architecture, called the Reversible\nResidual Network or RevNet, is composed of a series of reversible blocks. Each block takes an input\nx and produces an output y of the same dimensionality. The input x is separated into two groups:\nx = [x1; x2], and outputs are produced according to the following coupling rule:\n\n(1)\nwhere F and G are residual functions analogous to those in standard residual networks [He et al.,\n2016]. The output y is formed by concatenating y1 and y2, y = [y1; y2]. Each layer\u2019s activations can\nbe reconstructed from the next layer\u2019s activations as follows:\n\ny1 = x1 + F (x2)\n\ny2 = x2 + G(y1)\n\nx2 = y2 \u2212 G(y1)\n\nx1 = y1 \u2212 F (x2)\n\n(2)\nBecause of this property, activations from the forward pass need not be stored for use in the back-\nwards pass. Instead, starting from the last layer, activations of previous layers are reconstructed\nduring backpropagation2. Because reversible backprop requires an additional computation of the\nresidual functions to reconstruct activations, it requires 33% more arithmetic operations than ordinary\nbackprop and is about 50% more expensive in practice. Full details of how to ef\ufb01ciently combine\nreversibility with backpropagation may be found in Gomez et al. [2017].\n3 Reversible Recurrent Architectures\nThe techniques used to construct RevNets can be combined with traditional RNN models to produce\nreversible RNNs. In this section, we propose reversible analogues of the GRU and the LSTM.\n3.1 Reversible GRU\nWe start by recalling the GRU equations used to compute the next hidden state h(t+1) given the\ncurrent hidden state h(t) and the current input x(t) (omitting biases):\n\n[z(t); r(t)] = \u03c3(W [x(t); h(t\u22121)])\n\ng(t) = tanh(U [x(t); r(t) (cid:12) h(t\u22121)])\n\nh(t) = z(t)(cid:12)h(t\u22121) + (1 \u2212 z(t)) (cid:12) g(t)\n\n(3)\n\n1Code will be made available at https://github.com/matthewjmackay/reversible-rnn\n2The activations prior to a pooling step must still be saved, since this involves projection to a lower\n\ndimensional space, and hence loss of information.\n\n2\n\n\fHere, (cid:12) denotes elementwise multiplication. To make this update reversible, we separate the hidden\nstate h into two groups, h = [h1; h2]. These groups are updated using the following rules:\n\n1 ] = \u03c3(W1[x(t); h(t\u22121)\n\n[z(t)\n1 ; r(t)\ng(t)\n1 = tanh(U1[x(t); r(t)\nh(t)\n1 = z(t)\n\n])\n1 (cid:12) h(t\u22121)\n+ (1 \u2212 z(t)\n\n2\n\n2\n\n])\n\n(4)\n\n[z(t)\n2 ; r(t)\ng(t)\n2 = tanh(U2[x(t); r(t)\nh(t)\n2 = z(t)\n\n2 ] = \u03c3(W2[x(t); h(t)\n1 ])\n2 (cid:12) h(t)\n1 ])\n+ (1 \u2212 z(t)\n2 ) (cid:12) g(t)\n\n2\n\n(5)\n\n1\n\n1\n\n1 ) (cid:12) g(t)\n\n1 (cid:12) h(t\u22121)\n1 and not h(t\u22121)\n\n2 (cid:12) h(t\u22121)\nNote that h(t)\n2 . We term this model the Reversible\n(cid:54)= 0 for i = 1, 2 as it is the output of a sigmoid,\nGated Recurrent Unit, or RevGRU. Note that z(t)\nwhich maps to the open interval (0, 1). This means the RevGRU updates are reversible in exact\narithmetic: given h(t) = [h(t)\n2 by redoing\npart of our forwards computation. Then we can \ufb01nd h(t\u22121)\n\nis used to compute the update for h(t)\n\n1 and x(t) to \ufb01nd z(t)\n\n2 ], we can use h(t)\n\n2 , and g(t)\n\n2 , r(t)\n\n1 ; h(t)\n\n1\n\n2\n\ni\n\n2\n\nh(t\u22121)\n\n2\n\n= [h(t)\n\n2 \u2212 (1 \u2212 z(t)\n\n2 ) (cid:12) g(t)\n\nusing:\n2 ] (cid:12) 1/z(t)\n\n2\n\n(6)\n\nis reconstructed similarly. We address numerical issues which arise in practice in Section 3.3.\n\nh(t\u22121)\n1\n3.2 Reversible LSTM\nWe next construct a reversible LSTM. The LSTM separates the hidden state into an output state h\nand a cell state c. The update equations are:\n\n[f (t), i(t), o(t)] = \u03c3(W [x(t); h(t\u22121)])\nc(t) = f (t) (cid:12) c(t\u22121) + i(t) (cid:12) g(t)\n\n(8)\n(10)\nWe cannot straightforwardly apply our reversible techniques, as the update for h(t) is not a non-zero\nlinear transformation of h(t\u22121). Despite this, reversibility can be achieved using the equations:\n\ng(t) = tanh(U [x(t); h(t\u22121)])\n\nh(t) = o(t) (cid:12) tanh(c(t))\n\n(7)\n(9)\n\n[f (t)\n\n1 , o(t)\n1 , i(t)\n1 , p(t)\n1 = f (t)\nc(t)\n\n1 ] = \u03c3(W1[x(t); h(t\u22121)\n1 (cid:12) c(t\u22121)\n1 (cid:12) g(t)\n\n])\n1 (cid:12) tanh(c(t)\nWe calculate the updates for c2, h2 in an identical fashion to the above equations, using c(t)\nWe call this model the Reversible LSTM, or RevLSTM.\n\n1 = tanh(U1[x(t); h(t\u22121)\ng(t)\n1 (cid:12) h(t\u22121)\nh(t)\n1 = p(t)\n\n+ o(t)\n\n+ i(t)\n\n(13)\n\n(11)\n\n])\n\n2\n\n1\n\n1\n\n2\n\n1\n\n1 ) (14)\n1 and h(t)\n1 .\n\n(12)\n\n3.3 Reversibility in Finite Precision Arithmetic\nWe have de\ufb01ned RNNs which are reversible in exact arithmetic. In practice, the hidden states cannot\nbe perfectly reconstructed due to \ufb01nite numerical precision. Consider the RevGRU equations 4 and\n5. If the hidden state h is stored in \ufb01xed point, multiplication of h by z (whose entries are less than\n1) destroys information, preventing perfect reconstruction. Multiplying a hidden unit by 1/2, for\nexample, corresponds to discarding its least-signi\ufb01cant bit, whose value cannot be recovered in the\nreverse computation. These errors from information loss accumulate exponentially over timesteps,\ncausing the initial hidden state obtained by reversal to be far from the true initial state. The same\nissue also affects the reconstruction of the RevLSTM hidden states. Hence, we \ufb01nd that forgetting is\nthe main roadblock to constructing perfectly reversible recurrent architectures.\nThere are two possible avenues to address this limitation. The \ufb01rst is to remove the forgetting step.\nFor the RevGRU, this means we compute z(t)\ni = h(t\u22121)\nh(t)\n\n, r(t)\n, and g(t)\n+ (1 \u2212 z(t)\n\nas before, and update h(t)\n\ni ) (cid:12) g(t)\n\ni using:\n\n(15)\n\ni\n\ni\n\ni\n\ni\n\ni\n\nWe term this model the No-Forgetting RevGRU or NF-RevGRU. Because the updates of the NF-\nRevGRU do not discard information, we need only store one hidden state in memory at a given time\nduring training. Similar steps can be taken to de\ufb01ne a NF-RevLSTM.\nThe second avenue is to accept some memory usage and store the information forgotten from the\nhidden state in the forward pass. We can then achieve perfect reconstruction by restoring this\ninformation to our hidden state in the reverse computation. We discuss how to do so ef\ufb01ciently in\nSection 5.\n\n3\n\n\fImpossibility of No Forgetting\n\nFigure 1: Unrolling the reverse computation of an exactly reversible model on the repeat task yields a sequence-\nto-sequence computation. Left: The repeat task itself, where the model repeats each input token. Right:\nUnrolling the reversal. The model effectively uses the \ufb01nal hidden state to reconstruct all input tokens, implying\nthat the entire input sequence must be stored in the \ufb01nal hidden state.\n4\nWe have shown reversible RNNs in \ufb01nite precision can be constructed by ensuring that no information\nis discarded. We were unable to \ufb01nd such an architecture that achieved acceptable performance on\ntasks such as language modeling3. This is consistent with prior work which found forgetting to be\ncrucial to LSTM performance [Gers et al., 1999, Greff et al., 2017]. In this section, we argue that\nthis results from a fundamental limitation of no-forgetting reversible models: if none of the hidden\nstate can be forgotten, then the hidden state at any given timestep must contain enough information\nto reconstruct all previous hidden states. Thus, any information stored in the hidden state at one\ntimestep must remain present at all future timesteps to ensure exact reconstruction, overwhelming the\nstorage capacity of the model.\nWe make this intuition concrete by considering an elementary sequence learning task, the repeat task.\nIn this task, an RNN is given a sequence of discrete tokens and must simply repeat each token at the\nsubsequent timestep. This task is trivially solvable by ordinary RNN models with only a handful of\nhidden units, since it doesn\u2019t require modeling long-distance dependencies. But consider how an\nexactly reversible model would perform the repeat task. Unrolling the reverse computation, as shown\nin Figure 1, reveals a sequence-to-sequence computation in which the encoder and decoder weights\nare tied. The encoder takes in the tokens and produces a \ufb01nal hidden state. The decoder uses this\n\ufb01nal hidden state to produce the input sequence in reverse sequential order.\nNotice the relationship to another sequence learning task, the memorization task, used as part of a\ncurriculum learning strategy by Zaremba and Sutskever [2014]. After an RNN observes an entire\nsequence of input tokens, it is required to output the input sequence in reverse order. As shown in\nFigure 1, the memorization task for an ordinary RNN reduces to the repeat task for an NF-RevRNN.\nHence, if the memorization task requires a hidden representation size that grows with the sequence\nlength, then so does the repeat task for NF-RevRNNs.\nWe con\ufb01rmed experimentally that NF-RevGRU and NF-RevLSM networks with limited capacity\nwere unable to solve the repeat task4. Interestingly, the NF-RevGRU was able to memorize input\nsequences using considerably fewer hidden units than the ordinary GRU or LSTM, suggesting it may\nbe a useful architecture for tasks requiring memorization. Consistent with the results on the repeat\ntask, the NF-RevGRU and NF-RevLSTM were unable to match the performance of even vanilla\nRNNs on word-level language modeling on the Penn TreeBank dataset [Marcus et al., 1993].\n5 Reversibility with Forgetting\nThe impossibility of zero forgetting leads us to explore the second possibility to achieve reversibility:\nstoring information lost from the hidden state during the forward computation, then restoring it in the\nreverse computation. Initially, we investigated discrete forgetting, in which only an integral number\nof bits are allowed to be forgotten. This leads to a simple implementation: if n bits are forgotten in\nthe forwards pass, we can store these n bits on a stack, to be popped off and restored to the hidden\nstate during reconstruction. However, restricting our model to forget only an integral number of\nbits led to a substantial drop in performance compared to baseline models5. For the remainder of\n\n3We discuss our failed attempts in Appendix A.\n4We include full results and details in Appendix B. The argument presented applies to idealized RNNs able\nto implement any hidden-to-hidden transition and whose hidden units can store 32 bits each. We chose to use\nthe LSTM and the NF-RevGRU as approximations to these idealized models since they performed best at their\nrespective tasks.\n\n5Algorithmic details and experimental results for discrete forgetting are given in Appendix D.\n\n4\n\nh(0)Ah(3)h(1)ABh(2)BCCh(3)Ch(0)h(2)BBh(1)AAC\fAlgorithm 1 Exactly reversible multiplication (Maclaurin et al. [2015])\n1: Input: Buffer integer B, hidden state h = 2\u2212RH h\u2217, forget value z = 2\u2212RZ z\u2217 with 0 < z\u2217 < 2RZ\n2: B \u2190 B \u00d7 2RZ {make room for new information on buffer}\n3: B \u2190 B + (h\u2217 mod 2RZ ) {store lost information in buffer}\n4: h\u2217 \u2190 h\u2217 \u00f7 2RZ {divide by denominator of z}\n5: h\u2217 \u2190 h\u2217 \u00d7 z\u2217 {multiply by numerator of z}\n6: h\u2217 \u2190 h\u2217 + (B mod z\u2217) {add information to hidden state}\n7: B \u2190 B \u00f7 z\u2217 {shorten information buffer}\n8: return updated buffer B, updated value h = 2\u2212RH h\u2217\n\nthis paper, we turn to fractional forgetting, in which a fractional number of bits are allowed to be\nforgotten.\nTo allow forgetting of a fractional number of bits, we use a technique introduced by Maclaurin\net al. [2015] to store lost information. To avoid cumbersome notation, we do away with super-\nand subscripts and consider a single hidden unit h and its forget value z. We represent h and z as\n\ufb01xed-point numbers (integers with an implied radix point). For clarity, we write h = 2\u2212RH h\u2217 and\nz = 2\u2212RZ z\u2217. Hence, h\u2217 is the number stored on the computer and multiplication by 2\u2212RH supplies\nthe implied radix point. In general, RH and RZ are distinct. Our goal is to multiply h by z, storing\nas few bits as necessary to make this operation reversible.\nThe full process of reversible multiplication is shown in detail in Algorithm 1. The algorithm\nmaintains an integer information buffer which stores h\u2217 mod 2RZ at each timestep, so integer\ndivision of h\u2217 by 2RZ is reversible. However, this requires enlarging the buffer by RZ bits at each\ntimestep. Maclaurin et al. [2015] reduced this storage requirement by shifting information from the\nbuffer back onto the hidden state. Reversibility is preserved if the shifted information is small enough\nso that it does not affect the reverse operation (integer division of h\u2217 by z\u2217). We include a full review\nof the algorithm of Maclaurin et al. [2015] in Appendix C.1.\nHowever, this trick introduces a new complication not discussed by Maclaurin et al. [2015]: the\ninformation shifted from the buffer could introduce signi\ufb01cant noise into the hidden state. Shifting\ninformation requires adding a positive value less than z\u2217 to h\u2217. Because z\u2217 \u2208 (0, 2RZ ) (z is the output\nof a sigmoid function and z = 2\u2212RZ z\u2217), h = 2\u2212RH h\u2217 may be altered by as much (2RZ \u2212 1)/2RH .\nIf RZ \u2265 RH, this can alter the hidden state h by 1 or more6. This is substantial, as in practice we\nobserve |h| \u2264 16. Indeed, we observed severe performance drops for RH and RZ close to equal.\nThe solution is to limit the amount of information moved from the buffer to the hidden state by setting\nRZ smaller than RH. We found RH = 23 and RZ = 10 to work well. The amount of noise added\nonto the hidden state is bounded by 2RZ\u2212RH , so with these values, the hidden state is altered by at\nmost 2\u221213. While the precision of our forgetting value z is limited to 10 bits, previous work has\nfound that neural networks can be trained with precision as low as 10\u201315 bits and reach the same\nperformance as high precision networks [Gupta et al., 2015, Courbariaux et al., 2014]. We \ufb01nd our\nsituation to be similar.\nMemory Savings To analyze the savings that are theoretically possible using the procedure above,\nconsider an idealized memory buffer which maintains dynamically resizing storage integers Bi\nh for\neach hidden unit h in groups i = 1, 2 of the RevGRU model. Using the above procedure, at each\ntimestep the number of bits stored in each Bi\n\nh grows by:\n\n(cid:0)2RZ /z\u2217\n\n(cid:1) = log2 (1/zi,h)\n\nRZ \u2212 log2(z\u2217\n\n(16)\nIf the entries of zi,h are not close to zero, this compares favorably with the na\u00efve cost of 32 bits\nper timestep. The total storage cost of TBPTT for a RevGRU model with hidden state size H on a\nsequence of length T will be 7:\n\ni,h) = log2\n\ni,h\n\n(cid:34) T(cid:88)\n\nH(cid:88)\n\n\u2212\n\n(cid:35)\n\nlog2(z(t)\n\n1,h) + log2(z(t)\n2,h)\n\n(17)\n\nThus, in the idealized case, the number of bits stored equals the number of bits forgotten.\n\nt=T\n\nh=1\n\n6We illustrate this phenomenon with a concrete example in Appendix C.2.\n7For the RevLSTM, we would sum over p(t)\n\nand f (t)\n\nterms.\n\ni\n\ni\n\n5\n\n\fFigure 2: Attention mechanism for NMT. The word embeddings, encoder hidden states, and decoder hidden\nstates are color-coded orange, blue, and green, respectively; the striped regions of the encoder hidden states\nrepresent the slices that are stored in memory for attention. The \ufb01nal vectors used to compute the context vector\nare concatenations of the word embeddings and encoder hidden state slices.\n\n5.1 GPU Considerations\nFor our method to be used as part of a practical training procedure, we must run it on a parallel\narchitecture such as a GPU. This introduces additional considerations which require modi\ufb01cations to\nAlgorithm 1: (1) we implement it with ordinary \ufb01nite-bit integers, hence dealing with over\ufb02ow, and\n(2) for GPU ef\ufb01ciency, we ensure uniform memory access patterns across all hidden units.\nOver\ufb02ow. Consider the storage required for a single hidden unit. Algorithm 1 assumes unboundedly\nlarge integers, and hence would need to be implemented using dynamically resizing integer types, as\nwas done by Maclaurin et al. [2015]. But such data structures would require non-uniform memory\naccess patterns, limiting their ef\ufb01ciency on GPU architectures. Therefore, we modify the algorithm\nto use ordinary \ufb01nite integers. In particular, instead of a single integer, the buffer is represented\nwith a sequence of 64-bit integers (B0, . . . , BD). Whenever the last integer in our buffer is about to\nover\ufb02ow upon multiplication by 2RZ , as required by step 1 of Algorithm 1, we append a new integer\nBD+1 to the sequence. Over\ufb02ow will occur if BD > 264\u2212RZ .\nAfter appending a new integer BD+1, we apply Algorithm 1 unmodi\ufb01ed, using BD+1 in place of B.\nIt is possible that up to RZ \u2212 1 bits of BD will not be used, incurring an additional penalty on storage\ncost. We experimented with several ways of alleviating this penalty but found that none improved\nsigni\ufb01cantly over the storage cost of the initial method.\nVectorization. Vectorization imposes an additional penalty on storage. For ef\ufb01cient computation,\nwe cannot maintain different size lists as buffers for each hidden unit in a minibatch. Rather, we must\nstore the buffer as a three-dimensional tensor, with dimensions corresponding to the minibatch size,\nthe hidden state size, and the length of the buffer list. This means each list of integers being used as a\nbuffer for a given hidden unit must be the same size. Whenever a buffer being used for any hidden\nunit in the minibatch over\ufb02ows, an extra integer must be added to the buffer list for every hidden unit\nin the minibatch. Otherwise, the steps outlined above can still be followed.\nWe give the complete, revised algorithm in Appendix C.3. The compromises to address over\ufb02ow and\nvectorization entail additional overhead. We measure the size of this overhead in Section 6.\n5.2 Memory Savings with Attention\nMost modern architectures for neural machine translation make use of attention mechanisms [Bah-\ndanau et al., 2014, Wu et al., 2016]; in this section, we describe the modi\ufb01cations that must be made\nto obtain memory savings when using attention. We denote the source tokens by x(1), x(2), . . . , x(T ),\nand the corresponding word embeddings by e(1), e(2), . . . , e(T ). We also use the following notation\nto denote vector slices: given a vector v \u2208 RD, we let v[: k] \u2208 Rk denote the vector consisting of the\n\ufb01rst k dimensions of v. Standard attention-based models for NMT perform attention over the encoder\nhidden states; this is problematic from the standpoint of memory savings, because we must retain\nthe hidden states in memory to use them when computing attention. To remedy this, we explore\nseveral alternatives to storing the full hidden state in memory. In particular, we consider performing\nattention over: 1) the embeddings e(t), which capture the semantics of individual words; 2) slices of\n\n6\n\n......+......Attention...DecoderEncoder\fTable 1: Validation perplexities (memory savings) on Penn TreeBank word-level language modeling. Results\nshown when forgetting is restricted to 2, 3, and 5 bits per hidden unit per timestep and when there is no restriction.\n\n2 bit\n\nReversible Model\n1 layer RevGRU\n82.2 (13.8)\n2 layer RevGRU\n83.8 (14.8)\n1 layer RevLSTM 79.8 (13.8)\n2 layer RevLSTM 74.7 (14.0)\n\n3 bits\n\n81.1 (10.8)\n83.8 (12.0)\n79.4 (10.1)\n72.8 (10.0)\n\n5 bits\n\n81.1 (7.4)\n82.2 (9.4)\n78.4 (7.4)\n72.9 (7.3)\n\nNo limit\n81.5 (6.4)\n82.3 (4.9)\n78.2 (4.9)\n72.9 (4.9)\n\nUsual Model\n1 layer GRU\n2 layer GRU\n1 layer LSTM\n2 layer LSTM\n\nNo limit\n\n82.2\n81.5\n78.0\n73.0\n\nthe encoder hidden states, h(t)\nenc[: k] (where we consider k = 20 or 100); and 3) the concatenation of\nembeddings and hidden state slices, [e(t); h(t)\nenc[: k]]. Since the embeddings are computed directly\nfrom the input tokens, they don\u2019t need to be stored. When we slice the hidden state, only the slices\nthat are attended to must be stored. We apply our memory-saving buffer technique to the remaining\nD \u2212 k dimensions.\nIn our NMT models, we make use of the global attention mechanism introduced by Luong et al.\n[2015], where each decoder hidden state h(t)\ndec is modi\ufb01ed by incorporating context from the source\nannotations: a context vector c(t) is computed as a weighted sum of source annotations (with weights\n\u03b1(t)\nj ); h(t)\ndec. Figure 2 illustrates\nthis attention mechanism, where attention is performed over the concatenated embeddings and hidden\nstate slices. Additional details on attention are provided in Appendix F.\n\ndec and c(t) are used to produce an attentional decoder hidden state (cid:103)h(t)\n\n5.3 Additional Considerations\nRestricting forgetting.\nIn order to guarantee memory savings, we may restrict the entries of z(t)\nto lie in (a, 1) rather than (0, 1), for some a > 0. Setting a = 0.5, for example, forces our model to\nforget at most one bit from each hidden unit per timestep. This restriction may be accomplished by\napplying the linear transformation x (cid:55)\u2192 (1 \u2212 a)x + a to z(t)\nLimitations. The main \ufb02aw of our method is the increased computational cost. We must reconstruct\nhidden states during the backwards pass and manipulate the buffer at each timestep. We \ufb01nd that\neach step of reversible backprop takes about 2-3 times as much computation as regular backprop. We\nbelieve this overhead could be reduced through careful engineering. We did not observe a slowdown\nin convergence in terms of number of iterations, so we only pay an increased per-iteration cost.\n\nafter its initial computation8.\n\ni\n\ni\n\n6 Experiments\nWe evaluated the performance of reversible models on two standard RNN tasks: language modeling\nand machine translation. We wished to determine how much memory we could save using the\ntechniques we have developed, how these savings compare with those possible using an idealized\nbuffer, and whether these memory savings come at a cost in performance. We also evaluated our\nproposed attention mechanism on machine translation tasks.\n\n6.1 Language Modeling Experiments\nWe evaluated our one- and two-layer reversible models on word-level language modeling on the Penn\nTreebank [Marcus et al., 1993] and WikiText-2 [Merity et al., 2016] corpora. In the interest of a fair\ncomparison, we kept architectural and regularization hyperparameters the same between all models\nand datasets. We regularized the hidden-to-hidden, hidden-to-output, and input-to-hidden connections,\nas well as the embedding matrix, using various forms of dropout9. We used the hyperparameters from\nMerity et al. [2017]. Details are provided in Appendix G.1. We include training/validation curves for\nall models in Appendix I.\n\n6.1.1 Penn TreeBank Experiments\nWe conducted experiments on Penn TreeBank to understand the performance of our reversible models,\nhow much restrictions on forgetting affect performance, and what memory savings are achievable.\n\n8For the RevLSTM, we would apply this transformation to p(t)\n9We discuss why dropout does not require additional storage in Appendix E.\n\nand f (t)\n\n.\n\ni\n\ni\n\n7\n\n\fTable 2: Validation perplexities on WikiText-2 word-level language modeling. Results shown when forgetting is\nrestricted to 2, 3, and 5 bits per hidden unit per timestep and when there is no restriction.\n\n2 bits\nReversible Model\n97.7\n1 layer RevGRU\n95.2\n2 layer RevGRU\n1 layer RevLSTM 94.8\n2 layer RevLSTM 90.7\n\n3 bits\n97.2\n94.7\n94.5\n87.7\n\n5 bits No limit\n96.3\n95.3\n94.5\n87.0\n\n97.1\n95.0\n94.1\n86.0\n\nUsual model\n1 layer GRU\n2 layer GRU\n1 layer LSTM\n2 layer LSTM\n\nNo limit\n\n97.8\n93.6\n89.3\n82.2\n\nPerformance. With no restriction on the amount forgotten, one- and two-layer RevGRU and\nRevLSTM models obtained roughly equivalent validation performance10 compared to their non-\nreversible counterparts, as shown in Table 1. To determine how little could be forgotten without\naffecting performance, we also experimented with restricting forgetting to at most 2, 3, or 5 bits per\nhidden unit per timestep using the method of Section 5.3. Restricting the amount of forgetting to 2, 3,\nor 5 bits from each hidden unit did not signi\ufb01cantly impact performance. Performance suffered once\nforgetting was restricted to at most 1 bit. This caused a 4\u20135 increase in perplexity for the RevGRU. It\nalso made the RevLSTM unstable for this task since its hidden state, unlike the RevGRU\u2019s, can grow\nunboundedly if not enough is forgotten. Hence, we do not include these results.\nMemory savings. We tracked the size of the information buffer throughout training and used this\nto compare the memory required when using reversibility vs. storing all activations. As shown in\nAppendix H, the buffer size remains roughly constant throughout training. Therefore, we show\nthe average ratio of memory requirements during training in Table 1. Overall, we can achieve a\n10\u201315-fold reduction in memory when forgetting at most 2\u20133 bits, while maintaining comparable\nperformance to standard models. Using Equation 17, we also compared the actual memory savings\nto the idealized memory savings possible with a perfect buffer. In general, we use about twice the\namount of memory as theoretically possible. Plots of memory savings for all models, both idealized\nand actual, are given in Appendix H.\n6.1.2 WikiText-2 Experiments\nWe conducted experiments on the WikiText-2 dataset (WT2) to see how reversible models fare on a\nlarger, more challenging dataset. We investigated various restrictions, as well as no restriction, on\nforgetting and contrasted with baseline models as shown in Table 2. The RevGRU model matched\nthe performance of the baseline GRU model, even with forgetting restricted to 2 bits. The RevLSTM\nlagged behind the baseline LSTM by about 5 perplexity points for one- and two-layer models.\n6.2 Neural Machine Translation Experiments\nWe further evaluated our models on English-to-German neural machine translation (NMT). We used\na unidirectional encoder-decoder model and our novel attention mechanism described in Section 5.2.\nWe experimented on two datasets: Multi30K [Elliott et al., 2016], a dataset of \u223c30,000 sentence\npairs derived from Flickr image captions, and IWSLT 2016 [Cettolo et al., 2016], a larger dataset of\n\u223c180,000 pairs. Experimental details are provided in Appendix G.2; training and validation curves\nare shown in Appendix I.3 (Multi30K) and I.4 (IWSLT); plots of memory savings during training are\nshown in Appendix H.2.\nFor Multi30K, we used single-layer RNNs with 300-dimensional hidden states and 300-dimensional\nword embeddings for both the encoder and decoder. Our baseline GRU and LSTM models achieved\ntest BLEU scores of 32.60 and 37.06, respectively. The test BLEU scores and encoder memory\nsavings achieved by our reversible models are shown in Table 3, for several variants of attention\nand restrictions on forgetting. For attention, we use Emb to denote word embeddings, xH for a\nx-dimensional slice of the hidden state (300H denotes the whole hidden state), and Emb+xH to\ndenote the concatenation of the two. Overall, while Emb attention achieved the best memory savings,\nEmb+20H achieved the best balance between performance and memory savings. The RevGRU with\nEmb+20H attention and forgetting at most 2 bits achieved a test BLEU score of 34.41, outperforming\nthe standard GRU, while reducing activation memory requirements by 7.1\u00d7 and 14.8\u00d7 in the encoder\nand decoder, respectively. The RevLSTM with Emb+20H attention and forgetting at most 3 bits\nachieved a test BLEU score of 37.23, outperforming the standard LSTM, while reducing activation\nmemory requirements by 8.9\u00d7 and 11.1\u00d7 in the encoder and decoder respectively.\n\n10Test perplexities exhibit similar patterns but are 3\u20135 perplexity points lower.\n\n8\n\n\fTable 3: Performance on the Multi30K dataset with different restrictions on forgetting. P denotes the test BLEU\nscores; M denotes the average memory savings of the encoder during training.\n\nModel\n\nAttention\n\n1 bit\n\n2 bit\n\n3 bit\n\n5 bit\n\nRevLSTM\n\nRevGRU\n\nP\n\n29.18\n20H\n27.90\n100H\n26.44\n300H\nEmb\n31.92\nEmb+20H 36.80\n26.52\n20H\n33.28\n100H\n34.86\n300H\nEmb\n28.51\nEmb+20H 34.00\n\nM\n11.8\n4.9\n1.0\n20.0\n12.1\n7.2\n2.6\n1.0\n13.2\n7.2\n\nP\n\n30.63\n35.43\n36.10\n31.98\n36.78\n26.86\n32.53\n33.49\n28.76\n34.41\n\nM\n9.5\n4.3\n1.0\n15.1\n9.9\n7.2\n2.6\n1.0\n13.2\n7.1\n\nP\n\n30.47\n36.03\n37.05\n31.60\n37.23\n28.26\n31.44\n33.01\n28.86\n34.39\n\nM\n8.5\n4.0\n1.0\n13.9\n8.9\n6.8\n2.5\n1.0\n12.9\n6.4\n\nP\n\n30.02\n35.75\n37.30\n31.42\n36.45\n27.71\n31.60\n33.03\n27.93\n34.04\n\nM\n7.3\n3.7\n1.0\n10.7\n8.1\n6.5\n2.4\n1.0\n12.8\n5.9\n\nNo Limit\nP\n\nM\n6.1\n3.5\n1.0\n10.1\n7.4\n5.7\n2.3\n1.0\n12.9\n5.7\n\n29.13\n34.96\n36.80\n31.45\n37.30\n27.86\n31.66\n33.08\n28.59\n34.94\n\nFor IWSLT 2016, we used 2-layer RNNs with 600-dimensional hidden states and 600-dimensional\nword embeddings for the encoder and decoder. We evaluated reversible models in which the decoder\nused Emb+60H attention. The baseline GRU and LSTM models achieved test BLEU scores of 16.07\nand 22.35, respectively. The RevGRU achieved a test BLEU score of 20.70, outperforming the GRU,\nwhile saving 7.15\u00d7 and 12.92\u00d7 in the encoder and decoder, respectively. The RevLSTM achieved a\nscore of 22.34, competitive with the LSTM, while saving 8.32\u00d7 and 6.57\u00d7 memory in the encoder\nand decoder, respectively. Both reversible models were restricted to forget at most 5 bits.\n\n7 Related Work\n\nsequences of length T by storing the activations every (cid:100)\u221a\n\nSeveral approaches have been taken to reduce the memory requirements of RNNs. Frameworks that\nuse static computational graphs [Abadi et al., 2016, Al-Rfou et al., 2016] aim to allocate memory\nef\ufb01ciently in the training algorithms themselves. Checkpointing [Martens and Sutskever, 2012, Chen\net al., 2016, Gruslys et al., 2016] is a frequently used method. In this strategy, certain activations are\nstored as checkpoints throughout training and the remaining activations are recomputed as needed in\nthe backwards pass. Checkpointing has previously been used to train recurrent neural networks on\nT(cid:101) layers [Martens and Sutskever, 2012].\nGruslys et al. [2016] further developed this strategy by using dynamic programming to determine\nwhich activations to store in order to minimize computation for a given storage budget.\nDecoupled neural interfaces [Jaderberg et al., 2017, Czarnecki et al., 2017] use auxilliary neural\nnetworks trained to produce the gradient of a layer\u2019s weight matrix given the layer\u2019s activations as\ninput, then use these predictions to train, rather than the true gradient. This strategy depends on the\nquality of the gradient approximation produced by the auxilliary network. Hidden activations must\nstill be stored as in the usual backpropagation algorithm to train the auxilliary networks, unlike our\nmethod.\nUnitary recurrent neural networks [Arjovsky et al., 2016, Wisdom et al., 2016, Jing et al., 2016] re\ufb01ne\nvanilla RNNs by parametrizing their transition matrix to be unitary. These networks are reversible in\nexact arithmetic [Arjovsky et al., 2016]: the conjugate transpose of the transition matrix is its inverse,\nso the hidden-to-hidden transition is reversible. In practice, this method would run into numerical\nprecision issues as \ufb02oating point errors accumulate over timesteps. Our method, through storage of\nlost information, avoids these issues.\n\n8 Conclusion\n\nWe have introduced reversible recurrent neural networks as a method to reduce the memory require-\nments of truncated backpropagation through time. We demonstrated the \ufb02aws of exactly reversible\nRNNs, and developed methods to ef\ufb01ciently store information lost during the hidden-to-hidden\ntransition, allowing us to reverse the transition during backpropagation. Reversible models can\nachieve roughly equivalent performance to standard models while reducing the memory requirements\nby a factor of 5\u201315 during training. We believe reversible models offer a compelling path towards\nconstructing more \ufb02exible and expressive recurrent neural networks.\n\n9\n\n\fAcknowledgments\n\nWe thank Kyunghyun Cho for experimental advice and discussion. We also thank Aidan Gomez,\nMengye Ren, Gennady Pekhimenko, and David Duvenaud for helpful discussion. MM is supported\nby an NSERC CGS-M award, and PV is supported by an NSERC PGS-D award.\n\nReferences\nAlex Graves, Abdel-Rahman Mohamed, and Geoffrey Hinton. Speech Recognition with Deep Recur-\nrent Neural Networks. In International Conference on Acoustics, Speech and Signal Processing\n(ICASSP), pages 6645\u20136649. IEEE, 2013.\n\nG\u00e1bor Melis, Chris Dyer, and Phil Blunsom. On the State of the Art of Evaluation in Neural Language\n\nModels. arXiv:1707.05589, 2017.\n\nStephen Merity, Nitish Shirish Keskar, and Richard Socher. Regularizing and Optimizing LSTM\n\nLanguage Models. arXiv:1708.02182, 2017.\n\nDzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural Machine Translation by Jointly\n\nLearning to Align and Translate. arXiv:1409.0473, 2014.\n\nYonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey,\nMaxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google\u2019s Neural Machine Translation\nSystem: Bridging the Gap between Human and Machine Translation. arXiv:1609.08144, 2016.\n\nPaul J Werbos. Backpropagation through Time: What It Does and How to Do It. Proceedings of the\n\nIEEE, 78(10):1550\u20131560, 1990.\n\nDavid E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning Representations by\n\nBack-propagating Errors. Nature, 323(6088):533, 1986.\n\nRazvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, and Yoshua Bengio. How to Construct Deep\n\nRecurrent Neural Networks. arXiv:1312.6026, 2013.\n\nAntonio Valerio Miceli Barone, Jind\u02c7rich Helcl, Rico Sennrich, Barry Haddow, and Alexandra Birch.\n\nDeep Architectures for Neural Machine Translation. arXiv:1707.07631, 2017.\n\nJulian Georg Zilly, Rupesh Kumar Srivastava, Jan Koutn\u00edk, and J\u00fcrgen Schmidhuber. Recurrent\n\nHighway Networks. arXiv:1607.03474, 2016.\n\nKyunghyun Cho, Bart Van Merri\u00ebnboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger\nSchwenk, and Yoshua Bengio. Learning Phrase Representations using RNN Encoder-Decoder for\nStatistical Machine Translation. arXiv:1406.1078, 2014.\n\nSepp Hochreiter and J\u00fcrgen Schmidhuber. Long Short-Term Memory. Neural Computation, 9(8):\n\n1735\u20131780, 1997.\n\nDougal Maclaurin, David Duvenaud, and Ryan P Adams. Gradient-based Hyperparameter Opti-\nmization through Reversible Learning. In Proceedings of the 32nd International Conference on\nMachine Learning (ICML), 2015.\n\nMitchell P Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. Building a Large Annotated\n\nCorpus of English: The Penn Treebank. Computational Linguistics, 19(2):313\u2013330, 1993.\n\nStephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer Sentinel Mixture\n\nModels. arXiv:1609.07843, 2016.\n\nDesmond Elliott, Stella Frank, Khalil Sima\u2019an, and Lucia Specia. Multi30K: Multilingual English-\n\nGerman Image Descriptions. arXiv:1605.00459, 2016.\n\nMauro Cettolo, Jan Niehues, Sebastian St\u00fcker, Luisa Bentivogli, and Marcello Federico. The IWSLT\n2016 Evaluation Campaign. Proceedings of the 13th International Workshop on Spoken Language\nTranslation (IWSLT), 2016.\n\n10\n\n\fGeorge Papamakarios, Iain Murray, and Theo Pavlakou. Masked Autoregressive Flow for Density\nEstimation. In Advances in Neural Information Processing Systems (NIPS), pages 2335\u20132344,\n2017.\n\nLaurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density Estimation using Real NVP.\n\narXiv:1605.08803, 2016.\n\nDiederik P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling. Im-\nproving Variational Inference with Inverse Autoregressive Flow. In Advances in Neural Information\nProcessing Systems (NIPS), pages 4743\u20134751, 2016.\n\nAidan N Gomez, Mengye Ren, Raquel Urtasun, and Roger B Grosse. The Reversible Residual\nNetwork: Backpropagation Without Storing Activations. In Advances in Neural Information\nProcessing Systems (NIPS), pages 2211\u20132221, 2017.\n\nKaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image\nRecognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition\n(CVPR), pages 770\u2013778, 2016.\n\nFelix A Gers, J\u00fcrgen Schmidhuber, and Fred Cummins. Learning to Forget: Continual Prediction\n\nwith LSTM. 1999.\n\nKlaus Greff, Rupesh K Srivastava, Jan Koutn\u00edk, Bas R Steunebrink, and J\u00fcrgen Schmidhuber. LSTM:\nA Search Space Odyssey. IEEE Transactions on Neural Networks and Learning Systems, 28(10):\n2222\u20132232, 2017.\n\nWojciech Zaremba and Ilya Sutskever. Learning to Execute. arXiv:1410.4615, 2014.\n\nSuyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. Deep Learning with\nLimited Numerical Precision. In International Conference on Machine Learning (ICML), pages\n1737\u20131746, 2015.\n\nMatthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Training Deep Neural Networks with\n\nLow Precision Multiplications. arXiv:1412.7024, 2014.\n\nMinh-Thang Luong, Hieu Pham, and Christopher D Manning. Effective Approaches to Attention-\n\nBased Neural Machine Translation. arXiv:1508.04025, 2015.\n\nMart\u00edn Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S\nCorrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensor\ufb02ow: Large-Scale Machine\nLearning on Heterogeneous Distributed Systems. arXiv:1603.04467, 2016.\n\nRami Al-Rfou, Guillaume Alain, Amjad Almahairi, Christof Angermueller, Dzmitry Bahdanau, Nico-\nlas Ballas, Fr\u00e9d\u00e9ric Bastien, Justin Bayer, Anatoly Belikov, Alexander Belopolsky, et al. Theano:\nA Python Framework for Fast Computation of Mathematical Expressions. arXiv:1605.02688,\n2016.\n\nJames Martens and Ilya Sutskever. Training Deep and Recurrent Networks with Hessian-Free\n\nOptimization. In Neural Networks: Tricks of the Trade, pages 479\u2013535. Springer, 2012.\n\nTianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. Training Deep Nets with Sublinear\n\nMemory Cost. arXiv:1604.06174, 2016.\n\nAudrunas Gruslys, R\u00e9mi Munos, Ivo Danihelka, Marc Lanctot, and Alex Graves. Memory-Ef\ufb01cient\nBackpropagation through Time. In Advances in Neural Information Processing Systems (NIPS),\npages 4125\u20134133, 2016.\n\nMax Jaderberg, Wojciech M Czarnecki, Simon Osindero, Oriol Vinyals, Alex Graves, David Silver,\nand Koray Kavukcuoglu. Decoupled Neural Interfaces using Synthetic Gradients. In International\nConference on Machine Learning (ICML), 2017.\n\nWojciech Marian Czarnecki, Grzegorz \u00b4Swirszcz, Max Jaderberg, Simon Osindero, Oriol Vinyals,\nand Koray Kavukcuoglu. Understanding Synthetic Gradients and Decoupled Neural Interfaces.\narXiv:1703.00522, 2017.\n\n11\n\n\fMartin Arjovsky, Amar Shah, and Yoshua Bengio. Unitary Evolution Recurrent Neural Networks. In\n\nInternational Conference on Machine Learning (ICML), pages 1120\u20131128, 2016.\n\nScott Wisdom, Thomas Powers, John Hershey, Jonathan Le Roux, and Les Atlas. Full-Capacity\nUnitary Recurrent Neural Networks. In Advances in Neural Information Processing Systems\n(NIPS), pages 4880\u20134888, 2016.\n\nLi Jing, Yichen Shen, Tena Dub\u02c7cek, John Peurifoy, Scott Skirlo, Max Tegmark, and Marin Sol-\nja\u02c7ci\u00b4c. Tunable Ef\ufb01cient Unitary Neural Networks (EUNN) and their Application to RNN.\narXiv:1612.05231, 2016.\n\n12\n\n\f", "award": [], "sourceid": 5410, "authors": [{"given_name": "Matthew", "family_name": "MacKay", "institution": "University of Toronto"}, {"given_name": "Paul", "family_name": "Vicol", "institution": "University of Toronto"}, {"given_name": "Jimmy", "family_name": "Ba", "institution": "University of Toronto / Vector Institute"}, {"given_name": "Roger", "family_name": "Grosse", "institution": "University of Toronto"}]}