{"title": "Deep Equilibrium Models", "book": "Advances in Neural Information Processing Systems", "page_first": 690, "page_last": 701, "abstract": "We present a new approach to modeling sequential data: the deep equilibrium model (DEQ). Motivated by an observation that the hidden layers of many existing deep sequence models converge towards some fixed point, we propose the DEQ approach that directly finds these equilibrium points via root-finding. Such a method is equivalent to running an infinite depth (weight-tied) feedforward network, but has the notable advantage that we can analytically backpropagate through the equilibrium point using implicit differentiation. Using this approach, training and prediction in these networks require only constant memory, regardless of the effective \u201cdepth\u201d of the network. We demonstrate how DEQs can be applied to two state-of-the-art deep sequence models: self-attention transformers and trellis networks. On large-scale language modeling tasks, such as the WikiText-103 benchmark, we show that DEQs 1) often improve performance over these state-of-the-art models (for similar parameter counts); 2) have similar computational requirements to existing models; and 3) vastly reduce memory consumption (often the bottleneck for training large sequence models), demonstrating an up-to 88% memory reduction in our experiments. The code is available at https://github.com/locuslab/deq.", "full_text": "Deep Equilibrium Models\n\nShaojie Bai\n\nJ. Zico Kolter\n\nCarnegie Mellon University\n\nCarnegie Mellon University\n\nVladlen Koltun\n\nIntel Labs\n\nBosch Center for AI\n\nAbstract\n\nWe present a new approach to modeling sequential data: the deep equilibrium\nmodel (DEQ). Motivated by an observation that the hidden layers of many existing\ndeep sequence models converge towards some \ufb01xed point, we propose the DEQ\napproach that directly \ufb01nds these equilibrium points via root-\ufb01nding. Such a\nmethod is equivalent to running an in\ufb01nite depth (weight-tied) feedforward network,\nbut has the notable advantage that we can analytically backpropagate through the\nequilibrium point using implicit differentiation. Using this approach, training\nand prediction in these networks require only constant memory, regardless of the\neffective \u201cdepth\u201d of the network. We demonstrate how DEQs can be applied to\ntwo state-of-the-art deep sequence models: self-attention transformers and trellis\nnetworks. On large-scale language modeling tasks, such as the WikiText-103\nbenchmark, we show that DEQs 1) often improve performance over these state-\nof-the-art models (for similar parameter counts); 2) have similar computational\nrequirements to existing models; and 3) vastly reduce memory consumption (often\nthe bottleneck for training large sequence models), demonstrating an up-to 88%\nmemory reduction in our experiments. The code is available at \u2764tt\u2663s\u273f\u2734\u2734\u2763\u2710t\u2764\u2709\u275c\u2733\n\u275d\u2666\u2660\u2734\u2767\u2666\u275d\u2709s\u2767\u275b\u275c\u2734\u275e\u2761q.\n\n1\n\nIntroduction\n\nMost modern feedforward deep networks are built on the core concept of layers. In the forward pass,\neach network consists of a stack of some L transformations, where L is the depth of the network. To\nupdate these networks, the backward passes rely on backpropagating through the same L layers via the\nchain rule, which typically necessitates that we store the intermediate values of these layers. The value\nfor L is usually a hyperparameter and is picked by model designers (e.g., ResNet-101 [25]). Among\nthe many applications of deep networks, sequence modeling has witnessed continuous advances in\nmodel architectures. Speci\ufb01cally, while recurrent networks have long been the dominant model for\nsequences [21, 26, 14, 34], deep feedforward architectures based on temporal convolutions [49, 47, 7]\nand self-attention [48, 16, 13] have (re-)emerged to claim superior performance on a variety of\nsequence prediction tasks.\n\nIn very general terms, a deep feedforward sequence model can be written as the following iteration:\n\n[i]\n\n1:T ; x1:T(cid:1)\n\nfor i = 0, 1, 2, . . . , L \u2212 1\n\n(1)\n\nz\n\n[i+1]\n\n1:T = f [i]\n\u03b8 (cid:0)z\n\n[i]\nwhere i is the layer index; z\n1:T is the hidden sequence of length T at layer i; x1:T is the input\nsequence (i.e., we are choosing to explicitly model skip connections, for reasons we explain later);\nand f [i]\nis some nonlinear transformation which typically enforces causality (i.e., future time points\n\u03b8\ncannot in\ufb02uence past ones). Our paper derives its motivation from surprising recent works that\nemploy the same transformation in each layer (known as weight tying, with f [i]\n\u03b8 = f\u03b8,\u2200i) and still\nachieve results competitive with the state-of-the-art [18, 8, 15]. This raises an interesting question: If\nthe same transformation is applied at each layer of a deep network, what is the limit of this process,\nand how do we model it?\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fIn this paper, we propose a new approach to \u201cdeep\u201d modeling that addresses this question. Speci\ufb01cally,\nwe introduce the deep equilibirum model (DEQ), a method that directly computes the \ufb01xed point\nz\u22c6\n1:T of a nonlinear transformation, i.e., the solution to the nonlinear system\n\nz\u22c6\n1:T = f\u03b8(z\u22c6\n\n1:T ; x1:T ).\n\n(2)\n\nThis solution corresponds to the eventual hidden layer values of an in\ufb01nite depth network. But instead\nof \ufb01nding this value by iterating the model, we propose to directly (and in practice, more quickly)\nsolve for the equilibrium via any black-box root-\ufb01nding method. Importantly, we show that DEQ\ncan directly differentiate through the \ufb01xed point equations via implicit differentation, which does not\nrequire storing any intermediate activation values. In other words, we can backpropagate through the\nin\ufb01nite-depth network while using only constant memory, equivalent to a single layer\u2019s activations.\n\nAfter developing the generic DEQ approach, we study in detail the instantiation of DEQ via two\nfeedforward sequence models: trellis networks (weight-tied temporal convolutions) [8] and memory-\naugmented universal transformers (weight-tied multi-head self-attention) [18, 16], both of which\nhave obtained state-of-the-art performance (SOTA) on various sequence tasks. We show how both\nthe forward and backward passes can be implemented ef\ufb01ciently via quasi-Newton methods. Finally,\nwe demonstrate via experiments on large-scale high-dimensional sequence modeling benchmarks\n(e.g., WikiText-103 language modeling) that, despite only using constant memory, DEQ can attain\nmodeling accuracy on par with (or even slightly better than) corresponding layer-based networks. We\nbelieve that DEQ offers a novel perspective on the analysis of sequential data.\n\n2 Background\n\nDeep sequence models. Given an input sequence x1:T = [x1, . . . , xT ] \u2208 RT \u00d7p, where xi \u2208 Rp\n(e.g., a word embedding) and T \u2208 N is the sequence length, we de\ufb01ne a sequence model as any\nfunction G that produces output G(x1:T ) = y1:T =\u2208 RT \u00d7q that satis\ufb01es the causality constraint: yt\ndepends only on x1:t and not on any element of xt+1:T . Recent progress on autoregressive sequence\ntasks has been based on deep learning, where three major families of sequence models stand out.\nRecurrent networks (RNNs) [21, 51] as well as their variants such as LSTM [26] are universally\napplied and optimized in a variety of time-series tasks [9, 22, 34]. Alternatively, prior work has\nshown that deeply stacked temporal convolutions [49, 47, 17, 7] can achieve competitive results,\nespecially on long sequences. Finally, the self-attention transformer architecture [48, 16] has also\nachieved SOTA on several NLP benchmarks [19, 13]. Efforts have also been devoted to drawing\ndeeper connections among the three model families. Bai et al. [8] study the underlying relationship\nbetween RNNs and ConvNets, unifying these in the Trellis Network, which combines the bene\ufb01ts\nof both families. Dehghani et al. [18] introduce a recurrently-stacked universal transformer and\ndemonstrate its effectiveness on text understanding and generation.\n\nMemory-ef\ufb01cient deep networks. An important factor that limits the development of high-\ncapacity networks is limited memory on hardware devices used for training. To address this issue,\n[12] proposes gradient checkpointing that reduces an L-layer network\u2019s memory requirement to\n\nO(\u221aL) at the cost of extra forward passes (i.e., extra computations). Alternatively, [23, 30] develop\n\nreversible networks, where each layer\u2019s activations can be reconstructed from the next layer during\nbackpropagation to reduce memory requirements. DEQs reduce memory consumption to a constant\n(i.e., independent of network \u201cdepth\u201d) by directly differentiating through the equilibrium point and\nthus circumventing the construction and maintenance of \u201clayers\u201d.\n\nContinuous view of deep networks. Some prior works have studied continuous views of deep\nnetworks. [41] proposes a biologically inspired equilibrium propagation framework for an energy-\nbased model whose prediction is the \ufb01xed-point of the energy dynamics at its local minimum. [24, 11]\nmodel deep ResNets by black-box ODE solvers in forward and backward passes (as if the network\nhas smaller \u201clayer steps\u201d) given the start- and end-points of a dynamical system. For deep sequence\nmodels, [43, 36] consider the RNN as a dynamical system to investigate its stability properties.\n\nOur work takes a further step in the direction of the aforementioned areas. While some of the prior\nwork has primarily focused on the analysis of residual architectures or small symmetric-weight\nenergy-based models, our work is not predicated on any speci\ufb01c type of interlayer transformation.\nWe show that DEQs can be easily instantiated via two very different sequence learning architectures.\nMore fundamentally, unlike ODE-based methods, which use the adjoint system to backpropagate\n\n2\n\n\fthrough the entire latent trajectory, the DEQ model solves directly for sequence-level equilibria via\na quasi-Newton method and backpropagates directly through this \ufb01xed point, without regard for\nthe solution path that brought it there. Moreover, while ODE-based models [24, 11] were veri\ufb01ed\non numerical experiments and MNIST classi\ufb01cation, computation and numerical stability issues\nchallenge their application to large-scale problems. In comparison, we demonstrate the applicability\nof DEQs on realistic high-dimensional sequence tasks with competitive performance, while enjoying\nsimilar constant-memory bene\ufb01ts as [11].\n\nImplicit layers in deep learning. The DEQ model can be viewed as an in\ufb01nitely deep network,\nbut interestingly can also be viewed as a single-layer network, with the caveat that the layer is de\ufb01ned\nimplicitly: the output z\u22c6\n1:T is de\ufb01ned as the value which solves some non-linear equation. There has\nbeen a growing interest in implicit layers in recent years [37, 3, 37, 50], but the precise formulation\nof the DEQ is quite different, and our current models represent the largest-scale practical application\nof implicit layers in deep learning of which we are aware. Concurrent work [20] also looks at such\nimplicit layers in a broad sense and focuses on training small models via Lagrangian methods; a\ncombination of these approaches with the DEQ model is a promising avenue for future work.\n\nAnother thread of work on implicit layers traces back to some of the original papers on recurrent\nnetworks trained via recurrent backpropagation (RBP) [2, 38]. Recent work [28] has re-examined RBP\nand established an implicit, constant-memory variant based on conjugate gradient and Neumann series.\nA number of related papers also enforce \ufb01xed point conditions within RNN architectures [54, 27].\nWhereas the DEQ model shares similarities with the RBP approach, some major differences involve:\n1) the explicit use of equilibrium as a replacement for depth in general networks, along with our proof\nof the universality of these models to replace depth; 2) the use of the approach in methods outside\nof \ufb01xed-input RNNs (i.e., same input vector xt for all t), especially the compatibility with SOTA\narchitectures; and 3) the scalability of the DEQ model to practical tasks where it achieves results on\npar with the current SOTA, whereas RBP has typically been applied in small-scale settings.\n\n3 The Deep Equilibrium Sequence Model\n\nz\n\n[0]\n\nz\n\n[i+1]\n1:T = f\u03b8(z\n\n[i]\n1:T ; x1:T ),\n\nWe broadly consider the class of weight-tied deep sequence models (with passthrough connections\nfrom the input to each layer), which consist of the update\ni = 0, . . . , L \u2212 1,\n\n1:T = 0, G(x1:T ) \u2261 z\n\nWe note that this model encapsulates classes such as the trellis network [8] and the universal\ntransformer [18] (which is typically not written with passthrough connections, but this is a trivial\nmodi\ufb01cation). Such weight-tying is generally considered to come with four major bene\ufb01ts: 1) it\nacts as a form of regularization that stabilizes training and supports generalization; 2) it signi\ufb01cantly\nreduces the model size; 3) it is trivial to show that any deep network can be represented by a\nweight-tied deep network of equal depth and only a linear increase in width (see Appendix C); and\n4) the network can be unrolled to any depth, typically with improved feature abstractions as depth\nincreases [8, 18]. However, in practice almost all such models (and deep nets in general) are stacked,\ntrained and evaluated by unrolling a pre-determined, \ufb01xed number of layers. One reason is the limited\nmemory on training hardware: the models need to store intermediate hidden units for backpropagation\nand thus cannot be trained beyond a certain depth that depends on the available memory.\n\n[L]\n1:T\n\n(3)\n\nIn principle, the network could have in\ufb01nite depth. This is attained in the limit of unrolling a weight-\ntied model for an ever higher number of layers. What is the limit of this process? In practice, for\ncertain classes of f\u03b8 (discussed later), we hypothesize and observe that such weight-tied models\ntend to converge to a \ufb01xed point as depth increases towards in\ufb01nity (see Appendix D for empirical\nevidence). In other words, as each layer re\ufb01nes the previous one by combining temporal features\nacross the sequence, increasing depth towards in\ufb01nity brings \u201cdiminishing returns\u201d: each additional\nlayer has a smaller and smaller contribution until the network reaches an equilibrium:\n\nlim\ni\u2192\u221e\n\nz\n\n[i]\n1:T = lim\ni\u2192\u221e\n\nf\u03b8(cid:0)z\n\n3.1 The Deep Equilibrium Approach\n\n[i]\n\n1:T ; x1:T(cid:1) \u2261 f\u03b8(cid:0)z\u22c6\n\n1:T ; x1:T ) = z\u22c6\n\n1:T\n\n(4)\n\nWe introduce the deep equilibrium model (DEQ) which, instead of iteratively stacking f\u03b8, directly\nsolves for and differentiates through the equilibrium state.\n\n3\n\n\f3.1.1 Forward Pass\n\nUnlike a conventional network where the output is the activations from the Lth layer, the output of\na DEQ is the equilibrium point itself. Therefore, the forward evaluation could be any procedure\nthat solves for this equilibrium point. Conventional deep sequence networks, if they converge to an\nequilibrium, can be considered a form of \ufb01xed-point iterations:\n\nz\n\n[i+1]\n\n1:T = f\u03b8(cid:0)z\n\n[i]\n\n1:T ; x1:T(cid:1)\n\nfor i = 0, 1, 2, . . .\n\n(5)\n\nOne can alternatively use other methods that provide faster convergence guarantees. For notational\nconvenience, we de\ufb01ne g\u03b8 and rewrite Eq. (4) as g\u03b8(z\u22c6\n1:T \u2192 0. The\nequilibrium state z\u22c6\n1:T is thus the root of g\u03b8, which we can \ufb01nd more easily with Newton\u2019s method or\nquasi-Newton methods (e.g., Broyden\u2019s method [10]):\n\n1:T ; x1:T ) = f\u03b8(cid:0)z\u22c6\n\n1:T ; x1:T(cid:1) \u2212 z\u22c6\n\nz\n\n[i+1]\n1:T = z\n\n[i]\n\n1:T \u2212 \u03b1Bg\u03b8(z\n\n[i]\n1:T ; x1:T )\n\nfor i = 0, 1, 2, . . .\n\n(6)\n\n[i]\nwhere B is the Jacobian inverse (or its low-rank approximation) at z\n1:T , and \u03b1 is the step size. But\ngenerally, one can exploit any black-box root-\ufb01nding algorithm to solve for the equilibrium point in\n\nthe forward pass, given an initial estimate z\n\n[0]\n1:T (which we set to 0): z\u22c6\n\n1:T = RootFind(g\u03b8; x1:T )\n\n3.1.2 Backward Pass\n\nA major problem with using a black-box RootFind is that we are no longer able to rely on explicit\nbackpropagation through the exact operations in the forward pass. While one can certainly \ufb01x an\nalgorithm (say Newton\u2019s method) to obtain the equilibrium, and then store and backpropagate through\nall the Newton iterations, we provide below an alternative procedure that is much simpler, requires\nconstant memory, and assumes no knowledge of the black-box RootFind.\n1:T \u2208 RT \u00d7d be an equilibrium hidden\nTheorem 1. (Gradient of the Equilibrium Model) Let z\u22c6\nsequence with length T and dimensionality d, and y1:T \u2208 RT \u00d7q the ground-truth (target) sequence.\nLet h : Rd \u2192 Rq be any differentiable function and let L : Rq \u00d7 Rq \u2192 R be a loss function (where\nh,L are applied in a vectorized manner) that computes\n\n\u2113 = L(h(z\u22c6\n\n1:T ), y1:T ) = L(h(RootFind(g\u03b8; x1:T )), y1:T ).\n\n(7)\n\nThen the loss gradient w.r.t. (\u00b7) (for instance, \u03b8 or x1:T ) is\n\u2202\u2113\n\u2202h\n\n\u2202\u2113\n\u2202z\u22c6\n\n\u2202f\u03b8(z\u22c6\n\n= \u2212\n\n1:T ; x1:T )\n\u2202(\u00b7)\n\n1:T (cid:0)J \u22121\ng\u03b8 (cid:12)(cid:12)z\n\n\u22c6\n\n1:T(cid:1)\n\nwhere J \u22121\n\nis the inverse Jacobian of g\u03b8 evaluated at x.\n\n\u2202\u2113\n= \u2212\n\u2202(\u00b7)\ng\u03b8 (cid:12)(cid:12)x\n\n\u2202h\n\u2202z\u22c6\n\n1:T (cid:0)J \u22121\ng\u03b8 (cid:12)(cid:12)z\n\n\u2202f\u03b8(z\u22c6\n\n1:T ; x1:T )\n\u2202(\u00b7)\n\n\u22c6\n\n1:T(cid:1)\n\n,\n\n(8)\n\nThe proof is provided in Appendix A. The insight provided by Theorem 1 is at the core of our method\nand its various bene\ufb01ts. Importantly, the backward gradient through the \u201cin\ufb01nite\u201d stacking can be\nrepresented as one step of matrix multiplication that involves the Jacobian at equlibrium. For instance,\nan SGD update step on model parameters \u03b8 would be\n\n\u03b8+ = \u03b8 \u2212 \u03b1 \u00b7\n\n\u2202\u2113\n\u2202\u03b8\n\n= \u03b8 + \u03b1\n\n\u2202\u2113\n\u2202z\u22c6\n\n1:T (cid:0)J \u22121\ng\u03b8 (cid:12)(cid:12)z\n\n\u2202f\u03b8(z\u22c6\n\n1:T ; x1:T )\n\u2202\u03b8\n\n.\n\n\u22c6\n\n1:T(cid:1)\n\n(9)\n\nNote that this result is independent of the root-\ufb01nding algorithm we choose or the internal structure of\nthe transformation f\u03b8, and thus does not require any storage of the intermediate hidden states, which\nis necessary for backpropagation in conventional deep networks.\n\n3.1.3 Accelerating DEQ by Approximating the Inverse Jacobian\n\nOne challenge of enforcing the forward and backward passes described in Sections 3.1.1 and 3.1.2 is\nthe cost of computing the exact inverse Jacobian J \u22121\ng\u03b8 at every intermediate Newton iteration. We\npropose to address this using Broyden\u2019s method [10], a quasi-Newton approach that makes low-rank\nupdates to approximate J \u22121\n\ng\u03b8 via the Sherman-Morrison formula [42]:\n\n[i+1]\n\n1:T \u2248 B[i+1]\n\ng\u03b8 = B[i]\n\ng\u03b8 +\n\nJ \u22121\n\ng\u03b8 (cid:12)(cid:12)z\n\n\u2206z[i+1] \u2212 B[i]\n\u2206z[i+1]\u22a4\nB[i]\n\ng\u03b8 \u2206g[i+1]\ng\u03b8 \u2206g[i+1]\n\n\u03b8\n\n\u03b8\n\n\u2206z[i+1]\u22a4\n\nB[i]\ng\u03b8 ,\n\n(10)\n\n4\n\n\f[i]\n\n[i+1]\n\n[i]\nwhere \u2206z[i+1] = z\n1:T ; x1:T ). Initially, we set\nB[0]\ng\u03b8 = \u2212I and the Broyden iterations are stopped when either the norm of g[i]\n\u03b8 falls below a tolerance\n\u03b5 or when the maximum number of iterations is reached. This lets us avoid the cubic cost induced by\nthe inverse operation.\n\n1:T ; x1:T ) \u2212 g\u03b8(z\n\n1:T and \u2206g[i+1]\n\n1:T \u2212 z\n\n= g\u03b8(z\n\n[i+1]\n\n\u03b8\n\nA similar idea can be used for the backward pass as well. Speci\ufb01cally, to compute \u2212 \u2202\u2113\n\n\u2202 z\n\nin Theorem 1, we can alternatively solve the linear system\n\n\u22c6\n\ng\u03b8 (cid:12)(cid:12)z\n1:T (cid:0)J \u22121\n\n\u22c6\n\n1:T(cid:1)\n\n\u22c6\n\n1:T(cid:1)x\u22a4 +(cid:18) \u2202\u2113\n\n\u2202z\u22c6\n\n1:T (cid:19)\u22a4\n\n(cid:0)J \u22a4\ng\u03b8(cid:12)(cid:12)z\n\n= 0,\n\n(11)\n\nwhere the \ufb01rst term (a vector-Jacobian product) can be ef\ufb01ciently computed via autograd packages\n(e.g., PyTorch [45]) for any x, without explicitly writing out the Jacobian matrix. Such linear systems\ncan generally be solved by any indirect methods that leverage fast matrix-vector products; we thus\npropose to also rely on Broyden\u2019s method (other indirect methods would also suf\ufb01ce) to solve for\nEq. (11) and directly backpropagate through the equilibrium by Theorem 1 in the backward pass.\n\n3.2 Properties of Deep Equilibrium Models\n\nSection 3.1 develops a sequence model that, while still based on the deep learning philosophy, is quite\ndifferent from other approaches in the \ufb01eld, as its output is agnostic to the choice of the RootFind\nalgorithm in the forward pass. We now discuss some implications of the DEQ approach.\n\nMemory cost of DEQ. An important bene\ufb01t of DEQ is its extreme memory ef\ufb01ciency. As outlined\nin Section 3.1.3, since we are able to use any root-\ufb01nding algorithm for both the forward and backward\npasses (e.g., Broyden\u2019s method [10]), a DEQ only needs to store z\u22c6\n1:T (the equilibrium sequence),\nx1:T (input-related, layer-independent variables), and f\u03b8 for the backward pass. Note that as we\nonly need the vector-Jacobian product (with dimension N \u00d7 T d, where N is the minibatch size) in\nEq. (11), we never need to explicitly construct the Jacobian J \u22a4\n, which could be prohibitively\nlarge on long and high-dimensional sequences (with dimension N \u00d7 (T d)2). Compared to other deep\n\nnetworks, DEQs therefore offer a constant-memory alternative that enables models that previously\nrequired multiple GPUs and other implementation-based techniques (e.g., half-precision or gradient\ncheckpointing [12, 13]) to \ufb01t easily into a single GPU.\n\ng\u03b8(cid:12)(cid:12)z\n\n\u22c6\n1:T\n\nThe choice of f\u03b8. Our analysis in Sections 3.1.1, 3.1.2, and 3.1.3 is independent of the choice of\nf\u03b8, and the same kind of memory bene\ufb01t is present regardless of the type of f\u03b8. However, to \ufb01nd\nthe equilibrium in a reliable and ef\ufb01cient manner, generally f\u03b8 needs to be stable and constrained.\nThe two instantiations we provide in Section 4 are examples of stable transformations. (The gated\nactivation in TrellisNet and layer normalization in the transformer constrain the output ranges.)\n\nStacking the DEQ? A natural question arises: if one DEQ is good, can we get additional bene\ufb01ts\nby \u201cstacking\u201d DEQs (with potentially different classes of transformations)? The answer, somewhat\nsurprisingly, is no, as evidenced by the following theorem, which is proved in Appendix B. The\ntheorem essentially shows that stacking multiple DEQs does not create extra representational power\nover a single DEQ.\n\nTheorem 2. (Universality of \u201csingle-layer\u201d DEQs.) Let x1:T \u2208 RT \u00d7p be the input sequence,\nand \u03b8[1], \u03b8[2] the sets of parameters for stable transformations f\u03b8[1] : Rr \u00d7 Rp \u2192 Rr and v\u03b8[2] :\nRd \u00d7 Rr \u2192 Rd, respectively. Then there exists \u0393\u0398 : Rd+r \u00d7 Rp \u2192 Rd+r, where \u0398 = \u03b8[1] \u222a \u03b8[2], s.t.\n\nz\u22c6\n\n1:T = RootFind(cid:0)gf\n\n\u03b8[2] ; RootFind(cid:0)gv\n\n\u03b8[1] ; x1:T(cid:1)(cid:1) = RootFind(cid:0)g\u0393\n\n\u0398; x1:T(cid:1)[:,\u2212d:]\n\n,\n\n(12)\n\nwhere [\u00b7][:,\u2212d:] denotes the last d feature dimensions of [\u00b7].\n\n4\n\nInstantiations of DEQ\n\nWhile the forward and backward analyses of DEQ do not depend on the internal structure of f\u03b8,\nin this section we brie\ufb02y highlight two examples of f\u03b8 as speci\ufb01c instantiations of DEQ. Both\nmodels (TrellisNet [8] and self-attention [48, 18]) achieve state-of-the-art results on various sequence\nmodeling benchmarks. Importantly, through these two very different models and their properties,\nwe illustrate the compatibility of the DEQ approach with all three major families of existing deep\nsequence networks: transformers, RNNs, and temporal convolutional networks (TCNs).\n\n5\n\n\fDeep Equilibrium Models\nWeight-tied Deep Networks z[0] \u2192 z[1] \u2192 . . .\n\ny = x\n\nOutput\n\nDepth\n\nz\n\n[L]\n1:T\n\n[L]\n\n1:T\n\nz\n\n1:T \u2192 z?\nas L \u2192 \u221e\n\nf\u03b8(x)\n\nForward\n\n\u00d7L layers\n\nz[i+1] = f\u03b8(z[i]; x)\n\nat which\nf\u03b8(x; . . . ) = x\n\nz?\n1:T\n\nForward\n\nEquilibrium Solver for\nz\u2217 = f\u03b8(z\u2217; x)\n\ny\n\nBackward\n\nBackward\n\n=\n\nMemory storage needed at training time\n\ng\u03b8(x) =\n\nHistory (or zero) padding\n(i.e., previous equilibrium)\n\nInput injection\nPositional embedding\n. . .\n\nFixed\n\nz?\n\nz[2]\n\nz[1]\nx\n\nz[0]\n\nInput\n\nz\n\n[0]\n1:T\n\n\u02dcx1:T\n\nz\n\n[0]\n1:T\n\n\u02dcx1:T\n\nTime\n\nTypical Deep Neural Network\n\nDeep Equilibrium Model\n\n(a) A simple illustration of solving for\nan equilibrium point in 2D.\n\n(b) A deep equilibrium model operates with signi\ufb01cantly less memory\nthan conventional deep nets due to an analytical backward pass.\n\nFigure 1: Comparison of the DEQ with conventional weight-tied deep networks.\n\nTrellis networks. We brie\ufb02y introduce the trellis network (TrellisNet) here and refer interested\nreaders to [8] for a detailed description. Generally, TrellisNet is a TCN with two modi\ufb01cations. First,\na linear transformation of the original input sequence x1:T is added to the convolutional outputs at\nall layers. Second, the convolutional kernel weights are tied across the depth of the network (i.e.,\nTrellisNet is a weight-tied TCN). Thus we can write TrellisNet with convolutional kernel size k,\ndilation s, and nonlinearity \u03c8 in DEQ form as\n\n\u02dcx1:T = Input injection (i.e., linearly transformed inputs by Conv1D(x1:T ; Wx))\n\nf\u03b8(z1:T ; x1:T ) = \u03c8(Conv1D([u\u2212(k\u22121)s:, z1:T ]; Wz) + \u02dcx1:T )\n\nwhere u\u2212(k\u22121)s: is typically: 1) the last (k \u2212 1)s elements of the previous sequence\u2019s output (if\nusing history padding [8]); or 2) simply zero-padding. [\u00b7,\u00b7] means concatenation along the temporal\ndimension. Following [8], we use the LSTM gated activation for \u03c8.\n\nWeight-tied transformers. At a high level, multi-head self-attention transformers [48] are very\ndifferent from most deep networks. Instead of convolutions or recurrence, a self-attention layer maps\nthe input into Q (query), K (key), and V (value) and computes the attention score between time-steps\nti and tj as [QK \u22a4]i,j . This attention score is then normalized via softmax and multiplied with the\nV sequence to produce the output. Since the transformer is order-invariant, prior work proposed to\nadd positional embeddings (PE) [48, 16] to the self-attention operation. Following this design, [18]\nfurther proposed the universal transformer, which \u201crecurrently stacks\u201d the transformer\u2019s self-attention\nand transition function block \u03c6 through a number of layers. Referring readers to [48, 16, 18] for more\ndetails, we write a weight-tied transformer in the DEQ form as\n\n\u02dcx1:T = Input injection (i.e., linearly transformed inputs by x1:T Wx)\n\nf\u03b8(z1:T ; x1:T ) = LN(\u03c6(LN(SelfAttention(z1:T WQKV + \u02dcx1:T ; PE1:T ))))\n\nwhere WQKV \u2208 Rd\u00d73d produces the Q, K, V for the multi-head self-attention, and LN stands for\nlayer normalization [5]. Note that we add input injection \u02dcx1:T to Q, K, V in addition to the positional\n[0]\n1:T = 0. Following prior work [48, 19, 16, 18], we use a 2-layer\nembedding and initialize with z\npositionwise feedforward residual block for \u03c6. In our implementation, we use the memory-augmented\ntransformer proposed by [16], where we feed [z\u22c6\n\u2212T \u2032:, z1:T ] (i.e., with history padding of length T \u2032)\nand relative positional embedding PE\u2212T \u2032:T to the self-attention operation.\n\nFigure 1 provides a generic comparison between these conventional weight-tied deep networks and\nthe DEQ approach, highlighting the constant memory requirements of the latter.\n\n5 Experiments\n\nWe evaluate DEQ on both synthetic stress tests and realistic large-scale language modeling (where\ncomplex long-term temporal dependencies are involved). We use the two aforementioned instantia-\ntions of f\u03b8 in DEQ. On both WikiText-103 [35] (which contains >100M words and a vocabulary\nsize of >260K) and the smaller Penn Treebank corpus (where stronger regularizations are needed for\n\n6\n\n\fTable 1: DEQ achieves strong performance on the long-range copy-memory task.\n\nDEQ-Transformer (ours) (14K) TCN [7] (16K) LSTM [26] (14K) GRU [14] (14K)\n\nModels (Size)\n\nCopy Memory T =400 Loss\n\n3.5e-6\n\n2.7e-5\n\n0.0501\n\n0.0491\n\nTable 2: DEQ achieves competitive performance on word-level Penn Treebank language modeling\n(on par with SOTA results, without \ufb01ne-tuning steps [34]). \u2020The memory footprints are benchmarked\n(for fairness) on input sequence length 150 and batch size 15, which does not re\ufb02ect the actual\nhyperparameters used; the values also do not include the memory for word embeddings.\n\nWord-level Language Modeling w/ Penn Treebank (PTB)\n\nModel\n\nVariational LSTM [22]\n\nNAS Cell [55]\n\nNAS (w/ black-box hyperparameter tuner) [32]\n\nAWD-LSTM [34]\n\nDARTS architecture search (second order) [29]\n\n60-layer TrellisNet (w/ auxiliary loss, w/o MoS) [8]\n\nDEQ-TrellisNet (ours)\n\n# Params\n\nNon-embedding\n\nmodel size\n\nTest perplexity Memory\u2020\n\n66M\n54M\n24M\n24M\n23M\n\n24M\n24M\n\n-\n-\n\n20M\n20M\n20M\n\n20M\n20M\n\n73.4\n62.4\n59.7\n58.8\n55.7\n\n57.0\n57.1\n\n-\n-\n-\n-\n-\n\n8.5GB\n1.2GB\n\nconventional deep nets) for word-level language modeling, we show that DEQ achieves competitive\n(or better) performance even when compared to SOTA methods (of the same model size, both weight-\ntied and not) while using signi\ufb01cantly less memory. We provide a more detailed introduction of the\ntasks and datasets in Appendix F.\n\nSetting. Both instantiations of DEQ use Broyden\u2019s method [10] to avoid direct computation of the\ninverse Jacobian, as described in Section 3.1.3. We note that the use of DEQ implicitly introduces a\nnew \u201chyperparameter\u201d \u2013 the stopping criterion for Broyden iterations. During training, we set this\n\ntolerance \u03b5 of forward and backward passes to \u03b5 = \u221aT \u00b7 10\u22125 and \u221aT \u00b7 10\u22128, respectively. At\ninference, we relax the tolerance to \u03b5 = \u221aT \u00b7 10\u22122 (or we can use a smaller maximum iteration limit\n\nfor Broyden\u2019s method; see discussions later). For the DEQ-TrellisNet instantiation, we roughly follow\nthe settings of [8]. For DEQ-Transformers, we employ the relative positional embedding [16], with\nsequences of length 150 at both training and inference on the WikiText-103 dataset. Implementations\nand pretrained models can be found at \u2764tt\u2663s\u273f\u2734\u2734\u2763\u2710t\u2764\u2709\u275c\u2733\u275d\u2666\u2660\u2734\u2767\u2666\u275d\u2709s\u2767\u275b\u275c\u2734\u275e\u2761q.\n\n5.1 Copy Memory Task\n\nThe goal of the copy memory task is simple: to explicitly test a sequence model\u2019s ability to exactly\nmemorize elements across a long period of time (see Appendix F). As shown in Table 1, DEQ\ndemonstrates good memory retention over relatively long sequences (T = 400), with substantially\nbetter results than recurrent architectures such as LSTM/GRU (consistent with the \ufb01ndings in [7]).\n\n5.2 Large-Scale Language Modeling\n\nOne issue encountered in prior works that take a continuous view of deep networks [11, 24] is\nthe challenge of scaling these approaches to real, high-dimensional, large-scale datasets. In this\nsubsection, we evaluate the DEQ approach on some large-scale language datasets and investigate its\neffectiveness as a practical \u201cimplicit-depth\u201d sequence model.\n\nPerformance on Penn Treebank. Following the set of hyperparameters used by [8] for TrellisNet,\nwe evaluate the DEQ-TrellisNet instantiation on word-level language modeling with the PTB corpus.\nNote that without an explicit notion of \u201clayer\u201d, we do not add auxiliary losses, as was done in [8]. As\nshown in Table 2, when trained from scratch, the DEQ-TrellisNet achieves a test perplexity on par\nwith the original deeply supervised TrellisNet.\n\nPerformance on WikiText-103. On the much larger scale WT103 corpus (about 100x larger than\nPTB), the DEQ-TrellisNet achieves better test perplexity than the original deep TrellisNet. For the\nTransformer instantiation, we follow the design of the Transformer-XL model [16]. We speci\ufb01cally\ncompare to a \u201cmedium\u201d Transformer-XL model (the largest released model that can \ufb01t on GPUs)\n\n7\n\n\fTable 3: DEQ-based models are competitive with SOTA deep networks of the same model size on the\nWikiText-103 corpus, with signi\ufb01cantly less memory. \u2020See Table 2 for more details on the memory\nbenchmarking. Transformer-XL models are not weight-tied, unless speci\ufb01ed otherwise.\n\nWord-level Language Modeling w/ WikiText-103 (WT103)\nNon-Embedding\n\n# Params\n\nModel\n\nGeneric TCN [7]\n\nGated Linear ConvNet [17]\n\nAWD-QRNN [33]\n\nRelational Memory Core [40]\n\nTransformer-XL (X-large, adaptive embed., on TPU) [16]\n\n70-layer TrellisNet (+ auxiliary loss, etc.) [8]\n\n70-layer TrellisNet with gradient checkpointing\n\nDEQ-TrellisNet (ours)\n\nTransformer-XL (medium, 16 layers)\nDEQ-Transformer (medium, ours).\n\nTransformer-XL (medium, 18 layers, adaptive embed.)\nDEQ-Transformer (medium, adaptive embed., ours)\n\nTransformer-XL (small, 4 layers)\n\nTransformer-XL (small, weight-tied 16 layers)\n\nDEQ-Transformer (small, ours).\n\n150M\n230M\n159M\n195M\n257M\n\n180M\n180M\n180M\n\n165M\n172M\n110M\n110M\n\n139M\n138M\n138M\n\nModel Size\n\n34M\n\n-\n\n51M\n60M\n224M\n\n45M\n45M\n45M\n\n44M\n43M\n72M\n70M\n\n4.9M\n4.5M\n4.5M\n\nTest perplexity Memory\u2020\n\n45.2\n37.2\n33.0\n31.6\n18.7\n\n29.2\n29.2\n29.0\n\n24.3\n24.2\n23.6\n23.2\n\n35.8\n34.9\n32.4\n\n-\n-\n\n7.1GB\n\n-\n\n12.0GB\n\n24.7GB\n5.2GB\n3.3GB\n\n8.5GB\n2.7GB\n9.0GB\n3.7GB\n\n4.8GB\n6.8GB\n1.1GB\n\nFigure 2: Left: number of Broyden iterations in forward and backward passes gradually grows with\nepochs. Right: DEQ-Transformer \ufb01nds the equilibrium in a stable and ef\ufb01cient manner (whereas the\ndeep transformer could oscillate around the \ufb01xed point, even when one exists).\n\nand a \u201csmall\u201d Transformer-XL model, while noting that the largest Transformer-XL network has\nmassive memory requirements (due in part to very wide hidden features, batch sizes, and training-time\nsequence lengths, which would not be decreased by a DEQ) and can only be trained on TPUs [16].\nIn Table 3, we show that the DEQs yield competitive performance, outperforming prior SOTA\napproaches such as [16] on similar model sizes while consuming much less memory during training.\n\nMemory footprint of DEQ. For conventional deep networks with L layers, the training memory\ncomplexity is O(L) since all intermediate activations are stored for backpropagation. In comparison,\nDEQs have an O(1) (i.e., constant) memory footprint due to the root-\ufb01nding formulation. We\nbenchmark the reduced memory consumption in the last column of Tables 2 and 3, with controlled\nsequence lengths and batch sizes for fairness. On both instantiations, the DEQ approach leads\nto an over 80% (up to 88%) reduction in memory consumption by the model (excluding word\nembeddings, which are orthogonal to the comparison here). Moreover, we empirically verify (using\na 70-layer TrellisNet) that DEQ consumes even less memory than gradient checkpointing [12], a\n\npopular technique that reduces the memory required to train a layer-based model to O(\u221aL). Note\n\nthat the DEQ\u2019s memory footprint remains competitive even when compared with baselines that are\nnot weight-tied (a reduction of over 60%), with similar or better accuracy.\n\nInitialization of DEQ. To train DEQ models, it is critical to ensure that the model is stable,\nsuch that the equilibrium state can be reliably approximated via quasi-Newton methods. While we\nfound that the most commonly used initialization schemes with small values (around 0) suf\ufb01ce,\nit is generally important to make sure that DEQ starts with a small operator norm in the weight\nmatrices. For both DEQ-TrellisNet and DEQ-Transformer, we observe that they are not sensitive\nto any speci\ufb01c initialization scheme since non-linearities such as \u03c3/tanh and LayerNorm also help\nmake f\u03b8 contractive (and stable). We initialize the parameters of f\u03b8 by sampling from N (0, 0.05).\n\n8\n\n024681012Training Epoch0.40.50.60.70.80.9# of Broyden Iter. per Time StepDEQ-Transformer on WT103 (Seq. Length=150)Forward (eps=1e-6)Backward (eps=1e-8)050100150200250300350Number of Function Evaluations104103102101100101102103Difference Norm ||f(x)-x||DEQ-Transformer on WT103 (Seq. Length=150)Weight-tied Trans. (Ep. 1)Weight-tied Trans. (Ep. 12)DEQ-Trans. (Ep. 1)DEQ-Trans. (Ep. 12)\fTable 4: Runtime ratios between DEQs and corresponding deep networks at training and inference\n(> 1\u00d7 implies DEQ is slower). The ratios are benchmarked on WikiText-103.\nDEQ / 70-layer TrellisNet\nTraining\n2.40\u00d7\n\nDEQ / 18-layer Transformer\nTraining\n2.82\u00d7\n\nInference\n1.76\u00d7\n\nInference\n1.64\u00d7\n\nFigure 3: DEQ can be accelerated by leveraging higher tolerance \u03b5 (left) or a lower Broyden iteration\nlimit (right). In general, poor estimates of the equilibrium can hurt DEQ performances.\n\nConvergence to equilibrium. The deep equilibrium model does not have \u201clayers\u201d. One factor that\naffects computation time in DEQs is the number of Broyden iterations in forward/backward passes,\nwhere each forward Broyden step evaluates f\u03b8 once, and a backward step computes a vector-Jacobian\nproduct. We \ufb01nd that in general the number of Broyden iterations gradually increases with training\nepochs (Figure 2, left, where the y-axis is computed by Total Broyden Iterations\n), an observation similar to\nthe one reported for training Neural ODEs [11]. One factor contributing to this phenomenon could be\nthat the training pushes the operator norm of Jf\u03b8 to larger values, making the \ufb01xed point harder to\nsolve. Meanwhile, the backward pass requires much fewer iterations than the forward, primarily due\nto the simplicity of the linear system in Eq. (11). We also \ufb01nd that DEQs can almost always converge\nto the sequence-level \ufb01xed point, much more ef\ufb01ciently than original weight-tied transformers (Figure\n2, right). Note that after 12 epochs, deeply stacked self-attention tends to oscillate around the \ufb01xed\npoint, while DEQs exhibit stable convergence with the quasi-Newton method.\n\nSequence Length\n\nBroyden iterations and the runtime of DEQ. Unlike conventional deep networks that come with\na \ufb01xed number L of layers, the runtime of DEQ depends strongly on the number of Broyden steps to\nreach the equilibrium. Therefore, it\u2019s challenging to fairly compare the runtimes of implicit-depth\nmodels like DEQ with those of corresponding weight-tied deep networks (e.g., using higher depth\nnecessarily takes longer to run). Ideally, the values of \u03b5 should be as small as possible so as to ensure\nthat the analytical gradients from Theorem 1 are accurate. However, we empirically observe that using\na higher \u03b5 or a lower iteration limit allows the DEQ to be trained and evaluated much faster with only\na small degradation in performance. For instance, generally we \ufb01nd \u03b5 < 0.1 or an iteration limit of 30\n(on sequence length 75) to be suf\ufb01cient for competitive performance. Figure 3 visualizes this tradeoff\non a medium DEQ-Transformer (without adaptive embedding). Note that accuracy quickly diverges\nwhen tolerance \u03b5 is too large (Figure 3, left), suggesting that a poor estimate of the equilibrium can\nhurt DEQ performances. Table 4 provides approximate runtimes for competitive-accuracy DEQs on\nWikiText-103. DEQs are typically slower than layer-based deep networks.\n\nAdditional empirical remarks as well as training tips are provided in Appendix E.\n\n6 Conclusion\n\nDeep networks have predominantly taken the form of stacks of layers. We propose the deep equi-\nlibrium approach (DEQ), which models temporal data by directly solving for the sequence-level\n\ufb01xed point and optimizing this equilibrium for better representations. DEQ needs only O(1) memory\nat training time, is agnostic to the choice of the root solver in the forward pass, and is suf\ufb01ciently\nversatile to subsume drastically different architectural choices. Our experiments have shown that\nDEQs have good temporal memory retention, are able to scale to realistic, large-scale sequence tasks,\nand perform competitively with, or slightly outperform, SOTA methods. Overall, we believe that the\nDEQ approach provides an interesting and practical new perspective on designing and optimizing\nsequence models.\n\n9\n\n106105104103102101100Forward Threshold Epsilon (Step Avg.)255075100125150175200Validation PerplexityDEQ-Transformer on WT103DEQ-Transformer20406080100Forward Broyden Iteration Limit242526272829Validation PerplexityDEQ-Transformer on WT103DEQ-Transformer\fReferences\n\n[1] Rami Al-Rfou, Dokook Choe, Noah Constant, Mandy Guo, and Llion Jones. Character-level language\n\nmodeling with deeper self-attention. arXiv:1808.04444, 2018.\n\n[2] Luis B Almeida. A learning rule for asynchronous perceptrons with feedback in a combinatorial environ-\n\nment. In Arti\ufb01cial Neural Networks. 1990.\n\n[3] Brandon Amos and J Zico Kolter. OptNet: Differentiable optimization as a layer in neural networks. In\n\nInternational Conference on Machine Learning (ICML), 2017.\n\n[4] Martin Arjovsky, Amar Shah, and Yoshua Bengio. Unitary evolution recurrent neural networks.\n\nIn\n\nInternational Conference on Machine Learning (ICML), 2016.\n\n[5] Lei Jimmy Ba, Ryan Kiros, and Geoffrey E. Hinton. Layer normalization. arXiv:1607.06450, 2016.\n\n[6] Alexei Baevski and Michael Auli. Adaptive input representations for neural language modeling. In\n\nInternational Conference on Learning Representations (ICLR), 2019.\n\n[7] Shaojie Bai, J. Zico Kolter, and Vladlen Koltun. An empirical evaluation of generic convolutional and\n\nrecurrent networks for sequence modeling. arXiv:1803.01271, 2018.\n\n[8] Shaojie Bai, J. Zico Kolter, and Vladlen Koltun. Trellis networks for sequence modeling. In International\n\nConference on Learning Representations (ICLR), 2019.\n\n[9] James Bradbury, Stephen Merity, Caiming Xiong, and Richard Socher. Quasi-recurrent neural networks.\n\nIn International Conference on Learning Representations (ICLR), 2017.\n\n[10] Charles G Broyden. A class of methods for solving nonlinear simultaneous equations. Mathematics of\n\nComputation, 1965.\n\n[11] Tian Qi Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary differential\n\nequations. In Neural Information Processing Systems, 2018.\n\n[12] Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. Training deep nets with sublinear memory\n\ncost. arXiv:1604.06174, 2016.\n\n[13] Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse\n\ntransformers. arXiv:1904.10509, 2019.\n\n[14] Kyunghyun Cho, Bart Van Merri\u00ebnboer, Dzmitry Bahdanau, and Yoshua Bengio. On the properties of\n\nneural machine translation: Encoder-decoder approaches. arXiv:1409.1259, 2014.\n\n[15] Raj Dabre and Atsushi Fujita. Recurrent stacking of layers for compact neural machine translation models.\n\nIn AAAI Conference on Arti\ufb01cial Intelligence, 2019.\n\n[16] Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V. Le, and Ruslan Salakhutdinov.\nTransformer-XL: Attentive language models beyond a \ufb01xed-length context. In Annual Meeting of the\nAssociation for Computational Linguistics (ACL), 2019.\n\n[17] Yann N. Dauphin, Angela Fan, Michael Auli, and David Grangier. Language modeling with gated\n\nconvolutional networks. In International Conference on Machine Learning (ICML), 2017.\n\n[18] Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and \u0141ukasz Kaiser. Universal\n\ntransformers. International Conference on Learning Representations (ICLR), 2019.\n\n[19] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep\n\nbidirectional transformers for language understanding. In NAACL-HLT, 2019.\n\n[20] Laurent El Ghaoui, Fangda Gu, Bertrand Travacca, and Armin Askari.\n\nImplicit deep learning.\n\narXiv:1908.06315, 2019.\n\n[21] Jeffrey L Elman. Finding structure in time. Cognitive Science, 14(2), 1990.\n\n[22] Yarin Gal and Zoubin Ghahramani. A theoretically grounded application of dropout in recurrent neural\n\nnetworks. In Neural Information Processing Systems, 2016.\n\n[23] Aidan N Gomez, Mengye Ren, Raquel Urtasun, and Roger B Grosse. The reversible residual network:\n\nBackpropagation without storing activations. In Neural Information Processing Systems, 2017.\n\n10\n\n\f[24] Eldad Haber and Lars Ruthotto. Stable architectures for deep neural networks. Inverse Problems, 2017.\n\n[25] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.\n\nIn Computer Vision and Pattern Recognition (CVPR), 2016.\n\n[26] Sepp Hochreiter and J\u00fcrgen Schmidhuber. Long short-term memory. Neural Computation, 9(8), 1997.\n\n[27] Michaeel Kazi and Brian Thompson. Implicitly-de\ufb01ned neural networks for sequence labeling. In Annual\n\nMeeting of the Association for Computational Linguistics (Short Papers), 2017.\n\n[28] Renjie Liao, Yuwen Xiong, Ethan Fetaya, Lisa Zhang, KiJung Yoon, Xaq Pitkow, Raquel Urtasun, and\nRichard Zemel. Reviving and improving recurrent back-propagation. In International Conference on\nMachine Learning (ICML), 2018.\n\n[29] Hanxiao Liu, Karen Simonyan, and Yiming Yang. DARTS: Differentiable architecture search.\n\nIn\n\nInternational Conference on Learning Representations (ICLR), 2019.\n\n[30] Matthew MacKay, Paul Vicol, Jimmy Ba, and Roger B. Grosse. Reversible recurrent neural networks. In\n\nNeural Information Processing Systems, 2018.\n\n[31] Mitchell P Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. Building a large annotated corpus\n\nof English: The Penn treebank. Computational Linguistics, 19(2), 1993.\n\n[32] G\u00e1bor Melis, Chris Dyer, and Phil Blunsom. On the state of the art of evaluation in neural language models.\n\nIn International Conference on Learning Representations (ICLR), 2018.\n\n[33] Stephen Merity, Nitish Shirish Keskar, and Richard Socher. An analysis of neural language modeling at\n\nmultiple scales. arXiv:1803.08240, 2018.\n\n[34] Stephen Merity, Nitish Shirish Keskar, and Richard Socher. Regularizing and optimizing LSTM language\n\nmodels. In International Conference on Learning Representations (ICLR), 2018.\n\n[35] Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. In\n\nInternational Conference on Learning Representations (ICLR), 2017.\n\n[36] John Miller and Moritz Hardt. When recurrent models don\u2019t need to be recurrent. arXiv:1805.10369, 2018.\n\n[37] Vlad Niculae, Andre Martins, Mathieu Blondel, and Claire Cardie. SparseMAP: Differentiable sparse\n\nstructured inference. In International Conference on Machine Learning (ICML), 2018.\n\n[38] Fernando J Pineda. Generalization of back propagation to recurrent and higher order neural networks. In\n\nNeural Information Processing Systems, 1988.\n\n[39] Tim Salimans and Diederik P Kingma. Weight normalization: A simple reparameterization to accelerate\n\ntraining of deep neural networks. In Neural Information Processing Systems, 2016.\n\n[40] Adam Santoro, Ryan Faulkner, David Raposo, Jack Rae, Mike Chrzanowski, Theophane Weber, Daan\nWierstra, Oriol Vinyals, Razvan Pascanu, and Timothy Lillicrap. Relational recurrent neural networks. In\nNeural Information Processing Systems, 2018.\n\n[41] Benjamin Scellier and Yoshua Bengio. Equilibrium propagation: Bridging the gap between energy-based\n\nmodels and backpropagation. Frontiers in Computational Neuroscience, 2017.\n\n[42] Jack Sherman and Winifred J Morrison. Adjustment of an inverse matrix corresponding to a change in one\n\nelement of a given matrix. The Annals of Mathematical Statistics, 1950.\n\n[43] Patrice Y Simard, Mary B Ottaway, and Dana H Ballard. Fixed point analysis for recurrent networks. In\n\nNeural Information Processing Systems, 1989.\n\n[44] Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout:\nA simple way to prevent neural networks from over\ufb01tting. Journal of Machine Learning Research (JMLR),\n15(1), 2014.\n\n[45] Benoit Steiner, Zachary DeVito, Soumith Chintala, Sam Gross, Adam Paszke, Francisco Massa, Adam\nLerer, Gregory Chanan, Zeming Lin, Edward Yang, et al. PyTorch: An imperative style, high-performance\ndeep learning library. In Neural Information Processing Systems, 2019.\n\n[46] Trieu H Trinh, Andrew M Dai, Thang Luong, and Quoc V Le. Learning longer-term dependencies in\n\nRNNs with auxiliary losses. In International Conference on Machine Learning (ICML), 2018.\n\n11\n\n\f[47] A\u00e4ron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal\nKalchbrenner, Andrew W. Senior, and Koray Kavukcuoglu. WaveNet: A generative model for raw audio.\narXiv:1609.03499, 2016.\n\n[48] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, \u0141ukasz\nKaiser, and Illia Polosukhin. Attention is all you need. In Neural Information Processing Systems, 2017.\n\n[49] Alex Waibel, Toshiyuki Hanazawa, Geoffrey Hinton, Kiyohiro Shikano, and Kevin J Lang. Phoneme\nIEEE Transactions on Acoustics, Speech, and Signal\n\nrecognition using time-delay neural networks.\nProcessing, 37(3), 1989.\n\n[50] Po-Wei Wang, Priya Donti, Bryan Wilder, and Zico Kolter. SATNet: Bridging deep learning and logical\nreasoning using a differentiable satis\ufb01ability solver. In International Conference on Machine Learning\n(ICML), 2019.\n\n[51] Paul J Werbos. Backpropagation through time: What it does and how to do it. Proceedings of the IEEE,\n\n78(10), 1990.\n\n[52] Zhilin Yang, Zihang Dai, Ruslan Salakhutdinov, and William W. Cohen. Breaking the softmax bottleneck:\nA high-rank RNN language model. International Conference on Learning Representations (ICLR), 2018.\n\n[53] Saizheng Zhang, Yuhuai Wu, Tong Che, Zhouhan Lin, Roland Memisevic, Ruslan R Salakhutdinov, and\nYoshua Bengio. Architectural complexity measures of recurrent neural networks. In Neural Information\nProcessing Systems, 2016.\n\n[54] Ziming Zhang, Anil Kag, Alan Sullivan, and Venkatesh Saligrama. Equilibrated recurrent neural network:\n\nNeuronal time-delayed self-feedback improves accuracy and stability. arXiv:1903.00755, 2019.\n\n[55] Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning. In International\n\nConference on Learning Representations (ICLR), 2017.\n\n12\n\n\f", "award": [], "sourceid": 348, "authors": [{"given_name": "Shaojie", "family_name": "Bai", "institution": "Carnegie Mellon University"}, {"given_name": "J. Zico", "family_name": "Kolter", "institution": "Carnegie Mellon University / Bosch Center for AI"}, {"given_name": "Vladlen", "family_name": "Koltun", "institution": "Intel Labs"}]}