{"title": "Breaking the Activation Function Bottleneck through Adaptive Parameterization", "book": "Advances in Neural Information Processing Systems", "page_first": 7739, "page_last": 7750, "abstract": "Standard neural network architectures are non-linear only by virtue of a simple element-wise activation function, making them both brittle and excessively large. In this paper, we consider methods for making the feed-forward layer more flexible while preserving its basic structure. We develop simple drop-in replacements that learn to adapt their parameterization conditional on the input, thereby increasing statistical efficiency significantly. We present an adaptive LSTM that advances the state of the art for the Penn Treebank and Wikitext-2 word-modeling tasks while using fewer parameters and converging in half as many iterations.", "full_text": "Breaking the Activation Function Bottleneck\n\nthrough Adaptive Parameterization\n\nSebastian Flennerhag1, 2\n\nHujun Yin1, 2\n\nJohn Keane1\n\nMark Elliot1\n\n1University of Manchester\n\n2The Alan Turing Institute\n\ns\ufb02ennerhag@turing.ac.uk {hujun.yin, john.keane, mark.elliot}@manchester.ac.uk\n\nAbstract\n\nStandard neural network architectures are non-linear only by virtue of a simple\nelement-wise activation function, making them both brittle and excessively large.\nIn this paper, we consider methods for making the feed-forward layer more \ufb02exible\nwhile preserving its basic structure. We develop simple drop-in replacements that\nlearn to adapt their parameterization conditional on the input, thereby increasing\nstatistical ef\ufb01ciency signi\ufb01cantly. We present an adaptive LSTM that advances the\nstate of the art for the Penn Treebank and WikiText-2 word-modeling tasks while\nusing fewer parameters and converging in less than half the number of iterations.\n\n1\n\nIntroduction\n\nWhile a two-layer feed-forward neural network is suf\ufb01cient to approximate any function (Cybenko,\n1989; Hornik, 1991), in practice much deeper networks are necessary to learn a good approximation\nto a complex function. In fact, a network tends to generalize better the larger it is, often to the point\nof having more parameters than there are data points in the training set (Canziani et al., 2016; Novak\net al., 2018; Frankle & Carbin, 2018).\nOne reason why neural networks are so large is that they bias towards linear behavior: if the activation\nfunction is largely linear, so will the hidden layer be. Common activation functions, such as the\nSigmoid, Tanh, and ReLU all behave close to linear over large ranges of their domain. Consequently,\nfor a randomly sampled input to break linearity, layers must be wide and the network deep to ensure\nsome elements lie in non-linear regions of the activation function. To overcome the bias towards\nlinear behavior, more sophisticated activation functions have been designed (Clevert et al., 2015; He\net al., 2015; Klambauer et al., 2017; Dauphin et al., 2017). However, these still limit all non-linearity\nto sit in the activation function.\nWe instead propose adaptive parameterization, a method for learning to adapt the parameters of the\naf\ufb01ne transformation to a given input. In particular, we present a generic adaptive feed-forward layer\nthat retains the basic structure of the standard feed-forward layer while signi\ufb01cantly increasing the\ncapacity to model non-linear patterns. We develop speci\ufb01c instances of adaptive parameterization\nthat can be trained end-to-end jointly with the network using standard backpropagation, are simple to\nimplement, and run at minimal additional cost.\nEmpirically, we \ufb01nd that adaptive parameterization can learn non-linear patterns where a non-adaptive\nbaseline fails, or outperform the baseline using 30\u201350% fewer parameters. In particular, we develop\nan adaptive version of the Long Short-Term Memory model (LSTM; Hochreiter & Schmidhuber,\n1997; Gers et al., 2000) that enjoys both faster convergence and greater statistical ef\ufb01ciency.\nThe adaptive LSTM advances the state of the art for the Penn Treebank and WikiText-2 word modeling\ntasks using ~20\u201330% fewer parameters and converging in less than half as many iterations.1 We\n\n1Code available at https://github.com/flennerhag/alstm.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fDW\n\nW D\n\nD(2)W D(1)\n\nW (2)DW (1)\n\nFigure 1: Adaptation policies. Left: output adaptation shifts the mean of each row in W ; center left:\ninput adaptation shifts the mean of each column; center right: IO-adaptation shifts mean and variance\nacross sub-matrices; Right: SVA scales singular values.\n\nproceed as follows: section 2 presents the adaptive feed-forward layer, section 3 develops the adaptive\nLSTM, section 4 discusses related work and section 5 presents empirical analysis and results.\n\n2 Adaptive Parameterization\n\nTo motivate adaptive parameterization, we show that deep neural networks learn a family of com-\npositions of linear maps and because the activation function is static, the inherent \ufb02exibility in this\nfamily is weak. Adaptive parameterization is a means of increasing this \ufb02exibility and thereby\nincreasing the model\u2019s capacity to learn non-linear patterns. We focus on the feed-forward layer,\nf (x) := \u03c6(W x + b), for some activation function \u03c6 : R (cid:55)\u2192 R. De\ufb01ne the pre-activation layer as\na = A(x) := W x + b and denote by g(a) := \u03c6(a)/ a the activation effect of \u03c6 given a, where\ndivision is element-wise. Let G = diag(g(a)). We then have f (x) = g(a) (cid:12) a = G a; we use \u201c(cid:12)\u201d\nto denote the Hadamard product.2\nFor any pair (x, y) \u2208 Rn \u00d7 Rk, a deep feed-forward network with N \u2208 N layers, f (N ) \u25e6 \u00b7\u00b7\u00b7 \u25e6 f (1),\napproximates the relationship x (cid:55)\u2192 y by a composition of linear maps. To see this, note that x is\nsuf\ufb01cient to determine all activation effects G = {G(l)}N\nl=1. Together with \ufb01xed transformations\nA = {A(l)}N\n\nl=1, the network can be expressed as\n\n\u02c6y = (f (N ) \u25e6 \u00b7\u00b7\u00b7 \u25e6 f (1))(x) = (G(N ) \u25e6 A(N ) \u25e6 \u00b7\u00b7\u00b7 \u25e6 G(1) \u25e6 A(1))(x).\n\n(1)\n\nA neural network can therefore be understood as learning a \u201cprior\u201d A in parameter space around\nwhich it constructs a family of compositions of linear maps (as G varies across inputs). The neural\nnetwork adapts to inputs through the set of activation effects G. This adaptation mechanism is weak:\nif \u03c6 is close to linear over the distribution of a, as is often the case, little adaptation can occur.\nMoreover, because G does not have any learnable parameters itself, the \ufb01xed prior A must learn to\nencode both global input-invariant information as well as local contextual information. We refer\nto this as the activation function bottleneck. Adaptive parameterization breaks this bottleneck by\nparameterizing the adaptation mechanism in G, thereby circumventing these issues.\n\nTo see how the activation function bottleneck arises, note that \u03c6 is redundant whenever it is closely\napproximated by a linear function over some non-trivial segment of the input distribution. For these\ninputs, \u03c6 has no non-linear effect and such lost opportunities imply that the neural network must\nbe made larger than necessary to fully capture non-linear patterns. For instance, both the Sigmoid\nand the Tanh are closely approximated around 0 by a linear function, rendering them redundant for\ninputs close to 0. Consequently, the network must be made deeper and its layers wider to mitigate the\nactivation function bottleneck. In contrast, adaptive parameterization places the layer\u2019s non-linearity\nwithin the parameter matrix itself, thereby circumventing the activation function bottleneck. Further,\nby relaxing the element-wise non-linearity constraint imposed on the standard feed-forward layer, it\ncan learn behaviors that would otherwise be very hard or impossible to model, such as contextual\nrotations and shears, and adaptive feature activation.\n\n2This holds almost everywhere, but not for {a | ai = 0, ai \u2208 a}. Being measure 0, we ignore this exception.\n\n2\n\n\f2.1 The Adaptive Feed-Forward Layer\n\nOur goal is to break the activation function bottleneck by generalizing G into a parameterized\nadaptation policy, thereby enabling the network to specialize parameters in A to encode global, input\ninvariant information while parameters in G encode local, contextual information.\nConsider the standard feed-forward layer, de\ufb01ned by one adaptation block f (x) = (G \u25e6 A)(x). As\ndescribed above, we increase the capacity of the adaptation mechanism G by replacing it with a\nparameterized adaptation mechanism D(j) := diag(\u03c0(j)(x)), where \u03c0(j) is a learnable adaptation\npolicy. Note that \u03c0(j) can be made arbitrarily complex. In particular, even if \u03c0(j) is linear, the adaptive\nmechanism D(j) a is quadratic in x, and as such escapes the bottleneck. To ensure that the adaptive\nfeed-forward layer has suf\ufb01cient capacity, we generalize it to q \u2208 N adaptation blocks,3\n\n(cid:16)\n\n(cid:17)\nD(q)W (q\u22121) \u00b7\u00b7\u00b7 W (1)D(1) x + D(0) b\n\nf (x) := \u03c6\n\n.\n\n(2)\n\nWe refer to the number of adaptation blocks q as the order of the layer. Strictly speaking, the adaptive\nfeed-forward layer does not need an activation function, but it can provide desirable properties\ndepending on the application. It is worth noting that the adaptive feed-forward layer places no\nrestrictions on the form of the adaptation policy \u03c0 = (\u03c0(0), . . . , \u03c0(q)) or its training procedure. In\nthis paper, we parameterize \u03c0 as a neural network trained jointly with the main model. Next, we show\nhow different adaptive feed-forward layers are generated by the choice of adaptation policy.\n\n2.2 Adaptation Policies\n\nHigher-order adaptation (i.e. q large) enables expressive adaptation policies, but because the adapta-\ntion policy depends on x, high-order layers are less ef\ufb01cient than a stack of low-order layers. We \ufb01nd\nthat low-order layers are surprisingly powerful, and present a policy of order 2 that can express any\nother adaptation policy.\n\nPartial Adaptation The simplest adaptation policy (q = 1) is given by f (x) = W D(1) x +D(0) b.\nThis policy is equivalent to a mean shift and a re-scaling of the columns of W , or alternatively\nre-scaling the input. It can be thought of as a learned contextualized standardization mechanism that\nconditions the effect on the speci\ufb01c input. As such, we refer to this policy as input adaptation. Its\nmirror image, output adaptation, is given by f (x) = D(1)W x +D(0) b. This is a special case of\nsecond-order adaptation policies, where D(1) = I, where I denotes the identity matrix. Both these\npolicies are restrictive in that they only operate on either the rows or the columns of W (\ufb01g. 1).\n\nIO-adaptation The general form of second-order adaptation policies integrates input- and output-\nadaptation into a jointly learned adaptation policy. As such we refer to this as IO-adaptation,\n\nf (x) = D(2)W D(1) x +D(0) b .\n\n(3)\n\nIO-adaptation is much more powerful than either input- or output-adaptation alone, which can be seen\nby the fact that it essentially learns to identify and adapt sub-matrices in W by sharing adaptation\nvectors across rows and columns (\ufb01g. 1). In fact, assuming \u03c0 is suf\ufb01ciently powerful, IO-adaptation\ncan express any mapping from input to parameter space.\nProperty 1. Let W be given and \ufb01x x. For any G of same dimensionality as W , there are arbitrarily\nmany (D(1), D(2)) such that G x = D(2)W D(1) x.\n\nProof: see supplementary material.\n\n3The ordering of W and D matrices can be reversed by setting the \ufb01rst and / or last adaptation matrix to be\n\nthe identity matrix.\n\n3\n\n\fSingular Value Adaptation (SVA) Another policy of interest arises as a special case of third-order\nadaptation policies, where D(1) = I as before. The resulting policy,\n\nf (x) = W (2)DW (1) x +D(0) b,\n\n(4)\n\nis reminiscent of Singular Value Decomposition. However, rather than being a decomposition, it\ncomposes a projection by adapting singular values to the input. In particular, letting W (1) = V T A and\nW (2) = BU, with U and V appropriately orthogonal, eq. 4 can be written as B(U DV T )A x, with\nU DV T adapted to x through its singular values. In our experiments, we initialize weight matrices as\nsemi-orthogonal (Saxe et al., 2013), but we do not enforce orthogonality after initialization.\nThe drawback of SVA is that it requires learning two separate matrices of relatively high rank. For\nproblems where the dimensionality of x is large, the dimensionality of the adaptation space has to\nbe made small to control parameter count. This limits the model\u2019s capacity by enforcing a low-rank\nfactorization, which also tends to impact training negatively (Denil et al., 2013).\nSVA and IO-adaptation are simple but \ufb02exible policies that can be used as drop-in replacements\nfor any feed-forward layer. Because they are differentiable, they can be trained using standard\nbackpropagation. Next, we demonstrate adaptive parameterization in the context of Recurrent Neural\nNetworks (RNNs), where feed-forward layers are predominant.\n\n3 Adaptive Parameterization in RNNs\n\nRNNs are common in sequence learning, where the input is a sequence {x1, . . . , xt} and the target\nvariable either itself a sequence or a single point or vector. In either case, the computational graph of\nan RNN, when unrolled over time, will be of the form in eq. 1, making it a prime candidate for adaptive\nparameterization. Moreover, in sequence-to-sequence learning, the model estimates a conditional\ndistribution p(yt | x1, . . . , xt) that changes signi\ufb01cantly from one time step to the next. Because of\nthis variance, an RNN must be very \ufb02exible to model the conditional distribution. By embedding\nadaptive parameterization, we can increase \ufb02exibility for a given model size. Consider the LSTM\nmodel (Hochreiter & Schmidhuber, 1997; Gers et al., 2000), de\ufb01ned by the gating mechanism\n\nct = \u03c3(uf\nht = \u03c3(uo\n\nt ) (cid:12) ct\u22121 + \u03c3(ui\nt ) (cid:12) \u03c4 (ct),\n\nt) (cid:12) \u03c4 (uz\nt )\n\n(5)\n\nwhere \u03c3 and \u03c4 represent Sigmoid and Tanh activation functions respectively and each us\u2208{i,f,o,z}\nis\nt = W (s) xt +V (s) ht\u22121 + b(s). Adaptation in the LSTM can\na linear transformation of the form us\nbe derived directly from the adaptive feed-forward layer (eq. 2). We focus on IO-adaptation as this\nadaptation policy performed better in our experiments. For \u03c0, we use a small neural network to output\n\na latent variable zt that we map into each sub-policy with a projection U (j): \u03c0(j)(zt) = \u03c4(cid:0)U (j) zt\n\n(cid:1).\n\nt\n\nWe test a static and a recurrent network as models for the latent variable,\n\nzt = ReLU (W vt + b) ,\nzt = m(vt, zt\u22121)\n\n(6)\n(7)\n\nwhere m is a standard LSTM and vt a summary variable of the state of the system, normally\nvt = [xt ; ht\u22121] (we use [\u00b7 ;\u00b7] to denote concatenation). The potential bene\ufb01t of using a recurrent\nmodel is that it is able to retain a separate memory that facilitates learning of local, sub-sequence\nspeci\ufb01c patterns (Ha et al., 2017). Generally, we \ufb01nd that the recurrent model converges faster\nand generalizes marginally better. To extend the adaptive feed-forward layer to the LSTM, index\nsub-policies with a tuple (s, j) \u2208 {i, f, o, z} \u00d7 {0, 1, 2, 3, 4} such that D(s,j)\n= diag(\u03c0(s,j)(zt)). At\neach time step t we adapt the LSTM\u2019s linear transformations through IO-adaptation,\n\nt\n\nt = D(s,4)\nus\n\nt W (s)D(s,3)\n\nt\n\nxt +D(s,2)\n\nt\n\nV (s)D(s,1)\n\nt\n\nht\u22121 +D(s,0)\n\nt\n\nb(s) .\n\n(8)\n\n4\n\n\fAn undesirable side-effect of the formulation in eq. 8 is that each linear transformation requires its\nown modi\ufb01ed input, preventing a vectorized implementation of the LSTM. We avoid this by tying all\ninput adaptations across s: that is, D(s(cid:48),j) = D(s,j) for all (s(cid:48), j) \u2208 {i, f, o, z} \u00d7 {1, 3}. Doing so\napproximately halves the computation time and speeds up convergence considerably. When stacking\nmultiple aLSTM layers, the computational graph of the model becomes complex in that it extends\nboth in the temporal dimension and along the depth of the stack. For the recurrent adaptation policy\n(eq. 7) to be consistent, it should be conditioned not only by the latent variable in its own layer, but\nalso on that of the preceding layer, or it will not have a full memory of the computational graph. To\nachieve this, for a layer l \u2208 {1, . . . , L}, we de\ufb01ne the input summary variable as\n\nh(l\u22121)\n\nt\n\n; h(l)\n\nt\u22121 ; z(l\u22121)\n\nt\n\n,\n\n(9)\n\n(cid:104)\n\nv(l)\nt =\n\n(cid:105)\n\nt = xt and z(0)\n\nwhere h(0)\nt\u22121. In doing so, the credit assignment path of adaption policy visits\nall nodes in the computational graph. The resulting adaptation model becomes a blend of a standard\nLSTM and a Recurrent Highway Network (RHN; Zilly et al., 2016).\n\nt = z(L)\n\n4 Related Work\n\nAdaptive parameterization is a special case of having a relatively inexpensive learning algorithm\nsearch a vast parameter space in order to parameterize the larger main model (Stanley et al., 2009;\nFernando et al., 2016). The notion of using one model to generate context-dependent parameters for\nanother was suggested by Schmidhuber (1992); Gomez & Schmidhuber (2005). Building on this\nidea, Ha et al. (2017) proposed to jointly train a small network to generate the parameters of a larger\nnetwork; such HyperNetworks have achieve impressive results in several domains (Suarez, 2017; Ha\n& Eck, 2018; Brock et al., 2018). The general concept of learning to parameterize a model has been\nexplored in a variety of contexts, for example Schmidhuber (1992); Gomez & Schmidhuber (2005);\nDenil et al. (2013); Jaderberg et al. (2017); Andrychowicz et al. (2016); Yang et al. (2018).\nParameter adaptation has also been explored in meta-learning, usually in the context of few-shot\nlearning, where a meta-learner is trained across a set of tasks to select task-speci\ufb01c parameters\nof a downstream model (Bengio et al., 1991, 1995; Schmidhuber, 1992). Similar to adaptive\nparameterization, Bertinetto et al. (2016) directly tasks a meta learner with predicting the weights of\nthe task-speci\ufb01c learner. Ravi & Larochelle (2017) de\ufb01nes the adaptation policy as a gradient-descent\nrule, where the meta learner is an LSTM tasked with learning the update rule to use. An alternative\nmethod pre-de\ufb01nes the adaptation policy as gradient descent and meta-learns an initialization such\nthat performing gradient descent on a given input from some new task yields good task-speci\ufb01c\nparameters (Finn et al., 2017; Lee & Choi, 2017; Al-Shedivat et al., 2018).\nUsing gradient information to adjust parameters has also been explored in sequence-to-sequence\nlearning, where it is referred to as dynamic evaluation (Mikolov, 2012; Graves, 2013; Krause et al.,\n2017). This form of adaptation relies on the auto-regressive property of RNNs to adapt parameters at\neach time step by taking a gradient step with respect to one or several previous time steps.\nMany extensions have been proposed to the basic RNN and the LSTM model (Hochreiter & Schmid-\nhuber, 1997; Gers et al., 2000), some of which can be seen as implementing a form of constrained\nadaptation policy. The multiplicative RNN (mRNN; Sutskever et al., 2011) and the multiplicative\nLSTM (mLSTM; Krause et al., 2016) can be seen as implementing an SVA policy for the hidden-to-\nhidden projections. mRNN improves upon RNNs in language modeling tasks (Sutskever et al., 2011;\nMikolov et al., 2012), but tends to perform worse than the standard LSTM (Cooijmans et al., 2016).\nmLSTM has been shown to improve upon RNNs and LSTMs on language modeling tasks (Krause\net al., 2017; Radford et al., 2017). The multiplicative-integration RNN and its LSTM version (Wu\net al., 2016) essentially implement a constrained output-adaptation policy.\nThe implicit policies in the above models conditions only on the input, ignoring the state of the system.\nIn contrast, the GRU (Cho et al., 2014; Chung et al., 2014) can be interpreted as implementing an\ninput-adaptation policy on the input-to-hidden matrix that conditions on both the input and the state\nof the system. Most closely related to the aLSTM are HyperNetworks (Ha et al., 2017; Suarez, 2017);\nthese implement output adaptation conditioned on both the input and the state of the system using\na recurrent adaptation policy. HyperNetworks have attained impressive results on character level\n\n5\n\n\fmodeling tasks and sequence generation tasks, including hand-writing and drawing sketches (Ha et al.,\n2017; Ha & Eck, 2018). They have also been used in neural architecture search by generating weights\nconditional on the architecture (Brock et al., 2018), demonstrating that adaptive parameterization can\nbe conditioned on some arbitrary context, in this case the architecture itself.\n\n5 Experiments\n\nWe compare the behavior of a model with adaptive feed-forward layers to standard feed-forward\nbaselines in a controlled regression problem and on MNIST (LeCun et al., 1998). The aLSTM is tested\non the Penn Treebank and WikiText-2 word modeling tasks. We use the ADAM optimizer (Kingma &\nBa, 2015) unless otherwise stated.\n\n5.1 Extreme Tail Regression\nTo study the \ufb02exibility of the adaptive feed-forward layer, we sample x = (x1, x2) from N (0, I) and\nconstruct the target variable as y = (2x1)2 \u2212 (3x2)4 + \u0001 with \u0001 \u223c N (0, 1). Most of the data lies on\na hyperplane, but the target variable grows or shrinks exponentially as x1 or x2 moves away from 0.\nWe compare a 3-layer feed-forward network with 10 hidden units to a 2-layer model with 2 hidden\nunits, where the \ufb01rst layer is adaptive and the \ufb01nal layer is static. We use an SVA policy where \u03c0 is a\ngated linear unit (Dauphin et al., 2017). Both models are trained for 10 000 steps with a batch size of\n50 and a learning rate of 0.003.\n\nFigure 2: Extreme tail regression. Left: Predictions of the adaptive model (blue) and the baseline\nmodel (green) against ground truth (black). Center & Right: distribution of adaptive singular values.\n\nThe baseline model fails to represent the tail of the distribution despite being three times larger.\nIn contrast, the adaptive model does a remarkably good job given how small the model is and the\nextremity of the distribution. It is worth noting how the adaptation policy encodes local information\nthrough the distribution of its singular values (\ufb01g. 2).\n\n5.2 MNIST\n\nWe compare performance of a 3-layer feed-forward model against (a) a single-layer SVA model and\n(b) a 3-layer SVA model. We train all models with Stochastic Gradient Descent with a learning rate of\n0.001, a batch size of 128, and train for 50 000 steps. The single-layer adaptive model reduces to a\nlogistic regression conditional on the input. By comparing it to a logistic regression, we measure the\nmarginal bene\ufb01t of the SVA policy to approximately 1 percentage point gain in accuracy. In fact, if the\none-layer SVA model has a suf\ufb01ciently expressive adaptation model it matches and even outperforms\nthe deep feed-forward baseline.\n\n5.3 Penn Treebank\n\nThe Penn Treebank corpus (PTB; Marcus et al., 1993; Mikolov et al., 2010) is a widely used benchmark\nfor language modeling. It consists of heavily processed news articles and contains no capital letters,\nnumbers, or punctuation. As such, the vocabulary is relatively small at 10 000 unique words.\nWe evaluate the aLSTM on word-level modeling following standard practice in training setup (e.g.\nZaremba et al., 2015). As we are interested in statistical ef\ufb01ciency, we \ufb01x the number of layers to 2,\n\n6\n\n\u22123000300y\u22123000300\u02c6yx1\u221244x2\u221244d1\u221211x1\u221244x2\u221244d2\u221211\fTable 1: Train and test set accuracy on MNIST\n\nModel\nLogistic Regression\n3-layer feed-forward\n1-layer SVA\n1-layer SVA\n3-layer SVA\n\nTrain\n\nSize\nTest\n8K 92.00% 92.14%\n100K 97.57% 97.01%\n8K 94.05% 93.86%\n100K 98.62% 97.14%\n100K 99.99% 97.65%\n\nthough more layers tend to perform better, and use a policy latent variable size of 100. For details on\nhyper-parameters, see supplementary material. As we are evaluating underlying architectures, we do\nnot compare against bolt-on methods (Grave et al., 2017; Yang et al., 2018; Mikolov, 2012; Graves,\n2013; Krause et al., 2017). These are equally applicable to the aLSTM.\n\nFigure 3: Validation loss on PTB for our LSTM (green), aLSTM (blue), aLSTM with static policy\n(dashed), and the AWD-LSTM (orange; Merity et al., 2018). Drops correspond to learning rate cuts.\n\nThe aLSTM improves upon previously published results using roughly 30% fewer parameters, a\nsmaller hidden state size, and fewer layers while converging in fewer iterations (table 2). Notably, for\nthe standard LSTM to converge at all, gradient clipping is required and dropout rates must be reduced\nby ~25%. In our experimental setup, a percentage point change to these rates cause either severe\nover\ufb01tting or failure to converge. Taken together, this indicates that adaptive parameterization enjoys\nboth superior stability properties and substantially increases model capacity, even when the baseline\nmodel is complex; we explore both further in sections sections 5.5 and 5.6. Melis et al. (2018)\napplies a large-scale hyper-parameter search to an LSTM version with tied input and forget gates\nand inter-layer skip-connections (TG-SC LSTM), making it a challenging baseline that the aLSTM\nimproves upon by a considerable margin.\n\nPrevious state-of-art performance was achieved by the ASGD Weight-Dropped LSTM (AWD-LSTM;\nMerity et al., 2018), which uses regularization, optimization, and \ufb01ne-tuning techniques designed\nspeci\ufb01cally for language modeling4. The AWD-LSTM requires approximately 500 epochs to converge\nto optimal performance; the aLSTM outperforms the AWD-LSTM after 144 epochs and converges to\noptimal performance in 180 epochs. Consequently, even if the AWD-LSTM runs on top of the CuDNN\nimplementation of the LSTM, the aLSTM converges approximately ~25% faster in wall-clock time. In\nsummary, any form of adaptation is bene\ufb01cial, and a recurrent adaptation model (eq. 7) enjoys both\nfastest convergence rate and best \ufb01nal performance in this experiment.\n\n4Public release of their code at https://github.com/salesforce/awd-lstm-lm\n\n7\n\n255075100125150175epoch4.14.34.5loss\fTable 2: Validation and test set perplexities on Penn Treebank. All results except those from Zaremba\net al. (2015) use tied input and output embeddings (Press & Wolf, 2017).\n\nModel\nLSTM, Zaremba et al. (2015)\nRHN, Zilly et al. (2016)\nNAS, Zoph & Le (2017)\nTG-SC LSTM, Melis et al. (2018)\nTG-SC LSTM, Melis et al. (2018)\nAWD-LSTM, Merity et al. (2018)\nLSTM\naLSTM, static policy (eq. 6)\naLSTM, recurrent policy (eq. 7)\naLSTM, recurrent policy (eq. 7)\naLSTM, recurrent policy (eq. 7)\n\nSize Depth Valid\n82.2\n24M\n24M\n67.9\n54M\n10M\n24M\n24M\n20M\n17M\n14M\n17M\n24M\n\n62.4\n60.9\n60.0\n71.7\n60.2\n59.6\n58.7\n57.6\n\nTest\n78.4\n65.4\n\u2014 62.4\n60.1\n58.3\n57.3\n68.9\n58.0\n57.2\n56.5\n55.3\n\n2\n10\n\u2014\n4\n4\n3\n2\n2\n2\n2\n2\n\n5.4 WikiText-2\n\nWikiText-2 (WT2; Merity et al., 2017) is a corpus curated from Wikipedia articles with lighter\nprocessing than PTB. It is about twice as large with three times as many unique tokens. We evaluate\nthe aLSTM using the same settings as on PTB, and additionally test a version with larger hidden state\nsize to match the parameter count of current state of the art models. Without tuning for WT2, both\noutperform previously published results in 150 epochs (table 3) and converge to new state of the\nart performance in 190 epochs. In contrast, the AWD-LSTM requires 700 epochs to reach optimal\nperformance. As such, the aLSTM trains ~40% faster in wall-clock time. The TG-SC LSTM in Melis\net al. (2018) uses fewer parameters, but its hyper-parameters are tuned for WT2, in contrast to both\nthe AWD-LSTM and aLSTM. We expect that tuning hyper-parameters speci\ufb01cally for WT2 would yield\nfurther gains.\n\nTable 3: Validation and test set perplexities on WikiText-2.\n\nModel\nLSTM, Grave et al. (2017)\nLSTM, Inan et al. (2017)\nAWD-LSTM, Merity et al. (2018)\nTG-SC LSTM, Melis et al. (2018)\naLSTM, recurrent policy (eq. 7)\naLSTM, recurrent policy (eq. 7)\n\n\u2014\n22M\n33M\n24M\n27M\n32M\n\n\u2014\n3\n3\n2\n2\n2\n\nSize Depth Valid\n\nTest\n\u2014 99.3\n87.7\n65.8\n65.9\n65.5\n64.5\n\n91.5\n68.6\n69.1\n68.1\n67.5\n\n5.5 Ablation Study\n\nWe isolate the effect of each component in the aLSTM through an ablation study on PTB. We adjust\nthe hidden state to ensure every model has approximately 17M learnable parameters. We use the\nsame hyper-parameters for all models except for (a) the standard LSTM (see above) and (b) the aLSTM\nunder an output-adaptation policy and a feed-forward adaptation model, as this con\ufb01guration needed\nslightly lower dropout rates to converge to good performance.\nAs table 4 shows, any form of adaptation yields a signi\ufb01cant performance gain. Going from a\nfeed-forward adaptation model (eq. 6) to a recurrent adaptation model (eq. 7) yields a signi\ufb01cant\nimprovement irrespective of policy, and our hybrid RHN-LSTM (eq. 9) provides a further boost.\nSimilarly, moving from a partial adaptation policy to IO-adaptation leads to signi\ufb01cant performance\nimprovement under any adaptation model. These results indicate that the LSTM is constrained by the\nactivation function bottleneck and increasing its adaptive capacity breaks the bottleneck.\n\n8\n\n\fTable 4: Ablation study: perplexities on Penn Treebank.\u2020Equivalent to the HyperNetwork, except the\naLSTM uses one projection from z to \u03c0 instead of nesting two (Ha et al., 2017).\n\nModel\nLSTM\naLSTM\naLSTM\u2020\naLSTM\naLSTM\naLSTM\naLSTM\n\nAdaptation model Adaptation policy Valid\n\u2014 71.7\n66.0\n59.9\n59.7\n61.6\n59.0\n58.5\n\noutput-adaptation\nLSTM output-adaptation\noutput-adaptation\nIO-adaptation\nIO-adaptation\nIO-adaptation\n\n\u2014\nfeed-forward\n\nLSTM-RHN\nfeed-forward\nLSTM\nLSTM-RHN\n\nTest\n68.9\n63.1\n58.2\n57.3\n59.1\n56.9\n56.5\n\n5.6 Robustness\n\nWe further study the robustness of the aLSTM with respect to hyper-parameters. We limit ourselves to\ndropout rates and train for 10 epochs on PTB. All other hyper-parameters are held \ufb01xed. For each\nmodel, we draw 100 random samples uniformly from intervals of the form [r \u2212 0.1, r + 0.1], with r\nbeing the optimal rate found through previous hyper-parameter tuning. The two models exhibit very\ndifferent distributions (\ufb01g. 4). The distribution of the aLSTM is tight, re\ufb02ecting robustness with respect\nto hyper-parameters. In fact, no sampled model fails to converge. In contrast, approximately 25% of\nthe population of LSTM con\ufb01gurations fail to converge. In fact, fully 45% of the LSTM population\nfail to outperform the worst aLSTM con\ufb01guration; the 90th percentile of the aLSTM distribution is on\nthe same level as the 10th percentile of the LSTM distribution. On WT-2 these results are ampli\ufb01ed,\nwith half of the LSTM population failing to converge and 80% of the LSTM population failing to\noutperform the worst-case aLSTM con\ufb01guration.\n\nFigure 4: Distribution of validation scores on WikiText-2 (top) and Penn Treebank (bottom) for\nrandomly sampled hyper-parameters. The aLSTM (blue) is more robust than the LSTM (red).\n\n6 Conclusions\n\nBy viewing deep neural networks as adaptive compositions of linear maps, we have showed that\nstandard activation functions induce an activation function bottleneck because they fail to have signif-\nicant non-linear effect on a non-trivial subset of inputs. We break this bottleneck through adaptive\nparameterization, which allows the model to adapt the af\ufb01ne transformation to the input.\nWe have developed an adaptive feed-forward layer and showed empirically that it can learn patterns\nwhere a deep feed-forward network fails whilst also using fewer parameters. Extending the adaptive\nfeed-forward layer to RNNs, we presented an adaptive LSTM that signi\ufb01cantly increases model\ncapacity and statistical ef\ufb01ciency while being more robust to hyper-parameters. In particular, we\nobtain new state of the art results on the Penn Treebank and the WikiText-2 word-modeling tasks,\nusing ~20\u201330% fewer parameters and converging in less than half as many iterations.\n\n9\n\n091001502504001000perplexity09density\fAcknowledgments\n\nThe authors would like to thank anonymous reviewers for their comments. This work was supported\nby ESRC via the North West Doctoral Training Centre, grant number ES/J500094/1.\n\nReferences\nAl-Shedivat, Maruan, Bansal, Trapit, Burda, Yuri, Sutskever, Ilya, Mordatch, Igor, and Abbeel,\nPieter. Continuous adaptation via meta-learning in nonstationary and competitive environments.\nIn International Conference on Learning Representations, 2018.\n\nAndrychowicz, Marcin, Denil, Misha, G\u00f3mez, Sergio, Hoffman, Matthew W, Pfau, David, Schaul,\nTom, and de Freitas, Nando. Learning to learn by gradient descent by gradient descent. In Advances\nin neural information processing systems, pp. 3981\u20133989, 2016.\n\nBengio, Samy, Bengio, Yoshua, Cloutier, Jocelyn, and Gecsei, Jan. On the optimization of a synaptic\n\nlearning rule. In Optimality in Biological and Arti\ufb01cial Networks, pp. 6\u20138, 1995.\n\nBengio, Yoshua, Bengio, Samy, and Cloutier, Jocelyn. Learning a synaptic learning rule. Universit\u00e9\n\nde Montr\u00e9al, D\u00e9partement d\u2019informatique et de recherche op\u00e9rationnelle, 1991.\n\nBertinetto, Luca, Henriques, Jo\u00e3o F, Valmadre, Jack, Torr, Philip, and Vedaldi, Andrea. Learning feed-\nforward one-shot learners. In Advances in neural information processing systems, pp. 523\u2013531,\n2016.\n\nBrock, Andrew, Lim, Theo, Ritchie, J.M., and Weston, Nick. SMASH: One-shot model architecture\nsearch through hypernetworks. In International Conference on Learning Representations, 2018.\n\nCanziani, Alfredo, Paszke, Adam, and Culurciello, Eugenio. An analysis of deep neural network\n\nmodels for practical applications. arXiv preprint, arXiv:1605.07678, 2016.\n\nCho, Kyunghyun, van Merrienboer, Bart, G\u00fcl\u00e7ehre, \u00c7aglar, Bougares, Fethi, Schwenk, Holger,\nand Bengio, Yoshua. Learning phrase representations using RNN encoder-decoder for statistical\nmachine translation. Proceedings of Emperical Methods in Natural Language Processing, pp.\n1724\u20131734, 2014.\n\nChung, Junyoung, G\u00fcl\u00e7ehre, \u00c7aglar, Cho, Kyunghyun, and Bengio, Yoshua. Empirical Evaluation of\nGated Recurrent Neural Networks on Sequence Modeling. arXiv preprint, arXiv:1412.3555, 2014.\n\nClevert, Djork-Arn\u00e9, Unterthiner, Thomas, and Hochreiter, Sepp. Fast and accurate deep network\nlearning by exponential linear units (elus). In International Conference on Learning Representa-\ntions, 2015.\n\nCooijmans, Tim, Ballas, Nicolas, Laurent, C\u00e9sar, and Courville, Aaron. Recurrent Batch Normaliza-\n\ntion. In International Conference on Learning Representations, 2016.\n\nCybenko, George. Approximation by superpositions of a sigmoidal function. MCSS, 1989.\n\nDauphin, Yann N, Fan, Angela, Auli, Michael, and Grangier, David. Language Modeling with Gated\n\nConvolutional Networks. In International Conference on Machine Learning, 2017.\n\nDenil, Misha, Shakibi, Babak, Dinh, Laurent, De Freitas, Nando, et al. Predicting parameters in deep\n\nlearning. In Advances in neural information processing systems, pp. 2148\u20132156, 2013.\n\nFernando, Chrisantha, Banarse, Dylan, Reynolds, Malcolm, Besse, Frederic, Pfau, David, Jaderberg,\nMax, Lanctot, Marc, and Wierstra, Daan. Convolution by Evolution - Differentiable Pattern\nProducing Networks. GECCO, 2016.\n\nFinn, Chelsea, Abbeel, Pieter, and Levine, Sergey. Model-Agnostic Meta-Learning for Fast Adapta-\n\ntion of Deep Networks. In International Conference on Machine Learning, 2017.\n\nFrankle, Jonathan and Carbin, Michael. The lottery ticket hypothesis: Training pruned neural\n\nnetworks. arXiv preprint, arXiv:1803.03635, 2018.\n\n10\n\n\fGers, Felix A, Schmidhuber, J\u00fcrgen, and Cummins, Fred. Learning to Forget: Continual Prediction\n\nwith LSTM. Neural Computation, 12(10):2451\u20132471, 2000.\n\nGomez, Faustino and Schmidhuber, J\u00fcrgen. Evolving modular fast-weight networks for control. In\n\nInternational Conference on Arti\ufb01cial Neural Networks, pp. 383\u2013389. Springer, 2005.\n\nGrave, Edouard, Joulin, Armand, and Usunier, Nicolas. Improving Neural Language Models with a\n\nContinuous Cache. In International Conference on Learning Representations, 2017.\n\nGraves, Alex. Generating Sequences With Recurrent Neural Networks.\n\narXiv:1308.0850, 2013.\n\narXiv preprint,\n\nHa, David and Eck, Douglas. A neural representation of sketch drawings. In International Conference\n\non Learning Representations, 2018.\n\nHa, David, Dai, Andrew, and Le, Quoc V. HyperNetworks. International Conference on Learning\n\nRepresentations, 2017.\n\nHe, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, and Sun, Jian. Delving Deep into Recti\ufb01ers -\nSurpassing Human-Level Performance on ImageNet Classi\ufb01cation. In International Conference\non Computer Vision, 2015.\n\nHochreiter, Sepp and Schmidhuber, J\u00fcrgen. Long short-term memory. Neural Computation, 9:\n\n1735\u201380, 1997.\n\nHornik, Kurt. Approximation capabilities of multilayer feedforward networks. Neural Networks, 4\n\n(2):251\u2013257, 1991.\n\nInan, Hakan, Khosravi, Khashayar, and Socher, Richard. Tying word vectors and word classi\ufb01ers: A\nloss framework for language modeling. In International Conference on Learning Representations,\n2017.\n\nJaderberg, Max, Czarnecki, Wojciech Marian, Osindero, Simon, Vinyals, Oriol, Graves, Alex, and\nKavukcuoglu, Koray. Decoupled neural interfaces using synthetic gradients. In International\nConference on Machine Learning, 2017.\n\nKingma, Diederik P. and Ba, Jimmy. Adam: A Method for Stochastic Optimization. In International\n\nConference on Learning Representations, 2015.\n\nKlambauer, G\u00fcnter, Unterthiner, Thomas, Mayr, Andreas, and Hochreiter, Sepp. Self-normalizing\n\nneural networks. In Advances in Neural Information Processing Systems, pp. 972\u2013981, 2017.\n\nKrause, Ben, Lu, Liang, Murray, Iain, and Renals, Steve. Multiplicative lstm for sequence modelling.\n\narXiv preprint, arXiv:1609:07959, 2016.\n\nKrause, Ben, Kahembwe, Emmanuel, Murray, Iain, and Renals, Steve. Dynamic Evaluation of Neural\n\nSequence Models. arXiv preprint, arXiv:1709:07432, 2017.\n\nLeCun, Yann, Bottou, L\u00e9on, Orr, Genevieve B, and M\u00fcller, Klaus-Robert. Ef\ufb01cient backprop. In\n\nNeural networks: Tricks of the trade, pp. 9\u201350. Springer, 1998.\n\nLee, Yoonho and Choi, Seungjin. Meta-Learning with Adaptive Layerwise Metric and Subspace. In\n\nInternational Conference on Machine Learning, 2017.\n\nMarcus, Mitchell P, Marcinkiewicz, Mary Ann, and Santorini, Beatrice. Building a large annotated\n\ncorpus of english: the Penn treebank. Computational linguistics, 19(2):313\u2013330, 1993.\n\nMelis, G\u00e1bor, Dyer, Chris, and Blunsom, Phil. On the State of the Art of Evaluation in Neural\n\nLanguage Models. In International Conference on Learning Representations, 2018.\n\nMerity, Stephen, Xiong, Caiming, Bradbury, James, and Socher, Richard. Pointer sentinel mixture\n\nmodels. International Conference on Learning Representations, 2017.\n\nMerity, Stephen, Keskar, Nitish Shirish, and Socher, Richard. Regularizing and optimizing LSTM\n\nlanguage models. In International Conference on Learning Representations, 2018.\n\n11\n\n\fMikolov, Tom\u00e1\u0161. Statistical language models based on neural networks. PhD thesis, Brno University\n\nof Technology, 2012.\n\nMikolov, Tomas, Kara\ufb01at, Martin, Burget, Lukas, Cernocky, Jan, and Khudanpur, Sanjeev. Recurrent\n\nneural network based language model. Interspeech, 2:3, 2010.\n\nMikolov, Tom\u00e1\u0161, Sutskever, Ilya, Deoras, Anoop, Le, Hai-Son, Kombrink, Stefan, and Cernocky, Jan.\n\nSubword language modeling with neural networks. Preprint, 2012.\n\nNovak, Roman, Bahri, Yasaman, Abola\ufb01a, Daniel A., Pennington, Jeffrey, and Sohl-Dickstein, Jascha.\nSensitivity and generalization in neural networks: an empirical study. In International Conference\non Learning Representations, 2018.\n\nPress, O\ufb01r and Wolf, Lior. Using the output embedding to improve language models. In Proceedings\nof the European Chapter of the Association for Computational Linguistics, volume 2, pp. 157\u2013163,\n2017.\n\nRadford, Alec, Jozefowicz, Rafal, and Sutskever, Ilya. Learning to Generate Reviews and Discovering\n\nSentiment. arXiv preprint, arXiv:1704.01444, 2017.\n\nRavi, Sachin and Larochelle, Hugo. Optimization as a model for few-shot learning. In International\n\nConference on Learning Representations, 2017.\n\nSaxe, Andrew M., McClelland, James L., and Ganguli, Surya. Exact solutions to the nonlinear\n\ndynamics of learning in deep linear neural networks. arXiv preprint, arXiv:1312.6120, 2013.\n\nSchmidhuber, J\u00fcrgen. Learning to control fast-weight memories: An alternative to dynamic recurrent\n\nnetworks. Neural Computation, 4(1):131\u2013139, 1992.\n\nStanley, Kenneth O., D\u2019Ambrosio, David B., and Gauci, Jason. A hypercube-based encoding for\n\nevolving large-scale neural networks. Arti\ufb01cial Life, 15(2):185\u2013212, 2009.\n\nSuarez, Joseph. Character-level language modeling with recurrent highway hypernetworks. In\n\nAdvances in neural information processing systems, pp. 3269\u20133278, 2017.\n\nSutskever, Ilya, Martens, James, and Hinton, Geoffrey E. Generating text with recurrent neural\n\nnetworks. In International Conference on Machine Learning, pp. 1017\u20131024, 2011.\n\nWu, Yuhuai, Zhang, Saizheng, Zhang, Ying, Bengio, Yoshua, and Salakhutdinov, Ruslan. On\nMultiplicative Integration with Recurrent Neural Networks. In Advances in neural information\nprocessing systems, pp. 2864\u20132872, 2016.\n\nYang, Zhilin, Dai, Zihang, Salakhutdinov, Ruslan, and Cohen, William W. Breaking the Softmax\nIn International Conference on Learning\n\nBottleneck: A High-Rank RNN Language Model.\nRepresentations, 2018.\n\nZaremba, Wojciech, Sutskever, Ilya, and Vinyals, Oriol. Recurrent Neural Network Regularization.\n\nIn International Conference on Learning Representations, 2015.\n\nZilly, Julian Georg, Srivastava, Rupesh Kumar, Koutnik, Jan, and Schmidhuber, Jurgen. Recurrent\n\nHighway Networks. arXiv preprint, arXiv:1607.03474, 2016.\n\nZoph, Barret and Le, Quoc V. Neural Architecture Search with Reinforcement Learning.\n\nInternational Conference on Learning Representations, 2017.\n\nIn\n\n12\n\n\f", "award": [], "sourceid": 3831, "authors": [{"given_name": "Sebastian", "family_name": "Flennerhag", "institution": "Alan Turing Institute"}, {"given_name": "Hujun", "family_name": "Yin", "institution": "University of Manchester"}, {"given_name": "John", "family_name": "Keane", "institution": "University of Manchester"}, {"given_name": "Mark", "family_name": "Elliot", "institution": "University of Manchester"}]}