{"title": "Language Modeling with Recurrent Highway Hypernetworks", "book": "Advances in Neural Information Processing Systems", "page_first": 3267, "page_last": 3276, "abstract": "We present extensive experimental and theoretical support for the efficacy of recurrent highway networks (RHNs) and recurrent hypernetworks complimentary to the original works. Where the original RHN work primarily provides theoretical treatment of the subject, we demonstrate experimentally that RHNs benefit from far better gradient flow than LSTMs in addition to their improved task accuracy. The original hypernetworks work presents detailed experimental results but leaves several theoretical issues unresolved--we consider these in depth and frame several feasible solutions that we believe will yield further gains in the future. We demonstrate that these approaches are complementary: by combining RHNs and hypernetworks, we make a significant improvement over current state-of-the-art character-level language modeling performance on Penn Treebank while relying on much simpler regularization. Finally, we argue for RHNs as a drop-in replacement for LSTMs (analogous to LSTMs for vanilla RNNs) and for hypernetworks as a de-facto augmentation (analogous to attention) for recurrent architectures.", "full_text": "Character-Level Language Modeling with Recurrent\n\nHighway Hypernetworks\n\nJoseph Suarez\n\nStanford University\n\njoseph15@stanford.edu\n\nAbstract\n\nWe present extensive experimental and theoretical support for the ef\ufb01cacy of re-\ncurrent highway networks (RHNs) and recurrent hypernetworks complimentary to\nthe original works. Where the original RHN work primarily provides theoretical\ntreatment of the subject, we demonstrate experimentally that RHNs bene\ufb01t from\nfar better gradient \ufb02ow than LSTMs in addition to their improved task accuracy.\nThe original hypernetworks work presents detailed experimental results but leaves\nseveral theoretical issues unresolved\u2013we consider these in depth and frame sev-\neral feasible solutions that we believe will yield further gains in the future. We\ndemonstrate that these approaches are complementary: by combining RHNs and\nhypernetworks, we make a signi\ufb01cant improvement over current state-of-the-art\ncharacter-level language modeling performance on Penn Treebank while relying on\nmuch simpler regularization. Finally, we argue for RHNs as a drop-in replacement\nfor LSTMs (analogous to LSTMs for vanilla RNNs) and for hypernetworks as a\nde-facto augmentation (analogous to attention) for recurrent architectures.\n\n1\n\nIntroduction and related works\n\nRecurrent architectures have seen much improvement since their inception in the 1990s, but they\nstill suffer signi\ufb01cantly from the problem of vanishing gradients [1]. Though many consider LSTMs\n[2] the de-facto solution to vanishing gradients, in practice, the problem is far from solved (see\nDiscussion). Several LSTM variants have been developed, most notably GRUs [3], which are simpler\nthan LSTM cells but bene\ufb01t from only marginally better gradient \ufb02ow. Greff et al. and Britz et al.\nconducted exhaustive (for all practical purposes) architecture searches over simple LSTM variants\nand demonstrated that none achieved signi\ufb01cant improvement [4] [5]\u2013in particular, the latter work\ndiscovered that LSTMs consistently outperform comparable GRUs on machine translation, and no\nproposed cell architecture to date has been proven signi\ufb01cantly better than the LSTM. This result\nnecessitated novel approaches to the problem.\nOne approach is to upscale by simply stacking recurrent cells and increasing the number of hidden\nunits. While there is certainly some optimal trade off between depth and cell size, with enough data,\nsimply upscaling both has yielded remarkable results in neural machine translation (NMT) [6].1\nHowever, massive upscaling is impractical in all but the least hardware constrained settings and fails\nto remedy fundamental architecture issues, such as poor gradient \ufb02ow inherent in recurrent cells\n[8]. We later demonstrate that gradient issues persist in LSTMs (see Results) and that the grid-like\narchitecture of stacked LSTMs is suboptimal.\n\n1For fair comparison, Google\u2019s NMT system does far more than upscaling and includes an explicit attentional\nmechanism [7]. We do not experiment with attention and/or residual schemes, but we expect the gains made by\nsuch techniques to stack with our work.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fThe problem of gradient \ufb02ow can be somewhat mitigated by the adaptation of Batch Normalization [9]\nto the recurrent case [10] [11]. While effective, it does not solve the problem entirely and also imposes\nsigni\ufb01cant overhead in memory and thus in performance, given the ef\ufb01ciency of parallelization over\nminibatches. This is often offset by a reduction in total epochs over the data required, but recurrent\narchitectures with better gradient \ufb02ow could ideally provide comparable or better convergence without\nreliance upon explicit normalization.\nZilly et al. recently proposed recurrent highway networks (RHNs) and offered copious theoretical\nsupport for the architecture\u2019s improved gradient \ufb02ow [12]. However, while the authors provided\nmathematical rigor, we believe that experimental con\ufb01rmation of the authors\u2019 claims could further\ndemonstrate the model\u2019s simplicity and widespread applicability. Furthermore, we \ufb01nd that the\ndiscussion of gradient \ufb02ow is more nuanced than presented in the original work (see Discussion).\nHa et al. recently questioned the weight-sharing paradigm common among recurrent architectures,\nproposing hypernetworks as a mechanism for allowing weight drift between timesteps [13]. This\nconsideration is highly desirable, given the successes of recent convolutional architectures on language\nmodeling tasks [14] [15], which were previously dominated by recurrent architectures.\nBoth RHNs and hypernetworks achieved state-of-the-art (SOTA) on multiple natural language\nprocessing (NLP) tasks at the time. As these approaches address unrelated architectural issues, it\nshould not be surprising that combining them yields SOTA on Penn Treebank [16] (PTB), improving\nsigni\ufb01cantly over either model evaluated individually. We consider both RHNs and hypernetworks\nto be largely overlooked in recent literature on account of apparent rather than extant complexity.\nFurthermore, the original RHN work lacks suf\ufb01cient experimental demonstration of improved gradient\n\ufb02ow; the original hypernetworks work lacks theoretical generalization of their weight-drift scheme.\nWe present experimental results for RHNs complementary to the original work\u2019s theoretical results\nand theoretical results for hypernetworks complementary to the original work\u2019s experimental results.\nFounded on these results, our most important contribution is a strong argument for the utility of RHNs\nand hypernetworks, both individually and jointly, in constructing improved recurrent architectures.\n\n2 Model architecture\n\n2.1 Recurrent highway networks\n\nWe make a few notational simpli\ufb01cations to the original RHN equations that will later facilitate\nextensibility. We \ufb01nd it clearest and most succinct to be programmatic in our notation 2. First,\nconsider the GRU:\n\n[h, t] =xiU + si\u22121W r = tanh(xi \u02c6U + (si\u22121 \u25e6 h) \u02c6W )\nh, t =\u03c3(h), \u03c3(t)\n\n(1)\nwhere x \u2208 Rd, h, t, r, st \u2208 Rn, and U, \u02c6U \u2208 Rd\u00d72n, W, \u02c6W \u2208 Rn\u00d72n are weight matrices where\nd, n are the input and hidden dimensions. \u03c3 is the sigmoid nonlinearity, and \u25e6 is the Hadamard\n(elementwise) product. A one layer RHN cell is a simpli\ufb01ed GRU variant:\n[h, t] =xiU + si\u22121W si =(1 \u2212 t) \u25e6 si\u22121 + t \u25e6 h\nh, t = tanh(h), \u03c3(t)\n\nsi =(1 \u2212 t) \u25e6 r + t \u25e6 si\u22121\n\nwhere the de\ufb01nitions from above hold. The RHN is extended to arbitrary depth by simply stacking\nthis cell with new hidden weight matrices, with the caveat that xiU is omitted except at the input\nlayer:\n\n(2)\n\n(3)\n\nRHN Cell(xi, si\u22121, l) :\n\n[h, t] =1 [l = 0] xiU + si\u22121W c, t =1 \u2212 t, dropout(t)\nsi =c \u25e6 si\u22121 + t \u25e6 h\nh, t = tanh(h), \u03c3(t)\n\nwhere l is the layer index, which is used as an indicator. We can introduce recurrent dropout [17] on t\nacross all layers with a single hyperparameter. We later demonstrate strong results without the need\nfor more complex regularization or layer normalization. Finally, unlike stacked LSTMs, RHNs are\nstructurally linear. That is, a depth L RHN applied to a sequence of length M can be unrolled to a\nsimple depth M L network. We restate this fact from the original work only because it is important to\nour analysis, which we defer to Results and Discussion.\n\n2Note that for purpose of clean alignment, equations are presented top to bottom, then left to right.\n\n2\n\n\f2.2 Hypernetworks\n\nWe slightly alter the original notation of recurrent hypernetworks for ease of combination with RHNs.\nWe de\ufb01ne a hypervector z as a linear upscaling projection applied to the outputs of a small recurrent\nnetwork:\n\n(4)\nwhere a \u2208 Rh is the activation vector output by an arbitrary recurrent architecture, Wp \u2208 Rn\u00d7h is an\nupscaling projection from dimension h to n, and h (cid:28) n. The hypervector is then used to scale the\n\nz(a) = Wpa\n\nweights of the main recurrent network by:(cid:102)W (z(a)) = z(a) \u25e6 W\n\n(5)\nwhere we overload \u25e6 as the element-wise product across columns. That is, each element of z scales\none column (or row, depending on notation) of W . As this constitutes a direct modi\ufb01cation of the\nweights, hypernetworks have the interpretation of relaxing the weight sharing constraint implicit in\nRNNs.\n\n2.3 Recurrent highway hypernetworks\n\nWe adapt hypernetworks to RHNs by directly modifying the RHN cell using (5):\n\nRHN CellHyper(xi, si\u22121, l, z) :\n\n[h, t] =1 [l = 0] xi(cid:101)U (z) + si\u22121(cid:102)W (z)\n\nh, t = tanh(h), \u03c3(t)\n\nc, t =1 \u2212 t, dropout(t)\nsi =c \u25e6 si\u22121 + t \u25e6 h\n\n(6)\n\nIf RHN Cell and RHN CellHyper had the same state sizes, we could simply stack them. However,\nas the hypernetwork is much smaller than the main network by design, we instead must upscale\nbetween the networks. Our \ufb01nal architecture at each timestep for layer l can thus be written:\n\nz =[Mplsh, Mplsh]\n\nsh =RHN Cell(sh, l)\n\nsn =RHN CellHyper(sn, l, z)\n\n(7)\nwhere Mpl \u2208 Rh\u00d7n is the upscale projection matrix for layer l and z is the concatenation of Mplsh\nwith itself. Notice the simplicity of this extension\u2013it is at least as straightforward to extend RHNs as\nGRUs and LSTMs. Again, we use only simple recurrent dropout for regularization.\nA few notes, for clarity and ease of reproduction: as the internal weight matrices of the main network\nhave different dimensionality (Ul \u2208 Rd\u00d72n, Wl \u2208 Rn\u00d72n), we require the concatenation operation to\nform z in (7). We \ufb01nd this works much better than simply using larger projection matrices. Also,\nsh, sn in (7) are the hypernetwork and main network states, respectively. This may seem backwards\nfrom the notation above, but note that the hypernetwork is a standard, unmodi\ufb01ed RHN Cell; its\noutputs are then used in the main network, which is the modi\ufb01ed RHN CellHyper.\n\n3 Results (experimental)\n\n3.1 Penn Treebank\n\nPenn Treebank (PTB) contains approximately 5.10M/0.40M/0.45M characters in the train/val/test sets\nrespectively and has a small vocabulary of 50 characters. There has recently been some controversy\nsurrounding results on PTB: Jozefowicz et al. went as far to say that performance on such small\ndatasets is dominated by regularization [18]. Radford et al. chose to evaluate language modeling\nperformance only upon the (38GB) Amazon Product Review dataset for this reason [19].\nPerformance on large, realistic datasets is inarguably a better metric of architecture quality than\nperformance on smaller datasets such as PTB. However, such metrics make comparison among\nmodels nearly impossible: performance on large datasets is non-standard because evaluation at this\nscale is infeasible in many research settings simply because of limited hardware access. While\nmost models can be trained on 1-4 GPUs within a few weeks, this statement is misleading, as\nsigni\ufb01cantly more hardware is required for ef\ufb01cient development and hyperparameter search. We\ntherefore emphasize the importance of small datasets for standardized comparison among models.\nHutter is a medium sized task (approximately 20 times larger than PTB) that should be feasible in\nmost settings (e.g. the original RHN and Hypernetwork works). We are only reasonably able to\n\n3\n\n\fTable 1: Comparison of bits per character (BPC) test errors on PTB. We achieve SOTA without layer\nnormalization, improving over vanilla hypernetworks, which require layer normalization\n\nModel\nLSTM\n2-Layer LSTM\n2-Layer LSTM (1125 hidden, ours)\nHyperLSTM\nLayer Norm LSTM\nLayer Norm HyperLSTM\nLayer Norm HyperLSTM (large embed)\n2-Layer Norm HyperLSTM, 1000 units\nRecurrent Highway Network (ours)\nHyperRHN (ours)\n\nTest Val\n1.35\n1.31\n1.28\n1.31\n1.29\n\u2013\n1.30\n1.26\n1.30\n1.27\n1.25\n1.28\n1.26\n1.23\n1.24\n1.22\n1.24\n-\n1.19\n1.21\n\nParams (M)\n4.3\n12.2\n15.6\n4.9\n4.3\n4.9\n5.1\n14.4\n14.0\n15.5\n\nevaluate on PTB due to a strict hardware limitation of two personally owned GPUs. We therefore\ntake additional precautions to ensure fair comparison:\nFirst, we address the critiques of Jozefoqicz et al. by avoiding complex regularization. We use\nonly simple recurrent dropout with a uniform probability across layers. Second, we minimally\ntune hyperparameters as discussed below. Finally, we are careful with the validation data and\nrun the test set only once on our best model. We believe these precautions prevent over\ufb01tting the\ndomain and corroborate the integrity of our result. Furthermore, SOTA performance with suboptimal\nhyperparameters demonstrates the robustness of our model.\n\n3.2 Architecture and training details\n\nIn addition to our HyperRHN, we consider our implementations of a 2-Layer LSTM and a plain RHN\nbelow. All models, including hypernetworks and their strong baselines, are compared in Table 1.\nOther known published results are included in the original hypernetworks work, but have test bpc \u2265\n1.27. We train all models using Adam [20] with the default learning rate 0.001 and sequence length\n100, batch size 256 (the largest that \ufb01ts in memory for our main model) on a single GTX 1080 Ti\nuntil over\ufb01tting becomes obvious. We evaluate test performance only once and only on our main\nmodel, using the validation set for early stopping.\nOur data batcher loads the dataset into main memory as a single contiguous block and reshapes it to\ncolumn size 100. We do not zero pad for ef\ufb01ciency and no distinction is made between sentences\nfor simplicity. Data is embedded into a 27 dimensional vector. We do not cross validate any\nhyperparameters except for dropout.\nWe \ufb01rst consider our implementation of a 2-Layer LSTM with hidden dimension 1125, which yields\napproximately as many learnable parameters as our main model. We train for 350 epochs with\nrecurrent dropout probability 0.9. As expected, our model performs slightly better than the slightly\nsmaller baseline in the original hypernetworks work. We use this model in gradient \ufb02ow comparisons\n(see Discussion)\nAs the original RHN work presents only word-level results for PTB, we trained a RHN baseline by\nsimply disabling the Hypernetwork augmentation. Convergence was achieved in 618 epochs.\nOur model consists of a recurrent highway hypernetwork with 7 layers per cell. The main network has\n1000 neurons per layer and the hypernetwork has 128 neurons per layer, for a total of approximately\n15.2M parameters. Both subnetworks use a recurrent dropout keep probability of 0.65 and no other\nregularizer/normalizer. We attribute our model\u2019s ability to perform without layer normalization to the\nimproved gradient \ufb02ow of RHNs (see Discussion).\nThe model converges in 660 epochs, obtaining test perplexity 2.29 (where cross entropy corresponds\nto loge of perplexity) and 1.19 bits per character (BPC, log2 of perplexity), 74.6 percent accuracy. By\nepoch count, our model is comparable to a plain RHN but performs better. Training takes 2-3 days\n(fairly long for PTB) compared to 1-2 days for a plain RHN and a few hours for an LSTM. However,\nthis comparison is unfair: all models require a similar number of \ufb02oating point operations and differ\n\n4\n\n\fprimarily in backend implementation optimization. We consider possible modi\ufb01cations to our model\nthat take advantage of existing optimization in Results (theoretical), below.\nFinally, we note that reporting of accuracy is nonstandard. Accuracy is a standard metric in vision;\nwe encourage its adoption in language modeling, as BPC is effectively a change of base applied\nto standard cross entropy and is exponential in scale. This downplays the signi\ufb01cance of gains\nwhere the error ceiling is likely small. Accuracy is more immediately comparable to maximum task\nperformance, which we estimate to be well below 80 percent given the recent trend of diminishing\nreturns coupled with genuine ambiguity in the task. Human performance is roughly 55 percent, as\nmeasured by our own performance on the task.\n\n4 Results (theoretical)\n\nOur \ufb01nal model is a direct adaptation of the original hypervector scaling factor to RHNs. However,\nwe did attempt a generalization of hypernetworks and encountered extreme memory considerations\nthat have important implications for future work. Notice that the original hypernetwork scaling factor\nis equivalent to element-wise multiplication by a rank-1 matrix (e.g. the outer product of z with\na ones vector, which does not include all rank-1 matrices). Ideally, we should be able to scale by\nany matrix at all. As mentioned by the authors, naively generating different scaling vectors for each\ncolumn of the weight matrix is prohibitively expensive in both memory and computation time. We\npropose a low rank-d update inspired by the thin singular value decomposition as follows:\n\n(cid:102)W = W \u25e6 d(cid:88)\n\nuiv(cid:62)\n\ni\n\n(8)\n\ni=1\n\nCompared to the original scaling update, our variation has memory and performance cost linear in\nthe rank of the update. As with the SVD, we would expect most of the information relevant to the\nweight drift scale to be contained in a relatively low-rank update. However, we were unable to verify\nthis hypothesis due to a current framework limitation. All deep learning platforms currently assemble\ncomputation graphs, and this low rank approximation is added as a node in the graph. This requires\nmemory equal to the dimensionality of the scaling matrix per training example!\nThe original hypernetworks update is only feasible because of a small mathematical trick: row-wise\nscaling of the weight matrix is equal to elementwise multiplication after the matrix-vector multiply.\nNote that this is a practical rather than theoretical limitation. As variations in the weights of the\nhypernetwork arise only as a function of variations in ui, vi, W , it is possible to de\ufb01ne a custom\ngradient operation that does not need to store the low rank scaling matrices at each time step for\nbackpropagation.\nLastly, we note that hypernetworks are a new and largely unexplored area of research. Even without\nthe above addition, hypernetworks have yielded large improvements on a diverse array of tasks\nwhile introducing a minimal number of additional parameters. The only reason we cannot currently\nrecommend hypernetworks as a drop-in network augmentation for most tasks (compare to e.g.\nattention) is another framework limitation. Despite requiring far fewer \ufb02oating point operations than\nthe larger main network, adding a hypernetwork still incurs nearly a factor of two in performance.\nThis is due to the extreme ef\ufb01ciency of parallelization over large matrix multiplies; the overhead is\nlargely time spent copying data. We propose rolling the hypernetwork into the main network. This\ncould be accomplished by simply increasing the hidden dimension by the desired hypernetwork\ndimension h. The \ufb01rst h elements of the activation can then be treated as the hypervector. Note that\nthis may require experimentation with matrix blocking and/or weight masking schemes in order to\navoid linear interactions between the hypernetwork and main network during matrix multiplication.\nThe issues and solutions above are left as thought experiments; we prioritize our limited computational\nresources towards experimental efforts on recurrent highway networks. The theoretical results above\nare included to simultaneously raise and assuage concerns surrounding generalization and ef\ufb01ciency\nof hypernetworks. We see additional development of hypernetworks as crucial to the continued\nsuccess of our recurrent model in the same manner that attention is a necessary, de-facto network\naugmentation in machine translation (and we further expect the gains to stack). Our model\u2019s strong\nlanguage modeling result using a single graphics card was facilitated by the small size of PTB, which\nallowed us to afford the 2X computational cost of recurrent hypernetworks. We present methods\n\n5\n\n\fFigure 1: Visualization of hyper recurrent highway network training convergence\n\nfor optimizing the representational power and computational cost of hypernetworks; additional\nengineering will still be required in order to fully enable ef\ufb01cient training on large datasets.\n\n5 Discussion (experimental)\n\n5.1 Training time\n\nWe visualize training progress in Fig. 1. Notice that validation perplexity seems to remain below\ntraining perplexity for nearly 500 epochs. While the validation and test sets in PTB appear slightly\neasier than the training set, the cause of this artifact is that the validation loss is masked by a minimum\n50-character context whereas the training loss is not (we further increase minimum context to 95 after\ntraining and observe a small performance gain), therefore the training loss suffers from the \ufb01rst few\nimpossible predictions at the start of each example. The validation data is properly overlapped such\nthat performance is being evaluated over the entire set.\nIt may also seem surprising that the model takes over 600 epochs to converge, and that training\nprogress appears incredibly slow towards the end. We make three observations: \ufb01rst, we did not\nexperiment with different optimizers, annealing the learning rate, or even the \ufb01xed learning rate\nitself. Second, as the maximum task accuracy is unknown, it is likely that gains small on an absolute\nscale are large on a relative scale. We base this conjecture on the diminishing gains of recent work\non an absolute scale: we \ufb01nd that the difference between 1.31 (1 layer LSTM) and 1.19 bpc (our\nmodel) is approximately 71.1-74.6 percent accuracy. For reference, our improvement over the original\nhypernetworks work is approximately 1.0 percent (this \ufb01gure is obtained from interpolation on the\nBPC scale). Third and \ufb01nally, regardless of whether our second observation is true, our architecture\nexhibits similar convergence to a RHN and begins outperforming the 2-layer LSTM baseline before\nthe latter converges.\n\n5.2 Overview of visualizations\n\nOur motivation in the visualizations that follow is to compare desirable and undesirable properties\nof our RHN-based model and standard recurrent models, namely stacked LSTMs. There are two\nnatural gradient visualizations: within-cell gradients, which are averaged over time but not over all\nof the weight layers within the recurrent cell, and outside-cell gradients, which are averaged over\ninternal weight layers but not over time. Time-averaged gradients are less useful to our discussion\nthan the norms of raw weight layers; we therefore present these along with outside-cell gradient\nvisualizations.\n\n5.3 Cell visualizations\n\nWe visualize the 2-norms of the learned weight layers of our RHN-based model in Fig. 2 and of an\nLSTM baseline (2 layers, 1150 hidden units, recurrent dropout keep p=0.90, 15.6M parameters) in\nFig. 3.\nNotice that in the middle six layers (the \ufb01rst/last layers have different dimensionality and are\nincomparable) of the RHN block (Fig. 2), weight magnitude decreases with increasing layer depth.\nWe view this as evidence for the iterative-re\ufb01nement view of deep learning, as smaller updates are\n\n6\n\n\fFigure 2: L2 norms of learned weights in our recurrent highway hypernetwork model. Increasing\ndepth is shown from left to right in each block of layers. As dimensionality differs between blocks,\nthe middle layers of each block are incomparable to the \ufb01rst/last layers, hence the disparity in norm.\n\nFigure 3: L2 norms of learned weights in our 2-layer LSTM baseline, with layer 1 left of layer 2.\n\napplied in deeper layers. This is \ufb01rst evidence of this paradigm that we are aware of in the recurrent\ncase, as similar statistics in stacked LSTMs are less conclusive because of horizontal grid connections.\nThis also explains why performance gains diminish as RHN depth increases, as was noted in the\noriginal work.\n\n5.4 Gradient visualizations over time\n\nWe consider the mean L2-norms of the gradients of the activations with respect to the loss at the \ufb01nal\ntimestep. But \ufb01rst, an important digression: when should we visualize gradient \ufb02ow: at initialization,\nduring training, or after convergence? To our knowledge, this matter has not yet received direct\ntreatment. Fig. 4 is computed at initialization and seems to suggest that RHNs are far inferior to\nLSTMs in the multilayer case, as the network cannot possibly learn in the presence of extreme\nvanishing gradients. This line of reasoning lacks the required nuance, which we discuss below.\n\n6 Discussion (theoretical)\n\nWe address the seemingly inconsistent experimental results surrounding gradient \ufb02ow in RHN.\nFirst, we note that the LSTM/RHN comparison is unfair: multilayer LSTM/GRU cells are laid out in\na grid. The length of the gradient path is equal to the sum of the sequence length and the number of\nlayers (minus one); in an RHN, it is equal to the product. In the fair one layer case, we found that the\nRHN actually possesses far greater initial gradient \ufb02ow. Second, these intuitions regarding vanishing\ngradients at initialization are incorrect. As shown in Fig. 5, gradient \ufb02ow improves dramatically after\ntraining for just one epoch. By convergence, as shown in Fig. 6, results shift in the favor of RHNs,\ncon\ufb01rming experimentally the theoretical gradient \ufb02ow bene\ufb01ts of RHNs over LSTMs.\nThird, we address a potential objection. One might argue that while the gradient curves of our RHN\nbased model and the LSTM baseline are similar in shape, the magnitude difference is misleading.\nFor example, if LSTMs naturally have a million times smaller weights, then the factor of a hundred\nmagnitude difference in Fig. 6 would actually demonstrate superiority of the LSTM. This is the\nreason for our consideration of weight norms in Fig. 2-3, which show that LSTMs have only 100\ntimes smaller weights. Thus the gradient curves in Fig. 6 are effectively comparable in magnitude.\nHowever, RHNs maintain gradient \ufb02ow equal to that of stacked LSTMs while having far greater\n\n7\n\n\fFigure 4: Layer-averaged gradient comparison between our model and an LSTM baseline. Gradients\nare computed at initialization at the input layer of each timestep with respect to the \ufb01nal timestep\u2019s\nloss. Weights are initialized orthogonally.\n\nFigure 5: Identical to Fig. 4, but gradients are computed from models trained for one epoch.\n\nFigure 6: Identical to Fig. 4, but gradients are computed after convergence.\n\ngradient path length, thus the initial comparison is unfair. We believe that this is the basis for the\nRHN\u2019s performance increase over the LSTM: RHNs allow much greater effective network depth\nwithout incurring additional gradient vanishing.\nFourth, we experimented with adding the corresponding horizontal grid connections to our RHN,\nobtaining signi\ufb01cantly better gradient \ufb02ow. With the same parameter budget as our HyperRHN model,\nthis variant obtains 1.40 bpc\u2013far inferior to our HyperRHN, though it could likely be optimized\nsomewhat. It appears that long gradient paths are precisely the advantage in RHNs. We therefore\nsuggest that gradient \ufb02ow speci\ufb01cally along the deepest gradient path is an important consideration\nin architecture design: it provides an upper limit on effective network depth. It appears that greater\neffective depth is precisely the advantage in modeling potential of the RHN.\n\n7 Conclusion\n\nWe present a cohesive set of contributions to recurrent architectures. First, we provide strong\nexperimental evidence for RHNs as a simple drop-in replacement for stacked LSTMs and a detailed\ndiscussion of several engineering optimizations that could further performance. Second, we visualize\n\n8\n\n\fand discuss the problem of vanishing gradients in recurrent architectures, revealing that gradient \ufb02ow\nsigni\ufb01cantly shifts during training, which can lead to misleading comparisons among models. This\ndemonstrates that gradient \ufb02ow should be evaluated at or near convergence; using this metric, we\ncon\ufb01rm that RHNs bene\ufb01t from far greater effective depth than stacked LSTMs while maintaining\nequal gradient \ufb02ow. Third, we suggest multiple expansions upon hypernetworks for future work that\nhave the potential to signi\ufb01cantly improve ef\ufb01ciency and generalize the weight-drift paradigm. This\ncould lead to further improvement upon our architecture and, we hope, facilitate general adoption\nof hypernetworks as a network augmentation. Finally, we demonstrate effectiveness by presenting\nand open sourcing (code 3) a combined architecture that obtains SOTA on PTB with minimal\nregularization and tuning which normally compromise results on small datasets.\n\nAcknowledgments\n\nSpecial thanks to Ziang Xie, Jeremy Irvin, Dillon Laird, and Hao Sheng for helpful commentary and\nsuggestion during the revision process.\n\nReferences\n[1] Sepp Hochreiter, Yoshua Bengio, Paolo Frasconi, and J\u00fcrgen Schmidhuber. Gradient \ufb02ow in\n\nrecurrent nets: the dif\ufb01culty of learning long-term dependencies, 2001.\n\n[2] Sepp Hochreiter and J\u00fcrgen Schmidhuber. Long short-term memory. Neural computation, 9(8):\n\n1735\u20131780, 1997.\n\n[3] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. Empirical evaluation\nof gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555,\n2014.\n\n[4] Klaus Greff, Rupesh K Srivastava, Jan Koutn\u00edk, Bas R Steunebrink, and J\u00fcrgen Schmidhuber.\nLSTM: A search space odyssey. IEEE transactions on neural networks and learning systems,\n2016.\n\n[5] Denny Britz, Anna Goldie, Thang Luong, and Quoc Le. Massive exploration of neural machine\n\ntranslation architectures. arXiv preprint arXiv:1703.03906, 2017.\n\n[6] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang\nMacherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google\u2019s neural machine\ntranslation system: Bridging the gap between human and machine translation. arXiv preprint\narXiv:1609.08144, 2016.\n\n[7] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly\n\nlearning to align and translate. arXiv preprint arXiv:1409.0473, 2014.\n\n[8] Sepp Hochreiter, Yoshua Bengio, Paolo Frasconi, and J\u00fcrgen Schmidhuber. Gradient \ufb02ow in\n\nrecurrent nets: the dif\ufb01culty of learning long-term dependencies, 2001.\n\n[9] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training\n\nby reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.\n\n[10] Tim Cooijmans, Nicolas Ballas, C\u00e9sar Laurent, \u00c7a\u02d8glar G\u00fcl\u00e7ehre, and Aaron Courville. Recur-\n\nrent batch normalization. arXiv preprint arXiv:1603.09025, 2016.\n\n[11] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint\n\narXiv:1607.06450, 2016.\n\n[12] Julian Georg Zilly, Rupesh Kumar Srivastava, Jan Koutn\u00edk, and J\u00fcrgen Schmidhuber. Recurrent\n\nhighway networks. arXiv preprint arXiv:1607.03474, 2016.\n\n[13] David Ha, Andrew Dai, and Quoc V Le. Hypernetworks. arXiv preprint arXiv:1609.09106,\n\n2016.\n\n3github.com/jsuarez5341/Recurrent-Highway-Hypernetworks-NIPS\n\n9\n\n\f[14] Yann N Dauphin, Angela Fan, Michael Auli, and David Grangier. Language modeling with\n\ngated convolutional networks. arXiv preprint arXiv:1612.08083, 2016.\n\n[15] Nal Kalchbrenner, Lasse Espeholt, Karen Simonyan, Aaron van den Oord, Alex Graves, and Ko-\nray Kavukcuoglu. Neural machine translation in linear time. arXiv preprint arXiv:1610.10099,\n2016.\n\n[16] Mitchell P Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. Building a large annotated\n\ncorpus of english: The penn treebank. Computational linguistics, 19(2):313\u2013330, 1993.\n\n[17] Stanislau Semeniuta, Aliaksei Severyn, and Erhardt Barth. Recurrent dropout without memory\n\nloss. arXiv preprint arXiv:1603.05118, 2016.\n\n[18] Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. Exploring\n\nthe limits of language modeling. arXiv preprint arXiv:1602.02410, 2016.\n\n[19] Alec Radford, Rafal Jozefowicz, and Ilya Sutskever. Learning to generate reviews and discover-\n\ning sentiment. arXiv preprint arXiv:1704.01444, 2017.\n\n[20] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n10\n\n\f", "award": [], "sourceid": 1861, "authors": [{"given_name": "Joseph", "family_name": "Suarez", "institution": "Stanford University"}]}