{"title": "Wider and Deeper, Cheaper and Faster: Tensorized LSTMs for Sequence Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 1, "page_last": 11, "abstract": "Long Short-Term Memory (LSTM) is a popular approach to boosting the ability of Recurrent Neural Networks to store longer term temporal information. The capacity of an LSTM network can be increased by widening and adding layers. However, usually the former introduces additional parameters, while the latter increases the runtime. As an alternative we propose the Tensorized LSTM in which the hidden states are represented by tensors and updated via a cross-layer convolution. By increasing the tensor size, the network can be widened efficiently without additional parameters since the parameters are shared across different locations in the tensor; by delaying the output, the network can be deepened implicitly with little additional runtime since deep computations for each timestep are merged into temporal computations of the sequence. Experiments conducted on five challenging sequence learning tasks show the potential of the proposed model.", "full_text": "Wider and Deeper, Cheaper and Faster:\nTensorized LSTMs for Sequence Learning\n\nZhen He1,2, Shaobing Gao3, Liang Xiao2, Daxue Liu2, Hangen He2, and David Barber\n\n\u22171,4\n\n1University College London, 2National University of Defense Technology, 3Sichuan University,\n\n4Alan Turing Institute\n\nAbstract\n\nLong Short-Term Memory (LSTM) is a popular approach to boosting the ability\nof Recurrent Neural Networks to store longer term temporal information. The\ncapacity of an LSTM network can be increased by widening and adding layers.\nHowever, usually the former introduces additional parameters, while the latter\nincreases the runtime. As an alternative we propose the Tensorized LSTM in\nwhich the hidden states are represented by tensors and updated via a cross-layer\nconvolution. By increasing the tensor size, the network can be widened ef\ufb01ciently\nwithout additional parameters since the parameters are shared across different\nlocations in the tensor; by delaying the output, the network can be deepened\nimplicitly with little additional runtime since deep computations for each timestep\nare merged into temporal computations of the sequence. Experiments conducted on\n\ufb01ve challenging sequence learning tasks show the potential of the proposed model.\n\n1\n\nIntroduction\n\nWe consider the time-series prediction task of producing a desired output yt at each timestep\nt\u2208{1, . . . , T} given an observed input sequence x1:t = {x1, x2,\u00b7\u00b7\u00b7 , xt}, where xt \u2208 RR and\nyt\u2208RS are vectors1. The Recurrent Neural Network (RNN) [17, 43] is a powerful model that learns\nhow to use a hidden state vector ht \u2208 RM to encapsulate the relevant features of the entire input\nt\u22121\u2208RR+M be the concatenation of the current input xt and the\nhistory x1:t up to timestep t. Let hcat\nprevious hidden state ht\u22121:\n\nhcat\n\nt\u22121 = [xt, ht\u22121]\n\n(1)\n\nThe update of the hidden state ht is de\ufb01ned as:\n\n(2)\n(3)\nwhere W h \u2208 R(R+M )\u00d7M is the weight, bh \u2208 RM the bias, at \u2208 RM the hidden activation, and \u03c6(\u00b7)\nthe element-wise tanh function. Finally, the output yt at timestep t is generated by:\n\nat = hcat\nht = \u03c6(at)\n\nt\u22121W h + bh\n\nyt = \u03d5(htW y + by)\n\n(4)\nwhere W y \u2208RM\u00d7S and by \u2208 RS, and \u03d5(\u00b7) can be any differentiable function, depending on the task.\nHowever, this vanilla RNN has dif\ufb01culties in modeling long-range dependencies due to the van-\nishing/exploding gradient problem [4]. Long Short-Term Memories (LSTMs) [19, 24] alleviate\n\u2217Corresponding authors: Shaobing Gao <gaoshaobing@scu.edu.cn> and Zhen He <hezhen.cs@gmail.com>.\n1Vectors are assumed to be in row form throughout this paper.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fthese problems by employing memory cells to preserve information for longer, and adopting gating\nmechanisms to modulate the information \ufb02ow. Given the success of the LSTM in sequence modeling,\nit is natural to consider how to increase the complexity of the model and thereby increase the set of\ntasks for which the LSTM can be pro\ufb01tably applied.\nWe consider the capacity of a network to consist of two components: the width (the amount of\ninformation handled in parallel) and the depth (the number of computation steps) [5]. A naive way\nto widen the LSTM is to increase the number of units in a hidden layer; however, the parameter\nnumber scales quadratically with the number of units. To deepen the LSTM, the popular Stacked\nLSTM (sLSTM) stacks multiple LSTM layers [20]; however, runtime is proportional to the number\nof layers and information from the input is potentially lost (due to gradient vanishing/explosion) as it\npropagates vertically through the layers.\nIn this paper, we introduce a way to both widen and deepen the LSTM whilst keeping the parameter\nnumber and runtime largely unchanged. In summary, we make the following contributions:\n(a) We tensorize RNN hidden state vectors into higher-dimensional tensors which allow more \ufb02exible\n\nparameter sharing and can be widened more ef\ufb01ciently without additional parameters.\n\n(b) Based on (a), we merge RNN deep computations into its temporal computations so that the\nnetwork can be deepened with little additional runtime, resulting in a Tensorized RNN (tRNN).\n(c) We extend the tRNN to an LSTM, namely the Tensorized LSTM (tLSTM), which integrates a\n\nnovel memory cell convolution to help to prevent the vanishing/exploding gradients.\n\n2 Method\n\n2.1 Tensorizing Hidden States\n\nIt can be seen from (2) that in an RNN, the parameter number scales quadratically with the size of the\nhidden state. A popular way to limit the parameter number when widening the network is to organize\nparameters as higher-dimensional tensors which can be factorized into lower-rank sub-tensors that\ncontain signi\ufb01cantly fewer elements [6, 15, 18, 26, 32, 39, 46, 47, 51], which is is known as tensor\nfactorization. This implicitly widens the network since the hidden state vectors are in fact broadcast to\ninteract with the tensorized parameters. Another common way to reduce the parameter number is to\nshare a small set of parameters across different locations in the hidden state, similar to Convolutional\nNeural Networks (CNNs) [34, 35].\nWe adopt parameter sharing to cutdown the parameter number for RNNs, since compared with\nfactorization, it has the following advantages: (i) scalability, i.e., the number of shared parameters\ncan be set independent of the hidden state size, and (ii) separability, i.e., the information \ufb02ow can be\ncarefully managed by controlling the receptive \ufb01eld, allowing one to shift RNN deep computations to\nthe temporal domain (see Sec. 2.2). We also explicitly tensorize the RNN hidden state vectors, since\ncompared with vectors, tensors have a better: (i) \ufb02exibility, i.e., one can specify which dimensions\nto share parameters and then can just increase the size of those dimensions without introducing\nadditional parameters, and (ii) ef\ufb01ciency, i.e., with higher-dimensional tensors, the network can be\nwidened faster w.r.t. its depth when \ufb01xing the parameter number (see Sec. 2.3).\nFor ease of exposition, we \ufb01rst consider 2D tensors (matrices): we tensorize the hidden state ht\u2208RM\nto become Ht\u2208RP\u00d7M , where P is the tensor size, and M the channel size. We locally-connect the\n\ufb01rst dimension of Ht in order to share parameters, and fully-connect the second dimension of Ht to\nallow global interactions. This is analogous to the CNN which fully-connects one dimension (e.g.,\nthe RGB channel for input images) to globally fuse different feature planes. Also, if one compares\nHt to the hidden state of a Stacked RNN (sRNN) (see Fig. 1(a)), then P is akin to the number of\nstacked hidden layers, and M the size of each hidden layer. We start to describe our model based on\n2D tensors, and \ufb01nally show how to strengthen the model with higher-dimensional tensors.\n\n2.2 Merging Deep Computations\n\nSince an RNN is already deep in its temporal direction, we can deepen an input-to-output computation\nby associating the input xt with a (delayed) future output. In doing this, we need to ensure that the\noutput yt is separable, i.e., not in\ufb02uenced by any future input xt(cid:48) (t(cid:48) > t). Thus, we concatenate\nthe projection of xt to the top of the previous hidden state Ht\u22121, then gradually shift the input\n\n2\n\n\fFigure 1: Examples of sRNN, tRNNs and tLSTMs. (a) A 3-layer sRNN. (b) A 2D tRNN without (\u2013)\nfeedback (F) connections, which can be thought as a skewed version of (a). (c) A 2D tRNN. (d) A 2D\ntLSTM without (\u2013) memory (M) cell convolutions. (e) A 2D tLSTM. In each model, the blank circles\nin column 1 to 4 denote the hidden state at timestep t\u22121 to t+2, respectively, and the blue region\ndenotes the receptive \ufb01eld of the current output yt. In (b)-(e), the outputs are delayed by L\u22121 = 2\ntimesteps, where L = 3 is the depth.\n\ninformation down when the temporal computation proceeds, and \ufb01nally generate yt from the bottom\nof Ht+L\u22121, where L\u22121 is the number of delayed timesteps for computations of depth L. An example\nwith L = 3 is shown in Fig. 1(b). This is in fact a skewed sRNN as used in [1] (also similar to [48]).\nHowever, our method does not need to change the network structure and also allows different kinds\nof interactions as long as the output is separable, e.g, one can increase the local connections and use\nfeedback (see Fig. 1(c)), which can be bene\ufb01cial for sRNNs [10]. In order to share parameters, we\nupdate Ht using a convolution with a learnable kernel. In this manner we increase the complexity of\nthe input-to-output mapping (by delaying outputs) and limit parameter growth (by sharing transition\nparameters using convolutions).\nTo describe the resulting tRNN model, let H cat\np\u2208Z+ the location at a tensor. The channel vector hcat\n\nt\u22121\u2208R(P +1)\u00d7M be the concatenated hidden state, and\nt\u22121 is de\ufb01ned as:\n\nt\u22121,p\u2208RM at location p of H cat\n\n(cid:26)xtW x + bx\n\nht\u22121,p\u22121\n\nhcat\n\nt\u22121,p =\n\nif p = 1\nif p > 1\n\n(5)\n\n(cid:126) {W h, bh}\n\nAt = H cat\nt\u22121\nHt = \u03c6(At)\n\nwhere W x \u2208 RR\u00d7M and bx \u2208 RM . Then, the update of tensor Ht is implemented via a convolution:\n(6)\n(7)\nwhere W h\u2208RK\u00d7M i\u00d7M o is the kernel weight of size K, with M i = M input channels and M o = M\noutput channels, bh \u2208 RM o is the kernel bias, At \u2208 RP\u00d7M o is the hidden activation, and (cid:126) is the\nconvolution operator (see Appendix A.1 for a more detailed de\ufb01nition). Since the kernel convolves\nacross different hidden layers, we call it the cross-layer convolution. The kernel enables interaction,\nboth bottom-up and top-down across layers. Finally, we generate yt from the channel vector\nht+L\u22121,P \u2208RM which is located at the bottom of Ht+L\u22121:\n\n(8)\nwhere W y \u2208RM\u00d7S and by \u2208RS. To guarantee that the receptive \ufb01eld of yt only covers the current\nand previous inputs x1:t (see Fig. 1(c)), L, P , and K should satisfy the constraint:\n\nyt = \u03d5(ht+L\u22121,P W y + by)\n\n(cid:108)\n\nL =\n\n2P\n\nK \u2212 K mod 2\n\n(cid:109)\n\nwhere (cid:100)\u00b7(cid:101) is the ceil operation. For the derivation of (9), please see Appendix B.\nWe call the model de\ufb01ned in (5)-(8) the Tensorized RNN (tRNN). The model can be widened by\nincreasing the tensor size P , whilst the parameter number remains \ufb01xed (thanks to the convolution).\nAlso, unlike the sRNN of runtime complexity O(T L), tRNN breaks down the runtime complexity to\nO(T +L), which means either increasing the sequence length T or the network depth L would not\nsigni\ufb01cantly increase the runtime.\n\n(9)\n\n3\n\n\f2.3 Extending to LSTMs\n\nTo allow the tRNN to capture long-range temporal dependencies, one can straightforwardly extend it\nto an LSTM by replacing the tRNN tensor update equations of (6)-(7) as follows:\n\n(cid:126) {W h, bh}\nt), \u03c3(Af\n\nt )]\n\nt ,Ao\n\nt,Af\n\nt , Ai\n\nt, Af\n\nt , Ao\n\nt ), \u03c3(Ao\n\n[Ag\nt ] = H cat\nt\u22121\n[Gt, It, Ft, Ot] = [\u03c6(Ag\nt ), \u03c3(Ai\nCt = Gt (cid:12) It + Ct\u22121 (cid:12) Ft\nHt = \u03c6(Ct) (cid:12) Ot\n\n(10)\n(11)\n(12)\n(13)\nwhere the kernel {W h, bh} is of size K, with M i=M input channels and M o= 4M output channels,\nt \u2208 RP\u00d7M are activations for the new content Gt, input gate It, forget gate Ft, and\nAg\nt ,Ai\noutput gate Ot, respectively, \u03c3(\u00b7) is the element-wise sigmoid function, (cid:12) is the element-wise\nmultiplication, and Ct\u2208RP\u00d7M is the memory cell. However, since in (12) the previous memory cell\nCt\u22121 is only gated along the temporal direction (see Fig. 1(d)), long-range dependencies from the\ninput to output might be lost when the tensor size P becomes large.\nMemory Cell Convolution.\nTo capture long-range dependencies from multiple directions, we\nadditionally introduce a novel memory cell convolution, by which the memory cells can have a larger\nreceptive \ufb01eld (see Fig. 1(e)). We also dynamically generate this convolution kernel so that it is\nboth time- and location-dependent, allowing for \ufb02exible control over long-range dependencies from\ndifferent directions. This results in our tLSTM tensor update equations:\n\n(cid:126) {W h, bh}\nt), \u03c3(Af\n\nt ), \u03c3(Ao\n\nt ), \u03c2(Aq\n\nt )]\n\nt , Ai\n\nt, Af\n\nt , Ao\n\nt , Aq\n\n[Ag\nt ] = H cat\nt\u22121\n[Gt, It, Ft, Ot, Qt] = [\u03c6(Ag\nt ), \u03c3(Ai\nW c\nt (p) = reshape (qt,p, [K, 1, 1])\nt\u22121 = Ct\u22121 (cid:126) W c\nCconv\nt (p)\nCt = Gt (cid:12) It + Cconv\nt\u22121 (cid:12) Ft\nHt = \u03c6(Ct) (cid:12) Ot\n\n(14)\n(15)\n(16)\n(17)\n(18)\n(19)\n\nwhere, in contrast to (10)-(13), the kernel {W h, bh} has additional\n(cid:104)K(cid:105) output channels2 to generate the activation Aq\nt \u2208 RP\u00d7(cid:104)K(cid:105) for\nthe dynamic kernel bank Qt\u2208RP\u00d7(cid:104)K(cid:105), qt,p\u2208R(cid:104)K(cid:105) is the vectorized\nt (p) \u2208 RK\u00d71\u00d71 is\nadaptive kernel at the location p of Qt, and W c\nthe dynamic kernel of size K with a single input/output channel,\nwhich is reshaped from qt,p (see Fig. 2(a) for an illustration). In\n(17), each channel of the previous memory cell Ct\u22121 is convolved\nt (p) whose values vary with p, forming a memory cell\nwith W c\nconvolution (see Appendix A.2 for a more detailed de\ufb01nition),\nt\u22121 \u2208 RP\u00d7M . Note\nwhich produces a convolved memory cell Cconv\nthat in (15) we employ a softmax function \u03c2(\u00b7) to normalize the\nchannel dimension of Qt, which, similar to [37], can stabilize the\nvalue of memory cells and help to prevent the vanishing/exploding\ngradients (see Appendix C for details).\nThe idea of dynamically generating network weights has been used\nin many works [6, 14, 15, 23, 44, 46], where in [14] location-\ndependent convolutional kernels are also dynamically generated to improve CNNs. In contrast to\nthese works, we focus on broadening the receptive \ufb01eld of tLSTM memory cells. Whilst the \ufb02exibility\nis retained, fewer parameters are required to generate the kernel since the kernel is shared by different\nmemory cell channels.\nChannel Normalization.\nTo improve training, we adapt Layer Normalization (LN) [3] to our\ntLSTM. Similar to the observation in [3] that LN does not work well in CNNs where channel vectors\nat different locations have very different statistics, we \ufb01nd that LN is also unsuitable for tLSTM\nwhere lower level information is near the input while higher level information is near the output. We\n\nFigure 2: Illustration of gener-\nating the memory cell convolu-\ntion kernel, where (a) is for 2D\ntensors and (b) for 3D tensors.\n\n2The operator (cid:104)\u00b7(cid:105) returns the cumulative product of all elements in the input variable.\n\n4\n\n\ftherefore normalize the channel vectors at different locations with their own statistics, forming a\nChannel Normalization (CN), with its operator CN (\u00b7):\n\nCN (Z; \u0393, B) = (cid:98)Z (cid:12) \u0393 + B\n(cid:98)zmz = (zmz \u2212 z\u00b5)/z\u03c3\n\nwhere Z, (cid:98)Z, \u0393, B \u2208 RP\u00d7M z are the original tensor, normalized tensor, gain parameter, and bias\n\nparameter, respectively. The mz-th channel of Z, i.e. zmz \u2208RP , is normalized element-wisely:\n\n(20)\n\n(21)\nwhere z\u00b5, z\u03c3 \u2208RP are the mean and standard deviation along the channel dimension of Z, respec-\n\ntively, and(cid:98)zmz \u2208RP is the mz-th channel of (cid:98)Z. Note that the number of parameters introduced by\n\nCN/LN can be neglected as it is very small compared to the number of other parameters in the model.\nUsing Higher-Dimensional Tensors. One can observe from (9) that when \ufb01xing the kernel size\nK, the tensor size P of a 2D tLSTM grows linearly w.r.t. its depth L. How can we expand the tensor\nvolume more rapidly so that the network can be widened more ef\ufb01ciently? We can achieve this goal\nby leveraging higher-dimensional tensors. Based on previous de\ufb01nitions for 2D tLSTMs, we replace\nthe 2D tensors with D-dimensional (D > 2) tensors, obtaining Ht, Ct\u2208RP1\u00d7P2\u00d7...\u00d7PD\u22121\u00d7M with the\ntensor size P = [P1, P2, . . . , PD\u22121]. Since the hidden states are no longer matrices, we concatenate\nthe projection of xt to one corner of Ht\u22121, and thus (5) is extended as:\n\n\uf8f1\uf8f2\uf8f3xtW x + bx\n\nht\u22121,p\u22121\n0\n\nhcat\n\nt\u22121,p =\n\nif pd = 1 for d = 1, 2, . . . , D \u2212 1\nif pd > 1 for d = 1, 2, . . . , D \u2212 1\notherwise\n\n(22)\n\n+\n\nt\u22121,p \u2208 RM is the channel vector at location p \u2208 ZD\u22121\n\nt\u22121\u2208R(P1+1)\u00d7(P2+1)\u00d7...\u00d7(PD\u22121+1)\u00d7M . For the tensor update, the convolution kernel W h and W c\n\nwhere hcat\nof the concatenated hidden state\nt (\u00b7)\nH cat\nt (\u00b7) is\nalso increase their dimensionality with kernel size K = [K1, K2, . . . , KD\u22121]. Note that W c\nreshaped from the vector, as illustrated in Fig. 2(b). Correspondingly, we generate the output yt from\nthe opposite corner of Ht+L\u22121, and therefore (8) is modi\ufb01ed as:\nyt = \u03d5(ht+L\u22121,PW y + by)\n\n(23)\nFor convenience, we set Pd = P and Kd = K for d = 1, 2, . . . , D \u2212 1 so that all dimensions of P\nand K can satisfy (9) with the same depth L. In addition, CN still normalizes the channel dimension\nof tensors.\n\n3 Experiments\n\nWe evaluate tLSTM on \ufb01ve challenging sequence learning tasks under different con\ufb01gurations:\n(a) sLSTM (baseline): our implementation of sLSTM [21] with parameters shared across all layers.\n(b) 2D tLSTM: the standard 2D tLSTM, as de\ufb01ned in (14)-(19).\n(c) 2D tLSTM\u2013M: removing (\u2013) memory (M) cell convolutions from (b), as de\ufb01ned in (10)-(13).\n(d) 2D tLSTM\u2013F: removing (\u2013) feedback (F) connections from (b).\n(e) 3D tLSTM: tensorizing (b) into 3D tLSTM.\n(f) 3D tLSTM+LN: applying (+) LN [3] to (e).\n(g) 3D tLSTM+CN: applying (+) CN to (e), as de\ufb01ned in (20).\nTo compare different con\ufb01gurations, we also use L to denote the number of layers of a sLSTM, and\nM to denote the hidden size of each sLSTM layer. We set the kernel size K to 2 for 2D tLSTM\u2013F\nand 3 for other tLSTMs, in which case we have L = P according to (9).\nFor each con\ufb01guration, we \ufb01x the parameter number and increase the tensor size to see if the\nperformance of tLSTM can be boosted without increasing the parameter number. We also investigate\nhow the runtime is affected by the depth, where the runtime is measured by the average GPU\nmilliseconds spent by a forward and backward pass over one timestep of a single example. Next, we\ncompare tLSTM against the state-of-the-art methods to evaluate its ability. Finally, we visualize the\ninternal working mechanism of tLSTM. Please see Appendix D for training details.\n\n5\n\n\f3.1 Wikipedia Language Modeling\n\nThe Hutter Prize Wikipedia dataset [25] consists of 100 million\ncharacters taken from 205 different characters including alpha-\nbets, XML markups and special symbols. We model the dataset\nat the character-level, and try to predict the next character of the\ninput sequence.\nWe \ufb01x the parameter number to 10M, corresponding to channel\nsizes M of 1120 for sLSTM and 2D tLSTM\u2013F, 901 for other\n2D tLSTMs, and 522 for 3D tLSTMs. All con\ufb01gurations are\nevaluated with depths L = 1, 2, 3, 4. We use Bits-per-character\n(BPC) to measure the model performance.\nResults are shown in Fig. 3. When L \u2264 2, sLSTM and 2D\ntLSTM\u2013F outperform other models because of a larger M. With\nL increasing, the performances of sLSTM and 2D tLSTM\u2013M\nimprove but become saturated when L\u2265 3, while tLSTMs with\nmemory cell convolutions improve with increasing L and \ufb01nally\noutperform both sLSTM and 2D tLSTM\u2013M. When L = 4, 2D\ntLSTM\u2013F is surpassed by 2D tLSTM, which is in turn surpassed\nby 3D tLSTM. The performance of 3D tLSTM+LN bene\ufb01ts from\nLN only when L \u2264 2. However, 3D tLSTM+CN consistently\nimproves 3D tLSTM with different L.\nWhilst the runtime of sLSTM is al-\nmost proportional to L, it is nearly\nconstant in each tLSTM con\ufb01guration\nand largely independent of L.\nWe compare a larger model, i.e. a\n3D tLSTM+CN with L = 6 and M =\n1200, to the state-of-the-art methods\non the test set, as reported in Table 1.\nOur model achieves 1.264 BPC with\n50.1M parameters, and is competitive\nto the best performing methods [38,\n54] with similar parameter numbers.\n\nFigure 3: Performance and run-\ntime of different con\ufb01gurations\non Wikipedia.\n\nTable 1: Test BPC on Wikipedia.\n\nMI-LSTM [51]\nmLSTM [32]\nHyperLSTM+LN [23]\nHM-LSTM+LN [11]\nLarge RHN [54]\nLarge FS-LSTM-4 [38]\n2 \u00d7 Large FS-LSTM-4 [38]\n3D tLSTM+CN (L = 6, M = 1200)\n\nBPC\n1.44\n1.42\n1.34\n1.32\n1.27\n1.245\n1.198\n1.264\n\n# Param.\n\u224817M\n\u224820M\n26.5M\n\u224835M\n\u224846M\n\u224847M\n\u224894M\n50.1M\n\n3.2 Algorithmic Tasks\n\n(a) Addition: The task is to sum\ntwo 15-digit integers. The network\n\ufb01rst reads two integers with one\ndigit per timestep, and then predicts\nthe summation. We follow the pro-\ncessing of [30], where a symbol\n\u2018-\u2019 is used to delimit the integers\nas well as pad the input/target se-\nquence. A 3-digit integer addition\ntask is of the form:\n\nInput: - 1 2 3 - 9 0 0 - - - - -\nTarget: - - - - - - - - 1 0 2 3 -\n\n(b) Memorization: The goal of this\ntask is to memorize a sequence of\n20 random symbols. Similar to the\naddition task, we use 65 different\n\nFigure 4: Performance and runtime of different con\ufb01gurations\non the addition (left) and memorization (right) tasks.\n\n6\n\n\fsymbols. A 5-symbol memorization task is of the form:\n\nInput:\n- a b c c b - - - - - -\nTarget: - - - - - - a b c c b -\n\nWe evaluate all con\ufb01gurations with L = 1, 4, 7, 10 on both tasks, where M is 400 for addition and\n100 for memorization. The performance is measured by the symbol prediction accuracy.\nFig. 4 show the results. In both tasks, large L degrades the performances of sLSTM and 2D tLSTM\u2013\nM. In contrast, the performance of 2D tLSTM\u2013F steadily improves with L increasing, and is further\nenhanced by using feedback connections, higher-dimensional tensors, and CN, while LN helps only\nwhen L = 1. Note that in both tasks, the correct solution can be found (when 100% test accuracy is\nachieved) due to the repetitive nature of the task. In our experiment, we also observe that for the\naddition task, 3D tLSTM+CN with L = 7 outperforms other con\ufb01gurations and \ufb01nds the solution\nwith only 298K training samples, while for the memorization task, 3D tLSTM+CN with L = 10 beats\nothers con\ufb01gurations and achieves perfect memorization after seeing 54K training samples. Also,\nunlike in sLSTM, the runtime of all tLSTMs is largely unaffected by L.\n\nWe further compare the best\nperforming con\ufb01gurations to\nthe state-of-the-art methods\nfor both tasks (see Table 2).\nOur models solve both tasks\nsigni\ufb01cantly faster (i.e., using\nfewer training samples) than\nother models, achieving the\nnew state-of-the-art results.\n\n3.3 MNIST Image Classi\ufb01cation\n\nTable 2: Test accuracies on two algorithmic tasks.\n\nAddition\n\nAcc.\n\n# Samp.\n5M\nStacked LSTM [21]\n>99% 550K\nGrid LSTM [30]\n>99% 298K\n3D tLSTM+CN (L = 7)\n3D tLSTM+CN (L = 10) >99% 317K\n\n51%\n\nMemorization\nAcc.\n# Samp.\n>50% 900K\n>99% 150K\n>99% 115K\n54K\n>99%\n\nThe MNIST dataset [35] consists\nof 50000/10000/10000 handwritten\ndigit images of size 28\u00d728 for train-\ning/validation/test. We have two\ntasks on this dataset:\n(a) Sequential MNIST: The goal\nis to classify the digit after sequen-\ntially reading the pixels in a scan-\nline order [33].\nIt is therefore a\n784 timestep sequence learning task\nwhere a single output is produced at\nthe last timestep; the task requires\nvery long range dependencies in the\nsequence.\n(b) Sequential Permuted MNIST:\nWe permute the original image pix-\nels in a \ufb01xed random order as in\n[2], resulting in a permuted MNIST\n(pMNIST) problem that has even longer range dependencies across pixels and is harder.\nIn both tasks, all con\ufb01gurations are evaluated with M = 100 and L = 1, 3, 5. The model performance\nis measured by the classi\ufb01cation accuracy.\nResults are shown in Fig. 5. sLSTM and 2D tLSTM\u2013M no longer bene\ufb01t from the increased depth\nwhen L = 5. Both increasing the depth and tensorization boost the performance of 2D tLSTM.\nHowever, removing feedback connections from 2D tLSTM seems not to affect the performance. On\nthe other hand, CN enhances the 3D tLSTM and when L\u2265 3 it outperforms LN. 3D tLSTM+CN\nwith L = 5 achieves the highest performances in both tasks, with a validation accuracy of 99.1% for\nMNIST and 95.6% for pMNIST. The runtime of tLSTMs is negligibly affected by L, and all tLSTMs\nbecome faster than sLSTM when L = 5.\n\nFigure 5: Performance and runtime of different con\ufb01gurations\non sequential MNIST (left) and sequential pMNIST (right).\n\n7\n\n\fFigure 6: Visualization of the diagonal channel means of the tLSTM memory cells for each task. In\neach horizontal bar, the rows from top to bottom correspond to the diagonal locations from pin to\npout, the columns from left to right correspond to different timesteps (from 1 to T +L\u22121 for the full\nsequence, where L\u22121 is the time delay), and the values are normalized to be in range [0, 1] for better\nvisualization. Both full sequences in (d) and (e) are zoomed out horizontally.\n\n97.0\n98.2\n95.1\n96.9\n98.1\n99.0\n99.2\n98.3\n99.2\n99.0\n\n82.0\n88.0\n91.4\n94.1\n94.0\n95.4\n94.6\n96.7\n94.9\n95.7\n\nMNIST\n\npMNIST\n\nTable 3: Test accuracies (%) on sequential MNIST/pMNIST.\n\niRNN [33]\nLSTM [2]\nuRNN [2]\nFull-capacity uRNN [49]\nsTANH [53]\nBN-LSTM [13]\nDilated GRU [8]\nDilated CNN [40] in [8]\n3D tLSTM+CN (L = 3)\n3D tLSTM+CN (L = 5)\n\nWe also compare the con\ufb01gura-\ntions of the highest test accuracies\nto the state-of-the-art methods (see\nTable 3). For sequential MNIST, our\n3D tLSTM+CN with L = 3 performs\nas well as the state-of-the-art Dilated\nGRU model [8], with a test accu-\nracy of 99.2%. For the sequential\npMNIST, our 3D tLSTM+CN with\nL = 5 has a test accuracy of 95.7%,\nwhich is close to the state-of-the-art\nof 96.7% produced by the Dilated\nCNN [40] in [8].\n3.4 Analysis\nThe experimental results of different model con\ufb01gurations on different tasks suggest that the perfor-\nmance of tLSTMs can be improved by increasing the tensor size and network depth, requiring no\nadditional parameters and little additional runtime. As the network gets wider and deeper, we found\nthat the memory cell convolution mechanism is crucial to maintain improvement in performance.\nAlso, we found that feedback connections are useful for tasks of sequential output (e.g., our Wikipedia\nand algorithmic tasks). Moreover, tLSTM can be further strengthened via tensorization or CN.\nIt is also intriguing to examine the internal working mechanism of tLSTM. Thus, we visualize the\nmemory cell which gives insight into how information is routed. For each task, the best performing\ntLSTM is run on a random example. We record the channel mean (the mean over channels, e.g., it is\nof size P \u00d7P for 3D tLSTMs) of the memory cell at each timestep, and visualize the diagonal values\nof the channel mean from location pin = [1, 1] (near the input) to pout = [P, P ] (near the output).\nVisualization results in Fig. 6 reveal the distinct behaviors of tLSTM when dealing with different tasks:\n(i) Wikipedia: the input can be carried to the output location with less modi\ufb01cation if it is suf\ufb01cient\nto determine the next character, and vice versa; (ii) addition: the \ufb01rst integer is gradually encoded\ninto memories and then interacts (performs addition) with the second integer, producing the sum; (iii)\nmemorization: the network behaves like a shift register that continues to move the input symbol to the\noutput location at the correct timestep; (iv) sequential MNIST: the network is more sensitive to the\npixel value change (representing the contour, or topology of the digit) and can gradually accumulate\nevidence for the \ufb01nal prediction; (v) sequential pMNIST: the network is sensitive to high value pixels\n(representing the foreground digit), and we conjecture that this is because the permutation destroys\nthe topology of the digit, making each high value pixel potentially important.\nFrom Fig. 6 we can also observe common phenomena for all tasks: (i) at each timestep, the values\nat different tensor locations are markedly different, implying that wider (larger) tensors can encode\nmore information, with less effort to compress it; (ii) from the input to the output, the values become\nincreasingly distinct and are shifted by time, revealing that deep computations are indeed performed\ntogether with temporal computations, with long-range dependencies carried by memory cells.\n\n8\n\n\fFigure 7: Examples of models related to tLSTMs. (a) A single layer cLSTM [48] with vector array\ninput. (b) A 3-layer sLSTM [21]. (c) A 3-layer Grid LSTM [30]. (d) A 3-layer RHN [54]. (e) A\n3-layer QRNN [7] with kernel size 2, where costly computations are done by temporal convolution.\n\n4 Related Work\n\nConvolutional LSTMs. Convolutional LSTMs (cLSTMs) are proposed to parallelize the compu-\ntation of LSTMs when the input at each timestep is structured (see Fig. 7(a)), e.g., a vector array\n[48], a vector matrix [41, 42, 50, 52], or a vector tensor [9, 45]. Unlike cLSTMs, tLSTM aims to\nincrease the capacity of LSTMs when the input at each timestep is non-structured, i.e., a single vector,\nand is advantageous over cLSTMs in that: (i) it performs the convolution across different hidden\nlayers whose structure is independent of the input structure, and integrates information bottom-up\nand top-down; while cLSTM performs the convolution within each hidden layer whose structure is\ncoupled with the input structure, thus will fall back to the vanilla LSTM if the input at each timestep\nis a single vector; (ii) it can be widened ef\ufb01ciently without additional parameters by increasing the\ntensor size; while cLSTM can be widened by increasing the kernel size or kernel channel, which\nsigni\ufb01cantly increases the number of parameters; (iii) it can be deepened with little additional run-\ntime by delaying the output; while cLSTM can be deepened by using more hidden layers, which\nsigni\ufb01cantly increases the runtime; (iv) it captures long-range dependencies from multiple directions\nthrough the memory cell convolution; while cLSTM struggles to capture long-range dependencies\nfrom multiple directions since memory cells are only gated along one direction.\nDeep LSTMs. Deep LSTMs (dLSTMs) extend sLSTMs by making them deeper (see Fig. 7(b)-(d)).\nTo keep the parameter number small and ease training, Graves [22], Kalchbrenner et al. [30], Mujika\net al. [38], Zilly et al. [54] apply another RNN/LSTM along the depth direction of dLSTMs, which,\nhowever, multiplies the runtime. Though there are implementations to accelerate the deep computation\n[1, 16], they generally aim at simple architectures such sLSTMs. Compared with dLSTMs, tLSTM\nperforms the deep computation with little additional runtime, and employs a cross-layer convolution to\nenable the feedback mechanism. Moreover, the capacity of tLSTM can be increased more ef\ufb01ciently\nby using higher-dimensional tensors, whereas in dLSTM all hidden layers as a whole only equal to a\n2D tensor (i.e., a stack of hidden vectors), the dimensionality of which is \ufb01xed.\nOther Parallelization Methods.\nSome methods [7, 8, 28, 29, 36, 40] parallelize the temporal\ncomputation of the sequence (e.g., use the temporal convolution, as in Fig. 7(e)) during training, in\nwhich case full input/target sequences are accessible. However, during the online inference when the\ninput presents sequentially, temporal computations can no longer be parallelized and will be blocked\nby deep computations of each timestep, making these methods potentially unsuitable for real-time\napplications that demand a high sampling/output frequency. Unlike these methods, tLSTM can speed\nup not only training but also online inference for many tasks since it performs the deep computation\nby the temporal computation, which is also human-like: we convert each signal to an action and\nmeanwhile receive new signals in a non-blocking way. Note that for the online inference of tasks\nthat use the previous output yt\u22121 for the current input xt (e.g., autoregressive sequence generation),\ntLSTM cannot parallel the deep computation since it requires to delay L\u22121 timesteps to get yt\u22121.\n\n5 Conclusion\n\nWe introduced the Tensorized LSTM, which employs tensors to share parameters and utilizes the\ntemporal computation to perform the deep computation for sequential tasks. We validated our model\non a variety of tasks, showing its potential over other popular approaches.\n\n9\n\n\fAcknowledgements\n\nThis work is supported by the NSFC grant 91220301, the Alan Turing Institute under the EPSRC\ngrant EP/N510129/1, and the China Scholarship Council.\n\nReferences\n[1] Jeremy Appleyard, Tomas Kocisky, and Phil Blunsom. Optimizing performance of recurrent neural networks\n\non gpus. arXiv preprint arXiv:1604.01946, 2016. 3, 9\n\n[2] Martin Arjovsky, Amar Shah, and Yoshua Bengio. Unitary evolution recurrent neural networks. In ICML,\n\n2016. 7, 8\n\n[3] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization.\n\narXiv:1607.06450, 2016. 4, 5\n\narXiv preprint\n\nis dif\ufb01cult. IEEE TNN, 5(2):157\u2013166, 1994. 1\n\n[4] Yoshua Bengio, Patrice Simard, and Paolo Frasconi. Learning long-term dependencies with gradient descent\n[5] Yoshua Bengio. Learning deep architectures for ai. Foundations and trends R(cid:13) in Machine Learning, 2009. 2\n[6] Luca Bertinetto, Jo\u00e3o F Henriques, Jack Valmadre, Philip Torr, and Andrea Vedaldi. Learning feed-forward\n\none-shot learners. In NIPS, 2016. 2, 4\n\n[7] James Bradbury, Stephen Merity, Caiming Xiong, and Richard Socher. Quasi-recurrent neural networks. In\n\nICLR, 2017. 9\n\n[8] Shiyu Chang, Yang Zhang, Wei Han, Mo Yu, Xiaoxiao Guo, Wei Tan, Xiaodong Cui, Michael Witbrock,\n\nMark Hasegawa-Johnson, and Thomas Huang. Dilated recurrent neural networks. In NIPS, 2017. 8, 9\n\n[9] Jianxu Chen, Lin Yang, Yizhe Zhang, Mark Alber, and Danny Z Chen. Combining fully convolutional and\n\nrecurrent neural networks for 3d biomedical image segmentation. In NIPS, 2016. 9\n\n[10] Junyoung Chung, Caglar Gulcehre, Kyunghyun Cho, and Yoshua Bengio. Gated feedback recurrent neural\n\nnetworks. In ICML, 2015. 3, 13\n\n[11] Junyoung Chung, Sungjin Ahn, and Yoshua Bengio. Hierarchical multiscale recurrent neural networks. In\n\nICLR, 2017. 6\n\n[12] Ronan Collobert, Koray Kavukcuoglu, and Cl\u00e9ment Farabet. Torch7: A matlab-like environment for\n\nmachine learning. In NIPS Workshop, 2011. 13\n\n[13] Tim Cooijmans, Nicolas Ballas, C\u00e9sar Laurent, and Aaron Courville. Recurrent batch normalization. In\n\nICLR, 2017. 8\n\n[14] Bert De Brabandere, Xu Jia, Tinne Tuytelaars, and Luc Van Gool. Dynamic \ufb01lter networks. In NIPS, 2016.\n\n4\n\n[15] Misha Denil, Babak Shakibi, Laurent Dinh, Nando de Freitas, et al. Predicting parameters in deep learning.\n\nIn NIPS, 2013. 2, 4\n\n[16] Greg Diamos, Shubho Sengupta, Bryan Catanzaro, Mike Chrzanowski, Adam Coates, Erich Elsen, Jesse\nEngel, Awni Hannun, and Sanjeev Satheesh. Persistent rnns: Stashing recurrent weights on-chip. In ICML,\n2016. 9\n\n[17] Jeffrey L Elman. Finding structure in time. Cognitive science, 14(2):179\u2013211, 1990. 1\n[18] Timur Garipov, Dmitry Podoprikhin, Alexander Novikov, and Dmitry Vetrov. Ultimate tensorization:\n\ncompressing convolutional and fc layers alike. In NIPS Workshop, 2016. 2\n\n[19] Felix A Gers, J\u00fcrgen Schmidhuber, and Fred Cummins. Learning to forget: Continual prediction with lstm.\n\nNeural computation, 12(10):2451\u20132471, 2000. 1\n\n[20] Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. Speech recognition with deep recurrent\n\nneural networks. In ICASSP, 2013. 2\n\n[21] Alex Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013.\n\n5, 7, 9\n\n[22] Alex Graves. Adaptive computation time for recurrent neural networks. arXiv preprint arXiv:1603.08983,\n\n2016. 9\n\n[23] David Ha, Andrew Dai, and Quoc V Le. Hypernetworks. In ICLR, 2017. 4, 6\n[24] Sepp Hochreiter and J\u00fcrgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735\u20131780,\n\n1997. 1\n\n[25] Marcus Hutter. The human knowledge compression contest. URL http://prize.hutter1.net, 2012. 6\n[26] Ozan Irsoy and Claire Cardie. Modeling compositionality with multiplicative recurrent neural networks.\n\nIn ICLR, 2015. 2\n\n10\n\n\f[27] Rafal Jozefowicz, Wojciech Zaremba, and Ilya Sutskever. An empirical exploration of recurrent network\n\narchitectures. In ICML, 2015. 13\n\n[28] \u0141ukasz Kaiser and Samy Bengio. Can active memory replace attention? In NIPS, 2016. 9\n[29] \u0141ukasz Kaiser and Ilya Sutskever. Neural gpus learn algorithms. In ICLR, 2016. 9\n[30] Nal Kalchbrenner, Ivo Danihelka, and Alex Graves. Grid long short-term memory. In ICLR, 2016. 6, 7, 9,\n\n13\n\n[31] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015. 13\n[32] Ben Krause, Liang Lu, Iain Murray, and Steve Renals. Multiplicative lstm for sequence modelling. In\n\nICLR Workshop, 2017. 2, 6\n\n[33] Quoc V Le, Navdeep Jaitly, and Geoffrey E Hinton. A simple way to initialize recurrent networks of\n\nrecti\ufb01ed linear units. arXiv preprint arXiv:1504.00941, 2015. 7, 8\n\n[34] Yann LeCun, Bernhard Boser, John S Denker, Donnie Henderson, Richard E Howard, Wayne Hubbard,\nand Lawrence D Jackel. Backpropagation applied to handwritten zip code recognition. Neural computation,\n1(4):541\u2013551, 1989. 2\n\n[35] Yann LeCun, L\u00e9on Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to\n\ndocument recognition. Proceedings of the IEEE, 86(11):2278\u20132324, 1998. 2, 7\n\n[36] Tao Lei and Yu Zhang. Training rnns as fast as cnns. arXiv preprint arXiv:1709.02755, 2017. 9\n[37] Gundram Leifert, Tobias Strau\u00df, Tobias Gr\u00fcning, Welf Wustlich, and Roger Labahn. Cells in multidimen-\n\nsional recurrent neural networks. JMLR, 17(1):3313\u20133349, 2016. 4, 13\n\n[38] Asier Mujika, Florian Meier, and Angelika Steger. Fast-slow recurrent neural networks. In NIPS, 2017. 6,\n\n9\n\n[39] Alexander Novikov, Dmitrii Podoprikhin, Anton Osokin, and Dmitry P Vetrov. Tensorizing neural networks.\n\nIn NIPS, 2015. 2\n\n[40] Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal\nKalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. arXiv\npreprint arXiv:1609.03499, 2016. 8, 9\n\n[41] Viorica Patraucean, Ankur Handa, and Roberto Cipolla. Spatio-temporal video autoencoder with differen-\n\ntiable memory. In ICLR Workshop, 2016. 9\n\n[42] Bernardino Romera-Paredes and Philip Hilaire Sean Torr. Recurrent instance segmentation. In ECCV,\n\n2016. 9\n\n[43] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning representations by back-\n\npropagating errors. Nature, 323(6088):533\u2013536, 1986. 1\n\n[44] J\u00fcrgen Schmidhuber. Learning to control fast-weight memories: An alternative to dynamic recurrent\n\nnetworks. Neural Computation, 4(1):131\u2013139, 1992. 4\n\n[45] Marijn F Stollenga, Wonmin Byeon, Marcus Liwicki, and Juergen Schmidhuber. Parallel multi-dimensional\n\nlstm, with application to fast biomedical volumetric image segmentation. In NIPS, 2015. 9\n\n[46] Ilya Sutskever, James Martens, and Geoffrey E Hinton. Generating text with recurrent neural networks. In\n\nICML, 2011. 2, 4\n\n[47] Graham W Taylor and Geoffrey E Hinton. Factored conditional restricted boltzmann machines for modeling\n\nmotion style. In ICML, 2009. 2\n\n[48] Aaron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks. In\n\nICML, 2016. 3, 9\n\n[49] Scott Wisdom, Thomas Powers, John Hershey, Jonathan Le Roux, and Les Atlas. Full-capacity unitary\n\nrecurrent neural networks. In NIPS, 2016. 8\n\n[50] Lin Wu, Chunhua Shen, and Anton van den Hengel. Deep recurrent convolutional networks for video-based\n\nperson re-identi\ufb01cation: An end-to-end approach. arXiv preprint arXiv:1606.01609, 2016. 9\n\n[51] Yuhuai Wu, Saizheng Zhang, Ying Zhang, Yoshua Bengio, and Ruslan Salakhutdinov. On multiplicative\n\nintegration with recurrent neural networks. In NIPS, 2016. 2, 6\n\n[52] SHI Xingjian, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-kin Wong, and Wang-chun Woo. Convolu-\n\ntional lstm network: A machine learning approach for precipitation nowcasting. In NIPS, 2015. 9\n\n[53] Saizheng Zhang, Yuhuai Wu, Tong Che, Zhouhan Lin, Roland Memisevic, Ruslan R Salakhutdinov, and\n\nYoshua Bengio. Architectural complexity measures of recurrent neural networks. In NIPS, 2016. 8\n\n[54] Julian Georg Zilly, Rupesh Kumar Srivastava, Jan Koutn\u00edk, and J\u00fcrgen Schmidhuber. Recurrent highway\n\nnetworks. In ICML, 2017. 6, 9\n\n11\n\n\f", "award": [], "sourceid": 17, "authors": [{"given_name": "Zhen", "family_name": "He", "institution": "University College London"}, {"given_name": "Shaobing", "family_name": "Gao", "institution": "Sichuan University"}, {"given_name": "Liang", "family_name": "Xiao", "institution": "National University of Defense Technology"}, {"given_name": "Daxue", "family_name": "Liu", "institution": "National University of Defense Technology"}, {"given_name": "Hangen", "family_name": "He", "institution": "National University of Defense Technology"}, {"given_name": "David", "family_name": "Barber", "institution": "University College London"}]}