{"title": "A Tensorized Transformer for Language Modeling", "book": "Advances in Neural Information Processing Systems", "page_first": 2232, "page_last": 2242, "abstract": "Latest development of neural models has connected the encoder and decoder through a self-attention mechanism. In particular, Transformer, which is solely based on self-attention, has led to breakthroughs in Natural Language Processing (NLP) tasks. However, the multi-head attention mechanism, as a key component of Transformer, limits the effective deployment of the model to a resource-limited setting. In this paper, based on the ideas of tensor decomposition and parameters sharing, we propose a novel self-attention model (namely Multi-linear attention) with Block-Term Tensor Decomposition (BTD). We test and verify the proposed attention method on three language modeling tasks (i.e., PTB, WikiText-103 and One-billion) and a neural machine translation task (i.e., WMT-2016 English-German). Multi-linear attention can not only largely compress the model parameters but also obtain performance improvements, compared with a number of language modeling approaches, such as Transformer, Transformer-XL, and Transformer with tensor train decomposition.", "full_text": "A Tensorized Transformer for Language Modeling\n\n1College of Intelligence and Computing, Tianjin University, Tianjin, China\n\n2Microsoft Research Asia, Beijing, China\n\n3School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China\n\nXindian Ma1, Peng Zhang1\u2217, Shuai Zhang1,\n\nNan Duan2, Yuexian Hou1, Dawei Song3, Ming Zhou2\n\n{xindianma, pzhang, szhang96, yxhou}@tju.edu.cn\n\n{nanduan, mingzhou}@microsoft.com\n\n{dwsong}@bit.edu.cn\n\nAbstract\n\nLatest development of neural models has connected the encoder and decoder\nthrough a self-attention mechanism. In particular, Transformer, which is solely\nbased on self-attention, has led to breakthroughs in Natural Language Processing\n(NLP) tasks. However, the multi-head attention mechanism, as a key component\nof Transformer, limits the effective deployment of the model to a resource-limited\nsetting. In this paper, based on the ideas of tensor decomposition and parameters\nsharing, we propose a novel self-attention model (namely Multi-linear attention)\nwith Block-Term Tensor Decomposition (BTD). We test and verify the proposed at-\ntention method on three language modeling tasks (i.e., PTB, WikiText-103 and One-\nbillion) and a neural machine translation task (i.e., WMT-2016 English-German).\nMulti-linear attention can not only largely compress the model parameters but also\nobtain performance improvements, compared with a number of language modeling\napproaches, such as Transformer, Transformer-XL, and Transformer with tensor\ntrain decomposition.\n\n1\n\nIntroduction\n\nIn NLP, Neural language model pre-training has shown to be effective for improving many\ntasks [12, 26]. Transformer [35] is based solely on the attention mechanism, and dispensing with\nrecurrent and convolutional networks entirely. At present, this model has received extensive attentions\nand plays an key role in many neural language models, such as BERT [12], GPT [27] and Universal\nTransformer [10]. However, in Transformer based model, a lot of model parameters may cause prob-\nlems in training and deploying these parameters in a resource-limited setting. Thus, the compression\nof large neural pre-training language models has been an essential problem in NLP research.\nIn literature, there are some compression methods [18, 38, 14] proposed. When the vocabulary is\nlarge, the corresponding weight matrices can be enormous. Tensorized embedding (TE) [18] uses the\ntensor-train [25] to compress the embedding layers in Transformer-XL [7], but has not compressed\nthe attention layer. Recently, Block-Term Tensor Decomposition(BTD) [9] is used to compress\nrecurrent neural networks (RNNs) [38]. Ye et al. [38] propose a compact \ufb02exible structure to deal\nwith the large number of model parameters instead by high dimensional inputs in training recurrent\nneural networks (RNNs). This method greatly reduces the parameters of RNNs and improves their\ntraining ef\ufb01ciency. Still, the model only considers the input layer compression by the idea of low-rank\napproximation. On the other hand, some methods [14, 2] aim to develop a speci\ufb01c structure on its\n\n\u2217Corresponding Author: Peng Zhang\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fweight matrices and can reduce the parameters of the models. However, the new structure after\ncompressing can not be integrated into the model [35].\nIn Transformer, the multi-head attention is a key part and it is constructed by a large number\nof parameters. Speci\ufb01cally, Ashish et.al [35] compute the attention function on a set of queries\nsimultaneously, packed together into a matrix Q, while the keys and values are also packed together\ninto matrices K and V , respectively. The attention function then adopts a no-linear function sof tmax\nover two matrices Q and K. There are two challenges to \ufb01nd a high-quality compression method to\ncompress the multi-head attention in Transformer.\nFirst, the self-attention function in Transformer is a non-linear function, which makes it dif\ufb01cult\nto compress. In order to address this challenge, we \ufb01rst prove that the output of the attention\nfunction of the self-attention model [35] can be linearly represented by a group of orthonormal\nbase vectors. Then, by initializing a low rank core tensor, we use Tucker-decomposition [33, 20] to\nreconstruct a new attention representation, where Q, K and V can be considered as factor matrices.\nIn order to construct the multi-head mechanism and compress the model, we use the method of\nBlock-Term Tensor Decomposition (BTD), which is a combination of CP decomposition [3] and\nTucker decomposition [33]. The difference is that three factor matrices Q, K and V are shared in\nconstructing each 3-order block tensor. This process can reduce many parameters.\nThe second challenge is that the attention model after compressing can not be directly integrated\ninto the encoder and decoder framework of Transformer [35, 7]. In order to address this challenge,\nthere are three steps as follows. First, the average of each block tensor can be computed; Second,\nmultiple matrices can be given by tensor split. Third, the concatenation of these matrices can serve as\nthe input to the next layer network in Transformer. After that, it can be integrated into the encoder\nand decoder framework of Transformer [35, 7] and trained end-to-end. Moreover, we also prove\nthat the 3-order tensor can reconstruct the scaled dot-product attention in Transformer by a sum on a\nparticular dimension.\nOur method combines two ideas which are the low-rank approximation and parameters sharing at\nthe same time. Therefore, it achieves the higher compression ratios. Although the self-attention (i.e.,\nscaled dot-product attention) in Transformer can be reconstructed, we do not consider reconstructing\nit and choose to split the 3-order tensor (the output of Multi-linear attention) which is helpful for\nimproving the accuracy in experiments.\nOur major contributions of this paper are as follows:\n\n1) It is proved that the output of scaled dot-product attention (considering as a function) can be\n\nlinearly represented by a group of orthonormal base vectors.\n\n2) A novel self-attention method, namely Multi-linear attention, is provided, which combines\n\ntwo compression ideas, parameters sharing and low-rank approximation, together.\n\n3) Multi-linear attention builds the strong connection between three factor matrices (pack a\nset of queries, keys and values, respectively ), enhancing the ability of capturing suf\ufb01cient\nattention information. We also prove our model can reconstruct the scaled dot-product\nattention in the original Transformer.\n\nIn order to validate the bene\ufb01ts of our model, we test it on two NLP tasks, namely language modeling\nand neural machine translation. In our experiments, the multi-head attention can be replaced by\nthe proposed model, namely multi-linear attention. We have observed that the standard Multi-head\nattention can be compressed with higher compression ratios on One-Billion dataset. As a result, we\nshow that multi-linear attention not only considerably reduces the number of parameters, but also\nachieve promising experiments results, especially in language modeling tasks.\n\n2 Preliminaries\n\nMulti-linear attention is carried out in this paper. The analysis of Multi-linear attention relies on\nthese concepts and results from the \ufb01eld of tensor decomositon and multi-head attention. We cover\nbelow in Section 2.1 basic background on Block-Term tensor decomposition [9]. Then, we describe\nin Section 2.2 multi-head attention [35].\n\n2\n\n\fFigure 1: The representation of Block-Term tensor decomposition for a 3-order tensor. A \u2208\nRd1\u00d7d2\u00d7d3 is a 3-order tensor, and can be approximated by P Tucker decomposition. P is the CP\nrank, and R1, R2, R3 are the Tucker rank, respectively. In this paper, we assume that R=R1=R2=R3.\n\n2.1 Tensor and Block-Term Tensor Decomposition\nTensor We use the Euler script letter A to denote a tensor which can be thought of as a multi-array.\nThereby a vector and a matrix are a 1-order tensor and 2-order tensor, respectively. The element in a\nn-order tensor is denoted as Ad1,...,dn. In the geometric representation of a tensor, 3-order tensor can\nbe represented by a cube. After that, there is a related concept named tensor slice that will be used\nin this paper. Tensor and some other related concepts are showed in Supplementary Materials A.\nBlock-Term Tensor Decomposition (BTD) Block-Term tensor decomposition is a combination of\nCP decomposition [3] and Tucker decomposition [33]. Given a n-order tensor A \u2208 Rd1\u00d7...\u00d7dn. A\nhigh-order tensor can be decomposed into P block terms by the method named BTD. \u2022z is denoted as\nthe tenor-tensor product on the z-th order [19] and z \u2208 {1, . . . , d}. Each term contains \u2022z between a\ni \u2208 Rdk\u00d7Rk, where i \u2208 [1, P ] and k \u2208 [1, d].\ncore tensor Gi \u2208 RR1\u00d7...\u00d7Rd and d factor matrices X (k)\nThe formulation of BTD decomposition is as follows:\n\nP(cid:88)\n\nA =\n\nGi\u20221X (1)\n\ni \u20222X 2\n\ni \u20223 . . .\u2022dX (d)\n\ni\n\n(1)\n\nwhere P is the CP rank, and d is the Core-order. In our work, the tensor is 3-order. Figure 1\ndemonstrates the example of how a 3-order tensor A can be decomposed into P block terms.\n\ni=1\n\n2.2 Multi-head Attention\n\nIn Transformer, the attention function is named as \u201cScaled Dot-Product Attention\u201d. In practice,\nTransformer [35] processes query, keys and values as matrices Q, K, and V respectively. The\nattention function can be written as follows:\n\nAttention(Q, K, V ) = sof tmax(\n\nQK T\u221a\nd\n\n)V\n\n(2)\n\nwhere d is the number of columns of Q and K. In these work [35, 12, 7], they all use the multi-head\nattention, as introduced in [35],\n\nM ultiHeadAttention(Q(cid:48), K(cid:48), V (cid:48)) = Concat(head1, . . . , headh)W O\n\nwhere headi = Attention(Q(cid:48)W Q\n\ni , K(cid:48)W K\n\ni\n\n, V (cid:48)W V\ni )\n\n(3)\n\nwhere matrices W Q\ndv and dk are equal to d. In this work [35], multiple groups of parameters (W Q\nused, which results in a large number of redundant parameters.\n\ni \u2208 Rdmodel\u00d7dv and W O \u2208 Rhd\u00d7dmodel. In practice,\ni ) are\n\ni \u2208 Rdmodel\u00d7dk, W V\n\ni and W K\n\ni , W K\n\ni and W V\n\n3 Tensorized Transformer\n\nIn this section, we \ufb01rst build a Single-block attention in Figure 2 (left) based on the Tucker decompo-\nsition, a low-rank decomposition method. In this process, we prove that the self-attention function in\nTransformer can be represented by a linear function, i.e., a linear combination representation of a set\nof basic vectors.\n\n3\n\n\u2248++\u22ef\ud835\udc9c\ud835\udc9c\ud835\udc51\ud835\udc511\ud835\udc51\ud835\udc512\ud835\udc51\ud835\udc513\ud835\udcb3\ud835\udcb31(1)\ud835\udcb3\ud835\udcb3\ud835\udc43\ud835\udc43(1)\ud835\udcb3\ud835\udcb31(3)\ud835\udcb3\ud835\udcb31(2)\ud835\udcb3\ud835\udcb3\ud835\udc43\ud835\udc43(2)\ud835\udcb3\ud835\udcb3\ud835\udc43\ud835\udc43(3)\ud835\udca2\ud835\udca21\ud835\udca2\ud835\udca2\ud835\udc43\ud835\udc43\ud835\udc51\ud835\udc511\ud835\udc51\ud835\udc512\ud835\udc51\ud835\udc513\ud835\udc45\ud835\udc451\ud835\udc45\ud835\udc453\ud835\udc45\ud835\udc452\ud835\udc51\ud835\udc511\ud835\udc45\ud835\udc451\ud835\udc51\ud835\udc512\ud835\udc45\ud835\udc452\ud835\udc45\ud835\udc453\ud835\udc51\ud835\udc513\ufffd1\ufffd2\ufffd3\ufffd1\ufffd2\ufffd3\fFigure 2: (left) Single-block attention using Tucker decomposition. (right) Multi-linear attention\nbased on Block-Term tensor decomposition.\n\nIn order to compress the multi-head mechanism, we propose a multi-linear attention constructed by a\nBlock-Term tensor decomposition. This attention uses the idea of parameters sharing, i.e., sharing\nfactor matrices across multiple blocks, shown in Figure 2 (right). After that, the compression ratios\nand relatively lower complexity have been analyzed.\n\n3.1 Single-block Attention by Tucker Decomposition\n\nBefore building the Single-block attention, it is necessary to propose the theorem 3.1. The theorem is\nclosely related to attributes of Single-block attention function by Tucker decomposition [33].\nTheorem 3.1. Let e1, . . . , en be basis vectors from the vector space S. Assume that these vectors\ne1, . . . , en are linear independent and Q,K,V can be linearly represented by this set of basis vectors.\nThe output of the attention function in Eq. 2 can be represented by a linear combination of the set of\nthese basis vectors.\n(4)\nwhere M \u2208 Rn\u00d7d is a coef\ufb01cient matrix, and d is a dimension of these matrices (i.e., Q, K, and V ).\n\nAttention(Q, K, V ) = (e1, . . . , en)M,\n\nProof. The proof can be found in Supplementary Materials B.\n\nIn Figure 2 (left), it is a schematic diagram about the Single-block attention. First, we assume that the\nquery, key and value can be mapped into three factor matrices of which are composed of three groups\nof orthogonal basis vectors. Three factor matrices are Q, K and V . After that, we can construct\na new attention (i.e., Single-block attention) by initializing a 3-order diagonal tensor (trainable)\nwhich is the G. In Figure 2 (left), R is the rank about the tensor, N is the length of a sequence, and\nd is the dimension of matrix. The function of Single-block attention can be computed based on\nTucker-decomposition as follows:\n\nAttenT D(G; Q, K, V ) =G\u20221Q\u20222K\u20223V\n\nI(cid:88)\n\nJ(cid:88)\n\nM(cid:88)\n\n=\n\ni=1\n\nj=1\n\nm=1\n\nGijmQi \u25e6 Kj \u25e6 Vm\n\n(5)\n\nwhere G is a core tensor. i, j and m are the indexes of the core tensor. \u25e6 is the outer product. \u2022z is\nthe same de\ufb01nition in Eq. 1. Qi, Kj and Vk are column vectors from matrices Q, K and V , where\nQ \u2208 RN\u00d7d, K \u2208 RN\u00d7d and V \u2208 RN\u00d7d, and N is the length of a sequence. In practice, we set\nI=J=M=R. The core tensor G can be de\ufb01ned as follows,\n\nGijm =\n\ni = j = m\notherwise\n\n(6)\n\n(cid:26) rand(0, 1)\n\n0\n\n4\n\n\ud835\udc5c\ud835\udc5c\ud835\udc5c\ud835\udc5c\ud835\udc5c\ud835\udc5c\ud835\udc5c\ud835\udc5c\ud835\udc5c\ud835\udc5c\ud835\udc5c\ud835\udc5c\u22ef+++\u22ef\u210e\ud835\udc60\ud835\udc60\ud835\udc5c\ud835\udc5c\ud835\udc60\ud835\udc60\ud835\udc60\ud835\udc60\ud835\udc5c\ud835\udc5c\ud835\udc50\ud835\udc50\ud835\udc5c\ud835\udc5c\ud835\udc50\ud835\udc50\ud835\udc50\ud835\udc50\ud835\udc50\ud835\udc50\ud835\udc5c\ud835\udc5c\u22ef\ud835\udca2\ud835\udca21\ud835\udc41\ud835\udc41\ud835\udc41\ud835\udc41\ud835\udc41\ud835\udc41\ud835\udc41\ud835\udc41\ud835\udc41\ud835\udc41\ud835\udc41\ud835\udc41\u2217(1\u210e)\ud835\udc44\ud835\udc44\ud835\udc3e\ud835\udc3e\ud835\udc49\ud835\udc49LinearLinearLinearparameters sharing\ud835\udc44\ud835\udc44\ud835\udc3e\ud835\udc3e\ud835\udc49\ud835\udc49\ud835\udc51\ud835\udc51\ud835\udc51\ud835\udc51\ud835\udc45\ud835\udc45\ud835\udca2\ud835\udca2\ud835\udc45\ud835\udc45\ud835\udc45\ud835\udc45\ud835\udc45\ud835\udc45\ud835\udc47\ud835\udc471\ud835\udc47\ud835\udc472\ud835\udc47\ud835\udc47\u210e\ud835\udc4a\ud835\udc4a\ud835\udc42\ud835\udc42\ud835\udc5c\ud835\udc5c\ud835\udc5c\ud835\udc5c\ud835\udc5c\ud835\udc5c\ud835\udc5c\ud835\udc5c\ud835\udc5c\ud835\udc5c\ud835\udc5c\ud835\udc5c\ud835\udc41\ud835\udc41\ud835\udc41\ud835\udc41\ud835\udc51\ud835\udc51\ud835\udc41\ud835\udc41\ud835\udca2\ud835\udca22\ud835\udca2\ud835\udca2\u210e\ud835\udc44\ud835\udc44\u2032\ud835\udc3e\ud835\udc3e\u2032\ud835\udc49\ud835\udc49\u2032\fwhere the rand(0, 1) is a random function, and the diagonal entries of core tensor G form the vector g.\nEach entry gr \u2208 (0, 1), r \u2208 {1, . . . , R}. We can consider g as the trainable weight. In experiments,\nwe compute the weight vector by sof tmax function (i.e., sof tmax(g)).\nAfter that, the output of Single-block attention function is a 3-order tensor which is given by linear\ncomputation. The Single-block attention (i.e., a 3-order tensor with Tucker decomposition) can\nreconstruct the Scaled Dot-Product attention in Eq. 2 by the summing over the tensor according to\nthe second index 2 (it can be seen as the coordinates in the vertical direction for a tensor), as proved\nin the following corollary. Note that in our model, we do not adopt the above reconstructing process.\nInstead, to obtain a new representation, we adopt the concat method after the tensor splitting (see\nSec. 3.2). We will further show the compression ability of the Single-block attention in Sec. 3.3.\nCorollary 1. Under the same conditions as in Theorem 3.1 and the value of N is equal to the value\nof d, Single-block attention representation Eq. 5 can reconstruct the Scaled Dot-Product attention in\nEq. 2 by the summing over the tensor (i.e., the output of Single-block attention function) according to\nthe second index. It holds that:\n\nAttention(Q, K, V )i,m =\n\nAttenT D(G; Q, K, V )i,j,m\n\n(7)\n\nN(cid:88)\n\nwhere i, j and m are the indices of the Single-block attention\u2019s output (i.e., a 3-order tensor).\nAttenT D(\u00b7) is the function of Single-block attention based on Tucker decomposition. i and m are\nthe indices of outputs (i.e., a matrix) from Eq. 2.\n\nj=1\n\nProof. The proof can be found in Supplementary Materials C.\n\n3.2 Multi-Linear Attention by Block-Term Tensor Decomposition\n\nIn order to construct the multi-head mechanism and compress the parameters of multiple groups\nof mapping, we use a group of linear projections, and share the output from the linear projections.\nIn Figure 2(right), the learned linear projection can map queries, keys and values to three matrices\nwhich are composed of basis vectors. After that, we use the Block-Term tensor decomposition to\nbuild multi-head mechanism. In our work, our model is named as Multi-linear attention, which can\nbe formulated as follows:\n\nM ultiLinear(G; Q(cid:48), K(cid:48), V (cid:48)) = SplitConcat(\n\n\u2217 (T1 + . . . + Th))W O\n\n1\nh\n\nwhere Tj = AttenT D(Gj; Q(cid:48)W q, K(cid:48)W k, V (cid:48)W v)\n\n(8)\n\nwhere the core tensor Gj is a diagonal tensor, and the number of parameter in Gj is equal to the\nrank of core tensor, j \u2208 {1, . . . , h}. Q(cid:48)W q, K(cid:48)W k and V (cid:48)W v are equal to Q, K and V in Eq. 5,\nrespectively. G is the set of the core tensors. SplitConcat(\u00b7) is a function which achieves the\nconcatenation after splitting for a 3-order tensor. Figure 2 (right) shows the basis idea about the\nmulti-linear attention. The W O is the parameter matrix which is a full connection layer and correlated\nto the output of Multi-linear attention. AttenT D(\u00b7) is the function of Single-block attention, which is\na part of Multi-linear attention. W q, W k and W v are the parameters matrices which are shared in\nconstructing Multi-linear attention.\nThe Multi-linear attention is a compression model. After compressing the multi-head attention in\nTransformer, it is to achieve a Tensorized Transformer. The Multi-linear attention can be incorporated\ninto Transformer architecture. A diagram which is about the incorporating of Multi-linear attention\nin partial Transformer structure is given in Supplementary Materials E.1.\n\n3.3 Analysis of Compression and Complexity\n\nCompression Our focus is on the compression of the multi-head mechanism in the multi-head\nattention of Transformer. Previous work [35] gets the multi-head attention by multiple groups of\nlinear mappings. We use three linear mappings for matrices Q, K and V , respectively. For the\noutput of three mappings, we choose to share them which are considered as three factor matrices in\n\n2If the coordinates of a 3-order tensor are i, j and m, j is the second index.\n\n5\n\n\freconstructing the Multi-linear attention. This process is shown in Figure 2 (left). h is the number of\nheads in [35], and d is the dimension of factor matrices. The compression ratios can be computed\nby (3 \u00d7 h \u00d7 d)/(3 \u00d7 d + h). In practice, h is normally set to 8, d is set to 512. In this case, the\ncompression ratios can achieve 8. In other words, we can reduce almost 8 times parameters in the\nattention layer. The details of the computing of compression ratios can be found in Supplementary\nMaterials D. The Transformer also contains other network layers, such as Position-wise feed forward\nnetwork and embedding layers et al. Therefore, for the compression ratios in whole Transformer, we\ncan compare it by the analysis of experimental results for model parameters.\nComplexity The time complexity of the attention function in Eq. 2 is O(N 2d), N is the length of\na sequence, and d is the representation dimension. In Multi-linear attention, we can reorder the\ncomputations to receive the model complexity O(N 3), where N is also the length of the sequence.\nThe minimum number of sequential operations in Multi-linear attention for different layers is\napproximately equal to the self-attention in Transformer [35].\n\n4 Related Work\n\nThe \ufb01eld of language modeling has witnessed many signi\ufb01cant advances. Different from the archi-\ntectures of convolutional neural network (CNNs) and recurrent neural networks (RNNs) language\nmodeling, the Transformer [35] and its variants [7, 12, 10] achieve excellent results in language\nmodeling processing. Transformer networks have a potential of learning long-term dependency, but\nare limited by a \ufb01xed-length context in the setting of language modeling. Vaswani et al. [35] uses a\nsegment-level recurrence mechanism and a novel positional encoding scheme to resolve this question.\nBERT [12] is a kind of bidirectional encoder representations from transformers. It is designed to\npre-train deep bidirectional representation and obtains new SoTA on some NLP tasks. Although these\nmethods have achieved great results, a large number of parameters make it dif\ufb01cult for the model to\nbe trained in limited resources. Transformer fails to generalize in many simple tasks, e.g. copying\nstring and logical inference [10]. Universal Transformers [10] propose a self-attentive recurrent\nsequence model which addresses this problem. This methods can increase the training speed. In\ntheir work, authors following weight sharing found in CNNs and RNNs, extend the Transformer\nwith a simple form of weight sharing that strikes an effective balance between induces and model\nexpressivity. This methods also uses a large number of parameters.\nTherefore, it is very important to consider how to reduce the amount of memory and computing\nthey need. As we know, existing model compression methods are mainly divided into parameter\npruning and sharing [14], low rank approximation [29], knowledge transfer [2], and transferred\nconvolutional \ufb01lters [6]. Currently, tensor decomposition methods are used to decompose a high-\norder tensor, which can get different neural network language model structures [39, 40]. Besides,\ntensor decomposition methods which adopts the idea of low rank approximation in most cases,\nhave been successfully applied to neural networks compression. For example, in literature [11, 16],\nresearchers approximate a tensor by minimizing the reconstruction error of the original parameters\non convolutional neural networks (CNNs). However, these approaches tend to accumulate errors\nwhen multiple layers are compressed sequentially, and the output feature maps deviate far from\nthe original values with the increase of compressed layers. Our compression method uses the idea\nof parameters sharing in the constructing of attention layers, and the size of output is same as the\noutput from self-attention in Transformer which can effectively avoid these problems. Tensorizing\nNeural Networks [24] have combined the idea of reshaping weights of fully-connected layers into\nhigh-dimensional tensors and representing them in Tensor Train format [25]. This approach was later\nextended to convolutional [13] and recurrent neural networks [36]. Recently, in these work [5, 34],\nresearchers introduce ef\ufb01cient compression methods for the embedding and sof tmax layers based\non structured low rank matrix approximation. TT-embedding [18] aims to compression the larger\nembedding layer on Transformer-XL [7]. Sparse Transformer [28] adopts sparse techniques on the\nattention matrix and reduces its parameters. This work uses a sparse attention matrix by selecting\nthe information on some positions in the attention matrix, but does not change the mechanism of the\nattention. Our method is different from these works, and combines two compression idea (low rank\napproximate and parameters sharing) to construct a tensorized Transformer.\nIn our work, we focus on the compression the multi-head attention in Transformer based the idea\nof parameters sharing. At the same time, we also combine low-rank approximate method to reduce\nparameters and computation complexity.\n\n6\n\n\f5 Experiments\n\nTransformer is a versatile and powerful modeling tool and widely is used in various natural language\nprocess tasks. In order to verify the effectiveness of our method (i.e., Multi-linear attention) replacing\nmulti-head attention in Transformer, we carry out two NLP tasks named language modeling (LM)\nand neural machine translation (NMT). Code3 for running experiments has been released, and the\nkey code which is about our method can be found in Supplementary Materials F.\n\n5.1 Language Modeling\n\nfunction p(s) = p(w1)(cid:81)n\n\nLanguage modeling is the task of predicting the next word in a sentence. This task is to estimate\nthe joint probability p(s) of a sentence of tokens s=(w1, . . . , wn). The resulting models can be\nused to generate text or further \ufb01ne-tuned to solve other NLP tasks [27]. In this paper, we employ\nthe standard setting of predicting next token given the sequence of preceding tokens, based on the\ni=2 p(wi|w1, . . . , wi\u22121). We chose three datasets in the order of small (i.e.,\nPTB), medium (i.e., WikiText-103) and large (i.e., One-Billion). Models are evaluated based on\nPerplexity (PPL), which is the average per-word log-probability. The lower the PPL, the better the\nmodel is.\nSpecially, we take Transformer, the open source state-of-the art language modeling architecture, and\nreplace the standard multi-head attention layers with our Multi-linear attention. Then, we test different\nmodel con\ufb01gurations on the PTB [23], WikiText-103 [22] and One-Billion Word benchmark [4]\ndatasets and report the results in Table 1 and Table 2.\n\nTable 1: Results (PPL) and model parameters with state-of-the-art results on One-Billion. Tensorized\nTransformer is our model. The core-1 is that the model use Single-block term tensor. Analogously,\nthe core-2 is that two block term tensor is used.\n\nModel\n\nParams Test PPL\n\nLSTM-8192-1024+CNN Input [17]\n\nHigh-Budget MoE [32]\n\nLSTM+Mos [37]\n\nRNN-1024+9 Gram [4]\nLSTM-2018-512 [17]\n\nGCNN-14 bottleneck [8]\n\nTransformer+adaptive input [1]\n\nTransformer-XL Base [7]\nTransformer-XL Large [7]\n\nTensorized Transformer core-1\nTensorized Transformer core-2\n\n20B\n0.83B\n\n\u2013\n\n1.04B\n\n5B\n\n113M\n0.46B\n0.46B\n0.8B\n0.16B\n0.16B\n\n51.3\n43.7\n31.9\n30.0\n28.0\n37.10\n23.7\n23.5\n21.8\n20.5\n19.5\n\n5.2 Results and Details\n\nPTB has 929k training tokens, 73k validation words, and 82k test words. The results is reported in\nTable 2. Similar to AWD-LSTM-MoS [37], we apply variational dropout and weight average to our\nmodel (i.e., Tensorized Transformer). In addition, we need to state that, our model only replaces the\nmulti-head attention using Multi-linear attention structure, and the other structures remain the same.\nWe compare the results of our model with other models. Our model achieves the comparable results\nwith SoTA when the number of core tensor is equal to two. However, our model size (i.e, model\nparameters) reduces by nearly half comparing with Transformer and Transformer-XL.\nWikiText-103 contains 267,735 unique tokens. The dataset is available word-level language modeling\nbenchmark with long-term dependency. It contains 103M training tokens from 28k articles, with an\naverage length of 3.6k tokens per article, which allows testing the ability of long-term dependency\nmodeling. As shown in Table 2, our model get the perplexity of 18.9, which is a comparable\nexperimental result with the previous SoTA perplexity 18.3 , which demonstrates the effectiveness of\nthe proposed attention architecture.\n\n3https://github.com/szhangtju/The-compression-of-Transformer\n\n7\n\n\fModel\n\nLSTM+augmented loss [15]\nVariational RHN [41]\n4-layer QRNN [21]\nAWD-LSTM-MoS [37]\nTransformer+adaptive input [1]\nTransformer-XL-Base [7]\nTransformer-XL-Large [7]\nTransformer-XL+TT [18]\nSparse Transformer [28]\nTensorized Transformer core-1\nTensorized Transformer core-2\n\nPTB\nParams Val PPL\n24M\n23M\n\u2013\n22M\n24M\n24M\n\u2013\n18 M\n14M\n12M\n12M\n\n75.7\n67.9\n\u2013\n58.08\n59.1\n56.72\n\u2013\n57.9*\n74.0*\n60.5\n54.25\n\nTest PPL\n48.7\n65.4\n\u2013\n55.97\n57\n54.52\n\u2013\n55.4*\n73.1*\n57.9\n49.8\n\nWikiText-103\n\nParams Val PPL\n\u2013\n\u2013\n151M\n\u2013\n247M\n151M\n257M\n130M\n174M\n85.3M 22.7\n85.3M 19.7\n\n\u2013\n\u2013\n\u2013\n29.0\n19.8\n23.1\n\u2013\n23.61*\n38.98*\n\nTest PPL\n48.7\n45.2\n33.0\n29.2\n20.5\n24.0\n18.3\n25.70*\n40.23*\n20.9\n18.9\n\n\u2019\u2212\u2019\nTable 2: Results and compression with state-of-the-art results on PTB and WikiText-103.\nindicates no reported results in that setting, \u2019\u2217\u2019 indicates that the results is our own implementation.\n\nThe One-Billion Word benchmark is a large dataset derived from a news site. The dataset consists\nof 829, 250, 940 tokens over a vocabulary of 793, 471 words. In this dataset, sentences are shuf\ufb02ed\nand hence the context is limited. Consequently, this dataset mainly tests the ability of modeling only\nshort-term dependency. The comparison between Tensorized Transformer and the other methods\nare shown in Table 1. Although Tensorized Transformer is mainly designed to better compress\nTransformer or Transformer-XL model, it dramatically improves the single-model SoTA from 21.8\nto 19.5. Speci\ufb01cally, Tensorized Transformer signi\ufb01cantly outperforms a contemporary method\nusing vanilla Transformers [35], suggesting that the advantage of the tensorized Transformer is also\ngeneralizable to modeling short sequences.\nTable 2 and Table 1 show that our model get the lower PPL than other models in three datasets.\nAn exciting observation is that our model has much fewer parameters. The model of Transformer-\nXL+TT [18] is a recent compression model with Tensor Train to compress the input embedding layers\nonly. Sparse Transformer [28] uses the method of sparse attention matrix to compress Transformer\nmodel. The results in Table 2 show that compared with Transformer-XL+TT, our method has much\nfewer parameters, and better language modeling performance. These results verify that our model\n(i.e., Multi-linear attention) is effective in language modeling tasks, and has performed well for\nthe model compression. Other details (such as hyperparameters and Hardware) can be found in\nSupplementary Materials E.\n\n5.3 Neural Machine Translation\n\nThe goal is to map an input sequence s = (x1, x2, . . . , xn) representing a phrase in one language, to\nan output sequence y = (y1, y2, . . . , ym) representing the same phrase in a different language. In\nthis task, we have trained the Transformer model [35] on WMT 2016 English-German dataset [31].\nSentences were tokenized using the SentencePiece 4. For our experiments, we have replaced each\nof the attention layers with Multi-linear attention in Encoder. For evaluation we used beam search\nwith a beam size of 5 and length penalty \u03b1=0.6. In this section, we only compared the results with\nTransformer [35]. Our results are summarized in Table 3. \u2217 indicates that the result is our own\nimplementation.\nIn Table 3, we select two baseline models. The Base-line [31] is \ufb01rst model in WMT 2016 English-\nGerman dataset. For the other baseline, we use the basic Transformer architecture [35]. The BLEU\nscore is 34.5 for the basic architecture. We carry out two Tensorized Transformer structures, namely\ncore-1 and core-2 respectively. When Tensorized Transformer core-1 and core-2 are used, the BLEU\nscores are 34.10 and 34.91, which achieves better performance over Transformer. As for the reported\nmodel parameter size, our model uses less parameters.\n\n4https://github.com/google/sentencepiece\n\n8\n\n\fTable 3: Results and compression with Transformer on WMT-16 English-to-German translation.\n\nModel\n\nBase-line [31]\n\nLinguistic Input Featurec [30]\n\nAttentional encoder-decoder + BPE [31]\n\nTransformer [35]\n\nTensorized Transformer core-1\nTensorized Transformer core-2\n\n\u2013\n\u2013\n\u2013\n\nParams BLEU\n26.8\n28.4\n34.2\n34.5*\n34.10\n34.91\n\n52M\n21M\n21.2M\n\n5.4 Discussion\n\nWe have shown the results on language modeling and neural machine translation tasks using the Multi-\nlinear attention. For the compression of the model parameters, although we report the parameters of\nthe whole model structure, our method mainly considers the compression of multi-head attention\nbut has not changed other layers in Transformer. Regarding the rationale for the improvements, in\nCorollary 1, we prove that the output of the original attention can be represented by summing over the\n3-order tensor. In Figure 2, we use a concat function over these matrices from tensor splitting. The\noperation of concat can model all values in the 3-order tensor, and thus captures more information\nthan sum operator. Another reason could be the alleviation of over\ufb01tting by reducing parameters. The\nover\ufb01tting will appear when the number of the core tensor is greater than 2. Besides, according to\nour experiments, relatively large dimensions of the word embedding can lead to over\ufb01tting, resulting\nin performance degradation. Therefore, our model requires a relatively small dimension of the\nembedding, compared with the original Transformer. In order for a more systematic evaluation, we\nreport more experiments and analyses in Supplementary Materials E.4.\n\n6 Conclusion and Further Work\n\nWe have proposed a novel self attention encoder layer, namely the Multi-linear attention, to compress\nthe original multi-head attention and derive a novel encoding scheme. Our main contribution lies in a\nstructure of Tensorized Transformer based on Block-Term tensor decomposition which is represented\nby the combination of a group of 3-order tensors, with low-rank approximation and parameters\nsharing ideas adopted. Compared with existing Transformer based methods, our model achieved\nhigher compression ratio and got better experimental results, particularly in language modeling task.\nThese evidences imply that our method can potentially be further applied to more NLP tasks with\nlimited resources.\nIn the future, we will continue to optimize the Tensorized Transformer framework and apply it in\nother NLP tasks. As we stated earlier, our model may suffer from over\ufb01tting when the number of\ncores is large in language modeling. In the future, we will explore the fundamental reasons that cause\nthe problem and tackle them within the Tensorized Transformer framework.\n\n7 Acknowledgement\n\nThis work is supported in part by the state key development program of China (grant No.\n2017YFE0111900, 2018YFC0831704), Natural Science Foundation of China (grant No. 61772363,\nU1636203), and the European Unions Horizon 2020 research and innovation programme under the\nMarie SkodowskaCurie grant agreement No.721321.\n\nReferences\n[1] Alexei Baevski and Michael Auli. Adaptive input representations for neural language modeling. arXiv\n\npreprint arXiv:1809.10853, 2018.\n\n[2] Cristian Bucilu, Rich Caruana, and Alexandru Niculescu-Mizil. Model compression. In Proceedings of the\n12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 535\u2013541.\nACM, 2006.\n\n9\n\n\f[3] J Douglas Carroll and Jih-Jie Chang. Analysis of individual differences in multidimensional scaling via an\n\nn-way generalization of \u201ceckart-young\u201d decomposition. Psychometrika, 35(3):283\u2013319, 1970.\n\n[4] Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony\nRobinson. One billion word benchmark for measuring progress in statistical language modeling. Computer\nScience, 2013.\n\n[5] Patrick Chen, Si Si, Yang Li, Ciprian Chelba, and Cho-Jui Hsieh. Groupreduce: Block-wise low-rank\nIn Advances in Neural Information Processing\n\napproximation for neural language model shrinking.\nSystems, pages 10988\u201310998, 2018.\n\n[6] Taco Cohen and Max Welling. Group equivariant convolutional networks. In International conference on\n\nmachine learning, pages 2990\u20132999, 2016.\n\n[7] Zihang Dai, Zhilin Yang, Yiming Yang, William W Cohen, Jaime Carbonell, Quoc V Le, and Ruslan\nSalakhutdinov. Transformer-xl: Attentive language models beyond a \ufb01xed-length context. arXiv preprint\narXiv:1901.02860, 2019.\n\n[8] Yann N Dauphin, Angela Fan, Michael Auli, and David Grangier. Language modeling with gated\nconvolutional networks. In Proceedings of the 34th International Conference on Machine Learning-Volume\n70, pages 933\u2013941. JMLR. org, 2017.\n\n[9] Lieven De Lathauwer. Decompositions of a higher-order tensor in block terms\u2014part ii: De\ufb01nitions and\n\nuniqueness. SIAM Journal on Matrix Analysis and Applications, 30(3):1033\u20131066, 2008.\n\n[10] Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and \u0141ukasz Kaiser. Universal\n\ntransformers. Published at ICLR2019, 2018.\n\n[11] Emily L Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun, and Rob Fergus. Exploiting linear structure\nwithin convolutional networks for ef\ufb01cient evaluation. In Advances in neural information processing\nsystems, pages 1269\u20131277, 2014.\n\n[12] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirec-\n\ntional transformers for language understanding. 2018.\n\n[13] Timur Garipov, Dmitry Podoprikhin, Alexander Novikov, and Dmitry Vetrov. Ultimate tensorization:\n\ncompressing convolutional and fc layers alike. arXiv preprint arXiv:1611.03214, 2016.\n\n[14] Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for ef\ufb01cient\n\nneural network. In Advances in neural information processing systems, pages 1135\u20131143, 2015.\n\n[15] Hakan Inan, Khashayar Khosravi, and Richard Socher. Tying word vectors and word classi\ufb01ers: A loss\n\nframework for language modeling. arXiv preprint arXiv:1611.01462, 2016.\n\n[16] Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman. Speeding up convolutional neural networks with\n\nlow rank expansions. In Proceedings of the British Machine Vision Conference. BMVA Press, 2014.\n\n[17] Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. Exploring the limits of\n\nlanguage modeling. arXiv preprint arXiv:1602.02410, 2016.\n\n[18] Valentin Khrulkov, Oleksii Hrinchuk, Leyla Mirvakhabova, and Ivan Oseledets. Tensorized embedding\n\nlayers for ef\ufb01cient model compression. arXiv preprint arXiv:1901.10787, 2019.\n\n[19] Tamara G Kolda and Brett W Bader. Tensor decompositions and applications. SIAM review, 51(3):455\u2013500,\n\n2009.\n\n[20] Guangxi Li, Jinmian Ye, Haiqin Yang, Di Chen, Shuicheng Yan, and Zenglin Xu. Bt-nets: simplifying\n\ndeep neural networks via block term decomposition. arXiv preprint arXiv:1712.05689, 2017.\n\n[21] Stephen Merity, Nitish Shirish Keskar, and Richard Socher. An analysis of neural language modeling at\n\nmultiple scales. arXiv preprint arXiv:1803.08240, 2018.\n\n[22] Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models.\n\narXiv preprint arXiv:1609.07843, 2016.\n\n[23] Tom\u00e1\u0161 Mikolov, Anoop Deoras, Stefan Kombrink, Luk\u00e1\u0161 Burget, and Jan \u02c7Cernock`y. Empirical evaluation\nIn Twelfth Annual Conference of the\n\nand combination of advanced language modeling techniques.\nInternational Speech Communication Association, 2011.\n\n10\n\n\f[24] Alexander Novikov, Dmitrii Podoprikhin, Anton Osokin, and Dmitry P Vetrov. Tensorizing neural networks.\n\nIn Advances in neural information processing systems, pages 442\u2013450, 2015.\n\n[25] Ivan V Oseledets. Tensor-train decomposition. SIAM Journal on Scienti\ufb01c Computing, 33(5):2295\u20132317,\n\n2011.\n\n[26] Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke\nZettlemoyer. Deep contextualized word representations. In Proceedings of the 2018 Conference of the\nNorth American Chapter of the Association for Computational Linguistics: Human Language Technologies,\nVolume 1 (Long Papers), pages 2227\u20132237, 2018.\n\n[27] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever.\n\nImproving language under-\nstanding by generative pre-training. URL https://s3-us-west-2. amazonaws. com/openai-assets/research-\ncovers/languageunsupervised/language understanding paper. pdf, 2018.\n\n[28] Alec Radford Rewon Child, Scott Gray and Ilya Sutskever. Generating long sequences with sparse\n\ntransformer. arXiv preprint arXiv:1904.10509, 2019.\n\n[29] Tara N Sainath, Brian Kingsbury, Vikas Sindhwani, Ebru Arisoy, and Bhuvana Ramabhadran. Low-rank\nmatrix factorization for deep neural network training with high-dimensional output targets. In 2013 IEEE\ninternational conference on acoustics, speech and signal processing, pages 6655\u20136659. IEEE, 2013.\n\n[30] Rico Sennrich and Barry Haddow. Linguistic input features improve neural machine translation. arXiv\n\npreprint arXiv:1606.02892, 2016.\n\n[31] Rico Sennrich, Barry Haddow, and Alexandra Birch. Edinburgh neural machine translation systems for\n\nwmt 16. arXiv preprint arXiv:1606.02891, 2016.\n\n[32] Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff\nDean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint\narXiv:1701.06538, 2017.\n\n[33] Ledyard R Tucker. Some mathematical notes on three-mode factor analysis. Psychometrika, 31(3):279\u2013311,\n\n1966.\n\n[34] Ehsan Variani, Ananda Theertha Suresh, and Mitchel Weintraub. West: Word encoded sequence transducers.\n\narXiv preprint arXiv:1811.08417, 2018.\n\n[35] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, \u0141ukasz\nKaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing\nsystems, pages 5998\u20136008, 2017.\n\n[36] Yinchong Yang, Denis Krompass, and Volker Tresp. Tensor-train recurrent neural networks for video\nclassi\ufb01cation. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages\n3891\u20133900. JMLR. org, 2017.\n\n[37] Zhilin Yang, Zihang Dai, Ruslan Salakhutdinov, and William W Cohen. Breaking the softmax bottleneck:\n\nA high-rank rnn language model. arXiv preprint arXiv:1711.03953, 2017.\n\n[38] Jinmian Ye, Linnan Wang, Guangxi Li, Di Chen, Shandian Zhe, Xinqi Chu, and Zenglin Xu. Learning\ncompact recurrent neural networks with block-term tensor decomposition. In Proceedings of the IEEE\nConference on Computer Vision and Pattern Recognition, pages 9378\u20139387, 2018.\n\n[39] Lipeng Zhang, Peng Zhang, Xindian Ma, Shuqin Gu, Zhan Su, and Dawei Song. A generalized language\n\nmodel in tensor space. arXiv preprint arXiv:1901.11167, 2019.\n\n[40] Peng Zhang, Zhan Su, Lipeng Zhang, Benyou Wang, and Dawei Song. A quantum many-body wave\nfunction inspired language modeling approach. In Proceedings of the 27th ACM International Conference\non Information and Knowledge Management, pages 1303\u20131312. ACM, 2018.\n\n[41] Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning. arXiv preprint\n\narXiv:1611.01578, 2016.\n\n11\n\n\f", "award": [], "sourceid": 1322, "authors": [{"given_name": "Xindian", "family_name": "Ma", "institution": "Tianjin University"}, {"given_name": "Peng", "family_name": "Zhang", "institution": "Tianjin University"}, {"given_name": "Shuai", "family_name": "Zhang", "institution": "Tianjin University"}, {"given_name": "Nan", "family_name": "Duan", "institution": "Microsoft Research Asia"}, {"given_name": "Yuexian", "family_name": "Hou", "institution": "Tianjin University"}, {"given_name": "Ming", "family_name": "Zhou", "institution": "Microsoft Research"}, {"given_name": "Dawei", "family_name": "Song", "institution": "Beijing Institute of Technology"}]}