{"title": "LightRNN: Memory and Computation-Efficient Recurrent Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 4385, "page_last": 4393, "abstract": "Recurrent neural networks (RNNs) have achieved state-of-the-art performances in many natural language processing tasks, such as language modeling and machine translation. However, when the vocabulary is large, the RNN model will become very big (e.g., possibly beyond the memory capacity of a GPU device) and its training will become very inefficient. In this work, we propose a novel technique to tackle this challenge. The key idea is to use 2-Component (2C) shared embedding for word representations. We allocate every word in the vocabulary into a table, each row of which is associated with a vector, and each column associated with another vector. Depending on its position in the table, a word is jointly represented by two components: a row vector and a column vector. Since the words in the same row share the row vector and the words in the same column share the column vector, we only need $2 \\sqrt{|V|}$ vectors to represent a vocabulary of $|V|$ unique words, which are far less than the $|V|$ vectors required by existing approaches. Based on the 2-Component shared embedding, we design a new RNN algorithm and evaluate it using the language modeling task on several benchmark datasets. The results show that our algorithm significantly reduces the model size and speeds up the training process, without sacrifice of accuracy (it achieves similar, if not better, perplexity as compared to state-of-the-art language models). Remarkably, on the One-Billion-Word benchmark Dataset, our algorithm achieves comparable perplexity to previous language models, whilst reducing the model size by a factor of 40-100, and speeding up the training process by a factor of 2. We name our proposed algorithm \\emph{LightRNN} to reflect its very small model size and very high training speed.", "full_text": "LightRNN: Memory and Computation-Ef\ufb01cient\n\nRecurrent Neural Networks\n\nXiang Li1\n\nTao Qin2\n\nJian Yang1 Tie-Yan Liu2\n\n1Nanjing University of Science and Technology 2Microsoft Research Asia\n\n1implusdream@gmail.com 1csjyang@njust.edu.cn\n\n2{taoqin, tie-yan.liu}@microsoft.com\n\nAbstract\n\nvector, we only need 2(cid:112)|V | vectors to represent a vocabulary of |V | unique words,\n\nRecurrent neural networks (RNNs) have achieved state-of-the-art performances in\nmany natural language processing tasks, such as language modeling and machine\ntranslation. However, when the vocabulary is large, the RNN model will become\nvery big (e.g., possibly beyond the memory capacity of a GPU device) and its\ntraining will become very inef\ufb01cient. In this work, we propose a novel technique to\ntackle this challenge. The key idea is to use 2-Component (2C) shared embedding\nfor word representations. We allocate every word in the vocabulary into a table,\neach row of which is associated with a vector, and each column associated with\nanother vector. Depending on its position in the table, a word is jointly represented\nby two components: a row vector and a column vector. Since the words in the\nsame row share the row vector and the words in the same column share the column\nwhich are far less than the |V | vectors required by existing approaches. Based\non the 2-Component shared embedding, we design a new RNN algorithm and\nevaluate it using the language modeling task on several benchmark datasets. The\nresults show that our algorithm signi\ufb01cantly reduces the model size and speeds\nup the training process, without sacri\ufb01ce of accuracy (it achieves similar, if not\nbetter, perplexity as compared to state-of-the-art language models). Remarkably,\non the One-Billion-Word benchmark Dataset, our algorithm achieves comparable\nperplexity to previous language models, whilst reducing the model size by a factor\nof 40-100, and speeding up the training process by a factor of 2. We name our\nproposed algorithm LightRNN to re\ufb02ect its very small model size and very high\ntraining speed.\n\n1\n\nIntroduction\n\nRecently recurrent neural networks (RNNs) have been used in many natural language processing\n(NLP) tasks, such as language modeling [14], machine translation [23], sentiment analysis [24],\nand question answering [26]. A popular RNN architecture is long short-term memory (LSTM)\n[8, 11, 22], which can model long-term dependence and resolve the gradient-vanishing problem\nby using memory cells and gating functions. With these elements, LSTM RNNs have achieved\nstate-of-the-art performance in several NLP tasks, although almost learning from scratch.\nWhile RNNs are becoming increasingly popular, they have a known limitation: when applied to\ntextual corpora with large vocabularies, the size of the model will become very big. For instance,\nwhen using RNNs for language modeling, a word is \ufb01rst mapped from a one-hot vector (whose\ndimension is equal to the size of the vocabulary) to an embedding vector by an input-embedding\nmatrix. Then, to predict the probability of the next word, the top hidden layer is projected by an\noutput-embedding matrix onto a probability distribution over all the words in the vocabulary. When\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fsame column share the column vector, we only need 2(cid:112)|V | vectors to represent a vocabulary with\n\nthe vocabulary contains tens of millions of unique words, which is very common in Web corpora, the\ntwo embedding matrices will contain tens of billions of elements, making the RNN model too big to\n\ufb01t into the memory of GPU devices. Take the ClueWeb dataset [19] as an example, whose vocabulary\ncontains over 10M words. If the embedding vectors are of 1024 dimensions and each dimension is\nrepresented by a 32-bit \ufb02oating point, the size of the input-embedding matrix will be around 40GB.\nFurther considering the output-embedding matrix and those weights between hidden layers, the RNN\nmodel will be larger than 80GB, which is far beyond the capability of the best GPU devices on the\nmarket [2]. Even if the memory constraint is not a problem, the computational complexity for training\nsuch a big model will also be too high to afford. In RNN language models, the most time-consuming\noperation is to calculate a probability distribution over all the words in the vocabulary, which requires\nthe multiplication of the output-embedding matrix and the hidden state at each position of a sequence.\nAccording to simple calculations, we can get that it will take tens of years for the best single GPU\ntoday to \ufb01nish the training of a language model on the ClueWeb dataset. Furthermore, in addition\nto the challenges during the training phase, even if we can successfully train such a big model, it is\nalmost impossible to host it in mobile devices for ef\ufb01cient inferences.\nTo address the above challenges, in this work, we propose to use 2-Component (2C) shared embedding\nfor word representations in RNNs. We allocate all the words in the vocabulary into a table, each row\nof which is associated with a vector, and each column associated with another vector. Then we use\ntwo components to represent a word depending on its position in the table: the corresponding row\nvector and column vector. Since the words in the same row share the row vector and the words in the\n|V | unique words, and thus greatly reduce the model size as compared with the vanilla approach that\nneeds |V | unique vectors. In the meanwhile, due to the reduced model size, the training of the RNN\nmodel can also signi\ufb01cantly speed up. We therefore call our proposed new algorithm (LightRNN), to\nre\ufb02ect its very small model size and very high training speed.\nA central technical challenge of our approach is how to appropriately allocate the words into the table.\nTo this end, we propose a bootstrap framework: (1) We \ufb01rst randomly initialize the word allocation\nand then train the LightRNN model. (2) We \ufb01x the trained embedding vectors (corresponding to the\nrow and column vectors in the table), and re\ufb01ne the allocation to minimize the training loss, which is\na minimum weight perfect matching problem in graph theory and can be effectively solved. (3) We\nrepeat the second step until certain stopping criterion is met.\nWe evaluate LightRNN using the language modeling task on several benchmark datasets. The\nexperimental results show that LightRNN achieves comparable (if not better) accuracy to state-of-the-\nart language models in terms of perplexity, while reducing the model size by a factor of up to 100\nand speeding up the training process by a factor of 2.\nPlease note that it is desirable to have a highly compact model (without accuracy drop). First, it\nmakes it possible to put the RNN model into a GPU or even a mobile device. Second, if the training\ndata is large and one needs to perform distributed data-parallel training, the communication cost for\naggregating the models from local workers will be low. In this way, our approach makes previously\nexpensive RNN algorithms very economical and scalable, and therefore has its profound impact on\ndeep learning for NLP tasks.\n\n2 Related work\n\nIn the literature of deep learning, there have been several works that try to resolve the problem caused\nby the large vocabulary of the text corpus.\nSome works focus on reducing the computational complexity of the softmax operation on the output-\nembedding matrix. In [16, 17], a binary tree is used to represent a hierarchical clustering of words in\nthe vocabulary. Each leaf node of the tree is associated with a word, and every word has a unique\npath from the root to the leaf where it is in. In this way, when calculating the probability of the\nnext word, one can replace the original |V |-way normalization with a sequence of log |V | binary\nnormalizations. In [9, 15], the words in the vocabulary are organized into a tree with two layers: the\n\nroot node has roughly(cid:112)|V | intermediate nodes, each of which also has roughly(cid:112)|V | leaf nodes.\n\nEach intermediate node represents a cluster of words, and each leaf node represents a word in the\ncluster. To calculate the probability of the next word, one \ufb01rst calculates the probability of the cluster\nof the word and then the conditional probability of the word given its cluster. Besides, methods based\n\n2\n\n\fon sampling-based approximations intend to select randomly or heuristically a small subset of the\noutput layer and estimate the gradient only from those samples, such as importance sampling [3]\nand BlackOut [12]. Although these methods can speed up the training process by means of ef\ufb01cient\nsoftmax, they do not reduce the size of the model.\nSome other works focus on reducing the model size. Techniques [6, 21] like differentiated softmax\nand recurrent projection are employed to reduce the size of the output-embedding matrix. However,\nthey only slightly compress the model, and the number of parameters is still in the same order of\nthe vocabulary size. Character-level convolutional \ufb01lters are used to shrink the size of the input-\nembedding matrix in [13]. However, it still suffers from the gigantic output-embedding matrix.\nBesides, these methods have not addressed the challenge of computational complexity caused by the\ntime-consuming softmax operations.\nAs can be seen from the above introductions, no existing work has simultaneously achieved the\nsigni\ufb01cant reduction of both model size and computational complexity. This is exactly the problem\nthat we will address in this paper.\n\n3 LightRNN\n\nIn this section, we introduce our proposed LightRNN algorithm.\n\n3.1 RNN Model with 2-Component Shared Embedding\n\nA key technical innovation in the LightRNN algorithm is its 2-Component shared embedding for\nword representations. As shown in Figure 1, we allocate all the words in the vocabulary into a table.\nThe i-th row of the table is associated with an embedding vector xr\ni and the j-th column of the table\nj. Then a word in the i-th row and the j-th column is\nis associated with an embedding vector xc\nrepresented by two components: xr\ni and xc\nj. By sharing the embedding vector among words in the\n\nsame row (and also in the same column), for a vocabulary with |V | words, we only need 2(cid:112)|V |\n\nunique vectors for the input word embedding. It is the same case for the output word embedding.\n\nFigure 1: An example of the word table\n\nWith the 2-Component shared\nembedding, we can construc-\nt\nthe LightRNN model by\ndoubling the basic units of a\nvanilla RNN model, as shown\nin Figure 2. Let n and m\ndenote the dimension of a\nrow/column input vector and\nthat of a hidden state vector\nrespectively. To compute the\nprobability distribution of wt,\nwe need to use the column\nt\u22121 \u2208 Rn, the row\nvector xc\nt \u2208 Rn, and the hid-\nvector xr\nt\u22121 \u2208 Rm.\nden state vector hr\n|V | respectively.\nt \u2208 Rm are produced by applying the following recursive\n\n\u221a\n\n3\n\nFigure 2: LightRNN (left) vs. Conventional RNN (right).\n\nThe column and row vectors are from input-embedding matrices X c, X r \u2208 Rn\u00d7\nNext two hidden state vectors hc\n\nt\u22121, hr\n\n\foperations:\n\nt\u22121 + U hr\n\nt\u22121 + b) hr\n\nhc\nt\u22121 = f (W xc\n\n(1)\nIn the above function, W \u2208 Rm\u00d7n, U \u2208 Rm\u00d7m, b \u2208 Rm are parameters of af\ufb01ne transformations,\nand f is a nonlinear activation function (e.g., the sigmoid function).\nThe probability P (wt) of a word w at position t is determined by its row probability Pr(wt) and\ncolumn probability Pc(wt):\n\nt = f (W xr\n\nt + U hc\n\nt\u22121 + b).\n\nPr(wt) =\n\nPc(wt) =\n\n,\n\n(2)\n\n(cid:80)\n\nt\u22121 \u00b7 yr\nexp(hc\n\nexp(hc\ni\u2208Sr\n\nr(w))\nt\u22121 \u00b7 yr\ni )\n\n(cid:80)\n\nt \u00b7 yc\nexp(hr\n\nexp(hr\ni\u2208Sc\n\nc(w))\nt \u00b7 yc\ni )\n\nP (wt) = Pr(wt) \u00b7 Pc(wt),\n\n\u221a\n\n|V |, yc\n\ni \u2208 Rm is the i-th vector of Y c \u2208 Rm\u00d7\n\n(3)\ni \u2208 Rm is the i-th vector of\n\u221a\nwhere r(w) is the row index of word w, c(w) is its column index, yr\nY r \u2208 Rm\u00d7\n|V |, and Sr and Sc denote the set of rows\nand columns of the word table respectively. Note that we do not see the t-th word before predicting\nt\u22121 of the (t \u2212 1)-th word, we \ufb01rst infer the row\nit. In Figure 2, given the input column vector xc\nprobability Pr(wt) of the t-th word, and then choose the index of the row with the largest probability\nin Pr(wt) to look up the next input row vector xr\nt . Similarly, we can then infer the column probability\nPc(wt) of the t-th word.\nWe can see that by using Eqn.(3), we effectively reduce the computation of the probability of the next\n\nword from a |V |-way normalization (in standard RNN models) to two(cid:112)|V |-way normalizations. To\n\nbetter understand the reduction of the model size, we compare the key components in a vanilla RNN\nmodel and in our proposed LightRNN model by considering an example with embedding dimension\nn = 1024, hidden unit dimension m = 1024 and vocabulary size |V | = 10M. Suppose we use 32-bit\n\ufb02oating point representation for each dimension. The total size of the two embedding matrices X, Y\nis (m \u00d7 |V | + n \u00d7 |V |) \u00d7 4 = 80GB for the vanilla RNN model and that of the four embedding\n\nmatrices X r, X c, Y r, Y c in LightRNN is 2\u00d7 (m\u00d7(cid:112)|V | + n\u00d7(cid:112)|V |)\u00d7 4 \u2248 50M B. It is clear that\n\nLightRNN shrinks the model size by a signi\ufb01cant factor so that it can be easily \ufb01t into the memory of\na GPU device or a mobile device.\nThe cell of hidden state h can be implemented by a LSTM [22] or a gated recurrent unit (GRU) [7],\nand our idea works with any kind of recurrent unit. Please note that in LightRNN, the input and\noutput use different embedding matrices but they share the same word-allocation table.\n\n3.2 Bootstrap for Word Allocation\n\nThe LightRNN algorithm described in the previous subsection assumes that there exists a word\nallocation table. It remains as a problem how to appropriately generate this table, i.e., how to allocate\nthe words into appropriate columns and rows. In this subsection, we will discuss on this issue.\nSpeci\ufb01cally, we propose a bootstrap procedure to iteratively re\ufb01ne word allocation based on the\nlearned word embedding in the LightRNN model:\n\n(1) For cold start, randomly allocate the words into the table.\n(2) Train the input/output embedding vectors in LightRNN based on the given allocation until\nconvergence. Exit if a stopping criterion (e.g., training time, or perplexity for language modeling)\nis met, otherwise go to the next step.\n\n(3) Fixing the embedding vectors learned in the previous step, re\ufb01ne the allocation in the table, to\n\nminimize the loss function over all the words. Go to Step (2).\n\nAs can be seen above, the re\ufb01nement of the word allocation table according to the learned embedding\nvectors is a key step in the bootstrap procedure. We will provide more details about it, by taking\nlanguage modeling as an example.\nThe target in language modeling is to minimize the negative log-likelihood of the next word in\na sequence, which is equivalent to optimizing the cross-entropy between the target probability\ndistribution and the prediction given by the LightRNN model. Given a context with T words, the\n\n4\n\n\f\u2212 log Pr(wt) \u2212 log Pc(wt).\n\n(4)\n\nT(cid:88)\n\nT(cid:88)\n\noverall negative log-likelihood can be expressed as follows:\n\nN LL =\n\nN LL can be expanded with respect to words: N LL =(cid:80)|V |\n\nt=1\n\nt=1\n\n\u2212 log P (wt) =\n\nlog-likelihood for a speci\ufb01c word w.\nFor ease of deduction, we rewrite N LLw as l(w, r(w), c(w)), where (r(w), c(w)) is the position of\nword w in the word allocation table. In addition, we use lr(w, r(w)) and lc(w, c(w)) to represent the\nrow component and column component of l(w, r(w), c(w)) (which we call row loss and column loss\nof word w for ease of reference). The relationship between these quantities is\n\nw=1 N LLw, where N LLw is the negative\n\n\u2212 log P (wt) = l(w, r(w), c(w))\n\n(cid:88)\n\nt\u2208Sw\n\n\u2212 log Pr(wt) +\n\n\u2212 log Pc(wt) = lr(w, r(w)) + lc(w, c(w)),\n\n(5)\n\n(cid:88)\n(cid:88)\n\nt\u2208Sw\n\nt\u2208Sw\n\nN LLw =\n\n=\n\n(cid:88)\n\nmin\n\na\n\n(w,i,j)\n\nwhere Sw is the set of all the positions for the word w in the corpus.\nNow we consider adjusting the allocation table to minimize the loss function N LL. For word\nw, suppose we plan to move it from the original cell (r(w), c(w)) to another cell (i, j) in the\ntable. Then we can calculate the row loss lr(w, i) if it is moved to row i while its column and\nthe allocation of all the other words remain unchanged. We can also calculate the column loss\nlc(w, j) in a similar way. Next we de\ufb01ne the total loss of this move as l(w, i, j) which is equal to\nlr(w, i) + lc(w, j) according to Eqn.(5). The total cost of calculating all l(w, i, j) is O(|V |2), by\nassuming l(w, i, j) = lr(w, i) + lc(w, j), since we only need to calculate the loss of each word\nallocated in every row and column separately. In fact, all lr(w, i) and lc(w, j) have already been\ncalculated during the forward part of LightRNN training: to predict the next word we need to compute\ni and hr\nthe scores (i.e., in Eqn.(2), hc\ni for all i) of all the words in the vocabulary for\nt\u22121\u00b7yr\nnormalization and lr(w, i) is the sum of \u2212 log(\nexp(hc\ni )\nk)) ) over all the appearances of word w\nt\u22121\u00b7yr\nk(exp(hc\nin the training data. After we calculate l(w, i, j) for all possible w, i, j, we can write the reallocation\nproblem as the following optimization problem:\n\nt \u00b7 yc\n(cid:80)\n\nt\u22121 \u00b7 yr\n\nl(w, i, j)a(w, i, j)\n\nsubject to\n\n(cid:88)\n\na(w, i, j) = 1 \u2200w \u2208 V,\n\na(w, i, j) = 1 \u2200i \u2208 Sr, j \u2208 Sc,\n\n(6)\n\n(cid:88)\n\n(i,j)\n\nw\n\na(w, i, j) \u2208 {0, 1}, \u2200w \u2208 V, i \u2208 Sr, j \u2208 Sc,\n\nwhere a(w, i, j) = 1 means allocating word w to position (i, j) of the table, and Sr and Sc denote\nthe row set and column set of the table respectively.\nBy de\ufb01ning a weighted bipartite graph G = (V,E) with V = (V, Sr \u00d7 Sc), in which the weight of the\nedge in E connecting a node w \u2208 V and node (i, j) \u2208 Sr \u00d7 Sc is l(w, i, j), we will see that the above\noptimization problem is equivalent to a standard minimum weight perfect matching problem [18] on\ngraph G. This problem has been well studied in the literature, and one of the best practical algorithms\nfor the problem is the minimum cost maximum \ufb02ow (MCMF) algorithm [1], whose basic idea is\nshown in Figure 3. In Figure 3(a), we assign each edge connecting a word node w and a position\nnode (i, j) with \ufb02ow capacity 1 and cost l(w, i, j). The remaining edges starting from source (src)\nor ending at destination (dst) are all with \ufb02ow capacity 1 and cost 0. The thick solid lines in Figure\n3(a) give an example of the optimal weighted matching solution, while Figure 3(b) illustrates how the\nallocation gets updated correspondingly. Since the computational complexity of MCMF is O(|V |3),\nwhich is still costly for a large vocabulary, we alternatively leverage a linear time (with respect to |E|)\n2-approximation algorithm [20] in our experiments whose computational complexity is O(|V |2).\n1\nWhen the number of tokens in the dataset is far larger than the size of the vocabulary (which is the\ncommon case), this complexity can be ignored as compared with the overall complexity of LightRNN\ntraining (which is around O(|V |KT ), where K is the number of epochs in the training process and\nT is the total number of tokens in the training data).\n\n5\n\n\f(a)\n\n(b)\n\nFigure 3: The MCMF algorithm for minimum weight perfect matching\n\n4 Experiments\n\nTo test LightRNN, we conducted a set of experiments on the language modeling task.\n\n4.1 Settings\n\nWe use perplexity (P P L) as the measure to evaluate the performance of an algorithm for lan-\nguage modeling (the lower, the better), de\ufb01ned as P P L = exp( N LL\nT ), where T is the number\nof tokens in the test set. We used all the linguistic corpora from 2013 ACL Workshop Morpho-\nlogical Language Datasets (ACLW) [4] and the One-Billion-Word Benchmark Dataset (BillionW)\n[5] in our experiments. The detailed information of these public datasets is listed in Table 1.\n\nTable 1: Statistics of the datasets\n\nDataset\nACLW-Spanish\nACLW-French\nACLW-English\nACLW-Czech\nACLW-German\nACLW-Russian\nBillionW\n\n#Token\n56M\n57M\n20M\n17M\n51M\n25M\n799M\n\nVocabulary Size\n\n152K\n137K\n60K\n206K\n339K\n497K\n793K\n\nFor the ACLW datasets, we kept all the train-\ning/validation/test sets exactly the same as those in\n[4, 13] by using their processed data 1. For the Bil-\nlionW dataset, since the data2 are unprocessed, we\nprocessed the data according to the standard proce-\ndure as listed in [5]: We discarded all words with\ncount below 3 and padded the sentence boundary\nmarkers <S>,<\\S>. Words outside the vocabulary\nwere mapped to the <UNK> token. Meanwhile, the\npartition of training/validation/test sets on Billion-\n\nW was the same with public settings in [5] for fair comparisons.\nWe trained LSTM-based LightRNN using stochastic gradient descent with truncated backpropagation\nthrough time [10, 25]. The initial learning rate was 1.0 and then decreased by a ratio of 2 if the\nperplexity did not improve on the validation set after a certain number of mini batches. We clipped\nthe gradients of the parameters such that their norms were bounded by 5.0. We further performed\ndropout with probability 0.5 [28]. All the training processes were conducted on one single GPU K20\nwith 5GB memory.\n\n4.2 Results and Discussions\n\nFor the ACLW datasets, we mainly compared LightRNN with two state-of-the-art LSTM RN-\nN algorithms in [13]: one utilizes hierarchical softmax for word prediction (denoted as HSM),\nand the other one utilizes hierarchical softmax as well as character-level convolutional \ufb01lters for\ninput embedding (denoted as C-HSM). We explored several choices of dimensions of shared\nembedding for LightRNN: 200, 600, and 1000. Note that 200 is exactly the word embedding\nsize of HSM and C-HSM models used in [13]. Since our algorithm signi\ufb01cantly reduces the\nmodel size, it allows us to use larger dimensions of embedding vectors while still keeping our\nmodel size very small. Therefore, we also tried 600 and 1000 in LightRNN, and the results\nare showed in Table 2. We can see that with larger embedding sizes, LightRNN achieves bet-\n\n1https://www.dropbox.com/s/m83wwnlz3dw5zhk/large.zip?dl=0\n2http://tiny.cc/1billionLM\n\n6\n\n\fTable 3: Runtime comparisons in order to achieve\nthe HSMs\u2019 baseline P P L\n\nMethod\nC-HSM[13]\nLightRNN\n\nMethod\nHSM[6]\nLightRNN\n\nACLW\n\nRuntime(hours) Reallocation/Training\n\n168\n82\nBillionW\n\n\u2013\n\n0.19%\n\nRuntime(hours) Reallocation/Training\n\n168\n70\n\n\u2013\n\n2.36%\n\nTable 4: Results on BillionW dataset\nP P L #param\nMethod\n2G\nKN[5]\n1.6G\nHSM[6]\n4.1G\nB-RNN[12]\nLightRNN\n41M\n\u2013\nKN + HSM[6]\nKN + B-RNN[12]\n\u2013\n\u2013\nKN + LightRNN\n\n68\n85\n68\n66\n56\n47\n43\n\n200\n600\n1000\n\nP P L\n340\n208\n176\n\n#param\n0.9M\n7M\n17M\n\nEmbedding size\n\nTable 2: Test P P L of LightRNN on the ACLW-\nFrench dataset w.r.t. embedding sizes\n\nter accuracy in terms of perplexity. With 1000-dimensional embedding, it achieves the best re-\nsult while the total model size is still quite small. Thus, we set 1000 as the shared embedding\nsize while comparing with baselines on all the ACLW datasets in the following experiments.\nTable 5 shows the perplexity and model sizes in\nall the ACLW datasets. As can be seen, LightRNN\nsigni\ufb01cantly reduces the model size, while at the\nsame time outperforms the baselines in terms of\nperplexity. Furthermore, while the model sizes of\nthe baseline methods increase linearly with respect\nto the vocabulary size, the model size of LightRNN\nalmost keeps constant on the ACLW datasets.\nFor the BillionW dataset, we mainly compared\nwith BlackOut for RNN [12] (B-RNN) which achieves the state-of-the-art result by interpolating with\nKN (Kneser-Ney) 5-gram. Since the best single model reported in the paper is a 1-layer RNN with\n2048-dimenional word embedding, we also used this embedding size for LightRNN. In addition, we\ncompared with the HSM result reported in [6], which used 1024 dimensions for word embedding, but\nstill has 40x more parameters than our model. For further comparisons, we also ensembled LightRNN\nwith the KN 5-gram model. We utilized the KenLM Language Model Toolkit 3 to get the probability\ndistribution from the KN model with the same vocabulary setting.\n\nThe results on BillionW are shown in Table 4. It\nis easy to see that LightRNN achieves the lowest\nperplexity whilst signi\ufb01cantly reducing the mod-\nel size. For example, it reduces the model size\nby a factor of 40 as compared to HSM and by a\nfactor of 100 as compared to B-RNN. Further-\nmore, through ensemble with the KN 5-gram\nmodel, LightRNN achieves a perplexity of 43.\nIn our experiments,\nthe overall training of\nLightRNN consisted of several rounds of word\ntable re\ufb01nement. In each round, the training\nstopped until the perplexity on the validation set\nconverged. Figure 4 shows how the perplexity\ngets improved with respect to the table re\ufb01ne-\nment on one of the ACLW datasets. Based on\nour observations, 3-4 rounds of re\ufb01nements usu-\n\nFigure 4: Perplexity curve on ACLW-French.\n\nally give satisfactory results.\nTable 3 shows the training time of our algorithm in order to achieve the same perplexity as some\nbaselines on the two datasets. As can be seen, LightRNN saves half of the runtime to achieve the\nsame perplexity as C-HSM and HSM. This table also shows the time cost of word table re\ufb01nement in\nthe whole training process. Obviously, the word reallocation part accounts for very little fraction of\nthe total training time.\n\n3http://khea\ufb01eld.com/code/kenlm/\n\n7\n\n\fTable 5: P P L results in test set for various linguistic datasets on ACLW datasets. Italic results are\nthe previous state-of-the-art. #P denotes the number of Parameters.\n\nP P L on ACLW test\n\nMethod\nKN[4]\nHSM[13]\nC-HSM[13]\nLightRNN\n\nRussian/#P\nSpanish/#P\n390/\u2013\n219/\u2013\n353/200M\n186/61M\n169/48M\n313/152M\n157/18M 176/17M 191/17M 558/18M 281/18M 288/19M\n\nGerman/#P\n463/\u2013\n347/137M\n305/104M\n\nFrench/#P\n243/\u2013\n202/56M\n190/44M\n\nEnglish/#P\n291/\u2013\n236/25M\n216/20M\n\nCzech/#P\n862/\u2013\n701/83M\n578/64M\n\nFigure 5 shows a set of rows in the word allocation table on the BillionW dataset after several rounds\nof bootstrap. Surprisingly, our approach could automatically discover the semantic and syntactic\nrelationship of words in natural languages. For example, the place names are allocated together in\nrow 832; the expressions about the concept of time are allocated together in row 889; and URLs\nare allocated together in row 887. This automatically discovered semantic/syntactic relationship\nmay explain why LightRNN, with such a small number of parameters, sometimes outperforms those\nbaselines that assume all the words are independent of each other (i.e., embedding each word as an\nindependent vector).\n\nFigure 5: Case study of word allocation table\n\n5 Conclusion and future work\n\nIn this work, we have proposed a novel algorithm, LightRNN, for natural language processing\ntasks. Through the 2-Component shared embedding for word representations, LightRNN achieves\nhigh ef\ufb01ciency in terms of both model size and running time, especially for text corpora with large\nvocabularies.\nThere are many directions to explore in the future. First, we plan to apply LightRNN on even larger\ncorpora, such as the ClueWeb dataset, for which conventional RNN models cannot be \ufb01t into a\nmodern GPU. Second, we will apply LightRNN to other NLP tasks such as machine translation and\nquestion answering. Third, we will explore k-Component shared embedding (k > 2) and study the\nrole of k in the tradeoff between ef\ufb01ciency and effectiveness. Fourth, we are cleaning our codes and\nwill release them soon through CNTK [27].\n\nAcknowledgments\nThe authors would like to thank the anonymous reviewers for their critical and constructive comments\nand suggestions. This work was partially supported by the National Science Fund of China under\nGrant Nos. 91420201, 61472187, 61502235, 61233011 and 61373063, the Key Project of Chinese\nMinistry of Education under Grant No. 313030, the 973 Program No. 2014CB349303, and Program\nfor Changjiang Scholars and Innovative Research Team in University. We also would like to thank\nProfessor Xiaolin Hu from Department of Computer Science and Technology, Tsinghua National\nLaboratory for Information Science and Technology (TNList) for giving a lot of wonderful advices.\n\n8\n\n\fReferences\n[1] Ravindra K Ahuja, Thomas L Magnanti, and James B Orlin. Network \ufb02ows. Technical report, DTIC\n\nDocument, 1988.\n\n[2] Jeremy Appleyard, Tomas Kocisky, and Phil Blunsom. Optimizing performance of recurrent neural\n\nnetworks on gpus. arXiv preprint arXiv:1604.01946, 2016.\n\n[3] Yoshua Bengio, Jean-S\u00e9bastien Sen\u00e9cal, et al. Quick training of probabilistic neural nets by importance\n\nsampling. In AISTATS, 2003.\n\n[4] Jan A Botha and Phil Blunsom. Compositional morphology for word representations and language\n\nmodelling. arXiv preprint arXiv:1405.4273, 2014.\n\n[5] Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony\nRobinson. One billion word benchmark for measuring progress in statistical language modeling. arXiv\npreprint arXiv:1312.3005, 2013.\n\n[6] Welin Chen, David Grangier, and Michael Auli. Strategies for training large vocabulary neural language\n\nmodels. arXiv preprint arXiv:1512.04906, 2015.\n\n[7] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. Empirical evaluation of gated\n\nrecurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, 2014.\n\n[8] Felix A Gers, J\u00fcrgen Schmidhuber, and Fred Cummins. Learning to forget: Continual prediction with lstm.\n\nNeural computation, 12(10):2451\u20132471, 2000.\n\n[9] Joshua Goodman. Classes for fast maximum entropy training. In Acoustics, Speech, and Signal Processing,\n2001. Proceedings.(ICASSP\u201901). 2001 IEEE International Conference on, volume 1, pages 561\u2013564. IEEE,\n2001.\n\n[10] Alex Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013.\n[11] Sepp Hochreiter and J\u00fcrgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735\u20131780,\n\n1997.\n\n[12] Shihao Ji, SVN Vishwanathan, Nadathur Satish, Michael J Anderson, and Pradeep Dubey. Blackout:\nSpeeding up recurrent neural network language models with very large vocabularies. arXiv preprint\narXiv:1511.06909, 2015.\n\n[13] Yoon Kim, Yacine Jernite, David Sontag, and Alexander M Rush. Character-aware neural language models.\n\narXiv preprint arXiv:1508.06615, 2015.\n\n[14] Tomas Mikolov, Martin Kara\ufb01\u00e1t, Lukas Burget, Jan Cernock`y, and Sanjeev Khudanpur. Recurrent neural\n\nnetwork based language model. In INTERSPEECH, volume 2, page 3, 2010.\n\n[15] Tom\u00e1\u0161 Mikolov, Stefan Kombrink, Luk\u00e1\u0161 Burget, Jan Honza \u02c7Cernock`y, and Sanjeev Khudanpur. Extensions\nof recurrent neural network language model. In Acoustics, Speech and Signal Processing (ICASSP), 2011\nIEEE International Conference on, pages 5528\u20135531. IEEE, 2011.\n\n[16] Andriy Mnih and Geoffrey E Hinton. A scalable hierarchical distributed language model. In Advances in\n\nneural information processing systems, pages 1081\u20131088, 2009.\n\n[17] Frederic Morin and Yoshua Bengio. Hierarchical probabilistic neural network language model. In Aistats,\n\n[18] Christos H Papadimitriou and Kenneth Steiglitz. Combinatorial optimization: algorithms and complexity.\n\n[19] Jan Pomik\u00e1lek, Milos Jakub\u00edcek, and Pavel Rychl`y. Building a 70 billion word corpus of english from\n\nvolume 5, pages 246\u2013252. Citeseer, 2005.\n\nCourier Corporation, 1982.\n\nclueweb. In LREC, pages 502\u2013506, 2012.\n\nIn STACS 99, pages 259\u2013269. Springer, 1999.\n\n[20] Robert Preis. Linear time 1/2-approximation algorithm for maximum weighted matching in general graphs.\n\n[21] Ha\u00b8sim Sak, Andrew Senior, and Fran\u00e7oise Beaufays. Long short-term memory based recurrent neural\n\nnetwork architectures for large vocabulary speech recognition. arXiv preprint arXiv:1402.1128, 2014.\n\n[22] Martin Sundermeyer, Ralf Schl\u00fcter, and Hermann Ney. Lstm neural networks for language modeling. In\n\nINTERSPEECH, pages 194\u2013197, 2012.\n\n[23] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In\n\nAdvances in neural information processing systems, pages 3104\u20133112, 2014.\n\n[24] Duyu Tang, Bing Qin, and Ting Liu. Document modeling with gated recurrent neural network for\nIn Proceedings of the 2015 Conference on Empirical Methods in Natural\n\nsentiment classi\ufb01cation.\nLanguage Processing, pages 1422\u20131432, 2015.\n\n[25] Paul J Werbos. Backpropagation through time: what it does and how to do it. Proceedings of the IEEE,\n\n[26] Jason Weston, Sumit Chopra, and Antoine Bordes. Memory networks. arXiv preprint arXiv:1410.3916,\n\n78(10):1550\u20131560, 1990.\n\n2014.\n\n[27] Dong Yu, Adam Eversole, Mike Seltzer, Kaisheng Yao, Zhiheng Huang, Brian Guenter, Oleksii Kuchaiev,\nYu Zhang, Frank Seide, Huaming Wang, et al. An introduction to computational networks and the\ncomputational network toolkit. Technical report, Technical report, Tech. Rep. MSR, Microsoft Research,\n2014, 2014. research. microsoft. com/apps/pubs, 2014.\n\n[28] Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. Recurrent neural network regularization. arXiv\n\npreprint arXiv:1409.2329, 2014.\n\n9\n\n\f", "award": [], "sourceid": 2164, "authors": [{"given_name": "Xiang", "family_name": "Li", "institution": "NJUST"}, {"given_name": "Tao", "family_name": "Qin", "institution": "Microsoft"}, {"given_name": "Jian", "family_name": "Yang", "institution": "Facebook Inc."}, {"given_name": "Tie-Yan", "family_name": "Liu", "institution": "Microsoft Research"}]}