{"title": "SVD-Softmax: Fast Softmax Approximation on Large Vocabulary Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 5463, "page_last": 5473, "abstract": "We propose a fast approximation method of a softmax function with a very large vocabulary using singular value decomposition (SVD). SVD-softmax targets fast and accurate probability estimation of the topmost probable words during inference of neural network language models. The proposed method transforms the weight matrix used in the calculation of the output vector by using SVD. The approximate probability of each word can be estimated with only a small part of the weight matrix by using a few large singular values and the corresponding elements for most of the words. We applied the technique to language modeling and neural machine translation and present a guideline for good approximation. The algorithm requires only approximately 20\\% of arithmetic operations for an 800K vocabulary case and shows more than a three-fold speedup on a GPU.", "full_text": "SVD-Softmax: Fast Softmax Approximation on Large\n\nVocabulary Neural Networks\n\nKyuhong Shim, Minjae Lee, Iksoo Choi, Yoonho Boo, Wonyong Sung\n\nDepartment of Electrical and Computer Engineering\n\nSeoul National University, Seoul, Korea\n\nskhu20@snu.ac.kr, {mjlee, ischoi, yhboo}@dsp.snu.ac.kr, wysung@snu.ac.kr\n\nAbstract\n\nWe propose a fast approximation method of a softmax function with a very large\nvocabulary using singular value decomposition (SVD). SVD-softmax targets fast\nand accurate probability estimation of the topmost probable words during infer-\nence of neural network language models. The proposed method transforms the\nweight matrix used in the calculation of the output vector by using SVD. The ap-\nproximate probability of each word can be estimated with only a small part of\nthe weight matrix by using a few large singular values and the corresponding ele-\nments for most of the words. We applied the technique to language modeling and\nneural machine translation and present a guideline for good approximation. The\nalgorithm requires only approximately 20% of arithmetic operations for an 800K\nvocabulary case and shows more than a three-fold speedup on a GPU.\n\n1\n\nIntroduction\n\nNeural networks have shown impressive results for language modeling [1\u20133]. Neural network-based\nlanguage models (LMs) estimate the likelihood of a word sequence by predicting the next word wt+1\nby previous words w1:t. Word probabilities for every step are acquired by matrix multiplication\nand a softmax function. Likelihood evaluation by an LM is necessary for various tasks, such as\nspeech recognition [4, 5], machine translation, or natural language parsing and tagging. However,\nexecuting an LM with a large vocabulary size is computationally challenging because of the softmax\nnormalization. Softmax computation needs to access every word to compute the normalization factor\nV exp(zi) = exp(zk)/Z. V indicates the vocabulary size of\n\nZ, where sof tmax(zk) = exp(zk)/(cid:80)\n\nthe dataset. We refer the conventional softmax algorithm as the \"full-softmax.\"\nThe computational requirement of the softmax function frequently dominates the complexity of\nneural network LMs. For example, a Long Short-Term Memory (LSTM) [6] RNN with four layers\nof 2K hidden units requires roughly 128M multiply-add operations for one inference. If the LM\nsupports an 800K vocabulary, the evaluation of the output probability computation with softmax\nnormalization alone demands approximately 1,600M multiply-add operations, far exceeding that of\nthe RNN core itself.\nAlthough we should compute the output vector of all words to evaluate the denominator of the soft-\nmax function, few applications require the probability of every word. For example, if an LM is used\nfor rescoring purposes as in [7], only the probabilities of one or a few given words are needed. Fur-\nther, for applications employing beam search, the most probable top-5 or top-10 values are usually\nrequired. In speech recognition, since many states need to be pruned for ef\ufb01cient implementations,\nit is not demanded to consider the probabilities of all the words. Thus, we formulate our goal: to ob-\ntain accurate top-K word probabilities with considerably less computation for LM evaluation, where\nthe K considered is from 1 to 500.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fIn this paper, we present a fast softmax approximation for LMs, which does not involve alternative\nneural network architectures or additional loss during training. Our method can be directly applied\nto full-softmax, regardless of how it is trained. This method is different from those proposed in other\npapers, in that it is aimed to reduce the evaluation complexity, not to minimize the training time or\nto improve the performance.\nThe proposed technique is based on singular value decomposition (SVD) [8] of the softmax weight\nmatrix. Experimental results show that the proposed algorithm provides both fast and accurate\nevaluation of the most probable top-K word probabilities.\nThe contributions of this paper are as follows.\n\nlating the top-K word probabilities.\n\n\u2022 We propose a fast and accurate softmax approximation, SVD-softmax, applied for calcu-\n\u2022 We provide a quantitative analysis of SVD-softmax with three different datasets and two\n\u2022 We show through experimental results that the normalization term of softmax can be ap-\n\ndifferent tasks.\n\nproximated fairly accurately by computing only a fraction of the full weight matrix.\n\nThis paper is organized as follows. In Section 2, we review related studies and compare them to our\nstudy. We introduce SVD-softmax in Section 3. In Section 4, we provide experimental results. In\nSection 5, we discuss more details about the proposed algorithm. Section 6 concludes the paper.\n\n2 Related work\n\nMany methods have been developed to reduce the computational burden of the softmax function.\nThe most successful approaches include sampling-based softmax approximation, hierarchical soft-\nmax architecture, and self-normalization techniques. Some of these support very ef\ufb01cient training.\nHowever, the methods listed below must search the entire vocabulary to \ufb01nd the top-K words.\nSampling-based approximations choose a small subset of possible outputs and train only with\nthose. Importance sampling (IS) [9], noise contrastive estimation (NCE) [10], negative sampling\n(NEG) [11], and Blackout [12] are included in this category. These approximations train the net-\nwork to increase the possibilities of positive samples, which are usually labels, and to decrease the\nprobabilities of negative samples, which are randomly sampled. These strategies are bene\ufb01cial for\nincreasing the training speed. However, their evaluation does not show any improvement in speed.\nHierarchical softmax (HS) uni\ufb01es the softmax function and output vector computation by con-\nstructing a tree structure of words. Binary HS [13, 14] uses the binary tree structure, which is\nlog(V ) in depth. However, the binary representation is heavily dependent on each word\u2019s position,\nand therefore, a two-layer [2] or three-layer [15] hierarchy is also introduced. In particular, in the\nstudy in [15] several clustered words were arranged in a \"short-list,\" where the outputs of the second\nlevel hierarchy were the words themselves, not the classes of the third hierarchy. Adaptive softmax\n[16] extends the idea and allocates the short-list to the \ufb01rst layer, with a two-layer hierarchy. Adap-\ntive softmax achieves both a training time speedup and a performance gain. HS approaches have\nadvantages for quickly gathering probability of a certain word or predetermined words. However,\nHS should also visit every word to \ufb01nd the topmost likely words, where the merit of the tree structure\nis not useful.\nSelf-normalization approaches [17, 18] employ an additional training loss term, which leads a\nnormalization factor Z close to 1. The evaluation of selected words can be achieved signi\ufb01cantly\nfaster than by using full-softmax if the denominator is trained well. However, the method cannot\nensure that the denominator always appears correctly, and should also consider every word for top-K\nestimation.\nDifferentiated Softmax (D-softmax) [19] restricts the effective parameters, using the fraction of\nthe full output matrix. The matrix allocates higher dimensional representation to frequent words\nand only a lower dimensional vector to rare words. From this point of view, there is a commonality\nbetween our method and D-softmax in that the length of vector used in the output vector compu-\ntation varies among words. However, the determination of the length of each portion is somewhat\nheuristic and requires speci\ufb01ed training procedures in D-softmax. The word representation learned\n\n2\n\n\fFigure 1: Illustration of the proposed SVD-softmax algorithm. The softmax weight matrix is de-\ncomposed by singular value decomposition (b). Only a part of the columns is used to compute the\npreview outputs (c). Selected rows, which are chosen by sorting the preview outputs, are recomputed\nwith full-width (d). For simplicity, the bias vector is omitted.\n\nby D-softmax is restricted from the start, and may therefore be lacking in terms of expressiveness.\nIn contrast, our algorithm \ufb01rst trains words with a full-length vector and dynamically limits the\ndimension during evaluation. In SVD-softmax, the importance of each word is also dynamically\ndetermined during the inference.\n\n3 SVD-softmax\n\nThe softmax function transforms a D-dimensional real-valued vector h to a V -dimensional proba-\nbility distribution. The probability calculation consists of two stages. First, we acquire the output\nvector of size V , denoted as z, from h by matrix multiplication as\n\n(1)\nwhere A \u2208 RV \u00d7D is a weight matrix, h \u2208 RD is an input vector, b \u2208 RV is a bias vector, and\nz \u2208 RV is the computed output vector. Second, we normalize the output vector to compute the\nprobability yk of each word as\n\nz = Ah + b\n\nyk = sof tmax(zk) =\n\n(cid:80)V\n\nexp(Akh + bk)\ni=1 exp(Aih + bi)\n\n=\n\n(cid:80)V\n\nexp(zk)\ni=1 exp(zi)\n\n=\n\nexp(zk)\n\nZ\n\n(2)\n\nThe computational complexity of calculating the probability distribution over all classes and only\none class is the same, because the normalization factor Z requires every output vector elements to\nbe computed.\n\n3.1 Singular value decomposition\n\nSVD is a factorization method that decomposes a matrix into two unitary matrices U, V with sin-\ngular vectors in columns and one diagonal matrix \u03a3 with non-negative real singular values in de-\nscending order. SVD is applied to the weight matrix A as\n\nA = U\u03a3VT\n\n(3)\nwhere U \u2208 RV \u00d7D, \u03a3 \u2208 RD\u00d7D, and V \u2208 RD\u00d7D. We multiply \u03a3 and U to factorize the original\nmatrix into two parts: U\u03a3 and VT. Note that U\u00d7 \u03a3 multiplication is negligible in evaluation time\nbecause we can keep the result as a single matrix.\nLarger singular values in \u03a3 are multiplied to the leftmost columns of U. As a result, the elements\nof the B(= U\u03a3) matrix are statistically arranged in descending order of magnitude, from the \ufb01rst\ncolumn to the last. The leftmost columns of B are more in\ufb02uential than the rightmost columns.\n\n3\n\n\ud835\udc34\ud835\udc48\u03a3\ud835\udc7d\ud835\udc7d\ud835\udc7d(a)Base(b)After SVD(c)Preview window(d)Additional full-view vectors|\ud835\udc49||\ud835\udc37||\ud835\udc37||\ud835\udc4a||\ud835\udc41|\fAlgorithm 1 Algorithm of the proposed SVD-softmax.\n1: input: trained weight matrix A, input vector h, bias vector b\n2: hyperparameter: width of preview window W , number of full-view vectors N.\n3: initialize: decompose A = U\u03a3VT , B = U\u03a3\n4: \u02dch = VT \u00d7 h\n5: \u02dcz = B[:, : W ] \u00d7 \u02dch[: W ] + b\n6: Sort \u02dcz in descending order\n7: CN = Top-N word indices of \u02dcz\n8: for all id in CN do\n\u02dcz[id] = B[id, :] \u00d7 \u02dch + b[id]\n9:\n10: end for\n\ncompute preview outputs with only W dimensions\nselect N words of largest preview outputs\n\nupdate selected words by full-view vector multiplication\n\ncompute probability distribution using softmax\n\n11: \u02dcZ =(cid:80)\n\nV exp \u02dczi\n12: \u02dcy = exp(\u02dcz)/ \u02dcZ\n13: return \u02dcy\n\n3.2 Softmax approximation\n\nAlgorithm 1 shows the softmax approximation procedure, which is also illustrated in Figure 1.\nPrevious methods needed to compare every output vector elements to \ufb01nd the top-K words. Instead\nof using the full-length vector, we consult every word with a window of restricted length W . We\ncall this the \"preview window\" and the results the \"preview outputs.\" Note that adding the bias b\nin preview outputs computation is crucial for the performance. Since larger singular values are\nmultiplied to several leftmost columns, it is reasonable to assume that the most important portion of\nthe output vector is already computed with the preview window.\nHowever, we \ufb01nd that the preview outputs do not suf\ufb01ce to obtain accurate results. To increase the\naccuracy, N largest candidates CN are selected by sorting V preview outputs. The selected candi-\ndates are recomputed with the full-length window. We call the candidates the \"full-view\" vectors. As\na result, N outputs are computed exactly while (V \u2212 N ) outputs are only an approximation based\non the preview outputs. In other words, only the selected indices use the full window for output\nvector computation. Finally, the softmax function is applied to the output vector to normalize the\nprobability distribution. The modi\ufb01ed output vector \u02dczk is formulated as\n\n(cid:26)Bk\n\n\u02dczk =\n\n\u02dch + bk,\n\nif k \u2208 CN\nBk[: W ]\u02dch[: W ] + bk, otherwise\n\n(4)\n\nwhere B \u2208 RV \u00d7D and \u02dch = V T h \u2208 RD. Note that if k \u2208 CN , \u02dczk is equal to zk. The computational\ncomplexity is reduced from O(V \u00d7 D) to O(V \u00d7 W + N \u00d7 D).\n\n3.3 Metrics\n\nTo observe the accuracy of every word probability, we use Kullback-Leibler divergence (KLD) as a\nmetric. KLD shows the closeness of the approximated distribution to the actual one. Perplexity, or\nnegative log-likelihood (N LL), is a useful measurement for likelihood estimation. The gap between\nfull-softmax and SVD-softmax N LL should be small. For the evaluation of a given word, the\naccuracy of probability depends only on the normalization factor Z, and therefore we monitor also\nthe denominator of the softmax function.\nWe de\ufb01ne \"top-K coverage,\" which represents how many top-K words of full-softmax are included\nin the top-K words of SVD-softmax. For the beam-search purpose, it is important to correctly select\nthe top-K words, as beam paths might change if the order is mingled.\n\n4 Experimental results\n\nThe experiments were performed on three datasets and two different applications: language model-\ning and machine translation. The WikiText-2 [20] and One Billion Word benchmark (OBW) [21]\n\n4\n\n\fTable 1: Effect of the number of hidden units on the WikiText-2 language model. The number of\nfull-view vectors is \ufb01xed to 3,300 for the table, which is about 10% of the size of the vocabulary.\nTop-K denotes top-K coverage de\ufb01ned in 3.3. The values are averaged.\n\nD\n\n256\n\n512\n\n1024\n\nW\n16\n32\n32\n64\n64\n128\n\n\u02dcZ/Z\n0.9813\n0.9914\n0.9906\n0.9951\n0.9951\n0.9971\n\nKLD N LL (full/SVD) Top-10 Top-100 Top-1000\n952.71\n0.03843\n986.94\n0.01134\n974.87\n0.01453\n0.00638\n993.35\n992.62\n0.00656\n0.00353\n998.28\n\n4.408 / 4.518\n4.408 / 4.441\n3.831 / 3.907\n3.831 / 3.852\n3.743 / 3.789\n3.743 / 3.761\n\n9.97\n10.00\n10.00\n10.00\n10.00\n10.00\n\n99.47\n99.97\n99.89\n99.99\n99.99\n100.00\n\ndatasets were used for language modeling. The neural machine translation (NMT) from German to\nEnglish was trained with a dataset provided by the OpenNMT toolkit [22].\nWe \ufb01rst analyzed the extent to which the preview window size W and the number of full-view\nvectors N affect the overall performance and searched the best working combination.\n\n4.1 Effect of the number of hidden units on preview window size\n\nTo \ufb01nd the relationship between the preview window\u2019s width and the approximation quality, three\nLMs trained with WikiText-2 were tested. WikiText is a text dataset, which was recently intro-\nduced [20]. The WikiText-2 dataset contains 33,278-word vocabulary and approximately 2M train-\ning tokens. An RNN with a single LSTM layer [6] was used for language modeling. Traditional\nfull-softmax was used for the output layer. The number of LSTM units was the same as the input\nembedding dimension. Three models were trained on WikiText-2 with the number of hidden units\nD being 256, 512, and 1,024.\nThe models were trained with stochastic gradient descent (SGD) with an initial learning rate of 1.0\nand momentum of 0.95. The batch size was set to 20, and the network was unrolled for 35 timesteps.\nDropout [23] was applied to the LSTM output with a drop ratio of 0.5. Gradient clipping [24] of\nmaximum norm value 5 was applied.\nThe preview window widths W selected were 16, 32, 64, and 128 and the number of full-view\ncandidates N were 5% and 10% of the full vocabulary size for all three models. One thousand\nsequential frames were used for the evaluation. Table 1 shows the results of selected experiments,\nwhich indicates that the suf\ufb01cient preview window size is proportional to the hidden layer dimension\nD. In most cases, 1/8 of D is an adequate window width, which costs 12.5% of multiplications.\nOver 99% of the denominator is covered. KLD and N LL show that the approximation produces\nalmost the same results as the original. The top-K words are also computed precisely. We also\nchecked the order of the top-K words that were preserved. The result showed that using too short\nwindow width affects the performance badly.\n\n4.2 Effect of the vocabulary size on the number of full-view vectors\n\nThe OBW dataset was used to analyze the effect of vocabulary size on SVD-softmax. This bench-\nmark is a huge dataset with a 793,472-word vocabulary. The model used 256-dimension word em-\nbedding, an LSTM layer of 2,048 units, and a full-softmax output layer. The RNN LM was trained\nwith SGD with an initial learning rate of 1.0.\nWe explored multiple models by employing a vocabulary size of 8,004, 80,004, 401,951, and\n793,472, abbreviated as 8K, 80K, 400K, and 800K below. The 800K model follows the prepro-\ncessing consensus, keeping words that appear more than three times. The 400K vocabulary follows\nthe same process as the 800K but without case sensitivity. The 8K and 80K data models were cre-\nated by choosing the topmost frequent 8K and 80K words, respectively. Because of the limitation\nof GPU memory, the 800K model was trained with half-precision parameters. We used the full data\nfor training.\n\n5\n\n\fTable 2: Effect of the number of full-view vector size N on One Billion Word benchmark language\nmodel. The preview window width is \ufb01xed to 256 in this table. We omitted the ratio of approximated\n\u02dcZ and real Z, because the ratio is over 0.997 for all cases in the table. The multiplication ratio is to\nfull-softmax, including the overhead of VT \u00d7 h.\n\nV\n\n8K\n\n80K\n\n400K\n\n800K\n\nN\n1024\n2048\n4096\n8192\n16384\n32768\n32768\n65536\n\nN LL (full/SVD) Top-10 Top-50 Top-100 Top-500 Mult. ratio\n\n2.685 / 2.698\n2.685 / 2.687\n3.589 / 3.6051\n3.589 / 3.591\n3.493 / 3.495\n3.493 / 3.495\n4.688 / 4.718\n4.688 / 4.690\n\n9.98\n9.99\n10.00\n10.00\n10.00\n10.00\n10.00\n10.00\n\n49.81\n49.99\n49.94\n49.99\n50.00\n50.00\n49.99\n49.99\n\n99.36\n99.89\n99.85\n99.97\n100.00\n100.00\n99.96\n99.96\n\n469.48\n496.05\n497.73\n499.56\n499.90\n499.98\n499.99\n499.89\n\n0.493\n0.605\n0.195\n0.240\n0.171\n0.201\n0.168\n0.200\n\nTable 3: SVD-softmax on machine translation task. The baseline perplexity and BLEU score are\n10.57 and 21.98, respectively.\n\nW\n\n200\n\n100\n\n50\n\nN\n5000\n2500\n1000\n5000\n2500\n1000\n5000\n2500\n1000\n\nPerplexity BLEU\n21.99\n21.99\n22.00\n22.00\n22.00\n22.01\n22.00\n21.99\n22.00\n\n10.57\n10.57\n10.58\n10.58\n10.59\n10.65\n10.60\n10.68\n11.04\n\nThe preview window width and the number of full-view vectors were selected in the powers of 2.\nThe results were computed on randomly selected 2,000 consecutive frames.\nTable 2 shows the experimental results. With a \ufb01xed hidden dimension of 2,048, the required pre-\nview window width does not change signi\ufb01cantly, which is consistent with the observations in Sec-\ntion 4.1. However, the number of full-view vectors N should increase as the vocabulary size grows.\nIn our experiments, using 5% to 10% of the total vocabulary size as candidates suf\ufb01ced to achieve a\nsuccessful approximation. The results prove that the proposed method is scalable and more ef\ufb01cient\nwhen applied to large vocabulary softmax.\n\n4.3 Result on machine translation\n\nNMT is based on neural networks and contains an internal softmax function. We applied SVD-\nsoftmax to a German to English NMT task to evaluate the actual performance of the proposed\nalgorithm.\nThe baseline network, which employs the encoder-decoder model with an attention mechanism\n[25, 26], was trained using the OpenNMT toolkit. The network was trained with concatenated\ndata which contained a WMT 2015 translation task [27], Europarl v7 [28], common crawl [29],\nand news commentary v10 [30], and evaluated with newstest 2013. The training and evaluation\ndata were tokenized and preprocessed by following the procedures in previous studies [31, 32] to\nconduct case-sensitive translation with 50,004 frequent words. The baseline network employed 500-\ndimension word embedding, encoder- and decoder-networks with two unidirectional LSTM layers\nwith 500 units each, and a full-softmax output layer. The network was trained with SGD with\nan initial learning rate of 1.0 while applying dropout [23] with ratio 0.3 between adjacent LSTM\nlayers. The rest of the training settings followed the OpenNMT training recipe, which is based on\n\n6\n\n\fFigure 2: Singular value plot of three WikiText-2 language models that differ in hidden vector\ndimension D \u2208 {256, 512, 1024}. The left hand side \ufb01gure represents the singular value for each\nelement, while the right hand side \ufb01gure illustrates the value proportional to D. The dashed line\nimplies 0.125 = 1/8 point. Both are from the same data.\n\nprevious studies [31, 33]. The performance of the network was evaluated according to perplexity\nand the case-sensitive BLEU score [34], which was computed with the Moses toolkit [35]. During\ntranslation, a beam search was conducted with beam width 5.\nTo evaluate our algorithm, the preview window widths W selected were 25, 50, 100, and 200, and\nthe numbers of full-view candidates N chosen were 1,000, 2,500, and 5,000.\nTable 3 shows the experimental results for perplexity and the BLEU score with respect to the preview\nwindow dimension W and the number of full-view vectors N. The full-softmax layer in the baseline\nmodel employed a hidden dimension D of 500 and computed the probability for V = 50,004 words.\nThe experimental results show that a speed up can be achieved with preview width W = 100, which\nis 1/5 of D, and the number of full-view vectors N = 2,500 or 5,000, which is 1/5 or 1/10 of\nV . The parameters chosen did not affect the translation performance in terms of perplexity. For a\nwider W , it is possible to use a smaller N. The experimental results show that SVD-softmax is also\neffective when applied to NMT tasks.\n\n5 Discussion\n\nIn this section, we provide empirical evidence of the reasons why SVD-softmax operates ef\ufb01ciently.\nWe also present the results of an implementation on a GPU.\n\n5.1 Analysis of W , N, and D\n\nWe \ufb01rst explain the reason the required preview window width W is proportional to the hidden\nvector size D. Figure 2 shows the singular value distribution of WikiText-2 LM softmax weights.\nWe observed that the distributions are similar for all three cases when the singular value indices are\nscaled with D. Thus, it is important to preserve the ratio between W and D. The ratio of singular\nvalues in a D/8 window over the total sum of singular values for 256, 512, and 1,024 hidden vector\ndimensions is 0.42, 0.38, and 0.34, respectively.\nFurthermore, we explore the manner in which W and N affect the normalization term, i.e., the\ndenominator. Figure 3 shows how the denominator is approximated while changing W or N. Note\nthat the leftmost column of Figure 3 represents that no full-view vectors were used.\n\n5.2 Computational ef\ufb01ciency\n\nThe modeled number of multiplications in Table 2 shows that the computation required can be\ndecreased to 20%. After factorization, the overhead of matrix multiplication VT , which is O(D2),\nis a \ufb01xed cost. In most cases, especially with a very large vocabulary, V is signi\ufb01cantly larger than\nD, and the additional computation cost is negligible. However, as V decreases, the portion of the\noverhead increases.\n\n7\n\n01020304050607001282563845126407688961024S(256)S(512)S(1024)0102030405060700.000.200.400.600.801.00S(256)S(512)S(1024)0.125\fFigure 3: Heatmap of approximated normalization factor ratio \u02dcZ/Z. The x and y axis represent\nN and W , respectively. The WikiText-2 language model with D = 1,024 was used. Note that the\nmaximum values of N and W are 1,024 and 33,278, respectively. The gray line separates the area\nby 0.99 as a threshold. Best viewed in color.\n\nTable 4: Measured time (ms) of full-softmax and SVD-softmax on a GPU and CPU. The experiment\nwas conducted on a NVIDIA GTX Titan-X (Pascal) GPU and Intel i7-6850 CPU. The second col-\numn indicates the full-softmax, while the other columns represent each step of SVD-softmax. The\ncost of the sorting, exponential, and sum is omitted, as their time consumption is negligible.\n\nFull-softmax\n\nA \u00d7 h\n(262k, 2k)\n\n\u00d72k\n14.12\n1541.43\n\nDevice\n\nGPU\nCPU\n\nSVD-softmax\n\nVT \u00d7 h Preview window Full-view vectors\n(2k, 2k)\n\u00d72k\n0.33\n25.32\n\n\u00d7256\n2.98\n189.27\n\n\u00d72k\n1.12\n88.98\n\n(262k, 256)\n\n(16k, 2k)\n\nSum (speedup)\n\n-\n\n4.43 (\u00d73.19)\n303.57 (\u00d75.08)\n\nWe provide an example of time consumption on a CPU and GPU. Assume the weight A is a 262K\n(V = 218) by 2K (D = 211) matrix and SVD-softmax is applied with preview window width of\n256 and the number of full-view vectors is 16K (N = 214). This corresponds to W/D = 1/8 and\nN/V = 1/16. The setting well simulates the real LM environment and the use of the recommended\nSVD-softmax hyperparameters discussed above. We used our highly optimized custom CUDA ker-\nnel for the GPU evaluation. The matrix B was stored in row-major order for convenient full-view\nvector evaluation.\nAs observed in Table 4, the time consumption is reduced by approximately 70% on the GPU and\napproximately 80% on the CPU. Note that the GPU kernel is fully parallelized while the CPU code\nemploys a sequential logic. We also tested various vocabulary sizes and hidden dimensions on the\ncustom kernel, where a speedup is mostly observed, although it is less effective for small vocabulary\ncases.\n\n5.3 Compatibility with other methods\n\nThe proposed method is compatible with a neural network trained with sampling-based softmax\napproximations. SVD-softmax is also applicable to hierarchical softmax and adaptive softmax,\nespecially when the vocabulary is large. Hierarchical methods need large weight matrix multipli-\ncation to gather every word probability, and SVD-softmax can reduce the computation. We tested\nSVD-softmax with various softmax approximations and observed that a signi\ufb01cant amount of mul-\ntiplication is removed while the performance is not signi\ufb01cantly affected as it is by full softmax.\n\n8\n\n\f6 Conclusion\n\nWe present SVD-softmax, an ef\ufb01cient softmax approximation algorithm, which is effective for com-\nputing top-K word probabilities. The proposed method factorizes the matrix by SVD, and only part\nof the SVD transformed matrix is previewed to determine which words are worth preserving. The\nguideline for hyperparameter selection was given empirically. Language modeling and NMT exper-\niments were conducted. Our method reduces the number of multiplication operations to only 20% of\nthat of the full-softmax with little performance degradation. The proposed SVD-softmax is a simple\nyet powerful computation reduction technique.\n\nAcknowledgments\n\nThis work was supported in part by the Brain Korea 21 Plus Project and the Na-\ntional Research Foundation of Korea (NRF) grant funded by the Korea government (MSIP)\n(No.2015R1A2A1A10056051).\n\nReferences\n[1] Tomas Mikolov, Martin Kara\ufb01\u00e1t, Lukas Burget, Jan Cernock`y, and Sanjeev Khudanpur, \u201cRe-\n\ncurrent neural network based language model,\u201d in Interspeech, 2010, vol. 2, p. 3.\n\n[2] Tom\u00e1\u0161 Mikolov, Stefan Kombrink, Luk\u00e1\u0161 Burget, Jan \u02c7Cernock`y, and Sanjeev Khudanpur, \u201cEx-\ntensions of recurrent neural network language model,\u201d in Acoustics, Speech and Signal Pro-\ncessing (ICASSP), 2011 IEEE International Conference on. IEEE, 2011, pp. 5528\u20135531.\n\n[3] Yann N Dauphin, Angela Fan, Michael Auli, and David Grangier, \u201cLanguage modeling with\n\ngated convolutional networks,\u201d arXiv preprint arXiv:1612.08083, 2016.\n\n[4] William Chan, Navdeep Jaitly, Quoc Le, and Oriol Vinyals, \u201cListen, attend and spell: A\nneural network for large vocabulary conversational speech recognition,\u201d in Acoustics, Speech\nand Signal Processing (ICASSP), 2016 IEEE International Conference on. IEEE, 2016, pp.\n4960\u20134964.\n\n[5] Kyuyeon Hwang and Wonyong Sung, \u201cCharacter-level incremental speech recognition with\nrecurrent neural networks,\u201d in Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE\nInternational Conference on. IEEE, 2016, pp. 5335\u20135339.\n\n[6] Sepp Hochreiter and J\u00fcrgen Schmidhuber, \u201cLong short-term memory,\u201d Neural Computation,\n\nvol. 9, no. 8, pp. 1735\u20131780, 1997.\n\n[7] Xunying Liu, Yongqiang Wang, Xie Chen, Mark JF Gales, and Philip C Woodland, \u201cEf\ufb01cient\nlattice rescoring using recurrent neural network language models,\u201d in Acoustics, Speech and\nSignal Processing (ICASSP), 2014 IEEE International Conference on. IEEE, 2014, pp. 4908\u2013\n4912.\n\n[8] Gene H Golub and Christian Reinsch, \u201cSingular value decomposition and least squares solu-\n\ntions,\u201d Numerische Mathematik, vol. 14, no. 5, pp. 403\u2013420, 1970.\n\n[9] Yoshua Bengio, Jean-S\u00e9bastien Sen\u00e9cal, et al., \u201cQuick training of probabilistic neural nets by\n\nimportance sampling.,\u201d in AISTATS, 2003.\n\n[10] Andriy Mnih and Yee Whye Teh, \u201cA fast and simple algorithm for training neural probabilistic\n\nlanguage models,\u201d arXiv preprint arXiv:1206.6426, 2012.\n\n[11] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean, \u201cDistributed repre-\nsentations of words and phrases and their compositionality,\u201d in Advances in Neural Information\nProcessing Systems, 2013, pp. 3111\u20133119.\n\n[12] Shihao Ji, SVN Vishwanathan, Nadathur Satish, Michael J Anderson, and Pradeep Dubey,\n\u201cBlackout: Speeding up recurrent neural network language models with very large vocabular-\nies,\u201d arXiv preprint arXiv:1511.06909, 2015.\n\n9\n\n\f[13] Frederic Morin and Yoshua Bengio,\n\n\u201cHierarchical probabilistic neural network language\n\nmodel,\u201d in AISTATS. Citeseer, 2005, vol. 5, pp. 246\u2013252.\n\n[14] Andriy Mnih and Geoffrey E Hinton, \u201cA scalable hierarchical distributed language model,\u201d in\n\nAdvances in Neural Information Processing Systems, 2009, pp. 1081\u20131088.\n\n[15] Hai-Son Le, Ilya Oparin, Alexandre Allauzen, Jean-Luc Gauvain, and Fran\u00e7ois Yvon, \u201cStruc-\ntured output layer neural network language model,\u201d in Acoustics, Speech and Signal Process-\ning (ICASSP), 2011 IEEE International Conference on. IEEE, 2011, pp. 5524\u20135527.\n\n[16] Edouard Grave, Armand Joulin, Moustapha Ciss\u00e9, David Grangier, and Herv\u00e9 J\u00e9gou, \u201cEf\ufb01-\n\ncient softmax approximation for GPUs,\u201d arXiv preprint arXiv:1609.04309, 2016.\n\n[17] Jacob Devlin, Rabih Zbib, Zhongqiang Huang, Thomas Lamar, Richard M Schwartz, and John\nMakhoul, \u201cFast and robust neural network joint models for statistical machine translation,\u201d in\nACL (1). Citeseer, 2014, pp. 1370\u20131380.\n\n[18] Jacob Andreas, Maxim Rabinovich, Michael I Jordan, and Dan Klein, \u201cOn the accuracy of\nself-normalized log-linear models,\u201d in Advances in Neural Information Processing Systems,\n2015, pp. 1783\u20131791.\n\n[19] Welin Chen, David Grangier, and Michael Auli, \u201cStrategies for training large vocabulary neural\n\nlanguage models,\u201d arXiv preprint arXiv:1512.04906, 2015.\n\n[20] Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher, \u201cPointer sentinel mix-\n\nture models,\u201d arXiv preprint arXiv:1609.07843, 2016.\n\n[21] Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and\nTony Robinson, \u201cOne billion word benchmark for measuring progress in statistical language\nmodeling,\u201d arXiv preprint arXiv:1312.3005, 2013.\n\n[22] Guillaume Klein, Yoon Kim, Yuntian Deng, Jean Senellart, and Alexander M Rush, \u201cOpen-\nNMT: Open-source toolkit for neural machine translation,\u201d arXiv preprint arXiv:1701.02810,\n2017.\n\n[23] Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdi-\nnov, \u201cDropout: a simple way to prevent neural networks from over\ufb01tting,\u201d Journal of Machine\nLearning Research, vol. 15, no. 1, pp. 1929\u20131958, 2014.\n\n[24] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio, \u201cOn the dif\ufb01culty of training recurrent\n\nneural networks,\u201d ICML (3), vol. 28, pp. 1310\u20131318, 2013.\n\n[25] Kyunghyun Cho, Bart Van Merri\u00ebnboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares,\nHolger Schwenk, and Yoshua Bengio, \u201cLearning phrase representations using RNN encoder-\ndecoder for statistical machine translation,\u201d arXiv preprint arXiv:1406.1078, 2014.\n\n[26] Ilya Sutskever, Oriol Vinyals, and Quoc V Le, \u201cSequence to sequence learning with neural\n\nnetworks,\u201d in Advances in Neural Information Processing Systems, 2014, pp. 3104\u20133112.\n\n[27] Ondrej Bojar, Rajen Chatterjee, Christian Federmann, Barry Haddow, Matthias Huck, Chris\nHokamp, Philipp Koehn, Varvara Logacheva, Christof Monz, Matteo Negri, Matt Post, Car-\nolina Scarton, Lucia Specia, and Marco Turchi, \u201cFindings of the 2015 workshop on statistical\nmachine translation,\u201d in Proceedings of the Tenth Workshop on Statistical Machine Transla-\ntion, 2015, pp. 1\u201346.\n\n[28] Philipp Koehn, \u201cEuroparl: A parallel corpus for statistical machine translation,\u201d in MT Summit,\n\n2005, vol. 5, pp. 79\u201386.\n\n[29] Common Crawl Foundation, \u201cCommon crawl,\u201d http://commoncrawl.org, 2016, Accessed:\n\n2017-04-11.\n\n[30] Jorg Tiedemann, \u201cParallel data, tools and interfaces in OPUS,\u201d in LREC, 2012, vol. 2012, pp.\n\n2214\u20132218.\n\n10\n\n\f[31] Minh-Thang Luong, Hieu Pham, and Christopher D Manning,\n\n\u201cEffective approaches to\n\nattention-based neural machine translation,\u201d arXiv preprint arXiv:1508.04025, 2015.\n\n[32] S\u00e9bastien Jean, Kyunghyun Cho, Roland Memisevic, and Yoshua Bengio, \u201cOn using very large\n\ntarget vocabulary for neural machine translation,\u201d arXiv preprint arXiv:1412.2007, 2014.\n\n[33] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio, \u201cNeural machine translation by\n\njointly learning to align and translate,\u201d arXiv preprint arXiv:1409.0473, 2014.\n\n[34] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu, \u201cBleu: a method for au-\ntomatic evaluation of machine translation,\u201d in Proceedings of the 40th annual meeting on\nAssociation for Computational Linguistics. Association for Computational Linguistics, 2002,\npp. 311\u2013318.\n\n[35] Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico,\nNicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, et al., \u201cMoses:\nOpen source toolkit for statistical machine translation,\u201d in Proceedings of the 45th Annual\nMeeting of the Association for Computational Linguistics on Interactive Poster and Demon-\nstration Sessions. Association for Computational Linguistics, 2007, pp. 177\u2013180.\n\n11\n\n\f", "award": [], "sourceid": 2816, "authors": [{"given_name": "Kyuhong", "family_name": "Shim", "institution": "Seoul National University"}, {"given_name": "Minjae", "family_name": "Lee", "institution": "Seoul National University"}, {"given_name": "Iksoo", "family_name": "Choi", "institution": "Seoul National University"}, {"given_name": "Yoonho", "family_name": "Boo", "institution": "Seoul National University"}, {"given_name": "Wonyong", "family_name": "Sung", "institution": "Seoul National University"}]}