{"title": "A Neural Probabilistic Language Model", "book": "Advances in Neural Information Processing Systems", "page_first": 932, "page_last": 938, "abstract": null, "full_text": "A Neural Probabilistic Language Model \n\nYoshua Bengio; Rejean Ducharme and Pascal Vincent \nDepartement d'Informatique et Recherche Operationnelle \n\nCentre de Recherche Mathematiques \n\nUniversite de Montreal \n\nMontreal, Quebec, Canada, H3C 317 \n\n{bengioy,ducharme, vincentp }@iro.umontreal.ca \n\nAbstract \n\nA goal of statistical language modeling is to learn the joint probability \nfunction of sequences of words. This is intrinsically difficult because of \nthe curse of dimensionality: we propose to fight it with its own weapons. \nIn the proposed approach one learns simultaneously (1) a distributed rep(cid:173)\nresentation for each word (i.e. a similarity between words) along with (2) \nthe probability function for word sequences, expressed with these repre(cid:173)\nsentations. Generalization is obtained because a sequence of words that \nhas never been seen before gets high probability if it is made of words \nthat are similar to words forming an already seen sentence. We report on \nexperiments using neural networks for the probability function, showing \non two text corpora that the proposed approach very significantly im(cid:173)\nproves on a state-of-the-art trigram model. \n\n1 Introduction \nA fundamental problem that makes language modeling and other learning problems diffi(cid:173)\ncult is the curse of dimensionality. It is particularly obvious in the case when one wants to \nmodel the joint distribution between many discrete random variables (such as words in a \nsentence, or discrete attributes in a data-mining task). For example, if one wants to model \nthe joint distribution of 10 consecutive words in a natural language with a vocabulary V of \nsize 100,000, there are potentially 10000010 - 1 = 1050 - 1 free parameters. \nA statistical model of language can be represented by the conditional probability of the \nnext word given all the previous ones in the sequence, since P( W'[) = rri=l P( Wt Iwf-1), \nwhere Wt is the t-th word, and writing subsequence w[ = (Wi, Wi+1, ... , Wj-1, Wj). \nWhen building statistical models of natural language, one reduces the difficulty by taking \nadvantage of word order, and the fact that temporally closer words in the word sequence are \nstatistically more dependent. Thus, n-gram models construct tables of conditional proba(cid:173)\nbilities for the next word, for each one of a large number of contexts, i.e. combinations of \nthe last n - 1 words: p(wtlwf-1) ~ P(WtIW!=~+l)' Only those combinations of succes(cid:173)\nsive words that actually occur in the training corpus (or that occur frequently enough) are \nconsidered. What happens when a new combination of n words appears that was not seen \nin the training corpus? A simple answer is to look at the probability predicted using smaller \ncontext size, as done in back -off trigram models [7] or in smoothed (or interpolated) trigram \nmodels [6]. So, in such models, how is generalization basically obtained from sequences of \n\n\"Y.B. was also with AT&T Research while doing this research. \n\n\fwords seen in the training corpus to new sequences of words? simply by looking at a short \nenough context, i.e., the probability for a long sequence of words is obtained by \"gluing\" \nvery short pieces of length 1, 2 or 3 words that have been seen frequently enough in the \ntraining data. Obviously there is much more information in the sequence that precedes the \nword to predict than just the identity of the previous couple of words. There are at least two \nobvious flaws in this approach (which however has turned out to be very difficult to beat): \nfirst it is not taking into account contexts farther than 1 or 2 words, second it is not taking \naccount of the \"similarity\" between words. For example, having seen the sentence Th e \ni s wa l k in g i n t he b e droom in the training corpus should help us general(cid:173)\nca t \nize to make the sentence A d og was r u nning in a room almost as likely, simply \nbecause \"dog\" and \"cat\" (resp. \"the\" and \"a\", \"room\" and \"bedroom\", etc ... ) have similar \nsemantics and grammatical roles. \n1.1 Fighting the Curse of Dimensionality with its Own Weapons \nIn a nutshell, the idea of the proposed approach can be summarized as follows: \n\n1. associate with each word in the vocabulary a distributed \"feature vector\" (a real(cid:173)\n\nvalued vector in ~m), thereby creating a notion of similarity between words, \n\n2. express the joint probability fun ction of word sequences in terms of the feature \n\nvectors of these words in the sequence, and \n\n3. learn simultaneously the word feature vectors and the parameters of thatfitnction. \n\nThe feature vector represents different aspects of a word: each word is associated with a \npoint in a vector space. The number of features (e.g. m = 30,60 or 100 in the experiments) \nis much smaller than the size of the vocabulary. The probability function is expressed as a \nproduct of conditional probabilities of the next word given the previous ones, (e.g. using \na multi-layer neural network in the experiment). This function has parameters that can be \niteratively tuned in order to maximize the log-likelihood of the training data or a regularized \ncriterion, e.g. by adding a weight decay penalty. The feature vectors associated with each \nword are learned, but they can be initialized using prior knowledge. \n\nWhy does it work? In the previous example, if we knew that d og and cat played simi(cid:173)\nlar roles (semantically and syntactically), and similarly for (t h e ,a), (b edroom, room), \n(i s ,was ), (runn i ng,wa l k i ng), we could naturally generalize from The cat \ni s \nwa l k i ng i n t h e b e droom to A d og was ru n n i ng i n a room and likewise \nto many other combinations. In the proposed model, it will so generalize because \"simi(cid:173)\nlar\" words should have a similar feature vector, and because the probability function is a \nsmooth function of these feature values, so a small change in the features (to obtain similar \nwords) induces a small change in the probability: seeing only one of the above sentences \nwill increase the probability not only of that sentence but also of its combinatorial number \nof \"neighbors\" in sentence .Ipace (as represented by sequences of feature vectors). \n\n1.2 Relation to Previous Work \n\nThe idea of using neural networks to model high-dimensional discrete distributions has \nalready been found useful in [3] where the joint probability of Zl ... Zn is decomposed \n', Zn = Zn) = Oi P(Zi = \nas a product of conditional probabilities: P(Zl = Zl, \"\nzilgi(Zi-l, Zi-2, ... , Zl)), where g(.) is a function represented by part of a neural network, \nand it yields parameters for expressing the distribution of Zi. Experiments on four VCI \ndata sets show this approach to work comparatively very well [3, 2]. The idea of a dis(cid:173)\ntributed representation for symbols dates from the early days of connectionism [5]. More \nrecently, Hinton's approach was improved and successfully demonstrated on learning sev(cid:173)\neral symbolic relations [9] . The idea of using neural networks for language modeling is not \nnew either, e.g. [8] . In contrast, here we push this idea to a large scale, and concentrate on \nlearning a statistical model of the distribution of word sequences, rather than learning the \nrole of words in a sentence. The proposed approach is also related to previous proposals \n\n\fof character-based text compression using neural networks [11]. Learning a clustering of \nwords [10, 1] is also a way to discover similarities between words. In the model proposed \nhere, instead of characterizing the similarity with a discrete random or deterministic vari(cid:173)\nable (which corresponds to a soft or hard partition of the set of words), we use a continuous \nreal-vector for each word, i.e. a distributed feature vector, to indirectly represent similarity \nbetween words. The idea of using a vector-space representation for words has been well \nexploited in the area of information retrieval (for example see [12]), where vectorial fea(cid:173)\nture vectors for words are learned on the basis of their probability of co-occurring in the \nsame documents (Latent Semantic Indexing [4]). An important difference is that here we \nlook for a representation for words that is helpful in representing compactly the probabil(cid:173)\nity distribution of word sequences from natural language text. Experiments indicate that \nlearning jointly the representation (word features) and the model makes a big difference in \nperformance. \n2 The Proposed Model: two Architectures \nThe training set is a sequence Wi ... WT of words Wt E V, where the vocabulary V is a \nlarge but finite set. The objective is to learn a good model f(wt,\u00b7 . . , Wt-n) = P(wtlwi-i), \nin the sense that it gives high out-of-sample likelihood. In the experiments, we will report \nthe geometric average of l/P(wtlwi-i), also known as perplexity, which is also the ex(cid:173)\nponential of the average negative log-likelihood. The only constraint on the model is that \nfor any choice of wi- i , Ei~i f(i, Wt-i, Wt-n) = 1. By the product of these conditional \nprobabilities, one obtains a model of the joint probability of any sequence of words. \n\nThe basic form of the model is described here. Refinements to speed it up and extend it \nwill be described in the following sections. We decompose the function f (Wt, .. . , Wt-n) = \nP( Wt Iwi- i ) in two parts: \n\n1. A mapping C from any element of V to a real vector C(i) E Rm . It represents \nthe \"distributed feature vector\" associated with each word in the vocabulary. In \npractice, C is represented by a I V I x m matrix (of free parameters). \n\n2. The probability function over words, expressed with C. We have considered two \n\nalternative formulations : \n(a) The direct architecture: a function 9 maps a sequence of feature \nvectors for words in context (C(Wt-n),\u00b7\u00b7 \u00b7 , C(wt-d) to a probabil(cid:173)\nity distribution over words in V. \nIt is a vector function whose i-th \nelement estimates the probability P(Wt = \nilwi- i ) as in figure 1. \nf(i, Wt-i,\u00b7\u00b7 \u00b7 , Wt-n) = g(i, C(Wt-i),\u00b7 \u00b7\u00b7 , C(Wt-n)). We used the \"soft(cid:173)\nmax\" in the output layer of a neural net: P( Wt = ilwi- i ) = ehi / E j eh;, \nwhere hi is the neural network output score for word i . \n\n(b) The cycling architecture: a function h maps a sequence of feature \nvectors (C(Wt-n),\u00b7\u00b7\u00b7, C(Wt-i), C(i)) (i.e. \nincluding the context words \nand a candidate next word i) to a scalar hi, and again using a soft-\nmax, P(Wt = \nf(Wt,Wt-i,\u00b7 \u00b7 \u00b7,Wt-n) = \ng(C(Wt), C(wt-d,\u00b7\u00b7\u00b7 ,C(Wt-n)). We call this architecture \"cycling\" be(cid:173)\ncause one repeatedly runs h (e.g. a neural net), each time putting in input the \nfeature vector C(i) for a candidate next word i. \n\nehi /Ejeh;. \n\nilwi- i ) = \n\nThe function f is a composition of these two mappings (C and g), with C being shared \nacross all the words in the context. To each of these two parts are associated some pa(cid:173)\nrameters. The parameters of the mapping C are simply the feature vectors themselves \n(represented by a IVI x m matrix C whose row i is the feature vector C(i) for word i). The \nfunction 9 may be implemented by a feed-forward or recurrent neural network or another \nparameterized function, with parameters (). \n\n\fi-ill output = P (Wt = \n\ni I eMtext) \n\nTabl e \nlook-up \ninC \n\n: computed onl y \n\nfor WOlds in \nshOlt list \n\nindex fOl w i-n \n\nindex fot W'-2 \n\nindex fot Wt_l \n\nFigure 1: \"Direct Architecture\": f(i, Wt-l, \u00b7 \", Wt-n) = g(i, C(Wt-l),\u00b7\u00b7\u00b7, C(Wt-n)) \nwhere 9 is the neural network and C(i) is the i-th word feature vector. \nTraining is achieved by looking for ((), C) that maximize the training corpus penalized log(cid:173)\nlikelihood: L = ~ ~t logpw. (C(Wt-n),\u00b7\u00b7\u00b7, C(Wt-l)j ()) + R((), C), where R((), C) is a \nregularization term (e.g. a weight decay ).11()11 2 , that penalizes slightly the norm of (). \n3 Speeding-up and other Tricks \nShort list. The main idea is to focus the effort of the neural network on a \"short list\" \nof words that have the highest probability. This can save much computation because in \nboth of the proposed architectures the time to compute the probability of the observed \nnext word scales almost linearly with the number of words in the vocabulary (because \nthe scores hi associated with each word i in the vocabulary must be computed for prop(cid:173)\nerly normalizing probabilities with the softmax). The idea of the speed-up trick is the \nfollowing: instead of computing the actual probability of the next word, the neural net(cid:173)\nwork is used to compute the relative probability of the next word within that short list. \nThe choice of the short list depends on the current context (the previous n words). We \nhave used our smoothed trigram model to pre-compute a short list containing the most \nprobable next words associated to the previous two words. The conditional probabilities \nP(Wt = ilht ) are thus computed as follows, denoting with ht the history (context) before \nWt. and Lt the short list of words for the prediction of Wt. If i E Lt then the probabil-\nity is PNN(Wt = ilWt E Lt , ht)Ptrigram(Wt E Ltlht ), else it is Ptrigram(Wt = ilht ), \nwhere PNN(Wt = ilWt E Lt , ht) are the normalized scores of the words computed by \nthe neural network, where the \"softmax\" is only normalized over the words in the short \nlist Lt, and Ptrigram(Wt E Ltlht ) = ~iEL. Ptrigram(ilht), with Ptrigram(ilht) standing \nfor the next-word probabilities computed by the smoothed trigram. Note that both L t and \nPtrigram(Wt E Ltlht ) can be pre-computed (and stored in a hash table indexed by the last \ntwo words). \n\nTable look-up for recognition. To speed up application of the trained model, one can \npre-compute in a hash table the output of the neural network, at least for the most frequent \ninput contexts. In that case, the neural network will only be rarely called upon, and the \naverage computation time will be very small. Note that in a speech recognition system, one \nneeds only compute the relative probabilities of the acoustically ambiguous words in each \ncontext, also reducing drastically the amount of computations. \n\nStochastic gradient descent. Since we have millions of examples, it is important to con(cid:173)\nverge within only a few passes through the data. For very large data sets, stochastic gradient \ndescent convergence time seems to increase sub-linearly with the size of the data set (see \nexperiments on Brown vs Hansard below). To speed up training using stochastic gradient \n\n\fdescent, we have found it useful to break the corpus in paragraphs and to randomly permute \nthem. In this way, some of the non-stationarity in the word stream is eliminated, yielding \nfaster convergence. \n\nCapacity control. For the \"smaller corpora\" like Brown (1.2 million examples), we have \nfound early stopping and weight decay useful to avoid over-fitting. For the larger corpora, \nour networks still under-fit. For the larger corpora, we have found double-precision com(cid:173)\nputation to be very important to obtain good results. \n\nMixture of models. We have found improved performance by combining the probability \npredictions of the neural network with those of the smoothed trigram, with weights that \nwere conditional on the frequency of the context (same procedure used to combine trigram, \nbigram, and unigram in the smoothed trigram). \n\nInitialization of word feature vectors. We have tried both random initialization (uniform \nbetween -.01 and .01) and a \"smarter\" method based on a Singular Value Decomposition \n(SVD) of a very large matrix of \"context features\". These context features are formed \nby counting the frequency of occurrence of each word in each one of the most frequent \ncontexts (word sequences) in the corpus. The idea is that \"similar\" words should occur \nwith similar frequency in the same contexts. We used about 9000 most frequent contexts, \nand compressed these to 30 features with the SVD. \nOut-of-vocabulary words. For an out-of-vocabulary word Wt we need to come up with \na feature vector in order to predict the words that follow, or predict its probability (that \nis only possible with the cycling architecture). We used as feature vector the weighted \naverage feature vector of all the words in the short list, with the weights being the relative \nprobabilities ofthose words: E[C(wt)lhtl = Ei C(i)P(wt = ilht). \n4 Experimental Results \nComparative experiments were performed on the Brown and Hansard corpora. The Brown \ncorpus is a stream of 1,181,041 words (from a large variety of English texts and books). \nThe first 800,000 words were used for training, the following 200,000 for validation (model \nselection, weight decay, early stopping) and the remaining 181,041 for testing. The number \nof different words is 47, 578 (including punctuation, distinguishing between upper and \nlower case, and including the syntactical marks used to separate texts and paragraphs). \nRare words with frequency:::; 3 were merged into a single token, reducing the vocabulary \nsize to IVI = 16,383. \nThe Hansard corpus (Canadian parliament proceedings, French version) is a stream of \nabout 34 million words, of which 32 millions (set A) was used for training, 1.1 million \n(set B) was used for validation, and 1.2 million (set C) was used for out-of-sample tests. \nThe original data has 106, 936 different words, and those with frequency:::; 10 were merged \ninto a single token, yielding IVI = 30,959 different words. \nThe benchmark against which the neural network was compared is an interpolated or \nsmoothed trigram model [6]. Let qt = l(Jreq(Wt-l,Wt-2)) represent the discretized fre(cid:173)\nquency of occurrence of the context (Wt-l, Wt-2) (we used l(x) = r -log((l + x)/T)l \nwhere x is the frequency of occurrence of the context and T is the size of the training cor(cid:173)\npus). A conditional mixture of the trigram, bigram, unigram and zero-gram was learned on \nthe validation set, with mixture weights conditional on discretized frequency. \nBelow are measures of test set perplexity (geometric average of 1/ p( Wt Iwi- 1 ) for dif(cid:173)\nferent models P. Apparent convergence of the stochastic gradient descent procedure was \nobtained after around 10 epochs for Hansard and after about 50 epochs for Brown, with \na learning rate gradually decreased from approximately 10-3 to 10-5 . Weight decay of \n10-4 or 10-5 was used in all the experiments (based on a few experiments compared on \nthe validation set). \n\nThe main result is that the neural network performs much better than the smoothed \n\n\ftrigram. On Brown the best neural network system, according to validation perplexity \n(among different architectures tried, see below) yielded a perplexity of 258, while the \nsmoothed trigram yields a perplexity of 348, which is about 35% worse. This is obtained \nusing a network with the direct architecture mixed with the trigram (conditional mixture), \nwith 30 word features initialized with the SVD method, 40 hidden units, and n = 5 words \nof context. On Hansard, the corresponding figures are 44.8 for the neural network and 54.1 \nfor the smoothed trigram, which is 20.7% worse. This is obtained with a network with the \ndirect architecture, 100 randomly initialized words features, 120 hidden units, and n = 8 \nwords of context. \n\nMore context is useful. Experiments with the cycling architecture on Brown, with 30 \nword features, and 30 hidden units, varying the number of context words: n = 1 (like the \nbigram) yields a test perplexity of 302, n = 3 yields 291 , n = 5 yields 281 , n = 8 yields \n279 (N.B. the smoothed trigram yields 348). \n\nHidden units help. Experiments with the direct architecture on Brown (with direct input \nto output connections), with 30 word features, 5 words of context, varying the number of \nhidden units: 0 yields a test perplexity of 275, 10 yields 267, 20 yields 266, 40 yields 265, \n80 yields 265. \n\nLearning the word features jointly is important. Experiments with the direct architec(cid:173)\nture on Brown (40 hidden units, 5 words of context), in which the word features initialized \nwith the SVD method are kept fixed during training yield a test perplexity of 345.8 whereas \nif the word features are trained jointly with the rest of the parameters, the perplexity is 265. \n\nInitialization not so useful. Experiments on Brown with both architectures reveal that the \nSVD initialization of the word features does not bring much improvement with respect to \nrandom initialization: it speeds up initial convergence (saving about 2 epochs), and yields \na perplexity improvement of less than 0.3 %. \n\nDirect architecture works a bit better. The direct architecture was found about 2% better \nthan the cycling architecture. \n\nConditional mixture helps but even without it the neural net is better. On Brown, the \nbest neural net without the mixture yields a test perplexity of 265, the smoothed trigram \nyields 348, and their conditional mixture yields 258 (i.e., better than both). On Hansard \nthe improvement is less: a neural network yielding 46.7 perplexity, mixed with the trigram \n(54.1), yields a mixture with perplexity 45.1. \n\n5 Conclusions and Proposed Extensions \nThe experiments on two corpora, a medium one 0.2 million words), and a large one (34 \nmillion words) have shown that the proposed approach yields much better perplexity than \na state-of-the-art method, the smoothed trigram, with differences on the order of 20% to \n35 %. \n\nWe believe that the main reason for these improvements is that the proposed approach \nallows to take advantage of the learned distributed representation to fight the curse of di(cid:173)\nmensionality with its own weapons: each training sentence informs the model about a \ncombinatorial number of other sentences. Note that if we had a separate feature vector \nfor each \"context\" (short sequence of words), the model would have much more capacity \n(which could grow like that of n-grams) but it would not naturally generalize between the \nmany different ways a word can be used. A more reasonable alternative would be to ex(cid:173)\nplore language units other than words (e.g. some short word sequences, or alternatively \nsome sub-word morphemic units). \n\nThere is probably much more to be done to improve the model, at the level of architecture, \ncomputational efficiency, and taking advantage of prior knowledge. An important priority \nof future research should be to evaluate and improve the speeding-up tricks proposed here, \nand find ways to increase capacity without increasing training time too much (to deal with \n\n\fcorpora with hundreds of millions of words). A simple idea to take advantage of temporal \nstructure and extend the size of the input window to include possibly a whole paragraph, \nwithout increasing too much the number of parameters, is to use a time-delay and possibly \nrecurrent neural network. In such a multi-layered network the computation that has been \nperformed for small groups of consecutive words does not need to be redone when the \nnetwork input window is shifted. Similarly, one could use a recurrent network to capture \npotentially even longer term information about the subject of the text. \n\nA very important area in which the proposed model could be improved is in the use of prior \nlinguistic knowledge: semantic (e.g. Word Net), syntactic (e.g. a tagger), and morpho(cid:173)\nlogical (radix and morphemes). Looking at the word features learned by the model should \nhelp understand it and improve it. Finally, future research should establish how useful the \nproposed approach will be in applications to speech recognition, language translation, and \ninformation retrieval. \n\nAcknowledgments \n\nThe authors would like to thank Leon Bottou and Yann Le Cun for useful discussions. This \nresearch was made possible by funding from the NSERC granting agency. \n\nReferences \n[1] D. Baker and A. McCallum. Distributional clustering of words for text classification. In SI(cid:173)\n\nGlR '98,1998. \n\n[2] S. Bengio and Y. Bengio. Taking on the curse of dimensionality in joint distributions using \nneural networks. IEEE Transactions on Neural Networks, special issue on Data Mining and \nKnowledge Discovery, 11(3):550-557, 2000. \n\n[3] Yoshua Bengio and Samy Bengio. Modeling high-dimensional discrete data with multi-layer \nneural networks. In S. A. Solla, T. K. Leen, and K-R. Mller, editors, Advances in Neural \nInformation Processing Systems 12, pages 400--406. MIT Press, 2000. \n\n[4] S. Deerwester, S.T. Dumais, G.W. Furnas, T.K. Landauer, and R.Harshman. Indexing by latent \nsemantic analysis. Journal of the American Society for Information Science, 41(6):391-407, \n1990. \n\n[5] G.E. Hinton. Learning distributed representations of concepts. In Proceedings of the Eighth An(cid:173)\n\nnual Conference of the Cognitive Science Society, pages 1-12, Amherst 1986, 1986. Lawrence \nErlbaum, Hillsdale. \n\n[6] F. Jelinek and R. L. Mercer. Interpolated estimation of Markov source parameters from sparse \ndata. In E. S. Gelsema and L. N. Kanal, editors, Pattern Recognition in Practice. North-Holland, \nAmsterdam, 1980. \n\n[7] Slava M. Katz. Estimation of probabilities from sparse data for the language model component \nof a speech recognizer. IEEE Transactions on Acoustics, Speech, and Signal Processing, ASSP-\n35(3):400-401 , March 1987. \n\n[8] R. Miikkulainen and M.G. Dyer. Natural language processing with modular neural networks \n\nand distributed lexicon. Cognitive Science, 15:343- 399, 1991. \n\n[9] A. Paccanaro and G.E. Hinton. Extracting distributed representations of concepts and relations \nfrom positive and negative propositions. In Proceedings of the International Joint Conference \non Neural Network, lJCNN'2000, Como, Italy, 2000. IEEE, New York. \n\n[10] F. Pereira, N. Tishby, and L. Lee. Distributional clustering of english words. In 30th Annual \nMeeting of the Association for Computational Linguistics, pages 183- 190, Columbus, Ohio, \n1993. \n\n[11] Jiirgen Schmidhuber. Sequential neural text compression. IEEE Transactions on Neural Net(cid:173)\n\nworks, 7(1):142- 146, 1996. \n\n[12] H. Schutze. Word space. In S. J. Hanson, J. D. Cowan, and C. L. Giles, editors, Advances in \nNeural Information Processing Systems 5, pages pp. 895- 902, San Mateo CA, 1993. Morgan \nKaufmann. \n\n\f", "award": [], "sourceid": 1839, "authors": [{"given_name": "Yoshua", "family_name": "Bengio", "institution": null}, {"given_name": "R\u00e9jean", "family_name": "Ducharme", "institution": null}, {"given_name": "Pascal", "family_name": "Vincent", "institution": null}]}