{"title": "A Neural Probabilistic Language Model", "book": "Advances in Neural Information Processing Systems", "page_first": 932, "page_last": 938, "abstract": null, "full_text": "A Neural Probabilistic Language Model \n\nYoshua Bengio; Rejean Ducharme and Pascal Vincent \nDepartement d'Informatique et Recherche Operationnelle \n\nCentre de Recherche Mathematiques \n\nUniversite de Montreal \n\nMontreal, Quebec, Canada, H3C 317 \n\n{bengioy,ducharme, vincentp }@iro.umontreal.ca \n\nAbstract \n\nA goal  of statistical language modeling is  to  learn  the joint probability \nfunction of sequences of words.  This is intrinsically difficult because of \nthe curse of dimensionality:  we propose to fight it with its own weapons. \nIn the proposed approach one learns simultaneously (1) a distributed rep(cid:173)\nresentation for each word (i.e.  a similarity between words) along with (2) \nthe probability function for word sequences, expressed with these repre(cid:173)\nsentations.  Generalization is  obtained because a sequence of words that \nhas  never been seen before gets  high probability if it is  made of words \nthat are similar to words forming an already seen sentence.  We report on \nexperiments using neural networks for the probability function, showing \non  two  text  corpora that  the  proposed approach  very  significantly  im(cid:173)\nproves on a state-of-the-art trigram model. \n\n1  Introduction \nA fundamental  problem that makes language modeling and other learning problems diffi(cid:173)\ncult is  the curse of dimensionality.  It is particularly obvious in the case when one wants to \nmodel the joint distribution between many  discrete random variables  (such as  words in a \nsentence, or discrete attributes in a data-mining task).  For example, if one wants to model \nthe joint distribution of 10 consecutive words in a natural language with a vocabulary V  of \nsize  100,000, there are potentially 10000010  - 1 = 1050  - 1 free parameters. \nA statistical  model  of language can be represented by  the  conditional  probability  of the \nnext word given all the previous ones in the sequence, since P( W'[)  = rri=l P( Wt Iwf-1), \nwhere Wt  is  the t-th word, and writing subsequence w[  = (Wi, Wi+1, ... , Wj-1, Wj). \nWhen building statistical models of natural language, one reduces the difficulty by taking \nadvantage of word order, and the fact that temporally closer words in the word sequence are \nstatistically  more dependent.  Thus, n-gram models construct tables of conditional proba(cid:173)\nbilities for the next word, for each one of a large number of contexts, i.e. combinations of \nthe last n  - 1 words:  p(wtlwf-1) ~ P(WtIW!=~+l)' Only those combinations of succes(cid:173)\nsive words that actually occur in the training corpus (or that occur frequently enough) are \nconsidered.  What happens when a new combination of n  words appears that was not seen \nin the training corpus? A simple answer is to look at the probability predicted using smaller \ncontext size, as done in back -off trigram models [7]  or in smoothed (or interpolated) trigram \nmodels [6].  So, in such models, how is generalization basically obtained from sequences of \n\n\"Y.B.  was also with AT&T Research while doing this research. \n\n\fwords seen in the training corpus to new sequences of words? simply by looking at a short \nenough context, i.e., the probability for a long sequence of words is  obtained by \"gluing\" \nvery short pieces of length  1,  2 or 3 words  that have been seen frequently  enough in  the \ntraining data. Obviously there is  much more information in the sequence that precedes the \nword to predict than just the identity of the previous couple of words.  There are at least two \nobvious flaws  in this approach (which however has turned out to be very difficult to beat): \nfirst it is not taking into account contexts farther than 1 or 2 words, second it is  not taking \naccount of the  \"similarity\" between  words.  For example,  having  seen  the  sentence  Th e \ni s  wa l k in g  i n  t he  b e droom in the training corpus should help us  general(cid:173)\nca t \nize to make the sentence A  d og  was  r u nning  in  a  room almost as  likely, simply \nbecause \"dog\" and \"cat\" (resp.  \"the\" and \"a\", \"room\" and \"bedroom\", etc ... ) have similar \nsemantics and grammatical roles. \n1.1  Fighting the Curse of Dimensionality with its Own Weapons \nIn a nutshell, the idea of the proposed approach can be summarized as  follows: \n\n1.  associate with each word in the vocabulary a distributed \"feature vector\" (a real(cid:173)\n\nvalued vector in ~m), thereby creating a notion of similarity between words, \n\n2.  express  the joint probability fun ction  of word  sequences in  terms  of the  feature \n\nvectors of these words in the sequence, and \n\n3.  learn simultaneously the word feature vectors and the parameters of thatfitnction. \n\nThe feature  vector represents different aspects  of a word:  each word is  associated with a \npoint in a vector space.  The number of features (e.g. m  = 30,60 or 100 in the experiments) \nis  much smaller than the size of the vocabulary.  The probability function is expressed as  a \nproduct of conditional probabilities of the next word given the previous ones,  (e.g.  using \na multi-layer neural network in the experiment).  This function has parameters that can be \niteratively tuned in order to maximize the log-likelihood of the training data or a regularized \ncriterion, e.g.  by adding a weight decay penalty.  The feature  vectors associated with each \nword are learned, but they can be initialized using prior knowledge. \n\nWhy does  it work?  In  the previous example, if we  knew  that d og and  cat  played simi(cid:173)\nlar roles  (semantically  and  syntactically),  and  similarly  for  (t h e ,a),  (b edroom, room), \n(i s ,was ),  (runn i ng,wa l k i ng),  we  could  naturally  generalize  from  The  cat \ni s \nwa l k i ng  i n  t h e  b e droom to A  d og  was  ru n n i ng  i n  a  room and likewise \nto  many  other combinations.  In  the  proposed model, it will  so  generalize because \"simi(cid:173)\nlar\"  words  should have a similar feature  vector, and because the probability function is a \nsmooth function of these feature values, so a small change in the features (to obtain similar \nwords) induces a small change in  the probability: seeing only one of the above sentences \nwill increase the probability not only of  that sentence but also of its combinatorial number \nof \"neighbors\" in sentence .Ipace (as represented by sequences of feature vectors). \n\n1.2  Relation to Previous Work \n\nThe  idea of using  neural  networks  to  model  high-dimensional discrete  distributions  has \nalready been found  useful in  [3]  where  the joint probability of Zl ... Zn  is  decomposed \n', Zn  =  Zn)  =  Oi P(Zi  = \nas  a  product of conditional  probabilities:  P(Zl  =  Zl, \"\nzilgi(Zi-l, Zi-2, ... , Zl)), where g(.) is a function represented by part of a neural network, \nand it yields  parameters for  expressing the  distribution of Zi.  Experiments on  four VCI \ndata sets  show  this  approach to  work comparatively very  well  [3,  2].  The idea of a dis(cid:173)\ntributed representation for  symbols dates from the early days of connectionism [5].  More \nrecently, Hinton's approach was improved and successfully demonstrated on learning sev(cid:173)\neral symbolic relations [9] . The idea of using neural networks for language modeling is not \nnew either, e.g. [8] . In contrast, here we push this idea to a large scale, and concentrate on \nlearning a statistical  model of the distribution of word sequences, rather than learning the \nrole of words in  a sentence.  The proposed approach is  also  related to previous proposals \n\n\fof character-based text compression using neural networks  [11].  Learning a clustering of \nwords [10,  1]  is  also a way to discover similarities between words.  In the model proposed \nhere, instead of characterizing the similarity with a discrete random or deterministic vari(cid:173)\nable (which corresponds to a soft or hard partition of the set of words), we use a continuous \nreal-vector for each word, i.e.  a distributed feature vector, to indirectly represent similarity \nbetween words.  The idea of using  a vector-space representation for  words  has been  well \nexploited in  the  area of information  retrieval (for example see  [12]), where vectorial fea(cid:173)\nture vectors for words  are  learned on  the basis of their probability of co-occurring in  the \nsame documents (Latent Semantic Indexing [4]).  An  important difference is  that here we \nlook for a representation for words that is  helpful in representing compactly the probabil(cid:173)\nity distribution  of word sequences from natural language text.  Experiments indicate that \nlearning jointly the representation (word features) and the model makes a big difference in \nperformance. \n2  The Proposed Model: two Architectures \nThe training set is  a sequence Wi  ... WT  of words Wt  E  V,  where  the  vocabulary V  is  a \nlarge but finite set.  The objective is to learn a good model f(wt,\u00b7 . . , Wt-n)  =  P(wtlwi-i), \nin the sense that it gives high out-of-sample likelihood. In the experiments, we will report \nthe  geometric average of l/P(wtlwi-i), also known as  perplexity, which  is  also  the  ex(cid:173)\nponential of the average negative log-likelihood.  The only constraint on the model is  that \nfor any choice of wi- i ,  Ei~i f(i, Wt-i, Wt-n)  = 1.  By  the product of these conditional \nprobabilities, one obtains a model of the joint probability of any sequence of words. \n\nThe basic form of the  model is  described here.  Refinements to  speed it up  and extend it \nwill be described in the following sections. We decompose the function f (Wt, .. . , Wt-n)  = \nP( Wt Iwi- i )  in  two parts: \n\n1.  A mapping C from  any  element of V  to  a real  vector C(i)  E  Rm .  It represents \nthe \"distributed feature  vector\"  associated with each word  in  the  vocabulary.  In \npractice, C is represented by a  I V I x  m  matrix (of free parameters). \n\n2.  The probability function over words, expressed with C. We  have considered two \n\nalternative formulations : \n(a)  The  direct  architecture:  a  function  9  maps  a  sequence  of  feature \nvectors  for  words  in  context  (C(Wt-n),\u00b7\u00b7 \u00b7 , C(wt-d)  to  a  probabil(cid:173)\nity  distribution  over  words  in  V. \nIt  is  a  vector  function  whose  i-th \nelement  estimates  the  probability  P(Wt  = \nilwi- i )  as  in  figure  1. \nf(i, Wt-i,\u00b7\u00b7 \u00b7 , Wt-n)  =  g(i, C(Wt-i),\u00b7 \u00b7\u00b7 , C(Wt-n)).  We  used  the  \"soft(cid:173)\nmax\" in  the  output layer of a neural net:  P( Wt  = ilwi- i )  = ehi /  E j eh;, \nwhere hi is the neural network output score for word i . \n\n(b)  The  cycling  architecture:  a  function  h  maps  a  sequence  of  feature \nvectors  (C(Wt-n),\u00b7\u00b7\u00b7, C(Wt-i), C(i))  (i.e. \nincluding  the  context  words \nand  a  candidate  next  word  i)  to  a  scalar  hi,  and  again  using  a  soft-\nmax,  P(Wt  = \nf(Wt,Wt-i,\u00b7 \u00b7 \u00b7,Wt-n)  = \ng(C(Wt), C(wt-d,\u00b7\u00b7\u00b7 ,C(Wt-n)).  We  call  this  architecture \"cycling\" be(cid:173)\ncause one repeatedly runs h (e.g.  a neural net), each time putting in input the \nfeature vector C(i) for a candidate next word i. \n\nehi /Ejeh;. \n\nilwi- i )  = \n\nThe function  f  is  a composition of these  two mappings  (C and g),  with  C  being shared \nacross  all  the  words  in  the context.  To  each  of these  two  parts  are  associated  some  pa(cid:173)\nrameters.  The  parameters  of the  mapping  C  are  simply  the  feature  vectors  themselves \n(represented by a IVI  x m matrix C whose row i is the feature vector C(i) for word i). The \nfunction 9  may be implemented by a feed-forward or recurrent neural network or another \nparameterized function, with parameters (). \n\n\fi-ill output = P (Wt  = \n\ni  I eMtext) \n\nTabl e \nlook-up \ninC \n\n:  computed onl y \n\nfor WOlds in \nshOlt list \n\nindex  fOl  w i-n \n\nindex fot  W'-2 \n\nindex fot  Wt_l \n\nFigure  1:  \"Direct  Architecture\":  f(i, Wt-l, \u00b7 \", Wt-n)  = g(i, C(Wt-l),\u00b7\u00b7\u00b7, C(Wt-n)) \nwhere 9 is the neural network and C(i) is the i-th word feature vector. \nTraining is achieved by looking for ((), C) that maximize the training corpus penalized log(cid:173)\nlikelihood: L  =  ~ ~t logpw. (C(Wt-n),\u00b7\u00b7\u00b7, C(Wt-l)j ())  + R((), C), where R((), C) is  a \nregularization term (e.g.  a weight decay ).11()11 2 ,  that penalizes slightly the norm of (). \n3  Speeding-up and other Tricks \nShort list.  The  main  idea is  to  focus  the  effort  of the  neural  network  on  a \"short list\" \nof words  that have  the  highest probability.  This  can  save  much  computation because in \nboth  of the  proposed  architectures  the  time  to  compute  the  probability  of the  observed \nnext  word  scales  almost  linearly  with  the  number  of words  in  the  vocabulary  (because \nthe  scores hi  associated  with  each  word  i  in  the vocabulary must be computed for prop(cid:173)\nerly  normalizing probabilities  with  the  softmax).  The  idea of the  speed-up  trick  is  the \nfollowing:  instead  of computing the  actual  probability of the  next  word,  the  neural  net(cid:173)\nwork is  used  to  compute the relative probability  of the  next  word  within  that short list. \nThe  choice of the  short list depends  on  the  current context (the  previous  n  words).  We \nhave  used  our smoothed  trigram  model  to  pre-compute  a  short  list  containing  the  most \nprobable next words associated to the  previous two  words.  The conditional probabilities \nP(Wt  =  ilht ) are thus computed as  follows,  denoting with ht  the history (context) before \nWt.  and  Lt  the  short list  of words  for the prediction of Wt.  If i  E  Lt  then the probabil-\nity  is  PNN(Wt  = ilWt  E  Lt , ht)Ptrigram(Wt  E  Ltlht ),  else  it is  Ptrigram(Wt  = ilht ), \nwhere  PNN(Wt  = ilWt  E  Lt , ht)  are  the  normalized  scores  of the  words  computed by \nthe  neural network,  where  the  \"softmax\" is  only  normalized over the  words  in the  short \nlist Lt, and Ptrigram(Wt  E Ltlht ) =  ~iEL. Ptrigram(ilht), with Ptrigram(ilht)  standing \nfor the next-word probabilities computed by the smoothed trigram.  Note that both L t  and \nPtrigram(Wt  E Ltlht ) can be pre-computed (and stored in a hash table indexed by the last \ntwo words). \n\nTable  look-up for  recognition.  To  speed  up  application  of the  trained  model,  one  can \npre-compute in a hash table the output of the neural network, at least for the most frequent \ninput contexts.  In  that case,  the  neural network will  only be rarely  called upon,  and  the \naverage computation time will be very small.  Note that in a speech recognition system, one \nneeds only compute the relative probabilities of the acoustically ambiguous words in each \ncontext, also reducing drastically the amount of computations. \n\nStochastic gradient descent.  Since we have millions of examples, it is  important to con(cid:173)\nverge within only a few passes through the data.  For very large data sets, stochastic gradient \ndescent convergence time  seems to  increase sub-linearly with  the  size  of the data set (see \nexperiments on Brown vs  Hansard below).  To  speed up training using stochastic gradient \n\n\fdescent, we have found it useful to break the corpus in paragraphs and to randomly permute \nthem.  In this  way,  some of the non-stationarity in the word stream is  eliminated, yielding \nfaster convergence. \n\nCapacity control.  For the \"smaller corpora\" like Brown (1.2 million examples), we  have \nfound early stopping and weight decay useful to  avoid over-fitting.  For the larger corpora, \nour networks still under-fit.  For the larger corpora, we have found double-precision com(cid:173)\nputation to be very important to obtain good results. \n\nMixture of models.  We  have found improved performance by combining the probability \npredictions of the  neural  network with  those  of the  smoothed trigram,  with  weights  that \nwere conditional on the frequency of the context (same procedure used to combine trigram, \nbigram, and unigram in the smoothed trigram). \n\nInitialization of word feature vectors. We  have tried both random initialization (uniform \nbetween -.01  and  .01)  and  a \"smarter\" method based on a Singular Value Decomposition \n(SVD)  of a  very  large  matrix  of \"context features\".  These  context features  are  formed \nby  counting the  frequency  of occurrence of each  word  in  each  one of the  most frequent \ncontexts  (word  sequences) in  the  corpus.  The idea is  that \"similar\"  words  should  occur \nwith similar frequency in the same contexts.  We used about 9000 most frequent contexts, \nand compressed these to 30 features with the SVD. \nOut-of-vocabulary words.  For an  out-of-vocabulary word Wt  we  need  to  come up  with \na  feature  vector in  order to  predict the  words  that  follow,  or predict its  probability  (that \nis  only  possible  with  the  cycling  architecture).  We  used  as  feature  vector the  weighted \naverage feature vector of all the words in the short list,  with the weights being the relative \nprobabilities ofthose words:  E[C(wt)lhtl = Ei C(i)P(wt = ilht). \n4  Experimental Results \nComparative experiments were performed on the Brown and Hansard corpora. The Brown \ncorpus is  a stream of 1,181,041  words  (from a large variety  of English texts  and books). \nThe first 800,000 words were used for training, the following 200,000 for validation (model \nselection, weight decay, early stopping) and the remaining 181,041 for testing. The number \nof different  words  is  47, 578  (including  punctuation,  distinguishing  between  upper and \nlower  case,  and  including  the  syntactical  marks  used  to  separate  texts  and  paragraphs). \nRare words with frequency:::;  3 were merged into a single token, reducing the vocabulary \nsize to IVI = 16,383. \nThe  Hansard  corpus  (Canadian  parliament  proceedings,  French  version)  is  a  stream  of \nabout 34  million  words,  of which  32 millions  (set  A)  was  used  for  training,  1.1  million \n(set B)  was  used for validation, and  1.2 million  (set C)  was  used for out-of-sample tests. \nThe original data has 106, 936 different words, and those with frequency:::;  10 were merged \ninto a single token, yielding IVI = 30,959 different words. \nThe  benchmark  against  which  the  neural  network  was  compared  is  an  interpolated  or \nsmoothed trigram model [6].  Let qt  = l(Jreq(Wt-l,Wt-2)) represent the discretized fre(cid:173)\nquency of occurrence of the  context (Wt-l, Wt-2)  (we used  l(x)  = r -log((l + x)/T)l \nwhere x  is the frequency of occurrence of the context and T  is the size of the training cor(cid:173)\npus). A conditional mixture of the trigram, bigram, unigram and zero-gram was learned on \nthe validation set, with mixture weights conditional on discretized frequency. \nBelow  are  measures  of test  set perplexity (geometric  average of 1/ p( Wt Iwi- 1 )  for  dif(cid:173)\nferent models P. Apparent convergence of the stochastic gradient descent procedure was \nobtained after around  10 epochs for Hansard  and  after about 50 epochs for  Brown,  with \na learning rate  gradually  decreased from approximately 10-3  to  10-5 .  Weight decay  of \n10-4  or 10-5  was  used in  all  the experiments (based on  a few  experiments compared on \nthe validation set). \n\nThe main result is  that the neural network performs much better than the smoothed \n\n\ftrigram.  On  Brown  the  best  neural  network  system,  according  to  validation  perplexity \n(among  different  architectures  tried,  see  below)  yielded  a  perplexity  of 258,  while  the \nsmoothed trigram yields a perplexity of 348, which is  about 35% worse.  This is obtained \nusing a network with the direct architecture mixed with the trigram (conditional mixture), \nwith 30 word features initialized with the SVD method, 40 hidden units, and n = 5 words \nof context.  On Hansard, the corresponding figures are 44.8 for the neural network and 54.1 \nfor the smoothed trigram, which is 20.7% worse.  This is obtained with a network with the \ndirect architecture,  100 randomly initialized words features,  120 hidden units,  and n  =  8 \nwords of context. \n\nMore context is  useful.  Experiments  with  the  cycling  architecture  on  Brown,  with  30 \nword features, and 30 hidden units,  varying the number of context words: n  = 1 (like the \nbigram) yields a test perplexity of 302, n  =  3 yields 291 , n  =  5 yields 281 , n  =  8 yields \n279 (N.B. the smoothed trigram yields 348). \n\nHidden units help.  Experiments with  the direct architecture on  Brown (with direct input \nto  output connections), with 30 word features,  5 words of context, varying the number of \nhidden units:  0 yields a test perplexity of 275, 10 yields 267, 20 yields 266, 40 yields 265, \n80 yields 265. \n\nLearning the word features jointly is important.  Experiments with the direct architec(cid:173)\nture on Brown (40 hidden units, 5 words of context), in  which the word features initialized \nwith the SVD method are kept fixed during training yield a test perplexity of 345.8 whereas \nif the word features are trained jointly with the rest of the parameters, the perplexity is 265. \n\nInitialization not so useful. Experiments on Brown with both architectures reveal that the \nSVD initialization of the word features does not bring much improvement with respect to \nrandom initialization: it speeds up initial convergence (saving about 2 epochs), and yields \na perplexity improvement of less than 0.3 %. \n\nDirect architecture works a bit better. The direct architecture was found about 2% better \nthan the cycling architecture. \n\nConditional mixture helps but even without it the neural net is better.  On Brown, the \nbest neural net without the mixture yields  a test perplexity of 265,  the  smoothed trigram \nyields  348, and  their conditional  mixture yields  258  (i.e.,  better than both).  On  Hansard \nthe improvement is less:  a neural network yielding 46.7 perplexity, mixed with the trigram \n(54.1), yields a mixture with perplexity 45.1. \n\n5  Conclusions and Proposed Extensions \nThe experiments on  two corpora, a medium one  0.2 million words),  and  a large one  (34 \nmillion words) have shown that the proposed approach yields much better perplexity than \na  state-of-the-art method,  the  smoothed trigram, with differences on  the  order of 20% to \n35 %. \n\nWe  believe  that  the  main  reason  for  these  improvements  is  that  the  proposed  approach \nallows  to  take advantage of the learned distributed representation to  fight  the curse of di(cid:173)\nmensionality  with  its  own  weapons:  each  training  sentence  informs  the  model  about  a \ncombinatorial  number of other sentences.  Note  that  if we  had  a  separate  feature  vector \nfor each \"context\" (short sequence of words), the model would have much more capacity \n(which could grow like that of n-grams) but it would not naturally generalize between the \nmany different ways a word can be used.  A more reasonable alternative would be to  ex(cid:173)\nplore language units  other than  words  (e.g.  some  short word  sequences, or alternatively \nsome sub-word morphemic units). \n\nThere is probably much more to be done to improve the model, at the level of architecture, \ncomputational efficiency, and taking advantage of prior knowledge. An important priority \nof future research should be to evaluate and improve the speeding-up tricks proposed here, \nand find  ways to increase capacity without increasing training time too much (to deal with \n\n\fcorpora with hundreds of millions of words).  A simple idea to take advantage of temporal \nstructure and extend the size of the input window to  include possibly  a whole paragraph, \nwithout increasing too much the number of parameters, is to use a time-delay and possibly \nrecurrent neural network.  In such a multi-layered network the computation that has  been \nperformed for  small  groups  of consecutive  words  does  not  need  to  be  redone  when  the \nnetwork input window is  shifted.  Similarly, one could use  a recurrent network to capture \npotentially even longer term information about the subject of the text. \n\nA very important area in which the proposed model could be improved is in the use of prior \nlinguistic knowledge:  semantic  (e.g.  Word  Net),  syntactic  (e.g.  a tagger),  and  morpho(cid:173)\nlogical (radix and morphemes).  Looking at the word features learned by the model should \nhelp  understand it and improve it.  Finally, future research should establish how useful the \nproposed approach will  be in applications to  speech recognition, language translation, and \ninformation retrieval. \n\nAcknowledgments \n\nThe authors would like to thank Leon Bottou and Yann Le Cun for useful discussions.  This \nresearch was made possible by funding from the NSERC granting agency. \n\nReferences \n[1]  D.  Baker and A.  McCallum.  Distributional clustering  of words  for  text classification.  In  SI(cid:173)\n\nGlR '98,1998. \n\n[2]  S. Bengio  and  Y.  Bengio.  Taking  on  the  curse  of dimensionality  in joint distributions  using \nneural  networks.  IEEE Transactions  on Neural Networks,  special issue on  Data Mining  and \nKnowledge  Discovery, 11(3):550-557, 2000. \n\n[3]  Yoshua Bengio  and Samy Bengio.  Modeling high-dimensional  discrete data with multi-layer \nneural  networks.  In  S.  A.  Solla,  T.  K.  Leen,  and  K-R.  Mller,  editors,  Advances  in  Neural \nInformation Processing Systems 12, pages 400--406. MIT Press, 2000. \n\n[4]  S. Deerwester, S.T. Dumais, G.W. Furnas, T.K. Landauer, and R.Harshman. Indexing by latent \nsemantic  analysis.  Journal  of the  American  Society for Information  Science,  41(6):391-407, \n1990. \n\n[5]  G.E. Hinton. Learning distributed representations of concepts.  In Proceedings of the Eighth An(cid:173)\n\nnual Conference of the Cognitive Science Society, pages 1-12, Amherst 1986, 1986. Lawrence \nErlbaum, Hillsdale. \n\n[6]  F.  Jelinek and R. L. Mercer.  Interpolated estimation of Markov source parameters from sparse \ndata. In E. S. Gelsema and L. N. Kanal, editors, Pattern Recognition in Practice. North-Holland, \nAmsterdam, 1980. \n\n[7]  Slava M. Katz.  Estimation of probabilities from sparse data for the language model component \nof a speech recognizer. IEEE Transactions on Acoustics, Speech,  and Signal Processing, ASSP-\n35(3):400-401 , March 1987. \n\n[8]  R. Miikkulainen  and M.G. Dyer.  Natural  language processing with  modular neural  networks \n\nand distributed lexicon.  Cognitive Science,  15:343- 399, 1991. \n\n[9]  A. Paccanaro and G.E. Hinton.  Extracting distributed representations of concepts and relations \nfrom positive and negative propositions.  In  Proceedings of the International Joint Conference \non Neural Network, lJCNN'2000, Como, Italy, 2000. IEEE, New York. \n\n[10]  F.  Pereira, N. Tishby, and L. Lee.  Distributional clustering of english words.  In  30th Annual \nMeeting  of the  Association for Computational  Linguistics, pages  183- 190,  Columbus,  Ohio, \n1993. \n\n[11]  Jiirgen  Schmidhuber.  Sequential  neural  text compression.  IEEE Transactions on Neural Net(cid:173)\n\nworks, 7(1):142- 146, 1996. \n\n[12]  H.  Schutze.  Word  space.  In S.  J. Hanson, J.  D.  Cowan, and C. L. Giles, editors, Advances in \nNeural Information  Processing Systems 5, pages pp.  895- 902,  San Mateo CA,  1993. Morgan \nKaufmann. \n\n\f", "award": [], "sourceid": 1839, "authors": [{"given_name": "Yoshua", "family_name": "Bengio", "institution": null}, {"given_name": "R\u00e9jean", "family_name": "Ducharme", "institution": null}, {"given_name": "Pascal", "family_name": "Vincent", "institution": null}]}