{"title": "A Scalable Hierarchical Distributed Language Model", "book": "Advances in Neural Information Processing Systems", "page_first": 1081, "page_last": 1088, "abstract": "Neural probabilistic language models (NPLMs) have been shown to be competitive with and occasionally superior to the widely-used n-gram language models. The main drawback of NPLMs is their extremely long training and testing times. Morin and Bengio have proposed a hierarchical language model built around a binary tree of words that was two orders of magnitude faster than the non-hierarchical language model it was based on. However, it performed considerably worse than its non-hierarchical counterpart in spite of using a word tree created using expert knowledge. We introduce a fast hierarchical language model along with a simple feature-based algorithm for automatic construction of word trees from the data. We then show that the resulting models can outperform non-hierarchical models and achieve state-of-the-art performance.", "full_text": "A Scalable Hierarchical Distributed Language Model\n\nAndriy Mnih\n\nDepartment of Computer Science\n\nUniversity of Toronto\n\nGeoffrey Hinton\n\nDepartment of Computer Science\n\nUniversity of Toronto\n\namnih@cs.toronto.edu\n\nhinton@cs.toronto.edu\n\nAbstract\n\nNeural probabilistic language models (NPLMs) have been shown to be competi-\ntive with and occasionally superior to the widely-used n-gram language models.\nThe main drawback of NPLMs is their extremely long training and testing times.\nMorin and Bengio have proposed a hierarchical language model built around a\nbinary tree of words, which was two orders of magnitude faster than the non-\nhierarchical model it was based on. However, it performed considerably worse\nthan its non-hierarchical counterpart in spite of using a word tree created using\nexpert knowledge. We introduce a fast hierarchical language model along with\na simple feature-based algorithm for automatic construction of word trees from\nthe data. We then show that the resulting models can outperform non-hierarchical\nneural models as well as the best n-gram models.\n\n1 Introduction\n\nStatistical language modelling is concerned with building probabilistic models of word sequences.\nSuch models can be used to discriminate probable sequences from improbable ones, a task important\nfor performing speech recognition, information retrieval, and machine translation. The vast majority\nof statistical language models are based on the Markov assumption, which states that the distribu-\ntion of a word depends only on some \ufb01xed number of words that immediately precede it. While\nthis assumption is clearly false, it is very convenient because it reduces the problem of modelling\nthe probability distribution of word sequences of arbitrary length to the problem of modelling the\ndistribution on the next word given some \ufb01xed number of preceding words, called the context. We\nwill denote this distribution by P (wn|w1:n\u22121), where wn is the next word and w1:n\u22121 is the context\n(w1, ..., wn\u22121).\nn-gram language models are the most popular statistical language models due to their simplicity\nand surprisingly good performance. These models are simply conditional probability tables for\nP (wn|w1:n\u22121), estimated by counting the n-tuples in the training data and normalizing the counts\nappropriately. Since the number of n-tuples is exponential in n, smoothing the raw counts is essential\nfor achieving good performance. There is a large number of smoothing methods available for n-gram\nmodels [4]. In spite of the sophisticated smoothing methods developed for them, n-gram models are\nunable to take advantage of large contexts since the data sparsity problem becomes extreme. The\nmain reason for this behavior is the fact that classical n-gram models are essentially conditional\nprobability tables where different entries are estimated independently of each other. These models\ndo not take advantage of the fact that similar words occur in similar contexts, because they have no\nconcept of similarity. Class-based n-gram models [3] aim to address this issue by clustering words\nand/or contexts into classes based on their usage patterns and then using this class information to\nimprove generalization. While it can improve n-gram performance, this approach introduces a very\nrigid kind of similarity, since each word typically belongs to exactly one class.\n\nAn alternative and much more \ufb02exible approach to counteracting the data sparsity problem is to\nrepresent each word using a real-valued feature vector that captures its properties, so that words\n\n1\n\n\fused in similar contexts will have similar feature vectors. Then the conditional probability of the\nnext word can be modelled as a smooth function of the feature vectors of the context words and the\nnext word. This approach provides automatic smoothing, since for a given context similar words\nare now guaranteed to be assigned similar probabilities. Similarly, similar contexts are now likely to\nhave similar representations resulting in similar predictions for the next word. Most models based\non this approach use a feed-forward neural network to map the feature vectors of the context words\nto the distribution for the next word (e.g. [12], [5], [9]). Perhaps the best known model of this type is\nthe Neural Probabilistic Language Model [1], which has been shown to outperform n-gram models\non a dataset of about one million words.\n\n2 The hierarchical neural network language model\n\nThe main drawback of the NPLM and other similar models is that they are very slow to train and\ntest [10]. Since computing the probability of the next word requires explicitly normalizing over all\nwords in the vocabulary, the cost of computing the probability of the given next word and the cost of\ncomputing the full distribution over the next word are virtually the same \u2013 they take time linear in the\nvocabulary size. Since computing the exact gradient in such models requires repeatedly computing\nthe probability of the next word given its context and updating the model parameters to increase that\nprobability, training time is also linear in the vocabulary size. Typical natural language datasets have\nvocabularies containing tens of thousands of words, which means that training NPLM-like models\nthe straightforward way is usually too computationally expensive in practice. One way to speed\nup the process is to use a specialized importance sampling procedure to approximate the gradients\nrequired for learning [2]. However, while this method can speed up training substantially, testing\nremains computationally expensive.\n\nThe hierarchical NPLM introduced in [10], provides an exponential reduction in time complexity of\nlearning and testing as compared to the NPLM. It achieves this reduction by replacing the unstruc-\ntured vocabulary of the NPLM by a binary tree that represents a hierarchical clustering of words in\nthe vocabulary. Each word corresponds to a leaf in the tree and can be uniquely speci\ufb01ed by the\npath from the root to that leaf. If N is the number of words in the vocabulary and the tree is bal-\nanced, any word can be speci\ufb01ed by a sequence of O(log N ) binary decisions indicating which of\nthe two children of the current node is to be visited next. This setup replaces one N-way choice by a\nsequence of O(log N ) binary choices. In probabilistic terms, one N-way normalization is replaced\nby a sequence of O(log N ) local (binary) normalizations. As a result, a distribution over words in\nthe vocabulary can be speci\ufb01ed by providing the probability of visiting the left child at each of the\nnodes. In the hierarchical NPLM, these local probabilities are computed by giving a version of the\nNPLM the feature vectors for the context words as well as a feature vector for the current node as\ninputs. The probability of the next word is then given by the probability of making a sequence of\nbinary decisions that corresponds to the path to that word.\n\nWhen applied to a dataset of about one million words, this model outperformed class-based trigrams,\nbut performed considerably worse than the NPLM [10]. The hierarchical model however was more\nthan two orders of magnitude faster than the NPLM. The main limitation of this work was the\nprocedure used to construct the tree of words for the model. The tree was obtained by starting\nwith the WordNet IS-A taxonomy and converting it into a binary tree through a combination of\nmanual and data-driven processing. Our goal is to replace this procedure by an automated method\nfor building trees from the training data without requiring expert knowledge of any kind. We will\nalso explore the performance bene\ufb01ts of using trees where each word can occur more than once.\n\n3 The log-bilinear model\n\nWe will use the log-bilinear language model (LBL) [9] as the foundation of our hierarchical model\nbecause of its excellent performance and simplicity. Like virtually all neural language models, the\nLBL model represents each word with a real-valued feature vector. We will denote the feature vector\nfor word w by rw and refer to the matrix containing all these feature vectors as R. To predict the\nnext word wn given the context w1:n\u22121, the model computes the predicted feature vector \u02c6r for the\nnext word by linearly combining the context word feature vectors:\n\n2\n\n\f\u02c6r =\n\nn\u22121\n\nX\n\ni=1\n\nCirwi ,\n\n(1)\n\nwhere Ci is the weight matrix associated with the context position i. Then the similarity between the\npredicted feature vector and the feature vector for each word in the vocabulary is computed using\nthe inner product. The similarities are then exponentiated and normalized to obtain the distribution\nover the next word:\n\nP (wn = w|w1:n\u22121) =\n\n.\n\n(2)\n\nexp(\u02c6rT rw + bw)\nPj exp(\u02c6rT rj + bj)\n\nHere bw is the bias for word w, which is used to capture the context-independent word frequency.\nNote that the LBL model can be interpreted as a special kind of a feed-forward neural network\nwith one linear hidden layer and a softmax output layer. The inputs to the network are the feature\nvectors for the context words, while the matrix of weights from the hidden layer to the output layer\nis simply the feature vector matrix R. The vector of activities of the hidden units corresponds to the\nthe predicted feature vector for the next word. Unlike the NPLM, the LBL model needs to compute\nthe hidden activities only once per prediction and has no nonlinearities in its hidden layer. In spite\nof its simplicity the LBL model performs very well, outperforming both the NPLM and the n-gram\nmodels on a fairly large dataset [9].\n\n4 The hierarchical log-bilinear model\n\nOur hierarchical language model is based on the hierarchical model from [10]. The distinguishing\nfeatures of our model are the use of the log-bilinear language model for computing the probabilities\nat each node and the ability to handle multiple occurrences of each word in the tree. Note that the\nidea of using multiple word occurrences in a tree was proposed in [10], but it was not implemented.\n\nThe \ufb01rst component of the hierarchical log-bilinear model (HLBL) is a binary tree with words at its\nleaves. For now, we will assume that each word in the vocabulary is at exactly one leaf. Then each\nword can be uniquely speci\ufb01ed by a path from the root of the tree to the leaf node the word is at.\nThe path itself can be encoded as a binary string d of decisions made at each node, so that di = 1\ncorresponds to the decision to visit the left child of the current node. For example, the string \u201c10\u201d\ncorresponds to a path that starts at the root, visits its left child, and then visits the right child of that\nchild. This allows each word to be represented by a binary string which we will call a code.\n\nThe second component of the HLBL model is the probabilistic model for making the decisions\nat each node, which in our case is a modi\ufb01ed version of the LBL model.\nIn the HLBL model,\njust like in its non-hierarchical counterpart, context words are represented using real-valued feature\nvectors. Each of the non-leaf nodes in the tree also has a feature vector associated with it that is\nused for discriminating the words in the left subtree form the words in the right subtree of the node.\nUnlike the context words, the words being predicted are represented using their binary codes that are\ndetermined by the word tree. However, this representation is still quite \ufb02exible, since each binary\ndigit in the code encodes a decision made at a node, which depends on that node\u2019s feature vector.\n\nIn the HLBL model, the probability of the next word being w is the probability of making the\nsequences of binary decisions speci\ufb01ed by the word\u2019s code, given the context. Since the probability\nof making a decision at a node depends only on the predicted feature vector, determined by the\ncontext, and the feature vector for that node, we can express the probability of the next word as a\nproduct of probabilities of the binary decisions:\n\nP (wn = w|w1:n\u22121) = Y\n\nP (di|qi, w1:n\u22121),\n\n(3)\n\ni\n\nwhere di is ith digit in the code for word w, and qi is the feature vector for the ith node in the path\ncorresponding to that code. The probability of each decision is given by\n\n(4)\nwhere \u03c3(x) is the logistic function and \u02c6r is the predicted feature vector computed using Eq. 1. bi in\nthe equation is the node\u2019s bias that captures the context-independent tendency to visit the left child\nwhen leaving this node.\n\nP (di = 1|qi, w1:n\u22121) = \u03c3(\u02c6rT qi + bi),\n\n3\n\n\fThe de\ufb01nition of P (wn = w|w1:n\u22121) can be extended to multiple codes per word by including a\nsummation over all codes for w as follows:\n\nP (wn = w|w1:n\u22121) = X\n\nY\n\nP (di|qi, w1:n\u22121),\n\n(5)\n\nd\u2208D(w)\n\ni\n\nwhere D(w) is a set of codes corresponding to word w. Allowing multiple codes per word can allow\nbetter prediction of words that have multiple senses or multiple usage patterns. Using multiple codes\nper word also makes it easy to combine several separate words hierarchies to into a single one to to\nre\ufb02ect the fact that no single hierarchy can express all the relationships between words.\n\nUsing the LBL model instead of the NPLM for computing the local probabilities allows us to avoid\ncomputing the nonlinearities in the hidden layer which makes our hierarchical model faster at mak-\ning predictions than the hierarchical NPLM. More importantly, the hierarchical NPLM needs to\ncompute the hidden activities once for each of the O(log N ) decisions, while the HLBL model\ncomputes the predicted feature vector just once per prediction. However, the time complexity of\ncomputing the probability for a single binary decision in an LBL model is still quadratic in the\nfeature vector dimensionality D, which might make the use of high-dimensional feature vectors\ntoo computationally expensive. We make the time complexity linear in D by restricting the weight\nmatrices Ci to be diagonal.1 Note that for a context of size 1, this restriction does not reduce the\nrepresentational power of the model because the context weight matrix C1 can be absorbed into the\nword feature vectors. And while this restriction does makes the models with larger contexts slightly\nless powerful, we believe that this loss is more than compensated for by much faster training times\nwhich allow using more complex trees.\n\nHLBL models can be trained by maximizing the (penalized) log-likelihood. Since the probability of\nthe next word depends only on the context weights, the feature vectors of the context words, and the\nfeature vectors of the nodes on the paths from the root to the leaves containing the word in question,\nonly a (logarithmically) small fraction of the parameters need to be updated for each training case.\n\n5 Hierarchical clustering of words\n\nThe \ufb01rst step in training a hierarchical language model is constructing a binary tree of words for the\nmodel to use. This can be done by using expert knowledge, data-driven methods, or a combination of\nthe two. For example, in [10] the tree was constructed from the IS-A taxonomy DAG from WordNet\n[6]. After preprocessing the taxonomy by hand to ensure that each node had only one parent, data-\ndriven hierarchical binary clustering was performed on the children of the nodes in the taxonomy\nthat had more than two children, resulting in a binary tree.\n\nWe are interested in using a pure learning approach applicable in situations where the expert knowl-\nedge is unavailable.\nIt is also not clear that using expert knowledge, even when it is available,\nwill lead to superior performance. Hierarchical binary clustering of words based on the their usage\nstatistics is a natural choice for generating binary trees of words automatically. This task is similar\nto the task of clustering words into classes for training class-based n-gram models, for which a large\nnumber of algorithms has been proposed. We considered several of these algorithms before decid-\ning to use our own algorithm which turned out to be surprisingly effective in spite of its simplicity.\nHowever, we will mention two existing algorithms that might be suitable for producing binary word\nhierarchies. Since we wanted an algorithm that scaled well to large vocabularies, we restricted our\nattention to the top-down hierarchical clustering algorithms, as they tend to scale better than their\nagglomerative counterparts [7]. The algorithm from [8] produces exactly the kind of binary trees\nwe need, except that its time complexity is cubic in the vocabulary size.2 We also considered the\ndistributional clustering algorithm [11] but decided not to use it because of the dif\ufb01culties involved\nin using contexts of more than one word for clustering. This problem is shared by most n-gram\nclustering algorithms, so we will describe it in some detail. Since we would like to cluster words for\neasy prediction of the next word based on its context, it is natural to describe each word in terms of\nthe contexts that can precede it. For example, for a single-word context one such description is the\n\nof context weights for position i and \u25e6 denotes the elementwise product of two vectors.\n\n1Thus the feature vector for the next word can now be computed as \u02c6r = Pn\u22121\n2More precisely, the time complexity of the algorithm is cubic in the number of the frequent words, but that\n\nis still to slow for our purposes.\n\nci \u25e6 rwi, where ci is a vector\n\ni=1\n\n4\n\n\fdistribution of words that precede the word of interest in the training data. The problem becomes\napparent when we consider using larger contexts: the number of contexts that can potentially pre-\ncede a word grows exponentially in the context size. This is the very same data sparsity problem that\naffects the n-gram models, which is not surprising, since we are trying to describe words in terms of\nexponentially large (normalized) count vectors. Thus, clustering words based on such large-context\nrepresentations becomes non-trivial due to the computational cost involved as well as the statistical\ndif\ufb01culties caused by the sparsity of the data.\n\nWe avoid these dif\ufb01culties by operating on low-dimensional real-valued word representations in our\ntree-building procedure. Since we need to train a model to obtain word feature vectors, we perform\nthe following bootstrapping procedure: we generate a random binary tree of words, train an HLBL\nmodel based on it, and use the distributed representations it learns to represent words when building\nthe word tree.\n\nSince each word is represented by a distribution over contexts it appears in, we need a way of\ncompressing such a collection of contexts down to a low-dimensional vector. After training the\nHLBL model, we summarize each context w1:n\u22121 with the predicted feature vector produced from\nit using Eq. 1. Then, we condense the distribution of contexts that precede a given word into a\nfeature vector by computing the expectation of the predicted representation w.r.t. that distribution.\nThus, for the purposes of clustering each word is represented by its average predicted feature vector.\nAfter computing the low-dimensional real-valued feature vectors for words, we recursively apply a\nvery simple clustering algorithm to them. At each step, we \ufb01t a mixture of two Gaussians to the\nfeature vectors and then partition them into two subsets based on the responsibilities of the two\nmixture components for them. We then partition each of the subsets using the same procedure, and\nso on. The recursion stops when the current set contains only two words. We \ufb01t the mixtures by\nrunning the EM algorithm for 10 steps3. The algorithm updates both the means and the spherical\ncovariances of the components. Since the means of the components are initialized based on a random\npartitioning of the feature vectors, the algorithm is not deterministic and will produce somewhat\ndifferent clusterings on different runs. One appealing property of this algorithm is that the running\ntime of each iteration is linear in the vocabulary size, which is a consequence of representing words\nusing feature vectors of \ufb01xed dimensionality. In our experiments, the algorithm took only a few\nminutes to build a hierarchy for a vocabulary of nearly 18000 words based on 100-dimensional\nfeature vectors.\n\nThe goal of an algorithm for generating trees for hierarchical language models is to produce trees\nthat are well-supported by the data and are reasonably well-balanced so that the resulting models\ngeneralize well and are fast to train and test. To explore the trade-off between these two require-\nments, we tried several splitting rules in our tree-building algorithm. The rules are based on the\nobservation that the responsibility of a component for a datapoint can be used as a measure of con-\n\ufb01dence about the assignment of the datapoint to the component. Thus, when the responsibilities of\nboth components for a datapoint are close to 0.5, we cannot be sure that the datapoint should be in\none component but not the other.\n\nOur simplest rule aims to produce a balanced tree at any cost.\nIt sorts the responsibilities and\nsplits the words into two disjoint subsets of equal size based on the sorted order. The second rule\nmakes splits well-supported by the data even if that results in an unbalanced tree. It achieves that\nby assigning the word to the component with the higher responsibility for the word. The third\nand the most sophisticated rule is an extension of the second rule, modi\ufb01ed to assign a point to\nboth components whenever both responsibilities are within \u01eb of 0.5, for some pre-speci\ufb01ed \u01eb. This\nrule is designed to produce multiple codes for words that are dif\ufb01cult to cluster. We will refer to\nthe algorithms that use these rules as BALANCED, ADAPTIVE, and ADAPTIVE(\u01eb) respectively.\nFinally, as a baseline for comparison with the above algorithms, we will use an algorithm that\ngenerates random balanced trees. It starts with a random permutation of the words and recursively\nbuilds the left subtree based one the \ufb01rst half of the words and the right subtree based on the second\nhalf of the words. We will call this algorithm RANDOM.\n\n3Running EM for more than 10 steps did not make a signi\ufb01cant difference in the quality of the resulting\n\ntrees.\n\n5\n\n\fTable 1: Trees of words generated by the feature-based algorithm. The mean code length is the sum\nof lengths of codes associated with a word, averaged over the distribution of the words in the training\ndata. The run-time complexity of the hierarchical model is linear in the mean code length of the tree\nused. The mean number of codes per word refers to the number of codes per word averaged over the\ntraining data distribution. Since each non-leaf node in a tree has its own feature vector, the number\nof free parameters associated with the tree is linear in this quantity.\n\nTree Generating\nlabel\nalgorithm\nRANDOM\nT1\nBALANCED\nT2\nADAPTIVE\nT3\nADAPTIVE(0.25)\nT4\nT5\nADAPTIVE(0.4)\nADAPTIVE(0.4) \u00d7 2\nT6\nT7\nADAPTIVE(0.4) \u00d7 4\n\nMean code Mean number of\ncodes per word\n1.0\n1.0\n1.0\n1.3\n1.7\n3.4\n6.8\n\nlength\n14.2\n14.3\n16.1\n24.2\n29.0\n69.1\n143.2\n\nNumber of\nnon-leaf nodes\n17963\n17963\n17963\n22995\n30296\n61014\n121980\n\nTable 2: The effect of the feature dimensionality and the word tree used on the test set perplexity of\nthe model.\n\nFeature\n\ndimensionality\n\nPerplexity using\na random tree\n\nPerplexity using\na non-random tree\n\nReduction\nin perplexity\n\n25\n50\n75\n100\n\n191.6\n166.4\n156.4\n151.2\n\n162.4\n141.7\n134.8\n131.3\n\n29.2\n24.7\n21.6\n19.9\n\n6 Experimental results\n\nWe compared the performance of our models on the APNews dataset containing the Associated\nPress news stories from 1995 and 1996. The dataset consists of a 14 million word training set,\na 1 million word validation set, and 1 million word test set. The vocabulary size for this dataset\nis 17964. We chose this dataset because it had already been used to compare the performance of\nneural models to that of n-gram models in [1] and [9], which allowed us to compare our results to\nthe results in those papers. Except for where stated otherwise, the models used for the experiments\nused 100 dimensional feature vectors and a context size of 5. The details of the training procedure\nwe used are given in the appendix. All models were compared based on their perplexity score on\nthe test set.\n\nWe started by training a model that used a tree generated by the RANDOM algorithm (tree T1 in\nTable 1). The feature vectors learned by this model were used to build a tree using the BALANCED\nalgorithm (tree T2). We then trained models of various feature vector dimensionality on each of\nthese trees to see whether a highly expressive model can compensate for using a poorly constructed\ntree. The test scores for the resulting models are given in Table 2. As can be seen from the scores,\nusing a non-random tree results in much better model performance. Though the gap in performance\ncan be reduced by increasing the dimensionality of feature vectors, using a non-random tree drasti-\ncally improves performance even for the model with 100-dimensional feature vectors. It should be\nnoted however, that models that use the random tree are not entirely hopeless. For example, they\noutperform the unigram model which achieved the perplexity of 602.0 by a very large margin. This\nsuggests that the HLBL architecture is suf\ufb01ciently \ufb02exible to make effective use of a random tree\nover words.\n\nSince increasing the feature dimensionality beyond 100 did not result in a substantial reduction in\nperplexity, we used 100-dimensional feature vectors for all of our models in the following experi-\nments. Next we explored the effect of the tree building algorithm on the performance of the resulting\nHLBL model. To do that, we used the RANDOM, BALANCED, and ADAPTIVE algorithms to\ngenerate one tree each. The ADAPTIVE(\u01eb) algorithm was used to generate two trees: one with \u01eb set\n\n6\n\n\fTable 3: Test set perplexity results for the hierarchical LBL models. All the distributed models\nin the comparison used 100-dimensional feature vectors and a context size of 5. LBL is the non-\nhierarchical log-bilinear model. KNn is a Kneser-Ney n-gram model. The scores for LBL, KN3,\nand KN5 are from [9]. The timing for LBL is based on our implementation of the model.\n\nModel Tree\nused\ntype\nT1\nHLBL\nHLBL\nT2\nT3\nHLBL\nT4\nHLBL\nT5\nHLBL\nT6\nHLBL\nHLBL\nT7\n\u2013\nLBL\n\u2013\nKN3\nKN5\n\u2013\n\nTree generating\n\nalgorithm\nRANDOM\nBALANCED\nADAPTIVE\n\nADAPTIVE(0.25)\nADAPTIVE(0.4)\n\nADAPTIVE(0.4) \u00d7 2\nADAPTIVE(0.4) \u00d7 4\n\n\u2013\n\u2013\n\u2013\n\nPerplexity\n\n151.2\n131.3\n127.0\n124.4\n123.3\n115.7\n112.1\n117.0\n129.8\n123.2\n\nMinutes\nper epoch\n4\n4\n4\n6\n7\n16\n32\n6420\n\u2013\n\u2013\n\nto 0.25 and the other with \u01eb set to 0.4. We then generated a 2\u00d7 overcomplete tree by running the\nADAPTIVE(\u01eb = 0.4) algorithm twice and creating a tree with a root node that had the two generated\ntrees as its subtrees. Since the ADAPTIVE(\u01eb) algorithm involves some randomization we tried to\nimprove the model performance by allowing the model to choose dynamically between two possible\nclusterings. Finally, we generated a 4\u00d7 overcomplete using the same approach. Table 1 lists the\ngenerated trees as well as some statistics for them. Note that trees generated using ADAPTIVE(\u01eb)\nusing \u01eb > 0 result in models with more parameters due to the greater number of tree-nodes and thus\ntree-node feature vectors, as compared to trees generated using methods producing one code/leaf\nper word.\n\nTable 3 shows the test set perplexities and time per epoch for the resulting models along with the\nperplexities for models from [9]. The results show that the performance of the HLBL models based\non non-random trees is comparable to that of the n-gram models. As expected, building word trees\nadaptively improves model performance. The general trend that emerges is that bigger trees tend to\nlead to better performing models. For example, a model based on a single tree produced using the\nADAPTIVE(0.4) algorithm, performs as well as the 5-gram but not as well as the non-hierarchical\nLBL model. However, using a 2\u00d7 overcomplete tree generated using the same algorithm results in a\nmodel that outperforms both the n-gram models and the LBL model, and using a 4\u00d7 overcomplete\ntree leads to a further reduction in perplexity. The time-per-epoch statistics reported for the neural\nmodels in Table 3 shows the great speed advantage of the HLBL models over the LBL model.\nIndeed, the slowest of our HLBL models is over 200 times faster than the LBL model.\n\n7 Discussion and future work\n\nWe have demonstrated that a hierarchal neural language model can actually outperform its non-\nhierarchical counterparts and achieve state-of-the-art performance. The key to making a hierarchical\nmodel perform well is using a carefully constructed hierarchy over words. We have presented a\nsimple and fast feature-based algorithm for automatic construction of such hierarchies. Creating\nhierarchies in which every word occurred more than once was essential to getting the models to\nperform better.\n\nAn inspection of trees generated by our adaptive algorithm showed that the words with the largest\nnumbers of codes (i.e. the word that were replicated the most) were not the words with multiple\ndistinct senses. Instead, the algorithm appeared to replicate the words that occurred relatively in-\nfrequently in the data and were therefore dif\ufb01cult to cluster. The failure to use multiple codes for\nwords with several very different senses is probably a consequence of summarizing the distribution\nover contexts with a single mean feature vector when clustering words. The \u201csense multimodality\u201d\nof context distributions would be better captured by using a small set of feature vectors found by\nclustering the contexts.\n\n7\n\n\fFinally, since our tree building algorithm is based on the feature vectors learned by the model, it\nis possible to periodically interrupt training of such a model to rebuild the word tree based on the\nfeature vectors provided by the model being trained. This modi\ufb01ed training procedure might produce\nbetter models by allowing the word hierarchy to adapt to the probabilistic component of the model\nand vice versa.\n\nAppendix: Details of the training procedure\n\nThe models have been trained by maximizing the log-likelihood using stochastic gradient ascent.\nAll model parameters other than the biases were initialized by sampling from a Gaussian of small\nvariance. The biases for the tree nodes were initialized so that the distribution produced by the model\nwith all the non-bias parameters set to zero matched the base rates of the words in the training set.\n\nModels were trained using the learning rate of 10\u22123 until the perplexity on the validation set started\nto increase. Then the learning rate was reduced to 3 \u00d7 10\u22125 and training was resumed until the\nvalidation perplexity started increasing again. All model parameters were regulated using a small\nL2 penalty.\n\nAcknowledgments\n\nWe thank Martin Szummer for his comments on a draft of this paper. This research was supported\nby NSERC and CFI. GEH is a fellow of the Canadian Institute for Advanced Research.\n\nReferences\n\n[1] Yoshua Bengio, Rejean Ducharme, Pascal Vincent, and Christian Jauvin. A neural probabilistic\n\nlanguage model. Journal of Machine Learning Research, 3:1137\u20131155, 2003.\n\n[2] Yoshua Bengio and Jean-S\u00b4ebastien Sen\u00b4ecal. Quick training of probabilistic neural nets by\n\nimportance sampling. In AISTATS\u201903, 2003.\n\n[3] P.F. Brown, R.L. Mercer, V.J. Della Pietra, and J.C. Lai. Class-based n-gram models of natural\n\nlanguage. Computational Linguistics, 18(4):467\u2013479, 1992.\n\n[4] Stanley F. Chen and Joshua Goodman. An empirical study of smoothing techniques for lan-\nguage modeling. In Proceedings of the Thirty-Fourth Annual Meeting of the Association for\nComputational Linguistics, pages 310\u2013318, San Francisco, 1996.\n\n[5] Ahmad Emami, Peng Xu, and Frederick Jelinek. Using a connectionist model in a syntactical\n\nbased language model. In Proceedings of ICASSP, volume 1, pages 372\u2013375, 2003.\n\n[6] C. Fellbaum et al. WordNet: an electronic lexical database. Cambridge, Mass: MIT Press,\n\n1998.\n\n[7] J. Goodman. A bit of progress in language modeling. Technical report, Microsoft Research,\n\n2000.\n\n[8] John G. McMahon and Francis J. Smith. Improving statistical language model performance\nwith automatically generated word hierarchies. Computational Linguistics, 22(2):217\u2013247,\n1996.\n\n[9] A. Mnih and G. Hinton. Three new graphical models for statistical language modelling. Pro-\n\nceedings of the 24th international conference on Machine learning, pages 641\u2013648, 2007.\n\n[10] Frederic Morin and Yoshua Bengio. Hierarchical probabilistic neural network language model.\n\nIn Robert G. Cowell and Zoubin Ghahramani, editors, AISTATS\u201905, pages 246\u2013252, 2005.\n\n[11] F. Pereira, N. Tishby, and L. Lee. Distributional clustering of English words. Proceedings of\n\nthe 31st conference on Association for Computational Linguistics, pages 183\u2013190, 1993.\n\n[12] Holger Schwenk and Jean-Luc Gauvain. Connectionist language modeling for large vocabu-\nlary continuous speech recognition. In Proceedings of the International Conference on Acous-\ntics, Speech and Signal Processing, pages 765\u2013768, 2002.\n\n8\n\n\f", "award": [], "sourceid": 918, "authors": [{"given_name": "Andriy", "family_name": "Mnih", "institution": null}, {"given_name": "Geoffrey", "family_name": "Hinton", "institution": null}]}