John Blitzer, Fernando Pereira, Kilian Q. Weinberger, Lawrence Saul
Statistical language models estimate the probability of a word occurring in a given context. The most common language models rely on a discrete enumeration of predictive contexts (e.g., n-grams) and consequently fail to capture and exploit statistical regularities across these contexts. In this paper, we show how to learn hierarchical, distributed representations of word contexts that maximize the predictive value of a statistical language model. The representations are initialized by unsupervised algorithms for linear and nonlinear dimensionality reduction , then fed as input into a hierarchical mixture of experts, where each expert is a multinomial dis- tribution over predicted words . While the distributed representations in our model are inspired by the neural probabilistic language model of Bengio et al. [2, 3], our particular architecture enables us to work with significantly larger vocabularies and training corpora. For example, on a large-scale bigram modeling task involving a sixty thousand word vocab- ulary and a training corpus of three million sentences, we demonstrate consistent improvement over class-based bigram models [10, 13]. We also discuss extensions of our approach to longer multiword contexts.
Statistical language models are essential components of natural language systems for human-computer interaction. They play a central role in automatic speech recognition , machine translation , statistical parsing , and information retrieval . These mod- els estimate the probability that a word will occur in a given context, where in general a context specifies a relationship to one or more words that have already been observed. The simplest, most studied case is that of n-gram language modeling, where each word is predicted from the preceding n-1 words. The main problem in building these models is that the vast majority of word combinations occur very infrequently, making it difficult to estimate accurate probabilities of words in most contexts.
Researchers in statistical language modeling have developed a variety of smoothing tech- niques to alleviate this problem of data sparseness. Most smoothing methods are based on simple back-off formulas or interpolation schemes that discount the probability of observed events and assign the "leftover" probability mass to events unseen in training . Unfortu- nately, these methods do not typically represent or take advantage of statistical regularities
among contexts. One expects the probabilities of rare or unseen events in one context to be related to their probabilities in statistically similar contexts. Thus, it should be possible to estimate more accurate probabilities by exploiting these regularities.
Several approaches have been suggested for sharing statistical information across contexts. The aggregate Markov model (AMM) of Saul and Pereira  (also discussed by Hofmann and Puzicha  as a special case of the aspect model) factors the conditional probability table of a word given its context by a latent variable representing context "classes". How- ever, this latent variable approach is difficult to generalize to multiword contexts, as the size of the conditional probability table for class given context grows exponentially with the context length.
The neural probabilistic language model (NPLM) of Bengio et al. [2, 3] achieved signifi- cant improvements over state-of-the-art smoothed n-gram models . The NPLM encodes contexts as low-dimensional continuous vectors. These are fed to a multilayer neural net- work that outputs a probability distribution over words. The low-dimensional vectors and the parameters of the network are trained simultaneously to minimize the perplexity of the language model. This model has no difficulty encoding multiword contexts, but its training and application are very costly because of the need to compute a separate normalization for the conditional probabilities associated to each context.
In this paper, we introduce and evaluate a statistical language model that combines the advantages of the AMM and NPLM. Like the NPLM, it can be used for multiword con- texts, and like the AMM it avoids per-context normalization. In our model, contexts are represented as low-dimensional real vectors initialized by unsupervised algorithms for di- mensionality reduction . The probabilities of words given contexts are represented by a hierarchical mixture of experts (HME) , where each expert is a multinomial distri- bution over predicted words. This tree-structured mixture model allows a rich dependency on context without expensive per-context normalization. Proper initialization of the dis- tributed representations is crucial; in particular, we find that initializations from the results of linear and nonlinear dimensionality reduction algorithms lead to better models (with significantly lower test perplexities) than random initialization.
In practice our model is several orders of magnitude faster to train and apply than the NPLM, enabling us to work with larger vocabularies and training corpora. We present re- sults on a large-scale bigram modeling task, showing that our model also leads to significant improvements over comparable AMMs.
2 Distributed representations of words
Natural language has complex, multidimensional semantics. As a trivial example, consider the following four sentences:
The vase broke. The vase contains water. The window broke. The window contains water.
The bottom right sentence is syntactically valid but semantically meaningless. As shown by the table, a two-bit distributed representation of the words "vase" and "window" suffices to express that a vase is both a container and breakable, while a window is breakable but can- not be a container. More generally, we expect low dimensional continuous representations of words to be even more effective at capturing semantic regularities.
Distributed representations of words can be derived in several ways. In a given corpus of text, for example, consider the matrix of bigram counts whose element Cij records the number of times that word wj follows word wi. Further, let pij = Cij/ C k ik denote the conditional frequencies derived from these counts, and let pi denote the V -dimensional
frequency vector with elements pij, where V is the vocabulary size. Note that the vectors pi themselves provide a distributed representation of the words wi in the corpus. For large vocabularies and training corpora, however, this is an extremely unwieldy representation, tantamount to storing the full matrix of bigram counts. Thus, it is natural to seek a lower dimensional representation that captures the same information. To this end, we need to map each vector pi to some d-dimensional vector xi, with d V . We consider two methods in dimensionality reduction for this problem. The results from these methods are then used to initialize the HME architecture in the next section.
2.1 Linear dimensionality reduction
The simplest form of dimensionality reduction is principal component analysis (PCA). PCA computes a linear projection of the frequency vectors pi into the low dimensional subspace that maximizes their variance. The variance-maximizing subspace of dimension- ality d is spanned by the top d eigenvectors of the frequency vector covariance matrix. The eigenvalues of the covariance matrix measure the variance captured by each axis of the subspace. The effect of PCA can also be understood as a translation and rotation of the frequency vectors pi, followed by a truncation that preserves only their first d elements.
2.2 Nonlinear dimensionality reduction
Intuitively, we would like to map the vectors pi into a low dimensional space where se- mantically similar words remain close together and semantically dissimilar words are far apart. Can we find a nonlinear mapping that does this better than PCA? Weinberger et al. recently proposed a new solution to this problem based on semidefinite programming .
Let xi denote the image of pi under this mapping. The mapping is discovered by first learning the V V matrix of Euclidean squared distances  given by Dij = |xi - xj|2. This is done by balancing two competing goals: (i) to co-locate semantically similar words, and (ii) to separate semantically dissimilar words. The first goal is achieved by fixing the distances between words with similar frequency vectors to their original values. In particu- lar, if pj and pk lie within some small neighborhood of each other, then the corresponding element Djk in the distance matrix is fixed to the value |pj - pk|2. The second goal is achieved by maximizing the sum of pairwise squared distances ijDij. Thus, we push the words in the vocabulary as far apart as possible subject to the constraint that the distances between semantically similar words do not change.
The only freedom in this optimization is the criterion for judging that two words are se- mantically similar. In practice, we adopt a simple criterion such as k-nearest neighbors in the space of frequency vectors pi and choose k as small as possible so that the resulting neighborhood graph is connected .
The optimization is performed over the space of Euclidean squared distance matrices . Necessary and sufficient conditions for the matrix D to be interpretable as a Euclidean squared distance matrix are that D is symmetric and that the Gram matrix1 derived from G = - 1 HDHT is semipositive definite, where H = I - 1 11T. The optimization can 2 V thus be formulated as the semidefinite programming problem:
Maximize ijDij subject to: (i) DT = D, (ii) - 1 HDH 0, and 2 (iii) Dij = |pi - pj|2 for all neighboring vectors pi and pj. 1Assuming without loss of generality that the vectors xi are centered on the origin, the dot prod- ucts Gij = xi xj are related to the pairwise squared distances Dij = |xi - xj |2 as stated above.