{"title": "Unbounded cache model for online language modeling with open vocabulary", "book": "Advances in Neural Information Processing Systems", "page_first": 6042, "page_last": 6052, "abstract": "Recently, continuous cache models were proposed as extensions to recurrent neural network language models, to adapt their predictions to local changes in the data distribution. These models only capture the local context, of up to a few thousands tokens. In this paper, we propose an extension of continuous cache models, which can scale to larger contexts. In particular, we use a large scale non-parametric memory component that stores all the hidden activations seen in the past. We leverage recent advances in approximate nearest neighbor search and quantization algorithms to store millions of representations while searching them efficiently. We conduct extensive experiments showing that our approach significantly improves the perplexity of pre-trained language models on new distributions, and can scale efficiently to much larger contexts than previously proposed local cache models.", "full_text": "Unbounded cache model for online language\n\nmodeling with open vocabulary\n\nEdouard Grave\n\nFacebook AI Research\n\negrave@fb.com\n\nMoustapha Cisse\n\nFacebook AI Research\n\nmoustaphacisse@fb.com\n\nArmand Joulin\n\nFacebook AI Research\n\najoulin@fb.com\n\nAbstract\n\nRecently, continuous cache models were proposed as extensions to recurrent neural\nnetwork language models, to adapt their predictions to local changes in the data\ndistribution. These models only capture the local context, of up to a few thousands\ntokens. In this paper, we propose an extension of continuous cache models, which\ncan scale to larger contexts. In particular, we use a large scale non-parametric\nmemory component that stores all the hidden activations seen in the past. We\nleverage recent advances in approximate nearest neighbor search and quantization\nalgorithms to store millions of representations while searching them ef\ufb01ciently. We\nconduct extensive experiments showing that our approach signi\ufb01cantly improves\nthe perplexity of pre-trained language models on new distributions, and can scale\nef\ufb01ciently to much larger contexts than previously proposed local cache models.\n\n1\n\nIntroduction\n\nLanguage models are a core component of many natural language processing applications such\nas machine translation [3], speech recognition [2] or dialogue agents [50]. In recent years, deep\nlearning has led to remarkable progress in this domain, reaching state of the art performance on many\nchallenging benchmarks [31]. These models are known to be over-parametrized, and large quantities\nof data are needed for them to reach their full potential [12]. Consequently, the training time can be\nvery long (up to weeks) even when vast computational resources are available [31]. Unfortunately, in\nmany real-world scenarios, either such quantity of data is not available, or the distribution of the data\nchanges too rapidly to permit very long training. A common strategy to circumvent these problems is\nto use a pre-trained model and slowly \ufb01netune it on the new source of data. Such adaptive strategy is\nalso time-consuming for parametric models since the speci\ufb01cities of the new dataset must be slowly\nencoded in the parameters of the model. Additionally, such strategy is also prone to over\ufb01tting and\ndramatic forgetting of crucial information from the original dataset. These dif\ufb01culties directly result\nfrom the nature of parametric models.\nIn contrast, non-parametric approaches do not require retraining and can ef\ufb01ciently incorporate\nnew information without damaging the original model. This makes them particularly suitable for\nsettings requiring rapid adaptation to a changing distribution or to novel examples. However, non-\nparametric models perform signi\ufb01cantly worse than fully trained deep models [12]. In this work,\nwe are interested in building a language model that combines the best of both non-parametric and\nparametric approaches: a deep language model to model most of the distribution and a non-parametric\none to adapt it to the change of distribution.\nThis solution has been used in speech recognition under the name of cache models [36, 37]. Cache\nmodels exploit the unigram distribution of a recent context to improve the predictive ability of the\nmodel. Recently, Grave et al. [22] and Merity et al. [43] showed that this solution could be applied to\nneural networks. However, cache models depend on the local context. Hence, they can only adapt a\nparametric model to a local change in the distribution. These speci\ufb01cities limit their usefulness when\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fthe context is unavailable (e.g., tweets) or is enormous (e.g., book reading). This work overcomes\nthis limitation by introducing a fast non-parametric retrieval system into the hybrid approach. We\ndemonstrate that this novel combination of a parametric neural language model with a non-parametric\nretrieval system can smoothly adapt to changes in the distribution while remaining as consistent as\npossible with the history of the data. Our approach is as a generalization of cache models which\nscales to millions of examples.\n\n2 Related work\n\nThis section reviews different settings that require models to adapt to changes in the data distribution,\nlike transfer learning or open set (continual) learning. We also discuss solutions speci\ufb01c to language\nmodels, and we brie\ufb02y explain large-scale retrieval methods.\n\nTransfer Learning. Transfer learning [10] is a well-established component of machine learning\npractitioners\u2019 toolbox. It exploits the commonalities between different tasks to improve the predictive\nperformance of the models trained to solve them. Notable variants of transfer learning are multitask\nlearning [10], domain adaptation [6], and curriculum learning [8]. Multitask learning jointly trains\nseveral models to promote sharing of statistical strength. Domain adaptation reuses existing infor-\nmation about a given problem (e.g., data or model) to solve a new task. Curriculum learning takes\none step further by adapting an existing model across a (large) sequence of increasingly dif\ufb01cult\ntasks. Models developed for these settings have proven useful in practice. However, they are chie\ufb02y\ndesigned for supervised learning and do not scale to the size of the problem we consider in this work.\n\nClass-incremental and Open Set Learning. These methods are concerned with problems where\nthe set of targets is not known in advance but instead, increases over time. The main dif\ufb01culty\nin this scenario lies in the deterioration of performance on previously seen classes when trying to\naccommodate new ones. Kuzborskij et al. [39] proposed to reduce the loss of accuracy when adding\nnew classes by partly retraining the existing classi\ufb01er. Muhlbaier et al. [47] introduced an ensemble\nmodel to deal with an increasingly large number of concepts. However, their approach relies on\nunrealistic assumptions on the data distribution. Zero-shot learning [41] can deal with new classes\nbut often requires additional descriptive information about them [1]. Scheirer et al. [49] proposed a\nframework for open set recognition based on one-class SVMs.\n\nAdaptive language models. Adaptive language models change their parameters according to the\nrecent history. Therefore, they implement a form of domain adaptation. A popular approach adds\na cache to the model and has shown early success in the context of speech recognition [36, 38, 37].\nJelinek et al. further extended this strategy [29] into a smoothed trigram language model, reporting a\nreduction in both perplexity and word error rates. Della Pietra et al.[15] adapt the cache to a general\nn-gram model such that it satis\ufb01es marginal constraints obtained from the current document. Closer\nto our work, Grave et al. [21] have shown that this strategy can improve modern language models\nlike recurrent networks without retraining. However, their model assumes that the data distribution\nchanges smoothly over time, by using a context window to improve the performance. Merity et\nal. [43] proposed a similar model, where the cache is jointly trained with the language model.\nOther adaptive language models have been proposed in the past: Kneser and Steinbiss [35] and, Iyer\nand Ostendorf [26] dynamically adapt the parameters of their model to recent history using different\nweight interpolation schemes. Bellegarda [5] and Coccaro and Jurafsky [14] use latent semantic\nanalysis to adapt their models to current context. Similarly, topic features have been used with either\nmaximum entropy models [33] or recurrent networks [46, 53]. Finally, Lau et al. [42] propose to use\npairs of distant of words to capture long-range dependencies.\n\nLarge scale retrieval approaches. The standard method for large-scale retrieval is to compress\nvectors and query them using a standard ef\ufb01cient algorithm. One of the most popular strategies is\nLocality-sensitive hashing (LSH) by Charikar [11], which uses random projections to approximate\nthe cosine similarity between vectors by a function related to the Hamming distance between their\ncorresponding binary codes. Several works have built on this initial binarization technique, such as\nspectral hashing [54], or Iterative Quantization (ITQ) [19]. Product Quantization (PQ) [28] approxi-\nmates the distances between vectors by simultaneously learning the codes and the centroids, using\n\n2\n\n\fk-means. In the context of text, several works have shown that compression does not signi\ufb01cantly\nreduce the performance of models [17, 24, 30].\n\n3 Approach\n\nIn this section, we \ufb01rst brie\ufb02y review language modeling and the use of recurrent networks for this\ntask. We then describe our model, called unbounded cache, and explain how to scale it to large\ndatasets with millions of words.\n\n3.1 Language modeling\n\nA language model evaluates the probability distribution of sequences of words. It is often framed\nas learning the conditional probability of words, given their history [4]. Let V be the size of the\nvocabulary; each word is represented by a one-hot encoding vector x in RV = V, corresponding\nto its index in the dictionary. Using the chain rule, the probability assigned to a sequence of words\nx1, . . . , xT can be factorized as\n\nT(cid:89)\n\np(x1, ..., xT ) =\n\np(xt | xt\u22121, ..., x1).\n\n(1)\n\nt=1\n\nThis conditional probability is traditionally approximated with non-parametric models based on\ncounting statistics [20]. In particular, smoothed N-gram models [32, 34] have been the dominant type\nof models historically, achieving good performance in practice [44]. While the use of parametric\nmodels for language modeling is not new [48], their superiority has only been established with the\nrecent emergence of neural networks [7, 45]. In particular, recurrent networks are now the standard\napproach, achieving state-of-the-art performances on several challenging benchmarks [31, 55].\n\n3.2 Recurrent networks.\n\nRecurrent networks are a special case of neural networks speci\ufb01cally designed for sequence modeling.\nAt each time step, they maintain a hidden representation of the past and make a prediction accordingly.\nThis representation is maintained by a continuous vector ht \u2208 Rd encoding the history xt, ..., x1.\nThe probability of the next word is then simply parametrized using this hidden vector, i.e.,\n\np(w | xt, ..., x1) \u221d exp(h(cid:62)\n\nt ow).\n\nThe hidden vector ht is computed by recursively applying an update rule:\n\nht = \u03a6 (xt, ht\u22121) ,\n\n(2)\n\n(3)\n\nwhere \u03a6 is a function depending on the architecture of the network. Depending on \u03a6, the hidden\nvectors may have a speci\ufb01c structure adapted to different sequence representation problems. Several\narchitectures for recurrent networks have been proposed, such as the Elman network [16], the long\nshort-term memory (LSTM) [25] or the gated recurrent unit (GRU) [13]. For example, the Elman\nnetwork [16] is de\ufb01ned by the following update rule\n\nht = \u03c3 (Lxt + Rht\u22121) ,\n\n(4)\nwhere \u03c3 is a non-linearity such as the logistic or tanh functions, L \u2208 Rd\u00d7V is a word embedding\nmatrix and R \u2208 Rd\u00d7d is the recurrent matrix. Empirical results have validated the effectiveness of\nthe LSTM architecture to natural language modeling [31]. We refer the reader to [23] for details on\nthis architecture. In the rest of this paper, we focus on this structure of recurrent networks.\nRecurrent networks process a sentence one word at a time and update their weights by backpropagating\nthe error of the prediction to a \ufb01xed window size of past time steps. This training procedure\nis computationally expensive, and often requires a signi\ufb01cant amount of data to achieve good\nperformance. To circumvent the need of retraining such network for domain adaptation, we propose\nto add a non-parametric model that takes care of the \ufb02uctuation in the data distribution.\n\n3\n\n\f3.3 Unbounded cache\n\nAn unbounded cache adds a non-parametric and unconstrained memory to a neural network. Our\napproach is inspired by the cache model of Khun [36] and can be seen as an extension of Grave et\nal. [22] to an unbounded memory structure tailored to deal with out-of-vocabulary and rare words.\nSimilar to Grave et al. [22], we extend a recurrent neural network with a key-value memory compo-\nnent, storing the pairs (hi, wi+1) of hidden representation and corresponding word. This memory\ncomponent also shares similarity with the parametric memory component of the pointer network\nintroduced by Vinyals et al. [52] and extended by Merity et al. [43]. As opposed to these models\nand standard cache models, we do not restrict the cache component to recent history but store all\npreviously observed words. Using the information stored in the cache component, we can obtain a\nprobability distribution over the words observed up to time t using the kernel density estimator:\n\npcache(wt | w1, ...wt\u22121) \u221d t\u22121(cid:88)\n\n1{w = wi}K\n\ni=1\n\n(cid:18)(cid:107)ht \u2212 hi(cid:107)\n\n(cid:19)\n\n\u03b8\n\n,\n\n(5)\n\nwhere K is a kernel, such as Epanechnikov or Gaussian, and \u03b8 is a smoothing parameter. If K is\nthe Gaussian kernel (K(x) = exp(\u2212x2/2)) and the hidden representations are normalized, this is\nequivalent to the continuous cache model.\nAs the memory grows with the amount of data seen by the model, this probability distribution becomes\nimpossible to compute. Millions of words and their multiple associated context representations are\nstored, and exact exhaustive matching is prohibitive. Instead, we use the approximate k-nearest\nneighbors algorithm that is described below in Sec. 3.4 to estimate this probability distribution:\n\npcache(wt | w1, ...wt\u22121) \u221d (cid:88)\n\n1{w = wi}K\n\ni\u2208N (ht)\n\n(cid:18)(cid:107)ht \u2212 hi(cid:107)\n\n(cid:19)\n\n\u03b8(ht)\n\n,\n\n(6)\n\nwhere N (ht) is the set of nearest neighbors and \u03b8(ht) is the Euclidean distance from ht to its k-th\nnearest neighbor. This estimator is known as variable kernel density estimation [51]. It should be\nnoted that if the kernel K is equal to zero outside of [\u22121, 1], taking the sum over the k nearest\nneighbors is equivalent to taking the sum over the full data.\nThe distribution obtained using the estimator de\ufb01ned in Eq. 6 assigns non-zero probability to at\nmost k words, where k is the number of nearest neighbors used. In order to have non-zero probability\neverywhere (and avoid getting in\ufb01nite perplexity), we propose to linearly interpolate this distribution\nwith the one from the model:\n\np(wt | w1, ...wt\u22121) = (1 \u2212 \u03bb)pmodel(wt | w1, ...wt\u22121) + \u03bbpcache(wt | w1, ...wt\u22121).\n\n3.4 Fast large scale retrieval\n\nFast computation of the probability of a rare word is crucial to make the cache grow to millions of\npotential words. Their representation also needs to be stored with relatively low memory usage. In this\nsection, we brie\ufb02y describe a scalable retrieval method introduced by Jegou et al. [27]. Their approach\ncalled Inverted File System Product Quantization (IVFPQ) combines two methods, an inverted \ufb01le\nsystem [56] and a quantization method, called Product quantization (PQ) [28]. Combining these two\ncomponents offers a good compromise between a fast retrieval of approximate nearest neighbors and\na low memory footprint.\n\nInverted \ufb01le system.\nInverted \ufb01le systems [56] are a core component of standard large-scale text\nretrieval systems, like search engines. When a query x is compared to a set Y of potential elements,\nan inverted \ufb01le avoids an exhaustive search by providing a subset of possible matching candidates.\nIn the context of continuous vectors, this subset is obtained by measuring some distance between\nthe query and prede\ufb01ned vector representations of the set. More precisely, these candidates are\nselected through \u201ccoarse matching\u201d by clustering all the elements in Y in c groups using k-means.\nThe centroids are used as the vector representations. Each element of the set Y is associated with\none centroid in an inverted table. The query x is then compared to each centroid and a subset of\nthem is selected according to their distance to the query. All the elements of Y associated with these\ncentroids are then compared to the query x. Typically, we take c centroids and keep the cc closest\ncentroids to a query.\n\n4\n\n\fThis procedure is quite ef\ufb01cient but very memory consuming, as each vector in the set Y must be\nstored. This can be drastically reduced by quantizing the vectors. Product Quantization (PQ) is\na popular quantization method that has shown competitive performance on many retrieval bench-\nmarks [28]. Following Jegou et al. [28], we do not directly quantize the vector y but its residual r,\ni.e., the difference between the vector and its associated centroids.\n\nProduct Quantization. Product quantization is a data-driven compression algorithm with no\noverhead during search [28]. While PQ has been designed for image feature compression, Joulin\net al. [30] have demonstrated its effectiveness for text too. PQ compresses real-valued vector by\napproximating them with the closest vector in a pre-de\ufb01ned structured set of centroids, called a\ncodebook. This codebook is obtained by splitting each residual vector r into k subvectors ri, each of\ndimension d/k, and running a k-means algorithm with s centroids on each resulting subspace. The\nresulting codebook contains cs elements which is too large to be enumerated, and is instead implicitly\nde\ufb01ned by its structure: a d-dimensional vector x \u2208 Rd is approximated as\n\n\u02c6x =\n\nqi(x),\n\n(7)\n\nk(cid:88)\n\ni=1\n\nk(cid:88)\n\nwhere qi(x) is the closest centroid to subvector xi. For each subspace, there are s = 2b centroids,\nwhere b is the number of bits required to store the quantization index of the sub-quantizer. Note\nthat in PQ, the subspaces are aligned with the natural axis and improvements where made by Ge et\nal. [18] to align the subspaces to principal axes in the data. The reconstructed vector can take 2kb\ndistinct reproduction values and is stored in kb bits.\nPQ estimates the inner product in the compressed domain as\n\nx(cid:62)y \u2248 \u02c6x(cid:62)y =\n\nqi(xi)(cid:62)yi.\n\n(8)\n\nIn practice, the vector estimate \u02c6x is trivially reconstructed from the codes, (i.e., from the quantization\nindexes) by concatenating these centroids. PQ uses two parameters, namely the number of sub-\nquantizers k and the number of bits b per quantization index.\n\ni=1\n\n4 Experiments\n\nIn this section, we present evaluations of our unbounded cache model on different language modeling\ntasks. We \ufb01rst brie\ufb02y describe our experimental setting and the datasets we used, before presenting\nthe results.\n\n4.1 Experimental setting\n\nOne of the motivations of our model is to be able to adapt to changing data distribution. In particular,\nwe want to incorporate new words in the vocabulary, as they appear in the test data. We thus consider\na setting where we do not replace any words by the token, and where the test set contains\nout-of-vocabulary words (OOV) which were absent at train time. Since we use the perplexity as the\nevaluation metric, we need to avoid probabilities equal to zero in the output of our models (which\nwould result in in\ufb01nite perplexity). Thus, we always interpolate the probability distributions of the\nvarious models with the uniform distribution over the full vocabulary:\n\npuniform(wt) =\n\n1\n\n|vocabulary| .\n\nThis is a standard technique, which was previously used to compare language models trained on\ndatasets with different vocabularies [9].\n\nBaselines We compare our unbounded cache model with the static model interpolated with uniform\ndistribution, as well as the static model interpolated with the unigram probability distribution observed\nup to time t. Our proposal is a direct extension of the local cache model [22]. Therefore, we also\ncompare to it to highlight the settings where an unbounded cache model is preferable to a local one.\n\n5\n\n\fmodel\nNews 2008\nNews 2009\nNews 2010\nNews 2011\nCommentary\nWeb\nWiki\nBooks\n\nSize\n\nOoV rate (%)\n\n219,796\n218,628\n205,859\n209,187\n144,197\n321,072\n191,554\n174,037\n\n2.3%\n2.4%\n2.4%\n2.5%\n4.2%\n5.9%\n5.5%\n3.7%\n\nTable 1: Vocabulary size and out-of-vocabulary rate for various test sets (for a model trained on News\n2007).\n\n4.2\n\nImplementation details\n\nWe train recurrent neural networks with 256 LSTM hidden units, using the Adagrad algorithm with a\nlearning rate of 0.2 and 10 epochs. We compute the gradients using backpropagation through time\n(BPTT) over 20 timesteps. Because of the large vocabulary sizes, we use the adaptative softmax [21].\nWe use the IVFPQ implementation from the FAISS open source library.1 We use 4, 096 centroids\nand 8 probes for the inverted \ufb01le. Unless said otherwise, we query the 1, 024 nearest neighbors.\n\n4.3 Datasets\n\nMost commonly used benchmarks for evaluating language models propose to replace rare words\nby the token. On the contrary, we are interested in open vocabulary settings, and therefore\ndecided to use datasets without . We performed experiments on data from the \ufb01ve following\ndomains:\n\n\u2022 News Crawl2 is a dataset made of news articles, collected from various online publications.\nThere is one subset of the data for each year, from 2007 to 2011. This dataset will allow\ntesting the unbounded cache models on data whose distribution slowly changes over time.\nThe dataset is shuf\ufb02ed at the sentence level. In the following, we refer to this dataset as\nnews 2007-2011.\n\n\u2022 News Commentary consists of political and economic commentaries from the website\nhttps://www.project-syndicate.org/. This dataset is publicly available from the\nStatistical Machine Translation workshop website. In the following, we refer to this dataset\nas commentary.\n\n\u2022 Common Crawl is a text dataset collected from diverse web sources. The dataset is shuf\ufb02ed\n\nat the sentence level. In the following, we refer to this dataset as web.\n\n\u2022 WikiText3 is a dataset derived from high quality English Wikipedia articles, introduced by\nMerity et al. [43]. Since we do not to replace any tokens by , we use the raw version.\nIn the following, we refer to this dataset as wiki.\n\n\u2022 The book Corpus This is a dataset of 3,036 English books, collected from the Project\nGutenberg4 [40]. We use a subset of the books, which have a length around 100,000 tokens.\nIn the following we refer to this dataset as books.\n\nAll these datasets are publicly available. Unless stated otherwise, we use 2 million tokens for training\nthe static models and 10 million tokens for evaluation. All datasets are lowercased and tokenized\nusing the europarl dataset tools.5\n\n1https://github.com/facebookresearch/faiss\n2http://www.statmt.org/wmt14/translation-task.html\n3https://metamind.io/research/the-wikitext-long-term-dependency-language-modeling-dataset/\n4http://www.gutenberg.org/\n5http://statmt.org/europarl/v7/tools.tgz\n\n6\n\n\fmodel\nstatic\nstatic + unigram\nstatic + local cache\nstatic + unbounded cache\n\n2007\n220.9\n220.3\n218.9\n166.5\n\n2008\n237.6\n235.9\n234.5\n191.4\n\nTest set\n2009\n256.2\n252.6\n250.5\n202.6\n\n2010\n259.7\n256.1\n256.2\n204.8\n\n2011\n268.8\n264.3\n265.2\n214.3\n\nTable 2: Static model trained on news 2007 and tested on news 2007-2011.\n\nFigure 1: Performance of our model, as a function of the number k of nearest neighbors, used to\nestimate the probability of words in the unbounded cache. We report the entropy difference with the\nstatic+unigram baseline.\n\nNews\n\nTrain domain model\nstatic\nstatic + unigram\nstatic + local cache\nstatic + unbounded cache\nstatic\nstatic + unigram\nstatic + local cache\nstatic + unbounded cache\nstatic\nstatic + unigram\nstatic + local cache\nstatic + unbounded cache\n\nWeb\n\nWiki\n\nNews\n\n-\n-\n-\n-\n\n624.1\n519.2\n531.4\n306.3\n638.1\n537.9\n532.8\n318.7\n\n342.7\n303.5\n288.5\n191.1\n484.0\n395.6\n391.3\n234.9\n626.3\n462.2\n436.7\n255.3\n\nCommentary\n\nTest domain\nWeb\n689.3\n581.1\n593.4\n383.4\n\nWiki\n1003.2\n609.4\n316.5\n337.4\n805.3\n605.3\n321.5\n340.2\n\n-\n-\n-\n-\n\nBooks\n687.1\n349.1\n240.3\n237.2\n784.3\n352.4\n235.8\n223.6\n654.6\n346.9\n228.8\n223.8\n\n-\n-\n-\n-\n\n901.0\n688.5\n694.3\n456.1\n\nTable 3: Static model trained on news 2007 and tested on data from other domains.\n\nDataset\nNews 2008\nCommentary\nWeb\nWiki\nBooks\n\nStatic model Local cache Unbounded cache\n\n82\n78\n85\n87\n81\n\n664\n613\n668\n637\n626\n\n433\n494\n502\n540\n562\n\nTable 4: Computational time (in seconds) to process 10M tokens from different test sets for the static\nlanguage model, the local cache (size 10,000) and the unbounded cache.\n\n7\n\n2004006008001000number k of nearest neighbors0.230.220.210.200.190.180.170.160.150.14Entropy difference with baselineNews 2008-2011news 2008news 2009news 2010news 20112004006008001000number k of nearest neighbors0.600.550.500.450.400.350.300.250.20Entropy difference with baselineDomain adaptationcommentarywebwikibooks\fFigure 2: Performance of the un-\nbounded cache model, as a function\nof the number of test examples. We\nreport the entropy difference with the\nstatic+unigram baseline. We observe\nthat, as the number of test examples\nincreases (and thus, the information\nstored in the cache), the performance\nof the unbounded cache increases.\n\n4.4 Results\n\nWe demonstrate the effectiveness of using an unbounded cache to complement a language model\nas advocated in the previous sections model by performing two types of experiments representing a\nnear domain and far domain adaptation scenarios. In both experiments, we compare the unigram\nstatic model, the unigram extension, and the unbounded cache model.\n\nLocal vs. Unbounded Cache We \ufb01rst study the impact of using an unbounded cache instead of a\nlocal one. To that end, we compare the performance of the two models when trained and tested on\ndifferent combinations of the previously described datasets. These datasets can be categorized into\ntwo groups according to their properties and the results obtained by the various models we use.\nOn the one hand, the Wiki and Books datasets are not shuf\ufb02ed. Hence, the recent history (up to a few\nthousands words) contains a wealth of information that can be used by a local cache to reduce the\nperplexity of a static model. Indeed, the local cache model achieves respectively 316.5 and 240.3\non the Wiki and Books datasets when trained on the News dataset. This corresponds to about 3\u00d7\nreduction in perplexity on both datasets in comparison to the static model. A similar trend holds when\nthe training data is either Web or Wiki dataset. Surprisingly, the unbounded cache model performs\nsimilarly to the cache model despite using orders of magnitude broader context. A static model\ntrained on News and augmented with an unbounded cache achieves respectively 337.4 and 237.2\nof perplexity. It is also worth noting that our approach is more ef\ufb01cient than the local cache, while\nstoring a much larger number of elements. Thanks to the use of fast nearest neighbor algorithm,\nit takes 502 seconds to process 10M tokens from the test set when using the unbounded cache.\nComparatively, it takes 668 seconds for a local cache model of size 10, 000 to perform a similar task.\nThe timing experiments, reported in Table 4.3, show a similar trend.\nOn the other hand, the Commentary and Web datasets are shuf\ufb02ed. Therefore, a local cache can\nhardly capture the relevant statistics to signi\ufb01cantly improve upon the static model interpolated with\nthe unigram distribution. Indeed, the perplexity of a local cache model on these datasets when the\nstatic model is trained on the News dataset is respectively 288.5 and 593.4. In comparison, the\nunbounded cache model achieves on the same datasets respectively a perplexity of 191.1 and 383.4.\nThat is an average improvement of about 50% over the local cache in both cases (see Table 3).\n\nNear domain adaptation. We study the bene\ufb01t of using an unbounded cache model when the test\ndomain is only slightly different from the source domain. We train the static model on news 2007\nand test on the corpus news 2008 to news 2011. All the results are reported in Table 1.\nWe \ufb01rst observe that the unbounded cache brings a 24.6% improvement relative to the static model\non the in-domain news 2007 corpus by bringing the perplexity from 220.9 down to 166.5. In\ncomparison, neither using the unigram information nor using a local cache lead to signi\ufb01cant\nimprovement. This result underlines two phenomena. First, the simple distributional information\ncaptured by the unigram or the local cache is already captured by the static model. Second, the\nunbounded cache enhances the discrimination capabilities of the static model by capturing useful\nnon-linearities thanks to the combination of the nearest neighbor and the representation extracted from\n\n8\n\n105106107Number of test examples (log scale)0.250.200.150.100.050.00Entropy difference with baselinenews 2008news 2009news 2010\fthe static model. Interestingly, these observations remain consistent when we consider evaluations on\nthe test sets news 2008-2011. Indeed, the average improvement of unbounded cache relatively to\nthe static model on the corpus news 2008-2011 is 20.44% while the relative improvement of the\nunigram cache is only 1.3%. Similarly to the in-domain experiment, the unigram brings little useful\ninformation to the static model mainly because the source (news 2007) and the target distributions\n(news 2008-2011) are very close. In contrast, the unbounded cache still complements the static\nmodel with valuable non-linear information of the target distributions.\n\nFar domain adaptation. Our second set of experiments is concerned with testing on different\ndomains from the one the static model is trained on. We use the News, Web and Wiki datasets as\nsource domains, and all \ufb01ve domains as target. The results are reported in Table 3.\nFirst, we observe that the unigram, the local and the unbounded cache signi\ufb01cantly help the static\nmodel in all the far domain adaptation experiments. For example, when adapting the static model\nfrom the News domain to the Commentary and Wiki domains, the unigram reduces the perplexity of\nthe static model by 39.2 and 393.8 in absolute value respectively. The unbounded cache signi\ufb01cantly\nimproves upon the static model and the unigram on all the far domain adaptation experiment. The\nsmallest relative improvement compared to the static model and the unigram is achieved when\nadapting from News to Web and is 79.7% and 51.6% respectively. The more the target domain is\ndifferent from the source one, the more interesting is the use of an unbounded cache mode. Indeed,\nwhen adapting to the Books domain (which is the most different from the other domains) the average\nimprovement given by the unbounded cache relatively to the static model is 69.7%.\n\nNumber of nearest neighbors. Figure 1 shows the performance of our model with the number\nof nearest neighbors per query. As observed previously by Grave et al [22], the performance of a\nlanguage model improves with the size of the context used in the cache. This context is, in some\nsense, a constrained version of our set of retained nearest neighbors. Interestingly, we observe the\nsame phenomenon despite forming the set of possible predictions over a much broader set of potential\ncandidates than the immediate local context. Since IFVPQ has a linear complexity with the number\nof nearest neighbors, setting the number of nearest neighbors to a thousand offers a good trade-off\nbetween speed and accuracy.\n\nSize of the cache. Figure 2 shows the gap between the performance of static language model with\nand without the cache as the size of the test set increases. Despite having a much more signi\ufb01cant set\nof candidates to look from, our algorithm continues to select relevant information. As the test set is\nexplored, better representations for rare words are stored, explaining this constant improvement.\n\n5 Conclusion\n\nIn this paper, we introduce an extension to recurrent networks for language modeling, which stores\npast hidden activations and associated target words. This information can then be used to obtain a\nprobability distribution over the previous words, allowing the language models to adapt to the current\ndistribution of the data dynamically. We propose to scale this simple mechanism to large amounts of\ndata (millions of examples) by using fast approximate nearest neighbor search. We demonstrated on\nseveral datasets that our unbounded cache is an ef\ufb01cient method to adapt a recurrent neural network\nto new domains dynamically, and can scale to millions of examples.\n\nAcknowledgements\n\nWe thank the anonymous reviewers for their insightful comments.\n\nReferences\n[1] I. Alabdulmohsin, M. Cisse, and X. Zhang. Is attribute-based zero-shot learning an ill-posed strategy? In\n\nECML-PKDD.\n\n[2] D. Amodei, R. Anubhai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, J. Chen, M. Chrzanowski,\nA. Coates, G. Diamos, et al. Deep speech 2: End-to-end speech recognition in English and Mandarin. In\nICML, 2016.\n\n9\n\n\f[3] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate.\n\nIn ICLR, 2015.\n\n[4] L. R. Bahl, F. Jelinek, and R. L. Mercer. A maximum likelihood approach to continuous speech recognition.\n\nPAMI, 1983.\n\n[5] J. R. Bellegarda. Exploiting latent semantic information in statistical language modeling. Proceedings of\n\nthe IEEE, 2000.\n\n[6] S. Ben-David, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. W. Vaughan. A theory of learning\n\nfrom different domains. Machine learning, 79(1), 2010.\n\n[7] Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin. A neural probabilistic language model. JMLR, 2003.\n\n[8] Y. Bengio, J. Louradour, R. Collobert, and J. Weston. Curriculum learning. In ICML, 2009.\n\n[9] C. Buck, K. Hea\ufb01eld, and B. van Ooyen. N-gram counts and language models from the common crawl. In\n\nLREC, 2014.\n\n[10] R. Caruana. Multitask learning. In Learning to learn. Springer, 1998.\n\n[11] M. S. Charikar. Similarity estimation techniques from rounding algorithms. In STOC, 2002.\n\n[12] C. Chelba, T. Mikolov, M. Schuster, Q. Ge, T. Brants, P. Koehn, and T. Robinson. One billion word\nbenchmark for measuring progress in statistical language modeling. arXiv preprint arXiv:1312.3005, 2013.\n\n[13] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio. Empirical evaluation of gated recurrent neural networks on\n\nsequence modeling. arXiv preprint arXiv:1412.3555, 2014.\n\n[14] N. Coccaro and D. Jurafsky. Towards better integration of semantic predictors in statistical language\n\nmodeling. In ICSLP, 1998.\n\n[15] S. Della Pietra, V. Della Pietra, R. L. Mercer, and S. Roukos. Adaptive language modeling using minimum\n\ndiscriminant estimation. In Proceedings of the workshop on Speech and Natural Language, 1992.\n\n[16] J. L. Elman. Finding structure in time. Cognitive science, 1990.\n\n[17] M. Federico, N. Bertoldi, and M. Cettolo. Irstlm: an open source toolkit for handling large scale language\n\nmodels. In INTERSPEECH, 2008.\n\n[18] T. Ge, K. He, Q. Ke, and J. Sun. Optimized product quantization for approximate nearest neighbor search.\n\nIn CVPR, 2013.\n\n[19] Y. Gong and S. Lazebnik. Iterative quantization: A procrustean approach to learning binary codes. In\n\nCVPR, 2011.\n\n[20] J. T. Goodman. A bit of progress in language modeling. Computer Speech & Language, 2001.\n\n[21] E. Grave, A. Joulin, M. Ciss\u00e9, D. Grangier, and H. J\u00e9gou. Ef\ufb01cient softmax approximation for GPUs. In\n\nICML, 2017.\n\n[22] E. Grave, A. Joulin, and N. Usunier. Improving neural language models with a continuous cache. In ICLR,\n\n2017.\n\n[23] A. Graves, A. Mohamed, and G. Hinton. Speech recognition with deep recurrent neural networks. In\n\nICASSP, 2013.\n\n[24] K. Hea\ufb01eld. Kenlm: Faster and smaller language model queries. In Proceedings of the Sixth Workshop on\n\nStatistical Machine Translation, 2011.\n\n[25] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 1997.\n\n[26] R. M. Iyer and M. Ostendorf. Modeling long distance dependence in language: Topic mixtures versus\n\ndynamic cache models. IEEE Transactions on speech and audio processing, 1999.\n\n[27] H. Jegou, M. Douze, and C. Schmid. Hamming embedding and weak geometric consistency for large scale\n\nimage search. In ECCV, 2008.\n\n[28] H. Jegou, M. Douze, and C. Schmid. Product quantization for nearest neighbor search. PAMI, 2011.\n\n10\n\n\f[29] F. Jelinek, B. Merialdo, S. Roukos, and M. Strauss. A dynamic language model for speech recognition. In\n\nHLT, 1991.\n\n[30] A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. J\u00e9gou, and T. Mikolov. Fasttext.zip: Compressing text\n\nclassi\ufb01cation models. arXiv preprint arXiv:1612.03651, 2016.\n\n[31] R. Jozefowicz, O. Vinyals, M. Schuster, N. Shazeer, and Y. Wu. Exploring the limits of language modeling.\n\narXiv preprint arXiv:1602.02410, 2016.\n\n[32] S. M. Katz. Estimation of probabilities from sparse data for the language model component of a speech\n\nrecognizer. ICASSP, 1987.\n\n[33] S. Khudanpur and J. Wu. Maximum entropy techniques for exploiting syntactic, semantic and collocational\n\ndependencies in language modeling. Computer Speech & Language, 2000.\n\n[34] R. Kneser and H. Ney. Improved backing-off for m-gram language modeling. In ICASSP, 1995.\n\n[35] R. Kneser and V. Steinbiss. On the dynamic adaptation of stochastic language models. In ICASSP, 1993.\n\n[36] R. Kuhn. Speech recognition and the frequency of recently used words: A modi\ufb01ed markov model for\n\nnatural language. In Proceedings of the 12th conference on Computational linguistics-Volume 1, 1988.\n\n[37] R. Kuhn and R. De Mori. A cache-based natural language model for speech recognition. PAMI, 1990.\n\n[38] J. Kupiec. Probabilistic models of short and long distance word dependencies in running text.\n\nProceedings of the workshop on Speech and Natural Language, 1989.\n\nIn\n\n[39] I. Kuzborskij, F. Orabona, and B. Caputo. From n to n+ 1: Multiclass transfer incremental learning. In\n\nCVPR, 2013.\n\n[40] S. Lahiri. Complexity of word collocation networks: A preliminary structural analysis. In Proceedings of\nthe Student Research Workshop at the 14th Conference of the European Chapter of the Association for\nComputational Linguistics, 2014.\n\n[41] C. H. Lampert, H. Nickisch, and S. Harmeling. Attribute-based classi\ufb01cation for zero-shot visual object\n\ncategorization. PAMI, 2014.\n\n[42] R. Lau, R. Rosenfeld, and S. Roukos. Trigger-based language models: A maximum entropy approach. In\n\nICASSP, 1993.\n\n[43] S. Merity, C. Xiong, J. Bradbury, and R. Socher. Pointer sentinel mixture models. In ICLR, 2017.\n\n[44] T. Mikolov, A. Deoras, S. Kombrink, L. Burget, and J. Cernock`y. Empirical evaluation and combination of\n\nadvanced language modeling techniques. In INTERSPEECH, 2011.\n\n[45] T. Mikolov, M. Kara\ufb01\u00e1t, L. Burget, J. Cernock`y, and S. Khudanpur. Recurrent neural network based\n\nlanguage model. In INTERSPEECH, 2010.\n\n[46] T. Mikolov and G. Zweig. Context dependent recurrent neural network language model. In SLT, 2012.\n\n[47] M. D. Muhlbaier, A. Topalis, and R. Polikar. Learn++.NC: Combining ensemble of classi\ufb01ers with\ndynamically weighted consult-and-vote for ef\ufb01cient incremental learning of new classes. IEEE transactions\non neural networks, 20(1), 2009.\n\n[48] R. Rosenfeld. A maximum entropy approach to adaptive statistical language modeling. Computer, Speech\n\nand Language, 1996.\n\n[49] W. J. Scheirer, A. de Rezende Rocha, A. Sapkota, and T. E. Boult. Toward open set recognition. PAMI,\n\n2013.\n\n[50] I. V. Serban, A. Sordoni, Y. Bengio, A. Courville, and J. Pineau. Building end-to-end dialogue systems\n\nusing generative hierarchical neural network models. In AAAI, 2016.\n\n[51] G. R. Terrell and D. W. Scott. Variable kernel density estimation. The Annals of Statistics, 1992.\n\n[52] O. Vinyals, M. Fortunato, and N. Jaitly. Pointer networks. In NIPS, 2015.\n\n[53] T. Wang and K. Cho. Larger-context language modelling. In ACL, 2016.\n\n[54] Y. Weiss, A. Torralba, and R. Fergus. Spectral hashing. In NIPS, 2009.\n\n[55] J. G. Zilly, R. K. Srivastava, J. Koutn\u00edk, and J. Schmidhuber. Recurrent highway networks. In ICML, 2017.\n\n[56] J. Zobel and A. Moffat. Inverted \ufb01les for text search engines. ACM computing surveys (CSUR), 2006.\n\n11\n\n\f", "award": [], "sourceid": 3079, "authors": [{"given_name": "Edouard", "family_name": "Grave", "institution": "Facebook AI Research"}, {"given_name": "Moustapha", "family_name": "Cisse", "institution": "Facebook AI Research"}, {"given_name": "Armand", "family_name": "Joulin", "institution": "Facebook AI research"}]}