{"title": "A Neural Autoregressive Topic Model", "book": "Advances in Neural Information Processing Systems", "page_first": 2708, "page_last": 2716, "abstract": "We describe a new model for learning meaningful representations of text documents from an unlabeled collection of documents. This model is inspired by the recently proposed Replicated Softmax, an undirected graphical model of word counts that was shown to learn a better generative model and more meaningful document representations. Specifically, we take inspiration from the conditional mean-field recursive equations of the Replicated Softmax in order to define a neural network architecture that estimates the probability of observing a new word in a given document given the previously observed words. This paradigm also allows us to replace the expensive softmax distribution over words with a hierarchical distribution over paths in a binary tree of words. The end result is a model whose training complexity scales logarithmically with the vocabulary size instead of linearly as in the Replicated Softmax. Our experiments show that our model is competitive both as a generative model of documents and as a document representation learning algorithm.", "full_text": "A Neural Autoregressive Topic Model\n\nHugo Larochelle\n\nD\u00b4epartement d\u2019informatique\n\nUniversit\u00b4e de Sherbrooke\n\nhugo.larochelle@usherbrooke.ca\n\nStanislas Lauly\n\nD\u00b4epartement d\u2019informatique\n\nUniversit\u00b4e de Sherbrooke\n\nstanislas.lauly@usherbrooke.ca\n\nAbstract\n\nWe describe a new model for learning meaningful representations of text docu-\nments from an unlabeled collection of documents. This model is inspired by the\nrecently proposed Replicated Softmax, an undirected graphical model of word\ncounts that was shown to learn a better generative model and more meaningful\ndocument representations. Speci\ufb01cally, we take inspiration from the conditional\nmean-\ufb01eld recursive equations of the Replicated Softmax in order to de\ufb01ne a neu-\nral network architecture that estimates the probability of observing a new word\nin a given document given the previously observed words. This paradigm also\nallows us to replace the expensive softmax distribution over words with a hierar-\nchical distribution over paths in a binary tree of words. The end result is a model\nwhose training complexity scales logarithmically with the vocabulary size instead\nof linearly as in the Replicated Softmax. Our experiments show that our model\nis competitive both as a generative model of documents and as a document repre-\nsentation learning algorithm.\n\n1\n\nIntroduction\n\nIn order to leverage the large amount of available unlabeled text, a lot of research has been devoted\nto developing good probabilistic models of documents. Such models are usually embedded with\nlatent variables or topics, whose role is to capture salient statistical patterns in the co-occurrence of\nwords within documents.\nThe most popular model is latent Dirichlet allocation (LDA) [1], a directed graphical model in\nwhich each word is a sample from a mixture of global word distributions (shared across documents)\nand where the mixture weights vary between documents.\nIn this context, the word multinomial\ndistributions (mixture components) correspond to the topics and a document is represented as the\nparameters (mixture weights) of its associated distribution over topics. Once trained, these topics\nhave been found to extract meaningful groups of semantically related words and the (approximately)\ninferred topic mixture weights have been shown to form a useful representation for documents.\nMore recently, Salakhutdinov and Hinton [2] proposed an alternative undirected model, the Repli-\ncated Softmax which, instead of representing documents as distributions over topics, relies on a\nbinary distributed representation of the documents. The latent variables can then be understood as\ntopic features: they do not correspond to normalized distributions over words, but to unnormalized\nfactors over words. A combination of topic features generates a word distribution by multiplying\nthese factors and renormalizing. They show that the Replicated Softmax allows for very ef\ufb01cient\ninference of a document\u2019s topic feature representation and outperforms LDA both as a generative\nmodel of documents and as a method for representing documents in an information retrieval setting.\nWhile inference of a document representation is ef\ufb01cient in the Replicated Softmax, one of its\ndisadvantages is that the complexity of its learning update scales linearly with the vocabulary size\nV , i.e. the number of different words that are observed in a document. The factor responsible for this\n\n1\n\n\fFigure 1: (Left) Illustration of NADE. Colored lines identify the connections that share parameters\n\nand(cid:98)vi is a shorthand for the autoregressive conditional p(vi|v*i, h|v**i, h|v*