{"title": "Decoupling Sparsity and Smoothness in the Discrete Hierarchical Dirichlet Process", "book": "Advances in Neural Information Processing Systems", "page_first": 1982, "page_last": 1989, "abstract": "We present a nonparametric hierarchical Bayesian model of document collections that decouples sparsity and smoothness in the component distributions (i.e., the ``topics). In the sparse topic model (STM), each topic is represented by a bank of selector variables that determine which terms appear in the topic. Thus each topic is associated with a subset of the vocabulary, and topic smoothness is modeled on this subset. We develop an efficient Gibbs sampler for the STM that includes a general-purpose method for sampling from a Dirichlet mixture with a combinatorial number of components. We demonstrate the STM on four real-world datasets. Compared to traditional approaches, the empirical results show that STMs give better predictive performance with simpler inferred models.", "full_text": "Decoupling Sparsity and Smoothness in the\n\nDiscrete Hierarchical Dirichlet Process\n\nChong Wang\n\nComputer Science Department\n\nPrinceton University\n\nDavid M. Blei\n\nComputer Science Department\n\nPrinceton University\n\nchongw@cs.princeton.edu\n\nblei@cs.princeton.edu\n\nAbstract\n\nWe present a nonparametric hierarchical Bayesian model of document collections\nthat decouples sparsity and smoothness in the component distributions (i.e., the\n\u201ctopics\u201d). In the sparse topic model (sparseTM), each topic is represented by a\nbank of selector variables that determine which terms appear in the topic. Thus\neach topic is associated with a subset of the vocabulary, and topic smoothness is\nmodeled on this subset. We develop an ef\ufb01cient Gibbs sampler for the sparseTM\nthat includes a general-purpose method for sampling from a Dirichlet mixture\nwith a combinatorial number of components. We demonstrate the sparseTM on\nfour real-world datasets. Compared to traditional approaches, the empirical results\nwill show that sparseTMs give better predictive performance with simpler inferred\nmodels.\n\n1 Introduction\n\nThe hierarchical Dirichlet process (HDP) [1] has emerged as a powerful model for the unsupervised\nanalysis of text. The HDP models documents as distributions over a collection of latent components,\nwhich are often called \u201ctopics\u201d [2, 3]. Each word is assigned to a topic, and is drawn from a distribution\nover terms associated with that topic. The per-document distributions over topics represent systematic\nregularities of word use among the documents; the per-topic distributions over terms encode the\nrandomness inherent in observations from the topics. The number of topics is unbounded.\nGiven a corpus of documents, analysis proceeds by approximating the posterior of the topics and topic\nproportions. This posterior bundles the two types of regularity. It is a probabilistic decomposition\nof the corpus into its systematic components, i.e., the distributions over topics associated with\neach document, and a representation of our uncertainty surrounding observations from each of\nthose components, i.e., the topic distributions themselves. With this perspective, it is important to\ninvestigate how prior assumptions behind the HDP affect our inferences of these regularities.\nIn the HDP for document modeling, the topics are typically assumed drawn from an exchangeable\nDirichlet, a Dirichlet for which the components of the vector parameter are equal to the same scalar\nparameter. As this scalar parameter approaches zero, it affects the Dirichlet in two ways. First,\nthe resulting draws of random distributions will place their mass on only a few terms. That is, the\nresulting topics will be sparse. Second, given observations from such a Dirichlet, a small scalar\nparameter encodes increased con\ufb01dence in the estimate from the observed counts. As the parameter\napproaches zero, the expectation of each per-term probability becomes closer to its empirical estimate.\nThus, the expected distribution over terms becomes less smooth. The single scalar Dirichlet parameter\naffects both the sparsity of the topics and smoothness of the word probabilities within them.\nWhen employing the exchangeable Dirichlet in an HDP, these distinct properties of the prior have\nconsequences for both the global and local regularities captured by the model. Globally, posterior\ninference will prefer more topics because more sparse topics are needed to account for the observed\n\n1\n\n\fwords of the collection. Locally, the per-topic distribution over terms will be less smooth\u2014the\nposterior distribution has more con\ufb01dence in its assessment of the per-topic word probabilities\u2014and\nthis results in less smooth document-speci\ufb01c predictive distributions.\nThe goal of this work is to decouple sparsity and smoothness in the HDP. With the sparse topic model\n(sparseTM), we can \ufb01t sparse topics with more smoothing. Rather than placing a prior for the entire\nvocabulary, we introduce a Bernoulli variable for each term and each topic to determine whether\nor not the term appears in the topic. Conditioned on these variables, each topic is represented by a\nmultinomial distribution over its subset of the vocabulary, a sparse representation.\nThis prior smoothes only the relevant terms and thus the smoothness and sparsity are controlled\nthrough different hyper-parameters. As we will demonstrate, sparseTMs give better predictive\nperformance with simpler models than traditional approaches.\n\n2 Sparse Topic Models\n\nSparse topic models (sparseTMs) aim to separately control the number of terms in a topic, i.e.,\nsparsity, and the probabilities of those words, i.e., smoothness. Recall that a topic is a pattern of word\nuse, represented as a distribution over the \ufb01xed vocabulary of the collection. In order to decouple\nsmoothness and sparsity, we de\ufb01ne a topic on a random subset of the vocabulary (giving sparsity),\nand then model uncertainty of the probabilities on that subset (giving smoothness). For each topic, we\nintroduce a Bernoulli variable for each term in the vocabulary that decides whether the term appears\nin the topic. Similar ideas of using Bernoulli variables to represent \u201con\u201d and \u201coff\u201d have been seen\nin several other models, such as the noisy-OR model [4] and aspect Bernoulli model [5]. We can\nview this approach as a particular \u201cspike and slab\u201d prior [6] over Dirichlet distributions. The \u201cspike\u201d\nchooses the terms for the topic; the \u201cslab\u201d only smoothes those terms selected by the spike.\nAssume the size of the vocabulary is V . A Dirichlet distribution over the topic is de\ufb01ned on a\nV \u2212 1-simplex, i.e.,\n\n\u03b2 \u223c Dirichlet(\u03b31),\n\n(1)\nwhere 1 is a V -length vector of 1s. In an sparseTM, the idea of imposing sparsity is to use Bernoulli\nvariables to restrict the size of the simplex over which the Dirichlet distribution is de\ufb01ned. Let b\nbe a V -length binary vector composed of V Bernoulli variables. Thus b speci\ufb01es a smaller simplex\nthrough the \u201con\u201ds of its elements. The Dirichlet distribution over the restricted simplex is\n\n\u03b2 \u223c Dirichlet(\u03b3b),\n\n(2)\nwhich is a degenerate Dirichlet distribution over the sub-simplex speci\ufb01ed by b. In [7], Friedman and\nSinger use this type of distributions for language modeling.\nNow we introduce the generative process of the sparseTM. The sparseTM is built on the hierarchical\nDirichlet process for text, which we shorthand HDP-LDA. 1 In the Bayesian nonparametric setting\nthe number of topics is not speci\ufb01ed in advance or found by model comparison. Rather, it is inferred\nthrough posterior inference. The sparseTM assumes the following generative process:\n1. For each topic k \u2208 {1, 2, . . .}, draw term selection proportion \u03c0k \u223c Beta(r, s).\n\n(a) For each term v, 1 \u2264 v \u2264 V , draw term selector bkv \u223c Bernoulli(\u03c0k).\n\n(b) Let bV +1 = 1[(cid:80)V\n\nDraw topic distribution \u03b2k \u223c Dirichlet(\u03b3bk).\n\nv=1 bkv = 0] and bk = [bkv]V +1\nv=1 .\n\n2. Draw stick lengths \u03b1 \u223c GEM(\u03bb), which are the global topic proportions.\n3. For document d:\n\n(a) Draw per-document topic proportions \u03b8d \u223c DP(\u03c4, \u03b1).\n(b) For the ith word:\n\ni. Draw topic assignment zdi \u223c Mult(\u03b8d).\nii. Draw word wdi \u223c Mult(\u03b2zdi)\n\nFigure 1 illustrates the sparseTM as a graphical model.\n\n1This acronym comes from the fact that the HDP for text is akin to a nonparametric Bayesian version of\n\nlatent Dirichlet allocation (LDA).\n\n2\n\n\fFigure 1: A graphical model representation for sparseTMs.\n\nThe distinguishing feature of the sparseTM is step 1, which generates the latent topics in such a\nway that decouples sparsity and smoothness. For each topic k there is a corresponding Beta random\nvariable \u03c0k and a set of Bernoulli variables bkvs, one for each term in the vocabulary. De\ufb01ne the\nsparsity of the topic as\n\n(3)\nThis is the proportion of zeros in its bank of Bernoulli random variables. Conditioned on the Bernoulli\nparameter \u03c0k, the expectation of the sparsity is\n\nv=1 bkv/V.\n\nsparsityk\n\nE [sparsityk|\u03c0k] = 1 \u2212 \u03c0k.\n\n(4)\n\n(cid:44) 1 \u2212(cid:80)V\n\nThe conditional distribution of the topic \u03b2k given the vocabulary subset bk is Dirichlet(\u03b3bk). Thus,\ntopic k is represented by those terms with non-zero bkvs, and the smoothing is only enforced over\nthese terms through hyperparameter \u03b3. Sparsity, which is determined by the pattern of ones in bk, is\ncontrolled by the Bernoulli parameter. Smoothing and sparsity are decoupled.\n\nOne nuance is that we introduce bV +1 = 1[(cid:80)V\n\nv=1 bkv = 0]. The reason is that when bk,1:V = 0,\nDirichlet(\u03b3bk,1:V ) is not well de\ufb01ned. The term bV +1 extends the vocabulary to V + 1 terms, where\nthe V + 1th term never appears in the documents. Thus, Dirichlet(\u03b3bk,1:V +1) is always well de\ufb01ned.\nWe next compute the marginal distribution of \u03b2k, after integrating out Bernoullis bk and their\nparameter \u03c0k:\n\np(\u03b2k |\u03b3, r, s) =\n\nd\u03c0k p(\u03b2k |\u03b3, \u03c0k)p(\u03c0k|r, s)\np(\u03b2k |\u03b3, bk)\n\n(cid:90)\n\nd\u03c0k p(bk|\u03c0k)p(\u03c0k|r, s).\n\n(cid:90)\n=(cid:88)\n\nbk\n\nNote that(cid:80)\n\nWe see that p(\u03b2k |\u03b3, r, s) and p(\u03b2k |\u03b3, \u03c0k) are mixtures of Dirichlet distributions, where the mixture\ncomponents are de\ufb01ned over simplices of different dimensions. In total, there are 2V components;\neach con\ufb01guration of Bernoulli variables bk speci\ufb01es one particular component. In posterior inference\nwe will need to sample from this distribution. Sampling from such a mixture is dif\ufb01cult in general,\ndue to the combinatorial sum. In the supplement, we present an ef\ufb01cient procedure to overcome this\nissue. This is the central computational challenge for the sparseTM.\nStep 2 and 3 mimic the generative process of HDP-LDA [1]. The stick lengths \u03b1 come from a\nGrif\ufb01ths, Engen, and McCloskey (GEM) distribution [8], which is drawn using the stick-breaking\nconstruction [9],\n\n\u03b7k \u223c Beta(1, \u03bb),\n\u03b1k = \u03b7k\n\n(cid:81)k\u22121\nj=1 (1 \u2212 \u03b7j), k \u2208 {1, 2, . . .}.\n\nk \u03b1k = 1 almost surely. The stick lengths are used as a base measure in the Dirichlet\nprocess prior on the per-document topic proportions, \u03b8d \u223c DP(\u03c4, \u03b1). Finally, the generative process\nfor the topic assignments z and observed words w is straightforward.\n\n3 Approximate posterior inference using collapsed Gibbs sampling\n\nSince the posterior inference is intractable in sparseTMs, we turn to a collapsed Gibbs sampling\nalgorithm for posterior inference. In order to do so, we integrate out topic proportions \u03b8, topic distri-\nbutions \u03b2 and term selectors b analytically. The latent variables needed by the sampling algorithm\n\n3\n\n\fare stick lengths \u03b1, Bernoulli parameter \u03c0 and topic assignment z. We \ufb01x the hyperparameter s\nequal to 1.\nTo sample \u03b1 and topic assignments z, we use the direct-assignment method, which is based on an\nanalogy to the Chinese restaurant franchise (CRF) [1]. To apply direct assignment sampling, an\nauxiliary table count random variable m is introduced. In the CRF setting, we use the following\nnotation. The number of customers in restaurant d (document) eating dish k (topic) is denoted ndk,\nand nd\u00b7 denotes the number of customers in restaurant d. The number of tables in restaurant d serving\ndish k is denoted mdk, md\u00b7 denotes the number of tables in restaurant d, m\u00b7k denotes the number\nof tables serving dish k, and m\u00b7\u00b7 denotes the total number of tables occupied. (Marginal counts are\nrepresented with dots.) Let K be the current number of topics. The function n(v)\nk denotes the number\nof times that term v has been assigned to topic k, while n(\u00b7)\nk denotes the number of times that all the\nterms have been assigned to topic k. Index u is used to indicate the new topic in the sampling process.\nNote that direct assignment sampling of \u03b1 and z is conditioned on \u03c0.\nThe crux for sampling stick lengths \u03b1 and topic assignments z (conditioned on \u03c0) is to compute the\nconditional density of wdi under the topic component k given all data items except wdi as,\n\n\u2212wdi\nk\n\n(wdi = v|\u03c0k) (cid:44) p(wdi = v|{wd(cid:48)i(cid:48) , zd(cid:48)i(cid:48) : zd(cid:48)i(cid:48) = k, d(cid:48)i(cid:48) (cid:54)= di}, \u03c0k).\n\n(5)\nThe derivation of equations for computing this conditional density is detailed in the supplement.2 We\nsummarize our \ufb01ndings as follows. Let V (cid:44) {1, . . . , V } be the set of vocabulary terms, Bk (cid:44) {v :\nk,\u2212di > 0, v \u2208 V} be the set of terms that have word assignments in topic k after excluding wdi\nn(v)\nand |Bk| be its cardinality. Let\u2019s assume that Bk is not an empty set.3 We have the following,\n\nf\n\n(cid:26) (n(v)\n\nk,\u2212di + \u03b3)E [gBk(X)|\u03c0k]\n\u03b3\u03c0kE [g \u00afBk( \u00afX)|\u03c0k]\n\n\u2212wdi\nk\n\nf\n\n(wdi = v|\u03c0k) \u221d\n\nwhere\n\n\u0393((|Bk| + x)\u03b3)\n\n\u0393(n(\u00b7)\n\nk,\u2212di + 1 + (|Bk| + x)\u03b3)\n\ngBk(x) =\nX | \u03c0k \u223c Binomial(V \u2212 |Bk|, \u03c0k),\n\u00afX | \u03c0k \u223c Binomial(V \u2212 | \u00afBk|, \u03c0k),\n\nif v \u2208 Bk\notherwise.\n\n,\n\n,\n\n(6)\n\n(7)\n\nand where \u00afBk = Bk \u222a {v}. Further note \u0393(\u00b7) is the Gamma function and n(v)\nk,\u2212di describes the\ncorresponding count excluding word wdi. In the supplement, we also show that E [gBk(X)|\u03c0k] >\n\u03c0kE [g \u00afBk( \u00afX)|\u03c0k]. The central difference between the algorithms for HDP-LDA and the sparseTM is\nconditional probability in Equation 6 which depends on the selector variables and selector proportions.\nWe now describe how we sample stick lengths \u03b1 and topic assignments z. This is similar to the\nsampling procedure for HDP-LDA [1].\n\nSampling stick lengths \u03b1. Although \u03b1 is an in\ufb01nite-length vector, the number of topics K is\n\ufb01nite at every point in the sampling process. Sampling \u03b1 can be replaced by sampling \u03b1 (cid:44)\n[\u03b11, . . . , \u03b1K, \u03b1u] [1]. That is,\n\n\u03b1 | m \u223c Dirichlet(m\u00b71, . . . , m\u00b7K, \u03bb).\n\n(8)\n\nSampling topic assignments z. This is similar to the sampling approach for HDP-LDA [1] as well.\nUsing the conditional density f de\ufb01ned Equation 5 and 6, we have\n(wdi|\u03c0k)\n\n(cid:26) (ndk,\u2212di + \u03c4 \u03b1k)f\n\n(9)\nIf a new topic knew is sampled, then sample \u03ba \u223c Beta(1, \u03bb), and let \u03b1knew = \u03ba\u03b1u and \u03b1unew =\n(1 \u2212 \u03ba)\u03b1u.\n\np(zdi = k|z\u2212di, m, \u03b1, \u03c0k) \u221d\n\nif k previously used,\nk = u.\n\n\u03c4 \u03b1uf\u2212wdi\n\n(wdi|\u03c0u)\n\n\u2212wdi\nk\n\nu\n\n2Note we integrate out \u03b2k and bk. Another sampling strategy is to sample b (by integrating out \u03c0) and the\nGibbs sampler is much easier to derive. However, conditioned on b, sampling z will be constrained to a smaller\nset of topics (speci\ufb01ed by the values of b), which slows down convergence of the sampler.\n\n3In the supplement, we show that if Bk is an empty set, the result is trivial.\n\n4\n\n\fSampling Bernoulli parameter \u03c0. To sample \u03c0k, we use bk as an auxiliary variable. Note that bk\nwas integrated out earlier. Recall Bk is the set of terms that have word assignments in topic k. (This\ntime, we don\u2019t need to exclude certain words since we are sampling \u03c0.) Let Ak = {v : bkv = 1, v \u2208\nV} be the set of the indices of bk that are \u201con\u201d, the joint conditional distribution of \u03c0k and bk is\n\np(\u03c0k, bk|rest) \u221d p(bk|\u03c0k)p(\u03c0k|r)p({wdi : zdi = k}|bk,{zdi : zdi = k})\n\nd\u03b2k p({wdi : zdi = k}|\u03b2k,{zdi : zdi = k})p(\u03b2k|bk)\n\n= p(bk|\u03c0k)p(\u03c0k|r)\n\n= p(bk|\u03c0k)p(\u03c0k|r)\n\n= p(bk|\u03c0k)p(\u03c0k|r)\n\n\u221d(cid:89)\n\n(cid:90)\n1Bk\u2282Ak\u0393(|Ak|\u03b3)(cid:81)\n1Bk\u2282Ak\u0393(|Ak|\u03b3)(cid:81)\n\n\u0393|Ak|(\u03b3)\u0393(n(\u00b7)\n\n\u0393|Bk|(\u03b3)\u0393(n(\u00b7)\n1Bk\u2282Ak\u0393(|Ak|\u03b3)\n\u0393(n(\u00b7)\nk + |Ak|\u03b3)\n\nk + \u03b3)\n\nk + \u03b3)\n\n\u0393(n(v)\nv\u2208Ak\nk + |Ak|\u03b3)\n\u0393(n(v)\nv\u2208Bk\nk + |Ak|\u03b3)\n\nwhere 1Bk\u2282Ak is an indicator function and |Ak| = (cid:80)\n\np(bkv|\u03c0k)p(\u03c0k|r)\n\nv\n\nv bkv. This follows because if Ak is not a\nsuper set of Bk, there must be a term, say v in Bk but not in Ak, causing \u03b2kv = 0, a.s., and then\np({wdi : (d, i) \u2208 Zk}|\u03b2k,{zdi : (d, i) \u2208 Zk}) = 0 a.s.. Using this joint conditional distribution4 ,\nwe iteratively sample bk conditioned on \u03c0k and \u03c0k conditioned on bk to ultimately obtain a sample\nfrom \u03c0k.\n\n,\n\n(10)\n\nOthers. Sampling the table counts m is exactly the same as for the HDP [1], so we omit the details\nhere. In addition, we can sample the hyper-parameters \u03bb, \u03c4 and \u03b3. For the concentration parameters\n\u03bb and \u03c4 in both HDP-LDA and sparseTMs, we use previously developed approaches for Gamma\npriors [1, 10]. For the Dirichlet hyper-parameter \u03b3, we use Metropolis-Hastings.\nFinally, with any single sample we can estimate topic distributions \u03b2 from the value topic assignments\nz and term selector b by\n\n\u02c6\u03b2k,v = n(v)\nn(\u00b7)\n\nk +(cid:80)\n\nk + bk,v\u03b3\nv bkv\u03b3\n\n,\n\n(11)\n\nwhere we can smooth only those terms that are chosen to be in the topics. Note that we can obtain the\nsamples of b when sampling the Bernoulli parameter \u03c0.\n\n4 Experiments\n\nIn this section, we studied the performance of the sparseTM on four datasets and demonstrated how\nsparseTM decouples the smoothness and sparsity in the HDP.5 We placed Gamma(1, 1) priors over\nthe hyper-parameters \u03bb and \u03c4. The sparsity proportion prior was a uniform Beta, i.e., r = s = 1.\nFor hyper-parameter \u03b3, we use Metropolis-Hastings sampling method using symmetric Gaussian\nproposal with variance 1.0. A disadvantage of sparseTM is that its running speed is about 4-5 times\nslower than the HDP-LDA.\n\n4.1 Datasets\n\nThe four datasets we use in the experiments are:\n\n1. The arXiv data set contains 2500 (randomly sampled) online research abstracts\n(http://arxiv.org). It has 2873 unique terms, around 128K observed words and an aver-\nage of 36 unique terms per document.\n\nalgorithm might be achieved by modeling the joint conditional distribution of \u03c0k andP\np(\u03c0k,P\n\nv bkv|rest), since sampling \u03c0k only depends onP\n\n4In our experiments, we used the algorithm described in the main text to sample \u03c0. We note that an improved\nv bkv instead, i.e.,\n\n5Other experiments, which we don\u2019t report here, also showed that the \ufb01nite version of sparseTM outperforms\n\nv bkv.\n\nLDA with the same number of topics.\n\n5\n\n\f2. The Nematode Biology data set contains 2500 (randomly sampled) research abstracts\n(http://elegans.swmed.edu/wli/cgcbib). It has 2944 unique terms, around 179K observed\nwords and an average of 52 unique terms per document.\nthe NIPS articles published between 1988-1999\n(http://www.cs.utoronto.ca/\u223csroweis/nips). It has 5005 unique terms and around 403K\nobserved words. We randomly sample 20% of the words for each paper and this leads to an\naverage of 150 unique terms per document.\n\n3. The NIPS data set contains\n\n4. The Conf.\n\nabstracts set data contains abstracts (including papers and posters)\nfrom six international conferences: CIKM, ICML, KDD, NIPS, SIGIR and WWW\n(http://www.cs.princeton.edu/\u223cchongw/data/6conf.tgz). It has 3733 unique terms, around\n173K observed words and an average of 46 unique terms per document. The data are from\n2005-2008.\n\nFor all data, stop words and words occurring fewer than 10 times were removed.\n\n4.2 Performance evaluation and model examinations\n\nWe studied the predictive performance of the sparseTM compared to HDP-LDA. On the training\ndocuments our Gibbs sampler uses the \ufb01rst 2000 steps as burn-in, and we record the following 100\nsamples as samples from the posterior. Conditioned on these samples, we run the Gibbs sampler for\ntest documents to estimate the predictive quantities of interest. We use 5-fold cross validation.\nWe study two predictive quantities. First, we examine overall predictive power with the predictive\nperplexity of the test set given the training set. (This is a metric from the natural language literature.)\nThe predictive perplexity is\n\n(cid:40)\n\n\u2212\n\n(cid:80)\n\n(cid:80)\n\nd\u2208Dtest\n\n(cid:41)\n\n.\n\nlog p(wd|Dtrain)\nd\u2208Dtest\n\nNd\n\nperplexitypw = exp\n\nLower perplexity is better.\nSecond, we compute model complexity. Nonparametric Bayesian methods are often used to sidestep\nmodel selection and integrate over all instances (and all complexities) of a model at hand (e.g., the\nnumber of clusters). The model, though hidden and random, still lurks in the background. Here\nwe study its posterior distribution with the desideratum that between two equally good predictive\ndistributions, a simpler model\u2014or a posterior peaked at a simpler model\u2014is preferred.\nTo capture model complexity we \ufb01rst de\ufb01ne the complexity of topic. Recall that each Gibbs sample\ncontains a topic assignment z for every observed word in the corpus (see Equation 9). The topic\ncomplexity is the number of unique terms that have at least one word assigned to the topic. This can\nbe expressed as a sum of indicators,\n\ncomplexityk =(cid:80)\n\nd 1 [((cid:80)\n\nn 1[zd,n = k]) > 0] ,\n\nwhere recall that zd,n is the topic assignment for the nth word in document d. Note a topic with\nno words assigned to it has complexity zero. For a particular Gibbs sample, the model complexity\nis the sum of the topic complexities and the number of topics. Loosely, this is the number of free\nparameters in the \u201cmodel\u201d that the nonparametric Bayesian method has selected, which is\n\ncomplexity = #topics +(cid:80)\n\nk complexityk.\n\n(12)\n\nWe performed posterior inference with the sparseTM and HDP-LDA, computing predictive perplexity\nand average model complexity with 5-fold cross validation. Figure 2 illustrates the results.\n\nPerplexity versus Complexity. Figure 2 (\ufb01rst row) shows the model complexity versus predictive\nperplexity for each fold: Red circles represent sparseTM, blue squares represent HDP-LDA, and the\ndashed line connecting a red circle and blue square indicates the that the two are from the same fold.\nThese results shows that the sparseTM achieves better perplexity than HDP-LDA, and at simpler\nmodels. (To see this, notice that all the connecting lines going from HDP-LDA to sparseTM point\ndown and to the left.)\n\n6\n\n\fFigure 2: Experimental results for sparseTM (shortened as STM in this \ufb01gure) and HDP-LDA on four\ndatasets. First row. The scatter plots of model complexity versus predictive perplexity for 5-fold\ncross validation: Red circles represent the results from sparseTM, blue squares represent the results\nfrom HDP-LDA and the dashed lines connect results from the same fold. Second row. Box plots of\nthe hyperparameter \u03b3 values. Third row. Box plots of the number of topics. Fourth row. Box plots\nof the number of terms per topic.\n\nHyperparameter \u03b3, number of topics and number of terms per topic. Figure 2 (from the second\nto fourth rows) shows the Dirichlet parameter \u03b3 and posterior number of topics for HDP-LDA and\nsparseTM. HDP-LDA tends to have a very small \u03b3 in order to attain a reasonable number of topics,\nbut this leads to less smooth distributions. In contrast, sparseTM allows a larger \u03b3 and selects more\nsmoothing, even with a smaller number of topics. The numbers of terms per topic for two models\ndon\u2019t have a consistent trend, but they don\u2019t differ too much either.\n\nExample topics. For the NIPS data set, we provide some example topics (with top 15 terms)\ndiscovered by HDP-LDA and sparseTM in Table 1. Accidentally, we found that HDP-LDA seems to\nproduce more noisy topics, such as, those shown in Table 2.\n\n7\n\nllllllllll1800020000220001150120012501300arXivcomplexityperplexitystmhdp\u2212ldalllllllllllllll16000170001800019000700750800850Nematode Biologycomplexitystmhdp\u2212ldalllllllllllllll34000400004600013501450NIPScomplexitystmhdp\u2212ldalllllllllllllll19000210002300092096010001040Conf. abstractscomplexitystmhdp\u2212ldallllllHDP\u2212LDASTM0.010.030.05ggHDP\u2212LDASTM0.020.040.060.08llHDP\u2212LDASTM0.000.040.080.12lHDP\u2212LDASTM0.010.03HDP\u2212LDASTM8009001100#topicslHDP\u2212LDASTM500600700HDP\u2212LDASTM80012001600lHDP\u2212LDASTM80010001200lHDP\u2212LDASTM182022242628#terms per topiclHDP\u2212LDASTM24262830HDP\u2212LDASTM303540lHDP\u2212LDASTM182022\fsparseTM HDP-LDA\nsupport\nvector\nsvm\nkernel\nmachines\nmargin\ntraining\nvapnik\nsolution\nexamples\nspace\nsv\nnote\nkernels\nsvms\n\nsvm\nvector\nsupport\nmachines\nkernel\nsvms\ndecision\nhttp\ndigit\nmachine\ndiagonal\nregression\nsparse\noptimization\nmisclassi\ufb01cation models\n\nHDP-LDA\nvariational\nnetworks\njordan\nparameters\ninference\nbound\nbelief\ndistributions\napproximation\nlower\n\nsparseTM\nbelief\nnetworks\ninference\nlower\nbound\nvariational\njordan\ngraphical\nexact\n\ufb01eld\nprobabilistic methods\nquadratic\napproximate\n\ufb01eld\nconditional\nvariables\ndistribution\nintractable\n\nExample \u201cnoise topics\u201d\nresulting\nepsilon\nmation\nstream\ndirect\ninferred\ndevelopment\ntransfer\ndepicted\nbehaviour\nglobal\nmotor\nsubmitted\ncorner\ncarried\ninter\napplications\napplicable\nreplicated\nmixture\nrefers\nserved\nspeci\ufb01cation\nsearching\noperates\nmodest\ntension\nvertical\nclass\nmatter\n\nTable 1: Similar topics discovered.\n\nTable 2: \u201cNoise\u201d topics in HDP-LDA.\n\n5 Discussion\n\nThese results illuminate the issue with a single parameter controlling both sparsity and smoothing. In\nthe Gibbs sampler, if the HDP-LDA posterior requires more topics to explain the data, it will reduce\nthe value of \u03b3 to accommodate for the increased (necessary) sparseness. This smaller \u03b3, however,\nleads to less smooth topics that are less robust to \u201cnoise\u201d, i.e., infrequent words that might populate\na topic. The process is circular: To explain the noisy words, the Gibbs sampler might invoke new\ntopics still, thereby further reducing the hyperparameter. As a result of this interplay, HDP-LDA\nsettles on more topics and a smaller \u03b3. Ultimately, the \ufb01t to held out data suffers.\nFor the sparseTM, however, more topics can be used to explain the data by using the sparsity control\ngained from the \u201cspike\u201d component of the prior. The hyperparameter \u03b3 is controlled separately. Thus\nthe smoothing effect is retained, and held out performance is better.\n\nAcknowledgements. We thank anonymous reviewers for insightful suggestions. David M. Blei is\nsupported by ONR 175-6343, NSF CAREER 0745520, and grants from Google and Microsoft.\n\nReferences\n[1] Teh, Y. W., M. I. Jordan, M. J. Beal, et al. Hierarchical Dirichlet processes. Journal of the American\n\nStatistical Association, 101(476):1566\u20131581, 2006.\n\n[2] Blei, D., A. Ng, M. Jordan. Latent Dirichlet allocation. J. Mach. Learn. Res., 3:993\u20131022, 2003.\n[3] Griffths, T., M. Steyvers. Probabilistic topic models. In Latent Semantic Analysis: A Road to Meaning.\n\n2006.\n\n[4] Saund, E. A multiple cause mixture model for unsupervised learning. Neural Comput., 7(1):51\u201371, 1995.\n[5] Kab\u00b4an, A., E. Bingham, T. Hirsim\u00a8aki. Learning to read between the lines: The aspect Bernoulli model. In\n\nSDM. 2004.\n\n[6] Ishwaran, H., J. S. Rao. Spike and slab variable selection: Frequentist and Bayesian strategies. The Annals\n\nof Statistics, 33(2):730\u2013773, 2005.\n\n[7] Friedman, N., Y. Singer. Ef\ufb01cient Bayesian parameter estimation in large discrete domains. In NIPS. 1999.\n[8] Pitman, J. Poisson\u2013Dirichlet and GEM invariant distributions for split-and-merge transformations of an\n\ninterval partition. Comb. Probab. Comput., 11(5):501\u2013514, 2002.\n\n[9] Sethuraman, J. A constructive de\ufb01nition of Dirichlet priors. Statistica Sinica, 4:639\u2013650, 1994.\n[10] Escobar, M. D., M. West. Bayesian density estimation and inference using mixtures. Journal of the\n\nAmerican Statistical Association, 90:577\u2013588, 1995.\n\n8\n\n\f", "award": [], "sourceid": 215, "authors": [{"given_name": "Chong", "family_name": "Wang", "institution": null}, {"given_name": "David", "family_name": "Blei", "institution": null}]}