{"title": "Online Bayesian Moment Matching for Topic Modeling with Unknown Number of Topics", "book": "Advances in Neural Information Processing Systems", "page_first": 4536, "page_last": 4544, "abstract": "Latent Dirichlet Allocation (LDA) is a very popular model for topic modeling as well as many other problems with latent groups. It is both simple and effective. When the number of topics (or latent groups) is unknown, the Hierarchical Dirichlet Process (HDP) provides an elegant non-parametric extension; however, it is a complex model and it is difficult to incorporate prior knowledge since the distribution over topics is implicit. We propose two new models that extend LDA in a simple and intuitive fashion by directly expressing a distribution over the number of topics. We also propose a new online Bayesian moment matching technique to learn the parameters and the number of topics of those models based on streaming data. The approach achieves higher log-likelihood than batch and online HDP with fixed hyperparameters on several corpora.", "full_text": "Online Bayesian Moment Matching for\n\nTopic Modeling with Unknown Number of Topics\n\nWei-Shou Hsu and Pascal Poupart\n\nDavid R. Cheriton School of Computer Science\n\nUniversity of Waterloo\nWateroo, ON N2L 3G1\n\n{wwhsu,ppoupart}@uwaterloo.ca\n\nAbstract\n\nLatent Dirichlet Allocation (LDA) is a very popular model for topic modeling as\nwell as many other problems with latent groups. It is both simple and effective.\nWhen the number of topics (or latent groups) is unknown, the Hierarchical Dirich-\nlet Process (HDP) provides an elegant non-parametric extension; however, it is a\ncomplex model and it is dif\ufb01cult to incorporate prior knowledge since the distri-\nbution over topics is implicit. We propose two new models that extend LDA in a\nsimple and intuitive fashion by directly expressing a distribution over the number\nof topics. We also propose a new online Bayesian moment matching technique to\nlearn the parameters and the number of topics of those models based on stream-\ning data. The approach achieves higher log-likelihood than batch and online HDP\nwith \ufb01xed hyperparameters on several corpora. The code is publicly available at\nhttps://github.com/whsu/bmm.\n\n1 Introduction\n\nLatent Dirichlet Allocation (LDA) [3] recently emerged as the dominant framework for topic mod-\neling as well as many other applications with latent groups. The Hierarchical Dirichlet Process\n(HDP) [18] provides an elegant extension to LDA when the number of topics (latent groups) is\nunknown. The non-parametric nature of HDPs is quite attractive since HDPs effectively allow an\nunbounded number of topics to be inferred from the data. There is also a rich mathematical theory\nunderlying HDPs as well as attractive metaphors (e.g., stick breaking process, Chinese restaurant\nfranchise) to ease the understanding by those less comfortable with non-parametric statistics [18].\nThat being said, HDPs are not perfect. They do not expose an explicit distribution over the topics\nthat could allow practitioners to incorporate prior knowledge and to inspect the model\u2019s posterior\ncon\ufb01dence in different number of topics. Furthermore, the implicit distribution over the number of\ntopics is restricted to a regime where the number of topics grows logarithmically with the amount of\ndata in expectation [18]. For instance, this growth rate is insuf\ufb01cient for applications that exhibit a\npower law distribution [6] \u2013 a generalization of the HDP known as the hierarchical Pitman-Yor pro-\ncess [21] is often used instead. Existing inference algorithms for HDPs (e.g., Gibbs sampling [18],\nvariational inference [19, 24, 23, 4, 17]) are also fairly complex. As a result, practitioners often stick\nwith LDA and estimate the number of topics by repeatedly evaluating different number of topics by\ncross-validation; however, this is an expensive procedure.\nWe propose two new models that extend LDA in a simple and intuitive fashion by directly expressing\na distribution over the number of topics under the assumption that an upper bound on the number of\ntopics is available. When the amount of data is \ufb01nite, this assumption is perfectly \ufb01ne since there\ncannot be more topics than the amount of data. Otherwise, domain experts can often de\ufb01ne a suitable\nrange for the number of topics and if they plan to inspect the resulting topics, they cannot inspect\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fan unbounded number of topics. We also propose a novel Bayesian moment matching algorithm\nto compute a posterior distribution over the model parameters and the number of topics. Bayesian\nlearning naturally lends itself to online learning for streaming data since the posterior is updated\nsequentially after each data point and there is no need to go over the data more than once. The main\nissue is that the posterior becomes intractable. We approximate the posterior after each observed\nword by a tractable distribution that matches some moments of the exact posterior (hence the name\nBayesian Moment Matching). The approach compares favorably to online HDP on several topic\nmodeling tasks.\n\n2 Related work\n\nSetting the number of topics to use can be treated as a model selection problem. One solution is to\ntrain a topic model multiple times, each time with a different number of topics, and choose the num-\nber of topics that minimizes some cost function on a heldout test set. More recently nonparametric\nBayesian methods have been used to bypass the model selection problem. Hierarchical Dirichlet\nprocess (HDP) [18] is the natural extension of LDA in this direction. With HDP, the number of top-\nics is learned from data as part of the inference procedure. Gibbs sampling [7, 15] and Variational\nBayes [3, 20] are by far the most popular inference techniques for LDA. They have been extended\nto HDP [18, 19, 17]. With the rise of streaming data, online variants of Variational Bayes have also\nbeen developed for LDA [8] and HDP [24, 23, 4]. The \ufb01rst online variational technique [24] used\na truncation that effectively bounds the number of topics while subsequent techniques [23, 4] avoid\nany \ufb01xed truncation to fully exploit the non-parametric nature of HDP. These online variational tech-\nniques perform stochastic gradient ascent on mini-batches, which reduces their data ef\ufb01ciency, but\nimproves computational ef\ufb01ciency.\nWe propose two new models that are simpler than HDP and express a distribution directly on the\nnumber of topics. We extend online Bayesian moment matching (originally designed for LDA with\na \ufb01xed number of topics [14]) to learn the number of topics. This technique avoids mini-batches. It\napproximates Bayesian learning by Assumed Density Filtering [13], which can be thought as a single\nforward iteration of Expectation Propagation [12]. Note that Bayesian moment matching is different\nfrom frequentist moment matching techniques such as spectral learning [1, 2, 9, 11]. In BMM, we\ncompute a posterior over the parameters of the model and approximate the posterior with a simpler\ndistribution that matches some moments of the exact posterior. In spectral learning, moments of the\nempirical distribution of the data are used to \ufb01nd parameters that yield the same moments in the\nmodel. This is usually achieved by a spectral (or tensor) decomposition of the empirical moments,\nhence the name spectral learning. Although both BMM and spectral learning use the method of\nmoments, they match different moments in different distributions resulting in completely different\nalgorithms. While stochastic gradient descent can be used to compute tensor decompositions in an\nonline fashion [5, 10], no online variant of spectral learning has been developed to infer the number\nof topics in LDA.\n\n3 Models\n\nn=1, along with the IDs, {dn}N\n\nWe investigate the problem of online clustering of grouped discrete observations. Using terminology\nfrom text processing, we will call each observation a word and each group a document. The observed\ndata set is then a corpus of N words, {wn}N\nn=1, of the documents to\nwhich these words belong. We will let D denote the number of documents and V the number of\ndistinct words in the vocabulary. Figure 1 shows the generative models we are considering.\nThe basic model is LDA, in which the number of the topics T is \ufb01xed. We propose two extensions to\nthe basic model where the parameter T is unknown and inferred from data, with the assumption that\nT ranges from 1 to K. Each (cid:126)\u03b8 speci\ufb01es the topic distribution of a document, while each (cid:126)\u03c6 speci\ufb01es\nthe word distribution of a topic. In the rest of the paper, we will use \u0398 to denote the collection of all\n(cid:126)\u03b8\u2019s and \u03a6 the collection of all (cid:126)\u03c6\u2019s in the model.\n\n2\n\n\f(cid:126)\u03b1d\n\n(cid:126)\u03b8d\n\ntn\n\nD\n\n(cid:126)\u03b2t\n\n(cid:126)\u03c6t\n\nT\n\nN\n\nwn\n\ndn\n\n(cid:126)\u03b3\n\nT\n\ndn\n\n(cid:126)\u03b1d\n\n(cid:126)\u03b8d\n\ntn\n\n(cid:126)\u03b2t\n\n(cid:126)\u03c6t\n\nD\n\nK\n\nwn\n\nN\n\n(cid:126)\u03b3\n\nT\n\n(cid:126)\u03b1k,d\n\n(cid:126)\u03b8k,d\n\nD K\n\ndn\n\ntn\n\n(cid:126)\u03b2t\n\n(cid:126)\u03c6t\n\nK\n\nN\n\nwn\n\nFigure 1: Graphical representations of basic model with \ufb01xed number of topics (left), degenerate\nDirichlet model (middle), and triangular Dirichlet model (right)\n\n3.1 Degenerate Dirichlet model\n\nThe generative process of the degenerate Dirichlet model (DDM), as shown in the middle in Figure 1,\nworks by \ufb01rst sampling the hyperparameters (cid:126)\u03b3, {(cid:126)\u03b1d}D\nd=1,\nand {(cid:126)\u03c6t}K\n\nt=1 are then sampled from the following conditional distributions:\n\nd=1, and {(cid:126)\u03b2t}K\n\nt=1. The parameters T , {(cid:126)\u03b8d}D\n\nPn(\u0398, \u03a6, T ) = P (\u0398, \u03a6, T|w1:n)\n\nK(cid:88)\n\ntn=1\n\n=\n\n1\ncn\n\nP (tn|\u0398, T )P (wn|\u03a6, tn)Pn\u22121(\u0398, \u03a6, T )\n\n(1)\n\nwhere cn = P (wn|w1:n\u22121).\nFrom (1) we can see that after seeing each new observation wn, the number of terms in the posterior\nis increased by a factor of K, resulting in an exponential complexity for exact Bayesian update.\nTherefore, we will instead approximate Pn by a different distribution, whose parameters will be\nestimated by moment matching.\n\n1In the derivations that follow, the dependence on the document IDs {dn}D\n\nn=1 and the hyperparameters (cid:126)\u03b3,\n\n\u03b1, and \u03b2 is implicit and not shown.\n\n3\n\nP (T|(cid:126)\u03b3) = Discrete(T ; (cid:126)\u03b3)\nP ((cid:126)\u03b8d|(cid:126)\u03b1d, T ) = Dir((cid:126)\u03b8d; (cid:126)\u03b1d, T )\n\nP ((cid:126)\u03c6t|(cid:126)\u03b2t) = Dir((cid:126)\u03c6t; (cid:126)\u03b2t)\n\n(cid:26) \u03b1d,t\n\n0\n\n\u03b1(cid:48)\nd,t =\n\nfor t \u2264 T\nfor t > T\n\nwhere Dir((cid:126)\u03b8d; (cid:126)\u03b1d, T ) denotes a degenerate Dirichlet distribution Dir((cid:126)\u03b8d; (cid:126)\u03b1(cid:48)\n\nd) with\n\nand Discrete(T ; (cid:126)\u03b3) is the general discrete distribution with probability P (T = k) = \u03b3k for\nk = 1, . . . , K. Finally, the N observations are generated by \ufb01rst sampling the topic indicators\ntn according to the distribution P (tn|dn, \u0398) = \u03b8dn,tn. Note that since (cid:126)\u03b8dn is sampled from a de-\ngenerate Dirichlet, we have \u03b8dn,tn = 0 for tn > T . Given tn, the words are then sampled according\nto the categorical distribution P (wn|tn, \u03a6) = \u03c6tn,wn.\n3.2 Triangular Dirichlet model\n\nThe triangular Dirichlet model (TDM), shown on the right in Figure 1, works in a similar way except\nthe document-topic distribution \u0398 is represented by a three-dimensional array that is also indexed\nby the number of topics T in addition to the document ID d and the topic ID t. Given T and d, the\ntopic t is drawn according to the probability P (t|d, \u0398, T ) = \u03b8T,d,t for 1 \u2264 t \u2264 T . The array \u0398\ntherefore has a triangular shape in the \ufb01rst and third dimension. Again, we place a Dirichlet prior on\neach (cid:126)\u03b8k,d: P ((cid:126)\u03b8k,d|(cid:126)\u03b1k,d) = Dir((cid:126)\u03b8k,d; (cid:126)\u03b1k,d). In this case, however, (cid:126)\u03b8k,d has no dependence on T .\n\n4 Bayesian update by moment matching\n\nLet Pn(\u0398, \u03a6, T ) denote the joint posterior probability of \u0398, \u03a6, and T after seeing the \ufb01rst n obser-\nvations. Then1\n\n\f4.1 Approximating distribution\n\nTo make the inference tractable, we approximate Pn using a factorized distribution: Pn(\u0398, \u03a6, T ) =\nf\u0398(\u0398)f\u03a6(\u03a6)fT (T ).\nFor TDM, we choose the factorized distribution to have the exact same form as the prior distribution,\ni.e.,\n\nDir((cid:126)\u03b8k,d; (cid:126)\u03b1k,d)\n\nD(cid:89)\n\nK(cid:89)\nK(cid:89)\n\nk=1\n\nd=1\n\nDir((cid:126)\u03c6t; (cid:126)\u03b2t)\n\nf\u0398(\u0398) =\n\nf\u03a6(\u03a6) =\n\nt=1\n\nfT (T ) = Discrete(T ; (cid:126)\u03b3)\n\nD(cid:89)\n\nFor DDM, we use the same f\u03a6 and fT , but rather than choosing f\u0398 as degenerate Dirichlets again,\nwe instead approximate the posterior over \u0398 using proper Dirichlet distributions to decouple \u0398 from\nT :\n\nf\u0398(\u0398) =\n\nDir((cid:126)\u03b8d; (cid:126)\u03b1d)\n\n(5)\n\n4.2 Moment matching\n\nd=1\n\nLet x be a random variable with distribution p(x). The i-th moment of x about zero is de\ufb01ned as the\nexpectation of xi over p, and we denote it by Mxi(p):\nMxi(p) = Ep\n\n(cid:2)xi(cid:3)\n\n(6)\n\nFor a K-dimensional Dirichlet distribution Dir(x1, . . . , xK; \u03c41, . . . , \u03c4K), we can uniquely solve for\nthe parameters \u03c41, . . . , \u03c4K if we have K \u2212 1 \ufb01rst moments, Mx1, . . . , MxK\u22121, and one second\nmoment, Mx2\n\n. Given the moments, we can determine the Dirichlet parameters as\n\n1\n\n\u03c4k = Mxk\n\nMx1 \u2212 Mx2\n\u2212 M 2\nMx2\n\nx1\n\n1\n\n1\n\nfor k = 1, . . . , K. Therefore, we can compute the parameters for f\u0398 and f\u03a6 using (7): for \u03b1d,\nreplace \u03c4k with \u03b1d,k and xk with \u03b8d,k; and for \u03b2t, replace \u03c4k with \u03b2t,k and xk with \u03c6t,k.\nThe parameters for Discrete(T ; (cid:126)\u03b3) are estimated directly as\n\nwhere \u03b4 denotes the the Kronecker delta\n\n\u03b4i,j =\n\n4.3 Moment computation\n\n\u03b3k = E[\u03b4T,k]\n\n(cid:26) 1\n\n0\n\nif i = j\nif i (cid:54)= j\n\n.\n\nFrom (7) and (8), we see that to approximate Pn by moment matching, we need to compute the \ufb01rst\nand second moments of \u0398 and \u03a6 as well as the expectation E[\u03b4T,k] with respect to Pn. They can be\ncalculated using the Bayesian update equation (1).\nTo keep the notation uncluttered, let S(cid:126)x,:m denote the sum of the \ufb01rst m elements in a vector (cid:126)x and\nS(cid:126)x the sum of all elements in (cid:126)x. We can then compute the moments of DDM as follows:\n\n(2)\n\n(3)\n\n(4)\n\n(7)\n\n(8)\n\n(9)\n\n(10)\n\n(11)\n\nK(cid:88)\n\ncn =\n\nT =1\n\nEPn [\u03b4T,k] =\n\n\u03b3T\n\n1\ncn\n\nT(cid:88)\nk(cid:88)\n\ntn=1\n\n\u03b3k\n\ntn=1\n\n4\n\n\u03b1dn,tn\nS(cid:126)\u03b1dn ,:T\n\n\u03b2tn,wn\nS(cid:126)\u03b2tn\n\n\u03b1dn,tn\nS(cid:126)\u03b1dn ,:k\n\n\u03b2tn,wn\nS(cid:126)\u03b2tn\n\n\fM\u03b8d,t (Pn) =\n\nK(cid:88)\n\n\u03b3T\n\nT =t\n\ntn=1\n\nM\u03b82\n\nd,t\n\n(Pn) =\n\n1\ncn\n\nM\u03c6t,w (Pn) =\n\nK(cid:88)\n\n\u03b3T\n\nT =1\n\ntn=1\n\nM\u03c62\n\nt,w\n\n(Pn) =\n\n1\ncn\n\n1\ncn\n\nT(cid:88)\n\n1\ncn\n\nT(cid:88)\n\nK(cid:88)\n\n\u03b3T\n\nT(cid:88)\n\nT =t\n\ntn=1\n\n\u03b1dn,tn\nS(cid:126)\u03b1dn ,:T\n\n\u03b2tn,wn\nS(cid:126)\u03b2tn\n\n\u03b1d,t + \u03b4d,dn \u03b4t,tn\nS(cid:126)\u03b1d,:T + \u03b4d,dn\n\n\u03b1dn,tn\nS(cid:126)\u03b1dn ,:T\n\nK(cid:88)\n\n\u03b3T\n\n\u03b2tn,wn\nS(cid:126)\u03b2tn\n\nT(cid:88)\n\nT =1\n\ntn=1\n\n\u03b1d,t + \u03b4d,dn\u03b4t,tn\nS(cid:126)\u03b1d,:T + \u03b4d,dn\n\n\u03b1d,t + 1 + \u03b4d,dn \u03b4t,tn\nS(cid:126)\u03b1d,:T + 1 + \u03b4d,dn\n\n\u03b1dn,tn\nS(cid:126)\u03b1dn ,:T\n\n\u03b2tn,wn\nS(cid:126)\u03b2tn\n\n\u03b2t,w + \u03b4t,tn\u03b4w,wn\n\nS(cid:126)\u03b2t\n\n+ \u03b4t,tn\n\n\u03b1dn,tn\nS(cid:126)\u03b1dn ,:T\n\n\u03b2tn,wn\nS(cid:126)\u03b2tn\n\n\u03b2t,w + \u03b4t,tn \u03b4w,wn\n\n\u03b2t,w + 1 + \u03b4t,tn \u03b4w,wn\n\nS(cid:126)\u03b2t\n\n+ \u03b4t,tn\n\nS(cid:126)\u03b2t\n\n+ 1 + \u03b4t,tn\n\n(12)\n\n(13)\n\n(14)\n\n(15)\n\nFor TDM, the moments are computed similarly except that T is used to index into \u03b1 rather than to\ntake partial sums. The equations are included in the supplement.\n\n4.4 Parameter update\n\nFor TDM, the approximating distribution for the posterior has the exact same form as the prior;\ntherefore, the parameters we compute for Pn in the n-th update can be used directly as the parameters\nfor the prior in the (n + 1)-th update.\nHowever, for DDM, the prior for \u0398 consists of degenerate Dirichlet distributions conditionally de-\npendent on T , whereas the approximating distribution for the posterior is a fully factorized distri-\nbution with proper Dirichlets. Therefore, we have to make a further approximation to match the\nparameters of the two distributions.\nWhen Pn is being used as the prior in the (n + 1)-th update, we use the same \u03b1 that was obtained\nby moment matching during the n-th update, but it now has a different meaning. During the n-th\nupdate, \u03b1 is computed as parameters of proper Dirichlet distributions, but in the next update, it is\nused as parameters of a weighted sum of degenerate Dirichlet distributions. As a result, the DDM\nhas a natural bias towards smaller number of topics.\n\n4.5 Algorithm summary\n\nIn summary, starting from a prior distribution, the algorithm successively updates the posterior by\n\ufb01rst computing the exact moments according to the Bayesian update equation (1), and then updat-\ning the parameters by matching the moments with those of an approximating distribution. In the\ncase of TDM, the approximating distribution has the same form as the prior, whereas a simpli\ufb01ed\ndistribution is used for DDM. Algorithm 1 summarizes the procedure for the two models.\n\nAlgorithm 1 Online Bayesian moment matching algorithm\n1: Initialize \u03b1, \u03b2, and (cid:126)\u03b3.\n2: for n = 1, . . . , N do\n3:\n4:\n5:\n6: end for\n\nRead the n-th observation (dn, wn).\nCompute moments according to (10)\u2013(15) for DDM or equations in supplement for TDM.\nUpdate \u03b1, \u03b2, and (cid:126)\u03b3 according to (7) and (8) with appropriate substitutions.\n\n5 Experiments\n\nIn this section, we discuss our experiments on a synthetic dataset and three real text corpora. The\nTDM and DDM implementations are available at https://github.com/whsu/bmm. For both\nmodels we initialized the hyperparameters to be \u03b1d,t = 1 and \u03b2t,w = 1\u221a\nfor all d, t, and w. The\nreason that \u03b2t,w was not initialized to 1 was to encourage the algorithm to \ufb01nd topics with more\nconcentrated word distributions.\n\nV\n\n5\n\n\fDDM\nTDM\n\n20\n\n10\n\nT\nd\ne\nt\nc\ni\nd\ne\nr\nP\n\nDDM\nTDM\n\n20\n\n10\n\nT\nd\ne\nt\nc\ni\nd\ne\nr\nP\n\n0\n\n2\n\n4\n\n6\n\n8\n\n10\n\n0\n\n2\n\nActual T\n(a)\n\n4\n\n6\n\n8\n\n10\n\nActual T\n(b)\n\nFigure 2: Number of topics discovered by the DDM and TDM on synthetic datasets using (a) uni-\nform prior and (b) exponentially decreasing prior on T . The results are averaged over 100 randomly\ngenerated datasets for each actual T . Error bars show plus/minus one standard deviation. Gray line\nindicates the true number of topics that generated the datasets.\n\n5.1 Synthetic data\n\nWe \ufb01rst ran some tests on synthetic data to see how well the models estimate the number of topics.\nFor this experiment, the actual number of topics T was varied from 1 to 10, and for each value of T ,\nwe generated 100 random datasets with D = 100, V = 200, and N =100,000. Each random dataset\nwas created by \ufb01rst sampling \u0398 from Dir((cid:126)\u03b1d|0.05) and \u03a6 from Dir((cid:126)\u03b2t|0.1). The observations were\nthen sampled from \u0398 and \u03a6.\nWe set K = 20 and used the uniform prior P (T ) = 1\nK for T = 1, . . . , K. The estimated number\nof topics is shown in Figure 2(a). Both models were able to discover more topics as the actual\nnumber of topics increases. They tend to overestimate the number of topics because the initial value\n\u03b2t,w = 1\u221a\nHowever, in both models, the modeler has direct control over the number of topics.\nIf there is\nreason to believe the data come from a smaller number of topics, the modeler can change the prior\ndistribution on T accordingly as is typical in a Bayesian framework.\nFor this example, we also tested on an exponentially decreasing prior P (T ) \u221d e\u2212T for T =\n1, . . . , K. The results are shown in Figure 2(b). In this case, TDM shows a slight decrease than\nwith a uniform prior, whereas DDM produces an estimate that is close to the true number of topics.\n\nencourages topics with smaller number of words.\n\nV\n\n5.2 Text modeling\n\nWe compare the two proposed models by using them to model the distributions of three real text\ncorpora containing Reuters news articles, NIPS conference proceedings, and Yelp reviews. We also\ninclude online HDP (oHDP) in the comparisons, as well as the basic moment matching (basic MM)\nalgorithm with different values of T . For online HDP, we used the gensim 0.10.3 [16] implemen-\ntation with the default parameters except for the top-level truncation, which we set equal to the\nmaximum number of topics we used for DDM and TDM. Because DDM and TDM do not estimate\na global alpha as oHDP, for oHDP we include the results with both uniform alpha (oHDP unif) and\nalpha that is learned (oHDP alpha).\nWe followed a similar experimental setup as in [22, 4]. Each dataset was divided into a training set\nDtrain and a test set Dtest based on document IDs. The words in the test set were further split into\ntwo subsets W1 and W2, where W1 contains the words in the \ufb01rst half of each document in the\ntest set, and W2 contains the second half. The evaluation metric used is the per-word log likelihood\nL = log p(W2|W1,Dtrain)\nFor each experiment we also report the number of topics inferred by DDM, TDM. We do not report\nthis number for online HDP because it is not returned by the implementation.\n\nwhere |W2| denotes the total number of tokens in W2.\n\n|W2|\n\n6\n\n\fd\no\no\nh\ni\nl\ne\nk\ni\nl\ng\no\nl\nd\nr\no\nw\n\n-\nr\ne\nP\n\n-6.5\n\n-6.7\n\n-6.9\n\n-7.1\n\n-7.3\n\n0\n\nDDM\nTDM\nBasic MM\noHDP alpha\noHDP unif\n\n20\n\n40\n\n60\n\n80\n\n100\n\nT\n\n(a)\n\nT\n\n50\n\n40\n\n30\n\n20\n\n10\n\n0\n\n0\n\nDDM\nTDM\n\n1000\n\n(\u00d71000)\n\n500\n\nn\n(b)\n\nFigure 3: Text modeling on Reuters-21578: (a) Per-word test log likelihood and (b) Number of\ntopics found as a function of number of observations.\n\n5.2.1 Reuters-21578\n\nThe Reuters-21578 corpus contains 21,578 Reuters news articles in 1987. For this dataset,\nwe divided the data into training and test sets according to the LEWISSPLIT attribute that\nis available as part of the distribution at http://www.daviddlewis.com/resources/\ntestcollections/reuters21578/. The text was passed through a stemmer, and stopwords\nand words appearing in \ufb01ve or fewer documents were removed. This resulted in a total of 1,307,468\ntokens and a vocabulary of 7,720 distinct words. We chose K to be 100 for both models with\nuniform prior P (T ) = 1\nDDM discovered 39 topics while TDM found 36, and they both achieved similar per-word log\nlikelihood as the best models with \ufb01xed T showing that they were able to automatically determine\nthe number of topics necessary to model the data.\nWhile both models found the a similar number of topics in the end, they progressed to the \ufb01nal values\nin different ways. Fig. 3(b) shows the number of topics found by the two models as a function of\nnumber of observations. DDM shows a logarithmically increasing trend as more words are observed,\nwhereas TDM follows a more irregular progression.\n\nK . Figure 3(a) shows the experimental results.\n\n5.2.2 NIPS\n\nWe also tested the two models on 2,742 articles from the NIPS conference for the years 1988\u20132004.\nWe used the raw text versions available at http://cs.nyu.edu/\u02dcroweis/data.html\n(1988\u20131999) and http://ai.stanford.edu/\u02dcgal/data.html (2000\u20132004). The \ufb01rst set\nwas used as the training set and the second as the test set. The corpus was again passed through a\nstemmer, and stopwords and words appearing no more than 50 times were removed. After prepro-\ncessing we are left with 2,207,106 total words and a vocabulary of 4,383 unique words.\nFor this dataset we used K = 400 with the exponentially decreasing prior. DDM discovered 54\ntopics, and TDM found 89 topics. Figure 4(a) shows the per-word log likelihood on the test set.\nIn this experiment, both DDM and TDM obtained closed to the optimal likelihood compared to basic\nMM.\n\n5.2.3 Yelp\n\nIn our third experiment, we tested the models on a subset of the Yelp Academic Dataset (http:\n//www.yelp.com/dataset_challenge). We took the 129,524 reviews in the dataset that\nwere given to businesses in the Food category. The reviews were randomly split so that 70% were\nused for training and 30% for testing.\nSimilar preprocessing was performed. The corpus was passed through a stemmer, and stopwords\nand words appearing no more than 50 times were removed. After preprocessing the corpus contains\na total of 5,317,041 words and a vocabulary of 5,640 distinct words.\n\n7\n\n\fd\no\no\nh\ni\nl\ne\nk\ni\nl\n\ng\no\nl\nd\nr\no\nw\n\n-\nr\ne\nP\n\n-6.7\n\n-6.9\n\n-7.1\n\n-7.3\n\n-7.5\n\n-7.7\n\nDDM\nTDM\nBasic MM\noHDP alpha\noHDP unif\n\n0\n\n100\n\n200\nT\n\n(a)\n\n300\n\n400\n\nd\no\no\nh\ni\nl\ne\nk\ni\nl\n\ng\no\nl\nd\nr\no\nw\n\n-\nr\ne\nP\n\n-6.7\n\n-6.9\n\n-7.1\n\n-7.3\n\n-7.5\n\n-7.7\n\nDDM\nTDM\nBasic MM\noHDP alpha\noHDP unif\n\n0\n\n20\n\n40\n\n60\n\n80\n\n100\n\nT\n\n(b)\n\nFigure 4: Per-word test log likelihood of (a) NIPS and (b) Yelp.\n\nFor this dataset, we tested with K=100 using the exponentially decreasing prior on T . Figure 4(b)\nshows the per-word log likelihood on the test set. DDM found the optimal number of topics while\nboth models achieved close to best likelihood on the test set compared to basic MM.\n\n5.2.4 Comparison with online HDP\n\nBecause DDM and TDM do not estimate the global alpha, in the experiments we compute the test\nlikelihood using a uniform alpha. If we also use a uniform alpha for online HDP, DDM and TDM\nachieve higher test likelihood. However, online HDP is able to learn the global alpha, which results\nin higher likelihood. This is a shortcoming of our models, and we are exploring ways to estimate\nthe global alpha.\n\n5.3 Additional experimental results\n\nAdditional experimental results may be found in the supplement, including running time of the ex-\nperiments and samples of topics discovered in the Reuters and NIPS corpora, as well as experiments\non using the models as dimensionality reduction preprocessors in text classi\ufb01cation.\n\n6 Conclusions\n\nIn this paper we proposed two topic models that can be used when the number of topics is not known.\nUnlike nonparametric Bayesian models, the proposed models provide explicit control over the prior\nfor the number of topics. We then presented an online learning algorithm based on Bayesian moment\nmatching, and experiments showed that reasonable topics could be recovered using the proposed\nmodels. Additional experiments on text classi\ufb01cation and visual inspection of the inferred topics\nshow that the clusters discovered were indeed semantically meaningful.\nOne unsolved problem is that the proposed models do not estimate the global alpha, resulting in\nlower test likelihood compared to online HDP, which is able to estimate alpha. Developing a robust\nway to estimate alpha will be the next step to improve the models.\n\nReferences\n[1] Anima Anandkumar, Dean P Foster, Daniel Hsu, Sham Kakade, and Yi-Kai Liu. A spectral\n\nalgorithm for latent Dirichlet allocation. In NIPS, pages 926\u2013934, 2012.\n\n[2] Sanjeev Arora, Rong Ge, and Ankur Moitra. Learning topic models\u2013going beyond SVD. In\n\nFoundations of Computer Science, pages 1\u201310. IEEE, 2012.\n\n[3] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation. Journal of Machine\n\nLearning Research, 3:993\u20131022, 2003.\n\n[4] Michael Bryant and Erik B Sudderth. Truly nonparametric online variational inference for\n\nhierarchical Dirichlet processes. In NIPS, pages 2699\u20132707, 2012.\n\n8\n\n\f[5] Rong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping from saddle points online stochas-\ntic gradient for tensor decomposition. In Conference on Learning Theory, page 797842, 2015.\n[6] Sharon Goldwater, Mark Johnson, and Thomas L Grif\ufb01ths. Interpolating between types and\n\ntokens by estimating power-law generators. In NIPS, pages 459\u2013466, 2005.\n\n[7] Tom Grif\ufb01ths. Gibbs sampling in the generative model of latent Dirichlet allocation. 2002.\n[8] Matthew Hoffman, Francis R Bach, and David M Blei. Online learning for latent Dirichlet\n\nallocation. In NIPS, pages 856\u2013864, 2010.\n\n[9] Daniel Hsu and Sham M Kakade. Learning mixtures of spherical Gaussians: moment methods\nand spectral decompositions. In Conference on Innovations in Theoretical Computer Science,\npages 11\u201320. ACM, 2013.\n\n[10] Furong Huang, U. N. Niranjan, Mohammad Umar Hakeem, and Animashree Anandkumar.\nOnline tensor methods for learning latent variable models. Journal of Machine Learning Re-\nsearch, 16:2797\u20132835, 2015.\n\n[11] Furong Huang, UN Niranjan, Mohammad Umar Hakeem, and Animashree Anandkumar.\narXiv preprint\n\nFast detection of overlapping communities via online tensor methods.\narXiv:1309.0787, 2013.\n\n[12] Thomas Minka and John Lafferty. Expectation-propagation for the generative aspect model.\n\nIn UAI, pages 352\u2013359, 2002.\n\n[13] Thomas P Minka. Expectation propagation for approximate Bayesian inference. In UAI, pages\n\n362\u2013369, 2001.\n\n[14] Farheen Omar. Online Bayesian Learning in Probabilistic Graphical Models using Moment\nMatching with Applications. PhD thesis, David R. Cheriton School of Computer Science,\nUniversity of Waterloo, 2016.\n\n[15] Ian Porteous, David Newman, Alexander Ihler, Arthur Asuncion, Padhraic Smyth, and Max\nIn ACM SIGKDD,\n\nWelling. Fast collapsed Gibbs sampling for latent Dirichlet allocation.\npages 569\u2013577, 2008.\n\n[16] R. \u02c7Reh\u02dau\u02c7rek and P. Sojka. Software Framework for Topic Modelling with Large Corpora. In\nProceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pages\n45\u201350, Valletta, Malta, May 2010. ELRA.\n\n[17] Issei Sato, Kenichi Kurihara, and Hiroshi Nakagawa. Practical collapsed variational Bayes\n\ninference for hierarchical Dirichlet process. In ACM SIGKDD, pages 105\u2013113. ACM, 2012.\n\n[18] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei. Hierarchical Dirichlet processes. Journal\n\nof the American Statistical Association, 101:pp. 1566\u20131581, 2006.\n\n[19] Yee W Teh, Kenichi Kurihara, and Max Welling. Collapsed variational inference for HDP. In\n\nNIPS, pages 1481\u20131488, 2007.\n\n[20] Yee W Teh, David Newman, and Max Welling. A collapsed variational Bayesian inference\n\nalgorithm for latent Dirichlet allocation. In NIPS, pages 1353\u20131360, 2006.\n\n[21] Yee Whye Teh. A hierarchical Bayesian language model based on Pitman-Yor processes. In\nProceedings of the 21st International Conference on Computational Linguistics and the 44th\nannual meeting of the Association for Computational Linguistics, pages 985\u2013992. Association\nfor Computational Linguistics, 2006.\n\n[22] C. Wang, J. Paisley, and D. Blei. Online variational inference for the hierarchical Dirichlet\nprocess. In G. Gordon, D Dunson, and M. Dud\u00b4\u0131k, editors, AISTATS, volume 15. JMLR W&CP,\n2011.\n\n[23] Chong Wang and David M Blei. Truncation-free online variational inference for Bayesian\n\nnonparametric models. In NIPS, pages 413\u2013421, 2012.\n\n[24] Chong Wang, John W Paisley, and David M Blei. Online variational inference for the hierar-\n\nchical Dirichlet process. In AISTATS, pages 752\u2013760, 2011.\n\n9\n\n\f", "award": [], "sourceid": 2262, "authors": [{"given_name": "Wei-Shou", "family_name": "Hsu", "institution": "University of Waterloo"}, {"given_name": "Pascal", "family_name": "Poupart", "institution": "University of Waterloo"}]}