{"title": "Collapsed Variational Inference for HDP", "book": "Advances in Neural Information Processing Systems", "page_first": 1481, "page_last": 1488, "abstract": null, "full_text": "Collapsed Variational Inference for HDP\n\nYee Whye Teh\nGatsby Unit\n\nUniversity College London\nywteh@gatsby.ucl.ac.uk\n\nKenichi Kurihara\n\nDept. of Computer Science\nTokyo Institute of Technology\n\nMax Welling\n\nICS\n\nUC Irvine\n\nkurihara@mi.cs.titech.ac.jp\n\nwelling@ics.uci.edu\n\nAbstract\n\nA wide variety of Dirichlet-multinomial \u2018topic\u2019 models have found interesting ap-\nplications in recent years. While Gibbs sampling remains an important method of\ninference in such models, variational techniques have certain advantages such as\neasy assessment of convergence, easy optimization without the need to maintain\ndetailed balance, a bound on the marginal likelihood, and side-stepping of issues\nwith topic-identi\ufb01ability. The most accurate variational technique thus far, namely\ncollapsed variational latent Dirichlet allocation, did not deal with model selection\nnor did it include inference for hyperparameters. We address both issues by gen-\neralizing the technique, obtaining the \ufb01rst variational algorithm to deal with the\nhierarchical Dirichlet process and to deal with hyperparameters of Dirichlet vari-\nables. Experiments show a signi\ufb01cant improvement in accuracy.\n\n1 Introduction\n\nMany applications of graphical models have traditionally dealt with discrete state spaces, where\neach variable is multinomial distributed given its parents [1]. Without strong prior knowledge on\nthe structure of dependencies between variables and their parents, the typical Bayesian prior over\nparameters has been the Dirichlet distribution. This is because the Dirichlet prior is conjugate to\nthe multinomial, leading to simple and ef\ufb01cient computations for both the posterior over parameters\nand the marginal likelihood of data. When there are latent or unobserved variables, the variational\nBayesian approach to posterior estimation, where the latent variables are assumed independent from\nthe parameters, has proven successful [2].\nIn recent years there has been a proliferation of graphical models composed of a multitude of multi-\nnomial and Dirichlet variables interacting in various inventive ways. The major classes include the\nlatent Dirichlet allocation (LDA) [3] and many other topic models inspired by LDA, and the hier-\narchical Dirichlet process (HDP) [4] and many other nonparametric models based on the Dirichlet\nprocess (DP). LDA pioneered the use of Dirichlet distributed latent variables to represent shades\nof membership to different clusters or topics, while the HDP pioneered the use of nonparametric\nmodels to sidestep the need for model selection.\nFor these Dirichlet-multinomial models the inference method of choice is typically collapsed Gibbs\nsampling, due to its simplicity, speed, and good predictive performance on test sets. However there\nare drawbacks as well: it is often hard to access convergence of the Markov chains, it is harder still\nto accurately estimate the marginal probability of the training data or the predictive probability of\ntest data (if latent variables are associated with the test data), averaging topic-dependent quantities\nbased on samples is not well-de\ufb01ned because the topic labels may have switched during sampling\nand avoiding local optima through large MCMC moves such as split and merge algorithms are tricky\nto implement due to the need to preserve detailed balance. Thus there seems to be a genuine need to\nconsider alternatives to sampling.\nFor LDA and its cousins, there are alternatives based on variational Bayesian (VB) approximations\n[3] and on expectation propagation (EP) [5]. [6] found that EP was not ef\ufb01cient enough for large\n\n\fscale applications, while VB suffered from signi\ufb01cant bias resulting in worse predictive performance\nthan Gibbs sampling. [7] addressed these issues by proposing an improved VB approximation based\non the idea of collapsing, that is, integrating out the parameters while assuming that other latent\nvariables are independent. As for nonparametric models, a number of VB approximations have\nbeen proposed for DP mixture models [8, 9], while to our knowledge none has been proposed for\nthe HDP thus far ([10] derived a VB inference for the HDP, but dealt only with point estimates for\nhigher level parameters).\nIn this paper we investigate a new VB approach to inference for the class of Dirichlet-multinomial\nmodels. To be concrete we focus our attention on an application of the HDP to topic modeling [4],\nthough the approach is more generally applicable. Our approach is an extension of the collapsed\nVB approximation for LDA (CV-LDA) presented in [7], and represents the \ufb01rst VB approximation\nto the HDP1. We call this the collapsed variational HDP (CV-HDP). The advantage of CV-HDP\nover CV-LDA is that the optimal number of variational components is not \ufb01nite. This implies, apart\nfrom local optima, that we can keep adding components inde\ufb01nitely while the algorithm will take\ncare removing unnecessary clusters. Ours is also the \ufb01rst variational algorithm to treat full posterior\ndistributions over the hyperparameters of Dirichlet variables, and we show experimentally that this\nresults in signi\ufb01cant improvements in both the variational bound and test-set likelihood. We expect\nour approach to be generally applicable to a wide variety of Dirichlet-multinomial models beyond\nwhat we have described here.\n\n2 A Nonparametric Hierarchical Bayesian Topic Model\n\nWe consider a document model where each document in a corpus is modelled as a mixture over\ntopics, and each topic is a distribution over words in the vocabulary. Let there be D documents in\nthe corpus, and W words in the vocabulary. For each document d = 1, . . . , D, let \u03b8d be a vector of\nmixing proportions over topics. For each topic k, let \u03c6k be a vector of probabilities for words in that\ntopic. Words in each document are drawn as follows: \ufb01rst choose a topic k with probability \u03b8dk,\nthen choose a word w with probability \u03c6kw. Let xid be the ith word token in document d, and zid\nits chosen topic. We have,\n\nzid | \u03b8d \u223c Mult(\u03b8d)\n\nxid | zid, \u03c6zid \u223c Mult(\u03c6zid)\n\nWe place Dirichlet priors on the parameters \u03b8d and \u03c6k,\n\n\u03b8d | \u03c0 \u223c Dir(\u03b1\u03c0)\n\n\u03c6k | \u03c4 \u223c Dir(\u03b2\u03c4)\n\n(1)\n\n(2)\n\nwhere \u03c0 is the corpus-wide distribution over topics, \u03c4 is the corpus-wide distribution over the vo-\ncabulary, and \u03b1 and \u03b2 are concentration parameters describing how close \u03b8d and \u03c6k are to their\nrespective prior means \u03c0 and \u03c4.\nIf the number of topics K is \ufb01nite and \ufb01xed, the above model is LDA. As we usually do not know\nthe number of topics a priori, and would like a model that can determine this automatically, we\nconsider a nonparametric extension reposed on the HDP [4]. Speci\ufb01cally, we have a countably in\ufb01-\nnite number of topics (thus \u03b8d and \u03c0 are in\ufb01nite-dimensional vectors), and we use a stick-breaking\nrepresentation [11] for \u03c0:\n\n(cid:81)k\u22121\nl=1 (1 \u2212 \u02dc\u03c0l)\n\n\u03c0k = \u02dc\u03c0k\n\nDP(\u03b3, Dir(\u03b2\u03c4)), where Gd =(cid:80)\u221e\n\n(3)\nIn the normal Dirichlet process notation, we would equivalently have Gd \u223c DP(\u03b1, G0) and G0 \u223c\nk=1 \u03c0k\u03b4\u03c6k are sums of point masses, and\nDir(\u03b2\u03c4) is the base distribution. Finally, in addition to the prior over \u03c0, we place priors over the\nother hyperparameters \u03b1, \u03b2, \u03b3 and \u03c4 of the model as well,\n\nfor k = 1, 2, . . .\n\n\u02dc\u03c0k|\u03b3 \u223c Beta(1, \u03b3)\n\nk=1 \u03b8dk\u03b4\u03c6k and G0 =(cid:80)\u221e\n\n\u03b1 \u223c Gamma(a\u03b1, b\u03b1)\n\n\u03b2 \u223c Gamma(a\u03b2, b\u03b2)\n\n\u03b3 \u223c Gamma(a\u03b3, b\u03b3)\n\n\u03c4 \u223c Dir(a\u03c4 )\n\n(4)\n\nThe full model is shown graphically in Figure 1(left).\n\n1In this paper, by HDP we shall mean the two level HDP topic model in Section 2. We do not claim to have\nderived a VB inference for the general HDP in [4], which is signi\ufb01cantly more dif\ufb01cult; see \ufb01nal discussions.\n\n\fFigure 1: Left: The HDP topic model. Right: Factor graph of the model with auxiliary variables.\n\n3 Collapsed Variational Bayesian Inference for HDP\n\nThere is substantial empirical evidence that marginalizing out variables is helpful for ef\ufb01cient infer-\nence. For instance, in [12] it was observed that Gibbs sampling enjoys better mixing, while in [7] it\nwas shown that variational inference is more accurate in this collapsed space. In the following we\nwill build on this experience and propose a collapsed variational inference algorithm for the HDP,\nbased upon \ufb01rst replacing the parameters with auxiliary variables, then effectively collapsing out the\nauxiliary variables variationally. The algorithm is fully Bayesian in the sense that all parameter pos-\nteriors are treated exactly and full posterior distributions are maintained for all hyperparameters. The\nonly assumptions made are independencies among the latent topic variables and hyperparameters,\nand that there is a \ufb01nite upper bound on the number of topics used (which is found automatically).\nThe only inputs required of the modeller are the values of the top-level parameters a\u03b1, b\u03b1, ....\n\nD(cid:89)\n\nd=1\n\n3.1 Replacing parameters with auxiliary variables\nIn order to obtain ef\ufb01cient variational updates, we shall replace the parameters \u03b8 = {\u03b8d} and \u03c6 =\n{\u03c6k} with auxiliary variables. Speci\ufb01cally, we \ufb01rst integrate out the parameters; this gives a joint\ndistribution over latent variables z = {zid} and word tokens x = {xid} as follows:\n\n(cid:81)K\n\nK(cid:89)\n\n(cid:81)W\n\np(z, x|\u03b1, \u03b2, \u03b3, \u03c0, \u03c4) =\n\n\u0393(\u03b1\u03c0k+ndk\u00b7)\n\n\u0393(\u03b2)\n\n\u0393(\u03b2\u03c4w+n\u00b7kw)\n\n\u0393(\u03b1)\n\nk=1\n\n\u0393(\u03b1\u03c0k)\n\n\u0393(\u03b1+nd\u00b7\u00b7)\n\n(5)\nwith ndkw = #{i : xid = w, zid = k}, dot denoting sum over that index, and K denoting an index\nsuch that zid \u2264 K for all i, d. The ratios of gamma functions in (5) result from the normalization\nconstants of the Dirichlet densities of \u03b8 and \u03c6, and prove to be nuisances for updating the hyperpa-\nrameter posteriors. Thus we introduce four sets of auxiliary variables: \u03b7d and \u03bek taking values in\n[0, 1], and sdk and tkw taking integral values. This results in a joint probability distribution over an\nexpanded system,\n\n\u0393(\u03b2+n\u00b7k\u00b7)\n\n\u0393(\u03b2\u03c4w)\n\nw=1\n\nk=1\n\np(z, x, \u03b7, \u03be, s, t|\u03b1, \u03b2, \u03b3, \u03c0, \u03c4)\nk=1[ndk\u00b7\n\n(1\u2212\u03b7d)nd\u00b7\u00b7\u22121QK\n\nD(cid:89)\n\n\u03b7\u03b1\u22121\n\nd\n\nsdk ](\u03b1\u03c0k)sdk\n\nK(cid:89)\n\n=\n\n\u0393(nd\u00b7\u00b7)\n\n(1\u2212\u03bek)n\u00b7k\u00b7\u22121QW\n\n\u03be\u03b2\u22121\n\nk\n\nw=1[n\u00b7kw\n\ntkw ](\u03b2\u03c4w)tkw\n\n\u0393(n\u00b7k\u00b7)\n\n(6)\n\nd=1\n\nk=1\n\nwhere [ n\nm] are unsigned Stirling numbers of the \ufb01rst kind, and bold face letters denote sets of the\ncorresponding variables. It can be readily veri\ufb01ed that marginalizing out \u03b7, \u03be, s and t reduces (6)\nto (5). The main insight is that conditioned on z and x the auxiliary variables are independent and\nhave well-known distributions. Speci\ufb01cally, \u03b7d and \u03bek are Beta distributed, while sdk (respectively\ntkw) is the random number of occupied tables in a Chinese restaurant process with ndk\u00b7 (respectively\nn\u00b7kw) customers and a strength parameter of \u03b1\u03c0k (respectively \u03b2\u03c4w) [13, 4].\n\n3.2 The Variational Approximation\n\nWe assume the following form for the variational posterior over the auxiliary variables system:\n\nq(z, \u03b7, \u03be, s, t, \u03b1, \u03b2, \u03b3, \u03c4, \u03c0) = q(\u03b1)q(\u03b2)q(\u03b3)q(\u03c4)q(\u03c0)q(\u03b7, \u03be, s, t|z)\n\nq(zid)\n\n(7)\n\nwhere the dependence of auxiliary variables on z is modelled exactly. [7] showed that modelling\nexactly the dependence of a set of variables on another set is equivalent to integrating out the \ufb01rst\n\nd=1\n\ni=1\n\nD(cid:89)\n\nnd\u00b7\u00b7(cid:89)\n\ntopics k=1...\u221e        document d=1...Dwords i=1...nd\u03c0zidxid\u03b8d\u03b3\u03b1\u03b2\u03c4\u03ba\u03c6ktopics k=1...\u221e        document d=1...Dwords i=1...nd\u03c0zidxid\u03b2\u03c4\u03b1\u03b3\u03b7dsd\u03bektk\u03ba\fset. Thus we can interpret (7) as integrating out the auxiliary variables with respect to z. Given the\nabove factorization, q(\u03c0) further factorizes so that the \u02dc\u03c0k\u2019s are independent, as do the posterior over\nauxiliary variables.\nFor computational tractability, we also truncated our posterior representation to K topics. Specif-\nically, we assumed that q(zid > K) = 0 for every i and d. A consequence is that observations\nhave no effect on \u02dc\u03c0k and \u03c6k for all k > K, and these parameters can be exactly marginalized out.\nNotice that our approach to truncation is different from that in [8], who implemented a truncation\nat T by instead \ufb01xing the posterior for the stick weight q(vT = 1) = 1, and from that in [9], who\nassumed that the variational posteriors for parameters beyond the truncation level are set at their\npriors. Our truncation approximation is nested like that in [9], and unlike that in [8]. Our approach\nis also simpler than that in [9], which requires computing an in\ufb01nite sum which is intractable in the\ncase of HDPs. We shall treat K as a parameter of the variational approximation, possibly optimized\nby iteratively splitting or merging topics (though we have not explored these in this paper; see dis-\ncussion section). As in [9], we reordered the topic labels such that E[n\u00b71\u00b7] > E[n\u00b72\u00b7] > \u00b7\u00b7\u00b7 . An\nexpression for the variational bound on the marginal log-likelihood is given in appendix A.\n\n3.3 Variational Updates\n\n\u2202y\n\nIn this section we shall derive the complete set of variational updates for the system. In the following\nE[log y] the geometric expectation, and V[y] = E[y2] \u2212\nE[y] denotes the expectation of y, G[y] = e\nE[y]2 the variance. Let \u03a8(y) = \u2202 log \u0393(y)\nbe the digamma function. We shall also employ index\nsummation shorthands: \u00b7 sums out that index, while >l sums over i where i > l.\nHyperparameters. Updates for the hyperparameters are derived using the standard fully factorized\nvariational approach, since they are assumed independent from each other and from other variables.\nFor completeness we list these here, noting that \u03b1, \u03b2, \u03b3 are gamma distributed in the posterior, \u02dc\u03c0k\u2019s\nare beta distributed, and \u03c4 is Dirichlet distributed:\n\nq(\u03b1) \u221d \u03b1a\u03b1+E[s\u00b7\u00b7]\u22121e\u2212\u03b1(b\u03b1\u2212P\nq(\u03b2) \u221d \u03b2a\u03b2 +E[t\u00b7\u00b7]\u22121e\u2212\u03b2(b\u03b2\u2212P\nq(\u03b3) \u221d \u03b3a\u03b3 +K\u22121e\u2212\u03b3(b\u03b3\u2212PK\n(cid:81)\nIn subsequent updates we will need averages and geometric averages of these quantities which can be\nk \u21d2 G[xk] = e\u03a8(ak)/e\u03a8(P\nextracted using the following identities: p(x) \u221d xa\u22121e\u2212bx \u21d2 E[x] = a/b, G[x] = e\u03a8(a)/b, p(x) \u221d\nG[\u03b1\u03c0k] = G[\u03b1]G[\u03c0k], G[\u03b2\u03c4w] = G[\u03b2]G[\u03c4w] and G[\u03c0k] = G[\u02dc\u03c0k](cid:81)k\u22121\nk xak\u22121\nk ak). Note also that the geometric expectations factorizes:\n\nE[s\u00b7k]\nk\nw=1 \u03c4 a\u03c4 +E[t\u00b7w]\u22121\n\nE[log \u03b7d])\nE[log \u03bek])\nE[log(1\u2212\u02dc\u03c0k)]\n\nq(\u03c4) \u221d(cid:81)W\n\n(1 \u2212 \u02dc\u03c0k)E[\u03b3]+E[s\u00b7>k]\u22121\n\nq(\u02dc\u03c0k) \u221d \u02dc\u03c0\n\nG[1 \u2212 \u02dc\u03c0l].\n\n(8)\n\nk=1\n\nd\n\nk\n\nw\n\nl=1\n\nAuxiliary variables. The variational posteriors for the auxiliary variables depend on z through the\ncounts ndkw. \u03b7d and \u03bek are beta distributed. If ndk\u00b7 = 0 then q(sdk = 0) = 1 otherwise q(sdk) > 0\nonly if 1 \u2264 sdk \u2264 ndk\u00b7. Similarly for tkw. The posteriors are:\n\nq(\u03b7d|z) \u221d \u03b7\nq(\u03bek|z) \u221d \u03be\n\nE[\u03b1]\u22121\nd\nE[\u03b2]\u22121\nk\n\n(1 \u2212 \u03b7d)nd\u00b7\u00b7\u22121\n(1 \u2212 \u03bek)n\u00b7k\u00b7\u22121\n\nq(sdk = m|z) \u221d [ndk\u00b7\nq(tkw = m|z) \u221d [n\u00b7kw\n\nm ] (G[\u03b1\u03c0k])m\nm ] (G[\u03b2\u03c4w])m\n\n(9)\n\nTo obtain expectations of the auxiliary variables in (8) we will have to average over z as well. For\n\u03b7d this is E[log \u03b7d] = \u03a8(E[\u03b1]) \u2212 \u03a8(E[\u03b1] + nd\u00b7\u00b7) where nd\u00b7\u00b7 is the (\ufb01xed) number of words in\ndocument d. For the other auxiliary variables these expectations depend on counts which can take\non many values and a na\u00a8\u0131ve computation can be expensive. We derive computationally tractable\napproximations based upon an improvement to the second-order approximation in [7]. As we see in\nthe experiments these approximations are very accurate. Consider E[log \u03bek]. We have,\n\nE[log \u03bek|z] = \u03a8(E[\u03b2]) \u2212 \u03a8(E[\u03b2] + n\u00b7k\u00b7)\n\n(10)\nand we need to average over n\u00b7k\u00b7 as well. [7] tackled a similar problem with log instead of \u03a8 using\na second order Taylor expansion to log. Unfortunately such an approximation failed to work in our\ncase as the digamma function \u03a8(y) diverges much more quickly than log y at y = 0. Our solution\nis to treat the case n\u00b7k\u00b7 = 0 exactly, and apply the second-order approximation when n\u00b7k\u00b7 > 0. This\nleads to the following approximation:\n\nE[log \u03bek] \u2248 P+[n\u00b7k\u00b7](cid:0)\u03a8(E[\u03b2]) \u2212 \u03a8(E[\u03b2] + E+[n\u00b7k\u00b7]) \u2212 1\n\nV+[n\u00b7k\u00b7]\u03a8(cid:48)(cid:48)(E[\u03b2] + E+[n\u00b7k\u00b7])(cid:1)\n\n(11)\n\n2\n\n\fwhere P+ is the \u201cprobability of being positive\u201d operator: P+[y] = q(y > 0), and E+[y], V+[y] are\nthe expectation and variance conditional on y > 0. The other two expectations are derived similarly,\nmaking use of the fact that sdk and tkw are distributionally equal to the random numbers of tables\nin Chinese restaurant processes:\nE[sdk] \u2248 G[\u03b1\u03c0k]P+[ndk\u00b7]\nE[tkw] \u2248 G[\u03b2\u03c4w]P+[n\u00b7kw]\n\n\u03a8(G[\u03b1\u03c0k]+E+[ndk\u00b7])\u2212\u03a8(G[\u03b1\u03c0k])+ V+[ndk\u00b7]\u03a8(cid:48)(cid:48)(G[\u03b1\u03c0k]+E+[ndk\u00b7])\n\u03a8(G[\u03b2\u03c4w]+E+[n\u00b7kw])\u2212\u03a8(G[\u03b2\u03c4w])+ V+[n\u00b7kw]\u03a8(cid:48)(cid:48)(G[\u03b2\u03c4w]+E+[n\u00b7kw])\n\n(cid:16)\n(cid:16)\n\n(cid:17)\n\n(cid:17)\n\n(12)\n\n2\n\n2\n\nE[ndk\u00b7] =(cid:80)\n\nAs in [7], we can ef\ufb01ciently track the relevant quantities above by noting that each count is a sum of\nindependent Bernoulli variables. Consider ndk\u00b7 as an example. We keep track of three quantities:\ni log q(zid(cid:54)= k) (13)\n\ni q(zid = k)q(zid(cid:54)= k) Z[ndk\u00b7] =(cid:80)\n\ni q(zid = k) V[ndk\u00b7] =(cid:80)\n\nSome algebraic manipulations now show that:\n\nP+[ndk\u00b7] = 1 \u2212 e\n\nZ[ndk\u00b7] E+[ndk\u00b7] = E[ndk\u00b7]\nP+[ndk\u00b7]\n\nV+[ndk\u00b7] = V[ndk\u00b7]\n\nP+[ndk\u00b7] \u2212 e\n\nZ[ndk\u00b7]E+[ndk\u00b7]\n\n(14)\n\n[7] showed that if the dependence of a set of variables, say A, on\nTopic assignment variables.\nanother set of variables, say z, is modelled exactly, then in deriving the updates for z we may\nequivalently integrate out A. Applying to our situation with A = {\u03b7, \u03be, s, t}, we obtain updates\nsimilar to those in [7], except that the hyperparameters are replaced by either their expectations\nor their geometric expectations, depending on which is used in the updates for the corresponding\nauxiliary variables:\n\nq(zid = k) \u221dG(cid:2)G[\u03b1\u03c0k] + n\u00acid\n\u2248\u221d(cid:0)G[\u03b1\u03c0k] + E[n\u00acid\n\n(cid:18)\n\n\u2212\n\n2(G[\u03b1\u03c0k]+E[n\u00acid\n\nexp\n\n4 Experiments\n\ndk\u00b7(cid:3)G(cid:2)G[\u03b2\u03c4xid] + n\u00acid\u00b7kxid\ndk\u00b7 ](cid:1)(cid:0)G[\u03b2\u03c4xid] + E[n\u00acid\u00b7kxid\ndk\u00b7 ])2 \u2212\n\nV[n\u00acid\ndk\u00b7 ]\n\nV[n\u00acid\u00b7kxid\n\n]\n\n2(G[\u03b2\u03c4xid ]+E[n\u00acid\u00b7kxid\n\n(cid:3)G(cid:2)E[\u03b2] + n\u00acid\u00b7k\u00b7(cid:3)\u22121\n](cid:1)(cid:0)E[\u03b2] + E[n\u00acid\u00b7k\u00b7 ](cid:1)\u22121\n\n])2 +\n\nV[n\u00acid\u00b7k\u00b7 ]\n\n2(E[\u03b2]+E[n\u00acid\u00b7k\u00b7 ])2\n\n(cid:19)\n\n(15)\n\nWe implemented and compared performances for 5 inference algorithms for LDA and HDP: 1)\nvariational LDA (V-LDA) [3], collapsed variational LDA (CV-LDA) [7], collapsed variational HDP\n(CV-HDP, this paper), collapsed Gibbs sampling for LDA (G-LDA) [12] and the direct assignment\nGibbs sampler for HDP (G-HDP) [4].\nWe report results on the following 3 datasets: i) KOS (W = 6906, D = 3430, number of word-\ntokens N = 467, 714), ii) a subset of the Reuters dataset consisting of news-topics with a number\nof documents larger than 300 (W = 4593, D = 8433, N = 566, 298), iii) a subset of the 20News-\ngroups dataset consisting of the topics \u2018comp.os.ms-windows.misc\u2019, \u2018rec.autos\u2019, \u2018rec.sport.baseball\u2019,\n\u2018sci.space\u2019 and \u2018talk.politics.misc\u2019 (W = 8424, D = 4716, N = 437, 850).\nFor G-HDP we use the released code at http://www.gatsby.ucl.ac.uk/\u223cywteh/research/software.html.\nThe variables \u03b2, \u03c4 are not adapted in that code, so we \ufb01xed them at \u03b2 = 100 and \u03c4w = 1/W\nfor all algorithms (see below for discussion regarding adapting these in CV-HDP). G-HDP was\ninitialized with either 1 topic (G-HDP1) or with 100 topics (G-HDP100). For CV-HDP we use\nthe following initialization: E[\u03b2] = G[\u03b2] = 100 and G[\u03c4w] = 1/W (kept \ufb01xed to compare with\nG-HDP), E[\u03b1] = a\u03b1/b\u03b1, G[\u03b1] = e\u03a8(a\u03b1)/b\u03b1, E[\u03b3] = a\u03b3/b\u03b3, G[\u03c0k] = 1/K and q(zij = k) \u221d 1 + u\nwith u \u223c U[0, 1]. We set2 hyperparameters a\u03b1, b\u03b1, a\u03b2, b\u03b2 in the range between [2, 6], while a\u03b3, b\u03b3\nwas chosen in the range [5, 10] and a\u03c4 in [30 \u2212 50]/W . The number of topics used in CV-HDP\nwas truncated at 40, 80, and 120 topics, corresponding to the number of topics used in the LDA\nalgorithms. Finally, for all LDA algorithms we used \u03b1 = 0.1, \u03c0 = 1/K.\n\n2We actually set these values using a \ufb01xed but somewhat elaborate scheme which is the reason they ended\nup different for each dataset. Note that this scheme simply converts prior expectations about the number of\ntopics and amount of sharing into hyperparameter values, and that they were never tweaked. Since they always\nended up in these compact ranges and since we do not expect a strong dependence on their values inside these\nranges we choose to omit the details.\n\n\fPerformance was evaluated by comparing i) the in-sample (train) variational bound on the log-\nlikelihood for all three variational methods and ii) the out-of-sample (test) log-likelihood for all \ufb01ve\nmethods. All inference algorithms were run on 90% of the words in each document while test-\nset performance was evaluated on the remaining 10% of the words. Test-set log-likelihood was\ncomputed as follows for the variational methods:\n\n\u00af\u03b8jk\n\n\u00af\u03c6kxtest\n\nij\n\n\u00af\u03b8jk = \u03b1\u03c0k+Eq[njk\u00b7]\n\u03b1+Eq[nj\u00b7\u00b7]\n\n\u00af\u03c6kw = \u03b2\u03c4w+Eq[n\u00b7kw]\n\u03b2+Eq[n\u00b7k\u00b7]\n\n(16)\n\nk\n\nij\n\np(xtest) =(cid:81)\n\n(cid:80)\n\nNote that we used estimated mean values of \u03b8jk and \u03c6kw [14]. For CV-HDP we replaced all hy-\nperparameters by their expectations. For the Gibbs sampling algorithms, given S samples from the\nposterior, we used:\n\np(xtest) =(cid:81)\n\n(cid:80)S\n\n(cid:80)\n\n1\nS\n\nij\n\ns=1\n\nkxtest\nij\n\nk \u03b8s\n\njk = \u03b1s\u03c0s\n\u03b8s\n\nk+ns\njk\u00b7\n\u03b1s+ns\nj\u00b7\u00b7\n\njk\u03c6s\n\nkw = \u03b2\u03c4w+ns\u00b7kw\n\u03c6s\n\u03b2+ns\u00b7k\u00b7\n\n(17)\nWe used all samples obtained by the Gibbs sampling algorithms after an initial burn-in period; each\npoint in the predictive probabilities plots below is obtained from the samples collected thus far.\nThe results, shown in Figure 2, display a signi\ufb01cant improvement in accuracy of CV-HDP over\nCV-LDA, both in terms of the bound on the training log-likelihood as well as for the test-set log-\nlikelihood. This is caused by the fact that CV-HDP is learning the variational distributions over the\nhyperparameters. We note that we have not trained \u03b2 or \u03c4 for any of these methods. In fact, initial\nresults for CV-HDP show no additional improvement in test-set log-likelihood, in some cases even\na deterioration of the results. A second observation is that convergence of all variational methods\nis faster than for the sampling methods. Thirdly, we see signi\ufb01cant local optima effects in our\nsimulations. For example, G-HDP100 achieves the best results, better than G-HDP1, indicating that\npruning topics is a better way than adding topics to escape local optima in these models and leads to\nbetter posterior modes.\nIn further experiments we have also found that the variational methods bene\ufb01t from better initializa-\ntions due to local optima. In Figure 3 we show results when the variational methods were initialized\nat the last state obtained by G-HDP100. We see that indeed the variational methods were able to \ufb01nd\nsigni\ufb01cantly better local optima in the vicinity of the one found by G-HDP100, and that CV-HDP is\nstill consistently better than the other variational methods.\n\n5 Discussion\n\nIn this paper we have explored collapsed variational inference for the HDP. Our algorithm is the \ufb01rst\nto deal with the HDP and with posteriors over the parameters of Dirichlet distributions. We found\nthat the CV-HDP performs signi\ufb01cantly better than the CV-LDA on both test-set likelihood and the\nvariational bound. A caveat is that CV-HDP gives slightly worse test-set likelihood than collapsed\nGibbs sampling. However, as discussed in the introduction, we believe there are advantages to\nvariational approximations that are not available to sampling methods. A second caveat is that our\nvariational approximation works only for two layer HDPs\u2014a layer of group-speci\ufb01c DPs, and a\nglobal DP tying the groups together. It would be interesting to explore variational approximations\nfor more general HDPs.\nCV-HDP presents an improvement over CV-LDA in two ways. Firstly, we use a more sophisticated\nvariational approximation that can infer posterior distributions over the higher level variables in the\nmodel. Secondly, we use a more sophisticated HDP based model with an in\ufb01nite number of topics,\nand allow the model to \ufb01nd an appropriate number of topics automatically. These two advances are\ncoupled, because we needed the more sophisticated variational approximation to deal with the HDP.\nAlong the way we have also proposed two useful technical tricks. Firstly, we have a new truncation\ntechnique that guarantees nesting. As a result we know that the variational bound on the marginal\nlog-likelihood will reach its highest value (ignoring local optima issues) when K \u2192 \u221e. This fact\nshould facilitate the search over number of topics or clusters, e.g. by splitting and merging topics, an\naspect that we have not yet fully explored, and for which we expect to gain signi\ufb01cantly from in the\nface of the observed local optima issues in the experiments. Secondly, we have an improved second-\norder approximation that is able to handle the often encountered digamma function accurately.\nAn issue raised by the reviewers and in need of more thought by the community is the need for better\nevaluation criteria. The standard evaluation criteria in this area of research are the variational bound\n\n\fFigure 2: Left column: KOS, Middle column: Reuters and Right column: 20Newsgroups. Top row:\nlog p(xtest) as a function of K, Middle row: log p(xtest) as a function of number of steps (de\ufb01ned as number of\niterations multiplied by K) and Bottom row: variational bounds as a function of K. Log probabilities are on a\nper word basis. Shown are averages and standard errors obtained by repeating the experiments 10 times with\nrandom restarts. The distribution over the number of topics found by G-HDP1 are: KOS: K = 113.2 \u00b1 11.4,\nReuters: K = 60.4 \u00b1 6.4, 20News: K = 83.5 \u00b1 5.0. For G-HDP100 we have: KOS: K = 168.3 \u00b1 3.9,\nReuters: K = 122.2 \u00b1 5.0, 20News: K = 128.1 \u00b1 6.6.\n\nFigure 3: G-HDP100 initialized variational methods (K = 130), compared against variational methods ini-\ntialized in the usual manner with K = 130 as well. Results were averaged over 10 repeats.\n\n4080120!8!7.8!7.6!7.4!7.2K4080120!8.4!8.2!8!7.8!7.6K04000800012000!8!7.8!7.6!7.4!7.2#steps4080120!6.6!6.4!6.2!6!5.8K4080120!7!6.8!6.6!6.4K04000800012000!7!6.8!6.6!6.4!6.2!6!5.8#steps4080120!7.4!7.2!7!6.8K4080120!8.2!8!7.8!7.6!7.4K04000800012000!8!7.8!7.6!7.4!7.2!7!6.8#steps  GHDP100GHDP1GLDACVHDPCVLDAVLDA  CVHDPCVLDAVLDA  GHDP100GHDP1GLDACVHDPCVLDAVLDA05000#0000!9!8&5!8!7&5!7#ste,svariational bound0500010000!7.8!7.6!7.4!7.2!7!6.8!6.6#stepslog p(test) / N  GHDP100Gibbs init. CVHDPGibbs init. CVLDAGibbs init. VLDArandom init. CVHDPrandom init. CVLDArandom init. VLDA\fand the test-set likelihood. However both confound improvements to the model and improvements\nto the inference method. An alternative is to compare the computed posteriors over latent variables\non toy problems with known true values. However such toy problems are much smaller than real\nworld problems, and inferential quality on such problems may be of limited interest to practitioners.\nWe expect the proliferation of Dirichlet-multinomial models and their many exciting applications to\ncontinue. For some applications variational approximations may prove to be the most convenient\ntool for inference. We believe that the methods presented here are applicable to many models of this\ngeneral class and we hope to provide general purpose software to support inference in these models\nin the future.\n\nA Variational lower bound\nE[log p(z,x|\u03b1,\u03c0,\u03c4 )\u2212log q(z)]\u2212KL[q(\u03b1)(cid:107)p(\u03b1)]\u2212KL[q(\u03b2)(cid:107)p(\u03b2)]\u2212PK\n=P\n\n\u0393(G[\u03b1]G[\u03c0k]+ndk\u00b7)\n\nFh\n\nlog\n\nlog\n\nd log\n\nk\n\nFh\ni\n+P\nE[\u03b1]P\nE[\u03b2]P\n\nd\n\nk\n\n\u0393(G[\u03b1]G[\u03c0k])\nG[\u03b1]\n\n\u0393(a\u03b1)\n\n\u0393(a\u03b1+E[s\u00b7\u00b7])\n\nE[s\u00b7\u00b7]e\n\nE[log \u03b7d]\u2212P\n\n\u0393(a\u03b2 )\n\n\u0393(a\u03b2 +E[t\u00b7\u00b7])\n\nG[\u03b2]\n\nE[t\u00b7\u00b7]e\n\nE[log \u03bek]\n\n\u0393(E[\u03b1])\n\n\u0393(E[\u03b1]+nd\u00b7\u00b7) +P\n(b\u03b1\u2212P\n(b\u03b2\u2212P\n\nE[log \u03b7d])a\u03b1+E[s\u00b7\u00b7]\nE[log \u03bek])a\u03b2 +E[t\u00b7\u00b7]\n\na\u03b1\n\u03b1\n\ndk\n\nk\n\nd\n\nb\n\na\u03b2\nb\n\u03b2\n\n\u2212log\n\n\u2212log\n\n\u2212P\n\nk log\n\n\u0393(1+\u03b3+E[s\u00b7k]+E[s\u00b7>k])\n\u03b3\u0393(1+E[s\u00b7k])\u0393(\u03b3+E[s\u00b7>k])\n\nwhere F[f (n)]=P+[n](f (E+[n])+ 1\n\n2\n\nAcknowledgements\n\nk=1 KL[q(\u02dc\u03c0k)(cid:107)p(\u02dc\u03c0k)]\u2212KL[q(\u03c4 )(cid:107)p(\u03c4 )]\n\u0393(E[\u03b2])\n\u0393(E[\u03b2]+n\u00b7k\u00b7)\n\n\u0393(G[\u03b2]G[\u03c4w ])\n\n\u0393(G[\u03b2]G[\u03c4w ]+n\u00b7kw )\n\nlog\n\nkw\n\nFh\n\n(18)\n\ni\n\ndk\n\ni=1 q(zid=k) log q(zid=k)\n\ni\n+P\nPnd\n\nQ\n\nG[\u02dc\u03c0k]\n\nE[s\u00b7k]G[1\u2212\u02dc\u03c0k]\n\nG[\u03c4w]\nV+[n]f(cid:48)(cid:48)(E+[n])) is the improved second order approximation.\n\nE[s\u00b7>k]\u2212log \u0393(\u03ba+E[t\u00b7\u00b7])\n\n\u0393(\u03ba\u03c4w +E[t\u00b7w ])\n\n\u0393(\u03ba\u03c4w )\n\n\u0393(\u03ba)\n\nw\n\nE[t\u00b7w ]\n\nWe thank the reviewers for thoughtful and constructive comments. MW was supported by NSF\ngrants IIS-0535278 and IIS-0447903.\n\nReferences\n[1] R. G. Cowell, A. P. Dawid, S. L. Lauritzen, and D. J. Spiegelhalter. Probabilistic Networks and Expert\n\nSystems. Springer-Verlag, 1999.\n\n[2] M. J. Beal and Z. Ghahramani. Variational Bayesian learning of directed graphical models with hidden\n\nvariables. Bayesian Analysis, 1(4), 2006.\n\n[3] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation. Journal of Machine Learning\n\nResearch, 3:993\u20131022, 2003.\n\n[4] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei. Hierarchical Dirichlet processes. Journal of the\n\nAmerican Statistical Association, 101(476):1566\u20131581, 2006.\n\n[5] T. P. Minka and J. Lafferty. Expectation propagation for the generative aspect model. In Proceedings of\n\nthe Conference on Uncertainty in Arti\ufb01cial Intelligence, volume 18, 2002.\n\n[6] W. Buntine and A. Jakulin. Applying discrete PCA in data analysis. In Proceedings of the Conference on\n\nUncertainty in Arti\ufb01cial Intelligence, volume 20, 2004.\n\n[7] Y. W. Teh, D. Newman, and M. Welling. A collapsed variational Bayesian inference algorithm for latent\n\nDirichlet allocation. In Advances in Neural Information Processing Systems, volume 19, 2007.\n\n[8] D. M. Blei and M. I. Jordan. Variational inference for Dirichlet process mixtures. Bayesian Analysis,\n\n1(1):121\u2013144, 2006.\n\n[9] K. Kurihara, M. Welling, and N. Vlassis. Accelerated variational DP mixture models. In Advances in\n\nNeural Information Processing Systems, volume 19, 2007.\n\n[10] P. Liang, S. Petrov, M. I. Jordan, and D. Klein. The in\ufb01nite PCFG using hierarchical Dirichlet processes.\n\nIn Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2007.\n[11] J. Sethuraman. A constructive de\ufb01nition of Dirichlet priors. Statistica Sinica, 4:639\u2013650, 1994.\n[12] T.L. Grif\ufb01ths and M. Steyvers. A probabilistic approach to semantic representation. In Proceedings of\n\nthe 24th Annual Conference of the Cognitive Science Society, 2002.\n\n[13] C. E. Antoniak. Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems.\n\nAnnals of Statistics, 2(6):1152\u20131174, 1974.\n\n[14] M. J. Beal. Variational Algorithms for Approximate Bayesian Inference. PhD thesis, Gatsby Computa-\n\ntional Neuroscience Unit, University College London, 2003.\n\n\f", "award": [], "sourceid": 763, "authors": [{"given_name": "Yee", "family_name": "Teh", "institution": null}, {"given_name": "Kenichi", "family_name": "Kurihara", "institution": null}, {"given_name": "Max", "family_name": "Welling", "institution": null}]}