{"title": "A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation", "book": "Advances in Neural Information Processing Systems", "page_first": 1353, "page_last": 1360, "abstract": null, "full_text": "A Collapsed Variational Bayesian Inference\nAlgorithm for Latent Dirichlet Allocation\n\nYee Whye Teh\n\nDavid Newman and Max Welling\n\nGatsby Computational Neuroscience Unit\n\nBren School of Information and Computer Science\n\nUniversity College London\n\n17 Queen Square, London WC1N 3AR, UK\n\nUniversity of California, Irvine\n\nCA 92697-3425 USA\n\nywteh@gatsby.ucl.ac.uk\n\n{newman,welling}@ics.uci.edu\n\nAbstract\n\nLatent Dirichlet allocation (LDA) is a Bayesian network that has recently gained\nmuch popularity in applications ranging from document modeling to computer\nvision. Due to the large scale nature of these applications, current inference pro-\ncedures like variational Bayes and Gibbs sampling have been found lacking. In\nthis paper we propose the collapsed variational Bayesian inference algorithm for\nLDA, and show that it is computationally ef\ufb01cient, easy to implement and signi\ufb01-\ncantly more accurate than standard variational Bayesian inference for LDA.\n\n1 Introduction\n\nBayesian networks with discrete random variables form a very general and useful class of proba-\nbilistic models. In a Bayesian setting it is convenient to endow these models with Dirichlet priors\nover the parameters as they are conjugate to the multinomial distributions over the discrete random\nvariables [1]. This choice has important computational advantages and allows for easy inference in\nsuch models.\n\nA class of Bayesian networks that has gained signi\ufb01cant momentum recently is latent Dirichlet\nallocation (LDA) [2], otherwise known as multinomial PCA [3]. It has found important applications\nin both text modeling [4, 5] and computer vision [6]. Training LDA on a large corpus of several\nmillion documents can be a challenge and crucially depends on an ef\ufb01cient and accurate inference\nprocedure. A host of inference algorithms have been proposed, ranging from variational Bayesian\n(VB) inference [2], expectation propagation (EP) [7] to collapsed Gibbs sampling [5].\n\nPerhaps surprisingly, the collapsed Gibbs sampler proposed in [5] seem to be the preferred choice\nin many of these large scale applications. In [8] it is observed that EP is not ef\ufb01cient enough to\nbe practical while VB suffers from a large bias. However, collapsed Gibbs sampling also has its\nown problems: one needs to assess convergence of the Markov chain and to have some idea of\nmixing times to estimate the number of samples to collect, and to identify coherent topics across\nmultiple samples. In practice one often ignores these issues and collects as many samples as is\ncomputationally feasible, while the question of topic identi\ufb01cation is often sidestepped by using\njust 1 sample. Hence there still seems to be a need for more ef\ufb01cient, accurate and deterministic\ninference procedures.\n\nIn this paper we will leverage the important insight that a Gibbs sampler that operates in a collapsed\nspace\u2014where the parameters are marginalized out\u2014mixes much better than a Gibbs sampler that\nsamples parameters and latent topic variables simultaneously. This suggests that the parameters\nand latent variables are intimately coupled. As we shall see in the following, marginalizing out the\nparameters induces new dependencies between the latent variables (which are conditionally inde-\npendent given the parameters), but these dependencies are spread out over many latent variables.\nThis implies that the dependency between any two latent variables is expected to be small. This is\n\n\fprecisely the right setting for a mean \ufb01eld (i.e. fully factorized variational) approximation: a par-\nticular variable interacts with the remaining variables only through summary statistics called the\n\ufb01eld, and the impact of any single variable on the \ufb01eld is very small [9]. Note that this is not true\nin the joint space of parameters and latent variables because \ufb02uctuations in parameters can have a\nsigni\ufb01cant impact on latent variables. We thus conjecture that the mean \ufb01eld assumptions are much\nbetter satis\ufb01ed in the collapsed space of latent variables than in the joint space of latent variables\nand parameters. In this paper we leverage this insight and propose a collapsed variational Bayesian\n(CVB) inference algorithm.\n\nIn theory, the CVB algorithm requires the calculation of very expensive averages. However, the\naverages only depend on sums of independent Bernoulli variables, and thus are very closely approx-\nimated with Gaussian distributions (even for relatively small sums). Making use of this approxi-\nmation, the \ufb01nal algorithm is computationally ef\ufb01cient, easy to implement and signi\ufb01cantly more\naccurate than standard VB.\n\n2 Approximate Inference in Latent Dirichlet Allocation\n\nLDA models each document as a mixture over topics. We assume there are K latent topics, each\nbeing a multinomial distribution over a vocabulary of size W . For document j, we \ufb01rst draw a\nmixing proportion \u03b8j = {\u03b8jk} over K topics from a symmetric Dirichlet with parameter \u03b1. For\nthe ith word in the document, a topic zij is drawn with topic k chosen with probability \u03b8jk, then\nword xij is drawn from the zijth topic, with xij taking on value w with probability \u03c6kw. Finally, a\nsymmetric Dirichlet prior with parameter \u03b2 is placed on the topic parameters \u03c6k = {\u03c6kw}. The full\njoint distribution over all parameters and variables is:\n\np(x, z, \u03b8, \u03c6|\u03b1, \u03b2) =\n\nw=1 \u03c6\u03b2\u22121+n\u00b7kw\n\nkw\n\n(1)\n\nDYj=1\n\n\u0393(K\u03b1)\n\n\u0393(\u03b1)K QK\n\nk=1 \u03b8\n\n\u03b1\u22121+njk\u00b7\njk\n\nKYk=1\n\n\u0393(W \u03b2)\n\n\u0393(\u03b2)W QW\n\nwhere njkw = #{i : xij = w, zij = k}, and dot means the corresponding index is summed out:\n\nn\u00b7kw =Pj njkw, and njk\u00b7 =Pw njkw.\n\nGiven the observed words x = {xij} the task of Bayesian inference is to compute the posterior\ndistribution over the latent topic indices z = {zij}, the mixing proportions \u03b8 = {\u03b8j} and the topic\nparameters \u03c6 = {\u03c6k}. There are three current approaches, variational Bayes (VB) [2], expectation\npropagation [7] and collapsed Gibbs sampling [5]. We review the VB and collapsed Gibbs sam-\npling methods here as they are the most popular methods and to motivate our new algorithm which\ncombines advantages of both.\n\n2.1 Variational Bayes\n\nStandard VB inference upper bounds the negative log marginal likelihood \u2212 log p(x|\u03b1, \u03b2) using the\nvariational free energy:\n\n(2)\nwith \u02dcq(z, \u03b8, \u03c6) an approximate posterior, H(\u02dcq(z, \u03b8, \u03c6)) = E\u02dcq[\u2212 log \u02dcq(z, \u03b8, \u03c6)] the variational en-\ntropy, and \u02dcq(z, \u03b8, \u03c6) assumed to be fully factorized:\n\n\u2212 log p(x|\u03b1, \u03b2) \u2264 eF (\u02dcq(z, \u03b8, \u03c6)) = E\u02dcq[\u2212 log p(x, z, \u03c6, \u03b8|\u03b1, \u03b2)] \u2212 H(\u02dcq(z, \u03b8, \u03c6))\n\n\u02dcq(z, \u03b8, \u03c6) =Yij\n\n\u02dcq(zij |\u02dc\u03b3ij)Yj\n\n\u02dcq(\u03b8j |\u02dc\u03b1j)Yk\n\n\u02dcq(\u03c6k| \u02dc\u03b2k)\n\n(3)\n\n\u02dcq(zij|\u02dc\u03b3ij ) is multinomial with parameters \u02dc\u03b3ij and \u02dcq(\u03b8j|\u02dc\u03b1j ), \u02dcq(\u03c6k| \u02dc\u03b2k) are Dirichlet with parameters\n\n\u02dc\u03b1j and \u02dc\u03b2k respectively. Optimizing eF (\u02dcq) with respect to the variational parameters gives us a set of\nupdates guaranteed to improve eF (\u02dcq) at each iteration and converges to a local minimum:\n\n\u02dc\u03b1jk = \u03b1 +Pi \u02dc\u03b3ijk\n\u02dc\u03b2kw = \u03b2 +Pij 111(xij = w)\u02dc\u03b3ijk\n\u02dc\u03b3ijk \u221d exp(cid:16)\u03a8(\u02dc\u03b1jk) + \u03a8( \u02dc\u03b2kxij ) \u2212 \u03a8(Pw\n\n\u02dc\u03b2kw)(cid:17)\n\n(4)\n(5)\n\n(6)\n\n\fwhere \u03a8(y) = \u2202 log \u0393(y)\n\n\u2202y\n\nis the digamma function and 111 is the indicator function.\n\nAlthough ef\ufb01cient and easily implemented, VB can potentially lead to very inaccurate results. No-\ntice that the latent variables z and parameters \u03b8, \u03c6 can be strongly dependent in the true posterior\np(z, \u03b8, \u03c6|x) through the cross terms in (1). This dependence is ignored in VB which assumes that\nlatent variables and parameters are independent instead. As a result, the VB upper bound on the\nnegative log marginal likelihood can be very loose, leading to inaccurate estimates of the posterior.\n\n2.2 Collapsed Gibbs Sampling\n\nStandard Gibbs sampling, which iteratively samples latent variables z and parameters \u03b8, \u03c6, can\npotentially have slow convergence due again to strong dependencies between the parameters and\nlatent variables. Collapsed Gibbs sampling improves upon Gibbs sampling by marginalizing out \u03b8\nand \u03c6 instead, therefore dealing with them exactly. The marginal distribution over x and z is\n\np(z, x|\u03b1, \u03b2) =Yj\n\n\u0393(K\u03b1)\n\n\u0393(K\u03b1+nj\u00b7\u00b7)Qk\n\n\u0393(\u03b1+njk\u00b7)\n\n\u0393(\u03b1) Yk\n\n\u0393(W \u03b2)\n\n\u0393(W \u03b2 +n\u00b7k\u00b7)Qw\n\n\u0393(\u03b2 +n\u00b7kw)\n\n\u0393(\u03b2)\n\nGiven the current state of all but one variable zij, the conditional probability of zij is:\n\n(7)\n\n(8)\n\np(zij = k|z\n\n\u00acij, x, \u03b1, \u03b2) =\n\njk\u00b7 )(\u03b2 + n\u00acij\n\n\u00b7kxij\n\n(\u03b1 + n\u00acij\nk0=1(\u03b1 + n\u00acij\n\njk0\u00b7)(\u03b2 + n\u00acij\n\n\u00b7k0xij\n\n)(W \u03b2 + n\u00acij\n\n\u00b7k\u00b7 )\u22121\n)(W \u03b2 + n\u00acij\n\n\u00b7k0\u00b7)\u22121\n\nPK\n\nwhere the superscript \u00acij means the corresponding variables or counts with xij and zij excluded,\nand the denominator is just a normalization. The conditional distribution of zij is multinomial with\nsimple to calculate probabilities, so the programming and computational overhead is minimal.\nCollapsed Gibbs sampling has been observed to converge quickly [5]. Notice from (8) that zij\ndepends on z\u00acij only through the counts n\u00acij\n\u00b7k\u00b7 . In particular, the dependence of zij on\nany particular other variable zi0j 0 is very weak, especially for large datasets. As a result we expect the\nconvergence of collapsed Gibbs sampling to be fast [10]. However, as with other MCMC samplers,\nand unlike variational inference, it is often hard to diagnose convergence, and a suf\ufb01ciently large\nnumber of samples may be required to reduce sampling noise.\n\njk\u00b7 , n\u00acij\n\n, n\u00acij\n\n\u00b7kxij\n\nThe argument of rapid convergence of collapsed Gibbs sampling is reminiscent of the argument for\nwhen mean \ufb01eld algorithms can be expected to be accurate [9]. The counts n\u00acij\n\u00b7k\u00b7 act as\n\ufb01elds through which zij interacts with other variables. In particular, averaging both sides of (8) by\np(z\u00acij |x, \u03b1, \u03b2) gives us the Callen equations, a set of equations that the true posterior must satisfy:\n\njk\u00b7 , n\u00acij\n\n, n\u00acij\n\n\u00b7kxij\n\np(zij = k|x, \u03b1, \u03b2) = Ep(z\n\n\u00acij |x,\u03b1,\u03b2)\"\n\njk\u00b7 )(\u03b2 +n\u00acij\n\n\u00b7kxij\n\n(\u03b1+n\u00acij\nk0=1(\u03b1+n\u00acij\n\njk0\u00b7)(\u03b2 +n\u00acij\n\n\u00b7k0xij\n\n)(W \u03b2 +n\u00acij\n\n\u00b7k\u00b7 )\u22121\n)(W \u03b2 +n\u00acij\n\n\u00b7k0\u00b7)\u22121#\n\n(9)\n\nSince the latent variables are already weakly dependent on each other, it is possible to replace (9)\nby a set of mean \ufb01eld equations where latent variables are assumed independent and still expect\nthese equations to be accurate. This is the idea behind the collapsed variational Bayesian inference\nalgorithm of the next section.\n\nPK\n\n3 Collapsed Variational Bayesian Inference for LDA\n\nWe derive a new inference algorithm for LDA combining the advantages of both standard VB and\ncollapsed Gibbs sampling. It is a variational algorithm which, instead of assuming independence,\nmodels the dependence of the parameters on the latent variables in an exact fashion. On the other\nhand we still assume that latent variables are mutually independent. This is not an unreasonable\nassumption to make since as we saw they are only weakly dependent on each other. We call this\nalgorithm collapsed variational Bayesian (CVB) inference.\n\nThere are two ways to deal with the parameters in an exact fashion, the \ufb01rst is to marginalize them\nout of the joint distribution and to start from (7), the second is to explicitly model the posterior of\n\u03b8, \u03c6 given z and x without any assumptions on its form. We will show that these two methods\n\n\fare equivalent. The only assumption we make in CVB is that the latent variables z are mutually\nindependent, thus we approximate the posterior as:\n\n\u02c6q(z, \u03b8, \u03c6) = \u02c6q(\u03b8, \u03c6|z)Yij\n\n\u02c6q(zij|\u02c6\u03b3ij)\n\nwhere \u02c6q(zij|\u02c6\u03b3ij ) is multinomial with parameters \u02c6\u03b3ij. The variational free energy becomes:\n\nbF(\u02c6q(z)\u02c6q(\u03b8, \u03c6|z)) = E\u02c6q(z)\u02c6q(\u03b8,\u03c6|z)[\u2212 log p(x, z, \u03b8, \u03c6|\u03b1, \u03b2)] \u2212 H(\u02c6q(z)\u02c6q(\u03b8, \u03c6|z))\n\n=E\u02c6q(z)[E\u02c6q(\u03b8,\u03c6|z)[\u2212 log p(x, z, \u03b8, \u03c6|\u03b1, \u03b2)] \u2212 H(\u02c6q(\u03b8, \u03c6|z))] \u2212 H(\u02c6q(z))\n\nWe minimize the variational free energy with respect to \u02c6q(\u03b8, \u03c6|z) \ufb01rst, followed by \u02c6q(z). Since\nwe do not restrict the form of \u02c6q(\u03b8, \u03c6|z), the minimum is achieved at the true posterior \u02c6q(\u03b8, \u03c6|z) =\np(\u03b8, \u03c6|x, z, \u03b1, \u03b2), and the variational free energy simpli\ufb01es to:\n\nbF (\u02c6q(z)) , min\n\nWe see that CVB is equivalent to marginalizing out \u03b8, \u03c6 before approximating the posterior over z.\nAs CVB makes a strictly weaker assumption on the variational posterior than standard VB, we have\n\nand thus CVB is a better approximation than standard VB. Finally, we derive the updates for the\nvariational parameters \u02c6\u03b3ij. Minimizing (12) with respect to \u02c6\u03b3ijk, we get\n\n(10)\n\n(11)\n\n(12)\n\n(13)\n\n(14)\n\n(15)\n\n\u02c6q(\u03b8,\u03c6|z) bF(\u02c6q(z)\u02c6q(\u03b8, \u03c6|z)) = E\u02c6q(z)[\u2212 log p(x, z|\u03b1, \u03b2)] \u2212 H(\u02c6q(z))\nbF (\u02c6q(z)) \u2264 eF(\u02dcq(z)) , min\nexp(cid:0)E\u02c6q(z\nk0=1 exp(cid:0)E\u02c6q(z\nPK\n\u0393(\u03b7) =Pn\u22121\n\n\u02dcq(\u03b8)\u02dcq(\u03c6) eF (\u02dcq(z)\u02dcq(\u03b8)\u02dcq(\u03c6))\n\u00acij )[p(x, z\u00acij , zij = k|\u03b1, \u03b2)](cid:1)\n\u00acij )[p(x, z\u00acij , zij = k0|\u03b1, \u03b2)](cid:1)\n\u00b7k\u00b7 )](cid:17)\n\u00b7k0\u00b7)](cid:17)\n\njk0\u00b7) + log(\u03b2 +n\u00acij\n\njk\u00b7 ) + log(\u03b2 +n\u00acij\n\n) \u2212 log(W \u03b2 +n\u00acij\n\n\u00acij )[log(\u03b1+n\u00acij\n\n) \u2212 log(W \u03b2 +n\u00acij\n\n\u00b7k0xij\n\n\u00acij )[log(\u03b1+n\u00acij\n\n\u00b7kxij\n\n\u02c6\u03b3ijk = \u02c6q(zij = k) =\n\n\u02c6\u03b3ijk =\n\nexp(cid:16)E\u02c6q(z\nk0=1 exp(cid:16)E\u02c6q(z\nPK\n\n3.1 Gaussian approximation for CVB Inference\n\nPlugging in (7), expanding log \u0393(\u03b7+n)\nn, and cancelling terms appearing both in the numerator and denominator, we get\n\nl=0 log(\u03b7 + l) for positive reals \u03b7 and positive integers\n\nFor completeness, we describe how to compute each expectation term in (15) exactly in the ap-\npendix. This exact implementation of CVB is computationally too expensive to be practical, and\nwe propose instead to use a simple Gaussian approximation which works very accurately and which\nrequires minimal computational costs.\nIn this section we describe the Gaussian approximation applied to E\u02c6q[log(\u03b1 + n\u00acij\ntwo expectation terms are similarly computed. Assume that nj\u00b7\u00b7 (cid:29) 0. Notice that n\u00acij\n\nPi06=i 111(zi0j = k) is a sum of a large number independent Bernoulli variables 111(zi0j = k) each\n\nwith mean parameter \u02c6\u03b3i0jk, thus it can be accurately approximated by a Gaussian. The mean and\nvariance are given by the sum of the means and variances of the individual Bernoulli variables:\n\njk\u00b7 )]; the other\njk\u00b7 =\n\nE\u02c6q[n\u00acij\n\njk\u00b7 ] =Xi06=i\n\n\u02c6\u03b3i0jk\n\nVar\u02c6q[n\u00acij\n\njk\u00b7 ] =Xi06=i\n\n\u02c6\u03b3i0jk(1 \u2212 \u02c6\u03b3i0jk)\n\n(16)\n\nWe further approximate the function log(\u03b1 + n\u00acij\nE\u02c6q[n\u00acij\n\njk\u00b7 ], and evaluate its expectation under the Gaussian approximation:\n\njk\u00b7 ) using a second-order Taylor expansion about\n\nE\u02c6q[log(\u03b1 + n\u00acij\n\njk\u00b7 )] \u2248 log(\u03b1 + E\u02c6q[n\u00acij\n\njk\u00b7 ]) \u2212\n\nVar\u02c6q(n\u00acij\njk\u00b7 )\n2(\u03b1 + E\u02c6q[n\u00acij\njk\u00b7 ])2\n\n(17)\n\nBecause E\u02c6q[n\u00acij\njk\u00b7 ] (cid:29) 0, the third derivative is small and the Taylor series approximation is very\naccurate. In fact, we have found experimentally that the Gaussian approximation works very well\n\n\feven when nj\u00b7\u00b7 is small. The reason is that we often have \u02c6\u03b3i0jk being either close to 0 or 1 thus\nthe variance of n\u00acij\njk\u00b7 is small relative to its mean and the Gaussian approximation will be accurate.\nFinally, plugging (17) into (15), we have our CVB updates:\n\n\u02c6\u03b3ijk \u221d(cid:16)\u03b1+E\u02c6q[n\u00acij\n\njk\u00b7 ](cid:17)(cid:16)\u03b2 +E\u02c6q[n\u00acij\n\n\u00b7kxij\n\n](cid:17)(cid:16)W \u03b2 +E\u02c6q[n\u00acij\n\n\u00b7k\u00b7 ](cid:17)\u22121\n\n)\n\nVar \u02c6q(n\u00acij\njk\u00b7 )\njk\u00b7 ])2 \u2212\n2(\u03b1+E \u02c6q[n\u00acij\n\nVar \u02c6q(n\u00acij\n\u00b7kxij\n2(\u03b2+E \u02c6q[n\u00acij\n\n\u00b7kxij\n\nexp(cid:18)\u2212\n\n])2 +\n\nVar \u02c6q(n\u00acij\n\u00b7k\u00b7 )\n2(W \u03b2+E \u02c6q[n\u00acij\n\n\u00b7k\u00b7 ])2(cid:19)\n\n(18)\n\nNotice the striking correspondence between (18), (8) and (9), showing that CVB is indeed the mean\n\ufb01eld version of collapsed Gibbs sampling. In particular, the \ufb01rst line in (18) is obtained from (8)\nby replacing the \ufb01elds n\u00acij\n\u00b7k\u00b7 by their means (thus the term mean \ufb01eld) while the\nexponentiated terms are correction factors accounting for the variance in the \ufb01elds.\n\njk\u00b7 , n\u00acij\n\nand n\u00acij\n\n\u00b7kxij\n\nCVB with the Gaussian approximation is easily implemented and has minimal computational costs.\nBy keeping track of the mean and variance of njk\u00b7, n\u00b7kw and n\u00b7k\u00b7, and subtracting the mean and\nvariance of the corresponding Bernoulli variables whenever we require the terms with xij , zij re-\nmoved, the computational cost scales only as O(K) for each update to \u02c6q(zij). Further, we only\nneed to maintain one copy of the variational posterior over the latent variable for each unique docu-\nment/word pair, thus the overall computational cost per iteration of CVB scales as O(M K) where\nM is the total number of unique document/word pairs, while the memory requirement is O(M K).\nThis is the same as for VB. In comparison, collapsed Gibbs sampling needs to keep track of the\ncurrent sample of zij for every word in the corpus, thus the memory requirement is O(N ) while the\ncomputational cost scales as O(N K) where N is the total number of words in the corpus\u2014higher\nthan for VB and CVB. Note however that the constant factor involved in the O(N K) time cost of\ncollapsed Gibbs sampling is signi\ufb01cantly smaller than those for VB and CVB.\n\n4 Experiments\n\nWe compared the three algorithms described in the paper: standard VB, CVB and collapsed Gibbs\nsampling. We used two datasets: \ufb01rst is \u201cKOS\u201d (www.dailykos.com), which has J = 3430 docu-\nments, a vocabulary size of W = 6909, a total of N = 467, 714 words in all the documents and on\naverage 136 words per document. Second is \u201cNIPS\u201d (books.nips.cc) with J = 1675 documents, a\nvocabulary size of W = 12419, N = 2, 166, 029 words in the corpus and on average 1293 words per\ndocument. In both datasets stop words and infrequent words were removed. We split both datasets\ninto a training set and a test set by assigning 10% of the words in each document to the test set. In\nall our experiments we used \u03b1 = 0.1, \u03b2 = 0.1, K = 8 number of topics for KOS and K = 40 for\nNIPS. We ran each algorithm on each dataset 50 times with different random initializations.\n\nPerformance was measured in two ways. First using variational bounds of the log marginal proba-\nbilities on the training set, and secondly using log probabilities on the test set. Expressions for the\nvariational bounds are given in (2) for VB and (12) for CVB. For both VB and CVB, test set log\nprobabilities are computed as:\n\np(x\n\ntest) =Yij Xk\n\n\u00af\u03b8jk \u00af\u03c6kxtest\n\nij\n\n\u00af\u03b8jk =\n\n\u03b1 + Eq[njk\u00b7]\nK\u03b1 + Eq[nj\u00b7\u00b7]\n\n\u00af\u03c6kw =\n\n\u03b2 + Eq[n\u00b7kw]\nW \u03b2 + Eq[n\u00b7k\u00b7]\n\n(19)\n\nNote that we used estimated mean values of \u03b8jk and \u03c6kw [11]. For collapsed Gibbs sampling, given\nS samples from the posterior, we used:\n\np(x\n\ntest) =Yij Xk\n\n1\n|S|\n\nSXs=1\n\njk\u03c6s\n\u03b8s\n\nkxtest\nij\n\n\u03b8s\njk =\n\n\u03b1 + ns\njk\u00b7\nK\u03b1 + ns\nj\u00b7\u00b7\n\n\u03c6s\n\nkw =\n\n\u03b2 + ns\n\u00b7kw\nW \u03b2 + ns\n\u00b7k\u00b7\n\n(20)\n\nFigure 1 summarizes our results. We show both quantities as functions of iterations and as his-\ntograms of \ufb01nal values for all algorithms and datasets. CVB converged faster and to signi\ufb01cantly\nbetter solutions than standard VB; this con\ufb01rms our intuition that CVB provides much better approx-\nimations than VB. CVB also converged faster than collapsed Gibbs sampling, but Gibbs sampling\nattains a better solution in the end; this is reasonable since Gibbs sampling should be exact with\n\n\f\u22127.5\n\n\u22128\n\n\u22128.5\n\n\u22129\n0\n\n20\n\n15\n\n10\n\n5\n\n0\n\u22127.8\n\u22127.4\n\n\u22127.5\n\n\u22127.6\n\n\u22127.7\n\n\u22127.8\n\n\u22127.9\n0\n\n20\n\n15\n\n10\n\n5\n\n0\n\u22127.7\n\nCollapsed VB\nStandard VB\n\n20\n\n40\n\n60\n\n80\n\n100\n\nCollapsed VB\nStandard VB\n\n\u22127.675\n\n\u22127.55\n\nCollapsed Gibbs\nCollapsed VB\nStandard VB\n\n20\n\n40\n\n60\n\n80\n\n100\n\nCollapsed Gibbs\nCollapsed VB\nStandard VB\n\n\u22127.65\n\n\u22127.6\n\n\u22127.55\n\n\u22127.5\n\n\u22127.45\n\n\u22127.4\n\n\u22127.4\n\n\u22127.6\n\n\u22127.8\n\n\u22128\n\n\u22128.2\n\n\u22128.4\n\n\u22128.6\n\n\u22128.8\n\n\u22129\n0\n\n40\n\n35\n\n30\n\n25\n\n20\n\n15\n\n10\n\n5\n\n0\n\n\u22127.65\n\u22127.2\n\n\u22127.3\n\n\u22127.4\n\n\u22127.5\n\n\u22127.6\n\n\u22127.7\n\n\u22127.8\n\n\u22127.9\n0\n\n30\n\n25\n\n20\n\n15\n\n10\n\n5\n\n0\n\n\u22127.5\n\nCollapsed VB\nStandard VB\n\n20\n\n40\n\n60\n\n80\n\n100\n\nCollapsed VB\nStandard VB\n\n\u22127.6\n\n\u22127.55\n\n\u22127.5\n\n\u22127.45\n\n\u22127.4\n\nCollapsed Gibbs\nCollapsed VB\nStandard VB\n\n20\n\n40\n\n60\n\n80\n\n100\n\nCollapsed Gibbs\nCollapsed VB\nStandard VB\n\n\u22127.45\n\n\u22127.4\n\n\u22127.35\n\n\u22127.3\n\n\u22127.25\n\n\u22127.2\n\nFigure 1: Left: results for KOS. Right: results for NIPS. First row: per word variational bounds as functions\nof numbers of iterations of VB and CVB. Second row: histograms of converged per word variational bounds\nacross random initializations for VB and CVB. Third row: test set per word log probabilities as functions\nof numbers of iterations for VB, CVB and Gibbs. Fourth row: histograms of \ufb01nal test set per word log\nprobabilities across 50 random initializations.\n\n\f\u22127.4\n\n\u22127.5\n\n\u22127.6\n\n\u22127.7\n\n\u22127.8\n\n\u22127.9\n\n\u22128\n\n\u22128.1\n\n\u22128.2\n0\n\n\u22127.5\n\n\u22128\n\n\u22128.5\n\n\u22129\n\n\u22129.5\n0\n\nCollapsed Gibbs\nCollapsed VB\nStandard VB\n\n500\n\n1000\n\n1500\n\n2000\n\n2500\n\nCollapsed VB\nStandard VB\n\n500\n\n1000\n\n1500\n\n2000\n\n2500\n\nFigure 2: Left: test set per word log probabilities. Right: per word variational bounds. Both as functions of\nthe number of documents for KOS.\n\nenough samples. We have also applied the exact but much slower version of CVB without the Gaus-\nsian approximation, and found that it gave identical results to the one proposed here (not shown).\n\nWe have also studied the dependence of approximation accuracies on the number of documents in\nthe corpus. To conduct this experiment we train on 90% of the words in a (growing) subset of the\ncorpus and test on the corresponding 10% left out words. In \ufb01gure Figure 2 we show both variational\nbounds and test set log probabilities as functions of the number of documents J. We observe that as\nexpected the variational methods improve as J increases. However, perhaps surprisingly, CVB does\nnot suffer as much as VB for small values of J, even though one might expect that the Gaussian\napproximation becomes dubious in that regime.\n\n5 Discussion\n\nWe have described a collapsed variational Bayesian (CVB) inference algorithm for LDA. The al-\ngorithm is easy to implement, computationally ef\ufb01cient and more accurate than standard VB. The\ncentral insight of CVB is that instead of assuming parameters to be independent from latent vari-\nables, we treat their dependence on the topic variables in an exact fashion. Because the factorization\nassumptions made by CVB are weaker than those made by VB, the resulting approximation is more\naccurate. Computational ef\ufb01ciency is achieved in CVB with a Gaussian approximation, which was\nfound to be so accurate that there is never a need for exact summation.\n\nThe idea of integrating out parameters before applying variational inference has been indepen-\ndently proposed by [12]. Unfortunately, because they worked in the context of general conjugate-\nexponential families, the approach cannot be made generally computationally useful. Nevertheless,\nwe believe the insights of CVB can be applied to a wider class of discrete graphical models beyond\nLDA. Speci\ufb01c examples include various extensions of LDA [4, 13] hidden Markov models with dis-\ncrete outputs, and mixed-membership models with Dirichlet distributed mixture coef\ufb01cients [14].\nThese models all have the property that they consist of discrete random variables with Dirichlet\npriors on the parameters, which is the property allowing us to use the Gaussian approximation. We\nare also exploring CVB on an even more general class of models, including mixtures of Gaussians,\nDirichlet processes, and hierarchical Dirichlet processes.\n\nOver the years a variety of inference algorithms have been proposed based on a combination of\n{maximize, sample, assume independent, marginalize out} applied to both parameters and latent\nvariables. We conclude by summarizing these algorithms in Table 1, and note that CVB is located\nin the marginalize out parameters and assume latent variables are independent cell.\n\nA Exact Computation of Expectation Terms in (15)\n\nWe can compute the expectation terms in (15) exactly as follows. Consider E\u02c6q[log(\u03b1 + n\u00acij\nwhich requires computing \u02c6q(n\u00acij\n\njk\u00b7 )],\njk\u00b7 ) (other expectation terms are similarly computed). Note that\n\n\fParameters \u2192\n\n\u2193 Latent variables\n\nmaximize\nsample\n\nmaximize\n\nsample\n\nViterbi EM\n\n?\n\nstochastic EM Gibbs sampling\n\nassume independent variational EM\n\n?\n\nassume\n\nindependent\n\nME\n?\nVB\n\nmarginalize\n\nout\nME\n\ncollapsed Gibbs\n\nCVB\n\nmarginalize out\n\nEM\n\nany MCMC\n\nEP for LDA\n\nintractable\n\nTable 1: A variety of inference algorithms for graphical models. Note that not every cell is \ufb01lled in (marked\nby ?) while some are simply intractable. \u201cME\u201d is the maximization-expectation algorithm of [15] and \u201cany\nMCMC\u201d means that we can use any MCMC sampler for the parameters once latent variables have been\nmarginalized out.\n\njk\u00b7 =Pi06=i 111(zi0j = k) is a sum of independent Bernoulli variables 111(zi0j = k) each with mean\n\nn\u00acij\nparameter \u02c6\u03b3i0jk. De\ufb01ne vectors vi0jk = [(1 \u2212 \u02c6\u03b3i0jk), \u02c6\u03b3i0jk]>, and let vjk = v1jk \u2297 \u00b7 \u00b7 \u00b7 \u2297 vn\u00b7j\u00b7jk be\nthe convolution of all vi0jk. Finally let v\u00acij\njk\u00b7 = m) will\nbe the (m+1)st entry in v\u00acij\njk\u00b7 )] can now be computed explicitly.\nThis exact implementation requires an impractical O(n2\njk\u00b7 )]. At\nthe expense of complicating the algorithm implementation, this can be improved by sparsifying the\nvectors vjk (setting small entries to zero) as well as other computational tricks. We propose instead\nthe Gaussian approximation of Section 3.1, which we have found to give extremely accurate results\nbut with minimal implementation complexity and computational cost.\n\njk be vjk deconvolved by vijk. Then \u02c6q(n\u00acij\n\njk . The expectation E\u02c6q[log(\u03b1+n\u00acij\n\nj\u00b7\u00b7) time to compute E\u02c6q[log(\u03b1+n\u00acij\n\nAcknowledgement\n\nYWT was previously at NUS SoC and supported by the Lee Kuan Yew Endowment Fund. MW was\nsupported by ONR under grant no. N00014-06-1-0734 and by NSF under grant no. 0535278.\n\nReferences\n[1] D. Heckerman. A tutorial on learning with Bayesian networks.\n\nGraphical Models. Kluwer Academic Publishers, 1999.\n\nIn M. I. Jordan, editor, Learning in\n\n[2] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation. JMLR, 3, 2003.\n[3] W. Buntine. Variational extensions to EM and multinomial PCA. In ECML, 2002.\n[4] M. Rosen-Zvi, T. Grif\ufb01ths, M. Steyvers, and P. Smyth. The author-topic model for authors and documents.\n\nIn UAI, 2004.\n\n[5] T. L. Grif\ufb01ths and M. Steyvers. Finding scienti\ufb01c topics. In PNAS, 2004.\n[6] L. Fei-Fei and P. Perona. A Bayesian hierarchical model for learning natural scene categories. In CVPR,\n\n2005.\n\n[7] T. P. Minka and J. Lafferty. Expectation propagation for the generative aspect model. In UAI, 2002.\n[8] W. Buntine and A. Jakulin. Applying discrete PCA in data analysis. In UAI, 2004.\n[9] M. Opper and O. Winther. From naive mean \ufb01eld theory to the TAP equations. In D. Saad and M. Opper,\n\neditors, Advanced Mean Field Methods : Theory and Practice. The MIT Press, 2001.\n\n[10] G. Casella and C. P. Robert. Rao-Blackwellisation of sampling schemes. Biometrika, 83(1):81\u201394, 1996.\n[11] M. J. Beal. Variational Algorithms for Approximate Bayesian Inference. PhD thesis, Gatsby Computa-\n\ntional Neuroscience Unit, University College London, 2003.\n\n[12] J. Sung, Z. Ghahramani, and S. Choi. Variational Bayesian EM: A second-order approach. Unpublished\n\nmanuscript, 2005.\n\n[13] W. Li and A. McCallum. Pachinko allocation: DAG-structured mixture models of topic correlations. In\n\nICML, 2006.\n\n[14] E. M. Airoldi, D. M. Blei, E. P. Xing, and S. E. Fienberg. Mixed membership stochastic block models\nfor relational data with application to protein-protein interactions. In Proceedings of the International\nBiometrics Society Annual Meeting, 2006.\n\n[15] M. Welling and K. Kurihara. Bayesian K-means as a \u201cmaximization-expectation\u201d algorithm. In SIAM\n\nConference on Data Mining, 2006.\n\n\f", "award": [], "sourceid": 3113, "authors": [{"given_name": "Yee", "family_name": "Teh", "institution": null}, {"given_name": "David", "family_name": "Newman", "institution": null}, {"given_name": "Max", "family_name": "Welling", "institution": null}]}