{"title": "Truncation-free Online Variational Inference for Bayesian Nonparametric Models", "book": "Advances in Neural Information Processing Systems", "page_first": 413, "page_last": 421, "abstract": "We present a truncation-free online variational inference algorithm for Bayesian nonparametric models. Unlike traditional (online) variational inference algorithms that require truncations for the model or the variational distribution, our method adapts model complexity on the fly. Our experiments for Dirichlet process mixture models and hierarchical Dirichlet process topic models on two large-scale data sets show better performance than previous online variational inference algorithms.", "full_text": "Truncation-free Stochastic Variational Inference for\n\nBayesian Nonparametric Models\n\nChong Wang\u2217\n\nMachine Learning Department\nCarnegie Mellon University\nchongw@cs.cmu.edu\n\nDavid M. Blei\n\nblei@cs.princeton.edu\n\nComputer Science Department\n\nPrinceton Univeristy\n\nAbstract\n\nWe present a truncation-free stochastic variational inference algorithm for Bayesian\nnonparametric models. While traditional variational inference algorithms require\ntruncations for the model or the variational distribution, our method adapts model\ncomplexity on the \ufb02y. We studied our method with Dirichlet process mixture\nmodels and hierarchical Dirichlet process topic models on two large data sets. Our\nmethod performs better than previous stochastic variational inference algorithms.\n\n1 Introduction\n\nBayesian nonparametric (BNP) models [1] have emerged as an important tool for building probability\nmodels with \ufb02exible latent structure and complexity. BNP models use posterior inference to adapt\nthe model complexity to the data. For example, as more data are observed, Dirichlet process (DP)\nmixture models [2] can create new mixture components and hierarchical Dirichlet process (HDP)\ntopic models [3] can create new topics.\nIn general, posterior inference in BNP models is intractable and we must approximate the posterior.\nThe most widely-used approaches are Markov chain Monte Carlo (MCMC) [4] and variational\ninference [5]. For BNP models, the advantage of MCMC is that it directly operates in the unbounded\nlatent space; whether to increase model complexity (such as adding a new mixture component)\nnaturally folds in to the sampling steps [6, 3]. However MCMC does not easily scale\u2014it requires\nstoring many con\ufb01gurations of hidden variables, each one on the order of the number of data\npoints. For scalable MCMC one typically needs parallel hardware, and even then the computational\ncomplexity scales linearly with the data, which is not fast enough for massive data [7, 8, 9].\nThe alternative is variational inference, which \ufb01nds the member of a simpli\ufb01ed family of distributions\nto approximate the true posterior [5, 10]. This is generally faster than MCMC, and recent innovations\nlet us use stochastic optimization to approximate posteriors with massive and streaming data [11,\n12, 13]. Unlike MCMC, however, variational inference algorithms for BNP models do not operate\nin an unbounded latent space. Rather, they truncate the model or the variational distribution to a\nmaximum model complexity [13, 14, 15, 16, 17, 18].1 This is particularly limiting in the stochastic\napproach, where we might hope for a Bayesian nonparametric posterior seamlessly adapting its model\ncomplexity to an endless stream of data.\nIn this paper, we develop a truncation-free stochastic variational inference algorithm for BNP models.\nThis lets us more easily apply Bayesian nonparametric data analysis to massive and streaming data.\n\n\u2217Work was done when the author was with Princeton University.\n1In [17, 18], split-merge techniques were used to grow/shrink truncations. However, split-merge operations\nare model-speci\ufb01c and dif\ufb01cult to design. It is also unknown how to apply these to the stochastic variational\ninference setting we consider.\n\n1\n\n\fIn particular, we present a new general inference algorithm, locally collapsed variational inference.\nWhen applied to BNP models, it does not require truncations and gives a principled mechanism\nfor increasing model complexity on the \ufb02y. We demonstrate our algorithm on DP mixture models\nand HDP topic models with two large data sets, showing improved performance over truncated\nalgorithms.\n\n2 Truncation-free stochastic variational inference for BNP models\n\nAlthough our goal is to develop an ef\ufb01cient stochastic variational inference algorithm for BNP\nmodels, it is more succinct to describe our algorithm for a wider class of hierarchical Bayesian\nmodels [19]. We will show how we apply our algorithm for BNP models in \u00a72.3.\nWe consider the general class of hierarchical Bayesian models shown in Figure 1. Let the global\nhidden variables be \u03b2 with prior p(\u03b2 | \u03b7) (\u03b7 is the hyperparameter) and local variables for each data\nsample be zi (hidden) and xi (observed) for i = 1, . . . , n. The joint distribution of all variables\n(hidden and observed) factorizes as,\n\np(\u03b2, z1:n, x1:n | \u03b7) = p(\u03b2 | \u03b7)(cid:81)n\n\ni=1 p(xi, zi | \u03b2) = p(\u03b2 | \u03b7)(cid:81)n\n\ni=1 p(xi | zi, \u03b2)p(zi | \u03b2).\n\n(1)\n\nThe idea behind the nomenclature is that the local variables are conditionally independent of each\nother given the global variables. For convenience, we assume global variables \u03b2 are continuous and\nlocal variables zi are discrete. (This assumption is not necessary.) A large range of models can be\nrepresented using this form, e.g., mixture models [20, 21], mixed-membership models [3, 22], latent\nfactor models [23, 24] and tree-based hierarchical models [25].\nAs an example, consider a DP mixture model for document clustering. Each document is modeled\nas a bag of words drawn from a distribution over the vocabulary. The mixture components are the\ndistributions over the vocabulary \u03b8 and the mixture proportions \u03c0 are represented with a stick-breaking\nprocess [26]. The global variables \u03b2 (cid:44) (\u03c0, \u03b8) contain the proportions and components, and the local\nvariables zi are the mixture assignments for each document xi. The generative process is:\n\n1. Draw mixture component \u03b8k and sticks \u03c0k for k = 1, 2,\u00b7\u00b7\u00b7 ,\n\n\u03b8k \u223c Dirichlet(\u03b7), \u03c0k = \u00af\u03c0k\n\n(cid:81)k\u22121\n(cid:96)=1 (1 \u2212 \u00af\u03c0(cid:96)), \u00af\u03c0k \u223c Beta(1, a).\n\n2. For each document xi,\n\n(a) Draw mixture assignment zi \u223c Mult(\u03c0).\n(b) For each word xij, draw the word xij \u223c Mult(\u03b8zi).\n\nWe now return to the general model in Eq. 1. In inference, we are interested in the posterior of the\nhidden variables \u03b2 and z1:n given the observed data x1:n, i.e., p(\u03b2, z1:n | x1:n, \u03b7). For many models,\nthis posterior is intractable. We will approximate it using mean-\ufb01eld variational inference.\n\n2.1 Variational inference\n\nIn variational inference we try to \ufb01nd a distribution in a simple family that is close to the true posterior.\nWe describe the mean-\ufb01eld approach, the simplest variational inference algorithm [5]. It assumes the\nfully factorized family of distributions over the hidden variables,\n\nq(\u03b2, z1:n) = q(\u03b2)(cid:81)n\n\ni=1 q(zi).\n\n(2)\n\nWe call q(\u03b2) the global variational distribution and q(zi) the local variational distribution. We want\nto minimize the KL-divergence between this variational distribution and the true posterior. Under the\nstandard variational theory [5], this is equivalent to maximizing a lower bound of the log marginal\nlikelihood of the observed data x1:n. We obtain this bound with Jensen\u2019s inequality,\n\nlog p(x1:n | \u03b7) = log(cid:82)(cid:80)\n\n\u2265 Eq [log p(\u03b2) \u2212 log q(\u03b2) +(cid:80)n\n\np(x1:n, z1:n, \u03b2 | \u03b7)d\u03b2\n\nz1:n\n\ni=1 log p(xi, zi|\u03b2) \u2212 log q(zi)] (cid:44) L(q).\n\n(3)\n\n2\n\n\fFigure 1: Graphical model\nfor hierarchical Bayesian models\nwith global hidden variables \u03b2,\nlocal hidden and observed vari-\nables zi and xi, i = 1, . . . , n.\nHyperparameter \u03b7 is \ufb01xed, not a\nrandom variable.\n\nFigure 2: Results on assigning document d = {w1, 0, . . . , 0} to q(\u03b81) (case\nA and B, shown in the \ufb01gure above) or q(\u03b82) = Dirichlet(0.1, 0.1, . . . , 0.1).\nThe y axis is the log-odds of q(z = 1) to q(z = 2)\u2014if it is larger than 0,\nit is more likely to be assigned to component 1. The mean-\ufb01eld approach\nunderestimates the uncertainty around \u03b82, assigning d incorrectly for case B.\nThe locally collapsed approach does it correctly in both cases.\n\nAlgorithm 1 Mean-\ufb01eld variational inference.\n1: Initialize q(\u03b2).\n2: for iter = 1 to M do\nfor i = 1 to n do\n3:\n4:\n\nSet local variational distribution q(zi) \u221d\n\nexp(cid:8)Eq(\u03b2) [log p(xi, zi | \u03b2)](cid:9).\nexp(cid:8)Eq(z1:n) [log p(x1:n, z1:n, \u03b2)](cid:9).\n\nend for\nSet global variational distribution q(\u03b2) \u221d\n\n5:\n6:\n\nAlgorithm 2 Locally collapsed variational inference.\n1: Initialize q(\u03b2).\n2: for iter = 1 to M do\nfor i = 1 to n do\n3:\n4:\n5:\n6:\n7:\n\nSet local distribution q(zi) \u221d Eq(\u03b2) [p(xi, zi | \u03b2)].\nSample from q(zi) to obtain its empirical \u02c6q(zi).\n\u221d\n\nend for\nSet global variational distribution q(\u03b2)\n\nexp(cid:8)E\u02c6q(z1:n) [log p(x1:n, z1:n, \u03b2)](cid:9).\n\n8: end for\n9: return q(\u03b2).\n\nq(\u03b2) \u221d exp(cid:8)Eq(z1:n) [log p(x1:n, z1:n, \u03b2 | \u03b7)](cid:9)\nq(zi) \u221d exp(cid:8)Eq(\u03b2) [log p(xi, zi | \u03b2)](cid:9) .\n\n7: end for\n8: return q(\u03b2).\nMaximizing L(q) w.r.t. q(\u03b2, z1:n) de\ufb01ned in Eq. 2 (with the optimal conditions given in [27]) gives\n(4)\n(5)\nTypically these equations are used in a coordinate ascent algorithm, iteratively optimizing each factor\nwhile holding the others \ufb01xed (see Algorithm 1). The factorization into global and local variables\nensures that the local updates only depend on the global factors, which facilitates speed-ups like\nparallel [28] and stochastic variational inference [11, 12, 13, 29].\nIn BNP models, however, the value of zi is potentially unbounded (e.g., the mixture assignment in a\nDP mixture). Thus we need to truncate the variational distribution [13, 14]. Truncation is necessary\nin variational inference because of the mathematical structure of BNP models. Moreover, it is dif\ufb01cult\nto grow the truncation in mean-\ufb01eld variational inference even in an ad-hoc way because it tends to\nunderestimate posterior variance [30, 31]. In contrast, its mathematical structure and that it gets the\nvariance right in the conditional distribution allow Gibbs sampling for BNP models to effectively\nexplore the unbounded latent space [6].\n\n2.2 Locally collapsed variational inference\n\nWe now describe locally collapsed variational inference, which mitigates the problem of underesti-\nmating posterior variance in mean-\ufb01eld variational inference. Further, when applied to BNP models,\nit is truncation-free\u2014it gives a good mechanism to increase truncation on the \ufb02y.\nAlgorithm 2 outlines the approach. The difference between traditional mean-\ufb01eld variational inference\nand our algorithm lies in the update of the local distribution q(zi). In our algorithm, it is\n\nq(zi) \u221d Eq(\u03b2) [p(xi, zi | \u03b2)] ,\n\n(6)\nas opposed to the mean-\ufb01eld update in Eq. 5. Because we collapse out the global variational\ndistribution q(\u03b2) locally, we call this method locally collapsed variational inference. Note the two\nalgorithms are similar when q(\u03b2) has low variance. However, when the uncertainty modeled in q(\u03b2)\nis high, these two approaches lead to different approximations of the posterior.\n\n3\n\nzixibnhA: q(theta1)=Dirichlet(0.1,1,...,1)B: q(theta1)=Dirichlet(1,1,...,1)-14-12-10-8-6-4-201020301234512345w1: frequency of word 1 in the new doclog oddsmethodmean-fieldour method\fIn our implementation, we use a collapsed Gibbs sampler to sample from Equation 6. This is a local\nGibbs sampling step and thus is very fast. Further, this is where our algorithm does not require\ntruncation because Gibbs samplers for BNP models can operate in an unbounded space [6, 3].\nNow we update q(\u03b2). Suppose we have a set of samples from q(zi) to construct its empirical\ndistribution \u02c6q(zi). Plugging this into Eq. 3 gives the solution to q(\u03b2),\n\nq(\u03b2) \u221d exp(cid:8)E\u02c6q(z1:n) [log p(x1:n, z1:n, \u03b2 | \u03b7)](cid:9) ,\n\n(7)\n\nwhich has the same form as in Eq. 4 for the mean-\ufb01eld approach. This \ufb01nishes Algorithm 2.\nTo give an intuitive comparison of locally collapsed (Algorithm 2) and mean-\ufb01eld (Algorithm 1)\nvariational inference, we consider a toy document clustering problem with vocabulary size V = 10.\nWe use a two-component Bayesian mixture model with \ufb01xed and equal prior proportions \u03c01 = \u03c02 =\n1/2. Suppose at some stage, component 1 has some documents assignments while component 2\nhas not yet and we have obtained the (approximate) posterior for the two component parameters \u03b81\nand \u03b82 as q(\u03b81) and q(\u03b82). For \u03b81, we consider two cases, A) q(\u03b81) = Dirichlet(0.1, 1, . . . , 1); B)\nq(\u03b81) = Dirichlet(1, 1, . . . , 1). For \u03b82, we only consider q(\u03b82) = Dirichlet(0.1, 0.1, . . . , 0.1). In\nboth cases, q(\u03b81) has relatively low variance while q(\u03b82) has high variance. The difference is that the\nq(\u03b81) in case A has a lower probability on word 1 than that in case B. Now we have a new document\nd = {w1, 0, . . . , 0}, where word 1 is the only word and its frequency is w1. In both cases, document\nd is more likely be assigned to component 2 when w1 becomes larger. Figure 2 shows the difference\nbetween mean-\ufb01eld and locally collapsed variational inference. In case A, the mean-\ufb01eld approach\ndoes it correctly, since word 1 already has a very low probability in \u03b81. But in case B, it ignores the\nuncertainty around \u03b82, resulting in incorrect clustering. Our approach does it correctly in both cases.\nWhat justi\ufb01es this approach? Alas, as for some other adaptations of variational inference, we do not\nyet have an airtight justi\ufb01cation [32, 33, 34]. We are not optimizing q(zi) and so the corresponding\nlower bound must be looser than the optimized lower bound from the mean-\ufb01eld approach if the\nissue of local modes is excluded. However, our experiments show that we \ufb01nd a better predictive\ndistribution than mean-\ufb01eld inference. One possible explanation is outlined in S.1 (section 1 of the\nsupplement), where we show that our algorithm can be understood as an approximate Expectation\nPropagation (EP) algorithm [35].\n\nRelated algorithms. Our algorithm is closely related to collapsed variational inference (CVI) [15,\n16, 36, 32, 33]. CVI applies variational inference to the marginalized model, integrating out the\nglobal hidden variable \u03b2. This gives better estimates of posterior variance. In CVI, however, the\noptimization for each local variable zi depends on all other local variables, and this makes it dif\ufb01cult\nto apply it at large scale. Our algorithm is akin to applying CVI for the intermediate model that treats\nq(\u03b2) as a prior and considers a single data point xi with its hidden structure zi. This lets us develop\nstochastic algorithms that can be \ufb01t to massive data sets (as we show below).\nOur algorithm is also related to the recently proposed a hybrid approach of using Gibbs sampling\ninside stochastic variational inference to take advantage of the sparsity in text documents in topic\nmodeling [37]. Their approach still uses the mean-\ufb01eld update as in Eq. 5, where all local hidden\ntopic variables (for a document) are grouped together and the optimal q(zi) is approximated by a\nGibbs sampler. With some adaptations, their fast sparse update idea can be used inside our algorithm.\n\nStochastic locally collapsed variational inference. We now extend our algorithm to stochastic\nvariational inference, allowing us to \ufb01t approximate posteriors to massive data sets. To do this, we\nassume the model in Figure 1 is in the exponential family and satis\ufb01es the conditional conjugacy [11,\n13, 29]\u2014the global distribution p(\u03b2 | \u03b7) is the conjugate prior for the local distribution p(xi, zi | \u03b2),\n(8)\n(9)\nwhere we overload the notation for base measures h(\u00b7), suf\ufb01cient statistics t(\u00b7), and log normalizers\na(\u00b7). (These will often be different for the two families.) Due to the conjugacy, the term t(\u03b2) has\nthe form t(\u03b2) = [\u03b2;\u2212a(\u03b2)]. Also assume the global variational distribution q(\u03b2 | \u03bb) is in the same\nfamily as the prior q(\u03b2 | \u03b7). Given these conditions, the batch update for q(\u03b2 | \u03bb) in Eq. 7 is\n\np(\u03b2 | \u03b7) = h(\u03b2) exp(cid:8)\u03b7(cid:62)t(\u03b2) \u2212 a(\u03b7)(cid:9) ,\np(xi, zi | \u03b2) = h(xi, zi) exp(cid:8)\u03b2(cid:62)t(xi, zi) \u2212 a(\u03b2)(cid:9) ,\n\n\u03bb = \u03b7 +(cid:80)n\n\n4\n\nE\u02c6q(zi) [\u00aft(xi, zi)] .\n\ni=1\n\n(10)\n\n\f(cid:0)\u2212\u03bbt + \u03b7 + nE\u02c6q(zi) [\u00aft(xt, zt)](cid:1) .\n\nThe term \u00aft(xi, zi) is de\ufb01ned as \u00aft(xi, zi) (cid:44) [t(xi, zi); 1].\nAnalysis in [12, 13, 29] shows that given the conditional conjugacy assumption, the batch update\nof parameter \u03bb in Eq. 10 can be easily turned into a stochastic update using natural gradient [38].\nSuppose our parameter is \u03bbt at step t. Given a random observation xt, we sample from q(zt | xt, \u03bbt)\nto obtain the empirical distribution \u02c6q(zt). With an appropriate learning rate \u03c1t, we have\n\n\u03bbt+1 \u2190 \u03bbt + \u03c1t\n\n(11)\nThis corresponds to an stochastic update using the noisy natural gradient to optimize the lower bound\nin Eq. 3 [39]. (We note that the natural gradient is an approximation since our q(zi) in Eq. 6 is\nsuboptimal for the lower bound Eq. 3.)\nMini-batch. A common strategy used in stochastic variational inference [12, 13] is to use a small\nbatch of samples at each update. Suppose we have a batch size S, and the set of samples xt, t \u2208 S.\nUsing our formulation, the q(zt, t \u2208 S) becomes\nq(zt,t\u2208S) \u221d Eq(\u03b2 | \u03bbt)\n\n(cid:2)(cid:81)\nt\u2208S p(xt, zt|\u03b2)(cid:3) .\n\nWe choose not to factorize zt,t\u2208S, since factorization will potentially lead to the label-switching\nproblem when new components are instantiated for BNP models [7].\n\n2.3 Truncation-free stochastic variational inference for BNP models\n\nWe have described locally collapsed variational inference in a general setting. Our main interests in\nthis paper are BNP models, and we now show how this approach leads to truncation-free variational\nalgorithms. We describe the approach for a DP mixture model [21], whose full description was\npresented in the beginning of \u00a72.1. See S.2 for the details on the HDP topic models [3].\n\nThe global variational distribution. The variational distribution for the global hidden variables,\nmixture components \u03b2 and stick proportions \u00af\u03c0 is\n\nq(\u03b8, \u00af\u03c0 | \u03bb, u, v) =(cid:81)\n(cid:80)\nj 1[xij =w]; t(xi, zi)uk = 1[zi=k], t(xi, zi)vk =(cid:80)\n\nk q(\u03b8 | \u03bbk)q(\u00af\u03c0k | uk, vk),\n\nwhere \u03bbk is the Dirichlet parameter and (uk, vk) is the Beta parameter. The suf\ufb01cient statistic term\nt(xi, zi) de\ufb01ned in Eq. 9 can be summarized as\n\nt(xi, zi)\u03bbkw = 1[zi=k]\n\nj=k+1 1[zi=j],\n\nwhere 1[\u00b7] is the indicator function. Suppose at time t, we have obtained the empirical distribution\n\u02c6q(zi) for observation xi, we use Eq. 11 to update Dirichlet parameter \u03bb and Beta parameter (u, v),\n\n\u03bbkw \u2190 \u03bbkw + \u03c1t(\u2212\u03bbkw + \u03b7 + n\u02c6q(zi = k)(cid:80)\nvk \u2190 vk + \u03c1t(\u2212vk + a + n(cid:80)\n\nuk \u2190 uk + \u03c1t(\u2212uk + 1 + n\u02c6q(zi = k))\n\n(cid:96)=k+1 \u02c6q(zi = (cid:96))).\n\nj 1[xij =w])\n\nAlthough we have a unbounded number of mixture components, we do not need to represent them\nexplicitly. Suppose we have T components that are associated with some data. These updates indicate\nq(\u03b8k | \u03bbk) = Dirichlet(\u03b7) and q(\u00af\u03c0k) = Beta(1, a), i.e., their prior distributions, when k > T .\nSimilar to a Gibbs sampler [6], the model is \u201ctruncated\u201d automatically. (We re-ordered the sticks\naccording to their sizes [15].)\n\nThe local empirical distribution \u02c6q(zi). Since the mixture assignment zi is the only hidden variable,\nwe obtain its analytical form using Eq. 6,\n\nq(zi = k) \u221d(cid:82) p(xi | \u03b8k)p(zi = k | \u03c0)q(\u03b8k | \u03bbk)q(\u00af\u03c0)d\u03b8kd\u00af\u03c0\n(cid:81)k\u22121\n\nj 1[xij =w])\n\n(cid:81)\n\nw \u03bbkw+|xi|)\n\nuk\n\nuk+vk\n\nv(cid:96)\n\n,\n\n(cid:96)=1\n\nu(cid:96)+v(cid:96)\n\n= \u0393((cid:80)\n(cid:81)\n\nw \u03bbkw)\nw \u0393(\u03bbkw)\n\nwhere |xi| is the document length and \u0393(\u00b7) is the Gamma function. (For mini batches, we do not have\nan analytical form, but we can sample from it.) The probability of creating a new component is\n\nw \u0393(\u03bbkw+(cid:80)\n\u0393((cid:80)\n(cid:81)\nw \u0393(\u03b7+(cid:80)\n\nq(zi > T ) \u221d \u0393(\u03b7V )\n\n\u0393V (\u03b7)\n\nj 1[xij =w])\n\n\u0393(\u03b7V +|xi|)\n\nvk\n\nuk+vk\n\n.\n\nk=1\n\n(cid:81)T\n\nWe sample from q(zi) to obtain its empirical distribution \u02c6q(zi). If zi > T , we create a new component.\n\n5\n\n\fDiscussion. Why is \u201clocally collapsed\u201d enough? This is analogous to the collapsed Gibbs sampling\nalgorithm in DP mixture models [6]\u2014 whether or not exploring a new mixture component is initialized\nby one single sample. The locally collapsed variational inference is powerful enough to trigger this.\nIn the toy example above, the role of distribution q(\u03b82) = Dirichlet(0.1, . . . , 0.1) is similar to that\nof the potential new component we want to maintain in Gibbs sampling for DP mixture models.\nNote the difference between this approach and those found in [17, 18], which use mean-\ufb01eld methods\nthat can grow or shrink the truncation using split-merge moves. These approaches are model-speci\ufb01c\nand dif\ufb01cult to design. Further, they do not transfer to the stochastic setting. In contrast, the approach\npresented here grows the truncation as a natural consequence of the inference algorithm and is easily\nadapted to stochastic inference.\n\n3 Experiments\n\nWe evaluate our methods on DP mixtures and HDP topic models, comparing them to truncation-based\nstochastic mean-\ufb01eld variational inference. We focus on stochastic methods and large data sets.\n\nDatasets. We analyzed two large document collections. The Nature data contains about 350K\ndocuments from the journal Nature from years 1869 to 2008, with about 58M tokens and a vocabulary\nsize of 4,253. The New York Times dataset contains about 1.8M documents from the years 1987 to\n2007, with about 461M tokens and a vocabulary size of 8,000. Standard stop words and those words\nthat appear less than 10 times or in more than 20 percent of the documents are removed, and the\n\ufb01nal vocabulary is chosen by TF-IDF. We set aside a test set of 10K documents from each corpus on\nwhich to evaluate its predictive power; these test sets were not given for training.\n\nEvaluation Metric. We evaluate the different algorithms using held-out per-word likelihood,\n\nlikelihoodpw (cid:44) log p(Dtest |Dtrain)/(cid:80)\n\nxi\u2208Dtest\n\n|xi|,\n\nHigher likelihood is better. Since exact computing the held-out likelihood is intractable, we use\napproximations. See S.3 for details of approximating the likelihood. There is some question as to\nthe meaningfulness of held-out likelihood as a metric for comparing different models [40]. Held-out\nlikelihood metrics are nonetheless suited to measuring how well an inference algorithm accomplishes\nthe speci\ufb01c optimization task de\ufb01ned by a model.\n\nExperimental Settings. For DP mixtures, we set component Dirichlet parameter \u03b7 = 0.5 and the\nconcentration parameter of DP a = 1. For HDP topic models, we set topic Dirichlet parameter\n\u03b7 = 0.01, and the \ufb01rst-level and second-level concentration parameters of DP a = b = 1 as in [13].\n(See S.2 for the full description of HDP topic models.) For stochastic mean-\ufb01eld variational inference,\nwe set the truncation level at 300 for both DP and HDP. We run all algorithms for 10 hours and took\nthe model at the \ufb01nal stage as the output, without assessing the convergence. We vary the mini-batch\nsize S = {1, 2, 5, 10, 50, 100, 500}. (We do not intend to compare DP and HDP; we want to show\nour algorithm works on both methods.)\nFor stochastic mean-\ufb01eld approach, we set the learning rate according to [13] with \u03c1t = (\u03c40 + t)\u2212\u03ba\nwith \u03ba = 0.6 and \u03c40 = 1. We start our new method with 0 components without seeing any data. We\ncannot use the learning rate schedule as in [13], since it gives very large weights to the \ufb01rst several\ncomponents, effectively leaving no room for creating new components on the \ufb02y. We set the learning\nrate \u03c1t = S/nt, for nt < n, where nt is the size of corpus that the algorithm has seen at time t. After\nwe see all the documents, nt = n. For both stochastic mean-\ufb01eld and our algorithm, we set the lower\nbound of learning rate as S/n. We found this works well in practice. This mimics the usual trick\nof running Gibbs sampler\u2014one uses sequential prediction for initialization and after all data points\nhave been initialized, one runs the full Gibbs sampler [41]. We remove components with fewer than\n1 document for DP and topics with fewer than 1 word for HDP topic models each time when we\nprocess 20K documents.\n\n6\n\n\fFigure 3: Results on DP mixtures. (a) Held-out likelihood comparison on both corpora. Our approach is more\nrobust to batch sizes and gives better predictive performance. (b) The inferred number of mixtures on New York\nTimes. (Nature is similar.) The left of \ufb01gure (b) shows the number of mixture components inferred after 10\nhours; our method tends to give more mixtures. Small batch sizes for the stochastic mean-\ufb01eld approach do not\nreally work, resulting in very small number of mixtures. The right of \ufb01gure (b) shows how different methods\ninfer the number of mixtures. The stochastic mean \ufb01eld approach shrinks it while our approach grows it.\n\nFigure 4: Results on HDP topic models. (a) Held-out likelihood comparison on both corpora. Our approach is\nmore robust to batch sizes and gives better predictive performance most of time. (b) The inferred number of\ntopics on Nature. (New York Times is similar.) The left of \ufb01gure (b) show the number of topics inferred after 10\nhours; our method tends to give more topics. Small batch sizes for the stochastic mean-\ufb01eld approach do not\nreally work, resulting in very small number of topics. The right of \ufb01gure (b) shows how different methods infer\nthe number of topics. Similar to DP, the stochastic mean \ufb01eld approach shrinks it while our approach grows it.\n\nResults. Figure 3 shows the results for DP mixture models. Figure 3(a) shows the held-out\nlikelihood comparison on both datasets. Our approach is more robust to batch sizes and usually gives\nbetter predictive performance. Small batch sizes of the stochastic mean-\ufb01eld approach do not work\nwell. Figure 3(b) shows the inferred number of mixtures on New York Times. (Nature is similar.)\nOur method tends to give more mixtures than the stochastic mean-\ufb01eld approach. The stochastic\nmean-\ufb01eld approach shrinks the preset truncation; our approach does not need a truncation and grows\nthe number of mixtures when data requires.\nFigure 4 shows the results for HDP topic models. Figure 4(a) shows the held-out likelihood com-\nparison on both datasets. Similar to DP mixtures, our approach is more robust to batch sizes and\ngives better predictive performance most of time. And small batch sizes of the stochastic mean-\ufb01eld\napproach do not work well. Figure 4(b) shows the inferred number of topics on Nature. (New York\nTimes is similar.) This is also similar to DP. Our method tends to give more topics than the stochastic\nmean-\ufb01eld approach. The stochastic mean-\ufb01eld approach shrinks the preset truncation while our\napproach grows the number of topics when data requires.\nOne possible explanation that our method gives better results than the truncation-based stochastic\nmean-\ufb01eld approach is as follows. For truncation-based approach, the algorithm relies more on\nthe random initialization placed on the parameters within the preset truncations. If the random\ninitialization is not used well, performance degrades. This also explains that smaller batch sizes\nin stochastic mean-\ufb01elds tend to work much worse\u2014the \ufb01rst fewer samples might dominate the\neffect from the random initialization, leaving no room for later samples. Our approach mitigates this\nproblem by allowing new components/topics to be created as data requires.\nIf we compare DP and HDP, the best result of DP is better than that of HDP. But this comparison is\nnot meaningful. Besides the different settings of hyperparameters, computing the held-out likelihood\nfor DP is tractable, but intractable for HDP. We used importance sampling to approximate. (See S.3\n\n7\n\nNatureNew York Times-7.5-7.4-7.3-7.2-7.1-8.0-7.9-7.8-7.7-7.6100200300400500100200300400500batchsizelikelihoodmethodmean-fieldour method(a) on both corpora(b) on New York Times50100150100200300400500batchsizenumber of mixturesAt the end501001502002503000246810time (hours)Batchsize=100methodmean-fieldour methodDP mixturesNatureNew York Times-7.8-7.6-7.4-7.2-8.4-8.2-8.0-7.8100200300400500100200300400500batchsizeheldout likelihoodmethodmean-fieldour method050100100200300400500batchsizenumber of topicsAt the end5010015020025030002468time (hours)Batchsize=100methodmean-fieldour method(a) on both corpora(b) on NatureNatureNew York Times-7.5-7.4-7.3-7.2-7.1-8.0-7.9-7.8-7.7-7.6100200300400500100200300400500batchsizelikelihoodmethodmean-fieldour method50100150100200300400500batchsizenumber of topicsAt the end501001502002503000246810time (hours)Batchsize=100methodmean-fieldour method(a) on both corpora(b) on New York TimesHDP topic modelsNatureNew York Times-7.5-7.4-7.3-7.2-7.1-8.0-7.9-7.8-7.7-7.6100200300400500100200300400500batchsizelikelihoodmethodmean-fieldour method50100150100200300400500batchsizenumber of topicsAt the end501001502002503000246810time (hours)Batchsize=100methodmean-fieldour method(a) on both corpora(b) on New York TimesstochasticNatureNew York Times-7.5-7.4-7.3-7.2-7.1-8.0-7.9-7.8-7.7-7.6100200300400500100200300400500batchsizelikelihoodmethodmean-fieldour method50100150100200300400500batchsizenumber of topicsAt the end501001502002503000246810time (hours)Batchsize=100methodmean-fieldour method(a) on both corpora(b) on New York TimesNatureNew York Times-7.8-7.6-7.4-7.2-8.4-8.2-8.0-7.8100200300400500100200300400500batchsizeheldout likelihoodmethodmean-fieldour method050100100200300400500batchsizenumber of topicsAt the end5010015020025030002468time (hours)Batchsize=100methodmean-fieldour method(a) on both corpora(b) on NatureHDP topic modelsNatureNew York Times-7.5-7.4-7.3-7.2-7.1-8.0-7.9-7.8-7.7-7.6100200300400500100200300400500batchsizelikelihoodmethodmean-fieldour method50100150100200300400500batchsizenumber of topicsAt the end501001502002503000246810time (hours)Batchsize=100methodmean-fieldour method(a) on both corpora(b) on New York TimesstochasticNatureNew York Times-7.5-7.4-7.3-7.2-7.1-8.0-7.9-7.8-7.7-7.6100200300400500100200300400500batchsizelikelihoodmethodmean-fieldour method50100150100200300400500batchsizenumber of topicsAt the end501001502002503000246810time (hours)Batchsize=100methodmean-fieldour method(a) on both corpora(b) on New York Times\ffor details.) [42] shows that importance sampling usually gives the correct ranking of different topic\nmodels but signi\ufb01cantly underestimates the probability.\n\n4 Conclusion and future work\n\nIn this paper, we have developed truncation-free stochastic variational inference algorithms for\nBayesian nonparametric models (BNP models) and applied them to two large datasets. Extensions\nto other BNP models, such as Pitman-Yor process [43], Indian buffet process (IBP) [23, 24] and\nthe nested Chinese restaurant process [18, 25] are straightforward by using their stick-breaking\nconstructions. Exploring how this algorithm behaves in the true streaming setting where the program\nnever stops\u2014a \u201cnever-ending learning machine\u201d [44]\u2014is an interesting future direction.\nAcknowledgements. Chong Wang was supported by Google PhD and Siebel Scholar Fellowships.\nDavid M. Blei is supported by ONR N00014-11-1-0651, NSF CAREER 0745520, AFOSR FA9550-\n09-1-0668, the Alfred P. Sloan foundation, and a grant from Google.\n\nReferences\n[1] Hjort, N., C. Holmes, P. Mueller, et al. Bayesian Nonparametrics: Principles and Practice. Cambridge\n\nUniversity Press, Cambridge, UK, 2010.\n\n[2] Antoniak, C. Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems. The\n\nAnnals of Statistics, 2(6):1152\u20131174, 1974.\n\n[3] Teh, Y., M. Jordan, M. Beal, et al. Hierarchical Dirichlet processes. Journal of the American Statistical\n\nAssociation, 101(476):1566\u20131581, 2007.\n\n[4] Andrieu, C., N. de Freitas, A. Doucet, et al. An introduction to MCMC for machine learning. Machine\n\nLearning, 50:5\u201343, 2003.\n\n[5] Jordan, M., Z. Ghahramani, T. Jaakkola, et al. Introduction to variational methods for graphical models.\n\nMachine Learning, 37:183\u2013233, 1999.\n\n[6] Neal, R. Markov chain sampling methods for Dirichlet process mixture models. Journal of Computational\n\nand Graphical Statistics, 9(2):249\u2013265, 2000.\n\n[7] Newman, D., A. Asuncion, P. Smyth, et al. Distributed algorithms for topic models. Journal of Machine\n\nLearning Research, 10:1801\u20131828, 2009.\n\n[8] Smola, A., S. Narayanamurthy. An architecture for parallel topic models. Proc. VLDB Endow., 3(1-2):703\u2013\n\n710, 2010.\n\n[9] Ahmed, A., M. Aly, J. Gonzalez, et al. Scalable inference in latent variable models. In International\n\nConference on Web Search and Data Mining (WSDM). 2012.\n\n[10] Wainwright, M., M. Jordan. Graphical models, exponential families, and variational inference. Foundations\n\nand Trends in Machine Learning, 1(1\u20132):1\u2013305, 2008.\n\n[11] Hoffman, M., D. M. Blei, C. Wang, et al. Stochastic Variational Inference. ArXiv e-prints, 2012.\n[12] Hoffman, M., D. Blei, F. Bach. Online inference for latent Drichlet allocation. In Advances in Neural\n\nInformation Processing Systems (NIPS). 2010.\n\n[13] Wang, C., J. Paisley, D. M. Blei. Online variational inference for the hierarchical Dirichlet process. In\n\nInternational Conference on Arti\ufb01cial Intelligence and Statistics (AISTATS). 2011.\n\n[14] Blei, D., M. Jordan. Variational inference for Dirichlet process mixtures. Journal of Bayesian Analysis,\n\n1(1):121\u2013144, 2005.\n\n[15] Kurihara, K., M. Welling, Y. Teh. Collapsed variational Dirichlet process mixture models. In International\n\nJoint Conferences on Arti\ufb01cial Intelligence (IJCAI). 2007.\n\n[16] Teh, Y., K. Kurihara, M. Welling. Collapsed variational inference for HDP. In Advances in Neural\n\nInformation Processing Systems (NIPS). 2007.\n\n[17] Kurihara, K., M. Welling, N. Vlassis. Accelerated variational Dirichlet process mixtures. In Advances in\n\nNeural Information Processing Systems (NIPS). 2007.\n\n[18] Wang, C., D. Blei. Variational inference for the nested Chinese restaurant process. In Advances in Neural\n\nInformation Processing Systems (NIPS). 2009.\n\n[19] Gelman, A., J. Hill. Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge\n\nUniv. Press, 2007.\n\n[20] McLachlan, G., D. Peel. Finite mixture models. Wiley-Interscience, 2000.\n\n8\n\n\f[21] Escobar, M., M. West. Bayesian density estimation and inference using mixtures. Journal of the American\n\nStatistical Association, 90:577\u2013588, 1995.\n\n[22] Blei, D., A. Ng, M. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993\u20131022,\n\n2003.\n\n[23] Grif\ufb01ths, T., Z. Ghahramani. In\ufb01nite latent feature models and the Indian buffet process. In Advances in\n\nNeural Information Processing Systems (NIPS). 2006.\n\n[24] Teh, Y., D. Gorur, Z. Ghahramani. Stick-breaking construction for the Indian buffet process. In International\n\nConference on Arti\ufb01cal Intelligence and Statistics (AISTATS). 2007.\n\n[25] Blei, D., T. Grif\ufb01ths, M. Jordan. The nested Chinese restaurant process and Bayesian nonparametric\n\ninference of topic hierarchies. Journal of the ACM, 57(2):1\u201330, 2010.\n\n[26] Sethuraman, J. A constructive de\ufb01nition of Dirichlet priors. Statistica Sinica, 4:639\u2013650, 1994.\n[27] Bishop, C. Pattern Recognition and Machine Learning. Springer New York., 2006.\n[28] Zhai, K., J. Boyd-Graber, N. Asadi, et al. Mr. LDA: A \ufb02exible large scale topic modeling package using\n\nvariational inference in MapReduce. In International World Wide Web Conference (WWW). 2012.\n\n[29] Sato, M. Online model selection based on the variational Bayes. Neural Computation, 13(7):1649\u20131681,\n\n2001.\n\n[30] Opper, M., O. Winther. From Naive Mean Field Theory to the TAP Equations, pages 1\u201319. MIT Press,\n\n2001.\n\n[31] MacKay, D. Information Theory, Inference, and Learning Algorithms. Cambridge University Press, 2003.\n[32] Asuncion, A., M. Welling, P. Smyth, et al. On smoothing and inference for topic models. In Uncertainty in\n\nArti\ufb01cial Intelligence (UAI). 2009.\n\n[33] Sato, I., H. Nakagawa. Rethinking collapsed variational Bayes inference for LDA. In International\n\nConference on Machine Learning (ICML). 2012.\n\n[34] Sato, I., K. Kurihara, H. Nakagawa. Practical collapsed variational bayes inference for hierarchical dirichlet\nprocess. In International Conference on Knowledge Discovery and Data Mining, KDD, pages 105\u2013113.\nACM, New York, NY, USA, 2012.\n\n[35] Minka, T. Divergence measures and message passing. Tech. Rep. TR-2005-173, Microsoft Research, 2005.\n[36] Teh, Y., D. Newman, M. Welling. A collapsed variational Bayesian inference algorithm for latent Dirichlet\n\nallocation. In Advances in Neural Information Processing Systems (NIPS). 2006.\n\n[37] Mimno, D., M. Hoffman, D. Blei. Sparse stochastic inference for latent dirichlet allocation. In International\n\nConference on Machine Learning (ICML). 2012.\n\n[38] Amari, S. Natural gradient works ef\ufb01ciently in learning. Neural computation, 10(2):251\u2013276, 1998.\n[39] Robbins, H., S. Monro. A stochastic approximation method. The Annals of Mathematical Statistics,\n\n22(3):pp. 400\u2013407, 1951.\n\n[40] Chang, J., J. Boyd-Graber, C. Wang, et al. Reading tea leaves: How humans interpret topic models. In\n\nAdvances in Neural Information Processing Systems (NIPS). 2009.\n\n[41] Grif\ufb01ths, T., M. Steyvers. Finding scienti\ufb01c topics. Proceedings of the National Academy of Sciences\n\n(PNAS), 2004.\n\n[42] Wallach, H., I. Murray, R. Salakhutdinov, et al. Evaluation methods for topic models. In International\n\nConference on Machine Learning (ICML). 2009.\n\n[43] Pitman, J., M. Yor. The two-parameter poisson-dirichlet distribution derived from a stable subordinator.\n\nThe Annals of Probability, 25(2):855\u2013900, 1997.\n\n[44] Carlson, A., J. Betteridge, B. Kisiel, et al. Toward an architecture for never-ending language learning. In\n\nAAAI Conference on Arti\ufb01cial Intelligence (AAAI). 2010.\n\n9\n\n\f", "award": [], "sourceid": 208, "authors": [{"given_name": "Chong", "family_name": "Wang", "institution": null}, {"given_name": "David", "family_name": "Blei", "institution": null}]}