{"title": "Online Learning for Latent Dirichlet Allocation", "book": "Advances in Neural Information Processing Systems", "page_first": 856, "page_last": 864, "abstract": "We develop an online variational Bayes (VB) algorithm for Latent Dirichlet Allocation (LDA). Online LDA is based on online stochastic optimization with a natural gradient step, which we show converges to a local optimum of the VB objective function. It can handily analyze massive document collections, including those arriving in a stream. We study the performance of online LDA in several ways, including by fitting a 100-topic topic model to 3.3M articles from Wikipedia in a single pass. We demonstrate that online LDA finds topic models as good or better than those found with batch VB, and in a fraction of the time.", "full_text": "Online Learning for Latent Dirichlet Allocation\n\nMatthew D. Hoffman\n\nDepartment of Computer Science\n\nPrinceton University\n\nPrinceton, NJ\n\nDavid M. Blei\n\nDepartment of Computer Science\n\nPrinceton University\n\nPrinceton, NJ\n\nmdhoffma@cs.princeton.edu\n\nblei@cs.princeton.edu\n\nINRIA\u2014Ecole Normale Sup\u00b4erieure\n\nFrancis Bach\n\nParis, France\n\nfrancis.bach@ens.fr\n\nAbstract\n\nWe develop an online variational Bayes (VB) algorithm for Latent Dirichlet Al-\nlocation (LDA). Online LDA is based on online stochastic optimization with a\nnatural gradient step, which we show converges to a local optimum of the VB\nobjective function. It can handily analyze massive document collections, includ-\ning those arriving in a stream. We study the performance of online LDA in several\nways, including by \ufb01tting a 100-topic topic model to 3.3M articles from Wikipedia\nin a single pass. We demonstrate that online LDA \ufb01nds topic models as good or\nbetter than those found with batch VB, and in a fraction of the time.\n\n1\n\nIntroduction\n\nHierarchical Bayesian modeling has become a mainstay in machine learning and applied statistics.\nBayesian models provide a natural way to encode assumptions about observed data, and analysis\nproceeds by examining the posterior distribution of model parameters and latent variables condi-\ntioned on a set of observations. For example, research in probabilistic topic modeling\u2014the applica-\ntion we will focus on in this paper\u2014revolves around \ufb01tting complex hierarchical Bayesian models\nto large collections of documents. In a topic model, the posterior distribution reveals latent semantic\nstructure that can be used for many applications.\nFor topic models and many other Bayesian models of interest, however, the posterior is intractable\nto compute and researchers must appeal to approximate posterior inference. Modern approximate\nposterior inference algorithms fall in two categories\u2014sampling approaches and optimization ap-\nproaches. Sampling approaches are usually based on Markov Chain Monte Carlo (MCMC) sam-\npling, where a Markov chain is de\ufb01ned whose stationary distribution is the posterior of interest. Op-\ntimization approaches are usually based on variational inference, which is called variational Bayes\n(VB) when used in a Bayesian hierarchical model. Whereas MCMC methods seek to generate inde-\npendent samples from the posterior, VB optimizes a simpli\ufb01ed parametric distribution to be close in\nKullback-Leibler divergence to the posterior. Although the choice of approximate posterior intro-\nduces bias, VB is empirically shown to be faster than and as accurate as MCMC, which makes it an\nattractive option when applying Bayesian models to large datasets [1, 2, 3].\nNonetheless, large scale data analysis with VB can be computationally dif\ufb01cult. Standard \u201cbatch\u201d\nVB algorithms iterate between analyzing each observation and updating dataset-wide variational\nparameters. The per-iteration cost of batch algorithms can quickly become impractical for very large\ndatasets. In topic modeling applications, this issue is particularly relevant\u2014topic modeling promises\n\n1\n\n\fFigure 1: Top: Perplexity on held-out Wikipedia documents as a function of number of documents\nanalyzed, i.e., the number of E steps. Online VB run on 3.3 million unique Wikipedia articles is\ncompared with online VB run on 98,000 Wikipedia articles and with the batch algorithm run on the\nsame 98,000 articles. The online algorithms converge much faster than the batch algorithm does.\nBottom: Evolution of a topic about business as online LDA sees more and more documents.\n\nto summarize the latent structure of massive document collections that cannot be annotated by hand.\nA central research problem for topic modeling is to ef\ufb01ciently \ufb01t models to larger corpora [4, 5].\nTo this end, we develop an online variational Bayes algorithm for latent Dirichlet allocation (LDA),\none of the simplest topic models and one on which many others are based. Our algorithm is based on\nonline stochastic optimization, which has been shown to produce good parameter estimates dramat-\nically faster than batch algorithms on large datasets [6]. Online LDA handily analyzes massive col-\nlections of documents and, moreover, online LDA need not locally store or collect the documents\u2014\neach can arrive in a stream and be discarded after one look.\nIn the subsequent sections, we derive online LDA and show that it converges to a stationary point\nof the variational objective function. We study the performance of online LDA in several ways,\nincluding by \ufb01tting a topic model to 3.3M articles from Wikipedia without looking at the same\narticle twice. We show that online LDA \ufb01nds topic models as good as or better than those found\nwith batch VB, and in a fraction of the time (see \ufb01gure 1). Online variational Bayes is a practical\nnew method for estimating the posterior of complex hierarchical Bayesian models.\n\n2 Online variational Bayes for latent Dirichlet allocation\n\nLatent Dirichlet Allocation (LDA) [7] is a Bayesian probabilistic model of text documents. It as-\nsumes a collection of K \u201ctopics.\u201d Each topic de\ufb01nes a multinomial distribution over the vocabulary\nand is assumed to have been drawn from a Dirichlet, \u03b2k \u223c Dirichlet(\u03b7). Given the topics, LDA\nassumes the following generative process for each document d. First, draw a distribution over topics\n\u03b8d \u223c Dirichlet(\u03b1). Then, for each word i in the document, draw a topic index zdi \u2208 {1, . . . , K}\nfrom the topic weights zdi \u223c \u03b8d and draw the observed word wdi from the selected topic, wdi \u223c \u03b2zdi.\nFor simplicity, we assume symmetric priors on \u03b8 and \u03b2, but this assumption is easy to relax [8].\n\nNote that if we sum over the topic assignments z, then we get p(wdi|\u03b8d, \u03b2) = (cid:80)\n\nk \u03b8dk\u03b2kw. This\nleads to the \u201cmultinomial PCA\u201d interpretation of LDA; we can think of LDA as a probabilistic\nfactorization of the matrix of word counts n (where ndw is the number of times word w appears in\ndocument d) into a matrix of topic weights \u03b8 and a dictionary of topics \u03b2 [9]. Our work can thus\n\n2\n\n4096systemshealthcommunicationservicebillionlanguagecareroad8192servicesystemshealthcompaniesmarketcommunicationcompanybillion12288servicesystemscompaniesbusinesscompanybillionhealthindustry16384servicecompaniessystemsbusinesscompanyindustrymarketbillion32768businessservicecompaniesindustrycompanymanagementsystemsservices49152businessservicecompaniesindustryservicescompanymanagementpublic2048systemsroadmadeserviceannouncednationalwestlanguage65536businessindustryservicecompaniesservicescompanymanagementpublicDocumentsanalyzedTop eightwordsDocuments seen (log scale)Perplexity600650700750800850900103.5104104.5105105.5106106.5Batch 98KOnline 98KOnline 3.3M\fbe seen as an extension of online matrix factorization techniques that optimize squared error [10] to\nmore general probabilistic formulations.\nWe can analyze a corpus of documents with LDA by examining the posterior distribution of the\ntopics \u03b2, topic proportions \u03b8, and topic assignments z conditioned on the documents. This reveals\nlatent structure in the collection that can be used for prediction or data exploration. This posterior\ncannot be computed directly [7], and is usually approximated using Markov Chain Monte Carlo\n(MCMC) methods or variational inference. Both classes of methods are effective, but both present\nsigni\ufb01cant computational challenges in the face of massive data sets.Developing scalable approxi-\nmate inference methods for topic models is an active area of research [3, 4, 5, 11].\nTo this end, we develop online variational inference for LDA, an approximate posterior inference\nalgorithm that can analyze massive collections of documents. We \ufb01rst review the traditional vari-\national Bayes algorithm for LDA and its objective function, then present our online method, and\nshow that it converges to a stationary point of the same objective function.\n\n2.1 Batch variational Bayes for LDA\n\nIn Variational Bayesian inference (VB) the true posterior is approximated by a simpler distribution\nq(z, \u03b8, \u03b2), which is indexed by a set of free parameters [12, 13]. These parameters are optimized to\nmaximize the Evidence Lower BOund (ELBO):\n\nq(zdi = k) = \u03c6dwdik;\n\nlog p(w|\u03b1, \u03b7) \u2265L(w, \u03c6, \u03b3, \u03bb) (cid:44) Eq[log p(w, z, \u03b8, \u03b2|\u03b1, \u03b7)] \u2212 Eq[log q(z, \u03b8, \u03b2)].\n\n(1)\nMaximizing the ELBO is equivalent to minimizing the KL divergence between q(z, \u03b8, \u03b2) and the\nposterior p(z, \u03b8, \u03b2|w, \u03b1, \u03b7). Following [7], we choose a fully factorized distribution q of the form\n(2)\nThe posterior over the per-word topic assignments z is parameterized by \u03c6, the posterior over the per-\ndocument topic weights \u03b8 is parameterized by \u03b8, and the posterior over the topics \u03b2 is parameterized\nby \u03bb. As a shorthand, we refer to \u03bb as \u201cthe topics.\u201d Equation 1 factorizes to\n\n(cid:8)Eq[log p(wd|\u03b8d, zd, \u03b2)] + Eq[log p(zd|\u03b8d)] \u2212 Eq[log q(zd)]\n+ Eq[log p(\u03b8d|\u03b1)] \u2212 Eq[log q(\u03b8d)] + (Eq[log p(\u03b2|\u03b7)] \u2212 Eq[log q(\u03b2)])/D(cid:9).\n\nL(w, \u03c6, \u03b3, \u03bb) =(cid:80)\n\nq(\u03b2k) = Dirichlet(\u03b2k; \u03bbk),\n\nq(\u03b8d) = Dirichlet(\u03b8d; \u03b3d);\n\n(3)\n\nd\n\nNotice we have brought the per-corpus terms into the summation over documents, and divided them\nby the number of documents D. This step will help us to derive an online inference algorithm.\nWe now expand the expectations above to be functions of the variational parameters. This reveals\nthat the variational objective relies only on ndw, the number of times word w appears in document\nd. When using VB\u2014as opposed to MCMC\u2014documents can be summarized by their word counts,\n\nd\n\nw ndw\n\n(cid:80)\nL =(cid:80)\n(cid:80)\n\u2212 log \u0393((cid:80)\nk \u03b3dk) +(cid:80)\nk \u03c6dwk(Eq[log \u03b8dk] + Eq[log \u03b2kw] \u2212 log \u03c6dwk)\n+ ((cid:80)\nk \u2212 log \u0393((cid:80)\nk(\u03b1 \u2212 \u03b3dk)Eq[log \u03b8dk] + log \u0393(\u03b3dk)\n(cid:44)(cid:80)\n\n+ log \u0393(K\u03b1) \u2212 K log \u0393(\u03b1) + (log \u0393(W \u03b7) \u2212 W log \u0393(\u03b7))/D\nd (cid:96)(nd, \u03c6d, \u03b3d, \u03bb),\n\nw \u03bbkw) +(cid:80)\n\nw(\u03b7 \u2212 \u03bbkw)Eq[log \u03b2kw] + log \u0393(\u03bbkw))/D\n\n(4)\n\nThe expectations under q of log \u03b8 and log \u03b2 are\n\nEq[log \u03b8dk] = \u03a8(\u03b3dk) \u2212 \u03a8((cid:80)K\n\nwhere W is the size of the vocabulary and D is the number of documents. (cid:96)(nd, \u03c6d, \u03b3d, \u03bb) denotes\nthe contribution of document d to the ELBO.\nL can be optimized using coordinate ascent over the variational parameters \u03c6, \u03b3, \u03bb [7]:\n\u03c6dwk \u221d exp{Eq[log \u03b8dk] + Eq[log \u03b2kw]};\n\n\u03b3dk = \u03b1 +(cid:80)\n\n\u03bbkw = \u03b7 +(cid:80)\ni=1 \u03b3di); Eq[log \u03b2kw] = \u03a8(\u03bbkw) \u2212 \u03a8((cid:80)W\n\nw ndw\u03c6dwk;\n\n(6)\nwhere \u03a8 denotes the digamma function (the \ufb01rst derivative of the logarithm of the gamma function).\nThe updates in equation 5 are guaranteed to converge to a stationary point of the ELBO. By analogy\nto the Expectation-Maximization (EM) algorithm [14], we can partition these updates into an \u201cE\u201d\nstep\u2014iteratively updating \u03b3 and \u03c6 until convergence, holding \u03bb \ufb01xed\u2014and an \u201cM\u201d step\u2014updating\n\u03bb given \u03c6. In practice, this algorithm converges to a better solution if we reinitialize \u03b3 and \u03c6 before\neach E step. Algorithm 1 outlines batch VB for LDA.\n\ni=1 \u03bbki),\n\nd ndw\u03c6dwk.\n(5)\n\n3\n\n\fAlgorithm 1 Batch variational Bayes for LDA\n\nInitialize \u03bb randomly.\nwhile relative improvement in L(w, \u03c6, \u03b3, \u03bb) > 0.00001 do\n\nE step:\nfor d = 1 to D do\n\nInitialize \u03b3dk = 1. (The constant 1 is arbitrary.)\nrepeat\nSet \u03c6dwk \u221d exp{Eq[log \u03b8dk] + Eq[log \u03b2kw]}\n\nSet \u03b3dk = \u03b1 +(cid:80)\n(cid:80)\nk |change in\u03b3dk| < 0.00001\nSet \u03bbkw = \u03b7 +(cid:80)\n\nw \u03c6dwkndw\n\nuntil 1\nK\n\nend for\nM step:\n\nd ndw\u03c6dwk\n\nend while\n\n2.2 Online variational inference for LDA\n\nAlgorithm 1 has constant memory requirements and empirically converges faster than batch col-\nlapsed Gibbs sampling [3]. However, it still requires a full pass through the entire corpus each\niteration. It can therefore be slow to apply to very large datasets, and is not naturally suited to set-\ntings where new data is constantly arriving. We propose an online variational inference algorithm\nfor \ufb01tting \u03bb, the parameters to the variational posterior over the topic distributions \u03b2. Our algorithm\nis nearly as simple as the batch VB algorithm, but converges much faster for large datasets.\nA good setting of the topics \u03bb is one for which the ELBO L is as high as possible after \ufb01tting the\nper-document variational parameters \u03b3 and \u03c6 with the E step de\ufb01ned in algorithm 1. Let \u03b3(nd, \u03bb)\nand \u03c6(nd, \u03bb) be the values of \u03b3d and \u03c6d produced by the E step. Our goal is to set \u03bb to maximize\n\nd (cid:96)(nd, \u03b3(nd, \u03bb), \u03c6(nd, \u03bb), \u03bb),\n\n(7)\n\nL(n, \u03bb) (cid:44)(cid:80)\n\nwhere (cid:96)(nd, \u03b3d, \u03c6d, \u03bb) is the dth document\u2019s contribution to the variational bound in equation 4.\nThis is analogous to the goal of least-squares matrix factorization, although the ELBO for LDA is\nless convenient to work with than a simple squared loss function such as the one in [10].\nOnline VB for LDA (\u201conline LDA\u201d) is described in algorithm 2. As the tth vector of word counts\nnt is observed, we perform an E step to \ufb01nd locally optimal values of \u03b3t and \u03c6t, holding \u03bb \ufb01xed.\nWe then compute \u02dc\u03bb, the setting of \u03bb that would be optimal (given \u03c6t) if our entire corpus consisted\nof the single document nt repeated D times. D is the number of unique documents available to the\nalgorithm, e.g. the size of a corpus. (In the true online case D \u2192 \u221e, corresponding to empirical\nBayes estimation of \u03b2.) We then update \u03bb using a weighted average of its previous value and \u02dc\u03bb.\nThe weight given to \u02dc\u03bb is given by \u03c1t (cid:44) (\u03c40 + t)\u2212\u03ba, where \u03ba \u2208 (0.5, 1] controls the rate at which\nold values of \u02dc\u03bb are forgotten and \u03c40 \u2265 0 slows down the early iterations of the algorithm. The\ncondition that \u03ba \u2208 (0.5, 1] is needed to guarantee convergence. We show in section 2.3 that online\nLDA corresponds to a stochastic natural gradient algorithm on the variational objective L [15, 16].\nThis algorithm closely resembles one proposed in [16] for online VB on models with hidden data\u2014\nthe most important difference is that we use an approximate E step to optimize \u03b3t and \u03c6t, since we\ncannot compute the conditional distribution p(zt, \u03b8t|\u03b2, nt, \u03b1) exactly.\n\n(cid:80)\n\nMini-batches. A common technique in stochastic learning is to consider multiple observations per\nupdate to reduce noise [6, 17]. In online LDA, this means computing \u02dc\u03bb using S > 1 observations:\n\n\u02dc\u03bbkw = \u03b7 + D\nS\n\ns ntsk\u03c6tskw,\n\n(8)\n\nwhere nts is the sth document in mini-batch t. The variational parameters \u03c6ts and \u03b3ts for this\ndocument are \ufb01t with a normal E step. Note that we recover batch VB when S = D and \u03ba = 0.\n\nHyperparameter estimation.\nIn batch variational LDA, point estimates of the hyperparameters\n\u03b1 and \u03b7 can be \ufb01t given \u03b3 and \u03bb using a linear-time Newton-Raphson method [7]. We can likewise\n\n4\n\n\fAlgorithm 2 Online variational Bayes for LDA\n\nDe\ufb01ne \u03c1t (cid:44) (\u03c40 + t)\u2212\u03ba\nInitialize \u03bb randomly.\nfor t = 0 to \u221e do\n\nE step:\nInitialize \u03b3tk = 1. (The constant 1 is arbitrary.)\nrepeat\nSet \u03c6twk \u221d exp{Eq[log \u03b8tk] + Eq[log \u03b2kw]}\n\nSet \u03b3tk = \u03b1 +(cid:80)\n\n(cid:80)\nk |change in\u03b3tk| < 0.00001\n\nw \u03c6twkntw\n\nuntil 1\nK\nM step:\nCompute \u02dc\u03bbkw = \u03b7 + Dntw\u03c6twk\nSet \u03bb = (1 \u2212 \u03c1t)\u03bb + \u03c1t\n\n\u02dc\u03bb.\n\nend for\n\nincorporate updates for \u03b1 and \u03b7 into online LDA:\n\u03b1 \u2190 \u03b1 \u2212 \u03c1t \u02dc\u03b1(\u03b3t);\n\n(9)\nwhere \u02dc\u03b1(\u03b3t) is the inverse of the Hessian times the gradient \u2207\u03b1(cid:96)(nt, \u03b3t, \u03c6t, \u03bb), \u02dc\u03b7(\u03bb) is the inverse\nof the Hessian times the gradient \u2207\u03b7L, and \u03c1t (cid:44) (\u03c40 + t)\u2212\u03ba as elsewhere.\n\n\u03b7 \u2190 \u03b7 \u2212 \u03c1t \u02dc\u03b7(\u03bb),\n\n2.3 Analysis of convergence\n\n(cid:80)D\n\nIn this section we show that algorithm 2 converges to a stationary point of the objective de\ufb01ned in\nequation 7. Since variational inference replaces sampling with optimization, we can use results from\nstochastic optimization to analyze online LDA. Stochastic optimization algorithms optimize an ob-\njective using noisy estimates of its gradient [18]. Although there is no explicit gradient computation,\nalgorithm 2 can be interpreted as a stochastic natural gradient algorithm [16, 15].\nWe begin by deriving a related \ufb01rst-order stochastic gradient algorithm for LDA. Let g(n) denote\nthe population distribution over documents n from which we will repeatedly sample documents:\n\n(10)\nI[n = nd] is 1 if n = nd and 0 otherwise. If this population consists of the D documents in the\ncorpus, then we can rewrite equation 7 as\n\nd=1\n\nD\n\ng(n) (cid:44) 1\n\nI[n = nd].\n\nL(g, \u03bb) (cid:44) DEg[(cid:96)(n, \u03b3(n, \u03bb), \u03c6(n, \u03bb), \u03bb)|\u03bb].\n\n(11)\n\nwhere (cid:96) is de\ufb01ned as in equation 3. We can optimize equation 11 over \u03bb by repeatedly drawing an\nobservation nt \u223c g, computing \u03b3t (cid:44) \u03b3(nt, \u03bb) and \u03c6t (cid:44) \u03c6(nt, \u03bb), and applying the update\n\n\u03bb \u2190 \u03bb + \u03c1tD\u2207\u03bb(cid:96)(nt, \u03b3t, \u03c6t, \u03bb)\n\n(12)\n\n(cid:80)\nd (cid:96)(nd, \u03b3d, \u03c6d, \u03bb). Thus, since(cid:80)\u221e\n\nwhere \u03c1t (cid:44) (\u03c40 + t)\u2212\u03ba as in algorithm 2.\nIf we condition on the current value of \u03bb and\ntreat \u03b3t and \u03c6t as random variables drawn at the same time as each observed document nt, then\nEg[D\u2207\u03bb(cid:96)(nt, \u03b3t, \u03c6t, \u03bb)|\u03bb] = \u2207\u03bb\nt <\n\u221e, the analysis in [19] shows both that \u03bb converges and that the gradient \u2207\u03bb\nd (cid:96)(nd, \u03b3d, \u03c6d, \u03bb)\nconverges to 0, and thus that \u03bb converges to a stationary point.1\nThe update in equation 12 only makes use of \ufb01rst-order gradient information. Stochastic gradient\nalgorithms can be sped up by multiplying the gradient by the inverse of an appropriate positive\nde\ufb01nite matrix H [19]. One choice for H is the Hessian of the objective function. In variational\ninference, an alternative is to use the Fisher information matrix of the variational distribution q (i.e.,\nthe Hessian of the log of the variational probability density function), which corresponds to using\n\nt=0 \u03c1t = \u221e and(cid:80)\u221e\n\n(cid:80)\n\nt=0 \u03c12\n\n1Although we use a deterministic procedure to compute \u03b3 and \u03c6 given n and \u03bb, this analysis can also be\n\napplied if \u03b3 and \u03c6 are optimized using a randomized algorithm. We address this case in the supplement.\n\n5\n\n\fa natural gradient method instead of a (quasi-) Newton method [16, 15]. Following the analysis in\n[16], the gradient of the per-document ELBO (cid:96) can be written as\n\n(\u2212\u03bbkv/D + \u03b7/D + ntv\u03c6tvk)\n(\u2212\u03bbkv/D + \u03b7/D + ntv\u03c6tvk),\n\n(13)\n\n\u2202\u03bbkw\n\nv=1\n\n\u2202\u03bbkw\n\n\u2202(cid:96)(nt,\u03b3t,\u03c6t,\u03bb)\n\n=(cid:80)W\n=(cid:80)W\n\u2202Eq[log \u03b2kv]\nv=1 \u2212 \u22022 log q(\u03b2k)\n(cid:20)(cid:16)\u2212 \u22022 log q(log \u03b2k)\n(cid:21)\n(cid:17)\u22121\n\n\u2202(cid:96)(nt,\u03b3t,\u03c6t,\u03bb)\n\n\u2202\u03bbkv\u2202\u03bbkw\n\n\u2202\u03bbk\u2202\u03bbT\nk\n\n\u2202\u03bbk\n\nw\n\nwhere we have used the fact that Eq[log \u03b2kv] is the derivative of the log-normalizer of q(log \u03b2k). By\nde\ufb01nition, multiplying equation 13 by the inverse of the Fisher information matrix yields\n\n= \u2212\u03bbkw/D + \u03b7/D + ntw\u03c6twk.\n\n(14)\n\nMultiplying equation 14 by \u03c1tD and adding it to \u03bbkw yields the update for \u03bb in algorithm 2. Thus\nwe can interpret our algorithm as a stochastic natural gradient algorithm, as in [16].\n\n3 Related work\n\nComparison with other stochastic learning algorithms.\nIn the standard stochastic gradient op-\ntimization setup, the number of parameters to be \ufb01t does not depend on the number of observations\n[19]. However, some learning algorithms must also \ufb01t a set of per-observation parameters (such\nas the per-document variational parameters \u03b3d and \u03c6d in LDA). The problem is addressed by on-\nline coordinate ascent algorithms such as those described in [20, 21, 16, 17, 10]. The goal of these\nalgorithms is to set the global parameters so that the objective is as good as possible once the per-\nobservation parameters are optimized. Most of these approaches assume the computability of a\nunique optimum for the per-observation parameters, which is not available for LDA.\n\nEf\ufb01cient sampling methods. Markov Chain Monte Carlo (MCMC) methods form one class of\napproximate inference algorithms for LDA. Collapsed Gibbs Sampling (CGS) is a popular MCMC\napproach that samples from the posterior over topic assignments z by repeatedly sampling the topic\nassignment zdi conditioned on the data and all other topic assignments [22].\nOne online MCMC approach adapts CGS by sampling topic assignments zdi based on the topic\nassignments and data for all previously analyzed words, instead of all other words in the corpus [23].\nThis algorithm is fast and has constant memory requirements, but is not guaranteed to converge to\nthe posterior. Two alternative online MCMC approaches were considered in [24]. The \ufb01rst, called\nincremental LDA, periodically resamples the topic assignments for previously analyzed words. The\nsecond approach uses particle \ufb01ltering instead of CGS. In a study in [24], none of these three online\nMCMC algorithms performed as well as batch CGS.\nInstead of online methods, the authors of [4] used parallel computing to apply LDA to large corpora.\nThey developed two approximate parallel CGS schemes for LDA that gave similar predictive per-\nformance on held-out documents to batch CGS. However, they require parallel hardware, and their\ncomplexity and memory costs still scale linearly with the number of documents.\nExcept for the algorithm in [23] (which is not guaranteed to converge), all of the MCMC algorithms\ndescribed above have memory costs that scale linearly with the number of documents analyzed. By\ncontrast, batch VB can be implemented using constant memory, and parallelizes easily. As we will\nshow in the next section, its online counterpart is even faster.\n\n4 Experiments\n\nWe ran several experiments to evaluate online LDA\u2019s ef\ufb01ciency and effectiveness. The \ufb01rst set of\nexperiments compares algorithms 1 and 2 on static datasets. The second set of experiments evaluates\nonline VB in the setting where new documents are constantly being observed. Both algorithms were\nimplemented in Python using Numpy. The implementations are as similar as possible.2\n\n2Open-source Python implementations of batch and online LDA can be found at http://www.cs.\n\nprinceton.edu/\u02dcmdhoffma.\n\n6\n\n\fTable 1: Best settings of \u03ba and \u03c40 for various mini-batch sizes S, with resulting perplexities on\nNature and Wikipedia corpora.\n\nS\n\u03ba\n\u03c40\nPerplexity\n\nS\n\u03ba\n\u03c40\nPerplexity\n\nBest parameter settings for Nature corpus\n1024\n0.5\n256\n1031\n\n4\n0.8\n1024\n1087\n\n256\n0.6\n1024\n1042\n\n16\n0.8\n1024\n1052\n\n64\n0.7\n1024\n1053\n\n1\n0.9\n1024\n1132\n\nBest parameter settings for Wikipedia corpus\n\n1\n0.9\n1024\n675\n\n4\n0.9\n1024\n640\n\n16\n0.8\n1024\n611\n\n64\n0.7\n1024\n595\n\n256\n0.6\n1024\n588\n\n1024\n0.5\n1024\n584\n\n4096\n0.5\n64\n1030\n\n4096\n0.5\n64\n580\n\n16384\n0.5\n1\n1046\n\n16384\n0.5\n1\n584\n\nFigure 2: Held-out perplexity obtained on the Nature (left) and Wikipedia (right) corpora as a func-\ntion of CPU time. For moderately large mini-batch sizes, online LDA \ufb01nds solutions as good as\nthose that the batch LDA \ufb01nds, but with much less computation. When \ufb01t to a 10,000-document\nsubset of the training corpus batch LDA\u2019s speed improves, but its performance suffers.\n\nWe use perplexity on held-out data as a measure of model \ufb01t. Perplexity is de\ufb01ned as the geometric\nmean of the inverse marginal probability of each word in the held-out set of documents:\n\nperplexity(ntest, \u03bb, \u03b1) (cid:44) exp\n\n(15)\ntest denotes the vector of word counts for the ith document. Since we cannot directly\n\ni log p(ntest\n\ni\n\n|\u03b1, \u03b2), we use a lower bound on perplexity as a proxy:\n\n(cid:110)\u2212((cid:80)\n\n|\u03b1, \u03b2))/((cid:80)\n\n(cid:111)\n, \u03b8i, zi|\u03b1, \u03b2)] \u2212 Eq[log q(\u03b8i, zi)])((cid:80)\n\ni,w ntest\niw )\n\n(cid:111)\n\n.\n\nwhere ni\ncompute log p(ntest\nperplexity(ntest, \u03bb, \u03b1) \u2264 exp\n\ni\n\n(cid:110)\u2212((cid:80)\n\ni\n\ni\n\nEq[log p(ntest\n\ni,w ntest\niw )\n(16)\nThe per-document parameters \u03b3i and \u03c6i for the variational distributions q(\u03b8i) and q(zi) are \ufb01t using\nthe E step in algorithm 2. The topics \u03bb are \ufb01t to a training set of documents and then held \ufb01xed. In\nall experiments \u03b1 and \u03b7 are \ufb01xed at 0.01 and the number of topics K = 100.\nThere is some question as to the meaningfulness of perplexity as a metric for comparing different\ntopic models [25]. Held-out likelihood metrics are nonetheless well suited to measuring how well\nan inference algorithm accomplishes the speci\ufb01c optimization task de\ufb01ned by a model.\nEvaluating learning parameters. Online LDA introduces several learning parameters: \u03ba \u2208\n(0.5, 1], which controls how quickly old information is forgotten; \u03c40 \u2265 0, which downweights early\niterations; and the mini-batch size S, which controls how many documents are used each iteration.\nAlthough online LDA converges to a stationary point for any valid \u03ba, \u03c40, and S, the quality of this\nstationary point and the speed of convergence may depend on how the learning parameters are set.\nWe evaluated a range of settings of the learning parameters \u03ba, \u03c40, and S on two corpora: 352,549\ndocuments from the journal Nature 3 and 100,000 documents downloaded from the English ver-\n\n3For the Nature articles, we removed all words not in a pruned vocabulary of 4,253 words.\n\n7\n\nTime in seconds (log scale)Perplexity150020002500101102103104Batch size000010001600256010240409616384batch10Kbatch98KTime in seconds (log scale)Perplexity6007008009001000101102103104Batch size000010001600256010240409616384batch10Kbatch98K\fsion of Wikipedia 4. For each corpus, we set aside a 1,000-document test set and a separate\n1,000-document validation set. We then ran online LDA for \ufb01ve hours on the remaining docu-\nments from each corpus for \u03ba \u2208 {0.5, 0.6, 0.7, 0.8, 0.9, 1.0}, \u03c40 \u2208 {1, 4, 16, 64, 256, 1024}, and\nS \u2208 {1, 4, 16, 64, 256, 1024, 4096, 16384}, for a total of 288 runs per corpus. After \ufb01ve hours of\nCPU time, we computed perplexity on the test sets for the topics \u03bb obtained at the end of each \ufb01t.\nTable 1 summarizes the best settings for each corpus of \u03ba and \u03c40 for a range of settings of S. The\nsupplement includes a more exhaustive summary. The best learning parameter settings for both\ncorpora were \u03ba = 0.5, \u03c40 = 64, and S = 4096. The best settings of \u03ba and \u03c40 are consistent across\nthe two corpora. For mini-batch sizes from 256 to 16384 there is little difference in perplexity scores.\nSeveral trends emerge from these results. Higher values of the learning rate \u03ba and the downweighting\nparameter \u03c40 lead to better performance for small mini-batch sizes S, but worse performance for\nlarger values of S. Mini-batch sizes of at least 256 documents outperform smaller mini-batch sizes.\nComparing batch and online on \ufb01xed corpora. To compare batch LDA to online LDA, we evalu-\nated held-out perplexity as a function of time on the Nature and Wikipedia corpora above. We tried\nvarious mini-batch sizes from 1 to 16,384, using the best learning parameters for each mini-batch\nsize found in the previous study of the Nature corpus. We also evaluated batch LDA \ufb01t to a 10,000-\ndocument subset of the training corpus. We computed perplexity on a separate validation set from\nthe test set used in the previous experiment. Each algorithm ran for 24 hours of CPU time.\nFigure 2 summarizes the results. On the larger Nature corpus, online LDA \ufb01nds a solution as good as\nthe batch algorithm\u2019s with much less computation. On the smaller Wikipedia corpus, the online al-\ngorithm \ufb01nds a better solution than the batch algorithm does. The batch algorithm converges quickly\non the 10,000-document corpora, but makes less accurate predictions on held-out documents.\nTrue online. To demonstrate the ability of online VB to perform in a true online setting, we wrote a\nPython script to continually download and analyze mini-batches of articles chosen at random from\na list of approximately 3.3 million Wikipedia articles. This script can download and analyze about\n60,000 articles an hour. It completed a pass through all 3.3 million articles in under three days. The\namount of time needed to download an article and convert it to a vector of word counts is comparable\nto the amount of time that the online LDA algorithm takes to analyze it.\nWe ran online LDA with \u03ba = 0.5, \u03c40 = 1024, and S = 1024. Figure 1 shows the evolution of the\nperplexity obtained on the held-out validation set of 1,000 Wikipedia articles by the online algorithm\nas a function of number of articles seen. Shown for comparison is the perplexity obtained by the\nonline algorithm (with the same parameters) \ufb01t to only 98,000 Wikipedia articles, and that obtained\nby the batch algorithm \ufb01t to the same 98,000 articles.\nThe online algorithm outperforms the batch algorithm regardless of which training dataset is used,\nbut it does best with access to a constant stream of novel documents. The batch algorithm\u2019s failure\nto outperform the online algorithm on limited data may be due to stochastic gradient\u2019s robustness\nto local optima [19]. The online algorithm converged after analyzing about half of the 3.3 million\narticles. Even one iteration of the batch algorithm over that many articles would have taken days.\n\n5 Discussion\n\nWe have developed online variational Bayes (VB) for LDA. This algorithm requires only a few\nmore lines of code than the traditional batch VB of [7], and is handily applied to massive and\nstreaming document collections. Online VB for LDA approximates the posterior as well as previous\napproaches in a fraction of the time. The approach we used to derive an online version of batch VB\nfor LDA is general (and simple) enough to apply to a wide variety of hierarchical Bayesian models.\n\nAcknowledgments\n\nD.M. Blei is supported by ONR 175-6343, NSF CAREER 0745520, AFOSR 09NL202, the Alfred\nP. Sloan foundation, and a grant from Google. F. Bach is supported by ANR (MGA project).\n\n4For the Wikipedia articles, we removed all words not from a \ufb01xed vocabulary of 7,995 common words.\nThis vocabulary was obtained by removing words less than 3 characters long from a list of the 10,000 most com-\nmon words in Project Gutenberg texts obtained from http://en.wiktionary.org/wiki/Wiktionary:Frequency lists.\n\n8\n\n\fReferences\n[1] M. Braun and J. McAuliffe. Variational inference for large-scale models of discrete choice. arXiv,\n\n(0712.2526), 2008.\n\n[2] D. Blei and M. Jordan. Variational methods for the Dirichlet process. In Proc. 21st Int\u2019l Conf. on Machine\n\nLearning, 2004.\n\n[3] A. Asuncion, M. Welling, P. Smyth, and Y.W. Teh. On smoothing and inference for topic models. In\n\nProceedings of the 25th Conference on Uncertainty in Arti\ufb01cial Intelligence, 2009.\n\n[4] D. Newman, A. Asuncion, P. Smyth, and M. Welling. Distributed inference for latent Dirichlet allocation.\n\nIn Neural Information Processing Systems, 2007.\n\n[5] Feng Yan, Ningyi Xu, and Yuan Qi. Parallel inference for latent Dirichlet allocation on graphics process-\n\ning units. In Advances in Neural Information Processing Systems 22, pages 2134\u20132142, 2009.\n\n[6] L. Bottou and O. Bousquet. The tradeoffs of large scale learning. In Advances in Neural Information\n\nProcessing Systems, volume 20, pages 161\u2013168. NIPS Foundation (http://books.nips.cc), 2008.\n\n[7] D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research,\n\n3:993\u20131022, January 2003.\n\n[8] Hanna Wallach, David Mimno, and Andrew McCallum. Rethinking lda: Why priors matter. In Advances\n\nin Neural Information Processing Systems 22, pages 1973\u20131981, 2009.\n\n[9] W. Buntine. Variational extentions to EM and multinomial PCA. In European Conf. on Machine Learning,\n\n2002.\n\n[10] J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online learning for matrix factorization and sparse coding.\n\nJournal of Machine Learning Research, 11(1):19\u201360, 2010.\n\n[11] L. Yao, D. Mimno, and A. McCallum. Ef\ufb01cient methods for topic model inference on streaming document\nIn KDD 2009: Proc. 15th ACM SIGKDD int\u2019l Conf. on Knowledge discovery and data\n\ncollections.\nmining, pages 937\u2013946, 2009.\n\n[12] M. Jordan, Z. Ghahramani, T. Jaakkola, and L. Saul. Introduction to variational methods for graphical\n\nmodels. Machine Learning, 37:183\u2013233, 1999.\n\n[13] H. Attias. A variational Bayesian framework for graphical models. In Advances in Neural Information\n\nProcessing Systems 12, 2000.\n\n[14] A. Dempster, N. Laird, and D. Rubin. Maximum likelihood from incomplete data via the EM algorithm.\n\nJournal of the Royal Statistical Society, Series B, 39:1\u201338, 1977.\n\n[15] L. Bottou and N. Murata. Stochastic approximations and ef\ufb01cient learning. The Handbook of Brain\n\nTheory and Neural Networks, Second edition. The MIT Press, Cambridge, MA, 2002.\n\n[16] M.A. Sato. Online model selection based on the variational Bayes. Neural Computation, 13(7):1649\u2013\n\n1681, 2001.\n\n[17] P. Liang and D. Klein. Online EM for unsupervised models. In Proc. Human Language Technologies: The\n2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics,\npages 611\u2013619, 2009.\n\n[18] H. Robbins and S. Monro. A stochastic approximation method. The Annals of Mathematical Statistics,\n\n22(3):400\u2013407, 1951.\n\n[19] L. Bottou. Online learning and stochastic approximations. Cambridge University Press, Cambridge, UK,\n\n1998.\n\n[20] R.M. Neal and G.E. Hinton. A view of the EM algorithm that justi\ufb01es incremental, sparse, and other\n\nvariants. Learning in graphical models, 89:355\u2013368, 1998.\n\n[21] M.A. Sato and S. Ishii. On-line EM algorithm for the normalized Gaussian network. Neural Computation,\n\n12(2):407\u2013432, 2000.\n\n[22] T. Grif\ufb01ths and M. Steyvers. Finding scienti\ufb01c topics. Proc. National Academy of Science, 2004.\n[23] X. Song, C.Y. Lin, B.L. Tseng, and M.T. Sun. Modeling and predicting personal information dissemi-\nnation behavior. In KDD 2005: Proc. 11th ACM SIGKDD int\u2019l Conf. on Knowledge discovery and data\nmining. ACM, 2005.\n\n[24] K.R. Canini, L. Shi, and T.L. Grif\ufb01ths. Online inference of topics with latent Dirichlet allocation. In\n\nProceedings of the International Conference on Arti\ufb01cial Intelligence and Statistics, volume 5, 2009.\n\n[25] J. Chang, J. Boyd-Graber, S. Gerrish, C. Wang, and D. Blei. Reading tea leaves: How humans interpret\n\ntopic models. In Advances in Neural Information Processing Systems 21 (NIPS), 2009.\n\n9\n\n\f", "award": [], "sourceid": 1291, "authors": [{"given_name": "Matthew", "family_name": "Hoffman", "institution": null}, {"given_name": "Francis", "family_name": "Bach", "institution": null}, {"given_name": "David", "family_name": "Blei", "institution": null}]}