{"title": "Deterministic Single-Pass Algorithm for LDA", "book": "Advances in Neural Information Processing Systems", "page_first": 2074, "page_last": 2082, "abstract": "We develop a deterministic single-pass algorithm for latent Dirichlet allocation (LDA) in order to process received documents one at a time and then discard them in an excess text stream. Our algorithm does not need to store old statistics for all data. The proposed algorithm is much faster than a batch algorithm and is comparable to the batch algorithm in terms of perplexity in experiments.", "full_text": "Deterministic Single-Pass Algorithm for LDA\n\nIssei Sato\n\nUniversity of Tokyo, Japan\nsato@r.dl.itc.u-tokyo.ac.jp\n\nKenichi Kurihara\n\nGoogle\n\nkenichi.kurihara@gmail.com\n\nHiroshi Nakagawa\n\nUniversity of Tokyo, Japan\n\nn3@dl.itc.u-tokyo.ac.jp\n\nAbstract\n\nWe develop a deterministic single-pass algorithm for latent Dirichlet alloca-\ntion (LDA) in order to process received documents one at a time and then\ndiscard them in an excess text stream. Our algorithm does not need to store\nold statistics for all data. The proposed algorithm is much faster than a batch\nalgorithm and is comparable to the batch algorithm in terms of perplexity in\nexperiments.\n\n1 Introduction\n\nHuge quantities of text data such as news articles and blog\nposts arrives in a continuous stream. Online learning has at-\ntracted a great deal of attention as a useful method for han-\ndling this growing quantity of streaming data because it pro-\ncesses data one at a time, whereas batch algorithms are not\nfeasible in these settings because they need all the data at\nthe same time. This paper focus on online learning for La-\ntent Dirichlet allocation (LDA) (Blei et al., 2003), which is a\nwidely used probabilistic model for text data.\n\nFigure 1: Overview of the relation-\nships among inferences.\n\nOnline learning for LDA has been already developed (Baner-\njee and Basu, 2007; Alsumait et al., 2008; Canini et al., 2009;\nYao et al., 2009). Existing studies were based on sampling\nmethods such as the incremental Gibbs sampler and particle\n\ufb01lter. Sampling methods seem to be inappropriate for stream-\ning data because sampling methods have to represent a pos-\nterior by using a lot of samples, which basically needs much\ntime. Moreover, sampling algorithms often need a resampling\nstep in which a sampling method is applied to old data. Storing old data or old samples adversely\naffects the good properties of online algorithms. Particle \ufb01lters also need to run m parallel process-\ning. A parallel algorithm needs more memory than a single-process algorithm, which is not useful\nfor a large quantity of data, especially in the case of a large vocabulary. For example, LDA needs to\nstore the number of words observed in a topic. If the number of topics is T , the vocabulary size is\nV and m, so the required memory size is O(m (cid:3) T (cid:3) V ).\nWe propose two deterministic online algorithms; an incremental algorithms and a single-pass al-\ngorithm. Our incremental algorithm is an incremental variant of the reverse EM (REM) algorithm\n(Minka, 2001). The incremental algorithm updates parameters by replacing old suf\ufb01cient statistics\nwith new one for each datum. Our single-pass algorithm is based on an incremental algorithm, but it\ndoes not need to store old statistics for all data. In our single-pass algorithm, we propose a sequential\nupdate method for the Dirichlet parameters. Asuncion et al. (2009); Wallach et al. (2009) indicated\nthe importance of estimating the parameters of the Dirichlet distribution, which is the distribution\nover the topic distributions of documents. Moreover, we can deal with the growing vocabulary size.\nIn real life, the total vocabulary size is unknown, i.e., increasing as a document is observed.\n\n1\n\nRunning \u019fmeMemory usageiREM-LDACVB-LDAshortlongsREM-LDAlargesmallVB-LDA \fIn summary, Fig.1 shows the relationships among inferences. VB-LDA is the variational inference\nfor LDA, which is a batch inference; CVB-LDA is the collapsed variational inference for LDA (Teh\net al., 2007); iREM-LDA is our incremental algorithm; and sREM-LDA is our single-pass algorithm\nfor LDA.\n\nSections.2 brie\ufb02y explains inference algorithms for LDA. Section 3 describes the proposed algo-\nrithm for online learning. Section 4 presents the experimental results.\n\n2 Overview of Latent Dirichlet Allocation\n\nThis section overviews LDA where documents are represented as random mixtures over latent topics\nand each topic is characterized by a distribution over words. First, we will de\ufb01ne the notations, and\nthen, describe the formulation of LDA. T is the number of topics. M is the number of documents.\nV is the vocabulary size. Nj is the number of words in document j. wj,i denotes the i-th word in\ndocument j. zj,i denotes the latent topic of word wj,i. M ulti((cid:1)) is a multinomial distribution. Dir((cid:1))\nis a Dirichlet distribution. \u03b8j denotes a T -dimensional probability vector that is the parameters of the\nmultinomial distribution, and represents the topic distribution of document j. \u03b2t is a multinomial\nparameter a V -dimensional probability where \u03b2t,v speci\ufb01es the probability of generating word v\ngiven topic t. \u03b1 is the T -dimensional parameter vector of the Dirichlet distribution over \u03b8j (j =\n1,(cid:1)(cid:1)(cid:1) , M ).\n\u220f\n(cid:24) Dir(\u03b2j\u03bb) /\nLDA assumes the following generative process. For each of the T topics t, draw \u03b2t\n\u03b8\u03b1t(cid:0)1\nv \u03b2\u03bb(cid:0)1\n.\nFor each of the Nj words wj,i in document j, draw topic zj,i (cid:24) M ulti(zj\u03b8j) and draw word wj,i (cid:24)\np(wjzj,i, \u03b2) where p(w = vjz = t, \u03b2) = \u03b2t,v.\nThat is to say, the complete-data likelihood of a document wj is given by\n\nt,v . For each of the M documents j, draw \u03b8j (cid:24) Dir(\u03b8j\u03b1) where Dir(\u03b8j\u03b1) /\n\n\u220f\n\nt\n\nt\n\np(wj, zj, \u03b8jj\u03b1, \u03b2) = p(\u03b8jj\u03b1)\n\np(wj,ijzj,i, \u03b2)p(zjj\u03b8j).\n\n(1)\n\nNj\u220f\n\ni\n\nj\n\n\u220f\ng given by\nq(zj,ij\u03d5j,i)\n\u220f\n\u03b2\u00b5t;v(cid:0)1\n\u220f\n\nt,v\n\nv\n\n\u220f\n\nt\n\n\u222b \u2211\n\nz\n\n2.1 Variational Bayes Inference for LDA\n\n\u220f\nThe VB inference for LDA(Blei et al., 2003) introduces a factorized variational posterior q(z, \u03b8, \u03b2)\nover z = fzj,ig, \u03b8 = f\u03b8jg and \u03b2 = f\u03b2t\n\n\u220f\n\nq(z, \u03b8, \u03b2) =\n\nj,i\n\nq(\u03b8jj\u03b3j)\n\nj\u00b5t),\n\nq(\u03b2t\n\nt\n\n(2)\n\nwhere \u03d5 and \u03b3 are variational parameters, \u03d5j,i,t speci\ufb01es the probability that the topic of word wj,k\nis topic t, and \u03b3j and \u00b5t are the parameters of the Dirichlet distributions over \u03b8j and \u03b2t, respectively,\ni.e., q(\u03b8jj\u03b3j) /\n\nj\u00b5t) /\n\n\u03b8\u03b3j;t(cid:0)1\n\nand q(\u03b2t\n\n.\n\nj,t\n\n\u220f\n\nj p(wj, zj, \u03b8jj\u03b1, \u03b2)\n\nThe log-likelihood of documents is lower bounded introducing q(z, \u03b8) by\n\n(3)\n\nt p(\u03b2t\n\nd\u03b8jd\u03b2.\n\nq(z, \u03b8, \u03b2)\n\nq(z, \u03b8, \u03b2) log\n\nThe parameters are updated as\n\u03d5j,i,t / exp (cid:9)(\u00b5t,wj;i)\nv \u00b5t,v)\n\nF[q(z, \u03b8, \u03b2)] =\n\u2211\n\u2211\ni \u03d5j,i,tI(wj,i = v) and I((cid:1)) is an indicator function.\nwhere nj,t,v =\nWe can estimate \u03b1 with the \ufb01xed point iteration (Minka, 2000; Asuncion et al., 2009) by introducing\nthe gamma prior G(\u03b1tja0, b0), i.e., \u03b1t (cid:24) G(\u03b1tja0, b0)(t = 1, ..., T ), as\nt + nj,t) (cid:0) (cid:9)(\u03b1old\nt\n0 ) (cid:0) (cid:9)(\u03b1old\n0 ))\n\nf(cid:9)(\u03b1old\nj\nj((cid:9)(Nj + \u03b1old\n\na0 (cid:0) 1 +\nb0 +\n\nexp (cid:9)(\u03b3j,t)), \u03b3j,t = \u03b1t +\n\n\u03d5j,i,t, \u00b5t,v = \u03bb +\n\n\u2211\n\u2211\n\nNj\u2211\n\n)g\u03b1old\n\n\u03b1new\n\nt =\n\nexp (cid:9)(\n\nnj,t,v,\n\nt\n\n,\n\n(5)\n\n(4)\n\ni=1\n\nj\n\nj\u03bb)\n\u2211\n\n2\n\n\fAlgorithm 1\nVB inference for LDA\n\nAlgorithm 2\nCVB inference for LDA\n\n1: for iteration it = 1,(cid:1)(cid:1)(cid:1) , L do\n2:\n3:\n4:\n\nfor j = 1,(cid:1)(cid:1)(cid:1) , M do\nfor i = 1,(cid:1)(cid:1)(cid:1) , Nj do\nUpdate \u03d5j,i,t (t = 1,(cid:1)(cid:1)(cid:1) , T ) by\nEq. (4)\nend for\nUpdate \u03b3j,t (t = 1,(cid:1)(cid:1)(cid:1) , T ) by Eq.\n(4)\nend for\nUpdate \u00b5 by Eq. (4)\nUpdate \u03b1 by Eq. (5)\n\n5:\n6:\n\n7:\n8:\n9:\n10: end for\n\n1: for iteration it = 1,(cid:1)(cid:1)(cid:1) , L do\n2:\n3:\n4:\n5:\n\nfor j = 1,(cid:1)(cid:1)(cid:1) , M do\nfor i = 1,(cid:1)(cid:1)(cid:1) , Nj do\nUpdate \u03d5j,i,t by Eq. (7)\nUpdate nj,t replacing \u03d5old\n\u03d5new\nj,i,t .\nUpdate nt,wj;i\nwith \u03d5new\nj,i,t .\n\n6:\n\nj,i,t with\n\nreplacing \u03d5old\nj,i,t\n\nend for\n\nend for\nUpdate \u03b1 by Eq. (5)\n\n7:\n8:\n9:\n10: end for\n\n\u2211\n\nwhere \u03b10 =\nAlgorithm 1 has the VB inference scheme of LDA.\n\nt \u03b1t, and a0 and b0 are the parameters for the gamma distribution.\n\n2.2 Collapsed Variational Bayes Inference for LDA\n\nTeh et al. (2007) proposed CVB-LDA inspired by collapsed Gibbs sampling and found that the con-\nvergence of CVB-LDA is experimentally faster than that of VB-LDA, and CVB-LDA outperformed\nVB-LDA in terms of perplexity. The CVB-LDA only introduced a variational posterior q(z) where\nit marginalized out \u03b8 and \u03b2 over the priors. The CVB inference optimizes the following lower bound\ngiven by\n\nFCV B[q(z)] =\n\nq(z) log\n\np(wj, zjj\u03b1, \u03bb)\n\nq(z)\n\n.\n\n(6)\n\nM\u2211\n\n\u2211\n\nj=1\n\nz\n\nThe derivation of the update equation for q(z) is slightly complicated and involves approximations\nto compute intractable summations. Although Teh et al. (2007) made use of a second-order Taylor\nexpansion as an approximation, Asuncion et al. (2009) shows the usefulness of an approximation\nusing only zero-order information. An update using only zero-order information is given by\n\nNj\u2211\n\n\u2211\n\n(\u03b1t + n\n\n(cid:0)j,i\nj,t ), nj,t =\n\n\u03d5j,i,t, nt,v =\n\ni=1\n\nj,i\n\n\u03d5j,i,tI(wj,i = v),\n\n(7)\n\n\u2211\n\u03d5j,i,t / \u03bb + n\n\nV \u03bb +\n\n(cid:0)j,i\nt,wj;i\nv n\n\n(cid:0)j,i\nt,v\n\nwhere \u201c-j,i\u201d denotes subtracting \u03d5j,i,t. Algorithm 2 provides the CVB inference scheme for LDA.\n\n3 Deterministic Online Algorithm for LDA\n\nThe purpose of this study is to process text data such as news articles and blog posts arriving in\na continuous stream by using LDA. We propose a learning algorithm for LDA that can be applied\nto these semi-in\ufb01nite and time-series text streams. For these situations, we want to process text\none at a time and then discard them. We repeat iterations only for each word within a document.\nThat is, we update parameters from an arriving document and discard the document after doing l\niterations. Therefore, we do not need to store statistics about discarded documents. First, we derived\nan incremental algorithm for LDA, and then we extended the incremental algorithm to a single-pass\nalgorithm.\n\n3.1 Incremental Learning\n\n(Neal and Hinton, 1998) provided a framework of incremental learning for the EM algorithm. In\ngeneral unsupervised-learning, we estimate suf\ufb01cient statistics si for each data i, compute whole\n\n3\n\n\f\u2211\n\nsuf\ufb01cient statistics \u03c3(=\ni si) from all data, and update parameters by using \u03c3. In incremental\nlearning, for each data i, we estimate si, compute \u03c3(i) from si , and update parameters from \u03c3(i). It\nis easy to extend an existing batch algorithm to the incremental learning if whole suf\ufb01cient statistics\nor parameters updates are constructed by simply summarizing all data statistics. The incremental\ni + snew\nalgorithm processes data i by subtracting old sold\n.\ng for all data. While batch algorithms\nThe incremental algorithm needs to store old statistics fsold\nupdate parameters sweeping through all data, the incremental algorithm updates parameters for each\ndata one at a time, which results in more parameter updates than batch algorithms. Therefore, the\nincremental algorithm sometimes converge faster than batch algorithms.\n\n, i.e., \u03c3(i) = (cid:0)sold\n\nand adding new snew\n\ni\n\ni\n\ni\n\ni\n\n3.2 Incremental Learning for LDA\n\nOur motivation for devising the incremental algorithm for LDA was to compare CVB-LDA and\nVB-LDA. Statistics fnt,vg and fnj,tg are updated after each word is updated in CVB-LDA. This\nupdate schedule is similar to that of the incremental algorithm. This incremental property seems to\nbe the reason CVB-LDA converges faster than VB-LDA. Moreover, since CVB-LDA optimizes a\ntighterlower-bound from VB-LDA, CVB-LDA can \ufb01nd better optima. Below, let us consider the\nincremental algorithm for LDA. We start by optimizing the lower-bound different form VB-LDA by\nusing the reverse EM (REM) algorithm (Minka, 2001) as follows:\np(wjj\u03b1, \u03b2) =\n\nI(wj;i=v)p(\u03b8jj\u03b1)d\u03b8j =\n\n\u222b Nj\u220f\n\n(\u03b8j,t\u03b2t,v)\n\nT\u2211\n(\u03b8j,t\u03b2t,wj;i )p(\u03b8jj\u03b1)d\u03b8j,\n(8)\n\nt=1\n\ni=1\n\n)\u03d5j;i;t\np(\u03b8jj\u03b1)d\u03b8j,\n\u222b T\u220f\n\u2211\n\u2211\n(\n\ni \u03d5j;i;t\n\n\u2211\n\nt=1\n\nj,t\n\n\u03b8\n\n\u2211\n\n(cid:0)(\n\nlog\nx q(x) = 1, and so\n\np(\u03b8jj\u03b1)d\u03b8j.\n\u2211\n\u2211\n\u220f\n\n\u2211\n\nt \u03b1t)\n\n(9)\n\n(10)\n\n(cid:21)\n\n\u2211\n\nx f (x) = log\n\nx f (x) (cid:21)\u220f\n\u2211\n\nx q(x) f (x)\nq(x) )q(x).\n\nq(x)\n\nx( f (x)\n\n)\n\n\u2211\n\nEquation (9) is derived from Jensen\u2019s inequality as follows.\n\nx q(x) log f (x)\n\nq(x) = log\n\nx( f (x)\n\nq(x) )q(x) where\n\nTherefore, the lower bound for the log-likelihood is given by\n\nV\u220f\n(\n\nv=1\n\ni=1\n\nt=1\n\n\u222b Nj\u220f\nT\u2211\n\u222b Nj\u220f\nT\u220f\n(\nT\u220f\nNj\u220f\n\u220f\n\nt=1\n\nt=1\n\ni=1\n\ni=1\n\n(cid:21)\n\n=\n\n\u2211\n\n\u03b8j,t\u03b2t,wj;i\n\n\u03d5j,i,t\n\n)\u03d5j;i;t\n\n\u03b2t,wj;i\n\u03d5j,i,t\n\n\u03d5j,i,t)g, \u03b2tv / \u03bb +\n\u2211\n\ni\n\nj\n\n^F[q(z)] =\n\n\u03d5j,i,t log\n\n+\n\nlog\n\n\u03b2t,wj;i\n\u03d5j,i,t\n\nj,i,t\n\n(cid:0)(\u03b1t)\n\u2211\nThe maximum of ^F[q(z)] with respect to q(zj,i = t) = \u03d5j,i,t and \u03b2 is given by\nnj,t,v,\n\n\u03d5j,i,t / \u03b2t,wj;i expf(cid:9)(\u03b1t +\n\n\u2211\n\n(cid:0)(Nj +\n\nt \u03b1t)\n\nj\n\nt\n\n(cid:0)(\u03b1t +\n\ni \u03d5j,i,t)\n\n.\n\n(11)\n\n(12)\n\nj nj,t,v taking a negative value.\n\nThe updates of \u03b1 are the same as Eq.(5). Note that we use the maximum a posteriori estiamtion for\n\u03b2, however, we do not use \u03bb (cid:0) 1 to avoid \u03bb (cid:0) 1 +\nThe lower bound ^F[q(z)] introduces only q(z) like CVB-LDA. Equation (12) incrementally updates\nthe topic distribution of a document for each word as in CVB-LDA because we do not need \u03b3j,i in\nEq.(12) due to marginalizing out of \u03b8j. Equation (12) is a \ufb01xed point update, whereas CVB-LDA\ncan be interpreted as a coordinate ascent algorithm. \u03b1 and \u03b2 are updated from the entire document.\nThat is, when we compare this algorithm with VB-LDA, it looks like a hybrid variant of a batch\nupdates for \u03b1 and \u03b2, and incremental updates for \u03b3j,\nHere, we consider an incremental update for \u03b2 to be analogous to CVBLDA, in which \u03b2 is updated\nfor each word. Note that in the LDA setup, each independent identically distributed data point is\na document not a word. Therefore, we incrementally estimate \u03b2 for each document by swapping\ni \u03d5j,i,tI(wj,i = v) which is the number of word v generated from topic t in\nstatistics nj,t,v =\ndocument j. Algorithm 3 shows our incremental algorithm for LDA. This algorithm incrementally\noptimizes the lower bound in Eq.(11).\n\n\u2211Nj\n\n4\n\n\fAlgorithm 3\nIncremental algorithm for LDA\n\nAlgorithm 4\nSingle-pass algorithm for LDA\n\n1: for iteration it = 1,(cid:1)(cid:1)(cid:1) , L do\n2:\n3:\n4:\n5:\n6:\n\nfor j = 1,(cid:1)(cid:1)(cid:1) , M do\nfor i = 1,(cid:1)(cid:1)(cid:1) , Nj do\nUpdate \u03d5j,i,t by Eq. (12)\nend for\nReplace nold\nfwj,igNj\nend for\nUpdate \u03b1 by Eq. (5)\n\ni=1 in \u03b2 of Eq. (12) .\n\nj,t,v with nnew\n\n7:\n8:\n9: end for\n\nj,t,v for v 2\n\n3.3 Single-Pass Algorithm for LDA\n\nfor iteration it = 1,(cid:1)(cid:1)(cid:1) , l do\n\nfor i = 1, ..., Nj do\n\nUpdate \u03d5j,i,t by Eq. (13).\n\nend for\nUpdate \u03b2(j) by Eq.(13).\nUpdate \u03b1(j) by Eq.(17).\n\n1: for j = 1,(cid:1)(cid:1)(cid:1) , M do\n2:\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11: end for\n\nend for\nUpdate \u03bb(j) by Eq.(14).\nUpdate ~a(j) and ~b(j) by Eq.(17).\n\n\u2211\n\ni=1) / p(xNj\u03b8)p(\u03b8jfxigN(cid:0)1\n\nOur single-pass algorithm for LDA was inspired by the Bayesian formulation, which internally\nincludes a sequential update. The posterior distribution with the contribution from the data point\nxN is separated out so that p(\u03b8jfxigN\ni=1 ), where \u03b8 denotes a parameter.\nThis indicates that we can use a posterior given an observed datum as a prior for the next datum..\nWe use parameters learned from observed data as prior parameters for the next data. For example,\nnj,t,vg + nM,t,v. Here, we can interpret\n\u03b2t,v in Eq. (12) is represented as \u03b2t,v / f\u03bb +\nnj,t,vg as prior parameter \u03bb(M(cid:0)1)\nf\u03bb +\nOur single-pass algorithm sequentially sets a prior for each arrived document. By using this sequen-\ntial setting of prior parameters, we present a single-pass algorithm for LDA as shown in Algorithm\n4. First, we update parameters from j-th arrived document given prior parameters f\u03bb(j(cid:0)1)\ng for l\niterations\n\nfor the M-th document.\n\nM(cid:0)1\nj\n\nM(cid:0)1\nj\n\n\u2211\n\nt,v\n\nt,v\n\n\u2211\n\nNj\u2211\n\nNj\u2211\n\n\u03d5j,i,t /\u03b2(j)\n\nt,wj;i expf(cid:9)(\u03b1(j)\nt +\n\n\u03d5j,i,t)g, \u03b2(j)\n\nt,v\n\n/ \u03bb(j(cid:0)1)\n\nt,v +\n\n\u03d5j,i,tI(wj,i = v),\n\n(13)\n\ni\n\ni\n\nwhere \u03bb(0)\nthe document for the next document as follows, and \ufb01nally discard the document.\n\nis explained below. Then, we set prior parameters by using statistics from\n\nt,v = \u03bb and \u03b1(j)\n\nt\n\nt,v =\u03bb(j(cid:0)1)\n\u03bb(j)\nt,v +\n\n(14)\nSince the updates are repeated within a document, we need to store statistics f\u03d5j,i,tg for each word\nin a document, but not for all words in all documents.\n\ni\n\n\u03d5j,i,tI(wj,i = v).\n\nIn the CVB and iREM algorithms, the Dirichlet parameter, \u03b1, uses batch updates, i.e., \u03b1 is up-\ndated by using the entire document once in one iteration. We need an online-update algorithm for\n\u03b1 to process a streaming text. However, unlike parameter \u03b2t,v, the update of \u03b1 in Eq.(5) is not\nconstructed by simply summarizing suf\ufb01cient statistics of data and a prior. Therefore, we derive a\nsingle-pass update for the Dirichlet parameter \u03b1 using the following interpretation.\nWe consider Eq.(5) to be the expectation of \u03b1t over posterior G(\u03b1tj~at, ~b) given documents D and\nprior G(\u03b1tja0, b0), i.e, \u03b1new\nM\u2211\n\nt = E[\u03b1t]G(\u03b1j~at,~b) =\n\nM\u2211\n\n~at (cid:0) 1\n\n, where\n\n~b\n\n~at =a0 +\naj,t = f(cid:9)(\u03b1old\n\naj,t, ~b = b0 +\nbj,\nt + nj,t) (cid:0) (cid:9)(\u03b1old\n)g\u03b1old\n\nj\n\nj\n\nt\n\nt\n\n, bj = (cid:9)(Nj + \u03b1old\n\n0 ) (cid:0) (cid:9)(\u03b1old\n0 ).\n\n(15)\n\n(16)\n\n5\n\n\fWe regard aj,t and bj as statistics for each document, which indicates that the parameters that we\nactually update are ~at and ~b in Eq.(5). These updates are simple summarizations of aj,t and bj and\nprior parameters a0 and b0. Therefore, we have an update for \u03b1(j)\nafter observing document j given\nby\n\nt\n\nt = E[\u03b1t]G(\u03b1j~a(j)\n\u03b1(j)\naj,t = f(cid:9)(\u03b1(j(cid:0)1)\n\nt\n\nt\n\n~a(j)\nt\n\n(cid:0) 1\n~b(j)\n\n,~b(j)) =\n+ nj,t) (cid:0) (cid:9)(\u03b1(j(cid:0)1)\n\nt\n\nt = ~a(j(cid:0)1)\n, ~a(j)\n)g\u03b1(j(cid:0)1)\n\nt\n\nt\n\n+ aj,t, ~b(j) = ~b(j(cid:0)1) + bj,\n\n, bj = (cid:9)(Nj + \u03b1(j(cid:0)1)\n\n0\n\n) (cid:0) (cid:9)(\u03b1(j(cid:0)1)\n\n0\n\n),\n\n(17)\n\n(18)\n\nwhere ~a(0)\n~a(j(cid:0)1)\n\nt = a0 and ~b(0) = b0.\nand ~b(j(cid:0)1) are used as prior paramters for the next j-th documents.\n\nt\n\n3.4 Analysis\n\nThis section analyze the proposed updates for parameters \u03b1 and \u03b2 in the previous section.\n\nWe eventually update parameters \u03b1(j) and \u03b2(j) given document j as\n\n\u03b1(j)\n\nt =\n\nad,t + aj,t\nbd + bj\n\n= \u03b1(j(cid:0)1)\n\nt\n\n(1 (cid:0) \u03b7\u03b1\n\nj ) + \u03b7\u03b1\nj\n\naj,t\nbj\n\n, \u03b7\u03b1\n\nj =\n\nb0 +\n\n.\n\n(19)\n\n\u03b2(j)\nt,v =\n\n\u03bb +\n\nVj\u03bb +\n\nj(cid:0)1\nd nd,t,v + nj,t,v\nj(cid:0)1\nd nd,t,(cid:1) + nj,t,(cid:1)\n\n= \u03b2(j(cid:0)1)\n\nt,v\n\n(1 (cid:0) \u03b7\u03b2\n\nj ) + \u03b7\u03b2\n\nj\n\nnj,t,v\nnj,t,(cid:1) , \u03b7\u03b2\n\nj =\n\n(Vj (cid:0) Vj(cid:0)1)\u03bb + nj,t,(cid:1)\n\nVj\u03bb +\n\nj\n\nd nd,t,(cid:1)\n\n.\n\nj and \u03b7\u03b2\n\n(20)\nv nt,v and Vj is the vocabulary size of total observed documents(d = 1,(cid:1)(cid:1)(cid:1) , j). Our\nwhere nt,(cid:1) =\nsingle-pass algorithm sequentially sets a prior for each arrived document, and so we can select a prior\n(a dimension of Dirichlet distribution) corresponding to observed vocabulary. In fact, this property\nis useful for our problem because the vocabulary size is growing in the text stream. These updates\nindicate that \u03b7\u03b1\nj interpolate the parameters estimated from old and new data. These updates\nlook like a stepwise algorithm (H.Robbins and S.Monro, 1951; Sato and Ishii, 2000), although a\nstepsize algorithm interpolates suf\ufb01cient statistics whereas our updates interpolate parameters. In\nour updates, how we set the stepsize for parameter updates is equivalent to how we set the hyper-\nparameters for priors. Therefore, we do not need to newly introduce a stepsize parameter.\nIn our update of \u03b2, the appearance rate of word v in topic t in document j, nj,t,v/nj,t,(cid:1), is added\nto old parameter \u03b2(j(cid:0)1)\nj , which gradually decreases as the document is observed.\nThe same relation holds for \u03b1. Therefore, the in\ufb02uence of new data decreases as the number of\ndocument observations increases as shown in Theorem 1. Moreover, Theorem 1 is an important\nrole in analyzing the convergence of parameter updates by using the super-martingale convergence\ntheorem (Bertsekas and Tsitsiklis, 1996; Brochu et al., 2004). This convergence analysis is our\nfuture work.\nTheorem 1. If \u03f5 and \u03bd exist satisfying 0 < \u03f5 < Sj < \u03bd for any j,\n\nwith weight \u03b7\u03b2\n\nt,v\n\nj(cid:0)1\nd\nj(cid:0)1\nd\n\n\u2211\n\u2211\na0 (cid:0) 1 +\n\u2211\nb0 +\n\u2211\n\u2211\n\nbj\n\n\u2211\n\u2211\n\nj\nd bd\n\n\u2211\n\nSj\n\n\u03b7j =\n\n\u03c4 +\n\nj\nd Sd\n\nsatis\ufb01es\n\n1\u2211\n\nj\n\nj!1 \u03b7j = 0,\nlim\n\n\u03b7j = 1,\n\n1\u2211\n\nj\n\n(21)\n\n(22)\n\nj < 1\n\u03b72\n\nNote that \u03b7\u03b1\n\nj and \u03b7\u03b2\n\nj are shown as \u03b7j given by Eq. (21). The proof is given in the supporting material.\n\n6\n\n\f4 Experiments\n\nWe carried out experiments on document modeling in terms of perplexity. We compared the infer-\nences for LDA in two sets of text data. The \ufb01rst was \u201cAssociated Press(AP)\u201d where the number of\ndocuments was M = 10, 000 and the vocabulary size was V = 67, 291. The second was \u201cThe Wall\nStreet Journal(WSJ)\u201d where M = 10, 000 and V = 56, 738. The ordering of document is time-\nseries. The comparison metric for document modeling was the \u201ctest set perplexity\u201d. We randomly\nsplit both data sets into a training set and a test set by assigninig 20% of the words in each document\nto the test set. Stop words were eliminated in datasets.\n\nWe performed experiments on six inferences, PF, VB, CVB0, CVB, iREM and sREM. PF denotes\nthe particle \ufb01lter for LDA used in Canini et al. (2009). We set \u03b1t as 50/T in PF. The number of\nparticles, denoted by P , is 64. The number of words for resampling, denoted by R, is 20. The effec-\ntive sample size (ESS) threshold, which controls the number of resamplings, is set at 10. CVB0 and\nCVB are collapsed variational inference for LDA using zero-order and second-order information,\nrespectively. iREM represents the incremental reverse EM algorithm in Algorithm 3. CVB0 and\nCVB estimates the Dirichlet parameter \u03b1 over the topic distribution for all datasets, i.e., a batch\nframework. We estimated \u03b1 in iREM for all datasets like CVB to clarify the properties of iREM\ncompared with CVB. L denotes the number of iterations for whole documents in Algorithms 1 and\n2. sREM indicates a single-pass variants of iREM in Algorithm 4. l denotes the number of iterations\nwithin a document in Algorithm 4. sREM does not make iterations for whole documents.\n\nFigure 2 demonstrates the results of experiments on the test set perplexity where lower values indi-\ncates better performance. We ran experiments \ufb01ve times with different random initializations and\nshow the averages 1. PF and sREM calculate the test set perplexity after sweeping through all traing\nset.\n\nVB converges slower than CVB and iREM. Moreover, iREM outperforms CVB in the convergence\nrate. Although CVB0 outperforms other algorithms for the cases of low number of topics, the\nconvergence rate of CVB0 depends on the number of topics. sREM does not outperform iREM in\nterms of perplexities, however, the performance of sREM is close to that of iREM\nAs a results, we recommend sREM in a large number of documents or document streams. sREM\ndoes not need to store old statistics for all documents unlike other algorithms.\nIn addition, the\nconvergence of sREM depends on the length of a document, rather than the number of documents.\nSince we process each document individually, we can control the number of iterations corresponding\nto the length of each arrived document. Finally, we discuss the running time. The running time of\nsREM is O( L\nl ) times shorter than that of VB, CVB0, CVB and iREM. The averaged running times\nof PF(T=300,P=64,R=20) are 28.2 hours in AP and 31.2 hours in WSJ. Those of sREM(T=300,l=5)\nare 1.2 hours in AP and 1.3 hours in WSJ.\n\n5 Conclusions\n\nWe developed a deterministic online-learning algorithm for latent Dirichlet allocation (LDA). The\nproposed algorithm can be applied to excess text data in a continuous stream because it processes re-\nceived documents one at a time and then discard them. The proposed algorithm was much faster than\na batch algorithm and was comparable to the batch algorithm in terms of perplexity in experiments.\n\n1We exclude the error bar with standard deviation because it is so small that it is hidden by the plot markers\n\n7\n\n\f(a)\n\n(c)\n\n(e)\n\n(g)\n\n(b)\n\n(d)\n\n(f)\n\n(h)\n\nFigure 2: Results of experiments. Left line indicates the results in AP corpus. Right line indicates\nthe results in WSJ corpus. (a) and (b) compared test set perplexity with respect to the number of\ntopics. (c), (d), (e) and (f) compared test set perplexity with respect to the number of iterations\nin topic T = 100 and T = 300, respectively. (g) and (h) show the relationships between test set\nperplexity and the number of iterations within a document, i.e., l.\n\nReferences\n\nLoulwah Alsumait, Daniel Barbara, and Carlotta Domeniconi. On-line lda: Adaptive topic models\nIEEE International\n\nfor mining text streams with applications to topic detection and tracking.\nConference on Data Mining, 0:3\u201312, 2008. ISSN 1550-4786.\n\n8\n\n1500200025003000350050100150200250300Testset PerplexityNumber of TopicsAPVB(L=100)CVB0(L=100)CVB(L=100)iREM(L=100)sREM(l=5)PF(P=64)1500170019002100230025002700290050100150200250300Testset PerplexityNumber of TopicsWSJVB(L=100)CVB0(L=100)CVB(L=100)iREM(L=100)sREM(l=5)PF(P=64)1.50E+032.00E+032.50E+033.00E+033.50E+034.00E+034.50E+03102030405060708090100Testset PerplexityNumber of itera\u019fonsAP(T=100)VBCVB0CVBiREMsREMPF1.50E+032.00E+032.50E+033.00E+033.50E+034.00E+03102030405060708090100Testset PerplexityNumber of itera\u019fonsWSJ(T=100)VBCVB0CVBiREMsREMPF1.50E+032.00E+032.50E+033.00E+033.50E+034.00E+034.50E+035.00E+035.50E+03102030405060708090100Testset PerplexityNumber of itera\u019fonsAP(T=300)VBCVB0CVBiREMsREMPF1.50E+032.00E+032.50E+033.00E+033.50E+034.00E+034.50E+03102030405060708090100Testset PerplexityNumber of itera\u019fonsWSJ(T=300)VBCVB0CVBiREMsREMPF17001800190020002100220023002400250050100150200250300Testset PerplexityNumber of TopicsAPsREM(l=3)sREM(l=4)sREM(l=5)1680178018801980208021802280238050100150200250300Testset PerplexityNumber of TopicsWSJsREM(l=3)sREM(l=4)sREM(l=5)\fA. Asuncion, M. Welling, P. Smyth, and Y. W. Teh. On smoothing and inference for topic models.\n\nIn Proceedings of the International Conference on Uncertainty in Arti\ufb01cial Intelligence, 2009.\n\nArindam Banerjee and Sugato Basu. Topic models over text streams: A study of batch and online\n\nunsupervised learning. In SIAM International Conference on Data Mining, 2007.\n\nD. P. Bertsekas and J. N. Tsitsiklis. Neuro-Dynamic Programming. Athena Scienti\ufb01c, 1996.\nD. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation. Journal of Machine Learning\n\nResearch, 3:993\u20131022, 2003.\n\nEric Brochu, Nando de Freitas, and Kejie Bao. Owed to a martingale: A fast bayesian on-line em\n\nalgorithm for multinomial models, 2004.\n\nKevin R. Canini, Lei Shi, and Thomas L. Grif\ufb01ths. Online inference of topics with latent dirichlet\nallocation. In Proceedings of the Twelfth International Conference on Arti\ufb01cial Intelligence and\nStatistics, 2009.\n\nH.Robbins and S.Monro. A stochastic approximation method. In Annals of Mathematical Statistics,\n\npages 400\u2013407, 1951.\n\nThomas P. Minka.\n\n2000.\nminka-dirichlet.pdf.\n\nreport, Microsoft,\nURL http://research.microsoft.com/(cid:24)minka/papers/dirichlet/\n\nEstimating a dirichlet distribution.\n\nTechnical\n\nThomas P. Minka.\n\nreport,\nMicrosoft, 2001. URL http://research.microsoft.com/en-us/um/people/\nminka/papers/rem.html.\n\nUsing lower bounds to approximate integrals.\n\nTechnical\n\nR. Neal and G. Hinton. A view of the EM algorithm that justi\ufb01es incremental, sparse, and other\nvariants. In M. I. Jordan, editor, Learning in Graphical Models. Kluwer, 1998. URL http:\n//citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.33.2557.\n\nMasa A. Sato and Shin Ishii. On-line em algorithm for the normalized gaussian network.\nNeural Computation, 12(2):407\u2013432, 2000. URL http://citeseerx.ist.psu.edu/\nviewdoc/summary?doi=10.1.1.37.3704.\n\nYee Whye Teh, David Newman, and Max Welling. A collapsed variational Bayesian inference algo-\nrithm for latent Dirichlet allocation. In Advances in Neural Information Processing Systems 19,\n2007.\n\nHanna Wallach, David Mimno, and Andrew McCallum. Rethinking lda: Why priors matter.\n\nIn\nY. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, and A. Culotta, editors, Advances in\nNeural Information Processing Systems 22, pages 1973\u20131981. 2009.\n\nLimin Yao, David Mimno, and Andrew McCallum. Ef\ufb01cient methods for topic model inference\non streaming document collections. In KDD \u201909: Proceedings of the 15th ACM SIGKDD inter-\nnational conference on Knowledge discovery and data mining, pages 937\u2013946, New York, NY,\nUSA, 2009. ACM. ISBN 978-1-60558-495-9.\n\n9\n\n\f", "award": [], "sourceid": 800, "authors": [{"given_name": "Issei", "family_name": "Sato", "institution": null}, {"given_name": "Kenichi", "family_name": "Kurihara", "institution": null}, {"given_name": "Hiroshi", "family_name": "Nakagawa", "institution": null}]}