{"title": "Beta-Negative Binomial Process and Exchangeable \ufffcRandom Partitions for Mixed-Membership Modeling", "book": "Advances in Neural Information Processing Systems", "page_first": 3455, "page_last": 3463, "abstract": "The beta-negative binomial process (BNBP), an integer-valued stochastic process, is employed to partition a count vector into a latent random count matrix. As the marginal probability distribution of the BNBP that governs the exchangeable random partitions of grouped data has not yet been developed, current inference for the BNBP has to truncate the number of atoms of the beta process. This paper introduces an exchangeable partition probability function to explicitly describe how the BNBP clusters the data points of each group into a random number of exchangeable partitions, which are shared across all the groups. A fully collapsed Gibbs sampler is developed for the BNBP, leading to a novel nonparametric Bayesian topic model that is distinct from existing ones, with simple implementation, fast convergence, good mixing, and state-of-the-art predictive performance.", "full_text": "Beta-Negative Binomial Process and Exchangeable\nRandom Partitions for Mixed-Membership Modeling\n\nIROM Department, McCombs School of Business\n\nThe University of Texas at Austin, Austin, TX 78712, USA\n\nMingyuan Zhou\n\nmingyuan.zhou@mccombs.utexas.edu\n\nAbstract\n\nThe beta-negative binomial process (BNBP), an integer-valued stochastic process,\nis employed to partition a count vector into a latent random count matrix. As the\nmarginal probability distribution of the BNBP that governs the exchangeable ran-\ndom partitions of grouped data has not yet been developed, current inference for\nthe BNBP has to truncate the number of atoms of the beta process. This paper\nintroduces an exchangeable partition probability function to explicitly describe\nhow the BNBP clusters the data points of each group into a random number of\nexchangeable partitions, which are shared across all the groups. A fully col-\nlapsed Gibbs sampler is developed for the BNBP, leading to a novel nonparametric\nBayesian topic model that is distinct from existing ones, with simple implementa-\ntion, fast convergence, good mixing, and state-of-the-art predictive performance.\n\n1\n\nIntroduction\n\nFor mixture modeling, there is a wide selection of nonparametric Bayesian priors, such as the Dirich-\nlet process [1] and the more general family of normalized random measures with independent incre-\nments (NRMIs) [2, 3]. Although a draw from an NRMI usually consists of countably in\ufb01nite atoms\nthat are impossible to instantiate in practice, one may transform the in\ufb01nite-dimensional problem\ninto a \ufb01nite one by marginalizing out the NRMI. For instance, it is well known that the marginal-\nization of the Dirichlet process random probability measure under multinomial sampling leads to\nthe Chinese restaurant process [4, 5]. The general structure of the Chinese restaurant process is\nbroadened by [5] to the so called exchangeable partition probability function (EPPF) model, leading\nto fully collapsed inference and providing a uni\ufb01ed view of the characteristics of various nonpara-\nmetric Bayesian mixture-modeling priors. Despite signi\ufb01cant progress on EPPF models in the past\ndecade, their use in mixture modeling (clustering) is usually limited to a single set of data points.\nMoving beyond mixture modeling of a single set, there has been signi\ufb01cant recent interest in mixed-\nmembership modeling, i.e., mixture modeling of grouped data x1, . . . , xJ, where each group xj =\n{xji}i=1,mj consists of mj data points that are exchangeable within the group. To cluster the mj\ndata points in each group into a random, potentially unbounded number of partitions, which are\nexchangeable and shared across all the groups, is a much more challenging statistical problem.\nWhile the hierarchical Dirichlet process (HDP) [6] is a popular choice, it is shown in [7] that a wide\nvariety of integer-valued stochastic processes, including the gamma-Poisson process [8, 9], beta-\nnegative binomial process (BNBP) [10, 11], and gamma-negative binomial process (GNBP), can all\nbe applied to mixed-membership modeling. However, none of these stochastic processes are able\nto describe their marginal distributions that govern the exchangeable random partitions of grouped\ndata. Without these marginal distributions, the HDP exploits an alternative representation known as\nthe Chinese restaurant franchise [6] to derive collapsed inference, while fully collapsed inference is\navailable for neither the BNBP nor the GNBP.\n\n1\n\n\fThe EPPF provides a uni\ufb01ed treatment to mixture modeling, but there is hardly a uni\ufb01ed treatment\nto mixed-membership modeling. As the \ufb01rst step to \ufb01ll that gap, this paper thoroughly investigates\nthe law of the BNBP that governs its exchangeable random partitions of grouped data. As directly\nderiving the BNBP\u2019s EPPF for mixed-membership modeling is dif\ufb01cult, we \ufb01rst randomize the\ngroup sizes {mj}j and derive the joint distribution of {mj}j and their random partitions on a shared\nlist of exchangeable clusters; we then derive the marginal distribution of the group-size count vector\nm = (m1, . . . , mJ )T , and use Bayes\u2019 rule to further arrive at the BNBP\u2019s EPPF that describes the\nprior distribution of a latent column-exchangeable random count matrix, whose jth row sums to mj.\nThe general method to arrive at an EPPF for mixed-membership modeling using an integer-valued\nstochastic process is an important contribution. We make several additional contributions: 1) We\nderive a prediction rule for the BNBP to simulate exchangeable random partitions of grouped data\ngoverned by its EPPF. 2) We construct a BNBP topic model, derive a fully collapsed Gibbs sam-\npler that analytically marginalizes out not only the topics and topic weights, but also the in\ufb01nite-\ndimensional beta process, and provide closed-form update equations for model parameters. 3) The\nstraightforward to implement BNBP topic model sampling algorithm converges fast, mixes well,\nand produces state-of-the-art predictive performance with a compact representation of the corpus.\n\n1.1 Exchangeable Partition Probability Function\nLet \u03a0m = {A1, . . . , Al} denote a random partition of the set [m] = {1, 2, . . . , m}, where there\nare l partitions and each element i \u2208 [m] belongs to one and only one set Ak from \u03a0m. If P (\u03a0m =\n{A1, . . . , Al}|m) depends only on the number and sizes of the Ak\u2019s, regardless of their order, then\nit is called an exchangeable partition probability function (EPPF) of \u03a0m. An EPPF of \u03a0m is an\nEPPF of \u03a0 := (\u03a01, \u03a02, . . .) if P (\u03a0m|n) = P (\u03a0m|m) does not depend on n, where P (\u03a0m|n)\ndenotes the marginal partition probability for [m] when it is known the sample size is n. Such a\nconstraint can also be expressed as an addition rule for the EPPF [5]. In this paper, the addition rule\nis not required and the proposed EPPF is allowed to be dependent on the group sizes (or sample\nsize if the number of groups is one). Detailed discussions about sample size dependent EPPFs\ncan be found in [12]. We generalize the work of [12] to model the partition of a count vector into a\nlatent column-exchangeable random count matrix. A marginal sampler for \u03c3-stable Poisson-Kigman\nmixture models (but not mixed-membership models) is proposed in [13], encompassing a large class\nof random probability measures and their corresponding EPPFs of \u03a0. Note that the BNBP is not\nwithin that class and both the BNBP\u2019s EPPF and prediction rule are dependent on the group sizes.\n\n1.2 Beta Process\nThe beta process B \u223c BP(c, B0) is a completely random measure de\ufb01ned on the product space\n[0, 1] \u00d7 \u2126, with a concentration parameter c > 0 and a \ufb01nite and continuous base measure B0 over\na complete separable metric space \u2126 [14, 15] . We de\ufb01ne the L\u00b4evy measure of the beta process as\n\n\u03bd(dpd\u03c9) = p\u22121(1 \u2212 p)c\u22121dpB0(d\u03c9).\n\n(cid:80)\u221e\n(1)\nA draw from B \u223c BP(c, B0) can be represented as a countably in\ufb01nite sum as B =\ndivisible, and its measure on a Borel set A \u2282 \u2126, expressed as B(A) = (cid:80)\nk=1 pk\u03b4\u03c9k , \u03c9k \u223c g0, where \u03b30 = B0(\u2126) is the mass parameter and g0(d\u03c9) = B0(d\u03c9)/\u03b30\nis the base distribution. The beta process is unique in that the beta distribution is not in\ufb01nitely\nQ(A) = \u2212(cid:80)\nk:\u03c9k\u2208A pk, could be\nIn this paper we will work with\nk:\u03c9k\u2208A ln(1 \u2212 pk), de\ufb01ned as a logbeta random variable, to analyze model properties\nand derive closed-form Gibbs sampling update equations. We provide these details in the Appendix.\n\nlarger than one and hence clearly not a beta random variable.\n\n2 Exchangeable Cluster/Partition Probability Functions for the BNBP\n\nThe integer-valued beta-negative binomial process (BNBP) is de\ufb01ned as\n\nXj|B \u223c NBP(rj, B), B \u223c BP(c, B0),\n\nis a negative binomial process such that Xj(A) =(cid:80)\n\n(2)\nwhere for the jth group rj is the negative binomial dispersion parameter and Xj|B \u223c NBP(rj, B)\nk:\u03c9k\u2208A njk, njk \u223c NB(rj, pk) for each Borel\nset A \u2282 \u2126. The negative binomial distribution n \u223c NB(r, p) has probability mass function (PMF)\nn!\u0393(r) pn(1 \u2212 p)r for n \u2208 Z, where Z = {0, 1, . . .}. Our de\ufb01nition of the BNBP follows\nfN (n) = \u0393(n+r)\n\n2\n\n\fthose of [10, 7, 11], where for inference [10, 7] used \ufb01nite truncation and [11] used slice sampling.\nThere are two recent papers [16, 17] that both marginalize out the beta process from the negative\nbinomial process, with the predictive structures of the BNBP described as the negative binomial\nIndian buffet process (IBP) [16] and \u201cice cream\u201d buffet process [17], respectively. Both processes\nare also related to the \u201cmulti-scoop\u201d IBP of [10], and they all generalize the binary-valued IBP [18].\nDifferent from these two papers on in\ufb01nite random count matrices, this paper focuses on generating\na latent column-exchangeable random count matrix, each row of which sums to a \ufb01xed observed\ninteger. This paper generalizes the techniques developed in [17, 12] to de\ufb01ne an EPPF for mixed-\nmembership modeling and derive truncation-free fully collapsed inference.\nThe BNBP by nature is an integer-valued stochastic process as Xj(A) is a random count for each\nBorel set A \u2282 \u2126. As the negative binomial process is also a gamma-Poisson mixture process, we\ncan augment (2) as a beta-gamma-Poisson process as\n\nXj|\u0398j \u223c PP(\u0398j), \u0398j|rj, B \u223c \u0393P[rj, B/(1 \u2212 B)], B \u223c BP(c, B0),\n\n\u0393P[rj, B/(1\u2212B)] is a gamma process such that \u0398j(A) =(cid:80)\n\nwhere Xj|\u0398j \u223c PP(\u0398j) is a Poisson process such that Xj(A) \u223c Pois[\u0398j(A)], and \u0398j|B \u223c\nk:\u03c9k\u2208A \u03b8jk, \u03b8jk \u223c Gamma[rj, pk/(1\u2212\npk)], for each Borel set A \u2282 \u2126. The mixed-membership-modeling potentials of the BNBP become\nclear under this augmented representation. The Poisson process provides a bridge to link count mod-\neling and mixture modeling [7], since Xj \u223c PP(\u0398j) can be equivalently generated by \ufb01rst drawing\na total random count mj := Xj(\u2126) \u223c Pois[\u0398j(\u2126)] and then assigning this random count to disjoint\nBorel sets of \u2126 using a multinomial distribution.\n\n2.1 Exchangeable Cluster Probability Function\n\nIn mixed-membership modeling, the size of each group is observed rather being random, thus al-\nthough the BNBP\u2019s augmented representation is instructive, it is still unclear how exactly the integer-\nvalued stochastic process leads to a prior distribution on exchangeable random partitions of grouped\ndata. The \ufb01rst step for us to arrive at such a prior distribution is to build a group size dependent\nmodel that treats the number of data points to be clustered (partitioned) in each group as random.\nBelow we will \ufb01rst derive an exchangeable cluster probability function (ECPF) governed by the\nBNBP to describe the joint distribution of the random group sizes and their random partitions over a\nrandom, potentially unbounded number of exchangeable clusters shared across all the groups. Later\nwe will show how to derive the EPPF from the ECPF using Bayes\u2019 rule.\n\nLemma 1. Denote \u03b4k(zji) as a unit point mass at zji = k, njk = (cid:80)mj\n(cid:80)\n\ni=1 \u03b4k(zji), and Xj(A) =\nk:\u03c9k\u2208A njk as the number of data points in group j assigned to the atoms within the Borel set\n\nA \u2282 \u2126. The Xj\u2019s generated via the group size dependent mixed-membership model as\n\n\u03b8jk\n\nk=1\n\n\u0398j (\u2126) \u03b4k, mj \u223c Pois[\u0398j(\u2126)],\n\u0398j \u223c \u0393P[rj, B/(1 \u2212 B)], B \u223c BP(c, B0)\nis equivalent in distribution to the Xj\u2019s generated from a BNBP as in (2).\n\nk=1 pk\u03b4\u03c9k and \u0398j =(cid:80)\u221e\n\nProof. With B =(cid:80)\u221e\nzj = (zj1, . . . , zjmj ) given \u0398j and mj can be expressed as p(zj|\u0398j, mj) = (cid:81)mj\n(cid:81)\u221e\n((cid:80)\u221e\n\nk=1 \u03b8jk\u03b4\u03c9k, the joint distribution of the cluster indices\n\u03b8jzji(cid:80)\u221e\nk(cid:48)=1 \u03b8jk(cid:48) =\njk , which is not in a fully factorized form. As mj is linked to the total random\n\nk=1\u03b8njk\n\n(3)\n\ni=1\n\n1\n\nmass \u0398j(\u2126) with a Poisson distribution, we have the joint likelihood of zj and mj given \u0398j as\n\nk(cid:48)=1 \u03b8jk(cid:48) )mj\n\nzji \u223c(cid:80)\u221e\n\n(cid:81)\u221e\n\nf (zj, mj|\u0398j) = f (zj|\u0398j, mj)Pois[mj; \u0398j(\u2126)] = 1\n\n(4)\nwhich is fully factorized and hence amenable to marginalization. Since \u03b8jk \u223c Gamma[rj, pk/(1 \u2212\npk)], we can marginalize \u03b8jk out analytically as f (zj, mj|rj, B) = E\u0398j [f (zj, mj|\u0398j)], leading to\n(5)\n\nf (zj, mj|rj, B) = 1\n\n(1 \u2212 pk)rj .\n\njk e\u2212\u03b8jk ,\n\n(cid:81)\u221e\n\nk=1 \u03b8njk\n\n\u0393(njk+rj )\n\nmj !\n\npnjk\nk\n\nmj !\n\nk=1\n\n\u0393(rj )\n\nMultiplying the above equation with a multinomial coef\ufb01cient transforms the prior distribution\nfor the categorical random variables zj to the prior distribution for a random count vector as\nf (nj1, . . . , nj\u221e|rj, B) =\nk=1 NB(njk; rj, pk). Thus in the prior,\n\nk=1 njk! f (zj, mj|rj, B) = (cid:81)\u221e\nmj !(cid:81)\u221e\n\n3\n\n\fdata points independently at each atom. With Xj :=(cid:80)\u221e\nsuch that Xj(A) =(cid:80)\n\nfor each group, the group size dependent model in (3) draws njk \u223c NB(rj, pk) random number of\nk=1 njk\u03b4\u03c9k, we have Xj|B \u223c NBP(rj, B)\n\nk:\u03c9k\u2208A njk, njk \u223c NB(rj, pk).\n\nThe Lemma below provides a \ufb01nite-dimensional distribution obtained by marginalizing out the\nin\ufb01nite-dimensional beta process from the BNBP. The proof is provided in the Appendix.\nLemma 2 (ECPF). The exchangeable cluster probability function (ECPF) of the BNBP, which de-\nscribes the joint distribution of the random count vector m := (m1, . . . , mJ )T and its exchangeable\nrandom partitions z := (z11, . . . , zJmJ ), can be expressed as\n\nf (z, m|r, \u03b30, c) = \u03b3KJ\n\n0\n\nr\u00b7 :=(cid:80)J\n\nj=1 rj, n\u00b7k :=(cid:80)J\n\n(cid:81)J\n\n(cid:81)KJ\n\ne\u2212\u03b30[\u03c8(c+r\u00b7 )\u2212\u03c8(c)]\n\n(cid:104) \u0393(n\u00b7k)\u0393(c+r\u00b7)\n\n(cid:81)J\nj=1 njk, and mj \u2208 Z is the random size of group j.\n\n\u0393(c+n\u00b7k+r\u00b7)\n\nj=1 mj !\n\nk=1\n\nj=1\n\nwhere KJ is the number of observed points of discontinuity for which n\u00b7k > 0, r := (r1, . . . , rJ )T ,\n\n\u0393(njk+rj )\n\n\u0393(rj )\n\n,\n\n(6)\n\n(cid:105)\n\n2.2 Exchangeable Partition Probability Function and Prediction Rule\n\n(cid:81)J\n\nHaving the ECPF does not directly lead to the EPPF for the BNBP, as an EPPF describes the distri-\nbution of the exchangeable random partitions of the data groups whose sizes are observed rather than\nbeing random. To arrive at the EPPF, \ufb01rst we organize z into a random count matrix NJ \u2208 ZJ\u00d7KJ ,\nwhose jth row represents the partition of the mj data points into the KJ shared exchangeable clus-\nters and whose order of these KJ nonzero columns is chosen uniformly at random from one of the\nKJ ! possible permutations, then we obtain a prior distribution on a BNBP random count matrix as\n\nf (NJ|r, \u03b30, c) = 1\n= \u03b3KJ\n\nKJ !\n\n0\n\nf (z, m|r, \u03b30, c)\n\nmj !(cid:81)KJ\n\nj=1\n\nk=1 njk!\ne\u2212\u03b30[\u03c8(c+r\u00b7)\u2212\u03c8(c)]\n\nKJ !\n\n(cid:81)KJ\n\nk=1\n\n(cid:81)J\n\n1\n\nis\n\nfor\n\n(cid:81)J\n\n\u0393(r\u00b7)\n\n\u0393(njk+rj )\n\n\u0393(n\u00b7k)\u0393(c+r\u00b7)\n\u0393(c+n\u00b7k+r\u00b7)\n\n\u0393(njk+rj )\nnjk!\u0393(rj ) .\n\nj=1\n\n\u0393(n\u00b7k+r\u00b7)\n\nj=1\n\n\u0393(rj )\n\n\u0393(r+n)\u0393(c+r)\n\n(cid:81)J\n\nn\u00b7k!\nj=1 njk!\n\nthe count vector\n\n:= (n1k, . . . , nJk)T\n\n(7)\nAs described in detail in [17], although the matrix prior does not appear to be simple, direct calcula-\ntion shows that this random count matrix has KJ \u223c Pois{\u03b30 [\u03c8(c + r\u00b7) \u2212 \u03c8(c)]} independent and\nidentically distributed (i.i.d.) columns that can be generated via\n\nn:k \u223c DirMult(n\u00b7k, r1, . . . , rJ ), n\u00b7k \u223c Digam(r\u00b7, c),\n\nwhere n:k\nthe Dirichlet-multinomial\n\n(8)\nthe kth column (cluster),\n(DirMult) distribution [19] has PMF DirMult(n:k|n\u00b7k, r) =\n, and the digamma distribution [20] has PMF Digam(n|r, c) =\nn\u0393(c+n+r)\u0393(r) , where n = 1, 2, . . .. Thus in the prior, the BNBP generates a Poisson\n\u03c8(c+r)\u2212\u03c8(c)\nrandom number of clusters, the size of each cluster follows a digamma distribution, and each cluster\nis further partitioned into the J groups using a Dirichlet-multinomial distribution [17].\nLemma 3 (EPPF). Let (cid:80)(cid:80)K\nWith both the ECPF and random count matrix prior governed by the BNBP, we are ready to derive\nboth the EPPF and prediction rule, given in the following two Lemmas, with proofs in the Appendix.\n(cid:80)K\nk=1 n:k = m, where m\u00b7 = (cid:80)J\nk=1 n:k=m denote a summation over all sets of count vectors with\nj=1 mj and n\u00b7k \u2265 1. The group-size dependent exchangeable\n(cid:81)KJ\n0(cid:81)J\n(cid:80)(cid:80)K(cid:48)\n\npartition probability function (EPPF) governed by the BNBP can be expressed as\n\u0393(njk+rj )\n\n(cid:104) \u0393(n\u00b7k)\u0393(c+r\u00b7)\n(cid:81)K(cid:48)\n\nf (z|m, r, \u03b30, c) =\n\n(cid:80)m\u00b7\n\n\u03b3KJ\nj=1 mj !\n\n(cid:81)J\n\n(cid:81)J\n\n\u0393(c+n\u00b7k+r\u00b7)\n\nwhich is a function of the cluster sizes {njk}k=1,KJ , regardless of the orders of the indices k\u2019s.\nAlthough the EPPF is fairly complicated, one may derive a simple prediction rule, as shown below,\nto simulate exchangeable random partitions of grouped data governed by this EPPF.\nLemma 4 (Prediction Rule). With y\u2212ji representing the variable y excluding the contribution of\nxji, the prediction rule of the BNBP group size dependent mixed-membership model in (3) is\n\n\u0393(n\u00b7k(cid:48) )\u0393(c+r\u00b7)\n\u0393(c+n\u00b7k(cid:48) +r\u00b7)\n\n\u0393(njk(cid:48) +rj )\nnjk(cid:48) !\u0393(rj )\n\nk(cid:48)=1 n:k(cid:48) =m\n\nj=1\n\n\u0393(rj )\n\n\u03b3K(cid:48)\nK(cid:48)!\n\n0\n\nK(cid:48)=1\n\n,\n\n(9)\n\nk(cid:48)=1\n\n(cid:105)\n\nk=1\n\nj=1\n\n\uf8f1\uf8f2\uf8f3 n\n\nc+n\n\n\u03b30\nc+r\u00b7 rj,\n\n\u2212ji\n\u00b7k\n\u2212ji\n\u00b7k +r\u00b7\n\n4\n\nP (zji = k|z\u2212ji, m, r, \u03b30, c) \u221d\n\n\u2212ji\njk + rj),\n\n(n\n\nfor k = 1, . . . , K\n\n\u2212ji\nJ\n\n;\n\nif k = K\n\n\u2212ji\nJ + 1.\n\n(10)\n\n\f\u03c8(c+(cid:80)\n\nFigure 1:\nRandom draws from the EPPF that governs the BNBP\u2019s exchangeable random partitions of\n10 groups (rows), each of which has 50 data points. The parameters of the EPPF are set as c = 2,\nj rj )\u2212\u03c8(c) , and (a) rj = 1, (b) rj = 10, or (c) rj = 100 for all the 10 groups. The jth row\n\u03b30 =\nof each matrix, which sums to 50, represents the partition of the mj = 50 data points of the jth group over\na random number of exchangeable clusters, and the kth column of each matrix represents the kth nonempty\ncluster in order of appearance in Gibbs sampling (the empty clusters are deleted).\n\n12\n\n2.3 Simulating Exchangeable Random Partitions of Grouped Data\n\nWhile the EPPF (9) is not simple, the prediction rule (10) clearly shows that the probability of\nselecting k is proportional to the product of two terms: one is related to the kth cluster\u2019s overall\npopularity across groups, and the other is only related to the kth cluster\u2019s popularity at that group\nand that group\u2019s dispersion parameter; and the probability of creating a new cluster is related to \u03b30,\nc, r\u00b7 and rj. The BNBP\u2019s exchangeable random partitions of the group-size count vector m, whose\nprior distribution is governed by (9), can be easily simulated via Gibbs sampling using (10).\nRunning Gibbs sampling using (10) for 2500 iterations and displaying the last sample, we show in\nFigure 1 (a)-(c) three distinct exchangeable random partitions of the same group-size count vector,\nIt is clear that the dispersion parameters {rj}j play a\nunder three different parameter settings.\ncritical role in controlling how overdispersed the counts are: the smaller the {rj}j are, the more\noverdispersed the counts in each row and those in each column are. This is unsurprising as in the\nBNBP\u2019s prior, we have njk \u223c NB(rj, pk) and n:k \u223c DirMult(n\u00b7k, r1, . . . , rJ ). Figure 1 suggests\nthat it is important to infer rj rather than setting them in a heuristic manner or \ufb01xing them.\n\n3 Beta-Negative Binomial Process Topic Model\n\n\u03b8jk\n\n\u0398j (\u2126) \u03b4k, mj \u223c Pois[\u0398j(\u2126)],\n\nWith the BNBP\u2019s EPPF derived, it becomes evident that the integer-valued BNBP also governs a\nprior distribution for exchangeable random partitions of grouped data. To demonstrate the use of\nthe BNBP, we apply it to topic modeling [21] of a document corpus, a special case of mixture\nmodeling of grouped data, where the words of the jth document xj1, . . . , xjmj constitute a group\nxj (mj words in document j), each word xji is an exchangeable group member indexed by vji in\na vocabulary with V unique terms. We choose the base distribution as a V dimensional Dirichlet\ndistribution as g0(\u03c6) = Dir(\u03c6; \u03b7, . . . , \u03b7), and choose a multinomial distribution to link a word to a\ntopic (atom). We express the hierarchical construction of the BNBP topic model as\n\n(cid:1), rj \u223c Gamma(a0, 1/b0), B \u223c BP(c, B0), \u03b30 \u223c Gamma(e0, 1/f0). (11)\n(cid:81)V\n\nxji \u223c Mult(\u03c6zji), \u03c6k \u223c Dir(\u03b7, . . . , \u03b7), zji \u223c(cid:80)\u221e\n\u0398j \u223c \u0393P(cid:0)rj, B\nLet nvjk := (cid:80)mj\n1\u2212B\n(cid:81)\u221e\n(cid:81)V\ni=1 \u03b4v(xji)\u03b4k(zji). Multiplying (4) and the data likelihood f (xj|zj, \u03a6) =\n(cid:81)\u221e\n(cid:81)V\n(cid:81)\u221e\nk=1(\u03c6vk)nvjk , where \u03a6 = (\u03c61, . . . , \u03c6\u221e), we have f (xj, zj, mj|\u03a6, \u0398j) =\n(mvj)v=1:V, j=1:J is factorized under the Poisson likelihood as mvj = (cid:80)\u221e\nv=1 Pois(nvjk; \u03c6vk\u03b8jk). Thus the BNBP topic model can also be con-\nsidered as an in\ufb01nite Poisson factor model [10], where the term-document word count matrix\nk=1 nvjk, nvjk \u223c\nPois(\u03c6vk\u03b8jk), whose likelihood f ({nvjk}v,k|\u03a6, \u0398j) is different from f (xj, zj, mj|\u03a6, \u0398j) up to\nThe full conditional likelihood f (x, z, m|\u03a6, \u0398) = (cid:81)J\na multinomial coef\ufb01cient.\n(cid:26)(cid:81)\u221e\n(cid:111) \u00b7\nj=1 f (xj, zj, mj|\u03a6, \u0398j) can be further\nexpressed as f (x, z, m|\u03a6, \u0398) =\n, where the\nmarginalization of \u03a6 from the \ufb01rst right-hand-side term is the product of Dirichlet-multinomial dis-\ntributions and the second right-hand-side term leads to the ECPF. Thus we have a fully marginalized\n\n(cid:110)(cid:81)\u221e\n\nv=1 \u03c6nv\u00b7k\n\nvk\n\nnjk\njk e\n\nj=1 \u03b8\nj=1 mj !\n\n(cid:81)V\n\n(cid:81)J\n(cid:81)J\n\nv=1 nvjk!\n\nmj !\n\nk=1\n\n(cid:27)\n\n\u2212\u03b8jk\n\nv=1\n\nk=1\n\nk=1\n\nk=1\n\nk=1\n\n5\n\n274948 4 1 14949492345484950 1 1 1 2 1 1 1PartitionGroup(a) ri = 1123456782468102537391128203334333318 2 5181426 8 912 1 1 1 1 3 2 3 3 3 113 5 311 4 2 7 2 3 1 2 1 1 1 1 1 1 1 1 3 2 1 1 1 1 1 2 1 1 1PartitionGroup(b) ri = 10246810121424681026191521161310172217 9241414182931172118 9 213 912 7 3 7 613 1 2 1 1 1 1 2 2 1 2 2 1 1 4 1 1 2 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1PartitionGroup(c) ri = 10051015246810\flikelihood as f (x, z, m|\u03b30, c, r) = f (z, m|\u03b30, c, r)(cid:81)KJ\n\n. Directly\napplying Bayes\u2019 rule to this fully marginalized likelihood, we construct a nonparametric Bayesian\nfully collapsed Gibbs sampler for the BNBP topic model as\n\u00b7 (n\n\nfor k = 1, . . . , K\n\n\u0393(n\u00b7k+V \u03b7)\n\n\u0393(\u03b7)\n\nk=1\n\nv=1\n\n\u00b7\n\n\u2212ji\njk + rj),\n\n\u2212ji\nJ\n\n;\n\nP (zji = k|x, z\n\n\u2212ji, \u03b30, m, c, r)\u221d\n\n\u2212ji\nn\n\u00b7k\n\u2212ji\n\u00b7k +r\u00b7\nc+n\n\n(12)\n\n\u0393(V \u03b7)\n\n\u0393(nv\u00b7k+\u03b7)\n\n(cid:81)V\n\n(cid:105)\n\n(cid:104)\n\n\uf8f1\uf8f4\uf8f2\uf8f4\uf8f3 \u03b7+n\n\n\u2212ji\nvji\u00b7k\n\u2212ji\n\u00b7k\nV \u00b7 \u03b30\n\nV \u03b7+n\n\n1\n\nc+r\u00b7 \u00b7 rj,\n\nif k = K\n\n\u2212ji\nJ + 1.\n\nIn the Appendix we include all the other closed-form Gibbs sampling update equations.\n\n3.1 Comparison with Other Collapsed Gibbs Samplers\n\nOne may compare the collapsed Gibbs sampler of the BNBP topic model with that of latent Dirichlet\nallocation (LDA) [22], which, in our notation, can be expressed as\n\nP (zji = k|x, z\u2212ji, m, \u03b1, K) \u221d \u03b7+n\n\n\u2212ji\nvji\u00b7k\n\u2212ji\n\u00b7k\n\nV \u03b7+n\n\n\u00b7 (n\n\n\u2212ji\njk + \u03b1\n\nK ),\n\nfor k = 1, . . . , K,\n\n(13)\n\nwhere the number of topics K and the topic proportion Dirichlet smoothing parameter \u03b1 are both\ntuning parameters. The BNBP topic model is a nonparametric Bayesian algorithm that removes the\nneed to tune these parameters. One may also compare the BNBP topic model with the HDP-LDA\n[6, 23], whose direct assignment sampler in our notation can be expressed as\n\n\uf8f1\uf8f2\uf8f3 \u03b7+n\n\n1\n\nV \u03b7+n\n\n\u2212ji\nvji\u00b7k\n\u2212ji\n\u00b7k\nV \u00b7 (\u03b1\u02dcr\u2217),\n\nP (zji = k|x, z\u2212ji, m, \u03b1, \u02dcr) \u221d\n\n\u00b7 (n\n\n\u2212ji\njk + \u03b1\u02dcrk),\n\nfor k = 1, . . . , K\n\n\u2212ji\nJ\n\n;\n\n(14)\n\nwhere \u03b1 is the concentration parameter for the group-speci\ufb01c Dirichlet processes (cid:101)\u0398j \u223c DP(\u03b1,(cid:101)G),\nand \u02dcrk = (cid:101)G(\u03c9k) and \u02dcr\u2217 = (cid:101)G(\u2126\\DJ ) are the measures of the globally shared Dirichlet process (cid:101)G \u223c\nDP(\u03b30,(cid:101)G0) over the observed points of discontinuity and absolutely continuous space, respectively.\n\nif k = K\n\n\u2212ji\nJ + 1;\n\nComparison between (14) and (12) shows that distinct from the HDP-LDA that combines a topic\u2019s\n\u2212ji\njk + \u03b1\u02dcrk), the BNBP topic model com-\nglobal and local popularities in an additive manner as (n\n\u00b7 (n\n\u2212ji\nbines them in a multiplicative manner as\njk + rj). This term can also be rewritten\n\n\u2212ji\nn\n\u00b7k\n\u2212ji\n\u00b7k +r\u00b7\nc+n\n\n\u2212ji\n\u00b7k\n\nand n\n\n\u2212ji\njk +rj\n\u2212ji\n\u00b7k +r\u00b7\n\nc+n\n\n, the latter of which represents how much the jth document\nas the product of n\ncontributes to the overall popularity of the kth topic. Therefore, the BNBP and HDP-LDA have dis-\ntinct mechanisms to automatically shrink the number of topics. Note that while the BNBP sampler\nin (12) is fully collapsed, the direct assignment sampler of the HDP-LDA in (14) is only partially\n\ncollapsed as neither the globally shared Dirichlet process (cid:101)G nor the concentration parameter \u03b1 are\nmarginalized out. To derive a collapsed sampler for the HDP-LDA that marginalizes out (cid:101)G (but still\n\nnot \u03b1), one has to use the Chinese restaurant franchise [6], which has cumbersome book-keeping as\neach word is indirectly linked to its topic via a latent table index.\n\n4 Example Results\n\nWe consider the JACM1, PsyReview2, and NIPS123 corpora, restricting the vocabulary to terms that\noccur in \ufb01ve or more documents. The JACM corpus includes 536 documents, with V = 1539 unique\nterms and 68,055 total word counts. The PsyReview corpus includes 1281 documents, with V =\n2566 and 71,279 total word counts. The NIPS12 corpus includes 1740 documents, with V = 13, 649\nand 2,301,375 total word counts. To evaluate the BNBP topic model4 and its performance relative to\nthat of the HDP-LDA, which are both nonparametric Bayesian algorithms, we randomly choose 50%\n\n1http://www.cs.princeton.edu/\u223cblei/downloads/\n2http://psiexp.ss.uci.edu/research/programs data/toolbox.htm\n3http://www.cs.nyu.edu/\u223croweis/data.html\n4Matlab code available in http://mingyuanzhou.github.io/\n\n6\n\n\fFigure 2: The inferred number of topics KJ for the \ufb01rst 1500 Gibbs sampling iterations for the (a) HDP-LDA\nand (b) BNBP topic model on JACM. (c)-(d) and (e)-(f) are analogous plots to (a)-(c) for the PsyReview and\nNIPS12 corpora, respectively. From bottom to top in each plot, the red, blue, magenta, black, green, yellow,\nand cyan curves correspond to the results for \u03b7 = 0.50, 0.25, 0.10, 0.05, 0.02, 0.01, and 0.005, respectively.\n\nmtest\u00b7\u00b7\n\nv\n\nj mtest\n\nv\n\n(cid:17)\n\ns\n\ns\n\nv\n\nj mtest\n\n(cid:80)\n\nk \u03c6(s)\n\nvk \u03b8(s)\nk \u03c6(s)\n\nvk \u03b8(s)\n\njk\n\njk\n\n(cid:80)\n\n(cid:80)\n\n(cid:16) \u2212 1\n\n(cid:80)\n(cid:80)\n(cid:80)\n(cid:80)\n(cid:80)\n\nmtest\u00b7\u00b7 = (cid:80)\n\nof the words in each document as training, and use the remaining ones to calculate per-word heldout\nperplexity. We set the hyperparameters as a0 = b0 = e0 = f0 = 0.01. We consider 2500 Gibbs\nsampling iterations and collect the last 1500 samples. In each iteration, we randomize the ordering\nof the words. For each collected sample, we draw the topics (\u03c6k|\u2212) \u223c Dir(\u03b7 + n1\u00b7k, . . . , \u03b7 +\nnJ\u00b7k), and the topics weights (\u03b8jk|\u2212) \u223c Gamma(njk + rj, pk) for the BNBP and topic proportions\n(\u03b8k|\u2212) \u223c Dir(nj1 + \u03b1\u02dcr1, . . . , njKJ + \u03b1\u02dcrKJ ) for the HDP, with which the per-word perplexity is\n, where s \u2208 {1, . . . , S} is the index\ncomputed as exp\nvj ln\nof a collected MCMC sample, mtest\nis the number of test words at term v in document j, and\nvj\nvj . The \ufb01nal results are averaged over \ufb01ve random training/testing partitions.\nThe evaluation method is similar to those used in [24, 25, 26, 10]. Similar to [26, 10], we set the\ntopic Dirichlet smoothing parameter as \u03b7 = 0.01, 0.02, 0.05, 0.10, 0.25, or 0.50. To test how the\nalgorithms perform in more extreme settings, we also consider \u03b7 = 0.001, 0.002, and 0.005. All\nalgorithms are implemented with unoptimized Matlab code. On a 3.4 GHz CPU, the fully collapsed\nGibbs sampler of the BNBP topic model takes about 2.5 seconds per iteration on the NIPS12 corpus\nwhen the inferred number of topics is around 180. The direct assignment sampler of the HDP-LDA\nhas comparable computational complexity when the inferred number of topics is similar. Note that\nwhen the inferred number of topics KJ is large, the sparse computation technique for LDA [27, 28]\nmay also be used to considerably speed up the sampling algorithm of the BNBP topic model.\nWe \ufb01rst diagnose the convergence and mixing of the collapsed Gibbs samplers for the HDP-LDA\nand BNBP topic model via the trace plots of their samples. The three plots in the left column of\nFigures 2 show that the HDP-LDA travels relatively slowly to the target distributions of the number\nof topics, often reaching them in more than 300 iterations, whereas the three plots in the right column\nshow that the BNBP topic model travels quickly to the target distributions, usually reaching them\nin less than 100 iterations. Moreover, Figures 2 shows that the chains of the HDP-LDA are taking\nin small steps and do not traverse their distributions quickly, whereas the chains of the BNBP topic\nmodels mix very well locally and traverse their distributions relatively quickly.\nA smaller topic Dirichlet smoothing parameter \u03b7 generally supports a larger number of topics, as\nshown in the left column of Figure 3, and hence often leads to lower perplexities, as shown in\nthe middle column of Figure 3; however, an \u03b7 that is as small as 0.001 (not commonly used in\npractice) may lead to more than a thousand topics and consequently over\ufb01t the corpus, which is\nparticularly evident for the HDP-LDA on both the JACM and PsyReview corpora. Similar trends\nare also likely to be observed on the larger NIPS2012 corpus if we allow the values of \u03b7 to be\neven smaller than 0.001. As shown in the middle column of Figure 3, for the same \u03b7, the BNBP\ntopic model, usually representing the corpus with a smaller number of topics, often have higher\nperplexities than those of the HDP-LDA, which is unsurprising as the BNBP topic model has a\nmultiplicative control mechanism to more strongly shrink the number of topics, whereas the HDP\nhas a softer additive shrinkage mechanism. Similar performance differences have also been observed\n\n7\n\n050010001500101001000(d) BNBP Topic Model, PsyReview050010001500101001000(c) HDP\u2212LDA, PsyReviewNumber of topics050010001500101001000(b) BNBP Topic Model, JACM050010001500101001000(a) HDP\u2212LDA, JACMNumber of topics050010001500101001000(f) BNBP Topic Model, NIPS12Gibbs sampling iteration050010001500101001000(e) HDP\u2212LDA, NIPS12Gibbs sampling iterationNumber of topics\fFigure 3: Comparison between the HDP-LDA and BNBP topic model with the topic Dirichlet smoothing pa-\nrameter \u03b7 \u2208 {0.001, 0.002, 0.005, 0.01, 0.02, 0.05, 0.10, 0.25, 0.50}. For the JACM corpus: (a) the posterior\nmean of the inferred number of topics KJ and (b) per-word heldout perplexity, both as a function of \u03b7, and (c)\nper-word heldout perplexity as a function of the inferred number of topics KJ; the topic Dirichlet smoothing\nparameter \u03b7 and the number of topics KJ are displayed in the logarithmic scale. (d)-(f) Analogous plots to\n(a)-(c) for the PsyReview corpus. (g)-(i) Analogous plots to (a)-(c) for the NIPS12 corpus, where the results of\n\u03b7 = 0.002 and 0.001 for the HDP-LDA are omitted.\n\nin [7], where the HDP and BNBP are inferred under \ufb01nite approximations with truncated block\nGibbs sampling. However, it does not necessarily mean that the HDP-LDA has better predictive\nperformance than the BNBP topic model. In fact, as shown in the right column of Figure 3, the\nBNBP topic model\u2019s perplexity tends to be lower than that of the HDP-LDA if their inferred number\nof topics are comparable and the \u03b7 is not overly small, which implies that the BNBP topic model is\nable to achieve the same predictive power as the HDP-LDA, but with a more compact representation\nof the corpus under common experimental settings. While it is interesting to understand the ultimate\npotentials of the HDP-LDA and BNBP topic model for out-of-sample prediction by setting the\n\u03b7 to be very small, a moderate \u03b7 that supports a moderate number of topics is usually preferred\nin practice, for which the BNBP topic model could be a preferred choice over the HDP-LDA, as\nour experimental results on three different corpora all suggest that the BNBP topic model could\nachieve lower-perplexity using the same number of topics. To further understand why the BNBP\ntopic model and HDP-LDA have distinct characteristics, one may view them from a count-modeling\nperspective [7] and examine how they differently control the relationship between the variances and\nmeans of the latent topic usage count vectors {(n1k, . . . , nJk)}k.\nWe also \ufb01nd that the BNBP collapsed Gibbs sampler clearly outperforms the blocked Gibbs sampler\nof [7] in terms of convergence speed, computational complexity and memory requirement. But a\nblocked Gibbs sampler based on \ufb01nite truncation [7] or adaptive truncation [11] does have a clear\nadvantage that it is easy to parallelize. The heuristics used to parallelize an HDP collapsed sampler\n[24] may also be modi\ufb01ed to parallelize the proposed BNBP collapsed sampler.\n\n5 Conclusions\n\nA group size dependent exchangeable partition probability function (EPPF) for mixed-membership\nmodeling is developed using the integer-valued beta-negative binomial process (BNBP). The ex-\nchangeable random partitions of grouped data, governed by the EPPF of the BNBP, are strongly in-\n\ufb02uenced by the group-speci\ufb01c dispersion parameters. We construct a BNBP nonparametric Bayesian\ntopic model that is distinct from existing ones, intuitive to interpret, and straightforward to imple-\nment. The fully collapsed Gibbs sampler converges fast, mixes well, and has state-of-the-art predic-\ntive performance when a compact representation of the corpus is desired. The method to derive the\nEPPF for the BNBP via a group size dependent model is unique, and it is of interest to further inves-\ntigate whether this method can be generalized to derive new EPPFs for mixed-membership modeling\nthat could be introduced by other integer-valued stochastic processes, including the gamma-Poisson\nand gamma-negative binomial processes.\n\n8\n\n0.010.10.5100102104Topic Dirichlet parameter \u03b7Number of topics K(a)0.010.10.5240260280300320Topic Dirichlet parameter \u03b7Heldout perplexity(b)101001000240260280300320Number of topics KHeldout perplexity(c)BNBP Topic ModelHDP\u2212LDA0.010.10.5100102104Topic Dirichlet parameter \u03b7Number of topics K(d) 0.010.10.5800900100011001200Topic Dirichlet parameter \u03b7Heldout perplexity(e)1010010002000800900100011001200Number of topics KHeldout perplexity(f)BNBP Topic ModelHDP\u2212LDA0.010.10.5100102104Topic Dirichlet parameter \u03b7Number of topics K(g) 0.010.10.51000120014001600180020002200Topic Dirichlet parameter \u03b7Heldout perplexity(h)1010010001000120014001600180020002200Number of topics KHeldout perplexity(i)BNBP Topic ModelHDP\u2212LDA\fReferences\n[1] T. S. Ferguson. A Bayesian analysis of some nonparametric problems. Ann. Statist., 1973.\n[2] E. Regazzini, A. Lijoi, and I. Pr\u00a8unster. Distributional results for means of normalized random\n\nmeasures with independent increments. Annals of Statistics, 2003.\n\n[3] A. Lijoi and I. Pr\u00a8unster. Models beyond the Dirichlet process. In N. L. Hjort, C. Holmes,\nP. M\u00a8uller, and S. G. Walker, editors, Bayesian nonparametrics. Cambridge Univ. Press, 2010.\n[4] D. Blackwell and J. MacQueen. Ferguson distributions via P\u00b4olya urn schemes. The Annals of\n\nStatistics, 1973.\n\n[5] J. Pitman. Combinatorial stochastic processes. Lecture Notes in Mathematics. Springer-\n\nVerlag, 2006.\n\n[6] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei. Hierarchical Dirichlet processes. JASA,\n\n2006.\n\n[7] M. Zhou and L. Carin. Negative binomial process count and mixture modeling. To appear in\n\nIEEE Trans. Pattern Anal. Mach. Intelligence, 2014.\n\n[8] A. Y. Lo. Bayesian nonparametric statistical inference for Poisson point processes. Zeitschrift\n\nfur, pages 55\u201366, 1982.\n\n[9] M. K. Titsias. The in\ufb01nite gamma-Poisson feature model. In NIPS, 2008.\n[10] M. Zhou, L. Hannah, D. Dunson, and L. Carin. Beta-negative binomial process and Poisson\n\nfactor analysis. In AISTATS, 2012.\n\n[11] T. Broderick, L. Mackey, J. Paisley, and M. I. Jordan. Combinatorial clustering and the beta\nnegative binomial process. To appear in IEEE Trans. Pattern Anal. Mach. Intelligence, 2014.\n[12] M. Zhou and S. G. Walker. Sample size dependent species models. arXiv:1410.3155, 2014.\n[13] M. Lomel\u00b4\u0131, S. Favaro, and Y. W. Teh. A marginal sampler for \u03c3-Stable Poisson-Kingman\n\nmixture models. arXiv preprint arXiv:1407.4211, 2014.\n\n[14] N. L. Hjort. Nonparametric Bayes estimators based on beta processes in models for life history\n\ndata. Ann. Statist., 1990.\n\n[15] R. Thibaux and M. I. Jordan. Hierarchical beta processes and the Indian buffet process. In\n\nAISTATS, 2007.\n\n[16] C. Heaukulani and D. M. Roy. The combinatorial structure of beta negative binomial processes.\n\narXiv:1401.0062, 2013.\n\n[17] M. Zhou, O.-H. Madrid-Padilla, and J. G. Scott. Priors for random count matrices derived from\n\na family of negative binomial processes. arXiv:1404.3331v2, 2014.\n\n[18] T. L. Grif\ufb01ths and Z. Ghahramani. In\ufb01nite latent feature models and the Indian buffet process.\n\nIn NIPS, 2005.\n\n[19] R. E. Madsen, D. Kauchak, and C. Elkan. Modeling word burstiness using the Dirichlet distri-\n\nbution. In ICML, 2005.\n\n[20] M. Sibuya. Generalized hypergeometric, digamma and trigamma distributions. Annals of the\n\nInstitute of Statistical Mathematics, pages 373\u2013390, 1979.\n\n[21] D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. J. Mach. Learn. Res., 2003.\n[22] T. L. Grif\ufb01ths and M. Steyvers. Finding scienti\ufb01c topics. PNAS, 2004.\n[23] C. Wang, J. Paisley, and D. M. Blei. Online variational inference for the hierarchical Dirichlet\n\nprocess. In AISTATS, 2011.\n\n[24] D. Newman, A. Asuncion, P. Smyth, and M. Welling. Distributed algorithms for topic models.\n\nJMLR, 2009.\n\n[25] H. M. Wallach, I. Murray, R. Salakhutdinov, and D. Mimno. Evaluation methods for topic\n\nmodels. In ICML, 2009.\n\n[26] J. Paisley, C. Wang, and D. Blei. The discrete in\ufb01nite logistic normal distribution for mixed-\n\nmembership modeling. In AISTATS, 2011.\n\n[27] I. Porteous, D. Newman, A. Ihler, A. Asuncion, P. Smyth, and M. Welling. Fast collapsed\n\nGibbs sampling for latent Dirichlet allocation. In SIGKDD, 2008.\n\n[28] D. Mimno, M. Hoffman, and D. Blei. Sparse stochastic inference for latent Dirichlet allocation.\n\nIn ICML, 2012.\n\n9\n\n\f", "award": [], "sourceid": 1803, "authors": [{"given_name": "Mingyuan", "family_name": "Zhou", "institution": "University of Texas at Austin"}]}