{"title": "Augment-and-Conquer Negative Binomial Processes", "book": "Advances in Neural Information Processing Systems", "page_first": 2546, "page_last": 2554, "abstract": "By developing data augmentation methods unique to the negative binomial (NB) distribution, we unite seemingly disjoint count and mixture models under the NB process framework. We develop fundamental properties of the models and derive efficient Gibbs sampling inference. We show that the gamma-NB process can be reduced to the hierarchical Dirichlet process with normalization, highlighting its unique theoretical, structural and computational advantages. A variety of NB processes with distinct sharing mechanisms are constructed and applied to topic modeling, with connections to existing algorithms, showing the importance of inferring both the NB dispersion and probability parameters.", "full_text": "Augment-and-Conquer Negative Binomial Processes\n\nMingyuan Zhou\n\nLawrence Carin\n\nDept. of Electrical and Computer Engineering\n\nDept. of Electrical and Computer Engineering\n\nDuke University, Durham, NC 27708\n\nmz1@ee.duke.edu\n\nDuke University, Durham, NC 27708\n\nlcarin@ee.duke.edu\n\nAbstract\n\nBy developing data augmentation methods unique to the negative binomial (NB)\ndistribution, we unite seemingly disjoint count and mixture models under the NB\nprocess framework. We develop fundamental properties of the models and derive\nef\ufb01cient Gibbs sampling inference. We show that the gamma-NB process can\nbe reduced to the hierarchical Dirichlet process with normalization, highlighting\nits unique theoretical, structural and computational advantages. A variety of NB\nprocesses with distinct sharing mechanisms are constructed and applied to topic\nmodeling, with connections to existing algorithms, showing the importance of\ninferring both the NB dispersion and probability parameters.\n\n1\n\nIntroduction\n\nThere has been increasing interest in count modeling using the Poisson process, geometric process\n[1, 2, 3, 4] and recently the negative binomial (NB) process [5, 6]. Notably, it has been independently\nshown in [5] and [6] that the NB process, originally constructed for count analysis, can be naturally\napplied for mixture modeling of grouped data x1,\u00b7\u00b7\u00b7 , xJ, where each group xj = {xji}i=1,Nj .\nFor a territory long occupied by the hierarchical Dirichlet process (HDP) [7] and related models,\nthe inference of which may require substantial bookkeeping and suffer from slow convergence [7],\nthe discovery of the NB process for mixture modeling can be signi\ufb01cant. As the seemingly distinct\nproblems of count and mixture modeling are united under the NB process framework, new opportu-\nnities emerge for better data \ufb01tting, more ef\ufb01cient inference and more \ufb02exible model constructions.\nHowever, neither [5] nor [6] explore the properties of the NB distribution deep enough to achieve\nfully tractable closed-form inference. Of particular concern is the NB dispersion parameter, which\nwas simply \ufb01xed or empirically set [6], or inferred with a Metropolis-Hastings algorithm [5]. Under\nthese limitations, both papers fail to reveal the connections of the NB process to the HDP, and thus\nmay lead to false assessments on comparing their modeling abilities.\nWe perform joint count and mixture modeling under the NB process framework, using completely\nrandom measures [1, 8, 9] that are simple to construct and amenable for posterior computation.\nWe propose to augment-and-conquer the NB process: by \u201caugmenting\u201d a NB process into both\nthe gamma-Poisson and compound Poisson representations, we \u201cconquer\u201d the uni\ufb01cation of count\nand mixture modeling, the analysis of fundamental model properties, and the derivation of ef\ufb01cient\nGibbs sampling inference. We make two additional contributions: 1) we construct a gamma-NB\nprocess, analyze its properties and show how its normalization leads to the HDP, highlighting its\nunique theoretical, structural and computational advantages relative to the HDP. 2) We show that\na variety of NB processes can be constructed with distinct model properties, for which the shared\nrandom measure can be selected from completely random measures such as the gamma, beta, and\nbeta-Bernoulli processes; we compare their performance on topic modeling, a typical example for\nmixture modeling of grouped data, and show the importance of inferring both the NB dispersion and\nprobability parameters, which respectively govern the overdispersion level and the variance-to-mean\nratio in count modeling.\n\n1\n\n\f1.1 Poisson process for count and mixture modeling\nBefore introducing the NB process, we \ufb01rst illustrate how the seemingly distinct problems of count\nand mixture modeling can be united under the Poisson process. Denote \u2126 as a measure space and\nfor each Borel set A \u2282 \u2126, denote Xj(A) as a count random variable describing the number of obser-\nvations in xj that reside within A. Given grouped data x1,\u00b7\u00b7\u00b7 , xJ, for any measurable disjoint par-\ntition A1,\u00b7\u00b7\u00b7 , AQ of \u2126, we aim to jointly model the count random variables {Xj(Aq)}. A natural\nchoice would be to de\ufb01ne a Poisson process Xj \u223c PP(G), with a shared completely random mea-\nq=1 G(Aq) and\n\nsure G on \u2126, such that Xj(A) \u223c Pois(cid:0)G(A)(cid:1) for each A \u2282 \u2126. Denote G(\u2126) =(cid:80)Q\n(cid:101)G = G/G(\u2126). Following Lemma 4.1 of [5], the joint distributions of Xj(\u2126), Xj(A1),\u00b7\u00b7\u00b7 , Xj(AQ)\n[Xj(A1),\u00b7\u00b7\u00b7 , Xj(Aq)] \u223c Mult(cid:0)Xj(\u2126);(cid:101)G(A1),\u00b7\u00b7\u00b7 ,(cid:101)G(AQ)(cid:1). (2)\n\nXj(Aq) \u223c Pois(cid:0)G(Aq)(cid:1);\n\nare equivalent under the following two expressions:\n\nXj(\u2126) =(cid:80)Q\n\nXj(\u2126) \u223c Poisson(G(\u2126)),\n\nq=1 Xj(Aq),\n\n(1)\n\nThus the Poisson process provides not only a way to generate independent counts from each Aq,\nbut also a mechanism for mixture modeling, which allocates the observations into any measurable\n\ndisjoint partition {Aq}1,Q of \u2126, conditioning on Xj(\u2126) and the normalized mean measure (cid:101)G.\nboth. Note that (cid:101)G = G/G(\u2126) now becomes a Dirichlet process (DP) as (cid:101)G \u223c DP(\u03b30,(cid:101)G0), where\n\u03b30 = G0(\u2126) and (cid:101)G0 = G0/\u03b30. The normalized gamma representation of the DP is discussed in\n\nTo complete the model, we may place a gamma process [9] prior on the shared measure as\nG \u223c GaP(c, G0), with concentration parameter c and base measure G0, such that G(A) \u223c\nGamma(G0(A), 1/c) for each A\u2282 \u2126, where G0 can be continuous, discrete or a combination of\n\n[10, 11, 9] and has been used to construct the group-level DPs for an HDP [12]. The Poisson process\nhas an equal-dispersion assumption for count modeling. As shown in (2), the construction of Poisson\nprocesses with a shared gamma process mean measure implies the same mixture proportions across\ngroups, which is essentially the same as the DP when used for mixture modeling when the total\ncounts {Xj(\u2126)}j are not treated as random variables. This motivates us to consider adding an ad-\nditional layer or using a different distribution other than the Poisson to model the counts. As shown\nbelow, the NB distribution is an ideal candidate, not only because it allows overdispersion, but also\nbecause it can be augmented into both a gamma-Poisson and a compound Poisson representations.\n2 Augment-and-Conquer the Negative Binomial Distribution\nThe NB distribution m \u223c NB(r, p) has the probability mass function (PMF) fM (m) = \u0393(r+m)\nm!\u0393(r) (1\u2212\np)rpm. It has a mean \u00b5 = rp/(1\u2212p) smaller than the variance \u03c32 = rp/(1 \u2212 p)2 = \u00b5+r\u22121\u00b52, with\nthe variance-to-mean ratio (VMR) as (1\u2212p)\u22121 and the overdispersion level (ODL, the coef\ufb01cient of\nthe quadratic term in \u03c32) as r\u22121. It has been widely investigated and applied to numerous scienti\ufb01c\nstudies [13, 14, 15]. The NB distribution can be augmented into a gamma-Poisson construction as\nm \u223c Pois(\u03bb), \u03bb \u223c Gamma (r, p/(1 \u2212 p)), where the gamma distribution is parameterized by its\nshape r and scale p/(1 \u2212 p). It can also be augmented under a compound Poisson representation\nt=1 ut, ut \u223c Log(p), l \u223c Pois(\u2212r ln(1 \u2212 p)), where u \u223c Log(p) is the logarithmic\ndistribution [17] with probability-generating function (PGF) CU (z) = ln(1 \u2212 pz)/ln(1 \u2212 p),\n|z| <\np\u22121. In a slight abuse of notation, but for added conciseness, in the following discussion we use\n\n[16] as m =(cid:80)l\nm \u223c(cid:80)l\n\nt=1 Log(p) to denote m =(cid:80)l\n\nt=1 ut, ut \u223c Log(p).\n\nThe inference of the NB dispersion parameter r has long been a challenge [13, 18, 19]. In this paper,\nwe \ufb01rst place a gamma prior on it as r \u223c Gamma(r1, 1/c1). We then use Lemma 2.1 (below) to\ninfer a latent count l for each m \u223c NB(r, p) conditioning on m and r. Since l \u223c Pois(\u2212r ln(1\u2212 p))\nby construction, we can use the gamma Poisson conjugacy to update r. Using Lemma 2.2 (below),\nwe can further infer an augmented latent count l(cid:48) for each l, and then use these latent counts to\nupdate r1, assuming r1 \u223c Gamma(r2, 1/c2). Using Lemmas 2.1 and 2.2, we can continue this\nprocess repeatedly, suggesting that we may build a NB process to model data that have subgroups\nwithin groups. The conditional posterior of the latent count l was \ufb01rst derived by us but was not\ngiven an analytical form [20]. Below we explicitly derive the PMF of l, shown in (3), and \ufb01nd that\nit exactly represents the distribution of the random number of tables occupied by m customers in a\nChinese restaurant process with concentration parameter r [21, 22, 7]. We denote l \u223c CRT(m, r)\nas a Chinese restaurant table (CRT) count random variable with such a PMF and as proved in the\n\nsupplementary material, we can sample it as l =(cid:80)m\n\nn=1 bn, bn \u223c Bernoulli (r/(n \u2212 1 + r)).\n\n2\n\n\fn!\n\nn=j\n\ns(n,j)xn\n\nC (m)\nWj\n\nPr(l = j|m, r) = \u0393(r)\n\nthe conditional posterior of l has PMF\n\n\u0393(m+r)|s(m, j)|rj, j = 0, 1,\u00b7\u00b7\u00b7 , m.\n\nrandom variables, the PGF of wj becomes CWj (z) = C j\n\nBoth the gamma-Poisson and compound-Poisson augmentations of the NB distribution and Lemmas\n2.1 and 2.2 are key ingredients of this paper. We will show that these augment-and-concur methods\nnot only unite count and mixture modeling and provide ef\ufb01cient inference, but also, as shown in\nSection 3, let us examine the posteriors to understand fundamental properties of the NB processes,\nclearly revealing connections to previous nonparametric Bayesian mixture models.\nLemma 2.1. Denote s(m, j) as Stirling numbers of the \ufb01rst kind [17]. Augment m \u223c NB(r, p)\nt=1 Log(p), l \u223c Pois(\u2212r ln(1 \u2212 p)), then\n\n(3)\nt=1 Log(p), j = 1,\u00b7\u00b7\u00b7 , m. Since wj is the summation of j iid Log(p)\nU (z) = [ln(1 \u2212 pz)/ln(1 \u2212 p)]j , |z| <\n[17], we have Pr(wj = m) =\n(0)/m! = (\u22121)mpjj!s(m, j)/(m![ln(1 \u2212 p)]j). Thus for 0 \u2264 j \u2264 m, we have Pr(L =\nj=0 |s(m, j)|rj, we\n\nunder the compound Poisson representation as m \u223c(cid:80)l\nProof. Denote wj \u223c (cid:80)j\np\u22121. Using the property that [ln(1 + x)]j = j!(cid:80)\u221e\nj|m, r) \u221d Pr(wj = m)Pois(j;\u2212r ln(1\u2212p)) \u221d |s(m, j)|rj. Denote Sr(m) =(cid:80)m\nhave Sr(m) = (m\u22121+r)Sr(m\u22121) = \u00b7\u00b7\u00b7 =(cid:81)m\u22121\nt=1 Log(p), l \u223c(cid:80)l(cid:48)\nProof. Augmenting m leads to m \u223c(cid:80)l\n\nLemma 2.2. Let m \u223c NB(r, p), r \u223c Gamma(r1, 1/c1), denote p(cid:48) =\nbe generated from a compound distribution as\n\nt(cid:48)=1 Log(p(cid:48)), l(cid:48) \u223c Pois(\u2212r1 ln(1 \u2212 p(cid:48))).\n(4)\nt=1 Log(p), l \u223c Pois(\u2212r ln(1 \u2212 p)). Marginalizing out r\n\nn=1 (r +n)Sr(1) =(cid:81)m\u22121\n\nn=0 (r +n) = \u0393(m+r)\n\u2212 ln(1\u2212p)\nc1\u2212ln(1\u2212p) , then m can also\n\nleads to l \u223c NB (r1, p(cid:48)). Augmenting l using its compound Poisson representation leads to (4).\n3 Gamma-Negative Binomial Process\nWe explore sharing the NB dispersion across groups while the probability parameters are group\ndependent. We de\ufb01ne a NB process X \u223c NBP(G, p) as X(A) \u223c NB(G(A), p) for each A \u2282 \u2126 and\nconstruct a gamma-NB process for joint count and mixture modeling as Xj \u223c NBP(G, pj), G \u223c\nGaP(c, G0), which can be augmented as a gamma-gamma-Poisson process as\n(5)\nIn the above PP(\u00b7) and GaP(\u00b7) represent the Poisson and gamma processes, respectively, as de\ufb01ned\nin Section 1.1. Using Lemma 2.2, the gamma-NB process can also be augmented as\nt=1 Log(pj), Lj \u223c PP(\u2212G ln(1 \u2212 pj)), G \u223c GaP(c, G0);\n\nXj \u223c PP(\u039bj), \u039bj \u223c GaP((1 \u2212 pj)/pj, G), G \u223c GaP(c, G0).\n\nm \u223c(cid:80)l\n\n(6)\n\n\u0393(r)\n\n.\n\n(cid:1).\n\nj ln(1\u2212pj )\nj ln(1\u2212pj ) .\n\nt=1 Log(p(cid:48)), L(cid:48) \u223c PP(\u2212G0 ln(1 \u2212 p(cid:48))), p(cid:48) =\n\nL(cid:48)(\u2126)|L, G0 =(cid:80)\n\nLj|Xj, G \u223c CRTP(Xj, G), L(cid:48)|L, G0 \u223c CRTP(L, G0).\n\n(8)\n\u03c9\u2208A T (\u03c9), T (\u03c9) \u223c CRT(X(\u03c9), G(\u03c9))\n\n\u039bj|G, Xj, pj \u223c GaP(cid:0)1/pj, G + Xj\n\n(7)\nThese three augmentations allow us to derive a sequence of closed-form update equations for infer-\nence with the gamma-NB process. Using the gamma Poisson conjugacy on (5), for each A \u2282 \u2126, we\nhave \u039bj(A)|G, Xj, pj \u223c Gamma (G(A) + Xj(A), pj), thus the conditional posterior of \u039bj is\n\nDe\ufb01ne T \u223c CRTP(X, G) as a CRT process that T (A) =(cid:80)\n\u03c9\u2208\u2126 \u03b4(L(\u03c9) > 0) =(cid:80)\nK ) \u2265 1 if(cid:80)\n\u03b30|{L(cid:48)(\u2126), p(cid:48)} \u223c Gamma(cid:0)e0 + L(cid:48)(\u2126),\nG|G0,{Lj, pj} \u223c GaP(cid:0)c \u2212(cid:80)\n\nfor each A \u2282 \u2126. Applying Lemma 2.1 on (6) and (7), we have\n(9)\nIf G0 is a continuous base measure and \u03b30 = G0(\u2126) is \ufb01nite, we have G0(\u03c9)\u2192 0 \u2200 \u03c9 \u2208 \u2126 and thus\n(10)\nif G0 is discrete as G0 =\nwhich is equal to K +, the total number of used discrete atoms;\nj Xj(\u03c9k) > 0, thus L(cid:48)(\u2126) \u2265 K +. In\neither case, let \u03b30 \u223c Gamma(e0, 1/f0), with the gamma Poisson conjugacy on (6) and (7), we have\n(11)\n(12)\nSince the data {xji}i are exchangeable within group j, the predictive distribution of a point Xji,\nconditioning on X\u2212i\n\nj ln(1 \u2212 pj), G0 +(cid:80)\n\nj = {Xjn}n:n(cid:54)=i and G, with \u039bj marginalized out, can be expressed as\nXji|G, X\u2212i\n\nK \u03b4\u03c9k, then L(cid:48)(\u03c9k) = CRT(L(\u03c9k), \u03b30\n\n\u03c9\u2208\u2126 \u03b4((cid:80)\n\nj Xj(\u03c9) > 0)\n\n(cid:80)K\n\nf0\u2212ln(1\u2212p(cid:48))\n\n(cid:1).\n\n(cid:1);\n\n\u2212i\nj \u223c E[\u039bj|G,X\nj\nE[\u039bj (\u2126)|G,X\n\nXj \u223c(cid:80)Lj\nj Lj \u223c(cid:80)L(cid:48)\n\nL =(cid:80)\n\n\u2212(cid:80)\nc\u2212(cid:80)\n\n=\n\nG(\u2126)+Xj (\u2126)\u22121 +\n\nG(\u2126)+Xj (\u2126)\u22121 .\n\nG\n\n\u03b30\n\nk=1\n\nj Lj\n\n\u2212i\nX\nj\n\n]\n\u2212i\nj\n\n]\n\n3\n\n1\n\n(13)\n\n\f3.1 Relationship with the hierarchical Dirichlet process\nUsing the equivalence between (1) and (2) and normalizing all the gamma processes in (5), denoting\n\n(cid:101)\u039bj = \u039bj/\u039bj(\u2126), \u03b1 = G(\u2126), (cid:101)G = G/\u03b1, \u03b30 = G0(\u2126) and (cid:101)G0 = G0/\u03b30, we can re-express (5) as\n\nXji \u223c(cid:101)\u039bj, (cid:101)\u039bj \u223c DP(\u03b1,(cid:101)G), \u03b1 \u223c Gamma(\u03b30, 1/c), (cid:101)G \u223c DP(\u03b30,(cid:101)G0)\n\n(14)\nwhich is an HDP [7]. Thus the normalized gamma-NB process leads to an HDP, yet we can-\nnot return from the HDP to the gamma-NB process without modeling Xj(\u2126) and \u039bj(\u2126) as ran-\ndom variables. Theoretically, they are distinct in that the gamma-NB process is a completely\nrandom measure, assigning independent random variables into any disjoint Borel sets {Aq}1,Q\nof \u2126; whereas the HDP is not. Practically, the gamma-NB process can exploit conjugacy to\nachieve analytical conditional posteriors for all latent parameters. The inference of the HDP is\na major challenge and it is usually solved through alternative constructions such as the Chinese\nrestaurant franchise (CRF) and stick-breaking representations [7, 23].\nIn particular, without an-\nalytical conditional posteriors, the inference of concentration parameters \u03b1 and \u03b30 is nontrivial\n[7, 24] and they are often simply \ufb01xed [23]. Under the CRF metaphor \u03b1 governs the random\nnumber of tables occupied by customers in each restaurant independently; further, if the base\n\nprobability measure (cid:101)G0 is continuous, \u03b30 governs the random number of dishes selected by ta-\nHowever, if (cid:101)G0 is discrete as (cid:101)G0 = (cid:80)K\n\nbles of all restaurants. One may apply the data augmentation method of [22] to sample \u03b1 and \u03b30.\nK \u03b4\u03c9k, which is of practical value and becomes a con-\ntinuous base measure as K \u2192 \u221e [11, 7, 24], then using the method of [22] to sample \u03b30 is only\napproximately correct, which may result in a biased estimate in practice, especially if K is not large\nenough. By contrast, in the gamma-NB process, the shared gamma process G can be analytically\nupdated with (12) and G(\u2126) plays the role of \u03b1 in the HDP, which is readily available as\n\nk=1\n\n1\n\nG(\u2126)|G0,{Lj, pj}j=1,N \u223c Gamma\n\n(cid:16)\n\n\u03b30 +(cid:80)\n\nc\u2212(cid:80)\n\nj Lj(\u2126),\n\n1\nj ln(1\u2212pj )\n\n(15)\n\n(cid:17)\n\nk=1\n\nand as in (11), regardless of whether the base measure is continuous, the total mass \u03b30 has an analyt-\nical gamma posterior whose shape parameter is governed by L(cid:48)(\u2126), with L(cid:48)(\u2126) = K + if G0 is con-\n\u03b30\nK \u03b4\u03c9k. Equation (15) also intuitively shows how\n\ntinuous and \ufb01nite and L(cid:48)(\u2126) \u2265 K + if G0 =(cid:80)K\nthe NB probability parameters {pj} govern the variations among {(cid:101)\u039bj} in the gamma-NB process.\n\nIn the HDP, pj is not explicitly modeled, and since its value becomes irrelevant when taking the nor-\nmalized constructions in (14), it is usually treated as a nuisance parameter and perceived as pj = 0.5\nwhen needed for interpretation purpose. Fixing pj = 0.5 is also considered in [12] to construct an\nHDP, whose group-level DPs are normalized from gamma processes with the scale parameters as\npj\n= 1; it is also shown in [12] that improved performance can be obtained for topic modeling by\n1\u2212pj\nlearning the scale parameters with a log Gaussian process prior. However, no analytical conditional\nposteriors are provided and Gibbs sampling is not considered as a viable option [12].\n3.2 Augment-and-conquer inference for joint count and mixture modeling\nSince the Poisson intensity \u03bd+ = \u03bd(R+\u00d7\u2126) = \u221e and(cid:82)(cid:82)\nFor a \ufb01nite continuous base measure, the gamma process G \u223c GaP(c, G0) can also be de\ufb01ned\nwith its L\u00b4evy measure on a product space R+ \u00d7 \u2126, expressed as \u03bd(drd\u03c9) = r\u22121e\u2212crdrG0(d\u03c9) [9].\nprocess can be expressed as G =(cid:80)\u221e\nHere we consider a discrete base measure as G0 =(cid:80)K\nR+\u00d7\u2126 r\u03bd(drd\u03c9) is \ufb01nite, a draw from this\nk=1 rk\u03b4\u03c9k , (rk, \u03c9k) \u223c \u03c0(drd\u03c9), \u03c0(drd\u03c9)\u03bd+ \u2261 \u03bd(drd\u03c9) [9].\n(cid:80)K\nK \u03b4\u03c9k , \u03c9k \u223c g0(\u03c9k), then we have G =\nk=1 rk\u03b4\u03c9k, rk \u223c Gamma(\u03b30/K, 1/c), \u03c9k \u223c g0(\u03c9k), which becomes a draw from the gamma\nlinked to a mixture component \u03c9zji \u2208 \u2126 through a distribution F . Denote njk =(cid:80)Nj\nprocess with a continuous base measure as K \u2192 \u221e. Let xji \u223c F (\u03c9zji) be observation i in group j,\ni=1 \u03b4(zji = k),\n\nk=1\n\n\u03b30\n\nwe can express the gamma-NB process with the discrete base measure as\n\nk=1 njk, njk \u223c Pois(\u03bbjk), \u03bbjk \u223c Gamma(rk, pj/(1 \u2212 pj))\n\nrk \u223c Gamma(\u03b30/K, 1/c), pj \u223c Beta(a0, b0), \u03b30 \u223c Gamma(e0, 1/f0)\n\n(16)\nwhere marginally we have njk \u223c NB(rk, pj). Using the equivalence between (1) and (2), we\ncan equivalently express Nj and njk in the above model as Nj \u223c Pois (\u03bbj) , [nj1,\u00b7\u00b7\u00b7 , njK] \u223c\nk=1 \u03bbjk. Since the data {xji}i=1,Nj are fully\nexchangeable, rather than drawing [nj1,\u00b7\u00b7\u00b7 , njK] once, we may equivalently draw the index\n\nMult (Nj; \u03bbj1/\u03bbj,\u00b7\u00b7\u00b7 , \u03bbjK/\u03bbj), where \u03bbj = (cid:80)K\n\n\u03c9k \u223c g0(\u03c9k), Nj =(cid:80)K\n\nzji \u223c Discrete (\u03bbj1/\u03bbj,\u00b7\u00b7\u00b7 , \u03bbjK/\u03bbj)\n\n(17)\n\n4\n\n\fi=1 \u03b4(zji = k). This provides further insights on how the seem-\ningly disjoint problems of count and mixture modeling are united under the NB process framework.\nFollowing (8)-(12), the block Gibbs sampling is straightforward to write as\n\nfor each xji and then let njk =(cid:80)Nj\n\n(pj|\u2212) \u223c Beta\n\np(\u03c9k|\u2212) \u221d(cid:81)\n(cid:18)\na0 + Nj, b0 +(cid:80)\nk|\u2212) \u223c CRT((cid:80)\n(cid:18)\n\u03b30/K +(cid:80)\n\n(l(cid:48)\n\n(rk|\u2212) \u223c Gamma\n\n(cid:19)\n\n, p(cid:48) =\n\nk rk\n\ne0 +(cid:80)\n\nzji=k F (xji; \u03c9k)g0(\u03c9k), Pr(zji = k|\u2212) \u221d F (xji; \u03c9k)\u03bbjk\n\n\u2212(cid:80)\nc\u2212(cid:80)\n(cid:18)\n(cid:19)\nj ln(1\u2212pj )\nj ln(1\u2212pj ) , (ljk|\u2212) \u223c CRT(njk, rk)\n(cid:19)\nj ljk, \u03b30/K), (\u03b30|\u2212) \u223c Gamma\nk|\u2212) = \u03b4((cid:80)\n\nj ljk > 0) = \u03b4((cid:80)\n\nk l(cid:48)\nk,\n\n1\n\nf0\u2212ln(1\u2212p(cid:48))\n\n, (\u03bbjk|\u2212) \u223c Gamma(rk + njk, pj). (18)\n\nc\u2212(cid:80)\n\nj ljk,\n\n1\nj ln(1\u2212pj )\n\nwhich has similar computational complexity as that of the direct assignment block Gibbs sampling\nof the CRF-HDP [7, 24]. If g0(\u03c9) is conjugate to the likelihood F (x; \u03c9), then the posterior p(\u03c9|\u2212)\nwould be analytical. Note that when K \u2192 \u221e, we have (l(cid:48)\nj njk > 0).\nUsing (1) and (2) and normalizing the gamma distributions, (16) can be re-expressed as\n\nzji \u223c Discrete(\u02dc\u03bbj), \u02dc\u03bbj \u223c Dir(\u03b1\u02dcr), \u03b1 \u223c Gamma(\u03b30, 1/c), \u02dcr \u223c Dir(\u03b30/K,\u00b7\u00b7\u00b7 , \u03b30/K) (19)\nwhich loses the count modeling ability and becomes a \ufb01nite representation of the HDP, the inference\nof which is not conjugate and has to be solved under alternative representations [7, 24]. This also\nimplies that by using the Dirichlet process as the foundation, traditional mixture modeling may\ndiscard useful count information from the beginning.\n4 The Negative Binomial Process Family and Related Algorithms\nThe gamma-NB process shares the NB dispersion across groups. Since the NB distribution has two\nadjustable parameters, we may explore alternative ideas, with the NB probability measure shared\nacross groups as in [6], or with both the dispersion and probability measures shared as in [5]. These\nconstructions are distinct from both the gamma-NB process and HDP in that \u039bj has space dependent\n\nscales, and thus its normalization(cid:101)\u039bj = \u039bj/\u039bj(\u2126) no longer follows a Dirichlet process.\ncan be expressed as B =(cid:80)\u221e\nNBP(rj, B), with a random draw expressed as Xj = (cid:80)\u221e\nk=1(rk, pk)\u03b4\u03c9k, and the NB process Xj \u223c NBP(R, B) becomes Xj = (cid:80)\u221e\n(cid:80)\u221e\nZj \u223c BeP(B) is drawn from the Bernoulli process [26] and (R, B) =(cid:80)\u221e\n\nIt is natural to let the probability measure be drawn from a beta process [25, 26], which can be\nde\ufb01ned by its L\u00b4evy measure on a product space [0, 1]\u00d7 \u2126 as \u03bd(dpd\u03c9) = cp\u22121(1\u2212 p)c\u22121dpB0(d\u03c9).\nA draw from the beta process B \u223c BP(c, B0) with concentration parameter c and base measure B0\nk=1 pk\u03b4\u03c9k . A beta-NB process [5, 6] can be constructed by letting Xj \u223c\nk=1 njk\u03b4\u03c9k , njk \u223c NB(rj, pk). Under\nthis construction, the NB probability measure is shared and the NB dispersion parameters are group\ndependent. As in [5], we may also consider a marked-beta-NB1 process that both the NB probability\nand dispersion measures are shared, in which each point of the beta process is marked with an\nindependent gamma random variable. Thus a draw from the marked-beta process becomes (R, B) =\nk=1 njk\u03b4\u03c9k , njk \u223c\nNB(rk, pk). Since the beta and NB processes are conjugate, the posterior of B is tractable, as shown\nin [5, 6]. If it is believed that there are excessive number of zeros, governed by a process other\nthan the NB process, we may introduce a zero in\ufb02ated NB process as Xj \u223c NBP(RZj, pj), where\nk=1(rk, \u03c0k)\u03b4\u03c9k is drawn\nfrom a marked-beta process, thus njk \u223c NB(rkbjk, pj), bjk = Bernoulli(\u03c0k). This construction\ncan be linked to the model in [27] with appropriate normalization, with advantages that there is no\nneed to \ufb01x pj = 0.5 and the inference is fully tractable. The zero in\ufb02ated construction can also be\nlinked to models for real valued data using the Indian buffet process (IBP) or beta-Bernoulli process\nspike-and-slab prior [28, 29, 30, 31].\n4.1 Related Algorithms\nTo show how the NB processes can be diversely constructed and to make connections to previous\nparametric and nonparametric mixture models, we show in Table 1 a variety of NB processes, which\ndiffer on how the dispersion and probability measures are shared. For a deeper understanding on\nhow the counts are modeled, we also show in Table 1 both the VMR and ODL implied by these\n\n1We may also consider a beta marked-gamma-NB process, whose performance is found to be very similar.\n\n5\n\n\fTable 1: A variety of negative binomial processes are constructed with distinct sharing mechanisms, re\ufb02ected\nwith which parameters from rk, rj, pk, pj and \u03c0k (bjk) are inferred (indicated by a check-mark (cid:88)), and the\nimplied VMR and ODL for counts {njk}j,k. They are applied for topic modeling of a document corpus, a\ntypical example of mixture modeling of grouped data. Related algorithms are shown in the last column.\nRelated Algorithms\n\nVMR\n\npk\n\nrk\n\nrj\n(cid:88)\n\n(cid:88)\n(cid:88)\n\n(cid:88) (cid:88)\n\n(cid:88)\n\n\u03c0k\n\npj\n(cid:88)\n0.5\n0.5 (cid:88)\n\n(cid:88)\n\n(1 \u2212 pj)\u22121\n\n2\n2\n\n(1 \u2212 pk)\u22121\n(1 \u2212 pj)\u22121\n(1 \u2212 pk)\u22121\n\n(rk)\u22121bjk\n\nODL\n\u22121\nr\nj\n\u22121\nr\nk\n\n\u22121\nj\n\u22121\nk\n\u22121\nk\n\nr\nr\nr\n\nAlgorithms\nNB-LDA\nNB-HDP\nNB-FTM\nBeta-NB\n\nGamma-NB\n\n(cid:88)\nMarked-Beta-NB (cid:88)\n\nLDA [32], Dir-PFA [5]\nHDP[7], DILN-HDP [12]\nFTM [27], S\u03b3\u0393-PFA [5]\n\nBNBP [5], BNBP [6]\n\nCRF-HDP [7, 24]\n\nBNBP [5]\n\nsettings. We consider topic modeling of a document corpus, a typical example of mixture mod-\neling of grouped data, where each a-bag-of-words document constitutes a group, each word is an\nexchangeable group member, and F (xji; \u03c9k) is simply the probability of word xji in topic \u03c9k.\nWe consider six differently constructed NB processes in Table 1: (i) Related to latent Dirichlet\nallocation (LDA) [32] and Dirichlet Poisson factor analysis (Dir-PFA) [5], the NB-LDA is also a\nparametric topic model that requires tuning the number of topics. However, it uses a document de-\npendent rj and pj to automatically learn the smoothing of the gamma distributed topic weights, and\nit lets rj \u223c Gamma(\u03b30, 1/c), \u03b30 \u223c Gamma(e0, 1/f0) to share statistical strength between docu-\nments, with closed-form Gibbs sampling inference. Thus even the most basic parametric LDA topic\nmodel can be improved under the NB count modeling framework. (ii) The NB-HDP model is re-\nlated to the HDP [7], and since pj is an irrelevant parameter in the HDP due to normalization, we set\nit in the NB-HDP as 0.5, the usually perceived value before normalization. The NB-HDP model is\ncomparable to the DILN-HDP [12] that constructs the group-level DPs with normalized gamma pro-\ncesses, whose scale parameters are also set as one. (iii) The NB-FTM model introduces an additional\nbeta-Bernoulli process under the NB process framework to explicitly model zero counts. It is the\nsame as the sparse-gamma-gamma-PFA (S\u03b3\u0393-PFA) in [5] and is comparable to the focused topic\nmodel (FTM) [27], which is constructed from the IBP compound DP. Nevertheless, they apply about\nthe same likelihoods and priors for inference. The Zero-In\ufb02ated-NB process improves over them by\nallowing pj to be inferred, which generally yields better data \ufb01tting. (iv) The Gamma-NB process\nexplores the idea that the dispersion measure is shared across groups, and it improves over the NB-\nHDP by allowing the learning of pj. It reduces to the HDP [7] by normalizing both the group-level\nand the shared gamma processes. (v) The Beta-NB process explores sharing the probability measure\nacross groups, and it improves over the beta negative binomial process (BNBP) proposed in [6],\nallowing inference of rj. (vi) The Marked-Beta-NB process is comparable to the BNBP proposed\nin [5], with the distinction that it allows analytical update of rk. The constructions and inference\nof various NB processes and related algorithms in Table 1 all follow the formulas in (16) and (18),\nrespectively, with additional details presented in the supplementary material.\nNote that as shown in [5], NB process topic models can also be considered as factor analysis of\nthe term-document count matrix under the Poisson likelihood, with \u03c9k as the kth factor loading\nthat sums to one and \u03bbjk as the factor score, which can be further linked to nonnegative matrix\nfactorization [33] and a gamma Poisson factor model [34]. If except for proportions \u02dc\u03bbj and \u02dcr, the\nabsolute values, e.g., \u03bbjk, rk and pk, are also of interest, then the NB process based joint count and\nmixture models would apparently be more appropriate than the HDP based mixture models.\n5 Example Results\nMotivated by Table 1, we consider topic modeling using a variety of NB processes, which differ on\nwhich parameters are learned and consequently how the VMR and ODL of the latent counts {njk}j,k\nare modeled. We compare them with LDA [32, 35] and CRF-HDP [7, 24]. For fair comparison, they\nare all implemented with block Gibbs sampling using a discrete base measure with K atoms, and\nfor the \ufb01rst \ufb01fty iterations, the Gamma-NB process with rk \u2261 50/K and pj \u2261 0.5 is used for\ninitialization. For LDA and NB-LDA, we search K for optimal performance and for the other\nmodels, we set K = 400 as an upper-bound. We set the parameters as c = 1, \u03b7 = 0.05 and\na0 = b0 = e0 = f0 = 0.01. For LDA, we set the topic proportion Dirichlet smoothing parameter\nas 50/K, following the topic model toolbox2 provided for [35]. We consider 2500 Gibbs sampling\niterations, with the last 1500 samples collected. Under the NB processes, each word xji would\n\n6\n\n\fj\n\ns=1\n\ns=1\n\nv=1\n\nj\n\nvk \u03bb(s)\n\njk\n\nk=1 \u03c9(s)\n\nvk \u03bb(s)\n\nFigure 1: Comparison of per-word perplexities on the held-out words between various algorithms. (a) With\n60% of the words in each document used for training, the performance varies as a function of K in both LDA\nand NB-LDA, which are parametric models, whereas the NB-HDP, NB-FTM, Beta-NB, CRF-HDP, Gamma-\nNB and Marked-Beta-NB all infer the number of active topics, which are 127, 201, 107, 161, 177 and 130,\nrespectively, according to the last Gibbs sampling iteration. (b) Per-word perplexities of various models as a\nfunction of the percentage of words in each document used for training. The results of the LDA and NB-LDA\nare shown with the best settings of K under each training/testing partition.\nbe assigned to a topic k based on both F (xji; \u03c9k) and the topic weights {\u03bbjk}k=1,K; each topic is\ndrawn from a Dirichlet base measure as \u03c9k \u223c Dir(\u03b7,\u00b7\u00b7\u00b7 , \u03b7) \u2208 RV , where V is the number of unique\nterms in the vocabulary and \u03b7 is a smoothing parameter. Let vji denote the location of word xji in the\ni \u03b4(zji =\n\nvocabulary, then we have (\u03c9k|\u2212) \u223c Dir(cid:0)\u03b7 +(cid:80)\nk, vji = V )(cid:1). We consider the Psychological Review2 corpus, restricting the vocabulary to terms\nfjv = (cid:80)S\n(cid:80)K\n\ni \u03b4(zji = k, vji = 1),\u00b7\u00b7\u00b7 , \u03b7 +(cid:80)\n\n(cid:80)K\n\n(cid:80)\n\n(cid:80)\n\nk=1 \u03c9(s)\n\n(cid:80)V\n\n(cid:14)(cid:80)S\n\nthat occur in \ufb01ve or more documents. The corpus includes 1281 abstracts from 1967 to 2003, with\n2,566 unique terms and 71,279 total word counts. We randomly select 20%, 40%, 60% or 80%\nof the words from each document to learn a document dependent probability for each term v as\njk , where \u03c9vk is the probability of term v\nin topic k and S is the total number of collected samples. We use {fjv}j,v to calculate the per-\nword perplexity on the held-out words as in [5]. The \ufb01nal results are averaged from \ufb01ve random\ntraining/testing partitions. Note that the perplexity per test word is the fair metric to compare topic\nmodels. However, when the actual Poisson rates or distribution parameters for counts instead of the\nmixture proportions are of interest, it is obvious that a NB process based joint count and mixture\nmodel would be more appropriate than an HDP based mixture model.\nFigure 1 compares the performance of various algorithms. The Marked-Beta-NB process has the\nbest performance, closely followed by the Gamma-NB process, CRF-HDP and Beta-NB process.\nWith an appropriate K, the parametric NB-LDA may outperform the nonparametric NB-HDP and\nNB-FTM as the training data percentage increases, somewhat unexpected but very intuitive results,\nshowing that even by learning both the NB dispersion and probability parameters rj and pj in a\ndocument dependent manner, we may get better data \ufb01tting than using nonparametric models that\nshare the NB dispersion parameters rk across documents, but \ufb01x the NB probability parameters.\nFigure 2 shows the learned model parameters by various algorithms under the NB process frame-\nwork, revealing distinct sharing mechanisms and model properties. When (rj, pj) is used, as in the\nNB-LDA, different documents are weakly coupled with rj \u223c Gamma(\u03b30, 1/c), and the modeling\nresults show that a typical document in this corpus usually has a small rj and a large pj, thus a large\nODL and a large VMR, indicating highly overdispersed counts on its topic usage. When (rj, pk) is\nused to model the latent counts {njk}j,k, as in the Beta-NB process, the transition between active\nand non-active topics is very sharp that pk is either close to one or close to zero. That is because pk\nj rj and the VMR as (1 \u2212 pk)\u22121 on topic k, thus\na popular topic must also have large pk and thus large overdispersion measured by the VMR; since\nthe counts {njk}j are usually overdispersed, particularly true in this corpus, a middle range pk indi-\n(cid:80)\ncating an appreciable mean and small overdispersion is not favored by the model and thus is rarely\nobserved. When (rk, pj) is used, as in the Gamma-NB process, the transition is much smoother that\nj pj/(1 \u2212 pj)\nand the ODL r\u22121\nk on topic k, thus popular topics must also have large rk and thus small overdisper-\nsion measured by the ODL, and unpopular topics are modeled with small rk and thus large overdis-\npersion, allowing rarely and lightly used topics. Therefore, we can expect that (rk, pj) would allow\n\ncontrols the mean as E[(cid:80)\nrk gradually decreases. The reason is that rk controls the mean as E[(cid:80)\n\nj njk] = pk/(1 \u2212 pk)(cid:80)\n\nj njk] = rk\n\n2http://psiexp.ss.uci.edu/research/programs data/toolbox.htm\n\n7\n\n0501001502002503003504008008509009501000105011001150120012501300K+=127K+=201K+=107K+=161K+=177K+=130(a)Number of topicsPerplexity0.20.30.40.50.60.70.8700800900100011001200130014001500Training data percentagePerplexity(b) LDANB\u2212LDANB\u2212HDPNB\u2212FTMBeta\u2212NBCRF\u2212HDPGamma\u2212NBMarked\u2212Beta\u2212NB\fFigure 2: Distinct sharing mechanisms and model properties are evident between various NB processes, by\ncomparing their inferred parameters. Note that the transition between active and non-active topics is very sharp\nwhen pk is used and much smoother when rk is used. Both the documents and topics are ordered in a decreasing\norder based on the number of words associated with each of them. These results are based on the last Gibbs\nsampling iteration. The values are shown in either linear or log scales for convenient visualization.\n\noverdispersion would be allowed as both rk and pk are now responsible for the mean E[(cid:80)\n\nmore topics than (rj, pk), as con\ufb01rmed in Figure 1 (a) that the Gamma-NB process learns 177 active\ntopics, signi\ufb01cantly more than the 107 ones of the Beta-NB process. With these analysis, we can\nconclude that the mean and the amount of overdispersion (measure by the VMR or ODL) for the\nusage of topic k is positively correlated under (rj, pk) and negatively correlated under (rk, pj).\nWhen (rk, pk) is used, as in the Marked-Beta-NB process, more diverse combinations of mean and\nj njk] =\nJrkpk/(1\u2212pk). For example, there could be not only large mean and small overdispersion (large rk\nand small pk), but also large mean and large overdispersion (small rk and large pk). Thus (rk, pk)\nmay combine the advantages of using only rk or pk to model topic k, as con\ufb01rmed by the superior\nperformance of the Marked-Beta-NB over the Beta-NB and Gamma-NB processes. When (rk, \u03c0k)\nis used, as in the NB-FTM model, our results show that we usually have a small \u03c0k and a large rk,\nindicating topic k is sparsely used across the documents but once it is used, the amount of variation\non usage is small. This modeling properties might be helpful when there are excessive number of\nzeros which might not be well modeled by the NB process alone. In our experiments, we \ufb01nd the\nmore direct approaches of using pk or pj generally yield better results, but this might not be the\ncase when excessive number of zeros are better explained with the underlying beta-Bernoulli or IBP\nprocesses, e.g., when the training words are scarce.\nIt is also interesting to compare the Gamma-NB and NB-HDP. From a mixture-modeling viewpoint,\n\ufb01xing pj = 0.5 is natural as pj becomes irrelevant after normalization. However, from a count mod-\neling viewpoint, this would make a restrictive assumption that each count vector {njk}k=1,K has\nthe same VMR of 2, and the experimental results in Figure 1 con\ufb01rm the importance of learning pj\ntogether with rk. It is also interesting to examine (15), which can be viewed as the concentration pa-\nrameter \u03b1 in the HDP, allowing the adjustment of pj would allow a more \ufb02exible model assumption\non the amount of variations between the topic proportions, and thus potentially better data \ufb01tting.\n6 Conclusions\nWe propose a variety of negative binomial (NB) processes to jointly model counts across groups,\nwhich can be naturally applied for mixture modeling of grouped data. The proposed NB processes\nare completely random measures that they assign independent random variables to disjoint Borel sets\nof the measure space, as opposed to the hierarchical Dirichlet process (HDP) whose measures on\ndisjoint Borel sets are negatively correlated. We discover augment-and-conquer inference methods\nthat by \u201caugmenting\u201d a NB process into both the gamma-Poisson and compound Poisson repre-\nsentations, we are able to \u201cconquer\u201d the uni\ufb01cation of count and mixture modeling, the analysis of\nfundamental model properties and the derivation of ef\ufb01cient Gibbs sampling inference. We demon-\nstrate that the gamma-NB process, which shares the NB dispersion measure across groups, can be\nnormalized to produce the HDP and we show in detail its theoretical, structural and computational\nadvantages over the HDP. We examine the distinct sharing mechanisms and model properties of\nvarious NB processes, with connections to existing algorithms, with experimental results on topic\nmodeling showing the importance of modeling both the NB dispersion and probability parameters.\nAcknowledgments\nThe research reported here was supported by ARO, DOE, NGA, and ONR, and by DARPA under\nthe MSEE and HIST programs.\n\n8\n\n0500100010\u2212410\u22122100102NB\u2212LDArjDocument Index050010000.20.40.60.81pjDocument Index020040010\u2212410\u22122100NB\u2212HDPrkTopic Index0500100000.51pjDocument Index02004000102030NB\u2212FTMrkTopic Index020040010\u2212310\u2212210\u22121100\u03c0kTopic Index0500100010\u2212410\u22122100102Beta\u2212NBrjDocument Index020040000.51pkTopic Index020040010\u2212410\u22122100Gamma\u2212NBrkTopic Index0500100000.51pjDocument Index020040010\u2212410\u22122100Marked\u2212Beta\u2212NBrkTopic Index020040000.51pkTopic Index\fReferences\n[1] J. F. C. Kingman. Poisson Processes. Oxford University Press, 1993.\n[2] M. K. Titsias. The in\ufb01nite gamma-Poisson feature model. In NIPS, 2008.\n[3] R. J. Thibaux. Nonparametric Bayesian Models for Machine Learning. PhD thesis, UC Berkeley, 2008.\n[4] K. T. Miller. Bayesian Nonparametric Latent Feature Models. PhD thesis, UC Berkeley, 2011.\n[5] M. Zhou, L. Hannah, D. Dunson, and L. Carin. Beta-negative binomial process and Poisson factor anal-\n\nysis. In AISTATS, 2012.\n\n[6] T. Broderick, L. Mackey, J. Paisley, and M. I. Jordan. Combinatorial clustering and the beta negative\n\nbinomial process. arXiv:1111.1802v3, 2012.\n\n[7] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei. Hierarchical Dirichlet processes. JASA, 2006.\n[8] M. I. Jordan. Hierarchical models, nested models and completely random measures. 2010.\n[9] R. L. Wolpert, M. A. Clyde, and C. Tu. Stochastic expansions using continuous dictionaries: L\u00b4evy\n\nAdaptive Regression Kernels. Annals of Statistics, 2011.\n\n[10] T. S. Ferguson. A Bayesian analysis of some nonparametric problems. Ann. Statist., 1973.\n[11] H. Ishwaran and M. Zarepour. Exact and approximate sum-representations for the Dirichlet process. Can.\n\nJ. Statist., 2002.\n\n[12] J. Paisley, C. Wang, and D. M. Blei. The discrete in\ufb01nite logistic normal distribution. Bayesian Analysis,\n\n2012.\n\n[13] C. I. Bliss and R. A. Fisher. Fitting the negative binomial distribution to biological data. Biometrics,\n\n1953.\n\n[14] A. C. Cameron and P. K. Trivedi. Regression Analysis of Count Data. Cambridge, UK, 1998.\n[15] R. Winkelmann. Econometric Analysis of Count Data. Springer, Berlin, 5th edition, 2008.\n[16] M. H. Quenouille. A relation between the logarithmic, Poisson, and negative binomial series. Biometrics,\n\n1949.\n\n[17] N. L. Johnson, A. W. Kemp, and S. Kotz. Univariate Discrete Distributions. John Wiley & Sons, 2005.\n[18] S. J. Clark and J. N. Perry. Estimation of the negative binomial parameter \u03ba by maximum quasi-likelihood.\n\nBiometrics, 1989.\n\n[19] M. D. Robinson and G. K. Smyth. Small-sample estimation of negative binomial dispersion, with appli-\n\ncations to SAGE data. Biostatistics, 2008.\n\n[20] M. Zhou, L. Li, D. Dunson, and L. Carin. Lognormal and gamma mixed negative binomial regression. In\n\nICML, 2012.\n\n[21] C. E. Antoniak. Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems.\n\nAnn. Statist., 1974.\n\n[22] M. D. Escobar and M. West. Bayesian density estimation and inference using mixtures. JASA, 1995.\n[23] C. Wang, J. Paisley, and D. M. Blei. Online variational inference for the hierarchical Dirichlet process. In\n\nAISTATS, 2011.\n\n[24] E. Fox, E. Sudderth, M. Jordan, and A. Willsky. Developing a tempered HDP-HMM for systems with\n\nstate persistence. MIT LIDS, TR #2777, 2007.\n\n[25] N. L. Hjort. Nonparametric Bayes estimators based on beta processes in models for life history data. Ann.\n\nStatist., 1990.\n\n[26] R. Thibaux and M. I. Jordan. Hierarchical beta processes and the Indian buffet process. In AISTATS,\n\n2007.\n\n[27] S. Williamson, C. Wang, K. A. Heller, and D. M. Blei. The IBP compound Dirichlet process and its\n\napplication to focused topic modeling. In ICML, 2010.\n\n[28] T. L. Grif\ufb01ths and Z. Ghahramani. In\ufb01nite latent feature models and the Indian buffet process. In NIPS,\n\n2005.\n\n[29] M. Zhou, H. Chen, J. Paisley, L. Ren, L. Li, Z. Xing, D. Dunson, G. Sapiro, and L. Carin. Nonparametric\n\nBayesian dictionary learning for analysis of noisy and incomplete images. IEEE TIP, 2012.\n\n[30] M. Zhou, H. Yang, G. Sapiro, D. Dunson, and L. Carin. Dependent hierarchical beta process for image\n\ninterpolation and denoising. In AISTATS, 2011.\n\n[31] L. Li, M. Zhou, G. Sapiro, and L. Carin. On the integration of topic modeling and dictionary learning. In\n\nICML, 2011.\n\n[32] D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. J. Mach. Learn. Res., 2003.\n[33] D. D. Lee and H. S. Seung. Algorithms for non-negative matrix factorization. In NIPS, 2000.\n[34] J. Canny. Gap: a factor model for discrete data. In SIGIR, 2004.\n[35] T. L. Grif\ufb01ths and M. Steyvers. Finding scienti\ufb01c topics. PNAS, 2004.\n\n9\n\n\f", "award": [], "sourceid": 1217, "authors": [{"given_name": "Mingyuan", "family_name": "Zhou", "institution": null}, {"given_name": "Lawrence", "family_name": "Carin", "institution": null}]}