{"title": "The Poisson Gamma Belief Network", "book": "Advances in Neural Information Processing Systems", "page_first": 3043, "page_last": 3051, "abstract": "To infer a multilayer representation of high-dimensional count vectors, we propose the Poisson gamma belief network (PGBN) that factorizes each of its layers into the product of a connection weight matrix and the nonnegative real hidden units of the next layer. The PGBN's hidden layers are jointly trained with an upward-downward Gibbs sampler, each iteration of which upward samples Dirichlet distributed connection weight vectors starting from the first layer (bottom data layer), and then downward samples gamma distributed hidden units starting from the top hidden layer. The gamma-negative binomial process combined with a layer-wise training strategy allows the PGBN to infer the width of each layer given a fixed budget on the width of the first layer. The PGBN with a single hidden layer reduces to Poisson factor analysis. Example results on text analysis illustrate interesting relationships between the width of the first layer and the inferred network structure, and demonstrate that the PGBN, whose hidden units are imposed with correlated gamma priors, can add more layers to increase its performance gains over Poisson factor analysis, given the same limit on the width of the first layer.", "full_text": "The Poisson Gamma Belief Network\n\nMingyuan Zhou\n\nMcCombs School of Business\n\nThe University of Texas at Austin\n\nAustin, TX 78712, USA\n\nYulai Cong\n\nBo Chen\n\nNational Laboratory of RSP\n\nNational Laboratory of RSP\n\nXidian University\n\nXi\u2019an, Shaanxi, China\n\nXidian University\n\nXi\u2019an, Shaanxi, China\n\nAbstract\n\nTo infer a multilayer representation of high-dimensional count vectors, we pro-\npose the Poisson gamma belief network (PGBN) that factorizes each of its layers\ninto the product of a connection weight matrix and the nonnegative real hidden\nunits of the next layer. The PGBN\u2019s hidden layers are jointly trained with an\nupward-downward Gibbs sampler, each iteration of which upward samples Dirich-\nlet distributed connection weight vectors starting from the \ufb01rst layer (bottom data\nlayer), and then downward samples gamma distributed hidden units starting from\nthe top hidden layer. The gamma-negative binomial process combined with a\nlayer-wise training strategy allows the PGBN to infer the width of each layer given\na \ufb01xed budget on the width of the \ufb01rst layer. The PGBN with a single hidden layer\nreduces to Poisson factor analysis. Example results on text analysis illustrate in-\nteresting relationships between the width of the \ufb01rst layer and the inferred network\nstructure, and demonstrate that the PGBN, whose hidden units are imposed with\ncorrelated gamma priors, can add more layers to increase its performance gains\nover Poisson factor analysis, given the same limit on the width of the \ufb01rst layer.\n\n1\n\nIntroduction\n\nThere has been signi\ufb01cant recent interest in deep learning. Despite its tremendous success in su-\npervised learning, inferring a multilayer data representation in an unsupervised manner remains a\nchallenging problem [1, 2, 3]. The sigmoid belief network (SBN), which connects the binary units\nof adjacent layers via the sigmoid functions, infers a deep representation of multivariate binary vec-\ntors [4, 5]. The deep belief network (DBN) [6] is a SBN whose top hidden layer is replaced by the\nrestricted Boltzmann machine (RBM) [7] that is undirected. The deep Boltzmann machine (DBM)\nis an undirected deep network that connects the binary units of adjacent layers using the RBMs [8].\nAll these deep networks are designed to model binary observations. Although one may modify the\nbottom layer to model Gaussian and multinomial observations, the hidden units of these networks\nare still typically restricted to be binary [8, 9, 10]. One may further consider the exponential family\nharmoniums [11, 12] to construct more general networks with non-binary hidden units, but often at\nthe expense of noticeably increased complexity in training and data \ufb01tting.\nMoving beyond conventional deep networks using binary hidden units, we construct a deep directed\nnetwork with gamma distributed nonnegative real hidden units to unsupervisedly infer a multilayer\nrepresentation of multivariate count vectors, with a simple but powerful mechanism to capture the\ncorrelations among the visible/hidden features across all layers and handle highly overdispersed\ncounts. The proposed model is called the Poisson gamma belief network (PGBN), which factorizes\nthe observed count vectors under the Poisson likelihood into the product of a factor loading matrix\nand the gamma distributed hidden units (factor scores) of layer one; and further factorizes the shape\nparameters of the gamma hidden units of each layer into the product of a connection weight matrix\nand the gamma hidden units of the next layer. Distinct from previous deep networks that often utilize\nbinary units for tractable inference and require tuning both the width (number of hidden units) of\neach layer and the network depth (number of layers), the PGBN employs nonnegative real hidden\n\n1\n\n\funits and automatically infers the widths of subsequent layers given a \ufb01xed budget on the width of\nits \ufb01rst layer. Note that the budget could be in\ufb01nite and hence the whole network can grow without\nbound as more data are being observed. When the budget is \ufb01nite and hence the ultimate capacity\nof the network is limited, we \ufb01nd that the PGBN equipped with a narrower \ufb01rst layer could increase\nits depth to match or even outperform a shallower network with a substantially wider \ufb01rst layer.\nThe gamma distribution density function has the highly desired strong non-linearity for deep learn-\ning, but the existence of neither a conjugate prior nor a closed-form maximum likelihood estimate\nfor its shape parameter makes a deep network with gamma hidden units appear unattractive. Despite\nseemingly dif\ufb01cult, we discover that, by generalizing the data augmentation and marginalization\ntechniques for discrete data [13], one may propagate latent counts one layer at a time from the bot-\ntom data layer to the top hidden layer, with which one may derive an ef\ufb01cient upward-downward\nGibbs sampler that, one layer at a time in each iteration, upward samples Dirichlet distributed con-\nnection weight vectors and then downward samples gamma distributed hidden units.\nIn addition to constructing a new deep network that well \ufb01ts multivariate count data and developing\nan ef\ufb01cient upward-downward Gibbs sampler, other contributions of the paper include: 1) combin-\ning the gamma-negative binomial process [13, 14] with a layer-wise training strategy to automati-\ncally infer the network structure; 2) revealing the relationship between the upper bound imposed on\nthe width of the \ufb01rst layer and the inferred widths of subsequent layers; 3) revealing the relationship\nbetween the network depth and the model\u2019s ability to model overdispersed counts; 4) and generating\na multivariate high-dimensional random count vector, whose distribution is governed by the PGBN,\nby propagating the gamma hidden units of the top hidden layer back to the bottom data layer.\n\na random count generated as l = (cid:80)n\n\n1.1 Useful count distributions and their relationships\nLet the Chinese restaurant table (CRT) distribution l \u223c CRT(n, r) represent the distribution of\ni=1 bi, bi \u223c Bernoulli [r/(r + i \u2212 1)] . Its probability mass\nfunction (PMF) can be expressed as P (l | n, r) = \u0393(r)rl\n\u0393(n+r)|s(n, l)|, where l \u2208 Z, Z := {0, 1, . . . , n},\nand |s(n, l)| are unsigned Stirling numbers of the \ufb01rst kind. Let u \u223c Log(p) denote the logarithmic\ndistribution with PMF P (u| p) =\nu , where u \u2208 {1, 2, . . .}. Let n \u223c NB(r, p) denote\nn!\u0393(r) pn(1 \u2212 p)r, where n \u2208 Z.\nthe negative binomial (NB) distribution with PMF P (n| r, p) = \u0393(n+r)\nThe NB distribution n \u223c NB(r, p) can be generated as a gamma mixed Poisson distribution as n \u223c\nPois(\u03bb), \u03bb \u223c Gam [r, p/(1 \u2212 p)] , where p/(1\u2212 p) is the gamma scale parameter. As shown in [13],\nthe joint distribution of n and l given r and p in l \u223c CRT(n, r), n \u223c NB(r, p), where l \u2208 {0, . . . , n}\nt=1 ut, ut \u223c Log(p), l \u223c Pois[\u2212r ln(1 \u2212 p)], which is\npn(1 \u2212 p)r.\n\nand n \u2208 Z, is the same as that in n =(cid:80)l\n\ncalled the Poisson-logarithmic bivariate distribution, with PMF P (n, l | r, p) =\n\n1\u2212 ln(1\u2212p)\n\n|s(n,l)|rl\n\npu\n\nn!\n\n2 The Poisson Gamma Belief Network\n\nj \u2208 ZK0, the generative model of the\nAssuming the observations are multivariate count vectors x(1)\n(cid:17)\nPoisson gamma belief network (PGBN) with T hidden layers, from top to bottom, is expressed as\n, 1(cid:14)c(t+1)\n\n\u03a6(t+1)\u03b8(t+1)\n\n(cid:16)\n\n(cid:17)\n\n\u00b7\u00b7\u00b7\n\n,\n\n,\n\nj\n\nj \u223c Gam\n(cid:16)\n\u03b8(T )\nj \u223c Gam\n(cid:17)\n\u03b8(t)\n\n(cid:16)\n\nr, 1(cid:14)c(T +1)\n(cid:16)\n\nj\n\n\u00b7\u00b7\u00b7\n\nj\n\n(cid:14)(cid:0)1 \u2212 p(2)\n\nj\n\n(cid:1)(cid:17)\n\nj \u223c Pois\nx(1)\n\nj \u223c Gam\n\nj\n\n, \u03b8(1)\n\n\u03a6(2)\u03b8(2)\n\n\u03a6(1)\u03b8(1)\n\n(1)\ninto the product of the factor loading \u03a6(1) \u2208\nThe PGBN factorizes the count observation x(1)\n+ of layer one under the Poisson likelihood, where R+ = {x :\nRK0\u00d7K1\nx \u2265 0}, and for t = 1, 2, . . . , T \u22121, factorizes the shape parameters of the gamma distributed hidden\n+ of layer t into the product of the connection weight matrix \u03a6(t+1) \u2208 RKt\u00d7Kt+1\nunits \u03b8(t)\nand the hidden units \u03b8(t+1)\nshare the same\n\nand hidden units \u03b8(1)\nj \u2208 RKt\n\nof layer t + 1; the top layer\u2019s hidden units \u03b8(T )\n\nj \u2208 RK1\n\n\u2208 RKt+1\n\n, p(2)\n\n+\n\n+\n\n.\n\nj\n\nj\n\nj\n\nj\n\n+\n\nj\n\n2\n\n\fvector r = (r1, . . . , rKT )(cid:48) as their gamma shape parameters; and the p(2)\nand {1/c(t)}3,T +1 are gamma scale parameters, with c(2)\n.\nFor scale identi\ufb01abilty and ease of inference, each column of \u03a6(t) \u2208 RKt\u22121\u00d7Kt\na unit L1 norm. To complete the hierarchical model, for t \u2208 {1, . . . , T \u2212 1}, we let\n\n:=(cid:0)1 \u2212 p(2)\n(cid:1)(cid:14)p(2)\n(cid:1)\nk \u223c Dir(cid:0)\u03b7(t), . . . , \u03b7(t)(cid:1), rk \u223c Gam(cid:0)\u03b30/KT , 1/c0\n\n\u03c6(t)\n\n+\n\nj\n\nj\n\nj\n\nj\n\nis restricted to have\n\n(2)\n\nare probability parameters\n\nand impose c0 \u223c Gam(e0, 1/f0) and \u03b30 \u223c Gam(a0, 1/b0); and for t \u2208 {3, . . . , T + 1}, we let\n\nj \u223c Beta(a0, b0),\np(2)\n\nj \u223c Gam(e0, 1/f0).\nc(t)\n\n(3)\nWe expect the correlations between the rows (features) of (x(1)\nJ ) to be captured by the\ncolumns of \u03a6(1), and the correlations between the rows (latent features) of (\u03b8(t)\nJ ) to be\ncaptured by the columns of \u03a6(t+1). Even if all \u03a6(t) for t \u2265 2 are identity matrices, indicating no\ncorrelations between latent features, our analysis will show that a deep structure with T \u2265 2 could\nstill bene\ufb01t data \ufb01tting by better modeling the variability of the latent features \u03b8(1)\nSigmoid and deep belief networks. Under the hierarchical model in (1), given the connection\nweight matrices, the joint distribution of the count observations and gamma hidden units of the\nPGBN can be expressed, similar to those of the sigmoid and deep belief networks [3], as\n\n1 , . . . , x(1)\n\n1 , . . . , \u03b8(t)\n\n.\n\nj\n\nP\n\nx(1)\n\nj\n\n,{\u03b8(t)\nj }t\n\n= P\n\nx(1)\n\nj\n\n(cid:16)\n\n(cid:16)\n\n(cid:17)\n\n(cid:12)(cid:12)(cid:12){\u03a6(t)}t\n(cid:12)(cid:12)(cid:12) \u03c6(t+1)\n\nv:\n\nj\n\n(cid:16)\n\n(cid:17)\n\n(cid:12)(cid:12)(cid:12) \u03a6(1), \u03b8(1)\n(cid:17)\u03c6\n(cid:16)\n(cid:16)\n\nc(t+1)\nj+1\n\nj\n\n\u03c6(t+1)\n\nv:\n\n\u0393\n\nt=1 P\n\n(cid:17)(cid:104)(cid:81)T\u22121\n(cid:17) (cid:16)\n\n(t+1)\nj\n\n\u03b8\n\n\u03b8(t+1)\n\nj\n\n(t+1)\nv:\n\n(cid:16)\n\n\u03b8(t)\nj\n\n(cid:12)(cid:12)(cid:12) \u03a6(t+1), \u03b8(t+1)\n(cid:17)\u03c6(t+1)\n\nvj we have\n\n\u03b8(t+1)\n\n\u22121\n\nv:\n\nj\n\nj\n\n\u03b8(t)\nvj\n\n(cid:17)(cid:105)\n\n(cid:16)\n\n(cid:17)\n\nP\n\n\u03b8(T )\nj\n\n.\n\ne\u2212c(t+1)\n\nj+1 \u03b8(t)\nvj ,\n\n(4)\n\nWith \u03c6v: representing the vth row \u03a6, for the gamma hidden units \u03b8(t)\n\nP\n\n\u03b8(t)\nvj\n\n, \u03b8(t+1)\n\n, c(t+1)\nj+1\n\n=\n\nj\n\nv\n\nv\n\nv:\n\nv:\n\nP\n\n= \u03c3\n\n\u03b8(t)\n\n(cid:16)\n\n(cid:17)\n\n(cid:16)\n\n(cid:17)\n\nb(t+1)\nv\n\n, b(t+1)\n\n, \u03b8(t+1)\n\n+ \u03c6(t+1)\n\nvj = 1(cid:12)(cid:12) \u03c6(t+1)\n\nwhich are highly nonlinear functions that are strongly desired in deep learning. By contrast, with the\nsigmoid function \u03c3(x) = 1/(1 + e\u2212x) and bias terms b(t+1)\n, a sigmoid/deep belief network would\nvj \u2208 {0, 1} of layer t (for deep belief networks, t < T \u2212 1 ) to the\nconnect the binary hidden units \u03b8(t)\nproduct of the connection weights and binary hidden units of the next layer with\n\u03b8(t+1)\nj\n\n(5)\nComparing (4) with (5) clearly shows the differences between the gamma nonnegative hidden units\nand the sigmoid link based binary hidden units. Note that the recti\ufb01ed linear units have emerged as\npowerful alternatives of sigmoid units to introduce nonlinearity [15]. It would be interesting to use\nthe gamma units to introduce nonlinearity in the positive region of the recti\ufb01ed linear units.\nDeep Poisson factor analysis. With T = 1, the PGBN speci\ufb01ed by (1)-(3) reduces to Poisson factor\nanalysis (PFA) using the (truncated) gamma-negative binomial process [13], which is also related\nto latent Dirichlet allocation [16] if the Dirichlet priors are imposed on both \u03c6(1)\n. With\nT \u2265 2, the PGBN is related to the gamma Markov chain hinted by Corollary 2 of [13] and realized\nk\nin [17], the deep exponential family of [18], and the deep PFA of [19]. Different from the PGBN,\nin [18], it is the gamma scale but not shape parameters that are chained and factorized; in [19], it is\nthe correlations between binary topic usage indicators but not the full connection weights that are\ncaptured; and neither [18] nor [19] provide a principled way to learn the network structure. Below\nwe break the PGBN of T layers into T related submodels that are solved with the same subroutine.\n\nand \u03b8(1)\n\n.\n\nj\n\n2.1 The propagation of latent counts and model properties\n\nLemma 1 (Augment-and-conquer the PGBN). With p(1)\n\np(t+1)\nj\n\n:= \u2212 ln(1 \u2212 p(t)\nj )\n\n:= 1 \u2212 e\u22121 and\n\u2212 ln(1 \u2212 p(t)\nj )\n\nj\n\nc(t+1)\nj\n\n(cid:105)\n\nfor t = 1, . . . , T , one may connect the observed (if t = 1) or some latent (if t \u2265 2) counts x(t)\nZKt\u22121 to the product \u03a6(t)\u03b8(t)\n\nj at layer t under the Poisson likelihood as\n\n(cid:16)\n\n(cid:17)(cid:105)\n\nj \u223c Pois\nx(t)\n\nln\n\n1 \u2212 p(t)\n\nj\n\n.\n\n(7)\n\n(6)\nj \u2208\n\n(cid:46)(cid:104)\n(cid:104)\u2212\u03a6(t)\u03b8(t)\n\nj\n\n3\n\n\f(cid:16)\n\nkj ln\n\nvk \u03b8(t)\n\n(cid:16)\n(cid:104)\u2212\u03c6(t)\n:= x(t)\u00b7jk := (cid:80)Kt\u22121\n(cid:17)(cid:105)\n(cid:104)\u2212\u03b8(t)\n(cid:17)\n(cid:16)\n(cid:104)\u2212\u03c6(t+1)\n\n\u03a6(t+1)\u03b8(t+1)\n\n1 \u2212 p(t)\n\n, p(t+1)\n\nln\n\n.\n\n.\n\nj\n\nj\n\nj\n\nj\n\nv=1 \u03c6(t)\nm(t)(t+1)\n\nj\n\n\u223c Pois\n\nm(t)(t+1)\n\nj\n\n\u223c NB\n\nProof. By de\ufb01nition (7) is true for layer t = 1. Suppose that (7) is true for layer t \u2265 2, then we can\naugment each count x(t)\n\nvj into the summation of Kt latent counts that are smaller or equal as\n\nk=1 x(t)\n\nvjk, x(t)\n\nvjk \u223c Pois\n\n1 \u2212 p(t)\n\nj\n\n,\n\n(8)\n\n(cid:17)(cid:105)\n\nwhere v \u2208 {1, . . . , Kt\u22121}. With m(t)(t+1)\nber of times that factor k \u2208 {1, . . . , Kt} of layer t appears in observation j and m(t)(t+1)\n\nvjk representing the num-\n:=\n\nv=1 x(t)\n\nkj\n\nj\n\nvk = 1, we can marginalize out \u03a6(t) as in [20], leading to\n\n(cid:0)x(t)\u00b7j1, . . . , x(t)\u00b7jKt\n\n(cid:1)(cid:48)\n\nx(t)\n\nvj =(cid:80)Kt\n, since(cid:80)Kt\u22121\n\nFurther marginalizing out the gamma distributed \u03b8(t)\nj\n\nfrom the above Poisson likelihood leads to\n\n(9)\n\n(cid:17)(cid:105)\n\n.\n\n(cid:16)\n\nj\n\nj\n\nj\n\nkj\n\nk:\n\nkj\n\nln\n\n\u03b8(t+1)\nj\n\n(cid:110)(cid:16)\n\nm(t)(t+1)\n\n), x(t+1)\n\n=(cid:80)x(t+1)\n\nThe kth element of m(t)(t+1)\n\ncan be augmented under its compound Poisson representation as\n1 \u2212 p(t+1)\n\n(cid:96)=1 u(cid:96), u(cid:96) \u223c Log(p(t+1)\n\nkj \u223c Pois\nThus if (7) is true for layer t, then it is also true for layer t + 1.\nCorollary 2 (Propagate the latent counts upward). Using Lemma 4.1 of [20] on (8) and Theorem 1\nof [13] on (9), we can propagate the latent counts x(t)\n\n(cid:17)(cid:12)(cid:12)(cid:12) x(t)\n(cid:12)(cid:12)(cid:12) m(t)(t+1)\n(cid:16)\nexpressed as(cid:80)\n, would often be much smaller than that of layer t, expressed as(cid:80)\nThus the PGBN may use(cid:80)\n\nis in the same order as ln(cid:0)m(t)(t+1)\n\n(cid:1), the total count of layer t + 1,\n\nas a simple criterion to decide whether to add more layers.\n\n(cid:18)\n(cid:17) \u223c CRT\n\nAs x(t)\u00b7j = m(t)(t+1)\n\u00b7j\n\nvj of layer t upward to layer t + 1 as\n\n(cid:111) \u223c Mult\n\n\u03c6(t)\nv1 \u03b8(t)\nk=1 \u03c6(t)\nvk \u03b8(t)\nm(t)(t+1)\n\n\u03b8(t)\n\u03c6(t)\nKtj\nvKt\nk=1 \u03c6(t)\nvk \u03b8(t)\n\nx(t)\nvj1, . . . , x(t)\n\n(cid:80)Kt\n(cid:16)\n\nand x(t+1)\n\nj x(t+1)\n\n(cid:80)Kt\n\nvj , \u03c6(t)\n\nv: , \u03b8(t)\n\n, \u03c6(t+1)\n\n, \u03c6(t+1)\n\nx(t+1)\nkj\n\n, \u03b8(t+1)\n\n\u03b8(t+1)\nj\n\nj x(t)\u00b7j .\n\n(cid:19)\n\nx(t)\nvj ,\n\n(cid:17)\n\n, . . . ,\n\n(10)\n\n(11)\n\nvjKt\n\n\u00b7j\n\nkj\n\nkj\n\nkj\n\nkj\n\nk:\n\nk:\n\nkj\n\nkj\n\n1j\n\n.\n\n,\n\nj\n\nj\n\nj x(T )\n\u00b7j\n\n2.2 Modeling overdispersed counts\n\nj\n\nj\n\nkj\n\n. . . , \u03b8(t)\n\nkj , p(2)\nj ),\n\n\u223c NB(\u03b8(2)\n\nkj \u223c Gam(\u03b8(t+1)\n\nFor simplicity, let us further assume c(t)\n\nIn comparison to a single-layer shallow model with T = 1 that assumes the hidden units of layer\none to be independent in the prior, the multilayer deep model with T \u2265 2 captures the correlations\nbetween them. Note that for the extreme case that \u03a6(t) = IKt for t \u2265 2 are all identity matrices,\nwhich indicates that there are no correlations between the features of \u03b8(t\u22121)\nleft to be captured, the\ndeep structure could still provide bene\ufb01ts as it helps model latent counts m(1)(2)\nthat may be highly\noverdispersed. For example, supposing \u03a6(t) = IK2 for all t \u2265 2, then from (1) and (9) we have\nkj \u223c Gam(rk, 1/c(T +1)\nm(1)(2)\n\n(cid:3) = rk and Var(cid:2)\u03b8(2)\nj ), Var(cid:2)m(1)(2)\n\ntotal variance, we have E(cid:2)\u03b8(2)\n(cid:3) = rkp(2)\nE(cid:2)m(1)(2)\n\n).\nj = 1 for all t \u2265 3. Using the laws of total expectation and\n(cid:105)\nkj | rk\nkj | rk\n| rk\nj /(1 \u2212 p(2)\n.\n| rk \u223c NB(rk, p(2)\nj ), with a variance-to-mean ratio of 1/(1 \u2212\nIn comparison to PFA with m(1)(2)\np(2)\nj ), the PGBN with T hidden layers, which mixes the shape of m(1)(2)\nj ) with a\nchain of gamma random variables, increases the variance-to-mean ratio of the latent count m(1)(2)\ngiven rk by a factor of 1 + (T \u2212 1)p(2)\n, and hence could better model highly overdispersed counts.\n\n(cid:3) = (T \u2212 1)rk, and hence\n(cid:1)\u22122(cid:104)\n(cid:3) = rkp(2)\n\n(cid:0)1 \u2212 p(2)\n\n1 + (T \u2212 1)p(2)\n\n\u223c NB(\u03b8(2)\n\n, 1/c(t+1)\n\n. . . , \u03b8(T )\n\nkj , p(2)\n\n| rk\n\n),\n\nkj\n\nkj\n\nkj\n\nkj\n\nkj\n\nkj\n\nj\n\nj\n\nj\n\nj\n\nj\n\nj\n\n4\n\n\f2.3 Upward-downward Gibbs sampling\nWith Lemma 1 and Corollary 2 and the width of the \ufb01rst layer being bounded by K1 max, we develop\nan upward-downward Gibbs sampler for the PGBN, each iteration of which proceeds as follows:\nSample x(t)\nvjk for all layers using (10). But for the \ufb01rst hidden layer, we may\ntreat each observed count x(1)\nvj as a sequence of word tokens at the vth term (in a vocabulary of size\nV := K0) in the jth document, and assign the x(1)\u00b7j words {vji}i=1,x(1)\u00b7j\none after another to the\nlatent factors (topics), with both the topics \u03a6(1) and topic weights \u03b8(1)\n\nvjk. We can sample x(t)\n\n(cid:17)\n\nj marginalized out, as\n, k \u2208 {1, . . . , K1 max},\n\n(12)\n\nP (zji = k |\u2212) \u221d \u03b7(1)+x(1)\u2212ji\nV \u03b7(1)+x(1)\u2212ji\n\nvji\u00b7k\n\u00b7\u00b7k\n\nx(1)\u2212ji\n\u00b7jk + \u03c6(2)\n\nk: \u03b8(2)\n\nj\n\nv x(t)\n\ning index, e.g., x(t)\u00b7jk :=(cid:80)\n\nwhere zji is the topic index for vji and x(1)\ni \u03b4(vji = v, zji = k) counts the number of times\nthat term v appears in document j; we use the \u00b7 symbol to represent summing over the correspond-\nvjk, and use x\u2212ji to denote the count x calculated without considering\nword i in document j. The collapsed Gibbs sampling update equation shown above is related to the\none developed in [21] for latent Dirichlet allocation, and the one developed in [22] for PFA using the\nbeta-negative binomial process. When T = 1, we would replace the terms \u03c6(2)\nj with rk for PFA\nbuilt on the gamma-negative binomial process [13] (or with \u03b1\u03c0k for the hierarchical Dirichlet pro-\ncess latent Dirichlet allocation, see [23] and [22] for details), and add an additional term to account\nfor the possibility of creating an additional topic [22]. For simplicity, in this paper, we truncate the\nnonparametric Bayesian model with K1 max factors and let rk \u223c Gam(\u03b30/K1 max, 1/c0) if T = 1.\nSample \u03c6(t)\n\nk . Given these latent counts, we sample the factors/topics \u03c6(t)\n\nk: \u03b8(2)\n\nk as\n\n(cid:17)\n\n(cid:16)\nvjk :=(cid:80)\n\n(cid:16)\n\nj\n\n(\u03c6(t)\n\nk |\u2212) \u223c Dir\n(cid:16)\n(cid:16)\n:=(cid:80)Kt\n\nSample x(t+1)\nvj\nSample \u03b8(t)\n\n. We sample x(t+1)\n\nusing (11), replacing \u03a6(T +1)\u03b8(T +1)\n\nj . Using (7) and the gamma-Poisson conjugacy, we sample \u03b8j as\n\n(\u03b8(t)\nj\n\n|\u2212) \u223c Gamma\n\n\u03a6(t+1)\u03b8(t+1)\n\n+ m(t)(t+1)\n\nj\n\nSample r. Both \u03b30 and c0 are sampled using related equations in [13]. We sample r as\n\n\u03b7(t) + x(t)\n\n1\u00b7k, . . . , \u03b7(t) + x(t)\n\nKt\u22121\u00b7k\n\n.\n\n(13)\nwith r := (r1, . . . , rKT )(cid:48).\n\n.\n\n(14)\n\n(cid:17)(cid:105)\u22121(cid:17)\n\nj\n\nj\n\nj\n\n,\n\n\u2212 ln\n\nc(t+1)\nj\n\n1 \u2212 p(t)\n\n(cid:104)\n(cid:16)\n(cid:104)\n(cid:1)(cid:105)\u22121(cid:17)\nj ln(cid:0)1 \u2212 p(T +1)\nc0 \u2212(cid:80)\n(cid:104)\n(cid:16)\n\n:= r\u00b7, we sample p(2)\n\n\u00b7j\n\nj\n\nj\n\n|\u2212) \u223c Gamma\n\ne0 +\u03b8(t)\u00b7j ,\n\nf0 +\u03b8(t\u22121)\n\n\u00b7j\n\n.\nand {c(t)\n\n(15)\n(cid:105)\u22121(cid:17)\nj }t\u22653 as\n\n, (16)\n\n(rv |\u2212) \u223c Gam\n\nSample c(t)\n\nj . With \u03b8(t)\u00b7j\n\n(p(2)\n\nj\n\n|\u2212) \u223c Beta\n\na0 +m(1)(2)\n\nand calculate c(2)\n\nj\n\nand {p(t)\n\n,\n\nv\u00b7\n\nk=1 \u03b8(t)\n\n\u03b30/KT + x(T +1)\nkj for t \u2264 T and \u03b8(T +1)\n, b0 +\u03b8(2)\u00b7j\nj }t\u22653 with (6).\n\n, (c(t)\nj\n\n(cid:17)\n\n\u00b7j\n\n(cid:16)\n\n2.4 Learning the network structure with layer-wise training\nAs jointly training all layers together is often dif\ufb01cult, existing deep networks are typically trained\nusing a greedy layer-wise unsupervised training algorithm, such as the one proposed in [6] to train\nthe deep belief networks. The effectiveness of this training strategy is further analyzed in [24]. By\ncontrast, the PGBN has a simple Gibbs sampler to jointly train all its hidden layers, as described in\nSection 2.3, and hence does not require greedy layer-wise training. Yet the same as commonly used\ndeep learning algorithms, it still needs to specify the number of layers and the width of each layer.\nIn this paper, we adopt the idea of layer-wise training for the PGBN, not because of the lack of\nan effective joint-training algorithm, but for the purpose of learning the width of each hidden layer\nin a greedy layer-wise manner, given a \ufb01xed budget on the width of the \ufb01rst layer. The proposed\nlayer-wise training strategy is summarized in Algorithm 1. With a PGBN of T \u2212 1 layers that has\nalready been trained, the key idea is to use a truncated gamma-negative binomial process [13] to\n), rk \u223c\nmodel the latent count matrix for the newly added top layer as m(T )(T +1)\n\n\u223c NB(rk, p(T +1)\n\nj\n\nkj\n\n5\n\n\fSample {zji}j,i using collapsed inference; Calculate {x(1)\nfor t = 2, 3, . . . , T do\nk }k ; Sample {x(t+1)\n\nSet KT\u22121, the inferred width of layer T \u2212 1, as KT max, the upper bound of layer T \u2019s width.\nfor iter = 1 : BT + CT do Upward-downward Gibbs sampling\n\nAlgorithm 1 The PGBN upward-downward Gibbs sampler that uses a layer-wise training strategy to train a set\nof networks, each of which adds an additional hidden layer on top of the previously inferred network, retrains\nall its layers jointly, and prunes inactive factors from the last layer. Inputs: observed counts {xvj}v,j, upper\nbound of the width of the \ufb01rst layer K1 max, upper bound of the number of layers Tmax, and hyper-parameters.\nOutputs: A total of Tmax jointly trained PGBNs with depths T = 1, T = 2, . . ., and T = Tmax.\n1: for T = 1, 2, . . . , Tmax do Jointly train all the T layers of the network\n2:\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13:\n14:\n15:\n16:\n\n}v,j ;\nj }j,t and Calculate {p(t)\n\nend for\nOutput the posterior means (according to the last MCMC sample) of all remaining factors {\u03c6(t)\n\nk }k,t as\nk=1 as the gamma shape parameters of layer T \u2019s hidden units.\n\nthe inferred network of T layers, and {rk}KT\n\nend for\nSample p(2)\nfor t = T, T \u2212 1, . . . , 2 do\n\nvjk}v,j,k ; Sample {\u03c6(t)\nand Calculate c(2)\n\n; Sample {c(t)\n\nPrune layer T \u2019s inactive factors {\u03c6(T )\nk }\n\nSample r if t = T ; Sample {\u03b8(t)\n\nj }j ;\n\nvjk}v,k,j; Sample {x(2)\n\nvj }v,j ;\n\n, let KT =(cid:80)\n\nk:x\n\n(T )\u00b7\u00b7k =0\n\nend for\nif iter = BT then\n\nend if\n\nvj\n\nj }j,t for t = 3, . . . , T + 1\n\nSample {x(t)\n\nj\n\nj\n\nk \u03b4(x(T )\u00b7\u00b7k > 0), and update r;\n\n17: end for\n\nGam(\u03b30/KT max, 1/c0), and rely on that stochastic process\u2019s shrinkage mechanism to prune inactive\nfactors (connection weight vectors) of layer T , and hence the inferred KT would be smaller than\nKT max if KT max is suf\ufb01ciently large. The newly added layer and the layers below it would be\njointly trained, but with the structure below the newly added layer kept unchanged. Note that when\nT = 1, the PGBN would infer the number of active factors if K1 max is set large enough, otherwise,\nit would still assign the factors with different weights rk, but may not be able to prune any of them.\n3 Experimental Results\nWe apply the PGBNs for topic modeling of text corpora, each document of which is represented\nas a term-frequency count vector. Note that the PGBN with a single hidden layer is identical to\nthe (truncated) gamma-negative binomial process PFA of [13], which is a nonparametric Bayesian\nalgorithm that performs similarly to the hierarchical Dirichlet process latent Dirichlet allocation\n[23] for text analysis, and is considered as a strong baseline that outperforms a large number of\ntopic modeling algorithms. Thus we will focus on making comparison to the PGBN with a single\nlayer, with its layer width set to be large to approximate the performance of the gamma-negative\nbinomial process PFA. We evaluate the PGBNs\u2019 performance by examining both how well they\nunsupervisedly extract low-dimensional features for document classi\ufb01cation, and how well they\npredict heldout word tokens. Matlab code will be available in http://mingyuanzhou.github.io/.\nWe use Algorithm 1 to learn, in a layer-wise manner, from the training data the weight matrices\n\u03a6(1), . . . , \u03a6(Tmax) and the top-layer hidden units\u2019 gamma shape parameters r: to add layer T to\na previously trained network with T \u2212 1 layers, we use BT iterations to jointly train \u03a6(T ) and r\ntogether with {\u03a6(t)}1,T\u22121, prune the inactive factors of layer T , and continue the joint training with\nanother CT iterations. We set the hyper-parameters as a0 = b0 = 0.01 and e0 = f0 = 1. Given\nthe trained network, we apply the upward-downward Gibbs sampler to collect 500 MCMC samples\nafter 500 burnins to estimate the posterior mean of the feature usage proportion vector \u03b8(1)\nj /\u03b8(1)\u00b7j at\nthe \ufb01rst hidden layer, for every document in both the training and testing sets.\nFeature learning for binary classi\ufb01cation. We consider\nthe 20 newsgroups dataset\n(http://qwone.com/\u223cjason/20Newsgroups/) that consists of 18,774 documents from 20 different\nnews groups, with a vocabulary of size K0 = 61,188. It is partitioned into a training set of 11,269\ndocuments and a testing set of 7,505 ones. We \ufb01rst consider two binary classi\ufb01cation tasks that dis-\ntinguish between the comp.sys.ibm.pc.hardware and comp.sys.mac.hardware, and between the\nsci.electronics and sci.med news groups. For each binary classi\ufb01cation task, we remove a standard\nlist of stop words and only consider the terms that appear at least \ufb01ve times, and report the classi\ufb01-\ncation accuracies based on 12 independent random trials. With the upper bound of the \ufb01rst layer\u2019s\n\n6\n\n\fFigure 1: Classi\ufb01cation accuracy (%) as a function of the network depth T for two 20newsgroups binary\nclassi\ufb01cation tasks, with \u03b7(t) = 0.01 for all layers. (a)-(b): the boxplots of the accuracies of 12 independent\nruns with K1 max = 800. (c)-(d): the average accuracies of these 12 runs for various K1 max and T . Note that\nK1 max = 800 is large enough to cover all active \ufb01rst-layer topics (inferred to be around 500 for both binary\nclassi\ufb01cation tasks), whereas all the \ufb01rst-layer topics would be used if K1 max = 25, 50, 100, or 200.\n\nFigure 2: Classi\ufb01cation accuracy (%) of the PGBNs for 20newsgroups multi-class classi\ufb01cation (a) as a\nfunction of the depth T with various K1 max and (b) as a function of K1 max with various depths, with \u03b7(t) =\n0.05 for all layers. The widths of hidden layers are automatically inferred, with K1 max = 50, 100, 200, 400,\n600, or 800. Note that K1 max = 800 is large enough to cover all active \ufb01rst-layer topics, whereas all the\n\ufb01rst-layer topics would be used if K1 max = 50, 100, or 200.\nwidth set as K1 max \u2208 {25, 50, 100, 200, 400, 600, 800}, and Bt = Ct = 1000 and \u03b7(t) = 0.01 for\nall t, we use Algorithm 1 to train a network with T \u2208 {1, 2, . . . , 8} layers. Denote \u00af\u03b8j as the esti-\nmated K1 dimensional feature vector for document j, where K1 \u2264 K1 max is the inferred number of\nactive factors of the \ufb01rst layer that is bounded by the pre-speci\ufb01ed truncation level K1 max. We use\nthe L2 regularized logistic regression provided by the LIBLINEAR package [25] to train a linear\nclassi\ufb01er on \u00af\u03b8j in the training set and use it to classify \u00af\u03b8j in the test set, where the regularization\nparameter is \ufb01ve-folder cross-validated on the training set from (2\u221210, 2\u22129, . . . , 215).\nAs shown in Fig. 1, modifying the PGBN from a single-layer shallow network to a multi-\nlayer deep one clearly improves the qualities of the unsupervisedly extracted feature vectors.\nIn a random trial, with K1 max = 800, we infer a network structure of (K1, . . . , K8) =\n(512, 154, 75, 54, 47, 37, 34, 29) for the \ufb01rst binary classi\ufb01cation task, and (K1, . . . , K8) =\n(491, 143, 74, 49, 36, 32, 28, 26) for the second one. Figs. 1(c)-(d) also show that increasing the\nnetwork depth in general improves the performance, but the \ufb01rst-layer width clearly plays an impor-\ntant role in controlling the ultimate network capacity. This insight is further illustrated below.\nFeature learning for multi-class classi\ufb01cation. We test the PGBNs for multi-class classi\ufb01cation\non 20newsgroups. After removing a standard list of stopwords and the terms that appear less than\n\ufb01ve times, we obtain a vocabulary with K0 = 33, 420. We set Ct = 500 and \u03b7(t) = 0.05 for all\nt. If K1 max \u2264 400, we set Bt = 1000 for all t, otherwise we set B1 = 1000 and Bt = 500 for\nt \u2265 2. We use all 11,269 training documents to infer a set of networks with Tmax \u2208 {1, . . . , 5} and\nK1 max \u2208 {50, 100, 200, 400, 600, 800}, and mimic the same testing procedure used for binary clas-\nsi\ufb01cation to extract low-dimensional feature vectors, with which each testing document is classi\ufb01ed\nto one of the 20 news groups using the L2 regularized logistic regression. Fig. 2 shows a clear trend\nof improvement in classi\ufb01cation accuracy by increasing the network depth with a limited \ufb01rst-layer\nwidth, or by increasing the upper bound of the width of the \ufb01rst layer with the depth \ufb01xed. For ex-\nample, a single-layer PGBN with K1 max = 100 could add one or more layers to slightly outperform\na single-layer PGBN with K1 max = 200, and a single-layer PGBN with K1 max = 200 could add\nlayers to clearly outperform a single-layer PGBN with K1 max as large as 800. We also note that\neach iteration of jointly training multiple layers costs moderately more than that of training a single\nlayer, e.g., with K1 max = 400, a training iteration on a single core of an Intel Xeon 2.7 GHz CPU\non average takes about 5.6, 6.7, 7.1 seconds for the PGBN with 1, 3, and 5 layers, respectively.\nExamining the inferred network structure also reveals interesting details.\nple,\n\nFor exam-\nthe inferred network widths (K1, . . . , K5) are\n\nin a random trial with Algorithm 1,\n\n7\n\nNumber of layers T12345678Classification accuracy8282.58383.58484.58585.58686.587(a) ibm.pc.hardware vs mac.hardwareNumber of layers T12345678Classification accuracy9191.59292.59393.59494.595(b) sci.electronics vs sci.medNumber of layers T2468Classification accuracy77787980818283848586(c) ibm.pc.hardware vs mac.hardwareNumber of layers T2468Classification accuracy91.59292.59393.59494.595(d) sci.electronics vs sci.medK1max = 25K1max = 50K1max = 100K1max = 200K1max = 400K1max = 600K1max = 800Number of layers T1234567Classification accuracy717273747576777879(a)K1max = 50K1max = 100K1max = 200K1max = 400K1max = 600K1max = 800K1max100200300400500600700800Classification accuracy717273747576777879(b)T = 1T = 2T = 3T = 4T = 5\fFigure 3:\n(a) per-heldout-word perplexity (the lower the better) for the NIPS12 corpus (using the 2000 most\nfrequent terms) as a function of the upper bound of the \ufb01rst layer width K1 max and network depth T , with\n30% of the word tokens in each document used for training and \u03b7(t) = 0.05 for all t. (b) for visualization, each\ncurve in (a) is reproduced by subtracting its values from the average perplexity of the single-layer network.\n\n(50, 50, 50, 50, 50), (200, 161, 130, 94, 63), (528, 129, 109, 98, 91), and (608, 100, 99, 96, 89), for\nK1 max = 50, 200, 600, and 800, respectively. This indicates that for a network with an insuf\ufb01cient\nbudget on its \ufb01rst-layer width, as the network depth increases, its inferred layer widths decay more\nslowly than a network with a suf\ufb01cient or surplus budget on its \ufb01rst-layer width; and a network with\na surplus budget on its \ufb01rst-layer width may only need relatively small widths for its higher hidden\nlayers. In the Appendix, we provide comparisons of accuracies between the PGBN and other related\nalgorithms, including these of [9] and [26], on similar multi-class document classi\ufb01cation tasks.\nPerplexities for holdout words. In addition to examining the performance of the PGBN for unsu-\npervised feature learning, we also consider a more direct approach that we randomly choose 30% of\nthe word tokens in each document as training, and use the remaining ones to calculate per-heldout-\nword perplexity. We consider the NIPS12 (http://www.cs.nyu.edu/\u223croweis/data.html) corpus, lim-\niting the vocabulary to the 2000 most frequent terms. We set \u03b7(t) = 0.05 and Ct = 500 for all t, set\nB1 = 1000 and Bt = 500 for t \u2265 2, and consider \ufb01ve random trials. Among the Bt + Ct Gibbs\nsampling iterations used to train layer t, we collect one sample per \ufb01ve iterations during the last 500\niterations, for each of which we draw the topics {\u03c6(1)\n, to compute the\nper-heldout-word perplexity using Equation (34) of [13]. As shown in Fig. 3, we observe a clear\ntrend of improvement by increasing both K1 max and T .\nQualitative analysis and document simulation. In addition to these quantitative experiments, we\nk to project topic k of\nlayer t as a V -dimensional word probability vector. Generally speaking, the topics at lower layers\nare more speci\ufb01c, whereas those at higher layers are more general. E.g., examining the results used\nto produce Fig. 3, with K1 max = 200 and T = 5, the PGBN infers a network with (K1, . . . , K5) =\n(200, 164, 106, 60, 42). The ranks (by popularity) and top \ufb01ve words of three example topics for\nlayer T = 5 are \u201c6 network units input learning training,\u201d \u201c15 data model learning set image,\u201d and\n\u201c34 network learning model input neural;\u201d while these of \ufb01ve example topics of layer T = 1 are \u201c19\nlikelihood em mixture parameters data,\u201d \u201c37 bayesian posterior prior log evidence,\u201d \u201c62 variables\nbelief networks conditional inference,\u201d \u201c126 boltzmann binary machine energy hinton,\u201d and \u201c127\n\nhave also examined the topics learned at each layer. We use(cid:0)(cid:81)t\u22121\n\n(cid:96)=1 \u03a6((cid:96))(cid:1)\u03c6(t)\n\nk }k and topics weights \u03b8(1)\n\nj\n\nj\n\n(cid:1)\n\nspeech speaker acoustic vowel phonetic.\u201d We have also tried drawing \u03b8(T ) \u223c Gam(cid:0)r, 1/c(T +1)\n\nand downward passing it through the T -layer network to generate synthetic documents, which are\nfound to be quite interpretable and re\ufb02ect various general aspects of the corpus used to train the net-\nwork. We provide in the Appendix a number of synthetic documents generated from a PGBN trained\non the 20newsgroups corpus, whose inferred structure is (K1, . . . , K5) = (608, 100, 99, 96, 89).\n4 Conclusions\nThe Poisson gamma belief network is proposed to extract a multilayer deep representation for high-\ndimensional count vectors, with an ef\ufb01cient upward-downward Gibbs sampler to jointly train all\nits layers and a layer-wise training strategy to automatically infer the network structure. Example\nresults clearly demonstrate the advantages of deep topic models. For big data problems, in practice\none may rarely has a suf\ufb01cient budget to allow the \ufb01rst-layer width to grow without bound, thus\nit is natural to consider a belief network that can use a deep representation to not only enhance its\nrepresentation power, but also better allocate its computational resource. Our algorithm achieves a\ngood compromise between the widths of hidden layers and the depth of the network.\nAcknowledgements. M. Zhou thanks TACC for computational support. B. Chen thanks the support\nof the Thousand Young Talent Program of China, NSC-China (61372132), and NCET-13-0945.\n\n8\n\nK1max25 100200400600800Perplexity500550600650700750(a)T = 1T = 2T = 3T = 4T = 5K1max25 100200400600800Perplexity-202468101214(b)T = 1T = 2T = 3T = 4T = 5\fReferences\n[1] Y. Bengio and Y. LeCun. Scaling learning algorithms towards AI. In L\u00b4eon Bottou, Olivier\nChapelle, D. DeCoste, and J. Weston, editors, Large Scale Kernel Machines. MIT Press, 2007.\n[2] M Ranzato, F. J. Huang, Y.-L. Boureau, and Y. LeCun. Unsupervised learning of invariant\n\nfeature hierarchies with applications to object recognition. In CVPR, 2007.\n\n[3] Y. Bengio, I. J. Goodfellow, and A. Courville. Deep Learning. Book in preparation for MIT\n\nPress, 2015.\n\n[4] R. M. Neal. Connectionist learning of belief networks. Arti\ufb01cial Intelligence, pages 71\u2013113,\n\n1992.\n\n[5] L. K. Saul, T. Jaakkola, and M. I. Jordan. Mean \ufb01eld theory for sigmoid belief networks.\n\nJournal of Arti\ufb01cial Intelligence research, pages 61\u201376, 1996.\n\n[6] G. Hinton, S. Osindero, and Y.-W. Teh. A fast learning algorithm for deep belief nets. Neural\n\nComputation, pages 1527\u20131554, 2006.\n\n[7] G. Hinton. Training products of experts by minimizing contrastive divergence. Neural compu-\n\ntation, pages 1771\u20131800, 2002.\n\n[8] R. Salakhutdinov and G. E. Hinton. Deep Boltzmann machines. In AISTATS, 2009.\n[9] H. Larochelle and S. Lauly. A neural autoregressive topic model. In NIPS, 2012.\n[10] R. Salakhutdinov, J. B. Tenenbaum, and A. Torralba. Learning with hierarchical-deep models.\n\nIEEE Trans. Pattern Anal. Mach. Intell., pages 1958\u20131971, 2013.\n\n[11] M. Welling, M. Rosen-Zvi, and G. E. Hinton. Exponential family harmoniums with an appli-\n\ncation to information retrieval. In NIPS, pages 1481\u20131488, 2004.\n\n[12] E. P. Xing, R. Yan, and A. G. Hauptmann. Mining associated text and images with dual-wing\n\nharmoniums. In UAI, 2005.\n\n[13] M. Zhou and L. Carin. Negative binomial process count and mixture modeling. IEEE Trans.\n\nPattern Anal. Mach. Intell., 2015.\n\n[14] M. Zhou, O. H. M. Padilla, and J. G. Scott. Priors for random count matrices derived from a\n\nfamily of negative binomial processes. to appear in J. Amer. Statist. Assoc., 2015.\n\n[15] V. Nair and G. E. Hinton. Recti\ufb01ed linear units improve restricted Boltzmann machines. In\n\nICML, 2010.\n\n[16] D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. J. Mach. Learn. Res., 2003.\n[17] A. Acharya, J. Ghosh, and M. Zhou. Nonparametric Bayesian factor analysis for dynamic\n\ncount matrices. In AISTATS, 2015.\n\n[18] R. Ranganath, L. Tang, L. Charlin, and D. M. Blei. Deep exponential families. In AISTATS,\n\n2015.\n\n[19] Z. Gan, C. Chen, R. Henao, D. Carlson, and L. Carin. Scalable deep poisson factor analysis\n\nfor topic modeling. In ICML, 2015.\n\n[20] M. Zhou, L. Hannah, D. Dunson, and L. Carin. Beta-negative binomial process and Poisson\n\nfactor analysis. In AISTATS, 2012.\n\n[21] T. L. Grif\ufb01ths and M. Steyvers. Finding scienti\ufb01c topics. PNAS, 2004.\n[22] M. Zhou. Beta-negative binomial process and exchangeable random partitions for mixed-\n\nmembership modeling. In NIPS, 2014.\n\n[23] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei. Hierarchical Dirichlet processes. J. Amer.\n\nStatist. Assoc., 2006.\n\n[24] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle. Greedy layer-wise training of deep\n\nnetworks. In NIPS, 2007.\n\n[25] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. LIBLINEAR: A library for\n\nlarge linear classi\ufb01cation. JMLR, pages 1871\u20131874, 2008.\n\n[26] N. Srivastava, R. Salakhutdinov, and G. Hinton. Modeling documents with a deep Boltzmann\n\nmachine. In UAI, 2013.\n\n9\n\n\f", "award": [], "sourceid": 1719, "authors": [{"given_name": "Mingyuan", "family_name": "Zhou", "institution": "University of Texas at Austin"}, {"given_name": "Yulai", "family_name": "Cong", "institution": null}, {"given_name": "Bo", "family_name": "Chen", "institution": "Xidian University"}]}