{"title": "Restricting exchangeable nonparametric distributions", "book": "Advances in Neural Information Processing Systems", "page_first": 2598, "page_last": 2606, "abstract": "Distributions over exchangeable matrices with infinitely many columns are useful in constructing nonparametric latent variable models. However, the distribution implied by such models over the number of features exhibited by each data point may be poorly-suited for many modeling tasks. In this paper, we propose a class of exchangeable nonparametric priors obtained by restricting the domain of existing models. Such models allow us to specify the distribution over the number of features per data point, and can achieve better performance on data sets where the number of features is not well-modeled by the original distribution.", "full_text": "Restricting exchangeable nonparametric distributions\n\nSinead A. Williamson\n\nUniversity of Texas at Austin\n\nSteven N. MacEachern\nThe Ohio State University\n\nEric P. Xing\n\nCarnegie Mellon University\n\nAbstract\n\nDistributions over matrices with exchangeable rows and in\ufb01nitely many columns\nare useful in constructing nonparametric latent variable models. However, the dis-\ntribution implied by such models over the number of features exhibited by each\ndata point may be poorly-suited for many modeling tasks. In this paper, we pro-\npose a class of exchangeable nonparametric priors obtained by restricting the do-\nmain of existing models. Such models allow us to specify the distribution over the\nnumber of features per data point, and can achieve better performance on data sets\nwhere the number of features is not well-modeled by the original distribution.\n\n1\n\nIntroduction\n\nThe Indian buffet process [IBP, 11] is one of several distributions over matrices with exchangeable\nrows and in\ufb01nitely many columns, only a \ufb01nite (but random) number of which contain any non-zero\nentries. Such distributions have proved useful for constructing \ufb02exible latent feature models that do\nnot require us to specify the number of latent features a priori. In such models, each column of the\nrandom matrix corresponds to a latent feature, and each row to a data point. The non-zero elements\nof a row select the subset of features that contribute to the corresponding data point.\nHowever, distributions such as the IBP make certain assumptions about the structure of the data that\nmay be inappropriate. Speci\ufb01cally, such priors impose distributions on the number of data points that\nexhibit a given feature, and on the number of features exhibited by a given data point. For example,\nin the IBP, the number of features exhibited by a data point is marginally Poisson-distributed, and\nthe probability of a data point exhibiting a previously-observed feature is proportional to the number\nof times that feature has been seen so far.\nThese distributional assumptions may not be appropriate for many modeling tasks. For example,\nthe IBP has been used to model both text [17] and network [13] data. It is well known that word\nfrequencies in text corpora and degree distributions of networks often exhibit power-law behavior;\nit seems reasonable to suppose that this behavior would be better captured by models that assume\na heavy-tailed distribution over the number of latent features, rather than the Poisson distribution\nassumed by the IBP and related random matrices.\nIn certain cases we may instead wish to add constraints on the number of latent features exhibited\nper data point, particularly in cases where we expect, or desire, the latent features to correspond\nto interpretable features, or causes, of the data [20]. For example, we might believe that each data\npoint exhibits exactly S features \u2013 corresponding perhaps to speakers in a dialog, members of a\nteam, or alleles in a genotype \u2013 but be agnostic about the total number of features in our data set. A\nmodel that explicitly encodes this prior expectation about the number of features per data point will\ntend to lead to more interpretable and parsimonious results. Alternatively, we may wish to specify\na minimum number of latent features. For example, the IBP has been used to select possible next\nstates in a hidden Markov model [10]. In such a model, we do not expect to see a state that allows\nno transitions (including self-transitions). Nonetheless, because a data point in the IBP can have\nzero features with non-zero probability, this situation can occur, resulting in an invalid transition\ndistribution.\n\n1\n\n\fIn this paper, we propose a method for modifying the distribution over the number of non-zero ele-\nments per row in arbitrary exchangeable matrices, allowing us to control the number of features per\ndata point in a corresponding latent feature model. We show that our construction yields exchange-\nable distributions, and present Monte Carlo methods for posterior inference. Our experimental eval-\nuation shows that this approach allows us to incorporate prior beliefs about the number of features\nper data point into our model, yielding superior modeling performance.\n\n2 Exchangeability\n\nWe say a \ufb01nite sequence (X1, . . . , XN ) is exchangeable [see, for example, 1] if its distribution\nis unchanged under any permutation \u03c3 of {1, . . . , N}. Further, we say that an in\ufb01nite sequence\nX1, X2, . . . is in\ufb01nitely exchangeable if all of its \ufb01nite subsequences are exchangeable. Such distri-\nbutions are appropriate when we do not believe the order in which we see our data is important. In\nsuch cases, a model whose posterior distribution depends on the order in which we see our data is\nnot justi\ufb01ed. In addition, exchangeable models often yield ef\ufb01cient Gibbs samplers.\nDe Finetti\u2019s law tells us that a sequence is exchangeable iff the observations are i.i.d. given some\nlatent distribution. This means that we can write the probability of any exchangeable sequence as\n\n(cid:90)\n\n(cid:89)\n\nP (X1 = x1, X2 = x2, . . . ) =\n\n\u00b5\u03b8(Xi = xi)\u03bd(d\u03b8)\n\n(1)\n\n\u0398\n\ni\n\nfor some probability distribution \u03bd over parameter space \u0398, and some parametrized family {\u00b5\u03b8}\u03b8\u2208\u0398\nof conditional probability distributions.\nThroughout this paper, we will use the notation p(x1, x2, . . . ) = P (X1 = x1, X2 = x2, . . . ) to\nrepresent the joint distribution over an exchangeable sequence x1, x2, . . . ; p(xn+1|x1, . . . , xn) to\ni=1 \u00b5\u03b8(Xi = xi)\u03bd(\u03b8) to\nrepresent the joint distribution over the observations and the parameter \u03b8.\n\nrepresent the associated predictive distribution; and p(x1, . . . , xn, \u03b8) :=(cid:81)n\n\n2.1 Distributions over exchangeable matrices\n\nassign independent masses to disjoint subsets of \u2126, that can be written in the form \u0393 =(cid:80)\u221e\n\nThe Indian buffet process [IBP, 11] is a distribution over binary matrices with exchangeable rows and\nin\ufb01nitely many columns. In the de Finetti representation, the mixing distribution \u03bd is a beta process,\nthe parameter \u03b8 is a countably in\ufb01nite measure with atom sizes \u03c0k \u2208 (0, 1], and the conditional\ndistribution \u00b5\u03b8 is a Bernoulli process [17]. The beta process and the Bernoulli process are both\ncompletely random measures [CRM, 12] \u2013 distributions over random measures on some space \u2126 that\nk=1 \u03c0k\u03b4\u03c6k.\nWe can think of each atom of \u03b8 as determining the latent probability for a column of a matrix with\nin\ufb01nitely many columns, and the Bernoulli process as sampling binary values for the entries of that\ncolumn of the matrix. The resulting matrix has a \ufb01nite number of non-zero entries, with the number\nof non-zero entries in each row distributed as Poisson(\u03b1) and the total number of non-zero columns\nin N rows distributed as Poisson(\u03b1HN ), where HN is the Nth harmonic number. The number of\nrows with a non-zero entry for a given column exhibits a \u201crich gets richer\u201d property \u2013 a new row has\na one in a given column with probability proportional to the number of times a one has appeared in\nthat column in the preceding rows.\nDifferent patterns of behavior can be obtained with different choices of CRM. A three-parameter\nextension to the IBP [15] replaces the beta process with a completely random measure called the\nstable-beta process, which includes the beta process as a special case. The resulting random matrix\nexhibits power law behavior: the total number of features exhibited in a data set of size N grows\nas O(N s) for some s > 0, and the number of data points exhibiting each feature also follows\na power law. The number of features per data point, however, remains Poisson-distributed. The\nin\ufb01nite gamma-Poisson process [iGaP, 18] replaces the beta process with a gamma process, and\nthe Bernoulli process with a Poisson process, to give a distribution over non-negative integer-valued\nmatrices with in\ufb01nitely many columns and exchangeable rows. In this model, the sum of each row is\ndistributed according to a negative binomial distribution, and the number of non-zero entries in each\nrow is Poisson-distributed. The beta-negative binomial process [21] replaces the Bernoulli process\nwith a negative binomial process to get an alternative distribution over non-negative integer-valued\nmatrices.\n\n2\n\n\f3 Removing the Poisson assumption\n\nWhile different choices of CRMs in the de Finetti construction can alter the distribution over the\nnumber of data points that exhibit a feature and (in the case of non-binary matrices) the row sums,\nthey retain a marginally Poisson distribution over the number of distinct features exhibited by a\ngiven data point. The construction of Caron [4] extends the IBP to allow the number of features in\neach row to follow a mixture of Poissons, by assigning data point-speci\ufb01c parameters that have an\neffect equivalent to a monotonic transformation on the atom sizes in the underlying beta process;\nhowever conditioned on these parameters, the sum of each row is still Poisson-distributed.\nThis repeatedly occurring Poisson distribution is a direct result of the construction of a binary matrix\nfrom a combination of CRMs. To elaborate on this, note that, marginally, the distribution over the\nvalue of each element zik of a row zi of the IBP (or a three-parameter IBP) is given by a Bernoulli\nk zik is distributed according to a\n\ndistribution. Therefore, by the law of rare events, the sum (cid:80)\ndistribution, hence the number of non-zero elements,(cid:80)\nbution over the row sum,(cid:80)\n\nPoisson distribution.\nA similar argument applies to integer-valued matrices such as the in\ufb01nite gamma-Poisson process.\nMarginally, the distribution over whether an element zik is greater than zero is given by a Bernoulli\nk zik \u2227 1, is Poisson-distributed. The distri-\n\nk zik, will depend on the choice of CRMs.\n\nIt follows that, if we wish to circumvent the requirement of a Poisson (or mixture of Poisson) number\nof features per data point in an IBP-like model, we must remove the completely random assumption\non either the de Finetti mixing distribution or the family of conditional distributions. The remainder\nof this section discusses how we can obtain arbitrary marginal distributions over the number of\nfeatures per row by using conditional distributions that are not completely random.\n\n3.1 Restricting the family of conditional distributions in the de Finetti representation\n\nRecall from Section 2 that any exchangeable sequence can be represented as a mixture over some\nfamily of conditional distributions. The support of this family determines the support of the ex-\nchangeable sequence. For example, in the IBP the family of conditional distributions is the Bernoulli\nprocess, which has support in {0, 1}\u221e. A sample from the IBP therefore has support in {{0, 1}\u221e}N .\nWe are familiar with the idea of restricting the support of a distribution to a measurable subset. For\nexample, a truncated Gaussian is a Gaussian distribution restricted to a contiguous section of the real\nline. In general, we can restrict an arbitrary probability distribution \u00b5 with support \u2126 to a measurable\nsubset A \u2282 \u2126 by de\ufb01ning \u00b5|A(\u00b7) := \u00b5(\u00b7)I(\u00b7 \u2208 A)/\u00b5(A).\nTheorem 1 (Restricted exchangeable distributions). We can restrict the support of an exchangeable\ndistribution by restricting the family of conditional distributions {\u00b5\u03b8}\u03b8\u2208\u0398 introduced in Equation 1,\nto obtain an exchangeable distribution on the restricted space.\n\np(x1, . . . , xN , \u03b8) = (cid:81)N\n\nan\n\nProof. Consider\nrepresentation\nLet p|A be the restriction of p such that\nXi \u2208 A, i = 1, 2, . . . , obtained by restricting the family of conditional distributions {\u00b5\u03b8} to\n|A\n{\u00b5\n\u03b8 } as described above. Then\n\nunrestricted\ni=1 \u00b5\u03b8(Xi = xi)\u03bd(\u03b8).\n\nexchangeable model with\n\nde Finetti\n\np|A(x1, . . . , xN , \u03b8) =(cid:81)N\n\n\u03b8 (Xi = xi)\u03bd(\u03b8) =(cid:81)N\n(cid:81)N +1\n(cid:81)N +1\ni=1 \u00b5\u03b8(Xi=xi)\ni=1 \u00b5\u03b8(Xi\u2208A)\nis an exchangeable sequence by construction, according to de Finetti\u2019s law.\n\np|A(xN +1|x1, . . . , xN ) \u221d I(xN +1 \u2208 A)\n\ni=1 \u00b5\n\n(cid:90)\n\nand\n\n|A\n\ni=1\n\n\u0398\n\n\u00b5\u03b8(Xi=xi)I(xi\u2208A)\n\n\u00b5\u03b8(Xi\u2208A)\n\n\u03bd(\u03b8) ,\n\n\u03bd(d\u03b8)\n\n(2)\n\nWe give three examples of exchangeable matrices where the number of non-zero entries per row is\nrestricted to follow a given distribution. While our focus is on exchangeability of the rows, we note\nthat the following distributions (like their unrestricted counterparts) are invariant under reordering\nof the columns, and that the resulting matrices are separately exchangeable [2].\nExample 1 (Restriction of the IBP to a \ufb01xed number of non-zero entries per row). The family of\nconditional distributions in the IBP is given by the Bernoulli process. We can restrict the support\n\n3\n\n\fof the Bernoulli process to an arbitrary measurable subset A \u2282 {0, 1}\u221e \u2013 for example, the set of\nk zk = S for some integer S. The conditional distribution of a\nmatrix Z = {z1, . . . , zN} under such a distribution is given by:\n\nall vectors z \u2208 {0, 1}\u221e such that(cid:80)\ni=1 \u00b5B(Zi = zi)I((cid:80)\n(cid:81)N\n(\u00b5B((cid:80)\n(cid:81)\u221e\n\n|S\nB (Z = Z) =\n\u00b5\n\nk Zik = S))N\nk (1 \u2212 \u03c0k)N\u2212mk\nk=1)N\n\nk=1 \u03c0mk\nPoiBin(S|{\u03c0k}\u221e\n\n=\n\nwhere mk =(cid:80)\n\nk zik = S)\n\nN(cid:89)\n\n(cid:18) \u221e(cid:88)\n\nI\n\n(cid:19)\n\nzik = S\n\n,\n\n(3)\n\ni=1\n\nk=1\n\ni zik and PoiBin(\u00b7|{\u03c0k}\u221e\n\nk=1) is the in\ufb01nite limit of the Poisson-binomial distribution\n[6], which describes the distribution over the number of successes in a sequence of independent\nbut non-identical Bernoulli trials. The probability of Z given in Equation 3 is the in\ufb01nite limit of\nthe conditional Bernoulli distribution [6], which describes the distribution of the locations of the\nsuccesses in such a trial, conditioned on their sum.\nExample 2 (Restriction of the iGaP to a \ufb01xed number of non-zero entries per row). The fam-\n(cid:80)\nily of conditional distributions in the iGaP is given by the Poisson process, which has support in\nN\u221e. Following Example 1, we can restrict this support to the set of all vectors z \u2208 N\u221e such that\nk zk \u2227 1 = S for some integer S \u2013 i.e. the set of all non-negative integer-valued in\ufb01nite vectors\nwith S non-zero entries. The conditional distribution of a matrix Z = {z1, . . . , zN} under such a\ndistribution is given by:\n\n|S\n\u00b5\nG (Z = Z) =\n\n=\n\n(cid:81)N\ni=1 \u00b5G(Zi = zi)I((cid:80)\u221e\n(\u00b5G((cid:80)\u221e\nk=1 zik \u2227 1 = S)\n(cid:81)\u221e\nk=1 Zik \u2227 1 = S))N\nN(cid:89)\n(cid:81)N\nk e\u2212\u03bbk\ni=1 zik!\nPoiBin(S|{e\u2212\u03bbk}\u221e\n\n(cid:18) \u221e(cid:88)\n\nk=1\n\nmk\n\nI\n\n\u03bb\n\nk=1)N\n\ni=1\n\nk=1\n\n(cid:19)\n\n.\n\n(4)\n\nzik \u2227 1 = S\n\nExample 3 (Restriction of the IBP to a random number of non-zero entries per row). Rather than\nspecify the number of non-zero entries in each row a priori, we can allow it to be random, with\nsome arbitrary distribution f (\u00b7) over the non-negative integers. A Bernoulli process restricted to\nhave f-marginals can be described as\n\nN(cid:89)\nwhere Sn = (cid:80)\u221e\n\n|f\n\u00b5\nB (Z) =\n\ni=1\n\n|Si\n\u00b5\nB (Zi = zi)f (Si) =\n\nk=1 znk. If we marginalize over B = (cid:80)\u221e\n\nPoiBin(Si|{\u03c0k}\u221e\n\ni=1\n\nk=1 zik = Si)\n\nk=1)\n\nexchangeable, because mixtures of i.i.d. distributions are i.i.d.\n\nf (Si)I((cid:80)\u221e\n\nN(cid:89)\n\n\u221e(cid:89)\n\nk=1\n\nk (1 \u2212 \u03c0k)N\u2212mk ,\n\u03c0mk\n\n(5)\n\nk=1 \u03c0k\u03b4\u03c6k, the resulting distribution is\n\nWe note that, even if we choose f to be Poisson(\u03b1), we will not recover the IBP. The IBP has\nPoisson(\u03b1) marginals over the number non-zero elements per row, but the conditional distribution\nis described by a Poisson-binomial distribution. The Poisson-restricted IBP, however, will have\nPoisson marginal and conditional distributions.\nFigure 1 shows some examples of samples from the single-parameter IBP, with parameter \u03b1 = 5,\nwith various restrictions applied.\n\nFigure 1: Samples from restricted IBPs.\n\n3.2 Direct restriction of the predictive distributions\n\nThe construction in Section 3.1 is explicitly conditioned on a draw B from the de Finetti mix-\ning distribution \u03bd. Since it might be cumbersome to explicitly represent the in\ufb01nite dimensional\n\n4\n\nIBP1 per row5 per row10 per rowUniform{1,...,20}Power\u2212law (s=2)\fobject B, it is tempting to consider constructions that directly restrict the predictive distribution\np(XN +1|X1, . . . , XN ), where B has been marginalized out.\nUnfortunately, the distribution over matrices obtained by this approach does not (in general \u2013 see the\nappendix for a counter-example) correspond to the distribution over matrices obtained by restricting\nthe family of conditional distributions. Moreover, the resulting distribution will not in general be\nexchangeable. This means it is not appropriate for data sets where we have no explicit ordering of\nthe data, and also means we cannot directly use the predictive distribution to de\ufb01ne a Gibbs sampler\n(as is possible in exchangeable models).\nTheorem 2 (Sequences obtained by directly restricting the predictive distribution of an exchangeable\nsequence are not, in general, exchangeable). Let p be the distribution of the unrestricted exchange-\nable model introduced in the proof of Theorem 1. Let p\u2217|A be the distribution obtained by directly\nrestricting this unrestricted exchangeable model such that Xi \u2208 A, i.e.\n\np\u2217|A(xN +1|x1, . . . , xN ) \u221d I(xN +1 \u2208 A)\n\n.\n\n(6)\n\n(cid:82)\n(cid:82)\n\n\u0398\n\n\u0398\n\n(cid:81)N +1\n(cid:81)N +1\ni=1 \u00b5\u03b8(X = xi)\u03bd(d\u03b8)\ni=1 \u00b5\u03b8(X \u2208 A)\u03bd(d\u03b8)\n\nIn general, this will not be equal to Equation 2, and cannot be expressed as a mixture of i.i.d.\ndistributions.\n\nProof. To demonstrate that this is true, consider the counterexample given in Example 4.\nExample 4 (A three-urn buffet). Consider a simple form of the Indian buffet process, with a base\nmeasure consisting of three unit-mass atoms. We can represent the predictive distribution of such\na model using three indexed urns, each containing one red ball (representing a one in the resulting\nmatrix) and one blue ball (representing a zero in the resulting matrix). We generate a sequence of\nball sequences by repeatedly picking a ball from each urn, noting the ordered sequence of colors,\nand returning the balls to their urns, plus one ball of each sampled color.\nProposition 1. The three-urn buffet is exchangeable.\n\nProof. By using the fact that a sequence is exchangeable iff the predictive distribution given the \ufb01rst\nN elements of the sequence of the N + 1st and N + 2nd entries is exchangeable [9], it is trivial to\nshow that this model is exchangeable and that, for example,\n\np(XN +1 = (r, b, r), XN +2 = (r, r, b)|X1:N )\nm1m2(N + 1 \u2212 m3)\n\n\u00b7 (m + 1 + 1)(N + 1 \u2212 m2)m3\n\n=\n(N + 2)3\n=p(XN +1 = (r, r, b), XN +2 = (r, b, r)|X1:N ) ,\n\n(N + 1)3\n\nwhere mi is the number of times in the \ufb01rst N samples that the ith ball in a sample has been red.\nProposition 2. The directly restricted three-urn scheme (and, by extension, the directly restricted\nIBP) is not exchangeable.\n\nProof. Consider the same scheme, but where the outcome is restricted such that there is one, and\nonly one, red ball per sample. The probability of a sequence in this restricted model is given by\n\n(7)\n\n(8)\n\nand, for example,\n\n(cid:80)3\n\n(cid:80)3\n\nmk\n\nN +1\u2212mk\n\nk=1\n\nI(xi = r)\n\np\u2217(XN +1 = x|X1:N ) =\n\nN +1\u2212mk\np\u2217(XN +1 = (r, b, b), XN +2 = (b, r, b)|X1:N )\n\nk=1\n\nmk\n\n(cid:80)\n\nm1\n\nN +1\u2212m1\nmk\n\n\u00b7\n\nm2\n\nN +2\u2212m3\n\n+(cid:80)\n\n=\n(cid:54)=p\u2217(XN +1 = (b, r, b), XN +2 = (r, b, b)|X1:N ) ,\n\n\u2212 m2\n\nN +1\u2212mk\n\nN +1\u2212m2\n\nN +2\u2212m2\n\nm2\n\nk\n\nk\n\nmk\n\nN +1\u2212mk\n\ntherefore the restricted model is not exchangeable. By introducing a normalizing constant \u2013 cor-\nresponding to restricting over a subset of {0, 1}3 \u2013 that depends on the previous samples, we have\nbroken the exchangeability of the sequence.\nBy extension, a model obtained by directly restricting the predictive distribution of the IBP is not\nexchangeable.\n\n5\n\n\fWe note that there may well be situations where a non-exchangeable model, such as that described\nin Proposition 2, is appropriate for our data \u2013 for example where there is an explicit ordering on the\ndata. It is not, however, an appropriate model if we believe our data to be exchangeable, or if we\nare interested in \ufb01nding a single, stationary latent distribution describing our data. This exchange-\nable setting is the focus of this paper, so we defer exploration of distribution of non-exchangeable\nmatrices obtained by restriction of the predictive distribution to future work.\n\n4\n\nInference\n\nWe focus on models obtained by restricting the IBP to have f-marginals over the number of non-\nzero elements per row, as described in Example 3. Note that when f = \u03b4S, this yields the setting\ndescribed in Example 1. Extension to other cases, such as the restricted iGaP model of Example 2,\nare straightforward. We work with a truncated model, where we approximate the countably in\ufb01nite\nsequence {\u03c0k}\u221e\nk=1 with a large, but \ufb01nite, vector \u03c0 := (\u03c01, . . . , \u03c0K), where each atom \u03c0k is dis-\ntributed according to Beta(\u03b1/K, 1). An alternative approach would be to develop a slice sampler\nthat uses a random truncation, avoiding the error introduced by the \ufb01xed truncation [14, 16]. We\n\nassume a likelihood function g(X|Z) =(cid:81)\n\ni g(xi|zi).\n\n4.1 Sampling the binary matrix Z\nFor marginal functions f that assign probability mass to a contiguous, non-singleton subset of N,\nwe can Gibbs sample each entry of Z according to\np((cid:80)\nj(cid:54)=k zij = a) \u221d \u03c0k\nj(cid:54)=k zij = a) \u221d (1 \u2212 \u03c0k)\n\nk zk=a+1|\u03c0) g(xi|zik = 1, Z\u00acik)\np((cid:80)\nWhere f = \u03b4S, this approach will fail, since any move that changes zik must change(cid:80)\nk zk=a|\u03c0) g(xi|zik = 0, Z\u00acik).\n\np(zik = 1|xi, \u03c0, Z\u00acik,(cid:80)\np(zik = 0|xi, \u03c0, Z\u00acik,(cid:80)\n\nf (a+1)\n\n(9)\n\nk zik. In this\n\nsetting, instead, we sample the locations of the non-zero entries z(j)\n) \u221d \u03c0k(1 \u2212 \u03c0k)\u22121g(xi|z(j)\n\ni = k|xi, \u03c0, z(\u00acj)\n\n(10)\nTo improve mixing, we also include Metropolis-Hastings moves that propose an entire row of Z.\nWe include details in the supplementary material.\n\n, j = 1, . . . , S of zi:\ni = k, z(\u00acj)\n\np(z(j)\n\nf (a)\n\n) .\n\ni\n\ni\n\ni\n\n4.2 Sampling the beta process atoms \u03c0\n\n(cid:81)K\n\n\u03bd({\u03c0k}\u221e\n\nk=1|Z) \u221d \u00b5\n\nConditioned on Z, the the distribution of \u03c0 is\n|f{\u03c0k}(Z = Z)\u03bd({\u03c0k}\u221e\n\n(cid:81)N\nThe Poisson-binomial term can be calculated exactly in O(K(cid:80)\ni=1 PoiBin(Si|\u03c0)\nk zik) using either a recursive algo-\nrithm [3, 5] or an algorithm based on the characteristic function that uses the Discrete Fourier Trans-\nform [8]. It can also be approximated using a skewed-normal approximation to the Poisson-binomial\ndistribution [19]. We can therefore sample from the posterior of \u03c0 using Metropolis Hastings steps.\nSince we believe our posterior will be close to the posterior for the unrestricted model, we use the\nproposal distribution q(\u03c0k|Z) = Beta(\u03b1/K + mk, N + 1 \u2212 mk) to propose new values of \u03c0k.\n\n(1 \u2212 \u03c0k)N\u2212mk\n\nk=1 \u03c0(mk+ \u03b1\n\nk=1) \u221d\n\nK \u22121)\n\n(11)\n\nk\n\n.\n\n4.3 Evaluating the predictive distribution\nIn certain cases, we may wish to directly evaluate the predictive distribution p|f (zN +1|z1, . . . , zN ).\nUnfortunately, in the case of the IBP, we are unable to perform the integral in Equation 2 analyti-\ncally. We can, however, estimate the predictive distribution using importance sampling. We sample\nT measures \u03c0(t) \u223c \u03bd(\u03c0|Z), where \u03bd(\u03c0|Z) is the posterior distribution over \u03c0 in the \ufb01nite approxi-\nmation to the IBP, and then weight them to obtain the restricted predictive distribution\n\n(cid:80)T\n\n(cid:80)\n\nt=1 wt\u00b5\n\n|f\n\u03c0(t) (zN +1)\nt wt\n\n,\n\n(12)\n\np|f (zN +1|z1, . . . , zN ) \u2248 1\nT\n\n6\n\n\fFigure 2: Top row: True features. Bottom row: Sample data points for S = 2.\n\nIBP\nrIBP\n\nS = 2\n\n7297.4 \u00b1 2822.8\n\n57.2 \u00b1 66.4\n\nS = 5\n\n8982.2 \u00b1 1981.7\n3469.7 \u00b1 133.7\n\nS = 8\n\n7442.8 \u00b1 3602.0\n5963.8 \u00b1 871.4\n\nS = 11\n\n8862.1 \u00b1 3920.2\n11413 \u00b1 1992.9\n\nS = 14\n\n20244 \u00b1 6809.7\n12199 \u00b1 2593.8\n\nTable 1: Structure error on synthetic data with 100 data points and S features per data point.\n\n|f\nwhere wt = \u00b5\n\u03c0(t) (z1, . . . , zN )/\u00b5\u03c0(t)(z1, . . . , zN ), and\n\n\u03c0 (Z) \u221d N(cid:89)\n\n\u00b5|f\n\nf (Si)I((cid:80)K\n\nk=1 zik = Si)\n\nPoiBin(Si|\u03c0)\n\nK(cid:89)\n\nk=1\n\nk (1 \u2212 \u03c0k)N\u2212mk .\n\u03c0mk\n\ni=1\n\n5 Experimental evaluation\n\nIn this paper, we have described how distributions over exchangeable matrices, such as the IBP,\ncan be modi\ufb01ed to allow more \ufb02exible control over the distributions over the number of latent\nfeatures. In this section, we perform experiments on both real and synthetic data. The synthetic data\nexperiments are designed to show that appropriate restriction can yield more interpretable features.\nThe experiments on real data are designed to show that careful choice of the distribution over the\nnumber of latent features in our models can lead to improved predictive performance.\n\n5.1 Synthetic data\n\nThe IBP has been used to discover latent features that correspond to interpretable phenomena, such\nas latent causes behind patient symptoms [20]. If we have prior knowledge about the number of la-\ntent features per data point \u2013 for example, the number of players in a team, or the number of speakers\nin a conversation \u2013 we may expect both better predictive performance, and more interpretable latent\nfeatures. In this experiment, we evaluate this hypothesis on synthetic data, where the true latent\nfeatures are known. We generated images by randomly selecting S of 16 binary features, shown in\nFigure 2, superimposing them, and adding isotropic Gaussian noise (\u03c32 = 0.25). We modeled the\nresulting data using an uncollapsed linear Gaussian model, as described in [7], using both the IBP,\nand the IBP restricted to have S features per row. To compare the generating matrix Z0 and our pos-\nterior estimate Z, we looked at the structure error [20]. This is the sum absolute difference between\n0 and E[ZZT ], and is a general measure of graph dissimilarity.\nthe upper triangular portions of Z0ZT\nTable 1 shows the structure error obtained using both a standard IBP model (IBP) and an IBP re-\nstricted to have the correct number of latent features (rIBP), for varying numbers of features S. In\neach case, the number of data points is 100, the IBP parameter \u03b1 is \ufb01xed to S, and the model is\ntruncated to 50 features. Each experiment was repeated 10 times on independently generated data\nsets; we present the mean and standard deviation. All samplers were run for 5000 samples; the \ufb01rst\n2500 were discarded as burn-in.\nWhere the number of features per data point is small relative to the total number of features, the\nrestricted model does a much better job at recovering the \u201ccorrect\u201d latent structure. While the IBP\nmay be able to explain the training data set as well as the restricted model, it will not in general\nrecover the desired latent structure \u2013 which is important if we wish to interpret the latent structure.\nOnce the number of features per data point increases beyond half the total number of features, the\nmodel is ill-speci\ufb01ed \u2013 it is more parsimonious to represent features via the absence of a bar. As\na result, both models perform poorly at recovering the generating structure. The restricted model\n\u2013 and indeed the IBP \u2013 should only be expected to recover easily interpretable features where the\nnumber of such features per data point is small relative to the total number of features.\n\n7\n\n\f1\n\n0.591\n0.622\n\n11\n\n0.961\n0.971\n\n2\n\n0.726\n0.749\n\n12\n\n0.969\n0.978\n\n3\n\n0.796\n0.819\n\n13\n\n0.974\n0.981\n\n4\n\n0.848\n0.864\n\n14\n\n0.978\n0.983\n\n5\n\n0.878\n0.899\n\n15\n\n0.982\n0.988\n\n6\n\n0.905\n0.918\n\n16\n\n0.989\n0.992\n\n7\n\n0.923\n0.935\n\n17\n\n0.991\n0.998\n\n8\n\n0.936\n0.948\n\n18\n\n0.996\n1.000\n\n9\n\n0.952\n0.959\n\n19\n\n0.997\n1.000\n\n10\n\n0.958\n0.966\n\n20\n\n1.000\n1.000\n\nIBP\nrIBP\n\nIBP\nrIBP\n\nTable 2: Proportion correct at n on classifying documents from the 20newsgroup data set.\n\n5.2 Classi\ufb01cation of text data\n\nThe IBP and its extensions have been used to directly model text data [17, 15]. In such settings,\nthe IBP is used to directly model the presence or absence of words, and so the matrix Z is observed\nrather than latent, and the total number of features is given by the vocabulary size. We hypothesize\nthat the Poisson assumption made by the IBP is not appropriate for text data, as the statistics of word\nuse in natural language tends to follow a heavier tailed distribution [22]. To test this hypothesis, we\nmodeled a collection of corpora using both an IBP, and an IBP restricted to have a negative Binomial\ndistribution over the number of words. Our corpora were 20 collections of newsgroup postings on\nvarious topics (for example, comp.graphics, rec.autos, rec.sport.hockey)1. No pre-processing of the\ndocuments was performed. Since the vocabulary (and hence the feature space) is \ufb01nite, we truncated\nboth models to the vocabulary size. Due to the very large state space, we restricted our samples such\nthat, in a single sample, atoms with the same posterior distribution were assigned the same value.\nFor each model, \u03b1 was set to the mean number of words per document in the corresponding group,\nand the maximum likelihood parameters were used for the negative Binomial distribution.\nTo evaluate the quality of the models, we classi\ufb01ed held out documents based on their likelihood\nunder each of the 20 newsgroups. This experiment is designed to replicate an experiment performed\nby [15] to compare the original and three-parameter IBP models. For both models, we estimated the\npredictive distribution by generating 1000 samples from the posterior of the beta process in the IBP\nmodel. For the IBP, we used these samples directly to estimate the predictive distribution; for the\nrestricted model, we used the importance-weighted samples obtained using Equation 12. For each\nmodel, we trained on 1000 randomly selected documents, and tested on a further 1000 documents.\nTable 2 shows the fraction of documents correctly classi\ufb01ed in the \ufb01rst n labels \u2013 i.e. the fraction\nof documents for which the correct labels is one of the n most likely. The restricted IBP (rIBP)\nperforms uniformly better than the unrestricted model.\n\n6 Discussion and future work\n\nThe framework explored in this paper allows us to relax the distributional assumptions made by\nexisting exchangeable nonparametric processes. As future work, we intend to explore which appli-\ncations and models can most bene\ufb01t from this greater \ufb02exibility.\n\nWe note that the model, as posed, suffers from an identi\ufb01ability issue. Let \u02dcB =(cid:80)\u221e\nmeasure obtained by transforming B =(cid:80)\u221e\n\nk=1 \u02dc\u03c0k\u03b4\u03c6k be the\nk=1 \u03c0k\u03b4\u03c6k such that \u02dc\u03c0k = \u03c0k/(1 \u2212 \u03c0k). Then, scaling\n\u02dcB by a positive scalar does not affect the likelihood of a given matrix Z. We intend to explore the\nconsequences of this in future work.\n\nAcknowledgments\n\nWe would like to thank Zoubin Ghahramani for valuable suggestions and discussions throughout\nthis project. We would also like to thank Finale Doshi-Velez and Ryan Adams for pointing out\nthe non-identi\ufb01ability mentioned in Section 6. This research was supported in part by NSF grants\nDMS-1209194 and IIS-1111142, AFOSR grant FA95501010247, and NIH grant R01GM093156.\n\n1http://people.csail.mit.edu/jrennie/20Newsgroups/\n\n8\n\n\fReferences\n[1] D. Aldous. Exchangeability and related topics.\n\nXIII, pages 1\u2013198, 1985.\n\n\u00b4Ecole d\u2019 \u00b4Et\u00b4e de Probabilit\u00b4es de Saint-Flour\n\n[2] D. J. Aldous. Representations for partially exchangeable arrays of random variables. Journal\n\nof Multivariate Analysis, 11(4):581\u2013598, 1981.\n\n[3] R. E. Barlow and K. D. Heidtmann. Computing k-out-of-n system reliability. IEEE Transac-\n\ntions on Reliability, 33:322\u2013323, 1984.\n\n[4] F. Caron. Bayesian nonparametric models for bipartite graphs. In Neural Information Process-\n\ning Systems, 2012.\n\n[5] S. X Chen, A. P. Dempster, and J. S. Liu. Weighted \ufb01nite population sampling to maximize\n\nentropy. Biometrika, 81:457\u2013469, 1994.\n\n[6] S. X. Chen and J. S. Liu. Statistical applications of the Poisson-binomial and conditional\n\nBernoulli distributions. Statistica Sinica, 7:875\u2013892, 1997.\n\n[7] F. Doshi-Velez and Z. Ghahramani. Accelerated Gibbs sampling for the Indian buffet process.\n\nIn International Conference on Machine Learning, 2009.\n\n[8] M. Fern\u00b4andez and S. Williams. Closed-form expression for the Poisson-binomial probability\n\ndensity function. IEEE Transactions on Aerospace Electronic Systems, 46:803\u2013817, 2010.\n\n[9] S. Fortini, L. Ladelli, and E. Regazzini. Exchangeability, predictive distributions and paramet-\n\nric models. Sankhy\u00afa: The Indian Journal of Statistics, Series A, pages 86\u2013109, 2000.\n\n[10] E. B. Fox, E. B. Sudderth, M. I. Jordan, and A. S. Willsky. Sharing features among dynamical\n\nsystems with beta processes. In Neural Information Processing Systems, 2010.\n\n[11] T. L. Grif\ufb01ths and Z. Ghahramani. In\ufb01nite latent feature models and the Indian buffet process.\n\nIn Neural Information Processing Systems, 2005.\n\n[12] J. F. C. Kingman. Completely random measures. Paci\ufb01c Journal of Mathematics, 21(1):59\u201378,\n\n1967.\n\n[13] K. T. Miller, T. L. Grif\ufb01ths, and M. I. Jordan. Nonparametric latent feature models for link\n\nprediction. In Neural Information Processing Systems, 2009.\n\n[14] R. M. Neal. Slice sampling. Annals of Statistics, 31(3):705\u2013767, 2003.\n[15] Y. W. Teh and D. G\u00a8or\u00a8ur. Indian buffet processes with power law behaviour. In Neural Infor-\n\nmation Processing Systems, 2009.\n\n[16] Y. W. Teh, D. G\u00a8or\u00a8ur, and Z. Ghahramani. Stick-breaking construction for the Indian buffet\n\nprocess. In Arti\ufb01cial Intelligence and Statistics, 2007.\n\n[17] R. Thibaux and M.I. Jordan. Hierarchical beta processes and the Indian buffet process.\n\nArti\ufb01cial Intelligence and Statistics, 2007.\n\nIn\n\n[18] M. Titsias. The in\ufb01nite gamma-Poisson feature model.\n\nSystems, 2007.\n\nIn Neural Information Processing\n\n[19] A. Y. Volkova. A re\ufb01nement of the central limit theorem for sums of independent random\n\nindicators. Theory of Probability and its Applications, 40:791\u2013794, 1996.\n\n[20] F. Wood, T. L. Grif\ufb01ths, and Z. Ghahramani. A non-parametric Bayesian method for inferring\n\nhidden causes. In Uncertainty in Arti\ufb01cial Intelligence, 2006.\n\n[21] M. Zhou, L. A. Hannah, D. B. Dunson, and L. Carin. Beta-negative binomial process and\n\nPoisson factor analysis. In Arti\ufb01cial Intelligence and Statistics, 2012.\n\n[22] G. K. Zipf. Selective Studies and the Principle of Relative Frequency in Language. Harvard\n\nUniversity Press, 1932.\n\n9\n\n\f", "award": [], "sourceid": 1230, "authors": [{"given_name": "Sinead", "family_name": "Williamson", "institution": "UT Austin"}, {"given_name": "Steve", "family_name": "MacEachern", "institution": "Ohio State University"}, {"given_name": "Eric", "family_name": "Xing", "institution": "CMU"}]}