{"title": "Indian Buffet Processes with Power-law Behavior", "book": "Advances in Neural Information Processing Systems", "page_first": 1838, "page_last": 1846, "abstract": "The Indian buffet process (IBP) is an exchangeable distribution over binary matrices used in Bayesian nonparametric featural models.  In this paper we propose a three-parameter generalization of the IBP exhibiting power-law behavior.  We achieve this by generalizing the beta process (the de Finetti measure of the IBP) to the \\emph{stable-beta process} and deriving the IBP corresponding to it.  We find interesting relationships between the stable-beta process and the Pitman-Yor process (another stochastic process used in Bayesian nonparametric models with interesting power-law properties).  We show that our power-law IBP is a good model for word occurrences in documents with improved performance over the normal IBP.", "full_text": "Indian Buffet Processes with Power-law Behavior\n\nYee Whye Teh and Dilan G\u00a8or\u00a8ur\n\n17 Queen Square, London WC1N 3AR, United Kingdom\n\nGatsby Computational Neuroscience Unit, UCL\n{ywteh,dilan}@gatsby.ucl.ac.uk\n\nAbstract\n\nThe Indian buffet process (IBP) is an exchangeable distribution over binary ma-\ntrices used in Bayesian nonparametric featural models. In this paper we propose\na three-parameter generalization of the IBP exhibiting power-law behavior. We\nachieve this by generalizing the beta process (the de Finetti measure of the IBP) to\nthe stable-beta process and deriving the IBP corresponding to it. We \ufb01nd interest-\ning relationships between the stable-beta process and the Pitman-Yor process (an-\nother stochastic process used in Bayesian nonparametric models with interesting\npower-law properties). We derive a stick-breaking construction for the stable-beta\nprocess, and \ufb01nd that our power-law IBP is a good model for word occurrences in\ndocument corpora.\n\n1 Introduction\n\nThe Indian buffet process (IBP) is an in\ufb01nitely exchangeable distribution over binary matrices with\na \ufb01nite number of rows and an unbounded number of columns [1, 2]. It has been proposed as a\nsuitable prior for Bayesian nonparametric featural models, where each object (row) is modeled with\na potentially unbounded number of features (columns). Applications of the IBP include Bayesian\nnonparametric models for ICA [3], choice modeling [4], similarity judgements modeling [5], dyadic\ndata modeling [6] and causal inference [7].\nIn this paper we propose a three-parameter generalization of the IBP with power-law behavior. Using\nthe usual analogy of customers entering an Indian buffet restaurant and sequentially choosing dishes\nfrom an in\ufb01nitely long buffet counter, our generalization with parameters \u03b1 > 0, c > \u2212\u03c3 and\n\u03c3 \u2208 [0, 1) is simply as follows:\n\n\u2022 Customer 1 tries Poisson(\u03b1) dishes.\n\u2022 Subsequently, customer n + 1:\n\n\u2013 tries dish k with probability mk\u2212\u03c3\n\u2013 tries Poisson(\u03b1 \u0393(1+c)\u0393(n+c+\u03c3)\n\n\u0393(n+1+c)\u0393(c+\u03c3)) new dishes.\n\nn+c , for each dish that has previously been tried;\n\nwhere mk is the number of previous customers who tried dish k. The dishes and the customers\ncorrespond to the columns and the rows of the binary matrix respectively, with an entry of the matrix\nbeing one if the corresponding customer tried the dish (and zero otherwise). The mass parameter \u03b1\ncontrols the total number of dishes tried by the customers, the concentration parameter c controls\nthe number of customers that will try each dish, and the stability exponent \u03c3 controls the power-law\nbehavior of the process. When \u03c3 = 0 the process does not exhibit power-law behavior and reduces\nto the usual two-parameter IBP [2].\nMany naturally occurring phenomena exhibit power-law behavior, and it has been argued that using\nmodels that can capture this behavior can improve learning [8]. Recent examples where this has led\nto signi\ufb01cant improvements include unsupervised morphology learning [8], language modeling [9]\n\n1\n\n\fand image segmentation [10]. These examples are all based on the Pitman-Yor process [11, 12, 13],\na generalization of the Dirichlet process [14] with power-law properties. Our generalization of the\nIBP extends the ability to model power-law behavior to featural models, and we expect it to lead to\na wealth of novel applications not previously well handled by the IBP.\nThe approach we take in this paper is to \ufb01rst de\ufb01ne the underlying de Finetti measure, then to derive\nthe conditional distributions of Bernoulli process observations with the de Finetti measure integrated\nout. This automatically ensures that the resulting power-law IBP is in\ufb01nitely exchangeable. We call\nthe de Finetti measure of the power-law IBP the stable-beta process. It is a novel generalization of\nthe beta process [15] (which is the de Finetti measure of the normal two-parameter IBP [16]) with\ncharacteristics reminiscent of the stable process [17, 11] (in turn related to the Pitman-Yor process).\nWe will see that the stable-beta process has a number of properties similar to the Pitman-Yor process.\nIn the following section we \ufb01rst give a brief description of completely random measures, a class of\nrandom measures which includes the stable-beta and the beta processes. In Section 3 we introduce\nthe stable-beta process, a three parameter generalization of the beta process and derive the power-\nlaw IBP based on the stable-beta process. Based on the proposed model, in Section 4 we construct\na model of word occurrences in a document corpus. We conclude with a discussion in Section 5.\n\n2 Completely Random Measures\n\nIn this section we give a brief description of completely random measures [18]. Let \u0398 be a measure\nspace with \u2126 its \u03c3-algebra. A random variable whose values are measures on (\u0398, \u2126) is referred\nto as a random measure. A completely random measure (CRM) \u00b5 over (\u0398, \u2126) is a random mea-\nsure such that \u00b5(A)\u22a5\u22a5\u00b5(B) for all disjoint measurable subsets A, B \u2208 \u2126. That is, the (random)\nmasses assigned to disjoint subsets are independent. An important implication of this property is\nthat the whole distribution over \u00b5 is determined (with usually satis\ufb01ed technical assumptions) once\nthe distributions of \u00b5(A) are given for all A \u2208 \u2126.\nCRMs can always be decomposed into a sum of three independent parts: a (non-random) measure,\nan atomic measure with \ufb01xed atoms but random masses, and an atomic measure with random atoms\nand masses. CRMs in this paper will only contain the second and third components. In this case we\ncan write \u00b5 in the form,\n\nN(cid:88)\n\nk=1\n\nM(cid:88)\n\nl=1\n\n\u00b5 =\n\nuk\u03b4\u03c6k +\n\nvl\u03b4\u03c8l ,\n\n(1)\n\nwhere uk, vl > 0 are the random masses, \u03c6k \u2208 \u0398 are the \ufb01xed atoms, \u03c8l \u2208 \u0398 are the random atoms,\nand N, M \u2208 N\u222a{\u221e}. To describe \u00b5 fully it is suf\ufb01cient to specify N and {\u03c6k}, and to describe the\njoint distribution over the random variables {uk},{vl},{\u03c8l} and M. Each uk has to be independent\nfrom everything else and has some distribution Fk. The random atoms and their weights {vl, \u03c8l}\nare jointly drawn from a 2D Poisson process over (0,\u221e] \u00d7 \u0398 with some nonatomic rate measure\n\u039b called the L\u00b4evy measure. The rate measure \u039b has to satisfy a number of technical properties; see\n(0,\u221e] \u039b(du \u00d7 d\u03b8) = M\u2217 < \u221e then the number of random atoms M in \u00b5\nis Poisson distributed with mean M\u2217, otherwise there are an in\ufb01nite number of random atoms. If \u00b5\nis described by \u039b and {\u03c6k, Fk}N\n\n[18, 19] for details. If(cid:82)\n\nk=1 as above, we write,\n\n(cid:82)\n\n\u0398\n\n\u00b5 \u223c CRM(\u039b,{\u03c6k, Fk}N\n\nk=1).\n\n(2)\n\n3 The Stable-beta Process\n\nIn this section we introduce a novel CRM called the stable-beta process (SBP). It has no \ufb01xed atoms\nwhile its L\u00b4evy measure is de\ufb01ned over (0, 1) \u00d7 \u0398:\n\n\u0393(1 + c)\n\n\u039b0(du \u00d7 d\u03b8) = \u03b1\n\n(3)\nwhere the parameters are: a mass parameter \u03b1 > 0, a concentration parameter c > \u2212\u03c3, a stability\nexponent 0 \u2264 \u03c3 < 1, and a smooth base distribution H. The mass parameter controls the overall\nmass of the process and the base distribution gives the distribution over the random atom locations.\n\n\u0393(1 \u2212 \u03c3)\u0393(c + \u03c3) u\u2212\u03c3\u22121(1 \u2212 u)c+\u03c3\u22121duH(d\u03b8)\n\n2\n\n\fThe mean of the SBP can be shown to be E[\u00b5(A)] = \u03b1H(A) for each A \u2208 \u2126, while var(\u00b5(A)) =\n\u03b1 1\u2212\u03c3\n1+c H(A). Thus the concentration parameter and the stability exponent both affect the variability\nof the SBP around its mean. The stability exponent also governs the power-law behavior of the SBP.\nWhen \u03c3 = 0 the SBP does not have power-law behavior and reduces to a normal two-parameter beta\nprocess [15, 16]. When c = 1 \u2212 \u03c3 the stable-beta process describes the random atoms with masses\n< 1 in a stable process [17, 11]. The SBP is so named as it can be seen as a generalization of both\nthe stable and the beta processes. Both the concentration parameter and the stability exponent can\nbe generalized to functions over \u0398 though we will not deal with this generalization here.\n\n3.1 Posterior Stable-beta Process\n\nConsider the following hierarchical model:\n\u00b5 \u223c CRM(\u039b0,{}),\nZi|\u00b5 \u223c BernoulliP(\u00b5)\n\niid, for i = 1, . . . , n.\n\n(4)\nThe random measure \u00b5 is a SBP with no \ufb01xed atoms and with L\u00b4evy measure (3), while Zi \u223c\nBernoulliP(\u00b5) is a Bernoulli process with mean \u00b5 [16]. This is also a CRM: in a small neighborhood\nd\u03b8 around \u03b8 \u2208 \u0398 it has a probability \u00b5(d\u03b8) of having a unit mass atom in d\u03b8; otherwise it does not\nhave an atom in d\u03b8. If \u00b5 has an atom at \u03b8 the probability of Zi having an atom at \u03b8 as well is \u00b5({\u03b8}).\nIf \u00b5 has a smooth component, say \u00b50, Zi will have random atoms drawn from a Poisson process\nwith rate measure \u00b50. In typical applications to featural models the atoms in Zi give the features\nassociated with data item i, while the weights of the atoms in \u00b5 give the prior probabilities of the\ncorresponding features occurring in a data item.\nWe are interested in both the posterior of \u00b5 given Z1, . . . , Zn, as well as the conditional distribu-\ntion of Zn+1|Z1, . . . , Zn with \u00b5 marginalized out. Let \u03b8\u2217\nK be the K unique atoms among\nZ1, . . . , Zn with atom \u03b8\u2217\nk occurring mk times. Theorem 3.3 of [20] shows that the posterior of \u00b5\ngiven Z1, . . . , Zn is still a CRM, but now including \ufb01xed atoms given by \u03b8\u2217\nK. Its updated\nL\u00b4evy measure and the distribution of the mass at each \ufb01xed atom \u03b8\u2217\nk, Fnk}K\n\n\u00b5|Z1, . . . , Zn \u223c CRM(\u039bn,{\u03b8\u2217\n\nk are,\nk=1),\n\n1, . . . , \u03b8\u2217\n\n1, . . . , \u03b8\u2217\n\n(5)\n\n(6a)\n\n(6b)\n\nwhere\n\n\u039bn(du \u00d7 d\u03b8) =\u03b1\n\nFnk(du) =\n\n\u0393(1 + c)\n\n\u0393(1 \u2212 \u03c3)\u0393(c + \u03c3) u\u2212\u03c3\u22121(1 \u2212 u)n+c+\u03c3\u22121duH(d\u03b8),\n\u0393(mk \u2212 \u03c3)\u0393(n \u2212 mk + c + \u03c3) umk\u2212\u03c3\u22121(1 \u2212 u)n\u2212mk+c+\u03c3\u22121du.\n\n\u0393(n + c)\n\nIntuitively, the posterior is obtained as follows. Firstly, the posterior of \u00b5 must be a CRM since\nboth the prior of \u00b5 and the likelihood of each Zi|\u00b5 factorize over disjoint subsets of \u0398. Secondly,\n\u00b5 must have \ufb01xed atoms at each \u03b8\u2217\nk since otherwise the probability that there will be atoms among\nZ1, . . . , Zn at precisely \u03b8\u2217\nk is zero. The posterior mass at \u03b8\u2217\nk is obtained by multiplying a Bernoulli\n\u201clikelihood\u201d umk(1 \u2212 u)n\u2212mk (since there are mk occurrences of the atom \u03b8\u2217\nk among Z1, . . . , Zn)\nto the \u201cprior\u201d \u039b0(du\u00d7d\u03b8\u2217\nk) in (3) and normalizing, giving us (6b). Finally, outside of these K atoms\nthere are no other atoms among Z1, . . . , Zn. We can think of this as n observations of 0 among n\niid Bernoulli variables, so a \u201clikelihood\u201d of (1 \u2212 u)n is multiplied into \u039b0 (without normalization),\ngiving the updated L\u00b4evy measure in (6a).\nLet us inspect the distributions (6) of the \ufb01xed and random atoms in the posterior \u00b5 in turn. The\nk has a distribution Fnk which is simply a beta distribution with parameters (mk \u2212\nrandom mass at \u03b8\u2217\n\u03c3, n \u2212 mk + c + \u03c3). This differs from the usual beta process in the subtraction of \u03c3 from mk and\naddition of \u03c3 to n \u2212 mk + c. This is reminiscent of the Pitman-Yor generalization to the Dirichlet\nprocess [11, 12, 13], where a discount parameter is subtracted from the number of customers seated\naround each table, and added to the chance of sitting at a new table. On the other hand, the L\u00b4evy\nmeasure of the random atoms of \u00b5 is still a L\u00b4evy measure corresponding to an SBP with updated\nparameters\n\n\u0393(1 + c)\u0393(n + c + \u03c3)\n\u0393(n + 1 + c)\u0393(c + \u03c3) ,\n\n\u03b1(cid:48) \u2190 \u03b1\nc(cid:48) \u2190 c + n,\n\n\u03c3(cid:48) \u2190 \u03c3\nH(cid:48) \u2190 H.\n\n(7)\n\n3\n\n\fNote that the update depends only on n, not on Z1, . . . , Zn. In summary, the posterior of \u00b5 is simply\nan independent sum of an SBP with updated parameters and of \ufb01xed atoms with beta distributed\nmasses. Observe that the posterior \u00b5 is not itself a SBP. In other words, the SBP is not conjugate\nto Bernoulli process observations. This is different from the beta process and again reminiscent\nof Pitman-Yor processes, where the posterior is also a sum of a Pitman-Yor process with updated\nparameters and \ufb01xed atoms with random masses, but not a Pitman-Yor process [11]. Fortunately,\nthe non-conjugacy of the SBP does not preclude ef\ufb01cient inference. In the next subsections we de-\nscribe an Indian buffet process and a stick-breaking construction corresponding to the SBP. Ef\ufb01cient\ninference techniques based on both representations for the beta process can be straightforwardly\ngeneralized to the SBP [1, 16, 21].\n\n3.2 The Stable-beta Indian Buffet Process\n\nWe can derive an Indian buffet process (IBP) corresponding to the SBP by deriving, for each n,\nthe distribution of Zn+1 conditioned on Z1, . . . , Zn, with \u00b5 marginalized out. This derivation is\nstraightforward and follows closely that for the beta process [16]. For each of the atoms \u03b8\u2217\nk the\nposterior of \u00b5(\u03b8\u2217\n\nk) given Z1, . . . , Zn is beta distributed with mean mk\u2212\u03c3\np(Zn+1(\u03b8\u2217\n\nk) = 1|Z1, . . . , Zn) = E[\u00b5(\u03b8\u2217\n\nk)|Z1, . . . , Zn] = mk \u2212 \u03c3\n\nn+c . Thus\n\n(8)\n\nn + c\n\nMetaphorically speaking, customer n + 1 tries dish k with probability mk\u2212\u03c3\natoms. Let \u03b8 \u2208 \u0398\\{\u03b8\u2217\n\nK}. In a small neighborhood d\u03b8 around \u03b8, we have:\n\n1, . . . , \u03b8\u2217\n\nn+c . Now for the random\n\n(cid:90) 1\n\n0\n\nu\u039bn(du \u00d7 d\u03b8)\n\n(cid:90) 1\n\np(Zn+1(d\u03b8) = 1|Z1, . . . , Zn) = E[\u00b5(d\u03b8)|Z1, . . . , Zn] =\n\u0393(1 \u2212 \u03c3)\u0393(c + \u03c3) u\u22121\u2212\u03c3(1 \u2212 u)n+c+\u03c3\u22121duH(d\u03b8)\n\u0393(1 + c)\n\n(cid:90) 1\n\n\u0393(1 + c)\n\nu\u03b1\n\nu\u2212\u03c3(1 \u2212 u)n+c+\u03c3\u22121du\n\n0\n\n=\n\n=\u03b1\n\n\u0393(1 \u2212 \u03c3)\u0393(c + \u03c3) H(d\u03b8)\n\u0393(1 + c)\u0393(n + c + \u03c3)\n\u0393(n + 1 + c)\u0393(c + \u03c3) H(d\u03b8)\n\n0\n\n(cid:18)\n\nn(cid:88)\n\n(cid:19) K(cid:89)\n\n=\u03b1\n\n(9)\nK}\nSince Zn+1 is completely random and H is smooth, the above shows that on \u0398\\{\u03b8\u2217\n1, . . . , \u03b8\u2217\nZn+1 is simply a Poisson process with rate measure \u03b1 \u0393(1+c)\u0393(n+c+\u03c3)\n\u0393(n+1+c)\u0393(c+\u03c3) H. In particular, it will have\nPoisson(\u03b1 \u0393(1+c)\u0393(n+c+\u03c3)\n\u0393(n+1+c)\u0393(c+\u03c3)) new atoms, each independently and identically distributed according to\nH. In the IBP metaphor, this corresponds to customer n+1 trying new dishes, with each dish associ-\nated with a new draw from H. The resulting Indian buffet process is as described in the introduction.\nIt is automatically in\ufb01nitely exchangeable since it was derived from the conditional distributions of\nthe hierarchical model (4).\nMultiplying the conditional probabilities of each Zn given previous ones together, we get the joint\nprobability of Z1, . . . , Zn with \u00b5 marginalized out:\n\np(Z1, . . . , Zn) = exp\n\n\u2212\u03b1\n\n\u0393(1+c)\u0393(i\u22121+c+\u03c3)\n\n\u0393(i+c)\u0393(c+\u03c3)\n\n\u0393(mk\u2212\u03c3)\u0393(n\u2212mk+c+\u03c3)\u0393(1+c)\n\n\u0393(1\u2212\u03c3)\u0393(c+\u03c3)\u0393(n+c)\n\n\u03b1h(\u03b8\u2217\n\nk), (10)\n\ni=1\n\nk=1\n\nwhere there are K atoms (dishes) \u03b8\u2217\nK among Z1, . . . , Zn with atom k appearing mk times,\nand h is the density of H. (10) is to be contrasted with (4) in [1]. The Kh! terms in [1] are absent\nas we have to distinguish among these Kh dishes in assigning each of them a distinct atom (this\nalso contributes the h(\u03b8\u2217\nk) terms). The fact that (10) is invariant to permuting the ordering among\nZ1, . . . , Zn also indicates the in\ufb01nite exchangeability of the stable-beta IBP.\n\n1, . . . , \u03b8\u2217\n\n3.3 Stick-breaking constructions\n\nIn this section we describe stick-breaking constructions for the SBP generalizing those for the beta\nprocess. The \ufb01rst is based on the size-biased ordering of atoms induced by the IBP [16], while\n\n4\n\n\fthe second is based on the inverse L\u00b4evy measure method [22], and produces a sequence of random\natoms of strictly decreasing masses [21].\nThe size-biased construction is straightforward: we use the IBP to generate the atoms (dishes) in the\nSBP; each time a dish is newly generated the atom is drawn from H and its mass from Fnk. This\nleads to the following procedure:\n\nfor n = 1, 2, . . .:\nfor k = 1, . . . , Jn:\n\nJn \u223c Poisson(\u03b1 \u0393(1+c)\u0393(n\u22121+c+\u03c3)\nvnk \u223c Beta(1 \u2212 \u03c3, n \u2212 1 + c + \u03c3),\n\n\u0393(n+c)\u0393(c+\u03c3)\n\n),\n\n\u03c8nk \u223c H,\n\n(11)\n\n\u221e(cid:88)\n\nJn(cid:88)\n\nn=1\n\nk=1\n\n\u00b5 =\n\nvnk\u03b4\u03c8nk .\n\n\u0393(1+c)\n\nThe inverse L\u00b4evy measure is a general method of generating from a Poisson process with non-\nuniform rate measure.\nIt essentially transforms the Poisson process into one with uniform rate,\ngenerates a sample, and transforms the sample back. This method is more involved for the\nSBP because the inverse transform has no analytically tractable form. The L\u00b4evy measure \u039b0 of\nthe SBP factorizes into a product \u039b0(du\u00d7 d\u03b8) = L(du)H(d\u03b8) of a \u03c3-\ufb01nite measure L(du) =\n\u0393(1\u2212\u03c3)\u0393(c+\u03c3) u\u2212\u03c3\u22121(1\u2212u)c+\u03c3\u22121du over (0, 1) and a probability measure H over \u0398. This implies\n\u03b1\nthat we can generate a sample {vl, \u03c8l}\u221e\nl=1 of the random atoms of \u00b5 and their masses by \ufb01rst sam-\nl=1 \u223c PoissonP(L) from a Poisson process on (0, 1) with rate measure L, and\npling the masses {vl}\u221e\nassociating each vl with an iid draw \u03c8l \u223c H [19]. Now consider the mapping T : (0, 1) \u2192 (0,\u221e)\n(cid:90) 1\ngiven by\n\n(cid:90) 1\n\nT (u) =\n\nL(du) =\n\nu\n\nu\n\n\u03b1\n\n\u0393(1 + c)\n\n\u0393(1 \u2212 \u03c3)\u0393(c + \u03c3) u\u2212\u03c3\u22121(1 \u2212 u)c+\u03c3\u22121du.\n\n(12)\n\nl=1 \u223c PoissonP(L) if and only if {T (vl)}\u221e\n\nT is bijective and monotonically decreasing. The Mapping Theorem for Poisson processes [19]\nshows that {vl}\u221e\nl=1 \u223c PoissonP(L) where L is\nl=1 \u223c PoissonP(L) can be easily drawn by letting\nLebesgue measure on (0,\u221e). A sample {tl}\u221e\ni=1 ei for all l. Transforming back with vl = T \u22121(tl),\nl=1 \u223c PoissonP(L). As t1, t2, . . . is an increasing sequence and T is decreasing,\nwe have {vl}\u221e\nv1, v2, . . . is a decreasing sequence of masses. Deriving the density of vl given vl\u22121, we get:\n\nel \u223c Exponential(1) and setting tl = (cid:80)l\np(vl|vl\u22121) =(cid:12)(cid:12) dtl\n\n(cid:12)(cid:12)p(tl|tl\u22121) = \u03b1\n\n(1\u2212vl)c+\u03c3\u22121 exp\n\n\u0393(1\u2212\u03c3)\u0393(c+\u03c3) v\u2212\u03c3\u22121\n\n(cid:90) vl\u22121\n\n(cid:110)\u2212\n\n. (13)\n\nL(du)\n\n(cid:111)\n\n\u0393(1+c)\n\nl\n\ndvl\n\nvl\n\nIn general these densities do not simplify and we have to resort to solving for T \u22121(tl) numerically.\nThere are two cases for which they do simplify. For c = 1, \u03c3 = 0, the density function reduces to\np(vl|vl\u22121) = \u03b1v\u03b1\u22121\nl\u22121, leading to the stick-breaking construction of the single parameter IBP\n[21]. In the stable process case when c = 1 \u2212 \u03c3 and \u03c3 (cid:54)= 0, the density of vl simpli\ufb01es to:\n\n/v\u03b1\n\nl\n\np(vl | vl\u22121) = \u03b1 \u0393(2\u2212\u03c3)\n\n\u0393(1\u2212\u03c3)\u0393(1) v\u2212\u03c3\u22121\nexp\n\n= \u03b1(1 \u2212 \u03c3)v\u2212\u03c3\u22121\n\nl\n\nl\n\nDoing a change of values to yl = v\u2212\u03c3\n\nl\n\n, we get:\np(yl|yl\u22121) = \u03b1 1\u2212\u03c3\n\n\u03c3 exp\n\n(cid:111)\n\u03b1 \u0393(2\u2212\u03c3)\n\u0393(1\u2212\u03c3)\u0393(1) u\u2212\u03c3\u22121du\n\n\u00d7 exp\n\nvl\n\n(cid:110) \u2212(cid:82) vl\u22121\n(cid:110) \u2212 \u03b1(1\u2212\u03c3)\n(cid:110) \u2212 \u03b1 1\u2212\u03c3\n\nl \u2212 v\u2212\u03c3\n(v\u2212\u03c3\nl\u22121)\n\n(cid:111)\n(cid:111)\n\u03c3 (yl \u2212 yl\u22121)\n\n\u03c3\n\n.\n\n.\n\n(14)\n\n(15)\n\nThat is, each yl is exponentially distributed with rate \u03b1 1\u2212\u03c3\nand offset by yl\u22121. For general values\nof the parameters we do not have an analytic stick breaking form. However note that the weights\ngenerated using this method are still going to be strictly decreasing.\n\n\u03c3\n\n3.4 Power-law Properties\n\nThe SBP has a number of appealing power-law properties. In this section we shall assume \u03c3 > 0\nsince the case \u03c3 = 0 reduces the SBP to the usual beta process with less interesting power-law\nproperties. Derivations are given in the appendix.\n\n5\n\n\fFigure 1: Power-law properties of the stable-beta Indian buffet process.\n\nFirstly, the total number of dishes tried by n customers is O(n\u03c3). The left panel of Figure 1 shows\nthis for varying \u03c3. Secondly, the number of customers trying each dish follows a Zipf\u2019s law [23].\nThis is shown in the right panel of Figure 1, which plots the number of dishes Km versus the\nnumber of customers m trying each dish (that is, Km is the number of dishes k for which mk = m).\nAsymptotically we can show that the proportion of dishes tried by m customers is O(m\u22121\u2212\u03c3). Note\nthat these power-laws are similar to those observed for Pitman-Yor processes. One aspect of the\nSBP which is not power-law is the number of dishes each customer tries. This is simply Poisson(\u03b1)\ndistributed. It seems dif\ufb01cult obtain power-law behavior in this aspect within a CRM framework,\nbecause of the fundamental role played by the Poisson process.\n\n4 Word Occurrence Models with Stable-beta Processes\n\nIn this section we use the SBP as a model for word occurrences in document corpora. Let n be\nthe number of documents in a corpus. Let Zi({\u03b8}) = 1 if word type \u03b8 occurs in document i and\n0 otherwise, and let \u00b5({\u03b8}) be the occurrence probability of word type \u03b8 among the documents\nin the corpus. We use the hierarchical model (4) with a SBP prior1 on \u00b5 and with each document\nmodeled as a conditionally independent Bernoulli process draw. The joint distribution over the word\noccurrences Z1, . . . , Zn, with \u00b5 integrated out, is given by the IBP joint probability (10).\nWe applied the word occurrence model to the 20newsgroups dataset. Following [16], we modeled\nthe training documents in each of the 20 newsgroups as a separate corpus with a separate SBP. We\nuse the popularity of each word type across all 20 newsgroups as the base distribution2: for each\nword type \u03b8 let n\u03b8 be the number of documents containing \u03b8 and let H({\u03b8}) \u221d n\u03b8.\nIn the \ufb01rst experiment we compared the SBP to the beta process by \ufb01tting the parameters \u03b1, c and\n\u03c3 of both models to each newsgroup by maximum likelihood (in beta process case \u03c3 is \ufb01xed at\n0) . We expect the SBP to perform better as it is better able to capture the power-law statistics of\nthe document corpora (see Figure 2). The ML values of the parameters across classes did not vary\nmuch, taking values \u03b1 = 142.6 \u00b1 40.0, c = 4.1 \u00b1 0.9 and \u03c3 = 0.47 \u00b1 0.1. In comparison, the\nparameters values obtained by the beta process are \u03b1 = 147.3 \u00b1 41.4 and c = 25.9 \u00b1 8.4. Note that\nthe estimated values for c are signi\ufb01cantly larger than for the SBP to allow the beta process to model\nthe fact that many words occur in a small number of documents (a consequence of the power-law\n\n1Words are discrete objects. To get a smooth base distribution we imagine appending each word type with\n\na U [0, 1] variate. This does not affect the modelling that follows.\n\n2The appropriate technique, as proposed by [16], would be to use a hierarchical SBP to tie the word occur-\nrence probabilities across the newsgroups. However due to dif\ufb01culties dealing with atomic base distributions\nwe cannot de\ufb01ne a hierarchical SBP easily (see discussion).\n\n6\n\n100102104106100101102103104105number of customersmean number of dishes tried!=1, c=1  \"=0.8\"=0.5\"=0.2\"=0100102104100101102103104number of customers trying each dishnumber of dishes!=1, c=1, \"=0.5\fFigure 2: Power-law properties of the 20newsgroups dataset. The faint dashed lines are the distribu-\ntions of words in the documents in each class, the solid curve is the mean of these lines. The dashed\nlines are the means of the word distributions generated by the ML parameters for the beta process\n(pink) and the SBP (green).\n\nTable 1: Classi\ufb01cation performance of SBP and beta process (BP). The jth column (denoted 1:j)\nshows the cumulative rank j classi\ufb01cation accuracy of the test documents. The three numbers after\nthe models are the percentages of training, validation and test sets respectively.\n1:4\n\nassigned to classes:\n\n1:5\n\n1\n\n1:2\n\n1:3\n\nBP - 20/20/60\nSBP - 20/20/60\nBP - 60/20/20\nSBP - 60/20/20\n\n78.7(\u00b10.5)\n79.9(\u00b10.5)\n85.5(\u00b10.6)\n85.5(\u00b10.4)\n\n87.4(\u00b10.2)\n87.6(\u00b10.1)\n91.6(\u00b10.3)\n91.9(\u00b10.4)\n\n91.3(\u00b10.2)\n91.5(\u00b10.2)\n94.2(\u00b10.3)\n94.4(\u00b10.2)\n\n95.1(\u00b10.2)\n93.7(\u00b10.2)\n95.6(\u00b10.4)\n95.6(\u00b10.3)\n\n96.2(\u00b10.2)\n95.1(\u00b10.2)\n96.6(\u00b10.3)\n96.6(\u00b10.3)\n\nstatistics of word occurrences; see Figure 2). We also plotted the characteristics of data simulated\nfrom the models using the estimated ML parameters. The SBP has a much better \ufb01t than the beta\nprocess to the power-law properties of the corpora.\nIn the second experiment we tested the two models on categorizing test documents into one of the\n20 newsgroups. Since this is a discriminative task, we optimized the parameters in both models to\nmaximize the cumulative ranked classi\ufb01cation performance. The rank j classi\ufb01cation performance\nis de\ufb01ned to be the percentage of documents where the true label is among the top j predicted classes\n(as determined by the IBP conditional probabilities of the documents under each of the 20 newsgroup\nclasses). As the cost function is not differentiable, we did a grid search over the parameter space,\nusing 20 values of \u03b1, c and \u03c3 each, and found the parameters maximizing the objective function on\na validation set separate from the test set. To see the effect of sample size on model performance we\ntried splitting the documents in each newsgroup into 20% training, 20% validation and 60% test sets,\nand into 60% training, 20% validation and 20% test sets. We repeated the experiment \ufb01ve times with\ndifferent random splits of the dataset. The ranked classi\ufb01cation rates are shown in Table 1. Figure 3\nshows that the SBP model has generally higher classi\ufb01cation performances than the beta process.\n\n5 Discussion\n\nWe have introduced a novel stochastic process called the stable-beta process. The stable-beta process\nis a generalization of the beta process, and can be used in nonparametric Bayesian featural models\nwith an unbounded number of features. As opposed to the beta process, the stable-beta process has\na number of appealing power-law properties. We developed both an Indian buffet process and a\nstick-breaking construction for the stable-beta process and applied it to modeling word occurrences\nin document corpora. We expect the stable-beta process to \ufb01nd uses modeling a range of natural\nphenomena with power-law properties.\n\n7\n\n1002003004005002000400060008000100001200014000number of documentscumulative number of words  BPSBPDATA100101102100101102103number of documents per wordnumber of words  BPSBPDATA\fFigure 3: Differences between the classi\ufb01cation rates of the SBP and the beta process. The perfor-\nmance of the SBP was consistently higher than that of the beta process for each of the \ufb01ve runs.\n\nWe derived the stable-beta process as a completely random measure with L\u00b4evy measure (3).\nIt\nwould be interesting and illuminating to try to derive it as an in\ufb01nite limit of \ufb01nite models, however\nwe were not able to do so in our initial attempts. A related question is whether there is a natural\nde\ufb01nition of the stable-beta process for non-smooth base distributions. Until this is resolved in the\npositive, we are not able to de\ufb01ne hierarchical stable-beta processes generalizing the hierarchical\nbeta processes [16].\nAnother avenue of research we are currently pursuing is in deriving better stick-breaking construc-\ntions for the stable-beta process. The current construction requires inverting the integral (12), which\nis expensive as it requires an iterative method which evaluates the integral numerically within each\niteration.\n\nAcknowledgement\n\nWe thank the Gatsby Charitable Foundation for funding, Romain Thibaux, Peter Latham and Tom\nGrif\ufb01ths for interesting discussions, and the anonymous reviewers for help and feedback.\n\nA Derivation of Power-law Properties\nWe will make large n and K assumptions here, and make use of Stirling\u2019s approximation \u0393(n+1) \u2248\n\u221a\n2\u03c0n(n/e)n, which is accurate in the larger n regime. The expected number of dishes is,\n\n\u0393(n+1+c)\u0393(c+\u03c3) \u2208 O\n\u03b1 \u0393(1+c)\u0393(n+c+\u03c3)\n\ni=1\n\ni=1\n\n2\u03c0(i+c+\u03c3\u22121)((i+c+\u03c3\u22121)/e)i+c+\u03c3\u22121\n\n2\u03c0(i+c)((i+c)/e)i+c\n\n= O\n\nK!Qn\n\n= O(n\u03c3). (16)\n\ne\u2212\u03c3+1(1 + \u03c3\u22121\n\ne\u2212\u03c3+1e\u03c3\u22121i\u03c3\u22121\n\nWe are interested in the joint distribution of the statistics (K1, . . . , Kn), where Km is the number\nof dishes tried by exactly m customers and where there are a total of n customers in the restaurant.\nAs there are\n(K1, . . . , Kn), we have (ignoring constant terms and collecting terms in (10) with mk = m),\n\ni+c )i+c(i + c + \u03c3 \u2212 1)\u03c3\u22121\n(cid:0)\n(cid:81)n\nK!Qn\nm=1 Km as well, we see that (K1, . . . , Kn) is multinomial with the prob-\nability of a dish having m customers being proportional to the term in large parentheses. For large\nm (and even larger n), this probability simpli\ufb01es to,\n\n(cid:1)Km con\ufb01gurations of the IBP with the same statistics\n(cid:16)\n\np(K1, . . . , Kn|n) \u221d\n\nConditioning on K =(cid:80)n\n\n\u0393(m\u2212\u03c3)\u0393(n\u2212m+c+\u03c3)\u0393(1+c)\n\n\u0393(1\u2212\u03c3)\u0393(c+\u03c3)\u0393(n+c)\n\n(cid:17)Km\n\nm=1 Km!\n\nm=1\n\nm=1 Km!\n\nm=1\n\n(cid:81)n\n\n.\n\n(17)\n\nn!\n\nm!(n\u2212m)!\n\nn!\n\nm!(n\u2212m)!\n\n(cid:19)\n\n= O(cid:0)m\u22121\u2212\u03c3(cid:1) .\n\n(18)\n\n(cid:18)\u221a\n\nO( \u0393(m\u2212\u03c3)\n\n\u0393(m+1) ) = O\n\nn(cid:88)\n\nE[K] =\n\n(cid:32) n(cid:88)\n\n=O\n\ni=1\n\n(cid:32) n(cid:88)\n\n\u221a\n\n\u221a\n\n(cid:33)\n\n(cid:32) n(cid:88)\n\ni=1\n\n(cid:33)\n(cid:33)\n\n2\u03c0(m\u22121\u2212\u03c3)((m\u22121\u2212\u03c3)/e)m\u22121\u2212\u03c3\n\n\u221a\n\n2\u03c0m(m/e)m\n\n8\n\n12345\u221220246x 10\u22123SBP\u2212BPclass order\fReferences\n[1] T. L. Grif\ufb01ths and Z. Ghahramani. In\ufb01nite latent feature models and the Indian buffet process.\n\nIn Advances in Neural Information Processing Systems, volume 18, 2006.\n\n[2] Z. Ghahramani, T. L. Grif\ufb01ths, and P. Sollich. Bayesian nonparametric latent feature models\n\n(with discussion and rejoinder). In Bayesian Statistics, volume 8, 2007.\n\n[3] D. Knowles and Z. Ghahramani. In\ufb01nite sparse factor analysis and in\ufb01nite independent com-\nponents analysis. In International Conference on Independent Component Analysis and Signal\nSeparation, volume 7 of Lecture Notes in Computer Science. Springer, 2007.\n\n[4] D. G\u00a8or\u00a8ur, F. J\u00a8akel, and C. E. Rasmussen. A choice model with in\ufb01nitely many latent features.\n\nIn Proceedings of the International Conference on Machine Learning, volume 23, 2006.\n\n[5] D. J. Navarro and T. L. Grif\ufb01ths. Latent features in similarity judgment: A nonparametric\n\nBayesian approach. Neural Computation, in press 2008.\n\n[6] E. Meeds, Z. Ghahramani, R. M. Neal, and S. T. Roweis. Modeling dyadic data with binary\n\nlatent factors. In Advances in Neural Information Processing Systems, volume 19, 2007.\n\n[7] F. Wood, T. L. Grif\ufb01ths, and Z. Ghahramani. A non-parametric Bayesian method for inferring\nIn Proceedings of the Conference on Uncertainty in Arti\ufb01cial Intelligence,\n\nhidden causes.\nvolume 22, 2006.\n\n[8] S. Goldwater, T.L. Grif\ufb01ths, and M. Johnson. Interpolating between types and tokens by es-\ntimating power-law generators. In Advances in Neural Information Processing Systems, vol-\nume 18, 2006.\n\n[9] Y. W. Teh. A hierarchical Bayesian language model based on Pitman-Yor processes. In Pro-\nceedings of the 21st International Conference on Computational Linguistics and 44th Annual\nMeeting of the Association for Computational Linguistics, pages 985\u2013992, 2006.\n\n[10] E. Sudderth and M. I. Jordan. Shared segmentation of natural scenes using dependent Pitman-\n\nYor processes. In Advances in Neural Information Processing Systems, volume 21, 2009.\n\n[11] M. Perman, J. Pitman, and M. Yor. Size-biased sampling of Poisson point processes and\n\nexcursions. Probability Theory and Related Fields, 92(1):21\u201339, 1992.\n\n[12] J. Pitman and M. Yor. The two-parameter Poisson-Dirichlet distribution derived from a stable\n\nsubordinator. Annals of Probability, 25:855\u2013900, 1997.\n\n[13] H. Ishwaran and L. F. James. Gibbs sampling methods for stick-breaking priors. Journal of\n\nthe American Statistical Association, 96(453):161\u2013173, 2001.\n\n[14] T. S. Ferguson. A Bayesian analysis of some nonparametric problems. Annals of Statistics,\n\n1(2):209\u2013230, 1973.\n\n[15] N. L. Hjort. Nonparametric Bayes estimators based on beta processes in models for life history\n\ndata. Annals of Statistics, 18(3):1259\u20131294, 1990.\n\n[16] R. Thibaux and M. I. Jordan. Hierarchical beta processes and the Indian buffet process. In\nProceedings of the International Workshop on Arti\ufb01cial Intelligence and Statistics, volume 11,\npages 564\u2013571, 2007.\n\n[17] M. Perman. Random Discrete Distributions Derived from Subordinators. PhD thesis, Depart-\n\nment of Statistics, University of California at Berkeley, 1990.\n\n[18] J. F. C. Kingman. Completely random measures. Paci\ufb01c Journal of Mathematics, 21(1):59\u201378,\n\n1967.\n\n[19] J. F. C. Kingman. Poisson Processes. Oxford University Press, 1993.\n[20] Y. Kim. Nonparametric Bayesian estimators for counting processes. Annals of Statistics,\n\n27(2):562\u2013588, 1999.\n\n[21] Y. W. Teh, D. G\u00a8or\u00a8ur, and Z. Ghahramani. Stick-breaking construction for the Indian buffet pro-\ncess. In Proceedings of the International Conference on Arti\ufb01cial Intelligence and Statistics,\nvolume 11, 2007.\n\n[22] R. L. Wolpert and K. Ickstadt. Simulations of l\u00b4evy random \ufb01elds. In Practical Nonparametric\n\nand Semiparametric Bayesian Statistics, pages 227\u2013242. Springer-Verlag, 1998.\n\n[23] G. Zipf. Selective Studies and the Principle of Relative Frequency in Language. Harvard\n\nUniversity Press, Cambridge, MA, 1932.\n\n9\n\n\f", "award": [], "sourceid": 464, "authors": [{"given_name": "Yee", "family_name": "Teh", "institution": null}, {"given_name": "Dilan", "family_name": "Gorur", "institution": null}]}