{"title": "Finite-Dimensional BFRY Priors and Variational Bayesian Inference for Power Law Models", "book": "Advances in Neural Information Processing Systems", "page_first": 3162, "page_last": 3170, "abstract": "Bayesian nonparametric methods based on the Dirichlet process (DP), gamma process and beta process, have proven effective in capturing aspects of various datasets arising in machine learning. However, it is now recognized that such processes have their limitations in terms of the ability to capture power law behavior. As such there is now considerable interest in models based on the Stable Processs (SP), Generalized Gamma process (GGP) and Stable-beta process (SBP). These models present new challenges in terms of practical statistical implementation. In analogy to tractable processes such as the finite-dimensional Dirichlet process, we describe a class of random processes, we call iid finite-dimensional BFRY processes, that enables one to begin to develop efficient posterior inference algorithms such as variational Bayes that readily scale to massive datasets. For illustrative purposes, we describe a simple variational Bayes algorithm for normalized SP mixture models, and demonstrate its usefulness with experiments on synthetic and real-world datasets.", "full_text": "Finite-Dimensional BFRY Priors and Variational\n\nBayesian Inference for Power Law Models\n\nJuho Lee\n\nPOSTECH, Korea\n\nstonecold@postech.ac.kr\n\nLancelot F. James\nHKUST, Hong Kong\nlancelot@ust.hk\n\nSeungjin Choi\nPOSTECH, Korea\n\nseungjin@postech.ac.kr\n\nAbstract\n\nBayesian nonparametric methods based on the Dirichlet Process (DP), gamma pro-\ncess and beta process, have proven effective in capturing aspects of various datasets\narising in machine learning. However, it is now recognized that such processes\nhave their limitations in terms of the ability to capture power law behavior. As such\nthere is now considerable interest in models based on the Stable Processs (SP),\nGeneralized Gamma process (GGP) and Stable-Beta Process (SBP). These models\npresent new challenges in terms of practical statistical implementation. In analogy\nto tractable processes such as the \ufb01nite-dimensional Dirichlet process, we describe\na class of random processes, we call iid \ufb01nite-dimensional BFRY processes, that\nenables one to begin to develop ef\ufb01cient posterior inference algorithms such as\nvariational Bayes that readily scale to massive datasets. For illustrative purposes,\nwe describe a simple variational Bayes algorithm for normalized SP mixture mod-\nels, and demonstrate its usefulness with experiments on synthetic and real-world\ndatasets.\n\n1\n\nIntroduction\n\nthe limit, as K \u2192 \u221e, of a \ufb01nite-dimensional Dirichlet process, PK(A) =(cid:80)K\nvariable, leads to a \ufb01nite-dimensional Gamma process \u0393K(A) =(cid:80)K\n\nBayesian non-parametric ideas have played a major role in various intricate applications in statistics\nand machine learning. The Dirichlet process (DP) [1], due to its remarkable properties and relative\ntractability, has been the primary choice for many applications. It has also inspired the development of\nvariants such as the HDP [2] which can be seen as an in\ufb01nite-dimensional extension of latent Dirichlet\nallocation [3]. While there are many possible descriptions of a DP, a most intuitive one is its view as\nk=1 DkI{Vk\u2208A}, where\none can take (D1, . . . , DK) to be a K-variate symmetric Dirichlet vector on the (K \u2212 1)-simplex\nwith parameters (\u03b8/K, . . . , \u03b8/K), for \u03b8 > 0 and {Vk} are an arbitrary i.i.d. sequence of variables\nover a space \u2126, with law H(A) = Pr(Vk \u2208 A). Multiplying by a G\u03b8, an independent Gamma(\u03b8, 1),\nk=1 GkI{Vk\u2208A} := G\u03b8PK(A),\nwhere {Gk} are i.i.d. Gamma(\u03b8/K, 1) variables, and one may set \u0393K(\u2126) = G\u03b8. It was shown\nthat limK\u2192\u221e(PK, \u0393K) d= ( \u02dcF0,\u03b8, \u02dc\u00b50,\u03b8), where the limits correspond to a DP and a Gamma process\n(GP) [4]. While (PK, \u0393K) are often viewed as approximations to the DP and Gamma process (GP),\nthe works of [5, 6, 7] and references therein demonstrate the general utility of these models.\nThe relationship between the GP and DP shows that the GP is a more \ufb02exible random process. This\nis borne out by its recognized applicability for a wider range of data structures. As such, it suf\ufb01ces\nto focus on \u0393K as a tractable instance of what we refer to as an i.i.d. \ufb01nite-dimensional process. In\nk=1 Jk\u03b4Vk, is an i.i.d. \ufb01nite-dimensional process if\nd= \u02dc\u00b5, where \u02dc\u00b5 is a\n[(i)] For each \ufb01xed K, (J1, . . . , JK) are i.i.d. random variables [(ii)] limK\u2192\u221e \u00b5K\ncompletely random measure (CRM) [8]. In fact, from [9] (Theorem 14), it follows that if the limit\nexists \u02dc\u00b5 must be a CRM and therefore T := \u02dc\u00b5(\u2126) < \u221e is a non-negative in\ufb01nitely divisible random\n\ngeneral, we say a random process, \u00b5K(\u00b7) :=(cid:80)K\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fvariable. On the other hand, it is important to note that, {Jk} and TK = \u00b5K(\u2126) =(cid:80)K\n\nk=1 Jk need\nnot be in\ufb01nitely divisible. We also point out there are many constructions of \u00b5K that converge to\nthe same \u02dc\u00b5. According to [4], for every CRM \u02dc\u00b5 one can construct special cases of \u00b5K that always\nconverge as follow: Let (C1, . . . , CK) denote a disjoint partition of \u2126 such that H(Ck) = 1/K for\nd= \u02dc\u00b5(Ck), where the {Jk} are iid in\ufb01nitely divisible variables\nk = 1, . . . , K, then one can set Jk\nd= T. For reference we shall call such \u00b5K \ufb01nite-dimensional Kingman processes or simply\nand TK\nKingman proceses. It is clear that the \ufb01nite-dimensional gamma process satis\ufb01es such a construction\nwith Jk = Gk and TK = G\u03b8. However, the nice tractable features of this special case, do not carry\nover in general. This is due to the fact that there are many cases where the distribution of \u02dc\u00b5(Ck), is\nnot tractable either in the sense of not being easily simulated or having a relatively simple density.\nThe latter is of particular importance if one wants to consider developing inferential techniques for\nCRM models that scale readily to large or massive data sets. An example of this would be variational\nBayes type methods, which would otherwise be well suited to the i.i.d. based models [10]. As such\nwe consider a \ufb01ner class of i.i.d. \ufb01nite-dimensional processes as follows: We say \u00b5K is ideal if in\naddition to [(i)] and [(ii)] it satis\ufb01es [(iii)] the Jk are easily simulated [(iv)] the density of Jk has\nan explicit closed form suitable for application of techniques such as variational Bayes. We do not\nattempt to specify any formal structure on what we mean by ideal, except to note that one can easily\nrecognize a choice of \u00b5K that is not ideal.\nOur focus in this paper is not to explore the generalities of \ufb01nite-dimensional processes. Rather\nit is to identify speci\ufb01c ideal processes which are suitable for important cases where \u02dc\u00b5 is a Stable\nprocess (SP), or Generalized Gamma process (GGP). Furthermore by a simple transformation we\ncan construct processes that have behaviour similar to a Stable-Beta process (SBP). The SP, GGP,\nSBP, and processes constructed from them, are now regarded as important alternatives to the DP,\nGP and beta process (BP), as they, unlike the (DP, GP, BP), are better able to capture power law\nbehavior inherent in many types of datasets [11, 12, 13, 14, 15]. Unfortunately Kingman processes\nbased on SP, GGP or SBP are clearly not ideal. Indeed, if one considers for 0 < \u03b1 < 1, T = S\u03b1\na positive stable random variable, with density f\u03b1, then the corresponding stable process \u02dc\u00b5\u03b1,0, is\nd= \u00b5\u03b1,0(Ck) d= K\u22121/\u03b1S\u03b1. While it is fairly easy to sample S\u03b1 and hence each Jk, it is\nsuch that Jk\nwell-known that the density f\u03b1 does not have generally a tractable form. Things become worse in\nthe GGP setting as the relevant density is formed by exponentially tilting the density f\u03b1. Finally it is\nneither clear from the literature how to sample T for SBP, and much less have a simple form for its\ncorresponding density. Here we shall construct ideal processes based on various manipulations of a\nclass of \u00b5K we call \ufb01nite-dimensional BFRY [16] processes. We note that BFRY random variables\nappear in machine learning contexts in recent work [17], albeit in a very different role.\nBased on \ufb01nite-dimensional BFRY processes, we provide simple variational Bayes algorithms for\nmixture models based on normalized SP and GGP. We also derive collapsed variational Bayes\nalgorithms where the jumps are marginalized out. We demonstrate the effectiveness of our approach\non both synthetic and real-world datasets. Our intent here is to demonstrate how these processes can\nbe used within the context of variational inference. This in turn hopefully helps to elucidate how to\nimplement such procedures, or other inference techniques that bene\ufb01t from explicit densities, such as\nhybrid Monte Carlo [18] or stochastic gradient MCMC algorithms [19].\n\n2 Background\n\n2.1 Completely random measure and Laplace functionals\nLet (\u2126,F) be a measurable space, A random measure \u00b5 on \u2126 is completely random [8] if for any\ndisjoint A, B \u2208 F, \u00b5(A) and \u00b5(B) are independent. It is known that any CRM can be written as\nthe sum of a deterministic measure, a measure with \ufb01xed atoms, and a random measure represented\nas a linear functional of the Poisson process [8]. In this paper, we focus on CRMs with only the\nthird component. Let \u03a0 be a Poisson process on R+ \u00d7 \u2126 with mean intensity decomposed as\n\u03bd(ds, d\u03c9) = \u03c1(ds)H(d\u03c9). A realization of \u03a0 and corresponding CRM is written as\n\n\u03a0 =\n\n\u03b4(sk,\u03c9k), \u00b5 =\n\ns\u03a0(ds, d\u03c9) =\n\nsk\u03b4\u03c9k .\n\n(1)\n\n\u03a0(R+,\u2126)(cid:88)\n\nk=1\n\n(cid:90) \u221e\n\n\u03a0(R+,\u2126)(cid:88)\n\n0\n\n2\n\nk=1\n\n\fWe refer to \u03c1 as the L\u00e9vy measure of \u00b5 and H as the base measure, and write \u00b5 \u223c CRM(\u03c1, H).\nExamples of CRMs include the gamma process GP(\u03b8, H) with L\u00e9vy measure \u03c1(ds) = \u03b8s\u22121e\u2212sds\nor the beta process BP(c, \u03b8, H) with L\u00e9vy measure \u03c1(du) = \u03b8cu\u22121(1 \u2212 u)c\u22121I{0\u2264u\u22641}du. Stable,\ngeneralized gamma, and stable beta are also CRMs, and we will discuss them later.\nA CRM is identi\ufb01ed by its Laplace functional, just as a random variable is identi\ufb01ed by its charac-\nteristic function [20]. For a random measure \u00b5 and a measurable function f, the Laplace functional\nL\u00b5(f ) is de\ufb01ned as\n\nL\u00b5(f ) := E[e\u2212\u00b5(f )], \u00b5(f ) :=\n\nf (\u03c9)\u00b5(d\u03c9).\n\n(cid:90)\n\n\u0398\n\n(2)\n\n(3)\n\n(4)\n\n(5)\n\nWhen \u00b5 \u223c CRM(\u03c1, H), the Laplace functional can be computed using the following theorem.\nTheorem 1. (L\u00e9vy-Khintchine Formula [21]) For \u00b5 \u223c CRM(\u03c1, H) and measurable f on \u2126,\n\n(cid:26)\n\n(cid:90)\n\n(cid:90) \u221e\n\n(cid:27)\n\nL\u00b5(f ) = exp\n\n\u2212\n\n(1 \u2212 e\u2212sf (\u03c9))\u03c1(ds)H(d\u03c9)\n\n.\n\n2.2 Stable and related processes\n\n\u2126\n\n0\n\nA Stable Process SP(\u03b8, \u03b1, H) is a CRM with L\u00e9vy measure\ns\u2212\u03b1\u22121ds,\n\n\u03c1(ds) =\n\n\u03b8\n\n\u0393(1 \u2212 \u03b1)\n\nand a Generalized Gamma Process GGP(\u03b8, \u03b1, \u03c4, H) is a CRM with L\u00e9vy measure\n\n\u03c1(ds) =\n\n\u03b8\n\n\u0393(1 \u2212 \u03b1)\n\ns\u2212\u03b1\u22121e\u2212\u03c4 sds,\n\nwhere \u03b8 > 0, 0 < \u03b1 < 1, and \u03c4 > 0.\nGGP is general in the sense that we can get many other processes from it. For example, by letting\n\u03b1 \u2192 0 we get GP, and by setting \u03c4 = 0 we get SP. Furthermore, while it is well-known that the\nPitman-Yor process (see [22] and [23]) can be derived from SP, there is also a construction based\non GGP as follows. In particular as a consequence of ([23], Proposition 21), if we randomize \u03b8 =\nGamma(\u03b8(cid:48)/\u03b1, 1) in SP and normalize the jumps, then we get the Pitman-Yor process PYP(\u03b8(cid:48), \u03b1)\nfor \u03b8(cid:48) > 0. The jumps of SP and GGP are known to be heavy-tailed, and this results in power-law\nbehaviour of data drawn from models having those processes as priors.\nThe stable beta process SBP(\u03b8, \u03b1, c, H) is a CRM with L\u00e9vy measure\n\n\u03b8\u0393(1 + c)\n\n\u03c1(du) =\n\n(6)\nwhere \u03b8 > 0, 0 < \u03b1 < 1, and c > \u2212\u03b1. SBP can be viewed as a heavy-tailed extension of BP, and the\nspecial case of c = 0 can be obtained by applying the transformation u = s/(s + 1) in SP.\n\n\u0393(1 \u2212 \u03b1)\u0393(c + \u03b1)\n\nu\u2212\u03b1\u22121(1 \u2212 u)c+\u03b1\u22121I{0\u2264u\u22641}du,\n\n2.3 BFRY distributions\n\nThe BFRY distribution with parameter 0 < \u03b1 < 1, written as BFRY(\u03b1), is a random variable with\ndensity\n\ng\u03b1(s) =\n\n\u03b1\n\n\u0393(1 \u2212 \u03b1)\n\ns\u2212\u03b1\u22121(1 \u2212 e\u2212s).\n\n(7)\n\nWe can simulate S \u223c BFRY(\u03b1) with S d= G/B, where G \u223c Gamma(1\u2212\u03b1, 1) and B \u223c Beta(\u03b1, 1).\nOne can easily verify this by computing the density of the ratio distribution.\nThe name BFRY was coined in [16] after the work of Bertoin, Fujita, Roynette, and Yor [24]\nwho obtained explicit descriptions of the in\ufb01nitely divisible random variable and subordinator\ncorresponding to the density. However, the density arises much earlier, and can be found in a variety\nof contexts, for instance in [23] (Proposition 12, Corollary 13 and see also Eq.(124)) and [25]. See\n\n3\n\n\f[17] for the use of BFRY distributions to induce the closed form Indian buffet process type generative\nprocesses that have a type III power law behaviour.\nWe also explain some variations of BFRY distributions needed for the construction of \ufb01nite-\ndimensional BFRY processes for SP and GGP. First, we can scale the BFRY random variables\nby some scale c > 0. In that case, we write S \u223c BFRY(c, \u03b1), and the density is given as\n\ngc,\u03b1(s) =\n\nc\n\n\u0393(1 \u2212 \u03b1)\n\ns\u2212\u03b1\u22121(1 \u2212 e\u2212(\u03b1/c)1/\u03b1s).\n\n(8)\n\nWe can easily sample S \u223c BFRY(c, \u03b1) as S d= (\u03b1/c)\u22121/\u03b1T where T \u223c BFRY(\u03b1). We can also\nexponentially tilt the scaled BFRY random variable, with a parameter \u03c4 > 0. For that we write\nS \u223c BFRY(c, \u03c4, \u03b1), and the density is given as\n\ngc,\u03c4,\u03b1(s) =\n\n\u03b1s\u2212\u03b1\u22121e\u2212\u03c4 s(1 \u2212 e\u2212(\u03b1/c)1/\u03b1\n\u0393(1 \u2212 \u03b1){(\u03c4 + (\u03b1/c)1/\u03b1)\u03b1 \u2212 \u03c4 \u03b1} .\n\ns)\n\n(9)\n\nWe can simulate S \u223c BFRY(c, \u03c4, \u03b1) as S d= GT where G \u223c Gamma(1 \u2212 \u03b1, 1) and T is a random\nvariable with density,\n\nh(t) =\n\n\u03b1t\u2212\u03b1\u22121\n\n(\u03c4 + (\u03b1/c)1/\u03b1)\u03b1 \u2212 \u03c4 \u03b1\n\nI{(\u03c4 +(\u03b1/c)1/\u03b1)\u22121\u2264t\u2264\u03c4\u22121},\n\n(10)\n\nwhich can easily be sampled using inverse transform sampling.\n\n3 Main Contributions\n\n3.1 A Motivating example\n\nbeta process \u00b5K = (cid:80)K\n\nBefore we jump into our method, we \ufb01rst revisit an example of ideal \ufb01nite-dimensional processes.\nInspired by constructions of DP and GP, the Indian buffet process (IBP, [26]) was developed as a\nmodel for feature selection, by considering the limit K \u2192 \u221e of an M \u00d7 K binary matrix whose en-\ntries {Zm,k} are conditionally independent Bern(Uk) variables where {Uk} are i.i.d. Beta(\u03b8/K, 1)\nvariables. Although not explicitly described as such, this leads to the notion of a \ufb01nite-dimensional\nIn [26], IBP was obtained as the limit of the marginal dis-\ntribution where \u00b5K was marginalized out, and this result coupled with [27] show indirectly that\nlimK\u2192\u221e \u00b5K \u2192 \u00b5 \u223c BP(\u03b8, H). Here, we show another proof of this convergence, by inspecting the\nLaplace functional of \u00b5K. The Laplace functional of \u00b5K is computed as follows:\n\nk=1 Uk\u03b4Vk.\n\n(cid:90) 1\n\n\u03b8\nK\n\nu\n\n\u2126\n\n0\n\n(cid:20)(cid:90)\nL\u00b5K (f ) = E[e\u2212\u00b5K (f )] =\n(cid:90)\n(cid:90) 1\n(cid:90) \u221e\n\n(cid:20)\n(cid:26)\n\n1 \u2212 1\nK\n\n(cid:90)\n\n=\n\n\u2126\n\n0\n\n\u03b8\n\nK \u22121e\u2212uf (\u03c9)duH(d\u03c9)\n\n\u03b8\n\nK \u22121(1 \u2212 e\u2212uf (\u03c9))duH(d\u03c9)\n\n\u03b8u\n\n(cid:21)K\n\n(cid:21)K\n\n.\n\n(11)\n\n(cid:27)\n\n,\n\n(12)\n\nSince u\u03b8/K is bounded by 1, the bounded convergence theorem implies\n\nK\u2192\u221eL\u00b5K (f ) = exp\n\nlim\n\n\u2212\n\n\u2126\n\n0\n\n(1 \u2212 e\u2212uf (\u03c9))\u03b8u\u22121I{0\u2264u\u22641}duH(d\u03c9)\n\nwhich exactly matches the Laplace functional of \u00b5 computed by Eq. (3). In contrast to the marginal\nlikelihood arguments, in our proof, we illustrate the direct relationship between the random measures\nand suggest a blueprint that can be applied to other CRMs. Note that the \ufb01nite-dimensional beta\nprocess is not a Kingman process, since the beta variables are not in\ufb01nitely divisible and the total\nmass T is a Dickman variable. We can also apply our argument to the case of the \ufb01nite-dimensional\ngamma process, the proof of which is given in our supplementary material.\n\n3.2 Finite-dimensional BFRY processes\n\nInspired by the \ufb01nite-dimensional beta and gamma process examples, we propose \ufb01nite-dimensional\nBFRY processes, which converge to SP, GGP, and SBP as K \u2192 \u221e.\n\n4\n\n\fTheorem 2. (Finite-dimensional BFRY processes)\n\n(i) Let \u00b5 \u223c SP(\u03b8, \u03b1, H). Construct \u00b5K as follows:\n\nJ1, . . . , JK\n\ni.i.d.\u223c BFRY(\u03b8/K, \u03b1),\n\nV1, . . . , VK\n\ni.i.d.\u223c H, \u00b5K =\n\nJk\u03b4Vk .\n\n(13)\n\n(ii) Let \u00b5 \u223c GGP(\u03b8, \u03b1, \u03c4, H). Construct \u00b5K as follows:\n\nJ1, . . . , JK\n\ni.i.d.\u223c BFRY(\u03b8/K, \u03c4, \u03b1),\n\nV1, . . . , VK\n\n(iii) Let \u00b5 \u223c SBP(\u03b8, \u03b1, 0, H). Construct \u00b5K as follows:\n\nS1, . . . , SK\n\ni.i.d.\u223c BFRY(\u03b8/K, \u03b1),\n\nV1, . . . , VK\n\ni.i.d.\u223c H, \u00b5K =\n\ni.i.d.\u223c H, \u00b5K =\n\nJk\u03b4Vk .\n\n(14)\n\nk=1\n\nfor k = 1, . . . , K,\n\nJk\u03b4Vk .\n\n(15)\n\nJk = Sk\nSk+1\n\nK(cid:88)\n\nK(cid:88)\nK(cid:88)\n\nk=1\n\nFor all three cases, limK\u2192\u221e Lf (\u00b5K) = Lf (\u00b5) for an arbitrary measurable f.\n\nk=1\n\n(cid:21)K\n\n(cid:20)(cid:90)\n(cid:20)\n\n(cid:90) \u221e\n(cid:90)\n\n0\n\n\u2126\n\n1 \u2212 1\nK\n\n\u2126\n\n=\n\n(cid:90) \u221e\n(cid:26)\n\n0\n\nK\u0393(1 \u2212 \u03b1)\n\n\u0393(1 \u2212 \u03b1)\n\n\u03b8\n\n(cid:90) \u221e\n\n(cid:90)\n\nProof. We \ufb01rst provide a proof for SP case (i), and the proof for GGP (ii) is almost identical. The\nLaplace functional of \u00b5K is written as\nL\u00b5K (f ) =\n\ns\u2212\u03b1\u22121(1 \u2212 e\u2212(\u03b1K/\u03b8)1/\u03b1s)dsH(d\u03c9)\n\ne\u2212sf (\u03c9)\n\n\u03b8\n\n(cid:21)K\n\n(1 \u2212 e\u2212sf (\u03c9))s\u2212\u03b1\u22121(1 \u2212 e\u2212(\u03b1K/\u03b8)1/\u03b1s)dsH(d\u03c9)\n\nSince 1 \u2212 e\u2212(\u03b1K/\u03b8)s is bounded by 1, the bounded convergence theorem implies,\ns\u2212\u03b1\u22121dsH(d\u03c9)\n\n(1 \u2212 e\u2212sf (\u03c9))\n\n\u2212\n\n\u03b8\n\nK\u2192\u221eL\u00b5K (f ) = exp\n\nlim\n\n\u0393(1 \u2212 \u03b1)\n\n\u2126\n\n0\n\nwhich exactly matches the Laplace functional of SP. The proof of (iii) is trivial from (i) and the\nrelationship between SP and SBP.\nCorollary 1. Let \u03c4 = 1 and \u03b1 \u2192 0 in (14). Then \u00b5K will converge to \u00b5 \u223c GP(\u03b8, H).\nProof. The result is trivial by letting \u03b1 \u2192 0 in Lf (\u00b5K).\n\n(cid:27)\n\n,\n\nFinite-dimensional BFRY processes are certainly ideal processes, since\nwe can easily sample the jumps {Jk}, and we have explicit closed form\ndensities written as (8) and (9). Hence, based on those processes, we\ncan develop ef\ufb01cient inference algorithms such as variational Bayes for\npower-law models related to SP, GGP, and SBP that require explicit\ndensities of jumps. Figure 1 illustrates the log of average jump sizes\nof 100 normalized SPs drawn using \ufb01nite-dimensional BFRY processes,\nwith \u03b8 = 1, K = 1000, and varying \u03b1. As expected, the jumps generated\nwith bigger \u03b1 are more heavy-tailed.\n\nFigure 1: Log of average\njump sizes of NSPs\n\n3.3 Finite-dimensional normalized random measure mixture models\n\nA normalized random measure (NRM) is obtained by normalizing a CRM by its total mass. A NRM\nmixture model (NRMM) is then de\ufb01ned as a mixture model with NRM prior, and its generative\nprocess is written as follows:\n\n\u00b5 \u223c CRM(\u03c1, H), \u03c61, . . . \u03c6N\n\n(16)\nwhere L is a likelihood distribution. One can easily do posterior inferences by marginalizing out \u00b5,\nwith an auxiliary variable. Once \u00b5 is marginalized out we can develop a Gibbs sampler [28]. However,\nthis scales poorly as mentioned earlier. On the other hand, one may replace \u00b5 with \u00b5K, yielding\nthe \ufb01nite-dimensional NRMM (FNRMM), for which ef\ufb01cient variational Bayes can be developed\nprovided that the \ufb01nite-dimensional process is ideal.\n\ni.i.d.\u223c \u00b5/\u00b5(\u2126), Xn|\u03c6n \u223c L(\u03c6n),\n\n5\n\n123456atom indices (sorted)-6-4-20log (jumps)alpha=0.8alpha=0.4alpha=0.2\f3.4 Variational Bayes for \ufb01nite-dimensional mixture models\n\nWe \ufb01rst introduce a variational Bayes algorithm for \ufb01nite-dimensional normalized SP mixture\n(FNSPM). The joint likelihood of the model is written as\n\nwhere s\u00b7 := (cid:80)\n\nPr({Xn \u2208 dxn, zn},{Jk \u2208 dsk, Vk \u2208 d\u03c9k}) = s\n\n\u2212N\u00b7\n\nsNk\nk g\u03b8/K,\u03b1(dsk)\n\nk=1\n\nzn=k\n\nL(dxn|\u03c9k)H(d\u03c9k),\n\n(17)\n\nconvenient to introduce an auxiliary variable U \u223c Gamma(N, s\u00b7) as in [20] to remove s\u2212N\u00b7\n\nk sk, and zn is an indicator variable such that zn = k if \u03c6n = \u03c9k. We found it\n\n:\n\nPr({Xn \u2208 dxn, zn},{Jk \u2208 dsk, Vk \u2208 d\u03c9k}, U \u2208 du)\n\nK(cid:89)\n\n(cid:89)\n\nK(cid:89)\n\nk=1\n\n(cid:89)\n\nzn=k\n\n\u221d uN\u22121du\n\nsNk\nk e\n\n\u2212usk g\u03b8/K,\u03b1(sk)dsk\n\nL(dxn|\u03c9k)H(d\u03c9k).\n\n(18)\n\nNow we introduce variational distributions for {z, s, \u03c9, u} and optimize the Evidence Lower BOund\n(ELBO) with respect to the parameters of the variational distributions. The posterior statistics can be\nsimulated using the optimized variational distributions. We can also optimize the hyperparamters \u03b8\nand \u03b1 with ELBO. The detailed optimization procedure is described in the supplementary material.\n\n3.5 Collapsed Gibbs sampler for \ufb01nite-dimensional mixture models\nAs with the NRMM, we can also marginalize out the jumps {Jk} to get the collapsed model.\n(cid:21)I{Nk >0}\nMarginalizing out s in (18) gives\n\n(cid:20) \u03b8(1 \u2212 \u03beNk\u2212\u03b1)\n\nPr({Xn \u2208 dxn, zn},{Vk \u2208 d\u03c9k}, U \u2208 du) \u221d uN\u22121du\n\nK(cid:89)\n\nuNk\u2212\u03b1\n\n\u0393(Nk \u2212 \u03b1)\n\u0393(1 \u2212 \u03b1)\n\n(cid:20) \u03b8u\u03b1\n\n\u03b1\n\n\u00d7\n\n\u2212\u03b1 \u2212 1)\n\n(\u03be\n\n(cid:21)I{Nk =0} (cid:89)\n\nzn=k\n\nk=1\n\nL(dxn|\u03c9k)H(d\u03c9k),\n\n(19)\n\nwhere \u03be :=\nand the detailed equations are in the supplementary material.\n\nu+(\u03b1K/\u03b8)1/\u03b1 . Based on this, we can derive the collapsed Gibbs sampler for FNSPM,\n\nu\n\n3.6 Collapsed variational Bayes for \ufb01nite-dimensional mixture models\n\nBased on the marginalized log likelihood (19), we can develop a collapsed variational Bayes algorithm\nfor FNSPM, following the collapsed variational inference algorithm for DPM [29]. We introduce\nvariational distributions for {u, z, \u03c9}, and then the update equation for q(z) is computed using the\nconditional posterior p(z|x). The hyperparamters can also be optimized, the detailed procedures for\nwhich are explained in the supplementary material.\n\n4 Experiments\n\n4.1 Experiments on synthetic datasets\n\n4.1.1 Data generation\n\nWe generated 10 datasets from PYP mixture models. Each dataset was generated as follows. We\n\ufb01rst generated cluster labels for 2,000 data points from PYP(\u03b8, \u03b1) with \u03b8 = 1 and \u03b1 = 0.7. Given\nthe cluster labels, we generated data points from Mult(M, \u03c9), where the number of trials M was\nchosen uniformly from [1, 50] and \u03c9 was sampled from Dir(0.05 \u00b7 1200). We also generated another\n10 datasets from CRP mixture models CRP(\u03b8) with \u03b8 = 1, to see if FNSPM adapts to the change of\nthe underlying random measure. For each dataset, we used 80% of data points for training and the\nremaining 20% for testing.\n\n4.1.2 Algorithm settings and performance measure\n\nWe compared six algorithms - Collapsed Gibbs (CG) for FDPM (CG/D), Variational Bayes (VB) for\nFDPM (VB/D), Collapsed Variational Bayes (CVB) for FDPM (CVB/D), CG for FNSPM (CG/S),\n\n6\n\n\fFigure 2: (Left) comparison between the in\ufb01nite-dimensional algorithm and the \ufb01nite dimensional algorithms.\n(Middle) Average times per iteration of the in\ufb01nite and the \ufb01nite dimensional algorithms. (Right) Average\nnumber of iterations need to converge for variational algorithms.\n\nTable 1: Comparison between six \ufb01nite-dimensional algorithms on synthetic PYP, synthetic CRP, AP corpus and\nNIPS corpus. Average test log-likelihood values and \u03b1 estimates are shown with standard deviations.\n\nPYP\n\nCRP\n\nAP\n\nNIPS\n\nloglikel\nCG/D -33.2078\n(1.5557)\nVB/D -33.4480\n(1.6495)\nCVB/D -33.4278\n(1.6525)\n-33.1039\n(1.5676)\n-33.1861\n(1.5873)\n-33.2031\n(1.5858)\n\nCVB/S\n\nCG/S\n\nVB/S\n\n\u03b1\n\n-\n\n-\n\n-\n\n0.6940\n(0.0235)\n0.4640\n(0.0085)\n0.7041\n(0.0322)\n\nloglikel\n-25.4076\n(1.9081)\n-25.4148\n(1.9120)\n-25.4150\n(1.9115)\n-25.4079\n(1.9077)\n-25.5076\n(1.9122)\n-25.4080\n(1.9085)\n\n\u03b1\n\n-\n\n-\n\n-\n\n0.2867\n(0.0762)\n0.4770\n(0.0041)\n0.2925\n(0.0608)\n\nloglikel\n-157.2228\n(0.0189)\n-157.2379\n(0.0304)\n-157.2302\n(0.0280)\n-157.1920\n(0.0036)\n-157.1391\n(0.1154)\n-157.2182\n(0.0282)\n\n\u03b1\n\n-\n\n-\n\n-\n\n0.5261\n(0.0032)\n0.4748\n(0.0434)\n0.5327\n(0.0060)\n\nloglikel\n-352.8909\n(0.0070)\n-352.9104\n(0.0172)\n-352.8692\n(0.0321)\n-352.7487\n(0.0037)\n-352.6078\n(0.2599)\n-352.7544\n(0.0088)\n\n\u03b1\n\n-\n\n-\n\n-\n\n0.5857\n(0.0032)\n0.4945\n(0.0324)\n0.5899\n(0.0070)\n\nVB for FNSPM (VB/S) and CVB for FNSPM (CVB/S). All the algorithms were initialized with a\nsingle run of sequential collapsed Gibbs sampling starting from zero clusters, and afterwards ran for\n100 iterations. The variational algorithms were terminated if the improvements of the ELBO were\nsmaller than a threshold. The hyperparameters \u03b8 and \u03b1 were initialized as \u03b8 = 1 and \u03b1 = 0.5 for all\nalgorithms. The performances were measured by average test log-likelihood,\n\nNtest(cid:88)\n\nn=1\n\n1\n\nNtest\n\nlog p(xn|xtrain).\n\n(20)\n\nFor CG, we computed the average of samples collected every 10 iterations. For VB and CVB, we\ncomputed the log-likelihood using the expectations of the variational distributions.\n\n4.1.3 Effect of K on predictive performance and running time\n\nTo see the effect of K on predictive performance, we \ufb01rst compared the \ufb01nite-dimensional algorithms\n(CG for FNSPM, VB for FNSPM and CVB for FNSPM) to the in\ufb01nite-dimensional algorithm\n(CG for NSPM [28]). We tested the four algorithms on 10 synthetic datasets generated from PYP\nmixtures, with K \u2208 {200, 400, 600, 800, 1000} for \ufb01nite algorithms, and measured the difference\nof average test log likelihood compared to the in\ufb01nite-dimensional algorithm. We also measured\nthe average running time per iteration of the four algorithms, and the average number of iterations\nto converge of the variational algorithms. Figure 2 shows the results. As expected, the difference\nbetween \ufb01nite-dimensional algorithms and the in\ufb01nite-dimensional algorithm decreases as K grows.\nThe \ufb01nite-dimensional algorithms have O(N K) time complexity per iteration, and the in\ufb01nite-\ndimensional algorithm has O(N \u02dcK) where \u02dcK is the maximum number of clusters created during\nclustering. However, in practice, variational algorithms can be implemented with ef\ufb01cient matrix\nmultiplications, and this makes them much faster than sampling algorithms. Moreover, as shown in\nFigure 2, variational algorithms usually converge in 50 iterations.\n\n7\n\n2004006008001000dimension K00.511.5test log likel diffCGNSPM - CGFNSPMCGNSPM - VBFNSPMCGNSPM - CVBFNSPM2004006008001000dimension K0123time / iter [sec]CGNSPMCGFNSPMVBFNSPMCVBFNSPM2004006008001000dimension K02040iter to convergeVBFNSPMCVBFNSPM\f4.1.4 Comparing \ufb01nite-dimensional algorithms on PYP and CRP datasets\n\nWe compared six algorithms for \ufb01nite mixture models (CG/D, VB/D, CVB/D, CG/S, VB/S and\nCVB/S) on PYP mixture datasets and CRP mixture datasets, with K = 1000. The results are summa-\nrized in Table 1. On PYP datasets, in general, FNSPM outperformed FDPM and CG outperformed\nVB and CVB. CG/S consistently outperformed CG/D, and the same relationship applied to VB/S\nand VB/D and CVB/S and CVB/D. Even though VB/S and CVB/S were variational algorithms, the\nperformance gap between them and CG/S was not signi\ufb01cant. Table 1 shows the estimated \u03b1 values\nfor CG/S, VB/S and CVB/S. CG/S and CVB/S seemed to recover the true value \u03b1 = 0.7, but VB/S\ndidn\u2019t. We found that VB/S tends to control the other parameter \u03b8 while holding \u03b1 near its initial\nvalue 0.5. On CRP datasets, all the algorithms showed similar performances except for VB/S, which\nwas consistently worse than other algorithms. This is probably due to the bad estimates of \u03b1.\n\n4.2 Experiments on real-world documents\n\nWe compared the six algorithms on real-world document clustering task by clustering AP corpus 1 and\nNIPS corpus 2. We preprocessed the corpora using latent Dirichlet allocation (LDA) [3]. We ran LDA\nwith 300 topics, and then gave each document a bag-of-words representation of topic assignments\nto those 300 topics. We assumed that those representations were generated from the multinomial-\nDirichlet model, and clustered them using FDPM and FNSPM. We used 80% of documents for\ntraining and the remaining 20% for computing average test log-likelihood. We set K = 2, 000 and ran\neach algorithm for 200 iterations. We repeated this 10 times to measure the average performance.The\nresults are summarized in Table 1. In general, the algorithms based on FNSPM showed better\nperformance than those of FDPM based ones, implying that FNSPM based algorithms are well\ncapturing the heavy-tailed cluster distributions of the corpora. VB/S performed the best, even though\nit sometimes converged to poor values.\n\n5 Conclusion\n\nIn this paper, we proposed \ufb01nite-dimensional BFRY processes that converge to SP, GGP and SBP. The\njumps of the \ufb01nite-dimensional BFRY processes have nice closed-form densities, and this leads to\nthe ef\ufb01cient posterior inference algorithms. With \ufb01nite-dimensional BFRY processes, we developed\nvariational Bayes and collapsed variational Bayes for \ufb01nite-dimensional normalized SP mixture\nmodels, and demonstrated its performance both on synthetic and real-world datasets. As mentioned\nearlier, with \ufb01nite dimensional BFRY processes one can develop variational Bayes or other posterior\ninference algorithms for a variety of models with SP, GGP and SBP priors. This fact, along with\nmore theoretical properties of \ufb01nite-dimensional processes, presents interesting avenues for future\nresearch.\nAcknowledgements: This work was supported by IITP (No. B0101-16-0307, Basic Software\nResearch in Human-Level Lifelong Machine Learning (Machine Learning Center)) and by National\nResearch Foundation (NRF) of Korea (NRF-2013R1A2A2A01067464), and supported in part by the\ngrant RGC-HKUST 601712 of the HKSAR.\n\nReferences\n[1] T. S. Ferguson. A Bayesian analysis of some nonparametric problems. The Annals of Statistics, 1(2):209\u2013\n\n230, 1973.\n\n[2] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei. Hierarchical Dirichlet processes. Journal of the\n\nAmerican Statistical Association, 101(476):1566\u20131581, 2006.\n\n[3] D. M. Blei, A. Ng, and M. I. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research,\n\n3:993\u20131022, 2003.\n\n[4] J. F. C. Kingman. Random discrete distributions. Journal of the Royal Statistical Society. Series B\n\n(Methodological), 37(1):1\u201322, 1975.\n\n1http://www.cs.princeton.edu/~blei/lda-c/\n2https://archive.ics.uci.edu/ml/datasets/Bag+of+Words\n\n8\n\n\f[5] H. Ishwaran and L. F. James. Computational methods for multiplicative intensity models using weighted\ngamma processes: proportional hazards, marked point processes, and panel count data. Journal of the\nAmerican Statistical Association, 99:175\u2013190, 2004.\n\n[6] H. Ishwaran, L. F. James, and J. Sun. Bayesian model selection in \ufb01nite mixtures by marginal density\n\ndecompositions. Journal of the American Statistical Association, 96:1316\u20131332, 2001.\n\n[7] H. Ishwaran and M. Zarepour. Dirichlet prior seives in \ufb01nite normal mixtures. Statistica Sinica, 12(3):941\u2013\n\n963, 2002.\n\n[8] J. F. C. Kingman. Completely random measures. Paci\ufb01c Journal of Mathematics, 21(1):59\u201378, 1967.\n\n[9] J. Pitman and N. M. Tran. Size-biased permutation of a \ufb01nite sequence with independent and identically\n\ndistributed terms. Bernoulli, 21(4):2484\u20132512, 2015.\n\n[10] D. M. Blei, A. Kucukelbir, and J. D. McAuliffe. Variational inference: A review for statisticians. 2016.\n\narXiv:1601.00670.\n\n[11] C. Chen, N. Ding, and W. Buntine. Dependent hierarchical normalized random measures for dynamic\n\ntopic modeling. ICML, 2012.\n\n[12] Y. W. Teh and D. G\u00f6r\u00fcr. Indian buffet processes with power-law behavior. NIPS, 2009.\n\n[13] F. Caron and E. B. Fox. Sparse graphs using exchangeable random measures. 2014. arXiv:1401.1137.\n\n[14] F. Caron. Bayesian nonparametric models for bipartite graphs. NIPS, 2012.\n\n[15] V. Veitch and D. M. Roy. The class of random graphs arising from exchangeable random measures. 2015.\n\narXiv:1512.03099.\n\n[16] L. Devroye and L. F. James. On simulation and properties of the stable law. Statistical Methods and\n\nApplications, 23(3):307\u2013343, 2014.\n\n[17] L. F. James, P. Orbanz, and Y. W. Teh. Scaled subordinators and generalizations of the Indian buffet\n\nprocess. 2015. arXiv:1510.07309.\n\n[18] S. Duane, A. D. Kennedy, B. J. Pendleton, and D. Roweth. Hybrid Monte Carlo. Physics Letters B,\n\n195(2):216\u2013222, 1987.\n\n[19] M. Welling and Y. W. Teh. Bayesian learning via stochastic gradient Langevin dynamics. ICML, 2011.\n\n[20] L. F. James. Bayesian Poisson process partition calculus with an application to Bayesian L\u00e9vy moving\n\naverages. The Annals of Statistics, 33(4):1771\u20131799, 2005.\n\n[21] E. \u00c7inlar. Probability and stochastics. 2010.\n\n[22] H. Ishwaran and L. F. James. Gibbs sampling methods for stick-breaking priors. Journal of the American\n\nStatistical Association, 96(453):161\u2013173, 2001.\n\n[23] J. Pitman and M. Yor. The two-parameter Poisson-Dirichlet distribution derived from a stable subordinator.\n\nThe Annals of Probability, 25(2):855\u2013900, 1997.\n\n[24] J. Bertoin, T. Fujita, B. Roynette, and M. Yor. On a particular class of self-decomposable random variables:\nthe durations of Bessel excursions straddling independent exponential times. Probability and Mathematical\nStatistics, 26:315\u2013366, 2006.\n\n[25] M. Winkel. Electronic foreign-exchange markets and passage events of independent subordinators. Journal\n\nof Applied Probability, 42(1):138\u2013152, 2005.\n\n[26] T. L. Grif\ufb01ths and Z. Ghahramani. In\ufb01nite latent feature models and the Indian buffet process. NIPS, 2006.\n\n[27] R. Thibaux and M. I. Jordan. Hierarchcial beta processes and the Indian buffet process. AISTATS, 2007.\n\n[28] S. Favaro and Y. W. Teh. MCMC for normalized random measure mixture models. Statistical Science,\n\n28(3):335\u2013359, 2013.\n\n[29] K. Kurihara, M. Welling, and Y. W. Teh. Collapsed variational dirichlet process mixture models. IJCAI,\n\n2007.\n\n9\n\n\f", "award": [], "sourceid": 1574, "authors": [{"given_name": "Juho", "family_name": "Lee", "institution": "POSTECH"}, {"given_name": "Lancelot", "family_name": "James", "institution": "HKUST"}, {"given_name": "Seungjin", "family_name": "Choi", "institution": "POSTECH"}]}