{"title": "Tree-Guided MCMC Inference for Normalized Random Measure Mixture Models", "book": "Advances in Neural Information Processing Systems", "page_first": 1891, "page_last": 1899, "abstract": "Normalized random measures (NRMs) provide a broad class of discrete random measures that are often used as priors for Bayesian nonparametric models. Dirichlet process is a well-known example of NRMs. Most of posterior inference methods for NRM mixture models rely on MCMC methods since they are easy to implement and their convergence is well studied. However, MCMC often suffers from slow convergence when the acceptance rate is low. Tree-based inference is an alternative deterministic posterior inference method, where Bayesian hierarchical clustering (BHC) or incremental Bayesian hierarchical clustering (IBHC) have been developed for DP or NRM mixture (NRMM) models, respectively. Although IBHC is a promising method for posterior inference for NRMM models due to its efficiency and applicability to online inference, its convergence is not guaranteed since it uses heuristics that simply selects the best solution after multiple trials are made. In this paper, we present a hybrid inference algorithm for NRMM models, which combines the merits of both MCMC and IBHC. Trees built by IBHC outlinespartitions of data, which guides Metropolis-Hastings procedure to employ appropriate proposals. Inheriting the nature of MCMC, our tree-guided MCMC (tgMCMC) is guaranteed to converge, and enjoys the fast convergence thanks to the effective proposals guided by trees. Experiments on both synthetic and real world datasets demonstrate the benefit of our method.", "full_text": "Tree-Guided MCMC Inference for Normalized\n\nRandom Measure Mixture Models\n\nJuho Lee and Seungjin Choi\n\nDepartment of Computer Science and Engineering\n\nPohang University of Science and Technology\n77 Cheongam-ro, Nam-gu, Pohang 37673, Korea\n{stonecold,seungjin}@postech.ac.kr\n\nAbstract\n\nNormalized random measures (NRMs) provide a broad class of discrete random\nmeasures that are often used as priors for Bayesian nonparametric models. Dirich-\nlet process is a well-known example of NRMs. Most of posterior inference meth-\nods for NRM mixture models rely on MCMC methods since they are easy to\nimplement and their convergence is well studied. However, MCMC often suffers\nfrom slow convergence when the acceptance rate is low. Tree-based inference is\nan alternative deterministic posterior inference method, where Bayesian hierarchi-\ncal clustering (BHC) or incremental Bayesian hierarchical clustering (IBHC) have\nbeen developed for DP or NRM mixture (NRMM) models, respectively. Although\nIBHC is a promising method for posterior inference for NRMM models due to its\nef\ufb01ciency and applicability to online inference, its convergence is not guaranteed\nsince it uses heuristics that simply selects the best solution after multiple trials are\nmade. In this paper, we present a hybrid inference algorithm for NRMM models,\nwhich combines the merits of both MCMC and IBHC. Trees built by IBHC out-\nlines partitions of data, which guides Metropolis-Hastings procedure to employ\nappropriate proposals. Inheriting the nature of MCMC, our tree-guided MCMC\n(tgMCMC) is guaranteed to converge, and enjoys the fast convergence thanks to\nthe effective proposals guided by trees. Experiments on both synthetic and real-\nworld datasets demonstrate the bene\ufb01t of our method.\n\n1\n\nIntroduction\n\nNormalized random measures (NRMs) form a broad class of discrete random measures, includ-\ning Dirichlet proccess (DP) [1] normalized inverse Gaussian process [2], and normalized general-\nized Gamma process [3, 4]. NRM mixture (NRMM) model [5] is a representative example where\nNRM is used as a prior for mixture models. Recently NRMs were extended to dependent NRMs\n(DNRMs) [6, 7] to model data where exchangeability fails. The posterior analysis for NRM mix-\nture (NRMM) models has been developed [8, 9], yielding simple MCMC methods [10]. As in DP\nmixture (DPM) models [11], there are two paradigms in the MCMC algorithms for NRMM mod-\nels: (1) marginal samplers and (2) slice samplers. The marginal samplers simulate the posterior\ndistributions of partitions and cluster parameters given data (or just partitions given data provided\nthat conjugate priors are assumed) by marginalizing out the random measures. The marginal sam-\nplers include the Gibbs sampler [10], and the split-merge sampler [12], although it was not formally\nextended to NRMM models. The slice sampler [13] maintains random measures and explicitly sam-\nples the weights and atoms of the random measures. The term \u201dslice\u201d comes from the auxiliary slice\nvariables used to control the number of atoms to be used. The slice sampler is known to mix faster\nthan the marginal Gibbs sampler when applied to complicated DNRM mixture models where the\nevaluation of marginal distribution is costly [7].\n\n1\n\n\fThe main drawback of MCMC methods for NRMM models is their poor scalability, due to the\nnature of MCMC methods. Moreover, since the marginal Gibbs sampler and slice sampler iteratively\nsample the cluster assignment variable for a single data point at a time, they easily get stuck in local\noptima. Split-merge sampler may resolve the local optima problem to some extent, but is still\nproblematic for large-scale datasets since the samples proposed by split or merge procedures are\nrarely accepted. Recently, a deterministic alternative to MCMC algorithms for NRM (or DNRM)\nmixture models were proposed [14], extending Bayesian hierarchical clustering (BHC) [15] which\nwas developed as a tree-based inference for DP mixture models. The algorithm, referred to as\nincremental BHC (IBHC) [14] builds binary trees that re\ufb02ects the hierarchical cluster structures of\ndatasets by evaluating the approximate marginal likelihood of NRMM models, and is well suited\nfor the incremental inferences for large-scale or streaming datasets. The key idea of IBHC is to\nconsider only exponentially many posterior samples (which are represented as binary trees), instead\nof drawing inde\ufb01nite number of samples as in MCMC methods. However, IBHC depends on the\nheuristics that chooses the best trees after the multiple trials, and thus is not guaranteed to converge\nto the true posterior distributions.\nIn this paper, we propose a novel MCMC algorithm that elegantly combines IBHC and MCMC\nmethods for NRMM models. Our algorithm, called the tree-guided MCMC, utilizes the trees built\nfrom IBHC to proposes a good quality posterior samples ef\ufb01ciently. The trees contain useful infor-\nmation such as dissimilarities between clusters, so the errors in cluster assignments may be detected\nand corrected with less efforts. Moreover, designed as a MCMC methods, our algorithm is guar-\nanteed to converge to the true posterior, which was not possible for IBHC. We demonstrate the\nef\ufb01ciency and accuracy of our algorithm by comparing it to existing MCMC algorithms.\n\n2 Background\n\nThroughout this paper we use the following notations. Denote by [n] = {1, . . . , n} a set of indices\nand by X = {xi | i \u2208 [n]} a dataset. A partition \u03a0[n] of [n] is a set of disjoint nonempty subsets\nof [n] whose union is [n]. Cluster c is an entry of \u03a0[n], i.e., c \u2208 \u03a0[n]. Data points in cluster c is\ndenoted by Xc = {xi | i \u2208 c} for c \u2208 \u03a0n. For the sake of simplicity, we often use i to represent\na singleton {i} for i \u2208 [n]. In this section, we brie\ufb02y review NRMM models, existing posterior\ninference methods such as MCMC and IBHC.\n\n(cid:90) \u221e\n\n0\n\n(cid:90) \u221e\n\n0\n\n2.1 Normalized random measure mixture models\nLet \u00b5 be a homogeneous completely random measure (CRM) on measure space (\u0398,F) with L\u00b4evy\nintensity \u03c1 and base measure H, written as \u00b5 \u223c CRM(\u03c1H). We also assume that,\n\n(1)\n\n\u03c1(dw) = \u221e,\n\n(1 \u2212 e\u2212w)\u03c1(dw) < \u221e,\n\nso that \u00b5 has in\ufb01nitely many atoms and the total mass \u00b5(\u0398) is \ufb01nite: \u00b5 =(cid:80)\u221ej=1 wj\u03b4\u03b8\u2217j\n(cid:80)\u221ej=1 wj < \u221e. A NRM is then formed by normalizing \u00b5 by its total mass \u00b5(\u0398). For each index\n\ni \u2208 [n], we draw the corresponding atoms from NRM, \u03b8i|\u00b5 \u223c \u00b5/\u00b5(\u0398). Since \u00b5 is discrete, the\nset {\u03b8i|i \u2208 [n]} naturally form a partition of [n] with respect to the assigned atoms. We write the\npartition as a set of sets \u03a0[n] whose elements are non-empty and non-overlapping subsets of [n],\nand the union of the elements is [n]. We index the elements (clusters) of \u03a0[n] with the symbol c,\nand denote the unique atom assigned to c as \u03b8c. Summarizing the set {\u03b8i|i \u2208 [n]} as (\u03a0[n],{\u03b8c|c \u2208\n\u03a0[n]}), the posterior random measure is written as follows:\nTheorem 1. ([9]) Let (\u03a0[n],{\u03b8c|c \u2208 \u03a0[n]}) be samples drawn from \u00b5/\u00b5(\u0398) where \u00b5 \u223c CRM(\u03c1H).\nWith an auxiliary variable u \u223c Gamma(n, \u00b5(\u0398)), the posterior random measure is written as\n\n, \u00b5(\u0398) =\n\n\u00b5|u +\n\nwhere\n\n(cid:88)\n\nc\u2208\u03a0[n]\n\nwc\u03b4\u03b8c ,\n\n\u03c1u(dw) := e\u2212uw\u03c1(dw), \u00b5|u \u223c CRM(\u03c1uH), P (dwc) \u221d w|c|c \u03c1u(dwc).\n\n2\n\n(2)\n\n(3)\n\n\fMoreover, the marginal distribution is written as\n\nP (\u03a0[n],{d\u03b8c|c \u2208 \u03a0[n]}, du) =\n\nun\u22121e\u2212\u03c8\u03c1(u)du\n\n\u0393(n)\n\n\u03ba\u03c1(|c|, u)H(d\u03b8c),\n\n(cid:89)\n(cid:90) \u221e\n\nc\u2208\u03a0[n]\n\nwhere\n\n\u03c8\u03c1(u) :=\n\n(cid:90) \u221e\n\n0\n\n(1 \u2212 e\u2212uw)\u03c1(dw),\n\n\u03ba\u03c1(|c|, u) :=\n\n0\n\nw|c|\u03c1u(dw).\n\nUsing (4), the predictive distribution for the novel atom \u03b8 is written as\n\u03ba\u03c1(|c| + 1, u)\n\u03ba\u03c1(|c|, u)\n\nP (d\u03b8|{\u03b8i}, u) \u221d \u03ba\u03c1(1, u)H(d\u03b8) +\n\nc\u2208\u03a0[n]\n\n(cid:88)\n\n\u03b4\u03b8c(d\u03b8).\n\n(4)\n\n(5)\n\n(6)\n\n\u03b1\u03c3\n\nThe most general CRM may be used is the generalized Gamma [3], with L\u00b4evy intensity \u03c1(dw) =\n\u0393(1\u2212\u03c3) w\u2212\u03c3\u22121e\u2212wdw. In NRMM models, the observed dataset X is assusmed to be generated from\na likelihood P (dx|\u03b8) with parameters {\u03b8i} drawn from NRM. We focus on the conjugate case where\ni\u2208c P (dxi|\u03b8) is tractable.\n\nH is conjugate to P (dx|\u03b8), so that the integral P (dXc) :=(cid:82)\n\n\u0398 H(d\u03b8)(cid:81)\n\n2.2 MCMC Inference for NRMM models\nThe goal of posterior inference for NRMM models is to compute the posterior P (\u03a0[n],{d\u03b8c}, du|X)\nwith the marginal likelihood P (dX).\nMarginal Gibbs Sampler: marginal Gibbs sampler is basesd on the predictive distribution (6).\nAt each iteration, cluster assignments for each data point is sampled, where xi may join an exist-\ning cluster c with probability proportional to \u03ba\u03c1(|c|+1,u)\n\u03ba\u03c1(|c|,u) P (dxi|Xc), or create a novel cluster with\nprobability proportional to \u03ba\u03c1(1, u)P (dxi).\nSlice sampler: instead of marginalizing out \u00b5, slice sampler explicitly sample the atoms and weights\n{wj, \u03b8\u2217j} of \u00b5. Since maintaining in\ufb01nitely many atoms is infeasible, slice variables {si} are intro-\nduced for each data point, and atoms with masses larger than a threshold (usually set as mini\u2208[n] si)\nare kept and remaining atoms are added on the \ufb02y as the threshold changes. At each iteration, xi is\nassigned to the jth atom with probability 1[si < wj]P (dxi|\u03b8\u2217j ).\nSplit-merge sampler: both marginal Gibbs and slice sampler alter a single cluster assignment at\na time, so are prone to the local optima. Split-merge sampler, originally developed for DPM, is\na marginal sampler that is based on (6). At each iteration, instead of changing individual cluster\nassignments, split-merge sampler splits or merges clusters to propose a new partition. The split or\nmerged partition is proposed by a procedure called the restricted Gibbs sampling, which is Gibbs\nsampling restricted to the clusters to split or merge. The proposed partitions are accepted or rejected\naccording to Metropolis-Hastings schemes. Split-merge samplers are reported to mix better than\nmarginal Gibbs sampler.\n\n2.3\n\nIBHC Inference for NRMM models\n\nBayesian hierarchical clustering (BHC, [15]) is a probabilistic model-based agglomerative cluster-\ning, where the marginal likelihood of DPM is evaluated to measure the dissimilarity between nodes.\nLike the traditional agglomerative clustering algorithms, BHC repeatedly merges the pair of nodes\nwith the smallest dissimilarities, and builds binary trees embedding the hierarchical cluster structure\nof datasets. BHC de\ufb01nes the generative probability of binary trees which is maximized during the\nconstruction of the tree, and the generative probability provides a lower bound on the marginal like-\nlihood of DPM. For this reason, BHC is considered to be a posterior inference algorithm for DPM.\nIncremental BHC (IBHC, [14]) is an extension of BHC to (dependent) NRMM models. Like BHC\nis a deterministic posterior inference algorithm for DPM, IBHC serves as a deterministic posterior\ninference algorithms for NRMM models. Unlike the original BHC that greedily builds trees, IBHC\nsequentially insert data points into trees, yielding scalable algorithm that is well suited for online\ninference. We \ufb01rst explain the generative model of trees, and then explain the sequential algorithm\nof IBHC.\n\n3\n\n\fFigure 1: (Left) in IBHC, a new data point is inserted into one of the trees, or create a novel tree. (Middle)\nthree possible cases in SeqInsert. (Right) after the insertion, the potential funcitons for the nodes in the\nblue bold path should be updated. If a updated d(\u00b7,\u00b7) > 1, the tree is split at that level.\n\nIBHC aims to maximize the joint probability of the data X and the auxiliary variable u:\n\nP (dX, du) =\n\nun\u22121e\u2212\u03c8\u03c1(u)du\n\n\u0393(n)\n\n\u03ba\u03c1(|c|, u)P (dXc)\n\n(7)\n\n(cid:88)\n\n(cid:89)\n\n\u03a0[n]\n\nc\u2208\u03a0[n]\n\nLet tc be a binary tree whose leaf nodes consist of the indices in c. Let l(c) and r(c) denote the\nleft and right child of the set c in tree, and thus the corresponding trees are denoted by tl(c) and\ntr(c). The generative probability of trees is described with the potential function [14], which is the\nunnormalized reformulation of the original de\ufb01nition [15]. The potential function of the data Xc\ngiven the tree tc is recursively de\ufb01ned as follows:\n\n\u03c6(Xc|hc) := \u03ba\u03c1(|c|, u)P (dXc), \u03c6(Xc|tc) = \u03c6(Xc|hc) + \u03c6(Xl(c)|tl(c))\u03c6(Xr(c)|tr(c)).\n\n(8)\nHere, hc is the hypothesis that Xc was generated from a single cluster. The \ufb01rst therm \u03c6(Xc|hc) is\nproportional to the probability that hc is true, and came from the term inside the product of (7). The\nsecond term is proportional to the probability that Xc was generated from more than two clusters\nembedded in the subtrees tl(c) and tr(c). The posterior probability of hc is then computed as\n\u03c6(Xl(c)|tl(c))\u03c6(Xr(c)|tr(c))\n\n, where d(l(c), r(c)) :=\n\n1\n\n.\n\n(9)\n\nP (hc|Xc, tc) =\n\n1 + d(l(c), r(c))\n\n\u03c6(Xc|hc)\n\nd(\u00b7,\u00b7) is de\ufb01ned to be the dissimilarity between l(c) and r(c). In the greedy construction, the pair of\nnodes with smallest d(\u00b7,\u00b7) are merged at each iteration. When the minimum dissimilarity exceeds\none (P (hc|Xc, tc) < 0.5), hc is concluded to be false and the construction stops. This is an im-\nportant mechanism of BHC (and IBHC) that naturally selects the proper number of clusters. In the\nperspective of the posterior inference, this stopping corresponds to selecting the MAP partition that\nmaximizes P (\u03a0[n]|X, u). If the tree is built and the potential function is computed for the entire\ndataset X, a lower bound on the joint likelihood (7) is obtained [15, 14]:\n\nun\u22121e\u2212\u03c8\u03c1(u)du\n\n\u0393(n)\n\n\u03c6(X|t[n]) \u2264 P (dX, du).\n\n(10)\n\nNow we explain the sequential tree construction of IBHC. IBHC constructs a tree in an incremental\nmanner by inserting a new data point into an appropriate position of the existing tree, without com-\nputing dissimilarities between every pair of nodes. The procedure, which comprises three steps, is\nelucidated in Fig. 1.\nStep 1 (left): Given {x1, . . . , xi\u22121}, suppose that trees are built by IBHC, yielding to a partition\n\u03a0[i\u22121]. When a new data point xi arrives, this step assigns xi to a tree t(cid:98)c, which has the smallest\ndistance, i.e.,(cid:98)c = arg minc\u2208\u03a0[i\u22121]\n\nd(i, c), or create a new tree ti if d(i,(cid:98)c) > 1.\n\nStep 2 (middle): Suppose that the tree chosen in Step 1 is tc. Then Step 2 determines an appropriate\nposition of xi when it is inserted into the tree tc, and this is done by the procedure SeqInsert(c, i).\nSeqInsert(c, i) chooses the position of i among three cases (Fig. 1). Case 1 elucidates an option\nwhere xi is placed on the top of the tree tc. Case 2 and 3 show options where xi is added as a\nsibling of the subtree tl(c) or tr(c), respectively. Among these three cases, the one with the highest\npotential function \u03c6(Xc\u222ai|tc\u222ai) is selected, which can easily be done by comparing d(l(c), r(c)),\nd(l(c), i) and d(r(c), i) [14]. If d(l(c), r(c)) is the smallest, then Case 1 is selected and the insertion\nterminates. Otherwise, if d(l(c), i) is the smallest, xi is inserted into tl(c) and SeqInsert(l(c), i) is\nrecursively executed. The same procedure is applied to the case where d(r(c), i) is smallest.\n\n4\n\n{x1,...xi\u22121}xil(c)r(c)icase1case2case3l(c)l(c)r(c)r(c)iiid(\u00b7,\u00b7)>1i\u21d2\fFigure 2: Global moves of tgMCMC. Top row explains the way of proposing split partition \u03a0\u2217[n] from partition\n\u03a0[n], and explains the way to retain \u03a0[n] from \u03a0\u2217[n]. Bottom row shows the same things for merge case.\n\nStep 3 (right): After Step 1 and 2 are applied, the potential functions of tc\u222ai should be computed\nagain, starting from the subtree of tc to which xi is inserted, to the root tc\u222ai. During this procedure,\nupdated d(\u00b7,\u00b7) values may exceed 1. In such a case, we split the tree at the level where d(\u00b7,\u00b7) > 1,\nand re-insert all the split nodes.\nAfter having inserted all the data points in X, the auxiliary variable u and hyperparameters for\n\u03c1(dw) are resampled, and the tree is reconstructed. This procedure is repeated several times and the\ntrees with the highest potential functions are chosen as an output.\n\n3 Main results: A tree-guided MCMC procedure\n\nIBHC should reconstruct trees from the ground whenever u and hyperparameters are resampled,\nand this is obviously time consuming, and more importantly, converge is not guaranteed. Instead\nof completely reconstructing trees, we propose to re\ufb01ne the parts of existing trees with MCMC.\nOur algorithm, called tree-guided MCMC (tgMCMC), is a combination of deterministic tree-based\ninference and MCMC, where the trees constructed via IBHC guides MCMC to propose good-quality\nsamples.\ntgMCMC initialize a chain with a single run of IBHC. Given a current partition \u03a0[n]\nand trees {tc | c \u2208 \u03a0[n]}, tgMCMC proposes a novel partition \u03a0\u2217[n] by global and local moves.\nGlobal moves split or merges clusters to propose \u03a0\u2217[n], and local moves alters cluster assignments of\nindividual data points via Gibbs sampling. We \ufb01rst explain the two key operations used to modify\ntree structures, and then explain global and local moves. More details on the algorithm can be found\nin the supplementary material.\n\n3.1 Key operations\nSampleSub(c, p): given a tree tc, draw a subtree tc(cid:48) with probability \u221d d(l(c(cid:48)), r(c(cid:48))) + \u0001. \u0001 is\nadded for leaf nodes whose d(\u00b7,\u00b7) = 0, and set to the maximum d(\u00b7,\u00b7) among all subtrees of tc. The\ndrawn subtree is likely to contain errors to be corrected by splitting. The probability of drawing tc(cid:48)\nis multiplied to p, where p is usually set to transition probabilities.\nStocInsert(S, c, p):\na stochastic version of IBHC. c may be inserted to c(cid:48) \u2208 S via\nSeqInsert(c(cid:48), c) with probability\nc(cid:48)\u2208S d\u22121(c(cid:48),c), or may just be put into S (create a new cluster\n1+(cid:80)\nin S) with probability\nc(cid:48)\u2208S d\u22121(c(cid:48),c). If c is inserted via SeqInsert, the potential functions are\nupdated accordingly, but the trees are not split even if the update dissimilarities exceed 1. As in\nSampleSub, the probability is multiplied to p.\n\n1+(cid:80)\n\nd\u22121(c(cid:48),c)\n\n1\n\n3.2 Global moves\n\nThe global moves of tgMCMC are tree-guided analogy to split-merge sampling.\nIn split-merge\nsampling, a pair of data points are randomly selected, and split partition is proposed if they belong\nto the same cluster, or merged partition is proposed otherwise. Instead, tgMCMC \ufb01nds the clusters\nthat are highly likely to be split or merged using the dissimilarities between trees, which goes as\nfollows in detail. First, we randomly pick a tree tc in uniform. Then, we compute d(c, c(cid:48)) for\n\n5\n\nl(c?)cS\u21d2\u21d2QSM\u21d2tc\u2217\u21d2\u03a0[n]\u03a0\u2217[n]\u03a0[n]c1\u03a0[n]\u21d2SQSampleSubStocInsertSampleSubStocInsert\u21d2S\u03a0\u2217[n]\u03a0[n](split)(merge)(merge)(split)r(c?)l(c?)r(c?)l(c?)r(c?)l(c?)r(c?)c2c3ccc1c2c3c1ccc1c2c2c3c3\f(cid:81)\n\n1\n\nc(cid:48)\n\nd(c,c(cid:48))\n\n|\u03a0[n]|\n\n1[c(cid:48) /\u2208M ]\n\n1+d(c,c(cid:48))\n\nc(cid:48) \u2208 \u03a0[n]\\c, and put c(cid:48) in a set M with probability (1 + d(c, c(cid:48)))\u22121 (the probability of merging\nc and c(cid:48)). The transition probability q(\u03a0\u2217[n]|\u03a0[n]) up to this step is\n. The\nset M contains candidate clusters to merge with c. If M is empty, which means that there are no\ncandidates to merge with c, we propose \u03a0\u2217[n] by splitting c. Otherwise, we propose \u03a0\u2217[n] by merging\nc and clusters in M.\nSplit case: we start splitting by drawing a subtree tc(cid:63) by SampleSub(c, q(\u03a0\u2217[n]|\u03a0[n])) 1. Then we\nsplit c(cid:63) to S = {l(c(cid:63)), r(c(cid:63))}, destroy all the parents of tc(cid:63) and collect the split trees into a set Q\n(Fig. 2, top). Then we reconstruct the tree by StocInsert(S, c(cid:48), q(\u03a0\u2217[n]|\u03a0[n])) for all c(cid:48) \u2208 Q. After\nthe reconstruction, S has at least two clusters since we split S = {l(c(cid:63)), r(c(cid:63))} before insertion. The\nsplit partition to propose is \u03a0\u2217[n] = (\u03a0[n]\\c) \u222a S. The reverse transition probability q(\u03a0[n]|\u03a0\u2217[n]) is\ncomputed as follows. To obtain \u03a0[n] from \u03a0\u2217[n], we must merge the clusters in S to c. For this, we\nshould pick a cluster c(cid:48) \u2208 S, and put other clusters in S\\c into M. Since we can pick any c(cid:48) at \ufb01rst,\nthe reverse transition probability is computed as a sum of all those possibilities:\n\nd(c(cid:48), c(cid:48)(cid:48))1[c(cid:48)(cid:48) /\u2208S]\n1 + d(c(cid:48), c(cid:48)(cid:48))\n\n,\n\n(11)\n\n(cid:88)\n\n(cid:89)\nas \u03a0\u2217[n] = (\u03a0[n]\\M ) \u222a cm+1, where cm+1 = (cid:83)m\n\nq(\u03a0[n]|\u03a0\u2217[n]) =\n\n1\n|\u03a0\u2217[n]|\n\nc(cid:48)\u2208S\n\nc(cid:48)(cid:48)\u2208\u03a0\u2217[n]\\c(cid:48)\n\nMerge case: suppose that we have M = {c1, . . . , cm} 2. The merged partition to propose is given\ni=1 cm. We construct the corresponding binary\ntree as a cascading tree, where we put c1, . . . cm on top of c in order (Fig. 2, bottom). To com-\npute the reverse transition probability q(\u03a0[n]|\u03a0\u2217[n]), we should compute the probability of splitting\ncm+1 back into c1, . . . , cm. For this, we should \ufb01rst choose cm+1 and put nothing into the set\nd(cm+1,c(cid:48))\nM to provoke splitting. q(\u03a0[n]|\u03a0\u2217[n]) up to this step is\n1+d(cm+1,c(cid:48)). Then, we should\nsample the parent of c (the subtree connecting c and c1) via SampleSub(cm+1, q(\u03a0[n]|\u03a0\u2217[n])), and\nthis would result in S = {c, c1} and Q = {c2, . . . , cm}. Finally, we insert ci \u2208 Q into S via\nStocInsert(S, ci, q(\u03a0[n]|\u03a0\u2217[n])) for i = 2, . . . , m, where we select each c(i) to create a new clus-\nter in S. Corresponding update to q(\u03a0[n]|\u03a0\u2217[n]) by StocInsert is,\n1\n\n(cid:81)\n\n|\u03a0\u2217[n]|\n\nm(cid:89)\n\nc(cid:48)\n\n1\n\n(12)\n\nq(\u03a0[n]|\u03a0\u2217[n]) \u2190 q(\u03a0[n]|\u03a0\u2217[n]) \u00b7\n\n1 + d\u22121(c, ci) +(cid:80)i\u22121\n\ni=2\n\nj=1 d\u22121(cj, ci)\n\n.\n\np(dX,du,\u03a0[n]\u2217)q(\u03a0[n]|\u03a0\u2217[n])\np(dX,du,\u03a0[n])q(\u03a0\u2217[n]|\u03a0[n]) .\n\nOnce we\u2019ve proposed \u03a0\u2217[n] and computed both q(\u03a0\u2217[n]|\u03a0[n]) and q(\u03a0[n]|\u03a0\u2217[n]), \u03a0\u2217[n] is accepted with\nprobability min{1, r} where r =\nErgodicity of the global moves: to show that the global moves are ergodic, it is enough to show\nthat we can move an arbitrary point i from its current cluster c to any other cluster c(cid:48) in \ufb01nite step.\nThis can easily be done by a single split and merge moves, so the global moves are ergodic.\nTime complexity of the global moves: the time complexity of StocInsert(S, c, p) is O(|S| + h),\nwhere h is a height of the tree to insert c. The total time complexity of split proposal is mainly deter-\nmined by the time to execute StocInsert(S, c, p). This procedure is usually ef\ufb01cient, especially\nwhen the trees are well balanced. The time complexity to propose merged partition is O(|\u03a0[n]|+M ).\n3.3 Local moves\n\nIn local moves, we resample cluster assignments of individual data points via Gibbs sampling. If a\nleaf node i is moved from c to c(cid:48), we detach i from tc and run SeqInsert(c(cid:48), i) 3. Here, instead\nof running Gibbs sampling for all data points, we run Gibbs sampling for a subset of data points S,\nwhich is formed as follows. For each c \u2208 \u03a0[n], we draw a subtree tc(cid:48) by SampleSub. Then, we\n\n1Here, we restrict SampleSub to sample non-leaf nodes, since leaf nodes cannot be split.\n2We assume that clusters are given their own indices (such as hash values) so that they can be ordered.\n3We do not split even if the update dissimilarity exceed one, as in StocInsert.\n\n6\n\n\fGibbs\n\nSM\n\ntgMCMC\n\nmax log-lik\n-1730.4069\n(64.6268)\n-1602.3353\n(43.6904)\n-1577.7484\n(0.2889)\n\nESS\n\n10.4644\n(9.2663)\n7.6180\n(4.9166)\n35.6452\n(22.6121)\n\nlog r\n\n-\n\n-362.4405\n(271.9865)\n-3.4844\n(9.5959)\n\ntime /\niter\n\n0.0458\n(0.0062)\n0.0702\n(0.0141)\n0.0161\n(0.0088)\n\nGibbs\n\nSM\n\ntgMCMC\n\nmax log-lik\n-1594.7382\n(18.2228)\n-1588.5188\n(1.1158)\n-1587.8633\n(1.0423)\n\nESS\n\n15.8512\n(7.1542)\n36.2608\n(54.6962)\n149.7197\n(82.3461)\n\nlog r\n\n-\n\n-325.9407\n(233.2997)\n-2.6589\n(8.8980)\n\ntime /\niter\n\n0.0523\n(0.0067)\n0.0698\n(0.0143)\n0.0152\n(0.0088)\n\nFigure 3: Experimental results on toy dataset. (Top row) scatter plot of toy dataset, log-likelihoods of three\nsamplers with DP, log-likelihoods with NGGP, log-likelihoods of tgMCMC with varying G and varying D.\n(Bottom row) The statistics of three samplers with DP and the statistics of three samplers with NGGP.\n\nGibbs\n\nSM\n\nGibbs sub\n\nSM sub\n\ntgMCMC\n\nmax log-lik\n-74476.4666\n(375.0498)\n-74400.0337\n(382.0060)\n-74618.3702\n(550.1106)\n-73880.5029\n(382.6170)\n-73364.2985\n\n(8.5847)\n\nESS\n\n27.0464\n(7.6309)\n24.9432\n(6.7474)\n24.1818\n(5.5523)\n20.0981\n(4.9512)\n93.5509\n(23.5110)\n\nlog r\n\n-\n\n-913.0825\n(621.2756)\n\n-\n\n-918.2389\n(623.9808)\n-8.7375\n(21.9222)\n\ntime /\niter\n\n10.7325\n(0.3971)\n11.1163\n(0.4901)\n0.2659\n(0.0124)\n0.3467\n(0.0739)\n0.4295\n(0.0833)\n\nFigure 4: Average log-likelihood plot and the statistics of the samplers for 10K dataset.\n\ndraw a subtree of tc(cid:48) again by SampleSub. We repeat this subsampling for D times, and put the leaf\nnodes of the \ufb01nal subtree into S. Smaller D would result in more data points to resample, so we can\ncontrol the tradeoff between iteration time and mixing rates.\nCycling: at each iteration of tgMCMC, we cycle the global moves and local moves, as in split-merge\nsampling. We \ufb01rst run the global moves for G times, and run a single sweep of local moves. Setting\nG = 20 and D = 2 were the moderate choice for all data we\u2019ve tested.\n\n4 Experiments\n\nIn this section, we compare marginal Gibbs sampler (Gibbs), split-merge sampler (SM) and tgM-\nCMC on synthetic and real datasets.\n\n4.1 Toy dataset\n\nWe \ufb01rst compared the samplers on simple toy dataset that has 1,300 two-dimensional points with\n13 clusters, sampled from the mixture of Gaussians with prede\ufb01ned means and covariances. Since\nthe partition found by IBHC is almost perfect for this simple data, instead of initializing with IBHC,\nwe initialized the binary tree (and partition) as follows. As in IBHC, we sequentially inserted data\npoints into existing trees with a random order. However, instead of inserting them via SeqInsert,\nwe just put data points on top of existing trees, so that no splitting would occur.\ntgMCMC was\ninitialized with the tree constructed from this procedure, and Gibbs and SM were initialized with\ncorresponding partition. We assumed the Gaussian-likelihood and Gaussian-Wishart base measure,\n\nH(d\u00b5, d\u039b) = N (d\u00b5|m, (r\u039b)\u22121)W(d\u039b|\u03a8\u22121, \u03bd),\n\n(13)\nwhere r = 0.1, \u03bd = d+6, d is the dimensionality, m is the sample mean and \u03a8 = \u03a3/(10\u00b7det(\u03a3))1/d\n(\u03a3 is the sample covariance). We compared the samplers using both DP and NGGP priors. For\ntgMCMC, we \ufb01xed the number of global moves G = 20 and the parameter for local moves D = 2,\nexcept for the cases where we controlled them explicitly. All the samplers were run for 10 seconds,\nand repeated 10 times. We compared the joint log-likelihood log p(dX, \u03a0[n], du) of samples and the\neffective sample size (ESS) of the number of clusters found. For SM and tgMCMC, we compared\nthe average log value of the acceptance ratio r. The results are summarized in Fig. 3. As shown in\n\n7\n\n0246810\u22122300\u22122200\u22122100\u22122000\u22121900\u22121800\u22121700\u22121600time [sec]average log\u2212likelihood GibbsSMtgMCMC0246810\u22124000\u22123500\u22123000\u22122500\u22122000time [sec]average log\u2212likelihood GibbsSMtgMCMC020406080100\u22122400\u22122300\u22122200\u22122100\u22122000\u22121900\u22121800\u22121700\u22121600iterationaverage log\u2212likelihood 20151051020406080100\u22122300\u22122200\u22122100\u22122000\u22121900\u22121800\u22121700iterationaverage log\u2212likelihood 0123402004006008001000\u221210\u22129.5\u22129\u22128.5\u22128\u22127.5x 104time [sec]average log\u2212likelihood GibbsSMGibbs_subSM_subtgMCMC\fGibbs\n\nSM\n\nGibbs sub\n\nSM sub\n\ntgMCMC\n\nmax log-lik\n\n-4247895.5166\n(1527.0131)\n-4246689.8072\n(1656.2299)\n-4246878.3344\n(1391.1707)\n-4248034.0748\n(1703.6653)\n-4243009.3500\n(1101.0383)\n\nESS\n\n18.1758\n(7.2028)\n28.6608\n(18.1896)\n13.8057\n(4.5723)\n18.5764\n(18.6368)\n3.1274\n(2.6610)\n\nlog r\n\n-\n\n-3290.2988\n(2617.6750)\n\n-\n\n-3488.9523\n(3145.9786)\n-256.4831\n(218.8061)\n\ntime /\niter\n\n186.2020\n(2.9030)\n186.9424\n(2.2014)\n49.7875\n(0.9400)\n49.9563\n(0.8667)\n42.4176\n(2.0534)\n\nFigure 5: Average log-likelihood plot and the statistics of the samplers for NIPS corpus.\n\nthe log-likelihood trace plot, tgMCMC quickly converged to the ground truth solution for both DP\nand NGGP cases. Also, tgMCMC mixed better than other two samplers in terms of ESS. Comparing\nthe average log r values of SM and tgMCMC, we can see that the partitions proposed by tgMCMC\nis more often accepted. We also controlled the parameter G and D; as expected, higher G resulted\nin faster convergence. However, smaller D (more data points involved in local moves) did not\nnecessarily mean faster convergence.\n\n4.2 Large-scale synthetic dataset\n\nWe also compared the three samplers on larger dataset containing 10,000 points, which we will\ncall as 10K dataset, generated from six-dimensional mixture of Gaussians with labels drawn from\nPY(3, 0.8). We used the same base measure and initialization with those of the toy datasets, and\nused the NGGP prior, We ran the samplers for 1,000 seconds and repeated 10 times. Gibbs and SM\nwere too slow, so the number of samples produced in 1,000 seconds were too small. Hence, we also\ncompared Gibbs sub and SM sub, where we uniformly sampled the subset of data points and ran\nGibbs sweep only for those sampled points. We controlled the subset size to make their running time\nsimilar to that of tgMCMC. The results are summarized in Fig. 4. Again, tgMCMC outperformed\nother samplers both in terms of the log-likelihoods and ESS. Interestingly, SM was even worse than\nGibbs, since most of the samples proposed by split or merge proposal were rejected. Gibbs sub and\nSM sub were better than Gibbs and SM, but still failed to reach the best state found by tgMCMC.\n\n4.3 NIPS corpus\n\nWe also compared the samplers on NIPS corpus4, containing 1,500 documents with 12,419 words.\nWe used the multinomial likelihood and symmetric Dirichlet base measure Dir(0.1), used NGGP\nprior, and initialized the samplers with normal IBHC. As for the 10K dataset, we compared\nGibbs sub and SM sub along. We ran the samplers for 10,000 seconds and repeated 10 times.\nThe results are summarized in Fig. 5. tgMCMC outperformed other samplers in terms of the log-\nlikelihood; all the other samplers were trapped in local optima and failed to reach the states found\nby tgMCMC. However, ESS for tgMCMC were the lowest, meaning the poor mixing rates. We\nstill argue that tgMCMC is a better option for this dataset, since we think that \ufb01nding the better\nlog-likelihood states is more important than mixing rates.\n\n5 Conclusion\n\nIn this paper we have presented a novel inference algorithm for NRMM models. Our sampler,\ncalled tgMCMC, utilized the binary trees constructed by IBHC to propose good quality samples.\ntgMCMC explored the space of partitions via global and local moves which were guided by the\npotential functions of trees. tgMCMC was demonstrated to be outperform existing samplers in both\nsynthetic and real world datasets.\nAcknowledgments: This work was supported by the IT R&D Program of MSIP/IITP (B0101-\n15-0307, Machine Learning Center), National Research Foundation (NRF) of Korea (NRF-\n2013R1A2A2A01067464), and IITP-MSRA Creative ICT/SW Research Project.\n\n4https://archive.ics.uci.edu/ml/datasets/Bag+of+Words\n\n8\n\n0200040006000800010000\u22124.29\u22124.28\u22124.27\u22124.26\u22124.25x 106time [sec]average log\u2212likelihood GibbsSMGibbs_subSM_subtgMCMC\fReferences\n[1] T. S. Ferguson. A Bayesian analysis of some nonparametric problems. The Annals of Statistics,\n\n1(2):209\u2013230, 1973.\n\n[2] A. Lijoi, R. H. Mena, and I. Pr\u00a8unster. Hierarchical mixture modeling with normalized inverse-\nGaussian priors. Journal of the American Statistical Association, 100(472):1278\u20131291, 2005.\n[3] A. Brix. Generalized Gamma measures and shot-noise Cox processes. Advances in Applied\n\nProbability, 31:929\u2013953, 1999.\n\n[4] A. Lijoi, R. H. Mena, and I. Pr\u00a8unster. Controlling the reinforcement in Bayesian non-\n\nparametric mixture models. Journal of the Royal Statistical Society B, 69:715\u2013740, 2007.\n\n[5] E. Regazzini, A. Lijoi, and I. Pr\u00a8unster. Distriubtional results for means of normalized random\n\nmeasures with independent increments. The Annals of Statistics, 31(2):560\u2013585, 2003.\n\n[6] C. Chen, N. Ding, and W. Buntine. Dependent hierarchical normalized random measures for\ndynamic topic modeling. In Proceedings of the International Conference on Machine Learning\n(ICML), Edinburgh, UK, 2012.\n\n[7] C. Chen, V. Rao, W. Buntine, and Y. W. Teh. Dependent normalized random measures. In\nProceedings of the International Conference on Machine Learning (ICML), Atlanta, Georgia,\nUSA, 2013.\n\n[8] L. F. James. Bayesian Poisson process partition calculus with an application to Bayesian L\u00b4evy\n\nmoving averages. The Annals of Statistics, 33(4):1771\u20131799, 2005.\n\n[9] L. F. James, A. Lijoi, and I. Pr\u00a8unster. Posterior analysis for normalized random measures with\n\nindependent increments. Scandinavian Journal of Statistics, 36(1):76\u201397, 2009.\n\n[10] S. Favaro and Y. W. Teh. MCMC for normalized random measure mixture models. Statistical\n\nScience, 28(3):335\u2013359, 2013.\n\n[11] C. E. Antoniak. Mixtures of Dirichlet processes with applications to Bayesian nonparametric\n\nproblems. The Annals of Statistics, 2(6):1152\u20131174, 1974.\n\n[12] S. Jain and R. M. Neal. A split-merge Markov chain Monte Carlo procedure for the Dirichlet\nprocess mixture model. Journal of Computational and Graphical Statistics, 13:158\u2013182, 2000.\n[13] J. E. Grif\ufb01n and S. G. Walkera. Posterior simulation of normalized random measure mixtures.\n\nJournal of Computational and Graphical Statistics, 20(1):241\u2013259, 2011.\n\n[14] J. Lee and S. Choi. Incremental tree-based inference with dependent normalized random mea-\nsures. In Proceedings of the International Conference on Arti\ufb01cial Intelligence and Statistics\n(AISTATS), Reykjavik, Iceland, 2014.\n\n[15] K. A. Heller and Z. Ghahrahmani. Bayesian hierarchical clustering. In Proceedings of the\n\nInternational Conference on Machine Learning (ICML), Bonn, Germany, 2005.\n\n9\n\n\f", "award": [], "sourceid": 1161, "authors": [{"given_name": "Juho", "family_name": "Lee", "institution": "POSTECH"}, {"given_name": "Seungjin", "family_name": "Choi", "institution": "POSTECH"}]}