{"title": "Bayesian Agglomerative Clustering with Coalescents", "book": "Advances in Neural Information Processing Systems", "page_first": 1473, "page_last": 1480, "abstract": "", "full_text": "Bayesian Agglomerative Clustering with Coalescents\n\nYee Whye Teh\nGatsby Unit\n\nUniversity College London\nywteh@gatsby.ucl.ac.uk\n\nHal Daum\u00b4e III\n\nSchool of Computing\nUniversity of Utah\nme@hal3.name\n\nDaniel Roy\n\nCSAIL\nMIT\n\ndroy@mit.edu\n\nAbstract\n\nWe introduce a new Bayesian model for hierarchical clustering based on a prior\nover trees called Kingman\u2019s coalescent. We develop novel greedy and sequential\nMonte Carlo inferences which operate in a bottom-up agglomerative fashion. We\nshow experimentally the superiority of our algorithms over the state-of-the-art,\nand demonstrate our approach in document clustering and phylolinguistics.\n\n1\n\nIntroduction\n\nHierarchically structured data abound across a wide variety of domains. It is thus not surprising that\nhierarchical clustering is a traditional mainstay of machine learning [1]. The dominant approach to\nhierarchical clustering is agglomerative: start with one cluster per datum, and greedily merge pairs\nuntil a single cluster remains. Such algorithms are ef\ufb01cient and easy to implement. Their primary\nlimitations\u2014a lack of predictive semantics and a coherent mechanism to deal with missing data\u2014\ncan be addressed by probabilistic models that handle partially observed data, quantify goodness-of-\n\ufb01t, predict on new data, and integrate within more complex models, all in a principled fashion.\nCurrently there are two main approaches to probabilistic models for hierarchical clustering. The\n\ufb01rst takes a direct Bayesian approach by de\ufb01ning a prior over trees followed by a distribution over\ndata points conditioned on a tree [2, 3, 4, 5]. MCMC sampling is then used to obtain trees from\ntheir posterior distribution given observations. This approach has the advantages and disadvantages\nof most Bayesian models: averaging over sampled trees can improve predictive capabilities, give\ncon\ufb01dence estimates for conclusions drawn from the hierarchy, and share statistical strength across\nthe model; but it is also computationally demanding and complex to implement. As a result such\nmodels have not found widespread use. [2] has the additional advantage that the distribution induced\non the data points is exchangeable, so the model can be coherently extended to new data. The\nsecond approach uses a \ufb02at mixture model as the underlying probabilistic model and structures the\nposterior hierarchically [6, 7]. This approach uses an agglomerative procedure to \ufb01nd the tree giving\nthe best posterior approximation, mirroring traditional agglomerative clustering techniques closely\nand giving ef\ufb01cient and easy to implement algorithms. However because the underlying model has\nno hierarchical structure, there is no sharing of information across the tree.\nWe propose a novel class of Bayesian hierarchical clustering models and associated inference algo-\nrithms combining the advantages of both probabilistic approaches above. 1) We de\ufb01ne a prior and\ncompute the posterior over trees, thus reaping the bene\ufb01ts of a fully Bayesian approach; 2) the dis-\ntribution over data is hierarchically structured allowing for sharing of statistical strength; 3) we have\nef\ufb01cient and easy to implement inference algorithms that construct trees agglomeratively; and 4) the\ninduced distribution over data points is exchangeable. Our model is based on an exchangeable distri-\nbution over trees called Kingman\u2019s coalescent [8, 9]. Kingman\u2019s coalescent is a standard model from\npopulation genetics for the genealogy of a set of individuals. It is obtained by tracing the genealogy\nbackwards in time, noting when lineages coalesce together. We review Kingman\u2019s coalescent in\nSection 2. Our own contribution is in using it as a prior over trees in a hierarchical clustering model\n(Section 3) and in developing novel inference procedures for this model (Section 4).\n\n\fFigure 1: (a) Variables describing the n-coalescent. (b) Sample path from a Brownian diffusion\ncoalescent process in 1D, circles are coalescent points. (c) Sample observed points from same in\n2D, notice the hierarchically clustered nature of the points.\n\n2 Kingman\u2019s coalescent\n\nKingman\u2019s coalescent is a standard model in population genetics describing the common genealogy\n(ancestral tree) of a set of individuals [8, 9]. In its full form it is a distribution over the genealogy of\na countably in\ufb01nite set of individuals. Like other nonparametric models (e.g. Gaussian and Dirich-\nlet processes), Kingman\u2019s coalescent is most easily described and understood in terms of its \ufb01nite\ndimensional marginal distributions over the genealogies of n individuals, called n-coalescents. We\nobtain Kingman\u2019s coalescent as n\u2192\u221e.\nConsider the genealogy of n individuals alive at the present time t = 0. We can trace their ancestry\nbackwards in time to the distant past t =\u2212\u221e. Assume each individual has one parent (in genetics,\nhaploid organisms), and therefore genealogies of [n] = {1, ..., n} form a directed forest. In general,\nat time t\u22640, there are m (1\u2264 m\u2264 n) ancestors alive. Identify these ancestors with their correspond-\ning sets \u03c11, ..., \u03c1m of descendants (we will make this identi\ufb01cation throughout the paper). Note that\n\u03c0(t) = {\u03c11, ..., \u03c1m} form a partition of [n], and interpret t(cid:55)\u2192 \u03c0(t) as a function from (\u2212\u221e, 0] to the\nset of partitions of [n]. This function is piecewise constant, left-continuous, monotonic (s\u2264 t implies\nthat \u03c0(t) is a re\ufb01nement of \u03c0(s)), and \u03c0(0)={{1}, ...,{n}} (see Figure 1a). Further, \u03c0 completely\nand succinctly characterizes the genealogy; we shall henceforth refer to \u03c0 as the genealogy of [n].\nKingman\u2019s n-coalescent is simply a distribution over genealogies of [n], or equivalently, over the\nspace of partition-valued functions like \u03c0. More speci\ufb01cally, the n-coalescent is a continuous-time,\npartition-valued, Markov process, which starts at {{1}, ...,{n}} at present time t = 0, and evolves\nbackwards in time, merging (coalescing) lineages until only one is left. To describe the Markov\nprocess in its entirety, it is suf\ufb01cient to describe the jump process (i.e. the embedded, discrete-time,\nMarkov chain over partitions) and the distribution over coalescent times. Both are straightforward\nand their simplicity is part of the appeal of Kingman\u2019s coalescent. Let \u03c1li, \u03c1ri be the ith pair of\nlineages to coalesce, tn\u22121 < \u00b7\u00b7\u00b7< t1 < t0 = 0 be the coalescent times and \u03b4i = ti\u22121\u2212ti > 0 be the\nduration between adjacent events (see Figure 1a). Under the n-coalescent, every pair of lineages\nmerges independently with exponential rate 1. Thus the \ufb01rst pair amongst m lineages merge with\n\n(cid:1)(cid:1) independently, the pair \u03c1li, \u03c1ri is chosen from\n\n(cid:1) = m(m\u22121)\n\nrate(cid:0)m\n\namong those right after time ti, and with probability one a random draw from the n-coalescent is a\nbinary tree with a single root at t=\u2212\u221e and the n individuals at time t=0. The genealogy is:\n\n2\n\n2\n\n2\n\n. Therefore \u03b4i \u223c Exp(cid:0)(cid:0)n\u2212i+1\n\uf8f1\uf8f2\uf8f3{{1}, ...,{n}}\n(cid:1) exp(cid:0)\u2212(cid:0)n\u2212i+1\n(cid:0)n\u2212i+1\n(cid:1)\u03b4i\n\n\u03c0(t) =\n\np(\u03c0) =(cid:81)n\u22121\n\n\u03c0ti\u22121 \u2212 \u03c1li \u2212 \u03c1ri + (\u03c1li \u222a \u03c1ri)\n\u03c0ti\n\nif t = 0;\nif t = ti;\nif ti+1 < t < ti.\n\n(1)\n\n(cid:1) /(cid:0)n\u2212i+1\n\n(cid:1) =(cid:81)n\u22121\n\nexp(cid:0)\u2212(cid:0)n\u2212i+1\n(cid:1)\u03b4i\n\n(cid:1)\n\nCombining the probabilities of the durations and choices of lineages, the probability of \u03c0 is simply:\n\n2\n\n2\n\ni=1\n\n(2)\nThe n-coalescent has some interesting statistical properties [8, 9]. The marginal distribution over\ntree topologies is uniform and independent of the coalescent times. Secondly, it is in\ufb01nitely ex-\nchangeable: given a genealogy drawn from an n-coalescent, the genealogy of any m contemporary\nindividuals alive at time t \u2264 0 embedded within the genealogy is a draw from the m-coalescent.\nThus, taking n \u2192 \u221e, there is a distribution over genealogies of a countably in\ufb01nite population\nfor which the marginal distribution of the genealogy of any n individuals gives the n-coalescent.\nKingman called this the coalescent.\n\ni=1\n\n2\n\n2\n\n!!\"#!!\"$!!\"%!!!&\"\u2019!&\"#!&\"$!&\"%&!(!%\")!%!!\")!!!&\")&&\")!!\")t1t2t3\u2212\u221et0=0\u03b41\u03b42\u03b43x1x2x3x4y{1,2}y{3,4}y{1,2,3,4}z{{1,2,3,4}}{{1,2},{3,4}}{{1},{2},{3},{4}}{{1},{2},{3,4}}\u03c0(t)=!!!\"!#!$%$!#!$&\u2019!$!%&\u2019%%&\u2019$$&\u2019##&\u2019(a)(b)(c)t\f3 Hierarchical clustering with coalescents\n\nWe take a Bayesian approach to hierarchical clustering, placing a coalescent prior on the latent\ntree and modeling the observed data with a tree structured Markov process evolving forward in\ntime. We will alter our terminology from genealogy to tree, from n individuals at present time to n\nobserved data points, and from individuals on the genealogy to latent variables on the tree-structured\ndistribution. Let x = {x1, ..., xn} be n observed data points at the leaves of a tree \u03c0 drawn from\nthe n-coalescent. \u03c0 has n \u2212 1 coalescent points, the ith occuring when \u03c1li and \u03c1ri merge at time ti\nto form \u03c1i = \u03c1li \u222a \u03c1ri. Let tli and tri be the times at which \u03c1li and \u03c1ri are themselves formed.\nWe use a continuous-time Markov process to de\ufb01ne the distribution over the n data points x given\nthe tree \u03c0. The Markov process starts in the distant past, evolves forward in time, splits at each\ncoalescent point, and evolves independently down both branches until we reach time 0, when n data\npoints are observations of the process at the n leaves of the tree. The joint distribution described by\nthis process respects the conditional independences implied by the structure of the directed tree \u03c0.\nLet y\u03c1i be a latent variable that takes on the value of the Markov process at \u03c1i just before it splitsLet\ny{i} = xi at leaf i. See Figure 1a.\nTo complete the description of the likelihood model, let q(z) be the initial distribution of the Markov\nprocess at time t = \u2212\u221e, and kst(x, y) be the transition probability from state x at time s to state y\nat time t. This Markov process need be neither stationary nor ergodic. Marginalizing over paths of\nthe Markov process, the joint probability over the latent variables and the observations is:\n)\n\n(3)\nNotice that the marginal distributions for each observation p(xi|\u03c0) are identical and given by the\nMarkov process at time 0. However the observations are not independent as they share the same\nsample path down the Markov process until it splits. In fact the amount of dependence between two\nobservations is a function of the time at which the observations coalesce. A more recent coalescent\ntime implies larger dependence. The overall distribution induced on the observations p(x) inherits\nthe in\ufb01nite exchangeability of the n-coalescent. We consider in Section 4.3 a brownian diffusion\n(Figures 1(b,c)) and a simple independent sites mutation process on multinomial vectors.\n\np(x, y, z|\u03c0) = q(z)k\u2212\u221e tn\u22121\n\n)(cid:81)n\u22121\n\n(z, y\u03c1n\u22121\n\n)ktitri\n\n, y\u03c1ri\n\nktitli\n\n, y\u03c1li\n\n(y\u03c1i\n\n(y\u03c1i\n\ni=1\n\n4 Agglomerative sequential Monte Carlo and greedy inference\n\nWe develop two classes of ef\ufb01cient and easily implementable inference algorithms for our hierar-\nchical clustering model based on sequential Monte Carlo (SMC) and greedy schemes respectively.\nIn both classes, the latent variables are integrated out, and the trees are constructed in a bottom-up\nfashion. The full tree \u03c0 can be expressed as a series of n \u2212 1 coalescent events, ordered backwards\nin time. The ith coalescent event involves the merging of the two subtrees with leaves \u03c1li and \u03c1ri\nand occurs at a time \u03b4i before the previous coalescent event. Let \u03b8i = {\u03b4j, \u03c1lj, \u03c1rj for j \u2264 i} denote\nthe \ufb01rst i coalescent events. \u03b8n\u22121 is equivalent to \u03c0 and we shall use them interchangeably.\nWe assume that the form of the Markov process is such that the latent variables {y\u03c1i}n\u22121\ni=1 and z can\n(y) be\nbe ef\ufb01ciently integrated out using an upward pass of belief propagation on the tree. Let M\u03c1i\nthe message passed from y\u03c1i to its parent; M{i}(y) = \u03b4xi\n(y)\nis proportional to the likelihood of the observations at the leaves below coalescent event i, given that\ny\u03c1i\n\n(y) is point mass at xi for leaf i. M\u03c1i\n= y. Belief propagation computes the messages recursively up the tree; for i = 1, ..., n \u2212 1:\n\n(4)\nwhere Z\u03c1i\nThe choice of Z does not affect the computed\nprobability of x, but does impact the accuracy and ef\ufb01ciency of our inference algorithms. We found\nthat Z\u03c1i\n\n(x, \u03b8i) =(cid:82)(cid:82) q(z)k\u2212\u221eti\n\n(y) dy dz worked well. At the root, we have:\n\n(x, \u03b8i) is a normalization constant.\n\n(z, y)M\u03c1i\n\nb=l,r\n\n(y, yb)M\u03c1bi\n\n(yb) dyb\n\nM\u03c1i\n\n(y) = Z\u22121\n\u03c1i\n\n(x, \u03b8i)(cid:81)\n\n(cid:82) ktitbi\nZ\u2212\u221e(x, \u03b8n\u22121) =(cid:82)(cid:82) q(z)k\u2212\u221e tn\u22121\np(x|\u03c0) = Z\u2212\u221e(x, \u03b8n\u22121)(cid:81)n\u22121\n\n(5)\n\n(6)\n\nThe marginal probability p(x|\u03c0) is now given by the product of normalization constants:\n\n(z, y)M\u03c1n\u22121\n\n(y) dy dz\n\nMultiplying in the prior (2) over \u03c0, we get the joint probability for the tree \u03c0 and observations x:\n\np(x, \u03c0) = Z\u2212\u221e(x, \u03b8n\u22121)(cid:81)n\u22121\n\ni=1\n\n(x, \u03b8i)\n\nZ\u03c1i\n\ni=1\n\nexp(cid:0)\u2212(cid:0)n\u2212i+1\n(cid:1)\u03b4i\n\n2\n\n(cid:1) Z\u03c1i\n\n(x, \u03b8i)\n\n(7)\n\n\fOur inference algorithms are based upon (7). The sequential Monte Carlo (SMC) algorithms approx-\nimate the posterior over the tree \u03b8n\u22121 using a weighted sum of samples, while the greedy algorithms\nconstruct \u03b8n\u22121 by maximizing local terms in (7). Both proceeds by iterating over i = 1, ..., n \u2212 1,\nchoosing a duration \u03b4i and a pair of subtrees \u03c1li, \u03c1ri to coalesce at each iteration. This choice is\n\nbased upon the ith term in (7), interpreted as the product of a local prior exp(cid:0)\u2212(cid:0)n\u2212i+1\n\n(cid:1) and a\n\n(cid:1)\u03b4i\n\n2\n\nlocal likelihood Z\u03c1i\n\n(x, \u03b8i) for choosing \u03b4i, \u03c1li and \u03c1ri given \u03b8i\u22121.\n\n4.1 Sequential Monte Carlo algorithms\n\n(cid:1)\u03b4s\nexp(cid:0)\u2212(cid:0)n\u2212i+1\n\ni\n\n2\n\n(cid:1) Z\u03c1i\n\nSMC algorithms approximate the posterior by iteratively constructing a weighted sum of point\nmasses. At iteration i \u2212 1, particle s consists of \u03b8s\nrj for j < i}, and has weight\n, \u03c1s\ni\u22121\nws\ni\u22121. At iteration i, s is extended by sampling \u03b4s\nri from a proposal distribution\nfi(\u03b4s\n\n), and the weight is updated by:\n\n= {\u03b4s\n, \u03c1s\nlj\nli and \u03c1s\n\ni , \u03c1s\n\nri|\u03b8s\n, \u03c1s\ni\u22121\n\n, \u03c1s\nli\n\nj\n\ni\n\n)\n\ni\n\ni\n\n(8)\n\nws\ni\n\nws\n\ns\n\n\u03b4\u03b8s\n\nn\u22121\n\ni\u22121\n\nn\u22121\n\n, \u03c1s\nli\n\n= ws\n\n(x, \u03b8s\n\n)/fi(\u03b4s\n\nn\u22121 and weights ws\n\nAfter n \u2212 1 iterations, we obtain a set of trees \u03b8s\n\nis approximated by: p(\u03c0, x) \u2248 (cid:80)\n\nri|\u03b8s\n, \u03c1s\ni\u22121\nn\u22121. The joint distribution\n(\u03c0), while the posterior is approximated with the\nweights normalized. An important aspect of SMC is resampling, which places more particles in\nhigh probability regions and prunes particles stuck in low probability regions. We resample as in\nAlgorithm 5.1 of [10] when the effective sample size ratio as estimated in [11] falls below one half.\ni , \u03c1s\nSMC-PriorPrior. The simplest proposal distribution is to sample \u03b4s\nri from the local\nprior. \u03b4s\n, \u03c1s\nri are drawn uniformly from\n). This approach is\nall available pairs. The weight updates (8) reduce to multiplying by Z\u03c1i\ncomputationally very ef\ufb01cient, but performs badly with many objects due to the uniform draws over\npairs. SMC-PriorPost. The second approach addresses the suboptimal choice of pairs to coalesce.\nWe \ufb01rst draw \u03b4s\nri|\u03b4s\nfi(\u03c1s\n, \u03c1s\n\ni is drawn from an exponential with rate(cid:0)n\u2212i+1\n\ni from its local prior, then draw \u03c1s\n, \u03b8s\n\nri from the local posterior:\n\n(cid:1) and \u03c1s\n\nli, \u03c1s\n); ws\ni\n\n) \u221d Z\u03c1i\n\nli and \u03c1s\n\n(cid:80)\n\n(x, \u03b8s\n\n(x, \u03b8s\n\n(x, \u03b8s\n\n) (9)\n\n= ws\n\n, \u03c1s\nri\n\nZ\u03c1i\n\n, \u03b4s\ni\n\n, \u03b4s\ni\n\n, \u03c1s\nli\n\n, \u03c1(cid:48)\n\n, \u03c1(cid:48)\n\ni\u22121\n\ni\u22121\n\ni\u22121\n\ni\u22121\n\nli\n\ni\n\nl\n\nr\n\n2\n\nli\n\ni\n\n\u03c1(cid:48)\nl,\u03c1(cid:48)\n\nr\n\nThis approach is more computationally demanding since we need to evaluate the local likelihood of\nevery pair. It also performs signi\ufb01cantly better than SMC-PriorPrior. We have found that it works\nreasonably well for small data sets but fails in larger ones for which the local posterior for \u03b4i is highly\npeaked. SMC-PostPost. The third approach is to draw all of \u03b4s\n\nri from their posterior:\n\nfi(\u03b4s\n\ni\n\n, \u03c1s\nli\n\nri|\u03b8s\n, \u03c1s\ni\u22121\nws\ni\n\n) \u221d exp(cid:0)\u2212(cid:0)n\u2212i+1\n\n(cid:80)\n\n2\n\u03c1(cid:48)\nl,\u03c1(cid:48)\n\nr\n\n= ws\n\ni\u22121\n\n(cid:1)\u03b4s\n(cid:1) Z\u03c1i\n(cid:82) exp(cid:0)\u2212(cid:0)n\u2212i+1\n\n(x, \u03b8s\n\ni\n\n2\n\ni\u22121\n\ni , \u03c1s\n, \u03c1s\nli\n\n(cid:1)\u03b4(cid:48)(cid:1) Z\u03c1i\n\nli and \u03c1s\n)\n, \u03c1s\nri\n(x, \u03b8s\n\n, \u03b4s\ni\n\ni\u22121\n\n, \u03b4(cid:48), \u03c1(cid:48)\n\nl\n\n, \u03c1(cid:48)\n\nr\n\n) d\u03b4(cid:48)\n\n(10)\n\nThis approach requires the fewest particles, but is the most computationally expensive due to the\nintegral for each pair. Fortunately, for the case of Brownian diffusion process described below, these\nintegrals are tractable and related to generalized inverse Gaussian distributions.\n\n4.2 Greedy algorithms\n\nSMC algorithms are attractive because they can produce an arbitrarily accurate approximation to the\nfull posterior as the number of samples grow. However in many applications a single good tree is\noften suf\ufb01cient. We describe a few greedy algorithms to construct a good tree.\nGreedy-MaxProb: the obvious greedy algorithm is to pick \u03b4i, \u03c1li and \u03c1ri maximizing the ith term\nin (7). We do so by computing the optimal \u03b4i for each pair of \u03c1li, \u03c1ri, and then picking the pair\nmaximizing the ith term at its optimal \u03b4i. Greedy-MinDuration: pick the pair to coalesce whose\noptimal duration is minimum. Both algorithms require recomputing the optimal duration for each\n\npair at each iteration, since the prior rate(cid:0)n\u2212i+1\n\u03c1li and \u03c1ri we determine the optimal \u03b4i, replacing the(cid:0)n\u2212i+1\n\n(cid:1) on the duration varies with the iteration i. The total\n(cid:1) prior rate with 1. We coalesce the\n\ncomputational cost is thus O(n3). We can avoid this by using the alternative view of the n-coalesent\nas a Markov process where each pair of lineages coalesces at rate 1. Greedy-Rate1: for each pair\n\npair with most recent time (as in Greedy-MinDuration). This reduces the complexity to O(n2). We\nfound that all three performed similarly, and use Greedy-Rate1 in our experiments as it is faster.\n\n2\n\n2\n\n\f4.3 Examples\n\nZ\u03c1i\n\nBrownian diffusion. Consider the case of continuous data evolving via Brownian diffusion. The\ntransition kernel kst(y,\u00b7) is a Gaussian centred at y with variance (t \u2212 s)\u039b, where \u039b is a symmetric\npositive de\ufb01nite covariance matrix. Because the joint distribution (3) over x, y and z is Gaussian,\nwe can express each message M\u03c1i\nlikelihood is:\n\n(11)\nwhere (cid:107)x(cid:107)\u03a8 = x(cid:62)\u03a8\u22121x is the Mahanalobis norm. The optimal duration \u03b4i can also be solved for,\n+tli+tri\u22122ti\u22121) (12)\n\u03b4i = 1\nwhere D is the dimensionality. The message at the newly coalesced point has parameters:\n\n(y) as a Gaussian with mean(cid:98)y\u03c1i and covariance \u039bv\u03c1i. The local\n(cid:98)\u039bi = \u039b(v\u03c1li\n2 ||(cid:98)y\u03c1li\u2212(cid:98)y\u03c1ri||2b\u039bi\n(cid:17) \u2212 1\n(cid:1)||(cid:98)y\u03c1li\u2212(cid:98)y\u03c1ri||2\n=(cid:0)\n+ tri \u2212 ti)\u22121(cid:1)\u22121;(cid:98)y\u03c1i\n\n2 exp(cid:0)\u2212 1\n(x, \u03b8i) = |2\u03c0(cid:98)\u039bi|\u2212 1\n(cid:1)\u22121(cid:16)(cid:113)4(cid:0)n\u2212i+1\n(cid:0)n\u2212i+1\n=(cid:0)(v\u03c1li\n\n(v\u03c1li\n+v\u03c1ri\nby\u03c1li\n\n+ tli \u2212 ti)\u22121 + (v\u03c1ri\n\n+tli+tri\u22122ti)\n\n+D2 \u2212 D\n\n(cid:1)v\u03c1i\n\n+v\u03c1ri\n\nby\u03c1ri\n\n(cid:1);\n\nv\u03c1ri +tri\u2212ti\n\nv\u03c1li +tli\u2212ti\n\n(13)\n\nv\u03c1i\n\n+\n\n\u039b\n\n4\n\n2\n\n2\n\n2\n\nMultinomial vectors. Consider a Markov process acting on multinomial vectors with each entry\ntaking one of K values and evolving independently. Entry d evolves at rate \u03bbd and has equilibrium\n111K \u2212 Ik) where 111K is a vector of\ndistribution vector qd. The transition rate matrix is Qd = \u03bbd(q(cid:62)\nK ones and IK is identity matrix of size K, while the transition probability matrix for entry d in\na time interval of length t is eQdt = e\u2212\u03bbdtIK + (1 \u2212 e\u2212\u03bbdt)q(cid:62)\n111K. Representing the message for\n](cid:62), normalized so that qd \u00b7 M d\nd\n= 1,\n, ..., M dK\nentry d from \u03c1i to its parent as a vector M d\n\u03c1i\nthe local likelihood terms and messages are computed as,\n\n= [M d1\n\u03c1i\n\n\u03c1i\n\n\u03c1i\n\nh\n\n(x, \u03b8i) = 1 \u2212 e\u03bbh(2ti\u2212tli\u2212tri)(cid:0)1 \u2212(cid:80)K\n\n(cid:1)\n\nZ d\n\u03c1i\n\n(14)\n(15)\nUnfortunately the optimal \u03b4i cannot be solved analytically and we use Newton steps to compute it.\n\n= (1 \u2212 e\u03bbd(ti\u2212tli)(1 \u2212 M d\n\n))(1 \u2212 e\u03bbd(ti\u2212tri)(1 \u2212 M d\n\nqdkM dk\n\u03c1li\n\n))/Z d\n\u03c1i\n\n(x, \u03b8i)\n\nM dk\n\u03c1ri\n\nM d\n\u03c1i\n\nk=1\n\n\u03c1ri\n\n\u03c1li\n\n4.4 Hyperparameter estimation\n\nWe perform hyperparameter estimation by iterating between estimating a tree, and estimating the\nIn the Brownian case, we place an inverse Wishart prior on \u039b and the MAP\nhyperparameters.\nposterior \u02c6\u039b is available in a standard closed form. In the multinomial case, the updates are not\navailable analytically and are solved iteratively. Further information on hyperparameter estimation,\nas well predictive densities and more experiments are available in a longer technical report.\n\n5 Experiments\n\nSynthetic Data Sets. In Figure 2 we compare the various SMC algorithms and Greedy-Rate1 on a\nrange of synthetic data sets drawn from the Brownian diffusion coalescent process itself (\u039b = ID)\nto investigate the effects of various parameters on the ef\ufb01cacy of the algorithms1. Generally SMC-\nPostPost performed best, followed by SMC-PriorPost, SMC-PriorPrior and Greedy-Rate1. With\nincreasing D the amount of data given to the algorithms increases and all algorithms do better,\nespecially Greedy-Rate1. This is because the posterior becomes concentrated and the Greedy-Rate1\napproximation corresponds well with the posterior. As n increases, the amount of data increases\nas well and all algorithms perform better. However, the posterior space also increases and SMC-\nPriorPrior which simply samples from the prior over genealogies does not improve as much. We\nsee this effect as well when S is small. As S increases all SMC algorithms improve. Finally, the\nalgorithms were surprisingly robust when there is mismatch between the generated data sets\u2019 \u03bb and\nthe \u03bb used by the model. We expected all models to perform worse with SMC-PostPost best able to\nmaintain its performance (though this is possibly due to our experimental setup).\nMNIST and SPAMBASE. We compare the performance of Greedy-Rate1 to two other hierarchical\nclustering algorithms: average-linkage and Bayesian hierarchical clustering (BHC) [6]. In MNIST,\n\n1Each panel was generated from independent runs. Data set variance affected all algorithms, varying overall\n\nperformance across panels. However, trends in each panel are still valid, as they are based on the same data.\n\n\fFigure 2: Predictive performance of algorithms as we vary (a) the numbers of dimensions D, (b)\nobservations n, (c) the mutation rate \u03bb (\u039b = \u03bbID), and (d) number of samples S. In each panel\nother parameters are \ufb01xed to their middle values (we used S = 50) in other panels, and we report\nlog predictive probabilities on one unobserved entry, averaged over 100 runs.\n\nMNIST\nBHC\n\nCoalescent\nAvg-link\n.363\u00b1.004 .392\u00b1.006 .412\u00b1.006\n.581\u00b1.005 .579\u00b1.005 .610\u00b1.005\n.755\u00b1.005 .763\u00b1.005 .773\u00b1.005\n\nPurity\nSubtree\nLOO-acc\n\nSPAMBASE\n\nBHC\n\nCoalescent\nAvg-link\n.616\u00b1.007 .711\u00b1.010 .689\u00b1.008\n.661\u00b1.012\n.607\u00b1.011\n.846\u00b1.010\n.861\u00b1.008\n\n.549\u00b1.015\n.832\u00b1.010\n\nTable 1: Comparative results. Numbers are averages and standard errors over 50 and 20 repeats.\n\nwe use 20 exemplars from each of 10 digits from the MNIST data set, reduced via PCA to 20\ndimensions, repeating the experiment 50 times. In SPAMBASE, we use 100 examples of 57 binary\nattributes from each of 2 classes, repeating 20 times. We present purity scores [6], subtree scores\n(#{interior nodes with all leaves of same class}/(n \u2212 #classes)) and leave-one-out accuracies (all\nscores between 0 and 1, higher better). The results are in Table 1; except for purity on SPAMBASE,\nours gives the best performance. Experiments not presented here show that all greedy algorithms\nperform about the same and that performance improves with hyperparameter updates.\nPhylolinguistics. We apply Greedy-Rate1 to a phylolinguistic problem: language evolution. Un-\nlike previous research [12] which studies only phonological data, we use a full typological database\nof 139 binary features over 2150 languages: the World Atlas of Language Structures (WALS) [13].\nThe data is sparse: about 84% of the entries are unknown. We use the same version of the database\nas extracted by [14]. Based on the Indo-European subset of this data for which at most 30 features\nare unknown (48 languages total), we recover the coalescent tree shown in Figure 3(a). Each lan-\nguage is shown with its genus, allowing us to observe that it teases apart Germanic and Romance\nlanguages, but makes a few errors with respect to Iranian and Greek.\nNext we compare predictive abilities to other algo-\nrithms. We take a subset of WALS and tested on\n5% of withheld entries, restoring these with var-\nious techniques: Greedy-Rate1; nearest neighbors\n(use value from nearest observed neighbor); average-\nlinkage (nearest neighbor in the tree); and probabilistic\nPCA (latent dimensions in 5, 10, 20, 40, chosen opti-\nmistically). We use \ufb01ve subsets of the WALS database,\nobtained by sorting both the languages and features of\nthe database according to sparsity and using a varying\npercentage (10% \u2212 50%) of the densest portion. The\nresults are in Figure 3(b). Our approach performed\nreasonably well.\nFinally, we compare the trees generated by Greedy-Rate1 with trees generated by either average-\nlinkage or BHC, using the same evaluation criteria as for MNIST and SPAMBASE, using language\ngenus as classes. The results are in Table 5, where we can see that the coalescent signi\ufb01cantly\noutperforms the other methods.\n\nWhole World Data\nAvg-link BHC Coalescent\n0.162\n0.227\n0.080\n\nAvg-link BHC Coalescent\n0.510\n0.414\n0.538\n\nTable 2: Comparative performance of var-\nious algorithms on phylolinguistics data.\n\nPurity\nSubtree\nLOO-acc\n\n0.491\n0.414\n0.590\n\n0.813\n0.690\n0.769\n\nIndo-European Data\n\nPurity\nSubtree\nLOO-acc\n\n0.160\n0.099\n0.248\n\n0.269\n0.177\n0.369\n\n468\u22121.6\u22121.4\u22121.2\u22121\u22120.8\u22120.6(a)averagelogpredictiveD:dimensions468\u22121.6\u22121.4\u22121.2\u22121\u22120.8\u22120.6(b)n:observations0.512\u22121.6\u22121.4\u22121.2\u22121\u22120.8\u22120.6(c)\u03bb:mutationrate10305070\u22121.6\u22121.4\u22121.2\u22121\u22120.8\u22120.6(d)S:particles SMC\u2212PostPostSMC\u2212PriorPostSMC\u2212PriorPriorGreedy\u2212Rate1\f(b) Data restoration on WALS. Y-axis is accuracy;\nX-axis is percentage of data set used in experiments.\nAt 10%, there are N = 215 languages, H = 14\nfeatures and p = 94% observed data; at 20%, N =\n430, H = 28 and p = 80%; at 30%: N = 645,\nH = 42 and p = 66%; at 40%: N = 860, H =\n56 and p = 53%; at 50%: N = 1075, H = 70\nand p = 43%. Results are averaged over \ufb01ve folds\nwith a different 5% hidden each time. (We also tried\na \u201cmode\u201d prediction, but its performance is in the\n60% range in all cases, and is not depicted.)\n\n(a) Coalescent for a subset of Indo-European lan-\nguages from WALS.\n\nFigure 3: Results of the phylolinguistics experiments.\n\nTop Authors (# papers)\n\nLLR (t) Top Words\n32.7 (-2.71) bifurcation attractors hop\ufb01eld network saddle Mjolsness (9) Saad (9) Ruppin (8) Coolen (7)\n0.106 (-3.77) voltage model cells neurons neuron\n83.8 (-2.02) chip circuit voltage vlsi transistor\n140.0 (-2.43) spike ocular cells \ufb01ring stimulus\n2.48 (-3.66) data model learning algorithm training\n31.3 (-2.76) infomax image ica images kurtosis\n31.6 (-2.83) data training regression learning model\n39.5 (-2.46) critic policy reinforcement agent controller\n23.0 (-3.03) network training units hidden input\n\nKoch (30) Sejnowski (22) Bower (11) Dayan (10)\nKoch (12) Alspector (6) Lazzaro (6) Murray (6)\nSejnowski (22) Koch (18) Bower (11) Dayan (10)\nJordan (17) Hinton (16) Williams (14) Tresp (13)\nHinton (12) Sejnowski (10) Amari (7) Zemel (7)\nJordan (16) Tresp (13) Smola (11) Moody (10)\nSingh (15) Barto (10) Sutton (8) Sanger (7)\nMozer (14) Lippmann (11) Giles (10) Bengio (9)\n\nTable 3: Nine clusters discovered in NIPS abstracts data.\n\nNIPS. We applied Greedy-Rate1 to all NIPS abstracts through NIPS12 (1740, total). The data was\npreprocessed so that only words occuring in at least 100 abstracts were retained. The word counts\nwere then converted to binary. We performed one iteration of hyperparameter re-estimation. In\nthe supplemental material, we depict the top levels of the coalescent tree. Here, we use the tree to\ngenerate a \ufb02at clustering. To do so, we use the log likelihood ratio at each branch in the coalescent\nto determine if a split should occur. If the log likelihood ratio is greater than zero, we break the\nbranch; otherwise, we recurse down. On the NIPS abstracts, this leads to nine clusters, depicted\nin Table 3. Note that clusters two and three are quite similar\u2014had we used a slighly higher log\nlikelihood ratio, they would have been merged (the LLR for cluster 2 was only 0.105). Note that\nthe clustering is able to tease apart Bayesian learning (cluster 5) and non-bayesian learning (cluster\n7)\u2014both of which have Mike Jordan as their top author!\n\n6 Discussion\n\nWe described a new model for Bayesian agglomerative clustering. We used Kingman\u2019s coalescent\nas our prior over trees, and derived ef\ufb01cient and easily implementable greedy and SMC inference\nalgorithms for the model. We showed empirically that our model gives better performance than other\n\n00.10.2[Armenian] Armenian (Eastern)[Armenian] Armenian (Western)[Indic] Bengali[Indic] Marathi[Indic] Maithili[Iranian] Ossetic[Indic] Nepali[Indic] Sinhala[Indic] Kashmiri[Indic] Hindi[Indic] Panjabi[Iranian] Pashto[Slavic] Czech[Baltic] Latvian[Baltic] Lithuanian[Slavic] Russian[Slavic] Ukrainian[Slavic] Serbian\u2212Croatian[Slavic] Slovene[Slavic] Polish[Albanian] Albanian[Romance] Catalan[Romance] Italian[Romance] Portuguese[Romance] Romanian[Slavic] Bulgarian[Greek] Greek (Modern)[Romance] Spanish[Germanic] Danish[Germanic] Norwegian[Germanic] Swedish[Germanic] Icelandic[Germanic] English[Germanic] Dutch[Germanic] German[Romance] French[Iranian] Kurdish (Central)[Iranian] Persian[Iranian] Tajik[Celtic] Breton[Celtic] Cornish[Celtic] Welsh[Celtic] Gaelic (Scots)[Celtic] Irish0.10.20.30.40.5727476788082 CoalescentNeighborAgglomerativePPCA\fagglomerative clustering algorithms, and gives good results on applications to document modeling\nand phylolinguistics.\nOur model is most similar in spirit to the Dirichlet diffusion tree of [2]. Both use in\ufb01nitely ex-\nchangeable priors over trees. While [2] uses a fragmentation process for trees, our prior uses the\nreverse\u2014a coalescent process instead. This allows us to develop simpler inference algorithms than\nthose in [2] (we have not compared our model against the Dirichlet diffusion tree due to the com-\nplexity of implementing it). It will be interesting to consider the possibility of developing similar\nagglomerative style algorithms for [2]. [3] also describes a hierarchical clustering model involving\na prior over trees, but his prior is not in\ufb01nitely exchangeable. [5] uses tree-consistent partitions to\nmodel relational data; it would be interesting to apply our approach to their setting. Another related\nwork is the Bayesian hierarchical clustering of [6], which uses an agglomerative procedure returning\na tree structured approximate posterior for a Dirichlet process mixture model. As opposed to our\nwork [6] uses a \ufb02at mixture model and does not have a notion of distributions over trees.\nThere are a number of unresolved issues with our work. Firstly, our algorithms take O(n3) compu-\ntation time, except for Greedy-Rate1 which takes O(n2) time. Among the greedy algorithms we see\nthat there are no discernible differences in quality of approximation thus we recommend Greedy-\nRate1. It would be interesting to develop SMC algorithms with O(n2) runtime, and compare these\nagainst Greedy-Rate1 on real world problems. Secondly, there are unanswered statistical questions.\nFor example, since our prior is in\ufb01nitely exchangeable, by de Finetti\u2019s theorem there is an underly-\ning random distribution for which our observations are i.i.d. draws. What is this underlying random\ndistribution, and how do samples from this distribution look like? We know the answer for at least a\nsimple case: if the Markov process is a mutation process with mutation rate \u03b1/2 and new states are\ndrawn i.i.d. from a base distribution H, then the induced distribution is a Dirichlet process DP(\u03b1, H)\n[8]. Another issue is that of consistency\u2014does the posterior over random distributions converge to\nthe true distribution as the number of observations grows? Finally, it would be interesting to gen-\neralize our approach to varying mutation rates, and to non-binary trees by using generalizations to\nKingman\u2019s coalescent called \u039b-coalescents [15].\n\nReferences\n[1] R. O. Duda and P. E. Hart. Pattern Classi\ufb01cation And Scene Analysis. Wiley and Sons, New York, 1973.\n[2] R. M. Neal. De\ufb01ning priors for distributions using Dirichlet diffusion trees. Technical Report 0104,\n\nDepartment of Statistics, University of Toronto, 2001.\n\n[3] C. K. I. Williams. A MCMC approach to hierarchical mixture modelling. In Advances in Neural Infor-\n\nmation Processing Systems, volume 12, 2000.\n\n[4] C. Kemp, T. L. Grif\ufb01ths, S. Stromsten, and J. B. Tenenbaum. Semi-supervised learning with trees. In\n\nAdvances in Neural Information Processing Systems, volume 16, 2004.\n\n[5] D. M. Roy, C. Kemp, V. Mansinghka, and J. B. Tenenbaum. Learning annotated hierarchies from rela-\n\ntional data. In Advances in Neural Information Processing Systems, volume 19, 2007.\n\n[6] K. A. Heller and Z. Ghahramani. Bayesian hierarchical clustering. In Proceedings of the International\n\nConference on Machine Learning, volume 22, 2005.\n\n[7] N. Friedman. Pcluster: Probabilistic agglomerative clustering of gene expression pro\ufb01les. Technical\n\nReport Technical Report 2003-80, Hebrew University, 2003.\n\n[8] J. F. C. Kingman. On the genealogy of large populations. Journal of Applied Probability, 19:27\u201343, 1982.\n\nEssays in Statistical Science.\n\n[9] J. F. C. Kingman. The coalescent. Stochastic Processes and their Applications, 13:235\u2013248, 1982.\n[10] P. Fearnhead. Sequential Monte Carlo Method in Filter Theory. PhD thesis, Merton College, University\n\nof Oxford, 1998.\n\n[11] R. M. Neal. Annealed importance sampling. Technical Report 9805, Department of Statistics, University\n\nof Toronto, 1998.\n\n[12] A. McMahon and R. McMahon. Language Classi\ufb01cation by Numbers. Oxford University Press, 2005.\n[13] M. Haspelmath, M. Dryer, D. Gil, and B. Comrie, editors. The World Atlas of Language Structures.\n\nOxford University Press, 2005.\n\n[14] H. Daum\u00b4e III and L. Campbell. A Bayesian model for discovering typological implications. In Proceed-\n\nings of the Annual Meeting of the Association for Computational Linguistics, 2007.\n\n[15] J. Pitman. Coalescents with multiple collisions. Annals of Probability, 27:1870\u20131902, 1999.\n\n\f", "award": [], "sourceid": 760, "authors": [{"given_name": "Yee", "family_name": "Teh", "institution": null}, {"given_name": "Hal", "family_name": "Daume III", "institution": ""}, {"given_name": "Daniel", "family_name": "Roy", "institution": ""}]}