{"title": "Bayesian Partitioning of Large-Scale Distance Data", "book": "Advances in Neural Information Processing Systems", "page_first": 1368, "page_last": 1376, "abstract": "A Bayesian approach to partitioning distance matrices is presented. It is inspired by the 'Translation-Invariant Wishart-Dirichlet' process (TIWD) in (Vogt et al., 2010) and shares a number of advantageous properties like the fully probabilistic nature of the inference model, automatic selection of the number of clusters and applicability in semi-supervised settings. In addition, our method (which we call 'fastTIWD') overcomes the main shortcoming of the original TIWD, namely its high computational costs. The fastTIWD reduces the workload in each iteration of a Gibbs sampler from O(n^3) in the TIWD to O(n^2). Our experiments show that this cost reduction does not compromise the quality of the inferred partitions. With this new method it is now possible to 'mine' large relational datasets with a probabilistic model, thereby automatically detecting new and potentially interesting clusters.", "full_text": "Bayesian Partitioning of Large-Scale Distance Data\n\nDavid Adametz\n\nVolker Roth\n\nDepartment of Computer Science & Mathematics\n\nUniversity of Basel\nBasel, Switzerland\n\n{david.adametz,volker.roth}@unibas.ch\n\nAbstract\n\nA Bayesian approach to partitioning distance matrices is presented. It is inspired\nby the Translation-invariant Wishart-Dirichlet process (TIWD) in [1] and shares\na number of advantageous properties like the fully probabilistic nature of the in-\nference model, automatic selection of the number of clusters and applicability in\nsemi-supervised settings. In addition, our method (which we call fastTIWD) over-\ncomes the main shortcoming of the original TIWD, namely its high computational\ncosts. The fastTIWD reduces the workload in each iteration of a Gibbs sampler\nfrom O(n3) in the TIWD to O(n2). Our experiments show that the cost reduction\ndoes not compromise the quality of the inferred partitions. With this new method\nit is now possible to \u2018mine\u2019 large relational datasets with a probabilistic model,\nthereby automatically detecting new and potentially interesting clusters.\n\n1 Introduction\n\nIn cluster analysis we are concerned with identifying subsets of n objects that share some similarity\nand therefore potentially belong to the same sub-population. Many practical applications leave us\nwithout direct access to vectorial representations and instead only supply pairwise distance measures\ncollected in a matrix D. This poses a serious challenge, because great parts of geometric information\nare hereby lost that could otherwise help to discover hidden structures. One approach to deal with\nthis is to encode geometric invariances in the probabilistic model, as proposed in [1]. The most\nimportant properties that distinguish this Translation-invariant Wishart-Dirichlet Process (TIWD)\nfrom other approaches working on pairwise data are its fully probabilistic model, automatic selection\nof the number of clusters, and its applicability in semi-supervised settings in which not all classes\nare known in advance. Its main drawback, however, is the high computational cost of order O(n3)\nper sweep of a Gibbs sampler, limiting its applicability to relatively small data sets.\nIn this work we present an alternative method which shares all the positive properties of the TIWD\nwhile reducing the computational workload to O(n2) per Gibbs sweep. In analogy to [1] we call this\nnew approach fastTIWD. The main idea is to solve the problem of missing geometric information by\na normalisation procedure, which chooses one particular geometric embedding of the distance data\nand allows us to use a simple probabilistic model for inferring the unknown underlying partition.\nThe construction we use is guaranteed to give the optimal such geometric embedding if the true\npartition was known. Of course, this is only a hypothetical precondition, but we show that even rough\nprior estimates of the true partition signi\ufb01cantly outperform \u2018naive\u2019 embedding strategies. Using a\nsimple hierarchical clustering model to produce such prior estimates leads to clusterings being at\nleast of the same quality as those obtained by the original TIWD. The algorithmic contribution\nhere is an ef\ufb01cient algorithm for performing this normalisation procedure in O(n2) time, which\nmakes the whole pipeline from distance matrix to inferred partition an O(n2) process (assuming\na constant number of Gibbs sweeps). Detailed complexity analysis shows not only a worst-case\ncomplexity reduction from O(n3) to O(n2), but also a drastic speed improvement. We demonstrate\n\n1\n\n\fthis performance gain for a dataset containing \u2248 350 clusters, which now can be analysed in 6 hours\ninstead of \u2248 50 days with the original TIWD.\nIt should be noted that both the TIWD and our fastTIWD model expect (squared) Euclidean dis-\ntances on input. While this might be seen as a severe limitation, we argue that (i) a \u2018zoo\u2019 of Mercer\nkernels has been published in the last decade, e.g. kernels on graphs, sequences, probability distri-\nbutions etc. All these kernels allow the construction of squared Euclidean distances; (ii) ef\ufb01cient\npreprocessing methods like randomised versions of kernel PCA have been proposed, which can be\nused to transform an initial matrix into one of squared Euclidean type; (iii) one might even use an\narbitrary distance matrix hoping that the resulting model mismatch can be tolerated.\nIn the next section we introduce a probabilistic model for partitioning inner product matrices, which\nis generalised in section 3 to distance matrices using a preprocessing step that breaks the geomet-\nric symmetry inherent in distance representations. Experiments in section 4 demonstrate the high\nquality of clusterings found by our method and its superior computational ef\ufb01ciency over the TIWD.\n\nd\n\n2 exp(cid:2)\u2212 d\n2 tr(\u03a8S)(cid:3) .\n\n2 exp(cid:2)\u2212 d\n\n2 A Wishart Model for Partitioning Inner Product Matrices\nSuppose there is a matrix X \u2208 Rn\u00d7d representing n objects in Rd that belong to one of k sub-\npopulations. For identifying the underlying cluster structure, we formulate a generative model by\nassuming the columns xi \u2208 Rn, i = 1 . . . d are i.i.d. according to a normal distribution with zero\nmean and covariance \u03a3n\u00d7n, i.e. xi \u223c N (0n, \u03a3), or in matrix notation: X \u223c N (0n\u00d7d, \u03a3 \u2297 I).\nd XX t \u2208 Rn\u00d7n is central Wishart distributed, S \u223c Wd(\u03a3). For convenience we de\ufb01ne\nThen, S = 1\nthe generalised central Wishart distribution which also allows rank-de\ufb01cient S and/or \u03a3 as\n\np(S|\u03a8, d) \u221d det(S)\n\n1\n\n2 (d\u2212n\u22121) det(\u03a8)\n\n(1)\nwhere det(\u2022) is the product of non-zero eigenvalues and \u03a8 denotes the (generalised) inverse of \u03a3.\nThe likelihood as a function in \u03a8 is\n\n2 tr(\u03a8S)(cid:3) ,\n\nL(\u03a8) = det(\u03a8) d\n\n(2)\nConsider now the case where we observe S without direct access to X. Then, an orthogonal trans-\nformation X \u2190 OX cannot be retrieved anymore, but it is reasonable to assume such rotations are\nirrelevant for \ufb01nding the partition. Following the Bayesian inference principle, we complement the\nlikelihood with a prior over \u03a8. Since by assumption there is an underlying joint normal distribu-\ntion, a zero entry in \u03a8 encodes conditional independence between two objects, which means that\nblock diagonal \u03a8 matrices de\ufb01ne a suitable partitioning model in which the joint normal is decom-\nposed into independent cluster-wise normals. Note that the inverse of a block diagonal matrix is\nalso block diagonal, so we can formulate the prior in terms of \u03a3, which is easier to parametrise.\nFor this purpose we adapt the method in [2] using a Multinomial-Dirichlet process model [3, 4, 5]\nto de\ufb01ne a \ufb02exible prior distribution over block matrices without specifying the exact number of\nblocks. We only brie\ufb02y sketch this construction and refer the reader to [1, 2] for further details. Let\nBn be the set of partitions of the index set [n]. A partition B \u2208 Bn can be represented in matrix\nform as B(i, j) = 1 if y(i) = y(j) and B(i, j) = 0 otherwise, with y being a function that maps\n[n] to some label set L. Alternatively, B may be represented as a set of disjoint non-empty subsets\ncalled \u2018blocks\u2019 b. A partition process is a series of distributions Pn on the set Bn in which Pn is the\nmarginal of Pn+1. Using a multinomial model for the labels and a Dirichlet prior with rate parame-\nter \u03be on the mixing proportions, we may integrate out the latter and derive a Dirichlet-Multinomial\nprior over labels. Finally, after using a \u2018label forgetting\u2019 transformation, the prior over B is:\n\n\u0393(\u03be)(cid:81)\n\np(B|\u03be, k) =\n\nk!\n\n(k \u2212 kB)!\n\nb\u2208B \u0393(nb + \u03be/k)\n\n[\u0393(\u03be/k)]kB \u0393(n + \u03be)\n\n.\n\n(3)\n\nIn this setting, k is the number of blocks in the population (k can be in\ufb01nite, which leads to the\nEwens Process [6], a.k.a. Chinese Restaurant Process), nb is the number of objects in block b and\nkB \u2264 k is the total number of blocks in B. The prior is exchangeable meaning rows and columns\ncan be (jointly) permuted arbitrarily and therefore partition matrices can always be brought to block\ndiagonal form. To specify the variances of the normal distributions, the models in [1, 2] use two\nglobal parameters, \u03b1, \u03b2, for the within- and between-class scatter. This model can be easily extended\nto include block-wise scatter parameters, but for the sake of simplicity we will stay with the simple\nparametrisation here. The \ufb01nal block diagonal covariance matrix used in (2) has the form\n\n\u03a3 = \u03a8\u22121 = \u03b1(In + \u03b8B), with \u03b8 := \u03b2/\u03b1.\n\n(4)\n\n2\n\n\fb\u2208B\n\n1\n\u03b1\n\ntr(\u03a8S) =(cid:80)\n\nInference by way of Gibbs sampling. Multiplying the Wishart likelihood (2), the prior over par-\ntitions (3) and suitable priors over \u03b1, \u03b8 gives the joint posterior. Inference for B, \u03b1 and \u03b8 can then be\ncarried out via a Gibbs sampler. Each Gibbs sweep can be ef\ufb01ciently implemented since both trace\nand determinant in (2) can be computed analytically, see [1]:\n\nwhere Sbb denotes the block submatrix corresponding to the bth diagonal block in B, and \u00afSbb =\nbSbb1b. 1b is the indicator function mapping block b to a {0, 1}n vector, whose elements are 1 if\n1t\na sample is contained in b, or 0 otherwise. For the determinant one derives\n\n(cid:3) = 1\n\n(cid:2)tr(S) \u2212(cid:80)\n\n(cid:2)tr(Sbb) \u2212 \u03b8\ndet(\u03a8) = \u03b1\u2212n(cid:81)\n(cid:3). Using the prior \u03b1 \u223c Inv-Gamma(r0 \u00b7 d/2, s0 \u00b7 d/2), the posterior\n\nb\u2208B(1 + \u03b8nb)\u22121.\n\n(cid:2)tr(S)\u2212(cid:80)\n\n(6)\nThe conditional likelihood for \u03b1 is Inv-Gamma(r, s) with shape parameter r = n\u00b7 d/2\u2212 1 and scale\ns = d\n2\nis of the same functional form, and we can integrate out \u03b1 analytically:\n\n(cid:3),\n\nb\u2208B\n\nb\u2208B\n\n\u03b8\n\n1+nb\u03b8\n\n\u00afSbb\n\n\u00afSbb\n\n\u03b8\n\n1+nb\u03b8\n\n\u00afSbb\n\n1+nb\u03b8\n\n\u03b1\n\n(5)\n\nPn(B|\u2022) \u221d Pn(B|\u03be, k) det(\u03a8)d/2\n\n(cid:0)tr(\u03a8S)(\u03b1=1) + s0\nb\u2208B(1 + \u03b8nb)\u22121 and tr(\u03a8S)(\u03b1=1) = tr(S) \u2212(cid:80)\n\n(cid:2) d\n\n(\u03b1=1)\n\n2\n\nwhere det(\u03a8)(\u03b1=1) =(cid:81)\n\n\u00afSbb. Note that\nthe (usually unknown) degree of freedom d has the formal role of an annealing parameter, and it can\nindeed be used to \u2018cool\u2019 the Markov chain by increasing d, if desired, until a partition is \u2018frozen\u2019.\n\nb\u2208B\n\n1+nb\u03b8\n\n\u03b8\n\n(cid:1)(cid:3)\u2212(n+r0)d/2\n\n,\n\n(7)\n\nComplexity analysis.\nIn one sweep of the Gibbs sampler, we have to iteratively compute the\nmembership probability of one object indexed by i to the kB currently existing blocks in partition B\n(plus one new block), given the assignments for the n\u2212 1 remaining ones denoted by the superscript\n(\u2212i) [7, 8]. In every step of this inner loop over kB existing blocks we have to evaluate the Wishart\nlikelihood, i.e. trace (5) and determinant (6). Given trace tr(\u2212i), we update \u00afSbb for kB blocks\nb \u2208 B which in total needs O(n) operations. Given det(\u2212i), the computation of all kB updated\ndeterminants induces costs of O(kB). In total, there are n objects, so a full sweep requires O(n2 +\nnkB) operations, which is equal to O(n2) since the maximum number of blocks is n, i.e. kB \u2264 n.\nFollowing [1], we update \u03b8 on a discretised grid of values which adds O(kB) to the workload,\nthus not changing the overall complexity of O(n2). Compared to the original TIWD, the worst\ncase complexity in the Dirichlet process model with an in\ufb01nite number of blocks in the population,\nk = \u221e, is reduced from O(n3) to O(n2) .\n\n3 The fastTIWD Model for Partitioning Distance Matrices\nConsider now the case where S is not accessible, but only squared pairwise distances D \u2208 Rn\u00d7n:\n(8)\n\nD(i, j) = S(i, i) + S(j, j) \u2212 2 S(i, j).\n\nObserving one speci\ufb01c D does not imply a unique corresponding S, since there is a surjective\nmapping from a set of S-matrices to D, S(D) (cid:55)\u2192 D. Hereby, not only do we lose information\nabout orthogonal transformations of X, but also information about the origin of the coordinate\nsystem. If S\u2217 is one (any) matrix that ful\ufb01lls (8) for a speci\ufb01c D, the set S(D) is formally de\ufb01ned\nas S = {S|S = S\u2217 + 1vt + v1t, S (cid:23) 0, v \u2208 Rn} [9]. The Wishart distribution, however, is\nnot invariant against the choice of S \u2208 S. In fact, if S\u2217 \u223c W(\u03a3), the distribution of a general\nS \u2208 S is non-central Wishart, which can be easily seen as follows: S is exactly the set of inner\nproduct matrices that can be constructed by varying c \u2208 Rd in a modi\ufb01ed matrix normal model\nX \u223c N (M, \u03a3 \u2297 Id) with mean matrix M = 1nct. Note that now the d columns in X are still\nindependent, but no longer identically distributed. Note further that \u2018shifts\u2019 ci do not affect pairwise\nd XX t is\ndistances between rows in X. The modi\ufb01ed matrix normal distribution implies that S = 1\nnon-central Wishart, S \u223c W(\u03a3, \u0398), with non-centrality matrix \u0398 := \u03a3\u22121M M t. The practical use,\nhowever, is limited by its complicated form and the fundamental problem of estimating \u0398 based\non only one single observation S. It is thus desirable to work with a simpler probabilistic model.\nIn principle, there are two possibilities: either the likelihood is reformulated as being constant over\nall S \u2208 S (the approach taken in [1], called the translation-invariant Wishart distribution), or one\ntries to \ufb01nd a \u2018good\u2019 candidate matrix S(cid:48)\n\u2217 that is \u2018close\u2019 to the underlying S\u2217 and uses the much\n\n3\n\n\fsimpler central Wishart model. Both approaches have their pros and cons: encoding the translation\ninvariance directly in the likelihood is methodologically elegant and seems to work well in a couple\nof experiments (cf. [1]), but it induces high computational cost. The alternative route of searching\nfor a good candidate S(cid:48)\n\u2217 close to S\u2217 is complicated, because S\u2217 is unknown and it is not immediately\nclear what \u2018close\u2019 means. The positive aspect of this approach is the heavily reduced computational\ncost due to the formal simplicity of the central Wishart model. It is important to discuss the \u2018naive\u2019\nway of \ufb01nding a good candidate S(cid:48)\n\u2217 by subtracting the empirical column means in X, thus removing\nthe shifts ci. This normalisation procedure can be implemented solely based on S, leading to the\nwell-known centering procedure in kernel PCA, [10]:\n\nSc = QI S Qt\n\nI , with projection QI = I \u2212 (1/n)11t.\n\n(9)\nContrary to the PCA setting, however, this column normalisation induced by QI does not work well\nhere, because the elements of a column vector in X are not independent. Rather, they are coupled\nvia the \u03a3 component in the covariance tensor \u03a3 \u2297 Id. Hereby, we not only remove the shifts ci, but\nalso alter the distribution: the non-centrality matrix does not vanish in general and as a result, Sc is\nno longer central Wishart distributed.\nIn the following we present a solution to the problem of \ufb01nding a candidate matrix S(cid:48)\n\u2217 that recasts\ninference based on the translation-invariant Wishart distribution as a method to reconstruct the op-\ntimal S\u2217. Our proposal is guided by a particular analogy between trees and partition matrices and\naims at exploiting a tree-structure to guarantee low computational costs. The construction has the\nsame functional form as (9), but uses a different projection matrix Q.\n\nd\n\nL((cid:101)\u03a8) \u221d det((cid:101)\u03a8)\n\nThe translation-invariant Wishart distribution. Let S\u2217 induce pairwise distances D. Assuming\nthat S\u2217 \u223c Wd(\u03a3), the distribution of an arbitrary member S \u2208 S(D) can be derived analytically as\na generalised central Wishart distribution with a rank-de\ufb01cient covariance, see [2]. Its likelihood in\n\nthe rank-de\ufb01cient inverse covariance matrix(cid:101)\u03a8 is\n2 exp(cid:2) \u2212 d\nwith(cid:101)\u03a8 = \u03a8\u2212 (1t\u03a81)\u22121\u03a811t\u03a8. Note that although S\u2217 appears in the \ufb01rst term in (10), the density\n\n2 tr((cid:101)\u03a8S\u2217)(cid:3) = det((cid:101)\u03a8)\n\nis constant on all S \u2208 S(D), meaning it can be replaced by any other member of S(D). Note further\nthat S also contains rank-de\ufb01cient matrices (like, e.g. the column normalised Sc). By multiplying\n(10) with the product of nonzero eigenvalues of such a matrix raised to the power of (d \u2212 n \u2212 1)/2,\na valid generalised central Wishart distribution is obtained (see (1)), which is normalised on the\nmanifold of positive semi-de\ufb01nite matrices of rank r = n \u2212 1 with r distinct positive eigenvalues\n\n[11, 12, 13]. Unfortunately, (10) has a simple form only in(cid:101)\u03a8, but not in the original \u03a8, which \ufb01nally\n\n4 tr((cid:101)\u03a8D)(cid:3),\n\n2 exp(cid:2) d\n\n(10)\n\nd\n\nleads to the O(n3) complexity of the TIWD model.\n\nSelecting an optimal candidate S\u2217.\n\nIntroducing the projection matrix\nQ = I \u2212 1\n\n1t\u03a81 11t\u03a8,\n\n(11)\n\none can rewrite (cid:101)\u03a8 in (10) as \u03a8Q or, equivalently, as Qt\u03a8Q, see [2] for details. Assume now\n\nS \u223c Wd(\u03a3) induces distances D and consider the transformed S\u2217 = QSQt. Note that this trans-\nformation does not change the distances, i.e. S \u2208 S(D) \u21d4 S\u2217 \u2208 S(D), and that QSQt has rank\nr = n \u2212 1 (because Q is a projection with kernel 1). Plugging our speci\ufb01c S\u2217 = QSQt into (10),\nextending the likelihood to a generalised central Wishart (1) with rank-de\ufb01cient inverse covariance\n\n(cid:101)\u03a8, exploiting the identity QQ = Q and using the the cyclic property of the trace, we arrive at\ntransformed matrix S\u2217 = QSQt, parametrised by the full-rank matrix \u03a8 if det((cid:101)\u03a8) is substi-\n\n(12)\nBy treating Q as a \ufb01xed matrix, this expression can also be seen as a central Wishart in the\n\np(QSQt|(cid:101)\u03a8, d) \u221d det(QSQt)\n\n2 tr(\u03a8QSQt)(cid:3).\n\n2 (d\u2212n\u22121) det((cid:101)\u03a8)\n\n2 exp(cid:2) \u2212 d\n\ntuted by the appropriate normalisation term det(\u03a8). From this viewpoint, inference using the\ntranslation-invariant Wishart distribution can be interpreted as \ufb01nding a (rank-de\ufb01cient) represen-\ntative S\u2217 = QSQt \u2208 S(D) which follows a generalised central Wishart distribution with full-rank\ninverse covariance matrix \u03a8. For inferring \u03a8, the rank de\ufb01ciency of S\u2217 is not relevant, since only\nthe likelihood is needed. Thus S\u2217 can be seen as an optimal candidate inner-product matrix in the\nset S(D) for a central Wishart model parametrised by \u03a8.\n\nd\n\n1\n\n4\n\n\fApproximating S\u2217 with trees. The above selection of S\u2217 \u2208 S(D) cannot be directly used in a\nconstructive way, since Q in (11) depends on unknown \u03a8. If, on the other hand, we had some initial\nestimate of \u03a8, we could \ufb01nd a reasonable transformation Q(cid:48) and hereby a reasonable candidate\nS(cid:48)\n\u2217. Note that even if the estimate of \u03a8 is far away from the true inverse covariance, the pairwise\ndistances are at least guaranteed not to change under Q(cid:48)S(Q(cid:48))t.\nOne particular estimate would be to assume that every object forms a singleton cluster, which means\nthat our estimate of \u03a8 is an identity matrix. After substitution into (11) it is easily seen that this as-\nsumption results in the column-normalisation projection QI de\ufb01ned in (9). However, if we assume\nthat there is some non-trivial cluster structure in the data, this would be a very poor approximation.\nThe main dif\ufb01culty in \ufb01nding a better estimate is to not specify the number of blocks. Our con-\nstruction is guided by an analogy between binary trees and weighted sums of cut matrices, which\nare binary complements of partition matrices with two blocks. We use a binary tree with n leaves\nrepresenting n objects. It encodes a path distance matrix Dtree between those n objects, and for an\noptimal tree Dtree = D. Such an optimal tree exists only if D is additive, and the task of \ufb01nding an\napproximation is a well-studied problem. We will not discuss the various tree reconstruction algo-\nrithms, but only mention that there exist algorithms for reconstructing the closest ultrametric tree (in\nthe (cid:96)\u221e norm) in O(n2) time, [14].\n\nFigure 1: From left to right: Unknown samples X, pairwise distances collected in D, closest tree\nstructure and an exemplary building block.\nA tree metric induced by Dtree is composed of elementary cut (pseudo-)metrics. Any such metric\nlies in the metric space L1 and is also a member of (L2)2, which is the metric part of the space of\nsquared Euclidean distance matrices D. Thus, there exists a positive (semi-)de\ufb01nite Stree such that\n(Dtree)ij = (Stree)ii + (Stree)jj \u2212 2(Stree)ij. In fact, any matrix Stree has a canonical decomposition\ninto a weighted sum of 2-block partition matrices, which is constructed by cutting all edges (2n \u2212 2\nfor a rooted tree) and observing the resulting classi\ufb01cation of leaf nodes. Suppose, we keep track of\nsuch an assignment with indicator 1j induced by a single cut j, then the inner product matrix is\n\nj=1 \u03bbj(1j1t\n\n(13)\nwhere \u03bbj is the weight of edge j to be cut and \u00af1j (cid:55)\u2192 {0, 1}n is the complementary assignment, i.e.\n1j \ufb02ipped. Each term (1j1t\nj) is a 2-block partition matrix. We demonstrate the construction\nof Stree in Fig. 2 for a small dataset of n = 25 objects sampled from S \u223c Wd(\u03a3) with d = 25 and\n\u03a3 = \u03b1(In + \u03b8B) as de\ufb01ned in (4) with \u03b1 = 2 and \u03b8 = 1. B contains 3 blocks and is depicted in the\n\ufb01rst panel. The remaining panels show the single-linkage clustering tree, all 2n \u2212 2 = 48 weighted\n2-block partition matrices, and the \ufb01nal Stree (= sum of all individual 2-block matrices, rescaled to\nfull gray-value range). Note that single-linkage fails to identify the clusters in the three branches\nclosest to root, but still the structure of B is clearly visible in Stree.\n\nj + \u00af1j \u00af1t\n\nStree =(cid:80)2n\u22122\n\nj + \u00af1j \u00af1t\n\nj),\n\nFigure 2: Inner product matrix of a tree. Left to right: Partition matrix B for n = 25 objects in 3\nclusters, single-linkage tree, all weighted 2-block partition matrices, \ufb01nal Stree.\n\nThe idea is now to have Stree as an estimate of \u03a3, and use its inverse \u03a8tree to construct Qtree in (11),\nwhich, however, naively would involve an O(n3) Cholesky decomposition of Stree.\n\n5\n\n\fTheorem 1. The n \u00d7 n matrix S\u2217 = QtreeSQt\nFor the proof we need the following lemma:\nLemma 1. The product of Stree \u2208 Rn\u00d7n and a vector y \u2208 Rn can be computed in O(n) time.\nProof. (of lemma 1) Restating (13) and de\ufb01ning m := 2n \u2212 2, we have\n\ntree can be computed in O(n2) time.\n\nj\n\nyl\n\nl=1\n\nj=1\n\nj=1\n\n\u03bbj\n\n\u03bbj \u00af1j +\n\nj + \u00af1j \u00af1t\n\n(cid:88)m\n(cid:88)n\n(cid:88)n\n(cid:88)n\nl=1 yl,(cid:80)m\n\n(cid:1)y =\n(cid:0)1j1t\n(cid:88)m\n(cid:88)m\n(cid:88)\n(cid:88)\n\u03bbj \u2212(cid:88)\n(cid:0)(cid:88)m\nj=1 \u03bbj and(cid:80)m\n\nj /\u2208Ri\n\n\u03bbj +\n\nj=1\n\nl=1\n\nl=1\n\nyl\n\nyl\n\n(cid:88)m\n\nj=1\n\u03bbj1j\n\nj=1\n\n(cid:1)\n\n\u00af1jl yl\n\n1jlyl.\n\nl=1\n\nl=1\n\n\u03bbj\n\n(cid:0)1j\n(cid:88)n\n(cid:88)\n(cid:0)\u03bbj\n(cid:1) + 2\n\n\u03bbj\n\nl=1\n\nl=1\n\n\u03bbj \u00af1j\n\n1jl yl + \u00af1j\n\n(cid:88)n\n(cid:88)n\n1jlyl \u2212(cid:88)m\n(cid:88)n\n(cid:1) \u2212(cid:88)\n(cid:88)\n(cid:0)\u03bbj\n\u03bbjyj \u2212(cid:88)m\n\n(cid:88)\n\nj /\u2208Ri\n\nj=1\n\nyl\n\nl\u2208Rj\n\nj\u2208Ri\n\n\u03bbjyj.\n\nj=1\n\n(cid:1)\n\nj\u2208Ri\n\nj\u2208Ri\n\nStreey =\n\n=\n\n(Streey)i =\n\nNote that(cid:80)n\n\n=\n\nIn the next step, let us focus speci\ufb01cally on the ith element of the resulting vector. Furthermore,\nassume Ri is the set of all nodes on the branch starting from node i and leading to the tree\u2019s root:\n\n(14)\n\nyl\n\nl\u2208Rj\n\n(15)\n\nj=1 \u03bbjyj are constants and computed in O(n) time. For each\nelement i, we are now left to \ufb01nd Ri in order to determine the remaining two terms. This can be\ndone directly on the tree structure in two separate traversals:\n\n1. Bottom up: Starting from the leaf nodes, store the sum of both childrens\u2019 y values in their\nparent node j (see Fig. 1, rightmost), then ascend. Do the same for \u03bbj and compute \u03bbjyj.\n2. Top down: Starting from the root node, recursively descend into the child nodes j and sum\n\nup \u03bbj and \u03bbjyj until reaching the leafs. This implicitly determines Ri.\n\nIt is important to stress that the above two tree traversals fully describe the complete algorithm.\nProof. (of theorem 1) First, note that only the matrix-vector product a := \u03a8tree1 is needed in\n\ntree =(cid:0)I \u2212 1\n\nQtreeSQt\n\n1t\u03a8tree1 11t \u03a8tree\n\n(cid:1)S(cid:0)I \u2212 \u03a8tree\n\n1t\u03a8tree1 11t(cid:1)\n\n1\n\n= S \u2212 (1/1ta) 1atS \u2212 (1/1ta) S a1t + (1/1ta)2 1atS a1t.\n\n(16)\n\nOne way of computing a = \u03a8tree1 is to employ conjugate gradients (CG) and iteratively minimise\n||Streea\u2212 1||2. Theoretically, CG is guaranteed to \ufb01nd the true a in O(n) iterations, each evaluating\none matrix-vector product (Streey), y \u2208 Rn. Due to lemma 1, a can be computed in O(n2) time\ntree (only matrix-vector products, so O(n2) complexity\nand is used in (16) to compute S\u2217 = QtreeSQt\nis maintained).\n\n4 Experiments\n\nSynthetic examples: normal clusters.\nIn a \ufb01rst experiment we investigate the performance of\nour method on arti\ufb01cial datasets generated in accordance with underlying model assumptions. A\npartition matrix B of size n = 200 containing k = 3 blocks is sampled from which we construct\n\u03a3B = \u03b1(I + \u03b8B). Then, X is drawn from N (M = 40 \u00b7 1n1t\nd, \u03a3 = \u03a3B \u2297 Id) with d = 300\nd XX t and D. The covariance parameters are set to \u03b1 = 2 and \u03b8 = 15/d,\nto generate S = 1\nwhich de\ufb01nes a rather dif\ufb01cult clustering problem with a hardly visible structure in D as can be\nseen in the left part of Fig. 3. We compared the method to three different hierarchical clustering\nstrategies (single-linkage, complete-linkage, Ward\u2019s method), to the standard central Wishart model\nusing two different normalisations of S (\u2018WD C\u2019: column normalisation using Sc = QI SQt\nI and\n\u2018WD R\u2019: additional row normalisation after embedding Sc using kernel PCA) and to the original\nTIWD model. The experiment was repeated 200 times and the quality of the inferred clusters was\nmeasured by the adjusted Rand index w.r.t. the true labels. For the hierarchical methods we report\ntwo different performance values: splitting the tree such that the \u2018true\u2019 number k = 3 of clusters is\nobtained and computing the best value among all possible splits into [2, n] clusters (\u2018*.best\u2019 in the\nboxplot). The reader should notice that both values are in favour of the hierarchical algorithms, since\nneither the true k nor the true labels are used for inferring the clusters in the Wishart-type methods.\nFrom the right part of Fig. 3 we conclude that (i) both \u2018naive\u2019 normalisation strategies WD C and\nWD R are clearly outperformed by TIWD and fastTIWD (\u2018fTIWD\u2019 in the boxplot). Signi\ufb01cance\nof pairwise performance differences is measured with a nonparametric Kruskal-Wallis test with a\n\n6\n\n\fBonferroni-corrected post-test of Dunn\u2019s type, see the rightmost panel; (ii) the hierarchical methods\nhave severe problems with high dimensionality and low class separation, and optimising the tree\ncutting does not help much. Even Ward\u2019s method (being perfectly suited for spherical clusters) has\nproblems; (iii) there is no signi\ufb01cant difference between TIWD and fastTIWD.\n\nFigure 3: Normal distributed toy data. Left half: Partition matrix (top), distance matrix (bottom)\nand 2D-PCA embedding of a dataset drawn from the generative model. Right half: Agreement with\n\u2018true\u2019 labels measured by the adjusted Rand index (left) and outcome of a Kruskal-Wallis/Dunn test\n(right). Black squares mean two methods are different at a \u2018family\u2019 p-value \u2264 0.05.\n\nSynthetic examples: log-normal clusters.\nIn a second toy example we explicitly violate underly-\ning model assumptions. For this purpose we sample again 3 clusters in d = 300 dimensions, but now\nuse a log-normal distribution that tends to produce a high number of \u2018atypical\u2019 samples. Note that\nsuch a distribution should not induce severe problems for hierarchical methods when optimising the\nRand index over all possible tree cuttings, since the \u2018atypical\u2019 samples are likely to form singleton\nclusters while the main structure is still visible in other branches of the tree. This should be partic-\nularly true for Ward\u2019s method, since we still have spherically shaped clusters. As for the fastTIWD\nmodel, we want to test if the prior over partitions is \ufb02exible enough to introduce additional singleton\nclusters: In the experiment, it performed at least as well as Ward\u2019s method, and clearly outperformed\nsingle- and complete-linkage. We also compared it to the af\ufb01nity-propagation method (AP), which,\nhowever, has severe problems on this dataset, even when optimising the input preference parameter\nthat affects the number of clusters in the partition.\n\nFigure 4: Log-normal distributed toy data. Left: Agreement with \u2018true\u2019 labels measured by the\nadjusted Rand index. Right: Outcome of a Kruskal-Wallis/Dunn test, analogous to Fig. 3.\n\nSemi-supervised clustering of protein sequences. As large-scale application we present a semi-\nsupervised clustering example which is an upscaled version of an experiment with protein sequences\npresented in [1]. While traditional semi-supervised classi\ufb01ers assume at least one labelled object\nper class, our model is \ufb02exible enough to allow additional new clusters that have no counterpart\nin the subset of labelled objects. We apply this idea on two different databases, one being high\nquality due to manual annotation with a stringent review process (SwissProt) while the other contains\nautomatically annotated proteins and is not reviewed (TrEMBL). The annotations in SwissProt are\nused as supervision information resulting in a set of class labels, whereas the proteins in TrEMBL\nare treated as unlabelled objects, potentially forming new clusters. In contrast to a relatively small\nset of globin sequences in [1], we extract a total number of 12,290 (manually or automatically)\nannotated proteins to have some role in oxygen transport or binding. This set contains a richer class\nincluding, for instance, hemocyanins, hemerythrins, chlorocruorins and erythrocruorins.\nThe proteins are represented as a matrix of pairwise alignment scores. A subset of 1731 annotated se-\nquences is from SwissProt, resulting in 356 protein classes. Among the 10,559 TrEMBL sequences\n\n7\n\n\fwe could identify 23 new clusters which are dissimilar to any SwissProt proteins, see Fig. 5. Most of\nthe newly identi\ufb01ed clusters contain sequences sharing some rare and speci\ufb01c properties. In accor-\ndance with the results in [1], we \ufb01nd a large new cluster containing \ufb02avohemoglobins from speci\ufb01c\nspecies of funghi and bacteria that share a certain domain architecture composed of a globin domain\nfused with ferredoxin reductase-like FAD- and NAD-binding modules. An additional example is a\ncluster of proteins with chemotaxis methyl-accepting receptor domain from a very special class of\nmagnetic bacteria to orient themselves according to earth\u2019s magnetic \ufb01eld. The domain architecture\nof these proteins involving 6 domains is unique among all sequences in our dataset. Another cluster\ncontains iron-sulfur cluster repair di-iron proteins that build on a polymetallic system, the di-iron\ncenter, constituted by two iron ions bridged by two sul\ufb01de ions. Such di-iron centers occur only in\nthis new cluster.\n\nFigure 5: Partition of all 12,290 proteins into 379 clusters: 356 prede\ufb01ned by sequences from\nSwissProt and 23 new formed by sequences from TrEMBL (red box).\nIn order to gain the above results, 5000 Gibbs sweeps were conducted in a total runtime of \u2248 6\nhours. Although section 2 highlighted the worst-case complexity of the original TIWD, it is also\nimportant to experimentally compare both models in a real world scenario: we ran 100 sweeps\nwith each fastTIWD and TIWD and hereby observed an average improvement of factor 192, which\nwould lead to an estimated runtime of 1152 hours (\u2248 50 days) for the latter model. On a side note,\nautomatic cluster identi\ufb01cation is a nice example for bene\ufb01ts of large-scale data mining: clearly, one\ncould theoretically also identify special sequences by digging into various protein domain databases,\nbut without precise prior knowledge, this would hardly be feasible for \u2248 12,000 proteins.\n\n5 Conclusion\n\nWe have presented a new model for partitioning pairwise distance data, which is motivated by the\ngreat success of the TIWD model, shares all its positive properties, and additionally reduces the\ncomputational workload from O(n3) to O(n2) per sweep of the Gibbs sampler. Compared to vecto-\nrial representations, pairwise distances do not convey information about translations and rotations of\nthe underlying coordinate system. While in the TIWD model this lack of information is handled by\nmaking the likelihood invariant against such geometric transformations, here we break this symme-\ntry by choosing one particular inner-product representation S\u2217 and thus, one particular coordinate\nsystem. The advantage is being able to use a standard (i.e. central) Wishart distribution for which\nwe present an ef\ufb01cient Gibbs sampling algorithm.\nWe show that our construction principle for selecting S\u2217 among all inner product matrices corre-\nsponding to an observed distance matrix D and \ufb01nds an optimal candidate if the true covariance was\nknown. Although it is a pure theoretical guarantee, it is successfully exploited by a simple hierar-\nchical cluster method to produce an initial covariance estimate\u2014all without specifying the number\nof clusters, which is one of the model\u2019s key properties. On the algorithmic side, we prove that S\u2217\ncan be computed in O(n2) time using tree traversals. Assuming the number of Gibbs sweeps nec-\nessary is independent of n (which, of course, depends on the problem), we now have a probabilistic\nalgorithm for partitioning distance matrices running in O(n2) time. Experiments on simulated data\nshow that the quality of partitions found is at least comparable to that of the original TIWD. It is\nnow possible for the \ufb01rst time to use the Wishart-Dirichlet process model for large matrices. Our ex-\nperiment containing \u2248 12,000 proteins shows that fastTIWD can be successfully used to mine large\nrelational datasets and leads to automatic identi\ufb01cation of protein clusters sharing rare structural\nproperties. Assuming that in most clustering problems it is acceptable to obtain a solution within\nsome hours, any further size increase of the input matrix will become more and more a problem of\nmemory capacity rather than computation time.\n\nAcknowledgments\n\nThis work has been partially supported by the FP7 EU project SIMBAD.\n\n8\n\n\fReferences\n[1] J. Vogt, S. Prabhakaran, T. Fuchs, and V. Roth. The Translation-invariant Wishart-Dirichlet\nProcess for Clustering Distance Data. In Proceedings of the 27th International Conference on\nMachine Learning, 2010.\n\n[2] P. McCullagh and J. Yang. How Many Clusters? Bayesian Analysis, 3:101\u2013120, 2008.\n[3] Y. W. Teh. Dirichlet Processes. In Encyclopedia of Machine Learning. Springer, 2010.\n[4] J. Sethuraman. A Constructive De\ufb01nition of Dirichlet Priors. Statistica Sinica, 4:639\u2013650,\n\n1994.\n\n[5] B. A. Frigyik, A. Kapila, and M. R. Gupta.\n\nIntroduction to the Dirichlet Distribution and\nRelated Processes. Technical report, Departement of Electrical Engineering, University of\nWashington, 2010.\n\n[6] W. Ewens. The Sampling Theory of Selectively Neutral Alleles. Theoretical Population Biol-\n\nogy, 3:87\u2013112, 1972.\n\n[7] D. Blei and M. Jordan. Variational Inference for Dirichlet Process Mixtures. Bayesian Analy-\n\nsis, 1:121\u2013144, 2005.\n\n[8] R. Neal. Markov Chain Sampling Methods for Dirichlet Process Mixture Models. Journal of\n\nComputational and Graphical Statistics, 9(2):249\u2013265, 2000.\n\n[9] P. McCullagh. Marginal Likelihood for Distance Matrices. Statistica Sinica, 19:631\u2013649,\n\n2009.\n\n[10] B. Sch\u00a8olkopf, A. Smola, and K.-R. M\u00a8uller. Nonlinear Component Analysis as a Kernel Eigen-\n\nvalue Problem. Neural Computation, 10(5):1299\u20131319, July 1998.\n\n[11] J.A. Diaz-Garcia, J.R. Gutierrez, and K.V. Mardia. Wishart and Pseudo-Wishart Distributions\n\nand Some Applications to Shape Theory. Journal of Multivariate Analysis, 63:73\u201387, 1997.\n\n[12] H. Uhlig. On Singular Wishart and Singular Multivariate Beta Distributions. Annals of Statis-\n\ntics, 22:395\u2013405, 1994.\n\n[13] M. Srivastava. Singular Wishart and Multivariate Beta Distributions. Annals of Statistics,\n\n31(2):1537\u20131560, 2003.\n\n[14] M. Farach, S. Kannan, and T. Warnow. A Robust Model for Finding Optimal Evolutionary\nTrees. In Proceedings of the 25th Annual ACM Symposium on Theory of Computing, pages\n137\u2013145, 1993.\n\n9\n\n\f", "award": [], "sourceid": 787, "authors": [{"given_name": "David", "family_name": "Adametz", "institution": null}, {"given_name": "Volker", "family_name": "Roth", "institution": null}]}