{"title": "Clustering Aggregation as Maximum-Weight Independent Set", "book": "Advances in Neural Information Processing Systems", "page_first": 782, "page_last": 790, "abstract": "We formulate clustering aggregation as a special instance of Maximum-Weight Independent Set (MWIS) problem. For a given dataset, an attributed graph is constructed from the union of the input clusterings generated by different underlying clustering algorithms with different parameters. The vertices, which represent the distinct clusters, are weighted by an internal index measuring both cohesion and separation. The edges connect the vertices whose corresponding clusters overlap. Intuitively, an optimal aggregated clustering can be obtained by selecting an optimal subset of non-overlapping clusters partitioning the dataset together. We formalize this intuition as the MWIS problem on the attributed graph, i.e., finding the heaviest subset of mutually non-adjacent vertices.  This MWIS problem exhibits a special structure. Since the clusters of each input clustering form a partition of the dataset, the vertices corresponding to each clustering form a maximal independent set (MIS) in the attributed graph. We propose a variant of simulated annealing method that takes advantage of this special structure. Our algorithm starts from each MIS, which is close to a distinct local optimum of the MWIS problem, and utilizes a local search heuristic to explore its neighborhood in order to find the MWIS. Extensive experiments on many challenging datasets show that: 1. our approach to clustering aggregation automatically decides the optimal number of clusters; 2. it does not require any parameter tuning for the underlying clustering algorithms; 3. it can combine the advantages of different underlying clustering algorithms to achieve superior performance; 4. it is robust against moderate or even bad input clusterings.", "full_text": "Clustering Aggregation as\n\nMaximum-Weight Independent Set\n\nNan Li\n\nLongin Jan Latecki\n\nDepartment of Computer and Information Sciences\n\nTemple University, Philadelphia, USA\n{nan.li,latecki}@temple.edu\n\nAbstract\n\nWe formulate clustering aggregation as a special instance of Maximum-Weight\nIndependent Set (MWIS) problem. For a given dataset, an attributed graph is con-\nstructed from the union of the input clusterings generated by different underlying\nclustering algorithms with different parameters. The vertices, which represent the\ndistinct clusters, are weighted by an internal index measuring both cohesion and\nseparation. The edges connect the vertices whose corresponding clusters over-\nlap. Intuitively, an optimal aggregated clustering can be obtained by selecting an\noptimal subset of non-overlapping clusters partitioning the dataset together. We\nformalize this intuition as the MWIS problem on the attributed graph, i.e., \ufb01nding\nthe heaviest subset of mutually non-adjacent vertices.\nThis MWIS problem exhibits a special structure. Since the clusters of each in-\nput clustering form a partition of the dataset, the vertices corresponding to each\nclustering form a maximal independent set (MIS) in the attributed graph. We pro-\npose a variant of simulated annealing method that takes advantage of this special\nstructure. Our algorithm starts from each MIS, which is close to a distinct local\noptimum of the MWIS problem, and utilizes a local search heuristic to explore its\nneighborhood in order to \ufb01nd the MWIS. Extensive experiments on many chal-\nlenging datasets show that: 1. our approach to clustering aggregation automati-\ncally decides the optimal number of clusters; 2. it does not require any parameter\ntuning for the underlying clustering algorithms; 3. it can combine the advantages\nof different underlying clustering algorithms to achieve superior performance; 4.\nit is robust against moderate or even bad input clusterings.\n\n1\n\nIntroduction\n\nClustering is a fundamental problem in data analysis, and has extensive applications in statistics, data\nmining, computer vision and even in social sciences. The goal is to partition the data objects into\na set of groups (clusters) such that objects in the same group are similar, while objects in different\ngroups are dissimilar.\nIn the past two decades, many different clustering algorithms have been developed. Some popular\nones include K-means, DBSCAN, Ward\u2019s algorithm, EM-clustering and so on. However, there are\npotential shortcomings for each of the known clustering algorithms. For instance, K-means [7]\nand its variations have dif\ufb01culty detecting the \u201dnatural\u201d clusters, which have non-spherical shapes\nor widely different sizes or densities. Furthermore, in order to achieve good performance, they\nrequire an appropriate number of clusters as the input parameter, which is usually very hard to\nspecify. DBSCAN [8], a density-based clustering algorithm, can detect clusters of arbitrary shapes\nand sizes. However, it has trouble with data which have widely varying densities. Also, DBSCAN\nrequires two input parameters speci\ufb01ed by the user: the radius, Eps, to de\ufb01ne the neighborhood of\neach data object, and the minimum number, minP ts, of data objects required to form a cluster.\n\n1\n\n\fConsensus clustering, also called clustering aggregation or clustering ensemble, refers to a kind of\nmethods which try to \ufb01nd a single (consensus) superior clustering from a number of input clus-\nterings obtained by different algorithms with different parameters. The basic motivation of these\nmethods is to combine the advantages of different clustering algorithms and overcome their respec-\ntive shortcomings. Besides generating stable and robust clusterings, consensus clustering methods\ncan be applied in many other scenarios, such as categorical data clustering, \u201dprivacy-preserving\u201d\nclustering and so on. Some representative methods include [1, 2, 9, 11, 12, 13, 14]. [2] formulates\nclustering ensemble as a combinatorial optimization problem in terms of shared mutual information.\nThat is, the relationship between each pair of data objects is measured based on their cluster labels\nfrom the multiple input clusterings, rather than the original features. Then a graph representation is\nconstructed according to these relationships, and \ufb01nding a single consolidated clustering is reduced\nto a graph partitioning problem. Similarly, in [1], a number of deterministic approximation algo-\nrithms are proposed to \ufb01nd an \u201daggregated\u201d clustering which agrees as much as possible with the\ninput clusterings. [9] also applies a similar idea to combine multiple runs of K-means algorithm.\n[11] proposes to capture the notion of agreement using an measure based on a 2D string encoding.\nThey derive a nonlinear optimization model to maximize the new agreement measure and transform\nit into a strict 0-1 Semide\ufb01nite Program. [12] presents three iterative EM-like algorithms for the\nconsensus clustering problem.\nA common feature of these consensus clustering methods is that they usually do not access to the\noriginal features of the data objects. They utilize the cluster labels in different input clusterings as\nthe new features of each data object to \ufb01nd an optimal clustering. Consequently, the success of these\nconsensus clustering methods heavily relies on a premise that the majority of the input clusterings\nare reasonably good and consistent, which is not often the case in practice. For example, given a new\nchallenging dataset, it is probable that only some few of the chosen underlying clustering algorithms\ncan generate good clusterings. Many moderate or even bad input clustering can mislead the \ufb01nal\n\u201dconsensus\u201d. Furthermore, even if we choose the appropriate underlying clustering algorithms, in\norder to obtain good input clusterings, we still have to specify the appropriate input parameters.\nTherefore, it is desired to devise new consensus clustering methods which are more robust and do\nnot need the optimal input parameters to be speci\ufb01ed.\nIn this paper, our de\ufb01nition of \u201dclustering aggregation\u201d is different.\nInformally, for each of the\nclusters in the input clusterings, we evaluate its quality with some internal indices measuring both\nthe cohesion and separation. Then we select an optimal subset of clusters, which partition the\ndataset together and have the best overall quality, as the \u201daggregated clustering\u201d. (We give a formal\nstatement of our \u201dclustering aggregation\u201d problem in Sec. 2). In this framework, ideally, we can\n\ufb01nd the optimal \u201daggregated clustering\u201d even if only a minority of the input clusterings are good\nenough. Therefore, we only need to specify an appropriate range of the input parameters, rather\nthan the optimal values, for the underlying clustering algorithms.\nWe formulate this \u201dclustering aggregation\u201d problem as a special instance of Maximum-Weight In-\ndependent Set (MWIS) problem. An attributed graph is constructed from the union of the input\nclusterings. The vertices, which represent the distinct clusters, are weighted by an internal index\nmeasuring both cohesion and separation. The edges connect the vertices whose corresponding clus-\nters overlap (In practice, we may tolerate a relatively small amount of overlap for robustness). Then\nselecting an optimal subset of non-overlapping clusters partitioning the dataset together can be for-\nmulated as seeking the MWIS of the attributed graph, which is the heaviest subset of mutually\nnon-adjacent vertices. Moreover, this MWIS problem exhibits a special structure. Since the clusters\nof each input clustering form a partition of the dataset, the vertices corresponding to each clustering\nform a maximal independent set (MIS) in the attributed graph.\nThe most important source of motivation for our work is [3]. In [3], image segmentation is formulat-\ned as a MWIS problem. Speci\ufb01cally, given an image, they \ufb01rst segment it with different bottom-up\nsegmentation schemes to get an ensemble of distinct superpixels. Then they select a subset of the\nmost \u201dmeaningful\u201d non-overlapping superpixels to partition the image. This selection procedure is\nformulated as solving a MWIS problem. In this respect, our work is very similar to [3]. The only\ndifference is that our work applies the MWIS formulation to a more general problem, clustering\naggregation.\nMWIS problem is known to be NP-hard. Many heuristic approaches are proposed to \ufb01nd approx-\nimate solutions. As we mentioned before, in the context of clustering aggregation, the formulated\n\n2\n\n\fMWIS problem exhibits a special structure. That is, the vertices corresponding to each clustering\nform a maximal independent set (MIS) in the attributed graph. This special structure is valuable\nfor \ufb01nding good approximations to the MWIS because, although these MISs may not be the global\noptimum of the MWIS, they are close to distinct local optimums. We propose a variant of simulat-\ned annealing method that takes advantage of this special structure. Our algorithm starts from each\nMIS and utilizes a local search heuristic to explore its neighborhood in order to \ufb01nd better approx-\nimations to the MWIS. The best solution found in this process is returned as the \ufb01nal approximate\nMWIS. Since the exploration for each MIS is independent, our algorithm is suitable for parallel\ncomputation.\nFinally, since the selected clusters may not be able to cover the entire dataset, our approach performs\na post-processing to assign the missing data objects to their nearest clusters.\nExtensive experiments on many challenging datasets show that: 1. our approach to clustering ag-\ngregation automatically decides the optimal number of clusters; 2. it does not require any parameter\ntuning for the underlying clustering algorithms; 3. it can combine the advantages of different under-\nlying clustering algorithms to achieve superior performance; 4. it is robust against moderate or even\nbad input clusterings.\nPaper Organization In Sec. 2, we present the formal statement of the clustering aggregation prob-\nlem and its formulation as a special instance of MWIS problem. In Sec. 3, we present our algorithm.\nThe experimental evaluations and conclusion are given in Sec. 4 and Sec. 5 respectively.\n\n2 MWIS Formulation of Clustering Aggregation\nConsider a set of n data objects D = {d1, d2, ..., dn}. A clustering Ci of D is obtained by applying\nan exclusive clustering algorithm with a speci\ufb01c set of input parameters on D. The disjoint clusters\n\nci1, ci2, ..., cik of Ci are a partition of D, i.e.(cid:83)k\n\nj=1 cij = D and cip \u2229 ciq = \u2205 for all p (cid:54)= q.\n\nWith different clustering algorithms and different parameters, we can obtain a set of m different\nclusterings of D: C1, C2, ..., Cm. For each cluster cij in the union of these m clusterings, we\nevaluate its quality with an internal index measuring both cohesion and separation.\nWe use the average silhouette coef\ufb01cient of a cluster as such an internal index in this paper. The\nsilhouette coef\ufb01cient is de\ufb01ned for an individual data object. It is a measure of how similar that data\nobject is to data objects in its own cluster compared to data objects in other clusters. Formally, the\nsilhouette coef\ufb01cient for the tth data object, St, is de\ufb01ned as\n\nSt =\n\nbt \u2212 at\n\nmax(at, bt)\n\n(1)\n\n(cid:80)\n\nSt\n\nt\u2208cij\n|cij|\n\nwhere at is the average distance from the tth data object to the other data objects in the same cluster\nas t, and bt is the minimum average distance from the tth data object to data objects in a different\ncluster, minimized over clusters.\nSilhouette coef\ufb01cient ranges from -1 to +1 and a positive value is desirable. The quality of a par-\nticular cluster cij can be evaluated with the average of the silhouette coef\ufb01cients of the data objects\nbelonging to it.\n\nASCcij =\n\n(2)\nwhere St is the silhouette coef\ufb01cient of the tth data object in cluster cij, |cij| is the cardinality of\ncluster cij.\nWe select an optimal subset of non-overlapping clusters from the union of all the clusterings, which\npartition the dataset together and have the best overall quality, as the \u201daggregated clustering\u201d. The\nselection of clusters is formulated as a special instance of the Maximum-Weight Independent Set\n(MWIS) problem.\nFormally, consider an undirected and weighted graph G = (V, E), where V = {1, 2, ..., n} is\nthe vertex set and E \u2286 V \u00d7 V is the edge set. For each vertex i \u2208 V , a positive weight wi is\nassociated with i. A = (aij)n\u00d7n is the adjacency matrix of G, where aij = 1 if (i, j) \u2208 E is an\n\n3\n\n\fedge of G, and aij = 0 if (i, j) /\u2208 E. A subset of V can be represented by an indicator vector\nx = (xi) \u2208 {0, 1}n, where xi = 1 means that i is in the subset, and xi = 0 means that i is not in the\nsubset. An independent set is a subset of V , whose elements are pairwise nonadjacent. Then \ufb01nding\na maximum-weight independent set, denoted as x\u2217 can be posed as the following:\n\nx\u2217 = argmaxxwTx,\ns.t. \u2200i \u2208 V : xi \u2208 {0, 1}, xT Ax = 0\n\n(3)\n\nThe weight wi on vertex i is de\ufb01ned as:\n\nwi = ASCci \u00d7 |ci|\n\n(4)\nwhere ci is the cluster represented by vertex i, ASCci and |ci| are its quality measure and cardinality\nrespectively.\nOur problem (3) is a special instance of MWIS problem, since graph G exhibits an additional struc-\nture, which we will unitize in the proposed algorithm. The vertex set V can be partitioned into\ndisjoint subsets P = {P1, P2, ..., Pm}, where Pi corresponds to the clustering Ci, such that each Pi\nis also a maximal independent set (MIS), which means it is not a subset of any other independent\nset. This follows from the fact that each clustering Ci is a partition of the dataset D. Formally,\n\nPi = V, Pi \u2229 Pj = \u2205,\n\ni (cid:54)= j,\n\nand Pi is MIS,\n\n\u2200 i, j \u2208 {1, 2, ..., m}\n\n(5)\n\nm(cid:91)\n\ni=1\n\n3 Our Algorithm\n\nThe basic idea of our algorithm is to explore the neighborhood of each known MIS Pi independently\nwith a local search heuristic in order to \ufb01nd better solutions. The proposed algorithm is an instance\nof simulated annealing methods [10] with multiple initializations.\nOur algorithm starts with a particular MIS Pi, denoted by x0. xt+1, which is a neighbor of xt, is\nobtained by replacing some lower-weight vertices in xt with higher-weight vertices under the con-\nstraint of always being an independent set. Speci\ufb01cally, we \ufb01rst reduce xt by removing a proportion\nq of lower-weight vertices. Here we remove a proportion, rather than a \ufb01xed number, of vertices in\norder to make the reduction adaptive with respect to the number s of vertices in xt. In practice, we\nuse ceil(s \u00d7 q) to make sure at least one vertex will be removed. Note that this step is probabilistic,\nrather than deterministic. The probability that a vertex i will be retained is proportional to its W D\nvalue, which is de\ufb01ned as follows.\n\nW Di =\n\nwj\n\n(6)\n\nwi(cid:80)\n\nj\u2208Ni\n\nwhere Ni is the set of vertices which are connected with vertex i in G.\nIntuitively, larger W D value indicates larger weight, less con\ufb02ict with other vertices or both. There-\nfore, the obtained x(cid:48)\nt is likely to contain vertices with large weights and have large potential room\nfor improvement. The parameter of proportion q is used to control the \u201dradius\u201d of the neighborhood\nto be explored.\nThen our algorithm iteratively improves x(cid:48)\nt by adding compatible vertices one by one.\niteration, it \ufb01rst identi\ufb01es all the vertices compatible with the existing ones in current x(cid:48)\ncandidates. Then a \u201dlocal\u201d measure W D(cid:48) is calculated to evaluate each of these candidates:\n\nIn each\nt, called\n\nW D(cid:48)\n\ni =\n\nwi(cid:80)\n\nj\u2208N(cid:48)\n\ni\n\nwj\n\n(7)\n\ni is the set of candidate vertices which are connected with vertex i.\n\nwhere N(cid:48)\nThe large value of W D(cid:48)\n(numerator) or has small con\ufb02ict with further improvements (denominator) or both.\nThe candidate with the largest W D(cid:48) value is added to x(cid:48)\nfurther improved. This iterative procedure continues until x(cid:48)\nx(cid:48)\nt as a randomized neighbor of xt.\n\nt. In next iteration, this new x(cid:48)\nt will be\nt cannot be further improved. We obtain\n\ni indicates that candidate i either can bring large improvement this time\n\n4\n\n\fAlgorithm 1:\nInput: Graph G, weights w, adjacency matrix A, the known MIS P = {P1, P2, ..., Pm}\nOutput: An approximate solution to MWIS\n\n1 Calculate W D for each vertex;\n2 for Each MIS Pi do\n3\n4\n5\n\nInitialize x0 with Pi;\nfor t = 1, 2, ..., n do\nReduce xt to x(cid:48)\nlower W D values;\nrepeat\n\nt probabilistically by removing a proportion q of vertices with relatively\n\nIdentify candidate vertices compatible with current x(cid:48)\nt;\nCalculate W D(cid:48) for each candidate;\nUpdate x(cid:48)\n\nt by adding the candidate with the largest W D(cid:48);\n\n6\n7\n8\n9\n10\n11\n12\n13\n14 end\n15 return the best solution found in the process;\n\nuntil x(cid:48)\nCalculate \u03b1 = min[1, e(W (x(cid:48)\nUpdate xt+1 as x(cid:48)\n\nt cannot be further improved;\n\nend\n\nt)\u2212W (xt))/\u03b2t\n\n];\n\nt with probability \u03b1, otherwise xt+1 = xt;\n\nt. When calculating the acceptance ratio \u03b1 = e(W (x(cid:48)\n\nt)\u2212W (xt))/\u03b2t, where W (x) = wT x;\nt is accepted as\n\nNow our algorithm calculates the acceptance ratio \u03b1 = e(W (x(cid:48)\n0 < \u03b2 < 1 is a constant which is usually picked to be close to 1. If \u03b1 \u2265 1, then x(cid:48)\nxt+1. Otherwise, it is accepted with probability \u03b1.\nThis exploration starting from Pi continues for a number of iterations, or until xt converges. The\nbest solution encountered in this process is recorded. After exploring the neighborhood for all the\nknown MISs, the best solution is returned. A formal description can be found in Algorithm 1.\nOur algorithm is essentially a variant of simulated annealing method [10], since the maximization of\nW (x) = wT x is equivalent to the minimization of the energy function E(x) = \u2212W (x) = \u2212wT x.\nLines 5 to 10 in Alg. 1 de\ufb01ne a randomized \u201dmoving\u201d procedure of making a transition from xt to its\nneighbor x(cid:48)\nt)\u2212W (xt))/\u03b2t, suppose T0 = 1 (initial\ntemperature), then it is equivalent to \u03b1 = e(\u2212(W (xt)\u2212W (x(cid:48)\nt)\u2212E(xt)))/(\u03b2t). Hence\nAlgorithm 1 is a variant of simulated annealing. Therefore, our algorithm converges in theory.\nIn practice, the convergence of our algorithm is fast. In all the experiments presented in next section,\nour algorithm converges in less than 100 iterations. The reason is that our algorithm takes advantage\nof that the known MISs are close to distinct local maximum. Also, the local search heuristic of our\nalgorithm is effective to \ufb01nd better candidate in the neighborhood.\nThe parameter q controls the \u201dradius\u201d of the neighborhood to be explored in each iteration. Small\nq means small \u201dradius\u201d and results in more iterations to converge. On the other side, using large q\nwill take less advantage of the known MISs. Unstable exploration also results in more iterations to\nconverge.\nSince our algorithm explores the neighborhood of each known MIS independently, its ef\ufb01ciency can\nbe further improved by using parallel computation.\n\nt)))/(\u03b2t) = e(\u2212(E(x(cid:48)\n\n4 Results\n\nWe evaluate the performance of our approach with three experiments. In these experiments, for the\nunderlying clustering algorithms, including K-means, single linkage, complete linkage and Ward\u2019s\nclustering, we use the implementations in MATLAB. Unless speci\ufb01ed explicitly, the parameters\nare MATLAB\u2019s defaults. For example, when using K-means, we only specify the number K of\ndesired clusters. The default \u201dSquared Euclidean distance\u201d is used as the distance measure. When\ncalculating silhouette coef\ufb01cients, we use MATLAB\u2019s function \u201dsilhouette(X,clust)\u201d and the default\nmetric \u201dSquared Euclidean distance\u201d. For robustness in our experiments, we tolerate slight overlap\n\n5\n\n\f|ci\u2229cj|\n\nbetween clusters. That is, for the adjacency matrix A = (aij)n\u00d7n, aij = 1 if\nmin(|ci|,|cj|) > 0.1, and\naij = 0 otherwise. In these experiments, the parameters of our local search algorithm are: q = 0.3;\n\u03b2 = 0.999; iteration number n = 100. We test different combinations of q = 0.1 : 0.1 : 0.5 and\nn = 100 : 100 : 1000. The results are almost the same.\nIn the \ufb01rst experiment, we evaluate our approach\u2019s ability to achieve good performance without\nspecifying the optimal input parameters for the underlying clustering algorithms. We use the dataset\nfrom [6]. This dataset consists of 4 subsets (S1, S2, S3, S4) of synthetic 2-d data points. Each subset\ncontains 5000 vectors in 15 Gaussian clusters, but with different degree of cluster overlapping. We\nchoose K-means as the underlying clustering algorithm and vary the parameter K = 5 : 1 : 25,\nwhich is the desired number of clusters. Since different runs of K-means starting from random\ninitialization of centroids typically produce different clustering results, we run K-means 5 times for\neach value of K. That is, there are a total of 21 \u00d7 5 = 105 different input clusterings. Note that,\nin order to show the performance of our approach clearly, we do not perform the post-processing of\nassigning the missing data points to their nearest clusters.\n\nFigure 1: Clustering aggregation without parameter tuning. (top row) Original data. (bottom row)\nClustering results of our approach. Best viewed in color.\n\nAs shown in Fig. 1, on each of the four subsets, the aggregated clustering obtained by our approach\nhas the correct number (15) of clusters and near-perfect structure. Only a very small portion of\ndata points is not assigned to any cluster. These results con\ufb01rm that our approach can automatically\ndecide the optimal number of clusters without any parameter tuning for the underlying clustering\nalgorithms.\nIn the second experiment, we evaluate our approach\u2019s ability of combining the advantages of differ-\nent underlying clustering algorithms and canceling out the errors introduced by them. The dataset is\nfrom [1]. As shown in the \ufb01fth panel of Fig. 2, this synthetic dataset consists of 7 distinct groups of\n2-d data points, which have signi\ufb01cantly different shapes and sizes. There are also some \u201dbridges\u201d\nbetween different groups of data points. Consequently, this dataset is very challenging for any single\nclustering algorithm. In this experiment, we use four different underlying clustering algorithms im-\nplemented in MATLAB: single linkage, complete linkage, Ward\u2019s clustering and K-means. The \ufb01rst\ntwo are both agglomerative bottom-up algorithms. The only difference between them is that when\nmerging pairs of clusters, single linkage is based on the minimum distance, while complete linkage\nis based on maximum distance. The third one, Ward\u2019s clustering algorithm, is also an agglomerative\nbottom-up algorithm. In each merging step, it chooses the pair of clusters which minimize the sum\nof the square of distances from each point to the mean of the two clusters. The fourth algorithm is\nK-means.\n\n6\n\n0510x 1050246810x 105S10510x 1050246810x 105S20510x 1050246810x 105S30510x 1050246810x 105S40510x 1050246810x 105Our S10510x 1050246810x 105Our S20510x 1050246810x 105Our S30510x 1050246810x 105Our S4\fFor each of the underlying clustering algorithms, we vary the input parameter of desired number of\nclusters as 4 : 1 : 10. That is, we have a total of 7 \u00d7 4 = 28 input clusterings.\nNote that, unlike [1], we do not use the average linkage clustering algorithm, because by specifying\nthe correct number of clusters, it can generate near-perfect clustering by itself. We abandon the\nbest algorithm here in order to show the performance of our approach clearly. But, in practice,\nby utilizing good underlying clustering algorithms, it can signi\ufb01cantly increase the chance for our\napproach to obtain superior aggregated clusterings. Like experiment 1, we do not perform the post-\nprocessing in this experiment.\n\nFigure 2: Clustering aggregation on four different input clusterings. Best viewed in color.\n\nIn the \ufb01rst four panels of Fig. 2, we show the clustering results obtained by the four underlying\nclustering algorithms with the number of clusters set to be 7. Obviously, even with the optimal input\nparameters, the results of these algorithms are far from being correct. The ground truth and the result\nof our approach are shown in the \ufb01fth and sixth panels, respectively. As we can see, our aggregated\nclustering is almost perfect, except for the three green data points in the \u201dbridge\u201d between the cyan\nand green \u201dballs\u201d. These results con\ufb01rm that our approach can effectively combine the advantages\nof different clustering algorithms and cancel out the errors introduced by them. Also, in contrast to\nthe other consensus clustering algorithms, such as [1], our aggregated clustering is obtained without\nspecifying the optimal input parameters for any of the underlying clustering algorithm. This is a\nvery desirable feature in practice.\nIn the third experiment, we compare our approach with some other popular consensus clustering\nalgorithms, including Cluster-based Similarity Partitioning Algorithm (CSPA) [2], HyperGraph Par-\ntitioning Algorithm (HGPA) [2], Meta-Clustering Algorithm (MCLA) [2], the Furthest (Furth) al-\ngorithm [1], the Agglomerative (Agglo) [1] algorithm and the Balls (Balls) algorithm [1].\nThe performance is evaluated on three datasets: 8D5K [2] , Iris [4] and Pen-Based Recognition of\nHandwritten Digits (PENDIG) [5]. 8D5K is an arti\ufb01cial dataset. It contains 1000 points from \ufb01ve\nmultivariate Gaussian distributions (200 points each) in 8D space. Iris is a real dataset. It consists\nof 150 instances of three classes (50 each). There are four numeric attributes for each instance.\nPENDIG is also a real dataset. It contains a total of 7494 + 3498 = 10992 instances in 10 classes.\nEach instance has 16 integer attributes.\nFor our approach and all those consensus clustering algorithms, we choose K-means and Ward\u2019s\nalgorithm as the underlying clustering algorithms. The multiple clusterings for each dataset are\nobtained by varying the desired number of clusters for both K-means and Ward\u2019s algorithm. Specif-\n\n7\n\n0102030400102030Single Linkage0102030400102030Complete Linkage0102030400102030Ward's clustering0102030400102030K-means0102030400102030Original data0102030400102030Our result\fically, for the test on 8D5K, we set the desired numbers of clusters as 3:1:7. Consequently, there\nare 5 \u00d7 2 = 10 different input clusterings. For Iris and PENDIG, the numbers are 3:1:7 and 8:1:12\nrespectively. So there are also 10 different input clusterings for each of them.\nIn this paper, we use Jaccard coef\ufb01cient to measure the quality of clusterings.\n\nwhere f11 is the number of object pairs which are in the same class and in the same cluster; f01 and\nis the number of object pairs which are in different classes but the same cluster; f10 is the number\nof object pairs which are in the same class but in different cluster.\n\nf11\n\nf01 + f10 + f11\n\nJaccard Coef f icient =\n\n(8)\n\nFigure 3: Results of comparative experiments on different datasets. Best viewed in color.\n\nAs shown in Fig. 3, the performance of our approach is better than those of the other consensus\nclustering algorithms. The main reason is that, with a range of different input parameters, most\nclusterings generated by the underlying clustering algorithms are not good enough. The \u201dconsensus\u201d\nbased on these moderate or even bad input clusterings and much less good ones cannot be good.\nIn contrast, by selecting an optimal subset of the clusters, our approach can still achieve superior\nperformance as long as there are good clusters in the input clusterings. Therefore, our approach is\nmuch more robust, as con\ufb01rmed by the results of this experiment.\n\n5 Conclusion\n\nThe contribution of this paper is twofold: 1. We formulate clustering aggregation as a MWIS prob-\nlem with a special structure. 2. We propose a novel variant of simulated annealing method, which\ntakes advantage of the special structure, for solving this special MWIS problem. Experimental re-\nsults con\ufb01rm that: 1. our approach to clustering aggregation automatically decides the optimal\nnumber of clusters; 2. it does not require any parameter tuning for the underlying clustering algo-\nrithms; 3. it can combine the advantages of different underlying clustering algorithms to achieve\nsuperior performance; 4. it is robust against moderate or even bad input clusterings.\n\nAcknowledgments\n\nThis work was supported by US Department of Energy Award 71498-001-09 and by US National\nScience Foundation Grants IIS-0812118, BCS-0924164, OIA-1027897.\n\n8\n\n\fReferences\n\n[1] Gionis, A. & Mannila, H. & Tsaparas, P. (2005) \u201dClustering aggregation\u201d. Proceedings of the 21st ICDE\n[2] Strehl, A. & Ghosh, J. (2003) \u201dCluster ensembles\u2014a knowledge reuse framework for combining multiple\npartitions\u201d. The Journal of Machine Learning Research (3):583-617.\n[3] Brendel, W. & Todorovic, S. (2010) \u201dSegmentation as maximum-weight independent set\u201d. Neural Informa-\ntion Processing Systems\n[4] Fisher, R.A. (1936) \u201dThe use of multiple measurements in taxonomic problems\u201d. Annual Eugenics (7) Part\nII: 179-188\n[5] Alimoglu, F. & Alpaydin, E. (1996) \u201dMethods of Combining Multiple Classi\ufb01ers Based on Different Rep-\nresentations for Pen-based Handwriting Recognition\u201d. Proceedings of the Fifth Turkish Arti\ufb01cial Intelligence\nand Arti\ufb01cial Neural Networks Symposium (TAINN 96)\n[6] Franti, P. & Virmajoki, O. (2006) \u201dIterative shrinking method for clustering problems\u201d. Pattern Recognition\n39 (5), 761-765\n[7] Lloyd, S. P. (1982) \u201dLeast squares quantization in PCM\u201d. IEEE Transactions on Information Theory 28 (2):\n129-137\n[8] Martin Ester, Hans-Peter Kriegel, Jorg Sander, Xiaowei Xu (1996) \u201dA density-based algorithm for discov-\nering clusters in large spatial databases with noise\u201d. Proceedings of the Second International Conference on\nKnowledge Discovery and Data Mining (KDD-96)\n[9] Fred, A.L.N. & Jain, A.K. (2002) \u201dData clustering using evidence accumulation\u201d. Proceedings of the\nInternational Conference on Pattern Recognition(ICPR) 276-280\n[10] Kirkpatrick, S. & Gelatt, C. D. & Vecchi, M. P. (1983). \u201dOptimization by Simulated Annealing\u201d. Science\n220 (4598): 671C680\n[11] Vikas Singh & Lopamudra Mukherjee & Jiming Peng & Jinhui Xu (2008) \u201dEnsemble Clustering using\nSemide\ufb01nite Programming\u201d. Advances in Neural Information Processing Systems 20: 1353\u20131360\n[12] Nguyen, N. & Caruana, R. (2007) \u201dConsensus clusterings\u201d.\nMining ICDM 2007 607\u2013612\n[13] X. Z. Fern & C. E. Brodley (2004) \u201dSolving cluster ensemble problems by bipartite graph partitioning\u201d.\nProc. of International Conference on Machine Learning page 36\n[14] Topchy, A. & Jain, A.K. & Punch, W. (2003) \u201dCombining multiple weak clusterings\u201d. IEEE International\nConference on Data Mining, ICDM 2003 331\u2013338\n\nIEEE International Conference on Data\n\n9\n\n\f", "award": [], "sourceid": 348, "authors": [{"given_name": "Nan", "family_name": "Li", "institution": null}, {"given_name": "Longin", "family_name": "Latecki", "institution": null}]}