{"title": "Ensemble Clustering using Semidefinite Programming", "book": "Advances in Neural Information Processing Systems", "page_first": 1353, "page_last": 1360, "abstract": null, "full_text": "Ensemble Clustering using Semide\ufb01nite\n\nProgramming\n\nVikas Singh\n\nBiostatistics and Medical Informatics\nUniversity of Wisconsin \u2013 Madison\nvsingh @ biostat.wisc.edu\n\nJiming Peng\n\nIndustrial and Enterprise System Engineering\nUniversity of Illinois at Urbana-Champaign\n\nLopamudra Mukherjee\n\nComputer Science and Engineering\n\nState University of New York at Buffalo\n\nlm37 @ cse.buffalo.edu\n\nJinhui Xu\n\nComputer Science and Engineering\n\nState University of New York at Buffalo\n\npengj @ uiuc.edu\n\njinhui @ cse.buffalo.edu\n\nAbstract\n\nWe consider the ensemble clustering problem where the task is to \u2018aggregate\u2019\nmultiple clustering solutions into a single consolidated clustering that maximizes\nthe shared information among given clustering solutions. We obtain several new\nresults for this problem. First, we note that the notion of agreement under such\ncircumstances can be better captured using an agreement measure based on a 2D\nstring encoding rather than voting strategy based methods proposed in literature.\nUsing this generalization, we \ufb01rst derive a nonlinear optimization model to max-\nimize the new agreement measure. We then show that our optimization problem\ncan be transformed into a strict 0-1 Semide\ufb01nite Program (SDP) via novel con-\nvexi\ufb01cation techniques which can subsequently be relaxed to a polynomial time\nsolvable SDP. Our experiments indicate improvements not only in terms of the\nproposed agreement measure but also the existing agreement measures based on\nvoting strategies. We discuss evaluations on clustering and image segmentation\ndatabases.\n\n1 Introduction\n\nIn the so-called Ensemble Clustering problem, the target is to \u2018combine\u2019 multiple clustering solu-\ntions or partitions of a set into a single consolidated clustering that maximizes the information shared\n(or \u2018agreement\u2019) among all available clustering solutions. The need for this form of clustering arises\nin many applications, especially real world scenarios with a high degree of uncertainty such as image\nsegmentation with poor signal to noise ratio and computer assisted disease diagnosis. It is quite com-\nmon that a single clustering algorithm may not yield satisfactory results, while multiple algorithms\nmay individually make imperfect choices, assigning some elements to wrong clusters. Usually, by\nconsidering the results of several different clustering algorithms together, one may be able to miti-\ngate degeneracies in individual solutions and consequently obtain better solutions. The idea has been\nemployed successfully for microarray data classi\ufb01cation analysis [1], computer assisted diagnosis\nof diseases [2] and in a number of other applications [3].\nFormally, given a data set D = {d1, d2, . . . , dn}, a set of clustering solutions C =\n{C1, C2, . . . , Cm} obtained from m different clustering algorithms is called a cluster ensemble.\nEach solution, Ci, is the partition of the data into at most k different clusters. The Ensemble Clus-\ntering problem requires one to use the individual solutions in C to partition D into k clusters such\nthat information shared (\u2018agreement\u2019) among the solutions of C is maximized.\n\n\f1.1 Previous works\n\nThe Ensemble Clustering problem was recently introduced by Strehl and Ghosh [3], in [4] a related\nnotion of correlation clustering was independently proposed by Bansal, Blum, and Chawla. The\nproblem has attracted a fair amount of attention and a number of interesting techniques have been\nproposed [3, 2, 5, 6], also see [7, 4]. Formulations primarily differ in how the objective of shared\ninformation maximization or agreement is chosen, we review some of the popular techniques next.\nThe Instance Based Graph Formulation (IBGF) [2, 5] \ufb01rst constructs a fully connected graph G =\n(V, W ) for the ensemble C = (C1, . . . , Cm), each node represents an element of D = {d1, . . . , dn}.\nThe edge weight wij for the pair (di, dj) is de\ufb01ned as the number of algorithms in C that assign\nthe nodes di and dj to the same cluster (i.e., wij measures the togetherness frequency of di and\ndj). Then, standard graph partitioning techniques are used to obtain a \ufb01nal clustering solution.\nIn Cluster Based Graph Formulation (CBGF), a given cluster ensemble is represented as C =\n{C11, . . . , Cmk} = { \u00afC1, . . . , \u00afCm\u00b7k} where Cij denotes the ith cluster of the jth algorithm in C.\nLike IBGF, this approach also constructs a graph, G = (V, W ), to model the correspondence (or\n\u2018similarity\u2019) relationship among the mk clusters, where the similarity matrix W re\ufb02ects the Jaccard\u2019s\nsimilarity measure between clusters \u00afCi and \u00afCj. The graph is then partitioned so that the clusters of\nthe same group are similar to one another. Variants of the problem have also received considerable\nattention in the theoretical computer science and machine learning communities. A recent paper\nby Ailon, Charikar, and Newman [7] demonstrated connections to other well known problems such\nas Rank Aggregation, their algorithm is simple and obtains an expected constant approximation\nguarantee (via linear programming duality). In addition to [7], other results include [4, 8].\nA commonality of existing algorithms for Ensemble Clustering [3, 2, 9] is that they employ a graph\nconstruction, as a \ufb01rst step. Element pairs (cluster pairs or item pairs) are then evaluated and their\nedges are assigned a weight that re\ufb02ects their similarity. A natural question relates to whether we can\n\ufb01nd a better representation of the available information. This will be the focus of the next section.\n\n2 Key Observations: Two is a company, is three a crowd?\n\nConsider an example where one is \u2018aggregating\u2019 recommendations made by a group of family and\nfriends for dinner table seating assignments at a wedding. The hosts would like each \u2018table\u2019 to be\nable to \ufb01nd a common topic of dinner conversation. Now, consider three persons, Tom, Dick, and\nHarry invited to this reception. Tom and Dick share a common interest in Shakespeare, Dick and\nHarry are both surfboard enthusiasts, and Harry and Tom attended college together. Because they\nhad strong pairwise similarities, they were seated together but had a rather dull evening.\nA simple analysis shows that the three guests had strong common interests when considered two at\na time, but there was weak communion as a group. The connection of this example to the ensemble\nclustering problem is clear. Existing algorithms represent the similarity between elements in D as\na scalar value assigned to the edge joining their corresponding nodes in the graph. This weight\nis essentially a \u2018vote\u2019 re\ufb02ecting the number of algorithms that assigned those two elements to the\nsame cluster. The mechanism seems perfect until we ask if strong pairwise coupling necessarily\nimplies coupling for a larger group as well. The weight metric considering two elements does not\nretain information about which algorithms assigned them together. When expanding the group to\ninclude more elements, one is not sure if a common feature exists under which the larger group is\nsimilar. It seems natural to assign a higher priority to triples or larger groups of people that were\nrecommended to be seated together (must be similar under at least one feature) compared to groups\nthat were never assigned to the same table by any person in the recommendation group (clustering\nalgorithm), notwithstanding pairwise evaluations, for an illustrative example see [10]. While this\nproblem seems to be a distinctive disadvantage for only the IBGF approach; it also affects the CBGF\napproach. This can be seen by looking at clusters as items and the Jaccard\u2019s similarity measure as\nthe vote (weight) on the edges.\n\n3 Main Ideas\n\nTo model the intuition above, we generalize the similarity metric to maximize similarity or \u2018agree-\nment\u2019 by an appropriate encoding of the solutions obtained from individual clustering algorithms.\n\n\fMore precisely, in our generalization the similarity is no longer just a scalar value but a 2D string.\nThe ensemble clustering problem thus reduces to a form of string clustering problem where our\nobjective is to assign similar strings to the same cluster.\nThe encoding into a string is done as follows. The data item set is given as D with |D| = n. Let\nm be the number of clustering algorithms with each solution having no more than k clusters. We\nrepresent all input information (ensemble) as a single 3D matrix, A \u2208