{"title": "Ensemble Clustering using Semidefinite Programming", "book": "Advances in Neural Information Processing Systems", "page_first": 1353, "page_last": 1360, "abstract": null, "full_text": "Ensemble Clustering using Semide\ufb01nite\n\nProgramming\n\nVikas Singh\n\nBiostatistics and Medical Informatics\nUniversity of Wisconsin \u2013 Madison\nvsingh @ biostat.wisc.edu\n\nJiming Peng\n\nIndustrial and Enterprise System Engineering\nUniversity of Illinois at Urbana-Champaign\n\nLopamudra Mukherjee\n\nComputer Science and Engineering\n\nState University of New York at Buffalo\n\nlm37 @ cse.buffalo.edu\n\nJinhui Xu\n\nComputer Science and Engineering\n\nState University of New York at Buffalo\n\npengj @ uiuc.edu\n\njinhui @ cse.buffalo.edu\n\nAbstract\n\nWe consider the ensemble clustering problem where the task is to \u2018aggregate\u2019\nmultiple clustering solutions into a single consolidated clustering that maximizes\nthe shared information among given clustering solutions. We obtain several new\nresults for this problem. First, we note that the notion of agreement under such\ncircumstances can be better captured using an agreement measure based on a 2D\nstring encoding rather than voting strategy based methods proposed in literature.\nUsing this generalization, we \ufb01rst derive a nonlinear optimization model to max-\nimize the new agreement measure. We then show that our optimization problem\ncan be transformed into a strict 0-1 Semide\ufb01nite Program (SDP) via novel con-\nvexi\ufb01cation techniques which can subsequently be relaxed to a polynomial time\nsolvable SDP. Our experiments indicate improvements not only in terms of the\nproposed agreement measure but also the existing agreement measures based on\nvoting strategies. We discuss evaluations on clustering and image segmentation\ndatabases.\n\n1 Introduction\n\nIn the so-called Ensemble Clustering problem, the target is to \u2018combine\u2019 multiple clustering solu-\ntions or partitions of a set into a single consolidated clustering that maximizes the information shared\n(or \u2018agreement\u2019) among all available clustering solutions. The need for this form of clustering arises\nin many applications, especially real world scenarios with a high degree of uncertainty such as image\nsegmentation with poor signal to noise ratio and computer assisted disease diagnosis. It is quite com-\nmon that a single clustering algorithm may not yield satisfactory results, while multiple algorithms\nmay individually make imperfect choices, assigning some elements to wrong clusters. Usually, by\nconsidering the results of several different clustering algorithms together, one may be able to miti-\ngate degeneracies in individual solutions and consequently obtain better solutions. The idea has been\nemployed successfully for microarray data classi\ufb01cation analysis [1], computer assisted diagnosis\nof diseases [2] and in a number of other applications [3].\nFormally, given a data set D = {d1, d2, . . . , dn}, a set of clustering solutions C =\n{C1, C2, . . . , Cm} obtained from m different clustering algorithms is called a cluster ensemble.\nEach solution, Ci, is the partition of the data into at most k different clusters. The Ensemble Clus-\ntering problem requires one to use the individual solutions in C to partition D into k clusters such\nthat information shared (\u2018agreement\u2019) among the solutions of C is maximized.\n\n\f1.1 Previous works\n\nThe Ensemble Clustering problem was recently introduced by Strehl and Ghosh [3], in [4] a related\nnotion of correlation clustering was independently proposed by Bansal, Blum, and Chawla. The\nproblem has attracted a fair amount of attention and a number of interesting techniques have been\nproposed [3, 2, 5, 6], also see [7, 4]. Formulations primarily differ in how the objective of shared\ninformation maximization or agreement is chosen, we review some of the popular techniques next.\nThe Instance Based Graph Formulation (IBGF) [2, 5] \ufb01rst constructs a fully connected graph G =\n(V, W ) for the ensemble C = (C1, . . . , Cm), each node represents an element of D = {d1, . . . , dn}.\nThe edge weight wij for the pair (di, dj) is de\ufb01ned as the number of algorithms in C that assign\nthe nodes di and dj to the same cluster (i.e., wij measures the togetherness frequency of di and\ndj). Then, standard graph partitioning techniques are used to obtain a \ufb01nal clustering solution.\nIn Cluster Based Graph Formulation (CBGF), a given cluster ensemble is represented as C =\n{C11, . . . , Cmk} = { \u00afC1, . . . , \u00afCm\u00b7k} where Cij denotes the ith cluster of the jth algorithm in C.\nLike IBGF, this approach also constructs a graph, G = (V, W ), to model the correspondence (or\n\u2018similarity\u2019) relationship among the mk clusters, where the similarity matrix W re\ufb02ects the Jaccard\u2019s\nsimilarity measure between clusters \u00afCi and \u00afCj. The graph is then partitioned so that the clusters of\nthe same group are similar to one another. Variants of the problem have also received considerable\nattention in the theoretical computer science and machine learning communities. A recent paper\nby Ailon, Charikar, and Newman [7] demonstrated connections to other well known problems such\nas Rank Aggregation, their algorithm is simple and obtains an expected constant approximation\nguarantee (via linear programming duality). In addition to [7], other results include [4, 8].\nA commonality of existing algorithms for Ensemble Clustering [3, 2, 9] is that they employ a graph\nconstruction, as a \ufb01rst step. Element pairs (cluster pairs or item pairs) are then evaluated and their\nedges are assigned a weight that re\ufb02ects their similarity. A natural question relates to whether we can\n\ufb01nd a better representation of the available information. This will be the focus of the next section.\n\n2 Key Observations: Two is a company, is three a crowd?\n\nConsider an example where one is \u2018aggregating\u2019 recommendations made by a group of family and\nfriends for dinner table seating assignments at a wedding. The hosts would like each \u2018table\u2019 to be\nable to \ufb01nd a common topic of dinner conversation. Now, consider three persons, Tom, Dick, and\nHarry invited to this reception. Tom and Dick share a common interest in Shakespeare, Dick and\nHarry are both surfboard enthusiasts, and Harry and Tom attended college together. Because they\nhad strong pairwise similarities, they were seated together but had a rather dull evening.\nA simple analysis shows that the three guests had strong common interests when considered two at\na time, but there was weak communion as a group. The connection of this example to the ensemble\nclustering problem is clear. Existing algorithms represent the similarity between elements in D as\na scalar value assigned to the edge joining their corresponding nodes in the graph. This weight\nis essentially a \u2018vote\u2019 re\ufb02ecting the number of algorithms that assigned those two elements to the\nsame cluster. The mechanism seems perfect until we ask if strong pairwise coupling necessarily\nimplies coupling for a larger group as well. The weight metric considering two elements does not\nretain information about which algorithms assigned them together. When expanding the group to\ninclude more elements, one is not sure if a common feature exists under which the larger group is\nsimilar. It seems natural to assign a higher priority to triples or larger groups of people that were\nrecommended to be seated together (must be similar under at least one feature) compared to groups\nthat were never assigned to the same table by any person in the recommendation group (clustering\nalgorithm), notwithstanding pairwise evaluations, for an illustrative example see [10]. While this\nproblem seems to be a distinctive disadvantage for only the IBGF approach; it also affects the CBGF\napproach. This can be seen by looking at clusters as items and the Jaccard\u2019s similarity measure as\nthe vote (weight) on the edges.\n\n3 Main Ideas\n\nTo model the intuition above, we generalize the similarity metric to maximize similarity or \u2018agree-\nment\u2019 by an appropriate encoding of the solutions obtained from individual clustering algorithms.\n\n\fMore precisely, in our generalization the similarity is no longer just a scalar value but a 2D string.\nThe ensemble clustering problem thus reduces to a form of string clustering problem where our\nobjective is to assign similar strings to the same cluster.\nThe encoding into a string is done as follows. The data item set is given as D with |D| = n. Let\nm be the number of clustering algorithms with each solution having no more than k clusters. We\nrepresent all input information (ensemble) as a single 3D matrix, A \u2208 <n\u00d7m\u00d7k. For every data\nelement dl \u2208 D, Al \u2208 <m\u00d7k is a matrix whose elements are de\ufb01ned by\n\n(cid:26) 1 if dl is assigned to cluster i by Cj;\n\nAl(i, j) =\n\n0\n\notherwise\n\n(1)\n\nIt is easy to see that the summation of every row of Al equals 1. We call each Al an A-string. Our\ngoal is to cluster the elements D = {d1, d2, . . . , dn} based on the similarity of their A-strings.\nWe now consider how to compute the clusters based on the similarity (or dissimilarity) of strings. We\nnote that the paper [11] by Gasieniec et al., discussed the so-called Hamming radius p-clustering and\nHamming diameter p-clustering problems on strings. Though their results shed considerable light\non the hardness of string clustering with the selected distance measures, those techniques cannot be\ndirectly applied to the problem at hand because the objective here is fairly different from the one in\n[11]. Fortunately, our analysis reveals that a simpler objective is suf\ufb01cient to capture the essence of\nsimilarity maximization in clusters using certain special properties of the A-strings.\nOur approach is partly inspired by the classical k-means clustering where all data points are assigned\nto the cluster based on the shortest distance to the cluster center. Imagine an ideal input instance\nfor the ensemble clustering problem (all clustering algorithms behave similarly) \u2013 one with only k\nunique members among n A-strings. The partitioning simply assigns similar strings to the same\npartition. The representative for each cluster will then be exactly like its members, is a valid A-\nstring, and can be viewed as a center in a geometric sense. General input instances will obviously\nbe non-ideal and are likely to contain far more than k unique members. Naturally, the centers of the\nclusters will vary from its members. This variation can be thought of as noise or disagreement within\nthe clusters, our objective is to \ufb01nd a set of clusters (and centers) such that the noise is minimized\nand we move very close to the ideal case. To model this, we consider the centers to be in the same\nhigh dimensional space as the A-strings in D (though it may not belong to D). Consider an example\nwhere a cluster i in this optimal solution contains items (d1, d2, . . . , d7). A certain algorithm Cj\nin the input ensemble clusters items (d1, d2, d3, d4) in cluster s and (d5, d6, d7) in cluster p. How\nwould Cj behave if evaluating the center of cluster i as a data item? The probability it assigns the\ncenter to cluster s is 4/7 and the probability it assigns the center to cluster p is 3/7. If we emulate this\nlogic \u2013 we must pick the choice with the higher probability and assign the center to such a cluster.\nIt can be veri\ufb01ed that this choice minimizes the dissent of all items in cluster i to the center. The A-\nstring for the center of cluster i will have a \u201c1\u201d at position (j, s). The assignment of A-string (items)\nto clusters is unknown; however, if it were somehow known, we could \ufb01nd the centers for all other\nclusters i \u2208 [1, k] by computing the average value at every cell of the A matrices corresponding to\nthe members of the cluster and rounding the largest value in every row to 1 (rest to 0) and assigning\nthis as the cluster center. Hence, the dissent within a cluster can be quanti\ufb01ed simply by averaging\nthe matrices of elements that belong to the cluster and computing the difference to the center. Our\ngoal is to \ufb01nd such an assignment and group the A-strings so that the sum of the absolute differences\nof the averages of clusters to their centers (dissent) is minimized. In the subsequent sections, we will\nintroduce our optimization framework for ensemble clustering based on these ideas.\n\n4 Integer Program for Model 1\n\nWe start with a discussion of an Integer Program (IP, for short) formulation for ensemble clustering.\nk} and Cij denotes\nFor convenience, we denote the \ufb01nal clustering solution by C\u2217 = {C\u2217\n1 , . . . , C\u2217\nthe cluster i by the algorithm j. The variables that constitute the IP are as follows.\n\n(2)\n\n(3)\n\n(cid:26) 1 if dl \u2208 C\u2217\n( 1 if C\u2217\n\ni0;\notherwise\n\n0\n\n0\n\ni0 = arg max\n\ni=1,...,k\notherwise\n\ni0T Cij|}\n\n{|C\u2217\n\nXli0\n\nsiji0\n\n=\n\n=\n\n\fWe mention that the above de\ufb01nition implies that for a \ufb01xed index i0, its center, siji0 also provides an\nindicator to the cluster most similar to C\u2217\ni0 in the set of clusters produced by the clustering algorithm\nCj. We are now ready to introduce the following IP.\n\nkX\n\nmX\n\ni=1\n\nj=1\n\n(cid:12)(cid:12)(cid:12)(cid:12)siji0 \u2212\n\nPn\nPn\n\n(cid:12)(cid:12)(cid:12)(cid:12)\n\nl=1 AlijXli0\n\nl=1 Xli0\n\nnX\n\nmin\n\ns.t.\n\ni0=1\n\nkX\nkX\nkX\n\ni0=1\n\nXli0 = 1 \u2200l \u2208 [1, n],\n\nXli0 \u2265 1 \u2200i0 \u2208 [1, k],\n\nl=1\n\nsiji0 = 1 \u2200j \u2208 [1, m], i0 \u2208 [1, k], Xli0 \u2208 {0, 1},\n\nsiji0 \u2208 {0, 1}.\n\ni=1\n\nl=1 Alij Xli0\n\n(cid:12)(cid:12)(cid:12)siji0 \u2212 Pn\nPn\n\ni0) and the average of\ni0. Recall that siji0 will be 1 if Cij is the\ni0 among all the clusters produced by algorithm Cj. Hence, if siji0 = 0\n\n(4) minimizes the sum of the difference between siji0 (the center for cluster C\u2217\nall Alij bits of the data elements dl assigned to cluster C\u2217\nPn\nmost similar cluster to C\u2217\nPn\nand\nin C\u2217\ni0 that do not consent with the majority of the other elements in the group w.r.t. the clustering\nsolution provided by Cj. In other words, we are trying to minimize the dissent and maximize the\nconsent simultaneously. The remaining constraints are relatively simple \u2013 (5) enforces the condition\nthat a data element should belong to precisely one cluster in the \ufb01nal solution and that every cluster\nmust have size at least 1; (6) ensures that siji0 is an appropriate A-string for every cluster center.\n\n(cid:12)(cid:12)(cid:12) represents the percentage of data elements\n\n6= 0, the value\n\nl=1 Alij Xli0\n\nl=1 Xli0\n\nl=1 Xli0\n\n5\n\n0-1 Semide\ufb01nite Program for Model 1\n\nThe formulation given by (4)-(6) is a mixed integer program (MIP, for short) with a nonlinear ob-\njective function in (4). Solving this model optimally, however, is extremely challenging \u2013 (a) the\nconstraints in (5)-(6) are discrete; (b) the objective is nonlinear and nonconvex. One possible way of\nattacking the problem is to \u2018relax\u2019 it to some polynomially solvable problems such as SDP (the prob-\nlem of minimizing a linear function over the intersection of a polyhedron and the cone of symmetric\nand positive semide\ufb01nite matrices, see [12] for an introduction). Our effort would be to convert the\nnonlinear form in (4) into a 0-1 SDP form. By introducing arti\ufb01cial variables, we rewrite (4) as\n\nmin\n\nkX\n\nmX\n\nkX\n\ntiji0\n\ni=1\n\nj=1\n\ni0=1\n\nsiji0 \u2212 ciji0 \u2264 tiji0 ,\n\nciji0 \u2212 siji0 \u2264 tiji0 \u2200i, i0, j,\n\n(4)\n\n(5)\n\n(6)\n\n(7)\n\n(8)\n\n(9)\n\n(10)\n\nwhere the term ciji0 represents the second term in (4) de\ufb01ned by\n\u2200i, i0, j.\n\nl=1 AlijXli0\n\nciji0 =\n\nl=1 Xli0\n\nSince both Alij and Xli0 are binary, (9) can be rewritten as\n\nciji0 =\n\nl=1 A2\n\nlijX 2\nli0\nl=1 X 2\nli0\n\n\u2200i, i0, j.\n\nPn\nPn\nPn\nPn\npPn\n\nLet us introduce a matrix variable yi0 \u2208 <n whose lth column is de\ufb01ned by\n\nXli0\nl=1 X 2\nli0\n\n= Xli0\nkXi0k2\n\ny(l)\ni0 =\n\n(11)\nLet Aij \u2208 <n be a vector whose lth element has value Al(i, j). This allows us to represent (10) as\n(12)\nwhere Bij = diag(Aij) is a diagonal matrix with (Bij)ll = Al(i, j), the second and third properties\nfollow from Zi0 = yi0 yT\ni0 being a positive semide\ufb01nite matrix. Now, we rewrite the constraints for\nX in terms of Z. (5) is automatically satis\ufb01ed by the following constraints on the elements of Zi0.\n\ni0 = Zi0 , Zi0 (cid:23) 0,\n\nciji0 = tr(BijZi0), Z 2\n\n.\n\ni0 = 1 \u2200i0 \u2208 [1, k],\nZ (ll)\n\nZ (ll0)\ni0 \u2264 1 \u2200i0 \u2208 [1, k],\u2200l \u2208 [1, n].\n\n(13)\n\nnX\n\nl0=1\n\nnX\n\nl=1\n\n\fkX\nkX\n\nwhere Z (uv)\ni is a symmetric projection matrix by\nconstruction, (7)-(13) constitute a precisely de\ufb01ned 0-1 SDP that can be expressed in trace form as\n\nrefers to the (u, v) entry of matrix Zi0. Since Z0\n\ni0\n\nmin\n\ns.t.\n\ntr(diag(Ti0 ek))\ni0=1\n(Si0 \u2212 Ti0 \u2212 Qi0) \u2264 0,\n\n(Qi0 \u2212 Si0 \u2212 Ti0) \u2264 0 \u2200i0 \u2208 [1, k],\n\nZi0)en = en \u2200i0 \u2208 [1, k],\n\n(\ni0=1\nSi0 ek = em \u2200i0 \u2208 [1, k], Z \u2265 0; Z 2\n\ntr(Zi0) = 1 \u2200i0 \u2208 [1, k],\n\ni0 = Zi0; Zi0 = Z T\n\nkX\ntr(\nZi0) = k,\ni0 ; Si0 \u2208 {0, 1}.\n\ni0=1\n\n(14)\n\n(15)\n\n(16)\n\n(17)\n\nwhere Qi0(i, j) = ciji0 = tr(BijZi0), and en \u2208 <n is a vector of all 1s.\nThe experimental results for this model indicate that it performs very well in practice (see [10]).\nHowever, because we must solve the model while maintaining the requirement that Si0 be binary\n(otherwise, the problem becomes ill-posed), a branch and bound type method is needed. Such\napproaches are widely used in many application areas, but its worst case complexity is exponential\nin the input data size. In the subsequent sections, we will make several changes to this framework\nbased on additional observations in order to obtain a polynomial algorithm for the problem.\n\n6 Integer Program and 0-1 Semide\ufb01nite Program for Model 2\n\nRecall the de\ufb01nition of the variables ciji0, which can be interpreted as the size of the overlap between\nthe cluster C\u2217\ni0. Let us de\ufb01ne\n\ni0 in the \ufb01nal solution and Cij, and is proportional to the cardinality of C\u2217\n\nci\u2217ji0 = max\ni=1,...,k\n\nciji0 .\n\nnote that since entries of ciji0 are fractional satisfyingPk\n1 \u2212Pk\n\nLet us also de\ufb01ne vector variables qji0 whose ith element is siji0 \u2212 ciji0. In the IP model 1, we try\nto minimize the sum of all the L1-norms of qji0. The main dif\ufb01culty in the previous formulation\nstems from the fact that ciji0 is a fractional function w.r.t the assignment matrix X. Fortunately, we\ni=1 ciji0 = 1 for any \ufb01xed j, i0, their sum\nof squares is maximized when its largest entry is as high as possible. Thus, minimizing the function\ni=1(ciji0)2 is a reasonable substitute to minimizing the sum of the L1-norms in the IP model\n1. The primary advantage of this observation is that we do not need to know the \u2018index\u2019 (i\u2217) of\nthe maximal element ci\u2217ji0. As before, X denotes the assignment matrix. We no longer need the\nvariable s, as it can be easily determined from the solution. This yields the following IP.\n\n!\n\nmX\n\nnX\n\n(\n\nXli0)\n\n \n1 \u2212 kX\n\nj=1\n\nl=1\n\ni=1\n\nXli0 = 1 \u2200l \u2208 [1, n],\n\nkX\nkX\n\ni0=1\n\ni0=1\n\n(ciji0)2\n\nnX\n\nl=1\n\nmin\n\ns.t.\n\nXli0 \u2265 1 \u2200i0 \u2208 [1, k], Xli0 \u2208 {0, 1}.\n\n(19)\n\nWe next discuss how to transform the above problem to a 0-1 SDP. For this, we \ufb01rst note that the\nobjective function (18) can be expressed as follows.\n\nj=1\nwhich can be equivalently stated as\n\nmin\n\n(\n\n \nnX\nmX\nkX\n\uf8eb\uf8ednm \u2212 kX\n\ni0=1\n\nl=1\n\n(Pn\nXli0) \u2212 kX\nPn\n(Pn\nkX\nmX\nPn\n\ni=1\n\ni0=1\n\nj=1\n\ni=1\n\nmin\n\nl=1 AlijXli0)2\n\nl=1 Xli0\n\nl=1 AlijXli0)2\n\nl=1 Xli0\n\n!\n\n,\n\n\uf8f6\uf8f8 ,\n\nThe numerator of the second term above can be rewritten as\n\nAlijXli0)2 = (A1ijX1i0 + . . . + AnijXni0)2 = (AT\n\nijXi0)2 = X T\n\ni0 AijAT\n\nijXi0 ,\n\nnX\n\nl=1\n\n(\n\n(18)\n\n(20)\n\n(21)\n\n(22)\n\n\fwhere X0\n\ni is the i0th column vector of X. Therefore, the second term of (21) can be written as\n\nkX\nkX\n\ni0=1\n\nmX\nmX\n\nj=1\n\nkX\nkX\n\ni=1\n\ni0=1\n\nj=1\n\ni=1\n\n= tr(\n\nX T\n\ni0 AijAT\n\n= tr(\n\nAijAT\n\nijXi0(X T\n\ni0 Xi0)\u22121)\nmX\nkX\n\nijZi0) = tr(\n\nmX\ni0 (same as in IP model 1) and Z =Pk\n\nijZ) = tr(\n\nAijAT\n\nj=1\n\nj=1\n\ni=1\n\nBjZ) = tr(BZ). (23)\n\ni and B =Pm\n\ni0 Xi0)\u22121X T\n\ni0=1 Z0\nIn (23), Zi0 = Xi0(X T\nj=1 Bj.\nSince each matrix Zi0 is a symmetric projection matrix and Xi0\n2 are orthogonal to each other\nwhen i0\n2, Z is a projection matrix of the form X(X T X)\u22121X. The last fact also used in [13]\nis originally attributed to an anonymous referee in [14]. Finally, we derive the 0-1 SDP formulation\nfor the problem (18)-(19) as follows.\n\n1 and Xi0\n\n1 6= i0\n\nmin\ns.t.\n\n(nm \u2212 tr(BZ))\nZen = en \u2200i0 \u2208 [1, k],\ntr(Z) = k, Z \u2265 0; Z 2 = Z; Z = Z T .\n\n(24)\n(25)\n(26)\n\nRelaxing and Solving the 0-1 SDP: The relaxation to (24)-(26) exploits the fact that Z is a projec-\ntion matrix satisfying Z 2 = Z. This allows replacing the last three constraints in (26) as I (cid:23) Z (cid:23) 0.\nBy establishing the result that any feasible solution to the second formulation of 0-1 SDP, Z feas is a\nrank k matrix, we \ufb01rst solve the relaxed SDP using SeDuMi [15], take the rank k projection of Z\u2217\nand then adopt a rounding based on a variant of the winner-takes-all approach to obtain a solution\nin polynomial time. For the technical details and their proofs, please refer to [10].\n\n7 Experimental Results\n\nOur experiments included evaluations on several classi\ufb01cation datasets, segmentation databases and\nsimulations. Due to space limitations, we provide a brief summary here. Our \ufb01rst set of exper-\niments illustrates an application to several datasets from the UCI Machine Learning Repository:\n(1) Iris dataset, (2) Soybean dataset and (3) Wine dataset; these include ground truth data, see\nhttp://www.ics.uci.edu/ mlearn/MLRepository.html. To create the ensemble, we used a set of [4, 10]\nclustering schemes (by varying the clustering criterion and/or algorithm) from the CLUTO clustering\ntoolkit. The multiple solutions comprised the input ensemble, our model was then used to determine\na agreement maximizing solution. The ground-truth data was used at this stage to evaluate accuracy\nof the ensemble (and individual schemes). The results are shown in Figure 1(a)-(c). For each case,\nwe can see that the ensemble clustering solution is at least as good as the best clustering algorithm.\nObserve, however, that while such results are expected for this and many other datasets (see [3]),\nthe consensus solution may not always be superior to the \u2018best\u2019 clustering solution. For instance,\nin Fig. 1(c) (for m = 7) the best solution has a marginally lower error rate than the ensemble. An\nensemble solution is useful because we do not know a priori that which algorithm will perform the\nbest (especially if ground truth is unavailable).\n\n(a)\n\n(b)\n\nFigure 1: Synergy. The fraction of mislabeled cases ([0, 1]) in a consensus solution (\u2217) is com-\npared to the number of mislabelled cases (\u2206) in individual clustering algorithms. We illustrate the\nensemble effect for the Iris dataset in (a), the Soybean dataset in (b), and the Wine dataset in (c).\n\n(c)\n\n34567891011Number of  clustering algorithms in ensemble (m)0.10.20.30.40.5Mislabelled cases in each algorithmcluster ensembleindividual algorithms34567891011Number of  clustering algorithms in ensemble (m)0.10.20.30.4Mislabelled cases in each algorithmcluster ensembleindividual algorithms34567891011Number of  clustering algorithms in ensemble (m)0.10.20.30.40.50.6Mislabelled cases in each algorithm cluster ensembleindividual algorithms\fOur second set of experiments focuses on a novel application of ensembles to the problem of image\nsegmentation. Even sophisticated segmentation algorithms may yield \u2018different\u2019 results on the same\nimage, when multiple segmentations are available, it seems reasonable to \u2018combine\u2019 segmentations\nto reduce degeneracies. Our experimental results indicate that in many cases, we can obtain a better\noverall segmentation that captures (more) details in the images more accurately with fewer outlying\nclusters. In Fig. 2, we illustrate the results on an image from the Berkeley dataset. The segmenta-\ntions were generated using several powerful algorithms including (a) Normalized Cuts, (b) Energy\nMinimization by Graph Cuts and (c)\u2013(d) Curve Evolution. Notice that each algorithm performs well\nbut misses out on some details. For instance, (a) and (d) do not segment the eyes; (b) does well in\nsegmenting the shirt collar region but can only recognize one of the eyes and (c) creates an additional\ncut across the forehead. The ensemble (extreme right) is able to segment these details (eyes, shirt\ncollar and cap) nicely by combining (a)\u2013(d). For implementation details of the algorithm including\nsettings, preprocessing and additional evaluations, please refer to [10].\n\n(a)\n\n(b)\n\nFigure 2: A segmentation ensemble on an image from the Berkeley Segmentation dataset. (a)\u2013(d)\nshow the individual segmentations overlaid on the input image, the right-most image shows the\nsegmentation generated from ensemble clustering.\n\n(c)\n\n(d)\n\nensemble\n\nThe \ufb01nal set of our experiments were performed on 500 runs of arti\ufb01cially generated cluster ensem-\nbles. We \ufb01rst constructed an initial set segmentation, this was then repeatedly permuted (up to 15%)\nyielding a set of clustering solutions. The solutions from our model and [3] were compared w.r.t.\nour objective functions and Normalized Mutual Information used in [3]. In Figure 3(a), we see that\nour algorithm (Model 1) outperforms [3] on all instances. In the average case, the ratio is slightly\nmore than 1.5. We must note the time-quality trade-off because solving Model 1 requires a branch-\nand-bound approach. In Fig. 3(b), we compare the results of [3] with solutions from the relaxed\nSDP Model 2 on (24). We can see that our model performs better in \u223c 95% cases. Finally, Figure\n1(b) shows a comparison of relaxed SDP Model 2 with [3] on the objective function optimized in [3]\n(using best among two techniques). We observed that our solutions achieve superior results in 80%\nof the cases. The results show that even empirically our objective functions model similarity rather\nwell, and that Normalized Mutual Information may be implicitly optimized within this framework.\nRemarks. We note that the graph partitioning methods used in [3] are typically much faster than the\ntime needed by SDP solvers (e.g., SeDuMi [15] and SDPT3) for comparable problem sizes. How-\never, given the increasing interest in SDP in the last few years, we may expect the development of\nnew algorithms, and faster/more ef\ufb01cient software tools.\n8 Conclusions\n\nWe have proposed a new algorithm for ensemble clustering based on a SDP formulation. Among\nthe important contributions of this paper is, we believe, the observation that the notion of agreement\nin an ensemble can be captured better using string encoding rather than a voting strategy. While\na partition problem de\ufb01ned directly on such strings yields a non-linear optimization problem, we\nillustrate a transformation into a strict 0-1 SDP via novel convexi\ufb01cation techniques. The last result\nof this paper is the design of a modi\ufb01ed model of the SDP based on additional observations on the\nstructure of the underlying matrices. We discuss extensive experimental evaluations on simulations\nand real datasets, in addition, we illustrate application of the algorithm for segmentation ensembles.\nWe feel that the latter application is of independent research interest; to the best of our knowledge,\nthis is the \ufb01rst algorithmic treatment of generating segmentation ensembles for improving accuracy.\n\n\f(a)\n\n(b)\n\nFigure 3: A comparison of [3] with SDP Model 1 in (a), and with SDP Model 2 on (24) in (b).\nThe solution from [3] was used as the numerator. In (c), comparisons (difference in normalized\nvalues) between our solution and the best among IBGF and CBGF based on the Normalized Mutual\nInformation (NMI) objective function used in [3].\n\n(c)\n\nAcknowledgments. This work was supported in part by NSF grants CCF-0546509 and IIS-\n0713489. The \ufb01rst author was also supported by start-up funds from the Dept. of Biostatistics\nand Medical Informatics, UW \u2013 Madison. We thank D. Sivakumar for useful discussions, Johan\nL\u00a8ofberg for a thorough explanation of the salient features of Yalmip [16], and the reviewers for sug-\ngestions regarding the presentation of the paper. One of the reviewers also pointed out a typo in the\nderivations in \u00a76.\n\nReferences\n[1] V. Filkov and S. Skiena. Integrating microarray data by consensus clustering. In Proc. of International\n\nConference on Tools with Arti\ufb01cial Intelligence, page 418, 2003.\n\n[2] X. Z. Fern and C. E. Brodley. Solving cluster ensemble problems by bipartite graph partitioning. In Proc.\n\nof International Conference on Machine Learning, page 36, 2004.\n\n[3] A. Strehl and J. Ghosh. Cluster Ensembles \u2013 A Knowledge Reuse Framework for Combining Partition-\n\nings. In Proc. of AAAI 2002, pages 93\u201398, 2002.\n\n[4] N. Bansal, A. Blum, and S. Chawla. Correlation clustering.\n\nComputer Science, page 238, 2002.\n\nIn Proc. Symposium on Foundations of\n\n[5] S. Monti, P. Tamayo, J. Mesirov, and T. Golub. Consensus clustering: A resampling-based method for\nclass discovery and visualization of gene expression microarray data. Mach. Learn., 52(1-2):91\u2013118,\n2003.\n\n[6] A. Gionis, H. Mannila, and P. Tsaparas. Clustering aggregation. In Proc. of International Conference on\n\nData Engineering, pages 341\u2013352, 2005.\n\n[7] N. Ailon, M. Charikar, and A. Newman. Aggregating inconsistent information: ranking and clustering.\n\nIn Proc. of Symposium on Theory of Computing, pages 684\u2013693, 2005.\n\n[8] M. Charikar, V. Guruswami, and A. Wirth. Clustering with qualitative information. J. Comput. Syst. Sci.,\n\n71(3):360\u2013383, 2005.\n\n[9] X. Z. Fern and C. E. Brodley. Random projection for high dimensional data clustering: A cluster ensemble\n\napproach. In Proceedings of International Conference on Machine Learning, 2003.\n\n[10] V. Singh. On Several Geometric Optimization Problems in Biomedical Computation. PhD thesis, State\n\nUniversity of New York at Buffalo, 2007.\n\n[11] L. Gasieniec, J. Jansson, and A. Lingas. Approximation algorithms for hamming clustering problems. In\n\nProc. of Symposium on Combinatorial Pattern Matching, pages 108\u2013118, 2000.\n\n[12] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, New York, 2004.\n[13] J. Peng and Y. Wei. Approximating k-means-type clustering via semide\ufb01nite programming. SIAM Journal\n\non Optimization, 18(1):186\u2013205, 2007.\n\n[14] A. D. Gordon and J. T. Henderson. An algorithm for euclidean sum of squares classi\ufb01cation. Biometrics,\n\n33:355\u2013362, 1977.\n\n[15] J. F. Sturm. Using SeDuMi 1.02, A Matlab Toolbox for Optimization over Symmetric Cones. Optimiza-\n\ntion Methods and Software, 11-12:625\u2013653, 1999.\n\n[16] J. L\u00a8ofberg. YALMIP : A toolbox for modeling and optimization in MATLAB. In CCA/ISIC/CACSD,\n\nSeptember 2004.\n\n012345Ratios of solutions of objective functions of the two algorithms050100150200Number of instancesBetterWorse0.40.60.811.21.41.61.8Ratios of solutions of objective functions of the two algorithms0100200300400500Number of instancesBetterWorse-0.1-0.08-0.06-0.04-0.0200.020.040.060.080.1Normalized difference050100150200Number of instancesWorseBetter\f", "award": [], "sourceid": 140, "authors": [{"given_name": "Vikas", "family_name": "Singh", "institution": null}, {"given_name": "Lopamudra", "family_name": "Mukherjee", "institution": null}, {"given_name": "Jiming", "family_name": "Peng", "institution": null}, {"given_name": "Jinhui", "family_name": "Xu", "institution": null}]}