{"title": "Clustering via LP-based Stabilities", "book": "Advances in Neural Information Processing Systems", "page_first": 865, "page_last": 872, "abstract": "A novel center-based clustering algorithm is proposed in this paper. We first formulate clustering as an NP-hard linear integer program and we then use linear programming and the duality theory to derive the solution of this optimization problem. This leads to an efficient and very general algorithm, which works in the dual domain, and can cluster data based on an arbitrary set of distances. Despite its generality, it is independent of initialization (unlike EM-like methods such as K-means), has guaranteed convergence, and can also provide online optimality bounds about the quality of the estimated clustering solutions. To deal with the most critical issue in a center-based clustering algorithm (selection of cluster centers), we also introduce the notion of stability of a cluster center, which is a well defined LP-based quantity that plays a key role to our algorithm's success. Furthermore, we also introduce, what we call, the margins (another key ingredient in our algorithm), which can be roughly thought of as dual counterparts to stabilities and allow us to obtain computationally efficient approximations to the latter. Promising experimental results demonstrate the potentials of our method.", "full_text": "Clustering via LP-based Stabilities\n\nNikos Komodakis\nUniversity of Crete\n\nNikos Paragios\n\nEcole Centrale de Paris\n\nGeorgios Tziritas\nUniversity of Crete\n\nkomod@csd.uoc.gr\n\nINRIA Saclay Ile-de-France\n\ntziritas@csd.uoc.gr\n\nnikos.paragios@ecp.fr\n\nAbstract\n\nA novel center-based clustering algorithm is proposed in this paper. We \ufb01rst for-\nmulate clustering as an NP-hard linear integer program and we then use linear\nprogramming and the duality theory to derive the solution of this optimization\nproblem. This leads to an ef\ufb01cient and very general algorithm, which works in the\ndual domain, and can cluster data based on an arbitrary set of distances. Despite\nits generality, it is independent of initialization (unlike EM-like methods such as\nK-means), has guaranteed convergence, can automatically determine the number\nof clusters, and can also provide online optimality bounds about the quality of the\nestimated clustering solutions. To deal with the most critical issue in a center-\nbased clustering algorithm (selection of cluster centers), we also introduce the\nnotion of stability of a cluster center, which is a well de\ufb01ned LP-based quantity\nthat plays a key role to our algorithm\u2019s success. Furthermore, we also introduce,\nwhat we call, the margins (another key ingredient in our algorithm), which can be\nroughly thought of as dual counterparts to stabilities and allow us to obtain com-\nputationally ef\ufb01cient approximations to the latter. Promising experimental results\ndemonstrate the potentials of our method.\n\n1 Introduction\n\nClustering is considered as one of the most fundamental unsupervised learning problems. It lies\nat the heart of many important tasks in machine learning, patter recognition, computer vision, data\nmining, biology, marketing, just to mention a few of its application areas. Most of the clustering\nmethods are center-based, thus trying to extract a set of cluster centers that best \u2018describe\u2019 the input\ndata. Typically, this translates into an optimization problem where one seeks to assign each input\ndata point to a unique cluster center such that the total sum of the corresponding distances is min-\nimized. These techniques are extremely popular and they are thus essential even to other types of\nclustering algorithms such as Spectral Clustering methods [1],[2].\n\nCurrently, most center-based clustering methods rely on EM-like schemes for optimizing their clus-\ntering objective function [3]. K-means is the most characteristic (and perhaps the most widely\nused) technique from this class. It keeps greedily re\ufb01ning a current set of cluster centers based on\na simple gradient descent scheme. As a result, it can very easily get trapped to bad local minima\nand is extremely sensitive to initialization. It is thus likely to fail in problems with, e.g., a large\nnumber of clusters. A second very important drawback of many center-based clustering methods,\nwhich severely limits their applicability, is that they either require the input data to be of vectorial\nform and/or impose strong restrictions on the type of distance functions they can handle. Ideally,\none would like to be able to cluster data based on arbitrary distances. This is an important point\nbecause, by an appropriate choice of these distances, clustering results with completely different\ncharacteristics can be achieved [4]. In addition to that, one would prefer that the number of clusters\nis automatically estimated by the algorithm (e.g., as a byproduct of the optimization process) and\nnot given as input. In contrast to that, however, many algorithms assume that this number is known\na priori.\n\n1\n\n\fTo circumvent all the issues mentioned above, a novel center-based clustering algorithm is proposed\nin this paper. Similarly to other methods, it reduces clustering to a well-de\ufb01ned (but NP-hard)\nminimization problem, where, of course, the challenge now is how to obtain solutions of minimum\nobjective value. To this end, we rely on the fact that the above problem admits a linear integer\nprogramming formulation. By making heavy use of a dual LP relaxation to that program, we then\nmanage to derive a dual based algorithm for clustering. As in all center-based clustering techniques,\nthe most critical component in the resulting algorithm is deciding what cluster centers to choose.\nTo this end, we introduce, what we call, the stability of a data point as a cluster center (this is an\nLP-based quantity), which we consider as another contribution of this work. Intuitively, the stability\nof a data point as a cluster center tries to measure how much we need to penalize that point (by\nappropriately modifying the objective function) such that it can no longer be chosen as a center in\nan optimal solution of the modi\ufb01ed problem. Obviously, one would like to choose as centers those\npoints having high stability. For applying this idea in practice, however, a crucial issue that one needs\nto deal with is how to ef\ufb01ciently approximate these stability measures. To this end, we introduce,\nwhat we call, the margins, another very important concept in our algorithm and a key contribution of\nour work. As we prove in this paper, margins can be considered as dual to stabilities. Furthermore,\nthey allow us to approximate the latter on the \ufb02y, i.e., as our algorithm runs. The outcome is an\nef\ufb01cient and very easily implementable optimization algorithm, which works in the dual domain\nby iteratively updating a dual solution via two very simple operations: DISTRIBUTE and PROJECT.\nIt can cluster data based on an arbitrary set of distances, which is the only input required by the\nalgorithm (as a result, it can \ufb01nd use in a wide variety of applications, even in case where non-\nvectorial data need to be used). Furthermore, an important point is that, despite its generality, it does\nnot get trapped to bad local minima. It is thus insensitive to initialization and can always compute\nclusterings of very low cost. Similarly to [5], the number of clusters does not need to be prede\ufb01ned,\nbut is decided on the \ufb02y during the optimization process. However, unlike [5], convergence of the\nproposed method is always guaranteed and no parameters\u2019 adjustment needs to take place for this.\nFinally, an additional advantage of our method is that it can provide online optimality guarantees,\nwhich can be used for assessing the quality of the generated clusterings. These guarantees come in\nthe form of lower bounds on the cost of the optimal clustering and are computed (for free) by simply\nusing the cost of the dual solutions generated during the course of the algorithm.\n\n2 Clustering via stabilities based on Linear Programming\n\nGiven a set of objects V with distances d = {dpq}, clustering amounts to choosing a set of cluster\ncenters from V (say {qi}k\ni=1) such that the sum of distances between each object and its closest\ncenter is minimized. To this end, we are going to use the following objective function E(\u00b7) (which\nwill be referred to as the primal cost hereafter):\n\nmin\nk,{qi}k\n\ni=1\n\nE({qi}k\n\ni=1) = Xp\u2208V\n\nmin\n\ni\n\ndpqi + Xi\n\ndqiqi\n\n(1)\n\nNote that, in this case, we require that each cluster is chosen from the set V. Also note that, besides\n{qi}, here we optimize over the number of cluster centers k as well. Of course, to avoid the trivial\nsolution of choosing all objects as centers, we regularize the problem by assigning a penalty dqq to\neach chosen center q. Problem (1) has an equivalent formulation as a 0 \u2212 1 linear integer program\n[6], whose relaxation leads to the following LP (denoted by PRIMAL hereafter):\n\nPRIMAL \u2261 minXp,q\u2208V\ns.t.Xq\u2208V\n\ndpqxpq\n\nxpq = 1\n\nxpq \u2264 xqq\nxpq \u2265 0\n\n(2)\n\n(3)\n\n(4)\n\n(5)\n\nTo get an equivalent problem to (1), we simply have to replace xpq \u2265 0 with xpq \u2208 {0, 1}. In this\ncase, each binary variable xpq with p 6= q indicates whether object p has been assigned to cluster\ncenter q or not, while binary variable xqq indicates whether object q has been chosen as a cluster\ncenter or not. Constraints (3) simply express the fact that each object must be assigned to exactly\none center, while constraints (4) require that if p has been assigned to q then object q must obviously\nbe chosen as a center.\n\n2\n\n\fObviously at the core of any clustering problem of this type lies the issue of deciding which objects\nwill be chosen as centers. To deal with that, a key idea of our approach is to rely on, what we call, the\nstability of an object. This will be a well de\ufb01ned measure which, intuitively, tries to quantitatively\nanswer the following question: \u201cHow much do we need to penalize an object in order to ensure that\nit is never selected as an optimal cluster center?\u201d For formalizing this concept, we will make use\nof the LP relaxation PRIMAL. We will thus de\ufb01ne the stability S(q) of an object q as follows:\n\nS(q) = inf{perturbation s that has to be applied to penalty dqq (i.e., dqq \u2190 dqq + s)\n\n(6)\n\nsuch that PRIMAL has no optimal solution x with xqq > 0}\n\nAn object q can be stable or unstable depending on whether it holds S(q) \u2265 0 or S(q) < 0. To\nselect a set of centers Q, we will then rely on the following observation: a stable object with high\nstability is also expected to be, with high probability, an optimal center in (1). The reason is that the\nassumption of a high S(q) \u2265 0 is essentially a very strong requirement (much stronger than simply\nrequiring q to be active in the relaxed problem PRIMAL): it further requires that q will be active for all\nproblems PRIMAL(dqq + s)1 as well (where s \u2264 S(q)). Hence, our strategy for generating Q will be\nto sequentially select a set of stable objects, trying, at each step, to select an object of approximately\nmaximum stability (as already explained, there is high chance that this object will be an optimal\ncenter in (1)). Furthermore, each time we insert a stable object q to Q, we reestimate stabilities for\nthe remaining objects in order to take this fact into account (e.g., an object may become unstable if\nwe know that it holds xqq = 1 for another object q). To achieve that, we will need to impose extra\nconstraints to PRIMAL (as we shall see, this will help us to obtain an accurate estimation for the\nstabilities of the remaining objects given that objects in Q are already chosen as centers). Of course,\nthis process repeats until no more stable objects can be found.\n\n2.1 Margins and dual-based clustering\n\nFor having a practical algorithm, the most critical issue is how to obtain a rough approximation to\nthe stability of an object q in a computationally ef\ufb01cient manner. As we shall see, to achieve this\nwe will need to to move to the dual domain and introduce a novel concept that lies at the core of\nour approach: the margin of dual solutions. But, \ufb01rst, we need to introduce the dual to problem\nPRIMAL, which is the linear program called DUAL in (7)2:\n\nDUAL \u2261 max D(h) = Xp\u2208V\n\nhp\n\ns.t. hp = minq\u2208V hpq,\n\n\u2200p \u2208 V\n\nhpq = Xp\u2208V\n\nXp\u2208V\nhpq \u2265 dpq\n\ndpq,\n\n\u2200q \u2208 V\n\n\u2200p 6= q\n\n(7)\n\n(8)\n\n(9)\n\n(10)\n\nDual variables hpq can be thought of as representing pseudo-distances between objects, while each\nvariable hp represents the minimum pseudo-distance from p (which is, in fact, \u2018thought\u2019 by the dual\nas an estimation of the actual distance between p and its closest active center).\n\nGiven a feasible dual solution h, we can now de\ufb01ne its margin \u2206q(h) (with respect to object q) as\nfollows:\n\n\u2206q(h) = Xp:hpq=hp\n\n(\u02c6hp \u2212 hp) \u2212 Xp6=q\n\n(hpq \u2212 max(hp, dpq)) \u2212 (cid:16)hqq \u2212 hq(cid:17) ,\n\n(11)\n\nwhere (for any h) \u02c6hp hereafter denotes the next-to-minimum pseudo-distance from p.\n\nThere is a very tight connection between margins of dual solutions and stabilities of objects. The\nfollowing lemma provides a \ufb01rst indication for this fact and shows that we can actually use margins\nto decide whether an object is stable or not and also to lower bound or upper bound its stability\naccordingly (see [7] for proofs):\n\nLemma 1 ([7]). Let h be an optimal dual solution to DUAL.\n\n1PRIMAL(z) denotes a modi\ufb01ed problem PRIMAL where the penalty for q has been set equal to z.\n2Problem DUAL results from the standard dual to PRIMAL after applying a transformation to the dual\n\nvariables.\n\n3\n\n\f1. If \u2206q(h) > 0 then S(q) \u2265 \u2206q(h).\n\n2. If \u2206q(h) < 0 then S(q) \u2264 \u2206q(h).\n\nIn fact, the following fundamental theorem goes even further by proving that stabilities can be fully\ncharacterized solely in terms of margins. Hence, margins and stabilities are two concepts that can\nbe roughly considered as dual to each other:\n\nTheorem 2 ([7]). The following equalities hold true:\n\nS(q) \u2265 0 \u21d2 S(q) = sup{\u2206q(h) | h optimal solution to DUAL} ,\nS(q) \u2264 0 \u21d2 S(q) = inf{\u2206q(h) | h optimal solution to DUAL} .\n\nFurthermore, it can be shown that:\n\nS(q) = sign(S(q)) \u00b7 sup{|\u2206q(h)| (cid:12)(cid:12)\n\nh optimal solution to DUAL} .\n\n(12)\n\n(13)\n\n(14)\n\nWhat the above theorem essentially tells us is that one can compute S(q) exactly, simply by consid-\nering the margins of optimal dual solutions. Based on this fact, it is therefore safe to assume that\nsolutions h with high (but not necessarily maximum) dual objective D(h) will have margins that\nare good approximations to S(q), i.e., it holds:\n\nS(q) \u2248 \u2206q(h) .\n\n(15)\n\nThis is exactly the idea that our clustering algorithm will rely on in order to ef\ufb01ciently discover\nobjects that are stable. It thus maintains a dual solution h and a set Q containing all stable objects\nchosen as centers up to the current point (Q is empty initially). At each iteration, it increases the\ndual objective D(h) by updating solution h via an operation called DISTRIBUTE. This operation is\nrepeatedly applied until a high enough objective value D(h) is obtained such that at least one stable\nobject is revealed based on the estimated margins of h. At that point, the set Q is expanded and h is\nupdated (via an operation called PROJECT) to take account of this fact. The process is then repeated\nuntil no more stable objects can be found. A remarkable thing to note in this process is that, as we\nshall see, determining how to update h during the DISTRIBUTE operation (i.e., for increasing the\ndual objective) also relies critically on the use of margins.\n\nAnother technical point that we need to solve comes from the fact that Q gets populated with objects\nas the algorithm proceeds, which is something that we certainly need to take into account when\nestimating object stabilities. Fortunately, there is a very elegant solution to this problem: since all\nobjects in Q are assumed to be cluster centers (i.e., it holds xqq = 1, \u2200q \u2208 Q), instead of working\nwith problems PRIMAL and DUAL, it suf\ufb01ces that one works with the following primal-dual pair of\nLPs called PRIMALQ and DUALQ\n\n3:\n\nPRIMALQ = min PRIMAL\n\nDUALQ = max DUAL\n\ns.t. xqq = 1, \u2200q \u2208 Q\n\ns.t. hpq = dpq, \u2200{p, q} \u2229 Q 6= \u2205\n\nThis means, e.g., that stability S(q) is now de\ufb01ned by using PRIMALQ (instead of PRIMAL) in (6).\nLikewise, lemma 1 and theorem 2 still continue to hold true provided that DUAL is replaced with\nDUALQ in the statement of these theorems. In addition to that, the de\ufb01nition of margin \u2206q(h) needs\nto be modi\ufb01ed as follows :\n\n\u2206q(h) = Xp /\u2208Q:hpq=hp\n\n(\u02c6hp \u2212 hp) \u2212 Xp /\u2208Q\u222a{q}\n\n(hpq \u2212 max(hp, dpq)) \u2212 (cid:16)hqq \u2212 hq(cid:17) .\n\n(16)\n\nThe PROJECT operation: Given this modi\ufb01ed de\ufb01nition of margins, we can now update Q at any\niteration in the following manner:\n\nEXPAND: Compute \u00afq = arg max\nq /\u2208Q\n\n\u2206q(h) and if \u2206\u00afq(h) \u2265 0 then set Q = Q \u222a {\u00afq} .\n\n(17)\n\nBased on the fact that margins are used as approximations to the stabilities of objects, the above\nupdate simply says that the object \u00afq with maximum stability should be chosen as the new center at\nthe current iteration, provided of course that this object \u00afq is stable. Furthermore, in this case, we also\n\n3Actually, to represent the dual of PRIMALQ exactly, we need to add a constant in the objective function of\n\nDUALQ. Since, however, this constant does not affect maximization, it is thus omitted for clarity.\n\n4\n\n\fDprev \u2190 D(h); h \u2190 DISTRIBUTE(h);\nif Dprev = D(h) then exit;\n\n1: h \u2190 d;\n2: while maxq /\u2208Q \u2206q(h) < 0 do\n3:\n4:\n5: end\n6: \u00afq \u2190 arg maxq /\u2208Q \u2206q(h); Q \u2190 Q \u222a {\u00afq}; h \u2190 PROJECT(h);\n7: goto 2;\n\nFig. 1: Pseudocode of our clustering algorithm.\n\nneed to update the current dual solution h in order to take account of the fact that extra constraints\nhave been added to DUALQ (these are a result of the extra constraint x\u00afq \u00afq = 1 that has been added to\nPRIMALQ). By de\ufb01nition of DUALQ, the new constraints are h\u00afqp = d\u00afqp, hp\u00afq = dp\u00afq for all p /\u2208 Q\nand, so, one has to apply the following operation, which simply projects the current dual solution\ninto the feasible set of the updated linear program DUALQ:\n\nPROJECT: hpp+= h\u00afqp \u2212 d\u00afqp, h\u00afqp = d\u00afqp, hp\u00afq = dp\u00afq, \u2200p /\u2208 Q .\n\n(18)\n\nNote that update hpp+= h\u00afqp \u2212 d\u00afqp is needed for maintaining dual feasibility constraint (9). Essen-\ntially, PROJECT is a warm-start operation, that allows us to reuse existing information for computing\na solution h that has a high dual objective value D(h) and is also feasible to the updated DUALQ.\n\nThe DISTRIBUTE operation: In case it holds \u2206q(h) < 0 for all q /\u2208 Q, this means that we are\nunable to \ufb01nd an object with good stability at the current iteration. To counter that, we will thus\nneed to update solution h in order to increase its dual objective value (recall that, by lemma 1, stable\nobjects will necessarily be revealed at an optimal dual solution, i.e., at a dual solution of maximum\nobjective). Intuitively, what happens is that as we increase the dual objective D(h), objects not in\nQ actually try to compete with each other for achieving a large margin. Interestingly enough, in\norder to increase D(h), we will again have to rely on the margins of the current dual solution. In\nparticular, it turns out that, if \u2206q(h) < 0 holds true for all q /\u2208 Q, then the following very simple\nupdate of h is guaranteed to increase the dual objective:\n\nDISTRIBUTE: \u2200p, q /\u2208 Q, hpq =\n\nmax(hp, dpq),\nhp \u2212 \u2206q(h)\n|Vq| ,\n\u02c6hp \u2212 \u2206q(h)\n|Vq| ,\n\n\uf8f1\uf8f4\uf8f2\n\uf8f4\uf8f3\n\nif p 6= q AND (cid:0)p \u2208 LQ OR hp < dpq(cid:1)\nelse if hpq > hp\nelse if hpq = hp\n\nIn the above update, we denote by LQ the set of objects whose minimum pseudo-distance hp is\nattained at an object from Q, i.e., LQ = {p /\u2208 Q | hp = minq\u2208Q hpq}, while |Vq| denotes the\ncardinality of the set Vq = {p /\u2208 Q \u222a LQ | hp \u2265 dpq} \u222a {q}. The following theorem then holds true:\nTheorem 3. If maxq /\u2208Q \u2206q(h) < 0, then the DISTRIBUTE operation maintains feasibility and,\nunless V = Q \u222a LQ, it also strictly increases the dual objective.\n\nThe pseudocode of the resulting algorithm is shown in Fig. 1. As already explained, it is an iterative\nalgorithm, which keeps updating a dual solution h by using the DISTRIBUTE and PROJECT opera-\ntions (the latter applied only when needed) until the dual objective can no longer increase. Note also\nthat, besides maintaining a dual solution h, the algorithm also maintains Q which provides a current\nclustering and also has a primal cost E(Q). With respect to this cost, the following theorem can be\nshown to hold true:\n\nTheorem 4. If maxq /\u2208Q \u2206q(h) > 0, then the EXPAND operation strictly decreases the primal cost\nE(Q).\n\nThis implies that the sequence of primal costs E(Q) generated by the algorithm is decreasing (recall\nthat we actually want to minimize E(\u00b7)).\nIt is worth noting at this point that nowhere have we\ntried to enforce this property by explicitly considering the primal cost when updating Q. This\nis achieved simply thanks to the requirement of always selecting objects with high stability, thus\nshowing how powerful this requirement actually is. We also note that the algorithm\u2019s convergence\nis always guaranteed: the algorithm terminates when neither the primal cost E(Q) decreases nor the\ndual objective D(h) increases during the current iteration. Finally, we note that exactly the same\nalgorithm applies to the general case where the objects in V form a graph with edges E (distance dpq\nis then de\ufb01ned only for pq \u2208 E). In this case, it is easy to verify that the cost of each iteration will be\nO(|E|). Furthermore, the algorithm converges extremely fast in practice (i.e. in very few iterations).\n\n5\n\n\f3 Related work\n\nBefore proceeding, let us brie\ufb02y mention how our method relates to some state-of-the-art exemplar-\nbased clustering techniques. Af\ufb01nity propagation [5] is a recently proposed method for clustering,\nwhich relies on minimizing exactly the same objective function (1). This is an iterative algorithm,\nwhich repeatedly updates (through messages) the so-called responsibilities and availabilities. These\ncan be considered as counterparts to our pseudo-distances hpq. Af\ufb01nity propagation also estimates\nthe so-called self-availabilities for measuring the likelihood of an object being a cluster center. On\nthe contrary, we use for the same purpose the margins that approximate the stability of an object.\nFurthermore, compared to af\ufb01nity propagation, our method offers the following signi\ufb01cant advan-\ntages: its convergence is always guaranteed, it is parameter-free (no need for adjusting parameters\nsuch as damping factors in order to ensure convergence), it is a descent method (objective func-\ntion (1) always decreases), and it can make use of the computed dual solutions for deriving online\noptimality bounds for free (these can be used for assessing that the derived solutions are almost\noptimal). At the same time, our method performs equally well or better in practice. Very recently,\nanother exemplar-based algorithm has been proposed as well, which relies on solving a convex for-\nmulation of clustering [8]. We note, however, that this method is used for solving a different and\nmuch easier problem, which is that of soft clustering. Furthermore, it relies on a convex relaxation\nwhich is known to be much less tight than the LP relaxation PRIMAL we use here (essentially [8]\n\nreplaces all constraints xpq \u2264 xqq, \u2200p \u2208 V with the much looser constraint Pp xpq \u2264 |V| \u00b7 xqq ).\n\nAs a result, generated solutions are expected to be of much lower quality. We also note that, unlike\nEM-like clustering algorithms such as K-means, our method is totally insensitive to initialization\nconditions and does not get stuck at bad local minima (thus yielding solutions of much better qual-\nity). Also, it is much more ef\ufb01cient than methods like [6], that require solving very large linear\nprograms.\n\n4 Experimental results\n\nTo illustrate the robustness of our algorithm to noise and its insensitivity to initialization, we start\nby showing clustering results on synthetic data. The synthetic datasets were generated using the\nfollowing procedure: 2D points were sampled from a mixture of gaussian distributions, where the\ncenters of the gaussians were arranged in an approximately grid-like fashion over the plane. In\naddition to that, random outliers were generated uniformly all over the grid, with their number being\nequal to half the number of the points drawn from the gaussian distributions. One such dataset\n(consisting of 24 gaussians) is displayed in Fig. 2, where colored crosses correspond to samples\nfrom gaussians, while the black dots correspond to outliers. The clustering result produced by our\nalgorithm is shown in Fig. 2(a). As can be seen from that \ufb01gure, despite the heavy percentage of\nnoise, our method has been able to accurately detect all gaussian centers and successfully cluster\nthis 2D dataset. Note that the number of gaussians was not given as input to our algorithm. Instead,\nit was inferred based on a common penalty term dqq for all objects q, which was set roughly equal to\nthe median distance between points. On the contrary, K-means was unable to produce a good result\nfor this dataset despite the fact that it was restarted multiple times (100 runs were used in this case).\nThis is, of course, due to its well known sensitivity to initialization conditions. We repeated multiple\nexperiments by varying the number of gaussians. Contrary to our algorithm, behavior of K-means\ngets even worse as this number increases.\n\nWe have also plotted in Fig. 2(c) the primal and dual costs that were generated by our algorithm\nwhen it was applied to the example of Fig. 2(a). These correspond to the solid red and dashed blue\ncurves respectively. Note that the dual costs represent lower bounds to the optimum value of the\nobjective function E(\u00b7), while the primal costs represent obviously upper bounds. This fact allows\nus to obtain online optimality bounds with respect to how far our current primal solution Q is with\nrespect to the unknown optimum of E(\u00b7). These bounds are, of course, re\ufb01ned continuously as the\nalgorithm proceeds and can be useful for assessing its performance. For instance, in this particular\nexample, we can be sure that the primal cost of our \ufb01nal solution is within 1% of the unknown\noptimum of function E(\u00b7), i.e., an approximately optimal solution has been obtained.\n\nNext we show some results from applying our algorithm to the challenging problem of multibody 3D\nsegmentation, which has several applications in computer vision. As we shall see, a non-Euclidean\ndistance for clustering will have to be used in this case. According to the 3D segmentation problem,\nwe are given a set of N pixel correspondences between two images. These correspondences result\n\n6\n\n\f1.4\n\n1.2\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\nK\u2212means clustering\n\n1.4\n\n1.2\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n1000\n\n500\n\n \n\nprimal cost\ndual cost\n\n0\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\n1\n\n1.2\n\n1.4\n\n0\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\n1\n\n1.2\n\n1.4\n\n(a) Our algorithm\n\n(b) K-means\n\n0\n\n \n0\n(c) Primal and dual costs\n\n20\n\n40\n\n60\n\nFig. 2: Clustering results for synthetic data. The centers of the big circles represent the points chosen as cluster\ncenters by the 2 algorithms. The primal and dual costs in (c) verify that the cost of our algorithm\u2019s solution is\nwithin 1% of the optimum cost.\n\nfrom K objects undergoing K 3D rigid-body motions relative to a moving camera. The 3D-motion\nsegmentation problem is the task of clustering these N pixel pairs according to the K moving ob-\njects. We consider the more general and dif\ufb01cult scenario of a fully projective camera model. In this\ncase, each pixel pair, say, pi = (yi, zi) that belongs to a moving object k should satisfy an epipolar\nconstraint:\n\nyT\ni Fkzi = 0 ,\n\n(19)\nwhere Fk represents the fundamental matrix associated with the k-th 3D motion. Of course, the\nmatrices Fk corresponding to different motions are unknown to us. Hence, to solve the 3D segmen-\ntation problem, we need to estimate both the matrices Fk as well as the association of each pixel\npair pi = (yi, zi) to the correct fundamental matric Fk. To this end, we sample a large set of fun-\ndamental matrices by using a RANSAC-based scheme (we recall that a random set of, e.g., 8 pixel\npairs pi is enough for generating a new fundamental matrix). The resulting matrices, say, {Fk} will\nthen correspond to cluster centers, whereas all the input pixel pairs {pi} will correspond to objects\nthat need to be assigned to an active cluster center. A clustering objective function of the form (1)\nthus results and by minimizing it we can also obtain a solution to the 3D segmentation problem. Of\ncourse, in this case, the distance function d(pi, Fk) between an object pi = (yi, zi) and a cluster\ncenter will not be Euclidean. Instead, based on (19), we can use a distance of the following form:\n\nd(pi, Fk) = |yT\n\ni Fkzi| .\n\n(20)\n\nDue to being more robust, a normalized version of the above distance is usually preferred in practice.\nFigure 3 displays 3D motion segmentation results that were obtained by applying our algorithm to\ntwo image pairs (points with different colors correspond to different motions). These examples\nwere downloaded from a publicly available motion segmentation database [9] with ground-truth.\nThe ground-truth motion segmentation is also shown for each example and, as can be seen, it is\nalmost identical with the segmentation estimated by our algorithm.\n\nWe next compare our method to Af\ufb01nity Propagation (AP). Some really impressive results on 4\nvery challenging datasets have been reported for that algorithm in [5], indicating that it outperforms\nany other center-based clustering method. In particular, AP has been used for: clustering images\nof faces (using the squared error distance), detecting genes in microarray data (using a distance\nbased on exons\u2019 transcriptions levels), identifying representative sentences in manuscripts (using\n\n(a)\n\n(b)\n\nFig. 3: Two 3D motion segmentation results. For each one we show (left) ground truth segmentation of feature\npoints and (right) estimated segmentation along with the input optical \ufb02ow vectors.\n\n7\n\n\fFaces\nGenes\nCities\nSentences\n\nPrimal Cost E(Q)\nAP\nOurs\n13430\n13454\n-210595\n-210539\n92154\n92154\n10234\n10241\n(a)\n\n#clusters\n\nOurs\n60\n1301\n7\n4\n\nAP\n62\n1290\n7\n4\n\n1000\n\n \n\n \n\nOur exemplars\n\n2\n\n3\n\n4\n\n5\n\n6\n\n7\n\n8\n\n9\n\n10\n\n900\n\n800\n\n700\n\n600\n\n500\n\n400\n\n300\n\n200\n\n100\n\n \n0\n\n20\n\n40\n\n60\n\n80\n\n100\n\n120\n\n140\n\n160\n\n180\n\n200\n\nPrimal costs from\nAffinity Propagation\n\n10\n\n9\n\n8\n\n7\n\n6\n\n5\n\n4\n\n3\n\n2\n\n1\n\n \n1\n\n(b)\n\n(c)\n\nFig. 4: (a) Comparison of our algorithm with af\ufb01nity propagation [5] on the 4 very challenging datasets \u2018Faces\u2019,\n\u2018Genes\u2019, \u2018Cities\u2019 and \u2018Sentences\u2019 from [5]. Since the goal of both algorithms is to minimize objective function\nE(Q), for each dataset we report the \ufb01nal value of this function and the number of estimated clusters. We\nhave used exactly the same settings for both methods. (b) Our algorithm\u2019s clustering when applied to the \u2018four-\nclouds\u2019 dataset from [1]. The primal costs generated by AP for this dataset (shown in (c)) demonstrate that AP\nfails to converge in this case (to prevent that, a properly chosen damping factor has to be used).\n\nthe relative entropy as distance), and identifying cities that can be easily accessed by airline travel.\nIn Fig. 4(a), we compare our method to AP on these publicly available problems. Since both methods\nrely on optimizing the same objective function, we list the values obtained by the two methods for\nthe corresponding problems. Exactly the same settings have been used for both algorithms, with\nAP using the parameters proposed in [5]. Note that in all cases our algorithm manages to obtain\na solution of equal or lower value than AP. This is true even, e.g., in the Genes dataset, where\na higher number of clusters is selected by our algorithm (and thus a higher penalty for activating\nthem is paid). Furthermore, an additional advantage of our algorithm is that, unlike AP, it is always\nguaranteed to converge (e.g., see Figs 4(b), 4(c)). We note that, due to lack of space, a running time\ncomparison with AP, as well as a comparison of our algorithm to the method in [10], are included in\n[7].\n\n5 Conclusions\n\nIn this paper we have introduced a very powerful and ef\ufb01cient center-based clustering algorithm,\nderived from LP duality theory. The resulting algorithm has guaranteed convergence and can handle\ndata sets with arbitrary distance functions. Furthermore, despite its extreme generality, the proposed\nmethod is insensitive to initialization and computes clusterings of very low cost. As such, and\nconsidering the key role that clustering has in many problems, we believe that our method can \ufb01nd\nuse in a wide variety of tasks. As another very important (both practical and theoretical) contribution\nof this work we also consider the fact of introducing the notions of LP-based stabilities and margins,\ntwo quantities that, as we have proved, are dual to each other and can be used for deciding what\nobjects should be chosen as cluster centers. We strongly believe that these ideas can be of both\npractical and theoretical interest not just for designing center-based clustering algorithms, but also\nin many other contexts as well.\n\nReferences\n\n[1] A. Ng, M. Jordan, and Y. Weiss, \u201cOn spectral clustering: Analysis and an algorithm,\u201d in NIPS, 2001.\n\n[2] D. Verma and M. Meila, \u201cA comparison of spectral clustering algorithms,\u201d Tech. Rep., 2001.\n\n[3] A. Banerjee, S. Merugu, I. S. Dhillon, and J. Ghosh, \u201cClustering with bregman divergences,\u201d J. Mach.\n\nLearn. Res., vol. 6, pp. 1705\u20131749, 2005.\n\n[4] B. Fischer, V. Roth, and J. Buhmann, \u201cClustering with the connectivity kernel,\u201d in NIPS, 2004.\n\n[5] B. J. Frey and D. Dueck, \u201cClustering by passing messages between data points,\u201d Science, vol. 315, 2007.\n[6] M. Charikar, S. Guha, \u00b4E. Tardos, and D. B. Shmoys, \u201cA constant-factor approximation algorithm for the\n\nk-median problem,\u201d J. Comput. Syst. Sci., vol. 65, no. 1, pp. 129\u2013149, 2002.\n\n[7] N. Komodakis, N. Paragios, and G. Tziritas, \u201cClustering via LP-based Stabilities,\u201d Tech. Report, 2009.\n\n[8] D. Lashkari and P. Golland, \u201cConvex clustering with exemplar-based models,\u201d in NIPS, 2008.\n\n[9] R. Tron and R. Vidal, \u201cA benchmark for the comparison of 3-d motion segmentation algorithms,\u201d in\n\nCVPR, 2007.\n\n[10] M. Leone, Sumedha, and M. Weigt, \u201cClustering by soft-constraint af\ufb01nity propagation: applications to\n\ngene-expression data,\u201d Bioinformatics, vol. 23, no. 20, pp. 2708\u20132715, 2007.\n\n8\n\n\f", "award": [], "sourceid": 769, "authors": [{"given_name": "Nikos", "family_name": "Komodakis", "institution": null}, {"given_name": "Nikos", "family_name": "Paragios", "institution": null}, {"given_name": "Georgios", "family_name": "Tziritas", "institution": null}]}