{"title": "Consistent Minimization of Clustering Objective Functions", "book": "Advances in Neural Information Processing Systems", "page_first": 961, "page_last": 968, "abstract": null, "full_text": "Consistent Minimization of Clustering Objective\n\nFunctions\n\nUlrike von Luxburg\n\nMax Planck Institute for Biological Cybernetics\n\nS\u00b4ebastien Bubeck\n\nINRIA Futurs Lille, France\n\nulrike.luxburg@tuebingen.mpg.de\n\nsebastien.bubeck@inria.fr\n\nStefanie Jegelka\n\nMax Planck Institute for Biological Cybernetics\n\nMichael Kaufmann\n\nUniversity of T\u00a8ubingen, Germany\n\nstefanie.jegelka@tuebingen.mpg.de\n\nmk@informatik.uni-tuebingen.de\n\nAbstract\n\nClustering is often formulated as a discrete optimization problem. The objective is\nto \ufb01nd, among all partitions of the data set, the best one according to some quality\nmeasure. However, in the statistical setting where we assume that the \ufb01nite data\nset has been sampled from some underlying space, the goal is not to \ufb01nd the best\npartition of the given sample, but to approximate the true partition of the under-\nlying space. We argue that the discrete optimization approach usually does not\nachieve this goal. As an alternative, we suggest the paradigm of \u201cnearest neighbor\nclustering\u201d. Instead of selecting the best out of all partitions of the sample, it only\nconsiders partitions in some restricted function class. Using tools from statistical\nlearning theory we prove that nearest neighbor clustering is statistically consis-\ntent. Moreover, its worst case complexity is polynomial by construction, and it\ncan be implemented with small average case complexity using branch and bound.\n\n1 Introduction\nClustering is the problem of discovering \u201cmeaningful\u201d groups in given data. Many algorithms try to\nachieve this by minimizing a certain quality function Qn, for example graph cut objective functions\nsuch as ratio cut or normalized cut, or various criteria based on some function of the within- and\nbetween-cluster similarities. The objective of clustering is then stated as a discrete optimization\nproblem. Given a data set Xn = {X1, . . . , Xn} and a clustering quality function Qn, the ideal\nclustering algorithm should take into account all possible partitions of the data set and output the\none that minimizes Qn. The implicit understanding is that the \u201cbest\u201d clustering can be any partition\nout of the set of all possible partitions of the data set. The algorithmic challenge is to construct an\nalgorithm which is able to \ufb01nd this clustering. We will call this approach the \u201cdiscrete optimization\napproach to clustering\u201d.\nIf we look at clustering from the perspective of statistical learning theory we assume that the \ufb01nite\ndata set has been sampled from an underlying data space X according to some probability measure.\nThe ultimate goal in this setting is not to discover the best possible partition of the data set Xn, but\nto learn the \u201ctrue clustering\u201d of the underlying space. In an approach based on quality functions,\nthis \u201ctrue clustering\u201d can be de\ufb01ned easily. We choose a clustering quality function Q on the set of\npartitions of the entire data space X , and de\ufb01ne the true clustering f\u2217 to be the partition minimizing\nQ. In this setting, a very important property of a clustering algorithm is consistency. Denoting the\nclustering constructed on the \ufb01nite sample by fn, we require that Q(fn) converges to Q(f\u2217) when\nn\u2192\u221e. The most important insight of statistical learning theory is that in order to be consistent,\nlearning algorithms have to choose their functions from some \u201csmall\u201d function space only. To mea-\nsure the size of a function space F one uses the quantity NF(x1, .., xn) which denotes the number\n\n1\n\n\fof ways in which the points x1, . . . , xn can be partitioned by functions in F. One can prove that\nin the standard setting of statistical learning theory, a necessary condition for consistency is that\nE log NF(x1, .., xn)/n\u21920 (cf. Theorem 2.3 in Vapnik, 1995, Section 12.4 of Devroye et al., 1996).\nStated like this, it becomes apparent that the two viewpoints described above are not compatible\nwith each other. While the discrete optimization approach on any given sample attempts to \ufb01nd\nthe best of all (exponentially many) partitions, the statistical learning theory approach restricts the\nset of candidate partitions to have sub-exponential size. Hence, from the statistical learning theory\nperspective, an algorithm which is considered ideal in the discrete optimization setting is likely to\nover\ufb01t. One can construct simple examples (cf. Bubeck and von Luxburg, 2007) which show that\nthis indeed can happen: here the partitions constructed on the \ufb01nite sample do not converge to the\nIn practice, for most cases the discrete optimization approach\ntrue clustering of the data space.\ncannot be performed perfectly as the corresponding optimization problem is NP hard.\nInstead,\npeople resort to heuristics. One approach is to use local optimization procedures potentially ending\nin local minima only (this is what happens in the k-means algorithm). Another approach is to\nconstruct a relaxation of the original problem which can be solved ef\ufb01ciently (spectral clustering is\nan example for this). In both cases, one usually cannot guarantee how close the heuristic solution\nis to the global \ufb01nite sample optimum. This situation is clearly unsatisfactory: for most clustering\nalgorithms, we neither have guarantees on the \ufb01nite sample behavior of the algorithm, nor on its\nstatistical consistency in the limit.\nThe following alternative approach looks much more promising. Instead of attempting to solve the\ndiscrete optimization problem over the set of all partitions, and then resorting to relaxations due to\nthe NP-hardness of this problem, we turn the tables. Directly from the outset, we only consider\ncandidate partitions in some restricted class Fn containing only polynomially many functions. Then\nthe discrete optimization problem of minimizing Qn over Fn is no longer NP hard \u2013 it can trivially\nbe solved in polynomially many steps by trying all candidates in Fn. From a theoretical point of\nview this approach has the advantage that the resulting clustering algorithm has the potential of\nbeing consistent. In addition, it also leads to practical bene\ufb01ts: rather than dealing with uncontrolled\nrelaxations of the original problem, we restrict the function class to some small enough subset Fn of\n\u201creasonable\u201d partitions. Within this subset, we then have complete control over the solution of the\noptimization problem and can \ufb01nd the global optimum. Put another way, one can also interpret this\napproach as some controlled way of sparsifying the NP hard optimization problem, with the positive\nside effect of obeying the rules of statistical learning theory.\n\n2 Nearest neighbor clustering\n\nIn the following we assume that we are given a set of data points Xn = {X1, . . . , Xn} and pairwise\ndistances dij = d(Xi, Xj) or pairwise similarities sij = s(Xi, Xj). Let Qn be the \ufb01nite sample\nquality function to optimize on the sample. To follow the approach outlined above we have to\noptimize Qn over a \u201csmall\u201d set Fn of partitions of Xn. Essentially, we have three requirements\non Fn: First, the number of functions in Fn should be at most polynomial in n. Second, in the\nlimit of n \u2192 \u221e the class Fn should be rich enough to approximate any measurable partition of\nthe underlying space. Third, in order to perform the optimization we need to be able to enumerate\nall members of this class, that is the function class Fn should be \u201cconstructive\u201d in some sense. A\nconvenient choice satisfying all those properties is the class of \u201cnearest neighbor partitions\u201d. This\nclass contains all functions which can be generated as follows. Fix a subset of m (cid:28) n \u201cseed points\u201d\nXs1, . . . , Xsm among the given data points. Assign all other data points to their closest seed points,\nthat is for all j = 1, . . . , m de\ufb01ne the set Zj as the subset of data points whose nearest seed point is\nXsj . Then consider all partitions of Xn which are constant on the sets Zj. More formally, for given\nseeds we de\ufb01ne the set Fn as the set of all functions f : X \u2192 {1, . . . , K} which are constant on the\ncells of the Voronoi partition induced by the seeds. Here K denotes the number of clusters we want\nto construct. The function class Fn contains K m functions, which is polynomial in n if the number\nm of seeds satis\ufb01es m = O(log n). Given Fn, the simplest polynomial-time optimization algorithm\nis then to evaluate Qn(f) for all f \u2208 Fn and choose the solution fn = argminf\u2208Fn Qn(f). We call\nthe resulting clustering the nearest neighbor clustering and denote it by NNC(Qn). In practice, the\nseeds will be chosen randomly among the given data points.\n\n2\n\n\f3 Consistency of nearest neighbor clustering\n\nIn this section we prove that nearest neighbor clustering is statistically consistent for many clustering\nquality functions. Due to the complexity of the proofs and the page restriction we can only present\nsketches of the proofs. All details can be found in von Luxburg et al. (2007). Let us start with some\nnotation. For any clustering function f : Rd \u2192 {1, . . . , K} we denote by the predicate A(f) a\nproperty of the function which can either be true or false. As an example, de\ufb01ne A(f) to be true if\nall clusters have at least a certain minimal size. Moreover, we need to introduce a predicate An(f)\nwhich will be an \u201cestimator\u201d of A(f) based on the \ufb01nite sample only. Let m := m(n) \u2264 n be the\nnumber of seeds used in nearest neighbor clustering. To simplify notation we assume in this section\nthat the seeds are the \ufb01rst m data points; all results remain valid for any other (even random) choice\nof seeds. As data space we use X = Rd. We de\ufb01ne:\nNNm(x) := NNm(n)(x) := argminy\u2208{X1,...,Xm} kx \u2212 yk ( for x \u2208 Rd)\nF := {f : Rd \u2192 {1, . . . , K} | f continuous P-a.e. and A(f) true}\nFn := FX1,...,Xn := {f : Rd \u2192 {1, . . . , K} | f satis\ufb01es f(x) = f(NNm(x)), and An(f) is true}\n\neFn :=S\n\nX1,...,Xn\u2208Rd FX1,...,Xn\n\nFurthermore, let Q : F \u2192 R be the quality function we aim to minimize, and Qn : Fn \u2192 R an\nestimator of this quality function on a \ufb01nite sample. With this notation, the true clustering f\u2217 on the\nunderlying space and the nearest neighbor clustering fn introduced in the last section are given by\n\nf\u2217 \u2208 argminf\u2208F Q(f)\n\nand\n\nLater on we will also need to work with the functions\n\nn \u2208 argminf\u2208Fn Q(f)\nf\u2217\n\nand\n\nfn \u2208 argminf\u2208Fn Qn(f).\nef\u2217(x) := f\u2217(NNm(x)).\n\nAs distance function between different clusterings f, g we will use\n\nLn(f, g) := P(f(X) 6= g(X) | X1, . . . , Xn)\n\n(we need the conditioning in case f or g depend on the data, it has no effect otherwise).\n\nTheorem 1 (Consistency of nearest neighbor clustering) Let (Xi)i\u2208N be a sequence of points\ndrawn i.i.d. according to some probability measure P on Rd, and m := m(n) the number of\nseed points used in nearest neighbor clustering. Let Q : F \u2192 R be a clustering quality function,\n\nQn : eFn \u2192 R its estimator, and A(f) and An(f) some predicates. Assume that:\n\n1. Qn(f) is a consistent estimator of Q(f) which converges suf\ufb01ciently fast:\n\n\u2200\u03b5 > 0, K m(2n)(d+1)m2 supf\u2208eFn\nP(An(ef\u2217) true) \u2192 1\n\n2. An(f) is an estimator of A(f) which is \u201cconsistent\u201d in the following way:\n\nP(A(fn) true) \u2192 1.\n3. Q is uniformly continuous with respect to the distance Ln between F and Fn:\n\nand\n\nP(|Qn(f) \u2212 Q(f)| > \u03b5) \u2192 0.\n\n\u2200\u03b5 > 0 \u2203\u03b4(\u03b5) > 0 \u2200f \u2208 F \u2200g \u2208 Fn : Ln(f, g) \u2264 \u03b4(\u03b5) =\u21d2 |Q(f) \u2212 Q(g)| \u2264 \u03b5.\n\n4. limn\u2192\u221em(n) = +\u221e.\n\nThen nearest neighbor clustering as introduced in Section 2 is weakly consistent, that is Q(fn) \u2192\nQ(f\u2217) in probability.\nProof. (Sketch, for details see von Luxburg et al. (2007)). We split the term P(|Q(fn)\u2212Q(f\u2217)| \u2265 \u03b5)\ninto its two sides P(Q(fn) \u2212 Q(f\u2217) \u2264 \u2212\u03b5) and P(Q(fn) \u2212 Q(f\u2217) \u2265 \u03b5). It is a straightforward\nconsequence of Condition (2) that the \ufb01rst term converges to 0. The main work consists in bounding\nthe second term. As usual we consider the estimation and approximation errors\n\nP(cid:0)Q(fn) \u2212 Q(f\u2217) \u2265 \u03b5(cid:1) \u2264 P(cid:0)Q(fn) \u2212 Q(f\u2217\n\nn) \u2265 \u03b5/2(cid:1) + P(cid:0)Q(f\u2217\n\nn) \u2212 Q(f\u2217) \u2265 \u03b5/2(cid:1).\n\n3\n\n\fFirst we bound the estimation error. In a few lines one can show that\n\nP(Q(fn) \u2212 Q(f\u2217\n\nn) \u2265 \u03b5/2) \u2264 P(supf\u2208Fn |Qn(f) \u2212 Q(f)| \u2265 \u03b5/4).\n\nf\u2208Fn\n\nsupf\u2208eFn\ninf f\u2208eFn\n\nSection 12.3 of Devroye et al., 1996), we then move the supremum out of the probability:\n\nNote that the unusual denominator in Eq. (1) emerges in the symmetrization step as we do not\n\nP(cid:0)|Qn(f) \u2212 Q(f)| \u2265 \u03b5/16(cid:1)\nP(cid:0)|Qn(f) \u2212 Q(f)| \u2264 \u03b5/8(cid:1)\n\n|Qn(f) \u2212 Q(f)| \u2265 \u03b5/4(cid:1) \u2264 2SK(eFn, 2n)\n\nNote that even though the right hand side resembles the standard quantities often considered in\nstatistical learning theory, it is not straightforward to bound as we do not assume that Q(f) =\nEQn(f). Moreover, note that the function class Fn is data dependent as the seed points used in\nthe Voronoi partition are data points. To circumvent this problem, we replace the function class Fn\n\nby the larger class eFn, which is not data dependent. Using symmetrization by a ghost sample (cf.\nP(cid:0) sup\nassume Q(f) = EQn(f). The quantity SK(eFn, 2n) denotes the shattering coef\ufb01cient, that is the\nmaximum number of ways that 2n points can be partitioned into K sets using the functions in eFn.\nclusterings into K classes is bounded by SK(eFn, n) \u2264 K mn(d+1)m2\nsame holds for the estimation error. To deal with the approximation error, observe that if An(ef\u2217) is\ntrue, then ef\u2217 \u2208 Fn, and by the de\ufb01nition of f\u2217\nn) \u2212 Q(f\u2217) \u2264 Q(ef\u2217) \u2212 Q(f\u2217) and thus\nP(cid:0)Q(f\u2217\nn) \u2212 Q(f\u2217) \u2265 \u03b5(cid:1) \u2264 P(An(ef\u2217) false) + P(cid:0)ef\u2217 \u2208 Fn and Q(ef\u2217) \u2212 Q(f\u2217) \u2265 \u03b5(cid:1).\nP(cid:0)ef\u2217 \u2208 Fn, Q(ef\u2217) \u2212 Q(f\u2217) \u2265 \u03b5(cid:1) \u2264 P(cid:0)Q(ef\u2217) \u2212 Q(f\u2217) \u2265 \u03b5(cid:1) \u2264 P(cid:0) Ln(f\u2217,ef\u2217) \u2265 \u03b4(\u03b5)(cid:1).\n\nIt is well known (e.g., Section 21.5 of Devroye et al., 1996) that the number of Voronoi partitions\nof n points using m cells in Rd is bounded by n(d+1)m2, hence the number of nearest neighbor\n. Under Condition (1) of the\nTheorem we now see that for \ufb01xed \u03b5 and n \u2192 \u221e the right hand side of (1) converges to 0. Thus the\n\n(2)\nThe \ufb01rst expression on the right hand side converges to 0 by Condition (2) in the theorem. Using\nCondition (3), we can bound the second expression in terms of the distance Ln to obtain\n\nn we have\n\nQ(f\u2217\n\n(1)\n\nNow we use techniques from Fritz (1975) to show that if n is large enough, then the distance between\na function f \u2208 F evaluated at x and the same function evaluated at NNm(x) is small. Namely, for\nany f \u2208F and any \u03b5 >0 there exists some b(\u03b4(\u03b5)) >0 which does not depend on n and f such that\n\nP(Ln(f, f(NNm(\u00b7))) > \u03b4(\u03b5)) \u2264 (2/\u03b4(\u03b5)) e\u2212mb(\u03b4(\u03b5)).\n\nThe quantity \u03b4(\u03b5) has been introduced in Condition (3). For every \ufb01xed \u03b5, this term converges to 0\ndue to Condition (4), thus the approximation error vanishes.\n,\nNow we want to apply our general theorem to particular objective functions. We start with the\nnormalized cut. Let s : Rd\u00d7Rd \u2192 R+ be a similarity function which is upper bounded by a constant\nC. For a clustering f : Rd \u2192 {1, . . . , K} denote by fk(x) := 1f (x)=k the indicator function of the\nk-th cluster. De\ufb01ne the empirical and true cut, volume, and normalized cut as follows:\ncutn(fk) := 1\ncut(fk) := EX,Y\nvoln(fk) := 1\n\nPn\n(cid:0)fk(X)(1 \u2212 fk(Y ))s(X, Y )(cid:1)\ni,j=1 fk(Xi)(1 \u2212 fk(Xj))s(Xi, Xj)\nPn\n\n(cid:0)fk(X)s(X, Y )(cid:1)\n\ni,j=1 fk(Xi)s(Xi, Xj)\n\nn(n\u22121)\n\nn(n\u22121)\n\nvol(fk) := EX,Y\n\nNcut(f) :=PK\n\ncut(fk)\nvol(fk)\n\nk=1\n\nNcutn(f) :=PK\n\ncutn(fk)\nvoln(fk)\n\nk=1\n\nNote that E Ncutn(f) 6= Ncut(f), but E cutn(f) = cut(f) and E voln(f) = vol(f). We \ufb01x a\nconstant a > 0, a sequence (an)n\u2208N with an \u2265 an+1 and an \u2192 a and de\ufb01ne the predicates\n\nA(f) is true : \u21d0\u21d2 vol(fk) > a \u2200k = 1, . . . , K\nAn(f) is true : \u21d0\u21d2 voln(fk) > an \u2200k = 1, . . . , K\n\n(3)\nTheorem 2 (Consistency of NNC(Ncutn)) Let (Xi)i\u2208N be a sequence of points drawn i.i.d. ac-\ncording to some probability measure P on Rd and s : Rd \u00d7 Rd \u2192 R+ be a similarity function which\nis upper bounded by a constant C. Let m := m(n) be the number of seed points used in nearest\nneighbor clustering, a > 0 an arbitrary constant, and (an)n\u2208N a monotonically decreasing se-\nquence with an \u2192 a. Then nearest neighbor clustering using Q := Ncut, Qn := Ncutn, and A\nand An as de\ufb01ned in (3) is weakly consistent if m(n) \u2192 \u221e and m2 log n/(n(a \u2212 an)2) \u2192 0.\n\n4\n\n\fProof. We will check that all conditions of Theorem 1 are satis\ufb01ed. First we establish that\n{| cutn(fk) \u2212 cut(fk)| \u2264 a\u03b5} \u2229 {| voln(fk) \u2212 vol(fk)| \u2264 a\u03b5} \u2282 {|cutn(fk)\n\u2212 cut(fk)\nApplying the McDiarmid inequality to cutn and voln, respectively, we obtain that for all f \u2208 eFn\nvoln(fk)\nvol(fk)\n(cid:19)\n\n(cid:18)\n\n| \u2264 2\u03b5}.\n\nP(| Ncut(f) \u2212 Ncutn(f)| > \u03b5) \u2264 4K exp\n\n\u2212 na2\u03b52\n8C 2K 2\n\n.\n\nTogether with m2 log n/(n(a \u2212 an)2) \u2192 0 this shows Condition (1) of Theorem 1. The proof of\nCondition (2) is rather technical, but in the end also follows by applying the McDiarmid inequality\nto voln(f). Condition (3) follows by establishing that for f \u2208 F and g \u2208 Fn we have\n\n| Ncut(f) \u2212 Ncut(g)| \u2264 4CK\na\n\nLn(f, g).\n\n,\n\ncutn(fk)\n\nk=1\n\nnk\n\nIn fact, Theorem 1 can be applied to a large variety of clustering objective functions. As examples,\nconsider ratio cut, within-sum of squares, and the ratio of between- and within-cluster similarity:\n\nRatioCut(f) :=PK\nRatioCutn(f) :=PK\nPK\nPn\nk=1 fk(Xi)kXi \u2212 ck,nk2 WSS(f) := EPK\nBW :=PK\nBWn :=PK\n:= P\n\nvoln(fk)\u2212cutn(fk)\ni fk(Xi)/n is the fraction of points in the k-th cluster, and ck,n\ni fk(Xi)Xi/(nnk) and ck := Efk(X)X/Efk(X) are the empirical and true cluster centers.\n\nk=1 fk(X)kX \u2212 ckk2\nvol(fk)\u2212cut(fk)\n\nWSSn(f) := 1\nn\n\nHere nk\n\nP\n\ncut(fk)\nEfk(X)\n\nk=1\n\ncut(fk)\n\n:=\n\ni=1\n\ncutn(fk)\n\nk=1\n\nk=1\n\nTheorem 3 (Consistency of NNC(RatioCutn), NNC(WSSn), and NNC(BWn)) Let fn\nand f\u2217 be the empirical and true minimizers of nearest neighbor clustering using RatioCutn,\nWSSn, or BWn, respectively. Then, under conditions similar to the ones in Theorem 2, we have\nRatioCut(fn) \u2192 RatioCut(f\u2217), WSS(fn) \u2192 WSS(f\u2217), and BW(fn) \u2192 BW(f\u2217) in probabil-\nity. See von Luxburg et al. (2007) for details.\n\n4 Implementation using branch and bound\n\nXi\u2208Zs,Xj\u2208Zt\n\n\u00afs(Zs, Zt) :=P\n\nIt is an obvious question how nearest neighbor clustering can be implemented in a more ef\ufb01cient way\nthan simply trying all functions in Fn. Promising candidates are branch and bound methods. They\nare guaranteed to achieve an optimal solution, but in most cases are much more ef\ufb01cient than a naive\nimplementation. As an example we introduce a branch and bound algorithm for solving NNC(Ncut)\nfor K = 2 clusters. For background reading see Brusco and Stahl (2005). First of all, observe that\nminimizing Ncutn over the nearest neighbor function set Fn is the same as minimizing Ncutm over\nall partitions of a contracted data set consisting of m \u201csuper-points\u201d Z1, . . . , Zm (super-point Zi\ncontains all data points assigned to the i-th seed point), endowed with the \u201csuper-similarity\u201d function\ns(Xi, Xj). Hence nearest neighbor clustering on the original data set\nwith n points can be performed by directly optimizing Ncut on the contracted data set consisting of\nonly m super-points. Assume we already determined the labels l1, . . . , li\u22121\u2208{\u00b11} of the \ufb01rst i\u22121\nsuper-points. For those points we introduce the sets A = {Z1, . . . , Zi\u22121}, A\u2212 := {Zj | j < i, lj =\n\u22121}, A+ := {Zj | j < i, lj = +1}, for the remaining points the set B = {Zi, . . . , Zm}, and the\nset V := A \u222a B of all points. By default we label all points in B with \u22121 and, in recursion level\ni, decide about moving Zi to cluster +1. Analogously to the notation fk of the previous section, in\ncase K =2 we can decompose Ncut(f) = cut(f+1) \u00b7 (1/ vol(f+1) + 1/ vol(f\u22121)); we call the \ufb01rst\nterm the \u201ccut term\u201d and the second term the \u201cvolume term\u201d. As it is standard in branch and bound\nwe have to investigate whether the \u201cbranch\u201d of clusterings with the speci\ufb01c \ufb01xed labels on A could\ncontain a solution which is better than all the previously considered solutions. We use two criteria\nfor this purpose. The \ufb01rst one is very simple: assigning at least one point in B to +1 can only\nlead to an improvement if this either decreases the cut term or the volume term of Ncut. Necessary\nconditions for this are maxj\u2265i \u00afs(Zj, A+) \u2212 \u00afs(Zj, A\u2212) \u2265 0 or vol(A+) \u2264 vol(V )/2, respectively.\nIf neither is satis\ufb01ed, we retract. The second criterion involves a lower bound \u03b8l on the Ncut value of\n\n5\n\n\fBranch and bound algorithm for Ncut: f\u2217 = bbncut( \u00afS, i, f, \u03b8u){\n\n1. Set g := f ; set A\u2212, A+, and B as described in the text\n2. // Deal with special cases:\n\n\u2022 If i = m and A\u2212 = \u2205 then return f .\n\u2022 If i = m and A\u2212 6= \u2205:\n\n\u2013 Set gi = +1.\n\u2013 If Ncut(g) < Ncut(f ) return g, else return f .\n\n3. // Pruning:\n\n\u2022 If vol(A+) > vol(A\u222a B)/2 and maxj\u2265i(\u00afs(j, A+)\u2212 \u00afs(j, A\u2212)) \u2264 0 return f .\n\u2022 Compute lower bound \u03b8l as described in the text.\n\u2022 If \u03b8l \u2265 \u03b8u then return f .\n\n4. // If no pruning possible, recursively call bbncut:\n\n\u2022 Set gi = +1, \u03b80\n\u2022 Set gi = \u22121, \u03b800\n\u2022 If Ncut(g0) \u2264 Ncut(g00) then return g0, else return g00.\n\nu := min{Ncut(g), \u03b8u}, call g0 := bbncut( \u00afS, g, i + 1, \u03b80\nu)\nu}, call g00 := bbncut( \u00afS, g, i + 1, \u03b800\nu := min{Ncut(g0), \u03b80\nu)\n\n}\n\nFigure 1: Branch and bound algorithm for NNC(Ncut) for K = 2. The algorithm is initially called\nwith the super-similarity matrix \u00afS, i = 2, f = (+1,\u22121, . . . ,\u22121), and \u03b8u the Ncut value of f.\n\nall solutions in the current branch. It compares \u03b8l to an upper bound \u03b8u on the optimal Ncut value,\nnamely to the Ncut value of the best function we have seen so far. If \u03b8l \u2265 \u03b8u then no improvement\nis possible by any clustering in the current branch of the tree, and we retract. To compute \u03b8l, assume\nwe assign a non-empty set B+ \u2282 B to label +1 and the remaining set B\u2212 = B \\ B+ to label -1.\n\nUsing the conventions \u00afs(A, B) =P\n\nZi\u2208A,Zj\u2208B \u00afsij and \u00afs(A,\u2205) = 0, the cut term is bounded by\n\n(cid:26)minj\u2265i s(Zj, A+)\n\n\u00afs(A+, A\u2212) + minj\u2265i \u00afs(Zj, A\u2212)\n\nif A\u2212 = \u2205\notherwise.\n\n(4)\n\ncut(A+ \u222a B+, A\u2212 \u222a B\u2212) \u2265\n\nThe volume term can be maximally decreased in case vol(A+) < vol(V )/2, when choosing B+\nsuch that vol(A+ \u222a B+) = vol(A\u2212 \u222a B\u2212) = vol(V )/2. If vol(A+) > vol(V )/2, then an increase\nof the volume term is unavoidable; this increase is minimal when we move one vertex only to A+:\n\nvol(A+\u222aB+) +\n\n1\n\nvol(A\u2212\u222aB\u2212) \u2265\n\n1\n\nif vol(A+) \u2264 vol(V )/2\n(vol(A+ \u222a Zj) vol(A\u2212 \u222a B \\ Zj)) otherw.\n\n(5)\n\n(4/ vol(V )\n\nvol(V )/max\nj\u2265i\n\nCombining both bounds we can now de\ufb01ne the lower bound \u03b8l as the product of Eq. (4) and (5).\nThe entire algorithm is presented in Fig. 1. On top of the basic algorithm one can apply various\nheuristics to improve the retraction behavior and thus the average running time of the algorithm. For\nexample, in our experience it is of advantage to sort the super-points by decreasing degree, and from\none recursion level to the next one alternate between \ufb01rst visiting branch gi = 1 and gi = \u22121.\n\n5 Experiments\n\nThe main point about nearest neighbor clustering is its statistical consistency: for large n it reveals\nan approximately correct clustering. In this section we want to show that it also behaves reason-\nably on smaller samples. Given an objective function Qn (such as WSS or Ncut) we compare the\nNNC results to heuristics designed to optimize Qn directly (such as k-means or spectral clustering).\nAs numeric data sets we used classi\ufb01cation benchmark data sets from different repositories (UCI\nrepository, repository by G. R\u00a8atsch) and microarray data from Spellman et al. (1998). Moreover, we\nuse graph data sets of the internet graph and of biological, social, and political networks: COSIN\ncollection, collection by M. Newman, email data by Guimer`a et al. (2003), electrical power network\nby Watts and Strogatz (1998), and protein interaction networks of Jeong et al. (2001) and Tsuda\net al. (2005). Due to space constraints we focus on the case of constructing K = 2 clusters using\nthe objective functions WSS and Ncut. We always set the number m of seed points for NNC to\nm = log n. In case of WSS, we compare the result of the k-means algorithm to the result of NNC\nusing the WSS objective function and the Euclidean distance to assign data points to seed points.\n\n6\n\n\fWSS\n\nNcut\n\nNumeric\ndata sets\nbreast-c.\n\ndiabetis\n\ngerman\n\nheart\n\nsplice\n\nbcw\n\nionosph.\n\npima\n\ncellcycle\n\nK-means\n6.95 \u00b1 0.19\n7.12 \u00b1 0.20\n6.62 \u00b1 0.22\n6.72 \u00b1 0.22\n18.26 \u00b1 0.27\n18.35 \u00b1 0.30\n10.65 \u00b1 0.46\n10.75 \u00b1 0.46\n68.99 \u00b1 0.24\n69.03 \u00b1 0.24\n3.97 \u00b1 0.26\n3.98 \u00b1 0.26\n25.72 \u00b1 1.63\n25.76 \u00b1 1.63\n6.62 \u00b1 0.22\n6.73 \u00b1 0.23\n0.78 \u00b1 0.03\n0.78 \u00b1 0.03\n\nNNC\n7.04 \u00b1 0.21\n7.12 \u00b1 0.22\n6.71 \u00b1 0.22\n6.72 \u00b1 0.22\n18.56 \u00b1 0.28\n18.45 \u00b1 0.32\n10.77 \u00b1 0.47\n10.74 \u00b1 0.46\n69.89 \u00b1 0.24\n69.18 \u00b1 0.25\n3.98 \u00b1 0.26\n3.98 \u00b1 0.26\n25.77 \u00b1 1.63\n25.77 \u00b1 1.63\n6.73 \u00b1 0.23\n6.73 \u00b1 0.23\n0.78 \u00b1 0.03\n0.78 \u00b1 0.02\n\nSC\n0.11 \u00b1 0.02\n0.22 \u00b1 0.07\n0.03 \u00b1 0.02\n0.04 \u00b1 0.03\n0.02 \u00b1 0.02\n0.04 \u00b1 0.08\n0.18 \u00b1 0.03\n0.28 \u00b1 0.03\n0.36 \u00b1 0.10\n0.58 \u00b1 0.09\n0.02 \u00b1 0.01\n0.04 \u00b1 0.01\n0.06 \u00b1 0.03\n0.12 \u00b1 0.11\n0.03 \u00b1 0.03\n0.05 \u00b1 0.04\n0.12 \u00b1 0.02\n0.16 \u00b1 0.02\n\nNNC\n0.09 \u00b1 0.02\n0.21 \u00b1 0.07\n0.03 \u00b1 0.02\n0.05 \u00b1 0.05\n0.02 \u00b1 0.02\n0.03 \u00b1 0.03\n0.17 \u00b1 0.02\n0.30 \u00b1 0.07\n0.44 \u00b1 0.16\n0.66 \u00b1 0.18\n0.02 \u00b1 0.01\n0.08 \u00b1 0.07\n0.04 \u00b1 0.01\n0.14 \u00b1 0.12\n0.03 \u00b1 0.03\n0.09 \u00b1 0.13\n0.10 \u00b1 0.01\n0.15 \u00b1 0.03\n\nNetwork data\necoli.interact\necoli.metabol\nhelico\nbeta3s\nAS-19971108\nAS-19980402\nAS-19980703\nAS-19981002\nAS-19990114\nAS-19990402\nnetscience\npolblogs\npower\nemail\nyeastProtInt\nprotNW1\nprotNW2\nprotNW3\nprotNW4\n\nNNC\n0.06\n0.03\n0.16\n0.00\n0.02\n0.01\n0.02\n0.04\n0.08\n0.11\n0.01\n0.11\n0.00\n0.27\n0.04\n0.00\n0.08\n0.01\n0.03\n\nSC\n0.06\n0.04\n0.16\n0.00\n0.02\n1.00\n0.02\n0.04\n0.05\n0.10\n0.01\n0.11\n0.00\n0.27\n0.06\n0.00\n1.00\n0.80\n0.76\n\nTable 1: Left: Numeric data. Results for K-means algorithm, NNC(WSS) with Euclidean distance;\nspectral clustering (SC); NNC(Ncut) with commute distance. The top line always shows the results\non the training set, the second line the extended results on the test set. Right: Network data.\nNNC(Ncut) with commute distance and spectral clustering, both trained on the entire graph.\n\nNote that one cannot run K-means on pure network data, which does not provide coordinates. In\ncase of Ncut, we use the Gaussian kernel as similarity function on the numeric data sets. The kernel\nwidth \u03c3 is set to the mean distance of a data point to its k-th nearest neighbor. We then build the\nk-nearest neighbor graph (both times using k =ln n). On the network data, we directly use the given\ngraph. For both types of data, we use the commute distance on the graph (e.g., Gutman and Xiao,\n2004) as distance function to determine the nearest seed points for NNC.\nIn the \ufb01rst experiment we compare the values obtained by the different algorithms on the training\nsets. From the numeric data sets we generated z = 40 training sets by subsampling n/2 points.\nOn each training set, we repeated all algorithms r = 50 times with different random initializations\n(the seeds in NNC; the centers in K-means; the centers in the K-means post-processing step in\nspectral clustering). Denoting the quality of an individual run of the algorithm by q, we then report\nthe values meanz(minrq) \u00b1 standarddevz(minrq). For the network data sets we ran spectral\nclustering and NNC on the whole graph. Again we use r =50 different initializations, and we report\nminrq. All results can be found in Table 1. For both the numeric data sets (left table, top lines) and\nthe network data sets (right table) we see that the training performance of NNC is comparable to the\nother algorithms. This is what we had hoped, and we \ufb01nd it remarkable as NNC is in fact a very\nsimple clustering algorithm.\nIn the second experiment we try to measure the amount of over\ufb01tting induced by the different algo-\nrithms. For each of the numeric data sets we cluster n/2 points, extend the clustering to the other\nn/2 points, and then compute the objective function on the test set. For the extensions we proceed\nin a greedy way: for each test point, we add this test point to the training set and then give it the\nlabel +1 or -1 that leads to the smaller quality value on the augmented training set. We also tried\nseveral other extensions suggested in the literature, but the results did not differ much. To compute\nthe test error, we then evaluate the quality function on the test set labeled according to the exten-\nsion. For Ncut, we do this based on the k-nearest neighbor graph on the test set only. Note that this\nexperiment does not make sense on the network data, as there is no default procedure to construct\nthe subgraphs for training and testing. The results on the numeric data sets are reported in Table 1\n(left table, bottom lines). We see that NNC performs roughly comparably to the other algorithms.\nThis is not really what we wanted to obtain, our hope was that NNC obtains better test values as it is\nless prone to over\ufb01tting. The most likely explanation is that both K-means and spectral clustering\nhave already reasonably good extension properties. This can be due to the fact that as NNC, both\nalgorithms consider only a certain subclass of all partitions: Voronoi partitions for K-means, and\npartitions induced by eigenvectors for spectral clustering. See below for more discussion.\n\n7\n\n\f6 Discussion\nIn this paper we investigate clustering algorithms which minimize quality functions. Our main point\nis that, as soon as we require statistical consistency, we have to work with \u201csmall\u201d function classes\nFn. If we even choose Fn to be polynomial, then all problems due to NP hardness of discrete op-\ntimization problems formally disappear as the remaining optimization problems become inherently\npolynomial. From a practical point of view, the approach of using a restricted function class Fn can\nbe seen as a more controlled way of simplifying NP hard optimization problems than the standard\napproaches of local optimization or relaxation. Carefully choosing the function class Fn such that\noverly complex target functions are excluded, we can guarantee to pick the best out of all remaining\ntarget functions. This strategy circumvents the problem that solutions of local optimization or relax-\nation heuristics can be arbitrarily far away from the optimal solution.\nThe generic clustering algorithm we studied in this article is nearest neighbor clustering, which pro-\nduces clusterings that are constant on small local neighborhoods. We have proved that this algorithm\nis statistically consistent for a large variety of popular clustering objective functions. Thus, as op-\nposed to other clustering algorithms such as the K-means algorithm or spectral clustering, nearest\nneighbor clustering is guaranteed to converge to a minimizer of the true global optimum on the\nunderlying space. This statement is much stronger than the results already known for K-means or\nspectral clustering. For K-means it has been proved that the global minimizer of the WSS objec-\ntive function on the sample converges to a global minimizer on the underlying space (e.g., Pollard,\n1981). However, as the standard K-means algorithm only discovers a local optimum on the discrete\nsample, this result does not apply to the algorithm used in practice. A related effect happens for\nspectral clustering, which is a relaxation attempting to minimize Ncut (see von Luxburg (2007) for\na tutorial). It has been shown that under certain conditions the solution of the relaxed problem on\nthe \ufb01nite sample converges to some limit clustering (e.g., von Luxburg et al., to appear). However,\nit has been conjectured that this limit clustering is not necessarily the optimizer of the Ncut objec-\ntive function. So for both cases, our consistency results represent an improvement: our algorithm\nprovably converges to the true limit minimizer of K-means or Ncut, respectively. The same result\nalso holds for a large number of alternative objective functions used for clustering.\n\nReferences\nM. Brusco and S. Stahl. Branch-and-Bound Applications in Combinatorial Data Analysis. Springer, 2005.\nS. Bubeck and U. von Luxburg. Over\ufb01tting of clustering and how to avoid it. Preprint, 2007.\nData repository by G. R\u00a8atsch. http://ida.\ufb01rst.fraunhofer.de/projects/bench/benchmarks.htm.\nData repository by M. Newman. http://www-personal.umich.edu/\u02dcmejn/netdata/.\nData repository by UCI. http://www.ics.uci.edu/\u02dcmlearn/MLRepository.html.\nData repository COSIN. http://151.100.123.37/data.html.\nL. Devroye, L. Gy\u00a8or\ufb01, and G. Lugosi. A Probabilistic Theory of Pattern Recognition. Springer, 1996.\nJ. Fritz. Distribution-free exponential error bound for nearest neighbor pattern classi\ufb01cation. IEEE Trans. Inf.\n\nTh., 21(5):552 \u2013 557, 1975.\n\nR. Guimer`a, L. Danon, A. D\u00b4\u0131az-Guilera, F. Giralt, and A. Arenas. Self-similar community structure in a network\n\nof human interactions. Phys. Rev. E, 68(6):065103, 2003.\n\nI. Gutman and W. Xiao. Generalized inverse of the Laplacian matrix and some applications. Bulletin de\n\nl\u2019Academie Serbe des Sciences at des Arts (Cl. Math. Natur.), 129:15 \u2013 23, 2004.\n\nH. Jeong, S. Mason, A. Barabasi, and Z. Oltvai. Centrality and lethality of protein networks. Nature, 411:\n\n41 \u2013 42, 2001.\n\nD. Pollard. Strong consistency of k-means clustering. Annals of Statistics, 9(1):135 \u2013 140, 1981.\nP. Spellman, G. Sherlock, M. Zhang, V. Iyer, M. Anders, M. Eisen, P. Brown, D. Botstein, and B. Futcher. Com-\nprehensive identi\ufb01cation of cell cycle-regulated genes of the yeast saccharomyces cerevisiae by microarray\nhybridization. Mol Biol Cell, 9(12):3273\u201397, 1998.\n\nK. Tsuda, H. Shin, and B. Sch\u00a8olkopf. Fast protein classi\ufb01cation with multiple networks. Bioinformatics, 21\n\n(Supplement 1):ii59 \u2013 ii65, 2005.\n\nV. Vapnik. The Nature of Statistical Learning Theory. Springer, 1995.\nU. von Luxburg. A tutorial on spectral clustering. Statistics and Computing, 17(4), 2007.\nU. von Luxburg, S. Bubeck, S. Jegelka, and M. Kaufmann. Supplementary material to \u201dConsistent minimization\n\nof clustering objective functions\u201d, 2007. http://www.tuebingen.mpg.de/\u02dcule.\n\nU. von Luxburg, M. Belkin, and O. Bousquet. Consistency of spectral clustering. Annals of Statistics, to appear.\nD. Watts and S. Strogatz. Collective dynamics of small world networks. Nature, 393:440\u2013442, 1998.\n\n8\n\n\f", "award": [], "sourceid": 423, "authors": [{"given_name": "Ulrike", "family_name": "Luxburg", "institution": null}, {"given_name": "Stefanie", "family_name": "Jegelka", "institution": null}, {"given_name": "Michael", "family_name": "Kaufmann", "institution": null}, {"given_name": "S\u00e9bastien", "family_name": "Bubeck", "institution": null}]}