{"title": "DIFFRAC: a discriminative and flexible framework for clustering", "book": "Advances in Neural Information Processing Systems", "page_first": 49, "page_last": 56, "abstract": "We present a novel linear clustering framework (Diffrac) which relies on a linear discriminative cost function and a convex relaxation of a combinatorial optimization problem. The large convex optimization problem is solved through a sequence of lower dimensional singular value decompositions. This framework has several attractive properties: (1) although apparently similar to K-means, it exhibits superior clustering performance than K-means, in particular in terms of robustness to noise. (2) It can be readily extended to non linear clustering if the discriminative cost function is based on positive definite kernels, and can then be seen as an alternative to spectral clustering. (3) Prior information on the partition is easily incorporated, leading to state-of-the-art performance for semi-supervised learning, for clustering or classification. We present empirical evaluations of our algorithms on synthetic and real medium-scale datasets.", "full_text": "DIFFRAC : a discriminative and flexible\n\nframework for clustering\n\nFrancis R. Bach\n\nINRIA - Willow Project\n\u00b4Ecole Normale Sup\u00b4erieure\n\n45, rue d\u2019Ulm, 75230 Paris, France\nfrancis.bach@mines.org\n\nZa\u00a8\u0131d Harchaoui\n\nLTCI, TELECOM ParisTech and CNRS\n\n46, rue Barrault\n\n75634 Paris cedex 13, France\n\nzaid.harchaoui@enst.fr\n\nAbstract\n\nWe present a novel linear clustering framework (DIFFRAC) which relies on a lin-\near discriminative cost function and a convex relaxation of a combinatorial op-\ntimization problem. The large convex optimization problem is solved through a\nsequence of lower dimensional singular value decompositions. This framework\nhas several attractive properties: (1) although apparently similar to K-means, it\nexhibits superior clustering performance than K-means, in particular in terms of\nrobustness to noise. (2) It can be readily extended to non linear clustering if the\ndiscriminative cost function is based on positive de\ufb01nite kernels, and can then be\nseen as an alternative to spectral clustering. (3) Prior information on the partition\nis easily incorporated, leading to state-of-the-art performance for semi-supervised\nlearning, for clustering or classi\ufb01cation. We present empirical evaluations of our\nalgorithms on synthetic and real medium-scale datasets.\n\n1 Introduction\n\nMany clustering frameworks have already been proposed, with numerous applications in machine\nlearning, exploratory data analysis, computer vision and speech processing. However, these un-\nsupervised learning techniques have not reached the level of sophistication of supervised learning\ntechniques, that is, for all methods, there are still a signi\ufb01cant number of explicit or implicit param-\neters to tune for successful clustering, most generally, the number of clusters and the metric or the\nsimilarity structure over the space of con\ufb01gurations.\nIn this paper, we present a discriminative and flexible framework for clustering (DIFFRAC), which\nis aimed at alleviating some of those practical annoyances. Our framework is based on a recent\nset of works [1, 2] that have used the support vector machine (SVM) cost function used for linear\nclassi\ufb01cation as a clustering criterion, with the intuitive goal of looking for clusters which are most\nlinearly separable. This line of work has led to promising results; however, the large convex opti-\nmization problems that have to be solved prevent application to datasets larger than few hundreds\ndata points.1 In this paper, we consider the maximum value of the regularized linear regression on\nindicator matrices. By choosing a square loss (instead of the hinge loss), we obtain a simple cost\nfunction which can be simply expressed in closed form and is amenable to speci\ufb01c ef\ufb01cient convex\noptimization algorithms, that can deal with large datasets of size 10,000 to 50,000 data points. Our\ncost function turns out to be a linear function of the \u201cequivalence matrix\u201d M , which is a square\n{0, 1}-matrix indexed by the data points, with value one for all pairs of data points that belong to\nthe same clusters, and zero otherwise. In order to minimize this cost function with respect to M , we\nfollow [1] and [2] by using convex outer approximations of the set of equivalence matrices, with a\nnovel constraint on the minimum number of elements per cluster, which is based on the eigenvalues\nof M , and essential to the success of our approach.\n\n1Recent work [3] has looked at more ef\ufb01cient formulations.\n\n\fIn Section 2, we present a derivation of our cost function and of the convex relaxations. In Section 3,\nwe show how the convex relaxed problem can be solved ef\ufb01ciently through a sequence of lower\ndimensional singular value decompositions, while in Section 4, we show how a priori knowledge\ncan be incorporated into our framework. Finally, in Section 5, we present simulations comparing\nour new set of algorithms to other competing approaches.\n\n2 Discriminative clustering framework\n\nIn this section, we \ufb01rst assume that we are given n points x1, . . . , xn in Rd, represented in a matrix\nX \u2208 Rn\u00d7d. We represent the various partitions of {1, . . . , n} into k > 1 clusters by indicator\nmatrices y \u2208 {0, 1}n\u00d7k such that y1k = 1n, where 1k and 1n denote the constant vectors of all\nones, of dimensions k and n. We let denote Ik the set of k-class indicator matrices.\n\n2.1 Discriminative clustering cost\n\nGiven y, we consider the regularized linear regression problem of y given X, which takes the form:\n\nmin\n\nw\u2208Rd\u00d7k, b\u2208R1\u00d7k\n\n1\nn ky \u2212 Xw \u2212 1nbk2\n\nF + \u03ba tr w\u22a4w,\n\n(1)\n\nwhere the Frobenius norm is de\ufb01ned for any vector or rectangular matrix as kAk2\nF = trAA\u22a4 =\ntrA\u22a4A. Denoting f (x) = w\u22a4x + b \u2208 Rk, this corresponds to a multi-label classi\ufb01cation problem\nwith square loss functions [4, 5]. The main advantage of this cost function is the possibility of (a)\nminimizing the regularized cost in closed form and (b) including a bias term by simply centering\nthe data; namely, the global optimum is attained at w\u2217 = (X \u22a4\u03a0nX + n\u03baIn)\u22121X \u22a4\u03a0ny and b\u2217 =\n1\nn is the usual centering projection matrix. The optimal\nn 1\u22a4\nvalue is then equal to\n\nn (y \u2212 Xw\u2217), where \u03a0n = In \u2212 1\n\nn 1n1\u22a4\n\nwhere the n \u00d7 n matrix A(X, \u03ba) is de\ufb01ned as:\n\nJ(y, X, \u03ba) = tr yy\u22a4A(X, \u03ba),\n\nA(X, \u03ba) = 1\n\nn \u03a0n(In \u2212X(X \u22a4\u03a0nX + n\u03baI)\u22121X \u22a4)\u03a0n.\n\n(2)\n\n(3)\n\nThe matrix A(X, \u03ba) is positive semi-de\ufb01nite, i.e., for all u \u2208 Rn, u\u22a4A(X, \u03ba)u > 0, and 1n is a\nsingular vector of A(X, \u03ba), i.e., A(X, \u03ba)1n = 0.\nFollowing [1] and [2], we are thus looking for a k-class indicator matrix y such that tr yy\u22a4A(X, \u03ba)\nis minimal, i.e., for a partition such that the clusters are most linearly separated, where the sepa-\nrability of clusters is measured through the minimum of the discriminative cost with respect to all\nlinear classi\ufb01ers. This combinatorial optimization is NP-hard in general [6], but ef\ufb01cient convex\nrelaxations may be obtained, as presented in the next section.\n\n2.2 Indicator and equivalence matrices\nThe cost function de\ufb01ned in Eq. (2) only involves the matrix M = yy\u22a4 \u2208 Rn\u00d7n. We let denote Ek\nthe set of \u201ck-class equivalence matrices\u201d, i.e., the set of matrices M such that there exists a k-class\nindicator matrix y with M = yy\u22a4.\nThere are many outer convex approximations of the discrete sets Ek, based on different properties of\nmatrices in Ek, that were used in different contexts, such as maximum cut problems [6] or correlation\nclustering [7]. We have the following usual properties of equivalence matrices (independent of k):\nif M \u2208 Ek, then (a) M is positive semide\ufb01nite (denoted as M < 0), (b) M has nonnegative values\n(denoted as M > 0) , and (c) the diagonal of M is equal to 1n (denoted as diag(M ) = 1n).\nMoreover, if M corresponds to at most k clusters, we have M < 1\nn , which is a consequence to\nthe convex outer approximation of [6] for the maximum k-cut problem. We thus use the following\nconvex outer approximation:\n\nk 1n1\u22a4\n\nCk = {M \u2208 Rn\u00d7n, M = M \u22a4, diag(M ) = 1n, M > 0, M < 1\n\nk 1n1\u22a4\n\nn } \u2283 Ek.\n\nNote that when k = 2, the constraints M > 0 (pointwise nonnegativity) is implied by the other\nconstraints.\n\n\f2.3 Minimum cluster sizes\nGiven the discriminative nature of our cost function (and in particular that A(X, \u03ba)1n = 0), the\nminimum value 0 is always obtained with M = 1n1\u22a4\nn , a matrix of rank one, equivalent to a single\ncluster. Given the number of desired clusters, we thus need to add some prior knowledge regarding\nthe size of those clusters. Following [1], we impose a minimum size \u03bb0 for each cluster, through\nrow sums and eigenvalues:\n\nRow sums\nsmaller than (n \u2212 (k \u2212 1)\u03bb0) if they are all larger than \u03bb0)\u2013this is the same constraint as in [1].\n\nIf M \u2208 Ek, then M 1n > \u03bb01n and M 1n 6 (n \u2212 (k \u2212 1)\u03bb0)1n (the cluster must be\n\nEigenvalues When M \u2208 Ek, the sizes of the clusters are exactly the k largest eigenvalues of M .\nThus, for a matrix in Ek, the minimum cluster size constraint is equivalent to Pn\ni=1 1\u03bbi(M)>\u03bb0 >\nk, where \u03bb1(M ), . . . , \u03bbn(M ) are the n eigenvalues of M . Functions of the form \u03a6(M ) =\nPn\ni=1 \u03c6(\u03bbi(M )) are referred to as spectral functions and are particularly interesting in machine\nlearning and optimization, since \u03a6 inherits from \u03c6 many of its properties, such as differentiability\nand convexity [8]. The previous constraint can be seen as \u03a6(M ) > k, with \u03c6(\u03bb) = 1\u03bb>\u03bb0, which\nis not concave and thus does not lead to a convex constraint. In this paper we propose to use the\nconcave upper envelope of this function, namely \u03c6\u03bb0 (\u03bb) = min{\u03bb/\u03bb0, 1}, thus leading to a novel\nadditional constraint.\nOur \ufb01nal convex relaxation is thus of minimizing trA(X, \u03ba)M with respect to M \u2208 Ck and\nsuch that \u03a6\u03bb0 (M ) > k, M 1n > \u03bb01n and M 1n 6 (n \u2212 (k \u2212 1)\u03bb0)1n, where \u03a6\u03bb0 (M ) =\nPn\ni=1 min{\u03bbi(M )/\u03bb0, 1}. The clustering results are empirically robust to the value of \u03bb0. In all\n\nour simulations we use \u03bb0 = \u230an/2k\u230b.\n\n2.4 Comparison with K-means\n\nOur method bears some resemblance with the usual K-means algorithm. Indeed, in the unregular-\nized case (\u03ba = 0), we aim to minimize\n\nResults from [9] show that K-means aims at minimizing the following criterion with respect to y:\n\ntr \u03a0n(In \u2212 X(X \u22a4\u03a0nX)\u22121X \u22a4)\u03a0nyy\u22a4.\n\nmin\n\n\u00b5\u2208Rk\u00d7d\n\nkX \u2212 y\u00b5k2\n\nF = tr(In \u2212 y(y\u22a4y)\u22121y\u22a4)(\u03a0nX)(\u03a0nX)\u22a4.\n\nThe main differences between the two cost functions are that (1) we require an additional parameter,\nnamely the minimum number of elements per cluster and (2) our cost function normalizes the data,\nwhile the K-means distortion measure normalizes the labels. This apparently little difference has\na signi\ufb01cant impact on the performance, as our method is invariant by af\ufb01ne scaling of the data,\nwhile K-means is only invariant by translation, isometries and isotropic scaling, and is very much\ndependent on how the data are presented (in particular the marginal scaling of the variables). In\nFigure 1, we compare the two algorithms on a simple synthetic task with noisy dimensions, showing\nthat ours is more robust to noisy features. Note that using a discriminative criterion based on the\nsquare loss may lead to the masking problem [4], which can be dealt with in the usual way by using\nsecond-order polynomials or, equivalently, a polynomial kernel.\n\n2.5 Kernels\n\nThe matrix A(X, \u03ba) in Eq. (3) can be expressed only in terms of the Gram matrix K = XX \u22a4.\nIndeed, using the matrix inversion lemma, we get:\n\nwhere eK = \u03a0nK\u03a0n is the \u201ccentered Gram matrix\u201d of the points X. We can thus apply our\n\nframework with any positive de\ufb01nite kernel [5].\n\nA(K, \u03ba) = \u03ba\u03a0n(eK + n\u03baIn)\u22121\u03a0n,\n\n(4)\n\n2.6 Additional relaxations\n\nOur convex optimization problem can be further relaxed. An interesting relaxation is obtained by\n(1) relaxing the constraints M < 1\nn into M < 0, (2) relaxing diag(M ) = 1n into trM = n,\n\nk 1n1\u22a4\n\n\fr\no\nr\nr\ne\n \ng\nn\ni\nr\ne\nt\ns\nu\nc\n\nl\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n\u22120.2\n \n0\n\n \n\nK\u2212means\ndiffrac\n\n10\n\n20\n\n30\n\nnoise dimension\n\nFigure 1: Comparison with K-means, on a two-dimensional dataset composed of two linearly sep-\narable bumps (100 data points, plotted in the left panel), with additional random independent noise\ndimensions (with normal distributions with same marginal variances as the 2D data). The clustering\nperformance is plotted against the number of irrelevant dimensions, for regular K-means and our\nDIFFRAC approach (right panel, averaged over 50 replications with the standard deviation in dotted\nlines) . The clustering performance is measured by a metric between partitions de\ufb01ned in Section 5,\nwhich is always between 0 and 1.\n\nshows that this relaxation leads to an eigenvalue problem: let A = Pn\nrelaxed convex optimization problem is attained at M = Pj\ni=1 uiu\u22a4\n\nand (3) removing the constraint M > 0 and the constraints on the row sums. A short calculation\ni be an eigenvalue\ndecomposition of A, where a1 6 \u00b7 \u00b7 \u00b7 6 an are the sorted eigenvalues. The minimal value of the\nj+1, with\nj = \u230an/\u03bb0\u230b. This additional relaxation into an eigenvalue problem is the basis of our ef\ufb01cient\noptimization algorithm in Section 3.\n\ni + (n \u2212 \u03bb0j)uj+1u\u22a4\n\ni=1 aiuiu\u22a4\n\nIn the kernel formulation, since the smallest eigenvectors of A = 1\n\nn \u03a0n(eK + n\u03baIn)\u22121\u03a0n are the\nsame as the largest eigenvectors of eK, the relaxed problem is thus equivalent to kernel principal\n\ncomponent analysis [10, 5] in the kernel setting, and in the linear setting to regular PCA (followed by\nour rounding procedure presented in Section 3.3). In the linear setting, since PCA has no clustering\neffects in general2, it is clear that the constraints that were removed are essential to the clustering\nperformance. In the kernel setting, experiments have shown that the most important constraint to\nkeep in order to achieve the best embedding and clustering is the constraint diag(M ) = 1n.\n\n3 Optimization\n\nSince \u03c6\u03bb0 (\u03bb) = 1\n(\u03bb + \u03bb0 \u2212 |\u03bb \u2212 \u03bb0|), and the sum of singular values can be represented as a\n2\u03bb0\nsemide\ufb01nite program (SDP), our problem is an SDP. It can thus be solved to any given accuracy in\npolynomial time by general purpose interior-point methods [12]. However, the number of variables\nis O(n2) and thus the complexity of general purpose algorithms will be at least O(n7); this remains\nmuch too slow for medium scale problems, where the number of data points is between 1,000 and\n10,000. We now present an ef\ufb01cient approximate method that uses the speci\ufb01city of the problem to\nreduce the computational load.\n\n3.1 Optimization by partial dualization\n\nWe saw earlier that by relaxing some of the constraints, we get back an eigenvalue problem. Eigen-\nvalue decompositions are among the most important tools in numerical algebra and algorithms and\ncodes are heavily optimized for these, and it is thus advantageous to rely on a sequence of eigenvalue\ndecompositions for large scale algorithms.\n\nWe can dualize some constraints while keeping others; this leads to the following proposition:\n\n2Recent results show however that it does have an effect when clusters are spherical Gaussians [11].\n\n\fProposition 1 The solution of the convex optimization problem de\ufb01ned in Section 2.3 can be ob-\ntained my maximizing F (\u03b2) = minM <0,trM=n,\u03a6\u03bb0 (M)>k trB(\u03b2)M \u2212 b(\u03b2) with respect to \u03b2, where\n\nB(\u03b2) = A + Diag(\u03b21) \u2212\n\n1\n2\n\n(\u03b22 \u2212 \u03b23)1\u22a4 \u2212\n\n1\n2\n\n1(\u03b22 \u2212 \u03b23)\u22a4 \u2212 \u03b24 +\n\n1\n2\n\n\u03b25\u03b2\u22a4\n5\n\u03b26\n\nb(\u03b2) = \u03b2\u22a4\n\n1 1 \u2212 (n \u2212 (k \u2212 1)\u03bb0)\u03b2\u22a4\n\n2 1 + \u03bb0\u03b2\u22a4\n\n3 1 + k\u03b26/2 + \u03b2\u22a4\n\n5 1,\n\nand \u03b21 \u2208 Rn, \u03b22 \u2208 Rn\n\n+, \u03b23 \u2208 Rn\n\n+, \u03b24 \u2208 Rn\u00d7n\n\n+ ,\u03b25 \u2208 Rn, \u03b26 \u2208 R+.\n\n1n1\u22a4\n\nThe variables \u03b21, \u03b22, \u03b23, \u03b24, (\u03b25, \u03b26) correspond to the respective dualizations of the constraints\ndiag(M ) = 1n, M 1n 6 (n \u2212 (k \u2212 1)\u03bb0)1n, M 1n > \u03bb01n, M > 0, and M <\nThe function J(B) = minM <0,trM=n,\u03a6\u03bb0 (M)>k trBM is a spectral convex function and may be\ncomputed in closed form through an eigenvalue decomposition. Moreover, a subgradient may be\neasily computed, readily leading to a numerically ef\ufb01cient subgradient method in fewer dimensions\nthan n2. Indeed, if we subsample the pointwise positivity constraint N > 0 (so that \u03b24 has only a\nsize smaller than n1/2 \u00d7 n1/2), then the set of dual variables \u03b2 we are trying to maximize has linear\nsize in n (instead of the primal variable M being quadratic in n).\nMore re\ufb01ned optimization schemes, based on smoothing of the spectral function J(B) by\nminM <0,trM=n,\u03a6\u03bb0 (M)>k[trBM + \u03b5trM 2] are also used to speed up convergence (steepest descent\nof a smoothed function is generally faster than subgradient iterations) [13].\n\nk\n\nn\n\n.\n\n3.2 Computational complexity\n\nThe running time complexity can be splitted into initialization procedures and per iteration com-\nplexity. The per iteration complexity depends directly on the cost of our eigenvalue problems, which\nthemselves are linear in the matrix-vector operation with the matrix A (we only require a \ufb01xed small\nnumber of eigenvalues). In all situations, we manage to keep a linear complexity in the number n\nof data points. Note, however, that the number of descent iterations cannot be bounded a priori; in\nsimulations we limit the number of those iterations to 200.\n\nFor linear kernels with dimension d, the complexity of initialization is O(d2n), while the complexity\nof each iteration is proportional to the cost of performing a matrix-vector operation with A, that is,\nO(dn). For general kernels, the complexity of initialization is O(n3), while the complexity of each\niteration is O(n2). However, using an incomplete Cholesky decomposition [5] makes all costs linear\nin n.\n\n3.3 Rounding\nAfter the convex optimization, we obtain a low-rank matrix M \u2208 Ck which is pointwise nonnegative\nwith unit diagonal, of the form U U \u22a4 where U \u2208 Rn\u00d7m. We need to project it back to the discrete\nEk. We have explored several possibilities, all with similar results. We propose the following pro-\ncedure: we \ufb01rst project M back to the set of matrices of rank k and unit diagonal, by computing\nan eigendecomposition, rescaling the \ufb01rst k eigenvectors to unit norms and then perform K-means,\nwhich is equivalent to performing the spectral clustering algorithm of [14] on the matrix M .\n\n4 Semi-supervised learning\n\nWorking with equivalence matrices M allows to easily include prior knowledge on the clus-\nters [2, 15, 16], namely, \u201cmust-link\u201d constraints (also referred to a positive constraints) for which we\nconstrain an element of M to be one, and \u201cmust-not-link\u201d constraints (also referred to as negative\nconstraints), for which we constrain an element of M to be zero. Those two constraints are linear in\nM and can thus easily be included in our convex formulation.\nWe assume throughout this section that we have a set of \u201cmust-link\u201d pairs P+ and a set of \u201cmust-\nnot-link\u201d pairs P\u2212. Moreover, we assume that the set of positive constraints is closed, i.e., that if\nthere is a path of positive constraints between two data points, then these two data points are already\nforming a pair in P+. If the set of positive pairs does not satisfy this assumption, a larger set of pairs\ncan be obtained by transitive closure.\n\n\fr\no\nr\nr\ne\n \ng\nn\ni\nr\ne\nt\ns\nu\nc\n\nl\n\n20 % x n\n\n40 % x n\n\n1\n\n0.5\n\n0\n\n \n\n1\n\nK\u2212means\ndiffrac\n\n0.5\n\n0\n\n \n0\nnoise dimension\n\n40\n\n20\n\n20\n\n0\nnoise dimension\n\n40\n\nFigure 2: Comparison with K-means in the semi-supervised setting, with data taken from Figure 1:\nclustering performance (averaged over 50 replications, with standard deviations in dotted) vs. num-\nber of irrelevant dimensions, with 20% \u00d7 n and 40% \u00d7 n random matching pairs used for semi-\nsupervision.\n\nPositive constraints Given our closure assumption on P+, we get a partition of {1, . . . , n} into\np \u201cchunks\u201d of size greater or equal to 1. The singletons in this partition correspond to data points\nthat are not involved in any positive constraints, while other subsets corresponds to chunks of data\npoints that must occur together in the \ufb01nal partition. We let Cj, j = 1, . . . , p denote those groups,\nand let P denote the n \u00d7 p {0, 1}-matrix de\ufb01ned such that each column (indexed by j) is equal to\none for rows in Cj and zero otherwise. Forcing those groups is equivalent to considering M of the\nform M = P MP P \u22a4, where MP is an equivalence matrix of size p. Note that the positive constraint\nMij = 1 is in fact turned into the equality of columns (and thus rows by symmetry) i and j of M ,\nwhich is equivalent when M \u2208 Ek, but much stronger for M \u2208 Ck.\nIn our linear clustering framework, this is in fact equivalent to (a) replacing each chunk by its\nmean, (b) adding a weight equal to the number of elements in the group into the discriminative cost\nfunction and (c) modifying the regularization matrix to take into account the inner variance within\neach chunk. Positive constraints can be similarly included into K-means, to form a reduced weighted\nK-means problem, which is simpler than other approaches to deal with positive constraints [17].\n\nIn Figure 2, we compare constrained K-means and the DIFFRAC framework under the same setting\nas in Figure 1, with different numbers of randomly selected positive constraints.\n\nNegative constraints After the chunks corresponding to positive constraints have been collapsed\nto one point, we extend the set of negative constraints to those collapsed points (if the constraints\nwere originally consistent, the negative constraints can be uniquely extended). In our optimization\nframework, we simply add a penalty function of the form 1\nij. The K-means\nrounding procedure also has to be constrained, e.g., using the procedure of [17].\n\n\u03b5|P\u2212| P(i,j)\u2208P\u2212 M 2\n\n5 Simulations\n\n1 \u222a \u00b7 \u00b7 \u00b7 \u222a B\u2032\n\n(cid:16)k + k\u2032 \u2212 2Pi,i\u2032\n\nIn this section, we apply the DIFFRAC framework to various clustering problems and situa-\nIn all our simulations, we use the following distance between partitions B = B1 \u222a\ntions.\n\u00b7 \u00b7 \u00b7 \u222a Bk and B\u2032 = B\u2032\nk\u2032 into k and k\u2032 disjoints subsets of {1, . . . , n}: d(B, B\u2032) =\ni\u2032 )(cid:17)1/2\n. d(B, B\u2032) de\ufb01nes a distance over the set of partitions [9]\nwhich is always between 0 and (k + k\u2032 \u2212 2)1/2. When comparing partitions, we use the squared\ndistance 1\n2 \u2212 1 (and between 0 and k \u2212 1, if the two\npartitions have the same number of clusters).\n\n2 d(B, B\u2032)2, which is always between 0 and k+k\u2032\n\nCard(Bi\u2229B\u2032\n\ni\u2032 )2\nCard(Bi)Card(B\u2032\n\n5.1 Clustering classi\ufb01cation datasets\n\nWe looked at the Isolet dataset (26 classes, 5,200 data points) from the UCI repository and the\nMNIST datasets of handwritten digits (10 classes, 5,000 data points). For each of those datasets,\nwe compare the performances of K-means, RCA [18] and DIFFRAC, for linear and Gaussian ker-\nnels (referred to as \u201crbf\u201d), for \ufb01xed value of the regularization parameter, with different levels of\nsupervision. Results are presented in Table 1: on unsupervised problems, K-means and DIFFRAC\n\n\fK-means\n\nRCA\n\n5.6 \u00b1 0.2 4.9 \u00b1 0.2\n\n3.6 \u00b1 0.3 3.0 \u00b1 0.2\n2.2 \u00b1 0.2 1.8 \u00b1 0.4\n\nDataset\nDIFFRAC\nMnist-linear 0% 5.6 \u00b1 0.1 6.0 \u00b1 0.4\nMnist-linear 20% 4.5 \u00b1 0.3\nMnist-linear 40% 2.9 \u00b1 0.3\nMnist-RBF 0%\nMnist-RBF 20% 4.6 \u00b1 0.0 1.8 \u00b1 0.4 4.1 \u00b1 0.2\nMnist-RBF 40% 4.9 \u00b1 0.0 0.9 \u00b1 0.1 2.9 \u00b1 0.1\nIsolet-linear 0% 12.1 \u00b1 0.6 12.3 \u00b1 0.3\nIsolet-linear 20% 10.5 \u00b1 0.2 7.8 \u00b1 0.8 9.5 \u00b1 0.4\nIsolet-linear 40% 9.2 \u00b1 0.5 3.7 \u00b1 0.2 7.0 \u00b1 0.4\nIsolet-RBF 0%\nIsolet-RBF 20% 10.6 \u00b1 0.0 7.5 \u00b1 0.5 7.8 \u00b1 0.5\nIsolet-RBF 40% 10.0 \u00b1 0.0 3.7 \u00b1 1.0 6.9 \u00b1 0.6\n\n11.4 \u00b1 0.4 11.0 \u00b1 0.3\n\nTable 1: Comparison of K-means, RCA and linear DIFFRAC, using the clustering metric de\ufb01ned in\nSection 5 (averaged over 10 replications), for linear and \u201crbf\u201d kernels and various levels of supervi-\nsion.\n\nhave similar performance, while on semi-supervised problems, and in particular for nonlinear ker-\nnels, DIFFRAC outperforms both K-means and RCA. Note that all algorithms work on the same\ndata representation (linear or kernelized) and that differences are due to the underlying clustering\nframeworks.\n\n5.2 Semi-supervised classi\ufb01cation\n\nTo demonstrate the effectiveness of our method in a semi-supervised learning (SSL) context, we\nperformed experiments on some benchmarks datasets for SSL described in [19]. We considered the\nfollowing datasets: COIL, BCI and Text. We carried out the experiments in a transductive setting,\ni.e., the test set coincides with the set of unlabelled samples. This allowed us to conduct a fair\ncomparison with the low density separation (LDS) algorithm of [19], which is an enhanced version\nof the so-called Transductive SVM. However, deriving \u201cout-of-sample\u201d extensions for our method\nis straightforward.\n\nA primary goal in semi-supervised learning is to take into account a large number of labelled points\nin order to dramatically reduce the number of labelled points required to achieve a competitive\nclassi\ufb01cation accuracy. Henceforth, our experimental setting consists in observing how fast the\nclassi\ufb01cation accuracy collapses as the number of labelled points increases. The less labelled points\na method needs to achieve decent classi\ufb01cation accuracy, the more it is relevant for semi-supervised\nlearning tasks. As shown in Figure 3, our method yields competitive classi\ufb01cation accuracy with\nvery few labelled points on the three datasets. Moreover, DIFFRAC reaches unexpectedly good re-\nsults on the Text dataset, where most semi-supervised learning methods usually show disappointing\nperformance. One explanation might be that DIFFRAC acts as an \u201caugmented\u201d-clustering algorithm,\nwhereas most semi-supervised learning algorithms are built as \u201caugmented\u201d-versions of traditional\nsupervised learning algorithms such as LDS which is built on SVMs for instance. Hence, for datasets\nexhibiting multi-class structure such as Text, DIFFRAC is more able to utilize unlabelled points\nsince it based on a multi-class clustering algorithm rather than algorithms based on binary SVMs,\nwhere multi-class extensions are currently unclear. Thus, our experiments support the fact that semi-\nsupervised learning algorithms built on clustering algorithms augmented with labelled data acting\nas hints on clusters are worth for investigation and further research.\n\n6 Conclusion\n\nWe have presented a discriminative framework for clustering based on the square loss and penaliza-\ntion through spectral functions of equivalence matrices. Our formulation enables the easy incorpora-\ntion of semi-supervised constraints, which leads to state-of-the-art performance in semi-supervised\nlearning. Moreover, our discriminative framework should allow to use existing methods for learn-\ning the kernel matrix from data [20]. Finally, we are currently investigating the use of DIFFRAC in\nsemi-supervised image segmentation. In particular, early experiments on estimating the number of\nclusters using variation rates of our discriminative costs are very promising.\n\n\fLearning curve on Coil100\n\nLearning curve on BCI\n\nLearning curve on Text\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\nr\no\nr\nr\ne\n \nt\ns\ne\nT\n\n \n\nDIFFRAC\nLDS\n\n0.5\n\n0.45\n\n0.4\n\n0.35\n\n0.3\n\nr\no\nr\nr\ne\n \nt\ns\ne\nT\n\nDIFFRAC\nLDS\n\n \n\n0.35\n\n0.3\n\n0.25\n\n0.2\n\n0.15\n\nr\no\nr\nr\ne\n \nt\ns\ne\nT\n\nDIFFRAC\nLDS\n\n \n\n0\n \n0\n200\nNumber of labelled training points\n\n100\n\n50\n\n150\n\n0.25\n \n0\n150\nNumber of labelled training points\n\n100\n\n50\n\n0.1\n \n0\n200\nNumber of labelled training points\n\n100\n\n150\n\n50\n\nFigure 3: Semi-supervised classi\ufb01cation.\n\nReferences\n\n[1] L. Xu, J. Neufeld, B. Larson, and D. Schuurmans. Maximum margin clustering. In Adv. NIPS, 2004.\n[2] T. De Bie and N. Cristianini. Fast SDP relaxations of graph cut clustering, transduction, and other com-\n\nbinatorial problems. J. Mac. Learn. Res., 7:1409\u20131436, 2006.\n\n[3] K. Zhang, I. W. Tsang, and J. T. Kwok. Maximum margin clustering made practical. In Proc. ICML,\n\n2007.\n\n[4] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Springer-Verlag, 2001.\n[5] J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. Camb. Univ. Press, 2004.\n[6] A. Frieze and M. Jerrum. Improved approximation algorithms for MAX k-CUT and MAX BISECTION.\n\nIn Integer Programming and Combinatorial Optimization, volume 920, pages 1\u201313. Springer, 1995.\n\n[7] C. Swamy. Correlation clustering: maximizing agreements via semide\ufb01nite programming. In ACM-SIAM\n\nSymp. Discrete algorithms, 2004.\n\n[8] A. S. Lewis and H. S. Sendov. Twice differentiable spectral functions. SIAM J. Mat. Anal. App.,\n\n23(2):368\u2013386, 2002.\n\n[9] F R. Bach and M I. Jordan. Learning spectral clustering, with application to speech separation. J. Mac.\n\nLearn. Res., 7:1963\u20132001, 2006.\n\n[10] B. Sch\u00a8olkopf, A. J. Smola, and K.-R. M\u00a8uller. Nonlinear component analysis as a kernel eigenvalue\n\nproblem. Neural Comp., 10(3):1299\u20131319, 1998.\n\n[11] N. Srebro, G. Shakhnarovich, and S. Roweis. An investigation of computational and informational limits\n\nin gaussian mixture clustering. In Proc. ICML, 2006.\n\n[12] S. Boyd and L. Vandenberghe. Convex Optimization. Camb. Univ. Press, 2003.\n[13] J. F. Bonnans, J. C. Gilbert, C. Lemar\u00b4echal, and C. A. Sagastizbal. Numerical Optimization Theoretical\n\nand Practical Aspects. Springer, 2003.\n\n[14] A. Y. Ng, M. I. Jordan, and Y. Weiss. On spectral clustering: analysis and an algorithm. In Adv. NIPS,\n\n2002.\n\n[15] L. Xu and D. Schuurmans. Unsupervised and semi-supervised multi-class support vector machines. In\n\nProc. AAAI, 2005.\n\n[16] M. Heiler, J. Keuchel, and C. Schn\u00a8orr. Semide\ufb01nite clustering for image segmentation with a-priori\n\nknowledge. In Pattern Recognition, Proc. DAGM, 2005.\n\n[17] K. Wagstaff, C. Cardie, S. Rogers, and S. Schr\u00a8odl. Constrained K-means clustering with background\n\nknowledge. In Proc. ICML, 2001.\n\n[18] A. Bar-Hillel, T. Hertz, N. Shental, and D. Weinshall. Learning distance functions using equivalence\n\nrelations. In Proc. ICML, 2003.\n\n[19] O. Chapelle and A. Zien. Semi-supervised classi\ufb01cation by low density separation. In Proc. AISTATS,\n\n2004.\n\n[20] F. R. Bach, G. R. G. Lanckriet, and M. I. Jordan. Multiple kernel learning, conic duality, and the SMO\n\nalgorithm. In Proc. ICML, 2004.\n\n\f", "award": [], "sourceid": 870, "authors": [{"given_name": "Francis", "family_name": "Bach", "institution": null}, {"given_name": "Za\u00efd", "family_name": "Harchaoui", "institution": null}]}