{"title": "A Support Vector Method for Clustering", "book": "Advances in Neural Information Processing Systems", "page_first": 367, "page_last": 373, "abstract": null, "full_text": "A Support Vector Method for Clustering \n\nAsaBen-Hur \n\nFaculty of IE and Management \nTechnion, Haifa 32000, Israel \n\nDavid Horn \n\nSchool of Physics and Astronomy \n\nTel Aviv University, Tel Aviv 69978, Israel \n\nHava T. Siegelmann \n\nLab for Inf. & Decision Systems \nMIT Cambridge, MA 02139, USA \n\nVladimir Vapnik \n\nAT&T Labs Research \n\n100 Schultz Dr., Red Bank, NJ 07701, USA \n\nAbstract \n\nWe present a novel method for clustering using the support vector ma(cid:173)\nchine approach. Data points are mapped to a high dimensional feature \nspace, where support vectors are used to define a sphere enclosing them. \nThe boundary of the sphere forms in data space a set of closed contours \ncontaining the data. Data points enclosed by each contour are defined as a \ncluster. As the width parameter of the Gaussian kernel is decreased, these \ncontours fit the data more tightly and splitting of contours occurs. The \nalgorithm works by separating clusters according to valleys in the un(cid:173)\nderlying probability distribution, and thus clusters can take on arbitrary \ngeometrical shapes. As in other SV algorithms, outliers can be dealt with \nby introducing a soft margin constant leading to smoother cluster bound(cid:173)\naries. The structure of the data is explored by varying the two parame(cid:173)\nters. We investigate the dependence of our method on these parameters \nand apply it to several data sets. \n\n1 Introduction \n\nClustering is an ill-defined problem for which there exist numerous methods [1, 2]. These \ncan be based on parametric models or can be non-parametric. Parametric algorithms are \nusually limited in their expressive power, i.e. a certain cluster structure is assumed. In \nthis paper we propose a non-parametric clustering algorithm based on the support vector \napproach [3], which is usually employed for supervised learning. In the papers [4, 5] an SV \nalgorithm for characterizing the support of a high dimensional distribution was proposed. \nAs a by-product of the algorithm one can compute a set of contours which enclose the data \npoints. These contours were interpreted by us as cluster boundaries [6]. In [6] the number \nof clusters was predefined, and the value of the kernel parameter was not determined as \npart of the algorithm. In this paper we address these issues. The first stage of our Support \nVector Clustering (SVC) algorithm consists of computing the sphere with minimal radius \nwhich encloses the data points when mapped to a high dimensional feature space. This \nsphere corresponds to a set of contours which enclose the points in input space. As the \nwidth parameter of the Gaussian kernel function that represents the map to feature space \n\n\fis decreased, this contour breaks into an increasing number of disconnected pieces. The \npoints enclosed by each separate piece are interpreted as belonging to the same cluster. \nSince the contours characterize the support of the data, our algorithm identifies valleys in \nits probability distribution. When we deal with overlapping clusters we have to employ \na soft margin constant, allowing for \"outliers\". In this parameter range our algorithm is \nsimilar to the space clustering method [7]. The latter is based on a Parzen window estimate \nof the probability density, using a Gaussian kernel and identifying cluster centers with \npeaks of the estimator. \n\n2 Describing Cluster Boundaries with Support Vectors \n\nIn this section we describe an algorithm for representing the support of a probability distri(cid:173)\nbution by a finite data set using the formalism of support vectors [5, 4]. It forms the basis \nof our clustering algorithm. Let {xd ~ X be a data-set of N points, with X ~ Rd, the \ninput space. Using a nonlinear transformation * from X to some high dimensional feature(cid:173)\nspace, we look for the smallest enclosing sphere of radius R, described by the constraints: \n11**(xi) - aW ~ R2 'Vi , where II . II is the Euclidean norm and a is the center of the \nsphere. Soft constraints are incorporated by adding slack variables ~j: \n\nwith ~j ~ O. To solve this problem we introduce the Lagrangian \n\nL = R2 - 2:(R2 + ~j -11**(Xj) - aI1 2 ),Bj - 2:~j{tj + C2:~j , \n\nj \n\n(1) \n\n(2) \n\nwhere,Bj ~ 0 and {tj ~ 0 are Lagrange multipliers, C is a constant, and C L: ~j is a \npenalty term. Setting to zero the derivative of L with respect to R, a and ~j, respectively, \nleads to \n\na = 2: ,Bj ** (Xj ), \n,Bj = C - {tj \nThe KKT complementarity conditions [8] result in \n\nj \n\n(3) \n\n(4) \n\n(5) \n\n~j{tj = 0 \n(R2 +~j -11**(xj) - aI1 2),Bj = 0 \n\n(6) \n(7) \nA point Xi with ~i > 0 is outside the feature-space sphere (cf. equation 1). Equation (6) \nstates that such points Xi have {ti = 0, so from equation (5) ,Bi = C. A point with ~i = 0 \nis inside or on the surface of the feature space sphere. If its ,Bi i= 0 then equation 7 implies \nthat the point Xi is on the sudace of the feature space sphere. In this paper any point with \no < ,Bi < C will be referred to as a support vector or SV; points with ,Bi = C will be called \nbounded support vectors or bounded SVs. This is to emphasize the role of the support \nvectors as delineating the boundary. Note that when C ~ 1 no bounded SVs exist because \nof the constraint L:,Bi = 1. \nUsing these relations we may eliminate the variables R , a and {tj, turning the Lagrangian \ninto the Wolfe dual which is a function of the variables ,Bj: \n\nj \n\ni ,j \n\n(8) \n\n\fSince the variables {lj don't appear in the Lagrangian they may be replaced with the con(cid:173)\nstraints: \n\n(9) \n\nWe follow the SV method and represent the dot products R. Note that since we use a Gaussian kernel \nfor which K (x, x) = 1, our feature space is a unit sphere; thus its intersection with a sphere \nof radius R < 1 can also be defined as an intersection by a hyperplane, as in conventional \nSVM. \nThe shape of the enclosing contours in input space is governed by two parameters, q and C. \nFigure 1 demonstrates that, as q is increased, the enclosing contours form tighter fits to the \ndata. Figure 2 describes a situation that necessitated introduction of outliers, or bounded \nSV, by allowing for C < 1. As C is decreased not only does the number of bounded SVs \nincrease, but their influence on the shape of the cluster contour decreases (see also [6]). \nThe number of support vectors depends on both q and C. For fixed q, as C is decreased, \nthe number of SVs decreases since some of them turn into bounded SVs and the resulting \nshapes of the contours become smoother. \n\nWe denote by nsv, nbsv the number of support vectors and bounded support vectors, re(cid:173)\nspectively, and note the following result: \n\nProposition 2.1 [4] \n\nnbsv + nsv ~ l/C, nbsv < l/C \n\n(16) \n\nThis is an immediate consequence of the constraints (3) and (9). In fact, we have found \nempirically that \n\n(17) \nwhere no > 0 may be a function of q and N. This was observed for artificial and real data \nsets. Moreover, we have also observed that \n\nnbsv(q, C) = max(O, l/C - no) , \n\n(18) \nwhere a and b are functions of q and N. The linear behavior of nbsv continues until nbsv + \nnsv = N. \n\nnsv = a/C + b , \n\n\f3 Support Vector Clustering (SVC) \n\nIn this section we go through a set of examples demonstrating the use of SVC. We begin \nwith a data set in which the separation into clusters can be achieved without outliers, i.e. \nC = 1. As seen in Figure 1, as q is increased the shape of the boundary curves in data-space \nvaries. At several q values the enclosing contour splits, forming an increasing number of \nconnected components. We regard each component as representing a single cluster. While \nin this example clustering looks hierarchical, this is not strictly true in general. \n\n-D.lL, --~~~,--\u00ad\n\n-D_~.';-, ---;~::;-, - --;; \n\n'\" \n\nI\" \n\nFigure 1: Data set contains 183 points. A Gaussian kernel was used with C = 1.0. SVs are \nsurrounded by small circles. (a): q = 1 (b): q = 20 (c): q = 24 (d): q = 48. \n\nIn order to label data points into clusters we need to identify the connected components. \nWe define an adjacency matrix Aij between pairs of points Xi and Xj: \n\nA-. - {I iffor all y on the line segment connecting xiand Xj R(y) ~ R \n\n(19) \n\nOJ -\n\na otherwise. \n\nClusters are then defined as the connected components of the graph induced by A. This \nlabeling procedure is justified by the observation that nearest neighbors in data space can \nbe connected by a line segment that is contained in the high dimensional sphere. Checking \nthe line segment is implemented by sampling a number of points on the segment (a value \nof 10 was used in the numerical experiments). Note that bounded SVs are not classified \nby this procedure; they can be left unlabeled, or classified e.g., according to the cluster to \nwhich they are closest to. We adopt the latter approach. \n\nThe cluster description algorithm provides an estimate of the support of the underlying \nprobability distribution [4]. Thus we distinguish between clusters according to gaps in \nthe support of the underlying probability distribution. As q is increased the support is \ncharacterized by more detailed features, enabling the detection of smaller gaps. Too high \na value of q may lead to overfitting (see figure 2(a\u00bb, which can be handled by allowing \nfor bounded SVs (figure 2(b\u00bb: letting some of the data points be bounded SVs creates \nsmoother contours, and facilitates contour splitting at low values of q. \n\n3.1 Overlapping clusters \n\nIn many data sets clusters are strongly overlapping, and clear separating valleys as in Fig(cid:173)\nures 1 and 2 are not present. Our algorithm is useful in such cases as well, but a slightly \ndifferent interpretation is required. First we note that equation (15) for the enclosing \ncontour can be expressed as {x I 'E,J3iK(Xi,X) = p}, where p is determined by the \nvalue of this sum on the support vectors. The set of points enclosed by the contour is: \n\n\f(9 ) \n\n( b ) \n\nFigure 2: Clustering with and without outliers. The inner cluster is composed of 50 points \ngenerated by a Gaussian distribution. The two concentric rings contain 150/300 points, \ngenerated by a uniform angular distribution and radial Gaussian distribution. (a) The rings \ncannot be distinguished when C = 1. Shown here is q = 3.5, the lowest q value that leads \nto separation of the inner cluster. (b) Outliers allow easy clustering. The parameters are \nl/(NC) = 0.3 and q = 1.0. SVs are surrounded by small ellipses. \n\n{X I 2:i f3i K (Xi, x) > p} . In the extreme case when almost all data points are bounded \nSVs, the sum in this expression is approximately \n\n1 \n\np(x) = N LK(Xi,X). \n\ni \n\n(20) \n\nThis is recognized as a Parzen window estimate of the density function (up to a normaliza(cid:173)\ntion factor, if the kernel is not appropriately normalized). The contour will then enclose a \nsmall number of points which correspond to the maximum of the Parzen-estimated density. \nThus in the high bounded SVs regime we find a dense core of the probability distribution. \n\nIn this regime our algorithm is closely related to an algorithm proposed by Roberts [7]. \nHe defines cluster centers as maxima of the Parzen window estimate p(x). He shows that \nin his approach, which goes by the name of scale-space clustering, as q is increased the \nnumber of maxima increases. The Gaussian kernel plays an important role in his analysis: \nit is the only kernel for which the number of maxima (hence the number of clusters) is a \nmonotonically non-decreasing function of q (see [7] and references therein). \n\nThe advantage of SVC over Roberts' method is that we find a region, rather than just a \npeak, and that instead of solving a problem with many local maxima, we identify the core \nregions by an SV method with a global optimal solution. We have found examples where \na local maximum is hard to identify by Roberts' method. \n\n3.2 The iris data \n\nWe ran SVC on the iris data set [9], which is a standard benchmark in the pattern recog(cid:173)\nnition literature. It can be obtained from the UCI repository [10]. The data set contains \n150 instances, each containing four measurements of an iris flower. There are three types \nof flowers, represented by 50 instances each. We clustered the data in a two dimensional \nsubspace formed by the first two principal components. One of the clusters is linearly sep(cid:173)\narable from the other two at q = 0.5 with no bounded SVs. The remaining two clusters \nhave significant overlap, and were separated at q = 4.2, l/(NC) = 0.55, with 4 misclassi(cid:173)\nfications. Clustering results for an increasing number of principal components are reported \n\n\fTable 1: Performance of SVC on the iris data for a varying number of principal components. \n\nPrincipal components \n\n1-2 \n1-3 \n1-4 \n\nq \n4.2 \n7.0 \n9.0 \n\n1/(NC) \n\n0.55 \n0.70 \n0.75 \n\nSVs \n20 \n23 \n34 \n\nbounded SVs \n\n72 \n94 \n96 \n\nrnisclassified \n\n4 \n4 \n14 \n\nin Table 1. Note that as the number of principal components is increased from 3 to 4 there \nis a degradation in the performance of the algorithm - the number of misclassifications in(cid:173)\ncreases from 4 to 14. Also note the increase in the number of support vectors and bounded \nsupport vectors required to obtain contour splitting. As the dimensionality of the data in(cid:173)\ncreases a larger number of support vectors is required to describe the contours. Thus if the \ndata is sparse, it is better to use SVC on a low dimensional representation, obtained, e.g. \nby principal component analysis [2]. For comparison we quote results obtained by other \nnon-parametric clustering algorithms: the information theoretic approach of [11] leads to \n5 miscalssification and the SPC algorithm of [12] has 15 misclassifications. \n\n4 Varying q and C \n\nSVC was described for fixed values of q and C, and a method for exploring parameter \nspace is required. We can work with SVC in an agglomerative fashion, starting from a large \nvalue of q, where each point is in a different cluster, and decreasing q until there is a single \ncluster. Alternatively we may use the divisive approach, by starting from a small value of \nq and increasing it. The latter seems more efficient since meaningful clustering solutions \n(see below for a definition of this concept), usually have a small number of clusters. \n\nThe following is a qualitative schedule for varying the parameters. One may start with \na small value of q where only one cluster occurs: q = 1/ maxi,j Ilxi - Xj 112. q is then \nincreased to look for values at which a cluster contour splits. When single point clusters \nstart to break off or a large number of support vectors is obtained (overfitting, as in Figure \n2(a\u00bb I/C is increased. \nAn important issue in the divisive approach is the decision when to stop dividing the clus(cid:173)\nters. An algorithm for this is described in [13]. After clustering the data they partition the \ndata into two sets with some sizable overlap, perform clustering on these smaller data sets \nand compute the average overlap between the two clustering solutions for a number of par(cid:173)\ntitions. Such validation can be performed here as well. However, we believe that in our SV \nsetting it is natural to use the number of support vectors as an indication of a meaningful \nsolution, since their (small) number is an indication of good generalization. Therefore we \nshould stop the algorithm when the fraction of SVs exceeds some threshold. If the cluster(cid:173)\ning solution is stable with respect to changes in the parameters this is also an indication of \nmeaningful clustering. \n\nThe quadratic programming problem of equation (2) can be solved by the SMO algorithm \n[14] which was recently proposed as an efficient tool for solving such problems in SVM \ntraining. Some minor modifications are required to adapt it to the problem that we solve \nhere [4]. Benchmarks reported in [14] show that this algorithm converges in most cases in \n0(N2) kernel evaluations. The complexity of the labeling part of the algorithm is 0(N2d), \nso that the overall complexity is 0(N 2 d). We also note that the memory requirements of \nthe SMO algorithm are low - it can be implemented using 0(1) memory at the cost of a \ndecrease in efficiency, which makes our algorithm useful even for very large data-sets. \n\n\f5 Summary \n\nThe SVC algorithm finds clustering solutions together with curves representing their \nboundaries via a description of the support or high density regions of the data. As such, \nit separates between clusters according to gaps or low density regions in the probability \ndistribution of the data, and makes no assumptions on cluster shapes in input space. \n\nSVC has several other attractive features: the quadratic programming problem of the cluster \ndescription algorithm is convex and has a globally optimal solution, and, like other SV \nalgorithms, SVC can deal with noise or outliers by a margin parameter, making it robust \nwith respect to noise in the data. \n\nReferences \n\n[1] A.K. Jain and R.C. Dubes. Algorithmsfor clustering data . Prentice Hall, Englewood \n\nCliffs, NJ, 1988. \n\n[2] K. Fukunaga. Introduction to Statistical Pattern Recognition. Academic Press, San \n\nDiego, CA, 1990. \n\n[3] V. Vapnik. The Nature of Statistical Learning Theory. Springer, N.Y., 1995. \n[4] B. Sch6lkopf, R.C. Williamson, AJ. Smola, and J. Shawe-Taylor. SV estimation of a \n\ndistribution's support. In Neural Information Processing Systems, 2000. \n\n[5] D.MJ. Tax and R.P.W. Duin. Support vector domain description. Pattern Recognition \n\nLetters, 20:1991-1999, 1999. \n\n[6] A. Ben-Hur, D. Horn, H.T. Siegelmann, and V. Vapnik. A support vector clustering \n\nmethod. In International Conference on Pattern Recognition, 2000. \n\n[7] SJ. Roberts. Non-parametric unsupervised cluster analysis. Pattern Recognition, \n\n30(2):261-272,1997. \n\n[8] R. Fletcher. Practical Methods of Optimization. Wiley-Interscience, Chichester, 1987. \n[9] R.A. Fisher. The use of multiple measurements in taxonomic problems. Annual \n\nEugenics, 7: 179-188, 1936. \n\n[10] C.L. Blake and CJ. Merz. DCI repository of machine learning databases, 1998. \n[11] N. Tishby and N. Slonim. Data clustering by Markovian relaxation and the informa(cid:173)\n\ntion bottleneck method. In Neural Information Processing Systems, 2000. \n\n[12] M. Blatt, S. Wiseman, and E. Domany. Data clustering using a model granular mag(cid:173)\n\nnet. Neural Computation, 9:1804-1842,1997. \n\n[13] S. Dubnov, R. EI-Yaniv, Y. Gdalyahu, E. Schneidman, N. Tishby, and G. Yona. A \n\nnew nonparametric pairwise clustering algorithm. Submitted to Machine Learning. \n[14] J. Platt. Fast training of support vector machines using sequential minimal optimiza(cid:173)\ntion. In B. SchOlkopf, C. J. C. Burges, and A. 1. Smola, editors, Advances in Kernel \nMethods - Support Vector Learning, pages 185-208, Cambridge, MA, 1999. MIT \nPress. \n\n\f", "award": [], "sourceid": 1823, "authors": [{"given_name": "Asa", "family_name": "Ben-Hur", "institution": null}, {"given_name": "David", "family_name": "Horn", "institution": null}, {"given_name": "Hava", "family_name": "Siegelmann", "institution": null}, {"given_name": "Vladimir", "family_name": "Vapnik", "institution": null}]}*