{"title": "Distance Metric Learning with Application to Clustering with Side-Information", "book": "Advances in Neural Information Processing Systems", "page_first": 521, "page_last": 528, "abstract": null, "full_text": "Distance Metric Learning, with Application\n\nto Clustering with Side-Information\n\nEric P. Xing, Andrew Y. Ng, Michael I. Jordan and Stuart Russell\n\nUniversity of California, Berkeley\n\nBerkeley, CA 94720\n\n epxing,ang,jordan,russell\nAbstract\n\n@cs.berkeley.edu\n\nMany algorithms rely critically on being given a good metric over their\ninputs. For instance, data can often be clustered in many \u201cplausible\u201d\nways, and if a clustering algorithm such as K-means initially fails to \ufb01nd\none that is meaningful to a user, the only recourse may be for the user to\nmanually tweak the metric until suf\ufb01ciently good clusters are found. For\nthese and other applications requiring good metrics, it is desirable that\nwe provide a more systematic way for users to indicate what they con-\nsider \u201csimilar.\u201d For instance, we may ask them to provide examples. In\nthis paper, we present an algorithm that, given examples of similar (and,\n, learns a distance metric over\nif desired, dissimilar) pairs of points in\nthat respects these relationships. Our method is based on posing met-\n\u0002\u0005\u0003\nric learning as a convex optimization problem, which allows us to give\nef\ufb01cient, local-optima-free algorithms. We also demonstrate empirically\nthat the learned metrics can be used to signi\ufb01cantly improve clustering\nperformance.\n\n\u0002\u0004\u0003\n\n1 Introduction\n\nThe performance of many learning and datamining algorithms depend critically on their\nbeing given a good metric over the input space. For instance, K-means, nearest-neighbors\nclassi\ufb01ers and kernel algorithms such as SVMs all need to be given good metrics that re\ufb02ect\nreasonably well the important relationships between the data. This problem is particularly\nacute in unsupervised settings such as clustering, and is related to the perennial problem of\nthere often being no \u201cright\u201d answer for clustering: If three algorithms are used to cluster a\nset of documents, and one clusters according to the authorship, another clusters according\nto topic, and a third clusters according to writing style, who is to say which is the \u201cright\u201d\nanswer? Worse, if an algorithm were to have clustered by topic, and if we instead wanted it\nto cluster by writing style, there are relatively few systematic mechanisms for us to convey\nthis to a clustering algorithm, and we are often left tweaking distance metrics by hand.\nIn this paper, we are interested in the following problem: Suppose a user indicates that\n) are considered by them to be \u201csimilar.\u201d Can we\ncertain points in an input space (say,\n\u0002\u0006\u0003\nautomatically learn a distance metric over\nthat respects these relationships, i.e., one that\n\u0002\u0004\u0003\nassigns small distances between the similar pairs? For instance, in the documents example,\nwe might hope that, by giving it pairs of documents judged to be written in similar styles,\nit would learn to recognize the critical features for determining style.\n\n\u0001\n\f\u0001\u0006\u0005\n\n\u0002\u0005\u0003\n\n\u0002\u0001\n\n\u0002\u0005\u0003\u0004\u0003\n\nOne important family of algorithms that (implicitly) learn metrics are the unsupervised\nones that take an input dataset, and \ufb01nd an embedding of it in some space. This includes\nalgorithms such as Multidimensional Scaling (MDS) [2], and Locally Linear Embedding\n(LLE) [9]. One feature distinguishing our work from these is that we will learn a full metric\nover the input space, rather than focusing only on (\ufb01nding an embed-\nding for) the points in the training set. Our learned metric thus generalizes more easily to\npreviously unseen data. More importantly, methods such as LLE and MDS also suffer from\nthe \u201cno right answer\u201d problem: For example, if MDS \ufb01nds an embedding that fails to cap-\nture the structure important to a user, it is unclear what systematic corrective actions would\nbe available. (Similar comments also apply to Principal Components Analysis (PCA) [7].)\nAs in our motivating clustering example, the methods we propose can also be used in a\npre-processing step to help any of these unsupervised algorithms to \ufb01nd better solutions.\nIn the supervised learning setting, for instance nearest neighbor classi\ufb01cation, numerous\nattempts have been made to de\ufb01ne or learn either local or global metrics for classi\ufb01cation.\nIn these problems, a clear-cut, supervised criterion\u2014classi\ufb01cation error\u2014is available and\ncan be optimized for. (See also [11], for a different way of supervising clustering.) This\nliterature is too wide to survey here, but some relevant examples include [10, 5, 3, 6],\nand [1] also gives a good overview of some of this work. While these methods often\nlearn good metrics for classi\ufb01cation, it is less clear whether they can be used to learn\ngood, general metrics for other algorithms such as K-means, particularly if the information\navailable is less structured than the traditional, homogeneous training sets expected by\nthem.\nIn the context of clustering, a promising approach was recently proposed by Wagstaff et\nal. [12] for clustering with similarity information. If told that certain pairs are \u201csimilar\u201d or\n\u201cdissimilar,\u201d they search for a clustering that puts the similar pairs into the same, and dis-\nsimilar pairs into different, clusters. This gives a way of using similarity side-information\nto \ufb01nd clusters that re\ufb02ect a user\u2019s notion of meaningful clusters. But similar to MDS and\nLLE, the (\u201cinstance-level\u201d) constraints that they use do not generalize to previously unseen\ndata whose similarity/dissimilarity to the training set is not known. We will later discuss\nthis work in more detail, and also examine the effects of using the methods we propose in\nconjunction with these methods.\n\n2 Learning Distance Metrics\n\n, and are given information that certain\n\n(1)\nthat respects this;\n\npairs of them are \u201csimilar\u201d:\n\nSuppose we have some set of points \t\b\u000b\n\n\b\u0016\n\u0018\u0017\u0019\b\u001b\u001a\r\u001c\u001e\u001d \u001f\nHow can we learn a distance metric \u0016\u0015\n\b\"\u0017\u0018#\u001b\u001c)$+*,*\n\n'&(\u0015\n\n\u0002\u0005\u0003\n\n\u000f\u000e\u0011\u0010\u0013\u0012\n\u0001\r\f\nif\b\u0016\n and\b!\u001a are similar\n\b\"\u0017\u0018#!\u001c between points \b\nand #\n\b7-.#\u001b\u001c98\n\b1-2#\u001b\u001c4365\n\b -.#/*,*\n\n$+0\n\nspeci\ufb01cally, so that \u201csimilar\u201d points end up close to each other?\nConsider learning a distance metric of the form\n\n(2)\n\nof Mahalanobis distances over\n\nto be diagonal, this corresponds to learning a metric in which\n\nTo ensure that this be a metric\u2014satisfying non-negativity and the triangle inequality\u2014\n\n\b\"\u0017\u0018#!\u001c%$\n\u0016\u0015\nwe require that 5\nbe positive semi-de\ufb01nite, 5;:=< .1 Setting 5>$@? gives Euclidean\ndistance; if we restrict 5\nthe different axes are given different \u201cweights\u201d; more generally, 5 parameterizes a family\nto \ufb01nding a rescaling of a data that replaces each point \b with 5\n1Technically, this also allows pseudometrics, whereDFEHGJILKNMPO/QSR does not implyITQUM\n2Note that, but putting the original dataset through a non-linear basis functionV and considering\nGXV/GJI!O\u000bY7VZGJMPONO\\[L]^GXV/GJI!O\u0016Y_VZGJMPONO , non-linear distance metrics can also be learned.\n\n.2 Learning such a distance metric is also equivalent\nand applying the\n\n\u0010\u0018ACB\n\n\u0002\u0006\u0003\n\n.\n\n\u0007\n\u0002\n\u0014\n\u0001\n\u0015\n&\n\u0015\n\u0015\n\b\nW\n\f*,*\n\nsay,\n\nhave,\n\nstandard Euclidean metric to the rescaled data; this will later be useful in visualizing the\nlearned metrics.\nA simple way of de\ufb01ning a criterion for the desired metric is to demand that\npairs of points\nsmall squared distance between them:\n\ndoes not\ncan be a set of pairs of points known to be\n\u201cdissimilar\u201d if such information is explicitly available; otherwise, we may take it to be all\n\n\u0017\u0019\b\n\u000e\t\u0013\u0015\u0014\u0017\u0016\u0019\u0018\n\n&\u000b\n\r\f\u0004\u000e\u0010\u000f\u0012\u0011\n\nin \u001f\n. This is trivially solved with 5\n*,*\n\u0002\u0001\u0004\u0003\u0005\u0001\u0004\u0002\u0001\u0007\u0006\t\b\n\f\u0004\u000e\u001a\u000f\u0017\u0011\n\u000e\u0010\u0013\u001b\u0014\u0017\u0016\u0019\u001c\nuseful, and we add the constraint\n\ncollapse the dataset into a single point. Here,!\npairs not in\u001f\n\u000e\u0010\u0013\u001b\u0014\u0017\u0016\u0019\u0018\n\f\u0004\u000e\u001a\u000f\u0017\u0011\n*,*\n\u000e\u0010\u0013\u001b\u0014\u0017\u0016\u0019\u001c\n\f\u0004\u000e\u001a\u000f\"\u0011\n*,*\n\n\b\u0016\n/-2\b!\u001a'*,*\n\bL\n/-2\b!\u001a'*,*\n\n. This gives the optimization problem:\n\n\b\u0016\n\u0011-S\b!\u001a!*\u000f*\n\n&\u001e\u001d \u001f\n\n\u0002\u0001\u0007\u0003\n\n< , which is not\n\nto ensure that 5\n\n(4)\n(5)\n\ns.t.\n\n(3)\n\n*,*\n\nBCB\n\n\u001c)$\n\n\u001c%$\n\n8Z\u0017\u00185\n\n\u0014\u0012\u0016\u0019\u001c\n\nan ef\ufb01cient algorithm using the Newton-Raphson method. De\ufb01ne\n\nalways being rank 1 (i.e., the data are always projected onto a line). 3\n\nThe choice of the constant 1 in the right hand side of (4) is arbitrary but not important, and\n. Also,\n, and both of the constraints\nare also easily veri\ufb01ed to be convex. Thus, the optimization problem is convex, which\nenables us to derive ef\ufb01cient, local-minima-free algorithms to solve it.\n\n\b\u0016\n\u0016-\n*\u000f*\n\u001d'\u001f \u201d would not be a good choice despite its giving a simple linear constraint. It\n\u001c , we can derive\n-2\b\n\n&#\u001d$\u001f\nchanging it to any other positive constant% results only in5 being replaced by%\nthis problem has an objective that is linear in the parameters 5\nWe also note that, while one might consider various alternatives to (4), \u201c\n&\f\u0004\u000e\n\b!\u001a'*,*\nwould result in5\n2.1 The case of diagonal5\n$)(\n\u0017C5\nIn the case that we want to learn a diagonal 5\n\u0010C\u0010\n\u0001\u0007*,+\n-0/\u00071\n*,*\n-2\b\n*,*\n\u0010C\u0010\n+32\n\u0014\u0017\u0016\u0019\u0018\n(subject to 5\nIt is straightforward to show that minimizing-\nmultiplication of 5\nthus use Newton-Raphson to ef\ufb01ciently optimize-\n2.2 The case of full5\nIn the case of learning a full matrix 5\nto enforce, and Newton\u2019s method often becomes prohibitively expensive (requiring 8\ntime to invert the Hessian over9\n\u0013PD:E\u0010QRHBH\nHBH\nHBH\n\u0013\u001bDFE\u001aGIHBH\n\u000f\u0017C\nQ\rS\u0012T\u001bU\tVXW\u0016]ZY\nN[S\u0012T\u001bU\u0010VXWL]ZY\nO\u0012NO=@?BA\nIKJ\"Y\nI\u0019L\nIKJ\"Y\nI\u0019L\nimizing G>=@?BA\n\u0013\u001bDFE\n\u000f\u0015C\nGJIKJHYSI\u0019L\nOCGJI`J\nYSI\u0019L\nY]\\\nQ'=^?_A\nQa=cb\n. Decomposing ]\nas ]\nJ\u0004dfehg\nsible since ]ai\nR ), this gives=\nJ:NO=\nJ , which we recognize as a Rayleigh-\nJ\u001ag\nJjg\ne for the principal eigenvector, and settingg\nQ#kKY\nQ@lXlXl\nQSR .\n4To ensure that]mi\u0004R , which is true iff the diagonal elements]\nJ\u0004J are non-negative, we actually\ne\u001bqsr bytun]o\ne\u001bqsr\nreplace the Newton updatenpo\n, wheret\nline-search to give the largest downhill step subject to]ZJ\u0004Juv.R .\n\n, the constraint that 5\n< becomes slightly trickier\n\u0015:9<;\nB parameters). Using gradient descent and the idea of\n\u000f\u0017C\n\n3The proof is reminiscent of the derivation of Fisher\u2019s linear discriminant. Brie\ufb02y, consider max-\n, where\n(always pos-\n\n8Z\u0017C5\n&65\n\u0014\u0017\u0016\u0019\u001c\n< ) is equivalent, up to a\n\niterative projections (e.g., [8]) we derive a different algorithm for this setting.\n\nquotient like quantity whose solution is given by (say) solving the generalized eigenvector problem\n\nby a positive constant, to solving the original problem (3\u20135). We can\n\nis a step-size parameter optimized via a\n\n\f\u0007\u000e\n\n\f\u0004\u000e\n\n*,*\n\n*,*\n\n.4\n\n\u0015\n\b\n\n\u001a\n\u001c\n\b\n\n-\n\b\n\u001a\nB\n&\n$\n&\n\nB\n&\n\n\u0017\n5\n:\n<\n8\nB\n5\n\u000f\n\u0011\n\u000e\n\u0013\nB\n&\n\u0015\n5\n\u0017\n8\n8\n\u0003\n\u0003\n-\n\u0015\n5\n-\n\u0015\n5\n\u0017\n8\n8\n\u0003\n\u0003\n.\n\u000f\n\u0011\n\u000e\n\u0013\n\b\n\n\u001a\nB\n&\n4\n.\n\u000f\n\u0011\n\u000e\n\u0013\n\b\n\n\u001a\n7\n:\n:\n\u001c\nA\nM\nE\nA\nM\nE\nG\nQ\nA\n\\\nO\n[\nJ\ng\n[\nJ\n[\nJ\nY\nG\ng\n[\nJ\nY\nQ\ng\nY\nG\ng\ne\nQ\ng\nM\nQ\ng\nb\n\fIterate\n\nIterate\n\n*\u000f*\n*\u000f*\n\n*,*\n*\u000f*\n\n\t5\n\nHBH\n\nG:=\n\nconverges\n\n&\u0003\u0002\n\n!*\u000f*\n!*\u000f*\n\n\u000b*\u001c\u001b\n\ns.t.\n\nuntil5\n\n\u0002\u0001\u0007\u0003\n\u0002\u0001\u0007\u0003\n\u0015\u000e\n\n5\u0005\u0004\"\u001d\b\u0007\n\u001d\b\u0007\n\n5\u0005\u0004\n-25\n-25\n\u001c\u0019\u001c\u0010\u000f\u0012\u0011\u0014\u0013\u0016\u0015\n\nuntil convergence\n\nis the Frobenius norm on\n\n\u0014\u0017\u0016\u0019\u001c\n\u000e\u0010\u0013\u0015\u0014\u0017\u0016\u0019\u0018\n\nWe pose the equivalent problem:\n\nFigure 1: Gradient ascent + Iterative projection algorithm. Here,HBH\u0001\u0017hHBH\nmatrices (HBH\n\bL\n\u0018\u0017\u0018\b!\u001a'*,*\n&\f\u001e\n*,*\n\u0017\u0019\b\n\n*\u0001P+\n*\u0001P+\n5\n\t\f\u000b\ne\u001a\u0019\nM ).\n\n\f\u0004\u000e\n5\u0013\u001c)$\n\f\u0007\u000e\u0010\u000f\u0017\u0011\n\u001c)$\nWe will use a gradient ascent step on-\n5\u001f\t \u000b\nrepeatedly take a gradient step 5\nthe sets \u0007\n\u000e\t\u0013\u0015\u0014\u0012\u0016\u0019\u0018\n*,*\n\bL\n\"-U\b\u001b\u001a'*\u000f*\nonto \u0007\n5\"\u0004(\u001d#\u0007\n*!\n\u0002\u0001\u0007\u0003\na single linear constraint; the solution to this is easily found by solving (in 8\na sparse system of linear equations. The second projection step onto \u0007\nB , the space of all\npositive-semi de\ufb01nite matrices, is done by \ufb01rst \ufb01nding the diagonalization 5\nwhere &\nis a diagonal matrix of 5\n8\"\u0017\n\u2019s corresponding eigenvectors, and taking 5+\u0004\n$,$\n\u0002\u0005\u0003*)\n\r<\u001b\u0017\n\u001c . (E.g., see [4].)\n8Z\u0017\n\u0001\u0007*,+\n\n(7)\n(8)\nto optimize (6), followed by the method of\niterative projections to ensure that the constraints (7) and (8) hold. Speci\ufb01cally, we will\ninto\n. This gives the\n\nB can be done inexpensively. Speci\ufb01cally, the \ufb01rst projection step 5\n\u0015F9\n3'&\n$%$\n\u0004-$\n\nalgorithm shown in Figure 1.5\nThe motivation for the speci\ufb01c choice of the problem formulation (6\u20138) is that projecting\n\n\t5\n\u0010 or \u0007\n5\u0005\u0004\u000b-\n*,*\n\u0015\u000e(\n\u0001\u0007*,+\ncontains 5\n\u000b*!\u001b\n\r<\u001b\u0017\n\n5\u0013\u001c , and then repeatedly project 5\nand\u0007\u001eB\n\ninvolves minimizing a quadratic objective subject to\ntime)\n\n,\n\u2019s eigenvalues and the columns of\n, where\n\n3 Experiments and Examples\n\n\f\u0004\u000e\u001a\u000f\u0017\u0011\n\n*\u000f*\n\n(6)\n\nWe begin by giving some examples of distance metrics learned on arti\ufb01cial data, and then\nshow how our methods can be used to improve clustering performance.\n\n1.036\n\n\u000e87\n\n, we obtain:\n\n1\u001a2\u001a35401\u000e6\n\n3.1 Examples of learned distance metrics\nConsider the data shown in Figure 2(a), which is divided into two classes (shown by the\ndifferent symbols and, where available, colors). Suppose that points in each class are \u201csim-\nre\ufb02ecting this.6 Depending on whether we learn a\n\nilar\u201d to each other, and we are given \u001f\ndiagonal or a full5\n&\u0003.0/\n1.007 :\nTo visualize this, we can use the fact discussed earlier that learning *,*CB\u0016*,*\nto \ufb01nding a rescaling of the data \b\ndirection of the projection ofq\ndisrupt the constraintE\ne . Empirically, this modi\ufb01cation often signi\ufb01cantly speeds up convergence.\n6In the experiments with synthetic data,F was a randomly sampled 1% of all pairs of similar\n\nis equivalent\n, that hopefully \u201cmoves\u201d the similar pairs\n5The algorithm shown in the \ufb01gure includes a small re\ufb01nement that the gradient step is taken the\n, so that it will \u201cminimally\u201d\n\nr onto the orthogonal subspace ofq\n\n&\u0016?A@06\n\n\u000e87\n\n3.245\n3.286\n0.081\n\n0.081\n0.082\n0.002\n\n3.286\n3.327\n0.082\n\npoints.\n\n\u0010\u0019ACB\n\nE'D\n\n<>=\n\n5\n\u0001\n$\n\u0006\n\u0001\n\u0010\n\u0001\n5\n\u0001\n$\n&\n\u0002\n5\n\u0004\n\u0006\n\u0001\n5\n\u0004\nB\n\u0001\n5\n\u0001\n$\n&\n-\n\u0015\n5\n\u0018\nY\n\u0018\nQ\nJ\n=\nL\nY\nM\nJ\nL\nO\n&\n-\n\u0015\n\u000f\n\u0011\n\u000e\n\u0013\n&\n\u001d\n\u0015\n5\n\n\b\n\n\u001a\nB\n\u001f\n5\n:\n<\n8\n\u0015\n5\n\u001c\n\u0001\n$\n\n&\n-\n\u0015\n\u0010\n$\n\u0001\n\nB\n&\n\u001e\n\u001f\n\u0001\n$\n\u0001\n5\n:\n<\n\u0001\n5\n\u0001\n$\n+\n&\n\u0002\n\n5\nB\n\u0006\n\u0001\n\u0010\n\u0001\nB\n\u001c\n$\n$\n(\n\u0010\n\u0017\n8\n8\n(\n\u0003\n\u001c\n$\n\u001d\n\u0003\n3\n&\n&\n\u0004\n$\n(\n\u0015\n(\n\u0010\n\u0001\n\u0017\n8\n8\n(\n\u0003\n\u0001\n9\n:\n:\n:\n:\n:\n:\n;\n6\n9\n;\n<\n&\n\u0007\n5\n\b\nE\n\f2\u2212class data (original)\n\n2\u2212class data projection (Newton)\n\n2\u2212class data projection (IP)\n\n5\n\n0\n\n\u22125\n\nz\n\n5\n\n0\n\n\u22125\n\nz\n\n5\n\n0\n\ny\n\n\u22125\n\n5\n\n\u22125\n\n0\n\nx\n\n5\n\n0\n\ny\n\n\u22125\n\n5\n\n\u22125\n\n0\n\nx\n\n(a)\n\n(b)\n\nz\n\n5\n\n0\n\n\u22125\n\n20\n\n20\n\n0\n\ny\n\n\u221220\n\n\u221220\n\n(c)\n\n0\n\nx\n\nFigure 2: (a) Original data, with the different classes indicated by the different symbols (and col-\n(c) Rescaling\nors, where available).\n\n.\n\n.\n\n(b) Rescaling of data corresponding to learned diagonal]\n\n3\u2212class data projection (Newton)\n\n3\u2212class data projection (IP)\n\ncorresponding to full]\n\n3\u2212class data (original)\n\n2\n\n0\n\n\u22122\n\nz\n\n2\n\n0\n\n\u22122\n\nz\n\n2\n\n0\n\n\u22122\n\nz\n\n5\n\n0\n\ny\n\n\u22125\n\n5\n\n\u22125\n\n0\n\nx\n\n5\n\n0\n\ny\n\n\u22125\n\n5\n\n\u22125\n\n0\n\nx\n\n2\n\n0\n\ny\n\n\u22122\n\n2\n\n\u22122\n\n0\n\nx\n\n(a)\n\n(b)\n\n(c)\n\n. (c) Rescaling corre-\n\n.\n\nFigure 3: (a) Original data. (b) Rescaling corresponding to learned diagonal]\nsponding to full]\ntogether. Figure 2(b,c) shows the result of plotting 5\n\nsuccessfully brought together the similar points, while keeping dissimilar ones apart.\nFigure 3 shows a similar result for a case of three clusters whose centroids differ only\nin the x and y directions. As we see in Figure 3(b), the learned diagonal metric correctly\n, the algorithm \ufb01nds a surprising\n\nignores the z direction. Interestingly, in the case of a full 5\n\nprojection of the data onto a line that still maintains the separation of the clusters well.\n\n. As we see, the algorithm has\n\n\u0010\u0019ACB\n\n3.2 Application to clustering\n\nOne application of our methods is \u201cclustering with side information,\u201d in which we learn\na distance metric using similarity information, and cluster data using that metric. Speci\ufb01-\n\n3. K-means + metric: K-means but with distortion de\ufb01ned using the distance metric\n\n\u001f means\bL\n and\b\u001b\u001a belong\nB between points \b\n\n and\n\u0017\u0018\b\n\nalways being\n\n\u001dU\u001f\n\nassigned to the same cluster [12].7\n\nto the same cluster. We will consider four algorithms for clustering:\n\ncally, suppose we are given\u001f\n\bZ\nC\u0017\u0019\b!\u001a\t\u001c\u001e\u001d\n, and told that each pair \u0015\n1. K-means using the default Euclidean metric *,*\n-\u0001\u0003\u0002\u001b*,*\nto de\ufb01ne distortion (and ignoring\u001f ).\ncluster centroids \u0004\u0002\n2. Constrained K-means: K-means but subject to points \u0015\n*\u000f*\n*,*\n\b\u0016\n\"-\u0005\nlearned from\u001f\nGJIKJ\nT\f\u000b\u0007\r\u000f\u000e\u0011\u0010\u0012\t\nM . More generally, if we imagine drawing an edge between each pair of points in\u0018\n\n7This is implemented as the usual K-means, except ifGJI\nO\u0007\u0006\nKNI\npoints are assigned to cluster centroids\b\n\t , we assign bothIKJ andI\nY\u0017\b\u0014\t\tO\nGJI\nthe points in each resulting connected componentE\nwe pick to beU\tT\f\u000b\u0007\r\u000f\u000e\u0011\u0010\u0012\tO=\n\nlearned from\u001f\n\nto clusterU\n\nGJIKJ\u001bY\u001b\b\u0014\t\n\n\u000f\u0017E\u001a\u0019\n\n4. Constrained K-means + metric: Constrained K-means using the distance metric\n\n, then all\nare constrained to lie in the same cluster, which\n\n, then during the step in which\n\nM .\n\nY\u0013\b\u0014\t\tO\n\nM\u0016\u0015\n\n.\n\n.\n\n\b\n\b\n\nB\n\b\n\n\u001a\n\u001c\n\u0002\nB\n&\nJ\nL\nF\nL\nL\nA\nO\n\fOriginal 2\u2212class data\n\nPorjected 2\u2212class data\n\n10\n\nz\n\n0\n\n\u221210\n\n20\n\n0\n\ny\n\n\u221220\n\n\u221220\n\n0\n\nx\n\n20\n\n10\n\nz\n\n0\n\n\u221210\n\n20\n\n(a)\n\n1. K-means: Accuracy = 0.4975\n2. Constrained K-means: Accuracy = 0.5060\n3. K-means + metric: Accuracy = 1\n4. Constrained K-means + metric: Accuracy = 1\n\n0\n\ny\n\n\u221220\n\n\u221220\n\n0\n\nx\n\n20\n\n(b)\n\n<\u001b8\r\u0014\u0015\u0010\n\n\u0017\u0016\n\n\u0019\u0018\u0005\b\n\ntle\u201d side-information \u001f\n\nFigure 4: (a) Original dataset (b) Data scaled according to learned metric.\n\n \u2019s match the%\n/\u001c\u001b\n\n\u0004\u0006\u0005\u0006\u0007\u0006\b\u0003\u0004\u0006\t \u2019s result is\n(]\u0001\u0003\u0002\n\t gave visually indistinguishable results.)\nshown, but]\u000b\n\r\f\u0003\t\n89\u0017\u0011\u0010\n) be the cluster to which point\b\n\n is assigned by an automatic clustering\n\n (\u000f\nLet \u000e\n\n be some \u201ccorrect\u201d or desired clustering of the data. Following [?], in\nalgorithm, and let%\n\n \u2019s according to\nthe case of 2-cluster data, we will measure how well the \u000e\nAccuracy$\nwhere\u001f\n< ). This is equivalent to\nthe probability that for two points \bL\n , \b!\u001a drawn randomly from the dataset, our clustering\n% agrees with the \u201ctrue\u201d clustering % on whether \bZ\n and \b!\u001a belong to same or different\ntheir \b -coordinate, but where the data in its original space seems to cluster much better\naccording to their # -coordinate. As shown by the accuracy scores given in the \ufb01gure, both\n\nclusters.8\nAs a simple example, consider Figure 4, which shows a clustering problem in which the\n\u201ctrue clusters\u201d (indicated by the different symbols/colors in the plot) are distinguished by\n\n\u0013\u0012\u001b\u001a\nis the indicator function (\u001f\n\nK-means and constrained K-means failed to \ufb01nd good clusterings. But by \ufb01rst learning\na distance metric and then clustering according to that metric, we easily \ufb01nd the correct\nclustering separating the true clusters from each other. Figure 5 gives another example\nshowing similar results.\nWe also applied our methods to 9 datasets from the UC Irvine repository. Here, the \u201ctrue\nclustering\u201d is given by the data\u2019s class labels. In each, we ran one experiment using \u201clit-\n, and one with \u201cmuch\u201d side-information. The results are given in\n\n\u0017\u001a\n\n\u001f ,\u001f\n\nalmost any clustering will correctly predict that most pairs are in different clusters. In this setting,\n\nFigure 6.9\nWe see that, in almost every problem, using a learned diagonal or full metric leads to\nsigni\ufb01cantly improved performance over naive K-means. In most of the problems, using\n, 6th bar for full 5\na learned metric with constrained K-means (the 5th bar for diagonal 5\n)\nalso outperforms using constrained K-means alone (4th bar), sometimes by a very large\n8In the case of many (\u001d\u001f\u001e ) clusters, this evaluation metric tends to give in\ufb02ated scores since\nwe therefore modi\ufb01ed the measure averaging not only I\u0015J ,I\nL drawn uniformly at random, but from\n! ) with chance 0.5, and from different clusters with chance 0.5, so\nthe same cluster (as determined by \nthat \u201cmatches\u201d and \u201cmis-matches\u201d are given the same weight. All results reported here used K-means\nwith multiple restarts, and are averages over at least 20 trials (except for wine, 10 trials).\n9F was generated by picking a random subset of all pairs of points sharing the same class!\nJ . In\nresulting connected components \"$#\n\nthe case of \u201clittle\u201d side-information, the size of the subset was chosen so that the resulting number of\n(see footnote 7) would be very roughly 90% of the size of the\n\noriginal dataset. In the case of \u201cmuch\u201d side-information, this was changed to 70%.\n\n%\n$\n\u001f\n\u0017\n8\n8\n%\n.\n\u001f\n\n\u001f\n\n%\n\n$\n%\n\u001a\n\u0001\n$\n\u001f\n\n\u000e\n%\n\n$\n\u000e\n%\n\u001a\n\u0001\n\u0001\n\u0015\n\u0010\n-\n\u001f\n\u001c\n\u0017\n\nB\n\u0001\n\u0001\n$\n*\n\b\n\u0001\n$\n\u000e\n\fOriginal data\n\nProjected data\n\n50\n\nz\n\n0\n\n\u221250\n\n50\n\n0\n\ny\n\n\u221250\n\n\u221250\n\n0\n\nx\n\n50\n\n50\n\nz\n\n0\n\n\u221250\n\n50\n\n(a)\n\n1. K-means: Accuracy = 0.4993\n2. Constrained K-means: Accuracy = 0.5701\n3. K-means + metric: Accuracy = 1\n4. Constrained K-means + metric: Accuracy = 1\n\n0\n\ny\n\n\u221250\n\n\u221250\n\n0\n\nx\n\n50\n\n(b)\n\nFigure 5: (a) Original dataset (b) Data scaled according to learned metric.\n\nshown, but]\u000b\n\r\f\u0003\t\n\n\t gave visually indistinguishable results.)\n\n\u0004\u0006\u0005\u0006\u0007\u0006\b\u0003\u0004\u0006\t \u2019s result is\n\n(]\u0001\u0003\u0002\n\nBoston housing (N=506, C=3, d=13)\n\nionosphere (N=351, C=2, d=34)\n\nIris plants (N=150, C=3, d=4)\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\nKc=447\n\nKc=354\n\nwine (N=168, C=3, d=12)\n\nKc=153\n\nKc=127\n\nsoy bean (N=47, C=4, d=35)\n\nKc=41\n\nKc=34\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\nKc=269\n\nKc=187\n\nbalance (N=625, C=3, d=4)\n\nKc=548\n\nKc=400\n\nprotein (N=116, C=6, d=20)\n\nKc=92\n\nKc=61\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\nKc=133\n\nKc=116\n\nbreast cancer (N=569, C=2, d=30)\n\nKc=482\n\nKc=358\n\ndiabetes (N=768, C=2, d=8)\n\nKc=694\n\nKc=611\n\nFigure 6: Clustering accuracy on 9 UCI datasets. In each panel, the six bars on the left correspond to\nan experiment with \u201clittle\u201d side-informationF\n, and the six on the right to \u201cmuch\u201d side-information.\nFrom left to right, the six bars in each set are respectively K-means, K-means \u0015\ndiagonal met-\nric, K-means \u0015\ndiagonal metric, and\nC-Kmeans \u0015\n: number of classes/clusters;D : di-\nmensionality of data; \"$# : mean number of connected components (see footnotes 7, 9). 1 s.e. bars\nare also shown.\n\nfull metric, Constrained K-means (C-Kmeans), C-Kmeans \u0015\nfull metric. Also shown are\n\n: size of dataset; E\n\n\n\fPerformance on Protein dataset\n\nPerformance on Wine dataset\n\n1\n\n0.9\n\n0.8\n\n0.7\n\ne\nc\nn\na\nm\nr\no\n\nf\nr\ne\np\n\n0.6\n\n0.5\n\n0\n\nkmeans\nc\u2212kmeans\nkmeans + metric (diag A)\nc\u2212kmeans + metric (diag A)\nkmeans + metric (full A)\nc\u2212kmeans + metric (full A)\n\n0.1\n\nratio of constraints\n\n0.2\n\n(a)\n\n1\n\n0.9\n\n0.8\n\n0.7\n\ne\nc\nn\na\nm\nr\no\n\nf\nr\ne\np\n\n0.6\n\n0.5\n\n0\n\nkmeans\nc\u2212kmeans\nkmeans + metric (diag A)\nc\u2212kmeans + metric (diag A)\nkmeans + metric (full A)\nc\u2212kmeans + metric (full A)\n\n0.1\n\nratio of constraints\n\n0.2\n\n(b)\n\nFigure 7: Plots of accuracy vs. amount of side-information. Here, theI -axis gives the fraction of all\npairs of points in the same class that are randomly sampled to be included inF\nmargin. Not surprisingly, we also see that having more side-information in \u001f\n\nleads to metrics giving better clusterings.\nFigure 7 also shows two typical examples of how the quality of the clusterings found in-\ncreases with the amount of side-information. For some problems (e.g., wine), our algo-\nrithm learns good diagonal and full metrics quickly with only a very small amount of\nside-information; for some others (e.g., protein), the distance metric, particularly the full\nmetric, appears harder to learn and provides less bene\ufb01t over constrained K-means.\n\ntypically\n\n.\n\n4 Conclusions\n\nWe have presented an algorithm that, given examples of similar pairs of points in\n, learns\na distance metric that respects these relationships. Our method is based on posing metric\nlearning as a convex optimization problem, which allowed us to derive ef\ufb01cient, local-\noptima free algorithms. We also showed examples of diagonal and full metrics learned\nfrom simple arti\ufb01cial examples, and demonstrated on arti\ufb01cial and on UCI datasets how\nour methods can be used to improve clustering performance.\n\nReferences\n\n[1] C. Atkeson, A. Moore, and S. Schaal. Locally weighted learning. AI Review, 1996.\n[2] T. Cox and M. Cox. Multidimensional Scaling. Chapman & Hall, London, 1994.\n[3] C. Domeniconi and D. Gunopulos. Adaptive nearest neighbor classi\ufb01cation using support vec-\n\ntor machines. In Advances in Neural Information Processing Systems 14. MIT Press, 2002.\n[4] G. H. Golub and C. F. Van Loan. Matrix Computations. Johns Hopkins Univ. Press, 1996.\n[5] T. Hastie and R. Tibshirani. Discriminant adaptive nearest neighbor classi\ufb01cation. IEEE Trans-\n\nactions on Pattern Analysis and Machine Learning, 18:607\u2013616, 1996.\n\n[6] T.S. Jaakkola and D. Haussler. Exploiting generative models in discriminaive classi\ufb01er. In Proc.\n\nof Tenth Conference on Advances in Neural Information Processing Systems, 1999.\n\n[7] I.T. Jolliffe. Principal Component Analysis. Springer-Verlag, New York, 1989.\n[8] R. Rockafellar. Convex Analysis. Princeton Univ. Press, 1970.\n[9] S.T. Roweis and L.K. Saul. Nonlinear dimensionality reduction by locally linear embedding.\n\nScience 290: 2323-2326.\n\n[10] B. Scholkopf and A. Smola. Learning with Kernels. In Press, 2001.\n[11] N. Tishby, F. Pereira, and W. Bialek. The information bottleneck method. In Proc. of the 37th\n\nAllerton Conference on Communication, Control and Computing, 1999.\n\n[12] K. Wagstaff, C. Cardie, S. Rogers, and S. Schroedl. Constrained k-means clustering with back-\n\nground knowledge. In Proc. 18th International Conference on Machine Learning, 2001.\n\n\u0002\n\u0003\n\f", "award": [], "sourceid": 2164, "authors": [{"given_name": "Eric", "family_name": "Xing", "institution": null}, {"given_name": "Michael", "family_name": "Jordan", "institution": null}, {"given_name": "Stuart", "family_name": "Russell", "institution": null}, {"given_name": "Andrew", "family_name": "Ng", "institution": null}]}