{"title": "Proximity Graphs for Clustering and Manifold Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 225, "page_last": 232, "abstract": null, "full_text": "Proximity graphs for clustering and manifold\n\nlearning\n\nMiguel \u00b4A. Carreira-Perpi\u02dcn\u00b4an\n\nRichard S. Zemel\n\nDept. of Computer Science, University of Toronto\n\n6 King\u2019s College Road. Toronto, ON M5S 3H5, Canada\n\nEmail: fmiguel,zemelg@cs.toronto.edu\n\nAbstract\n\nMany machine learning algorithms for clustering or dimensionality re-\nduction take as input a cloud of points in Euclidean space, and construct\na graph with the input data points as vertices. This graph is then parti-\ntioned (clustering) or used to rede\ufb01ne metric information (dimensional-\nity reduction). There has been much recent work on new methods for\ngraph-based clustering and dimensionality reduction, but not much on\nconstructing the graph itself. Graphs typically used include the fully-\nconnected graph, a local \ufb01xed-grid graph (for image segmentation) or a\nnearest-neighbor graph. We suggest that the graph should adapt locally\nto the structure of the data. This can be achieved by a graph ensemble\nthat combines multiple minimum spanning trees, each \ufb01t to a perturbed\nversion of the data set. We show that such a graph ensemble usually pro-\nduces a better representation of the data manifold than standard methods;\nand that it provides robustness to a subsequent clustering or dimension-\nality reduction algorithm based on the graph.\n\n1\n\nIntroduction\n\nGraph-based algorithms have long been popular, and have received even more attention\nrecently, for two of the fundamental problems in machine learning: clustering [1\u20134] and\nmanifold learning [5\u20138]. Relatively little attention has been paid to the properties and\nconstruction methods for the graphs that these algorithms depend on.\n\nA starting point for this study is the question of what constitutes a good graph.\nIn the\napplications considered here, the graphs are an intermediate form of representation, and\ntherefore their utility to some extent depends on the algorithms that they will ultimately be\nused for. However, in the case of both clustering and manifold learning, the data points are\nassumed to lie on some small number of manifolds. Intuitively, the graph should represent\nthese underlying manifolds well: it should avoid shortcuts that travel outside a manifold,\navoid gaps that erroneously disconnect regions of a manifold, and be dense within the\nmanifold and clusters. Also, while the algorithms differ with respect to connectedness, in\nthat clustering wants the graph to be disconnected, while for manifold learning the graph\nshould be connected, they both want at least the inside of the clusters, or dense areas of the\nmanifold, to be enhanced relative to the between-cluster, or sparse manifold connections.\n\n\fDataset\n\nMST\n\nk-NNG\n\n(cid:15)-ball graph\n\nDelaunay\n\ntriangulation\n\nPSfrag replacements\n\nPerturbed\ndataset\n\nFigure 1: Sensitivity to noise of proximity graphs. Top row: several proximity graphs\nconstructed on a noisy sample of points lying on a circle. Bottom row: the same graphs\nconstructed on a different sample; speci\ufb01cally, we added to each point Gaussian noise of\nstandard deviation equal to the length of the small segment shown in the centre of the\ndataset (top row), built the graph, and drew it on the original dataset. This small perturba-\ntion can result in large changes in the graphs, such as disconnections, shortcuts or changes\nin connection density.\n\nMany methods employ simple graph constructions. A fully-connected graph is used for\nexample in spectral clustering and multidimensional scaling, while a \ufb01xed grid, with each\npoint connecting to some small \ufb01xed number of neighbors in a pre-de\ufb01ned grid of loca-\ntions, is generally used in image segmentation. An (cid:15)-ball, in which each point connects\nto points within some distance (cid:15), and k-nearest neighbors (k-NNG) are generalization of\nthese approaches, as they take into account distance in some features associated with each\npoint instead of simply the grid locations. The (cid:15)-ball or k-NNG provide an improvement\nover the fully-connected graph or \ufb01xed grid (clustering: [3, 9]; manifold learning: [5, 7]).\nThese traditional methods contain parameters ((cid:15), k) that strongly depend on the data; they\ngenerally require careful, costly tuning, as typically graphs must be constructed for a range\nof parameter values, the clustering or dimensionality-reduction algorithm run on each, and\nthen performance curves compared to determine the best settings. Figure 1 shows that these\nmethods are quite sensitive to sparsity and noise in the data points, and that the parameters\nshould ideally vary within the data set. It also shows that other traditional graphs (e.g. the\nDelaunay triangulation) are not good for manifolds, since they connect points nonlocally.\n\nIn this paper we propose a different method of graph construction, one based on mini-\nmum spanning trees (MSTs). Our method involves an ensemble of trees, each built on a\nperturbed version of the data. We \ufb01rst discuss the motivation for this new type of graph,\nand then examine its robustness properties, and its utility to both subsequent clustering or\ndimensionality reduction methods.\n\n2 Two new types of proximity graphs\n\nA minimum spanning tree is a tree subgraph that contains all the vertices and has a mini-\nmum sum of edge weights. As a skeleton of a data set, the MST has some good properties:\nit tends to avoid shortcuts between branches (typically caused by long edges, which are\ncontrary to the shortest-length criterion) and it gives a connected graph (usually a problem\nfor other methods with often-occurring random small groupings of points). In fact, the\n\n\fMST was an early approach to clustering [10]. However, the MST is too sparse (having\nonly N (cid:0) 1 edges for an N-point data set, and no cycles) and is sensitive to noise. One\nway to \ufb02esh it out and attain robustness to noise is to form an MST ensemble that combines\nmultiple MSTs; we give two different algorithms for this.\n\n2.1 Perturbed MSTs (PMSTs)\n\nPerturbed MSTs combine a number of MSTs, each \ufb01t to a perturbed version of the data\nset. The perturbation is done through a local noise model that we estimate separately for\neach data point based on its environment: point xi is perturbed by adding to it zero-mean\nuniform noise of standard deviation si = rdi where di is the average distance to the k\nnearest neighbors of xi, and r 2 [0; 1]. In this paper we use k = 5 throughout and study the\neffect of r. The locality of the noise model allows points to move more or less depending\non the local data structure around them and to connect to different numbers of neighbors at\ndifferent distances\u2014in effect we achieve a variable k and (cid:15).\nTo build the PMST ensemble, we generate T > 1 perturbed copies of the entire data set\naccording to the local noise model and \ufb01t an MST to each. The PMST ensemble assigns\na weight eij 2 [0; 1] to the edge between points xi and xj equal to the average number\nof times that edge appears on the trees. For T = 1 this gives the MST of the unperturbed\ndata set; for T ! 1 it gives a stochastic graph where eij is the probability (in the Laplace\nsense) of that edge under the noise model. The PMST ensemble contains at most T (N (cid:0) 1)\nedges (usually much less). Although the algorithm is randomized, the PMST ensemble for\nlarge T is essentially deterministic, and insensitive to noise by construction. In practice a\nsmall T is enough; we use T = 20 in the experiments.\n\n2.2 Disjoint MSTs (DMSTs)\n\nHere we build a graph that is a deterministic collection of t MSTs that satis\ufb01es the property\nthat the nth tree (for n = 1; : : : ; t) is the MST of the data subject to not using any edge\nalready in the previous 1; : : : ; t (cid:0) 1 trees. One possible construction algorithm is an ex-\ntension of Kruskal\u2019s algorithm for the MST where we pick edges without replacement and\nrestart for every new tree. Speci\ufb01cally, we sort the list of N (N (cid:0)1)\nedges eij by increasing\ndistance dij and visit each available edge in turn, removing it if it merges two clusters (or\nequivalently does not create a cycle); whenever we have removed N (cid:0) 1 edges, we go back\nto the start of the list. We repeat the procedure t times in total.\nThe DMST ensemble consists of the union of all removed edges and contains t(N (cid:0) 1)\nedges each of weight 1. The t parameter controls the overall density of the graph, which is\nalways connected; unlike (cid:15) or k (for the (cid:15)-ball or k-NNG), t is not a parameter that depends\nlocally on the data, and again points may connect to different numbers of neighbors at\ndifferent distances. We obtain the original MST for t = 1; values t = 2\u20134 (and usually\nquite larger) work very well in practice. t need not be integer, i.e., we can \ufb01x the total\nnumber of edges instead. In any case we should use t (cid:28) N\n2 .\n\n2\n\n2.3 Computational complexity\n\nthe computational complexity is approximately\nFor a data set with N points,\nO(T N 2 log N ) (PMSTs) or O(N 2(log N + t)) (DMSTs).\nIn both cases the resulting\ngraphs are sparse (number of edges is linear on number of points N). If imposing an a\npriori sparse structure (e.g. an 8-connected grid in image segmentation) the edge list is\nmuch shorter, so the graph construction is faster. For the perturbed MST ensemble, the\nperturbation of the data set results in a partially disordered edge list, which one should be\nable to sort ef\ufb01ciently. The bottleneck in the graph construction itself is the computation of\n\n\fpairwise distances, or equivalently of nearest neighbors, of a set of N points (which affects\nthe (cid:15)-ball and k-NNG graphs too): in 2D this is O(N log N ) thanks to properties of planar\ngeometry, but in higher dimensions the complexity quickly approaches O(N 2).\nOverall, the real computational bottleneck is the graph postprocessing, typically O(N 3) in\nspectral methods (for clustering or manifold learning). This can be sped up to O(cN 2) by\nusing sparsity (limiting a priori the edges allowed, thus approximating the true solution)\nbut then the graph construction is likewise sped up. Thus, even if our graphs are slightly\nmore costly to construct than the (cid:15)-ball or k-NNG, the computational savings are very large\nif we avoid having to run the spectral technique multiple times in search for a good (cid:15) or k.\n\n3 Experiments\n\nWe present two sets of experiments on the application of the graphs to clustering and man-\nifold learning, respectively.\n\n3.1 Clustering\n\n2 WD(cid:0) 1\n\nIn af\ufb01nity-based clustering, our data is an N (cid:2) N af\ufb01nity matrix W that de\ufb01nes a graph\n(where nonzeros de\ufb01ne edge weights) and we seek a partition of the graph that optimizes\na cost function, such as mincut [1] or normalized cut [2]. Typically the af\ufb01nities are\nwij = exp((cid:0) 1\n2 (dij=(cid:27))2) (where dij is the problem-dependent distance between points\nxi and xj) and depend on a scale parameter (cid:27) 2 (0; 1). This graph partitioning problem\nis generally NP-complete, so approximations are necessary, such as spectral clustering al-\ngorithms [2]. In spectral clustering we seek to cluster in the leading eigenvectors of the\nnormalized af\ufb01nity matrix N = D(cid:0) 1\n2 (where D = diag (Pi wij), and discarding\na constant eigenvector associated with eigenvalue 1). Spectral clustering succeeds only for\na range of values of (cid:27) where N displays the natural cluster structure of the data; if (cid:27) is\ntoo small W is approximately diagonal and if (cid:27) is too large W is approximately a matrix\nof ones. It is thus crucial to determine a good (cid:27), which requires computing clusterings\nover a range of (cid:27) values\u2014an expensive computation since each eigenvector computation\nis O(N 3) (or O(cN 2) under sparsity conditions).\nFig. 2 shows segmentation results for a grayscale image from [11] where the objective is\nto segment the occluder from the underlying background, a hard task given the intensity\ngradients. We use a standard weighted Euclidean distance on the data points (pixels) x =\n(pixel location, intensity). One method uses the 8-connected grid (where each pixel is con-\nnected to its 8 neighboring pixels). The other method uses the PMST or DMST ensemble\n(constrained to contain only edges in the 8-connected grid) under different values of the r, t\nparameters; the graph has between 44% and 98% the number of edges in the 8-grid, depend-\ning on the parameter value. We de\ufb01ne the af\ufb01nity matrix as wij = eij exp((cid:0) 1\n2 (dij=(cid:27))2)\n(where eij 2 [0; 1] are the edge values). In both methods we apply the spectral cluster-\ning algorithm of [2]. The plot shows the clustering error (mismatched occluder area) for a\nrange of scales. The 8-connected grid succeeds in segmenting the occluder for (cid:27) 2 [0:2; 1]\napproximately, while the MST ensembles (for all parameter values tested) succeed for a\nwider range\u2014up to (cid:27) = 1 in many cases. The reason for this success even for such high\n(cid:27) is that the graph lacks many edges around the occluder, so those af\ufb01nities are zero no\nmatter how high the scale is. In other words, for clustering, our graphs enhance the inside\nof clusters with respect to the bridges between clusters, and so ease the graph partitioning.\n\n3.2 Manifold learning\n\nFor dimensionality reduction, we concentrate on applying Isomap [5], a popular and pow-\nerful algorithm. We \ufb01rst estimate the geodesic distances (i.e., along the manifold) ^gij be-\n\n\fSegmentation\n\n (cid:0)(cid:0)(cid:0) First 5 eigenvectors (except the constant one) (cid:0)(cid:0)(cid:0)!\n\ne\nl\nb\nm\ne\ns\nn\ne\nT\nS\nM\nP\n\nd\ni\nr\ng\n-\n8\n\ne\nl\nb\nm\ne\ns\nn\ne\nT\nS\nM\nP\n\nd\ni\nr\ng\n-\n8\n\ne\nl\nb\nm\ne\ns\nn\ne\nT\nS\nM\nP\n\nd\ni\nr\ng\n-\n8\n\n5\n:\n0\n=\n1\n(cid:27)\n\n6\n:\n1\n=\n2\n(cid:27)\n\n1\n=\n(cid:27)\n\nI\n\n10\n5\n0\n\n0\n\n1\n\n2\n\n3\n\n4\n\nPMST ensemble\n\nin 3D view\n\n(x,y,intensity)\n\nOriginal image\n\nPSfrag replacements\n\n5\n\n6\n\ny\n\n7\n\n8\n\nr\no\nr\nr\ne\n\nPSfrag replacements\n\ng\nn\ni\nr\ne\nt\ns\nu\nl\nC\n\n8\n\n6\n\n2\n\n0\n\n4\n\nx\n\n60\n\n50\n\n40\n\n30\n\n20\n\n10\n\n0\n\n10\u22121\n\n8-grid\ngraph\n\nPMST, DMST\n\n(various\n\nparameters)\n\n100\n(cid:27)\n\n101\n\n1\n\nFigure 2: Using a proximity graph increases the scale range over which good segmentations\nare possible. We consider segmenting the greyscale image at the bottom (an occluder over\na background) with spectral clustering, asking for K = 5 clusters. The color diagrams\nrepresent the segmentation (column 1) and \ufb01rst 5 eigenvectors of the af\ufb01nity matrix (except\nthe constant eigenvector, columns 2\u20134) obtained with spectral clustering, using a PMST\nensemble with r = 0:4 (upper row) or an 8-connectivity graph (lower row), for 3 different\nscales: (cid:27)1 = 0:5, (cid:27)2 = 1:6 and (cid:27) = 1. The PMST ensemble succeeds at all scales\n(note how several eigenvectors are constant over the occluder), while the 8-connectivity\ngraph progressively deteriorates as (cid:27) increases to give a partition of equal-sized clusters\nfor large scale. In the bottom part of the \ufb01gure we show: the PMST ensemble graph in\n3D space; the clustering error vs (cid:27) (where the right end is (cid:27) = 1) for the 8-connectivity\ngraph (thick blue line) and for various other PMST and DMST ensembles under various\nparameters (thin lines). The PMST and DMST ensembles robustly (for many settings of\ntheir parameters) give an almost perfect segmentation over a large range of scales.\n\n\ftween pairs of points in the data set as the shortest-path lengths in a graph learned from the\ndata. Then we apply multidimensional scaling to these distances to obtain a collection of\nlow-dimensional points fyigN\ni=1 that optimally preserves the estimated geodesic distances.\nIn \ufb01g. 3 we show the results of applying Isomap using different graphs to two data sets\n(ellipse and Swiss roll) for which we know the true geodesic distances gij. In a real ap-\nplication, since the true geodesic distances are unknown, error and variance cannot be\ncomputed; an estimated residual variance has been proposed [5] to determine the optimal\ngraph parameter. For the perturbed MST ensemble, we binarize the edge values by making\n1 any eij > 0. (It is often possible to re\ufb01ne the graph by zeroing edges with small eij,\nsince this removes shortcuts that may have arisen by chance, particularly if T is large; but\nit is dif\ufb01cult to estimate the right threshold reliably.) The plots show 3 curves as a function\nof the graph parameter: the average error E in the geodesic distances; Isomap\u2019s estimated\nresidual variance ^V ; and the true residual variance V . From the plots we can see that ^V\ncorrelates well with V (though it underestimates it) and also with E for the Swiss roll, but\nnot for the ellipse; this can make the optimal graph parameter dif\ufb01cult to determine in a\nreal application. Given this, the fact that our graphs work well over a larger region of their\nparameter space than the (cid:15)-ball or k-NNG graphs makes them particularly attractive.\nThe plots for the Swiss roll show that, while for the low noise case the (cid:15)-ball or k-NNG\ngraphs work well over a reasonable region of their parameter space, for the high noise case\nthis region decreases a lot, almost vanishing for the (cid:15)-ball. This is because for low values\nof the parameter the graph is disconnected, while for high values it has multiple shortcuts;\nthe dif\ufb01culty of the task is compounded by the small number of points used, N = 500 (an\nunavoidable fact in high dimensions). However, for the PMSTs the region remains quite\nwide and for the DMSTs the approximate region t 2 [2; 8] gives good results. For very low\nr or t = 1 the graph is the single MST, thus the large errors.\nIt is also important to realize that the range of the r parameter of the PMST ensemble does\nnot depend on the data, while the range for (cid:15) and k does. The range of the t parameter of\nthe DMST ensemble does depend on the data, but we have found empirically that t = 2\u20134\ngives very good results with all data sets we have tried.\n\n4 Discussion\n\nOne main contribution of this paper is to highlight the relatively understudied problem of\nconverting a data set into a graph, which forms an intermediate representation for many\nclustering and manifold learning algorithms. A second contribution is novel construction\nalgorithms, which are: easy to implement, not expensive to compute, robust across many\nnoise levels and parameter settings, and useful for clustering and manifold learning. In\ngeneral, a careful selection of the graph construction algorithm makes the results of these\nmachine learning methods robust, and avoids or limits the required parameter search. Fi-\nnally, the combination of many graphs, formed from perturbed versions of the data, into an\nensemble of graphs, is a novel approach to the construction problem.\n\nOur idea of MST ensembles is an extension to graphs of the well-known technique of com-\nbining predictors by averaging (regression) or voting (classi\ufb01cation), as is the regularizing\neffect of training with noise [12]. An ensemble of predictors improves the generalization to\nunseen data if the individual predictors are independent from each other and disagree with\neach other, and can be explained by the bias-variance tradeoff. Unlike regression or classi-\n\ufb01cation, unsupervised graph learning lacks at present an error function, so it seems dif\ufb01cult\nto apply the bias-variance framework here. However, we have conducted a wide range of\nempirical tests to understand the properties of the ensemble MSTs, and to compare them\nto the other graph construction methods, in terms of the error in the geodesic distances (if\nknown a priori). In summary, we have found that the variance of the error for the geodesic\n\n\fEllipse, high noise\n\nSwiss roll, low noise\n\nSwiss roll, high noise\n\nt\ne\ns\n\nPSfrag replacements\n(cid:15)\nkr\nt\n\na\nt\na\nD\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n\u22120.2\n\n\u22120.4\n\n\u22120.6\n\n\u22120.8\n\n\u22121.5\n\n\u22121\n\n\u22120.5\n\n0\n\n50\n\n0\n\n\u221250\n10\n\n5\n\n0\n\n\u22125\n\n\u221210\n\n\u221215\n\n\u221215\n\n\u221210\n\n\u22125\n\n0\n\n5\n\nPSfrag replacements\n(cid:15)\nkr\nt\n\n1.5\n\n0.5\n\n1\n\n50\n\n0\n\n\u221250\n10\n\n5\n\n0\n\n\u22125\n\n\u221210\n\n\u221215\n\n\u221215\n\n\u221210\n\n\u22125\n\n0\n\n5\n\n10\n\n15\n\nPSfrag replacements\n(cid:15)\nkr\nt\n\n15\n\n10\n\n1\n\n49.9272\n\n1\n\n49.9272\n\navg error in gd\nresvar (estimated)\nresvar (true)\n\navg error in gd\nresvar (estimated)\nresvar (true)\n\navg error in gd\nresvar (estimated)\nresvar (true)\n\n0.75\n\n37.4454\n\n0.75\n\n37.4454\n\n0.5\n\n24.9636\n\nPSfrag replacements\nkr\nt\n\n0.25\n\n0.9\n\n0.8\n\n0.6\n\n0.7\n\n0.5\n\n0.4\n\n0\n\n1\n\n(cid:15)\n\n12.4818\n\n0.5\n\n24.9636\n\nPSfrag replacements\n(cid:15)\nr\nt\n\n0\n30\n\n0.25\n\n20\n\n25\n\n15\n\nk\n\n12.4818\n\n0.5\n\n24.9636\n\nPSfrag replacements\n(cid:15)\nk\nt\n\n0\n0.7\n\n0.25\n\n0.6\n\n0.3\n\n0.4\n\n0.5\n\nr\n\n12.4818\n\nPSfrag replacements\nkr\nt\n\n0.25\n\n0.5\n\n24.9636\n\n12.4818\n\nPSfrag replacements\n(cid:15)\nr\nt\n\n0.25\n\n0.5\n\n24.9636\n\n12.4818\n\nPSfrag replacements\n(cid:15)\nk\nt\n\n0.25\n\n0.5\n\n24.9636\n\n12.4818\n\nl\nl\na\nb\n-\n(cid:15)\n\nE\nPSfrag replacements\nkr\nt\n\nG\nN\nN\n\nE\nPSfrag replacements\n(cid:15)\nr\nt\n\n-\nk\n\n1.0332\n\n0.7749\n\n0.5166\n\n0.2583\n\n0\n\n0.1\n\n0.2\n\n0.3\n\n1.0332\n\n0.7749\n\n0.5166\n\n0.2583\n\n0\n\n5\n\n10\n\ne\nl\nb\nm\ne\ns\nn\ne\nT\nS\nM\nP\n\nE\nPSfrag replacements\n(cid:15)\nk\nt\n\n1.0332\n\n0.7749\n\n0.5166\n\n0.2583\n\n0\n\n0\n\n0.1\n\n0.2\n\ne\nl\nb\nm\ne\ns\nn\ne\nT\nS\nM\nD\n\nE\nPSfrag replacements\n(cid:15)\nkr\n\n1.0332\n\n0.7749\n\n0.5166\n\n0.2583\n\n0\n\n5\n\n10\n\nV;\n^V\n\n1\n\n0.75\n\n0.5\n\n0.25\n\n0\n10\n\n1\n\n0.75\n\n0.5\n\n0.25\n\nV;\n^V\n\n0\n30\n\n1\n\n0.75\n\n0.5\n\n0.25\n\nV;\n^V\n\n1\n\n0.75\n\n0.5\n\n0.25\n\nV;\n^V\n\n0\n\n1\n\n2\n\n3\n\n4\n\n5\n\n(cid:15)\n\n6\n\n7\n\n8\n\n9\n\n0\n10\n\n0\n\n1\n\n2\n\n3\n\n4\n\n6\n\n7\n\n8\n\n9\n\n5\n\n(cid:15)\n\n1\n\n49.9272\n\n1\n\n49.9272\n\navg error in gd\nresvar (estimated)\nresvar (true)\n\navg error in gd\nresvar (estimated)\nresvar (true)\n\navg error in gd\nresvar (estimated)\nresvar (true)\n\n0.75\n\n37.4454\n\n0.75\n\n37.4454\n\n0\n\n5\n\n10\n\n15\n\nk\n\n20\n\n25\n\n0\n30\n\n0\n\n5\n\n10\n\n20\n\n25\n\n15\n\nk\n\n1\n\n49.9272\n\n1\n\n49.9272\n\navg error in gd\nresvar (estimated)\nresvar (true)\n\navg error in gd\nresvar (estimated)\nresvar (true)\n\navg error in gd\nresvar (estimated)\nresvar (true)\n\n0.75\n\n37.4454\n\n0.75\n\n37.4454\n\n0\n\n0\n\n0.1\n\n0.2\n\n0.3\n\nr\n\n0.4\n\n0.5\n\n0.6\n\n0\n0.7\n\n0\n\n0\n\n0.1\n\n0.2\n\n0.3\n\n0.4\n\n0.5\n\n0.6\n\n0\n0.7\n\nr\n\n1\n\n49.9272\n\n1\n\n49.9272\n\navg error in gd\nresvar (estimated)\nresvar (true)\n\navg error in gd\nresvar (estimated)\nresvar (true)\n\navg error in gd\nresvar (estimated)\nresvar (true)\n\n0.75\n\n37.4454\n\n0.75\n\n37.4454\n\nPSfrag replacements\n(cid:15)\nkr\n\n0.5\n\n24.9636\n\n12.4818\n\n0.25\n\n12.4818\n\n0.5\n\n24.9636\n\nPSfrag replacements\n(cid:15)\nkr\n\n0.25\n\n0\n30\n\n20\n\n25\n\n15\n\nt\n\n0\n\n5\n\n10\n\n15\n\nt\n\n20\n\n25\n\n0\n30\n\n0\n\n5\n\n10\n\n20\n\n25\n\n0\n30\n\n15\n\nt\n\nFigure 3: Performance of Isomap with different graphs in 3 data sets: ellipse with N = 100\npoints, high noise; Swiss roll with N = 500 points, low and high noise (where high noise\nmeans Gaussian with standard deviation equal to 9% of the separation between branches).\nAll plots show on the X axis the graph parameter ((cid:15), k, r or t); on the left Y axis the\naverage error in the geodesic distances (red curve, E = 1\ni;j=1 j~gij (cid:0) gijj); and on the\nright Y axis Isomap\u2019s estimated residual variance (solid blue curve, ^V = 1 (cid:0) R2( ^G; DY ))\nand true residual variance (dashed blue curve, V = 1 (cid:0) R2(G; DY )), where ^G and G\nare the matrices of estimated and true geodesic distances, respectively, DY is the matrix of\nEuclidean distances in the low-dimensional embedding, and R(A; B) is the standard linear\ncorrelation coef\ufb01cient, taken over all entries of matrices A and B. Where the curves for\n(cid:15)-ball and k-NNG are missing, the graph was disconnected.\n\nN 2 PN\n\n\fdistances decreases for the ensemble when the individual graphs are sparse (e.g. MSTs as\nused here, or (cid:15)-ball and k-NNG with low (cid:15) or k); but not necessarily when the graphs are\nnot sparse.\nThe typical cut [9, 13] is a clustering criterion that is based on the probability pij that\npoints xi and xj are in the same cluster over all possible partitions (under the Boltzmann\ndistribution for the mincut cost function). The pij need to be estimated: [9] use Swendsen-\nWang sampling, while [13] use randomized trees sampling. However, these trees are not\nused to de\ufb01ne a proximity graph, unlike in our work.\n\nAn important direction for future work concerns the noise model for PMSTs. The model\nwe propose is isotropic, in that every direction of perturbation is equally likely. A better\nway is to perturb points more strongly in directions likely to lie within the manifold and less\nstrongly in directions away from the manifold, using a method such as k nearest neighbors\nto estimate appropriate directions. Preliminary experiments with such a manifold-aligned\nmodel are very promising, particularly when the data is very noisy or its distribution on the\nmanifold is not uniform. The noise model can also be extended to deal with non-Euclidean\ndata by directly perturbing the similarities.\n\nAcknowledgements\n\nFunding provided by a CIHR New Emerging Teams grant.\n\nReferences\n[1] Zhenyu Wu and Richard Leahy. An optimal graph theoretic approach to data clustering: Theory\nand its application to image segmentation. IEEE Trans. on Pattern Anal. and Machine Intel.,\n15(11):1101\u20131113, November 1993.\n\n[2] Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation.\n\nPattern Anal. and Machine Intel., 22(8):888\u2013905, August 2000.\n\nIEEE Trans. on\n\n[3] Pedro F. Felzenszwalb and Daniel P. Huttenlocher. Ef\ufb01cient graph-based image segmentation.\n\nInt. J. Computer Vision, 59(2):167\u2013181, September 2004.\n\n[4] Romer Rosales, Kannan Achan, and Brendan Frey. Learning to cluster using local neighbor-\n\nhood structure. In ICML, 2004.\n\n[5] Joshua B. Tenenbaum, Vin de Silva, and John C. Langford. A global geometric framework for\n\nnonlinear dimensionality reduction. Science, 290(5500):2319\u20132323, December 22 2000.\n\n[6] Sam T. Roweis and Lawrence K. Saul. Nonlinear dimensionality reduction by locally linear\n\nembedding. Science, 290(5500):2323\u20132326, December 22 2000.\n\n[7] Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps for dimensionality reduction and data\n\nrepresentation. Neural Computation, 15(6):1373\u20131396, June 2003.\n\n[8] Kilian Q. Weinberger and Lawrence K. Saul. Unsupervised learning of image manifolds by\n\nsemide\ufb01nite programming. In CVPR, 2004.\n\n[9] Marcelo Blatt, Shai Wiseman, and Eytan Domany. Data clustering using a model granular\n\nmagnet. Neural Computation, 9(8):1805\u20131842, November 1997.\n\n[10] C. T. Zahn. Graph-theoretical methods for detecting and describing gestalt clusters.\n\nTrans. Computers, C\u201320(1):68\u201386, April 1971.\n\nIEEE\n\n[11] Chakra Chennubhotla and Allan Jepson. EigenCuts: Half-lives of EigenFlows for spectral\n\nclustering. In NIPS, 2003.\n\n[12] Christopher M. Bishop. Neural Networks for Pattern Recognition. Oxford University Press,\n\nNew York, Oxford, 1995.\n\n[13] Yoram Gdalyahu, Daphna Weinshall, and Michael Werman. Self organization in vision: Sto-\nchastic clustering for image segmentation, perceptual grouping, and image database organiza-\ntion. IEEE Trans. on Pattern Anal. and Machine Intel., 23(10):1053\u20131074, October 2001.\n\n\f", "award": [], "sourceid": 2681, "authors": [{"given_name": "Richard", "family_name": "Zemel", "institution": null}, {"given_name": "Miguel", "family_name": "Carreira-Perpi\u00f1\u00e1n", "institution": null}]}