{"title": "Out-of-Sample Extensions for LLE, Isomap, MDS, Eigenmaps, and Spectral Clustering", "book": "Advances in Neural Information Processing Systems", "page_first": 177, "page_last": 184, "abstract": "", "full_text": "Out-of-Sample Extensions for LLE, Isomap,\nMDS, Eigenmaps, and Spectral Clustering\n\nYoshua Bengio, Jean-Franc\u00b8ois Paiement, Pascal Vincent\nOlivier Delalleau, Nicolas Le Roux and Marie Ouimet\nD\u00b4epartement d\u2019Informatique et Recherche Op\u00b4erationnelle\n\nUniversit\u00b4e de Montr\u00b4eal\n\nMontr\u00b4eal, Qu\u00b4ebec, Canada, H3C 3J7\n\nfbengioy,vincentp,paiemeje,delallea,lerouxni,ouimemag\n\n@iro.umontreal.ca\n\nAbstract\n\nSeveral unsupervised learning algorithms based on an eigendecompo-\nsition provide either an embedding or a clustering only for given train-\ning points, with no straightforward extension for out-of-sample examples\nshort of recomputing eigenvectors. This paper provides a uni\ufb01ed frame-\nwork for extending Local Linear Embedding (LLE), Isomap, Laplacian\nEigenmaps, Multi-Dimensional Scaling (for dimensionality reduction)\nas well as for Spectral Clustering. This framework is based on seeing\nthese algorithms as learning eigenfunctions of a data-dependent kernel.\nNumerical experiments show that the generalizations performed have a\nlevel of error comparable to the variability of the embedding algorithms\ndue to the choice of training data.\n\nIntroduction\n\n1\nMany unsupervised learning algorithms have been recently proposed, all using an eigen-\ndecomposition for obtaining a lower-dimensional embedding of data lying on a non-linear\nmanifold: Local Linear Embedding (LLE) (Roweis and Saul, 2000), Isomap (Tenenbaum,\nde Silva and Langford, 2000) and Laplacian Eigenmaps (Belkin and Niyogi, 2003). There\nare also many variants of Spectral Clustering (Weiss, 1999; Ng, Jordan and Weiss, 2002), in\nwhich such an embedding is an intermediate step before obtaining a clustering of the data\nthat can capture \ufb02at, elongated and even curved clusters. The two tasks (manifold learning\nand clustering) are linked because the clusters found by spectral clustering can be arbitrary\ncurved manifolds (as long as there is enough data to locally capture their curvature).\n\n2 Common Framework\nIn this paper we consider \ufb01ve types of unsupervised learning algorithms that can be cast\nin the same framework, based on the computation of an embedding for the training points\nobtained from the principal eigenvectors of a symmetric matrix.\nAlgorithm 1\n1. Start from a data set D = fx1; : : : ; xng with n points in Rd. Construct a n (cid:2) n\n\u201cneighborhood\u201d or similarity matrix M. Let us denote KD((cid:1);(cid:1)) (or K for shorthand) the\ndata-dependentfunctionwhichproducesM byMij = KD(xi; xj).\n2. Optionally transform M, yielding a \u201cnormalized\u201d matrix ~M. Equivalently, this corre-\nspondstogenerating ~M froma ~KD by ~Mij = ~KD(xi; xj).\n\n\f3. Computethem largestpositiveeigenvalues(cid:21)k andeigenvectorsvk of ~M.\n4. Theembeddingofeachexamplexi isthevectoryi with yik the i-thelementofthek-th\nprincipaleigenvectorvk of ~M. Alternatively(MDSandIsomap),theembeddingisei,with\neik = p(cid:21)kyik. Ifthe\ufb01rst m eigenvaluesarepositive,thenei (cid:1) ej isthebestapproximation\nof ~Mij usingonlym coordinates,inthesquarederrorsense.\nIn the following, we consider the specializations of Algorithm 1 for different unsupervised\nlearning algorithms. Let Si be the i-th row sum of the af\ufb01nity matrix M:\n\nMij:\n\n(1)\n\nSi =Xj\n\nWe say that two points (a; b) are k-nearest-neighbors of each other if a is among the k\nnearest neighbors of b in D [ fag or vice-versa. We denote by xij the j-th coordinate of\nthe vector xi.\n\n1\n\n1\nn\n\nSk! :\n\n2.1 Multi-Dimensional Scaling\nMulti-Dimensional Scaling (MDS) starts from a notion of distance or af\ufb01nity K that is\ncomputed between each pair of training examples. We consider here metric MDS (Cox\nand Cox, 1994). For the normalization step 2 in Algorithm 1, these distances are converted\nto equivalent dot products using the \u201cdouble-centering\u201d formula:\n\n1\n\n2 Mij (cid:0)\n\n1\nn\n\n(2)\n\n~Mij = (cid:0)\n\nn2 Xk\n\nSi (cid:0)\n\nSj +\nThe embedding eik of example xi is given by p(cid:21)kvki.\n2.2 Spectral Clustering\nSpectral clustering (Weiss, 1999) can yield impressively good results where traditional\nclustering looking for \u201cround blobs\u201d in the data, such as K-means, would fail miserably. It\nis based on two main steps: \ufb01rst embedding the data points in a space in which clusters are\nmore \u201cobvious\u201d (using the eigenvectors of a Gram matrix), and then applying a classical\nclustering algorithm such as K-means, e.g. as in (Ng, Jordan and Weiss, 2002). The af\ufb01nity\nmatrix M is formed using a kernel such as the Gaussian kernel. Several normalization steps\nhave been proposed. Among the most successful ones, as advocated in (Weiss, 1999; Ng,\nJordan and Weiss, 2002), is the following:\n\n~Mij =\n\n:\n\n(3)\n\nMij\n\npSiSj\n\nTo obtain m clusters, the \ufb01rst m principal eigenvectors of ~M are computed and K-means\nis applied on the unit-norm coordinates, obtained from the embedding yik = vki.\n2.3 Laplacian Eigenmaps\nLaplacian Eigenmaps is a recently proposed dimensionality reduction procedure (Belkin\nand Niyogi, 2003) that has been proposed for semi-supervised learning. The authors use\nan approximation of the Laplacian operator such as the Gaussian kernel or the matrix whose\nelement (i; j) is 1 if xi and xj are k-nearest-neighbors and 0 otherwise. Instead of solving\nan ordinary eigenproblem, the following generalized eigenproblem is solved:\n\n(S (cid:0) M )vj = (cid:21)jSvj\n\n(4)\nwith eigenvalues (cid:21)j, eigenvectors vj and S the diagonal matrix with entries given by eq. (1).\nThe smallest eigenvalue is left out and the eigenvectors corresponding to the other small\neigenvalues are used for the embedding. This is the same embedding that is computed\nwith the spectral clustering algorithm from (Shi and Malik, 1997). As noted in (Weiss,\n1999) (Normalization Lemma 1), an equivalent result (up to a componentwise scaling of\nthe embedding) can be obtained by considering the principal eigenvectors of the normalized\nmatrix de\ufb01ned in eq. (3).\n\n\fIsomap\n\n2.4\nIsomap (Tenenbaum, de Silva and Langford, 2000) generalizes MDS to non-linear mani-\nfolds. It is based on replacing the Euclidean distance by an approximation of the geodesic\ndistance on the manifold. We de\ufb01ne the geodesic distance with respect to a data set D, a\ndistance d(u; v) and a neighborhood k as follows:\n\n~D(a; b) = min\n\nd(pi; pi+1)\n\n(5)\n\np Xi\n\nwhere p is a sequence of points of length l (cid:21) 2 with p1 = a, pl = b, pi 2 D 8i 2\nf2; : : : ; l (cid:0) 1g and (pi,pi+1) are k-nearest-neighbors. The length l is free in the minimiza-\ntion. The Isomap algorithm obtains the normalized matrix ~M from which the embedding\nis derived by transforming the raw pairwise distances matrix as follows: \ufb01rst compute the\nmatrix Mij = ~D2(xi; xj) of squared geodesic distances with respect to the data D, then\napply to this matrix the distance-to-dot-product transformation (eq. (2)), as for MDS. As in\nMDS, the embedding is eik = p(cid:21)kvki rather than yik = vki.\n2.5 LLE\nThe Local Linear Embedding (LLE) algorithm (Roweis and Saul, 2000) looks for an em-\nbedding that preserves the local geometry in the neighborhood of each data point. First, a\n\nM = (I (cid:0) W )0(I (cid:0) W )\n\nsparse matrix of local predictive weights Wij is computed, such thatPj Wij = 1, Wij = 0\nif xj is not a k-nearest-neighbor of xi and (Pj Wijxj(cid:0)xi)2 is minimized. Then the matrix\n\n(6)\nis formed. The embedding is obtained from the lowest eigenvectors of M, except for the\nsmallest eigenvector which is uninteresting because it is (1; 1; : : : 1), with eigenvalue 0.\nNote that the lowest eigenvectors of M are the largest eigenvectors of ~M(cid:22) = (cid:22)I (cid:0) M to\n\ufb01t Algorithm 1 (the use of (cid:22) > 0 will be discussed in section 4.4). The embedding is given\nby yik = vki, and is constant with respect to (cid:22).\n3 From Eigenvectors to Eigenfunctions\nTo obtain an embedding for a new data point, we propose to use the Nystr\u00a8om formula (eq. 9)\n(Baker, 1977), which has been used successfully to speed-up kernel methods computations\nby focussing the heavier computations (the eigendecomposition) on a subset of examples.\nThe use of this formula can be justi\ufb01ed by considering the convergence of eigenvectors\nand eigenvalues, as the number of examples increases (Baker, 1977; Williams and Seeger,\n2000; Koltchinskii and Gin\u00b4e, 2000; Shawe-Taylor and Williams, 2003). Intuitively, the\nextensions to obtain the embedding for a new example require specifying a new column of\nthe Gram matrix ~M, through a training-set dependent kernel function ~KD, in which one of\nthe arguments may be required to be in the training set.\n\nIf we start from a data set D, obtain an embedding for its elements, and add more and\nmore data, the embedding for the points in D converges (for eigenvalues that are unique).\n(Shawe-Taylor and Williams, 2003) give bounds on the convergence error (in the case of\nkernel PCA). In the limit, we expect each eigenvector to converge to an eigenfunction for\nthe linear operator de\ufb01ned below, in the sense that the i-th element of the k-th eigenvector\nconverges to the application of the k-th eigenfunction to xi (up to a normalization factor).\n\nConsider a Hilbert space Hp of functions with inner product hf; gip =R f (x)g(x)p(x)dx;\nwith a density function p(x). Associate with kernel K a linear operator Kp in Hp:\n\n(Kpf )(x) =Z K(x; y)f (y)p(y)dy:\n\n(7)\n\nWe don\u2019t know the true density p but we can approximate the above inner product and\nlinear operator (and its eigenfunctions) using the empirical distribution ^p. An \u201cempirical\u201d\nHilbert space H ^p is thus de\ufb01ned using ^p instead of p. Note that the proposition below can be\n\n\fapplied even if the kernel is not positive semi-de\ufb01nite, although the embedding algorithms\nwe have studied are restricted to using the principal coordinates associated with positive\neigenvalues. For a more rigorous mathematical analysis, see (Bengio et al., 2003).\nProposition 1\nLet ~K(a; b) be a kernel function, not necessarily positive semi-de\ufb01nite, that gives rise to\na symmetric matrix ~M with entries ~Mij = ~K(xi; xj) upon a dataset D = fx1; : : : ; xng.\nLet (vk; (cid:21)k) be an (eigenvector,eigenvalue) pair that solves ~M vk = (cid:21)kvk. Let (fk; (cid:21)0k)\nbe an (eigenfunction,eigenvalue) pair that solves ( ~K ^pfk)(x) = (cid:21)0kfk(x) for any x, with ^p\nthe empirical distribution over D. Let ek(x) = yk(x)p(cid:21)k or yk(x) denote the embedding\nassociated with a new point x. Then\n\n(cid:21)0k =\n\n1\n(cid:21)k\nn\npn\n(cid:21)k\n\nn\n\nXi=1\nfk(x) =\nfk(xi) = pnvki\nfk(x)\npn\nyk(xi) = yik;\n\nyk(x) =\n\nvki ~K(x; xi)\n\nvki ~K(x; xi)\n\n=\n\n1\n(cid:21)k\n\nn\n\nXi=1\n\nek(xi) = eik\n\n(8)\n\n(9)\n\n(10)\n\n(11)\n\n(12)\n\nSee (Bengio et al., 2003) for a proof and further justi\ufb01cations of the above formulae. The\ngeneralized embedding for Isomap and MDS is ek(x) = p(cid:21)kyk(x) whereas the one for\nspectral clustering, Laplacian eigenmaps and LLE is yk(x).\nProposition 2\nIn addition, if the data-dependent kernel ~KD is positive semi-de\ufb01nite, then\n\nfk(x) =r n\n\n(cid:21)k\n\n(cid:25)k(x)\n\nwhere (cid:25)k(x) is the k-th component of the kernel PCA projection of x obtained from the\nkernel ~KD (up to centering).\nThis relation with kernel PCA (Sch\u00a8olkopf, Smola and M\u00a8uller, 1998), already pointed out\nin (Williams and Seeger, 2000), is further discussed in (Bengio et al., 2003).\n\n4 Extending to new Points\nUsing Proposition 1, one obtains a natural extension of all the unsupervised learning algo-\nrithms mapped to Algorithm 1, provided we can write down a kernel function ~K that gives\nrise to the matrix ~M on D, and can be used in eq. (11) to generalize the embedding. We\nconsider each of them in turn below. In addition to the convergence properties discussed in\nsection 3, another justi\ufb01cation for using equation (9) is given by the following proposition:\nProposition 3\nIf we de\ufb01ne the fk(xi) by eq. (10) and take a new point x, the value of fk(x) that minimizes\n\nm\n\nn\n\nXi=1 ~K(x; xi) (cid:0)\n\nXt=1\nis given by eq. (9), for m (cid:21) 1 and any k (cid:20) m.\nThe proof is a direct consequence of the orthogonality of the eigenvectors vk. This proposi-\ntion links equations (9) and (10). Indeed, we can obtain eq. (10) when trying to approximate\n\n(13)\n\n(cid:21)0tft(x)ft(xi)!2\n\n\f~K at the data points by minimizing the cost\n\nn\n\nXi;j=1 ~K(xi; xj) (cid:0)\n\nm\n\nXt=1\n\n(cid:21)0tft(xi)ft(xj)!2\n\nfor m = 1; 2; : : : When we add a new point x, it is thus natural to use the same cost to\napproximate the ~K(x; xi), which yields (13). Note that by doing so, we do not seek to\napproximate ~K(x; x). Future work should investigate embeddings which minimize the\nempirical reconstruction error of ~K but ignore the diagonal contributions.\n4.1 Extending MDS\nFor MDS, a normalized kernel can be de\ufb01ned as follows, using a continuous version of the\ndouble-centering eq. (2):\n\n~K(a; b) = (cid:0)\n\n1\n2\n\n(d2(a; b) (cid:0) Ex[d2(x; b)] (cid:0) Ex0 [d2(a; x0)] + Ex;x0 [d2(x; x0)])\n\n(14)\n\nwhere d(a; b) is the original distance and the expectations are taken over the empirical data\nD. An extension of metric MDS to new points has already been proposed in (Gower, 1968),\nsolving exactly for the embedding of x to be consistent with its distances to training points,\nwhich in general requires adding a new dimension.\n4.2 Extending Spectral Clustering and Laplacian Eigenmaps\nBoth the version of Spectral Clustering and Laplacian Eigenmaps described above are\nbased on an initial kernel K, such as the Gaussian or nearest-neighbor kernel. An equiva-\nlent normalized kernel is:\n\n~K(a; b) =\n\n1\nn\n\nK(a; b)\n\npEx[K(a; x)]Ex0 [K(b; x0)]\n\nwhere the expectations are taken over the empirical data D.\n4.3 Extending Isomap\nTo extend Isomap, the test point is not used in computing the geodesic distance between\ntraining points, otherwise we would have to recompute all the geodesic distances. A rea-\nsonable solution is to use the de\ufb01nition of ~D(a; b) in eq. (5), which only uses the training\npoints in the intermediate points on the path from a to b. We obtain a normalized kernel by\napplying the continuous double-centering of eq. (14) with d = ~D.\nA formula has already been proposed (de Silva and Tenenbaum, 2003) to approximate\nIsomap using only a subset of the examples (the \u201clandmark\u201d points) to compute the eigen-\nvectors. Using our notations, this formula is\n\ne0k(x) =\n\n1\n\n2p(cid:21)k Xi\n\nvki(Ex0 [ ~D2(x0; xi)] (cid:0) ~D2(xi; x)):\n\n(15)\n\nwhere Ex0 is an average over the data set. The formula is applied to obtain an embedding\nfor the non-landmark examples.\nCorollary 1\n\nThe embedding proposed in Proposition 1 for Isomap (ek(x)) is equal to formula 15 (Land-\nmark Isomap) when ~K(x; y) is de\ufb01ned as in eq. (14) with d = ~D.\n\nstruction. Therefore (1; 1; : : : 1) is an eigenvector with eigenvalue 0, and all the other eigen-\n\nProof: the proof relies on a property of the Gram matrix for Isomap:Pi Mij = 0, by con-\nvectors vk have the property Pi vki = 0 because of the orthogonality with (1; 1; : : : 1).\nWriting (Ex0 [ ~D2(x0; xi)](cid:0) ~D2(x; xi)) = 2 ~K(x; xi)+Ex0;x00 [ ~D2(x0; x00)](cid:0)Ex0 [ ~D2(x; x0)]\n2p(cid:21)k Pi vki ~K(x; xi) + (Ex0;x00 [ ~D2(x0; x00)] (cid:0) Ex0 [ ~D2(x; x0)])Pi vki =\nyields e0k(x) = 2\nek(x), since the last sum is 0.\n\n\f4.4 Extending LLE\nThe extension of LLE is the most challenging one because it does not \ufb01t as well the frame-\nwork of Algorithm 1: the M matrix for LLE does not have a clear interpretation in terms\nof distance or dot product. An extension has been proposed in (Saul and Roweis, 2002),\nbut unfortunately it cannot be cast directly into the framework of Proposition 1. Their\nembedding of a new point x is given by\n\nn\n\nyk(x) =\n\nyk(xi)w(x; xi)\n\n(16)\n\nXi=1\n\nwhere w(x; xi) is the weight of xi in the reconstruction of x by its k-nearest-neighbors in\nthe training set (if x = xj 2 D, w(x; xi) = (cid:14)ij). This is very close to eq. (11), but lacks the\nnormalization by (cid:21)k. However, we can see this embedding as a limit case of Proposition 1,\nas shown below.\nWe \ufb01rst need to de\ufb01ne a kernel ~K(cid:22) such that\n\n~K(cid:22)(xi; xj) = ~M(cid:22);ij = ((cid:22) (cid:0) 1)(cid:14)ij + Wij + Wji (cid:0)Xk\n\nWkiWkj\n\n(17)\n\nfor xi; xj 2 D. Let us de\ufb01ne a kernel ~K0 by\n\n~K0(xi; x) = ~K0(x; xi) = w(x; xi)\n\nand ~K0(x; y) = 0 when neither x nor y is in the training set D. Let ~K00 be de\ufb01ned by\n\n~K00(xi; xj) = Wij + Wji (cid:0)Xk\n\nWkiWkj\n\nand ~K00(x; y) = 0 when either x or y isn\u2019t in D. Then, by construction, the kernel ~K(cid:22) =\n((cid:22) (cid:0) 1) ~K0 + ~K00 veri\ufb01es eq. (17). Thus, we can apply eq. (11) to obtain an embedding of\na new point x, which yields\n\ny(cid:22);k(x) =\n\nyik(cid:16)((cid:22) (cid:0) 1) ~K0(x; xi) + ~K00(x; xi)(cid:17)\n\n1\n\n(cid:21)k Xi\n(cid:22) (cid:0) 1\n(cid:22) (cid:0) ^(cid:21)k Xi\n\nwith (cid:21)k = ((cid:22) (cid:0) ^(cid:21)k), and ^(cid:21)k being the k-th lowest eigenvalue of M. This rewrites into\n\ny(cid:22);k(x) =\n\nyikw(x; xi) +\n\n1\n\n(cid:22) (cid:0) ^(cid:21)k Xi\n\nyik ~K00(x; xi):\n\nThen when (cid:22) ! 1, y(cid:22);k(x) ! yk(x) de\ufb01ned by eq. (16).\nSince the choice of (cid:22) is free, we can thus consider eq. (16) as approximating the use of the\nkernel ~K(cid:22) with a large (cid:22) in Proposition 1. This is what we have done in the experiments\ndescribed in the next section. Note however that we can \ufb01nd smoother kernels ~K(cid:22) verifying\neq. (17), giving other extensions of LLE from Proposition 1. It is out of the scope of this\npaper to study which kernel is best for generalization, but it seems desirable to use a smooth\nkernel that would take into account not only the reconstruction of x by its neighbors xi, but\nalso the reconstruction of the xi by their neighbors including the new point x.\n5 Experiments\nWe want to evaluate whether the precision of the generalizations suggested in the pre-\nvious section is comparable to the intrinsic perturbations of the embedding algorithms.\nThe perturbation analysis will be achieved by considering splits of the data in three sets,\nD = F [ R1 [ R2 and training either with F [ R1 or F [ R2, comparing the embeddings\non F . For each algorithm described in section 2, we apply the following procedure:\n\n\fx 10 4\n\n10\n\n8\n\n6\n\n4\n\n2\n\n0\n\n-2\n\n-4\n\n7\n\n6\n\n5\n\n4\n\n3\n\n2\n\n1\n\n0\n\n0\n\nx 10 3\n\n0.05\n\n0.1\n\n0.15\n\n0.2\n\n0.25\n\n-1\n\n-2\n\n-3\n\n0\n\nx 10 4\n\n20\n\n15\n\n10\n\n5\n\n0\n\n-5\n\n0\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n-0. 2\n\n0.05\n\n0.1\n\n0.15\n\n0.2\n\n0.25\n\n0.3\n\n0.35\n\n0.4\n\n0.02\n\n0.04\n\n0.06\n\n0.08\n\n0.1\n\n0.12\n\n0.14\n\n-0. 4\n\n0\n\n0.05\n\n0.1\n\n0.15\n\n0.2\n\n0.25\n\nFigure 1: Training set variability minus out-of-sample error, wrt the proportion of training\nsamples substituted. Top left: MDS. Top right: spectral clustering or Laplacian eigenmaps.\nBottom left: Isomap. Bottom right: LLE. Error bars are 95% con\ufb01dence intervals.\n\n1. We choose F (cid:26) D with m = jFj samples. The remaining n(cid:0) m samples in D=F\nare split into two equal size subsets R1 and R2. We train (obtain the eigenvectors)\nover F [ R1 and F [ R2. When eigenvalues are close, the estimated eigenvectors\nare unstable and can rotate in the subspace they span. Thus we estimate an af\ufb01ne\nalignment between the two embeddings using the points in F , and we calculate\nthe Euclidean distance between the aligned embeddings obtained for each si 2 F .\n2. For each sample si 2 F , we also train over fF [ R1g=fsig. We apply the exten-\nsion to out-of-sample points to \ufb01nd the predicted embedding of si and calculate\nthe Euclidean distance between this embedding and the one obtained when train-\ning with F [ R1, i.e. with si in the training set.\n3. We calculate the mean difference (and its standard error, shown in the \ufb01gure)\nbetween the distance obtained in step 1 and the one obtained in step 2 for each\nsample si 2 F , and we repeat this experiment for various sizes of F .\n\nThe results obtained for MDS, Isomap, spectral clustering and LLE are shown in \ufb01gure 1\nfor different values of m. Experiments are done over a database of 698 synthetic face im-\nages described by 4096 components that is available at http://isomap.stanford.edu.\nQualitatively similar\nsuch as\nIonosphere (http://www.ics.uci.edu/~mlearn/MLSummary.html) and swissroll\n(http://www.cs.toronto.edu/~roweis/lle/). Each algorithm generates a two-\ndimensional embedding of the images, following the experiments reported for Isomap.\nThe number of neighbors is 10 for Isomap and LLE, and a Gaussian kernel with a standard\ndeviation of 0.01 is used for spectral clustering / Laplacian eigenmaps. 95% con\ufb01dence\n\nresults have been obtained over other databases\n\n\fintervals are drawn beside each mean difference of error on the \ufb01gure.\n\nAs expected, the mean difference between the two distances is almost monotonically in-\ncreasing as the fraction of substituted examples grows (x-axis in the \ufb01gure). In most cases,\nthe out-of-sample error is less than or comparable to the training set embedding stability:\nit corresponds to substituting a fraction of between 1 and 4% of the training examples.\n6 Conclusions\nIn this paper we have presented an extension to \ufb01ve unsupervised learning algorithms\nbased on a spectral embedding of the data: MDS, spectral clustering, Laplacian eigen-\nmaps, Isomap and LLE. This extension allows one to apply a trained model to out-of-\nsample points without having to recompute eigenvectors. It introduces a notion of function\ninduction and generalization error for these algorithms. The experiments on real high-\ndimensional data show that the average distance between the out-of-sample and in-sample\nembeddings is comparable or lower than the variation in in-sample embedding due to re-\nplacing a few points in the training set.\nReferences\nBaker, C. (1977). The numerical treatment of integral equations. Clarendon Press, Oxford.\nBelkin, M. and Niyogi, P. (2003). Laplacian eigenmaps for dimensionality reduction and data repre-\n\nsentation. Neural Computation, 15(6):1373\u20131396.\n\nBengio, Y., Vincent, P., Paiement, J., Delalleau, O., Ouimet, M., and Le Roux, N. (2003). Spec-\ntral clustering and kernel pca are learning eigenfunctions. Technical report, D\u00b4epartement\nd\u2019informatique et recherche op\u00b4erationnelle, Universit\u00b4e de Montr\u00b4eal.\n\nCox, T. and Cox, M. (1994). Multidimensional Scaling. Chapman & Hall, London.\nde Silva, V. and Tenenbaum, J. (2003). Global versus local methods in nonlinear dimensionality re-\nduction. In Becker, S., Thrun, S., and Obermayer, K., editors, Advances in Neural Information\nProcessing Systems, volume 15, pages 705\u2013712, Cambridge, MA. The MIT Press.\n\nGower, J. (1968). Adding a point to vector diagrams in multivariate analysis. Biometrika, 55(3):582\u2013\n\n585.\n\nKoltchinskii, V. and Gin\u00b4e, E. (2000). Random matrix approximation of spectra of integral operators.\n\nBernoulli, 6(1):113\u2013167.\n\nNg, A. Y., Jordan, M. I., and Weiss, Y. (2002). On spectral clustering: Analysis and an algorithm.\nIn Dietterich, T. G., Becker, S., and Ghahramani, Z., editors, Advances in Neural Information\nProcessing Systems 14, Cambridge, MA. MIT Press.\n\nRoweis, S. and Saul, L. (2000). Nonlinear dimensionality reduction by locally linear embedding.\n\nScience, 290(5500):2323\u20132326.\n\nSaul, L. and Roweis, S. (2002). Think globally, \ufb01t locally: unsupervised learning of low dimensional\n\nmanifolds. Journal of Machine Learning Research, 4:119\u2013155.\n\nSch\u00a8olkopf, B., Smola, A., and M\u00a8uller, K.-R. (1998). Nonlinear component analysis as a kernel\n\neigenvalue problem. Neural Computation, 10:1299\u20131319.\n\nShawe-Taylor, J. and Williams, C. (2003). The stability of kernel principal components analysis and\nits relation to the process eigenspectrum. In Becker, S., Thrun, S., and Obermayer, K., editors,\nAdvances in Neural Information Processing Systems, volume 15. The MIT Press.\n\nShi, J. and Malik, J. (1997). Normalized cuts and image segmentation. In Proc. IEEE Conf. Com-\n\nputer Vision and Pattern Recognition, pages 731\u2013737.\n\nTenenbaum, J., de Silva, V., and Langford, J. (2000). A global geometric framework for nonlinear\n\ndimensionality reduction. Science, 290(5500):2319\u20132323.\n\nWeiss, Y. (1999). Segmentation using eigenvectors: a unifying view. In Proceedings IEEE Interna-\n\ntional Conference on Computer Vision, pages 975\u2013982.\n\nWilliams, C. and Seeger, M. (2000). The effect of the input density distribution on kernel-based\nclassi\ufb01ers. In Proceedings of the Seventeenth International Conference on Machine Learning.\nMorgan Kaufmann.\n\n\f", "award": [], "sourceid": 2461, "authors": [{"given_name": "Yoshua", "family_name": "Bengio", "institution": null}, {"given_name": "Jean-fran\u00e7cois", "family_name": "Paiement", "institution": null}, {"given_name": "Pascal", "family_name": "Vincent", "institution": null}, {"given_name": "Olivier", "family_name": "Delalleau", "institution": null}, {"given_name": "Nicolas", "family_name": "Roux", "institution": null}, {"given_name": "Marie", "family_name": "Ouimet", "institution": null}]}