{"title": "Locality Preserving Projections", "book": "Advances in Neural Information Processing Systems", "page_first": 153, "page_last": 160, "abstract": "", "full_text": "Locality Preserving Projections\n\nXiaofei He\n\nDepartment of Computer Science\n\nThe University of Chicago\n\nChicago, IL 60637\n\nPartha Niyogi\n\nDepartment of Computer Science\n\nThe University of Chicago\n\nChicago, IL 60637\n\nxiaofei@cs.uchicago.edu\n\nniyogi@cs.uchicago.edu\n\nAbstract\n\nMany problems in information processing involve some form of dimen-\nsionality reduction. In this paper, we introduce Locality Preserving Pro-\njections (LPP). These are linear projective maps that arise by solving a\nvariational problem that optimally preserves the neighborhood structure\nof the data set. LPP should be seen as an alternative to Principal Com-\nponent Analysis (PCA) \u2013 a classical linear technique that projects the\ndata along the directions of maximal variance. When the high dimen-\nsional data lies on a low dimensional manifold embedded in the ambient\nspace, the Locality Preserving Projections are obtained by \ufb01nding the\noptimal linear approximations to the eigenfunctions of the Laplace Bel-\ntrami operator on the manifold. As a result, LPP shares many of the\ndata representation properties of nonlinear techniques such as Laplacian\nEigenmaps or Locally Linear Embedding. Yet LPP is linear and more\ncrucially is de\ufb01ned everywhere in ambient space rather than just on the\ntraining data points. This is borne out by illustrative examples on some\nhigh dimensional data sets.\n\n1. Introduction\n\nSuppose we have a collection of data points of n-dimensional real vectors drawn from an\nunknown probability distribution. In increasingly many cases of interest in machine learn-\ning and data mining, one is confronted with the situation where n is very large. However,\nthere might be reason to suspect that the \u201cintrinsic dimensionality\u201d of the data is much\nlower. This leads one to consider methods of dimensionality reduction that allow one to\nrepresent the data in a lower dimensional space.\nIn this paper, we propose a new linear dimensionality reduction algorithm, called Locality\nPreserving Projections (LPP). It builds a graph incorporating neighborhood information\nof the data set. Using the notion of the Laplacian of the graph, we then compute a trans-\nformation matrix which maps the data points to a subspace. This linear transformation\noptimally preserves local neighborhood information in a certain sense. The representation\nmap generated by the algorithm may be viewed as a linear discrete approximation to a con-\ntinuous map that naturally arises from the geometry of the manifold [2]. The new algorithm\nis interesting from a number of perspectives.\n\n1. The maps are designed to minimize a different objective criterion from the classi-\n\ncal linear techniques.\n\n\f2. The locality preserving quality of LPP is likely to be of particular use in informa-\ntion retrieval applications. If one wishes to retrieve audio, video, text documents\nunder a vector space model, then one will ultimately need to do a nearest neighbor\nsearch in the low dimensional space. Since LPP is designed for preserving local\nstructure, it is likely that a nearest neighbor search in the low dimensional space\nwill yield similar results to that in the high dimensional space. This makes for an\nindexing scheme that would allow quick retrieval.\n\n3. LPP is linear. This makes it fast and suitable for practical application. While a\nnumber of non linear techniques have properties (1) and (2) above, we know of no\nother linear projective technique that has such a property.\n\n4. LPP is de\ufb01ned everywhere. Recall that nonlinear dimensionality reduction tech-\nniques like ISOMAP[6], LLE[5], Laplacian eigenmaps[2] are de\ufb01ned only on the\ntraining data points and it is unclear how to evaluate the map for new test points.\nIn contrast, the Locality Preserving Projection may be simply applied to any new\ndata point to locate it in the reduced representation space.\n\n5. LPP may be conducted in the original space or in the reproducing kernel Hilbert\n\nspace(RKHS) into which data points are mapped. This gives rise to kernel LPP.\n\nAs a result of all these features, we expect the LPP based techniques to be a natural al-\nternative to PCA based techniques in exploratory data analysis, information retrieval, and\npattern classi\ufb01cation applications.\n\n2. Locality Preserving Projections\n2.1. The linear dimensionality reduction problem\n\nThe generic problem of linear dimensionality reduction is the following. Given a set\nx1; x2; (cid:1) (cid:1) (cid:1) ; xm in Rn, \ufb01nd a transformation matrix A that maps these m points to a set\nof points y1; y2; (cid:1) (cid:1) (cid:1) ; ym in Rl (l (cid:28) n), such that yi \u201drepresents\u201d xi, where yi = AT xi.\nOur method is of particular applicability in the special case where x1, x2; (cid:1) (cid:1) (cid:1) ; xm 2 M\nand M is a nonlinear manifold embedded in Rn.\n\n2.2. The algorithm\n\nLocality Preserving Projection (LPP) is a linear approximation of the nonlinear Laplacian\nEigenmap [2]. The algorithmic procedure is formally stated below:\n\n1. Constructing the adjacency graph: Let G denote a graph with m nodes. We put\nan edge between nodes i and j if xi and xj are \u201dclose\u201d. There are two variations:\n(a) (cid:15)-neighborhoods. [parameter (cid:15) 2 R] Nodes i and j are connected by an edge\n\nif kxi (cid:0) xjk2 < (cid:15) where the norm is the usual Euclidean norm in Rn.\n\n(b) k nearest neighbors. [parameter k 2 N] Nodes i and j are connected by an\nedge if i is among k nearest neighbors of j or j is among k nearest neighbors\nof i.\n\nNote: The method of constructing an adjacency graph outlined above is correct\nif the data actually lie on a low dimensional manifold. In general, however, one\nmight take a more utilitarian perspective and construct an adjacency graph based\non any principle (for example, perceptual similarity for natural signals, hyperlink\nstructures for web documents, etc.). Once such an adjacency graph is obtained,\nLPP will try to optimally preserve it in choosing projections.\n\n2. Choosing the weights: Here, as well, we have two variations for weighting the\nedges. W is a sparse symmetric m (cid:2) m matrix with Wij having the weight of the\nedge joining vertices i and j, and 0 if there is no such edge.\n\n\f(a) Heat kernel. [parameter t 2 R]. If nodes i and j are connected, put\n\nWij = e(cid:0)\n\nkxi(cid:0)xj k2\n\nt\n\nThe justi\ufb01cation for this choice of weights can be traced back to [2].\n\n(b) Simple-minded. [No parameter]. Wij = 1 if and only if vertices i and j are\n\nconnected by an edge.\n\n3. Eigenmaps: Compute the eigenvectors and eigenvalues for the generalized eigen-\n\nvector problem:\n\nXLX T a = (cid:21)XDX T a\n\n(1)\nwhere D is a diagonal matrix whose entries are column (or row, since W is sym-\nmetric) sums of W , Dii = (cid:6)jWji. L = D (cid:0) W is the Laplacian matrix. The ith\ncolumn of matrix X is xi.\nLet the column vectors a0; (cid:1) (cid:1) (cid:1) ; al(cid:0)1 be the solutions of equation (1), ordered ac-\ncording to their eigenvalues, (cid:21)0 < (cid:1) (cid:1) (cid:1) < (cid:21)l(cid:0)1. Thus, the embedding is as follows:\n\nxi ! yi = AT xi; A = (a0; a1; (cid:1) (cid:1) (cid:1) ; al(cid:0)1)\n\nwhere yi is a l-dimensional vector, and A is a n (cid:2) l matrix.\n\n3. Justi\ufb01cation\n3.1. Optimal Linear Embedding\n\nThe following section is based on standard spectral graph theory. See [4] for a comprehen-\nsive reference and [2] for applications to data representation.\n\nRecall that given a data set we construct a weighted graph G = (V; E) with edges connect-\ning nearby points to each other. Consider the problem of mapping the weighted graph G to\na line so that connected points stay as close together as possible. Let y = (y1; y2; (cid:1) (cid:1) (cid:1) ; ym)T\nbe such a map. A reasonable criterion for choosing a \u201dgood\u201d map is to minimize the fol-\nlowing objective function [2]\n\n(yi (cid:0) yj)2Wij\n\nXij\n\nunder appropriate constraints. The objective function with our choice of Wij incurs a heavy\npenalty if neighboring points xi and xj are mapped far apart. Therefore, minimizing it is\nan attempt to ensure that if xi and xj are \u201dclose\u201d then yi and yj are close as well.\nSuppose a is a transformation vector, that is, yT = aT X, where the ith column vector of\nX is xi. By simple algebra formulation, the objective function can be reduced to\n\n1\n\n2 Xij\n= Xi\n\naT xiDiixT\n\ni a (cid:0) Xij\n\n(yi (cid:0) yj)2Wij =\n\n(aT xi (cid:0) aT xj)2Wij\n\n1\n\n2 Xij\naT xiWijxT\n\nj a = aT X(D (cid:0) W )X T a = aT XLX T a\n\nwhere X = [x1; x2; (cid:1) (cid:1) (cid:1) ; xm], and D is a diagonal matrix; its entries are column (or row,\nsince W is symmetric) sum of W, Dii = (cid:6)jWij. L = D (cid:0) W is the Laplacian matrix\n[4]. Matrix D provides a natural measure on the data points. The bigger the value Dii\n(corresponding to yi) is, the more \u201dimportant\u201d is yi. Therefore, we impose a constraint as\nfollows:\n\nyT Dy = 1 ) aT XDX T a = 1\n\nFinally, the minimization problem reduces to \ufb01nding:\n\narg min\n\na\n\nT XDX T\n\na\n\na = 1\n\naT XLX T a\n\n\fThe transformation vector a that minimizes the objective function is given by the minimum\neigenvalue solution to the generalized eigenvalue problem:\n\nXLX T a = (cid:21)XDX T a\n\nIt is easy to show that the matrices XLX T and XDX T are symmetric and positive semi-\nde\ufb01nite. The vectors ai(i = 0; 2; (cid:1) (cid:1) (cid:1) ; l (cid:0) 1) that minimize the objective function are given\nby the minimum eigenvalue solutions to the generalized eigenvalue problem.\n\n3.2. Geometrical Justi\ufb01cation\n\nThe Laplacian matrix L (=D (cid:0) W ) for \ufb01nite graph, or [4], is analogous to the Laplace\nBeltrami operator L on compact Riemannian manifolds. While the Laplace Beltrami oper-\nator for a manifold is generated by the Riemannian metric, for a graph it comes from the\nadjacency relation.\n\nLet M be a smooth, compact, d-dimensional Riemannian manifold.\nIf the manifold is\nembedded in Rn the Riemannian structure on the manifold is induced by the standard\nRiemannian structure on Rn. We are looking here for a map from the manifold to the real\nline such that points close together on the manifold get mapped close together on the line.\nLet f be such a map. Assume that f : M ! R is twice differentiable.\nBelkin and Niyogi [2] showed that the optimal map preserving locality can be found by\nsolving the following optimization problem on the manifold:\n\nwhich is equivalent to 1\n\narg min\n\nkf kL2(M)=1ZM\n\nkrf k2\n\narg min\n\nkf kL2(M)=1ZM\n\nL(f )f\n\nwhere the integral is taken with respect to the standard measure on a Riemannian mani-\nfold. L is the Laplace Beltrami operator on the manifold, i.e. Lf = (cid:0) div r(f ). Thus,\n\nthe optimal f has to be an eigenfunction of L. The integral RM L(f )f can be discretely\n\napproximated by hf (X); Lf (X)i = f T (X)Lf (X) on a graph, where\n\nf (X) = [f (x1); f (x2; (cid:1) (cid:1) (cid:1) ; f (xm))]T ; f T (X) = [f (x1); f (x2; (cid:1) (cid:1) (cid:1) ; f (xm))]\n\nIf we restrict the map to be linear, i.e. f (x) = aT x, then we have\n\nf (X) = X T a ) hf (X); Lf (X)i = f T (X)Lf (X) = aT XLX T a\n\nThe constraint can be computed as follows,\n\nkf k2\n\nL2(M) = ZM\n\njf (x)j2dx = ZM\n\n(aT x)2dx = ZM\n\n(aT xxT a)dx = aT (ZM\n\nxxT dx)a\n\nwhere dx is the standard measure on a Riemannian manifold. By spectral graph theory [4],\nthe measure dx directly corresponds to the measure for the graph which is the degree of\nthe vertex, i.e. Dii. Thus, jf k2\n\nkf k2\n\nL2(M) = aT (ZM\n\nL2(M) can be discretely approximated as follows,\nxxT dx)a (cid:25) aT (Xi\n\nxxT Dii)a = aT XDX T a\n\nFinally, we conclude that the optimal linear projective map, i.e. f (x) = aT x, can be\nobtained by solving the following objective function,\n\narg min\n\na\n\nT XDX T\n\na\n\na = 1\n\naT XLX T a\n\n1If M has a boundary, appropriate boundary conditions for f need to be assumed.\n\n\fThese projective maps are the optimal linear approximations to the eigenfunctions of the\nLaplace Beltrami operator on the manifold. Therefore, they are capable of discovering the\nnonlinear manifold structure.\n\n3.3. Kernel LPP\n\nSuppose that the Euclidean space Rn is mapped to a Hilbert space H through a nonlinear\nmapping function (cid:30) : Rn ! H. Let (cid:30)(X) denote the data matrix in the Hilbert space,\n(cid:30)(X) = [(cid:30)(x1); (cid:30)(x2); (cid:1) (cid:1) (cid:1) ; (cid:30)(xm)]. Now, the eigenvector problem in the Hilbert space\ncan be written as follows:\n\n[(cid:30)(X)L(cid:30)T (X)](cid:23) = (cid:21)[(cid:30)(X)D(cid:30)T (X)](cid:23)\n\n(2)\n\nTo generalize LPP to the nonlinear case, we formulate it in a way that uses dot product\nexclusively. Therefore, we consider an expression of dot product on the Hilbert space H\ngiven by the following kernel function:\n\nK(xi; xj) = ((cid:30)(xi) (cid:1) (cid:30)(xj)) = (cid:30)T (xi)(cid:30)(xj)\n\nBecause the eigenvectors of (2) are linear combinations of (cid:30)(x1); (cid:30)(x2); (cid:1) (cid:1) (cid:1) ; (cid:30)(xm), there\nexist coef\ufb01cients (cid:11)i; i = 1; 2; (cid:1) (cid:1) (cid:1) ; m such that\n\nm\n\n(cid:23) =\n\nXi=1\nwhere (cid:11) = [(cid:11)1; (cid:11)2; (cid:1) (cid:1) (cid:1) ; (cid:11)m]T 2 Rm.\nBy simple algebra formulation, we can \ufb01nally obtain the following eigenvector problem:\n\n(cid:11)i(cid:30)(xi) = (cid:30)(X)(cid:11)\n\nKLK(cid:11) = (cid:21)KDK(cid:11)\n\n(3)\n\nLet the column vectors (cid:11)1; (cid:11)2; (cid:1) (cid:1) (cid:1) ; (cid:11)m be the solutions of equation (3). For a test point x,\nwe compute projections onto the eigenvectors (cid:23) k according to\n\n((cid:23) k (cid:1) (cid:30)(x)) =\n\nm\n\nXi=1\n\n(cid:11)k\n\ni ((cid:30)(x) (cid:1) (cid:30)(xi)) =\n\nm\n\nXi=1\n\n(cid:11)k\n\ni K(x; xi)\n\nwhere (cid:11)k\ni is the ith element of the vector (cid:11)k. For the original training points, the maps can\nbe obtained by y = K(cid:11), where the ith element of y is the one-dimensional representation\nof xi. Furthermore, equation (3) can be reduced to\n\nLy = (cid:21)Dy\n\n(4)\n\nwhich is identical to the eigenvalue problem of Laplacian Eigenmaps [2]. This shows that\nKernel LPP yields the same results as Laplacian Eigenmaps on the training points.\n\n4. Experimental Results\n\nIn this section, we will discuss several applications of the LPP algorithm. We begin with\ntwo simple synthetic examples to give some intuition about how LPP works.\n\n4.1. Simply Synthetic Example\n\nTwo simple synthetic examples are given in Figure 1. Both of the two data sets corre-\nspond essentially to a one-dimensional manifold. Projection of the data points onto the\n\ufb01rst basis would then correspond to a one-dimensional linear manifold representation. The\nsecond basis, shown as a short line segment in the \ufb01gure, would be discarded in this low-\ndimensional example.\n\n\fFigure 1: The \ufb01rst and third plots show the results of PCA. The second and forth plots\nshow the results of LPP. The line segments describe the two bases. The \ufb01rst basis is shown\nas a longer line segment, and the second basis is shown as a shorter line segment. In this\nexample, LPP is insensitive to the outlier and has more discriminating power than PCA.\n\nFigure 2: The handwritten digits (\u20180\u2019-\u20189\u2019) are mapped into a 2-dimensional space. The left\n\ufb01gure is a representation of the set of all images of digits using the Laplacian eigenmaps.\nThe middle \ufb01gure shows the results of LPP. The right \ufb01gure shows the results of PCA. Each\ncolor corresponds to a digit.\n\nLPP is derived by preserving local information, hence it is less sensitive to outliers than\nPCA. This can be clearly seen from Figure 1. LPP \ufb01nds the principal direction along the\ndata points at the left bottom corner, while PCA \ufb01nds the principal direction on which the\ndata points at the left bottom corner collapse into a single point. Moreover, LPP can has\nmore discriminating power than PCA. As can be seen from Figure 1, the two circles are\ntotally overlapped with each other in the principal direction obtained by PCA, while they\nare well separated in the principal direction obtained by LPP.\n\n4.2. 2-D Data Visulization\n\nAn experiment was conducted with the Multiple Features Database [3]. This dataset con-\nsists of features of handwritten numbers (\u20180\u2019-\u20189\u2019) extracted from a collection of Dutch\nutility maps. 200 patterns per class (for a total of 2,000 patterns) have been digitized in\nbinary images. Digits are represented in terms of Fourier coef\ufb01cients, pro\ufb01le correlations,\nKarhunen-Love coef\ufb01cients, pixel average, Zernike moments and morphological features.\nEach image is represented by a 649-dimensional vector. These data points are mapped to\na 2-dimensional space using different dimensionality reduction algorithms, PCA, LPP, and\nLaplacian Eigenmaps. The experimental results are shown in Figure 2. As can be seen,\nLPP performs much better than PCA. LPPs are obtained by \ufb01nding the optimal linear ap-\nproximations to the eigenfunctions of the Laplace Beltrami operator on the manifold. As a\nresult, LPP shares many of the data representation properties of non linear techniques such\nas Laplacian Eigenmap. However, LPP is computationally much more tractable.\n\n4.3. Manifold of Face Images\n\nIn this subsection, we applied the LPP to images of faces. The face image data set used\nhere is the same as that used in [5]. This dataset contains 1965 face images taken from\nsequential frames of a small video. The size of each image is 20 (cid:2) 28, with 256 gray levels\n\n\fFigure 3: A two-\ndimensional\nrepre-\nsentation of the set of\nall\nimages of faces\nusing the Locality\nPreserving\nProjec-\ntion. Representative\nfaces are shown next\nto the data points\nparts\nin\ndifferent\nAs\nof\nthe space.\nseen,\ncan be\nthe\nfacial\nexpression\nand\nviewing\npoint of faces change\nsmoothly.\n\nthe\n\nTable 1: Face Recognition Results on Yale Database\n\ndims\n\nerror rate (%)\n\nLPP LDA PCA\n14\n33\n25.3\n16.0\n\n14\n20.0\n\nper pixel. Thus, each face image is represented by a point in the 560-dimensional ambi-\nent space. Figure 3 shows the mapping results. The images of faces are mapped into the\n2-dimensional plane described by the \ufb01rst two coordinates of the Locality Preserving Pro-\njections. It should be emphasized that the mapping from image space to low-dimensional\nspace obtained by our method is linear, rather than nonlinear as in most previous work. The\nlinear algorithm does detect the nonlinear manifold structure of images of faces to some\nextent. Some representative faces are shown next to the data points in different parts of the\nspace. As can be seen, the images of faces are clearly divided into two parts. The left part\nare the faces with closed mouth, and the right part are the faces with open mouth. This\nis because that, by trying to preserve neighborhood structure in the embedding, the LPP\nalgorithm implicitly emphasizes the natural clusters in the data. Speci\ufb01cally, it makes the\nneighboring points in the ambient space nearer in the reduced representation space, and\nfaraway points in the ambient space farther in the reduced representation space. The bot-\ntom images correspond to points along the right path (linked by solid line), illustrating one\nparticular mode of variability in pose.\n\n4.4. Face Recognition\n\nPCA and LDA are the two most widely used subspace learning techniques for face recog-\nnition [1][7]. These methods project the training sample faces to a low dimensional rep-\nresentation space where the recognition is carried out. The main supposition behind this\nprocedure is that the face space (given by the feature vectors) has a lower dimension than\nthe image space (given by the number of pixels in the image), and that the recognition\nof the faces can be performed in this reduced space. In this subsection, we consider the\napplication of LPP to face recognition.\n\nThe database used for this experiment is the Yale face database [8]. It is constructed at the\n\n\fYale Center for Computational Vision and Control. It contains 165 grayscale images of\n15 individuals. The images demonstrate variations in lighting condition (left-light, center-\nlight, right-light), facial expression (normal, happy, sad, sleepy, surprised, and wink), and\nwith/without glasses. Preprocessing to locate the the faces was applied. Original images\nwere normalized (in scale and orientation) such that the two eyes were aligned at the same\nposition. Then, the facial areas were cropped into the \ufb01nal images for matching. The size\nof each cropped image is 32 (cid:2) 32 pixels, with 256 gray levels per pixel. Thus, each image\ncan be represented by a 1024-dimensional vector.\n\nFor each individual, six images were taken with labels to form the training set. The rest of\nthe database was considered to be the testing set. The training samples were used to learn\na projection. The testing samples were then projected into the reduced space. Recognition\nwas performed using a nearest neighbor classi\ufb01er. In general, the performance of PCA,\nLDA and LPP varies with the number of dimensions. We show the best results obtained by\nthem. The error rates are summarized in Table 1. As can be seen, LPP outperforms both\nPCA and LDA.\n5. Conclusions\n\nIn this paper, we propose a new linear dimensionality reduction algorithm called Locality\nPreserving Projections. It is based on the same variational principle that gives rise to the\nLaplacian Eigenmap [2]. As a result it has similar locality preserving properties.\n\nOur approach also has several possible advantages over recent nonparametric techniques\nfor global nonlinear dimensionality reduction such as [2][5][6].\nIt yields a map which\nis simple, linear, and de\ufb01ned everywhere (and therefore on novel test data points). The\nalgorithm can be easily kernelized yielding a natural non-linear extension.\n\nPerformance improvement of this method over Principal Component Analysis is demon-\nstrated through several experiments. Though our method is a linear algorithm, it is capable\nof discovering the non-linear structure of the data manifold.\n\nReferences\n\n[1] P.N. Belhumeur, J.P. Hepanha, and D.J. Kriegman, \u201cEigenfaces vs. \ufb01sherfaces: recog-\nnition using class speci\ufb01c linear projection,\u201dIEEE. Trans. Pattern Analysis and Ma-\nchine Intelligence, vol. 19, no. 7, pp. 711-720, July 1997.\n\n[2] M. Belkin and P. Niyogi, \u201cLaplacian Eigenmaps and Spectral Techniques for Em-\nbedding and Clustering ,\u201d Advances in Neural Information Processing Systems 14,\nVancouver, British Columbia, Canada, 2002.\n\n[3] C. L. Blake and C. J. Merz, \u201dUCI repository of machine learning databases\u201d,\nhttp://www.ics.uci.edu/ mlearn/MLRepository.html. Irvine, CA, University of Cali-\nfornia, Department of Information and Computer Science, 1998.\n\n[4] Fan R. K. Chung, Spectral Graph Theory, Regional Conference Series in Mathemat-\n\nics, number 92, 1997.\n\n[5] Sam Roweis, and Lawrence K. Saul, \u201cNonlinear Dimensionality Reduction by Lo-\n\ncally Linear Embedding,\u201d Science, vol 290, 22 December 2000.\n\n[6] Joshua B. Tenenbaum, Vin de Silva, and John C. Langford, \u201cA Global Geometric\nFramework for Nonlinear Dimensionality Reduction,\u201d Science, vol 290, 22 December\n2000.\n\n[7] M. Turk and A. Pentland, \u201cEigenfaces for recognition,\u201d Journal of Cognitive Neuro-\n\nscience, 3(1):71-86, 1991.\n\n[8] Yale Univ. Face Database, http://cvc.yale.edu/projects/yalefaces/yalefaces.html.\n\n\f", "award": [], "sourceid": 2359, "authors": [{"given_name": "Xiaofei", "family_name": "He", "institution": null}, {"given_name": "Partha", "family_name": "Niyogi", "institution": null}]}