{"title": "Two-Dimensional Linear Discriminant Analysis", "book": "Advances in Neural Information Processing Systems", "page_first": 1569, "page_last": 1576, "abstract": null, "full_text": "Two-Dimensional Linear Discriminant Analysis\n\nJieping Ye\n\nDepartment of CSE\n\nUniversity of Minnesota\njieping@cs.umn.edu\n\nRavi Janardan\n\nDepartment of CSE\n\nUniversity of Minnesota\n\njanardan@cs.umn.edu\n\nQi Li\n\nDepartment of CIS\n\nUniversity of Delaware\nqili@cis.udel.edu\n\nAbstract\n\nLinear Discriminant Analysis (LDA) is a well-known scheme for feature\nextraction and dimension reduction. It has been used widely in many ap-\nplications involving high-dimensional data, such as face recognition and\nimage retrieval. An intrinsic limitation of classical LDA is the so-called\nsingularity problem, that is, it fails when all scatter matrices are singu-\nlar. A well-known approach to deal with the singularity problem is to\napply an intermediate dimension reduction stage using Principal Com-\nponent Analysis (PCA) before LDA. The algorithm, called PCA+LDA,\nis used widely in face recognition. However, PCA+LDA has high costs\nin time and space, due to the need for an eigen-decomposition involving\nthe scatter matrices.\nIn this paper, we propose a novel LDA algorithm, namely 2DLDA, which\nstands for 2-Dimensional Linear Discriminant Analysis. 2DLDA over-\ncomes the singularity problem implicitly, while achieving ef\ufb01ciency. The\nkey difference between 2DLDA and classical LDA lies in the model for\ndata representation. Classical LDA works with vectorized representa-\ntions of data, while the 2DLDA algorithm works with data in matrix\nrepresentation. To further reduce the dimension by 2DLDA, the combi-\nnation of 2DLDA and classical LDA, namely 2DLDA+LDA, is studied,\nwhere LDA is preceded by 2DLDA. The proposed algorithms are ap-\nplied on face recognition and compared with PCA+LDA. Experiments\nshow that 2DLDA and 2DLDA+LDA achieve competitive recognition\naccuracy, while being much more ef\ufb01cient.\n\n1 Introduction\n\nLinear Discriminant Analysis [2, 4] is a well-known scheme for feature extraction and di-\nmension reduction. It has been used widely in many applications such as face recognition\n[1], image retrieval [6], microarray data classi\ufb01cation [3], etc. Classical LDA projects the\ndata onto a lower-dimensional vector space such that the ratio of the between-class dis-\ntance to the within-class distance is maximized, thus achieving maximum discrimination.\nThe optimal projection (transformation) can be readily computed by applying the eigen-\ndecomposition on the scatter matrices. An intrinsic limitation of classical LDA is that its\nobjective function requires the nonsingularity of one of the scatter matrices. For many ap-\nplications, such as face recognition, all scatter matrices in question can be singular since\nthe data is from a very high-dimensional space, and in general, the dimension exceeds the\n\n\fnumber of data points. This is known as the undersampled or singularity problem [5].\n\nIn recent years, many approaches have been brought to bear on such high-dimensional, un-\ndersampled problems, including pseudo-inverse LDA, PCA+LDA, and regularized LDA.\nMore details can be found in [5]. Among these LDA extensions, PCA+LDA has received a\nlot of attention, especially for face recognition [1]. In this two-stage algorithm, an interme-\ndiate dimension reduction stage using PCA is applied before LDA. The common aspect of\nprevious LDA extensions is the computation of eigen-decomposition of certain large ma-\ntrices, which not only degrades the ef\ufb01ciency but also makes it hard to scale them to large\ndatasets.\n\nIn this paper, we present a novel approach to alleviate the expensive computation of the\neigen-decomposition in previous LDA extensions. The novelty lies in a different data rep-\nresentation model. Under this model, each datum is represented as a matrix, instead of as\na vector, and the collection of data is represented as a collection of matrices, instead of as\na single large matrix. This model has been previously used in [8, 9, 7] for the generaliza-\ntion of SVD and PCA. Unlike classical LDA, we consider the projection of the data onto a\nspace which is the tensor product of two vector spaces. We formulate our dimension reduc-\ntion problem as an optimization problem in Section 3. Unlike classical LDA, there is no\nclosed form solution for the optimization problem; instead, we derive a heuristic, namely\n2DLDA. To further reduce the dimension, which is desirable for ef\ufb01cient querying, we con-\nsider the combination of 2DLDA and LDA, namely 2DLDA+LDA, where the dimension\nof the space transformed by 2DLDA is further reduced by LDA.\n\nWe perform experiments on three well-known face datasets to evaluate the effectiveness\nof 2DLDA and 2DLDA+LDA and compare with PCA+LDA, which is used widely in face\nrecognition. Our experiments show that: (1) 2DLDA is applicable to high-dimensional\nundersampled data such as face images, i.e., it implicitly avoids the singularity problem\nencountered in classical LDA; and (2) 2DLDA and 2DLDA+LDA have distinctly lower\ncosts in time and space than PCA+LDA, and achieve classi\ufb01cation accuracy that is com-\npetitive with PCA+LDA.\n\n2 An overview of LDA\n\nIn this section, we give a brief overview of classical LDA. Some of the important notations\nused in the rest of this paper are listed in Table 1.\nGiven a data matrix A \u2208 IRN\u00d7n, classical LDA aims to \ufb01nd a transformation G \u2208 IRN\u00d7(cid:1)\nthat maps each column ai of A, for 1 \u2264 i \u2264 n, in the N-dimensional space to a vector bi in\nthe (cid:1)-dimensional space. That is G : ai \u2208 IRN \u2192 bi = GT ai \u2208 IR(cid:1)((cid:1) < N). Equivalently,\nclassical LDA aims to \ufb01nd a vector space G spanned by {gi}(cid:1)\ni=1, where G = [g1,\u00b7\u00b7\u00b7 , g(cid:1)],\nsuch that each ai is projected onto G by (gT\nAssume that the original data in A is partitioned into k classes as A = {\u03a01,\u00b7\u00b7\u00b7 , \u03a0k}, where\ni=1 ni = n. Classical LDA aims to \ufb01nd\n\u03a0i contains ni data points from the ith class, and\nthe optimal transformation G such that the class structure of the original high-dimensional\nspace is preserved in the low-dimensional space.\n\n1 \u00b7 ai,\u00b7\u00b7\u00b7 , gT\n(cid:1)k\n\n(cid:1) \u00b7 ai)T \u2208 IR(cid:1).\n\nIn general, if each class is tightly grouped, but well separated from the other classes, the\nquality of the cluster is considered to be high. In discriminant analysis, two scatter ma-\ntrices, called within-class (Sw) and between-class (Sb) matrices, are de\ufb01ned to quantify\n(x \u2212 mi)(x \u2212 mi)T , and\nthe quality of the cluster, as follows [4]: Sw =\nSb =\nx\u2208\u03a0i x is the mean of the ith class,\nand m = 1\n\n(cid:1)k\n(cid:1)k\n(cid:1)\n(cid:1)k\ni=1 ni(mi \u2212 m)(mi \u2212 m)T , where mi = 1\n\nx\u2208\u03a0i x is the global mean.\n\n(cid:1)\n\n(cid:1)\n\nn\n\ni=1\n\ni=1\n\nx\u2208\u03a0i\n\nni\n\n\fNotation Description\nn\nk\nAi\nai\nr\nc\nN\n\u03a0j\nL\nR\nI\nBi\n(cid:1)1\n(cid:1)2\n\nnumber of images in the dataset\nnumber of classes in the dataset\nith image in matrix representation\nith image in vectorized representation\nnumber of rows in Ai\nnumber of columns in Ai\ndimension of ai (N = r \u2217 c)\njth class in the dataset\ntransformation matrix (left) by 2DLDA\ntransformation matrix (right) by 2DLDA\nnumber of iterations in 2DLDA\nreduced representation of Ai by 2DLDA\nnumber of rows in Bi\nnumber of columns in Bi\n\nTable 1: Notation\n\nIt is easy to verify that trace(Sw) measures the closeness of the vectors within the classes,\nwhile trace(Sb) measures the separation between classes. In the low-dimensional space\nresulting from the linear transformation G (or the linear projection onto the vector space G),\nw = GT SwG.\nthe within-class and between-class matrices become SL\nw). Com-\nAn optimal transformation G would maximize trace(SL\nmon optimizations in classical discriminant analysis include (see [4]):\nb )\u22121SL\nw)\n\nb = GT SbG, and SL\nb ) and minimize trace(SL\n(cid:3)\n\ntrace((SL\n\ntrace((SL\n\nw)\u22121SL\nb )\n\n(cid:3)\n\n(cid:2)\n\n.\n\n(1)\n\n(cid:2)\n\nmax\n\nG\n\nand min\n\nG\n\nThe optimization problems in Eq. (1) are equivalent to the following generalized eigenvalue\nproblem: Sbx = \u03bbSwx, for \u03bb (cid:5)= 0. The solution can be obtained by applying an eigen-\n\u22121\ndecomposition to the matrix S\nb Sw, if Sb is nonsingular.\nThere are at most k \u2212 1 eigenvectors corresponding to nonzero eigenvalues, since the rank\nof the matrix Sb is bounded from above by k \u2212 1. Therefore, the reduced dimension by\nclassical LDA is at most k \u2212 1. A stable way to compute the eigen-decomposition is to\napply SVD on the scatter matrices. Details can be found in [6].\n\n\u22121\nw Sb, if Sw is nonsingular, or S\n\nNote that a limitation of classical LDA in many applications involving undersampled data,\nsuch as text documents and images, is that at least one of the scatter matrices is required to\nbe nonsingular. Several extensions, including pseudo-inverse LDA, regularized LDA, and\nPCA+LDA, were proposed in the past to deal with the singularity problem. Details can be\nfound in [5].\n\n3 2-Dimensional LDA\n\nThe key difference between classical LDA and the 2DLDA that we propose in this paper\nis in the representation of data. While classical LDA uses the vectorized representation,\n2DLDA works with data in matrix representation.\n\nWe will see later in this section that the matrix representation in 2DLDA leads to an eigen-\ndecomposition on matrices with much smaller sizes. More speci\ufb01cally, 2DLDA involves\nthe eigen-decomposition of matrices with sizes r\u00d7r and c\u00d7c, which are much smaller than\nthe matrices in classical LDA. This dramatically reduces the time and space complexities\nof 2DLDA over LDA.\n\n\fUnlike classical LDA, 2DLDA considers the following ((cid:1)1\u00d7 (cid:1)2)-dimensional space L\u2297R,\nwhich is the tensor product of the following two spaces: L spanned by {ui}(cid:1)1\ni=1 and\ni=1. De\ufb01ne two matrices L = [u1,\u00b7\u00b7\u00b7 , u(cid:1)1] \u2208 IRr\u00d7(cid:1)1 and R =\nR spanned by {vi}(cid:1)2\n[v1,\u00b7\u00b7\u00b7 , v(cid:1)2] \u2208 IRc\u00d7(cid:1)2. Then the projection of X \u2208 IRr\u00d7c onto the space L \u2297 R is\nLT XR \u2208 R(cid:1)1\u00d7(cid:1)2.\nLet Ai \u2208 IRr\u00d7c, for i = 1,\u00b7\u00b7\u00b7 , n, be the n images in the dataset, clustered into classes\n\u03a01,\u00b7\u00b7\u00b7 , \u03a0k, where \u03a0i has ni images. Let Mi = 1\nX\u2208\u03a0i X be the mean of the ith\nclass, 1 \u2264 i \u2264 k, and M = 1\nIn 2DLDA, we\nconsider images as two-dimensional signals and aim to \ufb01nd two transformation matrices\nL \u2208 IRr\u00d7(cid:1)1 and R \u2208 IRc\u00d7(cid:1)2 that map each Ai \u2208 IRr\u00d7c, for 1 \u2264 i \u2264 n, to a matrix\nBi \u2208 IR(cid:1)1\u00d7(cid:1)2 such that Bi = LT AiR.\nLike classical LDA, 2DLDA aims to \ufb01nd the optimal transformations (projections) L and\nR such that the class structure of the original high-dimensional space is preserved in the\nlow-dimensional space.\n\nX\u2208\u03a0i X be the global mean.\n\n(cid:1)k\n\n(cid:1)\n\n(cid:1)\n\ni=1\n\nni\n\nn\n\nA natural similarity metric between matrices is the Frobenius norm [8]. Under this metric,\nthe (squared) within-class and between-class distances Dw and Db can be computed as\nfollows:\n\nDw =\n\n||X \u2212 Mi||2\n\nF , Db =\n\nni||Mi \u2212 M||2\nF .\n\nk(cid:4)\n\n(cid:4)\n\ni=1\n\nX\u2208\u03a0i\n\nk(cid:4)\n\ni=1\n\nUsing the property of the trace, that is, trace(M M T ) = ||M||2\nrewrite Dw and Db as follows:\n\nF , for any matrix M, we can\n\n(cid:6)\n\nDw = trace\n\nDb = trace\n\n(X \u2212 Mi)(X \u2212 Mi)T\n(cid:6)\n\nX\u2208\u03a0i\nni(Mi \u2212 M)(Mi \u2212 M)T\n\n,\n\n.\n\n(cid:4)\n\n(cid:5)\nk(cid:4)\n(cid:5)\nk(cid:4)\n\ni=1\n\ni=1\n\nIn the low-dimensional space resulting from the linear transformations L and R, the within-\nclass and between-class distances become\n\n(cid:5)\nk(cid:4)\n(cid:5)\nk(cid:4)\n\ni=1\n\n(cid:4)\n\n(cid:6)\nLT (X \u2212 Mi)RRT (X \u2212 Mi)T L\n(cid:6)\nX\u2208\u03a0i\nniLT (Mi \u2212 M)RRT (Mi \u2212 M)T L\n\n.\n\n,\n\n\u02dcDw = trace\n\n\u02dcDb = trace\n\ni=1\n\nThe optimal transformations L and R would maximize \u02dcDb and minimize \u02dcDw. Due to\nthe dif\ufb01culty of computing the optimal L and R simultaneously, we derive an iterative\nalgorithm in the following. More speci\ufb01cally, for a \ufb01xed R, we can compute the optimal\nL by solving an optimization problem similar to the one in Eq. (1). With the computed\nL, we can then update R by solving another optimization problem as the one in Eq. (1).\nDetails are given below. The procedure is repeated a certain number of times, as discussed\nin Section 4.\n\nComputation of L\nFor a \ufb01xed R, \u02dcDw and \u02dcDb can be rewritten as\n\n(cid:7)\n\n(cid:8)\n\n\u02dcDw = trace\n\nLT SR\n\nw L\n\n, \u02dcDb = trace\n\n(cid:7)\n\n(cid:8)\n\nLT SR\n\nb L\n\n,\n\n\f(cid:1)\n\nX\u2208\u03a0i X;\n\nn\n\ni=1\n\n(cid:1)\n\nX\u2208\u03a0i X;\n\nAlgorithm 2DLDA(A1,\u00b7\u00b7\u00b7 , An, (cid:1)1, (cid:1)2)\nInput: A1,\u00b7\u00b7\u00b7 , An, (cid:1)1, (cid:1)2\nOutput: L, R, B1,\u00b7\u00b7\u00b7 , Bn\n(cid:1)k\n1. Compute the mean Mi of ith class for each i as Mi = 1\nni\n2. Compute the global mean M = 1\n3. R0 \u2190 (I(cid:1)2, 0)T ;\n4. For j from 1 to I\n5.\n\nw \u2190(cid:1)k\nb \u2190(cid:1)k\nLj \u2190(cid:9)\nw \u2190(cid:1)k\nb \u2190(cid:1)k\n10. Rj \u2190(cid:9)\n\n(cid:1)\n(X \u2212 Mi)Rj\u22121RT\nj\u22121(X \u2212 Mi)T ,\nSR\n(cid:8)\u22121\nj\u22121(Mi \u2212 M)T ;\ni=1 ni(Mi \u2212 M)Rj\u22121RT\nSR\n(cid:1) }(cid:1)1\nCompute the \ufb01rst (cid:1)1 eigenvectors {\u03c6L\n(cid:1)\nSR\n(cid:1)=1 of\n1 ,\u00b7\u00b7\u00b7 , \u03c6L\nw\n\u03c6L\n(cid:1)1\nj (X \u2212 Mi),\n(X \u2212 Mi)T LjLT\nSL\n(cid:8)\u22121\n(cid:7)\nX\u2208\u03a0i\ni=1 ni(Mi \u2212 M)T LjLT\nSL\n(cid:1) }(cid:1)2\nCompute the \ufb01rst (cid:1)2 eigenvectors {\u03c6R\n1 ,\u00b7\u00b7\u00b7 , \u03c6R\n\u03c6R\n\nj (Mi \u2212 M);\n(cid:1)=1 of\n\n6.\n7.\n8.\n\nX\u2208\u03a0i\n\nSL\nw\n\n(cid:7)\n\n(cid:10)\n\n(cid:10)\n\ni=1\n\ni=1\n\n9.\n\n;\n\n(cid:1)2\n\nSR\nb ;\n\nSL\nb ;\n\n11. EndFor\n12. L \u2190 LI, R \u2190 RI;\n13. B(cid:1) \u2190 LT A(cid:1)R, for (cid:1) = 1,\u00b7\u00b7\u00b7 , n;\n14. return(L, R, B1,\u00b7\u00b7\u00b7 , Bn).\n\nni(Mi \u2212 M)RRT (Mi \u2212 M)T .\n\n(cid:8)\nb L)\n\nwhere\n\nw =\nSR\n\nk(cid:4)\n\n(cid:4)\n\ni=1\n\nX\u2208\u03a0i\n\n(X \u2212 Mi)RRT (X \u2212 Mi)T , SR\n(cid:7)\n\nb =\n\nk(cid:4)\n\ni=1\n\nSimilar to the optimization problem in Eq. (1), the optimal L can be computed by solving\nthe following optimization problem: maxL trace\n. The solution\nw x = \u03bbSR\ncan be obtained by solving the following generalized eigenvalue problem: SR\nb x.\nSince SR\nw is in general nonsingular, the optimal L can be obtained by computing an eigen-\nis r \u00d7 r,\nw and SR\ndecomposition on\nb\nwhich is much smaller than the size of the matrices Sw and Sb in classical LDA.\n\nSR\nb . Note that the size of the matrices SR\n\nw L)\u22121(LT SR\n\n(cid:8)\u22121\n\n(LT SR\n\nSR\nw\n\n(cid:7)\n\nComputation of R\nNext, consider the computation of R, for a \ufb01xed L. A key observation is that \u02dcDw and \u02dcDb\ncan be rewritten as\n\n(cid:7)\n\n(cid:8)\n\n\u02dcDw = trace\n\nRT SL\n\nwR\n\n,\n\n(cid:8)\n\nRT SL\n\nb R\n\n,\n\n(cid:7)\n\u02dcDb = trace\nk(cid:4)\n\ni=1\n\nwhere\n\nw =\nSL\n\nk(cid:4)\n\n(cid:4)\n\ni=1\n\nX\u2208\u03a0i\n\n(X \u2212 Mi)T LLT (X \u2212 Mi), SL\n\nb =\n\nni(Mi \u2212 M)T LLT (Mi \u2212 M).\n\nThis follows from the following property of trace, that is, trace(AB) = trace(BA), for any\ntwo matrices A and B.\n\nSimilarly, the optimal R can be computed by solving the following optimization problem:\nmaxR trace\n. The solution can be obtained by solving the follow-\nwx = \u03bbSL\ning generalized eigenvalue problem: SL\nw is in general nonsingular,\n\nwR)\u22121(RT SL\n\nb x. Since SL\n\n(RT SL\n\n(cid:8)\nb R)\n\n(cid:7)\n\n\f(cid:8)\u22121\nSL\nw\nb is c \u00d7 c, much smaller than Sw and Sb.\n\n(cid:7)\n\nSL\nb . Note\n\nw and SL\n\nthe optimal R can be obtained by computing an eigen-decomposition on\nthat the size of the matrices SL\nThe pseudo-code for the 2DLDA algorithm is given in Algorithm 2DLDA. It is clear that\nthe most expensive steps in Algorithm 2DLDA are in Lines 5, 8 and 13 and the total\ntime complexity is O(n max((cid:1)1, (cid:1)2)(r + c)2I), where I is the number of iterations. The\n2DLDA algorithm depends on the initial choice R0. Our experiments show that choosing\nR0 = (I(cid:1)2, 0)T , where I(cid:1)2 is the identity matrix, produces excellent results. We use this\ninitial R0 in all the experiments.\ncomparable, i.e., r \u2248 c \u2248 \u221a\nSince the number of rows (r) and the number of columns (c) of an image Ai are generally\nN, we set (cid:1)1 and (cid:1)2 to a common value d in the rest of\nthis paper, for simplicity. However, the algorithm works in the general case. With this\nsimpli\ufb01cation, the time complexity of the 2DLDA algorithm becomes O(ndN I).\nThe space complexity of 2DLDA is O(rc) = O(N). The key to the low space complexity\nof the algorithm is that the matrices SR\nb can be formed by reading the\nmatrices A(cid:1) incrementally.\n\nw, and SL\n\nb , SL\n\nw , SR\n\n3.1\n\n2DLDA+LDA\n\nAs mentioned in the Introduction, PCA is commonly applied as an intermediate dimension-\nreduction stage before LDA to overcome the singularity problem of classical LDA. In this\nsection, we consider the combination of 2DLDA and LDA, namely 2DLDA+LDA, where\nthe dimension by 2DLDA is further reduced by LDA, since small reduced dimension is\ndesirable for ef\ufb01cient querying. More speci\ufb01cally, in the \ufb01rst stage of 2DLDA+LDA, each\nimage Ai \u2208 IRr\u00d7c is reduced to Bi \u2208 IRd\u00d7d by 2DLDA, with d < min(r, c). In the second\nstage, each Bi is \ufb01rst transformed to a vector bi \u2208 IRd2\n(matrix-to-vector alignment), then\ni \u2208 IRk\u22121 by LDA with k \u2212 1 < d2, where k is the number\nbi is further reduced to bL\nof classes. Here, \u201cmatrix-to-vector alignment\u201d means that the matrix is transformed to a\nvector by concatenating all its rows together consecutively.\nThe time complexity of the \ufb01rst stage by 2DLDA is O(ndN I). The second stage applies\nclassical LDA to data in d2-dimensional space, hence takes O(n(d2)2), assuming n > d2.\nHence the total time complexity of 2DLDA+LDA is O\n\nnd(N I + d3)\n\n(cid:7)\n\n(cid:8)\n\n.\n\n4 Experiments\n\nIn this section, we experimentally evaluate the performance of 2DLDA and 2DLDA+LDA\non face images and compare with PCA+LDA, used widely in face recognition. For\nPCA+LDA, we use 200 principal components in the PCA stage, as it produces good overall\nresults. All of our experiments are performed on a P4 1.80GHz Linux machine with 1GB\nmemory. For all the experiments, the 1-Nearest-Neighbor (1NN) algorithm is applied for\nclassi\ufb01cation and ten-fold cross validation is used for computing the classi\ufb01cation accuracy.\n\nDatasets: We use three face datasets in our study: PIX, ORL, and PIE, which are\npublicly available. PIX (available at http://peipa.essex.ac.uk/ipa/pix/faces/manchester/test-\nhard/), contains 300 face images of 30 persons. The image size is 512 \u00d7 512. We\nsubsample the images down to a size of 100 \u00d7 100 = 10000. ORL (available at\nhttp://www.uk.research.att.com/facedatabase.html), contains 400 face images of 40 per-\nsons. The image size is 92 \u00d7 112. PIE is a subset of the CMU\u2013PIE face image dataset\n(available at http://www.ri.cmu.edu/projects/project 418.html). It contains 6615 face im-\nages of 63 persons. The image size is 640 \u00d7 480. We subsample the images down to a size\nof 220 \u00d7 175 = 38500. Note that PIE is much larger than the other two datasets.\n\n\fy\nc\na\nr\nu\nc\nc\nA\n\n 1\n 0.98\n 0.96\n 0.94\n 0.92\n 0.9\n 0.88\n 0.86\n 0.84\n 0.82\n 0.8\n\ny\nc\na\nr\nu\nc\nc\nA\n\n 1\n 0.98\n 0.96\n 0.94\n 0.92\n 0.9\n 0.88\n 0.86\n 0.84\n 0.82\n 0.8\n\ny\nc\na\nr\nu\nc\nc\nA\n\n 1.05\n\n 1\n\n 0.95\n\n 0.9\n\n 0.85\n\n 0.8\n\n2DLDA+LDA\n2DLDA\n\n 2\n\n 4\n\n 6\n\n 8\n\n 10 12 14 16 18 20\n\nNumber of iterations\n\n2DLDA+LDA\n2DLDA\n\n 2\n\n 4\n\n 6\n\n 8\n\n 10 12 14 16 18 20\n\nNumber of iterations\n\n2DLDA+LDA\n2DLDA\n\n 2\n\n 4\n\n 6\n\n 8\n\n 10 12 14 16 18 20\n\nNumber of iterations\n\nFigure 1: Effect of the number of iterations on 2DLDA and 2DLDA+LDA using the three\nface datasets; PIX, ORL and PIE (from left to right).\n\nThe impact of the number, I, of iterations:\nIn this experiment, we study the effect of\nthe number of iterations (I in Algorithm 2DLDA) on 2DLDA and 2DLDA+LDA. The\nresults are shown in Figure 1, where the x-axis denotes the number of iterations, and the\ny-axis denotes the classi\ufb01cation accuracy. d = 10 is used for both algorithms. It is clear\nthat both accuracy curves are stable with respect to the number of iterations. In general,\nthe accuracy curves of 2DLDA+LDA are slightly more stable than those of 2DLDA. The\nkey consequence is that we only need to run the \u201cfor\u201d loop (from Line 4 to Line 11) in\nAlgorithm 2DLDA only once, i.e., I = 1, which signi\ufb01cantly reduces the total running\ntime of both algorithms.\n\nThe impact of the value of the reduced dimension d:\nIn this experiment, we study the\neffect of the value of d on 2DLDA and 2DLDA+LDA, where the value of d determines the\ndimensionality in the transformed space by 2DLDA. We did extensive experiments using\ndifferent values of d on the face image datasets. The results are summarized in Figure 2,\nwhere the x-axis denotes the values of d (between 1 and 15) and the y-axis denotes the\nclassi\ufb01cation accuracy with 1-Nearest-Neighbor as the classi\ufb01er. As shown in Figure 2, the\naccuracy curves on all datasets stabilize around d = 4 to 6.\n\nComparison on classi\ufb01cation accuracy and ef\ufb01ciency:\nIn this experiment, we eval-\nuate the effectiveness of the proposed algorithms in terms of classi\ufb01cation accuracy and\nef\ufb01ciency and compare with PCA+LDA. The results are summarized in Table 2. We can\nobserve that 2DLDA+LDA has similar performance as PCA+LDA in classi\ufb01cation, while\nit outperforms 2DLDA. Hence the LDA stage in 2DLDA+LDA not only reduces the di-\nmension, but also increases the accuracy. Another key observation from Table 2 is that\n2DLDA is almost one order of magnitude faster than PCA+LDA, while, the running time\nof 2DLDA+LDA is close to that of 2DLDA.\n\nHence 2DLDA+LDA is a more effective dimension reduction algorithm than PCA+LDA,\nas it is competitive to PCA+LDA in classi\ufb01cation and has the same number of reduced\ndimensions in the transformed space, while it has much lower time and space costs.\n\n5 Conclusions\n\nAn ef\ufb01cient algorithm, namely 2DLDA, is presented for dimension reduction. 2DLDA is\nan extension of LDA. The key difference between 2DLDA and LDA is that 2DLDA works\non the matrix representation of images directly, while LDA uses a vector representation.\n2DLDA has asymptotically minimum memory requirements, and lower time complexity\nthan LDA, which is desirable for large face datasets, while it implicitly avoids the singu-\nlarity problem encountered in classical LDA. We also study the combination of 2DLDA\nand LDA, namely 2DLDA+LDA, where the dimension by 2DLDA is further reduced by\nLDA. Experiments show that 2DLDA and 2DLDA+LDA are competitive with PCA+LDA,\nin terms of classi\ufb01cation accuracy, while they have signi\ufb01cantly lower time and space costs.\n\n\fy\nc\na\nr\nu\nc\nc\nA\n\n 1\n\n 0.9\n\n 0.8\n\n 0.7\n\n 0.6\n\n 0.5\n\n 0.4\n\n2DLDA+LDA\n2DLDA\n\n 2\n\n 4\n\n 6\n\n 8\n\n 10\n\n 12\n\n 14\n\nValue of d\n\ny\nc\na\nr\nu\nc\nc\nA\n\n 1\n\n 0.9\n\n 0.8\n\n 0.7\n\n 0.6\n\n 0.5\n\n 0.4\n\n 0.3\n\n 0.2\n\n 0.1\n\n2DLDA+LDA\n2DLDA\n\n 2\n\n 4\n\n 6\n\n 8\n\n 10\n\n 12\n\n 14\n\nValue of d\n\ny\nc\na\nr\nu\nc\nc\nA\n\n 1\n 0.9\n 0.8\n 0.7\n 0.6\n 0.5\n 0.4\n 0.3\n 0.2\n 0.1\n 0\n\n2DLDA+LDA\n2DLDA\n\n 2\n\n 4\n\n 6\n\n 8\n\n 10\n\n 12\n\n 14\n\nValue of d\n\nFigure 2: Effect of the value of the reduced dimension d on 2DLDA and 2DLDA+LDA\nusing the three face datasets; PIX, ORL and PIE (from left to right).\n\nPCA+LDA\n\n2DLDA\n\n2DLDA+LDA\n\nDataset\n\nPIX\nORL\nPIE\n\n\u2014\n\nAccuracy Time(Sec) Accuracy Time(Sec) Accuracy Time(Sec)\n98.00%\n97.75%\n\n7.73\n12.5\n\u2014\n\n97.33%\n97.50%\n99.32%\n\n1.73\n2.19\n157\n\n1.69\n2.14\n153\n\n98.50%\n98.00%\n100%\n\nTable 2: Comparison on classi\ufb01cation accuracy and ef\ufb01ciency: \u201c\u2014\u201d means that PCA+LDA\nis not applicable for PIE, due to its large size. Note that PCA+LDA involves an eigen-\ndecomposition of the scatter matrices, which requires the whole data matrix to reside in\nmain memory.\n\nAcknowledgment Research of J. Ye and R. Janardan is sponsored, in part, by the Army\nHigh Performance Computing Research Center under the auspices of the Department of the\nArmy, Army Research Laboratory cooperative agreement number DAAD19-01-2-0014,\nthe content of which does not necessarily re\ufb02ect the position or the policy of the govern-\nment, and no of\ufb01cial endorsement should be inferred.\n\nReferences\n[1] P.N. Belhumeour, J.P. Hespanha, and D.J. Kriegman. Eigenfaces vs. Fisherfaces: Recognition\nusing class speci\ufb01c linear projection. IEEE Transactions on Pattern Analysis and Machine Intel-\nligence, 19(7):711\u2013720, 1997.\n\n[2] R.O. Duda, P.E. Hart, and D. Stork. Pattern Classi\ufb01cation. Wiley, 2000.\n[3] S. Dudoit, J. Fridlyand, and T. P. Speed. Comparison of discrimination methods for the classi-\n\ufb01cation of tumors using gene expression data. Journal of the American Statistical Association,\n97(457):77\u201387, 2002.\n\n[4] K. Fukunaga.\n\nIntroduction to Statistical Pattern Classi\ufb01cation. Academic Press, San Diego,\n\nCalifornia, USA, 1990.\n\n[5] W.J. Krzanowski, P. Jonathan, W.V McCarthy, and M.R. Thomas. Discriminant analysis with\nsingular covariance matrices: methods and applications to spectroscopic data. Applied Statistics,\n44:101\u2013115, 1995.\n\n[6] Daniel L. Swets and Juyang Weng. Using discriminant eigenfeatures for image retrieval. IEEE\n\nTransactions on Pattern Analysis and Machine Intelligence, 18(8):831\u2013836, 1996.\n\n[7] J. Yang, D. Zhang, A.F. Frangi, and J.Y. Yang. Two-dimensional PCA: a new approach to\nappearance-based face representation and recognition. IEEE Transactions on Pattern Analysis\nand Machine Intelligence, 26(1):131\u2013137, 2004.\n\n[8] J. Ye. Generalized low rank approximations of matrices.\n\npages 887\u2013894, 2004.\n\nIn ICML Conference Proceedings,\n\n[9] J. Ye, R. Janardan, and Q. Li. GPCA: An ef\ufb01cient dimension reduction scheme for image com-\n\npression and retrieval. In ACM SIGKDD Conference Proceedings, pages 354\u2013363, 2004.\n\n\f", "award": [], "sourceid": 2547, "authors": [{"given_name": "Jieping", "family_name": "Ye", "institution": null}, {"given_name": "Ravi", "family_name": "Janardan", "institution": null}, {"given_name": "Qi", "family_name": "Li", "institution": null}]}