{"title": "Efficient and Robust Feature Extraction by Maximum Margin Criterion", "book": "Advances in Neural Information Processing Systems", "page_first": 97, "page_last": 104, "abstract": "", "full_text": "Ef\ufb01cient and Robust Feature Extraction by\n\nMaximum Margin Criterion\n\nHaifeng Li\n\nTao Jiang\n\nKeshu Zhang\n\nDepartment of Computer Science\n\nDepartment of Electrical Engineering\n\nUniversity of California\n\nRiverside, CA 92521\n\nfhli,jiangg@cs.ucr.edu\n\nUniversity of New Orleans\nNew Orleans, LA 70148\nkzhang1@uno.edu\n\nAbstract\n\nA new feature extraction criterion, maximum margin criterion (MMC),\nis proposed in this paper. This new criterion is general in the sense that,\nwhen combined with a suitable constraint, it can actually give rise to\nthe most popular feature extractor in the literature, linear discriminate\nanalysis (LDA). We derive a new feature extractor based on MMC using\na different constraint that does not depend on the nonsingularity of the\nwithin-class scatter matrix Sw. Such a dependence is a major drawback\nof LDA especially when the sample size is small. The kernelized (nonlin-\near) counterpart of this linear feature extractor is also established in this\npaper. Our preliminary experimental results on face images demonstrate\nthat the new feature extractors are ef\ufb01cient and stable.\n\n1\n\nIntroduction\n\nIn statistical pattern recognition, the high-dimensionality is a major cause of the practical\nlimitations of many pattern recognition technologies. In the past several decades, many di-\nmensionality reduction techniques have been proposed. Linear discriminant analysis (LDA,\nalso called Fisher\u2019s Linear Discriminant) [1] is one of the most popular linear dimension-\nality reduction method. In many applications, LDA has been proven to be very powerful.\nLDA is given by a linear transformation matrix W 2 RD(cid:2)d maximizing the so-called\nFisher criterion (a kind of Rayleigh coef\ufb01cient)\n\nJF (W) =\n\nWT SbW\nWT SwW\n\n(1)\n\nwhere Sb = Pc\nm =Pc\n\ni=1 pi(mi(cid:0)m)(mi(cid:0)m)T and Sw = Pc\n\ni=1 piSi are the between-\nclass scatter matrix and the within-class scatter matrix, respectively; c is the number of\nclasses; mi and pi are the mean vector and a priori probability of class i, respectively;\ni=1 pimi is the overall mean vector; Si is the within-class scatter matrix of class\ni; D and d are the dimensionalities of the data before and after the transformation, respec-\ntively. To maximize (1), the transformation matrix W must be constituted by the largest\neigenvectors of S(cid:0)1\nw Sb. The purpose of LDA is to maximize the between-class scatter\nwhile simultaneously minimizing the within-class scatter. The two-class LDA has a close\nconnection to optimal linear Bayes classi\ufb01ers. In the two-class case, the transformation\nmatrix W is just a vector, which is in the same direction as the discriminant in the corre-\nsponding optimal Bayes classi\ufb01er. However, it has been shown that LDA is suboptimal for\nmulti-class problems [2]. A major drawback of LDA is that it cannot be applied when Sw\nis singular due to the small sample size problem [3]. The small sample size problem arises\n\n\fwhenever the number of samples is smaller than the dimensionality of samples. For ex-\nample, a 64 (cid:2) 64 image in a face recognition system has 4096 dimensions, which requires\nmore than 4096 training data to ensure that Sw is nonsingular. So, LDA is not a stable\nmethod in practice when the training data are scarce.\n\nIn recent years, many researchers have noticed this problem and tried to overcome the com-\nputational dif\ufb01culty with LDA. Tian et al. [4] used the pseudo-inverse matrix S+\nw instead\nw . For the same purpose, Hong and Yang [5] tried to add a singular\nof the inverse matrix S(cid:0)1\nvalue perturbation to Sw to make it nonsingular. Neither of these methods are theoretically\nsound because Fisher\u2019s criterion is not valid when Sw is singular. When Sw is singular, any\npositive Sb makes Fisher\u2019s criterion in\ufb01nitely large. Thus, these naive attempts to calculate\nthe (pseudo or approximate) inverse of Sw may lead to arbitrary (meaningless) results. Be-\nsides, it is also known that an eigenvector could be very sensitive to small perturbation if\nits corresponding eigenvalue is close to another eigenvalue of the same matrix [6].\nIn 1992, Liu et al. [7] modi\ufb01ed Fisher\u2019s criterion by using the total scatter matrix St =\nSb + Sw as the denominator instead of Sw. It has been proven that the modi\ufb01ed criterion\nis exactly equivalent to Fisher\u2019s criterion. However, when Sw is singular, the modi\ufb01ed\ncriterion reaches the maximum value (i.e., 1) no matter what the transformation W is.\nSuch an arbitrary transformation cannot guarantee the maximum class separability unless\nWT SbW is maximized. Besides, this method need still calculate an inverse matrix, which\nis time consuming. In 2000, Chen et al. [8] proposed the LDA+PCA method. When Sw is\nof full rank, the LDA+PCA method just calculates the maximum eigenvectors of S(cid:0)1\nt Sb to\nform the transformation matrix. Otherwise, a two-stage procedure is employed. First, the\ndata are transformed into the null space V0 of Sw. Second, it tries to maximize the between-\nclass scatter in V0, which is accomplished by performing principal component analysis\n(PCA) on the between-class scatter matrix in V0. Although this method solves the small\nsample size problem, it is obviously suboptimal because it maximizes the between-class\nscatter in the null space of Sw instead of the original input space. Besides, the performance\nof the LDA+PCA method drops signi\ufb01cantly when n (cid:0) c is close to the dimensionality D,\nwhere n is the number of samples and c is the number of classes. The reason is that the\ndimensionality of the null space V0 is too small in this situation, and too much information\nis lost when we try to extract the discriminant vectors in V0. LDA+PCA also need calculate\nthe rank of Sw, which is an ill-de\ufb01ned operation due to \ufb02oating-point imprecisions. At last,\nthis method is complicated and slow because too much calculation is involved.\n\nKernel Fisher\u2019s Discriminant (KFD) [9] is a well-known nonlinear extension to LDA. The\ninstability problem is more severe for KFD because Sw in the (nonlinear) feature space F\nis always singular (the rank of Sw is n (cid:0) c). Similar to [5], KFD simply adds a perturbation\n(cid:22)I to Sw. Of course, it has the same stability problem as that in [5] because eigenvectors\nare sensitive to small perturbation. Although the authors also argued that this perturbation\nacts as some kind of regularization, i.e., a capacity control in F, the real in\ufb02uence in this\nsetting of regularization is not yet fully understood. Besides, it is hard to determine an\noptimal (cid:22) since there are no theoretical guidelines.\n\nIn this paper, a simpler, more ef\ufb01cient, and stable method is proposed to calculate the most\ndiscriminant vectors based on a new feature extraction criterion, the maximum margin cri-\nterion (MMC). Based on MMC, new linear and nonlinear feature extractors are established.\nIt can be shown that MMC represents class separability better than PCA. As a connection\nto Fisher\u2019s criterion, we may also derive LDA from MMC by incorporating some suitable\nconstraint. On the other hand, the new feature extractors derived above (based on MMC)\ndo not suffer from the small sample size problem, which is known to cause serious stabil-\nity problems for LDA (based on Fisher\u2019s criterion). Different from LDA+PCA, the new\nfeature extractors based on MMC maximize the between-class scatter in the input space\ninstead of the null space of Sw. Hence, it has a better overall performance than LDA+PCA,\nas con\ufb01rmed by our preliminary experimental results.\n\n\f2 Maximum Margin Criterion\n\nSuppose that we are given empirical data\n\n(x1; y1); : : : ; (xn; yn) 2 X (cid:2) fC1; : : : ; Ccg\n\nHere, the domain X 2 RD is some nonempty set that the patterns xi are taken from. The\nyi\u2019s are called labels or targets. By studying these samples, we want to predict the label\ny 2 fC1; : : : ; Ccg of some new pattern x 2 X . In other words, we choose y such that (x; y)\nis in some sense similar to the training examples. For this purpose, some measure need be\nemployed to assess similarity or dissimilarity. We want to keep such similarity/dissimilarity\ninformation as much as possible after the dimensionality reduction, i.e., transforming x\nfrom RD to Rd, where d (cid:28) D.\nIf some distance metric is used to measure the dissimilarity, we would hope that a pattern\nis close to those in the same class but far from those in different classes. So, a good\nfeature extractor should maximize the distances between classes after the transformation.\nTherefore, we may de\ufb01ne the feature extraction criterion as\n\nJ =\n\n1\n2\n\ncXi=1\n\ncXj=1\n\npipjd(Ci; Cj)\n\n(2)\n\nWe call (2) the maximum margin criterion (MMC). It is actually the summation of 1\n2 c(c(cid:0)1)\ninterclass margins. Like the weighted pairwise Fisher\u2019s criteria in [2], one may also de\ufb01ne\na weighted maximum margin criterion. Due to the page limit, we omit the discussion in\nthis paper.\n\nOne may use the distance between mean vectors as the distance between classes, i.e.\n\nd(Ci; Cj) = d(mi; mj)\n\n(3)\nwhere mi and mj are the mean vectors of the class Ci and the class Cj, respectively. How-\never, (3) is not suitable since it neglects the scatter of classes. Even if the distance between\nthe mean vectors is large, it is not easy to separate two classes that have the large spread\nand overlap with each other. By considering the scatter of classes, we de\ufb01ne the interclass\ndistance (or margin) as\n\nd(Ci; Cj) = d(mi; mj) (cid:0) s(Ci) (cid:0) s(Cj)\n\n(4)\nwhere s(Ci) is some measure of the scatter of the class Ci. In statistics, we usually use the\ngeneralized variance jSij or overall variance tr(Si) to measure the scatter of data. In this\npaper, we use the overall variance tr(Si) because it is easy to analyze. The weakness of the\noverall variance is that it ignores covariance structure altogether. Note that, by employing\nthe overall/generalized variance, the expression (4) measures the \u201caverage margin\u201d between\ntwo classes while the minimum margin is used in support vector machines (SVMs) [10].\nWith (4) and s(Ci) being tr(Si), we may decompose (2) into two parts\n\nJ =\n\n=\n\n1\n2\n\n1\n2\n\ncXi=1\ncXi=1\n\ncXj=1\ncXj=1\n\npipj(d(mi; mj) (cid:0) tr(Si) (cid:0) tr(Sj))\n\npipjd(mi; mj) (cid:0)\n\npipj(tr(Si) + tr(Sj))\n\nThe second part is easily simpli\ufb01ed to tr(Sw)\n\n1\n2\n\ncXi=1\n\ncXj=1\n\npipj(tr(Si) + tr(Sj)) =\n\ncXi=1\n\n1\n2\n\ncXi=1\ncXj=1\npitr(Si) = tr cXi=1\n\npiSi! = tr(Sw)\n\n(5)\n\n\fBy employing the Euclidean distance, we may also simplify the \ufb01rst part to tr(Sb) as\nfollows\n\npipj(mi(cid:0)mj)T (mi(cid:0)mj)\n\npipj(mi(cid:0)m + m (cid:0) mj)T (mi(cid:0)m + m (cid:0) mj)\n\n1\n2\n\n1\n2\n\n=\n\n1\n2\n\ncXi=1\n\ncXj=1\n\npipjd(mi; mj) =\n\ncXi=1\ncXj=1\ncXj=1\ncXi=1\nAfter expanding it, we can simplify the above equation toPc\nusing the factPc\ncXi=1\ncXj=1\n\npipjd(mi; mj) = tr cXi=1\n\nj=1 pj(m (cid:0) mj) = 0. So\n\n1\n2\n\nNow we obtain\n\ni=1 pi(mi(cid:0)m)T (mi(cid:0)m) by\n\npi(mi(cid:0)m)(mi(cid:0)m)T! = tr(Sb)\n\n(6)\n\nJ = tr(Sb (cid:0) Sw)\n\n(7)\nSince tr(Sb) measures the overall variance of the class mean vectors, a large tr(Sb) implies\nthat the class mean vectors scatter in a large space. On the other hand, a small tr(Sw)\nimplies that every class has a small spread. Thus, a large J indicates that patterns are close\nto each other if they are from the same class but are far from each other if they are from\ndifferent classes. Thus, this criterion may represent class separability better than PCA.\nRecall that PCA tries to maximize the total scatter after a linear transformation. But the\ndata set with a large within-class scatter can also have a large total scatter even when it has\na small between-class scatter because St = Sb + Sw. Obviously, such data are not easy\nto classify. Compared with LDA+PCA, we maximize the between-class scatter in input\nspace rather than the null space of Sw when Sw is singular. So, our method can keep more\ndiscriminative information than LDA+PCA does.\n\n3 Linear Feature Extraction\n\nWhen performing dimensionality reduction, we want to \ufb01nd a (linear or nonlinear) mapping\nfrom the measurement space M to some feature space F such that J is maximized after the\ntransformation. In this section, we discuss how to \ufb01nd an optimal linear feature extractor.\nIn the next section, we will generalize it to the nonlinear case.\n\nConsider a linear mapping W 2 RD(cid:2)d . We would like to maximize\n\nJ(W) = tr(SW\n\nb (cid:0)SW\nw )\n\nb and SW\n\nwhere SW\nthe feature space F. Since W is a linear mapping, it is easy to show SW\nw = WT SwW. So, we have\nSW\n\nw are the between-class scatter matrix and within-class scatter matrix in\nb = WT SbW and\n\n(8)\nIn this formulation, we have the freedom to multiply W with some nonzero con-\nstant. Thus, we additionally require that W is constituted by the unit vectors, i.e.\nk wk = 1. This means that we need solve the following\nW = [w1 w2\nconstrained optimization\n\nJ(W) = tr(cid:0)WT (Sb(cid:0)Sw)W(cid:1)\n\n: : : wd] and wT\n\nmax\n\nwT\n\nk (Sb(cid:0)Sw)wk\n\ndXk=1\n\nsubject to wT\n\nk wk (cid:0) 1 = 0\n\nk = 1; : : : ; d\n\n\ftr(cid:0)WT SwW(cid:1) = 1 and then maximize tr(cid:0)WT SbW(cid:1).\n\nNote that, we may also use other constraints in the above. For example, we may require\nIt is easy to show that maxi-\nmizing MMC with such a constraint in fact results in LDA. The only difference is that it\ninvolves a constrained optimization whereas the traditional LDA solves an unconstrained\nk wk = 1 is that it allows us to\noptimization. The motivation for using the constraint wT\navoid calculating the inverse of Sw and thus the potential small sample size problem.\nTo solve the above optimization problem, we may introduce a Lagrangian\n\ndXk=1\n\nL(wk; (cid:21)k) =\n\nwT\n\nk (Sb(cid:0)Sw)wk (cid:0) (cid:21)k(wT\n\nk wk (cid:0) 1)\n\n(9)\n\nwith multipliers (cid:21)k. The Lagrangian L has to be maximized with respect to (cid:21)k and wk.\nThe condition that at the stationary point, the derivatives of L with respect to wk must\nvanish\n\n= ((Sb(cid:0)Sw) (cid:0) (cid:21)kI)wk = 0\n\nk = 1; : : : ; d\n\n(10)\n\n@L(wk; (cid:21)k)\n\n@wk\n\nleads to\n\n(11)\nwhich means that the (cid:21)k\u2019s are the eigenvalues of Sb(cid:0)Sw and the wk\u2019s are the correspond-\ning eigenvectors. Thus\n\n(Sb(cid:0)Sw)wk =(cid:21)kwk\n\nk = 1; : : : ; d\n\nJ(W) =\n\nwT\n\nk (Sb(cid:0)Sw)wk =\n\n(cid:21)kwT\n\nk wk =\n\n(cid:21)k\n\n(12)\n\ndXk=1\n\ndXk=1\n\ndXk=1\n\nTherefore, J(W) is maximized when W is composed of the \ufb01rst d largest eigenvectors of\nSb (cid:0) Sw. Here, we need not calculate the inverse of Sw, which allows us to avoid the small\nsample size problem easily. We may also require W to be orthonormal, which may help\npreserve the shape of the distribution.\n\n4 Nonlinear Feature Extraction with Kernel\n\nIn this section, we follow the approach of nonlinear SVMs [10] to kernelize the above linear\nfeature extractor. More precisely, we \ufb01rst reformulate the maximum margin criterion in\nterms of only dot-product h(cid:8)(x); (cid:8)(y)i of input patterns. Then we replace the dot-product\nby some positive de\ufb01nite kernel k(x; y), e.g. Gaussian kernel e(cid:0)(cid:13)kx(cid:0)yk2.\nConsider the maximum margin criterion in the feature space F\n\nJ (cid:8)(W) =\n\nwT\n\nk (S(cid:8)\n\nb (cid:0)S(cid:8)\n\nw)wk\n\ndXk=1\n\nj\n\nb and S(cid:8)\nwhere S(cid:8)\ntrix in F, i.e., S(cid:8)\ni = 1\nS(cid:8)\ni=1 pim(cid:8)\n\nniPni\n\nj=1((cid:8)(x(i)\ni , and x(i)\n\nb = Pc\n\nw are the between-class scatter matrix and within-class scatter ma-\ni and\nj ), m(cid:8) =\n\ni (cid:0) m(cid:8))T , S(cid:8)\ni = 1\nj ) (cid:0) m(cid:8)\nis the pattern of class Ci that has ni samples.\n\ni (cid:0) m(cid:8))(m(cid:8)\nj ) (cid:0) m(cid:8)\n\nw = Pc\nniPni\n\ni )T with m(cid:8)\n\nj=1 (cid:8)(x(i)\n\ni )((cid:8)(x(i)\n\ni=1 pi(m(cid:8)\n\ni=1 piS(cid:8)\n\nFor us, an important fact is that each wk lies in the span of (cid:8)(x1); (cid:8)(x2); : : : ; (cid:8)(xn).\nl (cid:8)(xl). Using this\n\nl=1 (cid:11)(k)\n\nexpansion and the de\ufb01nition of m(cid:8)\n\nPc\nTherefore, we can \ufb01nd an expansion for wk in the form wk =Pn\nj )i1A\n\nl 0@ 1\n\nh(cid:8)(xl); (cid:8)(x(i)\n\ni , we have\n\nniXj=1\n\nnXl=1\n\nk m(cid:8)\n\n(cid:11)(k)\n\nni\n\nwT\n\ni =\n\n\f1\n\nReplacing the dot-product by some kernel function k(x; y) and de\ufb01ning (emi)l =\nniPni\n\n. Similarly, we have\n\nj=1 k(xl; x(i)\n\nj ), we get wT\n\ni = (cid:11)T\n\nk m(cid:8)\n\nl\n\ni = (cid:11)\n\ncXi=1\n\nk emi with ((cid:11)k)l = (cid:11)(k)\npiemi = (cid:11)\nk em\nk (emi (cid:0) em). and\n\ni (cid:0) m(cid:8)) = (cid:11)T\n\nT\n\ni (cid:0) m(cid:8)))(wT\n\nk (m(cid:8)\n\ni (cid:0) m(cid:8)))T\n\n(cid:11)k =\n\ndXk=1\n\n(cid:11)\n\nT\n\nkeSb(cid:11)k\n\nk\n\nT\nk\n\nwT\n\nwT\n\nk S(cid:8)\n\npim(cid:8)\n\npi(wT\n\nk (m(cid:8)\n\nk (m(cid:8)\n\nb wk =\n\nk m(cid:8) = wT\n\ncXi=1\nwith em =Pc\ni=1 piemi. This means wT\ncXi=1\ndXk=1\ndXk=1\ncXi=1\nwhereeSb =Pc\ni=1 pi(emi (cid:0) em)(emi (cid:0) em)T .\nemi) with (k(i)\n\nSimilarly, one can simplify WT S(cid:8)\n\ndXk=1\nk (emi (cid:0) em)(emi (cid:0) em)T\n\nj )l = k(xl; x(i)\n\ni ))T , we have\n\nk ((cid:8)(x(i)\n\nj ) (cid:0) m(cid:8)\n\nj ). Considering wT\n\ni ))(wT\n\npT\ni (cid:11)\n\nm(cid:8)\n\n=\n\nT\n\nwW. First, we have wT\n\nj ) (cid:0) m(cid:8)\n\ni ) = (cid:11)T\n\nk ((cid:8)(x(i)\ni wk = 1\n\nk S(cid:8)\n\nk (k(i)\nk ((cid:8)(x(i)\n\nj (cid:0)\nj ) (cid:0)\n\nj=1(wT\n\nniPni\n\n(cid:11)k\n\nj (cid:0) emi)T\n\n1ni )(ej (cid:0)\n\nwT\n\nk S(cid:8)\n\ni wk =\n\n=\n\n=\n\n=\n\n1\nni\n\n1\nni\n\n1\nni\n\n1\nni\n\nT\n\nT\n\n(cid:11)\n\n(cid:11)\n\n1\nni\n\nk (k(i)\n\nniXj=1\nj (cid:0) emi)(k(i)\nniXj=1\nkeSi(ej (cid:0)\nniXj=1\nkeSi(ej eT\nkeSi(Ini(cid:2)ni (cid:0)\n\n1ni 1T\n\n1\nni\n\n1\nni\n\nj (cid:0)\n\n(cid:11)\n\nT\n\nT\n\n(cid:11)\n\nej 1T\n\nni (cid:0)\n\nT\n\ni\n\n(cid:11)k\n\nni )eS\n\ni (cid:11)k\n\n1\nni\n\n1\nni\n\n1ni )TeST\n\n1ni eT\n\nj +\n\n1\nn2\ni\n\nT\n\ni\n\n(cid:11)k\n\n1ni 1T\n\nni )eS\n\nvector of 1\u2019s, and ej is the canonical basis vector of ni dimensions. Thus, we obtain\n\nj ), Ini(cid:2)ni is the ni (cid:2) ni identity matrix, 1ni is the ni-dimensional\n\nwT\n\nk S(cid:8)\n\nwwk =\n\nwhere (eSi)lj = k(xl; x(i)\n\ndXk=1\ndXk=1\nwhereeSw = Pc\n\nspace F is\n\n=\n\n(cid:11)\n\ni=1 pi\n\nT\n\n1\n\npi\n\nk cXi=1\nnieSi(Ini (cid:0) 1\n\n1\n\nT\n\nT\n\n(cid:11)\n\npi\n\n1\nni\n\n1\nni\n\n1\nni\n\n1ni 1T\n\nkeSi(Ini (cid:0)\nni )eS\n\ndXk=1\ncXi=1\nnieSi(Ini (cid:0)\nni )eS\ndXk=1\nk (eSb (cid:0)eSw)(cid:11)k\n\nJ(W) =\n\n1ni 1T\n\nni\n\n(cid:11)\n\nT\n\nT\n\n(cid:11)k\n\nT\n\ni\n\n1ni 1T\n\nni )eS\ni! (cid:11)k =\ndXk=1\n\n(cid:11)\n\nT\n\nkeSw(cid:11)k\n\ni . So the maximum criterion in the feature\n\n(13)\n\nSimilar to the observations in Section 3, the above criterion is maximized by the largest\n\neigenvectors ofeSb (cid:0)eSw.\n\n\f 0.25\n\n 0.2\n\nRAW\nLDA+PCA\nMMC\nKMMC\n\ne\n\nt\n\na\nr\n \nr\no\nr\nr\ne\n\n 0.15\n\n 0.1\n\n 0.05\n\n 0\n\n 20\n\n 25\n\n 30\n\n 35\n\n 40\n\nclass no.\n\n)\nd\nn\no\nc\ne\ns\n(\n \ne\nm\n\ni\nt\n \n\ni\n\ng\nn\nn\na\nr\nt\n\ni\n\n 24\n 22\n 20\n 18\n 16\n 14\n 12\n 10\n 8\n 6\n 4\n 2\n\nLDA+PCA\nMMC\nKMMC\n\n 25\n\n 30\n\nclass no.\n\n 35\n\n 40\n\n 20\n\n(a) The comparison in term of error rate.\n\n(b) The comparison in term of training time.\n\nFigure 1: Experimental results obtained using a linear SVM on the original data (RAW), and the\ndata extracted by LDA+PCA, the linear feature extractor based on MMC (MMC) and the nonlinear\nfeature extractor based on MMC (KMMC), which employs the Gaussian kernel with (cid:13) = 0:03125.\n\n5 Experiments\nTo evaluate the performance of our new methods (both linear and nonlinear feature extrac-\ntors), we ran both LDA+PCA and our methods on the ORL face dataset [11]. The ORL\ndataset consists of 10 face images from 40 subjects for a total of 400 images, with some\nvariation in pose, facial expression and details. The resolution of the images is 112 (cid:2) 92,\nwith 256 gray-levels. First, we resized the images to 28 (cid:2) 23 to save the experimental\ntime. Then, we reduced the dimensionality of each image set to c (cid:0) 1, where c is the num-\nber of classes. At last we trained and tested a linear SVM on the dimensionality-reduced\ndata. As a control, we also trained and tested a linear SVM on the original data before its\ndimensionality was reduced.\n\nIn order to demonstrate the effectiveness and the ef\ufb01ciency of our methods, we conducted\na series of experiments and compared our results with those obtained using LDA+PCA.\nThe error rates are shown in Fig.1(a). When trained with 3 samples and tested with 7 other\nsamples for each class, our method is generally better than LDA+PCA. In fact, our method\nis usually better than LDA+PCA on other numbers of training samples. To save space,\nwe do not show all the results here. Note that our methods can even achieve lower error\nrates than a linear SVM on the original data (without dimensionality reduction). However,\nLDA+PCA does not demonstrate such a clear superiority over RAW. Fig. 1(a) also shows\nthat the kernelized (nonlinear) feature extractor based on MMC is signi\ufb01cantly better than\nthe linear one, in particular when the number of classes c is large.\n\nBesides accuracy, our methods are also much more ef\ufb01cient than LDA+PCA in the sense\nof the training time required. Fig. 1(b) shows that our linear feature extractor is about 4\ntimes faster than LDA+PCA. The same speedup was observed on other numbers of training\nsamples. Note that our nonlinear feature extractor is also faster than LDA+PCA in this case\nalthough it is very time-consuming to calculate the kernel matrix in general. An explanation\nof the speedup is that the kernel matrix size equals the number of samples, which is pretty\nsmall in this case.\n\nFurthermore, our method performs much better than LDA+PCA when n (cid:0) c is close to the\ndimensionality D. Because the amount of training data was limited, we resized the images\nto 168 dimensions to create such a situation. The experimental results are shown in Fig. 2.\nIn this situation, the performance of LDA+PCA drops signi\ufb01cantly because the null space\nof Sw has a small dimensionality. When LDA+PCA tries to maximize the between-class\nscatter in this small null space, it loses a lot of information. On the other hand, our method\ntries to maximize the between-class scatter in the original input space. From Fig. 2, we can\n\n\fLDA+PCA\nMMC\nKMMC\n\n 0.7\n\n 0.6\n\n 0.5\n\nLDA+PCA\nMMC\nKMMC\n\ne\n\nt\n\na\nr\n \nr\no\nr\nr\ne\n\n 0.4\n\n 0.3\n\ne\n\nt\n\na\nr\n \nr\no\nr\nr\ne\n\n 0.45\n 0.4\n 0.35\n 0.3\n 0.25\n 0.2\n 0.15\n 0.1\n 0.05\n\n 25\n\n 30\n\n 35\n\n 40\n\nclass no.\n\n 0.2\n\n 0.1\n\n 0\n\n 20\n\n 25\n\n 30\n\n 35\n\n 40\n\nclass no.\n\n 20\n\n(a) Each class contains three training samples.\n\n(b) Each class contains four training samples.\n\nFigure 2: Comparison between our new methods and LDA+PCA when n (cid:0) c is close to D.\n\nsee that LDA+PCA is ineffective in this situation because it is even worse than a random\nguess. But our method still produced acceptable results. Thus, the experimental results\nshow that our method is better than LDA+PCA in terms of both accuracy and ef\ufb01ciency.\n\n6 Conclusion\nIn this paper, we proposed both linear and nonlinear feature extractors based on the maxi-\nmum margin criterion. The new methods do not suffer from the small sample size problem.\nThe experimental results show that it is very ef\ufb01cient, accurate, and robust.\nAcknowledgments\nWe thank D. Gunopulos, C. Domeniconi, and J. Peng for valuable discussions and comments. This\nwork was partially supported by NSF grants CCR-9988353 and ACI-0085910.\n\nReferences\n\n[1] R. A. Fisher. The use of multiple measurements in taxonomic problems. Annual of Eugenics,\n\n7:179\u2013188, 1936.\n\n[2] M. Loog, R. P. W. Duin, and R. Haeb-Umbach. Multiclass linear dimension reduction by\nweighted pairwise \ufb01sher criteria. IEEE Transactions on Pattern Analysis and Machine Intelli-\ngence, 23(7):762\u2013766, 2001.\n\n[3] K. Fukunaga. Introduction to Statistical Pattern Recognition. Academic Press, New York, 2nd\n\nedition, 1990.\n\n[4] Q. Tian, M. Barbero, Z. Gu, and S. Lee. Image classi\ufb01cation by the foley-sammon transform.\n\nOptical Engineering, 25(7):834\u2013840, 1986.\n\n[5] Z. Hong and J. Yang. Optimal discriminant plane for a small number of samples and design\n\nmethod of classi\ufb01er on the plane. Pattern Recognition, 24(4):317\u2013324, 1991.\n\n[6] G. W. Stewart. Introduction to Matrix Computations. Academic Press, New York, 1973.\n[7] K. Liu, Y. Cheng, and J. Yang. A generalized optimal set of discriminant vectors. Pattern\n\nRecognition, 25(7):731\u2013739, 1992.\n\n[8] L. Chen, H. Liao, M .Ko, J. Lin, and G. Yu. A new LDA-based face recognition system which\n\ncan solve the small sample size problem. Pattern Recognition, 33(10):1713\u20131726, 2000.\n\n[9] S. Mika, G. R\u00a8atsch, J. Weston, B. Sch\u00a8olkopf, and K.-R. M\u00a8uller. Fisher discriminant analysis\nwith kernels. In Y.-H. Hu, J. Larsen, E. Wilson, and S. Douglas, editors, Neural Networks for\nSignal Processing IX, pages 41\u201348. IEEE, 1999.\n\n[10] V. N. Vapnik. Statistical Learning Theory. John Wiley & Sons, New York, 1998.\n[11] F. Samaria and A. Harter. Parameterisation of a stochastic model for human face identi\ufb01cation.\nIn Proceedings of 2nd IEEE Workshop on Applications of Computer Vision, Sarasota, FL, 1994.\n\n\f", "award": [], "sourceid": 2442, "authors": [{"given_name": "Haifeng", "family_name": "Li", "institution": null}, {"given_name": "Tao", "family_name": "Jiang", "institution": null}, {"given_name": "Keshu", "family_name": "Zhang", "institution": null}]}