{"title": "Extended Grassmann Kernels for Subspace-Based Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 601, "page_last": 608, "abstract": "Subspace-based learning problems involve data whose elements are linear subspaces of a vector space. To handle such data structures, Grassmann kernels have been proposed and used previously. In this paper, we analyze the relationship between Grassmann kernels and probabilistic similarity measures. Firstly, we show that the KL distance in the limit yields the Projection kernel on the Grassmann manifold, whereas the Bhattacharyya kernel becomes trivial in the limit and is suboptimal for subspace-based problems. Secondly, based on our analysis of the KL distance, we propose extensions of the Projection kernel which can be extended to the set of affine as well as scaled subspaces. We demonstrate the advantages of these extended kernels for classification and recognition tasks with Support Vector Machines and Kernel Discriminant Analysis using synthetic and real image databases.", "full_text": "Extended Grassmann Kernels for\n\nSubspace-Based Learning\n\nJihun Hamm\n\nGRASP Laboratory\n\nUniversity of Pennsylvania\n\nPhiladelphia, PA 19104\n\njhham@seas.upenn.edu\n\nDaniel D. Lee\n\nGRASP Laboratory\n\nUniversity of Pennsylvania\n\nPhiladelphia, PA 19104\n\nddlee@seas.upenn.edu\n\nAbstract\n\nSubspace-based learning problems involve data whose elements are linear sub-\nspaces of a vector space. To handle such data structures, Grassmann kernels have\nbeen proposed and used previously. In this paper, we analyze the relationship be-\ntween Grassmann kernels and probabilistic similarity measures. Firstly, we show\nthat the KL distance in the limit yields the Projection kernel on the Grassmann\nmanifold, whereas the Bhattacharyya kernel becomes trivial in the limit and is\nsuboptimal for subspace-based problems. Secondly, based on our analysis of the\nKL distance, we propose extensions of the Projection kernel which can be ex-\ntended to the set of af\ufb01ne as well as scaled subspaces. We demonstrate the ad-\nvantages of these extended kernels for classi\ufb01cation and recognition tasks with\nSupport Vector Machines and Kernel Discriminant Analysis using synthetic and\nreal image databases.\n\n1 Introduction\n\nIn machine learning problems the data often live in a vector space, typically a Euclidean space.\nHowever, there are many other kinds of non-Euclidean spaces suitable for data outside this conven-\ntional context. In this paper we focus on the domain where each data sample is a linear subspace\nof vectors, rather than a single vector, of a Euclidean space. Low-dimensional subspace structures\nare commonly encountered in computer vision problems. For example, the variation of images due\nto the change of pose, illumination, etc, is well-aproximated by the subspace spanned by a few\n\u201ceigenfaces\u201d. More recent examples include the dynamical system models of video sequences from\nhuman actions or time-varying textures, represented by the linear span of the observability matrices\n[1, 14, 13].\nSubspace-based learning is an approach to handle the data as a collection of subspaces instead of the\nusual vectors. The appropriate data space for the subspace-based learning is the Grassmann manifold\nG(m, D), which is de\ufb01ned as the set of m-dimensional linear subspaces in RD. In particular, we\ncan de\ufb01ne positive de\ufb01nite kernels on the Grassmann manifold, which allows us to treat the space as\nif it were a Euclidean space. Previously, the Binet-Cauchy kernel [17, 15] and the Projection kernel\n[16, 6] have been proposed and demonstrated the potential for subspace-based learning problems.\nOn the other hand, the subspace-based learning problem can be approached purely probabilistically.\nSuppose the set of vectors are i.i.d samples from an arbitrary probability distribution. Then it is\npossible to compare two such distributions of vectors with probabilistic similarity measures, such\nas the KL distance1, the Chernoff distance, or the Bhattacharyya/Hellinger distance, to name a few\n[11, 7, 8, 18]. Furthermore, the Bhattacharyya af\ufb01nity is indeed a positive de\ufb01nite kernel function\non the space of distributions and have nice closed-form expressions for the exponential family [7].\n\n1by distance we mean any nonnegative measure of similarity and not necessarily a metric.\n\n1\n\n\fIn this paper, we investigate the relationship between the Grassmann kernels and the probabilis-\ntic distances. The link is provided by the probabilistic generalization of subspaces with a Factor\nAnalyzer which is a Gaussian \u2018blob\u2019 that has nonzero volume along all dimensions.\nFirstly, we show that the KL distance yields the Projection kernel on the Grassmann manifold in the\nlimit of zero noise, whereas the Bhattacharyya kernel becomes trivial in the limit and is suboptimal\nfor subspace-based problems. Secondly, based on our analysis of the KL distance, we propose an\nextension of the Projection kernel which is originally con\ufb01ned to the set of linear subspaces, to the\nset of af\ufb01ne as well as scaled subspaces.\nWe will demonstrate the extended kernels with the Support Vector Machines and the Kernel Dis-\ncriminant Analysis using synthetic and real image databases. The proposed kernels show the better\nperformance compared to the previously used kernels such as Binet-Cauchy and the Bhattacharyya\nkernels.\n\n2 Probabilistic subspace distances and kernels\n\nx \u223c pi(x) = N (ui, Ci), Ci = YiY 0\n\nIn this section we will consider the two well-known probabilistic distances, the KL distance and the\nBhattacharyya distance, and establish their relationships to the Grassmann kernels. Although these\nprobabilistic distances are not restricted to speci\ufb01c distributions, we will model the data distribution\nas the Mixture of Factor Analyzers (MFA) [4]. If we have i = 1, ..., N sets in the data, then each set\nis considered as i.i.d. samples from the i-th Factor Analyzer\n(1)\nwhere ui \u2208 RD is the mean, Yi is a full-rank D \u00d7 m matrix (D > m), and \u03c3 is the ambient noise\nlevel. The FA model is a practical substitute for a Gaussian distribution in case the dimensionality D\nof the images is greater than the number of samples n in a set. Otherwise it is impossible to estimate\nthe full covariance C nor invert it.\nMore importantly, we use the FA distribution to provide the link between the Grassmann manifold\nand the space of probabilistic distributions. In fact a linear subspace can be considered as the \u2018\ufb02at-\ntened\u2019 (\u03c3 \u2192 0) limit of a zero-mean (ui = 0), homogeneous (Y 0\ni Yi = Im) FA distribution. We will\nlook at the limits of the KL distance and the Bhattacharyya kernel under this condition.\n\ni + \u03c32ID,\n\n2.1 KL distance in the limit\n\nZ\n\nThe (symmetrized) KL distance is de\ufb01ned as\n\n[p1(x) \u2212 p2(x)] log p1(x)\n\nJKL(p1, p2) =\n\np2(x) dx.\n\nLet Ci = \u03c32I + YiY 0\n\neZ = 2\u22121/2[eY1 eY2]. In this case the KL distance is\n2eY2) + \u03c3\u22122\n2ID \u2212eY1eY 0\n\ni be the covariance matrix, and de\ufb01ne eYi = Yi(\u03c32I + Y 0\ntr(\u2212eY 0\n2eY1 \u2212eY 0\n1eY1 \u2212eY 0\n(u1 \u2212 u2)0(cid:16)\n\n2 Y2 \u2212eY 0\n\n(cid:17)\n1 Y1 + Y 0\n\nJKL(p1, p2) =\n\n(u1 \u2212 u2).\n\n1 Y2Y 0\n\n2\n\n1\n2\n+ \u03c3\u22122\n2\n\ni Yi)\u22121/2, and\n\n1eY2)\n\n2 Y1Y 0\n\nUnder the subspace condition (\u03c3 \u2192 0, ui = 0, Y 0\nto\n\ni Yi = Im, i = 1, ..., N), the KL distance simpli\ufb01es\n\n(cid:19)\n\ntr(Y 0\n\n1 \u2212eY2eY 0\n(cid:18)\n\n2\n\nJKL(p1, p2) =\n\n=\n\n2m \u2212 2\n\n1\n\n\u03c32 + 1\n\ntr(Y 0\n\n1 Y2Y 0\n\n2 Y1)\n\n1\n2\n\n(\u22122 m\n\n\u03c32 + 1\n1\n\n) + \u03c3\u22122\n2\n(2m \u2212 2tr(Y 0\n\n2\u03c32(\u03c32 + 1)\nJKL(p1, p2) = 2m \u2212 2tr(Y 0\n\n1 Y2Y 0\n\n2 Y1))\n\n1 Y2Y 0\n\n2 Y1).\n\nWe can ignore the multiplying factors which do not depend on Y1 or Y2, and rewrite the distance as\n\n(2)\n\n(3)\n\n(4)\n\nWe immediately realize that the distance JKL(p1, p2) coincides with the de\ufb01nition of the squared\nProjection distance [2, 16, 6], with the corresponding Projection kernel\n\nkProj(Y1, Y2) = tr(Y 0\n\n1 Y2Y 0\n\n2 Y1).\n\n2\n\n\f2.2 Bhattacharyya kernel in the limit\n\nJebara and Kondor [7, 8] proposed the Probability Product kernel\n\nZ\n\nkProb(p1, p2) =\n\n[p1(x) p2(x)]\u03b1 dx (\u03b1 > 0)\n\n(5)\n\nwhich includes the Bhattacharyya kernel as a special case.\nUnder the subspace condition (\u03c3 \u2192 0, ui = 0, Y 0\n\nkProb(p1, p2) = \u03c0(1\u22122\u03b1)D2\u2212\u03b1D\u03b1\u2212D/2 \u03c32\u03b1(m\u2212D)+D\n(cid:19)\u22121/2\n\n(cid:18)\n\n\u221d det\n\nIm \u2212\n\n1\n\n(2\u03c32 + 1)2 Y 0\n\n1 Y2Y 0\n\n2 Y1\n\n.\n\ni Yi = Im, i = 1, ..., N) the kernel kProb becomes\n\n(\u03c32 + 1)\u03b1m det(I2m \u2212eY 0eY )\u22121/2\n\n(6)\n\n(7)\n\nSuppose the two subspaces span(Y1) and span(Y2) intersect only at the origin, that is, the singular\n1 Y2 are strictly less than 1. In this case kProb has a \ufb01nite value as \u03c3 \u2192 0 and the inversion\nvalues of Y 0\nof (7) is well-de\ufb01ned. In contrast, the diagonal terms of kProb become\n\n(cid:19)\u22121/2\n\n(cid:18) (2\u03c32 + 1)2\n\n(cid:19)m/2\n\nkProb(Y1, Y1) = det\n\n(8)\nwhich diverges to in\ufb01nity as \u03c3 \u2192 0. This implies that after normalizing the kernel by the diagonal\nterms, the resulting kernel becomes a trivial kernel\n\n(2\u03c32 + 1)2 )Im\n\n4\u03c32(\u03c32 + 1)\n\n=\n\n,\n\n\u02dckProb(Yi, Yj) =\n\nspan(Yi) = span(Yj)\n\notherwise\n\n, as \u03c3 \u2192 0.\n\n(9)\n\n(cid:18)\n\n(1 \u2212\n\n1\n\n(cid:26) 1,\n\n0,\n\nThe derivations are detailed in the thesis [5]. As we claimed earlier, the Probability Product kernel,\nincluding the Bhattacharyya kernel, loses its discriminating power as the distributions become close\nto subspaces.\n\n3 Extended Projection Kernel\n\nBased on the analysis of the previous section, we will extend the Projection kernel (4) to more\ngeneral spaces than the Grassmann manifold in this section. We will examine the two directions of\nextension: from linear to af\ufb01ne, and from homogeneous to scaled subspaces.\n\n3.1 Extension to af\ufb01ne subspaces\nAn af\ufb01ne subspace in RD is a linear subspace with an \u2018offset\u2019 . In that sense a linear subspace is\nsimply an af\ufb01ne subspace with a zero offset. Analogously to the (linear) Grassmann manifold, we\ncan de\ufb01ne an af\ufb01ne Grassmann manifold as the set of all m-dim af\ufb01ne subspaces in RD space 2.\nThe af\ufb01ne span is de\ufb01ned from the orthonormal basis Y \u2208 RD\u00d7m and an offset u \u2208 RD by\n\naspan(Y, u) , {x | x = Y v + u, \u2200v \u2208 Rm}.\n\n(10)\nBy de\ufb01nition, the representation of an af\ufb01ne space by (Y, u) is not unique and there is an invariant\ncondition for the equivalent of representations:\nDe\ufb01nition 1 (invariance to representations).\naspan(Y1, u1) = aspan(Y2, u2) if and only if span(Y1) = span(Y2) and Y \u22a5\n1 (Y \u22a5\nwhere Y \u22a5 is any orthonormal basis for the orthogonal complement of span(Y ).\n\n1 )0u1 = Y \u22a5\n\n2 (Y \u22a5\n\n2 )0u2,\n\nSimilarly to the de\ufb01nition of Grassmann kernels [6], we can now formally de\ufb01ne the af\ufb01ne Grass-\nmann kernel as follows. Let k : (Rm\u00d7D \u00d7 RD) \u00d7 (Rm\u00d7D \u00d7 RD) \u2192 R be a real valued symmetric\nfunction k(Y1, u1, Y2, u2) = k(Y2, u2, Y1, u1).\n\n2The Grassmann manifold is de\ufb01ned as a quotient space O(D)/O(m)\u00d7 O(D \u2212 m) where O is the orthog-\nonal group. The af\ufb01ne Grassmann manifold is similarly de\ufb01ned as E(D)/E(m) \u00d7 O(D \u2212 m), where E is the\nEuclidean group. Fore more explanations, please refer to [5].\n\n3\n\n\fDe\ufb01nition 2. A real valued symmetric function k is an af\ufb01ne Grassmann kernel if it is positive\nde\ufb01nite and invariant to different representations:\n\nk(Y1, u1, Y2, u2) = k(Y3, u3, Y4, u4)\naspan(Y1, u1) = aspan(Y3, u3) and aspan(Y2, u2) = aspan(Y4, u4).\n\nfor any Y1, Y2, Y3, Y4, and u1, u2, u3, u4\n\nsuch that\n\nWith this de\ufb01nition we check if the KL distance in the limit suggests an af\ufb01ne Grassmann kernel.\nThe KL distance with the homogeneity condition only Y 0\nJKL(p1, p2) \u2192 1\n2) (u1 \u2212 u2)] .\nIgnoring the multiplicative factor, the \ufb01rst term is the same is the original Projection kernel, which\nwe will denote as the \u2018linear\u2019 kernel to emphasize the underlying assumption:\n\n2 Y2 = Im becomes,\n1 \u2212 Y2Y 0\n\n2 Y1) + (u1 \u2212 u2)0 (2ID \u2212 Y1Y 0\n\n2\u03c32 [2m \u2212 2tr(Y 0\n\n1 Y1 = Y 0\n\n1 Y2Y 0\n\nkLin(Y1, Y2) = tr(Y1Y 0\n\n1 Y2Y 0\n2),\n\n(11)\n\nThe second term give rise to a new \u2018kernel\u2019\n\nku(Y1, u1, Y2, u2) = u0\n\n1(2ID \u2212 Y1Y 0\n\n1 \u2212 Y2Y 0\n\n2)u2,\n(12)\nwhich measures the similarity of the offsets u1 and u2 scaled by 2I \u2212 Y1Y 0\n2. However, this\nterm is not invariant under the invariance condition unfortunately. We instead propose the slight\nmodi\ufb01cation:\n\n1 \u2212 Y2Y 0\n\nk(Y1, u1, Y2, u2) = u0\n\n1(I \u2212 Y1Y 0\n\n1)(I \u2212 Y2Y 0\n\n2)u2\n\nThe proof of the proposed form being invariant and positive de\ufb01nite is straightforward and is omit-\nted. Combined with the linear term kLin, this de\ufb01nes the new \u2018af\ufb01ne\u2019 kernel\n\nkAff(Y1, u1, Y2, u2) = tr(Y1Y 0\n\n1 Y2Y 0\n\n2) + u0\n\n1(I \u2212 Y1Y 0\n\n1)(I \u2212 Y2Y 0\n\n2)u2.\n\n(13)\n\nAs we can see, the KL distance with only the homogeneity condition has two terms related to the\nsubspace Y and the offset u. This suggests a general construction rule for af\ufb01ne kernels. If we have\ntwo separate positive kernels for subspaces and for offsets, we can add or multiply them together to\nconstruct new kernels [10].\n\n3.2 Extension to scaled subspaces\n\nWe have assumed homogeneous subspace so far. However, if the subspaces are computed from\nthe PCA of real data, the eigenvalues in general will have non-homogeneous values. To incorpo-\nrate these scales for af\ufb01ne subspaces, we now allow the Y to be non-orthonormal and check if the\nresultant kernel is still valid.\n\nLet Yi be a full-rank D \u00d7 m matrix, and bYi = Yi(Y 0\n1bY2bY 0\n\n2 + Y1Y 0\n\n1 Y2Y 0\n\nk =\n\ni Yi)\u22121/2 be the orthonormalization of Yi.\n\n1(2I \u2212bY1bY 0\n\n1 \u2212bY2bY 0\n\n2)u2,\n\n2) + u0\n\nIgnoring the multiplicative factors, the limiting (\u03c3 \u2192 0) \u2018kernel\u2019 from (3) becomes\n\n1\n2\n\ntr(bY1bY 0\n1)(I \u2212bY2bY 0\n\n1(I \u2212bY1bY 0\n\nwhich is again not well-de\ufb01ned.\nThe second term is the same as (12) in the previous subsection, and can be modi\ufb01ed in the same way\nto ku = u0\nThe \ufb01rst term is not positive de\ufb01nite, and there are several ways to remedy it. We propose to use the\nfollowing form\n\n2)u2.\n\nk(Y1, Y2) =\n\ntr(Y1bY 0\n\n1bY2Y 0\nkAffSc(Y1, u1, Y2, u2) = tr(Y1bY 0\n\n2 +bY1Y 0\n1bY2Y 0\n\n2) = tr(bY 0\n1 Y2bY 0\n1(I \u2212bY1bY 0\n\n1bY2Y 0\n1)(I \u2212bY2bY 0\n\namong other possibilities.\nThe sum of the two modi\ufb01ed terms, is the proposed \u2018af\ufb01ne scaled\u2019 kernel:\n\n2) + u0\n\n2 Y1),\n\n1\n2\n\n2)u2.\n\n(14)\n\nThis is a positive de\ufb01nite kernel which can be shown from the de\ufb01nition.\n\n4\n\n\fSummary of the extended Projection kernels\nThe proposed kernels are summarized below. Let Yi be a full-rank D \u00d7 m matrix, and let\n\nbYi = Yi(Y 0\n\ni Yi)\u22121/2 the orthonormalization of Yi as before.\n\nkLin(Y1, Y2) = tr(bY 0\nkAff(Y1, Y2) = tr(bY 0\nkAffSc(Y1, Y2) = tr(bY 0\n\n1bY2bY 0\n2bY1),\n1bY2bY 0\n2bY1) + u0\n1bY2Y 0\n\n2 Y1) + u0\n\nkLinSc(Y1, Y2) = tr(bY 0\n1bY2Y 0\n1)(I \u2212bY2bY 0\n1(I \u2212bY1bY 0\n1)(I \u2212bY2bY 0\n1(I \u2212bY1bY 0\n\n2)u2\n2)u2\n\n2 Y1)\n\n(15)\n\nWe also spherize the kernels\n\nek(Y1, u1, Y2, u2) = k(Y1, u1, Y2, u2) k(Y1, u1, Y1, u1)\u22121/2 k(Y2, u2, Y2, u2)\u22121/2\n\nso that k(Y1, u1, Y1, u1) = 1 for any Y1 and u1.\n\nThere is a caveat in implementing these kernels. Although we used the same notations Y and bY for\nboth linear and af\ufb01ne kernels, they are different in computation. For linear kernels the Y and bY are\ncomputed from data assuming u = 0, whereas for af\ufb01ne kernels the Y and bY are computed after\n\nremoving the estimated mean u from the data.\n\n3.3 Extension to nonlinear subspaces\n\nA systematic way of extending the Projection kernel from linear/af\ufb01ne subspaces to nonlinear spaces\nis to use an implicit map via a kernel function, where the latter kernel is to be distinguished from the\nformer kernels. Note that the proposed kernels (15) can be computed only from the inner products\nof the column vectors of Y \u2019s and u\u2019s including the orthonormalization procedure. If we replace the\ninner products of those vectors y0\niyi by a positive de\ufb01nite function f(yi, yj) on Euclidean spaces,\nthis implicitly de\ufb01nes a nonlinear feature space. This \u2018doubly kernel\u2019 approach has already been\nproposed for the Binet-Cauchy kernel [17, 8] and for probabilistic distances in general [18]. We\ncan adopt the trick for the extended Projection kernels as well to extend the kernels to operate on\n\u2018nonlinear subspaces\u20193.\n\n4 Experiments with synthetic data\n\nIn this section we demonstrate the application of the extended Projection kernels to two-class clas-\nsi\ufb01cation problems with Support Vector Machines (SVMs).\n\n4.1 Synthetic data\n\nThe extended kernels are de\ufb01ned under different assumptions of data distribution. To test the kernels\nwe generate three types of data \u2013 \u2018easy\u2019, \u2018intermediate\u2019 and \u2018dif\ufb01cult\u2019 \u2013 from MFA distribution,\nwhich cover the different ranges of data distribution.\nA total of N = 100 FA distributions are generated in D = 10 dimensional space. The parameters\nof each FA distribution pi(x) = N (ui, Ci) are randomly chosen such that\n\n\u2022 \u2018Easy\u2019 data have well separarted means ui and homogeneous scales Y 0\n\u2022 \u2018Intermediate\u2019 data have partially overlapping means ui and homogeneous scales Y 0\n\u2022 \u2018Dif\ufb01cult\u2019 data have totally overlapping means (u1 = ... = uN = 0) and randomly chosen\n\ni Yi\n\ni Yi\n\nscales between 0 and 1.\n\nThe class label for each distribution pi is assigned as follows. We choose a pair of distribution p+\nand p\u2212 which are the farthest apart from each other among all pairs of distributions. Then the labels\nof the remaining distributions pi are determined from whether they are close to p+ or p\u2212. The\ndistances are measured by the KL distance JKL.\n\n3the preimage corresponding to the linear subspaces in the RKHS via the feature map\n\n5\n\n\f4.2 Algorithms and results\n\nWe compare the performance of the Euclidean SVM with linear/ polynomial/ RBF kernels and the\nperformance of SVM with Grassmann kernels. To test the original SVMs, we randomly sampled\nn = 50 point from each FA distribution pi(x). We evaluate the algorithm with N-fold cross valida-\ntion by holding out one set and training with the other N \u2212 1 sets. The polynomial kernel used is\nk(x1, x2) = (hx1, x2i + 1)3.\nTo test the Grassmann SVM, we \ufb01rst estimated the mean ui and the basis Yi from n = 50 points of\neach FA distribution pi(x) used for the original SVM. The Yi, \u00b5i and \u03c3 are estimated simply from\nthe probabilistic PCA [12], although they can also be estimated by the Expectation Maximization\napproach.\nthe original and the extended Pro-\nSix different Grassmann kernels are compared:\nthe Binet-Cauchy\njection kernels\n1 Y2Y 0\nkernel\nkernel\nrithms with leave-one-out test by holding out one subspace and training with the other N \u2212 1\nsubspaces.\n\nkBhat(p1, p2) =R [p1(x) p2(x)]1/2 dx adapted for FA distributions. We evaluate the algo-\n\n(Linear, Linear Scaled, Af\ufb01ne, Af\ufb01ne Scaled),\n\nkBC(Y1, Y2) = (det Y 0\n\n1 Y2)2 = det Y 0\n\n2 Y1,\n\nand\n\n3)\n\nthe Bhattacharyya\n\n1)\n\n2)\n\nTable 1: Classi\ufb01cation rates of the Euclidean SVMs and the Grassmann SVMs. The BC and Bhat\nare short for Binet-Cauchy and Bhattacharyya kernels, respectively.\n\nEuclidean\n\nLinear\n84.63\n62.40\n52.00\n\nPoly\n79.85\n61.76\n63.74\n\nEasy\n\nIntermediate\n\nDif\ufb01cult\n\nGrassmann\n\nLinear Lin Sc\n55.30\n55.10\n68.10\n67.50\n81.00\n80.10\n\nAff\n92.70\n85.20\n80.30\n\nAff Sc\n92.30\n83.60\n81.20\n\nBC\n54.70\n60.90\n68.90\n\nProbabilistic\n\nBhat\n46.10\n59.00\n77.30\n\nTable 1 shows the classi\ufb01cation rates of the Euclidean SVMs and the Grassmann SVMs, averaged\nfor 10 trials. The results shows that best rates are obtained from the extended kernels, and the\nEuclidean kernels lag behind for all three types of data. Interestingly the polynomial kernels often\nperform worse than the linear kernels, and the RBF kernel performed even worse which we do not\nreport. For the \u2018dif\ufb01cult\u2019 data where the means are zero, the linear SVMs degrade to the chance-\nlevel (50%), which agrees with the intuitive picture that any decision hyperplane that passes the\norigin will roughly halve the points from a zero-mean distribution. As expected, the linear kernel\nis inappropriate for data with nonzero offsets (\u2018easy\u2019 and \u2018intermediate\u2019), whereas the af\ufb01ne kernel\nperforms well regardless of the offsets. However, there is no signi\ufb01cant difference between the\nhomogeneous and the scaled kernels. The Binet-Cauchy and the Bhattacharyya kernels mostly\nunderperformed.\nWe conclude that under certain conditions the extended kernels have clear advantages over the orig-\ninal linear kernels and the Euclidean kernels for the subspace-based classi\ufb01cation problem.\n\n5 Experiments with real-world data\n\nIn this section we demonstrate the application of the extended Projection kernels to recognition\nproblems with the kernel Fisher Discriminant Analysis [10].\n\n5.1 Databases\n\nThe Yale face database and the Extended Yale face database [3] together consist of pictures of 38\nsubjects with 9 different poses and 45 different lighting conditions. The ETH-80 [9] is an object\ndatabase designed for object categorization test under varying poses. The database consists of pic-\ntures of 8 object categories and 10 object instances for each category, recored under 41 different\nposes.\n\n6\n\n\fThese databases have naturally factorized structures which make them ideal to test subspace-based\nlearning algorithms with. In Yale Face database, a set consists of images of all illumination con-\nditions a person at a \ufb01xed pose. By treating the set as a point in the Grassmann manifold, we can\nperform illumination-invarint learning tasks with the data. For ETH-80 database, a set consists of\nimages of all possible poses of an object from a category. Also by treating such set as a point in the\nGrassmann manifold, we can perform pose-invariant learning tasks with the data.\nThere are a total of N = 279 and 80 sets as described above respectively. The images are resized\nto the dimension of D = 504 and 1024 respectively, and the maximum of m = 9 dimensional\nsubspaces are used to compute the kernels. The subspace parameters Yi, ui and \u03c3 are estimated\nfrom the probabilistic PCA [12].\n\n5.2 Algorithms and results\n\nWe perform subject recognition tests with Yale Face, and categorization tests with ETH-80 database.\nSince these databases are highly multiclass (31 and 8 classes) relative to the total number of sam-\nples, we use the kernel Discriminant Analysis to reduce dimensionality and extract features, in con-\njunction with a 1-NN classi\ufb01er. The six different Grassmann kernels are compared: the extended\nProjection (Lin/LinSc/Aff/Affsc) kernels, the Binet-Cauchy kernel, and the Bhattacharyya kernel.\nThe baseline algorithm (Eucl) is the Linear Discriminant Analysis applied to the original images in\nthe data from which the subspaces are computed.\nFigure 1 summarizes the average recognition/categoriazation rates from 9- and 10-fold cross vali-\ndation with the Yale Face and ETH-80 databases respectively. The results shows that best rates are\nachieved from the extended kernels: linear scaled kernel in Yale Face and the af\ufb01ne kernel in ETH-\n80. However the difference within the extended kernels are small. The performance of the extended\nkernels remain relatively unaffected by the subspace dimensionality, which is a convenient property\nin practice since we do not know the true dimensionality a priori. However the Binet-Cauchy and the\nBhattacharyya kernels do not perform as well, and degrade fast as the subspace dimension increases.\nThe analysis of the poor performance are given in the thesis [5].\n\n6 Conclusion\n\nIn this paper we analyzed the relationship between probabilistic distances and the geometric Grass-\nmann kernels, especially the KL distance and the Projection kernel. This analysis help us to under-\nstand the limitations of the Bhattacharyya kernel for subspace-based problems, and also suggest the\nextensions of the Projection kernel. With synthetic and real data we demonstrated that the extended\nkernels can outperform the original Projection kernel, as well as the previously used Bhattacharyya\nand the Binet-Cauchy kernels for subspace-based classi\ufb01cation problems. The relationship between\nother probabilistic distances and the Grassmann kernels is yet to be fully explored, and we expect to\nsee more results from a follow-up study.\n\nReferences\n[1] Gianfranco Doretto, Alessandro Chiuso, Ying Nian Wu, and Stefano Soatto. Dynamic textures. Int. J.\n\nComput. Vision, 51(2):91\u2013109, 2003.\n\n[2] Alan Edelman, Tom\u00b4as A. Arias, and Steven T. Smith. The geometry of algorithms with orthogonality\n\nconstraints. SIAM J. Matrix Anal. Appl., 20(2):303\u2013353, 1999.\n\n[3] Athinodoros S. Georghiades, Peter N. Belhumeur, and David J. Kriegman. From few to many: Illumina-\ntion cone models for face recognition under variable lighting and pose. IEEE Trans. Pattern Anal. Mach.\nIntell., 23(6):643\u2013660, 2001.\n\n[4] Zoubin Ghahramani and Geoffrey E. Hinton. The EM algorithm for mixtures of factor analyzers. Tech-\n\nnical Report CRG-TR-96-1, Department of Computer Science, University of Toronto, 21 1996.\n\n[5] Jihun Hamm.\n\nSubspace-based Learning with Grassmann Manifolds.\n\nElectrical\nhttp://www.seas.upenn.edu/ jhham/Papers/thesis-jh.pdf.\n\nand Systems Engineering, University of Pennsylvania,\n\n2008.\n\nPh.D thesis\nAvailable\n\nin\nat\n\n[6] Jihun Hamm and Daniel Lee. Grassmann discriminant analysis: a unifying view on subspace-based\n\nlearning. In Int. Conf. Mach. Learning, 2008.\n\n7\n\n\fFigure 1: Comparison of Grassmann kernels for face recognition/ object categorization tasks with\nkernel discriminant analysis. The extended Projection kernels (Lin/LinSc/Aff/ AffSc) outperform\nthe baseline method (Eucl) and the Binet-Cauchy (BC) and the Bhattacharyya (Bhat) kernels.\n\n[7] Tony Jebara and Risi Imre Kondor. Bhattacharyya expected likelihood kernels. In COLT, pages 57\u201371,\n\n2003.\n\n[8] Risi Imre Kondor and Tony Jebara. A kernel between sets of vectors. In Proc. of the 20th Int. Conf. on\n\nMach. Learn., pages 361\u2013368, 2003.\n\n[9] Bastian Leibe and Bernt Schiele. Analyzing appearance and contour based methods for object catego-\n\nrization. CVPR, 02:409, 2003.\n\n[10] Bernhard Sch\u00a8olkopf and Alexander J. Smola. Learning with Kernels: Support Vector Machines, Regular-\n\nization, Optimization, and Beyond. MIT Press, Cambridge, MA, USA, 2001.\n\n[11] Gregory Shakhnarovich, John W. Fisher, and Trevor Darrell. Face recognition from long-term observa-\n\ntions. In Proc. of the 7th Euro. Conf. on Computer Vision, pages 851\u2013868, London, UK, 2002.\n\n[12] Michael E. Tipping and Christopher M. Bishop. Probabilistic principal component analysis. Journal Of\n\nThe Royal Statistical Society Series B, 61(3):611\u2013622, 1999.\n\n[13] Pavan Turaga, Ashok Veeraraghavan, and Rama Chellappa. Statistical analysis on Stiefel and Grassmann\n\nmanifolds with applications in computer vision. In CVPR, 2008.\n\n[14] Ashok Veeraraghavan, Amit K. Roy-Chowdhury, and Rama Chellappa. Matching shape sequences\nIEEE Trans. Pattern Anal. Mach. Intell.,\n\nin video with applications in human movement analysis.\n27(12):1896\u20131909, 2005.\n\n[15] S.V.N. Vishwanathan and Alexander J. Smola. Binet-Cauchy kernels. In NIPS, 2004.\n[16] Liwei Wang, Xiao Wang, and Jufu Feng. Subspace distance analysis with application to adaptive bayesian\n\nalgorithm for face recognition. Pattern Recogn., 39(3):456\u2013464, 2006.\n\n[17] Lior Wolf and Amnon Shashua. Learning over sets using kernel principal angles. J. Mach. Learn. Res.,\n\n4:913\u2013931, 2003.\n\n[18] Shaohua Kevin Zhou and Rama Chellappa. From sample similarity to ensemble similarity: Probabilis-\nIEEE Trans. Pattern Anal. Mach. Intell.,\n\ntic distance measures in Reproducing Kernel Hilbert Space.\n28(6):917\u2013929, 2006.\n\n8\n\n13579405060708090100Yale Facesubspace dimension (m)rate (%)  EuclLinLinScAffAffScBCBhat13579405060708090100ETH!80subspace dimension (m)rate (%)  EuclLinLinScAffAffScBCBhat\f", "award": [], "sourceid": 716, "authors": [{"given_name": "Jihun", "family_name": "Hamm", "institution": null}, {"given_name": "Daniel", "family_name": "Lee", "institution": null}]}