{"title": "Kernel Projection Machine: a New Tool for Pattern Recognition", "book": "Advances in Neural Information Processing Systems", "page_first": 1649, "page_last": 1656, "abstract": null, "full_text": " Kernel Projection Machine: a New Tool for\n Pattern Recognition\n\n\n\n Gilles Blanchard Pascal Massart\n Fraunhofer First (IDA), Departement de Mathematiques,\n Kekulestr. 7, D-12489 Berlin, Germany Universite Paris-Sud,\n blanchar@first.fhg.de Bat. 425, F-91405 Orsay, France\n Pascal.Massart@math.u-psud.fr\n\n\n Regis Vert Laurent Zwald\n LRI, Universite Paris-Sud, Departement de Mathematiques,\n Bat. 490, F-91405 Orsay, France Universite Paris-Sud,\n Masagroup Bat. 425, F-91405 Orsay, France\n 24 Bd de l'Hopital, F-75005 Paris, France Laurent.Zwald@math.u-psud.fr\n Regis.Vert@lri.fr\n\n\n\n\n Abstract\n\n This paper investigates the effect of Kernel Principal Component Analy-\n sis (KPCA) within the classification framework, essentially the regular-\n ization properties of this dimensionality reduction method. KPCA has\n been previously used as a pre-processing step before applying an SVM\n but we point out that this method is somewhat redundant from a reg-\n ularization point of view and we propose a new algorithm called Ker-\n nel Projection Machine to avoid this redundancy, based on an analogy\n with the statistical framework of regression for a Gaussian white noise\n model. Preliminary experimental results show that this algorithm reaches\n the same performances as an SVM.\n\n\n\n1 Introduction\n\nLet (xi, yi)i=1...n be n given realizations of a random variable (X, Y ) living in X \n{-1;1}. Let P denote the marginal distribution of X. The xi's are often referred to as\ninputs (or patterns), and the yi's as labels. Pattern recognition is concerned with finding a\nclassifier, i.e. a function that assigns a label to any new input x X and that makes as few\nprediction errors as possible.\nIt is often the case with real world data that the dimension of the patterns is very large,\nand some of the components carry more noise than information. In such cases, reducing\nthe dimension of the data before running a classification algorithm on it sounds reasonable.\nOne of the most famous methods for this kind of pre-processing is PCA, and its kernelized\nversion (KPCA), introduced in the pioneering work of Sch olkopf, Smola and M uller [8].\n\n This work was supported in part by the IST Programme of the European Community, under the\nPASCAL Network of Excellence, IST-2002-506778.\n\n\f\nNow, whether the quality of a given classification algorithm can be significantly improved\nby using such pre-processed data still remains an open question. Some experiments have\nalready been carried out to investigate the use of KPCA for classification purposes, and\nnumerical results are reported in [8]. The authors considered the USPS handwritten digit\ndatabase and reported the test error rates achieved by the linear SVM trained on the data\npre-processed with KPCA: the conclusion was that the larger the number of principal com-\nponents, the better the performance. In other words, the KPCA step was useless or even\ncounterproductive.\nThis conclusion might be explained by a redundancy arising in their experiments: there\nis actually a double regularization, the first corresponding to the dimensionality reduction\nachieved by KPCA, and the other to the regularization achieved by the SVM. With that in\nmind it does not seem so surprising that KPCA does not help in that case: whatever the\ndimensionality reduction, the SVM anyway achieves a (possibly strong) regularization.\nStill, de-noising the data using KPCA seems relevant. The aforementioned experiments\nsuggest that KPCA should be used together with a classification algorithm that is not regu-\nlarized (e.g. a simple empirical risk minimizer): in that case, it should be expected that the\nKPCA is by itself sufficient to achieve regularization, the choice of the dimension being\nguided by adequate model selection.\nIn this paper, we propose a new algorithm, called the Kernel Projection Machine (KPM),\nthat implements this idea: an optimal dimension is sought so as to minimize the test error\nof the resulting classifier. A nice property is that the training labels are used to select the\noptimal dimension optimal means that the resulting D-dimensional representation of the\ndata contains the right amount of information needed to classify the inputs. To sum up, the\nKPM can be seen as a dimensionality-reduction-based classification method that takes into\naccount the labels for the dimensionality reduction step.\nThis paper is organized as follows: Section 2 gives some statistical background on regular-\nized method vs. projection methods. Its goal is to explain the motivation and the \"Gaussian\nintuition\" that lies behind the KPM algorithm from a statistical point of view. Section 3\nexplicitly gives the details of the algorithm; experiments and results, which should be con-\nsidered preliminary, are reported in Section 4.\n\n\n2 Motivations for the Kernel Projection Machine\n\n\n2.1 The Gaussian Intuition: a Statistician's Perspective\n\nRegularization methods have been used for quite a long time in non parametric statistics\nsince the pioneering works of Grace Wahba in the eighties (see [10] for a review). Even\nif the classification context has its own specificity and offers new challenges (especially\nwhen the explanatory variables live in a high dimensional Euclidean space), it is good to\nremember what is the essence of regularization in the simplest non parametric statistical\nframework: the Gaussian white noise.\nSo let us assume that one observes a noisy signal dY (x) = s(x)dx + 1\n dw(x) , Y (0) = 0\n n\non [0,1] where dw(x) denotes standard white noise. To the reader not familiar with this\nmodel, it should be considered as nothing more but an idealization of the well-known fixed\ndesign regression problem Yi = s(i/n) + i for i = 1, . . . , n, where i N(0, 1), where\nthe goal is to recover the regression function s. (The white noise model is actually simpler\nto study from a mathematical point of view). The least square criterion is defined as\n\n 1\n n(f ) = f 2 - 2 f (x)dY (x)\n 0\n\nfor every f L2([0, 1]).\nGiven a Mercer kernel k on [0, 1][0, 1], the regularization least square procedure proposes\n\n\f\nto minimize\n\n n(f ) + n f H (1)\n k\n\nwhere (n) is a conveniently chosen sequence and Hk denotes the RKHS induced by k.\nThis procedure can indeed be viewed as a model selection procedure since minimizing\nn(f ) + n f H amounts to minimizing\n k\n\n inf n(f ) + nR2\n f R\nover R > 0. In other words, regularization aims at selecting the \"best\" RKHS ball\n{f, f R} to represent our data.\nAt this stage, it is interesting to realize that the balls in the RKHS space can be viewed as\nellipsoids in the original Hilbert space L2([0, 1]). Indeed, let (i)\n i=1 be some orthonormal\nbasis of eigenfunctions for the compact and self adjoint operator\n\n 1\n Tk : f - k(x, y)f (x)dx\n 0\n\n 2\nThen, setting j\n j = 1 f(x) = where (\n 0 j (x)dx one has f 2H j )j\n k j=1 j 1 denotes\nthe non increasing sequence of eigenvalues corresponding to (j)j1. Hence\n 2 \n {f, f j\n H .\n k R} =\nNow, due to the approximation jj; R2\n j\n j=1 j=1\n properties of the finite \n dimensional spaces {j, j D},\nD N with respect to the ellipsoids, one can think of penalized finite dimensional projec-\ntion as an alternative method to regularization. More precisely, if sD denotes the projection\nestimator on j, j D , i.e. sD = D \n j=1 j dY j and one considers the penalized\nselection criterion D = argmin[n(sD) + 2D ] then, it is proved in [1] that the selected\n n\n D\nestimator s b obeys to the following oracle inequality\n D\n\n E[ s - s 2 2\n b ] E s\n D C inf - sD\n D1\nwhere C is some absolute constant.\nThe nice thing is that whenever s belongs to some ellipsoid\n 2 \n E(c) = j \nwhere (c jj : c2 1\n j=1 j=1 j \n j )j1 is a decreasing sequence tending to 0 as j , then\n D D\n inf E s - s 2\n D = inf inf s - t 2 + c2 +\n D\n D1 D1 tSD n inf\n D1 n\nAs shown in [5] infD1[c2 + D] is (up to some absolute constant) of the order of magni-\n D n\ntude of the minimax risk over E(c).\nAs a consequence, the estimator s b is simultaneously minimax over the collection of all\n D\nellipsoids E(c), which in particular includes the collection {E(R), R > 0}.\nTo conclude and summarize, from a statistical performance point of view, what we can\nexpect from a regularized estimator s (i.e. a minimizer of (1)) is that a convenient de-\nvice of n ensures that s is simultaneously minimax over the collection of ellipsoids\n{E(R), R > 0}, (at least as far as asymptotic rates of convergence are concerned ).\nThe alternative estimator s b actually achieves this goal and even better since it is also\n D\nadaptive over the collection of all ellipsoids and not only the family {E(R), R > 0}.\n\n\f\n2.2 Extension to a general classification framework\n\nIn this section we go back to classification framework as described in the introduction. First\nof all, it has been noted by several authors ([6],[9]) that the SVM can be seen as a regular-\nized estimation method, where the regularizer is the squared norm of the function in Hk.\nPrecisely, the SVM algorithm solves the following unconstrained optimization problem:\n\n 1 n\n min (1 - yif(xi))+ + f 2 , (2)\n f Hb n Hk\n k i=1\n\n\nwhere Hb =\n k {f(x) + b, f Hk,b R}.\nThe above regularization can be viewed as a model selection process over RKHS balls,\nsimilarly to the previous section. Now, the line of ideas developed there suggests that it\nmight actually be a better idea to consider a sequence of finite-dimensional estimators.\nAdditionally, it has been shown in [4] that the regularization term of the SVM is actually\ntoo strong. We therefore transpose the ideas of previous Gaussian case to the classification\nframework. Consider a Mercer kernel k defined on X X and Let Tk denote the operator\nassociated with kernel k in the following way\n\n\n Tk : f (.) L2(X) - k(x, .)f (x)dP (x) L2(X)\n X\nLet 1, 2, . . . denote the eigenvectors of Tk, ordered by decreasing associated eigenvalues\n(i)11. For each integer D, the subspace FD defined by FD = span{11,1,...,D}\n(where 11 denotes the constant function equal to 1) corresponds to a subspace of Hb associ-\n k\nated with kernel k, and Hb = \n k D=1 FD. Instead of selecting the \"best\" ball in the RKHS,\nas the SVM does, we consider the analogue of the projection estimator sD:\n\n n\n ^\n fD = arg min (1 - yif(xi))+ (3)\n f FD i=1\n\nthat is, more explicitly,\n\n D\n ^\n fD(.) = \n j j (.) + b\n j=1\n\nwith\n n D \n (, b) = arg min jj(xi) + b (4)\n (RD,bR) 1-yi \n i=1 j=1 +\n\nAn appropriate D can then be chosen using an adequate model selection procedure such as\npenalization; we do not address this point in detail in the present work but it is of course\nthe next step to be taken.\n\nUnfortunately, since the underlying probability P is unknown, neither are the eigenfunc-\ntions 1, . . ., and it is therefore not possible to implement this procedure directly. We thus\nresort to considering empirical quantities as will be explained in more detail in section 3.\nEssentially, the unknown vectorial space spanned by the first eigenfunctions of Tk is re-\nplaced by the space spanned by the first eigenvectors of the normalized kernel Gram matrix\n1 (k(x\nn i, xj ))1i,jn. At this point we can see the relation appear with Kernel PCA. We\nnext precise this relation and give an interpretation of the resulting algorithm in terms of\ndimensionality reduction.\n\n\f\n2.3 Link with Kernel Principal Component Analysis\n\nPrincipal Component Analysis (PCA), and its non-linear variant, KPCA are widely used al-\ngorithms in data analysis. They extract from the input data space a basis (vi)i1 which is, in\nsome sense, adapted to the data by looking for directions where the variance is maximized.\nThey are often used as a pre-processing on the data in order to reduce the dimensionality\nor to perform de-noising.\nAs will be made more explicit in the next section, the Kernel Projection Machine consists\nin replacing the ideal projection estimator defined by (3) by\n\n 1 n\n fD = argmin (1 - yif(Xi))+\n f S n\n D i=1\n\nwhere SD is the space of dimension D chosen by the first D principal components chosen\nby KPCA in feature space. Hence, roughly speaking, in the KPM, the SVM penalization is\nreplaced by dimensionality reduction.\nChoosing D amounts to selecting the optimal D-dimensional representation of our data\nfor the classification task, in other words to extracting the information that is needed for\nthis task by model selection taking into account the relevance of the directions for the\nclassification task.\nTo conclude, the KPM is a method of dimensionality reduction that takes into account the\nlabels of the training data to choose the \"best\" dimension.\n\n\n3 The Kernel Projection Machine Algorithm\n\nIn this section, the empirical (and computable) version of the KPM algorithm is derived\nfrom the previous theoretical arguments.\nIn practice the true eigenfunctions of the kernel operator are not computable. But since\nonly the values of functions 1, . . . , D at points x1, . . . , xn are needed for minimizing\nthe empirical risk over FD, the eigenvectors of the kernel matrix K = (k(xi, xj))1i,jn\nwill be enough for our purpose. Indeed, it is well known in numerical analysis (see [2])\nthat the eigenvectors of the kernel matrix approximate the eigenfunctions of the kernel\noperator. This result has been pointed out in [7] in a more probabilistic language. More\nprecisely, if V1, . . . , VD denote the D first eigenvectors of K with associated eigenvalues\n1 2 ... D, then for each Vi\n Vi = V (1), . . . , V (n)\n i i (i(x1),...,i(xn)) (5)\nHence, considering Equation (4), the empirical version of the algorithm described above\nwill first consist of solving, for each dimension D, the following optimization problem:\n n D \n (, b) = arg min jV (i) + b (6)\n j\n RD,bR 1-yi \n i=1 j=1 +\n\nThen the solution should be\n\n D\n ^\n fD(.) = \n j j (.) + b . (7)\n j=1\n\nOnce again the true functions j's are unknown. At this stage, we can do an expansion of\nthe solution in terms of the kernel similarly to the SVM algorithm, in the following way:\n\n n\n ^\n fD(.) = k(x\n i i, .) + b (8)\n i=1\n\n\f\n 0.46 0.35\n\n\n\n\n 0.44\n\n 0.345\n\n\n 0.42\n\n\n 0.34\n\n 0.4\n\n\n\n\n 0.38 0.335\n\n\n\n\n 0.36\n\n 0.33\n\n\n 0.34\n\n\n 0.325\n\n 0.32\n\n\n\n\n 0.3 0.32\n 0 5 10 15 20 25 30 0 2 4 6 8 10 12 14 16 18 20\n\n\n\n\n\nFigure 1: Left: KPM risk (solid) and empirical risk (dashed) versus dimension D. Right:\nSVM risk and empirical risk versus C. Both on dataset 'flare-solar'.\n\n\n\nNarrowing both expressions ( 7) and ( 8) at points x1, . . . , xn leads the following equation:\n\n 1V1 + . . . + V\n D D = K (9)\n\n \nwhich has a straightforward solution: = D j V\n j=1 b j (provided the D first eigenvalues\n j\nare all strictly positive).\nNow the KPM algorithm can be summed up as follows:\n\n 1. given data x1, . . . , xn X and a positive kernel k defined on X X, compute\n the kernel matrix K and its eigenvectors V1, . . . , Vn together with its eigenvalues\n in decreasing order 1 2 . . . n.\n 2. for each dimension D such that D > 0 solve the linear optimization problem\n\n n\n\n (, b) = arg min i (10)\n ,b, i=1\n\n D \n under constraints i = 1 . . . n, i 0 , yi jV(i)+b\n j 1-i . (11)\n j=1\n\n\n D j\n Next, compute = Vj and ^\n fD(.) = n k(x\n i=1 i i, .) + b\n \n j=1 j\n\n 3. The last step is a model selection problem: choose a dimension ^\n D for which\n ^\n f ^ performs well. We do not address directly this point here; one can think of\n D\n applying cross-validation, or to penalize the empirical loss by a penalty function\n depending on the dimension.\n\n\n4 Experiments\n\nThe KPM was implemented in Matlab using the free library GLPK for solving the linear\noptimization problem. Since the algorithm involves the eigendecomposition of the kernel\nmatrix, only small datasets have been considered for the moment.\nIn order to assess the performance of the KPM, we carried out experiments on benchmark\ndatasets available on Gunnar Ratsch's web site [3]. Several state-of-art algorithms have\nalready been applied to those datasets, among which the SVM. All results are reported on\nthe web site. To get a valid comparison with the SVM, on each classification task, we used\n\n\f\nTable 1: Test errors of the KPM on several benchmark datasets, compared with SVM, using\nG.Ratsch's parameter selection procedure (see text). As an indication the best of the six\nresults presented in [3] are also reported.\n\n KPM (selected D) SVM Best of 6\n Banana 10.73 0.42 15 11.53 0.66 10.73 0.43\n Breast Cancer 26.51 4.75 24 26.04 4.74 24.77 4.63\n Diabetis 23.37 1.92 11 23.53 1.73 23.21 1.63\n Flare Solar 32.43 1.85 6 32.43 1.82 32.43 1.82\n German 23.59 2.15 14 23.61 2.07 23.61 2.07\n Heart 16.89 3.53 10 15.95 3.26 15.95 3.26\n\n\nTable 2: Test errors of the KPM on several benchmark datasets, compared with SVM, using\nstandard 5-fold cross-validation on each realization.\n\n KPM SVM\n Banana 11.14 0.73 10.69 0.67\n Breast Cancer 26.554.43 26.68 5.23\n Diabetis 24.14 1.86 23.79 2.01\n Flare Solar 32.701.97 32.62 1.86\n German 23.822.23 23.79 2.12\n Heart 17.593.30 16.23 3.18\n\n\nthe same kernel parameters as those used for SVM, so as to work with exactly the same\ngeometry.\nThere is a subtle, but important point arising here. In the SVM performance reported by G.\nRatsch, the regularization parameter C was first determined by cross-validation on the first\n5 realizations of each dataset; then the median of these values was taken as a fixed value\nfor the other realizations. This was done apparently for saving computation time, but this\nmight lead to an over-optimistic estimation of the performances since in some sense some\nextraneous information is then available to the algorithm and the variation due to the choice\nof C is reduced to almost zero. We first tried to mimic this methodology by applying it, in\nour case, to the choice of D itself (the median of 5 D values obtained by cross-validation\non the first realizations was then used on the other realizations).\nOne might then argue that this way we are selecting a parameter by this method instead of\na meta-parameter for the SVM, so that the comparison is unfair. However, this distinction\nbeing loose, this a rather moot point. To avoid this kind of debate and obtain fair results, we\ndecided to re-run the SVM tests by selecting systematically the regularization parameter by\na 5-fold cross-validation on each training set, and for our method, apply the same procedure\nto select D. Note that there is still extraneous information in the choice of the kernel\nparameters, but at least it is the same for both algorithms.\nResults relative to the first methodology are reported in table 1, and those relative the\nsecond one are reported in table 2. The globally worst performances exhibited in the second\ntable show that the first procedure may indeed be too optimistic. It is to be mentionned\nthat the parameter C of the SVM was systematically sought on a grid of only 100 values,\nranging from 0 to three times the optimal value given in [3]. Hence those experimental\nresults are to be considered as preliminary, and in no way they should be used to establish\na significant difference between the performances of the KPM and the SVM. Interestingly,\nthe graphic on the left in Figure 4 shows that our procedure is very different from the one\nof [8]: when D is very large, our risk increases (leading to the existence of a minimum)\nwhile the risk of [8] always decreases with D.\n\n\f\n5 Conclusion and discussion\n\nTo summarize, one can see the KPM as an alternative to the regularization of the SVM: reg-\nularization using the RKHS norm can be replaced by finite dimensional projection. More-\nover, this algorithm performs KPCA towards classification and thus offers a criterion to\ndecide what is the right order of expansion for the KPCA.\nDimensionality reduction can thus be used for classification but it is important to keep in\nmind that it behaves like a regularizer. Hence, it is clearly useless to plug it in a classifi-\ncation algorithm that is already regularized: the effect of the dimensionality reduction may\nbe canceled as noted by [8].\nOur experiments explicitly show the regularizing effect of KPCA: no other smoothness\ncontrol has been added in our algorithm and still, it gives performances comparable to the\none of SVM provided the dimension D is picked correctly. We only considered here selec-\ntion of D by cross-validation; other methods such as penalization will be studied in future\nworks. Moreover, with this algorithm, we obtain a D-dimensional representation of our\ndata which is optimal for the classification task. Thus KPM can be see as a de-noising\nmethod who takes into account the labels.\nThis version of the KPM only considers one kernel and thus one vectorial space by dimen-\nsion. A more advanced version of this algorithm is to consider several kernels and thus\nchoose among a bigger family of spaces. This family then contains more than one space\nby dimension and will allow to directly compare the performance of different kernels on\na given task, thus improving efficiency for the dimensionality reduction while taking into\naccount the labels.\n\n\nReferences\n\n [1] P. Massart A. Barron, L. Birge. Risk bounds for model selection via penalization.\n Proba.Theory Relat.Fields, 113:301413, 1999.\n\n [2] Baker. The numerical treatment of integral equations. Oxford:Clarendon Press, 1977.\n\n [3] http://ida.first.gmd.de/~raetsch/data/benchmarks.htm. Benchmark repository used in\n several Boosting, KFD and SVM papers.\n\n [4] G. Blanchard, O. Bousquet, and P.Massart. Statistical performance of support vector\n machines. Manuscript, 2004.\n\n [5] D.L. Donoho, R.C. Liu, and B. MacGibbon. Minimax risk over hyperrectangles, and\n implications. Ann. Statist. 18,1416-1437, 1990.\n\n [6] T. Evgeniou, M. Pontil, and T. Poggio. Regularization networks and support vector\n machines. In A. J. Smola, P. L. Bartlett, B. Sch olkopf, and D. Schuurmans, editors,\n Advances in Large Margin Classifiers, pages 171203, Cambridge, MA, 2000. MIT\n Press.\n\n [7] V. Koltchinskii. Asymptotics of spectral projections of some random matrices ap-\n proximating integral operators. Progress in Probability, 43:191227, 1998.\n\n [8] B. Scholkopf, A. J. Smola, and K.-R. M uller. Nonlinear component analysis as a\n kernel eigenvalue problem. Neural Computation, 10:12991319, 1998.\n\n [9] A. J. Smola and B. Sch olkopf. On a kernel-based method for pattern recognition,\n regression, approximation and operator inversion. Algorithmica, 22:211231, 1998.\n\n[10] G. Wahba. Spline Models for Observational Data, volume 59 of CBMS-NSF Re-\n gional Conference Series in Applied Mathematics. Society for Industrial and Applied\n Mathematics, Philadelphia, Pennsylvania, 1990.\n\n\f\n", "award": [], "sourceid": 2580, "authors": [{"given_name": "Laurent", "family_name": "Zwald", "institution": null}, {"given_name": "Gilles", "family_name": "Blanchard", "institution": null}, {"given_name": "Pascal", "family_name": "Massart", "institution": null}, {"given_name": "R\u00e9gis", "family_name": "Vert", "institution": null}]}