Part of Advances in Neural Information Processing Systems 17 (NIPS 2004)
Laurent Zwald, Gilles Blanchard, Pascal Massart, Régis Vert
This paper investigates the effect of Kernel Principal Component Analy- sis (KPCA) within the classification framework, essentially the regular- ization properties of this dimensionality reduction method. KPCA has been previously used as a pre-processing step before applying an SVM but we point out that this method is somewhat redundant from a reg- ularization point of view and we propose a new algorithm called Ker- nel Projection Machine to avoid this redundancy, based on an analogy with the statistical framework of regression for a Gaussian white noise model. Preliminary experimental results show that this algorithm reaches the same performances as an SVM.
1 Introduction
Let (xi, yi)i=1...n be n given realizations of a random variable (X, Y ) living in X {-1;1}. Let P denote the marginal distribution of X. The xi's are often referred to as inputs (or patterns), and the yi's as labels. Pattern recognition is concerned with finding a classifier, i.e. a function that assigns a label to any new input x X and that makes as few prediction errors as possible. It is often the case with real world data that the dimension of the patterns is very large, and some of the components carry more noise than information. In such cases, reducing the dimension of the data before running a classification algorithm on it sounds reasonable. One of the most famous methods for this kind of pre-processing is PCA, and its kernelized version (KPCA), introduced in the pioneering work of Sch olkopf, Smola and M uller [8].
This work was supported in part by the IST Programme of the European Community, under the PASCAL Network of Excellence, IST-2002-506778.
Now, whether the quality of a given classification algorithm can be significantly improved by using such pre-processed data still remains an open question. Some experiments have already been carried out to investigate the use of KPCA for classification purposes, and numerical results are reported in [8]. The authors considered the USPS handwritten digit database and reported the test error rates achieved by the linear SVM trained on the data pre-processed with KPCA: the conclusion was that the larger the number of principal com- ponents, the better the performance. In other words, the KPCA step was useless or even counterproductive. This conclusion might be explained by a redundancy arising in their experiments: there is actually a double regularization, the first corresponding to the dimensionality reduction achieved by KPCA, and the other to the regularization achieved by the SVM. With that in mind it does not seem so surprising that KPCA does not help in that case: whatever the dimensionality reduction, the SVM anyway achieves a (possibly strong) regularization. Still, de-noising the data using KPCA seems relevant. The aforementioned experiments suggest that KPCA should be used together with a classification algorithm that is not regu- larized (e.g. a simple empirical risk minimizer): in that case, it should be expected that the KPCA is by itself sufficient to achieve regularization, the choice of the dimension being guided by adequate model selection. In this paper, we propose a new algorithm, called the Kernel Projection Machine (KPM), that implements this idea: an optimal dimension is sought so as to minimize the test error of the resulting classifier. A nice property is that the training labels are used to select the optimal dimension optimal means that the resulting D-dimensional representation of the data contains the right amount of information needed to classify the inputs. To sum up, the KPM can be seen as a dimensionality-reduction-based classification method that takes into account the labels for the dimensionality reduction step. This paper is organized as follows: Section 2 gives some statistical background on regular- ized method vs. projection methods. Its goal is to explain the motivation and the "Gaussian intuition" that lies behind the KPM algorithm from a statistical point of view. Section 3 explicitly gives the details of the algorithm; experiments and results, which should be con- sidered preliminary, are reported in Section 4.
2 Motivations for the Kernel Projection Machine
2.1 The Gaussian Intuition: a Statistician's Perspective
Regularization methods have been used for quite a long time in non parametric statistics since the pioneering works of Grace Wahba in the eighties (see [10] for a review). Even if the classification context has its own specificity and offers new challenges (especially when the explanatory variables live in a high dimensional Euclidean space), it is good to remember what is the essence of regularization in the simplest non parametric statistical framework: the Gaussian white noise. So let us assume that one observes a noisy signal dY (x) = s(x)dx + 1 dw(x) , Y (0) = 0 n on [0,1] where dw(x) denotes standard white noise. To the reader not familiar with this model, it should be considered as nothing more but an idealization of the well-known fixed design regression problem Yi = s(i/n) + i for i = 1, . . . , n, where i N(0, 1), where the goal is to recover the regression function s. (The white noise model is actually simpler to study from a mathematical point of view). The least square criterion is defined as
1 n(f ) = f 2 - 2 f (x)dY (x) 0
for every f L2([0, 1]). Given a Mercer kernel k on [0, 1][0, 1], the regularization least square procedure proposes