{"title": "Kernel Dimensionality Reduction for Supervised Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 81, "page_last": 88, "abstract": "", "full_text": "Kernel Dimensionality Reduction for Supervised\n\nLearning\n\nKenji Fukumizu\n\nInstitute of Statistical\n\nMathematics\n\nTokyo 106-8569 Japan\nfukumizu@ism.ac.jp\n\nFrancis R. Bach\n\nCS Division\n\nUniversity of California\nBerkeley, CA 94720, USA\n\nfbach@cs.berkeley.edu\n\nMichael I. Jordan\n\nCS Division and Statistics\nUniversity of California\nBerkeley, CA 94720, USA\njordan@cs.berkeley.edu\n\nAbstract\n\nWe propose a novel method of dimensionality reduction for supervised\nlearning. Given a regression or classi\ufb01cation problem in which we wish\nto predict a variable Y from an explanatory vector X, we treat the prob-\nlem of dimensionality reduction as that of \ufb01nding a low-dimensional \u201cef-\nfective subspace\u201d of X which retains the statistical relationship between\nX and Y . We show that this problem can be formulated in terms of\nconditional independence. To turn this formulation into an optimization\nproblem, we characterize the notion of conditional independence using\ncovariance operators on reproducing kernel Hilbert spaces; this allows us\nto derive a contrast function for estimation of the effective subspace. Un-\nlike many conventional methods, the proposed method requires neither\nassumptions on the marginal distribution of X, nor a parametric model\nof the conditional distribution of Y .\n\n1 Introduction\n\nMany statistical learning problems involve some form of dimensionality reduction. The\ngoal may be one of feature selection, in which we aim to \ufb01nd linear or nonlinear combina-\ntions of the original set of variables, or one of variable selection, in which we wish to select\na subset of variables from the original set. Motivations for such dimensionality reduction\ninclude providing a simpli\ufb01ed explanation and visualization for a human, suppressing noise\nso as to make a better prediction or decision, or reducing the computational burden.\nWe study dimensionality reduction for supervised learning, in which the data consists of\n(X, Y ) pairs, where X is an m-dimensional explanatory variable and Y is an (cid:1)-dimensional\nresponse. The variable Y may be either continuous or discrete. We refer to these problems\ngenerically as \u201cregression,\u201d which indicates our focus on the conditional probability density\npY |X(y|x). Thus, our framework includes classi\ufb01cation problems, where Y is discrete.\nWe wish to solve a problem of feature selection in which the features are linear combi-\nnations of the components of X. In particular, we assume that there is an r-dimensional\nsubspace S \u2282 R\n\nm such that the following equality holds for all x and y:\n\n(1)\nm onto S. The subspace S is called the ef-\nwhere \u03a0S is the orthogonal projection of R\nfective subspace for regression. Based on observations of (X, Y ) pairs, we wish to re-\n\npY |X(y|x) = pY |\u03a0S X(y|\u03a0Sx),\n\n\f1 X) +\u00b7\u00b7\u00b7 + gK(\u03b2T\n\ncover a matrix whose columns span S. We approach the problem within a semiparamet-\nric statistical framework\u2014we make no assumptions regarding the conditional distribution\npY |\u03a0S X(y|\u03a0Sx) or the distribution pX(x) of X. Having found an effective subspace, we\nmay then proceed to build a parametric or nonparametric regression model on that sub-\nspace. Thus our approach is an explicit dimensionality reduction method for supervised\nlearning that does not require any particular form of regression model; it can be used as a\npreprocessor for any supervised learner.\nMost conventional approaches to dimensionality reduction make speci\ufb01c assumptions re-\ngarding the conditional distribution pY |\u03a0S X(y|\u03a0Sx), the marginal distribution pX(x), or\nboth. For example, classical two-layer neural networks can be seen as attempting to es-\ntimate an effective subspace in their \ufb01rst layer, using a speci\ufb01c model for the regressor.\nSimilar comments apply to projection pursuit regression [1] and ACE [2], which assume\nan additive model E[Y |X] =g 1(\u03b2T\nKX). While canonical correlation\nanalysis (CCA) and partial least squares (PLS, [3]) can be used for dimensionality reduc-\ntion in regression, they make a linearity assumption and place strong restrictions on the\nallowed dimensionality. The line of research that is closest to our work is sliced inverse re-\ngression (SIR, [4]) and related methods including principal Hessian directions (pHd, [5]).\nSIR is a semiparametric method that can \ufb01nd effective subspaces, but only under strong\nassumptions of ellipticity for the marginal distribution pX(x). pHd also places strong re-\nstrictions on pX(x). If these assumptions do not hold, there is no guarantee of \ufb01nding the\neffective subspace.\nIn this paper we present a novel semiparametric method for dimensionality reduction that\nwe refer to as Kernel Dimensionality Reduction (KDR). KDR is based on a particular class\nof operators on reproducing kernel Hilbert spaces (RKHS, [6]). In distinction to algorithms\nsuch as the support vector machine and kernel PCA [7, 8], KDR cannot be viewed as a \u201cker-\nnelization\u201d of an underlying linear algorithm. Rather, we relate dimensionality reduction\nto conditional independence of variables, and use RKHSs to provide characterizations of\nconditional independence and thereby design objective functions for optimization. This\nbuilds on the earlier work of [9], who used RKHSs to characterize marginal independence\nof variables. Our characterization of conditional independence is a signi\ufb01cant extension,\nrequiring rather different mathematical tools\u2014the covariance operators on RKHSs that we\npresent in Section 2.2.\n\n2 Kernel method of dimensionality reduction for regression\n\n2.1 Dimensionality reduction and conditional independence\n\nThe problem discussed in this paper is to \ufb01nd the effective subspace S de\ufb01ned by Eq. (1),\ngiven an i.i.d. sample {(Xi, Yi)}n\ni=1, sampled from the conditional probability Eq. (1) and\na marginal probability pX for X. The crux of the problem is that we have no a priori\nknowledge of the regressor, and place no assumptions on the conditional probability pY |X\nor the marginal probability pX.\nWe do not address the problem of choosing the dimensionality r in this paper\u2014in practical\napplications of KDR any of a variety of model selection methods such as cross-validation\ncan be reasonably considered. Rather our focus is on the problem of \ufb01nding the effective\nsubspace for a given choice of dimensionality.\nThe notion of effective subspace can be formulated in terms of conditional independence.\nLet Q = (B, C) be an m-dimensional orthogonal matrix such that the column vectors of\nB span the subspace S (thus B is m \u00d7 r and C is m \u00d7 (m \u2212 r)), and de\ufb01ne U = BT X\nand V = C T X. Because Q is an orthogonal matrix, we have pX(x) = pU,V (u, v) and\npX,Y (x, y) =p U,V,Y (u, v, y). Thus, Eq. (1) is equivalent to\npY |U,V (y|u, v) =p Y |U (y|u).\n\n(2)\n\n\fY\n\nX\n\nY\n\nX\n\nY\nY\n\nV | U\nV | U\n\nX = (U,V)\n\nV\n\nU\n\nFigure 1: Graphical representation of dimensionality reduction for regression.\n\nThis shows that the effective subspace S is the one which makes Y and V conditionally\nindependent given U (see Figure 1).\nMutual information provides another viewpoint on the equivalence between conditional\nindependence and the effective subspace. It is well known that\n(cid:1)\n(cid:2)\nI(Y |U, V |U)\n\nI(Y, X) = I(Y, U) + EU\n\n(3)\n\n,\n\nwhere I(Z, W ) is the mutual information between Z and W . Because Eq. (1) implies\nI(Y, X) = I(Y, U), the effective subspace S is characterized as the subspace which retains\nthe entire mutual information between X and Y , or equivalently, such that I(Y |U, V |U) =\n0. This is again the conditional independence of Y and V given U.\n\n2.2 Covariance operators on kernel Hilbert spaces and conditional independence\n\nWe use cross-covariance operators [10] on RKHSs to characterize the conditional inde-\npendence of random variables. Let (H, k) be a (real) reproducing kernel Hilbert space of\nfunctions on a set \u2126 with a positive de\ufb01nite kernel k : \u2126 \u00d7 \u2126 \u2192 R and an inner product\n(cid:3)\u00b7,\u00b7(cid:4)H. The most important aspect of a RKHS is the reproducing property:\n\n(cid:3)f, k(\u00b7, x)(cid:4)H = f(x)\n\nfor all x \u2208 \u2126 and f \u2208 H.\n\n(cid:3)\u2212(cid:6)x1 \u2212 x2(cid:6)2/2\u03c32\n\n(cid:4)\n\n(4)\n\n.\n\nIn this paper we focus on the Gaussian kernel k(x1, x2) = exp\nLet (H1, k1) and (H2, k2) be RKHSs over measurable spaces (\u21261,B1) and (\u21262,B2), re-\nspectively, with k1 and k2 measurable. For a random vector (X, Y ) on \u21261 \u00d7 \u21262, the\ncross-covariance operator \u03a3Y X from H1 to H2 is de\ufb01ned by the relation\n(cid:3)g, \u03a3Y X f(cid:4)H2 = EXY [f(X)g(Y )] \u2212 EX[f(X)]EY [g(Y )]\n(= Cov[f(X), g(Y )]) (5)\nfor all f \u2208 H1 and g \u2208 H2. Eq. (5) implies that the covariance of f(X) and g(Y ) is given\nby the action of the linear operator \u03a3Y X and the inner product. Under the assumption that\nEX[k1(X, X)] and EY [k2(Y, Y )] are \ufb01nite, by using Riesz\u2019s representation theorem, it is\nnot dif\ufb01cult to see that a bounded operator \u03a3Y X is uniquely de\ufb01ned by Eq. (5). We have\n\u03a3\u2217\n\u2217\ndenotes the adjoint of A. From Eq. (5), we see that \u03a3Y X captures\nY X = \u03a3XY , where A\nall of the nonlinear correlations de\ufb01ned by the functions in HX and HY .\nCross-covariance operators provide a useful framework for discussing conditional proba-\nbility and conditional independence, as shown by the following theorem and its corollary1:\nTheorem 1. Let (H1, k1) and (H2, k2) be RKHSs on measurable spaces \u21261 and \u21262, re-\nspectively, with k1 and k2 measurable, and (X, Y ) be a random vector on \u21261\u00d7\u21262. Assume\nthat EX[k1(X, X)] and EY [k2(Y, Y )] are \ufb01nite, and for all g \u2208 H2 the conditional expec-\ntation EY |X[g(Y ) | X = \u00b7] is an element of H1. Then, for all g \u2208 H2 we have\n\n\u03a3XX EY |X[g(Y ) | X = \u00b7] = \u03a3XY g.\n\n(6)\n\n1Full proofs of all theorems can be found in [11].\n\n\fCorollary 2. Let \u02dc\u03a3\nassumptions as Theorem 1, we have, for all f \u2208 (Ker\u03a3XX)\u22a5\n\n\u22121\nXX be the right inverse of \u03a3XX on (Ker\u03a3XX)\u22a5\nXX\u03a3XY g(cid:4)H1 = (cid:3)f, EY |X[g(Y ) | X = \u00b7](cid:4)H1 .\n\u22121\n\n(cid:3)f, \u02dc\u03a3\n\nand g \u2208 H2,\n\n. Under the same\n\n(7)\n\nSketch of the proof. \u03a3XY can be decomposed as \u03a3XY = \u03a31/2\nator V (Theorem 1, [10]). Thus, we see \u02dc\u03a3\nRange\u03a3XX = (Ker\u03a3XX)\u22a5\n\nY Y for a bounded oper-\nXX\u03a3XY is well-de\ufb01ned, because Range\u03a3XY \u2282\n\u22121\n\n. Then, Eq. (7) is a direct consequence of Theorem 1.\n\nXX V \u03a31/2\n\nGiven that \u03a3XX is invertible, Eq. (7) implies\nEY |X[g(Y ) | X = \u00b7] = \u03a3\n\n\u22121\nXX\u03a3XY g\n\nfor all g \u2208 H2.\n\n(8)\nThis can be understood by analogy to the conditional expectation of Gaussian random\nvariables. If X and Y are Gaussian random variables, it is well-known that the conditional\nexpectation is given by EY |X[aT Y | X = x] = xT \u03a3\n\u22121\nXX\u03a3XY a for an arbitrary vector a,\nwhere \u03a3XX and \u03a3XY are the variance-covariance matrices in the ordinary sense.\nUsing cross-covariance operators, we derive an objective function for characterizing con-\nditional independence. Let (H1, k1) and (H2, k2) be RKHSs on measurable spaces \u21261\nand \u21262, respectively, with k1 and k2 measurable, and suppose we have random variables\nU \u2208 H1 and Y \u2208 H2. We de\ufb01ne the conditional covariance operator \u03a3Y Y |U on H1 by\n\n\u03a3Y Y |U := \u03a3Y Y \u2212 \u03a3Y U \u02dc\u03a3\n\n\u22121\nU U \u03a3U Y .\n\n(9)\n\nCorollary 2 easily yields the following result on the conditional covariance of variables:\nTheorem 3. Assume that EX[k1(X, X)] and EY [k2(Y, Y )] are \ufb01nite, and that\nEY |X[f(Y )|X] is an element of H1 for all f \u2208 H2. Then, for all f, g \u2208 H2, we have\n\n(cid:3)g, \u03a3Y Y |U f(cid:4)H2 = EY [f(Y )g(Y )] \u2212 EU\n\n(cid:1)\n\n= EU\n\nCovY |U\n\n(cid:1)\n(cid:2)\nEY |U [f(Y )|U]EY |U [g(Y )|U]\n\n(cid:1)\nf(Y ), g(Y ) | U\n\n(cid:2)(cid:2)\n\n.\n\n(10)\n\nAs in the case of Eq. (8), Eqs. (9) and (10) can be viewed as the analogs of the well-known\nequality for Gaussian variables: Cov[aT Y, bT Y |U] = aT (\u03a3Y Y \u2212 \u03a3Y U \u03a3\nFrom Theorem 3, it is natural to use minimization of \u03a3Y Y |U as a basis of a method for\n\ufb01nding the most informative U, which gives the least VarY |U [f(Y )|U]. The following\nde\ufb01nition is needed to justify this intuition. Let (\u2126,B) be a measurable space, let (H, k) be\na RKHS over \u2126 with k measurable and bounded, and let M be the set of all the probability\nmeasures on (\u2126,B). The RKHS H is called probability-determining, if the map\n\n\u22121\nU U \u03a3U Y )b.\n\nM (cid:7) P\n\n(cid:8)\u2192 (f (cid:8)\u2192 EX\u223cP [f(X)]) \u2208 H\u2217\n\nm is probability-determining.\n\n(11)\nis one-to-one, where H\u2217\nis the dual space of H. The following theorem can be proved\nusing a argument similar to that used in the proof of Theorem 2 in [9].\nTheorem 4. For an arbitrary \u03c3 > 0, the RKHS with Gaussian kernel k(x, y) = exp(\u2212(cid:6)x\u2212\ny(cid:6)2/2\u03c32) on R\nRecall that for two RKHSs H1 and H2 on \u21261 and \u21262, respectively, the direct product\nH1\u2297H2 is the RKHS on \u21261\u00d7\u21262 with the kernel k1k2 [6]. The relation between conditional\nindependence and the conditional covariance operator is given by the following theorem:\nTheorem 5. Let (H11, k11), (H12, k12), and (H2, k2) be RKHSs on measurable spaces\n\u212611, \u212612, and \u21262, respectively, with continuous and bounded kernels. Let (X, Y ) =\n(U, V, Y ) be a random vector on \u212611 \u00d7 \u212612 \u00d7 \u21262, where X = (U, V ), and let H1 =\nH11 \u2297 H12 be the direct product.\nIt is assumed that EY |U [g(Y )|U = \u00b7] \u2208 H11 and\nEY |X[g(Y )|X = \u00b7] \u2208 H1 for all g \u2208 H2. Then, we have\n\u03a3Y Y |U \u2265 \u03a3Y Y |X ,\n\n(12)\n\n\fIf further H2 is\n\nwhere the inequality refers to the order of self-adjoint operators.\nprobability-determining, in particular, for Gaussian kernels, we have the equivalence:\n\n\u21d0\u21d2\n\nY \u22a5\u22a5V | U.\n\n\u03a3Y Y |X = \u03a3Y Y |U\n(13)\nSketch of the proof. Taking the expectation of the well-known equality VarY |U [g(Y )|U] =\n(cid:1)\n(cid:2)\nEY |U,V [g(Y )|U, V ]\n(cid:1)\n(cid:2) \u2265 0,\nEV |U\nwith respect to U, we ob-\nVarV |U [EY |X[g(Y )|X]]\ntain EU\nwhich implies Eq. (12). The equality holds iff EY |X[g(Y )|X] = EY |U [g(Y )|U] for a.e. X.\nSince H2 is probability-determining, this means PY |X = PY |U , that is, Y \u22a5\u22a5V | U.\n\n(cid:1)\n(cid:2)\nVarY |U,V [g(Y )|U, V ]\n(cid:1)\n(cid:2)\u2212EX\nVarY |U [g(Y )|U]\n\n(cid:2)\nVarY |X[g(Y )|X]\n\n+ VarV |U\n\n= EU\n\n(cid:1)\n\nFrom Theorem 5, for probability-determining kernel spaces, the effective subspace S can\nbe characterized in terms of the solution to the following minimization problem:\n\nmin\nS\n\n\u03a3Y Y |U ,\n\nsubject to U = \u03a0SX.\n\n(14)\n\n2.3 Kernel generalized variance for dimensionality reduction\n\n(cid:3)\n\n(cid:4)\n\n(cid:3)\n\n(cid:3)\n\n(cid:4)\n\nn\n\nGY\n\nn 1n1T\n\nn 1n1T\n\nIn\u2212 1\n\nIn\u2212 1\n\nTo derive a sampled-based objective function from Eq. (14) for a \ufb01nite sample, we have to\nestimate the conditional covariance operator with given data, and choose a speci\ufb01c way to\nevaluate the size of self-adjoint operators. Hereafter, we consider only Gaussian kernels,\nwhich are appropriate for both continuous and discrete variables.\nFor the estimation of the operator, we follow the procedure in [9] (see also [11] for further\ndetails), and use the centralized Gram matrix [9, 8], which is de\ufb01ned as:\n(cid:3)\n\u02c6KY =\n(15)\nwhere 1n = (1, . . . , 1)T , (GY )ij = k1(Yi, Yj) is the Gram matrix of the samples of Y ,\nand (GU )ij = k2(Ui, Uj) is given by the projection Ui = BT Xi. With a regularization\nconstant \u03b5 > 0, the empirical conditional covariance matrix \u02c6\u03a3Y Y |U is then de\ufb01ned by\n\u02c6\u03a3Y Y |U := \u02c6\u03a3Y Y \u2212 \u02c6\u03a3Y U \u02c6\u03a3\nThe size of \u02c6\u03a3Y Y |U in the ordered set of positive de\ufb01nite matrices can be evaluated by its\ndeterminant. Although there are other choices for measuring the size of \u02c6\u03a3Y Y |U , such as\nthe trace and the largest eigenvalue, we focus on the determinant in this paper. Using the\nSchur decomposition, det(A \u2212 BC\n\nU U \u02c6\u03a3U Y = ( \u02c6KY + \u03b5In)2\u2212 \u02c6KY \u02c6KU ( \u02c6KU + \u03b5In)\n\u22121\n\n\u22122 \u02c6KU \u02c6KY . (16)\n\n\u22121BT ) =det\n\n/detC, we have\n\nIn\u2212 1\n\nIn\u2212 1\n\nn 1n1T\n\nn 1n1T\n\n\u02c6KU =\n\nGU\n\n(cid:5)\n\n,\n\nn\n\n(cid:6)\n\n(cid:4)\n\nn\n\n(cid:4)\n\nn\n\nA B\nBT C\n\ndet \u02c6\u03a3Y Y |U = det \u02c6\u03a3[Y U ][Y U ]/ det \u02c6\u03a3U U ,\n(cid:5)\n\n(cid:6)\n\n(cid:5)\n\n(17)\n(cid:6)\n\nwhere \u02c6\u03a3[Y U ][Y U ] is de\ufb01ned by \u02c6\u03a3[Y U ][Y U ] =\n( \u02c6KU +\u03b5In)2\nWe symmetrize the objective function by dividing by the constant det \u02c6\u03a3Y Y , which yields\n\n\u02c6\u03a3Y U\n\u02c6\u03a3U U\n\n\u02c6\u03a3Y Y\n\u02c6\u03a3U Y\n\n\u02c6KU\n\n\u02c6KY\n\n=\n\n( \u02c6KY +\u03b5In)2\n\n\u02c6KU\n\n\u02c6KY\n\n.\n\nmin\n\nB\u2208Rm\u00d7r\n\ndet \u02c6\u03a3[Y U ][Y U ]\ndet \u02c6\u03a3Y Y det \u02c6\u03a3U U\n\n, where U = BT X.\n\n(18)\n\nWe refer to this minimization problem with respect to the choice of subspace S or matrix\nB as Kernel Dimensionality Reduction (KDR).\nEq. (18) has been termed the \u201ckernel generalized variance\u201d (KGV) by Bach and Jordan [9].\nThey used it as a contrast function for independent component analysis, in which the goal\nis to minimize a mutual information. They showed that KGV is in fact an approximation\nof the mutual information among the recovered sources around the factorized distributions.\nIn the current setting, on the other hand, our goal is to maximize the mutual information\n\n\fR(b1)\nR(b2)\n\nSIR(10)\n0.987\n0.421\n\nSIR(15)\n0.993\n0.705\n\nSIR(20)\n0.988\n0.480\n\nSIR(25)\n0.990\n0.526\n\npHd\n0.110\n0.859\n\nKDR\n0.999\n0.984\n\nTable 1: Correlation coef\ufb01cients. SIR(m) indicates the SIR method with m slices.\n\nI(Y, U), and with an entirely different argument, we have shown that KGV is an appro-\npriate objective function for the dimensionality reduction problem, and that minimizing\nEq. (18) can be viewed as maximizing the mutual information I(Y, U).\nGiven that the numerical task that must be solved in KDR is the same as the one to be\nsolved in kernel ICA, we can import all of the computational techniques developed in [9]\nfor minimizing KGV. In particular, the optimization routine that we use is gradient descent\nwith a line search, where we exploit incomplete Cholesky decomposition to reduce the\nn \u00d7 n matrices to low-rank approximations. To cope with local optima, we make use of an\nannealing technique, in which the scale parameter \u03c3 for the Gaussian kernel is decreased\ngradually during the iterations of optimization. For a larger \u03c3, the contrast function has\nfewer local optima, and the search becomes more accurate as \u03c3 is decreased.\n\n3 Experimental results\n\nWe illustrate the effectiveness of the proposed KDR method through experiments, compar-\ning it with several conventional methods: SIR, pHd, CCA, and PLS.\nThe \ufb01rst data set is a synthesized one with 300 samples of 17 dimensional X and one\nY \u223c 0.9X1 + 0.2/(1 + X17) + Z, where Z \u223c\ndimensional Y , which are generated by\nN(0, 0.012) and X follows a uniform distribution on [0, 1]17. The effective subspace is\ngiven by b1 = (1, 0, . . . , 0) and b2 = (0, . . . , 0, 1). We compare the KDR method with\nSIR and pHd only\u2014CCA and PLS cannot \ufb01nd a 2-dimensional subspace, because Y is one-\ndimensional. To evaluate estimation accuracy, we use the multiple correlation coef\ufb01cient\nR(b) = max\u03b2\u2208S \u03b2T \u03a3XX b/(\u03b2T \u03a3XX \u03b2 \u00b7 b T \u03a3XX b)1/2, which is used in [4]. As shown\nin Table 1, KDR outperforms the others in \ufb01nding the weak contribution of b2.\nNext, we apply the KDR method to classi\ufb01cation problems, for which many conventional\nmethods of dimensionality reduction are not suitable. In particular, SIR requires the dimen-\nsionality of the effective subspace to be less than the number of classes, because SIR uses\nthe average of X in slices along the variable Y . CCA and PLS have a similar limitation\non the dimensionality of the effective subspace. Thus we compare the result of KDR only\nwith pHd, which is applicable to general binary classi\ufb01cation problems.\nWe show the visualization capability of the dimensionality reduction methods for the Wine\ndataset from the UCI repository to see how the projection onto a low-dimensional space re-\nalizes an effective description of data. The Wine data consists of 178 samples with 13 vari-\nables and a label with three classes. Figure 2 shows the projection onto the 2-dimensional\nsubspace estimated by each method. KDR separates the data into three classes most com-\npletely. We can see that the data are nonlinearly separable in the two-dimensional space.\nIn the third experiment, we investigate how much information on the classi\ufb01cation is pre-\nserved in the estimated subspace. After reducing the dimensionality, we use the support\nvector machine (SVM) method to build a classi\ufb01er in the reduced space, and compare its\naccuracy with an SVM trained using the full-dimensional vector X. We use three data sets\nfrom the UCI repository. Figure 3 shows the classi\ufb01cation rates for the test set for sub-\nspaces of various dimensionality. We can see that KDR yields good classi\ufb01cation even in\nlow-dimensional subspaces, while pHd is much worse in small dimensionality. It is note-\nworthy that in the Ionosphere data set the classi\ufb01er in dimensions 5, 10, and 20 outperforms\n\n\fKDR \n\n20\n\n15\n\n10\n\n5\n\n0\n\n-5\n\n-10\n\n-15\n\n-20\n\n20 CCA \n\n15\n\n10\n\n5\n\n0\n\n-5\n\n-10\n\n-15\n\n-20\n\nPLS \n\n20\n\n15\n\n10\n\n5\n\n0\n\n-5\n\n-10\n\n-15\n\n-20\n\n-20\n\n-15\n\n-10\n\n-5\n\n0\n\n5\n\n10\n\n15\n\n20\n\n-20\n\n-15\n\n-10\n\n-5\n\n0\n\n5\n\n10\n\n15\n\n20\n\n-20\n\n-15\n\n-10\n\n-5\n\n0\n\n5\n\n10\n\n15\n\n20\n\nSIR \n\n20\n\n15\n\n10\n\n5\n\n0\n\n-5\n\n-10\n\n-15\n\n-20\n\npHd \n\n20\n\n15\n\n10\n\n5\n\n0\n\n-5\n\n-10\n\n-15\n\n-20\n\n-20\n\n-15\n\n-10\n\n-5\n\n0\n\n5\n\n10\n\n15\n\n20\n\n-20\n\n-15\n\n-10\n\n-5\n\n0\n\n5\n\n10\n\n15\n\n20\n\nFigure 2: Projections of Wine data: \u201d+\u201d, \u201d\u2022\u201d, and gray \u201d(cid:1)\u201d represent the three classes.\n\nthe classi\ufb01er in the full-dimensional space. This is caused by suppressing noise irrelevant\nto explain Y . These results show that KDR successfully \ufb01nds an effective subspace which\npreserves the class information even when the dimensionality is reduced signi\ufb01cantly.\n\n4 Extension to variable selection\n\nThe KDR method can be extended to variable selection, in which a subset of given ex-\nplanatory variables {X1, . . . , Xm} is selected. Extension of the KGV objective function\nto variable selection is straightforward. We have only to compare the KGV values for all\nthe subspaces spanned by combinations of a \ufb01xed number of selected variables. We of\ncourse do not avoid the combinatorial problem of variable selection; the total number of\ncombinations may be intractably large for a large number of explanatory variables m, and\ngreedy or random search procedures are needed.\nWe \ufb01rst apply this kernel method to the Boston Housing data (506 samples with 13 di-\nmensional X), which has been used as a typical example of variable selection. We select\nfour variables that attain the smallest KGV value among all the combinations. The selected\nvariables are exactly the same as the ones selected by ACE [2]. Next, we apply the method\nto the leukemia microarray data of 7129 dimensions ([12]). We select 50 effective genes\nto classify two types of leukemia using 38 training samples. For optimization of the KGV\nvalue, we use a greedy algorithm, in which new variables are selected one by one, and\nsubsequently a variant of genetic algorithm is used. Half of the 50 genes accord with 50\ngenes selected by [12]. With the genes selected by our method, the same classi\ufb01er as that\nused in [12] classi\ufb01es correctly 32 of the 34 test samples, for which, with their 50 genes,\nGolub et al. ([12]) report a result of classifying 29 of the 34 samples correctly.\n\n5 Conclusion\n\nWe have presented KDR, a novel method of dimensionality reduction for supervised learn-\ning. One of the striking properties of this method is its generality. We do not place any\nstrong assumptions on either the conditional or the marginal distribution, in distinction to\n\n\f(a) Heart-disease\n\n(b) Ionosphere\n\n)\n%\n(\n \ne\nt\na\nr\n \nn\no\ni\nt\na\nc\ni\nf\ni\ns\ns\na\nl\nC\n\n85\n\n80\n\n75\n\n70\n\n65\n\n60\n\n55\n\n50\n\n3\n\n)\n%\n(\n \ne\nt\na\nr\n \nn\no\ni\nt\na\nc\ni\nf\ni\ns\ns\na\nl\nC\n\n100\n\n98\n\n96\n\n94\n\n92\n\n90\n\n88\n\n3 5\n\nKernel\nPHD\nAll variables\n\n10\nDimensionality\n\n20\n\n15\n\nKernel\nPHD\nAll variables\n\n5\nDimensionality\n\n7\n\n9\n\n11\n\n13\n\n(c) Wisconsin Breast Cancer\n100\n\n)\n%\n(\n \ne\nt\na\nr\n \nn\no\ni\nt\na\nc\ni\nf\ni\ns\ns\na\nl\nC\n\n95\n\n90\n\n85\n\n80\n\n75\n\n34\n\n70\n0\n\nKernel\nPHD\nAll variables\n\n10\n\n20\n\nDimensionality\n\n30\n\nFigure 3: Classi\ufb01cation accuracy of the SVM for test data after dimensionality reduction.\n\nessentially all existing methods for dimensionality reduction in regression, including SIR,\npHd, CCA, and PPR. We have demonstrating promising empirical performance of KDR,\nshowing its practical utility in data visualization and feature selection for prediction. We\nhave also discussed an extension of KDR method to variable selection.\nThe theoretical basis of KDR lies in the nonparametric characterization of conditional inde-\npendence that we have presented in this paper. Extending earlier work on the kernel-based\ncharacterization of marginal independence [9], we have shown that conditional indepen-\ndence can be characterized in terms of covariance operators on a kernel Hilbert space.\nWhile our focus has been on the problem of dimensionality reduction, it is also worth not-\ning that there are many possible other applications of this result. In particular, conditional\nindependence plays an important role in the structural de\ufb01nition of graphical models, and\nour result may have implications for model selection and inference in graphical models.\n\nReferences\n[1] Friedman, J.H. and Stuetzle, W. Projection pursuit regression. J. Amer. Stat. Assoc., 76:817\u2013\n\n823, 1981.\n\n[2] Breiman, L. and Friedman, J.H. Estimating optimal transformations for multiple regression and\n\ncorrelation. J. Amer. Stat. Assoc., 80:580\u2013598, 1985.\n\n[3] Wold, H. Partial least squares. in S. Kotz and N.L. Johnson (Eds.), Encyclopedia of Statistical\n\nSciences, Vol. 6, Wiley, New York. pp.581\u2013591. 1985.\n\n[4] Li, K.-C. Sliced inverse regression for dimension reduction (with discussion). J. Amer. Stat.\n\nAssoc., 86:316\u2013342, 1991.\n\n[5] Li, K.-C. On principal Hessian directions for data visualization and dimension reduction: An-\n\nother application of Stein\u2019s lemma. J. Amer. Stat. Assoc., 87:1025\u20131039, 1992.\n\n[6] Aronszajn, N. Theory of reproducing kernels. Trans. Amer. Math. Soc., 69(3):337\u2013404, 1950.\n[7] Sch\u00a8olkopf, B., Burges, C.J.C., and Smola, A. (eds.) Advances in Kernel Methods: Support\n\nVector Learning. MIT Press. 1999.\n\n[8] Sch\u00a8olkopf, B., Smola, A and M\u00a8uller, K.-R. Nonlinear component analysis as a kernel eigenvalue\n\nproblem. Neural Computation, 10:1299\u20131319, 1998.\n\n[9] Bach, F.R. and Jordan, M.I. Kernel independent component analysis. JMLR, 3:1\u201348, 2002.\n[10] Baker, C.R. Joint measures and cross-covariance operators. Trans. Amer. Math. Soc., 186:273\u2013\n\n289, 1973.\n\n[11] Fukumizu, K., Bach, F.R. and Jordan, M.I. Dimensionality reduction for supervised learning\n\nwith reproducing kernel Hilbert spaces. JMLR, 5:73\u201399, 2004.\n\n[12] Golub T.R. et al. Molecular classi\ufb01cation of cancer: Class discovery and class prediction by\n\ngene expression monitoring. Science, 286:531\u2013537, 1999.\n\n\f", "award": [], "sourceid": 2513, "authors": [{"given_name": "Kenji", "family_name": "Fukumizu", "institution": null}, {"given_name": "Francis", "family_name": "Bach", "institution": null}, {"given_name": "Michael", "family_name": "Jordan", "institution": null}]}