{"title": "Discriminative Direction for Kernel Classifiers", "book": "Advances in Neural Information Processing Systems", "page_first": 745, "page_last": 752, "abstract": null, "full_text": "Discriminative Direction for Kernel Classifiers\nPolina Golland\n\nArtificial Intelligence Lab\nMassachusetts Institute of Technology\nCambridge, MA 02139\n\npolina@ai.mit.edu\nAbstract\n\nIn many scientific and engineering applications, detecting and under-\nstanding differences between two groups of examples can be reduced\nto a classical problem of training a classifier for labeling new examples\nwhile making as few mistakes as possible. In the traditional classifi-\ncation setting, the resulting classifier is rarely analyzed in terms of the\nproperties of the input data captured by the discriminative model. How-\never, such analysis is crucial if we want to understand and visualize the\ndetected differences. We propose an approach to interpretation of the sta-\ntistical model in the original feature space that allows us to argue about\nthe model in terms of the relevant changes to the input vectors. For each\npoint in the input space, we define a discriminative direction to be the\ndirection that moves the point towards the other class while introducing\nas little irrelevant change as possible with respect to the classifier func-\ntion. We derive the discriminative direction for kernel-based classifiers,\ndemonstrate the technique on several examples and briefly discuss its use\nin the statistical shape analysis, an application that originally motivated\nthis work.\n1 Introduction\n\nOnce a classifier is estimated from the training data, it can be used to label new examples,\nand in many application domains, such as character recognition, text classification and oth-\ners, this constitutes the final goal of the learning stage. The statistical learning algorithms\nare also used in scientific studies to detect and analyze differences between the two classes\nwhen the ``correct answer'' is unknown, and the information we have on the differences\nis represented implicitly by the training set. Example applications include morphologi-\ncal analysis of anatomical organs (comparing organ shape in patients vs. normal controls),\nmolecular design (identifying complex molecules that satisfy certain requirements), etc. In\nsuch applications, interpretation of the resulting classifier in terms of the original feature\nvectors can provide an insight into the nature of the differences detected by the learning\nalgorithm and is therefore a crucial step in the analysis. Furthermore, we would argue that\nstudying the spatial structure of the data captured by the classification function is important\nin any application, as it leads to a better understanding of the data and can potentially help\nin improving the technique.\nThis paper addresses the problem of translating a classifier into a different representation\n\f\nthat allows us to visualize and study the differences between the classes. We introduce\nand derive a so called discriminative direction at every point in the original feature space\nwith respect to a given classifier. Informally speaking, the discriminative direction tells\nus how to change any input example to make it look more like an example from another\nclass without introducing any irrelevant changes that possibly make it more similar to other\nexamples from the same class. It allows us to characterize differences captured by the\nclassifier and to express them as changes in the original input examples.\nThis paper is organized as follows. We start with a brief background section on kernel-\nbased classification, stating without proof the main facts on kernel-based SVMs necessary\nfor derivation of the discriminative direction. We follow the notation used in [3, 8, 9]. In\nSection 3, we provide a formal definition of the discriminative direction and explain how\nit can be estimated from the classification function. We then present some special cases,\nin which the computation can be simplified significantly due to a particular structure of the\nkernel. Section 4 demonstrates the discriminative direction for different kernels, followed\nby an example from the problem of statistical analysis of shape differences that originally\nmotivated this work.\n\n2 Basic Notation\n\nGiven a training set of l pairs\n\n{(x\n\nk , y k )}\n\nl\nk=1 , where x k\n\n#\n\nR\n\nn\n\nare observations and\n\ny k\n\n# {-1,\n\n1} are corresponding labels, and a kernel function K : R\n\nn\n\n\n\nR\n\nn\n\n##\n\nR, (with\nits implied mapping function #K : R\n\nn\n\n##\n\nF), the Support Vector Machines (SVMs) al-\ngorithm [8] constructs a classifier by implicitly mapping the training data into a higher\ndimensional space and estimating a linear classifier in that space that maximizes the mar-\ngin between the classes (Fig. 1a). The normal to the resulting separating hyperplane is a\nlinear combination of the training data:\n\nw =\n\n#\n\nk\n\n# k y k #K (x k ), (1)\nwhere the coefficients # k are computed by solving a constrained quadratic optimization\nproblem. The resulting classifier\n\nfK (x) =\n\n#x \n\nw#+b =\n\n#\n\nk\n\n# k y k\n\n##K\n\n(x)\n\n\n\n#K (x k )#+b =\n\n#\n\nk\n\n# k y k K(x, x k )+b (2)\ndefines a nonlinear separating boundary in the original feature space.\n\n3 Discriminative Direction\n\nEquations (1) and (2) imply that the classification function fK (x) is directly proportional\nto the signed distance from the input point to the separating boundary computed in the\nhigher dimensional space defined by the mapping #K . In other words, the function out-\nput depends only on the projection of vector #K (x) onto w and completely ignores the\ncomponent of #K (x) that is perpendicular to w. This suggests that in order to create a\ndisplacement of #K (x) that corresponds to the differences between the two classes, one\nshould change the vector's projection onto w while keeping its perpendicular component\nthe same. In the linear case, we can easily perform this operation, since we have access to\nthe image vectors, #K (x) = x. This is similar to visualization techniques typically used\nin linear generative modeling, where the data variation is captured using PCA, and new\nsamples are generated by changing a single principal component at a time. However, this\napproach is infeasible in the non-linear case, because we do not have access to the image\nvectors #K (x). Furthermore, the resulting image vector might not even have a source in\nthe original feature space, i.e., there might be no vector in the original space R\n\nn\n\nthat maps\ninto the resulting vector in the space F. Our solution is to search for the direction around\n\f\n(a)\n\nF\n\nw\n(b)\n\ndx\n\nz\nx\n\nd z\n\np e\nw\nF\nFigure 1: Kernel-based classification (a) and the discriminative direction (b).\nthe feature vector x in the original space that minimizes the divergence of its image #K (x)\n\nfrom the direction of the projection vector w\n\n1\n\n. We call it a discriminative direction, as it\nrepresents the direction that affects the output of the classifier while introducing as little\nirrelevant change as possible into the input vector.\nFormally, as we move from x to x + dx in R\n\nn\n\n, the image vector in the space F changes by\n\ndz = #K (x + dx)\n\n-\n\n#K (x) (Fig. 1b). This displacement can be thought of as a vector\nsum of its projection onto w and its deviation from w:\np = #dz \n\nw#\n\n#w \n\nw#\nw and e = dz\n\n-\n\np = dz\n\n-\n#dz \n\nw#\n\n#w \n\nw#\nw. (3)\nThe discriminative direction minimizes the divergence component e, leading to the follow-\ning optimization problem:\nminimize\n\nE(dx)\n\n=\n\n#e#\n\n2\n\n=\n\n#dz \n\ndz#\n\n-\n#dz \n\nw#\n\n2\n\n#w \n\nw#\n\n(4)\ns.t.\n\n#dx#\n\n2\n\n= #. (5)\nSince the cost function depends only on dot products of vectors in the space F, it can be\ncomputed using the kernel function K:\n\n#w \n\nw# =\n\n#\n\nk,m\n\n# k #m y k ymK(x k , xm ), (6)\n\n#dz \n\nw# =\n\n#fK\n\n(x)dx, (7)\n\n#dz \n\ndz# = dx\n\nT\n\nHK (x)dx, (8)\nwhere\n\n#fK\n\n(x) is the gradient of the classifier function fK evaluated at x and represented\nby a row-vector and matrix HK (x) is one of the (equivalent) off-diagonal quarters of the\nHessian of K, evaluated at (x, x):\nHK (x)[i, j] =\n\n#\n\n2\n\nK(u, v)\n\n#u i #v j\n\n#\n#\n#\n#\n\n(u=x,v=x)\n\n. (9)\nSubstituting into Equation (4), we obtain\nminimize\n\nE(dx)\n\n= dx\n\nT\n\n#\n\nHK (x)\n\n- #w#\n\n-2\n\n#f\n\nT\nK (x)#fK (x)\n\n#\n\ndx (10)\ns.t.\n\n#dx#\n\n2\n\n= #. (11)\n\n1\n\nA similar complication arises in kernel-based generative modeling, e.g., kernel PCA [7]. Con-\nstructing linear combinations of vectors in the space F leads to a global search in the original\nspace [6, 7]. Since we are interested in the direction that best approximates w, we use infinitesi-\nmal analysis that results in a different optimization problem.\n\f\nThe solution to this problem is the smallest eigenvector of matrix\n\nQK (x) = HK (x)\n\n- #w#\n\n-2\n\n#f\n\nT\nK (x)#fK (x). (12)\nNote that in general, the matrix QK (x) and its smallest eigenvector are not the same for\ndifferent points in the original space and must be estimated separately for every input vec-\ntor x. Furthermore, each solution defines two opposite directions in the input space, corre-\nsponding to the positive and the negative projections onto w. We want to move the input\nexample towards the opposite class and therefore assign the direction of increasing function\nvalues to the examples with label\n\n-1\n\nand the direction of decreasing function values to the\nexamples with label 1.\n\nObtaining a closed-form solution of this minimization problem could be desired, or even\nnecessary, if the dimensionality of the input space is high and computing the smallest eigen-\nvector is computationally expensive and numerically challenging. In the next section, we\ndemonstrate how a particular form of the matrix HK (x) leads to an analytical solution for\na large family of kernel functions\n\n2\n\n.\n\n3.1 Analytical Solution for Discriminative Direction\n\nIt is easy to see that if HK (x) is a multiple of the identity matrix, HK (x) = cI , then\nthe smallest eigenvector of the matrix QK (x) is equal to the largest eigenvector of the\nmatrix\n\n#f\n\nT\nK (x)#fK (x), namely the gradient of the classifier function\n\n#f\n\nT\nK (x). We will\nshow in this section that both for the linear kernel and, more surprisingly, for RBF kernels,\nthe matrix HK (x) is of the right form to yield an analytical solution of this form. It is\nwell known that to achieve the fastest change in the value of a function, one should move\nalong its gradient. In the case of the linear and the RBF kernels, the gradient also corre-\nsponds to the direction that distinguishes between the two classes while ignoring inter-class\nvariability.\n\nDot product kernels, K(u, v) = k(#u\n\n\n\nv#). For any dot product kernel,\n\n#\n\n2\n\nK(u, v)\n\n#u i #v j\n\n#\n#\n#\n#\n\n(u=x,v=x)\n\n= k # (#x#\n\n2\n\n)# ij + k ## (#x#\n\n2\n\n)x i x j , (13)\nand therefore HK (x) = cI for all x if and only if k ## (#x#\n\n2\n\n)\n\n#\n\n0, i.e., when k is a linear\nfunction. Thus the linear kernel is the only dot product kernel for which this simplification\nis relevant. In the linear case, HK (x) = I , and the discriminative direction is defined as\n\ndx # =\n\n#f\n\nT\nK (x) = w =\n\n#\n\n# k y k x k ;\n\nE(dx\n\n# ) = 0. (14)\nThis is not entirely surprising, as the classifier is a linear function in the original space and\nwe can move precisely along w.\n\nPolynomial kernels are a special case of dot product kernels. For polynomials of degree\n\nd\n\n#\n\n2,\n\n#\n\n2\n\nK(u, v)\n\n#u i #v j\n\n#\n#\n#\n#\n\n(u=x,v=x)\n\n= d(1 +\n\n#x#\n\n2\n\n)\n\nd-1\n\n# ij + d(d\n\n-\n\n1)(1 +\n\n#x#\n\n2\n\n)\n\nd-2\n\nx i x j . (15)\nHK (x) is not necessarily diagonal for all x, and we have to solve the general eigenvector\nproblem to identify the discriminative direction.\n\n2\n\nWhile a very specialized structure of HK (x) in the next section is sufficient for simplifying the\nsolution significantly, it is by no means necessary, and other kernel families might exist for which\nestimating the discriminative direction does not require solving the full eigenvector problem.\n\f\nDistance kernels, K(u, v) = k(#u\n\n-\n\nv#\n\n2\n\n). For a distance kernel,\n\n#\n\n2\n\nK(u, v)\n\n#u i #v j\n\n#\n#\n#\n#\n\n(u=x,v=x)\n\n=\n\n-2k\n\n# (0)# ij , (16)\nand therefore the discriminative direction can be determined analytically:\n\ndx # =\n\n#f\n\nT\nK (x);\n\nE(dx\n\n# ) =\n\n-2k\n\n# (0)\n\n- #w#\n\n-2\n\n##f\n\nT\nK (x)#\n\n2\n\n. (17)\nThe Gaussian kernels are a special case of the distance kernel family, and yield a closed\nform solution for the discriminative direction:\n\ndx # =\n\n-2/#\n\n#\n\nk\n\n# k y k e -\n\n#x-x k #\n\n2\n\n#\n\n(x-x k );\n\nE(dx\n\n# ) = 2/#\n\n-##f\n\nT\nK (x)#\n\n2\n\n/#w#\n\n2\n\n. (18)\nUnlike the linear case, we cannot achieve zero error, and the discriminative direction is only\nan approximation. The exact solution is unattainable in this case, as it has no corresponding\ndirection in the original space.\n\n3.2 Geometric Interpretation\n\nWe start by noting that the image vectors #K (x)'s do not populate the entire space F, but\nrather form a manifold of lower dimensionality whose geometry is fully defined by the\nkernel function K (Fig. 1). We will refer to this manifold as the target manifold in this\ndiscussion. We cannot explicitly manipulate elements of the space F, but can only explore\nthe target manifold through search in the original space. We perform the search in the\noriginal space by considering all points on an infinitesimally small sphere centered at the\noriginal input vector x. In the range space of the mapping function #K , the images of\npoints x + dx form an ellipsoid defined by the quadratic form dz\n\nT\n\ndz = dx\n\nT\n\nHK (x)dx.\n\nFor HK (x)\n\n#\n\nI , the ellipsoid becomes a sphere, all dz's are of the same length, and\nthe minimum of error in the displacement vector dz corresponds to the maximum of the\nprojection of dz onto w. Therefore, the discriminative direction is parallel to the gradient\nof the classifier function. If HK (x) is of any other form, the length of the displacement\nvector dz changes as we vary dx, and the minimum of the error in the displacement is not\nnecessarily aligned with the direction that maximizes the projection.\nAs a side note, our sufficient condition, HK (x)\n\n#\n\nI , implies that the target manifold is\nlocally flat, i.e., its Riemannian curvature is zero. Curvature and other properties of target\nmanifolds have been studied extensively for different kernel functions [1, 4]. In particular,\none can show that the kernel function implies a metric on the original space. Similarly to\nthe natural gradient [2] that maximizes the change in the function value under an arbitrary\nmetric, we minimize the changes that do not affect the function under the metric implied\nby the kernel.\n\n3.3 Selecting Inputs\n\nGiven any input example, we can compute the discriminative direction that represents the\ndifferences between the two classes captured by the classifier in the neighborhood of the\nexample. But how should we choose the input examples for which to compute the dis-\ncriminative direction? We argue that in order to study the differences between the classes,\none has to examine the input vectors that are close to the separating boundary, namely,\nthe support vectors. Note that this approach is significantly different from the generative\nmodeling, where a ``typical'' representative, often constructed by computing the mean of\nthe training data, is used for analysis and visualization. In the discriminative framework,\nwe are more interested in the examples that lie close to the opposite class, as they define\nthe differences between the two classes and the optimal separating boundary.\n\f\n(a)\n\n-3 -2 -1 0 1 2 3 4 5 6 7 8\n-5\n-4\n-3\n-2\n-1\n0\n1\n2\n3\n4\n5\n6\n(b)\n\n-3 -2 -1 0 1 2 3 4 5 6 7 8\n-5\n-4\n-3\n-2\n-1\n0\n1\n2\n3\n4\n5\n6\n(c)\n\n-3 -2 -1 0 1 2 3 4 5 6 7 8\n-5\n-4\n-3\n-2\n-1\n0\n1\n2\n3\n4\n5\n6\nFigure 2: Discriminative direction for linear (a), quadratic (b) and Gaussian RBF (c) clas-\nsifiers. The background is colored using the values of the classifier function. The black\nsolid line is the separating boundary, the dotted lines indicate the margin corridor. Support\nvectors are indicated using solid markers. The length of the vectors is proportional to the\nmagnitude of the classifier gradient.\nSupport vectors define a margin corridor whose shape is determined by the kernel type\nused for training. We can estimate the distance from any support vector to the separating\nboundary by examining the gradient of the classification function for that vector. Large\ngradient indicates that the support vector is close to the separating boundary and therefore\ncan provide more information on the spatial structure of the boundary. This provides a\nnatural heuristic for assigning importance weighting to different support vectors in the\nanalysis of the discriminative direction.\n4 Simple Example\n\nWe first demonstrate the the proposed approach on a simple example. Fig. 2 shows three\ndifferent classifiers, linear, quadratic and Gaussian RBF, for the same example training set\nthat was generated using two Gaussian densities with different means and covariance ma-\ntrices. We show the estimated discriminative direction for all points that are close to the\nseparating boundary, not just support vectors. While the magnitude of discriminative di-\nrection vector is irrelevant in our infinitesimal analysis, we scaled the vectors in the figure\naccording to the magnitude of the classifier gradient to illustrate importance ranking. Note\nthat for the RBF support vectors far away from the boundary (Fig. 2c), the magnitude of\nthe gradient is so small (tenth of the magnitude at the boundary), it renders the vectors\n\f\nNormal Control\nPatient\nFigure 3: Right hippocampus in schizophrenia study. First support vector from each group\nis shown, four views per shape (front, medial, back, lateral). The color coding is used to\nvisualize the amount and the direction of the deformation that corresponds to the discrim-\ninative direction, changing from blue (moving inwards) to green (zero deformation) to red\n(moving outwards).\ntoo short to be visible in the figure. We can see that in the areas where there is enough\nevidence to estimate the boundary reliably, all three classifiers agree on the boundary and\nthe discriminative direction (lower cluster of arrows). However, if the boundary location\nis reconstructed based on the regularization defined by the kernel, the classifiers suggest\ndifferent answers (the upper cluster of arrows), stressing the importance of model selection\nfor classification. The classifiers also provide an indication of the reliability of the differ-\nences represented by each arrow, which was repeatedly demonstrated in other experiments\nwe performed.\n5 Morphological Studies\n\nMorphological studies of anatomical organs motivated the analysis presented in this paper.\nHere, we show the results for the hippocampus study in schizophrenia. In this study, MRI\nscans of the brain were acquired for schizophrenia patients and a matched group of normal\ncontrol subjects. The hippocampus structure was segmented (outlined) in all of the scans.\nUsing the shape information (positions of the outline points), we trained a Gaussian RBF\nclassifier to discriminate between schizophrenia patients and normal controls. However,\nthe classifier in its original form does not provide the medical researchers with information\non how the hippocampal shape varies between the two groups. Our goal was to translate\nthe information captured by the classifier into anatomically meaningful terms of organ\ndevelopment and deformation.\nIn this application, the coordinates in the input space correspond to the surface point loca-\ntions for any particular example shape. The discriminative direction vector corresponds to\ndisplacements of the surface points and can be conveniently represented by a deformation\nof the original shape, yielding an intuitive description of shape differences for visualization\nand further analysis. We show the deformation that corresponds to the discriminative direc-\ntion, omitting the details of shape extraction (see [5] for more information). Fig. 3 displays\nthe first support vector from each group with the discriminative direction ``painted'' on it.\nEach row shows four snapshots of the same shape form different viewpoints\n\n3\n\n. The color at\nevery node of the surface encodes the corresponding component of the discriminative di-\nrection. Note that the deformation represented by the two vectors is very similar in nature,\nbut of opposite signs, as expected from the analysis in Section 3.3. We can see that the\nmain deformation represented by this pair of vectors is localized in the bulbous ``head'' of\n\n3\n\nAn alternative way to visualize the same information is to actually generate the animation of the\nexample shape undergoing the detected deformation.\n\f\nthe structure. The next four support vectors in each group represent a virtually identical de-\nformation to the one shown here. Starting with such visualization, the medical researchers\ncan explore the organ deformation and interaction caused by the disease.\n6 Conclusions\n\nWe presented an approach to quantifying the classifier's behavior with respect to small\nchanges in the input vectors, trying to answer the following question: what changes would\nmake the original input look more like an example from the other class without introduc-\ning irrelevant changes? We introduced the notion of the discriminative direction, which\ncorresponds to the maximum changes in the classifier's response while minimizing irrel-\nevant changes in the input. For kernel-based classifiers the discriminative directions is\ndetermined by minimizing the divergence of the infinitesimal displacement vector and the\nnormal to the separating hyperplane in the higher dimensional kernel space. The classifier\ninterpretation in terms of the original features in general, and the discriminative direction\nin particular, is an important component of the data analysis in many applications where\nthe statistical learning techniques are used to discover and study structural differences in\nthe data.\n\nAcknowledgments. Quadratic optimization was performed using PR LOQO optimizer\nwritten by Alex Smola. This research was supported in part by NSF IIS 9610249 grant.\n\nReferences\n\n[1] S. Amari and S. Wu. Improving Support Vector Machines by Modifying Kernel Func-\ntions. Neural Networks, 783-789, 1999.\n[2] S. Amari. Natural Gradient Works Efficiently in Learning. Neural Comp., 10:251-276,\n1998.\n[3] C. J. C. Burges. A Tutorial on Support Vector Machines for Pattern Recognition. Data\nMining and Knowledge Discovery, 2(2):121-167, 1998.\n[4] C. J. C. Burges. Geometry and Invariance in Kernel Based Methods. In Adv. in Kernel\nMethods: Support Vector Learning, Eds. Scholkopf, Burges and Smola, MIT Press,\n89-116, 1999.\n[5] P. Golland et al. Small Sample Size Learning for Shape Analysis of Anatomical Struc-\ntures. In Proc. of MICCAI'2000, LNCS 1935:72-82, 2000.\n[6] B. Scholkopf et al. Input Space vs. Feature Space in Kernel-Based Methods. IEEE\nTrans. on Neural Networks, 10(5):1000-1017, 1999.\n[7] B. Scholkopf, A. Smola, and K.-R. Muller. Nonlinear Component Analysis as a Ker-\nnel Eigenvalue Problem. Neural Comp., 10:1299-1319, 1998.\n[8] V. N. Vapnik. The Nature of Statistical Learning Theory. Springer, 1995.\n[9] V. N. Vapnik. Statistical Learning Theory. John Wiley & Sons, 1998.\n\f\n", "award": [], "sourceid": 1985, "authors": [{"given_name": "Polina", "family_name": "Golland", "institution": null}]}