{"title": "Discriminant Adaptive Nearest Neighbor Classification and Regression", "book": "Advances in Neural Information Processing Systems", "page_first": 409, "page_last": 415, "abstract": null, "full_text": "Discriminant Adaptive Nearest Neighbor \n\nClassification and Regression \n\nTrevor Hastie \n\nDepartment of Statistics \n\nSequoia Hall \n\nStanford University \n\nCalifornia 94305 \n\ntrevor@playfair.stanford.edu \n\nRobert Tibshirani \n\nDepartment of Statistics \n\nUniversity of Toronto \ntibs@utstat .toronto.edu \n\nAbstract \n\nNearest neighbor classification expects the class conditional prob(cid:173)\nabilities to be locally constant, and suffers from bias in high di(cid:173)\nmensions We propose a locally adaptive form of nearest neighbor \nclassification to try to finesse this curse of dimensionality. We use \na local linear discriminant analysis to estimate an effective met(cid:173)\nric for computing neighborhoods. We determine the local decision \nboundaries from centroid information, and then shrink neighbor(cid:173)\nhoods in directions orthogonal to these local decision boundaries, \nand elongate them parallel to the boundaries. Thereafter, any \nneighborhood-based classifier can be employed, using the modified \nneighborhoods. We also propose a method for global dimension \nreduction, that combines local dimension information. We indicate \nhow these techniques can be extended to the regression problem. \n\n1 \n\nIntroduction \n\nWe consider a discrimination problem with J classes and N training observations . \nThe training observations consist of predictor measurements x:::: (Xl,X2,\" .xp) on \np predictors and the known class memberships . Our goal is to predict the class \nmembership of an observation with predictor vector Xo \n\nNearest neighbor classification is a simple and appealing approach to this problem. \nWe find the set of J{ nearest neighbors in the training set to Xo and then classify \nXo as the most frequent class among the J{ neighbors. \nCover & Hart (1967) show that the one nearest neighbour rule has asymptotic \nerror rate at most twice the Bayes rate. However in finite samples the curse of \n\n\f410 \n\nT. HASTIE, R. TIBSHIRANI \n\ndimensionality can severely hurt the nearest neighbor rule. The relative radius of \nthe nearest-neighbor sphere grows like r 1/ p where p is the dimension and r the \nradius for p = 1, resulting in severe bias at the target point x. Figure 1 (left panel) \nillustrates the situation for a simple example. Nearest neighbor techniques are \n\n, , \n\nFigure 1: In the left panel, the vertical strip denotes the NN region using only horizontal \ncoordinate to find the nearest neighbor for the target point (solid dot). The sphere shows \nthe NN region using both coordinates, and we see in this case it has extended into the \nclass 1 region (and found the wrong class in this instance). The middle panel shows \na spherical neighborhood containing 25 points, for a two class problem with a circular \ndecision boundary. The right panel shows the ellipsoidal neighborhood found by the DANN \nprocedure, also containing 25 points. The latter is elongated in a direction parallel to the \ntrue decision boundary (locally constant posterior probabilities), and flattened orthogonal \nto it. \n\nbased on the assumption that locally the class posterior probabilities are constant. \nWhile that is clearly true in the vertical strip using only the vertical coordinate, \nusing both this is no longer true. Figure 1 (middle and right panels) shows how we \nlocally adapt the metric to overcome this problem, in a situation where the decision \nboundary is locally linear. \n\n2 Discriminant adaptive nearest neighbors \n\nConsider first a standard linear discriminant (LDA) classification procedure with \nJ{ classes. Let Band W denote the between and within sum of squares matrices . \nIn LDA the data are first sphered with respect to W, then the target point is \nclassified to the class of the closest centroid (with a correction for the class prior \nmembership probabilities). Since only relative distances are relevant, any distances \nin the complement of the subspace spanned by the sphered centroids can be ignored. \nThis complement corresponds to the null space of B. \nWe propose to estimate Band W locally, and use them to form a local metric that \napproximately behaves like the LDA metric. One such candidate is \n\n1: \n\nW-1BW- 1 \nW-l/2(W-l/2BW-l/2)W-l/2 \nW- 1/ 2B*W- 1/ 2. \n\n(1) \nwhere B* is the between sum-of-squares in the sphered space. Consider the action \nof 1: as a metric for computing distances \n\n(x - xo?1:(x - xo) : \n\n(2) \n\n\fDiscriminant Adaptive Nearest Neighbor Classification and Regression \n\n411 \n\n\u2022 it first spheres the space using W; \n\u2022 components of distance in the null space of B* are ignored; \n\u2022 other components are weighted according to the eigenvalues of B* when \ndirections in which the centroids are more \n\nthere are more than 2 classes -\nspread out are weighted more than those in which they are close \n\nThus this metric would result in neighborhoods similar to the narrow strip in fig(cid:173)\nure l(left figure): infinitely long in the null space of B, and then deformed appro(cid:173)\npriately in the centroid subspace according to how they are placed. It is dangerous \nto allow neighborhoods to extend infinitely in any direction, so we need to limit this \nstretching. Our proposal is \n\nW-l/2[W-l/2BW-l/2 + d]W- 1/ 2 \nW-l/2[B* + d]W- 1/ 2 \n\n(3) \n\nwhere f. is some small tuning parameter to be determined. The metric shrinks \nthe neighborhood in directions in which the local class centroids differ, with the \nintention of ending up with a neighborhood in which the class centroids coincide \n(and hence nearest neighbor classification is appropriate). Given E we use perform \nK-nearest neighbor classification using the metric (2). \n\nThere are several details that we briefly describe here and in more detail in Hastie \n& Tibshirani (1994): \n\n\u2022 B is defined to be the covariance of the class centroids, and W the pooled \nestimate of the common class covariance matrix. We estimate these locally \nusing a spherical, compactly supported kernel (Cleveland 1979), where the \nbandwidth is determined by the distance of the KM nearest neighbor. \n\n\u2022 KM above has to be supplied, as does the softening parameter f. . We some(cid:173)\n\nwhat arbitrarily use KM = max(N /5,50); so we use many more neighbors \n(50 or more) to determine the metric, and then typically K = 1, ... ,5 \nnearest neighbors in this metric to classify. We have found that the metric \nis relatively insensitive to different values of 0 < f. < 5, and typically use \nf. = 1. \n\n\u2022 Typically the data do not support the local calculation of W (p(p + 1)/2 \nentries), and it can be argued that this is not necessary. We mostly resort \nto using the diagonal of W instead, or else use a global estimate. \n\nSections 4 and 5 illustrate the effectiveness of this approach on some simulated and \nreal examples. \n\n3 Dimension Reduction using Local Discriminant \n\nInformation \n\nThe technique described above is entirely \"memory based\" , in that we locally adapt \na neighborhood about a query point at the time of classification. Here we describe a \nmethod for performing a global dimension reduction, by pooling the local dimension \ninformation over all points in the training set . In a nutshell we consider subspaces \ncorresponding to eigenvectors oj the average local between sum-oj-squares matrices. \n\nConsider first how linear discriminant analysis (LDA) works. After sphering the \ndata, it concentrates in the space spanned by the class centroids Xj or a reduced \nrank space that lies close to these centroids. If x denote the overall centroid, this \n\n\f412 \n\nT. HASTIE, R. TIBSHIRANI \n\nsubspace is exactly a principal component hyperplane for the data points Xj - X, \nweighted by the class proportions, and is given by the eigen-decomposition of the \nbetween covariance B. \nOur idea to compute the deviations Xj - x locally in a neighborhood around each of \nthe N training points, and then do an overall principal components analysis for the \nN x J deviations. This amounts to an eigen-decomposition of the average between \nsum of squares matrix 2:~1 B (i) / N. \n\nLOA and local Slbapaces - K _ 25 \n\nLocal e.tween Directions \n\n, \n\n, \n\n\u2022 \n\nI \n\n./ \n\n2 \n\n2 \n\n, , \n\n\" ' \" \n\n8 \n\n\u2022 \u2022 \u2022 \n\n\u2022 \n\n\u2022 \u2022 \u2022 \u2022 . .. \n\n10 \n\n\"'de, \n\nFigure 2: [Left Panel] Two dimensional gaussian data with two classes and correlation \n0.65. The solid lines are the LDA decision boundary and its equivalent subspace for classi(cid:173)\nfication, computed using both the between and (crucially) the within class covariance. The \ndashed lines were produced by the local procedure described in this section, without knowl(cid:173)\nedge of the overall within covariance matrix. [Middle panel] Each line segment represents \nthe local between information centered at that point. [Right panel] The eigenvalues of the \naverage between matrix for the 4D sphere in 10D problem. Using these first four dimen(cid:173)\nsions followed by our DANN nearest neighbor routine, we get better performance than 5NN \nin the real 4D subspace. \n\nFigure 2 (left two panels) demonstrates by a simple illustrative example that our \nsubspace procedure can recover the correct LDA direction without making use of \nthe within covariance matrix. Figure 2 (right panel) represents a two class problem \nwith a 4-dimensional spherical decision boundary. The data for the two classes lie \nin concentric spheres in 4D, the one class lying inside the other with some overlap (a \n4D version of the same 2D situation in figure 1.) In addition the are an extra 6 noise \ndimensions, and for future reference we denote such a model as the \"4D spheres in \nlOD\" problem. The decision boundary is a 4 dimensional sphere, although locally \nlinear. The eigenvalues show a distinct change after 4 (the correct dimension), and \nusing our DANN classifier in these four dimensions actually beats ordinary 5NN in \nthe known 4D discriminant subspace. \n\n4 Examples \n\nFigure 3 su\u00b7mmarizes the results of a number of simulated examples designed to test \nour procedures in both favorable and unfavorable situations. In all the situations \nDANN outperforms 5-NN. In the cases where 5NN is provided with the known lower(cid:173)\ndimensional discriminant subspace, our subspace technique subDANN followed by \nDANN comes close to the optimal performance. \n\n\fDiscriminant Adaptive Nearest Neighbor Classification and Regression \n\n413 \n\nTwo Gaussians wnh Noise \n\nUnstructured with Noise \n\n!i! \no \n\nre \no \n\n'\" o \n\no \no \n\n:g \no \n\nci \n\n'\" o \n\n[5 \n\nI \n\n~I~ \n\n-1-\n\n-1-\n\n0M=; \n8 \n~LU ~L \n\n~I~ \n\n4\u00b70 Sphere in 10\u00b70 \n\nT \n\nE$J \n\no \n\nT gY T \nT i S \nQ \n\nT \n\nT \n\nI B I T \nIBy \n\n1 \n\nI \n1 \n\nI \nj \n\n~8 \nT og \n\n1 \n\n: \n\n10\u00b70 sphere in 10\u00b70 \n\n. o \n\n'\" o \n\no \no \n\n'\" o \n\n'\" o \n\nFigure 3: Boxplots of error rates over 20 simulations. The top left panel has two gaussian \ndistributions separated in two dimensions, with 14 noise dimensions . The notation red-LDA \nand red-5NN refers to these procedures in the known lower dimensional space. iter-DANN \nrefers to an iterated version of DANN (which appears not to help), while sub-DANN refers \nto our global subspace approach, followed by DANN. The top right panel has 4 classes, each \nof which is a mixture of 3-gaussians in 2-D; in addition there are 8 noise variables. The \nlower two panels are versions of our sphere example. \n\n5 \n\nImage Classification Example \n\nHere we consider an image classification problem. The data consist of 4 LANDSAT \nimages in different spectral bands of a small area of the earths surface, and the goal \nis to classify into soil and vegetation types. Figure 4 shows the four spectral bands, \ntwo in the visible spectrum (red and green) and two in the infra red spectrum. \nThese data are taken from the data archive of the STATLOG (Michie et al. 1994)1. \nThe goal is to classify each pixel into one of 7 land types: red soil, cotton, vegetation \nstubble, mixture, grey soil, damp grey soil, very damp grey soil. We extract for each \npixel its 8-neighbors, giving us (8 + 1) x 4 = 36 features (the pixel intensities) per \npixel to be classified. The data come scrambled, with 4435 training pixels and 2000 \ntest pixels, each with their 36 features and the known classification. Included in \nfigure 4 is the true classification, as well as that produced by linear discriminant \nanalysis. The right panel compares DANN to all the procedures used in STATLOG, \nand we see the results are favorable. \n\n1 The authors thank C. Taylor and D. Spiegelhalter for making these images and data \n\navailable \n\n\f414 \n\nSpectral band 1 \n\nSpectral band 2 \n\nSpectral band 3 \n\nT. HASTIE, R. TIBSHIRANI \n\nST ATLOG results \n\nLDA \n\nSMA~g!st!C \n\nNm\"I[C.4.5 \n\nOOf\" \n\nC}\\fH \n\nNUf:r<,1 \n\nALLOCif) \n\nHBF \n\nSpectral band 4 \n\nLand use (Actual) \n\nLand use (Predicted) \n\no \nci \n\nLVQ \n\nK-NN \n\nANN \n\n2 \n\n4 \n\n6 \n\n8 \n\n10 \n\n12 \n\n14 \n\nMethod \n\nFigure 4: The first four images are the satellite images in the four spectral bands. The \nfifth image represents the known classification, and the final image is the classification \nmap produced by linear discriminant analysis. The right panel shows the misclassification \nresults of a variety of classification procedures on the satellite image test data (taken from \nMichie et al. (1994)). DANN is the overall winner. \n\n6 Local Regression \n\nNear neighbor techniques are used in the regression setting as well. Local polynomial \nregression (Cleveland 1979) is currently very popular, where, for example, locally \nweighted linear surfaces are fit in modest sized neighborhoods. Analogs of K-NN \nclassification for small J{ are used less frequently. In this case the response variable \nis quantitative rather than a class label. \nDuan & Li (1991) invented a technique called sliced inverse regression, a dimen(cid:173)\nsion reduction tool for situations where the regression function changes in a lower(cid:173)\ndimensional space. They show that under symmetry conditions of the marginal \ndistribution of X, the inverse regression curve E(XIY) is concentrated in the same \nlower-dimensional subspace. They estimate the curve by slicing Y into intervals, \nand computing conditional means of X in each interval, followed by a principal \ncomponent analysis. There are obvious similarities with our DANN procedure, and \nthe following generalizations of DANN are suggested for regression: \n\n\u2022 locally we use the B matrix of the sliced means to form our DANN metric, \n\nand then perform local regression in the deformed neighborhoods . \n\n\u2022 The local B(i) matrices can be pooled as in subDANN to extract global \nsubspaces for regression. This has an apparent advantage over the Duan & \nLi (1991) approach: we only require symmetry locally, a condition that is \nlocally encouraged by the convolution of the data with a spherical kernel 2 \n\n7 Discussion \n\nShort & Fukanaga (1980) proposed a technique close to ours for the two class \nproblem. In our terminology they used our metric with W = I and ( = 0, with \nB determined locally in a neighborhood of size J{M. In effect this extends the \n\n2We expect to be able to substantiate the claims in this section by the time of the \n\nNIPS995 meeting. \n\n\fDiscriminant Adaptive Nearest Neighbor Classification and Regression \n\n415 \n\nneighborhood infinitely in the null space of the local between class directions, but \nthey restrict this neighborhood to the original KM observations. This amounts \nto projecting the local data onto the line joining the two local centroids. In our \nexperiments this approach tended to perform on average 10% worse than our metric, \nand we did not pursue it further. Short & Fukanaga (1981) extended this to J > 2 \nclasses, but here their approach differs even more from ours. They computed a \nweighted average of the J local centroids from the overall average, and project \nthe data onto it, a one dimensional projection. Myles & Hand (1990) recognized \na shortfall of the Short and Fukanaga approach, since the averaging can cause \ncancellatlOn, and proposed other metrics to avoid this, different from ours. \n\nFriedman (1994) proposes a number of techniques for flexible metric nearest neigh(cid:173)\nbor classification (and sparked our interest in the problem.) These techniques use \na recursive partitioning style strategy to adaptively shrink and shape rectangular \nneighborhoods around the test point. \n\nAcknowledgement \n\nThe authors thank Jerry Friedman whose research on this problem was a source \nof inspiration, and for many discussions. Trevor Hastie was supported by NSF \nDMS-9504495. Robert Tibshirani was supported by a Guggenheim fellowship, and \na grant from the National Research Council of Canada. \n\nReferences \n\nCleveland, W. (1979), 'Robust locally-weighted regression and smoothing scatter(cid:173)\n\nplots', Journal of the American Statistical Society 74, 829-836. \n\nCover, T. & Hart, P. (1967), 'Nearest neighbor pattern classification', Proc. IEEE \n\nTrans. Inform. Theory pp. 21- 27. \n\nDuan, N. & Li, K.-C. (1991), 'Slicing regression: a link-free regression method', \n\nAnnals of Statistics pp. 505- 530. \n\nFriedman, J. (1994), Flexible metric nearest neighbour classification, Technical re(cid:173)\n\nport, Stanford U ni versity. \n\nHastie, T . & Tibshirani, R. (1994), Discriminant adaptive nearest neighbor classi(cid:173)\n\nfication, Technical report, Statistics Department, Stanford University. \n\nMichie, D., Spigelhalter, D. & Taylor, C., eds (1994), Machine Learning, Neural \nand Statistical Classification, Ellis Horwood series in Artificial Intelligence, \nEllis Horwood. \n\nMyles, J. & Hand, D. J. (1990), 'The multi-class metric problem in nearest neigh(cid:173)\n\nbour discrimination rules', Pattern Recognition 23, 1291-1297. \n\nShort, R. & Fukanaga, K. (1980), A new nearest neighbor distance measure, in \n\n'Proc. 5th IEEE Int. Conf. on Pattern Recognition', pp. 81- 86. \n\nShort, R. & Fukanaga, K. (1981), 'The optimal distance measure for nearest neigh(cid:173)\nbor classification', IEEE transactions of Information Theory IT-27, 622-627. \n\n\f", "award": [], "sourceid": 1131, "authors": [{"given_name": "Trevor", "family_name": "Hastie", "institution": null}, {"given_name": "Robert", "family_name": "Tibshirani", "institution": null}]}