{"title": "Adaptive Nearest Neighbor Classification Using Support Vector Machines", "book": "Advances in Neural Information Processing Systems", "page_first": 665, "page_last": 672, "abstract": null, "full_text": "Adaptive N earest Neighbor Classification \n\nusing Support Vector Machines \n\nCarlotta Domeniconi, Dimitrios Gunopulos \n\nDept. of Computer Science, University of California, Riverside, CA 92521 \n\n{ carlotta, dg} @cs.ucr.edu \n\nAbstract \n\nThe nearest neighbor technique is a simple and appealing method \nto address classification problems. It relies on t he assumption of \nlocally constant class conditional probabilities. This assumption \nbecomes invalid in high dimensions with a finite number of exam(cid:173)\nples due to the curse of dimensionality. We propose a technique \nthat computes a locally flexible metric by means of Support Vector \nMachines (SVMs). The maximum margin boundary found by the \nSVM is used to determine the most discriminant direction over the \nquery's neighborhood. Such direction provides a local weighting \nscheme for input features. We present experimental evidence of \nclassification performance improvement over the SVM algorithm \nalone and over a variety of adaptive learning schemes, by using \nboth simulated and real data sets. \n\n1 \n\nIntroduction \n\nIn a classification problem, we are given J classes and l training observations. The \ntraining observations consist of n feature measurements x = (Xl,'\" ,Xn)T E ~n \nand the known class labels j = 1, ... , J. The goal is to predict the class label of a \ngiven query q. \n\nThe K nearest neighbor classification method [4, 13, 16] is a simple and appealing \napproach to this problem: it finds the K nearest neighbors of q in the training \nset, and then predicts the class label of q as the most frequent one occurring in \nthe K neighbors. It has been shown [5, 8] that the one nearest neighbor rule has \nasymptotic error rate that is at most twice t he Bayes error rate, independent of \nthe distance metric used. The nearest neighbor rule becomes less appealing with \nfinite training samples, however. This is due to the curse of dimensionality [2]. \nSevere bias can be introduced in t he nearest neighbor rule in a high dimensional \ninput feature space with finite samples. As such, the choice of a distance measure \nbecomes crucial in determining t he outcome of nearest neighbor classification. The \ncommonly used Euclidean distance implies that the input space is isotropic, which \nis often invalid and generally undesirable in many practical applications. \n\nSeveral techniques [9, 10, 7] have been proposed to try to minimize bias in high di(cid:173)\nmensions by using locally adaptive mechanisms. The \"lazy learning\" approach used \n\n\fby these methods, while appealing in many ways, requires a considerable amount of \non-line computation, which makes it difficult for such techniques to scale up to large \ndata sets. The feature weighting scheme they introduce, in fact , is query based and \nis applied on-line when the test point is presented to the \"lazy learner\" . In this pa(cid:173)\nper we propose a locally adaptive metric classification method which, although still \nfounded on a query based weighting mechanism, computes off-line the information \nrelevant to define local weights. \n\nOur technique uses support vector machines (SVMs) as a guidance for the process of \ndefining a local flexible metric. SVMs have been successfully used as a classification \ntool in a variety of areas [11, 3, 14], and the maximum margin boundary they provide \nhas been proved to be optimal in a structural risk minimization sense. The solid \ntheoretical foundations that have inspired SVMs convey desirable computational \nand learning theoretic properties to the SVM's learning algorithm, and therefore \nSVMs are a natural choice for seeking local discriminant directions between classes. \n\nThe solution provided by SVMs allows to determine locations in input space where \nclass conditional probabilities are likely to be not constant, and guides the extraction \nof local information in such areas. This process produces highly stretched neigh(cid:173)\nborhoods along boundary directions when the query is close to the boundary. As a \nresult, the class conditional probabilities tend to be constant in the modified neigh(cid:173)\nborhoods, whereby better classification performance can be achieved. The amount \nof elongation-constriction decays as the query moves further from the boundary \nvicinity. \n\n2 Feature Weighting \n\nto \n\nf(x) \n\nthe sign(f(x)), where \n\nSVMs classify patterns according \nL:~=l (XiYiK(Xi, x) - b, K(x , y) = cpT(x). cp(y) (kernel junction), and cp: 3(n -+ 3(N \nis a mapping of the input vectors into a higher dimensional feature space. Here \nwe assume Xi E 3(n, i = I, . . . ,l, and Yi E {-I,I}. Clearly, in the general case \nof a non-linear feature mapping cp, the SVM classifier gives a non-linear boundary \nf(x) = 0 in input space. The gradient vector lld = \"Vdj, computed at any point \nd of the level curve f(x) = 0, gives the perpendicular direction to the decision \nboundary in input space at d. As such, the vector lld identifies the orientation in \ninput space on which the projected training data are well separated, locally over \nd's neighborhood. Therefore, the orientation given by lld, and any orientation close \nto it, is highly informative for the classification task at hand, and we can use such \ninformation to define a local measure of feature relevance. \n\nLet q be a query point whose class label we want to predict. Suppose q is close \nto the boundary, which is where class conditional probabilities become locally non \nuniform, and therefore estimation of local feature relevance becomes crucial. Let \nd be the closest point to q on the boundary f(x) = 0: d = argminp Ilq - pll, \nsubject to the constraint f(p) = O. Then we know that the gradient lld identifies a \ndirection along which data points between classes are well separated. \n\nAs a consequence, the subspace spanned by the orientation lld, locally at q, is \nlikely to contain points having the same class label as q . Therefore, when applying \na nearest neighbor rule at q, we desire to stay close to q along the lld direction, \nbecause that is where it is likely to find points similar to q in terms of class pos(cid:173)\nterior probabilities. Distances should be constricted (large weight) along lld and \nalong directions close to it. The farther we move from the lld direction, the less \ndiscriminant the correspondent orientation becomes. This means that class labels \nare likely not to change along those orientations, and distances should be elongated \n\n\f(small weight) , thus including in q's neighborhood points which are likely to be \nsimilar to q in terms of the class conditional probabilities. \n\nFormally, we can measure how close a direction t is to lld by considering the dot \nproduct lla \u00b7t. In particular, by denoting with Uj the unit vector along input feature \nj, for j = 1, . .. , n, we can define a measure of relevance for feature j, locally at q \n(and therefore at d), as Rj(q) == Iu] . lldl = Ind,j l, where lld = (nd,l,'\" ,nd,n)T. \nThe measure of feature relevance, as a weighting scheme, can then be given by the \nfollowing exponential weighting scheme: Wj(q) = exp(ARj(q))1 2::7=1 exp(ARi(q)), \nwhere A is a parameter that can be chosen to maximize (minimize) the influence of \nR j on Wj' When A = 0 we have Wj = lin, thereby ignoring any difference between \nthe Rj's. On the other hand, when A is large a change in R j will be exponentially re(cid:173)\nflected in Wj' The exponential weighting scheme conveys stability to the method by \npreventing neighborhoods to extend infinitely in any direction. This is achieved by \navoiding zero weights, which would instead be allowed by linear or quadratic weight(cid:173)\nings. Thus, the exponential weighting scheme can be used as weights associated with \nfeatures for weighted distance computation D(x, y) = )2::7=1 Wi(Xi - Yi)2. These \nweights enable the neighborhood to elongate less important feature dimensions, and, \nat the same time, to constrict the most influential ones. Note that the technique is \nquery-based because weightings depend on the query. \n\n3 Local Flexible Metric Classification based on SVMs \n\nTo estimate the orientation of local boundaries, we move from the query point along \nthe input axes at distances proportional to a given small step (whose initial value \ncan be arbitrarily small, and doubled at each iteration till the boundary is crossed). \nWe stop as soon as the boundary is crossed along an input axis i, i.e. when a point \nPi is reached that satisfies the condition sign(f(q)) x sign(f(pi)) = -1. Given Pi, \nwe can get arbitrarily close to the boundary by moving at (arbitrarily) small steps \nalong the segment that joins Pi to q. \nLet us denote with d i the intercepted point on the boundary along direction i. We \nthen approximate lld with the gradient vector lld i = \\7 di f, computed at d i . \nWe desire that the parameter A in the exponential weighting scheme increases as the \ndistance of q from the boundary decreases. By using the knowledge that support \nvectors are mostly located around the boundary surface, we can estimate how close \na query point q is to the boundary by computing its distance from the closest non \nbounded support vector: Bq = minsi Ilq - si ll, where the minimum is taken over \nthe non bounded (0 < D:i < C) support vectors Si. Following the same principle, \nin [1] the spatial resolution around the boundary is increased by enlarging volume \nelements locally in neighborhoods of support vectors. Then, we can achieve our goal \nby setting A = D - B q , where D is a constant input parameter of the algorithm. In \nour experiments we set D equal to the approximated average distance between the \ntraining points Xk and the boundary: D = t 2::xk {minsi Ilxk - sill}. If A becomes \nnegative it is set to zero. \nBy doing so the value of A nicely adapts to each query point according to its location \nwith respect to the boundary. The closer q is to the decision boundary, the higher \nthe effect of the Rj's values will be on distances computation. \n\nWe observe that this principled guideline for setting the parameters of our technique \ntakes advantage of the sparseness representation of the solution provided by the \nSVM. In fact, for each query point q, in order to compute Bq we only need to \nconsider the support vectors, whose number is typically small compared to the \n\n\fInput: Decision boundary f(x) = a produced by a SVM; query \npoint q and parameter K. \n\n1. Compute the approximated closest point d i to q on the bound-\n\nary; \n\n2. Compute the gradient vector ndi = \\l dJ; \n3. Set feature relevance values Rj(q) = Indi,jl for j = 1, . . . ,n; \n4. Estimate the distance of q from the boundary as: Bq = \n5. Set A = D - B q , where D = t EXk {minsi Ilxk - sill}; \n6. Set Wj(q) = exp(ARj(q))/ E~=l exp(ARi(q)), for \n\nminsi Ilq - sill; \n\nj \n\n1, ... ,n; \n\n7. Use the resulting w for K-nearest neighbor classification at \n\nthe query point q. \n\nFigure 1: The LFM-SVM algorithm \n\ntotal number of training examples. Furthermore, the computation of D's value is \ncarried out once and off-line. \n\nThe resulting local flexible metric technique based on SVMs (LFM-SVM) is summa(cid:173)\nrized in Figure 1. The algorithm has only one adjustable tuning parameter, namely \nthe number K of neighbors in the final nearest neighbor rule. This parameter is \ncommon to all nearest neighbor classification techniques. \n\n4 Experimental Results \n\nIn the following we compare several classification methods using both simulated \nand real data. We compare the following classification approaches: (1) LFM-SVM \nalgorithm described in Figure 1. SV Mlight [12] with radial basis kernels is used \nto build the SVM classifier; (2) RBF-SVM classifier with radial basis kernels. We \nused SV Mlight [12], and set the value of\"( in K(Xi' x) = e-r llxi-xI12 equal to the \noptimal one determined via cross-validation. Also the value of C for the soft-margin \nclassifier is optimized via cross-validation. The output of this classifier is the input \nof LFM-SVM; (3) ADAMENN-adaptive metric nearest neighbor technique [7]. It \nuses the Chi-squared distance in order to estimate to which extent each dimension \ncan be relied on to predict class posterior probabilities; (4) Machete [9]. It is a \nrecursive partitioning procedure, in which the input variable used for splitting at \neach step is the one that maximizes the estimated local relevance. Such relevance \nis measured in terms of the improvement in squared prediction error each feature is \ncapable to provide; (5) Scythe [9]. It is a generalization of the machete algorithm, in \nwhich the input variables influence each split in proportion to their estimated local \nrelevance; (6) DANN-discriminant adaptive nearest neighbor classification [10]. It \nis an adaptive nearest neighbor classification method based on linear discriminant \nanalysis. It computes a distance metric as a product of properly weighted within \nand between sum of squares matrices; (7) Simple K-NN method using the Euclidean \ndistance measure; (8) C4.5 decision tree method [15]. \nIn all the experiments, the features are first normalized over the training data to \nhave zero mean and unit variance, and the test data features are normalized using \nthe corresponding training mean and variance. Procedural parameters (including \n\n\fK) for each method were determined empirically through cross-validation. \n\n4.1 Experiments on Simulated Data \n\nFor all simulated data, 10 independent training samples of size 200 were generated. \nFor each of these, an additional independent test sample consisting of 200 obser(cid:173)\nvations was generated. These test data were classified by each competing method \nusing the respective training data set. Error rates computed over all 2,000 such \nclassifications are reported in Table 1. \n\nThe Problems. (1) Multi-Gaussians. The data set consists of n = 2 input \nfeatures, l = 200 training data, and J = 2 classes. Each class contains two spher(cid:173)\nical bivariate normal subclasses, having standard deviation 1. The mean vectors \nfor one class are (-3/4, -3) and (3/4,3); whereas for the other class are (3, -3) \nand (-3,3). For each class, data are evenly drawn from each of the two normal \nsubclasses. The first column of Table 1 shows the results for this problem. The \nstandard deviations are: 0.17, 0.01, 0.01, 0.01, 0.01 0.01, 0.01 and 1.50, respec(cid:173)\ntively. (2) Noisy-Gaussians. The data for this problem are generated as in the \nprevious example, but augmented with four predictors having independent stan(cid:173)\ndard Gaussian distributions. They serve as noise. Results are shown in the second \ncolumn of Table 1. The standard deviations are: 0.18, 0.01, 0.02, 0.01, 0.01, 0.01, \n0.01 and 1.60, respectively. \n\nResults. Table 1 shows that all methods have similar performances for the Multi(cid:173)\nGaussians problem, with C4.5 being the worst performer. When the noisy pre(cid:173)\ndictors are added to the problem (NoisyGaussians), we observe different levels of \ndeterioration in performance among the eight methods. LFM-SVM shows the most \nrobust behavior in presence of noise. K-NN is instead the worst performer. In \nFigure 2 we plot the performances of LFM-SVM and RBF-SVM as a function of an \nincreasing number of noisy features (for the same MultiGaussians problem). The \nstandard deviations for RBF -SVM (in order of increasing number of noisy features) \nare: 0.01, 0.01 , 0.03, 0.03, 0.03 and 0.03. The standard deviations for LFM-SVM \nare: 0.17,0.18,0.2,0.3,0.3 and 0.3. The LFM-SVM technique shows a considerable \nimprovement over RBF -SVM as the amount of noise increases. \n\nTable 1: Average classification error rates for simulated and real data. \n\nLFM-SVM \nRBF-SVM \nADAMENN \n\nMachete \nScythe \nDANN \nK-NN \nC4.5 \n\n3.3 \n3.3 \n3.4 \n3.4 \n3.4 \n3.7 \n3.3 \n5.0 \n\n4.0 \n4.0 \n3.0 \n5.0 \n4.0 \n6.0 \n6.0 \n8.0 \n\nMultiGauss NoisyGauss Iris Sonar Liver Vote Breast OQ Pima \n3.5 19.3 \n3.4 21.3 \n3.1 20.4 \n7.4 20.4 \n5.0 20.0 \n4.0 22.2 \n5.4 24.2 \n9.2 23.8 \n\n2.6 \n11.0 \n12.0 26.1 3.0 \n3.0 \n9.1 \n3.4 \n21.2 \n16.3 \n3.4 \n1.1 \n3.0 \n7.8 \n12.5 \n23.1 \n3.4 \n\n3.4 \n4.1 \n4.1 \n4.3 \n4.8 \n4.7 \n7.0 \n5.1 \n\n28.1 \n\n30.7 \n27.5 \n27.5 \n30.1 \n32.5 \n38.3 \n\n3.0 \n3.1 \n3.2 \n3.5 \n2.7 \n2.2 \n2.7 \n4.1 \n\n4.2 Experiments on Real Data \n\nIn our experiments we used seven different real data sets. They are all taken from \nDCI Machine Learning Repository \nhttp://www.cs.uci.edu/,,,-,mlearn/ \nMLRepository.html. For a description of the data sets see [6]. For the Iris, Sonar, \nLiver and Vote data we perform leave-one-out cross-validation to measure perfor(cid:173)\nmance, since the number of available data is limited for these data sets. For the \n\nat \n\n\fLFM-SVM --+-(cid:173)\nRBF-SVM ---)(---\n\n36'--'--'---r--'--~--'--'--~--.--'--~ \n34 \n32 \n30 \n28 \n26 \n24 \n22 \n20 \n18 \n16 \n14 \n12 \n10 \n8 \n6 \n~ ~~=='P'-\n\nO L-~--~--~~--~--~~--~--~~--~ \n\n10 \n\n14 \nNumber of Noisy Variables \n\n12 \n\n16 \n\n18 \n\n20 \n\n22 \n\no \n\nFigure 2: Average Error Rates of LFM-SVM and RBF-SVM as a function of an \nincreasing number of noisy predictors. \n\ni J. T I I \n--\n~ -\n~ \u2022 \n\"\"'!\"\" -\n-\n\u2022 \nI \n\ni \n\n\" \n~ \n\n3 \n\n1j \n\nz \nz \n\"\" \nQ \n\nz \nz \n'\" \n\n- 1-\n\n:E \n> \n:z \n\"' \n\" \n\n:E \n~ \n;l \n\"\" \n..J \n\nz \nz \nOJ \n:E \n\"\" \nQ \n\"\" \n\nFigure 3: Performance distributions for real data. \n\nBreast, OQ-Ietter and Pima data we randomly generated five independent training \nsets of size 200. For each of these, an additional independent test sample consisting \nof 200 observations was generated. Table 1 (columns 3-9) shows the cross-validated \nerror rates for the eight methods under consideration on the seven real data. The \nstandard deviation values are as follows. Breast data: 0.2, 0.2, 0.2, 0.2, 0.2, 0.9, 0.9 \nand 0.9, respectively. OQ data: 0.2 , 0.2 , 0.2, 0.3, 0.2 , 1.1 , 1.5 and 2.1 , respectively. \nPima data: 0.4, 0.4, 0.4, 0.4, 0.4, 2.4, 2.1 and 0.7, respectively. \n\nResults. Table 1 shows that LFM-SVM achieves the best performance in 2/7 of the \nreal data sets; in one case it shows the second best performance, and in the remaining \nfour its error rate is still quite close to t he best one. Following Friedman [9], we \ncapture robustness by computing the ratio bm of the error rate em of method m and \nthe smallest error rate over all methods being compared in a particular example: \nbm = emf minl~k~8 ek\u00b7 \nFigure 3 plots the distribution of bm for each method over the seven real data sets. \nThe dark area represents the lower and upper quartiles of the distribution that are \nseparated by the median. The outer vertical lines show the entire range of values for \nthe distribution. The spread of the error distribution for LFM-SVM is narrow and \nclose to one. The results clearly demonstrate that LFM-SVM (and ADAMENN) \nobtained the most robust performance over the data sets. \n\nThe poor performance of the machete and C4.5 methods might be due to the greedy \nstrategy they employ. Such recursive peeling strategy removes at each step a subset \nof data points permanently from further consideration. As a result, changes in an \nearly split, due to any variability in parameter estimates, can have a significant \nimpact on later splits , thereby producing different terminal regions. This makes \n\n\fpredictions highly sensitive to the sampling fluctuations associated with the random \nnature of the process that produces the traning data, thus leading to high variance \npredictions. The scythe algorithm, by relaxing the winner-take-all splitting strategy \nof the machete algorithm, mitigates the greedy nature of the approach, and thereby \nachieves better performance. \n\nIn [10], the authors show that the metric employed by the DANN algorithm approx(cid:173)\nimates the weighted Chi-squared distance, given that class densities are Gaussian \nand have the same covariance matrix. As a consequence, we may expect a degra(cid:173)\ndation in performance when the data do not follow Gaussian distributions and are \ncorrupted by noise, which is likely the case in real scenarios like the ones tested \nhere. \n\nWe observe that the sparse solution given by SVMs provides LFM-SVM with prin(cid:173)\ncipled guidelines to efficiently set the input parameters. This is an important ad(cid:173)\nvantage over ADAMENN, which has six tunable input parameters. Furthermore, \nLFM-SVM speeds up the classification process since it applies the nearest neighbor \nrule only once, whereas ADAMENN applies it at each point within a region centered \nat the query. We also observe that the construction of the SVM for LFM-SVM is \ncarried out off-line only once, and there exist algorithmic and computational results \nwhich make SVM training practical also for large-scale problems [12]. \n\nThe LFM-SVM offers performance improvements over the RBF-SVM algorithm \nalone, for both the (noisy) simulated and real data sets. The reason for such perfor(cid:173)\nmance gain may rely on the effect of our local weighting scheme on the separability \nof classes, and therefore on the margin, as shown in [6]. Assigning large weights to \ninput features close to the gradient direction, locally in neighborhoods of support \nvectors, corresponds to increase the spatial resolution along those orientations, and \ntherefore to improve the separability of classes. As a consequence, better classifica(cid:173)\ntion results can be achieved as demonstrated in our experiments. \n\n5 Related Work \n\nIn [1], Amari and Wu improve support vector machine classifiers by modifying \nkernel functions. A primary kernel is first used to obtain support vectors. The \nkernel is then modified in a data dependent way by using the support vectors: the \nfactor that drives the transformation has larger values at positions close to support \nvectors. The modified kernel enlarges the spatial resolution around the boundary \nso that the separability of classes is increased. \n\nThe resulting transformation depends on the distance of data points from the sup(cid:173)\nport vectors, and it is therefore a local transformation, but is independent of the \nboundary's orientation in input space. Likewise, our transformation metric de(cid:173)\npends , through the factor A, on the distance of the query point from the support \nvectors. Moreover, since we weight features, our metric is directional, and depends \non the orientation of local boundaries in input space. This dependence is driven \nby our measure of feature relevance, which has the effect of increasing the spatial \nresolution along discriminant directions around the boundary. \n\n6 Conclusions \n\nWe have described a locally adaptive metric classification method and demonstrated \nits efficacy through experimental results. The proposed technique offers perfor(cid:173)\nmance improvements over the SVM alone, and has the potential of scaling up to \n\n\flarge data sets. It speeds up, in fact, the classification process by computing off(cid:173)\nline the information relevant to define local weights, and by applying the nearest \nneighbor rule only once. \n\nAcknowledgments \n\nThis research has been supported by the National Science Foundation under grants \nNSF CAREER Award 9984729 and NSF IIS-9907477, by the US Department of \nDefense, and a research award from AT&T. \n\nReferences \n\n[1] S. Amari and S. Wu, \"Improving support vector machine classifiers by modifying \n\nkernel functions\", Neural Networks, 12, pp. 783-789, 1999. \n\n[2] R.E. Bellman, Adaptive Control Processes. Princeton Univ. Press, 1961. \n[3] M. Brown, W. Grundy, D. Lin, N. Cristianini, C. Sugnet, T. Furey, M. Ares, and \nD. Haussler, \"Knowledge-based analysis of microarray gene expressions data using \nsupport vector machines\", Tech. Report, University of California in Santa Cruz, \n1999. \n\n[4] W.S. Cleveland and S.J. Devlin, \"Locally Weighted Regression: An Approach to \n\nRegression Analysis by Local Fitting\", J. Amer. Statist. Assoc. 83, 596-610, 1988 \n\n[5] T.M. Cover and P.E. Hart, \"Nearest Neighbor Pattern Classification\", IEEE Trans. \n\non Information Theory, pp. 21-27, 1967. \n\n[6] C. Domeniconi and D. Gunopulos, \"Adaptive Nearest Neighbor Classification using \n\nSupport Vector Machines\", Tech. Report UCR-CSE-01-04, Dept. of Computer Sci(cid:173)\nence, University of California, Riverside, June 200l. \n\n[7] C. Domeniconi, J. Peng, and D. Gunopulos, \"An Adaptive Metric Machine for Pattern \n\nClassification\", Advances in Neural Information Processing Systems, 2000. \n\n[8] R.O. Duda and P.E. Hart, Pattern Classification and Scene Analysis. John Wiley & \n\nSons, Inc., 1973. \n\n[9] J.H. Friedman \"Flexible Metric Nearest Neighbor Classification\", Tech. Report, Dept. \n\nof Statistics, Stanford University, 1994. \n\n[10] T. Hastie and R. Tibshirani, \"Discriminant Adaptive Nearest Neighbor Classifica(cid:173)\n\ntion\", IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol. 18, No.6, pp. \n607-615, 1996. \n\n[11] T. Joachims, \"Text categorization with support vector machines\", Pmc. of European \n\nConference on Machine Learning, 1998. \n\n[12] T. Joachims, \"Making large-scale SVM learning practical\" Advances in Kernel Meth(cid:173)\nods - Support Vector Learning, B. Sch6lkopf and C. Burger and A. Smola (ed.), MIT(cid:173)\nPress, 1999. http://www-ai.cs.uni-dortmund.de/thorsten/svm_light.html \n\n[13] D.G. Lowe, \"Similarity Metric Learning for a Variable-Kernel Classifier\", Neural Com(cid:173)\n\nputation 7(1):72-85, 1995. \n\n[14] E. Osuna, R. Freund, and F. Girosi, \"Training support vector machines: An applica(cid:173)\n\ntion to face detection\", Pmc. of Computer Vision and Pattern Recognition, 1997. \n\n[15] J.R. Quinlan, C4.5: Programs for Machine Learning. Morgan-Kaufmann Publishers, \n\nInc., 1993. \n\n[16] C.J. Stone, Nonparametric regression and its applications (with discussion). Ann. \n\nStatist. 5, 595, 1977. \n\n\f", "award": [], "sourceid": 2054, "authors": [{"given_name": "Carlotta", "family_name": "Domeniconi", "institution": null}, {"given_name": "Dimitrios", "family_name": "Gunopulos", "institution": null}]}