{"title": "Estimating the Bayes Risk from Sample Data", "book": "Advances in Neural Information Processing Systems", "page_first": 232, "page_last": 238, "abstract": null, "full_text": "Estimating the Bayes Risk from Sample Data \n\nRobert R. Snapp\u00b7 and Tong Xu \n\nComputer Science and Electrical Engineering Department \n\nUniversity of Vermont \nBurlington, VT 05405 \n\nAbstract \n\nA new nearest-neighbor method is described for estimating the Bayes risk \nof a multiclass pattern claSSification problem from sample data (e.g., a \nclassified training set). Although it is assumed that the classification prob(cid:173)\nlem can be accurately described by sufficiently smooth class-conditional \ndistributions, neither these distributions, nor the corresponding prior prob(cid:173)\nabilities of the classes are required. Thus this method can be applied to \npractical problems where the underlying probabilities are not known. This \nmethod is illustrated using two different pattern recognition problems. \n\n1 INTRODUCTION \n\nAn important application of artificial neural networks is to obtain accurate solutions to \npattern classification problems. In this setting, each pattern, represented as an n-dimensional \nfeature vector, is associated with a discrete pattern class, or state of nature (Duda and Hart, \n1973). Using available information, (e.g., a statistically representative set of labeled feature \nvectors {(Xi, fin, where Xi E Rn denotes a feature vector and fi E l:::: {Wl,W2, ... ,we}, \nits correct pattern class), one desires a function (e.g., a neural network claSSifier) that assigns \nnew feature vectors to pattern classes with the smallest possible misclassification cost. \nIf the classification problem is stationary, such that the patterns from each class are generated \naccording to known probability distributions, then it is possible to construct an optimal \nclasSifier that assigns each pattern to a class with minimal expected risk. Although our \nmethod can be generalized to problems in which different types of classification errors \nincur different costs, we shall simplify our discussion by assuming that all errors are equal. \nIn this case, a Bayes claSSifier assigns each feature vector to a class with maximum posterior \nprobability. The expected risk of this classifier, or Bayes risk then reduces to the probability \nof error \n\n(1) \n\nRB = r [1 - SUPPCf!X)] JCx)dx, \n\nJs \n\nfEL \n\n\u2022 E-mail:snapp 0, it \nis possible to construct a k-nearest-neighbor classifier such that IRm - RBI < c if m and \nk are sufficiently large. Bayes consistency is also evident in other nonparametric pattern \nclassifiers. \n\nSeveral methods for estimating RB from sample data have previously been proposed, e.g., \n(Devijver, 1985), (Fukunaga, 1985), (Fukunaga and Hummels, 1987), (Garnett and Yau, \n1977), and (Loizou and Maybank, 1987). Typically, these methods involve constructing \nsequences of k-nearest neighbor classifiers, with increasing values of k and m. The mis(cid:173)\nclassification rates are estimated using an independent test sample, from which upper and \nlower bounds to RB are obtained. Because these experiments are necessarily performed \nwith finite reference samples, these bounds are often imprecise. This is especially true for \nproblems in which Rm converges to Roo(k) at a slow rate. In order to remedy this deficiency, \nit is necessary to understand the manner in which the limit in Property 1 is achieved. In the \nnext section we describe how this information can be used to construct new estimators for \nthe Bayes risk of sufficiently smooth claSSification problems. \n\n3 NEW ESTIMATORS OF THE BAYES RISK \n\nFor a subset of multiclass classification problems that can be described by probability \ndensities with uniformly bounded partial derivatives up through order N + 1 (with N 2: 2), \nthe finite-sample risk of a k-nearest-neighbor classifier that uses a weighted Lp metric can \nbe represented by the truncated asymptotic expansion \n\nRm = Roo(k) + 2:::Cjm-j/n + 0 (m- CN+1)/n) , \n\nN \n\n(4) \n\nj=2 \n\n(Psaltis, Snapp, and Venkatesh, 1994), and (Snapp and Venkatesh, 1995). In the above, \nn equals the dimensionality of the feature vectors, and Roo(k), C2, ... ,CN, are the ex(cid:173)\npansion coefficients that depend upon the probability distributions that define the pattern \nclassification problem. \n\nThis asymptotic expansion provides a parametric description of how the finite-sample risk \nRm converges to its infinite sample limit Roo(k). Using a large sample of classified data, \none can obtain statistical estimates of the finite-sample risk flm for different values of \nm. Specifically, let {md denote a sequence of M different sample sizes, and select fixed \nvalues for k and N. For each value of mi, construct an ensemble of k-nearest-neighbor \nclassifiers, i.e., for each classifier construct a random reference sample Xmi by selecting \nmi patterns with replacement from the original large sample. Estimate the empirical risk \nof each classifier in the ensemble with an independently drawn set of \"test\" vectors. Let \nflmi denote the average empirical risk of the i-th ensemble. Then, using the resulting set \nof data points {(mi, RmJ}, find the values of the coefficients Roo(k), and C2 through CN, \nthat minimizes the sum of the squares: \n\n8 flmi - Roo(k) - ~ Cjm;j/n \n\nM ( \n\nN )2 \n\n(5) \n\nSeveral inequalities can then be used obtain approximations of RB from the estimated value \nof Roo(k). For example, if k = 1, then Cover and Hart's inequality in Property 1 implies \nthat \n\nRoo(l) < R < R \n\n- B_ 00 \n\n2 \n\n(1). \n\nTo enable an estimate of RB with preciSion c, choose k > 2/c2 , and estimate Roo(k) by the \nabove methOd. Then Devroye's inequality (3) implies \n\nRoo(k) - c ~ Roo(k)(1 - c) ~ RB ~ Roo(k). \n\n\fEstimating the Bayes Risk from Sample Data \n\n235 \n\n4 EXPERIMENTAL RESULTS \n\nThe above procedure for estimating RB was applied to two pattern recognition problems. \nFirst consider the synthetic, two-class problem with prior probabilities PI = P2 = 1/2, and \nnormally distributed, class-conditional densities \n\nf(x)= \nI \n\n1 \n\n(27r)n/2 \n\ne-H(Xl+(-1)t)2+I:~=2xn \n' \n\nfor f = 1 and 2. Pseudorandom labeled feature vectors (x, f) were numerically generated in \naccordance with the above for dimensions n = 1 and n = 5. Twelve sample sizes between \n10 and 3000 were examined. For each dimension and sample size the risks Rm of many \nindependent k-nearest-neighborclassifiers with k = 1,7, and 63 were empirically estimated. \n(Because the asymptotic expansion (4) does not accurately describe the very small sample \nbehavior of the k-nearest-neighbor classifier, sample sizes smaller than 2k were not included \nin the fit.) \nEstimates of the coefficients in (5) for six different fits appear in the first equation of each cell \nin the third and fourth columns of Table 1. For reference, the second column contains the \nvalues of RooCk) that were obtained by numerically evaluating an exact integral expression \n(Cover and Hart, 1967). Estimates of the Bayes risk appear in the second equation of each \ncell in the third and fourth columns. Cover and Hart's inequality (2) was used for the \nexperiments that assumed k = 1, and Devroye's inequality (3) was used if k ~ 7. For thiS \nproblem, formula (1) evaluates to RB = (l/2)erfc(I/V2) = 0.15865. \n\nTable 1: Estimates of the model coefficients and Bayes error for a classification problem \nwith two normal classes. \n\nk \n\nRoo(k) \n\nn=1 (N=2) \n\nn=5 (N =6) \n\n1 \n\n0.2248 \n\nR m =0.2287 + \n\n0.6536 \nm2 \nRB =0.172 \u00b1 0.057 \n\n0.1121 \nRm =0.2287 + \nm2/ 5 \nRB =0.172 \u00b1 0.057 \n\n+ \n\n0.2001 \nm 4/ 5 \n\n-\n\n0.0222 \nm6/ 5 \n\n7 \n\n0.1746 \n\n4.842 \nRm =0.1744 + -2-\nm \nRB =0.152 \u00b10.023 \n\n0.2218 \nR m =0.1700+ \nm 2/ 5 \nRB =0.148 \u00b1 0.022 \n\n3.782 \n1.005 \n- - -+ - -\nm 4 / 5 m 6/ 5 \n\n63 \n\n0.1606 \n\n20.23 \nRm =0.1606 + -2 -\nm \nRB =0.157 \u00b1 0.004 \n\n0.1002 \nRm =0.1595 + \nm 2/ 5 \nRB =0.156 \u00b1 0.004 \n\n1.426 \n10.96 \n- - -+ - -\nm 4 / 5 m 6/ 5 \n\nThe second pattern recognition problem uses natural data; thus the underlying probability \ndistributions are not known. A pool of 222 classified multispectral pixels were was extracted \nfrom a seven band satellite image. Each pixel was represented by five spectral components, \nx = (Xl, .. . ,X5), each in the range 0 ~ X\" ~ 255. (Thus, n = 5.) The class label of \neach pixel was determined by one of the remaining spectral components, 0 ~ y ~ 255. \nTwo pattern classes were then defined: Wl = {y < B}, and W2 = {y ~ B}, where B was a \npredetermined threshOld. (This particular problem was chosen to test the feasibility of this \nmethod. In future work, we will examine more interesting pixel claSsification problems.) \n\n\f236 \n\nR. R. SNAPP, T. XU \n\nTable 2: Coefficients that minimize the squared error fit for different N. Note that C3 = 0 \nand Cs = 0 in (2) ifn ~ 4 (Psaltis, Snapp, and Venkatesh, 1994). \n\nN \n\n2 \n\n4 \n\n6 \n\nRoo(l) \n\n0.0757133 0.126214 \n\n0.0757846 \n\n0.124007 \n\n0.0132804 \n\n0.0766477 \n\n0.0785847 \n\n0.689242 \n\n-2.68818 \n\nWith k = 1, a large number of Bernoulli trials (e.g., 2~1000) were performed for each \nvalue of mi . Each trial began by constructing a reference sample of mi classified pixels \nchosen at random from the pool. The risk of each reference sample was then estimated by \nclassifying t pixels with the nearest-neighbor algorithm under a Euclidean metric. Here, \nthe t pixels, with 2000 ~ t ~ 20000, were chosen independently, with replacement, from \nthe pool. The risk 11m. was then estimated as the average risk of each reference sample \nof size mi . (The number of experiments performed for each value of mi, and the values \noft, were chosen to ensure that the variance of Hm. was sufficiently small, less than 10- 4 \nin this case.) This process was repeated for M = 33 different values of mi in the range \n1 00 ~ mi ~ 15000. Results of these experiments are displayed in Table 2 and Figure 1 \nfor three different values of N. Note that the robustness of the fit begins to dissolve, for this \ndata, at N = 6, either the result of overfitting, or insuffiCient smoothness in the underlying \nprobability distributions. However, the estimate for Roo(l) appears to be stable. For this \nclaSSification problem, we thus obtain RB = 0.0568 \u00b1 0.0190. \n\n5 CONCLUSION \n\nThe described method for estimating the Bayes risk is based on a recent asymptotic analysis \nof the finite-sample risk of the k-nearest-neighbor classifier (Snapp and Venkatesh, 1995). \nRepresenting the finite-sample risk as a truncated asymptotic series enables an efficient \nestimation of the infinite-sample risk Roo(k) from the classifier's finite-sample behavior. \nThe Bayes risk can then be estimated by the Bayes consistency of the k-nearest-neighbor \nalgorithm. Because such finite-sample analyses are difficult, and consequently rare, this \nnew method has the potential to evolve into a useful algorithm for estimating the Bayes risk. \nFurther improvements in efficiency may be obtained by incorporating principles of optimal \nexperimental deSign, cf., (Elfving, 1952) and (Federov, 1972). \n\nIt is important to emphasize, however, that the validity of (4) rests on several rather strong \nsmoothness assumptions, including a high-degree of differentiability of the class-conditional \nprobability densities. For problems that do not satisfy these conditions, other finite-sample \ndescriptions need to be constructed before this method can be applied. Nevertheless, there \nis much evidence that nature favors smoothness. Thus, these restrictive assumptions may \nstill be applicable to many important problems. \n\nAcknowledgments \n\nThe work reported here was supported in part by the National Science Foundation under \nGrant No. NSF OSR-9350540 and by Rome Laboratory, Air Force Material Command, \nUSAF, under grant number F30602-94-1-OOlO. \n\n\fEstimating the Bayes Risk from Sample Data \n\n237 \n\n-1.8 \n\nI \nt: \n\n-rl -2.0 \n0.:: -0 \n-'ol) \n0 -\n\n-2.2 \n\n-2.4 \n\n100 \n\n1000 \n\nm \n\n10000 \n\nFigure 1: The best fourth-order (N = 4) fit of Eqn. (5) to 33 empirical estimates of Hmo \nfor a pixel classification problem obtained from a multispectral Landsat image. Using \nRXI = 0.0758, the fourth-order fit, Rm = 0.0758 + 0.124m- 2/ 5 + 0.0133m - 4/5, is plotted \non a log-log scale to reveal the significance of the j = 2 term. \n\nReferences \n\nT. M. Cover and P. E. Hart, \"Nearest neighbor pattern classification,\" IEEE Trans. Inform. \nTheory,vol.IT-13,1967,pp.21-27. \nP. A. Devijver, \"A multiclass, k - N N approach to Bayes risk estimation,\" Pattern Recog(cid:173)\nnition Letters, vol. 3, 1985, pp. 1-6. \n\nL. Devroye, \"On the asymptotic probability of error in nonparametric discrimination,\" An(cid:173)\nnals Of Statistics, vol. 9, 1981, pp. 1320-1327. \n\nR. O. Duda and P. E. Hart, Pattern Classification and Scene Analysis. New York, New York: \nJohn Wiley & Sons, 1973. \n\nG. Elfving, \"Optimum allocation in linear regression theory,\" Ann. Math. Statist., vol. 23, \n1952,pp.255-262. \n\nV. V. Federov, Theory Of Optimal Experiments, New York, New York: Academic Press, \n1972. \n\nE. Fix and J. L. Hodges, \"Discriminatory Analysis: Nonparametric Discrimination: Con(cid:173)\nsistency Properties,\" from Project 21-49-004, Repon Number 4, UASF School of Aviation \nMedicine, Randolf Field, Texas, 1951, pp. 261-279. \n\n\f238 \n\nR. R. SNAPP, T. XU \n\nK. Fukunaga, \"The estimation of the Bayes error by the k-nearest neighbor approach,\" in L. \nN. Kanal and A. Rosenfeld (ed.), Progress in Pattern Recognition, vol. 2, Elesvier Science \nPublishers B.V. (North Holland), 1985, pp. 169-187. \n\nK. Fukunaga and D. Hummels, \"Bayes error estimation using Parzen and k-NN procedures,\" \nIEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 9, 1987, pp. 634-643. \nJ. M. Garnett, III and S. S. Yau, \"Nonparametric estimation of the Bayes error of feature \nextractors using ordered nearest neighbor sets,\" IEEE Transactions on Computers, vol. 26, \n1977,pp.46-54. \n\nG. Loizou and S. J. Maybank, \"The nearest neighbor and the Bayes error rate,\" IEEE \nTransactions on Pattern Analysis and Machine Intelligence, vol. 9, 1987, pp. 254-262. \n\nD. Psaltis, R. R. Snapp, and S. S. Venkatesh, \"On the finite sample performance of the \nnearest neighbor classifier,\" IEEE Trans. Inform. Theory, vol. IT-40, 1994, pp. 820--837. \n\nR. R. Snapp and S. S. Venkatesh, \"k Nearest Neighbors in Search of a Metric,\" 1995, \n(submitted). \n\n\f", "award": [], "sourceid": 1064, "authors": [{"given_name": "Robert", "family_name": "Snapp", "institution": null}, {"given_name": "Tong", "family_name": "Xu", "institution": null}]}