{"title": "A Boundary Hunting Radial Basis Function Classifier which Allocates Centers Constructively", "book": "Advances in Neural Information Processing Systems", "page_first": 139, "page_last": 146, "abstract": null, "full_text": "A Boundary Hunting Radial Basis Function \n\nClassifier Which Allocates Centers \n\nConstructively \n\nEric I. Chang and Richard P. Lippmann \n\nMIT Lincoln Laboratory \n\nLexington, MA02173-0073, USA \n\nAbstract \n\nA new boundary hunting radial basis function (BH-RBF) classifier \nwhich allocates RBF centers constructively near class boundaries is \ndescribed. This classifier creates complex decision boundaries only in \nregions where confusions occur and corresponding RBF outputs are \nsimilar. A predicted square error measure is used to determine how \nmany centers to add and to determine when to stop adding centers. Two \nexperiments are presented which demonstrate the advantages of the BH(cid:173)\nRBF classifier. One uses artificial data with two classes and two input \nfeatures where each class contains four clusters but only one cluster is \nnear a decision region boundary. The other uses a large seismic database \nwith seven classes and 14 input features. In both experiments the BH(cid:173)\nRBF classifier provides a lower error rate with fewer centers than are \nrequired by more conventional RBF, Gaussian mixture, or MLP \nclassifiers. \n\n1 \n\nINTRODUCTION \n\nRadial basis function (RBF) classifiers have been successfully applied to many pattern \nclassification problems (Broomhead, 1988, Ng, 1991). These classifiers have the advan(cid:173)\ntages of short training times and high classification accuracy. In addition, RBF outputs \nestimate minimum-error Bayesian a posteriori probabilities (Richard, 1991). Performing \nclassification with RBF outputs requires selecting the output which is highest for each \ninput. In regions where one class dominates, the Bayesian a posteriori probability for that \nclass will be uniformly \"high\" and near 1.0. Detailed modeling of the variation of the \nBayesian a posteriori probability in these regions is not necessary for classification. Only \n\n139 \n\n\f140 \n\nChang and Lippmann \n\nat the boundary between different classes is accurate estimation of the Bayesian a posteri(cid:173)\nori probability necessary for high classification accuracy. If the boundary between differ(cid:173)\nent classes can be located in the input space, RBF centers can be judiciously allocated in \nthose regions without wasting RBF centers in regions where accurate estimation of the \nBayesian a posteriori probability does not improve classification perfonnance. \n\nIn general, having more RBF centers allows better approximation of the desired output. \nWhile training a RBF classifier, the number of RBF centers must be selected. The tradi(cid:173)\ntional approach has been to randomly choose patterns from the training set as centers, or to \nperfonn K-means clustering on the data and then to use these centers as the RBF centers. \nFrequently the correct number of centers to use is not known a priori and the number of \ncenters has to be tuned. Also, with K-means clustering, the centers are distributed without \nconsidering their usefulness in classification. In contrast, a constructive approach to add(cid:173)\ning RBF centers based on modeling Bayesian a posteriori probabilities accurately only \nnear class boundaries provides good perfonnance with fewer centers than are required to \nseparately model class PDF's. \n\nMany algorithms have been proposed for constructively building up the structure of a RBF \nnetwork (Mel, 1991). However. the algorithms proposed have all been designed for train(cid:173)\ning a RBF network to perfonn function mapping. For mapping tasks, accuracy is impor(cid:173)\ntant throughout the input region and the mean squared error is the criterion that is \nminimized. In classification tasks, only boundaries between different classes are important \nand the overall mean squared error is not as important as the error in class boundaries. \n\n2 ALGORITHM DESCRIPTION \n\nA block diagram of a new boundary hunting RBF (BH-RBF) classifier that adds centers \nconstructively near class boundaries is presented in Figure 1. A simple unimodal Gaussian \nclassifier is first fonned by clustering the training patterns from a randomly selected class \nand assigning a center to that class. The confusion matrix generated by using this simple \nclassifier is then examined to determine the pair of classes A and B, which have the most \nmutual confusion. Training patterns that are close to the boundary between these two \nclasses are detennined by looking at the outputs of the RBF classifier. Boundary patterns \n\nONE RBF \nCENTER \n\nINITIAL RBF \nNETWORK \n\nADD NEW RBF CENTERS TO \nCLASS PAIR RESPONSIBLE \n\nFOR MOST ERRORS & OVERLAP' \n\nCALCULATE \n\nPREDICTED SQUARED \n\nERROR SCORE \n\nINTERMEDIATE \nRBF NETWORKS \n\nFigure 1: Block Diagram of Training of BH-RBF Network \n\nFINAL NETWORK \n\n\fA Boundary Hunting Radial Basis Function Classifier (Allocates Centers Constructively) \n\n141 \n\nwhich produce similar \"high\" outputs for both classes that are different by less than a \n\"closecall\" threshold are used to produce new cluster centers. \n\nFigure 2 shows RBF outputs corresponding to class A and B as the input varies over a \nsmall range. This figure illustrates how network outputs are used to determine the \"close(cid:173)\ncall\" region between classes. Network outputs are high in regions dominated by a particu(cid:173)\nlar class and therefore these regions are outside the boundary between different classes. \nNetworlc outputs are close in the region where the absolute difference of the two highest \nnetwork outputs is less than the closecall threshold. Training patterns which fall into this \nclosecall region plus all the points that are misclassified as the other class in the class pair \nare considered to be points in the boundary. For example, a pattern in class A which is \nmisclassified as class B would be considered to be in the boundary between class A and B. \nOn the other hand, a pattern in class A which is misclassified as class C would not be \nplaced in the boundary between class A and B. \n\n1 \n\nFCA) \n\n0.9 \n0.8 CLASS A \n\n=> 0.7 \nc. => 0.6 \n0 \n~ 0.5 \n~ 0.4 \nZ 0.3 \n\nCLOSECALL \nTHRESHOLD \n\n0.2 \n\n0.1 \n\n0 \n\n-3 \n\n-2 \n\n-1 \n\nCLOSECALL \n\nREGION \n\n~ \u2022 \n\n0 \n\nF(B) \n\nCLASS B \n\n1 \n\n2 \n\n3 \n\nFigure 2: Using the Network Output to Determine Closecall Regions \n\nINPUT \n\nAfter the patterns which belong in the boundary are determined, clustering is performed \nseparately on boundary patterns from different classes using K-means clustering and a \nnumber of centers ranging from zero to a preset maximum number of centers. After the \ncenters are found, new RBF classifiers are trained using the new sets of centers plus the \noriginal set of centers. The combined set of centers that provides the best performance is \nsaved and the cycle repeats again by fmding the next class pair which accounts for the \nmost remaining confusions. Overfitting by adding too many centers at a time is avoided by \nusing the predicted squared error (PSE) as the criterion for choosing new centers (Barron, \n1984): \n\nCxa2 \n\nPSE=RMS+-(cid:173)\nN \n\n\f142 \n\nChang and Lippmann \n\nIn this equation, RMS is the root mean squared error on the training set, (12 estimates the \nvariance of the error, C is the total number of centers in the RBF classifier, and N is the \ntotal number of patterns in the training set. The error variance (12 is selected empirically \nusing left-out evaluation data. Different values of cr2 are tried and the value which pro(cid:173)\nvides the best performance on the evaluation data is chosen. On each cycle, different num(cid:173)\nber of centers are tried for each class of the selected class pair and the PSE is used to select \nthe best subset of centers. The best PSE on each cycle is used to determine when training \nshould be stopped to prevent overfitting. Training stops after the PSE has not decreased \nfor five consecuti ve cycles. \n\n3 EXPERIMENTAL RESULTS \n\nTwo experiments were performed using the new BH-RBF classifier, a more conventional \nRBF classifier, a Gaussian mixture classifier (Ng, 1991), and aMLP classifier. Five regu(cid:173)\nlar RBF classifiers (RBF) were trained by asSigning 1, 2, 3,4, or 5 centers to each class. \nSimilarly, five Gaussian mixture classifiers (GMIX) were trained with 1,2,3,4, or 5 cen(cid:173)\nters in each class. The means of each center were trained individually using K-means clus(cid:173)\ntering to find the centers for patterns from each class. The diagonal covariance of each \ncenter was set using all the patterns that were assigned to a cluster during the last pass of \nK-means clustering. The structure of the regular RBF classifier and the Gaussian mixture \nclassifier are identical when the number of centers are the same. The only difference \nbetween the classifiers is the method used to train parameters. \n\nMLP classifiers were trained for 10 independent trials for each data set. The number of \nhidden nodes was varied from 2 to 30 in increments of 2. The goal of the experiment was \nto explore the relationship between the complexity of the classifier and the classification \naccuracy of the classifier. Training was stopped using cross validation to avoid overfitting. \n\n3.1 FOUR-CLUSTER DATABASE \n\nThe flfst problem is an artificial data set designed to illustrate the difference between BH(cid:173)\nRBF and other classifiers. There are two classes, each class consist of one large Gaussian \ncluster with 700 random points and three smaller clusters with 100 points each. Figure 3 \nshows the distribution of the data and the ideal decision boundary if the actual centers and \nvariances are used to train a Bayesian minimum error classifier. There were 2000 training \npatterns, 2000 evaluation patterns, and 2000 test patterns. The BH -RBF classifier was \ntrained with the closecall threshold set to 0.75, (12 set to 0.5, and a maximum of two extra \ncenters per class at between each pair of classes. The theoretically optimal Bayesian clas(cid:173)\nsifier for this database provides the error rate of 1.95% on the test set. This optimal Baye(cid:173)\nsian classifier is obtained using the actual centers, variances, and a priori probability used \nto generate the data in a Gaussian mixture classifier. In a real classification task, these cen(cid:173)\nter parameters are not known and have to be estimated from training data. \n\nFigure 4 shows the testing error rate of the three different classifiers. The BH-RBF classi(cid:173)\nfier was able to achieve 2.35% error rate with only 5 centers and the error rate gradually \ndecreased to 2.15% with 15 centers. The BH-RBF classifier performed well with few cen(cid:173)\nters because it allocated these centers near the boundary between the two classes. On the \nother hand, the perfonnance of the RBF classifier and the Gaussian mixture classifier was \nworse with few centers. These classifiers perfonned worse because they allocated centers \n\n\fA Boundary Hunting Radial Basis Function Classifier (Allocates Centers Constructively) \n\n143 \n\n\u00ae \n\n\u00ae \n\n60 \n\n50 \n\n40 \n\n30 \n\nY \n\n20 \n\no \n\nFigure 3: The Artificially Generated Four-Cluster Problem \n\nX \n\n15 \n\n10 \n\nex: \n!! ex: \n\nw \nH \n\n5 \n\no \n\no \n\n~ \n\n!~ \n. . \n/\\RBF \n, \n, \n0\\ \n,- . \nJ /, l \n. \\ \n\nGMIX \n\n\\ t=:. \n,. .. ,-\n\u00b7t \n\n. .... /'0. \n\n.\".' I\u00b7! \n....... \n\n\"-\n\n. \n\n'--_~=--\";\"_\"'~\":':':\"':'::'.'::._:,:,,,,:,:,,,~, .. :=s-::. \u2022\u2022. :.;,:' -=.':.:.!'.~r::.;'\u00b7 \u2022\u2022 ' \u2022\u2022 11:'1 \n\n.... .. .... I\n\nBH-RBF \n\nI \u2022 . -\n\n5 \n\n15 \nNUMBER OF CENTERS \n\n10 \n\n20 \n\nFigure 4: Testing Error Rate Of The BH-RBF Classifier, The Gaussian Mixture \n\nClassifier, And The Regular RBF Classifier On The Four-Cluster Problem. \n\nin regions that had many patterns. The training algorithm did not distinguish between pat(cid:173)\nterns that are easily confusable between classes (Le. near the class boundary) and patterns \nthat clearly belong in a given class. Furthermore, adding more centers did not monotoni-\n\n\f144 \n\nChang and Lippmann \n\ncally decrease the error rate. For example, the RBF classifier had 5% error using two cen(cid:173)\nters, but when the number of centers was increased to four, the error rate jumped to 11 %. \nOnly until the number of centers increased above 14 did the RBF classifier and the Gauss(cid:173)\nian mixture classifier's error rates converge. The RBF and the Gaussian mixture classifiers \nperformed poorly with few centers because the centers were concentrated away from the \ndecision boundary due to the high concentration of data far away from the boundary. \nThus, there weren't enough centers to model the decision boundary accurately. The BH(cid:173)\nRBF classifier added centers near the boundary and thus was able to define an accurate \nboundary with fewer centers. \n\nFigure 5 presents the results from training MLP classifiers on the same data set using dif(cid:173)\nferent numbers of hidden nodes. The learning rate was set to 0.001, the momentum term \nwas set to 0.6, and each classifier was trained for 100 epochs. The error rate on a left out \nevaluation set was checked to assure that the net had not overfitted the training data. As \nthe number of hidden nodes increased, the MLP classifier generally performed better. \nHowever, the testing error rate did not decrease monotonically as the number of hidden \nnodes increased. Furthermore, the random initial condition set by the different random \nseeds affected the classification error rate of each classifier. In comparison, the training \nalgorithms used for BH -RBF, RBF, and GMIX classifiers do not exhibit such sensitivity \nto initial conditions. \n\n15 \n\n10 \n\na: o \na: \na: w \n~ 5 \n\no \n2 \n\nMAX \n\nMIN \n\n4 \n\n6 \n\n8 \n\n10 12 14 16 18 20 22 24 26 28 \nNUMBER OF HIDDEN NODES \n\nFigure 5: Testing Error Rate Of The MLP Classifiers On The Four-Cluster Problem \n\n3.2 SEISMIC DATABASE \n\nThe second problem consists of data for classification of seismic events. The input consist \nof 14 continuous and binary measurements derived from seismic waveform signals. These \nfeatures are used to classify a wavefonn as belonging to one of 7 classes which represent \ndifferent seismic phases. There were 3038 training, 3033 evaluation, and 3034 testing pat-\n\n\fA Boundary Hunting Radial Basis Function Classifier (Allocates Centers Constructively) \n\n145 \n\n20 \n\n15 \n\nc: \n~10 \n\nLLJ \n~ \n\n5 \n\n\u2022 \nIII\u00b7I~ \u2022 \n--~--~~ \n\n-\n\n.. \n\n\u2022 - - . . \n\nGMIX \n\n---\n\nBH-RBF \n\nI \u2022\u2022\u2022\u2022\u2022 If \u2022\u2022\u2022\u2022 If \u2022\u2022\u2022 \"., .................. I ........... I \u2022\u2022\u2022\u2022 II I \u2022 \n\n---\n\n.--\n\nRBF \n\no ~------~----~~~~--~~~~~~~~~~--~~ \no \n40 \n\n30 \n\n10 \n\n35 \n\n5 \n\n15 \nNUMBER OF CENTERS \n\n25 \n\n20 \n\nFigure 6: Error Rate Comparison Between The BH -RBF Classifier, The Regular \nRBF Classifier, And The Gaussian Mixture Classifier On The Seismic Problem \n\nterns. Once again, the number of centers per class was varied from 1 to 5 for the regular \nRBF classifier and the Gaussian mixture classifier, while the BH-RBF classifier was \nstarted with 1 center in the frrst class and then more centers were automaticallY assigned. \nThe BH-RBF classifier was trained with the closecall threshold set to 0.75, (52 set to 0.5, \nand a maximum of one extra center per class at each boundary. The parameters were cho(cid:173)\nsen according to the performance of the classifier on the left-out evaluation data. For this \nproblem, the closecall threshold and (52 turned out to be the same as the ones used in the \nfour-cluster problem. \n\nFigure 6 shows the error rate on the testing patterns for all three classifiers. The BH-RBF \nclassifier clearly performed better than the regular RBF classifier and the Gaussian mix(cid:173)\nture classifier. The BH-RBF classifier added centers only at the boundary region where \nthey improved discrimination. Also, the diagonal covariance of the added centers are more \nlocal in their influence and can improve discrimination of a particular boundary without \naffecting other decision region boundaries. \n\nMLP classifiers were also trained on this data set with the number of hidden nodes varying \nfrom 2 to 32 in increments of 2. The learning rate was set to 0.001, the momentum term \nwas set to 0.6, and each classifier was trained for 100 epochs. The classification error rate \non the left-out evaluation set showed that the network had not overfitted on the training \ndata. Once more, the MLP classifiers exhibited great sensitivity to initial conditions, espe(cid:173)\ncially when the number of hidden nodes were small. Also, for this high dimensionality \nclassification task, even the best performance of the MLP classifier (15.5%) did not match \nthe best performance of the BH-RBF classifier. This result suggests that for this high \n\n\f146 \n\nChang and Lippmann \n\ndimensionality data, the radially symmetric boundaries fonned with local basis functions \nsuch as the RBF classifier are more appropriate than the ridge-like boundaries formed with \nthe MLP classifier. \n\n4 CONCLUSION \n\nA new boundary-hunting RBF classifier was developed which adds RBF centers construc(cid:173)\ntively near boundaries of classes which produce classification confusions. Experimental \nresults from two problems differing in input dimension, number of classes, and difficulty \nshow that the BH-RBF classifier performed better than traditional training algorithms used \nfor RBF, Gaussian mixture, and MLP classifiers. Experiments have also been conducted \non other problems such as Peterson and Barney's vowel database and the disjoint database \nused by Ng (Peterson, 1952, Ng, 1990). In all experiments, the BH-RBF constructive \nalgorithm performed at least as well as the traditional RBF training algorithm. These \nresults, and the experiments described above, confirm the hypothesis that better discrimi(cid:173)\nnation performance can be achieved by training a classifier to perform discrimination \ninstead of probability density function estimation. \n\nAcknowledgments \n\nThis work was supported by DARPA. The views expressed are those of the authors and do \nnot reflect the official policy or position of the U.S. Government. Experiments were con(cid:173)\nducted using LNKnet, a general purpose classifier program developed at Lincoln Labora(cid:173)\ntory by Richard Lippmann, Dave Nation, and Linda Kukolich. \n\nReferences \n\nG. E. Peterson and H. L. Barney. (1952) Control Methods Used in a Study of Vowels. The \nJournal of the Acoustical Society of America 24:2, 175-84. \n\nA. Barron. (1984) Predicted squared error: a criterion for automatic model selection. In S. \nFarlow, Editor. Self-Organizing Methods in Modeling. New York, Marcel Dekker. \n\nD. S. Broomhead and D. Lowe. (1988) Radial Basis Functions, multi-variable functional \ninterpolation and adaptive networks. Technical Report RSRE Memorandum No. 4148, \nRoyal Speech and Radar Establishment, Malvern, Worcester, Great Britain. \n\nB. W. Mel and S. M. Omohundro. (1991) How Receptive Field Parameters Affect Neural \nLearning. In R. Lippmann, J. Moody and D. Touretzky (Eds.), Advances in Neural Infor(cid:173)\nmation Processing Systems 3, 1991. San Mateo, CA: Morgan Kaufman. \n\nK. Ng and R. Lippmann. (1991) A Comparative Study of the Practical Characteristics of \nNeural Networks and Conventional Pattern Classifiers. In R. Lippmann, 1. Moody and D. \nTouretzky (Eds.), Advances in Neural Information Processing Systems 3, 1991. San \nMateo, CA: Morgan Kaufman. \n\nM.D. Richard and R. P. Lippmann. (1991) Neural Network Classifier Estimates Bayesian \na posteriori Probabilities. Neural Computation, Volume 3, Number 4. \n\n\f", "award": [], "sourceid": 715, "authors": [{"given_name": "Eric", "family_name": "Chang", "institution": null}, {"given_name": "Richard", "family_name": "Lippmann", "institution": null}]}