{"title": "Improving Performance in Neural Networks Using a Boosting Algorithm", "book": "Advances in Neural Information Processing Systems", "page_first": 42, "page_last": 49, "abstract": null, "full_text": "Improving Performance in Neural Networks \n\nUsing a Boosting Algorithm \n\n-\n\nHarris Drucker \n\nAT&T Bell Laboratories \n\nHolmdel, NJ 07733 \n\nPatrice Simard \n\nAT &T Bell Laboratories \n\nHolmdel, NJ 07733 \n\nRobert Schapire \n\nAT&T Bell Laboratories \nMurray Hill, NJ 07974 \n\nAbstract \n\nA boosting algorithm converts a learning machine with error rate less \nthan 50% to one with an arbitrarily low error rate. However, the \nalgorithm discussed here depends on having a large supply of \nindependent training samples. We show how to circumvent this \nproblem and generate an ensemble of learning machines whose \nperformance in optical character recognition problems is dramatically \nimproved over that of a single network. We report the effect of \nboosting on four databases (all handwritten) consisting of 12,000 digits \nfrom segmented ZIP codes from the United State Postal Service \n(USPS) and the following from the National Institute of Standards and \nTesting (NIST): 220,000 digits, 45,000 upper case alphas, and 45,000 \nlower case alphas. We use two performance measures: the raw error \nrate (no rejects) and the reject rate required to achieve a 1% error rate \non the patterns not rejected. Boosting improved performance in some \ncases by a factor of three. \n\n1 INTRODUCTION \n\nIn this article we summarize a study on the effects of a boosting algorithm on the \nperformance of an ensemble of neural networks used in optical character recognition \nproblems. Full details can be obtained elsewhere (Drucker, Schapire, and Simard, 1993). \nThe \"boosting by filtering\" algorithm is based on Schapire's original work (1990) which \nshowed that it is theoretically possible to convert a learning machine with error rate less \nthan 50% into an ensemble of learning machines whose error rate is arbitrarily low. The \nwork detailed here is the first practical implementation of this boosting algorithm. \n\nAs applied to an ensemble of neural networks using supervised learning, the algorithm \nproceeds as follows: Assume an oracle that generates a large number of independent \n\n42 \n\n\fImproving Performance in Neural Networks Using a Boosting Algorithm \n\n43 \n\ntraining examples. First, generate a set of training examples and train a first network. \nAfter the first network is trained it may be used in combination with the oracle to produce \na second training set in the following manner: Flip a fair coin. If the coin is heads, pass \noutputs from the oracle through the first learning machine until the first network \nmisclassifies a pattern and add this pattern to a second training set. Otherwise, if the coin \nis tails pass outputs from the oracle through the first learning machine until the first \nnetwork finds a pattern that it classifies correctly and add to the training set. This process \nis repeated until enough patterns have been collected. These patterns, half of which the \nfirst machine classifies correctly and half incorrectly, constitute the training set for the \nsecond network. The second network may then be trained. \n\nThe first two networks may then be used to produce a third training set in the following \nmanner: Pass the outputs from the oracle through the first two networks. If the networks \ndisagree on the classification, add this pattern to the training set. Otherwise, toss out the \npattern. Continue this until enough patterns are generated to form the third training set. \nThis third network is then trained. \n\nIn the final testing phase (of Schapire's original scheme), the test patterns (never \npreviously used for training or validation) are passed through the three networks and \nlabels assigned using the following voting scheme: If the first two networks agree, that is \nthe label. Otherwise, assign the label as classified by the third network. However, we \nhave found that if we add together the three sets of outputs from each of the three \nnetworks to obtain one set of ten outputs (for the digits) or one set of twenty-size outputs \n(for the alphas) we obtain better results. Typically, the error rate is reduced by .5% over \nstraight voting. \n\nThe rationale for the better performance using addition is as follows: A voting criterion is \na hard-decision rule. Each voter in the ensemble has an equal vote whether in fact the \nvoter has high confidence (large difference between the two largest outputs in a particular \nnetwork) or low confidence (small difference between the two largest outputs). By \nsumming the outputs (a soft-decision rule) we incorporate the confidence of the networks \ninto the total output. As will be seen later, this also allows us to build an ensemble with \nonly two voters rather than three as called for in the original algorithm. \n\nConceptually, this process could be iterated in a recursive manner to produce an ensemble \nof nine networks, twenty-seven networks, etc. However, we have found significant \nimprovement in going from one network to only three. The penalty paid is potentially an \nincrease by a factor of three in evaluating the performance (we attribute no penalty to the \nincreased training time). However it can show how to reduce this to a factor of 1.75 \nusing sieving procedures. \n\n2 A DEFORMATION MODEL \n\nThe proof that boosting works depends on the assumption of three independent training \nsets. Without a very large training set, this is not possible unless that error rates are large. \nAfter training the first network, unless the network has very poor performance, there are \nnot enough remaining samples to generate the second training set. For example, suppose \nwe had 9000 total examples and used the first 3000 to train the first network and that \nnetwork achieves a 5% error rate. We would like the next training set to consist of 1500 \npatterns that the first network classifies incorrectly and 1500 that the first network \n\n\f44 \n\nDrucker, Schapire, and Simard \n\nclassifies incorrectly. At a 5% error rate, we need approximately 30,000 new images to \npass through the first network to find 1500 patterns that the first network classifies \nincorrectly. These many patterns are not available. Instead we will generate additional \npatterns by using small deformations around the finite training set based on the \ntechniques of Simard (Simard, et al., 1992). \n\nThe image consists of a square pixel array (we use both 16x16 and 20x20). Let the \nintensity of the image at coordinate location (ij) be Fjj(x,y) where the (x,y) denotes that \nF is a differentiable and hence continuous function of x and y. i and j take on the discrete \nvalues 0,1, ... ,15 for a 16x16 pixel array. \n\nThe change in F at location (ij) due to small x-translation, y-translation, rotation, \ndiagonal deformation, axis deformation, scaling and thickness deformation is given by \nthe following respective matrix inner products: \n\naFjj(x,y) \n\nax \n\naFjj(x,y) \n\nay \n\nwhere the k's are small values and x and yare referenced to the center of the image. This \nconstruction depends on obtaining the two partial derivatives. \n\nFor example, if all the k' s except k 1 are zero, then M'jj(X,y) = k 1 \nby which Fij(x,y) at coordinate location (ij) changes due to an x-translation of value k 1. \n\nis the amount \n\nap.\u00b7(x y) \n\n'~X ' \n\nThe diagonal deformation can be conceived of as pulling on two opposite comers of the \nimage thereby stretching the image along the 45 degree axis (away from the center) while \nsimultaneously shrinking the image towards the center along a - 45 degree axis. If k4 \nchanges sign, we push towards the center along the 45 degree axis and pull away along \nthe - 45 degree axis. Axis deformation can be conceived as pulling (or pushing) away \nfrom the center along the x-axis while pushing (or pulling) towards the center along the \ny-axis. \nIf all the k's except k7 are zero, then M'jj(x,y) = k711 VFjj(x,y) I j2 is the norm squared of \nthe gradient of the intensity. It can be shown that this corresponds to varying the \n\"thickness\" of the image. \n\nTypically the original image is very coarsely quantized and not differentiable. Smoothing \nof the original image is done by numerically convolving the original image with a 5x5 \nto give us \nsquare kernel whose elements are values from the Gaussian: exp \n\n_ (x2 + y2) \n\ncr \n\n\fImproving Performance in Neural Networks Using a Boosting Algorithm \n\n45 \n\n. a 16x 16 or: 20x20 square matrix of smoothed values. \n\nA matrix of partial derivatives (with respect to x) for each pixel --Iocation is obtained by \nconvolving the original image with a kernel whose elements are the derivatives with \nrespect to x of the Gaussian function. ' We can similarly form a matrix of parti~ \nderivatives with respect to y. A new image can then be constructed by adding together \nthe smoothed image and a differential matrix whose elements are given by the above \nequation. \n\nUsing the above equation, we may simulate an oracle by cycling through a finite sized \ntraining set, picking random values (uniformly distributed in some small range) of the \nconstants k for each new image. The choice of the range of k is somewhat critical: too \nsmall and the new image is too close to the old image for the neural network to consider \nit a \"new\" pattern. Too large and the image is distorted and nonrepresentative of \"real\" \ndata. We will discuss the proper choice of k later. \n\n3 NETWORK ARCHITECTURES \n\nWe use as the basic learning machine a neural network with extensive use of shared \nweights (LeCun, et. al., 1989, 1990). Typically the number of weights is much less than \nthe number of connections. We believe this leads to a better ability to reject images (i.e., \nno decision made) and thereby minimizes the number of rejects needed to obtain a given \nerror rate on images not rejected. However, there is conflicting evidence (Martin & \nPitman, 1991) that given enough training patterns, fully connected networks give similar \nperformance to networks using weight sharing. For the digits there is a 16 by 16 input \nsurrounded by a six pixel border to give a 28 by 28 input layer. The network has 4645 \nneurons, 2578 different weights, and 98442 connections. \n\nThe networks used for the alpha characters use a 20 by 20 input surrounded by a six pixel \nborder to give a 32 by 32 input layer. There are larger feature maps and more layers, but \nessentially the same construction as for the digits. \n\n4 TRAINING ALGORITHM \n\nThe training algorithm is described in general terms: Ideally, the data set should be \nbroken up into a training set, a validation set and a test set. The training set and \nvalidation set are smoothed (no deformations) and the first network trained using a \nquasi-Newton procedure. We alternately train on the training data and test on the \nvalidation data until the error rate on the validation data reaches a minimum. Typically, \nthere is some overtraining in that the error rate on the training data continues to decrease \nafter the error rate on the validation set reaches a minimum. \n\nOnce the first network is trained, the second set of training data is generated by cycling \ndeformed training data through the first network. After the pseudo-random tossing of a \nfair coin, if the coin is heads, deformed images are passed though the first network until \nthe network makes a mistake. If tails, deformed images are passed through the network \nuntil the network makes a correct labeling. Each deformed image is generated from the \noriginal image by randomly selecting values of the constants k. It may require multiple \npasses through the training data to generate enough deformed images to form the second \ntraining set \n\n\f46 \n\nDrucker, Schapire, and Simard \n\nRecall that the second training set will consist equally of images that the first network \nmisclassifies and images that the the first network classifies correctly. The total size of \nthe training set is that of the first training set. Correctly classified images are not hard to \nfind if the error rate of the first network is low. However, we only accept these images \nwith probability 50%. The choice of the range of the random variables k should be such \nthat the deformed images do not look distorted. The choice of the range of the k' s is \ngood if the error rate using the first network on the deformed patterns is approximately \nthe same as the error rate of the first network on the validation set (NOT the first training \nset). \n\nA second network is now trained on this new training set in the alternate train/test \nprocedure using the original validation set (not deformed) as the test set. Since this \ntraining data is much more difficult to learn than the first training data, typically the error \nrate on the second training set using the second trained network will be higher \n(sometimes much higher) than the error rates of the first network on either the first \ntraining set or the validation set. Also, the error rate on the validation set using the \nsecond network will be higher than that of the first network because the network is trying \nto generalize from difficult training data, 50% of which the first network could not \nrecognize. \n\nThe third training set is formed by once again generating deformed images and presenting \nthe images to both the first and second networks. If the networks disagree (whether both \nare wrong or just one is), then that image is added to the third training set. The network \nis trained using this new training data and tested on the original validation set. \nTypically, the error rate on the validation set using the third network will be much higher \nthan either of the first two networks on the same validation set. \n\nThe three networks are then tested on the third set of data, which is the smoothed test \ndata. According to the original algorithm we should observe the outputs of the first two \nnetworks. If the networks agree, accept that labeling, otherwise use the labeling assigned \nby the third network. However, we are interested in more than a low error rate. We have \na second criterion, namely the percent of the patterns we have to reject (i.e. no \nclassification decision) in order to achieve a 1 % error rate. The rationale for this is that if \nan image recognizer is used to sort ZIP codes (or financial statements) it is much less \nexpensive to hand sort some numbers than to accept all and send mail to the wrong \naddress or credit the wrong account. From now on we shall call this latter criterion the \nreject rate (without appending each time the statement \"for a 1 % error rate on the patterns \nnot rejected\"). \n\nFor a single neural network, a reject criterion is to compare the two (of the ten or twenty(cid:173)\nsix) largest outputs of the network. If the difference is great, there is high confidence that \nthe maximum output is the correct classification. Therefore, a critical threshold is set \nsuch that if the difference is smaller then that threshold, the image is rejected. The \nthreshold is set so that the error rate on the patterns not rejected is 1 %. \n\n5 RESULTS \n\nThe boosting algorithm was first used on a database consisting of segmented ZIP codes \nfrom the United States Postal Service (USPS) divided into 9709 training examples and \n2007 validation samples. \n\n\fImproving Performance in Neural Networks Using a Boosting Algorithm \n\n47 \n\nThe samples supplied to us from the USPS were machine segmented from zip codes and \nlabeled but not size normalized. The validation set consists of approximately 2% badly \nsegmented characters (incomplete segmentations. decapitated fives, etc.) The training set \nwas cleaned thus the validation set is significantly more difficult than the training set. \n\nThe data was size normalized to fit inside a 16x16 array. centered, and deslanted. There \nis no third group of data called the \"test set\" in the sense described previously even \nthough the validation error rate has been commonly called the test error rate in prior work \n(LeCun. et. al., 1989, 1990). \n\nWithin the 9709 training digits are some machine printed digits which have been found to \nimprove performance on the validation set. This data set has an interesting history having \nbeen around for three years with an approximate 5% error rate and 10% reject rate using \nour best neural network. There has been a slight improvement using double \nbackpropagation (Drucker & LeCun. 1991) bringing down the error rate to 4.7% and the \nreject rate to 8.9% but nothing dramatic. This network. which has a 4.7% error rate was \nretrained on smoothed data by starting from the best set of weights. The second and third \nnetworks were trained as described previously with the following key numbers: \n\nThe retrained first network has a training error rate of less than 1%, a test error rate of \n4.9% and a test reject rate of 11.5% \n\nWe had to pass 153,000 deformed images (recycling the 9709 training set) through the \ntrained first network to obtain another 9709 training images. Of these 9709 images. \napproximately one-half are patterns that the first network misclassifies. This means that \nthe first network has a 3.2% error rate on the deformed images, far above the error rate on \nthe original training images. \n\nA second network is trained and gives a 5.8% test error rate. \n\nTo generate the last training set we passed 195,000 patterns (again recycling the 9709) to \ngive another set of 9709 training patterns. Therefore, the first two nets disagreed on 5% \nof the deformed patterns. \n\nThe third network is trained and gives a test error rate of 16.9% \n\nUsing the original voting scheme for these three networks, we obtained a 4.0% error rate. \na significant improvement over the 4.9% using one network. As suggested before. adding \ntogether the three outputs gives a method of rejecting images with low confidence scores \n(when the two highest outputs are too close). For curiosity, we also determined what \nwould happen if we just added together the first two networks: \n\nOriginal network: \n4.9% test error rate and 11.5% reject rate. \nTwo networks added: 3.9% test error rate and 7.9% reject rate. \nThree networks added: 3.6% test error rate and 6.6% reject rate. \n\nThe ensemble of three networks gives a significant improvement, especially in the reject \nrate. \n\nIn April of 1992, the National Institute of Standards and Technology (NIST) provided a \nlabeled database of 220.000 digits. 45.000 lower case alphas and 45.000 upper case \n\n\f48 \n\nDrucker, Schapire, and Simard \n\nalphas. We divided these into training set, validation set, and test set. All data were \nresampled and size-normalized to fit into a 16x16 or 20x20 pixel array. For the digits, we \ndeslanted and smoothed the data before retraining the first 16x16 input neural network \nused for the USPS data. After the second training set was generated and the second \nnetwork trained the results from adding the two networks together were so good (Table 1) \nthat we decided not to generate the third training set For the NIST data, the error rates \nreported are those of the test data. \n\nTABLE 1. Test error rate and reject rate in percent \n\nDATABASE \n\nUSPS NIST NIST NIST \ndigits \nlower \nalpha \n\nupper \nalphas \n\ndigits \n\nERROR RATE \nSINGLE NET \n\nERROR RATE \nUSING BOOSTING \n\nREJECT RATE \nSINGLE NET \n\nREJECT RATE \nUSING BOOSTING \n\n5.0 \n\n1.4 \n\n4.0 \n\n9.8 \n\n3.6 \n\n.8 \n\n2.4 \n\n8.1 \n\n9.6 \n\n1.0 \n\n9.2 \n\n29. \n\n6.6 \n\n* \n\n3.1 \n\n21. \n\n* Reject rate is not reported if the error rate is below 1 %. \n\n6 CONCLUSIONS \n\nIn all cases we have been able to boost performance above that of single net. Although \nothers have used ensembles to improve performance (Srihari, 1990; Benediktsson and \nSwain, 1992; Xu, et. al., 1992) the technique used here is particularly straightforward \nsince the usual multi-classifier system requires a laborious development of each classifier. \nThere is also a difference in emphasis. In the usual multi-classifier design, each classifier \nis trained independently and the problem is how to best combine the classifiers. In \nboosting, each network (after the first) has parameters that depend on the prior networks \nand we know how to combine the networks (by voting or adding). \n\n7 ACKNOWLEDGEMENTS \n\nWe hereby acknowledge the United State Postal Service and the National Institute of \nStandards and Technology in supplying the databases. \n\n\fImproving Performance in Neural Networks Using a Boosting Algorithm \n\n49 \n\nReferences \n\nJ.A. Benediktsson and P.H. Swain, \"Consensus Theoretic Classification Methods\", IEEE \ntrans. on Systems, Man, and Cybernetics, Vol. 22, No.4, July/August 1992, pp. 688-704. \n\nH. Drucker. R. Schapire, and P. Simard \"Boosting Perfonnance in Neural Networks\", \nInternational Journal of Pattern Recognition and Artificial Intelligence, (to be published, \n1993)d \n\nH. Drucker and Y. LeCun, \"Improving Generalization Perfonnance in Character \nRecognition\", Proceedings of the 1991 IEEE Workshop on Neural Networks for Signal \nProcessing, IEEE Press,pp. 198 - 207. \n\nY. LeCun, et. aI., \"Backpropagation Applied to Handwritten Zip Code Recognition\", \nNeural Computation 1,1989, pp. 541-551 \n\nY. LeCun, et. aI., Handwritten Digit Recognition with a Back-Propagation Network\", In \nD.S. Touretsky (ed), Advances in Neural Information Processing Systems 2, (1990) pp. \n396-404, San Mateo, CA: Morgan Kaufmann Publishers \n\nG. L. Martin and J. A. Pitman, \"Recognizing Handed-Printed Letters and Digits Using \nBackpropagation Learning\", Neural Computation, Vol. 3, 1991, pp. 258-267. \n\nR. Schapire, \"The Strength of Weak Learnability\", Machine Learning, Vol. 5, #2, 1990, \npp. 197-227. \n\nP. Simard. \"Tangent Prop - A fonnalism for specifying selected invariances in an \nadaptive network\", In J.E. Moody, SJ. Hanson, and R.P. Lippmann (eds.) Advances in \nNeural Information Processing Systems 4, (1992) p. 895-903, San Mateo, CA: Morgan \nKaufmann Publishers \n\nSargur Srihari, \"High-Perfonnance Reading Machines\", Proceeding of the IEEE, Vol 80, \nNo.7, July 1992, pp. 1120-1132. \n\nC.Y. Suen, et. aI., \"Computer Recognition of Unconstrained Handwritten Numerals\", \nProceeding of the IEEE, Vol 80, No.7, July 1992, pp. 1162-1180. \n\nL. Xu, et. al.. \"Methods of Combining Multiple Classifiers\", IEEE Trans. on Systems \nMan, and Cybernetics, Vol. 22. No.3, May/June 1992, pp. 418-435. \n\n\f", "award": [], "sourceid": 593, "authors": [{"given_name": "Harris", "family_name": "Drucker", "institution": null}, {"given_name": "Robert", "family_name": "Schapire", "institution": null}, {"given_name": "Patrice", "family_name": "Simard", "institution": null}]}