{"title": "Recognizing Hand-Printed Letters and Digits", "book": "Advances in Neural Information Processing Systems", "page_first": 405, "page_last": 414, "abstract": null, "full_text": "Recognizing Hand-Printed Letters and Digits \n\n405 \n\nRecognizing Hand-Printed Letters and Digits \n\nGale L. Martin \n\nJames A. Pittman \n\nMCC, Austin, Texas 78759 \n\nABSTRACT \n\nWe are developing a hand-printed character recognition system using a multi(cid:173)\nlayered neural net trained through backpropagation. We report on results of \ntraining nets with samples of hand-printed digits scanned off of bank checks \nand hand-printed letters interactively entered into a computer through a sty(cid:173)\nlus digitizer. Given a large training set, and a net with sufficient capacity to \nachieve high performance on the training set, nets typically achieved error \nrates of 4-5% at a 0% reject rate and 1-2% at a 10% reject rate. The topology \nand capacity of the system, as measured by the number of connections in the \nnet, have surprisingly little effect on generalization. For those developing \npractical pattern recognition systems, these results suggest that a large and \nrepresentative training sample may be the single, most important factor in \nachieving high recognition accuracy. From a scientific standpoint, these re(cid:173)\nsults raise doubts about the relevance to backpropagation of learning models \nthat estimate the likelihood of high generalization from estimates of capacity. \nReducing capacity does have other benefits however, especially when the re(cid:173)\nduction is accomplished by using local receptive fields with shared weights. \nIn this latter case, we find the net evolves feature detectors resembling those \nin visual cortex and Linsker's orientation-selective nodes. \n\nPractical interest in hand-printed character recognition is fueled by two current tech(cid:173)\nnology trends: one toward systems that interpret hand-printing on hard-copy docu(cid:173)\nments and one toward notebook-like computers that replace the keyboard with a stylus \ndigitizer. The stylus enables users to write and draw directly on a flat panel display. \nIn this paper, we report on results applying multi-layered neural nets trained through \nbackpropagation (Rumelhart, Hinton, & Williams, 1986) to both cases. \n\nDeveloping pattern recognition systems is typically a two-stage process. First, intuition \nand experimentation are used to select a set of features to represent the raw input pat(cid:173)\ntern. Then a variety of well-developed techniques are used to optimize the classifier \nsystem that assumes this featural representation. Most applications of backpropaga(cid:173)\ntion learning to character recognition use the learning capabilities only for this latter \n\n\f406 Martin and Pittman \n\nstage--developing the classifier system (Burr, 1986; Denker, Gardner, Graf, Hender(cid:173)\nson, Howard, Hubbard, Jackel, Baird, & Guyon, 1989; Mori & Yokosawa, 1989; Weide(cid:173)\nman, Manry, & Yau, 1989). However, backpropagation learning affords the opportuni(cid:173)\nty to optimize feature selection and pattern classification simultaneously. We avoid \nusing pre-determined features as input to the net in favor of using a pre- segmented, \nsize-normalized grayscale array for each character. This is a first step toward the goal \nof approximating the raw input projected onto the human retina, in that no pre-proces(cid:173)\nsing of the input is required. \n\nWe report on results for both hand-printed digits and letters. The hand-printed digits \ncome from a set of 40,000 hand-printed digits scanned from the numeric amount region \nof \"real-world\" bank checks. They were pre-segmented and size-normalized to a \n15x24 grayscale array. The test set consists of 4,000 samples and training sets varied \nfrom 100 to 35,200 samples. Although it is always difficult to compare recognition rates \narising from different pattern sets, some appreciation for the difficulty of categoriza(cid:173)\ntion can be gained using human performance data as a benchmark. An independent \nperson categorizing the test set of pre-segmented, size-normalized digits achieved an \nerror rate of 3.4%. This figure is considerably below the near-perfect performance of \noperators keying in numbers directly from bank checks, because the segmentation al(cid:173)\ngorithm is flawed. \n\nWorking with letters, as well as digits, enables tests of the generality of results on a \ndifferent pattern set having more than double the number of output categories. The \nhand-printed letters come from a set of 8,600 upper-case letters collected from over \n110 people writing with a stylus input device on a flat panel display. The stylUS collects \na sequence of x-y coordinates at 200 points per second at a spatial resolution of 1000 \npoints per inch. The temporal sequence for each character is first converted to a size(cid:173)\nnormalized bitmap array, keeping aspect ratio constant. We have found that recogni(cid:173)\ntion accuracy is significantly improved if these bitmaps are blurred through convolution \nwith a gaussian distnbution. Each pattern is represented as a 15x24 grayscale image. \nA test set of 2,368 samples was extracted by selecting samples from 18 people, so that \ntraining sets were generated by people different from those generating the test set. \nTraining set sizes ranged from 500 to roughly 6,300 samples. \n\n1 HIGH RECOGNITION ACCURACY \n\nWe find relatively high recognition accuracy for both pattern sets. Thble 11 reports \nthe minimal error rates achieved on the test samples for both pattern sets, at various \nreject rates. In the case of the hand-printed digits, the 4% error rate (0% rejects) ap-\n\n1. \nEff~cts of the number.of training samples and network capacity and topology are reported in the \nnext sectIon. Nets were tramed to error rates of 2-3%. 1i\"aining began with a learning rate of .05 and \na mome~tum value of .9. The learning rate was decreased when training accuracy began to oscillate or \nhad stabtlized for a large number of training epochs. We evaluate the output vector on a winner-take(cid:173)\nall basis, as this consistently improves accuracy and results in network parameters having a smaller \neffect on perfonnance. \n\n\fRecognizing Hand-Printed Letters and Digits \n\n407 \n\nproaches the 3.4% errors made by the human judge. This suggests that further im(cid:173)\nprovements to generalization will require improving segmentation accuracy. The fact \nthat an error rate of 5% was achieved for letters is promising. Accuracy is fairly high, \n\nTable 1: Error rates of best nets trained on largest sample sets and tested \n\non new samples \n\nREJECT RATE \n\nDIGITS \n\nLETTERS \n\n0% \n5% \n10% \n35% \n\n4% \n3% \n1% \n.001% \n\n5% \n3% \n2% \n.003% \n\neven though there are a large number of categories (26). This error rate may be ade(cid:173)\nquate for applications where contextual constraints can be used to significantly boost \naccuracy at the word-level. \n\n2 MINIMAL NETWORK CAPACITY AND TOPOLOGY EFFECTS \n\nThe effects of network parameters on generalization have both practical and scientific \nsignificance. The practical developer of pattern recognition systems is interested in \nsuch effects to determine whether limited resources should be spent on trying to opti(cid:173)\nmize network parameters or on collecting a larger, more representative training set. \nFor the scientist, effects of capacity bear on the relevance of learning models to back(cid:173)\npropagation. \n\nA central premise of most general models of learning-by-example is that the size of \nthe initial search space-the capacity of the system-determines the number of train(cid:173)\ning samples needed to achieve high generalization performance. Learning is conceptu(cid:173)\nalized as a search for a function that maps all possible inputs to their correct outputs. \nLearning occurs by comparing successive samples of input-output pairs to functions \nin a search space. Functions inconsistent with training samples are rejected. Very large \ntraining sets narrow the search down to a function that closely approximates the de(cid:173)\nsired function and yields high generalization. The capacity of a learning system-the \nnumber of functions it can represent--determines generalization, since a larger initial \nsearch space requires more training samples to narrow the search sufficiently . This \nsuggests that to improve generalization, capacity should be minimized. Unfortunately, \nit is typically unclear how to minimize capacity without eliminating the desired function \nfrom the search space. A heuristic, which is often suggested, is that simple is usually \nbetter. It receives support from experience in curve fitting. Low-order polynomials typ(cid:173)\nically extrapolate and interpolate better than high-order polynomials (Duda & Hart, \n1973). \n\nExtensions of the heuristic to neural net learning propose reducing capacity by reduc(cid:173)\ning the number of connections or the number of bits used to represent each connection \n\n\f408 Martin and Pittman \n\nweight (Baum & Haussler, 1989; Denker, Schwartz, Wittner, Solla, Howard, Jackel, \n& Hopfield,1987). We manipulated the capacity of nets in a number of ways: 1) varying \nthe number of hidden nodes, 2) limiting connectivity between layers so that nodes re(cid:173)\nceived input from only local areas, and 3) sharing connection weights between hidden \nnodes. We found only negligible effects on generalization. \n\n2.1 NUMBER OF HIDDEN NODES \n\nFigure 1 presents generalization results as a function of training set size for nets having \none hidden layer and varying numbers of hidden nodes. The number of free parameters \n(i.e., number of connections and biases) in each case is presented in parentheses. De(cid:173)\nspite considerable variation in the number of free parameters, using nets with fewer \nhidden nodes did not improve generalization. \n\nBaum & Haussler (1989) estimate the number of training samples required to achieve \nan error rate e (where 0 < e ~ 1/8) on the generalization test, when an error rate \nof el2 has been achieved on the training set. They assume a feed-forward net with \none hidden layer and W connections. The estimates are distribution-free in the sense \nthat calculations assume an arbitrary to-be-learned function. If the number of training \nlog ~ ,where N refers to the number of nodes, then it is a near \nsamples is of order : \ncertainty that the net will achieve generalization rates of (1 - e). This estimate is the \nupper-bound on the number of training samples needed. They also provide a lower \n\nDigits \n\nLetters \n\n100 \n\n100 \n\n-\n\n~ \nt::: 75 \n0 \nu \n~ \n\n. / \n\n,';/ \n\n~ \n\nNumber of Hidden Nodes \n\n50 \n\n(18,560) \n\n(63,080) \n170 \n383 (142,103) \n\n50 \n\n100 \n\n1000 \n\n10000 100000 \n\nTraining Set Size \n\n, \n.;/ \nI \nI \n, \nI \nI \n/ Number of Hidden Nodes \n, \n\n170 \n365 \n\n(65,816) \n(141,281) \n\n1000 10000 100000 \n\n'll'aining Set Size \n\n75 \n\n50 \n\n100 \n\nFigure 1. Effect of number of hidden nodes and training set size on generalization. \n\n\fRecognizing Hand-Printed Letters and Digits \n\n409 \n\nbound estimate, on the order of W/e. Using fewer than this number of samples will, \nfor some functions, fail to achieve generalization rates of (1- e). The fact that we find \nno advantage to reducing the number of connections conflicts with Baum & Haussler's \nestimates and the underlying assumption that capacity plays a strong role in determin(cid:173)\ning generalization. \n\nBaum & Haussler also suggest using a constant of proportionality of 1 in their esti(cid:173)\nmates. This implies that achieving error rates of 10% or less on new samples requires \nabout 10 times as many training examples as there are connection weights in the net. \nFor our largest nets, this implies a requirement of roughly a million training samples, \nwhich most developers would regard as prohibitively large. We found that about 5,000 \nsamples were sufficient. Thus, a sufficiently large training sample does not imply apro(cid:173)\nhibitively large sample, at least for character recognition. We find that sample sizes of \nthe order of thousands to tens of thousands yield performance very close to human lev(cid:173)\nels. One reason for the discrepancy is that Baum & Haussler'S estimates are distribu(cid:173)\ntion-free in the sense that they reflect worst-case scenarios across all possible func(cid:173)\ntions the net might learn. Presumably, the functions underlying most natural pattern \nrecognition tasks are not representative of the set of all possible functions. These re(cid:173)\nsults raise doubts about the relevance to natural pattern recognition of learning models \nbased on worst-case analyses, because content may greatly impact generalization. \n\n2.2 LOCAL CONNECTMTY AND SHARED WEIGHTS \n\nA more biologically plausible way to reduce capacity is to limit connectivity between \nlayers to local areas and to use shared weights. For example, visual cortex contains \nneurons, each of which is responsive to a feature such as an oriented line appearing \nin a small, local region on the retina (Hubel & Wiesel, 1979). A given oriented line-de(cid:173)\ntector is essentially replicated across the visual field, so that the same feature can be \ndetected wherever it appears. In this sense, the connections feeding into an oriented(cid:173)\nline detector are shared across all similar line-detectors for different areas of the visual \nfield. In an artificial neural net, local structure is achieved by limiting connectivity. \nA given hidden node receives input from only local areas in the input or hidden layer \npreceding it. Weight sharing is achieved by linking the incoming weights across a set \nof hidden nodes. Corresponding weights leading into these nodes are randomly initial(cid:173)\nized to the same values and forced to have equivalent updates during learning. In this \nway the net evolves the same local feature set that is invariant across the input array. \nSeveral demonstrations exist indicating that local connectivity and shared weights im(cid:173)\nprove generalization performance in tasks where position invariance is required (Ie \nCun, 1989; Rumelhart, Hinton, & Williams, 1986). \n\nWe examined the benefits of using local receptive fields with shared weights for hand(cid:173)\nprinted character recognition, where position invariance was not required. This does \nnot minimize the importance of position invariance. However, it is only one of many \nnecessary invariants underlying reliable pattern recognition. Unfortunately, most of \nthese invariants have not been explicitly specified. So we don't know how to bias a net \ntoward discovering them. Testing the role of local receptive fields with shared weights \n\n\f410 Martin and Pittman \n\nin situations where position invariance is not required is relevant to discovering wheth(cid:173)\ner these constraints have a role other than in promoting position invariance. \n\nAs indicated in Figure 2, we find only slightly improved generalization in moving from \nnets with global connectivity between layers to nets with local receptive fields or to nets \nwith local receptive fields and shared weights. This is true despite the fact that the \nnumber of free parameters is substantially reduced. The positive effects that do occur \nare at relatively small training set sizes. This may explain why others have reported \na greater degree of improved generalization by using local receptive fields (Honavar \n& Uhr, 1988). The data reported are for networks with two hidden layers. Global nets \nhad 150 nodes in the first layer and 50 nodes in the second. In the Local nets, first hid(cid:173)\nden layer nodes (540) received input from 5x8local and overlapping regions (offset by \n2 pixels) on the input array. Second hidden layer nodes (100) and output layer nodes \nhad global receptive fields. The Local. Shared nets had 540 nodes in the first hidden \nlayer with shared weights and, at the second hidden layer, either 102 (digits) or 180 (let(cid:173)\nters) nodes with local, overlapping, and shared receptive fields of size 4x6 on the 1st \n\n100 \n\nDigits \n\nLetters \n\n100 \n\n~ , \n\nh \n\n, ,# \n\n. ' .,f \n\n..... u \nQ) \nt::: \n0 u \n~ \n\n75 \n\n75 \n\n- - Global \nLocal \nLocal, \nShared \n\n(62.210) \n(77.250) \n\n( 4,857) \n\n1000 \n\n10000 100000 \n\nTraining Set Size \n\n50 \n\n100 \n\nGlobal (63,026) \n(78,866) \nLocal \nLocal, \n1 6 \nShared (11, 4 ) \n50 +-~:;:::==::::;::::==:::::;::::::. \n\n100 \n\n1000 10000 100000 \n\nTraining Set Size \n\nFigure 2. \n\nEffects of net capacity and topology on generalization. \n\nhidden layer. We have experimented with a large variety of different net architectures \nof this sort, varying the number of hidden nodes, the sizes and overlap of local receptive \nfields, and the use of local receptive fields with and without shared weights in one or \nboth hidden layers. The fact that we've found little difference in generalization for two \ndifferent pattern sets across such variations in network architectures argues for the \ngenerality of the results. \n\n\fRecognizing Hand-Printed Letters and Digits \n\n411 \n\n2.3 \n\nDISCUSSION \n\nGiven an architecture that enables relatively high training performance, we find only \nsmall effects of network capacity and topology on generalization performance. A large \ntraining set yields relatively high recognition accuracy in a robust way across most net \narchitectures with which we've worked. These results suggest some practical advice to \nthose developing hand-printed character recognition systems. IT optimizing general(cid:173)\nization performance is the goal, it is probably better to devote limited resources to col(cid:173)\nlecting a very large, representative training set than to extensive experimentation with \ndifferent net architectures. The variations in net capacity and topolOgy we've examined \ndo not substantially affect generalization performance for sufficiently large training \nsets. Sufficiently large should be interpreted as on the order of a thousand to tens of \nthousands of samples for hand-printed character recognition. \n\nFrom a theoretical standpoint, the negligible effects of network capacity on generaliza(cid:173)\ntion performance contradicts the central premise of machine learning that the size of \nthe initial hypothesis space determines learning performance. This challenges the rele(cid:173)\nvance, to backpropagation learning, of statistical models that estimate likelihood of \nhigh generalization performance from estimates of capacity. Due to the gradient de(cid:173)\nscent nature of backpropagation learning, not all functions that can be represented will \nbe visited during learning. The negligible effects of capacity suggest that the number \nof functions visited during learning constitutes only a very small percentage of the total \npossible functions that can be represented. \n\nThere are a number of reasons for believing that capacity might impact generalization \nperformance in other circumstances. We regularly train only to 2-3% error rates. This \nhelps to avoid the possibility of overfitting the data, although we have seen no indica(cid:173)\ntion of this when we have trained to higher levels, as long as we use large training sets. \nIt is also possible that the number of connections is not a good measure of capacity. \nFor example, the amount of information that can be passed on by a given connection \nmay be a better measure than the number of connections. At this conference, Ie Cun, \nDenker, Solla, Howard, & Jackel have also presented evidence that removing unim(cid:173)\nportant weights from a network may be a better way to reduce capacity. However, the \nfact that generalization rates come very close to human accuracy levels, even for nets \nwith extremely large numbers of free parameters, suggests that general effects of net \ncapacity and topology are, at best, small in comparison to effects of training set size. \nWe don't deny that there are likely to be net topologies that push performance up to \nhuman accuracy levels, presumably by biasing the net toward discovering the range of \ninvariants that underlie human pattern recognition. The problem is that only a few \nof these invariants have been explicitly specified (e.g., position, size, rotation), and so \nit is not possible to bias a net toward discovering the full range. \n\n\f412 Martin and Pittman \n\n3 ADVANTAGES OF REDUCING CAPACITY \n\nAlthough reducing gross indicators of capacity may not significantly improve general(cid:173)\nization, there are good practical and scientific reasons for doing it. A good reason to \nreduce the number of connections is to speed processing. Also, using local receptive \nfields with shared weights biases a net toward position invariance, and toward produc(cid:173)\ning a simpler, more modular internal representation which can be replicated across a \nlarge retina. This has important implications for developing nets that combine charac(cid:173)\nter segmentation with recognition. \n\nUsing local receptive fields with shared weights also offers promise for increasing our \nunderstanding of how the net correctly classifies patterns because the number of dis(cid:173)\ntinct receptive fields is greatly reduced. Figure 3 depicts Hinton diagrams of local re-\n\nDigits \n\nLetters \n\nFigure 3. Receptive fields that evolved in 1st hidden layer nodes in nets with \n\nlocal receptive fields having shared weights. \n\nceptive fields from 1st hidden layer nodes in nets with shared weights trained on digits \nor letters. Each of the eight large, gray rectangles corresponds to the receptive field \nfor a hidden node. The four on the left came from a net trained on digits; those on \nthe right from a net trained on letters. Within each ofthese eight, the black rectangles \ncorrespond to negative weights and the white to positive weights. The size of the black \nand white rectangles reflects the magnitude of the weights. \n\nThe local feature detectors that develop for both pattern sets appear to be oriented \nline and edge detectors. These are similar to oriented line and edge detectors found \nin visual cortex (Hubel & Wiesel, 1979) and to Linsker's (1986,1988) orientation-selec(cid:173)\ntive nodes, which emerge from a self-adaptive net exposed to random patterns. In \nLinsker's case, the feature detectors develop as an emergent property of the principle \nthat the signal transformation occurring from one layer to the next should maximize \nthe information that output signals convey about input signals. The fact that similar \n\n\fRecognizing Hand-Printed Letters and Digits \n\n413 \n\nfeature detectors emerge in backpropagation nets trained on \"natural\" patterns is in(cid:173)\nteresting because there were no explicit constraints to maximize information flow be(cid:173)\ntween layers in the backpropagation nets and because categorization is typically viewed \nas an abstraction process involving considerable loss of category-irrelevant informa(cid:173)\ntion. \n\nReferences \n\nBaum, E. and Haussler, D. (1989) What size net gives valid generalization? in D. S. \n\nTouretzky (Ed.) Advances in neural information processing systems I, Morgan \nKaufman. \n\nBurr, D. J. (1986) A neural network digit recognizer. Proceedings of the 1986 \nInternational conference on systems, man and cybernetics, Atlanta, Georgia. \npp. 1621-1625. \n\nDenker, J. S., Gardner, W. R., Graf, H. P., Henderson, D., Howard, R. E., Hubbard, \nW., Jackel, L. D., Baird, H. S., and Guyon, I. (1989) Neural network recognizer \nfor hand-written zip code digits in D. S. Touretzky (Ed.) Advances in neural \ninformation processing systems I, Morgan Kaufman. \n\nDenker, J. S., Schwartz, D., Wittner, B., SolI a, S., Howard, R. E., Jackel, L. D., & \n\nHopfield, J. (1987) Large automatic learning, rule extraction and generalization. \nComplex Systems, 1, pp. 877-933. \n\nDuda, R. 0., and Hart, P. E. (1973) Pattern classification and scene analysis. NY: \n\nJohn Wiley & Sons. \n\nHonavar, V. and Uhr, L. (1988) Experimental results indicate that generation, local \n\nreceptive fields and global convergence improve perceptual learning in \nconnectionist networks. CS-TR 805. Computer Science Department, University \nof Wisconsin-Madison. \n\nHubel, D. H. and Wiesel, T. N. (1979) Brain mechanisms of vision. Scientific \n\nAmerican, 241, pp. 150-162. \n\nIe Cun, Y. (1989) Generalization and network design strategies. Thchnical Report \n\nCRG-TR-89-4, Department of Computer Science, University of Thronto. \n\nLinsker, R. (1986) From basic network principles to neural architecture; Emergence \n\nof orientation-selective cells. Proceedings of the National Academy of Sciences, \nUSA, 83, pp. 8390-8394. \n\nLinsker, R. (1988) Thwards an organizing principle for a layered perceptual network \nin D. Anderson (Ed.) Neural information processing systems. American Institute \nof Physics. \n\n\f414 Martin and Pittman \n\nMori, Y. and Yokosawa, K. (1989) Neural networks that learn to discriminate similar \n\nkanji characters. in. D. S. Touretzky (Ed.) Advances in neural information \nprocessing systems I, Morgan Kaufman. \n\nRumelhart, D. E., Hinton, O. E., & Williams, R. J. Learning internal representations \nby error propagation in D. E. Rumelhart & J. L. McClelland (Editors) Parallel \ndistributed processing: V. 1. Cambridge, Mass.: MIT Press, 1986 \n\nWeideman, W. E., Manry, M T. & Yau, H. C. (1989) A comparison of a nearest \n\nneighbor classifier and a neural network for numeric handprint character \nrecognition. IEEE International Conference on Neural Networks, Washington, \nD. c., 1989. \n\nAcknowledgements \n\nWe would like to thank the NCR corporation for loaning us the set of hand-printed \ndigits and Joyce Conner, Janet Kilgore, and Kay Bauer for their invaluable help in col(cid:173)\nlecting the set of hand-printed letters. \n\n\f", "award": [], "sourceid": 267, "authors": [{"given_name": "Gale", "family_name": "Martin", "institution": null}, {"given_name": "James", "family_name": "Pittman", "institution": null}]}