{"title": "Effective Training of a Neural Network Character Classifier for Word Recognition", "book": "Advances in Neural Information Processing Systems", "page_first": 807, "page_last": 816, "abstract": null, "full_text": "Effective Training of a Neural Network \n\nCharacter Classifier for Word Recognition \n\nLarry Yaeger \nApple Computer \n\n5540 Bittersweet Rd. \n\nMorgantown, IN 46160 \n\nlarryy@apple.com \n\nRichard Lyon \nApple Computer \n\n1 Infinite Loop, MS301-3M \n\nCupertino, CA 95014 \n\nlyon@apple.com \n\nBrandyn Webb \n\nThe Future \n\n4578 Fieldgate Rd. \n\nOceanside, CA 92056 \n\nbrandyn@brainstorm.com \n\nAbstract \n\nWe have combined an artificial neural network (ANN) character \nclassifier with context-driven search over character segmentation, word \nsegmentation, and word recognition hypotheses to provide robust \nrecognition of hand-printed English text in new models of Apple \nComputer's Newton MessagePad. We present some innovations in the \ntraining and use of ANNs al; character classifiers for word recognition, \nincluding normalized output error, frequency balancing, error emphasis, \nnegative training, and stroke warping. A recurring theme of reducing a \npriori biases emerges and is discussed. \n\n1 INTRODUCTION \nWe have been conducting research on bottom-up classification techniques ba<;ed on \ntrainable artificial neural networks (ANNs), in combination with comprehensive but \nweakly-applied language models. To focus our work on a subproblem that is tractable \nenough to le.:'ld to usable products in a reasonable time, we have restricted the domain to \nhand-printing, so that strokes are clearly delineated by pen lifts. \nIn the process of \noptimizing overall performance of the recognizer, we have discovered some useful \ntechniques for architecting and training ANNs that must participate in a larger recognition \nprocess. Some of these techniques-especially the normalization of output error, \nfrequency balanCing, and error emphal;is-suggest a common theme of significant value \nderived by reducing the effect of a priori biases in training data to better represent low \nfrequency, low probability smnples, including second and third choice probabilities. \n\nThere is mnple prior work in combining low-level classifiers with various search \nstrategies to provide integrated segmentation and recognition for writing (Tappert et al \n1990) and speech (Renals et aI1992). And there is a rich background in the use of ANNs \na-; classifiers, including their use as a low-level, character classifier in a higher-level word \nrecognition system (Bengio et aI1995). But many questions remain regarding optimal \nstrategies for deploying and combining these methods to achieve acceptable (to a real \nuser) levels of performance. \nIn this paper, we survey some of our experiences in \nexploring refinements and improvements to these techniques. \n2 SYSTEM OVERVIEW \nOur recognition system, the Apple-Newton Print Recognizer (ANPR), consists of three \nconceptual stages: Tentative Segmentation, Classification, and Context-Driven Search. \nThe primary data upon which we operate are simple sequences of (x,y) coordinate pairs, \n\n\f808 \n\nL. Yaeger, R. Lyon and B. Webb \n\nplus pen-up/down infonnation, thus defining stroke primitives. The Segmentation stage \ndecides which strokes will be combined to produce segments-the tentative groupings of \nstrokes that will be treated as possible characters-and produces a sequence of these \nsegments together with legal transitions between them. This process builds an implicit \ngraph which is then scored in the Classification stage and examined for a maximum \nlikelihood interpretation in the Search stage. \n(x,y) Points & Pen-Lifl'i \n\nWords \n\n~------~~ \n\nCharacter \n\nNeural Network \n\nClassifier \n\nSegmentation \nHypotheses \n\n~------~~ \n\nCharacter \n\nClass \n\nHypotheses \n\nFigure 1: A Simplified Block Diagram of Our Hand-Print Recognizer. \n\n3 TRAINING THE NEURAL NETWORK CLASSIFIER \nExcept for an integrated multiple-representations architecture (Yaeger et a11996) and the \ntraining specitics detailed here, a fairly standard multi-layer perceptron trained with BP \nprovides the ANN character cla<;sitler at the heart of ANPR. Training an ANN character \ncla<;sifier for use in a word recognition system, however, has different constraints than \nwould training such a system for stand-alone character recognition. All of the techniques \nbelow, except for the annealing schedule, at least modestly reduce individual character \nrecognition accuracy, yet dramatically increase word recognition accuracy. \n\nA large body of prior work exist<; to indicate the general applicability of ANN technology \nas classifiers providing good estimates of a posteriori probabilities of each class given the \ninput (Gish 1990, Richard and Lippmann 1991, Renals and Morgan 1992, Lippmann \n1994, Morgan and Bourlard 1995, and others cited herein). \n3.1 NORMALIZING OUTPUT ERROR \nDespite their ability to provide good first choice a posteriori probabilities, we have found \nthat ANN cla<;sifiers do a poor job of representing second and third choice probabilities \nwhen trained in the classic way-minimizing mean squared error for target vectors that \nare all O's, except for a single 1 corresponding to the target class. This result<; in erratic \nword recognition failures as the net fails to accurately represent the legitimate ambiguity \nbetween characters. We speculated that reducing the \"pressure towards 0\" relative to the \n\"pressure towards 1\" as seen at the output unit\", and thus reducing the large bias towards \no in target vectors, might pennit the net to better model these inherent ambiguities. \nWe implemented a technique for \"nonnalizing output error\" (NormOutErr) by reducing \nthe BP error for non-target classes relative to the target class by a factor that nonnalizes \nthe total non-target error seen at a given output unit relative to the total target error seen \nat that unit. Assuming a training set with equal representation of classes, this \nnonnalization should then be based on the number of non-target versus target classes in a \ntypical training vector, or, simply, the number of output units (minus one). Hence for \nnon-target output units, we scale the error at each unit by a constant: \n\ne'=Ae \n\nwhere eis the emlr at an output unit, and A is detlned to be: \n\nA = 1/[d(NnuIJlUrs -1)] \n\nwhere N\"uIPuIs is the number of output units, and d is a user-adjusted tuning parameter, \ntypically ranging from 0.1 to 0.2. Error at the target output unit is unchanged. Overall, \nthis raises the activation values at the output units, due to the reduced pressure towards \nzero, particularly for low-probability samples. Thus the learning algorithm no longer \n\n\fEffective Training of a NN Character Classifier for Word Recognition \n\n809 \n\nconverges to a minimum mean-squared error (MMSE) estimate of P(classlinput), but to \nan MMSE estimate of a nonlinear function !(P(classlinput),A) depending on the factor \nA by which we reduced the error pressure toward zero. \n\nUsing a simple version of the technique of Bourlard and Wellekens (1990), we worked \nout what that resulting nonlinear function is. The net will attempt to converge to \nminimize the modified quadratic error function \n\n(i;2) = p(l- y)2 + A(l- p)y2 \n\nby setting its output y for a particular class to \n\nwhere p = P(classlinput), and A is as defined above. The inverse function is \n\ny= p/(A-Ap+ p) \n\np= yA/(yA+1- y) \n\nWe verified the fit of this function by looking at histograms of character-level empirical \npercentage-correct versus y, a'\\ in Figure 2. \n\n1.-~~r-~~--~~--~-'--~~ \n\n0.9 \n\n0.8 Pceel\n\n0.7 \n\n) \n\nnet output y \n\n0.6 \n\n0.5 \n\n0.4 \n\n0.3 \n\n0.2 \n\n0.1 \n\n0 \n0 \n\nFigure 2: Empiric.:'ll p vs. y Histogram for a Net Trained with A=D.ll (d=.l), with the \n\nCorresponding Theoretical Curve. \n\nNote that the lower-probability samples have their output activations raised significantly, \nrelative to the 45\u00b0 line that A = 1 yields. \nThe primary benefit derived from this technique is that the net does a much better job of \nrepresenting second and third choice probabilities, and low probabilities in general. \nDespite a small drop in top choice character accuracy when using NormOutErr, we obtain \na very significant increase in word accuracy by this technique. Figure 3 shows an \nexaggerated example of this effect, for an atypically large value of d (0.8), which overly \npenalizes character accuracy; however, the 30% decrea'\\e in word error rate is normal for \nthis technique. (Note: These data are from a multi-year-old experiment, and are not \nnecessarily representative of current levels of performance on any absolute scale.) \n\n% \nE 40 \n30 \nr \nr 20 \no 10 \no \nr \n\nNonnOutErr = \u2022 0.0 III 0.8 \n\nCharacter En-or \n\nWord Error \n\nFigure 3: Character and Word Error Rates for Two Different Values of NormOutErr (d). \nA Value 01'0.0 Disables NormOutErr, Yielding Normal BP. The Unusually High Value \n\nof 0.8 (A=O.013) Produces Nearly Equal Pressures Towards 0 and 1. \n\n\f810 \n\nL. Yaeger, R. Lyon and B. Webb \n\n3.2 FREQUENCY BALANCING \nTraining data from natural English words and phrases exhibit very non-uniform priors for \nthe various character classes, and ANNs readily model these priors. However, as with \nNonnOutErr, we tind that reducing the effect of these priors on the net, in a controlled \nway, and thus forcing the net to allocate more of its resources to low-frequency, low(cid:173)\nprobability classes is of significant benefit to the overall word recognition process. To \nthis end, we explicitly (partially) balance the frequencies of the classes during training. \nWe do this by probabilistically skipping and repeating patterns, ba<;ed on a precomputed \nrepetition factor. Each presentation of a repeated pattern is \"warped\" uniquely, as \ndiscussed later. \n\nTo compute the repetition factor for a class i, we tirst compute a normalized frequency of \nthat cla'ls: \n\nF; = S; IS \n\nwhere S; is the number of s~unples in class i, and S is the average number of srunples \nover all cla<;ses, computed in the obvious way: \n\n_ \nS =(- LS;) \n\n1 c \nc ;=1 \n\nR; = (a/F;t \n\nwith C being the number of classes. Our repetition factor is then defined to be: \n\nwith a and b being user controls over the ~ount of skipping vs. repeating and the degree \nof prior normalization, respectively. Typical values of a range from 0.2 to 0.8, while b \nranges from 0.5 to 0.9. The factor a < 1 lets us do more skipping than repeating; e.g. for \na = 0.5, cla<;ses with relative frequency equal to half the average will neither skip nor \nrepeat; more frequent classes will skip, and less frequent classes will repeat. A value of \n0.0 for b would do nothing, giving R; = 1.0 for all classes, while a value of 1.0 would \nprovide \"full\" normalization. A value of b somewhat less than one seems to be the best \nchoice, letting the net keep some bia<; in favor of cla<;ses with higher prior probabilities. \n\nThis explicit prior-bias reduction is related to Lippmann's (1994) and Morgan and \nBourlard's (1995) recommended method for converting from the net's estimate of \nposterior probability, p(classlinput), to the value needed in an HMM or Viterbi search, \np(inputlclass), which is to divide by p(class) priors. Besides eliminating potentially noisy \nestimates of low probability cla<;se.<; and a possible need for renormalization, our approach \nforces the net to actually 1e:'lfI1 a better model of these lower frequency classes. \n\n3.3 ERROR EMPHASIS \n\nWhile frequency balancing corrects for under-represented classes, it cannot account for \nunder-represented writing styles. We utilize a conceptually related probabilistic skipping \nof patterns, but this time for just those patterns that the net correctly classifies in it.<; \nforward/recognition pa<;s, as a form of \"error empha<;is\", to address this problem. We \ndefine a correct-train probability (0.1 to 1.0) that is used a<; a bia'led coin to determine \nwhether a particular pattern, having been correctly classified, will also be used for the \nbackward/training pass or not. This only applies to correctly segmented, or \"positive\" \npatterns, and miscla<;sified patterns are never skipped. \n\nEspecially during early stages of training, we set this parruneter fairly low (around 0.1), \nthus concentrating most of the training time and the net's learning capability on patterns \nthat are more difticult to correctly classify. This is the only way we were able to get the \nnet to learn to correctly classify unusual character variants, such a') a 3-stroke \"5\" as \nwritten by only one training writer. \n\n\fEffective Training of a NN Character Classifier for Word Recognition \n\n811 \n\nVariants of this scheme are possible in which mLliclalisified patterns would be repeated, or \ndifferent learning rates would apply to correctly and incorrectly classified patterns. It is \nalso related to techniques that use a training subset, from which ealiily-classified patterns \nare replaced by randomly selected patterns from the full training set (Guyon et aI1992). \n3.4 NEGATIVE TRAINING \n\nOur recognizer's tentative segmentation stage necessarily produces a large number of \ninvalid segments, due to inherent ambiguities in the character segmentation process. \nDuring recognition, the network must clalisify these invalid segment<; just ali it would any \nvalid segment, with no knowledge of which are valid or invalid. A significant increase in \nword-level recognition accuracy wali obtained by performing negative training with these \ninvalid segments. This consists of presenting invalid segments to the net during training, \nwith all-zero target vectors. We retain control over the degree of negative training in two \nways. First is a negative-training Jactor (0.2 to 0.5) that modulates the learning rate \n(equivalently by modulating the error at the output layer) for these negative patterns. \nThis reduces the impact of negative training on positive training, and modulates the \nimpact on characters that specifically look like element Ii of multi-stroke characters (e.g., \nI, 1, I, 0, 0, 0). Secondly, we control a negative-training probability (0.05 to 0.3), which \ndetermines the probability that a particular negative sample will actually be trained on \n(for a given presentation). This both reduces the overall impact of negative training, and \nsignificcUltly reduces training time, since invalid segment<; are more numerous than valid \nsegments. As with NormOutErr, this modification hurtli character-level accuracy a little \nbit, but helps word-level accuracy a lot. \n\n3.5 STROKE WARPING \n\nDuring training (but not during recognition), we produce random variations in stroke \ndata, consisting of small changes in skew, rotation, and x and y linear and quadratic \nscalings. This produces alternate character forms that are consistent with stylistic \nvariations within and between writers, and induces an explicit alipect ratio and rotation \nin variance within the framework of standard back-propagation. The amounts of each \ndistortion to apply were chosen through cross-validation experimentli, as just the amount \nneeded to yield optimum generalization. We also examined a number of such samples by \neye to verify that they represent a natural range of variation. A small set of such \nvariations is shown in Figure 4. \n\nFigure 4: A Few Random Stroke Warpings of the Same Original \"m\" Data. \n\nOur stroke warping scheme is somewhat related to the ideas of Tangent Dist and Tangent \nProp (Simard et (II 1992, 1993), in terms of the use of predetermined families of \ntransformations, but we believe it is much easier to implement. It is also somewhat \ndistinct in applying transformations on the original coordinate data, as opposed to using \ndistortions of images. The voice transformation scheme of Chang and Lippmann (1995) \nis also related, but they use a static replication of the training set through a small number \nof tr