{"title": "A Neural Network Classifier for the I100 OCR Chip", "book": "Advances in Neural Information Processing Systems", "page_first": 938, "page_last": 944, "abstract": "", "full_text": "A Neural Network Classifier for \n\nthe 11000 OCR Chip \n\nJohn C. Platt and Timothy P. Allen \n\nSynaptics, Inc. \n\n2698 Orchard Parkway \n\nSan Jose, CA 95134 \n\nplatt@synaptics.com, tpa@synaptics.com \n\nAbstract \n\nThis paper describes a neural network classifier for the 11000 chip, which \noptically reads the E13B font characters at the bottom of checks. The \nfirst layer of the neural network is a hardware linear classifier which \nrecognizes the characters in this font . A second software neural layer \nis implemented on an inexpensive microprocessor to clean up the re(cid:173)\nsults of the first layer. The hardware linear classifier is mathematically \nspecified using constraints and an optimization principle. The weights \nof the classifier are found using the active set method, similar to Vap(cid:173)\nnik's separating hyperplane algorithm. In 7.5 minutes ofSPARC 2 time, \nthe method solves for 1523 Lagrange mUltipliers, which is equivalent to \ntraining on a data set of approximately 128,000 examples. The result(cid:173)\ning network performs quite well: when tested on a test set of 1500 real \nchecks, it has a 99.995% character accuracy rate. \n\n1 A BRIEF OVERVIEW OF THE 11000 CHIP \n\nAt Synaptics, we have created the 11000, an analog VLSI chip that, when combined \nwith associated software, optically reads the E13B font from the bottom of checks. \nThis E13B font is shown in figure 1. The overall architecture of the 11000 chip \nis shown in figure 2. The 11000 recognizes checks hand-swiped through a slot. A \nlens focuses the image of the bottom of the check onto the retina. The retina has \ncircuitry which locates the vertical position of the characters on the check . The \nretina then sends an image vertically centered around a possible character to the \nclassifier. \nThe classifier in the nooo has a tough job. It must be very accurate and immune \nto noise and ink scribbles in the input . Therefore , we decided to use an integrated \nsegmentation and recognition approach (Martin & Pittman, 1992)(Platt, et al., \n1992). When the classifier produces a strong response, we know that a character is \nhorizontally centered in the retina. \n\n\fA Neural Network Classifier for the 11000 OCR Chip \n\n939 \n\nFigure 1: The E13B font, as seen by the 11000 chip \n\n~----------l \n\n11000 chip \n\nI \nI \n\nretina \n\nmIcroprocessor \n\nlinear \n\nI \nwinner I \nI \ntake \n\nG __ ~ ~l:i:r __ ~l_ J i \n\nSlot \nfor \n\ncheck \n\n18 by 24 image \n\nvertically positioned \n\nbest character \n\nhypothesis \n\n42 confidences \n\nFigure 2: The overall architecture of the 11000 chip \n\nWe decided to use analog VLSI to minimize the silicon area of the classifier. Be(cid:173)\ncause of the analog implementation, we decided to use a linear template classifier, \nwith fixed weights in silicon to minimize area. The weights are encoded as lengths \nof transistors acting as current sources . We trained the classifier using only the \nspecification of the font , because we did not have the real E13B data at the time of \nclassifier design. The design of the classifier is described in the next section. \nAs shown in figure 2, the input to the classifier is an 18 by 24 pixel image taken \nfrom the retina at a rate of 20 thousand frames per second. The templates in the \nclassifier are 18 by 22 pixels. Each template is evaluated in three different vertical \npositions , to allow the retina to send a slightly vertically mis-aligned character. The \noutput of the classifier is a set of 42 confidences, one for each of the 14 characters in \nthe font in three different vertical positions. These confidences are fed to a winner(cid:173)\ntake-all circuit (Lazzaro, et al. , 1989), which finds the confidence and the identity \nof the best character hypothesis. \n\n2 SPECIFYING THE BEHAVIOR OF THE CLASSIFIER \n\nLet us consider the training of one template corresponding to one of the characters \nin the font. The template takes a vector of pixels as input. For ease of analog \nimplementation, the template is a linear neuron with no bias input: \n\n(1) \n\nwhere 0 is the output of the template, Wi are the weights of the template, and Ii \nare the input pixels of the template. \nWe will now mathematically express the training of the templates as three types \nof constraints on the weights of the template. The input vectors used by these \nconstraints are the ideal characters taken from the specification of the font . \nThe first type of constraint on the template is that the output of the template \nshould be above 1 when the character that corresponds to the template is centered \n\n\f940 \n\nI ! I I ! \n\n1. C. PLAIT, T. P. ALLEN \n\nFigure 3: Examples of images from the bad set for the templates trained to detect \nthe zero character. These images are E13B characters that have been horizontally \nand vertically offset from the center of the image. The black border around each of \nthe characters shows the boundary of the input field. Notice the variety of horizontal \nand vertical shifts of the different characters. \n\nin the horizontal field. Call the vector of pixels of this centered character Gi . This \nconstraint is stated as: \n\n(2) \n\nThe second type of constraint on the template is to have an output much lower than \n1 when incorrect or offset characters are applied to the template. We collect these \nincorrect and offset characters into a set of pixel vectors jjj, which we call the \"bad \nset.\" The constraint that the output of the template be lower than a constant c for \nall of the vectors in the bad set is expressed as : \n\nL wiBf :s c Vj \n\n(3) \n\nTogether, constraints (2) and (3) permit use of a simple threshold to distinguish \nbetween a positive classifier response and a negative one. \nThe bad set contains examples of the correct character for the template that are \nhorizontally offset by at least two pixels and vertically offset by up to one pixel. In \naddition, examples of all other characters are added to the bad set at every horizon(cid:173)\ntal offset and with vertical offsets of up to one pixel (see figure 3). Vertically offset \nexamples are added to make the classifier resistant to characters whose baselines \nare slightly mismatched. \nThe third type of constraint on the template requires that the output be invariant \nto the addition of a constant to all of the input pixels. This constraint makes the \nclassifier immune to any changes in the background lighting level, k. This constraint \nis equivalent to requiring the sum of the weights to be zero: \n\n(4) \n\nFinally, an optimization principle is necessary to choose between all possible weight \nvectors that fulfill constraints (2), (3), and (4). We minimize the perturbation of \nthe output of the template given uncorrelated random noise on the input. This \noptimization principle is similar to training on a large data set, instead of simply \nthe ideal characters described by the specification. This optimization principle is \nequivalent to minimizing the sum of the square of the weights: \n\nminLWl \n\n(5) \n\nExpressing the training of the classifier as a combination of constraints and an \noptimization principle allows us to compactly define its behavior. For example, \nthe combination of constraints (3) and (4) allows the classifier to be immune to \nsituations when two partial characters appear in the image at the same time. The \nconfluence of two characters in the image can be described as: \n\nI?verlap = k + B! + B': \n, \n\n' I \n\n(6) \n\n\fA Neural Network Classifier for the 11000 OCR Chip \n\n941 \n\nwhere k is a background value and B! and B[ are partial characters from the bad \nset that appears on the left side and right side of the image, respectively. The \noutput of the template is then: \n\nooverlap = 2: Wi(k + BI + BD = 2: Wjk + 2: wiBI + 2: WiB[ < 2c \n\n(7) \n\nConstraints (3) and (4) thus limit the output of the neuron to less than 2c when \ntwo partial characters appear in the input. Therefore, we want c to be less than \n0.5. In order to get a 2:1 margin, we choose c = 0.25. \nThe classifier is trained only on individual partial characters instead of all possible \ncombinations of partial characters. Therefore, we can specify the classifier using \nonly 1523 constraints, instead of creating a training set of approximately 128,000 \npossible combinations of partial characters. Applying these constraints is therefore \nmuch faster than back-propagation on the entire data set. \nEquations (2), (3) and (5) describe the optimization problem solved by Vapnik \n(Vapnik, 1982) for constructing a hyperplane that separates two classes. Vapnik \nsolves this optimization problem by converting it into a dual space, where the in(cid:173)\nequality constraints become much simpler. However, we add the equality constraint \n(4), which does not allow us to directly use Vapnik's dual space method. To over(cid:173)\ncome this limitation, we use the active set method, which can fulfill any extra linear \nequality or inequality constraints. The active set method is described in the next \nsection. \n\n3 THE ACTIVE SET METHOD \n\nNotice that constraints (2), (3), and (4) are all linear in Wi. Therefore, minimiz(cid:173)\ning (5) with these constraints is simply quadratic programming with a mixture of \nequality and inequality constraints. This problem can be solved using the active set \nmethod from optimization theory (Gill, et al., 1981). \nWhen the quadratic programming problem is solved, some of the inequality con(cid:173)\nstraints and all of the equality constraints will be \"active.\" In other words, the ac(cid:173)\ntive constraints affect the solution as equality constraints. The system has \"bumped \ninto\" these constraints. All other constraints will be inactive; they will not affect \nthe solution. \nOnce we know which constraints are active, we can easily solve the quadratic mini(cid:173)\nmization problem with equality constraints via Lagrange multipliers. The solution \nis a saddle point of the function: \n\n~ 2: wl + 2: Ak(2: Akj Wj - Ck) \n\ni \n\nk \n\ni \n\n(8) \n\nwhere Ak is the Lagrange multiplier of the kth active constraint, and A kj and Ck \nare the linear and constant coefficients of the kth active constraint. For example, \nif constraint (2) is the kth active constraint, then Ak = G and Ck = 1. The saddle \npoint can be found via the set of linear equations: \n\nWi \n\n- 2: AkAki \n- 2)2: AjiAki)-lCj \n\nk \n\n(9) \n\n(10) \n\nThe active set method determines which inequality constraints belong in the active \nset by iteratively solving equation (10) above. At every step, one inequality con(cid:173)\nstraint is either made active, or inactive. A constraint can be moved to the active \n\nj \n\n\f942 \n\nJ. C. PLAIT, T. P. ALLEN \n\nAction: move X to here, make constraint 13 inactive \n\nro \u2022 ~ \nL \nX space \n\nA13=0 \n\non this line \n\n1 .. solution from \n\nequation (10) \n\n\\ . \n\nconstramt 19 \n\nviolated on this line \n\nFigure 4: The position along the step where the constraints become violated or the \nLagrange multipliers become zero can be computed analytically. The algorithm then \ntakes the largest possible step without violating constraints or having the Lagrange \nmultipliers become zero. \n\nset if the inequality constraint is violated. A constraint can be moved off the active \nset if its Lagrange multiplier has changed sign 1 . \nEach step of the active set method attempts to adjust the vector of Lagrange mul(cid:173)\ntipliers to the values provided by equation (10). Let us parameterize the step from \nthe old to the new Lagrange multipliers via a parameter a: \n\nX = XO + a8X \n\n(11) \nwhere Xo is the vector of Lagrange multipliers before the step, 8X is the step, and \nwhen a = 1, the step is completed. Now, the amount of constraint violation and the \nLagrange multipliers are linear functions of this a. Therefore, we can analytically \nderive the a at which a constraint is violated or a Lagrange multiplier changes sign \n(see figure 4). For currently inactive constraints, the a for constraint violation is: \n\nak = -\n\nCk + Lj AJ Li AjiAki \nLj 8Aj Li AjiAki \n\n(12) \n\nFor a currently active constraint, the a for a Lagrange multiplier sign change is \nsimply: \n\n(13) \n\nWe choose the constraint that has a smallest positive ak. If the smallest ak is \ngreater than 1, then the system has found the solution, and the final weights are \ncomputed from the Lagrange multipliers at the end of the step. Otherwise, if the kth \nconstraint is active, we make it inactive, and vice versa. We then set the Lagrange \nmultipliers to be the interpolated values from equation (11) with a = ak. We finally \nre-evaluate equation (10) with the updated active set 2 . \nWhen this optimization algorithm is applied to the E13B font, the templates that \nresult are shown in figure 5. When applied to characters that obey the specification, \nthe classifier is guaranteed to give a 2:1 margin between the correct peak and any \nfalse peak caused by the confluence of two partial characters. Each template has \n1523 constraints and takes 7.5 minutes on a SPARe 2 to train. Back-propagation on \nthe 128,000 training examples that are equivalent to the constraints would obviously \nrequire much more computation time. \n\nIThe sign of the Lagrange multiplier indicates on which side of the inequality constraint \n\n2 For more details on active set methods, such as how to recognize infeasible constraints, \n\nthe constrained minimum lies. \n\nconsult (Gill, et al., 1981). \n\n\fA Neural Network Classifier for the 11000 OCR Chip \n\n943 \n\nFigure 5: The weights for the fourteen E13B templates. The light pixels correspond \nto positive weights, while the dark pixels correspond to negative weights. \n\n14 output neurons \n\npinger neuron \n2 hidden neurons \n\n14 outputs \nof the 11000 \n(Every vertical column \ncontains 13 zeros) \n\nspatial window \n\nof 15 pixels \n\nhistory of \n\n11000 outputs \n\n~ \n\nFigure 6: The software second layer \n\n4 THE SOFTWARE SECOND LAYER \n\nAs a test of the linear classifier, we fabricated the 11000 and tested it with E 13B \ncharacters on real checks. The system worked when the printing on the check obeyed \nthe contrast specification of the font. However, some check printing companies use \nvery light or very dark printing. Therefore, there was no single threshold that could \nconsistently read the lightly printed checks without hallucinating characters on the \ndark checks. The retina shown in figure 2 does not have automatie gain control \n(AGC). One solution would have been to refabricate the chip using an AGC retina. \nHowever, we opted for a simpler solution. \nThe output of the 11000 chip is a 2-bit confidence level and a character code that \nis sent to an inexpensive microprocessor every 50 microseconds. Because this out(cid:173)\nput bandwidth is low, it is feasible to put a small software second layer into this \nmicroprocessor to post-process and clean up the output of the 11000. \nThe architecture of this software second layer is shown in figure 6. The input to \nthe second layer is a linearly time-warped history of the output of the 11000 chip. \nThe time warping makes the second layer immune to changes in the velocity of the \ncheck in the slot. There is one output neuron that is a \"pinger.\" That is, it is \ntrained to turn on when the input to the 11000 chip is centered over any character \n(Platt, et al. , 1992) (Martin & Pittman, 1992). There are fourteen other neurons \nthat each correspond to a character in the font. These neurons are trained to turn \non when the appropriate character is centered in the field, and otherwise turn off. \nThe classification output is the output of the fourteen neurons only when the pinger \nneuron is on. Thus, the pinger neuron aids in segmentation. \nConsidering the entire network spanning both the hardware first layer and software \n\n\f944 \n\nJ. C. PLATT. T. P. ALLEN \n\nsecond layer, we have constructed a non-standard TDNN (Waibel, et. al., 1989) \nwhich recognizes characters. \nWe trained the second layer using standard back-propagation, with a training set \ngathered from real checks. Because the nooo output bandwidth is quite low, col(cid:173)\nlecting the data and training the network was not onerous. The second layer was \ntrained on a data set of approximately 1000 real checks. \n\n5 OVERALL PERFORMANCE \n\nWhen the hardware first layer in the 11000 is combined with the software second \nlayer, the performance of the system on real checks is quite impressive. We gathered \na test set of 1500 real checks from across the country. This test set contained a \nvariety of light and dark checks with unusual backgrounds. We swiped this test set \nthrough one system. Out of the 1500 test checks, the system only failed to read 2, \ndue to staple holes in important locations of certain characters. As such , this test \nyielded a 99.995% character accuracy on real data. \n\n6 CONCLUSIONS \n\nFor the 11000 analog VLSI OCR chip, we have created an effective hardware linear \nclassifier that recognizes the E13B font. The behavior of this classifier was specified \nusing constrained optimization. The classifier was designed to have a predictable \nmargin of classification, be immune to lighting variations, and be resistant to ran(cid:173)\ndom input noise. The classifier was trained using the active set method, which is an \nenhancement of Vapnik's separating hyperplane algorithm. We used the active set \nmethod to find the weights of a template in 7.5 minutes of SPARC 2 time, instead of \ntraining on a data set with 128,000 examples. To make the overall system resistant \nto contrast variation, we separately trained a software second layer on top of this \nfirst hardware layer, thereby constructing a non-standard TDNN. \nThe application discussed in this paper shows the utility of using the active set \nmethod to very rapidly create either a stand-alone linear classifier or a first layer of \na multi-layer network. \n\nReferences \n\nP. Gill, W. Murray, M. Wright (1981), Practical Optimization, Section 5.2, Aca(cid:173)\ndemic Press . \nJ. Lazzaro, S. Ryckebusch, M. Mahowald, C. Mead (1989), \"Winner-Take-All Net(cid:173)\nworks of O(N) Complexity,\" Advances in Neural Information Processing Systems, \n1, D. Touretzky, ed., Morgan-Kaufmann, San Mateo, CA. \nG. Martin, M. Rashid (1992), \"Recognizing Overlapping Hand-Printed Characters \nby Centered-Object Integrated Segmentation and Recognition,\" Advances in Neural \nInformation Processing Systems, 4, Moody, J., Hanson, S., Lippmann, R., eds., \nMorgan-Kaufmann, San Mateo, CA. \nJ. Platt, J. Decker, and J. LeMoncheck (1992), Convolutional Neural Networks for \nthe Combined Segmentation and Recognition of Machine Printed Characters, USPS \n5th Advanced Technology Conference, 2, 701-713. \nV. Vapnik (1982), Estimation of Dependencies Based on Empirical Data, Adden(cid:173)\ndum I, Section 2, Springer-Verlag. \nA. Waibel, T. Hanazawa, G. Hinton, K. Shikano, K. Lang (1989), \"Phoneme \nRecognition Using Time-Delay Neural Networks,\" IEEE Transactions on Acous(cid:173)\ntics, Speech, and Signal Processing, vol. 37, pp. 328-339. \n\n\f", "award": [], "sourceid": 1170, "authors": [{"given_name": "John", "family_name": "Platt", "institution": null}, {"given_name": "Timothy", "family_name": "Allen", "institution": null}]}