{"title": "Multi-Digit Recognition Using a Space Displacement Neural Network", "book": "Advances in Neural Information Processing Systems", "page_first": 488, "page_last": 495, "abstract": null, "full_text": "Multi-Digit Recognition Using A Space \n\nDisplacement Neural Network \n\nOfer Matan*, Christopher J.C. Burges, \n\nYann Le Cun and John S. Denker \n\nAT&T Bell Laboratories, Holmdel, N. J. 07733 \n\nAbstract \n\nWe present a feed-forward network architecture for recognizing an uncon(cid:173)\nstrained handwritten multi-digit string. This is an extension of previous \nwork on recognizing isolated digits. In this architecture a single digit rec(cid:173)\nognizer is replicated over the input. The output layer of the network is \ncoupled to a Viterbi alignment module that chooses the best interpretation \nof the input. Training errors are propagated through the Viterbi module. \nThe novelty in this procedure is that segmentation is done on the feature \nmaps developed in the Space Displacement Neural Network (SDNN) rather \nthan the input (pixel) space. \n\n1 \n\nIntroduction \n\nIn previous work (Le Cun et al., 1990) we have demonstrated a feed-forward back(cid:173)\npropagation network that recognizes isolated handwritten digits at state-of-the-art \nperformance levels. The natural extension of this work is towards recognition of \nunconstrained strings of handwritten digits. The most straightforward solution is \nto divide the process into two: segmentation and recognition. The segmenter will \ndivide the original image into pieces (each containing an isolated digit) and pass \nit to the recognizer for scoring. This approach assumes that segmentation and \nrecognition can be decoupled. Except for very simple cases this is not true. \nSpeech-recognition research (Rabiner, 1989; Franzini, Lee and Waibel, 1990) has \ndemonstrated the power of using the recognition engine to score each segment in \n\n\u2022 Author's current address: Department of Computer Science, Stanford University, \n\nStanford, CA 94305. \n\n488 \n\n\fMulti-Digit Recognition Using a Space Displacement Neural Network \n\n489 \n\na candidate segmentation. The segmentation that gives the best combined score \nis chosen. \"Recognition driven\" segmentation is usually used in conjunction with \ndynamic programming, which can find the optimal solution very efficiently. \nThough dynamic programming algorithms save us from exploring an exponential \nnumber of segment combinations, they are still linear in the number of possible \nsegments - requiring one call to the recognition unit per candidate segment. In \norder to solve the problem in reasonable time it is necessary to: 1) limit the number \nof possible segments, or 2) have a rapid recognition unit. \nWe have built a ZIP code reading system that \"prunes\" the number of candidate \nsegments (Matan et al., 1991). The candidate segments were generated by analyzing \nthe image's pixel projection onto the horizontal axis. The strength of this system \nis that the number of calls to the recognizer is small (only slightly over twice the \nnumber of real digits). The weakness is that by generating only a small number \nof candidates one often misses the correct segmentation. In addition, generation \nof this small set is based on multi-parametric heuristics, making tuning the system \ndifficult. \nIt would be attractive to discard heuristics and generate many more candidates, \nbut then the time spent in the recognition unit would have to be reduced consider(cid:173)\nably. Reducing the computation of the recognizer usually gives rise to a reduction \nin recognition rates. However, it is possible to have our segments and eat them \ntoo. We propose an architecture which can explore many more candidates without \ncompromising the richness of the recognition engine. \n\n2 The Design \n\nLet us describe a simplified and less efficient solution that will lead us to our final \ndesign. Consider a de-skewed image such as the one shown in Figure 1. The system \nwill separate it into candidate segments using vertical cuts. A few examples of these \nare shown beneath the original image ill Figure 1. In the process of finding the \nbest overall segmentation each candidate segment will be passed to the recognizer \ndescribed in (Le Cun et al., 1990). The scores will be converted to probabilities \n(Bridle, 1989) that are inserted into nodes of a direct acyclic graph. Each path on \nthis graph represents a candidate segmentation where the length of each path is the \nproduct of the node values along it. The Viterbi algorithm is used to determine \nthe longest path (which corresponds to the segmentation with the highest combined \nscore). \nIt seems somewhat redundant to process the same pixels numerous times (as part \nof different, overlapping candidate segments). For this reason we propose to pass \na whole size-normalized image to the recognition unit and to segment a feature \nmap, after most of the neural network computation has been done. Since the first \nfour layers in our recognizer are convolutional, we can easily extend the single-digit \nnetwork by applying the convolution kernels to the multi-digit image. \nFigure 2 shows the example image (Figure 1) processed by the extended network. \nWe now proceed to segment the top layer. Since the network is convolutional, \nsegmenting this feature-map layer is similar to segmenting the input layer. (Because \nof overlapping receptive fields and reduced resolution, it is not exactly equivalent.) \nThis gives a speed-up of roughly an order of magnitude. \n\n\f490 \n\nMatan, Burges, Cun, and Denker \n\nFigure 1: A sample ZIP code image and possible segmentations. \n\nFigure 2: The example ZIP code processed by 4 layers of a convolutional feed(cid:173)\nforward network. \n\n\fMulti-Digit Recognition Using a Space Displacement Neural Network \n\n491 \n\nIn the single digit network, we can view the output layer as a lO-unit column vector \nthat is connected to a zone of width 5 on the last feature layer. If we replicate the \nsingle digit network over the input in the horizontal direction, the output layer will \nbe replicated. Each output vector will be connected to a different zone of width 5 \non the feature layer. Since the width of a handwritten digit is highly variable, we \nconstruct alternate output vectors that are connected to feature segment zones of \nwidths 4,3 and 2. The resulting output maps for the example ZIP code are shown \nin Figure 3. \nThe network we have constructed is a shared weight network reminiscent of a TDNN \n(Lang and Hinton, 1988). We have termed this architecture a Space Displacement \nNeural Network (SDNN). We rely on the fact that most digit strings lie on more \nor less one line; therefore, the network is replicated in the horizontal direction. For \nother applications it is conceivable to replicate in the vertical direction as well. \n\n3 The Recognition Procedure \n\nThe output maps are processed by a Viterbi algorithm which chooses the set of \noutput vectors corresponding to the segmentation giving the highest combined score. \nWe currently assume that we know the number of digits in the image; however, this \nprocedure can be generalized to an unknown number of digits. In Figure 3 the five \noutput vectors that combined to give the best overall score are marked by thin lines \nbeneath them. \n\n4 The Training Procedure \n\nDuring training we follow the above procedure and repeat it under the constraint \nthat the winning combination corresponds to the ground truth. In Figure 4 the \nconstrained-winning output vectors are marked by small circles. We perform back(cid:173)\npropagation through both the ground truth vectors (reinforcement) and highest \nscoring vectors (negative reinforcement). \nWe have trained and tested this architecture on size normalized 5-digit ZIP codes \ntaken from U.S Mail. 6000 images were used for training and 3000 where used for \ntesting. The images were cleaned, deskewed and height normalized according to \nthe assumed largest digit height. The data was not \"cleaned\" after the automatic \npreprocessing, leaving non centered images and non digits in both the training and \ntest set. \nTraining was done using stochastic back propagation with some sweeps using New(cid:173)\nton's method for adjusting the learning rates. We tried various methods of initial(cid:173)\nizing the gradient on the last layer: \n\n\u2022 Reinforce only units picked by the constrained Viterbi. (all other units have a \n\ngradient of zero). \n\n\u2022 Same as above, but set negative feedback through units chosen by regular \nViterbi that are different from those chosen by the constrained version. (Push \ndown the incorrect segmentation if it is different from the correct answer). This \nspeeds up the convergence. \n\n\u2022 Reinforce units chosen by the constrained Viterbi. Set negative feed back \n\n\f492 \n\nMatan, Burges, Cun, and Denker \n\npatt-nw>-t3 \n\nfile-3U \n\nViterl>i: (2 3 2 0\" (0 \u2022\u2022 IIU21 COftnraine\u00ab Viterl>i: (2 3 2 0 61 \n\n(0.'\"4421 \n\nFigure 3: Recognition using the SDNN/Viterbi. The output maps of the SDNN \nare shown. White indicates a positive activation. The output vectors chosen by \nthe Viterbi alignment are marked by a thin line beneath them. The input regions \ncorresponding to these vectors are shown. One can see that the system centers on \nthe individual digits. Each of the 4 output maps shown is connected to different size \nzone in the last feature layer (5,4,3 and 2, top to bottom). In order to implement \nweight sharing between output units connected to different zone sizes, the dangling \nconnections to the output vectors of narrower zones are connected to feature units \ncorresponding to background in the input. \n\n\fMulti-Digit Recognition Using a Space Displacement Neural Network \n\n493 \n\npatt-nUIII-U' Ule-11276 \n\nV1terbi: I' J 0 8 0) \n\n10.JU80J) \n\nCon_trained V1terb1: 11 \n\n, J 8 0) \n\n10.294892) \n\nFigure 4: Training using the SDNN /Viterbi. The output vectors chosen by the \nViterbi algorithm are marked by a thin line beneath them. The corresponding \ninput regions are shown in the left column. The output vectors chosen by the \nconstrained Viterbi algorithm are marked by small circles and their corresponding \ninput regions are shown to the right. Given the ground truth the system can learn \nto center on the correct digit. \n\n\f494 \n\nMatan, Burges, Cun, and Denker \n\nthrough all other units except those that are \"similar\" to ones in the correct \nset. (\"similar\" is defined by corresponding to a close center of frame in the \ninput and responding with the correct class). \n\nAs one adds more units that have a non zero gradient, each training iteration is \nmore similar to batch-training and is more prone to oscillations. In this case more \nNewton sweeps are required. \n\n5 Results \n\nThe current raw recognition rates for the whole 5-digit string are 70% correct from \nthe training set and 66% correct from the test set. Additional interesting statistics \nare the distribution of the number of correct digits across the whole ZIP code and the \nrecognition rates for each digit's position within the ZIP code. These are presented \nin the tables shown below. \n\nTable 1: Top: Distribution of test images according to the number of correct single \ndigit classifications out of 5. Bottom: Rates of single digit classification according \nto position. Digits on the edges are classified more easily since one edge is prede(cid:173)\ntermined. \n\nNumber of digits correct Percent of cases \n\n5 \n4 \n3 \n2 \n1 \n0 \n\n66.3 \n19.7 \n7.2 \n4.7 \n1.4 \n0.7 \nI Digit position I Percent correct \n\n1st \n2nd \n3rd \n4th \n5th \n\n92 \n87 \n87 \n86 \n90 \n\n6 Conclusions and Future Work \n\nThe SDNN combined with the Viterbi algorithm learns to recognize strings of hand(cid:173)\nwritten digits by \"centering\" on the individual digits in the string. This is similar \nin concept to other work in speech (Haffner, Franzini and Waibel, 1991) but differs \nfrom (Keeler, Rumelhart and Leow, 1991), where no alignment procedure is used. \nThe current recognition rates are still lower than our best system that uses pixel \nprojection information to guide a recognition based segmenter. The SDNN is much \nfaster and lends itself to parallel hardware. Possible improvements to the architec(cid:173)\nture may be: \n\n\fMulti-Digit Recognition Using a Space Displacement Neural Network \n\n495 \n\n\u2022 Modified constraints on the segmentation rules of the feature layer. \n\u2022 Applying the Viterbi algorithm in the vertical direction as well might overcome \n\nproblems due to height variance. \n\n\u2022 It might be too hard to segment using local information only; one might try \nusing global information, such as pixel projection or recognizing doublets or \ntriplets. \n\nThough there is still considerable work to be done in order to reach state-of-the-art \nrecognition levels, we believe that this type of approach is the correct direction for \nfuture image processing applications. Applying recognition based segmentation at \nthe line, word and character level on high feature maps is necessary in order to \nachieve fast processing while exploring a large set of possible interpretations. \n\nAcknowledgements \n\nSupport of this work by the Technology Resource Department of the U.S. Postal \nService under Task Order 104230-90-C-2456 is gratefully acknowledged. \n\nReferences \n\nBridle, J. S. (1989). Probabilistic Interpretation of Feedforward Classification \nNetwork Outputs with Relationships to Statistical Pattern Recognition. In \nFogelman-Soulie, F. and Herault, J., editors, N euro-computing: algorithms, \narchitectures and applications. Springer-Verlag. \n\nFranzini, M., Lee, K. F., and Waibel, A. (1990). Connectionist Viterbi Training: \nA New Hybrid Method For Continuous Speech Recognition. In Proceedings \nICASSP 90, pages 425-428. IEEE. \n\nHaffner, P., Franzini, M., and Waibel, A. (1991) . Integrating Time Alignment and \nNeural Networks for High Performance Continuous Speech Recognition. In \nProceedings ICASSP 91. IEEE. \n\nKeeler, J. D., Rumelhart, D. E., and Leow, W. (1991). Integrated Segmentation \nand Recognition of Handwritten-Printed Numerals. In Lippman, Moody, and \nTouretzky, editors, Advances in Neural Information Processing Systems, vol(cid:173)\nume 3. Morgan Kaufman. \n\nLang, K. J. and Hinton, G. E. (1988). A Time Delay Neural Network Architecture \nfor Speech Recognition. Technical Report CMU-cs-88-152, Carnegie-Mellon \nUniversity, Pittsburgh PA. \n\nLe Cun, Y., Matan, 0., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., \nHubbard, W., Jackel, L. D., and Baird, H. S. (1990). Handwritten Zip Code \nRecognition with Multilayer Networks. In Proceedings of the 10th International \nConference on Pattern Recognition. IEEE Computer Society Press. \n\nMatan, 0., Bromley, J., Burges, C. J. C., Denker, J. S., Jackel, 1. D., Le Cun, \nY., Pednault, E. P. D., Satterfield, W. D., Stenard, C. E., and Thompson, \nT. J. (1991). Reading Handwritten Digits: A ZIP code Recognition System \n(To appear in COMPUTER). \n\nRabiner, L. R. (1989). A Tutorial on Hidden Markov Models and Selected Appli(cid:173)\n\ncations in Speech Recognition. Proceedings of the IEEE, 77:257-286. \n\n\f", "award": [], "sourceid": 557, "authors": [{"given_name": "Ofer", "family_name": "Matan", "institution": null}, {"given_name": "Christopher", "family_name": "Burges", "institution": null}, {"given_name": "Yann", "family_name": "LeCun", "institution": null}, {"given_name": "John", "family_name": "Denker", "institution": null}]}