{"title": "Learning to See Where and What: Training a Net to Make Saccades and Recognize Handwritten Characters", "book": "Advances in Neural Information Processing Systems", "page_first": 441, "page_last": 447, "abstract": null, "full_text": "Learning to See Where and What: \n\nTraining a Net to Make Saccades and Recognize \n\nHandwritten Characters \n\nGale Martin, Mosfeq Rashid, David Chapman, and James Pittman \n\nMCC, 3500 Balcones Center Drive, Austin, Texas 78759 \n\nABSTRACT \n\nto  integrated  segmentation  and \nThis  paper  describes  an  approach \nrecognition of hand-printed characters.  The approach, called Saccade, \nintegrates  ballistic  and  corrective  saccades \n(eye  movements)  with \ncharacter recognition.  A single backpropagation net  is  trained to make \na classification decision on a character centered  in its input window, as \nwell as to estimate the distance of the current and next character from the \ncenter of the input window.  The net  learns to accurately estimate these \ndistances  regardless  of variations  in  character  width,  spacing  between \ncharacters,  writing  style  and  other factors.  During  testing,  the  system \nuses  the  net~xtracted classification  and  distance  information,  along \nwith a set of jumping rules, to jump from character to character. \n\nThe ability to read rests on multiple foundation skills.  In learning how to read, people learn \nhow to recognize individual characters centered in the visual  field.  They also learn how \nto move their eyes along a line of text, sequentially centering the visual field on successive \ncharacters.  We believe that the key to developing optical character recognition (OCR) sys(cid:173)\ntems that can mimic human reading capabilities, is to develop systems that can learn these \nand other skills in an integrated fashion.  In this paper, we demonstrate that a backpropaga(cid:173)\ntion net can learn to naVigate along a line of handwritten characters, as  well as to recognize \nthe characters centered in its visual field.  The system, called Saccade, extends the current \nstate of the art in OCR technology by using a single classifier to accurately and efficiently \nlocate and recognize characters, regardless of whether they touch each other or are sepa(cid:173)\nrate.  The Saccade system was described briefly at the last NIPS conference (Martin & Ra(cid:173)\nshid,  1992).  In this  paper,  we  describe  it  mcx-e  fully  and report on results demonstrating \nits accuracy and efficiency in recognizing handwritten digits. \n\nThe Saccade system takes a cue from the ballistic and corrective saccades (eye movements) \nof natural vision systems.  Natural saccades make it possible to efficiently move from one \ninformative area to another by jumping.  The eye  typically  initiates a ballistic saccade to \n\n441 \n\n\f442 \n\nMartin,  Rashid,  Chapman,  and  Pittman \n\nmove  the center of focus to the general  area of interest. followed.  if necessary. by one or \nmore ccnective saccades for fme-grained position corrections.  Recognition processes are \napplied only at these multiple fixation points. \n\nWe have copioo some of these aspects in the artificial Saccade system by training a neural \nnetwcn to know a1>oot the locations of characters in  its input window. as well as to know \nabout the identity of the character centered in its input window.  During run-time, the Sac(cid:173)\ncade system accesses this information computed by the net for successive input windows, \nalong with a set of simple jumping rules,  to yield an OCR system that jumps from character \nto character, classifying each character in a sequence. \n\n1  TRAINING DETAILS \n\nAs shown in Figure 1, the Saccade system has a wide input window, large enough to contain \n\nNo Centered Character \nCurrent character \nDistance to  current character I  I  I \nDistance to  next character \n\n~i~ii3a~i~ } \n\nI  I \n\nI \n\nf ..  W  .. , \n\nI \n\n4 Part \nOutput \nVector \n\nBackprop Net \n\nFigure 1. \n\nThe Saccade system uses an enlarged input window and a 4-part \noutput vector. \n\nseveral characters. Prior to training, each field  image of a line of characters is labeled with \nthe hcrizontal center position of each character in the field, as well as with the category of \neach character. During training, the input window slides hcriwntally across a field of char(cid:173)\nacters, and  at each position, the contents of the  input window  are paired with a four-part \ntarget ootput vector, the values of which are computed from  the  labeled  information.  The \ntarget values answer the following four questions about the contents of the input window: \n\nIs a character centered in the input window? \n\n1. \n2.  What character is closest to the center of the window? \n3.  How far off~nter (horizontally) is the centennost character? \n4.  How far is the next character to the right from the center of the window? \n\nThe first  node in the output vector represents the no-centered-character state.  It's target \nvalue is set high (e.g .\u2022  1.0) when the center of the input window falls  between characters, \n\n\fTraining a Net to  Make Saccades  and Recognize Handwritten Characters \n\n443 \n\nand set low (e.g .\u2022 0.0) when the center of the input window falls 00 the center of a character. \nWhen the net is trained. the value of the rw-cenlered~haracter node indicates whether the \ninput window is centered over a character. oc whether a corrective saccade is needed to bet(cid:173)\nter center the character. \n\nThe  second part of the  output  vector contains a node foc each character category.  When \nthe center of the  input window  falls on a character. the target value for its cocresponding \nnode is set high; otherwise it is  set low.  When the net is trained. the values in these nodes \nare  used  to classify  the centered  character.  The  target  values  for  bdh the  rw-centered(cid:173)\ncharacter and the character-category nodes are defmed continuously across the hcrizontal \ndimension as trapewidal functions. such that there are plateaus surrounding the off and on \npositions. with linearly increasing and decreasing values connecting the plateaus. \n\nThe third and  fourth  components of the output vector represent distance values. each en(cid:173)\ncoded  in  a  distributed  fashion  across  multiple  nodes.  using  localized  receptive  fields \n(Moody & Darken. 1988).  The frrst of these two parts represents the distance by which the \ncharacter closest to the center of the window is off ~nter. The target value can be positive. \nindicating that the center of the window is to the left of the center of character. or it can be \nnegative. indicating that the center of the window has passed over the character. to the right \nof it's center. When trained, the value of the current-character-distance set of nodes is ac(cid:173)\ncessed to determine the magnitude of a ccxrective saccade. to make a fme-grained position \nadjustment. \n\nThe fourth component represems the distance from the center of the window to the center \nof the next character to the right.  The target value can only be positive.  When trained, the \nvalue of this set of nodes is accessed to detennine the magnitude of a ballistic saccade, to \njump to the next character to the right. \n\nIt  is  imponant to note that for bdh distance components, the  maximum  target value can \nnot exceed half the  window width.  The net  is never trained to make a distance judgment \nthat extends beyond its field of view, since it is not given any infonnation about what exists \noutside of it's input window.  For example. when the center of the next character to the right \nis positioned outside of the current input window. the distance value is set to the maximum \nvalue  of half the window width.  Since the distance  values vary  with different characters. \ndifferent writers. and of course, at different positions with respect to a character. the net is \nforced to  learn  to use the visual  characteristics particular to each window to estimate the \ndistance values.  In other words. the net does  NOT simply learn  average values for  each \nof the  two distance metrics.  Moreover.  as the  results  will  show, the  trained net does not \nseem to use simple density histogram cues to estimate the distance values. It is able to reli(cid:173)\nably estimate the  distance values even  when characters overlap. and hence  would  appear \nas a single clump in a density histogram. \n\n2  RUN-TIME SACCADE RULES \n\nDuring run-time, the labeled values are. of course, not available.  The system uses the com(cid:173)\nputed values in the character classification and distance components of the output vector, \nand  some heuristics,  to navigate  horizontally  along  a character field.  jumping from  one \ncharacter to the  next. and occasionally making a corrective saccade to improve its ability \nto classify a character.  When the net recognizes a character, it executes a ballistic saccade \n\n\f444 \n\nMartin,  Rashid,  Chapman, and Pittman \n\nto the next  character. obtaining the distance to jump by  reading the  next-character-dis(cid:173)\nUlnCe component of the output vector.  When this actioo fails to center a character. as indi(cid:173)\ncated by a low value in the no-centered-character output node, the system executes a cor(cid:173)\nrective saccade to better center the character.  It obtains the  distance and direction to jump \nby reading the current-character-distance component of the output vector.  Multiple cor(cid:173)\nrective saccades can be executed. \n\n3  TFSTING  ON NIST HANDWRITTEN DIGIT FIELDS \n\nWe tested the perfonnance of the system on a set of hand-printed digits collected and dis(cid:173)\ntributed by the National Institute of Standards and Technology (NIST).  This is a database \ncontaining 273,000 samples of handwritten numerals.  Each of2100 Census workers filled \nin a fonn  with 33 fields, 28 fields of which only contain handwritten digits.  The scanning \nresolution of the samples was 300 pixels/inch.  The neural net was trained on about 80,000 \ncharacters from  20,000 fields,  written by  800 different  individuals.  The fields  varied  in \nlength from  2 characters per field to 6 characters per field. The horiwntal positions of each \nof the characters in these training-data fields were extracted by a person. The test data con(cid:173)\ntained about 20,000 digits from 5,000 fields, written by a different group of 200 individuals. \nThe test set was chosen to be this large because use of smaller test sets (e.g., 5,000 digits, \n1250 fields)  yielded  significant between-set variations  in reported  accuracy.  Each field \nimage was  preprocessed to remove  the box  around the field  of characters, and  any  sur(cid:173)\nrounding white space.  Each field image was  size normalized, with respect to the vertical \naxis. to a height of 20 pixels.  Aspect ratio was maintained. An input pattern generator was \nthen passed over the field to create input windows fa training the net.  The input window \nsize was 36 pixels wide and 20 pixels high.  The input window scanned the field at 2-pixel \nincrements  during  training.  Subsequent  experiments  have  shown  that  training  can  be \nspeeded up considerably by training on the character centers and at random points between \nthe character centers. without causing decreased accuracy. \n\nThe  backpropagation  netwak architecture  is  described  more  fully  in  Martin  & Rashid \n(1992).  It has 2 hidden layers with local, shared connections in the first hidden layer. and \nlocal connections in the second hidden layer.  Shared weights are not used in the second \nhidden layer because early experiments showed that this retards learning, presumably be(cid:173)\ncause  extending the ~ition invariance to second-hidden-Iayer nodes inhibits the net in \nlearning the position specific  information regarding what is centered in its input window. \nThe learning rate of the net was initially set at .05. and then successivel y lowered as training \nreached an asymptcxe.  The momentum term was set at  .9 throughout training.  All nodes \nin the net used logistic activation functions. \n\nTable  I reports on the test results in terms of field-based reject rates. for 1 % and .5% per(cid:173)\ncent of the fields rejected.  The error rates are field-based in the sense that if the net mis(cid:173)\nclassifies one character in the field.  the entire field  is considered as mis-classified. Error \nrates  pertain  to the  fields  remaining  after  rejection.  Rejections  are  based  on  placing  a \nthreshold for the acceptable distance between the highest and next highest running activa(cid:173)\ntion total.  In this way. by varying the threshold, the error rate can be traded off against the \npercentage of rejections.  In addition. recognized fields  were  also rejected if the number \nof recognized digits differed from the expected number of cligits. \n\n\fTraining a Net to Make Saccades and Recognize  Handwritten Characters \n\n445 \n\nThble 1:  Field - Based Error Rates For Saccade System \n\nField Size \n\nField \n\nError Rate \n\nField \nReject Rate \n\n2-dIalt. \n\n3-di&lt. \n\n4-dJalt. \n\n5-dJalu \n\n6-di&lt. \n\n10K \n0.5111 \n\n10K \n0.5111 \n\n1.1111 \n0.5111 \n\n1.1'\" \n0.5111 \n\n1.1111 \n0.5111 \n\n6.a \n9.J'11 \n\n12.\".. \n19.6'11 \n\n19.5111 \n350K \n\n23.2111 \n28.J'11 \n\n26.K \n35.0111 \n\nFigure 2 presents some of the fields of connected characters that the system correctly recog(cid:173)\nnized.  Conventional  OCR  systems  typically  fail  on  connected  characters  because  they \nemploy  an  independent  character segmentation  stage.  in  which the  character is  isolated \nfrom  its  surround  using  features.  such as  intervening  white  spaces.  This character seg(cid:173)\nmentation stage typically fails when characters are coonected.  The Saccade system goes \nbeyond  conventional  OCR  systems  by  integrating  segmentation  and  recognition.  and \nthereby is able to recognize touching characters. \n\nThe Saccade system is also efficient in the sense that it typically jumps from one character \nto the next without making a corrective saccades.  Corrective saccades tend to be mere like(cid:173)\nly when characters are touching.  In addition. there is almost always a corrective saccade \nfor the first character in the field. since the system starts at the beginning of the field. with \n\n\u00b71  41\" ,.  ~.  G9\u00a3  ~1 I o~71 \n~ ~J ~~~I  u.ul  ~ \n\u00b71~9 ___ q_30 __ 3/,---_r  ~l GtiJ\u00b7 \n\nFigure 2. Examples of connected and broken characters that the Saccade \n\nsystem correctly recognizes. \n\n\f446 \n\nMartin,  Rashid,  Chapman,  and Pittman \n\nno knowledge of the location of the character.  For fields containing two digits, the average \nnumber of passes of the net on a small test set was about 2 saccades per character.  For field \ncontaining five digits, the average number of saccades per character was 1.3. \n\n4  COMPARISONS WITH OTHER SYSTEMS \n\nThe Saccade system is  an  extension of a related integrated segmentation and recognition \nsystem we reported on at last years NIPS conference (Martin & Rashid, 1992).  That system \nemployed an exhaustive scan technique, rather than  saccades, to navigate along a line of \ntext.  Essentially, the net was convolved hociwntally across a field  image at a scan incre(cid:173)\nment of 2 pixels.  The net architecture was very similar to that of the Saccade system, except \nthat it did not have the two distance components in the output vector.  The accuracy rates \nof the two systems are essentially equivalent.  However, the  exhaustive scan  version was \nconsiderably less efficient, requiring a forward pass of the netwock at every 2-pixel incre(cid:173)\nmental scan position.  On average it required about 5.5 forward passes per  character, rather \nthan the  1.3 focward passes per character required by the Saccade system. \n\nOver the past two  years, an  approach similar to the exhaustive  scan method has  been ad(cid:173)\nvanced by a number of researchers (Keeler & Rumelhart, 1992; Matan. Burges. Le CuD. \n& Denker. 1992).  This approacb  also  involves convolving a network across a field \nimage.  but  uses  a  time-delay-neural-net (1DNN).  or completely  local.  shared \nweight. architecture. a smaller input window, and no explicit position labeling of char(cid:173)\nacters.  The IDNN approacb has algorithmic advantages over the exhaustive scan ver(cid:173)\nsion described in the previous paragraph. because the completely shared -weight ar(cid:173)\nchitecture enables the number of forward passes of the net to be reduced considerably. \n\n5  CONCLUSIONS AND FUTURE WORK \n\nAs stated at the beginning of this paper, we believe that the key to developing optical char(cid:173)\nacter recognition (OCR) systems that can mimic human reading capabilities is  to develop \nsystems that can learn the multiple foundation skills underlying human reading.  This paper \nhas repocted some progress in  this regard.  We have demonstrated  that a relatively simple \nback propagation network can integrate its  learning of position and  category  information. \nthereby enabling efficient navigatioo along a field of text through ballistic and corrective \nsaccades, and accurate recognition of touching or broken characters. \n\nThere is however, a long way to go before we can claim a system with capabilities similar \nto human reading.  The present StJccade system only moves horizontally, in one dimension. \nHuman reading operates in two-<iimensions. and in a sense. it operates in three-dimensions \nbecause it automatically operates across different scales.  Human vision also employs auter \nmatic  contrast adjusunent; the  Saccade system does not.  Human  vision has  a wider field \nof view  and employs a foveal  transform.  such that objects centered in the field  of vision \nare represented  at  a higher resolution than  objects  in the periphery.  This  effectively ex(cid:173)\npands the field of vision beyond what would be estimated simply by the size of the receptive \narea on the retina.  As a resUlt.  saccades enable very  effective means of scanning a large \nvisual area.  The present artificial Saccade system has only a small field  of vision. and no \nfoveal transform. so it's saccades must  necessarily be limited in size.  The present system \nis also only oriented toward recognizing a single character centered in its input window at \n\n\fTraining a Net to Make Saccades  and  Recognize  Handwritten Characters \n\n447 \n\na time.  Human reading typically only makes one or two saccades per woo1.  Finally, human \nreading capabilities clearly integrate recognition processes with higher-level processes, to \nenable the redundancies of natural language to constrain the recognition decisions. \n\nReferences \n\nKeeler,  J,  & Rumelhart,  D.  E. \n(1992)  A  self-aganizing  integrated  segmentation  and \nrecognition  neural  network.. \nIn Moody,  J.E.,  Hansoo,  S.1.,  and  Lippmann,  R.P.,  (eds.) \nA.dvances  in  Neural  Information  Processing  Systems  4.  San  Mateo.  CA:  Mocgan \nKaufmann Publishers. \nMalan, 0., Burges, J.  C., Le Cun, Y.,  and  Denker. J. S.  (1992)  Multi-Digit Recognition \nUsing a Space Displacement Neural Netwm. In Moody. J.E .\u2022 Hanson. S.1 .\u2022 and Lippmann. \nR.P .\u2022 (eds.) Advances in Neural Injormation Processing Systems 4.  San Mateo. \nCA:  Mocgan Kaufmann Publishers. 488-495. \nMartin. O.  L.  &  Rashid. M.  (1992) Recognizing overlapping hand-printed characters by \ncentered-OOject integrated segmentation and recognitioo. In Moody, J.E .\u2022 Hanson, S.1., \nand  Lippmann, R.P..  (eds.) Advances in Neural  Injormation  Processing  Systems 4.  San \nMateo. CA:  Morgan Kaufmann Publishers. \nMoody. J. & Darken. C.  (1988) Learning with localized receptive fields. Technical Report \nYaleu/DCS/RR-649. \n\n\f\fPART  V \n\nSTOCHASTIC \n\nLEARNING  AND \n\nANALYSIS \n\n\f\f", "award": [], "sourceid": 611, "authors": [{"given_name": "Gale", "family_name": "Martin", "institution": null}, {"given_name": "Mosfeq", "family_name": "Rashid", "institution": null}, {"given_name": "David", "family_name": "Chapman", "institution": null}, {"given_name": "James", "family_name": "Pittman", "institution": null}]}