{"title": "A Neural Network Classifier for the I100 OCR Chip", "book": "Advances in Neural Information Processing Systems", "page_first": 938, "page_last": 944, "abstract": "", "full_text": "A  Neural Network Classifier for \n\nthe 11000  OCR Chip \n\nJohn C.  Platt and Timothy P.  Allen \n\nSynaptics,  Inc. \n\n2698  Orchard Parkway \n\nSan Jose,  CA  95134 \n\nplatt@synaptics.com, tpa@synaptics.com \n\nAbstract \n\nThis paper describes a neural network classifier for  the 11000  chip, which \noptically  reads  the  E13B  font  characters  at  the  bottom of checks.  The \nfirst  layer  of  the  neural  network  is  a  hardware  linear  classifier  which \nrecognizes  the  characters  in  this  font .  A  second  software  neural  layer \nis  implemented  on  an  inexpensive  microprocessor  to  clean  up  the  re(cid:173)\nsults of the first  layer.  The hardware linear  classifier  is  mathematically \nspecified  using  constraints  and  an  optimization  principle.  The  weights \nof the  classifier  are  found  using  the  active  set  method,  similar  to  Vap(cid:173)\nnik's separating hyperplane algorithm.  In 7.5 minutes ofSPARC 2 time, \nthe  method solves for  1523  Lagrange mUltipliers,  which  is  equivalent  to \ntraining  on  a  data set  of approximately  128,000  examples.  The  result(cid:173)\ning network  performs quite  well:  when  tested  on  a  test set  of 1500  real \nchecks,  it has  a  99.995%  character accuracy  rate. \n\n1  A  BRIEF OVERVIEW OF THE 11000  CHIP \n\nAt Synaptics, we have created the 11000,  an analog VLSI chip that, when combined \nwith associated software, optically  reads  the E13B font  from  the bottom of checks. \nThis  E13B  font  is  shown  in  figure  1.  The  overall  architecture  of the  11000  chip \nis  shown  in  figure  2.  The 11000  recognizes  checks  hand-swiped  through  a  slot.  A \nlens  focuses  the image  of the bottom of the  check  onto the  retina.  The  retina has \ncircuitry  which  locates  the  vertical  position  of the  characters  on  the  check .  The \nretina then  sends  an  image  vertically  centered  around  a  possible  character  to  the \nclassifier. \nThe classifier  in  the nooo  has  a  tough job.  It  must  be  very accurate  and immune \nto  noise  and ink scribbles  in  the input .  Therefore , we  decided  to use  an  integrated \nsegmentation  and  recognition  approach  (Martin  &  Pittman,  1992)(Platt,  et  al., \n1992).  When the classifier produces a  strong response,  we  know  that a  character is \nhorizontally  centered in  the retina. \n\n\fA Neural Network Classifier for the 11000 OCR Chip \n\n939 \n\nFigure  1: The  E13B font,  as  seen  by  the 11000  chip \n\n~----------l \n\n11000  chip \n\nI \nI \n\nretina \n\nmIcroprocessor \n\nlinear \n\nI \nwinner  I \nI \ntake \n\nG __ ~ ~l:i:r __ ~l_ J i \n\nSlot \nfor \n\ncheck \n\n18  by 24  image \n\nvertically positioned \n\nbest  character \n\nhypothesis \n\n42  confidences \n\nFigure 2:  The  overall architecture of the 11000  chip \n\nWe  decided  to  use  analog  VLSI  to  minimize  the silicon  area of the  classifier.  Be(cid:173)\ncause  of the  analog  implementation, we  decided  to use  a  linear  template  classifier, \nwith fixed  weights in silicon  to  minimize  area.  The weights are encoded  as  lengths \nof transistors  acting  as  current  sources .  We  trained  the  classifier  using  only  the \nspecification of the font , because we  did  not  have  the real E13B  data at the time of \nclassifier  design.  The  design  of the classifier is  described  in  the next section. \nAs  shown  in  figure  2,  the  input  to  the  classifier  is  an  18  by  24  pixel  image  taken \nfrom  the  retina at  a  rate of 20  thousand  frames  per second.  The  templates  in  the \nclassifier  are  18  by 22  pixels.  Each  template  is  evaluated in  three  different  vertical \npositions , to allow the retina to send a slightly vertically mis-aligned character.  The \noutput of the classifier is  a set of 42  confidences, one for  each of the 14  characters in \nthe font  in three different  vertical positions.  These confidences are fed  to a  winner(cid:173)\ntake-all  circuit  (Lazzaro, et  al. ,  1989),  which  finds  the  confidence  and  the  identity \nof the  best  character hypothesis. \n\n2  SPECIFYING THE BEHAVIOR OF  THE CLASSIFIER \n\nLet  us  consider the training of one  template corresponding to one of the characters \nin  the  font.  The  template  takes  a  vector  of pixels  as  input.  For  ease  of  analog \nimplementation, the template is  a  linear neuron  with  no bias input: \n\n(1) \n\nwhere  0  is  the  output of the template,  Wi  are the weights  of the template,  and  Ii \nare the input  pixels  of the  template. \nWe  will  now  mathematically  express  the  training  of the  templates  as  three  types \nof  constraints  on  the  weights  of  the  template.  The  input  vectors  used  by  these \nconstraints are the ideal characters taken from  the specification of the font . \nThe  first  type  of  constraint  on  the  template  is  that  the  output  of the  template \nshould be  above  1 when the character that corresponds  to  the template is  centered \n\n\f940 \n\nI ! I I  ! \n\n1. C.  PLAIT, T. P. ALLEN \n\nFigure 3:  Examples of images from  the bad set for  the templates trained  to detect \nthe zero  character.  These images  are  E13B  characters that have  been horizontally \nand vertically offset  from  the center of the image.  The black border around each of \nthe characters shows the boundary of the input field.  Notice the variety of horizontal \nand vertical shifts of the different characters. \n\nin the horizontal field.  Call the vector of pixels of this centered character Gi .  This \nconstraint is  stated as: \n\n(2) \n\nThe second type of constraint on the template is  to have an output much lower than \n1 when  incorrect or  offset  characters are  applied to the template.  We  collect  these \nincorrect and offset  characters into a set of pixel vectors  jjj, which we call the \"bad \nset.\"  The constraint that the output of the template be lower than a  constant c for \nall  of the  vectors in the bad set is expressed  as : \n\nL wiBf :s c  Vj \n\n(3) \n\nTogether,  constraints  (2)  and  (3)  permit  use  of a  simple  threshold  to distinguish \nbetween a  positive classifier  response  and  a negative one. \nThe  bad  set  contains examples  of the  correct  character for  the  template  that  are \nhorizontally offset  by  at  least  two  pixels  and  vertically offset  by up  to one  pixel.  In \naddition, examples of all other characters are  added to the bad set at every horizon(cid:173)\ntal offset  and with vertical offsets  of up  to one  pixel  (see  figure  3).  Vertically offset \nexamples  are  added  to  make  the  classifier  resistant  to  characters  whose  baselines \nare slightly mismatched. \nThe third  type of constraint on the template requires  that the output be  invariant \nto  the  addition of a  constant  to  all  of the input  pixels.  This constraint makes  the \nclassifier immune to any changes in the background lighting level,  k. This constraint \nis  equivalent  to requiring the sum of the weights  to  be  zero: \n\n(4) \n\nFinally, an optimization principle is necessary to choose between all possible weight \nvectors  that  fulfill  constraints  (2),  (3),  and  (4).  We minimize  the  perturbation of \nthe  output  of the  template  given  uncorrelated  random  noise  on  the  input.  This \noptimization  principle  is  similar  to training  on  a  large  data set,  instead  of simply \nthe  ideal  characters  described  by  the specification.  This  optimization  principle  is \nequivalent to minimizing the sum of the square of the weights: \n\nminLWl \n\n(5) \n\nExpressing  the  training  of  the  classifier  as  a  combination  of constraints  and  an \noptimization  principle  allows  us  to  compactly  define  its  behavior.  For  example, \nthe  combination  of  constraints  (3)  and  (4)  allows  the  classifier  to  be  immune  to \nsituations when  two  partial characters appear in  the image  at  the same  time.  The \nconfluence  of two characters in the image can  be described as: \n\nI?verlap  =  k + B! + B': \n, \n\n' I  \n\n(6) \n\n\fA Neural Network Classifier for the 11000 OCR Chip \n\n941 \n\nwhere  k  is  a  background  value and  B!  and  B[  are  partial characters from  the  bad \nset  that  appears  on  the  left  side  and  right  side  of  the  image,  respectively.  The \noutput of the template is then: \n\nooverlap = 2: Wi(k + BI + BD = 2: Wjk + 2: wiBI + 2: WiB[ < 2c \n\n(7) \n\nConstraints (3)  and  (4)  thus limit  the output  of the  neuron  to  less  than  2c  when \ntwo  partial  characters  appear  in  the input.  Therefore,  we  want  c  to  be less  than \n0.5.  In order to get a  2:1  margin,  we  choose  c =  0.25. \nThe classifier is  trained only on  individual partial characters instead of all  possible \ncombinations  of partial  characters.  Therefore,  we  can  specify  the  classifier  using \nonly  1523  constraints,  instead  of creating  a  training set of approximately  128,000 \npossible  combinations of partial characters.  Applying these  constraints is  therefore \nmuch faster than back-propagation on  the entire data set. \nEquations  (2),  (3)  and  (5)  describe  the  optimization  problem  solved  by  Vapnik \n(Vapnik,  1982)  for  constructing  a  hyperplane  that  separates  two  classes.  Vapnik \nsolves  this  optimization  problem  by  converting  it  into a  dual space,  where  the  in(cid:173)\nequality constraints become much simpler.  However,  we add the equality constraint \n(4),  which  does  not allow  us  to directly use  Vapnik's  dual space method.  To over(cid:173)\ncome this limitation, we use the active set method, which can fulfill  any extra linear \nequality or  inequality  constraints.  The  active set  method  is  described  in  the next \nsection. \n\n3  THE ACTIVE SET METHOD \n\nNotice  that  constraints  (2),  (3),  and  (4)  are  all  linear  in  Wi.  Therefore,  minimiz(cid:173)\ning  (5)  with  these  constraints is  simply  quadratic  programming  with  a  mixture of \nequality and inequality constraints.  This problem can be solved using the active set \nmethod from optimization theory  (Gill, et al.,  1981). \nWhen  the  quadratic  programming  problem  is  solved,  some  of the  inequality  con(cid:173)\nstraints and  all  of the equality constraints will  be  \"active.\"  In other  words,  the ac(cid:173)\ntive constraints affect the solution as equality constraints.  The system has \"bumped \ninto\"  these  constraints.  All  other  constraints will  be  inactive;  they  will  not  affect \nthe solution. \nOnce we  know which constraints are  active,  we  can easily solve the quadratic mini(cid:173)\nmization problem with  equality constraints via Lagrange multipliers.  The solution \nis  a saddle  point of the function: \n\n~ 2: wl + 2: Ak(2: Akj Wj  - Ck) \n\ni \n\nk \n\ni \n\n(8) \n\nwhere  Ak  is  the  Lagrange multiplier  of the  kth  active  constraint,  and  A kj  and  Ck \nare  the linear  and  constant  coefficients  of the  kth  active  constraint.  For example, \nif constraint (2)  is  the  kth active constraint, then  Ak = G and Ck  = 1.  The saddle \npoint  can  be found  via the set of linear equations: \n\nWi \n\n- 2: AkAki \n- 2)2: AjiAki)-lCj \n\nk \n\n(9) \n\n(10) \n\nThe active set method determines which inequality constraints belong  in the active \nset  by  iteratively  solving  equation  (10)  above.  At  every step,  one  inequality  con(cid:173)\nstraint is  either  made  active,  or  inactive.  A  constraint can  be moved  to the  active \n\nj \n\n\f942 \n\nJ. C. PLAIT, T. P. ALLEN \n\nAction:  move X to here,  make constraint  13  inactive \n\nro \u2022  ~ \nL \nX space \n\nA13=0 \n\non this line \n\n1  .. solution from \n\nequation (10) \n\n\\ .  \n\nconstramt  19 \n\nviolated  on this line \n\nFigure 4:  The position along the step where the constraints become violated or the \nLagrange multipliers become zero can be computed analytically.  The algorithm then \ntakes the largest  possible step without violating constraints or having the Lagrange \nmultipliers become  zero. \n\nset if the inequality constraint is  violated.  A constraint can be moved off the active \nset if its Lagrange multiplier has  changed sign 1 . \nEach step  of the active set method  attempts to adjust  the vector of Lagrange mul(cid:173)\ntipliers to the values  provided by equation (10).  Let us  parameterize the step from \nthe old  to the new  Lagrange multipliers  via a  parameter a: \n\nX = XO  + a8X \n\n(11) \nwhere Xo  is  the vector of Lagrange multipliers before the step, 8X is  the step,  and \nwhen a  =  1,  the step is  completed.  Now,  the amount of constraint violation and the \nLagrange  multipliers  are linear functions  of this  a.  Therefore,  we  can  analytically \nderive the a  at which a constraint is  violated or a  Lagrange multiplier changes sign \n(see figure 4).  For currently inactive constraints, the a  for  constraint violation is: \n\nak =  -\n\nCk  + Lj AJ Li AjiAki \nLj 8Aj Li AjiAki \n\n(12) \n\nFor  a  currently  active  constraint,  the  a  for  a  Lagrange  multiplier  sign  change  is \nsimply: \n\n(13) \n\nWe  choose  the  constraint  that  has  a  smallest  positive  ak.  If the  smallest  ak  is \ngreater  than  1,  then  the  system has  found  the solution,  and  the final  weights  are \ncomputed from the Lagrange multipliers at the end of the step.  Otherwise, if the kth \nconstraint is  active, we  make it inactive,  and  vice  versa.  We then set  the  Lagrange \nmultipliers to be the interpolated values from equation (11) with a = ak.  We finally \nre-evaluate equation (10)  with  the updated active set 2 . \nWhen this optimization algorithm is  applied  to the  E13B  font,  the  templates that \nresult are shown in figure 5.  When applied to characters that obey the specification, \nthe classifier  is guaranteed  to give  a  2:1  margin  between the correct  peak  and  any \nfalse  peak  caused  by  the  confluence  of two  partial  characters.  Each  template has \n1523 constraints and takes 7.5 minutes on a SPARe 2 to train.  Back-propagation on \nthe 128,000 training examples that are equivalent to the constraints would obviously \nrequire much more  computation time. \n\nIThe sign of the Lagrange multiplier indicates on which side of the inequality constraint \n\n2 For more details on active set methods, such as  how to recognize infeasible constraints, \n\nthe constrained  minimum lies. \n\nconsult  (Gill,  et al.,  1981). \n\n\fA Neural Network Classifier for the 11000 OCR Chip \n\n943 \n\nFigure 5:  The weights for the fourteen  E13B  templates.  The light pixels correspond \nto positive  weights,  while  the  dark  pixels  correspond  to negative weights. \n\n14  output neurons \n\npinger neuron \n2 hidden neurons \n\n14  outputs \nof the 11000 \n(Every vertical  column \ncontains  13  zeros) \n\nspatial window \n\nof 15  pixels \n\nhistory of \n\n11000  outputs \n\n~ \n\nFigure 6:  The software second layer \n\n4  THE SOFTWARE SECOND  LAYER \n\nAs  a  test  of the linear  classifier,  we  fabricated  the 11000  and  tested  it  with  E 13B \ncharacters on real checks. The system worked when the printing on the check obeyed \nthe  contrast specification of the font.  However,  some check printing companies  use \nvery light or  very dark printing.  Therefore,  there was no single threshold that could \nconsistently read the lightly printed checks without hallucinating characters on  the \ndark  checks.  The  retina  shown  in  figure  2  does  not  have  automatie  gain  control \n(AGC). One solution would have been to refabricate the chip  using an AGC retina. \nHowever, we  opted for  a simpler solution. \nThe output of the 11000  chip is  a  2-bit  confidence  level  and  a  character code  that \nis  sent to  an  inexpensive microprocessor  every  50  microseconds.  Because  this out(cid:173)\nput  bandwidth  is  low,  it  is  feasible  to  put  a  small  software second  layer  into  this \nmicroprocessor  to post-process and clean up the output of the 11000. \nThe  architecture  of this  software second  layer  is  shown in  figure  6.  The  input  to \nthe  second  layer  is  a  linearly  time-warped  history  of the  output of the  11000  chip. \nThe time warping makes the second layer immune to changes in the velocity of the \ncheck  in  the  slot.  There  is  one  output  neuron  that  is  a  \"pinger.\"  That  is,  it  is \ntrained  to turn on  when  the input  to the 11000  chip is  centered over  any  character \n(Platt,  et  al. ,  1992)  (Martin  &  Pittman,  1992).  There  are  fourteen  other  neurons \nthat each correspond  to a  character in the font.  These neurons are  trained to turn \non when  the appropriate  character is  centered in  the field,  and  otherwise  turn off. \nThe classification output is the output of the fourteen neurons only when the pinger \nneuron is  on.  Thus, the pinger neuron aids  in  segmentation. \nConsidering the entire network spanning both the hardware first  layer and software \n\n\f944 \n\nJ. C. PLATT. T. P.  ALLEN \n\nsecond  layer,  we  have  constructed  a  non-standard  TDNN  (Waibel,  et.  al.,  1989) \nwhich  recognizes  characters. \nWe  trained  the second  layer  using  standard  back-propagation,  with  a  training set \ngathered  from  real  checks.  Because  the nooo  output  bandwidth  is  quite low,  col(cid:173)\nlecting  the  data and  training  the  network was  not  onerous.  The second  layer  was \ntrained on a  data set of approximately  1000  real checks. \n\n5  OVERALL PERFORMANCE \n\nWhen  the  hardware first  layer  in  the  11000  is  combined  with  the  software second \nlayer, the performance of the system on real checks is quite impressive.  We gathered \na  test  set  of 1500  real  checks  from  across  the  country.  This  test  set  contained  a \nvariety of light and dark  checks with unusual backgrounds.  We swiped this test set \nthrough one system.  Out of the 1500  test  checks,  the system only failed  to read  2, \ndue  to staple holes  in important locations of certain  characters.  As  such , this  test \nyielded  a  99.995%  character accuracy  on  real  data. \n\n6  CONCLUSIONS \n\nFor the 11000  analog VLSI  OCR chip, we have created an effective hardware linear \nclassifier that recognizes the E13B font.  The behavior of this classifier was specified \nusing  constrained  optimization.  The  classifier  was  designed  to  have  a  predictable \nmargin  of classification,  be  immune to lighting variations,  and be resistant  to ran(cid:173)\ndom input noise.  The classifier was trained using the active set method, which is  an \nenhancement  of Vapnik's separating hyperplane algorithm.  We used  the  active set \nmethod to find the weights of a template in 7.5 minutes of SPARC 2 time, instead of \ntraining on a data set with 128,000 examples.  To make the overall system resistant \nto  contrast  variation,  we  separately  trained  a  software second  layer on  top of this \nfirst  hardware layer,  thereby constructing  a non-standard TDNN. \nThe  application  discussed  in  this  paper  shows  the  utility  of using  the  active  set \nmethod to very rapidly create either a stand-alone linear classifier or a first  layer of \na  multi-layer network. \n\nReferences \n\nP.  Gill,  W.  Murray,  M.  Wright  (1981),  Practical  Optimization,  Section  5.2,  Aca(cid:173)\ndemic  Press . \nJ.  Lazzaro,  S.  Ryckebusch,  M.  Mahowald, C.  Mead  (1989),  \"Winner-Take-All Net(cid:173)\nworks  of O(N)  Complexity,\"  Advances  in  Neural Information  Processing  Systems, \n1, D.  Touretzky, ed.,  Morgan-Kaufmann,  San  Mateo,  CA. \nG.  Martin,  M.  Rashid  (1992),  \"Recognizing Overlapping Hand-Printed Characters \nby Centered-Object Integrated Segmentation and Recognition,\"  Advances in Neural \nInformation  Processing  Systems,  4,  Moody,  J.,  Hanson,  S.,  Lippmann,  R.,  eds., \nMorgan-Kaufmann,  San  Mateo,  CA. \nJ.  Platt, J.  Decker,  and J.  LeMoncheck (1992),  Convolutional Neural  Networks for \nthe Combined Segmentation and Recognition of Machine Printed Characters,  USPS \n5th  Advanced  Technology  Conference,  2, 701-713. \nV.  Vapnik  (1982),  Estimation  of Dependencies  Based  on  Empirical  Data,  Adden(cid:173)\ndum I,  Section 2,  Springer-Verlag. \nA.  Waibel,  T.  Hanazawa,  G.  Hinton,  K.  Shikano,  K.  Lang  (1989),  \"Phoneme \nRecognition  Using  Time-Delay  Neural  Networks,\"  IEEE  Transactions  on  Acous(cid:173)\ntics,  Speech,  and Signal  Processing,  vol.  37,  pp.  328-339. \n\n\f", "award": [], "sourceid": 1170, "authors": [{"given_name": "John", "family_name": "Platt", "institution": null}, {"given_name": "Timothy", "family_name": "Allen", "institution": null}]}