{"title": "A Rapid Graph-based Method for Arbitrary Transformation-Invariant Pattern Classification", "book": "Advances in Neural Information Processing Systems", "page_first": 665, "page_last": 672, "abstract": null, "full_text": "A  Rapid  Graph-based Method for \nArbitrary Transformation-Invariant \n\nPattern Classification \n\nAlessandro Sperduti \n\nDipartimento di Informatica \n\nU niversita di Pisa \n\nCorso Italia 40 \n\n56125  Pisa, ITALY \nperso~di.unipi.it \n\nDavid  G.  Stork \n\nMachine Learning and Perception Group \n\nRicoh California Research Center \n\n2882  Sand Hill  Road # 115 \n\nMenlo Park,  CA USA  94025-7022 \n\nstork~crc.ricoh.com \n\nAbstract \n\nWe  present  a  graph-based  method  for  rapid,  accurate  search \nthrough prototypes for  transformation-invariant pattern classifica(cid:173)\ntion.  Our method has in theory the same recognition accuracy as \nother recent  methods  based  on  ''tangent  distance\"  [Simard et al., \n1994],  since it uses the same categorization rule.  Nevertheless ours \nis  significantly  faster  during  classification  because  far  fewer  tan(cid:173)\ngent  distances  need  be  computed.  Crucial  to  the success  of our \nsystem  are  1)  a  novel  graph  architecture in  which  transformation \nconstraints and geometric  relationships  among  prototypes are en(cid:173)\ncoded  during learning,  and 2)  an improved graph search criterion, \nused during classification.  These architectural insights are applica(cid:173)\nble to a wide range of problem domains.  Here we demonstrate that \non  a  handwriting  recognition task,  a  basic implementation of our \nsystem  requires  less  than  half the  computation  of the  Euclidean \nsorting method. \n\n1 \n\nINTRODUCTION \n\nIn recent years,  the crucial issue of incorporating invariances into networks for  pat(cid:173)\ntern recognition has received increased attention, most especially due to the work of \n\n\f666 \n\nAlessandro Sperduti,  David G. Stork \n\nSimard and his colleagues.  To a regular hierachical backpropagation network Simard \net al.  [1992]  added a  Jacobian  network,  which  insured  that directional  derivatives \nwere  also  learned.  Such  derivatives  represented  directions  in  feature  space  corre(cid:173)\nsponding  to  the invariances  of interest,  such  as  rotation,  translation,  scaling  and \neven  line  thinning.  On small training sets  for  a  function  approximation  problem, \nthis  hybrid  network showed  performance superior  to that of a  highly tuned back(cid:173)\npropagation  network  taken  alone;  however  there  was  negligible  improvement  on \nlarge  sets.  In  order  to  find  a  simpler  method  applicable  to  real-world  problems, \nSimard,  Le  Cun  &  Denker  [1993]  later  used  a  variation  of the  nearest  neighbor \nalgorithm, one incorporating  \"tangent distance\"  (T-distance or D T )  as the classifi(cid:173)\ncation metric -\nthe smallest Euclidean distance between patterns after the optimal \ntransformation.  In this way,  state-of-the-art  accuracy was  achieved on an isolated \nhandwritten character task, though at quite high computational complexity,  owing \nto the inefficient search and large number of Euclidean and tangent distances that \nhad to be calculated. \nWhereas  Simard,  Hastie  & Saeckinger  [1994]  have  recently  sought  to  reduce  this \ncomplexity  by  means  of pre-clustering stored  prototypes,  we  here take a  different \napproach,  one  in  which  a  (graph)  data structure formed  during  learning  contains \ninformation  about  transformations  and  geometrical  relations  among  prototypes. \nNevertheless,  it  should  be  noted  that  our  method  can  be  applied  to  a  reduced \n(clustered) training set such as they formed,  yielding yet faster recognition.  Simard \n[1994]  recently  introduced a  hierarchical  structure of successively  lower  resolution \npatterns,  which  speeds  search  only  if  a  minority  of  patterns  are  classified  more \naccurately  by  using  the  tangent  metric  than  by  other  metrics.  In  contrast,  our \nmethod shows  significant  improvement  even  if the majority  or  all  of the  patterns \nare most accurately classified using the tangent distance. \n\ninvariant  classification \n\nOther  methods  seeking  fast \ninclude  Wilensky  and \nManukian's  scheme  [1994].  While  quite  rapid  during  recall,  it  is  more  properly \nconsidered  distortion  (rather  than  coherent  transformation)  invariant.  Moreover, \nsome transformations such  as  line thinning cannot  be naturally  incorporated  into \ntheir  scheme.  Finally,  it  appears  as  if their  scheme  scales  poorly  (compared  to \ntangent metric methods)  as the number of invariances is increased. \n\nIt seems  somewhat  futile  to try to improve significantly  upon  the recognition  ac(cid:173)\ncuracy  of  the  tangent  metric  approach  -\nfor  databases  such  as  NIST  isolated \nhandwritten  characters,  Simard  et  al.  [1993]  reported  accuracies  matching  that \nof  humans!  Nevertheless,  there  remains  much  that  can  be  done  to  increase  the \ncomputational efficiency during recall.  This is the problem we  address. \n\n2  TRANSFORMATION INVARIANCE \n\nIn  broad  overview,  during  learning  our  method  constructs  a  labelled  graph  data \nstructure in which each node represents a stored prototype (labelled by its category) \nas given by a training set, linked by arcs representing the T-distance between them. \nSearch through this graph (for classification) takes advantage of the graph structure \nand an improved search criterion.  To understand the underlying computations, we \nmust first  consider tangent space. \n\n\fGraph-Based Method for Arbitrary Transformation-Invariant  Pattern  Classification \n\n667 \n\nFigure  1:  Geometry  of tangent  space.  Here,  a  three-dimensional  feature  space \ncontains  the  \"current\"  prototype,  Pc,  and the subspace  consisting  of all  patterns \nobtainable by performing continuous transformations of it (shaded).  Two candidate \nprototypes  and a  test  pattern,  T,  as  well  as  their  projections onto the T-space of \nPc  are  shown.  The  insert  (above)  shows  the  progression  of search  through  the \ncorresponding  portion  of the  recognition  graph.  The  goal  is  to  rapidly  find  the \nprototype closest to T  (in the T-distance sense),  and our algorithm  (guided by the \nminimum  angle  OJ  in  the tangent  space)  finds  that P 2  is  so closer  to T  than are \neither PI or Pc (see text). \n\nFigure 1 illustrates geometry of tangent space and the relationships among the fun(cid:173)\ndamental entities in our trained system.  A  labelled  (\"current\")  trained pattern is \nrepresented by Pc, and the (shaded)  surface corresponds to patterns arising under \ncontinuous  transformations  of Pc.  Such  transformations  might  include  rotation, \ntranslation,  scaling,  line thinning,  etc.  Following Simard et al.  [1993],  we  approxi(cid:173)\nmate this surface in the vicinity of Pc by a subspace -\nthe tangent space or T -space \nof Pc - which is spanned by  \"tangent\" vectors, whose directions are determined by \ninfinitessimally transforming the prototype Pc.  The figure  shows  an ortho-normal \nbasis {TVa, TV b},  which helps to speed search during classification,  as  we shall see. \nA test pattern T  and two other (candidate)  prototypes as well  as their projections \nonto the T-space of Pc are shown. \n\n\f668 \n\nAlessandro Sperduti,  David G.  Stork \n\n3  THE ALGORITHMS \n\nOur overall approach includes constructing a graph (during learning), and searching \nit (for classification).  The graph is  constructed by the following  algorithm: \n\nInitialize  N  = #  patterns;  k  = #  nearest neighbors;  t = #  invariant transforma(cid:173)\n\nGraph construction \n\ntions \n\nBegin Loop  For each prototype Pi (i  =  1 ~ N) \n\n\u2022  Compute a  t-dimensional orthonormal basis for  the T -space of Pi \n\u2022  Compute  (\"one-sided\")  T-distance  of each  of the  N  - 1  prototypes \n\nP j  (j i- i)  using Pi'S T-space \n\n\u2022  Represent  Pj.l  (the projection of P j  onto  the  T-space of Pi)  in  the \n\ntangent orthonormal frame of Pi \n\n\u2022  Connect Pi to each of its  k  T-nearest neighbors,  storing their  associ(cid:173)\n\nated normalized projections Ph \n\nEnd Loop \n\nDuring classification, our algorithm  permits rapid search through prototypes.  Thus \nin  Figure  1,  starting  at  Pc  we  seek  to  find  another  prototype  (here,  P2)  that  is \ncloser  to the test  point T .  After  P2  is  so  chosen,  it becomes the current  pattern, \nand the search is  extended using  its T-space.  Graph search ends when the closest \nprototype to T  is  found  (Le.,  closest in a  T-distance sense). \n\nWe let D~ denote the current minimum tangent distance.  Our search algorithm is: \n\nInput  Test pattern T \nInitialize \n\nGraph search \n\n\u2022  Choose initial candidate prototype, Po \n\u2022  SetPc~Po \n\u2022  Set D~ ~ DT(Pc , T),  i.e.,  the T-distance ofT from  Pc \n\nDo \n\n\u2022  For each prototype P j  connected to Pc compute cos(Oj)  = \n\u2022  Sort these prototypes  by  increasing values of OJ  and  put them  into a \n\nT.L\u00b7P~ \nIT.Ll.L \n\ncandidate list \n\n\u2022  Pick P j  from  the top of the candidate list \n\u2022  In T-space of Pj, compute DT(Pj , T) \n\nIf  DT(Pj , T) < D~ then Pc ~ P j  and D~ ~ DT(P j , T) \notherwise  mark P j  as a  \"failure\"  (F),  and pick next  prototype from \n\nthe candidate list \n\nUntil  Candidate list empty \nReturn  D~ or the category label of the optimum prototype found \n\n\fGraph-Based Method for Arbitrary Transformation-Invariant  Pattern  Classification \n\n669 \n\nDr  4.91 \n\n3.70 \n\n3.61 \n\n3.03 \n\n2.94 \n\nFigure  2:  The  search  through  the  \"2\"  category  graph  for  the  T-nearest  stored \nprototype  to the test  pattern is  shown  (N  =  720  and  k  =  15  nearest  neighbors). \nThe number of T-distance calculations is equal to the number of nodes visited plus \nthe number offailures (marked F); Le., in the case shown 5 + 26 =  31.  The backward \nsearch step attempt is thwarted because the middle node  has  already been visited \n(marked M).  Notice in the prototypes how the search is first a downward shift, then \na  counter-clockwise rotation -\n\na  mere four steps through the graph. \n\nFigure  2 illustrates search through a  network of  \"2\"  prototypes.  Note how  the T(cid:173)\ndistance  of the test  pattern decreases,  and  that with  only  four  steps  through the \ngraph the optimal prototype is  found. \n\nThere are  several  ways  in  which  our  search  technique  can  be incorporated  into  a \nclassifier.  One is  to store all  prototypes,  regardless of class,  in a single large graph \nand  perform  the  search;  the  test  pattern  is  classified  by  the  label  of the optimal \nprototype found.  Another, is to employ separate graphs, one for  each category, and \nsearch  through  them  (possibly  in  parallel);  the  test  is  classified  by the  minimum \nT-distance  prototype  found.  The  choice  of  method  depends  upon  the  hardware \nlimitations, performance speed requirements, etc.  Figure 3 illustrates such a search \nthrough a  \"2\"  category graph for  the closest  prototype to a  test pattern  \"5.\"  We \nreport below results using a  single graph per category,  however. \n\n3.1  Computational complexity \n\nIf a graph contains N  prototypes with k  pointers (arcs) each, and if the patterns are \nof dimension m, then the storage requirement is O(N((t + 1) . m 2 + kt)).  The time \ncomplexity of training  depends  upon  details  of ortho-normalization,  sorting,  etc., \nand is  of little interest  anyway.  Construction is  more  than an order of magnitude \nfaster  than neural  network training on similar problems;  for  instance construction \nof a  graph for  N  =  720  prototypes  and  k  =  100  nearest  neighbors takes  less  than \n\n\f670 \n\nAlessandro Sperduti,  David G.  Stork \n\n[ZJ[ZJ[2J[2] \n\nDr  5.10 \n\n5.09 \n\n5.01 \n\n4.93 \n\n4.90 \n\nFigure 3:  The search through a  \"2\"  category graph given a  \"5\"  test pattern.  Note \nhow  the  search  first  tries  to  find  a  prototype  that  matches  the  upper  arc  of the \n\"5,\"  and then one possessing skew or rotation.  For this test pattern, the minimum \nT-distance found  for  the  \"5\"  category  (3.62)  is smaller than the one found  for  the \n\"2\"  category shown  here  (4.22),  and indeed for  any other category.  Thus the test \npattern is  correctly classified as a  \"5.\" \n\n20  minutes on a  Sparc 10. \n\nThe crucial quantity of interest is the time complexity for search.  This is,  of course, \nproblem  related,  and depends upon the number of categories,  transformation and \nprototypes  and their statistical properties  (see  next Section).  Worst  case  analyses \n(e.g.,  it is  theoretically conceivable that nearly  all  prototypes must be visited)  are \nirrelevant to practice. \nWe  used  a  slightly non-obvious search criterion at each step,  the function cos(Oj), \nas  shown  in  Figure  1.  Not  only  could  this  criterion  be  calculated  very efficiently \nin  our  orthonormal  basis  (by  using  simple  inner  products),  but  it  actually  led  to \na  slightly more accurate search than Euclidean distance in the T-space -\nperhaps \nthe most  natural choice of criterion.  The angle OJ  seems to guide the  \"flow\"  of the \nsearch along transformation directions toward the test point. \n\n4  Simulations and results \n\nWe  explored the search capabilities of our system on the binary handwritten digit \ndatabase of Guyon, et al.  [1991J.  We needed to scale all patterns by a  linear factor \n(0.833)  to insure that rotated versions did not go outside the 16 x  16 pixel grid.  As \nrequired in all T-space methods, the patterns must be continuous valued  (Le.,  here \ngrayscale);  this was  achieved  by convolution with  a  spatially symmetric  Gaussian \nhaving a  =  .55 pixels.  We had 720 training examples in each of ten digit categories; \nthe  test  set  consisted  of 1320  test  patterns  formed  by  transforming  independent \nprototypes in all meaningful combinations of the t =  6 transformations (four spatial \ndirections  and two rotation senses). \n\nWe  compared  the  Euclidean  sorting  method  of Simard et  al.  [1993J  to our  graph \n\n\fGraph-Based Method for Arbitrary Transformation-Invariant Pattern  Classification \n\n671 \n\n1.00 \n\n______ -----:::::::::::::==---10. 6 \n\n\u00a7 \n0.4  u \n.c \n~ u \n'\" \n~ \n0.2  e \n~ \n\n... ' \n',-.  error \n\n- ---~. \" .. - ................  --\n\n0 \n400 \n\n350 \n\no \n\n50 \n\n100  150  200 \n\n300 \nComputational complexity \n\n250 \n\n(equivalent number of T -distance calculations) \n\nFigure 4:  Comparison of graph-based (heavy lines)  and standard Euclidean sorting \nsearches (thin lines).  Search accuracy is the percentage of optimal prototypes found \non the full  test set  of 1320  patterns in  a  single category  (solid  lines).  The average \nsearch error is the per pattern difference between the global optimum T -distance and \nthe one actually found, averaged over the non-optimal prototypes found through the \nsearch  (dashed lines).  Note especially that for  the same computational complexity, \nour method has the same average error,  but that this average is taken over a much \nsmaller number of (non-optimal)  prototypes.  For a given criterion search accuracy, \nour  method  requires  significantly  less  computation.  For  instance,  if  90%  of  the \nprototypes  must  be  found  for  a  requisite  categorization  accuracy  (a typical  value \nfor  asymptotically high recognition accuracy), our graph-based method requires less \nthan half the computation of the Euclidean sorting method. \n\nbased  method  using  the  same  data  and  transformations,  over  the  full  range  of \nrelevant  computational  complexities.  Figure  4  summarizes  our  results.  For  our \nmethod,  the  computational  complexity  is  adjusted  by  the  number  of  neighbors \ninspected,  k.  For their Euclidean sorting method,  it is  adjusted  by the percentage \nof Euclidean  nearest  neighbors  that were  then inspected for  T -distance.  We  were \nquite careful to employ as many computational tricks and shortcuts on both methods \nwe  could  think of.  Our results  reflect  fairly  on the full  computational complexity, \nwhich was dominated by tangent and Euclidean distance calculations. \n\nWe note parenthetically that many of the recognition errors for  both methods could \nbe explained by the fact that we  did not include the transformation of line thinning \n(solely  because  we  lacked  the  preprocessing  capabilities);  the  overall  accuracy  of \nboth methods will  increase when this invariance is also included. \n\n5  CONCLUSIONS AND FUTURE WORK \n\nWe  have  demonstrated  a  graph-based  method  using  tangent  distance  that  per(cid:173)\nmits  search through  prototypes significantly faster  than the most  popular current \napproach.  Although  not  shown  above,  ours  is  also  superior  to  other  tree-based \n\n\f672 \n\nAlessandro Sperduli.  David G.  Stork \n\nmethods, such as  k-d-trees,  which are less accurate.  Since our primary concern was \nreducing  the  computational  complexity  of search  (while  matching  Simard  et  al.'s \naccuracy),  we  have  not  optimized over  preprocessing steps,  such  as  the  Gaussian \nkernel width or transformation set.  We note again that our method can be applied \nto reduced training sets, for  instance ones pruned by the method of Simard, Hastie \n&  Saeckinger  [1994].  Simard's  [1994]  recent  method  -\nin  which  low-resolution \nversions  of  training  patterns  are  organized  into  a  hierarchical  data  structure  so \nas  to  reduce  the  number  of multiply-accumulates  required  during  search  -\nis  in \nsome sense  \"orthogonal\"  to ours.  Our graph-based method will work with his low(cid:173)\nresolution  images  too,  and  thus  these  two  methods  can  be  unified  into  a  hybrid \nsystem. \nPerhaps  most  importantly,  our work  suggests  a  number  of research  avenues.  We \nused  just  a  single  (\"central\")  prototype  Po  to  start  search;  presumably  having \nseveral candidate starting points would  be faster.  Our general method may admit \ngradient descent learning of parameters of the search criterion.  For instance, we can \nimagine  scaling  the  different  tangent  basis vectors  according  to  their  relevance  in \nguiding correct searches as determined using a validation set.  Finally, our approach \nmay admit elegant parallel implementations for  real-world applications. \n\nAcknowledgements \n\nThis work  was  begun  during  a  visit  by  Dr.  Sperduti  to  Ricoh  CRC.  We  thank  I. \nGuyon  for  the use of her  database of handwritten digits  and Dr.  K.  V.  Prasad for \nassistance in image processing. \n\nReferences \n\n1.  Guyon,  P.  Albrecht,  Y.  Le  Cun,  J.  Denker &  W.  Hubbard.  (1991)  \"Comparing \ndifferent  neural  network  architectures  for  classifying  handwritten  digits,\"  Proc.  of \nthe  Inter.  Joint  Conference  on Neural  Networks,  vol.  II,  pp.  127-132,  IEEE Press. \nP. Simard.  (1994)  \"Efficient computation of complex distance metrics using hierar(cid:173)\nchical  filtering,\"  in  J.  D.  Cowan,  G.  Tesauro  and  J.  Alspector  (eds.)  Advances  in \nNeural  Information Processing Systems-6 Morgan Kaufmann pp.  168-175. \nP. Simard, B.  Victorrio, Y.  Le Cun &  J. Denker. (1992)  \"Tangent Prop - A formal(cid:173)\nism  for  specifying selected invariances in an adaptive network,\"  in  J.  E.  Moody,  S. \nJ . Hanson and R.  P.  Lippmann  (eds.)  Advances in Neural  Information  Processing \nSystems-4  Morgan Kaufmann pp.  895-903. \nP.  Y.  Simard, Y.  Le Cun &  J. Denker.  (1993)  \"Efficient Pattern Recognition Using \na  New  Transformation  Distance,\"  in  S.  J.  Hanson,  J.  D.  Cowan  and  C.  L.  Giles \n(eds.)  Advances  in  Neural  Information  Processing  Systems-5  Morgan  Kaufmann \npp.50-58. \nP.  Y.  Simard,  T.  Hastie &  E.  Saeckinger.  (1994)  \"Learning  Prototype  Models  for \nTangent Distance,\"  Neural  Networks for  Computing Snowbird,  UT  (April,  1994). \nG.  D.  Wilensky &  N.  Manukian.  (1994)  \"Nearest Neighbor Networks:  New  Neural \nArchitectures  for  Distortion-Insensitive  Image  Recognition,\"  Neural  Networks  for \nComputing Snowbird,  UT (April,  1994). \n\n\f", "award": [], "sourceid": 1012, "authors": [{"given_name": "Alessandro", "family_name": "Sperduti", "institution": null}, {"given_name": "David", "family_name": "Stork", "institution": null}]}