{"title": "Two-Dimensional Object Localization by Coarse-to-Fine Correlation Matching", "book": "Advances in Neural Information Processing Systems", "page_first": 985, "page_last": 992, "abstract": null, "full_text": "Two-Dimensional Object  Localization by \n\nCoarse-to-Fine  Correlation Matching \n\nChien-Ping Lu and Eric  Mjolsness \n\nDepartment of Computer Science \n\nYale  University \n\nNew  Haven,  CT 06520-8285 \n\nAbstract \n\nWe  present  a  Mean  Field  Theory  method  for  locating  two(cid:173)\ndimensional  objects  that  have  undergone  rigid  transformations. \nThe  resulting  algorithm  is  a  form  of  coarse-to-fine  correlation \nmatching.  We first  consider  problems of matching synthetic point \ndata,  and  derive  a  point matching objective function.  A  tractable \nline segment matching objective function is  derived by considering \neach  line segment  as  a  dense  collection of points,  and approximat(cid:173)\ning it by a sum of Gaussians.  The algorithm is tested on real images \nfrom  which line segments are extracted  and  matched. \n\n1 \n\nIntroduction \n\nAssume that an object in  a scene  can be  viewed  as  an instance of the model placed \nin  space  by  some spatial transformation, and object  recognition is  achieved  by  dis(cid:173)\ncovering  an  instance  of the  model  in  the scene.  Two  tightly  coupled  subproblems \nneed  to be solved for  locating and recognizing  the model:  the  correspondence  prob(cid:173)\nlem  (how are scene features put into correspondence  with model features?),  and the \nlocalization  problem  (what is  the  transformation that acceptably  relates  the model \nfeatures  to the scene features?).  If the correspondence  is known,  the transformation \ncan  be  determined  easily  by  least  squares  procedures.  Similarly, for  known  trans(cid:173)\nformation,  the correspondence  can be found  by  aligning the model with the scene, \nor  the  problem  becomes  an  assignment  problem  if the  scene  feature  locations  are \njittered by  noise. \n\n985 \n\n\f986 \n\nLu and Mjolsness \n\nSeveral  approaches  have  been  proposed  to solve  this  problem.  Some  tree-pruning \nmethods  [1,  3]  make  hypotheses  concerning  the  correspondence  by  searching  over \na  tree  in  which  each  node  represents  a  partial  match.  Each  partial match  is  then \nevaluated through the pose  that best fits  it.  In  the  generalized Hough  transform or \nequivalently  template  matching approach  [7,  3],  optimal transformation parameters \nare computed for  each  possible pairing of a  model feature and a  scene feature,  and \nthese  \"optimal\" parameters then  \"vote\"  for  the closest  candidate in the discretized \ntransformation space. \n\nBy contrast with the tree-pruning methods and the generalized Hough transform, we \npropose  to formulate the  problem as  an objective function  and  optimize it  directly \nby  using Mean  Field Theory  (MFT) techniques from statistical physics,  adapted as \nnecessary  to produce effective  algorithms in the form of analog neural  networks. \n\n2  Point  Matching \n\nConsider the problem of locating a two-dimensional \"model\"  object that is believed \nto  appear  in  the  \"scene\".  Assume  first  that  both  the  model  and  the  scene  are \nrepresented  by  a  set  of  \"points\"  respectively,  {xd  and  {Ya}.  The  problem  is  to \nrecover  the  actual  transformation  (translation  and  rotation)  that  relates  the  two \nsets  of points.  It can  be solved  by  minimizing the following objective function \n\nEmatch(Mia, 0, t) =  L  Miallxi  - ReYa -\n\ntll 2 \n\n(1) \n\nwhere  {Mia}  = M  is  a  Ofl-valued  \"match matrix\"  representing  the unknown  cor(cid:173)\nrespondence,  Re  is  a  rotation  matrix with  rotation  angle  0,  and  t  is  a  translation \nvector. \n\nia \n\n2.1  Constraints on match variables \n\nWe  need  to  enforce  some  constraints  on  correspondence  (match)  variables  Mia; \notherwise  all  Mia  =  a in  (1).  Here,  we  use  the following  constraint \n\nLMia =  N,  'iMia  ~ 0; \nia \n\n(2) \n\nimplying that  there  are  exactly  N  matches  among  all  possible  matches,  where  N \nis  the number of the model features.  Summing over  permutation matrices obeying \nthis constraint,  the effective  objective function  is  approximately  [5]: \n\nF(O, t, (3)  = -.!. L e-.8l1 x .-R8 y .. -tIl 2\n\n, \n\n13 \nwhich has the same fixed  points as \n\nia \n\nEpenalty(M, 0, t) =  Ematch(M, 0, t) + - L  Mia (log Mia  - 1), \n\n1 \n13 \n\nia \n\n(3) \n\n(4) \n\nwhere  Mia  is treated as  a  continuous variable and is subject to the penalty function \nx(logx-l). \n\n\fTwo-Dimensional Object Localization by Coarse-ta-Fine Correlation Matching \n\n987 \n\nFigure  1:  Assume  that  there  is  only  translation between  the  model  and the scene, \neach  containing  20  points.  The  objective  functions  at  at  different  temperatures \n(,8- 1 ):  0.0512  (top  left),  0.0128  (top  right) ,  0.0032  (bottom  left)  and  0.0008  (bot(cid:173)\ntom  right),  are  plotted as energy  surfaces  of x  and  y  components of translation . \n\nNow,  let  j3  =  1/2u2  and write \n\nEpoint(O, t) = L e-~lIx,-R8y(1-tIl2. \n\nia \n\n(5) \n\nThe problem then becomes that of maximizing Epoint ,  which in turn can be interpre(cid:173)\ntated  as  minimizing the  Euclidean  distance  between  two  Gaussian-blurred  images \ncontaining the  scene  points  Xi  and  a  transformed  version  of the  model  points  Ya. \nTracking the local maximum of the objective function from large u  to small u,  as in \ndeterministic annealing  and  other  continuation methods,  corresponds  to  a  coarse(cid:173)\nto-fine  correlation  matching.  See  Figure  1 for  a  demonstration of a  simpler case  in \nwhich  only translation is  applied  to the model. \n\n2.2  The descent  dynamics \n\nA  gradient  descent  dynamics for  finding  the saddle  point of the effective  objective \nfunction  F  is \n\no \n\nia \n\n-I\\, L mia(Xi  - R 9Ya  - t)t(R9+~Ya) , \n\nia \n\n(6) \n\n\f988 \n\nLu and Mjolsness \n\nwhere  mia  = (Mia}/3  = e-/3llx .-R 8y,,-tIl 2  is  the  \"soft  correspondence\"  associated \nwith  Mia. \nInstead  of  updating  t  by  descent  dynamics,  we  can  also  solve  for  t \ndirectly. \n\n3  The Vernier Network \n\nThough the effective  objective is non-convex over  translation at low  temperatures, \nits  dependence  on rotation is  non-convex even  at relatively high temperatures. \n\n3.1  Hierachical representation of variables \n\nWe propose  overcoming this problem by  applying Mean  Field Theory  (M FT)  to  a \nhierachical  representation  of rotation resulting from  the change of variables  [4] \n\no \n\nB-1 \n\nL  Xb(Ob  + (h),  (h  E  [-te, te], \nb=O \n\n(7) \n\nwhere  te  = 7r /2B,  Ob  = (b + l)~ are  the  constant  centers  of the  intervals,  and  (h \nare fine-scale  \"vernier\"  variabfes.  The Xb'S  are binary variables (so  Xb  E  {O,  I}) that \nsatisfy  the winner-take-all  (WTA)  constraint  Lb Xb  =  1. \nThe  essential  reason  that  this  hierarchical  representation  of 0  has  fewer  spurious \nlocal  minima  than  the  conventional  analog  representation  is  that  the  change  of \nvariables also  changes the  connectivity  of the  network's  state  space:  big jumps in 0 \ncan  be  achieved  by local variations of X. \n\n3.2  Vernier optimization dynamics \n\nEpoint  can  be  transformed  as  (see  [6,  4])  1 \n\nEpoint(O, t) \n\n~vbl \n\n~  E(LXb(Ob +Ob),2:XVt b) \n\nb \n\nb \n\nLXbE(Ov + Ob, tb) \n\nb \n\n1 Notation:  Coordinate  descent  with  2-phase  clock  'IlIa(t): \n\na \n\n\u2022  EB  for  clocked  sum \n\u2022  x for  a  clamped  variable \n\u2022  x A  for  a  set  of variables  to  be optimized  analytically \n\u2022  (v, u)H  for  Hopfield/Grossberg  dynamics \n\u2022  E(x, y)fJJ  for coordinate descent/ascent on x,  then  y, iterated if necessary.  Nested \n\nangle  brackets correspond  to  nested  loops. \n\n(8) \n\n\fTwo-Dimensional Object Localization by Coarse-to-Fine Correlation Matching \n\n989 \n\n\u2022 \n\n\u2022 \n\n, \n\n\u2022  . e..  .  , \n\n\" \n\u2022 \nI \n\u2022 \u2022   \u2022 \u2022  I  \u2022\u2022 \n\n,  .  , \n. \n'.  , \n\u2022 \n<f> () 0 \n\u2022  0 \n,0  -0' \nCOO \n\n.  0 . \u00b7 \u00b7  -. \n\u2022 \u2022 \u2022 \u2022  \n0  0 \n0 0\u2022 \no \n\n\u2022\u2022 \n\no \n\n, \n\n0 \n\n, \n\n. . \n\n\u2022 \n\u2022 \n\n' ...... .  ,  . \n00\u00b7  \u2022  o\u00b7 \no \n'0,4), \n\n. \n'. \n\u2022  c;P~  I \n00 .\u2022  0 \n:  0  \u2022\u2022 -. \nQ:)q \u2022\u2022 \n\no\u00b7 \n\n\u2022 \n\n.\u2022\u2022\u2022\u2022 \n\n, \n\n\u2022  e, \n\n0 \n\n~  0 \n\n~ ~  ,  . Q,  \u2022 \n. ..  [) \n\n\u2022 \u2022  ~Q \n\nI \n\n~f)  0' \u2022\u2022 \n\n\u20ac) \n\n-\u2022\u2022 \n\nFigure  2:  Shown  here  is  an  example of matching a  20-point model  to  a  scene  with \n66.7% spurious outliers.  The model is  represented  by  circles.  The set of square dots \nis  an  instance  of the  model  in  the  scene.  All  other  dots  are  outliers.  From left  to \nright  are  configurations at the  annealing steps  1,  10,  and  51,  respectively. \n\nMFT \n\n~  [ ~  A \n\n~ XbE(th  + Vb, tb) + ,8  ~(UbVb -log \nb \n\nb \n\n1 ~ \n\nsinh(tub) \n) \n\nt \n\n+ WTA(x,,8)]  (((v, u)H, t A), XA)$ \n\n(9) \n\nEach  bin-specific  rotation  angle Vb  can  be found  by  the following fixed  point equa(cid:173)\ntions \n\na \n\nia \n\n(10) \n\nThe algorithm is illustrated in  Figure 2. \n\n4  Line Segment Matching \n\nIn  many  vision  problems,  representation  of images  by  line  segments  has  the  ad(cid:173)\nvantage of compactness and subpixel accuracy  along the direction transverse to the \nline.  However, such a representation of an object may vary substantially from image \nto image due  to occlusions  and  different  illumination conditions. \n\n4.1 \n\nIndexing points on line segements \n\nThe  problem  of  matching  line  segments  can  be  thought  of  as  a  point  matching \nproblem in which each line segment is treated as a dense collection of points.  Assume \nnow  that  both  the  scene  and  the  model  are  represented  by  a  set  of line  segments \nrespectively,  {sil  and  {rna} .  Both  the  model  and  the  scene  line  segments  are \n\n\f990 \n\nLu and Mjolsness \n\n'!  ' J  \n( \nf \n\no  1!io \n\n.... ./  \\ \n\n\" \n\\ \n\nD.' \n\n1.2\\ \n\n1  S \n\n-e .lS \n\nFigure  3:  Approximating e(t) by  a sum of 3  Gaussians. \n\nrepresented  by  their  endpoints  as  Si  = (pi, p~)  and  rna  = (qa, q~),  where  Pi, p~, \nand  qa, q~ are the endpoints of the  ith scene  segment  and the  ath model segment, \nrespectively.  The locations of the points on each scene segment and model segments \ncan  be  parameterized as \n\nXi  = Si(U) = \nYa = IDa(v) = \n\nPi + u(p~ - Pi),  U  E [0,1]  and \nqa + v(q~ - qa),  v  E [0,1]. \n\n(ll) \n(12) \nN ow  the  model  points  and  the  scene  points  can  be  though  of as  indexed  by  i  = \n(i, u)  and  a  = (a, v).  Using  this  indexing,  we  have  Li ex  Li Ii Jol  du  and  La ex \nLa1aJoi dv,  where Ii  = Ilpi-P~II andla = IIqa-q~ll\u00b7 The point matching objective \nfunction  (5)  can  be specialized  to line segment matching as  [5] \n\nEseg((}, t) = L hla  t  (I e- ~IIS.(u)-Rem,,(v)-tIl2 du  dv. \n\nia \n\nJo  Jo \n\n(13) \n\nAs  a  special  case  of point  matching objective  function,  (13)  can  readily  be  trans(cid:173)\nformed  to the  vernier  network  previously  developed for  point  matching problem. \n\n4.2  Gaussian sum approximation \n\nNote  that,  as  in  Figure 3 and [5], \n\ne (t)  -_  {I  if  t E [0: 1]  ~ \n\notherWIse  ~ ~ Ak exp -\"2 \n\no \n\n1 (Ck  -\n\nt)2 \n\n(14) \n\nk~I \n\n(72 \nk \n\nwhere by numerical minimization of the Euclidean distance between these  two func(cid:173)\ntions  of t,  the  parameters  may be  chosen  as  Al  = A3  = 0.800673,  A2  = 1.09862, \n(71  = (73  = 0.0929032,  (72  = 0.237033,  C1  = 1 - C3  = 0.1l6807, and C2  = 0.5. \nUsing  this approximation, each finite  double integral in  (13)  can be replaced  by \n\n1+00 1+00  __  1_(Ck_U)2  -\n\n-00  e \n\n2\"'~ \n\n3 \n\nk~l AkAl  -00 \n\n1  (cr-v)2 \n\n1 \n\ne  ~  e- 2,;2l1 s .(u)+  em,,(v)- II  du dv. (15) \n\nR \n\nt  2 \n\nEach of these  nine  Gaussian integrals can  be  done  exactly.  Defining \n\nViakl  = Si(Ck)  - Rema(cl) -\n\nt \nPi = pi - Pi,  qa  = Re(q~ - qa), \n\n(16) \n(17) \n\n\fTwo-Dimensional Object Localization by Coarse-to-Fine Correlation Matching \n\n991 \n\nFigure  4:  The  model  line  segments,  which  are  transformed  with  the  optimal  pa(cid:173)\nrameter found  by  the  matching algorithm, are  overlayed  on  the  scene  image.  The \nalgorithm has successfully  located  the  model object  in  the scene. \n\n(15)  becomes \n\n1 vlaklu2 + (Viakl  X  pd2u~ + (Viakl  X  Qa)2uf \nX  exp -\"2  (u2 + f>;un(u2 + Q~uf) - U~U;(f>i . Qa)2 \n\n(18) \n\nas was  calculated by  Garrett  [2,  5].  From the Gaussian sum approximation, we  get \na  closed  form objective function  which  can  be  readily  optimized to give  a  solution \nto the line  segment matching problem. \n\n5  Results  and  Discussion \n\nThe line segment matching algorithm described  in  this  paper  was  tested  on scenes \ncaptured by a  CCD camera producing 640 x 480 images, which were  then processed \nby an edge detector.  Line segments were extracted using a polygonal approximation \nto the edge images.  The model line segments were extracted from a scene containing \na  canonically  positioned model object  (Figure  4  left).  They  were  then  matched to \nthat extracted from a scene  containing differently positioned and partially occluded \nmodel object  (Figure 4  nght).  The  result  of matching is  shown  in  Figure  5. \n\nOur approach is  based  on  a scale-space  continuation scheme derived from an appli(cid:173)\ncation of Mean  Field Theory  to the match variables.  It  provides  a  means to avoid \ntrapping  by  local  extrema  and  is  more  efficient  than  stochastic  searches  such  as \nsimulated annealing.  The estimation of location parameters based  on  continuously \nimproved  \"soft  correspondences\"  and  scale-space  is  often  more  robust  than  that \nbased on crisp  (but usually  inaccurate)  correspondences. \n\nThe  vernier  optimization dynamics arises from  an  application of Mean  Field The(cid:173)\nory to a  hierarchical representation  of the rotation, which  turns the original uncon(cid:173)\nstrained optimization problem over rotation e into several  constrained optimization \nproblems over smaller e intervals.  Such a  transformation results in  a  Hopfield-style \n\n\f992 \n\nLu and Mjolsness \n\nFigure 5:  Shows how the model line segments (gray) and the scene segments (black) \nare matched.  The model line segments, which are transformed with the optimal pa(cid:173)\nrameter found by  the matching algorithm, are overlayed on the scene  line segments \nwith which they are  matched.  Most of the the endpoints and the lengths of the line \nsegments  are  different.  Furthermore,  one  long  segment  frequently  corresponds  to \nseveral  short  ones.  However,  the  matching algorithm is  robust  enough  to  uncover \nthe underlying  rigid  transformation from  the incomplete and ambiguous data. \n\ndynamics on rotation 0,  which effectively  coordinates the dynamics of rotation and \ntranslation during the optimization.  The algorithm tends  to find  a  roughly  correct \ntranslation first,  and then  tunes up the rotation. \n\n6  Acknowledgements \n\nThis work  was supported  under  grant  NOOOl4-92-J-4048 from ONRjDARPA. \n\nReferences \n\n[1]  H.  S.  Baird.  Model-Based  Image  Matching  Using  Location.  The  MIT  Press, \n\nCambridge, Massachusetts,  first  edition,  84. \n\n[2]  C.  Garrett,  1990.  Private communication to  Eric  Mjolsness. \n[3]  W. E.  L.  Grimson and T. Lozano-Perez.  Localizing overlapping parts by search(cid:173)\n\ning the interpretation tree.  IEEE Transaction  on  Pattern Analysis and Machine \nInt elligence,  9 :469-482,  1987. \n\n[4]  C.-P.  Lu  and  E.  Mjolsness.  Mean field  point  matching by  vernier  network  and \nby generalized  Hough transform.  In  World  Congress  on  Neural Networks,  pages \n674-684,  1993. \n\n[5]  E.  Mjolsness.  Bayesian  inference  on  visual grammars by  neural  nets  that opti(cid:173)\nmize.  In  SPIE Science  of Artificial Neural  Networks,  pages 63-85,  April  1992. \n[6]  E.  Mjolsness  and  W.  L.  Miranker.  Greedy  Lagrangians  for  neural  net-\nworks:  Three  levels  of optimization in  relaxation  dynamics.  Technical  Report \nYALEUjDCSjTR-945, Yale Computer Science  Department, January  1993. \n\n[7]  G. Stockman. Object recognition and localization via pose clustering.  Computer \n\nVision,  Graphics,  and Image  Processing,  (40),  1987. \n\n\f", "award": [], "sourceid": 866, "authors": [{"given_name": "Chien-Ping", "family_name": "Lu", "institution": null}, {"given_name": "Eric", "family_name": "Mjolsness", "institution": null}]}