{"title": "Learning in Computer Vision and Image Understanding", "book": "Advances in Neural Information Processing Systems", "page_first": 1182, "page_last": 1183, "abstract": null, "full_text": "Learning in  Computer Vision and Image \n\nUnderstanding \n\nHayit  Greenspan \n\nDepartment of Electrical  Engineering \n\nCalifornia Institute of Technology,  116-81 \n\nPasadena, CA 91125 \n\nThere  is  an  increasing  interest  in  the  area  of Learning  in  Computer  Vision  and \nImage  Understanding,  both from  researchers  in  the  learning  community and from \nresearchers  involved with  the  computer  vision  world.  The field  is  characterized  by \na  shift  away  from  the  classical,  purely  model-based,  computer  vision  techniques, \ntowards data-driven learning paradigms for  solving real-world vision problems. \n\nUsing  learning  in  segmentation  or  recognition  tasks  has  several  advantages  over \nclassical  model-based  techniques.  These  include  adaptivity  to  noise  and  changing \nenvironments,  as  well  as  in  many cases,  a  simplified system generation  procedure. \nYet,  learning from  examples introduces  a  new  challenge  - getting  a  representative \ndata set  of examples from which  to learn.  Applications of learning systems to prac(cid:173)\ntical  problems  have  shown  that  the  performance  of the  system  is  often  critically \ndependent  on  both  the  size  and  quality  of the  training set.  Federico  Girosi of \nMIT  suggested  the  use  of  prior  information  as  a  general  method  for  synthesiz(cid:173)\ning many training examples from few  exemplars.  Prototypical transformations are \nused  for  general  3D  object  recognition.  Face-recognition  was  presented  as  a  par(cid:173)\nticular  example.  Dean  Pomerleau of Carnegie  Mellon addressed  the  training \ndata problem as well,  within the  context  of ALVINN,  a  neural  network  vision  sys(cid:173)\ntem  which  drives  an  autonomous van  without  human  intervention.  Some  general \nproblems emerge,  such  as  getting sufficient  training data for  the  more unexpected \nscenes  including  passing  cars  and  intersections.  Several  techniques  for  exploiting \nprior geometric knowledge  during training  and  testing of the  neural-network,  were \npresented.  A  somewhat  different  perspective  was  presented  by  Bartlett  Mel  of \nCaltech.  Bartlett introduced  a  3D object  recognition  approach based on concepts \nfrom the human visual system.  Here  the assumption is that a large database of ex(cid:173)\namples exists,  with varying viewing  angles  and distances,  as  is  available to  human \nobservers  as  they manipulate and inspect  common objects. \n\nA different issue of interest was using learning schemes in general recognition frame(cid:173)\nworks  which  can  handle  several  different  vision  problems.  Hayit  Greenspan of \nCaltech  suggested  combining  unsupervised  and  supervised  learning  approaches \nwithin a  multiresolution image representation space, for  texture and shape recogni(cid:173)\ntion.  It was suggested that shifting the input pixel representation  to a  more robust \nrepresentation  (using  a  pyramid filtering  approach)  in  combination  with  learning \n\n1182 \n\n\fLearning in Computer Vision and Image Understanding \n\n1183 \n\nschemes  can  combine the  advantages of both approaches.  Jonathan Marshall of \nUniv.  of North  Carolina concentrated  on  unsupervised  learning  and  proposed \nthat a common set of unsupervised learning rules might provide a basis for  commu(cid:173)\nnication  between  different  visual  modules  (such  as  stereopsis,  motion  perception, \ndepth  and so forth). \n\nThe  role  of unsupervised  learning in  vision  tasks,  and  its  combination with  super(cid:173)\nvised learning, was an issue of discussion.  The question arose on how much unsuper(cid:173)\nvised learning is  actually  unsupervised.  Some a-priori knowledge,  or bias,  is  always \npresent  (e.g.,  the  metric  chosen  for  the  task).  Eric  Saund of Xerox introduced \nthe  window  registration  problem  in  unsupervised  learning  of visual  features.  He \nargued  that there is  a  strong  dependence  on  the window  placement  as  slight shifts \nin  the  window  placement can represent  confounding assignments of image data to \nthe input units of the classifying network.  Chris Williams of Toronto introduced \nthe  use  of unsupervised  learning for  classifying objects.  Given a set of images, each \nof which  contains one instance of a  small but  unknown  set  of objects  imaged from \na  random  viewpoint,  unsupervised  learning  is  used  to  discover  the  object  classes. \nData is  grouped  into  objects  via  a  mixture  model  which  is  trained  with  the  EM \nalgorithm. \n\nReal-world  computer  vision  applications  in  which  learning  can  playa major role, \nand  the  challenges  involved,  was  an  additional theme  in  the  workshop.  Yann  Le \nCun of AT&T  described  a  handwritten word  recognizer  system of multiple mod(cid:173)\nules,  as  an example of a  large scale  vision system.  Yann suggested  that increasing \nthe role of learning in all modules allows one to minimize the amount of hand-built \nheuristics  and  improves  the  robustness  and  generality  of the  system.  Challenges \ninclude training large learning machines which  are composed of multiple, heteroge(cid:173)\nneous  modules,  and  what  the modules should  contain.  Padhraic Smyth of JPL \nintroduced  the  challenges  for  vision  and  learning  in  the  context  of large  scientific \nimage databases.  In  this domain there  is often  a  large  amount of data which  typi(cid:173)\ncally  has no ground  truth labeling.  In addition,  natural objects  can  be much more \ndifficult  to deal with  than  man made objects.  Learning can  be valuable here,  as  a \nlow-cost solution and sometimes the  only solution (with model-based schemes being \nimpractical).  The task of face  recognition was addressed by Joachim Buhmann of \nBonn.  Elastic matching was introduced for translation, rotation and scale invariant \nrecognition.  Methods to combine unsupervised and supervised data clustering with \nelastic matching to learn  a  discriminant metric  and enhance saliency of prototypes \nwere  discussed.  Related issues from  a  recent  AAAI forum  on  Machine  Learning in \nComputer Vision,  were  presented  by Rich  Zemel of the  Salk Institute. \n\nIn  Conclusion \nThe  vision world  is  very  diverse  with each  different  task  introducing a  whole spec(cid:173)\ntrum  of challenges  and  open  issues.  Currently,  many  of the  approaches  are  very \napplication  dependent.  It  is  clear  that  much  effort  still  needs  to  be  put  in  the \ndefinition  of  the  underlying  themes  of the  field  as  combined  across  the  different \napplication domains.  There was  general agreement at the workshop  that the issues \nbrought up  should be pursued further  and discussed  at future follow-up workshops. \n\nSpecial  thanks to Padhraic Smyth, Tommy Poggio,  and  Rama Chellappa for  their \ncontribution to the  organization of the workshop. \n\n\f", "award": [], "sourceid": 797, "authors": [{"given_name": "Hayit", "family_name": "Greenspan", "institution": null}]}