{"title": "A Simple and Fast Neural Network Approach to Stereovision", "book": "Advances in Neural Information Processing Systems", "page_first": 808, "page_last": 814, "abstract": "", "full_text": "A  Simple and  Fast  Neural Network \n\nApproach to Stereovision \n\nRolf D.  Henkel \n\nInstitute of Theoretical Physics \n\nUniversity of Bremen \n\nP.O.  Box 330 440,  D-28334 Bremen \n\nhttp://axon.physik.uni-bremen.de/-rdh \n\nAbstract \n\nA  neural  network  approach  to  stereovision  is  presented  based  on \naliasing effects of simple disparity estimators and a fast coherence(cid:173)\ndetection  scheme.  Within  a  single network structure, a  dense  dis(cid:173)\nparity  map  with  an  associated  validation  map  and,  additionally, \nthe  fused  cyclopean  view  of the scene  are  available.  The network \noperations  are  based  on  simple,  biological  plausible  circuitry;  the \nalgorithm is  fully  parallel and non-iterative. \n\n1 \n\nIntroduction \n\nHumans experience the three-dimensional world not as it is seen by either their left \nor right eye,  but from  a  position of a  virtual  cyclopean  eye,  located in  the  middle \nbetween the two real eye  positions.  The different  perspectives between  the left  and \nright  eyes  cause  slight  relative  displacements  of objects  in  the  two  retinal  images \n(disparities),  which  make  a  simple  superposition  of both  images  without  diplopia \nimpossible.  Proper fusion  of the retinal images into the cyclopean view requires the \nregistration of both images to a common coordinate system, which in  turn requires \ncalculation of disparities for  all image  areas which  are to be fused. \n\n1.1  The  Problems with Classical Approaches \n\nThe estimation of disparities  turns out to be  a  difficult  task,  since  various random \nand systematic image  variations complicate this  task.  Several different  techniques \nhave  been  proposed  over  time,  which  can  be  loosely  grouped  into  feature-,  area-\n\n\fA Simple and Fast Neural Network Approach to Stereovision \n\n809 \n\nand phase-based approaches.  All  these algorithms have a  number of computational \nproblems directly linked  to the very assumptions inherent  in  these  approaches. \n\nIn feature-based stereo, intensity data is first  converted to a set of features assumed \nto  be  a  more  stable  image  property  than  the  raw  image  intensities.  Matching \nprimitives  used  include  zerocrossings,  edges  and  corner  points  (Frisby,  1991),  or \nhigher order primitives like  topological fingerprints  (see  for  example:  Fleck,  1991). \nGenerally,  the  set  of feature-classes  is  discrete,  causing the  two primary  problems \nof  feature-based  stereo  algorithms:  the  famous  \"false-matches\"-problem  and  the \nproblem of missing disparity estimates. \n\nFalse  matches  are  caused  by  the  fact  that  a  single  feature  in  the  left  image  can \npotentially  be  matched  with  every  feature  of  the  same  class  in  the  right  image. \nThis problem is  basic to all feature-based stereo algorithms and can only  be solved \nby  the introduction  of  additional  constraints  to the  solution.  In  conjunction  with \nthe  extracted features  these  constraints define  a  complicated  error measure which \ncan  be  minimized  by cooperative processes  (Marr,  1979)  or by direct  (Ohta,  1985) \nor  stochastic  search  techniques  (Yuille,  1991).  While  cooperative  processes  and \nstochastic search  techniques  can  be  realized  easily  on  a  neural  basis,  it  is  not  im(cid:173)\nmediately clear how  to implement  the more complicated  algorithmic  structures  of \ndirect  search  techniques  neuronally.  Cooperative  processes  and  stochastic  search \ntechniques  turn  out  to  be  slow,  needing  many  iterations  to  converge  to  a  local \nminimum of the error measure. \n\nThe requirement of features to be a stable image property causes the second problem \nof  feature-based  stereo:  stable features  can  only  be  detected  in  a  fraction  of the \nwhole image area, leading to missing disparity estimates for most of the image area. \nFor those image parts, disparity estimates can only be guessed. \n\nDense disparity maps can be obtained with area-based approaches, where a suitable \nchosen correlation measure is maximized between small image patches of the left and \nright view.  However, a neuronally plausible implementation of this seems to be not \nreadily available.  Furthermore, the maximization turns out to be a computationally \nexpensive  process, since extensive search is  required in  configuration space. \n\nHierarchical processing schemes  can be utilized for  speed-up,  by using information \nobtained at coarse spatial scales to restrict searching at finer scales.  But, for general \nimage  data,  it  is  not  guaranteed that  the  disparity  information  obtained  at  some \ncoarse scale is valid.  The disparity data might be wrong, might have a different value \nthan at finer scales , or might not be present at all.  Furthermore, by processing data \nfrom  coarse  to fine  spatial scales,  hierarchical  processing schemes  are  intrinsically \nsequential.  This creates additional algorithmic overhead which  is  again difficult  to \nrealize  with neuronal structures. \n\nThe  same  comments  apply  to  phase-based  approaches,  where  a  locally  extracted \nFourier-phase  value  is  used  for  matching.  Phase  values  are  only  defined  modulo \n211\",  and  this  wrap-around  makes  the  use  of  hierarchical  processing  essential  for \nthese  types  of  algorithms.  Moreover,  since  data  is  analyzed  in  different  spatial \nfrequency  channels,  it  is  nearly  certain  that  some  phase  values  will  be  undefined \nat intermediate scales,  due  to missing signal  energy in  this  frequency  band  (Fleet, \n1993) .  Thus, in addition to hierarchical processing, some kind of exception handling \nis  needed  with  these approaches. \n\n\f810 \n\nR.  D. Henkel \n\n2  Stereovision by Coherence Detection \n\nIn  summary,  classical approaches to stereovision seem  to have difficulties  with  the \nfast  calculation  of  dense  disparity-maps,  at  least  with  plausible  neural  circuitry. \nIn  the  following,  a  neural  network  implementation  will  be  described  which  solves \nthis  task by using simple disparity estimators based on motion-energy mechanisms \n(Adelson,  1985;  Qian,  1997), closely resembling responses of complex cells  in  visual \ncortex (DeAngelis,  1991).  Disparity units of these type belong to a class of disparity \nestimators which can be derived from optical flow  methods  (Barron, 1994).  Clearly, \ndisparity calculations and optical flow  estimation share many similarities.  The two \nstereo  views  of  a  (static)  scene  can  be  considered  as  two  time-slices  cut  out  of \nthe space-time intensity pattern which  would  be recorded by an imaginary camera \nmoving  from  the  position  of  the  left  to  the  position  of  the  right  eye.  However, \ncompared to optical flow,  disparity estimation is  complicated by the fact  that only \ntwo  discrete  \"time\"-samples are  available,  namely the  images  of the left  and right \nview  positions. \n\nto \n\ndisparity calculations \n\n<p \n\nLeft \n\nRight 1 \n\nRight 2 \n\ncorrect \n\ncorrect \n\nwrong \n\nFigure 1:  The velocity of an image patch manifests itself as principal texture direc(cid:173)\ntion  in  the  space-time flow  field  traced  out  by  the intensity  pattern in  time  (left). \nSampling such flow  patterns at discrete times can create aliasing-effects which  lead \nto wrong estimates.  If one is  using optical flow  estimation techniques for  disparity \ncalculations,  this problem is  always present. \n\nFor  an  explanation  consider  Fig.  1.  A  surface  patch  shifting  over  time  traces  out \na  certain  flow  pattern.  The  principal  texture  direction  of this  flow  indicates  the \nrelative  velocity  of the image  patch  (Fig.  1,  left).  Sampling the flow  pattern only \nat  discrete  time  points,  the  shift  between  two  \"time-samples\"  can  be  estimated \nwithout ambiguity provided the shift is not too large (Fig.  1,  middle).  However, if a \ncertain limit is exceeded, it becomes impossible to estimate the shift correctly, given \nthe data (Fig.  1,  right).  This  is  a  simple aliasing-effect in  the \"time\"-direction; an \neveryday example  can be seen as motion reversal in  movies. \n\nIn the case of stereovision, aliasing-effects of this type are always present, and they \nlimit the range of disparities a simple disparity unit can estimate.  Sampling theory \ngives a relation between the maximal spatial wavevector k~ax (or,  equivalently,  the \nminimum  spatial  wavelength  >'~in)  present  in  the  data and  the  largest  disparity \nwhich can be estimated reliably  (Henkel,  1997): \n\nII \nd  < k~ax -\n\n7r  _1I{J \n\n'2>'min . \n\n(1) \n\n\fA Simple and Fast Neural Network Approach to Stereovision \n\n811 \n\nA  well-known  example  of  the  size-disparity  scaling  expressed  in  equation  (1)  is \nfound  in  the  context  of  the  spatial  frequency  channels  assumed  to  exist  in  the \nvisual  cortex.  Cortical  cells  respond  to  spatial  wavelengths  down  to  about  half \ntheir  peak  wavelength  Aopt;  therefore,  they  can  estimate  reliable  only  disparities \nless  than  1/4 Aopt.  This is  known as  Marr's quarter-cycle limit  (Blake,  1991). \n\nEquation  (1)  immediately  suggests  a  way  to  extend  the  limited  working  range  of \ndisparity estimators:  a spatial smoothing of the image data before or during dispar(cid:173)\nity  calculation  reduces  k'f:tax,  and in  turn increases  the  disparity range.  However, \nspatial smoothing reduces also the spatial resolution of the resulting disparity map. \nAnother  way  of modifying the  usable  range of disparity  estimators is  the  applica(cid:173)\ntion  of  a  fixed  preshift  to the  input  data before  disparity calculation.  This  would \nrequire  prior knowledge  of the  correct preshift  to  be applied,  which is  a  nontrivial \nproblem.  One could resort to hierarchical coarse-to-fine schemes, but the difficulties \nwith  hierarchical schemes have already been elal ')rated. \n\nThe  aliasing  effects  discussed  are  a  general feature  of  sampling  visual  space  with \nonly two eyes; instead of counteracting, one can exploit them in a simple coherence(cid:173)\ndetection scheme, where the multi-unit activity in stacks of disparity detectors tuned \nto a  common view  direction is  analyzed. \n\nAssuming that all disparity units i in a stack have random preshifts or presmoothing \napplied  to their input data, these units  will  have different,  but slightly overlapping \nworking ranges  Di = [diin , diax]  for  valid disparity estimates.  An  object with true \ndisparity d,  seen  in  the common  view  direction of such a  stack, will  therefore split \nthe  stack  into  two  disjunct  classes:  the  class  C of estimators  with  dEDi for  all \ni  E  C,  and  the  rest  of the stack, C,  with  d  \u00a2  D i .  All  disparity estimators  E  C will \ncode more or less the true disparity di  ~ d,  but the estimates of units belonging to C \nwill be subject to the random aliasing effects discussed, depending in a complicated \nway  on image content and disparity range  Di of the unit. \n\nWe will thus have di  ~ d  ~ dj  whenever units i and j  belong to C,  and random rela(cid:173)\ntionships otherwise.  A simple coherence  detection within each stack, i.e.  searching \nfor  all  units with  di  ~ dj  and extracting the largest cluster found,  will  be sufficient \nto single out C.  The true disparity d in the view direction of the stack can be simply \nestimated as  an average over all  coherently coding units: \n\n3  Neural Network Implementation \n\nRepeating this coherence detection scheme in  every view  direction results in a fully \nparallel  network  structure  for  disparity  calculation.  Neighboring  disparity  stacks \nresponding to different view directions estimate disparity values independently from \neach other, and within each stack, disparity units operate independently from  each \nother.  Since coherence detection is an opportunistic scheme, extensions of the basic \nalgorithm to mUltiple  spatial scales and combinations of different types of disparity \nestimators  are  trivial.  Additional  units  are  simply  included  in  the  appropriate \ncoherence  stacks.  The  coherence  scheme  will  combine  only  the  information  from \nthe  coherently  coding  units  and  ignore  the  rest  of  the  data.  For  this  reason,  the \nscheme also turns out to be  extremely robust against single-unit failures. \n\n\f812 \n\nR.  D.  Henkel \n\ndisparity data \n\n\"h'7\" \n\n-----------r\u00b7----------\n\nLeft eye\u00b7\"  .. , \n\n: \n\n, .............. , .. \n\n.'  Right eye \n\nFigure  2:  The  network structure for  a  single  horizontal scan-line  (left).  The view \ndirections  of the  disparity  stacks  split  the  angle  between  the  left  and  right  lines \nof sight  in  the  network and  3D-space  in  half,  therefore  analyzing  space  along the \ncyclopean view  directions  (right). \n\nCyclopean eye \n\nIn  the  current  implementation  (Fig.  2),  disparity  units  at  a  single  spatial  scale \nare  arranged  into  horizontal  disparity  layers.  Left  and  right  image  data  is  fed \ninto  this  network along  diagonally running data lines.  This causes every disparity \nlayer to receive the stereo data with a  certain fixed  preshift applied,  leading to the \nrequired,  slightly  different  working-ranges  of  neighboring  layers.  Disparity  units \nstacked vertically above each other are collected into a single disparity stack which \nis  then analyzed for  coherent activity. \n\n4  Results \n\nThe new  stereo network  performs  comparable on  several  standard  test  image  sets \n(Fig.  3).  The  calculated  disparity  maps  are  similar  to  maps  obtained  by  classical \narea-based approaches,  but they display subpixel-precision.  Since no  smoothing or \nregularization is performed by the coherence-based stereo algorithm, sharp disparity \nedges  can be observed at object borders. \n\nWithin the network, a simple validation map is  available locally.  A measure of local \n\nFigure 3:  Disparity maps for  some  standard test images  (small  insets),  calculated \nby the coherence-based stereo algorithm. \n\n\fA Simple and Fast Neural Network Approach to Stereovision \n\n813 \n\nFigure 4:  The performance of coherence-based stereo on a difficult  scene with spec(cid:173)\nular  highlights,  transparency  and  repetitive  structures  (left).  The  disparity  map \n(middle)  is  dense  and correct, except for  a  few  structure-less image regions.  These \nregions,  as well  as most object borders, are indicated in  the validation map  (right) \nwith a  low  [dark]  validation count. \n\ncoherence can  be obtained by calculating the relative number of coherently acting \ndisparity units in each stack, i.e. by calculating the ratio N(C)/ N(CUC), where N(C) \nis  the number of units in  class C.  In  most  cases,  this  validation map clearly marks \nimage areas where the disparity calculations failed  (for various reasons,  notably at \nocclusions caused by object borders, or in  large structure-less image regions,  where \nno reliable matching can be obtained -\n\ncompare Fig 4). \n\nClose  inspection  of disparity  and  validation  maps  reveals  that  these  image  maps \nare not aligned with  the left or the right view  of the scene.  Instead, both maps are \nregistered with the cyclopean view.  This is  caused by the structural arrangement of \ndata lines  and disparity stacks in  the network.  Reprojecting data lines  and  stacks \nback  into  3D-space  shows  that  the  stacks  analyze  three-dimensional  space  along \nlines  splitting the  angle  between  the left  and  right  view  directions  in  half.  This  is \nthe cyclopean view  direction as defined by  (Hering,  1879). \nIt is  easy to obtain the cyclopean view of the scene itself.  With If and If denoting \nthe  left  and  right input  data at the  position of disparity-unit i,  a  summation over \nall  coherently coding disparity units in a  stack, i.e., \n\nFigure  5:  A  simple  superposition  of  the  left  and  right  stereo  images  results  in \ndiplopia  (left).  By  using a  vergence  system,  the two  stereo images  can be  aligned \nbetter  (middle),  but  diplopia  is  still  prominent  in  most  areas  of  the  visual  field. \nThe fused  cyclopean view of the scene  (left)  was calculated by the coherence-based \nstereo network. \n\n\f814 \n\nR.  D.  Henkel \n\ngives the image intensity I C  in the cyclopean view-direction of this stack.  Collecting \nIC  from  all  disparity  stacks  gives  the  complete  cyclopean  view  as  the  third  co(cid:173)\nregistered map of the  network  (Fig 5). \n\nAcknowledgements \n\nThanks  to  Helmut  Schwegler  and  Robert  P.  O'Shea  for  interesting  discus(cid:173)\nImage  data  courtesy  of  G.  Medoni,  UCS  Institute  for  Robotics  &  In(cid:173)\nsions. \ntelligent  Systems,  B.  Bolles,  AIC,  SRI  International,  and  G.  Sommer,  Kiel \nCognitive  Systems  Group,  Christian-Albrechts-Universitat  Kiel.  An  internet(cid:173)\nbased  implementation  of  the  algorithm  presented  in  this  paper  is  available  at \nhttp://axon.physik.uni-bremen.de/-rdh/online~alc/stereo/. \n\nReferences \n\nAdelson,  E.H.  &  Bergen,  J.R.  (1985):  Spatiotemporal Energy  Models for  the  Per(cid:173)\nception of Motion.  J.  Opt.  Soc.  Am.  A2:  284-299. \n\nBarron, J.L.,  Fleet,  D.J.  &  Beauchemin,  S.S.  (1994):  Performance of Optical Flow \nTechniques.  Int.  J.  Camp.  Vis.  12:  43-77. \n\nBlake,  R.  &  Wilson,  H.R.  (1991):  Neural Models of Stereoscopic Vision.  TINS 14: \n445-452. \n\nDeAngelis,  G.C.,  Ohzawa,  I.  &  Freeman,  R.D.  (1991):  Depth  is  Encoded  in  the \nVisual  Cortex by a  Specialized  Field Structure.  Nature 11:  156-159. \n\nFleck,  M.M.  (1991):  A  Topological  Stereo  Matcher.  Int.  J.  of Camp.  Vis.  6: \n197-226. \n\nFleet,  D.J. &  Jepson, A.D.  (1993):  Stability of Phase Information.  IEEE PAMI 2: \n333-340. \nFrisby,  J.P.  &  and S.  B.  Pollard, S.B.  (1991):  Computational Issues  in Solving the \nStereo  Correspondence Problem.  eds.  M.S.  Landy and  J.  A.  Movshon,  Computa(cid:173)\ntional  Models  of Visual  Processing,  pp.  331,  MIT Press, Cambridge 1991. \n\nHenkel,  R.D.  (1997):  Fast  Stereovision  by  Coherence  Detection,  in  Proc.  of \nCAIP'97,  Kiel,  LCNS  1296,  eds.  G.  Sommer,  K.  Daniilidis  and  J.  Pauli,  pp.  297, \nLCNS  1296, Springer, Heidelberg 1997. \n\nE.  Hering  (1879):  Der  Raumsinn  und  die  Bewegung  des  Auges,  in  Handbuch  der \nPsychologie,  ed.  1. Hermann, Band 3,  Teil  1,  Vogel,  Leipzig  1879. \nMarr, D.  &  Poggio, T.  (1979):  A Computational Theory of Human Stereo Vision. \nProc.  R.  Soc.  Land.  B  204:  301-328. \nOhta,  Y,  & Kanade,  T.  (1985):  Stereo  by  Intra- and  Inter-scanline  Search  using \ndynamic programming.  IEEE PAMI 7:  139-154. \nQian,  N.  &  Zhu,  Y.  (1997):  Physiological  Computation  of Binocular Disparity,  to \nappear in  Vision  Research. \nYuille,  A.L.,  Geiger,  D.  &  Biilthoff,  H.H.  (1991):  Stereo  Integration,  Mean  Field \nTheory and Psychophysics.  Network 2:  423-442. \n\n\f", "award": [], "sourceid": 1352, "authors": [{"given_name": "Rolf", "family_name": "Henkel", "institution": null}]}