{"title": "Selective Integration: A Model for Disparity Estimation", "book": "Advances in Neural Information Processing Systems", "page_first": 866, "page_last": 872, "abstract": null, "full_text": "Selective Integration:  A  Model for \n\nDisparity Estimation \n\nMichael S.  Gray,  Alexandre Pouget, Richard S.  Zemel, \n\nSteven J. Nowlan, Terrence J. Sejnowski \nDepartments of Biology and Cognitive Science \n\nUniversity of California, San Diego \n\nLa Jolla,  CA  92093 \n\nand \n\nHoward Hughes Medical Institute \nComputational Neurobiology Lab \nThe Salk Institute, P.  O.  Box 85800 \n\nSan  Diego,  CA  92186-5800 \n\nEmail:  michael, alex,  zemel,  nowlan, terry@salk.edu \n\nAbstract \n\nLocal disparity information is often sparse and noisy, which creates \ntwo conflicting demands when estimating disparity in an image re(cid:173)\ngion:  the need to spatially average to get an accurate estimate, and \nthe problem of not averaging over discontinuities.  We  have devel(cid:173)\noped a  network model of disparity estimation based on disparity(cid:173)\nselective neurons, such as those found in the early stages of process(cid:173)\ning  in  visual cortex.  The model  can accurately estimate multiple \ndisparities in a region, which may be caused by transparency or oc(cid:173)\nclusion,  in real images and random-dot stereograms.  The use of a \nselection mechanism to selectively integrate reliable local disparity \nestimates  results  in  superior  performance  compared  to  standard \nback-propagation  and  cross-correlation  approaches.  In  addition, \nthe representations learned with this selection mechanism are con(cid:173)\nsistent  with  recent  neurophysiological  results  of  von  der  Heydt, \nZhou, Friedman, and Poggio [8]  for cells in cortical visual area V2. \nCombining multi-scale biologically-plausible image processing with \nthe power of the mixture-of-experts learning algorithm represents \na  promising approach that yields both high  performance and new \ninsights into visual system function. \n\n\fSelective Integration: A Model for Disparity Estimation \n\n867 \n\n1 \n\nINTRODUCTION \n\nIn many stereo algorithms, the local correlation between images from  the two eyes \nis  used  to estimate relative depth  (Jain,  Kasturi, & Schunk [5]).  Local correlation \nmeasures, however, convey no information about the reliability of a  particular dis(cid:173)\nparity measurement.  In the model presented here, we introduce a separate selection \nmechanism to determine which locations of the visual input have consistent dispar(cid:173)\nity information.  The focus  was on several challenging viewing situations in  which \ndisparity estimation  is  not straightforward.  For example,  can the model  estimate \nthe disparity  of more than  one  object in  a  scene?  Does  occlusion  lead  to  poorer \ndisparity estimation?  Can the model determine the disparities of two transparent \nsurfaces?  Does the model estimate accurately the disparities present in  real world \nimages?  Datasets corresponding to these  different  conditions  were  generated  and \nused to test the model. \n\nOur  goal  is  to  develop  a  neurobiologically  plausible  model  of stereopsis  that  ac(cid:173)\ncurately estimates disparity.  Compared to traditional cross-correlation approaches \nthat try to compute a  depth map for  all locations in space, the mixture-of-experts \nmodel used here searches for  sparse, reliable patterns or configurations of disparity \nstimuli that  provide  evidence  for  objects  at different  depths.  This  allows  partial \nsegmentation of the image to obtain a  more compact representation of disparities. \nLocal disparity estimates are sufficient in this case, as long as we selectively segment \nthose regions of the image with reliable disparity information. \n\nThe rest of the paper is organized as follows.  First, we  describe the architecture of \nthe mixture-of-experts model.  Second, we provide a brief qualitative description of \nthe model's  performance followed  by quantitative results on a  variety of datasets. \nIn the third section, we  compare the activity of units in the model to recent neuro(cid:173)\nphysiological data.  Finally, we  discuss these findings,  and consider remaining open \nquestions. \n\n2  MIXTURE-OF-EXPERTS MODEL \n\nThe model of stereopsis that we have explored is based on the filter model for motion \ndetection devised  by Nowlan  and Sejnowski  [6].  The motion  problem was  readily \nadapted  to  stereopsis  by  changing  the  time  domain  of  motion  to  the  left/right \nimage  domain  for  stereopsis.  Our  model  (Figure  1)  consisted  of  several  stages \nand  computed  its  output  using  only  feed-forward  processing,  as  described  below \n(see  also  Gray,  Pouget,  Zemel,  Nowlan,  and  Sejnowski  [2]  for  more  detail).  The \noutput of the first stage (disparity energy filters)  became the input to two different \nprimary pathways:  (1)  the local disparity networks, and (2) the selection networks. \nThe activation  of each of the four  disparity-tuned output  units  in  the model  was \nthe  product of the outputs of the two  primary pathways  (summed  across space). \nAn objective function based on the mixture-of-experts framework  (Jacobs, Jordan, \nNowlan,  & Hinton  [4])  was  used to optimize the weights from the disparity energy \nunits to the local disparity networks and to the selection networks.  The weights to \nthe output units from the local disparity and selection pathways were fixed  at 1.0. \nOnce the model was trained, we obtained a scalar disparity estimate from the model \nby computing a  nonlinear least squares Gaussian fit  to the four output values.  The \nmean of the Gaussian was  our disparity estimate.  When two  objects were present \n\n\f868 \n\nM.  S.  Gray, A.  Pouget, R.  S.  Zemel.  S. 1.  Nowlan and T.  1.  Sejnowski \n\n.---~~r'---, \n\nCompetition \n\nLow SF \n\nDisparity \n\nEnergy Filters \n\n'--_---'----''------'1 Medium SF \n\ni \n\nLeft Eye Retina \n\nRight Eye Retina \n\n1 \n\ni  --re;  :??-r  i \n\n1 High SF \n\nFigure 1:  The mixture-of-experts architecture. \n\nin the input, we  fit  the sum of two Gaussians to the four output values. \n\n2.1  DISPARITY ENERGY FILTERS \n\nThe retinal layer in  the model consisted of one-dimensional  right  eye  and left  eye \nimages,  each 82  pixels in length.  These images were  the input to  disparity energy \nfilters,  as  developed  by  Ohzawa,  DeAngelis,  and Freeman  [7].  At the energy filter \nlayer, there were 51 receptive field (RF) locations which received input from partially \noverlapping  regions  of the retinae.  At  each  of these  RF  locations,  there  were  30 \nenergy  units corresponding to 10 phase differences at 3 spatial frequencies.  These \nphase  differences  were  proportional  to  disparity.  An  energy  unit  consisted  of  4 \nenergy filter  pairs,  each of which  was  a  Gabor filter.  The outputs of the disparity \nenergy units were normalized at each RF location and within each spatial frequency \nusing a  soft-max nonlinearity. \n\n2.2  LOCAL DISPARITY NETWORKS \n\nIn  the  local  disparity  pathway,  there  were  8  RF  locations,  and  each  received  a \nweighted input from  9 disparity energy locations.  Each RF location  corresponded \nto a local disparity network and contained a pool of 4 disparity-tuned units.  Neigh(cid:173)\nboring  locations  received  input  from  overlapping  sets  of disparity  energy  units. \nWeights  were shared  across  all  RF  locations for  each  disparity.  Soft-max  compe(cid:173)\ntition occurred  within each  local  disparity network  (across disparity),  and insured \nthat only one disparity was strongly activated at each RF location. \n\n2.3  SELECTION NETWORKS \n\nLike the local disparity networks,  the selection networks were organized into a grid \nof 8 RF locations with a  pool of 4 disparity-tuned units at each location.  These 4 \nunits represented the local support for each of the different disparity hypotheses.  It \nis more useful to think of the selection networks, however, as 4 separate layers each \n\n\fSelective Integration:  A Modelfor Disparity Estimation \n\n869 \n\nof which responded to a  specific disparity across all regions of the image.  Like the \nlocal disparity pathway, neighboring RF  locations received  input from  overlapping \ndisparity  energy  units,  and  weights  were  shared  across  space  for  each  disparity. \nIn  addition,  the  outputs of the  selection  network  were  normalized  with  the soft(cid:173)\nmax  operation.  This competition,  however,  occurred  separately for  each  of the  4 \ndisparities in a global fashion  across  space. \n\n3  RESULTS \n\nFigure 2 shows the pattern of activations in the model when presented with a single \nobject  at  a  disparity  of 2.1  pixels.  The visual  layout  of the  model  in  this figure \nis  identical  to the layout in  Figure  1.  The stimulus  appears at bottom,  with  the \n3  disparity  energy filter  banks directly  above  it.  On  the left  above  the disparity \nenergy filters  are the local  disparity  networks.  The selection  networks are on  the \nright.  The summed output  (across space)  appears in the upper right  corner of the \nfigure.  Note  that the selection  network for  a  2 pixel  disparity  (2nd  row  from  the \nbottom in  the selection pathway)  is  active for  the spatial location  at far  left.  The \ncorresponding location is also highly active in the local disparity pathway, and this \ncombination leads to strong activation for  a  2 pixel  disparity in  the output of the \nmodel. \n\nThe mixture-of-experts model  was  optimized individually on a  variety of different \ndatasets  and  then  tested  on  novel  stimuli  from  the  same  datasets.  The  model's \nability  to discriminate  among different  disparities  was  quantified  as  the disparity \nthe disparity difference  at which  one can correctly  see a  difference  in \nthreshold -\ndepth 75% of the time.  Disparity thresholds for  the test stimuli were computed us(cid:173)\ning signal-detection theory (Green &  Swets [3]).  Sample stimuli and their disparity \nthresholds are shown in Table 1.  The model performed best on single object stimuli \n(top  row).  This  disparity  threshold  (0.23  pixels)  was  substantially  less  than  the \ninput resolution  of the model  (1  pixel)  and was thus exhibiting stereo hyperacuity. \nThe model  also  performed  well  when  there  were  multiple,  occluding  objects  (2nd \nrow).  When  both  the  stimulus  and  the  background  were  generated  from  a  uni(cid:173)\nform  random  distribution,  the disparity threshold  rose to 0.55  pixels.  The model \nestimated disparity accurately  in  random-dot  stereograms and real  world  images. \nBinary stereograms containing two transparent surfaces, however,  were a challeng(cid:173)\ning stimulus,  and the threshold rose to 0.83 pixels.  Part of the difficulty  with  this \nstimulus  (containing  two  objects)  was  fitting  the  sum  of 2  Gaussians  to  4  data \npoints. \n\nWe have compared our mixture-of-experts model (containing both a selection path(cid:173)\nway  and  a  local  disparity  pathway)  with  standard  backpropagation  and  cross(cid:173)\ncorrelation  techniques  (Gray  et  al  [2]).  The  primary  difference  is  that  the  back(cid:173)\npropagation and cross-correlation models have no separate selection mechanism.  In \nessence,  one  mechanism  must  compute  both  the  segmentation  and  the  disparity \nestimation.  In our tests with the back-propagation model,  we  found  that disparity \nthresholds for  single object stimuli had risen by a  factor  of 3  (to 0.74 pixels)  com(cid:173)\npared to the mixture-of-experts model.  Disparity estimation of the cross-correlation \nmodel was similarly poor.  Thresholds rose by a factor of2 (compared to the mixture(cid:173)\nof-experts model)  for  both single object stimuli and the noise stimuli  (threshold = \n0.46,  1.28 pixels,  respectively). \n\n\f870 \n\nM.  S.  Gray, A.  Pouget, R. S.  Zeme~ S.  J.  Nowlan and T.  J.  Sejnowski \n\nDesired  Output \n\n:~ \n\n0.98  0.00 \n\nSpace  Output \n\nActual  Output \n\n:~ \n\n0.98  0.01 \n\no \n\n3 \n\n0.87  0.00 \n\nLocal \n\nDlsparit\\:l  Nets \n\n1.00  0.00 \n\nSelection  Nets \n\no \n\n3 \n\n0.88  0.00 \n\nDisparit~ Energ~ \n\nLow  SF \n\nMed  SF \n\nHigh  SF \n\n0.39  0.00 \n\n~'ffi \n. , \n\n0.38  0.00 \n\n0.82  0.00 \n\n'\" \n\nl! \u2022 \n\n~. \n\nInput  Stilllulus \n\n0.99  0.50 \n\nFigure  2:  The  activity  in  the  mixture-of-experts  model  in  response  to  an  input \nstimulus containing a  single object at a  disparity of 2.10 pixels.  At  bottom is  the \ninput stimulus.  The 3  regions in  the middle represent  the output of the disparity \nenergy filters.  Above the disparity energy output are the two pathways of the model. \nThe local  disparity  networks  appear to the left  and the selection  networks are to \nthe  right.  Both  the  local  disparity  networks  and  the  selection  networks  receive \ntopographically organized input from the disparity energy filters.  The selection and \nlocal  disparity  networks  are displayed  so  that  the  top  row  represents  a  disparity \nof 0  pixels,  the next  row  a  1 pixel  disparity,  then  2 and 3  pixel  disparities in the \nremaining rows.  At the top left part of the figure is the desired output for the given \ninput stimulus.  In the top middle is  the output for  each local  region of space.  On \nthe top right is the actual output of the model collapsed across space.  The numbers \nat the bottom left of each part of the network indicate the maximum and minimum \nactivation values within that part.  White indicates maximum activation level, black \nis minimum. \n\n\fSelective Integration:  A Model for Disparity Estimation \n\n871 \n\nStimulus Type \n\nSample Stimulus \n\nThreshold \n\nSingle \nDouble \nNoise \n\nRandom-Dot \nTransparent \n\nReal \n\n;= \n\n\u00b7Eft \n\n\" '_ \" '_ '  .. ::JI \u2022.\u2022 I. __ \".~.\"' __ .iftk\" .. !, ___ ;r._. \n\n.\n\n... ... ::\u00b7~\u00b7~.i;:i;;:;:\u00b7 \"\":\u00b7\u00b7 i,,\u00b7:;'\u00b7::J \n\n0.23 \n0.41 \n0.55 \n0.36 \n0.83 \n0.30 \n\nTable 1:  Sample stimuli for each of the datasets, and corresponding disparity thresh(cid:173)\nolds  (in pixels)  for  the mixture-of-experts model. \n\n4  COMPARISON WITH NEUROPHYSIOLOGICAL \n\nDATA \n\nSelection 1 \n\nTo  gain  insight  into  the  response  properties  of the selection  units  in  our  model, \nwe  mapped their activations as a  function  of space and disparity.  Specifically,  we \nmeasured  the  activation  of a  unit  as  a  single \nhigh-contrast  edge  was  moved  across the  spa(cid:173)\ntial extent of the receptive field.  At  each spa(cid:173)\ntial location,  we  tested all  possible disparities. \nAn example of this mapping is shown in Figure \n3.  This  selection  unit  is  sensitive  to  changes \nin disparity as we  move across space.  We  refer \nto this property as  disparity  contrast.  In other \nwords, the selection unit learned that a reliable \nindicator  for  a  given  disparity  is  a  change  in \ni  disparity  across  space.  This  type  of detector \ncan be behaviorally significant, because dispar(cid:173)\nity contrast may playa role in  signaling object boundaries.  These  selection  units \ncould thus provide valuable information in the construction of a  3-D  model of the \nworld.  Recent  neurophysiological studies by  von der Heydt,  Zhou,  Friedman,  and \nPoggio [8]  is consistent with this interpretation.  They found that neurons of awake, \nbehaving  monkeys  in  area  V2  responded  to  edges  of 4\u00b0  by  4\u00b0  random-dot  stere(cid:173)\nograms.  Because  random-dot  stereograms  have  no  monocular  form  cues,  these \nneurons must be responding to edges in  depth.  This sensitivity to edges in a depth \nmap corresponds directly to the response profile of the selection units. \n\no \nFigure 3:  Selection unit activity \n\n0.2 \n\n0.4 \n\n0.6 \n\n0.8 \n\n5  DISCUSSION \n\nA  major difficulty  in  estimating the disparities  of objects  in  a  visual  scene  in  re(cid:173)\nalistic circumstances  (i.e.,  with clutter,  transparency,  occlusion,  noise)  is  knowing \nwhich cues are most reliable and should be integrated, and which regions have am(cid:173)\nbiguous or unreliable  information.  Nowlan  and Sejnowski  [6]  found  that selection \nunits  learned to  respond  strongly to image  regions  that contained motion  energy \nin several different directions.  The role of those selection units is similar to layered \nanalysis techniques for  computing support maps in the motion domain  (Darrell & \nPentland [1]).  The operation of the dual pathways in our model bears some similar-\n\n\f872 \n\nM.  S.  Gray, A.  Pouget, R. S. Zemel, S. J.  Nowlan and T.  1.  Sejnowski \n\nities to the pathways developed in the motion model of Nowlan and Sejnowski  [6]. \nIn  the  stereo  domain,  we  have  found  that  our  selection  units  develop  into  edge \ndetectors on a  disparity map.  They thus responded to regions rich  in disparity in(cid:173)\nformation,  analogous to the salient motion  information captured in the motion  [6] \nselection units. \n\nWe  have also found  that the model matches psychophysical data recorded by Wes(cid:173)\ntheimer  and  McKee  [9]  on  the  effects  of spatial  frequency  filtering  on  disparity \nthresholds  (Gray  et  al  [2]).  They  found,  in  human  psychophysical  experiments, \nthat disparity thresholds increased for  any kind of spatial frequency filtering of line \ntargets.  In particular, disparity sensitivity was more adversely affected by high-pass \nfiltering than by low-pass filtering. \n\nIn summary, we propose that the functional division into local response and selection \nrepresents  a  general  principle  for  image  interpretation  and  analysis  that  may  be \napplicable to many different visual cues, and also to other sensory domains.  In our \napproach  to  this  problem,  we  utilized  a  multi-scale  neurophysiologically-realistic \nimplementation of binocular cells for  the input, and then combined it with a neural \nnetwork model to learn reliable cues for  disparity estimation. \n\nReferences \n\n[1]  T.  Darrell  and  A.P.  Pentland.  Cooperative  robust  estimation  using  layers  of \nIEEE  Transactions  on  Pattern  Analysis  and  Machine  Intelligence, \n\nsupport. \n17(5):474-87,1995. \n\n[2]  M.S.  Gray,  A.  Pouget,  R.S.  Zemel,  S.J.  Nowlan,  and  T.J.  Sejnowski.  Reliable \ndisparity estimation through selective integration.  INC  Technical  Report  9602, \nInstitute for Neural  Computation,  University  of California,  San Diego,  1996. \n\n[3]  D.M.  Green and J.A. Swets.  Signal Detection  Theory  and Psychophysics.  John \n\nWiley and Sons,  New York,  1966. \n\n[4]  R.A.  Jacobs, M.I.  Jordan, S.J. Nowlan, and G.E.  Hinton.  Adaptive mixtures of \n\nlocal experts.  Neural  Computation, 3:79-87, 1991. \n\n[5]  R.  Jain,  R.  Kasturi,  and B.G.  Schunck.  Machine  Vision.  McGraw-Hill,  New \n\nYork,  1995. \n\n[6]  S.J.  Nowlan  and  T.J.  Sejnowski.  Filter  selection  model  for  motion  segmen(cid:173)\n\ntation  and velocity  integration.  Journal  of the  Optical  Society  of America  A, \n11(12):3177-3200, 1994. \n\n[7]  I. Ohzawa, G.C. DeAngelis,  and R.D.  Freeman.  Stereoscopic depth discrimina(cid:173)\ntion in the visual cortex:  Neurons ideally suited as disparity detectors.  Science, \n249:1037-1041,1990. \n\n[8]  R.  von der Heydt, H.  Zhou,  H.  Friedman, and G.F.  Poggio.  Neurons of area V2 \nof visual cortex detect  edges in  random-dot  stereograms.  Soc.  Neurosci.  Abs., \n21:18,  1995. \n\n[9]  G.  Westheimer and S.P. McKee.  Stereoscopic acuity with defocus  and spatially \nfiltered  retinal images.  Journal  of the  Optical  Society  of America,  70:772-777, \n1980. \n\n\f", "award": [], "sourceid": 1212, "authors": [{"given_name": "Michael", "family_name": "Gray", "institution": null}, {"given_name": "Alexandre", "family_name": "Pouget", "institution": null}, {"given_name": "Richard", "family_name": "Zemel", "institution": null}, {"given_name": "Steven", "family_name": "Nowlan", "institution": null}, {"given_name": "Terrence", "family_name": "Sejnowski", "institution": null}]}