{"title": "A Hierarchical Model of Complex Cells in Visual Cortex for the Binocular Perception of Motion-in-Depth", "book": "Advances in Neural Information Processing Systems", "page_first": 1271, "page_last": 1278, "abstract": "", "full_text": "A  hierarchical  model of complex cells  in \nvisual  cortex for  the binocular  perception \n\nof motion-in-depth \n\nSilvio  P.  Sabatini, Fabio  Solari,  Giulia Andreani, \n\nChiara Bartolozzi,  and  Giacomo  M.  Bisio \n\nDepartment of Biophysical and Electronic Engineering \n\nUniversity of Genoa, 1-16145  Genova, ITALY \n\nsilvio@dibe.unige.it \n\nAbstract \n\nA cortical model for  motion-in-depth selectivity of complex cells in \nthe  visual  cortex  is  proposed.  The  model  is  based  on  a  time  ex(cid:173)\ntension of the phase-based techniques for  disparity estimation.  We \nconsider  the  computation  of the  total  temporal  derivative  of the \ntime-varying disparity through the combination of the responses of \ndisparity energy units.  To take into account the physiological plau(cid:173)\nsibility, the model  is  based on the  combinations of binocular cells \ncharacterized by different  ocular dominance indices.  The resulting \ncortical units  of the model show  a  sharp selectivity for  motion-in(cid:173)\ndepth that has been compared with that reported in the literature \nfor  real cortical cells. \n\n1 \n\nIntroduction \n\nThe  analysis  of a  dynamic  scene  implies  estimates  of motion  parameters  to  infer \nspatio-temporal information  about the  visual  world.  In  particular,  the perception \nof  motion-in-depth  (MID),  i.e. \nthe  capability  of  discriminating  between  forward \nand backward movements of objects from  an observer,  has important implications \nfor  navigation in dynamic environments.  In general,  a  reliable estimate of motion(cid:173)\nin-depth can be gained by considering the dynamic stereo correspondence problem \nin  the  stereo  image  signals  acquired  by  a  binocular  vision  system.  Fig.  1  shows \nthe  relationships  between  an  object  moving  in  the  3-D  space  and  its  geometrical \nprojections in  the right  and left  retinas.  In  a  first  approximation, the positions of \ncorresponding points are related  by a  1-D  horizontal shift,  the  disparity,  along the \ndirection of the epipolar lines.  Formally, the left and right observed intensities from \nthe two eyes,  respectively JL(X)  and JR(x),  result related as  JL(X)  =  JR[x + 8(x)], \nwhere  8(x)  is  the  horizontal  binocular  disparity.  If an  object  moves  from  P  to \nQ its  disparity  changes  and  projects  different  velocities  (VL'  VR)  on  the  retinas. \n\n\f.............. .9J ............  t+~t \n\n8(t+lit)  =  (XQL -XQR)  \"\"  a(D-ZQ)/D2 \n\nV  \"\"  li8  D2/a \n\nz  M \n\nli8  =  8(t+lit)-&(t)  = \nlit \n\nlit \n\n_ \n\n(XQL -XPL )-(XQR -XPR) \n\n\"\" \n\nlit \n\nVZ \"\"  (VL-vR)D2/a \n\n( \n\n) \n\na \n\nFigure 1:  The dynamic stereo correspondence problem.  A moving object in the 3-D \nspace projects different trajectories onto the left  and right retinas.  The differences \nbetween the two  trajectories carry information about motion-in-depth. \n\nThus,  the  Z  component  of the  object's  motion  (i.e.,  its  motion-in-depth)  Vz  can \nbe  approximated  in  two  ways  [1]:  (1)  by  the  rate  of change  of disparity,  and  (2) \nby the difference  between retinal  velocities,  as  it is  evidenced in the box in  Fig.  l. \nThe  predominance  of one  measure  on  the  other  one  corresponds  to  different  hy(cid:173)\npotheses  on  the  architectural  solutions  adopted  by  visual  cortical  cells  to  encode \ndynamic  3-D  visual  information.  Recently,  numerous  experimental  and  computa(cid:173)\ntional studies (see e.g.,  [2]  [3]  [4]  [5])  addressed this issue, by analyzing the binocular \nspatio-temporal properties of simple and complex cells.  The fact  that the resulting \ndisparity  tuning  does  not  vary  with  time,  and  that  most  of the  cells  in  the  pri(cid:173)\nmary  visual  cortex  have  the  same  motion  preference  for  the  two  eyes,  led  to  the \nconclusion  that  these  cells  are  not  tuned  to  motion-in-depth.  In  this  paper,  we \ndemonstrate that, within a  phase-based disparity encoding scheme,  such cells relay \nphase temporal  derivative  components that  can be  combined,  at a  higher level,  to \nyield  a  specific  motion-in-depth  selectivity.  The  rationale  of this  statement  relies \nupon  analytical  considerations  on  phase-based  dynamic  stereopsis,  as  a  time  ex(cid:173)\ntension  of the  well-known  phase-based  techniques  for  disparity  estimation  [6]  [7]. \nThe resulting model is based on the computation of the total temporal derivative of \nthe disparity through the combination of the outputs of binocular disparity energy \nunits  [4]  [5]  characterized by different  ocular dominance indices.  Since each energy \nunit  is  just  a  binocular  Adelson  and  Bergen's  motion  detector,  this  establishes  a \nlink between  the information contained in the total rate of change of the binocular \n\n\fdisparity and that held by the interocular velocity differences. \n\n2  Phase-based  dynamic  stereopsis \n\n'k \n\n2 ;  2 \n\nIn  the last decades, a  computational approach for  stereopsis, that rely on the phase \ninformation contained in the spectral components of the stereo image pair, has been \nproposed [6]  [7].  Spatially-localized phase measures on the left and right images can \nbe obtained by filtering operations with a complex-valued quadrature pair of Gabor \nfilters  h(x, ko)  =  e- X  \"et  ox,  where  ko  is  the  peak frequency  of the  filter  and  a \nrelates  to  its  spatial extension.  The resulting  convolutions  with  the left  and right \nbinocular  signals  can  be  expressed  as  Q(x)  =  p(x)ei\u00a2(x)  =  C(x)  + is(x)  where \np(x)  =  ylC2(X) + S2(X)  and  \u00a2(x)  =  arctan (S(x)/C(x))  denote  their  amplitude \nand  phase  components,  respectively,  and  C(x)  and  S(x)  are  the  responses  of the \nquadrature  pair  of filters.  Hence,  binocular  disparity  can  be  predicted  by  8(x)  = \n[\u00a2L(X)  - \u00a2R(x)]/k(x) where k(x)  = [\u00a2~(x) + \u00a2;Z(x)]/2 , with \u00a2x  spatial derivative of \nphase \u00a2,  is  the average instantaneous frequency of the bandpass signal, that, under \na linear phase model, can be approximated by the peak frequency of the Gabor filter \nko.  Extending  to  time  domain,  the  disparity  of a  point  moving  with  the  motion \nfield  can be estimated by: \n\n5:[  ()  ] _  \u00a2L[X(t), t]  - \u00a2R[x(t), t] \nuxt ,t -\n\nko \n\n(1) \n\nwhere phase components are computed from the spatiotemporal convolutions of the \nstereo image pair Q(x, t)  =  C(x, t) + is(x, t)  with directionally tuned  Gabor filters \nwith  a  central  frequency  p  =  (ko, wo).  For  spatiotemporal  locations  where  linear \nphase  approximation  still  holds  (\u00a2  ~ kox + wot),  the  phase  differences  in  Eq.  (1) \nprovide only spatial information, useful  for  reliable disparity estimates. \n\n2.1  Motion-in-depth \n\nIf disparity is  defined  with respect  to  the  spatial  coordinate  XL,  by differentiating \nwith respect to time,  its total rate of variation can be written as \n\nd8  = 88 \ndt \n\n8t + ko \n\nVL  (A.L  _  A.R) \n'l'x \n\n'l'x \n\n(2) \n\nwhere VL  is  the horizontal component of the velocity signal on the left retina.  Con(cid:173)\nsidering the conservation property of local phase measurements [8],  image velocities \ncan be computed from the temporal evolution of constant phase contours, and thus: \n\nand \n\n(3) \n\nwith  \u00a2t = ~. Combining Eq.  (3)  with Eq.  (2)  we  obtain d8/dt = (VR  - VL)\u00a2;Z /ko, \nwhere (v R  - V L) is the phase-based interocular velocity difference along the epipolar \nlines.  When  the  spatial  tuning  frequency  of  the  Gabor  filter  ko  approaches  the \ninstantaneous  spatial  frequency  of the  left  and  right  convolution  signals  one  can \nderive the following  approximated expressions: \n88  \u00a2t - \u00a2f \n8t \n\nd8 \n- ~ -\ndt \n\n~VR-VL \n\n(4) \n\nko \n\n= \n\n\fThe  partial  derivative  of the  disparity  can  be  directly  computed  by  convolutions \n(S, C)  of stereo image pairs and by their temporal derivatives  (St, Ct): \n\na8 \nat \n\n[StCL - SLCt \n(SL)2  + (CL)2 \n\ns[lcR - SRC[l]  1 \n(SR)2 + (CR)2 \nko \n\n(5) \n\nthus  avoiding  explicit  calculation  and  differentiation  of phase,  and  the  attendant \nproblem  of  phase  unwrapping.  Considering  that,  at  first  approximation  (SL)2  + \n(C L)2  :::  (SR)2  + (CR)2  and that these terms  are scantly discriminant for  motion(cid:173)\nin-depth,  we  can  formulate  the  cortical  model  taking into  account  the  numerator \nterms only. \n\n2.2  The  cortical model \n\nIf one  prefilters  the  image  signal  to  extract  some  temporal  frequency  sub-band, \nS(x, t)  :::  9 * S(x , t)  and  C(x , t)  :::  9 * C(x , t) , and  evaluates  the temporal changes \nin  that sub-band, differentiation  can be attained by convolutions on the data with \nappropriate bandpass temporal filters: \nS'(x, t)  :::  g' * S(x, t) \n\n;  C'(x, t)  :::  g' * C(x, t)  . \n\nS'  and C'  approximate St  and Ct,  respectively,  if 9 and g'  are a  quadrature pair of \ntemporal  filters,  e.g.:  g(t)  =  e- t / T  sinwot  and g'(t)  =  e- t / T  coswot.  From  a  mod(cid:173)\neling perspective,  that approximation allows  us  to express  derivative operations in \nterms  of convolutions  with  a  set  of spatio-temporal filters,  whose  shapes  resemble \nthose of simple  cell  receptive fields  (RFs)  of the primary visual  cortex.  Though, it \nis  worthy to note that a direct interpretation of the computational model is  not bio(cid:173)\nlogically plausible.  Indeed, in the computational scheme  (see Eq.  (5)), the temporal \nvariations  of phases  are  obtained  by  processing  monocular  images  separately  and \nthen the resulting signals are binocularly combined to give at an estimate of motion(cid:173)\nin-depth in each spatial location.  To  employ binocular RFs from  the  beginning,  as \nthey exist for  most of the cells  in the visual cortex,  we  manipulated the numerator \nby rewriting it as  the combination of terms characterized by  a  dominant contribu(cid:173)\ntion  for  the  ipsilateral  eye  and  a  non-dominant  contribution  for  the  controlateral \neye.  These contributions are  referable to binocular disparity energy units  [5]  built \nfrom  two  pairs of binocular direction selective  simple  cells  with left  and right  RFs \nweighted by an ocular dominance index a  E  [0,1].  The \"tilted\" spatio-temporal RFs \nof simple cells  of the model  are obtained by combining separable RFs according to \nan Adelson and  Bergen's scheme  [9].  It can be demonstrated that the information \nabout motion-in-depth can be obtained with a  minimum number of eight binocular \nsimple  cells,  four  with  a  left  and  four  with  a  right  ocular  dominance,  respectively \n(see  Fig.  2): \n\nSl  =  (1  - a)(Cf + SL)  - a(CR - sf\") \n\nS2  =  (1  - a)(CL + Sf) + a(Cf\" + SR) \n\nS3  =  (1  - a)(Cf - SL)  - a(CR + sf\") \nS5 =  a(Cf + SL)  - (1  - a)(CR - sf\") \nS7  =  a(Cf - SL)  - (1  - a)(CR + sf\") \nC11  =  si + S~  ;  C12  =  S5  + S~ \n\nS4  =  (1  - a)(CL + Sf) + a(Cf\" - SR) \nS6  =  a(CL - Sf) + (1  - a)(Cf\" + SR) \nS8  =  a(CL + Sf) + (1  - a)(Cf\" - SR) \nC13  =  S~ + S~  ;  C14  =  S\u00a5  + S~ \n\n\fC21  =  C12  - C11 \n\n;  C22  =  C13  - C14 \n\nC3  =  (1  - 20:) (stcL - sLCt - s[lcR + sRc[l) . \n\nThe output of the higher  complex cell  in the hierarchy  (C3 )  truly encodes motion(cid:173)\nin-depth  information.  It  is  worthy  to  note  that  for  a  balanced  ocular  dominance \n(0:  =  0.5)  the cell  looses its selectivity. \n\n3  Results \n\nTo assess model performances we  derived cells' responses to drifting sinusoidal grat(cid:173)\nings  with  different  speeds  in  the  left  and  right  eye.  The  spatial frequency  of the \ngratings  has  been  chosen  as  central  to  the  RF's  bandwidth.  For  each  layer,  the \ntuning characteristics of the cells  are analyzed as  sensitivity maps in the (XL  - XR) \nand  (VL  - VR)  domains  for  the  static  and  dynamic  properties,  respectively.  The \n(XL  - XR)  represents the binocular RF  [5]  of a  cell,  evidencing its disparity tuning. \nThe (v L - v R)  response represents the binocular tuning curve of the velocities along \nthe  epipolar lines.  To  better evidence  motion-in-depth sensitivity,  we  represent as \npolar plots,  the responses  of the model  cells  with respect  to the interocular veloc(cid:173)\nities  ratio for  12  different  motion  trajectories in  depth  (labeled  1 to  12)  [10].  The \ncells  of the  cortical  model  exhibit  properties  and  typical  profiles  similar  to  those \nobserved  in  the  visual  cortex  [5]  [10].  The  middle  two  layers  (see  insets  A  and  B \nin  Fig.  2)  exhibit  a  strong selectivity  to  static disparity,  but  no  specific  tuning to \nmotion-in-depth.  On the contrary, the output cell  C3  shows a  narrow tuning to the \nZ  direction  of the  object's  motion,  while  lacking  disparity  tuning  (see  inset  C  in \nFig.  2). \n\nTo  consider  more  biologically  plausible  RFs  for  the  simple  cells,  we  included  a \ncoefficient f3  in the scheme used to obtain tilted RFs in the space-time domain (e.g. \nC + f3St).  This  coefficient  takes  into  account  the  simple  cell  response  to the  non(cid:173)\npreferred  direction.  We  analytically  demonstrated  (results  not  shown  here)  that \nthe  resulting  effect  is  a  constant  term  that  multiplies  the  cortical  model  output. \nIn  this  way,  the  model  is  based  on  more  realistic  simple  cells  without  lacking  its \nfunctionality,  provided that the basic direction selective units maintain a significant \ndirection  selective  index.  To  analyze  the effect  of the  architectural parameters on \nthe model performance, we  systematically varied the ocular dominance index 0:  and \nintroduced  a  weight  I  representing  the  inhibition  strength  of  the  afferent  signals \nto  the  complex  cells  in  layer  2.  The  resulting  direction-in-depth  polar  plots  are \nshown  in  Fig.  3.  The  0:  parameter  yields  a  strong  effect  on  the  response  profile: \nif  0:  =  0.5  there  is  no  direction-in-depth  selectivity;  according  that  0:  >  0.5  or \n0: < 0.5  cells  exhibit a  tuning to opposite directions in depth.  As  0:  approaches the \nboundary values 0 or  1 the binocular model turns to a  monocular one.  A  decrease \nof the inhibition strength I  yields  cells  characterized by a less  selective response to \ndirection-in-depth,  whereas an increase of I  diminishes their response amplitude. \n\n4  Discussion and  conclusions \n\nThere  are  at  least  two  binocular  cues  that  can  be  used  to  determine  the  MID \n[1] :  binocular  combination  of monocular  velocity  signals  or  the  rate  of change  of \nretinal disparity.  Assuming a phase-based disparity encoding scheme [6],  we demon(cid:173)\nstrated  that  information  held  in  the  interocular  velocity  difference  is  the  same  of \n\n\f\"\"  S, EB- ( \n\n,,-.,. \n...... \n...c \n01) \n\u00b7c \n'--' \n\u00a7 \n;:::l \n\"0 \nu \n<l) \n\n, \n\" / \n,' \"\"  s \nu  , \n/ \n,\n:::: \nro \n:::: \u00b7s \n0  \"  / \n'\"0 \n8 \n3 \n, / \nu \n0 \n\n' , ,,,,  S \n\n\"\"  S3 \n\nEB_2 ( \n\nEB- ( \n\nEB_4 ( \n\n\"\"  S EB-5  ( \n\n, \n\n\" / \n\nA \n12 \n\nVR \n\n)'  ~ \\ \n\n' \nX R ~\" \n\n)2---{] \n\n: \n\nVL \n\n\u00b7 \n\nc \n\n12 \n\n\u2022  ~C2' \n: \n. \n\u2022 \n: \n\n. \n: \n: \n\n6 \n\nR \n\nX\n\n)2\n\n~3 \n\n6 \n\nFigure  2:  Functional  representation  of  the  proposed  cortical  architecture.  Each \nbranch  groups  cells  belonging  to  an  ocular  dominance  column.  The  afferent  sig(cid:173)\nnals  from  left  and  right  ocular  dominance  columns  are  combined  in  layer  3.  The \nbasic units are binocular simple  cells tuned to motion directions  (S1, . . . ,S8).  The \nresponses of the complex cells in layers  1,  2 and 3 are obtained by  linear and non(cid:173)\nlinear  combinations  of the  outputs  of those  basic  units.  See  text .  White  squares \ndenote excitatory synapses whereas black squares denote inhibitory ones. \n\n\f,  =  0.5 \n\n12 \n\n9 ~ 3 \n\na  =  0.3 \n\n,  =  1.0 \n\n12 \n\n,  =  2.0 \n\n12 \n\n9 \n\n3 \n\n9 \n\n3 \n\n6 \n\n6 \n\n6 \n12 \n\na  =  0.7 \n\n9  --~~-- 3 \n\n9  --~I!'--- 3 \n\n9  ------':111:\"-- - 3 \n\n6 \n12 \nI \n\n6 \n12 \n\n6 \n12 \n\na  =  0.9 \n\n9 --~~-- 3 \n\n~ I v \n9  A~ 3 \n\n9 ------7,i!k-- - 3 \n\n6 \n\n6 \n\n6 \n\nFigure  3:  Effects  on  the  direction-in-depth  selectivity  of the  systematic  variation \nof  the  model's  parameters  a  and  f.  The  responses  are  normalized  to  the  largest \namplitude value. \n\nthat derived by the evaluation of the total derivative of the binocular disparity.  The \nresulting computation relies  upon spatio-temporal differentials of the left  and right \nretinal phases  that can be approximated by  linear filtering  operations with spatio(cid:173)\ntemporal  RFs.  Accordingly,  we  proposed  a  cortical  model  for  the  generation  of \nbinocular motion-in-depth selective cells  as  a  hierarchical combination of binocular \nenergy  complex  cells.  It  is  worth  noting  that  the  phase  response  and  the  associ(cid:173)\nated  characteristic  disparity  of simple  and  complex  cells  in  layers  1  and  2  do  not \nchange with time,  but the amplitudes of their responses carry information on tem(cid:173)\nporal phase derivatives,  that can be related to both retinal velocities and temporal \nchanges  in  disparity.  Moreover,  the  model  evidences  the  different  roles  of simple \nand  complex  cells.  Simple  cells  provide  a  Gabor-like  spatio-temporal  transforma(cid:173)\ntion of the  visual  space,  on  which  to base a  variety of visual functions  (perception \nofform, depth, motion).  Complex cells, by proper combinations ofthe same signals \nprovided  by  simple  cells,  actively  eliminate  sensitivity  to  a  selected  set  of param(cid:173)\neters,  thus  becoming specifically  tuned  to different  features,  such  as  disparity  but \nnot motion-in-depth (layer 1 and 2), motion-in-depth but not disparity  (layer  3). \n\n\fAcknowledgments \n\nThis  work  was  partially  supported  by  the  UNIGE-2000  Project  \"Spatio-temporal \nOperators  for  the  Analysis  of Motion  in Depth from  Binocular  Images \". \n\nReferences \n\n[1]  J.  Harris  and  S.  N.J.  Watamaniuk.  Speed  discrimination of Motion-in  depth \n\nusing binocular cues.  Vision  Research, 35(7):885- 896,  1995. \n\n[2]  N.  Qian and S.  Mikaelian.  Relationship between phase and energy methods for \n\ndisparity computation.  Neural  Comp .,  12(2) :279- 292, 2000. \n\n[3]  Y.  Chen, Y. Wang, and N.  Qian. Modelling VI disparity tuning to time-varying \n\nstimuli.  J.  N europhysiol.,  pages 504- 600,  2001. \n\n[4]  D. J. Fleet, H. Wagner, and D. J. Heeger.  Neural encoding of binocular diparity: \nenergy  models,  position  shift  and  phase  shift.  Vision  Research,  17:345- 398, \n1996. \n\n[5]  1.  Ohzawa, G.C. DeAngelis, and R.D. Freeman. Encoding of binocular disparity \nby  complex  cells  in  the  cat's  visual  cortex.  J.  Neurophysiol.,  77:2879- 2909 , \n1997. \n\n[6]  T.D . Sanger.  Stereo disparity computation using  Gabor filters.  BioI.  Cybern., \n\n59:405- 418,  1988. \n\n[7]  D.J. Fleet,  A.D.  Jepson,  and M.  Jenkin.  Phase-based disparity measurements. \n\nCVGIP:  Image  Understanding,  53:198- 210,  1991. \n\n[8]  D.  J. Fleet and A.  D.  Jepson.  Computation of component image velocity from \nlocal  phase information.  International  Journal  of Computer  Vision,  1 :77- 104, \n1990. \n\n[9]  E.H.  Adelson and J.R.  Bergen.  Spatiotemporal energy models for  the percep(cid:173)\n\ntion of motion.  J.  Opt.  Soc.  Amer.,  2:284-321,  1985. \n\n[10]  W.  Spileers,  G.A.  Orban,  B.  Gulyas,  and  H.  Maes.  Selectivity  of  cat  area \n18  neurons for  direction  and speed in  depth.  J.  Neurophysiol. , 63(4):936- 954, \n1990. \n\n\f", "award": [], "sourceid": 1950, "authors": [{"given_name": "Silvio", "family_name": "Sabatini", "institution": null}, {"given_name": "Fabio", "family_name": "Solari", "institution": null}, {"given_name": "Giulia", "family_name": "Andreani", "institution": null}, {"given_name": "Chiara", "family_name": "Bartolozzi", "institution": null}, {"given_name": "Giacomo", "family_name": "Bisio", "institution": null}]}