{"title": "Divisive Normalization, Line Attractor Networks and Ideal Observers", "book": "Advances in Neural Information Processing Systems", "page_first": 104, "page_last": 110, "abstract": null, "full_text": "Divisive  Normalization,  Line  Attractor \n\nNetworks and Ideal  Observers \n\nSophie Denevel  Alexandre  Pougetl, and  P.E.  Latham2 \n\n1 Georgetown Institute for  Computational and  Cognitive Sciences, \n\nGeorgetown University,  Washington,  DC  20007-2197 \n\n2Dpt  of Neurobiology,  UCLA,  Los  Angeles,  CA  90095-1763,  U.S.A. \n\nAbstract \n\nGain  control  by  divisive  inhibition,  a.k.a.  divisive  normalization, \nhas  been  proposed  to  be  a  general  mechanism  throughout  the  vi(cid:173)\nsual  cortex.  We  explore  in  this  study  the  statistical  properties \nof this  normalization in  the  presence  of noise.  Using  simulations, \nwe  show  that divisive  normalization is  a  close  approximation to  a \nmaximum likelihood estimator, which, in the context of population \ncoding, is the same as an ideal observer.  We also demonstrate ana(cid:173)\nlytically that this is  a general property of a  large class of nonlinear \nrecurrent  networks  with  line  attractors.  Our  work  suggests  that \ndivisive  normalization  plays  a  critical  role  in  noise  filtering,  and \nthat every cortical layer may be an ideal observer of the activity in \nthe preceding  layer. \n\nInformation  processing  in  the  cortex  is  often  formalized  as  a  sequence  of a  linear \nstages followed  by  a  nonlinearity.  In the visual cortex,  the nonlinearity is  best de(cid:173)\nscribed by squaring combined with a divisive pooling of local activities.  The divisive \npart of the nonlinearity has  been  extensively studied by  Heeger  and colleagues  [1], \nand several authors have explored the role of this normalization in the computation \nof high  order visual features  such  as  orientation of edges or first  and  second order \nmotion[ 4].  We show in this paper that divisive normalization can also playa role in \nnoise filtering.  More specifically, we demonstrate through simulations that networks \nimplementing this normalization come close to performing maximum likelihood es(cid:173)\ntimation.  We  then demonstrate analytically that the ability  to  perform maximum \nlikelihood estimation, and thus efficiently extract information from  a  population of \nnoisy neurons,  is  a  property exhibited  by  a  large class of networks. \n\nMaximum  likelihood  estimation  is  a  framework  commonly  used  in  the  theory  of \nideal observers.  A recent example comes from the work of Itti et al.,  1998, who have \nshown  that it  is  possible  to  account  for  the behavior of human  subjects  in simple \ndiscrimination  tasks.  Their  model  comprised  two  distinct  stages:  1)  a  network \n\n\fDivisive Normalization.  Line Attractor Networks and Ideal Observers \n\n105 \n\nwhich  models  the noisy  response of neurons  with  tuning  curves to  orientation and \nspatial frequency  combined with divisive normalization, and 2)  an ideal observer (a \nmaximum likelihood estimator)  to read out the population activity of the network. \n\nOur  work  suggests  that  there  is  no  need  to  distinguish  between  these  two  stages, \nsince,  as  we  will show,  divisive normalization comes close to  providing a  maximum \nlikelihood estimation.  More generally,  we  propose that there may  not  be any  part \nof the cortex that acts as  an  ideal observer for  patterns of activity in sensory areas \nbut, instead, that each cortical layer acts as an ideal observer of the activity in  the \npreceding layer. \n\n1  The network \n\nOur network  is  a  simplified  model  of a  cortical  hypercolumn  for  spatial frequency \nand orientation.  It consists of a  two  dimensional  array of units  in  which  each  unit \nis  indexed by  its  preferred orientation, 8i ,  and spatial frequency,  >'j. \n\n1.1  LGN  model \n\nUnits  in  the  cortical  layer  are  assumed  to  receive  direct  inputs  from  the  lateral \ngeniculate  nucleus  (LG N).  Here  we  do  not  model  explicitly  the  LG N,  but  focus \ninstead  on  the  pooled  LGN  input  onto  each  cortical  unit.  The  input  to each  unit \nis  denoted  aij'  We  distinguish  between  the  mean pooled  LGN  input,  fij(8, >'),  as \na  function  of  orientation,  8,  and  spatial  frequency,  >.,  and  the  noise  distribution \naround this mean,  P(aijI8, >.). \nIn  response  to a  stimulus  of orientation,  8,  spatial frequency,  >.,  and  contrast,  G, \nthe  mean  LGN  input  onto  unit  ij is  a  circular  Gaussian  with  a  small  amount  of \nspontaneous activity,  1/: \n\nJ, ,(8  ') - KG \n\n,/\\  -\n\n'J \n\nexp \n\n(COS(>.  - >'j)  - 1 cos(8 - 8i )  - 1) \n\n+ \n\n2 \n~A \n\n2 \n~o \n\n+ 1/, \n\n(1) \n\nwhere K is  a constant.  Note that spatial frequency  is  treated as a periodic variable; \nthis  was  done for  convenience only and should have negligible effects  on our results \nas  long as we  keep>. far  from  27m,  n  an  integer. \nOn any given trial the LGN input to cortical unit ij, aij, is sampled from a Gaussian \nnoise  distribution with variance ~;j: \n\n(2) \n\nIn our simulations,  the variance of the noise was  either kept fixed  (~'fj  =  ~2) or set \nto the mean activity  (~t =  Jij(8, >.)).  The latter is  more  consistent  with  the noise \nthat  has  been  measured  experimentally  in  the  cortex.  We  show  in  figure  I-A  an \nexample of a noisy  LGN  pattern of activity. \n\n1.2  Cortical Model:  Divisive Normalization \n\nActivities in  the cortical layer are updated over time  according to: \n\n\f106 \n\nS.  Deneve,  A.  Pouget and P.  E.  Latham \n\nA. \n\nCORTEX \n\n:::::::: r:::L:: t ::;-:::;::::: \n\n- r- , _  :-\n:::~ \n' ~ \n\nDO \n\n0.1 \n\n0.2  0.3 \n\n0.4  0.5 \n\n0.8 \n\n0.7 \n\n0.1  D.'  1 \nContrast \n\nFigure  1:  A- LGN  input  (bottom)  and  stable  hill  in  the  cortical  network  after \nrelaxation (top).  The position of the stable hill can be used  to estimate orientation \n(0)  and spatial frequency  (5.).  B- Inverse of the variance of the network estimate for \norientation using  Gaussian  noise  with  variance equal  to  the mean  as  a  function  of \ncontrast and number of iterations  (0,  dashed;  1,  diamond;  2,  circle;  and 3,  square). \nThe  continuous  curve  corresponds  to  the  theoretical  upper  bound  on  the  inverse \nof the variance  (i.e.  an  ideal  observer).  C- Gain curve for  contrast for  the cortical \nunits after 1,  2 and 3 iterations. \n\n(3) \n\nwhere  {Wij,kt}  are  the  filtering  weights,  Oij(t)  is  the  activity  of unit  ij at  time  t, \nS  is  a  constant,  and  J1.  is  what  we  call  the  divisive  inhibition weight.  The filtering \nweights  implement a  two dimensional Gaussian filter: \n\nWij,kl  =  Wi-k,j - l  =  Kwexp  (COS[27!'(i  -2 k )/P] -1 + cos[27!'(j  ~ l)/Pl-1) \n\n(4) \n\n~w~ \n\n~WA \n\nwhere Kw  is  a constant,  ~w~ and ~WA control the width of the filtering weights, and \nthere are p 2  units. \n\nOn  each  iteration  the  activity  is  filtered  by  the  weights,  squared,  and  then  nor(cid:173)\nmalized  by the total local  activity.  Divisive normalization  per se only  involves  the \nsquaring and  division  by  local  activity.  We  have added  the filtering  weights  to ob(cid:173)\ntain a local pooling of activity between cells with similar preferred orientations and \nspatial  frequencies.  This  pooling  can  easily  be  implemented  with  cortical  lateral \nconnections  and  it  is  reasonable  to  think  that  such  a  pooling  takes  place  in  the \ncortex. \n\n\fDivisive Normalization,  Line Attractor Networks and Ideal Observers \n\n107 \n\n2  Simulation Results \n\nOur simulations consist of iterating equation 3 with initial conditions determined by \nthe presentation orientation and spatial frequency.  The initial conditions are chosen \nas follows:  For a given presentation angle,  (}o,  and spatial frequency,  Ao,  determine \nthe  mean  cortical  activity,  /ij((}o, AO),  via  equation  1.  Then  generate  the  actual \ncortical activity,  {aij}, by sampling from the distribution given in equation  2.  This \nserves as our set of initial conditions:  Oij (t =  0)  =  aij' \nIterating equation  3  with  the above initial  conditions,  we  found  that  for  very  low \ncontrast  the  activity  of  all  cortical  units  decayed  to  zero.  Above  some  contrast \nthreshold,  however,  the activities converged to a  smooth stable hill  (see  figure  I-A \nfor  an  example  with  parameters  (Jw(}  = (Jw)..  = (J(}  = (J)..  = I/V8,  K  = 74,  C  = 1, \nJ.L  =  0.01).  The width of the hill  is  controlled by  the width of the filtering  weights. \nIts peak, on the other hand, depends on the orientation and spatial frequency of the \nLGN  input, (}o  and Ao.  The peak can thus be used to estimate these quantities  (see \nfigure  I-A).  To  compute the  position of the final  hill,  we  used  a  population vector \nestimator  [3]  although  any  unbiased  estimator would  work  as  well.  In  all  cases  we \nlooked  at,  the network produced an unbiased estimate of (}o  and Ao. \n\nIn  our  simulations  we  adjusted  (Jw(}  and  (Jw)..  so  that  the stable  hill  had  the same \nprofile  as  the mean LGN  input  (equation  1).  As  a  result,  the tuning curves of the \ncortical units match the tuning curves specified by the pooled LGN  input.  For this \ncase,  we  found  that  the estimate  obtained  from  the  network  has  a  variance  close \nto  the  theoretical  minimum,  known  as  the  Cramer-Rao  bound  [3].  For  Gaussian \nnoise  of fixed  variance,  the  variance  of the  estimate  was  16.6%  above  this  bound, \ncompared  to  3833%  for  the  population  vector  applied  directly  to the  LGN  input. \nIn  a  ID  network  (orientation  alone),  these  numbers  go  to  12.9%  for  the  network \nversus 613%  for  population vector.  For  Gaussian noise  with  variance proportional \nto  the  mean,  the  network  was  8.8%  above  the  bound,  compared  to  722%  for  the \npopulation vector applied directly to the input.  These numbers are respectively 9% \nand  108% for  the  I-D  network.  The network is  therefore  a  close  approximation to \na  maximum  likelihood  estimator,  i.e.,  it  is  close  to  being  an  ideal  observer of  the \nLGN  activity  with  respect  to orientation and spatial frequency. \nAs  long as the contrast, C,  was superthreshold, large variations in contrast did not \naffect  our results  (figure I-B). However, the tuning of the network units to contrast \nafter reaching the stable state was found  to follow  a step function  whereas, for  real \nneurons,  the  curves  are  better  described  by  a  sigmoid  [2]. \nImproved  agreement \nwith  experiment  was  achieved  by  taking  only  2-3  iterations,  at  which  point  the \nperformance of the network is  close to optimal (figure  I-B)  and the tuning curves to \ncontrast are more realistic  and closer  to sigmoids  (figure  I-C).  Therefore,  reaching \na stable state is  not required for  optimal performance, and in  fact  leads to contrast \ntuning curves that  are inconsistent with  experiment. \n\n3  Mathematical Analysis \n\nWe  first  prove  that  line  attractor networks  with  sufficiently  small  noise  are  close \napproximations to a maximum likelihood estimator.  We  then show  how  this result \napplies to our simulations  with divisive normalization. \n\n\fJ08 \n\nS.  Deneve,  A. Pouget and P.  E.  Latham \n\n3.1  General  Case:  Line Attractor Networks \n\nLet  On  be the  activity vector  (denoted  by  bold  type)  at  discrete time,  n,  for  a  set \nof P  interconnected  units.  We  consider  a  one  dimensional  network,  i.e.,  only  one \nfeature is  encoded;  generalization to multidimensional networks is  straightforward. \nA generic mapping for  this network may be written \n\n(5) \n\nwhere  H  is  a  nonlinear  function.  We  assume  that  this  mapping  admits  a  line \nattractor, which we denote G(O), for which G(O)  =  H(G(O)) where 0 is a continuous \nvariable. 1  Let  the  initial  state  of  the  network  be  a  function  of  the  presentation \nparameter, 00 ,  plus noise, \n\n00  =  F(Oo)  + N \n\n(6) \n\nwhere  F(Oo)  is  the  function  used  to  generate  the  data  (in  our  simulations  this \nwould  correspond  to  the  mean  LGN  input,  equation  1).  Iterating  the  mapping, \nequation  5,  leads  eventually  to  a  point  on  the  line  attractor.  Consequently,  as \nn  -+  00 ,  On  -+  G(O) .  The parameter 0 provides an estimate of 00 . \nTo determine how  well  the network does  we  need  to find  fJO  ::::  0 - 00  as  a  function \nof the noise,  N,  then  average over  the noise  to compute  the mean  and variance of \nfJO.  Because the mapping, equation 5,  is nonlinear, this cannot be done exactly.  For \nsmall  noise,  however,  we  can  take  a  perturbative  approach  and  expand  around  a \npoint on the  attractor.  For line at tractors there is  no general method for  choosing \nwhich  point  on  the  attractor to expand  around.  Our  approach  will  be  to  expand \naround an arbitrary point, G( 0), and choose 0 by requiring that the quadratic terms \nbe finite.  Keeping terms up  to quadratic order, equation 6  may be  written \n\nG(O) + fJo n . \nIn . fJoo + ~ I.: (Jm . fJo o )  . H\" . (Jm . fJo o )  , \n\nn-l \n\nm=O \n\n(7) \n\n(8) \n\nwhere J(O)  ==  [8G (o)H(G(0))f  is  the Jacobian  (the subscript T  means transpose), \nH\"  is  the  Hessian  of H  evaluated  at  G(O)  and  a  \".\"  represents  the  standard dot \nproduct. \nBecause  the  mapping,  equation  5,  admits  a  line  attractor,  J  has  one  eigenvalue \nequal  to 1  and  all  others less  than 1.  Denote  the eigenvector with  eigenvalue  1  as \ny  and its  adjoint  v t :  J  . v  =  v  and JT . v t  =  yt.  It is  not  hard  to show  that y  = \n8oG(0),  up  to  a  multiplicative  constant.  Since  J  has  an  eigenvalue  equal  to  1,  to \navoid  the quadratic term in  Eq.  8  approaching infinity as n  -+  00  we  require that \n\nlim  I n . fJoo =  O. \nn-too \n\n(9) \n\nIThe line attractor is,  in fact , an idealization; for  P  units the attractors associated with \nequation  5  consists  of P  isolated  points.  However, for  P  large,  the  attractors  are  spaced \nclosely  enough that they may be considered a  line. \n\n\fDivisive Normalization.  Line Attractor Networks and Ideal Observers \n\n109 \n\nThis  equations  has  an  important  consequence:  it  implies  that,  to  linear  order, \nlimn-too 60 n  = 0  (see  equation  8),  which  in  turn  implies  that  0 00  = G(O)  which, \n~nally, implies  that  0 = O.  Consequently  we  can  find  the  network  estimator of 00 , \n0, by  computing O.  We  now  turn to that task. \nIt is  straightforward to show that JOO  =  vvt .  Combining this expression for  J  with \nequation  9,  using equation  7 to express  600  in  terms of 00  and  G(O),  and, finally \nusing equation 6 to express  00  in  terms of the initial mean activity,  F(Oo),  and the \nnoise,  N,  we  find  that \n\nv t (0)  . [F(Oo)  - G(O) + N]  = O. \n\nUsing 00  = 0 - 60  and expanding F(Oo)  to first  order in 60  then yields \n\n60  = vt(O)  . [N + F(O)  - G(O)] \n\nvt(O)  . F'(O) \n\n. \n\n(10) \n\n(11) \n\nAs  long as v t  is  orthogonal to F(O)  - G(O),  (60)  = 0 and the estimator is  unbiased. \nThis  must  be  checked  on  a  case  by  case  basis,  but  for  the  circularly  symmetric \nnetworks  we  considered orthogonality is  satisfied. \nWe  can now  calculate the variance  of the  network  estimate,  (60)2.  Assuming  v t  . \n[F(O)  - G(O)]  = 0,  equation  11  implies that \n\n2  vt.R\u00b7vt \n(60)  =  [vt . F'F  ' \n\n(12) \n\nwhere a prime denotes a derivative with respect to 0 and R  is the covariance matrix \nof the noise,  R  = (NN).  The network  is  equivalent  to  maximum likelihood  when \nthis  variance  is  equal  to  the  Cramer-Rao  bound  [3],  (60)bR.  If the  noise,  N,  is \nGaussian with  a  covariance matrix independent  of 0,  this bound is  equal to: \n\n2 \n\n(60)CR  = F'. R - l  . F' \n\n1 \n\n(13) \n\nFor  independent  Gaussian  noise  of  fixed  variance,  (T2,  and  zero  covariance,  the \nvariance  of the  network  estimate,  equation  12,  becomes  (T2 1(IF'1 2 cos2  f-L)  where  f-L \nis  the  angle  between  v t  and  F'.  The  Cramer-Rao  bound,  on  the  other  hand,  is \nequal to (T2 IIF'1 2 .  These expressions differ  only by  cos2  J1.,  which  is  1 if F ex  v t .  In \naddition,  it is  close  to  1 for  networks  that have  identical input  and output tunin1 \ncurves,  F(O)  = G(O),  and  the  Jacobian,  J,  is  nearly  symmetric,  so  that  v  :::::  v \n(recall that v  = G').  If these last two  conditions are satisfied,  the network comes \nclose to being  a maximum likelihood estimator. \n\n3.2  Application to  Divisive Normalization \n\nDivisive normalization is  a particular example of the general case considered above. \nFor  simplicity,  in  our  simulations  we  chose  the  input  and  output  tuning  curves  to \nbe  equal  (F  = G  in  the  above  notation),  which  lead  to  a  value  of 0.87  for  cos2  f-L \n(evaluated  numerically).  This  predicted  a  variance  15%  above  the  Cramer-Rao \n\n\f110 \n\nS.  Deneve,  A. Pouget and P E.  Latham \n\nbound for  independent Gaussian noise with fixed  variance, consistent with the 16% \nwe  obtained  in  our  simulations.  The  network  also  handles  fairly  well  other  noise \ndistributions,  such  as  Gaussian  noise  with  variance  proportional  to  the  mean,  as \nillustrated by our simulations. \n\n4  Conclusions \n\nWe  have  recently  shown  that  a  subclass  of line  attractor networks  can be  used  as \nmaximum  likelihood  estimators[3].  This  paper  extend  this  conclusion  to  a  much \nwider class of networks,  namely,  any network that admits a  line  (or,  by straightfor(cid:173)\nward extension of the above analysis,  a  higher dimensional)  attractor.  This is  true \nin  particular  for  networks  using  divisive  normalization,  a  normalization  which  is \nthought to match quite closely the nonlinearity found  in the primary visual cortex \nand MT. \n\nAlthough our analysis relies  on  the existence of an attractor,  this is  not  a  require(cid:173)\nment  for  obtaining  near  optimal  noise  filtering.  As  we  have  seen,  2-3  iterations \nare  enough  to  achieve  asymptotic  performance  (except  at  contrasts  barely  above \nthreshold).  What  matters  most  is  that  our network  implement  a  sequence  of low \npass filtering to filter out the noise, followed by a square nonlinearity to compensate \nfor  the widening of the tuning curve due to the low  pass filter,  and a  normalization \nto  weaken  contrast  dependence.  It  is  likely  that  this  process  would  still  clean  up \nnoise efficiently in the first  2-3 iterations even if activity decayed to zero eventually, \nthat is to say, even if the hills of activity were not stable states.  This would allow us \nto  apply  our  approach to other types  of networks,  including  those lacking circular \nsymmetry and networks with  continuously clamped inputs. \n\nTo  conclude,  we  propose  that  each  cortical layer  may read  out  the  activity in  the \npreceding  layer  in  an  optimal  way  thanks  to  the  nonlinear  pooling  properties  of \ndivisive  normalization,  and,  as  a  result,  may  behave  like  an  ideal  observer.  It  is \ntherefore possible that the ability to read out neuronal codes in the sensory cortices \nin  an  optimal  way  may  not  be  confined  to  a  few  areas  like  the  parietal or frontal \ncortex,  but may instead be a  general property of every cortical layer. \n\nReferences \n\n[1]  D.  Heeger.  Normalization of cell  responses  in cat striate cortex.  Visual  Neuro(cid:173)\n\nscience,  9:181- 197,1992. \n\n[2]  L.  Itti,  C.  Koch,  and  J.  Braun.  A  quantitative  model  for  human  spatial  vi(cid:173)\n\nsion  threshold  on  the  basis  of non-linear  interactions among spatial filters.  In \nR.  Lippman,  J.  Moody,  and  D.  Touretzky,  editors,  Advances  in  Neural  Infor(cid:173)\nmation Processing  Systems,  volume  11.  Morgan-Kaufmann, San Mateo,  1998. \n\n[3]  A.  Pouget, K.  Zhang, S.  Deneve, and P. Latham. Statistically efficient estimation \n\nusing population coding.  Neural  Computation,  10:373- 401,  1998. \n\n[4]  E. Simoncelli and D.  Heeger.  A model of neuronal responses in visual area MT. \n\nVision  Research,  38(5):743- 761 , 1998. \n\n\f", "award": [], "sourceid": 1536, "authors": [{"given_name": "Sophie", "family_name": "Den\u00e8ve", "institution": null}, {"given_name": "Alexandre", "family_name": "Pouget", "institution": null}, {"given_name": "Peter", "family_name": "Latham", "institution": null}]}