{"title": "Optimization Principles for the Neural Code", "book": "Advances in Neural Information Processing Systems", "page_first": 281, "page_last": 287, "abstract": null, "full_text": "Optimization Principles for  the  Neural \n\nCode \n\nMichael DeWeese \n\nSloan Center,  Salk Institute \n\nLa Jolla, CA  92037 \ndeweese@salk.edu \n\nAbstract \n\nRecent  experiments  show  that  the  neural  codes  at  work  in  a  wide \nrange of creatures share some common features.  At first sight, these \nobservations seem unrelated.  However,  we show  that these  features \narise naturally in  a linear filtered  threshold crossing  (LFTC) model \nwhen we set the threshold to maximize the transmitted information. \nThis maximization process  requires  neural  adaptation  to  not only \nthe  DC  signal  level,  as  in  conventional  light  and  dark  adaptation, \nbut also to the statistical structure of the signal and noise distribu(cid:173)\ntions.  We  also  present  a  new  approach for  calculating the  mutual \ninformation between  a  neuron's  output  spike  train and any aspect \nof its  input signal  which  does not require  reconstruction  of the  in(cid:173)\nput  signal.  This formulation  is  valid  provided  the  correlations  in \nthe  spike  train are  small, and we  provide  a  procedure  for  checking \nthis  assumption.  This paper is  based  on joint work  (DeWeese  [1], \n1995).  Preliminary  results  from  the  LFTC  model  appeared  in  a \nprevious  proceedings  (DeWeese  [2],  1995),  and the  conclusions  we \nreached  at that time have been reaffirmed by further  analysis of the \nmodel. \n\n1 \n\nIntroduction \n\nMost sensory receptor cells produce analog voltages and currents which are smoothly \nrelated to analog signals in the outside world.  Before being transmitted to the brain, \nhowever,  these  signals  are  encoded  in  sequences  of identical  pulses  called  action \npotentials or spikes.  We  would like to know  if there  is  a  universal principle at work \nin the choice of these coding strategies.  The existence of such a potentially powerful \ntheoretical  tool  in  biology  is  an  appealing  notion,  but  it  may  not  turn  out  to  be \nuseful.  Perhaps  the  function  of biological  systems  is  best  seen  as  a  complicated \ncompromise  among  constraints  imposed  by  the  properties  of biological materials, \nthe  need  to  build  the  system  according  to  a  simple set  of development  rules,  and \n\n\f282 \n\nM.  DEWEESE \n\nthe fact  that current  systems must  arise from their  ancestors  by evolution  through \nrandom change  and  selection.  In  this  view,  biology  is  history,  and  the  search  for \nprinciples  (except for  evolution itself)  is likely to  be futile.  Obviously,  we  hope that \nthis view is wrong,  and that at least some of biology is understandable in terms of the \nsame sort  of universal  principles that  have emerged in  the physics  of the  inanimate \nworld. \n\nAdrian noticed in the  1920's that every  peripheral neuron  he  checked  produced  dis(cid:173)\ncrete,  identical pulses  no matter what input he  administered  (Adrian,  1928).  From \nthe  work  of Hodgkin  and  Huxley  we  know  that  these  pulses  are  stable  non-linear \nwaves  which  emerge from the non-linear dynamics describing  the electrical  proper(cid:173)\nties  of the  nerve  cell  membrane These  dynamics in  turn derive  from  the  molecular \ndynamics of specific  ion channels  in the  cell membrane.  By  analogy with other  non(cid:173)\nlinear wave problems, we  thus understand  that these  signals have propagated over a \ne.g.  ~ one  meter from  touch  receptors  in  a  finger  to  their  targets \nlong distance -\nin  the  spinal cord -\nso  that  every  spike has  the  same shape.  This is  an  important \nobservation since it implies that  all information carried by  a  spike train is encoded \nin  the  arrival times of the spikes.  Since  a  creature's  brain is  connected  to  all of its \nsensory systems by such axons,  all the creature knows about the outside world must \nbe encoded  in spike arrival times. \n\nUntil recently, neural codes have been studied primarily by measuring changes in the \nrate of spike production  by  different  input signals.  Recently  it has  become possible \nto  characterize  the  codes  in  information-theoretic  terms,  and  this  has  led  to  the \ndiscovery  of some  potentially  universal  features  of the  code  (Bialek,  1996)  (or  see \n(Bialek,  1993)  for  a brief summary).  They are: \n\n1.  Very  high  information  rates.  The record  so  far is  300  bits  per second  in  a \n\ncricket  mechanical sensor. \n\n2.  High  coding efficiency.  In cricket and frog vibration sensors,  the information \n\nrate is within  a  factor of 2 of the entropy per  unit  time of the spike  train. \n\n3.  Linear decoding.  Despite evident non-linearities ofthe nervous system, spike \ntrains can be decoded  by simple linear filters .  Thus we can write an estimate \nof the  analog input signal s(t)  as  Sest (t)  = Ei Kl (t - td, with  Kl  chosen  to \nminimize the  mean-squared errors  (X 2 )  in  the estimate.  Adding non-linear \nK2(t - ti, t - tj)  terms does not significantly reduce  X2 . \n\n4.  Moderate  signal-to-noise  ratios  (SNR).  The SNR in these  experiments was \ndefined  as the ratio of power spectra of the input signal to the  noise referred \nback to the input;  the power spectrum of the noise was approximated by X2 \ndefined  above.  All  these  examples  of high  information transmission rates \nhave SNR of order  unity over  a  broad bandwidth,  rather  than high  SNR in \na  narrow  band. \n\nWe will try to tie all of these observations together by elevating the first to a principle: \nThe neural code  is chosen to  maximize information transmission where  information \nis  quantified following  Shannon.  We  apply this  principle in  the  context  of a simple \nmodel  neuron  which  converts  analog  signals  into  spike  trains.  Before  we  consider \na  specific  model,  we  will  present  a  procedure for  expanding  the  information rate of \nany point process  encoding of an analog signal about  the  limit where  the spikes  are \nuncorrelated.  We  will briefly  discuss  how  this  can be used  to measure  information \nrates in  real neurons. \n\n\fOptimization Principles for  the  Neural  Code \n\n283 \n\nThis work  will  also appear in  Network. \n\n2 \n\nInformation Theory \n\nIn the  1940's,  Shannon proposed  a quantitative definition for  \"information\"  (Shan(cid:173)\nnon,  1949).  He  argued  first  that  the  average  amount  of  information  gained  by \nobserving  some  event  Z  is  the  entropy of the  distribution from  which  z  is  chosen, \nand  then  showed  that  this  is  the  only  definition  consistent  with  several  plausible \nrequirements.  This definition implies that the amount of information one signal can \nprovide  about  some  other  signal  is  the  difference  between  the  entropy  of the  first \nsignal's  a  priori distribution  and  the  entropy  of its  conditional  distribution.  The \naverage  of this  quantity is  called  the  mutual  (or  transmitted)  information.  Thus, \nwe can write the  amount of information that the spike train,  {td, tells us  about the \ntime dependent  signal,  s(t),  as \n\n(1) \n\nwhere  I1Jt;  is  shorthand  for  integration  over  all  arrival  times  {til  from  0  to  T \nand summation over  the total number of spikes,  N  (we  have divided the integration \nmeasure by N! to prevent over counting due to equivalent permutations of the spikes, \nrather than absorb this factor into the probability distribution as we did in (DeWeese \n[1],  1995)).  < ... >8=  I1JsP[sO]'\"  denotes integration over the space offunctions \ns(t)  weighted  by  the  signal's  a  priori  distribution,  P[{t;}ls()]  is  the  probability \ndistribution  for  the  spike  train  when  the  signal  is  fixed  and  P[{t;}]  is  the  spike \ntrain's average distribution. \n\n3  Arbitrary Point Process Encoding of an  Analog Signal \n\nIn order  to derive  a  useful  expression for  the information given by Eq.  (1),  we  need \nan explicit  representation  for  the  conditional distribution of the  spike  train.  If we \nchoose to represent  each spike as a  Dirac delta function,  then the spike train can be \ndefined  as \n\nN \n\np(t) = L c5(t  - t;). \n\n;=1 \n\nThis  is  the  output  spike  train  for  our  cell,  so  it  must  be  a  functional  of both  the \ninput signal,  s(t),  and all  the  noise sources  in  the cell  which  we  will lump together \nand call '7(t).  Choosing to represent  the spikes  as delta functions  allows us  to think \nof p(t)  as  the  probability of finding  a spike at time t  when  both the signal and noise \nare  specified.  In  other  words,  if the  noise  were  not  present,  p  would  be  the  cell's \nfiring rate, singular though it is.  This implies that in the presence of noise the cell's \nobserved  firing  rate,  r(t), is  the noise  average of p(t): \n\nr(t) = J 1J'7P ['70I s0]p(t) = (p(t))'1' \n\nNotice that by averaging over the conditional distribution for the noise rather than its \na priori distribution as we did in  (DeWeese [1],  1995), we ensure that this expression \nis still valid if the  noise is signal dependent,  as  is  the case  in  many real  neurons. \n\nFor  any  particular  realization  of the  noise,  the  spike  train  is  completely  specified \nwhich  means  that  the  distribution  for  the  spike  train  when  both  the  signal  and \n\n(2) \n\n(3) \n\n\f284 \n\nM.  DEWEESE \n\nnoise are fixed  is  a  modulated Poisson process  with  a  singular firing  rate,  p(t).  We \nemphasize that this is true even though we have assumed nothing about the encoding \nof the  signal in the spike  train when  the  noise is  not fixed.  One might then  assume \nthat  the  conditional  distribution  for  the  spike  tra.in  for  fixed  signal  would  be  the \nnoise average of the familiar formula for  a modulated Poisson  process: \n\n(4) \n\nHowever,  this is only approximately true due to subtleties arising from the singular \nnature of p(t).  One  can derive  the  correct  expression  (DeWeese  [1],  1995)  by  care(cid:173)\nfully  taking the continuum limit of an  approximation to this distribution defined for \ndiscrete  time.  The  result  is  the  same sum  of noise  averages  over  products  of p's \nproduced by expanding the exponential in  Eq.  (4)  in powers of f dtp(t)  except that \nall  terms  containing  more  than  one  factor  of p(t)  at  equal  times  are  not  present. \nThe exact  answer  is: \n\n(5) \n\nwhere  the  superscripted  minus  sign  reminds  us  to  remove  all  terms  containing \nproducts of coincident p's after expanding everything in the noise average in powers \nof p. \n\n4  Expanding About  the Poisson Limit \n\nAn  exact  solution  for  the  mutual  information  between  the  input  signal  and  spike \ntrain  would  be  hopeless  for  all  but  a  few  coding  schemes.  However,  the  success \nof linear decoding  coupled  with  the  high  information rates seen  in  the  experiments \nsuggests to us that the spikes might be transmitting roughly independent information \n(see  (DeWeese  [1],  1995)  or  (Bialek,  1993)  for  a more fleshed  out argument on this \npoint).  If this is the case, then the spike train should approximate a Poisson process. \nWe  can explicitly show  this  relationship  by  performing  a  cluster  expansion on  the \nright hand  side of Eq.  (5): \n\n(6) \n\nwhere  we  have defined  ~p(t) ==  p(t)- < p(t)  >'1= p(t) - r(t)  and introduced C'1(m) \nwhich  collects  all  terms containing m  factors  of ~p. For example, \n\nC  (2) ==  ~ ,,(~Pi~Pj}q - J dt' ~ (~p' ~Pi}q + ~ J dt'dt\"(~ '~ \")-. \n\np  P  '1 \n\n(7) \n\n'1 \n\n2 L..J \ni\u00a2j \n\nr \u00b7r \u00b7 \n,  J \n\nL..J \ni= l '  \n\nr \u00b7 \n\n2 \n\nClearly,  if the  correlations  between  spikes  are small in  the  noise  distribution,  then \nthe  C'1 's  will  be  small,  and  the  spike  train  will  nearly  approximate  a  modulated \nPoisson  process  when  the signal is fixed. \n\n\fOptimization Principles for  the Neural Code \n\n285 \n\nPerforming  the  cluster  expansion  on  the  signal  average of Eq.  (5)  yields  a  similar \nexpression for  the  average distribution for  the spike train: \n\n(8) \n\nwhere  T  is  the  total  duration  of the  spike  train,  r  is  the  average  firing  rate,  and \nC'1. 8 (m)  is  identical to  C'1(m)  with these  substitutions:  r(t)  --+  r,  ~p(t) --+  ap(t) == \np(t)  - f,  and  ( ... ); --+  {{ .. \u00b7);)8.  In  this  case,  the  distribution for  a  homogeneous \nPoisson  process  appears  in  front  of the  square  brackets,  and  inside  we  have  1 + \ncorrections due to correlations in the  average spike train. \n\n5  The Transmitted Information \n\nInserting these expressions for P[ {til IsO] and P[ {til]  (taken to  all orders in ~p and \nap, respectively)  into Eq.  (1),  and expanding to second  non-vanishing order  in  fTc \nresults  in  a  useful expression for  the information (DeWeese  [1],  1995): \n\n(9) \n\nwhere we have suppressed the explicit time notation in the correction term inside the \n\ndouble integral.  If the signal and noise are stationary then we can replace  the I; dt \nin  front  of each  of these  terms  by  T  illustrating that  the  information does  indeed \ngrow  linearly  with  the  duration of the spike train. \n\nThe  leading  term,  which  is  exact  if there  are  no  correlations  between  the  spikes, \ndepends only on the firing rate,  and is never negative.  The first correction is positive \nwhen  the  correlations  between  pairs of spikes  are  being  used  to encode  the  signal, \nand  negative  when  individual spikes  carry  redundant  information.  This correction \nterm is cumbersome but  we  present  it here  because  it is  experimentally accessible, \nas we  now  describe. \n\nThis  formula  can  be  used  to  measure  information  rates  in  real  neurons  without \nhaving to  assume  any  method  of reconstructing  the  signal from  the  spike  train.  In \nthe experimental context,  averages over  the  (conditional) noise distribution become \nrepeated  trials with  the  same input signal, and averages  over  the  signal are accom(cid:173)\nplished  by summing over  all trials.  r(t),  for  example, is  the  histogram of the spike \ntrains resulting  from the same input signal,  while  f(t)  is  the  histogram of all  spike \ntrains  resulting  from  all  input  signals.  If the  signal  and  noise  are  stationary,  then \nf  will  not  be  time dependent.  {p(t)p(t'))'1  is  in  general  a  2-dimensional histogram \nwhich  is  signal dependent:  It is  equal  to  the  number of spike  trains  resulting  from \nsome  specific  input  signal  which  simultaneously  contain  a  spike  in  the  time  bins \ncontaining t  and t'.  If the  noise  is stationary,  then  this  is  a function  of only t - t', \nand it reduces  to  a  1-dimensional histogram. \n\nIn order  to measure  the full  amount of information contained in  the  spike  train,  it \nis crucial to bin the data in small enough time bins to resolve  all of the structure in \n\n\f286 \n\nM.  DEWEESE \n\nr(t),  (p(t)p(t'))'l'  and so  on.  We  have  assumed  nothing  about  the  noise  or  signal; \nin fact,  they can even  be correlated so  that the  noise averages  are signal dependent \nwithout  changing  the  experimental  procedure.  The  experimenter  can  also  choose \nto  fix  only  some  aspects  of the  sensory  data during  the  noise  averaging  step,  thus \nmeasuring the mutual information between  the spike train and only these aspects of \nthe  input.  The only  assumption  we  have  made  up  to  this  point  is  that  the  spikes \nare  roughly  uncorrelated  which  can  be checked  by  comparing  the  leading  term  to \nthe  first  correction, just as  we  do for  the  model we  discuss  in the  next section. \n\n6  The  Linear Filtered Threshold  Crossing Model \n\nAs  we  reported  in  a  previous  proceedings  (DeWeese  [2],  1995)  (and see  (DeWeese \n[1],  1995)  for  details),  the  leading  term  in  Eq.  (9)  can  be  calculated exactly  in  the \ncase of a linear filtered  threshold crossing  (LFTC)  model when  the  signal and noise \nare  drawn from independent  Gaussian distributions.  Unlike the  Integrate and  Fire \n(IF)  model,  the  LFTC  model does  not  have  a  \"renewal  process\"  which  resets  the \nvalue  of the  filtered  signal to  zero  each  time the  threshold  is  reached.  Stevens  and \nZador  have  developed  an  alternative  formulation  for  the  information  transmission \nwhich is better suited for studying the IF model under some circumstances  (Stevens, \n1995),  and  they  give  a  nice  discussion  on  the  way  in  which  these  two formulations \ncompliment each  other. \n\nFor the  LFTC model,  the leading  term is a function of only  three  variables:  1)  The \nthreshold  height;  2)  the  ratio of the  variances of the  filtered  signal and  the  filtered \nnoise, (s2(t)),/(7J2(t))'l'  which we  refer  to as the SNR; 3) and the ratio of correlation \ntimes ofthe filtered  signal and the filtered  noise, T,/T'l'  where T;  ==  (S2(t)),/(S2(t)), \nand similarly for the noise.  In the equations in this last sentence, and in what follows, \nwe  absorb the linear filter  into our definitions for  the  power spectra of the signal and \nnoise.  Near  the  Poisson  limit, the  linear  filter  can  only  affect  the  information rate \nthrough its generally weak influence on the ratios of variances and correlation times \nof the  signal  and  noise,  so  we  focus  on  the  threshold  to  understand  adaptation  in \nour model cell. \n\nWhen  the  ratio of correlation  times  of the signal and  noise  is  moderate,  we  find  a \nthe leading term ~ lOx \nmaximum for  the information rate near the Poisson limit -\nthe first  correction.  For the interesting  and physically relevant  case where  the  noise \nis  slightly  more  broadband  than  the  signal  as  seen  through  the  cell's  prefiltering, \nwe  find  that  the  maximum  information  rate  is  achieved  with  a  threshold  setting \nwhich does not correspond to  the maximum average firing  rate illustrating that this \noptimum is non-trivial.  Provided the SNR is about one or less,  linear decoding does \nwell -\na lower bound on the information rate based on optimal linear reconstruction \nof the  signal is within  a factor  of two of the  total available information in  the  spike \ntrain.  As  SNR grows  unbounded,  this  lower  bound  asymptotes  to  a  constant.  In \naddition, the required timing resolution for extracting the information from the spike \ntrain is  quite modest - discretizing  the spike train into bins which are half as  wide \nas the correlation time of the signal degrades  the information rate by less  than  10%. \nHowever,  at maximum information transmission, the information per spike is low -\nRmaz/r ~ .7  bits/spike,  much  lower  than  3 bits/spike seen  in  the cricket.  This low \ninformation rate drives the efficiency  down to  1/3 of the experimental values despite \nthe  model's  robustness  to  timing jitter.  Aside  from  the  low  information rate,  the \noptimized model captures  all  the  experimental features  we  set  out to explain. \n\n\fOptimization Principles for the Neural  Code \n\n287 \n\n7  Concluding Remarks \n\nWe  have  derived  a  useful expression  for  the  transmitted information which  can be \nused to measure information rates in real neurons provided the correlations between \nspikes  are  shorter  range  than  the  average  inter-spike  interval.  We  have  described \na  method  for  checking  this  hypothesis  experimentally.  The  four  seemingly  unre(cid:173)\nlated  features  that  were  common  to  several  experiments  on  a  variety  of neurons \nare  actually  the  natural  consequences  of maximizing  the  transmitted  information. \nSpecifically,  they  are  all  due  to  the  relation  between  if  and  Tc  that  is  imposed  by \nthe  optimization.  We  reiterate  our  previous prediction  (DeWeese  [2],  1995;  Bialek, \n1993):  Optimizing  the  code  requires  that  the  threshold  adapt  not  only  to  cancel \nDC  offsets,  but  it  must  adapt  to  the  statistical  structure  of the  signal  and  noise. \nExperimental hints  at adaptation to  statistical structure  have recently  been seen  in \nthe fly  visual system  (de  Ruyter van Steveninck,  1994)  and in the salamander retina \n(Warland,  1995). \n\n8  References \n\nM.  DeWeese  1995  Optimization Principles for the  Neural  Code  (Dissertation, Prin(cid:173)\nceton University) \n\nM.  DeWeese  and  W.  Bialek  1995  Information  flow  in  sensory  neurons  II  Nuovo \nCimento l7D 733-738 \n\nE. D.  Adrian  1928  The  Basis  of Sensation  (New  York:  W.  W.  Norton) \n\nF .  Rieke,  D.  Warland,  R.  de  Ruyter  van  Steveninck,  and  W.  Bialek  1996  Neural \nCoding  (Boston:  MIT Press) \n\nW. Bialek,  M.  DeWeese,  F.  Rieke,  and D. Warland 1993 Bits and Brains:  Informa(cid:173)\ntion Flow in  the  Nervous  System  Physica  A  200 581-593 \n\nC.  E.  Shannon  1949  Communication  in  the  presence  of noise,  Proc. \n10-21 \n\nI.  R.  E.  37 \n\nC.  Stevens  and  A.  Zador  1996  Information  Flow  Through  a  Spiking  Neuron  in  M. \nHasselmo ed  Advances  in  Neural  Information  Processing  Systems,  Vol  8  (Boston: \nMIT Press)  (this volume) \n\nR.R. de Ruyter van Steveninck, W. Bialek, M. Potters,  R.H. Carlson 1994 Statistical \nadaptation and optimal estimation in  movement computation by  the  blowfly  visual \nsystem,  in  IEEE  International  Conference  On  Systems,  Man,  and  Cybernetics pp \n302-307 \n\nD.  Warland, M.  Berry,  S.  Smirnakis, and M.  Meister  1995 personal communication \n\n\f", "award": [], "sourceid": 1120, "authors": [{"given_name": "Michael", "family_name": "DeWeese", "institution": null}]}