{"title": "Bayesian Self-Organization", "book": "Advances in Neural Information Processing Systems", "page_first": 1001, "page_last": 1008, "abstract": null, "full_text": "Bayesian Self-Organization \n\nAlan L.  Yuille \n\nDivision of Applied Sciences \n\nHarvard  University \n\nCambridge, MA  02138 \n\nStelios M.  Smirnakis \n\nLyman Laboratory of Physics \n\nHarvard  University \n\nCambridge, MA  02138 \n\nLei Xu * \n\nDept.  of Computer Science \n\nHSH  ENG  BLDG,  Room  1006 \n\nThe Chinese  University of Hong  Kong \n\nShatin,  NT \nHong Kong \n\nAbstract \n\nRecent  work  by  Becker  and  Hinton  (Becker  and  Hinton,  1992) \nshows  a  promising  mechanism,  based  on  maximizing  mutual  in(cid:173)\nformation  assuming spatial coherence,  by which  a  system  can self(cid:173)\norganize itself to learn visual abilities such  as  binocular stereo.  We \nintroduce  a  more general  criterion,  based  on  Bayesian  probability \ntheory,  and  thereby  demonstrate  a  connection  to  Bayesian  theo(cid:173)\nries  of visual  perception  and  to  other  organization  principles  for \nearly  vision  (Atick  and  Redlich,  1990).  Methods  for  implementa(cid:173)\ntion using variants of stochastic learning are  described  and, for  the \nspecial  case  of linear filtering,  we  derive  an analytic expression for \nthe output. \n\n1 \n\nIntroduction \n\nThe  input  intensity  patterns  received  by  the  human  visual  system  are  typically \ncomplicated  functions  of  the  object  surfaces  and  light  sources  in  the  world.  It \n*Lei Xu was a research scholar in the Division  of Applied Sciences at Harvard University \n\nwhile  this  work was performed. \n\n1001 \n\n\f1002 \n\nYuille, Smimakis, and Xu \n\nseems  probable,  however,  that humans perceive  the world  in  terms of surfaces  and \nobjects  (Nakayama and  Shimojo,  1992).  Thus  the  visual  system  must  be  able  to \nextract  information from  the input intensities  that is  relatively independent  of the \nactual intensity  values.  Such  abilities may  not  be  present  at birth  and  hence  must \nbe  learned.  It seems,  for  example,  that  binocular stereo  develops  at about the  age \nof two  to  three  months (Held,  1981). \nBecker  and  Hinton  (Becker  and  Hinton,  1992)  describe  an  interesting  mechanism \nfor  self-organizing  a  system  to  achieve  this.  The  basic  idea  is  to  assume  spatial \ncoherence  of the structure  to  be  extracted  and  to  train  a  neural  network  by  maxi(cid:173)\nmizing  the  mutual information between  neurons  with  disjoint receptive  fields.  For \nbinocular stereo, for  example, the surface being viewed  is  assumed flat  (see  (Becker \nand  Hinton,  1992)  for  generalizations of this  assumption)  and  hence  has  spatially \nconstant disparity.  The intensity patterns,  however,  do  not have any simple spatial \nbehaviour.  Adjusting the synaptic strengths of the network to maximize the mutual \ninformation between  neurons  with  non-overlapping  receptive  fields,  for  an  ensem(cid:173)\nble  of images,  causes  the  neurons  to  extract  features  that  are spatially  coherent  -\nthereby  obtaining the disparity [fig. 1]. \n\nmaximize I (a;b) \n\n( : I : I ~ I ~ ) \n\nFigure  1:  In  Hinton  and  Becker's  initial scheme  (Becker  and  Hinton,  1992),  max(cid:173)\nimization of mutual information between  neurons  with spatially  disjoint  receptive \nfields  leads  to disparity  tuning,  provided  they  train on spatially coherent  patterns \n(i.e.  those for  which  disparity  changes slowly  with spatial position) \n\nWorkers  in  computer  vision  face  a similar problem of estimating the  properties  of \nobjects  in  the world from  intensity images.  It is commonly stated that vision is  ill(cid:173)\nposed  (Poggio et  al,  1985)  and  that prior assumptions  about  the  world  are  needed \nto  obtain a  unique  perception.  It is  convenient  to  formulate such  assumptions  by \nthe  use  of Bayes'  theorem  P(SID) = P(DIS)P(S)/ P(D).  This  relates  the  proba-\n\n\fBayesian Self-Organization \n\n1003 \n\nbility P(SID)  of the scene  S given  the  data D  to the  prior probability of the scene \nP(S)  and  the imaging model P(DIS)  (P(D)  can  be  interpreted  as  a normalization \nconstant) .  Thus a  vision theorist  (see  (Clark and Yuille,  1990), for  example) deter(cid:173)\nmines  an  imaging model P(DIS),  picks  a  set  of plausible prior  assumptions  about \nthe world  P(S)  (such  as  natural  constraints (Marr,  1982)), applies  Bayes'  theorem, \nand then  picks  an interpretation S*  from  some statistical estimator of P(SID)  (for \nexample, the maximum a posteriori (MAP) estimator S*  =  ARG{M AXsP(SID)}.) \n\nAn advantage of the Bayesian approach is that, by nature of its probabilistic formu(cid:173)\nlation, it can be readily related to learning with a teacher (Kersten et aI,  1987).  It is \nunclear,  however,  whether  such  a  teacher  will always be  available.  Moreover,  from \nBecker and Hinton's work on self-organization, it seems that a teacher is  not always \nnecessary.  This  paper  proposes  a  way  for  generalizing  the  self-organization  ap(cid:173)\nproach,  by starting from a Bayesian perspective,  and thereby relating it to Bayesian \ntheories of vision .  The key idea is to force  the activity distribution of the outputs to \nbe close  to a pre-specified  prior distribution  Pp(S).  We  argue that this approach is \nin the same spirit as (Becker and Hinton, 1992), because we can choose the prior dis(cid:173)\ntribution to enforce  spatial coherence,  but it is  also more general  since many other \nchoices of the prior are possible.  It also has some relation to the work performed by \nAtick and  Redlich  (Atick  and Redlich,  1990) for  modelling the early visual system. \nWe  will  take  the  viewpoint  that the  prior  Pp(S)  is  assumed  known  in  advance  by \nthe  visual  system  (perhaps  by  being  specified  genetically)  and  will  act  as  a  self(cid:173)\norganizing principle.  Later we  will  discuss  ways  that this might be relaxed. \n\n2  Theory \n\nWe  assume  that  the  input  D  is  a  function  of a  signal  L  that  the  system  wants \nto  determine  and  a  distractor  N  [fig.2].  For  example  L  might  correspond  to  the \ndisparities of a pair of binocular stereo images and N  to the intensity patterns.  The \ndistribution of the inputs is  PD(D)  and  the system  assumes that  the signal L  has \ndistribution Pp(L). \nLet  the  output  of  the  system  be  S  =  G(D, ,)  where  G  is  a  function  of  a  set \nof parameters,  to  be  determined.  For  example,  the  function  G(D, ,)  could  be \nrepresented  by  a  multi-layer  perceptron  with  the  , 's  being  the  synaptic  weights. \nBy  approximation theory,  it  can  be  shown  that  a  large  varidy of neural  networks \ncan  approximate  any  input-output  function  arbitrarily  well  given  enough  hidden \nnodes  (Hornik et  aI,  1989) . \n\nThe aim of self-organizing the network is to ensure that the parameters, are chosen \nso  that the outputs  S  are  as  close  to  the  L  as  possible.  We  claim that  this  can be \nachieved by adjusting the parameters, so as to make the derived distribution of the \noutputs PDD(S  : ,) =  f 8(S - G(D, ,))PD (D)[dD]  as  close  as  possible  to Pp(S). \nThis can be seen  to be a  consistency  condition for  a Bayesian theory as  from Bayes \nformula we  obtain the equation: \n\nJ P(SID)PD(D)[dD] = J P(DIS)Pp(S)[dD] =  Pp(S). \n\n(1) \n\n\f1004 \n\nYuille, Smimakis, and Xu \n\nwhich  is  equivalent  to  our  condition,  provided  we  choose  to  identify  P(SID)  with \n6(S - C(D, -y\u00bb. \nTo make this more precise  we  must define  a  measure of similarity between  the two \ndistributions Pp(S)  and PDD(S : -y).  An  attractive measure is  the  Kullback-Leibler \ndistance  (the entropy of PDD  relative to Pp): \n\nJ \n\nPDD(S:-y) \n\nI( L(-y)  =  PDD(S : -y) log  Pp(S) \n\n[dS]. \n\n(2) \n\nD= F(~,N) \n\n~(~) \n\nS=G(D,r) \n\nFigure  2:  The  parameters  -yare  adjusted  to  minihu~e the  Kullback-Leibler  dis(cid:173)\ntance  between  the  prior  (Pp)  distribution  of the  true  signal  (E)  and  the  derived \ndistribution (PDD)  of the network  output (8). \nThis measure  can  be  divided  into two  parts:  (i)  - I PDD(S  : -y) log Pp(S)[dS]  and \n(ii) I PDD(S : -y) log PDD(S : -y)[dS).  The second  term encourages  variability of the \noutput while  the first  term forces  similarity to  the  prior  distribution. \nSuppose  that  Pp(S)  can  be  expressed  as  a  Markov  random  field  (i.e.  the  spatial \ndistribution of Pp(S)  has a local neighbourhood structure,  as is  commonly assumed \nin  Bayesian  models of vision).  Then,  by  the  Hammersely-Clifford theorem,  we  can \nwrite Pp(S)  = e-fJEp(S) /Z where Ep(S)  is an energy function with local connections \n(for  example,  Ep(S)  = Li(S, - Si+1)2),  {3  is  an  inverse  temperature  and  Z  is  a \nnormalization constant. \n\nThen the first  term can  be  written  (Yuille et ai,  1992)  as \n\n-J PDD(S  : -y) log Pp(S)[d8) = {3{Ep(G(D, -Y\u00bb)D + log Z. \n\n(3) \n\n\fBayesian Self-Organization \n\n1005 \n\nWe  can  ignore  the  log Z  term  since  it  is  a  constant  (independent  of  ,).  Mini(cid:173)\nmizing  the  first  term  with  respect  to  ,  will  therefore  try  to  minimize  the  energy \nof the  outputs  averaged  over  the  inputs  - (Ep(G(D,')))D  - which  is  highly  desir(cid:173)\nable  (since  it  has  a  close  connection  to  the  minimal energy  principles  in  (Poggio \net  aI,  1985,  Clark  and Yuille,  1990)).  It  is  also  important,  however,  to  avoid  the \ntrivial  solution  G(D,,)  = 0  as  well  as  solutions  for  which  G(D,,)  is  very  small \nfor  most  inputs.  Fortunately these  solutions are  discouraged  by  the  second  term: \nJ PDD(D,,) log PDD(D, ,)[dD], which  corresponds  to  the  negative  entropy of the \nderived  distribution of the network output.  Thus, its  minimization with respect  to \n,  is  a  maximum entropy  principle  which  will  encourage  variability in  the  outputs \nG( D,,) and hence  prevent  the trivial solutions. \n\n3  Reformulating for  Implementation. \n\nOur theory requires  us  to minimize the Kullback-Leibler  distance,  equation 2,  with \nrespect  to ,.  We  now  describe  two  ways  in which this  could  be implemented using \nvariants  of stochastic  learning.  First observe  that by substituting  the  form  of the \nderived  distribution into equation 2 and  integrating out the 5  variable we  obtain: \n\n\"  J \n\nJ\\L({) = \n\nPD(D) log  Pp(G(D,,)) \n\n[dD]. \n\nPDD(G(D,,) : ,) \n\n(4) \n\nAssuming a representative sample {DJ.t  : JJ  fA} of inputs we can approximate K L(,) \nby  LJ.ttA log[PDD(G(DJ.t,,)  : ,)/ Pp(G(DJ.t, ,))].  We  can now,  in principle,  perform \nstochastic  learning  using backpropagation:  pick  inputs  DJ.t  at random and  update \nthe weights,  using log[PDD(G(DJ.t,,): ,)/Pp(G(DJ.t,,))]  as  the error  function. \nTo  do  this,  however,  we  need  expressions  for  PDD(G(DJ.t,,)  : ,) and  its  deriva(cid:173)\ntive  with  repect  to,.  If the  function  G(D,,)  can  be  restricted  to  being  1-1  (in(cid:173)\ncreasing  the  dimensionality of the  output  space  if necessary)  then  we  can  obtain \n(Yuille et aI,  1992)  analytic expressions  PDD(G(D,,) :,) = PD(D)/I det(oG/oD)1 \nand (ologPDD(G(D,,) : ,)/0,) = -(oG/OD)-1(02G/oDo,), where  [-1] denotes \nthe  matrix inverse.  Alternatively  we  can  perform additional sampling to  estimate \nPDD(G(D,,):,) and (ologPDD(G(D,,): ,)/0,) directly from their integral rep(cid:173)\nresentations.  (This second  approach is similar to (Becker  and Hinton,  1992) though \nthey  are  only  concerned  with  estimating  the  first  and  second  moments  of  these \ndistributions. ) \n\n4  Connection to Becker and  Hinton. \n\nThe Becker  and Hinton method (Becker  and Hinton,  1992) involves maximizing the \nmutual information between  the output of two  neuronal units 5 1 ,52  [fig.l].  This is \ngiven by  : \n\nwhere  the  first  two  terms  correspond  to  maximizing  the  entropies  of  51  and  52 \nwhile  the  last  term forces  51  ::::::  52. \n\n\f1006 \n\nYuille, Smirnakis, and Xu \n\nBy contrast,  our version  tries  to minimize the  quantity: \n\nIf we  then ensure  that Pp (S 1, S2)  =  6 (S 1  - S2)  our second  term  will force  S 1  ~ S2 \nand our first  term will maximize the entropy of the joint distribution of Sl, S2.  We \nargue  that  this  is  effectively  the  same  as  (Becker  and  Hinton,  1992)  since  maxi(cid:173)\nmizing the joint entropy of Sl, S2  with Sl  constrained  to equal  S2  is  equivalent to \nmaximizing the  individual entropies  of SI  and S2  with the same constraint. \nTo be more concrete,  we consider Becker and Hinton's implementation of the mutual \ninformation maximization  principle  in  the  case  of units  with  continuous  outputs. \nThey  assume  that  the  outputs  of units  1, 2  are  Gaussian  1  and  perform  steepest \ndescent  to  maximize the  symmetrized form  of the  mutual information between  SI \nand S2: \n\nwhere VO  stands for variance over the set of inputs.  They assume that the difference \nbetween  the  two  outputs  can  be  expressed  as  un correlated  additive  noise,  SI  = \nS2  + N.  We  reformalize  their  criterion as  maximizing EBH(V(S2), V(N)) where \n\nEBH (V(S2), V(N)) =  log{V(S2) + V(N)} + log V(S2)  - 210g V(N). \n\n(6) \n\nFor  our  scheme  we  make  similar  assumptions  about  the  distributions  of  SI  and \nS2.  We  see  that  <  logPDD(SI,S2)  >=  -log{< si  ><  S~  >  - <  S1S2  >2}  = \n-log{V(S2)V(N)}  (since  <  S1S2  >=<  (S2  + N)S2  >=  V(S2)  and  <  Sf  >= \nV(S2) + V(N)).  Using the prior distribution PP(Sl' S2)  ~ e- r (Sl-S2)2  our criterion \ncorresponds  to minimizing EYSX(V(S2), V(N))  where: \n\nEy SX(V(S2), V(N)) =  -log V(S2)  - log V(N) + rV(N). \n\n(7) \n\nIt is  easy  to  see  that  maximizing  EBH (V(S2), V(N))  will  try  to  make  V(S2)  as \nlarge  as  possible  and  force  V(N)  to  zero  (recall  that,  by  definition,  V(N)  ~ 0). \nMinimizing our  energy  will  try  to  make  V(S2)  as  large  as  possible  and  will  force \nV(N)  to  1/r  (recall  that  r  appears  as  the  inverse  of the  variance  of a  Gaussian \nprior distribution for  SI  - S2  so making r  large will force  the prior  distribution  to \napproach  6(Sl  - S2).)  Thus,  provided  r  is  very  large,  our  method  will  have  the \nsame effect  as  Becker  and  Hinton's. \n\n5  Application to  Linear Filtering. \n\nWe  now  describe  an  analysis  of these  ideas  for  the  case  of  linear  filtering.  Our \napproach will be  contrasted with the  traditional Wiener filter  approach. \n\n1 We  assume for  simplicity  that  these  Gaussians  have  zero  mean. \n\n\fBayesian Self-Organization \n\n1007 \n\nConsider a process ofthe form D(i) =  ~(i)+N(i) where D(i) denotes the input to \nthe system,  ~(i) is  the  true signal which we  would like to predict,  and N(i) is  the \nn?ise corrupting the signal.  The resulting Wiener filter  Aw (i) has fourier  transform \nAw  =  ~~ , ~/\u00abh:: , ~ + ~N,N) where  ~~,~ and  ~N,N are  the  power  spectrum  of the \nsignal and the  noise respectively. \nBy  contrast,  let  us  extract  a  linear filter  Ab  by  applying our  criterion.  In  the  case \nthat  the  noise  and  signal  are  independent  zero  mean  Gaussian  distributions  this \nfilter  can  be  calculated explicitly (Yuille et  aI,  1992).  It has fourier  transform with \nsquared  magnitude given  by  IAbl2  =  ~!:,~/(~~,~ + ~N,N) .  Thus  our filter  can  be \nthought of as  the square root of the  Wiener filter. \n\nIt is  important to realize  that although our  derivation  assumed  additive  Gaussian \nnoise our system would not need to make any assumptions about the noise distribu(cid:173)\ntion.  Instead our system would merely need  to assume that the filter was linear and \nthen would  automatically obtain the  \"correct\"  result for  the additive Gaussian noise \ncase.  We  conjecture  that the system might detect non-Gauusian noise by finding  it \nimpossible to get  zero  Kullback-Liebler distance  with the linear ansatz. \n\n6  Conclusion \n\nThe  goal  of this  paper  was  to  introduce  a  Bayesian  approach  to  self-organization \nusing prior assumptions about the signal as an organizing principle.  We argued that \nit  was  a  natural  generalization of the  criterion  of maximizing mutual information \nassuming spatial coherence  (Becker and Hinton, 1992) .  Using our principle it should \nbe  possible  to  self-organize  Bayesian  theories  of vision,  assuming  that  the  priors \nare  known,  the  network  is  capable  of representing  the  appropriate  functions  and \nthe  learning  algorithm  converges.  There  will  also  be  problems  if  the  probability \ndistributions of the true signal and the distractor  are  too similar. \n\nIf the  prior  is  not  correct  then  it  may  be  possible  to  detect  this  by  evaluating \nthe  goodness  of the  Kullback-Leibler fit  after  learning  2.  This suggests  a  strategy \nwhereby the system increases  the complexity of the priors until the Kullback-Leibler \nfit  is  sufficiently  good  (this  is  somewhat  similar to  an  idea  proposed  by  Mumford \n(Mumford, 1992)).  This is  related  to the idea of competitive priors in vision (Clark \nand  Yuille,  1990).  One  way  to  implement  this  would  be  for  the  prior  probability \nitself to have a set of adjustable parameters that would enable it to adapt to different \nclasses  of scenes.  We  are  currently  (Yuille  et  aI,  1992)  investigating  this  idea and \nexploring its relationships to Hidden  Markov  Models. \n\nWays to implement the theory,  using variants of stochastic learning, were described. \nWe  sketched  the relation to Becker  and  Hinton . \n\nAs  an illustration of our approach we  derived the filter  that our criterion would give \nfor  filtering  out  additive  Gaussian  noise  (possibly  the  only  analytically  tractable \ncase).  This had a  very  interesting  relation to the standard Wiener filter. \n\n2This is  reminiscent  of Barlow's suspicious  coincidence  detectors  (Barlow,  1993),  where \nwe  might  hope  to  determine if two  variables  x  & yare independent  or not  by  calculating \nthe  Kullback-Leibler  distance  between  the  joint  distribution  P(x, y)  and  the  product  of \nthe individual  distributions  P( x) P(y). \n\n\f1008 \n\nYuille, Smirnakis, and Xu \n\nAcknowledgements \n\nWe  would like to thank DARPA for  an Air  Force  contract  F49620-92-J-0466.  Con(cid:173)\nversations  with  Dan Kersten  and  David Mumford were  highly appreciated. \n\nReferences \n\nJ.J.  Atick  and  A.N.  Redlich. \nNeural  Computation.  Vol.  2,  No.3, pp 308-320.  Fall.  1990. \n\n\"Towards  a  Theory  of  Early  Visual  Processing\". \n\nH.B. Barlow.  \"What is  the Computational Goal of the  Neocortex?\"  To appear  in: \nLarge scale neuronal theories of the brain.  Ed.  C.  Koch.  MIT Press.  1993. \n\nS.  Becker  and G.E. Hinton.  \"Self-organizing neural network that discovers  surfaces \nin random-dot stereograms\".  Nature,  Vol  355.  pp  161-163.  Jan.  1992. \n\nJ .J. Clark and A.L. Yuille.  Data Fusion for  Sensory Information Processing \nSystems.  Kluwer  Academic  Press.  Boston/Dordrecht/London.  1990. \n\nR. Held.  \"Visual development in infants\".  In The encyclopedia of neuroscience, \nvol.  2.  Boston:  Birkhauser.  1987. \n\nK.  Hornik,  S.  Stinchocombe and  H.  White.  \"Multilayer feed-forward  networks  are \nuniversal  approximators\".  Neural  Networks 4,  pp  251-257.  1991. \n\nD.  Kersten,  A.J. O'Toole, M.E. Sereno,  D.C. Knill and J .A.  Anderson.  \"Associative \nlearning  of scene  parameters  from  images\".  Optical  Society  of America,  Vol.  26, \nNo.  23,  pp 4999-5006.  1 December,  1987. \n\nD.  Marr.  Vision.  W.H . Freeman and Company.  San Francisco.  1982. \n\nD.  Mumford. \nPreprint.  Harvard  University.  1992. \n\n\"Pattern  Theory:  a  unifying  perspective\".  Dept.  Mathematics \n\nK.  Nakayama  and  S.  Shimojo.  \"Experiencing  and  Perceiving  Visual  Surfaces\". \nScience.  Vol.  257,  pp  1357-1363.  4 September.  1992. \n\nT.  Poggio,  V.  Torre  and  C.  Koch.  \"Computational vision  and regularization  the(cid:173)\nory\" .  Nature,  317,  pp  314-319.  1985. \n\nA.L.  Yuille,  S.M.  Smirnakis  and  L.  Xu.  \"Bayesian  Self-Organization\".  Harvard \nRobotics  Laboratory Technical  Report .  1992. \n\n\fPART IX \n\nSPEECH AND  SIGNAL \n\nPROCESSING \n\n\f\f", "award": [], "sourceid": 809, "authors": [{"given_name": "Alan", "family_name": "Yuille", "institution": null}, {"given_name": "Stelios", "family_name": "Smirnakis", "institution": null}, {"given_name": "Lei", "family_name": "Xu", "institution": null}]}