{"title": "A Mean Field Algorithm for Bayes Learning in Large Feed-forward Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 225, "page_last": 231, "abstract": null, "full_text": "A  mean field  algorithm for  Bayes learning \n\nin large feed-forward  neural networks \n\nManfred Opper \n\nInstitut fur  Theoretische  Physik \n\nOle Winther \nCONNECT \n\nJulius-Maximilians-Universitat, Am Hubland \n\nThe Niels Bohr Institute \n\nD-97074 Wurzburg, Germany \n\nopperOphysik.Uni-Wuerzburg.de \n\nBlegdamsvej  17 \n\n2100  Copenhagen,  Denmark \nwintherGconnect.nbi.dk \n\nAbstract \n\nWe present an algorithm which is expected to realise Bayes optimal \npredictions in large feed-forward networks.  It is based on mean field \nmethods developed  within  statistical  mechanics  of disordered  sys(cid:173)\ntems.  We give a derivation for  the single layer perceptron and show \nthat  the  algorithm  also  provides  a  leave-one-out  cross-validation \ntest of the  predictions.  Simulations show excellent  agreement with \ntheoretical  results  of statistical mechanics. \n\n1 \n\nINTRODUCTION \n\nBayes  methods have  become  popular  as  a  consistent  framework  for  regularization \nand  model  selection  in  the  field  of neural  networks  (see  e.g.  [MacKay,1992]).  In \nthe Bayes  approach to statistical inference  [Berger, 1985]  one assumes that the prior \nuncertainty  about  parameters  of an  unknown  data  generating  mechanism  can  be \nencoded  in  a  probability  distribution,  the  so  called  prior.  Using  the  prior  and \nthe  likelihood  of the  data  given  the  parameters,  the  posterior distribution  of the \nparameters can  be derived  from  Bayes  rule.  From this posterior,  various estimates \nfor functions ofthe parameter, like predictions about unseen data, can be calculated. \nHowever,  in  general,  those  predictions  cannot  be  realised  by  specific  parameter \nvalues,  but only by  an ensemble average over parameters according to the  posterior \nprobability. \n\nHence,  exact implementations of Bayes method for neural networks require averages \nover  network  parameters  which  in  general  can  be  performed  by  time  consuming \n\n\f226 \n\nM.  Opper and O.  Winther \n\nMonte  Carlo  procedures.  There  are  however  useful  approximate  approaches  for \ncalculating  posterior  averages  which  are  based  on  the  assumption  of a  Gaussian \nform  of the  posterior  distribution  [MacKay,1992].  Under  regularity conditions  on \nthe  likelihood,  this  approximation becomes  asymptotically exact  when  the  number \nof data  is  large  compared  to  the  number  of parameters.  This  Gaussian  ansatz \nfor  the  posterior  may  not  be  justified  when  the  number  of examples  is  small  or \ncomparable to the number of network  weights.  A second cause for  its failure would \nbe a  situation  where  discrete  classification  labels  are  produced  from  a  probability \ndistribution  which  is  a  nonsmooth function  of the  parameters.  This would  include \nthe case of a  network  with  threshold units learning a  noise free  binary classification \nproblem. \n\nIn  this  contribution  we  present  an  alternative  approximate  realization  of  Bayes \nmethod  for  neural  networks,  which  is  not  based  on  asymptotic  posterior  normal(cid:173)\nity.  The posterior  averages  are performed using mean field  techniques  known from \nthe statistical mechanics of disordered systems.  Those are expected to become exact \nin the limit of a large number of network parameters under additional assumptions on \nthe statistics of the input data.  Our analysis follows  the approach of [Thouless,  An(cid:173)\nderson&  Palmer,1977] (TAP) as adapted to the simple percept ron by [Mezard,1989]. \n\nThe basic set up of the Bayes method is as follows:  We have a training set consisting \nof  m  input-output  pairs  Dm  = {(sll,ull),m  = 1, ... ,/J},  where  the  outputs  are \ngenerated  independently  from  a  conditional  probability  distribution  P( u ll Iw, sll). \nThis  probability is  assumed  to  describe  the  output  u ll  to  an  input  sll  of a  neural \nnetwork  with  weights  w  subject  to  a  suitable noise  process.  If we  assume  that  the \nunknown  parameters  w  are  randomly  distributed  with  a  prior  distribution p(w), \nthen  according to Bayes theorem  our  knowledge  about w  after  seeing  m  examples \nis expressed  through the posterior distribution \n\np(wIDm) =  Z-lp(w) II P(ulllw,sll) \n\nm \n\n11=1 \n\n( 1) \n\nwhere Z = J dwp(w) n;=l P(ulllw, sll) is called the partition function in statistical \n\nmechanics  and  the  evidence  in Bayesian  terminology.  Taking the  average  with  re(cid:173)\nspect to the posterior eq.  (1), which in the following will be denoted by angle brack(cid:173)\nets, gives Bayes estimates for various quantities.  For example the optimal predictive \nprobability for  an output u  to a new  input s  is given by  pBayes(uls)  = (P(ulw, s\u00bb. \nIn section  2 exact equations for  the  posterior  averaged  weights  (w)  are derived for \narbitrary networks.  In 3 we  specialize these equations to a  perceptron  and develop \na  mean field  ansatz in section 4.  The resulting system of mean field  equations equa(cid:173)\ntions is presented  in section  5.  In  section  6 we  consider Bayes optimal predictions \nand a  leave-one-out estimator for  the generalization error.  We conclude in section 7 \nwith a discussion of our results. \n\n2  A  RESULT FOR POSTERIOR AVERAGES  FROM \n\nGAUSSIAN  PRIORS \n\nIn this  section we  will  derive  an interesting  equation for  the  posterior  mean of the \nweights  for  arbitrary  networks  when  the  prior  is  Gaussian.  This  average  of the \n\n\fMean Field Algorithm/or Bayes Learning \n\n227 \n\nweights can be calculated for the distribution (1)  by  using the following simple and \nwell  known result  for  averages  over Gaussian distributions. \n\nLet  v be a Gaussian random variable with zero  means.  Then for  any function  f(v), \nwe  have \n\n(vf(v\u00bba  =  (v  )a\u00b7 (~)a . \n\n2 \n\ndf(v) \n\nHere ( .. . )a denotes the average over  the Gaussian distribution of v.  The relation is \neasily proved from an  integration by parts. \nIn  the  following  we  will  specialize  to  an  isotropic  Gaussian  prior  p(w)  = \n~Ne-!w.w.  In  [Opper  &  Winter,1996]  anisotropic  priors  are  treated  as  well. \nApplying  (2)  to each  component  of wand the function n;=l P(o-Illw,sll),  we  get \nv  21r \n\nthe following equations \n\n(2) \n\n(3) \n\n(w) = Z-l J dw wp(w) Ii P(o-\"Iw, s\") \n= Z-l t J dwp(w) Ii P(o-\"Iw, s\")\\7w P(o-lllw, sll) \n\n,,=1 \n\n1l=1 \n\n\"icll \n\nJ dWp(w) ... n  P(a\"lw ,s\") \n\nn \"1t' \n\nHere ( . . . ) Il  =  J \nthe  Jl-th  example is  kept  out of the  training set and  \\7 w  denotes  the gradient with \nrespect  to w. \n\nis a reduced average over a posterior where \n\n\"~t' P(a\"lw ,s\") \n\ndwp(w) \n\n3  THE PERCEPTRON \n\nIn the following , we will utilize the fact that for  neural networks,  the probability (1) \ndepends  only on the so called internal fields  8  = JNw . s . \nA simple but nontrivial example is the perceptron with N  dimensional input vector s \nand output 0-( W, s) = sign( 8). We will generalize the noise free model by considering \nlabel noise in  which the output is flipped, i.e.  0-8 < 0 with a  probability (1 +e.B)-l. \n(For simplicity, we  will  assume that f3  is  known such  that no prior on f3  is needed .) \nThe conditional probability may thus be written as \n\nP(0-1l81l ) =  P(o-Illw  sll) -\n\n, - 1 + e-.B \n\n-\n\ne-.B9( -at' At') \n\n' \n\n(4) \n\nwhere  9(x) =  1 for  x > 0 and 0 otherwise.  Obviously, this a  nonsmooth function of \nthe  weights w, for  which the posterior  will  not  become Gaussian asymptotically. \n\nFor this case  (3)  reads \n\n(w) =  _1_ t  (P'(0-1l81l\u00bb1l  o-Ilsll  = \n_1_ f J d8fll (8)P'(0-1l8)  o-Ilsll \n.jN 1l=1  J d8fll(A)P(0-1l8) \n\n.jN 1l=1  (P(0-1l81l\u00bb1l \n\n(5) \n\n\f228 \n\nM.  Opper and O.  Winther \n\nIIJ (~) is the  density of -dNw . glJ,  when the weights ware randomly drawn from  a \nposterior, where example (glJ , (TIJ)  was kept out of the training set.  This result states \nthat  the  weights  are  linear combinations of the  input  vectors.  It gives  an example \nof the ability of Bayes method to regularize  a  network  model:  the effective  number \nof parameters will  never  exceed  the number of data points. \n\n4  MEAN FIELD  APPROXIMATION \n\nSofar,  no  approximations have  been  made  to obtain eqs.  (3,5).  In general  IIJ(~) \ndepends  on  the  entire set  of data  Dm  and can  not  be calculated easily.  Hence,  we \nlook for  a  useful  approximation to these  densities. \nWe  split the  internal field  into its  average  and fluctuating  parts,  i.e.  we  set  ~IJ = \n(~IJ)IJ + v lJ ,  with  vlJ  = IN(w -\n(w)lJ)glJ.  Our mean field  approximation is based \non the assumption of a  central limit theorem for  the fluctuating  part of the internal \nfield,  vlJ  which  enters  in  the  reduced  average  of eq.  (5).  This means,  we  assume \nthat the  non-Gaussian  fluctuations  of Wi  around  (Wi)IJ'  when  mulitplied by sr  will \nsum up to make vlJ  a  Gaussian random variable.  The important point is  here  that \nfor  the reduced  average,  the  Wi  are  not correlated to the sr!  1 \nWe  expect  that  this  Gaussian  approximation  is  reasonable,  when  N,  the  number \nof network  weights  is  sufficiently  large.Following ideas  of [Mezard,  Parisi & Vira(cid:173)\nsoro,1987]  and  [Mezard,1989]'  who  obtained  mean  field  equations  for  a  variety  of \ndisordered  systems  in statistical mechanics,  one can  argue  that in many cases  this \nassumption  may be exactly fulfilled  in  the  'thermodynamic limit' m, N  ~ 00  with \na  =  ~ fixed.  According  to this ansatz, we  get \n\nin terms of the second  moment of vlJ  AIJ  :=  ~ 2:i,j srsj (WiWj)1J  - (Wi)IJ(Wj)IJ). \nTo evaluate (5)  we need to calculate the mean (~IJ)IJ and the variance AIJ.  The first \nproblem is treated  easily within  the Gaussian  approximation. \n\n(6) \n\nIn the third line  (2)  has been used  again for  the Gaussian random variable vlJ . \nSofar,  the  calculation  of the  variance  AIJ  for  general  inputs  is  an  open  problem. \nHowever,  we  can  make  a  further  reasonable  ansatz,  when  the  distribution  of the \ninputs  is  known.  The following  approximation for  AIJ  is expected  to become exact \nin the thermodynamic limit if the inputs of the training set are drawn independently \n\n1 Note  that  the  fluctuations  of the  internal  field  with  respect  to the  full posterior  mean \n(which  depends on the  input  si-')  is  non Gaussian,  because  the  different  terms in  the sum \nbecome slightly correlated. \n\n\fMean Field Algorithm/or Bayes Learning \n\n229 \n\nfrom  a  distribution,  where  all components  Si  are  uncorrelated  and  normalized i.e. \nSi  = 0  and  Si Sj = dij.  The bars denote expectation over the  distribution of inputs. \nFor the generalisation to a correlated input distribution see  [Opper& Winther,1996]. \nOur  basic  mean  field  assumption  is  that  the  fluctuations  of the  All  with  the  data \nset  can  be  neglected  so that  we  can  replace  them  by  their  averages  All.  Since  the \nreduced  posterior  averages  are  not  correlated  with  the  data  sf,  we  obtain  All  ~ \ntr  2:i(wl}1l -\n(Wi)!).  Finally,  we  replace  the  reduced  average  by  the  expectation \nover  the  full  posterior,  neglecting  terms of order  liN.  Using  2:i(wl) = N, which \ntr  2:i (Wi)2 . This \nfollows from our choice of the Gaussian prior, we get All  ~ A =  1 -\ndepends  only on known quantities. \n\n5  MEAN  FIELD EQUATIONS  FOR THE PERCEPTRON \n(5)  and (6)  give  a  selfconsistent set of equations for  the variable xll  ==  \\~{::::N: . \n\nWe finally  get \n\nwith \n\n(7) \n\n(8) \n\n(9) \n\nThese  mean field  equations can  be solved  by  iteration.  It is useful  to start  with  a \nsmall number  of data  and  then  to  increase  the  number  of data  in steps  of 1 - 10. \nNumerical  work  show  that  the  algorithm works  well  even  for  small systems  sizes, \nN  ~ 15. \n\n6  BAYES  PREDICTIONS  AND LEAVE-ONE-OUT \n\nAfter solving the mean field  equations we  can make optimal Bayesian classifications \nfor  new  data s  by  chosing  the  output  label  with  the  largest  predictive  probability. \nIn case of output noise this reduces  to uBayes(s)  = sign(u(w, s\u00bb Since the posterior \ndistribution is  independent  of the  new input vector  we  can  apply the  Gaussian  as(cid:173)\nsumption again to the internal field,  d.  and obtain uBayes(s)  =  u( (w), s), i.e for  the \nsimple perceptron  the  averaged  weights  implement the  Bayesian  prediction.  This \nwill  not be the case for  multi-layer neural  networks. \n\nWe  can  also  get  an estimate for  the  generalization  error  which  occurs  on the  pre(cid:173)\ndiction  of new  data.  The  generalization  error  for  the  Bayes  prediction  is  defined \nby {Bayes  = (8 (-u(s)(u(w,s\u00bb\u00bbs, where  u(s)  is the true output and ( ... )s  denotes \naverage over the input distribution.  To obtain the  leave-one-out  estimator of { one \n\n\f230 \n\nM.  Opper and O.  Winther \n\n0.50 \n\n0 .40 \n\n0 .30 \n\n0 .20 \n\n0.10 \n\n.\"-\n\n\"-\n.  >-\n\n... \n\nI  J \n\n1-\n\nI \n\nI \n\no. 00 '---\"'----'-.J'-----' __  --'---_\n\no \n\n---L _\n2 \n\n_  -'--_~ _\n\n_  _L_  _ \n\n___\"_ _\n\n_ \n\nL__  _\n\n- ' - -_ - - ' - -_\n\n- - - - '  \n\n4 \n\n6 \n\nFigure  1:  Error vs.  a = mj N  for  the  simple percept ron  with output  noise f3  = 0.5 \nand N  = 50 averaged over 200  runs.  The full lines are the simulation results  (upper \ncurve shows prediction error and the lower curve shows training error).  The dashed \nline is the theoretical result for  N  -+  00 obtained from statistical mechanics [Opper & \nHaussler, 1991] . The dotted line with larger error bars is the moving control estimate. \n\nremoves  the  p-th example from the  training set  and trains  the  network  using  only \nthe  remaining m  - 1 examples.  The  p'th example is  used  for  testing.  Repeating \nthis procedure for all p  an unbiased estimate for  the Bayes generalization error with \nm-1 training data is obtained as the mean value f~~r8 =  ! EI' e (-ul'(O'(w, 81'\u00bb1') \nwhich  is exactly  the  type  of reduced  averages  which  are  calculated within our  ap(cid:173)\nproach.  Figure 1 shows  a result of simulations of our algorithm when the inputs are \nuncorrelated  and  the  outputs  are  generated  from  a  teacher  percept ron  with  fixed \nnoise rate f3. \n\n7  CONCLUSION \n\nIn this paper we have presented a  mean field  algorithm which is expected  to imple(cid:173)\nment a Bayesian optimal classification  well in the limit of large networks.  We have \nexplained the method for the single layer perceptron.  An extension to a simple mul(cid:173)\ntilayer network, the so called committee machine with a tree architecture is discussed \nin  [Opper&  Winther,1996].  The  algorithm is  based  on  a  Gaussian  assumption for \nthe  distribution  of the  internal  fields,  which  seems  reasonable  for  large  networks. \nThe main problem sofar is the  restriction to ideal situations such as a  known distri-\n\n\fMean FieLd Algorithm/or Bayes Learning \n\n231 \n\nbution of inputs  which  is not  a  realistic  assumption for  real  world  data.  However, \nthis assumption only entered in the calculation of the variance of the Gaussian field. \nMore  theoretical  work  is  necessary  to find  an  approximation to the  variance which \nis  valid in  more general  cases.  A  promising approach  is  a  derivation of the  mean \nfield equations directly from an approximation to the free energy -In(Z).  Besides a \ndeeper understanding  this would  also give us the possibility to use the method with \nthe  so  called  evidence  framework ,  where  the  partition  function  (evidence)  can  be \nused to estimate unknown  (hyper-)  parameters of the  model class  [Berger, 1985].  It \nwill further  be important to extend  the  algorithm to fully  connected  architectures. \nIn that case it might be necessary  to make further  approximations in the mean field \nmethod. \n\nACKNOWLEDGMENTS \n\nThis research is supported  by  a Heisenberg fellowship of the  Deutsche  Forschrmgs(cid:173)\ngemeinschaft  and  by  the  Danish  Research  Councils for  the  Natural  and  Technical \nSciences  through the  Danish Computational Neural  Network Center  (CONNECT) . \n\nREFERENCES \n\nBerger,  J.  O.  (1985)  Statistical Decision  theory  and  Bayesian  Analysis, Springer(cid:173)\nVerlag,  New  York. \n\nMacKay, D. J. (1992)  A  practical Bayesian framework for backpropagation  networks, \nNeural Compo  4  448. \n\nMezard ,  M.,  Parisi G.  &  Virasoro  M.  A.  (1987)  Spin  Glass  Theory  and  Beyond, \nLecture Notes in Physics, 9,  World Scientific,  . \n\nMezard,  M.  (1989)  The  space  of interactions  in  neural  networks:  Gardner's calcu(cid:173)\nlation  with  the  cavity  method J.  Phys.  A 22, 2181  . \n\nOpper,  M.  &  Haussler,  D.  (1991)  in  IVth  Annual  Workshop  on  Computational \nLearning  Theory  (COLT91), Morgan  Kaufmann. \n\nOpper  M. &  Winther  0  (1996)  A  mean  field  approach  to  Bayes  learning  in  feed(cid:173)\nforward  neural networks,  Phys.  Rev.  Lett.  76 1964. \n\nThouless, D.J .,  Anderson, P. W . &  Palmer, R .G. (1977), Solution  of 'Solvable model \nof a  spin  glass' Phil.  Mag.  35, 593. \n\n\f", "award": [], "sourceid": 1268, "authors": [{"given_name": "Manfred", "family_name": "Opper", "institution": null}, {"given_name": "Ole", "family_name": "Winther", "institution": null}]}