{"title": "Boltzmann Machine Learning Using Mean Field Theory and Linear Response Correction", "book": "Advances in Neural Information Processing Systems", "page_first": 280, "page_last": 286, "abstract": "", "full_text": "Boltzmann Machine learning using mean \nfield  theory and linear response correction \n\nH.J.  Kappen \n\nDepartment of Biophysics \n\nUniversity  of Nijmegen,  Geert  Grooteplein 21 \n\nNL  6525  EZ  Nijmegen, The Netherlands \n\nInstituto de Ingenieria del  Conocimiento &  Departamento de Ingenieria Informatica. \n\nF.  B.  Rodriguez \n\nUniversidad  Aut6noma de  Madrid,  Canto Blanco,28049  Madrid,  Spain \n\nAbstract \n\nWe  present  a  new  approximate learning  algorithm for  Boltzmann \nMachines,  using a systematic expansion of the Gibbs free  energy to \nsecond  order  in the weights.  The  linear response  correction  to the \ncorrelations is  given  by  the  Hessian  of the  Gibbs free  energy.  The \ncomputational complexity of the  algorithm is  cubic  in  the number \nof neurons.  We compare the performance of the exact BM learning \nalgorithm  with  first  order  (Weiss)  mean  field  theory  and  second \norder  (TAP)  mean field  theory.  The learning task consists of a fully \nconnected  Ising spin glass  model on  10  neurons.  We conclude that \n1)  the  method  works  well  for  paramagnetic problems  2)  the  TAP \ncorrection gives a significant improvement over the Weiss mean field \ntheory,  both for  paramagnetic and spin glass problems and 3)  that \nthe inclusion of diagonal weights improves the Weiss approximation \nfor  paramagnetic problems , but not for spin  glass  problems. \n\n1 \n\nIntroduction \n\nBoltzmann  Machines  (BMs)  [1],  are  networks  of binary  neurons  with  a  stochastic \nneuron  dynamics,  known  as  Glauber dynamics.  Assuming symmetric connections \nbetween  neurons,  the  probability  distribution  over  neuron  states  s will  become \nstationary and is  given by the Boltzmann-Gibbs distribution P( S).  The Boltzmann \ndistribution  is  a  known  function  of  the  weights  and  thresholds  of  the  network. \nHowever,  computation of P(Sj  or  any statistics involving P(S),  such  as  mean firing \nrates  or correlations,  requires  exponential  time in  the  number  of neurons.  This  is \n\n\fBoltzmann Machine Learning  Using Mean Field Theory \n\n281 \n\ndue  to the  fact  that  P(S)  contains  a  normalization term  Z,  which  involves  a  sum \nover  all states in the network,  of which there are exponentially many.  This problem \nis  particularly important for  BM  learning. \n\nUsing  statistical sampling techiques  [2],  learning can  be  significantly improved  [1]. \nHowever,  the method has rather poor convergence  and can only be applied to small \nnetworks. \n\nIn  [3 , 4],  an  acceleration  method for  learning in  BMs  is  proposed  using  mean field \ntheory  by  replacing  (SjSj)  by  mimj  in the  learning  rule.  It can  be  shown  [5]  that \nsuch  a  naive  mean  field  approximation of the  learning  rules  does  not  converge  in \ngeneral.  Furthermore,  we  argue  that  the  correlations  can  be  computed  using  the \nlinear response  theorem  [6]. \n\nIn  [7,  5]  the  mean  field  approximation is  derived  by  making use  of the  properties \nof convex functions  (Jensen's  inequality and  tangential  bounds).  In  this  paper  we \npresent  an alternative derivation which  uses  a  Legendre  transformation and a small \ncoupling expansion  [8] .  It has  the  advantage that higher order  contributions  (TAP \nand higher)  can be computed in a systematic manner and that it may be  applicable \nto  arbitrary graphical models. \n\n2  Boltzmann Machine  learning \n\nThe  Boltzmann  Machine  is  defined  as  follows.  The  possible  configurations  of the \nnetwork  can  be characterized  by  a  vector s = (S1, .. , Si,  .. , sn), where  Si  = \u00b11  is  the \nstate of the neuron  i,  and n  the total number of the neurons.  Neurons  are updated \nusing  Glauber dynamics. \nLet  us  define  the energy  of a  configuration s as \n\n-E(S) =  2 L WijSiSj  + L SiOi. \n\n1 \n\ni,j \n\ni \n\nAfter long times, the probability to find  the network  in  a state s becomes indepen(cid:173)\ndent  of time (thermal equilibrium)  and  is  given  by  the Boltzmann distribution \n\n(1) \nZ  =  L; exp{ - E( S)}  is  the  partition  function  which  normalizes  the  probability \ndistribution. \n\np(S)  =  Z  exp{ -E(S)}. \n\n1 \n\nLearning  [1]  consists  of adjusting  the  weights  and  thresholds  in  such  a  way  that \nthe  Boltzmann  distribution  approximates  a  target  distribution  q(S)  as  closely  as \npossible. \nA suitable measure of the  difference  between  the distributions  p(S)  and q(S)  is  the \nK ullback divergence  [9] \n\nJ{  = ~ q(S) log ;~~. \n\ns \n\n(2) \n\nLearning consists  of minimizing J{  using gradient descent  [1] \n~()i = 17( (Si)c  -\n\n~Wij = 17( (SiSj)c  -\n\n(SiSj) ), \n\n(Si) ). \n\nThe parameter 17  is  the learning rate.  The brackets (-)  and  (-) c  denote the  'free' and \n'clamped' expectation  values,  respectively. \n\n\f282 \n\nH.  1. Kappen and F.  B.  Rodr(guez \n\nThe computation of both the free  and the clamped expectation values is  intractible, \nbecause it consists of a sum over all unclamped states.  As  a result,  the BM  learning \nalgorithm can not be  applied to practical problems. \n\n3  The mean field  approximation \n\nWe  derive  the mean field  free  energy  using the small, expansion  as  introduced by \nPlefka [8].  The energy  of the network  is  given  by \n\nE(s, w, h, ,) \n\nfor, =  1.  The free  energy  is  given  by \n\nF(w, (), ,) = -logTrs e- E (s ,w,8,),) \n\nand is  a function of the independent variables Wij, ()j  and ,. We perform a Legendre \ntransformation on the variables (}i  by introducing mj  =  - ~:..  The Gibbs free  energy \n\nG(w, m, ,) = F(w, (), ,) + L ()jmj \n\nis  now  a  function  of  the  independent  variables  mj  and  Wij,  and  ()i  is  implicitly \ngiven  by  (Si) )'  =  mi.  The  expectation  0\" is  with  respect  to  the  full  model  with \ninteraction , . \n\nWe expand \n\nG(,) = G(O) + ,G' (0) + ~,2G\"(O) + 0(,3) \n\nWe  directly  obtain from  [8] \n\nG'(,) \n\n(Eint)\" \n\nGUb) \n\n(E;n')~ - {Ei:,,), + (E;n,;;= ~ (s,  - m;)), \n\nFor, =  0 the expectation  values 0 )'  become the mean field  expectations which  we \ncan directly compute: \n\nG(O) \n\nG'(O) \n\nG\"(O) \n\nG(l) \n\nThus \n\n1  (1  \n2 ~ (1 + mi) log 2(1 + mi) + (1  - md log 2(1- md \n-~ L Wjjmjmj \n\nJ \n\n1 )  \n\nij \n\nI \n\n. \n\n2 \n\n-L  (1  + md log-(l + md + (1- mdlog-(l- md \n1  (1  \n2 \n-~ \"'w .. m \u00b7m\u00b7 \n2 L-\nij \n-~ L w;j(l - m;)(l- m]) + 0(w 3 f(m)) \n\nIJ \n\n1 )  \n2 \n\nI \n\nJ \n\nij \n\n(3) \n\n\fBoltvnann Machine Learning Using Mean Field Theory \n\nwhere  f(m)  is  some unknown function  of m. \nThe mean field  equations  are  given  by the  inverse  Legendre  transformation \n\ne.  - ae  _ \n\n1  - ami  -\n\nmi  - L- Wijmj + ~ Wjjmd 1 - mj ), \n\n\"\"'  2 \n\n\"\"' \n\n2 \n\n) \n\nh - 1 ( \n\ntan \n\n283 \n\n(4) \n\nwhich  we  recognize  as  the  mean field  equations. \n\nJ \n\nJ \n\nThe correlations  are  given  by \n\n(SiSj)  -\n\n(Si) (Sj) = - oeiooj  = oej  =  am \n\n(  ao  )  -1 \nij \n\nami \n\na 2 F \n\n( 02~) -1 \n\nam \n\nij \n\nWe  therefore  obtain from  Eq.  3 \n\n(Si S j  )  -\n\n(Si) (s j) =  Aij \n\nwith \n\n(A-')oj  = Jij  (  1 _1 ml  + ~ wi.(1 - ml)) - Wij  - 2mimjW;j \n\n(5) \n\nThus, for  given Wij  and OJ,  we obtain the approximate mean firing rates  mj  by solv(cid:173)\ning Eqs. 4 and the correlations by their linear response  approximations Eqs.  5.  The \ninclusion of hidden units is  straigthforward .  One  applies the above  approximations \nin  the free  and  the clamped phase separately  [5].  The complexity of the  method is \nO(n3 ),  due  to the matrix inversion. \n\n4  Learning without  hidden units \n\nWe will assess  the accuracy of the above method for networks without hidden units. \nLet  us  define  Cij  =  (SjSj)c  -\n(Si)c (Sj)c'  which  can  be  directly  computed from  the \ndata.  The fixed  point equation for  D..Oj  gives \n\nD..Oi  =  0 {:}  mj  =  (Si)c  . \nThe fixed  point equation for  D..wij,  using  Eq.  6,  gives \n\n(6) \n\n(7) \nFrom Eq.  7 and  Eq.  5 we  can solve for  Wij,  using a standard least squares  method. \nIn  our case,  we  used  fsolve from  Matlab.  Subsequently,  we  obtain ei  from  Eq.  4. \nWe  refer  to this  method as  the TAP approximation. \n\nD..wij  = 0 {:}  Aij = Cij ' i =F  j. \n\nIn  order  to  assess  the  effect  of the  TAP  term,  we  also  computed  the  weights  and \nthresholds  in  the same way  as  described  above,  but  without  the  terms of order  w 2 \nin  Eqs.  5 and 4.  Since  this is  the standard Weiss  mean field  expression,  we  refer  to \nthis  method as  the Weiss  approximation. \n\nThe fixed  point  equations  are  only  imposed  for  the  off-diagonal  elements  of D..Wjj \nbecause  the Boltzmann distribution Eq.  1 does not depend on the diagonal elements \nWij.  In  [5],  we  explored  a  variant  of the  Weiss  approximation, where  we  included \ndiagonal  weight  terms.  As  is  discussed  there,  if we  were  to  impose  Eq.  7 for  i  =  j \nas well,  we  have  A = C.  HC is  invertible,  we  therefore  have  A-I = C- 1 .  However, \nwe  now  have  more  constraints  than  variables.  Therefore,  we  introduce  diagonal \nweights  Wii  by  adding  the  term  Wiimi  to  the  righthandside  of Eq.  4  in  the  Weiss \napproximation.  Thus, \n\nWij  =  1 _  m? \n\n,sij \n\n_  (C- 1 ) .. \nlJ \n\nand  OJ  is  given  by  Eq.  4  in  the  Weiss  approximation.  Clearly,  this  method  is  com(cid:173)\nputationally simpler because  it gives  an  explicit  expression  for  the  solution  of the \nweights  involving only one  matrix inversion. \n\nI \n\n\f284 \n\nH.  1. Kappen and F.  B. Rodr(guez \n\n5  Numerical results \n\nFor the target  distribution q(s)  in  Eq. 2 we  chose  a  fully  connected  Ising spin glass \nmodel with  equilibrium distribution \n\nwith  lij  i.i.d.  Gaussian  variables  with  mean  n~l  and  variance  /~1 '  This  model \nis  known  as  the  Sherrington-Kirkpatrick  (SK)  model  [10].  Depending  on  the  val(cid:173)\nues  of 1  and  10 ,  the  model displays  a  para-magnetic  (unordered),  ferro-magnetic \n(ordered)  and a  spin-glass  (frustrated)  phase.  For 10  = 0,  the para-magnetic (spin(cid:173)\nglass)  phase  is  obtained for  1  <  1 (1 > 1).  We  will  assess  the  effectiveness  of our \napproximations for  finite  n, for  10  = 0  and for  various  values  of 1.  Since  this  is  a \nrealizable  task,  the optimal KL  divergence  is  zero,  which is  indeed  observed  in  our \nsimulations. \n\nWe measure the quality of the solutions by means ofthe Kullback divergence.  There(cid:173)\nfore,  this  comparison  is  only  feasible  for  small  networks .  The  reason  is  that  the \ncomputation of the Kullback divergence requires  the computation of the Boltzmann \ndistribution,  Eq.  1,  which  requires  exponential  time due  to  the  partition function \nZ. \nWe  present  results  for  a  network  of  n  = 10  neurons.  For  10  = 0,  we  generated \nfor  each  value  of 0.1  <  1  <  3,  10  random  weight  matrices  l i j.  For  each  weight \nmatrix,  we  computed  the  q(S)  on  all  2n  states.  For  each  of the  10  problems,  we \napplied  the TAP method,  the  Weiss  method  and  the  Weiss  method  with  diagonal \nweights.  In  addition,  we  applied  the exact  Boltzmann Machine  learning algorithm \nusing  conjugate  gradient  descent  and  verified  that  it  gives  KL  diver?ence  equal \nto  zero,  as  it  should.  We  also  applied  a  factorized  model  p(S)  = Ili ?\"(1  + misd \nwith  mi  =  (Si)c  to  assess  the importance of correlations in  the  target  distribution. \nIn  Fig.  la,  we  show  for  each  1  the  average  KL  divergence  over  the  10  problem \ninstances  as  a  function  of 1  for  the  TAP  method,  the  Weiss  method,  the  Weiss \nmethod with diagonal weights  and  the factorized  model.  We observe  that the TAP \nmethod gives the best results,  but that its performance deteriorates in the spin-glass \nphase  (1) 1) . \n\nThe  behaviour  of all  approximate  methods  is  highly  dependent  on  the  individual \nproblem instance.  In Fig.  Ib,  we  show  the mean value of the KL  divergence  of the \nTAP  solution,  together  with  the  minimum and  maximum values  obtained  on  the \n10  problem instances. \n\nDespite  these  large  fluctuations ,  the  quality  of  the  TAP  solution  is  consistently \nbetter than  the Weiss solution.  In Fig.  lc,  we  plot the difference  between  the TAP \nand  Weiss  solution,  averaged  over  the  10  problem instances. \n\nIn  [5]  we  concluded  that  the  Weiss  solution  with  diagonal  weights  is  better  than \nthe standard  Weiss  solution  when  learning a  finite  number  of randomly generated \npatterns.  In  Fig.  Id  we  plot  the  difference  between  the  Weiss  solution  with  and \nwithout diagonal weights.  We  observe  again that the  inclusion of diagonal weights \nleads to better results in the paramagnetic phase (1 < 1), but leads to worse results \nin  the  spin-glass  phase.  For  1  >  2,  we  encountered  problem  instances  for  which \neither the matrix C  is  not  invertible or the KL  divergence is  infinite.  This problem \nbecomes  more and  more  severe  for  increasing  1 .  We  therefore  have  not  presented \nresults  for  the Weiss  approximation with diagonal weigths for  1  > 2. \n\n\fBoltvnannMachine Learning Using  Mean Field Theory \n\n285 \n\nComparison mean values \n\n5r-------~------~------. \n\n4 \n\nfact \nweiss+d \nweiss \ntap \n\nI \n\n/ \n\n\", \n\n1_'\" \n\nJ..:y '\", \n\n.... ; ./ \n\n'\" -(cid:173)\n\n.... \n\no~ __ ~ ... =\u00b7~J~ ______ ~ ____ ~ \no \n\n2 \n\n3 \n\nJ \n\nDifference WEISS and TAP \n\nn. .1 0.5 \n:.:: \nI en \nen \nW \nJ \n:.:: \n\n0 \n\nTAP \n\n2 \n\n1.5 \n\nmean \nmin \nmax \n\nQ) \nu \nc: \nQ) \n\ne>  1 \nQ) > \n'5 \n....J \n:.:: \n\n0.5 \n\no~-=~~------~----~ \n3 \n\n2 \n\no \n\nJ \n\n1.5 \n\n1 \n\nen \n':!? \nw \n\nJ :.:: \nI '?  0.5 \nen \nen \nW \nJ \n:.:: \n\n0 \n\nDifference WEISS+D and WEISS \n\nv \n\nV \n/ \n~ 1 \n\n-0.5 L-. ______ ~ ____ ~~ ____ ----.J \n3 \n\n2 \n\no \n\nJ \n\n-0.5 o \n\n2 \n\n3 \n\nJ \n\nFigure  1:  Mean  field  learning  of  paramagnetic  (J  <  1)  and  spin  glass  (J  >  1) \nproblems for  a  network  of 10  neurons.  a)  Comparison of mean KL  divergences  for \nthe factorized  model  (fact),  the Weiss  mean field  approximation with  and  without \ndiagonal  weights  (weiss+d  and  weiss),  and  the  TAP  approximation,  as  a  function \nof J.  The exact  method  yields  zero  KL  divergence  for  all  J.  b)  The  mean,  mini(cid:173)\nmum and  maximum KL  divergence  of the  TAP  approximation for  the  10  problem \ninstances ,  as  a  function  of J.  c)  The  mean  difference  between  the  KL  divergence \nfor  the  Weiss  approximation and  the TAP  approximation,  as  a  function  of J.  d) \nThe mean difference  between  the KL  divergence  for  the Weiss  approximation with \nand  without diagonal weights,  as  a function  of J . \n\n6  Discussion \n\nWe  have presented  a derivation of mean field  theory  and the linear response  correc(cid:173)\ntion based  on  a  small coupling expansion  of the Gibbs free  energy.  This expansion \ncan  in  principle  be  computed to arbitrary  order.  However , one  should  expect  that \nthe solution  of the  resulting  mean field  and  linear  response  equations  will  become \nmore and  more difficult  to solve  numerically. The small coupling expansion should \nbe  applicable  to  other  network  models  such  as  the  sigmoid  belief network,  Potts \nnetworks  and  higher order  Boltzmann Machines . \n\nThe numerical results show that the method is applicable to paramagnetic problems. \nThis is  intuitively clear,  since  paramagnetic problems  have  a  unimodal probability \ndistribution,  which  can  be  approximated  by  a  mean  and  correlations  around  the \nmean.  The method  performs worse  for  spin  glass  problems.  However, it still  gives \na  useful  approximation of the  correlations when  compared to the factorized  model \nwhich  ignores  all  correlations.  In  this  regime,  the  TAP  approximation  improves \n\n\f286 \n\nH.  1.  Kappen and F.  B.  Rodr(guez \n\nsignificantly on the Weiss approximation.  One may therefore hope that higher order \napproximation may further improve the method for spin glass problems.  Therefore. \nwe  cannot  conclude  at  this  point  whether  mean  field  methods  are  restricted  to \nunimodal distributions.  In  order  to  further  investigate  this  issue,  one  should  also \nstudy  the  ferromagnetic  case  (Jo  > 1, J  > 1),  which  is  multimodal as  well  but less \nchallenging than the spin glass case. \n\nIt is  interesting  to  note  that  the  performance  of  the  exact  method  is  absolutely \ninsensitive  to  the  value  of  J.  Naively,  one  might  have  thought  that  for  highly \nmulti-modal  target  distributions,  any  gradient  based  learning  method  will  suffer \nfrom  local minima.  Apparently,  this  is  not  the case:  the  exact  KL  divergence  has \njust  one  minimum,  but  the  mean  field  approximations of the  gradients  may  have \nmultiple solutions. \n\nAcknowledgement \n\nThis  research  is  supported  by  the  Technology  Foundation  STW,  applied  science \ndivision of NWO and the techology programme of the Ministry of Economic Affairs. \n\nReferences \n\n[1]  D.  Ackley,  G.  Hinton,  and  T.  Sejnowski.  A  learning  algorithm  for  Boltzmann  Ma(cid:173)\n\nchines.  Cognitive Science, 9:147-169,  1985. \n\n[2]  C.  Itzykson  and  J-M.  Drouffe.  Statistical Field  Theory.  Cambridge  monographs  on \n\nmathematical  physics.  Cambridge  University  Press.  Cambridge,  UK,  1989. \n\n[3]  C.  Peterson  and  J.R.  Anderson.  A  mean  field  theory  learning  algorithm  for  neural \n\nnetworks.  Complex  Systems,  1:995-1019,  1987. \n\n[4]  G.E.  Hinton.  Deterministic  Boltzmann learning  performs steepest descent  in  weight(cid:173)\n\nspace.  Neural  Computation,  1:143-150.  1989. \n\n[5]  H.J.  Kappen  and  F.B.  Rodriguez.  Efficient  learning  in  Boltzmann  Machines  using \n\nlinear  response  theory.  Neural  Computation,  1997.  In press. \n\n[6]  G.  Parisi.  Statistical Field  Theory.  Frontiers in  Physics.  Addison-Wesley,  1988. \n[7]  L.K.  Saul,  T.  Jaakkola,  and  M.1.  Jordan.  Mean  field  theory  for  sigmoid  belief net(cid:173)\n\nworks.  Journal  of artificial intelligence research,  4:61-76,  1996. \n\n[8]  T.  Plefka.  Convergence  condition  of the  TAP  equation  for  the  infinite-range  Ising \n\nspin  glass  model.  Journal of Physics  A,  15:1971-1978,  1982. \n\n[9]  S.  Kullback.  Information  Theory  and Statistics.  Wiley,  New  York,  1959. \n[10]  D.  Sherrington  and  S.  Kirkpatrick.  Solvable  model  of Spin-Glass.  Physical  review \n\nletters,  35:1792-1796,  1975. \n\n\f", "award": [], "sourceid": 1412, "authors": [{"given_name": "Hilbert", "family_name": "Kappen", "institution": null}, {"given_name": "Francisco", "family_name": "de Borja Rodr\u00edguez Ortiz", "institution": null}]}