{"title": "Fast Parameter Estimation Using Green's Functions", "book": "Advances in Neural Information Processing Systems", "page_first": 535, "page_last": 542, "abstract": null, "full_text": "Fast  Parameter  Estimation \n\nUsing  Green's  Functions \n\nK.  Y.  Michael Wong \nDepartment of Physics \nHong Kong University \n\nof Science and Technology \n\nClear Water Bay, Hong Kong \n\nphkywong@ust.hk \n\nFuIi  Li \n\nDepartment of Applied  Physics \n\nXian  Jiaotong University \n\nXian,  China 710049 \n\nflli @xjtu. edu. en \n\nAbstract \n\nWe  propose  a  method  for  the  fast  estimation  of hyperparameters \nin large networks, based on the linear response relation in the cav(cid:173)\nity  method,  and  an  empirical  measurement  of  the  Green's  func(cid:173)\ntion.  Simulation results  show  that it is  efficient  and precise,  when \ncompared with cross-validation and other techniques which require \nmatrix inversion. \n\n1 \n\nIntroduction \n\nIt is well known that correct choices of hyperparameters in classification and regres(cid:173)\nsion  tasks  can  optimize  the  complexity  of the  data model,  and  hence  achieve  the \nbest generalization [1].  In recent years various methods have been proposed to esti(cid:173)\nmate the optimal hyperparameters in different contexts, such as neural networks [2], \nsupport  vector machines  [3,  4,  5]  and  Gaussian processes  [5].  Most of these  meth(cid:173)\nods  are  inspired  by  the  technique  of cross-validation  or  its  variant,  leave-one-out \nvalidation.  While  the  leave-one-out  procedure  gives  an  almost  unbiased  estimate \nof the generalization error, it is  nevertheless  very tedious.  Many of the mentioned \nattempts  aimed  at approximating this  tedious  procedure without  really  having to \nsweat through it.  They often rely on theoretical bounds, inverses to large matrices, \nor iterative optimizations. \n\nIn  this  paper,  we  propose  a  new  approach to  hyperparameter  estimation  in  large \nsystems.  It is  known that large networks are mean-field systems, so  that when one \nexample  is  removed  by  the  leave-one-out  procedure,  the  background  adjustment \ncan  be  analyzed  by  a  self-consistent  perturbation  approach.  Similar  techniques \nhave  been applied to the neural network  [6],  Bayesian learning  [7]  and the support \nvector  machine  [5].  They usually  involve  a  macroscopic  number  of unknown  vari(cid:173)\nables,  whose solution is  obtained through the inversion of a  matrix of macroscopic \nsize,  or iteration.  Here we  take a further step to replace it by a direct measurement \nof  the  Green's  function  via  a  small  number  of learning  processes.  The  proposed \nprocedure  is  fast  since  it  does  not  require  repetitive  cross-validations,  matrix  in(cid:173)\nversions,  nor  iterative optimizations for  each set  of hyperparaemters.  We  will  also \npresent simulation results which  show that it is  an excellent  approximation. \n\n\fThe  proposed  technique  is  based  on  the  cavity  method,  which  was  adapted  from \ndisordered  systems  in  many-body  physics.  The  basis  of the  cavity  method  is  a \nself-consistent argument addressing the situation of removing an example from  the \nsystem.  The change on removing an example is  described by the Green's function, \nwhich  is  an  extremely  general  technique  used  in  a  wide  range  of  quantum  and \nclassical problems in many-body physics  [8].  This provides an excellent framework \nfor  the  leave-one-out  procedure.  In  this  paper,  we  consider  two  applications  of \nthe cavity method to hyperparameter estimation, namely the optimal weight decay \nand the  optimal  learning  time  in feedforward  networks.  In  the  latter  application, \nthe  cavity  method  provides,  as  far  as  we  are  aware  of,  the  only  estimate  of the \nhyperparameter beyond empirical stopping criteria and brute force cross-validation. \n\n2  Steady-State  Hyperparameter  Estimation \n\nConsider  the  network  with  adjustable  parameters  w.  An  energy  function  E  is \ndefined  with  respect  to  a  set  of p  examples  with  inputs  and  outputs  respectively \ngiven  by {IL  and y'\",  JL  =  1, ... ,p,  where (IL  is  an N-dimensional input vector  with \ncomponents e;, j  =  1,\u00b7 \u00b7\u00b7 ,N, and  N  \u00bb  1  is  macroscopic.  We  will  first  focus  on \n\nthe  dynamics  of  a  single-layer  feedforward  network  and  generalize  the  results  to \nmultilayer networks later.  In single-layer networks,  E  has the form \n\nE  =  L f(X'\",y'\") + R(w). \n\n(1) \n\n'\" \n\nHere  f( x'\" , y'\")  represents  the  error  function  with  respect  to  example  JL.  It is  ex(cid:173)\npressed  in  terms  of the  activation  x'\"  ==  w\u00b7 (IL.  R( w)  represents  a  regularization \nterm which is introduced to limit the complexity of the network and hence enhance \nthe generalization ability.  Learning is  achieved by the gradient descent  dynamics \n\ndWj(t)  _ _ ~_oE_ \n\ndt \n\n(2) \n\nThe  time-dependent  Green's  function  Gjk(t, s)  is  defined  as  the  response  of  the \nweight Wj  at time t due to a unit stimulus added at time s to the gradient term with \nrespect to weight Wk,  in the limit of a  vanishing magnitude of the stimulus.  Hence \nif we  compare the evolution of Wj(t)  with  another system Wj(t)  with  a  continuous \nperturbative stimulus  Jhj(t),  we  would have \ndWj(t)  =  _~ oE \n\n(3) \n\n(4) \n\ndt \n\nNow.  + \n\nJ \n\nJh() \nJ  t , \n\nand the linear response relation \n\nWj(t)  =  Wj(t)  + L J dsGjk(t,s)Jhk(s). \n\nk \n\nNow we  consider the evolution ofthe network w;'\"(t)  in which example JL  is omitted \nfrom  the training set.  For  a  system learning macroscopic number of examples,  the \nchanges induced by the omission of an example are perturbative, and we can assume \nthat the system has  a  linear response.  Compared with  the original network Wj(t), \nthe  gradient  of the  error  of example  JL  now  plays  the  role  of the  stimulus  in  (3). \nHence we  have \n\n(5) \n\n\fMultiplying both sides  by ~f and summing over j, we  obtain \n\nI-'()  J  [1  '\" I-'G \n\nt  + \n\nds  N  \"7:~j  jk  t ,s ~k \n\n( \n\n1-'(  ) -\nh \n\nt  - x \n\n)  I-']OE(XI-'(S)'YI-') \n\noxl-'(s)' \n\n(6) \n\nHere  hl-'(t)  ==  V;\\I-'(t)  . ~ is  called  the  cavity  activation  of example  ft. \ndynamics has reached the steady state, we  arrive at \n\nWhen  the \n\nhI-' \n\nI-' \n\n=  x  +, \n\nOE(XI-' , yl-') \n\noxl-' \n\n' \n\n(7) \n\nwhere, =  limt--+oo J dS[Ljk ~fGjk (t , s)~r]j N  is  the susceptibility. \nAt time t, the generalization error is  defined as the error function averaged over the \ndistribution of input (, and their corresponding output y, i.e. , \n\n(8) \n\nwhere x  ==  V; . (is the network activation.  The leave-one-out generalization error is \nan estimate of 109  given in terms ofthe cavity activations hI-'  by fg  =  LI-' 10 (hI-' ,yl-')jp. \nHence if we  can estimate the Green's function,  the cavity activation in  (7)  provides \na  convenient  way  to estimate the leave-one-out  generalization error without  really \nhaving to undergo the validation  process. \n\nWhile  self-consistent  equations  for  the  Green's  function  have  been  derived  using \ndiagrammatic methods  [9],  their  solutions  cannot  be  computed except  for  the spe(cid:173)\ncific case of time-translational invariant Green's functions , such as those in Adaline \nlearning or linear  regression.  However,  the  linear  response  relation  (4)  provides  a \nconvenient way to measure the Green's function in the general case.  The basic idea \nis  to  perform  two  learning  processes  in  parallel, one  following  the  original  process \n(2)  and  the  other  having  a  constant  stimulus  as  in  (3)  with  6hj (t)  =  TJ6jk,  where \n8j k  is  the  Kronecka  delta.  When  the  dynamics  has  reached  the  steady  state,  the \nmeasurement Wj  - Wj  yields  the quantity TJ  Lk J dsGjk(t, s). \nA simple averaging procedure, replacing all the pairwise measurements between the \nstimulation  node  k  and  observation  node  j,  can  be  applied  in  the  limit  of large \nN.  We  first  consider the case in  which the inputs are independent and normalized, \ni.e.,  (~j)  =  0,  (~j~k)  =  8j k.  In  this  case,  it  has  been  shown  that  the  off-diagonal \nGreen's functions can be neglected, and the diagonal Green's functions become self(cid:173)\naveraging,  i.e. , Gjk(t, s)  = G(t, s)8jk , independent of the node labels  [9],  rendering \n,  =  limt--+oo J dsG(t, s). \nIn the case that the inputs are correlated and not normalized, we can apply standard \nprocedures of whitening transformation to make them independent and normalized \n[1].  In  large  networks,  one  can  use  the  diagrammatic  analysis  in  [9]  to show  that \nthe (unknown)  distribution of inputs does not change the self-averaging property of \nthe Green's functions after the whitening transformation.  Thereafter, the measure(cid:173)\nment of Green's functions  proceeds as described in the simpler case of independent \nand normalized  inputs.  Since  hyperparameter estimation  usually  involves  a  series \nof computing  fg  at  various  hyperparameters,  the  one-time  preprocessing  does  not \nincrease the computational load significantly. \n\nThus the susceptibility, can be  measured by comparing the evolution of two  pro(cid:173)\ncesses:  one  following  the  original  process  (2),  and  the  other  having  a  constant \nstimulus  as  in  (3)  with  8h j (t)  =  TJ  for  all  j.  When  the  dynamics  has  reached  the \nsteady state,  the measurement  (Wj  - Wj)  yields  the quantity TJ,. \n\n\fWe illustrate the extension to two-layer networks by considering the committee ma(cid:173)\nchine, in which the errorfunction takes the form E(2::a !(xa), y) , where a  =  1,\u00b7 \u00b7\u00b7, nh \nis  the  label  of a  hidden  node,  Xa  ==  wa  . [is the  activation  at the  hidden  node  a, \nand! represents the activation function.  The generalization error is thus a function \nof the  cavity activations of the hidden  nodes,  namely,  E9  =  2::JL E(2::a  !(h~), yJL) /p, \nwhere  h~ =  w~JL . (IL . When  the inputs  are independent  and normalized,  they  are \nrelated to the generic activations by \n\nhJL- JL+'\" \na - Xa  ~ \"lab \nb \n\naE(2::c !(X~) , yJL) \n' \n\na  JL \nXb \n\n(9) \n\nwhere \"lab  =  limt~oo J dsGab(t, s)  is the susceptibility tensor.  The Green's function \nGab(t, s)  represents the response of a weight feeding hidden node a due to a stimulus \napplied  at the  gradient  with  respect  to a  weight  feeding  node  b.  It is  obtained by \nmonitoring nh + 1  learning processes,  one  being original and each of the other nh \nprocesses having constant stimuli at the gradients with respect to one of the hidden \nnodes,  viz., \n\ndw~~) (t)  _ \n-\n\ndt \n\n1  aE \naWaj \n\n- N  ------=:(b)  + 'f)rSab ,  b =  1, ... ,nh\u00b7 \n\n(10) \n\nWhen  the  dynamics  has  reached  the  steady  state,  the  measurement  (w~7 - Waj) \nyields  the quantity 'f)'Yab. \nWe  will  also  compare the  results  with those obtained by extending the  analysis of \nlinear unlearning leave-one-out  (LULOO)  validation [6].  Consider the case that the \nregularization R(w)  takes the form  of a  weight  decay term,  R(w)  =  N  2::ab AabWa  . \nWb/2.  The cavity activations  will  be given by \n\nJL  + '\" ( \n\nhJL  = \na  Xa  ~  ,\" \n\niJ 2::jk ~'j(A + Q)~}bk~r \n\nb \n\n1 - 11  2::cjdk ~'j !'(xn(A + Q)~,dd'(x~)~r \n\n1 \n\n)  aE(2::c !(xn, yJL)) \n\na  JL \nXb \n\n' \n\n(11) \nwhere  E~ represents  the  second  derivative  of E  with  respect  to  the  student  output \nfor  example /1,  and the matrix Aaj,bk  =  AabrSjk  and Q is  given  by \n\nQaj,bk  = ~ 2: ~'j f'(x~)f'(x~)~r\u00b7 \n\nJL \n\n(12) \n\nThe  LULOO  result  of (11)  differs  from  the cavity result of (9)  in  that the  suscep(cid:173)\ntibility  \"lab  now  depends  on  the  example  label  /1,  and  needs  to  be  computed  by \ninverting  the  matrix  A + Q.  Note  also  that  second  derivatives  of the  error  term \nhave been neglected. \n\nTo  verify  the proposed method by simulations,  we  generate examples from  a  noisy \nteacher network which is  a  committee machine \n\nyJL  =  ~ erf  yf2Ba \u00b7 f  + (Jzw \n\nnh \n\n(1 \n\n) \n\n(13) \n\nHere  Ba  is  the  teacher  vector  at  the  hidden  node  a. \n(J  is  the  noise  level.  ~'j  and \nzJL  are Gaussian variables with zero means and unit variances.  Learning is  done by \nthe gradient descent  of the energy function \n\n(14) \n\n\fand  the  weight  decay  parameter  ,X  is  the  hyperparameter  to  be  optimized.  The \ngeneralization error fg  is  given by \n\nwhere the averaging is  performed over the distribution of input { and noise z.  It can \nbe computed analytically in terms of the inner products Q ab  =  wa . Wb,  Tab  =  Ba . Bb \nand Rab  =  Ba  . Wb  [10].  However,  this  target  result  is  only  known  by  the teacher, \nsince Tab  and Rab  are not  accessible by the student. \nFigure 1 shows the simulation results of 4 randomly generated samples.  Four results \nare  compared:  the  target  generalization  error observed  by  the  teacher,  and those \nestimated  by  the  cavity  method,  cross-validation  and  extended  LULOO.  It  can \nbe  seen  that the  cavity  method yields  estimates of the optimal  weight  decay  with \ncomparable precision as the other methods. \n\nFor a more systematic comparison, we search for the optimal weight decay in 10 sam(cid:173)\nples using golden section search [11]  for the same parameters as in Fig.  1.  Compared \nwith the target results, the standard deviations of the estimated optimal weight de(cid:173)\ncays  are  0.3,  0.25  and  0.24  for  the  cavity  method,  sevenfold  cross-validation  and \nextended  LULOO  respectively.  In  another  simulation of 80  samples  of the  single(cid:173)\nlayer perceptron, the estimated optimal weight  decays  have standard deviations of \n1.2,  1.5  and  1.6  for  the  cavity  method,  tenfold  cross-validation and extended  LU(cid:173)\nLOO  respectively  (the  parameters in  the simulations  are N  = 500,  p  = 400  and a \nranging from  0.98 to 2.56). \n\nTo  put  these  results  in  perspective,  we  mention  that the  computational  resources \nneeded by the cavity method is  much less than the other estimations.  For example, \nin the single-layer perceptrons, the CPU time needed to estimate the optimal weight \ndecay  using  the  golden  section  search  by  the  teacher,  the  cavity  method,  tenfold \ncross-validation and extended LULOO  are in the ratio of 1 : 1.5  : 3.0  : 4.6. \n\nBefore concluding this section, we mention that it is possible to derive an expression \nof  the  gradient  dEg I d,X  of  the  estimated  generalization  error  with  respect  to  the \nweight  decay.  This  provides  us  an  even  more  powerful  tool  for  hyperparameter \nestimation.  In the case of the search for  one hyperparameter, the gradient enables \nus  to  use  the  binary  search  for  the  zero  of  the  gradient,  which  converges  faster \nthan  the  golden  section  search.  In  the  single-layer  experiment  we  mentioned,  its \nprecision  is  comparable to fivefold  cross-validations,  and its  CPU  time  is  only  4% \nmore than the teacher's search.  Details  will  be presented elsewhere.  In the case of \nmore than one hyperparameters, the gradient information will  save us  the need for \nan exhaustive search over a  multidimensional  hyperparameter space. \n\n3  Dynamical Hyperparameter  Estimation \n\nThe  second  example  concerns  the  estimation  of  a  dynamical  hyperparameter, \nnamely  the  optimal  early  stopping  time,  in  cases  where  overtraining may  plague \nthe  generalization  ability  at  the  steady state.  In  perceptrons,  when  the  examples \nare  noisy  and  the  weight  decay  is  weak,  the  generalization  error  decreases  in  the \nearly stage of learning,  reaches  a  minimum  and then increases  towards  its  asymp(cid:173)\ntotic value  [12,  9].  Since the early stopping point  sets  in  before the system reaches \nthe steady state,  most  analyses  based  on the  equilibrium  state are  not  applicable. \nCross-validation  stopping  has  been  proposed  as  an  empirical  method  to  control \novertraining [13].  Here we  propose the cavity method as  a  convenient alternative. \n\n\f0.52 \n\ne \nQ) \n<= \n0 \n~  0.46 \n.!::! m \nQ) \n<= \nQ) \n0> \n\n0.40 \n\ne Q) \n\n<= \n0 \n~ \n.!::! m \nQ) \n<= \nQ) \n0> \n\n0.40 \n\n0 \n\nG----8 target \nG----EJ cavity \n0 -0  LULOO \n\n(c) \n\n(b) \n\n(d) \n\nweight decay A \n\nweight decay A \n\no \n\n2 \n\nFigure 1:  (a-d)  The dependence ofthe generalization error of the multilayer percep(cid:173)\ntron on the weight  decay for  N  =  200, p  =  700, nh  =  3,  (J  =  0.8  in 4 samples.  The \nsolid symbols locate the optimal weight decays estimated by the teacher (circle), the \ncavity method (square), extended LULOO (diamond) and sevenfold cross-validation \n(triangle) . \n\nIn single-layer perceptrons, the cavity activations of the examples evolve according \nto  (6),  enabling  us  to  estimate  the  dynamical  evolution  of the  estimated  general(cid:173)\nization  error when  learning  proceeds.  The  remaining  issue  is  the  measurement  of \nthe time-dependent  Green's function.  We  propose to introduce an initial  homoge(cid:173)\nneous  stimulus,  that  is,  Jhj (t)  =  1]J(t)  for  all  j.  Again,  assuming  normalized  and \nindependent  inputs  with  (~j)  =  0  and  (~j~k)  =  Jjk , we  can  see  from  (4)  that  the \nmeasurement  (Wj(t)  - Wj(t))  yields the quantity 1]G(t, 0). \nWe  will  first  consider  systems  that  are  time-translational  invariant,  i.e.,  G(t, s)  = \nG(t - s, 0).  Such are the cases for  Adaline learning and linear regression  [9],  where \nthe cavity activation can be written as \n\nh'\"(t)  =  x'\"(t) + J dsG(t - s, 0) OE(X'\"(S), y'\"). \n\nox,\"(s) \n\n(16) \n\nThis  allows  us \n2:.,\" E(h'\"(t), y'\")/p,  whose minimum in  time determines the early stopping point. \n\nthe  generalization  error  Eg(t)  via  Eg(t) \n\nto  estimate \n\nTo  verify  the  proposed  method  in  linear  regression,  we  randomly  generate  exam(cid:173)\nples  from  a  noisy  teacher  with  y'\"  =  iJ . f'\"  + (Jzw  Here  iJ  is  the  teacher  vec(cid:173)\ntor  with  B2  =  1.  e;  and  z'\"  are  independently  generated  with  zero  means  and \nunit  variances.  Learning  is  done  by  the  gradient  descent  of the  energy  function \nE(t)  =  2:.,\" (y'\"  - w(t)  . f'\")2/2 .  The  generalization  error  Eg(t)  is  the  error  av-\neraged  over  the  distribution  of  input  [  and  their  corresponding  output  y,  i.e., \nEg(t)  =  ((iJ . [ + (JZ - w\u00b7 [)2/2).  As  far  as  the  teacher  is  concerned,  Eg(t)  can \nbe  computed  as  Eg(t)  =  (1  - 2R(t) + Q(t) + (J2)/2.  where  R(t)  =  w(t)  . iJ  and \nQ(t)  =  W(t)2. \n\nFigure  2 shows  the simulation results  of 6 randomly  generated samples.  Three re(cid:173)\nsults are compared:  the teacher's estimate, the cavity method and cross-validation. \nSince  LULOO  is  based on  the  equilibrium  state,  it  cannot  be  used  in  the  present \n\n\fcontext.  Again,  we  see  that the  cavity method  yields  estimates  of the  early  stop(cid:173)\nping time with comparable precision as cross-validation.  The ratio of the CPU time \nbetween the cavity method and fivefold  cross-validation is  1 : 1.4. \n\nFor  nonlinear  regression  and  multilayer  networks,  the  Green's  functions  are  not \ntime-translational invariant.  To estimate the Green's functions in this case, we have \ndevised another scheme of stimuli.  Preliminary results for  the determination of the \nearly stopping point are satisfactory and final  results will  be presented elsewhere. \n\n1 .1 \n\ne \nQ.i \nc:: a \n~ 0.9 \n.!::! \n~ \n<l> \nc:: \n<l> \n0> \n\n0.7 \n\ne Q.i \nc:: a \n~ 0.9 \n.!::! \nc;; \nQ.i \nc:: \n<l> \n0> \n\n0.7 \n\n0 \n\n2 \n\ntime  t \n\n0 \n\n2 \n\ntime t \n\n0 \n\n2 \n\ntime  t \n\n4 \n\nFigure  2:  (a-f)  The  evolution  of the  generalization  error  of linear  regression  for \nN  =  500,  p  =  600  and  (J  =  1.  The solid  symbols  locate the  early stopping points \nestimated  by  the  teacher  (circle),  the  cavity  method  (square)  and  fivefold  cross(cid:173)\nvalidation  (diamond). \n\n4  Conclusion \n\nWe  have  proposed  a  method  for  the  fast  estimation  of hyperparameters  in  large \nnetworks,  based  on  the  linear  response  relation  in  the  cavity  method,  combined \nwith an empirical method of measuring the Green's function.  Its efficiency depends \non  the  independent  and  identical  distribution  of the  inputs,  greatly  reducing  the \nnumber  of networks  to  be  monitored.  It does  not  require  the  validation  process \nor  the  inversion  of  matrices  of  macroscopic  size,  and  hence  its  speed  compares \nfavorably with cross-validation and other perturbative approaches such as extended \nLULOO.  For  multilayer  networks,  we  will  explore  further  speedup  of the  Green's \nfunction measurement by multiplexing the stimuli to the different hidden units into \na single network, to be compared with a reference network.  We will  also  extend the \ntechnique to other benchmark data  to study its applicability. \n\nOur  initial  success  indicates  that  it  is  possible  to  generalize  the  method  to  more \ncomplicated systems in the future.  The concept of Green's functions is very general, \nand its measurement by comparing the states of a stimulated system with a reference \none  can  be  adopted  to  general  cases  with  suitable  adaptation.  Recently,  much \nattention is  paid to the issue of model selection in support vector machines [3,  4,  5]. \nIt would be interesting to consider how the proposed method can contribute to these \ncases. \n\n\fAcknowledgements \n\nWe  thank  C. Campbell for  interesting discussions and H.  Nishimori for  encourage(cid:173)\nment.  This work was  supported by the grant HKUST6157/99P from  the Research \nGrant Council of Hong Kong. \n\nReferences \n\n[1]  C.  M.  Bishop,  Neural  Networks  for  Pattern  Recognition,  Clarendon  Press,  Oxford \n\n(1995). \n\n[2]  G.  B.  Orr  and K-R.  Muller,  eds.,  Neural  Networks:  Tricks  of the  Trade,  Springer, \n\nBerlin  (1998). \n\n[3]  O.  Chapelle  and V.  N.  Vapnik,  Advances  in  Neural  Information  Processing  Systems \n12, S.  A.  Solla, T. KLeen and K-R. Muller,  eds.,  MIT Press,  Cambridge, 230  (2000). \n\n[4]  S.  S.  Keerthi,  Technical Report  CD-OI-02, \n\nhttp://guppy.mpe.nus. edu .sg/ mpessk/nparm.html (2001). \n\n[5]  M.  Opper  and  O.  Winther ,  Advances  in  Large  Margin  Classifiers,  A.  J.  Smola,  P. \nBartlett, B. Sch6lkopf and D. Schuurmans, eds.,  MIT Press,  Cambridge,  43  (1999) . \n[6]  J.  Larsen and L. K  Hansen ,  Advances  in  Computational  Math ematics 5, 269  (1996). \n[7]  M.  Opper  and O.  Winther,  Phys.  R ev.  Lett.  76,  1964  (1996). \n[8]  A.  L.  Fetter and J.  D. Walecka,  Quantum Theory  of Many-Particle Systems, McGraw(cid:173)\n\nHill, New York  (1971). \n\n[9]  K  Y.  M.  Wong, S.  Li  and Y. W . Tong,  Phys.  Rev.  E  62, 4036  (2000). \n[10]  D.  Saad and S.  A.  Solla,  Phys.  R ev.  Lett. 74, 4337  (1995). \n[11]  W.  H.  Press,  B.  P.  Flannery,  S.  A.  Teukolsky  and  W .  T.  Vetterling,  Num erical \n\nR ecipes  in  C:  Th e  Art  of Scientific  Computing,  Cambridge  University  Press,  Cam(cid:173)\nbridge  (1990). \n\n[12]  A.  Krogh  and J.  A. Hertz,  J.  Phys.  A  25 ,  1135  (1992). \n[13]  S.  Amari, N.  Murata, K-R. Muller, M.  Finke and H. H. Yang, IEEE Trans.  on Neural \n\nN etworks  8, 985  (1997) . \n\n\f", "award": [], "sourceid": 2104, "authors": [{"given_name": "K.", "family_name": "Wong", "institution": null}, {"given_name": "F.", "family_name": "Li", "institution": null}]}