{"title": "Learning Curves, Model Selection and Complexity of Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 607, "page_last": 614, "abstract": null, "full_text": "Learning  Curves, Model Selection and \n\nComplexity of Neural Networks \n\nNoboru  Murata \n\nDepartment of IVIathematical  Engineering and  Information  Physics \n\nUniversity of Tokyo,  Tokyo 113,  JAPAN \nE-mail:  mura~sat.t.u-tokyo.ac.jp \n\nShuji Yoshizawa \nDept.  Mech.  Info. \nUniversity of Tokyo \n\nShUll-ichi  Amari \n\nDept.  Math.  Eng.  and  Info.  Phys. \n\nUniversity  of Tokyo \n\nAbstract \n\nLearning  curves  show  how  a  neural  network  is  improved  as  the \nnumber  of  t.raiuing  examples  increases  and  how  it  is  related  to \nthe  network  complexity.  The  present  paper  clarifies  asymptotic \nproperties and  their relation of t.wo  learning curves, one concerning \nthe predictive loss  or generalization  loss  and  the other the training \nloss.  The  result  gives  a  natural  definition  of the  complexity  of  a \nneural  network.  Moreover,  it  provides  a  new  criterion  of  model \nselection. \n\n1 \n\nINTRODUCTION \n\nThe leal'lI ing  Cl1l've  shows how  well  t hE'  behavior of a  neural  network  is  improved as \nt.he  nurnber of training examples increast\"'s  and  how it is  I'elated  with  the complexity \nof  neural  net.works.  This  provides  liS  with  a  criterion  for  choosing  an  adequate \nnetwork  ill  relat.ion  t.o  the  number \n\nof training  examples.  Some  researchers  have  attacked  this  problem  by  using  sta(cid:173)\ntistical  mechanical  met.hods  (see  Levin  et  al. \n[1991]'  etc.) \nand some  by  informat.ion  theory  and  algorithmic  methods  (see  Baum and  Haussler \n\n[1990],  Seung  et  al. \n\n607 \n\n\f608 \n\nMurata, Yoshizawa,  and Amari \n\n[1989],  et.c.).  The  present.  paper  elucidates  asympt.otic  properties  of the  learning \nCUl\"ve  from  the statistical  point of view,  giving  a  new  criterion  for  model selection. \n\n2  STATEMENT OF  THE  PROBLEM \n\nLet  us  consider  a  stochastic  neural  network,  which  is  parameterized  by  a  set  of m \nweights  0 = (0 1 ,  ..\u2022 ,om)  and  whose  input-output  relation  is  specified  by  a  condi(cid:173)\ntional  probability  p(ylx, 0).  In  other  words,  for  an  input  signal  is  x  E  R\"\u00b7n,  the \nprobability distribution of output  y E  R\"oU!  is  given  by p(ylx, 0). \n\nA  typical  form  of  the  stochastic  neural  network  is  as  follows: \nlet  us  consider  a \nmulti-layered network !(x, 0)  where 0 is  a set of m  parameters 0 = (0 1 ,  \u2022\u2022\u2022 , om)  and \nits  components  correspond  to  weights  and  thresholds  of the  network.  When  some \ninput  x  is  given,  the  network  produce  an  output \n\n(1 ) \nwhere  TJ(x)  is  noise  whose  conditional  distribut.ion  is  given  by  a(TJlx).  Then  the \ncondit.ional  dist.ribution  of the  net.work.  which  specifies  the  input-output  relation, \nis  given  by \n\ny  =  /(x,()  + TJ(X), \n\np(yl1.\u00b7,O)  =  a(y -\n\n/(x, ()Ix). \n\n(2) \n\\Ve  define  a  t.raining  sample  e =  {(Xl, Yd,  .. \" (Xt, Yt)}  as  a  set  of  t  examples \ngenerated  from  the true conditional  distribution q(ylx),  where  Xi  is  generated from \na  probability  distribution  1'(X)  independently.  We  should  note  that  both  r(x)  and \nq(ylx)  are  unknown  and  we  need  not  assume  the faithfulness  of the model,  that is, \nwe do not a'3sume  that there exists a parameter 0*  which realize the true distribution \nq(ylx) such  that p(Ylx, 0\u00b7) =  q(ylx). \nOur  purpose  is  t.o  find  an  appropriate  parameter ()  which  realizes  a  good  approxi(cid:173)\nmation  IJ(ylx, 0)  t.o  q(yl:r).  For  this  purpose,  we  use  a  loss  function \n\nL(O)  = D(1'; qlp(O)) + 8(0) \n\n(3) \n\nas  a  Cl'it.erioll  t.o  be  minimized,  where  D( 1'; qlp( 0)  represent.s  a  general  divergence \nmeasure  between  t.wo  conditional  probabilit.ies  q(ylx)  and  p(ylx, 0)  in  the expecta(cid:173)\nt.ion  form  under  t.he  true  input-output probability \n\nD(1'; qlp(O\u00bb)  = J 1'(x)q(Ylx)k(x, y, O)dxdy \n\n(4) \n\nand S(O)  is  a  regulal'ization  t.erm to fit.  the smoothness condition of outputs (Moody \n[1992]),  So  t.he  loss  functioll  is  rewritten  as  a  expectation  form \n\nL(O)= j1'(J;)Q( Y1 X)d(x,y'(l)dxd y,  d(x,y,()=k(x,y,O)+S(O), \n\n(5) \n\nand  d(:t,!I, 0)  is  raIled  t.he  pointwise  loss  funct.ioll. \n\nA  typical  rase of the  divergence  D  of t.he  multi-layered  network  f( X, 0)  with  noise \nis  the squared  error \n\nD( 1'; qllJ( 0\u00bb)  = j 1'( X )q( ylx )lly - /( x, 0)11 2dxdy, \n\n(6) \n\n\fLearning  Curves, Model Selection and Complexity of Neural  Networks \n\n609 \n\nThe  error  function  of an  ordinary  multi-layered  network  is  in  this  form,  and  the \nconventional  Back-Pr'opagation  met.hod  is  derived  from  this  type of loss  function. \n\nAnot.her  t.ypical  case  is  the  Kullhaek-Leibler divergence \n\nJ \n\nq(ylx) \np(ylx,B) \n\nD(I';qlp(O))  = \n\n(7) \nThe  integration  J 1'(x)q(ylx) logq(ylx)dxdy  is  a  constant  called  a  conditional  en(cid:173)\ntropy,  and  we  usually  use  the  following  abbreviated  form  instead  of  the  previous \ndivergence: \n\nr(.r)lJ(ylx)log \n\ndxdy. \n\nD(7'; qlp((}))  =  - J 1'(x)q(ylx) logp(y/x, B)dxdy. \n\nNext,  we  define  an  optimum of the parameter in  the sense of the loss  function  that \nwe  introduced.  We  denote  by  B*  the  optimal  parameter  that  minimizes  the  loss \nfunction  L( 0),  that  is, \n\n(8) \n\n(9) \n\nL(O*)  =  min L(O), \n\n(J \n\nand  we  regard  p(ylx, 0*)  as  the  best  realization  of the model. \n\n\\t\\'hen  a  trailling sample e is  given,  we  can  also  define  an  empirical  loss  function: \n(10) \nwhere  i',  If  are  the  empirical  distribut.ions  given  by  the sample e,  that  is, \n\n1.(0)  = D(1'; qlp(B)) + S((n, \n\nD(l\u00b7;tj/p(O)) = t Lk(Xi'Yi,(}), \n\n1 \n\nt \n\n(xi,yd  E e. \n\n(11) \n\ni=l \n\nIn  practical  case,  we  consider  t.he  empirical  loss  function  and  search  for  the quasi(cid:173)\noptimal  paramet.er  0 defined  hy \n\nL(O)  =  min  L(O), \n\n(J \n\n(12) \n\nbecause  the  trw\u00b7'  distribut.ions  1'{x)  and  q(ylx)  are  unkllown  and  we  can  only  use \nexamplps  (XidJd  observed  from  t.he  tl'lle  distribution  ,,(x)IJ(ylx),  We  should  note \nthat.  the  quasi-optilllal  paramet.er  0  is  a  rallc\\OI1l  variable  depending on  the sample \ne, each  element of which  is  chosen  randOlnly. \nThe following  lemma guarantees that we  can  use  the empirical  loss  function  instead \nof the  actual  loss  funct.ion  when  t.he  number of examples  t  is  large. \n\nLenllna  1  If fhe  11'11111ber  of examples t  is  large  e1lough,  it  is  shown  that the  quasi(cid:173)\noptimal pam7llcier 0  -is  lIormally  dist7'ib'utcd  al'ound  the  optimal parameter B*,  that \nlS, \n\nwhere \n\n(-. \u2022. 1 \n\nQ \n\n/ \n\nr(.t)I/(yl;L')\\'c!(.l', y. 0* )'Vd(;L', V,  0* )Td.tdy, \n\nJ l'(x)IJ(ylx)'V'Vd(x,y,O*)dxdy, \n\nand 'V  denote~ fhe  di.fJer\u00b7en/utl  oper'ator with  respect  to  B, \n\n(13) \n\n( 14) \n\n(15) \n\n\f610 \n\nMurata, Yoshizawa,  and Amari \n\nThis  lemma  is  proved  hy  using  t.he  uSllal  statistical  methods. \n\n3  LEARNING  PROCEDURE \n\nIn  many  cases,  however,  it  is  difficult  to  obtain  the  quasi-optimal  parameter 9 by \nminimizing  the equation  (10)  direct.ly.  VVe  therefore often  use  a  stochastic descent \nmethod  to get an  approximation  to  the  quasi-optimal  parameter 9. \n\nDefinition  1  (Stochastic Descent  Method)  In  each  learning step,  an  example \n\nis  re-sampled from  the  given  sample e randomly,  and  the  following  modification  is \n\napplied to  the  parameter On  at  step  71, \n\n(16) \n\nwhere  c  is  a  positit,e  value  called  a  learni7lg  coefficient  and  (Xi(n), Yi(n))  2S  the  re(cid:173)\nsampled  example  at  step  71. \n\nThis  is  a  sequent.ial  learning  method  and  the  operations  of  random  sampling \nfrol11  e  in  eacll  lcarning  step  is  called  the  re-sampling  plan.  The  parameter \n011  at.  st.ep  11 \nis  a  random  variable  as  a  function  of  the  re-sampled  sequence \n...;  =  {( J'i( 1) \u2022 .lJi(  1) ) \u2022.\u2022. ,  (J: i( ,t!,  lji( Il d }. However,  if the initial value of 0  is  appropriate \n(this  assumpt.ion  prevent.s  being  stuck  in  local  minima)  and  if the  learning step  n \nis  large  enough,  it.  is  shown  that  the  learned  parameter On  is  normally  distributed \naround  the  qnasi-opt.imal  parameter. \n\nLenuua 2  If the  learning  step  n  is  large  enough  and  the  learning  coefficient c  is \nsmall enough,  the  parameter 0\"  is  normally  distributed  asymptotically,  that  is, \n\nwhere'  \\I  satisfies  the  followi7lg  T\"Clatio71 \n\nOil  '\" N(O,EV), \n\nG = QF + VQ, \n(,'  = f L \\1 d ( J ' /  ,  Yi , 0 rv d ( .l: i , !Ii , 0) T , \n\nt \n\ni= I \n\n(17) \n\n(18) \n\nIt \n\n,\nQ = t L V' V' d ( Xi, Yi , 0) . \n\ni==l \n\nIn  the following discussion,  we  assume that.  11  is  large enough  and c  is small enough, \nand  we  denot.e  the  learned  parameter  by \n\n(19) \nThe dist.ribut.ion  of t.he  randolll  variable 0,  therefore,  can be  regarded  a<;  the normal \ndistribllt.ioll  N(O.EV). \n\n4  LEARNING  CURVES \n\nIt.  is  import.allt. to evalll<:l\\.t>  the difl'crellce  bet.ween  two quantities L(O)  and 1.(0).  The \nquantit.y  1.(0)  is  calkd  the  predict.ive  loss  or  the  generalization  error,  which  shows \n\n\fLearning Curves,  Model Selection and Complexity of Neural  Networks \n\n611 \n\nt.he  average loss of t.he  tl\"ained  network when  a  novel example is  given.  On the other \nhand,  the quant.ity L(O)  is  called  the training loss or  the training error,  which shows \nthe  average loss  evaluated  by  the examples  used  in  tl\u00b7aining.  Since  these quantities \ndepend  all  t.he  sample e and  the  I'e-sampled  sequence w,  we  take  the  expectation \nE  and  the  variance Val'  with  respect to  the  sample e and  the re-sampling sequence \n\nw. \n\nFirst.,  let.  us  consider  the  predictive  loss  which  is  t.he  average  loss  of the  trained \nnetwork  when  a  new  example  (which  does  not  belong  to  the  sample  e)  is  given. \nThis  averaging  operation  is  replaced  by  averaging  all  over  the  input-output  pairs, \nbecause  the  measure of the sample e is  z\u20ac'ro.  Then  the  predictive loss  is  written  as \n(20) \nFrom the  properties of \u00b0 and B,  we  can  prove the  following  important  relations. \n\nL(O)  = J 1\u00b7(x)q(Ylx)d(x,y,O)dxdy. \n\nTheorem  1  Th.e  predictive  loss  asymptotically  satisfies \nL(()*) + 2t trCQ-1  + '2trQv, \nI \n21.'.!  t)'{,'Q-I(,'Q-1  +  2\"trQ VQV + 7t.rG\\I. \n\n\\lar[L(O)] \n\nE[L(O)] \n\n1 \n\n['2 \n\nE \n\n-\n\n(21) \n\n(22) \n\nRoughly  speaking,  thel'!' exist  t.wo  raudOll1  values  Y 1  and  }\"~.  and  the  predictive loss \ncan  he  writ t.en  as  t.he  following  forl1l: \n\nL(O) \n\nwhere  Y1  aud  Y2  satisfy \n\nE[Yd  =  0, \n\nE[Y'.!]  = o. \nCov[Y) }''.!] \n\n1 \n\nE \n\nL(O\u00b7) + 2tt.rCQ-l + 2t.rQll \n+fYl + EY2  + Op(~) + Op(E), \n\nVar[Yd = ~t.rCQ-1CQ-l, \n\n(23) \n\n. \n\nI \n\nVad}\":!]  =  1t.rQ V QV, \n\n'.rGV, \n\nE,  Val' and Cov dellol.e t.he I'xp ect.al.ioll,  t.h e  variance and the covariance respectively. \nNext,  we  consider  t.he>  t.railling loss,  i.e.,  t.lw  average loss  evaluated  by  the examples \nused  ill  t.l'<lining .  .Just.  as  we  did  in  t.he  previolls  theorem,  we  can  get  the  following \nre la.tions. \n\nThCOl'Clll  2  The  training  loss  aSY1l71ltotically  satisfy \nE \n\n1 \n\nL(O\u00b7) - 2t t.I'GQ-I  + 2 t.l'Q V, \n1(/' \n\" \nf  .  1'(J:)q(YI\u00b7t)d(J:,y,O*)-dxdy \n\n- (./ /'(.1: )q( Y IJ: )d( x, y, o\u00b7 )d.tdy) :!) . \n\n(24) \n\n(25) \n\n\f612 \n\nMurata,  Yoshizawa,  and Amari \n\nIntuitively speaking  like  the  predictive loss,  the training loss  can  be  expanded  as \n\n(26) \n\nwhere Y3  satisfies \n0, \n\n/  r(x)q(ylx)d(x,y,O*)2dxdy- (/ r(x)q(Ylx)d(x,y,O*)dxdy)2. \n\nWhen  we  look  at  two curves E[L(O)]  and  E[L(o)]  as  functions  of t,  they  are  called \nlearning curves which  represent the characteristics of learning.  The expectations of \nthe  predictive  loss  and  the  training  loss  look  quite  similar.  They  are  different  in \nthe  sign  of the  term  lit.  As  the  learning  coefficient  \u00a3  increases,  the  expectations \nE[L(lJ)]  and E[L(o)] increase, but as the number of examples t  increases, the average \npredictive  loss  E[L(O')]  dec.reases  and  t.he  average  training  loss  E[.L(8)]  conversely \nincreases.  Moreover,  their  variances  are  different  in  the  Ol\u00b7der  of t.  The  coefficients \ntrGQ-l,  trQV,  etc.  are  calculated  from  the  matrices  G,  Q  and  V,  which  reflect \nthe  architecture  of  the  network  and  the  loss  criterion  t.o  be  minimized.  We  can \nconsider  t.hese  mat.rices  as  representing  the  complexity  of the  network.  In  earlier \nwork,  Amari  and  tvillrata  [1991]  introduced  an  effective  complexity of the  network, \ntrCQ-l,  by  analogy  to  Akaike's Information  Criterion  (AIC)  (see  Akaike  [1974]). \n\n5  AN  APPLICATION  FOR MODEL  SELECTION \n\nThese results nat.urally  leads us  to a  model selection criterion,  which  is  like the AlC \ncriterion of statistical  model selection  and  which  is  related  those proposed by some \nresearchers (see  Murata  et  a1.  [1991],  Moody  [1992]).  From the  previous  relations, \nwe  can  easily show  the  following  relation \n\n(27) \nwhere  c  is  a  quant.it.y  of order  1/ Jl and  common  to  all  the  net.works  of the  same \narchit.ecture.  We  compare  the  abilities  of two  different  networks,  which  have  the \nsame  al'chitecture and  are  tl'ained  by  the same sample,  but  differ  in  the  number of \nweights or  nemons  (see  Fig.l).  We  can  use  a  quantity,  NIC  (Network Information \nCriterion), \n\nwhere \n\n(28) \n\n(29) \n\nfO!\"  selecting  an  opt.imal  net.work  model.  Note  that.  this  quantity  NIC  is  directly \ncalculable,  since  all  elements  of  it.  L(O).  G,  Q,  are  given  by  summing  over  the \n\n\fLearning Curves, Model Selection and Complexity of Neural Networks \n\n613 \n\nsample e.  When  we  have  two  models  1111  and  M2,  and  the  NIC  of A11  is  smaller \n\nthan  that of .Hz,  the  predictive  loss  of Afl  is  expected smaller  than  that of M 2 ,  so \nAll  can  be  l'egal'ded  as  a  better  model  in  the sense  of the  loss  function. \n\nThis criterion  cannot  he  used  when  we  compare two networks of different architec(cid:173)\ntures,  for  example  a  multi-layered  network  and  a  radial  basis expansion  network. \nThis is  because the value c of the order  1/ Vi term is  common only to two networks \nin  which  one  is  included  in  the  other  as  a  submodel.  The  criterion  is  in  general \nvalid only  for  such  a  family  of networks  (see  Fig.2). \n\n6  CONCLUSIONS \n\nIn  this  paper,  we  show  that.  there  is  nice  relation  between  the  expectation  of the \npredictive loss  and  that of the  training loss.  This result  naturally leads us  to a  new \nmodel  selection  criterion. \n\nWe  will  consider  the  application  of this  result  as  an  algorithm  for  automatically \nchanging the  number  of hidden  units in  the learning  as  future  work. \n\nReferences \n\nH.  Aka.ike.  (1974)  A  new  look  at  the  statistical  model  identification.  IEEE  Trans. \nAC,19(6):716-723. \n\nS.  Amari. \n16(3}:29U-:307. \n\n(1967)  Theol'Y  of  adaptive  pattern  classifiers. \n\nIEEE  Trans.  EC, \n\nS.  Amari and  N.  l'vI urata..  (1991) Stat.ist.ical  theory of learning curves under entropic \nloss  criterioll.  Technical  Report  METR 91-12,  University of Tokyo, Tokyo, Japan. \n\nE.  B.  Baum  and  D.  Haussler. \nNeural  Computation,  1:151-160. \n\n(1989)  What  size  net  gives  valid  generalization? \n\nE.  Levin,  N.  Tishby,  and  S.  A. Solla.  (1990)  A statistical approach  to learning and \ngeneralization  in  layered  neural  networks.  Proc.  of IEEE,  78(1O}:1568-1574. \n\nJ . E.  l\\'Ioody.  (1992) The effective number of parameters:  An  analysis of generaliza(cid:173)\ntion alld  regularization  in  nonlinear  learning systems.  In  J . E . Moody, S. J. Hanson, \nand  R.  P.  Lippmann, (eds.),  Advances ill  Neural  JlIfonnation  Processing Systems 4-\nSan  Mateo,  CA:  Morgan  Ka.ufmanll . \n\nN.  Murata.  (1992)  Statistical  aSY17l1liotic  study  on  learning  (In  Japanese).  PhD \nthesis,  University of Tokyo, Tokyo,  Japan . \n\nN.  Murata,  S.  Yoshizawa,  and  S.  Amari.  {1991}  A  criterion  for  determining  the \nnumber of paramet.ers  in  an  artificial  neural  network  model.  In  T.  Kohonen et  al., \n(eds.),  Artificial Ne 'ural Networks,  9-14.  Holland:  Elsevier Science  Publishers. \n\nH. S. Seung, H.  Sompolinsky, and N. Tishby.  (1991) Statistical mechanics of learning \nfrom  examples  II.  quenched  theory  a.nd  unrealizable  rules.  Submitted  to  Physical \nReview  A. \n\n\f614 \n\nMurata, Yoshizawa,  and Amari \n\norigin of the large \n\nq(y Ix) \n\nvariance  ~)1[\\ \n~ \", \n/!\\ \\ \nI  I \\ \\ \nI  \\  \\ \n\nq(y Ix) \n\nI \nI \n\n\\ \n\n\u2022 \n\n\u2022 \n\n\\ \n\nFigure 1:  Geomet.rical  represent.ation of hierarchical models:  the solid  lines between \nq(vlx) and Oi  show predictive losses, anel  the dashed lines between q(Ylx) and OJ  show \nt.raining losses.  The large  variance of t.he  trailling loss originated  in  the discrepancy \nof q(YI.r)  alld  q(yl.l').  Whell  we  est.illiatt'  I,he  I)l\"('dinion  loss  from  t.he  t.ra.ining  loss, \nthe  large  variallef'  st.ill  ,'('maills.  Bllt.  ill  t.he  case  t.hat  t.he  model  M 1  includes  the \nmodel  IIf:!,  t.his  variance  is  common  to  two  models,  so  we  do  not  have  to take  care \nof it. \n\nvariance \n\nFigure  2:  Geomet.rical  representat.ion  of  non-hierarchical  models:  the  solid  lines \nbet.ween  q(Ylx)  and  Oi  show  predictive losses,  and  the  dashed  lines  between  q(ylx) \nand  OJ  show  t.raining  losses.  The discrepancy of (J(ylx)  and  q(Ylx)  works  differently \non  two  models  M I  alld  11-12  ill  est.imating  predict.ivf'  losses. \n\n\f", "award": [], "sourceid": 601, "authors": [{"given_name": "Noboru", "family_name": "Murata", "institution": null}, {"given_name": "Shuji", "family_name": "Yoshizawa", "institution": null}, {"given_name": "Shun-ichi", "family_name": "Amari", "institution": null}]}