{"title": "Neural Network Model Selection Using Asymptotic Jackknife Estimator and Cross-Validation Method", "book": "Advances in Neural Information Processing Systems", "page_first": 599, "page_last": 606, "abstract": null, "full_text": "Neural Network  Model Selection Using \n\nAsymptotic Jackknife  Estimator and \n\nCross-Validation Method \n\nYong  Liu \n\nDepartment of Physics and \n\nInstitute for  Brain and  Neural  Systems \n\nBox  1843,  Brown  University \n\nProvidence,  RI,  02912 \n\nAbstract \n\nTwo  theorems  and  a  lemma are  presented  about  the  use  of jackknife es(cid:173)\ntimator and  the  cross-validation  method for  model selection.  Theorem  1 \ngives  the asymptotic form for  the jackknife estimator.  Combined with the \nmodel selection  criterion,  this asymptotic form  can  be  used  to obtain the \nfit  of a  model.  The model selection  criterion we  used  is the negative of the \naverage predictive likehood, the choice of which is  based on the idea of the \ncross-validation  method.  Lemma 1  provides  a  formula  for  further  explo(cid:173)\nration of the asymptotics of the model selection criterion.  Theorem 2 gives \nan asymptotic form of the model selection criterion for  the regression  case, \nwhen  the  parameters optimization criterion  has a  penalty term.  Theorem \n2  also  proves  the  asymptotic equivalence  of Moody's  model selection  cri(cid:173)\nterion  (Moody,  1992)  and  the  cross-validation method,  when  the distance \nmeasure  between  response  y  and  regression  function  takes  the  form  of a \nsquared  difference. \n\n1 \n\nINTRODUCTION \n\nSelecting a  model for  a  specified  problem is  the  key  to generalization  based on  the \ntraining  data set.  In  the  context  of neural  network,  this  corresponds  to  selecting \nan  architecture.  There  has  been  a  substantial  amount  of work  in  model selection \n(Lindley,  1968;  Mallows, 1973j Akaike,  1973; Stone,  1977; Atkinson,  1978j Schwartz, \n\n599 \n\n\f600 \n\nLiu \n\n1978;  Zellner,  1984;  MacKay,  1991;  Moody,  1992;  etc.).  In  Moody's paper  (Moody, \n1992),  the  author  generalized  Akaike  Information  Criterion  (AIC)  (Akaike,  1973) \nin  the  regression  case  and  introduced  the  term  effective  number  of parameters.  It \nis  thus  of great  interest  to see  what  the  link  between  this  criterion  and  the  cross(cid:173)\nvalidation  method  (Stone,  1974)  is  and  what  we  can  gain  from  it,  given  the  fact \nthat AIC is asymptotically equivalent to the cross-validation method (Stone,  1977). \n\nIn the  method of cross-validation (Stone,  1974), a  data set,  which  has a  data point \ndeleted  from  the original training data set,  is  used  to estimate the  parameters of a \nmodel by  optimizing a  parameters optimization criterion.  The  optimal parameters \nthus obtained are called the jackknife estimator (Miller,  1974).  Then the predictive \nlikelihood  of the  deleted  data  point  is  calculated,  based  on  the  estimated  parame(cid:173)\nters.  This is  repeated  for  each data point in  the  original training data set.  The  fit \nof the  model, or the model selection criterion,  is  chosen as the  negative of the aver(cid:173)\nage  of these  predictive  likelihoods.  However,  the  computational cost  of estimating \nparameters for  different  data point deletion  is  expensive.  In  section  2,  we  obtained \nan asymptotic formula (theorem  1)  for  the jackknife estimator based on optimizing \na  parameters optimization criterion  with one  data point  deleted  from  the  training \ndata  set.  This somewhat  relieves  the  computational  cost  mentioned  above.  This \nasymptotic formula can be used  to obtain the  model selection criterion by plugging \nit  into  the  criterion.  Furthermore,  in  section  3,  we  obtained  the  asymptotic  form \nof the  model selection  criterion  for  the general  case  (Lemma 1)  and  for  the special \ncase  when  the  parameters  optimization criterion  has  a  penalty  term  (theorem  2). \nWe also proved the equivalence of Moody's model selection criterion (Moody,  1992) \nand  the cross-validation method  (theorem  2).  Only  sketchy  proofs  are  given  when \nthese  theorems  and  lemma are  introduced.  The  detail  of the  proofs  are  given  in \nsection  4. \n\n2  APPROXIMATE  JACKKNIFE  ESTIMATOR \n\nLet the parameters optimization criterion,  with data set w  = {(:Vi, yd,  i = 1,  ... ,  n} \nand  parameters  9,  be  Cw (9),  and  let  W-i  denote  the  data set  with  ith  data point \ndeleted from w.  If we denote 8 and 8_ i  as the optimal parameters for criterion Cw (9) \nand Cw _.(9),  respectively,  \\79  as the  derivative  with  respect  to 9  and  superscript  t \n~s transpose,  we  have  the following  theorem  about  the  relationship  between 8 and \n9_ i \u00b7 \n\nTheorem 1  If the  criterion function Cw (9)  is  an infinite- order  differentiable func(cid:173)\ntion  and  its  derivatives  are  bounded  around 8.  The  estimator 8 -i  (also  called jack(cid:173)\nknife  estimator (Miller,  1974))  can  be  approzimated  as \n\n8_i  - 8 ~ -(\\79\\7~Cw(8) - \\79\\7~Ci(8\u00bb-1\\79Ci(8) \n\n(1) \n\nin which Ci(9)  =  Cw(9)  - Cw_.(9). \n\nProof.  Use  the  Taylor  expansion  of equation  \\7 9Cw_.(8_d \nterms higher  than the  second  order. \n\no around  9.  Ignore \n\n\fModel  Selection  Using Asymptotic Jackknife Estimator  &  Cross-Validation Method \n\n601 \n\nExample  1:  Using  the  generalized  maximum  likelihood  method  from  Bayesian \nanalysis 1  (Berger,  1985), if 7r( 0)  is  the prior on the parameters and the observations \nare  mutually independent,  for  which  the distribution is  modeled as  ylx ,.....  f(Ylx, 0), \nthe  parameters optimization criterion is \n\nThus Ci(O)  =  logf(Yilxi, 0).  If we  ignore  the  influence  of the  deleted  data point  in \nthe dt nominator of equation  1,  we  have \n\n(3) \n\nExample  2:  In  the  special  case  of example  I,  with  noninformative prior  7r( 0)  =  1, \nthe criterion  is  the ordinary log-likelihood function,  thus \n\n9_i-9~- [  L  VeV~logf(Yj lxj,O)  j-lVelogf(Yilxi,O). \n\n(4) \n\n(5) \n\n(xi,Y.:)Ew \n\n3  CROSS-VALIDATION  METHOD  AND  MODEL \n\nSELECTION CRITERION \n\nHereafter  we  use  the  negative of the average  predictive likelihood, or, \n\nTm(w)  =  -- L  logf(Yi lXi, O-i) \n\n1 \nn \n\n(x\"y.:)Ew \n\nas  the  model selection  criterion,  in  which  n  is  the  size  of the  training  data  set  w, \nmE .. Vi  denotes  parametric probability models  f(Ylx, 0)  and .tVi  is  the set of all the \nmodels  in  consideration.  It is  well  known  that  Tm(w)  is  an  unbiased  estimator  of \nr(00,9(.)), the risk of using the model m  and estimator 0,  when the true parameters \nare 00  and the training data set  is w  (Stone,  1974;  Efron and Gong,  1983; etc.),  i.e., \n\nr(Oo, 0(.)) \n\nE{Tm(w)} \nE{ -logf(Ylx, 9(w))} \nE{  -~  L \n\nlogf(Yj IXj, O(w))  } \n\n(x] ,Y] )Ew ... \n\n(6) \n\nj  =  I, \n\nin  which  Wn  =  {( Xj , Yj ), \n...  k}  is  the  test  data  set,  9(.)  is  an  implicit \nfunction  of the  training  data set  wand  it  is  the  estimator  we  decide  to  use  after \nwe  have observed  the  training data set  w.  The  expectation  above  is  taken over  the \nrandomness  of w,  x,  Y and  W n .  The  optimal model  will  be  the  one  that  minimizes \nthis criterion.  This procedure of using 9_ t  and Tm(w)  to obtain an estimation of risk \nis  often  called the cross-validation method  (Stone,  1974;  Efron  and Gong,  1983). \nRemark:  After  we  have obtained 9 for  a  model,  we  can  use  equation  1 to calculate \n9_ i  for  each  i, and put  the resulting 9_ i  into equation  5 to get  the fit  of the model, \nthus we  will  be able  to compare different  models m  E .tVi. \n\n1 Strictly  speaking,  it is  a  method  to find  the  posterior  mode. \n\n\f602 \n\nLiu \n\nLemma 1  If the  probability  model  f(ylx, 8),  as  a function,.  of 8,  is  differentiable  up \nto  infinite  order  and  its  derivatives  are  bounded  around  8.  The  approximation  to \nthe  model  selection  criterion,  equation  5,  can  be  written as \n\nTm(w)  ~ -~  L  logf(Yi lXi, 8) - ~  L  V'~logf(Yi lXi, 8)(8_i  - 8) \n\n(7) \n\nn \n\n(Xi,Yi)Ew \n\nn \n\n(Xi,Yi)Ew \n\nProof.  Igoring  the  terms  higher  than  the  second  order  of the  Taylor expansion  of \nlogf(Yj IXj, 8_ i )  around 8 will yield  the  result. \nEzample  2 (continued):  Using equation 4,  we have, for the modei selection criterion, \n\n1  \"\" \nn \n\nL \n\n(xi,y.)Ew \n\nlogf(Yi lXi, 9)  -\n\nA \n\n1 \n\nn  2:  V'~logf(Yi lXi, 8)A -IV' /}logf(Yi lXi, 8). \n\n(8) \n\n(:Ci,y.)Ew \n\nin  which  A  =  E(X]'Y3)EW V'/}V'~logf(Yjlxj,9).  If the  model  f(Ylx,8)  is  the  true \none,  the second  term is  asymptotically equal  to p,  the number of parameters in the \nmodel.  So  the  model selection  criterion  is \n\n-\n\nlog-likelihood + number of parameters of the  model. \n\nThis is  the  well  known  Akaike's Information Criterion  (AIC)  (Akaike,  1973). \nEzample  1( continued):  Consider  the  probability model \n\nf(Ylx,8)  =  ,8exp( - 20'2 E(y, 1}/}( X))) \n\n1 \n\n(9) \n\nin which,8 is a  normalization factor,  E(y, 1}/}(x))  is a  distance measure between Y and \nregression  function 1}/} (x).  E(\u00b7)  as function of 9  is assumed differentiable.  Denoting2 \nU(8,~, w)  = E(Xi,Yi)EW  E(Yi' 1}/}(xd) - 20'2log1T(81~), we  have the following theorem, \nTheorem  2  For the  model specified in equation  9  and the parameters optimization \ncriterion specified  in  equation  2  (ezample  1),  under  regular  condition,  the  unbiased \nestimator of \n\nE{  ~  2:  E(Yi,1}e(xd)} \n\n(xi,y.)Ew .. \n\nasymptotically  equals  to \n\nE(Yi' 1}e(x~))  + \n\n1\"\" L \nn  L  V'~E(Yi,1}e(xd){V'/}V'~U(8,).,w)}-IV'/}E(Yi,1}9(xd)\u00b7 \n\n(:Ci,y .. )Ew \n\nn \n\n1 \n\n(Xi,Yi)Ew \n\n2For  example,  1r(OIA)  =  Alp(O, (72 fA),  this  corresponds  to \n\nU(O,  A, w)  =  L  \u00a3(Yi,l1s(xi))  +  A02  +  const(A, (72). \n\n(z\"Yi)Ew \n\n(10) \n\n(11) \n\n\fModel Selection Using Asymptotic Jackknife Estimator  &  Cross-Validation Method \n\n603 \n\nFor the  case  when \u00a3(Y, 179(Z))  =  (y -179(Z))2,  we  get,  for the  asymptotic  equivalency \nof the  equation  11, \n\n(12) \n\nin  wh.ich W =  {(Zi,Yi),  i  =  1,  ... ,  n}  is  the  training  data  set,  Wn  =  {(zi,yd,  ~  = \n1,  ... ,  k}  is  the  test  data  set,  and \u00a3(8,w)  =  ~ L(:z:\"y,)EW  E(Yi,179(Zi)). \n\nProof.  This result  comes  directly  from  theorem  1 and  lemma 1.  Some asymptotic \ntechnique  has to be  used. \n\nRemark:  The  result  in  equation  12  was  first  proposed  by  Moody  (Moody,  1992). \nThe  effective  number  of parameters  formulated  in  his  paper  corresponds  to  the \nsummation in  equation  12.  Since  the  result  in  this  theorem  comes  directly  from \nthe asymptotics of the cross-validation method and the jackknife estimator, it gives \nthe  equivalency  proof  between  Moody's  model  selection  criterion  and  the  cross(cid:173)\nvalidation method.  The  detailed  proof of this  theorem,  presented  in  section  4,  is \nin  spirit  the  same  as  the  one  presented  in  Stone's  paper  about  the  proof  of  the \nasymptotic equivalence  of Ale and  the cross-validation method  (Stone,  1977). \n\n4  DETAILED  PROOF  OF  LEl\\1MAS  AND  THEOREMS \n\nIn order to prove theorem 1,  lemma 1 and theorem 2,  we  will present  three auxiliary \nlemmas first. \n\nLemma 2  For  random  variable  sequence  Zn  and  Yn,  if limn-+co  Zn \nliffin-+co Yn  =  z,  then Zn  and Yn  are  asymptotically  equivalent. \n\nZ  and \n\nProof.  This  comes  from  the definition  of asymptotic equivalence.  Because  asymp(cid:173)\ntotically  the  two  random  variable will behave  the same as  random variable  z. \n\nLemma 3  Consider  the  summation  Li h(Zi' Ydg(Zi' z).  If  E(h(z, y)lz, z)  is  a \nconstant c  independent  of z,  y,  z,  then  the  summation  is  asymptotically  equivalent \nto  cLig(Zi'Z). \n\nProof.  According to the  theorem  of large  number, \n\nlim  ~ ' \"  h(Zi' Yi)g(Zt, z) \nn-+co  n  ~ \n\nt \n\nE(h(z, y)g(z, z)) \n\nE(E(h(z, y)lz, z)g(z, z)) =  cE(g(z, z)) \n\nwhich is  the same as the limit of ~ Li g(Zt' z).  Using lemma 2,  we  get  the result  of \nthis lemma. \n\nLemma 4 \nmodel Y  =  T}9 (z) + f  with f \n\nIf T}9 (.)  and  g( 8, .)  are  differentiable  up  to  the  second  order,  and  the \n,....,  ,/V (0, (]'2)  is  the  true  model,  the  second  derivative  with \n\n\f604 \n\nLiu \n\nrespect  to  8  of \n\nn \n\ni=l \n\nevaluated  at  the  minimum  of  U,  i. e.,  iJ,  is  asymptotically  independent  of random \nvariable {Yi, i  =  1, ... , n}. \n\nPro~of.  Explicit calculation of the second derivative of U with respect  to 8,  evaluated \nat 8,  gives \n\nV9V~U(iJ,'\\,W) =  2:LV977J(1;JV~179(:Z:t) \n\nn \n\ni=l \n\ni=l \n\n+ \n\nAs  n  approaches  infinite,  the  effect  of the  second  term  in  U  vanishes,  iJ  approach \nthe  mean squared  error  estimator with  infinite  amount of data  points,  or  the  true \nparameters 80  of the model (consistency  of MSE estimator (Jennrich,  1969)), E(y-\n779(z))  approaches E(Y-7790(Z))  which is O.  According to lemma 2 and lemma 3, the \nsecond  term of this second  derivative  vanishes  asymptotically.  So as  n  approaches \ninfinite,  the second  derivative of U  with  respect  to 8,  evaluated at iJ,  approaches \n\nV' 9 V~U(80), '\\, w)  =  2 L V' 97790 (zi)V~7790 (Zi) + V' 9 V~g( 80 ,  ,\\) \n\nn \n\nwhich is independent of {Yi,  i  = 1,  ... ,  n}.  According to lemma 2,  the result of this \nlemma is  readily obtained. \n\nNow  we  give  the detailed  proof of theorem  1,  lemma 1 and theorem  2. \n\nProof of Theorem  1.  The jackknife  estimator  iJ- i  satisfies,  V 9Cw_ .. (ILi) \nThe Taylor expansion  of the left  side of this equation around 8  gives \n91 2 )  = 0 \n\nVeCW_i(iJ)  + VeV~Cw_.(iJ)(iJ_i - iJ)  + O(liJ- i  -\n\nO. \n\nAccording  to  the  definition  of iJ  and  iJ_ i ,  their  difference  is  thus  a  small  quantity. \nAlso  because  of the  boundness of the  derivatives,  we  can ignore higher order  terms \nin the  Taylor expansion and get  the  approximation \n\niJ- i - iJ  ~ -(V9V~CW_i(iJ))-1V'9Cw_.(iJ) \n\nSince 9 satisfies  V' 9Cw(iJ)  =  0,  we  can  rewrite  this equation and obtain equation  1. \nProof of Lemma 1.  The  Taylor expansion of 10gf(Yi IZi' iJ-d  around iJ  is \n\n10gf(Yi IZi, iJ-d =  10gf(Yi IZi, iJ)  + V'~logf(Yi IZi, iJ)(iJ_ i  -\n\niJ) + O(liJ-i  - 91 2 ) \n\nPutting this into equation 5 and ignoring higher order  terms for  the same argument \nas that presented  in  the  proof of theorem  1,  we  readily get  equation 7. \n\nProof of Theorem  2.  Up  to an  additive  constant  dependent  only  on  ,\\  and  cr2 , \nthe optimization criterion,  or  equation  2,  can  be  rewritten  as \n\n(13) \n\n\fModel Selection  Using Asymptotic Jackknife Estimator  &  Cross-Validation Method \n\n605 \n\nNow  putting equation  9 and  13  into equation  3,  we  get, \n\nO-i - 0 ~ -{V' 9 V'~U(8, .\\, w)} -IV' 9\u00a3(Yi, 17e(:z:d) \n\nPutting equation  14  into equation  7,  we  get,  for  the  model selection  criterion, \n\nTm(w)  =  n  ~ 2u2 \u00a3(Yi, 17e(:z:d)  + \n\n' \"  \n\n1 \n\n1 \n\n(:t:\"Yi)Ew \n\n1 \nr.. \n\n' \"  \n~ 2u 2 V'9\u00a3(Yi, 17e(:z:d){V'9V'9U(O,>.,w) \n\n1 \n\nt \n\nt \n\n~ \n\n}-l \n\nV'9\u00a3(Yi,17e(:Z:i \n\n)) \n\n(:t:i,Yi)Ew \n\nRecall  the  discussion associated  with equation  6 and  now \n\nE{ -\"k  ~  10gf(Y;I:Z:j,0)} = E{\"k  ~  2u2\u00a3(Yj,17e(:Z:;))} \n\n'\" \n\n'\" \n\n1 \n\n1 \n\n1 \n\n~ \n\n(:t:\"y,)Ew\" \n\n(:t:\"Yj)Ew\" \n\n(14) \n\n(15) \n\n(16) \n\nafter  some  simple  algebra,  we  can  obtain  the  unbiased  estimator  of equation  10. \nThe result is equation 15 multiplied by  2u 2 ,  or equation  11.  Thus we  prove the first \npart of the  theorem. \nNow  consider  the case  when \n\n\u00a3(Y,179(:Z:))  = (y -179(:z:))2 \n\n(17) \n\n(18) \n\n(20) \n\nThe second  term  of equation  11  now becomes \n\n~  L  4(Yi  -17e(:z:d)2V'~17e(:Z:i){V'9V'~U(8, >',w)}-1V'917e(:Z:i) \n\n(:t:\"Yi)Ew \n\nAs  n  approaches  infinite, 0 approach  the  true  parameters 0o,  V' 917e(:Z:')  approaches \nV'9179 0 (:Z:.)  and  E((y -17e(:z:)))2  asymptotically  equals  to  u 2 \u2022  Using  lemma  4  and \nlemma 3,  we  get,  for  the asymptotic equivalency  of equation  18, \n\n.!..u2  L  2V'~17\u00a7(:z:d{V'9V'~U(0,>.,W)}-12V'917\u00a7(:z:d \n\nn \n\n(19) \n\nIf we  use  notation \u00a3(O,w) = ~ L(:t:\"Yi)EW \u00a3(Yi,179(:z:d),  with \u00a3(Y,179(:Z:))  of the form \nspecified  in  equation  17,  we  can get, \n\na -a V'9 n \u00a3(0,w) = -2V'9179(:Z:i) \n\nYi \n\nCombining this with equation 19 and equation 11, we can readily obtain equation 12. \n\n5  SUMMARY \n\nIn  this  paper,  we  used  asymptotics  to  obtain  the  jackknife  estimator,  which  can \nbe  used  to get  the  fit  of a  model by  plugging it  into the  model selection  criterion. \nBased  on  the  idea  of  the  cross-validation  method,  we  used  the  negative  of  the \naverage  predicative  likelihood  as  the  model  selection  criterion.  We  also  obtained \nthe  asymptotic  form  of  the  model  selection  criterion  and  proved  that  when  the \nparameters  optimization criterion  is  the  mean  squared  error  plus  a  penalty  term, \nthis  asymptotic  form  is  the  same  as  the  form  presented  by  (Moody,  1992).  This \nalso  served  to  prove  the  asymptotic  equivalence  of this  criterion  to  the  method  of \ncross-validation. \n\n\f606 \n\nLiu \n\nAcknowledgements \n\nThe author  thanks all  the  members of the  Institute for  Brain and  Neural  Systems, \nin  particular,  Professor  Leon  N  Cooper for  reading  the draft of this paper,  and  Dr. \nNathan Intrator,  Michael P. Perrone  and Harel Shouval for  helpful comments.  This \nresearch  was supported  by grants from  NSF,  ONR and  ARO. \n\nReferences \n\nAkaike,  H.  (1973).  Information  theory  and  an  extension  of the  maximum  likeli(cid:173)\nhood  principle.  In  Petrov  and  Czaki,  editors,  Proceedings  of the  2nd  International \nSymposium  on Information  Theory,  pages  267-281. \n\nAtkinson,  A.  C.  (1978).  Posterior  probabilities  for  choosing  a  regression  model. \nBiometrika,  65:39-48. \n\nBerger,  J.  O.  (1985).  Statistical Decision  Theory  and Bayesian Analysis.  Springer(cid:173)\nVerlag. \n\nEfron,  B.  and Gong,  G.  (1983).  A  leisurely look at the bootstrap, the jackknife and \ncross-validation.  Amer.  Stat.,  37:36-48. \n\nJennrich,  R.  (1969).  Asymptotic  properties  of nonlinear  least  squares  estimators. \nAnn.  Math.  Stat.,  40:633-643. \n\nLindley,  D.  V.  (1968).  The  choice  of variables  in  multiple regression  (with  discus(cid:173)\nsion).  J.  Roy.  Stat.  Soc.,  Ser.  B,  30:31-66. \n\nMacKay,  D.  (1991).  Bayesian  methods for  adaptive  models.  PhD  thesis,  California \nInstitute of Technology. \n\nMallows,  C.  L.  (1973).  Some comments on  Cpo  Technometrics,  15:661-675. \n\nMiller,  R.  G.  (1974).  The jackknife - a  review.  Biometrika,  61:1-15. \n\nMoody,  J.  E.  (1992).  The  effective  number  of parameters,  an  analysis  of general(cid:173)\nization and  regularization  in  nonlinear  learning system.  In  Moody,  J.  E.,  Hanson, \nS.  J.,  and  Lippmann,  R.  P.,  editors,  Advances  in  Neural  Information  Processing \nSystem 4.  Morgan  Kaufmann Publication. \nSchwartz,  G.  (1978).  Estimating the dimension of a  model.  Ann.  Stat,  6:461-464. \n\nStone,  M.  (1974).  Cross-validatory choice  and assessment  of statistical predictions \n(with discussion).  J.  Roy.  Stat.  Soc.,  Ser.  B. \n\nStone,  M.  (1977).  An asymptotic equivalence of choice of model by cross-validation \nand Akaike's criterion.  J.  Roy.  Stat.  Soc.,  Ser.  B,  39(1):44-47. \n\nZellner,  A.  (1984).  Posterior odds ratios for  regression  hypotheses:  General consid(cid:173)\neration and some specific  results.  In  Zellner,  A.,  editor,  Basic  Issues  in Economet(cid:173)\nrics,  pages  275-305.  University  of Chicago Press. \n\n\f", "award": [], "sourceid": 700, "authors": [{"given_name": "Yong", "family_name": "Liu", "institution": null}]}