{"title": "The Effective Number of Parameters: An Analysis of Generalization and Regularization in Nonlinear Learning Systems", "book": "Advances in Neural Information Processing Systems", "page_first": 847, "page_last": 854, "abstract": null, "full_text": "The  Effective  Number of Parameters: \n\nAn Analysis of Generalization and  Regularization \n\nin Nonlinear Learning  Systems \n\nJohn E.  Moody \n\nDepartment of Computer Science,  Yale  University \n\nP.O.  Box  2158  Yale  Station,  New  Haven,  CT 06520-2158 \n\nInternet:  moody@cs.yale.edu,  Phone:  (203)432-1200 \n\nAbstract \n\nWe  present  an  analysis  of how  the  generalization  performance  (expected \ntest set error) relates to the expected training set  error for  nonlinear learn(cid:173)\ning systems, such as multilayer perceptrons and radial basis functions.  The \nprincipal  result  is  the  following  relationship  (computed  to  second  order) \nbetween  the  expected  test  set  and tlaining set  errors: \n\n(1) \nHere,  n  is  the  size  of the  training  sample e,  u;f f  is  the  effective  noise \nvariance in the response  variable( s),  ,x  is  a  regularization or  weight  decay \nparameter,  and Peff(,x)  is  the  effective  number  of parameters in  the non(cid:173)\nlinear  model.  The  expectations  ( )  of training set  and  test  set  errors  are \ntaken  over  possible  training  sets e and  training  and  test  sets  e'  respec(cid:173)\ntively.  The effective number of parameters Peff(,x)  usually differs from the \ntrue  number  of model  parameters  P for  nonlinear  or  regularized  models; \nthis  theoretical  conclusion  is  supported  by  Monte  Carlo experiments.  In \naddition to the surprising result  that Peff(,x)  ;/;  p,  we  propose an estimate \nof (1)  called the  generalized prediction  error (GPE)  which generalizes well \nestablished  estimates  of prediction  risk  such  as  Akaike's  F P E  and  AI C, \nMallows Cp, and  Barron's PSE to the nonlinear setting.! \n\nlCPE and Peff(>\")  were previously  introduced  in  Moody  (1991). \n\n847 \n\n\f848 \n\nMoody \n\n1  Background  and  Motivation \n\nMany  of  the  nonlinear  learning  systems  of current  interest  for  adaptive  control, \nadaptive signal  processing,  and  time series  prediction,  are supervised  learning sys(cid:173)\ntems of the regression type.  Understanding the relationship between generalization \nperformance  and  training  error  and being  able  to estimate the generalization  per(cid:173)\nformance of such  systems is  of crucial  importance.  We  will take the  prediction  risk \n(expected  test set error)  as  our measure of generalization performance. \n\n2  Learning from  Examples \n\nConsider  a  set  of n real-valued  input/output  data  pairs ~(n) =  {~i =  (xi, yi); i  = \n1, ... , n}  drawn from a  stationary density  3(~). The observations  can be viewed as \nbeing generated  according  to the  signal plus  noise model 2 \n\n(2) \n\nwhere  yi  is  the  observed  response  (dependent  variable),  Xl  are  the  independent \nvariables sampled with input probability density O( x),  Ei  is independent, identicaIIy(cid:173)\ndistributed  (iid)  noise  sampled with  density  ~(E) having mean 0 and variance (72,3 \nand J.t(x)  is  the  conditional  mean, an unknown function.  From the signal plus noise \nperspective,  the  density  3(~) =  3(x, y)  can  be  represented  as  the  product  of two \ncomponents, the  conditional density  w(ylx)  and the  input density O(x): \n\n3(x, y) \n\nw(ylx) O(x) \n~(y - J.t(x\u00bb  O(x) \n\n(3) \n\nThe learning problem is  then to find  an estimate jJ,(x)  of the conditional mean J.t(x) \non  the  basis  of the training set  ~(n). \n\nIn  many  real  world  problems,  few  a  priori  assumptions  can  be  made  about  the \nfunctional  form  of J.t(x).  Since  a  parame~ric function  class  is  usually  not  known, \none must resort to a  nonparametric  regression approach, whereby one constructs an \nestimate jJ,(x)  = f(x) for  J.t(x)  from a  large class of functions F  known to have good \napproximation properties  (for  example,  F  could  be  all  possible  radial  basis  func(cid:173)\ntion networks  and multilayer perceptrons).  The class  of approximation functions is \nusually  the union of a  countable  set  of subclasses  (specific  network  architectures)4 \nA  C  F  for  which  the  elements  of  each  subclass  f(w, x)  E  A  are  continuously \nparametrized  by  a  set  of p = p( A)  weights  w  =  {WCX;  0:  = 1, ... , p}.  The  task  of \nfinding  the estimate f( x)  thus consists of two problems:  choosing the best architec-\nture A and  choosing  the best set of weights w given  the architecture.  Note  that in \n2The assumption  of additive noise  ( which is independent of x  is a standard assumption \nand  is  not  overly  restrictive.  Many  other  conceivable  signal/noise  models  can  be  trans(cid:173)\nformed  into  this  form.  For  example,  the  multiplicative  model  y  =  /L(x)(l + ()  becomes \ny'  =  /L'(x) + (' for  the transformed  variable  y' =  log(y). \n\n3Note  that  we  have  made  only  a  minimal  assumption  about  the  noise  (,  that it is  has \nfinite  variance  (T2  independent  of x.  Specifically,  we  do  not  need  to make the  assumption \nthat the noise density  <I>(()  is  of known form  (e.g.  gaussian) for  the following  development. \n\n4For  example,  a  \"fully  connected  two layer  perceptron  with five  internal units\". \n\n\fThe Effective Number of Parameters \n\n849 \n\nthe  nonparametric setting,  there  does  not  typically  exist  a  function  f( w'\" , x)  E  F \nwith  a  finite  number  of parameters  such  that  f(w\"', x)  = I1(X)  for  arbitrary  l1(x). \nFor this  reason,  the  estimators  ji( x)  =  f( w, x)  will  be  biased estimators of 11( x). 5 \nThe first  problem  (finding  the  architecture  A)  requires  a  search  over  possible  ar(cid:173)\nchitectures  (e.g.  network  sizes  and  topologies),  usually  starting  with  small archi(cid:173)\ntectures  and  then  considering  larger  ones.  By  necessity,  the  search  is  not  usually \nexhaustive and must use heuristics to reduce search  complexity.  (A heuristic search \nprocedure for  two  layer networks  is  presented  in  Moody and  Utans  (1992).) \nThe second  problem (finding  a  good set  of weights for  f(w,x))  is  accomplished by \nminimizing an objective function: \n\nWA =  argminw U(A, w, e(n))  . \n\n(4) \n\nThe objective function  U consists  of an error function  plus a  regularizer: \n\n(5) \nHere,  the error Etrain(W,e(n))  measures the  \"distance\"  between the target response \nvalues  yi  and  the fitted  values  f(w,xi): \n\nU(A, w,e(n)) =  nEtrain(W,e(n)) + A S(w) \n\nEtrain(W,e(n)) = ~ 6 E[y\"f(w,x' )] \n. \n\n1 \" ' .  \n\nn \n\n, \n\n(6) \n\nand  S( w)  is  a  regularization  or  weight-decay  function  which  biases  the  solution \ntoward functions with  a priori \"desirable\"  characteristics,  such  as  smoothness.  The \nparameter A ~ 0 is  the regularization or  weight decay  parameter and must itself be \noptimized. 6 \n\ni=l \n\nThe  most  familiar  example  of  an  objective  function  uses  the  squared  error7 \nE[yi,f(w, xi)]  =  [yi  - f(w,x i )]2  and a  weight  decay  term: \np \n\nn \n\nU(A,w,~(n)) =  L(yi - f(w,x i))2 + A Lg(wCY ) \n\n(7) \n\ni=l \n\ncy=l \n\nThe first term is  the sum of squared errors (SSE)  of the model f ( w, x)  with resp ect \nto  the  training  data,  while  the  second  term  penalizes  either  small,  medium,  or \nlarge  weights,  depending  on the form  of g(wCY).  Two  common examples of weight \ndecay functions  are the ridge regression form g( wCY)  =  (w CY )2  (which penalizes large \nweights)  and the Rumelhart form g(wCY ) =  (wCY )2/[(wO)2 + (w CY )2]  (which  penalizes \nweights  of intermediate values  near wO). \n\n5By  biased,  we  mean that  the mean squared bias is  nonzero:  MSB =  J p(x)((/:t(x))e(cid:173)\nlL(x))2dx  >  o.  Here,  p(x)  is  some  positive  weighting  function  on  the  input  space  and \n()e  denotes  an  expected  valued  taken  over  possible  training  sets  \u20ac(n).  For  unbiasedness \n(MSB  = 0)  to  occur,  there  must  exist  a  set  of weights  w*  such  that  f(w\"', x)  = IL(X), \nand  the learned  weights  ill must be  \"close  to\"  w*.  For  \"near  unbiasedness\",  we  must have \nw*  = argminwMSB(w)  such  that  (MSB(w\u00b7):::::  0)  and ill  \"close  to\"  w*. \n\n6The optimization  of..x  will  be discussed  in  Moody  (1992). \n7 Other error functions,  such as  those used in generalized linear models  (see for example \nMcCullagh  and  NeIder  1983)  or  robust  statistics  (see  for  example  Huber  1981)  are  more \nappropriate  than  the squared  error  if  the  noise  is  known  to  be  non-gaussian  or  the  data \ncontains  many  outliers. \n\n\f850 \n\nMoody \n\nAn example of a  regularizer  which  is  not explicitly  a  weight  decay  term is: \n\nS(w)  = 1 dxO(x)IIOxxf(w, x)112  . \n\n(8) \n\nThis is  a smoothing term which  penalizes  functional fits  with  high  curvature. \n\n3  Prediction Risk \nWith l1(x) = f( w[c;( n)], x)  denoting an estimate of the true regression function J.t(x) \ntrained  on  a  data set  c;( n),  we  wish  to estimate the  prediction  risk  P, which  is  the \nexp ected  error  of 11( x)  in predicting future  data.  In principle,  we  can  either  define \nP  for  models  l1(x)  trained  on  arbitrary  training  sets  of size  n  sampled  from  the \nunknown  density w(ylx )O( x)  or for  training sets  of size  n  with input density  equal \nto the empirical density  defined  by  the single training set  available: \n\nO'(x) = - L 8(x - xi)  . \n\n1  n \nn \n\ni=1 \n\n(9) \n\nFor such  training sets,  the n  inputs  xi  are  held fixed,  but the response  variables yi \nare sampled with the  conditional densities  w(ylx i ).  Since O'(x) is  known,  but O(x) \nis  generally not known  a  priori,  we  adopt the latter approach. \n\nFor  a  large  ensemble of such  training sets,  the  expected  training  set  error is8 \n\n(f ... ;n( A)), \n\n\\ \n\n1=1 \n\n/ ~ t f[Y;, I( w[~( n)], X;)]) \nJ ~ t. f[lI ,J( w[~( n)], x;)] {g wMx; )dll } \nP  = J f[z,J(w[~(n)]'x)lw(zlx)n(x) {g W(Y;IX;)dY;} dzdx \n\nE \n\nFor  a  future  exemplar  (x,z)  sampled  with  density  w(zlx)O(x),  the  prediction  risk \nP  is  defined  as: \n\n(10) \n\n(11) \n\nAgain,  however,  we  don't  assume  that  O(x)  is  known,  so  computing  (11)  is  not \npossible. \n\nFollowing Akaike (1970),  Barron  (1984),  and numerous other  authors  (see  Eubank \n1988), we  can define  the prediction risk  P  as the  expected test set  error for  test sets \nof size  n e'(n)  = {c;i,  = (xi,zi);  i  = 1,  ... ,n} having  the  empirical  input  density \n0' (x).  This expected  test set  error  has form: \n\n(f.\".(A)),<, \n\n/ ~ tf[i,J(w[~(n)l,x;)l) \n\n\\ \n\n1=1 \n\nJ ! t. f[z; ,J( w[~( n)], x;) I {g w (y; Ix; )w( z; Ix;)dy; dz; } \n\nEE' \n\n(12) \n\n8Following  the physics convention,  we use angled  brackets ( ) to denote expected values. \n\nThe subscripts  denote  the random  variables  being  integrated  over. \n\n\fThe Effective Number of Parameters \n\n851 \n\nWe  take  (12)  as  a  proxy for  the true prediction risk  P. \nIn  order  to  compute  (12),  it  will  not  be  necessary  to  know  the  precise  functional \nform  of the  noise  density  ~(f).  Knowing just  the  noise  variance  (T2  will  enable  an \nexact  calculation for linear models trained with the SSE error and an approximate \ncalculation  correct  to  second  order  for  general  nonlinear  models.  The  results  of \nthese  calculations are  presented  in the next  two sections. \n\n4  The Expected Test  Set  Error for  Linear Models \n\nThe relationship between expected training set and expected test set errors for  linear \nmodels  trained  using  the  SSE  error  function  with  no  regularizer  is  well  known  in \nstatistics (Akaike 1970, Barron 1984, Eubank 1988).  The exact relation for  test and \ntraining sets  with  density  (9): \n\n(13) \n\nAs  pointed  out  by  Barron  (1984),  (13)  can  also  apply approximately to the  case  of \na  nonlinear  model  f( w, x)  trained  by  minimizing the sum  of squared  errors  SSE. \nThis approximation can be arrived  at in two  ways.  First, the model few, x)  can be \ntreated  as  locally  linear in  a  neighborhood  of w.  This  approximation ignores  the \nhessian  and  higher  order  shape  of f( w, x)  in  parameter  space.  Alternatively,  the \nmodel  f( w, x)  can  be  assumed  to  be  locally  quadratic  in  parameter  space  wand \nunbiased. \n\nHowever,  the  extension  of  (13)  as  an  approximate  relation  for  nonlinear  models \nbreaks  down  if any  of the following  situations hold: \n\nThe  SSE  error  function  is  not  used.  (For  example,  one  may  use  a  robust  error \n\nmeasure  (Huber  1981)  or  log  likelihood error  measure instead.) \n\nA regularization term is  included in the objective function.  (This introduces bias.) \nThe  locally  linear approximation for  few, x) is  not good. \nThe  unbiasedness assumption for  few, x)  is  incorrect. \n\n5  The Expected Test  Set  Error for  Nonlinear Models \n\nFor  neural  network  models,  which  are  typically  nonparametric  (thus  biased)  and \nhighly  nonlinear,  a  new  relationship  is  needed  to  replace  (13).  We  have  derived \nsuch  a  result  correct  to second  order for  nonlinear models: \n\nThis result  differs  from  the  classical  result  (13)  by  the  appearance  of Pelf ()..)  (the \neffective  number  of parameters),  (T;1f  (the  effective  noise  variance in  the  response \nvariable( s\u00bb, and a  dependence on )..  (the regularization or weight decay  parameter). \n\nA  full  derivation  of (14)  will  be  presented  in  a  longer  paper  (Moody  1992).  The \nresult  is  obtained  by  considering  the  noise  terms  fi  for  both  the  training  and  test \n\n(14) \n\n\f852 \n\nMoody \n\nsets  as  perturbations to an idealized  model fit  to noise free  data.  The perturbative \nexpansion is  computed out to second order in the fi s subject  to the constraint that \nthe  estimated  weights  w minimize  the  perturbed  objective  function.  Computing \nexpectation  values  and  comparing  the  expansions  for  expected  test  and  training \nerrors  yields  (14).  It is  important  to  re-emphasize  that  deriving  (14)  does  not \nrequire knowing the precise form of the noise  density  ~(f).  Only a knowledge of u 2 \nis  assumed. \n\nThe  effective  number  of parameters Peff(>')  usually  differs  from  the  true  number \nof model  parameters P  and  depends  upon  the  amount  of model  bias,  model  non(cid:173)\nlinearity,  and  on  our  prior  model  preferences  (eg.  smoothness)  as  determined  by \nthe regularization parameter A and the form of our regularizer.  The precise form of \nPeff(A)  is \n\nPeff(A) = trC = - L..J1iaUaJTf3i  , \n\n_ \n\n1\", \n2. laf3 \n\n(15) \n\n(16) \n\n(17) \n\nwhere C  is  the generalized influence  matrix which generalizes the standard influence \nor hat matrix of linear regression, 1ia is the n x p matrix of derivatives of the training \nerror function \n\n1ia = -8 . -8 nE(w,e(n)) , \n\n8  8 \nyl  wa \n\nand  U;;J  is  the inverse  of the hessian  of the total objective function \n\n8 \n\n8 \n\nUaf3  = 8wa 8wf3 U(A, w, e(n)) \n\nIn the general case that u2 (x) varies with location in the input space x, the effective \nnoise  variance  u;ff  is  a  weighted  average  of the  noise  variances  u2{xi).  For  the \nuniform signal  plus noise  model model we  have  described  above,  u;f f  =  u 2 \u2022 \n\n6  The Effects  of Regularization \n\nIn the neural network  community, the most commonly used  regularizers  are weight \ndecay  functions.  The use  of weight  decay  is  motivated by  the  intuitive notion that \nit removes unnecessary  weights from the model.  An analysis of Peff{A) with weight \ndecay  (A  > 0)  confirms  this  intuitive  notion.  Furthermore,  whenever  u 2  > 0  and \nn  < 00,  there  exists  some  Aoptimal  > 0 such  that the  expected  test  set  error  (12)  is \nminimized.  This  is  because  weight  decay  methods  yield  models  with lower  model \nvariance,  even  though  they  are  biased.  These  effects  will  be  discussed  further  in \nMoody  (1992). \nFor  models trained with squared  error  ~SSE) and quadratic weight  decay  g(wa )  = \n(w a )2, the assumptions of unbiasedness  or local linearizability lead to the following \nexpression  for  Peff{A)  which  we  call  the  linearized  effective  number  of parameters \nPlin{A): \n\n(18) \n\n9Strictly  speaking,  a  model  with  quadratic  weight  decay  is  unbiased  only  if the  \"true\" \n\nweights  are o. \n\n\fmplied.  Li nearized.  and  Full  P-effectiv e \n\nThe Effective Number of Parameters \n\n853 \n\nLinearized \n\n. - -----.--~--....  .. , \n\n'II( \n\n~, \n\nFull \n\nt \n\n, \n\nImpJi  d \n~-- ~ ~-\n\n~\" \n\n'\" K \n1\\' \n\n,. \n1-\nWeight  Decay  Parameter  (Lambda) \n\n,. \n\n,. \n\nI. \n\n\" \n\n~ \n..0  \u2022 E \n:i \nu \n> \n::=  , \n\n\" i \n\nFigure  1:  The  full  Peff(~)  (15)  agrees  with  the  implied  Pimp(~)  (19)  to  within \nexp erimental  error,  whereas  the  linearized  Plin (~)  (18)  does  not  (except  for  very \nlarge ~).  These  results  verify  the significance of (14)  and  (15) for  nonlinear models. \n\nHere,  \",01  is  the  a th  eigenvalue of the P x  P matrix J{ = TtT, with T  as  defined  in \n(16). \n\nThe form of Pelf(~) can be computed easily for  other weight  decay functions,  such \nas  the  Rumelhart  form  g(w Ol )  = (w Ol )2/[(wO)2  + (w Ol )2].  The  basic  result  for  all \nweight  decay  or  regularization functions ,  however,  is  that Peff (~)  is  a  decreasing \nfunction  of ~ with  Pelf(O)  = P and  Pelf(oo)  = 0,  as  is  evident  in  the  special  case \n(18).  If the model is  nonlinear and  biased,  then Pelf (0)  generally  differs  from p. \n\n7  Testing the Theory \n\nTo  test  the  result  (14)  in  a  nonlinear  context,  we  computed  the  full  Pej j(A)  (15), \nthe linearized Plin(~) (18), and the  implied number of parameters Pimp (A)  (19) for  a \nnonlinear test problem.  The value of Pimp (~) is obtained by computing the expected \ntraining and  test  errors  for  an  ensemble of training sets  of size  n  with known  noise \nvariance u 2  and solving for  Pelf (~) in  equation (14): \n\n(19) \n\nThe \"\"\"s  indicate Monte Carlo estimates based on computations using a finite ensem(cid:173)\nble  (10  in  our  experiments)  of training sets.  The test  problem  was  to  fit  training \nsets  of size  50  generated  as  a  sum of three sigmoids plus noise,  with  the noise sam(cid:173)\npled  from  the  uniform  density.  The model  architecture  f(w , x)  was  also  a  sum  of \nthree sigmoids and the weights w were  estimated by  minimizing (7)  with quadratic \nweight  decay.  See  figure  1. \n\n\f854 \n\nMoody \n\n8  G PE:  An  Estimate of Prediction Risk for  Nonlinear \n\nSystems \n\nA  number of well  established,  closely  related  criteria for  estimating the  prediction \nrisk  for  linear  or  linearizable  models  are  available.  These  include  Akaike's  F P E \n(1970),  Akaike's AlC (1973)  Mallow's  Cp  (1973),  and  Barron's PSE (1984).  (See \nalso  Akaike  (1974)  and  Eubank  (1988).)  These  estimates are  all  based  on equation \n(13). \n\nThe generalized  prediction  error  (G P E)  generalizes  the  classical  estimators  F P E, \nAIC, Cp,  and  PSE to the nonlinear setting by  estimating (14)  as follows: \n\nGPE>'  =  PGPE  =  &train  n  + 2ueff \n\n( )  \n\n-. \n\n( )  \n\n~2  Peff(>') \n\nn \n\n. \n\n(20) \n\nThe  estimation  process  and  the  quality  of the  resulting  GP E  estimates  will  be \ndescribed  in greater  detail elsewhere. \n\nAcknowledgements \n\nThe author  wishes  to  thank Andrew  Barron  and  Joseph  Chang for  helpful  conversations. \nThis research  was supported by  AFOSR grant 89-0478  and  ONR grant  N00014-89-J-1228. \n\nReferences \n\nH.  Akaike.  (1970).  Statistical  predictor identification.  Ann.  Inst.  Stat.  Math.,  22:203. \n\nH.  Akaike. \nprinciple.  In  2nd Inti.  Symp.  on Information  Theory,  Akademia  Kiado,  Budapest,  267. \n\nInformation  theory  and  an  extension  of  the  maximum  likelihood \n\n(1973). \n\nH.  Akaike. \nAuto.  Control,  19:716-723. \n\n(1974).  A  new  look  at  the  statistical  model  identification. \n\nIEEE  Trans. \n\nA.  Barron.  (1984).  Predicted squared  error:  a  criterion for  automatic model selection.  In \nSelf-Organizing Methods  in  Modeling,  S.  Farlow,  ed.,  Marcel  Dekker,  New  York. \n\nR.  Eubank.  (1988).  Spline  Smoothing  and  Nonparametric  Regression.  Marcel  Dekker, \nNew York. \n\nP.  J.  Huber.  (1981).  Robust Statistics.  Wiley,  New  York. \n\nC.  L.  Mallows.  (1973).  Some comments on  Cpo  Technometrics 15:661-675. \n\nP.  McCullagh  and  J.A.  NeIder.  (1983).  Generalized Linear Models.  Chapman  and  Hall, \nNew  York. \n\nJ.  Moody. \n(1991).  Note  on  Generalization,  Regularization,  and  Architecture  Selection \nin  Nonlinear  Learning  Systems.  In  B.H.  Juang,  S.Y.  Kung,  and  C.A.  Kamm,  editors, \nNeural Networks for  Signal Processing, IEEE Press,  Piscataway,  N J. \n\nJ.  Moody.  (1992).  Long  version  of this  paper,  in  preparation. \n\nJ.  Moody  and  J.  Utans.  (1992).  Principled  architecture  selection  for  neural  networks: \napplication  to  corporate bond  rating  prediction.  In  this  volume. \n\n\f", "award": [], "sourceid": 530, "authors": [{"given_name": "John", "family_name": "Moody", "institution": null}]}