{"title": "Kernel Regression and Backpropagation Training With Noise", "book": "Advances in Neural Information Processing Systems", "page_first": 1033, "page_last": 1039, "abstract": null, "full_text": "Kernel Regression and \n\nBackpropagation Training with Noise \n\nPetri Koistinen and Lasse Holmstrom \n\nRolf Nevanlinna Institute,  University of Helsinki \nTeollisuuskatu 23,  SF-0051O  Helsinki,  Finland \n\nAbstract \n\nOne method proposed for improving the generalization capability of a feed(cid:173)\nforward  network  trained  with  the  backpropagation  algorithm  is  to  use \nartificial  training vectors  which  are  obtained by  adding noise  to the orig(cid:173)\ninal training vectors.  We  discuss  the  connection  of such  backpropagation \ntraining with noise  to kernel  density  and kernel regression  estimation.  We \ncompare by simulated examples (1)  backpropagation, (2)  backpropagation \nwith  noise,  and  (3)  kernel  regression  in  mapping estimation  and  pattern \nclassification  contexts. \n\n1 \n\nINTRODUCTION \n\nLet  X  and Y  be random vectors taking values in  R d  and RP,  respectively.  Suppose \nthat  we  want  to  estimate  Y  in  terms  of X  using  a  feedforward  network  whose \ninput-output mapping we  denote by y = g(x, w).  Here  the vector w includes all the \nweights  and  biases  of the  network.  Backpropagation  training using  the  quadratic \nloss  (or error)  function  can  be interpreted  as  an  attempt to minimize the expected \nloss \n\n'\\(w) = ElIg(X, w)  _ Y1I2. \n\nSuppose  that EIIYW  < 00.  Then  the regression  function \n\n(2) \nminimizes the loss  Ellb(X) - YI1 2  over all Borel measurable mappings b.  Therefore, \nbackpropagation training can  also be  viewed  as an attempt to estimate m  with the \nnetwork  g. \n\nm(x) = E[YIX = x]. \n\n(1) \n\n1033 \n\n\f1034 \n\nKoistinen and Holmstrom \n\nIn  practice,  one  cannot  minimize  -'  directly  because  one  does  not  know  enough \nabout the  distribution of (X, Y).  Instead one minimizes a  sample estimate \n\n(3) \n\nin the hope that weight vectors w  that are near optimal for  ~n are also near optimal \nfor  -'.  In fact,  under  rather mild conditions  the minimizer of ~n actually converges \ntowards the minimizing set of weights for  -'  as n  -+ 00, with probability one (White, \n1989).  However,  if n  is  small compared to the  dimension of w,  minimization of ~n \ncan  easily  lead  to overfitting  and  poor  generalization,  i.e.,  weights  that render  ~n \nsmall may produce  a  large expected  error  -'. \n\nMany  cures for  overfitting have been  suggested.  One  can  divide  the  available sam(cid:173)\nples  into  a  training set  and  a  validation set,  perform  iterative  minimization using \nthe training set  and stop  minimization when  network  performance over  the  valida(cid:173)\ntion  set  begins  to  deteriorate  (Holmstrom et  al.,  1990,  Weigend  et  al.,  1990).  In \nanother approach, the minimization objective function is modified to include a term \nwhich  tries to discourage  the network from becoming too complex (Weigend et  al., \n1990).  Network  pruning (see,  e.g.,  Sietsma and Dow,  1991) has similar motivation. \nHere  we  consider  the  approach  of generating  artificial  training  vectors  by  adding \nnoise  to  the  original  samples.  We  have  recently  analyzed  such  an  approach  and \nproved  its  asymptotic  consistency  under  certain  technical  conditions  (Holmstrom \nand  Koistinen,  1990). \n\n2  ADDITIVE  NOISE  AND  KERNEL REGRESSION \n\nSuppose  that  we  have  n  original  training  vectors  (Xi, Yi)  and  want  to  generate \nartificial training vectors using additive noise.  If the distributions of both X  and Y \nare continuous it is  natural to add noise to both  X  and Y components of the sample. \nHowever,  if the  distribution  of X  is  continuous  and  that  of Y  is  discrete  (e.g.,  in \npattern  classification),  it feels  more natural to add noiRe  to the  X  components only. \nIn  Figure 1 we  present  sampling procedures  for  both  ca~es.  In  the  x-only  case  the \nadditive noise is generated from a random vector Sx with density Kx whereas in the \nx-and-y case  the noise  is  generated from  a  random vector  SXy  with  density  Kxy. \nNotice  that we  control  the  magnitude of noise  with  a  scalar  smoothing parameter \nh > O. \nIn  both  cases  the  sampling  procedures  can  be  thought  of  as  generating  random \nsamples  from  new  random  vectors  Xkn)  and  y~n) .  Using  the  same  argument  as \nin  the Introduction  we  see  that  a  network  trained with  the  artificial samples tends \nto  approximate  the  regression  function  E[y~n) IXkn)].  Generate  I  uniformly  on \n{1, ... , n}  and  denote  by  I  and  I( .11  =  i)  the  density  and  conditional  density  of \nXkn ).  Then in the x-only case  we  get \n\nm~n)(Xkn)) := E[y~n)IXkn)] =  LYiP(I = iIXkn)) \n\nn \n\ni=l \n\n\fKernel  Regression and Backpropagation Training with  Noise \n\n1035 \n\nProcedure 1. \n(Add  noise  to  x  only) \n\nProcedure 2. \n(Add  noise  to  both  x  and  y) \n\n1.  Select  i  E {I, ... , n}  with equal \n\nprobability for  each  index. \n\n2.  Draw a sample Sx  from density \n\nI<x  on Rd. \n\n1.  Select  i  E {I, ... , n}  with equal \n\nprobability for  each  index. \n\n2.  Draw  a  sample  (sx, Sy)  from \n\ndensity  /{Xy  on Rd+p. \n\n3.  Set  x~n) \n(n) \nYh \n\nXi  + hsx \nYi\u00b7 \n\n3.  Set  x~n) \n(n) \nYh \n\nXi  + hsx \nYi  +  Sy. \n\nh \n\nFigure  1:  Two Procedures for  Generating Artificial Training Vectors. \n\nn \n\n= tt Yi \n\nf(X~n)II = i)P(I = i) \n\nf(Xh n )) \n\nDenoting  /{x  by  k  we  obtain \n\nn \n\nh-d/{x\u00abXkn) - xi)/h). n- 1 \n\n= tt Yi  2:7=1 n- 1h- d/{x\u00abXkn) - xi)/h)\u00b7 \n\nm h \n\n(n)(  )  _  2:~=1 k\u00abx - xi)/h)Yi \n\",0 \nL..,j=1  k\u00abx - xi)/h) \n\nX \n\n-\n\n. \n\n(4) \n\nWe  result  in  the  same  expression  also  in  the  x-and-y  case  provided  that \nfY/{Xy(x,y)dy  = 0  and  that  we  take  k(x)  = fI<Xy(x,y)dy  (Watson,  1964). \nThe expression  (4)  is  known  as  the  (N adaraya-Watson) kernel  regression  estimator \n(Nadaraya,  1964,  Watson,  1964,  Devroye  and Wagner,  1980). \n\nA  common way  to  train  a  p-class  neural  network  classifier  is  to train  the  network \nto  associate  a  vector  x  from  class  j  with  the  j'th unit  vector  (0, ... ,0,1,0, ... ,0). \nIt is  easy  to see  that then  the kernel  regression  estimator components estimate the \nclass  a  posteriori probabilities using  (Parzen-Rosenblatt)  kernel  density  estimators \nfor  the  class  conditional  densities.  Specht  (1990)  argues  that such  a  classifier  can \nbe  considered  a  neural  network.  Analogously,  a  kernel  regression  estimator can  be \nconsidered  a neural  network  though such  a  network  would need  units proportional \nto  the  number  of training  samples.  Recently  Specht  (1991)  has  advocated  using \nkernel  regression  and has  also  presented  a  clustering  variant requiring  only a  fixed \namount of units.  Notice  also  the resemblance  of kernel  regression  to certain  radial \nbasis function  schemes  (Moody and Darken,  1989,  Stokbro et al.,  1990). \n\nAn often used method for choosing h is to minimize the cross-validated error (HardIe \nand Marron,  1985,  Friedman and Silverman, 1989) \n\n(5) \n\nAnother possibility is to use  a method suggested by kernel density estimation theory \n(Duin, 1976, Habbema et al.,  1974) whereby one chooses that h maximizing a cross(cid:173)\nvalidated (pseudo)  likelihood function \n\no \n\nLxy (h) = II f:,L(Xi, yd, \n\ni=1 \n\no \n\nLx(h) = IT f:'h,i(Xi), \n\ni=1 \n\n(6) \n\n\f1036 \n\nKoistinen and Holmstrom \nwhere I;r i  (I; h  i) is a kernel  density estimate with kernel  Kxy (Kx) and smooth(cid:173)\ning para~~ter h hut  with  the  i'th sample point left out. \n\n3  EXPERIMENTS \n\nIn  the first  experiment  we  try to estimate a  mapping go  from  noisy data (x, y), \n\nY \nX \n\ngo(X)+Ny =asinX+b+Ny , \nNy  '\" N(O, (72), \n\n'\"  UNI( -71\",71\"), \n\na=0.4,b=0.5 \n(7  = O.l. \n\nHere  UNI  and  N  denote  the  uniform  and  the  normal  distribution.  We  experi(cid:173)\nmented  with  backpropagation,  backpropagation  with  noise  and  kernel  regression. \nBackpropagation loss function was minimized using  Marquardt's method.  The net(cid:173)\nwork architecture was  FN-1-13-1  with 40  adaptable weights (a feedforward  network \nwith  one  input,  13  hidden  nodes,  one  output,  and  logistic  activation  functions  in \nthe hidden  and output layers).  We started  the local optimizations from  3  different \nrandom initial weights and kept the weights giving the least value for  ~n. Backprop(cid:173)\nagation training with noise was similar except  that instead of the original n  vectors \nwe  used  IOn  artificial  vectors  generated  with  Procedure  2  using  SXy  '\"  N(O, 12 ). \nMagnitude of noise was chosen with the criterion Lxy (which, for  backpropagation, \ngave better results than M).  In the kernel regression experiments SXy  was kept the \nsame.  Table 1 characterizes  the distribution of J, the expected squared distance of \nthe estimator 9  (g(., w)  or m~n)  from go, \n\nJ  =  E[g(X) - gO(X)]2. \n\nTable 2 characterizes the distribution of h chosen  according to the criteria Lxy and \nM  and Figure 2 shows  the estimators in one instance.  Notice  that, on the average, \nkernel  regression  is  better  than  backpropagation  with  noise  which  is  better  than \nplain backpropagation.  The success  of backpropagation with  noise  is  partly due to \nthe  fact  that  (7  and  n  have  here  been  picked  favorably.  Notice  too  that  in  kernel \nregression  the  results  with  the  two  cross-validation  methods  are  similar  although \nthe  h  values  they suggest  are  clearly  different . \n\nIn  the  second  experiment  we  trained  classifiers  for  a  four-dimensional  two-class \nproblem with equal  a  priori probabilities and class-conditional densities  N(J.ll, C1 ) \nand  N(J.l2' C2), \n\nJ.ll  = 2.32[1  0  0 O]T, C1 = 14; \n\nJ.l2  = 0, C2 = 414. \n\nAn FN-4-6-2 with 44 adaptable weights was trained to associate vectors from class 1 \nwith  [0.9 O.l]T  and vectors from class  2 with  [0.1  0.9jT.  We  generated  n/2 original \nvectors from each  class  and  a  total of IOn  artificial  vectors  using Procedure  1 with \nSx  '\"  N(O, 14).  We  chose  the  smoothing  parameters,  hI  and  h2'  separately  for \nthe  two  classes  using  the  criterion  Lx:  hi  was  chosen  by  evaluating  Lx  on  class \ni  samples  only.  We  formed  separate  kernel  regression  estimators  for  each  class; \nthe  i'th  estimator was  trained  to  output  1 for  class  i  vectors  and  0  for  the  other \nsample  vectors.  The  M  criterion  then  produces  equal  values  for  hI  and  h2.  The \nclassification  rule  was  to  classify  x  to  class  i  if the  output  corresponding  to  the \ni'th  class  was  the  maximum output.  The error  rates  are  given  in  Table  3.  (The \nerror  rate  of the  Bayesian  classifier  is  0.116  in  this  task.)  Table  4  summarizes the \ndistribution of hI  and  h2  as  selected  by Lx and  M . \n\n\fKernel  Regression and Backpropagation Training with  Noise \n\n1037 \n\nTable 1:  Results for  Mapping Estimation.  Mean value (left) and standard deviation \n(right) of J  based on  100  repetitions are given for  each  method. \n\nBP \n\nBP+noise, \n\nLxy \n\nKernel  regression \n\nLxy \n\nM \n\n.0218 \n.00764 \n\n.016 \n.0048 \n\n.0104 \n.00526 \n\n.0079 \n.0018 \n\n.00446 \n.00250 \n\n.0022 \n.00078 \n\n.00365 \n.00191 \n\n.0019 \n.00077 \n\nn \n40 \n80 \n\nTable 2:  Values  of h  Suggested  by  the  Two Cross-validation  Methods  in  the  Map(cid:173)\nping  Estimation  Experiment.  Mean  value  and  standard  deviation  based  on  100 \nrepetitions  are given. \n\nLxy \n\nn \n40  0.149  0.020  0.276  0.086 \n80  0.114  0.011 \n0.241  0.062 \n\nM \n\nTable 3:  Error  Rates for  the Different  Classifiers.  Mean  value  and standard devia(cid:173)\ntion based  on  25  repetitions  are  given for  each  method. \n\nBP \n\nBP+noise, \n\nLx \n\nKernel  regression \nLx \nM \n\n.281 \n.264 \n.210 \n\n.054 \n.028 \n.023 \n\n.189 \n.163 \n.145 \n\n.018 \n.011 \n.0lD \n\n.201 \n.182 \n.164 \n\n.022 \n.010 \n.0089 \n\n.207 \n.184 \n.164 \n\n.027 \n.013 \n.011 \n\nn \n44 \n88 \n176 \n\nTable  4:  Values  of hl  and  h2  Suggested  by  the  Two Cross-validation  Methods  in \nthe  Classification  Experiment.  Mean  value  and  standard  deviation  based  on  25 \nrepetitions  are  given. \n\nLx \n\nn \n44 \n88 \n176 \n\nhl \n\nh2 \n\n.818 \n.738 \n.668 \n\n.078  1.61 \n1.48 \n.056 \n.048 \n1.35 \n\n.14 \n.11 \n.090 \n\nM \n\nhl = h2 \n1.14 \n.27 \n.19 \n1.01 \n.868 \n.lD \n\n\f1038 \n\nKoistinen  and Holmstrom \n\n4  CONCLUSIONS \n\nAdditive  noise  can  improve  the generalization  capability of a  feedforward  network \ntrained  with  the  backpropagation  approach.  The  magnitude  of the  noise  cannot \nbe selected  blindly, though.  Cross-validation-type procedures  seem  to suit well  for \nthe selection of noise magnitude.  Kernel  regression,  however, seems to perform well \nwhenever  backpropagation with noise  performs well.  If the  kernel  is  fixed  in kernel \nregression,  we  only have  to  choose  the smoothing parameter h,  and  the method is \nnot overly sensitive  to its selection. \n\nReferences \n\n[Devroye  and Wagner,  1980]  Devroye,  1. and Wagner, T.  (1980).  Distribution-free \n\nconsistency  results in non parametric discrimination and regression  function esti(cid:173)\nmation.  The  Annals  of Statistics,  8(2):231-239. \n\n[Duin,  1976]  Duin,  R.  P.  W.  (1976).  On  the  choice  of smoothing  parameters  for \nParzen estimators of probability density functions.  IEEE  Transactions  on  Com(cid:173)\nputers,  C-25:1175-1179. \n\n[Friedman and Silverman,  1989]  Friedman,  J.  and  Silverman,  B.  (1989).  Flexible \n\nparsimonious smoothing and  additive modeling.  Technometrics,  31(1):3-2l. \n\n[Habbema et  al.,  1974]  Habbema,  J.  D.  F.,  Hermans,  J.,  and  van  den  Broek,  K. \n(1974).  A  stepwise  discriminant  analysis  program using  density  estimation.  In \nBruckmann, G., editor,  COMPSTAT 1974, pages 101-110, Wien. Physica Verlag. \n[HardIe  and Marron,  1985]  HardIe,  W. and Marron, J. (1985).  Optimal bandwidth \n\nselection  in nonparametric regression  function  estimation.  The  Annals  of Statis(cid:173)\ntics,  13(4):1465-148l. \n\n[Holmstrom and Koistinen,  1990]  Holmstrom,  1.  and  Koistinen,  P.  (1990).  Using \n\nadditive  noise  in  back-propagation  training.  Research  Reports  A3,  Rolf Nevan(cid:173)\nlinna Institute.  To  appear in  IEEE  Trans.  Neural Networks. \n\n[Holmstrom et  al.,  1990]  Holmstrom, L., Koistinen, P., and Ilmoniemi, R. J. (1990). \nClassification of un averaged evoked cortical magnetic fields.  In  Proc.  IJCNN-90-\nWASH  DC,  pages  II:  359-362. Lawrence  Erlbaum Associates. \n\n[Moody and  Darken,  1989]  Moody, J. and Darken,  C.  (1989).  Fast learning in net(cid:173)\n\nworks of locally-tuned processing units.  Neural  Computation,  1:281-294. \n\n[N adaraya,  1964]  N adaraya, E. (1964). On estimating regression.  Theor.  Probability \n\nAppl., 9:141-142. \n\n[Sietsma and  Dow,  1991]  Sietsma, J.  and  Dow,  R.  J.  F.  (1991).  Creating artificial \n\nneural  networks  that generalize.  Neural Networks,  4:67-79. \n\n[Specht,  1991]  Specht, D. (1991). A general regression neural network.  IEEE Trans(cid:173)\n\nactions  on  Neural  Networks,  2(6):568-576. \n\n[Specht,  1990]  Specht,  D.  F.  (1990).  Probabilistic  neural  networks.  Neural  Net(cid:173)\n\nworks,  3(1):109-118. \n\n[Stokbro et  al.,  1990]  Stokbro,  K.,  Umberger,  D.,  and  Hertz,  J.  (1990).  Exploiting \n\nneurons  with localized receptive fields  to learn  chaos.  NORDITA preprint. \n\n[Watson,  1964]  Watson,  G.  (1964).  Smooth  regression  analysis.  Sankhyii  Ser.  A, \n\n26:359-372. \n\n[Weigend et al.,  1990]  Weigend, A., Huberman, B., and Rumelhart, D. (1990).  Pre(cid:173)\n\ndicting  the  future:  A  connectionist  approach.  International  Journal  of Neural \nSystems,  1(3):193-209. \n\n\fKernel  Regression and Backpropagation Training with Noise \n\n1039 \n\n[White,  1989]  White, H.  (1989).  Learning in artificial neural networks:  A statistical \n\nperspective.  Neural  Computation,  1 :425-464. \n\n1.5.....--.....----.---.....--....----.---.....--....----, \n\no \n0 \n\n--- true \n-\n\nkernel \n\n-3 \n\n-2 \n\n0 \n\n2 \n\n3 \n\n4 \n\n' : . . \n\n\u2022 \n- .. \n\n\u2022  0 \n\n1 \n\n0.5 \n\n0 \n\n0 \n\n-0.54 \n\n1.5 \n\n1 \n\n0.5 \n\n0 \n\n.. \n\n.. . ' \n.. \n\n-\n\n0 \n\n---- BP \n-\n1 \n\nBP+noise \n3 \n2 \n\n4 \n\n-0.54 \n\n-3 \n\n-2 \n\n-I \n\nFigure  2:  Results  From a  Mapping Estimation Experiment.  Shown  are  the  n  = 40 \noriginal vectors  (o's),  the artificial  vectors  (dots),  the  true  function  asinx + band \nthe  fitting  results  using  kernel  regression,  backpropagation  and  backpropagation \nwith  noise.  Here  h  =  0.16  was  chosen  with  Lxy.  Values  of J  are  0.0075  (kernel \nregression),  0.014  (backpropagation with noise)  and 0.038  (backpropagation) . \n\n\f", "award": [], "sourceid": 459, "authors": [{"given_name": "Petri", "family_name": "Koistinen", "institution": null}, {"given_name": "Lasse", "family_name": "Holmstr\u00f6m", "institution": null}]}