{"title": "Synaptic Weight Noise During MLP Learning Enhances Fault-Tolerance, Generalization and Learning Trajectory", "book": "Advances in Neural Information Processing Systems", "page_first": 491, "page_last": 498, "abstract": null, "full_text": "Synaptic  Weight  Noise During MLP \nLearning Enhances Fault-Tolerance, \n\nGeneralisation and  Learning Trajectory \n\nAlan  F.  Murray \n\nDept.  of Electrical  Engineering \n\nEdinburgh  University \n\nScotland \n\nPeter J.  Edwards \n\nDept.  of Electrical  Engjneering \n\nEdinburgh  University \n\nScotland \n\nAbstract \n\nWe  analyse  the  effects  of analog  noise  on  the  synaptic  arithmetic \nduring MultiLayer Perceptron training, by expanding the cost func(cid:173)\ntion to include noise-mediated penalty terms.  Predictions are made \nin the light of these  calculations which suggest  that fault tolerance, \ngeneralisation  ability  and  learning  trajectory  should  be  improved \nby  such  noise-injection.  Extensive  simulation experiments  on  two \ndistinct  classification  problems  substantiate  the  claims.  The  re(cid:173)\nsults  appear to  be  perfectly  general for  all training schemes  where \nweights are adjusted incrementally, and have wide-ranging implica(cid:173)\ntions for  all applications, particularly those involving  \"inaccurate\" \nanalog neural  VLSI. \n\n1 \n\nIntroduction \n\nThis paper  demonstrates  both by  consjderatioll  of the cost  function  and the learn(cid:173)\ning  equations,  and  by  simulation experiments,  that  injection  of random  noise  on \nto MLP weights  during learning enhances fault-tolerance without additional super(cid:173)\nvision.  We  also  show  that  the  nature  of the  hidden  node  states  and  the  learning \ntrajectory  is  altered fundamentally,  in  a  manner that  improves training times and \nlearning  quality.  The  enhancement  uses  the  mediating  influence  of noise  to  dis(cid:173)\ntribute  information optimally across  the existing  weights. \n\n491 \n\n\f492 \n\nMurray  and Edwards \n\nTaylor [Taylor , 72]  has  studied  noisy  synapses,  largely  in  a  biological  context,  and \ninfers that the noise might assist learning.  vVe have already demonstrated that noise \ninjection both reduces  the learning time and improves the network's generalisation \nIt  is  established[Matsuoka, 92],[Bishop,  90]  that \nability [Murray,  91],[Murray,  92]. \nadding noise  to the  training  data  in neural  (MLP)  learning improves the  \"quality\" \nof learning,  as  measured  by  the  trained  network's  ability  to  generalise.  Here  we \ninfer  (synaptic)  noise-mediated terms that sculpt the error function  to favour faster \nlearning,  and  that  generate  more  robust  internal  representations,  giving  rise  to \nbetter generalisation and immunity to smaIl variations in the  characteristics of the \ntest data.  Much closer  to the spirit of this paper is the work of Hanson[Hanson,  90]. \nHis stochastic version  of the delta rule effectively  adapts weight means and standard \ndeviations.  Also  Sequin  and  Clay [Sequin , 91]  use  stuck-at  faults  during  training \nwhich  imbues the  trained  network  with  an  ability  to  withstand  such  faults.  They \nalso note,  but do not  pursue,  an  increased  generalisation ability. \n\nThis  paper  presents  an  outline  of  the  mathematical  predictions  and  verification \nsimulations.  A full  description  of the work  is  given in  [Murray,  93] . \n\n2  Mathematics \n\nLet  us  analyse  an  MLP  with  I  input,  J  hidden  and  ]{  output nodes,  with  a  set  of \nP  training input vectors  Qp = {Oip},  looking at the effect  of noise  injection into the \nerror function itself.  We are thus able to infer, from the additional terms introduced \nby  noise,  the  characteristics  of solutions  that  tend to reduce the  error,  and  those \nwhich tend to increase it.  The former will clearly be favoured, or at least stabilised, \nby  the additional terms.  while  the  latter will  be de-stabilised. \n\nLet  each  weight  Tab  be  augmented  by  a  random  noise  source,  such  that  Tab  -+(cid:173)\nTab  + ~abTab, for  all  weights  {Tab}.  Neuron  thresholds  are  treated  in  precisely \nthe  same  way.  Note  in  passing,  but  importantly,  that  this  synaptic noise  is  not \nthe same as  noise  on  the  input data.  Input  noise  is  correlated  across  the  synapses \nleaving  an  input  node,  while  the  synaptic  noise  that  forms  the  basis  of this  study \nis  not.  The effect  is thus  quite distinct. \n\nConsidering,  therefore,  an  error function  of the form  ;-\n\n1 K-l \n\nftot,p  =\"2  L \nk=O \n\n1 K-l \n\nk=O \n\nfk/ =\"2  L(okp({Tab}) -Okp)2 \n\n(1) \n\nWhere Okp  is the target output.  We  can now perform a Taylor expansion of the out(cid:173)\nput Okp  to second  order,  around the noise-free  weight set,  {TN},  and thus  augment \nthe error  function  ;-\n\nOkp  -+ Okp + L..J Tab~ab  aT. \nab \n\n\"\"' \nb \na \n\n(aOkP ) \n\n(  a20kp \n+\"2  L..J  Tab~abTcd~cd  aT.  aT. \ncd \n\n1  \"\"' \nb  d \na  ,c \n\nab \n\n) \n\n+0(> 3)  (2) \n\nIf we  ignore  terms  of order  ~ 3  and  above,  and  taking  the  time  average  over  the \nlearning phase,  we  can  infer  that two  terms are  added to the error function  ;-\n\n< ftot >=< (tot( {TN}) > + 2~ t\"I:l ~2 LTab 2  [(~;kP) 2 + (kp  (:~k~)l  (3) \n\np=l  k=O \n\nab \n\nab \n\nab \n\n\fSynaptic Weight Noise During MLP Learning \n\n493 \n\nConsider also  the perceptron  rule update on the hidden-output layer along with the \nexpanded  error function  :-\n\n2 {)2 \u00b0 k P \n< 6Tkj  >= -T L..J  <  fkpOjpOkp  > -T -2  L..J  < OjpOkp  >  X  L..J Tab  - -2  \naTab \n\n\" ' \"  \nab \n\n~ 2 \n\n\"'\" \n\nP \n\n\" ' \"  \n\np \n\nI \n\nI \n\n(4) \n\naveraged over  several training epochs  (which  is  acceptable for  small values of T  the \nadaption  rate  parameter). \n\n3  Simulations \n\nThe  simulations  detailed  below  are  based  on  the  virtual  targets  algorithm \n[Murray,  92],  a variant on backpropagation, with broadly similar performance.  The \n\"targets\"  algorithm was  chosen for  its faster  convergence properties.  Two contrast(cid:173)\ning classification tasks were  selected  to verify the predictions made in the following \nsection  by  simulation.  The first,  a feature location task,  uses  real world normalised \ngreyscale  image  data.  The  task  was  to  locate  eyes  in  facial  images - to  classify \nsections of these  as  either  \"eye\"  or  \"not-eye\".  The network  was  trained on  16  x  16 \npreclassified  sections  of the  images,  classified  as  eyes  and  not-eyes.  The  not-eyes \nwere  random sections  of facial images,  avoiding  the  eyes  (see  Fig.  1).  The second, \n\n16x 16 section \n\n.:J  \"eye\" \n\n------=>~ c=-\n\n\"not-eye\" \n\nFigure  1:  The eye/not-eye classifier. \n\na  more  artificial  task,  was  the  ubiquitous  character  encoder  (Fig.  2)  where  a  25-\n\n1B-~111111111111111111 \n\n:>  26 \nI  I  I  I \n\nFigure 2:  The character  encoder  task. \n\ndimensional binary input vector describing the 26  alphabetic characters (each 5 x 5 \npixels)  was  used  to train  the network  with  a one-out-of-26 output code. \n\nDuring the simulations noise  was added to the weights at a level proportional to the \nweight size and at a probability distribution of uniform density  (i.e.  -~max <  ~ < \n~max).  Levels  of up  to  40%  were  probed  in  detail  - although it  is  clear  that  the \nexpansion above  is  not  quantitatively valid  at  this  level.  Above  these  percentages \nfurther improvements were seen in the network performance, although the dynamics \nof the  training  algorithm  became  chaotic.  The  injected  noise  level  was  reduced \n\n\f494 \n\nMurray and Edwards \n\nsmoothly  to  a  minimum value  of 1%  as  the  network  approached  convergence  (as \nevidenced by the highest output bit error).  As ill  all neural network simulations, the \nresults  depended  upon the training parameters, network sizes  and the random start \nposition  of the  network.  To  overcome  these  factors  and  to  achieve  a  meaningful \nresult  35  weight  sets  were  produced  for  each  noise  level.  All  other  characteristics \nof the  training  process  were  held  constant.  The  results  are  therefore  not  simply \npathological freaks. \n\n4  Prediction/Verification \n\n4.1  Fault Tolerance \n\nConsider the first  derivative penalty  term  in  the expanded  cost  function  (3),  aver(cid:173)\naged over  all patterns,  output nodes  and weights  :-\n\n[{  X  A' [Ta\"  ( ~~: ) '] \n\n(5) \n\nThe implications of this  term  are  straightforward.  For  large values  of the  (weight(cid:173)\ned)  average  magnitude of the  derivative,  the  overall error  is  increased.  This  term \ntherefore  causes  solutions  to  be  favoured  where  the  dependence  of outputs on  in(cid:173)\ndividual  weights  is  evenly  distributed  across  the  entire  weight  set.  Furthermore, \nweight  saliency  should  not  only  have  a  lower  average  value,  but  a  smaller scatter \nacross  the  weight  set  as  the  training  process  attempts  to  reconcile  the  competing \npressures  to  reduce  both  (1)  and  (5) .  This  more  distributed  representation  should \nbe  manifest in  an  improved tolerance  to faulty  weights. \n\n~ \nU \nOJ ... ... \n0 \nU \n-0 \nOJ \n:-S \nII! \n~ \n0 \nII! \n0:: \n\n~ \n'\" Q. \n\n80 \n\n60 \n\n40 \n\n20 \n\n0 \n\n0 \n\nnolSe=O%  -\nnoise::;: 1 fJro \nnoise = 20% \nnoise =30% \nnoise=40%  ---\n\n-----\n\n'. \n\n'. ~ . \n\n5 \n\n10 \n\n15 \n\n20 \n\n25 \n\nSynapses Removed ('\u00a5o) \n\nFigure 3:  Fault tolerance in  the character encoder  problem. \n\nSimulations were carried out on 35 weight sets produced for each ofthe two problems \nat  each  of 5  levels  of noise  injected  during  training.  Weights  were  then  random(cid:173)\nly  removed  and  the  networks  tested  on  the  training  data.  The  resulting  graphs \n(Fig.  3,  4)  show  graceful  degradation  with  an  increased  tolerance  to  faults  with \ninjected  noise  during training.  The networks were  highly constrained for  these sim(cid:173)\nulations to remove some of the natural redundancy of the MLP structure.  Although \nthe  eye/not-eye  problem  contains  a  high  proportion of redundant  information, the \n\n\fSynaptic Weight  Noise During MLP Learning \n\n495 \n\n1 \n\nI \n\n-\n\nnDise~O% \nnoise = 10% \nnoise = ~ -._. __ \nnoise~ 30% \nnoise = 40%  - - -\n\n-- --\n\n~L-____ ~ ____ -L  ____ ~ ______ ~ ____ ~ \n\no \n\n5 \n\n15 \n\n20 \n\n25 \n\nSynapses Removed (%) \n\nFigure 4:  Fault tolerance  enhancement in  the eye/not-eye  classifier. \n\nimprovement in  the  networks  ability  to  withstand  damage,  with  injected  noise,  is \nclear. \n\n4.2  Generalisation Ability \n\nConsidering  the  derivative in equation  5,  and looking at  the  input-hidden  weights. \nThe  term  that  is  added  to  the  error  function,  again  averaged  over  all  patterns, \noutput nodes  and weights  is  :-\n\n(6) \n\nIf an output neuron has a non-zero connection from a particular hidden node (Tkj  \"I \n0), and provided the input Oip  is non-zero and is connected to the hidden node (Tji \"I \n0),  there is also  a  term oJp  that will tend to favour solutions with the hidden \nnodes  also  turned firmly  ON  or  OFF  (i.e.  Ojp  =  0  or  1).  Remembering,  of \ncourse,  that  all  these  terms  are  noise-mediated,  and  that  during  the  early  stages \nof training,  the  \"actual\"  error  fkp,  in  (1),  will  dominate, this term will  de-stabilise \nfinal solutions that balance the hidden nodes on the slope of the sigmoid.  Naturally, \nhidden  nodes  OJ  that  are  firmly  ON  or  OFF  are  less  likely  to  change  state  as  a \nresult  of small variations in the input data {Oi}.  This should become evident in  an \nincreased tolerance to input perturbations and therefore an increased generalisation \nability. \nSimulations  were  again  carried  out  on  the  two  problems  using  35  weight  sets  for \neach  level  of  injected  synaptic  noise  during  training.  For  the  character  encoder \nproblem  generalisation is  not  really  an  issue,  but  it  is  possible  to  verify  the  above \nprediction by introducing random gaussian noise into the input data and noting the \ndegradation  in  performance.  The  results  of these  simulations are  shown  in  Fig.  5, \nand clearly show an increased ability to withstand input perturbation, with injected \nnoise  into the  synapses during training. \n\nGeneralisation  ability  for  the  eye/not-eye  problem  is  a  real  issue.  This  problem \ntherefore  gives  a  valid  test  of whether  the  synaptic  noise  technique  actually  im(cid:173)\nproves  generalisation  performance.  The  networks  were  therefore  tested  on  previ(cid:173)\nously unseen facial images and the results are shown in Table 1.  These results show \n\n\f496 \n\nMurray and Edwards \n\n100 \n\nt  : ~\" \n\n70 \n\n~ \nI:: \n0 u \n11 \n-= \n'iii \n\" '\" \n0 \n'\" E \n.,. \n.!:! \nll-\n\n60 \n\n50 \n\n40 \n\n30 \n\n-.~ . , \n\n. /. \n\nI \n0 \n\n-_.-------------\n\n.. -\n\n.. -.-\n\nperturbation = 0,(5  -\nperturbation = 0.10 \n. \nperturbation = 0.15  ..... -\nperturbation = 0.20  .. . \n\n10 \n\n20 \n\n30 \n\n40 \n\nNoise Level (%) \n\nFigure 5:  Generalisation  enhancement shown  through increased  tolerance  to input \nperturbation , in  the  character encoder  problem. \n\nI  30%  J  40% \nNoise  Levels \nTest  Patterns  67.875  I 70.406  I 70.416  I 72.454  I 75.446 \n\nI  10% \n\nI  20% \n\n0% \n\nCorrectly  Classified  (%) \n\nTable  1:  Generalisation  enhancement  shown  through  increased  ability to classifier \npreviously  unseen  data, in the eye/not-eye task. \n\ndramatically improved generalisation ability with increased levels of injected synap(cid:173)\ntic noise during training.  An improvement of approximately 8%  is seen - consistent \nwith earlier  results on  a  different  \"real\"  problem  [Murray, 91]. \n\n4.3  Learning Trajectory \n\nConsidering  now  the  second  derivative penalty term in the expanded  cost function \n(2).  This term is complex as it involves second  order  derivatives,  and also depends \nupon  the  sign  a.nd  magnitude of the errors  themselves  {flep}.  The simplest  way  of \nlooking at its effect  is  to look at  a single exemplar term :-\n\nK t:,. 2 f  T.  2 ({)2 Olep  ) \n{)Tab 2 \n\nlep  ab \n\n(7) \n\nThis term implies that when the combination of flep ~~::~ is negative then the overall \ncost  function error is reduced  and  vice  versa.  The term (7) is therefore  constructive \nas  it  can  actually  lower  the  error  locally  via  noise  injection,  whereas  (6)  always \nincreases  it .  (7)  can  therefore  be  viewed  as  a  sculpting of the  error  surface  during \nthe early phases of training (i.e.  when  flep  is sUbstantial).  In particular, a weight set \nwith a higher  \"raw\"  error value, calculated from (1),  may be favoured over one with \na  lower  value  if noise-injected  terms indicate that  the  \"poorer\"  solution  is  located \nin  a  promising area of weight  space.  This  \"look-ahead\"  property should  lead to an \nenhanced  learning trajectory,  perhaps finding  a solution  more rapidly. \n\nIn  the  augmented  weight  update  equation  (4),  the  noise  is  acting  as  a  medium \nprojecting statistical information about the character of the entire weight set on to \n\n\fSynaptic Weight Noise During MLP Learning \n\n497 \n\nthe  update equation for  each  particular weight.  So,  the effect  of the  noise  term  is \nto  account  not  only for  the  weight  currently  being  updated,  but  to  add  in  a  term \nthat  estimates  what  the  other  weight  changes  are  likely  to  do  to  the  output,  and \nadjust  the size  of the  weight  increment/decrement as  appropriate. \n\nTo  verify  this  by  simulation is  not  as  straightforward  as  the  other  predictions.  It \nis  however  possible  to show  the mean training time for  each  level of injected  noise. \nFor  each  noise  level,  1000  random  start  points  were  used  to  allow  the  underlying \nproperties  of the  training  process  to emerge.  The results  are  shown  in  Fig.  6  and \n\n600 \n\nS50 \n\n0 \nQ., \n\n~ u \n~ .. \n\nSOO \n\n6 \nl= \n00  450 \nc: .. \n.5 \n.. ...J \n'\" \nc: \ni!  350 \n::E \n\n400 \n\n300 \n\n0 \n\n10 \n\n20 \n\n30 \n\n40 \n\n50 \n\nSynaptic Noise Level Used During Training \n\nFigure  6:  Training time as  a function  of injected synaptic noise  during training. \n\nclearly  show  that  at low noise  levels  (::;  30%  for  the  case  of the  character encoder) \na  definite  reduction  in  training times are  seen.  At  higher  levels  the  chaotic nature \nof the  \"noisy learning\"  takes over. \nIt is  also  possible  to  plot  the  combination  of fkp ~~::~.  This  is  shown  in  Fig.  7, \nagain  for  the  character  encoder  problem.  The  term  (7)  is  reduced  more  quickly \n\n-0.5 \n\n. ~ \ni;j \n:> \n\n1.0 \n~  O.S \n~ .. \n0.0 \n\u00b7c .. Q \n'tl \n.G \n\" .. \n~  -2.0 \n.. \ng \n-2.5 \n-3.0 \n\n-I.S \n\n-1.0 \n\nI.<i \n\n~ noise =7% \n\n0 \n\nSOO \n\n1000 \n\nISOO \n\n2000 \n\nLearning Epochs \n\nFigure 7:  The second  derivative  x  error  term trajectory for  injected  synaptic noise \nlevels  0%  and  7%. \n\nwith injected  noise,  thus  effecting  better  weight  changes  via  (4) .  At levels  of noise \n> 7%  the  effect  is  exaggerated,  and  the  noise  mediated  improvements take  place \n\n\f498 \n\nMurray and  Edwards \n\nduring  the  first  100-200  epochs  of training.  The  level  of  7%  is  displayed  simply \nbecause  it is  visually clear  what is  happening, and is also typical. \n\n5  Conclusion \n\nWe  have shown  both  by  mathematical expansion  and  by  simulation that  injecting \nrandom  noise  on  to  the  synaptic  weights  of a  MultiLayer  Perceptron  during  the \ntraining  phase  enhances fault-tolerance,  generalisation  ability  and  learning  trajec(cid:173)\ntory.  It has  long  been  held  that  any  inaccuracy  during  training  is  detrimental  to \nMLP learning.  This paper proves that analog inaccuracy is not.  The mathematical \npredictions  are  perfectly  general  and the simulations relate  to  a  non-trivial  classi(cid:173)\nfication  task  and  a  \"real\"  world  problem.  The  results  are  therefore  important for \nthe designers of analog hardware and also as a non-invasive technique for  producing \nlearning enhancements in the software  domain. \n\nAcknowledgements \n\nWe  are grateful to the  Science  and  Engineering Research  Council for  financial sup(cid:173)\nport,  and to Lionel Tarassenko and Chris Bishop for  encouragement and advice. \n\nReferences \n\n[Taylor, 72] \n\n[Murray,  91] \n\n[Murray,  92] \n\nJ. G. Taylor,  \"Spontaneous Behaviour in Neural Networks\" , J.  The(cid:173)\nor.  Bioi.,  vol.  36,  pp.  513-528,  1972. \nA.  F.  Murray,  \"Analog  Noise-Enhanced  Learning  in  Neural  Net(cid:173)\nwork  Circuits,\"  Electronics  Letters,  vol.  2,  no.  17,  pp.  1546-1548, \n1991. \nA.  F. Murray,  \"Multi-Layer Perceptron Learning Optimised for On(cid:173)\nChip Implementation - a  Noise  Robust System,\"  Neural  Computa(cid:173)\ntion,  vol.  4,  no.  3,  pp.  366-381, 1992. \n\n[Bishop,  90] \n\n[Hanson,  90] \n\n[Matsuoka, 92]  K.  Matsuoka,  \"Noise  Injection  into  Inputs  in  Back-Propagation \nLearning\",  IEEE  Trans.  Systems,  Man  and  Cybernetics,  vol.  22, \nno.  3,  pp.  436-440,  1992. \nC. Bishop,  \"Curvature-Driven Smoothing in Backpropagation Neu(cid:173)\nral  Networks,\"  IJCNN,  vol.  2,  pp.  749-752,  1990. \nS.  J. Hanson,  \"A Stochastic Version of the Delta Rule\",  Physica  D, \nvol.  42,  pp.  265-272,  1990. \nC.  H.  Sequin,  R.  D.  Clay,  \"Fault Tolerance in Feed-Forward Artifi(cid:173)\ncial  Neural  Networks\" , Neural  Networks:  Concepts,  Applications \nand Implementations, vol.  4,  pp.  111-141,  1991. \nA.  F.  Murray,  P.  J.  Edwards,  \"Enhanced  MLP  Performance  and \nFault  Tolerance  Resulting  from  Synaptic  Weight  Noise  During \nTraining\", IEEE  Trans.  Neural  Networks,  1993, In Press. \n\n[Sequin,  91] \n\n[Murray,  93] \n\n\f", "award": [], "sourceid": 682, "authors": [{"given_name": "Alan", "family_name": "Murray", "institution": null}, {"given_name": "Peter", "family_name": "Edwards", "institution": null}]}