{"title": "Removing Noise in On-Line Search using Adaptive Batch Sizes", "book": "Advances in Neural Information Processing Systems", "page_first": 232, "page_last": 238, "abstract": null, "full_text": "Removing Noise  in  On-Line  Search using \n\nAdaptive  Batch Sizes \n\nGenevieve  B.  Orr \n\nDepartment of Computer Science \n\nWillamette University \n\n900  State Street \n\nSalem,  Oregon 97301 \ngorr@willamette.ed-u \n\nAbstract \n\nStochastic  (on-line)  learning  can  be  faster  than  batch  learning. \nHowever,  at late times,  the learning rate must be annealed to re(cid:173)\nmove  the noise  present  in  the stochastic weight  updates.  In  this \nannealing phase,  the convergence rate (in  mean square)  is  at best \nproportional to  l/T where T  is  the number of input presentations. \nAn alternative is  to increase the batch size to remove the noise.  In \nthis paper we explore convergence for LMS using 1)  small but fixed \nbatch sizes  and  2)  an adaptive batch size.  We  show  that the best \nadaptive  batch schedule  is  exponential  and  has  a  rate  of  conver(cid:173)\ngence which is  the same as for  annealing, Le.,  at best proportional \nto  l/T. \n\n1 \n\nIntroduction \n\nStochastic (on-line)  learning can speed learning over its batch training particularly \n,,,,hen  data sets are large and contain redundant information [M0l93J.  However,  at \nlate times in learning, noise present in the weight updates prevents complete conver(cid:173)\ngence from  taking place.  To  reduce the noise,  the learning rate is  slowly  decreased \n(annealed{ at late times.  The optimal annealing schedule is  asymptotically propor(cid:173)\ntional to  T  where t  is  the iteration [GoI87,  L093,  Orr95J.  This results in  a  rate of \nconvergence (in mean square)  that is  also  proportional to t. \nAn alternative method of reducing the noise is  to simply switch to (noiseless)  batch \nmode when  the noise  regime is  reached.  However,  since batch mode can  be slow, \na  better idea is  to  slowly  increase the batch size  starting with  1  (pure stochastic) \nand slowly increasing it only  \"as needed\"  until it reaches the training set size  (pure \nbatch).  In  this  paper  we  1)  investigate  the  convergence  behavior  of LMS  when \n\n\fRemoving Noise in On-Line Search using Adaptive Batch Sizes \n\n233 \n\nusing small fixed batch sizes, 2)  determine the best schedule when using an adaptive \nbatch  size  at  each  iteration,  3)  analyze  the  convergence  behavior  of the adaptive \nbatch algorithm,  and  4)  compare this  convergence rate to the alternative method \nof annealing the learning rate. \n\nOther authors have approached the problem of redundant  data by  also  proposing \ntechniques  for  training  on  subsets  of the data.  For  example,  Pluto  [PW93]  uses \nactive data selection to  choose a  concise subset for  training.  This subset  is  slowly \nadded to over time as needed.  Moller  [1'10193]  proposes combining scaled conjugate \ngradient descent (SCG) with \\ .... hat he refers to as blocksize updating.  His algorithm \nuses  an  iterative  approach  and  assumes  that the  block  size  does  not  vary  rapidly \nduring  training.  In  this  paper,  we  take  the simpler  approach of just  choosing  ex(cid:173)\nemplars  at  mndom  at  each  iteration.  Given  this,  we  then  analyze  in  detail  the \nconvergence  behavior.  Our  results  are  more  of theoretical  than  practical interest \nsince the equations we derive are complex functions of quantities such as the Hessian \nthat are impractical to compute. \n\n2  Fixed Batch Size \n\nIn  this  section we  examine the convergence behavior for  LMS  using  a  fixed  batch \nsiL~e.  vVe  assume that we  are given  a  large but  finite  sized  training set T  ==  {Zj  == \n(Xi, dd }~1 where Xi  E nm  is  the ith  input and dj  En is  the corresponding target. \n\\Ve  further  assume that the targets are generated using a  signal  plus  noise  model \nso  that we  can write \n\n(1) \nwhere w.  E  nm  is  the optimal weight  vector and the  \u20acj \nis  zero  mean noise.  Since \nthe training set  is  assumed to be large we  take the average of fj and  Xj\u20acj  over  the \ntraining set  to be approximately zero.  Note that we  consider  only  the problem of \noptimization of the w  over the  training  set and do not address the issue of obtaining \ngood generalization over the distribution from which the training set was  drawn. \n\ndj = w; Xj + fj \n\nAt  each  iteration, we  assume that exactly  n  samples are randomly drawn  without \nTeplacement  from  T  where  1 S  n  S  N.  vVe  denote this  batch of size  n  drawn  at \ntime t  by  Bn(t)  ==  {ZkJ~l'  When n  = 1 we  have pure on-line training and when \n/I  = lV  we  have  pure batch.  \\Ve  choose  to sample without replacement so  that as \nthe batch size is  increased, we  have  a  smooth transition from on-line to  batch. \n\nFoI'  L1'1S,  the squared error at iteration t  /OT'  a  batch  of size n  is \n\n(2) \n\nand  where  Wt  E  nm  is  the  current  weight  in  the  network.  The  update  weight \n/.L &~B\"  where /.L  is  the fixed  learning  rate.  Rewriting \nequation is  then Wt+l  =  Wt  -\nthis  in  terms of the weight  error  v  ==  W  - w.  and defining gBn,t  ==  8\u00a3Bn(t)/8vt  we \nobtain \n\nUWt \n\n(3) \n\nConvergence (in mean square) to (,v'.  can be characterized by  the rate of change of \nthe average squared norm of the weight  error E[v2] where v 2 ==  v T v.  From  (3)  we \nobtain an expression for  V;+l  in terms of Vt, \n\nVt+1 =  t  -\n, ,2 \nv2 \n\n2 \n/.LV t  B,,,t \n\n,Tg \n\n+  2 g2 \n\n/J-\n\nBn,t\u00b7 \n\n(4) \n\n\f234 \n\nG.  B.  Orr \n\nTo  compute the expected value of Vi+l  conditioned on Vt  we  can average the right \nside of  (4)  over  all  possible ways  that the batch Bn (t)  can be chosen from  the N \ntraining examples.  In appendix A,  we show  that \n\n(gi ,t)N \n\n(gB,,,t)B \n2 \n(gBn,t)B  =  n(N -1) (gi,t)N + (N -1)n (gi,t}N \n\n(n-1)N \n\nN-n \n\n2 \n\n2 \n\n(5) \n\n(6) \n\nwhere (. ) N  denotes average over all examples in the training set,  (.) B  denotes average \nover  the  possible  batches  drawn at time t, and gi,t  ==  O\u00a3(Zi)/OVt.  The averages over \nthe entire training set are \n\n(7) \n\nN \n\nN \n\nfiXi  - vT XiXi  =  RVt \n\n~ L  O\u00a3(Zi)  =  - L \nN  i=l  OVt \n~ L(fiXi - vT XiXj)T (fjXj - vT XjXj)  =  u;('I'r R) + vT SVt \n\ni=l \n\nN \n\n= \n\n= \n\n(8) \nwhere  R  ==  (xxT)N'  S  ==  (xxTxxT)N 1 ,  u;  ==  (f2)  and  (Tr  R)  is  the trace of R. \nThese equations together with  (5)  and  (6)  in  (4)  gives  the expected value  of Vt+l \nconditioned on Vt \n\ni=l \n\n. \n\n2 \n\nT  { \n\n2  (N(n - 1) \n\n(Vt+tlVt)  =  Vt  J- 2IJR + IJ \n\n1J2 U;(Tr R)(N - n) \n. \n(9) \nNote that this reduces to the standard stochastic and batch update equations when \nn  =  1 and n  =  N, respectively. \n\n(N _  1)n R  + (N _  1)n S \n\nn(N -1) \n\nVt  + \n\nN  - n \n\n)} \n\n2 \n\n2.0.1  Special  Cases:  1-D solution and Spherically  Symmetric \n\nIn  I-dimension we  can average over Vt  in  (9)  to give \n(Vi+l)  = a(vi) + (3 \n\nwhere \n\n(10) \n\na  =  1 - 211R + I-l \n\n2  (N(n - 1) \n\n) \n(N _  l)n R  + (N _  1)n S \n\nN  - n \n\n2 \n\n, \n\nand where Rand S  simplifY to R = (x2)N' S  = (x 4 ) N.  This is  a difference equation \nwhich can be solved  exactly to give \n\n(vi)  =  a t - tO (v5) +  1- a \n\n(3 \n\n1- at-to \n\n(12) \n\nwhere (v5)  is  the expected squared weight error at the initial time to. \nFigure la compares equation (12) with simulations of 1-D LMS with gaussian inputs \nfor N  =  1000 and batch sizes n  = 10,  100, and 500.  As can be seen, the agreement is \ngood.  Note that (v 2 )  decreases exponentially until flattening out.  The equilibrium \nvalue can be computed from  (12)  by setting t =  00  (assuming lal  < 1)  to give \n\n(  2) \nv  00  =  1- (~  =  2Rn(N - 1) -1J(N(n - 1)R2 + (N - n)S)\u00b7 \n\nWT;R(N - n) \n\n(3 \n\n(13) \n\nNote that (v 2 )00  decreases as n  increases and is  zero only if n  =  N. \n\n1 For  real  zero  mean  gaussian  inputs,  we  can  apply  the  gaussian  moment  factoring \ntheorem  [Hay91]  which  states  that  (XiXjXkXI}N  = RijRkl + RikRjl + RilRjk  where  the \nsubscripts on x  denote components of x.  From this, we find  that S  = (Trace R)R+ 2R2. \n\n\fRemoving Noise in  On-Line Search using Adaptive Batch Sizes \n\n235 \n\nE[  v\"2] \n\n1 \n\nO.OOl!-----:\":\"iil'~------\n\nn=10 \n\nE[  v\"2) \n\n1 \n\n0.1 \n\n0.01 \n\nn=10 \n\nn,.l00 \nn,.500 \n\n't \n0 \n\n0.001  0 \n\n(b) \n\n(a) 0.0001 \nFigure  1:  Simulations(solid)  vs  Theoretical  (dashed)  predictions  of the  squared  weight \nerror  of 1-D  LMS  a)  as  function  of the number  of t,  the  batch  updates  (iterations),  and \nb)  as  function  of the number of input  presentations,  T.  Training set  size is  N  =  1000 and \nbatch sizes are n  =10,  100, and 500.  Inputs were gaussian with R  =  1,  0'(  =  1 and p.  =  .1. \nSimulations used  10000 networks \n\nEquation (9)  can also be solved exactly in multiple dimensions in the rather restric(cid:173)\ntive case where we assume that the inputs are spherically symmetric gaussians with \nR  =  a1  where  a  is  a  constant,  1  is  the  identity  matrix,  and  m  is  the dimension. \nThe update equation and solution are the same as  (10)  and (12),  respectively,  but \nwhere a  and {J  are now \n\nC\\'  = 1 - 2p.a + p.  a \n\n2 \n\n2  (N(n - 1) \n\n) \n(N _  1)n + (N _  1)n (m + 2) \n\nN  - n \n\n, f3  = \n\np. 2 0';ma(N - n) \n. \n\nn(N _  1) \n\n(14) \n\n3  Adaptive Batch Size \n\nThe time it takes to compute the weight  update in one iteration is  roughly propor(cid:173)\ntional to the number of input presentations, i.e the batch size.  To make the compar(cid:173)\nison of convergence rates for different batch sizes meaningful, we must compute the \nchange in squared weight error as a function of the number of input presentations, \nT,  rather than iteration number t. \nFor fixed batch size, T  =  nt.  Figure 1b displays our 1-D  LMS simulations plotted as \na  function of T.  As  can be seen, training \\vith a  large batch size is  slow  but results \nin  a  lower  equilibrium value  than obtained with a  small  batch size .. This suggests \nthat we could obtain the fastest decrease of (v 2 )  overall by varying the batch size at \neach iteration.  The batch size to choose for  the current  (v 2 )  would  be the smallest \nn  that has yet  to reach equilibrium, i.e.  for  which (v 2 )  > {v 2)oo. \nTo  determine the best batch size, we  take the greedy approach by  demanding that \nat  each  itemtion  the  batch size  is  chosen  so  as  to  reduce  the  weight  error  at  the \nnext iteration by the greatest amount per input presentation.  This is  equivalent  to \nasking what value of n  maximizes h ==  ((v;) - (V~+l))/n? Once we determine n we \nthen express it  as a  function of T. \n\nWe  treat  the  1-D  case,  although  the  analysis  would  be similar for  the spherically \nsymmetric case.  From (10)  we have h =  ~ (((~ - 1)(v;) + {J).  Differentiating h with \nrespect  to  n  and solving  yields  the batch size  that decreases  the weight  error the \nmost to be \n\n.  ( \n\n) \nnt  =  mm  N,  (2R(N -1) + J.I.(S  _  NR2))(v;) + J.I.(1~R \n\n2J.1.N((S-R2)(v;)+(1~R) \n\n. \n\n(15) \n\nWe have nt  exactly equal to N  when the current value of (v;)  satisfies \n\n(  2) \nvt  < Ie  =  2R(N _  1) - J.I.(R2(N - 2) - S) \n\nJ.I.(f~ R \n\n_ \n\n(nt  =  N). \n\n(16) \n\n\f236 \n\nE[  v\"2J \n\n1 \n\nAdaptive Batch \n\ntheory \n\n-\n- - - Simulation \n\n0 .1 \n\n0.01 \n\n0.001 \n\nBatch Size \n\nn \n\n100. \n50. \n\n10. \n5 \n\nG.  B.  Orr \n\n0 \n\n10 \n\n20 \n\n30 \n\n40 \n\n50 \n\n(a) \nFigure 2:  1-D LMS:  a)  Comparison of the simulated and theoretically predicted (equation \n(18))  squared  weight error  as  a  function of t  with  N  =  1000,  R = 1,  u.  = 1,  p.  =  .1 ,  and \n10000 networks.  b)  Corresponding batch sizes used  in the simulations. \n\n1~~?=~~~~~~~~~ \n\n(b) \n\n50 \n\n60 \n\n20 \n\n30 \n\n70 \n\n40 \n\n10 \n\n0 \n\nThus,  after  (v;)  has  decreased  to  Ie,  training  will  proceed as  pure  batch.  When \n(vn  > Ie,  we  have nt < N  and we  can put (15)  into  (10)  to obtain \njJ2u;R \n\n2 (NR2  -S)) (2) \n\n( 2  ) _  ( \n\nVt+l \n\n- 1- I-'-R + I-'- 2(N _  1) \n\nV t  - 2(N -1)' \n\n(17) \n\nSolving  (17)  we  get \n\n(v;) =  ni- tO (v6) +  1- :~1  {31 \n\n1 - nt-to \n\nwhere (tl, and  (31  are constants \n\n2  u;R \n\n(31  =  -j.t  2(N - 1)\" \n\n(18) \n\n(19) \n\nFigure  2a compares equation  (18)  with  1-D  LMS simulations.  The adaptive batch \nsize  was  chosen  by  rounding  (15)  to  the  nearest  integer.  Early  in  training,  the \npredicted nt is always smaller than 1 but the simulation always rounds up to 1 (can't \nhave n=O).  Figure 2b displa:ys  the batch sizes  that were used in the simulations.  A \nlogarithmic scale is  used  to  show  that  the  batch size  increases  exponentially  in  t. \n\\Ve  next examine the batch size as  a  function of T. \n\n3.1  Convergence Rate per Input Presentation \n\n\\Vhen we  use  (15)  to choose the batch size,  the number of input presentations will \nvary at each iteration.  Thus,  T  is  not simply a  mUltiple of t.  Instead, we  have \n\nT(t)  =  TO  + L nj \n\nt \n\ni=to \n\n(20) \n\nwhere  TO  is  the  number  of  inputs  that  have  been  presented  by  to.  This  can  be \nevaluated when N  is  very large.  In this case,  equations  (18)  and (15)  reduce to \n\n(-un \n\n-\n\nt \n\n(V6)  Ct3 -to  where \n21-'-\u00abS- R2)(v;) + u;R) \n\n(2R - I-'-R2) (v;) \n\nn3  ==  1 - I-'-R + \"21-'- R \n\n1  2 \n\n2 \n\n(21) \n\n21-'-(S - R2) \n\n2j.tu; \n\n=  2R - I-'-R2  + (2  _ jJR)(v6)(};~-to'  (22) \n\nPutting (22)  into  (20),  and summing gives  gives \n\n~ \n\n2j.t(S - R2) ~ \n\nT(t)  =  (2  _  I-'-R)R \n\n21-'-u; \n\nt + (2 - I-'-R)(V6) \n\nCt3~t - (};3 \n\n1 - Ct3 \n\n(23) \n\n\f\\ . \n\no.~\u00b7g~ \n\nn:::~~iivE!-'-___ '_\" __ _ \n\nn=1 \n\nn=10 \n\n0.01 \n\n0 .001 \n\n~~---------\n\nannealed \n\n\u2022\u2022\u2022 10;-\n\nf\u00b7\u00b7\u00b7\u00b7\u00b7 .. \n\nI \n\nadaptive batch: \ntheory (solid) \nsimulation (dashed) \n\nRemoving Noise in On-Line Search using Adaptive Batch Sizes \n\n237 \n\nE[v\"2J \n\n1f':-----____ ~n-~-1~OO~ ____ __ \n\nE[v\"2J \n\n1 \n\n10.  20. \n\n50.  100.  200.  500. 10002000. \n\nL-~5-0~10-0~15~0-2~OO--2~5~0-3~OO~3~50~400~ \n\n(b)  ~~~~~~~~~~~~.~ \n\n(a) \nFigure 3:  I-D  LMS:  a)  Simulations of the  squared  weight  error  as  a  function  of  T ,  the \nnumber of input  presentations for  N  = 1000,  R  =  1,  u.  =  1, 11  =  .1,  and  10000 networks. \nBatch  sizes  are  n  =1,  10,  100,  and  nt  (see  (15)).  b)  Comparison  of simulation  (dashed) \nand theory  (see  (24))  using adaptive batch size.  Simulation (long dash)  using an annealed \nlearning rate with  n = 1  and 11  = R- 1  is also shown. \nwhere  D.t  ==  t - to  and  D.T  ==  T - TO.  Assuming that la31  < 1,  the term with a;-At \nwill  dominate at late times.  Dropping the other terms and solving for  a~ gives \n\n(v~)  =  (vi)a~t ~ (2  -pR)2 R(T _  TO)' \n\n40-2 \n\n(24) \n\nThus, when using an adaptive batch size,  (v 2 )  converges at late times as  ~.  Figure \n3a compares simulations  of  (v 2 )  with  adaptive  and  constant  batch  sizes.  As  can \nbe seen,  the  adaptive n  curve  follows  the n  =  1  curve until  just before the n  =  1 \ncurve  starts to  flatten  out.  Figure 3b  compares  (24)  with  the simulation.  Curves \nare plotted on a  log-log plot to illustrate the l/T relationship at late times (straight \nline  with slope of -1). \n\n4  Learning Rate Annealing vs  Increasing Batch Size \n\n\\Vith  online  learning  (n  =  1),  we  can reduce  the noise  at  late times  by  annealing \nthe learning rate using a p/t schedule.  During this phase,  (v 2 )  decreases at a  rate of \nl/T if J.1  > R-l /2 [L093]  and slower otherwise.  In this paper, we have presented an \nalternative method for reducing the noise by increasing the batch size exponentially \nin  t.  Here,  (v 2 )  also  decreases  at  rate  of  l/T  so  that,  from  this  perspective, an \nadaptive batch size is  equivalent  to annealing the learning  rate.  This  is  confirmed \nin  Figure 3b which compares using an adaptive batch size with annealing. \nAn advantage of the adaptive batch size comes when n  reaches N.  At  this point  n \nremains constant so that (v 2 )  decreases exponentially in T.  However, with annealing, \nthe convergence rate of (v 2 )  always  remains proportional to  l/T.  A  disadvantage, \nthough, occurs in multiple dimensions with nonspherical R  where the best choice of \nHt  would  likely  be different  along different  directions in the weight  space.  Though \nit  is  possible  to  have  a  different  learning  rate  along  different  directions,  it  is  not \npossible to have different  batch sizes. \n\n5  Appendix  A \n\nIn  this  appendix  we  use  simple  counting  arguments  to  derive  the  two  results  in \nequations (5)  and (6).  We first  note that there are M  ==  (  ~ )  ways  of choosing  n \nexamples out of a  total of N  examples.  Thus,  (5)  can be rewritten as \n\n1  M \n\n1  M  1 \n\n(gB\".t)B=  MLgB~i),t= ML;;  L \n\ngj,t. \n\n(25) \n\ni=l \n\ni=l \n\nZjEB~i) \n\n\f238 \n\nG.  B.  Orr \nwhere B~i) is  the ith  batch (i =  1, ... ,M), and gj,t  ==  a\u00a3(Zj )jaVt for  j  = 1, ... ,N. \nIf we  were  to  expand  (25)  we  would  find  that  there are exactly nlvl  terms.  From \nsymmetry and since there are only N  unique gj,t,  we conclude that each gj,t  occurs \nexactly n;:  times.  The above expression can then be written as \n\n(26) \n\nThus, we  have equation (5).  The second equation (6)  is \n\nBy  the same argument to derive (5), the first  term on the right (ljn)(gr,t)N'  In the \nsecond term, there are a  total n(n -1)M terms in the sum, of which only N(N -1) \nare unique.  Thus,  a  given gJ.tgk,t  occurs exactly n(n - I)Mj(N(N - 1))  times so \nthat \n\nn2M ~  ~  gj,t gk,t  =  n 2M  N(N _  1)  .  ~. \n\ngj,t gk,t \n\n1  ~\" \n.=1  Zj ,z\"EB~') ,j~k \n\n1  n(n-l)M  ~ \nJ,k=l,J~k \n\nN(n - 1)  (1  (  N \n\nn(N - 1)  N2' \n\nN  2)  1 (1  N  2)) \n.  L.  gj,tgk,t + ~gj,t  - N  N  ~gj,t \nJ,k=l,J~k \n(n - 1) \n\nJ=l \n\nJ=l \n\nN(n - 1) \nn(N - 1) (gi,t)N  - n(N -1) (9i,t)N. \n\n2 \n\n2 \n\n= \n\n= \n\n(28) \n\nPutting  the simplified  first  term  together  and  (28)  both into  (27)  we  obtain  our \nsecond result in equation (6). \n\nReferences \n\n[Go187]  Larry Goldstein.  Mean square optimality in the continuous time Robbins Monro \nprocedure.  Technical  Report  DRB-306,  Dept.  of  Mathematics,  University  of \nSouthern  California, LA,  1987. \n\n[Hay91]  Simon Haykin.  Adaptive  Filter  Theory.  Prentice Hall, New  Jersey,  1991. \n[L093]  Todd K.  Leen and  Genevieve B.  Orr.  Momentum and optimal stochastic search. \nIn  Advances in  Neural  Information Processing  Systems,  vol.  6,  1993.  to appear. \n[M!2I193]  Martin M!2Iller. Supervised learning on large redundant training sets.  International \n\nJournal  of Neural  Systems, 4{1} :15-25, 1993. \n\n[Orr95]  Genevieve B.  Orr.  Dynamics and Algorithms for Stochastic  learning.  PhD thesis, \n\nOregon  Graduate Institute,  1995. \n\n[PW93]  Mark  Plutowski  and  Halbert  White.  Selecting  concise  training  sets  from  clean \n\ndata.  IEEE  Transactions  on Neural  Networks,  4:305-318, 1993. \n\n\f", "award": [], "sourceid": 1257, "authors": [{"given_name": "Genevieve", "family_name": "Orr", "institution": null}]}