{"title": "Removing Noise in On-Line Search using Adaptive Batch Sizes", "book": "Advances in Neural Information Processing Systems", "page_first": 232, "page_last": 238, "abstract": null, "full_text": "Removing Noise in On-Line Search using \n\nAdaptive Batch Sizes \n\nGenevieve B. Orr \n\nDepartment of Computer Science \n\nWillamette University \n\n900 State Street \n\nSalem, Oregon 97301 \ngorr@willamette.ed-u \n\nAbstract \n\nStochastic (on-line) learning can be faster than batch learning. \nHowever, at late times, the learning rate must be annealed to re(cid:173)\nmove the noise present in the stochastic weight updates. In this \nannealing phase, the convergence rate (in mean square) is at best \nproportional to l/T where T is the number of input presentations. \nAn alternative is to increase the batch size to remove the noise. In \nthis paper we explore convergence for LMS using 1) small but fixed \nbatch sizes and 2) an adaptive batch size. We show that the best \nadaptive batch schedule is exponential and has a rate of conver(cid:173)\ngence which is the same as for annealing, Le., at best proportional \nto l/T. \n\n1 \n\nIntroduction \n\nStochastic (on-line) learning can speed learning over its batch training particularly \n,,,,hen data sets are large and contain redundant information [M0l93J. However, at \nlate times in learning, noise present in the weight updates prevents complete conver(cid:173)\ngence from taking place. To reduce the noise, the learning rate is slowly decreased \n(annealed{ at late times. The optimal annealing schedule is asymptotically propor(cid:173)\ntional to T where t is the iteration [GoI87, L093, Orr95J. This results in a rate of \nconvergence (in mean square) that is also proportional to t. \nAn alternative method of reducing the noise is to simply switch to (noiseless) batch \nmode when the noise regime is reached. However, since batch mode can be slow, \na better idea is to slowly increase the batch size starting with 1 (pure stochastic) \nand slowly increasing it only \"as needed\" until it reaches the training set size (pure \nbatch). In this paper we 1) investigate the convergence behavior of LMS when \n\n\fRemoving Noise in On-Line Search using Adaptive Batch Sizes \n\n233 \n\nusing small fixed batch sizes, 2) determine the best schedule when using an adaptive \nbatch size at each iteration, 3) analyze the convergence behavior of the adaptive \nbatch algorithm, and 4) compare this convergence rate to the alternative method \nof annealing the learning rate. \n\nOther authors have approached the problem of redundant data by also proposing \ntechniques for training on subsets of the data. For example, Pluto [PW93] uses \nactive data selection to choose a concise subset for training. This subset is slowly \nadded to over time as needed. Moller [1'10193] proposes combining scaled conjugate \ngradient descent (SCG) with \\ .... hat he refers to as blocksize updating. His algorithm \nuses an iterative approach and assumes that the block size does not vary rapidly \nduring training. In this paper, we take the simpler approach of just choosing ex(cid:173)\nemplars at mndom at each iteration. Given this, we then analyze in detail the \nconvergence behavior. Our results are more of theoretical than practical interest \nsince the equations we derive are complex functions of quantities such as the Hessian \nthat are impractical to compute. \n\n2 Fixed Batch Size \n\nIn this section we examine the convergence behavior for LMS using a fixed batch \nsiL~e. vVe assume that we are given a large but finite sized training set T == {Zj == \n(Xi, dd }~1 where Xi E nm is the ith input and dj En is the corresponding target. \n\\Ve further assume that the targets are generated using a signal plus noise model \nso that we can write \n\n(1) \nwhere w. E nm is the optimal weight vector and the \u20acj \nis zero mean noise. Since \nthe training set is assumed to be large we take the average of fj and Xj\u20acj over the \ntraining set to be approximately zero. Note that we consider only the problem of \noptimization of the w over the training set and do not address the issue of obtaining \ngood generalization over the distribution from which the training set was drawn. \n\ndj = w; Xj + fj \n\nAt each iteration, we assume that exactly n samples are randomly drawn without \nTeplacement from T where 1 S n S N. vVe denote this batch of size n drawn at \ntime t by Bn(t) == {ZkJ~l' When n = 1 we have pure on-line training and when \n/I = lV we have pure batch. \\Ve choose to sample without replacement so that as \nthe batch size is increased, we have a smooth transition from on-line to batch. \n\nFoI' L1'1S, the squared error at iteration t /OT' a batch of size n is \n\n(2) \n\nand where Wt E nm is the current weight in the network. The update weight \n/.L &~B\" where /.L is the fixed learning rate. Rewriting \nequation is then Wt+l = Wt -\nthis in terms of the weight error v == W - w. and defining gBn,t == 8\u00a3Bn(t)/8vt we \nobtain \n\nUWt \n\n(3) \n\nConvergence (in mean square) to (,v'. can be characterized by the rate of change of \nthe average squared norm of the weight error E[v2] where v 2 == v T v. From (3) we \nobtain an expression for V;+l in terms of Vt, \n\nVt+1 = t -\n, ,2 \nv2 \n\n2 \n/.LV t B,,,t \n\n,Tg \n\n+ 2 g2 \n\n/J-\n\nBn,t\u00b7 \n\n(4) \n\n\f234 \n\nG. B. Orr \n\nTo compute the expected value of Vi+l conditioned on Vt we can average the right \nside of (4) over all possible ways that the batch Bn (t) can be chosen from the N \ntraining examples. In appendix A, we show that \n\n(gi ,t)N \n\n(gB,,,t)B \n2 \n(gBn,t)B = n(N -1) (gi,t)N + (N -1)n (gi,t}N \n\n(n-1)N \n\nN-n \n\n2 \n\n2 \n\n(5) \n\n(6) \n\nwhere (. ) N denotes average over all examples in the training set, (.) B denotes average \nover the possible batches drawn at time t, and gi,t == O\u00a3(Zi)/OVt. The averages over \nthe entire training set are \n\n(7) \n\nN \n\nN \n\nfiXi - vT XiXi = RVt \n\n~ L O\u00a3(Zi) = - L \nN i=l OVt \n~ L(fiXi - vT XiXj)T (fjXj - vT XjXj) = u;('I'r R) + vT SVt \n\ni=l \n\nN \n\n= \n\n= \n\n(8) \nwhere R == (xxT)N' S == (xxTxxT)N 1 , u; == (f2) and (Tr R) is the trace of R. \nThese equations together with (5) and (6) in (4) gives the expected value of Vt+l \nconditioned on Vt \n\ni=l \n\n. \n\n2 \n\nT { \n\n2 (N(n - 1) \n\n(Vt+tlVt) = Vt J- 2IJR + IJ \n\n1J2 U;(Tr R)(N - n) \n. \n(9) \nNote that this reduces to the standard stochastic and batch update equations when \nn = 1 and n = N, respectively. \n\n(N _ 1)n R + (N _ 1)n S \n\nn(N -1) \n\nVt + \n\nN - n \n\n)} \n\n2 \n\n2.0.1 Special Cases: 1-D solution and Spherically Symmetric \n\nIn I-dimension we can average over Vt in (9) to give \n(Vi+l) = a(vi) + (3 \n\nwhere \n\n(10) \n\na = 1 - 211R + I-l \n\n2 (N(n - 1) \n\n) \n(N _ l)n R + (N _ 1)n S \n\nN - n \n\n2 \n\n, \n\nand where Rand S simplifY to R = (x2)N' S = (x 4 ) N. This is a difference equation \nwhich can be solved exactly to give \n\n(vi) = a t - tO (v5) + 1- a \n\n(3 \n\n1- at-to \n\n(12) \n\nwhere (v5) is the expected squared weight error at the initial time to. \nFigure la compares equation (12) with simulations of 1-D LMS with gaussian inputs \nfor N = 1000 and batch sizes n = 10, 100, and 500. As can be seen, the agreement is \ngood. Note that (v 2 ) decreases exponentially until flattening out. The equilibrium \nvalue can be computed from (12) by setting t = 00 (assuming lal < 1) to give \n\n( 2) \nv 00 = 1- (~ = 2Rn(N - 1) -1J(N(n - 1)R2 + (N - n)S)\u00b7 \n\nWT;R(N - n) \n\n(3 \n\n(13) \n\nNote that (v 2 )00 decreases as n increases and is zero only if n = N. \n\n1 For real zero mean gaussian inputs, we can apply the gaussian moment factoring \ntheorem [Hay91] which states that (XiXjXkXI}N = RijRkl + RikRjl + RilRjk where the \nsubscripts on x denote components of x. From this, we find that S = (Trace R)R+ 2R2. \n\n\fRemoving Noise in On-Line Search using Adaptive Batch Sizes \n\n235 \n\nE[ v\"2] \n\n1 \n\nO.OOl!-----:\":\"iil'~------\n\nn=10 \n\nE[ v\"2) \n\n1 \n\n0.1 \n\n0.01 \n\nn=10 \n\nn,.l00 \nn,.500 \n\n't \n0 \n\n0.001 0 \n\n(b) \n\n(a) 0.0001 \nFigure 1: Simulations(solid) vs Theoretical (dashed) predictions of the squared weight \nerror of 1-D LMS a) as function of the number of t, the batch updates (iterations), and \nb) as function of the number of input presentations, T. Training set size is N = 1000 and \nbatch sizes are n =10, 100, and 500. Inputs were gaussian with R = 1, 0'( = 1 and p. = .1. \nSimulations used 10000 networks \n\nEquation (9) can also be solved exactly in multiple dimensions in the rather restric(cid:173)\ntive case where we assume that the inputs are spherically symmetric gaussians with \nR = a1 where a is a constant, 1 is the identity matrix, and m is the dimension. \nThe update equation and solution are the same as (10) and (12), respectively, but \nwhere a and {J are now \n\nC\\' = 1 - 2p.a + p. a \n\n2 \n\n2 (N(n - 1) \n\n) \n(N _ 1)n + (N _ 1)n (m + 2) \n\nN - n \n\n, f3 = \n\np. 2 0';ma(N - n) \n. \n\nn(N _ 1) \n\n(14) \n\n3 Adaptive Batch Size \n\nThe time it takes to compute the weight update in one iteration is roughly propor(cid:173)\ntional to the number of input presentations, i.e the batch size. To make the compar(cid:173)\nison of convergence rates for different batch sizes meaningful, we must compute the \nchange in squared weight error as a function of the number of input presentations, \nT, rather than iteration number t. \nFor fixed batch size, T = nt. Figure 1b displays our 1-D LMS simulations plotted as \na function of T. As can be seen, training \\vith a large batch size is slow but results \nin a lower equilibrium value than obtained with a small batch size .. This suggests \nthat we could obtain the fastest decrease of (v 2 ) overall by varying the batch size at \neach iteration. The batch size to choose for the current (v 2 ) would be the smallest \nn that has yet to reach equilibrium, i.e. for which (v 2 ) > {v 2)oo. \nTo determine the best batch size, we take the greedy approach by demanding that \nat each itemtion the batch size is chosen so as to reduce the weight error at the \nnext iteration by the greatest amount per input presentation. This is equivalent to \nasking what value of n maximizes h == ((v;) - (V~+l))/n? Once we determine n we \nthen express it as a function of T. \n\nWe treat the 1-D case, although the analysis would be similar for the spherically \nsymmetric case. From (10) we have h = ~ (((~ - 1)(v;) + {J). Differentiating h with \nrespect to n and solving yields the batch size that decreases the weight error the \nmost to be \n\n. ( \n\n) \nnt = mm N, (2R(N -1) + J.I.(S _ NR2))(v;) + J.I.(1~R \n\n2J.1.N((S-R2)(v;)+(1~R) \n\n. \n\n(15) \n\nWe have nt exactly equal to N when the current value of (v;) satisfies \n\n( 2) \nvt < Ie = 2R(N _ 1) - J.I.(R2(N - 2) - S) \n\nJ.I.(f~ R \n\n_ \n\n(nt = N). \n\n(16) \n\n\f236 \n\nE[ v\"2J \n\n1 \n\nAdaptive Batch \n\ntheory \n\n-\n- - - Simulation \n\n0 .1 \n\n0.01 \n\n0.001 \n\nBatch Size \n\nn \n\n100. \n50. \n\n10. \n5 \n\nG. B. Orr \n\n0 \n\n10 \n\n20 \n\n30 \n\n40 \n\n50 \n\n(a) \nFigure 2: 1-D LMS: a) Comparison of the simulated and theoretically predicted (equation \n(18)) squared weight error as a function of t with N = 1000, R = 1, u. = 1, p. = .1 , and \n10000 networks. b) Corresponding batch sizes used in the simulations. \n\n1~~?=~~~~~~~~~ \n\n(b) \n\n50 \n\n60 \n\n20 \n\n30 \n\n70 \n\n40 \n\n10 \n\n0 \n\nThus, after (v;) has decreased to Ie, training will proceed as pure batch. When \n(vn > Ie, we have nt < N and we can put (15) into (10) to obtain \njJ2u;R \n\n2 (NR2 -S)) (2) \n\n( 2 ) _ ( \n\nVt+l \n\n- 1- I-'-R + I-'- 2(N _ 1) \n\nV t - 2(N -1)' \n\n(17) \n\nSolving (17) we get \n\n(v;) = ni- tO (v6) + 1- :~1 {31 \n\n1 - nt-to \n\nwhere (tl, and (31 are constants \n\n2 u;R \n\n(31 = -j.t 2(N - 1)\" \n\n(18) \n\n(19) \n\nFigure 2a compares equation (18) with 1-D LMS simulations. The adaptive batch \nsize was chosen by rounding (15) to the nearest integer. Early in training, the \npredicted nt is always smaller than 1 but the simulation always rounds up to 1 (can't \nhave n=O). Figure 2b displa:ys the batch sizes that were used in the simulations. A \nlogarithmic scale is used to show that the batch size increases exponentially in t. \n\\Ve next examine the batch size as a function of T. \n\n3.1 Convergence Rate per Input Presentation \n\n\\Vhen we use (15) to choose the batch size, the number of input presentations will \nvary at each iteration. Thus, T is not simply a mUltiple of t. Instead, we have \n\nT(t) = TO + L nj \n\nt \n\ni=to \n\n(20) \n\nwhere TO is the number of inputs that have been presented by to. This can be \nevaluated when N is very large. In this case, equations (18) and (15) reduce to \n\n(-un \n\n-\n\nt \n\n(V6) Ct3 -to where \n21-'-\u00abS- R2)(v;) + u;R) \n\n(2R - I-'-R2) (v;) \n\nn3 == 1 - I-'-R + \"21-'- R \n\n1 2 \n\n2 \n\n(21) \n\n21-'-(S - R2) \n\n2j.tu; \n\n= 2R - I-'-R2 + (2 _ jJR)(v6)(};~-to' (22) \n\nPutting (22) into (20), and summing gives gives \n\n~ \n\n2j.t(S - R2) ~ \n\nT(t) = (2 _ I-'-R)R \n\n21-'-u; \n\nt + (2 - I-'-R)(V6) \n\nCt3~t - (};3 \n\n1 - Ct3 \n\n(23) \n\n\f\\ . \n\no.~\u00b7g~ \n\nn:::~~iivE!-'-___ '_\" __ _ \n\nn=1 \n\nn=10 \n\n0.01 \n\n0 .001 \n\n~~---------\n\nannealed \n\n\u2022\u2022\u2022 10;-\n\nf\u00b7\u00b7\u00b7\u00b7\u00b7 .. \n\nI \n\nadaptive batch: \ntheory (solid) \nsimulation (dashed) \n\nRemoving Noise in On-Line Search using Adaptive Batch Sizes \n\n237 \n\nE[v\"2J \n\n1f':-----____ ~n-~-1~OO~ ____ __ \n\nE[v\"2J \n\n1 \n\n10. 20. \n\n50. 100. 200. 500. 10002000. \n\nL-~5-0~10-0~15~0-2~OO--2~5~0-3~OO~3~50~400~ \n\n(b) ~~~~~~~~~~~~.~ \n\n(a) \nFigure 3: I-D LMS: a) Simulations of the squared weight error as a function of T , the \nnumber of input presentations for N = 1000, R = 1, u. = 1, 11 = .1, and 10000 networks. \nBatch sizes are n =1, 10, 100, and nt (see (15)). b) Comparison of simulation (dashed) \nand theory (see (24)) using adaptive batch size. Simulation (long dash) using an annealed \nlearning rate with n = 1 and 11 = R- 1 is also shown. \nwhere D.t == t - to and D.T == T - TO. Assuming that la31 < 1, the term with a;-At \nwill dominate at late times. Dropping the other terms and solving for a~ gives \n\n(v~) = (vi)a~t ~ (2 -pR)2 R(T _ TO)' \n\n40-2 \n\n(24) \n\nThus, when using an adaptive batch size, (v 2 ) converges at late times as ~. Figure \n3a compares simulations of (v 2 ) with adaptive and constant batch sizes. As can \nbe seen, the adaptive n curve follows the n = 1 curve until just before the n = 1 \ncurve starts to flatten out. Figure 3b compares (24) with the simulation. Curves \nare plotted on a log-log plot to illustrate the l/T relationship at late times (straight \nline with slope of -1). \n\n4 Learning Rate Annealing vs Increasing Batch Size \n\n\\Vith online learning (n = 1), we can reduce the noise at late times by annealing \nthe learning rate using a p/t schedule. During this phase, (v 2 ) decreases at a rate of \nl/T if J.1 > R-l /2 [L093] and slower otherwise. In this paper, we have presented an \nalternative method for reducing the noise by increasing the batch size exponentially \nin t. Here, (v 2 ) also decreases at rate of l/T so that, from this perspective, an \nadaptive batch size is equivalent to annealing the learning rate. This is confirmed \nin Figure 3b which compares using an adaptive batch size with annealing. \nAn advantage of the adaptive batch size comes when n reaches N. At this point n \nremains constant so that (v 2 ) decreases exponentially in T. However, with annealing, \nthe convergence rate of (v 2 ) always remains proportional to l/T. A disadvantage, \nthough, occurs in multiple dimensions with nonspherical R where the best choice of \nHt would likely be different along different directions in the weight space. Though \nit is possible to have a different learning rate along different directions, it is not \npossible to have different batch sizes. \n\n5 Appendix A \n\nIn this appendix we use simple counting arguments to derive the two results in \nequations (5) and (6). We first note that there are M == ( ~ ) ways of choosing n \nexamples out of a total of N examples. Thus, (5) can be rewritten as \n\n1 M \n\n1 M 1 \n\n(gB\".t)B= MLgB~i),t= ML;; L \n\ngj,t. \n\n(25) \n\ni=l \n\ni=l \n\nZjEB~i) \n\n\f238 \n\nG. B. Orr \nwhere B~i) is the ith batch (i = 1, ... ,M), and gj,t == a\u00a3(Zj )jaVt for j = 1, ... ,N. \nIf we were to expand (25) we would find that there are exactly nlvl terms. From \nsymmetry and since there are only N unique gj,t, we conclude that each gj,t occurs \nexactly n;: times. The above expression can then be written as \n\n(26) \n\nThus, we have equation (5). The second equation (6) is \n\nBy the same argument to derive (5), the first term on the right (ljn)(gr,t)N' In the \nsecond term, there are a total n(n -1)M terms in the sum, of which only N(N -1) \nare unique. Thus, a given gJ.tgk,t occurs exactly n(n - I)Mj(N(N - 1)) times so \nthat \n\nn2M ~ ~ gj,t gk,t = n 2M N(N _ 1) . ~. \n\ngj,t gk,t \n\n1 ~\" \n.=1 Zj ,z\"EB~') ,j~k \n\n1 n(n-l)M ~ \nJ,k=l,J~k \n\nN(n - 1) (1 ( N \n\nn(N - 1) N2' \n\nN 2) 1 (1 N 2)) \n. L. gj,tgk,t + ~gj,t - N N ~gj,t \nJ,k=l,J~k \n(n - 1) \n\nJ=l \n\nJ=l \n\nN(n - 1) \nn(N - 1) (gi,t)N - n(N -1) (9i,t)N. \n\n2 \n\n2 \n\n= \n\n= \n\n(28) \n\nPutting the simplified first term together and (28) both into (27) we obtain our \nsecond result in equation (6). \n\nReferences \n\n[Go187] Larry Goldstein. Mean square optimality in the continuous time Robbins Monro \nprocedure. Technical Report DRB-306, Dept. of Mathematics, University of \nSouthern California, LA, 1987. \n\n[Hay91] Simon Haykin. Adaptive Filter Theory. Prentice Hall, New Jersey, 1991. \n[L093] Todd K. Leen and Genevieve B. Orr. Momentum and optimal stochastic search. \nIn Advances in Neural Information Processing Systems, vol. 6, 1993. to appear. \n[M!2I193] Martin M!2Iller. Supervised learning on large redundant training sets. International \n\nJournal of Neural Systems, 4{1} :15-25, 1993. \n\n[Orr95] Genevieve B. Orr. Dynamics and Algorithms for Stochastic learning. PhD thesis, \n\nOregon Graduate Institute, 1995. \n\n[PW93] Mark Plutowski and Halbert White. Selecting concise training sets from clean \n\ndata. IEEE Transactions on Neural Networks, 4:305-318, 1993. \n\n\f", "award": [], "sourceid": 1257, "authors": [{"given_name": "Genevieve", "family_name": "Orr", "institution": null}]}