{"title": "Optimal Stochastic Search and Adaptive Momentum", "book": "Advances in Neural Information Processing Systems", "page_first": 477, "page_last": 484, "abstract": null, "full_text": "Optimal Stochastic Search and \n\nAdaptive Momentum \n\nTodd K. Leen and Genevieve B. Orr \n\nOregon Graduate Institute of Science and Technology \n\nDepartment of Computer Science and Engineering \n\nP.O.Box 91000, Portland, Oregon 97291-1000 \n\nAbstract \n\nStochastic optimization algorithms typically use learning rate \nschedules that behave asymptotically as J.t(t) = J.to/t. The ensem(cid:173)\nble dynamics (Leen and Moody, 1993) for such algorithms provides \nan easy path to results on mean squared weight error and asymp(cid:173)\ntotic normality. We apply this approach to stochastic gradient \nalgorithms with momentum. We show that at late times, learning \nis governed by an effective learning rate J.tejJ = J.to/(l -\nf3) where \nf3 is the momentum parameter. We describe the behavior of the \nasymptotic weight error and give conditions on J.tejJ that insure \noptimal convergence speed. Finally, we use the results to develop \nan adaptive form of momentum that achieves optimal convergence \nspeed independent of J.to. \n\n1 \n\nIntroduction \n\nThe rate of convergence for gradient descent algorithms, both batch and stochastic, \ncan be improved by including in the weight update a \"momentum\" term propor(cid:173)\ntional to the previous weight update. Several authors (Tugay and Tanik, 1989; \nShynk and Roy, 1988) give conditions for convergence of the mean and covariance \nof the weight vector for momentum LMS with constant learning rate. However \nstochastic algorithms require that the learning rate decay over time in order to \nachieve true convergence of the weight (in probability, in mean square, or with \nprobability one). \n\n477 \n\n\f478 \n\nLeen and Orr \n\nThis paper uses our previous work on weight space probabilities (Leen and Moody, \n1993; Orr and Leen, 1993) to study the convergence of stochastic gradient algo(cid:173)\nrithms with annealed learning rates of the form Jl = Jlo/t, both with and without \nmomentum. The approach provides simple derivations of previously known results \nand their extension to stochastic descent with momentum. Specifically, we show \nthat the mean squared weight misadjustment drops off at the maximal rate ex 1/ t \nonly if the effective learning rate JlejJ = Jlo/(1 - (3) is greater than a critical value \nwhich is determined by the Hessian. \n\nThese results suggest a new algorithm that automatically adjusts the momentum \ncoefficient to achieve the optimal convergence rate. This algorithm is simpler than \nprevious approaches that either estimate the curvature directly during the descent \n(Venter, 1967) or measure an auxilliary statistic not directly involved in the opti(cid:173)\nmization (Darken and Moody, 1992). \n\n2 Density Evolution and Asymptotics \n\nWe consider stochastic optimization algorithms with weight w E RN. We confine \nattention to a neighborhood of a local optimum w* and express the dynamics in \n\nterms of the weight error v = w - w*. For simplicity we treat the continuous time \n\nalgorithm 1 \n\nd~~t) = Jl(t) H[ v(t), x(t)] \n\n(1) \nwhere Jl(t) is the learning rate at time t, H is the weight update function and \nx(t) is the data fed to the algorithm at time t. For stochastic gradient algorithms \nH = - \\7 v \u00a3(v, x(t)), minus the gradient of the instantaneous cost function. \nConvergence (in mean square) to w* is characterized by the average squared norm \nof the weight error E [ 1 V 12] = Trace C where \n\nC - J dNv vvT P(v,t) \n\n(2) \n\nis the weight error correlation matrix and P(v, t) is the probability density at v and \ntime t. In (Leen and Moody, 1993) we show that the probability density evolves \naccording to the Kramers-Moyal expansion \n\nap(v, t) \n\nat \n\nN \n\n00 \n\n2: \n\ni=l \n\n( _l)i \n\n., 2. \n\n2: aVil aV12 ... aVj; \n\nai \n\nil,\u00b7\u00b7\u00b7j;=l \n\n1 Although algorithms are executed in discrete time, continuous time formulations are \noften advantagous for analysis. The passage from discrete to continuous time is treated \nin various ways depending on the needs of the theoretical exposition. Kushner and Clark \n(1978) define continous time functions that interpolate the discrete time process in order \nto establish an equivalence between the asymptotic behavior of the discrete time stochastic \nprocess, and solutions of an associated deterministic differential equation. Heskes et ai. \n(1992) draws on the results of Bedeaux et ai. (1971) that link (discrete time) random \nwalk trajectories to the solution of a (continuous time) master equation. Heskes' master \nequation is equivalent to our Kramers-Moyal expansion (3). \n\n\fOptimal Stochastic Search and Adaptive Momentum \n\n479 \n\nwhere H j \" denotes the jlh component of the N-component vector H, and ( .. ')x \ndenotes averaging over the density of inputs. Differentiating (2) with respect to \ntime, u~ing (3) and integrating by parts, we obtain the equation of motion for the \nweight error correlation \n\nJ-L(t) J dN V P(v, t) [v (H(v, xf)x + (H(v, x))x vTJ + \n\ndd~ \n\nJ-L(t)2 J dN V P(v, t) (H(v, x) H(v, x)T)x \n\n(4) \n\n2.1 Asymptotics of the Weight Error Correlation \n\nConvergence of v can be understood by studying the late time behavior of (4). \nSince the update function H(v, x) is in general non-linear in v, the time evolution \nof the correlation matrix Cij is coupled to higher moments E [Vi Vj Vk \u2022\u2022\u2022 ] of the \nweight error. However, the learning rate is assumed to follow a schedule J-L(t) that \nsatisfies the requirements for convergence in mean square to a local optimum. Thus \nat late times the density becomes sharply peaked about v = 02 . This suggests \nthat we expand H(v, x) in a power series about v = 0 and retain the lowest order \nnon-trivial terms in (4) leaving: \n\ndC dt = - J-L(t) [ (R C) + (C RT) ] + J-L(t)2 D , \n\n(5) \n\nwhere R is the Hessian of the average cost function (E) x' and \n\nD = (H(O,x)H(O,xf)x \n\n(6) \nis the diffusion matrix, both evaluated at the local optimum w*. (Note that RT = \nR.) We use (5) with the understanding that it is valid for large t. The solution to \n(5) is \n\nC(t) = U(t,to)C(to)UT(t,to) + t d7 J-L(7)2 U(t,7) D UT (t,7) \n\nito \n\nwhere the evolution operator U(t2' td is \n\nU(t2, t1) = exp [ -R 1:' dr tt(r) ] \n\n(7) \n\n(8) \n\nWe assume, without loss of generality, that the coordinates are chosen so that R is \ni = 1 ... N. Then with J-L(t) = J-Lo/t we \ndiagonal (D won't be) with eigenvalues Ai, \nobtain \n\nE[lvI 2 ] = Trace [C(t)] \n\nt; Cii (to) \n\nN \n\n{ \n\n( ttO ) 21-1-0 Ai \n\n1 (to) 21-1-0 A, 1 } \n\n. (9) \n\n-\nto \n\n-\nt \n\n[ 1 \n-\nt \n\n2In general the density will have nonzero components outside the basin of w* . We are \nneglecting these, for the purpose of calculating the second moment of the the local density \nin the vicinity of w*. \n\n\f480 \n\nLeen and Orr \n\nWe define \n\n1 \n\nJ.lerit == - - -\n2 Amin \n\n(10) \n\nand identify two regimes for which the behavior of (9) is fundamentally different: \n\n1. J.lo > J.lcri( E [lvI 2 ] drops off asymptotically as lit. \n2. J.lo < J.lerit: E [lvI 2 ] drops off asymptotically as ( t ) (2 ~o ATnin ) \n\ni.e. more slowly than lit. \n\nFigure 1 shows results from simulations of an ensemble of 2000 networks trained by \nLMS, and the prediction from (9). For the simulations, input data were drawn from \na gaussian with zero mean and variance R = 1.0. The targets were generated by \na noisy teacher neuron (i.e. targets =w*x +~, where (~) = 0 and (e) = (72). The \nupper two curves in each plot (dotted) depict the behavior for J.lo < J.lerit = 0.5. \nThe remaining curves (solid) show the behavior for J.lo > J.lerit. \n\n0 \n\n,... \n\nI \n\nC\\I \ni.CII \n~' \nW \nCIt? \n0 \noJ \n\n\"r \n\nU? \n\n..... \n\n0 \n\n,... \n\nI \n\nC\\I \ni.CII \n~ I \nWC') \nC l I \n0 \noJ \n\n'of \nI \n\nU? \n\n100 \n\n1000 \n\n5000 \n\n50000 \n\n100 \n\n1000 \n\n5000 \n\n50000 \n\nFig.1: LEFT - Simulation results from an ensemble of 2000 one-dimensional \nLMS algorithms with R = 1.0, (72 = 1.0 and /-L = \nretical predictions from equation (9). Curves correspond to (top to bottom) \n/-Lo = 0.2, 0.4, 0.6, 0.8, 1.0, 1.5 . \n\n/-Lo/t. RIGHT - Theo(cid:173)\n\nBy minimizing the coefficient of lit in (9), the optimal learning rate is found to \nbe J.lopt = 11 Amin. This formalism also yields asymptotic normality rather simply \nlit) convergence of \n(Orr and Leen, 1994). These conditions for \"optimal\" (Le. \nthe weight error correlation and the related results on asymptotic normality have \nbeen previously discussed in the stochastic approximation literature (Darken and \nMoody, 1992; Goldstein, 1987; White, 1989; and references therein) . The present \nformal structure provides the results with relative ease and facilitates the extension \nto stochastic gradient descent with momentum. \n\n3 Stochastic Search with Constant Momentum \n\nThe discrete time algorithm for stochastic optimization with momentum is: \n\nv(t + 1) = v(t) + J.l(t) H[v(t), x(t)] + f3 f!(t) \n\n(11) \n\n\fOptimal Stochastic Search and Adaptive Momentum \n\n481 \n\nn(t + 1) \n\nv(t + 1) - v(t) \nn(t) + /1(t) H[v(t), x(t)] + ((3 - 1) n(t), \n\nor in continuous time, \n\ndv(t) \n\ndt \n\ndn(t) \n\ndt \n\n/1(t) H[v(t), x(t)] + (3 n(t) \n\n-\n\n/1(t) H[v(t), x(t)] + ((3 - 1) n(t). \n\n(12) \n\n(13) \n\n(14) \n\nAs before, we are interested in the late time behavior of E [lvI 2 ]. To this end, we \ndefine the 2N-dimensional variable Z = (v, nf and, following the arguments of \nthe previous sections, expand H[v(t), x(t)] in a power series about v = 0 retaining \nthe lowest order non-trivial terms. In this approximation the correlation matrix \nC _ E[ZZT] evolves according to \n\ndC \n2 -\ndt = KC + CK + /1(t) D \n\nT \n\n-\n\n-\n\nwith \n\n(15) \n\n(16) \n\nI is the N x N identity matrix, and Rand D are defined as before. The evolution \noperator is now \n\nU(t2' ttl = exp [t' dr K(r)] \n\n(17) \n\n(18) \n\nand the solution to (15) is \n\nC = U(t, to) C(to) U T (t, to) + t dr /12(r) U(t, r) D U T (t, r) \n\nltD \n\nThe squared norm of the weight error is the sum of first N diagonal elements of C. \nIn coordinates for which R is diagonal and with /1(t) = /10 It, we find that for t \u00bb to \n\n'\" t, {c,,(to) (t;) 'i~~' + \n\nE[lvI2] \n\nThis reduces to (9) when (3 = O. Equation (19) defines two regimes of interest: \n\n1. /10/(1 - (3) > /1cri( E[lvI2] drops off asymptotically as lit. \n2. /10/(1 - (3) < /1cri( E[lvI2] drops off asymptotically as \n\n21-'Q'xmjn \n\n(~) 1 ~ \n\nI.e. more slowly than lit. \n\n\f482 \n\nLeen and Orr \n\nThe form of (19) and the conditions following it show that the asymptotics of \ngradient descent with momentum are governed by the effective learning rate \n\n_ M \n\nMejJ = 1 - {3 . \n\nFigure 2 compares simulations with the predictions of (19) for fixed Mo and various \n{3. The simulations were performed on an ensemble of 2000 networks trained by \nLMS as described previously but with an additional momentum term of the form \ngiven in (11). The upper three curves (dotted) show the behavior of E[lvI 2 ] for \nMejJ < Merit\u00b7 The solid curves show the behavior for MejJ > Merit\u00b7 \nThe derivation of asymptotic normality proceeds similarly to the case without mo(cid:173)\nmentum. Again the reader is referred to (Orr and Leen, 1994) for details . \n\n.... \n, \n\n...... N \n(\\I' \n< \n\"> \n-C') ur' \n\nC) \n0'<:1\" \n...J, \n\n, \nIII \n\n.... \n, \n\n...... N \n(\\I' \n~ \n> \n-C') ur' \n\nC) \n0'<:1\" \n...J, \n\n, \nIII \n\n100 \n\n1000 \n\n5000 \nt \n\n50000 \n\n100 \n\n1000 \n\nSOOO \nt \n\n50000 \n\nFig.2: LEFT - Simulation results from an ensemble of 2000 one-dimensional LMS al(cid:173)\ngorithms with mome~tum with R = 1.0, \n0.2. RIGHT(cid:173)\nTheoretical predictions from equation (19). Curves correspond to (top to bottom) \n{3 = 0.0, 004, 0.5, 0.6, 0.7, 0.8 . \n\n(12 = 1.0, and /10 \n\n-\n\n4 Adaptive Momentum Insures Optimal Convergence \n\nThe optimal constant momentum parameter is obtained by minimizing the coeffi(cid:173)\ncient of lit in (19). Imposing the restriction that this parameter is positive3 gives \n\n(3opt = max(O, 1 - MOAmin). \n\n(20) \n\nAs with Mopt, this result is not of practical use because, in general, Amin is unknown. \n\nFor I-dimensional linear networks, an alternative is to use the instantaneous esti(cid:173)\nmate of A, :\\(t) = x 2(t) where x(t) is the network input at time t. We thus define \nthe adaptive momentum parameter to be \n\n(3adapt = max(O, 1 - MOX 2 ) \n\n(I-dimension). \n\n(21) \n\nAn algorithm based on (21) insures that the late time convergence is optimally fast. \nAn alternative route to achieving the same goal is to dispense with the momentum \nterm and adaptively adjust the learning rate. Vetner (1967) proposed an algorithm \n3 E[lvI 2 ] diverges for 1{31 > 1. For -1 < {3 < 0, E[lvI 2 ] appears to converge but oscil(cid:173)\nlations are observed. Additional study is required to determine whether {3 in this range \nmight be useful for improving learning. \n\n\fOptimal Stochastic Search and Adaptive Momentum \n\n483 \n\nthat iteratively estimates A for 1-D algorithms and uses the estimate to adjust J.Lo. \nDarken and Moody (1992) propose measuring an auxiliary statistic they call \"drift\" \nthat is used to determine whether or not J.Lo > J.Lcrit. The adaptive momentum \nscheme generalizes to multiple dimensions more easily than Vetner's algorithm, \nand, unlike Darken and Moody's scheme, does not involve calculating an auxiliary \nstatistic not directly involved with the minimization. \n\nA natural extension to N dimensions is to define a matrix of momentum coefficients, \n'Y = I - J.Lo X xT, where I is the N x N identity matrix. By zeroing out the negative \neigenvalues of 'Y, we obtain the adaptive momentum matrix \n\n(3adapt = I - ex xT, \n\nwhere e = min(J.Lo, 1/(xT x)). \n\n(22) \n\n=1.5 \n\n-1+_~_--::=====-_~~~ L \n\n1 \n\n2 \n\n3 \n\n\u00b09(t) \n\nFig.3: Simulations of 2-D LMS with 1000 networks initialized at Vo = (.2, .3) and with \n(72 = 1, ).1 = .4, ).2 = 4, and /-Lcrit = 1.25. LEFT- {3 = 0, RIGHT - {3 = (3adapt. Dashed \ncurves correspond to adaptive momentum. \n\nFigure 3 shows that our adaptive momentum not only achieves the optimal con(cid:173)\nvergence rate independent of the learning rate parameter J.Lo but that the value of \nlog(E[lvI2]) at late times is nearly independent of J.Lo and smaller than when mo(cid:173)\nmentum is not used. The left graph displays simulation results without momentum. \nHere, convergence rates clearly depend on J.Lo and are optimal for J.Lo > J.Lcrit = 1.25. \nWhen J.Lo is large there is initially significant spreading in v so that the increased \nconvergence rate does not result in lower log(E[lvI 2]) until very late times (t ~ 105 ). \nThe graph on the right shows simulations with adaptive momentum. Initially, the \nspreading is even greater than with no momentum, but log(E[lvI 2]) quickly decreases \nto reach a much smaller value. In addition, for t ~ 300, the optimal convergence \nrate (slope=-l) is achieved for all three values of J.Lo and the curves themselves lie \nalmost on top of one another. In other words, at late times (t ;::: 300), the value of \nlog(E[lvI2]) is independent of J.Lo when adaptive momentum is used. \n\n5 Summary \n\nWe have used the dynamics of the weight space probabilities to derive the asymp(cid:173)\ntotic behavior of the weight error correlation for annealed stochastic gradient algo(cid:173)\nrithms with momentum. The late time behavior is governed by the effective learning \n\nrate J.Lejj = J.Lo/(l - (3) . For learning rate schedules J.Lolt, if J.Leff > 1/(2 Arnin) , then \n\nthe squared norm of the weight error v - w - w* falls off as lit. From these results \nwe have developed a form of momentum that adapts to obtain optimal convergence \nrates independent of the learning rate parameter. \n\n\f484 \n\nLeen and Orr \n\nAcknowledgments \n\nThis work was supported by grants from the Air Force Office of Scientific Research \n(F49620-93-1-0253) and the Electric Power Research Institute (RP8015-2). \n\nReferences \n\nD. Bedeaux, K. Laktos-Lindenberg, and K. Shuler. (1971) On the Relation Between \nMaster Equations and Random Walks and their Solutions. Journal of Mathematical \nPhysics, 12:2116-2123. \n\nChristian Darken and John Moody. \n(1992) Towards Faster Stochastic Gradient \nSearch. In J.E. Moody, S.J. Hanson, and R.P. Lipmann (eds.) Advances in Neural \nInformation Processing Systems, vol. 4. Morgan Kaufmann Publishers, San Mateo, \nCA, 1009-1016. \n\nLarry Goldstein. (1987) Mean Square Optimality in the Continuous Time Robbins \nMonro Procedure. Technical Report DRB-306, Dept. of Mathematics, University \nof Southern California, LA. \n\nH.J. Kushner and D.S. Clark. (1978) Stochastic Approximation Methods for Con(cid:173)\nstrained and Unconstrained Systems. Springer-Verlag, New York. \n\nTom M. Heskes, Eddy T.P. Slijpen, and Bert Kappen. (1992) Learning in Neural \nNetworks with Local Minima. Physical Review A, 46(8):5221-5231. \n\nTodd K. Leen and John E. Moody. (1993) Weight Space Probability Densities in \nStochastic Learning: 1. Dynamics and Equilibria. In Giles, Hanson, and Cowan \n(eds.), Advances in Neural Information Processing Systems, vol. 5, Morgan Kauf(cid:173)\nmann Publishers, San Mateo, CA, 451-458. \n\nG. B. Orr and T. K. Leen. (1993) Weight Space Probability Densities in Stochastic \nLearning: II. Transients and Basin Hopping Times. In Giles, Hanson, and Cowan \n(eds.), Advances in Neural Information Processing Systems, vol. 5, Morgan Kauf(cid:173)\nmann Publishers, San Mateo, CA, 507-514. \nG. B. Orr and T. K. Leen. (1994) Momentum and Optimal Stochastic Search. In \nM. C. Mozer, P. Smolensky, D. S. Touretzky, J. L. Elman, and A. S. Weigend (eds.), \nProceedings of the 1993 Connectionist Models Summer School, 351-357. \n\nJohn J. Shynk and Sumit Roy. (1988) The LMS Algorithm with Momentum Up(cid:173)\ndating. Proceedings of the IEEE International Symposium on Circuits and Systems, \n2651-2654. \n\nMehmet Ali Tugay and Yal<;in Tanik. (1989) Properties of the Momentum LMS \nAlgorithm. Signal Processing, 18:117-127. \n\nJ. H. Venter. (1967) An Extension of the Robbins-Monro Procedure. Annals of \nMathematical Statistics, 38:181-190. \n\nHalbert White. (1989) Learning in Artificial Neural Networks: A Statistical Per(cid:173)\nspective. Neural Computation, 1:425-464. \n\n\f", "award": [], "sourceid": 772, "authors": [{"given_name": "Todd", "family_name": "Leen", "institution": null}, {"given_name": "Genevieve", "family_name": "Orr", "institution": null}]}