{"title": "Optimal Stochastic Search and Adaptive Momentum", "book": "Advances in Neural Information Processing Systems", "page_first": 477, "page_last": 484, "abstract": null, "full_text": "Optimal  Stochastic  Search and \n\nAdaptive  Momentum \n\nTodd  K.  Leen and  Genevieve  B.  Orr \n\nOregon Graduate Institute of Science and Technology \n\nDepartment of Computer Science and Engineering \n\nP.O.Box 91000,  Portland, Oregon 97291-1000 \n\nAbstract \n\nStochastic  optimization  algorithms  typically  use  learning  rate \nschedules  that  behave asymptotically  as  J.t(t)  = J.to/t.  The ensem(cid:173)\nble dynamics  (Leen and Moody,  1993) for such algorithms provides \nan easy path to results  on  mean squared weight  error and asymp(cid:173)\ntotic  normality.  We  apply  this  approach  to  stochastic  gradient \nalgorithms with momentum.  We  show that  at late times,  learning \nis  governed  by  an  effective learning rate  J.tejJ  = J.to/(l  -\nf3)  where \nf3  is  the  momentum  parameter.  We  describe  the  behavior  of the \nasymptotic  weight  error  and  give  conditions  on  J.tejJ  that  insure \noptimal  convergence speed.  Finally,  we  use  the  results to develop \nan adaptive form of momentum that achieves optimal convergence \nspeed  independent of J.to. \n\n1 \n\nIntroduction \n\nThe rate of convergence for  gradient descent algorithms, both batch and stochastic, \ncan  be improved  by  including  in  the  weight  update a  \"momentum\"  term  propor(cid:173)\ntional  to  the  previous  weight  update.  Several  authors  (Tugay  and  Tanik,  1989; \nShynk and  Roy,  1988)  give  conditions for  convergence of the mean  and  covariance \nof  the  weight  vector  for  momentum  LMS  with  constant  learning  rate.  However \nstochastic  algorithms  require  that  the  learning  rate  decay  over  time  in  order  to \nachieve  true  convergence  of  the  weight  (in  probability,  in  mean  square,  or  with \nprobability one). \n\n477 \n\n\f478 \n\nLeen and Orr \n\nThis paper uses our previous  work on weight  space probabilities  (Leen  and Moody, \n1993;  Orr  and  Leen,  1993)  to  study  the  convergence  of stochastic  gradient  algo(cid:173)\nrithms  with  annealed  learning rates of the form  Jl  =  Jlo/t,  both  with  and  without \nmomentum.  The approach provides simple derivations of previously known results \nand  their  extension  to  stochastic  descent  with  momentum.  Specifically,  we  show \nthat  the mean squared  weight  misadjustment  drops off at  the maximal rate ex  1/ t \nonly  if the effective learning rate JlejJ  =  Jlo/(1  - (3)  is  greater than a  critical value \nwhich  is  determined by the Hessian. \n\nThese  results  suggest  a  new  algorithm  that  automatically  adjusts  the momentum \ncoefficient  to achieve the optimal convergence rate.  This  algorithm is  simpler than \nprevious approaches that either estimate the curvature directly during the descent \n(Venter,  1967)  or measure an  auxilliary  statistic not  directly  involved  in  the opti(cid:173)\nmization  (Darken and Moody,  1992). \n\n2  Density Evolution and  Asymptotics \n\nWe  consider stochastic  optimization  algorithms  with  weight  w  E  RN.  We  confine \nattention  to  a  neighborhood  of  a  local  optimum  w*  and  express  the  dynamics  in \n\nterms of the  weight  error  v = w - w*.  For simplicity we  treat the continuous time \n\nalgorithm  1 \n\nd~~t)  =  Jl(t) H[ v(t), x(t)] \n\n(1) \nwhere  Jl(t)  is  the  learning  rate  at  time  t,  H  is  the  weight  update  function  and \nx(t)  is  the data fed  to the algorithm  at  time t.  For  stochastic gradient  algorithms \nH  = - \\7 v  \u00a3(v, x(t)), minus the gradient of the instantaneous cost  function. \nConvergence  (in  mean square)  to w*  is  characterized by the average squared norm \nof the weight error E [ 1 V  12]  = Trace C where \n\nC  - J dNv  vvT  P(v,t) \n\n(2) \n\nis  the weight error correlation matrix and P(v, t)  is  the probability density at v and \ntime  t.  In  (Leen  and  Moody,  1993)  we  show  that  the  probability  density  evolves \naccording to the Kramers-Moyal expansion \n\nap(v, t) \n\nat \n\nN \n\n00 \n\n2: \n\ni=l \n\n( _l)i \n\n., 2. \n\n2:  aVil aV12  ... aVj; \n\nai \n\nil,\u00b7\u00b7\u00b7j;=l \n\n1 Although  algorithms  are executed in  discrete time,  continuous time formulations  are \noften  advantagous  for  analysis.  The passage  from  discrete  to continuous  time  is  treated \nin various  ways  depending on the needs  of the theoretical  exposition.  Kushner and Clark \n(1978)  define  continous time functions  that interpolate the discrete time process  in  order \nto establish an equivalence between the asymptotic behavior of the discrete time stochastic \nprocess,  and solutions  of an  associated  deterministic  differential  equation.  Heskes  et  ai. \n(1992)  draws  on  the  results  of Bedeaux  et  ai.  (1971)  that  link  (discrete  time)  random \nwalk  trajectories  to the solution of a  (continuous time)  master equation.  Heskes'  master \nequation is  equivalent to our Kramers-Moyal expansion  (3). \n\n\fOptimal Stochastic Search and Adaptive Momentum \n\n479 \n\nwhere  H j \"  denotes  the  jlh  component  of the  N-component  vector  H,  and  ( .. ')x \ndenotes  averaging  over  the  density  of inputs.  Differentiating  (2)  with  respect  to \ntime,  u~ing (3)  and  integrating by parts,  we  obtain the equation of motion for  the \nweight error correlation \n\nJ-L(t)  J dN V  P(v, t)  [v  (H(v, xf)x +  (H(v, x))x  vTJ + \n\ndd~ \n\nJ-L(t)2  J dN V  P(v, t)  (H(v, x) H(v, x)T)x \n\n(4) \n\n2.1  Asymptotics of the Weight  Error  Correlation \n\nConvergence  of  v  can  be  understood  by  studying  the  late  time  behavior  of  (4). \nSince the update function  H(v, x)  is  in  general non-linear in  v,  the time evolution \nof  the  correlation  matrix  Cij  is  coupled  to higher  moments  E  [Vi Vj Vk  \u2022\u2022\u2022 ]  of  the \nweight  error.  However,  the learning rate is  assumed  to follow  a  schedule  J-L(t)  that \nsatisfies the requirements for  convergence in mean square to a local optimum.  Thus \nat  late  times  the  density  becomes  sharply  peaked  about  v  =  02 .  This  suggests \nthat  we  expand  H(v, x)  in  a  power series  about v  =  0  and  retain the  lowest  order \nnon-trivial terms in  (4)  leaving: \n\ndC dt  =  - J-L(t)  [ (R C)  +  (C RT) ]  + J-L(t)2  D  , \n\n(5) \n\nwhere R is  the Hessian of the average cost function  (E) x'  and \n\nD  = (H(O,x)H(O,xf)x \n\n(6) \nis  the diffusion  matrix, both evaluated  at the local  optimum w*.  (Note that  RT = \nR.)  We  use  (5)  with  the understanding that it is  valid  for  large t.  The solution to \n(5)  is \n\nC(t)  =  U(t,to)C(to)UT(t,to)  +  t  d7  J-L(7)2  U(t,7)  D  UT (t,7) \n\nito \n\nwhere the evolution operator U(t2' td is \n\nU(t2, t1)  =  exp [ -R 1:' dr tt(r) ] \n\n(7) \n\n(8) \n\nWe  assume,  without loss of generality,  that the coordinates are chosen so that R  is \ni  =  1 ... N.  Then with J-L(t)  =  J-Lo/t  we \ndiagonal  (D  won't  be)  with eigenvalues  Ai, \nobtain \n\nE[lvI 2 ]  =  Trace  [C(t)] \n\nt;  Cii (to) \n\nN \n\n{ \n\n(  ttO )  21-1-0  Ai \n\n1  (to) 21-1-0  A, 1 } \n\n. (9) \n\n-\nto \n\n-\nt \n\n[  1 \n-\nt \n\n2In  general  the density will  have  nonzero components outside the basin of w* .  We are \nneglecting these, for  the purpose of calculating the second moment of the the local  density \nin the vicinity of w*. \n\n\f480 \n\nLeen and Orr \n\nWe  define \n\n1 \n\nJ.lerit  ==  - - -\n2 Amin \n\n(10) \n\nand identify two regimes  for  which  the behavior of (9)  is  fundamentally  different: \n\n1.  J.lo  > J.lcri(  E [lvI 2 ]  drops off asymptotically as  lit. \n2.  J.lo  < J.lerit:  E [lvI 2 ]  drops off asymptotically as  ( t ) (2 ~o ATnin  ) \n\ni.e.  more  slowly  than  lit. \n\nFigure 1 shows results from simulations of an ensemble of 2000 networks trained by \nLMS, and the prediction from  (9).  For the simulations, input data were drawn from \na  gaussian  with  zero  mean  and  variance  R  =  1.0.  The targets  were  generated  by \na  noisy  teacher  neuron  (i.e.  targets  =w*x +~, where  (~) = 0  and  (e) = (72).  The \nupper  two  curves  in  each  plot  (dotted)  depict  the  behavior for  J.lo  < J.lerit  = 0.5. \nThe remaining curves  (solid)  show  the behavior for  J.lo  > J.lerit. \n\n0 \n\n,... \n\nI \n\nC\\I \ni.CII \n~' \nW \nCIt? \n0 \noJ \n\n\"r \n\nU? \n\n..... \n\n0 \n\n,... \n\nI \n\nC\\I \ni.CII \n~ I \nWC') \nC l I  \n0 \noJ \n\n'of \nI \n\nU? \n\n100 \n\n1000 \n\n5000 \n\n50000 \n\n100 \n\n1000 \n\n5000 \n\n50000 \n\nFig.1:  LEFT  - Simulation  results  from  an  ensemble  of  2000  one-dimensional \nLMS  algorithms  with  R  =  1.0,  (72  =  1.0  and  /-L  = \nretical  predictions  from  equation  (9).  Curves  correspond  to  (top  to  bottom) \n/-Lo  = 0.2,  0.4,  0.6,  0.8,  1.0,  1.5  . \n\n/-Lo/t.  RIGHT  - Theo(cid:173)\n\nBy  minimizing  the  coefficient  of  lit in  (9),  the  optimal  learning  rate  is  found  to \nbe J.lopt  =  11 Amin.  This  formalism  also  yields  asymptotic normality rather simply \nlit)  convergence  of \n(Orr  and  Leen,  1994).  These  conditions  for  \"optimal\"  (Le. \nthe  weight  error correlation and  the related  results  on  asymptotic  normality have \nbeen  previously  discussed  in  the  stochastic  approximation  literature  (Darken  and \nMoody,  1992;  Goldstein,  1987;  White,  1989;  and references  therein)  .  The present \nformal structure provides the results with relative ease and facilitates  the extension \nto stochastic gradient descent  with momentum. \n\n3  Stochastic Search with Constant  Momentum \n\nThe discrete time algorithm for  stochastic optimization with momentum is: \n\nv(t + 1)  =  v(t) + J.l(t)  H[v(t), x(t)] + f3  f!(t) \n\n(11) \n\n\fOptimal Stochastic Search and Adaptive Momentum \n\n481 \n\nn(t + 1) \n\nv(t + 1)  - v(t) \nn(t) + /1(t)  H[v(t), x(t)] + ((3  - 1)  n(t), \n\nor in continuous time, \n\ndv(t) \n\ndt \n\ndn(t) \n\ndt \n\n/1(t)  H[v(t), x(t)] + (3  n(t) \n\n-\n\n/1(t)  H[v(t), x(t)] + ((3  - 1)  n(t). \n\n(12) \n\n(13) \n\n(14) \n\nAs  before,  we  are interested  in the late time  behavior of E [lvI 2 ].  To this end,  we \ndefine  the  2N-dimensional  variable  Z  = (v, nf and,  following  the  arguments  of \nthe previous sections,  expand  H[v(t), x(t)]  in  a  power  series  about  v =  0 retaining \nthe  lowest  order  non-trivial  terms.  In  this  approximation  the  correlation  matrix \nC  _  E[ZZT] evolves according to \n\ndC \n2 -\ndt =  KC + CK  + /1(t)  D \n\nT \n\n-\n\n-\n\nwith \n\n(15) \n\n(16) \n\nI  is  the N  x  N  identity matrix,  and Rand D  are defined  as  before.  The evolution \noperator is  now \n\nU(t2' ttl = exp [t' dr K(r)] \n\n(17) \n\n(18) \n\nand the solution to (15)  is \n\nC  =  U(t, to)  C(to)  U T (t, to)  +  t dr /12(r)  U(t, r)  D  U T (t, r) \n\nltD \n\nThe squared norm of the weight error is  the sum of first  N  diagonal elements of C. \nIn  coordinates for  which R  is diagonal and with /1(t)  =  /10 It, we find  that for  t \u00bb to \n\n'\"  t, {c,,(to)  (t;) 'i~~' + \n\nE[lvI2] \n\nThis reduces to  (9)  when (3  =  O.  Equation (19)  defines  two regimes  of interest: \n\n1.  /10/(1  - (3)  > /1cri(  E[lvI2]  drops off asymptotically as  lit. \n2.  /10/(1  - (3)  < /1cri(  E[lvI2] drops off asymptotically as \n\n21-'Q'xmjn \n\n(~)  1  ~ \n\nI.e.  more  slowly  than  lit. \n\n\f482 \n\nLeen and  Orr \n\nThe  form  of  (19)  and  the  conditions  following  it  show  that  the  asymptotics  of \ngradient descent  with momentum are governed by the  effective  learning  rate \n\n_  M \n\nMejJ  =  1 - {3  . \n\nFigure 2 compares simulations with the predictions of (19)  for fixed  Mo  and various \n{3.  The  simulations  were  performed  on  an  ensemble  of 2000  networks  trained  by \nLMS  as  described  previously  but  with  an  additional momentum  term  of the form \ngiven  in  (11).  The  upper  three  curves  (dotted)  show  the  behavior  of  E[lvI 2 ]  for \nMejJ < Merit\u00b7  The solid  curves show the behavior for  MejJ > Merit\u00b7 \nThe derivation of asymptotic normality proceeds similarly to the case without mo(cid:173)\nmentum.  Again the reader is  referred  to (Orr and  Leen,  1994)  for  details . \n\n.... \n, \n\n......  N \n(\\I' \n< \n\"> \n-C') ur' \n\nC) \n0'<:1\" \n...J, \n\n, \nIII \n\n.... \n, \n\n......  N \n(\\I' \n~ \n> \n-C') ur' \n\nC) \n0'<:1\" \n...J, \n\n, \nIII \n\n100 \n\n1000 \n\n5000 \nt \n\n50000 \n\n100 \n\n1000 \n\nSOOO \nt \n\n50000 \n\nFig.2:  LEFT  - Simulation  results  from  an  ensemble  of  2000  one-dimensional  LMS  al(cid:173)\ngorithms  with  mome~tum with  R  =  1.0, \n0.2.  RIGHT(cid:173)\nTheoretical  predictions  from  equation  (19).  Curves  correspond  to  (top  to  bottom) \n{3  =  0.0,  004,  0.5,  0.6,  0.7,  0.8  . \n\n(12  =  1.0,  and  /10 \n\n-\n\n4  Adaptive Momentum Insures  Optimal Convergence \n\nThe optimal constant  momentum  parameter is  obtained by  minimizing  the coeffi(cid:173)\ncient  of lit in  (19).  Imposing the restriction that this  parameter is  positive3  gives \n\n(3opt  =  max(O, 1 - MOAmin). \n\n(20) \n\nAs with Mopt,  this result is not of practical use because, in general,  Amin is unknown. \n\nFor  I-dimensional linear  networks,  an  alternative is  to  use  the  instantaneous  esti(cid:173)\nmate of A,  :\\(t)  = x 2(t)  where  x(t)  is  the network input  at  time t.  We  thus  define \nthe  adaptive  momentum parameter to be \n\n(3adapt  =  max(O, 1 - MOX 2 ) \n\n(I-dimension). \n\n(21) \n\nAn algorithm based on (21)  insures that the late time convergence is optimally fast. \nAn alternative route to achieving the same goal is  to dispense with  the momentum \nterm and adaptively adjust the learning rate.  Vetner  (1967)  proposed an algorithm \n3 E[lvI 2 ]  diverges  for  1{31  > 1.  For  -1 < {3  < 0,  E[lvI 2 ]  appears  to  converge  but  oscil(cid:173)\nlations  are  observed.  Additional  study is  required  to  determine whether  {3  in  this  range \nmight be useful for  improving learning. \n\n\fOptimal Stochastic Search and Adaptive Momentum \n\n483 \n\nthat iteratively estimates  A for  1-D  algorithms  and uses  the estimate to adjust  J.Lo. \nDarken and Moody (1992) propose measuring an auxiliary statistic they call \"drift\" \nthat  is  used  to  determine  whether  or  not  J.Lo  >  J.Lcrit.  The  adaptive  momentum \nscheme  generalizes  to  multiple  dimensions  more  easily  than  Vetner's  algorithm, \nand,  unlike  Darken and Moody's scheme,  does  not  involve  calculating an auxiliary \nstatistic not directly involved  with  the minimization. \n\nA natural extension to N dimensions is to define a matrix of momentum coefficients, \n'Y  =  I  - J.Lo  X  xT, where I  is  the N  x N  identity matrix.  By zeroing out the negative \neigenvalues of 'Y,  we  obtain the  adaptive  momentum matrix \n\n(3adapt  =  I  - ex xT, \n\nwhere  e =  min(J.Lo, 1/(xT x)). \n\n(22) \n\n=1.5 \n\n-1+_~_--::=====-_~~~ L \n\n1 \n\n2 \n\n3 \n\n\u00b09(t) \n\nFig.3:  Simulations  of 2-D  LMS  with  1000  networks  initialized  at  Vo  =  (.2, .3)  and  with \n(72  = 1,  ).1  = .4,  ).2  = 4,  and  /-Lcrit  = 1.25.  LEFT- {3  = 0,  RIGHT - {3  = (3adapt.  Dashed \ncurves correspond to adaptive momentum. \n\nFigure  3  shows  that  our  adaptive  momentum  not  only  achieves  the  optimal  con(cid:173)\nvergence  rate  independent  of the  learning  rate  parameter J.Lo  but  that  the  value  of \nlog(E[lvI2])  at  late  times  is  nearly  independent  of J.Lo  and  smaller  than  when  mo(cid:173)\nmentum is not used.  The left graph displays simulation results without momentum. \nHere, convergence rates clearly depend on J.Lo  and are optimal for J.Lo  > J.Lcrit  =  1.25. \nWhen  J.Lo  is  large there is  initially  significant  spreading in  v  so  that  the  increased \nconvergence rate does not result in lower log(E[lvI 2])  until very late times (t  ~ 105 ). \nThe graph on the right  shows  simulations  with  adaptive momentum.  Initially,  the \nspreading is even greater than with no momentum, but log(E[lvI 2])  quickly decreases \nto  reach  a  much  smaller  value.  In  addition,  for  t  ~ 300,  the optimal  convergence \nrate  (slope=-l)  is  achieved for  all  three values  of J.Lo  and the curves  themselves  lie \nalmost on top  of one another.  In  other words,  at late times  (t  ;:::  300),  the value of \nlog(E[lvI2])  is  independent of J.Lo  when adaptive momentum is  used. \n\n5  Summary \n\nWe  have used  the dynamics  of the weight  space probabilities to derive the asymp(cid:173)\ntotic behavior of the weight error correlation for  annealed stochastic gradient algo(cid:173)\nrithms with momentum.  The late time behavior is governed by the effective learning \n\nrate J.Lejj = J.Lo/(l  - (3) .  For learning rate schedules J.Lolt,  if J.Leff > 1/(2 Arnin) , then \n\nthe squared norm of the weight error v  - w  - w*  falls  off as  lit.  From these results \nwe  have developed a form of momentum that adapts to obtain optimal convergence \nrates  independent of the learning rate parameter. \n\n\f484 \n\nLeen and Orr \n\nAcknowledgments \n\nThis work was  supported by grants from  the Air Force Office of Scientific Research \n(F49620-93-1-0253)  and the Electric Power Research Institute  (RP8015-2). \n\nReferences \n\nD.  Bedeaux, K.  Laktos-Lindenberg, and K.  Shuler.  (1971) On the Relation Between \nMaster Equations and Random Walks and their Solutions.  Journal of Mathematical \nPhysics,  12:2116-2123. \n\nChristian  Darken  and  John  Moody. \n(1992)  Towards  Faster  Stochastic  Gradient \nSearch.  In  J.E.  Moody,  S.J.  Hanson,  and R.P.  Lipmann  (eds.)  Advances  in  Neural \nInformation Processing  Systems,  vol.  4.  Morgan Kaufmann Publishers, San Mateo, \nCA,  1009-1016. \n\nLarry Goldstein.  (1987)  Mean Square Optimality in the Continuous Time Robbins \nMonro Procedure.  Technical  Report  DRB-306,  Dept.  of Mathematics,  University \nof Southern California,  LA. \n\nH.J.  Kushner  and  D.S.  Clark.  (1978)  Stochastic  Approximation  Methods  for  Con(cid:173)\nstrained  and  Unconstrained  Systems.  Springer-Verlag, New  York. \n\nTom  M.  Heskes,  Eddy  T.P.  Slijpen,  and  Bert  Kappen.  (1992)  Learning in  Neural \nNetworks with  Local Minima.  Physical  Review  A, 46(8):5221-5231. \n\nTodd  K.  Leen  and  John  E.  Moody.  (1993)  Weight  Space Probability Densities  in \nStochastic  Learning:  1.  Dynamics  and  Equilibria.  In  Giles,  Hanson,  and  Cowan \n(eds.),  Advances  in  Neural  Information  Processing  Systems,  vol.  5,  Morgan  Kauf(cid:173)\nmann Publishers,  San Mateo,  CA,  451-458. \n\nG.  B.  Orr and T. K.  Leen.  (1993)  Weight Space Probability Densities in Stochastic \nLearning:  II.  Transients  and  Basin  Hopping Times.  In  Giles,  Hanson,  and  Cowan \n(eds.),  Advances  in  Neural  Information  Processing  Systems,  vol.  5,  Morgan  Kauf(cid:173)\nmann Publishers,  San Mateo,  CA,  507-514. \nG.  B.  Orr and  T.  K.  Leen.  (1994)  Momentum  and  Optimal Stochastic Search.  In \nM.  C.  Mozer, P. Smolensky, D.  S.  Touretzky, J. L.  Elman, and A.  S.  Weigend (eds.), \nProceedings  of the  1993  Connectionist  Models  Summer School,  351-357. \n\nJohn  J.  Shynk  and  Sumit  Roy.  (1988)  The  LMS  Algorithm  with  Momentum  Up(cid:173)\ndating.  Proceedings  of the IEEE International Symposium on Circuits  and Systems, \n2651-2654. \n\nMehmet  Ali  Tugay  and  Yal<;in  Tanik.  (1989)  Properties  of  the  Momentum  LMS \nAlgorithm.  Signal  Processing,  18:117-127. \n\nJ.  H.  Venter.  (1967)  An  Extension  of the  Robbins-Monro  Procedure.  Annals  of \nMathematical  Statistics,  38:181-190. \n\nHalbert  White.  (1989)  Learning in  Artificial  Neural  Networks:  A  Statistical  Per(cid:173)\nspective.  Neural  Computation,  1:425-464. \n\n\f", "award": [], "sourceid": 772, "authors": [{"given_name": "Todd", "family_name": "Leen", "institution": null}, {"given_name": "Genevieve", "family_name": "Orr", "institution": null}]}