{"title": "Using Curvature Information for Fast Stochastic Search", "book": "Advances in Neural Information Processing Systems", "page_first": 606, "page_last": 612, "abstract": null, "full_text": "U sing  Curvature  Information for \n\nFast  Stochastic  Search \n\nGenevieve B.  Orr \n\nDept of Computer Science \n\nWillamette University \n\n900 State Street \nSalem, OR 97301 \n\ngorr@willamette.edu \n\nTodd K.  Leen \n\nDept of Computer Science and Engineering \n\nOregon Graduate Institute of \n\nScience and Technology \n\nP.O.Box 91000, Portland, Oregon 97291-1000 \n\ntleen@cse.ogi.edu \n\nAbstract \n\nWe  present an algorithm for  fast  stochastic gradient descent  that \nuses  a  nonlinear adaptive momentum scheme to optimize the late \ntime convergence rate.  The algorithm makes  effective  use  of cur(cid:173)\nvature information,  requires  only  O(n)  storage and computation, \nand  delivers  convergence  rates  close  to  the theoretical  optimum. \nWe demonstrate the technique on linear and large nonlinear back(cid:173)\nprop networks. \n\nImproving Stochastic  Search \n\nLearning algorithms that perform gradient  descent  on a  cost function can  be for(cid:173)\nmulated in either stochastic (on-line)  or batch form.  The stochastic version  takes \nthe form \n\nWt+l  =  Wt  +  J1.t  G( Wt, Xt  ) \n\n(1) \nwhere  Wt  is  the  current  weight  estimate,  J1.t  is  the learning  rate,  G  is  minus  the \ninstantaneous gradient  estimate, and  Xt  is  the input at time  t i .  One obtains the \ncorresponding batch mode learning rule by taking J1.  constant and averaging Gover \nall  x. \n\nStochastic  learning  provides  several  advantages  over  batch  learning.  For  large \ndatasets the batch average is  expensive to compute.  Stochastic learning eliminates \nthe averaging.  The stochastic update can  be regarded as  a  noisy  estimate of the \nbatch update, and this intrinsic noise can reduce the likelihood of becoming trapped \nin poor local optima [1,  2J. \n\n1 We  assume  that  the  inputs  are i.i.d.  This is  achieved  by random  sampling with  re(cid:173)\n\nplacement from  the training data. \n\n\fUsing Curvature Informationfor Fast Stochastic Search \n\n607 \n\nThe noise must be reduced late in the training to allow  weights to converge.  After \nsettling within the basin of a local optimum W.,  learning rate annealing allows con(cid:173)\nvergence of the weight error v ==  W  - w \u2022.  It is well-known that the expected squared \nweight  error, E[lv12]  decays at its maximal rate ex:  l/t with the annealing schedule \nflo/to  FUrthermore to achieve this rate one must have flo  > flcnt  =  1/(2Am in) where \nAmin  is the smallest eigenvalue of the Hessian at w.  [3,  4,  5, and references therein]. \nFinally the optimal flo, which gives the lowest possible value of E[lv12] is flo  = 1/ A. \nIn multiple dimensions the optimal learning  rate  matrix is  fl(t) = (l/t) 1-\u00a3-1 ,where \n1-\u00a3  is  the Hessian at the local optimum. \nIncorporating this curvature information into stochastic learning is  difficult for  two \nreasons.  First, the Hessian is  not available since the point of stochastic learning is \nnot to  perform averages  over  the training data.  Second,  even if the Hessian  were \navailable, optimal learning requires its inverse - which is  prohibitively expensive to \ncompute 2. \n\nThe primary result of this paper is that one can achieve an algorithm that behaves \noptimally,  i.e.  as if one  had incorporated the inverse  of the full  Hessian,  without \nthe storage or  computational  burden.  The  algorithm,  which  requires  only  V(n) \nstorage and computation (n = number of weights in the network), uses an adaptive \nmomentum parameter, extending our earlier work  [7]  to fully  non-linear problems. \nWe demonstrate the performance on several large back-prop networks trained with \nlarge datasets. \n\nImplementations of stochastic learning typically use a constant learning rate during \nthe early part of training (what Darken and Moody [4]  call the search phase) to ob(cid:173)\ntain exponential convergence towards a local optimum, and then switch to annealed \nlearning (called the converge phase).  We use Darken and Moody's adaptive search \nthen converge (ASTC)  algorithm to determine the point at which to switch to  l/t \nannealing.  ASTC was  originally conceived as a  means to insure flo  > flcnt  during \nthe annealed phase, and we  compare its performance with adaptive momentum as \nwell.  We  also provide a  comparison with conjugate gradient optimization. \n\n1  Momentum in Stochastic Gradient Descent \n\nThe adaptive momentum algorithm we  propose was  suggested by  earlier  work  on \nconvergence rates for  annealed learning with constant momentum.  In this  section \nwe  summarize the relevant results of that work. \n\nExtending (1)  to include momentum leaves  the learning rule \n\nwt+ 1  = Wt  + flt  G ( Wt, x t)  + f3  ( Wt  - Wt -1 ) \n\n(2) \nwhere f3  is  the momentum parameter constrained so  that  0  < f3  <  1.  Analysis  of \nthe dynamics of the expected squared weight  error E[ Ivl2  ]  with flt = flo/t learning \nrate annealing [7, 8]  shows that at late times, learning proceeds as for the algorithm \nwithout momentum, but with a  scaled or  effective  learning rate \n\nflo \nfleff  =  1 _  f3 \n\n_ \n\n. \n\n(  ) \n3 \n\nThis result is  consistent with earlier work  on momentum learning with small,  con(cid:173)\nstant fl,  where the same result holds [9,  10,  11] \n\n2Venter [6]  proposed a  I-D algorithm for optimizing the convergence rate that estimates \nthe  Hessian  by  time averaging finite  differences  of the gradient and  scalin~ the  learning \nrate by the inverse.  Its extension to multiple dimensions would require O(n  ) storage and \nO(n3)  time for inversion.  Both are prohibitive for  large models. \n\n\f608 \n\nG.  B.  Orr and T.  K.  Leen \n\nIf we  allow  the effective learning rate to be a  matrix, then, following  our comments \nin the introduction, the lowest value  of the misadjustment is achieved when /leff  = \nti- 1  [7,  8].  Combining this result with (3) suggests that we adopt the heuristic3 \n\n/3o pt  =  I - /loti. \n\n(4) \nwhere /3opt  is  a  matrix of momentum parameters, I  is  the identity matrix,  and /lo \nis a  scalar. \nWe  started with a  scalar  momentum parameter constrained by  0  < /3  <  1.  The \nequivalent  constraint for  our matrix /3opt  is  that its  eigenvalues lie  between 0  and \n1.  Thus we require /lo  < 1/ Amoz  where Amoz  is  the largest eigenvalue of ti. \nA scalar annealed learning rate /loft combined with the momentum parameter /3o pt \nought to provide an effective learning rate asymptotically equal to the optimal learn(cid:173)\ning rate ti- 1.  This rate 1)  is  achieved  without ever  performing a  matrix inversion \non  ti and  2)  is  independent  of the choice  of /lo,  subject  to  the restriction in  the \nprevious paragraph. \n\nWe have dispensed with the need to invert the Hessian, and we next dispense with \nthe need to store it.  First notice that, unlike its inverse, stochastic estimates of ti \nare readily available,  so we  use a  stochastic estimate in (4).  Secondly according to \n(2)  we  do  not  require  the matrix  /3opt,  but  rather  /3opt  times  the  last  weight  up(cid:173)\ndate.  For both linear and non-linear networks this dispenses with the O( n 2 )  storage \nrequirements.  This algorithm, which  we  refer to as adaptive momentum, does not \nrequire explicit knowledge or inversion of the Hessian, and can be implemented very \nefficiently as is  shown in the next section. \n\n2 \n\nImplementation \n\nThe algorithm we  propose is \n\nwhere ~Wt = Wt  - Wt-l  and iit  is a  stochastic estimate of the Hessian at time t. \n\nWt+!  = Wt  +  /It  G( Wt, Xt)  + (I - /lo iit ) ~Wt \n\n(5) \n\nWe first  consider a  single layer feedforward linear network.  Since the weights con(cid:173)\nnecting the inputs to different outputs are independent of each other we  need only \ndiscuss the case for  one output node.  Each output node is  then treated identically. \nFor one output node and N  inputs, the Hessian is ti =  (xxT}z  E n NxN  where 0:1: \nindicates expectation over  the inputs  x  and where  xT  is  the transpose of x.  The \nsingle-step  estimate of the hessian is  then just iit  = xtxi.  The momentum term \nbecomes \n\n~ \n\n(I - /lotit)  ~Wt =  (I - /lo(XtXt  ))~Wt =  ~Wt -\n\n(6) \nWritten in this way,  we note that there is  no matrix multiplication, just the vector \ndot product xi ~Wt and vector addition that are both O(n).  For M  output nodes, \nthe algorithm is  then O(Nw )  where N w  =  NM is  the total number weights  in the \nnetwork. \n\n/loXt(X t  ~Wt). \n\nT \n\nT \n\nFor  nonlinear  networks  the problem is  somewhat more complicated.  To  compute \niit~Wt we use the algorithm developed by Pearlmutter [12]  for computing the prod(cid:173)\nuct  of the hessian  times an arbitrary vector.4  The equivalent  of one forward-back \n\n3We refer to  (4)  as  a  heuristic since we have no  theoretical results on  the dynamics of \n\nthe squared weight error for  learning with this matrix of momentum parameters. \n\n\u00b7We actually  use  a  slight  modification  that  calculates  the  linearized  Hessian  times a \nvector:  D f @D f  ~Wt where D f  is the Jacobian of the network output (vector) with respect \nto the weights,  and  @  indicates a  tensor product. \n\n\fUsing Curvature Information for Fast Stochastic Search \n\n609 \n\nLog( E[  Ivl2 1 ) \n\nI ~o=O\u00b71  I \n\n\u00b71 \n\n\u00b72 \n\n\u00b73 \n\nB=adaptlve \n\nLog( E[  Iv12]) \n\n'--------\"\"_--..flo=O.1 \n\u00b71 \n\n\u00b72L.-------\n\nflo=O\u00b701 \n\n\u00b73 \n\nI B=adaptlve  I \n\na) \n\n2 \n\n3 \nLog(t) \n\n5 \n\nb) \n\n2 \n\n3 \nLog(t) \n\nFigure 1:  2\u00b7D  LMS  Simulations:  Behavior of log(E[lvI 2 ])  over  an ensemble of 1000 net(cid:173)\nworks  with  Al  = .4  and  Al  = 4,  (J'~  = 1.  a)  1-'0  = 0.1  with  various  13.  Dashed  curve \ncorresponds to adaptive momentum.  b)  13  adaptive for  various 1-'0. \n\npropagation is required for this calculation.  Thus, to compute the entire weight up(cid:173)\ndate requires two forward-backward propagations, one for  the gradient calculation \nand one for  computing iltllWt. \nThe  only  constraint  on  JJo  is  that  JJo  <  1/ Amax.  We  use  the  on-line  algorithm \ndeveloped by  LeCun,  Simard,  and Pearlmutter  [13]  to  find  the largest  eigenvalue \nprior to the start of training. \n\n3  Examples \n\nIn the following  two subsections we examine the behavior of annealed learning with \nadaptive momentum on networks previously trained to a point close to an optimum, \nwhere the noise dominates.  We look at very simple linear nets, large linear nets, and \na  large nonlinear net.  In section 3.3 we couple adaptive momentum with automatic \nswitching from constant to annealed learning. \n\n3.1  Linear  Networks \n\nWe begin with a  simple 2-D  LMS network.  Inputs Xt  are gaussian distributed with \nzero mean and the targets d  at each timestep t  are d t  = W,!, Xt + Et  where Et  is zero \nmean gaussian noise, and W*  is the optimal weight vector.  The weight error at time \nt  is just v  ==  Wt  - w*. \nFigure 1 displays  results for  both constant and adaptive momentum with averages \ncomputed over an ensemble of 1000 networks.  Figure (la) shows the decay of E[lv1 2 ] \nfor  JJo  = 0.1  and various  values of f3.  As momentum is  increased,  the convergence \nrate increases.  The optimal scalar momentum parameter is f3  ==  (1- JJOAmin)  = .96. \nAdaptive momentum achieves essentially the same rate of convergence without prior \nknowledge of the Hessian. \nFigure 1b shows  the behavior of E[lvI 2 ]  for  various JJo  when adaptive momentum \nis  used.  One  can  see  that  after  a  few  hundred  iterations  the  value  of E[lv1 2 ]  is \nindependent of JJo  (in all cases JJo  < l/A max  < JJcrit  ). \nFigure  2  shows  the  behavior  of  the  misadjustment  (mean  squared  error  in  ex(cid:173)\ncess  of  the  optimum~ for  a  4-D  LMS  problem  with  a  large  condition  number \nP ==  Amax/Arr;in  =  10  .  We  compare 3  cases:. 1)  the  opt~mal learning  rate  matrix \nJJo  =  1i- wIthout  momentum,  2)  JJo  = .5  wIth  the  optzmal  constant  momentum \nmatrix f3  =  I  -\nJJo 1i,  and  3)  JJo  =  .5  with  the  adaptive  momentum.  All  three \ncases show similar behavior, showing the efficacy with which the matrix momentum \n\n\f610 \n\n10. \n\n0.1 \n\n0.001 \n\nG.  B.  Orr and T.  K. Leen \n\n10. \n\n100.  1000.  10000. \n\nI \n\n5 \n10 \n\n6 \n10 \n\nFigure  2:  4-D  LMS  with  p  =  105 :  Plot \ndisplays misadjustment.  Annealing starts at \nt = 10.  For  {3adapt  and {3  =  I  - 1-'01i,  we  use \n1-'0  = .5.  Each curve is  an average of 10 runs. \n\nFigure  3:  Linear  Prediction:  1-'0  =  0.26. \nCurves  show  constant  learning  rate,  anneal(cid:173)\ning  started  at  t  =  50  without  momentum, \nand with  adaptive momentum. \n\nmocks up the optimal learning rate matrix J1.0  = 1\u00a3 -1, and lending credence to the \nstochastic estimate of the Hessian used in adaptive momentum. \n\nWe  next  consider  a  large  linear  prediction  problem  (128  inputs,  16  outputs  and \neigenvalues ranging from  1.06 x  10-5  to 19.98 - condition number p = 1.9 X  106)5. \nFigure  3  displays  the  misadjustment  for  1)  annealed  learning  with  f3  =  f3adapt, \n2)  annealed  learning  with  f3  =  0,  and  3)  constant  learning  rate  (for  comparison \npurposes).  As  before,  we  have  first  trained  (not  shown  completely)  at  constant \nlearning rate J1.0  = .026  until  the MSE  and the weight  error have  leveled  out.  As \ncan be seen f3adapt  does much better than annealing without momentum. \n\n3.2  Phoneme Classification \n\nWe  next  use  phoneme classification  as  an  example of a  large  nonlinear  problem. \nThe  database  consists  of 9000  phoneme  vectors  taken from  48  50-second  speech \nmonologues.  Each input vector consists of 70 PLP coefficients.  There are 39 target \nclasses.  The architecture was a  standard fully  connected feedforward network with \n71  (includes bias) input nodes, 70 hidden nodes, and 39 output nodes for  a  total of \n7700 weights. \n\nWe  first  trained the network  with  constant learning rate until  the MSE flattened \nout.  At that point we either annealed without momentum, annealed with adaptive \nmomentum,  or  used  ASTC  (which  attempts to  adjust J1.0  to  be above  J1.crit  - see \nnext  section).  When annealing was  used  without  momentum, we  found  that  the \nnoise went away,  but the percent of correctly classified phonemes did not improve. \nBoth  the adaptive  momentum and ASTC  resulted in  significant  increases  in  the \npercent correct, however, adaptive momentum was significantly better than ASTC. \nIn the next section, we  examine this problem in more detail. \n\n3.3  Switching on Annealing \n\nA  complete algorithm must choose an appropriate point to change from constant J1. \nsearch to annealed learning.  We use Moody and Darken's ASTC algorithm [4,  14] \nto accomplish this.  ASTC measures the roughness of trajectories, switching to 1ft \nannealing when the trajectories become very rough - an indication that the noise \nin  the updates is  dominating  the  algorithm's  behavior.  In  an  attempt to  satisfy \n\n5Prediction of a 4  X  4 block of image pixels from  the surrounding 8 blocks. \n\n\fUsing  Curvature Information for Fast Stochastic Search \n\n611 \n\n50 \n\n40 \n~30 \n0 \n(.)20 \n;,I! \n0 \n\n10 \n\n100000 \n\na) \n\nb) \n\n40 -\n\n0 \n~30 \n0 \n(.)20 \n~ \n0 \n\n50 \n\n10 \nqo \n\n20 \n\nepoch \n\n50 \n\n100 \n\nFigure 4:  Phoneme Classification:  Percent Correct a)  ASTC without momentum (bottom \ncurve)  and  adaptive momentum  (top)  as  function  of the  number  of input  presentations. \nb)  Conjugate Gradient  Descent  - one epoch  equals one  pass  through  the  data,  i.e.  9000 \ninput  presentations. \n\nJ.lo  >  J.lcrit,  ASTC  can  also  switch  back  to  constant  learning  when  trajectories \nbecome too smooth. \n\nWe return to the phoneme problem using three different training methods:  1) ASTC \nwithout momentum (with switching back and forth between annealed and constant \nlearning), 2)  adaptive momentum with annealing turned on when ASTC first  sug(cid:173)\ngests  the transition  (but  no  subsequent  return  to  constant learning  rate),  and  3) \nstandard conjugate gradient descent. \n\nFigure 4a compares ASTC (no momentum) with adaptive momentum (using ASTC \nto  turn  on  annealing).  After  annealing  is  turned  on,  the  classification  accuracy \nimproves far  more quickly  with adaptive momentum. \n\nFigure 4b displays the classification performance as a  function of epoch using con(cid:173)\njugate gradient descent (CGD). After 100 passes through the 9000 example dataset \n(900,000 presentations), the classification accuracy is  39.6%, or 7%  below adaptive \nmomentum's performance at  100,000  presentations.  Note also  that  adaptive  mo(cid:173)\nmentum is  continuing to improve the optimization, while the ASTC and conjugate \ngradient descent curves have flattened out. \n\nThe cpu time used for the optimization was about the same for the CGD and adap(cid:173)\ntive  momentum algorithms.  It thus  appears that our implementation of adaptive \nmomentum costs about 9  times as  much per pattern as  CGD.  We believe that the \nperformance can be improved.  Our complexity analysis [8]  predicts a  3:1 cost ratio, \nrather  than 9:1,  and optimization comparable to  that applied to  the CGD  code6 \nshould enhance the run-time performance of CGD. \n\nFor this problem, the performance of the two algorityms on the test set (no shown \non graph) is not much different  (31.7% for CGD versus 33.4% for adaptive momen(cid:173)\ntum.  Howver  we  are concerned here with  the efficiency  of the optimization,  not \ngeneralization performance.  The latter depends on dataset size and regularization \ntechniques, which can easily be combined with any optimizer. \n\n4  Summary \n\nWe have presented an efficient O( n) stochastic algorithm with few adjustable param(cid:173)\neters that achieves fast  convergence during the converge phase for  both linear and \nnonlinear  problems.  It does  this  by  incorporating curvature information without \n\n6CGD  was  performed  using  nopt  written  by  Etienne  Barnard  and  made  available \nthrough the Center for Spoken Language Understanding at the Oregon Graduate Institute. \n\n\f612 \n\nG.  B.  Orr and T.  K.  Leen \n\nexplicit  computation of the Hessian.  We  also  combined it with a  method (ASTC) \nfor detecting when to make the transition between search and converge regimes. \n\nAcknowledgments \n\nThe authors thank Yann LeCun for  his  helpful critique.  This work  was supported \nby EPRl under grant RPB015-2  and AFOSR under grant FF4962-93-1-0253. \n\nReferences \n[1]  Genevieve B.  Orr and Todd K.  Leen.  Weight space probability densities in stochastic \nlearning:  II.  Transients  and  basin  hopping  times.  In  Giles,  Hanson,  and  Cowan, \neditors,  Advances  in Neural Information Processing  Systems,  vol.  5,  San Mateo,  CA, \n1993.  Morgan Kaufmann. \n\n[2]  William Finnoff.  Diffusion  approximations for  the  constant learning rate backprop(cid:173)\n\nagation  algorithm  and  resistence  to  local  minima.  In  Giles,  Hanson,  and  Cowan, \neditors,  Advances in Neural Information Processing  Systems,  vol.  5,  San  Mateo,  CA, \n1993. Morgan Kaufmann. \n\n[3]  Larry  Goldstein.  Mean  square  optimality  in  the  continuous  time  Robbins  Monro \nprocedure.  Technical Report DRB-306, Dept. of Mathematics, University of Southern \nCalifornia, LA,  1987. \n\n[4]  Christian Darken and John Moody.  Towards faster stochastic gradient search.  In J.E. \nMoody,  S.J.  Hanson,  and  R.P.  Lipmann,  editors,  Advances  in  Neural  Information \nProcessing  Systems 4.  Morgan Kaufmann Publishers,  San  Mateo,  CA,  1992. \n\n[5]  Halbert White.  Learning in artificial neural networks:  A statistical perspective.  Neu(cid:173)\n\nral  Computation, 1:425-464, 1989. \n\n[6]  J. H.  Venter.  An  extension of the robbins-monro procedure.  Annals of Mathematical \n\nStatistics, 38:117-127, 1967. \n\n[7]  Todd  K.  Leen  and  Genevieve  B.  Orr.  Optimal stochastic search  and  adaptive  mo(cid:173)\n\nmentum.  In  J.D.  Cowan,  G.  Tesauro,  and  J . Alspector, editors,  Advances  in  Neural \nInformation Processing Systems 6,  San Francisco, CA.,  1994. Morgan Kaufmann Pub(cid:173)\nlishers. \n\n[8]  Genevieve  B.  Orr.  Dynamics  and  Algorithms  for  Stochastic  Search.  PhD  thesis, \n\nOregon  Graduate Institute, 1996. \n\n[9]  Mehmet Ali Tugay and Yalcin  Tanik.  Properties of the momentum LMS  algorithm. \n\nSignal Processing,  18:117-127, 1989. \n\n[10]  John  J.  Shynk  and  Sumit  Roy.  Analysis  of the  momentum  LMS  algorithm.  IEEE \n\nTransactions  on  Acoustics,  Speech,  and Signal Processing,  38(12):2088-2098, 1990. \n\n[11]  W.  Wiegerinck,  A.  Komoda,  and  T.  Heskes.  Stochastic dynamics  of learning with \n\nmomentum in neural networks.  Journal  of Physics A, 27:4425-4437, 1994. \n\n[12]  Barak A.  Pearlmutter. Fast exact multiplication by the hessian.  Neural Computation, \n\n6:147-160, 1994. \n\n[13]  Yann  LeCun,  Patrice Y.  Simard,  and  Barak  Pearlmutter.  Automatic learning rate \nmaximization by on-line estimation of the  hessian's  eigenvectors.  In  Giles,  Hanson, \nand Cowan, editors,  Advances in Neural Information Processing Systems,  vol.  5,  San \nMateo,  CA,  1993.  Morgan Kaufmann. \n\n[14J  Christian Darken.  Learning Rate Schedules for Stochastic  Gradient Algorithms.  PhD \n\nthesis,  Yale University,  1993. \n\n\f", "award": [], "sourceid": 1227, "authors": [{"given_name": "Genevieve", "family_name": "Orr", "institution": null}, {"given_name": "Todd", "family_name": "Leen", "institution": null}]}