{"title": "Towards Faster Stochastic Gradient Search", "book": "Advances in Neural Information Processing Systems", "page_first": 1009, "page_last": 1016, "abstract": null, "full_text": "Towards Faster Stochastic Gradient Search \n\nChristian Darken and John Moody \n\nYale Computer Science, P.O. Box 2158, New Haven, CT 06520 \n\nEmail: darken@cs.yale.edu \n\nAbstract \n\nStochastic gradient descent is a general algorithm which includes LMS, \non-line backpropagation, and adaptive k-means clustering as special cases. \nThe standard choices of the learning rate 1] (both adaptive and fixed func(cid:173)\ntions of time) often perform quite poorly. In contrast, our recently pro(cid:173)\nposed class of \"search then converge\" learning rate schedules (Darken and \nMoody, 1990) display the theoretically optimal asymptotic convergence rate \nand a superior ability to escape from poor local minima. However, the user \nis responsible for setting a key parameter. We propose here a new method(cid:173)\nology for creating the first completely automatic adaptive learning rates \nwhich achieve the optimal rate of convergence. \n\nIntro d uction \n\nThe stochastic gradient descent algorithm is \n\n6. Wet) = -1]\\7w E(W(t), X(t)). \n\nwhere 1] is the learning rate, t is the \"time\", and X(t) is the independent random \nexemplar chosen at time t. The purpose of the algorithm is to find a parameter \nvector W which minimizes a function G(W) which for learning algorithms has the \nform \u00a3x E(W, X), i.e. G is the average of an objective function over the exemplars, \nlabeled E and X respectively. We can rewrite 6.W(t) in terms of G as \n\n6. Wet) = -1][\\7wG(W(t)) + e(t, Wet))], \n\nwhere the e are independent zero-mean noises. Stochastic gradient descent may be \npreferable to deterministic gradient descent when the exemplar set is increasing in \nsize over time or large, making the average over exemplars expensive to compute. \n1009 \n\n\f1010 \n\nDarken and Moody \n\nAdditionally, the noise in the gradient can help the system escape from local minima. \nThe fundamental algorithmic issue is how to best adjust 11 as a function of \ntime and the exemplars? \n\nState of the Art Schedules \n\nThe usual non-adaptive choices of 11 (i.e. 11 depends on the time only) often yield \npoor performance. The simple expedient of taking 11 to be constant results in \npersistent residual fluctuations whose magnitude and the resulting degradation of \nsystem performance are difficult to anticipate (see fig. 3). Taking a smaller constant \n11 reduces the magnitude of the fluctuations, but seriously slows convergence and \ncauses problems with metastable local minima. Taking l1(t) = cit, the common \nchoice in the stochastic approximation literature of the last forty years, typically \nresults in slow convergence to bad solutions for small c, and parameter blow-up for \nsmall t if c is large (Darken and Moody, 1990). \n\nThe available adaptive schedules (i.e. 11 depends on the time and on previous exem(cid:173)\nplars) have problems as well. Classical methods which involve estimating the hes(cid:173)\nsian of G are often unusable because they require O(N2) storage and computation \nfor each update, which is too expensive for large N (many parameter systems(cid:173)\ne.g. large neural nets). Methods such as those of Fabian (1960) and Kesten (1958) \nrequire the user to specify an entire function and thus are not practical methods \nas they stand. The delta-bar-delta learning rule, which was developed in the con(cid:173)\ntext of deterministic gradient descent (Jacobs, 1988), is often useful in locating the \ngeneral vicinity of a solution in the stochastic setting. However it hovers about the \nsolution without converging (see fig. 4). A schedule developed by Urasiev is proven \nto converge in principle, but in practice it converges slowly if at all (see fig. 5). The \nliterature is widely scattered over time and disciplines, however to our knowledge \nno published O(N) technique attains the optimal convergence speed. \n\nSearch-Then-Converge Schedules \n\nOur recently proposed solution is the \"search then converge\" learning rate schedule. \n11 is chosen to be a fixed function of time such as the following: \n\n( ) \n\n1 +..\u00a31 \n1/D T \n\n11 t = 110 1 + ..\u00a3 1 + T t 2 \nT2 \n\n1/D T \n\nThis function is approximately constant with value 110 at times small compared to T \n(the \"search phase\"). At times large compared with T (the \"converge phase\"), the \nfunction decreases as cit. See for example the eta vs. time curves for figs. 6 and \n7. This schedule has demonstrated a dramatic improvement in convergence speed \nand quality of solution as compared to the traditional fixed learning rate schedule \nfor k-means clustering (Darken and Moody, 1990). However, these benefits apply \nto supervised learning as well. Compare the error curve of fig. 3 with those of figs. \n6 and 7. \nThis schedule yields optimally fast asymptotic convergence if c > c*, c* = 1/2a, \nwhere a is the smallest eigenvalue of the hessian of the function G (defined above) \n\n\fTowards Faster Stochastic Gradient Search \n\n1011 \n\nLittle Drift \n\nMuch Drift \n\nFigure 1: Two contrasting parameter vector trajectories illustrating the notion of \ndrift \n\nat the pertinent minimum (Fabian, 1968) (Major and Revesz, 1973) (Goldstein, \n1987). The penalty for choosing c < c\u00b7 is that the ratio of the excess error given c \ntoo small to the excess error with c large enough gets arbitrarily large as training \ntime grows, i.e. \n\n1. Ecc. \n\n= 00, \n\nwhere E is the excess error above that at the minimum. The same holds for the \nratio of the two distances to the location of the minimum in parameter space. \n\nWhile the above schedule works well, its asymptotic performance depends upon \nthe user's choice of c. Since neither 1}o nor T affects the asymptotic behavior of \nthe system, we will discuss their selection elsewhere. Setting c > c\u00b7, however, \nis vital. Can such a c be determined automatically? Directly estimating a with \nconventional methods (by calculating the smallest eigenvalue of the hessian at our \ncurrent estimate of the minimum) is too computationally demanding. This would \ntake at least O(N2) storage and computation time for each estimate, and would \nhave to be done repeatedly (N is the number of parameters). We are investigating \nthe possibility of a low complexity direct estimation of a by performing a second \noptimization. However here we take a more unusual approach: we shall determine \nwhether c is large enough by observing the trajectory ofthe parameter (or \"weight\") \nvector. \n\nOn-line Determination of Whether c < c* \n\nWe propose that excessive correlation in the parameter change vectors (i.e. \"drift\") \nindicates that c is too small (see fig. 1). We define the drift as \n\nD(t) = ~ d~(t) \n\nk \n\ndk(t) = JT [{(8k(t) _(~~~tt~}T )2}T )1/2 \n\nwhere 8k (t) is the change in the kth component of the parameter vector at time t and \nthe angled brackets denote an average over T parameter changes. We take T = at, \nwhere a \u00ab 1. Notice that the numerator is the average parameter step while the \n\n\f1012 \n\nDarken and Moody \n\n1.11 On..t.alll-Ublenbeck Prooen ill OD. Dimenaon \n\nDJI \n\n/ \n\nIII' \n\nIII' \n\nIII' \n\n10-' \n\nFigure 2: (Left) An Ornstein-Uhlenbeck process. This process is zero-mean, gaus(cid:173)\nsian, and stationary (in fact strongly ergodic). It may be thought of as a random \nwalk with a restoring force towards zero. (Right) Measurement of the drift for the \nruns c = .1c\u00b7 and c = lOc\u00b7 which are discussed in figs. 7 and 8 below. \n\ndenominator is the standard deviation of the steps. As a point of reference, if the 61; \nwere independent normal random variables, then the dl; would be \"T-distributed\" \nwith T degrees of freedom, i.e. approximately unit-variance normals for moderate \nto large T. We find that 61; may also be taken to be the kth component of the noisy \ngradient to the same effect. \nAsymptotically, we will take the learning rate to go as cft. Choosing c too small \nresults in a slow drift of the parameter vector towards the solution in a relatively \nlinear trajectory. When c> c\u00b7 however, the trajectory is much more jagged. Com(cid:173)\npare figs. 7 and 8. More precisely, we find that D(t) blows up like a power of t \nwhen c is too small, but remains finite otherwise. Our experiments confirm \nthis (for an example, see fig. 2). This provides us with a signal to use in future \nadaptive learning rate schemes for ensuring that c is large enough. \n\nThe bold-printed statement above implies that an arbitrarily small change in c which \nmoves it to the opposite side of c\u00b7\u00b7has dramatic consequences for the behavior of the \ndrift. The following rough argument outlines how one might prove this statement, \nfocusing on the source of this interesting discontinuity in behavior. We simplify the \nargument by taking the 61; 's to be gradient measurements as mentioned above. We \nconsider a one-dimensional problem, and modify d1 to be ..;T{6dT (i.e. we ignore \nthe denominator). Then since T = at as stated above, we approximate \n\nd1 = Vr{6} (t))r ~ (Ji61 (t))T = (Ji[VG(t) + e(t)])r \n\nRecall the definitions of G and e from the introduction above. As t -+ 00, VG(t) -+ \nK[W(t) - Wo] for the appropriate K by the Taylor's expansion for G around Wo, \nthe location of the local minimum. Thus \n\nlim d 1 ~ (K Ji[W(t) - Wo])r + (Jie(t)}r \n'_00 \n\nDefine X(t) = Jt[W(t) - Wo]. Now according to (Kushner, 1978), X(e') converges \n\n\fConstant 11=0.1 \n\nTowards Faster Stochastic Gradient Search \n\n1013 \n\n~ \n\n01-\n\n-11 \n\nIII' \n\nlu-\n\nlU' \nUm. \n\n10\" \n\n\"I \n10\" \n\n10\" \n\n10-\" \n\n_ 10-' 110-\" \n\n1 10-\" \nj leru \n\n110-\" \niller\" \n\nP .. ~,. . . . . wcec.. \n\n10 \n\n10\" \n\nFigure 3: The constant 1/ schedule, commonly used in training backpropagation \nnetworks, does not converge in the stochastic setting. \n\nin distribution to the well-known Ornstein-Uhlenbeck process (fig. 2) when c > c\u00b7. \nBy extending this work, one can show that X(t) converges in distribution to a \ndeterministic power law, tP with p > 0 when c < c\u00b7. Since the e's are independent \nand have uniformly bounded variances for smooth objective functions, the second \nterm converges in distribution to a finite-variance random variable. The first term \nconverges to a finite-variance random variable if c > c\u00b7, but to a power of t if c < c\u00b7 . \n\nQualitative Behavior of Schedules \n\nWe compare several fixed and adaptive learning rate schedules on a toy stochastic \nproblem. Notice the difficulties that are encountered by some schedules even on a \nfairly easy problem due to noise in the gradient. The problem is learning a two \nparameter adaline in the presence of independent uniformly distributed [-0.5,0.5] \nnoise on the exemplar labels. Exemplars were independently uniformly distributed \non [-1,1]. The objective function has a condition number of 10, indicating the \npresence of the narrow ravine indicated by the elliptical isopleths in the figures. All \nruns start from the same parameter (weight) vector and receive the same sequence \nof exemplars. The misadjustment is defined as the Euclidean distance in parameter \nspace to the minimum. Multiples of this quantity bound the usual sum of squares \nerror measure above and below, i.e. sum of squares error is roughly proportional to \nthe misadjustment. Results are presented in figs. 3-8. \n\nConclusions \n\nOur empirical tests agree with our theoretical expectations that drift can be used to \ndetermine whether the crucial parameter c is large enough. Using this statistic, it \nwill be possible to produce the first fully automatic learning rates which converge at \noptimal speed. We are currently investigating candidate schedules which we expect \nto be useful for large-scale LMS, backpropagation, and clustering applications. \n\n\f1014 \n\nDarken and Moody \n\nStochastic Della-Dar-Delta \n\nPial thouI.md p\"\"\"\" \\lll)GlDl'1 \n\nFigure 4: Delta-bar-deita (Jacobs, 1988) was apparently developed for use with \ndeterministic gradient descent. It is also useful for stochastic problems with little \nnoise, which is however not the case for this test problem. In this example TJ \nincreases from its initial value, and then stabilizes. We use the algorithm exactly as \nit appears in Jacobs' paper with noisy gradients substituted for the true gradient \n(which is unavailable in the stochastic setting). Parameters used were TJo = 0.1, \nB = 0.3, K = 0.01, and \u00a2 = 0.1. \n\nUrasiev \n\nl~rTTmmrTn~-rrrm~nm~~~ro~~ \n10-' i---'----'''\\.. \n10'\" \n10'\" \n10\" \n10-<' \n10 .... \n\" 10-' \n.. 10\" \n10\" \n10-'\" \n10-11 \n10-\" \n10-1\u00b7 \nIO-~~O~~~~~Lll~~~~~~~~I~ \n\nFint ~ p-..adls VMIOra \n\n10' \n\nFigure 5: Urasiev's technique (Urasiev, 1988) varies TJ erratically over several orders \nof magnitude. The large fluctuations apparently cause TJ to completely stop changing \nafter a while due to finite precision effects. Parameters used were D = 0.2, R = 2, \nand U = 1. \n\n\fTowards Faster Stochastic Gradient Search \n\n1015 \n\nFixed Search-Then-Converge, c=c* \n\n104 \n\n10\" \n\n~ \n\n10~ \n\nFigure 6: The fixed search-then-converge schedule with c = c'\" gives excellent per(cid:173)\nformance. However if c'\" is not known, you may get performance as in the next two \nexamples. An adaptive technique is called for. \n\nFixed Search-Then-Converge, c=10c* \n\nFigure 7: Note that taking c > c'\" slows convergence a bit as compared to the c = c'\" \nexample in fig. 6, though it could aid escape from bad local minima in a nonlinear \nproblem. \n\n\f1016 \n\nDarken and Moody \n\nFixed Search-Then-Converge, c=O.lc* \n\n10-1 \n\nI~ \n\n10-\" \n-! 10-\" \n10'\" \n\n10\" \n\n10' \n\n\"i:'IO-1 \n\n110-\" \n110-< \nilo'\" \n\n110-< a 10-\" \n\n10 \n\n10 \n\n10\" \n\n10' \n\nFigure 8: This run illustrates the penalty to be paid if c < c\u00b7. \n\nReferences \n\nC. Darken and J. Moody. (1990) Note on learning rate schedules for stochastic optimiza(cid:173)\ntion. Advances in Neural Information Processing Systems 9. 832-838. \n\nV.Fabian. \n123-159. \n\n(1960) Stochastic approximation methods. Czechoslovak Math J. 10 (85) \n\nV. Fabian. (1968) On asymptotic normality in stochastic approximation. Ann. Math. \nStat. 39(4):1327-1332. \n\n1. Goldstein. (1987) Mean square optimality in the continuous time Robbins Monro pro(cid:173)\ncedure. Technical Report DRB-306. Department of Mathematics, University of Southern \nCalifornia. \n\nR. Jacobs. (1988) Increased rates of convergence through learning rate adaptation. Neural \nNetworks. 1:295-307. \n\nH. Kesten. (1958) Accelerated stochastic approximation. Annals of Mathematical Statis(cid:173)\ntics. 29:41-59. \n\nH. Kushner. (1978) Rates of convergence for sequential Monte Carlo optimization meth(cid:173)\nods. SIAM J. Control and Optimization. 16:150-168. \n\nP. Major and P.Revesz. (1973) A limit theorem for the Robbins-Monro approximation. Z. \nWahrscheinlichkeitstheorie verw. Geb. 27:79-86. \n\nS. Urasiev. \nniques for Stochastic Optimization. Y. Ermoliev and R. Wets Eds. Springer-Verlag. \n\n(1988) Adaptive stochastic quasigradient procedures. \n\nIn Numerical Tech(cid:173)\n\n\f", "award": [], "sourceid": 509, "authors": [{"given_name": "Christian", "family_name": "Darken", "institution": null}, {"given_name": "John", "family_name": "Moody", "institution": null}]}