{"title": "Note on Learning Rate Schedules for Stochastic Optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 832, "page_last": 838, "abstract": null, "full_text": "Note on Learning Rate Schedules for Stochastic \n\nOptimization \n\nChristian Darken and John Moody \n\nYale University \n\nP.O. Box 2158 Yale Station \n\nNew Haven, CT 06520 \n\nEmail: moody@cs.yale.edu \n\nAbstract \n\nWe present and compare learning rate schedules for stochastic gradient \ndescent, a general algorithm which includes LMS, on-line backpropaga(cid:173)\ntion and k-means clustering as special cases. We introduce \"search-then(cid:173)\nconverge\" type schedules which outperform the classical constant and \n\"running average\" (1ft) schedules both in speed of convergence and quality \nof solution. \n\nIntroduction: Stochastic Gradient Descent \n\n1 \ntion G(W). In the context of learning systems typically G(W) = \u00a3x E(W, X), i.e. \n\nThe optimization task is to find a parameter vector W which minimizes a func(cid:173)\n\nG is the average of an objective function over the exemplars, labeled E and X \nrespectively. The stochastic gradient descent algorithm is \nLl Wet) = -1](t)V'w E(W(t), X(t)). \n\nwhere t is the \"time\", and X(t) is the most recent independently-chosen random \nexemplar. For comparison, the deterministic gradient descent algorithm is \n\nLl Wet) = -1](t)V'w\u00a3x E(W(t), X). \n\n832 \n\n\fNote on Learning Rate Schedules for Stochastic Optimization \n\n833 \n\nIa' ~---_=--=------------------------\n\nFigure 1: Comparison of the shapes of the schedules. Dashed line = constant, Solid line \n= search-then-converge, Dotted line = \"running-average\" \n\nHI \n\nWhile on average the stochastic step is equal to the deterministic step, for any \nparticular exemplar X(t) the stochastic step may be in any direction, even uphill \nin \u00a3x E(W(t), X). Despite its noisiness, the stochastic algorithm may be preferable \nwhen the exemplar set is large, making the average over exemplars expensive to \ncompute. \nThe issue addressed by this paper is: which function should one choose for 7](t) \n(the learning rate schedule) in order to obtain fast convergence to a good local \nminimum? The schedules compared in this paper are the following (Fig. 1): \n\n\u2022 Constant: 7](t) = 7]0 \n\n\u2022 \"Running Average\": 7](t) = 7]0/(1 + t) \n\n\u2022 Search-Then-Converge: 7](t) = 7]0/(1 + tlr) \n\n\"Search-then-converge\" is the name of a novel class of schedules which we introdu(cid:173)\ncein this paper. The specific equation above is merely one member of this class and \nwas chosen for comparison because it is the simplest member of that class. We find \nthat the new schedules typically outperform the classical constant and running aver(cid:173)\nage schedules. Furthermore the new schedules are capable of attaining the optimal \nasymptotic convergence rate for any objective function and exemplar distribution. \nThe classical schedules cannot. \n\nAdaptive schedules are beyond the scope of this short paper (see however Darken \nand Moody, 1991). Nonetheless, all of the adaptive schedules in the literature of \nwhich we are aware are either second order, and thus too expensive to compute for \nlarge numbers of parameters, or make no claim to asymptotic optimality. \n\n\f834 \n\nDarken and Moody \n\n2 Example Task: K-Means Clustering \n\nAs our sample gradient-descent task we choose a k-means clustering problem. Clus(cid:173)\ntering is a good sample problem to study, both for its inherent usefulness and its \nillustrative qualities. Under the name of vector-quantization, clustering is an im(cid:173)\nportant technique for signal compression in communications engineering. In the \nmachine learning field, clustering has been used as a front-end for function learning \nand speech recognition systems. Clustering also has many features to recommend it \nas an illustrative stochastic optimization problem. The adaptive law is very simple, \nand there are often many local minima even for small problems. Most significantly \nhowever, if the means live in a low dimensional space, visualization of the parameter \nvector is simple: it has the interpretation of being a set of low-dimensional points \nwhich can be easily plotted and understood. \n\nThe k-means task is to locate k points (called \"means\") to minimize the ex(cid:173)\npected distance between a new random exemplar and the nearest mean to that \nexemplar. Thus, the function being minimized in k-means is \u00a3xllX - A1nr8t112, \nwhere Mnr8t \nis the nearest mean to exemplar X. An equivalent form is \nJ dX P(X) E:=l Ia(X)IIX - Ma11 2 , where P(X) is the density of the exemplar \ndistribution and Ia(X) is the indicator function of the Veronois region correspond(cid:173)\ning to the ath mean. The stochastic gradient descent algorithm for this function \nIS \n\n~Mnr8t(t) = -7](tnr8t)[Mnr6t(t) - X(t)), \n\ni.e. the nearest mean to the latest exemplar moves directly towards the exemplar \na fractional distance 7](tnr6t ). In a slight generalization from the stochastic gradi(cid:173)\nent descent algorithm above, tnr6t is the total number of exemplars (including the \ncurrent one) which have been assigned to mean Mnr6t . \nAs a specific example problem to compare various schedules across, we take k = 9 \n(9 means) and X uniformly distributed over the unit square. Although this would \nappear to be a simple problem, it has several observed local minima. The global \nminimum is where the means are located at the centers of a uniform 3x3 grid over \nthe square. Simulation results are presented in figures 2 and 3. \n\n3 Constant Schedule \n\nA constant learning rate has been the traditional choice for LMS and backprop(cid:173)\nagation. However, a constant rate generally does not allow the parameter vector \n(the \"means\" in the case of clustering) to converge. Instead, the parameters hover \naround a minimum at an average distance proportional to 7] and to a variance which \ndepends on the objective function and the exemplar set. Since the statistics of the \nexemplars are generally assumed to be unknown, this residual misadjustment cannot \nbe predicted. The resulting degradation of other measures of system performance, \nmean squared classification error for instance, is still more difficult to predict. Thus \nthe study of how to make the parameters converge is of significant practical interest. \n\nCurrent practice for backpropagation, when large misadjustment is suspected, is to \nrestart learning with a smaller 7]. Shrinking 7] does result in less residual misad(cid:173)\njustment, but at the same time the speed of convergence drops. In our example \n\n\fNote on Learning Rate Schedules for Stochastic Optimization \n\n835 \n\nclustering problem, a new phenomenon appears as 71 drops-metastable local min(cid:173)\nima. Here the parameter vector hovers around a relatively poor solution for a very \nlong time before slowly transiting to a better one. \n\n4 Running Average Schedule \nThe running average schedule (71(t) = 710/(1 + t)) is the staple of the stochastic ap(cid:173)\nproximation literature (Robbins and Monro, 1951) and of k-means clustering (with \n710 = 1) (Macqueen, 1967). This schedule is optimal for k = 1 (1 mean), but per(cid:173)\nforms very poorly for moderate to large k (like our example problem with 9 means). \nFrom the example run (Fig. 2A), it is clear that 71 must decrease more slowly in \norder for a good solution to be reached. Still, an advantage of this schedule is that \nthe parameter vector has been proven to converge to a local minimum (Macqueen, \n1967). We would like a class of schedules which is guaranteed to converge, and yet \nconverges as quickly as possible. \n\n5 Stochastic Approximation Theory \n\nIn the stochastic approximation literature, which has grown steadily since it began \nin 1951 with the Robbins and Monro paper, we find conditions on the learning rate \nto ensure convergence with optimal speed 1. \nFrom (Ljung, 1977), we find that 71(t) --+ Arp asymptotically for any 1 > P > 0, \nis sufficient to guarantee convergence. Power law schedules may work quite well in \npractice (Darken and Moody, 1990), however from (Goldstein, 1987) we find that in \norder to converge at an optimal rate, we must have 71(t) --+ cit asymptotically, for c \n~reater than some threshold which depends on the objective function and exemplars \n. When the optimal convergence rate is achieved, IIW - W\u00b7W goes like lit. \nThe running average schedule goes as 710lt asymptotically. Unfortunately, the con(cid:173)\nvergence rate of the running average schedule often cannot be improved by enlarging \n710, because the resulting instability for small t can outweigh the improvements in \nasymptotic convergence rate. \n\n6 Search-Then-Converge Schedules \n\nWe now introduce a new class of schedules which are guaranteed to converge and \nfurthermore, can achieve the optimal lit convergence rate without stability prob(cid:173)\nlems. These schedules are characterized by the following features. The learning \nrate stays high for a \"search time\" T in which it is hoped that the parameters will \nfind and hover about a good minimum. Then, for times greater than T, the learning \nrate decreases as cit, and the parameters converge. \n\nIThe cited theory generally does not directly apply to the full nonlinear setting of \ninterest in much practical work. For more details on the relation of the theory to practical \napplications and a complete quantitative theory of asymptotic misadjustment, see (Darken \nand Moody, 1991). \n\n2This choice of asymptotic 11 satisfies the necessary conditions given in (White, 1989). \n\n\f836 \n\nDarken and Moody \n\nWe choose the simplest of this class of schedules for study, the \"short-term linear\" \nduring the search phase. This schedule has c = T7]o and reduces to the running \nschedule (7](t) = 7]0/(1 +tIT)), so called because the learning rate decreases linearly \naverage schedule for T = 1. \n\n7 Conclusions \n\nWe have introduced the new class of \"search-then-converge\" learning rate schedules. \nStochastic approximation theory indicates that for large enough T, these schedules \ncan achieve optimally fast asymptotic convergence for any exemplar distribution \nand objective function. Neither constant nor \"running average\" (lIt) schedules \ncan achieve this. Empirical measurements on k-means clustering tasks are consis(cid:173)\ntent with this expectation. Furthermore asymptotic conditions obtain surprisingly \nquickly. Additionally, the search-then-converge schedule improves the observed like(cid:173)\nlihood of escaping bad local minima. \n\nAs implied above, k-means clustering is merely one example of a stochastic gradient \ndescent algorithm. LMS and on-line backpropagation are others of great interest \nto the learning systems community. Due to space limitations, experiments in these \nsettings will be published elsewhere (Darken and Moody, 1991). Preliminaryex(cid:173)\nperiments seem to confirm the generality of the above conclusions. \n\nExtensions to this work in progress includes application to algorithms more sophis(cid:173)\nticated than simple gradient descent, and adaptive search-then-converge algorithms \nwhich automatically determine the search time. \n\nAcknowledgements \n\nThe authors wish to thank Hal White for useful conversations and Jon Kauffman for \ndeveloping the animator which was used to produce figure 2. This work was supported by \nONR Grant N00014-89-J-1228 and AFOSR Grant 89-0478. \n\nReferences \n\nC. Darken and J. Moody. (1990) Fast Adaptive K-Means Clustering: Some Empirical \nResults. In International Joint Conference on Neural Networks 1990, 2:233-238. IEEE \nNeural Networks Council. \n\nC. Darken and J. Moody. (1991) Learning Rate Schedules for Stochastic Optimization. In \npreparation. \n\nL. Goldstein. (1987) Mean square optimality in the continuous time Robbins Monro pro(cid:173)\ncedure. Technical Report DRB-306. Department of Mathematics, University of Southern \nCalifornia. \n\nL. Ljung. (1977) Analysis of Recursive Stochastic Algorithms. IEEE Trans. on Automatic \nControl. AC-22( 4):551-575. \n\nJ. MacQueen. (1967) Some methods for classification and analysis of multivariate obser(cid:173)\nvations. In Proc. 5th Berkeley Symp. Math. Stat. Prob. 3:281. \n\nH. Robbins and S. Monro. (1951) A Stochastic Approximation Method. Ann. Math. Stat. \n22:400-407. \n\n\fNote on Learning Rate Schedules for Stochastic Optimization \n\n837 \n\nH. White. (1989) Learning in Artificial Neural Networks: A Statistical Perspective. Neural \nComputation. 1:425-464. \n\nA \n\n1 \n\nc \n\n.AIL., \u2022.. \n\n~ ... -. \n\nB \n\nD \n\n--.. \n\n\u2022.... ~ \n\n.. '. \n\n'-\u00b7~I~ \n\n.9. \n\n., \n'f; \n\n.; \n\n. ,. \nt :.a: \n\n\u2022 \n\nFigure 2: Example runs with classical schedules on 9-means clustering task. Exemplars \nare uniformly distributed over the square. Dots indicate previous locations of the means. \nThe triangles (barely visible) are the final locations of the means. (A) \"Running average\" \n\nschedule ('11 = 1/(1 + t\u00bb, loOk exemplars. Means are far from any minimum and pro(cid:173)\n\ngressing very slowly. (B) Large constant schedule ('11=0.1), lOOk exemplars. Means hover \naround global minimum at large average distance. (C) Small constant schedule (71=0.01) , \n50k exemplars. Means stuck in metastable local minimum. \n(D) Small constant sched(cid:173)\nule ('11=0.01), lOOk exemplars (later in the run pictured in C). Means tunnel out of loc al \nminimum and hover around global minimum. \n\n\f838 \n\nDarken and Moody \n\n10-' \n\n10\" \n\n~ \n\n~ \n\" \n~ \n;j \n~ 10'\" \n1 \nj \n\n10'\" \n\n10\" \n\n10-1 \n\n10'\" \n\no. \n\n~ \n\nj \n.. \ni 10'\" \n'i' \n\u2022 \ni \n\n10'\" \n\nB \n\n1 \n.ampl \u2022\u2022 / e1uat \u2022\u2022 \n\nI \n\n10\" \n\n10\" \n\n~ \n-'! \n..!:. \n;j \n\" 10-\u00b7 \n\n1 \nj \na \n\n10\" \n\n10\"'0 \n\n10\" \n\n10\" \n\no. \n\n\" \" '0 \n.. \n':> \n\u00a7 10\" \n~ \n:I \nil \n\n10\" \n\naampl_/elu.oter \n\nC \n\n10'\" !'1.J\"o r-'-' ............ ';'J;IO..--'-............. ~l o=r-............ w:,'k-'--'-'-'~r-'--'u..L.U'fh-J \n\nIOUIlpl ... /ctuter \n\nnmpl \u2022\u2022 /du.tu \n\nFigure 3: Comparison of 10 runs over the various schedules on the 9-means cluster(cid:173)\ning task (as described under Fig. 1). The exemplars are the same for each schedule. \nMisadjustment is defined as IIW - W be\"tIl2. (A) Small constant schedule (1]=0.01). \nNote the well-defined transitions out of metastable local minima and large misad(cid:173)\njustment late in the runs. (B) \"Running average\" schedule (T} = 1/(1 + t)). 6 \nout of 10 runs stick in a local minimum. The others slowly head for the global \nminimum. (C) Search-then-converge schedule (T} = 1/(1 + t/4)). All but one run \nhead for global minimum, but at a suboptimal rate (asymptotic slope less than -1). \n(D) Search-then-converge schedule (T} = 1/(1 + t/32)). All runs head for global \nminimum at optimally quick rate (asymptotic slope of -1). \n\n\f", "award": [], "sourceid": 400, "authors": [{"given_name": "Christian", "family_name": "Darken", "institution": null}, {"given_name": "John", "family_name": "Moody", "institution": null}]}