Note on Learning Rate Schedules for Stochastic Optimization

Part of Advances in Neural Information Processing Systems 3 (NIPS 1990)

Bibtex Metadata Paper


Christian Darken, John Moody


We present and compare learning rate schedules for stochastic gradient descent, a general algorithm which includes LMS, on-line backpropaga(cid:173) tion and k-means clustering as special cases. We introduce "search-then(cid:173) converge" type schedules which outperform the classical constant and "running average" (1ft) schedules both in speed of convergence and quality of solution.

Introduction: Stochastic Gradient Descent

1 tion G(W). In the context of learning systems typically G(W) = £x E(W, X), i.e.

The optimization task is to find a parameter vector W which minimizes a func(cid:173)

G is the average of an objective function over the exemplars, labeled E and X respectively. The stochastic gradient descent algorithm is Ll Wet) = -1](t)V'w E(W(t), X(t)).

where t is the "time", and X(t) is the most recent independently-chosen random exemplar. For comparison, the deterministic gradient descent algorithm is

Ll Wet) = -1](t)V'w£x E(W(t), X).