Christian Darken, John Moody
We present and compare learning rate schedules for stochastic gradient descent, a general algorithm which includes LMS, on-line backpropaga(cid:173) tion and k-means clustering as special cases. We introduce "search-then(cid:173) converge" type schedules which outperform the classical constant and "running average" (1ft) schedules both in speed of convergence and quality of solution.
Introduction: Stochastic Gradient Descent
1 tion G(W). In the context of learning systems typically G(W) = £x E(W, X), i.e.
The optimization task is to find a parameter vector W which minimizes a func(cid:173)
G is the average of an objective function over the exemplars, labeled E and X respectively. The stochastic gradient descent algorithm is Ll Wet) = -1](t)V'w E(W(t), X(t)).
where t is the "time", and X(t) is the most recent independently-chosen random exemplar. For comparison, the deterministic gradient descent algorithm is
Ll Wet) = -1](t)V'w£x E(W(t), X).