Yann LeCun, Patrice Simard, Barak Pearlmutter
We propose a very simple, and well principled way of computing the optimal step size in gradient descent algorithms. The on-line version is very efficient computationally, and is applicable to large backpropagation networks trained on large data sets. The main ingredient is a technique for estimating the principal eigenvalue(s) and eigenvector(s) of the objective function's second derivative ma(cid:173) trix (Hessian), which does not require to even calculate the Hes(cid:173) sian. Several other applications of this technique are proposed for speeding up learning, or for eliminating useless parameters.