Part of Advances in Neural Information Processing Systems 4 (NIPS 1991)
Jürgen Schmidhuber
Do you want your neural net algorithm to learn sequences? Do not lim(cid:173) it yourself to conventional gradient descent (or approximations thereof). Instead, use your sequence learning algorithm (any will do) to implement the following method for history compression. No matter what your fi(cid:173) nal goals are, train a network to predict its next input from the previous ones. Since only unpredictable inputs convey new information, ignore all predictable inputs but let all unexpected inputs (plus information about the time step at which they occurred) become inputs to a higher-level network of the same kind (working on a slower, self-adjusting time scale). Go on building a hierarchy of such networks. This principle reduces the descriptions of event sequences without 1088 of information, thus easing supervised or reinforcement learning tasks. Alternatively, you may use two recurrent networks to collapse a multi-level predictor hierarchy into a single recurrent net. Experiments show that systems based on these prin(cid:173) ciples can require less computation per time step and many fewer training sequences than conventional training algorithms for recurrent nets. Final(cid:173) ly you can modify the above method such that predictability is not defined in a yes-or-no fashion but in a continuous fashion.