{"title": "A Realizable Learning Task which Exhibits Overfitting", "book": "Advances in Neural Information Processing Systems", "page_first": 218, "page_last": 224, "abstract": null, "full_text": "A Realizable Learning Task which \n\nExhibits Overfitting \n\nSiegfried Bos \n\nLaboratory for Information Representation, RIKEN, \n\nHirosawa 2-1, Wako-shi, Saitama, 351-01, Japan \n\nemail: boes@zoo.riken.go.jp \n\nAbstract \n\nIn this paper we examine a perceptron learning task. The task is \nrealizable since it is provided by another perceptron with identi(cid:173)\ncal architecture. Both perceptrons have nonlinear sigmoid output \nfunctions. The gain of the output function determines the level of \nnonlinearity of the learning task. It is observed that a high level \nof nonlinearity leads to overfitting. We give an explanation for this \nrather surprising observation and develop a method to avoid the \noverfitting. This method has two possible interpretations, one is \nlearning with noise, the other cross-validated early stopping. \n\n1 Learning Rules from Examples \n\nThe property which makes feedforward neural nets interesting for many practical \napplications is their ability to approximate functions, which are given only by ex(cid:173)\namples. Feed-forward networks with at least one hidden layer of nonlinear units \nare able to approximate each continuous function on a N-dimensional hypercube \narbitrarily well. While the existence of neural function approximators is already \nestablished, there is still a lack of knowledge about their practical realizations. Also \nmajor problems, which complicate a good realization, like overfitting, need a better \nunderstanding. \n\nIn this work we study overfitting in a one-layer percept ron model. The model \nallows a good theoretical description while it exhibits already a qualitatively similar \nbehavior as the multilayer perceptron. \n\nA one-layer perceptron has N input units and one output unit. Between input \nand output it has one layer of adjustable weights Wi, (i = 1, ... ,N). The output z \nis a possibly nonlinear function of the weighted sum of inputs Xi, i.e. \n\nz = g(h) , with \n\nh = \n\n1 \n\nN \n\nI1tT L Wi Xi . \n\nvN i=l \n\n(1) \n\n\fA Realizable Learning Task Which Exhibits Overfitting \n\n219 \n\nThe quality of the function approximation is measured by the difference between \nthe correct output z* and the net's output z averaged over all possible inputs. In \nthe supervised learning scheme one trains the network using a set of examples ;fll \n(JL = 1, ... , P), for which the correct output is known. It is the learning task to \nminimize a certain cost function, which measures the difference between the correct \noutput z~ and the net's output Zll averaged over all examples. \n\nUsing the mean squared error as a suitable measure for the difference between \nthe outputs, we can define the training error ET and the generalization error Ea \nas \n\n(2) \n\nThe development of both errors as a function of the number P of trained examples \nis given by the learning curves. Training is conventionally done by gradient descend. \nFor theoretical purposes it is very useful to study learning tasks, which are pro(cid:173)\nvided by a second network, the so-called teacher network. This concept allows a \nmore transparent definition of the difficulty of the learning task. Also the monitor(cid:173)\ning of the training process becomes clearer, since it is always possible to compare \nthe student network and the teacher network directly. \n\nSuitable quantities for such a comparison are, in the perceptron case, the following \n\norder parameters, \n\nq:= IIWII = 2:(Wi )2. \n\nN \n\ni=l \n\n(3) \n\nBoth have a very transparent interpretation, r is the normalized overlap between \nthe weight vectors of teacher and student, and q is the norm of the student's weight \nvector. These order parameters can also be used in multilayer learning, but their \nnumber increases with the number of all possible permutations between the hidden \nunits of teacher and student. \n\n2 The Learning Task \nHere we concentrate on the case in which a student perceptron has to learn a \nmapping provided by another perceptron. We choose identical networks for teacher \nand student. Both have the same sigmoid output function, i.e. g*(h) = g(h) = \ntanh( \"Ih). Identical network architectures of teacher and student are realizable tasks. \nIn principle the student is able to learn the task provided by the teacher exactly. \nUnrealizable tasks can not be learnt exactly, there remains always a finite error. \n\nIf we use uniformally distributed random inputs ;f and weights W, the weighted \nsum h in (1) can be assumed as Gaussian distributed. Then we can express the \ngeneralization error (2) by the order parameters (3), \n\nEa= JDZ1 JDz2~{tanh[\"IZll-tanh[q(rzl+~Z2)]r, \n\n(4) \n\nwith the Gaussian measure J 1+00 dz \n\nDz:= \n\n- 00 \n\n(Z2) \n\n- - exp - -\n../2i \n2 \n\n(5) \n\nFrom equation (4) we can see how the student learns the gain \"I of the teachers \noutput function. It adjusts the norm q of its weights. The gain \"I plays an important \nrole since it allows to tune the function tanhbh) between a linear function b \u00ab 1) \nand a highly nonlinear function b \u00bb 1). Now we want to determine the learning \ncurves of this task. \n\n\f220 \n\ns.B6s \n\n3 Emergence of Overfitting \n\n3.1 Explicit Expression for the Weights \nBelow the storage capacity of the perceptron, i.e. a = 1, the minimum of the training \nerror ET is zero. A zero training error implies that every example has been learnt \nexactly, thus \n\n(6) \n\nThe weights with minimal norm that fulfill this condition are given by the Pseu(cid:173)\ndoinverse (see Hertz et al. 1991), \n\nP \n\nWi = 2: h~ (C-l)~v xf, \n\n~,v=l \n\n(7) \n\nNote, that the weights are completely independent of the output function g(h) = \ng*(h). They are the same as in the simplest realizable case, linear perceptron learns \nlinear perceptron. \n\n3.2 Statistical Mechanics \nThe calculation of the order parameters can be done by a method from statistical \nmechanics which applies the commonly used replica method. For details about the \nreplica approach see Hertz et al. (1991). The solution of the continuous perceptron \nproblem can be found in Bas et al. (1993). Since the results of the statistical me(cid:173)\nchanics calculations are exact only in the thermodynamic limit, i.e. N ~ 00, the \nvariable a is the more natural measure. It is defined as the fraction of the number \nof patterns P over the system size N, i.e. a := PIN. In the thermodynamic limit \nN and P are infinite, but a is still finite. Normally, reasonable system sizes, such \nas N ~ 100, are already well described by this theory. \n\nUsually one concentrates on the zero temperature limit, because this implies that \nthe training error ET accepts its absolute minimum for every number of presented \nexamples P. The corresponding order parameters for the case, linear perceptron \nlearns linear student, are \n\nq='Yva, \n\nr =va. \n\n(8) \n\nThe zero temperature limit can also be called exhaustive training, since the student \nnet is trained until the absolute minimum of ET is reached. \n\nFor small a and high gains 'Y, i.e levels of nonlinearity, exhaustive training leads \nto overfitting. That means the generalization error Ea(a) is not, as it should, \nmonotonously decreasing with a. It is one reason for overfitting, that the training \nfollows too strongly the examples. The critical gain 'Yc, which determines whether \nthe generalization error Ea ( a) is increasing or decreasing function for small values \nof a, can be determined by a linear approximation. For small a, both order param(cid:173)\neters (3) are small, and the student's tanh-function in (4) can be approximated by \na linear function. This simplifies the equation (4) to the following expression, \n\nEa(f) = Ea(O) - i [2H(r) - 'Y 1, with H( 'Y):= J Dz tanh(rz) z. \n\n(9) \n\nSince the function H(r) has an upper bound, i.e. J2/7r, the critical gain is reached \nif 'Yc = 2H{rc). The numerical solution gives 'Yc = 1.3371. If 'Y is higher, the slope \nof Ea(a) is positive for small a. In the following considerations we will use always \nthe gain 'Y = 5 as an example, since this is an intermediate level of nonlinearity. \n\n\fA Realizable Learning Task Which Exhibits Overfitting \n\n221 \n\n100.0 \n10.0 \n5.0 \n2.0 \n1.0 \n0.5 \n\n0.8 \n\n1.0 \n\n-- '- .- . - '-.- --.--- .- .- . -'-. - ' - ' - . -.- . -'- . \n\nw _\n\n_ \n\n_ \n\n\u2022 \n\n...... \n\n.. __ -- -- -.. -- -.. -- -- -- --- .. _- ---\n0.2 \n\n0.4 \n\n0.6 \n\n-'-.-\n\n1.0 \n\n0.8 \n\n0.6 \n\n0.4 \n\n0.2 \n\n0.0 \n\n0.0 \n\nPIN \n\nFigure 1: Learning curves E ( 0:) for the problem, tanh- perceptron learns tanh(cid:173)\nperceptron, for different values of the gain,. Even in this realizable case, exhaustive \ntraining can lead to overfitting, if the gain , is high enough. \n\n3.3 How to Understand the Emergence of Overfitting \nHere the evaluation of the generalization error in dependence of the order parameters \nrand q is helpful. Fig. 2 shows the function EG(r, q) for r between 0 and 1 and q \nbetween 0 and 1.2,. \n\nThe exhaustive training in realizable cases follows always the line q( r) = ,r \nindependent of the actual output function. That means, training is guided only by \nthe training error and not by the generalization error. If the gain , is higher than \n,e, the line EG = EG(O, 0) starts with a lower slope than q(r) = ,r, which results \nin overfitting. \n\n4 How to Avoid Overfitting \nFrom Fig. 2 we can guess already that q increases too fast compared to r. Maybe the \nratio between q and r is better during the training process. So we have to develop \na description for the training process first. \n\n4.1 Training Process \nWe found already that the order parameters for finite temperatures (T > 0) of \nthe statistical mechanics approach are a good description of the training process \nin an unrealizable learning task (Bos 1995). So we use the finite temperature order \nparameters also in this task. These are, again taken from the task 'linear perceptron \nlearns linear percept ron' , \n\nq 0:, a, \n\n( \n\n) = J(~) (1 + 0:) a - 20: \n\na \n\n2 \na - 0: \n\nr(o:, a) = \n\n' \n\n(0:) \n\na \n\na2 - 0: \n\n(1+0:)a-20:' \n\nwith the temperature dependent variable \n\na:= 1 + [,8(Q - q)]-l . \n\n(10) \n\n(11) \n\n\f222 \n\nS.BOS \n\n. \n. \n. \n. \n/local minZ \ni abs. mici. \n./ local m~. \n\n........ \n\n... \n\n... \n\n\".-\n\n........ \n\n........ \n\n6.0 \n\n5.0 \n\n4.0 \n\n3.0 \n\n2.0 \n\nq \n\n1.0 \n\n0.0 ~~==~-~-:=::: .... ! : . ==\u00b1~===--L..::===~\u00b7=\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7~\u00b7\u00b7\u00b7\u00b7\u00b73\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7 \n\n--- ...... ~: ...... -. \n\n0.0 \n\n0.2 \n\n0.4 \n\n0.6 \n\n0.8 \n\n1.0 \n\nr \n\nFigure 2: Contour plot of EG(r,q) defined by (4), the generalization error as a \nfunction of the two order parameters. Starting from the minimum EG = 0 at (r, q) = \n(1,5) the contour lines for EG = 0.1,0.2, ... , 0.8 are given (dotted lines). The dashed \nline corresponds to EG(O,O) = 0.42. The solid lines are parametric curves of the \norder parameters (r, q) for certain training strategies. The straight line illustrates \nexhaustive training, the lower ones the optimal training, which will be explained in \nFig. 3. Here the gain I = 5. \n\nThe zero temperature limit corresponds to a = 1. We will show now that the \ndecrease of the temperature dependent parameter a from 00 to 1, describes the \nevolution of the order parameters during the training process. In the training process \nthe natural parameter is the number of parallel training steps t. In each parallel \ntraining step all patterns are presented once and all weights are updated. Fig. 3 \nshows the evolution of the order parameters (10) as parametric curves (r,q). \n\nThe exhaustive learning curve is defined by a = 1 with the parameter 0: (solid \nline). For each 0: the training ends on this curve. The dotted lines illustrate the \ntraining process, a runs from infinity to 1. Simulations of the training process have \nshown that this theoretical curve is a good description, at least after some training \nsteps. We will now use this description of the training process for the definition of \nan optimized training strategy. \n\n4.2 Optimal temperature \n\nThe optimized training strategy chooses not a = 1 or the corresponding temperature \nT = 0, but the value of a (Le. temperature), which minimizes the generalization \nerror EG. In the lower solid curve indicating the parametric curve (r, q) the value of \na is chosen for every 0:, which minimizes EG. The function EG(a) has two minima \nbetween 0: = 0.5 and 0.7. The solid line indicates always the absolute minimum. \nThe parametric curves corresponding to the local minima are given by the double \ndashed and dash-dotted lines. Note, that the optimized value a is always related \nto an optimized temperature through equation (11). But the parameter a is also \nrelated to the number of training steps t. \n\n\fA Realizable Learning Task Which Exhibits Overfilling \n\n223 \n\nq \n\n6.0 \n\n5.0 \n\n4.0 \n\n3.0 \n\n2.0 \n\n1.0 \n\n0.0 \n\n0.0 \n\nlocal min. \nabs. min. \nlocal min. \nsimulation I--t--l \n\n0.2 \n\n0.4 \n\nr \n\n0.6 \n\n0.8 \n\n1.0 \n\nFigure 3: Training process. The order parameters (10) as parametric curves (r,q) \nwith the parameters a and a. The straight solid line corresponds to exhaustive \nlearning, i.e. a = 1 (marks at a = 0.1,0.2, ... 1.0). The dotted lines describe the \ntraining process for fixed a. Iterative training reduces the parameter a from 00 \nto 1. Examples for a = 0.1,0.2,0.3,0.4,0.9,0.99 are given. The lower solid line is \nan optimized learning curve. To achieve this curve the value of a is chosen, which \nminimizes EG absolutely. Between a ~ 0.5 and 0.7 the error EG has two minima; \nthe double- dashed and dash-dotted lines indicate the second, local minimum of EG. \nCompare with Fig. 2, to see which is the absolute and which the local minimum of \nEG. A naive early stopping procedure ends always in the minimum with the smaller \nq, since it is the first minimum during the training process (see simulation indicated \nwith errorbars). \n\n4.3 Early Stopping \nFig. 3 and Fig. 2 together indicate that an earlier stopping of the training process \ncan avoid the overfitting. But in order to determine the stopping point one has \nto know the actual generalization error during the training. Cross-validation tries \nto provide an approximation for the real generalization error. The cross-validation \nerror Ecv is defined like E T , see (2), on a set of examples, which are not used \nduring the training. Here we calculate the optimum using the real generalization \nerror, given by rand q, to determine the optimal point for early stopping. It is a \nlower bound for training with finite cross-validation sets. Some preliminary tests \nhave shown that already small cross- validation sets approximate the real EG quite \nwell. Training is stopped, when EG increases. The resulting curve is given by the \nerror bars in Fig. 3. The errorbars indicate the standard deviation of a simulation \nwith N = 100 averaged over 50 trials. \n\nIn Fig. 4 the same results are shown as learning curves EG(a). There one can see \n\nclearly that the early stopping strategy avoids the overfitting. \n\n5 Summary and outlook \n\nIn this paper we have shown that overfitting can also emerge in realizable learning \ntasks. The calculation of a critical gain and the contour lines in Fig. 2 imply, that \n\n\f224 \n\nS.BOS \n\n0.5 \n\n0.4 \n\n0.3 \n\nEO \n\n0.2 \n\n0.1 \n\n0.0 \n\n0.0 \n\nexh. \nlocal min. \nabs. min. \nlocal min. \nsimulation ~ \n\n0.2 \n\n0.4 \n\nPIN \n\n0.6 \n\n0.8 \n\n1.0 \n\nFigure 4: Learning curves corresponding to the parametric curves in Fig. 3. The \nupper solid line shows again exhaustive training. The optimized finite temperature \ncurve is the lower solid line. From 0: = 0.6 exhaustive and optimal training lead to \nidentical results (see marks). The simulation for early stopping (errorbars) finds the \nfirst minimum of EG. \n\nthe reason for the overfitting is the nonlinearity of the problem. The network adjusts \nslowly to the nonlinearity of the task. We have developed a method to avoid the \noverfitting, it can be interpreted in two ways. \n\nTraining at a finite temperature reduces overfitting. It can be realized, if one \ntrains with noisy examples. In the other interpretation one learns without noise, \nbut stops the training earlier. The early stopping is guided by cross-validation. It \nwas observed that early stopping is not completely simple, since it can lead to a \nlocal minimum of the generalization error. One should be aware of this possibility, \nbefore one applies early stopping. \n\nSince multilayer perceptrons are built of nonlinear perceptrons, the same effects \nare important for multilayer learning. A study with large scale simulations (Miiller \net al. 1995) has shown that overfitting occurs also in realizable multilayer learning \ntasks. \n\nAcknowledgments \nI would like to thank S. Amari and M. Opper for stimulating discussions, and M. \nHerrmann for hints concerning the presentation. \n\nReferences \nS. Bos. (1995) Avoiding overfitting by finite temperature learning and cross(cid:173)\nvalidation. International Conference on Artificial Neural Networks '95 Vo1.2, p.111. \nS. Bos, W. Kinzel & M. Opper. (1993) Generalization ability of perceptrons with \ncontinuous outputs. Phys. Rev. E 47:1384-1391. \nJ. Hertz, A. Krogh & R. G. Palmer. (1991) Introduction to the Theory of Neural \nComputation. Reading: Addison-Wesley. \nK. R. Miiller, M. Finke, N. Murata, K. Schulten & S. Amari. (1995) On large scale \nsimulations for learning curves, Neural Computation in press. \n\n\f", "award": [], "sourceid": 1127, "authors": [{"given_name": "Siegfried", "family_name": "B\u00f6s", "institution": null}]}