{"title": "Overfitting in Neural Nets: Backpropagation, Conjugate Gradient, and Early Stopping", "book": "Advances in Neural Information Processing Systems", "page_first": 402, "page_last": 408, "abstract": null, "full_text": "Overfitting in Neural Nets: Backpropagation, \n\nConjugate Gradient, and Early Stopping \n\nRich Caruana \nCALD,CMU \n\n5000 Forbes Ave. \n\nPittsburgh, PA 15213 \ncaruana@cs.cmu.edu \n\nSteve Lawrence \n\nNEC Research Institute \n4 Independence Way \nPrinceton, NJ 08540 \n\nLee Giles \n\nInformation Sciences \nPenn State University \n\nUniversity Park, PA 16801 \n\nlawrence@ research. nj. nec. com \n\ngiles@ist.psu.edu \n\nAbstract \n\nThe conventional wisdom is that backprop nets with excess hidden units \ngeneralize poorly. We show that nets with excess capacity generalize \nwell when trained with backprop and early stopping. Experiments sug(cid:173)\ngest two reasons for this: 1) Overfitting can vary significantly in different \nregions of the model. Excess capacity allows better fit to regions of high \nnon-linearity, and backprop often avoids overfitting the regions of low \nnon-linearity. 2) Regardless of size, nets learn task subcomponents in \nsimilar sequence. Big nets pass through stages similar to those learned \nby smaller nets. Early stopping can stop training the large net when it \ngeneralizes comparably to a smaller net. We also show that conjugate \ngradient can yield worse generalization because it overfits regions of low \nnon-linearity when learning to fit regions of high non-linearity. \n\nIntroduction \n\n1 \nIt is commonly believed that large multi-layer perceptrons (MLPs) generalize poorly: nets \nwith too much capacity overfit the training data. Restricting net capacity prevents overfit(cid:173)\nting because the net has insufficient capacity to learn models that are too complex. This \nbelief is consistent with a VC-dimension analysis of net capacity vs. generalization: the \nmore free parameters in the net the larger the VC-dimension of the hypothesis space, and \nthe less likely the training sample is large enough to select a (nearly) correct hypothesis [2]. \n\nOnce it became feasible to train large nets on real problems, a number of MLP users noted \nthat the overfitting they expected from nets with excess capacity did not occur. Large nets \nappeared to generalize as well as smaller nets -\nsometimes better. The earliest report \nof this that we are aware of is Martin and Pittman in 1991: \"We find only marginal and \ninconsistent indications that constraining net capacity improves generalization\" [7]. \n\nWe present empirical results showing that MLPs with excess capacity often do not over(cid:173)\nfit. On the contrary, we observe that large nets often generalize better than small nets of \nsufficient capacity. Backprop appears to use excess capacity to better fit regions of high \nnon-linearity, while still fitting regions of low non-linearity smoothly. (This desirable be(cid:173)\nhavior can disappear if a fast training algorithm such as conjugate gradient is used instead \nof backprop.) Nets with excess capacity trained with backprop appear first to learn models \nsimilar to models learned by smaller nets. If early stopping is used, training of the large net \ncan be halted when the large net's model is similar to models learned by smaller nets. \n\n\fApprOXlUlat l o D \nT rain i ng Data \nT urgel Functi on '-Vilhau! Noise \n\n-\n\n>< \n\nx \n\nApproxunatlo n \nT rain i ng Data \nT urg .. t Functi on 'Without Noise \n\n- 1 \n\n- 1 \n\nOrder 10 \n\nOrder 20 \n\n15~----------~----------~ \n\nTraining Data \n>< TurS .. 1 Functi on '\\Vi l h out N oise \n\n>< \n\nApprOXLnlutlon -\n\n- 1 5 O~-----------'~O----------~20 \n\n- 1 5 O~-----------'~O----------~20 \n\n10 Hidden Nodes \n\n50 Hidden Nodes \n\nFigure 1: Top: Polynomial fit to data from y = sin( x /3) + v . Order 20 overfits. Bottom: Small and \nlarge MLPs fit to same data. The large MLP does not overfit significantly more than the small MLP. \n\n2 Overfitting \nMuch has been written about overfitting and the bias/variance tradeoff in neural nets and \nother machine learning models [2, 12, 4, 8, 5, 13, 6] . The top of Figure 1 illustrates \npolynomial overfitting. We created a training dataset by evaluating y = sin( x /3) + lJ \nat 0 1 I I 2, ... ,20 where lJ is a uniformly distributed random variable between -0.25 and \n0.25. We fit polynomial models with orders 2-20 to the data. Underfitting occurs with \norder 2. The fit is good with order 10. As the order (and number of parameters) increases, \nhowever, significant overfitting (poor generalization) occurs. At order 20, the polynomial \nfits the training data well, but interpolates poorly. \n\nThe bottom of Figure 1 shows MLPs fit to the data. We used a single hidden layer MLP, \nbackpropagation (BP), and 100,000 stochastic updates. The learning rate was reduced \nlinearly to zero from an initial rate of 0.5 (reducing the learning rate improves convergence, \nand linear reduction performs sintilarly to other schedules [3]). This schedule and number \nof updates trains the MLPs to completion. (We examine early stopping in Section 4.) As \nwith polynomials, the smallest net with one hidden unit (HU) (4 weights weights) underfits \nthe data. The fit is good with two HU (7 weights). Unlike polynomials, however, networks \nwith 10 HU (31 weights) and 50 HU (151 weights) also yield good models. MLPs with \nseven times as many parameters as data points trained with BP do not significantly overfit \nthis data. The experiments in Section 4 confirm that this bias of BP-trained MLPs towards \nsmooth models is not limited to the simple 2-D problem used here. \n\n3 Local Overfitting \nRegularization methods such as weight decay typically assume that overfitting is a global \nphenomenon. But overfitting can vary significantly in different regions of a model. Figure \n2 shows polynomial fits for data generated from the following equation: \n\n- cos( x) + v \ncos(3(x - iT)) + V \n\n0 :::; x < iT \niT:::; X :::; 2iT \n\nY = \n{ \nFive equally spaced points were generated in the first region, and 15 in the second region, \nso that the two regions have different data densities and different underlying functions. \nOverfitting is different in the two regions. In Figure 2 the order 6 model fits the left region \n\n(Equation 1) \n\n\fO roe r 2 Approxim .. tio n \nTrain ing Dat a \nT .. r get Fu n c tio n 'Withou t No,,.., \n\n-\n\n+ \n\nOroe r I'i Approxi m .. t ion \nT raining D a . a \nT .. r ge. Fu n ction 'W ithou . Noio.c \n\n-\n\n+ \n\nOrder 2 \n\nOrder 6 \n\nOroer 10 A.t~..i~\\,',';~''::;~ - . -\n\nT arget Fu n c tio n 'Withou. No !o.c \n\nOroer ll'i ~t~..i~i:,';~':'~ - . -\n\nT arget Fu n c tio n \"\"ithoU! Noise \n\n'oL---~--~--~----~--~--~U \n\nOrder 10 \n\nOrder 16 \n\nFigure 2: Polynomial approximation of data from Equation 1 as the order of the model is increased \nfrom 2 to 16. The overfitting behavior differs in the left and right hand regions. \n\nwell, but larger models overfit it. The order 6 model underfits the region on the right, and \nthe order 10 model fits it better. No model performs well on both regions. Figure 3 shows \nMLPs trained on the same data (20,000 batch updates, learning rate linearly reduced to \nzero starting at 0.5). Small nets underfit. Larger nets, however, fit the entire function well \nwithout significant overfitting in the left region. \n\nThe ability of MLPs to fit both regions of low and high non-linearity well (without over(cid:173)\nfitting) depends on the training algorithm. Conjugate gradient (CG) is the most popular \nsecond order method. CG results in lower training error for this problem, but overfits sig(cid:173)\nnificantly. Figure 4 shows results for 10 trials for BP and CG. Large BP nets generalize \nbetter on this problem -- even the optimal size CG net is prone to overfitting. The degree \nof overfitting varies in different regions. When the net is large enough to fit the region of \nhigh non-linearity, overfitting is often seen in the region of low non-linearity. \n\n4 Generalization, Network Capacity, and Early Stopping \nThe results in Sections 2 and 3 suggest that BP nets are less prone to overfitting than \nexpected. But MLPs can and do overfit. This section examines overfitting vs. net size \non seven problems: NETtalk [10], 7 and 12 bit parity, an inverse kinematic model for a \nrobot arm (thanks to Sebastian Thrun for the simulator), Base 1 and Base 2: two sonar \nmodeling problems using data collected from a robot wondering hallways at CMU, and \nvision data used to learn to steer an autonomous car [9]. These problems exhibit a variety \nof characteristics. Some are Boolean. Others are continuous. Some have noise. Others are \nnoise-free. Some have many inputs or outputs. Others have few inputs or outputs. \n4.1 Results \nFor each problem we used small training sets (100-1000 points, depending on the problem) \nso that overfitting was possible. We trained fully connected feedforward MLPs with one \nhidden layer whose size varied from 2 to 800 HU (about 500-100,000 parameters). All the \nnets were trained with BP using stochastic updates, learning rate 0.1, and momentum 0.9. \n\nWe used early stopping for regularization because it doesn't interfere with backprop's abil(cid:173)\nity to control capacity locally. Early stopping combined with backprop is so effective that \nvery large nets can be trained without significant overfitting. Section 4.2 explains why. \n\n\f/ \n\n-\n\n\" -\n\n\"'/ \n\n1 Hidden Unit \n\n4 Hidden Units \n\n10 Hidden Units \n\n100 Hidden Units \n\nFigure 3: MLP approximation using backpropagation (BP) training of data from Equation 1 as the \nnumber of hidden units is increased. No significant overfitting can be seen. \n\n07 \n\n06 \n\n05 \n\n0 4 \n\nOJ \n\n02 \n\n01 \n\n0 7 \n\n06 \n\n05 \n\n0 4 \n\nOJ \n\n0 2 \n\n0 1 \n\nOJ \n\n'\" ::l \nZ \n~ \n\n25 \n\n50 \n\n5 \n\n10 \n\nNumbe! of Hidden Ncdes \n\n5 \n\n10 \n\n25 \n\n50 \n\nNumbei cI Hidden Nodes \n\nFigure 4: Test Normalized Mean Squared Error for MLPs trained with BP (left) and CG (right). \nResults are shown with both box-whiskers plots and the mean plus and minus one standard deviation. \n\nFigure 5 shows generalization curves for four of the problems. Examining the results for \nall seven problems, we observe that on only three (Base 1, Base 2, and ALVINN), do \nnets that are too large yield worse generalization than smaller networks, but the loss is \nsurprisingly small. Many trials were required before statistical tests confirmed that the \ndifferences between the optimal size net and the largest net were significant. Moreover, the \nresults suggest that generalization is hurt more by using a net that is a little too small than \nby using one that is far too large, i.e., it is better to make nets too large than too small. \n\nFor most tasks and net sizes, we trained well beyond the point where generalization per(cid:173)\nformance peaked. Because we had complete generalization curves, we noticed something \nunexpected. On some tasks, small nets overtrained considerably. The NETtalk graph in \nFigure 5 is a good example. Regularization (e.g., early stopping) is critical for nets of all \nsizes -\n\nnot just ones that are too big. Nets with restricted capacity can overtrain. \n\n4.2 Why Excess Capacity Does Not Hurt \nBP nets initialized with small weights can develop large weights only after the number of \nupdates is large. Thus BP nets consider hypotheses with small weights before hypotheses \nwith large weights. Nets with large weights have more representational power, so simple \nhypotheses are explored before complex hypotheses. \n\n\f0.17 _ -..... - - , - -..... - - , - - - - , \n\nNETtalk \n\n0.2 .----..... - - , - -..... - - , - - - - , \n\nInverse Kinematics \n\n0.16 \n\n0.15 \n\n0.14 \n\n0.13 \n\n0 . 12 ' - - - - - ' - - - ' - - - - - ' - - - ' - - - - ' \n\no \n\n100000 200000 300000 400000 500000 \n\nPattern Presentations \n\nBase 1: Average of 10 Runs \n\no . 15 r* , \n\u2022 00 o \n\" u \n\n0.18 \n\n0.16 \n0.14 \" ' ___ _______ ~ \n\n2 hidden units +-\n8 hidden units -+--\n32 hidden units -E}--\n128 hidden unit s .. * .... \n512 hidden units ~ .. \n\n0.12 \n\n0.1 \n\n0.08 1\" \u2022 \u2022 \u2022 \u2022 \n\n0.06 \n\n2et06 4et06 6et06 8et06 1et07 \n\nPattern Presentations \n\nBase 2 : Ave rage of 10 Runs \n\nO. 22 ,\"~-.----,------.----, \n\n2 hidden units +-\n8 hidden units -+--\n32 hidden units \u00b78 \u00b7\u00b7 \n128 hidden units \u00b7X\u00b7\u00b7\u00b7\u00b7 \n512 hidden units -A- .. \n\n0.21 \n\n0.2 \n\n0.18 \n0.17 \n\n0.16 \n\n0.15 \n\n0.14 \n0.13 \n0.12 ' - - - - - ' - - - - - - ' - - - - ' - - - - ' \n\no \n\n200000 \n\n400000 \n\n600000 \n\n800000 \n\nPattern Presentations \n\n\" o ... \njJ \u2022 '0 ... \nr< \u2022 ;> , \n\n00 \n\n00 o \n\" u \n\n\" o ... \njJ \u2022 '0 ... \nr< \u2022 ;> , \n\u2022 00 o \n\" u \n\nFigure 5: Generalization peiformance vs. net size for four of the seven test problems. \n\nWe analyzed what nets of different size learn while they are trained. We compared the \ninput/output behavior of nets at different stages of learning on large samples of test pat(cid:173)\nterns. We compare the input/output behavior of two nets by computing the squared error \nbetween the predictions made by the two nets. If two nets make the same predictions for \nall test cases, they have learned the same model (even though each model is represented \ndifferently), and the squared error between the two models is zero. If two nets make dif(cid:173)\nferent predictions for test cases, they have learned different models, and the squared error \nbetween them is large. This is not the error the models make predicting the true labels, but \nthe difference between predictions made by two different models. Two models can have \npoor generalization (large error on true labels), but have near zero error compared to each \nother if they are similar models. But two models with good generalization (low error on \ntrue labels) must have low error compared to each other. \n\nThe first graph in Figure 5 shows learning curves for nets with 10,25, 50, 100, 200, and 400 \nHU trained on NETtalk. For each size, we saved the net from the epoch that generalized \nbest on a large test set. This gives us the best model of each size found by backprop. We \nthen trained a BP net with 800 HU, and after each epoch compared this net's model with \nthe best models saved for nets of 10-400 HU. This lets us compare the sequence of models \nlearned by the 800 HU net to the best models learned by smaller nets. \n\nFigure 6 shows this comparison. The horizontal axis is the number of backprop passes \napplied to the 800 HU net. The vertical axis is the error between the 800 HU net model \nand the best model for each smaller net. The 800 HU net starts off distant from the good \nsmaller models, then becomes similar to the good models, and then diverges from them. \nThis is expected. What is interesting is that the 800 HU net first becomes closest to the best \n\n\f1000 ,....,-,-,--------,r'--------\"----,--\"---------r--------, \n\nS imilarity o f BOO HU Net DUring T raining to S maller Size Peak Performers \n\n~ ,* Xx x \n\nH~ t ~.%>< \nt \nb \nE \n\n~: '\" x \n\n\"',;t: \n\nx \n\n10hu pea k ....r--\n25hu pea k -+- -\n50hu pea k E} \n100hu p eak \n)( \n200hu pea k -6-\n400hu p eak ...,.,... \n\n---- ------+ ---\n\n-,-\n\n--;\"\"':'-: li; ::::: -:: -- ----- -- -- -- =- = =:. -=- ----it- --: \n\n.!lo, ot: \n\nX x \n\n>'1: \n\n'~1 \n\n800 \n\n600 \n\n400 \n\n200 \n\no ~----~~----~-----~-----~ \n2000 00 \n\n1000 00 \n\n150000 \n\n50000 \n\no \n\nPattern Pre sentat IOns \n\nFigure 6: I/O similarity during training between an 800 hidden unit net and smaller nets (10, 25, 50, \n100,200, and 400 hldden units) trained on NETtalk. \n\n10 HU net, then closest to the 25 HU net, then closest to the 50 HU net, etc. As it is trained, \nthe 800 HU net learns a sequence of models similar to the models learned by smaller nets. \nIf early stopping is used, training of the 800 HU net can be stopped when it behaves similar \nto the best model that could be learned with nets of 10, 25, 50, . .. HU. Large BP nets \nlearn models similar to those learned by smaller nets. If a BP net with too much capacity \nwould overjit, early stopping could stop training when the model was similar to a model \nthat would have been learned by a smaller net of optimal size. \n\nThe error between models is about 200-400, yet the generalization error is about 1600. \nThe models are much closer to each other than any of them are to the true model. With \nearly stopping, what counts is the closest approach of each model to the target function, \nnot where models end up late in training. With early stopping there is little disadvantage \nto using models that are too large because their learning trajectories are similar to those \nfollowed by smaller nets of more optimal size. \n\n5 Related Work \nOur results show that models learned by backprop are biased towards \"smooth\" solutions. \nAs nets with excess capacity are trained, they first explore smoother models similar to \nthe models smaller nets would have learned. Weigend [11] performed an experiment that \nshowed BP nets learn a problem's eigenvectors in sequence, learning the 1st eigenvector \nfirst, then the 2nd, etc. His result complements our analysis of what nets of different sizes \nlearn: if large nets learn an eigenvector sequence similar to smaller nets, then the models \nlearned by the large net will pass through intermediate stages similar to what is learned by \nsmall nets (but iff nets of different sizes learn the eigenvectors equally well, which is an \nassumption we do not need to make.) \n\nTheoretical work by [1] supports our results. Bartlett notes: \"the VC-bounds seem loose; \nneural nets often peiform successfully with training sets that are considerably smaller than \nthe number of weights.\" Bartlett shows (for classification) that the number of training \nsamples only needs to grow according to A 21 (ignoring log factors) to avoid overfitting, \nwhere A is a bound on the total weight magnitudes and I is the number of layers in the \nnetwork. This result suggests that a net with smaller weights will generalize better than a \nsimilar net with large weights. Examining the weights from BP and CG nets shows that BP \ntraining typically results in smaller weights. \n\n\f6 Summary \nNets of all sizes overfit some problems. But generalization is surprisingly insensitive to \nexcess capacity if the net is trained with backprop. Because BP nets with excess capacity \nlearn a sequence of models functionally similar to what smaller nets learn, early stopping \ncan often be used to stop training large nets when they have learned models similar to those \nlearned by smaller nets of optimal size. This means there is little loss in generalization \nperformance for nets with excess capacity if early stopping can be used. \n\nOverfitting is not a global phenomenon, although methods for controlling it often assume \nthat it is. Overfitting can vary significantly in different regions of the model. MLPs trained \nwith BP use excess parameters to improve fit in regions of high non-linearity, while not \nsignificantly overfitting other regions. Nets trained with conjugate gradient, however, are \nmore sensitive to net size. BP nets appear to be better than CG nets at avoiding overfitting \nin regions with different degrees of non-linearity, perhaps because CG is more effective \nat learning more complex functions that overfit training data, while BP is biased toward \nlearning smoother functions. \n\nReferences \n\n[1] Peter L. Bartlett. For valid generalization the size of the weights is more important than the size \nof the network. In Advances in Neural Information Processing Systems, volume 9, page 134. \nThe MIT Press, 1997. \n\n[2] E.B. Baum and D. Haussler. What size net gives valid generalization? Neural Computation, \n\n1(1):151- 160,1989. \n\n[3] C. Darken and J.E. Moody. Note on learning rate schedules for stochastic optimization. In \n\nAdvances in Neural Information Processing Systems, volume 3, pages 832- 838. Morgan Kauf(cid:173)\nmann, 1991. \n\n[4] S. Geman et al. Neural networks and the bias/variance dilemma. Neural Computation, 4(1):1-\n\n58,1992. \n\n[5] A Krogh and J.A Hertz. A simple weight decay can improve generalization. In Advances in \nNeural Information Processing Systems, volume 4, pages 950-957. Morgan Kaufmann, 1992. \n[6] Y. Le Cun, J.S. Denker, and S.A Solla. Optimal Brain Damage. In D.S. Touretzky, editor, \nAdvances in Neural Information Processing Systems, volume 2, pages 598-605, San Mateo, \n1990. (Denver 1989), Morgan Kaufmann. \n\n[7] G.L. Martin and J.A Pittman. Recognizing hand-printed letters and digits using backpropaga(cid:173)\n\ntion learning. Neural Computation, 3:258-267, 1991. \n\n[8] J.E. Moody. The effective number of parameters: An analysis of generalization and regular(cid:173)\n\nization in nonlinear learning systems. In Advances in Neural Information Processing Systems, \nvolume 4, pages 847-854. Morgan Kaufmann, 1992. \n\n[9] D.A Pomerleau. Alvinn: An autonomous land vehicle in a neural network. In D.S. Touretzky, \neditor, Advances in Neural Information Processing Systems, volume 1, pages 305-313, San \nMateo, 1989. (Denver 1988), Morgan Kaufmann. \n\n[10] T. Sejnowski and C. Rosenberg. Parallel networks that learn to pronounce English text. Complex \n\nSystems, 1:145-168, 1987. \n\n[11] A Weigend. On overfitting and the effective number of hidden units. In Proceedings of the \n1993 Connectionist Models Summer School, pages 335- 342. Lawrence Erlbaum Associates, \n1993. \n\n[12] AS. Weigend, D.E. Rumelhart, and B.A Huberman. Generalization by weight-elimination with \napplication to forecasting. In Advances in Neural Information Processing Systems, volume 3, \npages 875-882. Morgan Kaufmann, 1991. \n\n[13] D. Wolpert. On bias plus variance. Neural Computation, 9(6):1211-1243, 1997. \n\n\f", "award": [], "sourceid": 1895, "authors": [{"given_name": "Rich", "family_name": "Caruana", "institution": null}, {"given_name": "Steve", "family_name": "Lawrence", "institution": null}, {"given_name": "C.", "family_name": "Giles", "institution": null}]}*