{"title": "Generalization and Parameter Estimation in Feedforward Nets: Some Experiments", "book": "Advances in Neural Information Processing Systems", "page_first": 630, "page_last": 637, "abstract": "", "full_text": "630 Morgan and Bourfard \n\nGeneralization and Parameter Estimation \n\nin Feedforward Nets: \nSome Experiments \n\n~. Morgant \nInternational Computer Science Institute \nBerkeley, CA 94704, USA \n\nH. Bourlardt* \n\n*Philips Research Laboratory Brussels \nB-1170 Brussels, Belgium \n\nABSTRACT \n\nWe have done an empirical study of the relation of the number of \nparameters (weights) in a feedforward net to generalization perfor(cid:173)\nmance. Two experiments are reported. In one, we use simulated data \nsets with well-controlled parameters, such as the signal-to-noise ratio \nof continuous-valued data. In the second, we train the network on \nvector-quantized mel cepstra from real speech samples. In each case, \nwe use back-propagation to train the feedforward net to discriminate in \na multiple class pattern classification problem. We report the results of \nthese studies, and show the application of cross-validation techniques \nto prevent overfitting. \n\n1 INTRODUCTION \n\nIt is well known that system models which have too many parameters (with respect \nto the number of measurements) do not generalize well to new measurements. For \ninstance, an autoregressive (AR) model can be derived which will represent the training \ndata with no error by using as many parameters as there are data points. This would \n\n\fGeneralization and Parameter Estimation in Feedforward Nets \n\n631 \n\ngenerally be of no value, as it would only represent the training data. Criteria such as the \nAkaike Information Criterion (AIC) [Akaike, 1974, 1986] can be used to penalize both \nthe complexity of AR models and their training error variance. In feedforward nets, we \ndo not currently have such a measure. In fact, given the aim of building systems which \nare biologically plausible, there is a temptation to assume the usefulness of indefinitely \nlarge adaptive networks. In contrast to our best guess at Nature's tricks, man-made sys(cid:173)\ntems for pattern recognition seem to require nasty amounts of data for training. In short, \nthe design of massively parallel systems is limited by the number of parameters that can \nbe learned with available training data. It is likely that the only way truly massive sys(cid:173)\ntems can be built is with the help of prior information, e.g., connection topology and \nweights that need not be learned [Feldman et al, 1988]. \n\nLearning theory [Valiant, V.N., 1984; Pearl, J., 1978] has begun to establish what \nis possible for trained systems. Order-of-magnitude lower bounds have been established \nfor the number of required measurements to train a desired size feedforward net \n[Baum&Haussler, 1988]. Rules of thumb suggesting the number of samples required for \nspecific distributions could be useful for practical problems. Widrow has suggested hav(cid:173)\ning a training sample size that is 10 times the number of weights in a network (\"Uncle \nBernie's Rule\")[Widrow, 1987]. We have begun an empirical study of the relation of the \nnumber of parameters in a feedforward net (e.g. hidden units, connections, feature \ndimension) to generalization performance for data sets with known discrimination com(cid:173)\nplexity and signal-to-noise ratio. In the experiment reported here, we are using simulated \ndata sets with controlled parameters, such as the number of clusters of continuous-valued \ndata. In a related practical example, we have trained a feedforward network on vector(cid:173)\nquantized mel cepstra from real speech samples. In each case, we are using the back(cid:173)\npropagation algorithm [Rumelhart et al, 1986] to train the feedforward net to discriminate \nin a multiple class pattern classification problem. Our results confirm that estimating \nmore parameters than there are training samples can degrade generalization. However, \nthe peak in generalization performance (for the difficult pattern recognition problems \ntested here) can be quite broad if the networks are not trained too long, suggesting that \nprevious guidelines for network size may have been conservative. Furthermore, cross(cid:173)\nvalidation techniques, which have also proved quite useful for autoregressive model \norder determination, appear to improve generalization when used as a stopping criterion \nfor iteration, and thus preventing overtraining. \n\n2 RANDOM VECTOR PROBLEM \n\n2.1 METHODS \n\nStudies based on synthesized data sets will generally show behavior that is dif(cid:173)\n\nferent from that seen with a real data set. Nonetheless, such studies are useful because of \nthe ease with which variables of interest may be altered. In this case, the object was to \nmanufacture a difficult pauern recognition problem with statistically regular variability \nbetween the training and test sets. This is actually no easy trick; if the problem is too \neasy, then even very small nets will be sufficient, and we would not be modeling the \n\n\f632 Morgan and Bourlard \n\nproblem of doing hard pattern classification with small amounts of training data. If the \nproblem is too hard. then variations in perfonnance will be lost in the statistical varia(cid:173)\ntions inherent to methods like back-propagation. which use random initial weight values. \nRandom points in a 4-dimensional hyperrectangle (drawn from a uniform probabil(cid:173)\n\nity distribution) are classified arbitrarily into one of 16 classes. This group of points will \nbe referred to as a cluster. This process is repeated for 1-4 nonoverlapping hyperrectan(cid:173)\ngles. A total of 64 points are chosen. 4 for each class. All points are then randomly per(cid:173)\nturbed with noise of uniform density and range specified by a desired signal-to-noise \nratio (SNR). The noise is added twice to create 2 data sets. one to be used for training. \nand the other for test. Intuitively, one might expect that 16-64 hidden units would be \nrequired to transform the training space for classification by the output layer. However. \nthe variation between training and test and the relatively small amount of data (256 \nnumbers) suggest that for large numbers of parameters (over 256) there should be a \nsignificant degrading of generalization. Another issue was how performance in such a \nsituation would vary over large numbers of iterations. \n\nSimulations were run on this data using multi-layer perceptrons(MLP) (Le .\u2022 layered \nfeedforward networks) with 4 continuous-valued inputs. 16 outputs. and a hidden layer of \nsizes ranging from 4 to 128. Nets were run for signal-to-noise ratios of 1.0 and 2.0. where \nthe SNR is defined as the ratio of the range of the original cluster points to the range of \nthe added random values. Error back-propagation without momentum was used. with an \nadaptation constant of .25 . For each case. the 64 training patterns were used 10,000 \ntimes. and the resulting network was tested on the second data set every 100 iterations so \nthat generalization could be observed during the learning. Blocks of ten scores were \naveraged to stabilize the generalization estimate. After this smoothing, the standard devi(cid:173)\nation of error (using the normal approximation to the binomial distribution) was roughly \n1 %. Therefore. differences of 3% in generalization performance are significant at a level \nof .001 . All computation was performed on Sun4-110's using code written in Cat ICS!. \nRoughly a trillion floating point operations were required for the study. \n\n2.2 RESULTS \n\nTable I shows the test performance for a single cluster and a signal-to-noise ratio \nof 1.0 . The chart shows the variation over a range of iterations and network size \n(specified both as #hidden units. and as ratio of #weights to #measurements. or \"weight \nratio\"). Note that the percentages can have finer gradation than 1/64, due to the averag(cid:173)\ning. and that the performance on the training set is given in parentheses. Test perfor(cid:173)\nmance is best for this case for 8 hidden units (24.7%). or a weight ratio of .62 (after 2000 \niterations). and for 16 units (21.9%). or a weight ratio of 1.25 (after 10000 iterations). For \nlarger networks. the performance degrades, presumably because of the added noise. At \n2000 iterations. the degradation is statistically significant. even in going from 8 to 16 hid(cid:173)\nden units. There is further degradation out to the 128-unit case. The surprising thing is \nthat. while this degradation is quite noticeable, it is quite graceful considering the order(cid:173)\nof magnitude range in net sizes. An even stronger effect is the loss of generalization \npower when the larger nets are more fully trained. All of the nets generalized better when \n\n\fGeneralization and Parameter Estimation in Feedforward Nets \n\n633 \n\nthey were trained to a relatively poor degree, especially the larger ones. \n\nTable I - Test (and training) scores: 1 cluster, SNR = 1.0 \n\nHhidden \nunits \n4 \n8 \n16 \n32 \n64 \n128 \n\n#Weis.hts \nHinputs \n.31 \n.62 \n1.25 \n2.50 \n5.0 \n10.0 \n\n%Test (Train) Correct after N Iterations \n\n1000 \n9.2(4.4) \n11.4(5.2) \n13.6(6.9) \n12.8(6.4) \n13.6(7.7) \n11.6(6.7) \n\n2000 \n\n5000 \n\n21.7(15.6) \n24.7(17.0) \n21.1(18.4) \n18.4(18.3) \n18.3(20.8) \n17.7(19.1) \n\n12.0(25.9) \n20.6(29.8) \n18.3(37.2) \n17.8(41.7) \n19.7(34.4) \n12.2(34.7) \n\n10000 \n15.6(34.4) \n21.4(63.9) \n21.9(73.4) \n13.0(80.8) \n18.0(79.2) \n15.6(75.6) \n\nTable II shows the results for the same I-cluster problem, but with higher SNR \ndata (2.0 ). In this case, a higher level of test performance was reached, and it was \nreached for a larger net with more iterations (40.8% for 64 hidden units after 5000 itera(cid:173)\ntions). At this point in the iterations, no real degradation was seen for up to 10 times the \nnumber of weights as data samples. However, some signs of performance loss for the \nlargest nets was evident after 10000 iterations. Note that after 5000 iterations, the net(cid:173)\nworks were only half-trained (roughly 50% error on the training set). When they were \n80-90% trained, the larger nets lost considerable ground. For instance, the 10 x net (128 \nhidden units) lost performance from 40.5% to 28.1 % during these iterations. It appears \nthat the higher signal-to-noise of this example permitted performance gains for even \nhigher overparametrization factors, but that the result was even more sensitive to training \nfor too many iterations. \n\nTable II - Test (and training) scores: 1 cluster, SNR = 2.0 \n\nHhidden \nunits \n4 \n8 \n16 \n32 \n64 \n128 \n\n#Weights \nHinputs \n.31 \n.62 \n1.25 \n2.50 \n5.0 \n10.0 \n\n%Test (Train) Correct after N Iterations \n\n1000 \n\n2000 \n\n5000 \n\n10000 \n\n18.1(8.4) \n22.5(12.8) \n22.0(11.6) \n25.6(13.3) \n26.4(13.9) \n26.9(12.0) \n\n25.6(29.1) \n31.1(34.7) \n33.4(32.8) \n33.4(35.2) \n36.1(35.0) \n34.5134.5) \n\n32.2(29.8) \n34.5(44.5) \n33.6(57.2) \n39.4(51.1) \n40.8(45.2) \n40.5(47.2) \n\n26.9(29.2) \n33.3(62.2) \n29.4(78.3) \n34.2(87.0) \n33.6(86.9) \n28.1(91.1) \n\n\f634 Morgan and Bourlard \n\nTable III shows the perfonnance for a 4-cluster case. with SNR = 1.0. Small nets are \nomitted here, because earlier experiments showed this problem to be too hard. The best \nperformance (21.1 %) is for one of the larger nets at 2000 iterations. so that the degrada(cid:173)\ntion effect is not clearly visible for the undertrained case. At 10000 iterations, however, \nthe larger nets do poorly. \n\nTable III - Test (and training) scores: 4 cluster, SNR = 1.0 \n\n#hidden \n\nunits \n\n32 \n64 \n96 \n128 \n\n#Weights \n#inputs \n2.50 \n5.0 \n7.5 \n10. \n\n%Test (Train) Correct after N Iterations \n\n1000 \n\n2000 \n\n5000 \n\n10000 \n\n13.8(12.7) \n13.6(12.7) \n15.3(13.0) \n15.2(13.1) \n\n18.3(23.6) \n18.4(23.6) \n21.1(24.7) \n19.1(23.8) \n\n15.8(38.8) \n14.7(42.7) \n15.9(45.5) \n17.5(40.5) \n\n9.4(71.4) \n18.8(71.6) \n16.3(78.1) \n10.5(70.9) \n\nFigure 1 illustrates this graphically. The \"undertrained\" case is relatively insensi(cid:173)\n\ntive to the network size, as well as having the highest raw score. \n\n3 SPEECH RECOGNITION \n\n3.1 METHODS \n\nIn an ongoing project at ICSI and Philips, a Gennan language data base consisting \n\nof 100 training and 100 test sentences (both from the same speaker) were used for train(cid:173)\ning of a multi-layer-perceptron (MLP) for recognition of phones at the frame level, as \nwell as to estimate probabilities for use in the dynamic programming algorithm for a \ndiscrete Hidden Markov Model (HMM) [Bourlard & Wellekens. 1988; Bourlard et aI, \n1989]. Vector-quantized mel cepstra were used as binary input to a hidden layer. Multi(cid:173)\nple frames were used as input to provide context to the network. While the size of the \noutput layer was kept fixed at 50 units, corresponding to the 50 phonemes to be recog(cid:173)\nnized, the hidden layer was varied from 20 to 200 units, and the input context was kept \nfixed at 9 frames of speech. As the acoustic vectors were coded on the basis of 132 pro(cid:173)\ntotype vectors by a simple binary vector with only one bit 'on', the input field contained \n9x132=1188 units, and the total number of possible inputs was thus equal to 1329\u2022 There \nwere 26767 training patterns and 26702 independent test patterns. Of course, this \nrepresented only a very small fraction of the possible inputs, and generalization was thus \npotentially difficult Training was done by the classical \"error-back propagation\" algo(cid:173)\nrithm, starting by minimizing an entropy criterion [Solla et aI, 1988] and then the stan(cid:173)\ndard least-mean-square error (LMSE) criterion. In each iteration, the complete training \nset was presented, and the parameters were updated after each training pattern. \n\n\fGeneralization and Parameter Estimation in Feedforward Nets \n\n635 \n\nTo avoid overtraining of the MLP. (as was later demonstrated by the random vector \nexperiment described above), improvement on the test set was checked after each itera(cid:173)\ntion. If the classification rate on the test set was decreasing. the adaptation parameter of \nthe gradient procedure was decreased. otherwise it was kept constanl In another experi(cid:173)\nment this approach was systematized by splitting the data in three parts: one for the train(cid:173)\ning, one for the test and a third one absolutely independent of the training procedure for \nvalidation. No significant difference was observed between classification rates for the \ntest and validation data. \n\nOther than the obvious difference with the previous study (this used real data), it is \nimportant to note another significant point: in this case. we stopped iterating (by anyone \nparticular criterion) when that criterion was leading to no new test set performance \nimprovemenl While we had not yet done the simulations described above. we had \nobserved the necessity for such an approach over the course of our speech research. We \nexpected this to ameliorate the effects of overparameterization. \n\n3.2 RESULTS \n\nTable IV shows the variation in performance for 5. 20. 50. and 200 hidden units. \nThe peak at 20 hidden units for test set performance. in contrast to the continued \nimprovement in training set performance. can be clearly seen. However. the effect is cer(cid:173)\ntainly a mild one given the wide range in network size; using 10 times the number of \nweights as in the \"peak\" case only causes a degradation of 3.1 %. Note. however, that for \nthis experiment. the more sophisticated training procedure was used which halted train(cid:173)\ning when generalization started to degrade. \n\nFor comparison with classical approaches, results obtained with Maximum Likeli(cid:173)\n\nhood (ML) and Bayes estimates are also given. In those cases, it is not possible to use \ncontextual information. because the number of parameters to be learned would be \n50 * 1329 for the 9 frames of contexl Therefore. the input field was restricted to a single \nframe. The number of parameters for these two last classifiers was then 50 * 132 = 6600. \nor a parameter/measurement ratio of .25 . This restriction explains why the Bayes \nclassifier. which is inherently optimal for a given pattern classification problem. is shown \nhere as yielding a lower performance than the potentially suboptimal MLP. \n\nTable IV - Test Run: Phoneme Recognition on German data base \nhidden units \n\n#parameters/#training numbers \n\ntraining \n\n5 \n20 \n50 \n200 \nML \nBayes \n\n.23 \n.93 \n2.31 \n9.3 \n.25 \n.25 \n\n62.8 \n75.7 \n73.7 \n86.7 \n45.9 \n53.8 \n\ntest \n54.2 \n62.7 \n60.6 \n59.6 \n44.8 \n53.0 \n\n\f636 Morgan and Bourlard \n\n4 CONCLUSIONS \n\nWhile both studies show the expected effects of overparameterization, (poor gen(cid:173)\n\neralization, sensitivity to overtraining in the presence of noise), perhaps the most \nsignificant result is that it was possible to greatly reduce the sensitivity to the choice of \nnetwork size by directly observing the network perfonnance on an independent test set \nduring the course of learning (cross-validation). If iterations are not continued past this \npoint, fewer measurements are required. This only makes sense because of the inter(cid:173)\ndependence of the learned parameters, particularly for the undertrained case. In any \nevent, though, it is clear that adding parameters over the number required for discrimina(cid:173)\ntion is wasteful of resources. Networks which require many more parameters than there \nare measurements will certainly reach lower levels of peak perfonnance than simpler \nsystems. For at least the examples described here. it is clear that both the size of the \nMLP and the degree to which it should be trained are parameters which must be learned \nfrom experimentation with the data set. Further study might. perhaps, yield enough \nresults to pennit some rule of thumb dependent on properties of the data, but our current \nthinking is that these parameters should be detennined dynamically by testing on an \nindependent test set. \n\nReferences \n\nAkaike, H. (1974), \"A new look at the statistical model identification.\" IEEE Trans. \nautom. Control. AC-lO, 667-674 \n\nAkaike. H. (1986), \"Use of Statistical Models for Time Series Analysis\". Vol. 4, Proc. \nIEEE Intl. Conference on Acoustics, Speech, and Signal Processing. Tokyo. 1986. \npp.3147-3155 \nBaum, E.B., & Haussler. D., (1988), \"What Size Net Gives Valid Generalization?\", \nNeural Computation. In Press \nBourlard. H .\u2022 Morgan, N., & Wellekens, Cl., (1989), \"Statistical Inference in Multilayer \nPerceptrons and Hidden Markov Models. with Applications in Continuous Speech \nRecognition\", NATO Advanced Research Workshop, Les Arcs. France \n\nFeldman. J.A., Fanty, M.A., and Goddard, N., (1988) \"Computing with Structured Neural \nNetworks\", Computer, vol. 21, No.3. pp 91-I()4 \n\nPearlJ., (1978). \"On the Connection Between the Complexity and Credibility of Inferred \nModels\". Int. J. General Systems, Vol.4, pp. 155-164 \nRumelhart, D.E., Hinton. G.E., & Williams, RJ .\u2022 (1986). \"Learning internal representa(cid:173)\ntions by error propagation\" in Parallel Distributed Processing (D.E. Rumelhart & J.L. \nMcClelland, Eds.). ch. 15. Cambridge. MA: MIT Press \n\nValiant. L.G., (1984), \"A theory of the learnable\", Comm. ACM V27. Nll pp1l34-1142 \n\nWidrow. B, (1987) \"ADALINE and MADALINE\" ,Plenary Speech, Vol. I. Proc. IEEE \n1st Inti. Conf. on Neural Networks, San Diego, CA. 143-158 \n\n\fGeneralization and Parameter Estimation in Feedforward Nets \n\n637 \n\n- after 10,000 iterations \n\n- after 2,000 iterations \n\n% correct \n\n25 \n\n20 \n\n15 \n\n10 \n\n5 \n\n\u2022 \n\n\u2022 \n\nED \n\n\u2022 \n\n\u2022 \n\ne \n\n# hidden units \n\n32 \n\n64 \n\n96 \n\n128 \n\nFigure 1: Sensitivity to net size \n\n\f", "award": [], "sourceid": 275, "authors": [{"given_name": "N.", "family_name": "Morgan", "institution": null}, {"given_name": "H.", "family_name": "Bourlard", "institution": null}]}