{"title": "Can neural networks do better than the Vapnik-Chervonenkis bounds?", "book": "Advances in Neural Information Processing Systems", "page_first": 911, "page_last": 917, "abstract": null, "full_text": "Can neural networks do better than the \n\nVapnik-Chervonenkis bounds? \n\nDavid Cohn \n\nDept. of Compo Sci. & Eng. \n\nUniversity of Washington \n\nSeattle, WA 98195 \n\nGerald Tesauro \n\nIBM Watson Research Center \n\nP.O. Box 704 \n\nYorktown Heights, NY 10598 \n\nAbstract \n\n\\Ve describe a series of careful llumerical experiments which measure the \naverage generalization capability of neural networks trained on a variety of \nsimple functions. These experiments are designed to test whether average \ngeneralization performance can surpass the worst-case bounds obtained \nfrom formal learning theory using the Vapnik-Chervonenkis dimension \n(Blumer et al., 1989). We indeed find that, in some cases, the average \ngeneralization is significantly better than the VC bound: the approach to \nperfect performance is exponential in the number of examples m, rather \nthan the 11m result of the bound. In other cases, we do find the 11m \nbehavior of the VC bound, and in these cases, the numerical prefactor is \nclosely related to prefactor contained in the bound. \n\n1 \n\nINTRODUCTION \n\nProbably the most important issue in the study of supervised learning procedures is \nthe issue of generalization, i.e., how well the learning system can perform on inputs \nnot seen during training. Significant progress in the understanding of generalization \nwas made in the last few years using a concept known as the Vapnik-Chervonenkis \ndimension, or VC-dimension. The VC-dimension provides a basis for a number of \npowerful theorems which establish worst-case bounds on the ability of arbitrary \nlearning systems to generalize (Blumer et al., 1989; Haussler et al., 1988). These \ntheorems state that under certain broad conditions, the generalization error f of \na learning system with VC-dimensioll D trained on m random examples of an \narbitrary fUllction will with high confidence be no worse than a bound roughly of \norder Dim. The basic requirements for the theorems to hold are that the training \n\n911 \n\n\f912 \n\nCohn and Tesauro \n\nand testing examples are generated from the same probability distribution. and that \nthe learning system is able to correctly classify the training examples. \n\nUnfortunately, since these theorems do not calculate the expected generalization \nerror but instead only bound it, the question is left open whether expected error \nmight lie significantly below the bound. Empirical results of (Ahmad and Tesauro, \n1988) indicate that in at least one case, average error was in fact significantly below \nthe VC bound: the error decreased exponentially with the number of examples, \nt \"'\" exp( -m/mo}, rather than the l/m, result of the bound. Also, recent statistical \nlearning theories (Tishby et al., 1989; Schwartz et al., 1990), which provide an \nanalytic means of calculating expected performance, indicate that an exponential \napproach to perfect performance could be obtained if the spectrum of possible \nnetwork generalizations has a \"gap\" near perfect performance. \n\n\\IVe have addressed the issue of whether average performance can surpass \\vorst(cid:173)\ncase performance t.hrough numerical experiments which measure the average gen(cid:173)\neralization of simple neural networks trained on a variety of simple fUllctions. Our \nexperiments extend the work of (Ahmad and Tesauro, 1988). They test bot.h the \nrelevance of the \\'\\'orst-case VC bounds to average generalization performance, and \nthe predictions of exponential behavior due to a gap in the generalization spectrum. \n\n2 EXPERIMENTAL METHODOLOGY \n\nT\\',,'o pairs of N-dimensional classification tasks were examined in our experiments : \ntwo linf'ariy sepa.rable functions (\"majority\" and \"real-valued threshold\"). anel \ntwo higlwr-order functions (\"majority-XOR\" and \"threshold-XOR\"). rVlajority is \na Boolean predicate in which the output is 1 if and only if more than half of the \ninputs are 1. The real-valued threshold function is a natural extension of ma.(cid:173)\njority to t.he continuous space [O,l]N: \nthe output is 1 if and only if the sum of \nthe N real-valued inputs is greater than N /2, The majority-XOR function is a \nBoolean function where the output is 1 if and only if the N'th input disagrees \nwith the majority computed by the first N - 1 inputs. This is a natural exten(cid:173)\nsion of majority which retains many of its symmetry properties, e.g., the positive \nand negative examples are equally numerous and somewhat uniformly distributed. \nSimilarly, threshold-XOR is natural extension of the real-valued threshold function \n\\'\\'hich maps [0, l]N-l x {O, I} f-+ {0,1}. Here, the output is 1 if and only if the \nN'th input, which is binary, disagrees with the threshold function computed by the \nfirst N - 1 real-valued inputs. Networks trained on these tasks used sigmoidal units \nand had standard feed-forward fully-connected structures with at most a single hid(cid:173)\nden layer. The training algorithm was standard back-propagation with momentum \n(Rumelhart. et al., 1986). \n\nA simulator run consisted of training a randomly initialized network on a training \nset of 111 examples of the target function, chosen uniformly from the input space. \nNetworks were trained until all examples were classified within a specified margin \nof the correct classification. Runs that failed to converge within a cutoff time of \n50,000 epochs were discal'ded. The genel'alization error of the resulting network \nwas then estimated by testing on a set of 2048 novel examples independently drawn \nfrom the same distribution . The average generalization errol' fol' a given value of \n111 was typically computed by averaging the l'esults of 10-40 simulator runs, ea.ch \n\n\fCan Neural Networks do Better Than the Vapnik-Chervonenkis Bounds? \n\n913 \n\nwith a different set of training patterns, test patterns, and random initial weights. \nA wide range of values of 1l1, was examined in this way in each experiment. \n\n2.1 SOURCES OF ERROR \n\nOur experiments were carefully controlled for a number of potential sources of error. \nRandom errors due to the particular choice of random training patterns, test pat(cid:173)\nterns, and initial weights were reduced to low levels by performing a large number \nof runs and varying each of these in each run. \n\n\\Ve have also looked for systematic errors due to the particular values of learn(cid:173)\ning rate and momentum constants, initial random weight scale, frequency of weight \nchanges, training threshold, and training cutoff time. \\Vithin wide ranges of para.m(cid:173)\neter values, we find no significant dependence of the generalization performance on \nthe particular choice of any of these parameters except k, the frequency of weight \n(However, the parameter values can affect the rate of convergence or \nchanges. \nprobability of convergence on the training set.) Variations in k appear to alter the \nnumerical coefficients of the learning curve, but. not the overall functional form. \n\nAnother potential concern is the possibility of overtraining: even though the training \nset error should decrease monotonically with training time, the test set error might \nreach a minimum and then increase with further training. \n\\Ve have monitored \nhundreds of simulations of both the linearly separable and higher-order tasks, and \nfind no significant overtraining in either case. \n\nOther aspects of the experimental protocol which could affect measured results \ninclude order of pattern presentation, size of test set, testing threshold , and choice \nof input representation. We find that presenting the patterns in a random order \nas opposed to a fixed order improves the probability of convergence, but does not \nalter the average generalization of runs that do converge. Changing the criterion by \nwhich a test pattern is judged correct alters the numerical prefactor of the learning \ncurve but not the functional form. Using test sets of 4096 patterns instead of \n2048 patterns has no significant effect on measured generalization values. Finally, \nconvergence is faster with a [-1,1] coding scheme than with a [0,1] scheme, and \ngeneralization is improved, but only by numerical constants. \n\n2.2 ANALYSIS OF DATA \n\nTo determine the functional dependence of measured generalization error e on the \nnumber of examples In, we apply the standard curve-fitting technique of performing \nlinear re~ression on the appropriately ~ransformed data. Thus we .can look for an \nexponentIal law e = Ae- m / mo by plottmg log(e) vs. m and observmg whether the \ntransformed data lies on a straight line. We also look for a polynomial law of the \nform e = B/(m + a) by plotting l/e vs. m. \\Ve have not attempted to fit to a more \ngeneral polynomial law because this is less reliable, and because theory predicts a \n1/171, law. \nBy plotting each experimental curve in both forms, log(e) vs. m and l/e vs. m, we \ncan determine which model provides a better fit to the data. This can be done both \nvisually and more quantitatively by computing the linear correlation coefficient ,,2 \nin a linear least-squares fit. To the extent that one of the curves has a higher value \n\n\f914 \n\nCohn and Thsauro \n\nof 1,2 than the other one, we can say that it provides a better model of the data \nthan the other functional form. \n\nWe have also developed the following technique to assess absolute goodness-of(cid:173)\nfit . \\Ve generate a set of artificial data points by adding noise equivalent to the \nerror bars on the original data points to the best-fit curve obtained from the linear \nregression. Regression on the artificial data set yields a value of r2, and repeating \nthis process many times gives a distribution of r2 values which should approximate \nthe distribution expected with the amount of noise in our data. By comparing the \nvalue 1'2 from our original data to this generated distribution, we can estimate the \nprobability that our functional model would produce data like that we observed. \n\n3 EXPERIMENTS ON LINEARLY-SEPARABLE \n\nFUNCTIONS \n\nNetworks with 50 inputs and no hidden units were trained on majority and l'eal(cid:173)\nvalued threshold functions, with training set sizes ranging from m = 40 to Tn = 500 \nin increments of 20 patterns. Twenty networks were trained for each value of m. A \ntotal of 3.8% of the binary majority and 7.7% of the real-valued threshold simulation \nruns failed to converge and were discarded. \n\nThe data obtained from the binary majority and real-valued threshold problems \nwas tested for fit to the exponential and polynomial functional models, as shown in \nFigure 1. The binary majority data had a correlation coefficient of 1' '2 = 0.982 in \nthe exponential fit; this was better than 40% of the \"artificial\" data sets described \npreviously. However, the polynomial fit only gave a value of 1,2 = 0.9(:i6, which \nwas bett.er than only 6% of the artificial data sets. We conclude that the binary \nmajority data is consistent with an exponential law and not with a 11m law. \n\nThe real-valued threshold data, however, behaved in the opposite manner . The \nexponential fit gave a value of 1'2 = 0.943, which was better than only 14% of the \nartificial data sets. However, the polynomial fit gave a value of 1'2 = 0.996, which \nwas better than 40% of the artificial data sets. We conclude that the real-valued \nthreshold data closely approximates a 11m law and was not likely to have been \ngenerated by an exponential law. \n\n4 EXPERIMENTS ON HIGHER-ORDER FUNCTIONS \nFor the majority-XOR and threshold-XOR problems, we used N = 26 input units: \n25 for the \"majority\" (or threshold) and a single \"XOR\" unit. In theory, these \nproblems can be solved with only two hidden units, but in practice, at least three \nhidden ullit.s were needed for reliable convergence. Training set sizes ranging from \nm = 40 to 111 = 1000 in increments of 20 were studied for both tasks. At each \nvalue of m., 40 simulations were performed. Of the 1960 simulations, 1702 of the \nbinary and 1840 of the real-valued runs converged. No runs in either case achieved \na perfect score on the test data. \n\nWith both sets of runs, there was a visible change in the shape of the generalization \ncurve when the training set size reached 200 samples. We are interested primarily \n\n\fCan Neural Networks do Better Than the Vapnik-Chervonenkis Bounds? \n\n915 \n\no .... ....... ... ... .. .... ............ ..... ... ........ ..... ......... ... . \n\no ............................... ............................... ........ . \n\n50-Input binary majority \n\n::; \n'-\n'(cid:173)\nQ.> \nc -1 \no \n~ '\" N \niii \u00b7z \n'(cid:173)\nQ.> \nC \nQ.> \nCJ> \nC \" +--~-\"\"T\"\"'---r---.-----I\"~--. \no \n600 \n.. .... .... . .. .... ............. . ........ . ..... ..... . ... ... .......... .. \n\ntoil \n\n200 \n\n100 \n\n300 \n\n20 \n\nSO-Input real-valued threshold \n\n<:: \no \n'(cid:173)\n'(cid:173)\nQ.> \nC \no \n\n-1 \n\n~ ~+---------~~~------------\n'(cid:173)\nQ.> \nC \nQ.> \nC> \nC ,,+--\"\"T\"\"'---.----.----.--.------. \n600 \n\n300 \n\n&00 \n\n100 \n\n200 \n\n500 \n\n50-Input binary majority \n\n3ro+-------------~~=--\u00ad\nL \nL \nQ) \n\nC o \n~ 6+-------~~--------___ \n\nN \n\n'(cid:173)o \n'(cid:173)\n'(cid:173)Q.> \nC o \n.=. 10 \n'\" N \n'\" '-\n\nQ.> \nC \n0.' \nC> \n.... O-l.\u00a3..-~-\"\"T\"\"'---.--__._--.--~ \n6011 \n\n100 \n\n500 \n\n50D \n\nZIlO \n4110 \ntraining set size \n\nro \nL \nQ) \nC \nQ) \nCI \n.... O+--~-~-__._-~-~-~ \n\no \n\n100 \n\nJOO \n\nZOO \ntraining set size \n\n&00 \n\n500 \n\n600 \n\nFigure 1: Observed generalization curves for binary majority and real-valued thresh(cid:173)\nold, and their fit to the exponential and polynomial models. Errol' bars denote 9.5% \nconfidence intervals for the mean. \n\nin the asymptotic behavior of these curves, so we restricted our analysis to sample \nsizes 200 and above. As with the single-layer problems, we measured goodness of \nfit to appropriately linearized forms of the exponential and polynomial curves in \nquestion. Results are plotted in Figure 2. \n\nIt appears that the generalization curve of the threshold-XOR problem is not likely \nto have been generated by an exponential, but is a plausible 11m polynomial. The \nconelation coefficient in the exponential fit is only 1,2 = 0.959 (better than only \n10% of the artificial data sets), but in the polynomial fit is 1,2 = 0.997 (better than \n1'32% of the artificial data sets). \n\nThe binary majority-XOR data, however, appears both visually and from the rela(cid:173)\ntive 7'2 values to fit the exponential model better than the polynomial model. In the \nexponential fit, 1,2 = 0.994, while in the polynomial fit, 1'2 = 0.940. However, we are \nsomewhat. cautious because the artificial data test is inconclusive. The exponential \nfit is bett.er than 40% of artificial data sets, but the polynomial fit is better than \n60% of artificial data sets . Also , there appears to be a small component of t.he curve \nthat is slower than a pure exponential. \n\n5 COMPARISON TO THEORY \n\nFigure 3 plot.s our data for both the first.-order and higher-order tasks compared t.o \nthe thol'etical error bounds of (Blumer et aI., 1989) and (Haussler et aI., 1988) . In \nthe higher-order case we have used the total number of weights as an estimate of the \nVC-dimension, following (Baum and Haussler, 1989). (Even with this low estimate, \nthe bound of (Blumer et aI., 1989) lies off the scale.) All of our experimenta.l \ncurves fall below both bounds, and in each case the binary task does asymptotically \nbetter than the corresponding real-valued task. One should note tha.t the bound in \n\n\f916 \n\nCohn and Thsauro \n\no .....................\u2022.....\u2022....\u2022\u2022.....\u2022\u2022......................\u2022.........\u2022\u2022... \n\n26-input majority-XOR \n\n\u00b71 \n\nL. \no \n'(cid:173)\n\n'(cid:173)o \n'(cid:173)\n'-\n... \nc \n\nN \n\no --'\" \u00b72 \n'\" '-\n~ ~+-----------------~~~------\n* \n0> \n~ \n\u00a3-4-1----.---..------,---.---.---.:., \n1200 \n\n1000 \n\n100 \n\n200 \n\n'(cid:173)o \nL. \n~ 30 \nc \no \n~ 20 \nN \n\n'\" ~ 10 \nc ... \n0> \n\"(cid:173)\n_ o+-~~---..__-___,_---._-_._-~ \n1200 \n\n1000 \n\n200 \n\no \n\n600 \n\n400 \ntralnlng set sIze \n\nIlOO \n\no ............................................................................ . \n\n26-input threshold-XOR \n\n~ \u00b71 \n\n'(cid:173)... \n... \n'\" N \n... \nC ... 0> \nE -3 -1----.---..------,----.-----.--'---. \n1200 \n\no \n........................................................................... . \n\n~ \u00b72 \n\n1000 \n\n200 \n\n100 \n\n~ \n\n26-input threshold-XOR \n\n20 \n\nL. o \nL ... \nL. \nC o \n~ 10 \nN \n\n'\" '-... c ... \n\n0'1 \n'-\n\n200 \n\n600 \n\n400 \ntralnlng set sIze \n\n100 \n\n1000 \n\n1200 \n\nFigure 2: Generalization curves for 26-3-1 nets trained on majority-XOR and \nthreshold-XOR, and their fit to the exponential and polynomial models. \n\n(Haussler et al., 1988) fits the real-valued data to within a small numerical constant. \nHowever, strictly speaking it may not apply to our experiments because it is for \nBayes-optimal learning algorithms, and we do not know whether back-propagation \nis Bayes-optimal \n\n6 CONCLUSIONS \n\nWe have seen that two problems using strict binary inputs (majority and majority(cid:173)\nXOR) exhibited distinctly exponential generalization with increasing training set \nsize. This indicates that there exists a class of problems that is asymptotically \nmuch easier to learn than others of the same VC-dimension. This is not only of \ntheoretical interest, but it also hac; potential bearing on what kinds of large-scale \napplications might be tractable with network learning methods. On the other hand, \nmerely by making the inputs real instead of binary, we found average error curves \nlying close to the theoretical bounds. This indicates that the worst-cage bounds \nmay be more relevant to expected performance than has been previously realized. \n\nIt is interesting that the statistical theories of (Tishby et al., 1989; Schwartz et \nal, 1990) predict the two classes of behavior seen in our experiments. Our future \nresearch will focus on whether or not there is a \"gap\" as suggested by these theories. \nOur preliminary findings for majority suggest that there is in fact no gap, except \npossibly an \"inductive gap\" in which the learning process for some reason tends \nto avoid the near-perfect solutions. If such an inductive gap does not exist, then \neither the theory does not apply to back-propagation, or it must have some other \nmechanism to generate the exponential behavior. \n\n\fCan Neural Networks do Better Than the Vapnik-Chervonenkis Bounds? \n\n917 \n\n, ....................................................................... .. \n\n\u2022 Blumer et al bound \n\u2022 Haussler et al bound \n\nReal-valued threshold functIOn \n\n2 \n\n-;:: \no \n'-\n\nN \n\n\u2022 Binary majority function \n\n'-... \n~ 0 \n.... co \n~ .2t----~~_~.~.~::=:::::~~---\n.. .. ...... .......................\u2022 -.....\u2022 -.... -- ..... .. ............... . \nc: \n~4 \nc \n\n\"\"\". \n\n... \n\n.. . ........ . \n\n........ . \n\n\u00b76 +--.,....--,.-----,,.----,.---.--s..;.\"r-----. \n700 \n\n600 \n\n200 \n\no \n\n100 \n\n300 \n\n'00 \n\n500 \ntra I n Ing set 51 ze \n\no ..................................................................... . \n\nC> Haussler et al bound \no Threshold-XOR \n\u2022 MaJorlty-XOR \n\n\u00b71 \n\n'-\n:: \n\no \n\n'(cid:173)'\" c: \n'\" N \n'\" '-... \n\n~ \u00b73 \no\u00b7 \nC \n\ntraining set size \n\nFigure 3: (a) The real-valued threshold problem performs roughly within a constant \nfactor of the upper bounds predicted in (Blumer et al., 1989) and (Haussler et aI., \n1988), while the binary majority problem performs asymptotically better. (b) The \nthreshold-XOR performs roughly within a constant factor of the predicted bound, \nwhile majority-XOR performs asymptotically better . \n\nReferences \n\nS. Ahmad and G. Tesauro. (1988) Scaling and generalization in neural net.works: a \ncase study. In D. S. Touretzky et al., eds., Proceedings of the 1988 Conllectionist \nModels Summer School, 3-10, Morgan Kaufmann. \n\nE. B.' BaUln and D. Haussler. \nNeural Computation 1(1):151-160. \nA. Blumer, A. Ehrenfeucht, D. Haussler, and M. Warmuth. (1989) Learnability \nand the Vapnik-Chervonenkis dimension. JACM 36(4):929-965. \n\n(1989) 'Vhat size net gives valid generalization? \n\nD. Haussler, N. Littlestone, and M. vVarmuth. (1990) Predicting {O, l}-Functions \nOll Randomly Drawn Points. Tech Rep07'i UCSC-CRL-90-54, Univ. of California \nat Santa Cruz, CA. \n\nD. E. RUlnelhart, G. E. Hinton and R. J. vVilliams. (1986) Learning internal repre(cid:173)\nsentations by error propagation. In Parallel Distributed Processing, 1:381-362 'MIT \nPress. \n\nD. B. Schwartz, V. K. Samalam, S. A. Salla and J. S. Denker. (1990) Exhaustive \nlearning. Neural Computation 2:374-385. \n\nN. Tishby, E. Levin and S. A. SolIa. (1989) Consistent inference of probabilities in \nlayered networks: Predictions and generalizations. In IJCNN Proceedings, 2:403-\n409, IEEE. \n\n\f", "award": [], "sourceid": 401, "authors": [{"given_name": "David", "family_name": "Cohn", "institution": null}, {"given_name": "Gerald", "family_name": "Tesauro", "institution": null}]}*