{"title": "For Valid Generalization the Size of the Weights is More Important than the Size of the Network", "book": "Advances in Neural Information Processing Systems", "page_first": 134, "page_last": 140, "abstract": null, "full_text": "For valid generalization, the size of the \nweights is more important than the size \n\nof the network \n\nPeter L. Bartlett \n\nDepartment of Systems Engineering \n\nResearch School of Information Sciences and Engineering \n\nAustralian National University \n\nCanberra, 0200 Australia \nPeter .BartlettClanu .edu.au \n\nAbstract \n\nThis paper shows that if a large neural network is used for a pattern \nclassification problem, and the learning algorithm finds a network \nwith small weights that has small squared error on the training \npatterns, then the generalization performance depends on the size \nof the weights rather than the number of weights. More specifi(cid:173)\ncally, consider an i-layer feed-forward network of sigmoid units, in \nwhich the sum of the magnitudes of the weights associated with \neach unit is bounded by A. The misclassification probability con(cid:173)\nverges to an error estimate (that is closely related to squared error \non the training set) at rate O((cA)l(l+1)/2J(log n)jm) ignoring \nlog factors, where m is the number of training patterns, n is the \ninput dimension, and c is a constant. This may explain the gen(cid:173)\neralization performance of neural networks, particularly when the \nnumber of training examples is considerably smaller than the num(cid:173)\nber of weights. It also supports heuristics (such as weight decay \nand early stopping) that attempt to keep the weights small during \ntraining. \n\n1 \n\nIntroduction \n\nResults from statistical learning theory give bounds on the number of training exam(cid:173)\nples that are necessary for satisfactory generalization performance in classification \nproblems, in terms of the Vapnik-Chervonenkis dimension of the class of functions \nused by the learning system (see, for example, [13, 5]). Baum and Haussler [4] \nused these results to give sample size bounds for multi-layer threshold networks \n\n\fGeneralization and the Size of the Weights in Neural Networks \n\n135 \n\nthat grow at least as quickly as the number of weights (see also [7]). However, \nfor pattern classification applications the VC-bounds seem loose; neural networks \noften perform successfully with training sets that are considerably smaller than the \nnumber of weights. This paper shows that for classification problems on which neu(cid:173)\nral networks perform well, if the weights are not too big, the size of the weights \ndetermines the generalization performance. \nIn contrast with the function classes and algorithms considered in the VC-theory, \nneural networks used for binary classification problems have real-valued outputs, \nand learning algorithms typically attempt to minimize the squared error of the \nnetwork output over a training set. As well as encouraging the correct classification, \nthis tends to push the output away from zero and towards the target values of \n{ -1, I}. It is easy to see that if the total squared error of a hypothesis on m \nexamples is no more than mf, then on no more than mf/(I- o? of these examples \ncan the hypothesis have either the incorrect sign or magnitude less than a. \nThe next section gives misclassification probability bounds for hypotheses that are \n\"distinctly correct\" in this way on most examples. These bounds are in terms of \na scale-sensitive version of the VC-dimension, called the fat-shattering dimension. \nSection 3 gives bounds on this dimension for feedforward sigmoid networks, which \nimply the main results. The proofs are sketched in Section 4. Full proofs can be \nfound in the full version [2]. \n\n2 Notation and bounds on misclassification probability \n\nDenote the space of input patterns by X. The space of labels is {-I, I}. We \nassume that there is a probability distribution P on the product space X x { -1, I}, \nthat reflects both the relative frequency of different input patterns and the relative \nfrequency of an expert's classification of those patterns. The learning algorithm \nuses a class of real-valued functions, called the hypothesis class H. An hypothesis h \nis correct on an example (x, y) if sgn(h(x)) = y, where sgn(a) : 1R -+ {-I, I} takes \nvalue 1 iff a 2: 0, so the misclassification probability (or error) of h is defined as \n\nerp(h) = P {(x, y) E X x {-I, I} : sgn(h(x)) \"# y} . \n\nThe crucial quantity determining misclassification probability is the fat-shattering \ndimension of the hypothesis class H. We say that a sequence Xl, ... , X d of d points \nfrom X is shattered by H iffunctions in H can give all classifications of the sequence. \nThat is, for all b = (bl , ... , bm) E {-I, l}m there is an h in H satisfying sgn(h(xi)) = \nbi . The VC-dimension of H is defined as the size of the largest shattered sequence. l \nFor a given scale parameter, > 0, we say that a sequence Xl, ... , Xd of d points \nfrom X is ,-shattered by H if there is a sequence rl, ... , rd of real values such that \nfor all b = (bl , ... , bm ) E {-I, l}m there is an h in H satisfying (h(xd - rdbi 2: ,. \nThe fat-shattering dimension of H at \" denoted fatH(!), is the size of the largest \n,-shattered sequence. This dimension reflects the complexity of the functions in the \nclass H, when examined at scale ,. Notice that fatH(!) is a nonincreasing function \nof ,. The following theorem gives generalization error bounds in terms of fatH(!). \nA related result, that applies to the case of no errors on the training set, will appear \nin [12]. \n\nTheorem 1 Define the input space X, hypothesis class H, and probability distri(cid:173)\n\nbution P on X x {-I, I} as above. Let \u00b0 < 0 < 1/2, and \u00b0 < , < 1. Then, \n\nwith probability 1- 0 over the training sequence (Xl, YI), ... , (xm, Ym) of m labelled \n\n1 In fact, according to the usual definition, this is the VO-dimension of the class of \n\nthresholded versions of functions in H. \n\n\f136 \n\nP. L. Bartlett \n\nexamples, every hypothesis h in H satisfies \n\nerp(h) < ~ I{i : Ih(xdl < , or sgn(h(xt)) I- Yi}1 + fb, m, 6), \n\nm \n\nwhere \n\n2 \nf2b, m, 6) = -\nm \n\n(d In(50em/d) log2(1250m) + In(4/6)) , \n\n(1) \n\nand d = fatHb/16). \n\n2.1 Comments \n\nIt is informative to compare this result with the standard VC-bound. In that case, \nthe bound on misclassification probability is \n\nerp(h) < m I{i : sgn(h(xi)) I- ydl + m (dlog(m/d) + log(I/6)) \n\n) 1/2 \n\n1 \n\n( C \n\n, \n\nwhere d = VCdim(H) and c is a constant. We shall see in the next section that \nthere are function classes H for which VCdim(H) is infinite but fatHb) is finite \nfor all, > 0; an example is the class of functions computed by any two-layer neu(cid:173)\nral network with an arbitrary number of parameters but constraints on the size of \nthe parameters. It is known that if the learning algorithm and error estimates are \nconstrained to make use of the sample only by considering the proportion of train(cid:173)\ning examples that hypotheses misclassify, there are distributions P for which the \nsecond term in the VC-bound above cannot be improved by more than log factors. \nTheorem 1 shows that it can be improved if the learning algorithm makes use of \nthe sample by considering the proportion of training examples that are correctly \nclassified and have Ih(xdl < ,. It is possible to give a lower bound (see the full \npaper [2]) which, for the function classes considered here, shows that Theorem 1 \nalso cannot be improved by more than log factors. \nThe idea of using the magnitudes of the values of h(xd to give a more precise \nestimate of the generalization performance was first proposed by Vapnik in [13] \n(and was further developed by Vapnik and co-workers). There it was used only for \nthe case of linear hypothesis classes. Results in [13] give bounds on misclassification \nprobability for a test sample, in terms of values of h on the training and test data. \nThis result is extended in [11], to give bounds on misclassification probability (that \nis, for unseen data) in terms of the values of h on the training examples. This is \nfurther extended in [12] to more general function classes, to give error bounds that \nare applicable when there is a hypothesis with no errors on the training examples. \nLugosi and Pinter [9] have also obtained bounds on misclassification probability in \nterms of similar properties of the class of functions containing the true regression \nfunction (conditional expectation of Y given x). However, their results do not extend \nto the case when the true regression function is not in the class of real-valued \nfunctions used by the estimator. \nIt seems unnatural that the quantity, is specified in advance in Theorem 1, since \nit depends on the examples. The full paper [2] gives a similar result in which the \nstatement is made uniform over all values of this quantity. \n\n3 The fat-shattering dimension of neural networks \n\nBounds on the VC-dimensionofvarious neural network classes have been established \n(see [10] for a review), but these are all at least linear in the number of parameters. \nIn this section, we give bounds on the fat-shattering dimension for several neural \nnetwork classes. \n\n\fGeneralization and the Size of the Weights in Neural Networks \n\n137 \n\nWe assume that the input space X is some subset of ]Rn. Define a sigmoid unit \nas a function from ]R k to ]R, parametrized by a vector of weights w E ]R k \u2022 The \nunit computes x t-7 0'( X \u2022 w), where 0' is a fixed bounded function satisfying a \nLipchitz condition. (For simplicity, we ignore the offset parameter. It is equivalent \nto including an extra input with a constant value.) A multi-layer feed-forward \nsigmoid network of depth \u00a3 is a network of sigmoid units with a single output unit, \nwhich can be arranged in a layered structure with \u00a3 layers, so that the output of \na unit passes only to the inputs of units in later layers. We will consider networks \nin which the weights are bounded. The relevant norm is the \u00a31 norm: for a vector \nw E ]Rk, define IIwl11 = 2:7=1 IWil. The following result gives a bound on the fat(cid:173)\nshattering dimension of a (bounded) linear combination of real-valued functions, in \nterms of the fat-shattering dimension of the basis function class. We can apply this \nresult in a recursive fashion to give bounds for single output feed-forward networks. \n\nTheorem 2 Let F be a class of functions that map from X to [-M/2, M/2], such \nin F, - f E F. For A > 0, define the class H of \nthat 0 E F and, for all f \nweight-bounded linear combinations of functions from F as \n\nH = {t Wdi : k E W, fi E F, \"will ~ A} . \nSuppose , > 0 is such that d = fatFb/(32A)) 2: 1. \n(cM2 A 2d/,2) log2(M Ad/,), for some constant c. \n\n,=1 \n\nThen fatHb) ~ \n\nGurvits and Koiran [6] have shown that the fat-shattering dimension of the class of \ntwo-layer networks with bounded output weights and linear threshold hidden units \nis a ((A 2n2 h 2) log(n/,y)), when X = lRn. As a special case, Theorem 2 improves \nthis result. \nNotice that the fat-shattering dimension of a function class is not changed by more \nthan a constant factor if we compose the functions with a fixed function satisfying \na Lipschitz condition (like the standard sigmoid function) . Also, for X = ]Rn and \nH = {x t-7 xd we have fatHb) ~ logn for all 'Y. Finally, for H = {x t-7 W\u00b7 x: w E \n]Rn} we have fatHb) ~ n for all 'Y. These observations, together with Theorem 2, \ngive the following corollary. The 0(\u00b7) notation suppresses log factors. (Formally, \nf = O(g) if f = o(g1+O:) for all 0' > 0.) \nCorollary 3 If X C lR n and H is the class of two-layer sigmoid networks with the \nweights in the outp;;i unit satisfying IIwlh ~ A, then fatHb) = 6 (A2n/'Y2). \nIf X = {x E lRn : Ilxlioo ~ B} and the hidden unit weights are also bounded, then \nfatHb) = 0 (B2 A6 (log n)h4). \nApplying Theorem 2 to this result gives the following result for deeper networks. \nNotice that there is no constraint on the number of hidden units in any layer, only \non the total magnitude of the weights associated with a processing unit. \n\nCorollary 4 For some constant c, if X C ]Rn and H is the class of depth \u00a3 sigmoid \nnetworks in which the weight vector w associated with each unit beyond the first \nlayer satisfies IIwlll ~ A, then fatHb) = () (n(cA)l(l-1)h 2(l-1)) . \nIf X = {x E lRn : IIxlioo ~ B} and the weights in the first layer units also satisfy \nIIwll1 ~ A, then fatHb) = () (B2(cA)l(l+1) /'Y2llog n). \nIn the first part of this corollary, the network has fat-shattering dimension similar \nto the VC-dimension of a linear network. This formalizes the intuition that when \nthe weights are small, the network operates in the \"linear part\" of the sigmoid, and \nso behaves like a linear network. \n\n\f138 \n\n3.1 Comments \n\nP. L. Bartlett \n\nConsider a depth '- sigmoid network with bounded weights. The last corollary and \nTheorem 1 imply that if the training sample size grows roughly as B2 Al2 /f2, then \nthe misclassification probability of a network is within f of the proportion of training \nexamples that the network classifies as \"distinctly correct.\" \nThese results give a plausible explanation for the generalization performance of \nneural networks. If, in applications, networks with many units have small weights \nand small squared error on the training examples, then the VC-dimension (and \nhence number of parameters) is not as important as the magnitude of the weights \nfor generalization performance. \nIt is possible to give a version of Theorem 1 in which the probability bound is \nuniform over all values of a complexity parameter indexing the function classes \n(using the same technique mentioned at the end of Section 2.1). For the case of \nsigmoid network classes, indexed by a weight bound, minimizing the resulting bound \non misclassification probability is equivalent to minimizing the sum of a sample error \nterm and a penalty term involving the weight bound. This supports the use of two \npopular heuristic techniques, weight decay and early stopping (see, for example, [8]), \nwhich aim to minimize squared error while maintaining small weights. \nThese techniques give bounds on the fat-shattering dimension and hence generaliza(cid:173)\ntion performance for any function class that can be expressed as a bounded number \nof compositions of either bounded-weight linear combinations or scalar Lipschitz \nfunctions with functions in a class that has finite fat-shattering dimension. This \nincludes, for example, radial basis function networks. \n\n4 Proofs \n\n4.1 Proof sketch of Theorem 1 \nFor A f S,' where (S, p) is a pseudometric space, a set T <; S is an f-cover of A if \nfor all a in A there is a tin T with p(t, a) < f.. We define N(A, f, p) as the size of the \nsmallest f-cover of A . For x = (Xl, . . . , Xm) E X m , define the pseudometric dloo(x) \non the set S of functions defined on X by dloo(x)(f,g) = max; If(xd - g(x;)I. For a \nset A of functions, denote maxxEx\",N(A,f,dloo(x)) by Noo(A,f,m). Alon et al. [1] \nhave obtained the following bound on Noo in terms of the fat-shattering dimension. \n\nLemma 5 For a class F of functions that map from {I, .. . , n} to {I, ... , b} with \nfatF(l) ~ d, log2Noo(F,2,n) < 1 + log2(nb2)log2 (L:1=o (7)bi ), provided that n ~ \n1 + log2 (L:1=o (7)bi ). \nFor, > 0 define 7r\"'( \n: lR -+ [-\",] as the piecewise-linear squashing function satis(cid:173)\nfying 7r\"'((a) = , if a ~ \" 7r\"'((a) = -, if a ~ -\" and 7r\"'((a) = a otherwise. For a \nclass H of real-valued functions, define 7r\"'((H) as the set of compositions of 7r\"'( with \nfunctions in H. \n\nLemma 6 For X, H, P, 6, and, as in Theorem 1, \n\npm { z : 3h E H, erp(h) 2: (! In CNoo(\",(H), '1/2, 2m)) ) 1/2 + \n\n~ I{i: Ih(z.)1 < 'I orsgn(h(z.)) # !Ii}I} < 6. \n\n\fGeneralization and the Size o/the Weights in Neural Networks \n\n139 \n\nThe proof of the lemma relies on the observation that \n\npm {z : 3h E H, erp(h) 2: f + ! I{i : Ih(xi)1 <, or sgn(h(xd) =f. ydl} \n\n< pm {z : 3h E H, P(11I\"\"((h(x)) -,yl2: ,) 2: f+! I{i : 1I\"\"((h(xd) =f. ,ydl}. \n\nWe then use a standard symmetrization argument and the permutation argument \nintroduced by Vapnik and Chervonenkis to bound this probability by the probability \nunder a random permutation of a double-length sample that a related property \nholds. For any fixed sample, we can then use Pollard's approach of approximating \nthe hypothesis class using a {J/2)-cover, except that in this case the appropriate \ncover is with respect to the /00 pseudometric. Applying Hoeffding's inequality gives \nthe lemma. \nTo prove Theorem 1, we need to bound the covering numbers in terms of the fat(cid:173)\nshattering dimension. It is easy to apply Lemma 5 to a quantized version of the \nfunction class to get such a bound, taking advantage of the range constraint imposed \nby the squashing function 11\"\"( . \n\n4.2 Proof sketch of Theorem 2 \nFor x = (Xl\"'\" Xm) E X m , define the pseudometric dl1(x) on the class of func(cid:173)\ntions defined on X by dl1(x)(f,g) = ~ L:;:l I/(xi) - g(Xi)/. Similarly, define \ndl2 (x)(f,g) = (~ L:;:I(f(Xi) - g(xd)2)1/2 . If A is a set of functions defined on \nX, denote maxxEXm N(A\", dl1(x)) by Nt{A\", m), and similarly for N 2(A , \" m) . \nThe idea of the proof of Theorem 2 is to first derive a general upper bound on an \n/1 covering number of the class H, and then apply the following result (which is \nimplicit in the proof of Theorem 2 in [3]) to give a bound on the fat-shattering \ndimension. \nLemma 7 For a class F 01 [0 , I]-valued functions on X satisfying fatF(4,) 2: d, \nwe have log2 Nt{F\"\nTo derive an upper bound on N1{J, H, m) , we start with the bound that Lemma 5 \nimplies on the /00 covering number Noo(F\", m) for the class F of hidden unit \nfunctions. Since dl.A/, g) :::; dl oo (f, g) , this implies the following bound on the /2 \ncovering number for F (provided m satisfies the condition required by Lemma 5, \nand it turns out that the theorem is trivial otherwise). \n\n, d) 2: d/32. \n\nlog2 N2 (F\", m) < 1 + dlog2 \n\n( 4emM) \n\nd, \n\nlog2 \n\n(9mM2) \n\n,2 \n\n. \n\n(2) \n\nNext, we use the following result on approximation in /2 , which A. Barron attributes \nto Maurey. \n\nLemma 8 (Maurey) Suppose G is a Hilbert space and F ~ G has 11/11:::; b for all \nI in F. Let I be an element from the convex closure of F. Then for all k 2: 1 and all \nc> b2_1//I/ 2, there are functions {It , \u00b7 \u00b7 . Jd ~ F such that III - ~ L:7=1 lil1 2 \n:::; ~. \nThis implies that any element of H can be approximated to a particular accuracy \n(with respect to /2) using a fixed linear combination of a small number of elements \nof F . It follows that we can construct an /2 cover of H from the /2 cover of F; using \nLemma 8 and Inequality 2 shows that \n\nlog2 N2(H\", m) < \n\n2M2A2 ( \n\n,2 \n\n1 + dlog2 \n\n(8emMA) \n\nd, \n\nlog2 \n\n(36mM2A2)) \n\n,2 \n\n. \n\n\f140 \n\nP. L. Bartlett \n\nNow, Jensen's inequality implies that dI1(x)(f,g) ~ dI2 (x)(f,g), which gives a bound \non Nt (H, 1, m). Comparing with the lower bound given by Lemma 7 and solving \nfor m gives the result. A more refined analysis for the neural network case involves \nbounding N2 for successive layers, and solving to give a bound on the fat-shattering \ndimension of the network. \n\nAcknowledgements \nThanks to Andrew Barron, Jonathan Baxter, Mike Jordan, Adam Kowalczyk, Wee \nSun Lee, Phil Long, John Shawe-Taylor, and Robert Slaviero for helpful discussions \nand comments. \n\nReferences \n[1] N. Alon, S. Ben-David, N. Cesa-Bianchi, and D. Haussler. Scale-sensitive di(cid:173)\n\nmensions, uniform convergence, and learn ability. In Proceedings of the 1993 \nIEEE Symposium on Foundations of Computer Science. IEEE Press, 1993. \n\n[2] P. L. Bartlett. The sample complexity of pattern classification with neu(cid:173)\n\nral networks: \nthe size of the weights is more important than the size \nof the network. Technical report, Department of Systems Engineering, \nAustralian National University, 1996. \n(available by anonymous ftp from \nsyseng.anu.edu.au:pub/peter/TR96d.ps). \n\n[3] P. L. Bartlett, S. R. Kulkarni, and S. E. Posner. Covering numbers for real(cid:173)\n\nvalued function classes. Technical report, Australian National University and \nPrinceton University, 1996. \n\n[4] E. Baum and D. Haussler. What size net gives valid generalization? Neural \n\nComputation, 1(1):151-160,1989. \n\n[5] A. Blumer, A. Ehrenfeucht, D. Haussler, and M. K. Warmuth. Learnability \n\nand the Vapnik-Chervorienkis dimension. J . ACM, 36(4):929-965, 1989. \n\n[6] L. Gurvits and P. Koiran. Approximation and learning of convex superposi(cid:173)\n\ntions. In Computational Learning Theory: EUROCOLT'95, 1995. \n\n[7] D. Haussler. Decision theoretic generalizations of the PAC model for neural \n\nnet and other learning applications. Inform. Comput., 100(1):78-150,1992. \n\n[8] J. Hertz, A. Krogh, and R. G. Palmer. Introduction to the Theory of Neural \n\nComputation. Addison-Wesley, 1991. \n\n[9] G. Lugosi and M. Pinter. A data-dependent skeleton estimate for learning. In \nProc. 9th Annu. Conference on Comput. Learning Theory. ACM Press, New \nYork, NY, 1996. \n\n[10] W. Maass. Vapnik-Chervonenkis dimension of neural nets. In M. A. Arbib, \neditor, The Handbook of Brain Theory and Neural Networks, pages 1000-1003. \nMIT Press, Cambridge, 1995. \n\n[11] J. Shawe-Taylor, P. 1. Bartlett, R. C. Williamson, and M. Anthony. A frame(cid:173)\nwork for structural risk minimisation. In Proc. 9th Annu. Conference on Com(cid:173)\nput. Learning Theory. ACM Press, New York, NY, 1996. \n\n[12] J. Shawe-Taylor, P. 1. Bartlett, R. C. Williamson, and M. Anthony. Structural \n\nrisk minimization over data-dependent hierarchies. Technical report, 1996. \n\n[13] V. N. Vapnik. Estimation of Dependences Based on Empirical Data. Springer(cid:173)\n\nVerlag, New York, 1982. \n\n\f", "award": [], "sourceid": 1204, "authors": [{"given_name": "Peter", "family_name": "Bartlett", "institution": null}]}