{"title": "MLP Can Provably Generalize Much Better than VC-bounds Indicate", "book": "Advances in Neural Information Processing Systems", "page_first": 190, "page_last": 196, "abstract": null, "full_text": "MLP can provably generalise much better \n\nthan VC-bounds indicate. \n\nA. Kowalczyk and H. Ferra \nTelstra Research Laboratories \n\n770 Blackburn Road, Clayton, Vic. 3168, Australia \n\n({ a.kowalczyk, h.ferra}@trl.oz.au) \n\nAbstract \n\nResults of a study of the worst case learning curves for a partic(cid:173)\nular class of probability distribution on input space to MLP with \nhard threshold hidden units are presented. It is shown in partic(cid:173)\nular, that in the thermodynamic limit for scaling by the number \nof connections to the first hidden layer, although the true learning \ncurve behaves as ~ a-I for a ~ 1, its VC-dimension based bound \nis trivial (= 1) and its VC-entropy bound is trivial for a ::; 6.2. It \nis also shown that bounds following the true learning curve can be \nderived from a formalism based on the density of error patterns. \n\n1 \n\nIntroduction \n\nThe VC-formalism and its extensions link the generalisation capabilities of a binary \nvalued neural network with its counting function l , e.g. via upper bounds implied by \nVC-dimension or VC-entropy on this function [17, 18]. For linear perceptrons the \ncounting function is constant for almost every selection of a fixed number of input \nsamples [2], and essentially equal to its upper bound determined by VC-dimension \nand Sauer's Lemma. However, in the case for multilayer perceptrons (MLP) the \ncounting function depends essentially on the selected input samples. For instance, \nit has been shown recently that for MLP with sigmoidal units although the largest \nnumber of input samples which can be shattered, Le. VC-dimension, equals O(w 2 ) \n[6], there is always a non-zero probability of finding a (2w + 2)-element input sample \nwhich cannot be shattered, where w is the number of weights in the network [16]. \nIn the case of MLP using Heaviside rather than sigmoidal activations (McCulloch(cid:173)\nPitts neurons), a similar claim can be made: VC-dimension is O(wl1og21lt} [13, 15], \n\n1 Known also as the partition function in computational learning theory. \n\n\fMLP Can Provably Generalize Much Better than VC-bounds Indicate \n\n191 \n\nwhere WI is the number of weights to the first hidden layer of 11.1 units, but there is \na non-zero probability of finding a sample of size WI + 2 which cannot be shattered \n[7, 8]. The results on these \"hard to shatter samples\" for the two MLP types \ndiffer significantly in terms of techniques used for derivation. For the sigmoidal \ncase the result is \"existential\" (based on recent advances in \"model theory\") while \nin the Heaviside case the proofs are constructive, defining a class of probability \ndistributions from which \"hard to shatter\" samples can be drawn randomly; the \nresults in this case are also more explicit in that a form for the counting function \nmay be given [7, 8]. \n\nCan the existence of such hard to shatter samples be essential for generalisation \ncapabilities of MLP? Can they be an essential factor for improvement of theoretical \nmodels of generalisation? In this paper we show that at least for the McCulloch(cid:173)\nPitts case with specific (continuous) probability distributions on the input space \nthe answer is \"yes\". We estimate \"directly\" the real learning curve in this case and \nshow that its bounds based on VC-dimension or VC-entropy are loose at low learning \nsample regimes (for training samples having less than 12 x WI examples) even for \nthe linear perceptron. We also show that a modification to the VC-formalism given \nin [9, 10] provides a significantly better bound. This latter part is a more rigorous \nand formal extension and re-interpretation of some results in [11, 12]. All the results \nare presented in the thermodynamic limit, i.e. for MLP with WI ~ 00 and training \nsample size increasing proportionally, which simplifies their mathematical form. \n\n2 Overview of the formalism \n\nOn a sample space X we consider a class H of binary functions h : X ~ {a, 1} \nwhich we shall call a hypothesis space. Further we assume that there are given a \nprobability distribution jJ on X and a target concept t : X ~ {a, 1}. The quadruple \nC = (X, jJ, H, t) will be called a learning system. \n\nIn the usual way, with each hypothesis h E H we associate the generalization error \nfh d~ Ex [It(x) - h(x)l] and the training error fh,x d~ ~ L:~l It(Xi) - h(xdl for \nany training m-sample x = (Xl, ... ,xm) E xm. \nGiven a learning threshold \u00b0 ~ ,X ~ 1, let us introduce an auxiliary random variable \nf~ax(X) d~ max{fh ; h E H & fh,x ~ ,X} for x E xm, giving the worst general(cid:173)\nization error of all hypotheses with training error ~ ,X on the m-sample x E xm. 2 \nThe basic objects of intE'rest in this paper are the learning cUnJe3 defined as \n\ntL>C( \nf). m \n\n[ max ( ... )] \nXm f). X. \n\n) d!l E \n\n-\n\n2.1 Thermodynamic limit \n\nNow we introduce the thermodynamic limit of the learning curve. The underly(cid:173)\ning idea of such asymptotic analysis is to capture the essential features of learning \n\n2In this paper max(S), where S C R, denotes the maximal element in the closure of S, \n\nor 00 if no such element exists. Similarly, we understand mineS). \n\n3Note that our learning curve is determined by the worst generalisation error of accept(cid:173)\n\nable hypotheses and in this respect differs from \"average generalisation error\" learning \ncurves considered elsewhere, e.g. [3, 5]. \n\n\f192 \n\nA. Kowalczyk and H. Ferra \n\nsystems of very large size. Mathematically it turns out that in the thermodynamic \nlimit the functional forms of learning curves simplify significantly and analytic char(cid:173)\nacterizations of these are possible. \n\nWe are given a sequence of learning systems, or shortly, LN = (XN,/J.N,HN,tN)' \nN = 1,2, ... and a scaling N f-7 TN E R+, with the property TN ~ 00; the scaling \ncan be thought of as a measure of the size (complexity) of a learning system, e.g. \nVC-dimension of HN. The thermodynamic limit of scaled learning curves is defined \n\nfor a > \u00b0 as follows 4 \n\n\u20acAOO a = 1m sup \u20acA,N \nwe ( ) de! l' \n\nwe (L \n\naTN \n\nJ) \n, \n\nN--+oo \n\n(1) \n\nHere, and below, the additional subscript N refers to the N-th learning system. \n\n2.2 Error pattern density formalism \n\nThis subsection briefly presents a thermodynamic version of a modified VC formal(cid:173)\nism discussed previously in [9J; more details and proofs can be found in [1OJ. The \nmain innovation of this approach comes from splitting error patterns into error shells \nand using estimates on the size of these error shells rather than the total number \nof error patterns. We shall see on examples discussed in the following section that \nthis improves results significantly. \nThe space {O, l}m of all binary m-vectors naturally splits into m + 1 error pattern \nshells Ern, i = 0,1, ... , m, with the i-th shell composed of all vectors with exactly i \nentries equal to 1 . For each h E Hand i = (Xl, ... ,Xm ) E X m , let vh(i) E {O,l}m \nt(Xj). As the i-th error shell has en elements, the average error pattern density \ndenote a vector (error pattern) having 1 in the j-th position if and only if h(xj) :j:. \n\nfalling into this error shell is \n\n(i = 0,1, ... ,m), \n\n(2) \n\nwhere # denotes the cardinality of a set5 . \n\nTheorem 1 Given a sequence of learning systems LN = (XN, /J.N, HN, tN), a scal(cid:173)\ning TN and a function 'P : R+ X (0, 1) ~ R+ such that \n\nIn (dfN) ~ -TN'P (m ,i) + O(TN), \n\nTN m \n\n, \n\nfor all m,N = 1,2, ... , \u00b0 ~ i ~ m. \n\nThen \n\n(3) \n\n(4) \n\n4We recall that lxJ denotes the largest integer $ x and limsuPN-+oo XN is defined as \nlimN-+oo of the monotonic sequence N 1--+ max{xl' X2, \u2022\u2022\u2022 , XN}' Note that in contrast to \nthe ordinary limit, lim sup always exists. \n\n5Note the difference to the concept of error shells used in [4] which are partitions of the \nfinite hypothesis space H according to the generalisation error values. Both formalisms \nare related though, and the central result in [4], Theorem 4, can be derived from our \nTheorem 1 below. \n\n\fMLP Can Provably Generalize Much Better than VC-bounds Indicate \n\nfor any \u00b0 :s A :s 1 and a, /3 > 0, where \n\nf>.,6(a) = max f E (0,1) ; 3 0::;y::;>. a(1i(y) + /31i(x)) - rp a + a/3, 1 \n\nde! \n\n{ \n\n( \n\nt::;x::;1 \n\ny + /3X ) \n/3 \n+ \n\nand 1i(y) d~ -y In y - (1 - y) In(l - y) denotes the entropy function. \n\n3 Main results : applications of the formalism \n\n3.1 VC-bounds \n\n193 \n\n~ \u00b0 \n\n} \n\nWe consider a learning sequence L N = (X N , J.L N , H N , t N), t N E H N (realisable \ncase) and the scaling of this sequence by VC-dimension [17], i.e. we assume TN = \nfor A = \u00b0 (consistent learning case) [1, 17]: \ndvc(HN) -+ 00. The following bounds for the N-th learning system can be derived \n\nfwe (m) < \nO,N \n\n;\n\n. \n\n( \n\n.1 \n\n\u00b0 mm 1,2 \n\n2-mt/2 \n\n(2) dVc(HN\u00bb) \n\nem \n\ndvc(HN) \n\ndf. \n\n(5) \n\nIn the thermodynamic limit, i.e. as N -+ 00, we get for any a > lie \n\nfg'~(a) \n\n< mIn \n-\n\n. (1 210g2 (2ea)) \n\n, \n\na \n\n, \n\n(6) \n\nNote that this bound is independent of probability distributions J.LN. \n\n3.2 Piecewise constant functions \n\nLet PC(d) denote the class of piecewise constant binary functions on the unit \n\nsegment [0,1) with up to d ~ \u00b0 discontinuities and with their values defined as \n\n1 at all these discontinuity points. We consider here the learning sequence LN = \n([0,1), J.LN, PC(dN), tN) where J.LN is any continuous probability distributions on \n[0,1), dN is a monotonic sequence of positive integers diverging to 00 and targets \ntN E PC(dt N ) are such that the limit c5t d~ limN-+oo .!!.I:Ldd exists. (Without loss of \ngenerality we can assume that all J.LN are the uniform distribution on [0,1).) \n\ntN \n\nFor this learning sequence the following can be established. \n\nClaim 1. The following function defined for a > 1 and \u00b0 :s x :s 1 as \nrp(a, x) d~ -a(l-x)1i (2;(t~X\u00bb) -ax1i (~~~) +a1i(x) \n\nfor2ax(1-x) > 1, \n\nand as 0, otherwise, satisfies assumption (3) with respect to the scaling TN d;! dN. \nClaim 2. The following two sided bound on the learning curve holds: \n\n(7) \n\n\f194 \n\nA. Kowalczyk and H. Ferra \n\nWe outline the main steps of proof of these two claims now. \n\nFor Claim 1 we start with a combinatorial argument establishing that in the par(cid:173)\nticular case of constant target \n\ntl,\"!lN = \n{ \n\" \n\n( ':1'-1) -1 \"'~N /2 (m-:-i-l) (i~l) for d + dt < min(2i, 2(m - i)), \n1 \n\notherwise. \n\nL...J)=O \n\n,-1 \n\n)-1 \n\n) \n\nNext we observe that that the above sum equals \n\nThis easily gives Claim 1 for constant target (tSt = 0). Now we observe that this \nparticular case gives an upper bound for the general case (of non-constant target) \nif we use the \"effective\" number of discontinuities dN + dt N instead of dN. \nFor Claim 2 we start with the estimate [12, 11] \n\nderived from the Mauldon result [14] for the constant target tN = canst, m ~ dN. \nThis implies immediately the expression \n\n\u20aco~(a) = -..!.. (1 + In(2a)) . \n\n2a \n\n(8) \n\nfor the constant target, which extends to the estimate (7) with a straightforward \nlower and upper bound on the \"effective\" number of discontinuities in the case of a \nnon-constant target. \n\n3.3 Link to multilayer perceptron \n\nLet MLpn(wd denote the class offunction from R n to {O, I} which can be imple(cid:173)\nmented by a multilayer perceptron (feedforward neural network) with ~ 1 number \nof hidden layers, with Wt connections to the first hidden layer and the first hidden \nlayer composed entirely of fully connected, linear threshold logic units (i.e. units \nable to implement any mapping of the form (Xl, .. , Xn) f-t O(ao + L~l aixi) for \nai E R). It can be shown from the properties of Vandermonde determinant (c.f. \n[7, 8]) that if 1 : [0,1) -+ R n is a mapping with coordinates composed of linearly \nindependent polynomials (generic situation) of degree::; n, then \n\nThis implies immediately that all results for learning the class of PC functions in \nSection 5.2 are applicable (with obvious modifications) to this class of multilayer \nperceptrons with probability distribution concentrated on the I-dimensional curves \nof the form 1([0,1)) with 1 as above. \n\nHowever, we can go a step further. We can extend such a distribution to a con(cid:173)\ntinuous distribution on R n with support \"sufficiently close\" to the curve 1([0,1)), \n\n(9) \n\n\fMLP Can Provably Generalize Much Better than VC-bounds Indicate \n\n195 \n\n1.0 \n\n,...... \n..!:!. \n.... 0.8 \n0 \nt:: \nCD 0.6 \nc: \n0 :; \n. ~ \nIii .... \nCD 0.2 \nc: \nCD \nC) \n\n0.4 \n\n0.0 \n\n\\.. \n.... \\.. \n\n..... \"-: .... \n\n.... \" .. \n.... ''':2: :'C: 2:'.:: ::CC:.: .. ......... ~. \" ........................... ~.\"\".csc . \n\nVC Entropy \nEPD,0,=0.2 \nEPD,o,= O.O \nTC+, 0,=0.2 \nTCO,o,=O.O \nTC-, 0, = 0.2 \n\n0.0 \n\n10.0 \n\n20.0 \n\nscaled training sample size (a) \n\nFigure 1: Plots of different estimates for thermodynamic limit of learning curves for \nthe sequence of multilayer perceptrons as in Claim 3 for consistent learning (A = 0). \nEstimates on true learning curve from (7) are for 8t = 0 ('TCO') and 8t = 0.2 ('TC+' \nand 'TC-' for the upper and lower bound, respectively) . Two upper bounds of the \nform (4) from the modified VC-formalism for r.p as in Claim 1 and f3 = 1 are plotted \nfor 8t = 0.0 and 8t = 0.2 (marked EPD). For comparison, we plot also the bound \n\n(10) based on the VC-entropy; VC bound (5) being trivial for this scaling, = 1, c.f. \n\nCorollary 2, is not shown. \n\nwith changes to the error pattern densities /)/fN, the learning curves, etc., as small \nas desired. This observa.tion implies the follo~ing result: \nClaim 3 For any sequence of multilayer perceptrons, M LpnN (WIN), WIN ~ \nthere exists a sequence of continuous probability distributions J.1.N on \n00, \nR nN with properties as \nFor any sequence of targets tN E \nM LpnN (WltN)' both Claim 1 and Claim 2 of Section 3.2 hold for the learn-\ning sequence (RnN, J.1.N, M LpnN (WIN), tN) with scaling TN d~ nIN and 8t = \nIn particular bound (4) on the learning curve holds for r.p \nlimN-+oo WltN IWIN. \nas in Claim 1. \n\nfollows. \n\nCorollary 2 If additionally the number of units in first hidden layer 1llN ~ 00, \nthen the thermodynamic limit of VC-bound (5) with respect to the scaling TN = \nWIN is trivial, i.e. = 1 for all a > O. \n\ntt'lN \n\n-\n\nProof. The bound (5) is trivial for m ~ 12dN, where dN d~ dvc(M LpnN (WltN )). \nAs dN = O(WIN IOg2(1lIN)) [13, 15] for any continuous probability on the input \nspace, this bound is trivial for any a = ~ < 12..4H... ~ 00 if N ~ 00. 0 \nThere is a possibility that VC dimension based bounds are applicable but fail to cap(cid:173)\nture the true behavior because of their independence from the distribution. One op(cid:173)\ntion to remedy the situation is to try a distribution-specific estimate such as VC en(cid:173)\ntropy (i.e. the expectation of the logarithm of the counting function IIN(XI, ... , xm) \nwhich is the number of dichotomies realised by the perceptron for the m-tuple \nof input points [18]) . However, in our case, lIN (Xl , ... , xm) has the lower bound \n2 \",min(wlN/2 m-l) (m) \u00a3, \nIS VIrtu y t e ex-\nL.\"i=O \npression from Sauer's lemma with VC-dimension replaced by WIN 12. Thus using \n\n' or Xl, ... , Xm m genera pOSItIOn, w IC \n\nl ' . \n\nh' h' \n\nall \n\nWIN \n\n' \n\ni\n\n\u2022 \n\n. \n\nh \n\n\f196 \n\nA. Kowalczyk and H. Ferra \n\nVC entropy instead of VC dimension (and Sauer's Lemma) we cannot hope for \na better result than bounds of the form (5) with WlN 12 replacing VC-dimension \nresulting in the bound \n\n(0 > lie) \n\n(10) \n\nin the thermodynamic limit with respect to the scaling TN = WlN. \n(Note that \nmore \"optimistic\" VC entropy based bounds can be obtained if prior distribution \non hypothesis space is given and taken into account [3].) \n\nThe plots of learning curves are shown in Figure l. \n\nAcknowledgement. The permission of Director of Telstra Research Laboratories \nto publish this paper is gratefully acknowledged. \n\nReferences \n\n[1) A. Blumer, A. Ehrenfeucht, D. Haussler, and M.K. Warmuth. Learnability and the \n\nVapnik-Chervonenkis dimensions. Journal of the ACM, 36:929-965, (Oct. 1989). \n\n[2) T.M. Cover. Geometrical and statistical properties of linear inequalities with appli(cid:173)\n\ncations to pattern recognition. IEEE Trans. Elec. Comp., EC-14:326-334, 1965. \n\n[3) D. Hausler, M. Kearns, and R. Shapire. Bounds on the Sample Complexity of Bayesian \nLearning Using Information Theory and VC Dimension. Machine Learning, 14:83-\n113, (1994). \n\n[4) D. Haussler, M. Kearns, H.S. Seung, and N. Tishby. Rigorous learning curve bounds \n\nfrom statistical mechanics. In Proc. COLT'94, pages 76-87, 1994. \n\n[5) S.B. Holden and M. Niranjan. On the Practical Applicability of VC Dimension \n\nBounds. Neural Computation, 1:1265-1288, 1995). \n\n[6) P. Koiran and E.D. Sontag. Neural networks with quadratic VC-dimension. In Proc. \n\nNIPS 8, pages 197-203, The MIT Press, Cambridge, Ma., 1996 .. \n\n[7) A. Kowalczyk. Counting function theorem for multi-layer networks. In Proc. NIPS \n\n6, pages 375-382. Morgan Kaufman Publishers, Inc., 1994. \n\n[8) A. Kowalczyk. Estimates of storage capacity of multi-layer perceptron with threshold \n\nlogic hidden units. Neural networks, to appear. \n\n[9) A. Kowalczyk and H. Ferra. Generalisation in feedforward networks. Proc. NIPS 6, \n\npages 215-222, The MIT Press, Cambridge, Ma., 1994. \n\n[10) A. Kowalczyk. An asymptotic version of EPD-bounds on generalisation in learning \n\nsystems. 1996. Preprint. \n\n[11) A. Kowalczyk, J. Szymanski, and R.C. Williamson. Learning curves from a modified \n\nVC-formalism: a case study. In Proc. of 1CNN'95 , 2939-2943, IEEE, 1995. \n\n[12) A. Kowalczyk, J. Szymanski, P.L. Bartlett, and R.C. Williamson. Examples of learn(cid:173)\n\ning curves from a modified VC-formalism. Proc. NIPS 8, pages 344- 350, The MIT \nPress, 1996. \n\n[13) W. Maas. Neural Nets with superlinear VC-dimesnion. Neural Computation, 6:877-\n\n884, 1994. \n\n[14) J.G. Mauldon. Random division of an interval. Proc. Cambridge Phil. Soc., 41:331-\n\n336, 1951. \n\n[15) A. Sakurai. Tighter bounds of the VC-dimension of three-layer networks. In Proc. of \n\nthe 1993 World Congress on Neural Networks, 1993. \n\n[16) E. Sontag. Shattering all sets of k points in \"general position\" requires (k - 1)/2 \n\nparameters. Report 96-01, Rutgers Center for Systems and Control, 1996. \n\n[17) V. Vapnik. Estimation of Dependences Based on Empirical Data. Springer-Verlag, \n\n1982. \n\n[18) V. Vapnik. The Nature of Statistical Learning Theory. Springer-Verlag, 1995. \n\n\f", "award": [], "sourceid": 1219, "authors": [{"given_name": "Adam", "family_name": "Kowalczyk", "institution": null}, {"given_name": "Herman", "family_name": "Ferr\u00e1", "institution": null}]}