{"title": "Sample Size Requirements for Feedforward Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 327, "page_last": 334, "abstract": null, "full_text": "Sample Size Requirements For \nFeedforward Neural Networks \n\nMichael J. Turmon \n\nTerrence L. Fine \n\nCornell U niv. Electrical Engineering \n\nCornell Univ. Electrical Engineering \n\nIthaca, NY 14853 \nmjt@ee.comell.edu \n\nIthaca, NY 14853 \ntlfine@ee.comell.edu \n\nAbstract \n\nWe estimate the number of training samples required to ensure that \nthe performance of a neural network on its training data matches \nthat obtained when fresh data is applied to the network. Existing \nestimates are higher by orders of magnitude than practice indicates. \nThis work seeks to narrow the gap between theory and practice by \ntransforming the problem into determining the distribution of the \nsupremum of a random field in the space of weight vectors, which \nin turn is attacked by application of a recent technique called the \nPoisson clumping heuristic. \n\n1 \n\nINTRODUCTION AND KNOWN RESULTS \n\nWe investigate the tradeofi\"s among network complexity, training set size, and sta(cid:173)\ntistical performance of feedforward neural networks so as to allow a reasoned choice \nof network architecture in the face of limited training data. Nets are functions \n7](x; w), parameterized by their weight vector w E W ~ Rd , which take as input \npoints x E Rk. For classifiers, network output is restricted to {a, 1} while for fore(cid:173)\ncasting it may be any real number. The architecture of all nets under consideration \nis N, whose complexity may be gauged by its Vapnik-Chervonenkis (VC) dimension \nv, the size of the largest set of inputs the architecture can classify in any desired way \n('shatter'). Nets 7] EN are chosen on the basis of a training set T = {(Xi, YiHr=l. \nThese n samples are i.i.d. according to an unknown probability law P. Performance \nof a network is measured by the mean-squared error \n\nE(w) \n\nE(7](x; w) - y)2 \n= P(7](x;w);/; y) \n\n(for classifiers) \n\n(1) \n(2) \n\n\f328 \n\nMichael Turman, Terrence L. Fine \n\nand a good (perhaps not unique) net in the architecture is WO = argmiIlwew \u00a3(w). \nTo select a net using the training set we employ the empirical error \n\nVT(W) = - I)11(Xi; w) - Yi)2 \n\n1 n \n\nn i=l \n\n(3) \n\nsustained by 11(\u00b7; w) on the training set T. A good choice for a classifier is then \nw\u00b7 = argmiIlwew VT(W). In these terms, the issue raised in the first sentence ofthe \nsection can be restated as, \"How large must n be in order to ensure \u00a3(w\u00b7)-\u00a3(WO) $ \ni with high probability?\" \nFor purposes of analysis we can avoid dealing directly with the stochastically chosen \nnetwork w\u00b7 by noting \n\u00a3(w\u00b7) - \u00a3(WO) $ IVT(W\u00b7) - \u00a3(w\u00b7)1 + IVT(WO) - \u00a3(wo)1 $ 2 sup IVT(W) - \u00a3(w)1 \n\nwEW \n\nA bound on the last quantity is also useful in its own right. \nThe best-known result is in (Vapnik, 1982), introduced to the neural network com(cid:173)\nmunity by (Baum & Haussler, 1989): \n\nP( sup IVT(W) - \u00a3(w)1 ~ i) $ 6-,-e-n ( /2 \n\n(4) \n\nwEW \n\n(2n)V \n\n~ \n\nv. \n\nThis remarkable bound not only involves no unknown constant factors, but holds \nindependent of the data distribution P . Analysis shows that sample sizes of about \n(5) \nare enough to force the bound below unity, after which it drops exponentially to \nzero. Taking i = .1, v = 50 yields nc = 68000, which disagrees by orders of \nmagnitude with the experience of practitioners who train such simple networks. \nMore recently, Talagrand (1994) has obtained the bound \n\nnc = (4V/i2) log 3/i \n\nP( sup IVT(W) - \u00a3(w)1 ~ i) $ Kl \n\nwew \n\n( K2ni2)v \n\n~ \ne- 2n ( \n, \n\nV \n\n(6) \n\nyielding a sufficient condition of order V/i2, but the values of Kl and K2 are inac(cid:173)\ncessible so the result is of no practical use. \nFormulations with finer resolution near \u00a3(w) = 0 are used. Vapnik (1982) bounds \nP(suPwew IVT(W) - \u00a3(w)I/\u00a3(w)1/2 ~ i)-note \u00a3(w)1/2 ~ Var(vT(w\u00bb1/2 when \n\u00a3(w) ~ O-while Blumer et al. (1989) and Anthony and Biggs (1992) work with \nP(suPWEW IVT(W) - \u00a3(w)ll{o}(VT(W\u00bb ~ i). The latter obtain the sufficient condi(cid:173)\ntion \n\nnc = (5.8v/i) log 12/i \n\n(7) \nfor nets, if any, having VT( w) = o. If one is guaranteed to do reasonably well on \nthe training set, a smaller order of dependence results. \nResults (Turmon & Fine, 1993) for perceptrons and P a Gaussian mixture imply \nthat at least v/280i2 samples are needed to force \u00a3(w\u00b7) - \u00a3(WO) < 2i with high \nprobability. (Here w\u00b7 is the best linear discriminant with weights estimated from \nthe data.) Combining with Talagrand's result, we see that the general (not assuming \nsmall VT(W\u00bb functional dependence is V/i2. \n\n\fSample Size Requirements for Feedforward Neural Networks \n\n329 \n\n2 APPLYING THE POISSON CLUMPING HEURISTIC \n\nWe adopt a new approach to the problem. For the moderately large values of n \nwe anticipate, the central limit theorem informs us that Vn[lIT(W) - E(w)] has \nnearly the distribution of a zero-mean Gaussian random variable. It is therefore \nreasonablel to suppose that \n\nP( sup IlIT(W) - E(w)1 ~ f) ~ P( sup IZ(w)1 ~ fJ1i) ~ 2P( sup Z(w) ~ fVn) \n\nwEW \n\nwEW \nwhere Z( w) is a Gaussian process with mean zero and covariance \n\nwEW \n\nR(w, v) = EZ(w)Z(v) = Cov(y -1J(x; w\u00bb2, (y -1J(x; V\u00bb2) \n\nThe problem about extrema of the original empirical process is equivalent to one \nabout extrema of a corresponding Gaussian process. \nThe Poisson clumping heuristic (PCR), introduced in the remarkable (Aldous, \n1989), provides a general tool for estimating such exceedance probabilities. Con(cid:173)\nsider the excursions above level b(= fVn ~ 1) by a stochastic process Z(w). At \nleft below, the set {w : Z( w) ~ b} is seen as a group of \"clumps\" scattered in weight \nspace W. The PCR says that, provided Z has no long-range dependence and the \nlevel b is large, the centers of the clumps fall according to the points of a Poisson \nprocess on W, and the clump shapes are independent. The vertical arrows (below \nright) illustrate two clump centers (points of the Poisson process); the clumps are \nthe bars centered about the arrows. \n\nIn fact, with PheW) = P(Z(w) ~ b), Ch(W) the size of a clump located at w, and \nAh (w) the rate of occurrence of clump centers, the fundamental equation is \n\nw \n\nw \n\n(8) \nThe number of clumps in W is a Poisson random variable Nh with parameter \n1, Ah( w) dw. The probability of a clump is P(Nb > 0) = 1- exp( - fwAh( w) dW) =::: \nfw Ah(W) dw where the approximation holds because our goal is to operate in a \nregime where this probability is near zero. Letting ~(b) = P(N(0, 1) > b) and \n(T2(w) = R(w, w), we have PheW) = ~(b/(T(w\u00bb. The fundamental equation becomes \n\nP( sup Z(w) ~ b) ~ r ~(b/(T(w\u00bb dw \n\nJw ECh(W) \n\nwEW \n\n(9) \n\nIt remains only to find the mean clump size ECh( w) in terms of the network archi(cid:173)\ntecture and the statistics of (x, y). \n\nlSee ch. 7 of (Pollard, 1984) for treatment of some technical details in this limit. \n\n\f330 \n\nMichael Tunnon, Terrence L. Fine \n\n3 POISSON CLUMPING FOR SMOOTH PROCESSES \n\nAssume Z(w) has two mean-square derivatives in w. \n(If the network activation \nfunctions have two derivatives in w, for example, Z( w) will have two almost sure \nderivatives.) Z then has a parabolic approximation about some Wo via its gradient \nG = 'VZ(w) and Hessian matrix H = 'V'VZ(w) at woo Provided Zo ~ b, that is \nthat there is a clump at Wo, simple computations reveal \n\nCb( wo) ~ Kd \n\n(2(Zo - b) - cP'H- 1G)d/2 \n\nIHI I / 2 \n\n(10) \n\nwhere Kd is the volume of the unit ball in Rd and I\u00b7 1 is the determinant. The mean \nclump size is the expectation of this conditioned on Z(wo) ~ b. \nThe same argument used to show that Z(w) is approximately normal shows that G \nand H are approximately normal too. In fact, \n\nE[HIZ(wo) = z] \nA(wo) \n\n)A(wo) \n\nz \n2( \n(F Wo \n-EZ(wo)H = -'Vw 'VwR(wo, w)lw=wo \n\nso that, since b (and hence z) is large, the second term in the numerator of (10) \nmay be neglected. The expectation is then easily computed, resulting in \nLemma 1 (Smooth process clump size) Let the network activation functions \nbe twice continuously differentiable, and let b \u00bb (F( w). Then \n\nECb(W) ~ (21r)d/21 ~~~) 1-\n\n112 \n\n((F(:\u00bb) d \n\nSubstituting into (9) yields \n\nP( sup Z(w) ~ b) ~ (21r)-~ ( 1 A(w) 11/2 (_b_) d-~_b~/2q~(W) dw, \n\n(11) \n\niw (F2(w) \n\n(F(w) \n\nwEW \n\nwhere use of the asymptotic expansion ~(z) ~ (zv'21r)-l exp( _Z2 /2) is justified \nsince ('v'w)b \u00bb (F( w) is necessary to have the individual P( Z( w) ~ b) low-let alone \nthe supremum. To go farther, we need information about the variance (F2 (w) of \n(y - 11( x; w\u00bb2. In general this must come from the problem at hand, but suppose \nfor example the process has a unique variance maximum 0'2 at w. Then, since \nthe level b is large, we can use Laplace's method to approximate the d-dimensional \nintegral. \nLaplace's method finds asymptotic expansions for integrals \n\nfw g(w) exp( - f(w)2 /2) dw \n\nwhen few) is C2 with a unique positive minimum at Wo in the interior of W ~ Rd , \nand g( w) is positive and continuous. Suppose I( wo) \u00bb 1 so that the exponential \nfactor is decreasing much faster than the slowly varying g. Expanding f to second \norder about Wo, substituting into the exponential, and performing the integral shows \n\nthat iw g( w) exp( - f(w)2 /2) dw ~ (21r)d/2If( wo)KI- 1/ 2g( wo) exp( - f( wo)2 /2) \n\n\fSample Size Requirements for Feedforward Neural Networks \n\n331 \n\nwhere K = V'V'f(w)lwo, the Hessian of f. See (Wong, 1989) for a proof. Applying \nthis to (11) and using the asymptotic expansion for ~ in reverse yields \nTheorem 1 Let the network activation functions be twice continuously differen(cid:173)\ntiable. Let the variance have a unique maximum u at w in the interior of Wand \nthe level b ~ u. Then the peH estimate of exceedance probability is given by \n\nIA(w)1 1/ 2 \n\n_ \n\nP(:~fv Z(w) ~ b) ~ IA(w) _ r(w)1 1/ 2 ~(b/u) \n\n(12) \nwhere r(w) = V'wV'tlR(w,v)lw=tI=w. Furthermore, A- r is positive-definite at w; \nit is -1/2 the Hessian of cr2 (w). The leading constant thus strictly e:cceeds unity. \nThe above probability is just P(Z(w) ~ b) multiplied by a factor accounting for the \nother networks in the supremum. Letting b = f...;n reveals \nu2 10g(IA(w)I/IA(w) - r(w)!) \n\n(13) \n\nnc = \n\n2 \n\nf. \n\nsamples force P(supw IlIT(W) - &(w)\\ ~ {) below unity. If the variance maximum is \nnot unique but occurs over a d-dimensional set within W, the sample size estimate \nbecomes proportional to u2d/{2. With d playing the role of VC dimension v, this \nis similar to Vapnik's bound although we retain dependence on P and N. \nThe above probability is determined by behavior near the maximum-variance point, \nwhich for example in classification is where &(w) = 1/2. Such nets are uninterest(cid:173)\ning as classifiers, and certainly it is undesirable for them to dominate the entire \nprobability. This problem is avoided by replacing Z(w) with Z(w)/cr(w), which ad(cid:173)\nditionally allows a finer resolution where &(w) nears zero. Indeed, for classification, \nif n is such that with high probability \n\nIlIT(W) - &(w)1 \n\nIlIT(W) - &(w)1 \n\n< { \n\n, \n\n(14) \n\nsup \nweW \n\n= sup \n\ncr(w) \n\nwew J&(W)(I- &(w\u00bb \n\nthen lIT(W\u00b7) = 0 ::} &(w\u00b7) < {2(1 + (2)-1 ~ {2 1)/ P( Nb = 1) ~ fw ).b (w) dw \u00abblu)() dw' \n\n(19) \n\n(20) \n\nRemark 1. This integral will be used in (9) to find \ncf>(blu) \n\n( \n\nP(s~pZ(w) > b) ~ Jw fw~\u00abblu)() dw,dw \n\nSince b is large, the main contribution to the outer integral occurs for w near a \nvariance maximum, i.e. for u' I u ~ 1. If the variance is constant then all w E W \ncontribute. In either case ( is nonnegative. By lemma 1 we expect (19) to be, as \na function of b, of the form (const ulb)P for, say, p = d. In particular, we do not \nanticipate the exponentially small clump sizes resulting if (Vw')( w, w') ~ M ~ O. \nTherefore ( should approach zero over some range of w', which happens only when \np ~ 1, that is, for w' near w. The behavior of pew, w') for w' ~ w is the key to \nfinding the clump size. \nRemark 2. There is a simple interpretation of the clump size; it represents the \nvolume of w' E W for which Z(w') is highly correlated with Z(w) . The exceedance \nprobability is a sum of the point exceedance probabilities (the numerator of (20\u00bb, \neach weighted according to how many other points are correlated with it. In effect, \nthe space W is partitioned into regions that tend to \"have exceedances together,\" \nwith a large clump size ECb( w) indicating a large region. The overall probability \ncan be viewed as a sum over all these regions of the corresponding point exceedance \nprobability. This has a similarity to the Vapnik argument which lumps networks \ntogether according to their nV Iv! possible actions on n items in the training set. In \nthis sense the mean clump size is a fundamental quantity expressing the ability of \nan architecture to generalize. \n\n5 EMPIRICAL ESTIMATES OF CLUMP SIZE \n\nThe clump size estimate of lemma 2 is useful in its own right if one has information \nabout the covariance of Z. Other known techniques of finding ECb( w) exploit \nspecial features of the process at hand (e.g. smoothness or similarity to other well(cid:173)\nstudied processes); the above expression is valid for any covariance structure. In \n\n\fSample Size Requirements for Feedf01ward Neural Networks \n\n333 \n\nthis section we show how one may estimate the clump size using the training set, \nand thus obtain probability approximations in the absence of analytical information \nabout the unknown P and the potentially complex network architecture N. \nHere is a practical way to approximate the integral giving EDb{W). For'Y < 1 define \na set of significant w' \n\nV-y{W) = vol{S-y(w)) \n\nS-y{W) = {w' E W: (w,w') $ 'Y} \n\n(21) \nthen monotonicity of ~ yields EDb{W) ~ Is ~((b/(1X) dw' ~ V-y(W) ~((b/uh) . \nThis apparently crude lower bound for ~ is accurate enough near the origin to give \nsatisfactory results in the cases we have studied. For example, we can characterize \nthe covariance R( w, w') of the smooth process oflemma 1 and thus find its ( func(cid:173)\ntion. The bound above is then easily calculated and differs by only small constant \nfactors from the clump size in the lemma. \nThe lower bound for EDb(W) yields the upper bound \n\n.., \n\nP(s~p Z(w) ~ b) $ w V-y(w) ~\u00abb/uh) dw \n\n1 ~(b/(1-) \n\n(22) \n\nWe call V-y(w) the correlation volume, as it represents those weight vectors w' whose \nerrors Z(w') are highly correlated with Z(w); one simple way to estimate the cor(cid:173)\nrelation volume is as follows . Select a weight w' and using the training set compute \n\n(Yl - 17( Xl; w))2, ... , (Yn - 17( Xn; w)? & (Yl - 17( Xl; w'))2 , ... , (Yn - 17( Xn; w'))2 . \n\nIt is then easy to estimate u 2, u,2, and p, and finally (w , w'), which is compared \nto the chosen 'Y to decide if w' E S-y ( w) . \nThe difficulty is that for large d, S-y (w) is far smaller than any approximately(cid:173)\nenclosing set. Simple Monte Carlo sampling and even importance sampling methods \nfail to estimate the volume of such high-dimensional convex bodies because so few \nhits occur in probing the space (Lovasz, 1991). The simplest way to concentrate the \nsearch is to let w' = w except in one coordinate and probe along each coordinate \naxis. The correlation volume is approximated as the product of the one-dimensional \nmeasurements. \nSimulation studies of the above approach have been performed for a perceptron \narchitecture in input uniform over [-1, l]d. The integral (22) is computed by Monte \nCarlo sampling, and based ona training set of size lOOd, V-y (w) is computed at each \npoint via the above method. The result is that an estimated sample size of 5.4d/f2 \nis enough to ensure (14) with high probability. For nets, if any, having VT(W) = 0, \nsample sizes larger than 5.4d/f will ensure reliable generalization, which compares \nfavorably with (7) . \n\n6 SUMMARY AND CONCLUSIONS \n\nTo find realistic estimates of sample size we transform the original problem into \none of finding the distribution of the supremum of a derived Gaussian random \nfield, which is defined over the weight space of the network architecture. The \nlatter problem is amenable to solution via the Poisson clumping heuristic. In terms \nof the PCH the question becomes one of estimating the mean clump size, that \n\n\f334 \n\nMichael Turman, Terrence L. Fine \n\nis, the typical volume of an excursion above a given level by the random field. \nIn the \"smooth\" case we directly find the clump volume and obtain estimates of \nsample size that are (correctly) of order v/\u20ac2. The leading constant, while explicit, \ndepends on properties of the architecture and the data-which has the advantage \nof being tailored to the given problem but the potential disadvantage of our having \nto compute them. \nWe also obtain a useful estimate for the clump size of a general process in terms of \nthe correlation volume V-y(w). For normalized error, (22) becomes approximately \n\np (sup lIr(w) - \u00a3(w) > \u20ac) ~ E [vol(W)] e-(1--y2)nf2/2 \n\nu(w) \n\n-\n\nweW \n\nV-y(w) \n\nwhere the expectation is taken with respect to a uniform distribution on W. The \nprobability of reliable generalization is roughly given by an exponentially decreasing \nfactor (the exceedance probability for a single point) times a number representing \ndegrees of freedom. The latter is the mean size of an equivalence class of \"similarly(cid:173)\nacting\" networks. The parallel with the Vapnik approach, in which a worst-case \nexceedance probability is multiplied by a growth function bounding the number of \nclasses of networks in N that can act differently on n pieces of data, is striking. In \nthis fashion the correlation volume is an analog of the VC dimension, but one that \ndepends on the interaction of the data and the architecture. \nLastly, we have proposed practical methods of estimating the correlation volume \nempirically from the training data. Initial simulation studies based on a perceptron \nwith input uniform on a region in Rd show that these approximations can indeed \nyield informative estimates of sample complexity. \n\nReferences \nAldous, D. 1989. Probability Approximations via the Poisson Clumping Heuristic. \nSpringer. \nAnthony, M., & Biggs, N. 1992. Computational Learning Theory. Cambridge Univ. \nBaum, E., & Haussler, D. 1989. What size net gives valid generalization? Pages \n81-90 of' Touretzky, D. S. (ed), NIPS 1. \nBlumer, A., Ehrenfeucht, A., Haussler, D., & Warmuth, M. K. 1989. Learnability \nand the Vapnik-Chervonenkis dimension. Jour. Assoc. Compo Mach., 36,929-965. \nLovMz, L. 1991. Geometric Algorithms and Algorithmic Geometry. \nIn: Proc. \nInternat. Congr. Mathematicians. The Math. Soc. of Japan. \nPollard, D. 1984. Convergence of Stochastic Processes. Springer. \nTalagrand, M. 1994. Sharper bounds for Gaussian and empirical processes. Ann. \nProbab., 22, 28-76. \nTurmon, M. J., & Fine, T. L. 1993. Sample Size Requirements of Feedforward \nNeural Network Classifiers. In: IEEE 1993 Intern. Sympos. Inform. Theory. \nVapnik, V. 1982. Estimation of Dependences Based on Empirical Data. Springer. \nWong, R. 1989. Asymptotic Approximations of Integrals. Academic. \n\n\f", "award": [], "sourceid": 970, "authors": [{"given_name": "Michael", "family_name": "Turmon", "institution": null}, {"given_name": "Terrence", "family_name": "Fine", "institution": null}]}