{"title": "The VC-Dimension versus the Statistical Capacity of Multilayer Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 928, "page_last": 935, "abstract": null, "full_text": "The VC-Dimension versus the Statistical \n\nCapacity of Multilayer Networks \n\nChuanyi Ji \"and Demetri Psaltis \nDepartment of Electrical Engineering \nCalifornia Institute of Technology \nPasadena, CA 91125 \n\nAbstract \n\nA general relationship is developed between the VC-dimension and the \nstatistical lower epsilon-capacity which shows that the VC-dimension can \nbe lower bounded (in order) by the statistical lower epsilon-capacity of a \nnetwork trained with random samples. This relationship explains quan(cid:173)\ntitatively how generalization takes place after memorization, and relates \nthe concept of generalization (consistency) with the capacity of the optimal \nclassifier over a class of classifiers with the same structure and the capacity \nof the Bayesian classifier. Furthermore, it provides a general methodology \nto evaluate a lower bound for the VC-dimension of feedforward multilayer \nneural networks. \nThis general methodology is applied to two types of networks which are \nimportant for hardware implementations: two layer (N - 2L - 1) net(cid:173)\nworks with binary weights, integer thresholds for the hidden units and \nzero threshold for the output unit, and a single neuron ((N - 1) net(cid:173)\nworks) with binary weigths and a zero threshold. Specifically, we obtain \nOC~L) ::; d2 \n::; O(W), and d1 \"\"' O(N). Here W is the total number \nof weights of the (N - 2L - 1) networks. d1 and d2 represent the VC(cid:173)\ndimensions for the (N - 1) and (N - 2L - 1) networks respectively. \n\n1 \n\nIntroduction \n\nThe information capacity and the VC-dimension are two important quantities that \ncharacterize multilayer feedforward neural networks. The former characterizes their \n\n\"Present Address: Department of Electrical Computer and System Engineering, Rens(cid:173)\n\nselaer Poly tech Institute, Troy, NY 12180. \n\n928 \n\n\fThe VC-Dimension versus the Statistical Capacity of Multilayer Networks \n\n929 \n\nmemorization capability, while the latter represents the sample complexity needed \nfor generalization. Discovering their relationships is of importance for obtaining \na better understanding of the fundamental properties of multilayer networks in \nlearning and generalization. \n\nIn this work we show that the VC-dimension of feedforward multilayer neural net(cid:173)\nworks, which is a distribution-and network-parameter-indenpent quantity, can be \nlower bounded (in order) by the statistical lower epsilon-capacity C; (McEliece \net.al, (1987\u00bb, which is a distribution-and network-dependent quantity, when the \nsamples are drawn from two classes: 0 1 (+1) and 02{-1). The only requirement \non the distribution from which samples are drawn is that the optimal classification \nerror achievable, the Bayes error Pbe, is greater than zero. Then we will show that \nthe VC-dimension d and the statistical lower epsilon-capacity C; are related by \n\nC; ~ Ad, \n\n(1) \n\nI \n(: \n\nI \n\nI \n(: \n\nI \n\nfor 0 < (: ~ P eo ; or (: = Pbe -\n\nfor 0 < (: ~ Pbe. Here (: \nwhere (: = P eo -\nis the error tolerance, and P eo represents the optimal error rate achievable on the \nclass of classifiers considered. It is obvious that P eo ~ Pbe' The relation given \nin Equation (1) is non-trivial if Pbe > 0, P eo ~ / or Pbe ~ / so that (: is a non(cid:173)\nnegative quantity. Ad is called the universal sample bound for generalization, where \nA < \nis a positive constant. When the sample complexity exceeds Ad, all the \nnetworks of the same architechture for all distributions of the samples can generalize \nwith almost probability 1 for d large. A special case of interest, in which Pbe = ~, \ncorresponds to random assignments of samples. Then C; represents the random \nstorage capacity which characterizes the memorizing capability of networks. \n\n1281n-+ \n\n12 ' \n\n( \n\nAlthough the VC-dimension is a key parameter in generalization , there exists no \nsystematic way of finding it. The relationship we have obtained, however, brings \nconcomitantly a constructive method of finding a lower bound for the VC-dimension \nof multilayer networks. That is, if the weights of a network are properly con(cid:173)\nstructed using random samples drawn from a chosen distribution, the statistical \nlower epsilon-capacity can be evaluated and then utilized as bounds for the VC(cid:173)\ndimension. In this paper we will show how this constructive approach cQntributes \nto finding lower bounds of the VC-dimension of multilayer networks with binary \nweights. \n\n2 A Relationship Between the VC-Dimension and the \n\nStatistical Capacity \n\n2.1 Definition of the Statistical Capacity \n\nConsider a network s whose weights are constructed from M random samples be(cid:173)\nlonging to two classes. Let r{ s) = ~, where Z is the total number of samples \nclassified incorrectly by the network s. Then the random variable r( s) is the train(cid:173)\ning error rate. Let \n\n(2) \n\n\f930 \n\nJi and Psaltis \n\nwhere 0 < \u20ac ~ 1. Then the statistical lower epsilon-capacity (statistical capacity in \nshort) C; is the maximum M such that Pf(M) ;::: 1 - Tj, where Tj can be arbitrarily \nsmall for sufficiently large N. \n\nRoughly speaking, the statistical lower epsilon-capacity defined here can be regarded \nas a sharp transition point on the curve Pf(M) shown in Fig.1. When the number \nof samples used is below this sharp transition, the network can memorize them \nperfectly. \n\n2.2 The Universal Sample Bound for Generalization \n\nLet Pe(xls) be the true probability of error for the network s. Then the gener(cid:173)\nalization error LlE(s) satisfies LlE(s) =1 r(s) - Pe(xls) I. We can show that the \nprobability for the generalization error to exceed a given small quantity ( satisfies \nthe following relation. \n\nPr(maxLlE(s) > /) ~ h(2M;d,l), \n\nsES \n\n(3) \n\nTheorem 1 \n\nwhere \n\nh(2M; d, <') = { \n\n1; \n6 \n\n. \n\nezther 2M :s d, or 6 \n\n(2M)\" \n\nd! e--s -;::: 1&2M > d, \n\n.,2 M \n\n(2M)\" _ .,2 M \u2022 \n\nd! \n\ne ---r-, otherwise. \n\nHere S is a class of networks with the same architecture. The function h(2M; d, (') \nhas one sharp transition occurring at Ad shown in Fig.l, where A is a constant \nsatisfying the equation A = In(2A) + 1 - TA = O. \nThis theorem says that when the number M of samples used exceeds Ad, generaliza(cid:173)\ntion happens with probability 1. Since Ad is a distribution-and network-parameter(cid:173)\nindependent quantity, we call it the universal sample bound for generalization. \n\n,2 \n\n2.3 A Relationship between The VC-Dimension and C; \n\nRoughly speaking, since both the statistical capacity and the VC-dimension rep(cid:173)\nresent sharp transition points, it is natural to ask whether they are related. The \nrelationship can actually be given through the theorem below. \n\nTheorem 2 Let samples belonging to two classes 0 1(+1) and O2(-1) be drawn \nindependently from some distribution. The only requirement on the distributions \nconsidered is that the Bayes error Pbe satisfies 0 < Pbe ~ !. Let 5 be a class \nof feedforward multilayer networks with a fixed structure consisting of threshold \nelements and SI be one network in 5, where the weights of S1 are constructed from \nM (training) samples drawn from one distribution as specified above. For a given \ndistribution, let Peo be the optimal error rate achievable on Sand Pbe be the Bayes \nerror rate. Then \n\nPr(r(sI) < Peo -\n\n, \n, \n( ) :s h(2M; d, ( ), \n\nand \n\n(4) \n\n(5) \n\n\fThe VC-Dimension versus the Statistical Capacity of Multilayer Networks \n\n931 \n\n1 \n\nI h(2M;d,E' \n\nM \n\nFigure 1: Two sharp transition points for the capacity and the universal sample \nbound for generalization. \n\nwhere f(S1) is equal to the training error rate of S1. \nstitution error estimator in the pattern recognition literature.) These relations are \nnontrivial if Peo > /, Pbe > / and (' > 0 small. \n\n(It is also called the resub(cid:173)\n\nThe key idea of this result is illustrated in Fig.1. That is, the sharp transition which \nstands for the lower epsilon-capacity is below the sharp transition for the universal \nsample bound for generalization. \n\nTo interpret this relation, let us compare Equation (2) and Equation (5) and exam(cid:173)\nine the range of (: and (' respectively. Since (', which is initially given in Inequal(cid:173)\nity (3), represents a bound on the generalization error, it is usually quite small. \nFor most of practical problems, Pbe is small also. If the structure of the class of \nnetworks is properly chosen so that P eo ~ Pbe, then ( = Peo -\n(' will be a sma.ll \nquantity. Although the epsilon-capacity is a valid quantity depending on M for any \nnetwork in the class, for M sufficiently large, the meaningful networks to be consid(cid:173)\nered through this relation is only a small subset in the class whose true probability \nof error is close to Peo . That is, this small subset contains only those networks \nwhich can approximate the best classifier contained in this class . \n\nFor a special case in which samples are assigned randomly to two classes with equal \nprobability, we have a result stated in Corollary 1. \n\nCorollary 1 Let samples be drawn independently from some distribution and then \nassigned randomly to two classes fh(+I) and O2(-1) with equal probability. This \nis equivalent to the case that the two class conditional distributions have complete \noverlap with one another. That is, Pr(x 101) = Pr(x I O2 ). Then the Bayes error \nis !. Using the same notation as in the above theorem, we have \n\nC\"l \n\nI < Ad. \n\n2-( \n\n(6) \n\n\f932 \n\nJi and Psaltis \n\nAlthough the distributions specified here give an uninteresting case for classification \npurposes, we will see later that the random statistical epsilon-capacity in Inequal(cid:173)\nity (6) can be used to characterize the memorizing capability of networks, and to \nformulate a constructive approach to find a lower bound for the VC-dimension. \n\n3 Bounds for the VC-Dimension of Two Networks with \n\nBinary Weights \n\n3.1 A Constructive Methodology \n\nOne of the applications of this relation is that it provides a general constructive ap(cid:173)\nproach to find a lower bound for the VC-dimension for a class of networks. Specifi(cid:173)\ncally, using the relationship given in Inequality (6), the procedures can be described \nas follows. \n\n1) Select a distribution. \n\n2) Draw samples independently from the chosen distribution, and then assign them \nrandomly to two classes. \n\n3) Evaluate the lower epsilon-capacity and then use it as a lower bound for the \nVC-dimension. \n\nTwo example are given below to demonstrate how this general approach can be \napplied to find lower bounds for the VC-dimension. \n\n3.2 Bounds for Two-Layer Networks with Binary Weigths \n\nTwo-layer (N - 2L - 1) networks with binary weights and integer thresholds are \nconsidered in this section. \n\n3.2.1 A lower Bound \n\nThe construction of the network we consider is motivated by the one used by Baum \n(Baum, 1988) in finding the capacity for two layer networks with real weights. \nAlthough this particular network will fail if the accuracy of the weights and the \nthresholds is reduced, the idea of using the grandmother-cell type of network will \nbe adopted to construct our network. \n\nWe consider a two layer binary network with 2L hidden threshold units and one \noutput threshold unit shown in Fig.2 a). \n\nThe weights at the second layer are fixed and equal to +1 and -1 alternately. The \nhidden units are allowed to have integer thresholds in [-N, N], and the threshold \nfor the output unit is zero. \nLet Xr(m) = (x~;n), .. \" x~;) be a N dimensional random vector, where x~;n)'s are \nindependent random variables taking (+ 1) and (-1) with equal probability ~, 0 ~ \nI ::; L, and 0 ::; m ::; M. Consider the Ith pair of hidden units. The weights at the \nfirst layer for this pair of hidden units are equal. Let Wri denote the weight from \nthe ith input to these two hidden units, then we have \n\n\fThe VC-Dimension versus the Statistical Capacity of Multilayer Networks \n\n933 \n\n1 \n\n2L \n\nN \n\n+ \n\n+ \n\n+ \n\n+ \n\n+ \n\n+ \n\n+ \n\n(b) \n\nith \n\n(a) \n\nFigure 2: a) The two-layer network with binary weights. b) Illustration on how a \npair of hidden units separates samples. \n\nW/i = sgn(a/ L x~r\u00bb), \n\nM \n\nm=l \n\n(7) \n\nwhere sgn(x) = 1 if x> 0, and -1 otherwise. a/'s, 1 ~ I ~ L, which are indepen(cid:173)\ndent random variables which take on two values +1 or -1 with equal probability, \nrepresent the random assignments of the LM samples into two classes Ol( +1) and \n02( -1). \n\nThe thresholds for these two units are different and are given as \n\n(8) \n\nwhere 0 < k < 1, and t/:J: correspond to the thresholds for the units with weight + 1 \nand -1 at the second layer respectively. \n\nFig.2 b) illustrates how this network works. Each pair of hidden units forms two \nparallel hyperplanes separated by the two thresholds, which will generates a presy(cid:173)\nnaptic input either +2 or (-2) to the output unit only for the samples stored in \nthis pair which fall in between the planes when a/ equals either + 1 or -1, and a \npresynaptic input 0 for the samples falling outside. When the samples as well as \nthe parallel hyperplanes are random, with a certain probability they will fall either \nbetween a pair of parallel hyperplanes or outside. Therefore, statistical analysis is \nneeded to obtain the lower epsilon-capacity. \n\n\f934 \n\nJi and Psaltis \n\nTheorem 3 A lower bound c~ ,for the lower epsilon-capacity c~ ,for this \nnetwork is: \n\n2-( \n\n2-( \n\n, \n\n, c; , \n\n~-( \n\n(1-k)2NL \n\n(9) \n\n3.2.2 An Upper Bound \n\nSince the total number of possible mappings of two layer (N -2L-1) networks with \nbinary weights and integer thresholds ranging in [-N, N] is bounded by 2w +L log 2N, \nthe VC-dimension d2 is upper bounded by W + L log 2N, which is in the order of \nW. Then d2 ~ O(W). By combining both the upper and lower bounds, we have \n\n(10) \n\n3.3 Bounds for One-Layer Networks with Binary Weigths \n\nThe one-layer network we consider here is equivalent to one hidden unit in the above \n(N - 2L -1) network. Specifically, the weight from the i-th input unit to the neuron \nIS \n\nM \n\nWi = sgn( L O'mx~m\u00bb, \n\n(11) \n\nwhere (1 < i \n:::; N), x~m) 's and O'm's are independent and equally probable \nbinary(\u00b11) random variables, which represent elements of N-dimensional sample \nvectors and their random assignments to two classes respectively. \n\nm=l \n\nTheorem 4 The lower epsilon-capacity c~ ,of this network satisfies \n\n2-( \n\nC-\n\n1 \n2-( \n\nN \n\"\" - -\n7r (: \n\n, \" \" 22' \n\n(12) \n\nThen by Corollary 1 we have O(N) ~ O(dd, where d1 is the VC-dimension of \none-layer (N - 1) networks. \n\nUsing the similar counting arguement, an upper bound can be obtained as d 1 ~ N . \nThen combining the lower and upper bounds, we have d1 \"\" O(N) \n\n4 Discussions \n\nThe general relationship we have drawn between the VC-dimension and the sta(cid:173)\ntistical lower epsilon-capacity provides a new view on the sample complexity for \ngeneralization. Specifically, it has two implications to learning and generalziation. \n1) For random assignments of the samples (Pbe = t), the relationship confirms that \ngeneralization occurs after memorization, since the statistical lower epsilon-capacity \n\n\fThe VC-Dimension versus the Statistical Capacity of Multilayer Networks \n\n935 \n\nfor this case is the random storage capacity which charaterizes the memorizing \ncapability of networks and it is upper bounded by the universal sample bound for \ngeneralization. \n\n2) For cases where the Bayes error is smaller than ~, the relationship indicates \nthat an appropriate choice of a network structure is very important. If a network \nstructure is properly chosen so that the optimal achievable error rate Peo is close \nto the Bayes error Peb , than the optimal network in this class is the one which has \nthe largest lower epsilon-capacity. Since a suitable structure can hardly be chosen \na priori due to the lack of knowledge about the underlying distribution, searching \nfor network structures as well as weight values becomes necessary. Similar idea \nhas been addressed by Devroye (Devroye, 1988) and by Vapnik (Vapnik, 1982) for \nstructural minimization. \n\nWe have applied this relation as a general constructive approach to obtain lower \nbounds for the VC-dimension of two-layer and one-layer networks with binary inter(cid:173)\nconnections. For the one-layer networks, the lower bound is tight and matches the \nupper bound. For the two-layer networks, the lower bound is smaller than the upper \nbound (in order) by a In factor. In an independent work by Littlestone (Littlestone, \n1988), the VC-dimension of so-called DNF expressions were obtained. Since a.ny \nDNF expression can be implemented by a two layer network of threshold units with \nbinary weights and integer thresholds, this result is equivalent to showing that the \nVC-dimension of such networks is O(W). We believe that the In factor in our lower \nbound is due to the limitations of the grandmother-cell type of networks used in \nour construction. \n\nAcknowledgement \n\nThe authors would like to thank Yaser Abu-Mostafa and David Haussler for helpful \ndiscussions. The support of AFOSR and DARPA is gratefully acknowledged. \n\nReferences \n\nE. Baum. (1988) On the Capacity of Multilayer Perceptron. J. of Complexity, \n4:193-215. \nL. Devroye. \n(1988) Automatic Pattern Recognition: A Study of Probability of \nError. IEEE Trans. on Pattern Recognition and Machine Intelligence, Vol. 10, \nNo.4: 530-543. \n\nN. Littlestone. (1988) Learning Quickly When Irrelevant Attributes Abound: A \nNew Linear-Threshold Algorithm. Machine Learning 2: 285-318. \n\nR.J . McEliece, E.C . Posner, E.R. Rodemich, S.S . Venkatesh. (1987) The Capacity \nof the Hopfield Associative Memory. IEEE Trans. Inform. Theory, Vol. IT-33, No. \n4,461-482. \n\nV.N . Vapnik (1982) Estimation of Dependences Based on Empirical Data, New \nYork: Springer-Verlag. \n\n\f", "award": [], "sourceid": 481, "authors": [{"given_name": "Chuanyi", "family_name": "Ji", "institution": null}, {"given_name": "Demetri", "family_name": "Psaltis", "institution": null}]}