{"title": "Neural Networks with Quadratic VC Dimension", "book": "Advances in Neural Information Processing Systems", "page_first": 197, "page_last": 203, "abstract": null, "full_text": "Neural Networks with Quadratic VC \n\nDimension \n\nPascal Koiran* \n\nLab. de l'Informatique du Paraltelisme \n\nEcole Normale Superieure de Lyon - CNRS \n\n69364 Lyon Cedex 07, France \n\nEduardo D. Sontagt \n\nDepartment of Mathematics \n\nRutgers University \n\nNew Brunswick, NJ 08903, USA \n\nAbstract \n\nThis paper shows that neural networks which use continuous acti(cid:173)\nvation functions have VC dimension at least as large as the square \nof the number of weights w. This result settles a long-standing \nopen question, namely whether the well-known O( w log w) bound, \nknown for hard-threshold nets, also held for more general sigmoidal \nnets. Implications for the number of samples needed for valid gen(cid:173)\neralization are discussed. \n\n1 \n\nIntroduction \n\nOne of the main applications of artificial neural networks is to pattern classification \ntasks. A set of labeled training samples is provided, and a network must be obtained \nwhich is then expected to correctly classify previously unseen inputs. In this context, \na central problem is to estimate the amount of training data needed to guarantee \nsatisfactory learning performance. To study this question, it is necessary to first \nformalize the notion of learning from examples. \nOne such formalization is based on the paradigm of probably approximately correct \n(PAC) learning, due to Valiant (1984). In this framework, one starts by fitting some \nfunction /, chosen from a predetermined class F, to the given training data. The \nclass F is often called the \"hypothesis class\" , and for purposes of this discussion it \nwill be assumed that the functions in F take binary values {O, I} and are defined on a \ncommon domain X. (In neural networks applications, typically F corresponds to the \nset of all neural networks with a given architecture and choice of activation functions. \nThe elements of X are the inputs, possibly multidimensional.) The training data \nconsists of labeled samples (Xi,ci), with each Xi E X and each Ci E {O, I}, and \n\n*koiranGlip. ens-lyon. fr. \ntsontagGhilbert.rutgers.edu. \n\n\f198 \n\nP. KOIRAN, E. D. SONTAG \n\n\"fitting\" by an f means that f(xj) = Cj for each i. Given a new example x, one \nuses f( x) as a guess of the \"correct\" classification of x. Assuming that both training \ninputs and future inputs are picked according to the same probability distribution \non X, one needs that the space of possible inputs be well-sampled by the training \ndata, so that f is an accurate fit. We omit the details of the formalization of \nPAC learning, since there are excellent references available, both in textbook (e.g. \nAnthony and Biggs (1992), Natarajan (1991)) and survey paper (e.g. Maass (1994)) \nform, and the concept is by now very well-known. \nAfter the work of Vapnik (1982) in statistics and of Blumer et. al. (1989) in com(cid:173)\nputationallearning theory, one knows that a certain combinatorial quantity, called \nthe Vapnik-Chervonenkis (VC) dimension VC(F) of the class F of interest com(cid:173)\npletely characterizes the sample sizes needed for learnability in the PAC sense. (The \nappropriate definitions are reviewed below. In Valiant's formulation one is also in(cid:173)\nterested in quantifying the computational effort required to actually fit a function \nto the given training data, but we are ignoring that aspect in the current paper.) \nVery roughly speaking, the number of samples needed in order to learn reliably is \nproportional to VC(F). Estimating VC(F) then becomes a central concern. Thus \nfrom now on, we speak exclusively of VC dimension, instead of the original PAC \nlearning problem. \nThe work of Cover (1988) and Baum and Haussler (1989) dealt with the computa(cid:173)\ntion of VC(F) when the class F consists of networks built up from hard-threshold \nactivations and having w weights; they showed that VC(F)= O(wlogw). (Con(cid:173)\nversely, Maass (1993) showed that there is also a lower bound of this form.) It \nwould appear that this definitely settled the VC dimension (and hence also the \nsample size) question. \nHowever, the above estimate assumes an architecture based on hard-threshold \n(\"Heaviside\") neurons. In contrast, the usually employed gradient descent learning \nalgorithms (\"backpropagation\" method) rely upon continuous activations, that is, \nneurons with graded responses. As pointed out in Sontag (1989), the use of ana(cid:173)\nlog activations, which allow the passing of rich (not just binary) information among \nlevels, may result in higher memory capacity as compared with threshold nets. This \nhas serious potential implications in learning, essentially because more memory ca(cid:173)\npacity means that a given function f may be able to \"memorize\" in a \"rote\" fashion \ntoo much data, and less generalization is therefore possible. Indeed, Sontag (1992) \nshowed that there are conceivable (though not very practical) neural architectures \nwith extremely high VC dimensions. Thus the problem of studying VC(F) for ana(cid:173)\nlog networks is an interesting and relevant issue. Two important contributions in \nthis direction were the papers by Maass (1993) and by Goldberg and Jerrum (1995), \nwhich showed upper bounds on the VC dimension of networks that use piecewise \npolynomial activations. The last reference, in particular, established for that case \nan upper bound of O(w2), where, as before, w is the number of weights. However \nit was an open problem (specifically, \"open problem number 7\" in the recent survey \nby Maass (1993) if there is a matching w 2 lower bound for such networks, and more \ngenerally for arbitrary continuous-activation nets. It could have been the case that \nthe upper bound O( w 2 ) is merely an artifact of the method of proof in Goldberg \nand Jerrum (1995), and that reliable learning with continuous-activation networks \nis still possible with far smaller sample sizes, proportional to O( w log w). But this is \nnot the case, and in this paper we answer Maass' open question in the affirmative. \n\nAssume given an activation (T which has different limits at \u00b1oo, and is such that \nthere is at least one point where it has a derivative and the derivative is nonzero \n(this last condition rules out the Heaviside activation). Then there are architec(cid:173)\ntures with arbitrary large numbers of weights wand VC dimension proportional \n\n\fNeural Networks with Quadratic VC Dimension \n\n199 \n\nto w 2 \u2022 The proof relies on first showing that networks consisting of two types of \nactivations, Heavisides and linear, already have this power. This is a somewhat \nsurprising result, since purely linear networks result in VC dimension proportional \nto w, and purely threshold nets have, as per the results quoted above, VC dimension \nbounded by w log w. Our construction was originally motivated by a related one, \ngiven in Goldberg and Jerrum (1995), which showed that real-number programs (in \nthe Blum-Shub-Smale (1989) model of computation) with running time T have VC \ndimension O(T2). The desired result on continuous activations is then obtained, \napproximating Heaviside gates by IT-nets with large weights and approximating lin(cid:173)\near gates by IT-nets with small weights. This result applies in particular to the \nstandard sigmoid 1/(1 + e- X ). (However, in contrast with the piecewise-polynomial \ncase, there is still in that case a large gap between our O( w 2 ) lower bound and \nthe O( w 4 ) upper bound which was recently established in Karpinski and Macin(cid:173)\ntyre (1995).) A number of variations, dealing with Boolean inputs, or weakening \nthe assumptions on IT, are discussed. The full version of this paper also includes \nsome remarks on thresholds networks with a constant number of linear gates, and \nthreshold-only nets with \"shared\" weights. \n\nBasic Terminology and Definitions \n\nFormally, a (first-order, feedforward) architecture or network A is a connected di(cid:173)\nrected acyclic graph together with an assignment of a function to a subset of its \nnodes. The nodes are of two types: those of fan-in zero are called input nodes and \nthe remaining ones are called computation nodes or gates. An output node is a node \nof fan-out zero. To each gate g there is associated a function IT g : IR. -!- IR., called the \nactivation or gate function associated to g. \n\nThe number of weights or parameters associated to a gate 9 is the integer ng equal \nto the fan-in of 9 plus one. (This definition is motivated by the fact that each input \nto the gate will be multiplied by a weight, and the results are added together with \na \"bias\" constant term , seen as one more weight; see below.) The (total) number \nof weights (or parameters) of A is by definition the sum of the numbers n g , over all \nthe gates 9 of A. The number of inputs m of A is the total number of input nodes \n(one also says that \"A has inputs in IR.m,,); it is assumed that m > O. The number \nof outputs p of A is the number of output nodes (unless otherwise mentioned, we \nassume by default that all nets considered have one-dimensional outputs, that is, \np = 1). \nTwo examples of gate functions that are of particular interest are the identity or \nlinear gate: Id( x) = x for all x, and the threshold or H eaviside function: H (x) = 1 \nif x ~ 0, H(x) = 0 if x < O. \nLet A be an architecture. Assume that nodes of A have been linearly ordered as \n11\"1, ... , 11\" m, gl, ... , gl, where the 1I\"j 's are the input nodes and the gj 's the gates. For \nsimplicity, write nj := n g., for each i = 1, ... , I. Note that the total number of \nparameters is n = L:~=1 nj and the fan-in of each gj is nj - 1. To each architecture \nA (strictly speaking, an architecture together with such an ordering of nodes) we \nassociate a function \n\nF : ]Rm x ]Rn -!-]RP , \n\nwhere p is the number of outputs of A, defined by first assigning an \"output\" to \neach node, recursively on the distance from the the input nodes. Assume given \nan input x E ]Rm and a vector of weights w E ]Rn. We partition w into blocks \n(WI , ... , WI) of sizes nl, ... , nl respectively. First the coordinates of x are assigned \nas the outputs of the input nodes 11\"1, ... , 1I\"m respectively. For each of the other \ngates gj, we proceed as follows. Assume that outputs Yl, ... , Yn. -1 have already \n\n\f200 \n\nP. KOIRAN, E. D. SONTAG \n\nbeen assigned to the predecessor nodes of gi (these are input and/or computation \nnodes, listed consistently with the order fixed in advance). Then the output of gi \nis by definition \n\n(1'g. (Wi,O + Wi , lYI + Wi ,2Y2 + ... + wi,n.-lYn.-d , \n\nwhere we are writing Wi = (Wi,O, Wi,l, Wi ,2, ... , wi,n.-d. The value of F(x, w) is \nthen by definition the vector (scalar if p = 1) obtained by listing the outputs of the \noutput nodes (in the agreed-upon fixed ordering of nodes). We call F the function \ncomputed by the architecture A. For each choice of weights W E IRn, there is a \nfunction Fw : IRm _ IRP defined by Fw(x) := F(x, w) ; by abuse of terminology we \nsometimes call this also the function computed by A (if the weight vector has been \nfixed). \nAssume that A is an architecture with inputs in IRm and scalar outputs, and that \nthe (unique) output gate has range {O, 1}. A subset A ~ IR m is said to be shattered \nby A if for each Boolean function 13 : A -\n{O, 1} there is some weight W E IRn so \nthat Fw(x) = f3(x) for all x EA . The Vapnik-Chervonenkis (VC) dimension of A \nis the maximal size of a subset A ~ IRm that is shattered by A. If the output gate \ncan take non-binary values, we implicitly assume that the result of the computation \nis the sign of the output. That is, when we say that a subset A ~ IRm is shattered \nby A , we really mean that A is shattered by the architecture H(A) in which the \noutput of A is fed to a sign gate . \n\n2 Networks Made up of Linear and Threshold Gates \n\nProposition 1 For every n ;::: 1, there is a network architecture A with inputs in \nIR 2 and O( VN) weights that can shatter a set of size N = n 2. This architecture is \nmade only of linear and threshold gates. \n\nProof. Our architecture has n parameters WI , ... , Wn; each of them is an element \nofT = {O.WI . .. Wn ;Wi E {O, 1}}. The shattered set will be S = [n]2 = {1, .. . ,nF. \nFor a given choice of W = (WI' ... ' Wn), A will compute the boolean function \nfw : S -\n{O, 1} defined as follows: fw(x, y) is equal to the x-th bit of W y . Clearly, \nfor any boolean function f on S, there exists a (unique) W such that f = fw. \nWe first consider the obvious architecture which computes the function: \n\nflv(Y) = WI + I)Wz - Wz-dH(y - z + 1/2) \n\nn \n\nz=2 \n\n(1) \n\nsending each point Y E [n] to Wy. This architecture has n - 1 threshold gates, \n3(n - 1) + 1 weights, and just one linear gate. \nNext we define a second multi-output net which maps wET to its binary rep(cid:173)\nresentation j2(w) = (WI' . .. ' wn ). Assume by induction that we have a net N? \nthat maps W to (WI, ... ,Wi,O.Wi+l ... Wn) . Since Wi+l = H(O .Wi+l . .. Wn -1/2) \nand o. Wi+2 ... Wn = 2 x o. Wi+1 . .. Wn - Wi+!, .N;;'l can be obtained by adding one \nthreshold gate and one linear gate to .N;2 (as well as 4 weights). It follows that N~ \nhas n threshold gates, n linear gates and 4n weights. \nFinally, we define a net N3 which takes as input x E [n] and W = (WI , ... , wn) E \n{O, l}n, and outputs W X \u2022 We would like this network to be as follows: \n\nf3(X , w) = WI + L wzH(x - z + 1/2) - L wz_IH(x - z + 1/2). \n\nn \n\nn \n\nz=2 \n\nz=2 \n\n\fNeural Networks with Quadratic VC Dimension \n\n201 \n\nThis is not quite possible, because the products between the Wi'S (which are inputs \nin this context) and the Heavisides are not allowed. However, since we are dealing \nwith binary variables one can write uv = H(u + v -\nl.5). Thus N3 has one linear \ngate, 4(n - 1) threshold gates and 12(n - 1) + n weights. Note that fw(x, y) = \np (x, P Ulv (y)). This can be realized by means of a net that has n + 2 linear gates, \n(n-l)+n+4(n-l) = 6n-5 threshold gates, and (3n-2)+4n+(12n-ll) = 19n-13 \nweights. 0 \n\nThe following is the main result of this section: \n\nTheorem 1 For every n ;::: 1, there is a network architecture A with inputs in IR. \nand O( VN) weights that can shatter a set of size N = n 2. This architecture is \nmade only of linear and threshold gates. \n\nProof. The shattered set will be S = {O, 1, .. . ,n2 -I}. For every xES, there \nI} such that u = nx + y. The idea of the \nare unique integers x, y E {O, 1, ... , n -\nconstruction is to compute x and y, and then feed (x + 1, y + 1) to the network \nconstructed in Proposition 1. Note that x is the unique integer such that u - nx E \n{O, 1, .. . , n -\n\nI}. It can therefore by computed by brute force search as follows: \n\nn-1 \n\nX = L kH[H(u - nk) + H(n - 1 - (u - nk)) -\n\nl.5]. \n\nk=O \n\nThis network has 3n threshold gates, one linear gate and 8n weights. Then of course \ny = u - nx. 0 \n\nA Boolean version is as follows. \nTheorem 2 For every d ;::: 1, there is a network architecture A with O( VN) \nweights that can shatter the N = 22d points of {O, 1 Fd . This architecture is made \nonly of linear and threshold gates. \nProof. Given u E {O, IFd, one can compute x = 1 + 2::=1 2i-1ui and y = 1 + \n2:1=12i-1Ui+d with two linear gates. Then (x, y) can be fed to the network of \nProposition 1 (with n = 2d ). 0 \n\nIn other words, there is a network architecture with 2d weights that can compute \nall boolean functions on 2d variables. \n\n3 Arbitrary Sigmoids \n\nWe now extend the preceding VC dimension bounds to networks that use just \none activation function tr (instead of both linear and threshold gates). All that is \nrequired is that the gate function have a sigmoidal shape and satisfy a very weak \nsmoothness property: \n\nl. tr is differentiable at some point Xo (i.e., tr(xo+h) = tr(xo)+tr'(xo)h+o(h)) \n\nwhere tr'(xo)# 0. \n\n2. limx __ oo tr(x) = \u00b0 and limx _+ oo tr(x) = 1 (the limits \u00b0 and 1 can be \n\nreplaced by any distinct numbers). \n\nA function satisfying these two conditions will be called sigmoidal. Given any such \ntr, we will show that networks using only tr gates provide quadratic VC dimension. \n\n\f202 \n\nP. KOIRAN, E. D. SONTAG \n\nTheorem 3 Let tT be an arbitrary sigmoidal function. There exist architectures Al \nand A2 with O( VN) weights made only of tT gates such that: \n\u2022 Al can shatter a subset ofIR of cardinality N = n 2 ,-\n\u2022 A2 can shatter the N = 22d points of {O, 1}2d. \n\nThis follows directly from Theorems 1 and 2, together with the following simulation \nresult: \nTheorem 4 Let tT be a an arbitrary sigmoidal function. Let N be a network of \nT threshold and L linear gates, with a threshold gate at the output. Then N can \nbe simulated on any given finite set of inputs by a network N' of T + L gates that \nall use the activation function tT (except the output gate which is still a threshold). \nMoreover, if N has n weights then N' has O( n) weights. \n\nProof. Let S be a finite set of inputs. We can assume, by changing the thresholds of \nthreshold gates if necessary, that the net input Ig (x) to any threshold gate 9 of N \n\nis different from \u00b0 for all inputs xES. \n\nGiven \u20ac > 0, let N( be the net obtained by replacing the output functions of all gates \nby the new output function x 1--+ tT( X / \u20ac) \nif this output function is the sign function , \nand by x 1--+ tT(x) = [tT(xo+\u20acx)-tT(xo))/[\u20actT'(xo)] \nifit is the identity function. Note \nthat for any a > 0, lim(_o+ tT(x/\u20ac) = H(x) uniformly for x E) - 00, -a] U [a, +00] \nand limHo tT(x) = x uniformly for x E [-l/a, l/a]. \nThis implies by induction on the depth of 9 that for any gate 9 of N and any input \nXES, the net input Ig,(x) to 9 in the transformed net N( satisfies li~_o IgAx) = \nIg(x) (here, we use the fact that the output function of every 9 is continuous at \nIg(x)). In particular, by taking 9 to be the output gate of N, we see that Nand \nN( compute the same function on S if \u20ac \nis small enough. Such a net N( can be \ntransformed into an equivalent net N' that uses only tT as gate function by a simple \ntransformation of its weights and thresholds. The number of weights remains the \nsame, except at most for a constant term that must be added to each net input to \na gate; thus if N has n weights, N' has at most 2n weights. 0 \n\n4 More General Gate Functions \n\nThe objective of this section is to establish results similar to Theorem 3, but for \neven more arbitrary gate functions, in particular weakening the assumption that \nlimits exist at infinity. The main result is, roughly, that any tT which is piecewise \ntwice (continuously) differentiable gives at least quadratic VC dimension, save for \ncertain exceptional cases involving functions that are almost everywhere linear. \n\nA function tT : IR --+ IR is said to be piecewise C 2 if there is a finite sequence \nal < a2 < ... < ap such that on each interval I of the form] - 00, al [, )ai, ai+1 [ or \n]ap , +00[, tTll is C2. \n\n(Note: our results hold even if it is only assumed that the second derivative exists in \neach of the above intervals; we do not use the continuity of these second derivatives.) \n\nTheorem 5 Let tT be a piecewise C2 function. For every n ~ 1, there exists an \narchitecture made of tT-gates, and with O( n) weights, that can shatter a subset of \nIR 2 of cardinality n 2 , except perhaps in the following cases: \n\n1. tT is piecewise-constant, and in this case the VC dimension of any architec(cid:173)\n\nture of n weights is O( n log n),-\n\n\fNeural Networks with Quadratic VC Dimension \n\n203 \n\n2. u is affine, and in this case the VC dimension of any architecture of n \n\nweights is at most n. \n\n3. there are constants af; 0 and b such that u( x) = ax + b except at a finite \nnonempty set of points. In this case, the VC dimension of any architec(cid:173)\nture of n weights is O(n 2 ), and there are architectures of VC dimension \nO(nlogn). \n\nDue to the lack of space , the proof cannot be included in this paper. Note that \nthe upper bound of the first special case is tight for threshold nets, and that of the \nsecond special case is tight for linear functions in ]R n. \n\nAcknowledgements \n\nPascal Koiran was supported by an INRIA fellowship , DIMACS, and the Interna(cid:173)\ntional Computer Science Institute. Eduardo Sontag was supported in part by US \nAir Force Grant AFOSR-94-0293 . \n\nReferences \n\nM . ANTHONY AND N.L. BIGGS (1992) Computational Learning Th eory: An Introduction, \nCambridge U. Press. \n\nE .B. BAUM AND D . HAUSSLER (1989) What size net gives valid generalization?, Neural \nComputation 1, pp. 151-160. \n\nL. BLUM, M. SHUB AND S. SMALE (1989) On the theory of computation and complex(cid:173)\nity over the real numbers: NP-completeness, recursive functions and universal machines, \nBulletin of the AMS 21 , pp. 1- 46 . \n\nA. BLUMER, A . EHRENFEUCHT, D . HAUSSLER, AND M. WARMUTH (1989) Learnability \nand the Vapnik- Chervonenkis dimension , J. of the ACM 36, pp. 929-965. \n\nT.M. COVER (1988) Capacity problems for linear machines, in: Pattern Recognition , L. \nKanal ed. , Thompson Book Co., pp. 283-289. \n\nP. GOLDBERG AND M . JERRUM (1995) Bounding the Vapnik-Chervonenkis dim ension of \nconcept classes parametrized by real numbers, Machine Learning 18, pp. 131-148. \n\nM . KARPINSKI AND A. MACINTYRE (1995) Polynomial bounds for VC dimension of sig(cid:173)\nmoidal neural networks, in Proc. 27th ACM Symposium on Theory of Computing, pp. 200-\n208. \n\nW. MAASS (1993) Bounds for the computational power and learning complexity of analog \nneural nets, in Proc. of the 25th ACM Symp. Theory of Computing, pp. 335-344. \n\nW . MAASS (1994) Perspectives of current research about the complexity of learning in neu(cid:173)\nral nets, in Theoretical Advances in N eural Computation and Learning , V.P. Roychowd(cid:173)\nhury, K.Y. Siu, and A. Orlitsky, editors, Kluwer, Boston , pp. 295-336. \n\nB.K. NATARAJAN (1991) Machine Learning : A Theoretical Approach, M. Kaufmann Pub(cid:173)\nlishers, San Mateo , CA. \n\nE .D. SONTAG (1989) Sigmoids distinguish better than Heavisides, Neural Computation 1, \npp. 470-472. \n\nE.D. SONTAG (1992) Feedforward nets for interpolation and classification, J. Compo \nSyst. Sci 45 , pp. 20-48. \n\nL.G. VALIANT (1984) A th eory of the learnable, Comm. of the ACM 27, pp. 1134-1142 \n\nV .N. VAPNIK (1982) Estimation of Dependencies Based on Empirical Data, Springer, \nBerlin. \n\n\f", "award": [], "sourceid": 1051, "authors": [{"given_name": "Pascal", "family_name": "Koiran", "institution": null}, {"given_name": "Eduardo", "family_name": "Sontag", "institution": null}]}