{"title": "Tight Bounds for the VC-Dimension of Piecewise Polynomial Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 323, "page_last": 329, "abstract": null, "full_text": "Tight Bounds for the VC-Dimension of \n\nPiecewise Polynomial Networks \n\nAkito Sakurai \n\nSchool of Knowledge Science \n\nJapan Advanced Institute of Science and Technology \n\nNomi-gun, Ishikawa 923-1211, Japan. \n\nCREST, Japan Science and Technology Corporation. \n\nASakurai@jaist.ac.jp \n\nAbstract \n\nO(ws(s log d+log(dqh/ s))) and O(ws((h/ s) log q) +log(dqh/ s)) are \nupper bounds for the VC-dimension of a set of neural networks of \nunits with piecewise polynomial activation functions, where s is \nthe depth of the network, h is the number of hidden units, w is \nthe number of adjustable parameters, q is the maximum of the \nnumber of polynomial segments of the activation function, and d is \nthe maximum degree of the polynomials; also n(wslog(dqh/s)) is \na lower bound for the VC-dimension of such a network set, which \nare tight for the cases s = 8(h) and s is constant. For the special \ncase q = 1, the VC-dimension is 8(ws log d). \n\n1 \n\nIntroduction \n\nIn spite of its importance, we had been unable to obtain VC-dimension values for \npractical types of networks, until fairly tight upper and lower bounds were obtained \n([6], [8], [9], and [10]) for linear threshold element networks in which all elements \nperform a threshold function on weighted sum of inputs. Roughly, the lower bound \nfor the networks is (1/2)w log h and the upper bound is w log h where h is the number \nof hidden elements and w is the number of connecting weights (for one-hidden-Iayer \ncase w ~ nh where n is the input dimension of the network). \n\nIn many applications, though, sigmoidal functions, specifically a typical sigmoid \nfunction 1/ (1 + exp( -x)), or piecewise linear functions for economy of calculation, \nare used instead of the threshold function. This is mainly because the differen(cid:173)\ntiability of the functions is needed to perform backpropagation or other learning \nalgorithms. Unfortunately explicit bounds obtained so far for the VC-dimension of \nsigmoidal networks exhibit large gaps (O(w2h2) ([3]), n(w log h) for bounded depth \n\n\f324 \n\nA. Sakurai \n\nand f!(wh) for unbounded depth) and are hard to improve. For the piecewise linear \ncase, Maass obtained a result that the VO-dimension is O(w210g q), where q is the \nnumber of linear pieces of the function ([5]). \nRecently Koiran and Sontag ([4]) proved a lower bound f!(w 2 ) for the piecewise \npolynomial case and they claimed that an open problem that Maass posed if there \nis a matching w 2 lower bound for the type of networks is solved. But we still have \nsomething to do, since they showed it only for the case w = 8(h) and the number \nof hidden layers being unboundedj also O(w 2 ) bound has room to improve. \n\nWe in this paper improve the bounds obtained by Maass, Koiran and Sontag and \nconsequently show the role of polynomials, which can not be played by linear func(cid:173)\ntions, and the role of the constant functions that could appear for piecewise poly(cid:173)\nnomial case, which cannot be played by polynomial functions. \n\nAfter submission of the draft, we found that Bartlett, Maiorov, and Meir had ob(cid:173)\ntained similar results prior to ours (also in this proceedings). Our advantage is that \nwe clarified the role played by the degree and number of segments concerning the \nboth bounds. \n\n2 Terminology and Notation \n\nlog stands for the logarithm base 2 throughout the paper. \n\nThe depth of a network is the length of the longest path from its external inputs to \nits external output, where the length is the number of units on the path. Likewise \nwe can assign a depth to each unit in a network as the length of the longest path \nfrom the external input to the output of the unit. A hidden layer is a set of units at \nthe same depth other than the depth of the network. Therefore a depth L network \nhas L - 1 hidden layers. \nIn many cases W will stand for a vector composed of all the connection weights in \nthe network (including threshold values for the threshold units) and w is the length \nof w. The number of units in the network, excluding \"input units,\" will be denoted \nby hj in other words, the number of hidden units plus one, or sometimes just the \nnumber of hidden units. A function whose range is {O, 1} \n(a set of 0 and 1) is \ncalled a Boolean-valued function. \n\n3 Upper Bounds \n\nTo obtain upper bounds for the VO-dimension we use a region counting argu.ment, \ndeveloped by Goldberg and Jerrum [2]. The VO-dimension of the network, that is, \nthe VO-dimension of the function set {fG(wj . ) I W E'RW} is upper bounded by \n\nmax {N 12N ~ Xl~.~N Nee ('Rw - UJ:1.N'(fG(:Wj x\u00a3))) } \n\n(3.1) \n\nwhere NeeO is the number of connected components and .N'(f) IS the set \n{w I f(w) = O}. \nThe following two theorems are convenient. Refer [11] and [7] for the first theorem. \nThe lemma followed is easily proven. \nTheorem 3.1. Let fG(wj Xi) (1 ~ i ~ N) be real polynomials in w, each of degree \nd or less. The number of connected components of the set n~l {w I fG(wj xd = O} \nis bounded from above by 2(2d)W where w is the length of w. \n\n\fTight Bounds for the VC-Dimension of Piecewise Polynomial Networks \n\n325 \n\nLemma 3.2. Ifm ~ w(1ogC + loglogC + 1), then 2m > (mC/w)W for C ~ 4. \nFirst let us consider the polynomial activation function case. \n\nTheorem 3.3. Suppose that the activation function are polynomials of degree at \nmost d. O( ws log d) is an upper bound of the VC-dimension for the networks with \ndepth s. When s = 8(h) the bound is O(whlogd). More precisely ws(1ogd + \nlog log d + 2) is an upper bound. Note that if we allow a polynomial as the input \nfunction, d1d2 will replace d above where d1 is the maximum degree of the input \nfunctions and d2 is that of the activation functions. \n\nThe theorem is clear from the facts that the network function (fa in (3.1)) is a \npolynomial of degree at most d S + ds- 1 + ... + d, Theorem 3.1 and Lemma 3.2. \nFor the piecewise linear case, we have two types of bounds. The first one is suitable \nthe depth s = o( h)) and the second one for the \nfor bounded depth cases (i. e. \nunbounded depth case (i.e . s = 8(h)). \n\nTheorem 3.4. Suppose that the activation functions are piecewise polynomials with \nat most q segments of polynomials degree at most d. O(ws(slogd + log(dqh/s))) \nand O(ws((h/s)logq) +log(dqh/s)) are upper bounds for the VC-dimension, where \ns is the depth of the network. More precisely, ws((s/2)logd + log(qh)) and \nws( (h/ s) log q + log d) are asymptotic upper bounds. Note that if we allow a polyno(cid:173)\nmial as the input function then d1 d2 will replace d above where d1 is the maximum \ndegree of the input functions and d2 is that of the activation functions. \n\nProof. We have two different ways to calculate the bounds. First \n\nS \n\ni=1 \n< s \n-p \n\nJ=1 \n\n(8eNQhs(di-1 + .. . + d + l)d) 'l\u00bbl+'''+W; \n\nWl+\"'+W' \nJ \n\n::; (8eNqd(s:)/2(h/S)) ws \n\nwhere hi is the number of hidden units in the i-th layer and 0 is an operator to \nform a new vector by concatenating the two. From this we get an asymptotic upper \nbound ws((s/2) log d + log(qh)) for the VC-dimension. \nSecondly \n\nFrom this we get an asymptotic upper bound ws((h/s)logq + log d) for the VC(cid:173)\ndimension. Combining these two bounds we get the result. Note that sin log( dqh/ s) \nin it is introduced to eliminate unduly large term emerging when s = 8(h) . \n0 \n\n4 Lower Bounds for Polynomial Networks \nTheorem 4.1 Let us consider the case that the activation function are polynomials \nof degree at most d . n( ws log d) is a lower bound of the VC-dimension for the \nnetworks with depth s. When s = 8(h) the bound is n(whlogd), More precisely, \n\n\f326 \n\nA. Sakurai \n\n(1/16)w( 5 - 6) log d is an asymptotic lower bound where d is the degree of activation \nfunctions and is a power of two and h is restricted to O(n2) for input dimension n. \n\nThe proof consists of several lemmas. The network we are constructing will have \ntwo parts: an encoder and a decoder. We deliberately fix the N input points. The \ndecoder part has fixed underlying architecture but also fixed connecting weights \nwhereas the encoder part has variable weights so that for any given binary outputs \nfor the input points the decoder could output the specified value from the codes in \nwhich the output value is encoded by the encoder. \n\nFirst we consider the decoder, which has two real inputs and one real output. One \nof the two inputs y holds a code of a binary sequence bl , b2, ... ,bm and the other x \nholds a code of a binary sequence Cl, C2, ... ,Cm . The elements of the latter sequence \nare all O's except for Cj = 1, where Cj = 1 orders the decoder to output bj from it \nand consequently from the network. \n\nWe show two types of networks; one of which has activation functions of degree at \nmost two and has the VC-dimension w(s-l) and the other has activation functions \nof degree d a power of two and has the VC-dimension w( s - 5) log d. \n\nWe use for convenience two functions 'H9(X) = 1 if x 2:: 0 and \u00b0 otherwise and \n'H9,t/J (x) = 1 if x 2:: cp, \u00b0 if x ::; 0, and undefined otherwise. Throughout this section \n\nwe will use a simple logistic function p(x) = (16/3)x(1- x) which has the following \nproperty. \nLemma 4.2. For any binary sequence bl , b2, . .. , bm , there exists an interval [Xl, X2] \n\nsuch that bi = 'Hl /4,3/4(pi(x)) and \u00b0 :S /(x) ::; 1 for any x E [Xl, X2]' \n\nThe next lemmas are easily proven. \n\nLemma 4.3. For any binary sequence Cl, C2,\"\" Cm which are all O's except for \nCj = 1, there exists Xo such that Ci = 'Hl /4,3/4(pi(xo)). Specifically we will take Xo = \np~(j-l)(1/4), where PLl(x) is the inverse of p(x) on [0,1/2]. Then pi-l(xo) = 1/4, \n\npi(xo) = 1, pi(xo) = \u00b0 for all i > j, and pj-i(xo) ::; (1/4)i for all positive i ::; j. \n\nProof. Clear from the fact that p(x) 2:: 4x on [0,1/4]. \nLemma 4.4. For any binary sequence bl , b2, ... , bm , \n\ntake y such that bi \n\no \n\n'H 1 / 4,3/4(pi(y)) and \u00b0 ::; pi(y) \n'H7/ 12 ,3/4 (l::l pi(xo)pi(y)} = bi' i.e. 'Ho (l::l pi(xo)pi(y) - 2/3} = bi' \nProof. If bj = 0, l::l pi(xo)pi(y) = l:1=1 pi(xo)pi(y) :S pi(y) + l:1:::(1/4)i < \npi(y) + (1/3)::; 7/12. If bj = 1, l::l pi(xo)pi(y) > pi(xo)pi(y) 2:: 3/4. \n0 \n\n::; 1 for all i and Xo = p~(j-l)(1/4), then \n\nBy the above lemmas, the network in Figure 1 (left) has the following function: \n\nSuppose that a binary sequence bl , ... ,bm and an integer j is given. Then we \ncan present y that depends only on bl , \u2022\u2022 \u2022 ,bm and Xo that depends only on j \nsuch that bi is output from the decoder. \n\nNote that we use (x + y)2 - (x - y)2 = 4xy to realize a multiplication unit. \nFor the case of degree of higher than two we have to construct a bit more complicated \none by using another simple logistic function fL(X) = (36/5)x(1- x). We need the \nnext lemma. \nLemma 4.5. Take Xo = fL~(j-l)(1/6), where fLLl(X) is the inverse of fL(X) on \n\n[0,1/2]. Then fLi-1(xo) = 1/6, fLj(XO) = 1, fLi(xo) = \u00b0 for all i > j, and fLi-i(xo) = \n\n\fTight Bounds for the VC-Dimension of Piecewise Polynomial Networks \n\n327 \n\nL--_L...-_---L_...L..-_ X. \n\n~A\u00b7l \n\n'........ \n\ni~-!~ \n\nf\u00b7i~~] \n,----_ .. __ ...... __ .. \n\nx, \n\ny \n\nFigure 1: Network architecture consisting of polynomials of order two (left) and \nthose of order of power of two (right). \n\n(1/6)i for all i > 0 and $ j. \n\nProof. Clear from the fact that J-L(x) ~ 6x on [0,1/6]. \n0 \nLemma 4.6. For any binary sequence bl. b2, . .. , bk, bk+b bk+2, . .. ,b2k , \n... , b(m-1)H1,'''' bmk take y such that bi = 1-l1/4,3/4(pi(y)) and 0 $ pi(y) $ 1 \nj $ m and any 1 $ 1 $ k take Xl = \nfor all i. Moreover for any 1 $ \nJ-LL(j-1)(1/6), and Xo = J-LL(I-1)(1/6k). \nThen for Z = E:1 pik(Y)J-Lik(xt), \n1-lo (E~==-Ol pi(z)J-Li(xo) - (1/2)) = bki+l holds. \nLemma 4.7. If 0 < pi(x) < 1 for any 0 < i $1, take an \u00a3 such that (16/3)1\u00a3 < 1/4. \nThen pl(x) - (16/3)1\u00a3 < pl(x + \u00a3) < pl(x) + (16/3)1\u00a3. \nProof.. There are four cases ~epending on ~hether pl- ~ (x + \u00a3) is on the uphill or \ndownhIll of p and whether x IS on the uphlll or downhIll of p -1 . The proofs are \ndone by induction. \nFirst suppose that the two are on the uphill. Then pl(x + \u00a3) = p(pl-1\\X + f)) < \np(pl-1(X) + (16/3)1-1\u00a3)) < pl(x) + (16/3)1\u00a3. Secondly suppose that p -l(x + \u00a3) \nis on the uphill but x is on the downhill. Then pl(x + \u00a3) = p(pl-1(x + f)) > \np(pl-1(x) - (16/3)1-1\u00a3)) > pl(x) - (16/3)1\u00a3. The other two cases are similar. \n0 \nProof of Lemma 4.6. We will show that the difference between piHl(y) \nand E~==-ol p'(z)J-Li(xo) is sufficiently small. Clearly Z = E:1 J-Lik(X1)pik(y) = \nE{=l J-Lik(X1)pik(y) $ pik(y)+ E{~i(1/6k)i < pik(y)+1/(6k -1) and pik(y) < z. If \nZ is on the uphill of pI then by using the above lemma, we get E~==-Ol pi(z)J-Li(xO) = \nE~=o p'(z)J-Li(xo) < pl(z) + 1/(6k - 1) < piHl(y) + (1 + (16/3)1)(1/(6k - 1)) < \npik+1(y) + 1/4 (note that 1 $ k - 1 and k ~ 2). If z is on the downhill of pI then \nby using the above lemma, we get E~==-Ol pi(Z)J-Li(xo) = E~=o pi(z)J-Li(xo) > pl(z) > \npl(pik(y)) _ (16/3)1(1/(6k - 1)) > pik+l(y) - 1/4. \n0 \nNext we show the encoding scheme we adopted. We show only the case w = 8(h2 ) \nsince the case w = 8(h) or more generally w = O(h2) is easily obtained from this. \nTheorem 4.8 There is a network of2n inputs, 2h hidden units with h2 weights w, \n\n\f328 \n\nA. Sakurai \n\nand h 2 sets of input values Xl, ... ,Xh2 such that for any set of values Y1, ... , Yh2 \nwe can chose W to satisfy Yi = fG(w; Xi). \n\nProof. We extensively utilize the fact that monomials obtained by choosing at most \nk variables from n variables with repetition allowed (say X~X2X6) are all linearly \nindependent ([1]). Note that the number of monomials thus formed is (n~m). \n\nSuppose for simplicity that we have 2n inputs and 2h main hidden units (we have \nother hidden units too), and h = (n~m). By using multiplication units (in fact each \nis a composite of two squaring units and the outputs are supposed to be summed up \nas in Figure 1), we can form h = (n~m) linearly independent monomials composed \nof variables Xl, . \u2022\u2022 ,Xn by using at most (m -l)h multiplication units (or h nominal \nunits when m = 1). In the same way, we can form h linearly independent monomials \ncomposed of variables Xn+ll . .\u2022 , X2n. Let us denote the monomials by U1, \u2022.\u2022 , Uh \nand V1, . .. , Vh. \nWe form a subnetwork to calculate 2:7=1 (2:7=1 Wi,jUi)Vj by using h multiplication \nunits. Clearly the calculated result Y is the weighted sum of monomials described \nabove where the weights are Wi,j for 1 $ i, j $ h. \n\nSince y = fG(w; x) is a linear combination of linearly independent terms, if we \nchoose appropriately h 2 sets of values Xll . . . , Xh2 for X = (Xl, .. \u2022 , X2n) , then for \nany assignment of h 2 values Y1, ... ,Yh2 to Y we have a set of weights W such that \nYi = f(xi, w). \n0 \n\nProof of Theorem -4.1. The whole network consists of the decoder and the encoder. \nThe input points are the Cartesian product of the above Xl, ... ,Xh2 and {xo defined \nin Lemma 4.4 for bj = 111 $ j :$ 8'} for some h where 8' is the number of bits to \nbe encoded. This means that we have h 2 s points that can be shattered. \n\nLet the number of hidden layers of the decoder be 8. The number of units used \nfor the decoder is 4(8 - 1) + 1 (for the degree 2 case which can decode at most 8 \nbits) or 4(8 - 3) + 4(k - 1) + 1 (for the degree 2k case which can decode at most \n(8 - 2)k bits). The number of units used for the encoder is less than 4h; we though \nhave constraints on 8 (which dominates the depth of the network) and h (which \ndominates the number of units in the network) that h :$ (n~m) and m = O(s) or \nroughly log h = 0(8) be satisfied. \nLet us chose m = 2 (m = log 8 is a better choise). As a result, by using 4h + 4(s -\nI} + 1 (or 4h + 4(8 - 3) + 4(k -1) + 1) units in s + 2 layers, we can shatter h 2 8 (or \nh 2 (8 - 2) log d) points; or asymptotically by using h units 8 layers we can shatter \n(1/16)w( 8 - 3) (or (1/16)w( 8 - 5) log d) points. \n0 \n\n5 Piecewise Polynomial Case \nTheorem 5.1. Let us consider a set of networks of units with linear input func(cid:173)\ntions and piecewise polynomial (with q polynomial segments) activation functions . \nQ( W8 log( dqh/ 8)) is a lower bound of the VC-dimension, where 8 is the depth of the \nnetwork and d is the maximum degree of the activation functions. More precisely, \n(1/16)w(s - 6)(10gd+ log(h/s) + logq) is an asymptotic lower bound. \nFor the scarcity of space, we give just an outline of the proof. Our proof is based \non that of the polynomial networks. We will use h units with activation function \nof q ~ 2 polynomial segments of degree at most d in place of each of pk unit in the \ndecoder, which give the ability of decoding log dqh bits in one layer and slog dqh \nbits in total by 8( 8h) units in total. If h designates the total number of units, the \n\n\fTight Bounds for the VC-Dimension of Piecewise Polynomial Networks \n\n329 \n\nnumber of the decodable bits is represented as log(dqh/s). \n\nIn the following for simplicity we suppose that dqh is a power of 2. Let pk(x) be \nthe k composition of p(x) as usual i.e. pk(x) = p(pk-l(x)) and pl(X) = p(x). Let \nplogd,/(x) = /ogd(,X/(x)), where 'x(x) = 4x if x $ 1/2 and 4 - 4x otherwise, which \nby the way has 21 polynomial segments. \nNow the pk unit in the polynomial case is replaced by the array /ogd,logq,logh(x) of \nh units that is defined as follows: \n(i) plogd,logq,l(X) is an array of two units; one is plogd,logq(,X+(x)) where ,X+(x) = \n4x if x $ 1/2 and 0 otherwise and the other is plog d,log q ('x - (x)) where ,X - (x) = 0 \nif x $ 1/2 and 4 - 4x otherwise. \n\n(ii) plog d,log q,m~x) is the array of 2m units, each with one of the functions \n( . .\u2022 ('x\u00b1(x)) . . . )) where ,X\u00b1( ... ('x\u00b1(x)) .. \u00b7) is the m composition \n\nplogd,logq(,X \nof 'x+(x) or 'x - (x). Note that ,X\u00b1( ... ('x\u00b1(x)) ... ) has at most three linear seg(cid:173)\nments (one is linear and the others are constant 0) and the sum of 2m possible \ncombinations t(,X\u00b1( . . . ('x\u00b1(x)) \u00b7 . . )) is equal to t(,Xm(x)) for any function f \nsuch that f(O) = O. \n\nThen lemmas similar to the ones in the polynomial case follow. \n\nReferences \n\n[1] Anthony, M: Classification by polynomial surfaces, NeuroCOLT Technical Re(cid:173)\n\nport Series, NC-TR-95-011 (1995). \n\n[2] Goldberg, P. and M. Jerrum: Bounding the Vapnik-Chervonenkis dimension \nof concept classes parameterized by real numbers, Proc. Sixth Annual ACM \nConference on Computational Learning Theory, 361-369 (1993). \n\n[3] Karpinski, M. and A. Macintyre, Polynomial bounds for VC dimension of sig(cid:173)\n\nmoidal neural networks, Proc. 27th ACM Symposium on Theory of Computing, \n200-208 (1995) . \n\n[4] Koiran, P. and E. D. Sontag: Neural networks with quadratic VC dimension, \n\nJourn. Compo Syst. Sci., 54, 190-198(1997). \n\n[5] Maass, W . G.: Bounds for the computational power and learning complexity of \nanalog neural nets, Proc. 25th Annual Symposium of the Theory of Computing, \n335-344 (1993). \n\n[6] Maass, W. G.: Neural nets with superlinear VC-dimension, Neural Computa(cid:173)\n\ntion, 6, 877-884 (1994) \n\n[7] Milnor, J.: On the Betti numbers of real varieties, Proc. of the AMS, 15, \n\n275-280 (1964). \n\n[8] Sakurai, A.: Tighter Bounds of the VC-Dimension of Three-layer Networks, \n\nProc. WCNN'93, III, 540-543 (1993). \n\n[9] Sakurai, A.: On the VC-dimension of depth four threshold circuits and the \ncomplexity of Boolean-valued functions, Proc. ALT93 (LNAI 744), 251-264 \n(1993) ; refined version is in Theoretical Computer Science, 137, 109-127 (1995). \n\n[10] Sakurai, A. : On the VC-dimension of neural networks with a large number of \n\nhidden layers, Proc. NOLTA'93, IEICE, 239-242 (1993). \n\n[11] Warren, H. E.: Lower bounds for approximation by nonlinear manifolds, Trans . \n\nAMS, 133, 167-178, (1968). \n\n\f", "award": [], "sourceid": 1605, "authors": [{"given_name": "Akito", "family_name": "Sakurai", "institution": null}]}