{"title": "e-Entropy and the Complexity of Feedforward Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 946, "page_last": 952, "abstract": "", "full_text": "c-Entropy and the Complexity of \n\nFeedforward Neural Networks \n\nRobert C. Williamson \nDepartment of Systems Engineering \nResearch School of Physical Sciences and Engineering \nAustralian National University \nGPO Box 4, Canberra, 2601, Australia \n\nAbstract \n\nWe develop a. new feedforward neuralnet.work represent.ation of Lipschitz \nfunctions from [0, p]n into [0,1] ba'3ed on the level sets of the function. We \nshow that \n\n~~ + ~\u20acr + ( 1 + h) (:~) n \n\nis an upper bound on the number of nodes needed to represent f to within \nuniform error Cr, where L is the Lipschitz constant. \\Ve also show that the \nnumber of bits needed to represent the weights in the network in order to \nachieve this approximation is given by \n\no (~2;~r (:~) n) . \n\n\\Ve compare this bound with the [-entropy of the functional class under \nconsideration. \n\n1 \n\nINTRODUCTION \n\nWe are concerned with the problem of the number of nodes needed in a feedforward \nneural network in order to represent a fUllction to within a specified accuracy. \nAll results to date (e.g. [7,10,15]) have been in the form of existence theorems, \nstating that there does exist a neural network which achieves a certain accuracy of \nrepresentation, but no indication is given of the number of nodes necessary in order \nto achieve this. The two techniques we use are the notion of [-entropy (also known \n\n946 \n\n\f\u00a3-Enlropy and the Complexity of Feedforward Neural Networks \n\n947 \n\nTable 1: Hierarchy of theoretical problems to be solved. \n\nABSTRACT \n\n1. Determination of the general approximation properties of feedforward \n(Non-constructive results of the form mentioned \n\nneural networks. \nabove [15].) \n\n2. Explicit constructive approximation theorems for feedforward neural \nnetworks, indica.ting the number (or bounds on the number) of nodes \nneeded to approxima.te a function from a given class to within a given \naccuracy. (This is the subject of the present paper. We are unaware of \nany other work along these lines apart from [6].) \n\n3. Learning in general. Tha.t is, results on learning that are not dependent \non the pa.rticular representation chosen. The exciting new results using \nthe Vapnik-Chervonenkis dimension [4,9] fit into this category, as do \nstudies on the use of Shortest Description Length principles [2]. \n\n4. Sppcific results on capabilities of learning in a given architecture [11]. \n5. Sppcific algorithms for learning in a specific architecture [14]. \n\nCONCRETE \n\nas metric entropy) originally introduced by Kolmogorov [16] and a representation \nof a. function ill t.erms of its level sets, which was used by Arnold [1]. The place of \nthe current paper with respect to other works in the literature can be judged from \ntable 1. \nWe study the question of representing a function f in the class FiPc\u00b7\u00b7\u00b7\u00b7'Pn),n, which \nis the space of real valued functions defilled on the n-dimensional closed interval \nX 7=dO, Pi] with a Lipschitz constant L and bounded in absolute value by C. If \nPi = P for i = 1 .... , 11. WP denote the space Ff'~. The error measure we use is the \nuniform or sup metric: \n\n' \n\n~ -\n\nsup \n\nxE[O,pln \n\nli(x) -\n\nf(x)1. \n\n(1 ) \n\nwhere f is the approximation of f. \n\n2 c-ENTROPY OF FUNCTIONAL CLASSES \n\nThe \u20ac-entropy He gives an indication of the number of bits required to represent \nwith accuracy \u20ac an ar'bitrary function f in some functional class. It is defined as \nthe logarithm to base 2 of the number of elements in the smallest \u20ac-cover of the \nfunctional class. Kolmogorov [16] has proved that \n\nwhere B( n) is a constant which depends only on n. \\Ve use this result as a yardstick \nfor our neural network representation. A more powerful result is [18, p.86]: \n\n(2) \n\n\f948 Williamson \n\n---- --------------. \u00abi \n------- \u00abi-1 \n--------- \u00abi-2 \n----------. \u00ab i-3 t<-.;;:>:':::::::::'::::::\" ~:::'::::: \n------------. \u00abi-4 \n\nFigure 1: Illust.ration of some level sets of a function on R2. \n\nTheorel11 1 Let p be a non-negative integer and let 0' E (0,1]. Set s = p + 0'. Let \nF:\";,C(O) denote the space of real functions f defined on [0, p]ll all of whose partial \nderivatives of order p .satisfy a Lipschitz condition with constant L and index 0', \nand are such that \n\nThen for sufficiently small c, \n\nn \n\nfor L ki ::; p. \n\n;=1 \n\n(3) \n\n(4) \n\nwhere A(s, n) and B(s, n) are positive constants depending only on sand n. \n\nWe discuss the implication of this below. \n\n3 A NEURAL NETWORK REPRESENTATION BASED \n\nON LEVEL SETS \n\nWe develop a new neura.l network architecture for representing functions from [0, p]n \nonto [0,1] (the restriction of the range to [0,1] is just a conveni~nce and can be \neasily dropped). The basic idea is to represent approximations f of the function \nf in terms of the level sets of f (see figure 1). Then neural networks a.re used to \napproximate the above sets la(f) of f, where la(f) = {x:f(x) ~ O'} = U.B~al.B(f) \nand la(f) is the O'th level set: IoU) ~ {x:f(x) = O'}. The approximations ia(f) can \n\n~ \n\n-\n\n-\n\n\f\u00a3-Enlropy and the Complexity of reedforward Neural Networks \n\n949 \n\nbe implemented using tluec layer neural nets with threshold logic neurons. These \napproximations are of t.he form \n\nl!,othetlC approximatIOn to the mth component of i o , (f) , \n\nr~ _ __ _ ___ ____ ___ ~A~ ________________ , \n\nC'ca , \n\nlo,(f) = U U n [S(hu),9~m)nS(h_Uj,_(9;m+v>;mJ)]' \n\nAm \n\n11 \n\n111=1 A\",=tj=l \n\n~~ ______________ ~v~ _______________ J \n\n(5) \n\nIl-rectangle of dimensions v>~m x ' . ' X v>~m \n\nwhere !/J;m is the \"\\yidlh\" in t.he jt.h dimension of the Am t.h rectangula.r part of the \nmt.h component (disjoint connected subset) C~J~I) of the ith approximate above-set \nfa, I Ca , is t he number of compollcnts of the above-set la, (1), Am is the number of \nC01IV(X l1-rectangles (paris) that cHe required to form an c/-cover for c~')(J), Uj \nl::. \n(U)l), ... 'll~It)). ujlJl) = /ijln . S'(hw ,(d is the ll-half-space defined by the hyperplane \nhw 9: \n, \n\nS'(hw,/I) = {.r: htuJI(.t.) 2:: a}, \n\n(6) \n\nwhere hw ,I:I(.I') = w.r - 0 and U' = (WI \u2022 . . . , wll ). \nThe function f is then approximat.ed by \n\n~S-ua,.,( . ~ 1 \n-. -r + ---; \n.f \n2.\\ \n\n,I) -\n\n/\", \n\nN \n\n1 L \n\n1=1 \n\n( , \n\n(f) :t. ), \n\n17 \n'\" \n\n(7) \n\nL. .. \" .V and I..., is the indicator function of a set S. The \nt.1H'11 further approximated by implementing (5) using \n\nwhere OJ = iA.l, i \napproximatioll IY-uas(.!') i:-s \nN :3-layel' Il emal net.ti ill parnll('I : \nI\\A'~':) \n\nLX \n(J: ) = -+.'i ll \n\n:NN \nj \n\nIV/~present. f E F\u00a3,~ wit.h uniform error Cr, and bounds \non t.he number of bits needed t.o represent. t.he weights in such an approximation . \n\n, \n\nTheorem 2 The 1Il/lIIber' of nodf.') needed ill a 1leural lIeiworh' of lhe above archi(cid:173)\nteclure ill ordu' '0 n]l1'(.')(111 ([IIY f E Ff ' ~. 10 wilhin E,. ill the sup-melt-ie is .qiven \nby \n\nI/f'L \n-+--+ 1+-\nJ2 \n2~,. \n\n1 \n;:2.:\" \n\n( \n\n11 \n\n) ( L ) \" \n\nP \n-\n4E r \n\n( 9) \n\n\f950 Williamson \n\n1 \n\nNNi approximates lcx;(J) \n\nx \n\nn \n\n(dimx = n) \n\n\u2022 \n\n1 \n\n2 \n\nNN \n3 \n\n1 \n2N \n\nFigure 2: The Neural Network architecture we adopt . \n\nThis theorem is proved in a straight-forward manner by taking account of all the \nerrors incurred in the approxima.tion of a worst-case function in Ff'';;. \nSince compa.ring the number of nodes alone is inadequate for comparing the com(cid:173)\nplexity of neural nets (because the nodes themselves could implement quite complex \nfunctions) we have a.lso calcula.ted the number of bits needed to represent all of the _ \nweights (including zero weights which denote no connection) in order to achieve an \ncr-approximation:1 \n\n, \n\nTheoreln 3 The 1lumber- of bits needed to specify the weights in a neural network \nwith the above architecture i1l order to represent an arbitrary function f E Ff'~ \nwith accuracy Cr in the sup-metric is bounded above by \n\n. \n\n(10) \n\nEquation 10 can be compared with (2) to see that the neural net representation is \nclose to optimal. It is suboptimal by a factor of O( e.f:-). The ,h:n term is considered \nsubsumed into the B(n) term in (2). \n\nlThe idea of using the number of bits as a measure of network complexity has also \n\nrecently been adopted in [5]. \n\n\fE-Entropy and the Complexity of Ieedforward Neural Networks \n\n951 \n\n5 FURTHER WORK \n\nTheorem 3 shows that the complexity of representing an arbitrary f E F\u00a3:~ is \nexponential in n. This is not so much a limitation of the neural network as an \nindication that our problem is too hard. Theorem 1 shows that if smoothness \nconstraints are imposed. then the complexity can be considerably reduced. It is an \nopen problem to determine whether the construction of the network presented in \nthis paper can be extended to make good use of smoothness constraints. \n\nOf course the most important question is whether functions can be learned using \nneural networks. Apropos of this is Stone's result on rates of convergence in non(cid:173)\nparametric regression [17]. Although we do not have space to give details here, \nsuffice it say that he shows that the gains suggested by theorem 1 by imposing \nsmoothness constraints in the representation problem, are also achievable in the \nlearning problem. A more general statement of this type of result, making explicit \nthe connexion with \u20ac-entropy \n\nis given by Yatracos [19]: \n\nThem'em 4 Let Iii be a Ll-totally bounded set of measures on a probability space. \nLet the metric defined on the space be the Ll-distance between measures. Then there \nexists a uniformly consistent estimator (ji for some parameter 0 from a possibly \ninfinite dimensional family of measures 8 C At whose rate of convergence in i \nasymptotically satisfies the equation \n\na; = [1t \u2022. /0) ]'/2 \n\n(11) \n\nwhere 'lie (8) is tile \u20ac-el!tropy of 8. \n\nSimilar results have been discussed by Ben-Da.vid et al. [3] (who have made use of \nDudley'S (loose) relationships between \u20ac-entropy and Vapnik-Chervonenkis dimen(cid:173)\nsion [8]) and others [12.13]. There remain many open problems in this field. One of \nthe main difficulties however is the calculation of 'lit for non-trivial function classes. \nOne of the most significant results would be a complete and tight determination of \nthe \u20ac-entropy for a feedforward neural network. \n\nAcknowledgenlellts \n\nThis work was supported in part by a grant from ATERB. I thank Andrew Paice \nfor many useful discussions. \n\nReferences \n\n[1] V. 1. Arnold, Represent.at.ion of Continuous Functions of Three Variables by the Su(cid:173)\n\nperposition of Continuous Functions of Two Variables. Matematic1Jesllii Sbocnik \n(N.S.), 48 (1959). pp. 3-74. Translation ill American Mathematical Society Trans(cid:173)\nlations Sel\u00b7ies 2. 28 (1959) pp. 61-147. \n\n[2] A. R. Barroll. St.atist.ical Properties of Artificial Neural Networks, in Proceedjllgs of \n\nthe 28t}1 COllfel\"ellCe 011 Decision alld COlltrol, 1989, pp. 280-285. \n\n\f952 Williamson \n\n[3] S. Ben-David, A. Itai and E. Kushilevitz, Learning by Distances, in Proceedings of \nthe Third Annual Workshop on Computational Learning Theory, M. Fulk and \nJ. Case, eds., Morgan Kaufmann, San Mateo, 1990, pp. 232-245. \n\n[4] A. Blumer, A. Ehrenfeucht, D. Haussler and M. K. Warmuth, Learnability and \nthe Vapnik-Chervonenkis Dimension, Journal of tIle Association for Computing \nMachinery, 36 (1989), pp. 929-965. \n\n[5] \n\nJ. Bruck and J. W. Goodman, On the Power of Neural Networks for Solving Hard \n\nProblems, Journal of Complexity, 6 (1990), pp. 129-135. \n\n[6] S. M. Carroll and B. W. Dickinson, Construction of Neural Nets using the Radon \n\nTransform, in Proceedings of the International Joint Conference on Neural Net(cid:173)\nworks, 1989, pp. 607-611, (Volume I). \n\n[7] G. Cybenko, Approximation by Superpositions of a Sigmoidal Function, Mathemat(cid:173)\n\nics of Control, Signals, alld Systems, 2 (1989), pp. 303-314. \n\n[8] R. M. Dudley, A Course on Empirical Processes, in Ecole d'Ete de Probabilites \nde Saillt-Flour XII-19S2, R. M. Dudley, H. Kunitay and F. Ledrappier, eds., \nSpringer-Verlag, Berlin. 1984, pp. 1-142, Lecture Notes in Mathematics 1097. \n\n[9] A. Ehreufeucht, D. Haussler, M. Kearns and L. Valiant, A General Lower Bound on \nthe Number of Examples Needed for Learning, Information and Computation, \n82 (1989), pp. 247-261. \n\n[10] K. -I. Funahashi, On the Approximate Realization of Continuous Mappings by Neu(cid:173)\n\nral Networks, Neural Networks, 2 (1989), pp. 183-192. \n\n[11] S. I. Gallant, A Connectionist Learning Algorithm with Provable Generalization and \n\nScaling Bounds, Neural Networks, 3 (1990), pp. 191-201. \n\n[12] S. van de Geer, A New Approach to Least-Squares Estimation with Applications, \n\nThe Annals of Statistics, 15 (1987), pp. 587-602. \n\n[13] R. Hasminskii and I. Ibragimov, On Density Estimation in the View of Kolmogorov's \nIdeas in Approximation Theory, The Annals of Statistics, 18 (1990), pp. 999-\n1010. \n\n[14] R. Hecht-Nielsen, Theory of the Backpropagation Neural Network, in Proceedings \nof the Internatiollal Joint Conference on Neural Networks, 1989, pp. 593-605, \nVolume 1. \n\n[15] K. Hornik, M. Stinchcombe and H. White, Multilayer Feedforward Networks are \n\nUniversal Approximators, Neural Networks, 2 (1989), pp. 359-366. \n\n[16] A. N. Kolmogorov and V. M. Tihomirov, c-Entropy and c-Capacity of Sets in Func(cid:173)\n\ntional Spaces, Uspelli Mat. (N.S.), 14 (1959), pp. 3-86, Translation in American \nMatllematical Society Translations, Series 2, 17 (1961) pp. 277-364. \n\n[17] C. J. Stone, Optimal Global Rates of Convergence for Nonparametric Regression, \n\nThe Annals of Statistics. 10 (1982), pp. 1040-1053. \n\n[18] A. G. Vitushkin, Tlleory of tIle Transmission and Processing of Information, Perg(cid:173)\n\namon Press, Oxford, 1961, Originally published as Otsenka slozhnosti zadachi \ntabulirovaniya (Estimation of the Complexit.y of the Tabulation Problem), Fiz(cid:173)\nmatgiz, Moscow, 1959. \n\n[19] Y. G. Yatracos, Rat.es of Convergence of Minimum Distance Estimators and Kol(cid:173)\n\nmogorov's Entrap),. The Alll1als of Statistics, 13 (1985), pp. 768-774. \n\n\f", "award": [], "sourceid": 386, "authors": [{"given_name": "Robert", "family_name": "Williamson", "institution": null}]}