{"title": "Some Estimates of Necessary Number of Connections and Hidden Units for Feed-Forward Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 639, "page_last": 646, "abstract": "", "full_text": "Some Estimates of Necessary Number of \n\nConnections and Hidden Units for \n\nFeed-Forward Networks \n\nAdam Kowalczyk \n\nTelecom Australia, Research Laboratories \n\n770 Blackburn Road, Clayton, Vic. 3168, Australia \n\n(a.kowalczyk@trl.oz.au) \n\nAbstract \n\nThe feed-forward networks with fixed hidden units (FllU-networks) \nare compared against the category of remaining feed-forward net(cid:173)\nworks with variable hidden units (VHU-networks). Two broad \nclasses of tasks on a finite domain X C R n are considered: ap(cid:173)\nproximation of every function from an open subset of functions on \nX and representation of every dichotomy of X. For the first task \nit is found that both network categories require the same minimal \nnumber of synaptic weights. For the second task and X in gen(cid:173)\neral position it is shown that VHU-networks with threshold logic \nhidden units can have approximately lin times fewer hidden units \nthan any FHU-network must have. \n\n1 \n\nIntroduction \n\nA good candidate artificial neural network for short term memory needs to be: (i) \neasy to train, (ii) able to support a broad range of tasks in a domain of interest and \n(iii) simple to implement. The class of feed-forward networks with fixed hidden \nunits (HU) and adjustable synaptic weights at the top layer only (shortly: FHU(cid:173)\nnetworks) is an obvious candidate to consider in this context. This class covers a \nwide range of networks considered in the past, including the classical perceptron, \nhigher order networks and non-linear associative mapping. Also a number of train(cid:173)\ning algorithms were specifically devoted to this category (e.g. perceptron, madaline \n\n639 \n\n\f640 \n\nKowalczyk \n\nor pseudoinverse) and a number of hardware solutions were investigated for their \nimplementation (e.g. optical devices [8]). \n\nLeaving aside the non-trivial tasks of constructing the domain specific HU for a \nFHU-network [9] and then optimal loading of specific tasks, in this paper we con(cid:173)\ncentrate on assessing the abilities of such structures to support a wide range of tasks \nin comparison to more complex feedforward networks with multiple layers of variable \nHU (VHU-networks). More precisely, on a finite domain X two benchmark tests \nare considered: approximation of every function from an open subset of functions \non X and representation of every dichotomy of X. Some necessary and sufficient \nestimates of minimal necessary numbers of adaptable synaptic weights and of HU \nare obtained and then combined with some sufficient estimates in [10] to provide \nthe final results. In Appendix we present an outline some of our recent results on \nthe extension of the classical Function-Counting Theorem [2] to the multilayer case \nand discuss some of its implications to assessing network capacities. \n\n2 Statement of the main results \n\nIn this paper X will denote a subset of R n of N points. Of interest to us are \nmultilayer feed-forward networks (shortly FF-networks) , Fw : X - R, depending \non the k-tuple w = (Wl' ... , Wk) E R k of adjustable synaptic weights to be selected on \nloading to the network desired tasks. The FF -networks are split into two categories \ndefined above: \n\n\u2022 FHU-network with fixed hidden units \u00a2>i : X -+ R \n\nFw(x) def I: Wi\u00a2>i(X) \n\nk \n\n(x EX), \n\ni=l \n\n(1) \n\n\u2022 VHU-networks with variable hidden units 1/Jw\",i : X - R depending on \nsome adjustable synaptic weights w\", where w = (Wi, w\") E R k ' x R k \" = \nRk \n\nFw(x) def I: w~1/Jw\",i(X) (x EX). \n\nk' \n\ni=l \n\n(2) \n\nOf special interest are situations where hidden units are built from one or more layers \nof artificial neurons, which, for simplicity, can be thought of as devices computing \nsimple functions of the form \n\n(Yl, .. \u00b7,Ym) E R m ~ a(wi1Yl + Wi 2 Y2 + ... + Wim.Ym), \n\nwhere a : R - R is a non-decreasing squashing function. Two particular examples \nof squashing functions are (i) infinitely differentiable sigmoid function t ~ (1 + \nexp( _t))-l and (ii) the step function 9(t) defined as 1 for t ~ 0 a.nd = 0, otherwise. \nIn the latter case the artificial neuron is called a threshold logic neuron (ThL(cid:173)\nneuron). \n\nIn the formulation of results below all biases are treated as synaptic weights attached \nto links from special constant HUs (= 1). \n\n\fEstimates of Necessary Connections & Hidden Units for Feed-Forward Networks \n\n641 \n\n2.1 Function approximation \n\nThe space R X of all real functions on X has the natural structure of a vector space \nisomorphic with RN. We introduce the euclidean norm IIIII def CE:CEX 12(x))1/2 \non R X and denote by U C R X an open, non-empty subset. We say that the FF(cid:173)\nnetwork Fw can approximate a function I on X with accuracy f > 0 if II! - Fw II < \u20ac \nfor a weight vector w ERie. \n\nTheorem 1 Assume the FF-network Fw is continuously differentiable with respect \nto the adjustable synaptic weights w ERie and k < N. If it can approximate \nany function in U with any accuracy then for almost every function lEU, if \nlilTli-+oo IIFw(i) - III = 0, where w(l), w(2), ... ERie, then lilTli-+oo Ilw(i)11 = 00. \n\nIn the above theorem \"almost every\" means with the exception of a subset of the \nLebesgue measure 0 on R X ~ RN. The proof of this theorem relies on use of \nSard's theorem from differential topology (c.f. Section 3). Note that the above the(cid:173)\norem is applicable in particular to the popular \"back-propagation\" network which \nis typically built from artificial neurons with the continuously differentiable sigmoid \nsquashing function. \n\nThe proof of the following theorem uses a different approach, since the network is \nnot differentiably dependent on its synaptic weights to HUs. This theorem applies \nin particular to the classical FF -networks built from ThL-neurons. \n\nTheorem 2 A FF-network Fw must have 2:: N HU in the top hidden layer if all \nunits of this layer have a finite number of activation levels and the network can \napproximate any function in U with any accuracy. \n\nThe above theorems mean in particular that if we want to achieve an arbitrarily good \napproximation of any function in U def {I : X - R; I/(x)1 < A}, where A > 0, \nand we can use one of VHU-networks of the above type with synaptic weights of a \nrestricted magnitude only, then we have to have at least N such weights. However \nthat many weights are necessary and sufficient to achieve the same, with a FHU(cid:173)\nnetwork (1) if the functions \u00a2i are linearly independent on X. So variable hidden \nunits give no advantage in this case. \n\n2.2 \n\nImplementation of dichotomy \n\nWe say that the FF-network Fw can implement a dichotomy (X_, X+) of X if \nthere exists w ERie such that Fw < 0 on X_ and Fw > 0 on X+. \n\nProposition 3 A FHU-network Fw can implement every dichotomy of X if and \nonly if it can exactly compute every function on X . In such a case it must have \n2:: N HU in the top hidden layer. \n\nThe non-trivial part of the above theorem is necessity in the first part of it, i.e. that \nbeing able to implement every dichotomy on X requires N (fixed) hidden units. In \nSection 3.3 we obtain this proposition from a stronger result. Note that the above \n\n\f642 \n\nKowalczyk \n\nproposition can be deduced from the classical Function-Counting Theorem [2] and \nalso that an equivalent result is proved directly in [3, Theorem 7.2]. \nWe say that the points of a subdomain X C Rn are in general position \nif every \nin R n contains no more than n points of X. Note that points of every finite \nsub domain of R n are in general position after a sufficiently small perturbation and \nthat the property of being in general position is preserved under sufficiently small \nperturbations. Note also that the points of a typical N-point sub domain X C R n \nare in general position, where \"typical\" means with the exception of subdomains X \ncorresponding to a certain subset of Lebesgue measure 0 in the space (Rn)N of all \nN -tuples of points from Rn. \n\nIt is proved in [10] that for a subdomain set X C R n of N points in general position \na VHU-network having i(N - 1)/nl (adjustable) ThL-neurons in the first (and the \nonly) hidden layer can implement every dichotomy of X, where the notation itl \ndenotes the smallest integer ~ t. Furthermore, examples are given showing that the \nabove bound is tight. (Note that this paper corrects and gives rigorous proofs of \nsome early results in [I, Lemma 1 and Theorem 1] and also improves [6, Theorem 4].) \nCombining these results with Proposition 3 we get the following result. \n\nTheorem 4 Assume that all N points of X C R n are in general position. In the \nclass of all FF-networks which can implement every dichotomy on X there exists a \nVHU-network with threshold logic HU having a fraction l/n+O(1/ N) of the number \nof the HU that any FHU-network in this class must have. There are examples of \nX in general position of any even cardinality N > 0 showing that this estimate is \ntight. \n\n3 Proofs \n\nBelow we identify functions I : X -t R with N -tuples of their values at N -points \nof X (ordered in a unique manner). Under this identification the FF-networks Fw \ncan be regarded as a transformation \n\nWERk-tFwERN \n\n(3) \n\nwith the range R(Fw) def {Fw ; w E Rk} C RN. \n\n3.1 Proof of Theorem 1. \n\nIn this case the transformation (3) is continuously differentiable. Every value of it \nis singular since k < N, thus according to Sard's Theorem [5], R(Fw) C RN has \nLebesgue measure O. It is enough to show now that if \n\nand \n\nlEU - R(Fw) \n\nlim IIFw(i) - III = 0 and \n\nl-tOO \n\nIlw(i)11 < M, \n\n(4) \n\n(5) \n\nfor some M > 0, then a contradiction follows. Actually from (5) it follows that \nI belongs to the topological closure cl(RM) ofRM def {Fw; w E Rk & Ilwll;:; \n\n\fEstimates of Necessary Connections & Hidden Units for Feed-Forward Networks \n\n643 \n\nM}. However, RM is a compact set as a continuous image of a closed ball {w E \n; Ilwll :::; M}, so cl(RM) = RM. Consequently f E RM C R(Fw) which \nRA: \ncontradicts (4). Q.E.D. \n\n3.2 Proof of Theorem 2. \n\nWe con!>ider the FF-network (1) for which there exists a finite set VCR of s points \nsuch that \"pW\",i(X) E V for every w\" ERA:\", 1 ~ i ~ k' and x E X. It is sufficient \nto show that the set R(Fw) of all functions computable by Fw is not dense in U if \nk' < N . Actually, we can write R( Fw) as a union \n\nR(Fw) = \n\n(6) \n\nwhere each Lw\" ~f {2::~l w~'l/Jw\",i ; w~, ... , W~, E R} C RN is a linear subspace \nof dimension:::; k' :::; N uniquely determined by the vectors 'l/Jw\",i E VN C RN, \ni = 1, ... ,k'. However there is a finite number (:::; sN) of different vectors in VN, \nthus there is only a finite number (:::; sNA:) of different linear subspaces in the family \n} . Hence, as k' < N, the union (6) is a closed no-where dense \n{Lw\" ; w\" E RA: II\nsubset of R N as a finite union of proper linear subspaces (each of which is a closed \nand nowhere dense subset). Q.E.D. \n\n3.3 Proof of Proposition 3. \n\nWe state first a stronger result. We say that a set L of functions on X is convex if \nfor any couple of functions \u00a2>1, \u00a2>2 on X any Q > 0, {3 > 0, Q + {3 = 1, the function \nQ\u00a2>l + {3\u00a2>2 also belongs to L. \n\nProposition 5 Let L be a convex set of functions on X = {Xl, X2, ... , XN} im(cid:173)\nplementing every dichotomy of X. Then for each i E {1, 2, ... , N} there exists a \nfunction \u00a2>i E L such that \u00a2>i(Xi) -# 0 and \u00a2>i(Xj) = 0 for 1 ~ i i- j :::; N . \n\nProof. We define a transformation SGN : R X --+ {-1, 0, +1}N \n\nSGN(\u00a2\u00bb \n\ndef \n\nwhere sgn(~) def -1 if ~ < 0, sgn(O) ~f 0 and sgn(~) def + 1 if ~ > O. We denote by \nWA: the subset of {-1, 0, +l}N of all points q = (ql, ... ,qN) such that 2:~l Iqil = k, \nfor k = O,1, ... ,N. \nWe show first that convexity of L implies for k E {1, 2, ... , N} the following \n\nWA: C SGN(L) \n\n=> W k - l C SGN(L). \n\n(7) \n\nFor the proof assume WI: C SGN(L) and q = (q1, ... , qN) E {-1, 0, +1}N is such that \n2:~l Iqil = k - 1. We need to show that there exists \u00a2> E L such that \n\nSGN(\u00a2\u00bb = q. \n\n(8) \n\n\f644 \n\nKowalczyk \n\nThe vector q has at least one vanishing entry, say, without loss of generality, q1 = O. \nLet \u00a2+ and \u00a2- be two functions in L such that \n\nSGN(\u00a2+) = q+ \n\nSGN(\u00a2-) = q-\n\ndef \n\ndef \n\n( + 1, Q2, ... , q N ), \n\n(-l,Q2, ... ,QN)' \n\nSuch \u00a2+ and \u00a2- exist since q+, q- E Wk. The function \n\nbelongs to L as a convex combination of two functions from L and satisfies (8). \n\nNow note that the assumptions of the proposition imply that W N C SGN(L). Ap(cid:173)\nplying (7) repeatedly we find that W 1 ~ SGN(L), which means that for every index \ni, 1 ~ i ~ N, there exists a function \u00a2l E L with vanishing all entries but the i-th \none. Q.E.D. \n\nN ow let us see how Proposition 3 follows from the above result. Sufficiency is \nobvious. For the necessity we observe that the family Fw of functions on X is \nconvex being a linear space in the case of a FHU-network (1). Now if this network \ncan compute every dichotomy of X, then each function \u00a2i as in Proposition 5 equals \nto FWi for some Wi E R k. Thus n(Fw) = RN since those functions make a basis \nof R X ~ RN. Q.E.D. \n\n4 Discussion of results \n\nTheorem 1 combined with observations in [4] allows us to make the following contri(cid:173)\nbution to the recent controversy on relevance/irrelevance of Kolmogorov's theorem \non representation of continuous functions In _ R, I def [0,1] (c.f. [4, 7]), since In \ncontains subsets of any cardinality. \n\nThe FF-networks for approximations of continuous functions on \nIn of rising accuracy have to be complex, at leAst in one of the \nfollowing ways: \n\n\u2022 involve adjustment of a diverging number of synaptic weights \n\nand hidden units, or \n\n\u2022 require adjustment of synaptic weights of diverging magnitude, \n\nor \n\n\u2022 involve selection of \"pathological\" squashing functions. \n\nThus one can only shift complexity from one kind to a.nother, but not eliminate \nit completely. Although on theoretical grounds one can easily argue the virtues \nand simplicity of one kind of complexity over the other, for a genuine hardware \nimplementation any of them poses an equally serious obstacle. \n\nFor the classes of FF-networks and benchmark tests considered, the networks with \nmultiple hidden layers have no decisive superiority over the simple structures with \nfixed hidden units unless dimensionality of the input space is significant. \n\n\fEstimates of Necessary Connections & Hidden Units for Feed-Forward Networks \n\n645 \n\n5 Appendix: Capacity and Function-Counting Theorem \n\nThe above results can be viewed as a step towards estimation of capacity of networks \nto memorise dichotomies. We intend to elaborate this subject further now and \noutline some of our recent results on this matter. A more detailed presentation will \nbe available in future publications. \n\nThe capacity of a network in the sense of Cover [2] (Cover's capacity) is defined as \na maximal N such that for a randomly selected subset X C R n of N points with \nprobability 1 the network can implement 1/2 of all dichotomies of X. For a linear \nperceptron \n\nIe \n\nFw(x) = .LJ Wi~i \n\ndef \"\\:\" \n\n(x EX), \n\n(9) \n\ni=l \n\nwhere w E R n is the vector of adjustable synaptic weights, the capacity is 2n,and 2k \nfor a FHU-network (1) with suitable chosen hidden units 4>1, ... , 4>1e. These results are \nbased on the so-called Function-Counting Theorem proved for the linear perceptron \nin the sixties (c.r. [2]). Extension of this result to the multilayer case is still an open \nproblem (c.f. T. Cover's talk on NIPS'92). However, we have recently obtained the \nfollowing partial result in this direction. \n\nTheorem 6 Given a continuous probability density on R n , for a randomly selected \nsubset Xc R n of N points the FF-network having the first hidden layer built from \nh ThL-neurons can implement \n\ndef \n\nnh ( \nC(N,nh)=2L \nz=o \n\nN - 1 \n) \n\ni \n\n' \n\n(10) \n\ndichotomies of X with a non-zero probability. Such a network can be constructed \nusing nh variable synaptic weights between input and hidden layer only. \n\nFor h = 1 this theorem reduces to its classical form for which the phrase \"with \nnon-zero probability\" can be strengthened to \"with probability I\" [2]. \n\nThe proof of the theorem develops Sakurai's idea of utilising the Vandermonde \ndeterminant to show the following property of the curve c( t) def (t, t 2 , ... , t n -1), \nt > 0 \n\n(*) for any subset X of N points Xl = c(td, ... , XN = C(tN), tl < \nt2 < ... < tN, any hyperplane in Rn can intersect no more then n \ndifferent segments [Xi,Xi+l] ofc. \n\nThe first step of the proof is to observe that the property (*) itself implies that \nthe count (10) holds for such a set X. The second and the crucial step consists in \nshowing that for a sufficiently small \u20ac > 0, for any selection of points Xl, ... ,XN E R n \nsuch that Ilxi - xd I < \u20ac for i = 1, ... , n, there exists a curve c passing through these \npoints and satisfying also the property (*). \n\nTheorem 6 implies that in the class of multilayer FF-networks having the first hidden \nlayer built from ThL-neurons only the single hidden layer networks are the most \n\n\f646 \n\nKowalczyk \n\nefficient, since the higher layers have no influence on the number of implemented \ndichotomies (at least for the class of domains x C R n considered). \n\nNote that by virtue of (10) and the classical argument of Cover [2] for the class \nof domains X as in the Theorem 6 the capacity of the network considered is 2nh. \nThus the following estimates hold. \n\nCorollary 7 In the class of FF-networks with a fixed nu.mber h of hidden units \nthe ratio of the maximal capacity per hidden unit achievable by FHU-network to \nthe maximal capacity per hidden unit achievable by VHU-networks having the ThL(cid:173)\nneurons in the first hidden layer only is 2h/2nh = lin. The analogous ratio for \ncapacities per variable synaptic weight (in the class of FF-networks with a fixed \nnumber s of variable synaptic weights) is :::; 2s 12s = 1. \n\nAcknowledgement. \nleading to the improvement of results of the paper. The permission of the Direc(cid:173)\ntor, Telecom Australia Research Laboratories, to publish this material is gratefully \nacknow ledged. \n\nI thank A. Sakurai of Hitachi Ltd., for helpful comments \n\nReferences \n\n[lJ E. Baum. On the capabilities of multilayer perceptrons. Journal of Complexity, \n\n4:193-215, 1988. \n\n[2] T.M. Cover. Geometrical and statistical properties of linear inequalities with \napplications to pattern recognition. IEEE Trans. Elec. Comp., EC-14:326-\n334, 1965. \n\n[3] R.M. Dudley. Central limit theorems for empirical mea.sures. Ann. Probability, \n\n6:899-929, 1978. \n\n[4J F. Girosi and T . Poggio. Representation properties of networks: Kolmogorov's \n\ntheorem is irrelevant. Neural Computation, 1:465-469, (1989). \n\n[5] M. Golubitsky and V. Guillemin. Stable Mapping and Their Singularities. \n\nSpringer-Verlag, New York, 1973. \n\n[6] S. Huang and Y. Huang. Bounds on the number of hidden neurons in multilayer \n\nperceptrons. IEEE Transactions on Neural Networks, 2:47-55, (1991). \n\n[7] V. Kurkova. Kolmogorov theorem is relevant. Neura.l Computation, 1, 1992. \n[8] D. Psaltis, C.H. Park, and J. Hong. Higher order associative memories and \n\ntheir optical implementations. Neural Networks, 1:149-163, (1988). \n\n[9J N. Redding, A. Kowalczyk, and T. Downs. Higher order separability and \nIn T. Kohonen et al., editor, Artificial Neural \n\nminimal hidden-unit fan-in. \nNetworks, volume 1, pages 25-30. Elsevier, 1991. \n\n[10] A. Sakurai. n-h-1 networks store no less n\u00b7 h + 1 examples but sometimes no \n\nmore. In Proceedings of IJCNN92, pages 1II-936-III-94l. IEEE, June 1992. \n\n\fPART VIII \n\nSPEECH AND \n\nSIGNAL \n\nPROCESSING \n\n\f\f", "award": [], "sourceid": 695, "authors": [{"given_name": "Adam", "family_name": "Kowalczyk", "institution": null}]}