{"title": "Computing with Almost Optimal Size Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 19, "page_last": 26, "abstract": null, "full_text": "Computing with Almost Optimal Size Neural \n\nNetworks \n\nKai-Yeung Siu \n\nDept. of Electrical & Compo Engineering \n\nUniversity of California, Irvine \n\nIrvine, CA 92717 \n\nV wani Roychowdhury \n\nSchool of Electrical Engineering \n\nPurdue University \n\nWest Lafayette, IN 47907 \n\nThomas Kailath \n\nInformation Systems Laboratory \n\nStanford University \nStanford, CA 94305 \n\nAbstract \n\nArtificial neural networks are comprised of an interconnected collection \nof certain nonlinear devices; examples of commonly used devices include \nlinear threshold elements, sigmoidal elements and radial-basis elements. \nWe employ results from harmonic analysis and the theory of rational ap(cid:173)\nproximation to obtain almost tight lower bounds on the size (i.e. number \nof elements) of neural networks. The class of neural networks to which \nour techniques can be applied is quite general; it includes any feedforward \nnetwork in which each element can be piecewise approximated by a low \ndegree rational function. For example, we prove that any depth-( d + 1) \nnetwork of sigmoidal units or linear threshold elements computing the par(cid:173)\nity function of n variables must have O(dnl/d-\u00a3) size, for any fixed i > O. \nIn addition, we prove that this lower bound is almost tight by showing \nthat the parity function can be computed with O(dnl/d) sigmoidal units \nor linear threshold elements in a depth-(d + 1) network. These almost \ntight bounds are the first known complexity results on the size of neural \nnetworks with depth more than two. Our lower bound techniques yield \na unified approach to the complexity analysis of various models of neural \nnetworks with feedforward structures. Moreover, our results indicate that \nin the context of computing highly oscillating symmetric Boolean func-\n\n19 \n\n\f20 \n\nSiu, Roychowdhury, and Kailath \n\ntions, networks of continuous-output units such as sigmoidal elements do \nnot offer significant reduction in size compared with networks of linear \nthreshold elements of binary outputs. \n\n1 \n\nIntroduction \n\nRecently, artificial neural networks have found wide applications in many areas \nthat require solutions to nonlinear problems. One reason for such success is the \nexistence of good \"learning\" or \"training\" algorithms such as Backpropagation [13] \nthat provide solutions to many problems for which traditional attacks have failed. \nAt a more fundamental level, the computational power of neural networks comes \nfrom the fact that each basic processing element computes a nonlinear function \nof its inputs. Networks of these nonlinear elements can yield solutions to highly \ncomplex and nonlinear problems. On the other hand, because of the nonlinear \nfeatures, it is very difficult to study the fundamental limitations and capabilities of \nneural networks. Undoubtedly, any significant progress in the applications of neural \nnetworks must require a deeper understanding of their computational properties. \n\nWe employ classical tools such as harmonic analysis and rational approximation \nto derive new results on the computational complexity of neural networks. The \nclass of neural networks to which our techniques can be applied is quite large; it \nincludes feedforward networks of sigmoidal elements, linear threshold elements, and \nmore generally, elements that can be piecewise approximated by low degree rational \nfunctions. \n\n1.1 Background, Related Work and Definitions \n\nA widely accepted model of neural networks is the feedforward multilayer network \nin which the basic processing element is a sigmoidal element. A sigmoidal element \ncomputes a function I(X) of its input variables X = (Xl, ... , xn) such that \n\n1- e-F(X) \nI(X) = u(F(X\u00bb = 1 + e-F(X) - 1 = 1 + e-F(X) \n\n2 \n\nwhere \n\ns \n\nF(X) = L: Wi \u2022 Xi + WOo \n\ni=l \n\nThe real valued coefficients Wi are commonly referred to as the weights of the sig(cid:173)\nmoidal function. The case that is of most interest to us is when the inputs are \nbinary, i.e., X E {l, _l}n. We shall refer to this model as sigmoidal network. \nAnother common feed forward multilayer model is one in which each basic processing \nunit computes a binary linear threshold function sgn(F(X\u00bb, where F(X) is the \nsame as above, and \n\nsgn(F(X\u00bb = { _~ if F(X) ~ 0 \nif F(X) < 0 \n\nThis model is often called the threshold circuit in the literature and recently has \nbeen studied intensively in the field of computer science. \n\n\fComputing with Almost Optimal Size Neural Networks \n\n21 \n\nThe size of a network/circuit is the number of elements. The depth of a net(cid:173)\nwork/circuit is the longest path from any input gate to the output gates. We can \narrange the gates in layers so that all gates in the same layer compute concurrently. \n(A single element can be considered as a one-layer network.) Each layer costs a \nunit delay in the computation. The depth of the network (which is the number of \nlayers) can therefore be interpreted as the time for (parallel) computation. \n\nIt has been established that threshold circuit is a very powerful model of computa(cid:173)\ntion. Many functions of common interest such as multiplication, division and sort(cid:173)\ning can be computed in polynomial-size threshold circuits of small constant depth \n[19, 18, 21]. While many upper bound results for threshold circuits are known in \nthe literature, lower bound results have only been established for restricted cases of \nthreshold circuits. Most of the existing lower bound techniques [10, 17, 16] apply \nonly to depth-2 threshold circuits. In [16], novel techniques which utilized analyti(cid:173)\ncal tools from the theory of rational approximation were developed to obtain lower \nbounds on the size of depth-2 threshold circuits that compute the parity function. \nIn [20], we generalized the methods of rational approximation and our earlier tech(cid:173)\nniques based on harmonic analysis to obtain the first known almost tight lower \nbounds on the size of threshold circuits with depth more than two. In this paper, \nthe techniques are further generalized to yield almost tight lower bounds on the \nsize of a more general class of neural networks in which each element computes a \ncontinuous function. \nThe presentation of this paper will be divided into two parts. In the first part, we \nshall focus on results concerning threshold circuits. In the second part, the lower \nbound results presented in the first part are generalized and shown to be valid even \nwhen the elements of the networks can assume continuous output values. The class \nof networks for which such techniques can be applied include networks of sigmoidal \nelements and radial basis elements. Due to space limitations, we shall only state \nsome of the important results; further results and detailed proofs will appear in an \nextended paper. \n\nBefore we present our main results, we shall give formal definitions of the neural \nnetwork models and introduce some of the Boolean functions, which will be used \nto explore the computational power of the various networks. To present our results \nin a coherent fashion, we define throughout this paper a Boolean function as f : \n{I, _l}n -+ {I, -I}, instead of using the usual {O, I} notation. \n\nDefinition 1 A threshold circuit is a Boolean circuit in which every gate com(cid:173)\nputes a linear threshold function with an additional property: the weights are inte(cid:173)\ngers all bounded by a polynomial in n. \n0 \n\nThe assumption that the weights in the threshold circuits are integers \nRemark 1 \nbounded by a polynomial is common in the literature. In fact, the best known lower \nbound result on depth-2 threshold circuit [10] does not apply to the case where \nexponentially large weights are allowed. On the other hand, such assumption does \nnot pose any restriction as far as constant-depth and polynomial-size is concerned. \nIn other words, the class of constant-depth polynomial-size threshold circuits (TeO) \nremains the same when the weights are allowed to be arbitrary. This result was \nimplicit in [4] and was improved in [18] by showing that any depth-d threshold circuit \n\n\f22 \n\nSiu, Roychowdhury, and Kailath \n\nwith arbitrary weights can be simulated by a depth-(2d + 1) threshold circuit of \npolynomially bounded weights at the expense of a polynomial increase in size. More \nrecently, it has been shown that any polynomial-size depth-d threshold circuit with \narbitrary weights can be simulated by a polynomial-size depth-(2d + 1) threshold \ncircuit. \n0 \n\nIn addition to Boolean circuits, we shall also be interested in the computation of \nBoolean functions by networks of continuous-valued elements. To formalize this \nnotion, we adopt the following definitions [12]: \n\nDefinition 2 \nLet 'Y : R - R. A 'Y element with weights WI, ... , Wm E Rand \nthreshold t is defined to be an element that computes the function 'Y(E~1 WiX; -t) \nwhere (Xl. ... , xm) is the input. A 'Y-network is a feedforward network of'Y elements \nwith an additional property: the weights Wi are all bounded by a polynomial in n. \no \n\nFor example, when 'Y is the sigmoidal function O'(x), then we have a sigmoidal \nnetwork, a common model of neural network. In fact, a threshold circuit can also \nbe viewed as a special case of'Y network where 'Y is the sgn function. \n: \nDefinition 3 A 'Y-network C is said to compute a Boolean function f \n{I, -I} with separation (. > 0 if there is some tc E R such that \n{I,-l}n -\nfor any input X = (Xl, ... , Xm) to the network C, the output element of C outputs \na value C(X) with the following property: If f(X) = 1, then C(X) ~ tc + \u00a3. If \nf(X) = -1, then C(X) ~ tc - \u00a3. \n0 \n\nRemark 2 As pointed out in [12], computing with 'Y networks without separation \nat the output element is less interesting because an infinitesimal change in the \noutput of any 'Y element may change the output bit. In this paper, we shall be \nmainly interested in computations on 'Y networks Cn with separation at least O(n-k) \nfor some fixed k > o. This together with the assumption of polynomially bounded \nweights makes the complexity class of constant-depth polynomial-size 'Y networks \nquite robust and more interesting to study from a theoretical point of view (see \n[12]). \n0 \n\nThe PARITY function of X = (x}, X2, .. . , xn) E {I, _l}n is de(cid:173)\nDefinition 4 \nfined to be -1 if the number of -1 in the variables x I, ... , Xn is odd and + 1 otherwise. \nNote that this function can be represented as the product n~=l Xi. \n0 \n\nDefinition 5 \nfollowing: \n\nThe Complete Quadratic (CQ) function [3] is defined to be the \nCQ(X) = (Xl\" X2) EEl (Xl\" X3) EEl \u2022.\u2022 EEl (Xn-l \" xn) \n\ni.e. CQ(X) is the sum modulo 2 of all AND's between the (~) pairs of distinct \nvariables. Note that it is also a symmetric function. \n\n0 \n\n2 Results for Threshold Circuits \n\nFo. the lower bound results on threshold circuits, a central idea of our proof is the \nuse of a result from the theory of rational approximation which states the following \n\n\fComputing with Almost Optimal Size Neural Networks \n\n23 \n\n[9]: the function sgn(x) can be approximated with an error of O(e-ck/log(l/\u20ac\u00bb) by \na rational function of degree k for 0 < f < Ixl < 1. \n(In [16], they apply an \nequivalent result [15] that gives an approximation to the function Ixl instead of \nsgn(x).) This result allows us to approximate several layers of threshold gates by a \nrational function oflow (i.e. logarithmic) degree when the size of the circuit is small. \nThen by upper bounding the degree of the rational function that approximates the \nPARITY function, we give a lower bound on the size of the circuit. We also give \nsimilar lower bound on the Complete Quadratic (CQ) function using the same \ndegree argument. By generalizing the 'telescoping' techniques in [14], we show an \nalmost matching upper bound on the size of the circuits computing the PARITY \nand the CQ functions. We also examine circuits in which additional gates other \nthan the threshold gates are allowed and generalize the lower bound results in this \nmodel. For this purpose, we introduce tools from harmonic analysis of Boolean \nfunctions [11, 3, 18, 17]. We define the class of functions called SP such that \nevery function in SP can be closely approximated by a sparse polynomial for all \ninputs. For example, it can be shown that [18] the class SP contains functions \nAND, OR, COMPARISON and ADDITION, and more generally, functions that \nhave polynomially bounded spectral norms. \n\nThe main results on threshold circuits can be summarized by the following theo(cid:173)\nrems. First we present an explicit construction for implementing PARITY. This \nconstruction applies to any 'periodic' symmetric function, such as the CQ function. \nFor every d < logn, there exists a depth-(d + 1) threshold circuit \nTheorem 1 \nwith O(dn 1/ d ) gates that computes the PARITY function. \n0 \nWe next show that any depth-(d + 1) threshold circuit computing the PARITY \nfunction or the CQ function must have size O(dnl/d-\u00a3) for any fixed f > o. This \nresult also holds for any function that has strong degree O(n). \nTheorem 2 Any depth-(d + 1) threshold circuit computing the PARITY (CQ) \nfunction must have size O(dnl/d / log:! n). \n0 \n\nWe also consider threshold circuits that approximate the PARITY and the CQ \nfunctions when we have random inputs which are uniformly distributed. We derive \nalmost tight upper and lower bounds on the size of the approximating threshold \ncircuits. \n\nWe next consider threshold circuits with additional gates and prove the following \nresult. \n\nTheorem 3 \nSuppose in addition to threshold gates, we have polynomially many \ngates E SP in the first layer of a depth-2 threshold circuit that computes the CQ \nfunction. Then the number of threshold gates required in the circuit is O(n/ log2 n). \no \n\nThis result can be extended to higher depth circuits when additional gates that \nhave low degree polynomial approximations are allowed. \n\nRemark 3 \n\nRecently Beigel [2], using techniques similar to ours and the fact \n\n\f24 \n\nSiu, Roychowdhury, and Kailath \n\nthat the PARITY function cannot be computed in polynomial-size constant-depth \ncircuits of AND, OR gates [7], has shown that any constant-depth threshold circuit \n) AND, OR gates but only o(log n) threshold gates cannot compute the \nWith (2n \nPARITY function of n variables. \n0 \n\n0(1) \n\n\u2022 \n\n3 Results for ,-Networks \n\nIn the second part of the paper, we consider the computational power of networks \nof continuous-output elements. A celebrated result in this area was obtained by \nCybenko [5]. It was shown in [5] that any continuous function over a compact \ndomain can be closely approximated by sigmoidal networks with two layers. More \nrecently, Barron [1] has significantly strengthened this result by showing that a \nwide class of functions can be approximated with mean squared error of O( n -1 ) \nby tw<rlayer sigmoidal networks of only n elements. Here we are interested in \nnetworks of continuous-output elements computing Boolean functions instead of \ncontinuous functions. See Section 1.1 for a precise definition of computation of \nBoolean functions by a \"Y-network. \n\nWhile quite a few techniques have been developed for deriving lower bound results \non the complexity of threshold circuits, an understanding of the power and the \nlimitation of networks of continuous elements such as sigmoidal networks, especially \nas compared to threshold circuits, have not been explored. For example, we would \nlike to answer questions such as: how much added computational power does one \ngain by using sigmoidal elements or other continuous elements to compute Boolean \nfunctions? Can the size of the network be reduced by using sigmoidal elements \ninstead of threshold elements? \nIt was shown in [12] when the depth of the network is restricted to be two, then \nthere is a Boolean function of n variables that can be computed in a depth-2 sig(cid:173)\nmoidal network with a fixed number of elements, but requires a depth-2 threshold \ncircuit with size that increases at least logarithmic in n. In other words, in the \nrestricted case of depth-2 network, one can reduce the size of the network at least \na logarithmic factor by using continuous elements such as the sigmoidal elements \ninstead of threshold elements with binary output values. This result has been re(cid:173)\ncently improved in [6], where it is shown that there exists an explicit function that \ncan be computed using only a constant number of sigmoidal gates, and that any \nthreshold circuit (irrespective of the depth) computing it must have size !l(log n). \n\nThese results motivate the following question: Can we characterize a class of func(cid:173)\ntions for which the threshold circuits computing the functions have sizes at most a \nlogarithmic factor larger than the sizes of the sigmoidal networks computing them? \nBecause of the monotonicity of the sigmoidal functions, we do not expect that \nthere is substantial gain in the computational power over the threshold elements \nfor computing the class of highly oscillating functions. \n\nIt is natural to extend our techniques to sigmoidal networks by approximating \nsigmoidal functions with rational functions. We derive a key lemma that yields \na single low degree rational approximation to any function that can be piecewise \napproximated by low degree rational functions. \n\n\fComputing with Almost Optimal Size Neural Networks \nLet f be a continuous function over A = [a, b]. Let Al = [a, c] and \nLemma 1 \nA2 = [c,b], a < c < b. Denote II 9 II~,= sUP~e~ Ig(x)l. Suppose there are rational \nfunctIOns rl and r2 such that \n\n\u2022 \n\nI \n\nwhere { > O. Then for each l> 0 and 6 > 0, there is a rational function r such that \n\nII / - rj lI~i ~ { \n\n25 \n\ndeg r ~ 2 deg rl + 2 deg r2 + Gllog(e + -6-) log(e + l ) (1) \n0 \n\nwhere w(fj c5)~ is the modulus of continuity of / over A, G1 is a constant. \n\nb - a \n\nII / II~ \n\nThe above lemma is applied to show that both sigmoidal functions and radial basis \nfunctions can be closely approximated by low degree rational functions. In fact \nthe above lemma can be generalized to show that if a continuous function can \nbe piecewise approximated by low degree rational functions over k = 10gO(I) n \nconsecutive intervals, then it can be approximated by a single low degree rational \nfunction over the union of these intervals. \n\nThese generalized approximation results enable us to show that many of our lower \nbound results on threshold circuits can be carried over to sigmoidal networks. Prior \nto our work, there was no nontrivial lower bound on the size of sigmoidal networks \nwith depth more than two. In fact, we can generalize our results to neural networks \nwhose elements can be piecewise approximated by low degree rational functions. \nWe show in this paper that for symmetric Boolean functions of large strong degree \n(e.g. the parity function), any depth-d network whose elements can be piecewise \napproximated by low degree rational functions requires almost the same size as a \ndepth-d threshold circuit computing the function. \nIn particular, if it is the class of polynomially bounded functions that are piecewise \ncontinuous and can be piecewise approximated with low degree rational functions, \nthen we prove the following theorem. \n\nLet W be any depth-Cd + 1) neural network in which each element \nTheorem 4 \nVj computes a function Ji (Li WiXi) where Ji E it and Li Iwi! ~ nOel) for each \nelement. If the network W computes the PARITY function of n variables with \nseparation 6, where 0 < 6 = n(n- k ) for some k > 0, then for any fixed { > 0, W \nmust have size n(dn 1/ d -(). \n0 \n\nReferences \n[1] A. Barron. Universal Approximation Bounds for Superpositions of a Sigmoidal \n\nFunction. IEEE Transactions on In/ormation Theory, to appear. \n\n[2] R. Beigel. Polylog( n) Majority or O(log log n) Symmetric Gates are Equivalent \n\nto One. ACM Symposium on Theory of Computing (STOC), 1992. \n\n[3] J. Bruck. Harmonic Analysis of Polynomial Threshold Functions. SIAM \n\nJournal on Discrete Mathematics, pages 168-177, May 1990. \n\n\f26 \n\nSiu, Roychowdhury, and Kailath \n\n[4] A. K. Chandra, L. Stockmeyer, and U. Vishkin. Constant depth reducibility. \n\nSiam J. Comput., 13:423-439, 1984. \n\n[5] G. Cybenko. Approximations by superpositions of a sigmoidal function. \n\nMath. Control, Signals, Systems, vol. 2, pages 303-314, 1989. \n\n[6] B. Dasgupta and G. Schnitger. Efficient Approximation with Neural Networks: \n\nA Comparison of Gate Functions. In 5th Annual Conference on Neural Infor(cid:173)\nmation Processing Systems - Natural and Synthetic (NIPS'92), 1992. \n\n[7] M. Furst, J. B. Saxe, and M. Sipser. Parity, Circuits and the Polynomial-Time \n\nHierarchy. IEEE Symp. Found. Compo Sci., 22:260-270, 1981. \n\n[8] M. Goldmann, J. Hastad, and A. Razborov. Majority Gates vs. General \n\nWeighted Threshold Gates. Seventh Annual Conference on Structure in Com(cid:173)\nplexity Theory, 1992. \n\n[9] A. A. Goncar. On the rapidity of rational approximation of continuous func(cid:173)\n\ntions with characteristic singularities. Mat. Sbornik, 2(4):561-568, 1967. \n\n[10] A. Hajnal, W. Maass, P. Pudlak, M. Szegedy, and G. Turan. Threshold circuits \n\nof bounded depth. IEEE Symp. Found. Compo Sci., 28:99-110, 1987. \n\n[11] R. J. Lechner. Harmonic analysis of switching functions. In A. Mukhopadhyay, \n\neditor, Recent Development in Switching Theory. Academic Press, 1971. \n\n[12] W. Maass, G. Schnitger, and E. Sontag. On the computational power of \nsigmoid versus boolean threshold circuits. IEEE Symp. Found. Compo Sci., \nOctober 1991. \n\n[13] J. L. McClelland D. E. Rumelhart and the PDP Research Group. Parallel \nDistributed Processing: Explorations in the Microstructure of Cognition, vol. \n1. MIT Press, 1986. \n\n[14] R. Minnick. Linear-Input Logic. IEEE Trans. on Electronic Computers, EC \n\n10, 1961. \n\n11:11-14, 1964. \n\n[15] D. J. Newman. Rational Approximation to Ixl. Michigan Math. Journal, \n\n[16] R. Paturi and M. Saks. On Threshold Circuits for Parity. IEEE Symp. Found. \n\nCompo Sci., October 1990. \n\n[17] V. P. Roychowdhury, K. Y. Siu, A. Orlitsky, and T. Kailath. A Geometric \nApproach to Threshold Circuit Complexity. Workshop on Computational \nLearning Theory (Colt'91), pp. 97-111, 1991. \n\n[18] K. Y. Siu and J. Bruck. On the Power of Threshold Circuits with Small \n\nWeights. SIAM J. Discrete Math, pp. 423-435, August 1991. \n\n[19] K. Y. Siu and J . Bruck. Neural Computation of Arithmetic Functions. Proceed(cid:173)\n\nings of the IEEE, Special Issue on Neural Networks, pp. 1669-1675, October \n1990. \n\n[20] K. Y. Siu, V. P. Roychowdhury, and T. Kailath. Computing with Almost Op(cid:173)\n\ntimal Size Threshold Circuits. IEEE International Symposium on Information \nTheory, Budapest, Hungary, June 1991. \n\n[21] K.-Y. Siu, J. Bruck, T. Kailath, and T. Hofmeister. Depth-Efficient Neural \n\nNetworks for Division and Related Problems. to appear in IEEE Trans. In(cid:173)\nformation Theory, 1993. \n\n\f", "award": [], "sourceid": 590, "authors": [{"given_name": "Kai-Yeung", "family_name": "Siu", "institution": null}, {"given_name": "Vwani", "family_name": "Roychowdhury", "institution": null}, {"given_name": "Thomas", "family_name": "Kailath", "institution": null}]}