{"title": "Size of Multilayer Networks for Exact Learning: Analytic Approach", "book": "Advances in Neural Information Processing Systems", "page_first": 162, "page_last": 168, "abstract": null, "full_text": "Size of multilayer networks for exact \n\nlearning: analytic approach \n\nAndre Elisseefl' \n\nD~pt Mathematiques et Informatique \n\nEcole Normale Superieure de Lyon \n\n46 allee d'Italie \n\nHelene Paugam-Moisy \n\nLIP, URA 1398 CNRS \n\nEcole Normale Superieure de Lyon \n\n46 allee d'Italie \n\nF69364 Lyon cedex 07, FRANCE \n\nF69364 Lyon cedex 07, FRANCE \n\nAbstract \n\nThis article presents a new result about the size of a multilayer \nneural network computing real outputs for exact learning of a finite \nset of real samples. The architecture of the network is feedforward, \nwith one hidden layer and several outputs. Starting from a fixed \ntraining set, we consider the network as a function of its weights. \nWe derive, for a wide family of transfer functions, a lower and an \nupper bound on the number of hidden units for exact learning, \ngiven the size of the dataset and the dimensions of the input and \noutput spaces. \n\n1 RELATED WORKS \n\nThe context of our work is rather similar to the well-known results of Baum et al. [1, \n2,3,5, 10], but we consider both real inputs and outputs, instead ofthe dichotomies \nusually addressed. We are interested in learning exactly all the examples of a fixed \ndatabase, hence our work is different from stating that multilayer networks are \nuniversal approximators [6, 8, 9]. Since we consider real outputs and not only \ndichotomies, it is not straightforward to compare our results to the recent works \nabout the VC-dimension of multilayer networks [11, 12, 13]. Our study is more \nclosely related to several works of Sontag [14, 15], but with different hypotheses on \nthe transfer functions of the units. Finally, our approach is based on geometrical \nconsiderations and is close to the model of Coetzee and Stonick [4]. \n\nFirst we define the model of network and the notations and second we develop \nour analytic approach and prove the fundamental theorem. In the last section, we \ndiscuss our point of view and propose some practical consequences of the result. \n\n\fSize of Multilayer Networks for Exact Learning: Analytic Approach \n\n163 \n\n2 THE NETWORK AS A FUNCTION OF ITS WEIGHTS \n\nGeneral concepts on neural networks are presented in matrix and vector notations, \nin a geometrical perspective. All vectors are written in bold and considered as \ncolumn vectors, whereas matrices are denoted with upper-case script. \n\n2.1 THE NETWORK ARCHITECTURE AND NOTATIONS \n\nConsider a multilayer network with N/ input units, N H hidden units and N s output \nunits. The inputs and outputs are real-valued. The hidden units compute a non(cid:173)\nlinear function f which will be specified later on. The output units are assumed to \nbe linear. A learning set of Np examples is given and fixed. For allp E {1..Np }, the \npth example is defined by its input vector dp E iRNI and the corresponding desired \noutput vector tp E iRNs. The learning set can be represented as an input matrix, \nwith both row and column notations, as follows \n\nSimilarly, the target matrix is T = [ti, ... ,ttp ( , with independent row vectors. \n\n2.2 THE NETWORK AS A FUNCTION g OF ITS WEIGHTS \nFor all h E {1..N H }, w; = (w;I' ... ,WkNI? E iRNI is the vector of the weights be(cid:173)\ntween all the input units and the hth hidden unit. The input weight matrix WI is de(cid:173)\nfined as WI = [wi, . .. ,wJvH ]. Similarly, a vector w~ = (w;I' ... ,W~NHf E iRNH \nrepresents the weights between all the hidden units and the sth output unit, for all \ns E {1..N s}. Thus the output weight matrix W 2 is defined as W 2 = [w~, ... ,wJ.,.s]' \nFor an input matrix V, the network computes an output matrix \n\nwhere each output vector z(dp ) must be equal to the target tp for exact learning. \nThe network computation can be detailed as follows, for all s E {1..N s} \n\nNI \n\nNH \nL w~h.f(L dpi.wt) \nh=1 \nNH \nL w;h.f(d;.w;) \nh=1 \n\ni=1 \n\nHence, for the whole learning set, the sth output component is \n\n[f(di .. wu ] \n\n2 \n\nNH \n\nL W 8h' \n\nh=l \n\n: \n\nf(d~p.w;) \n\n(1) \n\nNH \n\nL W;h\u00b7F(V.W;) \n\nh=l \n\n\f164 \n\nA. Elisseeff and H. Paugam-Moisy \n\nIn equation (1), F is a vector operator which transforms a n vector v into a n vector \nF(v) according to the relation [F(V)]i = f([v]d, i E {1..n}. The same notation F \nwill be used for the matrix operator. Finally, the expression of the output matrix \ncan be deduced from equation (1) as follows \n\n(2) \n\n2(V) \n2(V) = F(V.Wl).W2 \n\n[F(V.wt), ... ,F(V.WhH )] : [wi, . .. ,w~s] \n\nFrom equation (2), the network output matrix appears as a simple function of \nthe input matrix and the network weights. Unlike Coetzee and Stonick, we will \nconsider that the input matrix V is not a variable of the problem. Thus we express \nthe network output matrix 2(V) as a function of its weights. Let 9 be this function \n\n9 : n.N[xNH+NHxNs \n\n--t n.NpxNs \n\nW = (Wl, W2) --t F(V.Wl).W2 \n\nThe 9 function clearly depends on the input matrix and could have be denoted by \ng'D but this index will be dropped for clarity. \n\n3 FUNDAMENTAL RESULT \n\n3.1 PROPERTY OF FUNCTION 9 \n\nLearning is said to be exact on V if and only if there exists a network such that \nits output matrix 2(V) is equal to the target matrix T. If 9 is a diffeomorphic \nfunction from RN[xNH+NHXNS onto RNpxNs then the network can learn any target \nin RNpxNs exactly. We prove that it is sufficient for the network function 9 to be \na local diffeomorphism. Suppose there exist a set of weights X, an open subset \nU C n.N[NH+NHNS including X and an open subset V C n.NpNs including g(X) \nsuch that 9 is diffeomorphic from U to V. Since V is an open neighborhood of \ng(X), there exist a real ..\\ and a point y in V such that T = ..\\(y - g(X)) . Since 9 is \ndiffeomorphic from U to V, there exists a set of weights Y in U such that y = g(Y), \nhence T = ..\\(g(Y) - g(X)). The output units of the network compute a linear \ntransfer function, hence the linear combination of g(X) and g(Y) can be integrated \nin the output weights and a network with twice N/ N H + N H N s weights can learn \n(V, T) exactly (see Figure 1). \n\ng(Y) \n\n(J;) \n\n'T=A.(g(Y)-g(X)) \n\n~)---z \n\nFigure 1: A network for exact learning of a target T (unique output for clarity) \n\nFor 9 a local diffeomorphism, it is sufficient to find a set of weights X such that the \nJacobian of 9 in X is non-zero and to apply the theorem of local inversion. This \nanalysis is developed in next sections and requires some assumptions on the transfer \nfunction f of the hidden units. A function which verifies such an hypothesis 11. will \nbe called a ll-function and is defined below. \n\n\fSize of Multilayer Networks for Exact Learning,' Analytic Approach \n\n165 \n\n3.2 DEFINITION AND THEOREM \nDefinition 1 Consider a function f : 'R ~ 'R which is C1 ('R) (i.e. with continuous \nderivative) and which has finite limits in -00 and +00. Such a function is called a \n1l-function iff it verifies the following property \n\n(1l) \n\n(Va E'RI I a I> 1) \n\nlim I ff'~(ax)) 1= 0 \n\nx--+\u00b1 oo \n\nx \n\nFrom this hypothesis on the transfer function of all the hidden units, the funda(cid:173)\nmental result can be stated as follows \n\nTheorem 1 Exact learning of a set of Np examples, in general position, from'RNr \nto 'RNs , can be realized by a network with linear output units and a transfer function \nwhich is a 1l-function, if the size N H of its hidden layer verifies the following bounds \n\nLower Bound \n\nUpper Bound \n\nN H = r !:r~ 1 hidden units are necessary \nNH = 2 r N~'Ns 1 Ns \n\nhidden units are sufficient \n\nThe proof of the lower bound is straightforward, since a condition for g to be \ndiffeomorphic from RNrxNH+NHXNs onto RNpxNs is the equality of its input and \noutput space dimensions NJNH + NHNS = NpNs . \n\n3.3 SKETCH OF THE PROOF FOR THE UPPER BOUND \n\nThe 9 function is an expression of the network as a function of its weights, for a given \ninput matrix: g(W1, W2) = F(V .W1 ).W2 and 9 can be decomposed according to \nits vectorial components on the learning set (which are themselves vectors of size \nNs) . For all p E {1..Np} \n\nThe derivatives of 9 w.r.t. the input weight matrix WI are, for all i E {1..NJ}, for \nall h E {l..NH} \n\n:!L = [W~h !,(d~.wl)dpi\"\" ,WJvsh f'(d~.wl)dpi]T \n\nFor the output weight matrix W2, the derivatives of 9 are, for all h E {1..NH}, for \nall s E {l..Ns} \n\n88g~ = [ 0, ... ,O,f(d~ .w~), 0, .. . , 0 y \n\n' - - \" \n\nNS-8 \n\nW 8h \n\n' - - \" \n\n8-1 \n\nThe Jacobian matrix MJ(g) of g, the size of which is NJNH + NHNS columns \nand NsNp rows, is thus composed of a block-diagonal part (derivatives w.r.t. W2) \nand several other blocks (derivatives w.r.t. WI). Hence the Jacobian J(g) can be \nrewritten J(g) =1 J1, h,\u00b7\u00b7 . ,JNH I, after permutations of rows and columns, and \nusing the Hadamard and Kronecker product notations, each J h being equal to \n(3) Jh = [F(v.wl) \u00ae INs, [F'(v.wl) 061 \" .F'(v.wl) 06Nr ] 0 [W~h\" ,WJvsh]] \nwhere INs is for the identity matrix in dimension Ns. \n\n\fA. Elisseeff and H. Paugam-Moisy \n166 \nOur purpose is to prove that there exists a point X = (Wi, W2) such that the \nJacobian J(g) is non-zero at X, i.e. such that the column vectors of the Jacobian \nmatrix MJ(g) are linearly independent at X. The proof can be divided in two steps. \nFirst we address the case of a single output unit. Afterwards, this proof can be used \nto extend the result to several output units. Since the complete development of both \nproofs require a lot of calculations, we only present their sketchs below. More details \ncan be found in [7] . \n\n3.3.1 Case of a single output unit \n\nThe proof is based on a linear arrangement of the projections of the column vectors \nof Jh onto a subspace. This subspace is orthogonal to all the Ji for i < h. We \nbuild a vector wi and a scalar w~h such that the projected column vectors are an \nindependent family, hence they are independent with the Ji for i < h. Such a con(cid:173)\nstruction is recursively applied until h = N H. We derive then vectors wi, .. . ,wkrH \nand wi such that J(g) is non-zero. The assumption on 1l-fonctions is essential for \nproving that the projected column vectors of Jh are independent . \n\n3.3.2 Case of multiple output units \n\nIn order to extend the result from a single output to s output units, the usual idea \nconsists in considering as many subnetworks as the number of output units. From \nthis point of view, the bound on the hidden units would be N H = 2 'f.!;~f which \ndiffers from the result stated in theorem 1. A new direct proof can be developed \n(see [7]) and get a better bound: the denominator is increased to N/ + N s . \n\n4 DISCUSSION \n\nThe definition of a 1l-function includes both sigmoids and gaussian functions which \nare commonly used for multilayer perceptrons and RBF networks, but is not valid \nfor threshold functions . Figure 2 shows the difference between a sigmoid, which \nis a 1l-function, and a saturation which is not a 1l-function. Figures (a) and (b) \nrepresent the span of the output space by the network when the weights are varying , \ni.e. the image of g . For clarity, the network is reduced to 1 hidden unit , 1 input \nunit, 1 output unit and 2 input patterns. For a 1l-function, a ball can be extracted \nfrom the output space 'R}, onto which the 9 function is a diffeomorphism. For the \nsaturation, the image of 9 is reduced to two lines , hence 9 cannot be onto on a ball \nof R2 . The assumption of the activation function is thus necessary to prove that \nthe jacobian is non-zero. \n\nOur bound on the number of hidden units is very similar to Baum's results for \ndichotomies and functions from real inputs to binary outputs [1] . Hence the present \nresult can be seen as an extension of Baum's results to the case of real outputs, \nand for a wide family of transfer functions , different from the threshold functions \naddressed by Baum and Haussler in [2]. An early result on sigmoid networks has \nbeen stated by Sontag [14]: for a single output and at least two input units, the \nnumber of examples must be twice the number of hidden units. Our upper bound \non the number of hidden units is strictly lower than that (as soon as the number of \ninput units is more than two). A counterpart of considering real data is that our \nresults bear little relation to the VC-dimension point of view. \n\n\fSize of Multilayer Networksfor Exact Learning: Analytic Approach \n\n167 \n\nD~ \n\n0.5 \n\n-.!, \n\nt .5 \n\n-1 \n\n~J.5 \n\n0 \n\n05 \n\n1 \n\n1.5 \n\n2 \n\n2.5 \n\n:I \n\n1.5 \n\n0.5 \n\no \u2022\u2022\u2022\u2022\u2022\u2022 -----~I__-----\n\n-0.5 \n\n-1 \n\n-1.5 \n\n-~2~---:'-1.5=----~1 --:-07.5 -~---'0-:-:.5----7---:':1.5:---!. \n\n(a) : A saturation function \n\n(b) : A sigmoid function \n\nFigure 2: Positions of output vectors, for given data, when varying network weights \n\n5 CONCLUSION \n\nIn this paper, we show that a number of hidden units N H = 2 r N p N s / (Nr + N s) 1 is \nsufficient for a network ofll-functions to exactly learn a given set of Np examples in \ngeneral position. We now discuss some of the practical consequences of this result. \nAccording to this formula, the size of the hidden layer required for exact learn(cid:173)\ning may grow very high if the size of the learning set is large. However, without \na priori knowledge on the degree of redundancy in the learning set, exact learning \nis not the right goal in practical cases. Exact learning usually implies overfitting, \nespecially if the examples are very noisy. Nevertheless, a right point of view could \nbe to previously reduce the dimension and the size of the learning set by feature \nextraction or data analysis as pre-processing. Afterwards, our theoretical result \ncould be a precious indication for scaling a network to perform exact learning on \nthis representative learning set, with a good compromise between, bias and variance. \nOur bound is more optimistic than the rule-of-thumb N p = lOw derived from the \ntheory of PAC-learning. In our architecture, the number of weights is w = 2NpNs. \nHowever the proof is not constructive enough to be derived as a learning algorithm, \nespecially the existence of g(Y) in the neighborhood of g(X) where 9 is a local \ndiffeomorphism (cf. figure 1). From this construction we can only conclude that \nNH = r NpNs/(Nr+Ns)l is necessary and NH = 2 fNpNs/(Nr+Ns)l is sufficient \nto realize exact learning of Np examples, from nNr to nNs. \n\n\f168 \n\nA. Elisseeff and H. Paugam-Moisy \n\nThe opportunity of using multilayer networks as auto-associative networks and for \ndata compression can be discussed at the light of this results. Assume that N s = NJ \nand the expression of the number of hidden units is reduced to N H = N p or at \nleast NH = N p /2 . Since N p ~ NJ + Ns, the number of hidden units must verify \nN H ~ NJ. Therefore, an architecture of \"diabolo\" network seems to be precluded \nfor exact learning of auto-associations. A consequence may be that exact retrieval \nfrom data compression is hopeless by using internal representations of a hidden \nlayer smaller than the data dimension. \n\nAcknowledgements \n\nThis work was supported by European Esprit III Project nO 8556, NeuroCOLT \nWorking Group. We thank C.S. Poon and J.V. Shah for fruitful discussions. \n\nReferences \n\n[1] E. B. Baum. On the capabilities of multilayer perceptrons. J. of Complexity, \n\n4:193-215, 1988. \n\n[2] E. B. Baum and D. Haussler. What size net gives valid generalization? Neural \n\nComputation, 1:151- 160, 1989. \n\n[3] E. K. Blum and L. K. Li. Approximation theory and feedforward networks. \n\nNeural Networks, 4(4) :511-516, 1991. \n\n[4] F. M. Coetzee and V. L. Stonick. Topology and geometry of single hidden layer \nnetwork, least squares weight solutions. Neural Computation, 7:672-705, 1995. \n[5] M. Cosnard, P. Koiran, and H. Paugam-Moisy. Bounds on the number of \nunits for computing arbitrary dichotomies by multilayer perceptrons. J. of \nComplexity, 10:57-63, 1994. \n\n[6] G. Cybenko. Approximation by superpositions of a sigmoidal function. Math. \n\nControl, Signal Systems, 2:303-314, October 1988. \n\n[7] A. Elisseeff and H. Paugam-Moisy. Size of multilayer networks for exact learn(cid:173)\n\ning: analytic approach. Rapport de recherche 96-16, LIP, July 1996. \n\n[8] K. Funahashi. On the approximate realization of continuous mappings by \n\nneural networks. Neural Networks, 2(3):183- 192, 1989. \n\n[9] K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networks \n\nare universal approximators. Neural Networks, 2(5):359-366, 1989. \n\n[10] S.-C. Huang and Y.-F. Huang. Bounds on the number of hidden neurones in \n\nmultilayer perceptrons. IEEE Trans. Neural Networks, 2:47- 55, 1991. \n\n[11] M. Karpinski and A. Macintyre. Polynomial bounds for vc dimension of sig(cid:173)\nmoidal neural networks. In 27th ACM Symposium on Theory of Computing, \npages 200-208, 1995. \n\n[12] P. Koiran and E. D. Sontag. Neural networks with quadratic vc dimension. In \n\nNeural Information Processing Systems (NIPS *95), 1995. to appear. \n\n[13] W . Maass. Bounds for the computational power and learning complexity of \nanalog neural networks. In 25th ACM Symposium on Theory of Computing, \npages 335-344, 1993. \n\n[14] E. D. Sontag. Feedforward nets for interpolation and classification. J. Compo \n\nSyst. Sci., 45:20-48, 1992. \n\n[15] E. D. Sontag. Shattering all sets of k points in \"general position\" requires (k-\n1)/2 parameters. Technical Report Report 96-01, Rutgers Center for Systems \nand Control (SYCON), February 1996. \n\n\f", "award": [], "sourceid": 1303, "authors": [{"given_name": "Andr\u00e9", "family_name": "Elisseeff", "institution": null}, {"given_name": "H\u00e9l\u00e8ne", "family_name": "Paugam-Moisy", "institution": null}]}