{"title": "Solvable Models of Artificial Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 423, "page_last": 430, "abstract": null, "full_text": "Solvable Models of Artificial Neural \n\nNetworks \n\nSumio Watanabe \n\nInformation and Communication R&D Center \n\nRicoh Co., Ltd. \n\n3-2-3, Shin-Yokohama, Kohoku-ku, Yokohama, 222 Japan \n\nsumio@ipe.rdc.ricoh.co.jp \n\nAbstract \n\nSolvable models of nonlinear learning machines are proposed, and \nlearning in artificial neural networks is studied based on the theory \nof ordinary differential equations. A learning algorithm is con(cid:173)\nstructed, by which the optimal parameter can be found without \nany recursive procedure. The solvable models enable us to analyze \nthe reason why experimental results by the error backpropagation \noften contradict the statistical learning theory. \n\n1 \n\nINTRODUCTION \n\nRecent studies have shown that learning in artificial neural networks can be under(cid:173)\nstood as statistical parametric estimation using t.he maximum likelihood method \n, and that their generalization abilities can be estimated using the statistical \nasymptotic theory . However, as is often reported, even when the number of \nparameters is too large, the error for the test.ing sample is not so large as the theory \npredicts. The reason for such inconsistency has not yet been clarified, because it is \ndifficult for the artificial neural network t.o find the global optimal parameter. \n\nOn the other hand, in order to analyze the nonlinear phenomena, exactly solvable \nmodels have been playing a central role in mathematical physics, for example, the \nK-dV equation, the Toda lattice, and some statistical models that satisfy the Yang-\n\n423 \n\n\f424 \n\nWatanabe \n\nBaxter equation. \n\nThis paper proposes the first solvable models in the nonlinear learning problem. We \nconsider simple three-layered neural networks, and show that the parameters from \nthe inputs to the hidden units determine the function space that is characterized \nby a differential equation. This fact means that optimization of the parameters \nis equivalent to optimization of the differential equation. Based on this property, \nwe construct a learning algorithm by which the optimal parameters can be found \nwithout any recursive procedure. Experimental result using the proposed algorithm \nshows that the maximum likelihood estimator is not always obtained by the error \nbackpropagation, and that the conventional statistical learning theory leaves much \nto be improved. \n\n2 The Basic Structure of Solvable Models \n\nLet us consider a function fc,w( x) given by a simple neural network with 1 input \nunit, H hidden units, and 1 output unit, \n\nH \n\nfc,w(x) = L CiIPw;{X), \n\ni=1 \n\n(I) \n\nwhere both C = {Ci} and w = {Wi} are parameters to be optimized, IPw;{x) is the \noutput of the i-th hidden unit. \nWe assume that {IPi(X) = IPw, (x)} is a set of independent functions in C H -class. \nThe following theorem is the start point of this paper. \n\nTheorem 1 The H -th order differential equation whose fundamental system of so(cid:173)\nlution is {IPi( x)} and whose H -th order coefficient is 1 is uniquely given by \n\n(Dwg)(x) = (_l)H H!H+l(g,1P1,1P2, .. \u00b7,IPH) = 0, \n\nlVH(IP1, IP2, .. \u00b7,IPH) \n\n(2) \n\nwhere ltV H is the H -th order Wronskian, \n\nIPH \n( 1) \nIPH \n(2) \n'PH \n\n(H-l) \n\n'PI \n\n(H-l) \n\n'P2 \n\n(H -1) \n\nIPH \n\nFor proof, see . From this theorem, we have the following corollary. \n\nCorollary 1 Let g(x) be a C H -class function. Then the following conditions for \ng(x) and w = {wd are equivalent. \n(1) There exists a set C = {cd such that g{x) = E~l CjIPw;(x). \n(2) (Dwg)(x) = O. \n\n\fSolvable Models of Artificial Neural Networks \n\n425 \n\nExample 1 Let us consider a case, !Pw;(x) = exp(WiX). \n\nH \n\ng(x) = L Ci exp(WiX) \n\nis equivalent to {DH + P1D H- 1 + P2DH-2 + ... + PH }g(x) = 0, where D = (d/dx) \nand a set {Pi} is determined from {Wi} by the relation, \n\ni=l \n\nzH + Plz H- 1 + P2zH-2 + ... + PIl = II(z - Wi) \n\nH \n\n('Vz E C). \n\ni=l \n\nExample 2 \n\n(RBF) A function g(x) is given by radial basis functions, \n\n11 \n\ng(x) = L Ci exp{ -(x - Wi)2}, \n\ni=l \n\nif and only if e- z2 {DIl + P1DIl-l + P2DIl-2 + ... + PIl }(eZ2 g(x)) = 0, where a set \n{Pi} is determined from {Wi} by the relation, \n\nzll + Plz ll - 1 + P2zll-2 + ... + PII = II(z - 2Wi) \n\n11 \n\n('Vz E C). \n\ni=l \n\nFigure 1 shows a learning algorithm for the solvable models. When a target function \ng( x) is given, let us consider the following function approximation problem. \n\n11 \n\ng(x) = L Ci!Pw;(X) + E(X). \n\ni=l \n\n(3) \n\nLearning in the neural network is optimizing both {cd and {wd such that E( x) is \nminimized for some error function. From the definition of D w , eq. (3) is equivalent \nto (Dwg)(x) = (Dw\u20ac)(x), where the term (Dwg)(x) is independent of Cj. Therefore, \nif we adopt IIDwEIl as the error function to be minimized, {wd is optimized by \nminimizing IIDwgll, independently of {Cj}, where 111112 = J II(x)1 2dx. After IIDwgll \nis minimized, we have (Dw.g)(x) ~ 0, where w* is the optimized parameter. From \nthe corollary 1, there exists a set {cn such that g(x) ~ L:ci!Pw~(x), where {en \ncan be found using the ordinary least square method. \n\n3 Solvable Models \n\nFor a general function !Pw, the differential operator Dw does not always have such \na simple form as the above examples. In this section, we consider a linear operator \nL such that the differential equation of L!pw has a simple form. \nDefinition A neural network L: Cj!PWi (x) is called solvable ifthere exist functions \na, b, and a linear operator L such that \n\n(L!pwJ(x) = exp{a{wj)x + b(wi)). \n\nThe following theorem shows that the optimal parameter of the solvable models can \nbe found using the same algorithm as Figure 1. \n\n\f426 \n\nWatanabe \n\nH \n\ni \n\ni=l \n\ng(X) = L Ci ~ (x) +E(X) \nto optimize wi \nt \n\nIt is difficult \nindependently ?f ci \n\nThere exits C i s.t. \n\ng(x) = L Ci

0, we define a sequence {Yn} by Yn = (Lg)(nQ). Then, there \nexists {qd such that Yn + qlYn-l + q2Yn-2 + ... + qHYn-H = o. \n\nNote that IIDwLgl12 is a quadratic form for {pd, which is easily minimized by the \nleast square method. En IYn + qlYn-l + ... + QHYn_HI 2 is also a quadratic form for \n{Qd\u00b7 \n\nTheorem 3 The sequences { wd, {pd, and {qd in the theorem 2 have the following \nrelations. \n\nH+ H-l+ H-2+ \nz \n\nP2 Z \n\nPIZ \n\n... PH \n\n+ \n\nzH + qlzH-l + q2zH-2 + ... + qH = \n\n('Vz E C), \n\nH \nIT(z - a(wi)) \ni=l \nH \nIT(z - exp(a(Wi)Q)) \ni=l \n\n('Vz E C). \n\nFor proofs of the above theorems, see . These theorems show that, if {Pi} or \n\n\fSolvable Models of Artificial Neural Networks \n\n427 \n\n{qd is optimized for a given function g( x), then {a( wd} can be found as a set of \nsolutions of the algebraic equation. \nSuppose that a target function g( x) is given. Then, from the above theorems, \nthe globally optimal parameter w* = {wi} can be found by minimizing IIDwLgll \nindependently of {cd. Moreover, if the function a(w) is a one-to-one mapping, then \nthere exists w* uniquely without permutation of {wi}, if and only if the quadratic \nform II{DH + P1 DH-1 + ... + PH }g1l2 is not degenerate. (Remark that, if it is \ndegenerate, we can use another neural network with the smaller number of hidden \nunits.) \n\nExample 3 A neural network without scaling \n\nH \n\nfb,c(X) = L CiU(X + bi), \n\n(4) \n\nis solvable when (F u)( x) I- 0 (a.e.), where F denotes the Fourier transform. Define \na linear operator L by (Lg)(x) = (Fg)(x)/(Fu)(x), then, it follows that \n\ni=1 \n\n(Lfb,c)(X) = L Ci exp( -vCi bi x). \n\nH \n\n(5) \n\nBy the Theorem 2, the optimal {bd can be obtained by using the differential 01' the \nsequential equation. \n\ni=l \n\nExample 4 (MLP) A three-layered perceptron \n\nfb,c(X) = L Ci tan \n\nH \n~ -1 X + bi \n( a. \nz \n\ni=1 \n\n), \n\n(6) \n\nis solvable. Define a linear operator L by (Lg)( x) = x . (F g)( x), then, it follows \nthat \n\n(Lfb,c)(X) = L Ci exp( -(a.i + yCi bdx + Q(ai, bd) (x ~ 0). \n\nH \n\n(7) \n\ni=1 \n\nwhere Q( ai, bi ) is some function of ai and bj. Since the function tan -1 (x) is mono(cid:173)\ntone increasing and bounded, we can expect that a neural network given by eq. \n(6) has the same ability in the function approximation problem as the ordinary \nthree-layered perceptron using the sigmoid function, tanh{x). \n\nExample 5 (Finite Wavelet Decomposition) A finite wavelet decomposition \n\nH \n\nfb,c(X) = L Cju( \n\nx + bj \n\n(8) \nis solvable when u(x) = (d/dx)n(1/(l + x 2 \u00bb (n ~ 1). Define a lineal' operator L by \n(Lg)(x) = x- n . (Fg)(x) then, it follows that \n\ni=l \n\na.j \n\n), \n\n(Lfb,c)(X) = L Ci exp( -(a.j + vCi bi)x + P(a.j, bi\u00bb \n\nH \n\n(x ~ 0). \n\n(9) \n\ni=1 \n\n\f428 \n\nWatanabe \n\nwhere f3(ai, bi) is some function of ai and bi. Note that O\"(x) is an analyzing wavelet, \nand that this example shows a method how to optimize parameters for the finite \nwavelet decomposition. \n\n4 Learning Algorithm \n\nWe construct a learning algorithm for solvable models, as shown in Figure 1-\n\n< > \n(0) A target function g(x) is given. \n(1) {Ym} is calculated by Ym = (Lg)(mQ). \n(2) {qi} is optimized by minimizing L:m IYm + Q1Ym-l + Q2Ym-2 + ... + QHYm_HI 2. \n(3) {Zi} is calculated by solving zH + q1zH-1 + Q2zH-2 + ... + QH = 0. \n(4) {wd is determined by a( wd = (l/Q) log Zi. \n(5) {cd is optimized by minimizing L:j(g(Xj) - L:i Cj<;?w;(Xj\u00bb2. \n\nStrictly speaking, g(x) should be given for arbitrary x. However, in the practical \napplicat.ion, if the number of training samples is sufficiently large so that (Lg)( x) \ncan be almost precisely approximated, this algorithm is available. \nIn the third \nprocedure, to solve the algebraic equation, t.he DKA method is applied, for example. \n\n5 Experimental Results and Discussion \n\n5.1 The backpropagation and the proposed method \n\nFor experiments, we used a probabilit.y density fUllction and a regression function \ngiven by \n\nQ(Ylx) \n\nh(x) \n\n1 \n\nJ27r0\"2 \n\nexp -\n\n((y - h(X\u00bb2) \n\n20\"2 \n\n1 \n\n-3\" tan \n\n-1 X - 1/3 \n\n( 0.04 \n\n1 \n\n) + 6\" tan \n\n-1 X - 2/3 \n) \n\n( 0.02 \n\nwhere 0\" = 0.2. One hundred input samples were set at the same interval in [0,1), \nand output samples were taken from the above condit.ional distribution. \n\nTable 1 shows the relation between the number of hidden units, training errors, \nand regression errors. In the table, the t.raining errol' in the back propagation shows \nthe square error obtained after 100,000 training cycles. The traiuing error in the \nproposed method shows the square errol' by the above algorithm. And the regres(cid:173)\nsion error shows the square error between the true regression curve h( x) and the \nestimated curve. \n\nFigure 2 shows the true and estimated regression lines: (0) the true regression \nline and sanlple points, (1) the estimated regression line with 2 hidden units, by \nthe BP (the error backpropagation) after 100,000 training cycles, (2) the estimated \nregression line with 12 hidden units, by the BP after 100,000 training cycles, (3) the \n\n\fSolvable Models of Artificial Neural Networks \n\n429 \n\nTable 1: Training errors and regression errors \n\nBackpropagation \n\nProposed Method \n\nHidden \nUnits \n\n2 \n4 \n6 \n8 \n10 \n12 \n\nTraining Regression Training Regression \n4.1652 \n3.3464 \n3.3343 \n3.3267 \n3.3284 \n3.3170 \n\n4.0889 \n3.8755 \n3.5368 \n3.2237 \n3.2547 \n3.1988 \n\n0.3301 \n0.2653 \n0.3730 \n0.4297 \n0.4413 \n0.5810 \n\n0.7698 \n0.4152 \n0.4227 \n0.4189 \n0.4260 \n0.4312 \n\nestimated line with 2 hidden units by the proposed method, and (4) the estimated \nline with 12 hidden units by the proposed method. \n\n5.2 Discussion \n\nWhen the number of hidden units was small, the training errors by the BP were \nsmaller, but the regression errors were larger. Vlhen the number of hidden units \nwas taken to be larger, the training error by the BP didn't decrease so much as the \nproposed method, and the regression error didn't increase so mnch as the proposed \nmethod. \n\nBy the error back propagation , parameters dichl 't reach the maximum likelihood \nestimator, or they fell into local minima. However, when t.he number of hidden \nunits was large, the neural network wit.hout. t.he maximum likelihood estimator \nattained the bett.er generalization. It seems that paramet.ers in the local minima \nwere closer to the true parameter than the maximum likelihood estimator. \n\nTheoretically, in the case of the layered neural networks, the maximum likelihood \nestimator may not be subject to asymptotically normal distribution because the \nFisher informat.ion matrix may be degenerate. This can be one reason why the \nexperimental results contradict the ordinary st.atistical theory. Adding such a prob(cid:173)\nlem, the above experimental results show that the local minimum causes a strange \nproblem. In order to construct the more precise learning t.heory for the backprop(cid:173)\nagation neural network, and to choose the better parameter for generalization, we \nmaybe need a method to analyze lea1'1ling and inference with a local minimum. \n\n6 Conclusion \n\nWe have proposed solvable models of artificial neural networks, and studied their \nlearning structure. It has been shown by the experimental results that the proposed \nmethod is useful in analysis of the neural network generalizat.ion problem. \n\n\f430 \n\nWatanabe \n\n.. \n. ..' .. \n~--------. '. \n..... ' : \n\n.'\" .... \n\n\" . ' .. \"0 \n\n... \n\n' .. \n\nH : the number of hidden units \nEtrain : t.he t.raining error \nE\"eg : the regression error \n\n(0) True Curve and Samples. \nSample error sum = 3.6874 \n\n\"0 \n\ne\" \n\n: \n\n.. : .... \" ... \". \n\n. ' \n\n. . . \n. ' .. \n(1) BP after 100,000 cycles \nH = 2, Etrain = 4.1652, E\"eg = 0.7698 \n\n... \n\n. . . \n\n..... \" : \n\n. \n\n.. . ..... ,\". \n.. ' \n\n'.' \n\n,'. \n\n.. \n\n. \n...... , . . \n~ \n\n. \n....... ~ .. : ........... : ...... :::: .. . \n\n\"0, e\" e\" ' \n\n.. \n\n.. \n\n' . . \n\n' \n\n\u2022\n\n\u2022 \n\n' \n\n0\" \n\n\u2022 \n\n(2) TIP aft.er 100,000 cycles \nH = 12, E Ir\u2022a;\" = 3.3170, E\"eg = 0.4312 \n\n.. ...... \n\n. .:'{: \n\n' .. \n\n(3) Proposed Method \nH = 2, Etrain = 4.0889, Ereg = 0.3301 \n\n(4) Proposed Met.hod \nH = 12, E'm;\" = 3.1988, Ereg = 0.5810 \n\nFigure 2: Experimental Results \n\nReferences \n\n[I] H. White. (1989) Learning in artificial neural networks: a statistical perspective. \nNeural Computation, 1, 425-464. \n\n N.Murata, S.Yoshizawa, and S.-I.Amari.(1992) Learning Curves, Model Selection \nand Complexity of Neural Networks. Adlla:nces in Neural Information Processing \nSystems 5, San Mateo, Morgan Kaufman, pp.607-614. \n\n R. J. Baxter. (1982) Exactly Solved Models in Statistical Mechanics, Academic \nPress. \n\n E. A. Coddington. (1955) Th.eory of ordinary differential equations, the McGraw(cid:173)\nHill Book Company, New York. \n\n S. Watanabe. (1993) Function approximation by neural networks and solution \nspaces of differential equations. Submitted to Neural Networks. \n\n\f", "award": [], "sourceid": 786, "authors": [{"given_name": "Sumio", "family_name": "Watanabe", "institution": null}]}