{"title": "Dynamics of On-Line Gradient Descent Learning for Multilayer Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 302, "page_last": 308, "abstract": null, "full_text": "Dynamics of On-Line Gradient Descent \nLearning for Multilayer Neural Networks \n\nDavid Saad\" \n\nDept. of Compo Sci. & App. Math. \n\nAston University \n\nBirmingham B4 7ET, UK \n\nSara A. Solla t \n\nCONNECT, The Niels Bohr Institute \n\nBlegdamsdvej 17 \n\nCopenhagen 2100, Denmark \n\nAbstract \n\nWe consider the problem of on-line gradient descent learning for \ngeneral two-layer neural networks. An analytic solution is pre(cid:173)\nsented and used to investigate the role of the learning rate in con(cid:173)\ntrolling the evolution and convergence of the learning process. \n\nLearning in layered neural networks refers to the modification of internal parameters \n{J} which specify the strength of the interneuron couplings, so as to bring the map \nfJ implemented by the network as close as possible to a desired map 1. The \ndegree of success is monitored through the generalization error, a measure of the \ndissimilarity between fJ and 1. \nConsider maps from an N-dimensional input space e onto a scalar (, as arise in \nthe formulation of classification and regression tasks. Two-layer networks with an \narbitrary number of hidden units have been shown to be universal approximators \n[1] for such N-to-one dimensional maps. Information about the desired map i is \nprovided through independent examples (e, (1'), with (I' = i(e) for all p . The \nexamples are used to train a student network with N input units, K hidden units, \nand a single linear output unit; the target map i is defined through a teacher \nnetwork of similar architecture except for the number M of hidden units. We \ninvestigate the emergence of generalization ability in an on-line learning scenario \n[2], in which the couplings are modified after the presentation of each example so \nas to minimize the corresponding error. The resulting changes in {J} are described \nas a dynamical evolution; the number of examples plays the role of time. \nIn this paper we limit our discussion to the case of the soft-committee machine \n[2], in which all the hidden units are connected to the output unit with positive \ncouplings of unit strength, and only the input-to-hidden couplings are adaptive. \n\n*D.Saad@aston.ac.uk \ntOn leave from AT&T Bell Laboratories, Holmdel, NJ 07733, USA \n\n\fDynamics of On-line Gradient Descent Learning for Multilayer Neural Networks \n\n303 \n\nConsider the student network: hidden unit i receives information from input unit \nr through the weight hr, and its activation under presentation of an input pattern \n~ = (6,\u00b7 .. ,~N) is Xi = J i .~, with J i = (hl, ... ,JiN) defined as the vector of \nincoming weights onto the i-th hidden unit. The output of the student network is \na(J,~) = L:~l 9 (Ji . ~), where 9 is the activation function of the hidden units, \ntaken here to be the error function g(x) == erf(x/V2), and J == {Jdl*{O through an average over all possible input vectors ~, to be per(cid:173)\nformed implicitly through averages over the activations x = (Xl\"'\" XK) and \nY = (YI I\n' \u2022\u2022 I YM). Note that both < Xi >=< Yn >= 0; second order correlations are \ngiven by the overlaps among the weight vectors associated with the various hidden \nunits: < Xi Xk > = J i . Jk == Qikl < Xi Yn > = J i . Bn == Rin, and < Yn Ym > = \nBn . Bm == Tnm. Averages over x and yare performed using the resulting multivari(cid:173)\nate Gaussian probability distribution, and yield an expression for the generalization \nerror in terms of the parameters Qik l Rin, and Tnm [3]. For g(x) == erf(x/V2) the \nresult is: \n\n1 {,\"\" \nL...J arCSlll \n-\n7 r ' k \nz \n\nQik \n\nV1+Qii V1+Qu \n\n,\"\" \n\n+ L...J arCSlll --;:;=:=;:;;;=---;:::=:::::::;:;;;:== \nV1+Tnn V1+Tmm \n\nTnm \n\nnm \n\n-\n\n2 ,\"\" \nL...J arCSlll \n. \nsn \n\nRin \n\nV1 + Qi; V1 + Tnn \n\n} \n\n. \n\n(2) \n\nThe parameters Tnm are characteristic of the task to be learned and remain fixed. \nThe overlaps Qik and Rin, which characterize the correlations among the various \nstudent units and their degree of specialization towards the implementation of the \ndesired task, are determined by the student weights J and evolve during training. \nA gradient descent rule for the update of the student weights results in Jf+l = \nJf + N bf e, where the learning rate TJ has been scaled with the input size N, and \n\nor \"g'(xf) [~g(y~) - ~g(xj')l \n\n(3) \n\nis defined in terms of both the activation function 9 and its derivative g'. The time \nevolution of the overlaps Rin and Qik can be explicitly written in terms of similar \n\n\f304 \n\nD. SAAD. S. A. SOLLA \n\nIn the large N limit the normalized number of examples \ndifference equations. \nQ' = piN can be interpreted as a continuous time variable, leading to the equations \nof motion \n\n(4) \n\nto be averaged over all possible ways in which an example can be chosen at a given \ntime st.ep. The dependence on the current input e is only through the activations \nx and y; the corresponding averages can be performed analytically for g(x) = \nerf( x I v'2), resulting in a set of coupled first-order differential equations [3]. These \ndynamical equations are exact, and provide a novel tool used here to analyze the \nlearning process for a general soft-committee machine with an arbitrary number ]{ \nof hidden units, trained to implement a task defined through a teacher of similar \narchitecture except for the number M of hidden units. In what follows we focus on \nuncorrelated teacher vectors of unit length, Tnm = onm. \n\nThe time evolution of the overlaps Rin and Qik follows from integrating the equa(cid:173)\ntions of motion (4) from initial conditions determined by a random initialization of \nthe student vectors {Jdl** 'fJmax\u00b7 Exponential convergence of R, S, C, ana Q to their optimal values \nis guaranteed for all learning rates in the range (0, 'fJmax) ; in this regime the gener(cid:173)\nalization error decays exponentially to f; = 0, with a rate controlled by the slowest \ndecay mode. An expansion of fg in terms of r = 1 - R, 5 , c, and q = 1 - Q reveals \nthat of the leading modes whose eigenvalues are shown in Fig. 2 only the mode as(cid:173)\nsociated with A1 contributes to the decay of the linear term , while the decay of the \nsecond order term is controlled by the mode associated with A2 and dominates the \nconvergence if 2A2 < A1. The learning rate 'fJopt which guarantees fastest asymptotic \ndecay for the generalization error follows from A1('fJopt) = 2A2('fJopt). \nThe asymptotic convergence of unrealizable learning is an intrinsically more com(cid:173)\nplicated process that cannot be described in closed analytic form. The asymptotic \n\n\f308 \n\nD. SAAD. S. A. SOLLA \n\nvalues of the order parameters and the generalization error depend on the learn(cid:173)\ning rate TJ; convergence to an optimal solution with minimal generalization error \nrequires TJ --+ 0 as a --+ 00. Optimal values for the order parameters follow from a \nsmall TJ analysis, equivalent to neglecting J.L and assuming student vectors confined \nto SB. The resulting expansion J i = 2:~=1 Hinen, with Rii = R, Hin = S for \n1 :::; n :::; J{, n 1= i, and Hin = U for J{ + 1 :s n :::; M, leads to \nQ = R2 + (I< -1)S2 + (M - J{)U 2 , C = 2RS + (I< - 2)S2 = (M - J{)U2 . (10) \nThe equations of motion for the remaining parameters R, S, and U exhibit a fixed \npoint solution which controls the asymptotic behavior. This solution cannot be \nobtained analytically, but numerical results are well approximated to order (1/ J{3) \nby \n\nR* \n\nS* \n\n6\\1'3 - 3 L ( \n\n1 ) \nJ{2 1 - J{ \n\n, \n\n1 -\n\n8 \n\n(1 --:) :3' w = ~ (1 -2~' ) \n\n(11) \n\nwhere L == M - J{. The corresponding fixed point values Q\" and C\" follow from \nEq. (10). Note that R\" is lower than for the realizable case, and that correlations \nU\" (significant) and S .. (weaker) between student vectors and the teacher vectors \nthey do not imitate are not eliminated . The asymptotic generalization error is given \nby \n\n(12) \n\nto order (1/ J{2). Note its proportionality to the mismatch L between teacher and \nstudent architectures. \n\nLearning at fixed and sufficiently small TJ results in exponential convergence to \nan asymptotic solution whose fixed point coordinates are shifted from the values \ndiscussed above. The solution is suboptimal; the resulting increase in f; from its \noptimal value (12) is easily obtained to first order in TJ , and it is also proportional \nto L. We have investigated convergence to the optimal solution (12) for schedules \nof the form TJ(a) = TJo/(a-ao)Z for the decay of the learning rate. A constant rate \nTJo is used for a :::; 0'0; the monotonic decrease of TJ for a > 0'0 is switched on after \nspecialization begins. Asymptotic convergence requires 0 < z :::; 1; fastest decay of \nthe generalization error is achieved for z = 1/2. \n\nSpecialization as described here and illustrated in Fig.l is a simultaneous process in \nwhich each student node acquires a strong correlation with a specific teacher node \nwhile correlations to other teacher nodes decrease. Such synchronous escape from \nthe symmetric phase is characteristic of learning scenarios where the target task is \ndefined through an isotropic teacher. In the case of a graded teacher we find that \nspecialization occurs through a sequence of escapes from the symmetric subspace, \nordered according to the relevance of the corresponding teacher nodes [3]. \n\nAcknowledgement The work was supported by the EU grant CHRX-CT92-0063. \n\nReferences \n\n[1] G. Cybenko, Math . Control Signals and Systems 2, 303 (1989) . \n[2] M. Biehl and H. Schwarze, J. Phys. A 28, 643 (1995). \n[3] D. Saad and S. A. Solla, Phys. Rev. E, 52,4225 (1995) . \n\n\f", "award": [], "sourceid": 1072, "authors": [{"given_name": "David", "family_name": "Saad", "institution": null}, {"given_name": "Sara", "family_name": "Solla", "institution": null}]}*