{"title": "Generalization Dynamics in LMS Trained Linear Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 890, "page_last": 896, "abstract": null, "full_text": "Generalization Dynamics in \n\nLMS Trained Linear Networks \n\nYves Chauvin\u00b7 \n\nPsychology Department \n\nStanford University \nStanford, CA 94305 \n\nAbstract \n\nFor a simple linear case, a mathematical analysis of the training and gener(cid:173)\nalization (validation) performance of networks trained by gradient descent \non a Least Mean Square cost function is provided as a function of the learn(cid:173)\ning parameters and of the statistics of the training data base. The analysis \npredicts that generalization error dynamics are very dependent on a pri(cid:173)\nori initial weights. In particular, the generalization error might sometimes \nweave within a computable range during extended training. In some cases, \nthe analysis provides bounds on the optimal number of training cycles for \nminimal validation error. For a speech labeling task, predicted weaving \neffects were qualitatively tested and observed by computer simulations in \nnetworks trained by the linear and non-linear back-propagation algorithm. \n\n1 \n\nINTRODUCTION \n\nRecent progress in network design demonstrates that non-linear feedforward neu(cid:173)\nral networks can perform impressive pattern classification for a variety of real-world \napplications (e.g., Le Cun et al., 1990; Waibel et al., 1989). Various simulations and \nrelationships between the neural network and machine learning theoretical litera(cid:173)\ntures also suggest that too large a number of free parameters (\"weight overfitting\") \ncould substantially reduce generalization performance. (e.g., Baum, 1989 1989). \n\nA number of solutions have recently been proposed to decrease or eliminate the \noverfitting problem in specific situations. They range from ad hoc heuristics to \ni theoretical considerations (e.g., Le Cun et al., 1990; Chauvin, 1990a; Weigend et al., \n\n\u2022 Also with Thomson-CSF, Inc., 630 Hansen Way, Suite 250, Palo Alto, CA 94304. \n\n890 \n\n\fGeneralization Dynamics in LMS Trained Linear Networks \n\n891 \n\nIn Press). For a phoneme labeling application, Chauvin showed that the overfitting \nphenomenon was actually observed only when networks were overtrained far beyond \ntheir \"optimal\" performance point (Chauvin, 1990b). Furthermore, generalization \nperformance of networks seemed to be independent of the size of the network during \nearly training but the rate of decrease in performance with overtraining was indeed \nrelated the number of weights. \n\nThe goal of this paper is to better understand training and generalization error dy(cid:173)\nnamics in Least-Mean-Square trained linear networks. As we will see, gradient de(cid:173)\nscent training on linear networks can actually generate surprisingly rich and insight(cid:173)\nful validation dynamics. Furthermore, in numerous applications, even non-linear \nnetworks tend to function in their linear range, as if the networks were making use \nof non-linearities only when necessary ('Veigend et al., In Press; Chauvin, 1990a). \nIn Section 2, I present a theoretical illustration yielding a better understanding of \ntraining and validation error dynamics. In Section 3, numerical solutions to ob(cid:173)\ntained analytical results make interesting predictions for validation dynamics under \novertraining. These predictions are tested for a phonemic labeling task. The ob(cid:173)\ntained simulations suggest that the results of the analysis obtained with the simple \ntheoretical framework of Section 2 might remain qualitatively valid for non-linear \ncomplex architectures. \n\n2 THEORETICAL ILLUSTRATION \n\n2.1 ASSUMPTIONS \n\nLet us consider a linear network composed of n input units and n output units fully \nconnected by a n.n weight matrix W . Let us suppose the network is trained to \nreproduce a noiseless output \"signal\" from a noisy input \"signal\" (the network can \nbe seen as a linear filter). 'Ve write F as the \"signal\", N the noise, X the input, Y \nthe output, and D the desired output. For the considered case, we have X = F+N, \nY = W X and D = F. \nThe statistical properties of the data base are the following. The signal is zero-mean \nwith covariance matrix CF. 'Ve write Ai and ei as the eigenvalues and eigenvectors \nof C F (ei are the so-called principal components; we will call Ai the \"signal ~ower \nspectrum\"). The noise is assumed to be zero-mean, with covariance matrix CN = \nv.I where I is the identity matrix. We assume the noise is uncorrelated with the \nsignal: CFN = O. We suppose two sets of patterns have been sampled for training \nand for validation. We write CF, CN and CFN the resulting covariance matrices for \nthe training set and CF, CN ~nd CF N the corresp_onding matrices for the validation \nset. We assume CF ~ Cp ~ CF , CFN ~ CPN ~ CFN = 0, CN = v.I and CN = v'.I \nwith v' > v. (N umerous of these assumptions are made for the sake of clarity of \nexplanation: they can be relaxed without changing the resulting implications.) \n\nThe problem considered is much simpler than typical realistic applications. How(cid:173)\never, we will see below that (i) a formal analysis becomes complex very quickly \n(ii) the validation dynamics are rich, insightful and can be mapped to a number \nof results observed in simulations of realistic applications and (iii) an interesting \nnumber of predictions can be obtained. \n\n\f892 \n\nChauvin \n\n2.2 LEARNING \n\nThe network is trained by gradient descent on the Least Mean Square (LMS) error: \ndW = -1JV'wE where 1J is the usual learning rate and, in the case considered, \nE = E; (Fp - Yp)T(Fp - Yp). We can write the gradient as a function of the \nvarious covariance matrices: V' wE = (I - W)C F + (I - 2W)C F N - W C N. From \nthe general assumptions, we get: \n\n(1) \nWe assume now that the principal components ei are also eigenvectors of the weight \nmatrix W at iteration k with corresponding eigenvalue Qik: Wk.ei = Qikei. We can \nthen compute the image of each eigenvector ei at iteration k + 1: \n\nV'wE ~ CF - WCF - WCN \n\nWk+l.ei = 1JAi.ei + Qik[I-1J(Ai + v)).ei \n\n(2) \n\nTherefore, ei is also an eigenvector of Wk+l and Qi,k+l satisfies the induction: \n\nQi,k+l = 1JAi + Qik[l - 1J(Ai + v)] \n\nA\u00b7 \nQik= A ' \n,+v \n\n(3) \nAssuming Wo = 0, we can compute the alpha-dynamics of the weight matrix W: \n(4) \nAs k goes to infinity, provided 1J < 1/ AM + v, Qi approaches Ai/(A, + Vi), which \ncorresponds to the optimal (Wiener) value of the linear filter implemented by the \nnetwork. We will write the convergence rates ai = I-1JA, -1JV. These rates depend \non the signal \"power spectrum\", on the noise power and on the learning rate 1J. \nIf we now assume WO.ei = QiO.ei with QiO #- 0 (this assumption can be made more \ngeneral), we get: \n\n[1-(I-1J(Ai+ v ))k] \n\n(5) \nwhere bi = 1 - QiO - QiOV / Ai. Figure 1 represents possible alpha dynamics for \narbitrary values of Ai with QiD = Qo #- O. \nWe can now compute the learning error dynamics by expanding the LMS error term \nE at time k. Using the general assumptions on the covariance matrices, we find: \n\nn \n\nEk = E Eik = E Ai(1 - Qik)2 + VQ~k \n\nn \n\n(6) \n\nTherefore, training error is a sum of error components, each of them being a \nquadratic function of Qi. Figure 2 represents a training error component Ei as \na function of Q. Knowing the alpha-dynamics, we can write these error components \nas a function of k: \n\nA, \n\nAi + V \n\n( \nV+A\u00b7 a \n\n\\ b2 2k) \n' \n\nE\u00b7 ... = \n\nh; \n\n(7) \n\nIt is easy to see that E is a monotonic decreasing function (generated by gradient \ndescent) which converges to the bottom of the quadratic error surface, yielding the \nresidual asymptotic error: \n\n(8) \n\n\fGeneralization Dynamics in LMS Trained Linear Networks \n\n893 \n\n1.0-\n\n1---------------------\nn, o.~ -~ ----------------\n\n~--------------------- , \n\n>.. = .2 \n\nO.O;---~--~I--~\u00b7~~I--~--~I--~--~I--~---,I \n100 \n\n20 \n\n80 \n\no \n\n40 \n\n60 \n\nN umber of Cycles \n\nFigure 1: Alpha dynamics for different values of >'i with 'T1 = .01 and aiO = ao =j:. O. \nThe solid lines represent the optimal values of ai for the training data set. The \ndashed lines represent corresponding optimal values for the validation data set. \n\nLMS \n\no \n\nv! , \n\n~~ \nA;+V J A.+V \n\n1 \n\naik \n\nFigure 2: Training and validation error dynamics as a function of ai. The dashed \ncurved lines represent the error dynamics for the initial conditions aiQ. Each training \nerror component follows the gradient of a quadratic learning curve (bottom). Note \nthe overtraining phenomenon (top curve) between at (optimal for validation) and \naioo (optimal for training). \n\n\f894 \n\nChauvin \n\n2.3 GENERALIZATION \n\nConsidering the general assumptions on the statistics of the data base, we can \ncompute the validation error E' (N ote that \"validation error\" strictly applies to the \nvalidation data set. \"Generalization error\" can qualify the validation data set or \nthe whole population, depending on context.): \n\nn \n\nn \n\nEk = ~E:k = ~Ai(l- aik)2 + v'a;k \n\n(9) \n\nwhere the alpha-dynamics are imposed by gradient descent learning on the training \ndata set. Again, the validation error is a sum of error components Ei, quadratic \nfunctions of ai. However, because the alpha-dynamics are adapted to the training \nsample, they might generate complex dynamics which will strongly depend on the \ninital values aiO (Figure 1). Consequently, the resulting error components E: are not \n\nmonotonic decreasing functions anymore. As seen in Figure 2, each of the validation \nerror components might (i) decrease (ii) decrease then increase (overtraining) or \n(iii) increase as a function of aiO. For each of these components, in the case of \novertraining, it is possible to compute the value of aik at which training should be \nstopped to get minimal validation error: \n\nL 2L-+L \n\nog >.;+v' \n\nog >';-aio(>'.+V') \n\nv'-v \n\nLog(1 - 7JAi - 7Jv) \n\n(10) \n\nHowever, the validation error dynamics become much more complex when we con(cid:173)\nsider sums of these components. If we assume aiQ = 0, the minimum (or minima) \nof E' can be found to correspond to possible intersections of hyper-ellipsoids and \npower curves. In general, it is possible to show that there exists at least one such \nminimum. It is also possible to find simple bounds on the optimal training time for \nminimal validation error: \n\n(11) \n\nThese bounds are tight when the noise power is small compared to the signal \"power \nspectrum\". For aiO =f. 0, a formal analysis of the validation error dynamics becomes \nintractable. Because some error components might increase while others decrease, \nit is possible to imagine multiple minima and maxima for the total validation error \n(see simulations below). Considering each component's dynamics, it is nonetheless \npossible to compute bounds within which E' might vary during training: \n\n~n AW' \n. Ai + v' -\n, \n\n-:---- < Ek < \n\n'2:\" Ai(V2 + v' Ai) \n(Ai + v)2 \n\n-\n\n, \n. \n\n(12) \n\nBecause of the \"exponential\" nature of training (Figure 1), it is possible to imagine \nthat this \"weaving\" effect might still be observed after a long training period, when \nthe training error itself has become stable. Furthermore, whereas the training error \nwill qualitatively show the same dynamics, validation error will very much depend \non aiO: for sufficiently large initial weights, validation dynamics might be very \ndependent on particular simulation \"runs\". \n\n\fGeneralization Dynamics in LMS Trained Linear Networks \n\n895 \n\n20 \n\n.. 5 \n\n10 \n\n\" \n\no \n\nFigure 3: Training (bottom curves) and validation (top curves) error dynamics in \na two-dimensional case for ).1 = 17,).2 = 1.7, v = 2, v' = 10, l:\u00a510 = 0 as l:\u00a520 varies \nfrom 0 to 1.6 (bottom-up) in .2 increments. \n\n3 SIMULATIONS \n\n3.1 CASE STUDY \n\nEquations 7 and 9 were simulated for a two-dimensional case (n = 2) with ).1 \n17,).2 = 1.7, v = 2, v' = 10 and l:\u00a510 = O. The values of l:\u00a520 determined the \nrelative dominance of the two error components during training. Figure 3 represents \ntraining and validation dynamics as a function of k for a range of values of l:\u00a520. \nAs shown analytically, training dynamics are basically unaffected by the initial \nconditions of the weight matrix Woo However, a variety of validation dynamics \ncan be observed as l:\u00a520 varies from 0 to 1.6. For 1.6 ~ l:\u00a520 ~ 1.4, the validation \nerror is monotically decreasing and looks like a typical \"gradient descent\" training \nerror. For 1.2 ~ l:\u00a520 ~ 1.0, each error component in turn imposes a descent rate: \nthe validation error looks like two \"connected descents\". For .8 ~ 0'20 ~ .6, E~ is \nmonotically decreasing with a slow convergence rate, forcing the validation error to \ndecrease long after E~ has become stable. This creates a minimum, followed by a \nmaximum, followed by a minimum for E'. Finally, for .4 ~ l:\u00a520 ~ 0, both error \ncomponents have a single minimum during training and generate a single minimum \nfor the total validation error E'. \n\n3.2 PHONEMIC LABELING \n\nOne of the main predictions obtained from the analytical results and from the \nprevious case study is that validation dynamics can demonstrate multiple local \nminima and maxima. To my knowledge, this phenomenon has not been described in \nthe literature. However, the theory also predicts that the phenomenon will probably \nappear very late in training, well after the training error has become stable, which \nmight explain the absence of such observations. The predictions were tested for a \nphonemic labeling task with spectrograms as input patterns and phonemes as output \n\n\f896 \n\nChauvin \n\npatterns. Various architectures were tested (direct connections or back-propagation \nnetworks with linear or non-linear hidden layers). Due to the limited length of \nthis article, the complete simulations will be reported elsewhere. In all cases, as \npredicted, multiple mimina/maxima were observed for the validation dynamics, \nprovided the networks were trained way beyond usual training times. Furthermore, \nthese generalization dynamics were very dependent on the initial weights (provided \nsufficient variance on the initial weight distribution). \n\n4 DISCUSSION \n\nIt is sometimes assumed that optimal learning is obtained when validation error \nstarts to increase during the course of training. Although for the theoretical study \npresented, the first minimum of E' is probably always a global minimum, inde(cid:173)\npendently of aw, simulations of the speech labeling task show it is not always the \ncase with more complex architectures: late validation minima can sometimes (albeit \nrarely) be deeper than the first \"local\" minimum. These observations and a lack \nof theoretical understanding of statistical inference under limited data set raise the \nquestion of the significance of a validation data set. As a final comment, we are \nnot reaDy interested in minimal validation error (E') but in minimal generalization \nerror (E'). Understanding the dynamics of the \"population\" error as a function \nof training and validation errors necessitates, at least, an evaluation of the sample \nstatistics as a function of the number of training and validation patterns. This is \nbeyond the scope of this paper. \n\nAcknowledgements \n\nThanks to Pierre Baldi and Julie Holmes for their helpful comments. \n\nReferences \nBaum, E. B. & Haussler, D. (1989). 'iVhat size net gives valid generalization? Neural \n\nComputation, 1, 151-160. \n\nChauvin, Y. (1990a). Dynamic behavior of constrained back-propagation networks. \nIn D. S. Touretzky (Ed.), Neural Information Processing Systems (Vol. 2) (pp. \n642-649). San Mateo, CA: Morgan Kaufman . \n\nChauvin, Y. (1990b). Generalization performance of overtrained back-propagation \n\nnetworks. In L. B. Almeida & C. J. 'iVellekens (Eds.), Lecture Notes in Com(cid:173)\nputer Science (Vo1. 412) (pp. 46-55). Berlin: Germany: Springer-Verlag. \n\nCun, Y. 1., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, 'iV., \n& Jackel, 1. D. (1990). Handwritten digit recognition with a back-propagation \nnetwork. In D. S. Touretzky (Ed.), Neural Information Processing Systems \n(Vo1. 2) (pp. 396-404). San Mateo, CA: Morgan Kaufman. \n\n'iVaibel, A., Sawai, H., & Shikano, K. (1989). Modularity and scaling in large \nphonemic neural networks. IEEE Transactions on Acoustics, Speech and Signal \nProcessing, ASSP-37, 1888-1898. \n\n'iVeigend, A. S., Huberman, B. A., & Rumelhart, D. E. (In Press). Predicting the \n\nfuture: a connectionist approach. International Journal of Neural Systems. \n\n\f", "award": [], "sourceid": 348, "authors": [{"given_name": "Yves", "family_name": "Chauvin", "institution": null}]}