{"title": "Dynamics of Generalization in Linear Perceptrons", "book": "Advances in Neural Information Processing Systems", "page_first": 897, "page_last": 903, "abstract": null, "full_text": "Dynamics of Generalization in Linear Perceptrons \n\nAnders Krogh \n\nNiels Bohr Institute \n\nBlegdamsvej 17 \n\nJohn A. Hertz \n\nNORDITA \n\nBlegdamsvej 17 \n\nDK-2100 Copenhagen, Denmark \n\nDK-2100 Copenhagen, Denmark \n\nAbstract \n\nWe study the evolution of the generalization ability of a simple linear per(cid:173)\nceptron with N inputs which learns to imitate a \"teacher perceptron\". The \nsystem is trained on p = aN binary example inputs and the generaliza(cid:173)\ntion ability measured by testing for agreement with the teacher on all 2N \npossible binary input patterns. The dynamics may be solved analytically \nand exhibits a phase transition from imperfect to perfect generalization \nat a = 1. Except at this point the generalization ability approaches its \nasymptotic value exponentially, with critical slowing down near the tran(cid:173)\nsition; the relaxation time is ex (1 - y'a)-2. Right at the critical point, \n1 \nthe approach to perfect generalization follows a power law ex t - '2. \nIn \nthe presence of noise, the generalization ability is degraded by an amount \nex (va - 1)-1 just above a = 1. \n\n1 \n\nINTRODUCTION \n\nIt is very important in practical situations to know how well a neural network will \ngeneralize from the examples it is trained on to the entire set of possible inputs. This \nproblem is the focus of a lot of recent and current work [1-11]. All this work, how(cid:173)\never, deals with the asymptotic state of the network after training. Here we study \na very simple model which allows us to follow the evolution of the generalization \nability in time under training. It has a single linear output unit, and the weights \nobey adaline learning. Despite its simplicity, it exhibits nontrivial behaviour: a dy(cid:173)\nnamical phase transition at a critical number of training examples, with power-law \ndecay right at the transition point and critical slowing down as one approaches it \nfrom either side. \n\n897 \n\n\f898 \n\nKrogh and Hertz \n\n2 THE MODEL \nOur simple linear neuron has an output V = N-\"2 2:i Wjei, where ei is the ith input. \nIt learns to imitate a teacher [1] whose weights are Uj by training on p examples of \ninput-output pairs (er, ,~) with \n\n1 \n\ngenerated by the teacher. The adaline learning equation [11] is then \n\n1 \n\nWi = Vii 'E('~ - v'N ~ Wje;)er = N ~(Uj - Wj)e;er. \n\np \n\n1 \n\n1 \n\nBy introducing the difference between the teacher and the pupil, \n\n~=1 \n\nJ \n\n~J \n\nand the training input correlation matrix \n1 \n\nA .. -\n\n-\n\nIJ - N L.J\"'J\"\"\n\np \n\"\"' r'!cf \n, \n~=1 \n\nthe learning equation becomes \n\nVi = - 'EAijVj. \n\nj \n\n(1) \n\n(2) \n\n(3) \n\n(4) \n\n(5) \n\nWe let the example inputs er take the values \u00b11, randomly and independently, but it \nis straightforward to generalize it to any distribution of inputs with (ereJ)e ex 6ij6~v . \nFor a large number of examples (p = O( N) ~ V, the resulting generalization ability \n\nwill be independent of just which p of the 2 possible binary input patterns we \nchoose. All our results will then depend only on the fact that we can calculate the \nspectrum of the matrix A. \n\n3 GENERALIZATION ABILITY \n\nTo measure the generalization ability, we test whether the output of our percept ron \nwith weights Wi agrees with that of the teacher with weights Ui on all possible binary \ninputs. Our objective function, which we call the generalization error, is just the \nsquare of the error, averaged over all these inputs: \n\nF \n\n(We used that 2~ 2:{q} (Tj(Tj is zero unless i = j.) That is, F is just proportional to \nthe square of the difference between the teacher and pupil weight vectors. With the \n\n(6) \n\n\fDynamics of Generalization in Linear Perceptrons \n\n899 \n\nN- 1 normalization factor F will then vary between 1 (tabula rasa) and 0 (perfect \ngeneralization) if we normalize it to length .IN. During learning, Wi and thus Vi \ndepends on time, so F is a function of t. The complementary quantity 1 - F(t) \ncould be called the generalization ability. \n\nIn the basis where A is diagonal, the learning equation (5) is simply \n\nVr = -Arvr \n\n(7) \n\nwhere Ar are the eigenvalues of A. This has the solution \nvr(t) = vr(O)e- Art = ur(O)e- Art , \n\n(8) \nwhere it is assumed that the weights are zero at time t = 0 (we will come back to \nthe more general case later). Thus we find \n\nF(t) = N L v;(t) = N L u;e- 2Art \n\n1 \n\n1 \n\nr \n\nr \n\n(9) \n\nA veraging over all possible training sets of size p this can be expressed in terms of \nthe density of eigenvalues of A, peE): \n\nF(t) = 1~2 J d\u20acp( \u20ac)e- 2ft . \n\n(10) \nIn the following it will be assumed that the length of it is normalized to .IN, so the \nprefactor disappears. \n\nFor large N, the eigenvalue density is (see, e.g. [11], where it can be obtained simply \nfrom the imaginary part of the Green's function in eq.(57)) \n\npeE) = _1_)(\u20ac+ _ \u20ac)(\u20ac \n\n271'\u20ac \n\n_ L) + (1 - 0:)0(1 - 0:)8(\u20ac), \n\n(11) \n\nwhere \n\n\u20ac\u00b1 = (1 \u00b1 fo)2 \n\n(12) \nand O() is the unit step function. The density has two terms: a 'deformed semicircle' \nbetween the roots \u20ac_ and \u20ac+, and for 0: < 1 a delta function at \u20ac = 0 with weight \n1 - 0:. The delta-function term appears because no learning takes place in the \nsubspace orthogonal to that spanned by the training patterns. For 0: > 1 the \npatterns span the whole space, and therefore the delta-function is absent. \n\nThe results at infinite time are immediately evident. For 0: < 1 there is a nonzero \nlimit, F( 00) = 1 - 0:, while F( 00) vanishes for 0: > 1, indicating perfect generaliza(cid:173)\ntion (the solid line in Figure 1). While on the one hand it may seem remarkable \nthat perfect generalization can be obtained from a training set which forms an in(cid:173)\nfinitesimal fraction of the entire set of possible examples, the meaning of the result \nis just that N points are sufficient to determine an N -\nI-dimensional hyperplane \nin N dimensions. \n\nFigure 2 shows F(t) as obtained numerically from (10) and (11). The qualitative \nform of the approach to F (00) can be obtained analytically by inspection. For \n0: i= 1, the asymptotic approach is governed by the smallest nonzero eigenvalue \u20ac_. \nThus we have critical slowing down, with a divergent relaxation time \n\n1 \n\n1 \n\nT = \u20ac_ = lfo _ 112 \n\n(13) \n\n\f900 \n\nKrogh and Hertz \n\n2 .\u2022\u2022\u2022\u2022.. \n\nr:.. 1 \n\n. . \n. . . . . . . . . \n. . . . . . \n... . ...: . .... ........... ----\n\n... \n\n'. \n\n1 \na \n\nO~ ____________ ~~ ________ -_-__ -_-~-\no \n2 \n\nFigure 1: The asymptotic generalization error as a function of (}. The full line \ncorresponds to A = 0, the dashed line to A = 0.2, and the dotted line to Wo = 1 and \nA = O. \n\nas the transition at (} = 1 is approached. Right at the critical point, the eigenvalue \ndensity diverges for small f like (-'2, which leads to the power law \n\n1 \n\n1 \nF(t) ex Vi \n\n(14) \n\nat long times. Thus, while exactly N examples are sufficient to produce perfect \ngeneralization, the approach to this desirable state is rather slow. A little bit \nabove (} = 1, F(t) will also follow this power law for times t ~ T, going over to \n(slow) exponential decay at very long times (t > T). By increasing the training set \nsize well above N (say, to ~N), one can achieve exponentially fast generalization. \nBelow (} = 1, where perfect generalization is never achieved, there is at least the \nconsolation that the approach to the generalization level the network does reach is \nexponential (though with the same problem of a long relaxation time just below the \ntransition as just above it). \n\n4 EXTENSIONS \n\nIn this section we briefly discuss some extensions of the foregoing calculation. We \nwill see what happens if the weights are non-zero at t = 0, discuss weight decay, \nand finally consider noise in the learning process. \n\nWeight decay is a simple and frequently-used way to limit the growth of the weights, \nwhich might be desirable for several reasons. It is also possible to approximate the \nproblem with binary weights using a weight decay term (the so-called spherical \nmodel, see [11]). We consider the simplest kind of weight decay, which comes in as \nan additive term, -AWi = -A( Ui - Vi), in the learning equation (2), so the equation \n\n\fDynamics of Generalization in Linear Perceptrons \n\n901 \n\n1.0 \n\n0.8 \n\n0.6 -.. \"-\" \n\n~ \n\n0.4 \n\n0.2 \n\n0.0 \n0 \n\na=O.B \n\n............ \n\n---\n\n... .... -......... \n- - - - - - - - - ~\u00b7~\u00b7\u00b7~~~~~i~2~\u00b7~\u00b7\u00b7~\u00b7\u00b7~\u00b7~\u00b7~\u00b7~ \n\na=I.0 \n\n5 \n\n10 \nt \n\n15 \n\n20 \n\nFigure 2: The generalization error as a function of time for a couple of different o . \n\n(5) for the difference between teacher and pupil is now \n\nVi = - LAijVj + >'(Ui - Vi) = - L(Aij + >'8ij)Vj + >'Ui. \n\n(15) \n\nj \n\nj \n\nApart from the last term this just shifts the eigenvalue spectrum by>.. \nIn the basis where A is diagonal we can again write down the general solution to \nthis equation: \n\n_ \nVr -\n\nThe square of this is \n\n(1 \n\n-\n\n-(Ar+,x)t) \\ \ne \nAr + 1\\ \n\n\\ \n\nI\\Ur \n\n+ vr \n\n(0) -(Ar+,x)t \n. \n\ne \n\n(16) \n\n(17) \n\nv2 = u2 \nr \nr \n\n[>'(1 \n\n-\n\ne-(Ar+,x)t) \nAr + >. \n\n+ e-(Ar+,x)t + _r_e-(Ar+,x)t \n\nW (0) \n\n] 2 \n\nU r \n\nAs in (10) this has to be integrated over the eigenvalue spectrum to find the averaged \ngeneralization error. Assuming that the initial weights are random, so that wr(O) = \n0, and that they have a relative variance given by \n\nthe average of F(t) over the distibution of initial conditions now becomes \n\nF(t) = J dept e) [ (,,(1-;, :~+\u00bb') + e-('+\u00bb') 2 + w6e- 2('+\u00bb'] . \n\n(18) \n\n(19) \n\n(Again it is assumed the length of it is .IN.) \nFor >. = 0 we see the result is the same as before except for a factor 1 + w5 in front \nof the integral. This means that the asymptotic generalization error is now \n\nF(oo) = { (1 + w5)(1 - 0) for 0 < 1 \n\no for 0 > 1, \n\n(20) \n\n\f902 \n\nKrogh and Hertz \n\nwhich is shown as a dotted line in Figure 1 for Wo = 1. The excess error can easily \nbe understood as a contribution to the error from the non-relaxing part of the initial \nweight vector in the subspace orthogonal to the space spanned by the patterns. The \nrelaxation times are unchanged for A = O .. \nFor A > 0 the relaxation times become finite even at a = 0, because the smallest \neigenvalue is shifted by A, so (13) is now \n\n1 \n\n1 \n\nT = L + A = lfo _ 1F + A' \n\n(21) \nIn this case the asymptotic error can easily be obtained numerically from (19), and \nis shown by the dashed line in Figure 1. It is smaller than for A = 0 for w5 > 1 at \nsufficiently small a. This is simply because the weight decay makes the part of w(O) \northogonal to the pattern space decay away exponentially, thereby eliminating the \nexcess error due to large initial weight components in this subspace. \nThis phase transition is very sensitive to noise. Consider adding a noise term 77i(t) \nto the right-hand side of (2), with \n\n(22) \nHere we restrict our attention to the case A = O. Carrying the extra term through \nthe succeeding manipulations leads, in place of (7), to \n\n(r/i(t)77j(t'\u00bb = 2T6(t - t'). \n\nvr = -Arvr + 77r(t). \n\n(23) \n\n(24) \n\n(25) \n\n(26) \n\nThe additional term leads to a correction (after Fourier transforming) \n\n6 ( ) _ \n\nVr W \n\n-\n\n77r(w) \nA \n\u2022 \n-zw+ \nr \n\nand thus to an extra (time-independent) piece of the generalization error F(t): \n\n6F = ~ '\" J dw \n\nN L...J \n\nr \n\n(l77r(w)12) = ~ '\" I-. \nN L...J Ar \n\n211\" 1- iw + Arl2 \n\nr \n\nFor a > 1, where there are no zero eigenvalues, we have \n\n6F = T j~+ dfP(f) \n\nE_ \n\nf \n\nwhich has the large a-limit T / a, as found in equilibrium analyses (also for thresh(cid:173)\nold perceptrons [2,3,5,6,7,8,9]). Equation (26) gives a generalization error which \ndiverges as one approaches the transition at a = 1: \nT \n. \nr.:. \nya-1 \n\n6F T -1/2 _ \n-\n\n(27) \n\nIX \n\n-\n\nf \n\nEquation (25) blows up for a < 1, where some of the Ar are zero. This divergence \njust reflects the fact that in the subspace orthogonal to the training patterns, v feels \nonly the noise and so exhibits a random walk whose variance diverges as t --+- 00. \nKeeping more careful track of the dynamics in this subspace leads to \n\n6F = 2T(1 - a)t + T 1~+ dfP~f) \n\ncx-::;- 2T [(1 - a)t + OC-Yra)] \n\n(28) \n\n\fDynamics of Generalization in Linear Perceptrons \n\n903 \n\n5 CONCLUSION \n\nGeneralization in the linear perceptron can be understood in the following picture. \nTo get perfect generalization the training pattern vectors have to span the whole \ninput space - N points (in general position) are enough to specify any hyperplane. \nThis means that perfect generalization appears only for a > 1. As a approaches \n1 the relaxation time - i.e. learning time - diverges, signaling a phase transition, \nas is common in physical systems. Noise has a severe effect on this transition. It \nleads to a degradation of the generalization ability which diverges as one reduces \nthe number of training examples toward the critical number. \nThis model is of course much simpler than most real-life training problems. How(cid:173)\never, it does allow us to examine in detail the dynamical phase transition separating \nperfect from imperfect generalization. Further extensions of the model can also be \nsolved and will be reported elsewhere. \n\nReferences \n\n[1] Gardner, E. and B. Derrida: Three Unfinished Works on the Optimal Storage \n\nCapacity of Networks. Journal of Physics A 22, 1983-1994 (1989). \n\n[2] Schwartz, D.B., V.K. Samalam, S.A. Solla, and J .S. Denker: Exhaustive Learn(cid:173)\n\ning. Neural Computation 2, 371-382 (1990). \n\n[3] Tishby, N., E. Levin, and S.A. Solla: Consistent Inference of Probabilities in \nLayered Networks: Predictions and Generalization. Proc. IJCNN Washington \n1989, vol. 2 403-410, Hillsdale: Erlbaum (1989). \n\n[4] Baum, E.B. and D. Haussler: What Size Net Gives Valid Generalization. Neu(cid:173)\n\nral Computation 1, 151-160 (1989). \n\n[5] Gyorgyi, G. and N. Tishby: Statistical Theory of Learning a Rule. In Neural \nNetworks and Spin Glasses, eds W.K. Theumann and R. Koeberle. Singapore: \nWorld Scientific (1990). \n\n[6] Hansel, D. and H. Sompolinsky: Learning from Examples in a Single-Layer \n\nNeural Network. Europhysics Letters 11, 687-692 (1990). \n\n[7] Vallet, F., J. Cailton and P. Refregier: Linear and Nonlinear Extension of the \nPseudo-Inverse Solution for Learning Boolean Functions. Europhysics Letters \n9, 315-320 (1989). \n\n[8] Opper, M., W. Kinzel, J. Kleinz, and R. Nehl: On the Ability of the Optimal \n\nPerceptron to Generalize. Journal of Physics A 23, L581-L586 (1990). \n\n[9] Levin, E., N. Tishby, and S. A. Solla: A Statistical Approach to Learning and \nGeneralization in Layered Neural Networks. AT&T Bell Labs, preprint (1990). \n[10] Gyorgyi, G.: Inference of a Rule by a Neural Network with Thermal Noise. \n\nPhysical Review Letters 64, 2957-2960 (1990). \n\n[11] Hertz, J .A., A. Krogh, and G.I. Thorbergsson: Phase Transitions in Simple \n\nLearning. Journal of Physics A 22, 2133-2150 (1989). \n\n\f", "award": [], "sourceid": 349, "authors": [{"given_name": "Anders", "family_name": "Krogh", "institution": null}, {"given_name": "John", "family_name": "Hertz", "institution": null}]}