{"title": "Online Learning from Finite Training Sets: An Analytical Case Study", "book": "Advances in Neural Information Processing Systems", "page_first": 274, "page_last": 280, "abstract": null, "full_text": "Online learning from finite training sets: \n\nAn analytical case study \n\nPeter Sollich* \n\nDepartment of Physics \nUniversity of Edinburgh \nEdinburgh EH9 3JZ, U.K. \n\nP.SollichOed.ac.uk \n\nDavid Barbert \n\nNeural Computing Research Group \nDepartment of Applied Mathematics \n\nAston University \n\nBirmingham B4 7ET, U.K. \nD.BarberOaston.ac.uk \n\nAbstract \n\nWe analyse online learning from finite training sets at non(cid:173)\ninfinitesimal learning rates TJ. By an extension of statistical me(cid:173)\nchanics methods, we obtain exact results for the time-dependent \ngeneralization error of a linear network with a large number of \nweights N. We find, for example, that for small training sets of \nsize p ~ N, larger learning rates can be used without compromis(cid:173)\ning asymptotic generalization performance or convergence speed. \nEncouragingly, for optimal settings of TJ (and, less importantly, \nweight decay ,\\) at given final learning time, the generalization per(cid:173)\nformance of online learning is essentially as good as that of offline \nlearning. \n\n1 \n\nINTRODUCTION \n\nThe analysis of online (gradient descent) learning, which is one of the most common \napproaches to supervised learning found in the neural networks community, has \nrecently been the focus of much attention [1]. The characteristic feature of online \nlearning is that the weights of a network ('student') are updated each time a new \ntraining example is presented, such that the error on this example is reduced. In \noffline learning, on the other hand, the total error on all examples in the training \nset is accumulated before a gradient descent weight update is made. Online and \n\n* Royal Society Dorothy Hodgkin Research Fellow \nt Supported by EPSRC grant GR/J75425: Novel Developments in Learning Theory \n\nfor Neural Networks \n\n\fOnline Leamingfrom Finite Training Sets: An Analytical Case Study \n\n275 \n\noffline learning are equivalent only in the limiting case where the learning rate \nT) --* 0 (see, e.g., [2]). The main quantity of interest is normally the evolution of \nthe generalization error: How well does the student approximate the input-output \nmapping ('teacher') underlying the training examples after a given number of weight \nupdates? \n\nMost analytical treatments of online learning assume either that the size of the \ntraining set is infinite, or that the learning rate T) is vanishingly small. Both of \nthese restrictions are undesirable: In practice, most training sets are finite, and non(cid:173)\ninfinitesimal values of T) are needed to ensure that the learning process converges \nafter a reasonable number of updates. General results have been derived for the \ndifference between online and offline learning to first order in T), which apply to \ntraining sets of any size (see, e. g., [2]). These results, however, do not directly \naddress the question of generalization performance. The most explicit analysis of \nthe time evolution of the generalization error for finite training sets was provided by \nKrogh and Hertz [3] for a scenario very similar to the one we consider below. Their \nT) --* 0 (i.e., offline) calculation will serve as a baseline for our work. For finite T), \nprogress has been made in particular for so-called soft committee machine network \narchitectures [4, 5], but only for the case of infinite training sets. \n\nOur aim in this paper is to analyse a simple model system in order to assess how the \ncombination of non-infinitesimal learning rates T) and finite training sets (containing \na examples per weight) affects online learning. \nIn particular, we will consider \nthe dependence of the asymptotic generalization error on T) and a, the effect of \nfinite a on both the critical learning rate and the learning rate yielding optimal \nconvergence speed, and optimal values of T) and weight decay A. We also compare \nthe performance of online and offline learning and discuss the extent to which infinite \ntraining set analyses are -applicable for finite a. \n\n2 MODEL AND OUTLINE OF CALCULATION \n\nWe consider online training of a linear student network with input-output relation \n\nHere x is an N-dimensional vector of real-valued inputs, y the single real output \nand w the wei~t vector of the network. ,T, denotes the transpose of a vector and \nthe factor 1/VN is introduced for convenience. Whenever a training example (x, y) \nis presented to the network, its weight vector is updated along the gradient of the \nsquared error on this example, i. e., \n\nwhere T) is the learning rate. We are interested in online learning from finite train(cid:173)\ning sets, where for each update an example is randomly chosen from a given set \n{(xll,yll),j.l = l. .. p} ofp training examples. (The case of cyclical presentation of \nexamples [6] is left for future study.) If example J.l is chosen for update n, the weight \nvector is changed to \n\nHere we have also included a weight decay 'Y. We will normally parameterize the \nstrength of the weight decay in terms of A = 'YO' (where a = p / N is the number \n\n(1) \n\n\f276 \n\nP. Sollich and D. Barber \n\nof examples per weight), which plays the same role as the weight decay commonly \nused in offline learning [3]. For simplicity, all student weights are assumed to be \ninitially zero, i.e., Wn=o = o. \nThe main quantity of interest is the evolution of the generalization error of the \nstudent. We assume that the training examples are generated by a linear 'teacher', \ni.e., yJJ = W. T x JJ IVN +e, where eJJ is zero mean additive noise of variance (72. The \nteacher weight vector is taken to be normalized to w. 2 = N for simplicity, and the \ninput vectors are assumed to be sampled randomly from an isotropic distribution \nover the hypersphere x 2 = N. The generalization error, defined as the average of \nthe squared error between student and teacher outputs for random inputs, is then \n\nwhere Vn = Wn - W \u2022. \n\nIn order to make the scenario analytically tractable, we focus on the limit N -+ 00 \nof a large number of input components and weights, taken at constant number of \nexamples per weight a = piN and updates per weight ('learning time') t = niN. In \nthis limit, the generalization error fg(t) becomes self-averaging and can be calculated \nby averaging both over the random selection of examples from a given training set \nand over all training sets. Our results can be straightforwardly extended to the case \nof percept ron teachers with a nonlinear transfer function, as in [7]. \n\nThe usual statistical mechanical approach to the online learning problem expresses \nthe generalization error in terms of 'order parameters' like R = ~wJw. whose \n(self-averaging) time evolution is determined from appropriately averaged update \nequations. This method works because for infinite training sets, the average or(cid:173)\nder parameter updates can again be expressed in terms of the order parameters \nalone. For finite training sets, on the other hand, the updates involve new order \nparameters such as Rl = ~wJ Aw., where A is the correlation matrix of the \ntraining inputs, A = ~L-P =lxJJ(xJJ)T. Their time evolution is in turn determined \nby order parameters involving higher powers of A, yielding an infinite hierarchy \nof order parameters. We solve this problem by considering instead order parame(cid:173)\nter (generating) junctions [8] such as a generalized form of the generalization error \nf(t;h) = 2~vJexp(hA)vn . This allows powers of A to be obtained by differentia(cid:173)\ntion with respect to h, reSUlting in a closed system of (partial differential) equations \nfor f(t; h) and R(t; h) = ~ wJ exp(hA)w \u2022. \n\nThe resulting equations and details of their solution will be given in a future publi(cid:173)\ncation. The final solution is most easily expressed in terms of the Laplace transform \nof the generalization error \n\nfg(Z) = '!!.. fdt fg(t)e-z(f//a)t = fdz) + T}f2(Z) + T} 2f3(Z) \n\n(2) \n\na ~ \n\n1 - T}f4(Z) \n\nThe functions fi (z) (i = 1 ... 4) can be expressed in closed form in terms of a, (72 \nand A (and, of course, z). The Laplace transform (2) yields directly the asymptotic \nvalue of the generalization error, foo = fg(t -+ (0) = limz--+o zig{z) , which can be \ncalculated analytically. For finite learning times t, fg(t) is obtained by numerical \ninversion of the Laplace transform. \n\n3 RESULTS AND DISCUSSION \n\nWe now discuss the consequences of our main result (2), focusing first on the asymp(cid:173)\ntotic generalization error foo, then the convergence speed for large learning times, \n\n\fOnline Learningfrom Finite Training Sets: An Analytical Case Study \n\n277 \n\na=O.s \n\na=i \n\n<1=2 \n\nFigure 1: Asymptotic generalization error (00 vs 1] and A. a as shown, (1\"2 = 0.1. \n\nand finally the behaviour at small t. For numerical evaluations, we generally take \n(1\"2 = 0.1, corresponding to a sizable noise-to-signal ratio of JQ.I ~ 0.32. \nThe asymptotic generalization error (00 is shown in Fig. 1 as a function of 1] and A \nfor a = 0.5, 1, 2. We observe that it is minimal for A = (1\"2 and 1] = 0, as expected \nfrom corresponding resul ts for offline learning [3]1. We also read off that for fixed A, \n(00 is an increasing function of 1]: The larger 1], the more the weight updates tend \nto overshoot the minimum of the (total, i.e., offline) training error. This causes a \ndiffusive motion of the weights around their average asymptotic values [2] which \nincreases (00. In the absence of weight decay (A = 0) and for a < 1, however, (00 \nis independent of 1]. In this case the training data can be fitted perfectly; every \nterm in the total sum-of-squares training error is then zero and online learning does \nnot lead to weight diffusion because all individual updates vanish . In general, the \nrelative increase (00(1])/(00(1] = 0) - 1 due to nonzero 1] depends significantly on a. \nFor 1] = 1 and a = 0.5, for example, this increase is smaller than 6% for all A (at \n(1\"2 = 0.1), and for a = 1 it is at most 13%. This means that in cases where training \ndata is limited (p ~ N), 1] can be chosen fairly large in order to optimize learning \nspeed, without seriously affecting the asymptotic generalization error. In the large \na limit, on the other hand, one finds (00 = ((1\"2/2)[1/a + 1]/(2 - 1])]. The relative \nincrease over the value at 1] = a therefore grows linearly with a; already for a = 2, \nincreases of around 50% can occur for 1] = 1. \n\nFig. 1 also shows that (00 diverges as 1] approaches a critical learning rate 1]e: As \n1] -+ 1]e, the 'overshoot' of the weight update steps becomes so large that the weights \neventually diverge. From the Laplace transform (2), one finds that 1]e is determined \nby 1]e(4(Z = 0) = 1; it is a function of a and A only. As shown in Fig. 2b-d, 1]e \nincreases with A. This is reasonable, as the weight decay reduces the length of the \nweight vector at each update, counteracting potential weight divergences. In the \nsmall and large a limit, one has 1]e = 2( 1 + A) and 1]e = 2( 1 + A/a), respectively. \nFor constant A, 1]e therefore decreases2 with a (Fig. 2b-d) . \nWe now turn to the large t behaviour of the generalization error (g(t). For small \n1], the most slowly decaying contribution (or 'mode') to (g(t) varies as exp( -ct), its \n\n1 The optimal value of the unscaledweight decay decreases with a as 'Y = (1\"2 ja, because \nfor large training sets there is less need to counteract noise in the training data by using \na large weight decay. \n2Conversely, for constant 'Y, f\"/e increases with a from 2(1 + 'Ya) to 2(1 + 'Y): For large a , \nthe weight decay is applied more often between repeat presentations of a training example \nthat would otherwise cause the weights to diverge. \n\n\f278 \n\nP. Sollich and D. Barber \n\ndecay constant c = 71['\\ + (va - 1 )2]/ a scaling linearly with 71, the size of the weight \nupdates, as expected (Fig. 2a). For small a, the condition ct \u00bb 1 for fg(t) to have \nreached its asymptotic value foo is 71(1 + ,\\)(t/a) \u00bb 1 and scales with tla, which is \nthe number of times each training example has been used. For large a, on the other \nhand, the condition becomes 71t \u00bb 1: The size of the training set drops out since \nconvergence occurs before repetitions of training examples become significant. \n\nFor larger 71, the picture changes due to a new 'slow mode' (arising from the de(cid:173)\nnominator of (2)). Interestingly, this mode exists only for 71 above a finite threshold \n71min = 2/(a1/ 2 + a- 1/ 2 -1). For finite a, it could therefore not have been predicted \nfrom a small 71 expansion of (g(t). \nIts decay constant Cslow decreases to zero as \n71 -t 71e, and crosses that of the normal mode at 71x(a,'\\) (Fig. 2a). For 71 > 71x, \nthe slow mode therefore determines the convergence speed for large t, and fastest \nconvergence is obtained for 71 = 71x. However, it may still be advantageous to use \nlower values of 71 in order to lower the asymptotic generalization error (see below); \nvalues of 71 > 71x would deteriorate both convergence speed and asymptotic per(cid:173)\nformance. Fig. 2b-d shows the dependence of 71min, 71x and 71e on a and'\\. For \n,\\ not too large, 71x has a maximum at a ~ 1 (where 71x ~ 71e), while decaying as \n71x = 1+2a- 1/ 2 ~ ~71e for larger a. This is because for a ~ 1 the (total training) er(cid:173)\nror surface is very anisotropic around its minimum in weight space [9]. The steepest \ndirections determine 71e and convergence along them would be fastest for 71 = ~71e \n(as in the isotropic case). However, the overall convergence speed is determined by \nthe shallow directions, which require maximal 71 ~ 71e for fastest convergence. \nConsider now the small t behaviour of fg(t). Fig. 3 illustrates the dependence of \nfg(t) on 71; comparison with simulation results for N = 50 clearly confirms our \ncalculations and demonstrates that finite N effects are not significant even for such \nfairly small N. For a = 0.7 (Fig. 3a), we see that nonzero 71 acts as effective update \nnoise, eliminating the minimum in fg(t) which corresponds to over-training [3]. foo \nis also seen to be essentially independent of 71 as predicted for the small value of \n,\\ = 10- 4 chosen. For a = 5, Fig. 3b clearly shows the increase of foo with 71. It \nalso illustrates how convergence first speeds up as 71 is increased from zero and then \nslows down again as 71e ~ 2 is approached. \n\nAbove, we discussed optimal settings of 71 and ,\\ for minimal asymptotic gener(cid:173)\nalization error foo. Fig. 4 shows what happens if we minimize fg(t) instead for \na given final learning time t, corresponding to a fixed amount of computational \neffort for training the network. As t increases, the optimal 71 decreases towards \nzero as required by the tradeoff between asymptotic performance and convergence \n\n(a) \n\nc \n\n1..=0 \n\n1..=0.1 \n\n1..=1 \n\n4 , - - - - - - - , 4 , - - - - - - - - - , \n\n(b) \n\n(c) \n\n11m in \n\no \n\no \n\nJ 2a 3 \n\n4 \n\nS \n\no \n\nJ \n\n2 a 3 \n\n4 \n\nS \n\nFigure 2: Definitions of71min, 71x and 71e, and their dependence on a (for'\\ as shown). \n\n\fOnline Learning /rom Finite Training Sets: An Analytical Case Study \n\n279 \n\n(a) a = 0.7 \n\n(b) a = 5 \n\nO.511---~--~--~---., \n\nO.20'--~5--1~O-~15--2~O-~25--3~o--'t \n\nfg vs t for different TJ. Simulations for N = 50 are shown by symbols \nFigure 3: \n(standard errors less than symbol sizes). A=1O- 4 , 0-2=0.1, a as shown. The learning \nrate TJ increases from below (at large t) over the range (a) 0.5 .. . 1.95, (b) 0.5 ... 1. 75. \n\n(a) \n\n0.08 \n\n(b) \n\n/ \n\n/ \n\n0.25 \n\n(c) \n\n0.06 \n\n, \n\n0.04 \n\n0.02 \n\n10 20 30 40 50 \n\nO. 00 '---'--'----'-..t.......'---'-~...J........_'___' \n10 20 30 40 50 \n\no \n\n0.8 \n\n0.6 \n\n0.4 \n\n0.2 \n\n\" \n\n\" \n\n------..... --... '-(cid:173) ----\n\n0.0 L-L--'---~_'___'______'_~___'___'__' \n50 \n\n40 \n\n30 \n\n10 \n\n20 \n\no \n\nt \n\nFigure 4: Optimal TJ and A vs given final learning time t, and resulting (g. \nSolid/dashed lines: a = 1 / a =2; bold/thin lines: online/offline learning. 0-2 =0.1. \nDotted lines in (a): Fits of form TJ = (a + bIn t)/t to optimal TJ for online learning. \n\nspeed. Minimizing (g(t) ::::: (00+ const . exp( -ct) ~ Cl + TJC2 + C3 exp( -C4TJt) leads to \nTJopt = (a + bIn t)/t (with some constants a, b, Cl...4). Although derived for small TJ, \nthis functional form (dotted lines in Fig. 4a) also provides a good description down \nto fairly small t , where TJopt becomes large. The optimal weight decay A increases3 \nwith t towards the limiting value 0- 2 . However, optimizing A is much less impor(cid:173)\ntant than choosing the right TJ: Minimizing (g(t) for fixed A yields almost the same \ngeneralization error as optimizing both TJ and A (we omit detailed results here4 ). It \nis encouraging to see from Fig. 4c that after as few as t = 10 updates per weight \nwith optimal TJ, the generalization error is almost indistinguishable from its optimal \nvalue for t --t 00 (this also holds if A is kept fixed). Optimization of the learning \nrate should therefore be worthwhile in most practical scenarios. \n\nIn Fig. 4c, we also compare the performance of online learning to that of offline \nlearning (calculated from the appropriate discrete time version of [3]), again with \n\n30ne might have expected the opposite effect of having larger>. at low t in order to \n'contain' potential divergences from the larger optimal learning rates tJ. However, smaller \n>. tends to make the asymptotic value foo less sensitive to large values of tJ as we saw \nabove, and we conclude that this effect dominates. \n\n4Por fixed>. < u 2 , where fg(t) has an over-training minimum (see Pig. 3a), the asymp(cid:173)\ntotic behaviour of tJopt changes to tJopt