{"title": "Geometry of Early Stopping in Linear Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 365, "page_last": 371, "abstract": null, "full_text": "Geometry of Early Stopping in Linear \n\nNetworks \n\nRobert Dodier * \n\nDept. of Computer Science \n\nUniversity of Colorado \n\nBoulder, CO 80309 \n\nAbstract \n\nA theory of early stopping as applied to linear models is presented. \nThe backpropagation learning algorithm is modeled as gradient \ndescent in continuous time. Given a training set and a validation \nset, all weight vectors found by early stopping must lie on a cer(cid:173)\ntain quadric surface, usually an ellipsoid. Given a training set and \na candidate early stopping weight vector, all validation sets have \nleast-squares weights lying on a certain plane. This latter fact can \nbe exploited to estimate the probability of stopping at any given \npoint along the trajectory from the initial weight vector to the least(cid:173)\nsquares weights derived from the training set, and to estimate the \nprobability that training goes on indefinitely. The prospects for \nextending this theory to nonlinear models are discussed. \n\n1 \n\nINTRODUCTION \n\n'Early stopping' is the following training procedure: \n\nSplit the available data into a training set and a \"validation\" set. \nStart with initial weights close to zero. Apply gradient descent \n(backpropagation) on the training data. If the error on the valida(cid:173)\ntion set increases over time, stop training. \n\nThis training method, as applied to neural networks, is of relatively recent origin. \nThe earliest references include Morgan and Bourlard [4] and Weigend et al. [7]. \n\n* Address correspondence to: dodier~cs. colorado . edu \n\n\f366 \n\nR. DODIER \n\nFinnoff et al. [2] studied early stopping empirically. While the goal of a theory of \nearly stopping is to analyze its application to nonlinear approximators such as sig(cid:173)\nmoidal networks, this paper will deal mainly with linear systems and only marginally \nwith nonlinear systems. Baldi and Chauvin [1] and Wang et al. [6] have also ana(cid:173)\nlyzed linear systems. \n\nIt can be shown \nThe main result of this paper can be summarized as follows. \n(see Sec. 5) that the most probable stopping point on a given trajectory (fixing \nthe training set and initial weights) is the same no matter what the size of the \nvalidation set. That is, the most probable stopping point (considering all possible \nvalidation sets) for a finite validation set is the same as for an infinite validation \nset. (If the validation data is unlimited, then the validation error is the same as the \ntrue generalization error.) However, for finite validation sets there is a dispersion \nof stopping points around the best (most probable and least generalization error) \nstopping point, and this increases the expected generalization error. See Figure 1 \nfor an illustration of these ideas. \n\n2 MATHEMATICAL PRELIMINARIES \n\nIn what follows, backpropagation will be modeled as a process in continuous time. \nThis corresponds to letting the learning rate approach zero. This continuum model \nsimplifies the necessary algebra while preserving the important properties of early \nstopping. Let the inputs be denoted X = (Xij), so that Xij is the j'th component of \nthe i'th observation; there are p components of each of the n observations. Likewise, \nlet y = (Yi) be the (scalar) outputs observed when the inputs are X. Our regression \nmodel will be a linear model, Yi = W'Xi + fi, i = 1, ... , n. Here fi represents \nindependent, identically distributed (LLd.) Gaussian noise, fi rv N(O, q2). Let \nE(w) = !IIXw - Yll2 be one-half the usual sum of squared errors. \n\nThe error gradient with respect to the weights is \\7 E(w) = w'x'x - y'X. The \nbackprop algorithm is modeled as Vi = -\\7 E( w). The least-squares solution, at \nwhich \\7E(w) = 0, is WLS = (X'X)-lX'y. Note the appearence here of the \ninput correlation matrix, X'X = (2:~=1 XkiXkj). The properties of this matrix \ndetermine, to a large extent, the properties of the least-squares solutions we find. It \nturns out that as the number of observations n increases without bound, the matrix \nq2(X'X)-1 converges with probability one to the population covariance matrix of \nthe weights. We will find that the correlation matrix plays an important role in the \nanalysis of early stopping. \n\nWe can rewrite the error E using a diagonalization of the correlation matrix X'X = \nSAS'. Omitting a few steps of algebra, \n\np \n\nE(w) = ! L AkV~ + !y'(y - XWLS) \n\nk=l \n\n(1) \n\nwhere v = S'(W-WLS) and A = diag(Al, .. . , Ap). In this sum we see that the mag(cid:173)\nnitude of the k'th term is proportional to the corresponding characteristic value, \nso moving w toward w LS in the direction corresponding to the largest character(cid:173)\nistic value yields the greatest reduction of error. Likewise, moving in the direction \ncorresponding to the smallest characteristic value gives the least reduction of error. \n\n\fGeometry of Early Stopping in Linear Networks \n\n367 \n\nSo far, we have implicitly considered only one set of data; we have assumed all data \nis used for training. Now let us distinguish training data, X t and Yt, from validation \ndata, Xv and Yv ; there are nt training and nv validation data. Now each set of \ndata has its own least-squares weight vector, Wt and Wv , and its own error gradient, \n\\lEt(w) and \\lEv(w). Also define M t = X~Xt and Mv = X~Xv for convenience. \nThe early stopping method can be analyzed in terms of the these pairs of matrices, \ngradients, and least-squares weight vectors. \n\n3 THE MAGIC ELLIPSOID \n\nConsider the early stopping criterion, d~v (w) = O. Applying the chain rule, \n\ndEv = dEv . dw = \\lE . -\\lE \ndt \nt, \n\ndw \n\ndt \n\nv \n\n(2) \n\nwhere the last equality follows from the definition of gradient descent. So the early \nstopping criterion is the same as saying \n\n\\lEt' \\lEv = 0, \n\n(3) \n\nthat is, at an early stopping point, the training and validation error gradients are \nperpendicular, if they are not zero. \n\nConsider now the set of all points in the weight space such that the training and \nvalidation error gradients are perpendicular. These are the points at which early \nstopping may stop. It turns out that this set of points has an easily described shape. \nThe condition given by Eq. 3 is equivalent to \n\nNote that all correlation matrices are symmetric, so MtM~ = MtMv. We see that \nEq. 4 gives a quadratic form. Let us put Eq. 4 into a standard form. Toward this \nend, let us define some useful terms. Let \n\n(4) \n\nM = MtMv , \nM = HM + M') = HMtMv + MvMt), \nVi \n\nHWt + wv ), \nWt - Wv , \n\n~w \n\nand \n\n~ \nw=w-i \n\n-\n\nIM- 1 (M \n\n-M w. \n\n')~ \n\n(5) \n(6) \n(7) \n(8) \n\n(9) \n\nNow an important result can be stated. The proof is omitted. \n\nProposition 1. \\lEt . \\lEv = 0 is equivalent to \n\n(W - w)'M(w - w) = t~w[M + t(M' - M)M- 1 (M - M')l~w. 0 \n\n(10) \n\nThe matrix M of the quadratic form given by Eq. 10 is \"usually\" positive definite. \nAs the number of observations nt and nv of training and validation data increase \nwithout bound, M converges to a positive definite matrix. In what follows it will \n\n\f368 \n\nR. DODIER \n\nalways be assumed that M is indeed positive definite. Given this, the locus defined \nby V' Et .1 V' Ev is an ellipsoid. The centroid is W, the orientation is determined by \nthe characteristic vectors of M, and the length of the k'th semiaxis is v' c/ Ak, where \nc is the constant on the righthand side of Eq. 10 and 'xk is the k'th characteristic \nvalue of M. \n\n4 THE MAGIC PLANE \n\nGiven the least-squares weight vector Wt derived from the training data and a \ncandidate early stopping weight vector Wes, any least-squares weight vector Wv \nfrom a validation set must lie on a certain plane, the 'magic plane.' The proof of \nthis statement is omitted. \n\nProposition 2. The condition that Wt, W v, and Wes all lie on the magic ellipsoid, \n(Wt -w)/M(wt -w) = (wv -w)/M(wv -w) = (wes -wYM(wes -w) = c, (11) \nimplies \n\n(Wt - wes)/Mwv = (Wt - wes)/Mwes. 0 \n\n(12) \nThis shows that Wv lies on a plane, the magic plane, with normal M/(wt - wes). \nThe reader will note a certain difficulty here, namely that M = MtM v depends on \nthe particular validation set used, as does W v. However, we can make progress by \nconsidering only a fixed correlation matrix Mv and letting W v vary. Let us suppose \nthe inputs (Xl, X2, \u2022\u2022 . ,Xp) are LLd. Gaussian random variables with mean zero and \nsome covariance E. (Here the inputs are random but they are observed exactly, so \nthe error model y = w/x + \u20ac still applies.) Then \n\n(Mv) = (X~Xv) = nvE, \n\nso in Eq. 12 let us replace Mv with its expected value nv:E. That is, we can \napproximate Eq. 12 with \n\n(13) \n\nNow consider the probability that a particular point w(t) on the trajectory from \nw(O) to Wt is an early stopping point, that is, V' Et(w(t)) . V' Ev(w(t)) = O. This is \nexactly the probability that Eq. 12 is satisfied, and approximately the probability \nthat Eq. 13 is satisfied. This latter approximation is easy to calculate: it is the \nmass of an infinitesimally-thin slab cutting through the distribution of least-squares \nvalidation weight vectors. Given the usual additive noise model y = w/x + \u20ac with \u20ac \nbeing Li.d. Gaussian distributed noise with mean zero and variance (f2, the least(cid:173)\nsquares weights are approximately distributed as \n\n(14) \n\nwhen the number of data is large. \nConsider now the plane n = {w : Wi ft = k}. The probability mass on this plane as \nit cuts through a Gaussian distribution N(/-t, C) is then \n\npn(k, ft) = (27rft/Cft)-1/2 exp( _~ (k ~~:)2) ds \n\n(15) \n\nwhere ds denotes an infinitesimal arc length. (See, for example, Sec. VIII-9.3 of \nvon Mises [3].) \n\n\fGeometry of Early Stopping in Linear Networks \n\n369 \n\n0.2S,------r-~-~-~-~__,_-_r_-_, \n\n0.15 \n\n0.' \n\nO.Os \n\n~~~~~L-lli3~~.ll-~S~~~~--~ \n\nArc leng1h Along Trajectory \n\nFigure 1: Histogram of early stopping points along a trajectory, with bins of equal \narc length. An approximation to the probability of stopping (Eq. 16) is superim(cid:173)\nposed. Altogether 1000 validation sets were generated for a certain training set; of \nthese, 288 gave \"don't start\" solutions, 701 gave early stopping solutions (which are \nbinned here) somewhere on the trajectory, and 11 gave \"don't stop\" solutions. \n\n5 PROBABILITY OF STOPPING AT A GIVEN POINT \nLet us apply Eq. 15 to the problem at hand. Our normal is ft = nv :EMt (w t - Wes ) \nand the offset is k = ft' W es. A formal statement of the approximation of PO can \nnow be made. \n\nProposition 3. Assuming the validation correlation matrix X~Xv equals the mean \ncorrelation matrix nv~, the probability of stopping at a point Wes = w(t) on the \ntrajectory from w(O) to Wt is approximately \n\nwith \n\n(17) \n\nHow useful is this approximation? Simulations were carried out in which the initial \nweight vector w(O) and the training data (nt = 20) were fixed, and many validation \nsets of size nv = 20 were generated (without fixing X~Xv). The trajectory was \ndivided into segments of equal length and histograms of the number of early stopping \nweights on each segment were constructed. A typical example is shown in Figure 1. \nIt can be seen that the empirical histogram is well-approximated by Eq. 16. \n\nIf for some w(t) on the trajectory the magic plane cuts through the true weights \nw\u00b7, then Po will have a peak at t. As the number of validation data nv increases, \nthe variance of Wv decreases and the peak narrows, but the position w(t) of the \npeak does not move. As nv -t 00 the peak becomes a spike at w(t). That is, the \npeak of Po for a finite validation set is the same as if we had access to the true \ngeneralization error. In this sense, early stopping does the right thing. \n\nIt has been observed that when early stopping is employed, the validation error \nmay decrease forever and never rise - thus the 'early stopping' procedure yields the \nleast-squares weights. How common is this phenomenon? Let us consider a fixed \n\n\f370 \n\nR. DODIER \n\ntraining set and a fixed initial weight vector, so that the trajectory is fixed. Letting \nthe validation set range over all possible realizations, let us denote by Pn(t) = \nPn(k(t), n(t)) the probability that training stops at time t or later. 1- Pn(O) is the \nprobability that validation error rises immediately upon beginning training, and let \nus agree that Pn(oo) denotes the probability that validation error never increases. \nThis Pn(t) is approximately the mass that is \"behind\" the plane n'wv = n'wes , \n\"behind\" meaning the points Wv such that (wv - wes)'ft < O. (The identification \nof Pn with the mass to one side of the plane is not exact because intersections of \nmagic planes are ignored.) As Eq. 15 has the form of a Gaussian p.dJ., it is easy \nto show that \n\nPq(k, ft) = G \n\n-nw \n( k \n(n'Cft)1/2 \n\nA' \"') \n\n(18) \n\nwhere G denotes the standard Gaussian c.dJ., G(z) = (211')-1/2 J~oo exp( -t2 /2)dt. \nRecall that we take the normal ft of the magic plane through Wes as ft = EMt(wt(cid:173)\nwes). For t = 0 there is no problem with Eq. 18 and an approximation for the \n\"never-starting\" probability is stated in the next proposition. \n\nProposition 4. The probability that validation error increases immediately upon \nbeginning training (\"never starting\"), assuming the validation correlation matrix \nX~Xv equals the mean correlation matrix nv:E, is approximately \n\n1 - Pn(O) = 1 - G (Fv \n\n(w'\" - w(O))'MtE(wt - w(O)) \n\n). 0 \n\n(19) \n\nU \n\n[(Wt - w(O))'MtEMt(wt - w(0))P/2 \n\nWith similar arguments we can develop an approximation to the \"never-stopping\" \nprobability. \n\nProposition 5. The probability that training continues indefinitely (\"never stop(cid:173)\nping\"), assuming the validation correlation matrix X~Xv equals the mean correla(cid:173)\ntion matrix nvE, is approximately \n\nPn(oo) = G (Fv (w'\" - Wt)'Mt:E(\u00b1S\"')) . \n\nU \n\nA\"'[(s\"')'Es\"'j1/2 \n\n(20) \n\nIn Eq. 20 pick +s'\" if (Wt - w(O))'s'\" > 0, otherwise pick -s\"'. \nSimulations are in good agreement with the estimates given by Propositions 4 and \n5. \n\n0 \n\n6 EXTENDING THE THEORY TO NONLINEAR \n\nSYSTEMS \n\nIt may be possible to extend the theory presented in this paper to nonlinear approx(cid:173)\nimators. The elementary concepts carryover unchanged, although it will be more \ndifficult to describe them algebraically. In a nonlinear early stopping problem, there \nwill be a surface corresponding to the magic ellipsoid on which 'VEt ...L \n'V E v , but \nthis surface may be nonconvex or not simply connected. Likewise, corresponding \nto the magic plane there will be a surface on which least-squares validation weights \nmust fall, but this surface need not be fiat or unbounded. \n\nIt is customary in the world of statistics to apply results derived for linear systems \nto nonlinear systems by assuming the number of data is very large and various \n\n\fGeometry of Early Stopping in Linear Networks \n\n371 \n\nregularity conditions hold. If the errors \u00a3. are additive, the least-squares weights \nagain have a Gaussian distribution. As in the linear case, the Hessian of the total \nerror appears as the inverse of the covariance of the least-squares weights. In this \nasymptotic (large data) regime, the standard results for linear regression carryover \nto nonlinear regression mostly unchanged. This suggests that the linear theory of \nearly stopping will also apply to nonlinear regression models, such as sigmoidal \nnetworks, when there is much data. \n\nHowever, it should be noted that the asymptotic regression theory is purely local \n- it describes only what happens in the neighborhood of the least-squares weights. \nAs the outcome of early stopping depends upon the initial weights and the trajec(cid:173)\ntory taken through the weight space, any local theory will not suffice to analyze \nearly stopping. Nonlinear effects such as local minima and non-quadratic basins \ncannot be accounted for by a linear or asymptotically linear theory, and these may \nplay important roles in nonlinear regression problems. This may invalidate direct \nextrapolations of linear results to nonlinear networks, such as that given by Wang \nand Venkatesh [5]. \n\n7 ACKNOWLEDGMENTS \n\nThis research was supported by NSF Presidential Young Investigator award IRl-\n9058450 and grant 90-21 from the James S. McDonnell Foundation to Michael C. \nMozer. \n\nReferences \n\n[1] Baldi, P., and Y. Chauvin. \"Temporal Evolution of Generalization during Learn(cid:173)\n\ning in Linear Networks,\" Neural Computation 3, 589-603 (Winter 1991). \n\n[2] Finnoff, W., F. Hergert, and H. G. Zimmermann. \"Extended Regularization \n\nMethods for Nonconvergent Model Selection,\" in Advances in NIPS 5, S. Han(cid:173)\nson, J. Cowan, and C. L. Giles, eds., pp 228-235. San Mateo, CA: Morgan \nKaufmann Publishers. 1993. \n\n[3] von Mises, R. Mathematical Theory of Probability and Statistics. New York: \n\nAcademic Press. 1964. \n\n[4] Morgan, N., and H. Bourlard. \"Generalization and Parameter Estimation in \nFeedforward Nets: Some Experiments,\" in Advances in NIPS 2, D. Touretzky, \ned., pp 630-637. San Mateo, CA: Morgan Kaufmann. 1990. \n\n[5] Wang, C., and S. Venkatesh. \"Temporal Dynamics of Generalization in Neural \nNetworks,\" in Advances in NIPS 7, G. Tesauro, D. Touretzky, and T. Leen, eds. \npp 263-270. Cambridge, MA: MIT Press. 1995. \n\n[6] Wang, C., S. Venkatesh, J. S. Judd. \"Optimal Stopping and Effective Machine \nComplexity in Learning,\" in Advances in NIPS 6, J. Cowan, G. Tesauro, and J. \nAlspector, eds., pp 303-310. San Francisco: Morgan Kaufmann. 1994. \n\n[7] Weigend, A., B. Huberman, and D. Rumelhart. \"Predicting the Future: A Con(cid:173)\n\nnectionist Approach,\" Int'l J. Neural Systems 1, 193-209 (1990). \n\n\f", "award": [], "sourceid": 1147, "authors": [{"given_name": "Robert", "family_name": "Dodier", "institution": null}]}