and (

\n\n(2) \n\nIn on-line learning one draws at each step m a question e(m) at random from the training \nset, the dynamics is a stochastic process; in batch learning one iterates a deterministic map. \nOur key dynamical observables are the training- and generalization errors, defined as \n\n(3) \nOnly if the training set D is sufficiently large, and if there are no correlations between J and \nthe training set examples, will these two errors be identical. We now turn to macroscopic \nobservables n[J] = (OdJ], ... , Ok[J]). For N -t 00 (with finite times t = m/ N \n\nEg(J) = (O[-(J \u00b7e)(B \u00b7e)]) D \n\n\fDynamics of Supervised Learning with Restricted Training Sets \n\n199 \n\nand with finite k), and if our observables are of a so-called mean-field type, their associated \nmacroscopic distribution Pt(!l) is found to obey a Fokker-Planck type equation, with flow(cid:173)\nand diffusion terms that depend on whether on-line or batch learning is used. We now \nchoose a specific set of observables !l[J], taylored to the present problem: \n\nQ[J] = J2, \n\nR[J] = J\u00b7B, \n\nP[x,y;J] = (8[x-J\u00b7e] 8[y-B\u00b7eDb \n\n(4) \nThis choice is motivated as follows: (i) in order to incorporate the old formalism we need \nQ[ J] and R[ J], (ii) the training error involves field statistics calculated over the training set, \nas given by P[x, y; J], and (iii) for a < 00 one cannot expect closed equations for a finite \nnumber of order parameters, the present choice effectively represents an infinite number. \nWe will assume the number of arguments (x, y) for which P[x, y; J] is evaluated to go to \ninfinity after the limit N ~ 00 has been taken. This eliminates technical subtleties and \nallows us to show that in the Fokker-Planck equation all diffusion terms vanish as N ~ 00. \nThe latter thereby reduces to a LiouviIle equation, describing deterministic evolution of our \nmacroscopic observables. For on-line learning one arrives at \n\n(5) \n\n(6) \n\n(7) \n\n:t Q = 27] / dxdy P[x, y] x Q[x; y] + 7]2 / dxdy P[x, y] Q2[X; y] \n:t R = 7] / dxdy P[x, y] y Q[x; y] \n:t P[x, y] = ~ [/ dx' P[x', y]8[x-x' -7]Q[x' , y)) - P[x, yl] \n\n-7] :x / dx'dy' g(X', y'] A[x, y; x', y'] \n\n+ - 7] \n1 2 / \n2 \n\ndx dy P x ,y Q x, y 8 .) P x, Y \n] \n\n[ ' '] \n\n[ \n\nI \n\n, \n\n2 [' '] 82 \nx-\n\nExpansion of these equations in powers of 7], and retaining only the terms linear in 7], gives \nthe corresponding equations describing batch learning. The complexity of the problem is \nfully concentrated in a Green's function A[x, y; x', y'], which is defined as \nA[x, y; x', y'] = lim ((([1-8cc' 18[x-J\u00b7e] 8[y-B\u00b7e](6~/) 8[xl-J\u00b7(]8[yl-B\u00b7e/])b) b)~;t \n\nN~oo \n\n...... \n\nIt involves a sub-shell average, in which Pt (J) is the weight probability density at time t: \nJ dJ K[J] pt(J)8[Q -Q[J]]8[R-R[J]] ITxy 8[P[x, y] -P[x, y; J]] \n\n(K[J])~:t = \n\nJ dJ pt(J)8[Q-Q[J]]8[R-R[J]] ITXY 8[P[x, y] - P[x, y; J]] \n\nwhere the sub-shells are defined with respect to the order parameters. The solution of \n(5,6,7) can be used to generate the errors of (3): \n\nE t = / dxdy P[x,y]O[-xy] \n\nEg = - arccos[R/ JQ] \n\n1 \n\n7r \n\n(8) \n\n3 CLOSURE VIA DYNAMICAL REPLICA THEORY \n\nSo far our analysis is still exact. We now close the macroscopic laws (5,6,7) by making, for \nN ~ 00, the two key assumptions underlying dynamical replica theory [7]: \n\n(i) Our macroscopic observables {Q, R, P} obey closed dynamic equations. \n(ii) These equations are self-averaging with respect to the realisation of jj. \n\n(i) implies that probability variations within the {Q, R, P} subshells are either absent or \nirrelevant to the evolution of {Q, R, P} . We may thus make the simplest choice for Pt (J): \n\nPt(J) ~ p(J) '\" 8[Q- Q[J]] 8[R-R[J)) IT 8[P[x, y] -P[x, y; J)) \n\n(9) \n\nxy \n\n\f200 \n\nA. C. C. Coolen and D. Saad \n\np(J) depends on time implicitly, via the order parameters {Q, R, Pl. The procedure (9) \nleads to exact laws if our observables {Q, R , P} indeed obey closed equations for N --7 00. \nIt gives an approximation if they don't. (ii) allows us to average the macroscopic laws over \nall training sets; it is observed in numerical simulations. and can probably be proven using \nthe formalism of [6]. Our assumptions result in the closure of (5,6,7), since now A[ ... ] is \nexpressed fully in terms of {Q, R, P} . The final ingredient of dynamical replica theory is \nthe realization that averaging fractions is simplified with the replica identity [8] \n\n/ JdJ W[J,Z]G[J,Z]) = lim jdJ 1 .. . dJ n (G[J 1 ,z] IT W[Ja,z])z \n\na=l \n\n\\ \n\nJ dJ W[J , z] \n\nZ \n\nn-40 \n\nWhat remains is to perform integrations. One finds that P[x, y] = P[xly]P[y] with Ply] = \n\u2022 Upon introducing the short-hands Dy = (271\")- ~ e- h 2 dy and (J(x, y)) = \n(271\")-~ e- h 2\nJ Dydx P[xly]f(x, y) we can write the resulting macroscopic laws as follows : \n\nd \ndt Q = 2r/V + TJ Z \n\n2 \n\nd \ndt R = TJW \n\n(10) \n\n1 j \n\n8 \n8tP[xly] =~ dx'P[x'ly] {8[x-X'-TJG[x',yll-8[x-x']} + \"2TJ2 Z 8x2P[Xly] \n\n82 \n\n1 \n\n8 \n\n-TJ 8x {P[xly] [U(x-Ry)+Wy+[V-RW-(Q-R2)U]~[x,yJ]} \n\n(11) \n\nwith \n\nU = (* [x, y]Q[x , y]), V = (x9[x,y]), W = (y9[x,y]), \n\nZ = (92[X,y]) \n\nAs before the batch equations follow upon expanding in TJ and retaining only the linear \nterms. Finding the function [x, y] (in replica symmetric ansatz) requires solving a saddle(cid:173)\npoint problem for a scalar observable q and a function M[xIY]. Upon introducing \n\nB = JqQ-R2 \n\nQ(l- q) \n\n(J[x,y,z])* = Jdx M[xly]eBX Zf[x,y,z] \n\nJ dx M[xly]eBxz \n\n(with J dx M[xly] = 1 for all y) the saddle-point equations acquire the form \n\nfor all X , y : P[Xly] = j Dz (<5[X -x])* \n\n((X-Ry)2) + (qQ-R2)[1-~] = [Q(1+q)-2R2](x~[x,y]) \n\na \n\nThe solution M[xly] of the functional saddle-point equation, given a value for q in the \nphysical range q E [R2/Q, 1], is unique [9] . The function ~[x, y] is then given by \n\n [X, y] = { JqQ- R2 P[Xly]} -1 j Dz z(<5[X -x])* \n\n(12) \n\n4 THE LIMIT a -7 00 \n\nFor consistency we show that our theory reduces to the simple (Q, R) formalism of infinite \ntraining sets in the limit a --7 00 . Upon making the ansatz \n\nP[xly] = [271\"(Q-R2)]-t e-~[x-RyJ 2 /(Q-R2) \n\none finds that the saddle-point equations are simultaneously and uniquely solved by \n\nand [x,y] reduces to \n\nM[xly] = P[xly], \n\nq = R2/Q \n\n[x,y] = (x-Ry)/(Q-R2) \n\nInsertion of our ansatz into equation (II), followed by rearranging of terms and usage of the \nabove expression for [x, y], shows that this equation is satisfied. Thus from our general \ntheory we indeed recover for a --7 00 the standard theory for infinite training sets. \n\n\fDynamics o/Supervised Learning with Restricted Training Sets \n\n201 \n\n0.5 _---------------~---, \n\n\"'O-<>-O-<>-CH>\"\"O\"\"\"\":ro-<>\"\"\"\"\"'T<>-*