{"title": "Discovering Hidden Features with Gaussian Processes Regression", "book": "Advances in Neural Information Processing Systems", "page_first": 613, "page_last": 619, "abstract": null, "full_text": "Dynamics of Supervised Learning with \n\nRestricted Training Sets \n\nA.C.C. Coolen \n\nDept of Mathematics \nKing's College London \n\nStrand, London WC2R 2LS, UK \n\ntcoolen @mth.kcl.ac.uk \n\nD. Saad \n\nNeural Computing Research Group \n\nAston University \n\nBirmingham B4 7ET, UK \n\nsaadd@aston.ac.uk \n\nAbstract \n\nWe study the dynamics of supervised learning in layered neural net(cid:173)\nworks, in the regime where the size p of the training set is proportional \nto the number N of inputs. Here the local fields are no longer described \nby Gaussian distributions. We use dynamical replica theory to predict \nthe evolution of macroscopic observables, including the relevant error \nmeasures, incorporating the old formalism in the limit piN --t 00. \n\n1 \n\nINTRODUCTION \n\nMuch progress has been made in solving the dynamics of supervised learning in layered \nneural networks, using the strategy of statistical mechanics: by deriving closed laws for the \nevolution of suitably chosen macroscopic observables (order parameters) in the limit of an \ninfinite system size [1, 2, 3, 4]. For a recent review and guide to references see e.g. [5]. \nThe main successful procedure developed so far is built on the following cornerstones: \n\u2022 The task to be learned is defined by a 'teacher', which is itself a neural network. This in(cid:173)\nduces a natural set of order parameters (mutual weight vector overlaps between the teacher \nand the trained, 'student', network). \n\u2022 The number of network inputs is infinitely large. This ensures that fluctuations in the \norder parameters will vanish, and enables usage of the central limit theorem. \n\u2022 The number of 'hidden' neurons is finite, in both teacher and student, ensuring a finite \nnumber of order parameters and an insignificant cumulative impact of the fluctuations . \n\u2022 The size of the training set is much larger than the number of updates. Each example \npresented is now different from the previous ones, so that the local fields will have Gaussian \ndistributions, leading to closure of the dynamic equations. \nIn this paper we study the dynamics of learning in layered networks with restricted training \nsets, where the number p of examples scales linearly with the number N of inputs. Indi(cid:173)\nvidual examples will now re-appear during the learning process as soon as the number of \nweight updates made is of the order of p. Correlations will develop between the weights \n\n\f198 \n\nA. C. C. Coolen and D. Saad \n\net=O.S \n1=50 \n\nto !-\n\n,. \" \nI. \n\nY 00 -\n\n- 10 -\n\n-20 -\n\n'J\",:':\",' \n\n,<.~~t,~~:i.~/ \n\n~o \n\n-4000 \n\n_1000 \n\n-2000 \n\n- 1000 \n\n1000 \n\n2000 \n\nloon \n\n4000 \n\n00 \nX \n\na=O.5 \n1=50 \n\n,. \n\u2022\u2022 \n,. \nI. \n\nY o. \n\n-H) \n\n- ~ () \n\n-.. \n\n~oL-~~ __ ~ ____ ~~ __ __ \n40 \n\n- 10 \n\n-1 0 \n\n...... 0 \n\n_10 \n\n00 \nX \n\n10 \n\n20 \n\n\\0 \n\nFigure 1: Student and teacher fields (x, y) (see text) observed during numerical simulations \nof on-line learning (learning rate 11 = 1) in a perceptron of size N = 10, 000 at t = 50, using \nexamples from a training set of size p = ~N. Left: Hebbian learning. Right: AdaTron \nlearning [5]. Both distributions are clearly non-Gaussian. \n\nand the training set examples and the student's local fields (activations) will be described by \nnon-Gaussian distributions (see e.g. Figure 1). This leads to a breakdown of the standard \nformalism: the field distributions are no longer characterized by a few moments, and the \nmacroscopic laws must now be averaged over realizations of the training set. The first rig(cid:173)\norous study of the dynamics of learning with restricted training sets in non-linear networks, \nvia generating functionals [6], was carried out for networks with binary weights. Here we \nuse dynamical replica theory (see e.g. [7]) to predict the evolution of macroscopic observ(cid:173)\nabIes for finite a, incorporating the old formalism as a special case (a = p/ N -t 00). For \nsimplicity we restrict ourselves to single-layer systems and noise-free teachers. \n\n2 FROM MICROSCOPIC TO MACROSCOPIC LAWS \n\nA 'student' perceptron operates a rule which is parametrised by the weight vector J E '!RN: \n\ns: {-I,I}N -t {-I,l} \n\nS(e) = sgn [J . e] == sgn [x] \n\n(I) \n\nIt tries to emulate a teacher ~erceptron which operates a similar rule, characterized by a \n(fixed) weight vector B E'!R \n. The student modifies its weight vector J iteratively, using \nexamples of input vectors e which are drawn at random from a fixed (randomly composed) \ntraining set D = {e 1 , . \u2022 . , e} c D = {-I, I}N, of size p = aN with a > 0, and the \ncorresponding values of the teacher outputs T(e) = sgn[B\u00b7 e] == \nsgn [y]. Averages \nover the training set D and over the full set D will be denoted as (
and (
\n\n(2) \n\nIn on-line learning one draws at each step m a question e(m) at random from the training \nset, the dynamics is a stochastic process; in batch learning one iterates a deterministic map. \nOur key dynamical observables are the training- and generalization errors, defined as \n\n(3) \nOnly if the training set D is sufficiently large, and if there are no correlations between J and \nthe training set examples, will these two errors be identical. We now turn to macroscopic \nobservables n[J] = (OdJ], ... , Ok[J]). For N -t 00 (with finite times t = m/ N \n\nEg(J) = (O[-(J \u00b7e)(B \u00b7e)]) D \n\n\fDynamics of Supervised Learning with Restricted Training Sets \n\n199 \n\nand with finite k), and if our observables are of a so-called mean-field type, their associated \nmacroscopic distribution Pt(!l) is found to obey a Fokker-Planck type equation, with flow(cid:173)\nand diffusion terms that depend on whether on-line or batch learning is used. We now \nchoose a specific set of observables !l[J], taylored to the present problem: \n\nQ[J] = J2, \n\nR[J] = J\u00b7B, \n\nP[x,y;J] = (8[x-J\u00b7e] 8[y-B\u00b7eDb \n\n(4) \nThis choice is motivated as follows: (i) in order to incorporate the old formalism we need \nQ[ J] and R[ J], (ii) the training error involves field statistics calculated over the training set, \nas given by P[x, y; J], and (iii) for a < 00 one cannot expect closed equations for a finite \nnumber of order parameters, the present choice effectively represents an infinite number. \nWe will assume the number of arguments (x, y) for which P[x, y; J] is evaluated to go to \ninfinity after the limit N ~ 00 has been taken. This eliminates technical subtleties and \nallows us to show that in the Fokker-Planck equation all diffusion terms vanish as N ~ 00. \nThe latter thereby reduces to a LiouviIle equation, describing deterministic evolution of our \nmacroscopic observables. For on-line learning one arrives at \n\n(5) \n\n(6) \n\n(7) \n\n:t Q = 27] / dxdy P[x, y] x Q[x; y] + 7]2 / dxdy P[x, y] Q2[X; y] \n:t R = 7] / dxdy P[x, y] y Q[x; y] \n:t P[x, y] = ~ [/ dx' P[x', y]8[x-x' -7]Q[x' , y)) - P[x, yl] \n\n-7] :x / dx'dy' g(X', y'] A[x, y; x', y'] \n\n+ - 7] \n1 2 / \n2 \n\ndx dy P x ,y Q x, y 8 .) P x, Y \n] \n\n[ ' '] \n\n[ \n\nI \n\n, \n\n2 [' '] 82 \nx-\n\nExpansion of these equations in powers of 7], and retaining only the terms linear in 7], gives \nthe corresponding equations describing batch learning. The complexity of the problem is \nfully concentrated in a Green's function A[x, y; x', y'], which is defined as \nA[x, y; x', y'] = lim ((([1-8cc' 18[x-J\u00b7e] 8[y-B\u00b7e](6~/) 8[xl-J\u00b7(]8[yl-B\u00b7e/])b) b)~;t \n\nN~oo \n\n...... \n\nIt involves a sub-shell average, in which Pt (J) is the weight probability density at time t: \nJ dJ K[J] pt(J)8[Q -Q[J]]8[R-R[J]] ITxy 8[P[x, y] -P[x, y; J]] \n\n(K[J])~:t = \n\nJ dJ pt(J)8[Q-Q[J]]8[R-R[J]] ITXY 8[P[x, y] - P[x, y; J]] \n\nwhere the sub-shells are defined with respect to the order parameters. The solution of \n(5,6,7) can be used to generate the errors of (3): \n\nE t = / dxdy P[x,y]O[-xy] \n\nEg = - arccos[R/ JQ] \n\n1 \n\n7r \n\n(8) \n\n3 CLOSURE VIA DYNAMICAL REPLICA THEORY \n\nSo far our analysis is still exact. We now close the macroscopic laws (5,6,7) by making, for \nN ~ 00, the two key assumptions underlying dynamical replica theory [7]: \n\n(i) Our macroscopic observables {Q, R, P} obey closed dynamic equations. \n(ii) These equations are self-averaging with respect to the realisation of jj. \n\n(i) implies that probability variations within the {Q, R, P} subshells are either absent or \nirrelevant to the evolution of {Q, R, P} . We may thus make the simplest choice for Pt (J): \n\nPt(J) ~ p(J) '\" 8[Q- Q[J]] 8[R-R[J)) IT 8[P[x, y] -P[x, y; J)) \n\n(9) \n\nxy \n\n\f200 \n\nA. C. C. Coolen and D. Saad \n\np(J) depends on time implicitly, via the order parameters {Q, R, Pl. The procedure (9) \nleads to exact laws if our observables {Q, R , P} indeed obey closed equations for N --7 00. \nIt gives an approximation if they don't. (ii) allows us to average the macroscopic laws over \nall training sets; it is observed in numerical simulations. and can probably be proven using \nthe formalism of [6]. Our assumptions result in the closure of (5,6,7), since now A[ ... ] is \nexpressed fully in terms of {Q, R, P} . The final ingredient of dynamical replica theory is \nthe realization that averaging fractions is simplified with the replica identity [8] \n\n/ JdJ W[J,Z]G[J,Z]) = lim jdJ 1 .. . dJ n (G[J 1 ,z] IT W[Ja,z])z \n\na=l \n\n\\ \n\nJ dJ W[J , z] \n\nZ \n\nn-40 \n\nWhat remains is to perform integrations. One finds that P[x, y] = P[xly]P[y] with Ply] = \n\u2022 Upon introducing the short-hands Dy = (271\")- ~ e- h 2 dy and (J(x, y)) = \n(271\")-~ e- h 2\nJ Dydx P[xly]f(x, y) we can write the resulting macroscopic laws as follows : \n\nd \ndt Q = 2r/V + TJ Z \n\n2 \n\nd \ndt R = TJW \n\n(10) \n\n1 j \n\n8 \n8tP[xly] =~ dx'P[x'ly] {8[x-X'-TJG[x',yll-8[x-x']} + \"2TJ2 Z 8x2P[Xly] \n\n82 \n\n1 \n\n8 \n\n-TJ 8x {P[xly] [U(x-Ry)+Wy+[V-RW-(Q-R2)U]~[x,yJ]} \n\n(11) \n\nwith \n\nU = ( [x, y]Q[x , y]), V = (x9[x,y]), W = (y9[x,y]), \n\nZ = (92[X,y]) \n\nAs before the batch equations follow upon expanding in TJ and retaining only the linear \nterms. Finding the function [x, y] (in replica symmetric ansatz) requires solving a saddle(cid:173)\npoint problem for a scalar observable q and a function M[xIY]. Upon introducing \n\nB = JqQ-R2 \n\nQ(l- q) \n\n(J[x,y,z])* = Jdx M[xly]eBX Zf[x,y,z] \n\nJ dx M[xly]eBxz \n\n(with J dx M[xly] = 1 for all y) the saddle-point equations acquire the form \n\nfor all X , y : P[Xly] = j Dz (<5[X -x])* \n\n((X-Ry)2) + (qQ-R2)[1-~] = [Q(1+q)-2R2](x~[x,y]) \n\na \n\nThe solution M[xly] of the functional saddle-point equation, given a value for q in the \nphysical range q E [R2/Q, 1], is unique [9] . The function ~[x, y] is then given by \n\n [X, y] = { JqQ- R2 P[Xly]} -1 j Dz z(<5[X -x])* \n\n(12) \n\n4 THE LIMIT a -7 00 \n\nFor consistency we show that our theory reduces to the simple (Q, R) formalism of infinite \ntraining sets in the limit a --7 00 . Upon making the ansatz \n\nP[xly] = [271\"(Q-R2)]-t e-~[x-RyJ 2 /(Q-R2) \n\none finds that the saddle-point equations are simultaneously and uniquely solved by \n\nand [x,y] reduces to \n\nM[xly] = P[xly], \n\nq = R2/Q \n\n[x,y] = (x-Ry)/(Q-R2) \n\nInsertion of our ansatz into equation (II), followed by rearranging of terms and usage of the \nabove expression for [x, y], shows that this equation is satisfied. Thus from our general \ntheory we indeed recover for a --7 00 the standard theory for infinite training sets. \n\n\fDynamics o/Supervised Learning with Restricted Training Sets \n\n201 \n\n0.5 _---------------~---, \n\n\"'O-<>-O-<>-CH>\"\"O\"\"\"\":ro-<>\"\"\"\"\"'T<>-