{"title": "Weight Space Probability Densities in Stochastic Learning: I. Dynamics and Equilibria", "book": "Advances in Neural Information Processing Systems", "page_first": 451, "page_last": 458, "abstract": null, "full_text": "Weight Space Probability Densities \n\nin Stochastic Learning: \n\nI. Dynamics and Equilibria \n\nTodd K. Leen and John E. Moody \n\nDepartment of Computer Science and Engineering \nOregon Graduate Institute of Science & Technology \n\n19600 N.W. von Neumann Dr. \n\nBeaverton, OR 97006-1999 \n\nAbstract \n\nThe ensemble dynamics of stochastic learning algorithms can be \nstudied using theoretical techniques from statistical physics. We \ndevelop the equations of motion for the weight space probability \ndensities for stochastic learning algorithms. We discuss equilibria \nin the diffusion approximation and provide expressions for special \ncases of the LMS algorithm. The equilibrium densities are not in \ngeneral thermal (Gibbs) distributions in the objective function be(cid:173)\ning minimized, but rather depend upon an effective potential that \nincludes diffusion effects. Finally we present an exact analytical \nexpression for the time evolution of the density for a learning algo(cid:173)\nrithm with weight updates proportional to the sign of the gradient. \n\n1 \n\nIntroduction: Theoretical Framework \n\nStochastic learning algorithms involve weight updates of the form \n\nw(n+1) = w(n) + /-l(n)H[w(n),x(n)] \n\n(1) \n\nwhere w E 7\u00a3m is the vector of m weights, /-l is the learning rate, H[.] E 7\u00a3m is the \nupdate function, and x(n) is the exemplar (input or input/target pair) presented \n\n451 \n\n\f452 \n\nLeen and Moody \n\nto the network at the nth iteration of the learning rule. Often the update function \nis based on the gradient of a cost function H(w,x) = -a\u00a3{w,x) law. We assume \nthat the exemplars are Li.d. with underlying probability density p{x). \nWe are interested in studying the time evolution and steady state behavior of \nthe weight space probability density P(w, n) for ensembles of networks trained by \nstochastic learning. Stochastic process theory and classical statistical mechanics \nprovide tools for doing this. As we shall see, the ensemble behavior of stochas(cid:173)\ntic learning algorithms is similar to that of diffusion processes in physical systems, \nalthough significant differences do exist. \n\n1.1 Dynamics of the Weight Space Probability Density \n\nEquation (1) defines a Markov process on the weight space. Given the particular \ninput x, the single time-step transition probability density for this process is a Dirac \ndelta function whose arguments satisfy the weight update (1): \n\nW ( w' ~ w I x) = 8 ( w - w' - J-t H[ w' , x]) . \n\n(2) \n\nFrom this conditional transition probability, we calculate the total single time-step \ntransition probability (Leen and Orr 1992, Ritter and Schulten 1988) \n\nW(w' ~ w) = ( 8( w - w' - J-tH[w',x]) }z \n\nwhere ( ... }z denotes integration over the measure on the random variable x. \nThe time evolution of the density is given by the Kolmogorov equation \n\nP(w, n + 1) = J dw' P(w', n) W(w' ~ w) , \n\n(3) \n\n(4) \n\nwhich forms the basis for our dynamical description of the weight space probability \ndensity 1. \n\nStationary, or equilibrium, probability distributions are eigenfunctions of the tran(cid:173)\nsition probability \n\nPs(w) = J dw' Ps(w') W(w' ~ w). \n\n(5) \n\nIt is particularly interesting to note that for problems in which there exists an \noptimal weight w,. such that \n\nH(w,.,x) = 0, \"Ix, \n\none stationary solution is a delta function at w = w,.. An important class of such \nexamples are noise-free mapping problems for which weight values exist that realize \nthe desired mapping over all possible input/target pairs. For such problems, the \nensemble can settle into a sharp distribution at the optimal weights (for examples \nsee Leen and Orr 1992, Orr and Leen 1993). \n\nAlthough the Kolmogorov equation can be integrated numerically, we would like \nto make further analytic progress. Towards this end we convert the Kolmogorov \n\n1 An alternative is to base the time evolution on a suitable master equation. Both \n\napproaches give the same results. \n\n\fWeight Space Probability Densities in Stochastic Learning: I. Dynamics and Equilibria \n\n453 \n\nequ~,tion into a differential\u00b7 difference equation by expanding (3) as a power series \nin J.l. Since the transition probability is defined in the sense of generalized functions \n(i.e. distributions), the proper way to proceed is to smear (4) with a smooth test \nfunction of compact support f(w) to obtain \n\nJ dw f{w) P(w, n + 1) = J dw dw' f(w) P(w', n) W(w' -t w). \n\n(6) \n\nNext we use the transition probability (3) to perform the integration over wand \nexpand the resulting expression as a power series in J.l. Finally, we integrate by \npart5 to take derivatives off f, dropping the surface terms. This results in a discrete \ntime version of the classic Kramers\u00b7Moyal expansion (llisken 1989) \n\nP(w,n+1) - P(w,n) = \n\nwhere Hja denotes the ja th component of the m-component vector H. \n\nIn section 3, we present an algorithm for which the Kramers-Moyal expansion can \nbe explicitly summed. In general the full expansion is not analytically tractable, \nand to make further analytic progress we will truncate it at second order to obtain \nthe Fokker-Planck equation. \n\n1.2 The Fokker-Planck (Diffusion) Approximation \nFor small enough 1J.l HI, the Kramers-Moyal expansion (7) can be truncated to \nsecond order to obtain a Fokker-Planck equation:2 \n\nP(w, n + 1) - P(w, n) = \n\n{) \n\n-J.l {)Wi [ Ai(W) P(w, n) ] \n\n(8) \n\nIn (8), and throughout the remainder of the paper, repeated indices are summed \nover. In the Fokker-Planck approximation, only two coefficients appear: Ai ( w) = \n(Hi)z, called the drift vector, and Bij(W) = (Hi Hj)z' called the diffusion matrix. \nThe drift vector is simply the average update applied at w. Since the diffusion \ncoefficients can be strongly dependent on the position in weight space, the equilib(cid:173)\nrium densities will, in general, not be thermal (Gibbs) distributions in the potential \ncorresponding to (H( w, x) ) z' This is exemplified in our discussion of equilibrium \ndensities for the LMS algorithm in section 2.1 below3 \u2022 \n\n2Radons et al. (1990) independently derived a Fokker-Planck equation for backpropaga(cid:173)\n\ntion. Earlier, Ritter and Schulten (1988) derived a Fokker-Planck equation (for Kohonen's \nself-ordering feature map) that is valid in the neighborhood of a local optimum. \n\n3See (Leen and Orr 1992, Orr and Leen 1993) for further examples. \n\n\f454 \n\nLeen and Moody \n\n2 Equilibrium Densities in the Fokker-Planck \n\nApproximation \n\nIn equilibrium the probability density is stationary, P(w, n+1) = P(w, n) = Ps(w), \n\nso the Fokker-Planck equation (8) becomes \n\n0= - a:i Ji(w) == - a:i (11. Ai(W) P8(W) - ~2 a:j [Bij(W) P8(W)] ) \n\n(9) \n\nHere, we have implicitly defined the probability density current J(w). In equilib(cid:173)\nrium, its divergence is zero. \nIf the drift and diffusion coefficients satisfy potential conditions, then the equilibrium \ncurrent itself is zero and detailed balance is obtained. The potential conditions are \n(Gardiner, 1990) \n\nOWl - OWk = 0, where Zk(W) = Bk/(w) 2\" ow; Bi;(W) - Ai(W) \nOZk OZ, \n\n[J-l 0 \n\nUnder these conditions the solution to (9) for the equilibrium density is: \n\nPs(w) = !... e-2:F(w)/~, F(w) = 1 dWk Zk(W) \n\nJ( \n\nw \n\n1 \n\n(10) \n\n(11) \n\nwhere J( is a normalization constant and F( w) is called the effective potential. \n\nIn general, the potential conditions are not satisfied for stochastic learning algo(cid:173)\nrithms in multiple dimensions. 4 \nIn this respect, stochastic learning differs from \nmost physical diffusion processes. However for LMS with inputs whose correlation \nmatrix is isotropic, the conditions are satisfied and the equilibrium density can be \nreduced to the quadrature in (11). \n\n2.1 Equilibrium Density for the LMS Algorithm \n\nThe best known on-line learning system is the LMS adaptive filter. For the LMS \nalgorithm, the training examples consist of input/target pairs x(n) = {s(n),t(n)}, \nthe model output is u(n) = W\u00b7 s(n), and the cost function is the squared error: \n\n\u00a3(w,x(n)) = 2 [t(n)-u(n)]2 = 2 [t(n)-w\u00b7s(n)]2 \n\n1 \n\n1 \n\nThe resulting update equations (for constant learning rate J-l) are \nw(n+l) = w(n) + J-l[t(n)-w.s(n)]s(n). \n\n(12) \n\n(13) \n\nWe assume that the training data are generated according to a \"signal plus noise\" \nmodel: \n\n(14) \nwhere w. is the \"true\" weight vector and \u20ac( n) is LLd. noise with mean zero and \nvariance (12. We denote the correlation matrix of the inputs s( n) by R and the \n\n, \n\nt(n) = w \u2022 . s(n) + \u20ac(n) \n\n4For one-dimensional algorithms, the potential conditions are trivially satisfied. \n\n\fWeight Space Probability Densities in Stochastic Learning: I. Dynamics and Equilibria \n\n455 \n\nfourth order correlation tensor of the inputs by S. It is convenient to shift the \norigin of coordinates in weight space and define the weight error vector \n\nv = w - w \u2022. \n\nIn terms of v, the weight update is \n\nv(n+l) = v(n) - JJ[s(n).v(n)]s(n) + JJf(n)s(n). \n\nThe drift vector and diffusion matrix are given by \n\nAi=-(SiSj}s Vj = -RijVj \n\n(15) \n\nand \n\n, \n\nBij = (Si Sj Sle SI Vie VI + f2 Sj Sj ) s ( = Sijlel Vie VI + (72 Rij \n\n(16) \nrespectively. Notice that the diffusion matrix is quadratic in v. Thus as we move \naway from the global minimum at v = 0, diffusive spreading of the probability \ndensity is enhanced. Notice also that, in general, both terms of the diffusion matrix \ncontribute an anisotropy. \nWe further assume that the inputs are drawn from a zero-mean Gaussian process. \nThis assumption allows us to appeal to the Gaussian moment factoring theorem \n(Haykin, 1991, p318) to express the fourth-order correlation S in terms of R \n\nSijlel = Rij Rlcl + Rile Rjl + Ril Rjle \n\n. \n\nThe diffusion matrix reduces to \n\nTo compute the effective potential (10 and 11) the diffusion matrix is inverted \nusing the Sherman-Morrison formula (Press, 1987, p67). As a final simplification, \nwe assume that the input distribution is spherically symmetric. Thus \n\n(17) \n\nwhere I denotes the identity matrix. \n\nR = rI , \n\nTogether these assumptions insure detailed balance, and we can integrate (11) in \nclosed form. In figure 1, we compare the effective potential F(v) (for 1-D LMS) \nwith the potential corresponding to the quadratic cost function. \n\nv \n\nFig.l: Effective potential (dashed curve) and cost function (solid curve) for I-D LMS. \n\nThe spatial dependence of the the diffusion coefficient forces the effective potential \nto soften relative to the cost function for large Ivl. This accentuates the tails of the \ndistribution relative to a gaussian. \n\n\f456 \n\nLeen and Moody \n\nThe equilibrium density is \n\nPs{v) = K \n\n1 [ \n\n3r \n\n1 + u21vl2 \n\n] -( ~+m ) \n\n, \n\n(18) \n\nwhere, as before, m and J( denote the dimension of the weight vector and the \nnormalization constant for the density respectively. For a l-D filter, the equilibrium \ndensity can be found in closed form without assuming Gaussian input data. We \nfind \n\n(19) \nWith gaussian inputs (for which S = 3r2 ) (19) properly reduces to (18) with m = 1. \nThe equilibrium densities (18) and (19) are clearly not gaussian, however in the limit \nof very small J.lr they reduce to gaussian distributions with variance J.lu 2/2. Figure \n2 shows a comparison between the theoretical result and a histogram of 200,000 \nvalues of v generated by simulation with J.l = 0.005, and u 2 = 1.0. The input data \nwere drawn from a zero-mean Gaussian distribution with r = 4.0. \n\nI \n\nI \n\ni \n\nI \n\n-0.2 -0.1 0.0 0.1 0.2 \n\nI \n\nFig.2: Equilibrium density for 1-D LMS \n\nv \n\n3 An Exactly Summable Model \n\nAs in the case of LMS learning above, stochastic gradient descent algorithms update \nweights based on an instantaneous estimate of the gradient of some average cost \nfunction \u00a3(w) = {\u00a3(w, x) }z. That is, the update is given by \n\nHi(W,X) = --0 \u00a3(w,x). \n\no \nWi \n\nAn alternative is to increment or decrement each weight by a fixed amount depend(cid:173)\ning only on the sign of O\u00a3/OWi. We formulated this alternative update rule because \nit avoids a common problem for sigmoidal networks, getting stuck on \"flat spots\" or \n\"plateaus\". The standard gradient descent update rule yields very slow movement \non plateaus, while second order methods such as gauss-newton can be unstable. \nThe sign-of-gradient update rule suffers from neither of these problems.s \n\n5The use of the sign of the gradient has been suggested previously in the stochastic \napproximation literature by Fabian (1960) and in the neural network literature by Derthick \n(1984). \n\n\fWeight Space Probability Densities in Stochastic Learning: I. Dynamics and Equilibria \n\n457 \n\nIf at each iteration one chooses a weight at random for updating, then the Kramers(cid:173)\nMoyal expansion can be exactly summed. Thus at each iteration we 1) choose a \nweight Wi and an exemplar x at random, and 2) update Wi with \n\nH .( \n\nI w,x \n\n) _ \n-\n\n. \n\n-Sign \n\n(8\u00a3(w,x(n))) \n\n8 \nWi \n\n(20) \n\nWith this update rule, Hj = \u00b11 or 0 and Hi Hj = lSij (or 0). All of the coefficients \n(HiHj Hk ... ) z in the Kramers-Moyal expansion (7) vanish unless i = j = k = .... \nThe remaining series can be summed by breaking it into odd and even parts. This \nleav\\!s \n\nP(w,n+l) - P(w,n) = \n\nm \n\n2m \n\n1 L { P(w + Ilj,n) Aj(w + Ilj) - P(w -Ilh n) Aj(w -Ilj) } \n1 L { P(w + Ilj, n) Bjj(w + Ilj) - 2P(w, n) Bjj(w) \n\nj=1 \nm \n\n2m \n\nj=1 \n\n+ \n\n+ P(w -Ilj, n) Bjj(w -Ilj) } \n\nwhere /-tj denotes a displacement along Wj a distance /-t, Aj(w) = (Hj(w, x) )z' and \nBjj(w) = (H;(w,x)z' Note that Bjj(w) = 1 unless H(w,x) = 0, for all x, in \nwhich case Bjj(w) = O. Although exact, (21) curiously has the form of a second \n\norder finite difference approximation to the Fokker-Planck equation with diagonal \ndiffusion matrix. This form is understandable, since the dynamics (20) restrict the \nweight values W to a hypercubic lattice with cell length /-t and generate only nearest \nneighbor interactions. \n\n(21) \n\n!J!: k!~ \n\n0.5 1 1.5 2 2.5 \n\n0.5 1 1.5 2 2.5 \n\n-0.5 \n\nv \n\nv \n\nn=5oo \n\nn =5000 \n\nFig.3: Sequence of densities for the XOR problem \n\n-0.5 \n\n0.5 1 1.5 2 2.5 \n\nv \n\nL: \n-0.5 \n\n-0.5 \n\nAs an example, figure 3 shows the cost function evaluated along a 1-D slice through \nthe weight space for the XOR problem. Along this line are local and global minima \nat v = 1 and v = 0 respectively. Also shown is the probability density (vertical \nlines). The sequence shows the spreading of the density from its initialization at \nthe local minimum, and its eventual collection at the global minimum. \n\n\f458 \n\nLeen and Moody \n\n4 Discussion \n\nA theoretical approach that focuses on the dynamics of the weight space probability \ndensity, as we do here, provides powerful tools to extend understanding of stochastic \nsearch. Both transient and equilibrium behavior can be studied using these tools. \nWe expect that knowledge of equilibrium weight space distributions can be used in \nconjunction with theories of generalization (e.g. Moody, 1992) to assess the influence \nof stochastic search on prediction error. Characterization of transient phenomena \nshould facilitate the design and evaluation of search strategies such as data batching \nand adaptive learning rate schedules. Transient phenomena are treated in greater \ndepth in the companion paper in this volume (Orr and Leen, 1993). \n\nAcknowledgements \n\nT. Leen was supported under grants N00014-91-J-1482 and N00014-90-J-1349 from \nONR. J. Moody was supported under grants 89-0478 from AFOSR, ECS-9114333 \nfrom NSF, and N00014-89-J-1228 and N00014-92-J-4062 from ONR. \n\nReferences \n\nTodd K. Leen and Genevieve B. Orr (1992), Weight-space probability densities and conver(cid:173)\ngence times for stochastic learning. In International Joint Conference on Neural Networks, \npages IV 158-164. IEEE, June. \n\nH. Ritter and K. Schulten (1988), Convergence properties of Kohonen's topology conserv(cid:173)\ning maps: Fluctuations, stability and dimension selection, Bioi. Cybern., 60, 59-71. \n\nGenevieve B. Orr and Todd K. Leen (1993), Probability densities in stochastic learning; \nII. Transients and Basin Hopping Times. In Giles, C.L., Hanson, S.J., and Cowan, J.D. \n(eds.), Advances in Neural Information Processing Systems 5. San Mateo, CA: Morgan \nKaufmann Publishers. \n\nH. Risken (1989), The Fokker-Planck Equation Springer-Verlag, Berlin. \n\nG. Radons, H.G. Schuster and D. Werner (1990), Fokker-Planck description oflearning in \nbackpropagation networks, International Neural Network Conference - INNC 90, Paris, II \n993-996, Kluwer Academic Publishers. \n\nC.W. Gardiner (1990), Handbook of Stochastic Methods, 2nd Ed. Springer-Verlag, Berlin. \n\nSimon Haykin (1991), Adaptive Filter Theory, 2nd edition. Prentice Hall, Englewood \nCliffs, N.J. \n\nW.H. Press, B.P. Flannery, S.A. Teukolsky, and W.T. Vetterling (1987) Numerical Recipes \n- the Art of Scientific Computing. Cambridge University Press, Cambridge I New York. \nV. Fabian (1960), Stochastic approximation methods. Czechoslovak Math J., 10, 123-159. \n\nMark Derthick (1984), Variations on the Boltzmann machine learning algorithm. Technical \nReport CMU-CS-84-120, Department of Computer Science, Carnegie-Mellon University, \nPittsburgh, PA, August. \n\nJohn E. Moody (1992), The effective number of parameters: An analysis of generaliza(cid:173)\ntion and regularization in nonlinear learning systems. In J .E. Moody, S.J. Hanson, and \nR.P. Lipmann, editors, Advances in Neural Information Processing Systems 4. Morgan \nKaufmann Publishers, San Mateo, CA. \n\n\f", "award": [], "sourceid": 634, "authors": [{"given_name": "Todd", "family_name": "Leen", "institution": null}, {"given_name": "John", "family_name": "Moody", "institution": null}]}