{"title": "Weight Space Probability Densities in Stochastic Learning: II. Transients and Basin Hopping Times", "book": "Advances in Neural Information Processing Systems", "page_first": 507, "page_last": 514, "abstract": null, "full_text": "Weight Space Probability Densities \n\nin Stochastic Learning: \n\nII. Transients and Basin Hopping Times \n\nGenevieve B. Orr and Todd K. Leen \n\nDepartment of Computer Science and Engineering \nOregon Graduate Institute of Science & Technology \n\n19600 N.W. von Neumann Drive \n\nBeaverton, OR 97006-1999 \n\nAbstract \n\nIn stochastic learning, weights are random variables whose time \nevolution is governed by a Markov process. At each time-step, \nn, the weights can be described by a probability density function \npew, n). We summarize the theory of the time evolution of P, and \ngive graphical examples of the time evolution that contrast the \nbehavior of stochastic learning with true gradient descent (batch \nlearning). Finally, we use the formalism to obtain predictions of the \ntime required for noise-induced hopping between basins of different \noptima. We compare the theoretical predictions with simulations \nof large ensembles of networks for simple problems in supervised \nand unsupervised learning. \n\n1 Weight-Space Probability Densities \n\nDespite the recent application of convergence theorems from stochastic approxima(cid:173)\ntion theory to neural network learning (Oja 1982, White 1989) there remain out(cid:173)\nstanding questions about the search dynamics in stochastic learning. For example, \nthe convergence theorems do not tell us to which of several optima the algorithm \n\n507 \n\n\f508 \n\nOrr and Leen \n\nis likely to converge 1 \u2022 Also, while it is widely recognized that the intrinsic noise \nin the weight update can move the system out of sUb-optimal local minima (for a \ngraphical example, see Darken and Moody 1991), there have been no theoretical \npredictions of the time required to escape from local optima, or of its dependence \non learning rates. \n\nIn order to more fully understand the dynamics of stochastic search, we study the \nweight-space probability density and its time evolution. In this paper we summarize \na theoretical framework that describes this time evolution. We graphically portray \nthe motion of the density for examples that contrast stochastic and batch learning. \nFinally we use the theory to predict the statistical distribution of times required for \nescape from local optima. We compare the theoretical results with simulations for \nsimple examples in supervised and unsupervised learning. \n\n2 Stochastic Learning and Noisy Maps \n\n2.1 Motion of the Probability Density \n\nWe consider stochastic learning algorithms of the form \n\nw(n+1} = w(n) + JJH[w(n},x(n)] \n\n(1) \n\nwhere w(n} E 1R,m is the weight, x(n} is the data exemplar input to the algorithm at \ntime-step n, JJ is the learning rate, and H[ ... ] E 1R, m is the weight update function. \nThe exemplars x(n} can be either inputs or, in the case of supervised learning, \ninput/target pairs. We assume that the x(n) are i.i.d. with density p(x). Angled \nbrackets (\"'):t denote averaging over this density. In what follows, the learning \nrate will be held constant. \n\nThe learning algorithm (1) is a noisy map on w. The weights are thus random \nvariables described by the probability density function P(w, n). The time evolution \nof this density is given by the Kolmogorov equation \n\nP(w, n + 1) = f dw' P(w', n) W(w' -+ w) \n\n(2) \n\nwhere the single time-step transition probability is given by (Leen and Orr 1992, \nLeen and Moody 1993) \n\nW(w' -+ w) = (8{w-w'-JJH[w',xJ) ):t \n\n(3) \n\nand 8 ( ... ) is the Dirac delta function. \nThe Kolmogorov equation can be recast as a differential-difference equation by \nexpanding the transition probability (3) as a power series in JJ. This gives a Kramers(cid:173)\nMoya! expansion (Leen and Orr 1992, Leen and Moody 1993) \n\nIHowever Kushner (1987) has proved convergence to global optima for stochastic \napproximation algorithms with added Gaussian noise subject to logarithmic annealing \nschedules. \n\n\fP(w, n + 1) - P(w, n) \nL (_~)t L \n\n00 \n\nm \n\n. \n\nI. \n\ni=l \n\n;l \u2022... ;j=l \n\nWeight Space Probability Densities in Stochastic Learning \n\n509 \n\n(4) \n\nwhere Wj\" and Hj\" are the j~h component of weight, and weight update, respectively. \n\nTruncating (4) to second order in J1, leaves a Fokker-Planck equation2 that is valid for \nsmall 1 J1,H I. The drift coefficient (H}z is simply the average update. It is important \nto note that the diffusion coefficients, (Hj\"HjIJ ) , can be strongly dependent on \nlocation in the weight-space. This spatial depe~dence influences both equilibria \nand transient phenomena. In section 3.1 we will use both the Kolmogorov equation \n(2), and the Fokker-Planck equation to track the time evolution of network ensemble \ndensities. \n\n2.2 First Passage Times \n\nOur discussion of basin hopping will use the notion of the first passage time (Gar(cid:173)\ndiner, 1990); the time required for a network initialized at Wo to first pass into an \n\u20ac-neighborhood D of a global or local optimum w. (see Figure 1). The first passage \ntime is a random variable. Its distribution function P( n; wo) is the probability that \na network initialized at Wo makes its first passage into D at the nth iteration of the \nlearning rule. \n\nFigure 1: Sample search path. \n\nTo arrive at an expression for P( n; wo), we first examine the probability of passing \nfrom the initial weight Wo to the weight w after n iterations. This probability can \nbe expressed as \n\npew, n I Wo, 0) = J dw' pew, n I w', 1) W( Wo -7 w'). \n\n(5) \n\n(6) \n\nSubstituting the single time-step transition probability (3) into the above expres(cid:173)\nsion, integrating over w', and making use of the time-shift invariance of the system3 \nwe find \n\npew, n 1 wo, 0) = ( pew, n - 1 1 Wo + J1,H(wo, x), O)}z \n\n. \n\nNext, let G(n; wo) denote the probability that a network initialized at Wo has not \npassed into the region D by the nth iteration. We obtain G(n; wo) by integrating \npew, n 1 Wo, 0) over weights w not in D; \n\nG(n;wo) = f dwP(w,nlwo,O) \n\nJDe \n\n(7) \n\n2See (Ritter and Schulten 1988) and (Radons et al. 1990) for independent derivations. \nlearning rate J.l and stationary sam-\n3With our assumptions of a constant \nMathematically stated, \n\nthe system \n\ntime-shift \n\ninvariant. \n\nple density p{x), \nP{w,n I w',m) = P{w,n -1/ w',m -1) \n\nis \n\n\f510 \n\nOrr and Leen \n\nwhere DC is the complement of D. Substituting equation (6) into (7) and integrating \nover w we obtain the recursion \n\nG(n;wo) = (G(n - l;wo + JJH[wo,x]) L: . \n\n(8) \n\nBefore any learning takes place, none of the networks in the ensemble have entered \nD. Thus the initial condition for G is \n\nG(O;wo) = 1, woEDc. \n\n(9) \n\nNetworks that have entered D are removed from the ensemble (i.e. aD is an ab(cid:173)\nsorbing boundary). Thus G satisfies the boundary condition \n\nG(n; wo) = 0, Wo ED. \n\n(10) \n\nFinally, the probability that the network has not passed into the region D on or \nbefore iteration n - 1 minus the probability the network has not passed into D \non or before iteration n is simply the probability that the network has passed into \nD exactly at iteration n. This is just the probability for first passage into D at \ntime-step n. Thus \n\nP(n;wo) = G(n -1;wo) - G(n;wo) . \n\n(11) \n\nFinally the recursion (8) for G can be expanded in a power series in JJ to obtain the \nbackward Kramers-Moyal equation \n\nG(n;w) - G(n -1;w) = \n\nex> \n\n. \n\nJ.lt L\"1 L \n\nI. \n\nm \n\n;=1 \n\n;1, ... ;;=1 \n\n(12) \n\nTruncation to second order in JJ results in the backward Fokker-Planck equation. In \nsection 3.2 we will use both the full recursion (8) and the Fokker-Planck approxi(cid:173)\nmation to (12) to predict basin hopping times in stochastic learning. \n\n3 Backpropagation and Competitive Nets \n\nWe apply the above formalism to study the time evolution of the probability density \nfor simple backpropagation and competitive learning problems. We give graphical \nexamples of the time evolution of the weight space density, and calculate times for \npassage from local to global optima. \n\n3.1 Densities for the XOR Problem \n\nFeed-forward networks trained to solve the XOR problem provide an example of \nsupervised learning with well-characterized local optima (Lisboa and Perantonis, \n1991). We use a 2-input, 2-hidden, I-output network (9 weights) trained by stochas(cid:173)\ntic gradient descent on the cross-entropy error function in Lisboa and Perantonis \n(1991). For computational tractability, we reduce the state space dimension by \n\n\fWeight Space Probability Densities in Stochastic Learning \n\n511 \n\nconstraining the search to one- or two-dimensional subs paces of the weight space. \nTo provide global optima at finite weight values, the output targets are set to 8 and \n1 - 8, with 8 < < 1. \nFigure 2a shows the cost function evaluated along a line in the weight space. This \nline, parameterized by v, is chosen to pass through a global optimum at v = 0, \nand a local optimum at v = 1.0. In this one-dimensional slice, another local \noptimum occurs at v = 1.24 . Figure 2b shows the evolution of P( v, n) obtained by \nnumerical integration of the Fokker-Planck equation. Figure 2c shows the evolution \nof P( v, n) estimated by simuhtion of 10,000 networks, each receiving a different \nrandom sequence of the four input/target patterns. Initially the density is peaked \nup about the local optimum at v = 1.24. At intermediate times, there is a spike of \ndensity at the local optimum at v = 1.0. This spike is narrow since the diffusion \ncoefficient is small there. At late times the density collects at the global optimum. \nWe note that for the learning rate used here, the local optimum at v = 1.24 is \nasymptotically stable under true gradient descent, and no escape would occur. \n\nCD \n\nIt) \n\nii 8\u00b7 .., \n(II -a) \n\n..().5 0.0 0.5 \n\n1.0 \n\n1.5 2.0 \n\nv \n\nc) \n\nFigure 2: a) XOR cost function. b) Predicted density. c) Simulated density. \n\nFigure 3 shows a series of snapshots of the density superimposed on the cost function \nfor a 2-D slice through the XOR weight space. The first frame shows the weight \nevolution under true gradient descent. The weights are initialized at the upper \nright-hand corner of the frame, travel down the gradient and settle into a local \noptimum. The remaining frames show the evolution of the density calculated by \ndirect integration of the Kolmogorov equation (2). Here one sees an early spreading \nof the initial density and the ultimate concentration at the global optimum. \n\n3.2 Basin Hopping Times \n\nThe above examples graphically illustrate the intuitive notion that the noise inher(cid:173)\nent in stochastic learning can move the system out of local optima4 In this section \nwe calculate the statistical distribution of times required to pass between basins. \n\n4The reader should not infer from these examples that stochastic update necessarily \nconverges to global optima. It is straightforward to construct examples for which stochastic \nlearning convergences to local optima with probability one. \n\n\f512 \n\nOrr and Leen \n\nTrue Gradient Descent \n\nTime =0 \n\nTime = 10 \n\nTime-28 \n\nTime =34 \n\nTime = 100 \n\nFigure 3: Weight evolution for 2-D XOR. The density is superimposed on top of the cost \nfunction. The first frame shows density using true gradient descent for all 100 timesteps. \nThe remaining frames show the density for selected timesteps using stochastic descent. \n\n3.2.1 Basin Hopping in Back-propagation \n\nFor the search direction used in the example of Figure 2, we calculated the distribu(cid:173)\ntion of times required for networks initialized at v = 1.2 to first pass within \u20ac = 0.1 \nof the global optimum at v = 0.0. For this example we numerically integrated \nthe backward Fokker-Planck equation. We verified the theoretical predictions by \nobtaining first passage times from an ensemble of 10,000 networks initialized at \nv = 1.2. See Figure 4. For this example the agreement is good at the small learn(cid:173)\ning rate (JJ = 0.025) used, but degrades for larger JJ as higher order terms in the \nexpansion (12) become significant. \n\no o \n\no \n\n200 400 600 800 1000 \nFirst Passage Time \n\nFigure 4: XOR problem. Simulated (histogram) and theoretical (solid line) distributions \nof first passage times for the cost function of Figure la. \n\n\fWeight Space Probability Densities in Stochastic Learning \n\n513 \n\nWhen the Fokker-Planck approximation fails, results obtained from the exact ex(cid:173)\npression (8) are in excellent agreement with experimental results. One such exam(cid:173)\nple is shown in Figure 5. Similar to Figure 2a, we have chosen a one-dimensional \nsubspace of the XOR weight space (but in a different direction). Here, the Fokker(cid:173)\nPlanck solution is quite poor because the steepness of the cost function results in \nlarge contributions from higher order terms in (12). As one would expect, the exact \nsolution obtained using (8) agrees well with the simulations. \n\nIt) \n\nExact \nFokker-Planck \n\n-0.5 \n\n0.0 \n\na) \n\n1.0 \n\n1.5 \n\n0.5 \n\nv \n\no \n\nb) \n\n200 \n\n600 \nFirst Passage Time \n\n400 \n\n800 \n\nFigure 5: Second I-D XOR example. a) Cost function. b) Simulated (histogram) and \ntheoretical (lines) distributions of first passage times. \n\n3.2.2 Basin Hopping in Competitive Learning \n\nAs a final example, we consider competitive learning with two 2-D weight vectors \nsymmetrically placed about the center of a rectangle. Inputs are uniformly dis(cid:173)\ntributed in a rectangle of width 1.1 and height 1. This configuration has both \nglobal and local optima. \n\nFigure 6a shows a sample path with weights started near the local optimum (crosses) \nand switching to hover around the global optimum. The measured and predicted \n(from numerical integration of (8)) distribution of times required to first pass within \na distance \u20ac = 0.1 of the global optimum are shown in Figure 6b. \n\nw2 \n\n~:~. : .. :...: .<~~:\\.:\u00b7\u00b7i \n\n4--~:-O---=,,\"-:----'=''''':::-''''''' -:-:---r-- w1 \n\n0.2 \n\n0.4 \n\n0.6 O.S \n\n1 \n\na) \n\n\u00a7 \n0 \n~ \n~8 \n-80 \nQ: \n\nq \n0 \n\n0 \n\nb) \n\n100 \n\n200 \n\n400 \nFirst Passage Time \n\n300 \n\n500 \n\nFigure 6: Competitive Learning a) Data (small dots) and sample weight path (large dots). \nb) First passage times. \n\n4 Discussion \n\nThe dynamics of the time evolution of the weight space probability density provides \na direct handle on the performance of learning algorithms. This paper has focused \n\n\f514 \n\nOrr and Leen \n\non transient phenomena in stochastic learning with constant learning rate. The \nsame theoretical framework can be used to analyze the asymptotic properties of \nstochastic search with decreasing learning rates, and to analyze equilibrium densi(cid:173)\nties. For a discussion of the latter, see the companion paper in this volume (Leen \nand Moody 1993). \n\nAcknowledgements \n\nThis work was supported under grants N00014-90-J-1349 and NOOO-HI-J-1482 from \nthe Office of Naval Research. \n\nReferences \n\nE. Oja (1982), A simplified neuron model as a principal component analyzer. J. \nMath. Biology, 15:267-273. \n\nHalbert White (1989), Learning in artificial neural networks: A statistical perspec(cid:173)\ntive. Neural Computation, 1:425-464. \n\nJ.J. Kushner (1987), Asymptotic global behavior for stochastic approximation and \ndiffusions with slowly decreasing noise effects: Global minimization via monte carlo. \nSIAM J. Appl. Math., 47:169-185. \n\nChristian Darken and John Moody (1991), Note on learning rate schedules for \nstochastic optimization. In Advances in Neural Information Processing Systems 3, \nSan Mateo, CA, Morgan Kaufmann. \nTodd K. Leen and Genevieve B. Orr. (1992), Weight-space probability densities \nand convergence times for stochastic learning. In International Joint Conference \non Neural Networks, pages IV 158-164. IEEE. \n\nTodd K. Leen and John Moody (1993), Probability Densities in Stochastic Learning: \nDynamics and Equilibria. In Giles, C.L., Hanson, S.J., and Cowan, J.D. (eds.), \nAdvances in Neural Information Processing Systems 5. San Mateo, CA: Morgan \nKaufmann Publishers. \n\nH. Ritter and K. Schulten (1988), Convergence properties of Kohonen's topology \nconserving maps: Fluctuations, stability and dimension selection, Bioi. Cybern., \n60, 59-71. \n\nG. Radons, H.G. Schuster and D. Werner (1990), Fokker-Planck description oflearn(cid:173)\ning in backpropagation networks, International Neural Network Conference, Paris, \nII 993-996, Kluwer Academic Publishers. \n\nC.W. Gardiner (1990), Handbook of Stochastic Methods, 2nd Ed. Springer-Verlag, \nBerlin. \n\nP. Lisboa and S. Perantonis (1991), Complete solution of the local minima in the \nXOR problem. Network: Computation in Neural Systems, 2:119. \n\n\f", "award": [], "sourceid": 637, "authors": [{"given_name": "Genevieve", "family_name": "Orr", "institution": null}, {"given_name": "Todd", "family_name": "Leen", "institution": null}]}