{"title": "Learning Theory and Experiments with Competitive Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 846, "page_last": 852, "abstract": null, "full_text": "Learning Theory and Experiments with \n\nCompetitive Networks \n\nGriff L. Bilbro \nNorth Carolina State University \nBox 7914 \nRaleigh, NC 27695-7914 \n\nDavid E. Van den Bout \nNorth Carolina State University \nBox 7914 \nRaleigh, NC 27695-7914 \n\nAbstract \n\nWe apply the theory of Tishby, Levin, and Sol1a (TLS) to two problems. \nFirst we analyze an elementary problem for which we find the predictions \nconsistent with conventional statistical results. Second we numerically \nexamine the more realistic problem of training a competitive net to learn \na probability density from samples. We find TLS useful for predicting \naverage training behavior. \n\n. \n\n1 TLS APPLIED TO LEARNING DENSITIES \n\nRecently a theory of learning has been constructed which describes the learning \nof a relation from examples (Tishby, Levin, and Sol1a, 1989), (Schwarb, Samalan, \nSol1a, and Denker, 1990). The original derivation relies on a statistical mechanics \ntreatment of the probability of independent events in a system with a specified \naverage value of an additive error function. \n\nThe resulting theory is not restricted to learning relations and it is not essentially \nstatistical mechanical. The TLS theory can be derived from the principle of maz(cid:173)\nimum entropy, a general inference tool which produces probabilities characterized \nby certain values of the averages of specified functions(Jaynes, 1979). A TLS theory \ncan be constructed whenever the specified function is additive and associated with \nindependent examples. In this paper we treat the problem of learning a probability \ndensity from samples. \nConsider the model as some function p( z Iw) of fixed form and adjustable parameters \nw which are to be chosen to approximate 1'(z) where the overline denotes the true \ndensity. All we know about l' are the elements of a training set T which are drawn \n\n846 \n\n\fLearning Theory and Experiments with Competitive Networks \n\n847 \n\nfrom it. Define an error e(zlw). By the principal of maximum entropy \n\np(zlw)= z(.B)e-~(zIW), \n\n1 \n\n(1) \n\ncan be interpreted as the unique density which contains no other information except \na specified value of the average error \n\n(e) = f dz p(zlw)e(zlw). \n\n(2) \n\nIn Equation 1 z is a normalization that is assumed to be independent of the value \nof Wj the parameter .B is called the ,en,itivity and is adjusted so that the average \nerror is equal to some eT, the specified target error on the training set. We will use \nthe convention that an integral operates on the entire expression that follows it. \nThe usual Bayes rule produces a density in w from p(zlw) and from a prior density \np(O)(w) which reflects at best a genuine prior probability or at least a restriction \nto the acceptable portion of the search space. Posterior to training on m certain \nexamples, \n\n(3) \n\nwhere Zm is a normalization that depends on the particular set of examples as well \nas their number. In order to remove the effect of any particular set of examples, we \ncan average this posterior density over all possible m examples \n\n(4) \n\nThis average posterior density models the expected density of nets or w after train(cid:173)\ning. This distribtution in w implies the followin~ expected posterior density for a \nnew example Zm+l \n\n(5) \n\nTLS compare this probability in Zm+l with the true target probability to obtain \nthe A verage Prediction Probability or APP after training \n\n(6) \n\nthe average over both the training set z(m) and an independent test example Zm+l. \n\nIn the averages of Equations 4 and 6 are inconvenient to evaluate exactly because of \nthe Zm term in Equation 3. TLS propose an \"annealed approximation\" to APP in \nwhich the average of the ratio of Equation 4 is replaced by the ratio of the averages. \nEquation 6 becomes \n\nwhere \n\np(m) = J dwp(o)(w)gm+l(w) \nJ dwp(O) (w)gm (w) \ng(w) = J dzp{z)p(zlw). \n\n(7) \n\n(8) \n\n\f848 \n\nBilbro and Van den Bout \n\nEquation 7 is well suited for theoretical analysis and is also convenient for numer(cid:173)\nical predictions. To apply Equation 7 numerically, we will produce Monte Carlo \nestimates for the moments of 9 that involve sampling p(O) (w). If the dimension of w \nis larger than 50, it is preferable to histogram 9 rather than evaluate the moments \ndirectly. \n\n1.1 ANALYSIS OF AN ELEMENTARY EXAMPLE \n\nIn this section we theoretically analyze a learning problem with the TLS theory. \nWe will study the adjustment of the mean of a Gaussian density to represent a \nfinite number of samples. The utility of this elementary example is that it admits \nan analytic solution for the APP of the previous section. All the relevant integrals \ncan be computed with the identity \n\n100 dz exp (-adz - bd2 - a2(z - b2)2) = ~ exp (- a1a2 (b1 - b2)2). \n\nV~ al +a2 \n\n-00 \n\nWe take the true density to be a Gaussian of mean wand variance 1/20 \n\np(z) = ~e-a(Z-iii)3. \n\nWe model the prior density as a Gaussian with mean wo and variance 1/21' \n\np(O)(w) = ~e-\"(W-WO)3. \n\nWe choose the simplest error function \n\ne(zlw) = (z - w)2, \n\n(9) \n\n(10) \n\n(11) \n\n(12) \n\nthe squared error between a sample z and the Gaussian \"model\" defined by its \nmean w, which is to become our estimate of w. In Equation 1, this error function \nleads to \n\nwith z(/3) = fi which is independent of w as assumed. We determine /3 by solving \nfor the error on the training set to get /3 = -21 \u2022 \nThe generalization, Equation 8, can now be evaluated with Equation 9 \n\nET \n\n(13) \n\ng(w) = ~e-\"(W-iii)3, \n\nwhere \n\nK= \n\n0/3 \n, \n0+/3 \n\nis less than either 0 or /3. The denominator of Equation 7 becomes \n\n(~)m/2 ~ exp(- mK1' (w-wo)2) \n7r V~ mK+1' \n\n(14) \n\n(15) \n\n(16) \n\n\fLearning Theory and Experiments with Competitive Networks \n\n849 \n\nwith a similar expression for the numerator. \nThe case of many examples or little prior knowledge is interesting. Consider Equa(cid:173)\ntions 7 and 16 in the limit mit > > f' \n\n(m) = {K {Tn \nY;Ym+1' \np \n\n(17) \n\nwhich climbs to an asymptotic value of ~ for m - - t 00. In order to compare this \nwith intuition, consider that the sample mean of {ZlJ Z2J \"'J zm} approaches w to \nwithin a variance of 1/2ma:, so that \n\n(p(m)(w))z ~ Jrn; e-ma(z-w)3 \n\n(18) \n\nwhich makes Equation 6 agree with Equation 17 for large enough {3. In this sense, \nthe statistical mechanical theory of learning differs from conventional Bayesian es(cid:173)\ntimation only in its choice of an unconventional performance criterion APP. \n\n2 GENERAL NUMERICAL PROCEDURE \n\nIn this section we apply the theory to the more realistic problem of learning a \ncontinuous probability density from a finite sample set. We can estimate the mo(cid:173)\nments of Equation 7 by the following Monte Carlo procedure. Given a training set \nT = {Zt H~r drawn from the unknown density p on domain X with finite volume V J \nan error function f( Z \\w ), a training error fT J and a prior density p(O) (w) of vectors \nsuch that each w specifies a candidate function, \n1. Construct two sample sets: a prior set of P functions P = {wp } drawn from \np(O)(w) and a set of U input vectors U = {zu} drawn uniformly from X. For \neach p in the prior set, tabulate the error fup = \u20ac(zulwp) for every point in U \nand the error ftp = f(Zt\\Wp) for every point in T. \n\n2. Determine the sensitivity f3 by solving the equation (\u20ac) = \u20acT where \n\n() Eu e-/J\u00b7 .... fup \nf = Eu e-/J'.. \n\n. \n\n3. Estimate the average generalization of a given wp from Equation 8 \n\n(19) \n\n(20) \n\n4. The performance after m examples is the ratio of Equation 7. By construction \n\nP is drawn from p(O) so that \n\n(21) \n\n\f850 \n\nBilbro and Vern den Bout \n\n2r-------~--~~--~--~ \n\n2r---~--~----~--~---' \n\n1.5 \n\nA-\n~ \n\n.010 \n.013 \n.ol!l \n.oIl \n.0111 \nmI \n\n1.5 \n\n8: \nC \n\n:I!! \n:81 \n\n0.7 \n\n0 \n\n2D \n\n40 \n\n60 \n\n80 \n\n100 \n\nTraining Set Size \n\n(a) \n\n1 \n\n0.7 \n\n0 \n\n20 \n\n40 \n\n60 \n\n80 \n\n100 \n\nTraining Set SIzIe \n\n(b) \n\nFigure 1: Predicted APP versus number of training samples for a 20-neuron com(cid:173)\npetitive network trained to various target errors where the neuron weights were \ninitialized from (a) a uniform density, (b) an antisymmetrically skewed density. \n\n3 COMPETITIVE LEARNING NETS \n\nWe consider competitive learning nets (CLNs) because they are familiar and use(cid:173)\nful to us (Van den Bout and Miller, 1990), because there exist two widely known \ntraining strategies for CLN s (the neurons can learn either independently or under a \nglobal interaction called conscience (DeSieno, 1988), and because CLNs can be ap(cid:173)\nplied to one-dimensional problems without being too trivial. Competitive learning \nnets with conscience qualitatively change their behavior when they are trained on \nfinite sample sets containing fewer examples than neurons; except for that regime \nwe found the theory satisfactory. All experiments in this section were conducted \nupon the following one-dimensional training density \n\n15(z) = { ~!;z o 20 the 6hape6 of the curves are almost unchanged, even though the vertical \nscale is different: saturation occurs at about the same value of m. Even when \nthe prior greatly overrepresents poor nets, their effect on the prediction rapidly \ndiminishes with training set size. This is important because in actual training, the \neffect of the initial configuration is also quickly lost. For m < 20 the predictions \nare not valid in any case, since our simple error function does not reflect the actual \nprobability even approximately for m < k in these nets. It is for m < 20 where \nthe only significant differences between the two families of curves occur. We have \nalso been able to draw the same conclusions from less structured prior densities \ngenerated by assigning positive normalized random numbers to intervals of the \n\n\f852 \n\nBilbro and v.m den Bout \n\ndomain. Moreover, we generally find that TLS predicts that about twice as many \nsamples as neurons are needed to train competitive nets of other sizes. \n\n4 CONCLUSION \n\nTLS can be applied to learning densities as well as relations. We considered the \neffects of varying the number of examples, the target training error, and the choice \nof prior density. In these experiments on learning a density as well as others dealing \nwith learning a binary output (Bilbro and Snyder, 1990), a ternary output (Chow, \nBilbro, and Yee, 1990), and a continuous output (Bilbro and Klenin, 1990) we \nfind if saturation occurs for m substantially less than the total number of available \nsamples, say m < ITI/2, that m is a good predictor of sufficient training set size. \nMoreover there is evidence from a reformulation of the learning theory based on the \ngrand canonical ensemble that supports this statistical approach (Klenin,1990). \n\nReferences \n\nG. L. Bilbro and M. Klenin. (1990) Thermodynamic Models of Learning: Applica(cid:173)\ntions. Unpublished. \n\nG. L. Bilbro and W. E. Snyder. (1990) Learning theory, linear separability, and \nnoisy data. CCSP-TR-90/7, Center for Communications and Signal Processing, \nBox 7914, Raleigh, NC 27695-7914. \n\nM. Y. Chow, G. L. Bilbro and S. O. Yee. (1990) Application of Learning Theory to \nSingle-Phase Induction Motor Incipient Fault Detection Artificial Neural Networks. \nSubmitted to International Journal of Neural Syltem,. \n\nD. DeSieno. (1988) Adding a conscience to competitive learning. In IEEE Interna(cid:173)\ntional Conference on Neural Network\" pages 1:117-1:124. \n\nE. T. Jaynes. (1979) Where Do We Stand on Maximum Entropy? In R. D. Leven \nand M. Tribus (Eds.), Mazimum Entropy Formali,m, M. I. T. Press, Cambridge, \npages 17-118. \n\nM. Klenin. (1990) Learning Models and Thermostatistics: A Description of Over(cid:173)\ntraining and Generalization Capacities. NETR-90/3, Center for Communications \nand Signal Processing, Neural Engineering Group, Box 7914, Raleigh, NC 27695-\n7914. \nD. B. Schwartz, V. K. Samalan, S. A. Solla &. J. S. Denker. (1990) Exhaustive \nLearning. Neural Computation. \nN. Tishby, E. Levin, and S. A. Solla. (1989) Consistent inference of probabilities in \nlayered networks: Predictions and generalization. IJCNN, IEEE, New York, pages \nII:403-410. \n\nD. E. Van den Bout and T. K. Miller III. (1990) TInMANN: The integer markovian \nartificial neural network. Accepted for publication in the Journal of Parallel and \nDiltributed Computing. \n\n\f", "award": [], "sourceid": 357, "authors": [{"given_name": "Griff", "family_name": "Bilbro", "institution": null}, {"given_name": "David", "family_name": "van den Bout", "institution": null}]}