{"title": "Maximum Likelihood Competitive Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 574, "page_last": 582, "abstract": null, "full_text": "574 \n\nNowlan \n\nMaximum Likelihood Competitive Learning \n\nSteven J. Nowlan1 \n\nDepartment of Computer Science \n\nUniversity of Toronto \n\nToronto, Canada \n\nM5S lA4 \n\nABSTRACT \n\nOne popular class of unsupervised algorithms are competitive algo(cid:173)\nrithms. In the traditional view of competition, only one competitor, \nthe winner, adapts for any given case. I propose to view compet(cid:173)\nitive adaptation as attempting to fit a blend of simple probability \ngenerators (such as gaussians) to a set of data-points. The maxi(cid:173)\nmum likelihood fit of a model of this type suggests a \"softer\" form \nof competition, in which all competitors adapt in proportion to \nthe relative probability that the input came from each competitor. \nI investigate one application of the soft competitive model, place(cid:173)\nment of radial basis function centers for function interpolation, and \nshow that the soft model can give better performance with little \nadditional computational cost. \n\nINTRODUCTION \n\n1 \nInterest in unsupervised learning has increased recently due to the application of \nmore sophisticated mathematical tools (Linsker, 1988; Plumbley and Fallside, 1988; \nSanger, 1989) and the success of several elegant simulations of large scale self(cid:173)\norganization (Linsker, 1986; Kohonen, 1982). One popular class of unsupervised \nalgorithms are competitive algorithms, which have appeared as components in a \nvariety of systems (Von der Malsburg, 1973; Fukushima, 1975; Grossberg, 1978). \nGeneralizing the definition of Rumelhart and Zipser (1986), a competitive adaptive \nsystem consists of a collection of modules which are structurally identical except, \npossibly, for random initial parameter variation. A set of rules is defined which \nallow the modules to compete in some way for the right to respond to some subset \n\nlThe author is visiting the University of Toronto while completing a PhD at Carnegie Mellon \n\nUniversity. \n\n\fMaximum Likelihood Competitive Learning \n\n575 \n\nof the inputs. Typically a module is a single unit, but this need not be the case. \nOften, parameter restrictions are used to prevent \"uninteresting\" representations in \nwhich the entire set of input patterns are represented by one module. \n\nMost of the work on competitive systems, especially within the neural network liter(cid:173)\nature, has focused on a fairly extreme form of competition in which only the winner \nof the competition for a particular case is updated. Variants on this theme are \nthe schemes in which, in addition to the winner, all of the losers are updated in \nsome uniform fashion 2 \u2022 Within the statistical pattern recognition literature (Duda \nand Hart, 1973; McLachlan and Basford, 1988) a rather different form of compe(cid:173)\ntition is frequently encountered. In this form, which will be referred to as \"soft\" \ncompetition, all competitors are updated but the amount of update is proportional \nto how well each competitor did in the competition for the current case. Under a \nstatistical model, this \"soft\" form of competition performs exact gradient descent \nin likelihood, while the more traditional winner-take-all, or \"hard\" competition, is \nan approximation to gradient descent in likelihood. \nIn this paper I demonstrate the superiority of \"soft\" competitive learning by com(cid:173)\nparing \"hard\" and \"soft\" algorithms in a classification application. The classifica(cid:173)\ntion network consists of a layer of Radial Basis Functions (RBF's) followed by a \nlayer of linear units which attempt to find a least mean square (LMS) fit to the \ndesired output function (Broomhead and Lowe, 1988; Lee and Kill, 1988; Niranjan \nand Fallside, 1988). A network of this type can form a smooth approximation to \nan arbitrary function, with the RBF centers serving as control points for fitting \nthe function (Keeler and Kowalski, 1989; Poggio and Girosi, 1989). A competitive \nlearning component adjusts the centers of the RBF's in an unsupervised fashion, \nbefore the weights to the output units are adapted. Comparisons of hard and soft \nalgorithms for placing the RBF's on a hand-drawn digit recognition problem and \na subset of a speaker independant vowel recognition problem suggest that the soft \nalgorithm is superior. Comparisons are also made with more traditional classifiers \non the same problems. \n\n2 COMPETITIVE PLACEMENT OF RBF'S \nRadial Basis Function networks have been shown to be quite effective for some tasks, \nhowever a major limitation is that a very large number of RBF's may be required \nin high dimensional spaces. One method for using RBF's places the centers of the \nRBF's at the interstices of some coarse lattice defined over the input space (Broom(cid:173)\nhead and Lowe, 1988). If we assume the lattice is uniform with k divisions along \neach dimension, and the dimensionality of the input space is d, a uniform lattice \nwould require kd RBF's. This exponential growth makes the use of such a uniform \nlattice impractical for any high dimensional space. Another choice is to center the \nRBF's on the first n training samples, but this method is subject to sampling error, \n\n2The feature maps of Kohonen (1982) are actually a special case in which a few units are \nadapted at once, however the units which are adapted in addition to the winner are selected by a \nneighbourhood function rather than by how well they represent the current data. \n\n\f576 \n\nNowlan \n\nand a very large number of samples can be required to adequately represent the \ndistribution of inputs. This is particularly true in high dimensional spaces where it \nis extremely difficult to visualize the input distribution and determine whether the \ntraining examples adequately represent this distribution. \nMoody and Darken (1988) have suggested a method in which a much smaller number \nof RBF's are used, however the centers of these RBF's are allowed to adapt to the \ninput samples, so they learn to represent only the part of the input space actually \nrepresented by the data. The adaptive strategy also allows the center of each RBF \nto be determined by a large number of training samples, greatly reducing sampling \nerror. In their method, an unsupervised algorithm (a version of k-means) is used \nto select the centers of the RBF's and some ad hoc heuristics are suggested for \nadjusting the size of the RBF's to get a smooth interpolator. The weights from the \nhidden to the output layer are adapted to minimize a Least Mean Square (LMS) \ncriterion. Moody and Darken were able to attain performance levels equivalent to a \nmulti-layer Back Propagation network on a chaotic time series prediction task and \na vowel discrimination task. Significant savings in training time were also reported. \nThe k-means algorithm used by Moody and Darken can be easily reformulated as a \nform of competitive adaptation. In the basic k-means algorithm (Duda and Hart, \n1973) the training samples are first assigned to the class of the closest mean. The \nmeans are then recomputed as the average of the samples in their class. This two \nstep process is repeated until the means stop changing. This is simply the \"batch\" \nversion of a competitive learning scheme in which the activity of each competing \nunit is proportional to the distance between its weight vector and the current input \nvector, and the winning unit on each case adapts by adding a portion of the current \ninput to its weight vector (with appropriate normalization). \nWe will now consider a statistical formalization of a competitive process for placing \nthe centers of RBF's. Let each competing unit represent a radially symmetric \n(spherical) gaussian probability distribution, with the weight vector of the unit jIj \nrepresenting the center or mean of the gaussian. The probability that the gaussian \nassociated with unit j generated an input vector Xle is \n\n( _ ) \n-\nP Xle = - e \n\n1 \nKUj \n\n(~k -/I i )l \n\nl ... ~ \n1 \n\n(1) \n\nwhere K is a normalization constant, and the covariance matrix is uJ f. \nA collection of M such units is a model of the input distribution. The parameters \nof these M gaussians can be adjusted so that the overall average likelihood of gen(cid:173)\nerating the training examples is maximized. The likelihood of generating a set of \nobservations {Xl, X2,\"\" xn} from the current model is \n\nL = II P(lle) \n\nIe \n\n(2) \n\nwhere P( lie) is the probability of generating observation lie under the current model. \n(For mathematical convenience we usually work with log L.) If gaussian i is selected \n\n\fMaximum Likelihood Competitive Learning \n\n577 \n\nwith probability 'lri and a sample is drawn from the selected gaussian, the probability \nof observing xJ: is \n\nN \n\nP(xJ:) = L 'lri p.(iJ:) \n\n;=1 \n\n(3) \n\nwhere Pi(iJ:) is the probability of observing il: under gaussian distribution i. The \nsummation in (3) is awkward to work with, and frequently one of the p.(iJ:) is much \nlarger than any of the others. Therefore, a convenient approximation for (3) is \n\n(4) \n\nThis is equivalent to assigning all of the responsibility for an observation to the \ngaussian with the highest probability of generating that observation. This approxi(cid:173)\nmation is frequently referred to as the \"winner-take-all\" assumption. It may also be \nregarded as a \"hard\" competitive decision among the gaussians. When we use (3) \ndirectly, all of the gaussians share responsibility for each observation in proportion \nto their probability of generating the observation. This sharing of responsibility can \nbe regarded as a \"soft\" competitive decision among the gaussians. \nThe maximum likelihood estimate for the mean of each gaussian in our model can \nbe found by evaluating Blog L/ BPj = O. We will consider a simple model in which \nwe assume that 'lrj and Uj are the same for all of the gaussians, and compare the \nhard and soft estimates for ilj. \nWith the hard approximation, substituting (4) in (2), the maximum likelihood \nestimate of ilj has the simple form \n\nEJ:EC; xJ: \n\n:. \nI-'j = N. \n\n1 \n\n(5) \n\nwhere Cj is the set of cases closest to gaussian j, and Nj is the size of this set. This \nis identical to the expression for Pj in the k-means algorithm. \nRather than using the approximation in (4) we can find the exact maximum like(cid:173)\nlihood estimates for ilj by substituting (3) in (2). The estimate for the mean is \nnow \n\n(6) \n\nwhere pOlxJ:) is the probability, given that we have observed \u00a31:, of gaussian j \nhaving generated XI:. For the simple model used here \n\nComparing (6) and (5), the hard competitive model uses the average of the cases \nunit j is closest to in recomputing its mean, while the soft competitive model uses \nthe average of all the cases weighted by p(jlil:). \n\n\f578 \n\nNowlan \n\nWe can use either the approximate or exact likelihood algorithm to position the \nRBF's in an interpolation network. If X\" is the current input, each RBF unit \ncomputes Pj(x,,) as its output activation aj. For the hard competitive model, a \nwinner-take-all operation then sets aj = 1 for the most active unit and ai = 0 \nfor all other units. Only the winning unit will update its mean vector, and for \nthis update we use the iterative version of (5). In the soft competitive model we \nnormalize each aj by dividing it by the sum of aJ over all RBF's. In this case the \nmean vectors of all of the hidden units are updated according to the iterative version \nof (6). The computational cost difference between the winner-take-all operation in \nthe hard model and the normalization in the soft model is negligible; however, if the \nalgorithms are implemented sequentially, the soft model requires more computation \nbecause all of the means, rather than just the mean of the winner, are updated for \neach case. \nThe two models described in this section are easily extended to allow each spher(cid:173)\nical gaussian to have a different variance O'J. The activation of each RBF unit is \nnow a function of (ik -\nj1J)/O'j, but the expressions for the maximum likelihood \nestimates of iIj are the same. Expressions for updating O'J can be found by solv(cid:173)\ning 810gL/8O'J = O. Some simulations have also been performed with a network \nin which each RBF had a diagonal covariance matrix, and each of the d variance \ncomponents was estimated separately (Nowlan, 1990). \n\n3 APPLICATION TO TWO CLASSIFICATION TASKS \nThe architecture described above was used for a digit classification and a vowel \ndiscrimination task. The networks were trained by first using the soft or hard \ncompetitive algorithm to determine the means and variances of the RBF's, and, \nonce these were learned, then training the output layer of weights. The weights \nfrom the RBF's to the output layer were trained using a recursive least squares \nalgorithm, allowing an exact LMS solution to be found with one pass through the \ntraining set. (A target of +1 was used for the correct output category and -1 \nfor all of the other categories.) For the hard competitive model the unnormalized \nprobabilities Pj (x) were used as the RBF unit outputs, while the soft competitive \nmodel used the normalized probabilities pUli). \nThe first task required the classification of a set of hand drawn digits from 12 \nsubjects. There were 480 input patterns, divided into 320 training patterns and \n160 testing patterns, with examples from all subjects in both groups. Each pattern \nwas digitized on a 16 by 16 grid. These 256 dimensional binary vectors were used \nas input to the classification network, and there were 10 output units. \nNetworks with 40 and 150 spherical gaussians were simulated. Both hard and soft \nalgorithms were used with all configurations. The performance of these networks \non the testing set is summarized in Table 1. This table also contains performance \nresults for a multi-layer back propagation network, a two layer linear network, and \na nearest neighbour classifier on the same task. The nearest neighbour classifier \nused all 320 labeled training samples and based its decision on the class of the \n\n\fMaximum Likelihood Competitive Learning \n\n579 \n\nType of Classifier \n\n% Correct on Test Set \n\n40 Sph. Gauss. - Hard \n40 Sph. Gauss. - Soft \n150 Sph. Gauss. - Hard \n150 Sph. Gauss. - Soft \nLayered BP Net \nLinear Net \nNearest Neighbour \n\n87.6% \n91.8% \n90.1% \n94.0% \n94.5% \n60.0% \n83.1% \n\nTable 1: Summary of Performance for Digit Classification \n\nnearest neighbour only3. The relatively poor performance of the nearest neighbour \nclassifier is one indication of the difficulty of this task. The two layer linear network \nwas trained with a recursive least squares algorithm4. The back propagation net(cid:173)\nwork was developed specifically for this task (Ie Cun, 1987), and used a specialized \narchitecture with three layers of hidden units, localized receptive fields, and weight \nsharing to reduce the number of free parameters in the system. \nTable 1 reveals that the networks were trained using the soft competitive algorithm \nto determine means and variances of the RBF's were superior in performance to \nidentical networks trained with the hard competitive algorithm. The RBF network \nusing 150 spherical gaussians was able to equal the performance level of the sophisti(cid:173)\ncated back propagation network, and a network with 40 spherical RBF's performed \nconsiderably better than the nearest neighbour classifier. \nThe second task was a speaker independent vowel recognition task. The data con(cid:173)\nsisted of a digitized version of the first and second formant frequencies of 10 vowels \nfor multiple male and female speakers (Peterson and Barney, 1952). Moody and \nDarken (1988) have previously applied to this data an architecture which is very \nsimilar to the one suggested here, and Huang and Lippmann (1988) have compared \nthe performance of a number of different classifiers on this same data. More re(cid:173)\ncently, Bridle (1989) has applied a supervised algorithm which uses a \"softmax\" \noutput function to this data. This softmax function is very similar to the equa(cid:173)\ntion for P(j\\Zk) used in the soft competitive model. The results from these studies \nare included in Table 2 along with the results for RBF networks using both hard \nand soft competition to determine the RBF parameters. All of the classifiers were \ntrained on a set of 338 examples and tested on a separate set of 333 examples. \nAs with the digit classification task, the RBF networks trained using the soft adap(cid:173)\ntive procedure show uniformly better performance than equivalent networks trained \nusing the hard adaptive procedure. The results obtained for the hard adaptive pro-\n\n3Two, three, and five nearest neighbour classifiers were also tried, but they all perfonned worse \n\nthan nearest neighbour. \n\nfThis network was included to show that the linear layer is not doing all of the work in the \n\nhybrid RBF networks. \n\n\f580 \n\nNowlan \n\nType of Classifier \n\n% Correct on Test Set \n\n20 Sph. Gauss. - Hard \n20 Sph. Gauss. - Soft \n100 Sph. Gauss. - Hard \n100 Sph. Gauss. - Soft \n20 RBF's (Moody et al) \n100 RBF's (Moody et al) \nK Nearest Neighbours (Lippmann et al) \nGaussian Classifier (Lippmann et al) \n2 Layer BP Net (Lippmann et al) \nFeature Map (Lippmann et al) \n2 Layer Softmax (Bridle) \n\n75.1% \n82.6% \n82.6% \n87.1% \n73.3% \n82.0% \n82.0% \n79.7% \n80.2% \n77.2% \n78.0% \n\nTable 2: Summary of Performance for Vowel Classification \n\ncedure with 20 and 100 spherical gaussians are very close to Moody and Darken's \nresults, which is expected since the procedures are identical except for the manner \nin which the variances are obtained. Table 2 also reveals that the RBF network \nwith 100 spherical gaussians, trained with the soft adaptive procedure, performed \nbetter than any of the other classifiers that have been applied to this data. \n\n4 DISCUSSION \nThe simulations reported in the previous section provide strong evidence that the \nexact maximum likelihood (or soft) approach to determining the centers and sizes of \nRBF's leads to better classification performance than the winner-take-all approx(cid:173)\nimation. In both tasks, for a variety of numbers of RBF's, the exact maximum \nlikelihood approach outperformed the approximate method. Comparing (5) and \n(6) reveals that this improved performance can be obtained with little additional \ncomputational burden. \nThe performance of the RBF networks on these two classification tasks also shows \nthat hybrid approaches which combine unsupervised and supervised procedures are \ncapable of competent levels of performance on difficult problems. In the digit clas(cid:173)\nsification task the hybrid RBF network was able to equal the performance level of \na sophisticated multi-layer supervised network, while in the vowel recognition task \nthe hybrid network obtained the best performance level of any of the classification \nnetworks. One reason why the hybrid model is interesting is that since the hid(cid:173)\nden unit representation is independent of the classification task, it may be used \nfor many different tasks without interference between the tasks. (This is actually \ndemonstrated in the simulations described, since each category in the two tasks can \nbe regarded as a separate classification problem.) Even if we are only interested in \nusing the network for one task, there are still advantages to the hybrid approach. \nIn many domains, such as speech, unlabeled samples can be obtained much more \n\n\fMaximum Likelihood Competitive Learning \n\n581 \n\ncheaply than labeled samples. To avoid over-fitting, the amount of training data \nmust generally be considerably greater than the number of free parameters in the \nmodel. In the hybrid models, especially in high dimensional input spaces, most of \nthe parameters are in the unsupervised part of the modelS. The unsupervised stage \nmay be trained with a large body of unlabeled samples, and a much smaller body \nof labeled samples can be used to train the output layer. \nThe performance on the digit classification task also shows that RBF networks can \ndeal effectively with tasks with high (256) dimensional input spaces and highly \nnon-gaussian input distributions. The competitive network was able to succeed on \nthis task with a relatively small number of RBF's because the data was actually \ndistributed over a much lower dimensional subspace of the input space. The soft \ncompetitive network automatically concentrates its representation on this subspace, \nand in this fashion performs a type of implicit dimensionality reduction. Moody \n(1989) has also mentioned this type of dimensionality reduction as a factor in the \nsuccess of some of the models he has worked with. \nThe success of the soft adaptive strategy in these interpolation networks encourages \none to extend the soft interpretation in other directions. The feature maps of \nKohonen (1982) incorporate a hard competitive process, and a soft version of the \nfeature map algorithm could be developed. In addition, there is a class of decision(cid:173)\ndirected, or \"bootstrap\" , learning algorithms which use their own outputs to provide \na training signal. These algorithms can be regarded as hard competitive processes, \nand new algorithms which use the soft assumption may be developed from the \nbootstrap procedure (Nowlan and Hinton, 1989). Bridle (1989) has suggested a \ndifferent type of output unit for supervised networks, which incorporates the idea \nof a \"soft max\" type of competition. Finally, the maximum likelihood approach is \neasily extended to non-gaussian models, and one model of particular interest would \nbe the Boltzmann machine. \n\nAcknowledgements \n\nI would like to thank Richard Lippmann of Lincoln Laboratories and John Moody of Yale Univer(cid:173)\nsity for making the vowel formant data available to me. I would also like to thank Geoff Hinton, \nand the members of the Connectionist Research Group of the University of Toronto, for many \nhelpful comments and suggestions while conducting this research and preparing this paper. \n\nReferences \nBridle, J. (1989). Probabilistic interpretation of feedforward classification network outputs, with \nrelationships to statistical pattern recognition. In Fougelman-Soulie, F . and Herault, J., \neditors, Neuro-computing: algorithm!, architecture! and application!. Springer-Verlag. \n\nBroomhead, D. and Lowe, D. (1988). Multivanable functional interpolation and adaptive networks. \n\nComplex Sy!tem&, 2:321-355. \n\nDuda, R. and Hart, P. (1913). Pattern Clauijication And Scene Analy&i&. Wiley and Son. \nFukushima, K. (1915). Cognitron: A self-organizing multilayered neural network. Biological \n\nCybernetic!, 20:121-136. \n\nSin the digit task, there are over 25 times as many parameters in the unsupervised part of the \n\nnetwork as there are in the supervised part. \n\n\f582 \n\nNowlan \n\nGrossberg, S. (1978). A theory of visual coding, memory, and development. In Formal theorie$ oj \n\n'IIi!.al perception. John Wiley and SOIUl, New York. \n\nHuang, W. and Lippmann, R. (1988). Neural net and traditional classifiers. In Anderson, D., \n\neditor, Ne.ra.lInJormation Proceuing S1J!tem!. American lnatitute of Physics. \n\nKeeler, E. H. J. and Kowalski, J. (1989). Layered neural networks with gaussian hidden units as \n\nuniversal approximators. MCC Technical Report ACT-ST-272-89, MCC. \n\nKohonen, T. (1982). Self-organized formation of topologically correct feature maps. Biological \n\nCybernetic!, 43:59-69. \n\nIe Cun, Y. (1987). Modele! Connexionni!te$ de l'Apprentiuage. PhD thesis, Universit~ Pierre et \n\nMarie Curie, Paris, France. \n\nLee, S. and Kill, R. (1988). Multilayer feedfo.,ward potential function networks. In Proceeding! \nIEEE Second International ConJerence on Ne.ral Network!, page 1:161, San Diego, Califor(cid:173)\nma. \n\nLinsker, R. (1986). From basic network principles to neural architecture: Emergence of spatial \n\nopponent cells. Proceeding! oj the Nationa.l Academ1J oj Science! USA, 83:7508-7512. \n\nLinsker, R. (1988). Self-organization in a perceptual network. IEEE Computer Society, pages \n\n105-117. \n\nMcLachlan, G. and Basford, K. (1988). Mixture Model!: InJerence and Application! to Clu!tering. \n\nMarcel Dekker, New York. \n\nMoody, J. (1989). \n\nFast \n\nlearning in multi-resolution hierarchies. \n\nTechnical Report \n\nYALEU/DCS/R~681, Yale University. \n\nMoody, J. and Darken, C. (1988). Learning with localized receptive fields. \n\nIn D. Touretzky, \nG. Hinton, T. S., editor, Proceeding. oj the 1988 Connectioni!t Model! Summer School, \npages 133-143. Morgan Kauffman. \n\nNiranjan, M. and Fallside, F. (1988). Neural networks and radial basis functions in classifying static \nspeech patterIUI. Technical Report CUEDIF-INFENGI7R22, Engineering Dept., Cambridge \nUniversity. to appear in Computers Speech and Language. \n\nNowlan, S. (1990). Maximum likelihood competition in RBF networks. Technical Report CRG(cid:173)\n\nT~90-2, University of Toronto Connectionist Research Group. \n\nNowlan, S. and Hinton, G. (1989). Maximum likelihood decision-directed adaptive equalization. \n\nTechnical Report CRG-TR-89-8, University of Toronto Connectionist Research Group. \n\nPeterson, G. and Barney, H. (1952). Control methods used in a study of vowels. The Journal oj \n\nthe Acou!tical Society oj America, 24:175-184. \n\nPlumbley, M. and Fallside, F. (1988). An information theoretic approach to unsupervised connec(cid:173)\ntionist models. In D. Touretzky, G Hinton, T. S., editor, Proceeding! oj the 1988 Connec. \ntioni$t Model! Summer School, pages 239-245. Morgan Kauffmann. \n\nPoggio, G. and Girosi, F. (1989). A theory of networks for approximation and learning. A.I. Memo \n\n1140, MIT. \n\nRumelhart, D. E. and Zipser, D. (1986). Feature discovery by competitive learning. In Parallel \ndi6trib.ted proceuing: Exploration. in the micro!tructure of cognition, volume I. Bradford \nBooks, Cambridge, MA. \n\nSanger, T. (1989). An optimality principle for unsupervised learning. In Touretzky, D., editor, \n\nAdvance! in Neural InJormation Proceuing Sy!tem$ 1, pages 11-19. Morgan Kauffman. \n\nVon der Malsburg, C. (1973). Self-organization of orientation sensitive cells in striate cortex. \n\nK ybernetik, 14:85-100. \n\n\f", "award": [], "sourceid": 225, "authors": [{"given_name": "Steven", "family_name": "Nowlan", "institution": null}]}