{"title": "Classifying with Gaussian Mixtures and Clusters", "book": "Advances in Neural Information Processing Systems", "page_first": 681, "page_last": 688, "abstract": null, "full_text": "Classifying with Gaussian Mixtures and \n\nClusters \n\nNanda Kambhatla and Todd K. Leen \n\nDepartment of Computer Science and Engineering \nOregon Graduate Institute of Science & Technology \n\nP.O. Box 91000 Portland, OR 97291-1000 \n\nnanda@cse.ogi.edu, tleen@cse.ogi.edu \n\nAbstract \n\nIn this paper, we derive classifiers which are winner-take-all (WTA) \napproximations to a Bayes classifier with Gaussian mixtures for \nclass conditional densities. The derived classifiers include clustering \nbased algorithms like LVQ and k-Means. We propose a constrained \nrank Gaussian mixtures model and derive a WTA algorithm for it. \nOur experiments with two speech classification tasks indicate that \nthe constrained rank model and the WTA approximations improve \nthe performance over the unconstrained models. \n\n1 \n\nIntroduction \n\nA classifier assigns vectors from Rn (n dimensional feature space) to one of K \nclasses, partitioning the feature space into a set of K disjoint regions. A Bayesian \nclassifier builds the partition based on a model of the class conditional probability \ndensities of the inputs (the partition is optimal for the given model). \nIn this paper, we assume that the class conditional densities are modeled by mixtures \nof Gaussians. Based on Nowlan's work relating Gaussian mixtures and clustering \n(Nowlan 1991), we derive winner-take-all (WTA) algorithms which approximate a \nGaussian mixtures Bayes classifier. We also show the relationship of these algo(cid:173)\nrithms to non-Bayesian cluster-based techniques like LVQ and k-Means. \nThe main problem with using Gaussian mixtures (or WTA algorithms thereof) is the \nexplosion in the number of parameters with the input dimensionality. We propose \n\n\f682 \n\nNanda Kambhatla. Todd K. Leen \n\na constrained rank Gaussian mixtures model for classification. Constraining the \nrank of the Gaussians reduces the effective number of model parameters thereby \nregularizing the model. We present the model and derive a WTA algorithm for it. \nFinally, we compare the performance of the different mixture models discussed in \nthis paper for two speech classification tasks. \n\n2 Gaussian Mixture Bayes (GMB) classifiers \n\nLet x denote the feature vector (x E 'Rn ), and {n I , I = 1, ... , K} denote the \nclasses. Class priors are denoted p(nI) and the class-conditional densities are de(cid:173)\nnoted p(x I nI). The discriminant function for the Bayes classifier is \n\n(1) \nAn input feature vector x is assigned to class I if 6I(x) > 6J (x) 'VJ:F I . Given the \nclass conditional densities, this choice minimizes the classification error rate (Duda \nand Hart 1973). \nWe model each class conditional density by a mixture composed of QI component \nGaussians. The Bayes discriminant function (see Figure 1) becomes \n\nQI \nj'(x)=PW)L \n\nI \n\n\"j,foiTlexp[-~(X-~J{~J-\\x-~J)], \n\n(2) \n\n;=1 (21r)n/2 \n\nI~fl \n\nwhere ~J and ~f are the mean and the covariance matrix of the lh mixture com(cid:173)\nponent for nl. \n\n0.12 \n\n0.1 \n\n0.08 \n\n0.06 \n0.04 \n0.02 \n\n5 \n\n20 \n10 \n--~)o X \n\n25 \n\nFig. 1: Figure showing the decision rule of a GMB classifier for a two class problem \nwith one input feature. The horizontal axis represents the feature and the vertical axis \nrepresents the Bayes discriminant functions. In this example, the class conditional densities \nare modelled as a mixture of two Gaussians and equal priors are assumed. \n\nTo implement the Gaussian mixture Bayes classifier (GMB) we first separate the \ntraining data into the different classes. We then use the EM algorithm (Dempster \n\n\fClassifying with Gaussian Mixtures and Clusters \n\n683 \n\net al 1977, Nowlan 1991) to determine the parameters for the Gaussian mixture \ndensity for each class. \n\n3 Winner-take-all approximations to G MB classifiers \n\nIn this section, we derive winner-take-all (WTA) approximations to GMB classifiers. \nWe also show the relationship of these algorithms to non-Bayesian cluster-based \ntechniques like LVQ and k-Means. \n\n3.1 The WTA model for GMB \n\nThe WTA assumptions (relating hard clustering to Gaussian mixtures; see (Nowlan \n1991)) are: \n\n\u2022 p(x 101) are mixtures of Gaussians as in (2). \n\u2022 The summation in (2) is dominated by the largest term. This is \"equivalent \nto assigning all of the responsibility for an observation to the Gaussian with \nthe highest probability of generating that observation\" (Nowlan 1991). \n\nTo draw the relation between GMB and cluster-based classifiers, we further assume \nthat: \n\n\u2022 The mixing proportions (oj) are equal for a given class. \n\u2022 The number of mixture components QI is proportional to p(O/). \n\nApplying all the above assumptions to (2), taking logs and discarding the terms \nthat are identical for each class, we get the discriminant function \n\n[1 ( I 1 ( \n\nQ.1 \n\n-y (x) = - ~f \"2log IEj!) + \"2 x - ILj) ~j (x - ILj) \nAI \nI \n] \n\nI T I-I \n\n. \n\n(3) \n\nThe discriminant function (3) suggests an algorithm that approximates the Bayes \nclassifier. We segregate the feature vectors by class and then train a separate vector \nquantizer (VQ) for each class. We then compute the means ILl and the covariance \nmatrices E1Jor each Voronoi cell of each quantizer, and use (3) for classifying new \npatterns. We call this algorithm VQ-Covariance. Note that this algorithm does \nnot do a maximum likelihood estimation of its parameters based on the probability \nmodel used to derive (3). The probability model is only used to classify patterns. \n\n3.2 The relation to LVQ and k-Means \n\nFurther assume that for each class, the mixture components are spherically sym(cid:173)\nmetric with covariance matrix El = 0\"2 I, with 0\"2 identical for all classes. We obtain \nthe discriminant function, \n\nA \n\n-y/(x) = - r\u00a2n II x - J.t)~ II . \n\n2 \n\nQI \n\n)=1 \n\n(4) \n\n\f684 \n\nNanda Kamblwtla, Todd K. Leen \n\nThis is exactly the discriminant function used by the learning vector quantizer \n(LVQ; Kohonen 1989) algorithm. Though LVQ employs a discriminatory training \nprocedure (i.e it directly learns the class boundaries and does not explicitly build a \nseparate model for each class), the implicit model of the class conditional densities \nused by LVQ corresponds to a GMB model under all the assumptions listed above. \nThis is also the implicit model underlying any classifier which makes its classification \ndecision based on the Euclidean distance measure between a feature vector and a \nset of prototype vectors (e.g. a k-Means clustering followed by classification based \non (4)) . \n\n4 Constrained rank GMB classifiers \n\nIn the preceding sections, we have presented a GMB classifier and some WTA \napproximations to GMB. Mixture models such as GMB generally have too many \nparameters for small data sets. In this section, we propose a way of regularizing the \nmixture densities and derive a WTA classifier for the regularized model. \n\n4.1 The constrained rank model \n\nIn section 2, we assumed that the class conditional densities of the feature vectors \nx are mixtures of Gaussians \n\n(5) \nwhere p.J and EJ are the means and covariance matrices for the jth component \nGaussian. eJi and AJi are the orthonormal eigenvectors and eigenvalues of I::Ji \n(ordered such that A}l ~ '\" ~ AJn). In (5), we have written the Mahalanobis \ndistance in terms of the eigenvectors. \nFor a particular data point x, the Mahalanobis distance is very sensitive to changes \nin the squared projections onto the trailing eigen-directions, since the variances \nare very small in these directions. This is a potential problem with small data sets. \nWhen there are insufficient data points to estimate all the parameters of the mixture \ndensity accurately, the trailing eigen-directions and their associated eigenvalues are \nlikely to be poorly estimated. Using the Mahalanobis distance in (5) can lead to \nerroneous results in such cases. \nWe propose a method for regularizing Gaussian mixture classifiers based on the \nabove ideas. We assume that the trailing n - m eigen-directions of each Gaussian \ncomponent are inaccurate due to overfitting to the training set. We rewrite the class \nconditional densities (5) retaining only the leading m (0 < m ::; n) eigen-directions \n\n\fClassifying with Gaussian Mixtures and Clusters \n\n685 \n\nin the determinants and the Mahalanobis distances \n\np(x I Ill) = f m 2) m \n\nj=1 (21T) \n\n/ \n\nft=1 \\i \n\nI exp [-~(x - ILJ{ (f e;~fiT) (x - ILJ)] \n\ni=l \n\nJl \n\n(6) \nWe choose the value of m (the reduced rank) by cross-validation over a separate \nvalidation set. Thus, our model can be considered to be regularizing or constraining \nthe class conditional mixture densities. \nIf we apply the above model and derive the Bayes discriminant functions (1), we \nget, \n\n8I (x) = p(nI) ~ m 2 a} m \n\nj=1 (21T) \n\n/ y'ft=l \\i \n\nI exp [-~(X - ILJ)T (f e}~f/) (x - ILJ)]. \n\ni=1 \n\nJl \n\n(7) \nWe can implement a constrained rank Gaussian mixture Bayes (GMB-Reduced) \nclassifier based on (7) using the EM algorithm to determine the parameters of the \nmixture density for each class. We segregate the data into different classes and use \nthe EM algorithm to determine the parameters of the full mixture density (5). We \nthen use (7) to classify patterns. \n\n4.2 A constrained rank WTA algorithm \n\nWe now derive a winner-take-all (WTA) approximation for the constrained rank \nmixture model described above. We assume (similar to section 3.1) that \n\n\u2022 p(x I nI) are constrained mixtures of Gaussians as in (6). \n\u2022 The summation in (6) is dominated by the largest term (the WTA assump-\n\ntion). \n\n. \n\n\u2022 The mixing proportions (a}) are equal for a given class and the number of \n\ncomponents QI is proportional to p(nI). \n\nApplying these assumptions to (7), taking logs and discarding the terms that are \nidentical for each class, we get the discriminant function \n\n(8) \n\nIt is interesting to compare (8) with (3) . Our model postulates that the trailing \nn - m eigen-directions of each Gaussian represent overfitting to noise in the training \nset. The discriminant functions reflect this; (8) retains only those terms of (3) which \nare in the leading m eigen-directions of each Gaussian. \nWe can generate an algorithm based on (8) that approximates the reduced rank \nBayes classifier. We separate the data based on classes and train a separate vector \nquantizer (VQ) for each class. We then compute the means IL}, the covariance \nmatrices ~} for each Voronoi cell of each quantizer and the orthonormal eigenvectors \n\n\f686 \n\nNanda Kambhatla. Todd K. Leen \n\nTable 1: The test set classification accuracies for the TIMIT vowels data for different \nalgorithms. \n\nALGORITHM \nMLP (40 nodes in hidden layer) \nGMB (1 component; full) \nGMB (1 component; diagonal) \nGMB-Reduced (1 component; 13-D) \nVQ-Covariance (1 component) \nVQ-Covariance-Reduced (1 component; 13-D) \nLVQ (48 cells) \n\nACCURACY \n\n46.8% \n41.4% \n46.3% \n51.2% \n41.4% \n51.2% \n41.4% \n\neJi and eigenvalues AJ for each covariance matrix EJ. We use (8) for classifying new \npatterns. Notice that the algorithm described above is a reduced rank version of \nVQ-Covariance (described in section 3.1). We call this algorithm VQ-Covariance(cid:173)\nReduced. \n\n5 Experimental Results \n\nIn this section we compare the different mixture models and a multi layer percep(cid:173)\ntron (MLP) for two speech phoneme classification tasks. The measure used is the \nclassification accuracy. \n\n5.1 TIMIT data \n\nThe first task is the classification of 12 monothongal vowels from the TIMIT \ndatabase (Fisher and Doddington 1986). Each feature vector consists of the lowest \n32 DFT coefficients, time-averaged over the central third of the vowel. We par(cid:173)\ntitioned the data into a training set (1200 vectors), a validation set (408 vectors) \nfor model selection, and a test set (408 vectors). The training set contained 100 \nexamples of each class. The values of the free parameters for the algorithms (the \nnumber of component densities, number of hidden nodes for the MLP etc.) were \nselected by maximizing the performance on the validation set. \nTable 1 shows the results obtained with different algorithms. The constrained rank \nmodels (GMB-Reduced and VQ-Covariance-Reduced1) perform much better than \nall the unconstrained ones and even beat a MLP for this task. This data set consists \nof very few data points per class, and hence is particularly susceptible to over fitting \nby algorithms with a large number of parameters (like GMB). It is not surprising \nthat constraining the number of model parameters is a big win for this task. \n\nINote that since the best validation set performance is obtained with only one compo(cid:173)\nnent for each mixture density, the WTA algorithms are identical to the GMB algorithms \n(for these results). \n\n\fClassifying with Gaussian Mixtures and Clusters \n\n687 \n\nTable 2: The test set classification accuracies for the CENSUS data for different algo(cid:173)\nrithms. \n\nALGORITHM \nMLP (80 nodes in hidden layer) \nGMB (1 component; full) \nGMB (8 components; diagonal) \nGMB-Reduced (2 components; 35-D) \nVQ-Covariance (3 components) \nVQ-Covariance-Reduced (4 components; 38-D) \nLVQ (55 cells) \n\nACCURACY \n\n88.2% \n77.2% \n70.9% \n82.5% \n77.5% \n84.2% \n67.3% \n\n5.2 CENSUS data \n\nThe next task we experimented with was the classification of 9 vowels (found in the \nutterances ofthe days of the week). The data was drawn from the CENSUS speech \ncorpus (Cole et alI994). Each feature vector was 70 dimensional (perceptual linear \nprediction (PLP) coefficients (Hermansky 1990) over the vowel and surrounding \ncontext}. We partitioned the data into a training set (8997 vectors), a validation \nset (1362 vectors) for model selection, and a test set (1638 vectors). The training \nset had close to a 1000 vectors per class. The values of the free parameters for the \ndifferent algorithms were selected by maximizing the validation set performance. \nTable 2 gives a summary of the classification accuracies obtained using the different \nalgorithms. This data set has a lot more data points per class than the TIMIT data \nset. The best accuracy is obtained by a MLP, though the constrained rank mixture \nmodels still greatly outperform the unconstrained ones. \n\n6 Discussion \n\nWe have derived WTA approximations to GMB classifiers and shown their relation \nto LVQ and k-Means algorithms. The main problem with Gaussian mixture models \nis the explosion in the number of model parameters with input dimensionality, re(cid:173)\nsulting in poor generalization performance. We propose constrained rank Gaussian \nmixture models for classification. This approach ignores some directions ( \"noise\") \nlocally in the input space, and thus reduces the effective number of model param(cid:173)\neters. This can be considered as a way of regularizing the mixture models. Our \nresults with speech vowel classification indicate that this approach works better \nthan using full mixture models, especially when the data set size is small. \nThe WTA algorithms proposed in this paper do not perform a maximum likelihood \nestimation of their parameters. The probability model is only used to classify data. \nWe can potentially improve the performance of these algorithms by doing maximum \nlikelihood training with respect to the models presented here. \n\n\f688 \n\nNanda Kambhatla, Todd K. Leen \n\nAcknowledgments \n\nThis work was supported by grants from the Air Force Office of Scientific Research \n(F49620-93-1-0253), Electric Power Research Institute (RP8015-2) and the Office of \nNaval Research (NOOOI4-91-J-1482). We would like to thank Joachim Utans, OGI \nfor several useful discussions and Zoubin Ghahramani, MIT for providing MATLAB \ncode for the EM algorithm. We also thank our colleagues in the Center for Spoken \nLanguage Understanding at OGI for providing speech data. \n\nReferences \n\nR.A. Cole, D.G. Novick, D. Burnett, B. Hansen, S. Sutton, M. Fanty. \n(1994) \nTowards Automatic Collection of the U.S. Census. Proceedings of the International \nConference on Acoustics, Speech and Signal Processing 1994. \nA.P. Dempster, N.M. Laird, and D.B. Rubin. (1977) Maximum Likelihood from \nIncomplete Data via the EM Algorithm. J. Royal Statistical Society Series B, vol. \n39, pp. 1-38. \nR.O. Duda and P.E. Hart. (1973) Pattern Classification and Scene Analysis. John \nWiley and Sons Inc. \nW.M Fisher and G.R Doddington. (1986) The DARPA speech recognition database: \nspecification and status. In Proceedings of the DARPA Speech Recognition Work(cid:173)\nshop, p93-99, Palo Alto CA. \nH. Hermansky. (1990) Perceptual Linear Predictive (PLP) analysis of speech. J. \nAcoust. Soc. Am., 87(4):1738-1752. \nT. Kohonen. \nBerlin: Springer-Verlag. \nS.J. Nowlan. (1991) Soft Competitive Adaptation: Neural Network Learning Algo(cid:173)\nrithms based on Fitting Statistical Mixtures. CMU-CS-91-126 PhD thesis, School \nof Computer Science, Carnegie Mellon University. \n\n(1989) Self-Organization and Associative Memory (3rd edition). \n\n\f", "award": [], "sourceid": 907, "authors": [{"given_name": "Nanda", "family_name": "Kambhatla", "institution": null}, {"given_name": "Todd", "family_name": "Leen", "institution": null}]}