{"title": "Limitations of Self-organizing Maps for Vector Quantization and Multidimensional Scaling", "book": "Advances in Neural Information Processing Systems", "page_first": 445, "page_last": 451, "abstract": null, "full_text": "Limitations of self-organizing maps for \n\nvector quantization and multidimensional \n\nscaling \n\nArthur Flexer \n\nThe Austrian Research Institute for Artificial Intelligence \n\nSchottengasse 3, A-lOlO Vienna, Austria \n\nDepartment of Psychology, University of Vienna \n\nLiebiggasse 5, A-lOlO Vienna, Austria \n\nand \n\narthur~ai.univie.ac.at \n\nAbstract \n\nThe limitations of using self-organizing maps (SaM) for either \nclustering/vector quantization (VQ) or multidimensional scaling \n(MDS) are being discussed by reviewing recent empirical findings \nand the relevant theory. SaM 's remaining ability of doing both VQ \nand MDS at the same time is challenged by a new combined tech(cid:173)\nnique of online K-means clustering plus Sammon mapping of the \ncluster centroids. SaM are shown to perform significantly worse in \nterms of quantization error , in recovering the structure of the clus(cid:173)\nters and in preserving the topology in a comprehensive empirical \nstudy using a series of multivariate normal clustering problems. \n\n1 \n\nIntroduction \n\nSelf-organizing maps (SaM) introduced by [Kohonen 84] are a very popular tool \nused for visualization of high dimensional data spaces. SaM can be said to do \nclustering/vector quantization (VQ) and at the same time to preserve the spatial \nordering of the input data reflected by an ordering of the code book vectors (cluster \ncentroids) in a one or two dimensional output space, where the latter property is \nclosely related to multidimensional scaling (MDS) in statistics. Although the level \nof activity and research around the SaM algorithm is quite large (a recent overview \nby [Kohonen 95] contains more than 1000 citations) , only little comparison among \nthe numerous existing variants of the basic approach and also to more traditional \nstatistical techniques of the larger frameworks of VQ and MDS is available. Ad(cid:173)\nditionally, there is only little advice in the literature about how to properly use \n\n\f446 \n\nA. Flexer \n\nSOM in order to get optimal results in terms of either vector quantization (VQ) or \nmultidimensional scaling or maybe even both of them. To make the notion of SOM \nbeing a tool for \"data visualization\" more precise, the following question has to be \nanswered: Should SOM be used for doing VQ, MDS, both at the same time or none \nof them? \n\nTwo recent comprehensive studies comparing SOM either to traditional VQ or MDS \ntechniques separately seem to indicate that SOM is not competitive when used for \neither VQ or MDS: [Balakrishnan et al. 94J compare SOM to K-means clustering \non 108 multivariate normal clustering problems with known clustering solutions and \nshow that SOM performs significantly worse in terms of data points misclassified 1 , \nespecially with higher numbers of clusters in the data sets. [Bezdek & Nikhil 95J \ncompare SOM to principal component analysis and the MDS-technique Sammon \nmapping on seven artificial data sets with different numbers of points and dimen(cid:173)\nsionality and different shapes of input distributions. The degree of preservation of \nthe spatial ordering of the input data is measured via a Spearman rank correla(cid:173)\ntion between the distances of points in the input space and the distances of their \nprojections in the two dimensional output space. The traditional MDS-techniques \npreserve the distances much more effectively than SOM, the performance of which \ndecreases rapidly with increasing dimensionality of the input data. \n\nDespite these strong empirical findings that speak against the use of SOM for either \nVQ or MDS there remains the appealing ability ofSOM to do both VQ and MDS at \nthe same time. It is the aim of this work to find out, whether a combined technique \nof traditional vector quantization (clustering) plus MDS on the code book vectors \n(cluster centroids) can perform better than Kohonen's SOM on a series of multi(cid:173)\nvariate normal clustering problems in terms of quantization error (mean squared \nerror) , recovering the cluster structure (Rand index) and preserving the topology \n(Pearson correlation). All the experiments were done in a rigoruos statistical design \nusing multiple analysis of variance for evaluation of the results. \n\n2 SOM and vector quantization/clustering \n\nA vector quantizer (VQ) is a mapping, q, that assigns to each input vector x a \nreproduction (code book) vector x = q( x) drawn from a finite reproduction alphabet \nA = {Xi, i = 1, ... , N}. The quantizer q is completely described by the reproduction \nalphabet (or codebook) A together with the partition S = {Si , i = 1, .. . , N}, of the \ninput vector space into the sets Si = {x : q(x) = xd of input vectors mapping into \nthe ith reproduction vector (or code word) [Linde et al. 80J. To be compareable to \nSO M, our VQ assigns to each of the input vectors x = (xO, xl, . .. , x k- l ) a socalled \ncode book vector x = (xO, xl, ... , xk -1) of the same dimensionality k. For reasons of \ndata compression, the number of code book vectors N ~ n, where n is the number \nof input vectors. \n\nDemanded is a VQ that produces a mapping q for which the expected distortion \ncaused by reproducing the input vectors x by code book vectors q( x) is at least \nlocally minimal. The expected distortion is usually esimated by using the aver(cid:173)\nage distortion D, where the most common distortion measure is the squared-error \n\n1 Although SOM is an unsupervised technique not built for classification, the number \nof points missclassified to a wrong cluster center is an appropriate and commonly used \nperformance measure for cluster procedures if the true cluster structure is known. \n\n\fLimitations of Self-organizing Maps \n\n447 \n\ndistortion d: \n\nk-l \n\nd(x, x) = L 1 Xi - Xi 12 \n\n(2) \n\ni=O \n\nThe classical vector quantization technique to achieve such a mapping is the LBG(cid:173)\nalgorithm [Linde et al. 80], where a given quantizer is iteratively improved. Al(cid:173)\nready [Linde et al. 80] noted that their proposed algorithm is almost similar to \nthe k-means approach developed in the cluster analysis literature starting from \n[MacQueen 67]. Closely related to SOM is online K-means clustering (oKMC) con(cid:173)\nsisting of the following steps: \n\n1. Initialization: Given N = number of code book vectors, k = dimensionality \nof the vectors, n = number of input vectors, a training sequence {Xj; j = \n0, ... , n -I}, an initial set Ao of N code book vectors x and a discrete-time \ncoordinate t = 0 ... , n - 1. \n\n2. Given At = {Xi ; i = 1, .. . , N}, find the minimum distortion partition \nIf \n\npeAt) = {Si; i = 1, ... , N}. Compute d(Xt, Xi) for i = 1, .. . , N. \nd(Xt, Xi) ~ (Xt, XI) for alII, then Xt E Si. \n\n3. Update the code book vector with the minimum distortion \n\nX(t)(Si) = x(t-1)(S;) + O'[X(t) - X(t-l)(Si)] \n\n(3) \n\nwhere 0' is a learning parameter to be defined by the user. Define At+1 = \nx(P(A t\u00bb, replace t by t + 1, ift = n -1, halt. Else go to step 2. \n\nThe main difference between the SOM-algorithm and oKMC is the fact that the \ncode book vectors are ordered either on a line or on a planar grid (i.e. in a one or \ntwo dimensional output space). The iterative procedure is the same as with oKMC \nwhere formula (3) is replaced by \n\nX(t)(S;) = X(t-1)(Si) + h[x(t) - X(t-l)(Si)] \n\n(4) \n\nand this update is not only computed for the Xi that gives minimum distortion, but \nalso for all the code book vectors which are in the neighbourhood of this Xi on the \nline or planar grid. The degree of neighbourhood and amount of code book vectors \nwhich are updated together with the Xi that gives minimum distortion is expressed \nby h, a function that decreases both with distance on the line or planar grid and \nwith time and that also includes an additional learning parameter 0' . If the degree \nof neighbourhood is decreased to zero, the SOM-algorithm becomes equal to the \noKMC-algorithm. \n\nWhereas local convergence is guaranteed for oKMC (at least for decreasing 0', \n[Bot.t.ou & Bengio 95]), no general proof for the convergence of SOM with nonzero \nneighbourhood is known. [Kohonen 95, p.128] notes that the last. steps of the SOM \nalgorithm should be computed with zero neighbourhood in order to guarantee \"the \nmost. accurate density approximation of the input samples\" . \n\n3 SOM and multidimensional scaling \n\nFormally, a topology preserving algorithm is a t.ransformation <1l : Rk ....... RP, that \neither preserves similarities or just. similarity orderings of the points in the input \nspace Rk when they are mapped into the outputspace R? For most algorithms it is \nthe case t.hat both the number of input vectors 1 x E Rk 1 and the number of output \n\n\fA. Flexer \n\n448 \nvectors I x E RP I are equal to n. A transformation !l> : x = !l>( x), that preserves \nsimilarities poses the strongest possible constraint since d( Xi, Xj) = cf( Xi, X j) for all \nXi, X JERk, all Xi, X j E RP, i, j = 1, .. . , n - 1 and d (cf) being a measure of distance \nin Rk (RP). Such a transformation is said to produce an isometric image. \nTechniques for finding such transformations !l> are, among others, various forms of \nmultidimensional scalinl (MDS) like metric MDS [Torgerson 52], nonmetric MDS \n[Shepard 62] or Sammon mapping [Sammon 69], but also principal component anal(cid:173)\nysis (PCA) (see e.g. \nminimizing the following via steepest descent: \n\n[Jolliffe 86]) or SOM. Sammon mapping is doing MDS by \n\nSince the SOM has been designed heuristically and not to find an extremum for a \ncertain cost or energy function 3 and the theoretical connection to the other MDS \nalgorithms remains unclear. It should be noted that for SOM the number of output \nvectors I x E RP I is limited to N, the number of cluster centroids x and that the x \nare further restricted to lie on a planar grid. This restriction entails a discretization \nof the outputspace RP . \n\n4 Online [(-means clustering plus Sammon mapping of the \n\ncl uster centroids \n\nOur new combined approach consists of simply finding the set of A = {Xi, i = \n1, ... , N} code book vectors that give the minimum distortion partition P(A) = \n{8i ; i = 1, . .. , N} via the oKMC algorithm and then using the Xi as input vectors \nto Sammon mapping and thereby obtaining a two dimensional representation of the \nXi via minimizing formula (5). Contrary to SOM, this two dimensional representa(cid:173)\ntion is not restricted to any fixed form and the distances between the N mapped \nXi directly correspond to those in the original higher dimension. This combined \nalgorithm is abbreviated oKMC+ . \n\n5 Empirical comparison \n\nThe empirical comparison was done using a 3 factorial experimental design with \n3 dependent variables. The multivariate normal distributions were generated us(cid:173)\ning the procedure by [Milligan & Cooper 85], which since has been used for several \ncomparisons of cluster algorithms (see e.g. [Balakrishnan et al. 94]). The marginal \nnormal distributions gave internal cohesion of the clusters by warranting that more \nthan 99% of the data lie within 3 standard deviations (IT). External isolation was \ndefined as having the first dimension nonoverlapping by truncating the normal dis(cid:173)\ntributions in the first dimension to \u00b12IT and defining the cluster centroids to be \n4.5IT apart. In all other dimensions the clusters were allowed to overlap by setting \nthe distance per dimension between two centroids randomly to lie between \u00b16IT. \nThe data was normalized to zero mean and unit variance in all dimensions. \n\n2Note that for MDS not the actual coordinates of the points in the input space but \n\nonly their distances or the ordering of the latter are needed. \n\n3[Erwin et al. 92] even showed that such an objective function cannot exist for SOM. \n\n\fLimitations of Self-organizing Maps \n\n449 \n\nalgorithm \nSOM \n\noKMC+ \n\nno. clusters \n\n4 \n\n9 \n\nmean SOM \n\n4 \n\n9 \n\nmean oKMC+ \n\ndimension msqe Rand \n1.00 \n0.91 \nO.YY \n0.97 \n0.97 \n0.96 \n0.97 \n0.99 \n0.99 \n1.00 \n0.98 \n0.99 \n0.98 \n0.99 \n\n0.53 \n1.53 \n1.15 \n0.33 \n0.54 \n0.81 \n0 .81 \n0.53 \n1.06 \n1.17 \n0.29 \n0.47 \n0.56 \n0 .68 \n\n4 \n6 \n8 \n4 \n6 \n8 \n\n4 \n6 \n8 \n4 \n6 \n8 \n\ncorr. \n0.64 \n0.72 \n0.74 \n0.48 \n0.66 \n0.74 \n0.67 \n0.87 \n0.87 \nO.Yl \n0.89 \n0.87 \n0.86 \n0.88 \n\nFactor 1, Type of algorithm: The number of code book vectors of both the SOM \nand the oKMC+ were set equal to the number of clusters known to be in the data. \nThe SOMs were planar grids consisting of 2 x 2 (3 x 3) code book vectors. During \nthe first phase (1000 code book updates) a was set to 0.05 and the radius of the \nneighbourhood to 2 (5). During the second phase (10000 code book updates) a was \nset to 0.02 and the radius ofthe neighbourhood to 0 to guarantee the most accurate \nvector quantization [Kohonen 95, p.128]. The oKMC+ algorithm had the parameter \na fixed to 0.02 and was trained using each data set 20 times, the minimization of \nformula (5) was stopped after 100 iterations. Both SOM and oKMC+ were run 10 \ntimes on each data set and only the best solutions, in terms of mean squared error, \nwere used for further analysis. \n\nFactor 2, Number of clusters was set to 4 and 9. \n\nFactor 3, Number of dimensions was set to 4,6, or8. \n\nDependent variable 1: mean squared error was computed using formula (1). \nD ependent variable 2, Rand index (see [Hubert & Arabie 85]) is a measure of agree(cid:173)\nment between the true, known partition structure and the obtained clusters. Both \nthe numerator and the denominator of the index reflect frequency counts. The \nnumerator is the number of times a pair of data is either in the same or in differ(cid:173)\nent clusters in both known and obtained clusterings for all possible comparisons \nof data points. Since the denominator is the total number of all possible pairwise \ncomparisons, an index value of 1.0 indicates an exact match of the clusterings. \n\nDependent variable 3, correlation is a measure of the topology preserving abilities of \nthe algorithms. The Pearson correlation of the distances d( Xl, X2) in the input space \nand the distances d( Xi, X j) in the output space for all possible pairwise comparisons \nof data points is computed. Note that for SOM the coordinates of the code book \nvectors on the planar grid were used to compute the d. An algorithm that preserves \nall dist.ances in every neighbourhood would produce an isometric image and yield \na value of 1.0 (see [Bezdek & Nikhil 95] for a discussion of measures of topolgy \npreservation) . \n\nFor each cell in the full-factorial 2 x 2 x 3 design 3 data sets with 25 points for each \ncluster were generated resulting in a total of 36 data sets. A multiple analysis of \nvariance (MANOVA) yielded the following significant effects at the .05 error level: \n\nThe mean squared error is lower for oKMC+ than for SOM, it is lower for the 9-\ncluster problem than for the 4-cluster problem and is higher for higher dimensional \n\n\f450 \n\nA. Flexer \n\ndata. There is also a combined effect of the number of clusters and dimensions on \nthe mean squared error. The Rand index is higher for oKMC+ than for SOM, there \nis also a combined effect of the number of clusters and dimensions. The correlation \nindex is higher for oKMC+ than for SOM. Since the main interest of this study is the \neffect of the type of algorithm on the dependent variables, the mean performances \nfor SOM and oKMC+ are printed in bold letters in the table. Note that the overall \ndifferences in the performances of the two algorithms are blurred by the significant \neffects of the other factors and that therefore the differences of the grand means \nacross the type of algorithms appear rather small. Only by applying a MANOVA, \neffects of the factor 'type of algorithms' that are masked by additional effects of \nthe other two factors 'number of clusters' and 'number of dimensions' could still be \ndetected. \n\n6 Discussion and Conclusion \n\nFrom the theoretical comparison of SOM to oKMC it should be clear that in terms \nof quantization error, SOM should only be possible to perform as good as oKMC \nif SOM's neighbourhood is set to zero. Additional experiments, not reported here \nin detail for brevity, with nonzero neighbourhood till the end of SOM training \ngave even worse results since the neighbourhood tends to pull the obtained clus(cid:173)\nter centroids away from the true ones. The Rand index is only slightly better for \noKMC+. The high values indicate that both algorithms were able to recover the \nknown cluster structure. Topology preserving is where SOM performs worst com(cid:173)\npared to oKMC+. This is a direct implication of the restriction to planar grids \nwhich allows only 2::=2 i,(&~2) different distances in an s x s planar grid instead \nof N(~ -1) different distances for N = s x s cluster centroids mapped via Sammon \nmapping in the case of oKMC+. Using a nonzero neighbourhood at the end of SOM \ntraining did not warrant any significant improvements. \n\nAn argument that could be brought forward against our approach towards compar(cid:173)\ning SOM and oKMC+ is that it would be unfair or not correct to set the number \nof SOM's code book vectors equal to the number of clusters known to be in the \ndata. In fact it seems to be common practice to apply SOM with numbers of code \nbook vectors that are a multiple of the input vectors available for training (see e.g. \n[Kohonen 95, pp.113]). Two things have to be said against such an argumentation: \nFirst if one uses more or even only the same amount of code book vectors than input \nvectors during vector quantization, each code book vector will become identical to \none of the input vectors in the limit of learning. So every Xi is replaced with an \nidentical Xi, which does not make any sense and runs counter to every notion of vec(cid:173)\ntor quantization. This means that SOMs employing numbers of code book vectors \nt.hat are a multiple of the input vectors available can be used for MDS only. But \neven such big SOMs do MDS in a very crude way: We computed SOMs consisting \nof either 20 x 20 (for data sets consisting of 4 clusters and 100 points) or 30 x 30 \n(for data sets consisting of 9 clusters and 225 points) code book vectors for all 36 \ndata sets which gave an average correlation of 0.77 between the distances di and di . \nThis is significantly worse at the .05 error level compared to the average correlation \nof 0.95 achieved by Sammon mapping applied to the input data directly. \n\nOur data sets consisted of iid multivariate normal distributions which therefore have \nspherical shape. All VQ algorithms using squared distances as a distortion measure, \nincluding our versions of oKMC as well as SOM, are inherently designed for such \ndistributions. Therefore, the clustering problems in this study, being also perfectly \nseperable in one dimension, were very simple and should be solveable with little or \nno error by any clustering or MDS algorithm. \n\n\fLimitations of Self-organizing Maps \n\n451 \n\nIn this work we examined the vague concept of using SOM as a \"data visualization \ntool\" both from a theoretical and empirical point of view. SOM cannot outperform \ntraditional VQ techniques in terms of quantization error and should therefore not \nbe used for doing VQ. From [Bezdek & Nikhil 95] as well as from our discussion of \nSOM's restriction to planar grids in the output space which allows only a restricted \nnumber of different distances to be represented, it should be evident that SOM \nis also a rather crude way of doing MDS. Our own empirical results show that if \none wants to have an algorithm that does both VQ and MDS at the same time, \nthere exists a very simple combination oftraditional techniques (our oKMC+) with \nwellknown and established properties that clearly outperforms SOM. \n\nWhether it is a good idea to combine clustering or vector quantization and mul(cid:173)\ntidimensional scaling at all and whether more principled approaches (see e.g. \n[Bishop et al. this volume], also for pointers to further related work) can yield even \nbetter results than our oKMC+ and last but not least what self-organizing maps \nshmtld be used for under this new light remain questions to be answered by future \ninvestigations. \n\nAcknowledgements: Thanks are due to James Pardey, University of Oxford, for the \nSammon code. The SOM_PAK, Helsinki University of Technology, was used for all com(cid:173)\nputations of self-organizing maps. This work has been started within the framework of the \nBIOMED-1 concerted action ANNDEE, sponsored by the European Commission, DG XII, \nand the Austrian Federal Ministry of Science, Transport, and the Arts, which is also sup(cid:173)\nporting the Austrian Research Institute for Artificial Intelligence. The author is supported \nby a doctoral grant of the Austrian Academy of Sciences. \n\nReferences \n[Balakrishnan et al. 94] Balakrishnan P.V., Cooper M.C., Jacob V.S., Lewis P.A. : A study \nof the classification capabilities of neural networks using unsupervised learning: a \ncomparison with k-means clustering, Psychometrika, Vol. 59, No.4, 509-525, 1994. \n\n[Bezdek & Nikhil 95] Bezdek J.C. , Nikhil R.P.: An index of topological preservation for \n\nfeature extraction, Pattern Recognition, Vol. 28, No.3, pp.381-391, 1995. \n\n[Bishop et al. this volume] Bishop C.M., Svensen M., Williams C.K.I.: GTM: A Princi(cid:173)\n\npled Alternative to the Self-Organizing Map, this volume. \n\n[Bottou & Bengio 95] Bottou 1., Bengio Y.: Convergence Properties of the K-Means Al(cid:173)\n\ngorithms, in Tesauro G., et al.(eds.), Advances in Neural Information Processing \nSystem 7, MIT Press, Cambridge, MA, pp.585-592, 1995. \n\n[Erwin et al. 92] Erwin E., Obermayer K., Schulten K.: Self-organizing maps: ordering, \nconvergence properties and energy functions, Biological Cybernetics, 67,47- 55, 1992. \n[Hubert & Arabie 85] Hubert L.J., Arabie P.: Comparing partitions, J . of Classification, \n\n2, 63-76, 1985. \n\n[Jolliffe 86] Jolliffe I.T.: Principal Component Analysis, Springer, 1986. \n[Kohonen 84] Kohonen T.: Self-Organization and Associative Memory, Springer, 1984. \n[Kohonen 95] Kohonen T.: Self-organizing maps, Springer, Berlin, 1995. \n[Linde et al. 80] Linde Y. , Buzo A., Gray R.M.: An Algorithm for Vector Quantizer De(cid:173)\nsign, IEEE Transactions on Communications, Vol. COM-28, No.1, January, 1980. \n[MacQueen 67] MacQueen J.: Some Methods for Classification and Analysis of Multivari(cid:173)\n\nate Observations, Proc. of the Fifth Berkeley Symposium on Math., Stat. and Prob., \nVol. 1, pp. 281-296, 1967. \n\n[Milligan & Cooper 85] Milligan G.W., Cooper M.C.: An examination of procedures for \ndetermining the number of clusters in a data set, Psychometrika 50(2), 159-179, 1985. \n[Sammon 69] Sammon J .W.: A Nonlinear Mapping for Data Structure Analysis, IEEE \n\nTransactions on Comp., Vol. C-18, No.5, p.401-409, 1969. \n\n[Shepard 62] Shepard R.N.: The analysis of proximities: multidimensional scaling with \n\nan unknown distance function . I., Psychometrika, Vol. 27, No. 2, p.125-140, 1962. \n\n[Torgerson 52] Torgerson W .S.: Multidimensional Scaling, I: theory and method, Psy(cid:173)\n\nchometrika, 17, 401-419, 1952. \n\n\f", "award": [], "sourceid": 1295, "authors": [{"given_name": "Arthur", "family_name": "Flexer", "institution": null}]}