{"title": "A Note on Learning Vector Quantization", "book": "Advances in Neural Information Processing Systems", "page_first": 220, "page_last": 227, "abstract": null, "full_text": "A Note on Learning Vector Quantization \n\nVirginia R. de Sa \n\nDepartment of Computer Science \n\nUniversity of Rochester \nRochester, NY 14627 \n\nDana H. Ballard \n\nDepartment of Computer Science \n\nUniversity of Rochester \nRochester, NY 14627 \n\nAbstract \n\nVector Quantization is useful for data compression.  Competitive Learn(cid:173)\ning which minimizes reconstruction error is an appropriate algorithm for \nvector quantization of unlabelled data.  Vector quantization of labelled \ndata for classification has a different objective, to minimize the number \nof misclassifications, and a different algorithm is appropriate.  We show \nthat a  variant of Kohonen's LVQ2.1  algorithm can  be  seen  as  a  multi(cid:173)\nclass  extension of an  algorithm which in a  restricted  2  class  case  can \nbe proven to converge to the Bayes optimal classification boundary.  We \ncompare the performance of the LVQ2.1 algorithm to that of a modified \nversion having a decreasing window and normalized step size, on a ten \nclass vowel classification problem. \n\n1 \n\nIntroduction \n\nVector quantization is a form of data compression that represents data vectors by a smaller \nset  of codebook vectors.  Each  data vector  is  then  represented  by  its  nearest  codebook \nvector.  The goal of vector quantization is to represent the data with the fewest code book \nvectors while losing as  little information as possible. \n\nVector quantization of unlabelled data seeks to minimize the reconstruction error.  This can \nbe accomplished with Competitive learning[Grossberg, 1976; Kohonen, 1982], an iterative \nlearning algorithm for vector quantization that has been shown to perform gradient descent \non the following energy function [Kohonen,  1991] \n\nJ /Ix - ws\u00b7(x) /l2p(x)dx. \n\n220 \n\n\fA Note on  Learning Vector  Quantization \n\n221 \n\nwhere p(x) is  the probability distribution of the input patterns and Ws  are  the reference or \ncodebook vectors  and  s*(x)  is  defined by IIx - WSO(x) I I ~ /Ix - will  (for alIt).  This  mini(cid:173)\nmizes the square reconstruction error of unlabelled data and may work reasonably well for \nclassification tasks if the patterns in the different classes are segregated. \n\nIn many classification tasks, however, the different member patterns may not be segregated \ninto separate clusters for each class.  In these cases it is more important that members ofthe \nsame class be represented by the same codebook vector than that the reconstruction error \nis  minimized.  To  do this,  the quantizer can  m&ke  use  of the labelled data to encourage \nappropriate quantization. \n\n2  Previous approaches to Supervised Vector Quantization \n\nThe first use of labelled data (or a teaching signal) with Competitive Learning by Rumelhart \nand Zipser [Rumelhart and Zipser,  1986]  can be thought of as  assigning a  class  to each \ncodebook vector and only allowing patterns  from  the appropriate class  to influence each \nreference vector. \n\nThis simple approach is  far from optimal though as it fails  to take into account interactions \nbetween the classes.  Kohonen addressed this in his LVQ( 1) algorithm[Kohonen, 1986]. He \nargues that the reference vectors resulting from LVQ( 1) tend to approximate for a particular \nclass r, \n\nP(xICr)P(Cr) - ~#rP(xICs)P(Cs). \n\nwhere P( Cj)  is  the a priori probability of Class i and P(xICj)  is  the conditional density of \nClass i. \n\nThis approach is  also not optimal for classification, as  it addresses  optimal places  to put \nthe codebook vectors instead of optimal placement of the borders of the vector quantizer \nwhich arise from the Voronoi tessellation induced by the codebook vectors.  1 \n\n3  Minimizing Misclassifications \n\nIn  classification  tasks  the  goal  is  to  minimize  the  numbers  of misclassifications  of the \nresultant quantizer.  That is we want to minimize: \n\n(1) \n\nwhere, P(Classj)  is  the  a  priori probability of Classj  and P(xIClassj)  is  the conditional \ndensity of Classi and D.Rj is the decision region for class j (which in this case is all x such \nthat I~ - wkll < I~ - wjll (for all i) and Wk  is  a codebook vector for class j). \nConsider a One-Dimensional problem of two classes and two codebook vectors wI and w2 \ndefining a class boundary b = (wI + w2)/2 as  shown in Figure 1.  In this case Equation 1 \nreduces to: \n\n1 Kohonen [1986] showed this by showing that the use of a \"weighted\" Voronoi tessellation (where \nthe relative distances of the borders from the reference vectors was changed) worked better.  However \nno principled way to  calculate the relative weights  was  given and the application to  real data used \nthe unweighted tessellation. \n\n\f222 \n\nde  Sa  and Ballard \n\nP(CIass i)P(xlClass i) \n\nw2 \n\nb*  b \n\nwI \n\n% \n\nFigure 1:  Codebook vectors Wl  and'W2 define a border b.  The optimal place for the border \nis at b* where P(Cl)P(xICt} = P(C2)P(xIC2).  The extra misclassification errors incurred by \nplacing the border at b is shown by the shaded region. \n\nThe derivative of Equation 2 with respect to b is \n\n(2) \n\nThat is, the minimum number of misclassifications occurs at b* where \nP(ClaSS1)P(b*IClasSl) = P(Class2)P(b*IClass2). \n\nIf f(x)  = (Classl)P(xIClassl) - P(Class2)P(xIClass2)  was  a  regression  function  then  we \ncould use stochastic approximation [Robbins and Monro,  1951] to estimate b* iteratively \nas \n\nben + 1) = ben) + a(n)Z\" \n\nwhere  Z\" \nP(Classl)P(b(n)IClasst) - P(Class2)P(b(n)IClass2\u00bb and \n\nis  a  sample  of  the  random  variable  Z  whose  expected  value \n\nis \n\nlim a(n) = 0 \n,,-+co \n\nl:ia(n) = 00 \n\nl:ia2(n) < 00 \n\nHowever, we do not have immediate access  to an appropriate random variable Z but  can \nexpress P( C lassl )P(xIClassl)-P( Class2)P(xIClass2) as the limit of a sequence of regression \nfunctions using the Parzen Window technique.  In the Parzen window technique, probability \ndensity functions are estimated as  the sum of appropriately normalized pulses centered at \n\n\fthe observed values.  More formally, we can estimate P(xIClassi) as  [Sklansky and Wassel, \n1981] \n\nA Note on  Learning Vector  Quantization \n\n223 \n\nAll \n\nPi (x)  =  - L...J'\u00a5II(x-Xj,cll ) \n\nIl \n\n1~ \nn  . \n)=1 \n\nwhere Xj is the sample data point at time j, and 'II II(X- z, c(n)) is a Parzen window function \ncentred at Z  with width parameter c(n) that satisfies the following conditions \n\n'\u00a5II(X - z, c(n\u00bb  ~ 0, Vx, Z \nJ~ '\u00a5II(X- Z, c(n\u00bbdx = 1 \n11-\n\n'\u00a5;(x- z, c(n))dx = 0 \n\nlim  -\n11-+- n  __ \n\nWe can estimate f(x) = P(Class1)P(xIClasst) - P(Class2)P(xIClass2) as \n\nlim '\u00a51I(x-z,c(n\u00bb  = c5(x-z) \nII-+-\n\n1 \n\nIl \n\nA \n\nrex) = - LS(Xj)'\u00a5II(x-Xj,c(n\u00bb \n\nn  . 1 \nJ= \n\nwhere S(Xj) is + 1 if Xj is from Class1  and -1 if Xj is from Class2. \n\nThen \n\nand \n\nlim j\"(X) = P(Class1)P(xIClass1) - P(Class2)P(xIClass2) \nII-+-\n\nlim E[S(X)'\u00a5ix - X, c(n)] = P(Class1)P(xIClassd - P(Class2)P(xIClass2) \n\nII-+-\n\nWassel and Sklansky [1972]  have extended the stochastic approximation method of Rob(cid:173)\nbins  and  Monro  [1951]  to find  the zero  of a  function  that is  the limit of a  sequence  of \nregression functions and show rigourously that for the above case (where the distribution \nof Class1  is to the left of that of Class2 and there is only one crossing point) the stochastic \napproximation procedure \n\nben + 1) = ben) + a(n)ZII(xlI , Class(n), ben), c(n\u00bb \n\n(3) \n\nusing \n\nZ  _  {  2c(n)'\u00a5(XII - ben), c(n\u00bb \n\nII  -\n\n-2c(n)'\u00a5(XII - ben), c(n\u00bb \n\nfor XII  E  Classl \nfor XII  E  Class2 \n\nconverges to the Bayes optimal border with probability one where '\u00a5(x - b, c)  is a Parzen \nwindow function.  The following standard conditions for stochastic approximation conver(cid:173)\ngence are needed in their proof \n\na(n), c(n) > 0, \n\nlim c(n) = 0 \n\nII-+-\n\nlim a(n) = 0, \nII-+-\n\n1:ia(n)c(n) = 00, \n\n\f224 \n\nde  Sa  and Ballard \n\nas  well as  a condition that for rectangular Parzen functions reduces to a requirement that \nP( Classl )P(xIClassl) - P( C lass2)P(xlClass2) be strictly positive to the left of b* and strictly \nnegative to  the right of b*  (for full  details  of the proof and  conditions see  [Wassel  and \nSklansky,  1972]). \n\nThe  above  argument  has  only addressed  the  motion  of the border.  But b  is  defined  as \nb = (wI + w2)/2, thus we can move the codebook vectors according to \n\ndE/dwl = dEldw2 = .5dEldb. \n\nWe could now write Equation 3 as \n\n(X\" - wj(n - 1\u00bb \nwj(n + 1) = wj(n) + a2(n) IX\" _ wj(n _ 1)1 \n\nif X\"  lies in window of width 2c(n) centred at ben), otherwise \n\nWi(n + 1) = wi(n). \n\nwhere  we have  used  rectangular Parzen  window functions  and X\"  is  from  Classj.  This \nholds if Classl  is to the right or left of Class2 as  long as  Wl  and W2  are relatively ordered \nappropriatel y. \n\nExpanding the problem to more dimensions, and more classes  with more codebook vec(cid:173)\ntors per class, complicates the analysis as a change in two codebook vectors to better adjust \ntheir border affects  more than just the border between  the two codebook vectors.  How(cid:173)\never ignoring these effects  for a first  order approximation suggests  the following update \nprocedure: \n\n(X\" - wren - 1\u00bb \n* \nWi (n) = Wi (n - 1) + a(n) IIX\" _ wren _  1)11 \n\n* \n\n(X\" - w;(n - 1\u00bb \n* .  \nWj (n) = Wj (n - 1) - a(n) IIX\" _ wj(n _ 1)11 \n\nwhere a(n) obeys the constraints above, X\"  is from Classj, and w;, wj  are the two nearest \ncodebook vectors, one each from  class i and j U * i) and x\"  lies within c(n) of the border \nbetween them.  (No changes are made if all the above conditions are not true).  As  above \nthis algorithm assumes  that the initial positions of the codebook vectors are such that they \nwill not have to cross during the algorithm. \n\nThe above algorithm is similar to Kohonen's LVQ2.1  algorithm (which is performed after \nappropriate initialization of the codebook vectors) except for the normalization of the step \nsize, the decreasing size of the window width c(n) and constraints on the learning rate a. \n\n\fA Note on Learning Vector  Quantization \n\n225 \n\n4  Simulations \n\nMotivated by the  theory above,  we decided to  modify Kohonen's LVQ2.1  algorithm  to \nadd  normalization  of the  step  size  and  a  decreasing  window.  In  order  to  allow  closer \ncomparison  with  LVQ2.1,  all  other parts  of the algorithm were kept the  same.  Thus  a \ndecreased  linearly.  We  used  a  linear decrease  on  the window  size  and  defined  it as  in \nLVQ2.1 for easier parameter matching. For a window size of w all input vectors satisfying \nd;/dj>  g:~ where di  is the distance to the closest codebook vector and dj  is  the distance \nto the next closest codebook vector, fall into the window between those two vectors (Note \nhowever,  that updates only occur if the two closest codebook vectors belong to different \nclasses). \n\nThe data used is a version of the Peterson and Barney vowel formant data 2.  The dataset \nconsists of the first and second formants for ten vowels in a/hVdj context from 75 speakers \n(32  males,  28  females,  15  children) who repeated each  vowel twice 3.  As  we were  not \ntesting generalization , the training set was  used as  the test set. \n\n75.------.------.------.-----..-----~ \n\n~. \n\n~A ... -:~~.::ra.\u00b7-..::: \n\n;,oz:~,,; \n\nalpha-0.002  -+(cid:173)\nalpha-0.030  -t--. \nalpha-0.080  'B'\" \nalpha-O .150  -\nalpha-0.500  ... -\n\n-fA \n,. \n!\\ \n;  ! \n. . \ni  \\ \n' \n.... \n\\ \n\\ \n\\. \n\\ \n\\ \n\n..... \nu \n\nQ) ... ... o \n\nU \n..... \nr:: \nQ) u ... \n'\" \n\nQ) \n\n70 \n\n65 \n\n\u2022 \u2022  \n\n\" \n\\ \n\\ \n\\ \n0.2 \n\n\\'\" \n\n.. \n\n\\ \n\\ \n\n.. ~ ... \n~.~--;:::.-..--....... \n. \n\\. \n\". \n\\ \nI\"\" \n\\ .   .. \n\\. \n\\ \n.~ \n\\,: \n, . ,  \n~ \n\\ \n\\ \n\\ \n~ \nit: \nt \n\\ \n\n\\ \n\\, \n\\ \n\\ \n\\ \n\n\\\\ \n\nI \n\n\\ \n\n60~-----L~----~--~-L--~~~----~ \n\no \n\n0.4 \n\n0.6 \n\nwindow  size \n\n0.8 \n\nFigure 2:  The effect of different window sizes on the accuracy for different values of initial \na. \n\nWe ran three sets of experiments varying the number of codebook vectors and the number \nof pattern presentations.  For the first  set of experiments there were 20 codebook vectors \nand the algorithms ran  for 40000 steps.  Figure 2 shows the effect of varying the window \nsize for different initial learning rates a( 1) in the LVQ2.1 algorithm. The values plotted are \naveraged over three runs (The order of presentation of patterns is different for the different \nruns). The sensitivity of the algorithm to the window size as mentioned in [Kohonen, 1990] \nis evident.  In general we found that as  the learning rate is  increased the peak accuracy is \nimproved at the expense of the accuracy for other window widths.  After a certain value \n\n20 btained from  Steven Nowlan \n33 speakers were missing one vowel and the raw data was linearly transfonned to have zero mean \n\nand fall within the range [-3,3] in both components \n\n\f226 \n\nde  Sa  and  Ballard \n\n._~~ .. - . \n\n85~----~----~------r------r----~ \norig/20/40000  ~ \nmod/20/40000  -+(cid:173)\norig/20/4000 \n\n\u00b7B\u00b7\u00b7\u00b7 \n... _._.-lI'-\u00b7-\u00b7-\u00b7\u00b7\u00b7-\u00b7-\u00b7lIi~T2!t)lrotT(J\u00b7\"''''':': \u00b7 \n\n._.;:::~:::::: ... ---\u00b7--\u00b7~~~O..\u00a3.4.jlD.oJI._.~=-. \n\n.. ; ., \n-,. \nII'\" \n!  ~=~.~'Il ........ \" .. ~--- ~ \n\n----~----+---------------------~ \n\nmod/100/40000 \n\n\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022 \n\n~-El'.'-\n\n-\u2022.\u2022 \n\n~;.' \n\n..... u \n\nQj \nI-< \nI-< \n0 \nU \n..... \nr:: \nQj u \nI-< \nQj \n'\" \n\n80 \n\n75 \n\n70 \n\n65~----~----~------~----~~--~ \n0.5 \n\n0.2 \n\n0.1 \n\n0.3 \n\n0.4 \n\no \n\nwindow  size \n\nFigure 3:  The performance of LVQ2.1 with and without the modifications (normalized step \nsize and decreasing window) for 3 different conditions. The legend gives in order [the alg \ntype/ the number of codebook vectors/ the number of pattern presentations] \n\nthe accuracy declines for further increases in learning rate. \n\nFigure 3  shows  the improvement achieved  with normalization  and  a  linearly decreasing \nwindow  size  for  three  sets  of experiments  :  (20  code  book vectors/40000  pattern pre(cid:173)\nsentations), (20 code book vectors/4000 pattern presentations) and (100 code book vec(cid:173)\ntors/40000 pattern presentations).  For the decreasing window algorithm, the x-axis repre(cid:173)\nsents the window size in the middle of the run.  As above, the values plotted were averaged \nover three runs.  The values  of a(l) were the same within each  algorithm over all  three \nconditions.  A graph using the best a found for each condition separately is almost identi(cid:173)\ncal.  The graph shows that the modifications provide a modest but consistent improvement \nin accuracy across the conditions. \n\nIn summary the preliminary experiments indicate that a decreasing window and normalized \nstep size can be worthwhile additions to the LVQ2.1  algorithm and further experiments on \nthe generalization properties of the algorithm and with other data sets  may be warranted. \nFor these tests we used a linear decrease of the window size and learning rate to allow for \neasier comparison with the LVQ2.1 algorithm. Further modifications on the algorithm that \nexperiment with different functions (that obey the theoretical constraints) for the learning \nrate and window size decrease may result in even better performance. \n\n5  Summary \n\nWe  have  shown  that Kohonen's LVQ2.1  algorithm can  be considered  as  a  variant on a \ngeneralization of an  algorithm  which  is  optimal for a  IDimensional/2 codebook  vector \nproblem.  We  added a decreasing  window and normalized step  size,  suggested from  the \none dimensional  algorithm.  to  the LVQ2.1  algorithm  and  found  a  small  but consistent \nimprovement in accuracy. \n\n\fA Note on  Learning Vector  Quantization \n\n227 \n\nAcknowledgements \n\nWe would like to thank Steven Nowlan for his many helpful suggestions on an earlier draft \nand for making the vowel formant data available to us.  We are also grateful to Leonidas \nKontothanassis for his help in coding and discussion. This work was supported by a grant \nfrom  the  Human  Frontier Science  Program  and  a  Canadian  NSERC  1967  Science  and \nEngineering Scholarship to the first author who also received A NIPS travel grant to attend \nthe conference. \n\nReferences \n\n[Grossberg, 1976]  Stephen Grossberg, \"Adaptive Pattern Classification and Universal Re(cid:173)\n\ncoding:  I. Parallel Development and Coding of Neural Feature Detectors,\"  Biological \nCybernetics, 23:121-134,1976. \n\n[Kohonen,1982]  Teuvo  Kohonen,  \"Self-Organized Formation of Topologically Correct \n\nFeature Maps,\"  Biological Cybernetics, 43:59--69, 1982. \n\n[Kohonen,1986]  Teuvo  Kohonen,  \"Learning  Vector  Quantization for Pattern Recogni(cid:173)\n\ntion,\"  Technical Report TKK-F-A601, Helsinki University of Technology, Department \nof Technical  Physics,  Laboratory  of Computer  and  Information  Science,  November \n1986. \n\n[Kohonen, 1990]  Teuvo Kohonen, \"Statistical Pattern Recognition Revisited,\" In R. Eck(cid:173)\nmiller, editor, Advanced Neural Computers, pages  137-144. Elsevier Science Publish(cid:173)\ners,  1990. \n\n[Kohonen, 1991]  Teuvo Kohonen, \"Self-Organizing Maps:  Optimization Approaches,\" In \nT. Kohonen, K. Makisara, O. Simula, and J. Kangas, editors,Artijicial Neural Networks, \npages 981-990. Elsevier Science Publishers, 1991. \n\n[Robbins and Monro, 1951J  Herbert Robbins and Sutton Monro, \"A Stochastic Approxi(cid:173)\n\nmation Method,\" Annals of Math. Stat., 22:400-407,1951. \n\n[Rumelhart and Zipser, 1986]  D.  E.  Rumelhart  and  D.  Zipser,  \"Feature  Discovery  by \n\nCompetitive Learning,\" In David E. Rumelhart, James L. McClelland, and the PDP Re(cid:173)\nsearch Group, editors, Parallel Distributed Processing:  Explorations in the Microstruc(cid:173)\nture of Cognition, volume 2, pages  151-193. MIT Press,  1986. \n\n[Sklansky and Wassel,  1981]  Jack  Sklansky  and Gustav  N.  Wassel,  Pattern  Classijiers \n\nand Trainable Machines,  Springer-Verlag, 1981. \n\n[Wassel and Sklansky, 1972]  Gustav  N.  Wassel  and  Jack  Sklansky,  \"Training  a  One(cid:173)\n\nDimensional  Classifier to  Minimize the Probability of Error,\"  IEEE Transactions on \nSystems, Man, and Cybernetics, SMC-2(4):533-541, 1972. \n\n\f", "award": [], "sourceid": 663, "authors": [{"given_name": "Virginia", "family_name": "de", "institution": null}, {"given_name": "Dana", "family_name": "Ballard", "institution": null}]}