{"title": "Generalized Learning Vector Quantization", "book": "Advances in Neural Information Processing Systems", "page_first": 423, "page_last": 429, "abstract": null, "full_text": "Generalized  Learning  Vector \n\nQuantization \n\nAtsushi Sato &  Keiji  Yamada \n\nInformation Technology  Research  Laboratories, \n\nNEC  Corporation \n\n1-1,  Miyazaki 4-chome, Miyamae-ku, \n\nKawasaki,  Kanagawa 216,  Japan \n\nE-mail:  {asato.yamada}@pat.cl.nec.co.jp \n\nAbstract \n\nWe  propose  a  new  learning  method,  \"Generalized  Learning  Vec(cid:173)\ntor Quantization (GLVQ),\"  in which reference vectors are updated \nbased on the steepest descent method in order to minimize the cost \nfunction .  The  cost  function  is  determined  so  that  the  obtained \nlearning  rule  satisfies  the  convergence  condition.  We  prove  that \nKohonen's  rule  as  used  in  LVQ  does  not  satisfy  the  convergence \ncondition  and  thus  degrades  recognition  ability.  Experimental re(cid:173)\nsults  for  printed  Chinese  character recognition  reveal  that  GLVQ \nis  superior to  LVQ  in  recognition  ability. \n\n1 \n\nINTRODUCTION \n\nArtificial  neural  network  models  have  been  applied  to  character  recognition  with \ngood  results  for  small-set  characters  such  as  alphanumerics  (Le  Cun  et  aI.,  1989) \n(Yamada  et  al.,  1989).  However,  applying  the models  to  large-set  characters  such \nas  Japanese or  Chinese characters is  difficult because most of the  models  are based \non Multi-Layer Perceptron (MLP) with the back propagation algorithm, which has \na  problem  in  regard  to  local  minima as  well  as  requiring  a  lot  of calculation. \n\nClassification  methods  based  on  pattern  matching  have  commonly  been  used  for \nlarge-set character recognition.  Learning Vector Quantization (LVQ) has been stud(cid:173)\nied  to generate optimal reference vectors because of its simple  and fast  learning al(cid:173)\ngorithm  (Kohonen,  1989;  1995).  However,  one problem with LVQ  is  that reference \nvectors  diverge and thus degrade recognition ability.  Much work has  been done on \nimproving LVQ  (Lee  &  Song,  1993)  (Miyahara &  Yoda,  1993)  (Sato  &  Tsukumo, \n1994),  but  the problem remains unsolved. \n\nRecently, a generalization of the Simple Competitive Learning (SCL) has been under \n\n\f424 \n\nA.  SATO, K.  YAMADA \n\nstudy  (Pal  et  al.,  1993)  (Gonzalez  et  al.,  1995),  and  one  unsupervised  learning \nrule  has  been  derived  based  on  the steepest  descent  method  to  minimize  the  cost \nfunction.  Pal  et  al.  call  their model  \"Generalized Learning Vector  Quantization,\" \nbut it is  not  a generalization of Kohonen's LVQ. \n\nIn this paper, we  propose a  new learning method for  supervised learning,  in which \nreference  vectors  are  updated  based  on  the steepest  descent  method,  to  minimize \nthe cost function.  This is  a  generalization of Kohonen's LVQ,  so we  call it  \"Gener(cid:173)\nalized  Learning Vector Quantization  (GLVQ).\"  The cost  function  is  determined  so \nthat the obtained learning rule satisfies  the  convergence  condition.  We  prove that \nKohonen's rule as used in LVQ  does not satisfy  the convergence condition and thus \ndegrades  recognition  ability.  Preliminary  experiments  revealed  that  non-linearity \nin the cost function  is  very effective for  improving recognition  ability.  Printed Chi(cid:173)\nnese  character recognition experiments were  carried out,  and we can show that the \nrecognition  ability of GLVQ  is  very high compared with LVQ. \n\n2  REVIEW  OF  LVQ \n\nAssume that a  number of reference  vectors Wk  are  placed in the input space.  Usu(cid:173)\nally, several reference vectors are assigned to each class.  An input vector x is decided \nto belong to the same class to which the nearest reference vector belongs.  Let Wk(t) \nrepresent sequences of the Wk  in the discrete-time domain.  Heretofore, several LVQ \nalgorithms  have  been  proposed  (Kohonen,  1995),  but in  this  section,  we  will  focus \non LVQ2.1.  Starting with  properly  defined  initial  values,  the  reference vectors  are \nupdated as  follows  by the LVQ2.1  algorithm: \n\nWi(t + 1)  =  Wi(t) - a(t)(x - Wi(t)), \nWj(t + 1)  =  Wj(t) + a(t)(x - Wj(t)), \n\n(1) \n(2) \nwhere  0  <  aCt)  <  1,  and  aCt)  may  decrease  monotonically  with  time.  The  two \nreference  vectors  Wi  and  Wj  are  the  nearest  to  x;  x  and  Wj  belong  to  the  same \nclass,  while  x  and  Wi  belong  to  different  classes.  Furthermore,  x  must  fall  into \nthe  \"window,\"  which  is  defined  around the midplane of Wi  and Wj.  That is,  if the \nfollowing  condition is  satisfied, Wi  and Wj  are updated: \n\nmin (~> ~~) > s, \n\n(3) \nwhere  di  =  Ix - wd,  dj  =  Ix - wjl.  The  LVQ2.1  algorithm is  based  on  the  idea \nof  shifting  the  decision  boundaries  toward  the  Bayes  limits  with  attractive  and \nrepulsive  forces  from  x.  However,  no  attention  is  given  to  what  might  happen  to \nthe  location  of  the  Wk,  so  the  reference  vectors  diverge  in  the  long  run.  LVQ3 \nhas been proposed to ensure that the reference vectors continue approximating the \nclass distributions, but it must be  noted that if only one reference vector is  assigned \nto  each  class,  LVQ3  is  the  same  as  LVQ2.1,  and  the  problem  of  reference  vector \ndivergence  remains unsolved. \n\n3  GENERALIZED  LVQ \n\nTo ensure that the reference vectors continue approximating the class distributions, \nwe  propose a  new learning  method based on  minimizing the cost function.  Let Wl \nbe the nearest  reference vector that belongs to the same class of x, and likewise let \nW2  be the nearest  reference  vector that belongs  to a  different  class  from  x.  Let  us \nconsider the relative distance difference p,( x) defined  as  follows: \n\ndl  - d2 \nP,(x)=d1 +d2 ' \n\n(4) \n\n\fGeneralized Learning Vector Quantization \n\n425 \n\nwhere dl  and  d2 are  the distances  of:B  from  WI  and W2,  respectively.  ft(x)  ranges \nbetween  -1 and + 1,  and if ft( x) is  negative,  x  is  classified  correctly;  otherwise,  x \nis  classified  incorrectly.  In  order  to  improve  error  rates,  1\u00a3( x)  should  decrease  for \nall  input  vectors.  Thus,  a  criterion for  learning is  formulated  as  the minimizing of \na  cost function  S  defined  by \n\nwhere  N  is  the number of input  vectors  for  training,  and  f(ft)  is  a  monotonically \nincreasing function.  To minimize S,  WI  and W2  are updated based on the steepest \ndescent  method with  a  small positive constant a  as  follows: \n\ni=l \n\n(5) \n\nWi  - Wj  - a--,  i = 1,2 \n\nas \naWj \n\nIf squared Euclid  distance,  d j  =  Ix - wd 2 ,  is  used,  we  can obtain the following. \n\n(6) \n\n(7) \n\n(8) \n\n(9) \n\n(10) \n\naft adl  aWl \n\nas  =  as  aft  adl  =  _ of \naWl \nas  = as  aft  ad2 = + of \naW2 \n\naft ad2 aW2 \n\n4d2 \n\n4dl \n\n(x _ WI) \n\n(x _ W2) \n\naft (dl + d2)2 \n\n01\u00a3  (dl + d2)2 \n\nTherefore,  the  GLVQ's learning rule  can be described  as  follows: \n\nWI \n\nW2 \n\n-\n\n-\n\nWI + a aft (dl + d2)2 (x - wt) \n\nof \n\nof \n\nd2 \n\ndl \n\nW2  - a aft (dl + d2)2 (x - W2) \n\nLet  us  discuss  the meaning  of f(ft).  of/aft is  a  kind  of gain  factor  for  updating, \nand  its  value  depends  on  x.  In  other  words,  of/aft  is  a  weight  for  each  x.  To \ndecrease  the error rate,  it is  effective  to update reference  vectors  mainly  by  input \nvectors around class boundaries, so that the decision boundaries are shifted toward \nthe Bayes limits.  Accordingly, f(ft) should be a non-linear monotonically increasing \nfunction,  and  it  is  considered  that  classification  ability  depends  on  the  definition \nof f(ft).  In  this  paper,  of/aft =  f(ft,t){l- f(ft,t)} was used  in  the experiments, \nwhere  t  is  learning  time  and  f(ft, t)  is  a  sigmoid  function  of 1/(1 + e-lJt).  In  this \ncase,  of / aft has  a single peak at ft  =  0,  and the peak width  becomes narrower as t \nincreases, so the input vectors that affect learning are gradually restricted  to those \naround the decision  boundaries. \nLet us discuss the meaning of ft.  WI  and W2  are updated by attractive and repulsive \nforces  from  x,  respectively,  as  shown  in  Eqs.  (9)  and  (10),  and  the  quantities  of \nupdating,  ILlwd  and  ILlw21,  depend  on  derivatives  of ft.  Reference  vectors  will \nconverge  to  the  equilibrium  states defined  by  attractive  and  repulsive forces,  so  it \nis  considered  that convergence property depends on the definition of ft. \n\n4  DISCUSSION \n\nFirst, we  show  that the  conventional LVQ  algorithms  can  be derived  based on  the \nframework  of GLVQ. If ft  =  dl  for dl  < d2, ft  =  -d2 for dl  > d2, and f(ft) =  ft,  the \ncost  f~nction is  written  as  S  =  ~dl <d2  dl  - ~dl >d2 d2 .  Then,  we  can  obtain  the \nfollowmg: \n\nWI  - WI + a(x - WI),  W2  - W2 \nW2  - W2  - a(x - W2),  WI  - WI \n\nfor  dl  < d2 \nfor  dl  > d2 \n\n(11) \n(12) \n\n\f426 \n\nA. SATO,  K.  YAMADA \n\nThis learning algorithm is the same as LVQ1.  If It =  dI -d2  and f(lt) =  It for Iltl  < s, \nf(lt) =  const  for  Iltl > s,  the cost  function is  written as  S =  2: IJJ1 <s(di  - d2 )  + C. \nThen, we can obtain the following: \n\nif Iltl  < s  (x falls  into the window) \n\nWI \nW2 \n\n(13) \n(14) \nIn  this  case,  WI  and  W2  are  updated  simultaneously,  and  this  learning  algorithm \nis  the same  as  LVQ2.1.  SO  it  can  be  said  that  GLVQ  is  a  generalized  model  that \nincludes the  conventional  LVQs. \n\n- WI  + a(x - W2) \nW2  - a(x - W2) \n-\n\nNext,  we  discuss  the  convergence  condition.  We  can  obtain  other  learning  algo(cid:173)\nrithms  by  defining  a  different  cost  function,  but it  must be  noted  that  the  conver(cid:173)\ngence property depends  on the definition  of the cost  function.  The main difference \nbetween GLVQ  and LVQ2.1 is  the definition of It;  It =  (dI -d2)/(di  +d2) in  GLVQ, \nIt =  dl  - d2  in LVQ2.1.  Why do the reference vectors diverge in LVQ2.1,  while they \nconverge in  GLVQ,  as  shown  later?  In  order to clarify  the convergence  condition, \nlet us  consider the following  learning rule: \n\n(15) \n(16) \nHere,  I~Wll and  I~W21 are  the  quantities  of  updating  by  the  attractive  and  the \nrepulsive forces,  respectively.  The ratio of these two is  calculated as follows: \n\n- WI + alx - w2lk(x - wt} \n-\nW2  - alx - wIlk(x - W2) \n\nWI \nW2 \n\nIx - w2l k- I \nI~WII  alx - w21klx  - wII \nI~W21 =  alx - wIlklx - w21  =  Ix - wll k - I \n\n(17) \n\nIf the initial  values  of reference  vectors  are  properly  defined,  most  x's will  satisfy \nIx - wd  < Ix - w21.  Therefore,  if  k  > 1,  the  attractive force  is  greater  than  the \nrepulsive force,  and the reference vectors will converge, because the attractive forces \ncome from  x's that  belong  to the same  class  of WI.  In GLVQ,  k =  2 as  shown  in \nEqs.  (9)  and  (10),  and the vectors will converge,  while they will  diverge in LVQ2.1 \nbecause  k = 0.  According to the above discussion,  we  can  use  di/(d 1 + d2)  or just \ndj,  instead of di/(d1 + d2)2  in Eqs.  (9)  and  (10).  This correction does not affect the \nconvergence condition.  The essential problem in LVQ2.1 results from  the drawback \nin Kohonen's rule with  k = 0.  In other words,  the cost function  used  in LVQ  is  not \ndetermined so  that the obtained learning rule satisfies the convergence condition. \n\n5  EXPERIMENTS \n\n5.1  PRELIMINARY EXPERIMENTS \n\nThe experimental results using Eqs.  (15)  and  (16)  with a  =  0.001,  shown in Fig.  1, \nsupport the above discussion on the convergence condition.  Two-dimensional input \nvectors with two classes shown in Fig.  1( a)  were used in the experiments.  The ideal \ndecision  boundary that minimizes  the error  rate is  shown  by  the broken line.  One \nreference  vector was assigned  to each class  with initial values  (x, y)  =  (0.3,0.5)  for \nClass A and (x,y)  =  (0.7,0.5)  for  Class B.  Figure  l(b)  shows the distance between \nthe two reference vectors during learning.  The distance remains the same value for \nk  > 1,  while it increases with time for  k  ~ 1;  that is,  the reference vectors diverge. \nFigure 2 shows the experimental results from  GLVQ for  linearly non-separable pat(cid:173)\nterns compared with LVQ2.1.  The input vectors  shown  in Fig.  2(a)  were obtained \nby shifting  all  input vectors shown  in  Fig.  l(a) to the  right  by Iy - 0.51.  The ideal \n\n\fGeneralized Learning Vector Quantization \n\n1.0 \n\n0.8 \n\nc: g  0.6 \n'iii \n0 \na. \n>- 0.4 \n\n0.2 \n\n0 \n\nClass A  0 \nClass B  x \n\n6.0 \n\n5.0 \n\n4.0 \n\n3.0 \n\n2.0 \n\n1.0 \n\n~ \nc: \n.l!! \n.!!! \n0 \n\n! \n\nf \ni \n\nt \n! \n\nt \n! \n,I-\n.I \n\n0.0 \n\n0.0 \n\n0.2 \n\n0.4 \n0.6 \nX Position \n(a) \n\n0.8 \n\n1.0 \n\n0.0 \n\n0 \n\n10 \n\n20 \n\n30 \n\nIteration \n(b) \n\n427 \n\nk = 0.0  -+--\nk=0.5  -f- --\nk = 1.0  \u00b7 13\u00b7 \u00b7 \u00b7 \nk= 1.5  .. )( _ .. \nk = 2.0  -6- .-\n\n40 \n\n50 \n\nFigure  1:  Experimental  results  that  support  the  discussion  on  the  convergence \ncondition  with  one  reference  vector  for  each  class.  (a)  Input  vectors  used  in  the \nexperiments.  The  broken  line  shows  the  ideal  decision  boundary. \n(b)  Distance \nbetween two reference vectors for each k  value during learning.  The distance remains \nthe same value for  k  > 1,  while it diverges for  k  $  1. \n\ndecision  boundary  that minimizes  the error rate is shown  by  the broken line.  Two \nreference  vectors  were  assigned  to  each class  with  initial  values  (x, y)  =  (0.3,0.4) \nand  (0.3, 0.6)  for Class A,  and (x,y) =  (0.7,0.4) and (0.7,0.6) for Class B.  The gain \nfactor  0:  was 0.004  in  GLVQ  and LVQ2.1,  and the window parameter  sin LVQ2.1 \nwas  0.8 in the experiments. \n\nFigure  2(b)  shows  the  number  of  error  counts  for  all  the  input  vectors  during \nlearning.  GLVQ(NL)  shows  results  by  GLVQ  with  a  non-linear  function;  that  is, \naf lap =  f(p, t){1  - f(p, t)}.  The number  of error  counts decreased  with  time to \nthe  minimum  determined  by  the  Bayes  limit.  GLVQ(L)  shows  results  by  GLVQ \nwith  a  linear  function;  that  is,  a flap =  1.  The  number  of error  counts did  not \ndecrease  to  the minimum.  This  indicates  that non-linearity of the cost  function  is \nvery effective for improving recognition ability.  Results using LVQ2.1 show that the \nnumber of error counts decreased  in  the beginning,  but overall increased gradually \nwith  time.  The  degradation  in  the  recognition  ability  results  from  the  divergence \nof the reference vectors, as  we  have mentioned earlier. \n\n5.2  CHARACTER RECOGNITION EXPERIMENTS \n\nPrinted Chinese character recognition experiments were carried out to examine the \nperformance  of GLVQ.  Thirteen  kinds  of printed  fonts with  500  classes were  used \nin  the experiments.  The total number of characters was 13,000;  half of which  were \nused  as training data, and  the other half were used  as test  data.  As  input vectors, \n256-dimensional  orientation features were  used  (Hamanaka et  al.,  1993).  Only one \nreference vector was  assigned  to each class,  and their initial values were defined  by \naveraging training data for  each  class. \n\nRecognition  results  for  test  data  are  tabulated  in  Table  1  compared  with  other \nmethods.  TM  is  the  template matching method  using  mean  vectors.  LVQ2  is  the \nearlier version of LVQ2.1.  The learning algorithm is  the same as  LVQ2.1 described \nin Section 2,  but di  must be less than dj.  The gain factor 0:  was 0.05, and the window \nparameter  s  was  0.65  in  the  experiments.  The  experimental  result  by  LVQ3  was \n\n\fA,SATO.K.Y~DA \n\n1OO~---r----r----r----r---, \n\n140 \n\nGLVQ(NL)  --(cid:173)\nGL VQ(L)  -+--(cid:173)\nLVQ2,1  \u00b713 .. \u2022 \n\n428 \n\n1,0 \n\n0,8 \n\nc:  0,6 \n,Q \n.;i \n0 a.  0,4 \n\n~ \n\n0,2 \n\n0,0  '---__ -'--__ ..1..-__  ....1...-__  --'--__  - \" -__  - '  \n1,2 \n\n0,6 \n\n0,8 \n\n0,0 \n\n1.0 \n\n0,2 \n\n0,4 \n\nX Position \n(a) \n\n4OL-~~~~~~~ __ --~ \n100 \n\n00 \n\n40 \n\n20 \n\n80 \n\no \n\nIteration \n(b) \n\nFigure  2:  Experimental  results  for  linearly  non-separable  patterns with  two  refer(cid:173)\nence vectors for  each  class.  (a)  Input vectors used  in the experiments.  The broken \nline shows the ideal decision boundary.  (b) The number of error counts during learn(cid:173)\ning.  GLVQ  (NL)  and  GLVQ  (L)  denote  the  proposed  method  using  a  non-linear \nand linear function in the cost function,  respectively.  This shows that non-linearity \nof the cost function  is  very effective for  improving classification  ability. \n\nTable 1:  Experimental results  for  printed  Chinese  character recognition  compared \nwith other methods. \n\nError rates(%) \n\nMethods \nTMI \nLVQ22 \nLVQ2.1 \nIVQ3 \n\n0.23 \n0.18 \n0.11 \n0.08 \n0.05 \n1 Template matching using  mean vectors, \n2The earlier version of LVQ2,l. \n30ur previous model  (Improved Vector  Quantization), \n\nGLVQ \n\nthe  same  as  that  by  LVQ2.1,  because  only  one  reference  vector  was  assigned  to \neach  class.  IVQ  (Improved  Vector  Quantization)  is  our  previous  model  based  on \nKohonen's rule  (Sato &  Tsukumo,  1994). \nThe error rate was extremely low for  GLVQ,  and  a  recognition  rate of 99.95%  was \nobtained.  Ambiguous  results can be rejected  by  thresholding the value of J,t(x).  If \ninput  vectors  with  J,t(x)  ~ -0.02  were  rejected,  a  recognition  rate  of 100%  would \nbe obtained, with a  rejection  rate of 0.08%  for  this experiment. \n\n6  CONCLUSION \n\nWe  proposed  the  Generalized  Learning  Vector  Quantization  as  a  new  learning \nmethod.  We  formulated  the  criterion  for  learning  as  the  minimizing  of  the  cost \nfunction,  and  obtained  the  learning  rule  based  on  the  steepest  descent  method. \nGLVQ  is  a  generalized  method  that  includes  LVQ.  We  discussed  the  convergence \ncondition  and  showed  that  the  convergence property depends  on  the  definition  of \n\n\fGeneralized Learning Vector Quantization \n\n429 \n\nthe  cost  function .  We  proved  that  the  essential  problem  of the  divergence  of the \nreference  vectors  in  LVQ2.1  results  from  a  drawback  of Kohonen's  rule  that  does \nnot  satisfy  the convergence condition.  Preliminary  experiments revealed  that non(cid:173)\nlinearity in the cost function is  very effective for improving recognition  ability.  We \ncarried out printed Chinese character recognition experiments and obtained a recog(cid:173)\nnition rate of 99.95%.  The experimental results  revealed  that GLVQ  is  superior to \nthe conventional LVQ  algorithms. \n\nAcknowledgements \n\nWe are indebted to Mr. Jun Tsukumo and our colleagues in the Pattern Recognition \nResearch Laboratory for  their helpful cooperation. \n\nReferences \n\nY.  Le  Cun, B.  Bose,  J.  S.  Denker,  D.  Henderson,  R.  E. Howard,  W.  Hubbard,  and \nL.  D.  Jackel,  \"Handwritten Digit  Recognition with  a  Back-Propagation Network,\" \nNeural  Information  Processing Systems  2,  pp.  396-404  (1989). \nK.  Yamada, H.  Kami,  J. Tsukumo,  and T. Temma,  \"Handwritten Numeral Recog(cid:173)\nnition by Multi-Layered Neural Network with Improved Learning Algorithm,\"  Proc. \nof the  International  Joint  Conference  on  Neural  Networks  89,  Vol.  2,  pp.  259-266 \n(1989). \n\nT.  Kohonen,  S elf- Organization  and  Associative  Memory,  3rd  ed. ,  Springer-Verlag \n(1989). \n\nT. Kohonen,  \"LVQ-.PAK Version 3.1 - The Learning Vector Quantization Program \nPackage,\"  LVQ Programming Team of the Helsinki  University  of Technology,  (1995). \n\nS.  W.  Lee  and  H.  H.  Song,  \"Optimal Design of Reference  Models  Using Simulated \nAnnealing Combined with an Improved LVQ3,\"  Proc.  of the  International  Confer(cid:173)\nence  on  Document Analysis and Recognition,  pp.  244-249  (1993). \n\nK.  Miyahara  and  F.  Yoda,  \"Printed  Japanese  Character  Recognition  Based  on \nMultiple Modified LVQ  Neural Network,\"  Proc.  of the  International  Conference  on \nDocument Analysis and Recognition, pp.  250- 253  (1993). \nA.  Sato and J.  Tsukumo, \"A Criterion for Training Reference Vectors and Improved \nVector  Quantization,\"  Proc.  of the  International  Conference  on  Neural  Networks, \nVol.  1,  pp.161-166  (1994). \nN.  R.  Pal, J. C.  Bezdek,  and E.  C.-IC Tsao,  \"Generalized Clustering Networks and \nKohonen's Self-organizing Scheme,\"  IEEE Trans.  of Neural Networks, Vol. 4, No.4, \npp.  549-557  (1993). \nA.  I.  Gonzalez, M.  Grana, and A.  D'Anjou,  \"An Analysis ofthe GLVQ Algorithm,\" \nIEEE  Trans .  of Neural Networks,  Vol.  6,  No.4,  pp.  1012-1016  (1995). \nM.  Hamanaka, K. Yamada, and  J. Tsukumo,  \"On-Line Japanese Character Recog(cid:173)\nnition  Experiments  by  an  Off-Line  Method  Based  on  Normalization-Cooperated \nFeature Extraction,\"  Proc.  of the  International  Conference  on Document Analysis \nand  Recognition,  pp.  204-207  (1993). \n\n\f", "award": [], "sourceid": 1113, "authors": [{"given_name": "Atsushi", "family_name": "Sato", "institution": null}, {"given_name": "Keiji", "family_name": "Yamada", "institution": null}]}