{"title": "Generalized Learning Vector Quantization", "book": "Advances in Neural Information Processing Systems", "page_first": 423, "page_last": 429, "abstract": null, "full_text": "Generalized Learning Vector \n\nQuantization \n\nAtsushi Sato & Keiji Yamada \n\nInformation Technology Research Laboratories, \n\nNEC Corporation \n\n1-1, Miyazaki 4-chome, Miyamae-ku, \n\nKawasaki, Kanagawa 216, Japan \n\nE-mail: {asato.yamada}@pat.cl.nec.co.jp \n\nAbstract \n\nWe propose a new learning method, \"Generalized Learning Vec(cid:173)\ntor Quantization (GLVQ),\" in which reference vectors are updated \nbased on the steepest descent method in order to minimize the cost \nfunction . The cost function is determined so that the obtained \nlearning rule satisfies the convergence condition. We prove that \nKohonen's rule as used in LVQ does not satisfy the convergence \ncondition and thus degrades recognition ability. Experimental re(cid:173)\nsults for printed Chinese character recognition reveal that GLVQ \nis superior to LVQ in recognition ability. \n\n1 \n\nINTRODUCTION \n\nArtificial neural network models have been applied to character recognition with \ngood results for small-set characters such as alphanumerics (Le Cun et aI., 1989) \n(Yamada et al., 1989). However, applying the models to large-set characters such \nas Japanese or Chinese characters is difficult because most of the models are based \non Multi-Layer Perceptron (MLP) with the back propagation algorithm, which has \na problem in regard to local minima as well as requiring a lot of calculation. \n\nClassification methods based on pattern matching have commonly been used for \nlarge-set character recognition. Learning Vector Quantization (LVQ) has been stud(cid:173)\nied to generate optimal reference vectors because of its simple and fast learning al(cid:173)\ngorithm (Kohonen, 1989; 1995). However, one problem with LVQ is that reference \nvectors diverge and thus degrade recognition ability. Much work has been done on \nimproving LVQ (Lee & Song, 1993) (Miyahara & Yoda, 1993) (Sato & Tsukumo, \n1994), but the problem remains unsolved. \n\nRecently, a generalization of the Simple Competitive Learning (SCL) has been under \n\n\f424 \n\nA. SATO, K. YAMADA \n\nstudy (Pal et al., 1993) (Gonzalez et al., 1995), and one unsupervised learning \nrule has been derived based on the steepest descent method to minimize the cost \nfunction. Pal et al. call their model \"Generalized Learning Vector Quantization,\" \nbut it is not a generalization of Kohonen's LVQ. \n\nIn this paper, we propose a new learning method for supervised learning, in which \nreference vectors are updated based on the steepest descent method, to minimize \nthe cost function. This is a generalization of Kohonen's LVQ, so we call it \"Gener(cid:173)\nalized Learning Vector Quantization (GLVQ).\" The cost function is determined so \nthat the obtained learning rule satisfies the convergence condition. We prove that \nKohonen's rule as used in LVQ does not satisfy the convergence condition and thus \ndegrades recognition ability. Preliminary experiments revealed that non-linearity \nin the cost function is very effective for improving recognition ability. Printed Chi(cid:173)\nnese character recognition experiments were carried out, and we can show that the \nrecognition ability of GLVQ is very high compared with LVQ. \n\n2 REVIEW OF LVQ \n\nAssume that a number of reference vectors Wk are placed in the input space. Usu(cid:173)\nally, several reference vectors are assigned to each class. An input vector x is decided \nto belong to the same class to which the nearest reference vector belongs. Let Wk(t) \nrepresent sequences of the Wk in the discrete-time domain. Heretofore, several LVQ \nalgorithms have been proposed (Kohonen, 1995), but in this section, we will focus \non LVQ2.1. Starting with properly defined initial values, the reference vectors are \nupdated as follows by the LVQ2.1 algorithm: \n\nWi(t + 1) = Wi(t) - a(t)(x - Wi(t)), \nWj(t + 1) = Wj(t) + a(t)(x - Wj(t)), \n\n(1) \n(2) \nwhere 0 < aCt) < 1, and aCt) may decrease monotonically with time. The two \nreference vectors Wi and Wj are the nearest to x; x and Wj belong to the same \nclass, while x and Wi belong to different classes. Furthermore, x must fall into \nthe \"window,\" which is defined around the midplane of Wi and Wj. That is, if the \nfollowing condition is satisfied, Wi and Wj are updated: \n\nmin (~> ~~) > s, \n\n(3) \nwhere di = Ix - wd, dj = Ix - wjl. The LVQ2.1 algorithm is based on the idea \nof shifting the decision boundaries toward the Bayes limits with attractive and \nrepulsive forces from x. However, no attention is given to what might happen to \nthe location of the Wk, so the reference vectors diverge in the long run. LVQ3 \nhas been proposed to ensure that the reference vectors continue approximating the \nclass distributions, but it must be noted that if only one reference vector is assigned \nto each class, LVQ3 is the same as LVQ2.1, and the problem of reference vector \ndivergence remains unsolved. \n\n3 GENERALIZED LVQ \n\nTo ensure that the reference vectors continue approximating the class distributions, \nwe propose a new learning method based on minimizing the cost function. Let Wl \nbe the nearest reference vector that belongs to the same class of x, and likewise let \nW2 be the nearest reference vector that belongs to a different class from x. Let us \nconsider the relative distance difference p,( x) defined as follows: \n\ndl - d2 \nP,(x)=d1 +d2 ' \n\n(4) \n\n\fGeneralized Learning Vector Quantization \n\n425 \n\nwhere dl and d2 are the distances of:B from WI and W2, respectively. ft(x) ranges \nbetween -1 and + 1, and if ft( x) is negative, x is classified correctly; otherwise, x \nis classified incorrectly. In order to improve error rates, 1\u00a3( x) should decrease for \nall input vectors. Thus, a criterion for learning is formulated as the minimizing of \na cost function S defined by \n\nwhere N is the number of input vectors for training, and f(ft) is a monotonically \nincreasing function. To minimize S, WI and W2 are updated based on the steepest \ndescent method with a small positive constant a as follows: \n\ni=l \n\n(5) \n\nWi - Wj - a--, i = 1,2 \n\nas \naWj \n\nIf squared Euclid distance, d j = Ix - wd 2 , is used, we can obtain the following. \n\n(6) \n\n(7) \n\n(8) \n\n(9) \n\n(10) \n\naft adl aWl \n\nas = as aft adl = _ of \naWl \nas = as aft ad2 = + of \naW2 \n\naft ad2 aW2 \n\n4d2 \n\n4dl \n\n(x _ WI) \n\n(x _ W2) \n\naft (dl + d2)2 \n\n01\u00a3 (dl + d2)2 \n\nTherefore, the GLVQ's learning rule can be described as follows: \n\nWI \n\nW2 \n\n-\n\n-\n\nWI + a aft (dl + d2)2 (x - wt) \n\nof \n\nof \n\nd2 \n\ndl \n\nW2 - a aft (dl + d2)2 (x - W2) \n\nLet us discuss the meaning of f(ft). of/aft is a kind of gain factor for updating, \nand its value depends on x. In other words, of/aft is a weight for each x. To \ndecrease the error rate, it is effective to update reference vectors mainly by input \nvectors around class boundaries, so that the decision boundaries are shifted toward \nthe Bayes limits. Accordingly, f(ft) should be a non-linear monotonically increasing \nfunction, and it is considered that classification ability depends on the definition \nof f(ft). In this paper, of/aft = f(ft,t){l- f(ft,t)} was used in the experiments, \nwhere t is learning time and f(ft, t) is a sigmoid function of 1/(1 + e-lJt). In this \ncase, of / aft has a single peak at ft = 0, and the peak width becomes narrower as t \nincreases, so the input vectors that affect learning are gradually restricted to those \naround the decision boundaries. \nLet us discuss the meaning of ft. WI and W2 are updated by attractive and repulsive \nforces from x, respectively, as shown in Eqs. (9) and (10), and the quantities of \nupdating, ILlwd and ILlw21, depend on derivatives of ft. Reference vectors will \nconverge to the equilibrium states defined by attractive and repulsive forces, so it \nis considered that convergence property depends on the definition of ft. \n\n4 DISCUSSION \n\nFirst, we show that the conventional LVQ algorithms can be derived based on the \nframework of GLVQ. If ft = dl for dl < d2, ft = -d2 for dl > d2, and f(ft) = ft, the \ncost f~nction is written as S = ~dl d2 d2 . Then, we can obtain the \nfollowmg: \n\nWI - WI + a(x - WI), W2 - W2 \nW2 - W2 - a(x - W2), WI - WI \n\nfor dl < d2 \nfor dl > d2 \n\n(11) \n(12) \n\n\f426 \n\nA. SATO, K. YAMADA \n\nThis learning algorithm is the same as LVQ1. If It = dI -d2 and f(lt) = It for Iltl < s, \nf(lt) = const for Iltl > s, the cost function is written as S = 2: IJJ1 ~~ 1, the attractive force is greater than the \nrepulsive force, and the reference vectors will converge, because the attractive forces \ncome from x's that belong to the same class of WI. In GLVQ, k = 2 as shown in \nEqs. (9) and (10), and the vectors will converge, while they will diverge in LVQ2.1 \nbecause k = 0. According to the above discussion, we can use di/(d 1 + d2) or just \ndj, instead of di/(d1 + d2)2 in Eqs. (9) and (10). This correction does not affect the \nconvergence condition. The essential problem in LVQ2.1 results from the drawback \nin Kohonen's rule with k = 0. In other words, the cost function used in LVQ is not \ndetermined so that the obtained learning rule satisfies the convergence condition. \n\n5 EXPERIMENTS \n\n5.1 PRELIMINARY EXPERIMENTS \n\nThe experimental results using Eqs. (15) and (16) with a = 0.001, shown in Fig. 1, \nsupport the above discussion on the convergence condition. Two-dimensional input \nvectors with two classes shown in Fig. 1( a) were used in the experiments. The ideal \ndecision boundary that minimizes the error rate is shown by the broken line. One \nreference vector was assigned to each class with initial values (x, y) = (0.3,0.5) for \nClass A and (x,y) = (0.7,0.5) for Class B. Figure l(b) shows the distance between \nthe two reference vectors during learning. The distance remains the same value for \nk > 1, while it increases with time for k ~ 1; that is, the reference vectors diverge. \nFigure 2 shows the experimental results from GLVQ for linearly non-separable pat(cid:173)\nterns compared with LVQ2.1. The input vectors shown in Fig. 2(a) were obtained \nby shifting all input vectors shown in Fig. l(a) to the right by Iy - 0.51. The ideal \n\n\fGeneralized Learning Vector Quantization \n\n1.0 \n\n0.8 \n\nc: g 0.6 \n'iii \n0 \na. \n>- 0.4 \n\n0.2 \n\n0 \n\nClass A 0 \nClass B x \n\n6.0 \n\n5.0 \n\n4.0 \n\n3.0 \n\n2.0 \n\n1.0 \n\n~ \nc: \n.l!! \n.!!! \n0 \n\n! \n\nf \ni \n\nt \n! \n\nt \n! \n,I-\n.I \n\n0.0 \n\n0.0 \n\n0.2 \n\n0.4 \n0.6 \nX Position \n(a) \n\n0.8 \n\n1.0 \n\n0.0 \n\n0 \n\n10 \n\n20 \n\n30 \n\nIteration \n(b) \n\n427 \n\nk = 0.0 -+--\nk=0.5 -f- --\nk = 1.0 \u00b7 13\u00b7 \u00b7 \u00b7 \nk= 1.5 .. )( _ .. \nk = 2.0 -6- .-\n\n40 \n\n50 \n\nFigure 1: Experimental results that support the discussion on the convergence \ncondition with one reference vector for each class. (a) Input vectors used in the \nexperiments. The broken line shows the ideal decision boundary. \n(b) Distance \nbetween two reference vectors for each k value during learning. The distance remains \nthe same value for k > 1, while it diverges for k $ 1. \n\ndecision boundary that minimizes the error rate is shown by the broken line. Two \nreference vectors were assigned to each class with initial values (x, y) = (0.3,0.4) \nand (0.3, 0.6) for Class A, and (x,y) = (0.7,0.4) and (0.7,0.6) for Class B. The gain \nfactor 0: was 0.004 in GLVQ and LVQ2.1, and the window parameter sin LVQ2.1 \nwas 0.8 in the experiments. \n\nFigure 2(b) shows the number of error counts for all the input vectors during \nlearning. GLVQ(NL) shows results by GLVQ with a non-linear function; that is, \naf lap = f(p, t){1 - f(p, t)}. The number of error counts decreased with time to \nthe minimum determined by the Bayes limit. GLVQ(L) shows results by GLVQ \nwith a linear function; that is, a flap = 1. The number of error counts did not \ndecrease to the minimum. This indicates that non-linearity of the cost function is \nvery effective for improving recognition ability. Results using LVQ2.1 show that the \nnumber of error counts decreased in the beginning, but overall increased gradually \nwith time. The degradation in the recognition ability results from the divergence \nof the reference vectors, as we have mentioned earlier. \n\n5.2 CHARACTER RECOGNITION EXPERIMENTS \n\nPrinted Chinese character recognition experiments were carried out to examine the \nperformance of GLVQ. Thirteen kinds of printed fonts with 500 classes were used \nin the experiments. The total number of characters was 13,000; half of which were \nused as training data, and the other half were used as test data. As input vectors, \n256-dimensional orientation features were used (Hamanaka et al., 1993). Only one \nreference vector was assigned to each class, and their initial values were defined by \naveraging training data for each class. \n\nRecognition results for test data are tabulated in Table 1 compared with other \nmethods. TM is the template matching method using mean vectors. LVQ2 is the \nearlier version of LVQ2.1. The learning algorithm is the same as LVQ2.1 described \nin Section 2, but di must be less than dj. The gain factor 0: was 0.05, and the window \nparameter s was 0.65 in the experiments. The experimental result by LVQ3 was \n\n\fA,SATO.K.Y~DA \n\n1OO~---r----r----r----r---, \n\n140 \n\nGLVQ(NL) --(cid:173)\nGL VQ(L) -+--(cid:173)\nLVQ2,1 \u00b713 .. \u2022 \n\n428 \n\n1,0 \n\n0,8 \n\nc: 0,6 \n,Q \n.;i \n0 a. 0,4 \n\n~ \n\n0,2 \n\n0,0 '---__ -'--__ ..1..-__ ....1...-__ --'--__ - \" -__ - ' \n1,2 \n\n0,6 \n\n0,8 \n\n0,0 \n\n1.0 \n\n0,2 \n\n0,4 \n\nX Position \n(a) \n\n4OL-~~~~~~~ __ --~ \n100 \n\n00 \n\n40 \n\n20 \n\n80 \n\no \n\nIteration \n(b) \n\nFigure 2: Experimental results for linearly non-separable patterns with two refer(cid:173)\nence vectors for each class. (a) Input vectors used in the experiments. The broken \nline shows the ideal decision boundary. (b) The number of error counts during learn(cid:173)\ning. GLVQ (NL) and GLVQ (L) denote the proposed method using a non-linear \nand linear function in the cost function, respectively. This shows that non-linearity \nof the cost function is very effective for improving classification ability. \n\nTable 1: Experimental results for printed Chinese character recognition compared \nwith other methods. \n\nError rates(%) \n\nMethods \nTMI \nLVQ22 \nLVQ2.1 \nIVQ3 \n\n0.23 \n0.18 \n0.11 \n0.08 \n0.05 \n1 Template matching using mean vectors, \n2The earlier version of LVQ2,l. \n30ur previous model (Improved Vector Quantization), \n\nGLVQ \n\nthe same as that by LVQ2.1, because only one reference vector was assigned to \neach class. IVQ (Improved Vector Quantization) is our previous model based on \nKohonen's rule (Sato & Tsukumo, 1994). \nThe error rate was extremely low for GLVQ, and a recognition rate of 99.95% was \nobtained. Ambiguous results can be rejected by thresholding the value of J,t(x). If \ninput vectors with J,t(x) ~ -0.02 were rejected, a recognition rate of 100% would \nbe obtained, with a rejection rate of 0.08% for this experiment. \n\n6 CONCLUSION \n\nWe proposed the Generalized Learning Vector Quantization as a new learning \nmethod. We formulated the criterion for learning as the minimizing of the cost \nfunction, and obtained the learning rule based on the steepest descent method. \nGLVQ is a generalized method that includes LVQ. We discussed the convergence \ncondition and showed that the convergence property depends on the definition of \n\n\fGeneralized Learning Vector Quantization \n\n429 \n\nthe cost function . We proved that the essential problem of the divergence of the \nreference vectors in LVQ2.1 results from a drawback of Kohonen's rule that does \nnot satisfy the convergence condition. Preliminary experiments revealed that non(cid:173)\nlinearity in the cost function is very effective for improving recognition ability. We \ncarried out printed Chinese character recognition experiments and obtained a recog(cid:173)\nnition rate of 99.95%. The experimental results revealed that GLVQ is superior to \nthe conventional LVQ algorithms. \n\nAcknowledgements \n\nWe are indebted to Mr. Jun Tsukumo and our colleagues in the Pattern Recognition \nResearch Laboratory for their helpful cooperation. \n\nReferences \n\nY. Le Cun, B. Bose, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and \nL. D. Jackel, \"Handwritten Digit Recognition with a Back-Propagation Network,\" \nNeural Information Processing Systems 2, pp. 396-404 (1989). \nK. Yamada, H. Kami, J. Tsukumo, and T. Temma, \"Handwritten Numeral Recog(cid:173)\nnition by Multi-Layered Neural Network with Improved Learning Algorithm,\" Proc. \nof the International Joint Conference on Neural Networks 89, Vol. 2, pp. 259-266 \n(1989). \n\nT. Kohonen, S elf- Organization and Associative Memory, 3rd ed. , Springer-Verlag \n(1989). \n\nT. Kohonen, \"LVQ-.PAK Version 3.1 - The Learning Vector Quantization Program \nPackage,\" LVQ Programming Team of the Helsinki University of Technology, (1995). \n\nS. W. Lee and H. H. Song, \"Optimal Design of Reference Models Using Simulated \nAnnealing Combined with an Improved LVQ3,\" Proc. of the International Confer(cid:173)\nence on Document Analysis and Recognition, pp. 244-249 (1993). \n\nK. Miyahara and F. Yoda, \"Printed Japanese Character Recognition Based on \nMultiple Modified LVQ Neural Network,\" Proc. of the International Conference on \nDocument Analysis and Recognition, pp. 250- 253 (1993). \nA. Sato and J. Tsukumo, \"A Criterion for Training Reference Vectors and Improved \nVector Quantization,\" Proc. of the International Conference on Neural Networks, \nVol. 1, pp.161-166 (1994). \nN. R. Pal, J. C. Bezdek, and E. C.-IC Tsao, \"Generalized Clustering Networks and \nKohonen's Self-organizing Scheme,\" IEEE Trans. of Neural Networks, Vol. 4, No.4, \npp. 549-557 (1993). \nA. I. Gonzalez, M. Grana, and A. D'Anjou, \"An Analysis ofthe GLVQ Algorithm,\" \nIEEE Trans . of Neural Networks, Vol. 6, No.4, pp. 1012-1016 (1995). \nM. Hamanaka, K. Yamada, and J. Tsukumo, \"On-Line Japanese Character Recog(cid:173)\nnition Experiments by an Off-Line Method Based on Normalization-Cooperated \nFeature Extraction,\" Proc. of the International Conference on Document Analysis \nand Recognition, pp. 204-207 (1993). \n\n\f", "award": [], "sourceid": 1113, "authors": [{"given_name": "Atsushi", "family_name": "Sato", "institution": null}, {"given_name": "Keiji", "family_name": "Yamada", "institution": null}]}~~