{"title": "Convergence of a Neural Network Classifier", "book": "Advances in Neural Information Processing Systems", "page_first": 839, "page_last": 845, "abstract": null, "full_text": "Convergence of a Neural Network Classifier \n\nJohn S. Baras \nSystems Research Center \nUniversity of Maryland \nCollege Park, Maryland 20705 \n\nAnthony La Vigna \nSystems Research Center \nUniversity of Maryland \nCollege Park, Maryland 20705 \n\nAbstract \n\nIn this paper, we prove that the vectors in the LVQ learning algorithm \nconverge. We do this by showing that the learning algorithm performs \nstochastic approximation. Convergence is then obtained by identifying \nthe appropriate conditions on the learning rate and on the underlying \nstatistics of the classification problem. We also present a modification to \nthe learning algorithm which we argue results in convergence of the LVQ \nerror to the Bayesian optimal error as the appropriate parameters become \nlarge. \n\n1 \n\nIntroduction \n\nLearning Vector Quantization (LVQ) originated in the neural network community \nand was introduced by Kohonen \n(Kohonen [1986]). There have been extensive \nsimulation studies reported in the literature demonstrating the effectiveness of LVQ \nas a classifier and it has generated considerable interest as the training times asso(cid:173)\nciated with LVQ are significantly less than those associated with backpropagation \nnetworks. \n\nIn this paper we analyse the convergence properties of LVQ. Using a theorem from \nthe stochastic approximation literature, we prove that the update algorithm con(cid:173)\nverges under the suitable conditions. We also present a modification to the algo(cid:173)\nrithm which provides for more stable learning. Finally, we discuss the decision error \nassociated with this \"modified\" LVQ algorithm. \n\n839 \n\n\f840 \n\nBaras and LaVigna \n\n2 A Review of Learning Vector Quantization \n\nLet {(Xi, dX')}~1 be the training data or past observation set. This means that Xi \nis observed when pattern dx , is in effect. We assume that the xi's are statistically \nindependent (this assumption can be relaxed). Let OJ be a Voronoi vector and let \n8 = {Ol, ... , Od be the set ofVoronoi vectors. We assume that there are many more \nobservations than Voronoi vectors (Duda & Hart [1973]). Once the Voronoi vectors \nare initialized, training proceeds by taking a sample (Xj, dx }) from the training set, \nfinding the closest Voronoi vector and adjusting its value according to equations (1) \nand (2). After several passes through the data, the Voronoi vectors converge and \ntraining is complete. \n\nSuppose Oe is the closest vector. Adjust Oe as follows: \n\nOe(n + 1) = Oe(n) - an (Xj - Oe(n)) \n\n(1) \n\n(2) \n\nif dec f::. d X1 ' The ot.her Voronoi vectors are not modified. \nThis update has the effect that if Xj and Oe have the same decision then Oe is moved \ncloser to Xj, however if they have different decisions then Oe is moved away from Xj. \nThe constants {an} are positive and decreasing, e.g., an = lin. We are concerned \nwith the convergence properties of 8( n) and with the resulting detection error. \n\nFor ease of notation, we assume that there are only two pattern classes. The \nequations for the case of more than two pattern classes are given in \n(LaVigna \n[1989]). \n\n3 Convergence of the Learning Algorithm \n\nThe LVQ algorithm has the general form \n\n0i(n + 1) = Oi(n) + an ,(dxn,de.(n),xn,8n) (xn - Oi(n)) \n\n(3) \n\nwhere Xn is the currently chosen past observation. The function I determines \nwhether there is an update and what its sign should be and is given by \n\nif dXn = de, and Xn EVe, \nif dXn f::. de, and Xn EVe, \notherwise \n\n(4) \n\nHere Ve, represents the set of points closest to OJ and is given by \n\nVe, = {x E ~d : \n\nIIOi - xii < IIOj - xiI, j f::. i} \n\ni = 1, ... , k. \n\n(5) \n\nThe update in (3) is a stochastic approximation algorithm (Benveniste, Metivier & \nPriouret [1987]). It has the form \n\n(6) \nwhere 8 is the vector with components OJ; H(8, z) is the vector with components \ndefined in the obvious manner from (3) and Zn = (xn' dx n) is the random pair \n\n\fConvergence of a Neural Network Classifier \n\n841 \n\nconsisting of the observation and the associated true pattern number. If the appro(cid:173)\npriate conditions are satisfied by On, H, and Zn, then 8 n approaches the solution \nof \n\nd -\ndt 8(t) = h(8(t)) \n\n-\n\n(7) \n\nfor the appropriate choice of h(8). \nFor the two pattern case, we let PI (x) represent the density for pattern 1 and 11\"1 \nrepresent its prior. Likewise for po{x) and 11\"0. It can be shown (Kohonen [1986]) \nthat \n\nwhere \n\n(8) \n\n(9) \n\nIf the following hypotheses hold then using techniques from (Benveniste, Metivier & \nPriouret [1987]) or (Kushner & Clark [1978]) we can prove the convergence theorem \nbelow: \n[H.1] {on} is a non increasing sequence of positive reals such that Ln an = 00, \n\nLnO~ < 00. \n\n[H.2] Given dxn , Xn are independent and distributed according to Pd:rn (x). \n[H.3] The pattern densities, Pi(X), are continuous. \nTheorem 1 Assume that [H.l]-[H.3] hold. Let 8* be a locally asymptotic stable \nequilibrium point of (7) with domain of attraction D*. Let Q be a compact subset \nof D*. If 8 n E Q for infinitely many n then \n\nlim 8 n = 0* a.s. \nn-oo \n\n( 10) \n\nProof: (see (LaVigna (1989))) \n\nHence if the initial locations and decisions of the Voronoi vectors are close to a \nlocally asymptotic stable equilibrium of (7) and if they do not move too much then \nthe vectors converge. \n\nGiven the form of (8) one might try to use Lyapunov theory to prove convergence \nwith \n\nK \n\nL(8) = L J IIx - 8il1 2 qi(X), dx \n\ni=I VIl, \n\n(11) \n\nas a candidate Lyapunov function. This function will not work as is demonstrated \nby the following calculation in the one dimensional case. Suppose that f{ = 2 and \n(h < O2 then \n\n{) \n-L(8) \n{)Ol \n\n(12) \n\n\f842 \n\nHaras and LaVigna \n\n\u2022 \n-00 \n\n\u2022 0 \n\n0 o \n\n00 \n\nFigure 1: A possible distribution of observations and two Voronoi vectors. \n\nLikewise \n\n(18) \n\nTherefore \n~L(E\u00bbe = -h 1(E\u00bb2-h2(E\u00bb2+1I(01-02)/2W Ql((Ol +(2)/2)(h 1(E\u00bb-h 2 (E\u00bb) (19) \nIn order for this to be a Lyapunov function (19) would have to be strictly nonpositive \nwhich is not the case. The problem with this candidate occurs because the integrand \nqi (x) is not strictly positive as is the case for ordinary vector quantization and \nadaptive K-means. \n\n4 Modified LVQ AlgorithlTI \n\nThe convergence results above require that the initial conditions are close to the \nstable points of (7) in order for the algorithm to converge. \nIn this section we \npresent a modification to the LVQ algorithm which increases the number of stable \nequilibrium for equation (7) and hence increases the chances of convergence. First \nwe present a simple example which emphasizes a defect of LVQ and suggests an \nappropriate modification to the algorithm. \nLet 0 represent an observation from pattern 2 and let 6. represent an observation \nfrom pattern 1. We assume that the observations are scalar. Figure 1 shows a \npossible distribution of observations. Suppose there are two Voronoi vectors 01 and \nO2 with decisions 1 and 2, respectively, initialized as shown in Figure 1. At each \nupdate of the LVQ algorithm, a point is picked at random from the observation set \nand the closest Voronoi vector is modified. We see that during this update , it is \npossible for 02(n) to be pushed towards 00 and 01(n) to be pushed towards -00, \nhence the Voronoi vectors may not converge. \n\nRecall that during the update procedure in (3), the Voronoi cells are changed by \nchanging the location of one Voronoi vector. After an update, the majority vote of \n\n\fConvergence of a Neural Network Classifier \n\n843 \n\nthe observations in each new Voronoi cell may not agree with the decision previously \nassigned to that cell. This discrepency can cause the divergence of the algorithm. \nIn order to prevent this from occuring the decisions associated with the Voronoi \nvectors should be updated to agree with the majority vote of the observations that \nfall within their Voronoi cells. Let \n\ng,(8; N) = { : \n\n1 N \n\nj=l \n\notherwise. \n\nif N L I{YJE V8,lI{dyJ =1} > N L I{Y J Ev8 .}I{dyJ =2} \n\n1 N \n\n(20) \n\nj=l \n\nThen gi represents the decision of the majority vote of the observations falling in \nVe,. With this modification, the learning for ()j becomes \n\n()i(n + 1) = ()i(n) + an ,(dxn ,gi(8n ; N),x n ,8n ) \\70,(n)(()i(n) - xn). \n\n(21) \n\nThis equation has the same form as (3) with the function H(8, z) defined from (21) \nreplacing H(8, z). \n\nThis divergence happens because the decisions of the Voronoi vectors do not agree \nwith the majority vote of the observations closest to each vector. As a result, the \nVoronoi vectors are pushed away from the origin. This phenomena occurs even \nthough the observation data is bounded. The point here is that, if the decision \nassociated with a Voronoi vector does not agree with the majority vote of the \nobservations closest to that vector then it is possible for the vector to diverge. A \nsimple solution to this problem is to correct the decisions of all the Voronoi vectors \na.fter every adjustment so that their decisions correspond to the majority vote. In \npractice this correction would only be done during the beginning iterations of the \nlearning algorithm since that is when an is large and the Voronoi vectors are moving \naround significantly. \"Vith this modification it is possible to show convergence to the \nBayes optimal classifier (La Vigna [1989]) as the number of Voronoi vectors become \nlarge. \n\n5 Decision Error \n\nIn this section we discuss the error associated with the modified LVQ algorithm. \nHere two results are discussed. The first is the simple comparison between LVQ \nand the nearest neighbor algorithm. The second result is if the number of Voronoi \nvectors is allowed to go to infinity at an appropriate rate as the number of obser(cid:173)\nvations goes to infinity, then it is possible to construct a convergent estimator of \nthe Bayes risk. That is, the error associated with LVQ can be made to approach \nthe optimal error. As before, we concentrate on the binary pa.ttern case for ease of \nnotation. \n\n5.1 Nearest Neighbor \n\nIf a Voronoi vector is assigned to each observation then the LVQ algorithm reduces \nto the nearest neighbor algorithm. For that algorithm, it was shown (Cover & Hart \n[1967]) that its Bayes minimum probability of error is less than twice that of the \noptimal classifier. More specifically, let r* be the Bayes optimal risk and let l' be \n\n\f844 \n\nBaras and LaVigna \n\nthe nearest neighbor risk. It was shown that \n\nr*::; r::; 2r*(1- r*) < 2r*. \n\n(22) \n\nHence in the case of no iteration, the Bayes' risk associated with LVQ is given from \nthe nearest neighbor algorithm. \n\n5.2 Other Choices for Number of Voronoi Vectors \n\nWe saw above that if the number of Voronoi vectors equals the number of observa(cid:173)\ntions then LVQ coincides with the nearest neighbor algorithm. Let kN represent the \nnumber of Voronoi vectors for an observation sample size of N. We are interested \nin determining the probability of error for LVQ when kN satisfies (1) limkN = 00 \nand (2) lim(kN / N) = O. In this case, there are more observations than vectors and \nhence the Voronoi vectors represent averages of the observations. It is possible to \nshow that with kN satisfying (1)-(2) the decision error associated with modified \nLVQ can be made to approach the Bayesian optimal decision error as N becomes \nlarge (LaVigna [1989]). \n\n6 Conclusions \n\nWe have shown convergence of the Voronoi vectors in the LVQ algorithm. We have \nalso presented the majority vote modification of the LVQ algorithm. This modifi(cid:173)\ncation prevents divergence of the Voronoi vectors and results in convergence for a \nlarger set of initial conditions. In addition, with this modification it is possible to \nshow that as the appropriate parameters go to infinity the decision regions asso(cid:173)\nciated with the modified LVQ algorithm approach the Bayesian optimal (La Vigna \n[1989]). \n\n7 Acknowledgements \n\nThis work was supported by the National Science Foundation through grant CDR-\n8803012, Texas Instruments through a TI/SRC Fellowship and the Office of Naval \nResearch through an ONR Fellowship. \n\n8 References \n\nA. Benveniste, M. Metivier & P. Priouret [1987], Algorithmes Adaptatifs et Approx(cid:173)\n\nimations Stochastiques, Mason, Paris. \n\nT. M. Cover & P. E. Hart [1967], \"Nearest Neighbor Pattern Classification,\" IEEE \n\nTransactions on Information Theory IT-13, 21-27. \n\nR. O. Duda & P. E. Hart [1973], Pattern Classification and Scene Analysis, John \n\nWiley & Sons, New York, NY. \n\nT. Kohonen [1986], \"Learning Vector Quantization for Pattern Recognition,\" Tech(cid:173)\n\nnical Report TKK-F-A601, Helsinki University of Technology. \n\n\fConvergence of a Neural Network Classifier \n\n845 \n\nH. J. Kushner & D. S. Clark [1978], Stochastic Approximation Methods for \n\nConstrained and Unconstrained Systems, Springer-Verlag, New York(cid:173)\nHeidelberg-Berlin. \n\nA. La Vigna [1989], \"Nonparametric Classification using Learning Vector Quantiza(cid:173)\n\ntion,\" Ph.D. Dissertation, Department of Electrical Engineering, University \nof Maryland. \n\n\f", "award": [], "sourceid": 407, "authors": [{"given_name": "John", "family_name": "Baras", "institution": null}, {"given_name": "Anthony", "family_name": "LaVigna", "institution": null}]}