{"title": "A NEURAL NETWORK CLASSIFIER BASED ON CODING THEORY", "book": "Neural Information Processing Systems", "page_first": 174, "page_last": 183, "abstract": "", "full_text": "174 \n\nA Neural Network Classifier Based on Coding Theory \n\nTzt-Dar Chlueh and Rodney Goodman \n\neanrornla Instltute of Technology. Pasadena. eanromla 91125 \n\nABSTRACT \n\nThe new neural network classifier we propose transforms the \nclassification problem into the coding theory problem of decoding a noisy \ncodeword. An input vector in the feature space is transformed into an internal \nrepresentation which is a codeword in the code space, and then error correction \ndecoded in this space to classify the input feature vector to its class. Two classes \nof codes which give high performance are the Hadamard matrix code and the \nmaximal length sequence code. We show that the number of classes stored in an \nN-neuron system is linear in N and significantly more than that obtainable by \nusing the Hopfield type memory as a classifier. \n\nI. INTRODUCTION \n\nAssociative recall using neural networks has recently received a great deal \nof attention. Hopfield in his papers [1,2) deSCribes a mechanism which iterates \nthrough a feedback loop and stabilizes at the memory element that is nearest the \ninput, provided that not many memory vectors are stored in the machine. He has \nalso shown that the number of memories that can be stored in an N-neuron \nsystem is about O.15N for N between 30 and 100. McEliece et al. in their work (3) \nshowed that for synchronous operation of the Hopfield memory about N /(2IogN) \ndata vectors can be stored reliably when N is large. Abu-Mostafa (4) has predicted \nthat the upper bound for the number of data vectors in an N-neuron Hopfield \nmachine is N. We believe that one should be able to devise a machine with M, the \nnumber of data vectors, linear in N and larger than the O.15N achieved by the \nHopfield method. \n\nFeature Space = B = {-1 , 1 } \n\nN \n\nN \n\nL \nCode Space = B = {-1 \u2022 1 } \n\nL \n\nFigure 1 (a) Classification problems versus (b) Error control decoding problems \n\nIn this paper we are specifically concerned with the problem of \nclassification as in pattern recognition. We propose a new method of building a \nneural network classifier, based on the well established techniques of error \ncontrol coding. ConSider a typical classification problem (Fig. l(a)), in which one \nis given a priori a set of classes, C( a), a = 1, .... M. Associated with each class is a \nfeature vector which labels the class ( the exemplar of the class), I.e. it is the \n\n\u00a9 American Institute of Physics 1988 \n\n\f175 \n\nmost representative point in the class region. The input is classified into the \nclass with the nearest exemplar to the input. Hence for each class there is a \nregion in the N-dimensional binary feature space BN == (I. -I}N. in which every \nvector will be classified to the corresponding class. \n\nA similar problem is that of decoding a codeword in an error correcting \ncode as shown in Fig. I(b). In this case codewords are constructed by design and \nare usually at least dmtn apart. The received corrupted codeword is the input to \nthe decoder. which then finds the nearest codeword to the input. In principle \nthen. if the distance between codewords is greater than 2t +1. it is possible to \ndecode (or classify) a noisy codeword (feature vector) into the correct codeword \n(exemplar) provided that the Hamming distance between the noisy codeword and \nthe correct codeword is no more than t. Note that there is no guarantee that the \nexemplars are uniformly distributed in BN. consequently the attraction radius \n(the maximum number of errors that can occur in any given feature vector such \nthat the vector can st111 be correctly classified) will depend on the minimum \ndistance between exemplars. \n\nMany solutions to the minimum Hamming distance classification have \nbeen proposed. the one commonly used is derived from the idea of matched filters \nin communication theory. Lippmann [5) proposed a two-stage neural network \nthat solves this classification problem by first correlating the input with all \nexemplars and then picking the maximum by a \"winner-take-all\" circuit or a \nnetwork composed of two-input comparators. In Figure 2. fI.f2 .... .fN are the N \ninput bits. and SI.S2 .... SM are the matching score s(Similartty) of f with the M \nexemplars. The second block picks the maximum of sI.S2 ..... SM and produces the \nindex of the exemplar with the largest score. The main disadvantage of such a \nclassifier is the complexity of the maximum-picking circuit. for example a \n''winner-take-all'' net needs connection weights of large dynamic range and \ngraded-response neurons. whilst the comparator maximum net demands M-I \ncomparators organized in log2M stages. \n\nM \nA \nX \nI \nM \nU \nM \n\n(0:) \n\nf = d + e \n\n\u2022 \ng = c + e \n\n(0:) \n\n,,-_;_~ .... ~ DECODER~SS(f) \n\ncloSS(f) \n\nFeature \nSpace \n\nCode \nSpace \n\nFig. 2 A matched filter type classifier Fig. 3 Structure of the proposed classifier \n\nOur main idea is thus to transform every vector in the feature space to a \nvector in some code space in such a way that every exemplar corresponds to a \ncodeword in that code. The code should preferably (but not necessarily) have the \nproperty that codewords are uniformly distributed in the code space. that is, the \nHamming distance between every pair of codewords is the same. With this \ntransformation. we turn the problem of classification into the coding problem of \ndecoding a noisy codeword. We then do error correction decoding on the vector in \nthe code space to obtain the index of the noisy codeword and hence classify the \noriginal feature vector. as shown in Figure 3. \n\nThis paper develops the construction of such a classification machine as \nfollows. First we conSider the problem of transforming the input vectors from the \nfeature space to the code space. We describe two hetero-associative memories for \ndOing this. the first method uses an outer product matrix technique Similar to \n\n\f176 \n\nthat of Hopfield's. and \nthe second method generates its matrix by the \npseudoinverse techruque[S.7J. Given that we have transformed the problem of \nassociative recall. or classification. into the problem of decoding a noisy \ncodeword. we next consider suitable codes for our machine. We require the \ncodewords in this code to have the property of orthogonality or \npseudo-orthogonality. that is. the ratio of the cross-correlation to the \nauto-correlation of the codewords is small. We show two classes of such good \ncodes for this particular decoding problem l.e. the Hadamard matrix codes. and \nthe maximal length sequence codes[8J. We next formulate the complete decoding \nalgorithm. and describe the overall structure of the classifier in terms of a two \nlayer neural network. The first layer performs the mapping operation on the \ninput. and the second one decodes its output to produce the index of the class to \nwhich the input belongs. \n\nThe second part of the paper is concerned with the performance of the \nclassifier. We first analyze the performance of this new classifier by finding the \nrelation between the maximum number of classes that can be stored and the \nclassification error rate. We show (when using a transform based on the outer \nproduct method) that for negligible misclassification rate and large N. a not very \ntight lower bound on M. the number of stored classes. is 0.22N. We then present \ncomprehensive simulation results that confirm and exceed our theoretical \nexpectations. The Simulation results compare our method with the Hopfield \nmodel for both the outer product and pseudo-inverse method. and for both the \nanalog and hard limited connection matrices. In all cases our classifier exceeds \nthe performance of the Hopfield memory in terms of the number of classes that \ncan be reliably recovered. \n\nD. TRANSFORM TECHNIQUES \n\nOur objective is to build a machine that can discriminate among input \nSuppose \nvectors and classify each one of them into the appropriate class. \nd(a) E BN is the exemplar ofthe corresponding class e(a.). a. = 1.2 ..... M . Given the \ninput f . we want the machine to be able to identify the class whose exemplar is \nclosest to f. that is. we want to calculate the follOWing function. \n\nclass ( f) = a. \n\nif f \n\nI f - d( a) I < I f - dH3) I \n\nwhere I I denotes Hamming distance in BN. \n\nWe approach the problem by seeking a transform ~ that maps each \nexemplar d(a) in BN to the corresponding codeword w(a) in BL. And an input \nfeature vector f = dey) + e is thus mapped to a noisy codeword g = wlY) + e' where e \nis the error added to the exemplar, and e' is the corresponding error pattern in the \ncode space. We then do error correction decoding on g to get the index of the \ncorresponding codeword. Note that e' may not have the same Hamming weight as \ne, that is, the transformation ~ may either generate more errors or eliminate \nerrors that are present in the original input feature vector. We require ~ to \nsatisfy the following equation, \n\nand ~ will be implemented uSing a Single-layer feedfoIWard network. \n\n0.=0,1 ..... M-l \n\n\f177 \n\nThus we first construct a matrix according to the sets of d(a)'s and w(a)'s, call it T, \nand define r:, as \n\nwhere sgn is the threshold operator that maps a vector in RL to BL and R is the \nfield of real numbers. \n\nLet D be an N x M matrix whose 0 ) and summing from k = 0 instead of k = L L/ 4 J ' \n\nL \n\nPe < L (L ) P k (l-p) L - k e t(k -L/4) = e -L t/4 (1 _ P + p et ) L \n\nk=O \n\nk \n\nDifferentiating the RHS of the above equation w.r.t. t and set it to 0, we find \nthe optimal to as eto = (l-p)/3p. The condition that to > 0 implies that p < 1/4, \nand since we are dealing with the case where p is small, it is automatically \nsatisfied. Substituting the optimal to, we obtain \n\nFrom the expression for Pe ,we can estimate M, the number of classes that \ncan be classified with negllgible misclassification rate, in the following way, \nsuppose Pe = () where ()\u00ab \n\nland p \u00ab 1, then \n\nwhere c = 4/(33/ 4 ) = 1.7547654 \n\nFor small x we have g-l(Z) -\napproaches infinity, we have \n\n../2 Log ( i/z) and since () is a fixed value, as L \n\nM> \n\nN \n\n810gc \n\n=.l:L \n4.5 \n\nFrom the above lower bound for M, one easily see that this new machlne is able to \nclassify a constant times N classes, which is better than the number of memory \nitems a Hopfield model can store Le. N/(210gN). Although the analysis is done \nassumlng N approaches lnfinlty, the simulation results in the next section show \nthat when N is moderately large (e.g. 63) the above lower bound applles. \n\nVI. SIMULATION RESULTS AND A CHARACTER RECOGNITION EXAMPLE \n\nWe have Simulated both the Hopfield model and our new machine(using \nmaxlmallength sequence codes) for L = N = 31, 63 and for the following four cases \nrespectively. \n(1) connection matrix generated by outer product method \n(ti) connection matrix generated by pseudo-inverse method \n(ill) connection matrix generated by outer product method, the components of the \n\nconnection matrix are hard limited. \n\n(iv) connection matrix generated by pseudo-inverse method, the components of \n\nthe connection matrix are hard limited. \n\n\f181 \n\nFor each case and each choice of N. the program fixes M and the number of \nerrors in the input vector. then randomly generates 50 sets of M exemplars and \ncomputes the connection matrix for each machine. For each machine it \nrandomly picks an exemplar and adds nOise to it by randomly complementing \nthe specified number of bits to generate 20 trial input vectors. it then simulates \nthe machine and checks whether or not the input is classified to the nearest class \nand reports the percentage of success for each machine. \n\nThe simulation results are shown in Figure 5. in each graph the hOrizontal \naxis is M and the vertical axis is the attraction radius. The data we show are \nobtained by collecting only those cases when the success rate is more than 98%, \nthat is for fixed M what is the largest attraction radius (number of bits in error of \nthe input vector) that has a success rate of more than 98%. Here we use the \nattraction radiUS of -1 to denote that for this particular M. with the input being \nan exemplar. the success rate is less than 98% in that machine . \n\n_e_ Hopfield Model \n\n.0- New Classifier{Op) \n\n\u2022 - New Classtfier{PI) \n\nN=31 \n\nBinary Connection Matrix \n\nN=31 \n\nAnalog Connection Matrix \n\n-, \n\n\"18 101114 u \n\n,. :tD \" \n\n:1' I. II \n\n'0 \n\nM \n\n(a) \n\n,0 12 \n\n'\" ,. 18 \nM \n\n(h) \n\n- ,~~~ . . ++++++++ . . ~~~~~~ \n\n, \n\n\" \n\n\u2022 \n\na \n\n'o \n\nIt lit \n\nII ,. '0 \n\n23f \n\n\u00a7 \n..... CIl 21 \n.... :s \n... .. \n(,).~ '9~ \n~'1.:! 17 \n::;:0:: 15 \n13 \n..... \n\n, .... \n\n, \n\n\u2022 \n\n11 \n9 \n7 \n5 \n3 \n\nN=63 \n\nBinary Connection Matrix \n\nd \n23 \n.2 en 21 \ntl.a 19 \nf! ~ 17 \n~~ 15 \n<: \n13 \n11 \n9 \n7 \n5 \n3 \n\nN=63 \n\nAnalog Connection Matrix \n\n-1 ~~~~~~~++~~I \n\n3 \n\n7 \n\n11 1519 23 27 31 35 39 43 47 51 55 59 83 \n\nM \n\n(c) \n\n3 \n\n7 \n\n11 1519 23 27 31 35 39 43 47 61 \n\nM \n\n(d) \n\nFigure 5 Simulation results of the Hopfield memory and the new classifier \n\n\f182 \n\n_e_ Hopfield Model .0- New Classifier(OP.L=63) \n\n.... New Classifier(OP.L=31) \n\n~ 23 1 \n.9 rIl 21 1 \nu.a 19 .\"--II.I--\".~o....... \n~\"C 17 .-\n:::~15 \n< \n13 \n\n'~ \\ \n\n-.~~ \n\n~-:~ \n\n........ -=-.~ \n\n,~ \n\n-1 +---+-_e __ -4eO-,e ___ e_-4e __ .. e e_ .. e __ e \n\n3 \n\n5 \n\n7 \n\n9 \n\n11 \n\n13 \n\n15 \n\n19 \n\n21 \n\n23 \n\n25 \n\n27 \n\n29 \n\n31 \n\n17 \nM \n\nFigure 6 Perfonnance of the new classifier using codes of different lengths \n\nIn all cases our classifier exceeds the perfonnance of the Hopfield model \nin tenns of the number of classes that can be reliably recovered_ For example. \nconsider the case of N = 63 and a hard limited connection matrix for both the new \nclassifier and the Hopfield model. we find that for an attraction radius of zero. \nthat is. no error in the input vector. the Hopfield model has a classification \ncapacity of approximately 5. while our new model can store 47. Also. for an \nattraction radius of 8. that is. an average of N/8 errors in the input vector. the \nHopfield model can reliably store 4 classes while our new model stores 27 \nclasses. Another Simulation (Fig. 6) USing a shorter code (L = 31 instead of L = 63) \nreveals that by shortening the code. the performance of the classifier degrades \nonly slightly. We therefore conjecture that it is pOSSible to use traditional error \ncorrecting codes (e.g. BCH code) as internal representations. however. by going to \na higher rate code, one is trading minimum distance of the code (error tolerance) \nfor complexity (number of hidden units). which implies pOSSibly poorer \nperformance of the classifier. \n\nWe also notice that the superiority of the pseudoinverse method over the \nouter product method appears only when the connection matrices are hard \nlimited. The reason for this is that the pseudOinverse method is best for \ndecorrelating the dependency among exemplars, yet the exemplars in this \nsimulation are generated randomly and are presumably independent. \nconsequently one can not see the advantage of pseudoinverse method. For \ncorrelated exemplars, we expect the pseudoinverse method to be clearly better \n(see next example). \n\nNext we present an example of applying this classifier to recognizing \ncharacters. Each character is represented by a 9 x 7 pixel array, the input is \ngenerated by flipping every pixel with 0.1 and 0.2 probability. The input is then \npassed to five machines: Hopfield memory. the new classifier with either the \npseudotnverse method or outer product method, and L = 7 or L = 31. Figure 7 and 8 \nshow the results of all 5 machines for 0.1 and 0.2 pixel flipping probability \nrespectively, a blank output means that the classifier refuses to make a decision. \nFirst note that the L = 7 case is not necessarily worse than the L = 31 case. this \nconfirms the earlier conjecture that fewer hidden units (shorter code) only \ndegrades perfonnance slightly. Also one eaSily sees that the pseudoinverse \nmethod is better than the outer product method because of the correlation \nbetween exemplars. Both methods outperform the Hopfield memory since the \nlatter mixes exemplars that are to be remembered and produces a blend of \nexemplars rather than the exemplars themselves, accordingly it cannot classify \nthe input without mistakes. \n\n\f(a) \n\n(b) \n\n(c) \n\n(d) \n\n\" \u00b7 . \u00b7 . ~ . \n\u00b7 . -:':'J \nL:.) . ~ . \u00b7 . \u00b7 . ... \n\u00b7 . \u00b7 . \u00b7 . .... _. \n\u00b7 . \n\u00b7 . . \n\n\" \u00b7 . \u00b7 \" ... \u00b7 . \u00b7 . . \n. ~. .. \u00b7 \u00b7 .-\u00b7 . \u00b7 . \u00b7 . \u00b7 , .. \nt-; .. \u00b7 . \u00b7 \u00b7 \u00b7 ., \u00b7 . \n\n,'. . . .. -, , .. . -; \n\n: .. ' \n\nL \n\n\" \u00b7 . ... . .. \u00b7 . \u00b7 . \n~; \n:-' . . \u00b7 \u00b7 . .. \n... , \u00b7 -\u00b7 .. \nF ,.', \n\nL-; \n\n(e) \n\n. \" \u00b7 --\u00b7 . \u00b7 . \n\n.. \u00b7 :.. L. _. \n\u00b7 ; \n; . \nt'::l \n\n(f) \n\n\" \u00b7 . --\u00b7 . \u00b7 . \u00b7 . .. \u00b7 \u00b7 ; . :... \u00b7 . \u00b7 \u00b7 .. \u00b7 -U .\u2022. \n\n\u00b7 . -\u00b7 \nt: : \n: .. \u00b7 . \n\nl.:1i \n\n~~ .. \u00b7 . \u00b7 . . .. .. \u00b7 . \u00b7 \u00b7 \u00b7 --\u00b7 . \u00b7 . \u00b7 . \u00b7 . -. _. \n\n\u00b7 [~ .-\nj' . . \u00b7 \n\nLo:: \n\nFigure 7 The character recognition \nexample with 10% pixel reverse \nprobability (a) input (b) correct \noutput (c) Hopfield Model (d)-(g) new \nclassifier (d) OP, L = 7 (e)OP, L = 31 \n(1) PI, L = 7 (g) PI, L = 31 \n\n183 \n\n{g} \n\n(a) \n\n(b) \n\n(c) \n\n(d) \n\n(e) \n\n(f) \n\n~ \n\n, \n\nJ'\"l \n\n.. \n\" . -.. --\u00b7 . \u00b7 . \n~; _. \u00b7 _. --... \n\" . \" '.:..' \u00b7 .. \u00b7 . \u00b7 . \u00b7 , \u00b7 . \n.\" \u00b7 . \n~. .. \u00b7 . \n\" .. _ .. \u00b7 -\n\nII \n\n\" \n\n.'. .-\u00b7 . \u00b7 . \u00b7 . \n\n. \" , . --\u00b7 . \u00b7 . \u00b7 . \u00b7 . \n:' L' \n-(cid:173)\u00b7 \u00b7 :-\u00b7 -\u00b7 . \u00b7 . \u00b7 . , . \u00b7 . \u00b7 -\n\n\" \u00b7 . _.-\u00b7 . \u00b7 . -\nb -\u00b7 . \nL . . \u00b7 -\u00b7 . \nL.J ..-\u00b7 L. .. \n! \u00b7 . \u00b7 \u00b7 \u00b7 .. \n\n' .. -: \n\n\" -. ..(cid:173)\u00b7 . \n~ \u00b7 . \u00b7 . \u00b7 . , . \u00b7 . .. \n-(cid:173)\u00b7 . \n\u00b7 . .-\u00b7 . \u00b7 . \u00b7 . \u00b7 . -_ . \n\u00b7 \u00b7 . . \u00b7 \u00b7 _. \n\u00b7 \u00b7 \u00b7 \u00b7 . \u00b7 \u00b7 \u00b7 \u00b7 -, -\n\n\" \u00b7 . \u00b7 -\u00b7 . \n.\u2022. \u00b7 . \u00b7 . \n, . .. \u00b7 \u00b7 . \u00b7 . \u00b7 . \u00b7 . -. \nE -\u00b7 \u00b7 \u00b7 \u00b7 ~.' \u00b7 \u00b7 \u00b7 .. . \n.. \u00b7 -U \n\u00b7 \u00b7 , .. . \u00b7 \u00b7 \u00b7 \u00b7 \n\n\u00b7 . -. \u00b7 . \u00b7 . \u00b7 . \u00b7 . ' .. \n\u00b7 .. \u00b7 . . . .. \u00b7 \u00b7 \u00b7 \u00b7 .. -\u00b7 -: : \nL~ -\u00b7 , \u00b7 _ . \n-\u00b7 \u00b7 \n. . , \n\u00b7 \u00b7 . \nFigure 8 The character recognition \nexample with 20016 pixel reverse \nprobability (a) input (b) correct \noutput (c) Hopfield Model (d)-(g) new \nclassifier (d) OP, L = 7 (e)OP. L = 31 \n(1) PI. L = 7 (g) PI. L = 31 \n\n: .... \n\nL_ \n\nF .. \n\ni \n:.::: \n\nVll. CONCLUSION \n\nIn this paper we have presented a new neural network classifier design \nbased on coding theory techniques. The classifier uses codewords from an error \ncorrecting code as its internal representations. Two classes of codes which give \nhigh performance are the Hadamard matrix codes and the maximal length \nsequence codes. In penormance terms we have shown that the new machine is \nsignificantly better than using the Hopfield model as a classifier. We should also \nnote that when comparing the new classifier with the Hopfield model. the \nincreased performance of the new classifier does not entail extra complexity. \nsince it needs only L + M hard limiter neurons and L(N + M) connection weights \nversus N neurons and N2 weights in a Hopfield memory. \n\nIn conclusion we believe that our model forms the basis of a fast. practical \nmethod of classification with an effiCiency greater than other previOUS neural \nnetwork techniques. \n\nREFERENCES \n\n[1) J. J. Hopfield. Proc. Nat. Acad. Set USA, Vol. 79. pp. 2554-2558 (1982). \n[2) J. J. Hopfield. Proc. Nat. Acad. Set USA, Vol. 81, pp. 3088-3092 (1984). \n[3) R J. McEliece, et. aI, IEEE Tran. on Infonnation. Theory. Vol. IT-33. \n\n[4) Y. S. Abu-Mostafa and J. St. Jacques. IEEE Tran. on Information Theory \u2022 \n\npp. 461-482 (1987). \n\nVol. IT-3I, pp. 461-464 (1985). \n\n[5) R Lippmann, IEEEASSP Magazine, Vol. 4, No.2. pp. 4-22 (April 1987). \n[6) T. Kohonen. Associative Memory - A System-Theoretical Approach \n\n(Springer-Verlag. Berlin Heidelberg. 1977). \n\n[7) S. S. Venkatesh,Linear Map with Point Rules ,Ph. D Thesis, Caltech, 1987. \n[8) E. R Berlekamp. Algebraic Coding Theory. Aegean Park Press. 1984. \n\n\f", "award": [], "sourceid": 48, "authors": [{"given_name": "Tzi-Dar", "family_name": "Chiueh", "institution": null}, {"given_name": "Rodney", "family_name": "Goodman", "institution": null}]}