{"title": "A High Performance k-NN Classifier Using a Binary Correlation Matrix Memory", "book": "Advances in Neural Information Processing Systems", "page_first": 713, "page_last": 722, "abstract": null, "full_text": "A High Performance k-NN Classifier Using a \n\nBinary Correlation Matrix Memory \n\nPing Zhou \n\nJim Austin \n\nJohn Kennedy \n\nzhoup@cs.york.ac.uk \n\naustin@cs.york.ac.uk \n\njohnk@cs.york.ac.uk \n\nAdvanced Computer Architecture Group \n\nDepartment of Computer Science \n\nUniversity of York, York YOW 500, UK \n\nAbstract \n\nThis paper presents a novel and fast k-NN classifier that is based on a \nbinary CMM (Correlation Matrix Memory) neural network. A robust \nencoding method is developed to meet CMM input requirements . A \nhardware implementation of the CMM is described, which gives over 200 \ntimes the speed of a current mid-range workstation, and is scaleable to \nvery large problems. When tested on several benchmarks and compared \nwith a simple k-NN method, the CMM classifier gave less than I % lower \naccuracy and over 4 and 12 times speed-up in software and hardware \nrespectively. \n\n1 INTRODUCTION \nPattern classification is one of most fundamental and important tasks, and a k-NN rule is \napplicable to a wide range of classification problems. As this method is too slow for many \napplications with large amounts of data, a great deal of effort has been put into speeding it \nup via complex pre-processing of training data, such as reducing training data (Dasarathy \n1994) and improving computational efficiency (Grother & Candela 1997). This work \ninvestigates a novel k-NN classification method that uses a binary correlation matrix \nmemory (CMM) neural network as a pattern store and match engine. Whereas most neural \nIt \nnetworks need a long iterative training time, a CMM is simple and quick to train. \nrequires only one-shot storage mechanism and simple binary operations (Willshaw & \nBuneman 1969), and it has highly flexible and fast pattern search ability. Therefore, the \ncombination of CMM and k-NN techniques is likely to result in a generic and fast \nclassifier. For most classification problems, patterns are in the form of multi-dimensional \nreal numbers, and appropriate quantisation and encoding are needed to convert them into \nbinary inputs to a CMM. A robust quantisation and encoding method is developed to meet \nrequirements for CMM input codes , and to overcome the common problem of identical \ndata points in many applications, e.g. background of images or normal features in a \ndiagnostic problem. \n\nMany research projects have applied the CMM successfully to commercial problems, e.g. \nsymbolic reasoning in the AURA (Advanced Uncertain Reasoning Architecture) approach \n\n\f714 \n\nP. Zhou. J. Austin and J. Kennedy \n\n(Austin 1996), chemical structure matching and post code matching. The execution of the \nCMM has been identified as the bottleneck. Motivated by the needs of these applications \nfor a further high speed processing, the CMM has been implemented in dedicated \nhardware, i.e. the PRESENCE architecture. The primary aim is to improve the execution \nspeed over conventional workstations in a cost-effective way. \n\nThe following sections discuss the CMM for pattern classification, describe \nPRESENCE architecture \n(the hardware \nexperimental results on several benchmarks. \n\nthe \nimplementation of CMM), and present \n\n2 BINARY CMM k-NN CLASSIFIER \nThe key idea (Figure I) is to use a CMM to pre-select a smaIl sub-set of training patterns \nfrom a large number of training data, and then to apply the k-NN rule to the sub-set. The \nCMM is fast but produces spurious errors as a side effect (Turner & Austin 1997); these \nare removed through the application of the k-NN rule. The architecture of the CMM \nclassifier (Figure I) includes an encoder (detailed in 2.2) for quantising numerical inputs \nand generating binary codes, a CMM pattern store and match engine and a conventional k(cid:173)\nNN module as detailed below . \n\nTraining patterns \nstored in CMM \n\nPatterns pre(cid:173)\nselected by CMM \n\nk-NN patterns \n\n\u2022 \n\n\u2022 \n\n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\n\u2022 \n\n\u2022 \n\nI \n\nI \n\n~B~r:~~k \n\nFigure 1: Architecture of the binary CMM k-NN classifier \n\nclassification \n\n2.1 PATTERN MATCH AND CLASSIFICATION WITH CMM \nA correlation matrix memory is basically a single layer network with binary weights M. In \nthe training process a unique binary vector or separator s, is generated to label an unseen \ninput binary vector P,; the CMM learns their association by performing the following \nlogical ORing operation: \n\nM=VS,TPi \n\ni \n\nIn a recall process, for a given test input vector Pk' the CMM performs: \n\nVk=MPJ=(ysTpl,J \n\n(1) \n\n(2) \n\nfollowed by thresholding v k and recovering individual separators. For speed, it is \nappropriate to use a fixed thresholding method and the threshold is set to a level \nproportional to the number of 'I' bits in the input pattern to allow an exact or partial \nmatch. To understand the recall properties of the CMM , consider the case where a known \npattern Pk \nis represented, then Equation 2 can be written as the following when two \ndifferent patterns are orthogonal to each other: \n\n(3) \n\nwhere np is a scalar, i.e. the number of 'I' bits in P k ' and P,P: = 0 for i:;; k . Hence a \nperfect recall of Sk can be obtained by thresholding v, at the level n\" . In practice \n'partially orthogonal' codes may be used to increase the storage capacity of the CMM and \nthe recall noise can be removed via appropriately thresholding v k (as p,p[ ~ n p for i :;; k ) \n\n\fRNNs Can Learn Symbol-Sensitive Counting \n\n715 \n\nand post-processing (e.g. applying k-NN rule). Sparse codes are usually used, i.e. only a \nfew bits in SA and P, being set to 'I' , as this maximises the number of codes and \nminimises the computation time (Turner & Austin 1997). These requirements for input \ncodes are often met by an encoder as detailed below. \n\nThe CMM exhibits an interesting 'partial match' property when the data dimensionality d \nis larger than one and input vector p; consists of d concatenated components. If two \ndifferent patterns have some common components, v k also contains separators for \npartially matched patterns, which can be obtained at lower threshold levels. This partial or \nnear match property is useful for pattern classification as it allows the retrieval of stored \npatterns that are close to the test pattern in Hamming distance. \n\nFrom those training patterns matched by the CMM engine, a test pattern is classified using \nthe k-NN rule. Distances are computed in the original input space to minimise the \ninformation loss due to quantisation and noise in the above match process. As the number \nof matches returned by the CMM is much smaller than the number of training data, the \ndistance computation and comparison are dramatically reduced compared with the simple \nk-NN method. Therefore, the speed of the classifier benefits from fast training and \nmatching of the CMM, and the accuracy gains from the application of the k-NN rule for \nreducing information loss and noise in the encoding and match processes. \n\n2.2 ROBUST UNIFORM ENCODING \n\nFigure 2 shows three stages of the encoding process. d-dimensional real numbers, xi' are \nquantised as y; ; sparse and orthogonal binary vectors, Ci ' are generated and concatenated \nto form a CMM input vector. \n\nFigure 2: Quantisation, code generation and concatenation \n\nYd \n\n(~, \n\nCMM input codes should be distributed as uniformly as possible in order to avoid some \nparts of the CMM being used heavily while others are rarely used. The code uniformity is \nmet at the quantisation stage. For a given set of N training samples in some dimension (or \naxis), it is required to divide the axis into Nb small intervals, called bins, such that they \ncontain uniform numbers of data points. As the data often have a non-uniform distribution, \nthe sizes of these bins should be different. It is also quite common for real world problems \nthat many data points are identical. For instance, there are 11 %-99.9% identical data in \nbenchmarks used in this work. Our robust quantisation method described below is \ndesigned to cope with the above problems and to achieve a maximal uniformity. \n\nIn our method data points are first sOfted in ascending order, N, identical points are then \nidentified, and the number of non-identical data points in each bin is estimated as \nN p = (N - N,)/ Nb . B in boundaries or partitions are determined as follows. The right \nboundary of a bin is initially set to the next N I' -th data point in the ordered data sequence; \nthe number of identical points on both sides of the boundary is identified; these are either \nincluded in the current or next bin. If the number of non-identical data points in the last \nbin is N, and N,~(Np +Nb)' Np may be increased by (N, -Np)/Nb and the above partition \nprocess may be repeated to increase the uniformity. Boundaries of bins obtained become \nparameters of the encoder in Figure 2. In general it is appropriate to choose Nh such that \neach bin contains a number of samples, which is larger than k nearest neighbours for the \noptimal classification. \n\n\f716 \n\nP. Zhou, J. Austin and J. Kennedy \n\n3 THE PRESENCE ARCHITECTURE \nThe pattern match and store engine of the CMM k-NN classifier has been implemented \nusing a novel hardware based CMM architecture. i.e. the PRESENCE. \n\n3.1 ARCHITECTURE DESIGN \nImportant design decisions include the use of cheap memory, and not embedding both the \nweight storage and the training and testing in hardware (VLSI). This arises because the \napplications commonly use CMMs with over 100Mb of weight memory. which would be \ndifficult and expensive to implement in custom silicon. VME and PCI are chosen to host \non industry standard buses and to allow widespread application. \n\nThe PRESENCE architecture implements the control logic and accumulators, i.e. the core \nof the CMM. As shown in Figure 3a binary input selects rows from the CMM that are \nadded, thresholded using L-max (Austin & Stonham 1987) or fixed global thresholding, \nand then returned to the host for further processing. The PRESENCE architecture shown \nin Figure 3b consists of a bus interface, a buffer memory which allows interleaving of \nmemory transfer and operation of the PRESENCE system, a SATCON and SA TSUM \ncombination that accumulates and thresholds the weights. The data bus connects to a pair \nof memory spaces, each of which contains a control block, an input block and an output \nblock. Thus the PRESENCE card is a memory mapping device, that uses interrupts to \nconfirm the completion of each operation. For efficiency, two memory input/output areas \nare provided to be acted on from the external bus and used by the card. The control \nmemory input block feeds to the control unit, which is a FPGA device. The input data are \nfed to the weights and the memory area read is then passed to a block of accumulators. In \nour current implementation the data width of each FPGA device is 32 bits, which allows \nus to add a 32 bit row from the weights memory in one cycle per device \n\nData bus \n\nInput (sparse codes) \nwei hts (-) \np \n\n\u2022 \u2022 \n\nSumv \n\nSeparator output s \n\nFigure 3: (a) correlation matrix memory. and (b) overall architecture of PRESENCE \n\nCurrently we have 16Mb of 25ns static memory implemented on the VME card, and 128 \nMb of dynamic (60ns) memory on the PCI card. The accumulators are implemented along \nwith the thresholding logic on another FPGA device (SATSUM). To enable the SA TSUM \nprocessors to operate faster, a 5 stage pipeline architecture was used, and the data \naccumulation time is reduced from 175ns to 50ns. All PRESENCE operations are \nsupported by a C++ library that is used in all AURA applications. The design of the \nSA TCON allows many SA TSUM devices to be used in parallel in a SIMD configuration. \nThe VME implementation uses 4 devices per board giving a 128 bit wide data path. In \naddition the PCI version allows daisy chaining of cards allowing a 4 card set for a 512 bit \nwide data path. The complete VME card assembly is shown in Figure 4. The SA TCON \nand SA TSUM devices are mounted on a daughter board for simple upgrading and \nalteration. The weights memory, buffer memory and VME interface are held on the \nmother board. \n\n\fRNNs Can Learn Symbol-Sensitive Counting \n\n717 \n\nFigure 4: The VME based PRESENCE card (a) motherboard, and (b) daughterboard \n\n3.2 PERFORMANCE \n\nBy an analysis of the state machines used in the SATCON device the time complexity of \nthe approach can be calculated. Equation 4 is used to calculate the processing time, T, in \nseconds to recall the data with N index values, a separator size of S, R 32 bit SATSUM \ndevices, and the clock period of C. \n\nT = C[23+(s-l)/32R+I)(N +38+2R)] \n\n(4) \n\nA comparison with a Silicon Graphics 133MHz R4600SC Indy in Table \nshows the \nspeed up of the matrix operation (Equation 2) for our VME implementation (128 bits \nwide) using a fixed threshold. The values for processing rate are given in millions of \nbinary weight additions per-second (MW/s). The system cycle time needed to sum a row \nof weights into the counters (i.e. time to accumulate one line) is SOns for the VME version \nand lOOns for the PCI version. In the PCI form, we will use 4 closely coupled cards, \nwhich result in a speed-up of 432. The build cost of the VME card was half the cost of the \nbaseline SGI Indy machine, when using 4Mb of 20ns static RAM. In the PCI version the \ncost is greatly reduced through the use of dynamic RAM devices allowing a 128Mb \nmemory to be used for the same cost. allowing only a 2x slower system with 32x as much \nmemory per card (note that 4 cards used in Table I hold 512Mb of memory). \n\nTable I : Relative speed-up of the PRESENCE architecture \n\nPlatform \n\nWorkstation \nI Card VME implementation \nFour card PCI system (estimate) \n\nProcessing_ Rate \n11.8 MW/s \n2557MW/s \n17,114MW/s \n\nI Relative Speed \n\nI \n216 \n432 \n\n-\n\nThe training and recogmtlon speed of the system are approximately equal. This is \nparticularly useful in on-line applications, where the system must learn to solve the \nproblem incrementally as it is presented. In particular, the use of the system for high speed \nreasoning allows the rules in the system to be altered without the long training times of \nother systems. Furthermore our use of the system for a k-NN classifier also allows high \nspeed operation compared with a conventional implementation of the classifier, while still \nallowing very fast training times. \n\n4 RESULTS ON BENCHMARKS \nPerformance of the robust quantisation method and the CMM classifier have been \nevaluated on four benchmarks consisting of large sets of real world problems from the \nStatlog project (Michie & Spiegelhalter 1994), including a satellite image database, letter \nimage recognition database. shuttle data set and image segmentation data set. To visualise \nthe result of quantisation, Figure Sa shows the distribution of numbers of data points of \nthe 8th feature of the image segment data for equal-size bins. The distribution represents \n\n\f718 \n\nP. Zhou, J. Austin and J. Kennedy \n\nthe inherent characteristics of the data. Figure 5b shows our robust quantisation (RQ) has \nresulted in the uniform distribution desired. \n\n400~~~--________ --~ \n\n40~ ____ - -__ ~ ____ ~~ \n\n350 \n\n35 \n\n~ 300 \n\"\"2SO \n; \n~ 200 \n.i1 150 \nE g 100 \n\n~ 1 111111 \n\no \n\n5 \n\n\" \n\n; \n\n30 \n\n~ '\" \n25 \n~ 20 \n.i1 \n15 \nE g \n\n10 \n\n10 \n\n15 \n\n20 \nvalues o f)ll; \n\n25 \n\n30 \n\n35 \n\no \n\no \n\n10 \n\n15 \n\n20 \nvalues of x \n\n25 \n\n30 \n\n3S \n\nFigure 5: Distributions of the image segment data for (a) equal bins, (b) RQ bins \n\nWe compared the CMM classifier with the simple k-NN method, multi-layer perceptron \n(MLP) and radial basis function (RBF) networks (Zhou and Austin 1997). In the \nevaluation we used the CMM software libraries developed in the project AURA at the \nUniversity of York. Between 1 and 3 '1' bits are set in input vectors and separators. \nExperiments were conducted to study influences of a CMM's size on classification rate (c(cid:173)\nrate) on test data sets and speed-up measured against the k-NN method (as shown in \nFigure 6). The speed-up of the CMM classifier includes the encoding, training and test \ntime. The effects of the number of bins N b on the performance were also studied. \n\n~ 0.89 \ni'! g 0 .88 \ne 0.87 \n'\" ~ 0.86 \n00.85 \n\n0 .84 \n\n0.5 \n\n1.5 \n\nI \nCMM Si7.e (MBytes) \n\n2.5 \n\n2 \n\n3 15 \n\n4 \n\n1.5 \n\nI \n3 \nCMM size (MBytes) \n\n2.5 \n\n2 \n\n3. 5 \n\n4 \n\nFigure 6: Effects of the CMM size on (a) c-rate and (b) speed-up on the satellite image data \n\nChoices of the CMM size and the number of bins may be application dependent, for \ninstance, in favour of the speed or accuracy. In the experiment it was required that the \nspeed-up is not 4 times less and c-rate is not 1 % lower than that of the k-NN method. \nTable 2 contains the speed-up of MLP and RBF networks and the CMM on the four \nbenchmarks. It is interesting to note that the k-NN method needed no training. The recall \nof MLP and RBF networks was very faster but their training was much slower than that of \nthe CMM classifier. The recall speed-up of the CMM was 6-23 times, and the overall \nspeed-up (including training and recall time) was 4-15x. When using the PRESENCE, i.e. \nthe dedicated CMM hardware, the speed of the CMM was further increased over 3 times. \nThis is much less than the speed-up of 216 given in Table 1 because of recovering \nseparators and k-NN classification are performed in software. \n\nTable 2: Speed-up of MLP, RBF and CMM relative to the simple k-NN method \n\nmethod \nMLPN \nRBFN \n\nsimplek-NN \n\nCMM \n\nImage segment \ntraining \ntest \n18 \n9 \nI \n9 \n\n0.04 \n0.09 \n\n-\n18 \n\ntraining \n\nSatellite image \nTest \n28.4 \n20.3 \n1 \n5.7 \n\n0.2 \n0.07 \n\n15.8 \n\n-\n\nLetter \n\nShuttle \n\ntraining \n\n0.2 \n0.3 \n-\n\n24.6 \n\ntest \n96.5 \n66.4 \n1 \n6.8 \n\ntraining \n\n4.2 \n1.8 \n-\n43 \n\ntest \n587.2 \n469.7 \n\nI \n23 \n\nThe classification rates by the four methods are given in Table 3, which shows the CMM \nclassifier performed only 0-1% less accurate than the k-NN method. \n\n\fRNNs Can Learn Symbol-Sensitive Counting \n\n719 \n\nTable 3: Classification rates of four methods on four benchmarks \n\nImage segment Satellite image \n\nMLPN \nRBFN \n\nsimple k-NN \n\nCMM \n\n0.950 \n0.939 \n0.956 \n0.948 \n\n0.914 \n0.914 \n0.906 \n0.901 \n\nLetter \n0.923 \n0.941 \n0.954 \n0.945 \n\nShuttle \n0.998 \n0.997 \n0.999 \n0.999 \n\n5 CONCLUSIONS \nA novel classifier is presented, which uses a binary CMM for storing and matching a large \namount of patterns efficiently, and the k-NN rule for classification. The RU encoder \nconverts numerical inputs into binary ones with the maximally achievable uniformity to \nmeet requirements of the CMM. Experimental results on the four benchmarks show that \nthe CMM classifier, compared with the simple k-NN method , gave slightly lower \nclassification accuracy (less than 1 % lower) and over 4 times speed in software and 12 \ntimes speed in hardware. Therefore our method has resulted in a generic and fast \nclassifier. \n\nThis paper has also described a hardware implementation of a FPGA based chip set and a \nprocessor card that will support the execution of binary CMM. It has shown the viability \nof using a simple binary neural network to achieve high processing rates. The approach \nallows both recognition and training to be achieved at speeds well above two orders of \nmagnitude faster \nthe \nworkstation. The system is scaleable to very large problems with very large weight arrays. \nCurrent research is aimed at showing that the system is scaleable, evaluating methods for \nthe acceleration of the pre- and post processing tasks and considering greater integration \nof the elements of the processor through VLSI. For more details of the AURA project and \nthe hardware described in this paper see http://www.cs.york.ac.uk/arch/nnJaura.html. \n\nthan conventional workstations at a much \n\nlower cost than \n\nAcknowledgements \n\nWe acknowledge British Aerospace and the Engineering and Physical Sciences Research \nCouncil (grant no. GRiK 41090 and GR/L 74651) for sponsoring the research. Our thanks \nare given to R Pack, A Moulds, Z Ulanowski. R Jennison and K Lees for their support. \n\nReferences \nWillshaw, 0.1., Buneman, O.P. & Longuet-Higgins, H.C. (1969) Non-holographic \nassociative memory. Nature, Vol. 222, p960-962. \nAustin, J. (1996) AURA, A distributed associative memory for high speed symbolic \nreasoning. In: Ron Sun (ed), Connectionist Symbolic Integration. Kluwer. \nTurner, M. & Austin, J. (1997) Matching performance of binary correlation matrix \nmemories. Neural Networks; 10:1637-1648. \nDasarathy, B.V. (1994) Minimal consistent set (MCS) identification for optimal nearest \nneighbor decision system design. IEEE Trans. Systems Man Cybernet; 24:511-517. \nGrother, P.l., Candela, G.T. & Blue, J.L. (1997) Fast implementations of nearest neighbor \nclassifiers. Pattern Recognition ; 30:459-465. \n\nAustin, J., Stonham, T.J. (1987) An associative memory for use in image recognition and \nocclusion analysis. Image and Vision Computing; 5:251-261. \nMichie, D., Spiegelhalter, 0.1. & Taylor, c.c. (1994) Machine learning, neural and \nstatistical classification (Chapter 9). New York, Ellis Horwood. \nZhou, P. & Austin J. (1998) Learning criteria for training neural network classifiers. \nNeural Computing and Applications Forum; 7:334-342. \n\n\f\fPART VI \n\nSPEECH, HANDWRITING AND SIGNAL \n\nPROCESSING \n\n\f\f", "award": [], "sourceid": 1495, "authors": [{"given_name": "Ping", "family_name": "Zhou", "institution": null}, {"given_name": "Jim", "family_name": "Austin", "institution": null}, {"given_name": "John", "family_name": "Kennedy", "institution": null}]}