{"title": "Efficient Computation of Complex Distance Metrics Using Hierarchical Filtering", "book": "Advances in Neural Information Processing Systems", "page_first": 168, "page_last": 175, "abstract": null, "full_text": "Efficient Computation of Complex \nDistance Metrics Using Hierarchical \n\nFiltering \n\nPatrice Y. Simard \n\nAT&T Bell Laboratories \n\nHolmdel, NJ 07733 \n\nAbstract \n\nBy their very nature, memory based algorithms such as KNN or \nParzen windows require a computationally expensive search of a \nlarge database of prototypes. In this paper we optimize the search(cid:173)\ning process for tangent distance (Simard, LeCun and Denker, 1993) \nto improve speed performance. The closest prototypes are found \nby recursively searching included subset.s of the database using dis(cid:173)\ntances of increasing complexit.y. This is done by using a hierarchy \nof tangent distances (increasing the Humber of tangent. vectors from \no to its maximum) and multiresolution (using wavelets). At each \nstage, a confidence level of the classification is computed. If the \nconfidence is high enough, the c.omputation of more complex dis(cid:173)\ntances is avoided. The resulting algorithm applied to character \nrecognition is close to t.hree orders of magnitude faster than com(cid:173)\nputing the full tangent dist.ance on every prot.ot.ypes . \n\n1 \n\nINTRODUCTION \n\nMemory based algorithms such as KNN or Parzen windows have been extensively \nused in pattern recognition. (See (Dasal'athy, 1991) for a survey.) Unfortunately, \nthese algorithms often rely 011 simple distances (such a<; Euclidean distance, Ham(cid:173)\nming distance, etc.). As a result, t.hey suffer from high sensitivity to simple trans(cid:173)\nformations of the input patterns that should leave the classification unchanged (e.g. \ntranslation or scaling for 2D images). To make the problem worse, these algorithms \n\n168 \n\n\fEfficient Computation of Complex Distance Metrics Using Hierarchical Filtering \n\n169 \n\nare further limited by extensive computational requirements due to the large number \nof distance computations. (If no optimization technique is used, the computational \ncost is given in equation 1.) \n\ncomputational cost ~ \n\nnumber of \nprototypes x \n\ndist.ance \ncomplexity \n\n(1) \n\nRecently, the problem of transformation sensitivity has been addressed by the intro(cid:173)\nduction of a locally transformation-invariant metric, the tangent distance (Simard, \nLeCun and Denker, 1993). The basic idea is that instead of measuring the distance \nd(A, B) between two patterns A and B, their respective sets of transformations TA \nand TB are approximated to the first order, and the distance between these two \napproximated sets is computed. Unfortunately, the tangent distance becomes com(cid:173)\nputationally more expensive as more transformations are taken into consideration, \nwhich results in even stronger speed requirements. \n\nThe good news is that memory based algorithms are well suited for optimization \nusing hierarchies of prototypes, and that this is even more true when the distance \ncomplexity is high. In this paper, we applied these ideas to tangent distance in two \nways: 1) Finding the closest prototype can be done by recursively searching included \nsubsets of the database using distances of increasing complexity. This is done by \nusing a hierarchy of tangent distances (increasing the number of tangent vectors \nfrom 0 to its maximum) and l11ultiresolution (using wavelets). 2) A confidence level \ncan be computed fm each distance. If the confidence in the classification is above a \nthreshold early on, there is no need to compute the more expensive distances. The \ntwo methods are described in the next section. Their application on a real world \nproblem will be shown in the result section. \n\n2 FILTERING USING A HIERARCHY OF DISTANCES \n\nOur goal is to compute the distance from one unknown pattern to every prototype \nin a large database in order to determine which one is the closest. It is fairly obvious \nthat some patterns are so different from each other that a very crude approximation \nof our distance can tell us so. There is a wide range of variation in computation time \n(and performance) depending on the choice of the distance. For instance, computing \nthe Euclidean distance on n-pixel images is a factor 11/ k of the computation of \ncomputing it on k-pixels images. \n\nSimilarly, at a given resolution, computing the tangent distance with 111 tangent \n\nvectors is (m + 1)2 times as expensive as computing the Euclidean distance (m = \u00b0 \n\ntangent vectors). \n\nThis observations provided us wit.h a hierarchy of about a dozen different distances \nranging in computation time from 4 multiply/adds (Euclidean distance on a 2 x 2 \naveraged image) to 20,000 multiply /adds (tangent distance, 7 tangent vectors, 16 x \n16 pixel images). The resulting filtering algorithm is very straightforward and is \nexemplified in Figure 1. \n\nThe general idea is to store the database of prototypes several times at different \nresolutions and with different tangent. vectors. Each of these resolutions and groups \nof tangent vectors defines a distance di . These distances are ordered in increasing \n\n\f170 \n\nSimard \n\nUnknown Pattern \n\n( \n\nProto types Euc. Dist \n\nEuc. Dist \n\n~ \n\n10~OOO \n\n2x2 \n\nCost: 4 \n\n~ \n\n{soc \n\n~ \n\nConfidence \n\n4x4 ~ \n\nsoo \n\nCost: 16 \n\n~ \n\nConfidence \n\nTang.Dist Category \n14 vectors \n\nt----t~ \n\n16x16 \n\nConfidence \n\nFigure 1: Pattern recognition using a hierarchy of distance. The filter proceed \nfrom left (starting with the whole database) to right (where only a few prototypes \nremain). At each stage distances between prototypes and the unknown pattern are \ncomputed, sorted and the best candidate prototypes are selected for the next stage. \nAs the complexity of the distance increases, the number of prototypes decreases, \nmaking computation feasible. At each stage a classification is attempted and a \nconfidence score is computed. If the confidence score is high enough, the remaining \nstages are skipped . \n\naccuracy and complexity. The first distance dl is computed on all (1\\0) prototypes \nof the database. The closest J\\ 1 pat.terns are then selected and identified to the \nnext stage. This process is repeated for each of the distances; i.e. at each stage i, \nthe distance di is computed on each J\\i-l patterns selected by the previous stage. \nOf course, the idea is that as the complexity of the distance increases, the number \nof patterns on which this distance must be computed decreases. At the last stage, \nthe most complex and accurate distance is computed on all remaining patterns to \ndetermine the classificat.ion. \n\nThe only difficult part is to det.ermine the minimum I<i patterns selected at each \nstage for which the filtering does not decrease t.he overall performance. Note that \nif the last distance used is the most accurat.e distance, setting all J\\j to the number \nof patterns in the database will give optimal performance (at the most expensive \ncost). Increasing I<i always improves the performance in the sense that it allows to \nfind patterns that are closer for the next distance measure d j + 1 . The simplest way \nto determine I<i is by selecting a validation set and plotting t.he performance on this \nvalidation set as a function of !\\j. The opt.imal !\\\u00b7i is then determined graphically. \nAn automatic way of computing each 1\\; is currently being developed. \n\nThis method is very useful when the performance is not degraded by choosing small \nJ{j. In this case, the dist.ance evaluation is done using distance metrics which are \nrelatively inexpensive to compute. The computation cost becomes: \n\n\fEfficient Computation of Complex Distance Metrics Using Hierarchical Filtering \n\n171 \n\ncomputational cost ~ L number of \n\nprototypes X \nat stage i \n\ndistance \ncomplexity \nat stage i \n\n(2) \n\nCurves showing the performance as a function of the value of !{i will be shown in \nthe result section. \n\n3 PRUNING THE SEARCH USING CONFIDENCE \n\nSCORES \n\nIf a confidence score is computed at each stage of the distance evaluation, it is \npossible for certain patterns to avoid completely computing the most expensive \ndistances. In the extreme case, if the Euclidean distance between two patterns is 0, \nthere is really no need to compute the tangent distance. A simple (and crude) way \nto compute a confidence score at a given stage i, is to find the closest prototype \n(for distance di ) in each of the possible classes. The distance difference between the \nclosest class and the next closest class gives an approximation of a confidence of \nthis classification. A simple algorithm is then to compare at stage i the confidence \nscore Cip of the current unknown patt.ern p to a threshold ()j, and to stop the \nclassification process for this pattern as soon as Cip > ()j. The classification will \nthen be determined by the closest prototype at this stage. The computation time \nwill therefore be different depending on the pattern to be classified. Easy patterns \nwill be recognized very quickly while difficult. patterns will need to be compared to \nsome of the prototypes using the most complex distance . The total computation \ncost is therefore: \n\ncomputational cost ~ L number of \n\nprototypes X \nat stage i \n\ndistance \ncomplexity X \nat. stage i \n\nprobabili ty \nto reach \nstage i \n\n(3 ) \n\nNote that if all ()j are high, the performance is maximized but so is the cost . We \ntherefore wish to find the smallest value of Oi which does not degrade the perfor(cid:173)\nmance (increasing (Jj a.lways improves the performance). As in the previous section, \nthe simplest way to determine the optimal ()j is graphically with a validation set.. \nExample of curves representing the perfornlance as a function of ()j will be given in \nthe result section. \n\n4 CHOSING A GOOD HIERARCHY, OPTIMIZATION \n\n4.1 k-d tree \n\nSeveral hierarchies of distance are possible for optimizing the search process. An \nincremental nearest neighbor search algorithm based on k-d tree (Broder, 1990) \nwas implemented . The k-d tree structure was interesting because it can potentially \nbe used with tangent distance. Indeed, since the separating hyperplanes have n-1 \ndimension, they can be made parallel to many tangent vectors at the same time. \nAs much as 36 images of 256 pixels ,,,ith each 7 t.angent. vectors can be separat.ed \ninto two group of 18 images by Olle hyperplane which is parallel to all tangent \n\n\f172 \n\nSimard \n\nvectors. The searching algorithm is taking some knowledge of the transformation \ninvariance into account when it computes on which side of each hyperplane the \nunknown pattern is. Of course, when a leaf is reached, the full tangent. distance \nmust be computed. \n\nThe problem with the k-d tree algorithm however is that in high dimensional space, \nthe distance from a point to a hyperplane is almost always smallel' than the distance \nbetween any pair of points. As a result, the unknown pattern must be compared to \nmany prototypes to have a reasonable accuracy. The speed up factor was compa(cid:173)\nrable to our multiresolution approach in the case of Euclidean distance (about 10), \nbut we have not been able to obtain both good performance and high speedup with \nthe k-d tree algorithm applied to tangent distance. This algorithm was not used in \nour final experiments. \n\n4.2 Wavelets \n\nOne of the main advantages of the multiresolution approach is that it is easily \nimplemented with wavelet transforms (i\\'1allat, 1989), and that in the wavelet space, \nthe tangent distance is conserved (with orthonormal wavelet bases). Furthermore, \nthe multiresolution decomposition is completely orthogonal to the tangent distance \ndecomposition. In our experiment.s, the Haar transform was used. \n\n4.3 Hierarchy of tangent distance \n\nMany increasingly accurate approximations can be made for the tangent distance \nat a given resolution. For instance, the tangent distance can be computed by an \niterative process of alternative projections onto the tangent hyperplanes. A hierar(cid:173)\nchy of distances results, derived from the number of projections performed. This \nhierarchy is not very good because the initial projection is already fairly expensive. \nIt is more desirable to have a better efficiency in the first stages since only few \npatterns will be left for the latter stages. \n\nOur most successful hierarchy consisted in adding tangent vectors one by one, on \nboth sides. Even though this implies solving a new linear system at each stage, \nthe computational cost is mainly dominated by computing dot products between \ntangent vectors. These dot-products are then reused in the subsequent stages to \ncreate larger linear systems (invol ving more tangent vectors). This hierarchy has the \nadvantage that the first stage is only twice as expensive, yet much more accurate, \nthan the Euclidean distance . Each subsequent stage brings a lot of accuracy at a \nreasonable cost. (The cost inCl'eases quicker toward the lat.er stages since solving the \nlinear system grows with the cube of the number of tangent vector .) In addition, \nthe last stage is exactly the full tangent distance. As we will see in section 5 the \ncost in the final stages is negligible. \n\nObviously, the tangent vectors can be added in different order. \\Ve did not try to \nfind the optimal order. For character recognition application adding translations \nfirst, followed by hyperbolic deformations, the scalings, the thickness deformations \nand the rotations yielded good performance. \n\n\fEfficient Computation of Complex Distance Metrics Using Hierarchical Filtering \n\n173 \n\nz # of T.V. Reso # of prot.o (Ki) # of prod Probab # of mul/add \n40,000 \n0 \n1 \n56,000 \n2 \n32,000 \n14,000 \n3 \n4 \n40,000 \n5 \n32,000 \n11,000 \n6 \n7 \n4,000 \n8 \n3,000 \n1,000 \n9 \n10 \n1,000 \n\n9709 \n3500 \n500 \n125 \n50 \n45 \n25 \n15 \n10 \n5 \n5 \n\n1.00 \n1.00 \n1.00 \n0.90 \n0.60 \n0.40 \n0.20 \n0.10 \n0.10 \n0.05 \n0.0.5 \n\n0 \n0 \n0 \n1 \n2 \n4 \n6 \n8 \n10 \n12 \n14 \n\n4 \n16 \n64 \n64 \n256 \n256 \n256 \n256 \n256 \n256 \n256 \n\n1 \n1 \n1 \n2 \n5 \n7 \n9 \n11 \n13 \n15 \n17 \n\nTable 1: Summary computation for the classification of 1 pattern: The first column \nis the distance index, the second column indicates the number of tangent vector \n(0 for the Euclidean distance), and the third column indicates the resolution in \npixels, the fourth is J{j or the number of prototypes on which the distance di must \nbe computed, the fifth column indicat.es the number of additional dot products \nwhich must be computed to evaluate distance di, the sixth column indicates the \nprobability to not skip that stage after the confidence score has been used, and \nthe last column indicates the total average number of multiply-adds which must be \nperformed (product of column 3 to 6) at each stage. \n\n4.4 Selecting the k closests out of N prototypes in O(N) \n\nIn the multiresolution filter, at the early stages we must select the k closest proto(cid:173)\ntypes from a large number of protot.ypes. This is problematic because the prototypes \ncannot be sorted since O( N ZagN) is expensive compared to computing N distances \nat very low resolution (like 4 pixels). A simple solution consist.s in using a variation \nof \"quicksort\" or \"finding the k-t.h element\" (Aho, Hopcroft and Ullman, 1983), \nwhich can select the k closests out of N prototypes in O(N). The generic idea is \nto compute the mean of the distances (an approximation is actually sufficient) and \nthen to split the distances \n\ninto two halves (of different sizes) according to whether they are smaller or larger \nthan the mean distance. If they are more dist.ances smaller than the mean than k, \nthe process is reiterat.ed on the upper half, ot.herwise it is reiterated on the lowel' \nhalf. The process is recursively executed until there is only one distance in each \nhalf. (k is then reached and all the k prototypes in the lower halves are closer to \nthe unknown pattei'll than all the N - ~~ prototypes in the upper halves.) Note that. \nthe elements are not sorted and t.hat only t.he expected t.ime is O(N), but this is \nsufficient for our problem. \n\n5 RESULTS \n\nA simple task of pattern classification was used to test the filtering. The prototype \nset and the test set consisted l'especti vely of 9709 and 2007 labeled images (16 \nby 16 pixels) of handwritten digit.s. The prot.otypes were also averaged t.o lower \n\n\f174 \n\nSimard \n\n5 \n\nError in % \n\n4 \n\n3 \n\nError in % \n\n871 \nt \n\n6 \n\n5 \n\n4 \n\n3 \n\n/ \n\nResolution 16 pIXels \n\nResolutIOn 64 pixels \n\nReso lutlO n 64 pixels \n1 tangent veC10r \n\nResol ution 16 pixels \n\nK (in 1000) \n\n% of pat. kept. \n\n2~~~~~~ __ ~ __ ~~~~ \no 10 20 30 40 50 60 70 80 90 100 \n\nFigure 2: Left: Raw error performance as a function of Kl and 1\\2. The final \nchosen values were J{ 1 = 3500 and [\\'2 = 500. Right: Raw error as a function of \nthe percentage of pattern which have not exceeded the confidence threshold Oi. A \n100% means all the pattern were passed to the next stage. \n\nresolutions (2 by 2, 4 by 4 and 8 by 8) and copied to separate databases . The 1 \nby 1 resolution was not useful for anything. Therefore the fastest distance was the \nEuclidean distance on 2 by 2 images, while the slowest distance was the full tangent \ndistance with 7 tangent vectors for both the prototype and the unknown pattern \n(Simard, LeCun and Denker, 1993). Table 1 summarizes the results. \n\nSeveral observations can be made. First, simple distance metrics are very useful to \neliminate large proportions of Pl\"Ototypes at no cost in performances. Indeed the \nEuclidean distance computed on 2 by 2 images can remove 2 third of the prototypes. \nFigure 2, left, shows the performance as a function of J{l and 1\\2 (2 .5 % raw error \nwas considered optimal performance). It can be noticed that for J{j above a certain \nvalue, the performance is optimal alld c.onstant. The most complex distances (6 and \n7 tangent vectors on each side) need only be computed for 5% of the prototypes. \n\nThe second observation is that the use of a confidence score can greatly reduce the \nnumber of distance evaluations in later stages. For instance the dominant phases of \nthe computation would be with 2, 4 and 6 tangent vectors at resolution 256 if there \nwere not reduced to 60%, 40% and 20% respectively using the confidence sc.ores. \nFigure 2, right, shows the raw error performance as a function of the percentage \nof rejection (confidence lower than OJ) at stage i. It can be noticed that above a \ncertain threshold, the performance are optimal and constant. Less than 10% of the \nunknown patterns need the most. complex distances (5, 6 and 7 tangent vectors on \neach side), to be comput.ed. \n\n\fEfficient Computation of Complex Distance Metrics Using Hierarchical Filtering \n\n175 \n\n6 DISCUSSION \n\nEven though our method is by no way optimal (the order of the tangent vector \ncan be changed, intermediate resolution can be used, etc ... ), the overall speed up \nwe achieved was about 3 orders of magnitude (compared with computing the full \ntangent distance on all the patterns). There was no significant decrease in perfor(cid:173)\nmances. This classification speed is comparable with neural network method, but \nthe performance are better with tangent distance (2.5% versus 3%). Furthermore \nthe above methods require no learning period which makes them very attractive for \napplication were the distribution of the patterns to be classified is changing rapidly. \n\nThe hierarchical filtering can also be combined with learning the prototypes using \nalgorithms such as learning vector quantization (LVQ). \n\nReferences \n\nAho, A. V., Hopcroft, J. E., and Ullman, J. D. (1983). Data Structure and Algo(cid:173)\n\nrithms. Addison- \\V'esley. \n\nBroder, A. J. (1990). Strategies for Efficient Incremental Nearest Neighbor Search. \n\nPattern Recognition, 23: 171-178. \n\nDasarathy, B. V. (1991). Nearest Neighbor (NN) Norms: NN Pattern classification \n\nTechniques. IEEE Computer Society Press, Los Alamitos, California. \n\nMallat, S. G. (1989). A Theory for I\\,Iultiresolution Signal Decomposition: The \nWavelet Representation. IEEE Transactions 011 Pattern Analysis and Machine \nIntelligence, 11, No. 7:674-693. \n\nSimard, P. Y., LeCun, Y., and Denker, J. (1993). Efficient Pattern Recognition \n\nUsing a New Transformation Distance. In Neural Information Processing Sys(cid:173)\ntems, volume 4, pages 50-58, Sa.n Mateo, CA. \n\n\f", "award": [], "sourceid": 875, "authors": [{"given_name": "Patrice", "family_name": "Simard", "institution": null}]}