{"title": "Learning Prototype Models for Tangent Distance", "book": "Advances in Neural Information Processing Systems", "page_first": 999, "page_last": 1006, "abstract": null, "full_text": "Learning Prototype Models for Tangent \n\nDistance \n\nTrevor Hastie\u00b7 \n\nStatistics Department \n\nSequoia Hall \n\nStanford University \nStanford, CA 94305 \n\nemail: trevor@playfair .stanford .edu \n\nPatrice Simard \n\nAT&T Bell Laboratories \nCrawfords Corner Road \n\nHolmdel, NJ 07733 \n\nemail: patrice@neural.att.com \n\nEduard Siickinger \n\nAT &T Bell Laboratories \nCrawfords Corner Road \n\nHolmdel, NJ 07733 \n\nemail: edi@neural.att.com \n\nAbstract \n\nSimard, LeCun & Denker (1993) showed that the performance of \nnearest-neighbor classification schemes for handwritten character \nrecognition can be improved by incorporating invariance to spe(cid:173)\nthe so \ncific transformations in the underlying distance metric -\ncalled tangent distance. The resulting classifier, however, can be \nprohibitively slow and memory intensive due to the large amount of \nprototypes that need to be stored and used in the distance compar(cid:173)\nisons. In this paper we develop rich models for representing large \nsubsets of the prototypes. These models are either used singly per \nclass, or as basic building blocks in conjunction with the K-means \nclustering algorithm. \n\n*This work was performed while Trevor Hastie was a member of the Statistics and Data \n\nAnalysis Research Group, AT&T Bell Laboratories, Murray Hill, NJ 07974. \n\n\fJ 000 \n\nTrevor Hastie, Patrice Simard, Eduard Siickinger \n\n1 \n\nINTRODUCTION \n\nLocal algorithms such as K-nearest neighbor (NN) perform well in pattern recogni(cid:173)\ntion, even though they often assume the simplest distance on the pattern space. It \nhas recently been shown (Simard et al. 1993) that the performance can be further \nimproved by incorporating invariance to specific transformations in the underlying \ndistance metric -\never, can be prohibitively slow and memory intensive due to the large amount of \nprototypes that need to be stored and used in the distance comparisons. \n\nthe so called tangent distance. The resulting classifier, how(cid:173)\n\nas such they are similar in flavor to the Singular Value De(cid:173)\n\nIn this paper we address this problem for the tangent distance algorithm, by de(cid:173)\nveloping rich models for representing large subsets of the prototypes. Our leading \nexample of prototype model is a low-dimensional (12) hyperplane defined by a point \nand a set of basis or tangent vectors. The components of these models are learned \nfrom the training set, chosen to minimize the average tangent distance from a subset \nof the training images -\ncomposition (SVD), which finds closest hyperplanes in Euclidean distance. These \nmodels are either used singly per class, or as basic building blocks in conjunction \nwith K-means and LVQ. Our results show that not only are the models effective, \nbut they also have meaningful interpretations. In handwritten character recogni(cid:173)\ntion, for instance, the main tangent vector learned for the the digit \"2\" corresponds \nto addition/removal of the loop at the bottom left corner of the digit; for the 9 the \nfatness of the circle. We can therefore think of some of these learned tangent vectors \nas representing additional invariances derived from the training digits themselves. \nEach learned prototype model therefore represents very compactly a large number \nof prototypes of the training set. \n\n2 OVERVIEW OF TANGENT DISTANCE \n\nWhen we look at handwritten characters, we are easily able to allow for simple trans(cid:173)\nformations such as rotations, small scalings, location shifts, and character thickness \nw hen identifying the character. Any reasonable automatic scheme should similarly \nbe insensitive to such changes. \nSimard et al. (1993) finessed this problem by generating a parametrized 7-\ndimensional manifold for each image, where each parameter accounts for one \nsuch invariance. Consider a single invariance dimension: rotation. If we were \nto rotate the image by an angle B prior to digitization, we would see roughly \nthe same picture, just slightly rotated. Our images are 16 x 16 grey-scale pix(cid:173)\nelmaps, which can be thought of as points in a 256-dimensional Euclidean space. \nThe rotation operation traces out a smooth one-dimensional curve Xi(B) with \nXi(O) = Xi, the image itself. Instead of measuring the distance between two im(cid:173)\nages as D(Xi,Xj) = IIXi - Xjll (for any norm 11\u00b711), the idea is to use instead the \nrotation-invariant DI (Xi, Xj) = minoi,oj IIX i(B;) - Xj(Bj )11. Simard et al. (1993) \nused 7 dimensions of invariance, accounting fo:: horizontal and vertical location and \nscale, rotation and shear and character thickness. \nComputing the manifold exactly is impossible, given a digitized image, and would \nbe impractical anyway. They approximated the manifold instead by its tangent \n\n\fLearning Prototype Models for Tangent Distance \n\n1001 \n\nplane at the image itself, leading to the tangent model Xi(B) = Xi + TiB, and the \ntangent distance DT(Xi,Xj) = minoi,oj IIXi(Bd -Xj(Bj)ll. Here we use B for the \n7-dimensional parameter, and for convenience drop the tilde. The approximation \nis valid locally, and thus permits local transformations. Non-local transformations \nare not interesting anyway (we don't want to flip 6s into 9s; shrink all digits down \nto nothing.) See Sackinger (1992) for further details. If 11\u00b711 is the Euclidean norm, \ncomputing the tangent distance is a simple least-squares problem, with solution the \nsquare-root of the residual sum-of-squares of the residuals in the regression with \nresponse Xi - Xj and predictors (-Ti : Tj ). \n\nSimard et al. (1993) used DT to drive a 1-NN classification rule, and achieved \nthe best rates so far-2.6%-on the official test set (2007 examples) of the USPS \ndata base. Unfortunately, 1-NN is expensive, especially when the distance function \nis non-trivial to compute; for each new image classified, one has to compute the \ntangent distance to each of the training images, and then classify as the class of \nthe closest. Our goal in this paper is to reduce the training set dramatically to a \nsmall set of prototype models; classification is then performed by finding the closest \nprototype. \n\n3 PROTOTYPE MODELS \n\nIn this section we explore some ideas for generalizing the concept of a mean or \ncentroid for a set of images, taking into account the tangent families. Such a \ncentroid model can be used on its own, or else as a building block in a K-means \nor LVQ algorithm at a higher level. We will interchangeably refer to the images as \npoints (in 256 space). \nThe centroid of a set of N points in d dimensions minimizes the average squared \nnorm from the points: \n\n(1) \n\n3.1 TANGENT CENTROID \n\nOne could generalize this definition and ask for the point M that minimizes the \naverage squared tangent distance: \n\nN \n\nMT = argm,Jn LDT(Xi,M)2 \n\ni=l \n\n(2) \n\nThis appears to be a difficult optimization problem, since computation of tangent \ndistance requires not only the image M but also its tangent basis TM. Thus the \ncriterion to be minimized is \n\n\f1002 \n\nTrevor Hastie, Patrice Simard, Eduard Sackinger \n\nwhere T(M) produces the tangent basis from M. All but the location tangent \nvectors are nonlinear functionals of M, and even without this nonlinearity, the \nproblem to be solved is a difficult inverse functional. Fortunately a simple iterative \nprocedure is available where we iteratively average the closest points (in tangent \ndistance) to the current guess. \n\nTangent Centroid Algorithm \n\nInitialize: Set M = ~ 2:~1 Xi, let TM = T(M) be the derived \nset of tangent vectors, and D = 2:i DT(Xi' M). Denote \nthe current tangent centroid (tangent family) by M(-y) = \nM +TM\"I. \n\nIterate: 1. For each \n\ni \n\nfind a 1'. and 8i \n\nthat solves \n\n11M + TM\"I - Xi(8)11 = min'Y.9 \n\n2. Set M +- N 2:'=1 (Xi(8i ) - TMi'i) and compute the \n\nN \n\n1 \n\nA \n\nnew tangent subspace TM = T(M). \n\n3. Compute D = 2:iDT(Xi,M) \n\nUntil: D converges. \n\nNote that the first step in Iterate is available from the computations in the third \nstep. The algorithm divides the parameters into two sets: M in the one, and then \nTM, \"Ii and 8, for each i in the other. It alternates between the two sets, although \nthe computation of TM given M is not the solution of an optimization problem. \nIt seems very hard to say anything precise about the convergence or behavior of \nthis algorithm, since the tangent vectors depend on each iterate in a nonlinear way. \nOur experience has always been that it converges fairly rapidly \u00ab 6 iterations). A \npotential drawback of this algorithm is that the TM are not learned, but are implicit \nin M. \n\n3.2 TANGENT SUBSPACE \n\nRather than define the model as a point and have it generate its own tangent \nsubspace, we can include the subspace as part of the parametrization: M(-y) = \nM + V\"I. Then we define this tangent subspace model as the minimizer of \n\nMS(M, V) = L min 11M + V\"Ii - Xi(8d1l 2 \n\nN \n\n. 1 'Yi. 9i \nt= \n\n(3) \n\nover M and V. Note that V can have an arbitrary number 0 ::; r ::; 256 of columns, \nalthough it does not make sense for r to be too large. An iterative algorithm \nsimilar to the tangent centroid algorithm is available, which hinges on the SVD \ndecomposition for fitting affine subspaces to a set of points. We briefly review the \nSVD in this context. \nLet X be the N x 256 matrix with rows the vectors Xi - X where X = ~ 2:~1 Xi. \nThen SVD(X) = UDVT is a unique decomposition with UNxR and V256xR the \n\n\fLearning Prototype Models for Tangent Distance \n\n1003 \n\northonormal left and right matrices of singular vectors, and R = rank( X). D Rx R \nis a diagonal matrix of decreasing positive singular values. A pertinent property of \nthe SVD is: \n\nConsider finding the closest affine, rank-r subspace to a set of \npoints, or \n\nN \n\nmin 2: IIXi - M - v(r)'hll \nM,v(r),{9i} i=1 \n\n2 \n\nwhere v(r) is 256 x r orthonormal. The solution is given by the \nSVD above, with M = X and v(r) the first r columns of V, and \nthe total squared distance E;=1 D;j. \n\nThe V( r) are also the largest r principal components or eigenvectors of the covariance \nmatrix of the Xi. They give in sequence directions of maximum spread, and for a \ngiven digit class can be thought of as class specific invariances. \nWe now present our Tangent subspace algorithm for solving (3); for convenience we \nassume V is rank r for some chosen r, and drop the superscript. \n\nTangent subspace algorithm \n\nInitialize: Set M = ~ Ef:l Xi and let V correspond to the first \nr right singular vectors of X. Set D = E;=1 D;j, and let \nthe current tangent subspace model be M(-y) = M + V-y. \nsolves \n\n(ji which \n\nIterate: \n\nthat \n\n1. For \n\neach \n\nfind \nIIM(-y) - Xi (8)11 = min \n\ni \n\nN \n\n2. Set M +- ~ Ei=1 (Xi (8i )) and replace the rows of X \nby Xi({jd - M. Compute the SVD of X, and replace \nV by the first r right singular vectors. \n\nA \n\n3. Compute D = E;=l D;j \n\nUntil: D converges. \n\nThe algorithm alternates between i) finding the closest point in the tangent subspace \nfor each image to the current tangent subspace model, and ii) computing the SVD \nfor these closest points. Each step of the alternation decreases the criterion, which \nis positive and hence converges to a stationary point of the criterion. In all our \nexamples we found that 12 complete iterations were sufficient to achieve a relative \nconvergence ratio of 0.001. \nOne advantage of this approach is that we need not restrict ourselves to a seven(cid:173)\ndimensional V -\nindeed, we have found 12 dimensions has produced the best \nresults. The basis vectors found for each class are interesting to view as images. \nFigure 1 shows some examples of the basis vectors found, and what kinds of invari(cid:173)\nances in the images they account for. These are digit specific features; for example, \na prominent basis vector for the family of 2s accounts for big versus small loops. \n\n\f1004 \n\nTrevor Hastie, Patrice Simard, Eduard Siickinger \n\nEach of the examples shown accounts for a similar digit specific invariance. None of \nthese changes are accounted for by the 7-dimensional tangent models, which were \nchosen to be digit nonspecific. \n\nFigure 1: Each column corresponds to a particular tangent subspace basis vector for the \ngiven digit. The top image is the basis vector itself, and the remaining 3 images correspond \nto the 0.1 , 0.5 and 0.9 quantiles for the projection indices for the training data for that \nbasis vector, showing a range of image models for that basis, keeping all the others at o. \n\n4 SUBSPACE MODELS AND K-MEANS CLUSTERING \n\nA natural and obvious extension of these single prototype-per-class models, is to \nuse them as centroid modules in a K-means algorithm. The extension is obvious, \nand space permits only a rough description. Given an initial partition of the images \nin a class into K sets: \n\n1. Fit a separate prototype model to each of the subsets; \n2. Redefine the partition based on closest tangent distance to the prototypes \n\nfound in step 1. \n\nIn a similar way the tangent centroid or subspace models can be used to seed LVQ \nalgorithms (Kohonen 1989), but so far we have not much experience with them. \n\n5 RESULTS \n\nTable 1 summarizes the results for some of these models. The first two lines corre(cid:173)\nspond to a SVD model for the images fit by ordinary least squares rather than least \ntangent squares. The first line classifies using Euclidean distance to this model, the \nsecond using tangent distance. Line 3 fits a single 12-dimensional tangent subspace \nmodel per class, while lines 4 and 5 use 12-dimensional tangent subspaces as cluster \n\n\fLearning Prototype Models for Tangent Distance \n\n1005 \n\nTable 1: Test errors for a variety of situations. In all cases the training data were 7291 \nUSPS handwritten digits, and the test data the \"official\" 2007 USPS test digits. Each \nentry describes the model used in each class, so for example in row 5 there are 5 models \nper class, hence 50 in all. \n\nPrototype \n\nMetric \n0 1-NN \nEuclidean \n1 12 dim SVD subspace \nEuclidean \n2 12 dim SVD subspace \nTangent \n3 12 dim Tangent subspace Tangent \n4 12 dim Tangent subspace Tangent \n5 12 dim Tangent subspace Tangent \n6 Tangent centroid \nTangent \nTangent \n7 \n8 1-NN \nTangent \n\n(4) U (6) \n\n# Prototypes7Class Error Rate \n\n~ 700 \n1 \n1 \n1 \n3 \n5 \n20 \n23 \n~ 700 \n\n0.053 \n0.055 \n0.045 \n0.041 \n0.038 \n0.038 \n0.038 \n0.034 \n0.026 \n\ncenters within each class. We tried other dimensions in a variety of settings, but \n12 seemed to be generally the best. Line 6 corresponds to the tangent centroid \nmodel used as the centroid in a 20-means cluster model per class; the performance \ncompares with with K=3 for the subspace model. Line 7 combines 4 and 6, and \nreduces the error even further. These limited experiments suggest that the tangent \nsubspace model is preferable, since it is more compact and the algorithm for fitting \nit is on firmer theoretical grounds. \nFigure 4 shows some of the misclassified examples in the test set. Despite all the \nmatching, it seems that Euclidean distance still fails us in the end in some of these \ncases. \n\n6 DISCUSSION \n\nGold, Mjolsness & Rangarajan (1994) independently had the idea of using \"domain \nspecific\" distance measures to seed K-means clustering algorithms. Their setting \nwas slightly different from ours, and they did not use subspace models. The idea of \nclassifying points to the closest subspace is found in the work of Oja (1989), but of \ncourse not in the context of tangent distance. \nWe are using Euclidean distance in conjunction with tangent distance. Since neigh(cid:173)\nboring pixels are correlated, one might expect that a metric that accounted for the \ncorrelation might do better. We tried several variants using Mahalanobis metrics \nin different ways, but with no success. We also tried to incorporate information \nabout where the images project in the tangent subspace models into the classifica(cid:173)\ntion rule. We thus computed two distances: 1) tangent distance to the subspace, \nand 2) Mahalanobis distance within the subspace to the centroid for the subspace. \nAgain the best performance was attained by ignoring the latter distance. \nIn conclusion, learning tangent centroid and subspace models is an effective way \n\n\f1006 \n\nTrevor Hastie, Patrice Simard, Eduard Siickinger \n\ntrue: 6 \n\ntrue: 2 \n\ntrue: 5 \n\ntrue: 2 \n\ntrue: 9 \n\ntrue: 4 \n\npred. pro). ( 0 ) \n\nprado proj. ( 0 ) \n\npred. pro). ( 8 ) \n\npred. proj. ( 0 ) \n\nprado proj. ( 4 ) \n\nprado pro). ( 7 ) \n\nFigure 2: Some of the errorS for the test set corresponding to line (3) of table 4. Each \ncase is displayed as a column of three images. The top is the true image, the middle the \ntangent projection of the true image onto the subspace model of its class, the bottom image \nthe tangent projection of the image onto the winning class. The models are sufficiently \nrich to allow distortions that can fool Euclidean distance. \n\nto reduce the number of prototypes (and thus the cost in speed and memory) at a \nslight expense in the performance. In the extreme case, as little as one 12 dimen(cid:173)\nsional tangent subspace per class and the tangent distance is enough to outperform \nclassification using ~ 700 prototypes per class and the Euclidean distance (4.1 % \nversus 5.3% on the test data). \n\nReferences \n\nGold, S., Mjolsness, E. & Rangarajan, A. (1994), Clustering with a domain specific \ndistance measure, in 'Advances in Neural Information Processing Systems', \nMorgan Kaufman, San Mateo, CA. \n\nKohonen, T. (1989), Self-Organization and Associative Memory (3rd edition), \n\nSpringer-Verlag, Berlin. \n\nOja, E. (1989), 'Neural networks, principal components, and subspaces', Interna(cid:173)\n\ntional Journal Of Neural Systems 1(1), 61-68. \n\nSackinger, E. (1992), Recurrent networks for elastic matching in pattern recognition, \n\nTechnical report, AT&T Bell Laboratories. \n\nSimard, P. Y, LeCun, Y. & Denker, J. (1993), Efficient pattern recognition using \na new transformation distance, in 'Advances in Neural Information Processing \nSystems', Morgan Kaufman, San Mateo, CA, pp. 50-58. \n\n\f", "award": [], "sourceid": 939, "authors": [{"given_name": "Trevor", "family_name": "Hastie", "institution": null}, {"given_name": "Patrice", "family_name": "Simard", "institution": null}]}