{"title": "Transformation Invariant Autoassociation with Application to Handwritten Character Recognition", "book": "Advances in Neural Information Processing Systems", "page_first": 992, "page_last": 998, "abstract": null, "full_text": "Transformation Invariant Autoassociation \n\nwith Application to \n\nHandwritten Character Recognition \n\nHolger Schwenk \n\nMaurice Milgram \n\nPARC \n\nUniversite Pierre et Marie Curie \n\ntour 66-56, boite 164 \n\n4, place Jussieu, 75252 Paris cedex 05, France. \n\ne-mail: schwenk@robo.jussieu.fr \n\nAbstract \n\nWhen training neural networks by the classical backpropagation algo(cid:173)\nrithm the whole problem to learn must be expressed by a set of inputs and \ndesired outputs. However, we often have high-level knowledge about \nthe learning problem. In optical character recognition (OCR), for in(cid:173)\nstance, we know that the classification should be invariant under a set of \ntransformations like rotation or translation. We propose a new modular \nclassification system based on several autoassociative multilayer percep(cid:173)\ntrons which allows the efficient incorporation of such knowledge. Results \nare reported on the NIST database of upper case handwritten letters and \ncompared to other approaches to the invariance problem. \n\n1 INCORPORATION OF EXPLICIT KNOWLEDGE \n\nThe aim of supervised learning is to learn a mapping between the input and the output \nspace from a set of example pairs (input, desired output). The classical implementation \nin the domain of neural networks is the backpropagation algorithm. If this learning set is \nsufficiently representative of the underlying data distributions, one hopes that after learning, \nthe system is able to generalize correctly to other inputs of the same distribution. \n\n\f992 \n\nHolger Schwenk, Maurice Milgram \n\nIt would be better to have more powerful techniques to incorporate knowledge into the \nlearning process than the choice of a set of examples. The use of additional knowledge \nis often limited to the feature extraction module. Besides simple operations like (size) \nnormalization, we can find more sophisticated approaches like zernike moments in the \ndomain of optical character recognition (OCR). In this paper we will not investigate this \npossibility, all discussed classifiers work directly on almost non preprocessed data (pixels). \n\nIn the context of OCR interest focuses on invariance of the classifier under a number of \ngiven transformations (translation, rotation, ... ) of the data to classify. In general a neural \nnetwork could extract those properties of a large enough learning set, but it is very hard to \nlearn and will probably take a lot of time. In the last years two main approaches for this \ninvariance problem have been proposed: tangent-prop and tangent-distance. An indirect \nincorporation can be achieved by boosting (Drucker, Schapire and Simard, 1993). \n\nIn this paper we briefly discuss these approaches and will present a new classification system \nwhich allows the efficient incorporation of transformation invariances. \n\n1.1 TANGENT PROPAGATION \n\nThe principle of tangent-prop is to specify besides desired outputs also desired changes jJJ. \nof the output vector when transforming the net input x by the transformations tJJ. (Simard, \nVictorri, LeCun and Denker, 1992). \nFor this, let us define a transformation of pattern p as t(p, a) : P --t P where P is the space \nof all patterns and a a parameter. Such transformations are in general highly nonlinear \noperations in the pixel space P and their analytical expressions are seldom known. It is \ntherefore favorable to use a first order approximation: \n\n(1) \n\n(2) \n\nt(p, a) :::::: p + atp with \n\ntp _- at(p, a) I \n\naa \n\na=O \n\ntp is called the tangent vector. This definition can be generalized to c transformations: \n\nt(p, Q) :::::: p + a] tp] + ... + a c tpc = P + TpQ \n\nwhere Tp is a n x c matrix, each column corresponding to a tangent vector. \nLet us define R (x) the function calculated by the network. The desired behavior of the net \noutputs can be obtained by adding a regularization term Er to the objective function: \n\nE, = ~ k -aR(t;~.a)tJ '\" ~ k -a~~x) t~ II' \n\n(3) \n\nti is the tangent vector for transformation tJJ. of the input vector x and a R (x) / ax is the \ngradient of the network with respect to the inputs. Transformation invariance of the outputs \nis obtained by setting jJJ. = 0, so we want that aR (x) / ax is orthogonal to ti. \nTangent-prop improved the learning time and the generalization on small databases, but its \napplicability to highly constraint networks (many shared weights) trained on large databases \nremains unknown. \n\n1.2 TANGENT DISTANCE \n\nAnother class of classifiers are memory based learning methods which rely on distance \nmetrics. The incorporation of knowledge in such classifiers can be done by a distance \n\n\fTransformation Invariant Autoassociation \n\n993 \n\nmeasure which is (locally) invariant under a set of specified transformations. \n\n(Simard, LeCun and Denker, 1993) define tangent distance as the minimal distance between \nthe two hyperplanes spanned up by the tangent vectors Tp in point P and Tq in point q: \n\nDpq(p, q) = ~i!J (p + Tpii - q - TqiJ) 2 = (p + Tpii* _ q _ TqiJ*) 2 \n\n01,/3 \n\n(4) \n\nThe optimality condition is that the partial derivatives 8Dpq /8ii* and 8Dpq /8iJ* should \nbe zero. The values ii* and iJ* minimizing (4) can be computed by solving these two linear \nsystems numerically. \n\n(Simard, LeCun and Denker, 1993) obtained very good results on handwritten digits and \nletters using tangent distance with a I-nearest-neighborclassifier (I-nn) . A big problem of \nevery nn-classifier, however, is that it uses no compilation of the data and it needs therefore \nnumerous reference vectors resulting in long classification time and high memory usage. \n\nLike reported in (Simard, 1994) and (Sperdutti and Stork, 1995) important improvements \nare possible, but often a trade-off between speed and memory usage must be made. \n\n2 ARCHITECTURE OF THE CLASSIFIER \n\nThe main idea of our approach is to use an autoassociative multilayer perceptron with a \nlow dimensional hidden layer for each class to recognize. These networks, called diabolo \nnetwork in the following, are trained only with examples of the corresponding class. This \ncan be seen as supervised learning without counter-examples. Each network learns a hidden \nlayer representation which preserves optimally the information of the examples of one class. \nThese learned networks can now be used like discriminant functions: the reconstruction \nerror is in general much lower for examples of the learned class than for the other ones. \n\nIn order to build a classifier we use a decision module which interprets the distances between \nthe reconstructed output vectors and the presented example. In our studies we have used \nuntil now a simple minimum operator which associates the class of the net with the smallest \ndistance (Fig. 1). \n\nThe figure illustrates also typical classification behavior, here when presenting a \"D\" out of \nthe test set. One can see clearly that the distance of the network \"D\" is much lower than for \nthe two other ones. The character is therefore correctly classified. It is also interesting to \nanalyze the outputs of the two networks with the next nearest distances: the network \"0\" \ntries to output a more round character and the network \"B\" wants to add a horizontal bar in \nthe middle. \n\nThe basic classification architecture can be adapted in two ways to the problem to be solved. \nOne on hand we can imagine different architectures for each diabolo network, e.g. several \nencoding/decoding layers which allow nonlinear dimension reduction. It is even possible \nto use shared weights realizing local feature detectors (see (Schwenk and Milgram, 1994) \nfor more details). \n\nOne the other hand we can change the underlying distance measure, as long as the derivatives \nwith respect to the weights can be calculated. This offers a powerful and efficient mechanism \nto introduce explicit knowledge into the learning algorithm of a neural network. In the \ndiscussed case, the recognition of characters represented as pixel images, we can use a \n\n\f994 \n\nHolger Schwenk, Maurice Milgram \n\nscore B \n\n8.07 \n\nA \n\nscoreD \n\nI--t-~ U. --~ \n\n4.49 \n\nscore 0 \n\n8.54 \n\ncharacter \nto classify \n\ninput \nvector \n\ndiabolo \nnetworks \n\noutput \nvectors \n\ndistance \nmeasures \n\ndecision \nmodule \n\nFigure 1: Basic Architecture of a Diabolo Classifier \n\ntransformation invariant distance measure between the net output 0 and the desired output \nd (that is of course identical with the net input). The networks do not need to learn each \nexample separately any more, but they can use the set of specified transformations in order \nto find a common non linear model of each class. \n\nThe advantage of this approach, besides a better expected generalization behavior of course, \nis a very low additional complexity. In comparison to the original k-nn approach, and \nsupposedly any possible optimization, we need to calculate only one distance measure for \neach class to recognize, regardless of the number of learning data. \n\nWe used two different versions of tangent distance with increasing complexity: \n\n1. single sided tangent distance: \n\nDd(d,o) =mjn~ (d+Tda - o f = ~ ( d+Td a * -of \n\n(5) \n\nThis is the minimal distance between the hyperplane spanned up by the tangent \nvectors Td in input vector d and the untransformed output vector o. \n\n2. double sided tangent distance: \n\nDdo(d,o) = mi!?! (d + Tda - 0 * g - ToiJ) 2 \n\nCi,/3 2 \n\n(6) \n\nThe convolution of the net output with a Gaussian 9 is necessary for the computa(cid:173)\ntion of the tangent vectors To (the net input d is convolved during preprocessing). \n\nFigure 2 shows a graphical comparison of Euclidean distance with the two tangent distances. \n\n\fTransformation Invariant Autoassociation \n\n995 \n\nd : desired output \ntd : tangent vector in d \no : net output \nto : tangent vector in 0 \nD : Euclidean distance \nD d : single sided tangent distance \n\n(only d is transformed) \n\nDdo : double sided tangent distance \n(both points are transformed) \n\nV' D d : gradient of D d \n\nFigure 2: Comparison of Euclidean Distance with the Different Tangent Distances \n\nThe major advantage of the single sided version is that we can now calculate easily the \noptimal multipliers &* and therefore the whole distance (the double sided version demands \nexpensive matrix multiplications and the numerical solution of a system of linear equations). \nThe optimality condition 8Dd(d, 0)/8&* ~ OT gives: \n\n(7) \n\nThe tangent vectors Td and the matrix Ti} = (TITd)-1 can be precomputed and stored in \nmemory. Note that it is the same for all diabolo networks, regardless of their class. \n\n2.1 LEARNING ALGORITHM \n\nWhen using a tangent distance with an autoencoder we must calculate its derivatives with \nrespect to the weights, i.e. after application of the chain rule with respect to the output \nvector o. In the case of the single sided tangent distance we get: \n\n(8) \n\nThe resulting learning algorithm is therefore barely more complicated than with standard \nEuclidean error. Furthermore it has a pleasant graphical interpretation: the net output \ndoesn't approach directly the desired output any more, but it takes the shortest way towards \nthe tangent hyperplane (see also fig. 2). \n\nThe derivation of the double sided tangent distance with respect to the net output is more \nIn particular we must derivate the convolution of the net output with a \ncomplicated. \nGaussian as well as the tangent vectors To. These equations will be published elsewhere. \n\nTraining of the whole system is stopped when the error on the cross validation set reaches a \nminimum. Using stochastic gradient descent convergence is typically achieved after some \nten iterations. \n\n\f996 \n\nHolger Schwenk, Maurice Milgram \n\n3 APPLICATION TO CHARACTER RECOGNITION \n\nIn 1992 the National Institute of Standards and Technology provided a Database of hand(cid:173)\nwritten digits and letters, known under the name NIST Special-Database 3. This database \ncontains about 45000 upper case segmented characters which we have divided into learning \nand cross-validation set (60%) and test set (40%) respectively. \n\nWe only applied a very simple preprocessing: the binary characters were centered and size(cid:173)\nnormalized (the aspect-ratio was kept). The net input is 16 x 16 pixels with real-values. \n\n3.1 EXPERIMENTAL RESULTS \n\nAll the following results were obtained by fully connected diabolo networks with one low \ndimensional hidden layer, and a set of eight transformations (x- and y-translation, rotation, \nscaling, axis-deformation, diagonal-deformation, x- and y-thickness). Figure 3 illustrates \nhow the networks use the transformations. \n\n, \n\ninput \n\nEuclidean distance: 11 .1 \n\n, \n\noutput \n\n, \n\ninput \n\nEuclidean distance: 20.0 \n\n, \n\noutput \n\noptimally ttansformed \ninput \noutput \n\ntangent distance: 0.61 \n\n'\" \n\noptimally ttansformed \ninput \noutput \n\n., \n\ntangent distance: 0.94 \n\nFigure 3: Reconstruction Examples (test set). The left side of each screen dump depicts \nthe input character and the right side the one reconstructed by the network. In the middle, \nfinally, one can see the optimally transformed patterns as calculated when evaluating the \ndouble sided tangent distance, i.e. transformed by a* and jJ* respectively. \n\nAlthough the \"I.:' in the first example has an unusual short horizontal line, the network \nreconstructs a normally sized character. It is clearly visible how the input transformation \nlengthens and the output transformation shortens this line in order to get a small tangent \ndistance. The right side shows a very difficult classification problem: a heavily deformed \n\"T\". Nevertheless we get a small tangent distance, so that the character is correctly classified. \nIn summary we note a big difference between the Euclidean and the tangent distances, this \nis a good indicator that the networks really use the transformations. \n\nThe performances on the whole test set of about 18 000 characters are summarized in \nfigure 4. For comparison we give also the results of a one nearest neighbor classifier on \nthe same test set. The incorporation of knowledge improved in both cases dramatically \nthe performances. The diabolo classifier, for instance, achieves an error rate of 4.7 % with \nsimple Euclidean distance which goes down to 3.7 % with the single sided and to only \n2.6 % with the double sided tangent distance. In order to get the same results with the \nI-nn approach, the whole set of 27000 reference vectors had to be used. It's worth to note \nthe results with less references: when using only 18 000 reference vectors the error rates \nincreased to 3.7% for the single sided and to 2.8% for the double sided version respectively. \n\n\fTransformation Invariant Autoassociation \n\n997 \n\n% \n\n5 -\n-\n\n5.3 \n\nF \n\n2.5 \n\n4.7 \nr;; \n11 3.7 \n\n4.0 \n\nc \n\n-\n-\n\nEuclidean \n\none sided \ntwo sided \n\nI-nn \n\nDiabolo \n\nLeNet \n\n(27 000 refs) \n\nFigure 4: Raw Error Rate with NIST Upper Case Letters (test set) \n\nIn practical applications we are not only interested in low error rates, but we need also low \ncomputationally costs. An important factor is the recognition speed. The overall processing \ntime of a diabolo classifier using the full tangent distance corresponds to the calculation of \nabout 7 000 Euclidean and to less than 50 tangent distances. This should be less than for \nany algorithm of the k-nn family. If we assume the precalculation of all the tangent vectors \nand other expensive matrix multiplications, we could evaluate about 80 tangent distances, \nbut the price would be exploding memory requirements. A diabolo classifier, on the other \nhand, needs only few memory: the storage of the weights corresponds to about 60 reference \nvectors per class. On a HP 715/50 workstation we obtained a recognition rate of 7.5 chis \nwith the single sided and of more than 2.5 chis with the double sided tangent distance. We \nhave also a method to combine both by rejection, resulting in up to 4 chis at the same low \nerror rates (corresponds to the calculation of 32 double sided tangent distances). \n\nThe table contains also the results of a large multilayer perceptron with extensive use of \nshared weights, known as LeNet. (Drucker, Schapire and Simard, 1993) give an error rate \nof 4.0% when used alone and of 2.4% for an ensemble of three such networks trained \nby boosting. The networks were trained on a basic set of 10 000 examples, the cross \nvalidation and test set consisted of 2 000 and 3 000 examples respectively (Drucker, \npersonal communication). Due to the different number of examples, the results are perhaps \nnot exactly comparable, but we can deduce nevertheless that the state of the art on this \ndatabase seems to be around 2.5 %. \n\n4 DISCUSSION \n\nWe have proposed a new classification architecture that allows the efficient incorporation of \nknowledge into the learning algorithm. The system is easy to train and only one structural \nparameter must be chosen by the supervisor: the size of the hidden layer. It achieved state \nof the art recognition rates on the NIST database of handwritten upper case letters at a very \nlow computational complexity. \n\nFurthermore we can say that a hardware implementation seems to be promising. Fully \nconnected networks with only two layers are easy to put into standardized hardware chips. \nWe could even propagate all diabolo networks in parallel. Speedups of several orders of \nmagnitude should therefore be possible. \n\n\f998 \n\nHolger Schwenk, Maurice Milgram \n\nOn this year NIPS conference several authors presented related approaches. A comparable \nclassification architecture was proposed by (Hinton, Revow and Dayan, 1995). Instead of \none non-linear global model per class, several local linear models were used by performing \nseparately principal component analysis (PCA) on subsets of each class. Since diabolo \nnetworks with one hidden layer and linear activation functions perform PCA, this architec(cid:173)\nture can be interpreted as an hierarchical diabolo classifier with linear nets and Euclidean \ndistance. Such an hierarchisation could also be done with our classifier, i.e. with tangent \ndistance and sigmoidal units, and might improve the results even further. \n\n(Hastie, Simard and Sackinger, 1995) developed an iterative algorithm that learns optimal \nreference vectors in the sense of tangent distance. An extension allows also to learn typical \ninvariant transformations, i.e. tangent vectors, of each class. These two algorithms allowed \nto reduce drastically the number of reference vectors, but the results of the original approach \ncouldn't be achieved no longer. \n\nAcknowledgements \n\nThe first author is supported by the German Academic Exchange Service under grant \nHSP II 516.006.512.3. The simulations were performed with the Aspirin/ MIGRAINES \nneural network simulator developed by the MITRE Corporation. \n\nReferences \n\nH. Drucker, R. Schapire, and P. Simard (1993), \"Boosting performance in neural networks,\" \nInt. Journal of Pattern Recognition and Artificial Intelligence, vol. 7, no. 4, pp. 705-719. \n\nT. Hastie, P. Simard and E. Sackinger (1995), \"Learning prototype models for tangent \ndistance,\" in NIPS 7 (G. Tesauro, D. Touretzky, and T. Leen, eds.), Morgan Kaufmann. \n\nG. Hinton, M. Revow, and P. Dayan (1995), \"Recognizing handwritten digits using mixtures \nof linear models,\" in NIPS 7 (G. Tesauro, D. Touretzky, and T. Leen, eds.), Morgan \nKaufmann. \n\nH. Schwenk and M. Milgram (1994), \"Structured diabolo-networks for handwritten char(cid:173)\nacter recognition,\" in International Conference on Artificial Neural Networks, pp. 985-988, \nSpringer-Verlag. \nP. Simard, B. Victorri, Y. LeCun, and 1. Denker (1992), \"Tangent prop - a formalism for \nspecifying selected invariances in an adaptive network,\" in NIPS 4 (1. Moody, S. Hanson, \nand R. Lippmann, eds.), pp. 895-903, Morgan Kaufmann. \n\nP. Simard, Y. LeCun, and J. Denker (1993), \"Efficient pattern recognition using a new \ntransformation distance,\" in NIPS 5 (S. Hanson, J. Cowan, and C. Giles, eds.), pp. 50-58, \nMorgan Kaufmann. \n\nP. Simard (1994), \"Efficient Computation of complex distance measures using hierarchical \nfiltering,\" in NIPS 6 (J.D. Cowan, G. Tesauro, and J. Alspector, eds.), pp. 50-58, Morgan \nKaufmann. \n\nA. Sperdutti and D.G. Stork (1995), ''A rapid graph-based method for arbitrary transfor(cid:173)\nmation invariant pattern classification,\" in NIPS 7 (G. Tesauro, D. Touretzky, and T. Leen, \neds.), Morgan Kaufmann. \n\n\f", "award": [], "sourceid": 961, "authors": [{"given_name": "Holger", "family_name": "Schwenk", "institution": null}, {"given_name": "Maurice", "family_name": "Milgram", "institution": null}]}