{"title": "Feature Densities are Required for Computing Feature Correspondences", "book": "Advances in Neural Information Processing Systems", "page_first": 961, "page_last": 968, "abstract": null, "full_text": "Feature Densities are Required for \n\nComputing Feature Correspondences \n\nSubutai Ahmad \n\nInterval Research Corporation \n\n1801-C Page Mill Road, Palo Alto, CA 94304 \n\nE-mail: ahmadCDinterval.com \n\nAbstract \n\nThe feature correspondence problem is a classic hurdle in visual \nobject-recognition concerned with determining the correct mapping \nbetween the features measured from the image and the features ex(cid:173)\npected by the model. In this paper we show that determining good \ncorrespondences requires information about the joint probability \ndensity over the image features. We propose \"likelihood based \ncorrespondence matching\" as a general principle for selecting op(cid:173)\ntimal correspondences. The approach is applicable to non-rigid \nmodels, allows nonlinear perspective transformations, and can op(cid:173)\ntimally deal with occlusions and missing features. Experiments \nwith rigid and non-rigid 3D hand gesture recognition support the \ntheory. The likelihood based techniques show almost no decrease \nin classification performance when compared to performance with \nperfect correspondence knowledge. \n\n1 \n\nINTRODUCTION \n\nThe ability to deal with missing information is crucial in model-based vision sys(cid:173)\ntems. The feature correspondence problem is an example where the correct map(cid:173)\nping between image features and model features is unknown at recognition time. \nFor example, imagine a network trained to map fingertip locations to hand gestures. \nGiven features extracted from an image, it becomes important to determine which \nfeatures correspond to the thumb, to the index finger, etc. so we know which input \nunits to clamp with which numbers. Success at the correspondence matching step \n\n961 \n\n\f962 \n\nAhmad \n\n/ \n\nClass bouudary \n\nI \n\nClass 2 \n\nClass 1 \n\n\u2022 P2 \no L-________ ~ ______________ _ \n\nI Xl \n\no \n\nFigure 1: An example 2D feature space. Shaded regions denote high probabil(cid:173)\nity. Given measured values of 0.2 and 0.9, the points PI and P2 denote possible \ninstantiations but PI is much more likely. \n\nis vital for correct classification. There has been much previous work on this topic \n(Connell and Brady 1987; Segen 1989; Huttenlocher and Ullman 1990; Pope and \nLowe 1993) but a general solution has eluded the vision community. In this paper \nwe propose a novel approach based on maximizing the probability of a set of mod(cid:173)\nels generating the given data. We show that neural networks trained to estimate \nthe joint density between image features can be successfully used to recover the \noptimal correspondence. Unlike other techniques, the likelihood based approach is \napplicable to non-rigid models, allows perspective 3D transformations, and includes \na principled method for dealing with occlusions and missing features. \n\n1.1 A SIMPLE EXAMPLE \n\nConsider the idealized example depicted in Figure 1. The distribution of features is \nhighly non-uniform (this is typical of non-rigid objects). The classification boundary \nis in general completely unrelated to the feature distribution. In this case, the \nclass (posterior) probability approaches 1 as feature Xl approaches 0, and 0 as it \napproaches 1. Now suppose that two feature values 0.2 and 0.9 are measured from \nan image. The task is to decide which value gets assigned to X I and which value \ngets assigned to X2. A common strategy is to select the correspondence which gives \nthe maximal network output (i.e. maximal posterior probability). In this example \n(and in general) such a strategy will pick point P2, the wrong correspondence. This \nis because the classifer output represents the probability of a class given a specific \nfeature assignment and specific values. The correspondence problem however, is \nsomething completely different: it deals with the probability of getting the feature \nassignments and values in the first place. \n\n\fFeature Densities Are Required for Computing Feature Correspondences \n\n963 \n\n2 LIKELIHOOD BASED CORRESPONDENCE \n\nMATCHING \n\nWe can formalize the intuitive arguments in the previous section. Let C denote the \nset of classes under consideration. Let X denote the list of features measured from \nthe image with correspondences unknown. Let A be the set of assignments of the \nmeasured values to the model features. Each assignment a E A reflects a particular \nchoice of feature correspondences. \\Ve consider two different problems: the task of \nchoosing the best assignment a and the task of classifying the object given X. \n\nSelecting the best correspondence is equivalent to selecting the permutation that \nmaximizes p(aIX, C). This can be re-written as: \n\n( I X C) = p(Xla, C)p(aIC) \np a - , \n\np(XIC) \n\n(1) \n\np(XIC) is a normalization factor that is constant across all a and can be ignored. \nLet Xa denote a specific feature vector constructed by applying permutation a to \nX. Then (1) is equivalent to maximizing: \n\np(aIX, C) = p(xaIC)p(aIC) \n\n(2) \n\np(aIC) denotes our prior knowledge about possible correspondences. (For example \nthe knowledge that edge features cannot be matched to color features.) When no \nprior knowledge is available this term is constant. We denote the assignment that \nmaximizes (2) the maximum likelihood correspondence match. Such a correspon(cid:173)\ndence maximizes the probability that a set of visual models generated a given set \nof image features and will be the optimal correspondence in a Bayesian sense. \n\n2.1 CLASSIFICATION \n\nIn addition to computing correspondences, we would like to classify a model from \nthe measured image features, i.e. compute p( CdX, C). The maximal-output based \nsolution is equivalent to selecting the class Ci that maximizes p(Cilxa, C) over all \nassignments a and all classes Ci. It is easy to see that the optimal strategy is actually \nto compute the following weighted estimate over all candidate assignments: \n\n(C.IX C) = Lap(CiIX, a, C)p(Xla, C)p(aIC) \np \n\np(XIC) \n\n, \n\n, \n\nClassification based on (3) is equivalent to selecting the class that maximizes: \n\n(3) \n\n(4) \n\na \n\nNote that the network output based solution represents quite a degraded estimate \nof (4). It does not consider the input density nor perform a weighting over possible \n\n\f964 \n\nAhmad \n\ncorrespondences. A reasonable approximation is to select the maximum likelihood \ncorrespondence according to (2) and then use this feature vector in the classification \nnetwork. This is suboptimal since the weighting is not done but in our experience \nit yields results that are very close to those obtained with (4). \n\n3 COMPUTING CORRESPONDENCES WITH GBF \n\nNETWORKS \n\nIn order to compute (2) and (4) we consider networks of normalized Gaussian basis \nfunctions (GBF networks). The i'th output unit is computed as: \n\n(5) \n\nwith: \n\nHere each basis function j is characterized by a mean vector f.lj and by oJ, a vector \nrepresenting the diagonal covariance matrix. Wji represents the weight from the \nj'th Gaussian to the i'th output. 7rj is a weight attached to each basis function. \n\nSuch networks have been popular recently and have proven to be useful in a num(cid:173)\nber of applications (e.g. \n(Roscheisen et al. 1992; Poggio and Edelman 1990). \nFor our current purpose, these networks have a number of advantages. Under \ncertain training regimes such as EM or \"soft clustering\" (Dempster et al. 1977; \nNowlan 1990) or an approximation such as K-11eans (Neal and Hinton 1993), \nthe basis functions adapt to represent local probability densities. \nIn particu(cid:173)\nlar p(xaIC) :::::: E j bj(xa). If standard error gradient training is used to set the \nweights Wij then Yi(X a ) :::::: p( Cilxa, C) Thus both (2) and (4) can be easilty com(cid:173)\nputed.(Ahmad and Tresp 1993) showed that such networks can effectively learn \nfeature density information for complex visual problems. \n(Poggio and Edelman \n1990) have also shown that similar networks (with a different training regime) can \nlearn to approximate the complex mappings that arise in 3D recognition. \n\n3.1 OPTIMAL CORRESPONDENCE MATCHING WITH \n\nOCCLUSION \n\nAn additional advantage of G BF networks trained in this way is that it is possible \nto obtain closed form solutions to the optimal classifier in the presence of missing \nor noisy features. It is also possible to correctly compute the probability of feature \nvectors containing missing dimensions. The solution consists of projecting each \nGaussian onto the non-missing dimensions and evaluating the resulting network. \nNote that it is incorrect to simply substitute zero or any other single value for the \nmissing dimensions. (For lack of space we refer the reader to (Ahmad and Tresp \n\n\fFeature Densities Are Required for Computing Feature Correspondences \n\n965 \n\n'\"five\" \n\n\"four\" \n\n'1hree\" \n\n\"two\" \n\n\"one\" \n\n.. tlumbs _up \" )Jointing ,. \n\nFigure 2: Classifiers were trained to recognize these 7 gestures. a 3D computer \nmodel of the hand is used to generate images of the hand in various poses. For \neach training example, we randomly choose a 3D orientation and depth, compute \nthe 3D positions of the fingertips and project them onto 2D. There were 5 features \nyielding a lOD input space. \n\n1993) for further details.) Thus likelihood based approaches using GBF networks \ncan simultaneously optimally deal with occlusions and the correspondence problem. \n\n4 EXPERIMENTAL RESULTS \n\n(Figure 2 describes the task.) \n\nWe have used the task of 3D gesture recognition to compare likelihood based meth(cid:173)\nods to the network output based technique. \n\"\\rVe \nconsidered both rigid and non-rigid gesture recognition tasks. We used a GBF \nnetwork with 10 inputs, 1050 basis functions and 7 output units. For comparision \nwe also trained a standard backpropagation network (BP) with 60 hidden units on \nthe task. For this task we assume that during training all feature correspondences \nare known and that during training no feature values are noisy or missing. For this \ntask we assume that during training all feature correspondences are known and that \nduring training no feature values are noisy or missing. Classification performance \nwith full correspondence information on an independent test set is about 92% for \nthe GBF network and 93% for the BP network. (For other results see (\\Villiams \net al. 1993) who have also used the rigid version of this task as a benchmark.) \n\n4.1 EXPERIMENTS WITH RIGID HAND POSES \n\nTable 1 plots the ability of the various methods to select the correct correspon(cid:173)\ndence. Random patterns were selected from the test set and all 5! = 120 possible \ncombinations were tried. MLCM denotes the percentage of times the maximum \nlikelihood method (equation (2)) selected the correct feature correspondence. GBF(cid:173)\nM and BP-M denotes how often the maximal output method chooses the correct \ncorrespondence using GBF nets and BP. \"Random\" denotes the percentage if cor(cid:173)\nrespondences are chosen randomly. The substantially better performance of MLCM \nsuggests that, at least for this task, density information is crucial. It is also interest(cid:173)\ning to examine the errors made by MLCM. A common error is to switch the features \nfor the pinky and the adjacent finger for gestures \"one\", \"two\", \"thumbs-up\" and \n\"pointing\". These two fingertips often project very close to one another in many \nposes; such a mistake usually do not affect subsequent classification. \n\n\f966 \n\nAhmad \n\nSelection Method Percentage Correct \n1.2% \nRandom \n8.8% \nGBF-NI \nBP-M \n10.3% \n62.0% \nMLCM \n\nTable 1: Percentage of correspondences selected correctly. \n\nClassifier \nBP-Random \nBP-11ax \nGBF-Max \nGBF-vVLC \nGBF-Known \n\nClassification Performance \n28.0% \n39.2% \n47.3% \n86.2% \n91.8% \n\nTable 2: Classification without correspondence information. \n\nTable 2 shows classification performance when the correspondence is unknown. \nGBF-WLC denotes weighted likelihood classification using GBF networks to com(cid:173)\npute the feature densities and the posterior probabilities. Performance with the \noutput based techniques are denoted with GBF-M and BP-M. BP-R denotes per(cid:173)\nformance with random correspondences using the back propagation network. GBF(cid:173)\nknown plots the performance of the G BF network when all correspondences are \nknown. The results are quite encouraging in that performance is only slightly de(cid:173)\ngraded with WLC even though there is substantially less information present when \ncorrespondences are unknown. Although not shown, results with MLCM (i.e. not \ndoing the weighting step but just choosing the correspondence with highest prob(cid:173)\nability) are about 1 % less than vVLC. This supports the theory that many of the \nerrors of MLCM in Table 1 are inconsequential. \n\n4.1.1 Missing Features and No Correspondences \n\nFigure 3 shows error as a function of the number of missing dimensions. (The \nmissing dimensions were randomly selected from the test set.) Figure 3 plots the \naverage number of classes that are assigned higher probability than the correct class. \nThe network output method and weighted likelihood classification is compared to \nthe case where all correspondences are known. \nIn all cases the basis functions \nwere projected onto the non-missing dimensions to approximate the Bayes-optimal \ncondition. As before, the likelihood based method outperforms the output based \nmethod. Surprisingly, even with 4 of the 10 dimensions missing and with correspon(cid:173)\ndences unknown, \\VLC assigns highest probability to the correct class on average \n(performance score < 1.0). \n\n\fFeature Densities Are Required for Computing Feature Correspondences \n\n967 \n\nError vs missing features without correspondence \n\nGBF-M +--\nWLC -e-\nG BF -Known +--\n\n3.5 \n\n3 \n\n2.5 \n\n2 \n\n1.5 \n\n1 \n\nError \n\n0.5 rL-_-~-::r-\no L -______ ~ ________ ~ ________ _L ________ ~ ______ ~ \no \n\n2 \n\n3 \n\n4 \n\n5 \n\n1 \n\nNo. of missing features \n\nFigure 3: Error with mIssmg features when no correspondence information is \npresent. The y-axis denotes the average number of classes that are assigned higher \nprobability than the correct class. \n\n4.2 EXPERIMENTS WITH NON-RIGID HAND POSES \n\nIn the previous experiments the hand configuration for each gesture remained rigid. \nCorrespondence selection with non-rigid gestures was also tried out. As before a \ntraining set consisting of examples of each gesture was constructed. However, in \nthis case, for each sample, a random perturbation (within 20 degrees) was added to \neach finger joint. The orientation of each sample was allowed to randomly vary by \n45 degrees around the x, y, and z axes. When viewed on a screen the samples give \nthe appearance of a hand wiggling around. Surprisingly, GBF networks with 210 \nhidden units consistently selected the correct correspondences with a performance \nof 94.9%. (The performance is actually better than the rigid case. This is because \nin this training set all possible 3D orientations were not allowed.) \n\n5 DISCUSSION \n\nWe have shown that estimates of joint feature densities can be used to successfully \ndeal with lack of correspondence information even when some input features are \nmissing. We have dealt mainly with the rather severe case where no prior informa(cid:173)\ntion about correspondences is available. In this particular case to get the optimal \ncorrespondece, all n! possibilities must be considered. However this is usually not \nnecessary. Useful techniques exist for reducing the number of possible correspon(cid:173)\ndences. For example, (Huttenlocher and Ullman 1990) have argued that three fea-\n\n\f968 \n\nAhmad \n\nture correspondences are enough to constrain the pose of rigid objects. In this case \nonly O(n3 ) matches need to be tested. In addition features usually fall into incom(cid:173)\npatible sets (e.g. edge features, corner features: etc.) further reducing the nWllber \nof potential matches. Finally: with ullage sequences one can use correspondence \nulformation from the previous frame to constraul the set of correspondences in the \ncurrent frame. \\\\llatever the situation, a likelihood based approach is a prulcipled \nmethod for evaluatulg the set of available matches. \n\nAcknowledgements \n\n1'luch of this research was conducted at Siemens Central Research in :Munich, Ger(cid:173)\nmany. I would like to thank Volker Tresp at Siemens for many interesting discussions \nand Brigitte \\Virtz for providulg the hand model. \n\nReferences \n\nAhmad, S. and V. Tresp (1993). Some solutions to the missing feature problem \nUl vision. In S. Hanson, J. Cowan: and C. Giles (Eds.), Advances in Neural \nInformation Processing Systems 5, pp. 393 400. 110rgan Kaufmann Publish(cid:173)\ners. \n\nConnell, J. and 1'1. Brady (1987). Generating and generalizing models of visual \n\nobjects. A1\u00b7tificial Intelligence 31: 159 183. \n\nDempster, A.: N. Laird: and D. Rubin (1977). 1tlaximwu-likelihood fromulcom(cid:173)\nplete data via the E1'1 algorithm. J. Royal Statistical Soc. Ser. B 39, 1 38. \n\nHuttenlocher: D. and S. Ullman (1990). Recognizing solid objects by alignment \n\nwith an ullage. International Journal of Computer Vision 5(2), 195 212. \n\nNeal, R. and G. HiIlton (1993). A new view of the E1tl algorithm that justifies \n\nincremental and other variants. Biometrika, submitted. \n\nNowlan: S. (1990). 1/1aximwll likelihood competitive learnulg. III D. Touretzky \n(Ed.), Advances in Neural Information Processing Systems 2: pp. 574 582. \nSan 1tlateo: CA: 1tlorgan Kaufmann Publishers. \n\nPoggio, T. and S. Edelman (1990). A network that learns to recognize three(cid:173)\n\ndiIllensional objects. Nature 343(6225), 1 3. \n\nPope: A. and D. Lowe (1993: 1tIay). Learuulg object recognition models from llll(cid:173)\n\nages. In Fourth International Confe1'ence on Computer Vision, Berlin. IEEE \nComputer Society Press. \n\nRoscheisen: ~I.: R. Hofmann, and V. Tresp (1992). Neural control for rolling \nmills: Incorporating domain theories to overcome data deficiency. In 1\u00b71. J., \nH. S.J.: and L. R. (Eds.), Advances in Neural Information Processing Systems \n4. 1'Iorgan Kaufman. \n\nSegen: J. (1989). 1:1odel learning and recognition of nonrigid objects. In IEEE \nComputer Society Conference on Computer Vision and Pattern Recognition: \nSan Diego: CA. \n\n\\Villiams: C. K.: R. S. Zemel: and ~I. C. 1/10zer (1993). Unsupervised learning \nof object models. In AAAI Fall 1993 Symposium on Machine Lea\u00b7ming in \nComputer Vision: pp. 20 24. Proceedings available as AAAI Tech Report \nFSS-93-04. \n\n\f", "award": [], "sourceid": 738, "authors": [{"given_name": "Subutai", "family_name": "Ahmad", "institution": null}]}