{"title": "Some Solutions to the Missing Feature Problem in Vision", "book": "Advances in Neural Information Processing Systems", "page_first": 393, "page_last": 400, "abstract": null, "full_text": "Some Solutions to the Missing Feature Problem \n\nin Vision \n\nSubutai Ahmad \n\nSiemens AG, \n\nVolker Tresp \nSiemens AG, \n\nCentral Research and Development \nZFE ST SN61, Otto-Hahn Ring 6 \n\n8000 Miinchen 83, Gennany. \n\nahmad@icsi.berkeley.edu \n\nCentral Research and Development \nZFE ST SN41, Otto-Hahn Ring 6 \n\n8000 Miinchen 83, Gennany. \ntresp@inf21.zfe.siemens.de \n\nAbstract \n\nIn visual processing the ability to deal with missing and noisy informa(cid:173)\ntion is crucial. Occlusions and unreliable feature detectors often lead to \nsituations where little or no direct information about features is availa(cid:173)\nble. However the available information is usually sufficient to highly \nconstrain the outputs. We discuss Bayesian techniques for extracting \nclass probabilities given partial data. The optimal solution involves inte(cid:173)\ngrating over the missing dimensions weighted by the local probability \ndensities. We show how to obtain closed-form approximations to the \nBayesian solution using Gaussian basis function networks. The frame(cid:173)\nwork extends naturally to the case of noisy features. Simulations on a \ncomplex task (3D hand gesture recognition) validate the theory. When \nboth integration and weighting by input densities are used, performance \ndecreases gracefully with the number of missing or noisy features. Per(cid:173)\nformance is substantially degraded if either step is omitted. \n\n1 INTRODUCTION \n\nThe ability to deal with missing or noisy features is vital in vision. One is often faced with \nsituations in which the full set of image features is not computable. In fact, in 3D object \nrecognition, it is highly unlikely that all features will be available. This can be due to self(cid:173)\nocclusion, occlusion from other objects, shadows, etc. To date the issue of missing fea(cid:173)\ntures has not been dealt with in neural networks in a systematic way. Instead the usual \npractice is to substitute a single value for the missing feature (e.g. 0, the mean value of the \nfeature, or a pre-computed value) and use the network's output on that feature vector. \n\n393 \n\n\f394 \n\nAhmad and Tresp \n\ny \n\n5 \n\ny \n\n4 \n\n3 \n\n............ Yo \n\nx \n\nYo \n\n(a) \n\nx \n\n(b) \n\nFigure 1. The images show two possible situations for a 6-class classification problem. (Dark \nshading denotes high-probability regions.) If the value of feature x is unknown, the correct \nsolution depends both on the classification boundaries along the missing dimension and on the \ndistribution of exemplars. \n\nWhen the features are known to be noisy, the usual practice is to just use the measured \nnoisy features directly. The point of this paper is to show that these approaches are not \noptimal and that it is possible to do much better. \nA simple example serves to illustrate why one needs to be careful in dealing with missing \nfeatures. Consider the situation depicted in Figure 1 (a). It shows a 2 -d feature space with 6 \npossible classes. Assume a network has already been trained to correctly classify these \nregions. During classification of a novel exemplar. only feature y has been measured, as \nYo; the value of feature x is unknown. For each class Ci , we would like to compute p(Cily). \nSince nothing is known about x, the classifier should assign equal probability to classes 1, \n2, and 3, and zero probability to classes 4,5, and 6. Note that substituting any single value \nwill always produce the wrong result. For example, if the mean value of x is substituted, \nthe classifier would assign a probability near 1 for class 2. To obtain the correct posterior \nprobability, it is necessary to integrate the network output over all values of x. But there is \none other fact to consider: the probability distribution over x may be highly constrained by \nthe known value of feature y. With a distribution as in Figure 1 (b) the classifier should \nassign class 1 the highest probability. Thus it is necessary to integrate over x along the line \nY=Yo weighted by the joint distribution p(x,y). \n\n2 MISSING FEATURES \n\nWe first show how the intituitive arguments outlined above for missing inputs can be for(cid:173)\nmalized using Bayes rule. Let x represent a complete feature vector. We assume the classi(cid:173)\nfier outputs good estimates of p (Cil x) (most reasonable classifiers do - see (Richard & \nLippmann, 1991\u00bb. In a given instance, x can be split up into xc' the vector of known (cer(cid:173)\ntain) features, and xu. the unknown features. When features are missing the task is to esti(cid:173)\nmate p (Cil xc) . Computing marginal probabilities we get: \n\n\fSome Solutions to the Missing Feature Problem in Vision \n\n395 \n\nJp (Cil Xc' xu) p (xc' xu) dxu \n\np (xc) \n\n(1) \n\nNote that p (Cil XC' xu) is approximated by the network output and that in order to use (1) \neffectively we need estimates of the joint probabilities of the inputs. \n\n3 NOISY FEATURES \n\nThe missing feature scenario can be extended to deal with noisy inputs. (Missing features \nare simply noisy features in the limiting case of complete noise.) Let Xc be the vector of \nfeatures measured with complete certainty, Xu the vector of measured, uncertain features, \nand xtu the true values of the features in Xu. p (xul XtU) denotes our knowledge of the noise \n(i.e. the probability of measuring the (uncertain) value Xu given that the true value is xtu ). \nWe assume that this is independent of Xc and Ci \u2022 i.e. that p (xul xlU , Xc' Ci ) = p (xul xlU ) \u2022 \n(Of course the value of xlU is dependent on Xc and Cj .) We want to compute p (Cil Xc' xu) . \nThis can be expressed as: \n\n\",,. \n\np(Cjlxc,xu) = \n\nJp (xc' xu' xtu, Ci ) dxtu \n\n~ \n\n~ \n\np (xc' xu) \n\nGiven the independence assumption, this becomes: \n\nJp (Cjl x ,XtU ) p (xc' xtu ) p (Xul XtU) dxtu \np(Cilxc'xu) = ____ c __________ _ \n\nJp (xc' XtU ) p (xul XtU ) dxtu \n\n(2) \n\n(3) \n\nAs before, p (C il Xc> X tu) is given by the classifier. (3) is almost the same as (1) except that \nthe integral is also weighted by the noise model. Note that in the case of complete uncer(cid:173)\ntainty about the features (i.e. the noise is uniform over the entire range of the features), the \nequations reduce to the miSSing feature case. \n\n4 GAUSSIAN BASIS FUNCTION NETWORKS \n\nThe above discussion shows how to optimally deal with missing and noisy inputs in a \nBayesian sense. We now show how these equations can be approximated using networks \nof Gaussian basis functions (GBF nets). Let us consider GBF networks where the Gaus(cid:173)\nsians have diagonal covariance matrices (Nowlan, 1990). Such networks have proven to \nbe useful in a number of real-world applications (e.g. Roscheisen et al, 1992). Each hid-\nden unit is characterized by a mean vector ~j and by aj, a vector representing the diagonal \nof the covariance matrix. The network output is: \n\n4: wijbj (x) \nYj (x) = -..::1 ___ _ \n\n\f396 \n\nAhmad and Tresp \n\nwith bj (x) = 1tj n (x;a.j , crJ) = \n\nd1tj d \n\nexp [-r (xi -':;/l \n\nI \n\n20'\u00b7\u00b7 \nJI \n\n-\n2 II~ \nO'kj \n\n(21t) \n\n(4) \n\nk \n\nwji is the weight from the j'th basis unit to the i'th output unit, Ttj is the probability of \nchoosing unit j, and d is the dimensionality of x. \n\n4.1 GBF NETWORKS AND MISSING FEATURES \n\nUnder certain training regimes sur.h as Gaussian mixture modeling, EM or \"soft cluster(cid:173)\ning\" (Duda & Hart, 1973; Dempster et ai, 1977; Nowlan, 1990) or an approximation as in \n(Moody & Darken, 1988) the hidden units adapt to represent local probability densities. In \nparticular Yi (x) \"\" p (Cil x) and p (x) \"\" Ijbj (x) . This is a major advantage of this archi(cid:173)\ntectur and can be exploited to obtain closed form solutions to (1) and (3). Substituting into \n(3) we get: \n\nJ (L, wijbj (xc' XtU\u00bb p (xul XtU) dXtu \np (C il xc' xu) == - - - \" j - - - - - - - - - - (cid:173)\nJ (Lbj (xc' xlU) ) p (xul xtu ) dxlu \n\nJ \n\n(5) \n\nFor the case of missing features equation (5) can be computed directly. As noted before, \nequation (1) is simply (3) with p (xui x,u) uniform. Since the infinite integral along each \ndimension of a multivariate normal density is equal to one we get: \n\n'\" w .. b \u00b7 (xc) \n4 \n\nJI J \n\np(Cilxc)\"\"J\", \n\n3. \n\n~bj(xc) \nj \n\n(6) \n\n(Here bj (xc) denotes the same function as in except that it is only evaluated over the \nknown dimensions given by xc.) Equation (6) is appealing since it gives us a simple closed \nform solution. Intuitively, the solution is nothing more than projecting the Gaussians onto \nthe dimensions which are available and evaluating the resulting network. As the number \nof training patterns increases, (6) will approach the optimal Bayes solution. \n\n4.2 GBF NETWORKS AND NOISY FEATURES \n\nWith noisy features the situation is a little more complicated and the solution depends on \nthe form of the noise. If the noise is known to be uniform in some region [a, b] then \nequation (5) becomes: \n\n'\" w iJb. (xc) II [N (bjal .. , 0'2.) - N (ai;~'\" O'~.)] \n\n~ J . \n\nIJ \n\nIJ \n\nIJ \n\nIJ \n\np(C'lx,x)== J \n\nICU L \n\nlEV \n\n3. II \n\nbJ.(xc) \n\n. \n\n. . IJ \nJ \n\nI E V \n\n[N(b\"'~ ' :7O' '' ) -N(a,. ,~ .. ,(J .. )] \n\n. \n\nIJ \n\n2 \nIJ \n\n2 \nIJ \n\n(7) \n\n\fSome Solutions to the Missing Feature Problem in Vision \n\n397 \n\nHere ~jj and a~ select the i'th component of the j'th mean and variance vectors. U ranges \nover the noisy feature indices. Good closed form approximations to the normal distribu(cid:173)\ntion function N (x; 1.1., ( 2) are available (Press et al, 1986) so (7) is efficiently computable. \n\nWith zero-mean Gaussian noise with variance O'~, we can also write down a closed form \nsolution. In this case we have to integrate a product of two Gaussians and end up with: \n\n= ~----- with b'j (xc' xu) = n (xu;J..Lju' 0u + 0ju) b/xc)' \n\n~.. ..>. \n\n.... 2..>.2 \n\n.>. \n\n4, wjjb') (xc' xu) \n\nJ \n\nLb') (xc' xu) \nj \n\n5 BACKPROPAGATION NETWORKS \n\nWith a large training set, the outputs of a sufficiently large network trained with back(cid:173)\npropagation converges to the optimal Bayes a posteriori estimates (Richard & Lippmann, \n1992). If B j (x) \nis the output of the i'th output unit when presented with input x, \nB j (x) \"\" p (Cj / x) . Unfortunately, access to the input distribution is not available with back(cid:173)\npropagation. Without prior knowledge it is reasonable to assume a uniform input distribu(cid:173)\ntion, in which case the right hand side of (3) simplifies to: \n\nJp (Cil xc' xtu)p (xul xtu ) dxtu \np (C -I x ) == - - - - - - - -\n\n.>. \n\nI \n\nC \n\nJp (xul xtu ) dxtu \n\n(8) \n\nThe integral can be approximated using standard Monte Carlo techniques. With uniform \nnoise in the interval [a, b] , this becomes (ignoring normalizing constants): \n\n.>. \n\n\" b \n\np(Cjlxc) == JBj(Xc.Xtu)dXtu \n\n(9) \n\nWith missing features the integral in (9) is computed over the entire range of each feature. \n\n6 AN EXAMPLE TASK: 3D HAND GESTURE RECOGNITION \n\nA simple realistic example serves to illustrate the utility of the above techniques. We con(cid:173)\nsider the task of recognizing a set of hand gestures from single 2D images independent of \n3D orientation (Figure 2). As input, each classifier is given the 2D polar coordinates of the \nfive fingertip positions relative to the 2D center of mass of the hand (so the input space is \nlO-dimensional). Each classifier is trained on a training set of 4368 examples (624 poses \nfor each gesture) and tested on a similar independent test set. \n\nThe task forms a good benchmark for testing performance with missing and uncertain \ninputs. The classification task itself is non-trivial. The classifier must learn to deal with \nhands (which are complex non-rigid objects) and with perspective projection (which is \nnon-linear and non-invertible). In fact it is impossible to obtain a perfect score since in \ncertain poses some of the gestures are indistinguishable (e.g. when the hand is pointing \ndirectly at the screen). Moreover, the task is characteristic of real vision problems. The \n\n\f398 \n\nAhmad and Tresp \n\n\"five\" \n\n\"four\" \n\n\"three\" \n\n\"two\" \n\n\"one\" \"thumbs_up\" \"pointing\" \n\nFigure 2. Examples of the 7 gestures used to train the classifier. A 3D computer model of \nthe hand is used to generate images of the hand in various poses. For each training exam(cid:173)\nple, we choose a 3D orientation, compute the 3D positions of the fingertips and project \nthem onto 2D. For this task we assume that the correspondence between image and model \nfeatures are known, and that during training all feature values are always available. \n\nposition of each finger is highly (but not completely) constrained by the others resulting in \na very non-uniform input distribution. Finally it is often easy to see what the classifier \nshould output if features are uncertain. For example suppose the real gesture is \"fi ve\" but \nfor some reason the features from the thumb are not reliably computed. In this case the \ngestures \"four\" and \"five\" should both get a positive probability whereas the rest should \nget zero. In many such cases only a single class should get the highest score, e.g. if the fea(cid:173)\ntures for the little finger are uncertain the correct class is still \"five\". \n\nWe tried three classifiers on this task: standard sigmoidal networks trained with backprop(cid:173)\nagation (BP), and two types of gaussian networks as described in . In the first (Gauss(cid:173)\nRBF), the gaussians were radial and the centers were determined using k-means clustering \nas in (Moody & Darken, 1988). 0'2 was set to twice the average distance of each point to \nits nearest gaussian (all gaussians had the same width). After clustering, 1t . was set to \n\nJ \n\nL k [ n (Xk~ ~< ~J~2 ] . The output weights were then determined using LMS gradient \n\nL j n(xk,llj,O'J \n\ndescent. In the second (Gauss-G), each gaussian had a unique diagonal covariance matrix. \nThe centers and variances were determined using gradient descent on all the parameters \n(Roscheisen et ai, 1992). Note that with this type of training, even though gaussian hidden \nunits are used, there is no guarantee that the distribution information will be preserved. \nAll classifiers were able to achieve a reasonable performance level. BP with 60 hidden \nunits managed to score 95.3% and 93.3% on the training and test sets, respectively. Gauss(cid:173)\nG with 28 hidden units scored 94% and 92%. Gauss-RBF scored 97.7% and 91.4% and \nrequired 2000 units to achieve it. (Larger numbers of hidden units led to overfitting.) For \ncomparison, nearest neighbor achieves a score of 82.4% on the test set. \n\n6.1 PERFORMANCE WITH MISSING FEATURES \n\nWe tested the performance of each network in the presence of missing features. For back(cid:173)\npropagation we used a numerical approximation to equation (9). For both gaussian basis \nfunction networks we used equation (6). To test the networks we randomly picked samples \nfrom the test set and deleted random features. We calculated a performance score as the \npercentage of samples where the correct class was ranked as one of the top two classes. \nFigure 3 displays the results. For comparison we also tested each classifier by substituting \nthe mean value of each missing feature and using the normal update equation. \nAs predicted by the theory the performance of Gauss-RBF using (6) was consistently bet(cid:173)\nter than the others. The fact that BP and Gauss-G performed poorly indicates that the dis(cid:173)\ntribution of the features must be taken into account. The fact that using the mean value is \n\n\fSome Solutions to the Missing Feature Problem in Vision \n\n399 \n\nPerformance with milling features \n\n\u00b7\n\n\u00b7 : \u00b7~ .. \"\"\n\n~ \n\" 'O \n'. \n. \n'. \n. \n.. .. v~o. .... \n\nCl&11lIII-RBF -.--\n\nGa~; ~ \nGauls-G-MEAN \u00b70\u00b7 . \nBP-MEAN +. -\nRBF MEAN \u00b7e\u00b7-\n\n~~\"\":\"'O \n4: \n\n.\" ...\u2022 \n\n'. \n\n1 2 34 5 \n\nNo. of milling feat urel \n\n6 \n\n100 \n\n90 \n\n80 \nPerformanc \n70 \n\n60 \n\no \n\nFigure 3. The performance of various classifiers when dealing with missing features. Each \ndata point denotes an average over tOOO random samples from an independent test set. For \neach sample. random features were considered missing. Each graph plots the percentage \nof samples where the correct class was one of the top two classes. \n\ninsufficient indicates that the integration step must also be carried out. Perhaps most \nencouraging is the result that even with 50% of the features missing. Gauss-RBF ranks the \ncorrect class among the top two 90% of the time. This clearly shows that a significant \namount of information can be extracted even with a large number of missing features. \n\n6.2 PERFORMANCE WITH NOISY FEATURES \n\nWe also tested the performance of each network in the presence of noisy features. We ran(cid:173)\ndomly picked samples from the test set and added uniform noise to random features. The \nnoise interval was calculated as [x . - 2cr ., x \u00b7 + 2cr.J where XI\u00b7 is the feature value and cr. is \nthe standard deviation of that feature over the training set. For BP we used equation (9) \nand for the GBF networks we used equation (7). Figure 3 displays the results. For compar(cid:173)\nison we also tested each classifier by substituting the noisy value of each noisy feature and \nusing the normal update equation (RBF-N, BP-N, and Gauss-GN). As with missing fea(cid:173)\ntures, the performance of Gauss-RBF was significantly better than the others when a large \nnumber of features were noisy. \n\nI \n\nI \n\nI \n\nI \n\nI \n\n7 DISCUSSION \n\nThe results demonstrate the advantages of estimating the input distribution and integrating \nover the missing dimensions, at least on this task. They also show that good classification \nperformance alone does not guarantee good missing feature performance. (Both BP and \nGauss-G performed better than Gauss-RBF on the test set.) To get the best of both worlds \none could use a hybrid technique utilizing separate density estimators and classifiers \nalthough this would probably require equations (1) and (3) to be numerically integrated. \nOne way to improve the performance of BP and Gauss-G might be to use a training set \nthat contained missing features. Given the unusual distributions that arise in vision, in \norder to guarantee accuracy such a training set should include every possible combination \n\n\f400 \n\nAhmad and Tresp \n\nPerformance with noilY featurel \n\nI \n\nI \n\nI \n\nI \n\nPerformance: \n\n80 I-Gaull-RBF ___ \nGa1181-G 0-\nBP -+-(cid:173)\nGauu-GN \u00b70\u00b7 . \nBP-N + .. \nRBF-N .\u2022.. \n\n70 I-\n\no \n\nI \n1 \n\nl \n\nl \n23 4 \nNo. of noisy featurel \n\n1 \n\n5 \n\n-\n\n-\n\n-\n\n6 \n\nFigure 4. As in Figure 3 except that the performance with noisy features is plotted. \nof missing features. In addition, for each such combination, enough patterns must be \nincluded to accurately estimate the posterior density. In general this type of training is \nintractable since the number of combinations is exponential in the number of features. \nNote that if the input distribution is available (as in Gauss-RBF), then such a training sce(cid:173)\nnario is unnecessary. \n\nAcknowledgements \n\nWe thank D. Goryn, C. Maggioni, S. Omohundro, A. Stokke, and R. Schuster for helpful \ndiscussions, and especially B. Wirtz for providing the computer hand model. V.T. is sup(cid:173)\nported in part by a grant from the Bundesministerium fUr Forschung und Technologie. \n\nReferences \n\nA.P. Dempster, N.M. Laird, and D.H. Rubin. (1977) Maximum-likelihood from incomplete \ndata via the EM algorithm.f. Royal Statistical Soc. Ser. B, 39:1-38. \nR.O. Duda and P.E. Hart. (1973) Pattern Classification and Scene Analysis. John Wiley & \nSons, New York. \nJ. Moody and C. Darken. (1988) Learning with localized receptive fields. In: D. Touretzky, G. \nHinton, T. Sejnowski, editors, Proceedings of the 1988 Connectionist Models Summer School, \nMorgan Kaufmann, CA. \nS. Nowlan. (1990) Maximum Likelihood Competitive Learning. In: Advances in Neurallnfor(cid:173)\nmation Processing Systems 4, pages 574-582. \nW.H. Press, B.P. Flannery, S.A. Teukolsky, and W.T. Veuerling. (1986) Numerical Recipes: \nThe Art of Scientific Computing, Cambridge University Press, Cambridge, UK. \nM. D. Richard and R.P. Lippmann. (1991) Neural Network Classifiers Estimate Bayesian a \nposteriori Probabilities, Neural Computation, 3:461-483. \nM. Roscheisen, R. Hofman, and V. Tresp. (1992) Neural Control for Rolling Mills: Incorporat(cid:173)\ning Domain Theories to Overcome Data DefiCiency. In: Advances in Neural Information \nProcessing Systems 4, pages 659-666. \n\n\f", "award": [], "sourceid": 621, "authors": [{"given_name": "Subutai", "family_name": "Ahmad", "institution": null}, {"given_name": "Volker", "family_name": "Tresp", "institution": null}]}