{"title": "Classifying Hand Gestures with a View-Based Distributed Representation", "book": "Advances in Neural Information Processing Systems", "page_first": 945, "page_last": 952, "abstract": null, "full_text": "Classifying Hand Gestures with a View-based \n\nDistributed Representation \n\nTrevor J. Darrell \n\nPerceptual Computing Group \n\nMIT Media Lab \n\nAlex P. Pentland \n\nPerceptual Computing Group \n\nMIT Media Lab \n\nAbstract \n\nWe present a method for learning, tracking, and recognizing human hand \ngestures recorded by a conventional CCD camera without any special \ngloves or other sensors. A view-based representation is used to model \naspects of the hand relevant to the trained gestures, and is found using an \nunsupervised clustering technique. We use normalized correlation net(cid:173)\nworks, with dynamic time warping in the temporal domain, as a distance \nfunction for unsupervised clustering. Views are computed separably for \nspace and time dimensions; the distributed response of the combination \nof these units characterizes the input data with a low dimensional repre(cid:173)\nsentation. A supervised classification stage uses labeled outputs of the \nspatio-temporal units as training data. Our system can correctly classify \ngestures in real time with a low-cost image processing accelerator. \n\n1 \n\nINTRODUCTION \n\nGesture recognition is an important aspect of human interaction, either interpersonally or \nin the context of man-machine interfaces. In general, there are many facets to the \"gesture \nrecognition\" problem. Gestures can be made by hands, faces, or one's entire body; they \ncan be static or dynamic, person-specific or cross-cultural. Here we focus on a subset of \nthe general task, and develop a method for interpreting dynamic hand gestures generated \nby a specific user. We pose the problem as one of spotting instances of a set of known \n(previously trained) gestures. In this context, a gesture can be thought of as a set of hand \nviews observed over time, or simply as a sequence of images of hands over time. These \nimages may occur at different temporal rates, and the hand may have different spatial \n\n945 \n\n\f946 \n\nDarrell and Pentland \n\noffset or gross illumination condition. We would like to achieve real- or near real-time \nperformance with our system, so that it can be used interactively by users. \n\nTo achieve this level of performance, we take advantage of the principle of using only \nas much \"representation\" as needed to perform the task. Hands are complex, 3D articu(cid:173)\nlated structures, whose kinematics and dynamics are difficult to fully model. Instead of \nperforming explicit model-based reconstruction, and attempting to extract these 3D model \nparameters (for example see [4, 5, 6]), we use a simpler approach which uses a set of 2D \nviews to represent the object. Using this approach we can perform recognition on objects \nwhich are either too difficult to model or for which a model recovery method is not feasible. \nAs we shall see below, the view-based approach affords several advantages, such as the \nability to form a sparse representation that only models the poses of the hands that are \nrelevant to the desired recognition tasks, and the ability to learn the relevant model directly \nfrom the data using unsupervised clustering. \n\n2 VIEW-BASED REPRESENTATION \n\nOur task is to recognize spatio-temporal sequences of hand images. To reduce the dimen(cid:173)\nsionality of the matching involved, we find a set of view images and a matching function \nsuch that the set of match scores of a new image with the view images is adequate for recog(cid:173)\nnition. The matching function we use is the normalized correlation between the image and \nthe set of learned spatial views. \n\nEach view represents a different pose of the object being tracked or recognized. We \nconstruct a set of views that \"spans\" the set of images seen in the training sequences, in \nthe sense that at least one view matches every frame in the sequence (given a distance \nmetric and threshold value). We can then use the view with the maximum score (minimum \ndistance) to localize the position of the object during gesture performance, and use the \nensemble response of the view units (at the location of maximal response) to characterize \nthe actual pose of the object. Each model is based on one or more example images of a \nview of an object, from which mean and variance statistics about each pixel in the view are \ncomputed. \n\nThe general idea of view-based representation has been advocated by Ullman [12] and \nPoggio [9] for representing 3-D objects by interpolating between a small set of 2-D views. \nRecognition using views was analyzed by Breuel, who established bounds on the number \nof views needed for a given error rate [3]. However the view-based models used in these \napproaches rely on a feature-based representation of an image, in which a \"view\" is the \nlist of vertex locations of semantically relevant features. The automatic extraction of these \nfeatures is not a fully solved problem. (See [2] for a nearly automated system of finding \ncorresponding points and extracting views.) \n\nMost similar to our work is that of Murase and Nayar[8] and Turk[11] which use low(cid:173)\norder eigenvectors to reduce the dimensionality of the signal and perform recognition. Our \nwork differs from theirs in that we use normalized-correlation model images instead of \neigenfunctions and can thus localize the hand position more directly, and we extend into \nthe temporal domain, recognizing image sequences of gestures rather than static poses. \n\nA particular view model will have a range of parameter values of a given transformation \n(e.g., rotation, scale, articulation) for which the correlation score shows a roughly convex \n\"tuning curve\". If we have a set of view models which sample the transformation parameter \n\n\fClassifying Hand Gestures with a View-Based Distributed Representation \n\n947 \n\n(a) \n\n-\"-~ _00 \n\n(b) \n\n... '\" \n\n:ao \n\n<> \n\n... \n\n..0 \n\n,.... ~ -s=-==== \n\n(c) )~!l __ ~ \n\n(d) \n\nFigure 1: (a) Three views of an eyeball: +30, O. and -30 of gaze angle. (a) Normalized \ncorrelation scores of the +30 degree view model when tracking a eyeball rotating from \napproximately -30 to +30 degrees of gaze angle. (b) Score for 0 degree view model. (c) \nScore for - 30 degree model. \n\nfinely enough, it is possible to infer the actual transform parameters for new views by \nexamining the set of model correlation scores. For example, Figure la shows three views \nof an eyeball that could be used for gaze tracking; one looking 30 degrees left, one looking \ncenter-on, and one looking 30 degrees to the right. The three views span a \u00b130 degree \nsubspace of the gaze direction parameter. Figure I (b,c,d) shows the normalized correlation \nscore for each view model when tracking a rotating eyeball. Since the tuning curves \nproduced by these models are fairly broad with respect to gaze angle, one could interpolate \nfrom their responses to obtain a good estimate of the true angle. \n\nWhen objects are non-rigid, either constructed out of flexible materials or an articulated \ncollection of rigid parts (like a hand), then the dimensionality of the space of possible \nviews becomes much larger. Full coverage of the view space in these cases is usually \nnot possible since enumerating it even with very coarse sampling would be prohibitively \nexpensive in terms of storage and search computation required. However, many parts of a \nhigh dimensional view space may never be encountered when processing real sequences, \ndue to unforeseen additional constraints. These may be physical (some joints may not \nbe completely independent), or behavioral (some views may never be used in the actual \ncommunication between user and machine). A major advantage of our adaptive scheme is \nthat it has no difficulty with sparse view spaces, and derives from the data which regions of \nthe space are full. \n\n\f948 \n\nDarrell and Pentland \n\n( \n\nFigure 2: (a) Models automatically acquired from a sequence of images of a rotating box. \n(b) Normalized correlation scores for each model as a function of image sequence frame \nnumber. \n\n3 UNSUPERVISED LEARNING OF VIEW UNITS \n\nTo derive a set of new view models, we use a simple form of unsupervised clustering \nin which the first example forms a new view, and subsequent examples that are below a \ndistance threshold are merged into the nearest existing view. A new view is created when \nan example is below the threshold distance for all views in the current set, but is above a \nbase threshold which establishes that the object is still (roughly) being tracked. Over time, \nthis \"follow-the-Ieader\" algorithm results in a family of view models that sample the space \nof object poses in the training data. This method is similar to those commonly used in \nvector quantization [7]. Variance statistics are updated for each model pixel, and can be \nused to exclude unreliable points from the correlation computation. \n\nFor simple objects and transformations, this adaptive scheme can build a model which \nadequately covers the entire space of possible views. For example, for a convex rigid body \nundergoing aID rotation with fixed relative illumination, a relatively small number of view \nmodels can suffice to track and interpolate the position of the object at any rotation. Figures \n2 illustrates this with a simple example of a rotating box. The adaptive tracking scheme \nwas run with a camera viewing a box rotating about a fixed axis. Figure 2a shows the view \nmodels in use when the algorithm converged, and all possible rotations were matched with \nscore greater than 0\\. To demonstrate the tuning properties of each model under rotation, \nFigure 2b shows the correlation scores for each model plotted as a function of input frame \n\n\fClassifying Hand Gestures with a View-Based Distributed Representation \n\n949 \n\nFigure 3: Four spatial views found by unsupervised clustering method on sequence con(cid:173)\ntaining two hand-waving gestures: side-to-side and up-down. \n\nI \n\nI \n\nI \n\nI \n\nYt4 \n\nx \n\nIT] \n\n*m ... ~ * \n\n-l-\n\n~ ~ \nspatial \nviews \n\n~ \n~ \n\n. . . c:::::J \n\ntemporal \n\nviews \n\nFigure4: Overview of unsupervised clustering stage to learn spatial and temporal views. An \ninput image sequence is reduced to sequence of feature vectors which record the maximum \nvalue in a normalized correlation network corresponding to each spatial view. A similar \nprocess using temporal views reduces the spatial feature vectors to a single spatia-temporal \nfeature vector. \n\nnumber of a demonstration sequence. In this sequence the box was held fixed at its initial \nposition for the first 5 frames, and then rotated continuously from 0 to 340 degrees. The \nresponses of each model are broadly tuned as a function of object angle, with a small \nnumber of models sufficing to represent/interpolate the object at all rotations (at least about \na single axis). \n\nWe ran our spatial clustering method on images of hands performing two different \"waving\" \ngestures. One gesture was a side-to-side wave, with the fingers rigid, and the other was \nan up-down wave, with the wrist held fixed and the fingers bending towards the camera \nin synchrony. Running instances of both through our view learning method, with a base \nthreshold of Bo=0.6 and a \"new model\" threshold of BI = 0.7, the clustering method found \n4 four spatial templates to span all of the images in the both sequences Figure 3 shows the \npixel values for these four models. \n\n\f950 \n\nDarrell and Pentland \n\nFigure 5: Surface plot of temporal templates found by unsupervised clustering method on \nsequences of two hand-waving gestures. Vertical axis is score, horizontal axis is time, and \ndepth axis is spatial view index. \n\n3.1 TEMPORAL VIEWS \n\nThe previous sections provide a method for finding spatial views to reduce the dimen(cid:173)\nsionality in a tracking task. The same method can be applied in the temporal domain as \nwell, using a set of \"temporal views\". Figure 4 shows an overview of these two stages. \nWe construct temporal views using a similar method to that used for spatial views, but \nwith temporal segmentation cues provided by the user. Sequences of spatial-feature vector \noutputs (the normalized correlation scores of the spatial views) are passed as input to the \nunsupervised clustering method, yielding a set of temporal views. To find the distance \nbetween two sequences, we again use a normalized correlation metric, with Dynamic Time \nWarping (DlW) method [1, 10]. This allows the time course of a gesture to vary, as long \nas the same series of spatial poses is present. \n\nIn this way a set of temporal views acting on spatial views which in turn act on image \nintensities, is created. The responses of these composi te views yield a single spatio-temporal \nstimulus vector which describes spatial and temporal properties of the input signal. As an \nexample, for the \"hand-waving\" example shown above, two temporal views were found by \nthe clustering method. These are shown as surface plots in Figure 5. Empirically we have \nfound that the spatio-temporal units capture the salient aspects of the spatial and temporal \nvariation of the hand gestures in a low-dimensional representation, so efficient classification \nis possible. The response of these temporal view units on an input sequence containing \nthree instances of each gesture is shown in Figure 6. \n\n4 CLASSIFICATION OF GESTURES \n\nThe spatio-temporal units obtained by the unsupervised procedure described above are used \nas inputs to a supervised learning/classification stage (Figure 7(a)). We have implemented \ntwo different classification strategies, a traditional Diagonal Gaussian Classifier, and a \nmulti-layer perceptron. \n\n\fClassifying Hand Gestures with a View-Based Distributed Representation \n\n951 \n\n(a) \n\n(b) \n\nFigure 6: (a) surface plot of spatial view responses on input sequence containing three \ninstances of each hand-waving gesture. (b) final spatio-temporal view unit response: the \ntime-warped, normalized correlation score of temporal views on spatial view feature vectors. \n\nAs an experiment, we collected 42 examples of a \"hello\" gesture, 26 examples of \"good(cid:173)\nbye\" and 10 examples of other gestures intended to generate false alarms in the classifier. \nAll gestures were performed by a single user under similar imaging conditions. For each \ntrial we randomly selected half of the target gestures to train the classifier, and tested on the \nremaining half. (All of the conflictor gestures were used in both training and testing sets \nsince they were few in number.) \n\nFigure 7(b) summarizes the results for the different classification strategies. The Gaussian \nclassifier (DG) achieved an hit rate of 67%, with zero false alarms. The multi-layer \nperceptron (MLP) was more powerful but less conservative, with a hit rate of 86% and a \nfalse alarm rate of 5%. We found the results of the MLP classifier to be quite variable; \non many of the trials the classifier was stuck in a local minima and failed to converge on \nthe test set. Additionally there was considerable dependence on the number of units in \nthe hidden layer; empirically we found 12 gave best performance. Nonetheless, the MLP \nclassifier provided good performance. When we excluded the trials on which the classifier \nfailed to converge on the training set, the performance increased to 91 % hit rate, 2% false \nalarm rate. \n\n5 CONCLUSION \n\nWe have demonstrated a system for tracking and recognition of simple hand gestures. Our \nentire recognition system, including time-warping and classification, runs in real time (over \n10Hz). This is made possible through the use of a special purpose normalized correlation \nsearch co-processor. Since the dimensionality of the feature space is low, the dynamic time \nwarping and classifications steps can be implemented on conventional workstations and \nstill achieve real-time performance. Because of this real-time performance, our system is \n\n\f952 \n\nDarrell and Pentland \n\nc:::J - -... ~ CLASSIFIER -\nST unit \noutputs \n\nII hello II \n.... ~ \"bye\" \n\nFigure 7: Overview of supervised classification stage and results obtained for different \ntypes of classifiers. \n\ndirectly applicable to interactive \"glove-free\" gestural user interfaces. \n\nReferences \n\n[1] Bellman, R E., (1957) Dynamic Programming. Princeton, NJ: Princeton Univ. Press. \n[2] Beymer, D., Shashua, A., and Poggio, T., (1993) ''Example Based Image Analysis \n\nand Synthesis\", MIT AI Lab Memo No. 1431 \n\n[3] Breuel, T., (1992) \"View-based Recognition\", IAPR Workshop on Machine Vision \n\nApplications. \n\n[4] Cipolla, R, Okamotot, Y., and Kuno, Y., (1992) \"Qualitative visual interpretation \nof 3D hand gestures using motion parallax\", IAPR Workshop on Machine Vision \nApplications. \n\n[5] Fukumoto, M., Mase, K., and Suenaga, Y., (1992) \"Real-Time Detection of Pointing \nActions for a Glove-Free Interface\", IAPR Workshop on Machine Vision Applications. \n[6] Ishibuchi, K., Takemura, H., and Kishino, F., \"Real-Time Hand Shape Recognition \nusing Pipe-line Image Processor\", (1992) IEEE Workshop on Robot and Human \nCommunication, pp. 111-116. \n\n[7] Makhoul, J., Roucos, S., and Gish, H., (1985) \"Vector Quantization in Speech Coding\" \n\nProc. IEEE, Vol. 73, No. 11, pp. 1551-1587. \n\n[8] Murase, H.,and Nayar, S. K., (1993) \"Learning and Recognition of 3D Objects from \n\nAppearance\", Proc. IEEE Qualitative Vision Workshop, New York City, pp. 39-49. \n\n[9] Poggio, T., and Edelman, S., (1990) \"A Network that Learns to Recognize Three \n\nDimensional Objects,\" Nature, Vol. 343, No. 6255, pp. 263-266. \n\n[10] Sakoe, H., and Chiba, S., (1980) \"Dynamic Programming optimization for spoken \n\nword recognition\", IEEE Trans. ASSP, Vol. 26, pp. 623-625. \n\n[11] Turk, M., and Pentland, A. P., (1991) \"Eigenfaces for Recognition\", Journal of \n\nCognitive Neuroscience, vol. 3, pp. 71-89. \n\n[12] Ullman, S., and Basri, R, (1991)\"Recognition by Linear Combinations of Models,\" \n\nIEEE PAMI, Vol. 13, No. 10, pp. 992-1007. \n\n\f", "award": [], "sourceid": 832, "authors": [{"given_name": "Trevor", "family_name": "Darrell", "institution": null}, {"given_name": "Alex", "family_name": "Pentland", "institution": null}]}