{"title": "Graphical Models for Recognizing Human Interactions", "book": "Advances in Neural Information Processing Systems", "page_first": 924, "page_last": 930, "abstract": null, "full_text": "Graphical Models for Recognizing \n\nHuman Interactions \n\nNuria M. Oliver, Barbara Rosario and Alex Pentland \n\n20 Ames Street, E15-384C, \n\nMedia Arts and Sciences Laboratory, MIT \n\nCambridge, MA 02139 \n\n{nuria, rosario, sandy}@media.mit.edu \n\nAbstract \n\nWe describe a real-time computer vision and machine learning sys(cid:173)\ntem for modeling and recognizing human actions and interactions. \nTwo different domains are explored: recognition of two-handed \nmotions in the martial art 'Tai Chi' , and multiple- person interac(cid:173)\ntions in a visual surveillance task. Our system combines top-down \nwith bottom-up information using a feedback loop, and is formu(cid:173)\nlated with a Bayesian framework. Two different graphical models \n(HMMs and Coupled HMMs) are used for modeling both individual \nactions and multiple-agent interactions, and CHMMs are shown to \nwork more efficiently and accurately for a given amount of train(cid:173)\ning. Finally, to overcome the limited amounts of training data, \nwe demonstrate that 'synthetic agents' (Alife-style agents) can be \nused to develop flexible prior models of the person-to-person inter(cid:173)\nactions. \n\nINTRODUCTION \n\n1 \nWe describe a real-time computer vision and machine learning system for modeling \nand recognizing human behaviors in two different scenarios: (1) complex, two(cid:173)\nhanded action recognition in the martial art of Tai Chi and (2) detection and \nrecognition of individual human behaviors and multiple-person interactions in a \nvisual surveillance task. In the latter case, the system is particularly concerned \nwith detecting when interactions between people occur, and classifying them. \nGraphical models, such as Hidden Markov Models (HMMs) [6] and Coupled Hid(cid:173)\nden Markov Models (CHMMs) [3, 2], seem appropriate for modeling and, classify(cid:173)\ning human behaviors because they offer dynamic time warping, a well-understood \ntraining algorithm, and a clear Bayesian semantics for both individual (HMMs) \nand interacting or coupled (CHMMs) generative processes. A major problem with \nthis data-driven statistical approach, especially when modeling rare or anomalous \nbehaviors, is the limited number of training examples. A major emphasis of our \nwork, therefore, is on efficient Bayesian integration of both prior knowledge with \nevidence from data. We will show that for situations involving multiple indepen(cid:173)\ndent (or partially independent) agents the Coupled HMM approach generates much \nbetter results than traditional HMM methods. \nIn addition, we have developed a synthetic agent or Alife modeling environment for \nbuilding and training flexible a priori models of various behaviors using software \nagents. Simulation with these software agents yields synthetic data that can be \nused to train prior models. These prior models can then be used recursively in a \nBayesian framework to fit real behavioral data. \n\n\fGraphical Models for Recognizing Human Interactions \n\n925 \n\nThis synthetic agent approach is a straightforward and flexible method for devel(cid:173)\noping prior models, one that does not require strong analytical assumptions to be \nmade about the form of the priorsl . In addition, it has allowed us to develop ro(cid:173)\nbust models even when there are only a few examples of some target behaviors. In \nour experiments we have found that by combining such synthetic priors with lim(cid:173)\nited real data we can easily achieve very high accuracies at recognition of different \nhuman-to-human interactions. \nThe paper is structured as follows: section 2 presents an overview of the system, \nthe statistical models used for behavior modeling and recognition are described in \nsection 3. Section 4 contains experimental results in two different real situations. \nFinally section 5 summarizes the main conclusions and our future lines of research . \n\n2 VISUAL INPUT \nWe have experimented using two different types of visual input. The first is a real(cid:173)\ntime, self-calibrating 3-D stereo blob tracker (used for the Tai Chi scenario) [1], and \nthe second is a real-time blob-tracking system [5] (used in the visual surveillance \ntask). In both cases an Extended Kalman filter (EKF) tracks the blobs' location, \ncoarse shape, color pattern, and velocity. This information is represented as a \nlow-dimensional, parametric probability distribution function (PDF) composed of \na mixture of Gaussians, whose parameters (sufficient statistics and mixing weights \nfor each of the components) are estimated using Expectation Maximization (EM). \nThis visual input module detects and tracks moving objects -\nbody parts in Tai \nChi and pedestrians in the visual surveillance task -\nand outputs a feature vector \ndescribing their motion, heading, and spatial relationship to all nearby moving \nobjects. These output feature vectors constitute the temporally ordered stream \nof data input to our stochastic state-based behavior models. Both HMMs and \nCHMMs, with varying structures depending on the complexity of the behavior, are \nused for classifying the observed behaviors. \nBoth top-down and bottom-up flows of information are continuously managed and \ncombined for each moving object within the scene. The Bayesian graphical models \noffer a mathematical framework for combining the observations (bottom-up) with \ncomplex behavioral priors (top-down) to provide expectations that will be fed back \nto the input visual system. \n\n3 VISUAL UNDERSTANDING VIA GRAPHICAL \n\nMODELS: HMMs and CHMMs \n\nStatistical directed acyclic graphs (DAGs) or probabilistic inference networks (PINs \nhereafter) can provide a computationally efficient solution to the problem of time \nseries analysis and modeling. HMMs and some of their extensions, in particular \nCHMMs, can be viewed as a particular and simple case of temporal PIN or DAG. \nGraphically Markov Models are often depicted 'rolled-out in time' as Probabilistic \nInference Networks, such as in figure 1. PINs present important advantages that are \nrelevant to our problem: they can handle incomplete data as well as uncertainty; \nthey are trainable and easier to avoid overfitting; they encode causality in a natural \nway; there are algorithms for both doing prediction and probabilistic inference; \nthey offer a framework for combining prior knowledge and data; and finally they \nare modular and parallelizable. \nTraditional HMMs offer a probabilistic framework for modeling processes that have \nstructure in time. They offer clear Bayesian semantics, efficient algorithms for state \nand parameter estimation, and they automatically perform dynamic time warping. \nAn HMM is essentially a quantization of a system's configuration space into a \nsmall number of discrete states, together with probabilities for transitions between \n\n1 Note that our priors have the same form as our posteriors, namely, they are graphical \n\nmodels. \n\n\f926 \n\nN. M. Oliver, B. Rosario and A. Pentland \n\nCoupled ~1ddo.II.rk'\" 1101101 \n\nHidden .... kov Model \n\nS;c\"rrrrH H I- I S~'~ \"-\n\n1\u00b0 \n. .. .i ~S' \n\n... \n\u00b0 \n\nObsc:tvallou \n\no \n\n.(cid:173).-. - ....... \n\n' - -0 ' \n\n-...... . ........... \n\nFigure 1: Graphical representation of a HMM and a CHMM rolled-out in time \n\nstates. A single finite discrete variable indexes the current state of the system. Any \ninformation about the history of the process needed for future inferences must be \nreflected in the current value of this state variable. \nHowever many interesting real-life problems are composed of multiple interacting \nprocesses, and thus merit a compositional representation of two or more variables. \nThis is typically the case for systems that have structure both in time and space. \nWith a single state variable, Markov models are ill-suited to these problems. In \norder to model these interactions a more complex architecture is needed. \nExtensions to the basic Markov model generally increase the memory of the sys(cid:173)\ntem (durational modeling), providing it with compositional state in time. We are \ninterested in systems that have compositional state in space, e.g., more than one \nsimultaneous state variable. It is well known that the exact solution of extensions \nof the basic HMM to 3 or more chains is intractable. In those cases approximation \ntechniques are needed ([7, 4, 8, 9]). However, it is also known that there exists an \nexact solution for the case of 2 interacting chains, as it is our case [7, 2]. \nWe therefore use two Coupled Hidden Markov Models (CHMMs) for modeling two \ninteracting processes, whether they are separate body parts or individual humans. \nIn this architecture state chains are coupled via matrices of conditional probabilities \nmodeling causal (temporal) influences between their hidden state variables. The \ngraphical representation of CHMMs is shown in figure 1. From the graph it can be \nseen that for each chain, the state at time t depends on the state at time t - 1 in \nboth chains. The influence of one chain on the other is through a causal link. \nIn this paper we compare performance of HMMs and CHMMs for maximum a \nposteriori (MAP) state estimation . We compute the most likely sequence of states \nS within a model given the observation sequence 0 = {01' ... , on}. This most likely \nsequence is obtained by S = argmaxsP(SIO). \nIn the case of HMMs the posterior state sequence probability P(SIO) is given by \n\nT \n\nt=2 \n\nP(SIO) = P31P31(0I) IIP3t(Ot)P3tI31_1 \n\n(1) \nwhere S = {a1,\"\" aN} is the set of discrete states, St E S corresponds to the \nstate at time t. Pilj == P31 =a,1 3t_l=a J is the state-to-state transition probability (i.e. \nprobability of being in state ai at time t given that the system was in state aj at \ntime t - 1). In the following we will write them as P3tI3t-l' Pi == P31 =a, = P31 are \nthe prior probabilities for the initial state. Finally Pi(Ot) == P3t=a,(Ot) = P3t(od are \nthe output probabilities for each state2 . \nFor CHMMs we need to introduce another set of probabilities, P 3tI3 :_ 1 , which cor-\n\n2The output probability is the probability of observing Ot given state a, at time t \n\n\fGraphical Models for Recognizing Human Interactions \n\n927 \n\nrespond to the probability of state St at time t in one chain given that the other \n- was in state S~_l at time t - 1. These \nchain -denoted hereafter by superscript I \nnew probabilities express the causal influence (coupling) of one chain to the other. \nThe posterior state probability for CHMMs is expressed as \n\nP(SIO) \n\n= \n\nP p (Ol)P,P ,(d) \n\n\"'1\"'1 \n\n\"'1\"'1 \n\nP(O) \n\n1 \n\nx \n\nT \n\nII P \n\nt=2 \n\n\"',1\",-1 \n\nP \n( ) (') \n\" :I\"'~_I ,,;1\",-1 \"\"I\";_IPs, 0t p,,; \u00b0t \n\nP \n\nP \n\n(2) \nwhere St, s~; Ot, o~ denote states and observations for each of the Markov chains that \ncompose the CHMMs. \nIn [2] a deterministic approximation for maximum a posterior (MAP) state esti(cid:173)\nmation is introduced. It enables fast classification and parameter estimation via \nEM, and also obtains an upper bound on the cross entropy with the full (combi(cid:173)\nnatoric) posterior which can be minimized using a subspace that is linear in the \nnumber of state variables. An \"N-heads\" dynamic programming algorithm samples \nfrom the O(N) highest probability paths through a compacted state trellis, with \ncomplexity O(T( C N)2) for C chains of N states apiece observing T data points. \nThe cartesian product equivalent HMM would involve a combinatoric number of \nstates, typically requiring OCT N 2C ) computations. We are particularly interested \nin efficient, compact algorithms that can perform in real-time. \n\n4 EXPERIMENTAL RESULTS \nOur first experiment is with a version of Tai Chi Ch 'uan (a Chinese martial and \nmeditative art) that is practiced while sitting. Using our self-calibrating, 3-D stereo \nblob tracker [1], we obtained 3D hand tracking data for three Tai Chi gestures in(cid:173)\nvolving two, semi-independent arm motions: the left single whip, the left cobra, and \nthe left brush knee. Figure 4 illustrates one of the gestures and the blob-tracking. \nA detailed description of this set of Tai Chi experimental results can be found in [3] \nand viewed at http://nuria . www.media.mit. edurnurial chmm/taichi . html. \n\n. ~ \n-\n\n\"\"-\n\nI \n, \n\nFigure 2: Selected frames from 'left brush knee.' \n\nWe collected 52 sequences, roughly 17 of each gesture and created a feature vector \nconsisting of the 3-D (x, y, z) centroid (mean position) of each of the blobs that char(cid:173)\nacterize the hands. The resulting six-dimensional time series was used for training \nboth HMMs and CHMMs. \nWe used the best trained HMMs and CHMMs -\nto \nclassify the full data set of 52 gestures. The Viterbi algorithm was used to find the \nmaximum likelihood model for HMMs and CHMMs. Two-thirds ofthe testing data \nhad not been seen in training, including gestures performed at varying speeds and \nfrom slightly different views. It can be seen from the classification accuracies, shown \nin table 1, that the CHMMs outperform the HMMs. This difference is not due to \nintrinsic modeling power, however; from earlier experiments we know that when a \nlarge number of training samples is available then HMMs can reach similar accu(cid:173)\nracies. We conclude thus that for data where there are two partially-independent \nprocesses (e.g., coordinated but not exactly linked), the CHMM method requires \nmuch less training to achieve a high classification accuracy. \nTable 1 illustrates the source of this training advantage. The numbers between \n\nusing 10-crossvalidation -\n\n\f928 \n\nN. M Oliver, B. Rosario and A. Pentland \n\nTable 1: Recognition accuracies for HMMs and CHMMs on Tai Chi gestures. The ex(cid:173)\npressions between parenthesis correspond to the number of parameters of the largest best(cid:173)\nscoring model. \n\nRecognition Results on Tai Chi Gestures \n\nSingle HMMs \n\nCoupled HMMs (CHMMs) \n\nAccuracy \n\n69.2% (25+30+180) \n\n100% (27+18+54) \n\nparenthesis correspond to the number of degrees of freedom in the largest best(cid:173)\nscoring model: state-to-state probabilities + output means + output covariances. \nThe conventional HMM has a large number of covariance parameters because it \nhas a 6-D output variable; whereas the CHMM architecture has two 3-D output \nvariables. In consequence, due to their larger dimensionality HMMs need much \nmore training data than equivalent CHMMs before yielding good generalization \nresults. \nOur second experiment was with a pedestrian video surveillance task 3; the goal was \nfirst to recognize typical pedestrian behaviors in an open plaza (e.g., walk from A to \nB, run from C to D), and second to recognize interactions between the pedestrians \n(e.g., person X greets person V). The task is to reliably and robustly detect and \ntrack the pedestrians in the scene. We use in this case 2-D blob features for modeling \neach pedestrian. In our system one of the main cues for clustering the pixels into \nblobs is motion, because we have a static background with moving objects. To \ndetect these moving objects we build an eigenspace that models the background. \nDepending on the dynamics of the background scene the system can adaptively \nrelearn the eigenbackground to compensate for changes such as big shadows. \nThe trajectories of each blob are computed and saved into a dynamic track memory. \nEach trajectory has associated a first order EKF that predicts the blob's position \nand velocity in the next frame As before, the appearance of each blob is modeled \nby a Gaussian PDF in RGB color space, allowing us to handle occlusions. \n\nFigure 3: Typical Image from pedestrian plaza. Background mean image, input image \nwith blob bounding boxes and blob segmentation image \n\nThe behaviors we examine are generated by pedestrians walking in an open out(cid:173)\ndoor environment. Our goal is to develop a generic, compositional analysis of the \nobserved behaviors in terms of states and transitions between states over time in \nsuch a manner that (1) the states correspond to our common sense notions of hu(cid:173)\nman behaviors, and (2) they are immediately applicable to a wide range of sites \nand viewing situations. Figure 3 shows a typical image for our pedestrian scenario, \nthe pedestrians found, and the final segmentation. Two people (each modeled as \nits own generative process) may interact without wholly determining each others' \nbehavior. Instead, each of them has its own internal dynamics and is influenced \n(either weakly or strongly) by others. The probabilities PStIS~_1 and PS;ISt_l from \nequation 2 describe this kind of interactions and CHMMs are intended to model \nthem in as efficient a manner as is possible. \nWe would like to have a system that will accurately interpret behaviors and interac(cid:173)\ntions within almost any pedestrian scene with at most minimal training. As we have \n\nFurther \n\ninformation \n\nabout \n\nthis \n\nsystem \n\ncan \n\nbe \n\nfound \n\nat \n\n3 \n\nhttp:/www.vismod.www.media.mit.edu/ nuria/humanBehavior IhumanBehavior .html \n\n\fGraphical Models for Recognizing Human Interactions \n\n929 \n\nalready mentioned , une critical problem is the generation of models that capture \nour prior knowledge about human behavior. To achieve this goal we have developed \na modeling environment that uses synthetic agents to mimic pedestrian behavior in \na virtual environment. The agents can be assigned different behaviors and they can \ninteract with each other as well. Currently they can generate 5 different interacting \nbehaviors and various kinds of individual behaviors (with no interaction) . These \nbehaviors are: following, meet and walk together (inter1); approach, meet and go \non separately (inter2) or go on together (inter3) ; change direction in order to meet , \napproach , meet and continue together (inter4) or go on separately (inter5) . The pa(cid:173)\nrameters of this virtual environment are modeled using data drawn from a 'generic' \nset of real scenes. \nBy training the models of the synthetic agents to have good generalization and \ninvariance properties, we can obtain flexible prior models for use when learning the \nhuman behavior models from real scenes. Thus the synthetic prior models allow us \nto learn robust behavior models from a small number of real behavior examples. \nThis capability is of special importance in a visual surveillance task , where typically \nthe behaviors of greatest interest are also the rarest . \nTo test our behavior modeling in the pedestrian scenario, we first used the detection \nand tracking system previously described to obtain 2-D blob features for each person \nin several hours of video. More than 20 examples of following and the two first types \nof meeting behaviors were detected and processed. \nCHMMs were then used for modeling three different behaviors: following , meet \nand continue together, and meet and go on separately. Furthermore, an interaction \nversus no interaction detection test was also performed (HMMs performed so poorly \nat this task that their results are not reported). In addition to velocity, heading, \nand position, the feature vectors consisted of the derivative of the relative distance \nbetween two agents, their degree of alignment (dot product of their velocity vectors) \nand the magnitude of the difference in their velocity vectors. \nWe tested on this video data using models trained with two types of data: (1) 'Prior(cid:173)\nonly models', that is, models learned entirely from our synthetic-agents environment \nand then applied directly to the real data with no additional training or tuning of \nthe parameters; and (2) 'Posterior models', or prior-pIus-real data behavior models \ntrained by starting with the prior-only model and then 'tuning' the models with data \nfrom this specific site, using eight examples of each type of interaction. Recognition \naccuracies for both these 'prior' and 'posterior' CHMMs are summarized in table \n2. It is noteworthy that with only 8 training examples, the recognition accuracy \non the remaining data could be raised to 100%. This demonstrates the ability to \naccomplish extremely rapid refinement of our behavior models from the initial a \npriori models. \n\nTable 2: Accuracies on real pedestrian data, (a) only a priori models, (b) posterior \nmodels (with site-specific training) \n\nAccuracy on Real Pedestrian Data \n\nNo-inter \n\nInterl \n\nInter2 \n\nInter3 \n\n(a)Prior CHMMs \n(b ) Posterior CHMMs \n\n90.9 \n100 \n\n93.7 \n100 \n\n100 \n100 \n\n100 \n100 \n\nIn a visual surveillance system the false alarm rate is often as important as the \nclassification accuracy4 To analyze this aspect of our system's performance, we \ncalculated the system's ROC curve. For accuracies of 95% the false alarm rate was \nless than 0.01. \n\n4In an ideal automatic surveillance system, all the targeted behaviors should be detected \nwith a close-to-zero false alarm rate, so that we can reasonably alert a human operator to \nexamine them further. \n\n\f930 \n\nN. M. Oliver, B. Rosario and A. Pentland \n\n5 SUMMARY, CONCLUSIONS AND FUTURE WORK \nIn this paper we have described a computer vision system and a mathematical \nmodeling framework for recognizing different human behaviors and interactions in \ntwo different real domains: human actions in the martial art of Tai Chi and human \ninteractions in a visual surveillance task. Our system combines top-down with \nbottom-up information in a closed feedback loop, with both components employing \na statistical Bayesian approach. \nTwo different state-based statistical learning architectures, namely HMMs and \nCHMMs, have been proposed and compared for modeling behaviors and interac(cid:173)\ntions. The superiority of the CHMM formulation has been demonstrated in terms \nof both training efficiency and classification accuracy. A synthetic agent training \nsystem has been created in order to develop flexible prior behavior models, and we \nhave demonstrated the ability to use these prior models to accurately classify real \nbehaviors with no additional training on real data. This fact is specially important, \ngiven the limited amount of training data available. \nFuture directions under current investigation include: extending our agent interac(cid:173)\ntions to more than two interacting processes; developing a hierarchical system where \ncomplex behaviors are expressed in terms of simpler behaviors; automatic discovery \nand modeling of new behaviors (both structure and parameters) ; automatic deter(cid:173)\nmination of priors, their evaluation and interpretation; developing an attentional \nmechanism with a foveated camera along with a more detailed representation of the \nbehaviors; evaluating the adaptability of off-line learned behavior structures to dif(cid:173)\nferent real situations; and exploring a sampling approach for recognizing behaviors \nby sampling the interactions generated by our synthetic agents. \n\nAcknowledgments \nSincere thanks to Michael Jordan, Tony Jebara and Matthew Brand for their ines(cid:173)\ntimable help. \n\nReferences \n1. A. Azarbayejani and A. Pentland. \n\nusing 3-D shape estimation from blob features. \nferenc e on Pattern R ecognition, Vienna, August 1996. IEEE. \n\nReal-time self-calibrating stereo person-tracker \nIn Proceedings, International Con-\n\n2. M. Brand. \n\nCoupled hidden markov models for modeling interacting processes. \n\nNovember 1996. Submitted to Neural Computation. \n\n3. M. Brand and N. Oliver. Coupled hidden markov models for complex action recog(cid:173)\n\nnition . \n\nIn In Proceedings of IEEE CVPR97, 1996. \n\n4. Z. Ghahramani and M. 1. Jordan . \n\nIn D. S. \nTouretzky, M. C . Mozer , and M. Hasselmo, editors , NIPS, volume 8, Cambridge, MA , \n1996. MITP. \n\nFactorial hidden Markov models. \n\n5. N. Oliver, B. Rosario, and A. Pentland. \n\nStatistical modeling of human behaviors . \n\nIn To appear in Proceedings of CVPR98, Perception of Action Workshop, 1998. \n\n6 1. R. Rabiner . A tutorial on hidden markov models and selected applications in \n\nspeech recognition . PIEEE, 77(2):257- 285 , 1989. \n\n7. L. K. Saul and M. 1. Jordan. Boltzmann chains and hidden Markov models . \n\nIn \nG. Tesauro, D. S. Touretzky, and T. Leen , editors, NIPS, volume 7, Cambridge, MA , \n1995. MITP. \n\n8. P. Smyth, D. Heckerman , and M. Jordan. \n\nProbabilistic independence networks for \nhidden Markov probability models. AI memo 1565, MIT, Cambridge, MA, Feb 1996. \n\n9 C. Williams and G. E. Hinton . Mean field networks that learn to discriminate tem(cid:173)\n\nporally distorted strings. \n18- 22 , San Mateo , CA, 1990. Morgan Kaufmann. \n\nIn Proceedings, connectionist models summ er school, pages \n\n\f", "award": [], "sourceid": 1560, "authors": [{"given_name": "Nuria", "family_name": "Oliver", "institution": null}, {"given_name": "Barbara", "family_name": "Rosario", "institution": null}, {"given_name": "Alex", "family_name": "Pentland", "institution": null}]}