{"title": "Learning Switching Linear Models of Human Motion", "book": "Advances in Neural Information Processing Systems", "page_first": 981, "page_last": 987, "abstract": null, "full_text": "Learning Switching Linear Models of Human \n\nMotion \n\nVladimir Pavlovic and James M. Rehg \n\nCompaq - Cambridge Research Lab \n\nJohn MacCormick \n\nCompaq - System Research Center \n\nCambridge, MA 02139 \n\nPalo Alto, CA 94301 \n\n{vladimir.pavlovic,jim.rehg}@compaq.com \n\n{john.maccormick} @compaq.com \n\nAbstract \n\nThe human figure exhibits complex and rich dynamic behavior that is \nboth nonlinear and time-varying. Effective models of human dynamics \ncan be learned from motion capture data using switching linear dynamic \nsystem (SLDS) models. We present results for human motion synthe(cid:173)\nsis, classification, and visual tracking using learned SLDS models. Since \nexact inference in SLDS is intractable, we present three approximate in(cid:173)\nference algorithms and compare their performance. In particular, a new \nvariational inference algorithm is obtained by casting the SLDS model \nas a Dynamic Bayesian Network. Classification experiments show the \nsuperiority of SLDS over conventional HMM's for our problem domain. \n\n1 \n\nIntroduction \n\nThe human figure exhibits complex and rich dynamic behavior. Dynamics are essential to \nthe classification of human motion (e.g. gesture recognition) as well as to the synthesis of \nrealistic figure motion for computer graphics. In visual tracking applications, dynamics can \nprovide a powerful cue in the presence of occlusions and measurement noise. Although the \nuse of kinematic models in figure motion analysis is now commonplace, dynamic models \nhave received relatively little attention. The kinematics of the figure specify its degrees \nof freedom (e.g. joint angles and torso pose) and define a state space. A stochastic dy(cid:173)\nnamic model imposes additional structure on the state space by specifying a probability \ndistribution over state trajectories. \n\nWe are interested in learning dynamic models from motion capture data, which provides a \ntraining corpus of observed state space trajectories. Previous work by a number of authors \nhas applied Hidden Markov Models (HMMs) to this problem. More recently, switching \nlinear dynamic system (SLDS) models have been studied in [5, 12]. In SLDS models, the \nMarkov process controls an underlying linear dynamic system, rather than a fixed Gaussian \nmeasurement model. l By mapping discrete hidden states to piecewise linear measurement \nmodels, the SLDS framework has potentially greater descriptive power than an HMM. Off(cid:173)\nsetting this advantage is the fact that exact inference in SLDS is intractable. Approximate \ninference algorithms are required, which in turn complicates SLDS learning. \n\nIn this paper we present a framework for SLDS learning and apply it to figure motion mod(cid:173)\neling. We derive three different approximate inference schemes: Viterbi [13], variational, \nand GPB2 [1]. We apply learned motion models to three tasks: classification, motion syn(cid:173)\nthesis, and visual tracking. Our results include an empirical comparison between SLDS \n\nI SLDS models are sometimes referred to as jump-linear or conditional Gaussian models, and have \n\nbeen studied in the controls and econometrics literatures. \n\n\fa, \n\n(a) SLDS as a Bayesian net. \n\n(b) Factorization of SLDS. \n\nFigure 1: (a) SLDS model as Dynamic Bayesian Network. s is discrete switch state, x is continuous \nstate, and y is its observation. (b) Factorization of SLDS into decoupled HMM and LDS. \n\nand HMM models on classification and one-step ahead prediction tasks. The SLDS model \nclass consistently outperforms standard HMMs even on fairly simple motion sequences. \n\nOur results suggest that SLDS models are a promising tool for figure motion analysis, \nand could play a key role in applications such as gesture recognition, visual surveillance, \nand computer animation. In addition, this paper provides a summary of approximate in(cid:173)\nference techniques which is lacking in the previous literature on SLDS. Furthermore, our \nvariational inference algorithm is novel, and it provides another example of the benefit of \ninterpreting classical statistical models as (mixed-state) graphical models. \n\n2 Switching Linear Dynamic System Model \n\nA switching linear dynamic system (SLDS) model describes the dynamics of a complex, \nnonlinear physical process by switching among a set of linear dynamic models over time. \nThe system can be described using the following set of state-space equations, \nXt+1 = A(st+dxt + Vt+1(St+1), Yt = eXt + Wt, Pr(st+1 = ilst = j) = II(i,j), \nfor the plant and the switching model. The meaning of the variables is as follows: X t E lRN \ndenotes the hidden state of the LDS, and Vt is the state noise process. Similarly, Yt E lRM \nis the observed measurement and Wt is the measurement noise. Parameters A and e are the \ntypical LDS parameters: the state transition matrix and the observation matrix, respectively. \nWe assumed that the LDS models a Gauss-Markov process with i.i.d. Gaussian noise pro(cid:173)\ncesses Vt (St) '\" N(O, Q(St)). The switching model is a discrete first order Markov process \nwith state variables St from a set of S states. The switching model is defined with the state \ntransition matrix II and an initial state distribution 71\"0. The LDS and switching process are \ncoupled due to the dependence of the LDS parameters A and Q on the switching state S t : \nA(st = i) = Ai, Q(St = i) = Qi. \nThe complex state space representation is equivalently depicted by the DBN depen(cid:173)\ndency graph in Figure lea). The dependency graph implies that the joint distribution \nP(YT, XT, ST) over the variables of the SLDS can be written as \n\nPr(so) TI;~l Pr(st ISt-dPr(xo Iso) TI;=-;.l Pr(xt IXt-l, St) TI;=~l Pr(Yt IXt), \n\n(1) \n\nwhere YT, XT, and ST denote the sequences (of length T) of observations and hidden \nstate variables. From the Gauss-Markov assumption on the LDS and the Markov switching \nassumption, we can expand Equation I into the parameterized joint pdf of the SLDS of \nduration T. \n\nLearning in complex DBNs can be cast as ML learning in general Bayesian networks. \nThe generalized EM algorithm can then be used to find optimal values of DBN parameters \n{A, e, Q, R, II, 7I\"0}. Inference, which is addressed in the next section, is the most complex \n\n\fstep in SLDS learning. Given the sufficient statistics from the inference phase, the param(cid:173)\neter update equations in the maximization (M) step are easily obtained by maximizing the \nexpected log of Equation 1 with respect to the LDS and Me parameters (see [13]). \n\n3 \n\nInference in SLDS \n\nThe goal of inference in complex DBNs is to estimate the posterior P(XT, STIYT). If there \nwere no switching dynamics, the inference would be straightforward - we could infer X T \nfrom YT using LDS inference. However, the presence of switching dynamics makes exact \ninference exponentially hard, as the distribution of the system state at time t is a mixture \nof st Gaussians. Tractable, approximate inference algorithms are therefore required. We \ndescribe three methods: Viterbi, variational, and generalized Pseudo Bayesian. \n\n3.1 Approximate Viterbi Inference \nViterbi approximation approach finds the most likely sequence of switching states Sf for \na given observation sequence YT. Namely, the desired posterior P(XT,STIYT) is approx(cid:173)\nimated by its mode Pr(XTISf,YT). It is well known how to apply Viterbi inference to \ndiscrete state hidden Markov models and continuous state Gauss-Markov models. Here we \nreview an algorithm for approximate Viterbi inference in SLDSs presented in [13]. \n\nWe have shown in [13] that one can use a recursive procedure to find the best switching \nsequence Sf = argmaxsT Pr(STIYT). In the heart of this recursion lays the approxi(cid:173)\nmation of the partial probability of the swiching sequence and observations up to time t, \nJt,i = maxs' _l Pr (St-l, St = i, Yt) R:J \n\nmaxdPr(Ytlst =i,St-l =j,S;_2(j),Yt-l)Pr(St =ilst - 1 =j)Jt-1,j}. \n\n(2) \nThe two scaling components are the likelihood associated with the transition i ~ j from t \nto t - 1, and the probability of discrete SLDS switching from j to i. They have the notion \nof a \"transition probability\" and we denote them together by J tlt-l ,i,j \n\nThe likelihood term can easily be found using Kalman updates, concurent with the recur(cid:173)\nsion of Equation 2. See [13] for details. The Viterbi inference algorithm can now be written \n\nInitialize LDS state estimates XOI-l,i and E01-1,i ; \nInitialize JO ,i ; \nfort=l:T-l \n\nfori=l:S \n\nforj=l:S \n\nPredict and filter LDS state estimates xt It ,i,j and E tlt ,i ,j; \nFind j -+ i \"transition probability\" J tit - 1. i ,j ; \n\nend \nFind best transition '!f;t - 1 i into state i; \nUpdate sequence probabilities J t \u2022 i and LDS slate estimates Xtl t , i and E t It ,i; \n\nend \n\nend \nFind \"best\" final switching state i;'_ l and backtrace the best switching sequence S;' ; \nDo RTS smoothing for S = s.;.. ; \n\n3.2 Approximate Variational Inference \n\nA general structured variational inference technique for Bayesian networks is described \nin [8]. Namely, an 1]-parameterized distribution Q(1]) is constructed which is \"close\" to the \ndesired conditional distribution P but is computionally feasible. In our case we define Q \nby decoupling the switching and LDS portions of SLDS as shown in Figure l(b). The orig(cid:173)\ninal distribution is factorized into two independent distributions, a Hidden Markov Model \n(HMM) Q s with variational parameters {qo, ... , qT-l} and a time-varying LDS Q x with \nvariational parameters {xo,Ao, ... , AT-1,Qo, ... ,QT-d. \n\n\fThe optimal values of the variational parameters TJ are obtained by minimizing the KL(cid:173)\ndivergence w.r.t. TJ. For example, we arrive at the following optimal variational parameters: \n\nOt 1 = \nAt = \nlog qt( i) = \n\nTo obtain the terms Pr(St) = Pr(stlqo, ... , qT-t) we use the inference in the HMM with \noutput \"probabilities\" qt . Similarly, to obtain (Xt) = E[XtIYT] we perform LDS inference \nin the decoupled time-varying LDS via RTS smoothing. Equation 3 together with the \ninference solutions in the decoupled models form a set of fixed-point equations. Solution \nof this fixed-point set is a tractable approximation to the intractable inference of the fully \ncoupled SLDS. The variational inference algorithm for fully coupled SLDSs can now be \nsummarized as: \n\nerror = 00 ; \nInitialize P r CSt) ; \nwhile (KL di vergence> maxError) \n\nFind Qt, At, XO [TOm PrCSt) (Eq. 3); \n\nEstimate ( Xt) I (Xt X t ') and ( Xt Xt - 1') from Yt using time-varying LDS inference; \nFind qt from (xt) I (xt Xt') and (XtXt_l') (Eq. 3); \nEstimate Pr (St) from qt using HMM inference. \n\nend \n\nVariational parameters in Equation 3 have intuitive interpretation. LDS parameters At and \nOt 1 define the best unimodal representation of the corresponding switching system and \nare, roughly, averages of original parameters weighted by a best estimates of the switching \nstates P(St). HMM variational paremeters log qt, on the other hand, measure the agreement \nof each individual LDS with the data. \n\n3.3 Approximate Generalized Pseudo Bayesian Inference \n\nThe Generalized Psuedo Bayesian [1, 9] (GPB) approximation scheme is based on the \ngeneral idea of \"collapsing\" a mixture of Mt Gaussians onto a mixture of Mr Gaussians, \nwhere r < t (see [12] for a detailed review). While there are several variations on this idea, \nour focus is the GPB2 algorithm, which maintains a mixture of M 2 Gaussians over time \nand can be reformulated to include smoothing as well as filtering. \n\nGPB2 is closely related to the Viterbi approximation of Section 3.1. Instead of picking the \nmost likely previous switching state j, we collapse the S Gaussians (one for each possible \nvalue of j) down into a single Gaussian. Namely, the state at time t is obtained as X tlt,i = \nLj Xtlt, i,jPr(St-l = jiSt = i, Yt) . \nSmoothing in GPB2 is unfortunately a more involved process that includes several addi(cid:173)\ntional approximations. Details of this can be found in [12] . Effectively, an RTS smoother \ncan be constructed when an assumption is made that decouples the MC model from the \nLDS when smoothing the MC states. Together with filtering this results in the following \nGPB2 algorithm pseudo code \n\n\fInitialize LDS state estimates x 01-1, i and Eo 1_ I, i; \nInitialize Pr(sQ = il - 1) = .. (i); \nfort=1:T-1 \n\nfori=1:8 \n\nforj=1:8 \n\nPredict and filter LDS state estimates Xtl t ,i,i ' Etl t, i,j ; \nFind switching state distributions Prest = ilYt) , Pr(St-l = jist = i, Yt); \nCollapse Xtlt,i,j ' Etlt,i,j to Xtlt,i , Etlt,i; \n\nend \nCollapse Xtlt,i and Etlt,i to Xtlt and E tlt ; \n\nend \n\nend \nDo GPB2 smoothing; \n\nThe inference process of GPB2 is more involved than those of the Viterbi or the variational \napproximation. Unlike Viterbi, GPB2 provides soft estimates of switching states at each \ntime t. Like Viterbi GPB2 is a local approximation scheme and as such does not guarantee \nglobal optimality inherent in the variational approximation. Some recent work (see [3]) on \nthis type of local approximation in general DBN s has emerged that provides conditions for \nit to be globally optimal. \n\n4 Previous Work \n\nSLDS models and their equivalents have been studied in statistics, time-series modeling, \nand target tracking since early 1970's. See [13,12] for a review. Ghahramani [6] introduced \na DBN-framework for learning and approximate inference in one class of SLDS models. \nHis underlying model differs from ours in assuming the presence of S independent, white \nnoise-driven LDSs whose measurements are selected by the Markov switching process. A \nswitching framework for particle filters applied to dynamics learning is described in [2]. \nManifold learning [7] is another approach to constraining the set of allowable trajectories \nwithin a high dimensional state space. An HMM-based approach is described in [4]. \n\n5 Experimental Results \n\nThe data set for our experiments is a corpus of 18 sequences of six individuals perform(cid:173)\ning walking and jogging. Each sequence was approximately 50 frames in duration. All \nof the motion was fronto-parallel (i.e. occured in a plane that was parallel to the camera \nplane, as in Figure 2(c).) This simplifies data acquisition and kinematic modeling, while \nself-occlusions and cluttered backgrounds make the tracking problem non-trivial. Our kine(cid:173)\nmatic model had eight DOF's, corresponding to rotations at the knees, hip, and neck (and \nignoring the arms). The link lengths were adjusted manually for each person. \n\nThe first task we addressed was learning HMM and SLDS models for walking and running. \nEach of the two motion types were modeled as one, two, or four-state HMM and SLDS \nmodels and then combined into a single complex jog-walk model. In addition, each SLDS \nmotion model was assumed to be of either the first or the second order 2. Hence, a total of \nthree models (HMM, first order SLDS, and second order SLDS) were considered for each \ncardinality (one, two, or four) of the switching state. \n\nHMM models were initially assumed to be fully connected. Their parameters were then \nlearned using the standard EM learning, initialized by k-means clustering. Learned HMM \nmodels were used to initialize the switching state segmentations for the SLDS models. The \nSLDS model parameters (A, Q, R, xo, II, 71\"0) were then reestimated using EM. The infer(cid:173)\nence) in SLDS learning was accomplished using the three approximated methods outlined \nin Section 3: Viterbi, GPB2, and variational inference. \n\nResults of SLDS learning using either of the three approximate inference methods did \nnot produce significantly different models. This can be explained by the fact that initial \nsegmentations using the HMM and the initial SLDS parameters were all very close to a \n\n2Second order SLDS models imply Xt = A 1 (st)Xt-1 + A 2 (st)Xt-2. \n\n\f: \n\nI \n\n: \n\nn \n\nI : \n\n~:~[ \n~:':f \n\"'p;;; \n\"'b;JL;--J \n~::~ : \n\nLJ \nl \"'f \nLJ \nl ~:~[ \n:- :u \nl ~::[ \n.. u l ;::~ \n\nLN \n\n\"\"\" \n\n: \n\n: \n\n.\",k \n\n.,\", \n\nV\"';\"liD\",1I \n\n:LJ \n:LJ \n:LJ \n:LJ \n:LJ \n\n: \n\n: \n\n: \nv'\"\"\"'\"\" \n\n\u2022 \n\n: \n\n: \n\nwalk : \n\nl \"'bZL2 \n\n. ok \n\n100 \n\n:50 \n\n\" \n\n'\" \n\n(a) One switching state, second order \nSLDS. \n\n(b) Four switching states, second order \nSLDS. \n\n(c) KF, \nframe 7 \n\n(d) SLDS, \nframe 7 \n\n(e) SLDS, \nframe 20 \n\n(f) Synthesized walking motion \n\nFigure 2: (a)-(d) show an example of classification results on mixed walk-jog sequences \nusing models of different order. (e)-(g) compare constant velocity and SLDS trackers, and \n(h) shows motion synthesis. \n\nlocally optimal solution and all three inference schemes indeed converged to the same or \nsimilar posteriors. \n\nWe next addressed the classification of unknown motion sequences in order to test the rela(cid:173)\ntive performance of inference in HMM and SLDS. Test sequences of walking and jogging \nmotion were selected randomly and spliced together using B-spline smoothing. Segmen(cid:173)\ntation of the resulting sequences into \"walk\" and \"jog\" regimes was accomplished using \nViterbi inference in the HMM model and approximate Viterbi, GPB2, and variational infer(cid:173)\nence under the SLDS model. Estimates of \"best\" switching states Pr(St) indicated which \nof the two models were considered to be the source of the corresponding motion segment. \n\nFigure 2(a)-(b) shows results for two representative combinations of switching state and \nlinear model orders. In Figure 2(a), the top graph depicts the true sequence of jog-walk \nmotions, followed by Viterbi, GPB2, variational, and HMM classifications. Each motion \ntype Gog and walk) is modeled using one switching state and a second order LDS. Fig(cid:173)\nure 2(b) shows the result when the switching state is increased to four. \n\nThe accuracy of classification increases with the order of the switching states and the LDS \nmodel order. More interesting, however, is that the HMM model consistently yields lower \nsegmentation accuracy then all of the SLDS inference schemes. This is not surprising \nsince the HMM model does not impose continuity across time in the plant state space \n(x), which does indeed exist in a natural figure motion Goint angles evolve continuously \nin time.) Quantitatively, the three SLDS inference schemes produce very similar results. \nQualitatively, GPB2 produces \"soft\" state estimates, while the Viterbi scheme does not. \nVariational is somewhere in-between. In terms of computational complexity, Viterbi seems \n\n\fto be the clear winner. \n\nOur next experiment addressed the use of learned dynamic models in visual tracking. The \nprimary difficulty in visual tracking is that joint angle measurements are not readily avail(cid:173)\nable from a sequence of image intensities. We use image templates for each link in the \nfigure model, initialized from the first video frame, to track the figure through template \nregistration [11]. A conventional extended Kalman filter using a constant velocity dynamic \nmodel performs poorly on simple walking motion, due to pixel noise and self-occlusions, \nand fails by frame 7 as shown in Figure 2(c). We employ approximate Viterbi inference in \nSLDS as a multi-hypothesis predictor that initializes multiple local template searches in the \nimage space. From the S2 multiple hypotheses Xtlt-l,i,j at each time step, we pick the best \nS hypothesis with the smallest switching cost, as determined by Equation 2. Figure 2(d)-\n2(e) show the superior performance of the SLDS tracker on the same image sequence. The \ntracker is well-aligned at frame 7 and only starts to drift off by frame 20. This is not terribly \nsurprising since the SLDS tracker has effectively S (extended) Kalman filters, but it is an \nencouraging result. \n\nThe final experiment simulated walking motion by sampling from a learned SLDS walking \nmodel. A stick figure animation obtained by superimposing 50 frames of walking is shown \nin Figure 2(f). The discrete states used to generate the motion are plotted at the bottom of \nthe figure. The synthesized walk becomes less realistic as the simulation time progresses, \ndue to the lack of global constraints on the trajectories. \n\n6 Conclusions \nDynamic models for human motion can be learned within a Switching Linear Dynamic \nSystem (SLDS) framework. We have derived three approximate inference algorithms for \nSLDS: Viterbi, GPB2, and variational. Our variational algorithm is novel in the SLDS \ndomain. We show that SLDS classification performance is superior to that of HMMs. We \ndemonstrate that a tracker based on SLDS is more effective than a conventional Extended \nKalman Filter. We show synthesis of natural walking motion by sampling. In future work \nwe will build more complex motion models using a much larger motion capture dataset, \nwhich we are currently building. We will also extend the SLDS tracker to more complex \nmeasurement models and complex discrete state processes (see [10] for a recent approach). \nReferences \n[1] Bar-Shalom and Li, Estimation and tracking: principles, techniques, and software. 1998. \n[2] A. Blake, B. North, and M. Isard, \"Learning multi-class dynamics,\" in NIPS '98, 1998. \n[3] X. Boyen, N. Firedman, and D. Koller, \"Discovering the hidden structure of complex dynamic \n\nsystems,\" in Proc. Uncertainty in Artificial Intelligence, 1999. \n\n[4] M. Brand, ''An entropic estimator for structure discovery,\" in NIPS '98, 1998. \n[5] C. Bregler, \"Learning and recognizing human dynamics in video sequences,\" in Proc. Int'l \n\nCon! Computer Vision and Pattern Recognition (CVPR), 1997. \n\n[6] Z. Ghahramani and G. E. Hinton, \"Switching state-space models.\" 1998. \n[7] N. Howe, M. Leventon, and W. Freeman, \"Bayesian reconstruction of 3d human motion from \n\nsingle-camera video,\" in NIPS'99, 1999. \n\n[8] M. 1. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul, ''An introduction to variational \n\nmethods for graphical models,\" in Learning in graphical models, 1998. \n\n[9] C.-J. Kim, \"Dynamic linear models with markov-switching,\" 1. Econometrics, vol. 60, 1994. \n[10] U. Lerner, R. Parr, D. Koller, and G. Biswas, \"Bayesian fault detection and diagnosis in dynamic \n\nsystems,\" in Proc. AAAJ, (Austin, TX), 2000. \n\n[11] D. Morris and J. Rehg, \"Singularity analysis for articulated object tracking,\" in CVPR, 1998. \n[12] K. P. Murphy, \"Learning switching kalman-filter models,\" TR 98-10, Compaq CRL., 1998. \n[13] V. Pavlovic, J. M. Rehg, T.-J. Cham, and K. P. Murphy, \"A dynamic bayesian network approach \nto figure tmcking using learned dynamic models,\" in Proc. Inti. Con! Computer Vision, 1999. \n\n\f", "award": [], "sourceid": 1892, "authors": [{"given_name": "Vladimir", "family_name": "Pavlovic", "institution": null}, {"given_name": "James", "family_name": "Rehg", "institution": null}, {"given_name": "John", "family_name": "MacCormick", "institution": null}]}