{"title": "Recovering Articulated Model Topology from Observed Rigid Motion", "book": "Advances in Neural Information Processing Systems", "page_first": 1335, "page_last": 1342, "abstract": null, "full_text": "Recovering Articulated Model Topology from Observed\n\nRigid Motion\n\nLeonid Taycher, John W. Fisher III, and Trevor Darrell\n\nArti\ufb01cial Intelligence Laboratory\n\nMassachusetts Institute of Technology\n\nCambridge, MA, 02139\n\nflodrion, \ufb01sher, trevorg@ai.mit.edu\n\nAbstract\n\nAccurate representation of articulated motion is a challenging problem\nfor machine perception. Several successful tracking algorithms have\nbeen developed that model human body as an articulated tree. We pro-\npose a learning-based method for creating such articulated models from\nobservations of multiple rigid motions. This paper is concerned with\nrecovering topology of the articulated model, when the rigid motion of\nconstituent segments is known. Our approach is based on \ufb01nding the\nMaximum Likelihood tree shaped factorization of the joint probability\ndensity function (PDF) of rigid segment motions. The topology of graph-\nical model formed from this factorization corresponds to topology of the\nunderlying articulated body. We demonstrate the performance of our al-\ngorithm on both synthetic and real motion capture data.\n\n1 Introduction\n\nTracking human motion is an integral part of many proposed human-computer interfaces,\nsurveillance and identi\ufb01cation systems, as well as animation and virtual reality systems. A\ncommon approach to this task is to model the body as a kinematic tree, and reformulate\nthe problem as articulated body tracking[6]. Most of the state-of-the-art systems rely on\nprede\ufb01ned kinematic models [16]. Some methods require manual initialization, while other\nuse heuristics [12], or prede\ufb01ned protocols [10] to adapt the model to observations.\n\nWe are interested in a principled way to recover articulated models from observations. The\nrecovered models may then be used for further tracking and/or recognition. We would\nlike to approach model estimation as a multistage problem. In the \ufb01rst stage the rigidly\nmoving segments are tracked independently; at the second stage, the topology of the body\n(the connectivity between the segments) is recovered. After the topology is determined, the\njoint parameters may be determined.\n\nIn this paper we concentrate on the second stage of this task, estimating the underlying\ntopology of the observed articulated body, when the motion of the constituent rigid bodies\nis known. We approach this as a learning problem, in the spirit of [17]. If we assume that\nthe body may be modeled as a kinematic tree, and motion of a particular rigid segment is\nknown, then the motions of the rigid segments that are connected through that segment are\nindependent of each other. That is, we can model a probability distribution of the full body-\n\n\fpose as a tree-structured graphical model, where each node corresponds to pose of a rigid\nsegment. This observation allows us to formulate the problem of recovering topology of\nan articulated body as \ufb01nding the tree-shaped graphical model that best (in the Maximum\nLikelihood sense) describes the observations.\n\n2 Prior Work\n\nWhile state-of-the-art tracking algorithms [16] do not address either model creation or\nmodel initialization, the necessity of automating these two steps has been long recognized.\n\nThe approach in [10] required a subject to follow a set of prede\ufb01ned movements, and\nrecovered the descriptions of body parts and body topology from deformations of apparent\ncontours. Various heuristics were used in [12] to adapt an articulated model of known\ntopology to 3D observations. Analysis of magnetic motion capture data was used by [14]\nto recover limb lengths and joint locations for known topology, it also suggested similar\nanalysis for topology extraction. A learning based approach for decomposing a set of\nobserved marker positions and velocities into sets corresponding to various body parts was\ndescribed in [17]. Our work builds on the latter two approaches in estimating the topology\nof the articulated tree model underlying the observed motion.\n\nSeveral methods have been used to recover multiple rigid motions from video, such as\nfactorization [3, 18], RANSAC [7], and learning based methods [9]. In this work we as-\nsume that the 3-D rigid motions has been recovered and are represented using a 2-D Scaled\nPrismatic Model (SPM).\n\n3 Representing Pose and Motion\n\nA 2-D Scaled Prismatic Model (SPM) was proposed by [15] and is useful for representing\nimage motion of projections of elongated 3-D objects. It is obtained by orthographically\n\u201cprojecting\u201d the major axis of the object to the image plane. The SPM has four degrees of\nfreedom: in-plane translation, rotation, and uniform scale. 3-D rigid motion of an object,\nmay be simulated by SPM transformations, using in-plane translation for rigid translation,\nand rotation and uniform scaling for plane-parallel and out-of-plane rotations respectively.\n\nSPM motion (or pose) may be expressed as a linear transformation in projective space as\n\nM = a (cid:0)b\n\na\n0\n\nb\n0\n\ne\nf\n\n1!\n\n(1)\n\nFollowing [13] we have chosen to use exponential coordinates, derived from constant ve-\nlocity equations, to parameterize motion.\n\nAn SPM transformation may be represented as an exponential map\n\nM = e\n\n^(cid:24)\n\n^(cid:24) = (cid:18)  c (cid:0)! vx\n0!\n\n!\n0\n\nc\n0\n\nvy\n\nvx\nvy\n!\nc\n\n(cid:24) = (cid:18)0\nB@\n\n1\nCA\n\n(2)\n\nIn this representation vx is a horizontal velocity, vy \u2013 vertical velocity, ! \u2013 angular velocity,\nand c is a rate of scale change. (cid:18) is analogous to time parameter. Note that there is an\ninherent scale ambiguity, since (cid:18) and (vx; vy; !; c)T may be chosen arbitrarily, as long as\n^(cid:24) = M.\ne\n\n\fIt can be shown ([13]) that if the SPM transformation is a combination of scaling and\nrotation, it may be expressed by the sum of two twists, with coincident centers (ux; uy)T\nof rotation and expansion.\n\nuy\n(cid:0)ux\n1\n0\n\n(cid:24) = !0\nB@\n\n1\nCA\n\n+ c0\nB@\n\n(cid:0)ux\n(cid:0)uy\n0\n1\n\n1\nCA\n\n=0\nB@\n\n(cid:0)c\n!\n(cid:0)! (cid:0)c\n\n1\n\nux\nuy\n!\nc\n\n1\nCA\n\n0\nB@\n\n1\nCA\n\n1\n\n(3)\n\nWhile \u201cpure\u201d translation, rotation or scale have intuitive representation with twists, the\ncombination or rotation and scale does not. We propose a scaled twist representation, that\npreserves the intuitiveness of representation for all possible SPM motions. We want to\nseparate the \u201cdirection\u201d of motion (the direction of translation or the relative amounts of\nrotation and scale) from the amount of motion.\nIf the transformation involves rotation and/or scale, then we choose (cid:18) so that jj(!; c)jj2 = 1,\nand then use eq. 3 to compute the center of rotation/expansion. The computation may be\nexpressed as a linear transformation:\n\n(cid:18)\nux\nuy\n!\nc\n\n(cid:28) =0\nBBB@\n\n1\nCCCA\n\n=\n\n0\n\nBBBBBB@\n\np~!2 + ~c2\n\n(cid:0) ~c\n\n~!2+~c2 (cid:0) ~!\n(cid:0) ~c\n\n~!2 +~c2\n\n~!\n\n~!2+~c2\n\n~!2+~c2\n\n1p ~!2+~c2\n\n1p ~!2+~c2\n\n1\n\nCCCCCCA\n\n1\n~vx\n~vy\n~!\n~c\n\n0\nBBB@\n\n1\nCCCA\n\n(4)\n\nwhere (cid:24) = (~vx; ~vy; ~!; ~c)T .\nThe the pure translational motion (! = c = 0) may be regarded as an in\ufb01nitely small\nrotation about a point at in\ufb01nity, e.g. the translation by l in the direction (ux; uy) may be\nrepresented as (cid:28) = lim!!0(lj!j; (cid:0)uy\n\n! ; !; 0)T , but we choose a direct representation\n\n! ; ux\nx + ~v2\ny\n\n0\n\np~v2\n\n1\n\n(cid:18)\nux\nuy\n0\n0\n\n(cid:28) =0\nBBB@\n\n1\nCCCA\n\n1p~v2\n\nx\n\n=\n\n+~v2\ny\n\nx+~v2\ny\n\n1p~v2\n\nBBBBB@\n\nCCCCCA\ndet(A) =(cid:26)(cid:18)(cid:0)3 ! 6= 0 _ c 6= 0 (rotation/scaling)\n\n(cid:18)(cid:0)1 ! = 0 ^ c = 0 (pure translation)\n\n1\n\n1\n\n0\nBBB@\n\n1\n~vx\n~vy\n~!\n~c\n\n1\nCCCA\n\n(5)\n\n(6)\n\nIn both cases (cid:28) = A(1; ~(cid:24)T )T , and\n\nNote that (cid:28)I = (0; ux; uy; !; c)T represents identity transformation for any ux; uy; !; and\nc. It is always reported as (cid:28)I = 0.\n\n4 Learning Articulated Topology\n\nWe wish to infer the underlying topology of an articulated body from noisy observations\nof a set of rigid body motions. Towards that end we will adopt a statistical framework for\n\ufb01tting a joint probability density. As a practical matter, one must make choices regarding\ndensity models; we discuss one such choice although other choices are also suitable.\n\nWe denote the set of observed motions of N rigid bodies at time t; 1 (cid:20) t (cid:20) F as a set\nsj1 (cid:20) s (cid:20) N g. Graphical models provide a useful methodology for expressing the de-\nfMt\npendency structure of a set of random variables (cf. [8]). Variables Mi with observations\n\n\fij1 (cid:20) t (cid:20) F g are assigned to the vertices of a graph, while edges between nodes indi-\nfMt\ncate dependency. We shall denote presence or absence of an edge between two variables,\nMi and Mj by an index variable Eij, equal to one if an edge is present and zero otherwise.\nFurthermore, if the corresponding graphical model is a spanning tree, it can be expressed\nas a product of conditional densities (e.g. see [11])\n\nPM (M1; : : : ; MN ) =YMs\n\nPMsjpa(Ms) (Msjpa (Ms))\n\n(7)\n\nwhere pa(Ms) is the parent of Ms. While multiple nodes may have the same parent, each\nindividual node has only one parent node. Furthermore, in any decomposition one node\n(the root node) has no parent. Any node (variable) in the model can serve as the root node\n[8]. Consequently, a tree model constrains E. Of the possible tree models (choices of\nE), we wish to choose the maximum likelihood tree which is equivalent to the minimum\nentropy tree [4]. The entropy of a tree model can be written\n\nH(M ) =Xs\n\nH(Ms) (cid:0) XEij =1\n\nI(Mi; Mj)\n\n(8)\n\nwhere H(Ms) is the marginal entropy of each variable and I(Mi; Mj) is the mutual in-\nformation between nodes Mi and Mj and quanti\ufb01es their statistical dependence. Conse-\nquently, the minimum entropy tree corresponds to the choice of E which minimizes the\nsum of the pairwise mutual informations [1]. The tree denoted by E can be found via the\nmaximum spanning tree algorithm [2] using I(Mi; Mj) for all i; j as the edge weights.\nOur conjecture is that if our data are sampled from a variety of motions the topology of\nthe estimated density model is likely to be the same as the topology of the articulated body\nmodel. It follows from the intuition that when considering only pairwise relationships, the\nrelative motions of physically connected bodies will be most strongly related.\n\n4.1 Estimation of Mutual Information\n\nComputing the minimum entropy spanning tree requires estimating the pairwise mutual\ninformations between rigid motions Mi and Mj for all i; j pairs. In order to do so we\nmust make a choice regarding the parameterization of motion and a probability density\nover that parameterization; to estimate articulated topology it is suf\ufb01cient to use the the\nScaled Prismatic Model with twist parameterization described in Section 3).\n\n4.2 Estimating Motion Entropy\n\nWe parameterize rigid motion, Mt\n\ni, by the vector of quantities (cid:24)t\n\ni (cf. Eq. 2). In general,\n\n(9)\nbut since there is a one-to-one correspondence between the Mi\u2019s and (cid:24)i\u2019s [4], we can\nestimate the I(Mi; Mj) by \ufb01rst computing (cid:24)t\n\nH(Mi) 6= H((cid:24)i);\n\nj from Mt\n\ni; Mt\nj\n\ni ; (cid:24)t\n\nI(Mi; Mj) = I((cid:24)i; (cid:24)j) = H((cid:24)j) (cid:0) H((cid:24)jj(cid:24)i)\n\nFurthermore, if the relative motion Mjji between segments si and sj (M t\nassumed to be independent of Mi, it can be shown that\n\nj = M t\n\ni M t\n\nH((cid:24)j j(cid:24)i) = H(log MiMjjij log Mi) = H(log Mjji) = H((cid:24)jji):\n\n(11)\nWe wish to use scaled twists (Section 3) to compute the entropies involved. Since the in-\nvolved quantities are in the linear relationship (cid:28) = A(1; ~(cid:24)T )T (Eqs. 4 and 5), the entropies\nare related,\n\nwhere E[log det(A)] may be estimated using Equation 6.\n\nH((cid:24)) = H((cid:28) ) (cid:0) E[log det(A)];\n\n(10)\njji) is\n\n(12)\n\n\f4.3 Estimating the Motion Kernel\n\nIn order to estimate the entropy of motion, we need to estimate the probability density\nbased on the available samples. Since the functional form of the underlying density is not\nknown we have chosen to use kernel-based density estimator,\n\n^p((cid:28) ) = (cid:11)Xi\n\nK((cid:28) ; (cid:28)i):\n\n(13)\n\nSince our task is to determine the articulated topology, we wish to concentrate on \u201cspatial\u201d\nfeatures of the transformation, center of rotation for rotational motion, and the direction of\ntranslation for translational, that correspond to two common kinds of joints, spherical and\nprismatic. Thus we need to de\ufb01ne a kernel function K((cid:28)1; (cid:28)2) that captures the following\nnotion of \u201cdistance\u201d between the motions:\n\n1. If (cid:28)1 and (cid:28)2 do not represent pure translational motions, then they should be con-\n\nsidered to be close if their centers of rotation are close.\n\n2. If (cid:28)1 and (cid:28)2 are pure translations, then they should be considered close if their\n\ndirections are close.\n\n3. If (cid:28)1 and (cid:28)2 represent different types of motion (i.e. rotation/scale vs. translation),\n\nthen they are arbitrarily far apart.\n\n4. The identity transformation ((cid:18) = 0) is equidistant from all possible transforma-\n\ntions (since any (ux; uy; !; c)T combined with (cid:18) = 0 produces identity)\n\nOne kernel that satis\ufb01es these requirements is the following:\n\nKR((ux1; uy1); (ux2; uy2)) Condition 1\n\nKT ((ux1; uy1); (ux2; uy2)) Condition 2\n\n(!1 6= 0 _ c1 6= 0) ^ (!2 6= 0 _ c2 6= 0)\n\nK((cid:28)1; (cid:28)2) =\n\n8>>>>>>>>>>>>>>>><\n>>>>>>>>>>>>>>>>:\n\n0\n\n0\n\n(cid:14)(0)\n\n!1 = 0 ^ c1 = 0 ^ !2 = 0 ^ c2 = 0\nCondition 3\n(!1 6= 0 _ c1 6= 0) ^ (!2 = 0 ^ c2 = 0)\nCondition 3\n(!1 = 0 ^ c1 = 0) ^ (!2 6= 0 _ c2 6= 0)\nCondition 4.\n(cid:18)1 = 0 _ (cid:18)2 = 0\n\n(14)\n\nwhere KR and KT are Gaussian kernels with covariances estimated using methods from\n[5].\n\n5 Implementation\n\nThe input to our algorithm is a set of SPM poses (Section 3) fPt\nsj1 (cid:20) s (cid:20) S; 1 (cid:20) t (cid:20) T g,\nwhere S is the number of tracked rigid segments and F is the number of frames. In order\nto compute the mutual information between the motion of segments s1 and s2, we \ufb01rst\ncompute motions of segment s1 in frames 1 < t (cid:20) F relative to its position in frame\nt1 = 1,\n\nMt1t\n\ns1 = Pt\n\ns1 (Pt1\n\ns1 )(cid:0)1;\n\n(15)\nand the transformation of s2 relative to s1 (with the relative pose Ps2js1 = (Ps1 )(cid:0)1Ps2),\n(16)\n\ns1 )(cid:0)1Pt\nThe parameter vectors (cid:28) t1t\nare then extracted from the transformation matrices\nMs2 and Ms2js1 (cf. Section 3), and the mutual information is estimated as described in\nSection 4.2.\n\ns2 and (cid:28) t1t\ns2js1\n\ns1 )(cid:0)1:Pt1\n\ns2 )(cid:0)1\n\nMt1t\n\ns2js1\n\n= ((Pt\n\ns2 )((Pt1\n\n\f6 Results\n\nWe have tested our algorithm both on synthetic and motion capture data. Two synthetic se-\nquences were generated with the following steps. First, the rigid segments were positioned\nby randomly perturbing parameters of the corresponding kinematic tree structure. A set of\nfeature points was then selected for each segment. At each time step point positions were\ncomputed based on the corresponding segment pose, and perturbed with Gaussian noise\nwith zero mean and standard deviation of 1 pixel. The inputs to the algorithm were the seg-\nment poses re-estimated from the feature point coordinates. In the motion capture-based\nexperiment, the segment poses were estimated from the marker positions.\n\nThe results of the experiments are shown in the Figures 6.1, 6.2 and 6.3. The \ufb01rst ex-\nperiment involved a simple kinematic chain with 3 segments in order to demonstrate the\noperation of the algorithm. The system has a rotational joint between S1 and S2 and pris-\nmatic joint between S2 and S3.\nThe sample con\ufb01gurations of the articulated body are shown in the \ufb01rst row of the Figures\n6.1. The graph computed using method from Section 4.2 and the corresponding maximum\nspanning tree are in Figures 6.1(d, e).\n\nThe second experiment involved a humanoid torso-like synthetic model containing 5 rigid\nsegments. It was processed in a way similar to the \ufb01rst experiment. The results are shown\nin Figure 6.2.\n\nFor the human motion experiment, we have used motion capture data of a dance sequence\n(Figure 6.3(a-c)). The rigid segment motion was extracted from the positions of the markers\ntracked across 220 frames (the marker correspondence to the body locations was known).\nThe algorithm was able to correctly recover the articulated body topology (Compare Fig-\nures 6.3(e) and 6.3(a)), when provided only with the extracted segment poses. The dance is\na highly structured activity, so not all degrees of freedom were explored in this sequence,\nand mutual information between some unconnected segments (e.g. thighs S3 and S7) was\ndetermined to be relatively large, although this did not impact the \ufb01nal result.\n\n7 Conclusions\n\nWe have presented a novel general technique for recovering the underlying articulated\nstructure from information about rigid segment motion. Our method relies on only a very\nweak assumption, that this structure may be represented by a tree with unknown topology.\nWhile the results presented in this paper were obtained using the Scaled Prismatic Model\nand non-parametric density estimator, our methodology does not rely on either modeling\nassumption.\n\nReferences\n\n[1] C. K. Chow and C. N. Liu. Approximating discrete probability distributions with dependence\n\ntrees. IEEE Transactions on Information Theory, IT-14(3):462\u2013467, May 1968.\n\n[2] Thomas H. Cormen, Charles E. Leiserson, and Ronald L. Rivern. Introduction to Algorithms.\n\nMIT Press, Cambridge, MA, 1990.\n\n[3] Joao Paolo Costeira and Takeo Kanade. A multibody factorization method for independently\n\nmoving objects. International Journal of Computer Vision, 29(3):159\u2013179, 1998.\n\n[4] T. M. Cover and J. A. Thomas. Elements of Information Theory. John Wiley & Sons, Inc., New\n\nYork, 1991.\n\n[5] Luc Devroye. A Course in Density Estimation, volume 14 of Progress in Probability and Statis-\n\ntics. Birkhauser, Boston, 1987.\n\n\fS3 \n\nS2\n\nS1 \n\n(a)\n\n(b)\n\n(c)\n\nS1\n\nS2\n\nS3\n\nS\n\n1\n\n(e)\n\nS1\n\nS2\n\nS3\n\n(d)\n\nS\n\n2\n\n3S\n\nFigure 6.1: Simple kinematic chain topology recovery. The \ufb01rst row shows 3 sample frames\nfrom a 100 frame synthetic sequence. The adjacency matrix of the mutual information\ngraph is shown in (d), with intensities corresponding to edge weights. The vertices in\nthe graph correspond to the rigid segments labeled in (a). (e) is the recovered articulated\ntopology.\n\nS\n \n4\n\nS\n \n2\n\nS\n \n3\n\nS\n \n5\n\nS\n \n1\n\n(a)\n\n(b)\n\n(c)\n\nS1\n\nS2\n\nS3\n\nS4\n\nS5\n\nS1\n\nS2\n\nS3\n\nS4\n\nS5\n\n(d)\n\n(e)\n\nFigure 6.2: Humanoid torso synthetic test. The sample frames from a randomly generated\n150 frame sequence are shown in (a), (b), and (c). The adjacency matrix of the mutual\ninformation graph is shown in (d), with intensities corresponding to edge weights. The\nvertices in the graph correspond to the rigid segments labeled in (a). (e) is the recovered\narticulated topology.\n\n\fS\n \n1\n\nS\n \n9\n\nS\n \n8\n\nS\n \n7\n\nS\n6\n\nS\n \n5\n\nS\n \n4\n\nS\n \n3\n\nS\n \n2\n\n(a)\n\n(b)\n\n(c)\n\nS1\n\nS2\n\nS3\n\nS4\n\nS5\n\nS6\n\nS7\n\nS8\n\nS9\n\nS1\n\nS2\n\nS3\n\nS4\n\nS5\n\nS6\n\nS7\n\nS8\n\nS9\n\n(d)\n\n(e)\n\nFigure 6.3: Motion Capture based test. (a), (b), and (c) are the sample frames from a 220 frame\nsequence. The adjacency matrix of the mutual information graph is shown in (d), with intensities\ncorresponding to edge weights. The vertices in the graph correspond to the rigid segments labeled in\n(a). (e) is the recovered articulated topology.\n\n[6] David C. Hogg. Model-based vision: A program to see a walking person. Image and Vision\n\nComputing, 1(1):5\u201320, 1983.\n\n[7] Yi-Ping Hung, Cheng-Yuan Tang, Sheng-Wen Shin, Zen Chen, and Wei-Song Lin. A 3d feature-\nbased tracker for tracking multiple moving objects with a controlled binocular head. Technical\nreport, Academia Sinica Institute of Information Science, 1995.\n\n[8] Finn Jensen. An Introduction to Bayesian Networks. Springer, 1996.\n[9] N. Jojic and B.J. Frey. Learning \ufb02exible sprites in video layers. In Computer Vision and Pattern\n\nRecognition, pages I:199\u2013206, 2001.\n\n[10] Ioannis A. Kakadiaris and Dimirti Metaxas. 3d human body acquisition from multiple views.\n\nIn Proc. Fifth International Conference on Computer Vision, pages 618\u2013623, 1995.\n\n[11] Marina Meila. Learning Mixtures of Trees. PhD thesis, MIT, 1998.\n[12] Ivana Mikic, Mohan Triverdi, Edward Hunter, and Pamela Cosman. Articulated body posture\nestimation from multi-camera voxel data. In Computer Vision and Pattern Recognition, 2001.\n[13] Richard M. Murray, Zexiang Li, and S. Shankar Sastry. A Mathematical Introduction to Robotic\n\nManipulation. CRC Press, 1994.\n\n[14] J. O\u2019Brien, R. E. Bodenheimer, G. Brostow, and J. K. Hodgins. Automatic joint parameter\nestimation from magnetic motion capture data. In Graphics Interface\u20192000, pages 53\u201360, 2000.\n[15] James M. Regh and Daniel D. Morris. Singularities in articulated object tracking with 2-d and\n\n3-d models. Technical report, Digital Equipment Corporation, 1997.\n\n[16] Hedvig Sidenbladh, Michael J. Black, and David J. Fleet. Stochastic tracking of 3d human\n\n\ufb01gures using 2d image motion. In Proc. European Conference on Computer Vision, 2000.\n\n[17] Yang Song, Luis Goncalves, Enrico Di Bernardo, and Pietro Perona. Monocular perception\nof biological motion - detection and labeling. In Proc. International Conference on Computer\nVision, pages 805\u2013812, 1999.\n\n[18] Ying Wu, Zhengyou Zhang, Thomas S. Huang, and John Y. Lin. Multibody grouping via\nIn Proc. IEEE Conf. on Computer Vision and Pattern\n\northogonal subspace decomposition.\nRecognition, 2001.\n\n\f", "award": [], "sourceid": 2182, "authors": [{"given_name": "Leonid", "family_name": "Taycher", "institution": null}, {"given_name": "John", "family_name": "Iii", "institution": null}, {"given_name": "Trevor", "family_name": "Darrell", "institution": null}]}