{"title": "Learning Non-Rigid 3D Shape from 2D Motion", "book": "Advances in Neural Information Processing Systems", "page_first": 1555, "page_last": 1562, "abstract": "", "full_text": "Learning Non-Rigid 3D Shape from 2D Motion\n\nLorenzo Torresani\nStanford University\n\nAaron Hertzmann\nUniversity of Toronto\n\nltorresa@cs.stanford.edu\n\nhertzman@dgp.toronto.edu\n\nChristoph Bregler\nNew York University\n\nchris.bregler@nyu.edu\n\nAbstract\n\nThis paper presents an algorithm for learning the time-varying shape of a\nnon-rigid 3D object from uncalibrated 2D tracking data. We model shape\nmotion as a rigid component (rotation and translation) combined with a\nnon-rigid deformation. Reconstruction is ill-posed if arbitrary deforma-\ntions are allowed. We constrain the problem by assuming that the object\nshape at each time instant is drawn from a Gaussian distribution. Based\non this assumption, the algorithm simultaneously estimates 3D shape and\nmotion for each time frame, learns the parameters of the Gaussian, and\nrobustly \ufb01lls-in missing data points. We then extend the algorithm to\nmodel temporal smoothness in object shape, thus allowing it to handle\nsevere cases of missing data.\n\n1\n\nIntroduction\n\nWe can generally think of a non-rigid object\u2019s motion as consisting of a rigid component\nplus a non-rigid deformation. For example, a person\u2019s head can move rigidly (e.g. turning\nleft or right) while deforming (due to changing facial expressions). If we view this non-rigid\nmotion from a single camera view, the shape and motion are ambiguous: for any hypotheti-\ncal rigid motion, a corresponding 3D shape can be devised that \ufb01ts the image observations.\nEven if camera calibration and rigid motion are known, a depth ambiguity remains. Despite\nthis apparent ambiguity, humans interpret the shape and motion of non-rigid objects with\nrelative ease; clearly, more assumptions about the nature of the deformations are used by\nhumans.\n\nThis paper addresses the question: how can we resolve the ambiguity, with as weak as-\nsumptions as possible? We argue that, by assuming that the 3D shape is drawn from some\nnon-uniform PDF, we can reconstruct 3D non-rigid shape from 2D motion unambiguously.\nMoreover, we show that this can be done without assuming that the parameters of the PDF\nare known in advance. The use of a proper PDF makes the technique robust to noise and\nover\ufb01tting. We demonstrate this approach by modeling the PDF as a Gaussian distribution\n(more speci\ufb01cally, as a factor analyzer), and describe a novel EM algorithm for simulta-\nneously learning the 3D shapes, the rigid motion, and the parameters of the Gaussian. We\nalso generalize this approach by modeling the shape as a Linear Dynamical System (LDS).\n\n\fOur algorithm can be thought of as a structure-from-motion (SFM) algorithm with a learn-\ning component: we assume that a set of labeled point tracks have been extracted from a raw\nvideo sequence, and the goal is to estimate 3D shape, camera motion, and a deformation\nPDF. Our algorithm is well-suited to reconstruction in the case of missing data, such as\ndue to occlusions and other tracking outliers. However, we show signi\ufb01cant improvements\nover previous algorithms even when all tracks are visible.\n\nOur work may also be seen as unifying Active Shape Models [1, 2, 5] with SFM, where both\nare estimated jointly from an image sequence. Our methods are closely related to factor\nanalysis, probabilistic PCA, and linear dynamical systems. Our missing-data technique can\nbe viewed as generalizing previous algorithms for SFM with missing data (e.g. [8, 9]) to the\nnonrigid case. In work concurrent to our own, Gruber and Weiss [7] also apply EM to SFM;\ntheir work focuses on the rigid case with known noise, and applies temporal smoothing to\nrigid motion parameters rather than shape.\n\n2 Deformation, Shape, and Ambiguities\n\nWe now formalize the problem of interpreting non-rigid shape and motion. We assume\nthat a scene consists of J scene points sj;t, where j is an index over scene points, and t\nis an index over image frames. The 2D projections pj;t of these points are imaged under\northographic projection:\n\npj;t = Rt(sj;t + dt) + n\n\n(1)\nwhere pj;t is the 2D projection of scene point j at time t, dt is a 2 (cid:2) 1 translation vector,\nRt is a 2 (cid:2) 3 matrix that combines rotation with orthographic projection [12], and n is\nzero-mean Gaussian noise with variance (cid:27)2. Collecting the projected points into a 2 (cid:2) J\nmatrix Pt = [p1;t; :::; pJ;t] and the 3D shape into a 3 (cid:2) J matrix St = [s1;t; :::sJ;t] gives\nthe equivalent form\n\nPt = Rt(St + Dt) + N\n\n(2)\n\nwhere Dt = dt1T contains J copies of the translation matrix dt. Note that rigid motion of\nthe object and rigid motion of the camera are interchangeable. Our goal is to estimate the\ntime-varying shape St and motion (Rt; Dt) from the observed projections Pt. Without any\nconstraints on the 3D shape sj;t, this problem is extremely ambiguous [11]. For example,\ngiven a shape St and motion (Rt; Dt) and an arbitrary orthonormal matrix At, we can\nproduce a new shape AtSt and motion (RtA(cid:0)1\n; AtDt) that together give identical 2D\nprojections as the original model, even if a different matrix At is applied in every frame.\nA common way to model non-rigid deformations is to assume that the shape is produced\nby adding deformations to a shape average (cid:22)S:\n\nt\n\nK\n\nSt = (cid:22)S +\n\nVkzk;t\n\n(3)\n\nXk=1\n\nwhere zk;t are scalar per-frame weights that indicate the contributions of the deformations\nto each shape; these weights are combined in a vector zt = [z1;t; :::; zK;t]T . Together, (cid:22)S\nand Vk are referred to as the shape basis. Equivalently, the space of possible shapes may be\ndescribed by linear combinations of basis shapes, by selecting K + 1 linearly independent\npoints in the space. This model was \ufb01rst applied to non-rigid SFM by Bregler et al. [4].\nHowever, this model contains ambiguities, since, for some 3D shape and motion, there\nwill still be ways to combine different weights and a different rigid motion to produce the\nsame 3D shape. Since we are performing a 2D projection, an additional depth ambiguity\n\noccurs. For example, whenever there exist weights wk such that RtP Vkwk = 0 and\nP Vkwk 6= 0, these weights de\ufb01ne a linear space of distinct 3D shapes (with weights\n\n\fzt;k + (cid:11)wk) that give identical 2D projections. (When the number of basis shapes is small,\nthese ambiguities are rarer and may not make a dramatic impact.) Furthermore, a least-\nsquares \ufb01t may over\ufb01t noise, especially with many basis shapes. As the number of basis\nshapes grows, the problem is more likely to become unconstrained, eventually approaching\nthe totally unconstrained case described above.\n\nThe ambiguity and over\ufb01tting may be resolved by introducing regularization terms that pe-\nnalize large deformations, and then solving for 3D shape in a least-squares sense. Soatto\n\nand Yezzi [11] use a regularization term equivalent toPt jjSt (cid:0) (cid:22)Sjj2. However, this regu-\n\nlarization may be too restrictive in many cases and too loose in others. For example, when\ntracking a face, deformations of the jaw are much more likely than deformations of the\nnose. Moreover, the weight for this regularization term must be speci\ufb01ed by hand1. Al-\nternatively, Brand [3] proposes placing a user-speci\ufb01ed Gaussian prior on the deformation\nbasis and a prior on the deformations based on an initial estimate.\n\nIn order to motivate our approach, we can restate the above techniques as follows. Suppose\nwe assume that shapes St are drawn from a probabilitity distribution p(Stj(cid:18)) with known\nparameters (cid:18). The non-rigid shape and motion are estimated by maximizing\np(S; R; DjP; (cid:18); (cid:27)2) / p(PjS; R; D; (cid:18); (cid:27)2)p(S; R; Dj(cid:18); (cid:27)2)\n\n(4)\n\n/ Yt\n\np(PtjSt; Rt; Dt; (cid:27)2)p(Stj(cid:18))\n\n(5)\n\nassuming uniform priors on Rt, and Dt. The projection likelihood p(PtjSt; Rt; Dt; (cid:27)2)\nis a spherical Gaussian (Equation 2). The negative log-posterior (cid:0) ln p(S; R; D; (cid:18)jP)\ncorresponds to a standard least-squares formulation for SFM, plus a regularization term\n(cid:0) ln p(Stj(cid:18)). If we set p(Stj(cid:18)) to be a uniform distribution, then we get the highly un-\nderconstrained case described above. If we set p(Stj(cid:18)) to be a spherical Gaussian with\na speci\ufb01ed variance (e.g. p(Stj(cid:18)) = N ((cid:22)S; (cid:27)2I)) then we obtain the simple regularization\nused previously \u2014 the problem is constrained, but by a weak regularization term with a\nuser-speci\ufb01ed weight (variance).\n\nOur approach. Our approach is to simultaneously estimate the rigid motion and learn\nthe shape PDF. In other words, we estimate R; D; (cid:18); and (cid:27)2 to maximize\n\np(R; D; (cid:18); (cid:27)2jP) = Z p(R; D; (cid:18); S; (cid:27)2jP)dS\n\n/ Z p(PjR; D; S; (cid:27)2)p(Sj(cid:18))dS\n\n(6)\n\n(7)\n\nThe key idea is that we can estimate shape and motion while learning the parameters of the\nPDF p(Sj(cid:18)) over shapes. (Our method marginalizes over the unknown shapes St, rather\nthan solving for estimates of shape.) In effect, the regularization terms (i.e. the PDF) are\nlearned simultaneously with the rest of SFM. This means that the regularization terms need\nnot be set manually, and can thus be much more sophisticated and have many more param-\neters than previous methods. In practice, we \ufb01nd that this leads to signi\ufb01cantly improved\nreconstructions over user-speci\ufb01ed shape PDFs. We demonstrate the approach by model-\ning the shape PDF as a general Gaussian. We reduce the dimensionality of the Gaussian\nby representing it as a factor analyzer. In this case, the factors Vk may be interpreted as\nbasis deformations. We later generalize this approach to model shape as an LDS, leading\nto temporal correlations in the shape PDF.\n\nIt might seem that, since the parameters of the PDF are not known a priori, the algorithm\ncould estimate wildly varying shapes, and then learn a correspondingly spread-out PDF.\n1In their work, Soatto and Yezzi address a slightly simpler problem where the 3D data is observed\n\nwithout noise or projection, and thus there are no weights to specify in this case\n\n\fHowever, such a spread-out PDF would assign very low likelihood to the solution and thus\nbe suboptimal; this is a typical case of Bayesian learning naturally balancing the desire to\n\ufb01t the data with the desire for a \u201csimple\u201d model. One way to see this is to consider the\nterms of (cid:0) ln p(R; D; (cid:18)jP) in the case of the Gaussian prior PDF: in addition to the data-\n\ufb01tting term and the regularization term, there is a \u201cnormalization constant\u201d term of T lnj(cid:30)j,\nwhere T is the number of frames and (cid:30) is the covariance of the shape PDF. This term\ndirectly penalizes spread-out Gaussians. Hence, the optimal solution trades-off between\n(a) \ufb01tting the projection data, (b) \ufb01tting the shapes St to the shape PDF (regularizing),\nand (c) minimizing the variance of the shape PDF as much as possible. The algorithm\nsimultaneously regularizes and learns the regularization.\n\n3 Learning a Gaussian shape distribution\n\nWe now describe our algorithm in detail. We model p(Stj(cid:18)) as a factor analyzer [6]. In\nthis setting, the factors of the Gaussian can be interpreted as basis deformations \u2014 shape\nis modeled by Equation 3 \u2014 but the weights zt are now hidden variables, with zero-mean\nGaussian priors with unit variance for each:\n\nThe shape and projection model\nand 8.\nlihood estimates of Rt; Dt; (cid:22)S; V; and (cid:27)2,\n\n(8)\nzt (cid:24) N (0; I)\nis then completely speci\ufb01ed by Equations 2, 3,\nThe problem of non-rigid SFM is now to solve for the maximum like-\ni.e. maximize p(Rt; Dt; (cid:22)S; V; (cid:27)2jPt) /\n\nQt p(PtjRt; Dt; (cid:22)S; V; (cid:27)2) =QtR p(Pt; ztjRt; Dt; (cid:22)S; V; (cid:27)2)p(zt)dzt\n\n3.1 Vectorized form.\n\nFor later computations, it is useful to rewrite the model in a vectorized form. First, de\ufb01ne\nft to be the vector of point tracks ft = vec(Pt) = [x1;t; y1;t; :::; xJ;t; yJ;t]T . Note that ft\nis the same variable as Pt, but written as a vector rather than a matrix2. Expanding ft we\nhave\n\nft = vec(Pt) = vec(RtSt + RtDt + Nt)\n\nK\n\n=\n\nvec(RtVk)zk;t + vec(Rt\n\n(cid:22)S) + vec(RtDt) + vec(Nt)\n\nXk=1\n\n(9)\n\n(10)\n\n(12)\n\n= Mtzt + (cid:22)ft + Tt + vec(Nt)\n\nwhere Mt = [vec(RtV1); :::; vec(RtVK)], zt = [z1;t; :::; zK;t]T , (cid:22)ft = vec(Rt\nTt = vec(RtDt) = [(Rtdt)T ; :::; (Rtdt)T ]T = [tT\ndistribution over shape \u2014 as well as its projection \u2014 is Gaussian:\n\nt ; :::; tT\n\n(11)\n(cid:22)S) and\nt ]T . Note that the marginal\n\np(ftj ) = Z p(ftjzt; )p(ztj )dzt\n= N (ftjTt + (cid:22)ft; MtMT\n\nt + (cid:27)2I)\nwhere encapsulates the model parameters (cid:22)S; Vk; Rt; Dt and (cid:27)2.\nLet ~H = [vec((cid:22)S); vec(V1); :::; vec(VK)] and ~zt = [1; zT\nt ]T . We can also rewrite the\nshape equation as vec(RtSt) = (I (cid:10) Rt)vec(St) = (I (cid:10) Rt) ~H~zt, by using the identity\nvec(ABC) = (CT (cid:10) A)vec(B). The symbol (cid:10) denotes Kronecker product.\na3 (cid:21)(cid:19) =\n2The vec operator stacks the columns of a matrix into a vector, e.g. vec(cid:18)(cid:20) a0\n\n[a0; a1; a2; a3]T . The operator is linear: vec(A + B) = vec(A) + vec(B); vec((cid:11)A) = (cid:11)vec(A)\nfor any matrices A and B and scalar (cid:11).\n\n(13)\n\na2\n\na1\n\n\f3.2 Generalized EM algorithm.\n\nGiven a set of point tracks P (equivalently, f), we can estimate the motion and deformation\nmodel using EM; the algorithm is similar to EM for factor analysis [6].\n\nThe E-step. We estimate the distribution over zt given the current motion and shape\nestimates, for each frame t. De\ufb01ning q(zt) to be the distribution to be estimated in frame\nt, it can be computed as\n\nq(zt) = p(ztjft; )\n\n= N (ztj(cid:12)(ft (cid:0) (cid:22)ft (cid:0) Tt); I (cid:0) (cid:12)Mt)\n\n(cid:12) = MT\n\nt (MtMT\n\nt + (cid:27)2I)(cid:0)1\n\n(14)\n(15)\n(16)\n\nThe matrix inversion lemma may be used to accelerate the computation of (cid:12). We de\ufb01ne\nthe expectations (cid:22)t (cid:17) Eq[zt] and (cid:30)t (cid:17) Eq[ztzT\n\nt ] and compute them as:\n\n(cid:22)t = (cid:12)(ft (cid:0) (cid:22)ft (cid:0) Tt)\n(cid:30)t = I (cid:0) (cid:12)Mt + (cid:22)t(cid:22)T\n\nt\n\n(17)\n(18)\n\nWe also de\ufb01ne ~(cid:22)t = E[~zt] = [1; (cid:22)T\n\nt ]T and ~(cid:30) = E[~zt~zT\n\nt ] =(cid:20) 1\n\n(cid:22)t\n\n(cid:22)T\nt\n\n(cid:30)t (cid:21).\n\nThe M-step. We estimate the motion parameters by minimizing\n\n(19)\n\nQ(P; ) = Eq(z1);:::;q(zT )[(cid:0) log p(Pj )]\n\nEq(zt)[jjft (cid:0) vec(RtSt) (cid:0) Tt)jj2=(2(cid:27)2)] + 2JT log p2(cid:25)(cid:27)2 (20)\n\n= Xt\n\nThis function is quadratic in the shape parameters ((cid:22)S; Vk), in the rigid motion parameters\n(Rt; Tt) and in the gaussian noise variance parameter (cid:27)2. To update each of these param-\neters we compute the corresponding partial derivative of the expected log likelihood, set it\nto zero and solve it. The parameter update rules are:\n\n(cid:15) Shape basis:\n\nvec( ~H) Xt\n\n( ~(cid:30)t (cid:10) (I (cid:10) RT\n\nt Rt))!(cid:0)1\n\nvec Xt\n\nt !\n(I (cid:10) Rt)T (ft (cid:0) Tt)~(cid:22)T\n\n(21)\n\n(cid:15) Noise variance:\n2JT Xt\n\n(cid:27)2 \n\n1\n\n(jjft (cid:0)(cid:22)ft (cid:0) Ttjj2 (cid:0) 2(ft (cid:0)(cid:22)ft (cid:0) Tt)T Mt(cid:22)t + tr(MT\n\nt Mt(cid:30)t)) (22)\n\n(cid:15) Translation:\n\nTt (1 (cid:10) I)\n\n1\n\nJ Xj\n\n(ftj (cid:0) Rt((cid:22)Sj +Xk\n\nVkj(cid:22)tk))\n\n(23)\n\n(cid:15) Rotation:\n\nRt arg min\n\nRt\n\n( ~Hj\n\n~(cid:30)t\n\njjRtXj\n\n~HT\n\nj ) (cid:0)Xj\n\n((ftj (cid:0) tt)~(cid:22)T\n\nt\n\n~HT\n\nj )jj\n\n(24)\n\nwhere ~H = [ ~HT\n\n1 ; :::; ~HT\n\nJ ]T and ft = [ft1; :::; ftJ ].\n\n\fSince the system of equations in Equation 21 is large and sparse, we solve it using conjugate\ngradient. In Equation 24, we enforce orthonormality of rotations by parameterizing Rt\nwith exponential coordinates. We linearize the equation with respect to the exponential\ncoordinates, and solve the resulting quadratic.\n\nIf any of the point tracks are missing, they are also \ufb01lled in during the M-step. Let f (cid:3)\nt\ndenote the elements of a frame of tracking data that are not observed; they are estimated as\n\nwhere ((cid:3)) indicates rows that correspond to the missing data.\nIn our M-step, we apply each of these updates once, although they could also be alternated.\nOnce EM has converged, the maximum likelihood shapes may be computed as St = (cid:22)S +\n\nf (cid:3)\n\nt (cid:22)f (cid:3)\n\nt + M(cid:3)\n\nt (cid:22)t + T(cid:3)\nt\n\n(25)\n\nPk Vk(cid:22)t;k.\n\n4 Learning dynamics\n\nMany real deformations contain some temporal smoothness. We model temporal behavior\nof deformations using a Linear Dynamical System (LDS). In this model, Equation 8 is\nreplaced with\n\nz0 (cid:24) N (0; I)\nzt = (cid:8)zt(cid:0)1 + n; n (cid:24) N (0; Q)\n\n(26)\n(27)\nwhere (cid:8) is an arbitrary unknown K (cid:2) K matrix, and Q is a K (cid:2) K covariance matrix. For\ncertain estimates of (cid:8), this model corresponds to an assumption of continuously or slowly\nchanging shape. Since our model is a special form of Shumway and Stoffer\u2019s algorithm\nfor LDS learning with EM [10], it is straightforward to adapt it to our needs. In the E-\nstep, we apply Shumway and Stoffer\u2019s E-step to estimate (cid:22)t; (cid:30)t, and E[ztzT\nt(cid:0)1], based on\nPt; (cid:22)S; Mt; (cid:8); Q, and (cid:27)2. In the M-step, we apply the same shape and motion updates as\nin the previous section; additionally, we update (cid:8) and Q in the same way as in Shumway\nand Stoffer\u2019s algorithm. In other words, this reconstruction algorithm learns 3D shape with\ntemporal smoothing, while learning the temporal smoothness term.\n\n5 Experiments\n\nWe compared our algorithm with the iterative SFM algorithm presented by Torresani et\nal. [13], which we will refer to as ILSQ (iterative least-squares) in the following discus-\nsion3. ILSQ optimizes Equations 2 and 3 by alternating optimization of each of the un-\nknowns (rotation, basis shapes, and coef\ufb01cients). We also improved the algorithm by up-\ndating the translations as well. When some data is missing, ILSQ optimizes with respect\nto the available data. For both algorithms, the rigid motion is initialized by Tomasi-Kanade\n[12], and random initialization of the shape basis and coef\ufb01cients. For the algorithm pre-\nsented in section 3, we adopted an annealing scheme that forces (cid:27)2 to remain large in\nthe initial steps of the optimization. We refer to our new algorithms as EM-Gaussian and\nEM-LDS.\n\nWe tested the algorithms on a synthetic animation of a deforming shark in Figure 1. The\nmotion consists of rigid rotation plus deformations generated by K = 2 basis shapes. The\naverage reconstruction errors in Z for ILSQ and EM-Gaussian are respectively 7.10% and\n2.50% on this sequence after 100 parameter updates.4 By enforcing temporal smoothness\n\n3In our experience, ILSQ always performs better than the algorithm of Bregler et al. [4].\n4All errors are computed in percentage points: the average distance of the reconstructed point to\n\nthe correct point divided by the size of the shape.\n\n\fs\nk\nc\na\nr\nT\n\ny\n\n60\n\n40\n\n20\n\n0\n\nD\n2\n\n60\n\n40\n\n20\n\n0\n\n\u221220\n\n\u221240\n\ny\n\n60\n\n40\n\n20\n\n0\n\n\u221220\n\n\u221240\n\ny\n\n60\n\n40\n\n20\n\n0\n\n\u221220\n\n\u221240\n\ny\n\n60\n\n40\n\n20\n\n0\n\n\u221220\n\n\u221240\n\ny\n\n60\n\n40\n\n20\n\n0\n\n\u221220\n\n\u221240\n\ny\n\n60\n\n40\n\n20\n\n0\n\n\u221220\n\n\u221240\n\ny\n\n\u2212100\n\n\u221250\n\n0\nx\n\n50\n\n100\n\n\u2212100\n\n\u221250\n\n0\nx\n\n50\n\n100\n\n\u221280\n\n\u221260\n\n\u221240\n\n\u221220\n\n0\n\n20\n\n40\n\n60\n\n80\n\n100\n\nx\n\n\u221230\n\n\u221220\n\n\u221210\n\n0\n\n20\n\n30\n\n40\n\n50\n\n10\n\nx\n\n\u2212100\n\n\u221250\n\n0\nx\n\n50\n\n100\n\n\u221260\n\n\u221240\n\n\u221220\n\n0\n\nx\n\n20\n\n40\n\n60\n\n\u221240\n\n\u221230\n\n\u221210\n\n0\n\n\u221220\nx\n\n\u221220\n\n\u221240\n\n100\n\n50\n\n100\n\n50\n\n100\n\n50\n\n100\n\n50\n\nz\n\n0\n\nz\n\n0\n\nz\n\n0\n\nz\n\n0\n\n50\n\n100\n\n\u221280\n\n\u221260\n\n\u221240\n\n\u221220\n\n0\n\n20\n\n40\n\n60\n\n80\n\n100\n\nx\n\n100\n\n50\n\nz\n\n0\n\n\u221250\n\n\u2212100\n\n\u221250\n\n\u2212100\n\n100\n\n50\n\n\u221250\n\n\u2212100\n\n\u221220\n\n0\n\n20\n\n40\n\nx\n\n\u2212100\n\n\u221250\n\n100\n\n50\n\nz\n\n0\n\nz\n\n0\n\n\u221250\n\n\u2212100\n\n\u221250\n\n\u2212100\n\n50\n\n100\n\n\u221280\n\n\u221260\n\n\u221240\n\n\u221220\n\n0\n\n20\n\n40\n\n60\n\n80\n\n100\n\n\u221220\n\n0\n\n20\n\n40\n\n\u2212100\n\n\u221250\n\nx\n\nx\n\n100\n\n50\n\nz\n\n0\n\n\u221250\n\n\u2212100\n\n100\n\n50\n\n100\n\n50\n\nz\n\n0\n\nz\n\n0\n\n\u221250\n\n\u2212100\n\n\u221250\n\n\u2212100\n\n50\n\n100\n\n\u221280\n\n\u221260\n\n\u221240\n\n\u221220\n\n0\n\n20\n\n40\n\n60\n\n80\n\n100\n\n\u221220\n\n0\n\n20\n\n40\n\n\u2212100\n\n\u221250\n\n\u221250\n\n\u2212100\n\n\u221250\n\n\u2212100\n\n50\n\n100\n\n\u221260\n\n\u221240\n\n\u221220\n\n20\n\n40\n\n60\n\n0\n\nx\n\n\u221240\n\n\u221220\nx\n\n0\n\n100\n\n50\n\n100\n\n50\n\nz\n\n0\n\nz\n\n0\n\n\u221250\n\n\u2212100\n\n\u221250\n\n\u2212100\n\n50\n\n100\n\n\u221260\n\n\u221240\n\n\u221220\n\n20\n\n40\n\n60\n\n0\n\nx\n\n\u221240\n\n\u221220\nx\n\n0\n\n100\n\n50\n\n100\n\n50\n\nz\n\n0\n\nz\n\n0\n\n\u221250\n\n\u2212100\n\n\u221250\n\n\u2212100\n\n50\n\n100\n\n\u221260\n\n\u221240\n\n\u221220\n\n20\n\n40\n\n60\n\n0\n\nx\n\n\u221240\n\n\u221220\nx\n\n0\n\n0\nx\n\n0\nx\n\n0\nx\n\n100\n\n50\n\nz\n\n0\n\n\u221250\n\n\u2212100\n\n100\n\n50\n\nz\n\n0\n\n\u221250\n\n\u2212100\n\n100\n\n50\n\nz\n\n0\n\n\u221250\n\n\u2212100\n\nQ\nS\nL\nI\n\n100\n\n50\n\n0\n\n\u221250\n\n\u2212100\n\n100\n\n50\n\n0\n\n\u221250\n\n\u2212100\n\nz\n\nn\na\ni\ns\ns\nu\na\nG\nM\nE\n\n-\n\nz\n\nS\nD\nL\n-\nM\nE\n\n\u2212100\n\n\u221250\n\n0\nx\n\n50\n\n100\n\n\u2212100\n\n\u221250\n\n100\n\n50\n\nz\n\n0\n\n\u221250\n\n\u2212100\n\n\u2212100\n\n\u221250\n\n0\nx\n\n50\n\n100\n\n\u2212100\n\n\u221250\n\n100\n\n50\n\nz\n\n0\n\n\u221250\n\n\u2212100\n\n\u2212100\n\n\u221250\n\n0\nx\n\n50\n\n100\n\n\u2212100\n\n\u221250\n\n0\nx\n\n0\nx\n\n0\nx\n\nt=20\n\nt=50\n\nx\n\nt=80\n\nx\n\nt=115\n\nt=148\n\nt=175\n\nt=200\n\nFigure 1: Reconstructions of the shark sequence using the three algorithms. Each algorithm was\ngiven 2D tracks as inputs; reconstructions are shown here from a different viewpoint than the inputs\nto the algorithm. Ground-truth features are shown as blue dots; reconstructions are red circles. Note\nthat, although ILSQ gets approximately the correct shape in most cases, it misses details, whereas\nEM gives very accurate results most of the time. Some of the deformation errors of EM-Gaussian\n(e.g. for t=148) are corrected by EM-LDS through temporal smoothing.\n\nEM-LDS was able to correct some of the deformation errors of EM-Gaussian. The average\nZ error for EM-LDS on the shark sequence after 100 EM iterations is 1.24%. Videos of\nthe shark reconstructions and the Matlab software used for these experiments are available\nfrom http://movement.stanford.edu/learning-nr-shape/ .\n\nIn highly-constrained cases \u2014 low-rank motion, no image noise, and no missing data \u2014\nILSQ achieved reasonably good results. However, EM-Gaussian gave better results in\nnearly every case, and dramatically better results in underconstrained cases. Figure 2(a)\nand (b) show experimental results on another set of arti\ufb01cial data consisting of random\nbasis shapes. Figure 2(a) shows the results of reconstruction with missing data; the ILSQ\nresults degrade much faster as the percentage of missing data increases. Figure 2(b) shows\nthe effect of changing the complexity of the model, while leaving the complexity of the\ndata \ufb01xed. ILSQ yields poor results when the model complexity does not closely match the\ndata complexity, but EM-Gaussian yields reasonable results regardless.\n\n6 Discussion and future work\n\nWe have described an approach to non-rigid structure-from-motion with a probabilistic\ndeformation model, and demonstrated its usefulness in the case of a Gaussian deformation\nmodel. We expect that more sophisticated distributions can be used to model more complex\nnon-rigid shapes in video. More general graphical models with other correlations (such\nas from audio data) could be built from this method. Our method is also applicable to\n\n\fEM\u2212Gaussian\nILSQ\n\n3\n\n2.5\n\n2\n\n1.5\n\n1\n\n0.5\n\nr\no\nr\nr\ne\n\n \nz\n \n%\n\n0\n0\n\n10\n\n20\n\n30\n\n% missing data\n\n(a)\n\nr\no\nr\nr\ne\n\n \nz\n \n%\n\n10\n\n9\n\n8\n\n7\n\n6\n\n5\n\n4\n\n3\n\n2\n\n1\n\n0\n1\n\n40\n\n50\n\nEM\u2212Gaussian\nILSQ\n\n2\n\n3\n\nK\n\n4\n\n5\n\n6\n\n(b)\n\nFigure 2: Error comparison between ILSQ and EM-Gaussian on random basis shapes. (a) Increasing\nmissing data. As the percentage of missing feature tracks per frame increases, ILSQ degenerates\nmuch more rapidly than EM-Gaussian. (b) ILSQ gives poor results when the model complexity does\nnot match the actual data complexity, whereas EM-Gaussian is relatively robust to this.\n\nseparating rigid from non-rigid motion in fully-observed data, as in Soatto and Yezzi\u2019s\nwork [11]. Our models could easily be generalized to perspective projection, although the\noptimization may be more dif\ufb01cult.\n\nAcknowledgements. Thanks to Hrishikesh Deshpande for assisting with an early version of this\nproject, and to Stefano Soatto for discussing deformation ambiguities. Portions of this work were\nperformed while LT was visiting New York University, AH was at University of Washington, and\nwhile CB was at Stanford University. LT and CB were supported by ONR grant N00014-01-1-0890\nunder the MURI program. AH was supported in part by UW Animation Research Labs, NSF grant\nIIS-0113007, the Connaught Fund, and an NSERC Discovery Grant.\n\nReferences\n[1] A. Blake and M. Isard. Active Contours. Springer-Verlag, 1998.\n[2] V. Blanz and T. Vetter. A Morphable Model for the Synthesis of 3D Faces. In Proceedings of\n\nSIGGRAPH 99, Computer Graphics Proceedings, pages 187\u2013194, Aug. 1999.\n\n[3] M. Brand. Morphable 3D models from video. In Proc. CVPR 2001, 2001.\n[4] C. Bregler, A. Hertzmann, and H. Biermann. Recovering Non-Rigid 3D Shape from Image\n\nStreams. In Proc. CVPR 2000, 2000.\n\n[5] T. F. Cootes and C. J. Taylor. Statistical models of appearance for medical image analysis and\n\ncomputer vision. In Proc. SPIE Medical Imaging, 2001.\n\n[6] Z. Ghahramani and G. E. Hinton. The EM Algorithm for Mixtures of Factor Analyzers. Tech-\n\nnical Report CRG-TR-96-1, University of Toronto, 1996.\n\n[7] A. Gruber and Y. Weiss. Factorization with Uncertainty and Missing Data: Exploiting Temporal\n\nCoherence. In Proc. NIPS 2003, 2003. In these proceedings.\n\n[8] D. W. Jacobs. Linear Fitting with Missing Data for Structure-From-Motion. Computer Vision\n\nand Image Understanding, 82:57\u201382, 2001.\n\n[9] H. Shum, K. Ikeuchi, and R. Reddy. Principal Component Analysis with Missing Data and Its\n\nApplications to Polyhedral Object Modeling. IEEE Trans. PAMI, 17(9):854\u2013867, 1995.\n\n[10] R. H. Shumway and D. S. Stoffer. An approach to time series smoothing and forecasting using\n\nthe em algorithm. J. Time Series Analysis, 3(4):253\u2013264, 1982.\n\n[11] S. Soatto and A. J. Yezzi. Deformotion: Deforming Motion, Shape Averages, and the Joint\n\nRegistration and Segmentation of Images. In Proc. ECCV 2002, May 2002.\n\n[12] C. Tomasi and T. Kanade. Shape and motion from image streams under orthography: A factor-\n\nization method. Int. J. of Computer Vision, 9(2):137\u2013154, 1992.\n\n[13] L. Torresani, D. Yang, G. Alexander, and C. Bregler. Tracking and Modeling Non-Rigid Objects\n\nwith Rank Constraints. In Proc. CVPR, 2001.\n\n\f", "award": [], "sourceid": 2509, "authors": [{"given_name": "Lorenzo", "family_name": "Torresani", "institution": null}, {"given_name": "Aaron", "family_name": "Hertzmann", "institution": null}, {"given_name": "Christoph", "family_name": "Bregler", "institution": null}]}