{"title": "From Deformations to Parts: Motion-based Segmentation of 3D Objects", "book": "Advances in Neural Information Processing Systems", "page_first": 1997, "page_last": 2005, "abstract": "We develop a method for discovering the parts of an articulated object from aligned meshes capturing various three-dimensional (3D) poses. We adapt the distance dependent Chinese restaurant process (ddCRP) to allow nonparametric discovery of a potentially unbounded number of parts, while simultaneously guaranteeing a spatially connected segmentation. To allow analysis of datasets in which object instances have varying shapes, we model part variability across poses via affine transformations. By placing a matrix normal-inverse-Wishart prior on these affine transformations, we develop a ddCRP Gibbs sampler which tractably marginalizes over transformation uncertainty. Analyzing a dataset of humans captured in dozens of poses, we infer parts which provide quantitatively better motion predictions than conventional clustering methods.", "full_text": "From Deformations to Parts:\n\nMotion-based Segmentation of 3D Objects\n\nSoumya Ghosh1, Erik B. Sudderth1, Matthew Loper2, and Michael J. Black2\n\n1Department of Computer Science, Brown University, {sghosh,sudderth}@cs.brown.edu\n\n2Perceiving Systems Department, Max Planck Institute for Intelligent Systems,\n\n{mloper,black}@tuebingen.mpg.de\n\nAbstract\n\nWe develop a method for discovering the parts of an articulated object from\naligned meshes of the object in various three-dimensional poses. We adapt the dis-\ntance dependent Chinese restaurant process (ddCRP) to allow nonparametric dis-\ncovery of a potentially unbounded number of parts, while simultaneously guaran-\nteeing a spatially connected segmentation. To allow analysis of datasets in which\nobject instances have varying 3D shapes, we model part variability across poses\nvia af\ufb01ne transformations. By placing a matrix normal-inverse-Wishart prior on\nthese af\ufb01ne transformations, we develop a ddCRP Gibbs sampler which tractably\nmarginalizes over transformation uncertainty. Analyzing a dataset of humans cap-\ntured in dozens of poses, we infer parts which provide quantitatively better defor-\nmation predictions than conventional clustering methods.\n\n1 Introduction\n\nMesh segmentation methods decompose a three-dimensional (3D) mesh, or a collection of aligned\nmeshes, into their constituent parts. This well-studied problem has numerous applications in com-\nputational graphics and vision, including texture mapping, skeleton extraction, morphing, and mesh\nregistration and simpli\ufb01cation. We focus in particular on the problem of segmenting an articulated\nobject, given aligned 3D meshes capturing various object poses. The meshes we consider are com-\nplete surfaces described by a set of triangular faces, and we seek a segmentation into spatially co-\nherent parts whose spatial transformations capture object articulations. Applied to various poses of\nhuman bodies as in Figure 1, our approach identi\ufb01es regions of the mesh that deform together, and\nthus provides information which could inform applications such as the design of protective clothing.\n\nMesh segmentation has been most widely studied as a static clustering problem, where a single\nmesh is segmented into \u201csemantic\u201d parts using low-level geometric cues such as distance and cur-\nvature [1, 2]. While supervised training data can sometimes lead to improved results [3], there are\nmany applications where such data is unavailable, and the proper way to partition a single mesh is\ninherently ambiguous. By searching for parts which deform consistently across many meshes, we\ncreate a better-posed problem whose solution is directly useful for modeling objects in motion.\n\nSeveral issues must be addressed to effectively segment collections of articulated meshes. First, the\nnumber of parts comprising an articulated object is unknown a priori, and must be inferred from\nthe observed deformations. Second, mesh faces exhibit strong spatial correlations, and the inferred\nparts must be contiguous. This spatial connectivity is needed to discover parts which correspond\nwith physical object structure, and required by target applications such as skeleton extraction. Fi-\nnally, our primary goal is to understand the structure of human bodies, and humans vary widely in\nsize and shape. People move and deform in different ways depending on age, \ufb01tness, body fat, etc.\nA segmentation of the human body should take into account this range of variability in the popula-\n\n1\n\n\fFigure 1: Human body segmentation. Left: Reference poses for two female bodies, and those bodies captured\nin \ufb01ve other poses. Right: A manual segmentation used to align these meshes [6], and the segmentation inferred\nby our ddCRP model from 56 poses. The ddCRP segmentation discovers parts whose motion is nearly rigid,\nand includes small parts such as elbows and knees absent from the manual segmentation.\n\ntion. To our knowledge, no previous methods for segmenting meshes combine information about\ndeformation from multiple bodies to address this corpus segmentation problem.\n\nIn this paper, we develop a statistical model which addresses all of these issues. We adapt the\ndistance dependent Chinese restaurant process (ddCRP) [4] to model spatial dependencies among\nmesh triangles, and enforce spatial contiguity of the inferred parts [5]. Unlike most previous mesh\nsegmentation methods, our Bayesian nonparametric approach allows data-driven inference of an ap-\npropriate number of parts, and uses a af\ufb01ne transformation-based likelihood to accommodate object\ninstances of varying shape. After developing our model in Section 2, Section 3 develops a Gibbs\nsampler which ef\ufb01ciently marginalizes the latent af\ufb01ne transformations de\ufb01ning part deformation.\nWe conclude in Section 4 with results examining meshes of humans and other articulated objects,\nwhere we introduce a metric for quantitative evaluation of deformation-based segmentations.\n\n2 A Part-Based Model for Mesh Deformation\n\nConsider a collection of J meshes, each with N triangles. For some input mesh j, we let yjn \u2208 R3\ndenote the 3D location of the center of triangular face n, and Yj = [yj1, . . . , yjN ] \u2208 R3\u00d7N the\nfull mesh con\ufb01guration. Each mesh j has an associated N -triangle reference mesh, indexed by bj .\nWe let xbn \u2208 R4 denote the location of triangle n in reference mesh b, expressed in homogeneous\ncoordinates (xbn(4) = 1). A full reference mesh Xb = [xb1, . . . , xbN ]. In our later experiments, Yj\nencodes the 3D mesh for a person in pose j, and Xbj is the reference pose for the same individual.\n\nWe estimate aligned correspondences between the triangular faces of the input pose meshes Yj , and\nthe reference meshes Xb, using a recently developed method [6]. This approach robustly handles\n3D data capturing varying shapes and poses, and outputs meshes which have equal numbers of faces\nin one-to-one alignment. Our segmentation model does not depend on the details of this alignment\nmethod, and could be applied to data produced by other correspondence algorithms.\n\n2.1 Nonparametric Spatial Priors for Mesh Partitions\n\nThe recently proposed distance dependent Chinese restaurant process (ddCRP) [4], a generalization\nof the CRP underlying Dirichlet process mixture models [7], has a number of attractive properties\nwhich make it particularly well suited for modeling segmentations of articulated objects. By placing\nprior probability mass on partitions with arbitrary numbers of parts, it allows data-driven inference\nof the true number of mostly-rigid parts underlying the observed data. In addition, by choosing an\nappropriate distance function we can encourage spatially adjacent triangles to lie in the same part,\nand guarantee that all inferred parts are spatially contiguous [5].\n\nThe Chinese restaurant process (CRP) is a distribution on all possible partitions of a set of objects (in\nour case, mesh triangles). The generative process can be described via a restaurant with an in\ufb01nite\nnumber of tables (in our case, parts). Customers (triangles) i enter the restaurant in sequence and\nselect a table zi to join. They pick an occupied table with probability proportional to the number of\ncustomers already sitting there, or a new table with probability proportional to a scaling parameter \u03b1.\n\n2\n\n\fFigure 2: Left: A reference mesh in which links (yellow arrows) currently de\ufb01ne three parts (connected\ncomponents). Right: Each part undergoes a distinct af\ufb01ne transformation, generated as in Equation (2).\n\nThe \ufb01nal seating arrangement gives a partition of the data, where each occupied table corresponds\nto a part in the \ufb01nal segmentation.\n\nAlthough described sequentially, the CRP induces an exchangeable distribution on partitions, for\nwhich the segmentation probability is invariant to the order in which triangle allocations are sampled.\nThis is inappropriate for mesh data, in which nearby triangles are far more likely to lie in the same\npart. The ddCRP alters the CRP by modeling customer links not to tables, but to other customers.\nThe link cm for customer m is sampled according to the distribution\n\np (cm = n | D, f, \u03b1) \u221d(cid:26) f (dmn) m 6= n,\n\nm = n.\n\n\u03b1\n\n(1)\n\nHere, dmn is an externally speci\ufb01ed distance between data points m and n, and \u03b1 determines the\nprobability that a customer links to themselves rather than another customer. The monotonically\ndecreasing decay function f (d) mediates how the distance between two data points affects their\nprobability of connecting to each other. The overall link structure speci\ufb01es a partition: two cus-\ntomers are clustered together if and only if one can reach the other by traversing the link edges.\n\nWe de\ufb01ne the distance between two triangles as the minimal number of hops, between adjacent\nfaces, required to reach one triangle from the other. A \u201cwindow\u201d decay function of width 1, f (d) =\n1[d \u2264 1], then restricts triangles to link only to immediately adjacent faces. Note that this doesn\u2019t\nlimit the size of parts, since all pairs of faces are potentially reachable via a sequence of adjacent\nlinks. However, it does guarantee that only spatially contiguous parts have non-zero probability\nunder the prior. This constraint is preserved by our MCMC inference algorithm.\n\n2.2 Modeling Part Deformation via Af\ufb01ne Transformations\n\nArticulated object deformation is naturally described via the spatial transformations of its constituent\nparts. We expect the triangular faces within a part to deform according to a coherent part-speci\ufb01c\ntransformation, up to independent face-speci\ufb01c noise. The near-rigid motions of interest are reason-\nably modeled as af\ufb01ne transformations, a family of co-linearity preserving linear transformations.\nWe concisely denote the transformation from a reference triangle to an observed triangle via a ma-\n\ntrix A \u2208 R3\u00d74. The fourth column of A encodes translation of the corresponding reference triangle\nvia homogeneous coordinates xbn, and the other entries encode rotation, scaling, and shearing.\n\nPrevious approaches have treated such transformations as parameters to be estimated during infer-\nence [8, 9]. Here, we instead de\ufb01ne a prior distribution over af\ufb01ne transformations. Our construction\nallows transformations to be analytically marginalized when learning our part-based segmentation,\nbut retains the \ufb02exibility to later estimate transformations if desired. Explictly modeling transforma-\ntion uncertainty makes our MCMC inference more robust and rapidly mixing [7], and also allows\ndata-driven determination of an appropriate number of parts.\n\nThe matrix of numbers encoding an af\ufb01ne transformation is naturally modeled via multivariate Gaus-\nsian distributions. We place a conjugate, matrix normal-inverse-Wishart [10, 11] prior on the af\ufb01ne\ntransformation A and residual noise covariance matrix \u03a3:\n\u03a3 \u223c IW(n0, S0)\n\nA | \u03a3 \u223c MN (M, \u03a3, K)\n\n(2)\n\n3\n\n\fHere, n0 \u2208 R and S0 \u2208 R3\u00d73 control the variance and mean of the Wishart prior on \u03a3\u22121. The\nmean af\ufb01ne transformation is M \u2208 R3\u00d74, and K \u2208 R4\u00d74 and \u03a3 determine the variance of the prior\non A. Applied to mesh data, these parameters have physical interpretations and can be estimated\nfrom the data collection process. While such priors are common in Bayesian regression models, our\napplication to the modeling of geometric af\ufb01ne transformations appears novel.\n\nAllocating a different af\ufb01ne transformation for the motion of each part in each pose (Figure 2), the\noverall generative model can be summarized as follows:\n\n1. For each triangle n, sample an associated link cn \u223c ddCRP (\u03b1, f, D). The part assignments\n\nz are a deterministic function of the sampled links c = [c1, . . . , cN ].\n\n2. For each pose j of each part k, sample an af\ufb01ne transformation Ajk and residual noise\n\ncovariance \u03a3jk from the matrix normal-inverse-Wishart prior of Equation (2).\n\n3. Given these pose-speci\ufb01c af\ufb01ne transformations and assignments of mesh faces to parts, in-\ndependently sample the observed location of each pose triangle relative to its corresponding\nreference triangle, yjn \u223c N (Ajzn xbjn, \u03a3jzn ).\n\nNote that \u03a3jk governs the degree of non-rigid deformation of part k in pose j. It also indirectly\nin\ufb02uences the number of inferred parts: a large S0 makes large \u03a3jk more probable, which allows\nmore non-rigid deformation and permits models which utilize fewer parts. The overall model is\n\np(Y, c, A, \u03a3 | X, b, D, \u03b1, f, \u03b7) = p(c | D, f, \u03b1)\n\nN (yjn | Ajzn xbjn, \u03a3jzn )# (3)\nwhere Y = {Y1, . . . , YJ}, X = {X1, . . . , XB}, b = [b1, . . . , bJ ], the ddCRP links c de\ufb01ne assign-\nments z to K(c) parts, and \u03b7 = {n0, S0, M, K} are likelihood hyperparameters. There is a single\nreference mesh Xb for each object instance b, and Yj captures a single deformed pose of Xbj .\n\np(Ajk, \u03a3jk | \u03b7)#\" N\nYn=1\n\nJ\n\nYj=1\" K(c)\nYk=1\n\n2.3 Previous Work\n\nPrevious work has also sought to segment a mesh into parts based on observed articulations [8, 12,\n13, 14]. The two-stage procedure of Rosman et al. [13] \ufb01rst minimizes a variational functional\nregularized to favor piecewise constant transformations, and then clusters the transformations into\nparts. Several other segmentation procedures [12, 14] lack coherent probabilistic models, and thus\nhave dif\ufb01culty quantifying uncertainty and determining appropriate segmentation resolutions.\n\nAnguelov et al. [8] de\ufb01ne a global probabilistic model, and use the EM algorithm to jointly estimate\nparts and their transformations. They explicitly model spatial dependencies among mesh faces, but\ntheir Markov random \ufb01eld cannot ensure that parts are spatially connected; a separate connected\ncomponents process is required. Heuristics are used to determine an appropriate number of parts.\n\nAmbitious recent work has considered a model for joint mesh alignment and segmentation [9]. How-\never, this approach suffers from many of the issues noted above: the number of parts must be speci-\n\ufb01ed a priori, parts may not be contiguous, and their EM inference appears prone to local optima.\n\n3 Inference\n\nWe seek the constituent parts of an articulated model, given observed data (X, Y, and b). These parts\nare characterized by the posterior distribution of the customer links c. We approximate this posterior\nusing a collapsed Gibbs sampler, which iteratively draws cn from the conditional distribution\n\np(cn | c\u2212n, X, Y, b, D, f, \u03b1, \u03b7) \u221d p(cn | D, f, \u03b1)p(Y | z(c), X, b, \u03b7).\n\n(4)\n\nHere, z(c) is the clustering into parts de\ufb01ned by the customer links c. The ddCRP prior is given by\nEquation (1), while the likelihood term in the above equation further factorizes as\n\np(Y | z(c), X, b, \u03b7) =\n\nK(c)\n\nJ\n\nYk=1\n\nYj=1\n\n4\n\np(Yjk | Xbj k, \u03b7)\n\n(5)\n\n\fwhere Yjk \u2208 R3\u00d7Nk is the set of triangular faces in part k of pose j, and Xbj k are the corresponding\nreference faces. Exploiting the conjugacy of the normal likelihood to the prior over af\ufb01ne transfor-\nmations in Equation (2), we marginalize the part-speci\ufb01c latent variables Ajk and \u03a3jk to compute\nthe marginal likelihood in closed form (see the supplement for a derivation):\n\np(Yjk | Xbj k, \u03b7) =\n\n|K|3/2|S0|(n0/2)\u03933(cid:16) Nk +n0\n\n2\n\n(cid:17)\n\n\u03c0(3Nk /2)|Sxx|(3/2)|S0+Sy|x|((Nk +n0)/2)\u03933( n0\n2 )\n\nSxx = Xbj kXbj k\nSy|x = YjkYjk\n\nT + K, Syx = YjkXbj k\nT + M KM T \u2212 Syx(Sxx)\u22121ST\nyx.\n\nT + M K,\n\n,\n\n(6)\n\n(7)\n\n(8)\n\nn\n\nInstead of explicitly sampling from Equation (4), a more ef\ufb01cient sampler [4] can be derived by\nobserving that different realizations of the link cn only make a small change to the partition structure.\nFirst, note that removing a link cn generates a partition z(c\u2212n) which is either identical to the old\npartition z(c) or contains one extra part, created by splitting some existing part. Sampling new\nrealizations of cn will give rise to new partitions z(c\u2212n \u222a c(new)\n), which may either be identical to\nz(c\u2212n) or contain one less part, due to a merge of two existing parts. We thus sample cn from the\nfollowing distribution which only tracks those parts which change with different realizations of cn:\np(cn | c\u2212n, X, Y, b, D, f, \u03b1, \u03b7) \u221d(cid:26) p(cn | D, f, \u03b1)\u2206(Y, X, b, z(c), \u03b7)\np(cn | D, \u03b1)\nQJ\nj=1 p(Yjk1\u222ak2 | Xbj k1\u222ak2 , \u03b7)\nj=1 p(Yjk1 | Xbj k1 , \u03b7)QJ\nQJ\n\nHere, k1 and k2 are parts in z(c\u2212n). Note that if the mesh segmentation c is the only quantity\nof interest, the analytically marginalized af\ufb01ne transformations Ajk need not be directly estimated.\nHowever, for some applications the transformations are of direct interest. Given a sampled segmen-\ntation, the part-speci\ufb01c parameters for pose j have the following posterior [10]:\np(Ajk, \u03a3jk | Y k\n\nxx , \u03a3jk, Sxx)IW(\u03a3jk | Nk + n0, Sy|x + S0)\n\nj , X k, \u03b7) \u221d MN (Ajk | SyxS\u22121\n\nif cn links k1 and k2;\notherwise,\n\nj=1 p(Yjk2 | Xbj k2 , \u03b7)\n\n\u2206(Y, X, b, z(c), \u03b7) =\n\n(10)\n\n.\n\n(9)\n\nMarginalizing the noise covariance matrix, the distribution over transformations is then\n\np(Ajk | Y k\n\nj , X k, \u03b7) =Z MN (Ajk | SyxS\u22121\n\nxx , \u03a3jk, Sxx)IW (\u03a3jk | Nk + n0, Sy|x + S0) d\u03a3jk\n\n= MT (Ajk | Nk + n0, SyxS\u22121\n\nxx , Sxx, Sy|x + S0)\n\n(11)\n\nwhere MT (\u00b7) is a matrix-t distribution [11] with mean SyxS\u22121\n4 Experimental Results\n\nxx , and Nk + n0 degrees of freedom.\n\nWe now experimentally validate, both qualitatively and quantitatively, our mesh-ddcrp model. Be-\ncause \u201cground truth\u201d parts are unavailable for the real body pose datasets of primary interest, we\npropose an alternative evaluation metric based on the prediction of held-out object poses, and show\nthat the mesh-ddcrp performs favorably against competing approaches.\n\nWe primarily focus on a collection of 56 training meshes, acquired and aligned [6] from 3D scans\nof two female subjects in 27 and 29 poses. For quantitative tests, we employ 12 meshes of each of\nsix different female subjects [15] (Figure 4). For each subject, a mesh in a canonical pose is chosen\nas the reference mesh (Figure 1). These meshes contain about 20,000 faces.\n\n4.1 Hyperparameter Speci\ufb01cation and MCMC Learning\n\nThe hyperparameters that regularize our mesh-ddcrp prior have intuitive interpretations, and can be\nspeci\ufb01ed based on properties of the mesh data under consideration. As described in Section 2.1,\nthe ddCRP distances D and f are set to guarantee spatially connected parts. The self-connection\nparameter is set to a small value, \u03b1 = 10\u22128, to encourage creation of larger parts.\n\nThe matrix normal-inverse-Wishart prior on af\ufb01ne transformations Ajk, and residual noise covari-\nances \u03a3jk, has hyperparameters \u03b7 = {n0, S0, M, K}. The mean af\ufb01ne transformation M is set to\n\n5\n\n\fthe identity transformation, because on average we expect mesh faces to undergo small deformations.\nFor the noise covariance prior, we set the degrees of freedom n0 = 5, a value which makes the prior\nvariance nearly as large as possible while ensuring that the mean remains \ufb01nite. The expected part\nvariance S0 captures the degree of non-rigidity which we expect parts to demonstrate, as well as\nnoise from the mesh alignment process. The correspondence error in our human meshes is approxi-\n\nmately 0.01m; allowing for some part non-rigidity, we set \u03c3 = 0.015m and S0 = \u03c32 \u00d7 I3\u00d73. K is\na precision matrix set to K = \u03c32\u00d7diag(1, 1, 1, 0.1).The Kronecker product of K\u22121 and S0 governs\nthe covariance of the distribution on A. Our settings make this nearly identity for most components,\nbut the translation components of A have variance which is an order of magnitude larger, so that the\nexpected scale of the translation parameters matches that of the mesh coordinates.\n\nIn our experiments, we ran the mesh-ddcrp sampler for 200 iterations from each of \ufb01ve random\ninitializations, and selected the most probable posterior sample. The computational cost of a Gibbs\niteration scales linearly with the number of meshes; our unoptimized Matlab implementation re-\nquired around 10 hours to analyze 56 human meshes.\n\n4.2 Baseline Segmentation Methods\n\nWe compare the mesh-ddcrp model to three competing methods. The \ufb01rst is a modi\ufb01ed agglomer-\native clustering technique [16] which enforces spatial contiguity of the faces within each part. At\ninitialization, each face is deemed to be its own part. Adjacent parts on the mesh are then merged\nbased on the squared error in describing their motion by af\ufb01ne transformations. Only adjacent parts\nare considered in these merge steps, so that parts remain spatially connected.\n\nOur second baseline is based on a publicly available implementation of spectral clustering meth-\nods [17], a popular approach which has been previously used for mesh segmentation [18]. We com-\npare to an af\ufb01nity matrix speci\ufb01cally designed to cluster faces with similar motions [19]. The af\ufb01nity\n\nbetween two mesh faces u, v is de\ufb01ned as Cuv = exp{\u2212 \u03c3uv+\u221amuv\n\u03b4uvj is the Euclidean distance between u and v in pose j, \u03c3uv = q 1\n\ncorresponding standard deviation, and S = 1\n\nS2\n\n}, where muv = 1\nJ 2 Pj \u03b4uvj ,\nJ Pj(\u03b4uvj \u2212 \u00af\u03b4uv)2 is the\nM Pu,v \u03c3uv + \u221amuv for all M pairs of faces u, v.\n\nFor the agglomerative and spectral clustering approaches, the number of parts must be externally\nspeci\ufb01ed; we experimented with K = 5, 10, 15, 20, 25, 30 parts. We also consider a Bayesian\nnonparametric baseline which replaces the ddCRP prior over mesh partitions with a standard CRP\nprior. The resulting mesh-crp model may estimate the number of parts, but doesn\u2019t model mesh\nstructure or enforce part contiguity. The expected number of parts under the CRP prior is roughly\n\u03b1 log N ; we set \u03b1 = 2 so that the expected number of mesh-crp parts is similar to the number of\nparts discovered by the mesh-ddcrp. To exploit bilateral symmetry, for all methods we only segment\nthe right half of each mesh. The resulting segmentation is then re\ufb02ected onto the left half.\n\n4.3 Part Discovery and Motion Prediction\n\nWe \ufb01rst consider the synthetic Tosca dataset [20], and separately analyze the Centaur (six poses)\nand Horse (eight poses) meshes. These meshes contain about 31,000 and 38,000 triangular faces,\nrespectively. Figure 3 displays the segmentations of the Tosca meshes inferred by mesh-ddcrp. The\ninferred parts largely correspond to groups of mesh faces which undergo similar transformations.\n\nFigure 4 displays the results produced by the ddCRP, as well as our baseline methods, on the human\nmesh data. Qualitatively, the segmentations produced by mesh-ddcrp correspond to our intuitions\nabout the body. Note that in addition to capturing the head and limbs, the segmentation successfully\nsegregates distinctly moving small regions such as knees, elbows, shoulders, biceps, and triceps. In\nall, the mesh-ddcrp detects 20 distinctly moving parts for one half of the body.\n\nWe now introduce a quantitative measure of segmentation quality: segmentations are evaluated by\ntheir ability to explain the articulations of test meshes with novel shapes and poses. Given a collec-\ntion of T test meshes Yt with corresponding reference meshes Xbt , and a candidate segmentation\ninto K parts, we compute\n\n1\nT\n\nE =\n\nT\n\nK\n\nXt=1\n\nXk=1\n\n||Ytk \u2212 A\u2217tkXbtk||2.\n\n6\n\n(12)\n\n\fFigure 3: Segmentations produced by mesh-ddcrp on synthetic Tosca meshes [20]. The \ufb01rst mesh in each row\ndisplays the chosen reference mesh. For illustration, we have only segmented the right half of each mesh.\n\nHere, A\u2217tk is the least squares estimate of the single af\ufb01ne transformation responsible for mapping\nXbtk to Ytk. Note that Equation (12) is trivially zero for a degenerate solution wherein each mesh\nface is assigned to its own part. However, segmentations of similar resolution may safely be com-\npared using Equation (12), with lower errors corresponding to better segmentations.\nOn our test set of human meshes, the mesh-ddcrp model produces an error of E = 1.39 meters, which\ncorresponds to sub-millimeter accuracy when normalized by the number of faces. Figure 4 displays\na plot comparing the errors achieved by the different methods. Mesh-ddcrp is signi\ufb01cantly better\nthan all other methods, including for settings of K which allocate 50% more parts to competing\napproaches, according to a Wilcoxon\u2019s signed rank test (5% signi\ufb01cance level).\n\nNext, we demonstrate the bene\ufb01ts of sharing information among differently shaped bodies. We\nselected an illustrative articulated pose for each of the two training subjects in addition to their\nrespective reference poses (Figure 4). The chosen poses either exhibit upper or lower body defor-\nmations, but not both. The meshes were then segmented both independently for the two subjects\nand jointly sharing information across subjects. Figure 5 demonstrates that the independent segmen-\ntations exhibit both undersegmented (legs in the \ufb01rst set) and oversegmented (head in the second)\nparts. However, sharing information among subjects results in parts which correspond well with\nphysical human bodies. Note that with only two articulated poses, we are able to generate mean-\ningful segmentations in about an hour of computation. This data-limited scenario also demonstrates\nthe bene\ufb01ts of the ddCRP prior: as shown in Figure 5, the parts extracted by mesh-crp are \u201cpatchy\u201d,\nspatially disconnected, and physically implausible.\n\n5 Discussion\n\nAdapting the ddCRP to collections of 3D meshes, we have developed an effective approach for\nthe discovery an unknown number of parts underlying articulated object motion. Unlike previous\nmethods, our model guarantees that parts are spatially connected, and uses transformations to model\ninstances with potentially varying body shapes. Via a novel application of matrix normal-inverse-\nWishart priors, our sampler analytically marginalizes transformations for improved ef\ufb01ciency. While\nwe have modeled part motion via af\ufb01ne transformations, future work should explore more accurate\nLie algebra characterizations of deformation manifolds [21].\n\nExperiments with dozens of real human body poses provide strong quantitative evidence that our ap-\nproach produces state-of-the-art segmentations with many potential applications. We are currently\nexploring methods for using multiple samples from the ddCRP posterior to characterize part uncer-\ntainty, and scaling our Monte Carlo learning algorithms to datasets containing thousands of meshes.\n\nAcknowledgments This work was supported in part by the Of\ufb01ce of Naval Research under con-\ntract W911QY-10-C-0172. We thank Eric Rachlin, Alex Weiss, and David Hirshberg for acquiring\nand aligning the human meshes, and Aggeliki Tsoli for her helpful comments.\n\n7\n\n\fSpect15\n\nSpect20\n\nSpect25\n\nAgglom15\n\nAgglom20 Agglom25 mesh-crp mesh-ddcrp\n\n4.5\n\n4\n\n3.5\n\n3\n\n2.5\n\n2\n\n1.5\n\ns\nr\ne\n\nt\n\ne\nm\nn\n\n \n\ni\n \nr\no\nr\nr\n\nE\n\n1\n\n \n5\n\n \n\nddcrp\u2212mesh\nSpectral Clustering\nAgglomerative\ncrp\u2212mesh\n\n10\n\n15\n\n20\n\n25\n\n30\n\nNumber of Parts\n\nFigure 4: Top two rows (left to right): Segmentations produced by spectral and agglomerative clustering with\n15, 20, and 25 clusters respectively, followed by the mesh-crp and mesh-ddcrp segmentations. Bottom row: Test\nset results. We display mesh-ddcrp segmentations for several test meshes, and quantitatively compare methods.\n\nRef. pose\n\nIllust. pose\n\nind. mesh-crp ind. mesh-ddcrp mesh-crp mesh-ddcrp\n\nFigure 5: Impact of sharing information across bodies with varying shapes. The two rows correspond to the\ntraining subjects. Each row displays the reference pose, an illustrative articulated pose, mesh-crp and mesh-\nddcrp segmentations produced by independently segmenting the pair of poses of each individual, and mesh-crp\nand mesh-ddcrp segmentations produced by jointly segmenting the chosen poses from both subjects.\n\n8\n\n\fReferences\n\n[1] M. Attene, S. Katz, M. Mortara, G. Patane, M. Spagnuolo, and A. Tal. Mesh segmentation \u2014 A compar-\n\native study. In SMI, 2006.\n\n[2] Xiaobai Chen, Aleksey Golovinskiy, and Thomas Funkhouser. A benchmark for 3D mesh segmentation.\n\nACM Transactions on Graphics (Proc. SIGGRAPH), 28(3):73:1\u201373:12, 2009.\n\n[3] Evangelos Kalogerakis, Aaron Hertzmann, and Karan Singh. Learning 3D Mesh Segmentation and La-\n\nbeling. ACM Transactions on Graphics, 29(4):102:1\u2013102:12, July 2010.\n\n[4] David M. Blei and Peter I. Frazier. Distance dependent Chinese restaurant processes. J. Mach. Learn.\n\nRes., 12:2461\u20132488, November 2011.\n\n[5] S. Ghosh, A. B. Ungureanu, E. B. Sudderth, and D. Blei. Spatial distance dependent Chinese restaurant\n\nprocesses for image segmentation. In NIPS, pages 1476\u20131484, 2011.\n\n[6] D. Hirshberg, M. Loper, E. Rachlin, and M.J. Black. Coregistration: Simultaneous alignment and model-\n\ning of articulated 3D shape. In ECCV, pages 242\u2013255, 2012.\n\n[7] R. M. Neal. Markov chain sampling methods for Dirichlet process mixture models. JCGS, 9(2):249\u2013265,\n\n2000.\n\n[8] D. Anguelov, D. Koller, H. Pang, P. Srinivasan, and S. Thrun. Recovering articulated object models from\n\n3d range data. In UAI, pages 18\u201326, 2004.\n\n[9] J. Franco and E. Boyer. Learning temporally consistent rigidities.\n\nIn IEEE CVPR, pages 1241\u20131248,\n\n2011.\n\n[10] E. B. Fox. Bayesian Nonparametric Learning of Complex Dynamical Phenomena. PhD thesis, Mas-\n\nsachusetts Institute of Technology, Cambridge, MA, 2009.\n\n[11] A. K. Gupta and D. K. Nagar. Matrix Variate Distributions. Chapman & Hall/CRC, October 2000.\n\n[12] Tong-Yee Lee, Yu-Shuen Wang, and Tai-Guang Chen. Segmenting a deforming mesh into near-rigid\n\ncomponents. The Visual Computer, 22(9):729\u2013739, September 2006.\n\n[13] Guy Rosman, Michael M. Bronstein, Alexander M. Bronstein, Alon Wolf, and Ron Kimmel. Group-\nIn SSVM\u201911,\n\nvalued regularization framework for motion segmentation of dynamic non-rigid shapes.\npages 725\u2013736, 2012.\n\n[14] Stefanie Wuhrer and Alan Brunton. Segmenting animated objects into near-rigid components. The Visual\n\nComputer, 26:147\u2013155, 2010.\n\n[15] N. Hasler, C. Stoll, M. Sunkel, B. Rosenhahn, and H.-P. Seidel. A statistical model of human pose and\nbody shape. In Computer Graphics Forum (Proc. Eurographics 2009), volume 2, pages 337\u2013346, March\n2009.\n\n[16] R. N. Shepard. Multidimensional scaling, tree-\ufb01tting, and clustering. Science, 210:390\u2013398, October\n\n1980.\n\n[17] Wen-Yen Chen, Yangqiu Song, Hongjie Bai, Chih-Jen Lin, and Edward Y. Chang. Parallel spectral\n\nclustering in distributed systems. IEEE PAMI, 33(3):568\u2013586, 2011.\n\n[18] Rong Liu and Hao Zhang. Segmentation of 3D meshes through spectral clustering. In Paci\ufb01c Conference\n\non Computer Graphics and Applications, pages 298\u2013305, 2004.\n\n[19] Edilson de Aguiar, Christian Theobalt, Sebastian Thrun, and Hans-Peter Seidel. Automatic conversion of\n\nmesh animations into skeleton-based animations. Computer Graphics Forum, 27(2):389\u2013397, 2008.\n\n[20] Alexander Bronstein, Michael Bronstein, and Ron Kimmel. Calculus of nonrigid surfaces for geometry\n\nand texture manipulation. IEEE Tran. on Viz. and Computer Graphics, 13:902\u2013913, 2007.\n\n[21] Oren Freifeld and Michael J. Black. Lie bodies: A manifold representation of 3D human shape.\n\nIn\nEuropean Conf. on Computer Vision (ECCV), Part I, LNCS 7572, pages 1\u201314. Springer-Verlag, October\n2012.\n\n9\n\n\f", "award": [], "sourceid": 989, "authors": [{"given_name": "Soumya", "family_name": "Ghosh", "institution": null}, {"given_name": "Matthew", "family_name": "Loper", "institution": null}, {"given_name": "Erik", "family_name": "Sudderth", "institution": null}, {"given_name": "Michael", "family_name": "Black", "institution": null}]}