{"title": "Self-supervised Learning of Motion Capture", "book": "Advances in Neural Information Processing Systems", "page_first": 5236, "page_last": 5246, "abstract": "Current state-of-the-art solutions for motion capture from a single camera are optimization driven: they optimize the parameters of a 3D human model so that its re-projection matches measurements in the video (e.g. person segmentation, optical flow, keypoint detections etc.). Optimization models are susceptible to local minima. This has been the bottleneck that forced using clean green-screen like backgrounds at capture time, manual initialization, or switching to multiple cameras as input resource. In this work, we propose a learning based motion capture model for single camera input. Instead of optimizing mesh and skeleton parameters directly, our model optimizes neural network weights that predict 3D shape and skeleton configurations given a monocular RGB video. Our model is trained using a combination of strong supervision from synthetic data, and self-supervision from differentiable rendering of (a) skeletal keypoints, (b) dense 3D mesh motion, and (c) human-background segmentation, in an end-to-end framework. Empirically we show our model combines the best of both worlds of supervised learning and test-time optimization: supervised learning initializes the model parameters in the right regime, ensuring good pose and surface initialization at test time, without manual effort. Self-supervision by back-propagating through differentiable rendering allows (unsupervised) adaptation of the model to the test data, and offers much tighter fit than a pretrained fixed model. We show that the proposed model improves with experience and converges to low-error solutions where previous optimization methods fail.", "full_text": "Self-supervised Learning of Motion Capture\n\nHsiao-Yu Fish Tung 1, Hsiao-Wei Tung 2, Ersin Yumer 3, Katerina Fragkiadaki 1\n\n1 Carnegie Mellon University, Machine Learning Department\n\n2 University of Pittsburgh, Department of Electrical and Computer Engineering\n\n{htung, katef}@cs.cmu.edu, hst11@pitt.edu,yumer@adobe.com\n\n3 Adobe Research\n\nAbstract\n\nCurrent state-of-the-art solutions for motion capture from a single camera are\noptimization driven: they optimize the parameters of a 3D human model so that\nits re-projection matches measurements in the video (e.g. person segmentation,\noptical \ufb02ow, keypoint detections etc.). Optimization models are susceptible to\nlocal minima. This has been the bottleneck that forced using clean green-screen\nlike backgrounds at capture time, manual initialization, or switching to multiple\ncameras as input resource. In this work, we propose a learning based motion capture\nmodel for single camera input. Instead of optimizing mesh and skeleton parameters\ndirectly, our model optimizes neural network weights that predict 3D shape and\nskeleton con\ufb01gurations given a monocular RGB video. Our model is trained using\na combination of strong supervision from synthetic data, and self-supervision from\ndifferentiable rendering of (a) skeletal keypoints, (b) dense 3D mesh motion, and\n(c) human-background segmentation, in an end-to-end framework. Empirically\nwe show our model combines the best of both worlds of supervised learning\nand test-time optimization: supervised learning initializes the model parameters\nin the right regime, ensuring good pose and surface initialization at test time,\nwithout manual effort. Self-supervision by back-propagating through differentiable\nrendering allows (unsupervised) adaptation of the model to the test data, and offers\nmuch tighter \ufb01t than a pretrained \ufb01xed model. We show that the proposed model\nimproves with experience and converges to low-error solutions where previous\noptimization methods fail.\n\n1\n\nIntroduction\n\nDetailed understanding of the human body and its motion from \u201cin the wild\" monocular setups\nwould open the path to applications of automated gym and dancing teachers, rehabilitation guidance,\npatient monitoring and safer human-robot interactions. It would also impact the movie industry\nwhere character motion capture (MOCAP) and retargeting still requires tedious labor effort of artists\nto achieve the desired accuracy, or the use of expensive multi-camera setups and green-screen\nbackgrounds.\nMost current motion capture systems are optimization driven and cannot bene\ufb01t from experience.\nMonocular motion capture systems optimize the parameters of a 3D human model to match measure-\nments in the video (e.g., person segmentation, optical \ufb02ow). Background clutter and optimization\ndif\ufb01culties signi\ufb01cantly impact tracking performance, leading prior work to use green screen-like\nbackdrops [5] and careful initializations. Additionally, these methods cannot leverage the data gener-\nated by laborious manual processes involved in motion capture, to improve over time. This means\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fFigure 1: Self-supervised learning of motion capture. Given a video sequence and a set of 2D body\njoint heatmaps, our network predicts the body parameters for the SMPL 3D human mesh model [25].\nNeural networks weights are pretrained using synthetic data and \ufb01netuned using self-supervised losses\ndriven by differentiable keypoint, segmentation, and motion reprojection errors, against detected 2D\nkeypoints, 2D segmentation and 2D optical \ufb02ow, respectively. By \ufb01netuning its parameters at test\ntime through self-supervised losses, the proposed model achieves signi\ufb01cantly higher level of 3D\nreconstruction accuracy than pure supervised or pure optimization based models, which either do not\nadapt at test time, or cannot bene\ufb01t from training data, respectively.\n\nthat each time a video needs to be processed, the optimization and manual efforts need to be repeated\nfrom scratch.\nWe propose a neural network model for motion capture in monocular videos, that learns to map an\nimage sequence to a sequence of corresponding 3D meshes. The success of deep learning models lies\nin their supervision from large scale annotated datasets [14]. However, detailed 3D mesh annotations\nare tedious and time consuming to obtain, thus, large scale annotation of 3D human shapes in realistic\nvideo input is currently unavailable. Our work bypasses lack of 3D mesh annotations in real videos\nby combining strong supervision from large scale synthetic data of human rendered models, and self-\nsupervision from 3D-to-2D differentiable rendering of 3D keypoints, motion and segmentation, and\nmatching with corresponding detected quantities in 2D, in real monocular videos. Our self-supervision\nleverages recent advances in 2D body joint detection [37; 9], 2D \ufb01gure-ground segmentation [22],\nand 2D optical \ufb02ow [21], each learnt using strong supervision from real or synthetic datasets, such as,\nMPII [3], COCO [24], and \ufb02ying chairs [15], respectively. Indeed, annotating 2D body joints is easier\nthan annotating 3D joints or 3D meshes, while optical \ufb02ow has proven to be easy to generalize from\nsynthetic to real data. We show how state-of-the-art models of 2D joints, optical \ufb02ow and 2D human\nsegmentation can be used to infer dense 3D human structure in videos in the wild, that is hard to\notherwise manually annotate. In contrast to previous optimization based motion capture works [8; 7],\nwe use differentiable warping and differentiable camera projection for optical \ufb02ow and segmentation\nlosses, which allows our model to be trained end-to-end with standard back-propagation.\nWe use SMPL [25] as our dense human 3D mesh model. It consists of a \ufb01xed number of vertices and\ntriangles with \ufb01xed topology, where the global pose is controlled by relative angles between body\nparts \u03b8, and the local shape is controlled by mesh surface parameters \u03b2. Given the pose and surface\nparameters, a dense mesh can be generated in an analytical (differentiable) form, which could then be\nglobally rotated and translated to a desired location. The task of our model is to reverse-engineer the\nrendering process and predict the parameters of the SMPL model (\u03b8 and \u03b2), as well as the focal length,\n3D rotations and 3D translations in each input frame, provided an image crop around a detected\nperson.\nGiven 3D mesh predictions in two consecutive frames, we differentiably project the 3D motion\nvectors of the mesh vertices, and match them against estimated 2D optical \ufb02ow vectors (Figure 1).\nDifferentiable motion rendering and matching requires vertex visibility estimation, which we perform\nusing ray casting integrated with our neural model for code acceleration. Similarly, in each frame,\n3D keypoints are projected and their distances to corresponding detected 2D keypoints are penalized.\nLast but not the least, differentiable segmentation matching using Chamfer distances penalizes under\nand over \ufb01tting of the projected vertices against 2D segmentation of the human foreground. Note that\n\n2\n\n\u00011R1SMPLcamera re-projectionKeypoint re-projectionSegmentation re-projectionMotion re-projectiont1t2\u00011T1\u00012R2\u00012T2\fthese re-projection errors are only on the shape rather than the texture by design, since our predicted\n3D meshes are textureless.\nWe provide quantitative and qualitative results on 3D dense human shape tracking in SURREAL\n[35] and H3.6M [22] datasets. We compare against the corresponding optimization versions, where\nmesh parameters are directly optimized by minimizing our self-supervised losses, as well as against\nsupervised models that do not use self-supervision at test time. Optimization baselines easily get\nstuck in local minima, and are very sensitive to initialization. In contrast, our learning-based MOCAP\nmodel relies on supervised pretraining (on synthetic data) to provide reasonable pose initialization\nat test time. Further, self-supervised adaptation achieves lower 3D reconstruction error than the\npretrained, non-adapted model. Last, our ablation highlights the complementarity of the three\nproposed self-supervised losses.\n\n2 Related Work\n\n3D Motion capture 3D motion capture using multiple cameras (four or more) is a well studied\nproblem where impressive results are achieved with existing methods [17]. However, motion capture\nfrom a single monocular camera is still an open problem even for skeleton-only capture/tracking.\nSince ambiguities and occlusions can be severe in monocular motion capture, most approaches rely on\nprior models of pose and motion. Earlier works considered linear motion models [16; 13]. Non-linear\npriors such as Gaussian process dynamical models [34], as well as twin Gaussian processes [6] have\nalso been proposed, and shown to outperform their linear counterparts. Recently, Bogo et al. [7]\npresented a static image pose and 3D dense shape prediction model which works in two stages: \ufb01rst, a\n3D human skeleton is predicted from the image, and then a parametric 3D shape is \ufb01t to the predicted\nskeleton using an optimization procedure, during which the skeleton remains unchanged. Instead, our\nwork couples 3D skeleton and 3D mesh estimation in an end-to-end differentiable framework, via\ntest-time adaptation.\n\n3D human pose estimation Earlier work on 3D pose estimation considered optimization methods\nand hard-coded anthropomorphic constraints (e.g., limb symmetry) to \ufb01ght ambiguity during 2D-\nto-3D lifting [28]. Many recent works learn to regress to 3D human pose directly given an RGB\nimage [27] using deep neural networks and large supervised training sets [22]. Many have explored\n2D body pose as an intermediate representation [11; 38], or as an auxiliary task in a multi-task\nsetting [32; 38; 39], where the abundance of labelled 2D pose training examples helps feature\nlearning and complements limited 3D human pose supervision, which requires a Vicon system and\nthus is restricted to lab instrumented environments. Rogez and Schmid [29] obtain large scale\nRGB to 3D pose synthetic annotations by rendering synthetic 3D human models against realistic\nbackgrounds [29], a dataset also used in this work.\n\nDeep geometry learning Our differentiable renderer follows recent works that integrate deep\nlearning and geometric inference [33]. Differentiable warping [23; 26] and backpropable camera\nprojection [39; 38] have been used to learn 3D camera motion [40] and joint 3D camera and 3D\nobject motion [30] in an end-to-end self-supervised fashion, minimizing a photometric loss. Garg et\nal. [18]learns a monocular depth predictor, supervised by photometric error, given a stereo image\npair with known baseline as input. The work of [19] contributed a deep learning library with many\ngeometric operations including a backpropable camera projection layer, similar to the one used in\nYan et al. [39] and Wu et al. [38]\u2019s cameras, as well as Garg et al.\u2019s depth CNN [18].\n\n3 Learning Motion Capture\n\nThe architecture of our network is shown in Figure 1. We use SMPL as the parametrized model of 3D\nhuman shape, introduced by Loper et al. [25]. SMPL is comprised of parameters that control the yaw,\npitch and roll of body joints, and parameters that control deformation of the body skin surface. Let \u03b8,\n\u03b2 denote the joint angle and surface deformation parameters, respectively. Given these parameters, a\n\ufb01xed number (n = 6890) of 3D mesh vertex coordinates are obtained using the following analytical\nexpression, where Xi \u2208 R3 stands for the 3D coordinates of the ith vertex in the mesh:\n\nXi = \u00afXi +\n\n\u03b2msm,i +\n\n(Tn(\u03b8) \u2212 Tn(\u03b8\u2217))pn,i\n\n(1)\n\n(cid:88)\n\n(cid:88)\n\nm\n\nn\n\n3\n\n\fFigure 2: Differentiable rendering of body joints (left), segmentation (middle) and mesh vertex\nmotion (right).\n\nwhere \u00afXi \u2208 R3 is the nominal rest position of vertex i, \u03b2m is the blend coef\ufb01cient for the skin surface\nblendshapes, sm,i \u2208 R3 is the element corresponding to ith vertex of the mth skin surface blendshape,\npn,i \u2208 R3 is the element corresponding to ith vertex of the nth skeletal pose blendshape, Tn(\u03b8) is a\nfunction that maps the nth pose blendshape to a vector of concatenated part relative rotation matrices,\nand Tn(\u03b8\u2217) is the same for the rest pose \u03b8\u2217. Note the expression in Eq. 1 is differentiable.\nOur model, given an image crop centered around a person detection, predicts parameters \u03b2 and \u03b8 of\nthe SMPL 3D human mesh. Since annotations of 3D meshes are very tedious and time consuming\nto obtain, our model uses supervision from a large dataset of synthetic monocular videos, and self-\nsupervision with a number of losses that rely on differentiable rendering of 3d keypoints, segmentation\nand vertex motion, and matching with their 2D equivalents. We detail supervision of our model\nbelow.\n\nPaired supervision from synthetic data We use the synthetic Surreal dataset [35] that contains\nmonocular videos of human characters performing activities against 2D image backgrounds. The\nsynthetic human characters have been generated using the SMPL model, and animated using Human\nH3.6M dataset [22]. Texture is generated by directly coloring the mesh vertices, without actual\n3D cloth simulation. Since values for \u03b2 and \u03b8 are directly available in this dataset, we use them to\npretrain the \u03b8 and \u03b2 branches of our network using a standard supervised regression loss.\n\n3.1 Self-supervision through differentiable rendering\n\nSelf-supervision in our model is based on 3D-to-2D rendering and consistency checks against 2D\nestimates of keypoints, segmentation and optical \ufb02ow. Self-supervision can be used at both train and\ntest time, for adapting our model\u2019s weights to the statistics of the test set.\n\nKeypoint re-projection error Given a static image, predictions of 3D body joints of the depicted\nperson should match, when projected, corresponding 2D keypoint detections. Such keypoint re-\nprojection error has been used already in numerous previous works [38; 39]. Our model predicts a\ndense 3D mesh instead of a skeleton. We leverage the linear relationship that relates our 3D mesh\nvertices to 3D body joints:\n\n(2)\nLet X \u2208 R4\u00d7n denote the 3D coordinates of the mesh vertices in homogeneous coordinates (with\na small abuse of notation since it is clear from the context), where n the number of vertices. For\nestimating 3D-to-2D projection, our model further predicts focal length, rotation of the camera and\n\nXkpt\n\n(cid:124)\n\n= A \u00b7 X\n\n(cid:124)\n\n4\n\ncamerat1t2Distance mapsThresholdSegmentation(SM)Chamferdistance maps(CM) (SI)(CI) x2d~x2du, vu, v~~matchmatch by differentiable interpolationSM x CI + SI x CM \u2192 0xKPTxKPT~\ftranslation of the 3D mesh off the center of the image, in case the root node of the 3D mesh is\nnot exactly placed at the center of the image crop. We do not predict translation in the z direction\n(perpendicular to the image plane), as the predicted focal length accounts for scaling of the person\n\ufb01gure. For rotation, we predict Euler rotation angles \u03b1, \u03b2, \u03b3 so that the 3D rotation of the camera\nreads R = Rx(\u03b1)Ry(\u03b2)Rz\nt (\u03b3), where Rx(\u03b8) denotes rotation around the x-axis by angle \u03b8, here in\nhomogeneous coordinates. The re-projection equation for the kth keypoint then reads:\n\nkpt = P \u00b7(cid:16)\n\nxk\n\nR \u00b7 Xk\n\nkpt + T\n\n(cid:17)\n\n(3)\n\nwhere P = diag([f\n[Tx Ty\nthen reads:\n\n0\n\nis the predicted camera projection matrix and T =\n0]T handles small perturbations in object centering. Keypoint reprojection error\n\n1 0]\n\nf\n\nLkpt = (cid:107)xkpt \u2212 \u02dcxkpt(cid:107)2\n2,\n\n(4)\n\nand \u02dcxkpt are ground-truth or detected 2D keypoints. Since 3D mesh vertices are related to \u03b2, \u03b8\npredictions using Eq. 1, re-projection error minimization updates the neural parameters for \u03b2, \u03b8\nestimation.\n\nMotion re-projection error Given a pair of frames, 3D mesh vertex displacements from one frame\nto the next should match, when projected, corresponding 2D optical \ufb02ow vectors, computed from\nthe corresponding RGB frames. All Structure-from-Motion (SfM) methods exploit such motion\nre-projection error in one way or another: the estimated 3D pointcloud in time when projected should\nmatch 2D optical \ufb02ow vectors in [2], or multiframe 2D point trajectories in [31]. Though previous\nSfM models use motion re-projection error to optimize 3D coordinates and camera parameters directly\n[2], here we use it to optimize neural network parameters, that predict such quantities, instead.\nMotion re-projection error estimation requires visibility of the mesh vertices in each frame. We\nimplement visibility inference through ray casting for each example and training iteration in Tensor\nFlow and integrate it with our neural network model, which accelerates by ten times execution time,\nas opposed to interfacing with raycasting in OpenGL. Vertex visibility inference does not need to be\ndifferentiable: it is used only to mask motion re-projection loss for invisible vertices. Since we are\nonly interested in visibility rather than complex rendering functionality, ray casting boils down to\ndetecting the \ufb01rst mesh facet to intersect with the straight line from the image projected position of\nthe center of a facet to its 3D point. If the intercepted facet is the same as the one which the ray is\ncast from, we denote that facet as visible since there is no occluder between that facet and the image\nplane. We provide more details for the ray casting reasoning in the experiment section. Vertices that\nconstructs these visible facet are treated as visible. Let vi \u2208 {0, 1}, i = 1\u00b7\u00b7\u00b7 n denote visibilities of\nmesh vertices.\nGiven two consecutive frames I1, I2, let \u03b21, \u03b81, R1, T1, \u03b22, \u03b82, R2, T2 denote corresponding predic-\n\n\uf8f9\uf8fb , i = 1\u00b7\u00b7\u00b7 n, and\n\uf8f9\uf8fb , i = 1\u00b7\u00b7\u00b7 n using Eq. 1. The 3D mesh vertices are mapped to corresponding pixel\n\ntions from our model. We obtain corresponding 3D pointclouds, Xi\n\n\uf8ee\uf8f0X i\n\n\uf8ee\uf8f0X i\n\n1\nY i\n1\nZ i\n1\n\n1 =\n\nXi\n\n2 =\n\n2 \u2212 yi\n\n1, yi\n2 \u2212 xi\n\n2, yi\n1), i = 1\u00b7\u00b7\u00b7 n.\n\n1), i = 1\u00b7\u00b7\u00b7 n, (xi\n1, yi\n\n2), i = 1\u00b7\u00b7\u00b7 n, using the camera projection equation (Eq.\ncoordinates (xi\n3). Thus the predicted 2D body \ufb02ow resulting from the 3D motion of the corresponding meshes is\n(ui, vi) = (xi\nLet OF = (\u02dcu, \u02dcv) denote the 2D optical \ufb02ow \ufb01eld estimated with an optical \ufb02ow method, such as\nthe state-of-the-art deep neural \ufb02ow of [21]. Let OF(xi\n1) denote the optical \ufb02ow at a potentially\n1, obtained from the pixel centered optical \ufb02ow \ufb01eld OF through differentiable\nsubpixel location xi\nbilinear interpolation (differentiable warping) [23]. Then, the motion re-projection error reads:\nn(cid:88)\n\n1, yi\n\n1, yi\n\n1\n\n(cid:1)\n\nvi(cid:0)(cid:107)ui(xi\n\n1) \u2212 \u02dcu(xi\n\n1, yi\n\n1)(cid:107)1 + (cid:107)vi(xi\n\n1) \u2212 \u02dcv(xi\n\n1, yi\n\n1, yi\n\n1)(cid:107)1\n\n1, yi\n\nLmotion =\n\n1T v\n\ni\n\n2\nY i\n2\nZ i\n2\n\n5\n\n\fSegmentation re-projection error Given a static image, the predicted 3D mesh for the depicted\nperson should match, when projected, the corresponding 2D \ufb01gure-ground segmentation mask.\nNumerous 3D shape reconstruction methods have used such segmentation consistency constraint\n[36; 2; 4], but again, in an optimization as opposed to learning framework.\nLet S I \u2208 {0, 1}w\u00d7h denote the 2D \ufb01gure-ground binary image segmentation, supplied by ground-\ntruth, background subtraction or predicted by a \ufb01gure-ground neural network segmenter [20]. Our\nsegmentation re-projection loss measures how well the projected mesh mask \ufb01ts the image seg-\nmentation S I by penalizing non-overlapping pixels by the shortest distance to the projected model\nsegmentation SM = {x2d}. For this purpose Chamfer distance maps CI for the image segmentation\nS I and Chamfer distance maps CM for the model projected segmentation SM are calculated. The\nloss then reads:\n\nLseg = SM \u2297 CI + S I \u2297 CM ,\n\nwhere \u2297 denotes pointwise multiplication. Both terms are necessary to prevent under of over\ncoverage of the model segmentation over the image segmentation. For the loss to be differentiable\nwe cannot use distance transform for ef\ufb01cient computation of Chamfer maps. Rather, we brute\nforce its computation by calculating the shortest distance of each pixel to the model segmentation\n2d, i \u2208 1\u00b7\u00b7\u00b7 n denote the set of model projected vertex pixel coordinates and\nand the inverse. Let xi\nseg, p \u2208 1\u00b7\u00b7\u00b7 m denote the set of pixel centered coordinates that belong to the foreground of the 2D\nxp\nsegmentation map S I:\n\nLseg-proj =\n\nn(cid:88)\n(cid:124)\n\ni=1\n\nm(cid:88)\n(cid:124)\n\np\n\n+\n\nmin\n\ni\n\nmin\n\np\n\n(cid:107)xi\n\n2d \u2212 xp\n(cid:123)(cid:122)\n\nseg(cid:107)2\n(cid:125)\n\n2\n\nprevent over-coverage\n\n(cid:107)xp\n\nseg \u2212 xi\n(cid:123)(cid:122)\n\n2d(cid:107)2\n(cid:125)\n2.\n\nprevent under-coverage\n\n(5)\n\nThe \ufb01rst term ensures the model projected segmentation is covered by the image segmentation, while\nthe second term ensures that model projected segmentation covers well the image segmentation. To\nlower the memory requirements we use half of the image input resolution.\n\n4 Experiments\n\nWe test our method on two datasets: Surreal [35] and H3.6M [22]. Surreal is currently the largest\nsynthetic dataset for people in motion. It contains short monocular video clips depicting human\ncharacters performing daily activities. Ground-truth 3D human meshes are readily available. We split\nthe dataset into train and test video sequences. Human3.6M (H3.6M) is the largest real video dataset\nwith annotated 3D human skeletons. It contains videos of actors performing activities and provides\nannotations of body joint locations in 2D and 3D at every frame, recorded through a Vicon system. It\ndoes not provide dense 3D ground-truth though.\nOur model is \ufb01rst trained using supervised skeleton and surface parameters in the training set of the\nSurreal dataset. Then, it is self-supervised using differentiable rendering and re-projection error mini-\nmization at two test sets, one in the Surreal dataset, and one in H3.6M. For self-supervision, we use\nground-truth 2D keypoints and segmentations in both datasets, Surreal and H3.6M. The segmentation\nmask in Surreal is very accurate while in H3.6M is obtained using background subtraction and can be\nquite inaccurate, as you can see in Figure 4. Our model re\ufb01nes such initially inaccurate segmentation\nmask. The 2D optical \ufb02ows for dense motion matching are obtained using FlowNet2.0 [21] in both\ndatasets. We do not use any 3D ground-truth supervision in H3.6M as our goal is to demonstrate\nsuccessful domain transfer of our model, from SURREAL to H3.6M. We measure the quality of\nthe predicted 3D skeletons in both datasets, and we measure the quality of the predicted dense 3D\nmeshes in Surreal, since only there it is available.\n\nEvaluation metrics Given predicted 3D body joint locations of K = 32 keypoints Xk\nkpt, k =\n1\u00b7\u00b7\u00b7 K and corresponding ground-truth 3D joint locations \u02dcXk\nkpt, k = 1\u00b7\u00b7\u00b7 K, we de\ufb01ne the per-joint\nkpt(cid:107)2 similar to previous works [41]. We also de\ufb01ne\nerror of each example as 1\nK\nthe reconstruction error of each example as the 3D per-joint error up to a 3D translation T (3D\n\n(cid:80)K\nk=1 (cid:107)Xk\n\nkpt \u2212 \u02dcXk\n\n6\n\n\f(cid:80)K\nk=1 (cid:107)(Xk\n\n1\nK\n\nkpt + T ) \u2212 ( \u02dcXk\n\n(cid:80)n\ni=1 (cid:107)Xi \u2212 \u02dcXi(cid:107)2.\n\nkpt)(cid:107)2 We de\ufb01ne the\nrotation should still be predicted correctly): minT\nsurface error of each example to be the per-joint error when considering all the vertices of the 3D\nmesh: 1\nn\nWe compare our learning based model against two baselines: (1) Pretrained, a model that uses\nonly supervised training from synthetic data, without self-supervised adaptation. This baseline is\nsimilar to the recent work of [12]. (2) Direct optimization, a model that uses our differentiable\nself-supervised losses, but instead of optimizing neural network weights, optimizes directly over body\nmesh parameters (\u03b8, \u03b2), rotation (R), translation (T ), and focal length f. We use standard gradient\ndescent as our optimization method. We experiment with varying amount of supervision during\ninitialization of our optimization baseline: random initialization, using ground-truth 3D translation,\nusing ground-truth rotation and using ground-truth theta angles (to estimate the surface parameters).\nTables 1 and 2 show the results of our model and baselines for the different evaluation metrics. The\nlearning based self-supervised model outperforms both the pretrained model, that does not exploit\nadaptation through differentiable rendering and consistency checks, as well as direct optimization\nbaselines, sensitive to initialization mistakes.\n\nAblation In Figure 3 we show the 3D keypoint reconstruction error after self-supervised \ufb01netuning\nusing different combinations of self-supervised losses. A model self-supervised by the keypoint\nre-projection error (Lkpt) alone does worse than model using both keypoint and segmentation re-\nprojection error (Lkpt+Lseg). Models trained using all three proposed losses (keypoint, segmentation\nand dense motion re-projection error (Lkpt+Lseg+Lmotion) outperformes the above two. This shows\nthe complementarity and importance of all the proposed losses.\n\nOptimization\n\nOptimization + \u02dcR\n\nOptimization + \u02dcR + \u02dcT\n\nPretrained\n\nPretrained+Self-Sup\n\nsurface error (mm)\n\n346.5\n301.1\n272.8\n119.4\n74.5\n\nper-joint error (mm)\n\n532.8\n222.0\n206.6\n101.6\n64.4\n\nrecon. error (mm)\n\n1320.1\n294.9\n205.5\n351.3\n203.9\n\nTable 1: 3D mesh prediction results in Surreal [35]. The proposed model (pretrained+self-\nsupervised) outperforms both optimization based alternatives, as well as pretrained models using\nsupervised regression, that do not adapt to the test data. We use a superscript \u02dc\u00b7 to denote ground-truth\ninformation provided at initialization of our optimization based baseline.\n\nper-joint error\n\nrecon. error\n\n(mm)\n562.4\n125.6\n98.4\n\n(mm)\n883.1\n303.5\n145.8\n\nOptimization\nPretrained\n\nPretrained+Self-Sup\nTable 2: 3D skeleton prediction results on H3.6M\n[22]. The proposed model (pretrained+self-supervised)\noutperforms both an optimization based baseline, as\nwell as a pretrained model. Self-supervised learning\nthrough differentiable rendering allows our model to\nadapt effectively across domains (Surreal to H3.6M),\nwhile the \ufb01xed pretrained baseline cannot. Dense 3D\nsurface ground-truth is not available and thus cannot be\nmeasured in H3.6M\n\nFigure 3: 3D reconstruction error dur-\ning purely unsupervised \ufb01netuning\nunder different self-supervised losses.\n(Lk \u2261 Lkpt: Keypoint re-projection\nerror; LS\u2261 Lseg: Segmentation re-\nprojection error LM\u2261 Lmotion: Dense\nmotion re-projection error ). All losses\ncontribute to 3D error reduction.\n\nDiscussion We have shown that a combination of supervised pretraining and unsupervised adapta-\ntion is bene\ufb01cial for accurate 3D mesh prediction. Learning based self-supervision combines the best\nof both worlds of supervised learning and test time optimization: supervised learning initializes the\nlearning parameters in the right regime, ensuring good pose initialization at test time, without manual\n\n7\n\nrecon. errorLkLk + LsLk + Ls + LM0.280.260.240.220.2050.0k\feffort. Self-supervision through differentiable rendering allows adaptation of the model to test data,\nthus allows much tighter \ufb01tting that a pretrained model with \u201cfrozen\" weights at test time. Note that\nover\ufb01tting in that sense is desirable. We want our predicted 3D mesh to \ufb01t as tight as possible to our\ntest set, and improve tracking accuracy with minimal human intervention.\n\nImplementation details Our model architecture consists of 5 convolution blocks. Each block\ncontains two convolutional layers with \ufb01lter size 5 \u00d7 5 (stride 2) and 3 \u00d7 3 (stride 1), followed by\n\nFigure 4: Qualitative results of 3D mesh prediction. In the top four rows, we show predictions in\nSurreal and in the bottom four from H3.6M. Our model handles bad segmentation input masks in\nH3.6M thanks to supervision from multiple rendering based losses. A byproduct of our 3D mesh\nmodel is improved 2D person segmentation (column 6).\n\n8\n\ninput 1input 2predicted meshpredicted 2d projectionsegmentation groundtruthpredicted \ufb02owpredicted mask\fbatch normalization and leaky relu activation. The \ufb01rst block contains 64 channels, and we double\nsize after each block. On top of these blocks, we add 3 fully connected layers and shrink the size of\nthe \ufb01nal layer to match our desired outputs. Input image to our model is 128 \u00d7 128. The model is\ntrained with gradient descent optimizer with learning rate 0.0001 and is implemented in Tensor\ufb02ow\nv1.1.0 [1].\nChamfer distance: We obtain Chamfer distance map CI for an input image frame I using distance\ntransform with seed the image \ufb01gure-ground segmentation mask S I. This assigns to every pixel in\nCI the minimum distance to a pixel on the mask foreground. Next, we describe the differentiable\ncomputation for CM used in our method. Let P = {x2d} denote a set of pixel coordinates for the\nmesh\u2019s visible projected points. For each pixel location p, we compute the minimum distance between\nthat pixel location and any pixel coordinate in P and obtain a distance map D \u2208 Rw\u00d7h. Next, we\nthreshold the distance map D to get the Chamfer distance map CM and segmentation mask SM\nwhere, for each pixel position p:\n\nCM (p) = max(0.5, D(p))\nSM (p) = min(0.5, D(p)) + \u03b4(D(p) < 0.5) \u00b7 0.5,\n\n(6)\n(7)\n\nand \u03b4(\u00b7) is an indicator function.\nRay casting: We implemented a standard raycasting algorithm in TensorFlow to accelerate its\ncomputation. Let r = (x, d) denote a casted ray, where x is the point where the ray casts from and d\nis a normalized vector for the shooting direction. In our case, all the rays cast from the center of the\ncamera. For ease of explanation, we set x at (0,0,0). A facet f = (v0, v1, v2), is determined as \"hit\" if\nit satis\ufb01es the following three conditions : (1) the facet is not parallel to the casted ray, (2) the facet is\nnot behind the ray and (3) the ray passes through the triangle region formed by the three edges of the\nfacet. Given a facet f = (v0, v1, v2), where vi denotes the ith vertex of the facet, the \ufb01rst condition is\nsatis\ufb01ed if the magnitude of the inner product between the ray cast direction d and the surface normal\nof the facet f is large than some threshold \u0001. Here we set \u0001 to be 1e \u2212 8. The second condition is\nsatis\ufb01ed if the inner product between the ray cast direction d and the surface normal N, which is\nde\ufb01ned as the normalized cross product between v1 \u2212 v0 and v2 \u2212 v0, has the same sign as the inner\nproduct between v0 on N. Finally, the last condition can be split into three sub-problems: given one\nof the edges on the facet, whether the ray casts on the same side as the facet or not. First, we \ufb01nd the\nintersecting point p of the ray cast and the 2D plane expanded by the facet by the following equation:\n\np = x + d \u00b7 < N, v0 >\n< N, d >\n\n(8)\nwhere < \u00b7,\u00b7 > denotes inner product. Given an edge formed by vertices vi and vj, the ray casted is\ndetermined to fall on the same side of the facet if the cross product between edge vi \u2212 vj and vector\np \u2212 vj has the same sign as the surface normal vector N. We examine this condition on all of the\nthree edges. If all the above conditions are satis\ufb01ed, the facet is determined as hit by the ray cast.\nAmong the hit facets, we choose the one with the minimum distance to the origin as the visible facet\nseen from the direction of the ray cast.\n\n,\n\n5 Conclusion\n\nWe have presented a learning based model for dense human 3D body tracking supervised by synthetic\ndata and self-supervised by differentiable rendering of mesh motion, keypoints, and segmentation,\nand matching to their 2D equivalent quantities. We show that our model improves by using unlabelled\nvideo data, which is very valuable for motion capture where dense 3D ground-truth is hard to annotate.\nA clear direction for future work is iterative additive feedback [10] on the mesh parameters, for\nachieving higher 3D reconstruction accuracy, and allowing learning a residual free form deformation\non top of the parametric SMPL model, again in a self-supervised manner. Extensions of our model\nbeyond human 3D shape would allow neural agents to learn 3D with experience as human do,\nsupervised solely by video motion.\n\n9\n\n\fReferences\n\n[1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. Corrado, A. Davis, J. Dean, M. Devin,\nS. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur,\nJ. Levenberg, D. Man\u00e9, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner,\nI. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Vi\u00e9gas, O. Vinyals, P. Warden, M. Wat-\ntenberg, M. Wicke, Y. Yu, and X. Zheng. Tensor\ufb02ow: Large-scale machine learning on heterogeneous\ndistributed systems, 2015.\n\n[2] T. Alldieck, M. Kassubeck, and M. A. Magnor. Optical \ufb02ow-based 3d human motion estimation from\n\nmonocular video. CoRR, abs/1703.00177, 2017.\n\n[3] M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele. 2d human pose estimation: New benchmark and\nstate of the art analysis. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June\n2014.\n\n[4] A. Balan, L. Sigal, M. J. Black, J. Davis, and H. Haussecker. Detailed human shape and pose from images.\nIn IEEE Conf. on Computer Vision and Pattern Recognition, CVPR, pages 1\u20138, Minneapolis, June 2007.\n[5] L. Ballan and G. M. Cortelazzo. Marker-less motion capture of skinned models in a four camera set-up\n\nusing optical \ufb02ow and silhouettes. In 3DPVT, 2008.\n\n[6] L. Bo and C. Sminchisescu. Twin gaussian processes for structured prediction. International Journal of\n\nComputer Vision, 87(1):28\u201352, 2010.\n\n[7] F. Bogo, A. Kanazawa, C. Lassner, P. V. Gehler, J. Romero, and M. J. Black. Keep it SMPL: automatic\n\nestimation of 3d human pose and shape from a single image. ECCV, 2016, 2016.\n\n[8] T. Brox, B. Rosenhahn, D. Cremers, and H.-P. Seidel. High accuracy optical \ufb02ow serves 3-d pose tracking:\n\nexploiting contour and \ufb02ow based constraints. In ECCV, 2006.\n\n[9] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh. Realtime multi-person 2d pose estimation using part af\ufb01nity\n\n\ufb01elds. In CVPR, 2017.\n\nabs/1612.06524, 2016.\n\n[10] J. Carreira, P. Agrawal, K. Fragkiadaki, and J. Malik. Human pose estimation with iterative error feedback.\n\nIn arXiv preprint arXiv:1507.06550, 2015.\n\n[11] C. Chen and D. Ramanan. 3d human pose estimation = 2d pose estimation + matching. CoRR,\n\n[12] W. Chen, H. Wang, Y. Li, H. Su, C. Tu, D. Lischinski, D. Cohen-Or, and B. Chen. Synthesizing training\n\nimages for boosting human 3d pose estimation. CoRR, abs/1604.02703, 2016.\n\n[13] K. Choo and D. J. Fleet. People tracking using hybrid monte carlo \ufb01ltering. In Computer Vision, 2001.\n\nICCV 2001. Proceedings. Eighth IEEE International Conference on, volume 2, pages 321\u2013328, 2001.\n\n[14] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image\n\nDatabase. In CVPR09, 2009.\n\n[15] A. Dosovitskiy, P. Fischer, E. Ilg, P. H\u00e4usser, C. Hazirbas, V. Golkov, P. Smagt, D. Cremers, and T. Brox.\n\nFlownet: Learning optical \ufb02ow with convolutional networks. In ICCV, 2015.\n\n[16] D. Fleet, A. Jepson, and T. El-Maraghi. Robust on-line appearance models for vision tracking. In Proc.\n\nIEEE Conf. Computer Vision and Pattern Recognition, 2001.\n\n[17] J. Gall, C. Stoll, E. De Aguiar, C. Theobalt, B. Rosenhahn, and H.-P. Seidel. Motion capture using joint\nskeleton tracking and surface estimation. In Computer Vision and Pattern Recognition, 2009. CVPR 2009.\nIEEE Conference on, pages 1746\u20131753. IEEE, 2009.\n\n[18] R. Garg, B. V. Kumar, G. Carneiro, and I. Reid. Unsupervised cnn for single view depth estimation:\n\nGeometry to the rescue. In European Conference on Computer Vision, pages 740\u2013756. Springer, 2016.\n\n[19] A. Handa, M. Bloesch, V. P\u02d8atr\u02d8aucean, S. Stent, J. McCormac, and A. Davison. gvnn: Neural network\nlibrary for geometric computer vision. In Computer Vision\u2013ECCV 2016 Workshops, pages 67\u201382. Springer\nInternational Publishing, 2016.\n\n[20] K. He, G. Gkioxari, P. Doll\u00e1r, and R. B. Girshick. Mask R-CNN. CoRR, abs/1703.06870, 2017.\n[21] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox. Flownet 2.0: Evolution of optical\n\n\ufb02ow estimation with deep networks. CoRR, abs/1612.01925, 2016.\n\n[22] C. Ionescu, D. Papava, V. Olaru, and C. Sminchisescu. Human3.6m: Large scale datasets and predictive\nmethods for 3d human sensing in natural environments. IEEE Transactions on Pattern Analysis and\nMachine Intelligence, 36(7):1325\u20131339, jul 2014.\n\n[23] M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu. Spatial transformer networks. In NIPS,\n\n2015.\n\n2016.\n\n[24] T. Lin, M. Maire, S. J. Belongie, L. D. Bourdev, R. B. Girshick, J. Hays, P. Perona, D. Ramanan, P. Doll\u00e1r,\n\nand C. L. Zitnick. Microsoft COCO: common objects in context. CoRR, abs/1405.0312, 2014.\n\n[25] M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black. SMPL: A skinned multi-person linear\n\nmodel. ACM Trans. Graphics (Proc. SIGGRAPH Asia), 34(6):248:1\u2013248:16, Oct. 2015.\n\n[26] V. Patraucean, A. Handa, and R. Cipolla. Spatio-temporal video autoencoder with differentiable memory.\n\nCoRR, abs/1511.06309, 2015.\n\n[27] G. Pavlakos, X. Zhou, K. G. Derpanis, and K. Daniilidis. Coarse-to-\ufb01ne volumetric prediction for\n\nsingle-image 3d human pose. CoRR, abs/1611.07828, 2016.\n\n[28] V. Ramakrishna, T. Kanade, and Y. Sheikh. Reconstructing 3d Human Pose from 2d Image Landmarks.\n\nComputer Vision\u2013ECCV 2012, pages 573\u2013586, 2012.\n\n[29] G. Rogez and C. Schmid. Mocap-guided data augmentation for 3d pose estimation in the wild. In NIPS,\n\n[30] V. S., R. S., S. C., S. R., and F. K. Sfm-net: Learning of structure and motion from video. In arxiv, 2017.\n\n10\n\n\f[31] C. Tomasi and T. Kanade. Shape and motion from image streams under orthography: A factorization\n\nmethod. Int. J. Comput. Vision, 9(2):137\u2013154, Nov. 1992.\n\n[32] D. Tom\u00e8, C. Russell, and L. Agapito. Lifting from the deep: Convolutional 3d pose estimation from a\n\nsingle image. CoRR, abs/1701.00295, 2017.\n\n[33] H. F. Tung, A. Harley, W. Seto, and K. Fragkiadaki. Adversarial inverse graphics networks: Learning\n\n2d-to-3d lifting and image-to-image translation from unpaired supervision. ICCV, 2017.\n\n[34] R. Urtasun, D. Fleet, and P. Fua. Gaussian process dynamical models for 3d people tracking. In Proc. of\n\nthe IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), 2006.\n\n[35] G. Varol, J. Romero, X. Martin, N. Mahmood, M. J. Black, I. Laptev, and C. Schmid. Learning from\n\nSynthetic Humans. In CVPR, 2017.\n\n[36] S. Vicente, J. Carreira, L. Agapito, and J. Batista. Reconstructing PASCAL VOC. In 2014 IEEE Conference\non Computer Vision and Pattern Recognition, CVPR 2014, Columbus, OH, USA, June 23-28, 2014, pages\n41\u201348, 2014.\n\n[37] S.-E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh. Convolutional pose machines. In CVPR, 2016.\n[38] J. Wu, T. Xue, J. J. Lim, Y. Tian, J. B. Tenenbaum, A. Torralba, and W. T. Freeman. Single image 3D\n\ninterpreter network. In ECCV, 2016.\n\n[39] X. Yan, J. Yang, E. Yumer, Y. Guo, and H. Lee. Perspective transformer nets: Learning single-view 3d\nobject reconstruction without 3d supervision. In Advances in Neural Information Processing Systems,\npages 1696\u20131704, 2016.\n\n[40] T. Zhou, M. Brown, N. Snavely, and D. G. Lowe. Unsupervised learning of depth and ego-motion from\n\nvideo. In arxiv, 2017.\n\n[41] X. Zhou, M. Zhu, G. Pavlakos, S. Leonardos, K. G. Derpanis, and K. Daniilidis. Monocap: Monocular\n\nhuman motion capture using a CNN coupled with a geometric prior. CoRR, abs/1701.02354, 2017.\n\n11\n\n\f", "award": [], "sourceid": 2697, "authors": [{"given_name": "Hsiao-Yu", "family_name": "Tung", "institution": "Carnegie Mellon University"}, {"given_name": "Hsiao-Wei", "family_name": "Tung", "institution": "University of Pittsburgh"}, {"given_name": "Ersin", "family_name": "Yumer", "institution": "Adobe Research"}, {"given_name": "Katerina", "family_name": "Fragkiadaki", "institution": "Carnegie Mellon University"}]}