{"title": "Learning Temporal Pose Estimation from Sparsely-Labeled Videos", "book": "Advances in Neural Information Processing Systems", "page_first": 3027, "page_last": 3038, "abstract": "Modern approaches for multi-person pose estimation in video require large amounts of dense annotations. However, labeling every frame in a video is costly and labor intensive. To reduce the need for dense annotations, we propose a PoseWarper network that leverages training videos with sparse annotations (every k frames) to learn to perform dense temporal pose propagation and estimation. Given a pair of video frames---a labeled Frame A and an unlabeled Frame B---we train our model to predict human pose in Frame A using the features from Frame B by means of deformable convolutions to implicitly learn the pose warping between A and B. We demonstrate that we can leverage our trained PoseWarper for several applications. First, at inference time we can reverse the application direction of our network in order to propagate pose information from manually annotated frames to unlabeled frames. This makes it possible to generate pose annotations for the entire video given only a few manually-labeled frames. Compared to modern label propagation methods based on optical flow, our warping mechanism is much more compact (6M vs 39M parameters), and also more accurate (88.7% mAP vs 83.8% mAP). We also show that we can improve the accuracy of a pose estimator by training it on an augmented dataset obtained by adding our propagated poses to the original manual labels. Lastly, we can use our PoseWarper to aggregate temporal pose information from neighboring frames during inference. This allows us to obtain state-of-the-art pose detection results on PoseTrack2017 and PoseTrack2018 datasets.", "full_text": "Learning Temporal Pose Estimation\n\nfrom Sparsely-Labeled Videos\n\nGedas Bertasius1,2, Christoph Feichtenhofer1, Du Tran1, Jianbo Shi2, Lorenzo Torresani1\n\n1Facebook AI, 2University of Pennsylvania\n\nAbstract\n\nModern approaches for multi-person pose estimation in video require large amounts\nof dense annotations. However, labeling every frame in a video is costly and labor\nintensive. To reduce the need for dense annotations, we propose a PoseWarper\nnetwork that leverages training videos with sparse annotations (every k frames) to\nlearn to perform dense temporal pose propagation and estimation. Given a pair\nof video frames\u2014a labeled Frame A and an unlabeled Frame B\u2014we train our\nmodel to predict human pose in Frame A using the features from Frame B by\nmeans of deformable convolutions to implicitly learn the pose warping between A\nand B. We demonstrate that we can leverage our trained PoseWarper for several\napplications. First, at inference time we can reverse the application direction\nof our network in order to propagate pose information from manually annotated\nframes to unlabeled frames. This makes it possible to generate pose annotations\nfor the entire video given only a few manually-labeled frames. Compared to\nmodern label propagation methods based on optical \ufb02ow, our warping mechanism\nis much more compact (6M vs 39M parameters), and also more accurate (88.7%\nmAP vs 83.8% mAP). We also show that we can improve the accuracy of a\npose estimator by training it on an augmented dataset obtained by adding our\npropagated poses to the original manual labels. Lastly, we can use our PoseWarper\nto aggregate temporal pose information from neighboring frames during inference.\nThis allows us to obtain state-of-the-art pose detection results on PoseTrack2017\nand PoseTrack2018 datasets. Code has been made available at: https://github.\ncom/facebookresearch/PoseWarper.\n\n1\n\nIntroduction\n\nIn recent years, visual understanding methods [1\u201315] have made tremendous progress, partly because\nof advances in deep learning [16\u201319], and partly due to the introduction of large-scale annotated\ndatasets [20, 21]. In this paper we consider the problem of pose estimation, which has greatly\nbene\ufb01tted from the recent creation of large-scale datasets [22, 23]. Most of the recent advances\nin this area, though, have concentrated on the task of pose estimation in still-images [3, 23\u201327].\nHowever, directly applying these image-level models to video is challenging due to nuisance factors\nsuch as motion blur, video defocus, and frequent pose occlusions. Additionally, the process of\ncollecting annotated pose data in multi-person videos is costly and time consuming. A video typically\ncontains hundreds of frames that need to be densely-labeled by human annotators. As a result,\ndatasets for video pose estimation [22] are typically smaller and less diverse compared to their image\ncounterparts [21]. This is problematic because modern deep models require large amounts of labeled\ndata to achieve good performance. At the same time, videos have high informational redundancy as\nthe content changes little from frame to frame. This raises the question of whether every single frame\nin a training video needs to be labeled in order to achieve good pose estimation accuracy.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fTo reduce the reliance on densely annotated video pose data, in this work, we introduce the Pose-\nWarper network, which operates on sparsely annotated videos, i.e., videos where pose annotations\nare given only every k frames. Given a pair of frames from the same video\u2014a labeled Frame A\nand an unlabeled Frame B\u2014we train our model to detect pose in Frame A, using the features from\nFrame B. To achieve this goal, our model leverages deformable convolutions [28] across space and\ntime. Through this mechanism, our model learns to sample features from an unlabeled Frame B to\nmaximize pose detection accuracy in a labeled Frame A.\nOur trained PoseWarper can then be used for several applications. First, we can leverage PoseWarper\nto propagate pose information from a few manually-labeled frames across the entire video. Compared\nto modern optical \ufb02ow propagation methods such as FlowNet2 [29], our PoseWarper produces more\naccurate pose annotations (88.7% mAP vs 83.8% mAP), while also employing a much more compact\nwarping mechanism (6M vs 39M parameters). Furthermore, we show that our propagated poses can\nserve as effective pseudo labels for training a more accurate pose detector. Finally, our PoseWarper\ncan be used to aggregate temporal pose information from neighboring frames during inference. This\nnaturally renders the approach more robust to occlusion or motion blur in individual frames, and leads\nto state-of-the-art pose detection results on the PoseTrack2017 and PoseTrack2018 datasets [22].\n\n2 Related Work\n\nMulti-Person Pose Detection in Images. The traditional approaches for pose estimation leverage\npictorial structures model [30\u201334], which represents human body as a tree-structured graph with\npairwise potentials between the connected body parts. These approaches have been highly successful\nin the past, but they tend to fail if some of body parts are occluded. These issues have been partially\naddressed by the models that assume a non-tree graph structure [35\u201338]. However, most modern\napproaches for single image pose estimation are based on convolutional neural networks [3, 6, 23\u2013\n27, 39\u201345]. The method in [3] regresses (x, y) joint coordinates directly from the images. More\nrecent work [25] instead predicts pose heatmaps, which leads to an easier optimization problem.\nSeveral approaches [24, 26, 39, 46] propose an iterative pose estimation pipeline where the predictions\nare re\ufb01ned at different stages inside a CNN or via a recurrent network. The methods in [6, 23, 45]\ntackle pose estimation problem in a top-down fashion, \ufb01rst detecting bounding boxes of people, and\nthen predicting the pose heatmaps from the cropped images. The work in [24] proposes part af\ufb01nity\n\ufb01elds module that captures pairwise relationships between different body parts. The approaches\nin [42, 43] leverage a bottom-up pipeline \ufb01rst predicting the keypoints, and then assembling them\ninto instances. Lastly, a recent work in [27], proposes an architecture that preserves high resolution\nfeature maps, which is shown to be highly bene\ufb01cial for the multi-person pose estimation task.\nMulti-Person Pose Detection in Video. Due to a limited number of large scale benchmarks for\nvideo pose detection, there has been signi\ufb01cantly fewer methods in the video domain. Several prior\nmethods [22, 47, 48] tackle a video pose estimation task as a two-stage problem, \ufb01rst detecting the\nkeypoints in individual frames, and then applying temporal smoothing techniques. The method in [49]\nproposes a spatiotemporal CRF, which is jointly optimized for the pose prediction in video. The\nwork in [50] proposes a personalized video pose estimation framework, which is accomplished by\n\ufb01netuning the model on a few frames with high con\ufb01dence keypoints in each video. The approaches\nin [51, 52] leverage \ufb02ow based representations for aligning features temporally across multiple\nframes, and then using such aligned features for pose detection in individual frames.\nIn contrast to these prior methods, our primary objective is to learn an effective video pose detector\nfrom sparsely labeled videos. Our approach has similarities to the methods in [51, 52], which use\n\ufb02ow representations for feature alignment. However, unlike our model, the methods in [51, 52] do\nnot optimize their \ufb02ow representations end-to-end with respect to the pose detection task. As we will\nshow in our experiments, this is important for strong performance.\n\n3 The PoseWarper Network\n\nOverview. Our goal is to design a model that learns to detect pose from sparsely labeled videos.\nSpeci\ufb01cally, we assume that pose annotations in training videos are available every k frames. Inspired\nby a recent self-supervised approach for learning facial attribute embeddings [53], we formulate the\nfollowing task. Given two video frames\u2014a labeled Frame A and an unlabeled Frame B\u2014our model\n\n2\n\n\fFigure 1: A high level overview of our approach for using sparsely labeled videos for pose detection.\nFaces in the \ufb01gure are arti\ufb01cially masked for privacy reasons. In each training video, pose annotations\nare available only every k frames. During training, our system considers a pair of frames\u2013a labeled\nFrame A, and an unlabeled Frame B, and aims to detect pose in Frame A, using the features from\nFrame B. Our training procedure is designed to achieve two goals: 1) our model must be able to\nextract motion offsets relating these two frames. 2) Using these motion offsets our model must\nthen be able to rewarp the detected pose heatmap extracted from an unlabeled Frame B in order to\noptimize the accuracy of pose detection in a labeled Frame A. After training, we can apply our model\nin reverse order to propagate pose information across the entire video from ground truth poses given\nfor only a few frames.\n\nis allowed to compare Frame A to Frame B but it must predict Pose A (i.e., the pose in Frame A)\nusing the features from Frame B, as illustrated in Figure 1 (top).\nAt \ufb01rst glance, this task may look overly challenging: how can we predict Pose A by merely using\nfeatures from Frame B? However, suppose that we had body joint correspondences between Frame A\nand Frame B. In such a scenario, this task would become trivial, as we would simply need to spatially\n\u201cwarp\u201d the feature maps computed from frame B according to the set of correspondences relating\nframe B to frame A. Based on this intuition, we design a learning scheme that achieves two goals: 1)\nBy comparing Frame A and Frame B, our model must be able to extract motion offsets relating these\ntwo frames. 2) Using these motion offsets our model must be able to rewarp the pose extracted from\nan unlabeled Frame B in order to optimize pose detection accuracy in a labeled Frame A.\nTo achieve these goals, we \ufb01rst feed both frames through a backbone CNN that predicts pose heatmaps\nfor each of the frames. Then, the resulting heatmaps from both frames are used to determine which\npoints from Frame B should be sampled for detection in Frame A. Finally, the resampled pose\nheatmap from Frame B is used to maximize accuracy of Pose A.\nBackbone Network. Due to its high ef\ufb01ciency and accuracy, we use the state-of-the-art High\nResolution Network (HRNet-W48) [27] as our backbone CNN. However, we note that our system can\neasily integrate other architectures as well. Thus, we envision that future improvements in still-image\npose estimation will further improve the effectiveness of our approach.\nDeformable Warping. Initially, we feed Frame A and Frame B through our backbone CNN, which\noutputs pose heatmaps fA and fB. Then, we compute the difference \u03c8A,B = fA \u2212 fB. The resulting\nfeature tensor \u03c8A,B is provided as input to a stack of 3 \u00d7 3 simple residual blocks (as in standard\nResNet-18 or ResNet-34 models), which output a feature tensor \u03c6A,B. The feature tensor \u03c6A,B is\nthen fed into \ufb01ve 3 \u00d7 3 convolutional layers each using a different dilation rate d \u2208 {3, 6, 12, 18, 24}\n\n3\n\n1. Training (Unlabeled Labeled) 2. Video Pose Propagation (Labeled Unlabeled) Unlabeled Frame B (time t+ )AAAB7XicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPRi8cK9gPaUDabTbt2kw27E6GE/gcvHhTx6v/x5r9x2+agrQ8GHu/NMDMvSKUw6LrfTmltfWNzq7xd2dnd2z+oHh61jco04y2mpNLdgBouRcJbKFDybqo5jQPJO8H4duZ3nrg2QiUPOEm5H9NhIiLBKFqp3Q+5RDqo1ty6OwdZJV5BalCgOah+9UPFspgnyCQ1pue5Kfo51SiY5NNKPzM8pWxMh7xnaUJjbvx8fu2UnFklJJHSthIkc/X3RE5jYyZxYDtjiiOz7M3E/7xehtG1n4skzZAnbLEoyiRBRWavk1BozlBOLKFMC3srYSOqKUMbUMWG4C2/vEraF3XPrXv3l7XGTRFHGU7gFM7BgytowB00oQUMHuEZXuHNUc6L8+58LFpLTjFzDH/gfP4AkYyPHA==AAAB7XicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPRi8cK9gPaUDabTbt2kw27E6GE/gcvHhTx6v/x5r9x2+agrQ8GHu/NMDMvSKUw6LrfTmltfWNzq7xd2dnd2z+oHh61jco04y2mpNLdgBouRcJbKFDybqo5jQPJO8H4duZ3nrg2QiUPOEm5H9NhIiLBKFqp3Q+5RDqo1ty6OwdZJV5BalCgOah+9UPFspgnyCQ1pue5Kfo51SiY5NNKPzM8pWxMh7xnaUJjbvx8fu2UnFklJJHSthIkc/X3RE5jYyZxYDtjiiOz7M3E/7xehtG1n4skzZAnbLEoyiRBRWavk1BozlBOLKFMC3srYSOqKUMbUMWG4C2/vEraF3XPrXv3l7XGTRFHGU7gFM7BgytowB00oQUMHuEZXuHNUc6L8+58LFpLTjFzDH/gfP4AkYyPHA==AAAB7XicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPRi8cK9gPaUDabTbt2kw27E6GE/gcvHhTx6v/x5r9x2+agrQ8GHu/NMDMvSKUw6LrfTmltfWNzq7xd2dnd2z+oHh61jco04y2mpNLdgBouRcJbKFDybqo5jQPJO8H4duZ3nrg2QiUPOEm5H9NhIiLBKFqp3Q+5RDqo1ty6OwdZJV5BalCgOah+9UPFspgnyCQ1pue5Kfo51SiY5NNKPzM8pWxMh7xnaUJjbvx8fu2UnFklJJHSthIkc/X3RE5jYyZxYDtjiiOz7M3E/7xehtG1n4skzZAnbLEoyiRBRWavk1BozlBOLKFMC3srYSOqKUMbUMWG4C2/vEraF3XPrXv3l7XGTRFHGU7gFM7BgytowB00oQUMHuEZXuHNUc6L8+58LFpLTjFzDH/gfP4AkYyPHA==AAAB7XicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPRi8cK9gPaUDabTbt2kw27E6GE/gcvHhTx6v/x5r9x2+agrQ8GHu/NMDMvSKUw6LrfTmltfWNzq7xd2dnd2z+oHh61jco04y2mpNLdgBouRcJbKFDybqo5jQPJO8H4duZ3nrg2QiUPOEm5H9NhIiLBKFqp3Q+5RDqo1ty6OwdZJV5BalCgOah+9UPFspgnyCQ1pue5Kfo51SiY5NNKPzM8pWxMh7xnaUJjbvx8fu2UnFklJJHSthIkc/X3RE5jYyZxYDtjiiOz7M3E/7xehtG1n4skzZAnbLEoyiRBRWavk1BozlBOLKFMC3srYSOqKUMbUMWG4C2/vEraF3XPrXv3l7XGTRFHGU7gFM7BgytowB00oQUMHuEZXuHNUc6L8+58LFpLTjFzDH/gfP4AkYyPHA==Labeled Frame A (time t)Features Relating the Two FramesPredicted Pose in Frame AForward PassBackward PassLabeled Frame A (time t+ )AAAB7XicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPRi8cK9gPaUDabTbt2kw27E6GE/gcvHhTx6v/x5r9x2+agrQ8GHu/NMDMvSKUw6LrfTmltfWNzq7xd2dnd2z+oHh61jco04y2mpNLdgBouRcJbKFDybqo5jQPJO8H4duZ3nrg2QiUPOEm5H9NhIiLBKFqp3Q+5RDqo1ty6OwdZJV5BalCgOah+9UPFspgnyCQ1pue5Kfo51SiY5NNKPzM8pWxMh7xnaUJjbvx8fu2UnFklJJHSthIkc/X3RE5jYyZxYDtjiiOz7M3E/7xehtG1n4skzZAnbLEoyiRBRWavk1BozlBOLKFMC3srYSOqKUMbUMWG4C2/vEraF3XPrXv3l7XGTRFHGU7gFM7BgytowB00oQUMHuEZXuHNUc6L8+58LFpLTjFzDH/gfP4AkYyPHA==AAAB7XicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPRi8cK9gPaUDabTbt2kw27E6GE/gcvHhTx6v/x5r9x2+agrQ8GHu/NMDMvSKUw6LrfTmltfWNzq7xd2dnd2z+oHh61jco04y2mpNLdgBouRcJbKFDybqo5jQPJO8H4duZ3nrg2QiUPOEm5H9NhIiLBKFqp3Q+5RDqo1ty6OwdZJV5BalCgOah+9UPFspgnyCQ1pue5Kfo51SiY5NNKPzM8pWxMh7xnaUJjbvx8fu2UnFklJJHSthIkc/X3RE5jYyZxYDtjiiOz7M3E/7xehtG1n4skzZAnbLEoyiRBRWavk1BozlBOLKFMC3srYSOqKUMbUMWG4C2/vEraF3XPrXv3l7XGTRFHGU7gFM7BgytowB00oQUMHuEZXuHNUc6L8+58LFpLTjFzDH/gfP4AkYyPHA==AAAB7XicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPRi8cK9gPaUDabTbt2kw27E6GE/gcvHhTx6v/x5r9x2+agrQ8GHu/NMDMvSKUw6LrfTmltfWNzq7xd2dnd2z+oHh61jco04y2mpNLdgBouRcJbKFDybqo5jQPJO8H4duZ3nrg2QiUPOEm5H9NhIiLBKFqp3Q+5RDqo1ty6OwdZJV5BalCgOah+9UPFspgnyCQ1pue5Kfo51SiY5NNKPzM8pWxMh7xnaUJjbvx8fu2UnFklJJHSthIkc/X3RE5jYyZxYDtjiiOz7M3E/7xehtG1n4skzZAnbLEoyiRBRWavk1BozlBOLKFMC3srYSOqKUMbUMWG4C2/vEraF3XPrXv3l7XGTRFHGU7gFM7BgytowB00oQUMHuEZXuHNUc6L8+58LFpLTjFzDH/gfP4AkYyPHA==AAAB7XicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPRi8cK9gPaUDabTbt2kw27E6GE/gcvHhTx6v/x5r9x2+agrQ8GHu/NMDMvSKUw6LrfTmltfWNzq7xd2dnd2z+oHh61jco04y2mpNLdgBouRcJbKFDybqo5jQPJO8H4duZ3nrg2QiUPOEm5H9NhIiLBKFqp3Q+5RDqo1ty6OwdZJV5BalCgOah+9UPFspgnyCQ1pue5Kfo51SiY5NNKPzM8pWxMh7xnaUJjbvx8fu2UnFklJJHSthIkc/X3RE5jYyZxYDtjiiOz7M3E/7xehtG1n4skzZAnbLEoyiRBRWavk1BozlBOLKFMC3srYSOqKUMbUMWG4C2/vEraF3XPrXv3l7XGTRFHGU7gFM7BgytowB00oQUMHuEZXuHNUc6L8+58LFpLTjFzDH/gfP4AkYyPHA==Unlabeled Frame B (time t) Ground Truth Pose Propagated to Frame BFeature WarpingGround Truth Pose in Frame AFeatures Relating the Two FramesPose DetectionFeature Warping\fFigure 2: An illustration of our PoseWarper architecture. Given a labeled Frame A and an unlabeled\nFrame B, which are separated by \u03b4 steps in time, our goal is to detect pose in a labeled Frame A\nusing the features from an unlabeled Frame B. First, we predict pose heatmaps for both frames. Then,\nwe compute the difference between pose heatmaps in Frame A and Frame B and feed it through a\nstack of 3 \u00d7 3 residual blocks. Afterwards, we attach \ufb01ve 3 \u00d7 3 convolutional layers with dilation\nrates d \u2208 {3, 6, 12, 18, 24} and predict \ufb01ve sets of offsets o(d)(pn) for each pixel location pn. The\npredicted offsets are used to rewarp pose heatmap B. All \ufb01ve rewarped heatmaps are then summed\nand the resulting tensor is used to predict pose in Frame A.\n\nto predict \ufb01ve sets of offsets o(d)(pn) at all pixel locations pn. The motivation for using different\ndilation rates at the offset prediction stage comes from the need to consider motion cues at different\nspatial scales. When the body motion is small, a smaller dilation rate may be more useful as it\ncaptures subtle motion cues. Conversely, if the body motion is large, using large dilation rate allows\nus to incorporate relevant information further away. Next, the predicted offsets are used to spatially\nrewarp the pose heatmap fB. We do this for each of the \ufb01ve sets of offsets o(d), and then sum up all\n\ufb01ve rewarped pose heatmaps to obtain a \ufb01nal output gA,B, which is used to predict pose in Frame A.\nWe implement the warping mechanism via a deformable convolution [28], which takes 1) the offsets\no(d)(pn), and 2) the pose heatmap fB as its inputs, and then outputs a newly sampled pose heatmap\ngA,B. The subscript (A, B) is used to indicate that even though gA,B was resampled from tensor fB,\nthe offsets for rewarping were computed using \u03c8A,B, which contains information from both frames.\nAn illustration of our architecture is presented in Figure 2.\nLoss Function. As in [27], we use a standard pose estimation loss function which computes a mean\nsquared error between the predicted, and the ground-truth heatmaps. The ground-truth heatmap is\ngenerated by applying a 2D Gaussian around the location of each joint.\nPose Annotation Propagation. During training, we force our model to warp pose heatmap fB from\nan unlabeled frame B such that it would match the ground-truth pose heatmap in a labeled Frame A.\nAfterwards, we can reverse the application direction of our network. This then, allows us to propagate\npose information from manually annotated frames to unlabeled frames (i.e. from a labeled Frame\nA to an unlabeled Frame B). Speci\ufb01cally, given a pose annotation in Frame A, we can generate its\nrespective ground-truth heatmap yA by applying a 2D Gaussian around the location of each joint\n(identically to how it was done in [23, 27]. Then, we can predict the offsets for warping ground-truth\nheatmap yA to an unlabeled Frame B, from the feature difference \u03c8B,A = fB \u2212 fA. Lastly, we\nuse our deformable warping scheme to warp the ground-truth pose heatmap yA to Frame B, thus,\npropagating pose annotations to unlabeled frames in the same video. See Figure 1 (bottom) for a\nhigh-level illustration of this scheme.\nTemporal Pose Aggregation at Inference Time. Instead of using our model to propagate pose\nannotations on training videos, we can also use our deformable warping mechanism to aggregate pose\ninformation from nearby frames during inference in order to improve the accuracy of pose detection.\nFor every frame at time t, we want to aggregate information from all frames at times t + \u03b4 where\n\u03b4 \u2208 {\u22123,\u22122,\u22121, 0, 1, 2, 3}. Such a pose aggregation procedure renders pose estimation more robust\nto occlusions, motion blur, and video defocus.\n\n4\n\nLabeled Frame A (time t)Offsets Warped Feature MapDifference AAAB7nicbVDJSgNBEK1xjXGLevTSGARPYUYEPQa9eIxgFkiG0NPpSZr0MvQihCEf4cWDIl79Hm/+jZ1kDpr4oODxXhVV9ZKMM2PD8DtYW9/Y3Nou7ZR39/YPDitHxy2jnCa0SRRXupNgQzmTtGmZ5bSTaYpFwmk7Gd/N/PYT1YYp+WgnGY0FHkqWMoKtl9o9JZh0pl+phrVwDrRKooJUoUCjX/nqDRRxgkpLODamG4WZjXOsLSOcTss9Z2iGyRgPaddTiQU1cT4/d4rOvTJAqdK+pEVz9fdEjoUxE5H4ToHtyCx7M/E/r+tsehPnTGbOUkkWi1LHkVVo9jsaME2J5RNPMNHM34rICGtMrE+o7EOIll9eJa3LWhTWooerav22iKMEp3AGFxDBNdThHhrQBAJjeIZXeAuy4CV4Dz4WrWtBMXMCfxB8/gCVCY+3AAAB7nicbVDJSgNBEK1xjXGLevTSGARPYUYEPQa9eIxgFkiG0NPpSZr0MvQihCEf4cWDIl79Hm/+jZ1kDpr4oODxXhVV9ZKMM2PD8DtYW9/Y3Nou7ZR39/YPDitHxy2jnCa0SRRXupNgQzmTtGmZ5bSTaYpFwmk7Gd/N/PYT1YYp+WgnGY0FHkqWMoKtl9o9JZh0pl+phrVwDrRKooJUoUCjX/nqDRRxgkpLODamG4WZjXOsLSOcTss9Z2iGyRgPaddTiQU1cT4/d4rOvTJAqdK+pEVz9fdEjoUxE5H4ToHtyCx7M/E/r+tsehPnTGbOUkkWi1LHkVVo9jsaME2J5RNPMNHM34rICGtMrE+o7EOIll9eJa3LWhTWooerav22iKMEp3AGFxDBNdThHhrQBAJjeIZXeAuy4CV4Dz4WrWtBMXMCfxB8/gCVCY+3AAAB7nicbVDJSgNBEK1xjXGLevTSGARPYUYEPQa9eIxgFkiG0NPpSZr0MvQihCEf4cWDIl79Hm/+jZ1kDpr4oODxXhVV9ZKMM2PD8DtYW9/Y3Nou7ZR39/YPDitHxy2jnCa0SRRXupNgQzmTtGmZ5bSTaYpFwmk7Gd/N/PYT1YYp+WgnGY0FHkqWMoKtl9o9JZh0pl+phrVwDrRKooJUoUCjX/nqDRRxgkpLODamG4WZjXOsLSOcTss9Z2iGyRgPaddTiQU1cT4/d4rOvTJAqdK+pEVz9fdEjoUxE5H4ToHtyCx7M/E/r+tsehPnTGbOUkkWi1LHkVVo9jsaME2J5RNPMNHM34rICGtMrE+o7EOIll9eJa3LWhTWooerav22iKMEp3AGFxDBNdThHhrQBAJjeIZXeAuy4CV4Dz4WrWtBMXMCfxB8/gCVCY+3AAAB7nicbVDJSgNBEK1xjXGLevTSGARPYUYEPQa9eIxgFkiG0NPpSZr0MvQihCEf4cWDIl79Hm/+jZ1kDpr4oODxXhVV9ZKMM2PD8DtYW9/Y3Nou7ZR39/YPDitHxy2jnCa0SRRXupNgQzmTtGmZ5bSTaYpFwmk7Gd/N/PYT1YYp+WgnGY0FHkqWMoKtl9o9JZh0pl+phrVwDrRKooJUoUCjX/nqDRRxgkpLODamG4WZjXOsLSOcTss9Z2iGyRgPaddTiQU1cT4/d4rOvTJAqdK+pEVz9fdEjoUxE5H4ToHtyCx7M/E/r+tsehPnTGbOUkkWi1LHkVVo9jsaME2J5RNPMNHM34rICGtMrE+o7EOIll9eJa3LWhTWooerav22iKMEp3AGFxDBNdThHhrQBAJjeIZXeAuy4CV4Dz4WrWtBMXMCfxB8/gCVCY+3Unlabeled Frame B (time t+ )AAAB7XicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPRi8cK9gPaUDabTbt2kw27E6GE/gcvHhTx6v/x5r9x2+agrQ8GHu/NMDMvSKUw6LrfTmltfWNzq7xd2dnd2z+oHh61jco04y2mpNLdgBouRcJbKFDybqo5jQPJO8H4duZ3nrg2QiUPOEm5H9NhIiLBKFqp3Q+5RDqo1ty6OwdZJV5BalCgOah+9UPFspgnyCQ1pue5Kfo51SiY5NNKPzM8pWxMh7xnaUJjbvx8fu2UnFklJJHSthIkc/X3RE5jYyZxYDtjiiOz7M3E/7xehtG1n4skzZAnbLEoyiRBRWavk1BozlBOLKFMC3srYSOqKUMbUMWG4C2/vEraF3XPrXv3l7XGTRFHGU7gFM7BgytowB00oQUMHuEZXuHNUc6L8+58LFpLTjFzDH/gfP4AkYyPHA==AAAB7XicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPRi8cK9gPaUDabTbt2kw27E6GE/gcvHhTx6v/x5r9x2+agrQ8GHu/NMDMvSKUw6LrfTmltfWNzq7xd2dnd2z+oHh61jco04y2mpNLdgBouRcJbKFDybqo5jQPJO8H4duZ3nrg2QiUPOEm5H9NhIiLBKFqp3Q+5RDqo1ty6OwdZJV5BalCgOah+9UPFspgnyCQ1pue5Kfo51SiY5NNKPzM8pWxMh7xnaUJjbvx8fu2UnFklJJHSthIkc/X3RE5jYyZxYDtjiiOz7M3E/7xehtG1n4skzZAnbLEoyiRBRWavk1BozlBOLKFMC3srYSOqKUMbUMWG4C2/vEraF3XPrXv3l7XGTRFHGU7gFM7BgytowB00oQUMHuEZXuHNUc6L8+58LFpLTjFzDH/gfP4AkYyPHA==AAAB7XicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPRi8cK9gPaUDabTbt2kw27E6GE/gcvHhTx6v/x5r9x2+agrQ8GHu/NMDMvSKUw6LrfTmltfWNzq7xd2dnd2z+oHh61jco04y2mpNLdgBouRcJbKFDybqo5jQPJO8H4duZ3nrg2QiUPOEm5H9NhIiLBKFqp3Q+5RDqo1ty6OwdZJV5BalCgOah+9UPFspgnyCQ1pue5Kfo51SiY5NNKPzM8pWxMh7xnaUJjbvx8fu2UnFklJJHSthIkc/X3RE5jYyZxYDtjiiOz7M3E/7xehtG1n4skzZAnbLEoyiRBRWavk1BozlBOLKFMC3srYSOqKUMbUMWG4C2/vEraF3XPrXv3l7XGTRFHGU7gFM7BgytowB00oQUMHuEZXuHNUc6L8+58LFpLTjFzDH/gfP4AkYyPHA==AAAB7XicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPRi8cK9gPaUDabTbt2kw27E6GE/gcvHhTx6v/x5r9x2+agrQ8GHu/NMDMvSKUw6LrfTmltfWNzq7xd2dnd2z+oHh61jco04y2mpNLdgBouRcJbKFDybqo5jQPJO8H4duZ3nrg2QiUPOEm5H9NhIiLBKFqp3Q+5RDqo1ty6OwdZJV5BalCgOah+9UPFspgnyCQ1pue5Kfo51SiY5NNKPzM8pWxMh7xnaUJjbvx8fu2UnFklJJHSthIkc/X3RE5jYyZxYDtjiiOz7M3E/7xehtG1n4skzZAnbLEoyiRBRWavk1BozlBOLKFMC3srYSOqKUMbUMWG4C2/vEraF3XPrXv3l7XGTRFHGU7gFM7BgytowB00oQUMHuEZXuHNUc6L8+58LFpLTjFzDH/gfP4AkYyPHA==Predicted Pose in Frame A A Stack of Residual 3x3 Blocks Def. Conv. (dilation = 12) 2D Conv. (dilation = 12) 2D Conv. (dilation = 6) 2D Conv. (dilation = 24) Def. Conv. (dilation = 6) Def. Conv. (dilation = 24)AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ==AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ==AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ==AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ== Def. Conv. (dilation = 3) Def. Conv. (dilation = 18) 2D Conv. (dilation = 18) 2D Conv. (dilation = 3)\fFigure 3: The results of a video pose propagation task by our PoseWarper and FlowNet2 [29]. The\n\ufb01rst frame in each 3-frame sequence illustrates a labeled reference frame at time t. For simplicity,\nwe show only the \u201cright ankle\u201d body joint for one person, denoted by a pink circle in each of the\nframes (please zoom in for a clearer view). The second frame depicts our propagated \u201cright ankle\u201d\ndetection from the labeled frame in time t to the unlabeled frame in time t+1. The third frame shows\nthe propagated detection in frame t+1 produced by the FlowNet2 baseline. In contrast to our method,\nFlowNet2 fails to propagate poses when there is large motion, blurriness or occlusions.\n\n\u03b4 gt,t+\u03b4.\n\nConsider a pair of frames, It and It+\u03b4. In this case, we want to use pose information from frame\nIt+\u03b4 to improve pose detection in frame It. To do this, we \ufb01rst feed both frames through our trained\nPoseWarper model, and obtain a spatially rewarped (resampled) pose heatmap gt,t+\u03b4, which is aligned\nwith respect to frame It using the features from frame It+\u03b4. We can repeat this procedure for every \u03b4\n\nvalue, and then aggregate pose information from multiple frames via a summation as(cid:80)\nImplementation Details. Following the framework in [27], for training, we crop a 384 \u00d7 288\nbounding box around each person and use it as input to our model. During training, we use ground\ntruth person bounding boxes. We also employ random rotations, scaling, and horizontal \ufb02ipping to\naugment the data. To learn the network, we use the Adam optimizer [54] with a base learning rate\nof 10\u22124, which is reduced to 10\u22125 and 10\u22126 after 10, and 15 epochs, respectively. The training is\nperformed using 4 Tesla M40 GPUs, and is terminated after 20 epochs. We initialize our model with\na HRNet-W48 [27] pretrained for a COCO keypoint estimation task. To train the deformable warping\nmodule, we select Frame B, with a random time-gap \u03b4 \u2208 [\u22123, 3] relative to Frame A. To compute\nfeatures relating the two frames, we use twenty 3 \u00d7 3 residual blocks each with 128 channels. Even\nthough this seems like many convolutional layers, due to a small number of channels in each layer,\nthis amounts to only 5.8M parameters (compared to 39M required to compute optical \ufb02ow in [29]).\nTo compute the offsets o(d), we use \ufb01ve 3 \u00d7 3 convolutional layers, each using a different dilation\nrate (d = 3, 6, 12, 18, 24). To resample the pose heatmap fB, we employ \ufb01ve 3 \u00d7 3 deformable\nconvolutional layers, each applied to one of the \ufb01ve predicted offset maps o(d). The \ufb01ve deformable\nconvolution layers too employ different dilation rates of 3, 6, 12, 18, 24. During testing, we follow\nthe same two-stage framework used in [27, 23]: \ufb01rst, we detect the bounding boxes for each person in\nthe image using the detector in [48], and then feed the cropped images to our pose estimation model.\n\n4 Experiments\n\nIn this section, we present our results on the PoseTrack [22] dataset. We demonstrate the effectiveness\nof our approach on three applications: 1) video pose propagation, 2) training a network on annotations\naugmented with propagated pose pseudo-labels, 3) temporal pose aggregation during inference.\n\n5\n\n Ground Truth Reference Frame (time t)PoseWarper (time t+1)FlowNet2 (time t+1)\fTable 1: The results of video pose propagation on the PoseTrack2017 [22] validation set (measured in\nmAP). We propagate pose information across the entire video from the manual annotations provided\nin few frames. To study the effect of different levels of dilated convolutions in our PoseWarper\narchitecture, we also include several ablation baselines (see the bottom half of the table).\n\nMethod\n\nPseudo-labeling w/ HRNet [27]\n\nHead Shoulder Elbow Wrist Hip Knee Ankle Mean\n79.3\n79.1\n75.5\nOptical Flow Propagation (Farneback [55]) 76.5\n83.8\nOptical Flow Propagation (FlowNet2 [29]) 82.7\n87.2\n86.1\n85.0\n87.0\n85.8\n88.0\n88.4\n86.1\n86.3\n88.6\n88.7\n86.0\n\n74.7 81.4 79.4 72.3\n69.2 80.8 74.8 70.1\n78.4 89.7 83.6 78.1\n83.5 90.2 87.3 84.6\n83.7 89.6 87.3 84.7\n84.9 91.0 88.4 86.0\n85.5 91.3 88.8 86.3\n85.9 91.9 88.8 86.4\n86.0 91.5 89.1 86.6\n\nPoseWarper (no dilated convs)\nPoseWarper (1 dilated conv)\nPoseWarper (2 dilated convs)\nPoseWarper (3 dilated convs)\nPoseWarper (4 dilated convs)\nPoseWarper (5 dilated convs)\n\n86.5\n82.3\n91.0\n91.7\n91.6\n92.4\n92.6\n92.6\n92.7\n\n81.4\n74.3\n83.8\n88.0\n88.0\n88.8\n89.2\n89.5\n89.5\n\n4.1 Video Pose Propagation\n\nQuantitative Results. To verify that our model learns to capture pose correspondences, we apply it\nto the task of video pose propagation, i.e., propagating poses across time from a few labeled frames.\nInitially, we train our PoseWarper in a sparsely labeled video setting according to the procedure\ndescribed above. In this setting, every 7th frame of a training video is labeled, i.e.\nthere are 6\nunlabeled frames between each pair of manually labeled frames. Since each video contains on\naverage 30 frames, we have approximately 5 annotated frames uniformly spaced out in each video.\nOur goal then, is to use our learned PoseWarper to propagate pose annotations from manually-labeled\nframes to all unlabeled frames in the same video. Speci\ufb01cally, for each labeled frame in a video,\nwe propagate its pose information to the three preceding and three subsequent frames. We train our\nPoseWarper on sparsely labeled videos from the training set of PoseTrack2017 [22] and then perform\nour evaluations on the validation set.\nTo evaluate the effectiveness of our approach, we compare our model to several relevant baselines.\nAs our weakest baseline, we use our trained HRNet [27] model that simply predicts pose for every\nsingle frame in a video. Furthermore, we also include a few propagation baselines based on warping\nannotations using optical \ufb02ow. The \ufb01rst of these uses a standard Farneback optical \ufb02ow [55] to\nwarp the manually-labeled pose in each labeled frame to its three preceding and three subsequent\nframes. We also include a more advanced optical \ufb02ow propagation baseline that uses FlowNet2\noptical \ufb02ow [29]. Finally, we evaluate our PoseWarper model.\nIn Table 1, we present our quantitative results for video pose propagation. The evaluation is done\nusing an mAP metric as in [42]. Our best model achieves a 88.7% mAP, while the optical \ufb02ow\npropagation baseline using FlowNet2 [29] yields an accuracy of 83.8% mAP. We also note that\ncompared to the FlowNet2 [29] propagation baseline, our PoseWarper warping mechanism is not\nonly more accurate, but also signi\ufb01cantly more compact (6M vs 39M parameters).\nAblation Studies on Dilated Convolution. In Table 1, we also present the results investigating the\neffect of different levels of dilated convolutions in our PoseWarper architecture. We evaluate all these\nvariants on the task of video pose propagation. First, we report that removing dilated convolution\nblocks from the original architecture reduces the accuracy from 88.7 mAP to 87.2 mAP. We also\nnote that a network with a single dilated convolution (using a dilation rate of 3) yields 87.0 mAP.\nAdding a second dilated convolution level (using dilation rates of 3, 6) improves the accuracy to 88.0.\nThree dilation levels (with dilation rates of 3, 6, 12) yield a mAP of 88.4 and four levels (dilation\nrates of 3, 6, 12, 18) give a mAP of 88.6. A network with 5 dilated convolution levels yields 88.7\nmAP. Adding more dilated convolutions does not improve the performance further. Additionally, we\nalso experimented with two networks that use dilation rates of 1, 2, 3, 4, 5, and 4, 8, 16, 24, 32, and\nreport that such models yield mAPs of 88.6 and 88.5, respectively, which are slightly lower.\nQualitative Comparison to FlowNet2. In Figure 3, we include an illustration of the motion encoded\nby PoseWarper, and compare it to the optical \ufb02ow computed by FlowNet2 for the video pose\npropagation task. The \ufb01rst frame in each 3-frame sequence illustrates a labeled reference frame at\ntime t. For a cleaner visualization, we show only the \u201cright ankle\u201d body joint for one person, which is\nmarked with a pink circle in each of the frames. The second frame depicts our propagated \u201cright\n\n6\n\n\fFigure 4: A \ufb01gure illustrating the value of a) training a standard HRNet [27] using our propagated\npose pseudo labels (left), and b) our temporal pose aggregation scheme during inference. In both\nsettings, we study pose detection performance as a function of 1) number of sparsely-labeled training\nvideos (with 1 manually-labeled frame per video), and 2) number of labeled frames per video (with\n50 sparsely-labeled videos in total). All baselines are based on retraining the standard HRNet [27]\nmodel on the different training sets. The \"GT (1x)\" baseline is trained in a standard way on sparsely\nlabeled video data. The \"GT (7x)\" baseline uses 7x more manually annotated data relative to the\n\"GT (1x)\" baseline. Our approach on the left sub\ufb01gure (\"GT (1x) + pGT (6x)\"), augments the\noriginal sparsely labeled video data with our propagated pose pseudo labels (6 nearby frames for\nevery manually-labeled frame). Lastly, in b) \"GT (1x) + T-Agg\" denotes the use of PoseWarper to\nfuse pose information from multiple neighboring frames during inference (training is done as in \"GT\n(1x)\" baseline). From the results, we observe that both application modalities of PoseWarper provide\nan effective way to achieve strong pose accuracy while reducing the number of manual annotations.\n\nankle\u201d detection from the labeled frame in time t to the unlabeled frame in time t+1. The third frame\nshows the propagated detection in frame t+1 produced by the FlowNet2 baseline. These results\nsuggest that FlowNet2 struggles to accurately warp poses if 1) there is large motion, 2) occlusions, or\n3) blurriness. In contrast, our PoseWarper handles these cases robustly, which is also indicated by our\nresults in Table 1 (i.e., 88.7 vs 83.8 mAP w.r.t. FlowNet2).\n\n4.2 Data Augmentation with PoseWarper\n\nHere we consider the task of propagating poses on sparsely labeled training videos using PoseWarper,\nand then using them as pseudo-ground truth labels (in addition to the original manual labels) to\ntrain a standard HRNet-W48 [27]. For this experiment, we study the pose detection accuracy as\na function of two variables: 1) the total number of sparsely-labeled videos, and 2) the number of\nmanually-annotated frames per video. We aim to study how much we can reduce manual labeling\nthrough our mechanism of pose propagation, while maintaining strong pose accuracy. Note, that we\n\ufb01rst train our PoseWarper on sparsely labeled videos from the training set of PoseTrack2017 [22].\nThen, we propagate pose annotations on the same set of training videos. Afterwards, we retrain the\nmodel on the joint training set comprised of sparse manual pose annotations and our propagated\nposes. Lastly, we evaluate this trained model on the validation set.\nAll results are based on a standard HRNet [27] model trained on different forms of training data. \"GT\n(1x)\" refers to a model trained on sparsely labeled videos using ground-truth annotations only. \"GT\n(7x)\" baseline employs 7x more manually-annotated poses relative to \"GT (1x)\" (the annotations\nare part of the PoseTrack2017 training set). In comparison, our approach (\"GT (1x) + pGT (6x)\"),\nis trained on a joint training set consisting of sparse manual pose annotations (same as \"GT (1x)\"\nbaseline) and our propagated poses (on the training set of PoseTrack2017), which we use as pseudo\nground truth data (pGT). As before, for every labeled frame we propagate the ground truth pose to\nthe 3 previous and the 3 subsequent frames, which allows us to expand the training set by 7 times.\nBased on the results in the left sub\ufb01gure of Figure 4, we can draw several conclusions. First, we\nnote that when there are very few labeled videos (i.e., 5), all three baselines perform poorly (leftmost\n\ufb01gure). This indicates that in this setting there is not enough data to learn an effective pose detection\nmodel. Second, we observe that when the number of labeled videos is somewhat reasonable (e.g.,\n50 \u2212 100), our approach signi\ufb01cantly outperforms the \"GT (1x)\" baseline, and is only slightly worse\nrelative to the \"GT (7x)\" baseline. As we increase the number of labeled videos, the gaps among the\nthree methods shrink, suggesting that the model becomes saturated.\n\n7\n\na) Training HRNet with Propagated Pose Pseudo Labelsb) Temporal Pose Aggregation during Inference0100200300# Labeled Training Videos707580Accuracy (mAP)GT (7x)GT (1x) + pGT (6x)GT (1x)02468# Labeled Frames Per Video707580Accuracy (mAP)GT (7x)GT (1x) + pGT (6x)GT (1x)02468# Labeled Frames Per Video707580Accuracy (mAP)GT (7x)GT (1x) + T-Agg.GT (1x)0100200300# Labeled Training Videos707580Accuracy (mAP)GT (7x)GT (1x) + T-Agg.GT (1x)\fTable 2: Multi-person pose estimation results on the validation and test sets of PoseTrack2017\nand PoseTrack2018 datasets. Even though our model is designed to improve pose detection in\nscenarios involving sparsely-labeled videos, here we show that our temporal pose aggregation scheme\nduring inference is also useful for models trained on densely labeled videos. We improve upon the\nstate-of-the-art single-frame baselines [23, 27, 56].\n\nDataset\n\nPoseTrack17 Val Set\n\nPoseTrack17 Test Set\n\nPoseTrack18 Val Set\n\nPoseTrack18 Test Set\n\nMethod\n\nGirdhar et al. [48]\n\nXiu et al. [57]\nBin et al [23]\nHRNet [27]\nMDPN [56]\nPoseWarper\n\nGirdhar et al. [48]\n\nXiu et al. [57]\nBin et al [23]\nHRNet [27]\nPoseWarper\nAlphaPose [58]\n\nMDPN [56]\nPoseWarper\n\nAlphaPose++ [56, 58]\n\nMDPN [56]\nPoseWarper\n\n65.3\n68.3\n80.0\n80.4\n83.9\n83.9\n\n-\n\n75.6\n73.3\n83.4\n83.6\n88.5\n88.3\n\n-\n\n67.5\n80.2\n80.2\n84.3\n78.7\n81.2\n86.3\n\n54.3 63.5 60.9 51.8\n61.1 67.5 67.0 61.3\n72.4 75.3 74.8 67.1\n73.3 75.5 75.3 68.5\n77.5 79.0 77.0 71.4\n78.0 82.4 80.5 73.6\n\nHead Shoulder Elbow Wrist Hip Knee Ankle Mean\n64.1\n72.8\n66.5\n66.7\n81.7\n76.7\n82.1\n77.3\n80.7\n85.2\n81.2\n81.4\n59.6\n63.0\n74.6\n74.9\n77.9\n71.9\n75.0\n79.7\n67.6\n76.4\n78.0\n\n59.0 62.5 62.8 57.9\n71.5 72.5 72.4 65.7\n72.0 73.4 72.5 67.0\n75.8 77.6 76.8 70.8\n71.0 73.7 73.0 69.7\n74.1 72.4 73.0 69.9\n77.5 79.8 78.8 73.2\n65.0\n66.2\n74.5\n69.0\n76.8 75.6 77.5 71.8\n\n-\n\n64.9\n80.1\n80.1\n79.5\n63.9\n75.4\n79.9\n\n65.0\n76.9\n76.9\n80.1\n77.4\n79.0\n82.4\n\n80.9\n\n78.9\n\n84.4\n\n-\n-\n\n-\n-\n\n-\n\n-\n\n-\n\n-\n\n-\n-\n\n-\n-\n\n-\n-\n\nAs we vary the number of labeled frames per video (second leftmost \ufb01gure), we notice several\ninteresting patterns. First, we note that for a small number of labeled frames per video (i.e., 1\u2212 2) our\napproach outperforms the \"GT (1x)\" baseline by a large margin. Second, we note that the performance\nof our approach and the \"GT (7x)\" becomes very similar as we add 2 or more labeled frames per\nvideo. These \ufb01ndings further strengthen our previous observation that PoseWarper allows us to reduce\nthe annotation cost without a signi\ufb01cant loss in performance.\n\n4.3\n\nImproved Pose Estimation via Temporal Pose Aggregation\n\nIn this subsection we assess the ability of PoseWarper to improve the accuracy of pose estimation at\ntest time by using our deformable warping mechanism to aggregate pose information from nearby\nframes. We visualize our results in Figure 4 b), where we evaluate the effectiveness of our temporal\npose aggregation during inference for models trained a) with a different number of labeled videos\n(second rightmost \ufb01gure), and b) with a different number of manually-labeled frames per video\n(rightmost \ufb01gure). We compare our approach (\"GT (1x) + T-Agg.\") to the same \"GT (7x)\" and \"GT\n(1x)\" baselines de\ufb01ned in the previous subsection. Note that our method in this case is trained exactly\nas \"GT (1x)\" baseline, the only difference comes from the inference procedure.\nWhen the number of training videos and/or manually labeled frames is small, our approach provides\na signi\ufb01cant accuracy boost with respect to the \"GT (1x)\" baseline. However, once, we increase the\nnumber of labeled videos/frames, the gap between all three baselines shrinks, and the model becomes\nmore saturated. Thus, our temporal pose aggregation scheme during inference is another effective\nway to maintain strong performance in a sparsely-labeled video setting.\n\n4.4 Comparison to State-of-the-Art\n\nWe also test the effectiveness of our temporal pose aggregation scheme, when the model is trained on\nthe full PoseTrack [22] dataset. Table 2 compares our method to the most recent approaches in this\narea [48, 57, 23, 27]. These results suggest that although we designed our method to improve pose\nestimation when training videos are sparsely-labeled, our temporal pose aggregation scheme applied\nat inference is also useful for models trained on densely-labeled videos. Our PoseWarper obtains\n81.2 mAP and 77.9 mAP on PoseTrack2017 validation and test sets respectively, and 79.7 mAP and\n78.0 mAP on PoseTrack2018 validation and test sets respectively, thus outperforming prior single\nframe baselines [48, 57, 23, 27].\n\n8\n\n\fFigure 5: In the \ufb01rst two columns, we show a pair of video frames used as input for our model.\nThe 3rd and 4th columns depict 2 randomly selected offset channels visualized as a motion \ufb01eld.\nDifferent channels appear to capture the motion of different body parts. In the 5th column, we display\nthe offset magnitudes, which highlight salient human motion. Finally, the last two columns illustrate\nthe standard Farneback \ufb02ow, and the human motion predicted from our learned offsets. To predict\nhuman motion we train a linear classi\ufb01er to regress the ground-truth (x, y) displacement of each\njoint from the offset maps. The color wheel, at the bottom right corner encodes motion direction.\n\n4.5\n\nInterpreting Learned Offsets\n\nUnderstanding what information is encoded in our learned offsets is nearly as dif\ufb01cult as analyzing\nany other CNN features [59, 60]. The main challenge comes from the high dimensionality of offsets:\nwe are predicting c \u00d7 kh \u00d7 kw (x, y) displacements for every pixel for each of the \ufb01ve dilation\nrates d, where c is the number of channels, and kh, kw are the convolutional kernel height and width\nrespectively.\nIn columns 3, 4 of Figure 5, we visualize two randomly-selected offset channels as a motion \ufb01eld.\nBased on this \ufb01gure, it appears that different offset maps encode different motions rather than all\npredicting the same solution (say, the optical \ufb02ow between the two frames). This makes sense, as the\nnetwork may decide to ignore motions of uninformative regions, and instead capture the motion of\ndifferent human body parts in different offset maps (say, a hand as opposed to the head). We also note\nthat the magnitudes of our learned offsets encode salient human motion (see Column 5 of Figure 5).\nLastly, to verify that our learned offsets encode human motion, for each point pn denoting a body joint,\nwe extract our predicted offsets and train a linear classi\ufb01er to regress the ground truth (x, y) motion\ndisplacement of this body joint. In Column 7 of Figure 5, we visualize our predicted motion outputs\nfor every pixel. We show Farneback\u2019s optical \ufb02ow in Column 6. Note that in regions containing\npeople, our predicted human motion matches Farneback optical \ufb02ow. Furthermore, we point out that\ncompared to the standard Farneback optical \ufb02ow, our motion \ufb01elds look less noisy.\n\n5 Conclusions\n\nIn this work, we introduced PoseWarper, a novel architecture for pose detection in sparsely labeled\nvideos. Our PoseWarper can be effectively used for multiple applications, including video pose\npropagation, and temporal pose aggregation. In these settings, we demonstrated that our approach\nreduces the need for densely labeled video data, while producing strong pose detection performance.\nFurthermore, our state-of-the-art results on PoseTrack2017 and PoseTrack2018 datasets demonstrate\nthat our PoseWarper is useful even when the training videos are densely-labeled. Our future work\ninvolves improving our model ability to propagate labels and aggregate temporal information when\nthe input frames are far away from each other. We are also interested in exploring self-supervised\nlearning objectives for our task, which may further reduce the need of pose annotations in video. We\nwill release our source code and our trained models upon publication of the paper.\n\nReferences\n[1] Gedas Bertasius, Jianbo Shi, and Lorenzo Torresani. Deepedge: A multi-scale bifurcated deep network\nfor top-down contour detection. In The IEEE Conference on Computer Vision and Pattern Recognition\n(CVPR), June 2015.\n\n9\n\nFrame tFrame t+5Farneback FlowPredicted Human MotionOffset MagnitudesChannel 99 (x,y)Channel 123 (x,y)\f[2] Saining Xie, Ross Girshick, Piotr Doll\u00e1r, Zhuowen Tu, and Kaiming He. Aggregated residual transforma-\n\ntions for deep neural networks. In CVPR, 2017.\n\n[3] Alexander Toshev and Christian Szegedy. Deeppose: Human pose estimation via deep neural networks. In\n\nCVPR, 2014.\n\n[4] Gedas Bertasius, Lorenzo Torresani, Stella X. Yu, and Jianbo Shi. Convolutional random walk networks\nfor semantic image segmentation. In The IEEE Conference on Computer Vision and Pattern Recognition\n(CVPR), July 2017.\n\n[5] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Spatial pyramid pooling in deep convolutional\n\nnetworks for visual recognition. In Computer Vision \u2013 ECCV 2014, 2014.\n\n[6] Kaiming He, Georgia Gkioxari, Piotr Doll\u00e1r, and Ross Girshick. Mask R-CNN. In Proceedings of the\n\nInternational Conference on Computer Vision (ICCV), 2017.\n\n[7] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Doll\u00e1r. Focal Loss for Dense Object\n\nDetection. In Proceedings of the International Conference on Computer Vision (ICCV), 2017.\n\n[8] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards real-time object\n\ndetection with region proposal networks. In Neural Information Processing Systems (NIPS), 2015.\n\n[9] Ross Girshick. Fast R-CNN. In Proceedings of the International Conference on Computer Vision (ICCV),\n\n2015.\n\n[10] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate\nobject detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision\nand Pattern Recognition (CVPR), 2014.\n\n[11] Saurabh Gupta, Ross Girshick, Pablo Arbelaez, and Jitendra Malik. Learning rich features from RGB-D\n\nimages for object detection and segmentation. In ECCV, 2014.\n\n[12] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and\n\nAlexander C. Berg. Ssd: Single shot multibox detector. In ECCV, 2016.\n\n[13] Jifeng Dai, Yi Li, Kaiming He, and Jian Sun. R-fcn: Object detection via region-based fully convolutional\nnetworks. In Advances in Neural Information Processing Systems 29, pages 379\u2013387. Curran Associates,\nInc., 2016.\n\n[14] Joseph Redmon, Santosh Kumar Divvala, Ross B. Girshick, and Ali Farhadi. You only look once: Uni\ufb01ed,\nreal-time object detection. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR\n2016, Las Vegas, NV, USA, June 27-30, 2016, pages 779\u2013788, 2016.\n\n[15] Joseph Redmon and Ali Farhadi. YOLO9000: better, faster, stronger. In 2017 IEEE Conference on\nComputer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages\n6517\u20136525, 2017.\n\n[16] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classi\ufb01cation with deep convolutional\n\nneural networks. In NIPS, 2012.\n\n[17] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru\nErhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Computer Vision\nand Pattern Recognition (CVPR), 2015.\n\n[18] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In\n\nICLR, 2015.\n\n[19] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.\n\n2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770\u2013778, 2016.\n\n[20] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image\n\nDatabase. In CVPR09, 2009.\n\n[21] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll\u00e1r,\nand C. Lawrence Zitnick. Microsoft coco: Common objects in context. In European Conference on\nComputer Vision (ECCV), Z\u00fcrich, September 2014.\n\n[22] Umar Iqbal, Anton Milan, and Juergen Gall. Posetrack: Joint multi-person pose estimation and tracking.\n\nIn IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.\n\n10\n\n\f[23] Bin Xiao, Haiping Wu, and Yichen Wei. Simple baselines for human pose estimation and tracking. In\n\nEuropean Conference on Computer Vision (ECCV), 2018.\n\n[24] Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. Realtime multi-person 2d pose estimation using\n\npart af\ufb01nity \ufb01elds. In CVPR, 2017.\n\n[25] Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked hourglass networks for human pose estimation. In\nComputer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14,\n2016, Proceedings, Part VIII, pages 483\u2013499, 2016.\n\n[26] Shih-En Wei, Varun Ramakrishna, Takeo Kanade, and Yaser Sheikh. Convolutional pose machines. In\n\nCVPR, pages 4724\u20134732. IEEE Computer Society, 2016.\n\n[27] Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang. Deep high-resolution representation learning for human\n\npose estimation. In CVPR, 2019.\n\n[28] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei. Deformable convolutional networks. In 2017\n\nIEEE International Conference on Computer Vision (ICCV), volume 00, pages 764\u2013773, Oct. 2017.\n\n[29] Eddy Ilg, Nikolaus Mayer, Tonmoy Saikia, Margret Keuper, Alexey Dosovitskiy, and Thomas Brox.\n\nFlownet 2.0: Evolution of optical \ufb02ow estimation with deep networks. CoRR, abs/1612.01925, 2016.\n\n[30] Mykhaylo Andriluka, Stefan Roth, and Bernt Schiele. Pictorial structures revisited: People detection and\n\narticulated pose estimation. In CVPR, pages 1014\u20131021. IEEE Computer Society, 2009.\n\n[31] M. Andriluka, S. Roth, and B. Schiele. Monocular 3d pose estimation and tracking by detection. In 2010\nIEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 623\u2013630, June\n2010.\n\n[32] Sam Johnson and Mark Everingham. Clustered pose and nonlinear appearance models for human pose\n\nestimation. In Proc. BMVC, pages 12.1\u201311, 2010. doi:10.5244/C.24.12.\n\n[33] Leonid Pishchulin, Mykhaylo Andriluka, Peter V. Gehler, and Bernt Schiele. Strong appearance and\nexpressive spatial models for human pose estimation. In ICCV, pages 3487\u20133494. IEEE Computer Society,\n2013.\n\n[34] Yi Yang and D. Ramanan. Articulated pose estimation with \ufb02exible mixtures-of-parts. In Proceedings of\nthe 2011 IEEE Conference on Computer Vision and Pattern Recognition, CVPR \u201911, pages 1385\u20131392,\nWashington, DC, USA, 2011. IEEE Computer Society.\n\n[35] M. Dantone, J. Gall, C. Leistner, and L. van Gool. Human pose estimation using body parts dependent\njoint regressors. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 3041\u20133048,\nPortland, OR, USA, June 2013. IEEE.\n\n[36] Xiangyang Lan and D. P. Huttenlocher. Beyond trees: common-factor models for 2d human pose recovery.\nIn Tenth IEEE International Conference on Computer Vision (ICCV\u201905) Volume 1, volume 1, pages\n470\u2013477 Vol. 1, Oct 2005.\n\n[37] L. Sigal and M. J. Black. Measure locally, reason globally: Occlusion-sensitive articulated pose estimation.\nIn Proc. IEEE Conf. on Computer Vision and Pattern Recognition, CVPR, volume 2, pages 2041\u20132048,\nNew York, NY, June 2006.\n\n[38] Yang Wang and Greg Mori. Multiple tree models for occlusion and spatial constraints in human pose\nestimation. In David Forsyth, Philip Torr, and Andrew Zisserman, editors, Computer Vision \u2013 ECCV 2008,\npages 710\u2013724, Berlin, Heidelberg, 2008. Springer Berlin Heidelberg.\n\n[39] J. Carreira, P. Agrawal, K. Fragkiadaki, and J. Malik. Human pose estimation with iterative error feedback.\nIn 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4733\u20134742, June\n2016.\n\n[40] Xianjie Chen and Alan L Yuille. Articulated pose estimation by a graphical model with image dependent\npairwise relations. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger,\neditors, Advances in Neural Information Processing Systems 27, pages 1736\u20131744. Curran Associates,\nInc., 2014.\n\n[41] W. Ouyang, X. Chu, and X. Wang. Multi-source deep learning for human pose estimation. In 2014 IEEE\n\nConference on Computer Vision and Pattern Recognition, pages 2337\u20132344, June 2014.\n\n11\n\n\f[42] Eldar Insafutdinov, Leonid Pishchulin, Bjoern Andres, Mykhaylo Andriluka, and Bernt Schieke. Deepercut:\nA deeper, stronger, and faster multi-person pose estimation model. In European Conference on Computer\nVision (ECCV), 2016.\n\n[43] Leonid Pishchulin, Eldar Insafutdinov, Siyu Tang, Bjoern Andres, Mykhaylo Andriluka, Peter Gehler, and\nBernt Schiele. Deepcut: Joint subset partition and labeling for multi person pose estimation. In IEEE\nConference on Computer Vision and Pattern Recognition (CVPR), 2016.\n\n[44] Jonathan J Tompson, Arjun Jain, Yann LeCun, and Christoph Bregler. Joint training of a convolutional\nnetwork and a graphical model for human pose estimation. In Z. Ghahramani, M. Welling, C. Cortes, N. D.\nLawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages\n1799\u20131807. Curran Associates, Inc., 2014.\n\n[45] George Papandreou, Tyler Zhu, Nori Kanazawa, Alexander Toshev, Jonathan Tompson, Chris Bregler, and\nKevin Murphy. Towards accurate multi-person pose estimation in the wild. In CVPR, pages 3711\u20133719.\nIEEE Computer Society, 2017.\n\n[46] Vasileios Belagiannis and Andrew Zisserman. Recurrent human pose estimation.\n\nConference on Automatic Face and Gesture Recognition. IEEE, 2017.\n\nIn International\n\n[47] Eldar Insafutdinov, Mykhaylo Andriluka, Leonid Pishchulin, Siyu Tang, Evgeny Levinkov, Bjoern Andres,\nand Bernt Schiele. Articulated multi-person tracking in the wild. In 2017 IEEE Conference on Computer\nVision and Pattern Recognition (CVPR), pages 1293\u20131301. IEEE, July 2017. Oral.\n\n[48] Rohit Girdhar, Georgia Gkioxari, Lorenzo Torresani, Manohar Paluri, and Du Tran. Detect-and-Track:\n\nEf\ufb01cient Pose Estimation in Videos. In CVPR, 2018.\n\n[49] Jie Song, Limin Wang, Luc Van Gool, and Otmar Hilliges. Thin-slicing network: A deep structured model\n\nfor pose estimation in videos. In CVPR, pages 4420\u20134229, 2017.\n\n[50] J. Charles, T. P\ufb01ster, D. Magee, D. Hogg, and A. Zisserman. Personalizing human video pose estimation.\n\nIn IEEE Conference on Computer Vision and Pattern Recognition, 2016.\n\n[51] T. P\ufb01ster, J. Charles, and A. Zisserman. Flowing convnets for human pose estimation in videos. In IEEE\n\nInternational Conference on Computer Vision, 2015.\n\n[52] Dingwen Zhang, Guangyu Guo, Dong Huang, and Junwei Han. Pose\ufb02ow: A deep motion representation\nfor understanding human behaviors in videos. In The IEEE Conference on Computer Vision and Pattern\nRecognition (CVPR), June 2018.\n\n[53] O. Wiles, A.S. Koepke, and A. Zisserman. Self-supervised learning of a facial attribute embedding from\n\nvideo. In British Machine Vision Conference, 2018.\n\n[54] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.\n\n[55] Gunnar Farneb\u00e4ck. Two-frame motion estimation based on polynomial expansion. In Proceedings of the\n13th Scandinavian Conference on Image Analysis, SCIA\u201903, pages 363\u2013370, Berlin, Heidelberg, 2003.\nSpringer-Verlag.\n\n[56] Hengkai Guo, Tang Tang, Guozhong Luo, Riwei Chen, Yongchen Lu, and Linfu Wen. Multi-domain pose\nnetwork for multi-person pose estimation and tracking. In Computer Vision - ECCV 2018 Workshops -\nMunich, Germany, September 8-14, 2018, Proceedings, Part II, pages 209\u2013216, 2018.\n\n[57] Yuliang Xiu, Jiefeng Li, Haoyu Wang, Yinghong Fang, and Cewu Lu. Pose Flow: Ef\ufb01cient online pose\n\ntracking. In BMVC, 2018.\n\n[58] Hao-Shu Fang, Shuqin Xie, Yu-Wing Tai, and Cewu Lu. Rmpe: Regional multi-person pose estimation. In\n\nThe IEEE International Conference on Computer Vision (ICCV), Oct 2017.\n\n[59] Matthew D. Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. CoRR,\n\nabs/1311.2901, 2013.\n\n[60] Jason Yosinski, Jeff Clune, Anh Mai Nguyen, Thomas J. Fuchs, and Hod Lipson. Understanding neural\n\nnetworks through deep visualization. CoRR, abs/1506.06579, 2015.\n\n12\n\n\f", "award": [], "sourceid": 1728, "authors": [{"given_name": "Gedas", "family_name": "Bertasius", "institution": "Facebook Research"}, {"given_name": "Christoph", "family_name": "Feichtenhofer", "institution": "Facebook AI Research"}, {"given_name": "Du", "family_name": "Tran", "institution": "Facebook AI"}, {"given_name": "Jianbo", "family_name": "Shi", "institution": "University of Pennsylvania"}, {"given_name": "Lorenzo", "family_name": "Torresani", "institution": "Facebook AI Research"}]}