Reviews: Learning Temporal Pose Estimation from Sparsely-Labeled Videos

- Here I will describe the proposed PoseWarper algorithm. -- The input consists of a pair of frames containing multiple humans and that are up to 3 frames apart. One frame, A contains 2D annotations of human body joints and frame B is unlabeled. The algorithm aims to predict poses for the frame B using only supervision from the frame A. First, a base network (using HRNet backbone [27]) predicts pose heatmaps for both frames. Then, their difference is computed and fed into a stack of residual layers, that predict per-channel offsets which are then used to warp pose heatmaps of the frame B. Finally, they compute the loss between the warped heatmaps and the ground truth heatmaps of the frame A. The warping mechanism is differentiable and is implemented in a similar fashion to the Spatial Transformer Networks by usin Deformable Convolutions [28]. This way the network learns a motion field that warps human body parts from between neighbor frames. -- The proposed methodarchitecture is validated on the task of propagation of poses by comparing against two baselines on the PoseTrack dataset [22]. The first baseline involves training a standard human pose estimator on the available frames and simply apply the trained detector on the unlabeled frames. It reaches the accuracy 79.3 mAP. The more advanced baseline uses a trained optical flow network (FlowNet2 [29]) to propagate annotations from the lableed to unlabeled frames and attains 83.8 mAP. The proposed PoseWarper reaches 88.7mAP, a substantial improvement over both baselines. -- In another set of experiments the propagated annotations were used to augment manual annotations for training a pose estimation network. The method comes close to training with full supervision on all available frames, and substantially outperforms a baseline when only one frame is available. -- Finally, spatio-temporal pose aggregation at inference time also improves over naive baseline. - I find the paper overall well-written. Introduction is very clear and provides good motivation for the main part. Related work is thorough and comprehensive. Here are some questions and concerns: - Missing ablations for the architecture of the offset predictors. Why 5 levels are chosen? This need to be studied in more detail. - It is not clear how exactly poses from neighbor frames are aggregated in the Spatiotemporal Pose Aggregation. Are the warped heatmaps simply averaged? - What remains unclear whether the offsets are used to warp feature maps or the final keypoint heatmaps? Lines 115-116 say that "Next, the predicted offsets are used to spatially rewarp the pose heatmap" which indicates that offsets are warping body joint heatmaps, however later in the section 4.5 in the Figure 5 offsets are shown for the channel 123, which is clearly outside of number of keypoints in the PoseTrack dataset (14 keypoints). So does this mean that the feature maps are warped instead? - The presense of analysis section (4.5) is in principle good, but I cannot make much of it. In particular Figure 5 is not illustrative at all. I would like to see a more informative comparison between using state-of-the-art optical flow (why FlowNet2 is not used there?) and the proposed method. Also channels 123 and 99 are used, but it's not clear at all what those correspond, as I already mentioned in the previous remark.

Reviewer 2

Pros: - The paper is well written and easy to read. - The proposed PoseWarper network is novel and makes sense. Even though similar ideas have been explored in the literature, e.g, for facial keypoints in [53], I still found the ideas presented in the paper novel for body pose estimation. - The effectiveness of the proposed approach is demonstrated for three different applications on the validation set of PoseTrack-2017 dataset. Cons: - For frames with multiple persons, the proposed approach requires track ids of the persons to know which two bounding boxes should be used to learn the PoseWarper. I understand that track-ids are significantly easier to annotate as compared to poses, but this should be clarified in the paper. Currently, I couldn't find any explanation for this. - It's not clear how person association b/w frames are obtained during testing? Are the poses tracked following the approach in [23] before performing pose aggregation in Table-2? Conclusion: Overall it is a good paper with sufficient technical contributions. Exploiting temporal information for improving pose estimation in multi-person setups has been quite a challenge and this paper proposes a good way of doing this. However, I request the authors to kindly address my concerns below during the rebuttal.

Reviewer 3

Originality The task being addressed is in fact quite novel and also well motivated. Acquiring dense pose annotations in video could be tedious and time-consuming. This provides a strong case for studying pose estimation from sparsely labeled videos. Quality The methodology is technically sound. Evaluation on PoseTrack2017 is compared with proper recent approaches. The results might be more complete if the paper could add some ablation study on the proposed network architecture. Clarity Overall the idea and results are clearly exposed. It could be helpful to add a more rigorous description on deformable convolutions (L119-121). Significance The paper demonstrates three important tasks with the proposed technique (1, 2, 3 in the last block), which shows the wide applicability of the technique. ---Post Rebuttal Feedback--- The rebuttal does not change my view, which has been positive initially.

Paper ID:	1728
Title:	Learning Temporal Pose Estimation from Sparsely-Labeled Videos

Reviewer 1

Reviewer 2

Reviewer 3