NeurIPS 2020

Inference Stage Optimization for Cross-scenario 3D Human Pose Estimation

Review 1

Summary and Contributions: This paper presents an interface stage optimization framework to adapt 3D human pose estimation to novel scenarios with enhanced generalization. With both fully supervised learning and self-supervised learning co-trained on source data, the proposed ISO performs self-supervised adaption to each inference instance to accommodate distribution shifts between cross scenarios. It has achieved improved accuracy on cross-scenario setup.

Strengths: The paper has a novel framework for practical cross-scenario 3D human pose estimation and extensive experiments have been done demonstrating the effectiveness of the method. The framework is orthogonal to different self-supervised learning techniques and could inspire more future works on domain adaption for human pose estimation.

Weaknesses: My major concern towards this method is the inference time performance. Even though the paper has proposed a few speedup solutions, the vanilla-lr is still 9x slower than the baseline (and the numbers in other tables should be based on the faster version for a fair comparison). I am wondering how the accuracy would be affected if the parameters adaption could be performed on a batch of interference images (several images altogether or even a dataset) instead of on each image instance. How necessary is this *individual* adaption?

Correctness: Seems good.

Clarity: yes, I find it easy to follow.

Relation to Prior Work: Good. I find this recent paper also related: "Weakly Supervised 3D Human Pose and Shape Reconstruction with Normalizing Flows"

Reproducibility: Yes

Additional Feedback: - It is not very clear to me how the pose discriminator are trained and updated during inference? What's the training data at both stages? Could it be unbiasedly "adapted" based on a single image instance during inference? - What's the inference accuracy with and without ISO over Human3.6M data? It would be good to have such an experiment to understand if ISO would help even when there is no pose distribution shift.

Review 2

Summary and Contributions: This paper proposed an Inference stage optimization (ISO) method for improving the generalizability of 3D pose estimation model. The proposed method performs geometry-aware self-supervised learning to update the neural network model for each instance. Two SSL techniques, i.e., ISO-Adversary and ISO-Cycle, are adopted to the 3D pose estimation task under the IOS framework. The experiments proved that the proposed method can improve the generalizability of the 3D pose estimation model.

Strengths: This paper focused on an interesting problem of improving 3D pose model generalizability in the inference stage. It adopted two SSL methods, ISO-Adversary and ISO-Cycle ,to update model during inference. The experimental results show the effectiveness of the proposed method.

Weaknesses: The inference stage optimization seems similar to the previous model-fitting methods, e.g., Kolotouros et al. Learning to reconstruct 3D human pose and shape via model-fitting in the loop. CVPR 2019. However, the comparisons are not presented. Ablation studies in the supplementary material show that the main improvements are from the joint training. In MPJPE metric, the joint-adv improves the baseline by -6.8 and the inference stage optimization vanilla-adv improves the model by -1.7 (vs joint-adv). As supervised learning and weakly supervised learning have been studied in the previous papers[7, 9, 35], the comparison should be presented in table 1 and table 2.

Correctness: The proposed method is technically correct.

Clarity: The paper is well written and easy to follow. It is suggested that the implementation details and ablation study parts should be moved from the supplementary material to the main paper.

Relation to Prior Work: The difference between the proposed method and the prior works is not clearly discussed. The relation to the unsupervised or weakly supervised methods, and model-fitting based methods should be highlighted.

Reproducibility: Yes

Additional Feedback: The analysis in section 4.3 seems interesting. It is better to visualize the ground truth in figure 4. In the robustness analysis, some images should be presented to show how the noise influences the input and the output.

Review 3

Summary and Contributions: The authors propose a neural network training scheme that finetunes a pre-trained 3D pose estimation model on unlabelled target images (a form of transductive learning) in an iterative fashion.

Strengths: Focussing on geometry instead of appearance distribution shift between domains makes sense, as good 2D detectors exist. I really liked the detailed analysis of why the proposed methods help and what their effect is, .e.g., on capturing the bone length ratios in the target distribution. The online training (alongside prediction) is a neat idea and effective in speeding up the adaption to the target distribution. The cycle consistency loss and re-projection adversarial losses have been used elsewhere; yet, their application to the transductive case and making this effective and efficient is nontrivial. The gained improvement is impressive across the board.

Weaknesses: It is not explained what 2D pose is used as input. Is it the GT pose or one estimated by OpenPose or similar off-the-shelf method? This is the most important concern I have, please clarify in detail. Other self-supervised methods are not evaluated in the transductive setting (seeing examples of the target dataset), would this be possible with the available codebases? It would be good to mark in the table which ones are transductive (e.g. [A] should be added as it is transductive too).

Correctness: The method is mostly a system of existing components that are tied together in a neat way. The individual components and the overall approach is sound.

Clarity: The paper is clearly written and easy to understand.

Relation to Prior Work: Methods for transductive learning should be discussed. Particularly the dollowing one doing it for human pose estimation. The intro needs to be changed in that this method is not the first to do transductive learning for human pose estimation. [A] Neural Scene Decomposition for Multi-Person Motion Capture Rhodin et al. CVPR 2019

Reproducibility: Yes

Additional Feedback: It is great that additional results and code was given in the supplemental. Potential negative effects when misused should be discussed in the impact section. Update: Thanks for the rebuttal, I stand to my score but it is a bummer that all evaluation is on ground truth. If possible, include an experiment on 2D pose given from OpenPose or similar.

Review 4

Summary and Contributions: The paper proposes Inference Stage Optimization(ISO) to incorporate self-supervised learning(SSL) for lifting 2D body pose to 3D body pose into the lifting network trained with full supervision using 3D pose label to help the generalization of the network to target data having different pose distribution. During the inference stage, SSL is also optimized for the new input before making a prediction for the 3D body pose. The experiments show substantial improvements in the metrics compared with existing work and the baseline.

Strengths: Applying supervised learning based method in the inference stage to transfer the trained model for data from different distributions. Substantial improvements over the baseline and existing methods in making prediction for data from different distributions. Analyses on why ISO works provide some intuitions about the efficacy of the proposed methods.

Weaknesses: Though the authors shadow many insights on why ISO performs well, I still have questions about the Shared Feature Extractor, SSL Head, FSL Head. As the SSL is from existing work and the main contribution is combination of SSL with FSL, answering the questions clearly is important. Which kind of feature, information is shared in the Shared Feature Extractor? How much will it divert when trained on new target data so that is causes the FLS head fail? What information is kept in the FSL head? How much source data specific information is in the FSL head? What information is kept in the SSL head? Do SSL and FSL heads share some common information? Or the features are well disentangled? The other issue is the online prediction for the new coming input. As the data comes in sequentially, how to learn the network for new instances without overfitting to those come first becomes an issue and also an important question to answer. Though the authors have a vanilla ISO version to adapt the Shared Feature Extractor on each coming instance which has already show improvements over the baseline, it's not adequate to answer this question. In the supplementary material, Fig S2 shows the online ISO is very sensitive to the learning rate and training iteration which agrees with my thinking. Line55 to Line 58 attribute this to "the model quickly overfit to the SSL task, thus hamper 3D pose estimation". The explanation is ambiguous and I don't understand the SSL task and 3D pose estimation mean. In essence, SSL also makes 3D pose estimation and SSL and FSL should agree with each other. My answer for this sensitiveness is the network overfit to the sequentially coming data and can coverage to the later data, which is supported by the fact that Vanilla ISO is not sensitive to the training iteration and learning rate.(Fig S2) Minors: A brief introduction or grouping of the methods in Table 1 would give better idea of the improvements by the contributions.

Correctness: Yes

Clarity: Yes

Relation to Prior Work: Yes.

Reproducibility: Yes

Additional Feedback: ===UPDATE AFTER REBUTTAL'=== Thank the authors for the rebuttal. The rebuttal has explained most of the questions. I raise the original rating above the acceptance threshold.