NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:2164
Title:Domes to Drones: Self-Supervised Active Triangulation for 3D Human Pose Reconstruction

Reviewer 1

I really like this paper primarily for its novel technology idea to select the best camera views for 3D pose estimation in multi-person scenario with occlusions. I can see similar idea can be used in many other relevant problems when it needs to find the best subset of camera views, like real-time holoportation. Even though the technology itself is not super exciting and completely new, applying RL to tailor this specific problem is novel enough for a paper. I have a few minor questions * what's the trends of performance when the # of cameras > 10? from the plots, it seems that the gain trends to be small when using more cameras compared with other baselines, can you explain why? for example, the performance is almost close to Max-Azim when the # of camera is 10(see figure 2) * I believe there is a strong reason, but I am curious why not using the exhaustive triangulation of 2d pose estimate in practical applications? What's the benefits of using this auto-selection based algorithms. * This paper has not discussed any downside of this approach, to be fair, what are the things need to be careful when applied for other problems like Multiview 3D reconstruction? *other minor issues (1) Figure 5 is missing (2) line 223~224, the sentence " The corresponding results but for 2d reprojection errors onto the OpenPose 2d estimates are given in Fig. 3. Looking also at reprojection errors is relevant" is very hard to understand, please polish it. ***Rebuttal***: I've read the authors rebuttal and other reviewers' feedback, I keep my original rating. I like the idea and problem formulation, however, as pointed out by reviewers as well, the practical impact of this work for real application is not quite clear. I am okay to accept, but will not fight if it is rejected.

Reviewer 2

This paper focuses on the problem of 3D human pose reconstruction from multiple viewpoints in video sequences. The system assumes that a set of 2D human poses are available (e.g. by using OpenPose), coming from different camera viewpoints, and the paper proposes a pipeline for simultaneously selecting the next viewpoint that would be needed to decrease the reconstruction error and generating the 3D reconstruction. An artificial agent, named ACTOR, is defined for the purpose of selecting the camera viewpoints, following a reward-based approach. The proposed system is evaluated on the CMU Panoptic dataset, which contains 480 VGA camera views. The experimental results show that by using the proposed agent the reconstruction error is better than using the baseline methods. Although worse than selecting the cameras in a greedy fashion ("Oracle" case). * Originality: the originality comes from the definition of the agent that selects the most relevant cameras for the triangulation stage. The remaining components of the system do not seem to be novel. * Quality: the technical content of the paper appears to be correct. No flaws have been found in the experimental setup. - From Table 1, it does not seem that the difference between ACTOR cases and Random and Max-Azim is large, and they could be probably faster. Therefore, how much time is needed to select a camera viewpoint at each time stamp compared to the baseline methods? * Clarity: the paper is well-written. Technical and implementation details are available. However, the description of the dataset where the system is evaluated is minimal. In fact, I do not see clearly stated if the almost 500 available cameras of the Panoptic studio are used for the experiments. * Significance: the idea of automatically selecting the most suitable cameras for the triangulation purpose is interesting. However, the practical situation where it is validated is quite rare, i.e., it is very unusual having such a large amount of cameras available. This limits the range of applications where this can be applied. Basically, if only few cameras were available, there would be no need of selection, all of them might be used. *** Post-rebuttal *** I have carefully read the answer of the authors and my overall score goes towards its acceptance.

Reviewer 3

# Originality The task is new, though there are related works on autonomous reconstruction of 3D scenes or next-best-view selection for object pose estimation. There is a lack of thorough discussion on these related tasks and the special challenge in the setting of 3D human pose reconstruction, as well as how the special challenge is solved in this work. # Quality It is a solid in general. The proposed approach is technically sound. The experiments and baselines are well designed and sufficiently demonstrate the effectiveness of the proposed approach. One concern is that the reward is based on visibility of body joints in the selected views, and the authors claim that no annotation is needed for training. The question is how to know the true visibility if there is no annotation? # Clarity The idea and methodology are clearly presented. The paper is well organized and easy to follow. # Significance This is a new task. My main concern is if this task is useful in practice. For the multi-view setting, I was not able to imagine why not using all views. For the active observer setting, the assumption that human is static is impractical. Pose estimation is usually for motion capture where people are moving.