Review for NeurIPS paper: CaSPR: Learning Canonical Spatiotemporal Point Cloud Representations

NeurIPS 2020

CaSPR: Learning Canonical Spatiotemporal Point Cloud Representations

Review 1

Summary and Contributions: This work presents CaSPR a novel system to represent spatial-temporal sequences of 3D pointclouds. The proposed method allows intra-class generalization to unseen geometry and poses. Unlike previous methods, CaSPR treats time explicitly rather than as another spatial dimension. CaSPR works by learning an embedding from a point cloud sequence into a normalized space (NOCS), then learning a Neural ODE over latent vectors, such that the geometry generated by pushing those latent vectors through a generative model agrees with the normalized geometry in NOCS space. The neural ODE ensures continuity in time respecting temporal unconventionality, while the generative model (CNF) ensures continuity in space.

Strengths: The proposed formulation is elegant and well designed for the task of modelling temporal sequences of deforming objects. The paper is extremely clear, well written and easy to understand. The experiments section does a great job at highlighting the effects of individual design choices (e.g. TPointNet++ vs other architectures for canonicalization). Furthermore, the authors demonstrate the flexibility of their model on a number of downstream reconstruction tasks, showing that it is at least competitive with specialized architectures for these tasks. I also appreciate how the authors reported EMD in addition to Chamfer distances, as I find it a more meaningful metric that is left out of many point-cloud reconstruction papers.

Weaknesses: As the authors mentioned, the method does require a fairly onerous data setup to train (supervised NOCS) which can be difficult to acquire. While this is a limitation, solving this problem is beyond the scope of the paper. The authors also mention that the method is object centric, yet I believe that solving the problem of representing continuous dynamics of objects is a hard problem on its own and extending this problem to full scenes is beyond the scope of this work.

Correctness: Yes as far as I can tell.

Clarity: Yes, very.

Relation to Prior Work: This paper did a great job positioning itself with respect to prior work.

Reproducibility: Yes

Additional Feedback: This is an obvious accept from me, great work! ------- After reading the rebuttal and other reviews, I still feel like this paper is a clear accept.

Review 2

Summary and Contributions: This paper aims at learning object-level representations that aggregate and encode spatiotemporal changes in shapes observed from the 3D sensors. It first proposed an encoder network to canonicalize the point cloud sequence, then exploit latent NeuralODE and CNF to generate novel shapes in spacetime. This work has a wide range of potential applications.

Strengths: This work is a solid one. The proposed problem setup is a novel and useful one, which essentially aims at seeking a unified representation to describe a partially observed point cloud sequence. On the other hand, to resolve the proposed problem, the author(s) exploits and improve several existing works, e.g., NOCS, NeuralODE, to address the problem. This work also demonstrates several potential applications of the proposal. This work is interesting and can benefit the community.

Weaknesses: I have quite a few concerns over this work: 1. As mentioned by the authors, this proposal is entirely based on object-level point cloud sequence, it will be very beneficial to generalize this methodology to scene-level point clouds. 2. The canonicalization network training requires the ground-truth for supervision, however, in practice it is difficult to obtain such ground-truth (as also mentioned by the author(s)), is it possible to perform this canonicalization network training with some unsupervised criterion? 3. The network training needs to be performed for the individual category which is not practical in the real world, it will be useful to make the network to be able to handle different categories of objects at the same time.

Correctness: Basically correct.

Clarity: This paper is well written.

Relation to Prior Work: Yes, this work have differentiate its problem settings with existing works.

Reproducibility: Yes

Additional Feedback: 1. In Section 3, the authors split the latent representation into a static descriptor and a dynamic descriptor. However, why this splitting is reasonable? From the perspective of the canonicalization network, there is no mechanism in the architecture that can guarantee such splitting is achievable. 2. How can the proposed CaSPR guarantee the generated flow (e.g., Figure 7) is scene flow (which represents the actual movements of the particles) rather than a deformation flow (which only reshapes the point cloud to make it have a similar shape to the target point cloud). 3. To generate point cloud sequence with temporal correspondence, does the Gaussian noise needs to be the same one throughout time? Having read the authors's responses and the other reviewers' comments, I think the idea of this paper is good but not surprisingly new. The CNF decoder (generator) deployed at the very end of the model can be replaced by other types of decoder (e.g., AtlasNet decoder), and the latent ODE part can also be replaced by another network module taking time stamp t and initial codeword/feature as input and generate a new feature. In a sense, the authors have put several existing things together to achieve their goals. Having said that, I still think this is an interesting combination and can be helpful for the NeuIPS community. Acceptance to this work is recommended.

Review 3

Summary and Contributions: The authors propose CaSPR, which is a method to learn object-centric Canonical Spatio Temporal Point Cloud Representations of dynamically moving or evolving objects. The CaSPR learns representations that support spacetime continuity, are robust to variable and irregularly spacetime-sampled point clouds, and generalize to unseen object instances. Experimental results demonstrate the effectiveness of the proposed CaSPR on several applications including shape reconstruction, camera pose estimation, continuous spatiotemporal sequence reconstruction, and correspondence estimation from irregularly or intermittently sampled observations.

Strengths: 1. The proposed CaSPR is novel in the following two ways: a) canonicalizing an input point cloud sequence (partial or complete) into a shared 4D container space; b) learning a continuous ST latent representation of top of the canonicalized space. 2. The proposed model can be used in many applications, including partial or full shape reconstruction, spatiotemporal sequence recovery, camera pose estimation, and correspondence estimation. Hence, the proposed method could have a strong impact on the research community.

Weaknesses: In the experiments section, the authors introduce many applications using CaSPR. However, It would be better if the authors could give more details on how CaSPR is adopted to different applications.

Correctness: The method is correct to my best knowledge. Open for discussion.

Clarity: This paper is a completely finished work. The writing is clear.

Relation to Prior Work: The related work is reasonably good. However, it would be better if the authors could put more related work into the main paper instead of supplementary material.

Reproducibility: Yes

Additional Feedback:

Review 4

Summary and Contributions: This paper proposes CaSPR, a framework to reason spatiotemporally-canonicalized object space for point clouds via neural ordinary differential equations and continuous normalizing flows. Experiments demonstrate the effectiveness of the proposed method on various tasks, such as shape reconstruction, camera pose estimation and spatiotemporal sequence reconstruction.

Strengths: Although building up correspondences with the NOCS and reasoning timespace with neural ODEs are studied by previous methods, the idea of combining them for the point cloud representation is novel. And a variety of applications indicate the promising potential of the proposed framework.

Weaknesses: The limitation section has already summarized the potential limitations of this work. I am wondering how CaSPR generalizes to objects (sequences) from unseen categories. ======================== After rebuttal: I appreciate the authors addressing my concerns.

Correctness: Yes.

Clarity: Yes. The paper is easy to follow and the illustrations are helpful to understand the main idea.

Relation to Prior Work: Yes.

Reproducibility: Yes

Additional Feedback: The idea is not surprisingly new. But the paper is well written and the extensive experiments are sufficient to demonstrate the potential of the proposed framework.