NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:2574
Title:Exploiting Local and Global Structure for Point Cloud Semantic Segmentation with Contextual Point Representations

Reviewer 1

1. Originality: The method is a combination of existing techniques, the attention has been well explored in GNN, which solves a simiilar to the point cloud analysis. Actually, point could is a kind of graph data. The contextual representation is just a fusion of neighboring features with the central one, which is quite straightforward. Meanwhile, the choice of fusion operations (equ 2,5,7) is not well explained and motivated. Local information has been explored in the point cloud community. Some related works are not cited and discussed. For example, Dynamic Graph CNN for Learning on Point Clouds, Relation-Shape Convolutional Neural Network for Point Cloud Analysis, they all offically accepted by peer-reviewed journals/conferences, and have papers on arxiv before NeuIPS submission deadline. Mining Point Cloud Local Structures by Kernel Correlation and Graph Pooling Pointwise Convolutional Neural Networks 2. Quality: The paper is technically sound. Some designs are not well supported by experiments or not well motivated, as I mentioned above. No results on running time complexity. 3. Clarity: The paper is well written and easy to follow. 4. Significance: The paper is incremental to previous work. While considering both global and local information is benefical for point cloud segmentation, to my opinion, the method does not show its advantages over previous work. For example, both the DGCNN and PointNet++ incorporates contextual information to enrich the point feature either by feature space(DGCNN) or geometric space(PointNet++, if we do not use the sampling to reduce the point number in the original PointNet++, it is a kind of feature extractor by considering neighboring information), the method does not show its advantage over these method. Actually, the method only shows that by combining many factors it can achieve better performance, however, it is unclear what will happen if we replace the proposed module with the existing technique. This could be more interest to readers to know the advance of the proposed method.

Reviewer 2

The authors present a novel network architecture and encoding for point clouds. Specifically, they propose to use additionally to the plain point position an enriched representation that includes the positions of the nearest neighbors of the point. The paper is technically sound, though I have some additional questions which I state below (see "Improvements"). The paper does not include a discussion of the failure modes of the proposed algorithm (especially the invariance properties w.r.t. rotation and scale could be interesting), but is reasonably well evaluated. The paper is clearly written and well-structured. I do find the described idea moderately interesting, where my main concerns are 1) general applicability of the method and 2) the failure modes of the proposed approach. I will go more in depth on these issues in the "Improvements" section. Additional comments after the rebuttal: I thank the authors for the additional insightful experiments and their detailed response. On ground of the classification scores, I tend to accept the paper. However, I still feel that there is not a good explanation for the globally optimized feature combination function.

Reviewer 3

The authors proposed to exploit the structure relationships between point clouds from both global and local perspectives are very enlightening. As such, the complicated relationships between point clouds can be more comprehensively exploited. Moreover, one novel contextual representation of each point is proposed, which considers its neighboring points to enrich the semantic meaning of each point. Such contextual representations are clearly motivated, with ablation studies demonstrating the corresponding contributions. The corresponding novelties and contributions have been summarized in the “Contributions” part. And the questions and some detailed comments are listed in the following. 1. I am wondering the results if considering the spatial-wise and channel-wise attention with each GPM. How does it perform, comparing with the proposed GPM. 2. For the ablation studies in Table 5, it seems that the performances of different components, namely CR, AM, and GPM, perform differently over different categories. Please provide more explanations. 3. What about the performances by stacking different numbers of GPMs? 4. Some more qualitative results should be provided.