NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:3645
Title:Learning Object Bounding Boxes for 3D Instance Segmentation on Point Clouds

Reviewer 1

Authors propose a novel framework for instance segmentation on 3D point clouds. The idea is new and I haven’t seen it being used in 3D before. The idea is to predict a set of 3D bounding boxes from some global feature of the 3D scene, pair predicted bounding boxes with ground truth ones using optimal assignment algorithm and then calculate carefully designed loss. Quality: Generally, the quality of the work is good. The Computation Analysis section needs to be extended. Authors claim that their approach is approximately 10× more computationally efficient than (at least some) previous ones. Since the nature of approaches in 3D instance segmentation differs so much and different pre- and post-processing steps are used, it would be great to provide specific performance measurements for each step of the whole pipeline run at least for a couple of the solutions (e.g. with open code). E.g., many of the methods mentioned in this paper use Voxelization pre-processing, BlockMerging, Clustering, NMS, and true some of these can be costly, but on the other hand 3D-BoNet requires an additional run of SCN semantic segmentation - how does the cost of that compare to these other components? It would be much better to see some hard numbers on this comparison and not too hard to do (at least for the ones with code and are already prepped to run on Scannet). Clarity: The paper is well-written and structured clearly. The notation is good and overall paper is easy to read and understand. The code is going to be open-sourced, great! One of the key contributions of the paper is the ability to approximate the gradient for the Hungarian algorithm. This is a key component, and without an explicit statement of the gradient (more detailed than what is shown in the appendix) this method would be difficult to reproduce. Aside from that, it looks like it would be possible to reproduce paper results. Minor comments: Typo - “... Ai,j =1 iff the …” (page 4) SGPN [49] uses 1.5m×1.5m blocks for ScanNet training (see their C.1. Section). Originality: The paper presents an original and new method for 3D instance segmentation. The proposed approach correlates well with the proposal-free, per-point solutions for object detection that are gaining popularity in 2D. Significance: The results of the paper are valuable for people who research 3D instance segmentation problems (possibly 2D too). There aren’t many working approaches in this field and since 3D (2D) instance segmentation task is not as mature as say 2D object detection every new approach (especially an effective one) is certainly welcomed.

Reviewer 2

Overall, this paper proposes a novel and interesting idea, the writing is done well, and the numbers seem to back up most of the contributions. Below are my detailed comments. The proposed framework takes a 3D point cloud as input and outputs a fixed number of object hypotheses. For each hypothesis, the method regresses an axis-aligned bounding box, an objectness score, and a point mask. In order to set regression targets, the authors propose a novel association layer to associate each ground truth object to one object hypothesis. This is done by solving Hungarian matching on a hand-crafted pairwise cost matrix. To backprop through the non-differentiable matching process, the authors propose to estimate the gradient using Policy Gradient. As far as I know, incorporating Hungarian matching to associate hypotheses and ground truths as part of an end-to-end pipeline is novel. However, I feel the difference of this particular design was not emphasized enough neither in the writing and the experiments. For example, in terms of the one-to-one mapping, it makes sense to map every hypothesis to at most one ground truth object, but it is not as clear why we do not allow mapping one ground truth to multiple hypotheses. My concern comes from the fact that many well-known approaches (e.g. MaskRCNN) allow such ground truth “sharing”. I am aware that this might result in duplicated predictions and these methods often run NMS. Nonetheless, an empirical comparison one-to-one and one-to-many mapping would help justify the importance of this particular design choice. On a similar note, it would be interesting to see an empirical comparison between solving one-to-one mapping optimally and solving it greedily. I am also curious about what these MLPs learn. They might get mapped to different ground truth randomly after initialization. But after a while, do they learn to consistently respond to a specific part of the point cloud (or block)? My lack of understanding what these MLPs learn makes me wonder what happens if we apply the same idea presented in this paper to instance segmentation on images. Specifically, we would learn a model that takes an image and spits out scores, bounding boxes, and pixel masks. Though it seems viable algorithmically, I am not sure if it would work better than MaskRCNN. I wonder what the authors think about this idea. Regarding Algorithm 1: First, I was not sure how to interpret the step 1. My best guess is that it computes the squared distance of the n-th point to the min/max vertices of the i-th bounding box. If so, p_{xyz} should be a scalar value but why do we take the minimum over p_{xyz} in step 4? In addition, it looks like the probability is biased towards larger bounding boxes. I thought instead of Euclidean distance, it makes more sense to normalize distance w.r.t. each dimension.

Reviewer 3

3D Instance Segmentation is an interesting problem with a lot of possible applications. This paper proposes a single-stage, anchor-free and end-to-end trainable framework to address this problem. The bounding box association layer and the multi-criteria loss function are original designs, which might have a high impact on the research community. Unlike most of the instance segmentation works, which employ heavy post-processing steps, the proposed framework is able to give accurate predictions without any post-processing. The framework is novel and efficient, which might inspire more interesting works in this area. Besides the originality and significance, the paper is complete work, and the writing and illustration is clear and concise.