NeurIPS 2020

Every View Counts: Cross-View Consistency in 3D Object Detection with Hybrid-Cylindrical-Spherical Voxelization

Review 1

Summary and Contributions: The main contribution of the paper is a method that leverages cross-view consistency between BEV and RV point cloud representation by using Hybrid-Cylindrical-Spherical voxelization. The performance is good as demonstrated on NuScenes dataset.

Strengths: 1. The idea of combining the advantage of both BEV and RV point cloud representation is great. And the paper proposes an effective way to combining features from both view. 2. The performance on NuScenes is good. I like the ablation study which clearly shows the contributions of each component. 3. The whole paper is well written. The motivation and the design is clearly described.

Weaknesses: 1. Experiments are only conducted on NuScenes. It would be great if the KITTI / Waymo dataset is also evaluated. 2. The method is compared to a few baselines, but some popular methods such as AVOD, F-PointNet, PointRCNN are not listed. 3. The runtime of inference is not reported, which is important for practical application.

Correctness: yes

Clarity: yes

Relation to Prior Work: yes

Reproducibility: Yes

Additional Feedback: I think the overall quality of the paper is good. The proposed method for combining two different view of point cloud representation is elegent and reasonable. And the authors addressed my major concern by adding results on Waymo dataset, which are pretty good compared to existing methods. Therefore I decide to keep my rating as acceptance.

Review 2

Summary and Contributions: This paper proposes a Cross-view Consistent Network (CVCNet) to leverage the benefits from BEV and RV. It proposes Hybrid-Cylindrical-Spherical voxelization that enables learning from both BEV and RV in one network. It proposed a pair of cross-view transformers to transform the feature maps into the other view and introduce a cross-view consistency loss on them as a multi-view learning problem.

Strengths: It proposed a novel Cross-view Consistent Network (CVCNet) to leverage the advantages of both range view (RV) and Bird’s-eye-view (BEV) in 3D detection. + introduces the concept of Cross-view Consistency to 3D detection task. + proposes a pair of Hough-Transform-like cross-view transformers that explicitly incorporate the correlation between two views and enforce consistency on the transformed features. + designed a new Voxel Representation, Hybrid-Cylindrical-Spherical (HCS) Voxels, which enables to extract features for both RV and BEV. + experiments on NuScenes dataset show substantial improvement over state of art methods

Weaknesses: The main baseline for this paper is the MVF paper. Despite the claim in the beginning section one paragraph details the potential advantage of the proposed method over the MVF paper including better memory and time efficiency and better utilization of context, it does not back it up with any experiments. It would be more convincing if the paper do a side by side comparison with MVF on common datasets such as Kitti and show significant improvement. The paper however is trained on another dataset not the ones MVF tested. This is the main point I didn't give the paper a higher score. Clearly MVF[32] is the main reference paper and could be the paper inspired the current work. The proposed paper try to propose some better ways, which are all legit, however it is very disappointing the experiments does not include a side by side comparison on the two open datasets Waymo and Kitti. Even for the Nuscene dataset it didn't include MVF 32 in the table. I checked the supplementary it is not there either.

Correctness: seems okay

Clarity: Okay.

Relation to Prior Work: yes

Reproducibility: Yes

Additional Feedback: I suggest try more than the dataset [2], do side by side comparison with MVF 32 both in accuracy, memory, parameters, timing, etc. That would make a more convincing case. After rebuttal. Based on the rebuttal I increased my original rating.

Review 3

Summary and Contributions: This paper proposes a method for detecting objects in LIDAR data. The idea is to consider both a "range view" (ie from the LIDAR position) and a "bird eye view" (ie a top view) of the Lidar data for the input of 2 CNNs (1 per view). A term that constrains the output of the 2 CNNs to be consistent is then introduced to the loss function used for training the 2 CNNs (introducing constraints between output has been shown to improve performance on other problems). This term works by computing linear combinations of output terms for one view along lines in 3D. Each linear combination should be consistent with 1 output term for the other view. The weights of the linear combinations are also optimized during learning. This is close to a recent method (MVF [32]). This is discussed in the paper (section 2.3). I agree the method proposed here is more elegant than [32]

Strengths: Adding terms constraining different outputs has been done before (for example for constraining depth and normal predictions), but the proposed solution is designed for a different and important problem. The method is simple (which is a good thing). The experiments seem to confirm the validity of the approach.

Weaknesses: The main idea is specific to the problem considered by the paper - object detection in LIDAR data, but it is an important problem, so I think this part is fine. My main concern about the paper is the clarity of the text.

Correctness: The method is simple and well justified. The experiments are limited to a dataset (NuScenes) but this is sufficient to validate the method with an ablation study. The method is compared to very recent papers (CVPR 20, ..), with an improvement of ~3% of mAP compared to the best competitor method.

Clarity: That's my main concern about the paper. There are many small language mistakes, mostly in the technical section (Section 3), but they are not the main problem. The proposed method is simple (which, again, is something good), but somehow it is difficult to understand from the text. I try to detail below what could be changed to improve the text clarity: - Calling "Cross-view transformers" the mapping functions used in the constraint term is confusing, as "transformer" means other thing in deep learning (transformers in NLP, spatial transformers) - Section 3.4 (about the transformers) mentions features, while in fact it is the final outputs that are "transformed" - it is not said explicitly that the weights in Eq (1) are learned in Section 3.4 - Eqs (3) to (6) seem to use the Euclidean(?) norm, while the authors probably meant some similarity functions; - Eqs (6) is disconnected from the text - Figure 1 is very dense and it is difficult to understand the method from it, while it should be possible to convey visually the method in a simple way - mentioning the Hough transform to explain the method did not make the presentation more intuitive for me. The method is probably more related to epipolar constraints (but this is a detail, maybe a matter of taste)

Relation to Prior Work: yes. The Related Work section is pretty clear.

Reproducibility: Yes

Additional Feedback: Update after rebuttal and discussion: I still think the paper can be written in a much better way, but since I am the only one to have a problem with this and the authors promised to improve the writing, I increased my "overall score". My advice to the authors is to make a genuine effort to improve the readability - it would help the paper to get a stronger impact. I understand better the link with Hough Transform now, but alternatively the operation in Eq (1) can also be seen as a smart pooling that takes into account the 3D geometry of the problem. --------- My main motivation for the final score is the text quality. Given the number of papers published at NeurIPS, papers should be as easy as possible to understand. In principle, the problems in the text could be fixed, but they are many and it is difficult to expect the authors can fix all of them for the final version, which does not have a second reviewing stage.

Review 4

Summary and Contributions: They propose a novel Cross-view Consistent Network to leverage both RV and BEV for 3D detection task and design a new Hybrid-Cylindrical-Spherical Voxel Representation to extract features for both RV and BEV. Their proposed CVCNet outperforms all the published approaches in the metric of mAP on NuScenes dataset.

Strengths: The main idea is very novel and interesting. They first use cross-view features to detect 3D objects, and propose an interesting cross-view voxel presentations and cross-view network architectures. Performance on NuScenes dataset suggests improvements comparing with previous methods.

Weaknesses: No weaknesses I found

Correctness: Yes

Clarity: Yes

Relation to Prior Work: Yes

Reproducibility: No

Additional Feedback: