Review for NeurIPS paper: Learning to Orient Surfaces by Self-supervised Spherical CNNs

NeurIPS 2020

Learning to Orient Surfaces by Self-supervised Spherical CNNs

Review 1

Summary and Contributions: A spherical CNN is used to predict canonical orientations for point clouds.

Strengths: Results seem better than recent state of the art methods.

Weaknesses: The task seems rather simpler than the one considered in 'Fully convolutional geometric features', ICCV 2019. The point of orienting surfaces seems to be to perform matching. But if state of the art matching can be performed without Spherical CNNs, is orienting surfaces actually a useful task? Are spherical CNNs better than non-spherical CNNs in terms of performance vs accuracy. There is no accounting for the computation cost of the method (FLOPs or run time). Table 2. I think the more interesting comparison would be the alternative methods trained with rotational data augmentation. If the only weakness with the other methods is that they need 3-degrees of freedom data augmentation, the case for Compass is a lot weaker.

Correctness: Yes

Clarity: Yes

Relation to Prior Work: Yes.

Reproducibility: Yes

Additional Feedback: L10: Ortoghonal Reply to rebuttal: There seems to be two main claims in the paper 1. That their method is useful for feature matching. The results in table 1 don't really evaluate feature matching (versus FCGF D3Feat, and related work.) They are instead optimising some intermediate objective (creating LRFs), the importance of which is not vey clear. In the rebuttal they refer to D3Feat, to explain their motivation. However, the D3Feat paper actually says '''..., as also observed in FCGF [2], we find that a fully convolutional network (e.g., KPConv [33] as used in this paper) is able to empirically achieve strong rotation invariance through low cost data augmentation, as shown in the right columns of Table 1.''' This seems to be the exact opposite to what they claim, that establishing LRFs is useful?? 2. That their method is useful for object classification. The evaluation here seems unsatisfactory as (a) they consider simple CAD models with no scanning occlusion and limited interclass variation and (b) they compare to uncompetitive baselines they trained themselves in an obviously flawed way (without appropriate data augmentation) to make an academic point about their methods rotation invariance, not to measure real world usefulness.

Review 2

Summary and Contributions: This paper presents a method to rotate 3D point sets to a canonical orientation in a self-supervised manner. The key insight is to use rotation equivariant networks, predict a canonicalizing rotation, and training them in Siamese fashion can allow canonicalization. The paper realizes this using Spherical CNNs. Experiments on two tasks show improved performance on extracting features for matching and shape classification.

Strengths: I really like the central insight of this paper. Here are the positives. - The idea of canonical rotations (called "poses" in the paper) is really neat. There is ample evidence to show that this is a useful property to have and can be used in a variety of downstream tasks. Figure 1 is great. - The paper is well motivated and the results are promising. - The idea of using spherical CNNs to achieve canonicalization is great.

Weaknesses: I would like to see this paper accepted -- but I am torn because the paper feels unfinished. It falls short in theoretical/experimental rigor, and writing quality. I list my questions, comments, and suggestions below which I hope can help address the the shortcomings. - First, the paper refers to canonical orientation as "pose" which is misleading. The term "pose" usually refers to position and orientation, i.e. SE(3) transformations whereas what the paper achieves is equivariance to SO(3). This should be clarified. - The description of spherical CNNs mostly made sense as they follow [6]. But I found it a bit hard to follow section 3.2 without reading it multiple times. Part of the reason is because of some abuse of notation in equation (4). g^{-1}(.) in reality is an SO(3) rotation but the way it is defined makes it look like it is a point cloud (since g(.) maps form P-->SO(3), g^{-1} should map from SO(3) --> P). It may make sense to change this notation. Moreover, the text explanations in 135 are not very clear and succinct. - Section 3.2 does not make it clear that equivariance enables canonicalization only because of the property of spherical CNNs that they output features in SO(3). Any neural network without this output domain will not be canonicalizing (unless they explicitly predict rotation). To me, this is the *main insight* of the paper and is not highlighted anywhere. - For the loss function in (6), is it applied once for each layer or once for the whole network? Why? - How does the method handle permutation equivariance for the points? Is a standard lexicography adopted for the sphere cells? What about the permutation equivariant components of PointNet in the shape classification experiment? - In the local surface patches experiment, how does the overlap between patches help? Does more overlap improve results? - I would have loved to see visual and quantitavative results of the canonical patches discovered by the network.

Correctness: Mostly yes. Some notation abuse in equation (4) onwards for g^{-1}(.)

Clarity: The quality of writing in the paper can be significantly improved. For instance, the Standard Flow in line 126 does not make sense because the conversion of point clouds to spherical signals is described later in line 161 onwards. Section 3 is hard to read and can benefit from re-writing.

Relation to Prior Work: Not enough. The paper is missing discussion of the following papers on canonicalization. - C3DPO: Canonical 3D Pose Networks for Non-Rigid Structure From Motion, Novotny et al. 2019 - Normalized Object Coordinate Space for Category-Level 6D Object Pose and Size Estimation, Wang et al. 2019 - Multiview Aggregation for Learning Category-Specific Shape Reconstruction, Sridhar et al. 2019

Reproducibility: Yes

Additional Feedback: Thanks for the rebuttal. The writing can be improved -- overall I am leaning slightly on the positive side.

Review 3

Summary and Contributions: The paper presents an approach for self-supervised learning of a NN that maps identical (up to noise and occlusion) 3D point clouds into identical canonical orientations. Having such a NN is demonstrated to enable two tasks: learning rotation invariant 3D descriptors and rotation-robust point cloud classification. The basic idea is to learn filters of a NN $g$ defined on directional signals. First, the point cloud is converted into a direction-distance representation. The first two dimensions are angles, the third one a distance. Now training takes two copies of those patches, the original one $V$ and a rotated one, $T$, transformed by a known rotation $R$. Now $g$ is learned to output an activation map that has a max at one bin $i,j,k$. That bin maps to a certain rotation (orientation and rotation around it). The loss now asks, that the transformation between those two maximal activations on the two point cloud copies is exactly R. Applying this transformation, the two point clouds can be aligned, not maybe semantical (as in a cup being upright), but consistently. This step then is a preprocess (or learned jointly with) another task, such as classifying the point cloud or to compute descriptors for matching. Results on classic ShapeNet etc benchmark show a marked increase in performance, in particular for randomly rotated objects.

Strengths: -- Strong technical idea -- Good exposition of a difficult (I thought) topic -- Good results for a focused analysis, executed fairly -- I have not seen soft argmax (which is easily confused with softmax in the start) before and find it a useful tool

Weaknesses: -- Not easy to understand -- I could not understand the relation of SO(3) and this orientation/distance binning. I would not see what it really i --it appears this is just a spherical grid-- and what the relation of i,j,k, is to a specific rotation. That seems to follow from the spherical CNN's properties, but it did not get across for me -- Also in the analysis, some simpler competitors like PCA could be compared to. PointNet already has a means to output a rotation to work in a consistent (local) frame. Why does it fail? Probably, cause it does not use spherical CNNs? The analysis could even think about a downgraded version of PointNet that also does not use this transformer. -- The protocol for data augmentation in respect to occlusions was surprising: I would have expected some simulated occlusions, as in Blendsor, but instead some concentric shells that drop some points are used. This seems not to be very representative of true occlusions. But it could also be seen as a strength that even with such a primitive occlusion model, results are good. So with a better model, that might be easy to include, results would become even better. -- I was unsure why this need spherical CNNs. What is wrong with a pipeline that would just voxelize the point cloud, and encode it to a rotation matrix or (unit) quaternion? Or more similar: voxelize the point cloud and produce a SO(3) activation map, where each bin is one rotation. Which property of spherical CNNs is crucial where? Right now the choice of Spherical CNN appears principled as using a point net, voxels or a graph CNNs.

Correctness: I did not spot any mistakes. Some smaller typos. L28 it says PointNet achieves rotation invariance by random rotations. That is correct, but not complete. There is also a step in the pipeline to rotate -- very much like what is done here-- into a canonical frame. It just does not seem to work as well as what is suggested here.

Clarity: Yes. I had difficulties following, but maybe not the papers fault. -- Some typeset matrices as \mathsf -- I am always confused by functions like g returning a matrix in Eq. 4. When it then says ^-1, is it the inverse of the matrix or the inverse of the mapping g? Here it is the inverse of the matrix, but maybe some brackets or re-ordering could help? Wikipedia agrees, how function inversion and multiplicative inversion ARE confusing notations. Fig. 2 could be linked better to Eq. 4 and 5. I see T and S, but not R, R-star, g etc. In fact, all the funny blocks of the architecture do not matter much to me, if only i would see all relevant operations applied to all relevant point clouds. In particular, what is compare to what in the end for computing a loss?

Relation to Prior Work: All I am aware of.

Reproducibility: Yes

Additional Feedback:

Review 4

Summary and Contributions: This paper presents a methodolody, named compass, that learns canonical orientation of 3D objects and surfaces by self-supervised spherical CNN. The basic idea is to learn a canonocial SO(3) matrix by Spherical CNN on two sets V,T, supervised by a known rotation matrix R. In terms of contribution, it provides a effective and robust orientation learning method for various 3D vision tasks.

Strengths: 1. The paper is written clearly and well motivated. The theoretical claims, mainly grounded on Spherical CNN, is sound and clear. 2. Extensive experiments are conducted to demonstrate the effectiveness of the proposed method, and they are explained in detail. And judging by the qualitative results, especially the transfer learning result, the robustness of Compass is promising.

Weaknesses: 1. The proposed idea is mainly grounded on Sperical CNN (2018). It seems like that the current Compass is a simple combination of Spherical CNN and a self-consistency angular loss, plus some learning techniques like deleting points to handle occlusion. From this perspective, the theoretical novelty of this paper is unclear. I would like authors to clarify more on the originality w.r.t. the previous works. 2. This one is more like a confusion. In your 2nd experiment, where you replace the T-Net with Compass followed by PointNet, I would assume that Compass can provide a good canonical transformation far better than a simple T-Net. But in Table 2, the Compass + PointNet cannot beat simple PointNet in NR scenario, I would like the author to give some insight about it.

Correctness: I do not see obvious error in the methodology.

Clarity: The paper is well written and easy to follow.

Relation to Prior Work: Like I suggest in weakness part, I would like the author to clarify more on the theoretical originality w.r.t. the previous works.

Reproducibility: Yes

Additional Feedback: Post-Rebuttal: I've read the rebuttal. I think the central idea is interesting, but the concern about empirical evaluation is only partially addressed in the rebuttal. So I decide to keep my initial score for the submission.