Reviews: Learning elementary structures for 3D shape generation and matching

Summary: This paper states an interesting and novel idea- that learned shape bases could outperform hand-crafted heuristic functions. On the flip side, though, the method and experimental setups make drawing clear conclusions difficult, diminishing the impact. Post-rebuttal: The authors gave convincing responses to questions about the atlasnet comparison and about the number of parameters. So the final review is increased from 6 to 7. Originality: -As stated above, the paper has an interesting and novel high-level key idea. Whether the proposed method is really learning shape bases rather than using heuristic bases is a matter of interpretation, though. The input to the method is still K unit squares or a human template. So whether the output bases are really learned or are just fixed intermediate activations is a subtle difference, especially given that they can be >3D and are treated only as features in the MLP case. Particularly in the multiclass case the elementary structures don’t look very much like their output deformations. Is there a more convincing demonstration that the difference is more than just extra degrees of freedom in the model? -There is a missing important related work: "Learning Category-Specific Mesh Reconstruction from Image Collections" by Kanazawa et al. This work learns a mean shape per class that is a deformation of a sphere to do reconstruction, which is quite related to learning consistent deformations of squares per structured component to do reconstruction. Quality: It's currently unclear whether the AtlasNet result should be treated as a true ablation or as a baseline comparison. On one hand, AtlasNet10 seems quite similar to the method, but without the extra learning module layers. On the other hand, it isn't exactly the same, and the paper calls it a baseline. It seems really important to show both results. First, two real ablations of the method. One that just removes the learning modules, and another that replaces them with extra parameters to prove that the improvement in performance is not just coming from extra degrees of freedom. Second, a real comparison to AtlasNet- the authors provide pretrained weights for 25 squares on point cloud inputs. Comparing to this would be a much more trustworthy baseline establishing a clear performance improvement. The claim that the elements correspond to parts or an abstraction seems questionable. The elements are not internally connected in point learning; for example the teal airplane structure learned in figure 3a) contains engines, fuselage, and the tail, while the purple part is more tail, more fuselage, and part of the wing. In figure 4d) the teal part is all wing in one example but describes half the fuselage in another. What do the reconstructions and decompositions look like for examples with greater structural variation? I.e. fighter jets, biplanes, etc for planes. The figure 5) result is quite interesting- the convergent behavior regardless of initialization indicates there’s some kind of useful warp to the initial template. Would this consistency hold for the ShapeNet case? Clarity: Is the evaluation in Table 1 on the train set or test set? This review assumes test, but it isn’t very clear since it says ’single-category training’ and ‘multi-category training’. In the case it is the training set, please explain why the performance differences could not just be explained by the parameter count of each model. Significance: The paper could definitely be significant to future research if it were clear that there is improved performance which is attributable to learning shape building blocks. Currently the results are promising, but not conclusive enough to establish a real win. Also the method still needs heuristic bases, such as an input human mesh or K squares, which diminishes the significance of the proposed 'learned' elementary structures somewhat.

Summary This paper proposes a pipeline, decomposing/modeling 3D shapes by learned elementary structures, which can be used in shape generation and matching. The authors build their pipeline based on AtlasNet, with elementary structure learning modules so that all the elementary structures are not fixed but learned from data. Two approaches are then introduced, one based on deformation, and another directly learns the translation for each of the points. The authors then discuss the design of loss function: if we have point correspondences across training examples, then we can use the trivial squared error; if not, we can use Chamfer distance as the loss. Finally, the authors demonstrate the performance of the proposed model, by doing shape reconstruction on ShapeNet dataset and doing human shape reconstruction and matching on SURREAL dataset. Strengths -The idea of learning elementary structures from data is novel. By letting the model learn from data, with higher probability, the model will be able to learn some meaningful structures. -The results look impressive. As shown in figure 3, the proposed method successfully learned some meaning structures (e.g., tailplane in figure 3(b)). Weaknesses -Need to improve the readability. For example, the notations and names of modules are kind of confusing. In figure 2, t_1 to t_K are bold while d_1 to d_K are not bold. In line 101, p_1 to p_K are called positioning modules while in line 144 they are called adjustment modules. Making all of them consistent would help readers to understand the paper more easily. Comments after the rebuttal ************************************ Thank the authors for the rebuttal. The results in Figure 2 looks good, but still not particularly amazing. So I kept my rating.

Paper ID:	4037
Title:	Learning elementary structures for 3D shape generation and matching

Reviewer 1

Reviewer 2

Reviewer 3