Summary and Contributions: the paper proposes a network that assembles man-made shapes given the set of its parts. The network is computes SE(3) transformations for each part such that the unity of all parts generates the original shape. The main insight is that the parts of a shape can be represented as a graph, enabling the use graph neural networks, but that the connectivity of should correspond to the way the parts are positioned in the shape.
Strengths: - the paper presents a novel problem, and an interesting and creative solution. - being the first, the paper also proposes interesting a non-trivial alternatives to the problem as baselines for comparison, hence justifying the choice of the approach.
Weaknesses: - the paper could still use some more justification of design choices. How does the amount of iterations? - how well does the system to different label of details? Different number of points? - Some of the parts of the system are rather naive. For example, the use of a vanilla pointNet, or the use of q for r. Many other alternatives should be considered, but I am fine with this being in future work. After reading the rebuttal and other reviews, my opinions have not changed much. I think the experiments described in the rebuttal (both that answer my questions, and those of the others) are interesting to the reader, and should be added to the final revision.
Correctness: yes
Clarity: yes
Relation to Prior Work: Some of the tree-based networks are missing, which I find to be related. For example: "GAN-Tree: An Incrementally Learned Hierarchical Generative Framework for Multi-Modal Data Distributions"
Reproducibility: Yes
Additional Feedback:
Summary and Contributions: This paper tackles the problem of generative 3D part assembly, which predicts a 6-doF part pose for each input part that assembles into a single 3D shape. The paper proposes a graph learning framework that uses an iterative graph neural network to adjust the relations between parts and infer their poses.
Strengths: + The task is meaningful but it is also relatively under-investigated in the 3D community. + The method is well motivated. Predicting the spatial relationships between object parts by graph inference is a natural idea. + The experimental results look promising, although some results need to be clarified/discussed more.
Weaknesses: My biggest concern is the part aggregation module: - I'm not sure if max-pooling is the best way to aggregate the node attributes among the parts into a single node. If we want to captuer the commonalities between different nodes, wouldn't average pooling be more appropriate? Some comparison is probably needed. - How should we interprete the relation between an aggregated part and the other parts? - In the ablation study, is "Our backbone w. relation reasoning" the model with aggregation module? - It would be great if the authors could discuss more about the results in Figure 3. First, the relations look inconsistent. For example, seat is strongly related to back while back is least related to seat; arm is most related to leg but leg is hardly related to arm. Second, why is leg most related to leg, given that they are not connected? Is it the effect of the aggregation module?
Correctness: The method looks correct, although I haven't checked the maths in detail.
Clarity: The paper is reasonably clear, but some notations could be improved. For example, in L168-169, using t and t+1 to represent odd and even numbers is informal. In Eq 6, instead of "k in V", it should be "v_k in V".
Relation to Prior Work: The paper explains the difference in details, but it could be further clarified. In L30-32, the authords mention that some previous work "assume certain part priors, such as a known number of parts", and "we assume no semantic knowledge upon the input parts". However, in L95 it says that "we assume to know the part count in each group."
Reproducibility: Yes
Additional Feedback: I read the rebuttal and comments from other reviewers, and I am leaning towards acceptance.
Summary and Contributions: This paper aims to solve the 3D part assembly problem of a set of shape parts in a given 3D point cloud representation. The authors propose to apply an iterative graph neural network as a backbone, and predict part assembly in a coarse-to-fine way. Both the quantitative and qualitative results are promising.
Strengths: 1. The proposed dynamic graph learning framework is novel and technically sound. 2. Both the quantitative and qualitative results are convincing compared to the proposed three baselines. 3. The ablation study in Table 2 reflects the efficacy of using graph learning and relation reasoning. The additional ablation study provided in the supplemental is detailed.
Weaknesses: My main concern is the lack of clarification on the experimental details of the baseline method. It's not clear to me that whether the baseline methods are trained with the same losses, with the same termination strategy (same number of epochs, or stop training if achieving the best score on the validation set). A minor concern is the missing ablation study of the training losses. I'm surprised that without a connectivity loss/constraint, by directly optimizing the pose of each part, the assembly results are visually reasonable. It's better to have more discussions related to this. After reading the rebuttal, my concerns are mostly resolved.
Correctness: The proposed dynamic graph learning framework is technically sound.
Clarity: The paper is easy to follow.
Relation to Prior Work: Yes.
Reproducibility: No
Additional Feedback: Given my above concerns, I would rate the paper as marginally above the threshold. I will re-evaluate the scores based on the authors' feedback. Update: after reading the rebuttal, my concerns are mostly resolved. I'm inclined to accept.
Summary and Contributions: The paper presents a solid work on conditioned shape generation, which is a quite massive subfield of 3D computer vision. In the same time authors show that their setting with parts assembly is different from related methods in that they use rigid part point clouds without any prior semantic knowledge.
Strengths: The method in the paper leverages a quite efficient and popular combination of dynamic graph and coarse-to-fine learning, which already showed good results in predicting relationships between objects in 3D. As there is no complete shape in input, the novelty of the work is clear.
Weaknesses: 1. It is hard to adapt existing works with 3D shape generation (like GRASS, PointEdit, SAGNet) to the current setting. 2. Maybe it could be also useful to see timings on network inference. 3. It is very interesting to see what would it be if one makes a slight modification and replaces the 6-DoF pose prediction with 9-DoF, that is augmented with scale vector prediction.
Correctness: Experimental section seems comprehensive. However, what first comes to mind is that proposed metrics are only focused on comparison with ground truth. However, we can assemble parts into plausible shapes in a large number of ways. In paper we can see some examples where baseline methods (and also the proposed) predict shapes that are different from ground truth, but at same time similar to real-world objects. No perceptual metrics are used paper to estimate the visual quality of shapes, which is as important as fitting accuracy.
Clarity: The paper is organized very well, the structure is clear.
Relation to Prior Work: There are several details in the problem setting that can identify it as a new problem formulation.
Reproducibility: Yes
Additional Feedback: