Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Even though the method needs 2D supervision, all experiments and visualizations seems to suggest that approach needs images with single objects with perfectly segmented transparent/white background. All the experiments were done by synthetic renderings of CAD models itself. So in practice there is no benefit over 3D supervision from CAD models. From the experiments you have to take a big leap of faith to assume that the approach works with occlusions and imperfect masks (say from MaskRCNN like systems). Not convinced from the ablation evaluation that the geometric regularization is helping. In terms of 3D IoU there is no statistically significant improvement in 3D IoU scores of 0.503 vs 0.502. I would like to see ablation study done across multiple shape categories and increasing amount of training data. I wonder with more data (since the proposed method only needs 2D supervision) if these hand engineered 3D regularizations have any benefit? Equation1 seems to be an approximation of the volume rendering integral? Can you explain the assumptions made there and place it with respect to the computer graphics literature on volume rendering. Overall the paper is well-written and easy to follow. However some section needs improvement. Fig2 illustration is not clear until one reads the “Boundary Aware Assignment” paragraph (lines 168-178). Refering to that paragraph in the Figure2 caption can help with the clarity. Figure3 is excellent and very clearly depicts the network architecture. In my opinion, it should be before Figure2. Please improve Table2 headers. This is the most important quantitative study in the paper describing the ablation of each component in the paper. This is a fast moving field, but I think these are highly relevant related works that the authors missed. DeepSDF (Park et. al. 2019) and Occupancy Networks (Mescheder et al. 2019) both uses a similar implicit representation and a shape decoder conditioned on 3D location. 2D supervision from masks for predicting shapes as TSDF (implicit) representation has been done before (3D-RCNN Kundu et. al. 2018) but not in an end-to-end fashion. Why choose to have the implicit values in the range of [0, 1] as occupancy probability instead of having zero-level set. Isn’t it is beneficial to have a zero-mean prediction like in a standard Signed Distance Function (SDF)? Also if the implicit representation in this paper is indeed an occupancy probability, then Differentiable Ray Consistency (Tulsiani et. al. 2017) has already demonstrated how to backpropagate through occupancy values collected on a ray for single view 3D shape learning.
This paper addresses the unsupervised 3D shape generation problem by using the implicit surface function. The paper is well written and easy to follow. My biggest concern is that while the experiments show the implicit function can achieve better results than explicit representations, the training time and inference time are not shown and compared. And how the authors infer the 3D shape at evaluation and what resolution they adopt can be described in more details. Another question is how the support region radius affects the prediction.
The idea makes completely sense, the paper introduces a couple of techniques to make it work. The results are far better than the state-of-the-art in other representations. The paper is well-written. Other comments: 1) Would it be better to use some anisotropic kernels to capture shape structures? 2) Can we make the implicit representation hierarchical? 3) It would be also interesting to see how the results change when changing the viewpoints of the input image? 4) What does equation (6) approximate in the limit? Curvature?