Reviews: DISN: Deep Implicit Surface Network for High-quality Single-view 3D Reconstruction

Update: The F-score results and the naive baseline shown in the rebuttal are good. I agree with R3 that an experiment with more camera variations should be added, which is also promised in the rebuttal. --- The paper presents a new method for single image 3D reconstruction. The method generates a signed distance function, which contains the surface as the zero level set. Implicit representations have very recently become popular for this task but the presented network architecture makes good use of this representation. The method works similarly to Occupancy networks, IMNET and DeepSDF. Given an image of an object, the method maps a 3D position to a signed distance value. To improve the reconstruction of details, this paper uses the camera pose (estimated by another network) to project each 3D position to the image plane and look up local image features generated by a VGG encoder. Making use of this local feature look-up distinguishes this work. The decoding stage then combines two SDF values generated by decoders processing the local and global features separately. Originality: The approach estimates the camera pose(i) and combines an implicit representation(ii) with an explicit look-up of local features(iii). None of these three components is new but the combination is very sound and the paper shows that it can improve the quality and generalization with this approach. This is in contrast to approaches which miss something, e.g. neglect camera geometry or use convenient grid representations that do not scale. The paper cites recent and highly related works sufficiently. The section can be improved by explaining the (dis)similarities between the local features used in Pix2Mesh and this work. Quality: The overall quality of this paper is good. The figures are helpful and qualitative examples give a good impression of the quality of the results and the outcome of specific experiments. I liked the experiments which explain the design choices in more detail (Sec. 4.3 and Fig. 8). Further, the experiments include comparisons between the GT camera pose and the estimated camera pose, which point out the dependency on the camera parameters of this approach. I miss a comparison with a naive baseline replacing the second decoder stream by adding to the SDF based on the projection to a foreground (add 0) or background pixel (add inf) in the image. While this baseline cannot add to the reconstructed volume it can generate holes, which would add detail. The quantitative evaluation uses common metrics such as IoU and CD; this is OK but does not do a good job on highlighting the advantages of this method, which is reconstruction of details. Consider using the F-score for the evaluation (see Tatarchenko et al. “What Do Single-view 3D Reconstruction Networks Learn?,” CVPR 2019). Clarity: The paper is well-written and gives enough information to implement the method. A minor issue are the statements about "infinite resolution"(l46) and "continuously sampling"(l209) could be misunderstood and written clearer to point out that this approach allows to freely sample the SDF. Significance: This work succeeds in improving the visual quality of single image reconstruction methods and the experiment with multiple views shows how this work can be extended and its ideas be used in related tasks. Quantitatively the work is on par with other related methods, which could be a limitation of the evaluation metrics. My biggest concern is the missing naive baseline, which would give us more information about the significance of the local feature decoder. Taking everything into account, I tend towards accepting this work. Questions: Is the signed distance approximately euclidean for the output of both streams? What does the signed distance function look like only for the local stream? Does it add only the details or does it contain the whole object? Minor mistakes: l.40: 'A' SDF l.58: To the best 'of' our ... l.147: grou'n'dtruth

Reviewer 2

To me, the main issue with this paper is that it tries to claim as a contribution the use of implicit function. Since at least 3 CVPR19 papers have already presented a similar idea (which is problematic to me, even if formally ok and not ground for rejection) I think this is a bad strategy for this paper, which has another contribution which I think is valid, useful and improve results, but is lost because of this presentation: it does not appear anywhere in the title and not clearly in the abstract. Simply looking at the title and abstract, I would not be likely to read the paper and would simply discard it as one more late paper building on the same idea. I would also have no chance to find it back if I wanted to cite it. Clearly not related work, but "PIFu: Pixel-Aligned Implicit Function for High-Resolution Clothed Human Digitization Shunsuke Saito*, Zeng Huang*, Ryota Natsume*, Shigeo Morishima, Angjoo Kanazawa, Hao Li" has a similar approach using local features, it might be nice to discuss the relation of the two works in the final version === Post-rebuttal please include the new dataset and experiments including more camera variation promised in the rebuttal

Reviewer 3

Positives: 1) I like the overall approach and the insight that when answering per-point queries, leveraging corresponding image-based features via reprojection should also help. Similar insights have been exploited in learning-based multi-view reconstruction works, but this approach is novel in context of single-view 3D reconstruction. 2) The paper is generally well written, easy to follow, and reports the desired ablations (though these do not necessarily support the claims, see below). Concerns/Comments/Questions: 1) My primary concern (and the main reason for leaning towards rejection) is that the central contribution of using 'local features' does not help empirically. I am judging this in the setting with 'estimated pose' (and not known pose, as this is additional information that is hard to acquire at inference). Judging by Table 3, the 'global network' is slightly better than the proposed approach ('Two stream, est') in IoU, similar in CD, and slightly worse in EMD. This shows that using the additional stream with local features (if camera pose is predicted) does not necessarily help. Similarly, the improvement of 'Ours_cam' over Occnet is only marginal. 2) I am concerned about certain aspects of camera prediction: a) How are symmetric objects handled e.g. if a table is square, how can the network predict the true camera pose. b) What is the variation in camera poses? From what I recall, the data from Choy et. al. always has a camera pointing towards origin, and a fixed elevation and cyclo-rotation, effectively having only 1 degree of freedom. If this is indeed the case, this should make the camera prediction simple, and I feel that this approach would degrade more in settings with larger camera variation (e.g. actual 6D freedom) compared to methods that do not use explicit camera prediction. It would really help the paper if experiments in settings with larger camera variation are shown. 3) Some additional comments (not central to the rating): a) I am currently evaluating the paper only in context of results in settings without known camera, and previous approaches do tackle reconstruction in settings with known camera e.g. Kar et. al., "Learnt Stereo Machines", and have a similar ideology of propagating image features to voxels, and this paper would then need to compare to these. I would also strongly recommend renaming 'Ours' to 'Ours + gt cam' and 'Ours_cam' to 'Ours', because currently method denoted as 'Ours' using extra information that baselines do not, and is not really tackling a 'Single-view 3D reconstruction' task as normally defined. b) I am curious why 'One stream' with 'ground-truth' is worse that with 'estimated'? ----- Overall, while the paper has an interesting central idea, the empirical results (in the setting with predicted pose) do not convince the reader that it is adds a significant additional value in terms of improving performance as the results (Table 3) are mixed, or at best indicate marginal improvement. ---------- Updates after rebuttal: I think the primary argument made in the rebuttal is that (in the setup with predicted camera) even though the quantitative results are only marginally better, the qualitative results are more impressive, and I think that is true. Additionally, the f-score metric reported shows slightly larger gains, and I'd overall be happy to increase my rating to marginal accept. That said, I really do hope the experiment with more camera variation would be added to the main paper as the authors promised in the rebuttal, as the current rebuttal experiment only shows that reprojection error is less, not that it helps in the downstream task.

Paper ID:	270
Title:	DISN: Deep Implicit Surface Network for High-quality Single-view 3D Reconstruction

Reviewer 1

Reviewer 2

Reviewer 3