Review for NeurIPS paper: Canonical 3D Deformer Maps: Unifying parametric and non-parametric methods for dense weakly-supervised category reconstruction

NeurIPS 2020

Canonical 3D Deformer Maps: Unifying parametric and non-parametric methods for dense weakly-supervised category reconstruction

Meta Review

The initial scores for this paper were diverging: 6: Marginally above the acceptance threshold. 5: Marginally below the acceptance threshold. 7: A good submission; accept. 4: An okay submission, but not good enough; a reject. The positive points praised by the reviewers were: - The work addresses a hard problem with an interesting and timely approach putting together existing components and including technical details to improve results. - Compelling quantitative and qualitative improvement over existing work - Reasonable ablations. The negative points: - The contributions of the work are not convincingly evaluated. - Other missing experiments testing important scenarios. - Rather ad-hoc losses and unclear motivation for parts of the model. - Some missing citations. - Missing discussion of limitations. - Several unclear points in writing. The authors provide a rebuttal, which addresses some of the weak points. This paper had an extensive and healthy post-rebuttal discussion among the positive and negative reviewers. The key points of that discussion are summarized at the end of this meta-review. As an outcome of that discussion R2 raises their score from 4 to 6. The rest of the reviewers keep their original ratings. The final scores are: 6, 5, 7, 6. However, all reviewers agree that the paper should address the following three points (listed below) before publication. This was discussed with the AC. AC has also discussed the issue with SAC. As there is no “conditional accept” decision at NeurIPS, the AC suggests (after reading the paper, consulting with the reviewers and SAC) to Accept the paper, trusting the authors that they will address these three points (listed as A.-C. below) in the final version: Summary of requested experiments for final version: A. Only replace the representation (mesh vs implicit) while keeping everything else fixed. B. Do the experiment where only 1 MLP network is used initialized with rigid-SFM output. C. Add qualitative results on PASCAL 3D+ in the paper. ###################################################### Below is the (anonymized) summary of the key exchanges among the reviewers in the post-rebuttal discussion. 1. Concerns over the lack of contribution: Similar to CMR/CSM plus AtlastNet like implicit representation?: The authors respond that being able to learn 3D shape through the texture loss is a big part of the difference why C3DM is more than CMR/CSM, as CMR/CSM did not do this. This is true -- CMR did not backprop on texture loss. However this CVPR'20 work from Henderson et al. shows that you can https://arxiv.org/abs/2004.04180. This paper may not have been known to the authors (CVPR happened around the NeurIPS deadline), so I’m fine if they correct and discuss this point in the main paper. To me, it seems that these are the main differences between CMR, CSM, and the proposed approach: (i) CMR is akin to a direct method - backpropagation through the texture results in a photometric-like loss (it’s not quite a photometric loss since a perceptual loss is used instead, but it’s close enough); (ii) CSM learns to establish correspondences from image pixels to a fixed shape template that does not adapt to the depicted shape (their articulated-CSM follow-up CVPR 2020 paper allows the template to deform, but the shape deforms based on a semi-manually defined skeleton, which does not have the capacity to capture surface details); (iii) the proposed approach learns to establish correspondences from image pixels to the parameterized surface of a (C3DPO) shape basis that then deforms to the depicted shape. In the classical debate of direct versus correspondence methods, I view the proposed method as belonging to the latter camp. My hypothesis is, similar to how correspondence methods played out in the late 90s and 2000s, the proposed approach may be less susceptible to local minima than direct methods during shape-fitting optimization. But I think there’s room to investigate this issue more fully, which may be outside the scope of this paper. Although I think (iii) is still a hybrid of CMR and CSM (but still with known keypoints). With that said, I'm changing my mind on this, I find this combination a reasonable idea. 2. Rebuttal. I'm a bit confused about the last two sentence in the first paragraph in the rebuttal. The first sentence indicates that one can learn the shape (due to reprojection) regardless of the texture quality, contradicting the first point that C3DM can learn shape through texture more. Also confused about "note that without appearance cues CMR fails to reconstruct faces". CMR on faces doesn't look good in the paper already, it's worse without the texture loss? but it contradicts their earlier point that in CMR texture is not affecting shape/pose? I also found the last two sentences in the first paragraph of the rebuttal to be unclear. We could ask that the paper clarifies the point they were trying to make in the final version. 3. Mesh vs Implicit: There is a new experience in the author response, but it seems like for training with mesh, they removed min-k percep and repro with CSM's cycle consistency and that gets worse results. Why remove min-k precep also? The right thing to do is to keep all losses equal and just use the mesh representation during training (please correct me if this is not possible and I'm not seeing it). It may point out that min-k percep is the most important loss? This is related to one of my original concern, as pointed out by other reviewers this paper looks proposes new losses and the experimental protocol does not really identify what makes this get good result (is it the representation? the new losses?). The provided ablation study does not answer this as the representation is fixed. The new results in the response is also unfortunately eludes this question. I agree with your point that they should have also tried the explicit representation with the min-k loss (I also don’t see why they couldn’t use it). This apparently important min-k loss is not much of a focal point in the paper. From the ablation study in Table 1 it does not seem to be quite important. More critically, that ablation does not offer the experiment of straight forward CSM+CMR vs their model. The number reported on the rebuttal touches on this, but it still contains the basis also -- so maybe that's the most important part? (Tab 1 in paper seems to indicate that this is so). I feel like this is quite a key information to add in order for others to build on to learn what it was that made this work well. 4. - The explicit basis is not clearly motivated. My question"However in principal can't the non-linear MLP basis learn this all" is not addressed. Atlasnet sphere should be able to also capture multiple deformations with an implicit non-linear basis function. It's more that I wanted an ablation where only one atlasnet-sphere is used instead of multiple basis. Getting AtlasNet sphere to work out of the box may be tricky due to surface self-intersection, which causes discontinuities during optimization. why do you need multiple of these implicit atlasnet-sphere like representation? Just use one implicit atlasnet-sphere MLP, instead of using a set of MLP? It's non-linear so it should be technically able to do it with one (deepSDF for ex. doesn't use multiple MLP for a single category). Table 1 removes the "basis" so it seems like this is the experiment. However there's no real discussion on that ablation study other than the caption making it hard to tell what exactly that means. If you look closely, from the text it says first term of eq(4) only, so this means that only the basis matching loss is not used, but the representation is still using multiple MLP in all experiments. 5. PASCAL3D+: I am quite concerned about introducing a benchmark on PASCAL3D+ categories (Chair and Bus) that the authors do not show any qualitative results on.