NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:5098
Title:Learning to Predict 3D Objects with an Interpolation-based Differentiable Renderer

Reviewer 1

The paper is clearly and well written and together with the supplement it should be possible to reimplement the renderer and the experiments. The qualitative experiments on synthetic data are impressive and the quantitative evaluation shows the performance of the proposed system relative to others. Developing differentiable rewnderers is important to enable the inference of 3D quantities such as geometry, and lighting effects based on 2D image observations. Since rasterization is still one of the most used rendering techniques for various areas of research (excepting graphics) a differentiable model for this renderer is important to advance the state of the art in the community. The authors show that the proposed renderer can be used to take analytic gradients with respect to all commonly used image formation parameters. This is an important contribution. Some questions: 1) the renderer is called DIB but it seems the propper acronym would be IBD (since it seems to stand for Interpolation-based Differentiable). I would make that consistent. 2) Regarding z-buffering: It seems that in gradient updates to the vertex locations the z-buffering might change (i.e. which primitive is the closest one). Do you recompute the z-biuffering at each gradient step or is it fixed based on the initial view when optimizing vertex locations? 3) In Fig. 2 f) you seem to be optimizing over the pose of the tea pot. Is it simply a matter of taking gradients all the way down to the pose of the tea pot and updating that in a gradient descent fashion? 4) It would be useful to get an intiuition for relative timing between the proposed renderer and others to see that tradeoff and some more reasoning as to why one should us the rasterization based renderer vs the ray tracing one form [15]. Is it simplicity? Computational efficiency? Speed? 5) l 126: I am not sure what you mean b y the alpha channel prediction? Do you simply mean that you store the value of A_i' in the alpha channel a t pixel p_i' ?

Reviewer 2

This work introduce a differentiable renderer based on rasterization. Different from previous differentiable renderers, DIB-Renderer supports texture maps as well, which enables the existing 3D object reconstruction methods to predict the texture of the object. DIB-Render is based on rasterization. It interpolates the vertex's attributes for each foreground pixel. It also softly assign vertex to background pixel so that it can back-propagate mesh attributes. The renderer also supports three different lighting models: Phong, Lambertian and Spherical Harmonics. The author also shows different applications for the renderer such as single image 3D reconstruction and estimating geometry, texture and lighting condition. Extensive experiments prove the purposed renderer is superior to the existing differentiable renderers and achieves plausible results on the applications they purposed. The paper is well-written and structured clearly. The purposed renderer can be put into use for future 3D reconstruction tasks. The current 3D reconstruction put more emphasis on the geometry than textures partially due to the lack of a differentiable renderer that supports texture maps. This work could result in more realistic 3D reconstruction.

Reviewer 3

The paper describes a differentiable renderer for triangle meshes that provides gradients with respect to geometry, texture, and illumination. As a key innovation, the paper claims the formulation of rasterization as interpolation, saying that "In contrast to standard rendering, where a pixel’s value is assigned from the closest face that covers the pixel, we treat rasterization as an interpolation of vertex attributes". However, the formulation in Equation 1 using barycentric interpolation is textbook material in computer graphics. I do not think using this standard approach in a differentiable renderer can be claimed as a significant novel contribution. The experimental results on single image 3D reconstruction and textured shape generation using only 2D supervision lead to good quality, but they do not provide a significant improvement over previous work on these problems in my opinion. I consider the fact that this renderer also supports gradients due to illumination and texture, which some of the other public implementations don't (or they focus on specific cases, like spherical harmonics illumination), more as an engineering detail. The state of the art here is the work by Li et al. on differentiable ray tracing, which even supports gradients due to indirect illumination. The paper should also discuss "Pix2Vex: Image-to-Geometry Reconstruction using a Smooth Differentiable Renderer" by Petersen et al., and "Unsupervised 3D Shape Learning from Image Collections in the Wild" by Szabo and Favaro. Both use differentiable rendering, for single view shape reconstruction or shape generation using 2D supervision only similar as in this submission. In summary, the technical contribution in the differentiable renderer does not seem significant enough to me for a NeurIPS paper. Experiments are performed on standard problem statements (single image reconstruction, shape generation from 2D supervision) with good results, but previous work achieves quite similar quality.

Reviewer 4

-- The model -- The paper presents a differentiable renderer (DIB-Render) that can render a coloured 3D mesh onto a 2D image. Having such renderer allows, for example, to train a neural network that can reconstruct a 3D shape of an object from a single image and render the shape onto a number of 2D views using different camera configurations. The learning can then be supervised by computing a reconstruction error between the computed rendering of a 3D shape and an actual image (using an L1 loss for the coloured image or Intersection over Union (IoU) for the binary silhouettes). The renderer is largely based on the soft rasterizer (Soft-Ras) proposed in [18, 19]. Unlike traditional non-differentiable rasterizers, which assign a binary score of whether a pixel in the image plane is covered by a triangle or not, Soft-Ras computes a soft score based on a distance of a pixel to the triangle (with an exponential or a sigmoid function of distance). This allows to compute gradients of image pixels with respect to the vertex attributes such as coordinates, colours and so on. One noteworthy difference between the proposed renderer and Soft-Ras is that the former uses only the closest triangle ("For every such pixel we perform a z-buffering test [6], and assign it to the closest covering face"), whereas the latter uses a soft z-buffering formulation where the triangles behind the closest one can still receive gradients using SoftMin function. Since in [19] it was argued as an advantage, I would like to ask the authors what was the reason behind not using the soft z-buffering? Another minor difference is that the DIB-Render uses exponential function (Equation (5)), whereas Soft-Ras uses Sigmoid (Equation (1) in [19]). Now on what separates the proposed renderer from Soft-Ras. First, DIB-Render can sample pixel colours not only from the vertices but also from the textures in the UV coordinate space, and differentiate with respect to both texture coordinates and textures themselves. This allows to train a texture reconstruction network (a UNet type architecture) alongside with the mesh-reconstruction network. Secondly, DIB-Render supports different lighting models including Phong, Lambertian and Spherical Harmonics. Importantly, it allows to compute derivatives with respect to the lighting direction and train a network that predicts lighting. Thirdly, an adversarial loss applied to the generated 2D renderings in order to distinguish them with the actual images, as well as applied directly in the space of UV texture maps. It helps to improve the crispness of reconstructed textures. The fourth contribution is a 3D GAN, that at training time reconstructs a 3D mesh and a UV texture from a noise vector and, since no paired projections are available, is trained only with an adversarial loss. This, in theory, allows to lift the requirement of having multi-view images for training such models. -- Experiments -- - In the first experiment DIB-Render is compared with the Soft-Ras [18] and Neural Mesh 3D Renderer (N3MR) [12] on the task of Single-Image 3D shape reconstruction (Tab 2). DIB-Render outperforms the baselines on both IoU metric as well as on F-score between the predicted mesh and ground truth meshes. However, I have a concern about this experiment. As the lighting is not taken into account as in this experiment, the DIB-Render has very little differences to the SoftRas, and thus I expect them to perform about the same. It is important to understand, what makes DIB-Render better. Is it the differences in z-buffering or something else? - In the second experiment, DIB-Render is evaluated on the task of texture reconstruction and light direction prediction. DIB-Render outperforms N3MR both on the texture reconstruction accuracy and on the angular difference between the predicted and the actual lighting direction. - In the third experiment, using adversarial loss is evaluated for reconstruction of shape under Phong lighting model and Spherical Harmonics. Using adversarial loss enables to generate crisper textures. - In the fourth experiment the 3D GAN is evaluated. I was confused, because in the model section (Sec. 4.2) it is stated, that the "GAN is able to recover accurate shapes, but it fails to produce meaningful textures." (l.223). However in the experiments (Sec. 5.4) I see the opposite: "this figure demonstrates the high quality of of shape and texture generations" (l.299). At least, as far as I can see, generated textures are not very realistic, which could be explained by the lack of reconstruction loss (L1) with only adversarial loss being used. This conforms with the behavior of Pix2Pix architecture, which requires L1 loss to stabilise training and doesn't work very well without it. -- Some comments and criticism -- - It is worth to cite and discuss the differences to [a] that also uses a smooth differentiable mesh renderer and applies a variety of adversarial losses on the rendered shapes. Another paper [b] that uses a smooth differentiable renderer of point clouds is also worth mentioning in the related work. - Figures with qualitative results are very difficult to read and make sense of, because of the lacking row and column titles directly in the figures (Fig. 3, 5-9, Fig. 5 of the Appendix and so on). One has to repeatedly jump between reading the figure caption and the figure itself to understand what each column corresponds to. The figures need to be improved. - l.166 - I believe should be "light, normal, reflectance and eye, respectively", i.e. "R" is is reflectance direction and "V" is the viewer. [a] Pix2Vex: Image-to-Geometry Reconstruction using a Smooth Differentiable Renderer. Petersen et al. ArXiv:1903.11149, 2019. [b] Unsupervised Learning of Shape and Pose with Differentiable Point Clouds. Insafutdinov and Dosovitskiy, NeurIPS, 2018.