{"title": "Learning to Infer Implicit Surfaces without 3D Supervision", "book": "Advances in Neural Information Processing Systems", "page_first": 8295, "page_last": 8306, "abstract": "Recent advances in 3D deep learning have shown that it is possible to train highly effective deep models for 3D shape generation, directly from 2D images. This is particularly interesting since the availability of 3D models is still limited compared to the massive amount of accessible 2D images, which is invaluable for training. The representation of 3D surfaces itself is a key factor for the quality and resolution of the 3D output. While explicit representations, such as point clouds and voxels, can span a wide range of shape variations, their resolutions are often limited. Mesh-based representations are more efficient but are limited by their ability to handle varying topologies. Implicit surfaces, however, can robustly handle complex shapes, topologies, and also provide flexible resolution control. We address the fundamental problem of learning implicit surfaces for shape inference without the need of 3D supervision. Despite their advantages, it remains nontrivial to (1) formulate a differentiable connection between implicit surfaces and their 2D renderings, which is needed for image-based supervision; and (2) ensure precise geometric properties and control, such as local smoothness. In particular, sampling implicit surfaces densely is also known to be a computationally demanding and very slow operation. To this end, we propose a novel ray-based field probing technique for efficient image-to-field supervision, as well as a general geometric regularizer for implicit surfaces, which provides natural shape priors in unconstrained regions. We demonstrate the effectiveness of our framework on the task of single-view image-based 3D shape digitization and show how we outperform state-of-the-art techniques both quantitatively and qualitatively.", "full_text": "Learning to Infer Implicit Surfaces\n\nwithout 3D Supervision\n\nShichen Liu\u2020,\u00a7, Shunsuke Saito\u2020,\u00a7, Weikai Chen ((cid:66))\u2020, and Hao Li\u2020,\u00a7,\u2021\n\n\u2020USC Institute for Creative Technologies\n\n\u00a7University of Southern California\n\n\u2021Pinscreen\n\n{liushichen95, shunsuke.saito16, chenwk891}@gmail.com\n\nhao@hao-li.com\n\nAbstract\n\nRecent advances in 3D deep learning have shown that it is possible to train highly\neffective deep models for 3D shape generation, directly from 2D images. This is\nparticularly interesting since the availability of 3D models is still limited compared\nto the massive amount of accessible 2D images, which is invaluable for training.\nThe representation of 3D surfaces itself is a key factor for the quality and resolution\nof the 3D output. While explicit representations, such as point clouds and voxels,\ncan span a wide range of shape variations, their resolutions are often limited.\nMesh-based representations are more ef\ufb01cient but are limited by their ability to\nhandle varying topologies. Implicit surfaces, however, can robustly handle complex\nshapes, topologies, and also provide \ufb02exible resolution control. We address the\nfundamental problem of learning implicit surfaces for shape inference without\nthe need of 3D supervision. Despite their advantages, it remains nontrivial to\n(1) formulate a differentiable connection between implicit surfaces and their 2D\nrenderings, which is needed for image-based supervision; and (2) ensure precise\ngeometric properties and control, such as local smoothness. In particular, sampling\nimplicit surfaces densely is also known to be a computationally demanding and very\nslow operation. To this end, we propose a novel ray-based \ufb01eld probing technique\nfor ef\ufb01cient image-to-\ufb01eld supervision, as well as a general geometric regularizer\nfor implicit surfaces, which provides natural shape priors in unconstrained regions.\nWe demonstrate the effectiveness of our framework on the task of single-view\nimage-based 3D shape digitization and show how we outperform state-of-the-art\ntechniques both quantitatively and qualitatively.\n\n1\n\nIntroduction\n\nThe ef\ufb01cient learning of 3D deep generative models is the key to achieving high-quality shape\nreconstruction and inference algorithms. While supervised learning with direct 3D supervision has\nshown promising results, its modeling capabilities are constrained by the quantity and variations\nof available 3D datasets. In contrast, far more 2D photographs are being taken and shared over the\nInternet, than can ever be watched. To exploit the abundance of image datasets, various differentiable\nrendering techniques [1, 2, 3, 4] were introduced recently, to learn 3D generative models directly from\nmassive amounts of 2D pictures. While several types of shape representations have been adopted,\nmost techniques are based on explicit surfaces, which often leads to poor visual quality due to limited\nresolutions (e.g., point clouds, voxels) or fail to handle arbitrary topologies (e.g., polygonal meshes).\nImplicit surfaces, on the other hand, describe a 3D shape using an iso-surface of an implicit \ufb01eld\nand can therefore handle arbitrary topologies, as well as support multi-resolution control to ensure\nhigh-\ufb01delity modeling. As demonstrated by several recent 3D supervised learning methods [5, 6, 7, 8],\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: While explicit shape representations may suffer from poor visual quality due to limited resolutions\nor fail to handle arbitrary topologies (a), implicit surfaces handle arbitrary topologies with high resolutions\nin a memory ef\ufb01cient manner (b). However, in contrast to the explicit representations, it is not feasible to\ndirectly project an implicit \ufb01eld onto a 2D domain via perspective transformation. Thus, we introduce a \ufb01eld\nprobing approach based on ef\ufb01cient ray sampling that enables unsupervised learning of implicit surfaces from\nimage-based supervision.\n\nimplicit representations are particularly advantageous over explicit ones, and naturally encode a 3D\nsurface at in\ufb01nite resolution with minimal memory footprint.\nDespite these bene\ufb01ts, it remains challenging to achieve unsupervised learning of implicit surfaces\nonly from 2D images. First, it is non-trivial to relate the changes of the implicit surface with that of the\nobserved images. An explicit surface, on the other hand, can be easily projected and shaded onto an\nimage plane (Figure 1 right). By inverting such process, one can obtain gradient \ufb02ows that supervise\nthe generation of the 3D shape. However, it is infeasible to directly project an implicit \ufb01eld onto a 2D\ndomain via transformation. Instead, rendering an implicit surface relies on ray sampling techniques\nto densely evaluate the \ufb01eld, which may lead to very high computational cost, especially for objects\nwith thin structures. Second, it is challenging to ensure precise geometric properties such as local\nsmoothness of an implicit surface. This is critical to generating plausible shapes in unconstrained\nregions, especially when only image-based supervision is available. Unlike mesh-based surface\nrepresentations, it is not straightforward to obtain geometric properties, e.g. normal, curvature, etc.,\nfor an implicit surface, as the shape is implicitly encoded as the level set of a scalar \ufb01eld.\nWe address the above challenges and propose the \ufb01rst framework for learning implicit surfaces with\nonly 2D supervision. In contrast to 3D supervised learning, where a signed distance \ufb01eld can be\ncomputed from the 3D training data, 2D images can only provide supervision on the binary occupancy\nof the \ufb01eld. Hence, we formulate the unsupervised learning of implicit \ufb01elds as a classi\ufb01cation\nproblem such that the occupancy probability at an arbitrary 3D point can be predicted. The key to our\napproach is a novel \ufb01eld probing approach based on ef\ufb01cient ray sampling that achieves image-to-\ufb01eld\nsupervision. Unlike conventional sampling methods [9], which excessively cast rays passing through\nall image pixels and apply binary search along the ray to detect the surface boundary, we propose a\nmuch more ef\ufb01cient approach by leveraging sparse sets of 3D anchor points and rays. In particular,\nthe anchor points probe the \ufb01eld by evaluating the occupancy probability at its location, while the\nrays aggregate the information from the anchor points that it intersects with. We assign a spherical\nsupporting region to each anchor point to enable the ray-point intersection. To further improve the\nboundary modeling accuracy, we apply importance sampling in both 2D and 3D space to allocate\nmore rays and anchor points around the image and surface boundaries respectively.\nWhile geometric regularization for implicit \ufb01elds is largely unexplored, we propose a new method\nfor constraining geometric properties of an implicit surface using the approximated derivatives\nof the \ufb01eld with a \ufb01nite difference method. Since we only care about the decision boundary of\nthe \ufb01eld, regularizing the entire 3D space would introduce scarcity of constraints in the region of\ninterest. Hence, we further propose an importance weighting technique to draw more attention\nto the surface region. We validate our approach on the task of single-view surface reconstruction.\nExperimental results demonstrate the superiority of our method over state-of-the-art unsupervised 3D\ndeep learning techniques, that are based on alternative shape representations, in terms of quantitative\nand qualitative measures. Comprehensive ablation studies also verify the ef\ufb01cacy of proposed\nprobing-based sampling technique and the implicit geometric regularization.\nOur contributions can be summarized as follows: (1) the \ufb01rst framework that enables learning of\nimplicit surfaces for shape modeling without 3D supervision; (2) a novel \ufb01eld probing approach based\non anchor points and probing rays that ef\ufb01ciently correlates the implicit \ufb01eld and the observed images;\n\n2\n\na) Explicit representationb) Implicit surfaceProjection(Rasterization)Explicit shape representations(pj)Field probing(Ray tracing)Implicit representations>0.5<0.5VoxelPoint cloudMeshOccupancy fieldDifferentiablerendering+Topology-Fidelity+Topology-Fidelity-Topology+Fidelity+Topology++Fidelity\f(3) an ef\ufb01cient point and ray sampling method for implicit surface generation from image-based\nsupervision; (4) a general formulation of geometric regularization that can constrain the geometric\nproperties of a continuous implicit surface.\n\n2 Related Work\n\nGeometric Representation for 3D Deep Learning. A 3D surface can be represented either ex-\nplicitly or implicitly. Explicit representations mainly consist of three categories: voxel-, point- and\nmesh-based. Due to their uniform spatial structures, voxel-based representations [10, 11, 12, 13] have\nbeen extensively explored to replicate the success of 2D convolutional networks onto the 3D regular\ndomain. Such volumetric representations can be easily generalized across shape topologies, but are\noften restricted to low resolutions due to large memory requirements. Progress has also been made in\nreconstructing point clouds from single images using point feature learning [14, 15, 16, 17, 3]. While\nbeing able to describe arbitrary topologies, point-based representations are also restricted by their\nresolution capabilities since dense samples are needed. Mesh representations can be more ef\ufb01cient\nsince they naturally describe mesh connectivity and are hence, suitable for 2-manifold representations.\nRecent advances have focused on reconstructing mesh geometry from point clouds [18] or even a\nsingle image [19]. AtlasNet [18] learns an implicit representation that maps and assembles 2D squares\nto 3D surface patches. Despite the compactness of mesh representations, it remains challenging to\nmodify the vertex connections, making it unsuitable for modeling shapes with arbitrary topology.\nUnlike explicit surfaces, implicit surface representations [20, 21] depict a 3D shape by extracting the\niso-surface from a continuous \ufb01eld. For implicit surfaces, a generative model can have more \ufb02exibility\nand expressiveness for capturing complex topologies. Furthermore, multi-resolution representations\nand control enable them to also capture \ufb01ne geometric details at arbitrary resolution and also reduce\nthe memory footprint during training. Recent works [22, 5, 6, 7, 8, 23] have shown promising results\non supervised learning for 3D shape inference based on implicit representations. Our approach further\npushes the envelope by achieving 3D-unsupervised learning of implicit generative shape modeling\nsolely from 2D images.\n\nLearning Shapes from 2D Supervision. Training a generative model for 3D shapes typically\nrequires direct 3D supervision from a large corpus of shape collections [10]. However, 3D model\ndatabases are still limited compared to the massive availability of 2D photos, especially since acquiring\nclean and high-\ufb01delity ground-truth 3D models still requires a tedious 3D capture process [24, 25]. A\nnumber of techniques have been introduced to exploit 2D training data to overcome this limitation, and\nuse key points [26], silhouettes [4, 1, 2, 27], and shading cues [28] for supervision. In particular, Yan\net al. [4] obtain the shape supervision by measuring the loss between the perspectively transformed\nvolumes with the ground-truth silhouettes. To achieve even denser 2D supervision, differentiable\nrendering (DR) techniques have been proposed to relate the changes in the observed pixels with that of\nthe 3D models. One line of DR research focuses on differentiating the rasterization-based rendering.\nLoper and Black [29] introduce an approximate differentiable renderer that generates rendering\nderivatives. Kato et al. [1] achieve single-view mesh reconstruction using a hand-crafted function to\napproximate the gradient of mesh rendering. Liu et al. [2] instead propose to render meshes with\ndifferentiable functions to obtain the gradient. In addition to polygon meshes, Insafutdinov et al. [3]\npropose the use of differentiable point clouds to learn shapes and poses in an unsupervised manner.\nAnother direction of DR work aims to differentiate the ray tracing procedure during rendering.\nLi et al. [30] introduce a differentiable ray tracer through edge sampling. Aside from silhouettes,\nshading and appearances in image space also provides supervision cues for learning \ufb01ne-grained\nshape representations in category speci\ufb01c domains such as 3D face reconstruction [31, 32, 33, 34, 35]\nand material inference [36, 37, 38]. Whereas existing methods focus on learning shapes from 2D\nsupervisions and the use of explicit shape representations (i.e., voxels, point clouds, and meshes),\nwe present the \ufb01rst framework for unsupervised learning of implicit surface representations by\ndifferentiating the implicit \ufb01eld rendering. With our framework, one can reconstruct shapes with\narbitrary topology at arbitrary resolution from a single image without requiring any 3D supervision.\n\n3\n\n\fFigure 2: Ray-based \ufb01eld probing technique. (a) A sparse set of 3D anchor points are distributed\nto sense the \ufb01eld by sampling the occupancy value at its location. (b) Each anchor is assigned a\nspherical supporting region to enable ray-point intersection. The anchor points that have higher\nprobability to stay inside the object surface are marked with deeper blue. (c) Rays are cast passing\nthrough the sampling points {xi} on the 2D silhouette under the camera views {\u03c0k} (blue indicates\nobject interior and white otherwise). (d) By aggregating the information from the intersected anchor\npoints via max pooling, one can obtain the prediction for each ray. (e) The silhouette loss is obtained\nby comparing the prediction with the ground-truth label in the image space.\n\n3 Unsupervised Learning of Implicit Surfaces\n\nOverview. Our goal is to learn a generative model for implicit surfaces that infers 3D shapes solely\nfrom 2D images. Unlike direct supervision with 3D ground truth, which supports the computation\nof a continuous signed distance \ufb01eld with respect to the surface, 2D observations can only provide\nguidance on the occupancy of the implicit \ufb01eld. Hence, we formulate the unsupervised learning of\nk=1 images of an object O from different\nimplicit surfaces as a classi\ufb01cation problem. Given {Ik}NK\nviews {\u03c0k}NK\nk=1 as supervision signals, we train a neural network that takes a single image Ik and\nproduce a continuous occupancy probability \ufb01eld, whose iso-surface at 0.5 depicts the shape of O.\nOur pipeline is based on a novel ray-based \ufb01eld probing technique as illustrated in Figure 2. Instead\nof excessively casting rays to detect the surface boundary, we probe the \ufb01eld using a sparse set of\n3D anchor points and rays. The anchor points sense the \ufb01eld by sampling the occupancy probability\nat its location, and are assigned a spherical supporting region to ease the computation of ray-point\nintersection. We then correlate the \ufb01eld and the observed images by casting the probing rays, which\noriginate from the viewpoint and pass through the sampling points of the images. The ray, that\npasses through the image pixel xi, given the camera parameter \u03c0k, obtains its prediction \u03c8(\u03c0k, xi)\nby aggregating the occupancy values from the anchor points whose supporting regions intersect\nwith it. By comparing \u03c8(\u03c0k, xi) with the ground-truth label of xi, we can obtain error signals that\nsupervise the generation of implicit \ufb01elds. Note that when detecting ray-point intersections, we apply\na boundary-aware assignment to remove ambiguity, which is detailed in Section 3.1.\n\nNetwork Architecture. We demonstrate our network architecture in Figure 3. Following the recent\nadvances in unsupervised shape learning [4, 1], we use 2D silhouettes of the objects as the supervision\nfor network training. Our framework consist of two components: (1) an image encoder g that maps\nthe input image I to a latent feature z; and (2) an implicit decoder f that consumes z and a 3D\nquery point pj and infers its occupancy probability \u03c6(pj). Note that the implicit decoder generates\na continuous prediction ranging from 0 to 1, where the estimated surface can be extracted at the\ndecision boundary of 0.5 (Figure 3 right).\n\n3.1 Sampling-Based 2D Supervision\n\nTo compute the prediction loss of the implicit decoder, a key step is to properly aggregate the\ninformation collected throughout the \ufb01eld probing process for each ray. Given a continuous occupancy\n\ufb01eld and a set of anchor points along a ray r, the probability that r hits the object interior can be\nconsidered as an aggregation function:\n\n\u03c8 (\u03c0k, xi) = G\n\n{\u03c6 (c + r (\u03c0k, xi) \u00b7 tj)}Np\n\nj=1\n\n4\n\n(cid:16)\n\n(cid:17)\n\n,\n\n(1)\n\nSilhouette image(a) 3D anchor points (b) Occupancy field evaluation(c) Ray casting with boundary-aware assignment(d) Aggregating intersected anchors along rays(e) Loss computationRay predictionsRay labels (\u21e1k,xi)Sk(xi)LsilSkxi\fwhere r(\u03c0k, xi) denotes the ray direction that intersects with the image pixel xi in the viewing\ndirection \u03c0k; c is the camera location; Np is the number of 3D anchor points; tj indicates the sampled\nlocation along the ray for each anchor point; \u03c6(\u00b7) is the occupancy function that returns the occupancy\nprobability of the input point; \u03c8 denotes the predicted occupancy for ray r(\u03c0k, xi). Since whether\nthe ray r hits the object interior is determined by the maximum occupancy value detected along the\nray, in this work, we adopt G as a max-pooling operation due to its computational ef\ufb01ciency and\neffectiveness demonstrated in [4]. By considering the l2 differences between the predictions and the\nground-truth silhouette, we can obtain the silhouette loss Lsil:\n\nNr(cid:88)\n\nNK(cid:88)\n\nLsil =\n\n1\nNr\n\ni=1\n\nk=1\n\n(cid:107)\u03c8(\u03c0k, xi) \u2212 Sk(xi)(cid:107)2,\n\n(2)\n\nwhere Sk(xi) is a bilinearly interpolated silhouette at xi under the k-th viewpoint; Nr and NK denote\nthe number of 2D sampling points and camera views, respectively.\n\nBoundary-Aware Assignment. To facilitate the computation of ray-point intersections, we model\neach anchor point as a sphere with a non-zero radius. While such a strategy works well in most\ncases, erroneous labeling may occur in the vicinity of the decision boundary. For instance, a ray\nthat has no intersection with the target object may still have a chance to hit the supporting region of\nan anchor point whose center lies inside the object. Since we use max-pooling as the aggregating\nfunction, the ray may be wrongly labeled as an intersecting ray. To resolve this issue, we use 2D\nsilhouettes as additional prior by \ufb01ltering out the anchor points on the wrong side. In particular, if\na ray is passing through a pixel belonging to the inside/outside of the silhouette, the anchor points\nlying outside/inside of the 3D object are ignored when detecting intersections (Figure 2 (c)). This\nboundary-aware assignment can signi\ufb01cantly improve the quality and reconstructed details, which is\ndemonstrated in the ablation study in Section 4.\n\nImportance Sampling. A naive solution for distributing anchor points and probing rays is to apply\nrandom sampling. However, as the occupancy of the target object may be highly sparse over the\n3D space, random sampling could be extremely inef\ufb01cient. We propose an importance sampling\napproach based on shape cues obtained from the 2D images for ef\ufb01cient sampling of rays and anchor\npoints. The main idea is to draw more samples around the surface boundary, which is equivalent to\nthe 2D contour of the object in image space. For ray sampling, we \ufb01rst obtain the contour map Wr(x)\nby applying Laplacian operator over the input silhouette. We then generate a Gaussian mixture\ndistribution by positioning the individual kernels to each pixel of Wr(x) and setting the kernel height\nas the pixel intensity at its location. The rays are then generated by sampling from the resulting\ndistribution. Similarly, to generate the 3D contour map Wp(p), we apply mean \ufb01ltering to the 3D\nvisual hulls computed from the multi-view silhouettes. The anchor points are then sampled from a\n3D Gaussian mixture distribution model created in a similar fashion to the 2D case, which yields the\nprobabilistic density function of the sampling as:\n\n(cid:26) Pr(x) =(cid:82)\nPp(p) =(cid:82)\n\nx(cid:48) \u03ba(x(cid:48), x; \u03c3)Wr(x(cid:48))dx(cid:48),\np(cid:48) \u03ba(p(cid:48), p; \u03c3)Wr(p(cid:48))dp(cid:48),\n\n(3)\n\nwhere x(cid:48) is a pixel in the image domain and p(cid:48) is a point in the 3D space, \u03ba(\u00b7,\u00b7; \u03c3) denotes the\ngaussian kernel with bandwidth \u03c3; Pr(x) and Pp(p) denotes the probabilistic density function at\npixel x and point p respectively.\n\n3.2 Geometric Regularization on Implicit Surfaces\n\nRegularizing geometric surface properties is critical to achieving desirable shapes, especially in un-\nconstrained regions. While such constraints can be easily realized with explicit shape representations,\na controlled regularization of an implicit surface is not straightforward, since the surface is implicitly\nencoded as the level set of a scalar \ufb01eld. Here, we introduce a general formulation of geometric\nregularization for implicit surfaces using a new importance weighting scheme.\nSince computing geometric properties of a surface, e.g. normal, curvature, etc., requires access to\nthe derivatives of the \ufb01eld, we propose a \ufb01nite difference method-based approach. In particular, we\n\n5\n\n\fFigure 3: Network architecture for unsupervised learning of implicit surfaces. The input image I is\n\ufb01rst mapped to a latent feature z by an image encoder g while the implicit decoder f consumes both\nthe latent code z and a query point pj and predicts its occupancy probability \u03c6(pj). With a trained\nnetwork, one can generate an implicit \ufb01eld whose iso-surface at 0.5 depicts the inferred geometry.\n\ncompute the n-order derivative of the implicit \ufb01eld at point pj with central difference approximation:\n\nn(cid:88)\n\nl=0\n\n(cid:19)\n\n(cid:18)n\n\nl\n\n\u03b4n\u03c6\n\u03b4pn\nj\n\n=\n\n1\n\n\u2206dn\n\n(\u22121)l\n\n\u03c6(pj + (\n\nn\n2 \u2212 l)\u2206d),\n\n(4)\n\nwhere \u2206d is the spacing distance between pj and its adjacent sample points (Figure 4). When n\nequals to 1, the surface normal n(pj) at pj can be obtained via n(pj) = \u03b4\u03c6\n\u03b4pj\n\n\u03b4pj\n\n/\n\n(cid:12)(cid:12)(cid:12) \u03b4\u03c6\n\n(cid:12)(cid:12)(cid:12).\n\nImportance weighting. As we focus on the geometric properties\non the surface, applying the regularizer over the entire 3D space\nwould lead to overly loose constraint in regions of interest. Hence,\nwe propose an importance weighting approach to assign more atten-\ntion on the sampling points closer to the surface. Here, we leverage\nthe prior learned by our network \u2013 the surface points should have\nan occupancy probability close to the decision boundary, which is\n0.5 in our implementation. Therefore, we propose a weighting func-\ntion W (x) = I(|x \u2212 0.5| < \u0001) and formulate the loss of geometric\nregularization as follows:\n\nFigure 4: 2D illustration of im-\nportance weighted geometric\nregularization.\n\nNp(cid:88)\n\nj=1\n\n(cid:80)6\n\n(cid:80)6\n\nLgeo =\n\n1\nNp\n\nW (\u03c6(sj))\n\nl=1 W (\u03c6(ql\n\nj))(cid:107)n(sj) \u2212 n(ql\nl=1 W (\u03c6(ql\n\nj))\n\nj)(cid:107)p\n\np\n\n.\n\n(5)\n\nIn particular, as shown in Figure 4, for each anchor point sj, we uniformly sample 2 neighboring\nsamples {ql\nj} with spacing \u2206d along the x, y and z axis respectively. We feed the weight function\nW (\u00b7) with the predicted occupancy probability \u03c6(sj) such that anchor points closer to the surface\n(with \u03c6(sj) closer to 0.5) would receive higher weights and vice versa. By minimizing Lgeo, we\nencourage the normals at the 3D anchors to stay close to that of its adjacent points. Notice that we use\nlp norm rather than the commonly used l2 for generality. We show that various geometric properties\ncan be achieved by taking p as a hyper parameter (see Section 4.3).\nThe total loss for network training is a weighted sum of the silhouette loss Lsil and the geometric\nregularization loss Lgeo with a trade-off factor \u03bb as shown below:\n\nL = Lsil + \u03bbLgeo.\n\n(6)\n\n4 Experiments\n\n4.1 Experimental Setup\n\nDatasets. We evaluate our method on ShapeNet [10] dataset. We focus on 6 commonly used\ncategories with complex topologies: plane, bench, table, car, chair and boat. We use the same\ntrain/validate/test split as in [4, 1, 2] and the rendered images (64 \u00d7 64 resolution) provided by [1]\nwhich consist of 24 views for each object.\n\n6\n\nImage EncoderImplicit Decoder>0.5insideoutside<0.5pjf(pj,z)z(pj)g(I)concat{0.50.30.10.70.9sjqljdn(sj)n(qlj)\fFigure 5: Qualitative results of single-view reconstruction using different surface representations. For\npoint cloud representation, we also visualize the meshes reconstructed from the output point cloud.\n\nImplementation details. We adopt a pre-trained ResNet18 as the encoder, which outputs a latent\ncode of 128 dimensions. The decoder is realized using 6 fully-connected layers (output channels\nas 2048, 1024, 512, 256, 128 and 1 respectively) followed by a sigmoid activation function. We\nsample Np = 16, 000 anchor points in 3D space and Nr = 4096 rays for each view. The sampling\nbandwidth \u03c3 is set as 7 \u00d7 10\u22123. The radius \u03c4 of the supporting region is set as 3 \u00d7 10\u22122. For the\nregularizer, we set \u2206d = 3 \u00d7 10\u22122, \u03bb = 1 \u00d7 10\u22122, and norm p = 0.8. We train the network using\nAdam optimizer with learning rate of 1 \u00d7 10\u22124 and batch size of 8 on a single 1080Ti GPU.\n4.2 Comparisons\n\nWe validate the effectiveness of our framework in the task of unsupervised shape digitization from a\nsingle image. Figure 5 and Table 1 compare the performance of our approach with the state-of-the-art\nunsupervised methods that are based on explicit surface representations, including voxels [4], point\nclouds [3], and triangle meshes [1, 2]. We provide both qualitative and quantitative measures. Note\nthat all the methods are trained with the same training data for fair comparison. While the explicit\nsurface representations either suffer from visually unpleasant reconstruction due to limited resolution\nand expressiveness (voxels, point clouds), or fail to capture complex topology from a single template\n(meshes), our approach produces visually appealing reconstructions for complex shapes with arbitrary\ntopologies. Compared to mesh-based representations, we are able to achieve higher resolution output,\nwhich is re\ufb02ected by the even sharper local geometric details, e.g. the engine of plane (\ufb01rst row) and\nthe wheels of the vehicle (fourth row). The performance of our method is also demonstrated in the\nquantitative comparisons, where we achieve state-of-the-art reconstruction accuracy using 3D IoU\nwith large margins.\nIn Figure 6, we further illustrate the importance of supporting arbitrary topologies, compared to\nexisting mesh-based reconstruction techniques [2]. Since real-world objects can exhibit a wide range\nof varying topologies even for a single object category (e.g., chairs), mesh-based approaches often\nlead to deteriorated results. In contrast, our approach is able to faithfully infer complex shapes and\narbitrary topologies from very limited visual cues, e.g. the chair and the table on the third row,\nthanks to the \ufb02exibility of the implicit representation and the strong shape prior enabled through the\ngeometric regularizer.\n\n4.3 Ablation Analysis\n\nWe provide a comprehensive ablation study to assess the effectiveness of each algorithmic component.\nFor all the experiments, we use the same data and parameters as before unless otherwise noted.\n\n7\n\nInput imagesPTN [4](cid:1)Voxel(cid:2)DPC [3](Point clouds)NMR [1](Mesh)Ours(Implicit occupancy field )SoftRas [2](Mesh)Ground truths\fCategory\nPTN [4]\nNMR [1]\nSoftRas [2]\n\nOurs\n\nAirplane\n0.5564\n0.6172\n0.6419\n0.6510\n\nBench\n0.4875\n0.4998\n0.5080\n0.5360\n\nTable\n0.4938\n0.4829\n0.4487\n0.5150\n\nCar\n\n0.7123\n0.7095\n0.7697\n0.7820\n\nChair\n0.4494\n0.4990\n0.5270\n0.5480\n\nBoat\n0.5507\n0.5953\n0.6145\n0.6080\n\nMean\n0.5417\n0.5673\n0.5850\n0.6067\n\nTable 1: Comparison of 3D IoU with other unsupervised reconstruction methods.\n\nFigure 6: Qualitative comparisons with mesh-based approach [2] in term of modeling capability of\ncapturing varying topologies.\n\nGeometric Regularization. In Table 2 and Figure 7, we demonstrate that our proposed geometric\nregularization enables a \ufb02exible control over various geometric properties by varying the value of\nnorm p. To validate the effectiveness of geometric regularization, we train the same network using\ndifferent con\ufb01gurations: 1) without using any geometry regularizers; 2) applying our proposed\ngeometric regularization with p norm equals to 0.8, 1.0, 2.0, respectively. As shown in the results,\nthe lack of geometry regularizer would lead to an ambiguity of reconstructed geometry, e.g. \ufb01rst\nrow in Figure 7, as some unexpected shape could appear the same with the ground-truth with an\naccordingly optimized texture map, and thus makes the generation of \ufb02at surface rather dif\ufb01cult. The\nproposed regularizer can effectively enhance the regularity of reconstructed objects, especially for\nman-made objects, while providing \ufb02exible control. In particular, when p = 2.0, the surface normal\ndifference is minimized in a least-square manner, leading to a smooth reconstruction. When p \u2192 0,\nsparsity is enforced in the surface normal consistency, which encourages the reconstructed surface to\nbe piece-wise linear and is often desirable for man-made objects. We also perform ablation study\non the effect of the sampling step \u2206d for the regularizer as shown in Table 3 and Figure 8. We can\nobserve that larger \u2206d leads to more \ufb02attening surfaces at the cost of less \ufb01ne details.\n\n3D IoU\n\nnorm p = 2.0\nnorm p = 1.0\nnorm p = 0.8\n-Regularizer\n\n0.502\n0.524\n0.548\n0.503\n\nTable 2: Quantitative evalua-\ntions of our approach on chair\ncategory using different regu-\nlarizer con\ufb01gurations.\n\nFigure 7: Qualitative evaluations of geometric regularization by\nusing different con\ufb01gurations.\n\nImportance Sampling. To fully explore the effect of importance sampling, we compare two different\ncon\ufb01gurations of sampling scheme: 1) \u201c-Imp. sampling\": drawing both 3D anchor points and rays\nfrom the normal distribution with mean and standard deviation set as 0 and 0.4 respectively; and 2)\n\u201cFull model\": only using the importance sampling approach for both anchor points and rays with\nthe bandwidth set as 0.007. We show sampled rays and results in Table 4 and Figure 9. In terms of\n\n8\n\nInput imagesGround TruthSoftRas (Mesh)Ours (Implicit field)Input imagesGround TruthSoftRas (Mesh)Ours (Implicit field)Input images-Regularizernorm p=2.0norm p=1.0norm p=0.8Ground truths\fvisual quality, importance sampling based approach has achieved much more detailed reconstruction\ncompared to its counterpart. The quantitative measurement also leads to consistent observation,\nwhere our proposed importance sampling has outperformed the normal sampling by a large margin.\n\n3D IoU\n\u2206d = 1 \u00d7 10\u22122\n\u2206d = 3 \u00d7 10\u22122\n\u2206d = 1 \u00d7 10\u22121\nTable 3: Quantitative evalua-\ntions on table category with\ndifferent \u2206d\n\n0.482\n0.515\n0.507\n\n3D IoU\n\nFull model\n-Imp. sampling\n-Boundary aware\n\n0.548\n0.482\n0.524\n\nTable 4: Quantitative measure-\nments for the ablation anal-\nysis of importance sampling\nand boundary-aware assign-\nment on the chair category as\nshown in Figure 9.\n\nFigure 8: Qualitative results of reconstruction using our ap-\nproach with different regularizer sampling step \u2206d.\n\nFigure 9: Qualitative analysis of importance sampling and boundary-\naware assignment for single-view reconstruction.\n\nBoundary-Aware Assignment. We also compare the performance with and without boundary-aware\nassignment in Table 4 and Figure 9. When boundary-aware assignment is disabled, the sampling rays\naround the decision boundary may be assigned with incorrect labels. As a result, the reconstructions\nlack suf\ufb01cient accuracy, especially around the thin surface regions, and thus may not be able to\ncapture holes and thin structures as demonstrated in the rightmost examples in Figure 9.\n\n5 Discussion\n\nWe introduced a learning framework for implicit surface modeling of general objects without 3D\nsupervision. An occupancy \ufb01eld is learned through a set of 2D silhouettes using an ef\ufb01cient \ufb01eld\nprobing algorithm, and the desired local smoothness of implicit \ufb01eld is achieved using a novel\ngeometric regularizer based on \ufb01nite difference. Our experiments show that high-\ufb01delity implicit\nsurface modeling is possible from 2D images alone, even for unconstrained regions. Our approach can\nproduce more visually pleasant and higher-resolution results compared to both voxels and point clouds.\nIn addition, unlike mesh representations, our approach can handle arbitrary topologies spanning\nvarious object categories. We believe that the use of implicit surfaces and our proposed algorithms\nopens up new frontiers for learning limitless shape variations from in-the-wild images. Future work\nincludes unsupervised learning of textured geometries, which has been recently addressed with an\nexplicit mesh representation [2], and eliminating the need of silhouette segmentations to further\nincrease the scalability of the image-based learning. It would also be interesting to investigate the use\nof anisotropic kernels for shape modeling and hierarchical implicit representations with advanced\ndata structure, e.g. Octree, to further improve the modeling ef\ufb01ciency. Furthermore, we would like to\nconsider the use of learning from texture cues in addition to binary masks.\n\nAcknowledgements This research was conducted at USC and was funded by in part by the ONR\nYIP grant N00014-17-S-FO14, the CONIX Research Center, a Semiconductor Research Corporation\n(SRC) program sponsored by DARPA, the Andrew and Erna Viterbi Early Career Chair, the U.S.\nArmy Research Laboratory (ARL) under contract number W911NF-14-D-0005, Adobe, and Sony.\n\n9\n\nInput imagesGround truthsd=3\u21e5102d=1\u21e5102d=1\u21e5101Ground truthsSampled raysFull model-Imp. samplingInput images-Boundary-aware\fReferences\n[1] Hiroharu Kato, Yoshitaka Ushiku, and Tatsuya Harada. Neural 3d mesh renderer. In Proceedings\nof the IEEE Conference on Computer Vision and Pattern Recognition, pages 3907\u20133916, 2018.\n\n[2] Shichen Liu, Tianye Li, Weikai Chen, and Hao Li. Soft rasterizer: A differentiable renderer\nfor image-based 3d reasoning. In IEEE International Conference on Computer Vision (ICCV),\n2019.\n\n[3] Eldar Insafutdinov and Alexey Dosovitskiy. Unsupervised learning of shape and pose with\ndifferentiable point clouds. In Advances in Neural Information Processing Systems, pages\n2802\u20132812, 2018.\n\n[4] Xinchen Yan, Jimei Yang, Ersin Yumer, Yijie Guo, and Honglak Lee. Perspective transformer\nnets: Learning single-view 3d object reconstruction without 3d supervision. In Advances in\nNeural Information Processing Systems, pages 1696\u20131704, 2016.\n\n[5] Zhiqin Chen and Hao Zhang. Learning implicit \ufb01elds for generative shape modeling. Proceed-\n\nings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.\n\n[6] Jeong Joon Park, Peter Florence, Julian Straub, Richard Newcombe, and Steven Lovegrove.\nDeepsdf: Learning continuous signed distance functions for shape representation. Proceedings\nof IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.\n\n[7] Mateusz Michalkiewicz, Jhony K Pontes, Dominic Jack, Mahsa Baktashmotlagh, and Anders\nEriksson. Deep level sets: Implicit surface representations for 3d shape inference. arXiv preprint\narXiv:1901.06802, 2019.\n\n[8] Lars Mescheder, Michael Oechsle, Michael Niemeyer, Sebastian Nowozin, and Andreas Geiger.\nOccupancy networks: Learning 3d reconstruction in function space. Proceedings of IEEE\nConference on Computer Vision and Pattern Recognition (CVPR), 2019.\n\n[9] Christian Sigg. Representation and rendering of implicit surfaces. PhD thesis, ETH Zurich,\n\n2006.\n\n[10] Angel X Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li,\nSilvio Savarese, Manolis Savva, Shuran Song, Hao Su, et al. Shapenet: An information-rich 3d\nmodel repository. arXiv preprint arXiv:1512.03012, 2015.\n\n[11] Daniel Maturana and Sebastian Scherer. Voxnet: A 3d convolutional neural network for real-\ntime object recognition. In Intelligent Robots and Systems (IROS), 2015 IEEE/RSJ International\nConference on, pages 922\u2013928. IEEE, 2015.\n\n[12] Hang Su, Subhransu Maji, Evangelos Kalogerakis, and Erik Learned-Miller. Multi-view convo-\nlutional neural networks for 3d shape recognition. In Proceedings of the IEEE international\nconference on computer vision, pages 945\u2013953, 2015.\n\n[13] Shubham Tulsiani, Tinghui Zhou, Alexei A Efros, and Jitendra Malik. Multi-view supervision\nfor single-view reconstruction via differentiable ray consistency. In Proceedings of the IEEE\nconference on computer vision and pattern recognition, pages 2626\u20132634, 2017.\n\n[14] Yaoqing Yang, Chen Feng, Yiru Shen, and Dong Tian. Foldingnet: Point cloud auto-encoder\nvia deep grid deformation. In Proceedings of the IEEE Conference on Computer Vision and\nPattern Recognition, pages 206\u2013215, 2018.\n\n[15] Haoqiang Fan, Hao Su, and Leonidas J Guibas. A point set generation network for 3d object\n\nreconstruction from a single image. In CVPR, volume 2, page 6, 2017.\n\n[16] Chen-Hsuan Lin, Chen Kong, and Simon Lucey. Learning ef\ufb01cient point cloud generation for\ndense 3d object reconstruction. In Thirty-Second AAAI Conference on Arti\ufb01cial Intelligence,\n2018.\n\n[17] Panos Achlioptas, Olga Diamanti, Ioannis Mitliagkas, and Leonidas Guibas. Learning rep-\nresentations and generative models for 3d point clouds. arXiv preprint arXiv:1707.02392,\n2017.\n\n10\n\n\f[18] Thibault Groueix, Matthew Fisher, Vladimir G. Kim, Bryan Russell, and Mathieu Aubry.\nAtlasNet: A Papier-M\u00e2ch\u00e9 Approach to Learning 3D Surface Generation. In Proceedings IEEE\nConf. on Computer Vision and Pattern Recognition (CVPR), 2018.\n\n[19] Nanyang Wang, Yinda Zhang, Zhuwen Li, Yanwei Fu, Wei Liu, and Yu-Gang Jiang. Pixel2mesh:\n\nGenerating 3d mesh models from single rgb images. In ECCV, 2018.\n\n[20] Jonathan C Carr, Richard K Beatson, Jon B Cherrie, Tim J Mitchell, W Richard Fright, Bruce C\nMcCallum, and Tim R Evans. Reconstruction and representation of 3d objects with radial basis\nfunctions. In Proceedings of the 28th annual conference on Computer graphics and interactive\ntechniques, pages 67\u201376. ACM, 2001.\n\n[21] Chen Shen, James F O\u2019Brien, and Jonathan R Shewchuk. Interpolating and approximating\nimplicit surfaces from polygon soup. In ACM Siggraph 2005 Courses, page 204. ACM, 2005.\n[22] Yiyi Liao, Simon Donn\u00e9, and Andreas Geiger. Deep marching cubes: Learning explicit surface\nIn Proceedings of the IEEE Conference on Computer Vision and Pattern\n\nrepresentations.\nRecognition, pages 2916\u20132925, 2018.\n\n[23] Zeng Huang, Tianye Li, Weikai Chen, Yajie Zhao, Jun Xing, Chloe LeGendre, Linjie Luo,\nChongyang Ma, and Hao Li. Deep volumetric video from very sparse multi-view performance\ncapture. In European Conference on Computer Vision, pages 351\u2013369. Springer, 2018.\n\n[24] Yasutaka Furukawa and Jean Ponce. Accurate, dense, and robust multiview stereopsis. IEEE\n\ntransactions on pattern analysis and machine intelligence, 32(8):1362\u20131376, 2010.\n\n[25] Richard A Newcombe, Shahram Izadi, Otmar Hilliges, David Molyneaux, David Kim, Andrew J\nDavison, Pushmeet Kohi, Jamie Shotton, Steve Hodges, and Andrew Fitzgibbon. Kinectfusion:\nReal-time dense surface mapping and tracking. In Mixed and augmented reality (ISMAR), 2011\n10th IEEE international symposium on, pages 127\u2013136, 2011.\n\n[26] Shubham Tulsiani, Abhishek Kar, Joao Carreira, and Jitendra Malik. Learning category-speci\ufb01c\ndeformable 3d models for object reconstruction. IEEE transactions on pattern analysis and\nmachine intelligence, 39(4):719\u2013731, 2017.\n\n[27] Abhijit Kundu, Yin Li, and James M Rehg. 3d-rcnn: Instance-level 3d object reconstruction via\nrender-and-compare. In Proceedings of the IEEE Conference on Computer Vision and Pattern\nRecognition, pages 3559\u20133568, 2018.\n\n[28] Soumyadip Sengupta, Angjoo Kanazawa, Carlos D Castillo, and David W Jacobs. Sfsnet:\nlearning shape, re\ufb02ectance and illuminance of faces in the wild. In Proceedings of the IEEE\nConference on Computer Vision and Pattern Recognition (CVPR), pages 6296\u20136305, 2018.\n\n[29] Matthew M Loper and Michael J Black. Opendr: An approximate differentiable renderer. In\n\nEuropean Conference on Computer Vision, pages 154\u2013169. Springer, 2014.\n\n[30] Tzu-Mao Li, Miika Aittala, Fr\u00e9do Durand, and Jaakko Lehtinen. Differentiable monte carlo\nray tracing through edge sampling. ACM Trans. Graph. (Proc. SIGGRAPH Asia), 37(6):222:1\u2013\n222:11, 2018.\n\n[31] Elad Richardson, Matan Sela, Roy Or-El, and Ron Kimmel. Learning detailed face reconstruc-\ntion from a single image. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE\nConference on, pages 5553\u20135562. IEEE, 2017.\n\n[32] Ayush Tewari, Michael Zollh\u00f6fer, Hyeongwoo Kim, Pablo Garrido, Florian Bernard, Patrick\nP\u00e9rez, and Christian Theobalt. Mofa: Model-based deep convolutional face autoencoder for\nunsupervised monocular reconstruction. In The IEEE International Conference on Computer\nVision (ICCV), volume 2, page 5, 2017.\n\n[33] Ayush Tewari, Michael Zollh\u00f6fer, Pablo Garrido, Florian Bernard, Hyeongwoo Kim, Patrick\nP\u00e9rez, and Christian Theobalt. Self-supervised multi-level face model learning for monocular\nreconstruction at over 250 hz. In Proceedings of the IEEE Conference on Computer Vision and\nPattern Recognition, pages 2549\u20132559, 2018.\n\n[34] Luan Tran and Xiaoming Liu. Nonlinear 3d face morphable model. In Proceedings of the IEEE\n\nConference on Computer Vision and Pattern Recognition, pages 7346\u20137355, 2018.\n\n11\n\n\f[35] Kyle Genova, Forrester Cole, Aaron Maschinot, Aaron Sarna, Daniel Vlasic, and William T\nFreeman. Unsupervised training for 3d morphable model regression. In Proceedings of the\nIEEE Conference on Computer Vision and Pattern Recognition, pages 8377\u20138386, 2018.\n\n[36] Dejan Azinovi\u00b4c, Tzu-Mao Li, Anton Kaplanyan, and Matthias Nie\u00dfner. Inverse path tracing\nfor joint material and lighting estimation. In Proc. Computer Vision and Pattern Recognition\n(CVPR), IEEE, 2019.\n\n[37] Guilin Liu, Duygu Ceylan, Ersin Yumer, Jimei Yang, and Jyh-Ming Lien. Material editing using\na physically based rendering network. In Computer Vision (ICCV), 2017 IEEE International\nConference on, pages 2280\u20132288. IEEE, 2017.\n\n[38] Valentin Deschaintre, Miika Aittala, Fredo Durand, George Drettakis, and Adrien Bousseau.\nSingle-image svbrdf capture with a rendering-aware deep network. ACM Transactions on\nGraphics (TOG), 37(4):128, 2018.\n\n12\n\n\f", "award": [], "sourceid": 4496, "authors": [{"given_name": "Shichen", "family_name": "Liu", "institution": "University of Southern California (SSO)"}, {"given_name": "Shunsuke", "family_name": "Saito", "institution": "University of Southern California"}, {"given_name": "Weikai", "family_name": "Chen", "institution": "USC Institute for Creative Technology"}, {"given_name": "Hao", "family_name": "Li", "institution": "Pinscreen/University of Southern California/USC ICT"}]}