{"title": "Multiview Aggregation for Learning Category-Specific Shape Reconstruction", "book": "Advances in Neural Information Processing Systems", "page_first": 2351, "page_last": 2362, "abstract": "We investigate the problem of learning category-specific 3D shape reconstruction from a variable number of RGB views of previously unobserved object instances. Most approaches for multiview shape reconstruction operate on sparse shape representations, or assume a fixed number of views. We present a method that can estimate dense 3D shape, and aggregate shape across multiple and varying number of input views. Given a single input view of an object instance, we propose a representation that encodes the dense shape of the visible object surface as well as the surface behind line of sight occluded by the visible surface. When multiple input views are available, the shape representation is designed to be aggregated into a single 3D shape using an inexpensive union operation. We train a 2D CNN to learn to predict this representation from a variable number of views (1 or more). We further aggregate multiview information by using permutation equivariant layers that promote order-agnostic view information exchange at the feature level. Experiments show that our approach is able to produce dense 3D reconstructions of objects that improve in quality as more views are added.", "full_text": "Multiview Aggregation for Learning\n\nCategory-Speci\ufb01c Shape Reconstruction\n\nSrinath Sridhar1 Davis Rempe1 Julien Valentin2 So\ufb01en Bouaziz2 Leonidas J. Guibas1,3\n\nR ssrinath@cs.stanford.edu\n\n(cid:140) geometry.stanford.edu/projects/xnocs\n\n1Stanford University\n\n2Google Inc.\n\n3Facebook AI Research\n\nAbstract\n\nWe investigate the problem of learning category-speci\ufb01c 3D shape reconstruction\nfrom a variable number of RGB views of previously unobserved object instances.\nMost approaches for multiview shape reconstruction operate on sparse shape\nrepresentations, or assume a \ufb01xed number of views. We present a method that\ncan estimate dense 3D shape, and aggregate shape across multiple and varying\nnumber of input views. Given a single input view of an object instance, we propose\na representation that encodes the dense shape of the visible object surface as well\nas the surface behind line of sight occluded by the visible surface. When multiple\ninput views are available, the shape representation is designed to be aggregated\ninto a single 3D shape using an inexpensive union operation. We train a 2D CNN\nto learn to predict this representation from a variable number of views (1 or more).\nWe further aggregate multiview information by using permutation equivariant\nlayers that promote order-agnostic view information exchange at the feature level.\nExperiments show that our approach is able to produce dense 3D reconstructions\nof objects that improve in quality as more views are added.\n\n1\n\nIntroduction\n\nLearning to estimate the 3D shape of objects observed from one or more views is an important\nproblem in 3D computer vision with applications in robotics, 3D scene understanding, and augmented\nreality. Humans and many animals perform well at this task, especially for known object categories,\neven when observed object instances have never been encountered before [27]. We are able to infer\nthe 3D surface shape of both object parts that are directly visible, and of parts that are occluded\nby the visible surface. When provided with more views of the instance, our con\ufb01dence about its\nshape increases. Endowing machines with this ability would allow us to operate and reason in\nnew environments and enable a wide range of applications. We study this problem of learning\ncategory-speci\ufb01c 3D surface shape reconstruction given a variable number of RGB views (1 or more)\nof an object instance.\nThere are several challenges in developing a learning-based solution for this problem. First, we need\na representation that can encode the 3D geometry of both the visible and occluded parts of an object\nwhile still being able to aggregate shape information across multiple views. Second, for a given\nobject category, we need to learn to predict the shape of new instances from a variable number of\nviews at test time. We address these challenges by introducing a new representation for encoding\ncategory-speci\ufb01c 3D surface shape, and a method for learning to predict shape from a variable number\nof views in an order-agnostic manner.\nRepresentations such as voxel grids [6], point clouds [9, 17], and meshes [11, 40] have previously\nbeen used for learning 3D shape. These representations can be computationally expensive to operate\non, often produce only sparse or smoothed-out reconstructions, or decouple 3D shape from 2D\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: An input RGB view of a previously unseen object instance (a). Humans are capable of\ninferring the shape of the visible object surface (original colors in (b)) as well as the parts that are\noutside the line of sight (separated by red line in (b)). We propose an extended version of the NOCS\nmap representation [39] to encode both the visible surface (c) and the occluded surface furthest from\nthe current view, the X-NOCS map (d). Note that (c) and (d) are in exact pixel correspondence to\n(a), and their point set union yields the complete 3D shape of the object. RGB colors denote the XYZ\nposition within NOCS. We learn category-speci\ufb01c 3D reconstruction from one or more views.\n\nprojection losing 2D\u20133D correspondence. To overcome these issues, we build upon the normalized\nobject coordinate space maps (NOCS maps) representation [39]\u2014a 2D projection of a shared\ncategory-level 3D object shape space that can encode intra-category shape variation (see Figure 1). A\nNOCS map can be interpreted as a 3D surface reconstruction in a canonical space of object pixels\ndirectly visible in an image. NOCS maps retain the advantages of point clouds and are implicitly\ngrounded to the image since they provide a strong pixel\u2013shape correspondence\u2014a feature that allows\nus to copy object texture from the input image. However, a NOCS map only encodes the surface\nshape of object parts directly in the line of sight. We extend it to also encode the 3D shape of object\nparts that are occluded by the visible surface by predicting the shape of the object surface furthest\nand hidden from the view\u2014called X-NOCS maps (see Figure 1). Given a single RGB view of an\nobject instance, we aim to reconstruct the NOCS maps corresponding to the visible surface and the\nX-NOCS map of the occluded surface. Given multiple views, we aggregate the predicted NOCS and\nX-NOCS maps from each view into a single 3D shape using an inexpensive union operation.\nTo learn to predict these visible and occluded NOCS maps for one or more views, we use an encoder-\ndecoder architecture based on SegNet [3]. We show that a network can learn to predict shape\nindependently for each view. However, independent learning does not exploit multiview overlap\ninformation. We therefore propose to aggregate multiview information in a view order-agnostic\nmanner by using permutation equivariant layers [43] that promote information exchange among\nthe views at the feature level. Thus, our approach aggregates multiview information both at the\nshape level, and at the feature level enabling better reconstructions. Our approach is trained on a\nvariable number of input views and can be used on a different variable number of views at test time.\nExtensive experiments show that our approach outperforms other state-of-the-art approaches, is able\nto reconstruct object shape with \ufb01ne details, and accurately captures dense shape while improving\nreconstruction as more views are added, both during training and testing.\n\n2 Related Work\n\nExtensive work exists on recognizing and reconstructing 3D shape of objects from images. This\nreview focuses on learning-based approaches which have dominated recent state of the art, but we\nbrie\ufb02y summarize below literature on techniques that rely purely on geometry and constraints.\nNon-Learning 3D Reconstruction Methods: The method presented in [22] requires user input to\nestimate both camera intrinsics and multiple levels of reconstruction detail using primitives which\nallow complete 3D reconstruction. The approach of [38] also requires user input but is more data-\ndriven and targets class-based 3D reconstruction of the objects on the Pascal VOC dataset. [4] is\nanother notable approach for class-based 3D reconstruction a parametric 3D model and corresponding\nparameters per object-instance are predicted with minimal user intervention. We now focus on\nlearning-based methods for single and multiview reconstruction.\nSingle-View Reconstruction:\nSingle-view 3D reconstruction of objects is a severely under-\nconstrained problem. Probabilistic or generative techniques have been used to impose constraints\non the solution space. For instance, [21] uses structure from motion to estimate camera parameters\n\n2\n\n\fand learns category-speci\ufb01c generative models. The approach of [9] learns a generative model of\nun-ordered point clouds. The method of [15] also argue for learning generative models that can\npredict 3D shape, pose and lighting from a single image. Most techniques implicitly or explicitly\nlearn class-speci\ufb01c generative models, but there are some, e.g., [36], that take a radically different\napproach and use multiple views of the same object to impose a geometric loss during training. The\napproach of [42] predicts 2.5D sketches in the form of depth, surface normals, and silhouette images\nof the object. It then infers the 3D object shape using a voxel representation. In [13], the authors\npresent a technique that uses silhouette constraints. That loss is not well suited for non-convex objects\nand hence the authors propose to use another set of constraints coming from a generative model\nwhich has been taught to generate 3D models. Finally, [44] propose an approach that \ufb01rst predicts\ndepth from a 2D image which is then projected onto a spherical map. This map is inpainted to \ufb01ll\nholes and backprojected into a 3D shape.\nMultiview Reconstruction: Multiple views of an object add more constraints to the reconstructed\n3D shape. Some of the most popular constraints in computer vision are multiview photometric\nconsistency, depth error, and silhouette constraints [18, 41]. In [20], the authors assume that the pose\nof the camera is given and extract image features that are un-projected in 3D and iteratively fused\nwith the information from other views into a voxel grid. Similarly, [16] uses structure from motion to\nextract camera calibration and pose. [23] proposes an approach to differentiable point-cloud rendering\nthat effectively deals with the problem of visibility. Some approaches jointly perform the tasks of\nestimating the camera parameters as well as reconstructing the object in 3D [17, 45].\nPermutation Invariance and Equivariance: One of the requirements of supporting a variable\nnumber of input views is that the network must be agnostic to the order of the inputs. This is not\nthe case with [6] since their RNN is sensitive to input view order. In this work, we use ideas of\npermutation invariance and equivariance from DeepSets [29, 43]. Permutation invariance has been\nused in computer vision in problems such as burst image deblurring [2], shape recognition [35], and\n3D vision [28]. Permutation equivariance is not as widely used in vision but is common in other\nareas [29, 30]. Other forms of approximate equivariance have been used in multiview networks [7].\nA detailed theoretical analysis is provided by [25].\nShape Representations: There are two dominant families of shape representations used in literature:\nvolumetric and surface representations, each with their trade-offs in terms of memory, closeness to\nthe actual surface and ease of use in neural networks. We offer a brief review and refer the reader to\n[1, 34] for a more extensive study.\nThe voxel representation is the most common volumetric representation because of its regular grid\nstructure, making convolutional operators easy to implement. As illustrated in [6] which performs\nsingle and multiview reconstructions, voxels can be used as an occupancy grid, usually resulting\nin coarse surfaces. [26] demonstrates high quality reconstruction and geometry completion results.\nHowever, voxels have high memory cost, especially when combined with 3D convolutions. This\nhas been noted by several authors, including [31] who propose to \ufb01rst predict a series of 6 depth\nmaps observed from each face of a cube containing the object to reconstruct. Each series of 6 depth\nmap represent a different surface, allowing to ef\ufb01ciently capture both the outside and the inside\n(occluded) parts of objects. These series of depth maps are coined shape layers and are combined in\nan occupancy grid to obtain the \ufb01nal reconstructions.\nSurface representations have advantages such as compactness, and are amenable to differentiable\noperators that can be applied on them. They are gaining popularity in learning 3D reconstruction with\nworks like [19], where the authors present a technique for predicting category-speci\ufb01c mesh (and\ntexture) reconstructions from single images, or explorations like in [9], which introduces a technique\nfor reconstructing the surface of objects using point clouds. Another interesting representation is\nscene coordinates which associates each pixel in the image with a 3D position on the surface of the\nobject or scene being observed. This representation has been successfully used for several problems\nincluding camera pose estimation [37] and face reconstruction [10]. However, it requires a scene- or\ninstance-speci\ufb01c scan to be available. Finally, geometry images [12] have been proposed to encode\n3D shape in images. However, they lack input RGB pixel to shape correspondence.\nIn this work, we propose a category-level surface representation that has the advantages of point\nclouds but encodes strong pixel\u20133D shape correspondence which allows multiview shape aggregation\nwithout explicit correspondences.\n\n3\n\n\fFigure 2: Given canonically aligned and scaled instances from an object category [5], the NOCS\nrepresentation [39] can be used to encode intra-category shape variation. For a single view (a), a\nNOCS map encodes the shape of the visible parts of the object (b). We extend this representation\nto also encode the occluded parts called an X-NOCS map (c). Multiple (X-)NOCS maps can be\n\ntrivially combined using a set union operation ((cid:83)) into a single dense shape (rightmost). We can also\n\nef\ufb01ciently represent the texture of object surfaces that are not directly observable (d). Inputs to our\nmethod are shown in green boxes, predictions are in red, and optional predictions are in orange.\n\n3 Background and Overview\n\nIn this section, we provide a description of our shape representation, relevant background, and a\ngeneral overview of our method.\nShape Representation: Our goal is to design a shape representation that can capture dense shapes of\nboth the visible and occluded surfaces of objects observed from any given viewpoint. We would like\na representation that can support computationally ef\ufb01cient signal processing (e.g., 2D convolution)\nwhile also having the advantages of 3D point clouds. This requires a strong coupling between image\npixels and 3D shapes. We build upon the NOCS map [39] representation, which we describe below.\n\nThe Normalized Object Coordinates Space (NOCS) can\nbe described as the 3D space contained within a unit cube\nas shown in Figure 2. Given a collection of shapes from\na category which are consistently oriented and scaled, we\nbuild a shape space where the XYZ coordinates within\nNOCS represent the shape of an instance. A NOCS map\nis a 2D projection of the 3D NOCS points of an instance\nas seen from a particular viewpoint. Each pixel in the\nNOCS map denotes the 3D position of that object point\nin NOCS (color coded in Figure 2). NOCS maps are\ndense shape representations that scale with the size of the\nobject in the view\u2014objects that are closer to the camera\nwith more image pixels are denser than object further\naway. They can readily be converted to a point cloud by\nreading out the pixel values, but still retain 3D shape\u2013pixel\ncorrespondence. Because of this correspondence we can\nobtain camera pose in the canonical NOCS space using the direct linear transform algorithm [14].\nHowever, NOCS maps only encode the shape of the visible surface of the object.\nDepth Peeling: To overcome this limitation and encode the shape of the occluded object surface, we\nbuild upon the idea of depth peeling [8] and layered depth images [33]. Depth peeling is a technique\nused to generate more accurate order-independent transparency effects when blending transparent\nobjects. As shown in Figure 3, this process refers to the extraction of object depth or, alternatively,\nNOCS coordinates corresponding to the kth intersection of a ray passing through a given image\npixel. By peeling a suf\ufb01ciently large number of layers (e.g., k = 10), we can accurately encode the\ninterior and exterior shape of an object. However, using many layers can be unnecessarily expensive,\nespecially if the goal is to estimate only the external object surface. We therefore propose to use 2\nlayers to approximate the external surfaces corresponding the \ufb01rst and last ray intersections. These\nintersections faithfully capture the visible and occluded parts of most common convex objects. We\n\nFigure 3: We use depth peeling to extract\nX-NOCS maps corresponding to differ-\nent ray intersections. The top row shows\n4 intersections. The bottom row shows\nour representation which uses the \ufb01rst\nand last intersections.\n\n4\n\n\frefer to the maps corresponding to the occluded surface (i.e., last ray intersection) as X-NOCS maps,\nsimilar to X-ray images.\nBoth NOCS and X-NOCS maps support multiview shape aggregation into a single 3D shape using an\ninexpensive point set union operation. This is because NOCS is a canonical and normalized space\nwhere multiple views correspond to the same 3D space. Since these maps preserve pixel\u2013shape\ncorrespondence, they also support estimation of object or camera pose in the canonical NOCS\nspace [39]. We can use the direct linear transform [14] to estimate camera pose, up to an unknown\nscale factor (see supplementary document). Furthermore, we can support the prediction of the texture\nof the occluded parts of the object by hallucinating a peeled color image (see Figure 2 (d)).\nLearning Shape Reconstruction: Given the X-NOCS map representation that encodes the 3D shape\nboth of occluded object surfaces, our goal is to learn to predict both maps from a variable number of\ninput views and aggregate multiview predictions. We adopt a supervised approach for this problem.\nWe generated a large corpus of training data with synthetic objects from 3 popular categories\u2014cars,\nchairs, and airplanes. For each object we render multiple viewpoints, as well the corresponding\nground truth X-NOCS maps. Our network learns to predict the (X-)NOCS maps corresponding to\neach view using a SegNet-based [3] encoder-decoder architecture. Learning independently on each\nview does not exploit the available multiview overlap information. We therefore aggregate multiview\ninformation at the feature level by using permutation equivariant layers that combine input view\ninformation in an order-agnostic manner. The multiview aggregation that we perform at the NOCS\nshape and feature levels allows us to reconstruct dense shape with details as we show in Section 5.\n\n4 Method\n\nOur goal is to learn to predict the both NOCS and X-NOCS maps corresponding to a variable number\nof input RGB views of previously unobserved object instances. We adopt a supervised learning\napproach and restrict ourselves to speci\ufb01c object categories. We \ufb01rst describe our general approach\nto this problem and then discuss how we aggregate multiview information.\n\n4.1 Single-View (X-)NOCS Map Prediction\n\nthe output (X-)NOCS maps are combined into a single 3D point cloud as P = R(Nv)(cid:83)R(No),\n\nThe goal of this task is to predict the NOCS maps for the visible (Nv) and X-NOCS maps for the\noccluded parts (No) of the object given a single RGB view I. We assume that no other multiview\ninputs are available at train or test time. For this pixel-level prediction task we use an encoder-\ndecoder architecture similar to SegNet [3] (see Figure 4). Our architecture takes a 3 channel RGB\nimage as input and predicts 6 output channels corresponding to the NOCS and X-NOCS maps\n(N i = {Nv, No}), and optionally also predicts a peeled color map (Cp) encoding the texture of the\noccluded object surface (see Figure 2 (d)). We include skip connections between the encoder and\ndecoder to promote information sharing and consistency. To obtain the 3D shape of object instances,\nwhere R denotes a readout operation that converts each map to a 3D point set.\n(X-)NOCS Map Aggregation: While single-view (X-)NOCS map prediction is trained indepen-\ndently on each view, it can still be used for multiview shape aggregation. Given multiple input\nviews, {I0, . . . , In}, we predict the (X-)NOCS maps {N 0, . . . , N n} for each view independently.\nNOCS represents a canonical and normalized space and thus (X-)NOCS maps can also be interpreted\nas dense correspondences between pixels and 3D NOCS space. Therefore any set of (X-)NOCS\nmaps will map into the same space\u2014multiview consistency is implicit in the representation. Given\nmultiple independent (X-)NOCS maps, we can combine them into a single 3D point cloud as\n\nPn =(cid:83)n\n\ni=0 R(N i).\n\nLoss Functions: We experimented with several loss functions for (X-)NOCS map prediction\nincluding a pixel-level L2 loss, and a combined pixel-level mask and L2 loss. The L2 loss is de\ufb01ned\nas\n\n(cid:88)||y \u2212 \u02c6y||2, \u2200y \u2208 Nv, No,\u2200\u02c6y \u2208 \u02c6Nv, \u02c6No,\n\n(1)\nwhere y, \u02c6y \u2208 R3 denote the ground truth and predicted 3D NOCS value, \u02c6Nv, \u02c6No are the predicted\nNOCS and X-NOCS maps, and n is the total number of pixels in the X-NOCS maps. However, this\nfunction computes the loss for all pixels, even those that do not belong to the object thus wasting\n\nLe(y, \u02c6y) =\n\n1\nn\n\n5\n\n\fFigure 4: We use an encoder-decoder architecture based on SegNet [3] to predict NOCS and X-NOCS\nmaps from an input RGB view independently. To better exploit multiview information, we propose to\nuse the same architecture but with added permutation equivariant layers (bottom right) to combine\nmultiview information at the feature level. Our network can operate on a variable number of input\nviews in an order-agnostic manner. The features extracted for each view during upsampling and\ndownsampling operations are combined using permutation equivariant layers (orange bars).\n\nnetwork capacity. We therefore use object masks to restrict the loss computation only to the object\npixels in the image. We predict 2 masks corresponding to the NOCS and X-NOCS maps\u20148 channels\nin total. We predict 2 independent masks since they could be different for thin structures like airplane\ntail \ufb01ns. The combined mask loss is de\ufb01ned as Lm = Lv + Lo, where the loss for the visible NOCS\nmap and mask is de\ufb01ned as\n\n(cid:88)||y \u2212 \u02c6y||2, \u2200y \u2208 Mv,\u2200\u02c6y \u2208 \u02c6Mv,\n\n(2)\n\nLv(y, \u02c6y) = wm M(Mv, \u02c6Mv) + wl\n\n1\nm\n\nwhere \u02c6Mv is the predicted mask corresponding to the visible NOCS map, Mv is the ground truth\nmask, M is the binary cross entropy loss on the mask, and m is the number of masked pixels. Lo\nis identical to Lv but for the X-NOCS map. We empirically set the weights wm and wl to be 0.7\nand 0.3 respectively. Experimentally, we observe that the combined pixel-level mask and L2 loss\noutperforms the L2 loss since more network capacity can be utilized for shape prediction.\n\n4.2 Multiview (X-)NOCS Map Prediction\n\nThe above approach predicts (X-)NOCS maps independently and aggregates them to produce a 3D\nshape. However, multiview images of an object have strong inter-view overlap information which we\nhave not made use of. To promote information exchange between views both during training and\ntesting, and to support a variable number of input views, we propose to use permutation equivariant\nlayers [43] that are agnostic to the order of the views.\nFeature Level Multiview Aggregation: Our multiview aggregation network is illustrated in Figure 4.\nThe network is identical to the single-view network except for the addition of several permutation\nequivariant layers (orange bars). A network layer is said to be permutation equivariant if and only if\nthe off diagonal elements of the learned weight matrix are equal, as are the diagonal elements [43].\nIn practice, this can be achieved by passing each feature map through a pool-subtract operation\nfollowed by a non-linear function. The pool-subtract operation pools features extracted from different\nviewpoints and subtracts the pooled feature from the individual features (see Figure 4). We use\nmultiple permutation equivariant layers after each downsampling and upsampling operation in the\nencoder-decoder architecture (vertical orange bars in Figure 4). Both average pooling and max\npooling can used but experimentally average pooling worked best. Our permutation equivariant layers\nconsist of an average-subtraction operation and the non-linearity from the next convolutional layer.\nHallucinating Occluded Object Texture: As an additional feature, we train both our single and\nmultiview networks to also predict the texture of the occluded surface of the object (see Figure 2 (d)).\nThis is predicted as 3 additional output channels with the same loss as Lv. This optional prediction\ncan be used to hallucinate the texture of hidden object surfaces.\n\n6\n\n\f5 Experiments\n\nDataset: We generated our own dataset, called ShapeNetCOCO, consisting of object instances\nfrom 3 categories commonly used in related work: chairs, cars, and airplanes. We use thousands\nof instances from the ShapeNet [5] repository and render 20 different views for each instance and\nadditionally augment backgrounds with randomly chosen COCO images [24]. This dataset is harder\nthan previously proposed datasets because of random backgrounds, and widely varying camera\ndistances. To facilitate comparisons with previous work [6, 17], we also generated a simpler dataset,\ncalled ShapeNetPlain, with white backgrounds and 5 views per object following the camera placement\nprocedure of [17]. Except for comparisons and Table 3, we report results from the more complex\ndataset. We follow the train/test protocol of [36]. Unless otherwise speci\ufb01ed, we use a batch size of 1\n(multiview) or 2 (single-view), a learning rate of 0.0001, and the Adam optimizer.\nMetrics: For all experiments, we evaluate point cloud reconstruction using the 2-way Chamfer\ndistance multiplied by 100. Given two point sets S1 and S2 the Chamfer distance is de\ufb01ned as\n\nmin\ny\u2208S2\n\n(cid:107)x \u2212 y(cid:107)2\n\n2 +\n\n1\n|S2|\n\n(cid:107)x \u2212 y(cid:107)2\n2.\n\nmin\nx\u2208S1\n\n(3)\n\n(cid:88)\n\nx\u2208S1\n\nd(S1, S2) =\n\n1\n|S1|\n\n5.1 Design Choices\n\n(cid:88)\n\ny\u2208S2\n\nTable 1: Single-view reconstruction performance using various\nlosses and outputs. For each category, the Chamfer distance is\nshown. Using the joint loss with L2 and the mask signi\ufb01cantly\noutperforms just L2. Predicting peeled color further improves\nreconstruction.\n\nWe \ufb01rst justify our loss function\nchoice and network outputs. As\ndescribed, we experiment with\ntwo loss functions\u2014L2 losses\nwith and without a mask. Further,\nthere are several outputs that we\npredict in addition to the NOCS\nand X-NOCS maps i.e., mask and\npeeled color. In Table 1, we sum-\nmarize the average Chamfer dis-\ntance loss for all variants trained\nindependently on single views (ShapeNetCOCO dataset). Using the loss function which jointly\naccounts for NOCS map, X-NOCS maps and mask output clearly outperforms a vanilla L2 loss on the\nNOCS and X-NOCS maps. We also observe that predicting peeled color along with the (X-)NOCS\nmaps gives better performance on all categories.\n\nOutput\n(X-)NOCS+Peel\n(X-)NOCS+Mask\n(X-)NOCSS+Mask+Peel\n\nAirplanes Chairs\n7.9072\n4.4716\n0.4401\n0.3037\n0.2659\n0.4288\n\nLoss\nL2\nL2+Mask\nL2+Mask\n\nCars\n3.6573\n0.5093\n0.3714\n\n5.2 Multiview Aggregation\n\nModel\n\nCategory\n\nTable 2: Comparison of different forms of multiview ag-\ngregation. Aggregating multiple views using set union\nimproves performance with further improvements using\nfeature space aggregation.\n\nNext we show that our multiview aggre-\ngation approach is capable of estimating\nbetter reconstructions when more views\nare available (ShapeNetCOCO dataset).\nTable 2 shows that the reconstruction\nfrom the single view network improves\nas we aggregate more views into NOCS\nspace (using set union) without any fea-\nture space aggregation. When we train\nwith feature space aggregation from 5\nviews using the permutation equivariant\nlayers we see further improvements as\nmore views are added. Table 3 shows\nvariations of our multiview model: one trained on a \ufb01xed number of views, one trained on a variable\nnumber of views up to a maximum of 5, and one trained on a variable number up to 10 views.\nAll these models are trained on the ShapeNetPlain dataset for 100 epochs. We see that both \ufb01xed\nand variable models take advantage of the additional information from more views, almost always\nincreasing performance from left to right. Although the \ufb01xed multiview models perform best, we\nhypothesize that the variable view models will be able to better handle the widening gap between the\nnumber of train-time and test-time views. In Figure 5, we visualize our results in 3D which shows\nthe small scale details such as airplane engines reconstructed by our method.\n\n2 views\nSingle-View 0.4206\nMultiview 0.3789\nSingle-View 0.1760\nMultiview 0.2387\nSingle-View 0.4249\nMultiview 0.3649\n\n5 views\n0.3692\n0.2731\n0.1619\n0.1277\n0.3600\n0.2457\n\n3 views\n0.3974\n0.3537\n0.1677\n0.1782\n0.3813\n0.2860\n\nAirplanes\n\nCars\n\nChairs\n\n7\n\n\fFigure 5: Qualitative reconstructions produced by our method. Each rows shows the input RGB\nviews, NOCS map ground truth and prediction of the central view, and the ground truth and predicted\n3D shape. These visualizations are produced by the variable multiview model trained on up to 5\nviews and evaluated on 5 views. We post-process the both NOCS and X-NOCS maps with a bilateral\n\ufb01lter followed by a statistical outlier \ufb01lter [32], and use the input RGB images to color the point\ncloud. Best viewed zoomed and in color.\n\n5.3 Comparisons\n\n2 views\n0.2645\n0.2896\n0.2992\n0.1318\n0.1418\n0.1847\n0.2967\n0.2642\n0.2643\n\nCategory\n\nCars\n\nAirplanes\n\nChairs\n\n5 views\n0.1721\n0.1955\n0.3095\n0.0604\n0.0991\n0.1049\n0.1314\n0.1695\n0.1693\n\n3 views\n0.1645\n0.1989\n0.2447\n0.1571\n0.1006\n0.1309\n0.1845\n0.2072\n0.2070\n\nModel\nFixed Multi\nVariable Multi (5)\nVariable Multi (10)\nFixed Multi\nVariable Multi (5)\nVariable Multi (10)\nFixed Multi\nVariable Multi (5)\nVariable Multi (10)\n\nTable 3: Multiview reconstruction variations. We observe\nthat both \ufb01xed and variable models take advantage of the\nadditional information from more views.\n\nWe compare our method to two previous\nworks. The \ufb01rst, called differentiable\npoint clouds (DPC) [17] directly predicts\na point cloud given a single image of an\nobject. We train a separate single-view\nmodel for cars, airplanes, and chairs to\npredict the NOCS maps, X-NOCS maps,\nmask and peeled color (ShapeNetPlain\ndataset). To evaluate the Chamfer dis-\ntance for DPC outputs, we \ufb01rst scale the\npredicted output point cloud such that\nthe bounding box diagonal is one, then\nwe follow the alignment procedure from\ntheir paper to calculate the transforma-\ntion from the network\u2019s output frame to the ground truth point cloud frame. As seen in Table 4, the\nX-NOCS map representation allows our network to outperform DPC in all three categories.\nWe next compare our multiview permutation equivari-\nant model to the multiview method 3D-R2N2 [6]. In\neach training batch, both methods are given a random\nsubset of 5 views of an object, so that they may be eval-\nuated with up to 5 views at test time. Since 3D-R2N2\noutputs a volumetric 32x32x32 voxel grid, we \ufb01rst \ufb01nd\nall surface voxels of the output then place a point at\nthe center of these surface voxels to obtain a 3D point cloud. This point cloud is scaled to have a\nunit-diagonal bounding box to match the ground truth ShapeNet objects. We limit our comparison to\nonly chairs since we were unable to make their method converge on the other categories.\n\nTable 4: Single-view reconstruction com-\nparison to DPC [17].\n\nMethod Cars\nDPC\nOurs\n\n0.2932\n0.1569\n\nAirplanes Chairs\n0.4314\n0.2549\n0.1855\n0.3803\n\nTable 5: Multiview reconstruction performance compared\nto 3D-R2N2 [6] on the chairs category in ShapeNetPlain.\n\nMethod\n3D-R2N2\nOurs\n\n2 views\n0.2511\n0.2508\n\n3 views\n0.2191\n0.1952\n\n5 views\n0.1932\n0.1576\n\nTable 5 shows the performance of both\nmethods when trained on the chairs cat-\negory and evaluated on 2, 3, and 5 views\n(ShapeNetPlain dataset). For 2 views\nthe methods perform similar but when\ncombining more views to reconstruct the\nshape, our method becomes more accu-\n\nrate. We again see the trend of increasing performance as more views are used.\n\n8\n\n\fFigure 6: More qualitative reconstructions produced by our method. For each box, ground truth is\nshown leftmost. Here we show reconstructions from (a) the permutation equivariant network trained\nand tested on 10 views for a car, and (b, c) permutation equivariant network trained on chairs with 5\nviews and tested on 5. A reconstruction with higher shape variance that fails to capture small scale\ndetail is shown in (d). Finally, in (e) we show a visual comparison with the reconstruction produced\nby [6] which lacks detail such as the armrest although it sees 5 different views. Best viewed in color.\n\nLimitations and Future Work: While we reconstruct dense 3D shapes, there is still some variance\nin our predicted shape. We can further improve the quality of our reconstructions by incorporating\nsurface topology information. We currently use the DLT algorithm [14] to predict camera pose in our\ncanonical NOCS space, however we would need extra information such as depth [39] to estimate\nmetric pose. Jointly estimating pose and shape is a tightly coupled problem and an interesting\nfuture direction. Finally, we observed that Chamfer distance, although used widely to evaluate shape\nreconstruction quality, is not the ideal metric to help differentiate \ufb01ne scale detail and overall shape.\nWe plan to explore the use of the other metrics to evaluate reconstruction quality.\n\n6 Conclusion\n\nIn this paper we introduced X-NOCS maps, a new and ef\ufb01cient surface representation that is well\nsuited for the task of 3D reconstruction of objects, even of occluded parts, from a variable number of\nviews. We demonstrate how this representation can be used to estimate the \ufb01rst and the last surface\npoint that would project on any pixel in the observed image, and also to estimate the appearance\nof these surface points. We then show how adding a permutation equivariant layer allows the\nproposed method to be agnostic to the number of views and their associated viewpoints, but also\nhow our aggregation network is able to ef\ufb01ciently combine these observations to yield even higher\nquality results compared to those obtained with a single observation. Finally, extensive analysis and\nexperiments validate that our method reaches state-of-the-art results using a single observation, and\nsigni\ufb01cantly improves upon existing techniques.\nAcknowledgments: This work was supported by the Google Daydream University Research Pro-\ngram, AWS Machine Learning Awards Program, and the Toyota-Stanford Center for AI Research.\nWe would like to thank Jiahui Lei, the anonymous reviewers, and members of the Guibas Group for\nuseful feedback. Toyota Research Institute (\u201cTRI\u201d) provided funds to assist the authors with their\nresearch but this article solely re\ufb02ects the opinions and conclusions of its authors and not TRI or any\nother Toyota entity.\n\nReferences\n[1] Eman Ahmed, Alexandre Saint, Abd El Rahman Shabayek, Kseniya Cherenkova, Rig Das, Gleb Gu-\nsev, Djamila Aouada, and Bjorn Ottersten. A survey on deep learning advances on different 3d data\nrepresentations. arXiv preprint arXiv:1808.01462, 2018.\n\n[2] Miika Aittala and Fr\u00e9do Durand. Burst image deblurring using permutation invariant convolutional neural\nnetworks. In Proceedings of the European Conference on Computer Vision (ECCV), pages 731\u2013747, 2018.\n\n9\n\n\f[3] Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. Segnet: A deep convolutional encoder-decoder\narchitecture for image segmentation. IEEE transactions on pattern analysis and machine intelligence,\n39(12):2481\u20132495, 2017.\n\n[4] Thomas J Cashman and Andrew W Fitzgibbon. What shape are dolphins? building 3d morphable models\nfrom 2d images. IEEE transactions on pattern analysis and machine intelligence, 35(1):232\u2013244, 2012.\n\n[5] Angel X Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio\nSavarese, Manolis Savva, Shuran Song, Hao Su, et al. Shapenet: An information-rich 3d model repository.\narXiv preprint arXiv:1512.03012, 2015.\n\n[6] Christopher Bongsoo Choy, Danfei Xu, JunYoung Gwak, Kevin Chen, and Silvio Savarese. 3d-r2n2: A\n\nuni\ufb01ed approach for single and multi-view 3d object reconstruction. CoRR, abs/1604.00449, 2016.\n\n[7] Carlos Esteves, Yinshuang Xu, Christine Allen-Blanchette, and Kostas Daniilidis. Equivariant multi-view\n\nnetworks. arXiv preprint arXiv:1904.00993, 2019.\n\n[8] Cass Everitt. Interactive order-independent transparency. White paper, nVIDIA, 2(6):7, 2001.\n\n[9] Haoqiang Fan, Hao Su, and Leonidas J Guibas. A point set generation network for 3d object reconstruction\nfrom a single image. In Proceedings of the IEEE conference on computer vision and pattern recognition,\npages 605\u2013613, 2017.\n\n[10] Yao Feng, Fan Wu, Xiaohu Shao, Yanfeng Wang, and Xi Zhou. Joint 3d face reconstruction and dense\nalignment with position map regression network. In Proceedings of the European Conference on Computer\nVision (ECCV), pages 534\u2013551, 2018.\n\n[11] Thibault Groueix, Matthew Fisher, Vladimir G. Kim, Bryan C. Russell, and Mathieu Aubry. Atlasnet: A\n\npapier-m\u00e2ch\u00e9 approach to learning 3d surface generation. CoRR, abs/1802.05384, 2018.\n\n[12] Xianfeng Gu, Steven J Gortler, and Hugues Hoppe. Geometry images. In ACM Transactions on Graphics\n\n(TOG), volume 21, pages 355\u2013361. ACM, 2002.\n\n[13] JunYoung Gwak, Christopher B Choy, Manmohan Chandraker, Animesh Garg, and Silvio Savarese. Weakly\nsupervised 3d reconstruction with adversarial constraint. In 2017 International Conference on 3D Vision\n(3DV), pages 263\u2013272. IEEE, 2017.\n\n[14] Richard Hartley and Andrew Zisserman. Multiple view geometry in computer vision. Cambridge university\n\npress, 2003.\n\n[15] Paul Henderson and Vittorio Ferrari. Learning single-image 3d reconstruction by generative modelling of\n\nshape, pose and shading. arXiv preprint arXiv:1901.06447, 2019.\n\n[16] Po-Han Huang, Kevin Matzen, Johannes Kopf, Narendra Ahuja, and Jia-Bin Huang. Deepmvs: Learning\nmulti-view stereopsis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,\npages 2821\u20132830, 2018.\n\n[17] Eldar Insafutdinov and Alexey Dosovitskiy. Unsupervised learning of shape and pose with differentiable\n\npoint clouds. In Advances in Neural Information Processing Systems, pages 2807\u20132817, 2018.\n\n[18] Mengqi Ji, Juergen Gall, Haitian Zheng, Yebin Liu, and Lu Fang. Surfacenet: An end-to-end 3d neural\nnetwork for multiview stereopsis. In Proceedings of the IEEE International Conference on Computer\nVision, pages 2307\u20132315, 2017.\n\n[19] Angjoo Kanazawa, Shubham Tulsiani, Alexei A Efros, and Jitendra Malik. Learning category-speci\ufb01c\nmesh reconstruction from image collections. In Proceedings of the European Conference on Computer\nVision (ECCV), pages 371\u2013386, 2018.\n\n[20] Abhishek Kar, Christian H\u00e4ne, and Jitendra Malik. Learning a multi-view stereo machine. In Advances in\n\nneural information processing systems, pages 365\u2013376, 2017.\n\n[21] Abhishek Kar, Shubham Tulsiani, Joao Carreira, and Jitendra Malik. Category-speci\ufb01c object recon-\nstruction from a single image. In Proceedings of the IEEE conference on computer vision and pattern\nrecognition, pages 1966\u20131974, 2015.\n\n[22] Akash M Kushal, Gaurav Chanda, Kanishka Srivastava, Mohit Gupta, Subhajit Sanyal, TVN Sriram, Prem\nKalra, and Subhashis Banerjee. Multilevel modelling and rendering of architectural scenes. In Proc.\nEuroGraphics, 2003.\n\n10\n\n\f[23] Chen-Hsuan Lin, Chen Kong, and Simon Lucey. Learning ef\ufb01cient point cloud generation for dense 3d\n\nobject reconstruction. In AAAI Conference on Arti\ufb01cial Intelligence (AAAI), 2018.\n\n[24] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll\u00e1r,\nand C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on\ncomputer vision, pages 740\u2013755. Springer, 2014.\n\n[25] Haggai Maron, Heli Ben-Hamu, Nadav Shamir, and Yaron Lipman. Invariant and equivariant graph\n\nnetworks. arXiv preprint arXiv:1812.09902, 2018.\n\n[26] Jeong Joon Park, Peter Florence, Julian Straub, Richard Newcombe, and Steven Lovegrove. Deepsdf:\nLearning continuous signed distance functions for shape representation. arXiv preprint arXiv:1901.05103,\n2019.\n\n[27] Alex Pentland. Shape information from shading: a theory about human perception. In [1988 Proceedings]\n\nSecond International Conference on Computer Vision, pages 404\u2013413. IEEE, 1988.\n\n[28] Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. Pointnet++: Deep hierarchical feature\nlearning on point sets in a metric space. In Advances in Neural Information Processing Systems 30, pages\n5099\u20135108, 2017.\n\n[29] Siamak Ravanbakhsh, Jeff Schneider, and Barnabas Poczos. Deep learning with sets and point clouds.\n\narXiv preprint arXiv:1611.04500, 2016.\n\n[30] Siamak Ravanbakhsh, Jeff Schneider, and Barnabas Poczos. Equivariance through parameter-sharing. In\nProceedings of the 34th International Conference on Machine Learning-Volume 70, pages 2892\u20132901.\nJMLR. org, 2017.\n\n[31] Stephan R. Richter and Stefan Roth. Matryoshka networks: Predicting 3d geometry via nested shape\nlayers. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake\nCity, UT, USA, June 18-22, 2018, pages 1936\u20131944, 2018.\n\n[32] Radu Bogdan Rusu and Steve Cousins. 3D is here: Point Cloud Library (PCL). In IEEE International\n\nConference on Robotics and Automation (ICRA), Shanghai, China, May 9-13 2011.\n\n[33] Jonathan Shade, Steven Gortler, Li-wei He, and Richard Szeliski. Layered depth images. 1998.\n\n[34] Daeyun Shin, Charless C Fowlkes, and Derek Hoiem. Pixels, voxels, and views: A study of shape\nrepresentations for single view 3d object shape prediction. In Proceedings of the IEEE Conference on\nComputer Vision and Pattern Recognition, pages 3061\u20133069, 2018.\n\n[35] Hang Su, Subhransu Maji, Evangelos Kalogerakis, and Erik Learned-Miller. Multi-view convolutional\nneural networks for 3d shape recognition. In Proceedings of the IEEE international conference on computer\nvision, pages 945\u2013953, 2015.\n\n[36] Shubham Tulsiani, Alexei A Efros, and Jitendra Malik. Multi-view consistency as supervisory signal\nfor learning shape and pose prediction. In Proceedings of the IEEE Conference on Computer Vision and\nPattern Recognition, pages 2897\u20132905, 2018.\n\n[37] Julien Valentin, Matthias Nie\u00dfner, Jamie Shotton, Andrew Fitzgibbon, Shahram Izadi, and Philip HS Torr.\nExploiting uncertainty in regression forests for accurate camera relocalization. In Proceedings of the IEEE\nConference on Computer Vision and Pattern Recognition, pages 4400\u20134408, 2015.\n\n[38] Sara Vicente, Joao Carreira, Lourdes Agapito, and Jorge Batista. Reconstructing pascal voc. In Proceedings\n\nof the IEEE conference on computer vision and pattern recognition, pages 41\u201348, 2014.\n\n[39] He Wang, Srinath Sridhar, Jingwei Huang, Julien Valentin, Shuran Song, and Leonidas J Guibas. Nor-\nmalized object coordinate space for category-level 6d object pose and size estimation. arXiv preprint\narXiv:1901.02970, 2019.\n\n[40] Nanyang Wang, Yinda Zhang, Zhuwen Li, Yanwei Fu, Wei Liu, and Yu-Gang Jiang. Pixel2mesh:\nGenerating 3d mesh models from single rgb images. In Proceedings of the European Conference on\nComputer Vision (ECCV), pages 52\u201367, 2018.\n\n[41] Olivia Wiles and Andrew Zisserman. Silnet: Single-and multi-view reconstruction by learning from\n\nsilhouettes. arXiv preprint arXiv:1711.07888, 2017.\n\n[42] Jiajun Wu, Yifan Wang, Tianfan Xue, Xingyuan Sun, Bill Freeman, and Josh Tenenbaum. Marrnet: 3d\nshape reconstruction via 2.5 d sketches. In Advances in neural information processing systems, pages\n540\u2013550, 2017.\n\n11\n\n\f[43] Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Ruslan R Salakhutdinov, and\nAlexander J Smola. Deep sets. In Advances in neural information processing systems, pages 3391\u20133401,\n2017.\n\n[44] Xiuming Zhang, Zhoutong Zhang, Chengkai Zhang, Josh Tenenbaum, Bill Freeman, and Jiajun Wu.\nIn Advances in Neural Information Processing\n\nLearning to reconstruct shapes from unseen classes.\nSystems, pages 2263\u20132274, 2018.\n\n[45] Rui Zhu, Chaoyang Wang, Chen-Hsuan Lin, Ziyan Wang, and Simon Lucey. Object-centric photometric\nbundle adjustment with deep shape prior. In 2018 IEEE Winter Conference on Applications of Computer\nVision (WACV), pages 894\u2013902. IEEE, 2018.\n\n12\n\n\f", "award": [], "sourceid": 1385, "authors": [{"given_name": "Srinath", "family_name": "Sridhar", "institution": "Stanford University"}, {"given_name": "Davis", "family_name": "Rempe", "institution": "Stanford University"}, {"given_name": "Julien", "family_name": "Valentin", "institution": "Google"}, {"given_name": "Bouaziz", "family_name": "Sofien", "institution": null}, {"given_name": "Leonidas", "family_name": "Guibas", "institution": "stanford.edu"}]}