{"title": "Learning a Probabilistic Latent Space of Object Shapes via 3D Generative-Adversarial Modeling", "book": "Advances in Neural Information Processing Systems", "page_first": 82, "page_last": 90, "abstract": "We study the problem of 3D object generation. We propose a novel framework, namely 3D Generative Adversarial Network (3D-GAN), which generates 3D objects from a probabilistic space by leveraging recent advances in volumetric convolutional networks and generative adversarial nets. The benefits of our model are three-fold: first, the use of an adversarial criterion, instead of traditional heuristic criteria, enables the generator to capture object structure implicitly and to synthesize high-quality 3D objects; second, the generator establishes a mapping from a low-dimensional probabilistic space to the space of 3D objects, so that we can sample objects without a reference image or CAD models, and explore the 3D object manifold; third, the adversarial discriminator provides a powerful 3D shape descriptor which, learned without supervision, has wide applications in 3D object recognition. Experiments demonstrate that our method generates high-quality 3D objects, and our unsupervisedly learned features achieve impressive performance on 3D object recognition, comparable with those of supervised learning methods.", "full_text": "Learning a Probabilistic Latent Space of Object\nShapes via 3D Generative-Adversarial Modeling\n\nJiajun Wu*\nMIT CSAIL\n\nChengkai Zhang*\n\nMIT CSAIL\n\nTianfan Xue\nMIT CSAIL\n\nWilliam T. Freeman\n\nMIT CSAIL, Google Research\n\nJoshua B. Tenenbaum\n\nMIT CSAIL\n\nAbstract\n\nWe study the problem of 3D object generation. We propose a novel framework,\nnamely 3D Generative Adversarial Network (3D-GAN), which generates 3D ob-\njects from a probabilistic space by leveraging recent advances in volumetric convo-\nlutional networks and generative adversarial nets. The bene\ufb01ts of our model are\nthree-fold: \ufb01rst, the use of an adversarial criterion, instead of traditional heuristic\ncriteria, enables the generator to capture object structure implicitly and to synthe-\nsize high-quality 3D objects; second, the generator establishes a mapping from\na low-dimensional probabilistic space to the space of 3D objects, so that we can\nsample objects without a reference image or CAD models, and explore the 3D\nobject manifold; third, the adversarial discriminator provides a powerful 3D shape\ndescriptor which, learned without supervision, has wide applications in 3D object\nrecognition. Experiments demonstrate that our method generates high-quality 3D\nobjects, and our unsupervisedly learned features achieve impressive performance\non 3D object recognition, comparable with those of supervised learning methods.\n\nIntroduction\n\n1\nWhat makes a 3D generative model of object shapes appealing? We believe a good generative model\nshould be able to synthesize 3D objects that are both highly varied and realistic. Speci\ufb01cally, for\n3D objects to have variations, a generative model should be able to go beyond memorizing and\nrecombining parts or pieces from a pre-de\ufb01ned repository to produce novel shapes; and for objects to\nbe realistic, there need to be \ufb01ne details in the generated examples.\nIn the past decades, researchers have made impressive progress on 3D object modeling and synthe-\nsis [Van Kaick et al., 2011, Tangelder and Veltkamp, 2008, Carlson, 1982], mostly based on meshes\nor skeletons. Many of these traditional methods synthesize new objects by borrowing parts from\nobjects in existing CAD model libraries. Therefore, the synthesized objects look realistic, but not\nconceptually novel.\nRecently, with the advances in deep representation learning and the introduction of large 3D CAD\ndatasets like ShapeNet [Chang et al., 2015, Wu et al., 2015], there have been some inspiring attempts\nin learning deep object representations based on voxelized objects [Girdhar et al., 2016, Su et al.,\n2015a, Qi et al., 2016]. Different from part-based methods, many of these generative approaches\ndo not explicitly model the concept of parts or retrieve them from an object repository; instead,\nthey synthesize new objects based on learned object representations. This is a challenging problem\nbecause, compared to the space of 2D images, it is more dif\ufb01cult to model the space of 3D shapes due\nto its higher dimensionality. Their current results are encouraging, but often there still exist artifacts\n(e.g., fragments or holes) in the generated objects.\nIn this paper, we demonstrate that modeling volumetric objects in a general-adversarial manner could\nbe a promising solution to generate objects that are both novel and realistic. Our approach combines\n\n\u2217 indicates equal contributions. Emails: {jiajunwu, ckzhang, tfxue, billf, jbt}@mit.edu\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fthe merits of both general-adversarial modeling [Goodfellow et al., 2014, Radford et al., 2016] and\nvolumetric convolutional networks [Maturana and Scherer, 2015, Wu et al., 2015]. Different from\ntraditional heuristic criteria, generative-adversarial modeling introduces an adversarial discriminator\nto classify whether an object is synthesized or real. This could be a particularly favorable framework\nfor 3D object modeling: as 3D objects are highly structured, a generative-adversarial criterion, but\nnot a voxel-wise independent heuristic one, has the potential to capture the structural difference of\ntwo 3D objects. The use of a generative-adversarial loss may also avoid possible criterion-dependent\nover\ufb01tting (e.g., generating mean-shape-like blurred objects when minimizing a mean squared error).\nModeling 3D objects in a generative-adversarial way offers additional distinctive advantages. First, it\nbecomes possible to sample novel 3D objects from a probabilistic latent space such as a Gaussian\nor uniform distribution. Second, the discriminator in the generative-adversarial approach carries\ninformative features for 3D object recognition, as demonstrated in experiments (Section 4). From\na different perspective, instead of learning a single feature representation for both generating and\nrecognizing objects [Girdhar et al., 2016, Sharma et al., 2016], our framework learns disentangled\ngenerative and discriminative representations for 3D objects without supervision, and applies them\non generation and recognition tasks, respectively.\nWe show that our generative representation can be used to synthesize high-quality realistic objects,\nand our discriminative representation can be used for 3D object recognition, achieving comparable\nperformance with recent supervised methods [Maturana and Scherer, 2015, Shi et al., 2015], and\noutperforming other unsupervised methods by a large margin. The learned generative and discrim-\ninative representations also have wide applications. For example, we show that our network can\nbe combined with a variational autoencoder [Kingma and Welling, 2014, Larsen et al., 2016] to\ndirectly reconstruct a 3D object from a 2D input image. Further, we explore the space of object\nrepresentations and demonstrate that both our generative and discriminative representations carry\nrich semantic information about 3D objects.\n2 Related Work\nModeling and synthesizing 3D shapes 3D object understanding and generation is an important\nproblem in the graphics and vision community, and the relevant literature is very rich [Carlson,\n1982, Tangelder and Veltkamp, 2008, Van Kaick et al., 2011, Blanz and Vetter, 1999, Kalogerakis\net al., 2012, Chaudhuri et al., 2011, Xue et al., 2012, Kar et al., 2015, Bansal et al., 2016, Wu et al.,\n2016]. Since decades ago, AI and vision researchers have made inspiring attempts to design or learn\n3D object representations, mostly based on meshes and skeletons. Many of these shape synthesis\nalgorithms are nonparametric and they synthesize new objects by retrieving and combining shapes and\nparts from a database. Recently, Huang et al. [2015] explored generating 3D shapes with pre-trained\ntemplates and producing both object structure and surface geometry. Our framework synthesizes\nobjects without explicitly borrow parts from a repository, and requires no supervision during training.\nDeep learning for 3D data The vision community have witnessed rapid development of deep\nnetworks for various tasks. In the \ufb01eld of 3D object recognition, Li et al. [2015], Su et al. [2015b],\nGirdhar et al. [2016] proposed to learn a joint embedding of 3D shapes and synthesized images, Su\net al. [2015a], Qi et al. [2016] focused on learning discriminative representations for 3D object recog-\nnition, Wu et al. [2016], Xiang et al. [2015], Choy et al. [2016] discussed 3D object reconstruction\nfrom in-the-wild images, possibly with a recurrent network, and Girdhar et al. [2016], Sharma et al.\n[2016] explored autoencoder-based networks for learning voxel-based object representations. Wu\net al. [2015], Rezende et al. [2016], Yan et al. [2016] attempted to generate 3D objects with deep\nnetworks, some using 2D images during training with a 3D to 2D projection layer. Many of these\nnetworks can be used for 3D shape classi\ufb01cation [Su et al., 2015a, Sharma et al., 2016, Maturana\nand Scherer, 2015], 3D shape retrieval [Shi et al., 2015, Su et al., 2015a], and single image 3D\nreconstruction [Kar et al., 2015, Bansal et al., 2016, Girdhar et al., 2016], mostly with full supervision.\nIn comparison, our framework requires no supervision for training, is able to generate objects from a\nprobabilistic space, and comes with a rich discriminative 3D shape representation.\nLearning with an adversarial net Generative Adversarial Nets (GAN) [Goodfellow et al., 2014]\nproposed to incorporate an adversarial discriminator into the procedure of generative modeling. More\nrecently, LAPGAN [Denton et al., 2015] and DC-GAN [Radford et al., 2016] adopted GAN with\nconvolutional networks for image synthesis, and achieved impressive performance. Researchers have\nalso explored the use of GAN for other vision problems. To name a few, Wang and Gupta [2016]\ndiscussed how to model image style and structure with sequential GANs, Li and Wand [2016] and\nZhu et al. [2016] used GAN for texture synthesis and image editing, respectively, and Im et al. [2016]\n\n2\n\n\fFigure 1: The generator in 3D-GAN. The discriminator mostly mirrors the generator.\n\ndeveloped a recurrent adversarial network for image generation. While previous approaches focus on\nmodeling 2D images, we discuss the use of an adversarial component in modeling 3D objects.\n3 Models\nIn this section we introduce our model for 3D object generation. We \ufb01rst discuss how we build\nour framework, 3D Generative Adversarial Network (3D-GAN), by leveraging previous advances\non volumetric convolutional networks and generative adversarial nets. We then show how to train\na variational autoencoder [Kingma and Welling, 2014] simultaneously so that our framework can\ncapture a mapping from a 2D image to a 3D object.\n3.1\nAs proposed in Goodfellow et al. [2014], the Generative Adversarial Network (GAN) consists of\na generator and a discriminator, where the discriminator tries to classify real objects and objects\nsynthesized by the generator, and the generator attempts to confuse the discriminator. In our 3D\nGenerative Adversarial Network (3D-GAN), the generator G maps a 200-dimensional latent vector z,\nrandomly sampled from a probabilistic latent space, to a 64 \u00d7 64 \u00d7 64 cube, representing an object\nG(z) in 3D voxel space. The discriminator D outputs a con\ufb01dence value D(x) of whether a 3D\nobject input x is real or synthetic.\nFollowing Goodfellow et al. [2014], we use binary cross entropy as the classi\ufb01cation loss, and present\nour overall adversarial loss function as\n\n3D Generative Adversarial Network (3D-GAN)\n\nL3D-GAN = log D(x) + log(1 \u2212 D(G(z))),\n\n(1)\nwhere x is a real object in a 64 \u00d7 64 \u00d7 64 space, and z is a randomly sampled noise vector from a\ndistribution p(z). In this work, each dimension of z is an i.i.d. uniform distribution over [0, 1].\nNetwork structure\nInspired by Radford et al. [2016], we design an all-convolutional neural\nnetwork to generate 3D objects. As shown in Figure 1, the generator consists of \ufb01ve volumetric fully\nconvolutional layers of kernel sizes 4 \u00d7 4 \u00d7 4 and strides 2, with batch normalization and ReLU\nlayers added in between and a Sigmoid layer at the end. The discriminator basically mirrors the\ngenerator, except that it uses Leaky ReLU [Maas et al., 2013] instead of ReLU layers. There are no\npooling or linear layers in our network. More details can be found in the supplementary material.\nTraining details A straightforward training procedure is to update both the generator and the\ndiscriminator in every batch. However, the discriminator usually learns much faster than the generator,\npossibly because generating objects in a 3D voxel space is more dif\ufb01cult than differentiating between\nreal and synthetic objects [Goodfellow et al., 2014, Radford et al., 2016]. It then becomes hard\nfor the generator to extract signals for improvement from a discriminator that is way ahead, as all\nexamples it generated would be correctly identi\ufb01ed as synthetic with high con\ufb01dence. Therefore,\nto keep the training of both networks in pace, we employ an adaptive training strategy: for each\nbatch, the discriminator only gets updated if its accuracy in the last batch is not higher than 80%. We\nobserve this helps to stabilize the training and to produce better results. We set the learning rate of\nG to 0.0025, D to 10\u22125, and use a batch size of 100. We use ADAM [Kingma and Ba, 2015] for\noptimization, with \u03b2 = 0.5.\n3.2 3D-VAE-GAN\nWe have discussed how to generate 3D objects by sampling a latent vector z and mapping it to the\nobject space. In practice, it would also be helpful to infer these latent vectors from observations. For\nexample, if there exists a mapping from a 2D image to the latent representation, we can then recover\nthe 3D object corresponding to that 2D image.\n\n3\n\nzG(z) in 3D Voxel Space64\u00d764\u00d764512\u00d74\u00d74\u00d74256\u00d78\u00d78\u00d78128\u00d716\u00d716\u00d71664\u00d732\u00d732\u00d732\fFollowing this idea, we introduce 3D-VAE-GAN as an extension to 3D-GAN. We add an additional\nimage encoder E, which takes a 2D image x as input and outputs the latent representation vector z.\nThis is inspired by VAE-GAN proposed by [Larsen et al., 2016], which combines VAE and GAN by\nsharing the decoder of VAE with the generator of GAN.\nThe 3D-VAE-GAN therefore consists of three components: an image encoder E, a decoder (the\ngenerator G in 3D-GAN), and a discriminator D. The image encoder consists of \ufb01ve spatial\nconvolution layers with kernel size {11, 5, 5, 5, 8} and strides {4, 2, 2, 2, 1}, respectively. There\nare batch normalization and ReLU layers in between, and a sampler at the end to sample a 200\ndimensional vector used by the 3D-GAN. The structures of the generator and the discriminator are\nthe same as those in Section 3.1.\nSimilar to VAE-GAN [Larsen et al., 2016], our loss function consists of three parts: an object\nreconstruction loss Lrecon, a cross entropy loss L3D-GAN for 3D-GAN, and a KL divergence loss LKL\nto restrict the distribution of the output of the encoder. Formally, these loss functions write as\n\nL = L3D-GAN + \u03b11LKL + \u03b12Lrecon,\n\n(2)\n\nwhere \u03b11 and \u03b12 are weights of the KL divergence loss and the reconstruction loss. We have\n\nL3D-GAN = log D(x) + log(1 \u2212 D(G(z))),\n\nLKL = DKL(q(z|y) || p(z)),\nLrecon = ||G(E(y)) \u2212 x||2,\n\n(3)\n(4)\n(5)\nwhere x is a 3D shape from the training set, y is its corresponding 2D image, and q(z|y) is the\nvariational distribution of the latent representation z. The KL-divergence pushes this variational\ndistribution towards to the prior distribution p(z), so that the generator can sample the latent repre-\nsentation z from the same distribution p(z). In this work, we choose p(z) a multivariate Gaussian\ndistribution with zero-mean and unit variance. For more details, please refer to Larsen et al. [2016].\nTraining 3D-VAE-GAN requires both 2D images and their corresponding 3D models. We render 3D\nshapes in front of background images (16, 913 indoor images from the SUN database [Xiao et al.,\n2010]) in 72 views (from 24 angles and 3 elevations). We set \u03b11 = 5, \u03b12 = 10\u22124, and use a similar\ntraining strategy as in Section 3.1. See our supplementary material for more details.\n4 Evaluation\nIn this section, we evaluate our framework from various aspects. We \ufb01rst show qualitative results\nof generated 3D objects. We then evaluate the unsupervisedly learned representation from the\ndiscriminator by using them as features for 3D object classi\ufb01cation. We show both qualitative and\nquantitative results on the popular benchmark ModelNet [Wu et al., 2015]. Further, we evaluate\nour 3D-VAE-GAN on 3D object reconstruction from a single image, and show both qualitative and\nquantitative results on the IKEA dataset [Lim et al., 2013].\n4.1\nFigure 2 shows 3D objects generated by our 3D-GAN. For this experiment, we train one 3D-GAN\nfor each object category. For generation, we sample 200-dimensional vectors following an i.i.d.\nuniform distribution over [0, 1], and render the largest connected component of each generated object.\nWe compare 3D-GAN with Wu et al. [2015], the state-of-the-art in 3D object synthesis from a\nprobabilistic space, and with a volumetric autoencoder, whose variants have been employed by\nmultiple recent methods [Girdhar et al., 2016, Sharma et al., 2016]. Because an autoencoder does not\nrestrict the distribution of its latent representation, we compute the empirical distribution p0(z) of the\nlatent vector z of all training examples, \ufb01t a Gaussian distribution g0 to p0, and sample from g0. Our\nalgorithm produces 3D objects with much higher quality and more \ufb01ne-grained details.\nCompared with previous works, our 3D-GAN can synthesize high-resolution 3D objects with detailed\ngeometries. Figure 3 shows both high-res voxels and down-sampled low-res voxels for comparison.\nNote that it is relatively easy to synthesize a low-res object, but is much harder to obtain a high-res\none due to the rapid growth of 3D space. However, object details are only revealed in high resolution.\nA natural concern to our generative model is whether it is simply memorizing objects from training\ndata. To demonstrate that the network can generalize beyond the training set, we compare synthesized\nobjects with their nearest neighbor in the training set. Since the retrieval objects based on (cid:96)2 distance in\nthe voxel space are visually very different from the queries, we use the output of the last convolutional\n\n3D Object Generation\n\n4\n\n\fOur results (64 \u00d7 64 \u00d7 64)\n\nNN\n\nGun\n\nChair\n\nCar\n\nSofa\n\nTable\n\nTable\n\nChair\n\nObjects generated by Wu et al. [2015] (30 \u00d7 30 \u00d7 30)\n\nCar\n\nObjects generated by a volumetric autoencoder (64 \u00d7 64 \u00d7 64)\n\nTable\n\nSofa\n\nFigure 2: Objects generated by 3D-GAN from vectors, without a reference image/object. We show,\nfor the last two objects in each row, the nearest neighbor retrieved from the training set. We see that\nthe generated objects are similar, but not identical, to examples in the training set. For comparison,\nwe show objects generated by the previous state-of-the-art [Wu et al., 2015] (results supplied by the\nauthors). We also show objects generated by autoencoders trained on a single object category, with\nlatent vectors sampled from empirical distribution. See text for details.\n\nHigh-res\n\nLow-res\n\nHigh-res\n\nLow-res\n\nHigh-res\n\nLow-res\n\nHigh-res\n\nLow-res\n\n3D Object Classi\ufb01cation\n\nFigure 3: We present each object at high resolution (64 \u00d7 64 \u00d7 64) on the left and at low resolution\n(down-sampled to 16 \u00d7 16 \u00d7 16) on the right. While humans can perceive object structure at a\nrelatively low resolution, \ufb01ne details and variations only appear in high-res objects.\nlayer in our discriminator (with a 2x pooling) as features for retrieval instead. Figure 2 shows that\ngenerated objects are similar, but not identical, to the nearest examples in the training set.\n4.2\nWe then evaluate the representations learned by our discriminator. A typical way of evaluating\nrepresentations learned without supervision is to use them as features for classi\ufb01cation. To obtain\nfeatures for an input 3D object, we concatenate the responses of the second, third, and fourth\nconvolution layers in the discriminator, and apply max pooling of kernel sizes {8, 4, 2}, respectively.\nWe use a linear SVM for classi\ufb01cation.\nData We train a single 3D-GAN on the seven major object categories (chairs, sofas, tables, boats,\nairplanes, ri\ufb02es, and cars) of ShapeNet [Chang et al., 2015]. We use ModelNet [Wu et al., 2015] for\ntesting, following Sharma et al. [2016], Maturana and Scherer [2015], Qi et al. [2016].\u2217 Speci\ufb01cally,\nwe evaluate our model on both ModelNet10 and ModelNet40, two subsets of ModelNet that are often\n\u2217For ModelNet, there are two train/test splits typically used. Qi et al. [2016], Shi et al. [2015], Maturana\nand Scherer [2015] used the train/test split included in the dataset, which we also follow; Wu et al. [2015], Su\n\n5\n\n\fSupervision\n\nPretraining Method\n\nClassi\ufb01cation (Accuracy)\nModelNet40 ModelNet10\n\nCategory labels\n\nImageNet\n\nNone\n\nUnsupervised\n\n-\n\nMVCNN [Su et al., 2015a]\nMVCNN-MultiRes [Qi et al., 2016]\n3D ShapeNets [Wu et al., 2015]\nDeepPano [Shi et al., 2015]\nVoxNet [Maturana and Scherer, 2015]\nORION [Sedaghat et al., 2016]\nSPH [Kazhdan et al., 2003]\nLFD [Chen et al., 2003]\nT-L Network [Girdhar et al., 2016]\nVConv-DAE [Sharma et al., 2016]\n3D-GAN (ours)\n\n90.1%\n91.4%\n77.3%\n77.6%\n83.0%\n\n-\n\n68.2%\n75.5%\n74.4%\n75.5%\n83.3%\n\n-\n-\n\n83.5%\n85.5%\n92.0%\n93.8%\n79.8%\n79.9%\n\n-\n\n80.5%\n91.0%\n\nTable 1: Classi\ufb01cation results on the ModelNet dataset. Our 3D-GAN outperforms other unsupervised\nlearning methods by a large margin, and is comparable to some recent supervised learning frameworks.\n\nFigure 5: The effects of individual dimensions of the object vector\n\nFigure 4: ModelNet40 classi\ufb01-\ncation with limited training data\n\nFigure 6: Intra/inter-class interpolation between object vectors\n\nused as benchmarks for 3D object classi\ufb01cation. Note that the training and test categories are not\nidentical, which also shows the out-of-category generalization power of our 3D-GAN.\nResults We compare with the state-of-the-art methods [Wu et al., 2015, Girdhar et al., 2016,\nSharma et al., 2016, Sedaghat et al., 2016] and show per-class accuracy in Table 1. Our representation\noutperforms other features learned without supervision by a large margin (83.3% vs. 75.5% on\nModelNet40, and 91.0% vs 80.5% on ModelNet10) [Girdhar et al., 2016, Sharma et al., 2016].\nFurther, our classi\ufb01cation accuracy is also higher than some recent supervised methods [Shi et al.,\n2015], and is close to the state-of-the-art voxel-based supervised learning approaches [Maturana and\nScherer, 2015, Sedaghat et al., 2016]. Multi-view CNNs [Su et al., 2015a, Qi et al., 2016] outperform\nus, though their methods are designed for classi\ufb01cation, and require rendered multi-view images and\nan ImageNet-pretrained model.\n3D-GAN also works well with limited training data. As shown in Figure 4, with roughly 25\ntraining samples per class, 3D-GAN achieves comparable performance on ModelNet40 with other\nunsupervised learning methods trained with at least 80 samples per class.\n4.3 Single Image 3D Reconstruction\nAs an application, our show that the 3D-VAE-GAN can perform well on single image 3D reconstruc-\ntion. Following previous work [Girdhar et al., 2016], we test it on the IKEA dataset [Lim et al., 2013],\nand show both qualitative and quantitative results.\nData The IKEA dataset consists of images with IKEA objects. We crop the images so that the\nobjects are centered in the images. Our test set consists of 1, 039 objects cropped from 759 images\n(supplied by the author). The IKEA dataset is challenging because all images are captured in the\nwild, often with heavy occlusions. We test on all six categories of objects: bed, bookcase, chair, desk,\nsofa, and table.\nResults We show our results in Figure 7 and Table 2, with performance of a single 3D-VAE-GAN\njointly trained on all six categories, as well as the results of six 3D-VAE-GANs separately trained on\n\net al. [2015a], Sharma et al. [2016] used 80 training points and 20 test points in each category for experiments,\npossibly with viewpoint augmentation.\n\n6\n\n# objects per class in training10204080160fullAccuracy (%)65707580853D-GANVoxNetVConv-DAE\fMethod\nAlexNet-fc8 [Girdhar et al., 2016]\nAlexNet-conv4 [Girdhar et al., 2016]\nT-L Network [Girdhar et al., 2016]\n3D-VAE-GAN (jointly trained)\n3D-VAE-GAN (separately trained)\n\nTable Mean\n23.6\n16.0\n35.2\n19.1\n40.0\n23.3\n33.1\n45.2\n42.3\n53.1\nTable 2: Average precision for voxel prediction on the IKEA dataset.\u2020\n\nChair Desk\n19.7\n20.4\n26.6\n31.4\n25.8\n32.9\n42.6\n34.8\n40.7\n47.2\n\nSofa\n38.8\n69.3\n71.7\n79.8\n78.8\n\nBed\n29.5\n38.2\n56.3\n49.1\n63.2\n\nBookcase\n\n17.3\n26.6\n30.2\n31.9\n46.3\n\nFigure 7: Qualitative results of single image 3D reconstruction on the IKEA dataset\n\neach class. Following Girdhar et al. [2016], we evaluate results at resolution 20 \u00d7 20 \u00d7 20, use the\naverage precision as our evaluation metric, and attempt to align each prediction with the ground-truth\nover permutations, \ufb02ips, and translational alignments (up to 10%), as IKEA ground truth objects\nare not in a canonical viewpoint. In all categories, our model consistently outperforms previous\nstate-of-the-art in voxel-level prediction and other baseline methods.\u2020\n5 Analyzing Learned Representations\nIn this section, we look deep into the representations learned by both the generator and the discrimina-\ntor of 3D-GAN. We start with the 200-dimensional object vector, from which the generator produces\nvarious objects. We then visualize neurons in the discriminator, and demonstrate that these units\ncapture informative semantic knowledge of the objects, which justi\ufb01es its good performance on\nobject classi\ufb01cation presented in Section 4.\n5.1 The Generative Representation\nWe explore three methods for understanding the latent space of vectors for object generation. We\n\ufb01rst visualize what an individual dimension of the vector represents; we then explore the possibility\nof interpolating between two object vectors and observe how the generated objects change; last, we\npresent how we can apply shape arithmetic in the latent space.\nVisualizing the object vector To visualize the semantic meaning of each dimension, we gradually\nincrease its value, and observe how it affects the generated 3D object. In Figure 5, each column\ncorresponds to one dimension of the object vector, where the red region marks the voxels affected by\nchanging values of that dimension. We observe that some dimensions in the object vector carries\nsemantic knowledge of the object, e.g., the thickness or width of surfaces.\nInterpolation We show results of interpolating between two object vectors in Figure 6. Earlier\nworks demonstrated interpolation between two 2D images of the same category [Dosovitskiy et al.,\n2015, Radford et al., 2016]. Here we show interpolations both within and across object categories.\nWe observe that for both cases walking over the latent space gives smooth transitions between objects.\nArithmetic Another way of exploring the learned representations is to show arithmetic in the latent\nspace. Previously, Dosovitskiy et al. [2015], Radford et al. [2016] presented that their generative\nnets are able to encode semantic knowledge of chair or face images in its latent space; Girdhar et al.\n[2016] also showed that the learned representation for 3D objects behave similarly. We show our\nshape arithmetic in Figure 8. Different from Girdhar et al. [2016], all of our objects are randomly\nsampled, requiring no existing 3D CAD models as input.\n5.2 The Discriminative Representation\nWe now visualize the neurons in the discriminator. Speci\ufb01cally, we would like to show what input\nobjects, and which part of them produce the highest intensity values for each neuron. To do that,\n\u2020For methods from Girdhar et al. [2016], the mean values in the last column are higher than the originals in\n\ntheir paper, because we compute per-class accuracy instead of per-instance accuracy.\n\n7\n\n\fFigure 8: Shape arithmetic for chairs and tables. The left images show the obtained \u201carm\u201d vector can\nbe added to other chairs, and the right ones show the \u201clayer\u201d vector can be added to other tables.\n\nFigure 9: Objects and parts that activate speci\ufb01c neurons in the discriminator. For each neuron, we\nshow \ufb01ve objects that activate it most strongly, with colors representing gradients of activations with\nrespect to input voxels.\n\nfor each neuron in the second to last convolutional layer of the discriminator, we iterate through\nall training objects and exhibit the ones activating the unit most strongly. We further use guided\nback-propagation [Springenberg et al., 2015] to visualize the parts that produce the activation.\nFigure 9 shows the results. There are two main observations: \ufb01rst, for a single neuron, the objects\nproducing strongest activations have very similar shapes, showing the neuron is selective in terms of\nthe overall object shape; second, the parts that activate the neuron, shown in red, are consistent across\nthese objects, indicating the neuron is also learning semantic knowledge about object parts.\n6 Conclusion\nIn this paper, we proposed 3D-GAN for 3D object generation, as well as 3D-VAE-GAN for learning\nan image to 3D model mapping. We demonstrated that our models are able to generate novel objects\nand to reconstruct 3D objects from images. We showed that the discriminator in GAN, learned\nwithout supervision, can be used as an informative feature representation for 3D objects, achieving\nimpressive performance on shape classi\ufb01cation. We also explored the latent space of object vectors,\nand presented results on object interpolation, shape arithmetic, and neuron visualization.\nAcknowledgement This work is supported by NSF grants #1212849 and #1447476, ONR MURI\nN00014-16-1-2007, the Center for Brain, Minds and Machines (NSF STC award CCF-1231216),\nToyota Research Institute, Adobe, Shell, IARPA MICrONS, and a hardware donation from Nvidia.\nReferences\nAayush Bansal, Bryan Russell, and Abhinav Gupta. Marr revisited: 2d-3d alignment via surface normal\n\nVolker Blanz and Thomas Vetter. A morphable model for the synthesis of 3d faces. In SIGGRAPH, 1999. 2\nWayne E Carlson. An algorithm and data structure for 3d object synthesis using surface patch intersections. In\n\nprediction. In CVPR, 2016. 2\n\nSIGGRAPH, 1982. 1, 2\n\nAngel X Chang, Thomas Funkhouser, Leonidas Guibas, et al. Shapenet: An information-rich 3d model repository.\n\narXiv preprint arXiv:1512.03012, 2015. 1, 5\n\nSiddhartha Chaudhuri, Evangelos Kalogerakis, Leonidas Guibas, and Vladlen Koltun. Probabilistic reasoning\n\nfor assembly-based 3d modeling. ACM TOG, 30(4):35, 2011. 2\n\nDing-Yun Chen, Xiao-Pei Tian, Yu-Te Shen, and Ming Ouhyoung. On visual similarity based 3d model retrieval.\n\nCGF, 2003. 6\n\nChristopher B Choy, Danfei Xu, JunYoung Gwak, Kevin Chen, and Silvio Savarese. 3d-r2n2: A uni\ufb01ed approach\n\nfor single and multi-view 3d object reconstruction. In ECCV, 2016. 2\n\nEmily L Denton, Soumith Chintala, and Rob Fergus. Deep generative image models using a laplacian pyramid\n\nAlexey Dosovitskiy, Jost Tobias Springenberg, and Thomas Brox. Learning to generate chairs with convolutional\n\nof adversarial networks. In NIPS, 2015. 2\n\nneural networks. In CVPR, 2015. 7\n\nRohit Girdhar, David F Fouhey, Mikel Rodriguez, and Abhinav Gupta. Learning a predictable and generative\n\nvector representation for objects. In ECCV, 2016. 1, 2, 4, 6, 7\n\n8\n\n\fIan Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron\n\nCourville, and Yoshua Bengio. Generative adversarial nets. In NIPS, 2014. 2, 3\n\nHaibin Huang, Evangelos Kalogerakis, and Benjamin Marlin. Analysis and synthesis of 3d shape families via\n\ndeep-learned generative models of surfaces. CGF, 34(5):25\u201338, 2015. 2\n\nDaniel Jiwoong Im, Chris Dongjoo Kim, Hui Jiang, and Roland Memisevic. Generating images with recurrent\n\nadversarial networks. arXiv preprint arXiv:1602.05110, 2016. 2\n\nEvangelos Kalogerakis, Siddhartha Chaudhuri, Daphne Koller, and Vladlen Koltun. A probabilistic model for\n\ncomponent-based shape synthesis. ACM TOG, 31(4):55, 2012. 2\n\nAbhishek Kar, Shubham Tulsiani, Joao Carreira, and Jitendra Malik. Category-speci\ufb01c object reconstruction\n\nfrom a single image. In CVPR, 2015. 2\n\nMichael Kazhdan, Thomas Funkhouser, and Szymon Rusinkiewicz. Rotation invariant spherical harmonic\n\nrepresentation of 3 d shape descriptors. In SGP, 2003. 6\n\nDiederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015. 3\nDiederik P Kingma and Max Welling. Auto-encoding variational bayes. In ICLR, 2014. 2, 3\nAnders Boesen Lindbo Larsen, S\u00f8ren Kaae S\u00f8nderby, and Ole Winther. Autoencoding beyond pixels using a\n\nlearned similarity metric. In ICML, 2016. 2, 4\n\nChuan Li and Michael Wand. Precomputed real-time texture synthesis with markovian generative adversarial\n\nnetworks. arXiv preprint arXiv:1604.04382, 2016. 2\n\nYangyan Li, Hao Su, Charles Ruizhongtai Qi, Noa Fish, Daniel Cohen-Or, and Leonidas J Guibas. Joint\n\nembeddings of shapes and images via cnn image puri\ufb01cation. ACM TOG, 34(6):234, 2015. 2\n\nJoseph J. Lim, Hamed Pirsiavash, and Antonio Torralba. Parsing ikea objects: Fine pose estimation. In ICCV,\n\nAndrew L Maas, Awni Y Hannun, and Andrew Y Ng. Recti\ufb01er nonlinearities improve neural network acoustic\n\nDaniel Maturana and Sebastian Scherer. Voxnet: A 3d convolutional neural network for real-time object\n\n2013. 4, 6\n\nmodels. In ICML, 2013. 3\n\nrecognition. In IROS, 2015. 2, 5, 6\n\nCharles R Qi, Hao Su, Matthias Niessner, Angela Dai, Mengyuan Yan, and Leonidas J Guibas. Volumetric and\n\nmulti-view cnns for object classi\ufb01cation on 3d data. In CVPR, 2016. 1, 2, 5, 6\n\nAlec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional\n\ngenerative adversarial networks. In ICLR, 2016. 2, 3, 7\n\nDanilo Jimenez Rezende, SM Eslami, Shakir Mohamed, Peter Battaglia, Max Jaderberg, and Nicolas Heess.\n\nUnsupervised learning of 3d structure from images. In NIPS, 2016. 2\n\nNima Sedaghat, Mohammadreza Zolfaghari, and Thomas Brox. Orientation-boosted voxel nets for 3d object\n\nrecognition. arXiv preprint arXiv:1604.03351, 2016. 6\n\nAbhishek Sharma, Oliver Grau, and Mario Fritz. Vconv-dae: Deep volumetric shape learning without object\n\nlabels. arXiv preprint arXiv:1604.03755, 2016. 2, 4, 5, 6\n\nBaoguang Shi, Song Bai, Zhichao Zhou, and Xiang Bai. Deeppano: Deep panoramic representation for 3-d\n\nshape recognition. IEEE SPL, 22(12):2339\u20132343, 2015. 2, 5, 6\n\nJost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin Riedmiller. Striving for simplicity:\n\nThe all convolutional net. In ICLR Workshop, 2015. 8\n\nHang Su, Subhransu Maji, Evangelos Kalogerakis, and Erik Learned-Miller. Multi-view convolutional neural\n\nnetworks for 3d shape recognition. In ICCV, 2015a. 1, 2, 5, 6\n\nHao Su, Charles R Qi, Yangyan Li, and Leonidas Guibas. Render for cnn: Viewpoint estimation in images using\n\ncnns trained with rendered 3d model views. In ICCV, 2015b. 2\n\nJohan WH Tangelder and Remco C Veltkamp. A survey of content based 3d shape retrieval methods. Multimedia\n\ntools and applications, 39(3):441\u2013471, 2008. 1, 2\n\nOliver Van Kaick, Hao Zhang, Ghassan Hamarneh, and Daniel Cohen-Or. A survey on shape correspondence.\n\nXiaolong Wang and Abhinav Gupta. Generative image modeling using style and structure adversarial networks.\n\nCGF, 2011. 1, 2\n\nIn ECCV, 2016. 2\n\nJiajun Wu, Tianfan Xue, Joseph J Lim, Yuandong Tian, Joshua B Tenenbaum, Antonio Torralba, and William T\n\nFreeman. Single image 3d interpreter network. In ECCV, 2016. 2\n\nZhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang, and Jianxiong Xiao. 3d\n\nshapenets: A deep representation for volumetric shapes. In CVPR, 2015. 1, 2, 4, 5, 6\n\nYu Xiang, Wongun Choi, Yuanqing Lin, and Silvio Savarese. Data-driven 3d voxel patterns for object category\n\nrecognition. In CVPR, 2015. 2\n\nJianxiong Xiao, James Hays, Krista Ehinger, Aude Oliva, and Antonio Torralba. Sun database: Large-scale\n\nscene recognition from abbey to zoo. In CVPR, 2010. 4\n\nTianfan Xue, Jianzhuang Liu, and Xiaoou Tang. Example-based 3d object reconstruction from line drawings. In\n\nCVPR, 2012. 2\n\nXinchen Yan, Jimei Yang, Ersin Yumer, Yijie Guo, and Honglak Lee. Perspective transformer nets: Learning\n\nsingle-view 3d object reconstruction without 3d supervision. In NIPS, 2016. 2\n\nJun-Yan Zhu, Philipp Kr\u00e4henb\u00fchl, Eli Shechtman, and Alexei A Efros. Generative visual manipulation on the\n\nnatural image manifold. In ECCV, 2016. 2\n\n9\n\n\f", "award": [], "sourceid": 62, "authors": [{"given_name": "Jiajun", "family_name": "Wu", "institution": "MIT"}, {"given_name": "Chengkai", "family_name": "Zhang", "institution": "Massachusetts Institute of Technology"}, {"given_name": "Tianfan", "family_name": "Xue", "institution": "MIT CSAIL"}, {"given_name": "Bill", "family_name": "Freeman", "institution": "MIT/Google"}, {"given_name": "Josh", "family_name": "Tenenbaum", "institution": "MIT"}]}