{"title": "Deep Convolutional Inverse Graphics Network", "book": "Advances in Neural Information Processing Systems", "page_first": 2539, "page_last": 2547, "abstract": "This paper presents the Deep Convolution Inverse Graphics Network (DC-IGN), a model that aims to learn an interpretable representation of images, disentangled with respect to three-dimensional scene structure and viewing transformations such as depth rotations and lighting variations. The DC-IGN model is composed of multiple layers of convolution and de-convolution operators and is trained using the Stochastic Gradient Variational Bayes (SGVB) algorithm. We propose a training procedure to encourage neurons in the graphics code layer to represent a specific transformation (e.g. pose or light). Given a single input image, our model can generate new images of the same object with variations in pose and lighting. We present qualitative and quantitative tests of the model's efficacy at learning a 3D rendering engine for varied object classes including faces and chairs.", "full_text": "Deep Convolutional Inverse Graphics Network\n\nTejas D. Kulkarni*1, William F. Whitney*2,\n\nPushmeet Kohli3, Joshua B. Tenenbaum4\n\n1tejask@mit.edu 2wwhitney@mit.edu 3pkohli@microsoft.com 4jbt@mit.edu\n\n* First two authors contributed equally and are listed alphabetically.\n\n1,2,4Massachusetts Institute of Technology, Cambridge, USA\n\n3Microsoft Research, Cambridge, UK\n\nAbstract\n\nThis paper presents the Deep Convolution Inverse Graphics Network (DC-\nIGN), a model that aims to learn an interpretable representation of images,\ndisentangled with respect to three-dimensional scene structure and viewing\ntransformations such as depth rotations and lighting variations. The DC-\nIGN model is composed of multiple layers of convolution and de-convolution\noperators and is trained using the Stochastic Gradient Variational Bayes\n(SGVB) algorithm [10]. We propose a training procedure to encourage\nneurons in the graphics code layer to represent a speci\ufb01c transformation\n(e.g. pose or light). Given a single input image, our model can generate\nnew images of the same object with variations in pose and lighting. We\npresent qualitative and quantitative tests of the model\u2019s e\ufb03cacy at learning\na 3D rendering engine for varied object classes including faces and chairs.\n\n1 Introduction\n\nDeep learning has led to remarkable breakthroughs in learning hierarchical representations\nfrom images. Models such as Convolutional Neural Networks (CNNs) [13], Restricted Boltz-\nmann Machines, [8, 19], and Auto-encoders [2, 23] have been successfully applied to produce\nmultiple layers of increasingly abstract visual representations. However, there is relatively\nlittle work on characterizing the optimal representation of the data. While Cohen et al.\n[4] have considered this problem by proposing a theoretical framework to learn irreducible\nrepresentations with both invariances and equivariances, coming up with the best represen-\ntation for any given task is an open question.\nVarious work [3, 4, 7] has been done on the theory and practice of representation learning,\nand from this work a consistent set of desiderata for representations has emerged: invariance,\ninterpretability, abstraction, and disentanglement. In particular, Bengio et al. [3] propose\nthat a disentangled representation is one for which changes in the encoded data are sparse\nover real-world transformations; that is, changes in only a few latents at a time should be\nable to represent sequences which are likely to happen in the real world.\nThe \u201cvision as inverse graphics\u201d paradigm suggests a representation for images which pro-\nvides these features. Computer graphics consists of a function to go from compact descrip-\ntions of scenes (the graphics code) to images, and this graphics code is typically disentangled\nto allow for rendering scenes with \ufb01ne-grained control over transformations such as object\nlocation, pose, lighting, texture, and shape. This encoding is designed to easily and inter-\npretably represent sequences of real data so that common transformations may be compactly\nrepresented in software code; this criterion is conceptually identical to that of Bengio et al.,\nand graphics codes conveniently align with the properties of an ideal representation.\n\n1\n\n\fFigure 1: Model Architecture: Deep Convolutional Inverse Graphics Network (DC-IGN)\nhas an encoder and a decoder. We follow the variational autoencoder [10] architecture\nwith variations. The encoder consists of several layers of convolutions followed by max-\npooling and the decoder has several layers of unpooling (upsampling using nearest neighbors)\nfollowed by convolution.\n(a) During training, data x is passed through the encoder to\nproduce the posterior approximation Q(zi|x), where zi consists of scene latent variables\nsuch as pose, light, texture or shape. In order to learn parameters in DC-IGN, gradients\nare back-propagated using stochastic gradient descent using the following variational object\nfunction: \u2212log(P(x|zi)) + KL(Q(zi|x)||P(zi)) for every zi. We can force DC-IGN to learn\na disentangled representation by showing mini-batches with a set of inactive and active\ntransformations (e.g. face rotating, light sweeping in some direction etc). (b) During test,\ndata x can be passed through the encoder to get latents zi. Images can be re-rendered to\ndi\ufb00erent viewpoints, lighting conditions, shape variations, etc by setting the appropriate\ngraphics code group (zi), which is how one would manipulate an o\ufb00-the-shelf 3D graphics\nengine.\n\nRecent work in inverse graphics [15, 12, 11] follows a general strategy of de\ufb01ning a probabilis-\ntic with latent parameters, then using an inference algorithm to \ufb01nd the most appropriate\nset of latent parameters given the observations. Recently, Tieleman et al. [21] moved beyond\nthis two-stage pipeline by using a generic encoder network and a domain-speci\ufb01c decoder\nnetwork to approximate a 2D rendering function. However, none of these approaches have\nbeen shown to automatically produce a semantically-interpretable graphics code and to learn\na 3D rendering engine to reproduce images.\nIn this paper, we present an approach which attempts to learn interpretable graphics codes\nfor complex transformations such as out-of-plane rotations and lighting variations. Given\na set of images, we use a hybrid encoder-decoder model to learn a representation that is\ndisentangled with respect to various transformations such as object out-of-plane rotations\nand lighting variations. We employ a deep directed graphical model with many layers\nof convolution and de-convolution operators that is trained using the Stochastic Gradient\nVariational Bayes (SGVB) algorithm [10].\nWe propose a training procedure to encourage each group of neurons in the graphics code\nlayer to distinctly represent a speci\ufb01c transformation. To learn a disentangled representa-\ntion, we train using data where each mini-batch has a set of active and inactive transforma-\ntions, but we do not provide target values as in supervised learning; the objective function\nremains reconstruction quality. For example, a nodding face would have the 3D elevation\ntransformation active but its shape, texture and other transformations would be inactive.\nWe exploit this type of training data to force chosen neurons in the graphics code layer to\nspeci\ufb01cally represent active transformations, thereby automatically creating a disentangled\nrepresentation. Given a single face image, our model can re-generate the input image with\na di\ufb00erent pose and lighting. We present qualitative and quantitative results of the model\u2019s\ne\ufb03cacy at learning a 3D rendering engine.\n\n2\n\nobserved imageFilters = 96kernel size (KS) = 5150x150Convolution + Poolinggraphics codexQ(zi|x)Filters = 64KS = 5Filters = 32KS = 57200poselightshape....Filters = 32KS = 7Filters = 64KS = 7Filters = 96KS = 7P(x|z)Encoder(De-rendering)Decoder(Renderer)Unpooling (Nearest Neighbor) + Convolution{\u00b5200,\u2303200}\fFigure 2: Structure of the representation vector. \u03c6 is the azimuth of the face, \u03b1 is the\nelevation of the face with respect to the camera, and \u03c6L is the azimuth of the light source.\n\n2 Related Work\n\nAs mentioned previously, a number of generative models have been proposed in the litera-\nture to obtain abstract visual representations. Unlike most RBM-based models [8, 19, 14],\nour approach is trained using back-propagation with objective function consisting of data\nreconstruction and the variational bound.\nRelatively recently, Kingma et al.\n[10] proposed the SGVB algorithm to learn generative\nmodels with continuous latent variables. In this work, a feed-forward neural network (en-\ncoder) is used to approximate the posterior distribution and a decoder network serves to\nenable stochastic reconstruction of observations. In order to handle \ufb01ne-grained geometry of\nfaces, we work with relatively large scale images (150 \u00d7 150 pixels). Our approach extends\nand applies the SGVB algorithm to jointly train and utilize many layers of convolution\nand de-convolution operators for the encoder and decoder network respectively. The de-\ncoder network is a function that transform a compact graphics code ( 200 dimensions) to\na 150 \u00d7 150 image. We propose using unpooling (nearest neighbor sampling) followed by\nconvolution to handle the massive increase in dimensionality with a manageable number of\nparameters.\nRecently, [6] proposed using CNNs to generate images given object-speci\ufb01c parameters in\na supervised setting. As their approach requires ground-truth labels for the graphics code\nlayer, it cannot be directly applied to image interpretation tasks. Our work is similar to\nRanzato et al.\n[18], whose work was amongst the \ufb01rst to use a generic encoder-decoder\narchitecture for feature learning. However, in comparison to our proposal their model was\ntrained layer-wise, the intermediate representations were not disentangled like a graphics\ncode, and their approach does not use the variational auto-encoder loss to approximate the\nposterior distribution. Our work is also similar in spirit to [20], but in comparison our\nmodel does not assume a Lambertian re\ufb02ectance model and implicitly constructs the 3D\nrepresentations. Another piece of related work is Desjardins et al. [5], who used a spike and\nslab prior to factorize representations in a generative deep network.\nIn comparison to existing approaches, it is important to note that our encoder network\nproduces the interpretable and disentangled representations necessary to learn a meaningful\n3D graphics engine. A number of inverse-graphics inspired methods have recently been\nproposed in the literature [15]. However, most such methods rely on hand-crafted rendering\nengines. The exception to this is work by Hinton et al. [9] and Tieleman [21] on transforming\nautoencoders which use a domain-speci\ufb01c decoder to reconstruct input images. Our work\nis similar in spirit to these works but has some key di\ufb00erences: (a) It uses a very generic\nconvolutional architecture in the encoder and decoder networks to enable e\ufb03cient learning\non large datasets and image sizes; (b) it can handle single static frames as opposed to pair\nof images required in [9]; and (c) it is generative.\n\n3 Model\n\nAs shown in Figure 1, the basic structure of the Deep Convolutional Inverse Graphics Net-\nwork (DC-IGN) consists of two parts: an encoder network which captures a distribution over\ngraphics codes Z given data x and a decoder network which learns a conditional distribution\nto produce an approximation \u02c6x given Z. Z can be a disentangled representation containing\na factored set of latent variables zi \u2208 Z such as pose, light and shape. This is important\n\n3\n\n\ud835\udf191\ud835\udefc1\ud835\udefc\ud835\udf19L1\ud835\udf19Lz[4,n]z =z3z2z1\ud835\udf19corresponds toOutput\ufb01rst sample in batch x1from encoderto encoderintrinsic properties (shape, texture, etc)same as output for x1z[4,n]z3z2z1later samples in batch xiz[4,n]z3z2z1unique for each xi in batchzero error signalfor clamped outputszero error signalfor clamped outputserror signalfrom decoder\u2207zki = zki - mean zkBackpropagationz[4,n]z3z2z1z[4,n]z3z2z1Backpropagation with invariance targetingz[4,n]z3z2z1k \u2208 batchk \u2208 batchCaption: Training on a minibatch in which only \ud835\udf19, the azimuth angle of the face, changes.During the forward step, the output from each component z_k != z_1 of the encoder is forced to be the same for each sample in the batch. This re\ufb02ects the fact that the generating variables of the image which correspond to the desired values of these latents are unchanged throughout the batch. By holding these outputs constant throughout the batch, z_1 is forced to explain all the variance within the batch, i.e. the full range of changes to the image caused by changing \ud835\udf19.During the backward step, backpropagation of gradients happens only through the latent z_1, with gradients for z_k != z_1 set to zero. This corresponds with the clamped output from those latents throughout the batch.Caption: In order to directly enforce invariance of the latents corresponding to properties of the image which do not change within a given batch, we calculate gradients for the z_k != z_1 which move them towards the mean of each invariant latent over the batch. This is equivalent to regularizing the latents z_{[2,n]} by the L2 norm of (zk - mean zk).\fFigure 3: Training on a minibatch in which only \u03c6, the azimuth angle of the\nface, changes. During the forward step, the output from each component zi 6= z1 of the\nencoder is altered to be the same for each sample in the batch. This re\ufb02ects the fact that\nthe generating variables of the image (e.g. the identity of the face) which correspond to\nthe desired values of these latents are unchanged throughout the batch. By holding these\noutputs constant throughout the batch, the single neuron z1 is forced to explain all the\nvariance within the batch, i.e. the full range of changes to the image caused by changing \u03c6.\nDuring the backward step z1 is the only neuron which receives a gradient signal from the\nattempted reconstruction, and all zi 6= z1 receive a signal which nudges them to be closer\nto their respective averages over the batch. During the complete training process, after this\nbatch, another batch is selected at random; it likewise contains variations of only one of\n\u03c6, \u03b1, \u03c6L, intrinsic; all neurons which do not correspond to the selected latent are clamped;\nand the training proceeds.\n\nin learning a meaningful approximation of a 3D graphics engine and helps tease apart the\ngeneralization capability of the model with respect to di\ufb00erent types of transformations.\nLet us denote the encoder output of DC-IGN to be ye = encoder(x). The encoder output\nis used to parametrize the variational approximation Q(zi|ye), where Q is chosen to be a\nmultivariate normal distribution. There are two reasons for using this parametrization: (1)\nGradients of samples with respect to parameters \u03b8 of Q can be easily obtained using the\nreparametrization trick proposed in [10], and (2) Various statistical shape models trained\non 3D scanner data such as faces have the same multivariate normal latent distribution\n[17]. Given that model parameters We connect ye and zi, the distribution parameters\n\u03b8 = (\u00b5zi, \u03a3zi) and latents Z can then be expressed as:\n\n\u00b5z = Weye, \u03a3z = diag(exp(Weye))\n\n\u2200i, zi \u223c N (\u00b5zi, \u03a3zi)\n\n(1)\n(2)\n\nWe present a novel training procedure which allows networks to be trained to have disen-\ntangled and interpretable representations.\n\n3.1 Training with Speci\ufb01c Transformations\nThe main goal of this work is to learn a representation of the data which consists of disen-\ntangled and semantically interpretable latent variables. We would like only a small subset\nof the latent variables to change for sequences of inputs corresponding to real-world events.\nOne natural choice of target representation for information about scenes is that already\ndesigned for use in graphics engines.\nIf we can deconstruct a face image by splitting it\ninto variables for pose, light, and shape, we can trivially represent the same transformations\nthat these variables are used for in graphics applications. Figure 2 depicts the representation\nwhich we will attempt to learn.\nWith this goal in mind, we perform a training procedure which directly targets this de\ufb01nition\nof disentanglement. We organize our data into mini-batches corresponding to changes in\nonly a single scene variable (azimuth angle, elevation angle, azimuth angle of the light\n\n4\n\nForwardBackwardEncoderDecoderout = mean zkk \u2208 batchiigrad = zk mean zkk \u2208 batchiiiz[4,n]z3z2z1out1 = z1grad1 = \u2207z1\u2207out1EncoderDecoderclampedunclamped\f(a)\nFigure 4: Manipulating light and elevation variables: Qualitative results showing the\ngeneralization capability of the learned DC-IGN decoder to re-render a single input image\nwith di\ufb00erent pose directions. (a) We change the latent zlight smoothly leaving all 199\nother latents unchanged. (b) We change the latent zelevation smoothly leaving all 199 other\nlatents unchanged.\n\n(b)\n\nsource); these are transformations which might occur in the real world. We will term these\nthe extrinsic variables, and they are represented by the components z1,2,3 of the encoding.\nWe also generate mini-batches in which the three extrinsic scene variables are held \ufb01xed\nbut all other properties of the face change. That is, these batches consist of many di\ufb00erent\nfaces under the same viewing conditions and pose. These intrinsic properties of the model,\nwhich describe identity, shape, expression, etc., are represented by the remainder of the\nlatent variables z[4,200]. These mini-batches varying intrinsic properties are interspersed\nstochastically with those varying the extrinsic properties.\nWe train this representation using SGVB, but we make some key adjustments to the outputs\nof the encoder and the gradients which train it. The procedure (Figure 3) is as follows.\n\n1. Select at random a latent variable ztrain which we wish to correspond to one of\n\n{azimuth angle, elevation angle, azimuth of light source, intrinsic properties}.\n\n2. Select at random a mini-batch in which that only that variable changes.\n3. Show the network each example in the minibatch and capture its latent represen-\n\ntation for that example zk.\n\n4. Calculate the average of those representation vectors over the entire batch.\n5. Before putting the encoder\u2019s output into the decoder, replace the values zi 6= ztrain\n\nwith their averages over the entire batch. These outputs are \u201cclamped\u201d.\n\n6. Calculate reconstruction error and backpropagate as per SGVB in the decoder.\n7. Replace the gradients for the latents zi 6= ztrain (the clamped neurons) with their\ndi\ufb00erence from the mean (see Section 3.2). The gradient at ztrain is passed through\nunchanged.\n\n8. Continue backpropagation through the encoder using the modi\ufb01ed gradient.\n\nlighting :\n\nSince the intrinsic representation is much higher-dimensional than the extrinsic ones, it\nrequires more training. Accordingly we select the type of batch to use in a ratio of about\n1:1:1:10, azimuth : elevation :\nintrinsic; we arrived at this ratio after extensive\ntesting, and it works well for both of our datasets.\nThis training procedure works to train both the encoder and decoder to represent certain\nproperties of the data in a speci\ufb01c neuron. By clamping the output of all but one of the\nneurons, we force the decoder to recreate all the variation in that batch using only the\nchanges in that one neuron\u2019s value. By clamping the gradients, we train the encoder to put\nall the information about the variations in the batch into one output neuron.\nThis training method leads to networks whose latent variables have a strong equivariance\nwith the corresponding generating parameters, as shown in Figure 6. This allows the value\n\n5\n\n\fof the true generating parameter (e.g. the true angle of the face) to be trivially extracted\nfrom the encoder.\n\n3.2\n\nInvariance Targeting\n\nBy training with only one transformation at a time, we are encouraging certain neurons to\ncontain speci\ufb01c information; this is equivariance. But we also wish to explicitly discourage\nthem from having other information; that is, we want them to be invariant to other trans-\nformations. Since our mini-batches of training data consist of only one transformation per\nbatch, then this goal corresponds to having all but one of the output neurons of the encoder\ngive the same output for every image in the batch.\nTo encourage this property of the DC-IGN, we train all the neurons which correspond to\nthe inactive transformations with an error gradient equal to their di\ufb00erence from the mean.\nIt is simplest to think about this gradient as acting on the set of subvectors zinactive from\nthe encoder for each input in the batch. Each of these zinactive\u2019s will be pointing to a\nclose-together but not identical point in a high-dimensional space; the invariance training\nsignal will push them all closer together. We don\u2019t care where they are; the network can\nrepresent the face shown in this batch however it likes. We only care that the network always\nrepresents it as still being the same face, no matter which way it\u2019s facing. This regularizing\nforce needs to be scaled to be much smaller than the true training signal, otherwise it can\noverwhelm the reconstruction goal. Empirically, a factor of 1/100 works well.\n\n4 Experiments\n\nWe trained our model on about 12,000 batches\nof faces generated from a 3D face model ob-\ntained from Paysan et al. [17], where each batch\nconsists of 20 faces with random variations on\nface identity variables (shape/texture), pose, or\nlighting. We used the rmsprop [22] learning algo-\nrithm during training and set the meta learning\nrate equal to 0.0005, the momentum decay to\n0.1 and weight decay to 0.01.\nTo ensure that these techniques work on other\ntypes of data, we also trained networks to per-\nform reconstruction on images of widely varied\n3D chairs from many perspectives derived from\nthe Pascal Visual Object Classes dataset as ex-\ntracted by Aubry et al. [16, 1]. This task tests\nthe ability of the DC-IGN to learn a rendering\nfunction for a dataset with high variation be-\ntween the elements of the set; the chairs vary\nfrom o\ufb03ce chairs to wicker to modern designs,\nand viewpoints span 360 degrees and two ele-\nvations. These networks were trained with the\nsame methods and parameters as the ones above.\n\n4.1 3D Face Dataset\n\nFigure 5: Manipulating azimuth\n(pose) variables: Qualitative results\nshowing the generalization capability of\nthe learnt DC-IGN decoder to render\noriginal static image with di\ufb00erent az-\nimuth (pose) directions. The latent neu-\nron zazimuth is changed to random values\nbut all other latents are clamped.\n\nThe decoder network learns an approximate rendering engine as shown in Figures (4,7).\nGiven a static test image, the encoder network produces the latents Z depicting scene\nvariables such as light, pose, shape etc. Similar to an o\ufb00-the-shelf rendering engine, we\ncan independently control these to generate new images with the decoder. For example, as\nshown in Figure 7, given the original test image, we can vary the lighting of an image by\nkeeping all the other latents constant and varying zlight. It is perhaps surprising that the\nfully-trained decoder network is able to function as a 3D rendering engine.\n\n6\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 6: Generalization of decoder to render images in novel viewpoints and\nlighting conditions: We generated several datasets by varying light, azimuth and eleva-\ntion, and tested the invariance properties of DC-IGN\u2019s representation Z. We show quan-\ntitative performance on three network con\ufb01gurations as described in section 4.1. (a,b,c)\nAll DC-IGN encoder networks reasonably predicts transformations from static test images.\nInterestingly, as seen in (a), the encoder network seems to have learnt a switch node to\nseparately process azimuth on left and right pro\ufb01le side of the face.\n\nWe also quantitatively illustrate the network\u2019s ability to represent pose and light\non a smooth linear manifold as shown in Figure 6, which directly demonstrates\nour training algorithm\u2019s ability to disentangle complex transformations.\nIn these\nplots, the inferred and ground-truth transformation values are plotted for a random\nsubset of the test set.\nthe encoder net-\nwork\u2019s representation of azimuth has a discontinuity at 0\u25e6 (facing straight forward).\n\nInterestingly, as shown in Figure 6(a),\n\n4.1.1 Comparison with Entangled Representations\nTo explore how much of a di\ufb00erence the DC-IGN training\nprocedure makes, we compare the novel-view reconstruction\nperformance of networks with entangled representations (base-\nline) versus disentangled representations (DC-IGN). The base-\nline network is identical in every way to the DC-IGN, but was\ntrained with SGVB without using our proposed training pro-\ncedure. As in Figure 4, we feed each network a single input\nimage, then attempt to use the decoder to re-render this image\nat di\ufb00erent azimuth angles. To do this, we \ufb01rst must \ufb01gure\nout which latent of the entangled representation most closely\ncorresponds to the azimuth. This we do rather simply. First,\nwe encode all images in an azimuth-varied batch using the\nbaseline\u2019s encoder. Then we calculate the variance of each of\nthe latents over this batch. The latent with the largest vari-\nance is then the one most closely associated with the azimuth\nof the face, and we will call it zazimuth. Once that is found,\nthe latent zazimuth is varied for both the models to render a\nnovel view of the face given a single image of that face. Figure\n7 shows that explicit disentanglement is critical for novel-view\nreconstruction.\n\n4.2 Chair Dataset\n\nFigure 7: Entangled ver-\nsus disentangled repre-\nsentations. First column:\nSecond\nOriginal\ncolumn: transformed image\nusing DC-IGN. Third col-\numn: transformed image us-\ning normally-trained network.\n\nimages.\n\nWe performed a similar set of experiments on the 3D chairs dataset described above. This\ndataset contains still images rendered from 3D CAD models of 1357 di\ufb00erent chairs, each\nmodel skinned with the photographic texture of the real chair. Each of these models is\nrendered in 60 di\ufb00erent poses; at each of two elevations, there are 30 images taken from 360\ndegrees around the model. We used approximately 1200 of these chairs in the training set\nand the remaining 150 in the test set; as such, the networks had never seen the chairs in the\ntest set from any angle, so the tests explore the networks\u2019 ability to generalize to arbitrary\n\n7\n\n\f(a)\n\n(b)\n\nFigure 8: Manipulating rotation: Each row was generated by encoding the input image\n(leftmost) with the encoder, then changing the value of a single latent and putting this\nmodi\ufb01ed encoding through the decoder. The network has never seen these chairs before at\nany orientation. (a) Some positive examples. Note that the DC-IGN is making a conjecture\nabout any components of the chair it cannot see; in particular, it guesses that the chair in\nthe top row has arms, because it can\u2019t see that it doesn\u2019t. (b) Examples in which the\nnetwork extrapolates to new viewpoints less accurately.\n\nchairs. We resized the images to 150 \u00d7 150 pixels and made them grayscale to match our\nface dataset.\nWe trained these networks with the azimuth (\ufb02at rotation) of the chair as a disentangled vari-\nable represented by a single node z1; all other variation between images is undi\ufb00erentiated\nand represented by z[2,200]. The DC-IGN network succeeded in achieving a mean-squared\nerror (MSE) of reconstruction of 2.7722 \u00d7 10\u22124 on the test set. Each image has grayscale\nvalues in the range [0, 1] and is 150 \u00d7 150 pixels.\nIn Figure 8 we have included examples of the network\u2019s ability to re-render previously-\nunseen chairs at di\ufb00erent angles given a single image. For some chairs it is able to render\nfairly smooth transitions, showing the chair at many intermediate poses, while for others it\nseems to only capture a sort of \u201ckeyframes\u201d representation, only having distinct outputs for\na few angles. Interestingly, the task of rotating a chair seen only from one angle requires\nspeculation about unseen components; the chair might have arms, or not; a curved seat or\na \ufb02at one; etc.\n\n5 Discussion\nWe have shown that it is possible to train a deep convolutional inverse graphics network with\na fairly disentangled, interpretable graphics code layer representation from static images. By\nutilizing a deep convolution and de-convolution architecture within a variational autoencoder\nformulation, our model can be trained end-to-end using back-propagation on the stochastic\nvariational objective function [10]. We proposed a training procedure to force the network\nto learn disentangled and interpretable representations. Using 3D face and chair analysis as\na working example, we have demonstrated the invariant and equivariant characteristics of\nthe learned representations.\nAcknowledgements: We thank Thomas Vetter for access to the Basel face model. We are\ngrateful for support from the MIT Center for Brains, Minds, and Machines (CBMM). We\nalso thank Geo\ufb00rey Hinton and Ilker Yildrim for helpful feedback and discussions.\n\n8\n\n\fReferences\n[1] M. Aubry, D. Maturana, A. Efros, B. Russell, and J. Sivic. Seeing 3d chairs: exemplar part-\n\nbased 2d-3d alignment using a large dataset of cad models. In CVPR, 2014.\n\n[2] Y. Bengio. Learning deep architectures for ai. Foundations and trends R(cid:13) in Machine Learning,\n\n2(1):1\u2013127, 2009.\n\n[3] Y. Bengio, A. Courville, and P. Vincent. Representation learning: A review and new perspec-\ntives. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 35(8):1798\u20131828,\n2013.\n\n[4] T. Cohen and M. Welling. Learning the irreducible representations of commutative lie groups.\n\narXiv preprint arXiv:1402.4437, 2014.\n\n[5] G. Desjardins, A. Courville, and Y. Bengio. Disentangling factors of variation via generative\n\nentangling. arXiv preprint arXiv:1210.5474, 2012.\n\n[6] A. Dosovitskiy, J. Springenberg, and T. Brox. Learning to generate chairs with convolutional\n\nneural networks. arXiv:1411.5928, 2015.\n\n[7] I. Goodfellow, H. Lee, Q. V. Le, A. Saxe, and A. Y. Ng. Measuring invariances in deep\n\nnetworks. In Advances in neural information processing systems, pages 646\u2013654, 2009.\n\n[8] G. Hinton, S. Osindero, and Y.-W. Teh. A fast learning algorithm for deep belief nets. Neural\n\ncomputation, 18(7):1527\u20131554, 2006.\n\n[9] G. E. Hinton, A. Krizhevsky, and S. D. Wang. Transforming auto-encoders. In Arti\ufb01cial Neural\n\nNetworks and Machine Learning\u2013ICANN 2011, pages 44\u201351. Springer, 2011.\n\n[10] D. P. Kingma and M. Welling.\n\narXiv:1312.6114, 2013.\n\nAuto-encoding variational bayes.\n\narXiv preprint\n\n[11] T. D. Kulkarni, P. Kohli, J. B. Tenenbaum, and V. Mansinghka. Picture: A probabilistic pro-\ngramming language for scene perception. In Proceedings of the IEEE Conference on Computer\nVision and Pattern Recognition, pages 4390\u20134399, 2015.\n\n[12] T. D. Kulkarni, V. K. Mansinghka, P. Kohli, and J. B. Tenenbaum. Inverse graphics with\n\nprobabilistic cad models. arXiv preprint arXiv:1407.1339, 2014.\n\n[13] Y. LeCun and Y. Bengio. Convolutional networks for images, speech, and time series. The\n\nhandbook of brain theory and neural networks, 3361, 1995.\n\n[14] H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng. Convolutional deep belief networks for scal-\nable unsupervised learning of hierarchical representations. In Proceedings of the 26th Annual\nInternational Conference on Machine Learning, pages 609\u2013616. ACM, 2009.\n\n[15] V. Mansinghka, T. D. Kulkarni, Y. N. Perov, and J. Tenenbaum. Approximate bayesian\nimage interpretation using generative probabilistic graphics programs. In Advances in Neural\nInformation Processing Systems, pages 1520\u20131528, 2013.\n\n[16] R. Mottaghi, X. Chen, X. Liu, N.-G. Cho, S.-W. Lee, S. Fidler, R. Urtasun, and A. Yuille. The\nrole of context for object detection and semantic segmentation in the wild. In IEEE Conference\non Computer Vision and Pattern Recognition (CVPR), 2014.\n\n[17] P. Paysan, R. Knothe, B. Amberg, S. Romdhani, and T. Vetter. A 3d face model for pose and\n\nillumination invariant face recognition. Genova, Italy, 2009. IEEE.\n\n[18] M. Ranzato, F. J. Huang, Y.-L. Boureau, and Y. LeCun. Unsupervised learning of invariant\nfeature hierarchies with applications to object recognition. In Computer Vision and Pattern\nRecognition, 2007. CVPR\u201907. IEEE Conference on, pages 1\u20138. IEEE, 2007.\n\n[19] R. Salakhutdinov and G. E. Hinton. Deep boltzmann machines. In International Conference\n\non Arti\ufb01cial Intelligence and Statistics, pages 448\u2013455, 2009.\n\n[20] Y. Tang, R. Salakhutdinov, and G. Hinton. Deep lambertian networks.\n\narXiv:1206.6445, 2012.\n\narXiv preprint\n\n[21] T. Tieleman. Optimizing Neural Networks that Generate Images. PhD thesis, University of\n\nToronto, 2014.\n\n[22] T. Tieleman and G. Hinton. Lecture 6.5 - rmsprop, coursera: Neural networks for machine\n\nlearning. 2012.\n\n[23] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol. Stacked denoising au-\ntoencoders: Learning useful representations in a deep network with a local denoising criterion.\nThe Journal of Machine Learning Research, 11:3371\u20133408, 2010.\n\n9\n\n\f", "award": [], "sourceid": 1498, "authors": [{"given_name": "Tejas", "family_name": "Kulkarni", "institution": "MIT"}, {"given_name": "William", "family_name": "Whitney", "institution": "MIT"}, {"given_name": "Pushmeet", "family_name": "Kohli", "institution": "Microsoft Research"}, {"given_name": "Josh", "family_name": "Tenenbaum", "institution": "MIT"}]}