{"title": "Explicit Disentanglement of Appearance and Perspective in Generative Models", "book": "Advances in Neural Information Processing Systems", "page_first": 1018, "page_last": 1028, "abstract": "Disentangled representation learning finds compact, independent and easy-to-interpret factors of the data.\nLearning such has been shown to require an inductive bias, which we explicitly encode in a generative model of images. Specifically, we propose a model with two latent spaces: one that represents spatial transformations of the input data, and another that represents the transformed data. We find that the latter naturally captures the intrinsic appearance of the data. To realize the generative model, we propose a Variationally Inferred Transformational Autoencoder (VITAE) that incorporates a spatial ransformer into a variational autoencoder. We show how to perform inference in the model efficiently by carefully designing the encoders and restricting the transformation class to be diffeomorphic. Empirically, our model separates the visual style from digit type on MNIST, separates shape and pose in images of human bodies and facial features from facial shape on CelebA.", "full_text": "Explicit Disentanglement of Appearance and\n\nPerspective in Generative Models\n\nNicki S. Detlefsen \u2217\nnsde@dtu.dk\n\nS\u00f8ren Hauberg \u2217\nsohau@dtu.dk\n\nAbstract\n\nDisentangled representation learning \ufb01nds compact, independent and easy-to-\ninterpret factors of the data. Learning such has been shown to require an inductive\nbias, which we explicitly encode in a generative model of images. Speci\ufb01cally, we\npropose a model with two latent spaces: one that represents spatial transformations\nof the input data, and another that represents the transformed data. We \ufb01nd that the\nlatter naturally captures the intrinsic appearance of the data. To realize the gener-\native model, we propose a Variationally Inferred Transformational Autoencoder\n(VITAE) that incorporates a spatial transformer into a variational autoencoder. We\nshow how to perform inference in the model ef\ufb01ciently by carefully designing the\nencoders and restricting the transformation class to be diffeomorphic. Empirically,\nour model separates the visual style from digit type on MNIST, separates shape and\npose in images of human bodies and facial features from facial shape on CelebA.\n\n1\n\nIntroduction\n\nDisentangled Representation Learning (DRL) is a fundamental challenge in machine learning that is\ncurrently seeing a renaissance within deep generative models. DRL approaches assume that an AI\nagent can bene\ufb01t from separating out (disentangle) the underlying structure of data into disjointed\nparts of its representation. This can furthermore help interpretability of the decisions of the AI agent\nand thereby make them more accountable.\nEven though there have been attempts to \ufb01nd a single formalized notion of disentanglement [Higgins\net al., 2018], no such theory exists (yet) which is widely accepted. However, the intuition is that a\ndisentangled representation z should separate different informative factors of variation in the data\n[Bengio et al., 2012]. This means that changing a single latent dimension zi should only change a\nsingle interpretable feature in the data space X .\nWithin the DRL literature, there are two main approaches. The \ufb01rst is to hard-wire disentanglement\ninto the model, thereby creating an inductive bias. This is well known e.g. in convolutional neural\nnetworks, where the convolution operator creates an inductive bias towards translation in data. The\nsecond approach is to instead learn a representation that is faithful to the underlying data structure,\nhoping that this is suf\ufb01cient to disentangle the representation. However, there is currently little to no\nagreement in the literature on how to learn such representations [Locatello et al., 2019].\nWe consider disentanglement of two explicit groups of factors, the appearance and the perspective.\nWe here de\ufb01ne the appearance as being the factors of data that are left after transforming x by its\nperspective. Thus, the appearance is the form or archetype of an object and the perspective represents\nthe speci\ufb01c realization of that archetype. Practically speaking, the perspective could correspond to\nan image rotation that is deemed irrelevant, while the appearance is a representation of the rotated\nimage, which is then invariant to the perspective. This interpretation of the world goes back to\nPlato\u2019s allegory of the cave, from which we also borrow our terminology. This notion of removing\n\n\u2217Section for Cognitive Systems, Technical University of Denmark\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: We disentangle data into appearance and perspective\nfactors. First, data are encoded based on their perspective (in\nthis case image A and C are rotated in the same way), which is\nthen removed from the original input. Hereafter, the transformed\nsamples can be encoded in the appearance space (image A and B\nare both ones), that encodes the factors left in data.\n\nFigure 2: Our model, VITAE,\ndisentangles appearance from\nperspective. Here we separate\nbody pose (arm position) from\nbody shape.\n\nperspective before looking at the appearance is well-studied within supervised learning, e.g. using\nspatial transformer nets (STNs) [Jaderberg et al., 2015].\nThis paper contributes an explicit model for disentanglement of appearance and perspective in\nimages, called the variational inferred transformational autoencoder (VITAE). As the name suggests,\nwe focus on variational autoencoders as generative models, but the idea is general (Fig. 1). First\nwe encode/decode the perspective features in order to extract an appearance that is perspective-\ninvariant. This is then encoded into a second latent space, where inputs with similar appearance\nare encoded similarly. This process generates an inductive bias that disentangles perspective and\nappearance. In practice, we develop an architecture that leverages the inference part of the model\nto guide the generator towards better disentanglement. We also show that this speci\ufb01c choice of\narchitecture improves training stability with the right choice of parametrization of perspective factors.\nExperimentally, we demonstrate that our model on four datasets: standard disentanglement benchmark\ndSprites, disentanglement of style and content on MNIST, pose and shape on images of human bodies\n(Fig. 2) and facial features and facial shape on CelebA.\n\n2 Related work\n\nDisentangled representations learning (DRL) have long been a goal in data analysis. Early work\non non-negative matrix factorization [Lee and Seung, 1999] and bilinear models [Tenenbaum and\nFreeman, 2000] showed how images can be composed into semantic \u201cparts\u201d that can be glued together\nto form the \ufb01nal image. Similarly, EigenFaces [Turk and Pentland, 1991] have often been used to\nfactor out lighting conditions from the representation [Shakunaga and Shigenari, 2001], thereby\ndiscovering some of the physics that govern the world of which the data is a glimpse. This is central\nin the long-standing argument that for an AI agent to understand and reason about the world, it must\ndisentangle the explanatory factors of variation in data [Lake et al., 2016]. As such, DRL can be seen\nas a poor man\u2019s approximation to discovering the underlying causal factors of the data.\nIndependent components are, perhaps, the most stringent formalization of \u201cdisentanglement\u201d. The\nseminal independent component analysis (ICA) [Comon, 1994] factors the signal into statistically\nindependent components. It has been shown that the independent components of natural images are\nedge \ufb01lters [Bell and Sejnowski, 1997] that can be linked to the receptive \ufb01elds in the human brain\n[Olshausen and Field, 1996]. Similar \ufb01ndings have been made for both video and audio [van Hateren\nand Ruderman, 1998, Lewicki, 2002]. DRL, thus, allows us to understand both the data and ourselves.\nSince independent factors are the optimal compression, ICA \ufb01nds the most compact representation,\nimplying that the predictive model can achieve maximal capacity from its parameters. This gives DLR\na predictive perspective, and can be taken as a hint that a well-trained model might be disentangled. In\n\n2\n\nABCPerspective latent spaceAppearancelatent space= 1= 0\u21d2,,Appearance(body pose)Perspective(body shape)Generated samples\u21d2\fthe linear case, independent components have many successful realizations [Hyv\u00e4rinen and Oja, 2000],\nbut in the general non-linear case, the problem is not identi\ufb01able [Hyv\u00e4rinen et al., 2018].\nDeep DRL was initiated by Bengio et al. [2012] who sparked the current interest in the topic. One\nof the current state-of-the-art methods for doing disentangled representation learning is the \u03b2-VAE\n[Higgins et al., 2017], that modi\ufb01es the variational autoencoder (VAE) [Kingma and Welling, 2013,\nRezende et al., 2014] to learn a more disentangled representation. \u03b2-VAE enforces more weight on the\nKL-divergence in the VAE loss, thereby optimizing towards latent factors that should be axis aligned\ni.e. disentangled. Newer models like \u03b2-TCVAE [Chen et al., 2018] and DIP-VAE [Kumar et al.,\n2017] extend \u03b2-VAE by decomposing the KL-divergences into multiple terms, and only increase the\nweight on terms that analytically disentangles the models. InfoGAN [Chen et al., 2016] extends the\nlatent code z of the standard GAN model [Goodfellow et al., 2014] with an extra latent code c and\nthen penalize low mutual information between generated samples G(c, z) and c. DC-IGN [Kulkarni\net al., 2015] forces the latent codes to be disentangled by only feeding in batches of data that vary in\none way (e.g. pose, light) while only having small disjoint parts of the latent code active.\nShape statistics is the key inspiration for our work. The shape of an object was \ufb01rst formalized by\nKendall [1989] as being what is left of an object when translation, rotation and scale are factored\nout. That is, the intrinsic shape of an object should not depend on viewpoint. This idea dates, at least,\nback to D\u2019Arcy Thompson [1917] who pioneered the understanding of the development of biological\nforms. In Kendall\u2019s formalism, the rigid transformations (translation, rotation and scale) are viewed\nas group actions to be factored out of the representation, such that the remainder is shape. Higgins\net al. [2018] follow the same idea by de\ufb01ning disentanglement as a factoring of the representation\ninto group actions. Our work can be seen as a realization of this principle within a deep generative\nmodel. When an object is represented by a set of landmarks, e.g. in the form of discrete points along\nits contour, then Kendall\u2019s shape space is a Riemannian manifold that exactly captures all variability\namong the landmarks except translation, rotation, and scale of the object. When the object is not\nrepresented by landmarks, then similar mathematical results are not available. Our work shows how\nthe same idea can be realized for general image data, and for a much wider range of transformations\nthan the rigid ones. Learned-Miller [2006] proposed a related linear model that generate new data by\ntransforming a prototype, which is estimated by joint alignment.\nTransformations are at the core of our method, and these leverage the architecture of spatial\ntransformer nets (STNs) [Jaderberg et al., 2015]. While these work well within supervised learning,\n[Lin and Lucey, 2016, Annunziata et al., 2018, Detlefsen et al., 2018] there has been limited uptake\nwithin generative models. Lin et al. [2018] combine a GAN with an STN to compose a foreground\n(e.g a furniture) into a background such that it look neutral. The AIR model [Eslami et al., 2016]\ncombines STNs with a VAE for object rendering, but do not seek disentangled representations. In\nsupervised learning, data augmentation is often used to make a classi\ufb01er partially invariant to select\ntransformations [Baird, 1992, Hauberg et al., 2016].\n\n3 Method\n\nOur goal is to extend a variational autoencoder (VAE) [Kingma and Welling, 2013, Rezende et al.,\n2014] such that it can disentangle appearance and perspective in data. A standard VAE assumes that\ndata is generated by a set of latent variables following a standard Gaussian prior,\n\n(cid:90)\n\np(x|z)p(z)dz\n\np(x) =\np(z) = N (0, Id), p(x|z) = N (x|\u00b5p(z), \u03c32\n\np(z)) or P (x|z) = B(x|\u00b5p(z)).\n\nData x is then generated by \ufb01rst sampling a latent variable z and then sample x from the conditional\np(x|z) (often called the decoder). To make the model \ufb02exible enough to capture complex data\ndistributions, \u00b5p and \u03c32\np are modeled as deep neural nets. The marginal likelihood is then intractable\nand a variational approximation q to p(z|x) is needed,\n\np(z|x) \u2248 q(z|x) = N (z|\u00b5q(x), \u03c32\nq (x) are deep neural networks, see Fig. 3(a).\n\nq (x)),\n\nwhere \u00b5q(x) and \u03c32\nWhen training VAEs, we therefore simultaneously train a generative model p\u03b8(x|z)p\u03b8(z) and an\ninference model q\u03c6(z|x) (often called the encoder). This is done by maximizing a variational lower\n\n(1)\n\n(2)\n\n3\n\n\f(a) VAE\n\n(b) Unconditional VITAE\n\n(c) Conditional VITAE\n\nFigure 3: Architectures of standard VAE and our proposed U-VITAE and C-VITAE models. Here q\ndenotes encoders, p denotes decoders, T \u03b3 denotes a ST-layer with transformation parameters \u03b3. The\ndotted box indicates the generative model.\n\nbound to the likelihood p(x) called the evidence lower bound (ELBO)\n(cid:125)\n(cid:124)\n= Eq\u03c6(z|x) [log p\u03b8(x|z)]\n\nlog p(x) \u2265 Eq\u03c6(z|x)\n\np\u03b8(x, z)\nq\u03c6(z|x)\n\n(cid:123)(cid:122)\n\nlog\n\n(cid:124)\n\n(cid:20)\n\n(cid:21)\n\n\u2212 KL(q\u03c6(z|x)||p\u03b8(z))\n\n.\n\n(3)\n\n(cid:123)(cid:122)\n\n(cid:125)\n\ndata \ufb01tting term\n\nregulazation term\n\nThe \ufb01rst term measures the reconstruction error between x and p\u03b8(x|z) and the second measures\nthe KL-divergence between the encoder q\u03c6(z|x) and the prior p(z). Eq. 3 can be optimized using\nthe reparametrization trick [Kingma and Welling, 2013]. Several improvements to VAEs have been\nproposed [Burda et al., 2015, Kingma et al., 2016], but our focus is on the standard model.\n\n3.1\n\nIncorporating an inductive bias\n\nTo incorporate an inductive bias that is able to disentangle appearance from perspective, we change\nthe underlying generative model to rely on two latent factors zA and zP ,\np(x|zA, zP )p(zA)p(zP )dzAdzP ,\n\n(cid:90)(cid:90)\n\np(x) =\n\n(4)\n\nwhere we assume that zA and zP both follow standard Gaussian priors. Similar to a VAE, we also\nmodel the generators as deep neural networks. To generate new data x, we combine the appearance\nand perspective factors using the following 3-step procedure that uses a spatial transformer (ST) layer\n[Jaderberg et al., 2015] (dotted box in Fig. 3(b)):\n\n1. Sample zA and zP from p(z) = N (0, Id).\n2. Decode both samples \u02dcx \u223c p(x|zA), \u03b3 \u223c p(x|zP ).\n3. Transform \u02dcx with parameters \u03b3 using a spatial transformer layer: x = T\u03b3( \u02dcx).\n\nThis process is illustrated by the dotted box in Fig. 3(b).\n\nUnconditional VITAE inference. As the marginal likelihood (4) is intractable, we use variational\ninference. A natural choice is to approximate each latent group of factors zA, zP independently of\nthe other i.e.\n\np(zP|x) \u2248 qP (zP|x) and p(zA|x) \u2248 qA(zA|x).\n\n(5)\nThe combined inference and generative model is illustrated in Fig. 3(b). For comparison, a VAE\nmodel is shown in Fig. 3(a). It can easily be shown that the ELBO for this model is merely a VAE\nwith a KL-term for each latent space (see supplements).\n\n4\n\n\fConditional VITAE inference. This inference model does not mimic the generative process of the\nmodel, which may be suboptimal. Intuitively, we expect the encoder to approximately perform the\ninverse operation of the decoder, i.e. z \u2248 encoder(decoder(z)) \u2248 decoder\u22121(decoder(z)). Since\nthe proposed encoder (5) does not include an ST-layer, it may be dif\ufb01cult to train an encoder to\napproximately invert the decoder. To accommodate this, we \ufb01rst include an ST-layer in the encoder\nfor the appearance factors. Secondly, we explicitly enforce that the predicted transformation in the\nencoder T \u03b3e is the inverse of that of the decoder T \u03b3d, i.e. T \u03b3e = (T \u03b3d )\u22121 (more on invertibility in\nSec. 3.2). The inference of appearance is now dependent on the perspective factor zP , i.e.\n\np(zP|x) \u2248 qP (zP|x) and p(zA|x) \u2248 qA(zA|x, zP ).\n\n(6)\nThese changes to the inference architecture are illustrated in Fig. 3(c). It can easily be shown that the\nELBO for this model is given by\nlog p(x) \u2265 EqA,qP [log(p(x|zA, zP )] \u2212 DKL(qP (zP|x)||p(zP )) \u2212 EqP [DKL(qA(zA|x)||p(zA))].\n(7)\nwhich resembles the standard ELBO with a additional term (derivation in supplementary material),\ncorresponding to the second latent space. We will call both models variational inferred transfor-\nmational autoencoders (VITAE) and we will denote the \ufb01rst model (5) as unconditional/U-VITAE\nand the second model (6) as conditional/C-VITAE. The naming comes from Eq. 5 and 6, where zA\nis respectively unconditioned and conditioned on zP . Experiments will show that the conditional\narchitecture is essential for inference (Sec. 4.2).\n\n3.2 Transformation classes\n\nUntil now, we have assumed that\ntures the perspective factors in data.\nfactors underlying the data, but\n\nthere exists a class of\n\ntransformations T that cap-\nthe choice of T depends on the true\nin many cases an af\ufb01ne transformation should suf\ufb01ce.\n\nClearly,\n\n(cid:20)\u03b311\n\n\u03b321\n\n(cid:21)(cid:34)x\n\n(cid:35)\n\ny\n1\n\nT\u03b3(x) = Ax + b =\n\n\u03b312\n\u03b322\n\n\u03b313\n\u03b314\n\n.\n\n(8)\n\nHowever, the C-VITAE model requires access to the in-\nverse transformation T \u22121. The inverse of Eq. 8 is given\nby T \u22121\n\u03b3 (x) = A\u22121x \u2212 b, which only exist if A has a\nnon-zero determinant.\nOne, easily veri\ufb01ed, approach to secure invertibility is to\nparametrize the transformation by two scale factors sx, sy,\none rotation angle \u03b1, one shear parameter m and two\ntranslation parameters tx, ty:\nT\u03b3(x) =\n\n(cid:20)cos(\u03b1) \u2212 sin(\u03b1)\n\n(cid:21)(cid:20)1 m\n\n(cid:21)(cid:20)sx\n\n(cid:20)tx\n\n(cid:21)\n\n(cid:21)\n\n+\n\ncos(\u03b1)\n\n0\n\n1\n\n0\n\nsin(\u03b1)\n\n0\nsy\n\nIn this case the inverse is trivially\n\nT \u22121\n(sx,sy,\u03b3,m,tx,ty)(x) = T( 1\n\n, 1\nsy\n\nsx\n\n,\u2212\u03b3,\u2212m,\u2212tx,\u2212ty)(x),\n\nwhere the scale factors must be strictly positive.\nAn easier and more elegant approach is to leverage the\nmatrix exponential. That is, instead of parametrizing the\ntransformation in Eq. 8, we instead parametrize the veloc-\nity of the transformation\n\n.\n\nty\n(9)\n\n(10)\n\nFigure 4: Random deformation \ufb01eld of\nan af\ufb01ne transformation (top) compared\nto a CPAB (bottom). We clearly see\nthat CPAB transformations offers a mush\nmore \ufb02exible and rich class of dif\ufb01omor-\nphic transformations.\n\n(cid:32)(cid:34)\u03b311\n\n(cid:35)(cid:33)(cid:34)x\n\n(cid:35)\n\nT\u03b3(x) = expm\n\n(11)\nThe inverse2 is then T \u22121\n\u03b3 = T\u2212\u03b3. Then T in Eq. 11 is a C\u221e-dif\ufb01omorphism (i.e. a differentiable\ninvertible map with a differentiable inverse) [Duistermaat and Kolk, 2000]. Experiments show that\ndiffeomorphic transformations stabilize training and yield tighter ELBOs (see supplements).\n\n\u03b321\n0\n\ny\n1\n\n.\n\n\u03b312\n\u03b322\n0\n\n\u03b313\n\u03b314\n0\n\n2Follows from T\u03b3 and T\u2212\u03b3 being commuting matrices.\n\n5\n\n\fOften we will not have prior knowledge regarding which transformation classes are suitable for\ndisentangling the data. A natural way forward is then to apply a highly \ufb02exible class of transformations\nthat are treated as \u201cblack-box\u201d. Inspired by Detlefsen et al. [2018], we also consider transformations\nT\u03b3 using the highly expressive dif\ufb01omorphic transformations CPAB from Freifeld et al. [2015].\nThese can be viewed as an extension to Eq. 11: instead of having a single af\ufb01ne transformation\nparametrized by its velocity, the image domain is divided into smaller cells, each having their own\naf\ufb01ne velocity. The collection of local af\ufb01ne velocities can be ef\ufb01ciently parametrized and integrated,\ngiving a fast and \ufb02exible diffeomorphic transformation, see Fig. 4 for a comparison between an af\ufb01ne\ntransformation and a CPAB transformation. For details, see Freifeld et al. [2015].\nWe note, that our transformer architecture are similar to the work of Lorenz et al. [2019] and Xing et al.\n[2019] in that they also tries to achieve disentanglement through spatial transformations. However,\nour work differ in the choice of transformation. This is key, as the theory of Higgins et al. [2018]\nstrongly relies on disentanglement through group actions. This places hard constrains on which\nspatial transformations are allowed: they have to form a smooth group. Both thin-plate-spline\ntransformations considered in Lorenz et al. [2019] and displacement \ufb01elds considered in Xing et al.\n[2019] are not invertible and hence do not correspond to proper group actions. Since dif\ufb01omorphic\ntransformations form a smooth group, this choice is paramount to realize the theory of Higgins et al.\n[2018].\n\n4 Experimental results and discussion\n\nFor all experiments, we train a standard VAE, a \u03b2-VAE [Higgins et al., 2017], a \u03b2-TCVAE [Chen\net al., 2018], a DIP-VAE-II [Kumar et al., 2017] and our developed VITAE model. We model\nthe encoders and decoders as multilayer perceptron networks (MLPs). For a fair comparison,\nthe number of trainable parameters is approximately the same in all models. The models were\nimplemented in Pytorch [Paszke et al., 2017] and the code is available at https://github.com/\nSkafteNicki/unsuper/.\nEvaluation metric. Measuring disentanglement still seems to be an unsolved problem, but the work\nof Locatello et al. [2019] found that most proposed disentanglement metrics are highly correlated.\nWe have chosen to focus on the DIC-metric from Eastwood and Williams [2019], since this metric\nhas seen some uptake in the research community. This metric measures how will the generative\nfactors can be predicted from latent factors. For the MNIST and SMPL datasets, the generative\nfactors are discrete instead of continuous, so we change the standard linear regression network to a\nkNN-classi\ufb01cation algorithm. We denote this metric Dscore in the results.\n\n4.1 Disentanglement on shapes\n\nWe initially test our models on the dSprites dataset [Matthey et al., 2017], which is a well established\ndisentanglement benchmarking dataset to evaluate the performance of disentanglement algorithms.\nThe results can be seen in Table 1. We \ufb01nd that our proposed C-VITAE model perform best, followed\n\nFigure 5: Reconstructions (left images) and manipulation of latent codes (right images) on MNIST\nfor the three different models: VAE (a), \u03b2-VAE (b) and C-VITAE (c). The right images are generated\nby varying one latent dimension in all models, while keeping the rest \ufb01xed. For the C-VITAE model,\nwe have shown this for both the appearance and perspective spaces.\n\n6\n\n\fdSprite\n\nMNIST\n\nSMPL\n\nELBO\n\nELBO log p(x) Dscore ELBO log p(x) Dscore\n-47.05 -49.32\nVAE\n-79.45 -81.38\n\u03b2-VAE\n\u03b2-TCVAE\n-66.48 -68.12\nDIP-VAE-II -46.32 -48.92\n-55.25 -57.29\nU-VITAE\nC-VITAE\n-68.26 -70.49\n\nDscore\nlog p(x)\n0.579 \u22128.62 \u00d7 103 \u22128.62 \u00d7 103\n0.485\n0.653 \u22128.62 \u00d7 103 \u22128.60 \u00d7 103\n0.525\n0.679 \u22128.62 \u00d7 103 \u22128.56 \u00d7 103\n0.651\n0.733 \u22128.62 \u00d7 103 \u22128.54 \u00d7 103\n0.743\n0.782 \u22128.62 \u00d7 103 \u22128.55 \u00d7 103\n0.673\n0.884 \u22128.62 \u00d7 103 \u22128.52 \u00d7 103 0.943\n\n-172\n-152\n-144\n-155\n-143\n-141\n\n-169\n-150\n-141\n-140\n-142\n-139\n\n0.05\n0.18\n0.30\n0.12\n0.22\n0.38\n\nTable 1: Quantitative results on three datasets. For each dataset we report the ELBO, test set log\nlikelihood and disentanglement score Dscore. Bold marks best results.\n\nby the \u03b2-TCVAE model in terms of disentanglement. The experiments clearly shows the effect on\nperformance of the improved inference structure of C-VITAE compared to U-VITAE. It can be shown\nthat the conditional architecture of C-VITAE, minimizes the mutual information between zA and zP ,\nleading to better disentanglement of the two latent spaces. To get the U-VITAE architecture to work\nsimilarly would require a auxiliary loss term added to the ELBO.\n\n4.2 Disentanglement of MNIST images\n\nSecondly, we test our model on the MNIST dataset [LeCun et al., 1998]. To make the task more\ndif\ufb01cult, we arti\ufb01cially augment the dataset by \ufb01rst randomly rotating each image by an angle\nuniformly chosen in the interval [\u221220\u25e6, 20\u25e6] and secondly translating the images by t = [x, y], where\nx, y is uniformly chosen from the interval [-3, 3]. For VITAE, we model the perspective with an\naf\ufb01ne dif\ufb01omorphic transformation (Eq. 11).\nThe quantitative results can be seen in Table 1. We clearly see that C-VITAE outperforms the\nalternatives on all measures. We overall observes that better disentanglement, seems to give better\ndistribution \ufb01tting. Qualitatively, Fig. 5 shows the effect of manipulating the latent codes alongside\ntest reconstructions for VAE, \u03b2-VAE and C-VITAE. Due to space constraints, the results from\n\u03b2-TCVAE and DIP-VAE-II can found in the supplementary material. The plots were generated by\nfollowing the protocol from Higgins et al. [2017]: one latent factor is linearly increased from -3 to 3,\nwhile the rest is kept \ufb01xed. In the VAE (Fig. 5(a)), this changes both the appearance (going from a 7\nto a 1) and the perspective (going from rotated slightly left to rotated right). We see no meaningful\ndisentanglement of latent factors. In the \u03b2-VAE model (Fig. 5(b)), we observe some disentanglement,\nsince only the appearance changes with the latent factor. However this disentanglement comes at\nthe cost of poor reconstructions. This trade-off is directly linked to the emphasized regularization in\nthe \u03b2-VAE. We note that the value \u03b2 = 4.0 proposed in the original paper [Higgins et al., 2017] is\ninsuf\ufb01ciently low for our experiments to observe any disentanglement, and we use \u03b2 = 8.0 based\non qualitative evaluation of results. For \u03b2-TCVAE and DIP-VAE-II we observe nearly the same\namount of qualitative disentanglement as \u03b2-VAE, however these models achieve less blurred samples\nand reconstructions. This is probably due to the two models decomposition of the KL-term, only\nincreasing the parts that actually contributes to disentanglement. Finally, for our developed VITAE\nmodel (Fig. 5(c)), we clearly see that when we change the latent code in the appearance space (top\nrow), we only change the content of the generated images, while manipulating the latent code in the\nperspective space (bottom row) only changes the perspective i.e. image orientation.\nInterestingly, we observe that there exists more than one prototype of a 1 in the appearance space\nof VITAE, going from slightly bent to straightened out. By our de\ufb01nition of disentanglement, that\neverything left after transforming the image is appearance, there is nothing wrong with this. This\nis simply a consequence of using an af\ufb01ne transformation that cannot model this kind of local\ndeformation. Choosing a more \ufb02exible transformation class could factor out this kind of perspective.\nThe supplements contain generated samples from the different models.\n\n4.3 Disentanglement of body shape and pose\n\nWe now consider synthetic image data of human bodies generated by the Skinned Multi-Person Linear\nModel (SMPL) [Loper et al., 2015] which are explicitly factored into shape and pose. We generate\n10,000 bodies (8,000 for training, 2,000 for testing), by \ufb01rst continuously sampling body shape (going\nfrom thin to thick) and then uniformly sampling a body pose from four categories ((arms up, tight),\n\n7\n\n\f(arms up, wide), (arms down, tight), (arms down, wide)). Fig. 2 shows examples of generated images.\nSince change in body shape approximately amounts to a local shape deformation, we model the\nperspective factors using the aforementioned \"black-box\" dif\ufb01omorphic CPAB transformations (Sec.\n3.2). The remaining appearance factor should then re\ufb02ect body pose.\n\nQuantitative evaluation. We again refer to Table 1 that shows ELBO, test set log-likelihood and\ndisentanglement score for all models. As before, C-VITAE is both better at modelling the data\ndistribution and achieves a higher disentanglement score. The explanation is that for a standard\nVAE model (or \u03b2-VAE and its variants for that sake) to learn a complex body shape deformation\nmodel, it requires a high capacity network. However, the VITAE architecture gives the autoencoder a\nshort-cut to learning these transformations that only requires optimizing a few parameters. We are not\nguaranteed that the model will learn anything meaningful or that it actually uses this short-cut, but\nexperimental evidence points in that direction. A similar argument holds in the case of MNIST, where\na standard MLP may struggle to learn rotation of digits, but the ST-layer in the VITAE architecture\nprovides a short-cut. Furthermore, we found the training of VITAE to be more stable than other\nmodels.\n\nFigure 6: Disentanglement of body shape and body pose on SMPL-generated bodies for three\ndifferent models. The images are generated by varying one latent dimension, while keeping the rest\n\ufb01xed. For the C-VITAE model we have shown this for both the appearance and perspective spaces,\nsince this is the only model where we quantitatively observe disentanglement.\n\nQualitative evaluation. Again, we manipulate the latent codes to visualize their effect (Fig. 6). This\ntime, we here show the result for \u03b2-TCVAE, DIP-VAE-II and VITAE. The results from standard VAE\nand \u03b2-VAE can be found in supplementary material. For both \u03b2-TCVAE and DIP-VAE-II we do not\nobserve disentanglement of body pose and shape, since the decoded images both change arm position\n(from up to down) and body shape. We note that for both \u03b2-VAE, \u03b2-TCVAE and DIP-VAE-II we\ndid a grid search for their respective hyper parameters. For these three models, we observe that the\nchoice of hyper parameters (scaling of KL term) can have detrimental impact of reconstructions and\ngenerated samples. Due to lack of space, test set reconstructions and generated samples can be found\nin the supplementary material. For VITAE we observe some disentanglement of body pose and shape,\nas variation in appearance space mostly changes the positions of the arms, while the variations in the\nperspective space mostly changes body shape. The fact that we cannot achieve full disentanglement\nof this SMPL dataset indicates the dif\ufb01culty of the task.\n\n8\n\nVAE\ud835\udefd\ud835\udefd-VAEVITAE\ud835\udefd\ud835\udefd-TCVAEDIP-VAEPerspectiveAppearance\f4.4 Disentanglement on CelebA\n\nFinally, we qualitatively evaluated our proposed model on the CelebA dataset [Liu et al., 2015]. Since\nthis is a \" real life \" dataset we do not have access to generative factors and we can therefore only\nqualitatively evaluate the model. We again model the perspective factors using the aforementioned\nCPAB transformations, which we assume can model the facial shape. The results can be seen in\nFig. 7, which shows latent traversals of both the perspective and appearance factors, and how they\nin\ufb02uence the generated images. We do observe some interpolation artifacts that are common for\narchitectures using spatial transformers.\n\n(a) Changing zP,1 corresponds to facial size.\n\n(b) Changing zP,2 corresponds to facial displacement.\n\n(c) Changing zA,2 corresponds to hair color.\n\nFigure 7: Traversal in latent space shows, that our model can disentangle complex factors such as\nfacial size, facial position and hair color.\n\n5 Summary\n\nIn this paper, we have shown how to explicitly disentangle appearance from perspective in a\nvariational autoencoder [Kingma and Welling, 2013, Rezende et al., 2014]. This is achieved by\nincorporating a spatial transformer layer [Jaderberg et al., 2015] into both encoder and decoder in\na coupled manner. The concepts of appearance and perspective are broad as is evident from our\nexperimental results in human body images, where they correspond to pose and shape, respectively.\nBy choosing the class of transformations in accordance with prior knowledge it becomes an effective\ntool for controlling the inductive bias needed for disentangled representation learning. On both\nMNIST and body images our method quantitatively and qualitatively outperforms general purpose\ndisentanglement models [Higgins et al., 2017, Chen et al., 2018, Kumar et al., 2017]. We \ufb01nd it\nunsurprisingly that in situations where some prior knowledge about the generative factors is known,\nencoding these in the into the model give better result than ignoring such information.\n\nOur results support the hypothesis [Higgins et al., 2018] that inductive biases are necessary for\nlearning disentangled representations, and our model is a step in the direction of getting fully\ndisentangled generative models. We envision that the VITAE model should be combined with other\nmodels, by \ufb01rst using the VITAE model to separate appearance and perspective, and then training\na second model only on the appearance. This will factor out one latent factor at a time, leaving a\nhierachy of disentangled factors.\n\nAcknowledgements. This project has received funding from the European Research Council (ERC)\nunder the European Union\u2019s Horizon 2020 research and innovation programme (grant agreement no\n757360). NSD and SH were supported in part by a research grant (15334) from VILLUM FONDEN.\nWe gratefully acknowledge the support of NVIDIA Corporation with the donation of GPU hardware\nused for this research.\n\n9\n\n\fReferences\nR. Annunziata, C. Sagonas, and J. Cal\u00ec. Destnet: Densely fused spatial transformer networks. CoRR,\n\nabs/1807.04050, 2018.\n\nH. S. Baird. Document image defect models. In Structured Document Image Analysis, pages 546\u2013556. Springer,\n\n1992.\n\nA. J. Bell and T. J. Sejnowski. The \u201cindependent components\u201d of natural scenes are edge \ufb01lters. Vision research,\n\n37(23):3327\u20133338, 1997.\n\nY. Bengio, A. Courville, and P. Vincent. Representation Learning: A Review and New Perspectives. arXiv\n\ne-prints, art. arXiv:1206.5538, June 2012.\n\nY. Burda, R. B. Grosse, and R. Salakhutdinov. Importance weighted autoencoders. CoRR, abs/1509.00519,\n\n2015.\n\nR. T. Q. Chen, X. Li, R. Grosse, and D. Duvenaud.\n\nIsolating Sources of Disentanglement in Variational\n\nAutoencoders. feb 2018. URL http://arxiv.org/abs/1802.04942.\n\nX. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel. Infogan: Interpretable representation\n\nlearning by information maximizing generative adversarial nets. CoRR, abs/1606.03657, 2016.\n\nP. Comon. Independent component analysis, a new concept? Signal Processing, 36(3):287 \u2013 314, 1994. ISSN\n\n0165-1684. Higher Order Statistics.\n\nD\u2019Arcy Thompson. On growth and form. On growth and form., 1917.\n\nN. S. Detlefsen, O. Freifeld, and S. Hauberg. Deep diffeomorphic transformer networks. In The IEEE Conference\n\non Computer Vision and Pattern Recognition (CVPR), June 2018.\n\nJ. Duistermaat and J. Kolk. Lie groups and lie algebras. In Lie Groups, pages 1\u201392. Springer Berlin Heidelberg,\n\n2000.\n\nC. Eastwood and C. K. I. Williams. A Framework for the Quantitative Evaluation of Disentangled Representa-\n\ntions, feb 2019. URL https://openreview.net/forum?id=By-7dz-AZ.\n\nS. M. A. Eslami, N. Heess, T. Weber, Y. Tassa, D. Szepesvari, K. Kavukcuoglu, and G. E. Hinton. Attend, Infer,\n\nRepeat: Fast Scene Understanding with Generative Models. mar 2016. doi: 10.1038/nature14236.\n\nO. Freifeld, S. Hauberg, K. Batmanghelich, and J. W. F. III. Highly-expressive spaces of well-behaved\n\ntransformations: Keeping it simple. In ICCV, 2015.\n\nI. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio.\nGenerative adversarial nets. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger,\neditors, Advances in Neural Information Processing Systems 27, pages 2672\u20132680. Curran Associates, Inc.,\n2014.\n\nS. Hauberg, O. Freifeld, A. B. L. Larsen, J. W. F. III, and L. K. Hansen. Dreaming more data: Class-dependent\ndistributions over diffeomorphisms for learned data augmentation. In Proceedings of the 19th international\nConference on Arti\ufb01cial Intelligence and Statistics (AISTATS), volume 41, 2016.\n\nI. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Lerchner. \u03b2-vae:\n\nLearning basic visual concepts with a constrained variational framework. ICLR, 2017.\n\nI. Higgins, D. Amos, D. Pfau, S. Racaniere, L. Matthey, D. Rezende, and A. Lerchner. Towards a De\ufb01nition of\n\nDisentangled Representations. arXiv e-prints, art. arXiv:1812.02230, Dec. 2018.\n\nA. Hyv\u00e4rinen and E. Oja. Independent component analysis: algorithms and applications. Neural networks, 13\n\n(4-5):411\u2013430, 2000.\n\nA. Hyv\u00e4rinen, H. Sasaki, and R. E. Turner. Nonlinear ICA Using Auxiliary Variables and Generalized Contrastive\n\nLearning. arXiv e-prints, art. arXiv:1805.08651, May 2018.\n\nM. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu. Spatial transformer networks. CoRR,\n\nabs/1506.02025, 2015.\n\nD. G. Kendall. A survey of the statistical theory of shape. Statistical Science, pages 87\u201399, 1989.\n\nD. P. Kingma and M. Welling. Auto-Encoding Variational Bayes. ArXiv e-prints, Dec. 2013.\n\n10\n\n\fD. P. Kingma, T. Salimans, R. Jozefowicz, X. Chen, I. Sutskever, and M. Welling. Improving Variational\n\nInference with Inverse Autoregressive Flow. arXiv e-prints, art. arXiv:1606.04934, June 2016.\n\nT. D. Kulkarni, W. Whitney, P. Kohli, and J. B. Tenenbaum. Deep convolutional inverse graphics network. CoRR,\n\nabs/1503.03167, 2015.\n\nA. Kumar, P. Sattigeri, and A. Balakrishnan. Variational Inference of Disentangled Latent Concepts from\n\nUnlabeled Observations. nov 2017. URL http://arxiv.org/abs/1711.00848.\n\nB. M. Lake, T. D. Ullman, J. B. Tenenbaum, and S. J. Gershman. Building machines that learn and think like\n\npeople. CoRR, abs/1604.00289, 2016.\n\nE. G. Learned-Miller. Data driven image models through continuous joint alignment. IEEE Trans. Pattern Anal.\n\nMach. Intell., 28(2):236\u2013250, Feb. 2006. ISSN 0162-8828.\n\nY. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition.\n\nProceedings of the IEEE, pages 86(11):2278\u20132324, nov 1998.\n\nD. D. Lee and H. S. Seung. Learning the parts of objects by non-negative matrix factorization. Nature, 401\n\n(6755):788, 1999.\n\nM. S. Lewicki. Ef\ufb01cient coding of natural sounds. Nature neuroscience, 5(4):356, 2002.\n\nC. Lin and S. Lucey. Inverse compositional spatial transformer networks. CoRR, abs/1612.03897, 2016.\n\nC.-H. Lin, E. Yumer, O. Wang, E. Shechtman, and S. Lucey. ST-GAN: Spatial Transformer Generative\n\nAdversarial Networks for Image Compositing. ArXiv e-prints, Mar. 2018.\n\nZ. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face attributes in the wild. In Proceedings of International\n\nConference on Computer Vision (ICCV), December 2015.\n\nF. Locatello, S. Bauer, M. Lucic, S. Gelly, B. Sch\u00f6lkopf, and O. Bachem. Challenging common assumptions in\nthe unsupervised learning of disentangled representations. Proceedings of the 36th International Conference\non Machine Learning, 2019.\n\nM. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black. SMPL: A skinned multi-person linear\n\nmodel. ACM Trans. Graphics (Proc. SIGGRAPH Asia), 34(6):248:1\u2013248:16, Oct. 2015.\n\nD. Lorenz, L. Bereska, T. Milbich, and B. Ommer. Unsupervised Part-Based Disentangling of Object Shape and\n\nAppearance. Proceedings of Computer Vision and Pattern Recognition (CVPR), Mar 2019.\n\nL. Matthey, I. Higgins, D. Hassabis, and A. Lerchner. dsprites: Disentanglement testing sprites dataset.\n\nhttps://github.com/deepmind/dsprites-dataset/, 2017.\n\nB. A. Olshausen and D. J. Field. Emergence of simple-cell receptive \ufb01eld properties by learning a sparse code\n\nfor natural images. Nature, 381(6583):607, 1996.\n\nA. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer.\n\nAutomatic differentiation in pytorch. In NIPS-W, 2017.\n\nD. Rezende, S. Mohamed, and D. Wierstra. Stochastic Backpropagation and Approximate Inference in Deep\n\nGenerative Models. ArXiv e-prints, Jan. 2014.\n\nT. Shakunaga and K. Shigenari. Decomposed eigenface for face recognition under various lighting conditions.\nIn Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the 2001 IEEE Computer\nSociety Conference on, volume 1, pages I\u2013I. IEEE, 2001.\n\nJ. B. Tenenbaum and W. T. Freeman. Separating style and content with bilinear models. Neural Comput., 12(6):\n\n1247\u20131283, June 2000. ISSN 0899-7667.\n\nM. Turk and A. Pentland. Eigenfaces for recognition. Journal of cognitive neuroscience, 3(1):71\u201386, 1991.\n\nJ. H. van Hateren and D. L. Ruderman. Independent component analysis of natural image sequences yields\nspatio-temporal \ufb01lters similar to simple cells in primary visual cortex. Proceedings of the Royal Society of\nLondon B: Biological Sciences, 265(1412):2315\u20132320, 1998.\n\nX. Xing, R. Gao, T. Han, S.-C. Zhu, and Y. Nian Wu. Deformable Generator Network: Unsupervised Disentan-\nglement of Appearance and Geometry. Proceedings of Computer Vision and Pattern Recognition (CVPR),\nJun 2019.\n\n11\n\n\f", "award": [], "sourceid": 591, "authors": [{"given_name": "Nicki", "family_name": "Skafte", "institution": "Technical University of Denmark"}, {"given_name": "S\u00f8ren", "family_name": "Hauberg", "institution": "Technical University of Denmark"}]}