{"title": "Isolating Sources of Disentanglement in Variational Autoencoders", "book": "Advances in Neural Information Processing Systems", "page_first": 2610, "page_last": 2620, "abstract": "We decompose the evidence lower bound to show the existence of a term measuring the total correlation between latent variables. We use this to motivate the beta-TCVAE (Total Correlation Variational Autoencoder) algorithm, a refinement and plug-in replacement of the beta-VAE for learning disentangled representations, requiring no additional hyperparameters during training. We further propose a principled classifier-free measure of disentanglement called the mutual information gap (MIG). We perform extensive quantitative and qualitative experiments, in both restricted and non-restricted settings, and show a strong relation between total correlation and disentanglement, when the model is trained using our framework.", "full_text": "Isolating Sources of Disentanglement in VAEs\n\nRicky T. Q. Chen, Xuechen Li, Roger Grosse, David Duvenaud\n\nUniversity of Toronto, Vector Institute\n\n\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd \ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd \ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd \ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\n\nAbstract\n\nWe decompose the evidence lower bound to show the existence of a term measuring\nthe total correlation between latent variables. We use this to motivate the \u03b2-TCVAE\n(Total Correlation Variational Autoencoder) algorithm, a re\ufb01nement and plug-in\nreplacement of the \u03b2-VAE for learning disentangled representations, requiring\nno additional hyperparameters during training. We further propose a principled\nclassi\ufb01er-free measure of disentanglement called the mutual information gap (MIG).\nWe perform extensive quantitative and qualitative experiments, in both restricted\nand non-restricted settings, and show a strong relation between total correlation\nand disentanglement, when the model is trained using our framework.\n\n1\n\nIntroduction\n\nLearning disentangled representations without supervision is a dif\ufb01cult open problem. Disentangled\nvariables are generally considered to contain interpretable semantic information and re\ufb02ect separate\nfactors of variation in the data. While the de\ufb01nition of disentanglement is open to debate, many\nbelieve a factorial representation, one with statistically independent variables, is a good starting\npoint [1, 2, 3]. Such representations distill information into a compact form which is oftentimes\nsemantically meaningful and useful for a variety of tasks [2, 4]. For instance, it is found that such\nrepresentations are more generalizable and robust against adversarial attacks [5].\nMany state-of-the-art methods for learning disentangled representations are based on re-weighting\nparts of an existing objective. For instance, it is claimed that mutual information between latent\nvariables and the observed data can encourage the latents into becoming more interpretable [6]. It\nis also argued that encouraging independence between latent variables induces disentanglement\n[7]. However, there is no strong evidence linking factorial representations to disentanglement. In\npart, this can be attributed to weak qualitative evaluation procedures. While traversals in the latent\nrepresentation can qualitatively illustrate disentanglement, quantitative measures of disentanglement\nare in their infancy.\nIn this paper, we:\n\nof the \u03b2-VAE [7] in learning disentangled representations.\n\nweights on the terms of our decomposition without any additional hyperparameters.\n\n\u2022 show a decomposition of the variational lower bound that can be used to explain the success\n\u2022 propose a simple method based on weighted minibatches to stochastically train with arbitrary\n\u2022 introduce the \u03b2-TCVAE, which can be used as a plug-in replacement for the \u03b2-VAE with no\nextra hyperparameters. Empirical evaluations suggest that the \u03b2-TCVAE discovers more\ninterpretable representations than existing methods, while also being fairly robust to random\ninitialization.\n\n\u2022 propose a new information-theoretic disentanglement metric, which is classi\ufb01er-free and\n\ngeneralizable to arbitrarily-distributed and non-scalar latent variables.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\f]\n7\n[\nE\nA\nV\n\n-\n\u03b2\n\n)\nr\nu\nO\n\n(\nE\nA\nV\nC\nT\n-\n\u03b2\n\n(a) Baldness (-6, 6)\n\n(b) Face width (0, 6)\n\n(c) Gender (-6, 6)\n\n(d) Mustache (-6, 0)\n\nFigure 1: Qualitative comparisons on CelebA. Traversal ranges are shown in parentheses. Some\nattributes are only manifested in one direction of a latent variable, so we show a one-sided traversal.\nMost semantically similar variables from a \u03b2-VAE are shown for comparison.\n\nWhile Kim & Mnih [8] have independently proposed augmenting VAEs with an equivalent total\ncorrelation penalty to the \u03b2-TCVAE, their proposed training method differs from ours and requires\nan auxiliary discriminator network.\n\n2 Background: Learning and Evaluating Disentangled Representations\n\nWe discuss existing work that aims at either learning disentangled representations without supervision\nor evaluating such representations. The two problems are inherently related, since improvements\nto learning algorithms require evaluation metrics that are sensitive to subtle details, and stronger\nevaluation metrics reveal de\ufb01ciencies in existing methods.\n\n2.1 Learning Disentangled Representations\n\nVAE and \u03b2-VAE The variational autoencoder (VAE) [9, 10] is a latent variable model that pairs a\ntop-down generator with a bottom-up inference network. Instead of directly performing maximum\nlikelihood estimation on the intractable marginal log-likelihood, training is done by optimizing the\ntractable evidence lower bound (ELBO). We would like to optimize this lower bound averaged over\nthe empirical distribution (with \u03b2 = 1):\n\nL\u03b2 =\n\n1\nN\n\nN\ufffdn=1\n\n(Eq[log p(xn|z)] \u2212 \u03b2 KL (q(z|xn)||p(z)))\n\n(1)\n\nThe \u03b2-VAE [7] is a variant of the variational autoencoder that attempts to learn a disentangled\nrepresentation by optimizing a heavily penalized objective with \u03b2 > 1. Such simple penalization\nhas been shown to be capable of obtaining models with a high degree of disentanglement in image\ndatasets. However, it is not made explicit why penalizing KL(q(z|x)||p(z)) with a factorial prior can\nlead to learning latent variables that exhibit disentangled transformations for all data samples.\n\nInfoGAN The InfoGAN [6] is a variant of the generative adversarial network (GAN) [11] that\nencourages an interpretable latent representation by maximizing the mutual information between the\nobservation and a small subset of latent variables. The approach relies on optimizing a lower bound\nof the intractable mutual information.\n\n2.2 Evaluating Disentangled Representations\n\nWhen the true underlying generative factors are known and we have reason to believe that this\nset of factors is disentangled, it is possible to create a supervised evaluation metric. Many have\nproposed classi\ufb01er-based metrics for assessing the quality of disentanglement [7, 8, 12, 13, 14, 15].\n\n2\n\n\fl\n\n, z(2)\n\nk=1, each training data point is an aggregation over L samples: 1\n\nWe focus on discussing the metrics proposed in [7] and [8], as they are relatively simple in design\nand generalizable.\nThe Higgins\u2019 metric [7] is de\ufb01ned as the accuracy that a low VC-dimension linear classi\ufb01er can\nachieve at identifying a \ufb01xed ground truth factor. Speci\ufb01cally, for a set of ground truth factors\n|, where\n{vk}K\nrandom vectors z(1)\nare drawn i.i.d. from q(z|vk)1 for any \ufb01xed value of vk, and a classi\ufb01cation\ntarget k. A drawback of this method is the lack of axis-alignment detection. That is, we believe a truly\ndisentangled model should only contain one latent variable that is related to each factor. As a means\nto include axis-alignment detection, [8] proposes using argminj Varq(zj|vk)[zj] and a majority-vote\nclassi\ufb01er.\nClassi\ufb01er-based disentanglement metrics tend to be ad-hoc and sensitive to hyperparameters. The\nmetrics in [7] and [8] can be loosely interpreted as measuring the reduction in entropy of z if v\nis observed. In section 4, we show that it is possible to directly measure the mutual information\nbetween z and v which is a principled information-theoretic quantity that can be used for any latent\ndistributions provided that ef\ufb01cient estimation exists.\n\nL\ufffdL\n\nl \u2212 z(2)\n\nl=1 |z(1)\n\nl\n\nl\n\n3 Sources of Disentanglement in the ELBO\n\nIt is suggested that two quantities are especially important in learning a disentangled representation\n[6, 7]: A) Mutual information between the latent variables and the data variable, and B) Independence\nbetween the latent variables.\nA term that quanti\ufb01es criterion A was illustrated by an ELBO decomposition [16]. In this section, we\nintroduce a re\ufb01ned decomposition showing that terms describing both criteria appear in the ELBO.\n\nELBO TC-Decomposition We identify each training example with a unique integer index and\nde\ufb01ne a uniform random variable on {1, 2, ..., N} with which we relate to data points. Fur-\nthermore, we de\ufb01ne q(z|n) = q(z|xn) and q(z, n) = q(z|n)p(n) = q(z|n) 1\nN . We refer to\nq(z) =\ufffdN\nn=1 q(z|n)p(n) as the aggregated posterior following [17], which captures the aggregate\nstructure of the latent variables under the data distribution. With this notation, we decompose the KL\nterm in (1) assuming a factorized p(z).\nEp(n)\ufffdKL\ufffdq(z|n)||p(z)\ufffd\ufffd = KL (q(z, n)||q(z)p(n))\n\ufffd\n\nii Total Correlation\n\nIndex-Code MI\n\n\ufffd\ufffd\n\n\ufffd\n\ni\n\n+ KL\ufffdq(z)||\ufffdj\n\ufffd\n\ufffd\ufffd\n\nq(zj)\ufffd\n\ufffd\n\n+\ufffdj\n\ufffd\n\nKL (q(zj)||p(zj))\n\ufffd\n\niii Dimension-wise KL\n(2)\n\n\ufffd\ufffd\n\nwhere zj denotes the jth dimension of the latent variable.\n\nIn a similar decomposition [16],\n\nDecomposition Analysis\nis referred to as the index-code\nmutual information (MI). The index-code MI is the mutual information Iq(z; n) between the data\nvariable and latent variable based on the empirical data distribution q(z, n). It is argued that a\nhigher mutual information can lead to better disentanglement [6], and some have even proposed to\ncompletely drop the penalty on this term during optimization [17, 18]. However, recent investigations\ninto generative modeling also claim that a penalized mutual information through the information\nbottleneck encourages compact and disentangled representations [3, 19].\n\ni\n\nii\n\nIn information theory,\nis referred to as the total correlation (TC), one of many generalizations\nof mutual information to more than two random variables [20]. The naming is unfortunate as it is\nactually a measure of dependence between the variables. The penalty on TC forces the model to \ufb01nd\nstatistically independent factors in the data distribution. We claim that a heavier penalty on this term\ninduces a more disentangled representation, and that the existence of this term is the reason \u03b2-VAE\nhas been successful.\n\n1Note that q(z|vk) is sampled by using an intermediate data sample: z \u223c q(z|x), x \u223c p(x|vk).\n\n3\n\n\fWe refer to iii as the dimension-wise KL. This term mainly prevents individual latent dimensions\nfrom deviating too far from their corresponding priors. It acts as a complexity penalty on the aggregate\nposterior which reasonably follows from the minimum description length [21] formulation of the\nELBO.\nWe would like to verify the claim that TC is the most important term in this decomposition for\nlearning disentangled representations by penalizing only this term; however, it is dif\ufb01cult to estimate\nthe three terms in the decomposition. In the following section, we propose a simple yet general\nframework for training with the TC-decomposition using minibatches of data.\nA special case of this decomposition was given in [22], assuming that the use of a \ufb02exible prior can\neffectively ignore the dimension-wise KL term. In contrast, our decomposition (2) is more generally\napplicable to many applications of the ELBO.\n\n3.1 Training with Minibatch-Weighted Sampling\n\nWe describe a method to stochastically estimate the decomposition terms, allowing scalable training\nwith arbitrary weights on each decomposition term. Note that the decomposed expression (2) requires\nthe evaluation of the density q(z) = Ep(n)[q(z|n)], which depends on the entire dataset2. As such, it\nis undesirable to compute it exactly during training. One main advantage of our stochastic estimation\nmethod is the lack of hyperparameters or inner optimization loops, which should provide more stable\ntraining.\nA na\u00efve Monte Carlo approximation based on a minibatch of samples from p(n) is likely to underes-\ntimate q(z). This can be intuitively seen by viewing q(z) as a mixture distribution where the data\nindex n indicates the mixture component. With a randomly sampled component, q(z|n) is close to 0,\nwhereas q(z|n) would be large if n is the component that z came from. So it is much better to sample\nthis component and weight the probability appropriately.\nTo this end, we propose using a weighted version for estimating the function log q(z) during training,\ninspired by importance sampling. When provided with a minibatch of samples {n1, ..., nM}, we can\nuse the estimator\n\nwhere z(ni) is a sample from q(z|ni) (see derivation in Appendix C). This minibatch estimator is\nbiased, since its expectation is a lower bound3. However, computing it does not require any additional\nhyperparameters.\n\nL\u03b2\u2212TC := Eq(z|n)p(n)[log p(n|z)] \u2212 \u03b1Iq(z; n) \u2212 \u03b2 KL\ufffdq(z)||\ufffdj\n\n3.1.1 Special case: \u03b2-TCVAE\nWith minibatch-weighted sampling, it is easy to assign different weights (\u03b1, \u03b2, \u03b3) to the terms in (2):\nKL (q(zj)||p(zj))\n(4)\nWhile we performed ablation experiments with different values for \u03b1 and \u03b3, we ultimately \ufb01nd that\ntuning \u03b2 leads to the best results. Our proposed \u03b2-TCVAE uses \u03b1 = \u03b3 = 1 and only modi\ufb01es the\nhyperparameter \u03b2. While Kim & Mnih [8] have proposed an equivalent objective, they estimate TC\nusing an auxiliary discriminator network.\n\nq(zj)\ufffd \u2212 \u03b3\ufffdj\n\n4 Measuring Disentanglement with the Mutual Information Gap\n\nIt is dif\ufb01cult to compare disentangling algorithms without a proper metric. Most prior work has\nresorted to qualitative analysis by visualizing the latent representation. Another approach relies\non knowing the true generative process p(n|v) and ground truth latent factors v. Often these are\n2The same argument holds for the term\ufffdj q(zj) and a similar estimator can be constructed.\n3This follows from Jensen\u2019s inequality Ep(n)[log q(z|n)] \u2264 log Ep(n)[q(z|n)].\n\n4\n\nEq(z)[log q(z)] \u2248\n\n1\nM\n\nM\ufffdi=1\n\n\uf8ee\uf8f0log\n\n1\n\nN M\n\nM\ufffdj=1\n\nq(z(ni)|nj)\uf8f9\uf8fb\n\n(3)\n\n\fsemantically meaningful attributes of the data. For instance, photographic portraits generally contain\ndisentangled factors such as pose (azimuth and elevation), lighting condition, and attributes of the\nface such as skin tone, gender, face width, etc. Though not all ground truth factors may be provided,\nit is still possible to evaluate disentanglement using the known factors. We propose a metric based on\nthe empirical mutual information between latent variables and ground truth factors.\n\n4.1 Mutual Information Gap (MIG)\n\nOur key insight is that the empirical mutual information between a latent variable zj and a\nground truth factor vk can be estimated using the joint distribution de\ufb01ned by q(zj, vk) =\n\ufffdN\nn=1 p(vk)p(n|vk)q(zj|n). Assuming that the underlying factors p(vk) and the generating process\nis known for the empirical data samples p(n|vk), then\n\nIn(zj; vk) = Eq(zj ,vk)\uf8ee\uf8f0log \ufffdn\u2208Xvk\n\nq(zj|n)p(n|vk)\uf8f9\uf8fb + H(zj)\n\nwhere Xvk is the support of p(n|vk). (See derivation in Appendix B.)\nA higher mutual information implies that zj contains a lot of information about vk, and the mutual\ninformation is maximal if there exists a deterministic, invertible relationship between zj and vj.\nFurthermore, for discrete vk, 0 \u2264 I(zj; vk) \u2264 H(vk), where H(vk) = Ep(vk)[\u2212 log p(vk)] is the\nentropy of vk. As such, we use the normalized mutual information I(zj; vk)/H(vk).\nNote that a single factor can have high mutual information with multiple latent variables. We enforce\naxis-alignment by measuring the difference between the top two latent variables with highest mutual\ninformation. The full metric we call mutual information gap (MIG) is then\n\n(5)\n\n(6)\n\n1\nK\n\nK\ufffdk=1\n\n1\n\nH(vk)\ufffdIn(zj(k) ; vk) \u2212 max\n\nj\ufffd=j(k)\n\nIn(zj; vk)\ufffd\n\nwhere j(k) = argmaxj In(zj; vk) and K is the number of known factors. MIG is bounded by 0 and 1.\nWe perform an entire pass through the dataset to estimate MIG.\n\nK\ufffdK\n\nIn(zk\u2217 ;vk)\n\nH(vk)\n\nk=1\n\nWhile it is possible to compute just the average maximal MI, 1\n, the gap in our\nformulation (6) defends against two important cases. The \ufb01rst case is related to rotation of the factors.\nWhen a set of latent variables are not axis-aligned, each variable can contain a decent amount of\ninformation regarding two or more factors. The gap heavily penalizes unaligned variables, which is\nan indication of entanglement. The second case is related to compactness of the representation. If one\nlatent variable reliably models a ground truth factor, then it is unnecessary for other latent variables\nto also be informative about this factor.\nAs summarized in Table 1, our metric de-\ntects axis-alignment and is generally appli-\ncable and meaningful for any factorized la-\ntent distribution, including vectors of mul-\ntimodal, categorical, and other structured\ndistributions. This is because the metric is\nonly limited by whether the mutual informa-\ntion can be estimated. Ef\ufb01cient estimation of\nmutual information is an ongoing research\ntopic [23, 24], but we \ufb01nd that the simple\nestimator (5) can be computed within rea-\nsonable amount of time for the datasets we\nuse. We \ufb01nd that MIG can better capture\nsubtle differences in models compared to existing metrics. Systematic experiments analyzing MIG\nand existing metrics are in Appendix G.\n\nTable 1: In comparison to prior metrics, our proposed\nMIG detects axis-alignment, is unbiased for all hyper-\nparameter settings, and can be generally applied to any\nlatent distributions provided ef\ufb01cient estimation exists.\n\nHiggins et al. [7]\nKim & Mnih [8]\nMIG (Ours)\n\nAxis Unbiased General\n\nNo\nYes\nYes\n\nNo\nNo\nYes\n\nNo\nNo\nYes\n\nMetric\n\n5 Related Work\n\nWe focus on discussing the learning of disentangled representations in an unsupervised manner.\nNevertheless, we note that inverting generative processes with known disentangled factors through\n\n5\n\n\fweak supervision has been pursued by many. The goal in this case is not perfect inversion but to\ndistill simpler representation [15, 25, 26, 27, 28]. Although not explicitly the main motivation, many\nunsupervised generative modeling frameworks have explored the disentanglement of their learned\nrepresentations [9, 17, 29]. Prior to \u03b2-VAE [7], some have shown successful disentanglement in\nlimited settings with few factors of variation [1, 14, 30, 31].\nAs a means to describe the properties of disentangled representations, factorial representations have\nbeen motivated by many [1, 2, 3, 22, 32, 33]. In particular, Appendix B of [22] shows the existence of\nthe total correlation in a similar objective with a \ufb02exible prior and assuming optimality q(z) = p(z).\nSimilarly, [34] arrives at the ELBO from an objective that combines informativeness and the total\ncorrelation of latent variables. In contrast, we show a more general analysis of the unmodi\ufb01ed\nevidence lower bound.\nThe existence of the index-code MI in the ELBO has been shown before [16], and as a result,\nFactorVAE, which uses an equivalent objective to the \u03b2-TCVAE, is independently proposed [8].\nThe main difference is they estimate the total correlation using the density ratio trick [35] which\nrequires an auxiliary discriminator network and an inner optimization loop. In contrast, we emphasize\nthe success of \u03b2-VAE using our re\ufb01ned decomposition, and propose a training method that allows\nassigning arbitrary weights to each term of the objective without requiring any additional networks.\nIn a similar vein, non-linear independent component analysis [36, 37, 38] studies the problem of\ninverting a generative process assuming independent latent factors. Instead of a perfect inversion, we\nonly aim for maximizing the mutual information between our learned representation and the ground\ntruth factors. Simple priors can further encourage interpretability by means of warping complex\nfactors into simpler manifolds. To the best of our knowledge, we are the \ufb01rst to show a strong\nquanti\ufb01able relation between factorial representations and disentanglement (see Section 6).\n\n6 Experiments\n\nWe perform a series of quantitative\nand qualitative experiments, show-\ning that \u03b2-TCVAE can consistently\nachieve higher MIG scores compared\nto prior methods \u03b2-VAE [7] and Info-\nGAN [6], and can match the perfor-\nmance of FactorVAE [8] whilst per-\nforming better in scenarios where the\ndensity ratio trick is dif\ufb01cult to train.\nFurthermore, we \ufb01nd that in models\ntrained with our method, total corre-\nlation is strongly correlated with disentanglement.4\n\ndSprites\n3D Faces\n\nTable 2: Summary of datasets with known ground truth fac-\ntors. Parentheses contain number of quantized values for each\nfactor.\n\nDataset\n\nGround truth factors\n\nscale (6), rotation (40), posX (32), posY (32)\n\nazimuth (21), elevation (11), lighting (11)\n\nIndependent Factors of Variation First, we analyze the performance of our proposed \u03b2-TCVAE\nand MIG metric in a restricted setting, with ground truth factors that are uniformly and independently\nsampled. To paint a clearer picture on the robustness of learning algorithms, we aggregate results\nfrom multiple experiments to visualize the effect of initialization .\nWe perform quantitative evaluations with two datasets, a dataset of 2D shapes [39] and a dataset of\nsynthetic 3D faces [40]. Their ground truth factors are summarized in Table 2. The dSprites and 3D\nfaces also contain 3 types of shapes and 50 identities, respectively, which are treated as noise during\nevaluation.\n\nELBO vs. Disentanglement Trade-off Since the \u03b2-VAE and \u03b2-TCVAE objectives are lower\nbounds on the standard ELBO, we would like to see the effect of training with this modi\ufb01cation.\nTo see how the choice of \u03b2 affects these learning algorithms, we train using a range of values. The\ntrade-off between density estimation and the amount of disentanglement measured by MIG is shown\nin Figure 2.\n\n4Code is available at \ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd.\n\n6\n\n\f(a) dSprites\n\n(b) 3D Faces\n\nFigure 2: Compared to \u03b2-VAE, \u03b2-TCVAE creates more disentangled representations while preserving\na better generative model of the data with increasing \u03b2. Shaded regions show the 90% con\ufb01dence\nintervals. Higher is better on both metrics.\n\n(a) dSprites\n\n(b) 3D Faces\n\n(a) dSprites\n\n(b) 3D Faces\n\nFigure 3: Distribution of disentanglement score\n(MIG) for different modeling algorithms.\n\nFigure 4: Scatter plots of the average MIG and TC\nper value of \u03b2. Larger circles indicate a higher \u03b2.\n\nWe \ufb01nd that \u03b2-TCVAE provides a better trade-off between density estimation and disentanglement.\nNotably, with higher values of \u03b2, the mutual information penality in \u03b2-VAE is too strong and this\nhinders the usefulness of the latent variables. However, \u03b2-TCVAE with higher values of \u03b2 consistently\nresults in models with higher disentanglement score relative to \u03b2-VAE.\nWe also perform ablation studies on the removal of the index-code MI term by setting \u03b1 = 0\nin (4), and a model using a factorized normalizing \ufb02ow as the prior distribution which is jointly\ntrained to maximize the modi\ufb01ed objective. Neither resulted in signi\ufb01cant performance difference,\nsuggesting that tuning the weight of the TC term in (2) is the most useful for learning disentangled\nrepresentations.\n\nQuantitative Comparisons While a disentangled representation may be achievable by some learn-\ning algorithms, the chances of obtaining such a representation typically is not clear. Unsupervised\nlearning of a disentangled representation can have high variance since disentangled labels are not\nprovided during training. To further understand the robustness of each algorithm, we show box\nplots depicting the quartiles of the MIG score distribution for various methods in Figure 3. We\nused \u03b2 = 4 for \u03b2-VAE and \u03b2 = 6 for \u03b2-TCVAE, based on modes in Figure 2. For InfoGAN, we\nused 5 continuous latent codes and 5 noise variables. Other settings are chosen following those\nsuggested by [6], but we also added instance noise [41] to stabilize training. FactorVAE uses an\nequivalent objective to the \u03b2-TCVAE but is trained with the density ratio trick [35], which is known\nto underestimate the TC term [8]. As a result, we tuned \u03b2 \u2208 [1, 80] and used double the number of\niterations for FactorVAE. Note that while \u03b2-VAE, FactorVAE and \u03b2-TCVAE use a fully connected\narchitecture for the dSprites dataset, InfoGAN uses a convolutional architecture for increased stability.\nWe also \ufb01nd that FactorVAE performs poorly with fully connected layers, resulting in worse results\nthan \u03b2-VAE on the dSprites dataset.\nIn general, we \ufb01nd that the median score is highest for \u03b2-TCVAE and it is close to the highest score\nachieved by all methods. Despite the best half of the \u03b2-TCVAE runs achieving relatively high scores,\nwe see that the other half can still perform poorly. Low-score outliers exist in the 3D faces dataset,\nalthough their scores are still higher than the median scores achieved by both VAE and InfoGAN.\n\nFactorial vs. Disentangled Representations While a low total correlation has been previously\nconjectured to lead to disentanglement, we provide concrete evidence that our \u03b2-TCVAE learning\nalgorithm satis\ufb01es this property. Figure 4 shows a scatter plot of total correlation and the MIG\n\n7\n\n\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\fA\n\nC\n\nB\n\nD\n\n(a) Different joint\ndistributions of factors.\n\n(b) Distribution of disentanglement scores (MIG).\n\nFigure 5: The \u03b2-TCVAE has a higher chance of obtaining a disentangled representation than \u03b2-\nVAE, even in the presence of sampling bias. (a) All samples have non-zero probability in all joint\ndistributions; the most likely sample is 4 times as likely as the least likely sample.\n\ndisentanglement metric for varying values of \u03b2 trained on the dSprites and faces datasets, averaged\nover 40 random initializations. For models trained with \u03b2-TCVAE, the correlation between average\nTC and average MIG is strongly negative, while models trained with \u03b2-VAE have a weaker correlation.\nIn general, for the same degree of total correlation, \u03b2-TCVAE creates a better disentangled model.\nThis is also strong evidence for the hypothesis that large values of \u03b2 can be useful as long as the\nindex-code mutual information is not penalized.\n\n6.1 Correlated or Dependent Factors\n\nA notion of disentanglement can exist even when the underlying generative process samples factors\nnon-uniformly and dependently sampled. Many real datasets exhibit this behavior, where some con-\n\ufb01gurations of factors are sampled more than others, violating the statistical independence assumption.\nDisentangling the factors of variation in this case corresponds to \ufb01nding the generative model where\nthe latent factors can independently act and perturb the generated result, even when there is bias in\nthe sampling procedure. In general, we \ufb01nd that \u03b2-TCVAE has no problem in \ufb01nding the correct\nfactors of variation in a toy dataset and can \ufb01nd more interpretable factors of variation than those\nfound in prior work, even though the independence assumption is violated.\n\nTwo Factors We start off with a toy dataset with only two factors and test \u03b2-TCVAE using\nsampling distributions with varying degrees of correlation and dependence. We take the dataset of\nsynthetic 3D faces and \ufb01x all factors other than pose. The joint distributions over factors that we test\nwith are summarized in Figure 5a, which includes varying degrees of sampling bias. Speci\ufb01cally,\ncon\ufb01guration A uses uniform and independent factors; B uses factors with non-uniform marginals but\nare uncorrelated and independent; C uses uncorrelated but dependent factors; and D uses correlated\nand dependent factors. While it is possible to train a disentangled model in all con\ufb01gurations, the\nchances of obtaining one is overall lower when there exist sampling bias. Across all con\ufb01gurations,\nwe see that \u03b2-TCVAE is superior to \u03b2-VAE and InfoGAN, and there is a large difference in median\nscores for most con\ufb01gurations.\n\n6.1.1 Qualitative Comparisons\n\nWe show qualitatively that \u03b2-TCVAE discovers more disentangled factors than \u03b2-VAE on datasets of\nchairs [42] and real faces [43].\n\n3D Chairs Figure 6 shows traversals in latent variables that depict an interpretable property in\ngenerating 3D chairs. The \u03b2-VAE [7] has shown to be capable of learning the \ufb01rst four properties:\nazimuth, size, leg style, and backrest. However, the leg style change learned by \u03b2-VAE does not seem\nto be consistent for all chairs. We \ufb01nd that \u03b2-TCVAE can learn two additional interpretable properties:\nmaterial of the chair, and leg rotation for swivel chairs. These two properties are more subtle and\nlikely require a higher index-code mutual information, so the lower penalization of index-code MI in\n\u03b2-TCVAE helps in \ufb01nding these properties.\n\n8\n\n\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\fE\nA\nV\n\n-\n\u03b2\n\nE\nA\nV\nC\nT\n-\n\u03b2\n\n(a) Azimuth\n\n(b) Chair size\n\n(c) Leg style\n\n(d) Backrest\n\n(e) Material\n\n(f) Swivel\n\nFigure 6: Learned latent variables using \u03b2-VAE and \u03b2-TCVAE are shown. Traversal range is (-2, 2).\n\nCelebA Figure 1 shows 4 out of 15 attributes that are discovered by the \u03b2-TCVAE without super-\nvision (see Appendix A.3). We traverse up to six standard deviations away from the mean to show\nthe effect of generalizing the represented semantics of each variable. The representation learned\nby \u03b2-VAE is entangled with nuances, which can be shown when generalizing to low probability\nregions. For instance, it has dif\ufb01culty rendering complete baldness or narrow face width, whereas the\n\u03b2-TCVAE shows meaningful extrapolation. The extrapolation of the gender attribute of \u03b2-TCVAE\nshows that it focuses more on gender-speci\ufb01c facial features, whereas the \u03b2-VAE is entangled with\nmany irrelevances such as face width. The ability to generalize beyond the \ufb01rst few standard devia-\ntions of the prior mean implies that the \u03b2-TCVAE model can generate rare samples such as bald or\nmustached females.\n\n7 Conclusion\n\nWe present a decomposition of the ELBO with the goal of explaining why \u03b2-VAE works. In particular,\nwe \ufb01nd that a TC penalty in the objective encourages the model to \ufb01nd statistically independent\nfactors in the data distribution. We then designate a special case as \u03b2-TCVAE, which can be trained\nstochastically using minibatch estimator with no additional hyperparameters compared to the \u03b2-VAE.\nThe simplicity of our method allows easy integration into different frameworks [44].To quantitatively\nevaluate our approach, we propose a classi\ufb01er-free disentanglement metric called MIG. This metric\nbene\ufb01ts from advances in ef\ufb01cient computation of mutual information [23] and enforces compactness\nin addition to disentanglement. Unsupervised learning of disentangled representations is inherently\na dif\ufb01cult problem due to the lack of a prior for semantic awareness, but we show some evidence\nin simple datasets with uniform factors that independence between latent variables can be strongly\nrelated to disentanglement.\n\nAcknowledgements\n\nWe thank Alireza Makhzani, Yuxing Zhang, and Bowen Xu for initial discussions. We also thank\nChatavut Viriyasuthee for pointing out an error in one of our derivations. Ricky would also like to\nthank Brendan Shillingford for supplying accommodation at a popular conference.\n\nReferences\n[1] J\u00fcrgen Schmidhuber. Learning factorial codes by predictability minimization. Neural Computation,\n\n4(6):863\u2013879, 1992.\n\n[2] Karl Ridgeway. A survey of inductive biases for factorial representation-learning. arXiv preprint\n\narXiv:1612.05299, 2016.\n\n[3] Alessandro Achille and Stefano Soatto. On the emergence of invariance and disentangling in deep\n\nrepresentations. arXiv preprint arXiv:1706.01350, 2017.\n\n[4] Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new\n\nperspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798\u20131828, 2013.\n\n9\n\n\f[5] Alexander A Alemi, Ian Fischer, Joshua V Dillon, and Kevin Murphy. Deep variational information\n\nbottleneck. International Conference on Learning Representations, 2017.\n\n[6] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel.\n\nInfogan:\nInterpretable representation learning by information maximizing generative adversarial nets. In Advances\nin Neural Information Processing Systems, pages 2172\u20132180, 2016.\n\n[7] Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir\nMohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a constrained variational\nframework. International Conference on Learning Representations, 2017.\n\n[8] Hyunjik Kim and Andriy Mnih. Disentangling by factorising. arXiv preprint arXiv:1802.05983, 2018.\n[9] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114,\n\n2013.\n\n[10] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approxi-\n\nmate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014.\n\n[11] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron\nCourville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing\nsystems, pages 2672\u20132680, 2014.\n\n[12] Will Grathwohl and Aaron Wilson. Disentangling space and time in video with hierarchical variational\n\nauto-encoders. arXiv preprint arXiv:1612.04440, 2016.\n\n[13] Christopher K. I. Williams Cian Eastwood. A framework for the quantitative evaluation of disentangled\n\nrepresentations. International Conference on Learning Representations, 2018.\n\n[14] Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Domain adaptation for large-scale sentiment clas-\nsi\ufb01cation: A deep learning approach. In Proceedings of the 28th international conference on machine\nlearning (ICML-11), pages 513\u2013520, 2011.\n\n[15] Theofanis Karaletsos, Serge Belongie, and Gunnar R\u00e4tsch. Bayesian representation learning with oracle\n\nconstraints. International Conference on Learning Representations, 2016.\n\n[16] Matthew D Hoffman and Matthew J Johnson. Elbo surgery: yet another way to carve up the variational\n\nevidence lower bound. In Workshop in Advances in Approximate Bayesian Inference, NIPS, 2016.\n\n[17] Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, Ian Goodfellow, and Brendan Frey. Adversarial\n\nautoencoders. ICLR 2016 Workshop, International Conference on Learning Representations, 2016.\n\n[18] Shengjia Zhao, Jiaming Song, and Stefano Ermon. Infovae: Information maximizing variational autoen-\n\ncoders. arXiv preprint arXiv:1706.02262, 2017.\n\n[19] Christopher Burgess, Irina Higgins, Arka Pal, Loic Matthey, Nick Watters, Guillaume Desjardins, and\nAlexander Lerchner. Understanding disentangling in beta-vae. Learning Disentangled Representations:\nFrom Perception to Control Workshop, 2017.\n\n[20] Satosi Watanabe. Information theoretical analysis of multivariate correlation. IBM Journal of research and\n\ndevelopment, 4(1):66\u201382, 1960.\n\n[21] Geoffrey E Hinton and Richard S Zemel. Autoencoders, minimum description length and helmholtz free\n\nenergy. In Advances in neural information processing systems, pages 3\u201310, 1994.\n\n[22] Alessandro Achille and Stefano Soatto. Information dropout: Learning optimal representations through\n\nnoisy computation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018.\n\n[23] Ishmael Belghazi, Sai Rajeswar, Aristide Baratin, R Devon Hjelm, and Aaron Courville. MINE: Mutual\n\ninformation neural estimation. arXiv preprint arXiv:1801.04062, 2018.\n\n[24] David N Reshef, Yakir A Reshef, Hilary K Finucane, Sharon R Grossman, Gilean McVean, Peter J\nTurnbaugh, Eric S Lander, Michael Mitzenmacher, and Pardis C Sabeti. Detecting novel associations in\nlarge data sets. science, 334(6062):1518\u20131524, 2011.\n\n[25] Geoffrey E Hinton, Alex Krizhevsky, and Sida D Wang. Transforming auto-encoders. In International\n\nConference on Arti\ufb01cial Neural Networks, pages 44\u201351. Springer, 2011.\n\n[26] Tejas D Kulkarni, William F Whitney, Pushmeet Kohli, and Josh Tenenbaum. Deep convolutional inverse\n\ngraphics network. In Advances in Neural Information Processing Systems, pages 2539\u20132547, 2015.\n\n[27] Brian Cheung, Jesse A Livezey, Arjun K Bansal, and Bruno A Olshausen. Discovering hidden factors of\n\nvariation in deep networks. arXiv preprint arXiv:1412.6583, 2014.\n\n[28] Ramakrishna Vedantam, Ian Fischer, Jonathan Huang, and Kevin Murphy. Generative models of visually\n\ngrounded imagination. International Conference on Learning Representations, 2018.\n\n[29] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep\n\nconvolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.\n\n10\n\n\f[30] Yichuan Tang, Ruslan Salakhutdinov, and Geoffrey Hinton. Tensor analyzers. In International Conference\n\non Machine Learning, pages 163\u2013171, 2013.\n\n[31] Guillaume Desjardins, Aaron Courville, and Yoshua Bengio. Disentangling factors of variation via\n\ngenerative entangling. arXiv preprint arXiv:1210.5474, 2012.\n\n[32] Greg Ver Steeg and Aram Galstyan. Maximally informative hierarchical representations of high-\n\ndimensional data. In Arti\ufb01cial Intelligence and Statistics, pages 1004\u20131012, 2015.\n\n[33] Abhishek Kumar, Prasanna Sattigeri, and Avinash Balakrishnan. Variational inference of disentangled\n\nlatent concepts from unlabeled observations. arXiv preprint arXiv:1711.00848, 2017.\n\n[34] Shuyang Gao, Rob Brekelmans, Greg Ver Steeg, and Aram Galstyan. Auto-encoding total correlation\n\nexplanation. arXiv preprint arXiv:1802.05822, 2018.\n\n[35] Masashi Sugiyama, Taiji Suzuki, and Takafumi Kanamori. Density ratio estimation in machine learning.\n\nCambridge University Press, 2012.\n\n[36] Pierre Comon. Independent component analysis, a new concept? Signal processing, 36(3):287\u2013314, 1994.\n[37] Aapo Hyv\u00e4rinen and Petteri Pajunen. Nonlinear independent component analysis: Existence and uniqueness\n\nresults. Neural Networks, 12(3):429\u2013439, 1999.\n\n[38] Christian Jutten and Juha Karhunen. Advances in nonlinear blind source separation. In Proc. of the 4th Int.\n\nSymp. on Independent Component Analysis and Blind Signal Separation, 2003.\n\n[39] Loic Matthey, Irina Higgins, Demis Hassabis, and Alexander Lerchner. dsprites: Disentanglement testing\n\nsprites dataset. https://github.com/deepmind/dsprites-dataset/, 2017.\n\n[40] Pascal Paysan, Reinhard Knothe, Brian Amberg, Sami Romdhani, and Thomas Vetter. A 3d face model for\npose and illumination invariant face recognition. In Advanced Video and Signal Based Surveillance, 2009.\nAVSS\u201909. Sixth IEEE International Conference on, pages 296\u2013301. Ieee, 2009.\n\n[41] Casper Kaae S\u00f8nderby, Jose Caballero, Lucas Theis, Wenzhe Shi, and Ferenc Husz\u00e1r. Amortised map\n\ninference for image super-resolution. International Conference on Learning Representations, 2017.\n\n[42] Mathieu Aubry, Daniel Maturana, Alexei Efros, Bryan Russell, and Josef Sivic. Seeing 3d chairs: exemplar\n\npart-based 2d-3d alignment using a large dataset of cad models. In CVPR, 2014.\n\n[43] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In\n\nProceedings of the IEEE International Conference on Computer Vision, pages 3730\u20133738, 2015.\n\n[44] Jin Xu and Yee Whye Teh. Controllable semantic image inpainting. arXiv preprint arXiv:1806.05953,\n\n2018.\n\n11\n\n\f", "award": [], "sourceid": 1319, "authors": [{"given_name": "Ricky T. Q.", "family_name": "Chen", "institution": "University of Toronto"}, {"given_name": "Xuechen", "family_name": "Li", "institution": "University of Toronto"}, {"given_name": "Roger", "family_name": "Grosse", "institution": "University of Toronto"}, {"given_name": "David", "family_name": "Duvenaud", "institution": "University of Toronto"}]}