{"title": "Leveraging the Exact Likelihood of Deep Latent Variable Models", "book": "Advances in Neural Information Processing Systems", "page_first": 3855, "page_last": 3866, "abstract": "Deep latent variable models (DLVMs) combine the approximation abilities of deep neural networks and the statistical foundations of generative models. Variational methods are commonly used for inference; however, the exact likelihood of these models has been largely overlooked. The purpose of this work is to study the general properties of this quantity and to show how they can be leveraged in practice. We focus on important inferential problems that rely on the likelihood: estimation and missing data imputation. First, we investigate maximum likelihood estimation for DLVMs: in particular, we show that most unconstrained models used for continuous data have an unbounded likelihood function. This problematic behaviour is demonstrated to be a source of mode collapse. We also show how to ensure the existence of maximum likelihood estimates, and draw useful connections with nonparametric mixture models. Finally, we describe an algorithm for missing data imputation using the exact conditional likelihood of a DLVM. On several data sets, our algorithm consistently and significantly outperforms the usual imputation scheme used for DLVMs.", "full_text": "Leveraging the Exact Likelihood of Deep Latent\n\nVariable Models\n\nPierre-Alexandre Mattei\n\nDepartment of Computer Science\n\nIT University of Copenhagen\n\npima@itu.dk\n\nJes Frellsen\n\nDepartment of Computer Science\n\nIT University of Copenhagen\n\njefr@itu.dk\n\nAbstract\n\nDeep latent variable models (DLVMs) combine the approximation abilities of deep\nneural networks and the statistical foundations of generative models. Variational\nmethods are commonly used for inference; however, the exact likelihood of these\nmodels has been largely overlooked. The purpose of this work is to study the\ngeneral properties of this quantity and to show how they can be leveraged in\npractice. We focus on important inferential problems that rely on the likelihood:\nestimation and missing data imputation. First, we investigate maximum likelihood\nestimation for DLVMs: in particular, we show that most unconstrained models\nused for continuous data have an unbounded likelihood function. This problematic\nbehaviour is demonstrated to be a source of mode collapse. We also show how to\nensure the existence of maximum likelihood estimates, and draw useful connections\nwith nonparametric mixture models. Finally, we describe an algorithm for missing\ndata imputation using the exact conditional likelihood of a DLVM. On several data\nsets, our algorithm consistently and signi\ufb01cantly outperforms the usual imputation\nscheme used for DLVMs.\n\n1\n\nIntroduction\n\nDimension reduction aims at summarizing multivariate data using a small number of features that\nconstitute a code. Earliest attempts rested on linear projections, leading to Hotelling\u2019s (1933) principal\ncomponent analysis (PCA) that has been vastly explored and perfected over the last century (Jolliffe\nand Cadima, 2016). In recent years, the \ufb01eld has been vividly animated by the successes of latent\nvariable models that probabilistically use the low-dimensional features to de\ufb01ne powerful generative\nmodels. Usually, these latent variable models transform the random code into parameters of a simple\ndistribution. Linear mappings were initially considered, giving rise to factor analysis (Bartholomew\net al., 2011) and probabilistic principal component analysis (Tipping and Bishop, 1999). In recent\nyears, much work has been done regarding nonlinear mappings parametrised by deep neural networks,\nfollowing the seminal papers of Rezende et al. (2014) and Kingma and Welling (2014). These\nmodels have led to impressive empirical performance in unsupervised or semi-supervised generative\nmodelling of images (Siddharth et al., 2017), molecular structures (Kusner et al., 2017; G\u00f3mez-\nBombarelli et al., 2018), arithmetic expressions (Kusner et al., 2017), and single-cell gene expression\ndata (Gr\u00f8nbech et al., 2018). This paper is an investigation of the statistical properties of these models,\nwhich remain essentially unknown.\n\n1.1 Deep latent variable models\n\nIn their most common form, deep latent variable models (DLVMs) assume that we are in the presence\nof a data matrix X = (x1, ..., xn)T \u2208 X n that we wish to explain using some latent variables\nZ = (z1, ..., zn)T \u2208 Rn\u00d7d. We assume that (xi, zi)i\u2264n are independent and identically distributed\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\f(i.i.d.) random variables driven by the following generative model:\n\n(cid:26) z \u223c p(z)\n\np\u03b8(x|z) = \u03a6(x|f\u03b8(z)).\n\n(1)\n\nThe unobserved random vector z \u2208 Rd is called the latent variable and usually follows marginally a\nsimple distribution p(z) called the prior distribution. The dimension d of the latent space is called\nthe intrinsic dimension\u2014and is usually smaller than the dimensionality of the data. The collection\n(\u03a6(\u00b7|\u03b7))\u03b7\u2208H is a parametric family of densities with respect to a dominating measure (usually the\nLebesgue or the counting measure) called the observation model. The function f\u03b8 : Rd \u2192 H is\ncalled a decoder or a generative network, and is parametrised by a (deep) neural network whose\nweights are stored in \u03b8 \u2208 \u0398. The latent structure of these DLVMs leads to the following marginal\ndistribution of the data:\n\np\u03b8(x|z)p(z)dz =\n\n\u03a6(x|f\u03b8(z))p(z)dz.\n\n(2)\n\n(cid:90)\n\np\u03b8(x) =\n\nRd\n\nThis parametrisation allows to leverage recent advances in deep architectures, such as deep residual\nnetworks (Kingma et al., 2016), recurrent networks (Bowman et al., 2016; G\u00f3mez-Bombarelli et al.,\n2018), or batch normalisation (S\u00f8nderby et al., 2016).\nSeveral observation models have been considered: in case of discrete multivariate data, products\nof Bernoulli (or multinomial) distributions; multivariate Gaussian distributions for continuous data;\nproducts of Poisson distributions for multivariate count data (Gr\u00f8nbech et al., 2018). Several speci\ufb01c\nproposals for image data have been made, like the discretised logistic mixture of Salimans et al., 2017).\nDirac observation models correspond to deterministic decoders, that are used e.g. within generative\nadversarial networks (Goodfellow et al., 2014), or non-volume preserving transformations (Dinh\net al., 2017). Introduced by both Kingma and Welling (2014) and Rezende et al. (2014), the Gaussian\nand Bernoulli families are the most widely studied, and will be the focus of this article.\n\n1.2 Scalable learning through amortised variational inference\nThe log-likelihood function of a DLVM is, for all \u03b8 \u2208 \u0398,\n\n(cid:90)\n\nRd\n\nn(cid:88)\n\ni=1\n\n(cid:96)(\u03b8) = log p\u03b8(X) =\n\nlog p\u03b8(xi),\n\n(3)\n\nwhich is an extremely challenging quantity to compute that involves potentially high-dimensional\nintegrals. Estimating \u03b8 by maximum likelihood appears therefore out of reach. Consequently,\nfollowing Rezende et al. (2014) and Kingma and Welling (2014), inference in DLVMs is usually\nperformed using amortised variational inference. Variational inference approximatively maximises\nthe log-likelihood by maximising a lower bound known as the evidence lower bound (ELBO, see\ne.g. Blei et al., 2017):\n\nELBO(\u03b8, q) = EZ\u223cq\n\nlog\n\np(X, Z)\n\nq(Z)\n\n= (cid:96)(\u03b8) \u2212 KL(q||p(\u00b7|X)) \u2264 (cid:96)(\u03b8),\n\n(4)\n\nwhere the variational distribution q is a distribution over the space of codes Rn\u00d7d. The variational\ndistribution plays the role of a tractable approximation of the posterior distribution of the codes;\nwhen this approximation is perfectly accurate, the ELBO is equal to the log-likelihood. Amortised\ninference builds q using a neural network called the inference network g\u03b3 : X \u2192 K, whose weights\nare stored in \u03b3 \u2208 \u0393:\n\n(cid:20)\n\nq\u03b3,X(Z) = q\u03b3,X(z1, ..., zn) =\n\n(5)\nwhere (\u03a8(\u00b7|\u03ba))\u03ba\u2208K is a parametric family of distributions over Rd\u2014such as Gaussians with diagonal\ncovariances (Kingma and Welling, 2014). Other kinds of families\u2014built using e.g. normalising\n\ufb02ows (Rezende and Mohamed, 2015; Kingma et al., 2016), auxiliary variables (Maal\u00f8e et al., 2016;\nRanganath et al., 2016), or importance weights (Burda et al., 2016; Cremer et al., 2017)\u2014have been\nconsidered for amortised inference, but they will not be central focus of in this paper. Variational\n\n\u03a8(zi|g\u03b3(xi)),\n\nn(cid:89)\n\ni=1\n\n(cid:21)\n\n2\n\n\finference for DLVMs then solves the optimisation problem max\u03b8\u2208\u0398,\u03b3\u2208\u0393 ELBO(\u03b8, q\u03b3,X) using\nvariants of stochastic gradient ascent (see e.g. Roeder et al., 2017, for strategies for computing\ngradients estimates of the ELBO).\nAs emphasised by Kingma and Welling (2014), the ELBO resembles the objective function of a\npopular deep learning model called an autoencoder (see e.g. Goodfellow et al., 2016, Chapter 14).\nThis motivates the popular denomination of encoder for the inference network g\u03b3 and variational\nautoencoder (VAE) for the combination of a DLVM with amortised variational inference.\n\nbe seen as parsimonious submodels of nonparametric mixture models.\n\nIn this work, we revisit DLVMs by asking: Is it possible to leverage the properties\n\nContributions.\nof p\u03b8(x) to understand and improve deep generative modelling? Our main contributions are:\n\u2022 We show that maximum likelihood is ill-posed for continuous DLVMs and well-posed for discrete\nones. We link this undesirable property of continuous DLVMs to the mode collapse phenomenon,\nand illustrate it on a real data set.\n\u2022 We draw a connection between DLVMs and nonparametric statistics, and show that DLVMs can\n\u2022 We leverage this connection to provide a way of \ufb01nding an upper bound of the likelihood based on\n\ufb01nite mixtures. Combined with the ELBO, this bound allows us to provide useful \u201csandwichings\u201d\nof the exact likelihood. We also prove that this bound characterises the large capacity behaviour\nof DLVMs.\n\u2022 When dealing with missing data, we show how a simple modi\ufb01cation of an approximate scheme\nproposed by Rezende et al. (2014) allows us to draw according to the exact conditional distribution\nof the missing data. On several data sets and missing data scenarios, our algorithm consistently\noutperforms the one of Rezende et al. (2014), while having the same computational cost.\n\n2\n\nIs maximum likelihood well-de\ufb01ned for deep latent variable models?\n\nIn this section, we investigate the properties of maximum likelihood estimation for DLVMs with\nGaussian and Bernoulli observation models.\n\n2.1 On the boundedness of the likelihood of deep latent variable models\nDeep generative models with Gaussian observation models assume that the data space is X = Rp, and\nthat the observation model is the family of p-variate full-rank Gaussian distributions. The conditional\ndistribution of each data point is consequently\n\np\u03b8(x|z) = N (x|\u00b5\u03b8(z), \u03a3\u03b8(z)),\n\n(6)\n\n(7)\n\nwhere \u00b5\u03b8 : Rd \u2192 Rp and \u03a3\u03b8 : Rd \u2192 S ++\nare two continuous functions parametrised by neural\nnetworks whose weights are stored in a parameter \u03b8. These two functions constitute the decoder of\n(cid:18)(cid:90)\nthe model. This leads to the log-likelihood\n\n(cid:19)\n\np\n\n(cid:96)(\u03b8) =\n\nlog\n\nRd\n\nN (xi|\u00b5\u03b8(z), \u03a3\u03b8(z))p(z)dz\n\n.\n\nn(cid:88)\n\ni=1\n\nThis model can be seen as a special case of in\ufb01nite mixture of Gaussian distributions. However, it\nis well-known that maximum likelihood is ill-posed for \ufb01nite Gaussian mixtures (see e.g. Le Cam,\n1990). Here, by \u201cill-posed\u201d, we mean that, inside the parameter space, there exists no maximiser\nof the likelihood function, which corresponds to the \ufb01rst condition given by Tikhonov and Arsenin\n(1977, p.7 ). This happens because the likelihood function is unbounded above. Moreover, the in\ufb01nite\nmaxima of the likelihood happen to be very poor generative models, whose density collapse around\nsome of the data points. This problematic behaviour of a model quite similar to DLVMs motivates\nthe question: is the likelihood function of DLVMs bounded above?\nIn this section, we will not make any particular parametric assumption about the prior distribution\nof the latent variable z. While Kingma and Welling (2014) and Rezende et al. (2014) originally\nproposed to use isotropic Gaussian distributions, more complex learnable priors have also been\nproposed (e.g. Tomczak and Welling, 2018). We simply make the natural assumptions that z is\ncontinuous and has zero mean. Many different neural architectures have been explored regarding\n\n3\n\n\fthe parametrisation of the decoder. For example, Kingma and Welling (2014) consider multilayer\nperceptrons (MLPs) of the form\n\n\u00b5\u03b8(z) = V tanh (Wz + a) + b, \u03a3\u03b8(z) = Diag (exp (\u03b1 tanh (Wz + a) + \u03b2)) ,\n\n(8)\nwhere \u03b8 = (W, a, V, b, \u03b1, \u03b2). The weights of the decoder are W \u2208 Rh\u00d7d, a \u2208 Rh, V, \u03b1 \u2208 Rp\u00d7h,\nand b, \u03b2 \u2208 Rp. The integer h \u2208 N\u2217 is the (common) number of hidden units of the MLPs. Much\nmore complex parametrisations exist, but we will see that this one, arguably one of the most rigid, is\nalready too \ufb02exible for maximum likelihood. Actually, we will show that an even much less \ufb02exible\nfamily of MLPs with a single hidden unit is problematic and leads the model to collapse around a data\npoint. Let w \u2208 Rp and let (\u03b1k)k\u22651 be a sequence of nonnegative real numbers such that \u03b1k \u2192 +\u221e\nas k \u2192 +\u221e. Let us consider i\u2217 \u2208 {1, ..., n}: this arbitrary index will represent the observation\naround which the model will collapse. Using the parametrisation (8), we consider the sequences of\n\u2217\nparameters \u03b8(i\nk\n\n= (\u03b1kwT , 0, 0p, xi\u2217 , \u03b1k1p,\u2212\u03b1k1p). This leads to the simpli\ufb01ed decoders:\n\n,w)\n\n\u00b5\u03b8(i\u2217 ,w)\n\nk\n\n(z) = xi\u2217 , \u03a3\n\n\u03b8(i\u2217 ,w)\n\nk\n\n(9)\n\n(z) = exp(cid:0)\u03b1k tanh(cid:0)\u03b1kwT z(cid:1) \u2212 \u03b1k\n(cid:16)\n\n(cid:1) Ip.\n\n(cid:17)\n\n\u2217\n\n,w)\n\nk\n\nk\n\n\u03b8(i\nk\n\n\u03b8(i\u2217 ,w)\n\n= +\u221e.\n\nAs shown by next theorem, these sequences of decoders lead to the divergence of the log-likelihood\nfunction.\nTheorem 1. For all i\u2217 \u2208 {1, ..., n} and w \u2208 Rd \\ {0}, we have limk\u2192+\u221e (cid:96)\nA detailed proof is provided in Appendix A (all appendices of this paper are available as supplementary\nconverges to a function\nmaterial). Its cornerstone is the fact that the sequence of functions \u03a3\nthat outputs both singular and nonsingular covariances, leading to the explosion of log p\u03b8(i\u2217 ,w)\n(xi\u2217 )\nwhile all other terms of the log-likelihood remain bounded below by a constant.\nUsing simple MLP-based parametrisations such a the one of Kingma and Welling (2014) therefore\nbrings about an unbounded log-likelihood function. A natural question that follows is: do these\nin\ufb01nite suprema lead to useful generative models? The answer is no. Actually, none of the functions\nconsidered in Theorem 1 are particularly useful, because of the use of a constant mean function. This\nis formalised in the next proposition, that exhibits a strong link between likelihood blow-up and the\nmode collapse phenomenon.\nProposition 1. For all k \u2208 N\u2217, i\u2217 \u2208 {1, ..., n}, and w \u2208 Rd \\ {0}, the distribution p\u03b8(i\u2217 ,w)\nspherically symmetric and unimodal around xi\u2217.\nA proof is provided in Appendix B. This is a direct consequence of the constant mean function.\nThe spherical symmetry implies that the distribution of these \u201coptimal\u201d deep generative model will\nlead to uncorrelated variables, and the unimodality will lead to poor sample diversity. This behaviour\nis symptomatic of mode collapse, which remains one of the most challenging drawbacks of generative\nmodelling (Arora et al., 2018). While mode collapse has been extensively investigated for adversarial\ntraining (e.g. Arora et al., 2018; Lucas et al., 2018), this phenomenon is also known to affect VAEs\n(Richardson and Weiss, 2018).\nUnregularised gradient-based optimisation of a tight lower bound of this unbounded likelihood\nis therefore likely to follow these (uncountably many) paths to blow-up. This gives a theoretical\nfoundation to the necessary regularisation of VAEs that was already noted by Rezende et al. (2014)\nand Kingma and Welling (2014). For example, using weight decay as in Kingma and Welling\n(2014) is likely to help avoiding these in\ufb01nite maxima. This dif\ufb01culty to learn the variance was also\nexperimentally noticed by Takahashi et al. (2018), and may explain the choice made by several authors\nto use a constant variance function \u03a3(z) = \u03c30Ip, where \u03c30 can be either \ufb01xed (Zhao et al., 2017)\nor learned via approximate maximum likelihood (Pu et al., 2016). Dai et al. (2018) independently\nshowed that the VAE objective is also unbounded in the case where such a constant variance function\nis combined with a nonparametric mean function. An interesting feature of our result is that it only\ninvolves a decoder of very low capacity.\n\nis\n\nk\n\nTackling the unboundedness of the likelihood. Let us go back to a parametrisation which is not\nnecessarily MLP-based. Even in this general context, it is possible to tackle the unboundedness of the\nlikelihood using additional constraints on \u03a3\u03b8. Speci\ufb01cally, for each \u03be \u2265 0, we will consider the set\nS \u03be\np = {A \u2208 S +\np , SpA denotes the spectrum of A. Note\nthat S 0\np = S +\n\np | min(SpA) \u2265 \u03be}, where, for all A \u2208 S +\np . This simple spectral constraint allows to end up with a bounded likelihood.\n\n4\n\n\fp for all \u03b8, then the log-likelihood function is upper bounded by \u2212np log\n\nProposition 2. Let \u03be > 0. If the parametrisation of the decoder is such that the image of \u03a3\u03b8 is\nincluded in S \u03be\nProof. For all i \u2208 {1, ..., n}, we have p(xi|\u00b5\u03b8, \u03a3\u03b8) \u2264 (2\u03c0\u03be)\u22122p/2, using the fact that the determi-\nnant of \u03a3\u03b8(z) is lower bounded by \u03bep for all z \u2208 Rd and that the exponential of a negative number is\nsmaller than one. Therefore, the likelihood function is bounded above by 1/(2\u03c0\u03be)np/2.\n\n\u221a\n\n2\u03c0\u03be.\n\nSimilar constraints have been proposed to solve the ill-posedness of maximum likelihood for \ufb01nite\nGaussian mixtures (e.g. Hathaway, 1985; Biernacki and Castellan, 2011). In practice, implementing\nsuch constraints can be easily done by adding a constant diagonal matrix to the output of the\ncovariance decoder.\n\nWhat about other parametrisations? We chose a speci\ufb01c and natural parametrisation in order to\nobtain a constructive proof of the unboundedness of the likelihood. However, virtually any other deep\nparametrisation that does not include covariance constraints will be affected by our result, because of\nthe universal approximation abilities of neural networks (see e.g. Goodfellow et al., 2016, Section\n6.4.1).\nBernoulli DLVMs do not suffer from unbounded likelihood. When X = {0, 1}p, Bernoulli\nDLVMs assume that (\u03a6(\u00b7|\u03b7))\u03b7\u2208H is the family of p-variate multivariate Bernoulli distributions (i.e.\nthe family of products of p univariate Bernoulli distributions). In this case, maximum likelihood is\nwell-posed.\nProposition 3. Given any possible parametrisation, the log-likelihood function of a deep latent\nmodel with a Bernoulli observation model is everywhere negative.\n\nProof. This directly follows from the fact that the Bernoulli density is always smaller than one.\n\n2.2 Towards data-dependent likelihood upper bounds\n\nWe have determined under which conditions maximum likelihood estimates exist, and have computed\nsimple upper bounds on the likelihood functions. Since they do not depend on the data, these bounds\nare likely to be very loose. A natural follow-up issue is to seek tighter, data-dependent upper bounds\nthat remain easily computable. Such bounds are desirable because, combined with ELBOs, they\nwould allow sandwiching the likelihood between two bounds.\nTo study this problem, let us take a step backwards and consider a more general in\ufb01nite mixture model.\nPrecisely, given any distribution G over the generic parameter space H, we de\ufb01ne the nonparametric\nmixture model (see e.g. Lindsay, 1995, Chapter 1) as:\n\n(cid:90)\n\nH\n\npG(x) =\n\n\u03a6(x|\u03b7)dG(\u03b7).\n\n(10)\n\ni=1 log pG(xi).\n\nNote that there are many ways for a mixture model to be nonparametric (e.g. having some nonpara-\nmetric components, an in\ufb01nite but countable number of components, or an uncountable number of\ncomponents). In this case, this comes from the fact that the model parameter is the mixing distribution\nG, which belongs to the set P of all probability measures over H. The log-likelihood of any G \u2208 P\n\nis given by (cid:96)(G) =(cid:80)n\n\nWhen G has a \ufb01nite support of cardinal k \u2208 N\u2217, pG is a \ufb01nite mixture model with k components.\nWhen the mixing distribution G is generatively de\ufb01ned by the distribution of a random variable \u03b7\nsuch that z \u223c p(z), \u03b7 = f\u03b8(z), we exactly end up with a deep generative model with decoder f\u03b8.\nTherefore, the nonparametric mixture is a more general model that the DLVM. The fact that the\nmixing distribution of a DLVM is intrinsically low-dimensional leads us to interpret the DLVM as a\nparsimonious submodel of the nonparametric mixture model. This also gives us an immediate upper\nbound on the likelihood of any decoder f\u03b8: (cid:96)(\u03b8) \u2264 maxG\u2208P (cid:96)(G).\nOf course, in many cases, this upper bound will be in\ufb01nite (for example in the case of an unconstrained\nGaussian observation model). However, under the conditions of boundedness of the likelihood of\ndeep Gaussian models, the bound is \ufb01nite and attained for a \ufb01nite mixture model with no more\ncomponents than data points.\n\n5\n\n\fTheorem 2. Assume that (\u03a6(\u00b7|\u03b7))\u03b7\u2208H is the family of multivariate Bernoulli distributions or the\nfamily of Gaussian distributions with the spectral constraint of Proposition 2. The likelihood of\nthe corresponding nonparametric mixture model is maximised for a \ufb01nite mixture model of k \u2264 n\ndistributions from the family (\u03a6(\u00b7|\u03b7))\u03b7\u2208H.\nA detailed proof is provided in Appendix C. The main tool of the proof of this rather surprising\nresult is Lindsay\u2019s (1983) geometric analysis of the likelihood of nonparametric mixtures, based on\nMinkovski\u2019s theorem. Speci\ufb01cally, Lindsay\u2019s (1983) Theorem 3.1 ensures that, when the trace of\nthe likelihood curve is compact, the likelihood function is maximised for a \ufb01nite mixture. For the\nBernoulli case, compactness of the curve is immediate; for the Gaussian case, we use a compacti\ufb01ca-\ntion argument inspired by van der Vaart and Wellner (1992).\nAssume now that the conditions of The-\norem 2 are satis\ufb01ed. Let us denote a\nmaximum likelihood estimate of G as\n\u02c6G. For all \u03b8, we therefore have\n\n(cid:96)(\u03b8) \u2264 (cid:96)( \u02c6G),\n\n(11)\nwhich gives an upper bound on the like-\nlihood. We call the difference (cid:96)( \u02c6G) \u2212\n(cid:96)(\u03b8) the parsimony gap (see Fig. 1).\nBy sandwiching the exact likelihood\nbetween this bound and an ELBO, we\ncan also have guarantees on how far a\nposterior approximation q is from the\ntrue posterior:\nKL(q||p(\u00b7|X)) \u2264 (cid:96)( \u02c6G)\u2212ELBO(\u03b8, q).\n(12)\nFigure 1: The parsimony gap represents the amount of\nlikelihood lost due to the architecture of the decoder. The\nNote that \ufb01nding upper bounds of the\napproximation gap expresses how far the posterior is from\nlikelihood of latent variable models\nthe variational family, and the amortisation gap appears due\nis usually harder than \ufb01nding lower\nto the limited capacity of the encoder (Cremer et al., 2018).\nbounds (Grosse et al., 2015; Dieng\net al., 2017). From a computational perspective, the estimate \u02c6G can be found using the expectation-\nmaximisation algorithm for \ufb01nite mixtures (Dempster et al., 1977)\u2014although it only ensures to \ufb01nd\na local optimum. Some strategies guaranteed to \ufb01nd a global optimum have also been developed\n(e.g. Lindsay, 1995, Chapter 6, or Wang, 2007).\nNow that computationally approachable upper bounds have been derived, the question remains\nwhether or not these bounds can be tight. Actually, as shown by next theorem, tightness of the\nparsimony gap occurs when the decoder has universal approximation abilities. In other words, the\nnonparametric upper bound characterises the large capacity limit of the decoder.\nTheorem 3 (Tightness of the parsimony gap). Assume that\n1. (\u03a6(\u00b7|\u03b7))\u03b7\u2208H is the family of multivariate Bernoulli distributions or the family of Gaussian\n2. The decoder has universal approximation abilities : for any compact C \u2282 Rd and continuous\n\ndistributions with the spectral constraint of Proposition 2.\n\n\u221e < \u03b5.\n\nfunction f : C \u2192 H, for all \u03b5 > 0, there exists \u03b8 such that ||f \u2212 f\u03b8 ||\nThen, for all \u03b5 > 0, there exists \u03b8 \u2208 \u0398 such that (cid:96)( \u02c6G) \u2265 (cid:96)(\u03b8) \u2265 (cid:96)( \u02c6G) \u2212 \u03b5.\nA detailed proof is provided in Appendix D. The main idea is to split the code space into a compact\nset made of several parts that will represent the mixture components, and an unbounded set of very\nsmall prior mass. The universal approximation property is \ufb01nally used for this compact set.\nThe universal approximation condition is satis\ufb01ed for example by MLPs with nonpolynomial ac-\ntivations (Leshno et al., 1993). Combined with the work of Cremer et al. (2018), who studied the\nlarge capacity limit of the encoder, this result describes the general behaviour of a VAE in the large\ncapacity limit (see Fig. 1). Note eventually that Rezende and Viola (2018) analysed the large capacity\nbehaviour of the VAE objective, and also found connections with \ufb01nite mixtures.\n\n6\n\n`(\u2713)`(\u02c6G)ELBO(\u2713,q\u21e4X)ELBO(\u2713,q,X)parsimonygaptightwhenf\u2713haslargecapacitytightwhenghaslargecapacityamortisationgapapproximationgaptightwhentheposteriorbelongstothevariationalfamily\f3 Missing data imputation using the exact conditional likelihood\n\nIn this section, we assume that a variational autoencoder has been trained, and that some data is\nmissing at test time. The couple decoder/encoder obtained after training is denoted by f\u03b8 and g\u03b3. Let\nx \u2208 X be a new data point that consists of some observed features xobs and missing data xmiss. Since\nwe have a probabilistic model p\u03b8 of the data, an ideal way of imputing xmiss would be to generate\nsome data according to the conditional distribution\n\np\u03b8(xmiss|xobs) =\n\np\u03b8(xmiss|xobs, z)p(z|xobs)dz.\n\n(13)\n\n(cid:90)\n\nRd\n\nAgain, this distribution appears out of reach because of the integration of the latent variable z.\nHowever, it is reasonable to assume that, for all \u03b7, it is easy to sample from the marginals of \u03a6(\u00b7|\u03b7).\nThis is for instance the case for Gaussian observation models and factorised observation models\n(like products of Bernoulli or Poisson distributions). A direct consequence of this assumption is that,\nfor all z, it is easy to sample from p\u03b8(xmiss|xobs, z). Under this simple assumption, we will see that\ngenerating data according to the conditional distribution is actually (asymptotically) possible.\n\n3.1 Pseudo-Gibbs sampling\n\nt\n\n0\n\nt\u22121)) and \u02c6xmiss\n\n)t\u22651(initialised by randomly imputing the missing data with \u02c6xmiss\n\nRezende et al. (2014) proposed a simple way of imputing xmiss by following a Markov chain\n). For all t \u2265 1, the chain\n(zt, \u02c6xmiss\nalternatively generates zt \u223c \u03a8(z|g\u03b3(xobs, \u02c6xmiss\nt \u223c p\u03b8(xmiss|xobs, z) until convergence.\nThis scheme closely resembles Gibbs sampling (Geman and Geman, 1984), and actually exactly\ncoincides with Gibbs sampling when the amortised variational distribution \u03a8(z|g\u03b3(xobs, \u02c6xmiss))\nis equal to the true posterior distribution p\u03b8(z|xobs, \u02c6xmiss) for all possible \u02c6xmiss. Following the\nterminology of Heckerman et al. (2000), we will call this algorithm pseudo-Gibbs sampling. Very\nsimilar schemes have been proposed for more general autoencoder settings (Goodfellow et al., 2016,\nSection 20.11). Because of its \ufb02exibility, this pseudo-Gibbs approach is routinely used for missing\ndata imputation using DLVMs (see e.g. Li et al., 2016; Rezende et al., 2016; Du et al., 2018). Rezende\net al. (2014, Proposition F.1) proved that, when these two distributions are close in some sense,\npseudo-Gibbs sampling generates points that approximatively follow the conditional distribution\np\u03b8(xmiss|xobs). Actually, we will see that a simple modi\ufb01cation of this scheme allows to generate\nexactly according to the conditional distribution.\n\n3.2 Metropolis-within-Gibbs sampling\n\n)\n\n\u03a6(xobs,\u02c6xmiss\n\n, ..., \u02c6xmiss\nT .\n\n1\n\nAlgorithm 1 Metropolis-within-Gibbs sampler for missing\ndata imputation using a trained VAE\n\nInputs: Observed data xobs, trained VAE (f\u03b8, g\u03b3), number\nof iterations T\nOutputs: Markov chain of imputations \u02c6xmiss\nInitialise (z0, \u02c6xmiss\n0\nfor t = 1 to T do\n\nAt each step of the chain, rather\nthan generating codes according to\nthe approximate posterior distribu-\ntion, we may use this approximation\nas a proposal within a Metropolis-\nHastings algorithm (Metropolis\net al., 1953; Hastings, 1970), using\nthe fact that we have access to the\nunnormalised posterior density of\nthe latent codes.\nSpeci\ufb01cally, at each step, we will\ngenerate a new code \u02dczt as a pro-\nposal using the approximate poste-\nrior \u03a8(z|g\u03b3(xobs, \u02c6xmiss\nt\u22121)). This pro-\nposal is kept as a valid code with ac-\nceptance probability \u03c1t, de\ufb01ned in\nAlgorithm 1. This probability corre-\nsponds to a ratio of importance ratios, and is equal to one when the posterior approximation is perfect.\nThis code-generating scheme exactly corresponds to performing a single iteration of an independent\nMetropolis-Hastings algorithm. With the obtained code zt, we can now generate a new imputation\nusing the exact conditional \u03a6(xmiss|xobs, f\u03b8(zt)). The obtained algorithm, detailed in Algorithm 1,\nis a particular instance of a Metropolis-within-Gibbs algorithm. Actually, it exactly corresponds to\n\n\u02dczt \u223c \u03a8(z|g\u03b3(xobs, \u02c6xmiss\nt\u22121))\n\u03a6(xobs,\u02c6xmiss\n\u02dc\u03c1t =\n\u03c1t = min{ \u02dc\u03c1t, 1}\nzt =\nt \u223c p\u03b8(xmiss|xobs, zt)\n\u02c6xmiss\nend for\n\n(cid:26) \u02dczt\nzt\u22121 with probability 1 \u2212 \u03c1t\n\n\u03a8(zt\u22121|g\u03b3 (xobs,\u02c6xmiss\nt\u22121))\n\u03a8(\u02dczt|g\u03b3 (xobs,\u02c6xmiss\nt\u22121))\n\nt\u22121|f\u03b8 (\u02dczt))p(\u02dczt)\n\nt\u22121|f\u03b8 (zt\u22121))p(zt\u22121)\n\nwith probability \u03c1t\n\n7\n\n\fthe algorithm described by Gelman (1993, Section 4.4), and is ensured to asymptotically produce\nsamples from the true conditional distribution p\u03b8(xmiss|xobs), even if the variational approximation is\nimperfect. Note that when the variational approximation is perfect, all proposals are accepted and the\nalgorithm exactly reduces to Gibbs sampling.\nThe theoretical superiority of the Metropolis-within-Gibbs scheme compared to the pseudo-Gibbs\nsampler comes with almost no additional computational cost. Indeed, all the quantities that need\nto be computed in order to compute the acceptance probability need also to be computed within\nthe pseudo-Gibbs scheme\u2014except for prior evaluations, which are assumed to be computationally\nnegligible. However, a poor initialisation of the missing values might lead to a lot of rejections at the\nbeginning of the chain, and to slow convergence. A good initialisation heuristic is to perform a few\npseudo-Gibbs iterations at \ufb01rst in order to begin with a sensible imputation. Note also that, similarly\nto the pseudo-Gibbs sampler, our Metropolis-within-Gibbs scheme can be extended to many other\nvariational approximations\u2014like normalising \ufb02ows (Rezende and Mohamed, 2015; Kingma et al.,\n2016)\u2014in a straightforward manner.\n\n4 Empirical results\n\nIn this section, we investigate the empirical realisations of our theoretical \ufb01ndings on DLVMs. For\narchitecture and implementation details, see Appendix E (in the supplementary material).\n\n4.1 Witnessing likelihood blow-up\n\nTo investigate if the unboundedness\nof the likelihood of a DLVM with\na Gaussian observation model has\nactual concrete consequences for\nvariational inference, we train two\nDLVMs on the Frey faces data set:\none with no constraints, and one\nwith the constraint of Proposition\n2 (with \u03be = 2\u22124). The results are\npresented in Fig. 2. One can notice\nthat the unconstrained DLVM \ufb01nds\nmodels with very high likelihood\nbut very poor generalisation perfor-\nmance. This con\ufb01rms that the un-\nboundedness of the likelihood is not\na merely theoretical concern. We\nalso display the two upper bounds of\nthe likelihood. The nonparametric\nbound offers a slight but signi\ufb01cant\nimprovement over the naive upper\nbound. On this example, using the\nnonparametric upper bound as an early stopping criterion for the unconstrained ELBO appears to\nprovide a good regularisation scheme\u2014that perform better than the covariance constraints on this\ndata set. This illustrates the potential practical usefulness of the connection that we drew between\nDLVMs and nonparametric mixtures.\n\nFigure 2: Likelihood blow-up for the Frey Faces data. The\nunconstrained ELBO appears to diverge, while \ufb01nding increas-\ningly poor models.\n\n4.2 Comparing the pseudo-Gibbs and Metropolis-within-Gibbs samplers\n\nWe compare the two samplers for single imputation of the test sets of three data sets: Caltech 101\nSilhouettes and statically binarised versions of MNIST and OMNIGLOT. We consider two missing\ndata scenarios: a \ufb01rst one with pixels missing uniformly at random (the fractions of missing data\nconsidered are 40%, 50%, 60%, 70%, and 80%) and one where the top or bottom half of the pixels\nwas removed. Both samplers use the same trained VAE and perform the same number of iterations.\nThe imputations are made by computing the means of the chains, which estimate the conditional\nexpected value of the missing data. Since the imputation of these high-dimensional binary data sets\ncan be interpreted as imbalanced binary classi\ufb01cation problems, we use the F1 score (the harmonic\n\n8\n\n\u221225002505000500010000EpochsLog\u2212likelihood boundsConstrained ELBO (Test)Constrained ELBO (Training)Constrained naive upper boundConstrained nonparametric upper boundUnconstrained ELBO (Test)Unconstrained ELBO (Training)\fFigure 3: Single imputation results (F1 score between the true and imputed values) for the two Markov\nchains. Additional results for the bottom missing and the 50% and 70% MAR cases are provided as\nsupplementary material. The more the conditional distribution is challenging (high-dimensional in\nthe MAR cases and highly multimodal in the top/bottom cases), the more the performance gain of\nour Metropolis-within-Gibbs scheme is important.\n\nmean of precision and recall) as a performance metric. For both schemes, we use 50 iterations\nof Pseudo-Gibbs as burn-in. In practice, convergence and mixing of the chains can be monitored\nusing a validation set of complete data. The results are displayed on Fig. 3 and in Appendix F\n(see supplementary material). The chains converge much faster for the missing at random (MAR)\nsituation than for the top/bottom missing scenario. This is probably due to the fact that the conditional\ndistribution of the missing half of an image is highly multimodal. The Metropolis-within-Gibbs\nsampler consistently outperforms the pseudo-Gibbs scheme, especially for the most challenging\nscenarios where the top/bottom of the image is missing. One can see that the pseudo-Gibbs sampler\nappears to converge quickly to a stationary distribution that gives suboptimal results. Because of the\nrejections, the Metropolis-within-Gibbs algorithm converges slower, but to a much more accurate\nconditional distribution.\n\n5 Conclusion\n\nAlthough extremely dif\ufb01cult to compute in practice, the exact likelihood of DLVMs offers several\nimportant insights on deep generative modelling. An important research direction for future work is\nthe design of principled regularisation schemes for maximum likelihood estimation.\nThe objective evaluation of deep generative models remains an open question. Missing data imputation\nis often used as a performance metric for DLVMs (e.g. Li et al., 2016; Du et al., 2018). Since both\nalgorithms have essentially the same computational cost, this motivates to replace pseudo-Gibbs\nsampling by Metropolis-within-Gibbs when evaluating these models. Upon convergence, the samples\ngenerated by Metropolis-within-Gibbs do not depend on the inference network, and explicitly depend\non the prior, which allows us to evaluate mainly the generative performance of the models.\nWe interpreted DLVMs as parsimonious submodels of nonparametric mixture models. While we\nused this connection to provide upper bounds of the likelihood, many other applications could be\nderived. In particular, the important body of work regarding consistency of maximum likelihood\nestimates for nonparametric mixtures (e.g. Kiefer and Wolfowitz, 1956; van de Geer, 2003; Chen,\n2017) could be leveraged to study the asymptotics of DLVMs.\n\n9\n\nMNIST40% MAR0.40.50.60.70.80500100015002000F1 scoreMNIST 60% MAR0.40.50.60.70.80500100015002000MNIST 80% MAR0.40.50.60.70.80500100015002000MNIST top missing0.40.50.60.70.80100002000030000OMNIGLOT40% MAR0.20.30.40100020003000F1 scoreOMNIGLOT 60% MAR0.20.30.40100020003000OMNIGLOT 80% MAR0.20.30.40100020003000OMNIGLOT top missing0.20.30.40100002000030000Caltech Silhouettes40% MAR0.750.800.850.900.950100020003000IterationsF1 scoreCaltech Silhouettes 60% MAR0.750.800.850.900.950100020003000IterationsCaltech Silhouettes 80% MAR0.750.800.850.900.950100020003000IterationsCaltech Silhouettes top missing0.750.800.850.900.95025005000750010000IterationsMetropolis\u2212within\u2212GibbsPseudo\u2212Gibbs (Rezende et al., 2014)\fReferences\nS. Arora, A. Risteski, and Y. Zhang. Do GANs learn the distribution? Some theory and empirics. In\n\nInternational Conference on Learning Representations, 2018.\n\nD. J. Bartholomew, M. Knott, and I. Moustaki. Latent variable models and factor analysis: A uni\ufb01ed\n\napproach, volume 904. John Wiley & Sons, 2011.\n\nC. Biernacki and G. Castellan. A data-driven bound on variances for avoiding degeneracy in univariate\n\nGaussian mixtures. Pub. IRMA Lille, 71, 2011.\n\nD. M. Blei, A. Kucukelbir, and J. D. McAuliffe. Variational inference: A review for statisticians.\n\nJournal of the American Statistical Association, 112(518):859\u2013877, 2017.\n\nS. R. Bowman, L. Vilnis, O. Vinyals, A. M. Dai, R. Jozefowicz, and S. Bengio. Generating sentences\n\nfrom a continuous space. Proceedings of CoNLL, 2016.\n\nY. Burda, R. Grosse, and R. Salakhutdinov.\n\nImportance weighted autoencoders.\n\nConference on Learning Representations, 2016.\n\nInternational\n\nJ. Chen. Consistency of the MLE under mixture models. Statistical Science, 32(1):47\u201363, 2017.\n\nC. Cremer, Q. Morris, and D. Duvenaud. Reinterpreting importance-weighted autoencoders. Interna-\n\ntional Conference on Learning Representations (Workshop track), 2017.\n\nC. Cremer, X. Li, and D. Duvenaud.\n\nInference suboptimality in variational autoencoders.\n\nProceedings of the 35th International Conference on Machine Learning, 2018.\n\nIn\n\nB. Dai, Y. Wang, J. Aston, G. Hua, and D. Wipf. Connections with robust PCA and the role of\nemergent sparsity in variational autoencoder models. The Journal of Machine Learning Research,\n19(1):1573\u20131614, 2018.\n\nA. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the\nEM algorithm. Journal of the Royal Statistical Society: Series B (Statistical Methodology), pages\n1\u201338, 1977.\n\nA. B. Dieng, D. Tran, R. Ranganath, J. Paisley, and D. Blei. Variational inference via chi upper bound\nminimization. In Advances in Neural Information Processing Systems, pages 2732\u20132741, 2017.\n\nL. Dinh, J. Sohl-Dickstein, and S. Bengio. Density estimation using real NVP.\n\nConference on Learning Representations, 2017.\n\nInternational\n\nC. Du, J. Zhu, and B. Zhang. Learning deep generative models with doubly stochastic gradient\n\nMCMC. IEEE Transactions on Neural Networks and Learning Systems, PP(99):1\u201313, 2018.\n\nA. Gelman. Iterative and non-iterative simulation algorithms. Computing science and statistics,\n\npages 433\u2013433, 1993.\n\nS. Geman and D. Geman. Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of\n\nimages. IEEE Transactions on Pattern Analysis and Machine Intelligence, (6):721\u2013741, 1984.\n\nR. G\u00f3mez-Bombarelli, J. N. Wei, D. Duvenaud, J. M. Hern\u00e1ndez-Lobato, B. S\u00e1nchez-Lengeling,\nD. Sheberla, J. Aguilera-Iparraguirre, T. D. Hirzel, R. P. Adams, and A. Aspuru-Guzik. Automatic\nchemical design using a data-driven continuous representation of molecules. ACS Central Science,\n2018.\n\nI. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and\nY. Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems,\npages 2672\u20132680, 2014.\n\nI. Goodfellow, Y. Bengio, and A. Courville. Deep learning. MIT press, 2016.\n\nC. H. Gr\u00f8nbech, M. F. Vording, P. N. Timshel, C. K. S\u00f8nderby, T. H. Pers, and O. Winther. scVAE:\nVariational auto-encoders for single-cell gene expression data. bioRxiv, 2018. URL https:\n//www.biorxiv.org/content/early/2018/05/16/318295.\n\n10\n\n\fR. Grosse, Z. Ghahramani, and R. P. Adams. Sandwiching the marginal likelihood using bidirectional\n\nmonte carlo. arXiv preprint arXiv:1511.02543, 2015.\n\nW. K. Hastings. Monte Carlo sampling methods using Markov chains and their applications.\n\nBiometrika, 57(1):97\u2013109, 1970.\n\nR. J. Hathaway. A constrained formulation of maximum-likelihood estimation for normal mixture\n\ndistributions. The Annals of Statistics, 13(2):795\u2013800, 1985.\n\nD. Heckerman, D. M. Chickering, C. Meek, R. Rounthwaite, and C. Kadie. Dependency networks for\ninference, collaborative \ufb01ltering, and data visualization. Journal of Machine Learning Research, 1\n(Oct):49\u201375, 2000.\n\nH. Hotelling. Analysis of a complex of statistical variables into principal components. Journal of\n\neducational psychology, 24(6):417, 1933.\n\nI. T. Jolliffe and J. Cadima. Principal component analysis: a review and recent developments. Philo-\nsophical Transactions of the Royal Society of London A: Mathematical, Physical and Engineering\nSciences, 374(2065), 2016.\n\nJ. Kiefer and J. Wolfowitz. Consistency of the maximum likelihood estimator in the presence of\nin\ufb01nitely many incidental parameters. The Annals of Mathematical Statistics, pages 887\u2013906,\n1956.\n\nD. P. Kingma and M. Welling. Auto-encoding variational Bayes. In Proceedings of the International\n\nConference on Learning Representations, 2014.\n\nD. P. Kingma, T. Salimans, R. Jozefowicz, X. Chen, I. Sutskever, and M. Welling. Improved varia-\ntional inference with inverse autoregressive \ufb02ow. In Advances in Neural Information Processing\nSystems, pages 4743\u20134751, 2016.\n\nM. J. Kusner, B. Paige, and J. M. Hern\u00e1ndez-Lobato. Grammar variational autoencoder. In Proceed-\n\nings of the 34th International Conference on Machine Learning, pages 1945\u20131954, 2017.\n\nL. Le Cam. Maximum likelihood: an introduction. International Statistical Review/Revue Interna-\n\ntionale de Statistique, pages 153\u2013171, 1990.\n\nM. Leshno, V. Y. Lin, A. Pinkus, and S. Schocken. Multilayer feedforward networks with a\nnonpolynomial activation function can approximate any function. Neural Networks, 6(6):861\u2013867,\n1993.\n\nC. Li, J. Zhu, and B. Zhang. Learning to generate with memory.\n\nInternational Conference on Machine Learning, pages 1177\u20131186, 2016.\n\nIn Proceedings of The 33rd\n\nB. Lindsay. The geometry of mixture likelihoods: a general theory. The Annals of Statistics, 11(1):\n\n86\u201394, 1983.\n\nB. Lindsay. Mixture Models: Theory, Geometry and Applications, volume 5 of Regional Conference\nSeries in Probability and Statistics. Institute of Mathematical Statistics and American Statistical\nAssociation, 1995.\n\nT. Lucas, C. Tallec, J. Verbeek, and Y. Ollivier. Mixed batches and symmetric discriminators for\n\nGAN training. In International Conference on Machine Learning, 2018.\n\nL. Maal\u00f8e, C. K. S\u00f8nderby, S. K. S\u00f8nderby, and O. Winther. Auxiliary deep generative models. In\n\nInternational Conference on Machine Learning, pages 1445\u20131453, 2016.\n\nN. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, and E. Teller. Equation of state\ncalculations by fast computing machines. The Journal of Chemical Physics, 21(6):1087\u20131092,\n1953.\n\nY. Pu, Z. Gan, R. Henao, X. Yuan, C. Li, A. Stevens, and L. Carin. Variational autoencoder for deep\nlearning of images, labels and captions. In Advances in Neural Information Processing Systems,\npages 2352\u20132360, 2016.\n\n11\n\n\fR. Ranganath, D. Tran, and D. Blei. Hierarchical variational models. In Proceedings of the 33rd\n\nInternational Conference on Machine Learning, pages 324\u2013333, 2016.\n\nD. Rezende and S. Mohamed. Variational inference with normalizing \ufb02ows. In Proceedings of the\n\n32nd International Conference on Machine Learning, pages 1530\u20131538, 2015.\n\nD. Rezende and F. Viola. Taming VAEs. arXiv preprint arXiv:1810.00597, 2018.\n\nD. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in\ndeep generative models. In Proceedings of the 31st International Conference on Machine Learning,\npages 1278\u20131286, 2014.\n\nD. Rezende, S. M. A. Eslami, S. Mohamed, P. Battaglia, M. Jaderberg, and N. Heess. Unsupervised\nlearning of 3D structure from images. In Advances in Neural Information Processing Systems,\npages 4996\u20135004, 2016.\n\nE. Richardson and Y. Weiss. On GANs and GMMs. In Advances in Neural Information Processing\n\nSystems, 2018.\n\nG. Roeder, Y. Wu, and D. Duvenaud. Sticking the landing: Simple, lower-variance gradient estimators\nfor variational inference. In Advances in Neural Information Processing Systems, pages 6928\u20136937,\n2017.\n\nT. Salimans, A. Karpathy, X. Chen, and D. P. Kingma. Pixelcnn++: Improving the pixelcnn with\ndiscretized logistic mixture likelihood and other modi\ufb01cations. Proceedings of the International\nConference on Learning Representations, 2017.\n\nN. Siddharth, B. Paige, J.-W. van de Meent, A. Desmaison, N. Goodman, P. Kohli, F. Wood, and\nP. Torr. Learning disentangled representations with semi-supervised deep generative models. In\nAdvances in Neural Information Processing Systems, pages 5927\u20135937, 2017.\n\nC. K. S\u00f8nderby, T. Raiko, L. Maal\u00f8e, S. Kaae S\u00f8nderby, and O. Winther. Ladder variational\nautoencoders. In Advances in Neural Information Processing Systems, pages 3738\u20133746, 2016.\n\nH. Takahashi, T. Iwata, Y. Yamanaka, M. Yamada, and S. Yagi. Student-t variational autoencoder for\nrobust density estimation. In Proceedings of the Twenty-Seventh International Joint Conference on\nArti\ufb01cial Intelligence, pages 2696\u20132702. International Joint Conferences on Arti\ufb01cial Intelligence\nOrganization, 2018.\n\nA. N. Tikhonov and V. Y. Arsenin. Solutions of ill-posed problems. New York: Winston, 1977.\n\nM. E. Tipping and C. M. Bishop. Probabilistic principal component analysis. Journal of the Royal\n\nStatistical Society: Series B (Statistical Methodology), 61(3):611\u2013622, 1999.\n\nJ. Tomczak and M. Welling. VAE with a VampPrior. In International Conference on Arti\ufb01cial\n\nIntelligence and Statistics, pages 1214\u20131223, 2018.\n\nS. van de Geer. Asymptotic theory for maximum likelihood in nonparametric mixture models.\n\nComputational Statistics & Data Analysis, 41(3-4):453\u2013464, 2003.\n\nA. W. van der Vaart and J. A. Wellner. Existence and consistency of maximum likelihood in upgraded\n\nmixture models. Journal of Multivariate Analysis, 43(1):133\u2013146, 1992.\n\nY. Wang. On fast computation of the non-parametric maximum likelihood estimate of a mixing\ndistribution. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 69(2):\n185\u2013198, 2007.\n\nS. Zhao, J. Song, and S. Ermon. Learning hierarchical features from deep generative models. In\nProceedings of the 34th International Conference on Machine Learning, pages 4091\u20134099, 2017.\n\n12\n\n\f", "award": [], "sourceid": 1909, "authors": [{"given_name": "Pierre-Alexandre", "family_name": "Mattei", "institution": "ITU Copenhagen"}, {"given_name": "Jes", "family_name": "Frellsen", "institution": "IT University of Copenhagen"}]}