{"title": "Don't Blame the ELBO! A Linear VAE Perspective on Posterior Collapse", "book": "Advances in Neural Information Processing Systems", "page_first": 9408, "page_last": 9418, "abstract": "Posterior collapse in Variational Autoencoders (VAEs) with uninformative priors arises when the variational posterior distribution closely matches the prior for a subset of latent variables. This paper presents a simple and intuitive explanation for posterior collapse through the analysis of linear VAEs and their direct correspondence with Probabilistic PCA (pPCA). We explain how posterior collapse may occur in pPCA due to local maxima in the log marginal likelihood. Unexpectedly, we prove that the ELBO objective for the linear VAE does not introduce additional spurious local maxima relative to log marginal likelihood. We show further that training a linear VAE with exact variational inference recovers a uniquely identifiable global maximum corresponding to the principal component directions. Empirically, we find that our linear analysis is predictive even for high-capacity, non-linear VAEs and helps explain the relationship between the observation noise, local maxima, and posterior collapse in deep Gaussian VAEs.", "full_text": "Don\u2019t Blame the ELBO!\n\nA Linear VAE Perspective on Posterior Collapse\n\nJames Lucas\u2021\u2217, George Tucker\u2020, Roger Grosse\u2021, Mohammad Norouzi\u2020\n\n\u2021University of Toronto\n\n\u2020Google Brain\n\nAbstract\n\nPosterior collapse in Variational Autoencoders (VAEs) arises when the variational\nposterior distribution closely matches the prior for a subset of latent variables. This\npaper presents a simple and intuitive explanation for posterior collapse through\nthe analysis of linear VAEs and their direct correspondence with Probabilistic\nPCA (pPCA). We explain how posterior collapse may occur in pPCA due to\nlocal maxima in the log marginal likelihood. Unexpectedly, we prove that the\nELBO objective for the linear VAE does not introduce additional spurious local\nmaxima relative to log marginal likelihood. We show further that training a linear\nVAE with exact variational inference recovers an identi\ufb01able global maximum\ncorresponding to the principal component directions. Empirically, we \ufb01nd that\nour linear analysis is predictive even for high-capacity, non-linear VAEs and helps\nexplain the relationship between the observation noise, local maxima, and posterior\ncollapse in deep Gaussian VAEs.\n\nIntroduction\n\n1\nThe generative process of a deep latent variable model entails drawing a number of latent factors from\nthe prior and using a neural network to convert such factors to real data points. Maximum likelihood\nestimation of the parameters requires marginalizing out the latent factors, which is intractable for\ndeep latent variable models. The in\ufb02uential work of Kingma and Welling [24] and Rezende et al.\n[35] on Variational Autoencoders (VAEs) enables optimization of a tractable lower bound on the\nlikelihood via a reparameterization of the Evidence Lower Bound (ELBO) [21, 6]. This has led to a\nsurge of recent interest in automatic discovery of the latent factors of variation for a data distribution\nbased on VAEs and principled probabilistic modeling [18, 7, 10, 16].\nUnfortunately, the quality and the number of the latent factors learned is in\ufb02uenced by a phenomenon\nknown as posterior collapse, where the generative model learns to ignore a subset of the latent\nvariables. Most existing papers suggest that posterior collapse is caused by the KL-divergence\nterm in the ELBO objective, which directly encourages the variational distribution to match the\nprior [7, 25, 38]. Thus, a wide range of heuristic approaches in the literature have attempted to\ndiminish the effect of the KL term in the ELBO to alleviate posterior collapse [7, 33, 38, 20]. While\nholding the KL term responsible for posterior collapse makes intuitive sense, the mathematical\nmechanism of this phenomenon is not well understood. In this paper, we investigate the connection\nbetween posterior collapse and spurious local maxima in the ELBO objective through the analysis of\nlinear VAEs. Unexpectedly, we show that spurious local maxima may arise even in the optimization\nof exact marginal likelihood, and such local maxima are linked with a collapsed posterior.\nWhile linear autoencoders [37] have been studied extensively [4, 26], little attention has been given to\ntheir variational counterpart from a theoretical standpoint. A well-known relationship exists between\nlinear autoencoders and PCA \u2013 the optimal solution of a linear autoencoder has decoder weight\n\n\u2217Intern at Google Brain\nCode available at https://sites.google.com/view/dont-blame-the-elbo\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fcolumns that span the same subspace as the one de\ufb01ned by the principal components [4]. Similarly,\nthe maximum likelihood solution of probabilistic PCA (pPCA) [39] recovers the subspace of principal\ncomponents. In this work, we show that a linear variational autoencoder can recover the solution\nof pPCA. In particular, by specifying a diagonal covariance structure on the variational distribution,\none can recover an identi\ufb01able autoencoder, which at the global maximum of the ELBO recovers\nthe exact principal components as the columns of the decoder\u2019s weights. Importantly, we show that\nthe ELBO objective for a linear VAE does not introduce any local maxima beyond the log marginal\nlikelihood.\nThe study of linear VAEs gives us new insights into the cause of posterior collapse and the dif\ufb01culty of\nVAE optimization more generally. Following the analysis of Tipping and Bishop [39], we characterize\nthe stationary points of pPCA and show that the variance of the observation model directly in\ufb02uences\nthe stability of local stationary points corresponding to posterior collapse \u2013 it is only possible to\nescape these sub-optimal solutions by simultaneously reducing noise and learning better features.\nOur contributions include:\n\n\u2022 We verify that linear VAEs can recover the true posterior of pPCA. Further, we prove that the\nglobal optimum of the linear VAE recovers the principal components (not just their spanning\nsub-space). More importantly, we prove that using ELBO to train linear VAEs does not\nintroduce any additional spurious local maxima relative to log marginal likelihood training.\n\u2022 While high-capacity decoders are often blamed for posterior collapse, we show that posterior\ncollapse may occur when optimizing log marginal likelihood even without powerful decoders.\nOur experiments verify the analysis of the linear setting and show that these insights extend\neven to high-capacity non-linear VAEs. Speci\ufb01cally, we provide evidence that the observation\nnoise in deep Gaussian VAEs plays a crucial role in overcoming local maxima corresponding\nto posterior collapse.\n\n2 Preliminaries\n\nProbabilistic PCA.\nThe probabilitic PCA (pPCA) model is de\ufb01ned as follows. Suppose latent\nvariables z \u2208 Rk generate data x \u2208 Rn. A standard Gaussian prior is used for z and a linear\ngenerative model with a spherical Gaussian observation model for x:\n\np(z) = N (0, I) ,\n\np(x | z) = N (Wz + \u00b5, \u03c32I) .\n\n(1)\n\nThe pPCA model is a special case of factor analysis [5], which uses a spherical covariance \u03c32I instead\nof a full covariance matrix. As pPCA is fully Gaussian, both the marginal distribution for x and the\nposterior p(z | x) are Gaussian, and unlike factor analysis, the maximum likelihood estimates of W\nand \u03c32 are tractable [39].\n\nVariational Autoencoders. Recently, amortized variational inference has gained popularity as a\nmeans to learn complicated latent variable models. In these models, the log marginal likelihood,\nlog p(x), is intractable but a variational distribution, denoted q(z | x), is used to approximate the\nposterior p(z| x), allowing tractable approximate inference using the Evidence Lower Bound (ELBO):\n(2)\n(3)\n(4)\n\n\u2265 Eq(z|x)[log p(x, z) \u2212 log q(z | x)]\n= Eq(z|x)[log p(x | z)] \u2212 DKL(q(z | x)||p(z))\n\nlog p(x) = Eq(z|x)[log p(x, z) \u2212 log q(z | x)] + DKL(q(z | x)||p(z | x))\n\n(:= ELBO)\n\nThe ELBO [21, 6] consists of two terms, the KL divergence between the variational distribution,\nq(z|x), and prior, p(z), and the expected conditional log-likelihood. The KL divergence forces the\nvariational distribution towards the prior and so has reasonably been the focus of many attempts to\nalleviate posterior collapse. We hypothesize that the log marginal likelihood itself often encourages\nposterior collapse.\nIn Variational Autoencoders (VAEs), two neural networks are used to parameterize q\u03c6(z|x) and\np\u03b8(x|z), where \u03c6 and \u03b8 denote two sets of neural network weights. The encoder maps an input x to\nthe parameters of the variational distribution, and then the decoder maps a sample from the variational\ndistribution back to the inputs.\n\n2\n\n\fa) \u03c32 = \u03bb4\n\nb) \u03c32 = \u03bb6\n\nc) \u03c32 = \u03bb8\n\nFigure 1: Stationary points of pPCA. Two zero-columns of W are perturbed in the directions of two\northogonal principal components (\u00b55 and \u00b57) and the optimization landscape around zero-columns is shown,\nwhere the goal is to maximize log marginal likelihood. The stability of the stationary points depends critically\non \u03c32 (the observation noise). Left: \u03c32 is too large to capture either principal component. Middle: \u03c32 is too\nlarge to capture one of the principal components. Right: \u03c32 is able to capture both principal components.\n\nPosterior collapse. A dominant issue with VAE optimization is posterior collapse, in which the\nlearned variational distribution is close to the prior. This reduces the capacity of the generative\nmodel, making it impossible for the decoder network to make use of the information content of all\nof the latent dimensions. While posterior collapse is widely acknowledged, formally de\ufb01ning it has\nremained a challenge. We introduce a formal de\ufb01nition in Section 6.2 which we use to measure\nposterior collapse in trained deep neural networks.\n\n3 Related Work\n\nDai et al. [14] discuss the relationship between robust PCA methods [8] and VAEs. They show that\nat stationary points the VAE objective locally aligns with pPCA under certain assumptions. We study\nthe pPCA objective explicitly and show a direct correspondence with linear VAEs. Dai et al. [14]\nshowed that the covariance structure of the variational distribution may smooth out the loss landscape.\nThis is an interesting result whose interactions with ours is an exciting direction for future research.\nHe et al. [17] motivate posterior collapse through an investigation of the learning dynamics of deep\nVAEs. They suggest that posterior collapse is caused by the inference network lagging behind the\ntrue posterior during the early stages of training. A related line of research studies issues arising from\napproximate inference causing a mismatch between the variational distribution and true posterior\n[12, 22, 19]. By contrast, we show that posterior collapse may exist even when the variational\ndistribution matches the true posterior exactly.\nAlemi et al. [2] used an information theoretic framework to study the representational properties of\nVAEs. They show that with in\ufb01nite model capacity there are solutions with equal ELBO and log\nmarginal likelihood which span a range of representations, including posterior collapse. We \ufb01nd that\neven with weak (linear) decoders, posterior collapse may occur. Moreover, we show that in the linear\ncase this posterior collapse is due entirely to the log marginal likelihood.\nThe most common approach for dealing with posterior collapse is to anneal a weight on the KL term\nduring training from 0 to 1 [7, 38, 30, 18, 20]. Unfortunately, this means that during the annealing\nprocess, one is no longer optimizing a bound on the log-likelihood. Also, it is dif\ufb01cult to design these\nannealing schedules and we have found that once regular ELBO training resumes the posterior will\ntypically collapse again (Section 6.2).\nKingma et al. [25] propose a constraint on the KL term, termed \"free-bits\", where the gradient of the\nKL term per dimension is ignored if the KL is below a given threshold. Unfortunately, this method\nreportedly has some negative effects on training stability [33, 11]. Delta-VAEs [33] instead choose\nprior and variational distributions such that the variational distribution can never exactly recover the\nprior, allocating free-bits implicitly. Several other papers have studied alternative formulations of the\nVAE objective [34, 13, 2, 29, 41]. Dai and Wipf [13] analyzed the VAE objective to improve image\n\ufb01delity under Gaussian observation models and also discuss the importance of the observation noise.\nOther approaches have explored changing the VAE network architecture to help alleviate posterior\ncollapse; for example adding skip connections [30, 15]\n\n3\n\n\fRolinek et al. [36] observed that the diagonal covariance used in the variational distribution of VAEs\nencourages orthogonal representations. They use linearizations of deep networks to prove their results\nunder a modi\ufb01cation of the objective function by explicitly ignoring latent dimensions with posterior\ncollapse. Our formulation is distinct in focusing on linear VAEs without modifying the objective\nfunction and proving an exact correspondence between the global solution of linear VAEs and the\nprincipal components.\nKunin et al. [26] studied the optimization challenges in the linear autoencoder setting. They exposed\nan equivalence between pPCA and Bayesian autoencoders and point out that when \u03c32 is too large\ninformation about the latent code is lost. A similar phenomenon is discussed in the supervised\nlearning setting by Chechik et al. [9]. Kunin et al. [26] also showed that suitable regularization allows\nthe linear autoencoder to recover the principal components up to rotations. We show that linear VAEs\nwith a diagonal covariance structure recover the principal components exactly.\n\n4 Analysis of linear VAE\nThis section compares and analyzes the loss landscapes of both pPCA and linear variational autoen-\ncoders. We \ufb01rst discuss the stationary points of pPCA and then show that a simple linear VAE can\nrecover the global optimum of pPCA. Moreover, when the data covariance eigenvalues are distinct,\nthe linear VAE identi\ufb01es the individual principal components, unlike pPCA, which recovers only the\nPCA subspace. Finally, we prove that ELBO does not introduce any additional spurious maxima to\nthe loss landscape.\n\n4.1 Probabilistic PCA Revisited\nThe pPCA model (Eq. (1)) is a fully Gaussian linear model, thus we can compute both the marginal\ndistribution for x and the posterior p(z | x) in closed form:\n\np(x) = N (\u00b5, WW(cid:62) + \u03c32I),\n\np(z | x) = N (M\u22121W(cid:62)(x \u2212 \u00b5), \u03c32M\u22121),\n\n(5)\n(6)\nwhere M = W(cid:62)W + \u03c32I. This model is particularly interesting to analyze in the setting of\nvariational inference, as the ELBO can also be computed in closed form (see Appendix C).\nStationary points of pPCA We now characterize the stationary points of pPCA, largely repeating\nthe thorough analysis of Tipping and Bishop [39] (see Appendix A of their paper). The maximum\nlikelihood estimate of \u00b5 is the mean of the data. We can compute WMLE and \u03c32\n\nMLE as follows:\n\nn(cid:88)\n\n\u03c32\nMLE =\n\n1\n\nn \u2212 k\n\n\u03bbj,\n\nj=k+1\n\nWMLE = Uk(\u039bk \u2212 \u03c32\n\nMLEI)1/2R.\n\n(7)\n\n(8)\n\nHere Uk corresponds to the \ufb01rst k principal components of the data with the corresponding eigenval-\nues \u03bb1, . . . , \u03bbk stored in the k \u00d7 k diagonal matrix \u039bk. The matrix R is an arbitrary rotation matrix\nwhich accounts for weak identi\ufb01ability in the model. We can interpret \u03c32\nM LE as the average variance\nlost in the projection. The MLE solution is the global optimum. Other stationary points correspond\nto zeroing out columns of WMLE (posterior collapse).\nStability of WMLE In this section we consider \u03c32 to be \ufb01xed and not necessarily equal to the\nMLE solution. Equation 8 remains a stationary point when the general \u03c32 is swapped in. One\nsurprising observation is that \u03c32 directly controls the stability of the stationary points of the log\nmarginal likelihood (see Appendix A). In Figure 1, we illustrate one such stationary point of pPCA\nfor different values of \u03c32. We computed this stationary point by taking W to have three principal\ncomponent columns and zeros elsewhere. Each plot shows the same stationary point perturbed by\ntwo orthogonal vectors corresponding to other principal components.\nThe stability of the pPCA stationary points depends on the size of \u03c32 \u2014 as \u03c32 increases the stationary\npoint tends towards a stable local maximum so that we cannot learn the additional components.\nIntuitively, the model prefers to explain deviations in the data with the larger observation noise.\nFortunately, decreasing \u03c32 will increase likelihood at these stationary points so that when learning \u03c32\nsimultaneously these stationary points are saddle points [39]. Therefore, learning \u03c32 is necessary for\ngaining a full latent representation.\n\n4\n\n\f4.2 Linear VAEs recover pPCA\nWe now show that linear VAEs can recover the globally optimal solution to Probabilistic PCA. We\nwill consider the following VAE model,\n\np(x | z) = N (Wz + \u00b5, \u03c32I),\nq(z | x) = N (V(x \u2212 \u00b5), D),\n\n(9)\n\nwhere D is a diagonal covariance matrix, used globally for all of the data points. While this is a\nsigni\ufb01cant restriction compared to typical VAE architectures, which de\ufb01ne an amortized variance for\neach input point, this is suf\ufb01cient to recover the global optimum of the probabilistic model.\nLemma 1. The global maximum of the ELBO objective (Eq. (4)) for the linear VAE (Eq. (9)) is\nidentical to the global maximum for the log marginal likelihood of pPCA (Eq. (5)).\nProof. Note that the global optimum of pPCA is de\ufb01ned up to an orthogonal transformation of the\ncolumns of W, i.e., any rotation R in Eq. (8) results in a matrix WMLE that given \u03c32\nMLE attains\nmaximum marginal likelihood. The linear VAE model de\ufb01ned in Eq. (9) is able to recover the\nglobal optimum of pPCA when R = I. Recall from Eq. (6) that p(z | x) is de\ufb01ned in terms of\nM = W(cid:62)W + \u03c32I. When R = I, we obtain M = W(cid:62)\nMLEI = \u039bk, which is\ndiagonal. Thus, setting V = M\u22121W(cid:62)\nk , recovers the true\nposterior with diagonal covariance at the global optimum. In this case, the ELBO equals the log\nmarginal likelihood and is maximized when the decoder has weights W = WMLE. Because the\nELBO lower bounds log-likelihood, the global maximum of the ELBO for the linear VAE is the same\nas the global maximum of the marginal likelihood for pPCA.\n\nMLEM\u22121 = \u03c32\n\nMLE and D = \u03c32\n\nMLEWMLE + \u03c32\n\nMLE\u039b\u22121\n\nThe result of Lemma 1 is somewhat expected because the posterior of pPCA is Gaussian. Further\ndetails are given in Appendix C. In addition, we prove a more surprising result that suggests restricting\nthe variational distribution to a Gaussian with a diagonal covariance structure allows one to identify\nthe principal components at the global optimum of ELBO.\nCorollary 1. The global maximum of the ELBO objective (Eq. (4)) for the linear VAE (Eq. (9)) has\nthe scaled principal components as the columns of the decoder network.\nProof. Follows directly from the proof of Lemma 1 and Eq. (8).\n\nWe discuss this result in Appendix B. This full identi\ufb01ability is non-trivial and is not achieved even\nwith the regularized linear autoencoder [26].\nSo far, we have shown that at its global optimum the linear VAE recovers the pPCA solution, which\nenforces orthogonality of the decoder weight columns. However, the VAE is trained with the ELBO\nrather than the log marginal likelihood \u2014 often using SGD. The majority of existing work suggests\nthat the KL term in the ELBO objective is responsible for posterior collapse. So, we should ask\nwhether this term introduces additional spurious local maxima. Surprisingly, for the linear VAE\nmodel the ELBO objective does not introduce any additional spurious local maxima. We provide a\nsketch of the proof below with full details in Appendix C.\nTheorem 1. The ELBO objective for a linear VAE does not introduce any additional local maxima\nto the pPCA model.\nProof. (Sketch) If the decoder has orthogonal columns, then the variational distribution recovers the\ntrue posterior at stationary points. Thus, the variational objective will exactly recover the log marginal\nlikelihood. If the decoder does not have orthogonal columns then the variational distribution is no\nlonger tight. However, the ELBO can always be increased by applying an in\ufb01nitesimal rotation to the\nright-singular vectors of the decoder towards identity: W(cid:48) \u2190 WR\u0001 (so that the decoder columns\nare closer to orthogonal). This works because the variational distribution can \ufb01t the posterior more\nclosely while the log marginal likelihood is invariant to rotations of the weight columns. Thus, any\nadditional stationary points in the ELBO objective must necessarily be saddle points.\n\nThe theoretical results presented in this section provide new intuition for posterior collapse in VAEs.\nIn particular, the KL between the variational distribution and the prior is not entirely responsible for\nposterior collapse \u2014 log marginal likelihood has a role. The evidence for this is two-fold. We have\nshown that log marginal likelihood may have spurious local maxima but also that in the linear case\nthe ELBO objective does not add any additional spurious local maxima. Rephrased, in the linear\nsetting the problem lies entirely with the probabilistic model. We should then ask, to what extent do\nthese results hold in the non-linear setting?\n\n5\n\n\f5 Deep Gaussian VAEs\n\nThe deep Gaussian VAE consists of a decoder D\u03b8 and an encoder E\u03c6. The ELBO objective can be\nexpressed as,\n\n(cid:2)(cid:107)D\u03b8(z) \u2212 x(cid:107)2(cid:3) \u2212 1\n\n2\n\nL(x; \u03b8, \u03c6) = \u2212 KL(q\u03c6(z | x) (cid:107) p(z)) \u2212 1\n2\u03c32\n\nEq\u03c6(z|x)\n\nlog(2\u03c0\u03c32)\n\n(10)\n\nThe role of \u03c32 in this objective invites a natural comparison to the \u03b2-VAE objective [18], where the\nKL term is weighted by \u03b2 \u2208 R+. Alemi et al. [2] propose using small \u03b2 values to force powerful\ndecoders to utilize the latent variables, but this comes at the cost of poor ELBO. Practitioners must\nthen use downstream task performance for model selection, thus sacri\ufb01cing one of the primary\nbene\ufb01ts of likelihood-based models. However, for a given \u03b2, one can \ufb01nd a corresponding \u03c32 (and a\nlearning rate) such that the gradient updates to the network parameters are identical. Importantly,\nthe Gaussian partition function for a Gaussian observation model (the last term on the RHS of\nEq. (10)) prevents ELBO from deviating from the \u03b2-VAE\u2019s objective with a \u03b2-weighted KL term\nwhile maintaining the bene\ufb01ts to representation learning when \u03c32 is small. For the Gaussian VAE,\nthis helps connect the dots between the role of local maxima and observation noise in posterior\ncollapse vs. heuristic approaches that attempted to alleviate posterior collapse by diminishing the\neffect of the KL term [7, 33, 38, 20]. In the following section, we will study the nonlinear VAE\nempirically and explore connections to the linear theory.\n\n6 Experiments\nIn this section, we present empirical evidence found from studying two distinct claims. First, we\nverify our theoretical analysis of the linear VAE model. Second, we explore to what extent these\ninsights apply to deep nonlinear VAEs.\n\n6.1 Linear VAEs\nWe ran two sets of experiments on 1000 randomly chosen MNIST images. First, we trained linear\nVAEs with learnable \u03c32 for a range of hidden dimensions2. For each model, we compared the\n\ufb01nal ELBO to the maximum-likelihood of pPCA \ufb01nding them to be essentially indistinguishable\n(as predicted by Lemma 1 and Theorem 1). For the second set of experiments, we took the pPCA\nMLE solution for W for each number of hidden dimensions and computed the likelihood under the\nobservation noise which maximizes likelihood for 50 hidden dimensions. We observed that adding\nadditional principal components (after 50) will initially improve likelihood but eventually adding\nmore components (after 200) actually decreases the likelihood. In other words, the collapsed solution\nis actually preferred if the observation noise is not set correctly \u2014 we observe this theoretically\nthrough the stability of the stationary points (e.g. Figure 1).\n\nFigure 2: The log marginal likelihood and optimal ELBO of MNIST pPCA solutions over increasing hidden\ndimension. Green represents the MLE solution (global maximum), the red dashed line is the optimal ELBO\nsolution which matches the global optimum. The blue line shows the log marginal likelihood of the solutions\nusing the full decoder weights when \u03c32 is \ufb01xed to its MLE solution for 50 hidden dimensions.\n\n2The VAEs were trained using the analytic ELBO (Appendix C.1) and without mini-batching gradients.\n\n6\n\n50100150200250300Hidden dimensions200020040060080010001200Marginal log-likelihood of pPCAExact likelihood (variable 2)ELBO (variable 2)Exact likelihood (2=2MLE(50))\fFigure 3: Stochastic vs analytic ELBO training: using\nthe analytic gradient of the ELBO led to faster conver-\ngence and better \ufb01nal ELBO (950.7 vs. 939.3).\n\nFigure 4: VAEs with linear decoders trained on real-\nvalued MNIST with nonlinear preprocessing [31]. Fi-\nnal average ELBO on training set are (ordered by leg-\nend): -1098.2, -1108.7, -1112.1, -1119.6.\n\nEffect of stochastic ELBO estimates\nIn general, we are unable to compute the ELBO in closed\nform and so instead rely on unbiased Monte Carlo estimates using the reparameterization trick.\nThese estimates add high-variance noise and can make optimization more challenging [24]. In the\nlinear model, we can compare the solutions obtained using the stochastic ELBO gradients versus\nthe analytic ELBO3 (Figure 3). Additional experimental details are in Appendix E. We found that\nstochastic optimization had slower convergence (when compared to analytic training with the same\nlearning rate) and, unsurprisingly, reached a worse \ufb01nal training ELBO value (in other words, worse\nsteady-state risk due to the gradient variance).\n\nNonlinear Encoders With a linear decoder and nonlinear encoder, Lemma 1 still holds, and the\noptimal variational distribution is the same as the true posterior has not changed. However, Corollary\n1 and Theorem 1 no longer hold in general. Even a deep linear encoder will not have a unique global\nmaximum and new stationary points (possibly maxima) may be introduced to ELBO in general. To\ninvestigate how deeper networks may impact optimization of the probabilistic model, we trained\nlinear decoders with varying encoders using ELBO. We do not expect the linear encoder to be\noutperformed and indeed the empirical results support this (Figure 4).\n\nInvestigating posterior collapse in deep nonlinear VAEs\n\n6.2\nWe explored how the analysis of the linear VAEs extends to deep nonlinear models. To do so, we\ntrained VAEs with Gaussian observation models on the MNIST [27] and CelebA [28] datasets. We\napply uniform dequantization as in Papamakarios et al. [31] in each case. We also adopt the nonlinear\nlogit preprocessing transformation from Papamakarios et al. [31] to provide fair comparisons with\nexisting work. We also report results of models trained directly in pixel space in the appendix (there\nis no signi\ufb01cant difference for the hypotheses we test).\n\nMeasuring posterior collapse\nIn order to measure the extent of posterior collapse, we intro-\nduce the following de\ufb01nition. We say that latent dimension dimension i has (\u0001, \u03b4)-collapsed if\nPx\u223cp[KL(q(zi|x)||p(zi)) < \u0001] \u2265 1 \u2212 \u03b4. Note that the linear VAE can suffer (0, 0)-collapse. To\nestimate this practically, we compute the proportion of data samples which induce a variational\ndistribution with KL divergence less than \u0001 and \ufb01nally report the percentage of dimensions which\nhave (\u0001, \u03b4)-collapsed. Throughout this work, we \ufb01x \u03b4 = 0.01 and vary \u0001.\n\nInvestigating \u03c32 We trained MNIST VAEs with 2 hidden layers in both the decoder and encoder,\nReLU activations, and 200 latent dimensions. We \ufb01rst evaluated training with \ufb01xed values of the\nobservation noise, \u03c32. This mirrors many public VAE implementations where \u03c32 is \ufb01xed to 1\nthroughout training (also observed by Dai and Wipf [13]), however, our linear analysis suggests that\nthis is suboptimal. Then, we consider the setting where the observation noise and VAE weights are\nlearned simultaneously.\nIn Table 1 we report the \ufb01nal ELBO of nonlinear VAEs trained on real-valued MNIST. For \ufb01xed \u03c32,\nwe found that the \ufb01nal models could have signi\ufb01cant differences in ELBO which were maintained\neven after tuning \u03c32 to the learned representations \u2014 the converged representations are less good\nwhen \u03c32 is too large as predicted by the linear model. Additionally, we report the \ufb01nal ELBO\n\n3We use 1000 MNIST images, as before, to enable full-batch training so that the only source of noise is from\n\nthe reparameterization trick [24]\n\n7\n\n020004000600080001000012000Training steps100050005001000ELBOTraining loss for stochastic vs. analytic ELBOAnalyticStochastic050000100000150000200000250000300000350000400000Training Steps1200118011601140112011001080ELBOLinear Decoder VAE with varying encodersLinear VAEDeep Linear EncoderNonlinear Encoder\fInit \u03c32\n\nFinal \u03c32\n\nT\nS\nI\nN\nM\n\n4\n6\nA\nB\nE\nL\nE\nC\n\n10.0\n1.0\n0.1\n0.01\n0.001\n\n10.0\n1.0\n0.1\n0.01\n0.001\n\nModel\n\n10.0\n1.0\n0.1\n0.01\n0.001\n\n10.0\n1.0\n0.1\n0.01\n0.001\n\n1.320\n1.183\n1.194\n1.194\n1.208\n\nELBO\n\u22121450.3 \u00b1 4.2\n\u22121022.1 \u00b1 5.4\n\u22123697.3 \u00b1 493.3\n\u221238612.5 \u00b1 1189.8\n\u2212504259.1 \u00b1 49149.8\n\u22121022.2 \u00b1 4.5\n\u22121011.1 \u00b1 2.7\n\u22121025.4 \u00b1 8.6\n\u22121030.6 \u00b1 3.5\n\u22121038.7 \u00b1 5.6\n\u221273328.4 \u00b1 0.49\n\u221259841.8 \u00b1 30.1\n\u221250760.3 \u00b1 353.4\n\u221282478.7 \u00b1 1823.3\n\u2212531924.5 \u00b1 17177.6\n0.0962 \u221251109.5 \u00b1 408.2\n0.0875 \u221250631.2 \u00b1 163.4\n0.0863 \u221250646.9 \u00b1 269.0\n0.0911 \u221251285.0 \u00b1 708.1\n0.1040 \u221251695.1 \u00b1 322.4\n\n\u03c32-tuned ELBO\n\u22121098.2 \u00b1 28.3\n\u22121018.3 \u00b1 5.3\n\u22121190.8 \u00b1 37.4\n\u22122090.8 \u00b1 975.1\n\u22121744.7 \u00b1 48.4\n\u22121022.3 \u00b1 4.6\n\u22121011.1 \u00b1 2.8\n\u22121025.4 \u00b1 8.6\n\u22121030.5 \u00b1 3.5\n\u22121038.8 \u00b1 5.6\n\u221255186.7 \u00b1 35.1\n\u221251294.8 \u00b1 333.7\n\u221250698.5 \u00b1 393.9\n\u221251373.9 \u00b1 213.3\n\u221257381.5 \u00b1 512.6\n\u221251109.5 \u00b1 408.3\n\u221250631.0 \u00b1 163.3\n\u221250645.9 \u00b1 267.5\n\u221251284.8 \u00b1 708.1\n\u221251694.8 \u00b1 322.7\n\nTuned \u03c32\n1.797\n1.145\n0.968\n0.877\n0.810\n1.318\n1.182\n1.195\n1.191\n1.209\n0.2040\n0.1020\n0.0883\n0.0817\n0.0296\n0.0963\n0.0875\n0.0869\n0.0963\n0.0974\n\nPosterior\n\ncollapse (%)\n89.88\n27.38\n3.25\n0.00\n0.00\n73.75\n47.88\n29.25\n23.00\n27.00\n80.56\n2.52\n32.72\n0.00\n0.00\n53.32\n54.76\n28.84\n5.64\n0.00\n\nKL\nDivergence\n28.8 \u00b1 1.4\n125.4 \u00b1 4.2\n368.7 \u00b1 94.6\n695.9 \u00b1 118.1\n756.2 \u00b1 12.6\n73.8 \u00b1 9.8\n106.3 \u00b1 2.5\n116.1 \u00b1 11.4\n121.9 \u00b1 7.7\n124.9 \u00b1 1.6\n56.12 \u00b1 0.4\n213.4 \u00b1 6.3\n483.8 \u00b1 36.2\n1624.2 \u00b1 8.8\n2680.2 \u00b1 41.5\n364.5 \u00b1 26.4\n462.2 \u00b1 20.0\n520.9 \u00b1 11.7\n557.0 \u00b1 50.5\n537.5 \u00b1 46.2\n\nTable 1: Evaluation of deep Gaussian VAEs (averaged over 5 trials) on real-valued MNIST. We report the\nELBO on the training set in all cases. Collapse percent gives the percentage of latent dimensions which are\nwithin 0.01 KL of the prior for at least 99% of the encoder inputs.\n\nFigure 5: Posterior collapse percentage as a function of \u0001-threshold for a deep VAE trained on MNIST.\nWe measure posterior collapse for trained networks as the proportion of latent dimensions that are\nwithin \u0001 KL divergence of the prior for at least a 1\u2212 \u03b4 proportion of the training data points (\u03b4 = 0.01\nin the plots).\n\nvalues when the model is trained while learning \u03c32 with different initial values of \u03c32. The gap in\nperformance across different initializations is smaller than for \ufb01xed \u03c32 but is still signi\ufb01cant. The\nlinear VAE does not predict this gap which suggests that learning \u03c32 correctly is more challenging in\nthe nonlinear case.\nDespite the large volume of work studying posterior collapse it has not been measured in a consistent\nway (or even de\ufb01ned so). In Figure 5 and Figure 6 we measure posterior collapse for trained networks\nas described above (we chose \u03b4 = 0.01). By considering a range of \u0001 values we found this was\n(moderately) robust to stochasticity in data preprocessing. We observed that for large choices of \u03c32\ninitialization the variational distribution matches the prior closely. This was true even when \u03c32 is\nlearned \u2014 suggesting that local optima may contribute to posterior collapse in deep VAEs.\n\n8\n\n0.00.20.40.60.81.0Collapse %2:30.0Collapse %2:10.0Collapse %2:3.0Collapse %2:1.00.00.20.40.60.81.0Collapse %2:0.3Collapse %2:0.1Collapse %2:0.03Collapse %2:0.010.00.51.01.50.00.20.40.60.81.0Collapse %2:0.0030.00.51.01.5Collapse %2:0.0010.00.51.01.5Collapse %2:0.00010.00.51.01.5Posterior collapse: MNIST (fixed variance)\fFigure 6: Posterior collapse percentage as a function of \u0001-threshold for a deep VAE trained on MNIST.\nWe measure posterior collapse for trained networks as the proportion of latent dimensions that are\nwithin \u0001 KL divergence of the prior for at least a 1\u2212 \u03b4 proportion of the training data points (\u03b4 = 0.01\nin the plots).\n\nCelebA VAEs We trained deep convolutional VAEs with 500 hidden dimensions on images from\nthe CelebA dataset (resized to 64x64). We trained the CelebA VAEs with different \ufb01xed values of \u03c32\nand compared the ELBO before and after tuning \u03c32 to the learned representations (Table 1). Further,\nwe explored training the CelebA VAE while learning \u03c32 over varied initializations of the observation\nnoise. The VAE is sensitive to the initialization of the observation noise even when \u03c32 is learned (in\nparticular, in terms of the number of collapsed dimensions).\n\n7 Discussion\nBy analyzing the correspondence between linear VAEs and pPCA, this paper makes signi\ufb01cant\nprogress towards understanding the causes of posterior collapse. We show that for simple linear\nVAEs posterior collapse is caused by ill-conditioning of the stationary points in the log marginal\nlikelihood objective. We demonstrate empirically that the same optimization issues play a role in deep\nnon-linear VAEs. Finally, we \ufb01nd that linear VAEs are useful theoretical test-cases for evaluating\nexisting hypotheses on VAEs and we encourage researchers to consider studying their hypotheses in\nthe linear VAE setting.\n\n8 Acknowledgements\n\nThis work was guided by many conversations with and feedback from our colleagues. In particular,\nwe thank Durk Kingma, Alex Alemi, and Guodong Zhang for invaluable feedback on early versions\nof this work. RG acknowledges support from the CIFAR Canadian AI Chairs program.\n\n9\n\n0.00.20.40.60.81.0Collapse %2:30.0Collapse %2:10.0Collapse %2:3.0Collapse %2:1.00.00.20.40.60.81.0Collapse %2:0.3Collapse %2:0.1Collapse %2:0.03Collapse %2:0.010.00.51.01.50.00.20.40.60.81.0Collapse %2:0.0030.00.51.01.5Collapse %2:0.0010.00.51.01.5Collapse %2:0.00010.00.51.01.5Posterior collapse: MNIST (learned variance)\fReferences\n[1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis,\nJ. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Joze-\nfowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Man\u00e9, R. Monga, S. Moore, D. Murray,\nC. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke,\nV. Vasudevan, F. Vi\u00e9gas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and\nX. Zheng. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. URL\nhttps://www.tensorflow.org/. Software available from tensor\ufb02ow.org.\n\n[2] A. A. Alemi, B. Poole, I. Fischer, J. V. Dillon, R. A. Saurous, and K. Murphy. Fixing a broken\n\nELBO. arXiv preprint arXiv:1711.00464, 2017.\n\n[3] J. Atchison and S. M. Shen. Logistic-normal distributions: Some properties and uses.\n\nBiometrika, 67(2):261\u2013272, 1980.\n\n[4] P. Baldi and K. Hornik. Neural networks and principal component analysis: Learning from\n\nexamples without local minima. Neural networks, 2(1):53\u201358, 1989.\n\n[5] D. J. Bartholomew. Latent variable models and factors analysis. Oxford University Press, Inc.,\n\n1987.\n\n[6] D. M. Blei, A. Kucukelbir, and J. D. McAuliffe. Variational inference: A review for statisticians.\n\nJournal of the American Statistical Association, 2017.\n\n[7] S. R. Bowman, L. Vilnis, O. Vinyals, A. M. Dai, R. Jozefowicz, and S. Bengio. Generating\n\nsentences from a continuous space. arXiv preprint arXiv:1511.06349, 2015.\n\n[8] E. J. Cand\u00e8s, X. Li, Y. Ma, and J. Wright. Robust principal component analysis? Journal of the\n\nACM (JACM), 58(3):11, 2011.\n\n[9] G. Chechik, A. Globerson, N. Tishby, and Y. Weiss.\n\nInformation bottleneck for gaussian\n\nvariables. Journal of machine learning research, 6(Jan):165\u2013188, 2005.\n\n[10] R. T. Q. Chen, X. Li, R. Grosse, and D. Duvenaud. Isolating sources of disentanglement in\n\nvariational autoencoders. Advances in Neural Information Processing Systems, 2018.\n\n[11] X. Chen, D. P. Kingma, T. Salimans, Y. Duan, P. Dhariwal, J. Schulman, I. Sutskever, and\n\nP. Abbeel. Variational lossy autoencoder. arXiv preprint arXiv:1611.02731, 2016.\n\n[12] C. Cremer, X. Li, and D. Duvenaud. Inference suboptimality in variational autoencoders. arXiv\n\npreprint arXiv:1801.03558, 2018.\n\n[13] B. Dai and D. Wipf. Diagnosing and enhancing VAE models. In International Conference on\n\nLearning Representations, 2019.\n\n[14] B. Dai, Y. Wang, J. Aston, G. Hua, and D. Wipf. Hidden talents of the variational autoencoder.\n\narXiv preprint arXiv:1706.05148, 2017.\n\n[15] A. B. Dieng, Y. Kim, A. M. Rush, and D. M. Blei. Avoiding latent variable collapse with\n\ngenerative skip models. arXiv preprint arXiv:1807.04863, 2018.\n\n[16] R. Gomez-Bombarelli, J. N. Wei, D. Duvenaud, J. M. Hernandez-Lobato, B. Sanchez-Lengeling,\nD. Sheberla, J. Aguilera-Iparraguirre, T. D. Hirzel, R. P. Adams, and A. Aspuru-Guzik. Auto-\nmatic chemical design using a data-driven continuous representation of molecules. American\nChemical Society Central Science, 2018.\n\n[17] J. He, D. Spokoyny, G. Neubig, and T. Berg-Kirkpatrick. Lagging inference networks and\nposterior collapse in variational autoencoders. In International Conference on Learning Repre-\nsentations, 2019.\n\n[18] I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Lerch-\nner. beta-VAE: Learning basic visual concepts with a constrained variational framework. In\nInternational Conference on Learning Representations, 2016.\n\n10\n\n\f[19] D. Hjelm, R. R. Salakhutdinov, K. Cho, N. Jojic, V. Calhoun, and J. Chung. Iterative re\ufb01nement\nof the approximate posterior for directed belief networks. In Advances in Neural Information\nProcessing Systems, 2016.\n\n[20] C.-W. Huang, S. Tan, A. Lacoste, and A. C. Courville. Improving explorability in variational\ninference with annealed variational objectives. In Advances in Neural Information Processing\nSystems, 2018.\n\n[21] M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul. An introduction to variational\n\nmethods for graphical models. Machine learning, 1999.\n\n[22] Y. Kim, S. Wiseman, A. C. Miller, D. Sontag, and A. M. Rush. Semi-amortized variational\n\nautoencoders. arXiv preprint arXiv:1802.02550, 2018.\n\n[23] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[24] D. P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114,\n\n2013.\n\n[25] D. P. Kingma, T. Salimans, R. Jozefowicz, X. Chen, I. Sutskever, and M. Welling. Improved\nIn Advances in neural information\n\nvariational inference with inverse autoregressive \ufb02ow.\nprocessing systems, pages 4743\u20134751, 2016.\n\n[26] D. Kunin, J. M. Bloom, A. Goeva, and C. Seed. Loss landscapes of regularized linear autoen-\n\ncoders. arXiv preprint arXiv:1901.08168, 2019.\n\n[27] Y. LeCun. The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/, 1998.\n[28] Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face attributes in the wild. In Proceedings\n\nof International Conference on Computer Vision (ICCV), 2015.\n\n[29] X. Ma, C. Zhou, and E. Hovy. MAE: Mutual posterior-divergence regularization for variational\n\nautoencoders. In International Conference on Learning Representations, 2019.\n\n[30] L. Maal\u00f8e, M. Fraccaro, V. Li\u00e9vin, and O. Winther. BIVA: A very deep hierarchy of latent\n\nvariables for generative modeling. arXiv preprint arXiv:1902.02102, 2019.\n\n[31] G. Papamakarios, T. Pavlakou, and I. Murray. Masked autoregressive \ufb02ow for density estimation.\n\nIn Advances in Neural Information Processing Systems. 2017.\n\n[32] K. B. Petersen et al. The matrix cookbook.\n[33] A. Razavi, A. van den Oord, B. Poole, and O. Vinyals. Preventing posterior collapse with\n\ndelta-VAEs. In International Conference on Learning Representations, 2019.\n\n[34] D. J. Rezende and F. Viola. Taming VAEs. arXiv preprint arXiv:1810.00597, 2018.\n[35] D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate\n\ninference in deep generative models. arXiv preprint arXiv:1401.4082, 2014.\n\n[36] M. Rolinek, D. Zietlow, and G. Martius. Variational autoencoders pursue PCA directions (by\n\naccident). arXiv preprint arXiv:1812.06775, 2018.\n\n[37] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internal representations by error\npropagation. Technical report, California Univ San Diego La Jolla Inst for Cognitive Science,\n1985.\n\n[38] C. K. S\u00f8nderby, T. Raiko, L. Maal\u00f8e, S. K. S\u00f8nderby, and O. Winther. Ladder variational\nautoencoders. In Advances in neural information processing systems, pages 3738\u20133746, 2016.\n[39] M. E. Tipping and C. M. Bishop. Probabilistic principal component analysis. Journal of the\n\nRoyal Statistical Society: Series B (Statistical Methodology), 61(3):611\u2013622, 1999.\n\n[40] J. M. Tomczak and M. Welling. Vae with a VampPrior. arXiv preprint arXiv:1705.07120, 2017.\n[41] S. Yeung, A. Kannan, Y. Dauphin, and L. Fei-Fei. Tackling over-pruning in variational\n\nautoencoders. arXiv preprint arXiv:1706.03643, 2017.\n\n11\n\n\f", "award": [], "sourceid": 5016, "authors": [{"given_name": "James", "family_name": "Lucas", "institution": "University of Toronto"}, {"given_name": "George", "family_name": "Tucker", "institution": "Google Brain"}, {"given_name": "Roger", "family_name": "Grosse", "institution": "University of Toronto"}, {"given_name": "Mohammad", "family_name": "Norouzi", "institution": "Google Brain"}]}