{"title": "Ladder Variational Autoencoders", "book": "Advances in Neural Information Processing Systems", "page_first": 3738, "page_last": 3746, "abstract": "Variational autoencoders are powerful models for unsupervised learning. However deep models with several layers of dependent stochastic variables are difficult to train which limits the improvements obtained using these highly expressive models. We propose a new inference model, the Ladder Variational Autoencoder, that recursively corrects the generative distribution by a data dependent approximate likelihood in a process resembling the recently proposed Ladder Network. We show that this model provides state of the art predictive log-likelihood and tighter log-likelihood lower bound compared to the purely bottom-up inference in layered Variational Autoencoders and other generative models. We provide a detailed analysis of the learned hierarchical latent representation and show that our new inference model is qualitatively different and utilizes a deeper more distributed hierarchy of latent variables. Finally, we observe that batch-normalization and deterministic warm-up (gradually turning on the KL-term) are crucial for training variational models with many stochastic layers.", "full_text": "Ladder Variational Autoencoders\n\nCasper Kaae S\u00f8nderby\u21e4\ncasperkaae@gmail.com\n\nTapani Raiko\u2020\n\ntapani.raiko@aalto.fi\n\nLars Maal\u00f8e\u2021\n\nlarsma@dtu.dk\n\nS\u00f8ren Kaae S\u00f8nderby\u21e4\n\nskaaesonderby@gmail.com\n\nOle Winther\u21e4,\u2021\nolwi@dtu.dk\n\nAbstract\n\nVariational autoencoders are powerful models for unsupervised learning. However\ndeep models with several layers of dependent stochastic variables are dif\ufb01cult to\ntrain which limits the improvements obtained using these highly expressive models.\nWe propose a new inference model, the Ladder Variational Autoencoder, that\nrecursively corrects the generative distribution by a data dependent approximate\nlikelihood in a process resembling the recently proposed Ladder Network. We\nshow that this model provides state of the art predictive log-likelihood and tighter\nlog-likelihood lower bound compared to the purely bottom-up inference in layered\nVariational Autoencoders and other generative models. We provide a detailed\nanalysis of the learned hierarchical latent representation and show that our new\ninference model is qualitatively different and utilizes a deeper more distributed\nhierarchy of latent variables. Finally, we observe that batch-normalization and\ndeterministic warm-up (gradually turning on the KL-term) are crucial for training\nvariational models with many stochastic layers.\n\n1 Introduction\n\nThe recently introduced variational autoencoder (VAE) [10, 19] provides a framework for deep\ngenerative models. In this work we study how the variational inference in such models can be\nimproved while not changing the generative model. We introduce a new inference model using\nthe same top-down dependency structure in both the inference and generative models achieving\nstate-of-the-art generative performance.\nVAEs, consisting of hierarchies of conditional stochastic variables, are highly expressive models\nretaining the computational ef\ufb01ciency of fully factorized models, Figure 1 a). Although highly\n\ufb02exible these models are dif\ufb01cult to optimize for deep hierarchies due to multiple layers of conditional\nstochastic layers. The VAEs considered here are trained by optimizing a variational approximate\nposterior lower bounding the intractable true posterior. Recently used inference are calculated\npurely bottom-up with no interaction between the inference and generative models [10, 18, 19]. We\npropose a new structured inference model using the same top-down dependency structure in both\nthe inference and generative models. Here the approximate posterior distribution can be viewed\nas merging information from a bottom up computed approximate likelihood term with top-down\nprior information from the generative distribution, see Figure 1 b). The sharing of information (and\nparameters) with the generative model gives the inference model knowledge of the current state of the\ngenerative model in each layer. The top down-pass then recursively corrects the generative distribution\nwith a data dependent approximating the log-likelihood using a simple precision-weighted addition.\n\n\u21e4Bioinformatics Centre, Department of Biology, University of Copenhagen, Denmark\n\u2020Department of Computer Science, Aalto University, Finland\n\u2021Department of Applied Mathematics and Computer Science, Technical University of Denmark\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fa)\n\nz2\n\nz1\n\nx\n\nz2\n\nz1\n\nx\n\nb)\n\nz2\n\nz1\n\nshared\n\nz2\n\nz1\n\nx\n\nd2\n\nd1\n\nx\n\nFigure 1: Inference (or encoder/recognition) and\ngenerative (or decoder) models for a) VAE and\nb) LVAE. Circles are stochastic variables and dia-\nmonds are deterministic variables.\n\nFigure 2: MNIST train (full lines) and test\n(dashed lines) set log-likelihood using one im-\nportance sample during training. The LVAE im-\nproves performance signi\ufb01cantly over the regular\nVAE.\n\nThis parameterization allows interactions between the bottom-up and top-down signals resembling\nthe recently proposed Ladder Network [22, 17], and we therefore denote it Ladder-VAE (LVAE). For\nthe remainder of this paper we will refer to VAEs as both the inference and generative model seen in\nFigure 1 a) and similarly LVAE as both the inference and generative model in Figure 1 b). We stress\nthat the VAE and LVAE have identical generative models and only differ in the inference models.\nPrevious work on VAEs have been restricted to shallow models with one or two layers of stochastic\nlatent variables. The performance of such models are constrained by the restrictive mean \ufb01eld\napproximation to the intractable posterior distribution. Here we found that purely bottom-up inference\noptimized with gradient ascent are only to a limited degree able to utilize more than two layers of\nstochastic latent variables. We initially show that a warm-up period [2, 16, Section 6.2] to support\nstochastic units staying active in early training and batch-normalization (BN) [7] can signi\ufb01cantly\nimprove performance of VAEs. Using these VAE models as competitive baselines we show that\nLVAE improves the generative performance achieving as good or better performance than other (often\ncomplicated) methods for creating \ufb02exible variational distributions such as: The Variational Gaussian\nProcesses [21], Normalizing Flows [18], Importance Weighted Autoencoders [3] or Auxiliary Deep\nGenerative Models[13]. Compared to the bottom-up inference in VAEs we \ufb01nd that LVAE: 1) have\nbetter generative performance 2) provides a tighter bound on the true log-likelihood and 3) can utilize\ndeeper and more distributed hierarchies of stochastic variables. Lastly we study the learned latent\nrepresentations and \ufb01nd that these differ qualitatively between the LVAE and VAE with the LVAE\ncapturing more high level structure in the datasets. In summary our contributions are:\n\n\u2022 A new inference model combining a Gaussian term, akin to an approximate Gaussian\nlikelihood, with the generative model resulting in better generative performance than the\nnormally used bottom-up VAE inference.\n\n\u2022 We provide a detailed study of the learned latent distributions and show that LVAE learns\n\nboth a deeper and more distributed representation when compared to VAE.\n\n\u2022 We show that a deterministic warm-up period and batch-normalization are important for\n\ntraining deep stochastic models.\n\n2 Methods\n\nVAEs and LVAEs simultaneously train a generative model p\u2713(x, z) = p\u2713(x|z)p\u2713(z) for data x using\nlatent variables z, and an inference model q(z|x) by optimizing a variational lower bound to the\nlikelihood p\u2713(x) =R p\u2713(x, z)dz. In the generative model p\u2713, the latent variables z are split into L\nlayers zi, i = 1 . . . L, and each stochastic layer is a fully factorized Gaussian distribution conditioned\n\n2\n\n\fon the layer above:\n\np\u2713(z) = p\u2713(zL)\n\nL1Yi=1\n\np\u2713(zi|zi+1)\n\np\u2713(zi|zi+1) = Nzi|\u00b5p,i(zi+1), 2\np\u2713(x|z1) = Nx|\u00b5p,0(z1), 2\n\np,i(zi+1) , p\u2713(zL) = N (zL|0, I)\np,0(z1) or P\u2713(x|z1) = B (x|\u00b5p,0(z1))\n\n(2)\n(3)\nwhere the observation model is matching either continuous-valued (Gaussian N ) or binary-valued\n(Bernoulli B) data, respectively. We use subscript p (and q) to highlight if \u00b5 or 2 belongs to the\ngenerative or inference distributions respectively. Note that while individual conditional distributions\nare fully factorized, the hierarchical speci\ufb01cation allows the lower layers of the latent variables to be\nhighly correlated. The variational principle provides a tractable lower bound on the log likelihood\nwhich can be used as a training criterion L.\nlog p(x) Eq(z|x)\uf8fflog\n\nq(z|x) = L(\u2713, ; x)\n\n(5)\nwhere KL is the Kullback-Leibler divergence. A strictly tighter bound on the likelihood may be\nobtained at the expense of a K-fold increase of samples by using the importance weighted bound [3]:\n\n= KL(q(z|x)||p\u2713(z)) + Eq(z|x) [log p\u2713(x|z)] ,\n\np\u2713(x, z)\n\nlog p(x) Eq(z(1)|x) . . . Eq(z(K)|x)\"log\n\np\u2713(x, z(k))\n\nq(z(k)|x)# = LK(\u2713, ; x) L (\u2713, ; x) .\n\nKXk=1\n\nThe generative and inference parameters, \u2713 and , are jointly trained by optimizing Eq. (5) using\nstochastic gradient descent where we use the reparametrization trick for stochastic backpropagation\nthrough the Gaussian latent variables [10, 19]. The KL[q|p\u2713] is calculated analytically at each layer\nwhen possible and otherwise approximated using Monte Carlo sampling.\n\n2.1 Variational autoencoder inference model\nVAE inference models are parameterized as a bottom-up process similar to [3, 9]. Conditioned on the\nstochastic layer below each stochastic layer is speci\ufb01ed as a fully factorized Gaussian distribution:\n\n(1)\n\n(4)\n\n(6)\n\n(7)\n\nq(z|x) = q(z1|x)\n\nq(zi|zi1)\n\nLYi=2\n\nq,1(x)\nq(z1|x) = Nz1|\u00b5q,1(x), 2\nq(zi|zi1) = Nzi|\u00b5q,i(zi1), 2\nq,i(zi1) , i = 2 . . . L.\n\n(8)\n(9)\nIn this parameterization the inference and generative distributions are computed separately with no\nexplicit sharing of information. In the beginning of the training procedure this might cause problems\nsince the inference models have to approximately match the highly variable generative distribution in\norder to optimize the likelihood. The functions \u00b5(\u00b7) and 2(\u00b7) in the generative and VAE inference\nmodels are implemented as:\n(10)\n(11)\n(12)\nwhere MLP is a two layered multilayer perceptron network, Linear is a single linear layer, and\nSoftplus applies log(1 + exp(\u00b7)) nonlinearity to each component of its argument vector ensuring\npositive variances. In our notation, each MLP(\u00b7) or Linear(\u00b7) gives a new mapping with its own\nparameters, so the deterministic variable d is used to mark that the MLP-part is shared between \u00b5 and\n2 whereas the last Linear layer is not shared.\n\nd(y) =MLP(y)\n\u00b5(y) =Linear(d(y))\n2(y) =Softplus(Linear(d(y))) ,\n\n2.2 Ladder variational autoencoder inference model\nWe propose a new inference model that recursively corrects the generative distribution with a data\ndependent approximate likelihood term. First a deterministic upward pass computes the Gaussian\n\n3\n\n\fFigure 3: MNIST log-likelihood values for VAEs and the LVAE model with different number of latent\nlayers, Batch-normalization (BN) and Warm-up (WU). a) Train log-likelihood, b) test log-likelihood\nand c) test log-likelihood with 5000 importance samples.\n\nlikelihood like contributions:\n\n(13)\n(14)\n(15)\nwhere d0 = x. This is followed by a stochastic downward pass recursively computing both the\napproximate posterior and generative distributions:\n\ndn =MLP(dn1)\n\u02c6\u00b5q,i =Linear(di), i = 1 . . . L\n\u02c62\nq,i =Softplus(Linear(di)), i = 1 . . . L\n\nq(z|x) =q(zL|x)\n\nL1Yi=1\n\nq(zi|zi+1, x)\n\n1\n\nq,i =\n\n\u00b5q,i =\n\nq,i + 2\n\u02c62\np,i\nq,i + \u00b5p,i2\n\u02c6\u00b5q,i\u02c62\np,i\n\u02c62\nq,i + 2\np,i\n\n(16)\n\n(17)\n\n(18)\n\nq,L = \u02c62\n\nq(zi|\u00b7) = Nzi|\u00b5q,i, 2\nq,i ,\n\nq carrying bottom-up information and \u00b5p and 2\n\n(19)\nq,L. The inference model is a precision-weighted combination of\nwhere \u00b5q,L = \u02c6\u00b5q,L and 2\n\u02c6\u00b5q and \u02c62\np from the generative distribution carrying\ntop-down prior information. This parameterization has a probabilistic motivation by viewing \u02c6\u00b5q\nand \u02c62\nq as an approximate Gaussian likelihood that is combined with a Gaussian prior \u00b5p and 2\np\nfrom the generative distribution. Together these form the approximate posterior distribution q\u2713(z|x)\nusing the same top-down dependency structure both in the inference and generative model. A line of\nmotivation, already noted in [4], see [1] for a recent approach, is that a purely bottom-up inference\nprocess as in i.e. VAEs does not correspond well with real perception, where iterative interaction\nbetween bottom-up and top-down signals produces the \ufb01nal activity of a unit4. Notably it is dif\ufb01cult\nfor the purely bottom-up inference networks to model the explaining away phenomenon, see [23,\nChapter 5] for a recent discussion on this phenomenon. The LVAE model provides a framework with\nthe wanted interaction, while not increasing the number of parameters.\n\n2.3 Warm-up from deterministic to variational autoencoder\nThe variational training criterion in Eq. (5) contains the reconstruction term p\u2713(x|z) and the variational\nregularization term. The variational regularization term causes some of the latent units to become\nuninformative during training [14] because the approximate posterior for unit k, q(zi,k| . . . ) is\nregularized towards its own prior p(zi,k| . . . ), a phenomenon also recognized in the VAE setting\n[3, 2]. This can be seen as a virtue of automatic relevance determination, but also as a problem when\nmany units collapse early in training before they learned a useful representation. We observed that\nsuch units remain uninformative for the rest of the training, presumably trapped in a local minima or\nsaddle point at KL(qi,k|pi,k) \u21e1 0, with the optimization algorithm unable to re-activate them.\n\n4The idea was dismissed at the time, since it could introduce substantial theoretical complications.\n\n4\n\n\fWe alleviate the problem by initializing training using the reconstruction error only (corresponding\nto training a standard deterministic auto-encoder), and then gradually introducing the variational\nregularization term:\n\nL(\u2713, ; x)W U = KL(q(z|x)||p\u2713(z)) + Eq(z|x) [log p\u2713(x|z)] ,\n\n(20)\nwhere is increased linearly from 0 to 1 during the \ufb01rst Nt epochs of training. We denote this\nscheme warm-up (abbreviated WU in tables and graphs) because the objective goes from having a\ndelta-function solution (corresponding to zero temperature) and then move towards the fully stochastic\nvariational objective. This idea have previously been considered in [16, Section 6.2] and more recently\nin [2].\n\n3 Experiments\n\nTo test our models we use the standard benchmark datasets MNIST, OMNIGLOT [11] and NORB\n[12]. The largest models trained used a hierarchy of \ufb01ve layers of stochastic latent variables of sizes\n64, 32, 16, 8 and 4, going from bottom to top. We implemented all mappings using MLP\u2019s with two\nlayers of deterministic hidden units. In all models the MLP\u2019s between x and z1 or d1 were of size 512.\nSubsequent layers were connected by MLP\u2019s of sizes 256, 128, 64 and 32 for all connections in both\nthe VAE and LVAE. Shallower models were created by removing latent variables from the top of the\nhierarchy. We sometimes refer to the \ufb01ve layer models as 64-32-16-8-4, the four layer models as\n64-32-16-8 and so fourth. The models were trained end-to-end using the Adam [8] optimizer with a\nmini-batch size of 256. We report the train and test log-likelihood lower bounds, Eq. (5) as well as\nthe approximated true log-likelihood calculated using 5000 importance weighted samples, Eq. (6).\nThe models were implemented using the Theano [20], Lasagne [5] and Parmesan5 frameworks. The\nsource code is available at github6\nFor MNIST, we used a sigmoid output layer to predict the mean of a Bernoulli observation model\nand leaky recti\ufb01ers (max(x, 0.1x)) as nonlinearities in the MLP\u2019s. The models were trained for\n2000 epochs with a learning rate of 0.001 on the complete training set. Models using warm-up used\nNt = 200. Similarly to [3], we resample the binarized training values from the real-valued images\nusing a Bernoulli distribution after each epoch which prevents the models from over-\ufb01tting. Some of\nthe models were \ufb01ne-tuned by continuing training for 2000 epochs while multiplying the learning rate\nwith 0.75 after every 200 epochs and increase the number of Monte Carlo and importance weighted\nsamples to 10 to reduce the variance in the approximation of the expectations in Eq. (4) and improve\nthe inference model, respectively.\nModels trained on the OMNIGLOT dataset7, consisting of 28x28 binary images images were trained\nsimilar to above except that the number of training epochs was 1500.\nModels trained on the NORB dataset8, consisting of 32x32 grays-scale images with color-coding\nrescaled to [0, 1], used a Gaussian observation model with mean and variance predicted using a linear\nand a softplus output layer respectively. The settings were similar to the models above except that\nhyperbolic tangent was used as nonlinearities in the MLP\u2019s and the number of training epochs was\n2000.\n\n3.1 Generative log-likelihood performance\nIn Figure 3 we show the train and test set log-likelihood on the MNIST dataset for a series of different\nmodels with varying number of stochastic layers.\n, Figure 3 b), the VAE without batch-normalization and warm-up does not improve\nConsider the Ltest\nfor additional stochastic layers beyond one whereas VAEs with batch-normalization and warm-up\nimprove performance up to three layers. The LVAE models performs better improving performance\nfor each additional layer reaching Ltest\n1 = 85.23 with \ufb01ve layers which is signi\ufb01cantly higher than\nthe best VAE score at 87.49 using three layers. As expected the improvement in performance is\n\n1\n\n5https://github.com/casperkaae/parmesan\n6https://github.com/casperkaae/LVAE\n7The\n\nOMNIGLOT\n\ndata\n\nhttps://github.com/yburda/iwae/tree/master/datasets/OMNIGLOT\n\n8The NORB dataset was downloaded in resized format from github.com/gwtaylor/convnet_matlab\n\nwas\n\npartitioned\n\nand\n\npreprocessed\n\nas\n\nin\n\n[3],\n\n5\n\n\fFigure 4: log KL(q|p) for each latent unit is shown at different training epochs. Low KL (white)\ncorresponds to an uninformative unit. The units are sorted for visualization. It is clear that vanilla\nVAE cannot train the higher latent layers, while introducing batch-normalization helps. Warm-up\ncreates more active units early in training, some of which are then gradually pruned away during\ntraining, resulting in a more distributed \ufb01nal representation. Lastly, we see that the LVAE activates\nthe highest number of units in each layer.\n\nVAE 1-layer + NF [18]\nIWAE, 2-layer + IW=1 [3]\nIWAE, 2-layer + IW=50 [3]\nVAE, 2-layer + VGP [21]\nLVAE, 5-layer\nLVAE, 5-layer + \ufb01netuning\nLVAE, 5-layer + \ufb01netuning + IW=10\n\n\uf8ff log p((x))\n\n-85.10\n-85.33\n-82.90\n-81.90\n-82.12\n-81.84\n-81.74\n\nTable 1: Test set MNIST performance for importance weighted autoencoder (IWAE), VAE with\nnormalizing \ufb02ows (NF) and VAE with variational Gaussian process (VGP). Number of importance\nweighted (IW) samples used for training is one unless otherwise stated.\n\ndecreasing for each additional layer, but we emphasize that the improvements are consistent even\nfor the addition of the top-most layers. We found batch-normalization improved performance for all\nmodels, however especially for LVAE we found batch-normalization to be important. In Figure 3\nc) the approximated true log-likelihood estimated using 5000 importance weighted samples is seen.\nAgain the LVAE models performs better than the VAE reaching Ltest\n5000 = 82.12 compared to the best\nVAE at 82.74. These results show that the LVAE achieves both a higher approximate log-likelihood\n. The models in Figure 3\nscore, but also a signi\ufb01cantly tighter lower bound on the log-likelihood Ltest\nwere trained using \ufb01xed learning rate and one Monte Carlo and importance weighted sample. To\nimprove performance we \ufb01ne-tuned the best performing \ufb01ve layer LVAE models by training these for\na further 2000 epochs with annealed learning rate and increasing the number of IW samples and see a\nslight improvements in the test set log-likelihood values, Table 1. We saw no signs of over-\ufb01tting for\nany of our models even though the hierarchical latent representations are highly expressive as seen in\nFigure 2.\nComparing the results obtained here with current state-of-the art results on permutation invariant\nMNIST, Table 1, we see that the LVAE performs better than the normalizing \ufb02ow VAE and importance\nweighted VAE and comparable to the Variational Gaussian Process VAE. However we note that these\nresults are not directly comparable to these due to differences in the training procedure.\nTo test the models on more challenging data we used the OMNIGLOT dataset, consisting of characters\nfrom 50 different alphabets with 20 samples of each character. The log-likelihood values, Table 2,\nshows similar trends as for MNIST with the LVAE achieving the best performance using \ufb01ve layers\n\n1\n\n6\n\n\fVAE\n\nVAE\n+BN\n\nVAE\n+BN\n+WU\n\nLVAE\n+BN\n+WU\n\n\n\n111.21 105.62 104.51\n110.58 105.51 102.61 102.63\n111.26 106.09 102.52 102.18\n111.58 105.66 102.66 102.21\n-102.11\n\nOMNIGLOT\n64\n64-32\n64-32-16\n64-32-16-8\n64-32-16-8-4 110.46 105.45 102.48\nNORB\n64\n64-32\n64-32-16\n64-32-16-8\n64-32-16-8-4\n\n\n3272\n3519\n3449\n3455\n\n3338\n3483\n3492\n3482\n3422\n\n2741\n2792\n2786\n2689\n2654\n\n3198\n3224\n3235\n3201\n3198\n\nTable 2: Test set log-likelihood scores for models trained on the OMNIGLOT and NORB datasets.\nThe left most column show dataset and the number of latent variables i each model.\n\nof latent variables, see the appendix for further results. The best log-likelihood results obtained here,\n102.11, is higher than the best results from [3] at 103.38, which were obtained using more latent\nvariables (100-50 vs 64-32-16-8-4) and further using 50 importance weighted samples for training.\nWe tested the models using a continuous Gaussian observation model on the NORB dataset consisting\nof gray-scale images of 5 different toy objects under different illuminations and observation angles.\nThe LVAE achieves a slightly higher score than the VAE, however none of the models see an increase\nin performance for more using more than three stochastic layers. We found the Gaussian observation\nmodels to be harder to optimize compared to the Bernoulli models, a \ufb01nding also recognized in [24],\nwhich might explain the lower utilization of the topmost latent layers in these models.\n\n3.2 Latent representations\n\nThe probabilistic generative models studied here automatically tune the model complexity to the data\nby reducing the effective dimension of the latent representation due to the regularization effect of the\npriors in Eq. (4). However, as previously identi\ufb01ed [16, 3], the latent representation is often overly\nsparse with few stochastic latent variables propagating useful information.\nTo study the importance of individual units, we split the variational training criterion L into a sum\nof terms corresponding to each unit k in each layer i. For stochastic latent units, this is the KL-\ndivergence between q(zi,k|\u00b7) and p(zi,k|zi+1). Figure 4 shows the evolution of these terms during\ntraining. This term is zero if the inference model is collapsed onto the prior carrying no information\nabout the data, making the unit uninformative. For the models without warm-up we \ufb01nd that the\nKL-divergence for each unit is stable during all training epochs with only very few new units activated\nduring training. For the models trained with warm-up we initially see many active units which are\nthen gradually pruned away as the variational regularization term is introduced. At the end of training\nwarm-up results in more active units indicating a more distributed representation and further that the\nLVAE model produces both the deepest and most distributed latent representation.\nWe also study the importance of layers by splitting the training criterion layer-wise as seen in Figure 5.\nThis measures how much of the representation work (or innovation) is done in each layer. The VAEs\nuse the lower layers the most whereas the highest layers are not (or only to a limited degree) used.\nContrary to this, the LVAE puts much more importance to the higher layers which shows that it learns\nboth a deeper and qualitatively different hierarchical latent representation which might explain the\nbetter performance of the model. To qualitatively study the learned representations, PCA plots of\nzi \u21e0 q(zi|\u00b7) are seen in Figure 6. For vanilla VAE, the latent representations above the second layer\nare completely collapsed on a standard normal prior. Including Batch-normalization and warm-up\nactivates one additional layer each in the VAE. The LVAE utilizes all \ufb01ve latent layers and the latent\nrepresentation shows progressively more clustering according to class, which is clearly seen in the\n\n7\n\n\fFigure 5: Layer-wise KL[q|p] divergence going\nfrom the lowest to the highest layers. In the VAE\nmodels the KL divergence is highest in the lowest\nlayers whereas it is more distributed in the LVAE\nmodel\n\nFigure 6: PCA-plots of samples from q(zi|zi1)\nfor 5-layer VAE and LVAE models trained on\nMNIST. Color-coded according to true class label\n\ntopmost layer of this model. These \ufb01ndings indicate that the LVAE produce a structured high-level\nlatent representations that are likely useful for semi-supervised learning.\n\n4 Conclusion and Discussion\n\nWe presented a new inference model for VAEs combining a bottom-up data-dependent approximate\nlikelihood term with prior information from the generative distribution. We showed that this param-\neterization 1) increases the approximated log-likelihood compared to VAEs, 2) provides a tighter\nbound on the log-likelihood and 3) learns a deeper and qualitatively different latent representation of\nthe data. Secondly we showed that deterministic warm-up and batch-normalization are important for\noptimizing deep VAEs and LVAEs. Especially the large bene\ufb01ts in generative performance and depth\nof learned hierarchical representations using batch-normalization were surprising given the additional\nnoise introduced. This is something that is not fully understood and deserves further investigation\nand although batch-normalization is not novel we believe that this \ufb01nding in the context of VAEs are\nimportant.\nThe inference in LVAE is computed recursively by correcting the generative distribution with a\ndata-dependent approximate likelihood contribution. Compared to purely bottom-up inference,\nthis parameterization makes the optimization easier since the inference is simply correcting the\ngenerative distribution instead of \ufb01tting the two models separately. We believe this explicit parameter\nsharing between the inference and generative distribution can generally be bene\ufb01cial in other types\nof recursive variational distributions such as DRAW [6] where the ideas presented here are directly\napplicable. Further the LVAE is orthogonal to other methods for improving the inference distribution\nsuch as Normalizing \ufb02ows [18], Variational Gaussian Process [21] or Auxiliary Deep generative\nmodels [13] and combining with these might provide further improvements.\nOther directions for future work include extending these models to semi-supervised learning which\nwill likely bene\ufb01t form the learned deep structured hierarchies of latent variables and studying more\nelaborate inference schemes such as a k-step iterative inference in the LVAE [15].\n\nReferences\n[1] J. Bornschein, S. Shabanian, A. Fischer, and Y. Bengio. Bidirectional helmholtz machines.\n\narXiv preprint arXiv:1506.03877, 2015.\n\n8\n\n\f[2] S. R. Bowman, L. Vilnis, O. Vinyals, A. M. Dai, R. Jozefowicz, and S. Bengio. Generating\n\nsentences from a continuous space. arXiv preprint arXiv:1511.06349, 2015.\n\n[3] Y. Burda, R. Grosse, and R. Salakhutdinov. Importance weighted autoencoders. arXiv preprint\n\narXiv:1509.00519, 2015.\n\n[4] P. Dayan, G. E. Hinton, R. M. Neal, and R. S. Zemel. The Helmholtz machine. Neural\n\ncomputation, 7(5):889\u2013904, 1995.\n\n[5] S. Dieleman, J. Schl\u00fcter, C. Raffel, E. Olson, S. K. S\u00f8nderby, D. Nouri, A. van den Oord, and\n\nE. B. and. Lasagne: First release., Aug. 2015.\n\n[6] K. Gregor, I. Danihelka, A. Graves, and D. Wierstra. Draw: A recurrent neural network for\n\nimage generation. arXiv preprint arXiv:1502.04623, 2015.\n\n[7] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing\n\ninternal covariate shift. arXiv preprint arXiv:1502.03167, 2015.\n\n[8] D. Kingma and J. Ba. Adam: A method for stochastic optimization.\n\narXiv:1412.6980, 2014.\n\narXiv preprint\n\n[9] D. P. Kingma, S. Mohamed, D. J. Rezende, and M. Welling. Semi-supervised learning with\n\ndeep generative models. In Advances in Neural Information Processing Systems, 2014.\n\n[10] D. P. Kingma and M. Welling. Auto-encoding variational Bayes.\n\narXiv:1312.6114, 2013.\n\narXiv preprint\n\n[11] B. M. Lake, R. R. Salakhutdinov, and J. Tenenbaum. One-shot learning by inverting a composi-\n\ntional causal process. In Advances in neural information processing systems, 2013.\n\n[12] Y. LeCun, F. J. Huang, and L. Bottou. Learning methods for generic object recognition with\n\ninvariance to pose and lighting. In Computer Vision and Pattern Recognition. IEEE, 2004.\n\n[13] L. Maal\u00f8e, C. K. S\u00f8nderby, S. K. S\u00f8nderby, and O. Winther. Auxiliary deep generative models.\n\nProceedings of the 33nd International Conference on Machine Learning, 2016.\n\n[14] D. J. MacKay. Local minima, symmetry-breaking, and model pruning in variational free energy\n\nminimization. Inference Group, Cavendish Laboratory, Cambridge, UK, 2001.\n\n[15] T. Raiko, Y. Li, K. Cho, and Y. Bengio. Iterative neural autoregressive distribution estimator\n\nNADE-k. In Advances in Neural Information Processing Systems, 2014.\n\n[16] T. Raiko, H. Valpola, M. Harva, and J. Karhunen. Building blocks for variational Bayesian\n\nlearning of latent variable models. The Journal of Machine Learning Research, 8, 2007.\n\n[17] A. Rasmus, M. Berglund, M. Honkala, H. Valpola, and T. Raiko. Semi-supervised learning with\n\nladder networks. In Advances in Neural Information Processing Systems, 2015.\n\n[18] D. J. Rezende and S. Mohamed. Variational inference with normalizing \ufb02ows. arXiv preprint\n\narXiv:1505.05770, 2015.\n\n[19] D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate\n\ninference in deep generative models. arXiv preprint arXiv:1401.4082, 2014.\n\n[20] Theano Development Team. Theano: A Python framework for fast computation of mathematical\n\nexpressions. arXiv e-prints, abs/1605.02688, May 2016.\n\n[21] D. Tran, R. Ranganath, and D. M. Blei. Variational Gaussian process. arXiv preprint\n\narXiv:1511.06499, 2015.\n\n[22] H. Valpola. From neural PCA to deep unsupervised learning.\n\narXiv:1411.7783.\n\n2015.\n\narXiv preprint\n\n[23] G. van den Broeke. What auto-encoders could learn from brains - generation as feedback in\n\nunsupervised deep learning and inference, 2016. MSc thesis, Aalto University, Finland.\n\n[24] A. van den Oord, N. Kalchbrenner, and K. Kavukcuoglu. Pixel recurrent neural networks. arXiv\n\npreprint arXiv:1601.06759, 2016.\n\n9\n\n\f", "award": [], "sourceid": 1855, "authors": [{"given_name": "Casper Kaae", "family_name": "S\u00f8nderby", "institution": "University of Copenhagen"}, {"given_name": "Tapani", "family_name": "Raiko", "institution": "Apple Inc."}, {"given_name": "Lars", "family_name": "Maal\u00f8e", "institution": "Technical University of Denmark"}, {"given_name": "S\u00f8ren Kaae", "family_name": "S\u00f8nderby", "institution": "KU"}, {"given_name": "Ole", "family_name": "Winther", "institution": "Technical University of Denmark"}]}