{"title": "Learning Hierarchical Priors in VAEs", "book": "Advances in Neural Information Processing Systems", "page_first": 2870, "page_last": 2879, "abstract": "We propose to learn a hierarchical prior in the context of variational autoencoders to avoid the over-regularisation resulting from a standard normal prior distribution. To incentivise an informative latent representation of the data, we formulate the learning problem as a constrained optimisation problem by extending the Taming VAEs framework to two-level hierarchical models. We introduce a graph-based interpolation method, which shows that the topology of the learned latent representation corresponds to the topology of the data manifold---and present several examples, where desired properties of latent representation such as smoothness and simple explanatory factors are learned by the prior.", "full_text": "Learning Hierarchical Priors in VAEs\n\nAlexej Klushyn1 2 Nutan Chen1 Richard Kurle1 2 Botond Cseke1 Patrick van der Smagt1\n\n1Machine Learning Research Lab, Volkswagen Group, Germany\n\n2Department of Informatics, Technical University of Munich, Germany\n\n{alexej.klushyn, nutan.chen, richardk, botond.cseke, smagt}@argmax.ai\n\nAbstract\n\nWe propose to learn a hierarchical prior in the context of variational autoencoders\nto avoid the over-regularisation resulting from a standard normal prior distribution.\nTo incentivise an informative latent representation of the data, we formulate the\nlearning problem as a constrained optimisation problem by extending the Taming\nVAEs framework to two-level hierarchical models. We introduce a graph-based\ninterpolation method, which shows that the topology of the learned latent repre-\nsentation corresponds to the topology of the data manifold\u2014and present several\nexamples, where desired properties of latent representation such as smoothness\nand simple explanatory factors are learned by the prior.\n\n1\n\nIntroduction\n\nVariational autoencoders (VAEs) [15, 24] are a class of probabilistic latent variable models for\nunsupervised learning. The learned generative model and the corresponding (approximate) posterior\ndistribution of the latent variables provide a decoder/encoder pair that often captures semantically\nmeaningful features of the data. In this paper, we address the issue of learning informative latent\nrepresentations/encodings of the data.\nThe vanilla VAE uses a standard normal prior distribution for the latent variables. It has been shown\nthat this can lead to over-regularising the posterior distribution, resulting in latent representations\nthat do not represent well the structure of the data [1]. There are several approaches to alleviate this\nproblem: (i) de\ufb01ning and learning complex prior distributions that can better model the encoded data\nmanifold [10, 28]; (ii) using specialised optimisation algorithms, which try to \ufb01nd local/global minima\nof the training objective that correspond to informative latent representations [4, 27, 14, 25]; and\n(iii) adding mutual-information-based constraints or regularisers to incentivise a good correspondence\nbetween the data and the latent variables [1, 31, 9]. In this paper, we focus on the \ufb01rst two approaches.\nWe use a two-level stochastic model, where the \ufb01rst layer corresponds to the latent representation and\nthe second layer models a hierarchical prior (continuous mixture). In order to learn such hierarchical\npriors, we extend the optimisation framework introduced in [25], where the authors reformulate the\nVAE objective as the Lagrangian of a constrained optimisation problem. They impose an inequality\nconstraint on the reconstruction error and use the KL divergence between the approximate posterior\nand the standard normal prior as the optimisation objective. We substitute the standard normal prior\nwith the hierarchical one and use an importance-weighted bound [5] to approximate the resulting\nintractable marginal. Concurrently, we introduce the associated optimisation algorithm, which is\ninspired by GECO [25]\u2014the latter does not always lead to good encodings (e.g., Sec. 4.1). Our\napproach better avoids posterior collapse and enhances interpretability compared to similar methods.\nWe adopt the manifold hypothesis [6, 26] to validate the quality of a latent representation. We do this\nby proposing a nearest-neighbour graph-based method for interpolating between different data points\nalong the learned data manifold in the latent space.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f2 Methods\n\n2.1 VAEs as a Constrained Optimisation Problem\nVAEs model the distribution of i.i.d. data D = {xi}N\n\n(cid:89)\n\np\u03b8(xi) =\n\ni\n\ni\n\n(cid:90)\n\n(cid:89)\n\ni=1 as the marginal\np\u03b8(xi|z) p(z) dz.\n\n(1)\n\nThe model parameters are learned through amortised variational EM, which requires learning an\napproximate posterior distribution q\u03c6(z|xi) \u2248 p\u03b8(z|xi). It is hoped that the learned q\u03c6(z|x) and\np\u03b8(x|z) result in an informative latent representation of the data. For example, {Eq\u03b8(z|xi)[z]}N\ni=1\ncluster w.r.t. some discrete features or important factors of variation in the data. In Sec. 4.1, we show\na toy example, where the model can learn the true underlying factors of variation in D.\nAmortised variational EM in VAEs maximises the evidence lower bound (ELBO) [15, 24]:\nEpD(x)\n, (2)\n(cid:80)N\nwhere q\u03c6(z|x) and p\u03b8(x|z) are typically assumed to be diagonal Gaussians with their parameters\ni=1 \u03b4(x \u2212 xi)\nde\ufb01ned as neural network functions of the conditioning variables. pD(x) = 1\nstands for the empirical distribution of D. The (EM) optimisation problem [e.g. 21] is formulated as\nN\n(3)\n\n(cid:2) log p\u03b8(x|z)(cid:3) \u2212 KL(cid:0)q\u03c6(z|x)(cid:107) p(z)(cid:1)(cid:105)\n\n(cid:2) log p\u03b8(x)(cid:3) \u2265 FELBO(\u03b8, \u03c6) \u2261 EpD(x)\n\n\u2212FELBO(\u03b8, \u03c6) (cid:98)= min\n\n(cid:104)Eq\u03c6(z|x)\n\n\u2212FELBO(\u03b8, \u03c6).\n\nmin\n\nmin\n\u03b8,\u03c6\n\n\u03b8\n\n\u03c6\n\nThe corresponding optimisation algorithm was originally introduced as a double-loop algorithm,\nhowever, in the context of VAEs\u2014or neural inference models in general\u2014it is a common practice to\noptimise (\u03b8, \u03c6) jointly.\nIt has been shown that local minima with high ELBO values do not necessarily result in informative\nlatent representations [1, 14].\nIn order to address this problem, several approaches have been\ndeveloped, which typically result in some weighting schedule for either the negative expected log-\nlikelihood or the KL term of the ELBO [4, 27]. This is because a different ratio targets different\nregions in the rate-distortion plane, either favouring better compression or reconstruction [1].\nIn [25], the authors reformulate the VAE objective as the Lagrangian of a constrained optimisation\nconstraint EpD(x) Eq\u03c6(z|x)\nrelated term in \u2212 log p\u03b8(x|z). Since EpD(x) Eq\u03c6(z|x)\nthis formulation allows for a better control of the quality of generated data. In the resulting Lagrangian\n\nproblem. They choose KL(cid:0)q\u03c6(z|x)(cid:107) p(z)(cid:1) as the optimisation objective and impose the inequality\n(cid:2)C\u03b8(x, z)(cid:3) \u2264 \u03ba2. Typically C\u03b8(x, z) is de\ufb01ned as the reconstruction-error-\n(cid:2)C\u03b8(x, z)(cid:3) is the average reconstruction error,\n(cid:105)\n(cid:2) \u2212 log p\u03b8(x|z)(cid:3).\n\n(cid:104) KL(cid:0)q\u03c6(z|x)(cid:107) p(z)(cid:1) + \u03bb(cid:0)Eq\u03c6(z|x)\n\nthe Lagrange multiplier \u03bb can be viewed as a weighting term for EpD(x) Eq\u03c6(z|x)\nThis approach leads to a similar optimisation objective as in [14] with \u03b2 = 1/\u03bb. The authors propose\na descent-ascent algorithm (GECO) for \ufb01nding the saddle point of the Lagrangian. The parameters\n(\u03b8, \u03c6) are optimised through gradient descent and \u03bb is updated as\n\n(cid:2)C\u03b8(x, z)(cid:3) \u2212 \u03ba2)\n\nL(\u03b8, \u03c6; \u03bb) \u2261 EpD(x)\n\n(4)\n\n,\n\n\u03bbt = \u03bbt\u22121 \u00b7 exp(cid:0)\u03bd \u00b7 (\u02c6Ct \u2212 \u03ba2)(cid:1),\n\n(5)\ncorresponding to a quasi-gradient ascent due to \u2206\u03bbt \u00b7 \u2202\u03bbL \u2265 0; \u03bd is the update\u2019s learning rate. In the\ncontext of stochastic batch gradient training, \u02c6Ct \u2248 EpD(x) Eq\u03c6(z|x)\nning average \u02c6Ct = (1\u2212\u03b1)\u00b7 \u02c6Cba +\u03b1\u00b7 \u02c6Ct\u22121, where \u02c6Cba is the batch average EpD(xba) Eq\u03c6(z|x)\nTo the best of our understanding,1 the GECO algorithm solves the optimisation problem\n\n(cid:2)C\u03b8(x, z)(cid:3) is estimated as the run-\n(cid:2)C\u03b8(x, z)(cid:3).\n\nmin\n\n(6)\nHere, max\u03bb min\u03c6 L(\u03b8, \u03c6; \u03bb) can be viewed to correspond to the E-step of the EM algorithm. However,\nin general this objective can only be guaranteed to be the ELBO if \u03bb = 1, or in case of 0 \u2264 \u03bb < 1, a\nscaled lower bound on the ELBO.\n\nmax\n\nmin\n\n\u03c6\n\n\u03bb\n\n\u03b8\n\nL(\u03b8, \u03c6; \u03bb)\n\ns.t. \u03bb \u2265 0.\n\n1The optimisation problem is not explicitly stated in [25].\n\n2\n\n\f2.2 Hierarchical Priors for Learning Informative Latent Representations\n\nIn this section, we propose a hierarchical prior for VAEs within the constrained optimisation setting.\nOur goal is to incentivise the learning of informative latent representations and to avoid over-\nregularising the posterior distribution (i) by increasing the complexity of the prior distribution p(z),\nand (ii) by providing an optimisation method to learn such models.\nIt has been shown in [28] that the optimal empirical Bayes prior is the aggregated posterior distribution\np\u2217(z) = EpD(x)\ndistribution. However, we opt for a continuous mixture/hierarchical model\n\n(cid:2)q\u03c6(z|x)(cid:3). We follow [28] to approximate this distribution in the form of a mixture\n\np\u0398(z) =\n\np\u0398(z|\u03b6) p(\u03b6) d\u03b6,\n\n(7)\n\n(cid:90)\n\nwith a standard normal p(\u03b6). This leads to a hierarchical model with two stochastic layers. As a\nresult, intuitively, our approach inherently favours the learning of continuous latent features. We refer\nto this model by variational hierarchical prior (VHP).\nIn order to learn the parameters in Eq. (7), we propose to adapt the constrained optimisation problem\nin Sec. 2.1 to hierarchical models. For this purpose we use an importance-weighted (IW) bound\n[5]\u2014and the corresponding proposal distribution q\u03a6\u2014to introduce a sequence of upper bounds\n\nEpD(x) KL(cid:0)q\u03c6(z|x)(cid:107) p(z)(cid:1) \u2264 F(\u03c6, \u0398, \u03a6)\n\n\u2261 EpD(x) Eq\u03c6(z|x)\n\nlog q\u03c6(z|x) \u2212 E\u03b61:K\u223cq\u03a6(\u03b6|z)\n\nlog\n\nwith K importance weights, resulting in an upper bound on Eq. (4):\n\nL(\u03b8, \u03c6; \u03bb) \u2264 F(\u03c6, \u0398, \u03a6) + \u03bb(cid:0) EpD(x) Eq\u03c6(z|x)\n\n(cid:104)\n\nK(cid:88)\n\nk=1\n\n(cid:105)(cid:21)\n(cid:2)C\u03b8(x, z)(cid:3) \u2212 \u03ba2(cid:1) \u2261 LVHP(\u03b8, \u03c6, \u0398, \u03a6; \u03bb).\n\np\u0398(z|\u03b6k) p(\u03b6k)\n\nq\u03a6(\u03b6k|z)\n\n1\nK\n\n,\n\n(cid:20)\n\nAs a result, we arrive to the optimisation problem\n\nmin\n\u0398,\u03a6\n\nmin\n\n\u03b8\n\nmax\n\n\u03bb\n\nmin\n\n\u03c6\n\nLVHP(\u03b8, \u03c6, \u0398, \u03a6; \u03bb)\n\ns.t. \u03bb \u2265 0,\n\n(i) in the outer loop we\nwhich we can optimise by the following double-loop algorithm:\n(ii) in the inner loop we solve the optimisation problem\nupdate the bound w.r.t. (\u0398, \u03a6);\nmin\u03b8 max\u03bb min\u03c6 LVHP(\u03b8, \u03c6, \u0398, \u03a6; \u03bb) by applying an update scheme for \u03bb and \u03b2 = 1/\u03bb, respec-\ntively. In the following, we use the \u03b2-parameterisation to be in line with [e.g. 14, 27].\nIn the GECO update scheme (Eq. (5)), \u03b2 increases/decreases until \u02c6Ct = \u03ba2. However, provided the\nconstraint is ful\ufb01lled, we want to obtain a tight lower bound on the log-likelihood. As discussed in\nSec. 2.1, this holds when \u03b2 = 1 (ELBO)\u2014in case of \u03b2 > 1, we would optimise a scaled lower bound\non the ELBO. Therefore, we propose to replace the corresponding \u03b2-version of Eq. (5) by\n\n\u03b2t = \u03b2t\u22121 \u00b7 exp(cid:2)\u03bd \u00b7 f\u03b2(\u03b2t\u22121, \u02c6Ct \u2212 \u03ba2; \u03c4 ) \u00b7 (\u02c6Ct \u2212 \u03ba2)(cid:3),\nf\u03b2(\u03b2, \u03b4; \u03c4 ) =(cid:0)1 \u2212 H(\u03b4)(cid:1) \u00b7 tanh(cid:0)\u03c4 \u00b7 (\u03b2 \u2212 1)(cid:1) \u2212 H(\u03b4).\n\nwhere we de\ufb01ne\n\nHere, H(\u2022) is the Heaviside function and we introduce a slope parameter \u03c4. This update can be\ninterpreted as follows. If the constraint is violated, i.e. \u02c6Ct > \u03ba2, the update scheme is equal to Eq. (5).\nIn case the constraint is ful\ufb01lled, the tanh term guarantees that we \ufb01nish at \u03b2 = 1, to obtain/optimise\nthe ELBO at the end of the training. Thus, we impose \u03b2 \u2208 (0, 1], which is reasonable since \u03b2 < \u03b2max\ndoes not violate the constraint. A visualisation of the \u03b2-update scheme is shown in Fig. 1. Note that\nthere are alternative ways to modify Eq. (5), see App. B.1, however, Eq. (11) led to the best results.\nThe double-loop approach in Eq. (10) is often computationally inef\ufb01cient. Thus, we decided to run\nthe inner loop only until the constraints are satis\ufb01ed and then updating the bound. That is, we optimise\nEq. (10) and skip the outer loop/bound updates when the constraints are not satis\ufb01ed. It turned out\nthat the bound updates were often skipped in the initial phase, but rarely skipped later on. Hence,\nthe algorithm behaves as layer-wise pretraining [3]. For these reasons, we propose Alg. 1 (REWO)\nthat separates training into two phases: an initial phase, where we only optimise the reconstruction\nerror\u2014and a main phase, where all parameters are updated jointly.\n\n3\n\n(8)\n\n(9)\n\n(10)\n\n(11)\n\n(12)\n\n\fAlgorithm 1 (REWO) Reconstruction-error-based\nweighting of the objective function\n\nInitialise t = 1\nInitialise \u03b2 (cid:28) 1\nInitialise INITIALPHASE = TRUE\nwhile training do\n\nRead current data batch xba\nSample from variational posterior z \u223c q\u03c6(z|xba)\nCompute \u02c6Cba (batch average)\n\u02c6Ct = (1 \u2212 \u03b1) \u00b7 \u02c6Cba + \u03b1 \u00b7 \u02c6Ct\u22121,\nif \u02c6Ct < \u03ba2 then\n\n(\u02c6C0 = \u02c6Cba)\n\nINITIALPHASE = FALSE\n\nend if\nif INITIALPHASE then\n\nelse\n\nOptimise LVHP(\u03b8, \u03c6, \u0398, \u03a6; \u03b2) w.r.t \u03b8, \u03c6\n\n\u03b2 \u2190 \u03b2\u00b7exp(cid:2)\u03bd\u00b7f\u03b2(\u03b2t\u22121, \u02c6Ct\u2212\u03ba2; \u03c4 )\u00b7(\u02c6Ct\u2212\u03ba2)(cid:3)\n\nOptimise LVHP(\u03b8, \u03c6, \u0398, \u03a6; \u03b2) w.r.t \u03b8, \u03c6, \u0398, \u03a6\n\nFigure 1: \u03b2-update scheme: \u2206\u03b2t = \u03b2t \u2212 \u03b2t\u22121\nas a function of \u03b2t\u22121 and \u02c6Ct \u2212 \u03ba2 for \u03bd = 1 and\n\u03c4 = 3. A comparison with the GECO update\nscheme can be found in App. A. (see Sec. 2.2)\n\nend if\nt \u2190 t + 1\n\nend while\n\nIn the initial phase, we start with \u03b2 (cid:28) 1 to enforce a reconstruction optimisation. Thus, we train the\n\ufb01rst stochastic layer for achieving a good encoding of the data through q\u03c6(z|x), measured by the\nreconstruction error. For preventing \u03b2 to become smaller than the initial value during the \ufb01rst iteration\nsteps, we start to update \u03b2 when the condition \u02c6Ct < \u03ba2 is ful\ufb01lled. A good encoding is required\nto learn the conditionals q\u03a6(\u03b6|z) and p\u0398(z|\u03b6) in the second stochastic layer. Otherwise, q\u03a6(\u03b6|z)\n\nwould be strongly regularised towards p(\u03b6), resulting in KL(cid:0)q\u03a6(\u03b6i|z)(cid:107) p(\u03b6i)(cid:1) \u2248 0, from which it\n\ntypically does not recover [27]. In the main phase, after \u02c6Ct < \u03ba2 is ful\ufb01lled, we start to optimise the\nparameters of the second stochastic layer and to update \u03b2. This approach avoids posterior collapse in\nboth stochastic layers (see Sec. 4.1 and App. D.2), and thereby helps the prior to learn an informative\nencoding for preventing the aforementioned over-regularisation.\nThe proposed method, which is a combination of an ELBO-like Lagrangian and an IW bound, can be\ninterpreted as follows: the posterior of the \ufb01rst stochastic layer q\u03c6(z|x) can learn an informative latent\nrepresentation due to the \ufb02exible hierarchical prior. The \ufb02exible prior, on the other hand, is achieved\nby applying an IW bound. Despite a diagonal Gaussian q\u03a6(\u03b6|z), the importance weighting allows\nto learn a precise conditional p\u0398(z|\u03b6) from the standard normal distribution p(\u03b6) to the aggregated\nposterior EpD(x)[q\u03c6(z|x)] [11]. Alternatively, one could use, for example, a normalising \ufb02ow [23].\nOtherwise, the model could compensate a less expressive prior by regularising q\u03c6(z|x), which would\nresult in a restricted latent representation (see App. B.4 for empirical evidence).\n\n2.3 Graph-Based Interpolations for Verifying Latent Representations\n\nA key reason for introducing hierarchical priors was to facilitate an informative latent representation\ndue to less over-regularisation of the posterior. To verify the quality of the latent representations, we\nbuild on the manifold hypothesis, de\ufb01ned in [6, 26]. The idea can be summarised by the following\nassumption: real-world data presented in high-dimensional spaces is likely to concentrate in the\nvicinity of nonlinear sub-manifolds of much lower dimensionality. Following this hypothesis, the\nquality of latent representations can be evaluated by interpolating between data points along the\nlearned data manifold in the latent space\u2014and reconstructing this path to the observable space.\nTo implement the above idea, we propose a graph-based method [8] which summarises the continuous\nlatent space by a graph consisting of a \ufb01nite number of nodes. The nodes Z = {z1, . . . , zN} can be\nobtained by randomly sampling N samples from the learned prior (Eq. (7)):\n\n(13)\nThe graph is constructed by connecting each node by undirected edges to its k-nearest neighbours.\nThe edge weights are Euclidean distances in the latent space between the related node pairs. Once the\n\nzn, \u03b6n \u223c p\u0398(z|\u03b6) p(\u03b6), n = 1, . . . , N.\n\n4\n\nt10.00.51.01.52.02.53.0Ct21.00.50.00.51.0t1.51.00.50.00.51.51.00.50.00.5\fgraph is built, interpolation between two data points xi and xj can be done as follows. We encode\nthese data points as z(\u2022) = \u00b5\u03b8(x(\u2022)), where \u00b5\u03c6(x(\u2022)) is the mean of q\u03c6(z|x(\u2022)), and add them as new\nnodes to the existing graph.\nTo \ufb01nd the shortest path through the graph between nodes zi and zj, a classic search algorithm such\n\nas A(cid:63) can be used. The result is a sequence Zpath =(cid:0)zi, Zsub, zj\n\n(cid:1), where Zsub \u2286 Z, representing the\n\nshortest path in the latent space along the learned latent manifold. Finally, to obtain the interpolation,\nwe reconstruct Zpath to the observable space.\n\n3 Related Work\n\nSeveral works improve the VAE by learning more complex priors such as the stick-breaking prior\n[20], a nested Chinese Restaurant Process prior [13], Gaussian mixture priors [12], or autoregressive\npriors [10]. A closely related line of research is based on the insight that the optimal prior is the\naggregated posterior [28, 19]. The VampPrior [28] approximates the prior by a uniform mixture of\napproximate posterior distributions, evaluated at a few learned pseudo data points. In our approach,\nthe prior is approximated by using a second stochastic layer (IW bound). The authors in [19] use a\ntwo-level stochastic model with a combination of implicit and explicit distributions for the encoders\nand decoders. Inference is done through optimising a sandwich bound of the ELBO, which is speci\ufb01c\nto the choice of implicit distributions. In our work, however, we address inference using a constrained\noptimisation approach and all distributions are explicit.\nIn the context of VAEs, hierarchical latent variable models were already introduced earlier [24, 5, 27].\nCompared to our approach, these works have in common the structure of the generative model but\ndiffer in the de\ufb01nition of the inference models and in the optimisation procedure. In our proposed\nmethod, the VAE objective is reformulated as the Lagrangian of a constrained optimisation problem.\nThe prior of this ELBO-like Lagrangian is approximated by an IW bound. It is motivated by the fact\nthat a single stochastic layer with a \ufb02exible prior can be suf\ufb01cient for modelling an informative latent\nrepresentation. The second stochastic layer is required to learn a suf\ufb01ciently \ufb02exible prior.\nThe common problem of failing to use the full capacity of the model in VAEs [5] has been addressed\nby applying annealing/warm-up [4, 27]. Here, the KL divergence between the approximate posterior\nand the prior is multiplied by a weighting factor, which is linearly increased from 0 to 1 during\ntraining. However, such prede\ufb01ned schedules might be suboptimal. By reformulating the objective\nas a constrained optimisation problem [25], the above weighting term can be represented by a\nLagrange multiplier and updated based on the reconstruction error. Our proposed algorithm builds\n[25], providing several modi\ufb01cations discussed in Sec. 2.2.\nIn [14], the authors propose a constrained optimisation framework, where the optimisation objective is\nthe expected negative log-likelihood and the constraint is imposed in the KL term\u2014recall that in [25]\nthe roles are reversed. Instead of optimising the resulting Lagrangian, the authors choose Lagrange\nmultipliers (\u03b2 parameter) that maximise a heuristic cost for disentanglement. Their goal is not to\nlearn a latent representation that re\ufb02ects the topology of the data but a disentangled representation,\nwhere the dimensions of the latent space correspond to various features of the data.\n\n4 Experiments\n\nTo validate our approach, we consider the following experiments. In Sec. 4.1, we demonstrate that\nour method learns to represent the degree of freedom in the data of a moving pendulum. In Sec. 4.2,\nwe generate human movements based on the learned latent representations of real-world data (CMU\nGraphics Lab Motion Capture Database). In Sec. 4.3, the marginal log-likelihood on standard datasets\nsuch as MNIST, Fashion-MNIST, and OMNIGLOT is evaluated. In Sec. 4.4, we compare our method\non the high-dimensional image datasets 3D Faces and 3D Chairs. The model architectures used in\nour experiments can be found in App. F.\n\n4.1 Arti\ufb01cial Pendulum Dataset\n\nWe created a dataset of 15,000 images of a moving pendulum (see Fig. 4). Each image has a size\nof 16 \u00d7 16 pixels and the joint angles are distributed uniformly in the range [0, 2\u03c0). Thus, the joint\nangle is the only degree of freedom.\n\n5\n\n\f(a) VHP + REWO\n\n(b) VHP + GECO\n\nFigure 2: (left) Latent representation of the pendulum data at different iteration steps when optimising\nLVHP(\u03b8, \u03c6, \u0398, \u03a6; \u03b2) with REWO and GECO, respectively. The top row shows the approximate posterior; the\ngreyscale encodes the variance of its standard deviation. The bottom row shows the hierarchical prior. (right) \u03b2\nas a function of the iteration steps; red lines mark the visualised iteration steps. (see Sec. 4.1)\n\n(b) VHP + GECO\n\n(a) VHP + REWO\nFigure 3: Graph-based interpolation of the pendulum\nmovement. The graph is based on the prior, shown in\nApp. B.5. The red curves depict the interpolations, the\nbluescale indicates the edge weight. (see Sec. 4.1)\n\n(c) IWAE\n\ntop: VHP + REWO, middle: VHP + GECO, bottom: IWAE\nFigure 4: Pendulum reconstructions of the graph-\nbased interpolation in the latent space, shown in\nFig. 3. Discontinuities are marked by blue boxes.\n(see Sec. 4.1)\n\nFig. 2 shows latent representations of the pendulum data learned by the VHP when applying REWO\nand GECO, respectively; the same \u03ba is used in both cases. The variance of the posterior\u2019s standard\ndeviation, expressed by the greyscale, measures whether the contribution to the ELBO is equally\ndistributed over all data points.\nTo validate whether the obtained latent representation is informative, we apply a linear regression\nafter transforming the latent space to polar coordinates. The goal is to predict the joint angle of the\npendulum. We compare REWO with GECO, and additionally with warm-up (WU) [27], a linear\nannealing schedule of \u03b2. In the latter, we use a VAE objective\u2014de\ufb01ned as an ELBO/IW bound\ncombination, similar to Eq. (9); the related plots are in App. B.2. Tab. 1 shows the absolute errors\n(OLS regression) for the different optimisation procedures; details on the regression can be found in\nApp. B.3. REWO leads to the most precise prediction of the ground truth.\n\nTable 1: OLS regression on the learned latent representations of the pendulum data.\n\nMETHOD\nVHP + REWO\nVHP + GECO\n\nABSOLUTE ERROR\n0.054\n0.53\n\n(cid:63)VAE OBJECTIVE\n\nMETHOD\nVHP(cid:63)\nVHP(cid:63) + WU (20 EPOCHS)\nVHP(cid:63) + WU (200 EPOCHS)\n\nABSOLUTE ERROR\n0.49\n0.20\n0.31\n\nFurthermore, we demonstrate in App. B.4 that a less expressive posterior q\u03a6(\u03b6|z) in the second\nstochastic layer leads to poor latent representations, since the model compensates it by restricting\nq\u03c6(z|x)\u2014as discussed in Sec. 2.2.\n\n6\n\n\f(a) VHP + REWO\n\n(b) VampPrior\n\n(c) IWAE\n\nFigure 5: Graph-based interpolation of human motions. The graphs are based on the (learned) prior distributions,\ndepicted in App. C.1. The bluescale indicates the edge weight. The coloured lines represent four interpolated\nmovements, which can be found in Fig. 6 and App. C. (see Sec. 4.2)\n\ntop: VHP + REWO, middle: VampPrior, bottom: IWAE\n\nFigure 6: Human-movement reconstructions of the\ngraph-based interpolations in Fig. 5 (red curve). Re-\nconstruction of the remaining interpolations can be\nfound in App. C.2. Discontinuities are marked by blue\nboxes. (see Sec. 4.2)\n\nFigure 7: Smoothness measure of the human-\nmovement interpolations. For each joint, the mean\nand standard deviation of the smoothness factor are\ndisplayed. Smaller values correspond to smoother\nmovements. (see Sec. 4.2)\n\nFinally, we compare the latent representations, learned by the VHP and the IWAE, using the graph-\nbased interpolation method. The graphs, shown in Fig. 3, are built (see Sec. 2.3) based on 1000\nsamples from the prior of the respective model. The red curves depict the interpolation along resulting\ndata manifold, between pendulum images with joint angles of 0 and 180 degrees, respectively. The\nreconstructions of the interpolations are shown in (Fig. 4). The top row (VHP + REWO) shows a\nsmooth change of the joint angles, whereas the middle (VHP + GECO) and bottom row (IWAE)\ncontain discontinuities resulting in an unrealistic interpolation.\n\n4.2 Human Motion Capture Database\n\nThe CMU Graphics Lab Motion Capture Database (http://mocap.cs.cmu.edu) consists of several\nhuman motion recordings. Our experiments base on \ufb01ve different motions. We preprocess the data as\nin [7], such that each frame is represented by a 50-dimensional feature vector.\nWe compare our method with the VampPrior and the IWAE. The prior and approximate posterior\nof the three methods are depicted in App. C.1. We generate four interpolations (Fig. 5) using our\ngraph-based approach: between two frames within the same motion (black line) and of different\nmotions (orange, red, and maroon); the reconstructions are shown in Fig. 6 and App. C.2. In contrast\nto the IWAE, the VampPrior and the VHP enable smooth interpolations.\nFig. 7 depicts the movement smoothness factor, which we de\ufb01ne as the RMS of the second order\n\ufb01nite difference along the interpolated path. Thus, smaller values correspond to smoother movements.\nFor each of the three methods, it is averaged across 10 graphs, each with 100 interpolations. The\nstarting and ending points are randomly selected. As a result, the latent representation learned by the\nVHP leads to smoother movement interpolations than in case of the VampPrior and the IWAE.\n\n4.3 Evaluation on MNIST, Fashion-MNIST, and OMNIGLOT\n\nWe compare our method quantitatively with the VampPrior and the IWAE on MNIST [18, 17],\nFashion-MNIST [29], and OMNIGLOT [16]. For this purpose, we report the marginal log-likelihood\n(LL) on the respective test set. Following the test protocol of previous work [28], we evaluate the LL\nusing importance sampling with 5,000 samples [5]. The results are reported in Tab. 2.\n\n7\n\n5101520253035404550joint index00.010.020.03smoothness factorleft legright leghead and torsoleft armright armVHP + REWOVampPriorIWAE\f(a) VHP + REWO\n\n(c) VHP + REWO\n\n(b) IWAE\n\n(d) IWAE\n\nFigure 8: Faces & Chairs: graph-based interpolations\u2014between data points from the test set\u2014along the learned\n32-dimensional latent manifold. The graph is based on prior samples. (see Sec. 4.4)\n\nVHP + REWO performs as good or better than state-of-the-art on these datasets. The same \u03ba was\nused for training VHP with REWO and GECO. The two stochastic layer hierarchical IWAE does not\nperform better than VHP + REWO, supporting our claim that a \ufb02exible prior in the \ufb01rst stochastic\nlayer and a \ufb02exible approximate posterior in the second stochastic layer is suf\ufb01cient. Additionally,\nwe show that REWO leads to a similar amount of active units as WU (see App. D.2).\n\nTable 2: Negative test log-likelihood estimated with 5,000 importance samples.\n\nDYNAMIC MNIST STATIC MNIST FASHION-MNIST OMNIGLOT\n\nVHP + REWO\nVHP + GECO\nVAMPPRIOR\nIWAE (L=1)\nIWAE (L=2)\n\n78.88\n95.01\n80.42\n81.36\n80.66\n\n82.74\n96.32\n84.02\n84.46\n82.83\n\n225.37\n234.73\n232.78\n226.83\n225.39\n\n101.78\n108.97\n101.97\n101.57\n101.83\n\n4.4 Qualitative Results: 3D Chairs and 3D Faces\n\nWe generated 3D Faces [22] based on images of 2000 faces with 37 views each. 3D Chairs [2]\nconsists of 1393 chair images with 62 views each. The images have a size of 64 \u00d7 64 pixels.\nHere, our approach is compared with the IWAE using a 32-dimensional latent space. The learned\nencodings are evaluated qualitatively by using the graph-based interpolation method. Fig. 8(a) and\n8(c) show interpolations along the latent manifold learned by the VHP + REWO. Compared to the\nIWAE (Fig. 8(b) and 8(d)), they are less blurry and smoother. Further results can be found in App. E.\n\n5 Conclusion\n\nIn this paper, we have proposed a hierarchical prior in the context of variational autoencoders and\nextended the constrained optimisation framework in Taming VAEs to hierarchical models by using\nan importance-weighted bound on the marginal of the hierarchical prior. Concurrently, we have\nintroduced the associated optimisation algorithm to facilitate good encodings.\nWe have shown that the learned hierarchical prior is indeed non-trivial, moreover, it is well-adapted to\nthe latent representation, re\ufb02ecting the topology of the encoded data manifold. Our method provides\ninformative latent representations and performs particularly well on data where the relevant features\nchange continuously. In case of the pendulum (Sec. 4.1), the prior has learned to represent the degrees\nof freedom in the data\u2014allowing us to predict the pendulum\u2019s angle by a simple OLS regression.\nThe experiments on the CMU human motion data (Sec. 4.2) and on the high-dimensional Faces and\nChairs datasets (Sec. 4.4) have demonstrated that the learned hierarchical prior leads to smoother and\nmore realistic interpolations than a standard normal prior (or the VampPrior). Moreover, we have\nobtained test log-likelihoods (Sec. 4.3) comparable to state-of-the-art on standard datasets.\n\n8\n\n\fAcknowledgements\n\nWe would like to thank Maximilian Soelch for valuable suggestions and discussions.\n\nReferences\n[1] A. A. Alemi, B. Poole, I. Fischer, J. V. Dillon, R. A. Saurous, and K. Murphy. Fixing a broken ELBO.\n\nICML, 2018.\n\n[2] M. Aubry, D. Maturana, A. A. Efros, B. C. Russell, and J. Sivic. Seeing 3D chairs: exemplar part-based\n\n2D-3D alignment using a large dataset of CAD models. CVPR, 2014.\n\n[3] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle. Greedy layer-wise training of deep networks.\n\nNeurIPS, 2007.\n\n[4] S. R. Bowman, L. Vilnis, O. Vinyals, A. Dai, R. Jozefowicz, and S. Bengio. Generating sentences from a\n\ncontinuous space. CoNLL, 2016.\n\n[5] Y. Burda, R. B. Grosse, and R. Salakhutdinov. Importance weighted autoencoders. ICLR, 2016.\n\n[6] L. Cayton. Algorithms for manifold learning. Univ. of California at San Diego Tech. Rep, 12, 2005.\n\n[7] N. Chen, J. Bayer, S. Urban, and P. Van Der Smagt. Ef\ufb01cient movement representation by embedding\n\ndynamic movement primitives in deep autoencoders. HUMANOIDS, 2015.\n\n[8] N. Chen, F. Ferroni, A. Klushyn, A. Paraschos, J. Bayer, and P. van der Smagt. Fast approximate geodesics\n\nfor deep generative models. ICANN, 2019.\n\n[9] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel. InfoGAN: Interpretable\n\nrepresentation learning by information maximizing generative adversarial nets. NeurIPS, 2016.\n\n[10] X. Chen, D. P. Kingma, T. Salimans, Y. Duan, P. Dhariwal, J. Schulman, I. Sutskever, and P. Abbeel.\n\nVariational lossy autoencoder. ICLR, 2017.\n\n[11] C. Cremer, Q. Morris, and D. Duvenaud.\n\narXiv:1704.02916, 2017.\n\nReinterpreting importance-weighted autoencoders.\n\n[12] N. Dilokthanakul, P. A. M. Mediano, M. Garnelo, M. C. H. Lee, H. Salimbeni, K. Arulkumaran, and\nM. Shanahan. Deep unsupervised clustering with Gaussian mixture variational autoencoders. CoRR, 2016.\n\n[13] P. Goyal, Z. Hu, X. Liang, C. Wang, and E. P. Xing. Nonparametric variational auto-encoders for\n\nhierarchical representation learning. ICCV, 2017.\n\n[14] I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Lerchner.\n\nBeta-VAE: Learning basic visual concepts with a constrained variational framework. ICLR, 2017.\n\n[15] D. P. Kingma and M. Welling. Auto-encoding variational Bayes. ICML, 2014.\n\n[16] B. M. Lake, R. Salakhutdinov, and J. B. Tenenbaum. Human-level concept learning through probabilistic\n\nprogram induction. Science, 350, 2015.\n\n[17] H. Larochelle and I. Murray. The neural autoregressive distribution estimator. AISTATS, 2011.\n\n[18] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition.\n\nProceedings of the IEEE, 1998.\n\n[19] D. Molchanov, V. Kharitonov, A. Sobolev, and D. Vetrov. Doubly semi-implicit variational inference.\n\nAISTATS, 2019.\n\n[20] E. T. Nalisnick and P. Smyth. Stick-breaking variational autoencoders. ICLR, 2017.\n\n[21] R. M. Neal and G. E. Hinton. A view of the em algorithm that justi\ufb01es incremental, sparse, and other\n\nvariants. In Learning in graphical models. 1998.\n\n[22] P. Paysan, R. Knothe, B. Amberg, S. Romdhani, and T. Vetter. A 3d face model for pose and illumination\n\ninvariant face recognition. AVSS, 2009.\n\n[23] D. J. Rezende and S. Mohamed. Variational inference with normalizing \ufb02ows. ICML, 2015.\n\n9\n\n\f[24] D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in\n\ndeep generative models. ICML, 2014.\n\n[25] D. J. Rezende and F. Viola. Taming VAEs. CoRR, 2018.\n\n[26] S. Rifai, Y. N. Dauphin, P. Vincent, Y. Bengio, and X. Muller. The manifold tangent classi\ufb01er. NeurIPS,\n\n2011.\n\n[27] C. K. S\u00f8nderby, T. Raiko, L. Maal\u00f8e, S. K. S\u00f8nderby, and O. Winther. Ladder variational autoencoders.\n\nNeurIPS, 2016.\n\n[28] J. Tomczak and M. Welling. VAE with a VampPrior. AISTATS, 2018.\n\n[29] H. Xiao, K. Rasul, and R. Vollgraf. Fashion-MNIST: a novel image dataset for benchmarking machine\n\nlearning algorithms. CoRR, 2017.\n\n[30] S. Yeung, A. Kannan, Y. Dauphin, and L. Fei-Fei. Tackling over-pruning in variational autoencoders.\n\nCoRR, 2017.\n\n[31] S. Zhao, J. Song, and S. Ermon. Infovae: Information maximizing variational autoencoders. CoRR, 2017.\n\n10\n\n\f", "award": [], "sourceid": 1661, "authors": [{"given_name": "Alexej", "family_name": "Klushyn", "institution": "Volkswagen Group"}, {"given_name": "Nutan", "family_name": "Chen", "institution": "Volkswagen Group"}, {"given_name": "Richard", "family_name": "Kurle", "institution": "Volkswagen Group"}, {"given_name": "Botond", "family_name": "Cseke", "institution": "Volkswagen Group"}, {"given_name": "Patrick", "family_name": "van der Smagt", "institution": "Volkswagen Group"}]}