{"title": "Continuous Hierarchical Representations with Poincar\u00e9 Variational Auto-Encoders", "book": "Advances in Neural Information Processing Systems", "page_first": 12565, "page_last": 12576, "abstract": "The Variational Auto-Encoder (VAE)  is a popular method for learning a generative model and embeddings of the data. Many real datasets are hierarchically structured. However, traditional VAEs map data in a Euclidean latent space which cannot efficiently embed tree-like structures. Hyperbolic spaces with negative curvature can. We therefore endow VAEs with a Poincar\u00e9 ball model of hyperbolic geometry as a latent space and rigorously derive the necessary methods to work with two main Gaussian generalisations on that space. We empirically show better generalisation to unseen data than the Euclidean counterpart, and can qualitatively and quantitatively better recover hierarchical structures.", "full_text": "Continuous Hierarchical Representations with\n\nPoincar\u00e9 Variational Auto-Encoders\n\nEmile Mathieu\u2020\n\nemile.mathieu@stats.ox.ac.uk\n\nCharline Le Lan\u2020\n\ncharline.lelan@stats.ox.ac.uk\n\nChris J. Maddison\u2020\u21e4\n\ncmaddis@stats.ox.ac.uk\n\nRyota Tomioka\u2021\n\nryoto@microsoft.com\n\nYee Whye Teh\u2020\u21e4\n\ny.w.teh@stats.ox.ac.uk\n\n\u2020 Department of Statistics, University of Oxford, United Kingdom\n\n\u21e4 DeepMind, London, United Kingdom\n\n\u2021 Microsoft Research, Cambridge, United Kingdom\n\nAbstract\n\nThe variational auto-encoder (VAE) is a popular method for learning a generative\nmodel and embeddings of the data. Many real datasets are hierarchically structured.\nHowever, traditional VAEs map data in a Euclidean latent space which cannot\nef\ufb01ciently embed tree-like structures. Hyperbolic spaces with negative curvature\ncan. We therefore endow VAEs with a Poincar\u00e9 ball model of hyperbolic geometry\nas a latent space and rigorously derive the necessary methods to work with two\nmain Gaussian generalisations on that space. We empirically show better gener-\nalisation to unseen data than the Euclidean counterpart, and can qualitatively and\nquantitatively better recover hierarchical structures.\n\n1 Introduction\nLearning useful representations from unlabelled\nraw sensory observations, which are often high-\ndimensional, is a problem of signi\ufb01cant importance in\nmachine learning. Variational auto-encoders (VAEs)\n(Kingma and Welling, 2014; Rezende et al., 2014)\nare a popular approach to this: they are probabilistic\ngenerative models composed of an encoder stochas-\ntically embedding observations in a low dimensional\nlatent space Z, and a decoder generating observations\nx 2X from encodings z 2Z . After training, the en-\ncodings constitute a low-dimensional representation\nof the original raw observations, which can be used\nas features for a downstream task (e.g. Huang and\nLeCun, 2006; Coates et al., 2011) or be interpretable\nfor their own sake. VAEs are therefore of interest for\nrepresentation learning (Bengio et al., 2013), a \ufb01eld\nwhich aims to learn good representations, e.g. inter-\npretable representations, ones yielding better gener-\nalisation, or ones useful for downstream tasks.\n\nFigure 1: A regular tree isometrically embed-\nded in the Poincar\u00e9 disc. Red curves are same\nlength geodesics, i.e. \"straight lines\".\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fIt can be argued that in many domains data should be represented hierarchically. For example, in\ncognitive science, it is widely accepted that human beings use a hierarchy to organise object categories\n(e.g. Roy et al., 2006; Collins and Quillian, 1969; Keil, 1979). In biology, the theory of evolution\n(Darwin, 1859) implies that features of living organisms are related in a hierarchical manner given\nby the evolutionary tree. Explicitly incorporating hierarchical structure in probabilistic models has\nunsurprisingly been a long-running research topic (e.g. Duda et al., 2000; Heller and Ghahramani,\n2005).\nEarlier work in this direction tended to use trees as data structures to represent hierarchies. Recently,\nhyperbolic spaces have been proposed as an alternative continuous approach to learn hierarchical\nrepresentations from textual and graph-structured data (Nickel and Kiela, 2017; Tifrea et al., 2019).\nHyperbolic spaces can be thought of as continuous versions of trees, and vice versa, as illustrated in\nFigure 1. Trees can be embedded with arbitrarily low error into the Poincar\u00e9 disc model of hyperbolic\ngeometry (Sarkar, 2012). The exponential growth of the Poincar\u00e9 surface area with respect to its\nradius is analogous to the exponential growth of the number of leaves in a tree with respect to its\ndepth. Further, these spaces are smooth, enabling the use of deep learning approaches which rely on\ndifferentiability.\nWe show that replacing VAEs latent space components, which traditionally assume a Euclidean metric\nover the latent space, by their hyperbolic generalisation helps to represent and discover hierarchies.\nOur goals are twofold: (a) learn a latent representation that is interpretable in terms of hierarchical\nrelationships among the observations, (b) learn a more ef\ufb01cient representation which generalises\nbetter to unseen data that is hierarchically structured. Our main contributions are as follows:\n\n1. We propose ef\ufb01cient and reparametrisable sampling schemes, and calculate the probability\ndensity functions, for two canonical Gaussian generalisations de\ufb01ned on the Poincar\u00e9 ball,\nnamely the maximum-entropy and wrapped normal distributions. These are the ingredients\nrequired to train our VAEs.\n\n2. We introduce a decoder architecture that explicitly takes into account the hyperbolic geometry,\n\nwhich we empirically show to be crucial.\n\n3. We empirically demonstrate that endowing a VAE with a Poincar\u00e9 ball latent space can be\n\nbene\ufb01cial in terms of model generalisation and can yield more interpretable representations.\n\nOur work \ufb01ts well with a surge of interest in combining hyperbolic geometry and VAEs. Of these, it\nrelates most strongly to the concurrent works of Ovinnikov (2018); Grattarola et al. (2019); Nagano\net al. (2019). In contrast to these approaches, we introduce a decoder that takes into account the\ngeometry of the hyperbolic latent space. Along with the wrapped normal generalisation used in the\nlatter two articles, we give a thorough treatment of the maximum entropy normal generalisation and a\nrigorous analysis of the difference between the two. Additionally, we train our model by maximising\na lower bound on the marginal likelihood, as opposed to Ovinnikov (2018); Grattarola et al. (2019)\nwhich consider a Wasserstein and an adversarial auto-encoder setting, respectively. We discuss these\nworks in more detail in Section 4.\n2 The Poincar\u00e9 Ball model of hyperbolic geometry\n2.1 Review of Riemannian geometry\nThroughout the paper we denote the Euclidean norm and inner product by k\u00b7k and h\u00b7,\u00b7i respectively.\nA real, smooth manifold M is a set of points z, which is \"locally similar\" to a linear space. For every\npoint z of the manifold M is attached a real vector space of the same dimensionality as M called the\ntangent space TzM. Intuitively, it contains all the possible directions in which one can tangentially\npass through z. For each point z of the manifold, the metric tensor g(z) de\ufb01nes an inner product\non the associated tangent space : g(z) = h\u00b7,\u00b7iz : TzM\u21e5T zM! R. The matrix representation of\nthe Riemannian metric G(z), is de\ufb01ned such that 8u, v 2T zM\u21e5T zM, hu, viz = g(z)(u, v) =\nuT G(z)v. A Riemannian manifold is then de\ufb01ned as a tuple (M, g) (Petersen, 2006). The metric\ntensor gives a local notion of angle, length of curves, surface area and volume, from which global\nquantities can be derived by integrating local contributions. A norm is induced by the inner product\non TzM: k\u00b7kz =ph\u00b7,\u00b7iz. An in\ufb01nitesimal volume element is induced on each tangent space TzM,\nand thus a measure dM(z) =p|G(z)|dz on the manifold, with dz being the Lebesgue measure.\n\n2\n\n\f0 k0(t)k1/2\n\nThe length of a curve  : t 7! (t) 2M is given by L() = R 1\n(t)dt. The concept of\nstraight lines can then be generalised to geodesics, which are constant speed curves giving the shortest\npath between pairs of points z, y of the manifold: \u21e4 = arg min L() with (0) = z, (1) = y and\nk0(t)k(t) = 1. A global distance is thus induced on M given by dM(z, y) = inf L(). Endowing\nM with that distance consequently de\ufb01nes a metric space (M, dM). The concept of moving along a\n\"straight\" curve with constant velocity is given by the exponential map. In particular, there is a unique\nunit speed geodesic  satisfying (0) = z with initial tangent vector 0(0) = v. The corresponding\nexponential map is then de\ufb01ned by expz(v) = (1), as illustrated on Figure 2. The logarithm map is\nthe inverse logz = exp1\n: M!T zM. For geodesically complete manifolds, such as the Poincar\u00e9\nz\nball, expz is well-de\ufb01ned on the full tangent space TzM for all z 2M .\n2.2 The Poincar\u00e9 ball model of hyperbolic geometry\n\nA d-dimensional hyperbolic space, denoted Hd, is a complete,\nsimply connected, d-dimensional Riemannian manifold with con-\nstant negative curvature c. In contrast with the Euclidean space\nRd, Hd can be constructed using various isomorphic models (none\nof which is prevalent), including the hyperboloid model, the\nBeltrami-Klein model, the Poincar\u00e9 half-plane model and the\nPoincar\u00e9 ball Bd\nc (Beltrami, 1868). The Poincar\u00e9 ball model\nis formally de\ufb01ned as the Riemannian manifold Bd\np),\nc , gc\np its metric tensor,\nwhere Bd\nwhich along with its induced distance are given by\n\nc is the open ball of radius 1/pc, and gc\n\nc = (Bd\n\ngc\np(z) = (c\n\nz)2 ge(z), dc\n\np(z, y) =\n\n1\npc\n\nFigure 2: Geodesics and exponen-\ntial maps in the Poincar\u00e9 disc.\n\ncosh1 1 + 2c\n\n(1  ckzk2)(1  ckyk2)! ,\n\n||z  y||2\n\nz =\n\nwhere c\nM\u00f6bius addition (Ungar, 2008) of z and y in Bd\n\n1ckzk2 and ge denotes the Euclidean metric tensor, i.e. the usual dot product. The\n\nc is de\ufb01ned as\n\n2\n\nz c y =\n\n(1 + 2chz, yi + ckyk2)z + (1  ckzk2)y\n\n.\n\n1 + 2chz, yi + c2kzk2kyk2\n\nOne recovers the Euclidean addition of two vectors in Rd as c ! 0. Building on that framework,\nGanea et al. (2018) derived closed-form formulations for the exponential map (illustrated in Figure 2)\n\nand its inverse, the logarithm map\n\nlogc\n\nz(y) =\n\n2\npcc\n\nz\n\nexpc\n\nc\nzkvk\n\nz(v) = z c\u2713tanh\u2713pc\n\n2 \u25c6 v\n\npckvk\u25c6\ntanh1pck  z c yk z c y\n\nk  z c yk\n\n.\n\nc, as well as learning a map from this latent space Z = Bd\n\n3 The Poincar\u00e9 VAE\nWe consider the problem of mapping an empirical distribution of observations to a lower dimensional\nPoincar\u00e9 ball Bd\nc to the observation space\nX . Building on the VAE framework, this Poincar\u00e9-VAE model, or P c-VAE for short, differs by the\nchoice of prior and posterior distributions being de\ufb01ned on Bd\nc, and by the encoder g and decoder\nf\u2713 maps which take into account the latent space geometry. Their parameters {\u2713, } are learned\nby maximising the evidence lower bound (ELBO). Our model can be seen as a generalisation of a\nclassical Euclidean VAE (Kingma and Welling, 2014; Rezende et al., 2014) that we denote by N -VAE,\ni.e. P c-VAE !c!0 N -VAE.\n3.1 Prior and variational posterior distributions\nIn order to parametrise distributions on the Poincar\u00e9 ball, we consider two canonical generalisations of\nnormal distributions on that space. A more detailed review of Gaussian generalisations on manifolds\ncan be found in Appendix B.1.\n\n3\n\n\fRiemannian normal One generalisation is the distribution maximising entropy given an expec-\ntation and variance (Said et al., 2014; Pennec, 2006; Hauberg, 2018), often called the Riemannian\nnormal distribution, which has a density w.r.t. the metric induced measure dM given by\n\nN R\n\nBd\nc\n\n(z|\u00b5, 2) =\n\nd\u232bR(z|\u00b5, 2)\n\n=\n\n1\n\nZR exp \n\ndc\np(\u00b5, z)2\n\n22 ! ,\n\n(1)\n\ndM(z)\n\nBd\nc\n\nc\n\np(\u00b5, z)\n\n.\n\n0\n=\n2\n\n4\n.\n0\n=\n2\n\n8\n.\n0\n=\n2\n\nk\n\u00b5\nk\nc\np\n\nRiemannian\n\nWrapped\n\nk\n\u00b5\nk\nc\np\n\nk\n\u00b5\nk\nc\np\n\n(z|\u00b5, \u2303) =\n\nsinh(pc dc\n\nc is the Fr\u00e9chet mean , and ZR is the normalising\n\n\u00b5 with v \u21e0N (\u00b7|0, \u2303) and its density\nd\u232bW(z|\u00b5, \u2303)\n\n(2)\n\n\u00b5v/c\n\u00b5 log\u00b5(z)0, \u2303\u2713 pc dc\n\ndM(z)\nwhere > 0 is a dispersion parameter, \u00b5 2 Bd\nconstant derived in Appendix B.4.3.\nWrapped normal An alternative is to consider the push-\nforward measure obtained by mapping a normal distribu-\ntion along the exponential map exp\u00b5. That probability\nmeasure is often referred to as the wrapped normal dis-\ntribution, and has been used in auto-encoder frameworks\nwith other manifolds (Grattarola et al., 2019; Nagano et al.,\nc are obtained\n2019; Falorsi et al., 2018). Samples z 2 Bd\nas z = expc\nis given by (details given in Appendix B.3)\nN W\np(\u00b5, z))\u25c6d1\n= Nc\nThe (usual) normal distribution is recovered for both gen-\neralisations as c ! 0. We discuss the bene\ufb01ts and draw-\nbacks of those two distributions in Appendix B.1. We\nrefer to both as hyperbolic normal distributions with pdf\n(z|\u00b5, 2). Figure 8 shows several probability densi-\nNBd\nties for both distributions.\nThe prior distribution de\ufb01ned on Z is chosen to be a\nhyperbolic normal distribution with mean zero, p(z) =\n0), and the variational family is chosen to be\nNBd\nparametrised as Q = {NBd\n\u21e4 }.\nc , 2 R+\n3.2 Encoder and decoder\nWe make use of two neural networks, a decoder f\u2713 and an encoder g, to parametrise the likelihood\np(\u00b7|f\u2713(z)) and the variational posterior q(\u00b7|g(x)) respectively. The input of f\u2713 and the output of g\nneed to respect the hyperbolic geometry of Z. In the following we describe appropriate choices for\nthe \ufb01rst layer of the decoder and the last layer of the encoder.\nDecoder\nIn the Euclidean case, an af\ufb01ne\ntransformation can be written in the form\nfa,p(z) = ha, z  pi, with orientation and\noffset parameters a, p 2 Rd. This can be\nrewritten in the form\nfa,p(z) = sign(ha, z  pi)kak dE(z, H c\na,p)\nwhere Ha,p = {z 2 Rp |h a, z  pi =\n0} = p + {a}? is the decision hyperplane.\nThe third term is the distance between z and\na,p and the \ufb01rst\nthe decision hyperplane H c\nterm refers to the side of H c\na,p where z lies.\nGanea et al. (2018) analogously introduced\nc ! Rp on the Poincar\u00e9\nan operator f c\nball,\na,p(z) = sign(\u2326a, logc\n\nFigure 3: Hyperbolic normal probability\ndensity for different Fr\u00e9chet mean, same\nstandard deviation and c = 10. The Rie-\nmannian hyperbolic radius has a slightly\nlarger mode.\n\nFigure 4: Illustration of an orthogonal projection on a\nhyperplane in a Poincar\u00e9 disc (Left) and an Euclidean\nplane (Right).\n\np(z)\u21b5p)kakp dc\n\n(\u00b7|\u00b5, 2) | \u00b5 2 Bd\n\n(\u00b7|0, 2\n\nc\n\na,p : Bd\n\nf c\n\np(z, H c\n\na,p)\n\nc\n\n4\n\n\fc | \u2326a, logc\n\np(z)\u21b5 = 0} = expc\n\np(z, H c\n\na,p) was also derived, dc\n\na,p = {z 2 Bd\np(z, H c\n\np({a}?). A closed-formed expression for\n(1ckpczk2)kak\u2318. The\na,p) = 1pc sinh1\u21e3 2pc|hpcz,ai|\n\nwith H c\nthe distance dc\nhyperplane decision boundary H c\na,p is called gyroplane and is a semi-hypersphere orthogonal to\nthe Poincar\u00e9 ball\u2019s boundary as illustrated on Figure 4. The decoder\u2019s \ufb01rst layer, called gyroplane\nlayer, is chosen to be a concatenation of such operators, which are then composed with a standard\nfeed-forward neural network.\nEncoder The encoder g outputs a Fr\u00e9chet mean \u00b5 2 Bd\nwhich\nparametrise the hyperbolic variational posterior. The Fr\u00e9chet mean \u00b5 is obtained as the image\nof the exponential map expc\n\n0, and the distortion  through a softplus function.\n\nc and a distortion  2 R+\n\u21e4\n\n3.3 Training\nWe follow a standard variational approach by deriving a lower bound on the marginal likelihood.\nThe ELBO is optimised via an unbiased Monte Carlo (MC) estimator thanks to the reparametrisable\nsampling schemes that we introduce for both hyperbolic normal distributions.\nObjective The evidence lower bound (ELBO) can readily be extended to Riemannian latent spaces\nby applying Jensen\u2019s inequality w.r.t. dM (see Appendix A)\nln\u2713 p\u2713(x|z)p(z)\n\nlog p(x) L M(x; \u2713, ) ,ZM\n\nq(z|x) \u25c6 q(z|x) dM(z).\n\n\u21e2W(r) / 1R+(r) e r2\n\n22 rd1,\u21e2\n\nR(r) / 1R+(r)e r2\n\nThe latter density \u21e2R(r) can ef\ufb01ciently be sampled via rejection sampling with a piecewise exponential\ndistribution proposal. This makes use of its log-concavity. The Riemannian normal sampling scheme\nis not directly affected by dimensionality since the radius is a one-dimensional random variable. Full\nsampling schemes are described in Algorithm 1, and in Appendices B.4.1 and B.4.2.\nGradients Gradients r\u00b5z can straightforwardly be computed thanks to the exponential map\nreparametrisation (Eq 3), and gradients w.r.t.\nthe dispersion rz are readily available for the\nwrapped normal. For the Riemannian normal, we additionally rely on an implicit reparametrisation\n(Figurnov et al., 2018) of \u21e2R via its cdf F R(r; ).\nOptimisation Parameters of the model living in the Poincar\u00e9 ball are parametrised via the expo-\nnential mapping: i = expc\ni 2 Rm, so we can make use of usual optimisation schemes.\nAlternatively, one could directly optimise such manifold parameters with manifold gradient descent\nschemes (Bonnabel, 2013).\n\ni ) with 0\n\n0(0\n\n5\n\nDensities have been introduced earlier in Equations 1 and 2.\nReparametrisation In the Euclidean setting, by\nworking in polar coordinates, an isotropic normal\ndistribution centred at \u00b5 can be described by a\ndirectional vector \u21b5 uniformly distributed on the\nhypersphere and a univariate radius r = dE(\u00b5, z)\nfollowing a -distribution.\nIn the Poincar\u00e9 ball\nwe can rely on a similar representation, through a\nhyperbolic polar change of coordinates, given by\n\nz = expc\n\n\u00b5\u21e3G(\u00b5) 1\n\n2 v\u2318 = expc\n\n\u00b5\u2713 r\n\nc\n\u00b5\n\n\u21b5\u25c6 (3)\n\nwith v = r\u21b5 and r = dc\np(\u00b5, z). The direction\n\u21b5 is still uniformly distributed on the hypersphere\nand for the wrapped normal, the radius r is still\n-distributed, while for the Riemannian normal its\ndensity \u21e2R(r) is given by (derived in Appendix B.4.1)\n\nAlgorithm 1 Hyperbolic normal sampling\nscheme\nRequire: \u00b5, 2, dimension d, curvature c\nif Wrapped normal then v \u21e0N (0d, 2)\nelse if Riemannian normal then\n\nLet g be a piecewise exponential proposal\nwhile sample r not accepted do\n\nPropose r \u21e0 g(\u00b7), u \u21e0U ([0, 1])\nif u < \u21e2R(r)\n\ng(r) then Accept sample r\n\nSample direction \u21b5 \u21e0U (Sd1)\nv r\u21b5\n\nReturn z = expc\n\n\u00b5v/c\n\u00b5\n\u25c6d1\n22 \u2713 sinh(pcr)\n\npc\n\n.\n\n\f4 Related work\nHierarchical models The Bayesian Nonparametric literature has a rich history of explicitly mod-\nelling the hierarchical structure of data (Teh et al., 2008; Heller and Ghahramani, 2005; Grif\ufb01ths\net al., 2004; Ghahramani et al., 2010; Larsen et al., 2001; Salakhutdinov et al., 2011). The discrete\nnature of trees used in such models makes learning dif\ufb01cult, whereas performing optimisation in a\ncontinuous hyperbolic space is an attractive alternative. Such an approach has been empirically and\ntheoretically shown to be useful for graphs and word embeddings (Nickel and Kiela, 2017, 2018;\nChamberlain et al., 2017; Sala et al., 2018; Tifrea et al., 2019).\nDistributions on manifold Probability measures de\ufb01ned on manifolds are of interest to model\nuncertainty of data living (either intrinsically or assumed to) on such spaces, e.g. directional\nstatistics (Ley and Verdebout, 2017; Mardia and Jupp, 2009). Pennec (2006) introduced a maximum\nentropy generalisation of the normal distribution, often referred to as Riemannian normal, which\nhas been used for maximum likelihood estimation in the Poincar\u00e9 half-plane (Said et al., 2014)\nand on the hypersphere (Hauberg, 2018). Another class of manifold probability measures are\nwrapped distributions, i.e. push-forward of distributions de\ufb01ned on a tangent space, often along the\nexponential map. They have recently been used in auto-encoder frameworks on the hyperboloid\nmodel (of hyperbolic geometry) (Grattarola et al., 2019; Nagano et al., 2019) and on Lie groups\n(Falorsi et al., 2018). Rey et al. (2019); Li et al. (2019) proposed to parametrise a variational family\nthrough a Brownian motion on manifolds such as spheres, tori, projective spaces and SO(3).\nVAEs with Riemannian latent manifold VAEs with non Euclidean latent space have been recently\nintroduced, such as Davidson et al. (2018) making use of hyperspherical geometry and Falorsi\net al. (2018) endowing the latent space with a SO(3) group structure. Concurrent work considers\nendowing auto-encoders (AEs) with a hyperbolic latent space. Grattarola et al. (2019) introduces a\nconstant curvature manifold (CCM) (i.e. hyperspherical, Euclidean and hyperboloid) latent space\nwithin an adversarial auto-encoder framework. However, the encoder and decoder are not designed to\nexplicitly take into account the latent space geometry. Ovinnikov (2018) recently proposed to endow\na VAE latent space with a Poincar\u00e9 ball model. They choose a Wasserstein Auto-Encoder framework\n(Tolstikhin et al., 2018) because they could not derive a closed-form solution of the ELBO\u2019s entropy\nterm. We instead rely on a MC estimate of the ELBO by introducing a novel reparametrisation of the\nRiemannian normal. They discuss the Riemannian normal distribution, yet they make a number of\nheuristic approximations for sampling and reparametrisation. Also, Nagano et al. (2019) propose\nusing a wrapped normal distribution to model uncertainty on the hyperboloid model of hyperbolic\nspace. They derive its density and a reparametrisable sampling scheme, allowing such a distribution\nto be used in a variational learning framework. They apply this wrapped normal distribution to\nstochastically embed graphs and to parametrise the variational family in VAEs. Ovinnikov (2018) and\nNagano et al. (2019) rely on a standard feed-forward decoder architecture, which does not take into\naccount the hyperbolic geometry.\n5 Experiments\nWe implemented our model and ran our experiments within the automatic differentiation framework\nPyTorch (Paszke et al., 2017). We open-source our code for reproducibility and to bene\ufb01t the\ncommunity 1. Experimental details are fully described in Appendix C.\n\n5.1 Branching diffusion process\nWe assess our modelling assumption on data generated from a branching diffusion process which\nexplicitly incorporate hierarchical structure. Nodes yi 2 Rn are normally distributed with mean\ngiven by their parent and with unit variance. Models are trained on a noisy vector representations\n(x1, . . . , xN ), hence do not have access to the true hierarchical representation. We train several\nP c-VAEs with increasing curvatures, along with a vanilla N -VAE as a baseline. Table 1 shows that\nthe P c-VAE outperforms its Euclidean counterpart in terms of test marginal likelihood. As expected,\nwe observe that the performance of the N -VAE is recovered as the curvature c tends to zero. Also,\nwe notice that increasing the prior distribution distortion 0 helps embeddings lie closer to the border,\n\n1https://github.com/emilemathieu/pvae\n\n6\n\n\fFigure 5: Latent representations learned by \u2013 P 1-VAE (Leftmost), N -VAE (Center-Left), PCA\n(Center-Right) and GPLVM (Rightmost) trained on synthetic dataset. Embeddings are represented by\nblack crosses, and colour dots are posterior samples. Blue lines represent true hierarchy.\n\nand as a consequence improved generalisation performance. Figure 5 represents latent embeddings\nfor P 1-VAE and N -VAE, along with two embedding baselines: principal component analysis (PCA)\nand a Gaussian process latent variable model (GPLVM). A hierarchical structure is somewhat learned\nby all models, yet P c-VAE\u2019s latent representation is the least distorted.\nTable 1: Negative test marginal likelihood estimates LIWAE (Burda et al., 2015) (computed with 5000\nsamples) on the synthetic dataset. 95% con\ufb01dence intervals are computed over 20 trainings.\n\nModels\n0 N -VAE\n57.1\u00b10.2\n1\n57.0\u00b10.2\n\n1.7\n\nLIWAE\nLIWAE\n\nP 0.1-VAE P 0.3-VAE P 0.8-VAE P 1.0-VAE P 1.2-VAE\n56.6\u00b10.2\n57.1\u00b10.2\n56.8\u00b10.2\n55.6\u00b10.2\n\n56.9\u00b10.2\n55.9\u00b10.2\n\n57.2\u00b10.2\n56.6\u00b10.2\n\n56.7\u00b10.2\n55.7\u00b10.2\n\n5.2 Mnist digits\nThe MNIST (LeCun and Cortes, 2010) dataset has been used in the literature for hierarchical\nmodelling (Salakhutdinov et al., 2011; Saha et al., 2018). One can view the natural clustering in\nMNIST images as a hierarchy with each of the 10 classes being internal nodes of the hierarchy.\nWe empirically assess whether our model can take advantage of such simple underlying hierar-\nchical structure, \ufb01rst by measuring its generalisation capacity via the test marginal log-likelihood.\nTable 2 shows that our model outperforms its Euclidean counterpart, especially for low latent\ndimension. This can be interpreted through an information bottleneck perspective; as the latent\ndimensionality increases, the pressure on the embeddings quality decreases, hence the gain from\nthe hyperbolic geometry is reduced (as observed by Nickel and Kiela (2017)). Also, by using the\nRiemannian normal distribution, we achieve slightly better results than with the wrapped normal.\n\nTable 2: Negative test marginal likelihood estimates computed with 5000 samples. 95% con\ufb01dence\nintervals are computed over 10 runs. * indicates numerically unstable settings.\n\nN -VAE\nP-VAE (Wrapped)\n\nP-VAE (Riemannian)\n\nDimensionality\n5\n2\n114.7\u00b10.1\n144.5\u00b10.4\n143.9\u00b10.5\n115.5\u00b10.3\n144.2\u00b10.5\n115.3\u00b10.3\n115.1\u00b10.3\n143.8\u00b10.6\n114.7\u00b10.1\n144.0\u00b10.6\n143.7\u00b10.6\n115.2\u00b10.2\n143.8\u00b10.4\n114.7\u00b10.3\n114.1\u00b10.2\n143.1\u00b10.4\n142.5\u00b10.4\n115.5\u00b10.3\n\nc\n(0)\n0.1\n0.2\n0.7\n1.4\n0.1\n0.2\n0.7\n1.4\n\n10\n100.2\u00b10.1\n100.2\u00b10.1\n100.0\u00b10.1\n100.2\u00b10.1\n100.7\u00b10.1\n99.9\u00b10.1\n99.7\u00b10.1\n101.2\u00b10.2\n\n*\n\n20\n97.6\u00b10.1\n97.2\u00b10.1\n97.1\u00b10.1\n97.5\u00b10.1\n98.0\u00b10.1\n97.0\u00b10.1\n97.4\u00b10.1\n\n*\n*\n\n7\n\n\fWe conduct an ablation study to assess the usefulness\nof the gyroplane layer introduced in Section 3.2. To\ndo so we estimate the test marginal log-likelihood\nfor different choices of decoder. We select a multi-\nlayer perceptron (MLP) to be the baseline decoder.\nWe additionally compare to a MLP pre-composed by\nlog0, which can be seen as a linearisation of the space\naround the centre of the ball. Figure 6 shows the rel-\native performance improvement of decoders over the\nMLP baseline w.r.t. the latent space dimension. We\nobserve that linearising the input of a MLP through\nthe logarithm map slightly improves generalisation,\nand that using a gyroplane layer as the \ufb01rst layer\nof the decoder additionally improves generalisation.\nYet, these performance gains appear to decrease as\nthe latent dimensionality increases.\nSecond, we explore the learned latent representations\nof the trained P -VAE and N -VAE models shown in Figure 7. Qualitatively our P -VAE produces\na clearer partitioning of the digits, in groupings of {4, 7, 9}, {0, 6}, {2, 3, 5, 8} and {1}, with right-\nslanting {5, 8} being placed separately from the non-slanting ones. Recall that distances increase\ntowards the edge of the Poincar\u00e9 ball. We quantitatively assess the quality of the embeddings by\ntraining a classi\ufb01er predicting labels. Table 3 shows that the embeddings learned by our P -VAE\nmodel yield on average an 2% increase in accuracy over the digits. The full confusion matrices are\nshown in Figure 12 in Appendix.\n\nFigure 6: Decoder ablation study on MNIST\nwith wrapped normal P 1-VAE. Baseline de-\ncoder is a MLP.\n\nTable 3: Per digit accuracy of a classi\ufb01er trained on the learned latent 2-d embeddings. Results are\naveraged over 10 sets of embeddings and 5 classi\ufb01er trainings.\n7\n\n1 2\n\n5\n\n3\n\nDigits\n0\n8 9 Avg\nN -VAE\n89 97 81 75 59 43 89 78 68 57 73.6\nP 1.4-VAE 94 97 82 79 69 47 90 77 68 53 75.6\n\n4\n\n6\n\nFigure 7: MNIST Posteriors mean (Left) sub-sample of digit images associated with posteriors mean\n(Middle) Model samples (Right) \u2013 for P 1.4-VAE (Top) and N -VAE (Bottom).\n\n8\n\n5101520Latent space dLPensLon0.00.51.01.52.0\u0394r PargLnal log-lLkelLhood (%)log0\u2218 0LPgyroplane\f5.3 Graph embeddings\nWe evaluate the performance of a variational graph auto-encoder (VGAE) (Kipf and Welling, 2016)\nwith Poincar\u00e9 ball latent space for link prediction in networks. Edges in complex networks can\ntypically be explained by a latent hierarchy over the nodes (Clauset et al., 2008). We believe the\nPoincar\u00e9 ball latent space should help in terms of generalisation. We demonstrate these capabilities\non three network datasets: a graph of Ph.D. advisor-advisee relationships (Nooy et al., 2011), a\nphylogenetic tree expressing genetic heritage (Hofbauer et al., 2016; Sanderson and Eriksson, 1994)\nand a biological set representing disease relationships (Goh et al., 2007; Rossi and Ahmed, 2015).\nWe follow the VGAE model, which maps the adjacency matrix A to node embeddings Z through\na graph convolutional network (GCN), and reconstructs A by predicting edge probabilities from\nthe node embeddings. In order to take into account the latent space geometry, we parametrise the\nprobability of an edge by p(Aij = 1|zi, zj) = 1  tanh(dM(zi, zj)) 2 (0, 1] with dM the latent\ngeodsic metric. We use a Wrapped Gaussian prior and variational posterior for the P 1-VAE.\nWe set the latent dimension to 5. We follow the training and evaluation procedures introduced in\nKipf and Welling (2016). Models are trained on an incomplete adjacency matrix where some of the\nedges have randomly been removed. A test set is formed from previously removed edges and an\nequal number of randomly sampled pairs of unconnected nodes. We report in Table 4 the area under\nthe ROC curve (AUC) and average precision (AP) evaluated on the test set. It can be observed that\nthe P -VAE performs better than its Euclidean counterpart in terms of generalisation to unseen edges.\nTable 4: Results on network link prediction. 95% con\ufb01dence intervals are computed over 40 runs.\n\nPhylogenetic\nAUC\nAP\n54.2\u00b12.2\n59.0\u00b11.9\n\n54.0\u00b12.1\n55.5\u00b11.6\n\nCS PhDs\n\nDiseases\n\nAUC\n56.5\u00b11.1\n59.8\u00b11.2\n\nAP\n\n56.4\u00b11.1\n56.7\u00b11.2\n\nAUC\n89.8\u00b10.7\n92.3\u00b10.7\n\nAP\n\n91.8\u00b10.7\n93.6\u00b10.5\n\nN -VAE\nP-VAE\n\n6 Conclusion\nIn this paper we have explored VAEs with a Poincar\u00e9 ball latent space. We gave a thorough treatment\nof two canonical \u2013 wrapped and maximum entropy \u2013 normal generalisations on that space, and\na rigorous analysis of the difference between the two. We derived the necessary ingredients for\ntraining such VAEs, namely ef\ufb01cient and reparametrisable sampling schemes, along with probability\ndensity functions for these two distributions. We introduced a decoder architecture explicitly taking\ninto account the hyperbolic geometry, and empirically showed that it is crucial for the hyperbolic\nlatent space to be useful. We empirically demonstrated that endowing a VAE with a Poincar\u00e9 ball\nlatent space can be bene\ufb01cial in terms of model generalisation and can yield more interpretable\nrepresentations if the data has hierarchical structure.\nThere are a number of interesting future directions. There are many models of hyperbolic geometry,\nand several have been considered in a gradient-based setting. Yet, it is still unclear which models\nshould be preferred and which of their properties matter. Also, it would be useful to consider\nprincipled ways of assessing whether a given dataset has an underlying hierarchical structure, in the\nsame way that topological data analysis (Pascucci et al., 2011) attempts to discover the topologies\nthat underlie datasets.\n\n9\n\n\fAcknowledgments We are extremely grateful to Adam Foster, Phillipe Gagnon and Emmanuel\nChevallier for their help. EM, YWT\u2019s research leading to these results received funding from the\nEuropean Research Council under the European Union\u2019s Seventh Framework Programme (FP7/2007-\n2013) ERC grant agreement no. 617071 and they acknowledge Microsoft Research and EPSRC\nfor funding EM\u2019s studentship, and EPSRC grant agreement no. EP/N509711/1 for funding CL\u2019s\nstudentship.\nReferences\nBeltrami, E. (1868). Teoria fondamentale degli spazii di curvatura costante: memoria. F. Zanetti.\nBengio, Y., Courville, A., and Vincent, P. (2013). Representation learning: A review and new\n\nperspectives. IEEE Trans. Pattern Anal. Mach. Intell., 35(8):1798\u20131828.\n\nBonnabel, S. (2013). Stochastic gradient descent on riemannian manifolds. IEEE Transactions on\n\nAutomatic Control, 58(9):2217\u20132229.\n\nBox, G. E. P. and Muller, M. E. (1958). A note on the generation of random normal deviates. Ann.\n\nMath. Statist., 29(2):610\u2013611.\n\nBurda, Y., Grosse, R., and Salakhutdinov, R. (2015). Importance Weighted Autoencoders. arXiv.org.\nChamberlain, B. P., Clough, J., and Deisenroth, M. P. (2017). Neural Embeddings of Graphs in\n\nHyperbolic Space. arXiv.org.\n\nChevallier, E., Barbaresco, F., and Angulo, J. (2015). Probability density estimation on the hyperbolic\nspace applied to radar processing. In Nielsen, F. and Barbaresco, F., editors, Geometric Science of\nInformation, pages 753\u2013761, Cham. Springer International Publishing.\n\nClauset, A., Moore, C., and Newman, M. E. J. (2008). Hierarchical structure and the prediction of\n\nmissing links in networks. Nature, 453:98\u2013101.\n\nCoates, A., Lee, H., and Ng, A. (2011). An analysis of single-layer networks in unsupervised feature\nlearning. In Gordon, G., Dunson, D., and Dud\u00edk, M., editors, Proceedings of the Fourteenth\nInternational Conference on Arti\ufb01cial Intelligence and Statistics, volume 15 of JMLR Workshop\nand Conference Proceedings, pages 215\u2013223. JMLR W&CP.\n\nCollins, A. and Quillian, M. (1969). Retrieval time from semantic memory. Journal of Verbal\n\nLearning and Verbal Behavior, 8:240\u2013248.\n\nDarwin, C. (1859). On the Origin of Species by Means of Natural Selection. Murray, London. or the\n\nPreservation of Favored Races in the Struggle for Life.\n\nDavidson, T. R., Falorsi, L., De Cao, N., Kipf, T., and Tomczak, J. M. (2018). Hyperspherical\n\nvariational auto-encoders. 34th Conference on Uncertainty in Arti\ufb01cial Intelligence (UAI-18).\n\nDuda, R. O., Hart, P. E., and Stork, D. G. (2000). Pattern Classi\ufb01cation (2Nd Edition). Wiley-\n\nInterscience, New York, NY, USA.\n\nFalorsi, L., de Haan, P., Davidson, T. R., De Cao, N., Weiler, M., Forr\u00e9, P., and Cohen, T. S. (2018). Ex-\nplorations in Homeomorphic Variational Auto-Encoding. arXiv e-prints, page arXiv:1807.04689.\nIn\n\nFigurnov, M., Mohamed, S., and Mnih, A. (2018).\n\nInternational Conference on Neural Information Processing Systems, pages 439\u2013450.\n\nImplicit reparameterization gradients.\n\nGanea, O.-E., B\u00e9cigneul, G., and Hofmann, T. (2018). Hyperbolic neural networks. In International\n\nConference on Neural Information Processing Systems, pages 5350\u20135360.\n\nGhahramani, Z., Jordan, M. I., and Adams, R. P. (2010). Tree-structured stick breaking for hierarchical\ndata. In Lafferty, J. D., Williams, C. K. I., Shawe-Taylor, J., Zemel, R. S., and Culotta, A., editors,\nAdvances in Neural Information Processing Systems 23, pages 19\u201327. Curran Associates, Inc.\n\nGoh, K.-I., Cusick, M. E., Valle, D., Childs, B., Vidal, M., and Barab\u00e1si, A.-L. (2007). The human\n\ndisease network. Proceedings of the National Academy of Sciences, 104(21):8685\u20138690.\n\n10\n\n\fGrattarola, D., Livi, L., and Alippi, C. (2019). Adversarial autoencoders with constant-curvature\n\nlatent manifolds. Appl. Soft Comput., 81.\n\nGrif\ufb01ths, T. L., Jordan, M. I., Tenenbaum, J. B., and Blei, D. M. (2004). Hierarchical topic models\nand the nested chinese restaurant process. In Thrun, S., Saul, L. K., and Sch\u00f6lkopf, B., editors,\nAdvances in Neural Information Processing Systems 16, pages 17\u201324. MIT Press.\n\nHauberg, S. (2018). Directional statistics with the spherical normal distribution. In Proceedings of\n2018 21st International Conference on Information Fusion, FUSION 2018, pages 704\u2013711. IEEE.\n\nHeller, K. A. and Ghahramani, Z. (2005). Bayesian hierarchical clustering.\n\nIn International\n\nConference on Machine Learning, ICML \u201905, pages 297\u2013304, New York, NY, USA. ACM.\n\nHofbauer, W., Forrest, L., M. Hollingsworth, P., and Hart, M. (2016). Preliminary insights from dna\nbarcoding into the diversity of mosses colonising modern building surfaces. Bryophyte Diversity\nand Evolution, 38:1.\n\nHsu, E. P. (2008). A brief introduction to brownian motion on a riemannian manifold.\n\nHuang, F. J. and LeCun, Y. (2006). Large-scale learning with SVM and convolutional for generic\n\nobject categorization. In CVPR (1), pages 284\u2013291. IEEE Computer Society.\n\nKeil, F. (1979). Semantic and Conceptual Development: An Ontological Perspective. Cognitive\n\nscience series. Harvard University Press.\n\nKingma, D. P. and Ba, J. (2016). Adam: A Method for Stochastic Optimization. In Proceedings of\n\nthe International Conference on Learning Representations (ICLR).\n\nKingma, D. P. and Welling, M. (2014). Auto-encoding variational bayes. In Proceedings of the\n\nInternational Conference on Learning Representations (ICLR).\n\nKipf, T. N. and Welling, M. (2016). Variational graph auto-encoders. Workshop on Bayesian Deep\n\nLearning, NIPS.\n\nLarsen, J., Szymkowiak, A., and Hansen, L. K. (2001). Probabilistic hierarchical clustering with\n\nlabeled and unlabeled data.\n\nLeCun, Y. and Cortes, C. (2010). MNIST handwritten digit database.\n\nLey, C. and Verdebout, T. (2017). Modern Directional Statistics. New York: Chapman and Hall/CRC.\n\nLi, H., Lindenbaum, O., Cheng, X., and Cloninger, A. (2019). Variational Random Walk Autoen-\n\ncoders. arXiv.org.\n\nMardia, K. and Jupp, P. (2009). Directional Statistics. Wiley Series in Probability and Statistics.\n\nWiley.\n\nNagano, Y., Yamaguchi, S., Fujita, Y., and Koyama, M. (2019). A Differentiable Gaussian-like\nDistribution on Hyperbolic Space for Gradient-Based Learning. In International Conference on\nMachine Learning (ICML).\n\nNickel, M. and Kiela, D. (2017). Poincar\u00e9 embeddings for learning hierarchical representations. In\n\nAdvances in Neural Information Processing Systems, pages 6341\u20136350.\n\nNickel, M. and Kiela, D. (2018). Learning Continuous Hierarchies in the Lorentz Model of Hyperbolic\n\nGeometry. In International Conference on Machine Learning (ICML).\n\nNooy, W. D., Mrvar, A., and Batagelj, V. (2011). Exploratory Social Network Analysis with Pajek.\n\nCambridge University Press, New York, NY, USA.\n\nOvinnikov, I. (2018). Poincar\u00e9 Wasserstein Autoencoder. NeurIPS Workshop on Bayesian Deep\n\nLearning, pages 1\u20138.\n\nPaeng, S.-H. (2011). Brownian motion on manifolds with time-dependent metrics and stochastic\n\ncompleteness. Journal of Geometry and Physics, 61(5):940 \u2013 946.\n\n11\n\n\fPascucci, V., Tricoche, X., Hagen, H., and Tierny, J. (2011). Topological Methods in Data Anal-\nysis and Visualization: Theory, Algorithms, and Applications. Springer Publishing Company,\nIncorporated, 1st edition.\n\nPaszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga,\n\nL., and Lerer, A. (2017). Automatic differentiation in pytorch. In NIPS-W.\n\nPennec, X. (2006). Intrinsic statistics on riemannian manifolds: Basic tools for geometric measure-\n\nments. Journal of Mathematical Imaging and Vision, 25(1):127.\n\nPetersen, P. (2006). Riemannian Geometry. Springer-Verlag New York.\nR. Gilks, W. and Wild, P. (1992). Adaptive rejection sampling for gibbs sampling. 41:337\u2013348.\nRey, L. A. P., Menkovski, V., and Portegies, J. W. (2019). Diffusion Variational Autoencoders. CoRR.\nRezende, D. J., Mohamed, S., and Wierstra, D. (2014). Stochastic Backpropagation and Approximate\nInference in Deep Generative Models. In International Conference on Machine Learning (ICML).\nRossi, R. A. and Ahmed, N. K. (2015). The network data repository with interactive graph analytics\nand visualization. In Proceedings of the Twenty-Ninth AAAI Conference on Arti\ufb01cial Intelligence,\nAAAI\u201915, pages 4292\u20134293. AAAI Press.\n\nRoy, D. M., Kemp, C., Mansinghka, V. K., and Tenenbaum, J. B. (2006). Learning annotated\n\nhierarchies from relational data. In NIPS, pages 1185\u20131192. MIT Press.\n\nSaha, S., Varma, G., and Jawahar, C. V. (2018). Class2str: End to end latent hierarchy learning.\n\nInternational Conference on Pattern Recognition (ICPR), pages 1000\u20131005.\n\nSaid, S., Bombrun, L., and Berthoumieu, Y. (2014). New riemannian priors on the univariate normal\n\nmodel. Entropy, 16(7):4015\u20134031.\n\nSala, F., De Sa, C., Gu, A., and Re, C. (2018). Representation tradeoffs for hyperbolic embed-\ndings. In Dy, J. and Krause, A., editors, Proceedings of the 35th International Conference on\nMachine Learning, volume 80 of Proceedings of Machine Learning Research, pages 4460\u20134469,\nStockholmsm\u00e4ssan, Stockholm Sweden. PMLR.\n\nSalakhutdinov, R. R., Tenenbaum, J. B., and Torralba, A. (2011). One-shot learning with a hierarchical\n\nnonparametric bayesian model. In ICML Unsupervised and Transfer Learning.\n\nSanderson, M. J., M. J. D. W. P. and Eriksson, T. (1994). Treebase: a prototype database of\nphylogenetic analyses and an interactive tool for browsing the phylogeny of life. American Journal\nof Botany.\n\nSarkar, R. (2012). Low distortion delaunay embedding of trees in hyperbolic plane. In van Kreveld,\nM. and Speckmann, B., editors, Graph Drawing, pages 355\u2013366, Berlin, Heidelberg. Springer\nBerlin Heidelberg.\n\nTeh, Y. W., Daume III, H., and Roy, D. M. (2008). Bayesian agglomerative clustering with coalescents.\nIn Platt, J. C., Koller, D., Singer, Y., and Roweis, S. T., editors, Advances in Neural Information\nProcessing Systems 20, pages 1473\u20131480. Curran Associates, Inc.\n\nTifrea, A., Becigneul, G., and Ganea, O.-E. (2019). Poincare glove: Hyperbolic word embeddings.\n\nIn International Conference on Learning Representations (ICLR).\n\nTolstikhin, I., Bousquet, O., Gelly, S., and Schoelkopf, B. (2018). Wasserstein auto-encoders. In\n\nInternational Conference on Learning Representations.\n\nUngar, A. A. (2008). A gyrovector space approach to hyperbolic geometry. Synthesis Lectures on\n\nMathematics and Statistics, 1(1):1\u2013194.\n\n12\n\n\f", "award": [], "sourceid": 6826, "authors": [{"given_name": "Emile", "family_name": "Mathieu", "institution": "University of Oxford"}, {"given_name": "Charline", "family_name": "Le Lan", "institution": "University of Oxford"}, {"given_name": "Chris", "family_name": "Maddison", "institution": "Institute for Advanced Study, Princeton"}, {"given_name": "Ryota", "family_name": "Tomioka", "institution": "Microsoft Research Cambridge"}, {"given_name": "Yee Whye", "family_name": "Teh", "institution": "University of Oxford, DeepMind"}]}