{"title": "Hyperbolic Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 5345, "page_last": 5355, "abstract": "Hyperbolic spaces have recently gained momentum in the context of machine learning due to their high capacity and tree-likeliness properties. However, the representational power of hyperbolic geometry is not yet on par with Euclidean geometry, firstly because of the absence of corresponding hyperbolic neural network layers. Here, we bridge this gap in a principled manner by combining the formalism of M\u00f6bius gyrovector spaces with the Riemannian geometry of the Poincar\u00e9 model of hyperbolic spaces. As a result, we derive hyperbolic versions of important deep learning tools: multinomial logistic regression, feed-forward and recurrent neural networks. This allows to embed sequential data and perform classification in the hyperbolic space. Empirically, we show that, even if hyperbolic optimization tools are limited, hyperbolic sentence embeddings either outperform or are on par with their Euclidean variants on textual entailment and noisy-prefix recognition tasks.", "full_text": "Hyperbolic Neural Networks\n\nOctavian-Eugen Ganea\u2217\nDept. of Computer Science\n\nETH Z\u00fcrich\n\nZurich, Switzerland\n\nDept. of Computer Science\n\nDept. of Computer Science\n\nThomas Hofmann\n\nETH Z\u00fcrich\n\nZurich, Switzerland\n\nGary B\u00e9cigneul\u2217\n\nETH Z\u00fcrich\n\nZurich, Switzerland\n\nAbstract\n\nHyperbolic spaces have recently gained momentum in the context of machine\nlearning due to their high capacity and tree-likeliness properties. However, the\nrepresentational power of hyperbolic geometry is not yet on par with Euclidean\ngeometry, mostly because of the absence of corresponding hyperbolic neural\nnetwork layers. This makes it hard to use hyperbolic embeddings in downstream\ntasks. Here, we bridge this gap in a principled manner by combining the formalism\nof M\u00f6bius gyrovector spaces with the Riemannian geometry of the Poincar\u00e9 model\nof hyperbolic spaces. As a result, we derive hyperbolic versions of important deep\nlearning tools: multinomial logistic regression, feed-forward and recurrent neural\nnetworks such as gated recurrent units. This allows to embed sequential data and\nperform classi\ufb01cation in the hyperbolic space. Empirically, we show that, even if\nhyperbolic optimization tools are limited, hyperbolic sentence embeddings either\noutperform or are on par with their Euclidean variants on textual entailment and\nnoisy-pre\ufb01x recognition tasks.\n\n1\n\nIntroduction\n\nIt is common in machine learning to represent data as being embedded in the Euclidean space Rn. The\nmain reason for such a choice is simply convenience, as this space has a vectorial structure, closed-\nform formulas of distance and inner-product, and is the natural generalization of our intuition-friendly,\nvisual three-dimensional space. Moreover, embedding entities in such a continuous space allows to\nfeed them as input to neural networks, which has led to unprecedented performance on a broad range\nof problems, including sentiment detection [15], machine translation [3], textual entailment [22] or\nknowledge base link prediction [20, 6].\nDespite the success of Euclidean embeddings, recent research has proven that many types of com-\nplex data (e.g. graph data) from a multitude of \ufb01elds (e.g. Biology, Network Science, Computer\nGraphics or Computer Vision) exhibit a highly non-Euclidean latent anatomy [8]. In such cases, the\nEuclidean space does not provide the most powerful or meaningful geometrical representations. For\nexample, [10] shows that arbitrary tree structures cannot be embedded with arbitrary low distortion\n(i.e. almost preserving their metric) in the Euclidean space with unbounded number of dimensions,\nbut this task becomes surprisingly easy in the hyperbolic space with only 2 dimensions where the\nexponential growth of distances matches the exponential growth of nodes with the tree depth.\nThe adoption of neural networks and deep learning in these non-Euclidean settings has been rather\nlimited until very recently, the main reason being the non-trivial or impossible principled general-\nizations of basic operations (e.g. vector addition, matrix-vector multiplication, vector translation,\nvector inner product) as well as, in more complex geometries, the lack of closed form expressions for\nbasic objects (e.g. distances, geodesics, parallel transport). Thus, classic tools such as multinomial\n\n\u2217Equal contribution, correspondence at {octavian.ganea,gary.becigneul}@inf.ethz.ch\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\flogistic regression (MLR), feed forward (FFNN) or recurrent neural networks (RNN) did not have a\ncorrespondence in these geometries.\nHow should one generalize deep neural models to non-Euclidean domains ? In this paper we address\nthis question for one of the simplest, yet useful, non-Euclidean domains: spaces of constant negative\ncurvature, i.e. hyperbolic. Their tree-likeness properties have been extensively studied [12, 13, 26]\nand used to visualize large taxonomies [18] or to embed heterogeneous complex networks [17]. In\nmachine learning, recently, hyperbolic representations greatly outperformed Euclidean embeddings\nfor hierarchical, taxonomic or entailment data [21, 10, 11]. Disjoint subtrees from the latent hierar-\nchical structure surprisingly disentangle and cluster in the embedding space as a simple re\ufb02ection of\nthe space\u2019s negative curvature. However, appropriate deep learning tools are needed to embed feature\ndata in this space and use it in downstream tasks. For example, implicitly hierarchical sequence data\n(e.g. textual entailment data, phylogenetic trees of DNA sequences or hierarchial captions of images)\nwould bene\ufb01t from suitable hyperbolic RNNs.\nThe main contribution of this paper is to bridge the gap between hyperbolic and Euclidean geometry\nin the context of neural networks and deep learning by generalizing in a principled manner both the\nbasic operations as well as multinomial logistic regression (MLR), feed-forward (FFNN), simple and\ngated (GRU) recurrent neural networks (RNN) to the Poincar\u00e9 model of the hyperbolic geometry.\nWe do it by connecting the theory of gyrovector spaces and generalized M\u00f6bius transformations\nintroduced by [2, 26] with the Riemannian geometry properties of the manifold. We smoothly\nparametrize basic operations and objects in all spaces of constant negative curvature using a uni\ufb01ed\nframework that depends only on the curvature value. Thus, we show how Euclidean and hyperbolic\nspaces can be continuously deformed into each other. On a series of experiments and datasets we\nshowcase the effectiveness of our hyperbolic neural network layers compared to their \"classic\"\nEuclidean variants on textual entailment and noisy-pre\ufb01x recognition tasks. We hope that this paper\nwill open exciting future directions in the nascent \ufb01eld of Geometric Deep Learning.\n\n2 The Geometry of the Poincar\u00e9 Ball\n\nBasics of differential geometry are presented in appendix A.\n\n2.1 Hyperbolic space: the Poincar\u00e9 ball\n\nThe hyperbolic space has \ufb01ve isometric models that one can work with [9]. Similarly as in [21] and\n[11], we choose to work in the Poincar\u00e9 ball. The Poincar\u00e9 ball model (Dn, g\n) is de\ufb01ned by the\nmanifold Dn = {x \u2208 Rn : (cid:107)x(cid:107) < 1} equipped with the following Riemannian metric:\n\nD\n\nD\nx = \u03bb2\ng\n\nxgE, where \u03bbx :=\n\n2\n\n1 \u2212 (cid:107)x(cid:107)2 ,\n\n(1)\n\n(3)\n\ngE = In being the Euclidean metric tensor. Note that the hyperbolic metric tensor is conformal to\nthe Euclidean one. The induced distance between two points x, y \u2208 Dn is known to be given by\n\ndD(x, y) = cosh\n\n(2)\nSince the Poincar\u00e9 ball is conformal to Euclidean space, the angle between two vectors u, v \u2208\nTxDn \\ {0} is given by\n\n(1 \u2212 (cid:107)x(cid:107)2)(1 \u2212 (cid:107)y(cid:107)2)\n\n1 + 2\n\n(cid:18)\n\n\u22121\n\n(cid:107)x \u2212 y(cid:107)2\n\n(cid:19)\n\n.\n\ncos(\u2220(u, v)) =\n\n2.2 Gyrovector spaces\n\n(cid:112)gD\nx (u, u)(cid:112)gD\n\nD\ng\nx (u, v)\n\nx (v, v)\n\n(cid:104)u, v(cid:105)\n(cid:107)u(cid:107)(cid:107)v(cid:107) .\n\n=\n\nIn Euclidean space, natural operations inherited from the vectorial structure, such as vector addition,\nsubtraction and scalar multiplication are often useful. The framework of gyrovector spaces provides\nan elegant non-associative algebraic formalism for hyperbolic geometry just as vector spaces provide\nthe algebraic setting for Euclidean geometry [2, 25, 26].\nIn particular, these operations are used in special relativity, allowing to add speed vectors belonging\nto the Poincar\u00e9 ball of radius c (the celerity, i.e. the speed of light) so that they remain in the ball,\nhence not exceeding the speed of light.\n\n2\n\n\fWe will make extensive use of these operations in our de\ufb01nitions of hyperbolic neural networks.\nFor c \u2265 0, denote2 by Dn\nthen Dn\nc, c(cid:48) > 0, Dn\nM\u00f6bius addition. The M\u00f6bius addition of x and y in Dn\n\nc = Rn; if c > 0,\nc. If c = 1 then we recover the usual ball Dn. Note that for\n\n\u221a\nc is the open ball of radius 1/\nc(cid:48) are isometric.\n\nc := {x \u2208 Rn | c(cid:107)x(cid:107)2 < 1}. Note that if c = 0, then Dn\n\nc and Dn\n\nc is de\ufb01ned as\n\nx \u2295c y :=\n\n(1 + 2c(cid:104)x, y(cid:105) + c(cid:107)y(cid:107)2)x + (1 \u2212 c(cid:107)x(cid:107)2)y\n\n1 + 2c(cid:104)x, y(cid:105) + c2(cid:107)x(cid:107)2(cid:107)y(cid:107)2\n\n.\n\n(4)\n\nIn particular, when c = 0, one recovers the Euclidean addition of two vectors in Rn. Note that\nwithout loss of generality, the case c > 0 can be reduced to c = 1. Unless stated otherwise, we\nwill use \u2295 as \u22951 to simplify notations. For general c > 0, this operation is not commutative nor\nassociative. However, it satis\ufb01es x \u2295c 0 = 0 \u2295c x = x. Moreover, for any x, y \u2208 Dn\nc , we have\n(\u2212x) \u2295c x = x \u2295c (\u2212x) = 0 and (\u2212x) \u2295c (x \u2295c y) = y (left-cancellation law). The M\u00f6bius\nsubstraction is then de\ufb01ned by the use of the following notation: x (cid:9)c y := x \u2295c (\u2212y). See [29,\nsection 2.1] for a geometric interpretation of the M\u00f6bius addition.\nM\u00f6bius scalar multiplication. For c > 0, the M\u00f6bius scalar multiplication of x \u2208 Dn\nr \u2208 R is de\ufb01ned as\n\nc \\ {0} by\n\n(5)\nand r \u2297c 0 := 0. Note that similarly as for the M\u00f6bius addition, one recovers the Euclidean scalar\nmultiplication when c goes to zero: limc\u21920 r \u2297c x = rx. This operation satis\ufb01es desirable properties\nsuch as n\u2297c x = x\u2295c\u00b7\u00b7\u00b7\u2295c x (n additions), (r + r(cid:48))\u2297c x = r\u2297c x\u2295c r(cid:48)\u2297c x (scalar distributivity3),\n(rr(cid:48)) \u2297c x = r \u2297c (r(cid:48) \u2297c x) (scalar associativity) and |r| \u2297c x/(cid:107)r \u2297c x(cid:107) = x/(cid:107)x(cid:107) (scaling property).\n\nc) tanh(r tanh\n\n\u221a\nr \u2297c x := (1/\n\n\u221a\n\u22121(\n\nc(cid:107)x(cid:107)))\n\nx\n(cid:107)x(cid:107) ,\n\nc , gc) is given by4\n\nDistance.\nEuclidean one, with conformal factor \u03bbc\n(Dn\n\nIf one de\ufb01nes the generalized hyperbolic metric tensor gc as the metric conformal to the\nx := 2/(1 \u2212 c(cid:107)x(cid:107)2), then the induced distance function on\n\u221a\ndc(x, y) = (2/\n\n(6)\nAgain, observe that limc\u21920 dc(x, y) = 2(cid:107)x \u2212 y(cid:107), i.e. we recover Euclidean geometry in the limit5.\nMoreover, for c = 1 we recover dD of Eq. (2).\n\nc(cid:107) \u2212 x \u2295c y(cid:107)(cid:1) .\n\n\u22121(cid:0)\u221a\n\nc) tanh\n\nHyperbolic trigonometry. Similarly as in the Euclidean space, one can de\ufb01ne the notions of\nhyperbolic angles or gyroangles (when using the \u2295c), as well as hyperbolic law of sines in the\ngeneralized Poincar\u00e9 ball (Dn\nc , gc). We make use of these notions in our proofs. See Appendix B.\n\n2.3 Connecting Gyrovector spaces and Riemannian geometry of the Poincar\u00e9 ball\n\nIn this subsection, we present how geodesics in the Poincar\u00e9 ball model are usually described with\nM\u00f6bius operations, and push one step further the existing connection between gyrovector spaces and\nthe Poincar\u00e9 ball by \ufb01nding new identities involving the exponential map, and parallel transport.\nIn particular, these \ufb01ndings provide us with a simpler formulation of M\u00f6bius scalar multiplication,\nyielding a natural de\ufb01nition of matrix-vector multiplication in the Poincar\u00e9 ball.\n\nRiemannian gyroline element. The Riemannian gyroline element is de\ufb01ned for an in\ufb01nitesimal\ndx as ds := (x + dx) (cid:9)c x, and its size is given by [26, section 3.7]:\n\n(cid:107)ds(cid:107) = (cid:107)(x + dx) (cid:9)c x(cid:107) = (cid:107)dx(cid:107)/(1 \u2212 c(cid:107)x(cid:107)2).\n\n(7)\nWhat is remarkable is that it turns out to be identical, up to a scaling factor of 2, to the usual line\nelement 2(cid:107)dx(cid:107)/(1 \u2212 c(cid:107)x(cid:107)2) of the Riemannian manifold (Dn\n\nc , gc).\n\u221a\n2We take different notations as in [25] where the author uses s = 1/\nc.\n3\u2297c has priority over \u2295c in the sense that a \u2297c b \u2295c c := (a \u2297c b) \u2295c c and a \u2295c b \u2297c c := a \u2295c (b \u2297c c).\n4The notation \u2212x \u2295c y should always be read as (\u2212x) \u2295c y and not \u2212(x \u2295c y).\n5The factor 2 comes from the conformal factor \u03bbx = 2/(1 \u2212 (cid:107)x(cid:107)2), which is a convention setting the\n\ncurvature to \u22121.\n\n3\n\n\fGeodesics. The geodesic connecting points x, y \u2208 Dn\n\u03b3x\u2192y(t) := x \u2295c (\u2212x \u2295c y) \u2297c t, with \u03b3x\u2192y : R \u2192 Dn\n\nc is shown in [2, 26] to be given by:\n\nc s.t. \u03b3x\u2192y(0) = x and \u03b3x\u2192y(1) = y.\n\n(cid:18)\n\n(8)\nNote that when c goes to 0, geodesics become straight-lines, recovering Euclidean geometry. In the\nremainder of this subsection, we connect the gyrospace framework with Riemannian geometry.\nLemma 1. For any x \u2208 Dn and v \u2208 TxDn\nx with direction v is given by:\n\u03b3x,v(t) = x \u2295c\nt\n2\n\n, where \u03b3x,v : R \u2192 Dn s.t. \u03b3x,v(0) = x and \u02d9\u03b3x,v(0) = v.\n(9)\nProof. One can use Eq. (8) and reparametrize it to unit-speed using Eq. (6). Alternatively, direct\ncomputation and identi\ufb01cation with the formula in [11, Thm. 1] would give the same result. Using\nEq. (6) and Eq. (9), one can sanity-check that dc(\u03b3(0), \u03b3(t)) = t,\u2200t \u2208 [0, 1].\n\nx(v, v) = 1, the unit-speed geodesic starting from\n\n(cid:19) v\u221a\n\n(cid:18)\u221a\n\nc s.t. gc\n\nc(cid:107)v(cid:107)\n\n(cid:19)\n\ntanh\n\nc\n\nExponential and logarithmic maps. The following lemma gives the closed-form derivation of\nexponential and logarithmic maps.\nLemma 2. For any point x \u2208 Dn\nx : Dn\nmap logc\nx(cid:107)v(cid:107)\n\u03bbc\nx(v) = x \u2295c\n2\n\nc are given for v (cid:54)= 0 and y (cid:54)= x by:\nx(y) =\n\n(cid:18)\nc \u2192 TxDn\n\nc , the exponential map expc\n\nc \u2192 Dn\n\u221a\n\n(cid:19) v\u221a\n\nc and the logarithmic\n\nc(cid:107) \u2212 x \u2295c y(cid:107))\n\nx : TxDn\n\n(cid:18)\u221a\n\nc(cid:107)v(cid:107)\n\n, logc\n\n(cid:19)\n\n\u22121(\n\ntanh\n\ntanh\n\nc\n\n2\u221a\nc\u03bbc\nx\n\n\u2212x \u2295c y\n(cid:107) \u2212 x \u2295c y(cid:107) .\n\nexpc\n\nProof. Following the proof of [11, Cor. 1.1], one gets expc\ngives the formula for expc\nThe above maps have more appealing forms when x = 0, namely for v \u2208 T0Dn\n\nx. Algebraic check of the identity logc\n\nx(v) = \u03b3x,\nx(expc\n\nx(cid:107)v(cid:107) (\u03bbc\n\u03bbc\n\nv\n\nx(v)) = v concludes.\n\nc \\ {0}, y \u2208 Dn\n\nc \\ {0}:\n\n(10)\nx(cid:107)v(cid:107)). Using Eq. (9)\n\nexpc\n\n0(v) = tanh(\n\n\u221a\n\nc(cid:107)v(cid:107))\n\nv\u221a\nc(cid:107)v(cid:107) , logc\n\n0(y) = tanh\n\n\u221a\n\u22121(\n\nc(cid:107)y(cid:107))\n\ny\u221a\nc(cid:107)y(cid:107) .\nx(v) = x + v is the\n\n(11)\n\nMoreover, we still recover Euclidean geometry in the limit c \u2192 0, as limc\u21920 expc\nEuclidean exponential map, and limc\u21920 logc\n\nx(y) = y \u2212 x is the Euclidean logarithmic map.\n\nM\u00f6bius scalar multiplication using exponential and logarithmic maps. We studied the expo-\nnential and logarithmic maps in order to gain a better understanding of the M\u00f6bius scalar multiplica-\ntion (Eq. (5)). We found the following:\nLemma 3. The quantity r \u2297 x can actually be obtained by projecting x in the tangent space at 0\nwith the logarithmic map, multiplying this projection by the scalar r in T0Dn\nc , and then projecting it\nback on the manifold with the exponential map:\n0(r logc\n\n(12)\nIn addition, we recover the well-known relation between geodesics connecting two points and the\nexponential map:\n\n\u2200r \u2208 R, x \u2208 Dn\nc .\n\nr \u2297c x = expc\n\n0(x)),\n\n\u03b3x\u2192y(t) = x \u2295c (\u2212x \u2295c y) \u2297c t = expc\n\nx(t logc\n\nx(y)),\n\nt \u2208 [0, 1].\n\n(13)\n\nThis last result enables us to generalize scalar multiplication in order to de\ufb01ne matrix-vector multipli-\ncation between Poincar\u00e9 balls, one of the essential building blocks of hyperbolic neural networks.\n\nParallel transport. Finally, we connect parallel transport along the unique geodesic from 0 to x to\ngyrovector spaces with the following theorem, which we prove in appendix C.\nTheorem 4. In the manifold (Dn\nvector v \u2208 T0Dn\n\nc , gc), the parallel transport w.r.t. the Levi-Civita connection of a\n\nc to another tangent space TxDn\n\nc is given by the following isometry:\n\n0\u2192x(v) = logc\nP c\n\nx(x \u2295c expc\n\n0(v)) =\n\n\u03bbc\n0\n\u03bbc\nx\n\nv.\n\n(14)\n\nAs we\u2019ll see later, this result is crucial in order to de\ufb01ne and optimize parameters shared between\ndifferent tangent spaces, such as biases in hyperbolic neural layers or parameters of hyperbolic MLR.\n\n4\n\n\f3 Hyperbolic Neural Networks\n\nNeural networks can be seen as being made of compositions of basic operations, such as linear\nmaps, bias translations, pointwise non-linearities and a \ufb01nal sigmoid or softmax layer. We \ufb01rst\nexplain how to construct a softmax layer for logits lying in a Poincar\u00e9 ball. Then, we explain how\nto transform a mapping between two Euclidean spaces as one between Poincar\u00e9 balls, yielding\nmatrix-vector multiplication and pointwise non-linearities in the Poincar\u00e9 ball. Finally, we present\npossible adaptations of various recurrent neural networks to the hyperbolic domain.\n\n3.1 Hyperbolic multiclass logistic regression\n\nIn order to perform multi-class classi\ufb01cation on the Poincar\u00e9 ball, one needs to generalize multinomial\nlogistic regression (MLR) \u2212 also called softmax regression \u2212 to the Poincar\u00e9 ball.\n\nReformulating Euclidean MLR. Let\u2019s \ufb01rst reformulate Euclidean MLR from the perspective of\ndistances to margin hyperplanes, as in [19, Section 5]. This will allow us to easily generalize it.\nGiven K classes, one learns a margin hyperplane for each such class using softmax probabilities:\n\n\u2200k \u2208 {1, ..., K},\n\np(y = k|x) \u221d exp (((cid:104)ak, x(cid:105) \u2212 bk)) , where bk \u2208 R, x, ak \u2208 Rn.\n\n(15)\n\nNote that any af\ufb01ne hyperplane in Rn can be written with a normal vector a and a scalar shift b:\n\nHa,b = {x \u2208 Rn : (cid:104)a, x(cid:105) \u2212 b = 0}, where a \u2208 Rn \\ {0}, and b \u2208 R.\n\nAs in [19, Section 5], we note that (cid:104)a, x(cid:105) \u2212 b = sign((cid:104)a, x(cid:105) \u2212 b)(cid:107)a(cid:107)d(x, Ha,b). Using Eq. (15):\n\np(y = k|x) \u221d exp(sign((cid:104)ak, x(cid:105) \u2212 bk)(cid:107)ak(cid:107)d(x, Hak,bk )), bk \u2208 R, x, ak \u2208 Rn.\n\n(16)\n\n(17)\n\nAs it is not immediately obvious how to generalize the Euclidean hyperplane of Eq. (16) to other\nspaces such as the Poincar\u00e9 ball, we reformulate it as follows:\n\n\u02dcHa,p = {x \u2208 Rn : (cid:104)\u2212p + x, a(cid:105) = 0} = p + {a}\u22a5, where p \u2208 Rn, a \u2208 Rn \\ {0}.\n\n(18)\nThis new de\ufb01nition relates to the previous one as \u02dcHa,p = Ha,(cid:104)a,p(cid:105). Rewriting Eq. (17) with b = (cid:104)a, p(cid:105):\n(19)\n\np(y = k|x) \u221d exp(sign((cid:104)\u2212pk + x, ak(cid:105))(cid:107)ak(cid:107)d(x, \u02dcHak,pk )), with pk, x, ak \u2208 Rn.\n\nIt is now natural to adapt the previous de\ufb01nition to the hyperbolic setting by replacing + by \u2295c:\nDe\ufb01nition 3.1 (Poincar\u00e9 hyperplanes). For p \u2208 Dn\nc \\ {0}, let {a}\u22a5 := {z \u2208 TpDn\nc :\np(z, a) = 0} = {z \u2208 TpDn\ngc\nc : (cid:104)logc\na,p := {x \u2208 Dn\n\u02dcH c\n\nc : (cid:104)z, a(cid:105) = 0}. Then, we de\ufb01ne6 Poincar\u00e9 hyperplanes as\np(x), a(cid:105)p = 0} = expc\n\nc : (cid:104)\u2212p \u2295c x, a(cid:105) = 0}.\n\np({a}\u22a5) = {x \u2208 Dn\n\nc , a \u2208 TpDn\n\n(20)\n\na,p can also be described as the union of images of\nc orthogonal to a and containing p. Notice that our de\ufb01nition matches that of\n\nThe last equality is shown appendix D. \u02dcH c\nall geodesics in Dn\nhypergyroplanes, see [27, de\ufb01nition 5.8]. A 3D hyperplane example is depicted in Fig. 1.\nNext, we need the following theorem, proved in appendix E:\nTheorem 5.\n\ndc(x, \u02dcH c\n\na,p) := inf\nw\u2208 \u02dcH c\n\na,p\n\ndc(x, w) =\n\n\u22121\n\nsinh\n\n1\u221a\nc\n\nc|(cid:104)\u2212p \u2295c x, a(cid:105)|\n\n(1 \u2212 c(cid:107) \u2212 p \u2295c x(cid:107)2)(cid:107)a(cid:107)\n\n.\n\n(21)\n\n(cid:18) 2\n\n\u221a\n\n(cid:19)\n\nFinal formula for MLR in the Poincar\u00e9 ball. Putting together Eq. (19) and Thm. 5, we get the\nhyperbolic MLR formulation. Given K classes and k \u2208 {1, . . . , K}, pk \u2208 Dn\nc \\ {0}:\nDn\n(22)\n\np(y = k|x) \u221d exp(sign((cid:104)\u2212pk \u2295c x, ak(cid:105))\n\nc , ak \u2208 Tpk\n\u2200x \u2208 Dn\nc ,\n\n(ak, ak)dc(x, \u02dcH c\n6where (cid:104)\u00b7,\u00b7(cid:105) denotes the (Euclidean) inner-product of the ambient space.\n\n(cid:113)\n\ngc\npk\n\n)),\n\nak,pk\n\n5\n\n\f\u221a\n2\n\nc(cid:104)\u2212pk \u2295c x, ak(cid:105)\n\n(cid:19)(cid:19)\n\nor, equivalently\n\np(y = k|x) \u221d exp\n\n(cid:18) \u03bbc\n\n(cid:107)ak(cid:107)\u221a\n\npk\n\nc\n\n(cid:18)\n\n\u22121\n\nsinh\n\n(1 \u2212 c(cid:107) \u2212 pk \u2295c x(cid:107)2)(cid:107)ak(cid:107)\n\n(23)\nthis goes to p(y = k|x) \u221d exp(4(cid:104)\u2212pk + x, ak(cid:105)) =\n\n,\n\n\u2200x \u2208 Dn\nc .\n\nNotice that when c goes to zero,\n)2(cid:104)\u2212pk + x, ak(cid:105)) = exp((cid:104)\u2212pk + x, ak(cid:105)0), recovering the usual Euclidean softmax.\nexp((\u03bb0\npk\nDn\nHowever, at this point it is unclear how to perform optimization over ak, since it lives in Tpk\nc and\n)a(cid:48)\nk, where\nhence depends on pk. The solution is that one should write ak = P c\nk \u2208 T0Dn\na(cid:48)\n3.2 Hyperbolic feed-forward layers\n\nc = Rn, and optimize a(cid:48)\n\nk as a Euclidean parameter.\n\nk) = (\u03bbc\n\n0/\u03bbc\npk\n\n0\u2192pk\n\n(a(cid:48)\n\nIn order to de\ufb01ne hyperbolic neural networks, it is crucial to de-\n\ufb01ne a canonically simple parametric family of transformations,\nplaying the role of linear mappings in usual Euclidean neural\nnetworks, and to know how to apply pointwise non-linearities.\nInspiring ourselves from our reformulation of M\u00f6bius scalar\nmultiplication in Eq. (12), we de\ufb01ne:\nDe\ufb01nition 3.2 (M\u00f6bius version). For f : Rn \u2192 Rm, we de\ufb01ne\nthe M\u00f6bius version of f as the map from Dn\n0(x))),\n0 : Dn\n\nf\u2297c(x) := expc\nc \u2192 Dm\nDm\n\nc \u2192 T0n\n\nc and logc\n\nc to Dm\n\nwhere expc\n\n0(f (logc\n\nc by:\n\n0 : T0m\n\nDn\nc .\n\n(24)\n\nFigure 1: An example of a hyper-\nbolic hyperplane in D3\n1 plotted us-\ning sampling. The red point is p.\nThe shown normal axis to the hy-\nperplane through p is parallel to a.\n\nNote that similarly as for other M\u00f6bius operations, we recover\nthe Euclidean mapping in the limit c \u2192 0 if f is continuous, as limc\u21920 f\u2297c(x) = f (x). This\nde\ufb01nition satis\ufb01es a few desirable properties too, such as: (f \u25e6 g)\u2297c = f\u2297c \u25e6 g\u2297c for f : Rm \u2192 Rl\nand g : Rn \u2192 Rm (morphism property), and f\u2297c(x)/(cid:107)f\u2297c(x)(cid:107) = f (x)/(cid:107)f (x)(cid:107) for f (x) (cid:54)= 0\n(direction preserving). It is then straight-forward to prove the following result:\nLemma 6 (M\u00f6bius matrix-vector multiplication). If M : Rn \u2192 Rm is a linear map, which we\nidentify with its matrix representation, then \u2200x \u2208 Dn\n\nc , if M x (cid:54)= 0 we have\n\n\u221a\nM\u2297c(x) = (1/\n\nc) tanh\n\n(cid:18)(cid:107)M x(cid:107)\n\n(cid:107)x(cid:107) tanh\n\n\u221a\n\u22121(\n\nc(cid:107)x(cid:107))\n\n(cid:19) M x\n\n(cid:107)M x(cid:107) ,\n\n(25)\n\nand M\u2297c(x) = 0 if M x = 0. Moreover, if we de\ufb01ne the M\u00f6bius matrix-vector multiplication of\nM \u2208 Mm,n(R) and x \u2208 Dn\nc by M \u2297c x := M\u2297c(x), then we have (M M(cid:48))\u2297c x = M \u2297c (M(cid:48)\u2297c x)\nfor M \u2208 Ml,m(R) and M(cid:48) \u2208 Mm,n(R) (matrix associativity), (rM ) \u2297c x = r \u2297c (M \u2297c x) for\nr \u2208 R and M \u2208 Mm,n(R) (scalar-matrix associativity) and M \u2297c x = M x for all M \u2208 On(R)\n(rotations are preserved).\n\nPointwise non-linearity.\n\u03d5\u2297c can be applied to elements of the Poincar\u00e9 ball.\n\nIf \u03d5 : Rn \u2192 Rn is a pointwise non-linearity, then its M\u00f6bius version\n\nBias translation. The generalization of a translation in the Poincar\u00e9 ball is naturally given by\nmoving along geodesics. But should we use the M\u00f6bius sum x \u2295c b with a hyperbolic bias b or the\nx(b(cid:48)) with a Euclidean bias b(cid:48)? These views are uni\ufb01ed with parallel transport\nexponential map expc\n(see Thm 4). M\u00f6bius translation of a point x \u2208 Dn\n\nc by a bias b \u2208 Dn\n\nc is given by\n\nx \u2190 x \u2295c b = expc\n\nx(P c\n\n0\u2192x(logc\n\n0(b))) = expc\nx\n\n(26)\nWe recover Euclidean translations in the limit c \u2192 0. Note that bias translations play a particular\nIndeed, consider multiple layers of the form fk(x) = \u03d5k(Mkx), each of\nrole in this model.\nk (Mk \u2297c x). Then their composition can be re-written\n\u2297c\n\u2297c\nwhich having M\u00f6bius version f\nk (x) = \u03d5\n0 \u25e6fk \u25e6 \u00b7\u00b7\u00b7 \u25e6 f1 \u25e6 logc\nk \u25e6 \u00b7\u00b7\u00b7 \u25e6 f\n\u2297c\n0. This means that these operations can essentially be\nf\nperformed in Euclidean space. Therefore, it is the interposition between those with the bias translation\nof Eq. (26) which differentiates this model from its Euclidean counterpart.\n\n\u2297c\n1 = expc\n\nlogc\n\n0(b)\n\n.\n\n(cid:19)\n\n(cid:18) \u03bbc\n\n0\n\u03bbc\nx\n\n6\n\n\fIf a vector x \u2208 Rn+p is the (vertical) concatenation\nConcatenation of multiple input vectors.\nof two vectors x1 \u2208 Rn, x2 \u2208 Rp, and M \u2208 Mm,n+p(R) can be written as the (horizontal)\nconcatenation of two matrices M1 \u2208 Mm,n(R) and M2 \u2208 Mm,p(R), then M x = M1x1 + M2x2.\nWe generalize this to hyperbolic spaces: if we are given x1 \u2208 Dn\nc \u00d7Dp\nc,\nand M, M1, M2 as before, then we de\ufb01ne M \u2297c x := M1 \u2297c x1 \u2295c M2 \u2297c x2. Note that when c goes\nto zero, we recover the Euclidean formulation, as limc\u21920 M \u2297c x = limc\u21920 M1\u2297c x1\u2295c M2\u2297c x2 =\nM1x1 + M2x2 = M x. Moreover, hyperbolic vectors x \u2208 Dn\nc can also be \"concatenated\" with real\nfeatures y \u2208 R by doing: M \u2297c x \u2295c y \u2297c b with learnable b \u2208 Dm\n\nc, x = (x1 x2)T \u2208 Dn\n\nc and M \u2208 Mm,n(R).\n\nc , x2 \u2208 Dp\n\n3.3 Hyperbolic RNN\n\nNaive RNN. A simple RNN can be de\ufb01ned by ht+1 = \u03d5(W ht + U xt + b) where \u03d5 is a pointwise\nnon-linearity, typically tanh, sigmoid, ReLU, etc. This formula can be naturally generalized to the\nhyperbolic space as follows. For parameters W \u2208 Mm,n(R), U \u2208 Mm,d(R), b \u2208 Dm\nc , we de\ufb01ne:\nht \u2208 Dn\n(27)\n0(xt) and use the above formula, since\n\nht+1 = \u03d5\u2297c (W \u2297c ht \u2295c U \u2297c xt \u2295c b),\n\nNote that if inputs xt\u2019s are Euclidean, one can write \u02dcxt := expc\nexpc\n\nc , xt \u2208 Dd\nc .\n0(U xt) = W \u2297c ht \u2295c U \u2297c \u02dcxt.\n\n(U xt)) = W \u2297c ht \u2295c expc\n\n0\u2192W\u2297cht\n\nW\u2297cht\n\n(P c\n\nGRU architecture. One can also adapt the GRU architecture:\n\nrt = \u03c3(W rht\u22121 + U rxt + br),\nzt = \u03c3(W zht\u22121 + U zxt + bz),\n\u02dcht = \u03d5(W (rt (cid:12) ht\u22121) + U xt + b), ht = (1 \u2212 zt) (cid:12) ht\u22121 + zt (cid:12) \u02dcht,\n\n(28)\nwhere (cid:12) denotes pointwise product. First, how should we adapt the pointwise multiplication by a\nscaling gate? Note that the de\ufb01nition of the M\u00f6bius version (see Eq. (24)) can be naturally extended\nto maps f : Rn \u00d7 Rp \u2192 Rm as f\u2297c : (h, h(cid:48)) \u2208 Dn\nc (cid:55)\u2192 expc\n0(h(cid:48)))). In\nparticular, choosing f (h, h(cid:48)) := \u03c3(h) (cid:12) h(cid:48) yields7 f\u2297c(h, h(cid:48)) = expc\n0(h(cid:48))) =\ndiag(\u03c3(logc\n\n0(h))) \u2297c h(cid:48). Hence we adapt rt (cid:12) ht\u22121 to diag(rt) \u2297c ht\u22121 and the reset gate rt to:\n\n0(h), logc\n0(h)) (cid:12) logc\n\n0(f (logc\n0(\u03c3(logc\n\nc \u00d7 Dp\n\n(29)\nand similarly for the update gate zt. Note that as the argument of \u03c3 in the above is unbounded, rt and\nzt can a priori take values onto the full range (0, 1). Now the intermediate hidden state becomes:\n\n0(W r \u2297c ht\u22121 \u2295c U r \u2297c xt \u2295c br),\n\nrt = \u03c3 logc\n\n\u02dcht = \u03d5\u2297c ((W diag(rt)) \u2297c ht\u22121 \u2295c U \u2297c xt \u2295 b),\n\n(30)\nwhere M\u00f6bius matrix associativity simpli\ufb01es W \u2297c (diag(rt) \u2297c ht\u22121) into (W diag(rt)) \u2297c ht\u22121.\nFinally, we propose to adapt the update-gate equation as\n\nht = ht\u22121 \u2295c diag(zt) \u2297c (\u2212ht\u22121 \u2295c\n\n(31)\nNote that when c goes to zero, one recovers the usual GRU. Moreover, if zt = 0 or zt = 1, then ht\nbecomes ht\u22121 or \u02dcht respectively, similarly as in the usual GRU. This adaptation was obtained by\nadapting [24]: in this work, the authors re-derive the update-gate mechanism from a \ufb01rst principle\ncalled time-warping invariance. We adapted their derivation to the hyperbolic setting by using the\nnotion of gyroderivative [4] and proving a gyro-chain-rule (see appendix F).\n\n\u02dcht).\n\n4 Experiments\n\nSNLI task and dataset. We evaluate our method on two tasks. The \ufb01rst is natural language\ninference, or textual entailment. Given two sentences, a premise (e.g. \"Little kids A. and B. are\nplaying soccer.\") and a hypothesis (e.g. \"Two children are playing outdoors.\"), the binary classi\ufb01cation\ntask is to predict whether the second sentence can be inferred from the \ufb01rst one. This de\ufb01nes a partial\norder in the sentence space. We test hyperbolic networks on the biggest real dataset for this task,\nSNLI [7]. It consists of 570K training, 10K validation and 10K test sentence pairs. Following [28],\nwe merge the \"contradiction\" and \"neutral\" classes into a single class of negative sentence pairs, while\nthe \"entailment\" class gives the positive pairs.\n\n7If x has n coordinates, then diag(x) denotes the diagonal matrix of size n with xi\u2019s on its diagonal.\n\n7\n\n\fFULLY EUCLIDEAN RNN\n\nHYP RNN+FFNN, EUCL MLR\n\nFULLY HYPERBOLIC RNN\nFULLY EUCLIDEAN GRU\n\nHYP GRU+FFNN, EUCL MLR\n\nFULLY HYPERBOLIC GRU\n\nSNLI\n79.34 %\n79.18 %\n78.21 %\n81.52 %\n79.76 %\n81.19 %\n\nPREFIX-10% PREFIX-30% PREFIX-50%\n\n89.62 %\n96.36 %\n96.91 %\n95.96 %\n97.36 %\n97.14 %\n\n81.71 %\n87.83 %\n87.25 %\n86.47 %\n88.47 %\n88.26 %\n\n72.10 %\n76.50 %\n62.94 %\n75.04 %\n76.87 %\n76.44 %\n\nTable 1: Test accuracies for various models and four datasets. \u201cEucl\u201d denotes Euclidean, \u201cHyp\u201d\ndenotes hyperbolic. All word and sentence embeddings have dimension 5. We highlight in bold the\nbest baseline (or baselines, if the difference is less than 0.5%).\n\nPREFIX task and datasets. We conjecture that the improvements of hyperbolic neural networks\nare more signi\ufb01cant when the underlying data structure is closer to a tree. To test this, we design a\nproof-of-concept task of detection of noisy pre\ufb01xes, i.e. given two sentences, one has to decide if the\nsecond sentence is a noisy pre\ufb01x of the \ufb01rst, or a random sentence. We thus build synthetic datasets\nPREFIX-Z% (for Z being 10, 30 or 50) as follows: for each random \ufb01rst sentence of random length\nat most 20 and one random pre\ufb01x of it, a second positive sentence is generated by randomly replacing\nZ% of the words of the pre\ufb01x, and a second negative sentence of same length is randomly generated.\nWord vocabulary size is 100, and we generate 500K training, 10K validation and 10K test pairs.\nExperimental details are presented in appendix G.\n\nModels architecture. Our neural network layers can be used in a plug-n-play manner exactly like\nstandard Euclidean layers. They can also be combined with Euclidean layers. However, optimization\nw.r.t. hyperbolic parameters is different (see below) and based on Riemannian gradients which\nare just rescaled Euclidean gradients when working in the conformal Poincar\u00e9 model [21]. Thus,\nback-propagation can be applied in the standard way.\nIn our setting, we embed the two sentences using two distinct hyperbolic RNNs or GRUs. The\nsentence embeddings are then fed together with their squared distance (hyperbolic or Euclidean,\ndepending on their geometry) to a FFNN (Euclidean or hyperbolic, see Sec. 3.2) which is further\nfed to an MLR (Euclidean or hyperbolic, see Sec. 3.1) that gives probabilities of the two classes\n(entailment vs neutral). We use cross-entropy loss on top. Note that hyperbolic and Euclidean layers\ncan be mixed, e.g. the full network can be hyperbolic and only the last layer be Euclidean, in which\ncase one has to use log0 and exp0 functions to move between the two manifolds in a correct manner\nas explained for Eq. 24. For the results shown in Tab. 1, we run each model (baseline or ours) exactly\n3 times and report the test result corresponding to the best validation result from these 3 runs. We\ndo this because the highly non-convex spectrum of hyperbolic neural networks sometimes results in\nconvergence to poor local minima, suggesting that initialization is very important.\n\nResults. Results are shown in Tab. 1. Note that the fully Euclidean baseline models might have\nan advantage over hyperbolic baselines because more sophisticated optimization algorithms such\nas Adam do not have a hyperbolic analogue at the moment. We \ufb01rst observe that all GRU models\noverpass their RNN variants. Hyperbolic RNNs and GRUs have the most signi\ufb01cant improvement\nover their Euclidean variants when the underlying data structure is more tree-like, e.g. for PREFIX-\n10% \u2212 for which the tree relation between sentences and their pre\ufb01xes is more prominent \u2212 we\nreduce the error by a factor of 3.35 for hyperbolic vs Euclidean RNN, and by a factor of 1.5 for\nhyperbolic vs Euclidean GRU. As soon as the underlying structure diverges more and more from\na tree, the accuracy gap decreases \u2212 for example, for PREFIX-50% the noise heavily affects the\nrepresentational power of hyperbolic networks. Also, note that on SNLI our methods perform\nsimilarly as with their Euclidean variants. Moreover, hyperbolic and Euclidean MLR are on par when\nused in conjunction with hyperbolic sentence embeddings, suggesting further empirical investigation\nis needed for this direction (see below).\n\nMLR classi\ufb01cation experiments. For the sentence entailment classi\ufb01cation task we do not see\na clear advantage of hyperbolic MLR compared to its Euclidean variant. A possible reason is\nthat, when trained end-to-end, the model might decide to place positive and negative embed-\ndings in a manner that is already well separated with a classic MLR. As a consequence, we\n\n8\n\n\fWORDNET\nSUBTREE\n\nANIMAL.N.01\n3218 / 798\n\nGROUP.N.01\n6649 / 1727\n\nWORKER.N.01\n\n861 / 254\n\nMAMMAL.N.01\n\n953 / 228\n\nMODEL\n\nD = 2\n\nD = 3\n\nD = 5\n\nD = 10\n\nHYP\nEUCL\nlog0\nHYP\nEUCL\nlog0\nHYP\nEUCL\nlog0\nHYP\nEUCL\nlog0\n\n47.43 \u00b1 1.07\n41.69 \u00b1 0.19\n38.89 \u00b1 0.01\n81.72 \u00b1 0.17\n61.13 \u00b1 0.42\n60.75 \u00b1 0.24\n12.68 \u00b1 0.82\n10.86 \u00b1 0.01\n9.04 \u00b1 0.06\n32.01 \u00b1 17.14\n15.58 \u00b1 0.04\n13.10 \u00b1 0.13\n\n91.92 \u00b1 0.61\n68.43 \u00b1 3.90\n62.57 \u00b1 0.61\n89.87 \u00b1 2.73\n63.56 \u00b1 1.22\n61.98 \u00b1 0.57\n24.09 \u00b1 1.49\n22.39 \u00b1 0.04\n22.57 \u00b1 0.20\n87.54 \u00b1 4.55\n44.68 \u00b1 1.87\n44.89 \u00b1 1.18\n\n98.07 \u00b1 0.55\n95.59 \u00b1 1.18\n89.21 \u00b1 1.34\n87.89 \u00b1 0.80\n67.82 \u00b1 0.81\n67.92 \u00b1 0.74\n55.46 \u00b1 5.49\n35.23 \u00b1 3.16\n26.47 \u00b1 0.78\n88.73 \u00b1 3.22\n59.35 \u00b1 1.31\n52.51 \u00b1 0.85\n\n99.26 \u00b1 0.59\n99.36 \u00b1 0.18\n98.27 \u00b1 0.70\n91.91 \u00b1 3.07\n91.38 \u00b1 1.19\n91.41 \u00b1 0.18\n66.83 \u00b1 11.38\n47.29 \u00b1 3.93\n36.66 \u00b1 2.74\n91.37 \u00b1 6.09\n77.76 \u00b1 5.08\n56.11 \u00b1 2.21\n\nTable 2: Test F1 classi\ufb01cation scores (%) for four different subtrees of WordNet noun tree. 95% con-\n\ufb01dence intervals for 3 different runs are shown for each method and each dimension. \u201cHyp\u201d denotes\nour hyperbolic MLR, \u201cEucl\u201d denotes directly applying Euclidean MLR to hyperbolic embeddings in\ntheir Euclidean parametrization, and log0 denotes applying Euclidean MLR in the tangent space at 0,\nafter projecting all hyperbolic embeddings there with log0.\n\nfurther investigate MLR for the task of subtree classi\ufb01cation. Using an open source implemen-\ntation8 of [21], we pre-trained Poincar\u00e9 embeddings of the WordNet noun hierarchy (82,115\nnodes). We then choose one node in this tree (see Table 2) and classify all other nodes (solely\nbased on their embeddings) as being part of the subtree rooted at this node. All nodes in\nsuch a subtree are divided into positive training nodes (80%) and positive test nodes (20%).\nThe same splitting procedure is applied for the\nremaining WordNet nodes that are divided into\na negative training and negative test set respec-\ntively. Three variants of MLR are then trained\non top of pre-trained Poincar\u00e9 embeddings[21]\nto solve this binary classi\ufb01cation task: hyper-\nbolic MLR, Euclidean MLR applied directly\non the hyperbolic embeddings (even if mathe-\nmatically this is not respecting the hyperbolic\ngeometry) and Euclidean MLR applied after\nmapping all embeddings in the tangent space\nat 0 using the log0 map. More experimental de-\ntails in appendix G.2. Quantitative results are\npresented in Table 2. We can see that the hyper-\nbolic MLR overpasses its Euclidean variants in\nalmost all settings, sometimes by a large mar-\ngin. Moreover, to provide further understand-\ning, we plot the 2-dimensional embeddings and\nthe trained separation hyperplanes (geodesics\nin this case) in Figure 2.\n\nFigure 2: Hyperbolic (left) vs Direct Euclidean\n(right) binary MLR used to classify nodes as be-\ning part in the GROUP.N.01 subtree of the WordNet\nnoun hierarchy solely based on their Poincar\u00e9 em-\nbeddings. The positive points (from the subtree) are\nin red, the negative points (the rest) are in yellow\nand the trained positive separation hyperplane is\ndepicted in green.\n\n5 Conclusion\n\nWe showed how classic Euclidean deep learning tools such as MLR, FFNNs, RNNs or GRUs can be\ngeneralized in a principled manner to all spaces of constant negative curvature combining Riemannian\ngeometry with the elegant theory of gyrovector spaces. Empirically we found that our models\noutperform or are on par with corresponding Euclidean architectures on sequential data with implicit\nhierarchical structure. We hope to trigger exciting future research related to better understanding\nof the hyperbolic non-convexity spectrum and development of other non-Euclidean deep learning\nmethods. Our data and Tensor\ufb02ow [1] code are publicly available9.\n\n8https://github.com/dalab/hyperbolic_cones\n9https://github.com/dalab/hyperbolic_nn\n\n9\n\n\fAcknowledgements\n\nWe thank Igor Petrovski for useful pointers regarding the implementation.\nThis research is funded by the Swiss National Science Foundation (SNSF) under grant agreement\nnumber 167176. Gary B\u00e9cigneul is also funded by the Max Planck ETH Center for Learning\nSystems.\n\nReferences\n[1] Mart\u00edn Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu\nDevin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensor\ufb02ow: A system for\nlarge-scale machine learning. 2016.\n\n[2] Ungar Abraham Albert. Analytic hyperbolic geometry and Albert Einstein\u2019s special theory of\n\nrelativity. World scienti\ufb01c, 2008.\n\n[3] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly\nlearning to align and translate. In Proceedings of the International Conference on Learning\nRepresentations (ICLR), 2015.\n\n[4] Graciela S Birman and Abraham A Ungar. The hyperbolic derivative in the poincar\u00e9 ball model\nof hyperbolic geometry. Journal of mathematical analysis and applications, 254(1):321\u2013333,\n2001.\n\n[5] S. Bonnabel. Stochastic gradient descent on riemannian manifolds. IEEE Transactions on\n\nAutomatic Control, 58(9):2217\u20132229, Sept 2013.\n\n[6] Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko.\nTranslating embeddings for modeling multi-relational data. In Advances in neural information\nprocessing systems (NIPS), pages 2787\u20132795, 2013.\n\n[7] Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. A large\nannotated corpus for learning natural language inference. In Proceedings of the 2015 Conference\non Empirical Methods in Natural Language Processing (EMNLP), pages 632\u2013642. Association\nfor Computational Linguistics, 2015.\n\n[8] Michael M Bronstein, Joan Bruna, Yann LeCun, Arthur Szlam, and Pierre Vandergheynst.\nGeometric deep learning: going beyond euclidean data. IEEE Signal Processing Magazine,\n34(4):18\u201342, 2017.\n\n[9] James W Cannon, William J Floyd, Richard Kenyon, Walter R Parry, et al. Hyperbolic geometry.\n\nFlavors of geometry, 31:59\u2013115, 1997.\n\n[10] Christopher De Sa, Albert Gu, Christopher R\u00e9, and Frederic Sala. Representation tradeoffs for\n\nhyperbolic embeddings. arXiv preprint arXiv:1804.03329, 2018.\n\n[11] Octavian-Eugen Ganea, Gary B\u00e9cigneul, and Thomas Hofmann. Hyperbolic entailment cones\nfor learning hierarchical embeddings. In Proceedings of the thirty-\ufb01fth international conference\non machine learning (ICML), 2018.\n\n[12] Mikhael Gromov. Hyperbolic groups. In Essays in group theory, pages 75\u2013263. Springer, 1987.\n\n[13] Matthias Hamann. On the tree-likeness of hyperbolic spaces. Mathematical Proceedings of the\n\nCambridge Philosophical Society, page 1\u201317, 2017.\n\n[14] Christopher Hopper and Ben Andrews. The Ricci \ufb02ow in Riemannian geometry. Springer, 2010.\n\n[15] Yoon Kim. Convolutional neural networks for sentence classi\ufb01cation. In Proceedings of the\n2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages\n1746\u20131751. Association for Computational Linguistics, 2014.\n\n[16] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Proceedings\n\nof the International Conference on Learning Representations (ICLR), 2015.\n\n10\n\n\f[17] Dmitri Krioukov, Fragkiskos Papadopoulos, Maksim Kitsak, Amin Vahdat, and Mari\u00e1n Bogun\u00e1.\n\nHyperbolic geometry of complex networks. Physical Review E, 82(3):036106, 2010.\n\n[18] John Lamping, Ramana Rao, and Peter Pirolli. A focus+ context technique based on hyperbolic\ngeometry for visualizing large hierarchies. In Proceedings of the SIGCHI conference on Human\nfactors in computing systems, pages 401\u2013408. ACM Press/Addison-Wesley Publishing Co.,\n1995.\n\n[19] Guy Lebanon and John Lafferty. Hyperplane margin classi\ufb01ers on the multinomial manifold. In\nProceedings of the international conference on machine learning (ICML), page 66. ACM, 2004.\n\n[20] Maximilian Nickel, Volker Tresp, and Hans-Peter Kriegel. A three-way model for collective\nlearning on multi-relational data. In Proceedings of the international conference on machine\nlearning (ICML), volume 11, pages 809\u2013816, 2011.\n\n[21] Maximillian Nickel and Douwe Kiela. Poincar\u00e9 embeddings for learning hierarchical repre-\nsentations. In Advances in Neural Information Processing Systems (NIPS), pages 6341\u20136350,\n2017.\n\n[22] Tim Rockt\u00e4schel, Edward Grefenstette, Karl Moritz Hermann, Tom\u00e1\u0161 Ko\u02c7cisk`y, and Phil Blun-\nsom. Reasoning about entailment with neural attention. In Proceedings of the International\nConference on Learning Representations (ICLR), 2015.\n\n[23] Michael Spivak. A comprehensive introduction to differential geometry. Publish or perish, 1979.\n\n[24] Corentin Tallec and Yann Ollivier. Can recurrent neural networks warp time? In Proceedings of\n\nthe International Conference on Learning Representations (ICLR), 2018.\n\n[25] Abraham A Ungar. Hyperbolic trigonometry and its application in the poincar\u00e9 ball model of\n\nhyperbolic geometry. Computers & Mathematics with Applications, 41(1-2):135\u2013147, 2001.\n\n[26] Abraham Albert Ungar. A gyrovector space approach to hyperbolic geometry. Synthesis\n\nLectures on Mathematics and Statistics, 1(1):1\u2013194, 2008.\n\n[27] Abraham Albert Ungar. Analytic hyperbolic geometry in n dimensions: An introduction. CRC\n\nPress, 2014.\n\n[28] Ivan Vendrov, Ryan Kiros, Sanja Fidler, and Raquel Urtasun. Order-embeddings of images and\nlanguage. In Proceedings of the International Conference on Learning Representations (ICLR),\n2016.\n\n[29] J Vermeer. A geometric interpretation of ungar\u2019s addition and of gyration in the hyperbolic\n\nplane. Topology and its Applications, 152(3):226\u2013242, 2005.\n\n11\n\n\f", "award": [], "sourceid": 2562, "authors": [{"given_name": "Octavian", "family_name": "Ganea", "institution": "ETH Zurich"}, {"given_name": "Gary", "family_name": "Becigneul", "institution": "ETH Z\u00fcrich & MPI T\u00fcbingen"}, {"given_name": "Thomas", "family_name": "Hofmann", "institution": "ETH Zurich"}]}