{"title": "Tree-Sliced Variants of Wasserstein Distances", "book": "Advances in Neural Information Processing Systems", "page_first": 12304, "page_last": 12315, "abstract": "Optimal transport (\\OT) theory defines a powerful set of tools to compare probability distributions. \\OT~suffers however from a few drawbacks, computational and statistical, which have encouraged the proposal of several regularized variants of OT in the recent literature, one of the most notable being the \\textit{sliced} formulation, which exploits the closed-form formula between univariate distributions by projecting high-dimensional measures onto random lines. We consider in this work a more general family of ground metrics, namely \\textit{tree metrics}, which also yield fast closed-form computations and negative definite, and of which the sliced-Wasserstein distance is a particular case (the tree is a chain). We propose the tree-sliced Wasserstein distance, computed by averaging the Wasserstein distance between these measures using random tree metrics, built adaptively in either low or high-dimensional spaces. Exploiting the negative definiteness of that distance, we also propose a positive definite kernel, and test it against other baselines on a few benchmark tasks.", "full_text": "Tree-Sliced Variants of Wasserstein Distances\n\nTam Le\n\nRIKEN AIP, Japan\ntam.le@riken.jp\n\nKenji Fukumizu\n\nISM, Japan & RIKEN AIP, Japan\n\nfukumizu@ism.ac.jp\n\nMakoto Yamada\n\nKyoto University & RIKEN AIP, Japan\n\nmakoto.yamada@riken.jp\n\nMarco Cuturi\n\nGoogle Brain, Paris & CREST - ENSAE\n\ncuturi@google.com\n\nAbstract\n\nOptimal transport (OT) theory de\ufb01nes a powerful set of tools to compare probability\ndistributions. OT suffers however from a few drawbacks, computational and\nstatistical, which have encouraged the proposal of several regularized variants of OT\nin the recent literature, one of the most notable being the sliced formulation, which\nexploits the closed-form formula between univariate distributions by projecting\nhigh-dimensional measures onto random lines. We consider in this work a more\ngeneral family of ground metrics, namely tree metrics, which also yield fast closed-\nform computations and negative de\ufb01nite, and of which the sliced-Wasserstein\ndistance is a particular case (the tree is a chain). We propose the tree-sliced\nWasserstein distance, computed by averaging the Wasserstein distance between\nthese measures using random tree metrics, built adaptively in either low or high-\ndimensional spaces. Exploiting the negative de\ufb01niteness of that distance, we also\npropose a positive de\ufb01nite kernel, and test it against other baselines on a few\nbenchmark tasks.\n\n1\n\nIntroduction\n\nMany tasks in machine learning involve the comparison of two probability distributions, or histograms.\nSeveral geometries in the statistics and machine learning literature are used for that purpose, such\nas the Kullback-Leibler divergence, the Fisher information metric, the \u03c72 distance, or the Hellinger\ndistance, to name a few. Among them, the optimal transport (OT) geometry, also known as Wasserstein\n[65], Monge-Kantorovich [34], or Earth Mover\u2019s [54], has gained traction in the machine learning\ncommunity [26, 39, 43], statistics [18, 50], or computer graphics [41, 61].\nThe naive computation of OT between two discrete measures involves solving a network \ufb02ow\nproblem whose computation scales typically cubically in the size of the measures [10]. There are\ntwo notable lines of work to reduce the time complexity of OT. (i) The \ufb01rst direction exploits the\nfact that simple ground costs can lead to faster computations. For instance, if one uses the binary\nmetric d(x, z) = 1x(cid:54)=z between two points x, z, the OT distance is equivalent to the total variation\ndistance [64, p.7]. When measures are supported on the real line R and the cost c is a nonegative\nconvex function g applied to the difference z \u2212 x between two points, namely for x, z \u2208 R, one has\nc(x, z) = g(z \u2212 x), then the OT distance is equal to the integral of g evaluated on the difference\nbetween the generalized quantile functions of these two probability distributions [57, \u00a72]. Other\nsimpli\ufb01cations include thresholding the ground cost distance [51] or considering for a ground cost the\nshortest-path metric on a graph [52, \u00a76]. (ii) The second one is to use regularization to approximate\nsolutions of OT problems, notably entropy [14], which results in a problem that can be solved using\nSinkhorn iterations. Genevay et al. [26] extended this approach to the semi-discrete and continuous\nOT problems using stochastic optimization. Different variants of Sinkhorn algorithm have been\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: An illustration for a tree with root r where x1, x2 are at depth level 1, and x6, x7 are at\ndepth level 3. Path P(x3, x6) contains e3, e4, e6 (the green-dot path), \u0393(x4) = {x4, x6, x7} (the\nyellow-dot subtree), ve4 = x4, and ue4 = x1.\n\nproposed recently [4, 17], and speed-ups are obtained when the ground cost is the quadratic Euclidean\ndistance [2, 3], or more generally the heat kernel on geometric domains [61]. The convergence of\nSinkhorn algorithm has been considered in [4, 25].\nIn this work, we follow the \ufb01rst direction to provide a fast computation for OT. To do so, we consider\ntree metrics as ground costs for OT, which results in the so-called tree-Wasserstein (TW) distance\n[15, 21, 46]. We consider two practical procedures to sample tree metrics based on spatial information\nfor both low-dimensional and high-dimensional spaces of supports. Using these random tree-metrics,\nwe propose tree-sliced-Wasserstein distances, obtained by averaging over several TW distances with\nvarious ground tree metrics. The TW distance, as well as its average over several trees, can be shown\nto be negative de\ufb01nite1. As a consequence, we propose a positive de\ufb01nite tree-(sliced-)Wasserstein\nkernel that generalizes the sliced-Wasserstein kernel [11, 36].\nThe paper is organized as follows: we give reminders on OT and tree metrics in Section 2, introduce\nTW distance and its properties in Section 3, describe tree-sliced-Wasserstein variants with practical\nfamilies of tree metrics, and proposed tree-(sliced)-Wasserstein kernel in Section 4, provide connec-\ntions of TW with other work in Section 5, and follow with experimental results on many benchmark\ndatasets in word embedding-based document classi\ufb01cation and topological data analysis in Section 6,\nbefore concluding in Section 7. We have released code for these tools2.\n\n2 Reminders on Optimal Transport and Tree Metrics\n\nIn this section, we brie\ufb02y review de\ufb01nitions of optimal transport (OT) and tree metrics. Let \u2126 be a\nmeasurable space endowed with a metric d. For any x \u2208 \u2126, we write \u03b4x the Dirac unit mass on x.\nOptimal transport. Let \u00b5, \u03bd be two Borel probability distributions on \u2126, R(\u00b5, \u03bd) be the set of\nprobability distributions \u03c0 on the product space \u2126\u00d7 \u2126 such that \u03c0(A\u00d7 \u2126) = \u00b5(A) and \u03c0(\u2126\u00d7 B) =\n\u03bd(B) for all Borel sets A, B. The 1-Wasserstein distance Wd [64, p.2] between \u00b5, \u03bd is de\ufb01ned as:\n\nWd(\u00b5, \u03bd) = inf\n\n\u2126\u00d7\u2126\n\nd(x, z)\u03c0(dx, dz) | \u03c0 \u2208 R(\u00b5, \u03bd)\n\n.\n\n(1)\n\n(cid:90)\n\nLet Fd be the set of Lipschitz functions w.r.t. d, i.e. functions f : \u2126 \u2192 R such that |f (x) \u2212 f (z)| \u2264\nd(x, z),\u2200x, z \u2208 \u2126. The dual of (1) simpli\ufb01es to the following problem OT [64, Theorem 1.3, p.19]\nis:\n\nWd(\u00b5, \u03bd) = sup\n\nf (x)\u00b5(dx) \u2212\n\nf (z)\u03bd(dz) | f \u2208 Fd\n\n.\n\n(2)\n\n\u2126\n\n\u2126\n\nTree metrics. A metric d : \u2126 \u00d7 \u2126 \u2192 R is called a tree metric on \u2126 if there exists a tree T with\nnon-negative edge lengths such that all elements of \u2126 are contained in its nodes and such that for\nevery x, z \u2208 \u2126, one has that d(x, z) equals to the length of the (unique) path between x and z [58,\n\u00a77, p.145\u2013182]. We write the tree metric corresponding to that tree dT .\n\n1In general, Wasserstein spaces are not Hilbertian [52, \u00a78.3].\n2https://github.com/lttam/TreeWasserstein.\n\n2\n\n(cid:26)(cid:90)\n\n(cid:26)(cid:90)\n\n(cid:27)\n\n(cid:27)\n\nx1AAAB7HicbVA9TwJBEJ3DL8Qv1NJmI5hYkTsstMTYWGLigQlcyN4ywIa9vcvunpFc+A02Fhpj6w+y89+4wBUKvmSSl/dmMjMvTATXxnW/ncLa+sbmVnG7tLO7t39QPjxq6ThVDH0Wi1g9hFSj4BJ9w43Ah0QhjUKB7XB8M/Pbj6g0j+W9mSQYRHQo+YAzaqzkV596XrVXrrg1dw6ySrycVCBHs1f+6vZjlkYoDRNU647nJibIqDKcCZyWuqnGhLIxHWLHUkkj1EE2P3ZKzqzSJ4NY2ZKGzNXfExmNtJ5Eoe2MqBnpZW8m/ud1UjO4CjIuk9SgZItFg1QQE5PZ56TPFTIjJpZQpri9lbARVZQZm0/JhuAtv7xKWvWad1Gr39Urjes8jiKcwCmcgweX0IBbaIIPDDg8wyu8OdJ5cd6dj0VrwclnjuEPnM8fxJuN/Q==x2AAAB7HicbVBNT8JAEJ3iF+IX6tHLRjDxRNp60CPGi0dMLJhAQ7bLFjZsd5vdrZE0/AYvHjTGqz/Im//GBXpQ8CWTvLw3k5l5UcqZNq777ZTW1jc2t8rblZ3dvf2D6uFRW8tMERoQyaV6iLCmnAkaGGY4fUgVxUnEaSca38z8ziNVmklxbyYpDRM8FCxmBBsrBfWnvl/vV2tuw50DrRKvIDUo0OpXv3oDSbKECkM41rrruakJc6wMI5xOK71M0xSTMR7SrqUCJ1SH+fzYKTqzygDFUtkSBs3V3xM5TrSeJJHtTLAZ6WVvJv7ndTMTX4U5E2lmqCCLRXHGkZFo9jkaMEWJ4RNLMFHM3orICCtMjM2nYkPwll9eJW2/4V00/Du/1rwu4ijDCZzCOXhwCU24hRYEQIDBM7zCmyOcF+fd+Vi0lpxi5hj+wPn8AcYgjf4=x3AAAB7HicbVA9TwJBEJ3FL8Qv1NJmI5hYkTsotMTYWGLiAQlcyN6yBxv29i67e0Zy4TfYWGiMrT/Izn/jAlco+JJJXt6bycy8IBFcG8f5RoWNza3tneJuaW//4PCofHzS1nGqKPNoLGLVDYhmgkvmGW4E6yaKkSgQrBNMbud+55EpzWP5YKYJ8yMykjzklBgredWnQaM6KFecmrMAXiduTiqQozUof/WHMU0jJg0VROue6yTGz4gynAo2K/VTzRJCJ2TEepZKEjHtZ4tjZ/jCKkMcxsqWNHih/p7ISKT1NApsZ0TMWK96c/E/r5ea8NrPuExSwyRdLgpTgU2M55/jIVeMGjG1hFDF7a2Yjoki1Nh8SjYEd/XlddKu19xGrX5frzRv8jiKcAbncAkuXEET7qAFHlDg8Ayv8IYkekHv6GPZWkD5zCn8Afr8Aceljf8=x4AAAB7HicbVBNTwIxEJ3iF+IX6tFLI5h4IrtookeMF4+YuEACG9ItXWjodjdt10g2/AYvHjTGqz/Im//GAntQ8CWTvLw3k5l5QSK4No7zjQpr6xubW8Xt0s7u3v5B+fCopeNUUebRWMSqExDNBJfMM9wI1kkUI1EgWDsY38789iNTmsfywUwS5kdkKHnIKTFW8qpP/ctqv1xxas4ceJW4OalAjma//NUbxDSNmDRUEK27rpMYPyPKcCrYtNRLNUsIHZMh61oqScS0n82PneIzqwxwGCtb0uC5+nsiI5HWkyiwnRExI73szcT/vG5qwms/4zJJDZN0sShMBTYxnn2OB1wxasTEEkIVt7diOiKKUGPzKdkQ3OWXV0mrXnMvavX7eqVxk8dRhBM4hXNw4QoacAdN8IACh2d4hTck0Qt6Rx+L1gLKZ47hD9DnD8kqjgA=x5AAAB7HicbVBNTwIxEJ3iF+IX6tFLI5h4IrsYo0eMF4+YuEACG9ItXWjodjdt10g2/AYvHjTGqz/Im//GAntQ8CWTvLw3k5l5QSK4No7zjQpr6xubW8Xt0s7u3v5B+fCopeNUUebRWMSqExDNBJfMM9wI1kkUI1EgWDsY38789iNTmsfywUwS5kdkKHnIKTFW8qpP/ctqv1xxas4ceJW4OalAjma//NUbxDSNmDRUEK27rpMYPyPKcCrYtNRLNUsIHZMh61oqScS0n82PneIzqwxwGCtb0uC5+nsiI5HWkyiwnRExI73szcT/vG5qwms/4zJJDZN0sShMBTYxnn2OB1wxasTEEkIVt7diOiKKUGPzKdkQ3OWXV0mrXnMvavX7eqVxk8dRhBM4hXNw4QoacAdN8IACh2d4hTck0Qt6Rx+L1gLKZ47hD9DnD8qvjgE=x6AAAB7HicbVBNTwIxEJ3iF+IX6tFLI5h4IruYqEeMF4+YuEACG9ItXWjodjdt10g2/AYvHjTGqz/Im//GAntQ8CWTvLw3k5l5QSK4No7zjQpr6xubW8Xt0s7u3v5B+fCopeNUUebRWMSqExDNBJfMM9wI1kkUI1EgWDsY38789iNTmsfywUwS5kdkKHnIKTFW8qpP/ctqv1xxas4ceJW4OalAjma//NUbxDSNmDRUEK27rpMYPyPKcCrYtNRLNUsIHZMh61oqScS0n82PneIzqwxwGCtb0uC5+nsiI5HWkyiwnRExI73szcT/vG5qwms/4zJJDZN0sShMBTYxnn2OB1wxasTEEkIVt7diOiKKUGPzKdkQ3OWXV0mrXnMvavX7eqVxk8dRhBM4hXNw4QoacAdN8IACh2d4hTck0Qt6Rx+L1gLKZ47hD9DnD8w0jgI=x7AAAB7HicbVA9TwJBEJ3FL8Qv1NJmI5hYkTsssMTYWGLiAQlcyN6yBxv29i67e0Zy4TfYWGiMrT/Izn/jAlco+JJJXt6bycy8IBFcG8f5RoWNza3tneJuaW//4PCofHzS1nGqKPNoLGLVDYhmgkvmGW4E6yaKkSgQrBNMbud+55EpzWP5YKYJ8yMykjzklBgredWnQaM6KFecmrMAXiduTiqQozUof/WHMU0jJg0VROue6yTGz4gynAo2K/VTzRJCJ2TEepZKEjHtZ4tjZ/jCKkMcxsqWNHih/p7ISKT1NApsZ0TMWK96c/E/r5ea8NrPuExSwyRdLgpTgU2M55/jIVeMGjG1hFDF7a2Yjoki1Nh8SjYEd/XlddKu19yrWv2+Xmne5HEU4QzO4RJcaEAT7qAFHlDg8Ayv8IYkekHv6GPZWkD5zCn8Afr8Ac25jgM=rAAAB6nicbVA9TwJBEJ3DL8Qv1NJmI5hYkTsstMTYWGIUJIEL2Vv2YMPe3mV3zoRc+Ak2Fhpj6y+y89+4wBUKvmSSl/dmMjMvSKQw6LrfTmFtfWNzq7hd2tnd2z8oHx61TZxqxlsslrHuBNRwKRRvoUDJO4nmNAokfwzGNzP/8YlrI2L1gJOE+xEdKhEKRtFK91Vd7Zcrbs2dg6wSLycVyNHsl796g5ilEVfIJDWm67kJ+hnVKJjk01IvNTyhbEyHvGupohE3fjY/dUrOrDIgYaxtKSRz9fdERiNjJlFgOyOKI7PszcT/vG6K4ZWfCZWkyBVbLApTSTAms7/JQGjOUE4soUwLeythI6opQ5tOyYbgLb+8Str1mndRq9/VK43rPI4inMApnIMHl9CAW2hCCxgM4Rle4c2Rzovz7nwsWgtOPnMMf+B8/gCVNY1Te1AAAB/HicbVC7TsMwFHXKq5RXoCNLRIvEVCVlgLGIhbFI9CG1UeS4N61Vx4lsBymKwq+wMIAQKx/Cxt/gphmg5UhXOjrnXl/f48eMSmXb30ZlY3Nre6e6W9vbPzg8Mo9P+jJKBIEeiVgkhj6WwCiHnqKKwTAWgEOfwcCf3y78wSMISSP+oNIY3BBPOQ0owUpLnlnPxsUjmc8SyJvgOc3cMxt2yy5grROnJA1UouuZX+NJRJIQuCIMSzly7Fi5GRaKEgZ5bZxIiDGZ4ymMNOU4BOlmxd7cOtfKxAoioYsrq1B/T2Q4lDINfd0ZYjWTq95C/M8bJSq4djPK40QBJ8tFQcIsFVmLJKwJFUAUSzXBRFD9V4vMsMBE6bxqOgRn9eR10m+3nMtW+77d6NyUcVTRKTpDF8hBV6iD7lAX9RBBKXpGr+jNeDJejHfjY9laMcqZOvoD4/MHnrqUug==e2AAAB/HicbVC7TsMwFHXKq5RXoCNLRIvEVCVlgLGIhbFI9CG1UeS4N61Vx4lsBymKwq+wMIAQKx/Cxt/gphmg5UhXOjrnXl/f48eMSmXb30ZlY3Nre6e6W9vbPzg8Mo9P+jJKBIEeiVgkhj6WwCiHnqKKwTAWgEOfwcCf3y78wSMISSP+oNIY3BBPOQ0owUpLnlnPxsUjmc8SyJvgtZu5Zzbsll3AWidOSRqoRNczv8aTiCQhcEUYlnLk2LFyMywUJQzy2jiREGMyx1MYacpxCNLNir25da6ViRVEQhdXVqH+nshwKGUa+rozxGomV72F+J83SlRw7WaUx4kCTpaLgoRZKrIWSVgTKoAolmqCiaD6rxaZYYGJ0nnVdAjO6snrpN9uOZet9n270bkp46iiU3SGLpCDrlAH3aEu6iGCUvSMXtGb8WS8GO/Gx7K1YpQzdfQHxucPoECUuw==e3AAAB/HicbVA9T8MwEHXKVylfgY4sES0SU5W0A4xFLIxFoi1SG0WOe2mtOk5kO0hRFP4KCwMIsfJD2Pg3uGkGaHnSSU/v3fl8z48Zlcq2v43KxubW9k51t7a3f3B4ZB6fDGSUCAJ9ErFIPPhYAqMc+ooqBg+xABz6DIb+/GbhDx9BSBrxe5XG4IZ4ymlACVZa8sx6Ni4eyXyWQN4Er9PMPbNht+wC1jpxStJAJXqe+TWeRCQJgSvCsJQjx46Vm2GhKGGQ18aJhBiTOZ7CSFOOQ5BuVuzNrXOtTKwgErq4sgr190SGQynT0NedIVYzueotxP+8UaKCKzejPE4UcLJcFCTMUpG1SMKaUAFEsVQTTATVf7XIDAtMlM6rpkNwVk9eJ4N2y+m02nftRve6jKOKTtEZukAOukRddIt6qI8IStEzekVvxpPxYrwbH8vWilHO1NEfGJ8/ocaUvA==e4AAAB/HicbVBNS8NAEN3Ur1q/oj16CbaCp5JUQY8VLx4r2FpoQ9hsJ+3SzSbsboQQ4l/x4kERr/4Qb/4bt2kO2vpg4PHezM7O82NGpbLtb6Oytr6xuVXdru3s7u0fmIdHfRklgkCPRCwSAx9LYJRDT1HFYBALwKHP4MGf3cz9h0cQkkb8XqUxuCGecBpQgpWWPLOejYpHMp8lkDfBu2jmntmwW3YBa5U4JWmgEl3P/BqNI5KEwBVhWMqhY8fKzbBQlDDIa6NEQozJDE9gqCnHIUg3K/bm1qlWxlYQCV1cWYX6eyLDoZRp6OvOEKupXPbm4n/eMFHBlZtRHicKOFksChJmqciaJ2GNqQCiWKoJJoLqv1pkigUmSudV0yE4yyevkn675Zy32nftRue6jKOKjtEJOkMOukQddIu6qIcIStEzekVvxpPxYrwbH4vWilHO1NEfGJ8/o0yUvQ==e5AAAB/HicbVBNS8NAEN3Ur1q/oj16CbaCp5JURI8VLx4r2FpoQ9hsJ+3SzSbsboQQ4l/x4kERr/4Qb/4bt2kO2vpg4PHezM7O82NGpbLtb6Oytr6xuVXdru3s7u0fmIdHfRklgkCPRCwSAx9LYJRDT1HFYBALwKHP4MGf3cz9h0cQkkb8XqUxuCGecBpQgpWWPLOejYpHMp8lkDfBu2jmntmwW3YBa5U4JWmgEl3P/BqNI5KEwBVhWMqhY8fKzbBQlDDIa6NEQozJDE9gqCnHIUg3K/bm1qlWxlYQCV1cWYX6eyLDoZRp6OvOEKupXPbm4n/eMFHBlZtRHicKOFksChJmqciaJ2GNqQCiWKoJJoLqv1pkigUmSudV0yE4yyevkn675Zy32nftRue6jKOKjtEJOkMOukQddIu6qIcIStEzekVvxpPxYrwbH4vWilHO1NEfGJ8/pNKUvg==e6AAAB/HicbVBNS8NAEN3Ur1q/oj16CbaCp5JUUI8VLx4r2FpoQ9hsJ+3SzSbsboQQ4l/x4kERr/4Qb/4bt2kO2vpg4PHezM7O82NGpbLtb6Oytr6xuVXdru3s7u0fmIdHfRklgkCPRCwSAx9LYJRDT1HFYBALwKHP4MGf3cz9h0cQkkb8XqUxuCGecBpQgpWWPLOejYpHMp8lkDfBu2jmntmwW3YBa5U4JWmgEl3P/BqNI5KEwBVhWMqhY8fKzbBQlDDIa6NEQozJDE9gqCnHIUg3K/bm1qlWxlYQCV1cWYX6eyLDoZRp6OvOEKupXPbm4n/eMFHBlZtRHicKOFksChJmqciaJ2GNqQCiWKoJJoLqv1pkigUmSudV0yE4yyevkn675Zy32nftRue6jKOKjtEJOkMOukQddIu6qIcIStEzekVvxpPxYrwbH4vWilHO1NEfGJ8/pliUvw==e7AAAB/HicbVA9T8MwEHXKVylfgY4sES0SU5WUoYxFLIxFoi1SG0WOe2mtOk5kO0hRFP4KCwMIsfJD2Pg3uGkGaHnSSU/v3fl8z48Zlcq2v43KxubW9k51t7a3f3B4ZB6fDGSUCAJ9ErFIPPhYAqMc+ooqBg+xABz6DIb+/GbhDx9BSBrxe5XG4IZ4ymlACVZa8sx6Ni4eyXyWQN4Er9PMPbNht+wC1jpxStJAJXqe+TWeRCQJgSvCsJQjx46Vm2GhKGGQ18aJhBiTOZ7CSFOOQ5BuVuzNrXOtTKwgErq4sgr190SGQynT0NedIVYzueotxP+8UaKCKzejPE4UcLJcFCTMUpG1SMKaUAFEsVQTTATVf7XIDAtMlM6rpkNwVk9eJ4N2y7lste/aje51GUcVnaIzdIEc1EFddIt6qI8IStEzekVvxpPxYrwbH8vWilHO1NEfGJ8/p96UwA==\f3 Tree-Wasserstein Distances: Optimal Transport with Tree Metrics\n\nLozupone and co-authors [44, 45] \ufb01rst noticed, when proposing the UniFrac method in the metage-\nnomics community, that the Wasserstein distance between two measures supported on the nodes\nof the same tree admits a closed form when the ground metric between the supports of the two\nmeasures is a tree metric. That method was used to compare microbial communities by measuring\nthe phylogenetic distance between sets of taxa in a phylogenetic tree as the fraction of the branch\nlength of the tree that leads to descendants from either one environment or the other, but not both\n[44]. In this section, we follow [15, 21, 46] to leverage the geometric structure of tree metrics, and\nrecall their main result.\nLet T be a tree rooted at r with non-negative edge lengths, and dT be the tree metric on T . For nodes\nx, z \u2208 T , let P(x, z) be the (unique) path between x and z in T , \u03bb is the unique Borel measure (i.e.\nlength measure) on T such that dT (x, z) = \u03bb(P(x, z)). We also write \u0393(x) for a set of nodes in the\nsubtree of T rooted at x, de\ufb01ned as \u0393(x) = {z \u2208 T | x \u2208 P(r, z)}. For each edge e in T , let ve be\nthe deeper level node of edge e (farther to the root), ue is the other node, and we = dT (ue, ve) is the\nnon-negative length of that edge, illustrated in Figure 1. Then, TW not only has a closed form, but is\nnegative de\ufb01nite.\nProposition 1. Given two measures \u00b5, \u03bd supported on T , and setting the ground metric to be dT ,\nthen\n\nWdT (\u00b5, \u03bd) =\n\nwe |\u00b5(\u0393(ve)) \u2212 \u03bd(\u0393(ve))| .\n\n(3)\n\nProof. Following [21], for any f \u2208 FdT such that f (r) = 0, there is an \u03bb-a.e. unique Borel function\nT 1z\u2208P(r,x)f(z)\u03bb(dz). Intuitively, f (x)\nmodels a \ufb02ow along the (unique) path of the root r and node x where f(z) controls a probability\namount, received or provided by f (x) on dz. Note that 1z\u2208P(r,x) = 1x\u2208\u0393(z), then we have:\n\nf : T \u2192 [\u22121, 1] such that f (x) =(cid:82)\n(cid:90)\n\n(cid:90)\n\n(cid:90)\n\n(cid:88)\nP(r,x) f(z)\u03bb(dz) =(cid:82)\n\ne\u2208T\n\nf (x)\u00b5(dx) =\n\nT\n\nT\n\nT\n\n1z\u2208P(r,x)f(z)\u03bb(dz)\u00b5(dx) =\n\nThen, plugging this identity in Equation (2), we have:\n\nWdT (\u00b5, \u03bd) = sup\n\n(\u00b5(\u0393(z)) \u2212 \u03bd(\u0393(z))) f(z)\u03bb(dz)\n\n=\n\n(cid:27)\n\nT\n\nf(z)\u03bb(dz)\u00b5(\u0393(z)).\n\n|\u00b5(\u0393(z)) \u2212 \u03bd(\u0393(z))| \u03bb(dz),\n\n(cid:26)(cid:90)\n\nT\n\nsince the optimal function f\u2217 corresponds to f(z) = 1 if \u00b5(\u0393(z)) \u2265 \u03bd(\u0393(z)), otherwise f(z) = \u22121.\nMoreover, we have \u00b5(\u0393(r)) = \u03bd(\u0393(r)) = 1, and \u03bb(P(ue, ve)) = dT (ue, ve) = we. Therefore,\n\n(cid:90)\n(cid:90)\n\nT\n\nWdT (\u00b5, \u03bd) =\n\nwe |\u00b5(\u0393(ve)) \u2212 \u03bd(\u0393(ve))| ,\n\n(cid:88)\n\ne\u2208T\n\nsince the total mass \ufb02owing through edge e is equal to the total mass in subtree \u0393(ve).\nProposition 2. The tree-Wasserstein distance WdT is negative de\ufb01nite.\nProof. Let m be the number of edges in tree T . From Equation (3), \u00b5(\u0393(ve)) with e \u2208 T can be\nconsidered as a feature map for probability distribution \u00b5 onto Rm\n+ . Consequently, the TW distance\nis equivalent to a weighted (cid:96)1 distance between these representations, with non-negative weights we,\nbetween these feature maps. Therefore, the tree-Wasserstein distance is negative de\ufb01nite3.\n\n4 Tree-Sliced Wasserstein by Sampling Tree Metrics\n\nMuch as in sliced-Wasserstein (SW) distances, computing TW distances requires choosing or sam-\npling tree metrics. Unlike SW distances however, the space of possible tree metrics is far too large\nin practical cases to expect that purely random trees can lead to meaningful results. We consider in\nthis section two adaptive methods to de\ufb01ne tree metrics based on spatial information in both low\nand high-dimensional cases, using partitioning or clustering. We further average the TW distances\n\n3We follow here [9, p. 66\u201367], to de\ufb01ne negative-de\ufb01niteness, see review about kernels in the supplementary.\n\n3\n\n\fcorresponding to these ground tree metrics. This has the bene\ufb01t of reducing quantization effects,\nor cluster sensitivity problems in which data points may be partitioned or clustered to adjacent but\ndifferent hypercubes [32] or clusters respectively. We then de\ufb01ne the tree-sliced Wasserstein kernel,\nthat is the direct generalization of those considered by [11, 36].\nDe\ufb01nition 1. For two measures \u00b5, \u03bd supported on a set in which tree metrics {dTi | 1 \u2264 i \u2264 n} can\nbe de\ufb01ned, the tree-sliced-Wasserstein (TSW) distance is de\ufb01ned as:\n\nTSW(\u00b5, \u03bd) =\n\n1\nn\n\nWdTi\n\n(\u00b5, \u03bd).\n\n(4)\n\nn(cid:88)\n\ni=1\n\nNote that averaging of negative de\ufb01nite functions is trivially negative de\ufb01nite. Thus, following\nDe\ufb01nition 1 and Proposition 2, the TSW distance is also negative de\ufb01nite. Positive de\ufb01nite kernels\ncan be therefore derived following [9, Theorem 3.2.2, p.74], and given t > 0, \u00b5, \u03bd on tree T , we\nde\ufb01ne the following tree-sliced-Wasserstein kernel,\n\n(5)\nVery much like the Gaussian kernel, one can tune if needed the bandwidth parameter t according to\nthe learning task that is targeted, using e.g. cross validation.\n\nkTSW(\u00b5, \u03bd) = exp(\u2212t TSW(\u00b5, \u03bd)) .\n\nAdaptive methods to de\ufb01ne tree metrics for the space of support data. We consider sampling\nmechanisms to select tree metrics to be used in De\ufb01nition 1.\nOne possibility is to sample tree metrics following the general idea that these tree metrics should\napproximate the original distance [7, 8, 13, 22, 30]. This was the original motivation for previous\nwork focusing on approximating the OT distance with the Euclidean ground metric (a.k.a. W2 metric)\ninto (cid:96)1 metric for fast nearest neighbor search [16, 32]. Our goal is rather to sample tree metrics for\nthe space of supports, and use those random tree metrics as ground metrics. Much like 1-dimensional\nprojections do not offer interesting properties from a distortion perspective but remain useful for\nsliced-Wasserstein (SW) distance, we believe that trees with large distortions can be useful. This\nfollows the recent realization that solving exactly the OT problem leads to over\ufb01tting [52, \u00a78.4], and\ntherefore excessive efforts to approximate the ground metric using trees would be self-defeating since\nit would lead to over\ufb01tting within the computation of the Wasserstein metric itself.\n\u2022 Partition-based tree metrics. For low-dimensional spaces of supports, one can construct a\npartition-based tree metric with a tree structure T as follows:\n\ndepth level, and HT : the prede\ufb01ned deepest level of tree T .\n\nNode \u02dcxc \u2190 a point center of s.\nLength of edge (\u02dcxs, \u02dcxc) \u2190 distance (\u02dcxs, \u02dcxc).\nNode \u02dcxc \u2190 \u02dcxs.\n\nAlgorithm 1 Partition_Tree_Metric(s, X, \u02dcxs, h, HT )\nInput: s: a side-(cid:96) hypercube, X: a set of m data points of Rd in s, \u02dcxs: a parent node, h: a current\n1: if m > 0 then\nif h > 0 then\n2:\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13:\n14:\n15: end if\n\n\u02dcXc \u2190 data points of X in sc.\nPartition_Tree_Metric(sc, \u02dcXc, \u02dcxc, h + 1, HT ).\n\nPartition s into 2d side-((cid:96)/2) hypercubes.\nfor each side-((cid:96)/2) hypercube sc do\n\nend if\nif m > 1 and h < HT then\n\nelse\n\nend for\n\nend if\n\nAssume that data points are in a side-(\u03b2/2) hypercube of Rd. We then randomly expand it into a\nhypercube s0 with side at most \u03b2. Inspired by a series of grids in [32], we set the center of s0 as the\n\n4\n\n\froot of T , and use a following recursive procedure to partition s0. For each side-(cid:96) hypercube s, there\nare 3 partitioning cases: (i) if s does not contain any data points, we discard it, (ii) if s contains 1\ndata point, we use the center of s (or the data point) as a node in T , and (iii) if s contains more than 1\ndata point, we represent s by its center as a node x in T , and equally partition s into 2d side-((cid:96)/2)\nhypercubes for potential child nodes of x. We then apply the recursive partition procedure for those\nchild hypercubes. One can use any metrics in Rd to obtain lengths for edges in T . Additionally, one\ncan use a prede\ufb01ned deepest level of T as a stopping condition for the procedure. We summarize\nthe recursive tree construction procedure in Algorithm 1. As desired, the random expansion of\nthe original hypercube into s0 creates a variety to partition data spaces. Note that Algorithm 1 for\nconstructing tree T is also known as the classical Quadtree algorithm [56] for 2-dimensional data\n(and later extended for high-dimensional data in [6, 30, 31, 32]).\n\u2022 Clustering-based tree metrics. As in Algorithm 1, the number of partitioned hypercubes grows\nexponentially with respect to dimension d. To overcome this problem for high-dimensional spaces,\nwe directly leverage the distribution of support data points to adaptively partition data spaces via\nclustering, inspired by the clustering-based approach for a space subdivision in Improved Fast Gauss\nTransform [48, 66]. We derive a similar recursive procedure as in the partition-based tree metrics,\nbut apply the farthest-point clustering [27] to partition support data points, and replace centers of\nhypercubes by cluster centroids as nodes in T . In practice, we \ufb01x the same number of clusters \u03ba when\nperforming the farthest-point clustering (replace the partition in line 9 in Algorithm 1). \u03ba is typically\nchosen via cross-validation. In general, one can apply any favorite clustering methods. We use the\nfarthest-point clustering due to its fast computation. In particular, the complexity of the farthest-point\nclustering into \u03ba clusters for n data points is O(n log \u03ba) using the algorithm in [23]. Using different\nrandom initializations for the farthest-point clustering, we recover a simple sampling mechanism to\nobtain random tree metrics.\n\n5 Relations to Other Work\n\nOT with ground ultrametrics. An ultrametric is also known as non-Archimedean metric, or\nisosceles metric [59]. Ultrametrics strengthen the triangle inequality to a strong inequality (i.e., for\nany x, y, z in an ultrametric space, d(x, z) \u2264 max(d(x, y), d(y, z))). Note that binary metrics are\na special case of ultrametrics since binary metrics satisfy the strong inequality. Following [33, \u00a71,\np.245\u2013247], an ultrametric implies a tree structure which can be constructed by hierarchical clustering\nschemes. Therefore, an ultrametric is a tree metric. Furthermore, we note that ultrametrics have\nsimilar spirits with strong kernels and hierarchy-induced kernels which are key components to form\nvalid optimal assignment kernels for graph classi\ufb01cation applications [37].\nConnection with OT with Euclidean ground metric W2(\u00b7, \u00b7). Let dHT be a partition-based tree\nmetric where H is the depth level of corresponding tree T , at which all support data points are\nseparated into different hypercubes (i.e., Algorithm 1 stops at depth level H). Edges in T are\ncomputed by Euclidean distance. Let \u03b2 be the side of the randomly expanded hypercube. Given\ntwo d-dimensional point clouds \u02dc\u00b5, \u02dc\u03bd with the same cardinality (i.e., discrete uniform measures), and\ndenote TW with dHT as WdHT . Then,\n\n\u221a\n\nW2(\u02dc\u00b5, \u02dc\u03bd) \u2264 WdHT (\u02dc\u00b5, \u02dc\u03bd)/2 + \u03b2\n\nd/2H .\n\nThe proof is given in the supplementary material. Moreover, we also investigate the empirical relation\nbetween the TSW distance and the W2 distance in the supplementary material, in which empirical\nresults indicate that the TSW distance agrees more with W2 as the number of tree-slices used to\nde\ufb01ne the TSW distance is increased.\n\nConnection with embedding W2 metric into (cid:96)1 metric for fast nearest neighbor search. As\ndiscussed earlier, our goal is neither to approximate OT distance using trees as in [7, 8, 13, 22, 30],\nnor to embed W2 metric into (cid:96)1 metric as in [16, 32], but rather to sample tree metrics to de\ufb01ne\nan extended variant of the sliced-Wasserstein distance. When using the Quadtree algorithm (as in\nAlgorithm 1) to sample tree metrics for the TSW distance, then the resulted TSW distance is in the\nsame spirit as the embedding approach in [32] where the authors embedded W2 metric into (cid:96)1 metric\nby using a series of grids.\n\n5\n\n\fOT with tree metrics. There are a few work related to our considered class of OT with tree metrics\n[35, 62]. In particular, Kloeckner [35] studied geometric properties of OT space for measures on an\nultrametric space, and Sommerfeld and Munk [62] focused on statistical inference for empirical OT\non \ufb01nite spaces including tree metrics.\n\n6 Experimental Results\n\nIn this section, we evaluated the proposed TSW kernel kTSW (Equation (5)) for comparing empirical\nmeasures in word embedding-based document classi\ufb01cation and topological data analysis.\n\n6.1 Word Embedding-based Document Classi\ufb01cation\n\nKusner et al. [39] proposed Word Mover\u2019s distances for document classi\ufb01cation. Each document is\nregarded as an empirical measure where each word and its frequency are considered as a support and\na corresponding weight respectively. Kusner et al. [39] used word embedding such as word2vec to\nmap each word to a vector data point. Equivalently, Word Mover\u2019s distances are OT metrics between\nempirical measures (i.e., documents) where its ground cost is a metric on the word embedding space.\n\nSetup. We evaluated kTSW on four datasets: TWITTER, RECIPE, CLASSIC and AMAZON, following\nthe approach of Word Mover\u2019s distances [39], for document classi\ufb01cation with SVM. Statistical\ncharacteristics for those datasets are summarized in Figure 2b. We used the word2vec word embedding\n[47], pre-trained on Google News4, containing about 3 million words/phrases. word2vec maps these\nwords/phrases into vectors in R300. Following [39], for all datasets, we removed all SMART stop\nword [55], and further dropped words in documents if they are not available in the pre-trained\nword2vec. We used two baseline kernels in the form of exp(\u2212td) where d is a document distance\nand t > 0, for two corresponding baseline document distances based on Word Mover\u2019s: (i) OT with\nEuclidean ground metric [39], and (ii) sliced-Wasserstein, denoted as kOT and kSW respectively.\nFor TSW distance in kTSW, we consider ns randomized clustering-based tree metrics, built with\na prede\ufb01ned deepest level HT of tree T as a stopping condition. We also regularized for kernel\nkOT matrices due to its inde\ufb01niteness by adding a suf\ufb01ciently large diagonal term as in [14]. For\nSVM, we randomly split each dataset into 70%/30% for training and test with 100 repeats, choose\nhyper-parameters through cross validation, choose 1/t from {1, q10, q20, q50} where qs is the s%\nquantile of a subset of corresponding distances, observed on a training set, use one-vs-one strategy\n\nwith Libsvm [12] for multi-class classi\ufb01cation, and choose SVM regularization from(cid:8)10\u22122:1:2(cid:9). We\n\nran experiments with Intel Xeon CPU E7-8891v3 (2.80GHz), and 256GB RAM.\n\n6.2 Topological Data Analysis (TDA)\n\nTDA has recently gained interest within the machine learning community [11, 38, 42, 53]. TDA is a\npowerful tool for statistical analysis on geometric structured data such as linked twist maps, or material\ndata. TDA employs algebraic topology methods, such as persistence homology, to extract robust\ntopological features (i.e., connected components, rings, cavities) and output 2-dimensional point\nmultisets, known as persistence diagrams (PD) [19]. Each 2-dimensional point in PD summarizes a\nlifespan, corresponding to birth and death time as its coordinates, of a particular topological feature.\n\nSetup. We evaluated kTSW for orbit recognition and object shape classi\ufb01cation with support vector\nmachines (SVM), as well as change point detection for material data analysis with kernel Fisher\ndiscriminant ratio (KFDR) [28]. Generally, we followed the same setting as in [42] for these\nTDA experiments. We considered \ufb01ve baseline kernels for PD: (i) persistence scale space (kPSS) [53],\n(ii) persistence weighted Gaussian (kPWG) [38], (iii) sliced-Wasserstein (kSW) [11], (iv) persistence\nFisher (kPF) [42], and (v) optimal transport5, de\ufb01ned as kOT = exp(\u2212tdOT) for t > 0, and also further\nregularized its kernel matrices by adding a suf\ufb01ciently large diagonal term due to its inde\ufb01niteness as\nin \u00a76.1. For TSW distance in kTSW, we considered ns randomized partition-based tree metrics, built\nwith a prede\ufb01ned deepest level HT of tree T as a stopping condition.\n\n4https://code.google.com/p/word2vec\n5We used a fast OT implementation (e.g. on MPEG7 dataset, it took 7.98 seconds while the popular mex-\ufb01le\n\nwith Rubner\u2019s implementation required 28.72 seconds).\n\n6\n\n\fLet Dgi = (x1, x2, . . . , xn) and Dgj = (z1, z2, . . . , zm) be two PD where xi |1\u2264i\u2264n, zj |1\u2264j\u2264m\u2208\nR2, and \u0398 = {(a, a) | a \u2208 R} be the diagonal set. Denote Dgi\u0398 = {\u03a0\u0398(x) | x \u2208 Dgi}\nwhere \u03a0\u0398(x) is a projection of x on \u0398. As in SW distance between Dgi and Dgj [11], we use\ntransportation plans between (Dgi \u222a Dgj\u0398) and (Dgj \u222a Dgi\u0398) for TW (in Equation (4) of TSW)\nand OT distances. We typically used a cross validation to choose hyper-parameters, and followed\ncorresponding authors of those baseline kernels to form sets of candidates. For kTSW and kOT, we\nchose 1/t from {1, q10, q20, q50}. Similar as in \u00a76.1, we used one-vs-one strategy with Libsvm\n\nfor multi-class classi\ufb01cation,(cid:8)10\u22122:1:2(cid:9) as a set of regularization candidates, and a random split\n\n70%/30% for training and test with 100 repeats for SVM, and DIPHA toolbox6 to extract PD.\n\nOrbit recognition. Adams et al. [1, \u00a76.4.1] proposed a synthesized dataset for link twist map, a\ndiscrete dynamical system to model \ufb02ows in DNA microarrays [29]. There are 5 classes of orbits. As\nin [42], we generated 1000 orbits for each class where each orbit contains 1000 points. We considered\n1-dimensional topological features for PD, extracted with Vietoris-Rips complex \ufb01ltration [19].\n\nObject shape classi\ufb01cation. We evaluated object shape classi\ufb01cation on a 10-class subset of\nMPEG7 dataset [40], containing 20 samples for each class as in [42]. For simplicity, we used the\nsame procedure as in [42] to extract 1-dimensional topological features for PD with Vietoris-Rips\ncomplex \ufb01ltration7 [19].\n\nChange point detection for material data analysis. We considered granular packing system [24]\nand SiO2 [49] datasets for change point detection problem with KFDR as a statistical score. As in\n[38, 42], we extracted 2-dimensional topological features for PD in granular packing system dataset,\n1-dimensional topological features for PD in SiO2 dataset, both with ball model \ufb01ltration, and set\n10\u22123 for the regularization parameter in KFDR. KFDR graphs for these datasets are shown in Figure\n2c. For granular tracking system dataset, all kernel approaches obtain the change point as the 23rd\nindex, which support an observation result (corresponding id = 23) in [5] . For SiO2 dataset, results\nof all kernel methods are within a supported range (35 \u2264 id \u2264 50), obtained by a traditional physical\napproach [20]. The KFDR results of kTSW compare favorably with those of other baseline kernels.\nAs shown in Figure 2b, kTSW is faster than other baseline kernels. We note that we omit the baseline\nkernel kOT for this application since computation of OT distance is out of memory.\n\n6.3 Results of SVM, Time Consumption and Discussion\n\nThe results of SVM and time consumption for kernel matrices in TDA, and word embedding based\ndocument classi\ufb01cation are illustrated in Figure 2a and Figure 2b respectively. The performances\nof kTSW compare favorably with other baseline kernels. Moreover, the computational time of kTSW\nis much less than that of kOT. Especially, in CLASSIC dataset, it took less than 3 hours for kTSW\nwhile more than 8 days for kOT. Note that kTSW and kSW are positive de\ufb01nite while kOT is not. The\ninde\ufb01niteness of kOT may affect its performances in some applications, e.g. kOT performs worse in\nTDA applications, but works well for documents with word embedding applications. The fact that SW\nonly considers 1-dimensional projections may limit its ability to capture high-dimensional structure\nin data distributions [60]. TSW distance remedies this problem by using clustering-based tree metrics\nwhich directly leverage distributions of support data points. Furthermore, we also illustrate a trade-off\nof performances and computational time for different parameters in tree-sliced-Wasserstein distances\nfor kTSW on TWITTER dataset in Figure 2d. For tree-sliced-Wasserstein TSW for kTSW, performances\nare usually improved with more slices (ns), but they come with a trade-off of more computational\ntime. In these applications, we observed that a good trade-off for ns of tree-sliced-Wasserstein is\nabout 10 slices. Many further results can be seen in the supplementary.\n\n7 Conclusion\n\nIn this work, we proposed positive de\ufb01nite tree-(sliced)-Wasserstein kernel on OT geometry by\nconsidering a particular class of ground metrics, namely tree metrics. Much like the univariate\nWasserstein distance, the tree-(sliced)-Wasserstein distance has a closed form, and is also negative\n\n6https://github.com/DIPHA/dipha\n7Turner et al. [63] proposed a more complicated and advanced \ufb01ltration for this task.\n\n7\n\n\f(a) SVM results for TDA and document classi\ufb01cation.\n\n(b) Corresponding time consumption of kernel matrices\nfor TDA and document classi\ufb01cation.\n\n(c) The KFDR graphs on granular packing system\nand SiO2 datasets.\n\n(d) SVM results and time consumption of kernel matrices of kTSW with different (ns, HT , \u03ba), and kSW with\ndifferent ns on TWITTER dataset.\nFigure 2: Experimental results for document classi\ufb01cation and TDA. In (a), for a trade-off between\ntime consumption and performances, results of TDA are reported for kTSW with (ns = 6, HT = 6),\nand (ns = 12, HT = 5) in MPEG7 and Orbit datasets respectively. For document classi\ufb01cation,\nresults are reported for kSW with (ns = 20), and for kTSW with (ns = 10, HT = 6, \u03ba = 4). In (b), the\nnumbers in the parenthesis: for TDA in the \ufb01rst row, are the number of PD and the maximum number\nof points in PD respectively; for document classi\ufb01cation in the second row, are the number of classes,\nthe number of documents, and the maximum number of unique words for each document respectively.\nIn (c), for kTSW, TSW distances are computed with (ns = 12, HT = 6).\n\n8\n\nTopological Data AnalysisMPEG7Orbit505560657075808590Accuracy (%)kPSSkPWGkSWkPFkOTkTSWDocument ClassificationTWITTERRECIPECLASSICAMAZON405060708090100kOTkSWkTSWMPEG7(200/80)02468Time Consumption (s)Orbit(5000/300)103104105106kPSSkPWGkSWkPFkOTkTSWGranular(35/20.4K)010203040SiO2(80/30K)0100200300400500600kPSSkPWGkSWkPFkTSWTWITTER(3/3108/26)00.511.522.53104RECIPE(15/4370/340)00.511.522.533.5105CLASSIC(4/7093/197)02468105AMAZON(4/8000/884)0246810105kOTkSWkTSW0102030(id = 23)GranularkPSS0102030(id = 23)kPWG0102030(id = 23)kSW0102030(id = 23)kPF0102030(id = 23)kTSW020406080(id = 46)SiO2020406080(id = 37)020406080(id = 43)020406080(id = 42)020406080(id = 42)15102030506869707172737475Accuracy (%)k = 2151020305005101520Time Consumption (s)103HT = 8HT = 9HT = 10HT = 11HT = 12kSW15102030506869707172737475k = 31510203050Number of tree slices (ns)0510152025103HT = 5HT = 6HT = 7HT = 8kSW15102030506869707172737475k = 415102030500246810103HT = 4HT = 5HT = 6kSW15102030506869707172737475k = 515102030500246810103HT = 3HT = 4HT = 5kSW\fde\ufb01nite. We also provide two sampling schemes to generate tree metrics for both high-dimensional\nand low-dimensional spaces. Leveraging random tree-metrics, we have proposed a new generalization\nof sliced-Wasserstein metrics that has more \ufb02exibility and degrees of freedom, by choosing a tree\nrather than a line, especially in high-dimensional spaces. The questions of sampling ef\ufb01ciently tree\nmetrics from data points for tree-sliced-Wasserstein distance, as well as using them for more involved\nparametric inference are left for future work.\n\nAcknowledgments\n\nWe thank anonymous reviewers for their comments. TL acknowledges the support of JSPS KAKENHI\nGrant number 17K12745. MY was supported by the JST PRESTO program JPMJPR165A.\n\nReferences\n[1] Henry Adams, Tegan Emerson, Michael Kirby, Rachel Neville, Chris Peterson, Patrick Shipman,\nSofya Chepushtanova, Eric Hanson, Francis Motta, and Lori Ziegelmeier. Persistence images: A\nstable vector representation of persistent homology. The Journal of Machine Learning Research,\n18(1):218\u2013252, 2017.\n\n[2] Jason Altschuler, Francis Bach, Alessandro Rudi, and Jonathan Weed. Approximating the\n\nquadratic transportation metric in near-linear time. arXiv preprint arXiv:1810.10046, 2018.\n\n[3] Jason Altschuler, Francis Bach, Alessandro Rudi, and Jonathan Weed. Massively scalable\n\nSinkhorn distances via the Nystrom method. arXiv preprint arXiv:1812.05189, 2018.\n\n[4] Jason Altschuler, Jonathan Weed, and Philippe Rigollet. Near-linear time approximation\nalgorithms for optimal transport via Sinkhorn iteration. In Advances in Neural Information\nProcessing Systems, pages 1964\u20131974, 2017.\n\n[5] Anonymous. What is random packing? Nature, 239:488\u2013489, 1972.\n\n[6] Arturs Backurs, Piotr Indyk, Krzysztof Onak, Baruch Schieber, Ali Vakilian, and Tal Wagner.\nScalable fair clustering. In International Conference on Machine Learning, pages 405\u2013413,\n2019.\n\n[7] Yair Bartal. Probabilistic approximation of metric spaces and its algorithmic applications. In\nProceedings of 37th Conference on Foundations of Computer Science, pages 184\u2013193, 1996.\n\n[8] Yair Bartal. On approximating arbitrary metrices by tree metrics. In STOC, volume 98, pages\n\n161\u2013168, 1998.\n\n[9] Christian Berg, Jens Peter Reus Christensen, and Paul Ressel. Harmonic analysis on semigroups.\n\nSpringer-Verlag, 1984.\n\n[10] Rainer E Burkard and Eranda Cela. Linear assignment problems and extensions. In Handbook\n\nof combinatorial optimization, pages 75\u2013149. Springer, 1999.\n\n[11] Mathieu Carriere, Marco Cuturi, and Steve Oudot. Sliced Wasserstein kernel for persistence\ndiagrams. In International Conference on Machine Learning, volume 70, pages 664\u2013673, 2017.\n\n[12] Chih-Chung Chang and Chih-Jen Lin. Libsvm: a library for support vector machines. ACM\n\ntransactions on intelligent systems and technology (TIST), 2(3):27, 2011.\n\n[13] Moses Charikar, Chandra Chekuri, Ashish Goel, Sudipto Guha, and Serge Plotkin. Approximat-\ning a \ufb01nite metric by a small number of tree metrics. In Proceedings 39th Annual Symposium\non Foundations of Computer Science, pages 379\u2013388, 1998.\n\n[14] Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances\n\nin neural information processing systems, pages 2292\u20132300, 2013.\n\n[15] Khanh Do Ba, Huy L Nguyen, Huy N Nguyen, and Ronitt Rubinfeld. Sublinear time algorithms\n\nfor Earth Mover\u2019s distance. Theory of Computing Systems, 48(2):428\u2013442, 2011.\n\n9\n\n\f[16] Yihe Dong, Piotr Indyk, Ilya Razenshteyn, and Tal Wagner. Scalable nearest neighbor search\n\nfor optimal transport. arXiv preprint arXiv:1910.04126, 2019.\n\n[17] Pavel Dvurechensky, Alexander Gasnikov, and Alexey Kroshnin. Computational optimal\ntransport: Complexity by accelerated gradient descent is better than by Sinkhorn\u2019s algorithm.\nIn Proceedings of the 35th International Conference on Machine Learning, pages 1367\u20131376,\n2018.\n\n[18] Johannes Ebert, Vladimir Spokoiny, and Alexandra Suvorikova. Construction of non-asymptotic\n\ncon\ufb01dence sets in 2-Wasserstein space. arXiv preprint arXiv:1703.03658, 2017.\n\n[19] Herbert Edelsbrunner and John Harer. Persistent homology - a survey. Contemporary mathe-\n\nmatics, 453:257\u2013282, 2008.\n\n[20] Stephen Richard Elliott. Physics of amorphous materials. Longman Group, 1983.\n\n[21] Steven N Evans and Frederick A Matsen. The phylogenetic Kantorovich\u2013Rubinstein metric for\nenvironmental sequence samples. Journal of the Royal Statistical Society: Series B (Statistical\nMethodology), 74(3):569\u2013592, 2012.\n\n[22] Jittat Fakcharoenphol, Satish Rao, and Kunal Talwar. A tight bound on approximating arbitrary\n\nmetrics by tree metrics. Journal of Computer and System Sciences, 69(3):485\u2013497, 2004.\n\n[23] Tomas Feder and Daniel Greene. Optimal algorithms for approximate clustering. In Proceedings\nof the twentieth annual ACM symposium on Theory of computing, pages 434\u2013444. ACM, 1988.\n\n[24] Nicolas Francois, Mohammad Saadatfar, R Cruikshank, and A Sheppard. Geometrical frus-\ntration in amorphous and partially crystallized packings of spheres. Physical review letters,\n111(14):148001, 2013.\n\n[25] Joel Franklin and Jens Lorenz. On the scaling of multidimensional matrices. Linear Algebra\n\nand its applications, 114:717\u2013735, 1989.\n\n[26] Aude Genevay, Marco Cuturi, Gabriel Peyre, and Francis Bach. Stochastic optimization for\nlarge-scale optimal transport. In Advances in Neural Information Processing Systems, pages\n3440\u20133448, 2016.\n\n[27] Teo\ufb01lo F Gonzalez. Clustering to minimize the maximum intercluster distance. Theoretical\n\nComputer Science, 38:293\u2013306, 1985.\n\n[28] Zaid Harchaoui, Eric Moulines, and Francis R Bach. Kernel change-point analysis. In Advances\n\nin neural information processing systems, pages 609\u2013616, 2009.\n\n[29] Jan-Martin Hertzsch, Rob Sturman, and Stephen Wiggins. Dna microarrays: design principles\n\nfor maximizing ergodic, chaotic mixing. Small, 3(2):202\u2013218, 2007.\n\n[30] Piotr Indyk. Algorithmic applications of low-distortion geometric embeddings. In Proceedings\n\n42nd IEEE Symposium on Foundations of Computer Science, pages 10\u201333, 2001.\n\n[31] Piotr Indyk, Ilya Razenshteyn, and Tal Wagner. Practical data-dependent metric compression\nIn Advances in Neural Information Processing Systems, pages\n\nwith provable guarantees.\n2617\u20132626, 2017.\n\n[32] Piotr Indyk and Nitin Thaper. Fast image retrieval via embeddings. International Workshop on\n\nStatistical and Computational Theories of Vision, 2003.\n\n[33] Stephen C Johnson. Hierarchical clustering schemes. Psychometrika, 32(3):241\u2013254, 1967.\n\n[34] Leonid V Kantorovich. On the transfer of masses. In Dokl. Akad. Nauk. SSSR, volume 37,\n\npages 227\u2013229, 1942.\n\n[35] Benoit R Kloeckner. A geometric study of Wasserstein spaces: ultrametrics. Mathematika,\n\n61(1):162\u2013178, 2015.\n\n10\n\n\f[36] Soheil Kolouri, Yang Zou, and Gustavo K Rohde. Sliced Wasserstein kernels for probability dis-\ntributions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition\n(CVPR), pages 5258\u20135267, 2016.\n\n[37] Nils M Kriege, Pierre-Louis Giscard, and Richard Wilson. On valid optimal assignment kernels\nand applications to graph classi\ufb01cation. In Advances in Neural Information Processing Systems,\npages 1623\u20131631, 2016.\n\n[38] Genki Kusano, Kenji Fukumizu, and Yasuaki Hiraoka. Kernel method for persistence diagrams\nvia kernel embedding and weight factor. Journal of Machine Learning Research, 18(189):1\u201341,\n2018.\n\n[39] Matt Kusner, Yu Sun, Nicholas Kolkin, and Kilian Weinberger. From word embeddings to\ndocument distances. In International Conference on Machine Learning, pages 957\u2013966, 2015.\n\n[40] Longin Jan Latecki, Rolf Lakamper, and T Eckhardt. Shape descriptors for non-rigid shapes\nwith a single closed contour. In Proceedings of the IEEE Conference on Computer Vision and\nPattern Recognition (CVPR), volume 1, pages 424\u2013429, 2000.\n\n[41] Hugo Lavenant, Sebastian Claici, Edward Chien, and Justin Solomon. Dynamical optimal\ntransport on discrete surfaces. In SIGGRAPH Asia 2018 Technical Papers, page 250. ACM,\n2018.\n\n[42] Tam Le and Makoto Yamada. Persistence Fisher kernel: A Riemannian manifold kernel for\npersistence diagrams. In Advances in Neural Information Processing Systems, pages 10028\u2013\n10039, 2018.\n\n[43] Jaeho Lee and Maxim Raginsky. Minimax statistical learning with Wasserstein distances. In\n\nAdvances in Neural Information Processing Systems, pages 2692\u20132701, 2018.\n\n[44] Catherine Lozupone and Rob Knight. Unifrac: a new phylogenetic method for comparing\n\nmicrobial communities. Applied and environmental microbiology, 71(12):8228\u20138235, 2005.\n\n[45] Catherine A Lozupone, Micah Hamady, Scott T Kelley, and Rob Knight. Quantitative and\nqualitative \u03b2 diversity measures lead to different insights into factors that structure microbial\ncommunities. Applied and environmental microbiology, 73(5):1576\u20131585, 2007.\n\n[46] Andrew McGregor and Daniel Stubbs. Sketching Earth-Mover distance on graph metrics. In\nApproximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques,\npages 274\u2013286. Springer, 2013.\n\n[47] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed repre-\nsentations of words and phrases and their compositionality. In Advances in neural information\nprocessing systems, pages 3111\u20133119, 2013.\n\n[48] Vlad I Morariu, Balaji V Srinivasan, Vikas C Raykar, Ramani Duraiswami, and Larry S Davis.\nIn Advances in neural information\n\nAutomatic online tuning for fast gaussian summation.\nprocessing systems, pages 1113\u20131120, 2009.\n\n[49] Takenobu Nakamura, Yasuaki Hiraoka, Akihiko Hirata, Emerson G Escolar, and Yasumasa\nNishiura. Persistent homology and many-body atomic structure for medium-range order in the\nglass. Nanotechnology, 26(30):304001, 2015.\n\n[50] Victor M Panaretos, Yoav Zemel, et al. Amplitude and phase variation of point processes. The\n\nAnnals of Statistics, 44(2):771\u2013812, 2016.\n\n[51] O\ufb01r Pele and Michael Werman. Fast and robust Earth Mover\u2019s distances. In International\n\nConference on Computer Vision, pages 460\u2013467, 2009.\n\n[52] Gabriel Peyr\u00e9 and Marco Cuturi. Computational optimal transport. Foundations and Trends in\n\nMachine Learning, 11(5-6):355\u2013607, 2019.\n\n[53] Jan Reininghaus, Stefan Huber, Ulrich Bauer, and Roland Kwitt. A stable multi-scale kernel for\ntopological machine learning. In Proceedings of the IEEE conference on computer vision and\npattern recognition (CVPR), pages 4741\u20134748, 2015.\n\n11\n\n\f[54] Yossi Rubner, Carlo Tomasi, and Leonidas J. Guibas. The Earth Mover\u2019s distance as a metric\n\nfor image retrieval. International journal of computer vision, 40(2):99\u2013121, 2000.\n\n[55] Gerard Salton and Christopher Buckley. Term-weighting approaches in automatic text retrieval.\n\nInformation processing & management, 24(5):513\u2013523, 1988.\n\n[56] Hanan Samet. The quadtree and related hierarchical data structures. ACM Computing Surveys\n\n(CSUR), 16(2):187\u2013260, 1984.\n\n[57] Filippo Santambrogio. Optimal transport for applied mathematicians. Birkhauser, 2015.\n\n[58] Charles Semple and Mike Steel. Phylogenetics. Oxford Lecture Series in Mathematics and its\n\nApplications, 2003.\n\n[59] S. A. Shkarin. Isometric embedding of \ufb01nite ultrametric spaces in banach spaces. Topology and\n\nits Applications, 142(1-3):13\u201317, 2004.\n\n[60] Umut \u00b8Sim\u00b8sekli, Antoine Liutkus, Szymon Majewski, and Alain Durmus. Sliced-Wasserstein\n\ufb02ows: Nonparametric generative modeling via optimal transport and diffusions. arXiv preprint\narXiv:1806.08141, 2018.\n\n[61] Justin Solomon, Fernando De Goes, Gabriel Peyre, Marco Cuturi, Adrian Butscher, Andy\nNguyen, Tao Du, and Leonidas Guibas. Convolutional wasserstein distances: Ef\ufb01cient optimal\ntransportation on geometric domains. ACM Transactions on Graphics (TOG), 34(4):66, 2015.\n\n[62] Max Sommerfeld and Axel Munk. Inference for empirical Wasserstein distances on \ufb01nite spaces.\nJournal of the Royal Statistical Society: Series B (Statistical Methodology), 80(1):219\u2013238,\n2018.\n\n[63] Katharine Turner, Sayan Mukherjee, and Doug M Boyer. Persistent homology transform for\nmodeling shapes and surfaces. Information and Inference: A Journal of the IMA, 3(4):310\u2013344,\n2014.\n\n[64] Cedric Villani. Topics in optimal transportation. American Mathematical Soc., 2003.\n\n[65] Cedric Villani. Optimal transport: old and new, volume 338. Springer Science and Business\n\nMedia, 2008.\n\n[66] Changjiang Yang, Ramani Duraiswami, and Larry S Davis. Ef\ufb01cient kernel machines using the\nimproved fast gauss transform. In Advances in neural information processing systems, pages\n1561\u20131568, 2005.\n\n12\n\n\f", "award": [], "sourceid": 6654, "authors": [{"given_name": "Tam", "family_name": "Le", "institution": "RIKEN AIP"}, {"given_name": "Makoto", "family_name": "Yamada", "institution": "Kyoto University / RIKEN AIP"}, {"given_name": "Kenji", "family_name": "Fukumizu", "institution": "Institute of Statistical Mathematics / Preferred Networks / RIKEN AIP"}, {"given_name": "Marco", "family_name": "Cuturi", "institution": "Google Brain & CREST - ENSAE"}]}