{"title": "Generalized Sliced Wasserstein Distances", "book": "Advances in Neural Information Processing Systems", "page_first": 261, "page_last": 272, "abstract": "The Wasserstein distance and its variations, e.g., the sliced-Wasserstein (SW) distance, have recently drawn attention from the machine learning community. The SW distance, specifically, was shown to have similar properties to the Wasserstein distance, while being much simpler to compute, and is therefore used in various applications including generative modeling and general supervised/unsupervised learning. In this paper, we first clarify the mathematical connection between the SW distance and the Radon transform. We then utilize the generalized Radon transform to define a new family of distances for probability measures, which we call generalized sliced-Wasserstein (GSW) distances. We further show that, similar to the SW distance, the GSW distance can be extended to a maximum GSW (max-GSW) distance. We then provide the conditions under which GSW and max-GSW distances are indeed proper metrics. Finally, we compare the numerical performance of the proposed distances on the generative modeling task of SW flows and report favorable results.", "full_text": "Generalized Sliced Wasserstein Distances\n\nSoheil Kolouri1\u2217, Kimia Nadjahi2\u2217, Umut \u00b8Sim\u00b8sekli2,3, Roland Badeau2, Gustavo K. Rohde4\n\n1: HRL Laboratories, LLC., Malibu, CA, USA, 90265\n\n2: LTCI, T\u00e9l\u00e9com Paris, Institut Polytechnique de Paris, France\n\n3: Department of Statistics, University of Oxford, UK\n\n4: University of Virginia, Charlottesville, VA, USA, 22904\n\nskolouri@hrl.com, gustavo@virginia.edu\n\n{kimia.nadjahi, umut.simsekli, roland.badeau}@telecom-paris.fr\n\nAbstract\n\nThe Wasserstein distance and its variations, e.g., the sliced-Wasserstein (SW)\ndistance, have recently drawn attention from the machine learning community. The\nSW distance, speci\ufb01cally, was shown to have similar properties to the Wasserstein\ndistance, while being much simpler to compute, and is therefore used in various\napplications including generative modeling and general supervised/unsupervised\nlearning. In this paper, we \ufb01rst clarify the mathematical connection between the\nSW distance and the Radon transform. We then utilize the generalized Radon\ntransform to de\ufb01ne a new family of distances for probability measures, which\nwe call generalized sliced-Wasserstein (GSW) distances. We further show that,\nsimilar to the SW distance, the GSW distance can be extended to a maximum\nGSW (max-GSW) distance. We then provide the conditions under which GSW and\nmax-GSW distances are indeed proper metrics. Finally, we compare the numerical\nperformance of the proposed distances on the generative modeling task of SW\n\ufb02ows and report favorable results.\n\n1\n\nIntroduction\n\nThe Wasserstein distance has its roots in optimal transport (OT) theory [1] and forms a metric between\ntwo probability measures. It has attracted abundant attention in data sciences and machine learning\ndue to its convenient theoretical properties and applications on many domains [2, 3, 4, 5, 6, 7, 8],\nespecially in implicit generative modeling such as OT-based generative adversarial networks (GANs)\nand variational auto-encoders [9, 10, 11, 12].\nWhile OT brings new perspectives and principled ways to formalize problems, the OT-based methods\nusually suffer from high computational complexity. The Wasserstein distance is often the computa-\ntional bottleneck and it turns out that evaluating it between multi-dimensional measures is numerically\nintractable in general. This important computational burden is a major limiting factor in the appli-\ncation of OT distances to large-scale data analysis. Recently, several numerical methods have been\nproposed to speed-up the evaluation of the Wasserstein distance. For instance, entropic regularization\ntechniques [13, 14, 15] provide a fast approximation to the Wasserstein distance by regularizing the\noriginal OT problem with an entropy term. The linear OT approach, [16, 17] further simpli\ufb01es this\ncomputation for a given dataset by a linear approximation of pairwise distances with a functional\nde\ufb01ned on distances to a reference measure. Other notable contributions towards computational\nmethods for OT include multi-scale and sparse approximation approaches [18, 19], and Newton-based\nschemes for semi-discrete OT [20, 21].\nThere are some special favorable cases where solving the OT problem is easy and reasonably\ncheap. In particular, the Wasserstein distance for one-dimensional probability densities has a closed-\n\n\u2217Denotes equal contribution.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fform formula that can be ef\ufb01ciently approximated. This nice property motivates the use of the\nsliced-Wasserstein distance [22], an alternative OT distance obtained by computing in\ufb01nitely many\nlinear projections of the high-dimensional distribution to one-dimensional distributions and then\ncomputing the average of the Wasserstein distance between these one-dimensional representations.\nWhile having similar theoretical properties [23], the sliced-Wasserstein distance has signi\ufb01cantly\nlower computational requirements than the classical Wasserstein distance. Therefore, it has recently\nattracted ample attention and successfully been applied to a variety of practical tasks [22, 24, 25, 26,\n27, 28, 29, 30, 31].\nAs we will detail in the next sections, the linear projection process used in the sliced-Wasserstein\ndistance is closely related to the Radon transform, which is widely used in tomography [32, 33].\nIn other words, the sliced-Wasserstein distance is calculated via linear slicing of the probability\ndistributions. However, the linear nature of these projections does not guarantee an ef\ufb01cient evaluation\nof the sliced-Wasserstein distance: in very high-dimensional settings, the data often lives in a thin\nmanifold and the number of randomly chosen linear projections required to capture the structure of\nthe data distribution grows very quickly [27]. Reducing the number of required projections would\nthus result in a signi\ufb01cant performance improvement in sliced-Wasserstein computations.\nTo address the inef\ufb01ciencies caused by the linear projections, very recently, several attempts have\nbeen made. In [34], Rowland et al. combined linear projections with orthogonal coupling in Monte\nCarlo estimation to increase computational ef\ufb01ciency and estimation quality. In [35], Deshpande et\nal. extended the sliced-Wasserstein distance to the \u2018max-sliced-Wasserstein\u2019 distance, where they\naimed at \ufb01nding a single linear projection that maximizes the distance in the projected space. In\nanother study [36], Paty and Cuturi extended this idea to projection on linear subspaces, where they\naimed at \ufb01nding the optimal subspace for the projections by replacing the projections along a vector\nwith projections onto the nullspace of a matrix. While these methods reduce the computational cost\ninduced by the projection operations by choosing a single vector or an orthogonal matrix, they incur\nan additional cost since they need to solve a non-convex optimization over manifolds.\nIn this paper, we address the aforementioned computational issues of the sliced-Wasserstein distance\nby taking an alternative route. In particular, we extend the linear slicing to non-linear slicing of\nprobability measures. Our main contributions are summarized as follows:\n\n\u2022 Using the theory of the generalized Radon transform [37] we extend the de\ufb01nition of the\nsliced-Wasserstein distance to an entire class of distances, which we call the generalized sliced-\nWasserstein (GSW) distance. We prove that replacing the linear projections with non-linear\nprojections can still yield a valid distance metric and we then identify general conditions under\nwhich the GSW distance is a proper metric function. To the best of our knowledge, this is the\n\ufb01rst study to generalize SW to non-linear projection.\n\u2022 Similar to [35], we then show that, instead of using in\ufb01nitely many projections as required by\nthe GSW distance, we can still de\ufb01ne a valid distance metric by using a single projection, as\nlong as the projection gives the maximal distance in the projected space. We aptly call this\ndistance the max-GSW distance.\n\u2022 As instances of non-linear projections, we \ufb01rst investigate projections with polynomial ker-\nnels, which ful\ufb01ll all the conditions that we identify. However, we observe that the memory\ncomplexity of such projections has a combinatorial growth with respect to the dimension of\nthe problem, hence restricts their applications to modern problems. This motivates us to con-\nsider a neural-network-based projection scheme, where we observe that fully connected or\nconvolutional networks with leaky ReLU activations ful\ufb01ll all the crucial conditions so that\ntheir resulting GSW becomes a pseudo-metric for probability measures. In addition to its\npractical advantages, this scheme also brings an interesting perspective on adversarial generative\nmodeling, showing that such algorithms contain an implicit stage for learning projections with\ndifferent cost functions than ours.\n\u2022 Due to their inherent non-linearity, the GSW and max-GSW distances are expected to capture the\ncomplex structure of high-dimensional distributions by using much less projections, which will\nreduce the iteration complexity in a signi\ufb01cant amount. We verify this fact in our experiments,\nwhere we illustrate the superior performance of the proposed distances in both synthetic and\nreal-data settings.\n\n2\n\n\f2 Background\n\nWe review in this section the preliminary concepts and formulations needed to develop our framework,\nnamely the p-Wasserstein distance, the Radon transform, the sliced p-Wasserstein distance and\nthe maximum sliced p-Wasserstein distance.\nIn what follows, we denote by Pp(\u2126) the set of\nBorel probability measures with \ufb01nite p\u2019th moment de\ufb01ned on a given metric space (\u2126, d) and\nby \u00b5 \u2208 Pp(X) and \u03bd \u2208 Pp(Y ) probability measures de\ufb01ned on X, Y \u2286 \u2126 with corresponding\nprobability density functions I\u00b5 and I\u03bd, i.e. d\u00b5(x) = I\u00b5(x)dx and d\u03bd(y) = I\u03bd(y)dy.\nWasserstein Distance. The p-Wasserstein distance, p \u2208 [1,\u221e), between \u00b5 and \u03bd is de\ufb01ned as the\nsolution of the optimal mass transportation problem [1]:\n\n(cid:90)\n\n(cid:19) 1\n\np\n\nWp(\u00b5, \u03bd) =\n\n(1)\nwhere dp(\u00b7,\u00b7) is the cost function, and \u0393(\u00b5, \u03bd) is the set of all transportation plans \u03b3 \u2208 \u0393(\u00b5, \u03bd) such\nthat:\n\ndp(x, y)d\u03b3(x, y)\n\n\u03b3\u2208\u0393(\u00b5,\u03bd)\n\nX\u00d7Y\n\ninf\n\n\u03b3(A \u00d7 Y ) = \u00b5(A) for any Borel A \u2286 X,\n\n\u03b3(X \u00d7 B) = \u03bd(B) for any Borel B \u2286 Y.\n\nDue to Brenier\u2019s theorem [38], for absolutely continuous probability measures \u00b5 and \u03bd (with respect\nto the Lebesgue measure), the p-Wasserstein distance can be equivalently obtained from\n\n(cid:90)\n\n(cid:19) 1\ndp(cid:0)x, f (x)(cid:1)d\u00b5(x)\n\np\n\nWp(\u00b5, \u03bd) =\n\n(2)\nwhere M P (\u00b5, \u03bd) = {f : X \u2192 Y | f#\u00b5 = \u03bd} and f#\u00b5 represents the pushforward of measure \u00b5,\ncharacterized as\n\nf\u2208M P (\u00b5,\u03bd)\n\ninf\n\nX\n\ndf#\u00b5(y) =\n\nf\u22121(A)\n\nd\u00b5(x) for any Borel subset A \u2286 Y.\n\n(cid:90)\n\nA\n\n(cid:18)\n\n(cid:18)\n(cid:90)\n\nNote that in most engineering and computer science applications, \u2126 is a compact subset of Rd\nand d(x, y) = |x \u2212 y| is the Euclidean distance. By abuse of notation, we will use Wp(\u00b5, \u03bd) and\nWp(I\u00b5, I\u03bd) interchangeably.\nOne-dimensional distributions: The case of one-dimensional continuous probability measures is\nspeci\ufb01cally interesting as the p-Wasserstein distance has a closed-form solution. More precisely, for\none-dimensional probability measures, there exists a unique monotonically increasing transport map\n\u2212\u221e I\u00b5(\u03c4 )d\u03c4 be the cumulative\ndistribution function (CDF) for I\u00b5 and de\ufb01ne F\u03bd to be the CDF of I\u03bd. The optimal transport map is\nthen uniquely de\ufb01ned as f (x) = F \u22121\n(F\u00b5(x)) and, consequently, the p-Wasserstein distance has an\nanalytical form given as follows:\n\nthat pushes one measure to another. Let F\u00b5(x) = \u00b5((\u2212\u221e, x]) =(cid:82) x\ndp(cid:0)F \u22121\n\n(cid:19) 1\n(F\u00b5(x))(cid:1)d\u00b5(x)\n\ndp(cid:0)x, F \u22121\n\n(z)(cid:1)dz\n\n(cid:18)(cid:90) 1\n\nWp(\u00b5, \u03bd) =\n\n\u00b5 (z), F \u22121\n\n(cid:18)(cid:90)\n\n(cid:19) 1\n\n(3)\n\n=\n\n\u03bd\n\n\u03bd\n\n\u03bd\n\np\n\np\n\nX\n\n0\n\nwhere Eq. (3) results from the change of variable F\u00b5(x) = z. Note that for empirical distributions, Eq.\n(3) is calculated by simply sorting the samples from the two distributions and calculating the average\ndp(\u00b7,\u00b7) between the sorted samples. This requires only O(M ) operations at best and O(M log M ) at\nworst, where M is the number of samples drawn from each distribution (see [30] for more details).\nThe closed-form solution of the p-Wasserstein distance for one-dimensional distributions is an\nattractive property that gives rise to the sliced-Wasserstein (SW) distance. Next, we review the Radon\ntransform, which enables the de\ufb01nition of the SW distance. We also formulate an alternative OT\ndistance called the maximum sliced-Wasserstein distance.\nRadon Transform. The standard Radon transform, denoted by R, maps a function I \u2208 L1(Rd),\nwhere\n\nL1(Rd) = {I : Rd \u2192 R /\n\n|I(x)|dx < \u221e},\n\n(cid:90)\n\nRd\n\nto the in\ufb01nite set of its integrals over the hyperplanes of Rd and is de\ufb01ned as\n\n(cid:90)\n\nRd\n\nRI(t, \u03b8) =\n\nI(x)\u03b4(t \u2212 (cid:104)x, \u03b8(cid:105))dx,\n\n(4)\n\n3\n\n\fI(x) = R\u22121(cid:0)RI(t, \u03b8)(cid:1) =\n\n(cid:90)\n\nfor (t, \u03b8) \u2208 R \u00d7 Sd\u22121, where Sd\u22121 \u2282 Rd stands for the d-dimensional unit sphere, \u03b4(\u00b7) the one-\ndimensional Dirac delta function, and (cid:104)\u00b7,\u00b7(cid:105) the Euclidean inner-product. Note that R : L1(Rd) \u2192\nL1(R \u00d7 Sd\u22121). Each hyperplane can be written as:\n\nH(t, \u03b8) = {x \u2208 Rd | (cid:104)x, \u03b8(cid:105) = t},\n\n(5)\nwhich alternatively can be interpreted as a level set of the function g \u2208 Rd \u00d7 Sd\u22121 \u2192 R de\ufb01ned as\ng(x, \u03b8) = (cid:104)x, \u03b8(cid:105). For a \ufb01xed \u03b8, the integrals over all hyperplanes orthogonal to \u03b8 de\ufb01ne a continuous\nfunction RI(\u00b7, \u03b8) : R \u2192 R which is a projection (or a slice) of I.\nThe Radon transform is a linear bijection [39, 33] and its inverse R\u22121 is de\ufb01ned as:\n\n(RI((cid:104)x, \u03b8(cid:105), \u03b8) \u2217 \u03b7((cid:104)x, \u03b8(cid:105))d\u03b8\n\nSd\u22121\n\n(6)\nwhere \u03b7(\u00b7) is a one-dimensional high-pass \ufb01lter with corresponding Fourier transform F\u03b7(\u03c9) =\nc|\u03c9|d\u22121, which appears due to the Fourier slice theorem [33], and \u2018\u2217\u2019 is the convolution operator.\nThe above de\ufb01nition of the inverse Radon transform is also known as the \ufb01ltered back-projection\nmethod, which is extensively used in image reconstruction in the biomedical imaging community.\nIntuitively each one-dimensional projection (or slice) RI(\u00b7, \u03b8) is \ufb01rst \ufb01ltered via a high-pass \ufb01lter\nand then smeared back into Rd along H(\u00b7, \u03b8) to approximate I. The summation of all smeared\napproximations then reconstructs I. Note that in practice, acquiring an in\ufb01nite number of projections\nis not feasible, therefore the integration in the \ufb01ltered back-projection formulation is replaced with a\n\ufb01nite summation over projections (i.e., a Monte-Carlo approximation).\nSliced-Wasserstein and Maximum Sliced-Wasserstein Distances. The idea behind the sliced\np-Wasserstein distance is to \ufb01rst, obtain a family of one-dimensional representations for a higher-\ndimensional probability distribution through linear projections (via the Radon transform), and then,\ncalculate the distance between two input distributions as a functional on the p-Wasserstein distance of\ntheir one-dimensional representations (i.e., the one-dimensional marginals). The sliced p-Wasserstein\ndistance between I\u00b5 and I\u03bd is then formally de\ufb01ned as:\n\n(cid:18)(cid:90)\n\n(cid:0)RI\u00b5(., \u03b8),RI\u03bd(., \u03b8)(cid:1)d\u03b8\n\n(cid:19) 1\n\np\n\nSWp(I\u00b5, I\u03bd) =\n\nW p\np\n\nSd\u22121\n\n(7)\n\n(8)\n\nThis is indeed a distance function as it satis\ufb01es positive-de\ufb01niteness, symmetry and the triangle\ninequality [23, 24].\nThe computation of the SW distance requires an integration over the unit sphere in Rd. In practice,\nthis integration is approximated by using a simple Monte Carlo scheme that draws samples {\u03b8l} from\nthe uniform distribution on Sd\u22121 and replaces the integral with a \ufb01nite-sample average:\n\n(cid:18) 1\n\n(cid:88)L\n\nW p\np\n\nL\n\nl=1\n\n(cid:0)RI\u00b5(\u00b7, \u03b8l),RI\u03bd(\u00b7, \u03b8l)(cid:1)(cid:19)1/p\n\nSWp(I\u00b5, I\u03bd) \u2248\n\nIn higher dimensions, the random nature of slices could lead to underestimating the distance between\nthe two probability measures. To further clarify this, let I\u00b5 = N (0, Id) and I\u03bd = N (x0, Id), x0 \u2208 Rd,\nbe two multivariate Gaussian densities with the identity matrix as the covariance matrix. Their\nprojected representations are one-dimensional Gaussian distributions of the form RI\u00b5(\u00b7, \u03b8) = N (0, 1)\nIt is therefore clear that W2(RI\u00b5(\u00b7, \u03b8),RI\u03bd(\u00b7, \u03b8)) achieves its\nand RI\u03bd(\u00b7, \u03b8) = N ((cid:104)\u03b8, x0(cid:105), 1).\nmaximum value when \u03b8 = x0(cid:107)x0(cid:107)2\nand is zero for \u03b8\u2019s that are orthogonal to x0. On the other hand,\nwe know that vectors that are randomly picked from the unit sphere are more likely to be nearly\northogonal in high-dimension. More rigorously, the following inequality holds: P r(|(cid:104)\u03b8, x0(cid:107)x0(cid:107)2\n(cid:105)| <\n\u0001) > 1 \u2212 e(\u2212d\u00012), which implies that for a high dimension d, the majority of sampled \u03b8\u2019s would be\nnearly orthogonal to x0 and therefore, W2(RI\u00b5(\u00b7, \u03b8),RI\u03bd(\u00b7, \u03b8)) \u2248 0 with high probability.\nTo remedy this issue, one can avoid uniform sampling of the unit sphere, and pick samples \u03b8\u2019s that\ncontain discriminant information between I\u00b5 and I\u03bd instead. This idea was for instance used in\n[28, 35, 36]. For instance, Deshpande et al. [28] \ufb01rst calculate a linear discriminant subspace and\nthen measure the empirical SW distance by setting the \u03b8\u2019s to be the discriminant components of the\nsubspace.\n\n4\n\n\fA similarly \ufb02avored but less heuristic approach is to use the maximum sliced p-Wasserstein (max-SW)\ndistance, which is an alternative OT metric de\ufb01ned as [35]:\n\n(cid:0)RI\u00b5(\u00b7, \u03b8),RI\u03bd(\u00b7, \u03b8)(cid:1)\n\nmax-SWp(I\u00b5, I\u03bd) = max\n\u03b8\u2208Sd\u22121\n\nWp\n\n(9)\n\nGiven that Wp is a distance, it is straightforward to show that max-SWp is also a distance: we\nwill prove in Section 3.2 that the metric axioms would also hold for the maximum generalized\nsliced-Wasserstein distance, which contains the max-SW distance as a special case.\n\n3 Generalized Sliced-Wasserstein Distances\n\nWe propose in this paper to extend the de\ufb01nition\nof the sliced-Wasserstein distance to formulate a\nnew optimal transport metric, which we call the\ngeneralized sliced-Wasserstein (GSW) distance.\nThe GSW distance is obtained using the same pro-\ncedure as for the SW distance, except that here,\nthe one-dimensional representations are acquired\nthrough nonlinear projections. In this section,\nwe \ufb01rst review the generalized Radon transform,\nwhich is used to project the high-dimensional dis-\ntributions, and we then formally de\ufb01ne the class\nof GSW distances. We also extend the concept of\nmax-SW distance to the class of maximum gener-\nalized sliced-Wasserstein (max-GSW) distances.\n\n3.1 Generalized Radon Transform\n\nFigure 1: Visualizing the slicing process for classi-\ncal and generalized Radon transforms for the Half\nMoons distribution. The slices GI(t, \u03b8) follow\nEquation (10).\n\nThe generalized Radon transform (GRT) extends\nthe original idea of the classical Radon trans-\nform introduced by [32] from integration over\nhyperplanes of Rd to integration over hyper-\n(d \u2212 1)-dimensional manifolds\nsurfaces, i.e.\n[37, 40, 41, 42, 43, 44]. The GRT has various applications, including Thermoacoustic Tomog-\nraphy, where the hypersurfaces are spheres, and Electrical Impedance Tomography, which requires\nintegration over hyperbolic surfaces.\nTo formally de\ufb01ne the GRT, we introduce a function g de\ufb01ned on X \u00d7 (Rn\\{0}) with X \u2282 Rd. We\nsay that g is a de\ufb01ning function when it satis\ufb01es the four conditions below:\nH1. g is a real-valued C\u221e function on X \u00d7 (Rn\\{0})\nH2. g(x, \u03b8) is homogeneous of degree one in \u03b8, i.e., \u2200\u03bb \u2208 R, g(x, \u03bb\u03b8) = \u03bbg(x, \u03b8).\n(cid:19)\n\u2202x (x, \u03b8) (cid:54)= 0.\nH3. g is non-degenerate in the sense that \u2200(x, \u03b8) \u2208 X \u00d7 Rn\\{0}, \u2202g\n\n(cid:18)(cid:16) \u22022g\n\n(cid:17)\n\nH4. The mixed Hessian of g is strictly positive, i.e. det\n\n\u2202xi\u2202\u03b8j\n\ni,j\n\n> 0.\n\nThen, the GRT of I \u2208 L1(Rd) is the integration of I over hypersurfaces characterized by the level\nsets of g, which are characterized by Ht,\u03b8 = {x \u2208 X | g(x, \u03b8) = t}.\nLet g be a de\ufb01ning function. The generalized Radon transform of I, denoted by GI, is then formally\nde\ufb01ned as:\n\nGI(t, \u03b8) =\n\nI(x)\u03b4(t \u2212 g(x, \u03b8))dx\n\n(10)\nNote that the standard Radon transform is a special case of the GRT with g(x, \u03b8) = (cid:104)x, \u03b8(cid:105). Figure 1\nillustrates the slicing process for standard and generalized Radon transforms for the Half Moons\ndataset as input.\n\nRd\n\n(cid:90)\n\n5\n\n\f3.2 Generalized Sliced-Wasserstein and Max-Generalized Sliced-Wasserstein Distances\n\nFollowing the de\ufb01nition of the SW distance in Equation (7), we de\ufb01ne the generalized sliced p-\nWasserstein distance using the generalized Radon transform as:\n\n(cid:18)(cid:90)\n\n(cid:0)GI\u00b5(\u00b7, \u03b8),GI\u03bd(\u00b7, \u03b8)(cid:1)d\u03b8\n\n(cid:19) 1\n\np\n\nGSWp(I\u00b5, I\u03bd) =\n\n(11)\nwhere \u2126\u03b8 is a compact set of feasible parameters for g(\u00b7, \u03b8) (e.g., \u2126\u03b8 = Sd\u22121 for g(\u00b7, \u03b8) = (cid:104)\u00b7, \u03b8(cid:105)).\nThe GSW distance can also suffer from the projection complexity issue described before; that is\nwhy we formulate the maximum generalized sliced p-Wasserstein distance, which generalizes the\nmax-SW distance as de\ufb01ned in (9):\n\nW p\np\n\n\u2126\u03b8\n\nmax-GSWp(I\u00b5, I\u03bd) = max\n\u03b8\u2208\u2126\u03b8\n\n(12)\nProposition 1. The generalized sliced p-Wasserstein distance and the maximum generalized sliced p-\nWasserstein distance are, indeed, distances over Pp(\u2126) if and only if the generalized Radon transform\nis injective.\n\nWp\n\n(cid:0)GI\u00b5(\u00b7, \u03b8),GI\u03bd(\u00b7, \u03b8)(cid:1)\n\nThe proof is given in the supplementary document.\nRemark 1. If the chosen generalized Radon transform is not injective, then we can only say that the\nGSW and max-GSW distances are pseudo-metrics: they still satisfy non-negativity, symmetry, the\ntriangle inequality, and GSWp(I\u00b5, I\u00b5) = 0 and max-GSWp(I\u00b5, I\u00b5) = 0.\nRemark 2. Proposition 1 shows that the injectivity of GRT is suf\ufb01cient and necessary for GSW to be\na metric. In this respect, our result brings a different perspective on the results of [23] by showing\nthat SW is indeed distance since the standard Radon transform is injective.\n\n3.3\n\nInjectivity of the Generalized Radon Transform\n\ni\n\n\u2200j (cid:54)= i} and contains d elements.\n\nWe have shown that the injectivity of the GRT is crucial for the GSW and max-GSW distances to be,\nindeed, distances between probability measures. Here, we enumerate some of the known de\ufb01ning\nfunctions that lead to injective GRTs.\nThe investigation of the suf\ufb01cient and necessary conditions for showing the injectivity of GRTs is a\nlong-standing topic [37, 44, 45, 41]. The circular de\ufb01ning function, g(x, \u03b8) = (cid:107)x\u2212r\u2217\u03b8(cid:107)2 with r \u2208 R+\nand \u2126\u03b8 = Sd\u22121 was shown to provide an injective GRT [43]. More interestingly, homogeneous\n\npolynomials with an odd degree also yield an injective GRT [46], i.e. g(x, \u03b8) =(cid:80)|\u03b1|=m \u03b8\u03b1x\u03b1, where\nwe use the multi-index notation \u03b1 = (\u03b11, . . . , \u03b1d\u03b1) \u2208 Nd\u03b1, |\u03b1| =(cid:80)d\u03b1\nof the multi-indices with |\u03b1| = 1 becomes {(\u03b11, . . . , \u03b1d); \u03b1i = 1 for a single i \u2208(cid:74)1, d(cid:75), and \u03b1j =\n\ni=1 x\u03b1i\n.\nHere, the summation iterates over all possible multi-indices \u03b1, such that |\u03b1| = m, where m denotes\nthe degree of the polynomial and \u03b8\u03b1 \u2208 R. The parameter set for homogeneous polynomials is then set\nto \u2126\u03b8 = Sd\u03b1\u22121. We can observe that choosing m = 1 reduces to the linear case (cid:104)x, \u03b8(cid:105), since the set\n\ni=1 \u03b1i, and x\u03b1 =(cid:81)d\u03b1\n\n0,\nWhile the polynomial projections form an interesting alternative to linear projections, their memory\ncomplexity d\u03b1 grows exponentially with the dimension of the data and the degree of the polynomial,\nhence deteriorate their potential in modern machine learning problems. As a remedy, given the current\nsuccess of the neural networks in various application domains, a natural task in our context would be\nto come up with a neural network, which would yield a valid GSW or max-GSW, when used as the\nde\ufb01ning function in the GRT. As a neural network-based de\ufb01ning function, we propose a multi-layer\nfully connected network with \u2018leaky ReLU\u2019 activations. Under this speci\ufb01c network architecture, one\ncan easily show that the corresponding de\ufb01ning function satis\ufb01es H1 to H4 on (X\\{0}) \u00d7 (Rn\\{0}).\nOn the other hand, it is highly non-trivial to show the injectivity of the associated GRT, therefore\nthe GSW associated with this particular de\ufb01ning function is a pseudo-metric, as we discussed in\nRemark 1. However, as illustrated later on in Section 5, this neural network-based de\ufb01ning function\nstill performs well in practice, and speci\ufb01cally, the non-differentiability of the leaky ReLU function\nat 0 does not seem to be a big issue in practice.\nRemark 3. With a neural network as the de\ufb01ning function, minimizing max-GSW between two distri-\nbutions is analogical to adversarial learning, where the adversary network\u2019s goal is to distinguish\nthe two distributions. In the max-GSW case, the adversary network (i.e. the de\ufb01ning function) seeks\noptimal parameters that maximize the GSW distance between the input distributions.\n\n6\n\n\f4 Numerical Implementation\n\n4.1 Generalized Radon Transforms of Empirical PDFs\n\ni=1 drawn from I\u00b5, for which the empirical density is: I\u00b5(x) \u2248 1\n\n(cid:80)N\nIn most machine learning applications, we do not have access to the distribution I\u00b5 but to a set of\nsamples {xi}N\ni=1 \u03b4(x\u2212 xi) The\nGRT of the empirical density is then given by: GI\u00b5(t, \u03b8) \u2248 1\nhigh-dimensional problems, estimating I\u00b5 in Rd requires a large number of samples. However, the\nprojections of I\u00b5, GI(\u00b7, \u03b8), are one-dimensional and it may not be critical to have a large number of\nsamples to estimate these one-dimensional densities.\n\ni=1 \u03b4(cid:0)t \u2212 g(xi, \u03b8)(cid:1) Moreover, for\n(cid:80)N\n\nN\n\nN\n\ni=1 and {yj}N\n\n4.2 Numerical Implementation of GSW Distances\nLet {xi}N\nj=1 be samples respectively drawn from I\u00b5 and I\u03bd, and let g(\u00b7, \u03b8) be a de\ufb01ning\nfunction. Following the work of [30], the Wasserstein distance between one-dimensional distributions\nGI\u00b5(\u00b7, \u03b8) and GI\u03bd(\u00b7, \u03b8) can be calculated from sorting their samples and calculating the Lp distance\nbetween the sorted samples. In other words, the GSW distance between I\u00b5 and I\u03bd can be approximated\nfrom their samples as follows:\n\nGSWp(I\u00b5, I\u03bd) \u2248(cid:16) 1\n\n(cid:88)L\n\n(cid:88)N\n\n|g(xi[n], \u03b8l) \u2212 g(yj[n], \u03b8l)|p(cid:17)1/p\n\nL\n\nl=1\n\nn=1\n\ni=1 and {g(yj, \u03b8)}N\nwhere i[m] and j[n] are the indices of sorted {g(xi, \u03b8)}N\napproximate the GSW distance is summarized the supplementary document.\n\nj=1. The procedure to\n\n4.3 Numerical Implementation of max-GSW Distances\n\nTo compute the max-GSW distance, we perform an EM-like optimization scheme: (a) for a \ufb01xed \u03b8,\ng(xi, \u03b8) and g(yi, \u03b8) are sorted to compute Wp, (b) \u03b8 is updated with a Projected Gradient Descent\n(PGD) step:\n\n(cid:0)Optim(cid:0)\u2207\u03b8(\n\n1\nN\n\n(cid:88)N\n\nn=1\n\n|g(xi[n], \u03b8) \u2212 g(yj[n], \u03b8)|p), \u03b8(cid:1)(cid:1)\n\n\u03b8 = P roj\n\n\u2126\u03b8\n\nwhere Optim(\u00b7) refers to the preferred optimizer, for instance Gradient Descent (GD) or ADAM [47],\nand P roj\n\n(\u00b7) is the operator projecting \u03b8 onto \u2126\u03b8. For instance, when \u03b8 \u2208 Sn\u22121, P roj\n\n(\u03b8) = \u03b8(cid:107)\u03b8(cid:107).\n\n\u2126\u03b8\n\n\u2126\u03b8\n\nRemark 4. Here, we \ufb01nd the optimal \u03b8 by optimizing the actual Wp, as opposed to the heuristic\napproaches proposed in [28] and [30], where the pseudo-optimal slice is found via perceptrons or\npenalized linear discriminant analysis [48].\n\nFinally, once convergence is reached, the max-GSW distance is approximated with:\n\nmax-GSWp(I\u00b5, I\u03bd) \u2248(cid:0) 1\n\n(cid:88)N\n\n|g(xi[n], \u03b8\u2217) \u2212 g(yj[n], \u03b8\u2217)|p(cid:1) 1\n\np\n\nN\n\nn=1\n\nThe whole procedure is summarized as pseudocode in the supplementary document.\n\n5 Experiments\n\nIn this section, we conduct experiments on the generalized Sliced-Wasserstein \ufb02ows. We also\nimplemented GSW-based auto-encoders, whose results are reported in the supplementary document\ndue to space limitations. We provide the source code to reproduce the experiments of this paper.2\nOur goal is to demonstrate the effects of the choice of the GSW distance in its purest form by\nconsidering the following problem: min\u00b5 GSWp(\u00b5, \u03bd), where \u03bd is a target distribution and \u00b5 is the\nsource distribution, which is initialized to be the normal distribution. The optimization is then solved\niteratively via: \u2202t\u00b5t = \u2212\u2207GSWp(\u00b5t, \u03bd), \u00b50 = N (0, 1).\n\n2See https://github.com/kimiandj/gsw.\n\n7\n\n\fFigure 2: Log 2-Wasserstein distance between the source and target distributions as a function of the\nnumber of iterations for 4 classical target distributions.\n\nFigure 3: 2-Wasserstein distance between source and target distribu-\ntions for the MNIST dataset.\n\nWe started by using 4\nwell-known\ndistributions\nas the target, namely the\n25-Gaussians, 8-Gaussians,\nSwiss Roll,\nand Circle\ndistributions. We compare\nGSW and max-GSW for\noptimizing the \ufb02ow with\nlinear (i.e., SW distance),\nhomogeneous polynomials\nof degree 3 and 5, and\nneural networks with 1,\n2, and 3 hidden layers as\nde\ufb01ning functions. We used the exact same optimization scheme for all methods, and kept only\nL = 1 projection, and calculated the 2-Wasserstein distance between \u00b5t and \u03bd at each iteration of the\noptimization (via solving a linear programming at each step). We repeated each experiment 100\ntimes and reported the mean of the 2-Wasserstein distance for all target datasets in Figure 2. We also\nshowed a snapshot of \u00b5t and \u03bd at t = 100 iterations for all datasets. We observe that (i) max-GSW\noutperforms GSW, of course at the cost of an additional optimization, and (ii) while the choice\nof the de\ufb01ning function g(\u00b7, \u03b8) is data-dependent, one can see that the homogeneous polynomials\nare often among the top performers for all datasets. Speci\ufb01cally, SW is always outperformed by\nGSW with polynomial projections (\u2018Poly 3\u2019 and \u2018Poly 5\u2019 in Figure 2, left) and by all the variants\nof max-GSW. Besides, max-linear-SW is consistently outperformed by max-GSW-NN. The only\nvariant of GSW that is outperformed by SW is GSW with neural network-based de\ufb01ning function,\nwhich was expected because of its inherent complexity of approximating the integral over a very\nlarge domain (11) with a simple Monte Carlo average. To circumvent this issue, max-GSW replaces\nsampling with optimization.\nTo move to more realistic datasets, we considered GSW \ufb02ows for the hand-written digit recognition\ndataset, MNIST, where we initialize 100 random images and optimize the \ufb02ow via max-SW and max-\n\n8\n\nLinearPoly\t3Poly\t5NN\tD=1NN\tD=2NN\tD=3GSWMax-GSWGSWMax-GSWGSWMax-GSWGSWMax-GSW2-Wasserstein\tdistance\tbetween\t!\"and\t#\fFigure 4: Flow minimization comparison between max-SW and max-GSW on the CelebA dataset.\n\nGSW and measure the 2-Wasserstein distance between the \u00b5t (the 100 images) and \u03bd (the training set\nof MNIST). See supplementary material for videos. Given the high-dimensional nature of the problem\n(i.e., 784-dims.) we cannot use the homogeneous polynomials due to memory constraints caused by\nthe combinatorial growth of the coef\ufb01cients. Therefore, we chose a 3-layer neural network for our\nde\ufb01ning function. Figure 3 shows the 2-Wasserstein between the source and target distributions as\na function of number of training epochs. We observe that with the proposed approach the error is\ndecreasing signi\ufb01cantly faster when compared to the linear projections. We also observe this in the\nquality of the generated images, where we obtain crisper results.\nFinally, we applied our methodology on a larger dataset, namely CelebA [49]. We performed \ufb02ow\noptimization in a 256-dimensional latent space of a pre-trained auto-encoder, and compared max-SW\nwith max-GSW using a 3 layer neural network. We then measured the 2-Wasserstein between the\nreal and optimized distributions in the 256-dimensional latent space. Figure 4 shows the results of\nthis experiment. As can be seen, max-GSW \ufb01nds a better solution than max-SW in fewer iterations\nand the quality of the generated images is slightly better.\n\n6 Conclusion\n\nWe introduced a new family of optimal transport metrics for probability measures that generalizes the\nsliced-Wasserstein distance: while the latter is based on linear slicing of distributions, we propose\nto perform nonlinear slicing. We provided theoretical conditions that yield the generalized sliced-\nWasserstein distance to be, indeed, a distance function, and we empirically demonstrated the superior\nperformance of the GSW and max-GSW distances over the classical sliced-Wasserstein distance in\nvarious generative modeling applications.\n\nAcknowledgements\n\nThis work was partially supported by the United States Air Force and DARPA under Contract\nNo. FA8750-18-C-0103. Any opinions, \ufb01ndings and conclusions or recommendations expressed\nin this material are those of the author(s) and do not necessarily re\ufb02ect the views of the United\nStates Air Force and DARPA. This work is also partly supported by the French National Research\nAgency (ANR) as a part of the FBIMATRIX project (ANR-16-CE23-0014) and by the industrial\nchair Machine Learning for Big Data from T\u00e9l\u00e9com ParisTech.\n\nReferences\n[1] C\u00e9dric Villani. Optimal transport: old and new, volume 338. Springer Science & Business\n\nMedia, 2008.\n\n[2] Justin Solomon, Raif Rustamov, Leonidas Guibas, and Adrian Butscher. Wasserstein propaga-\ntion for semi-supervised learning. In International Conference on Machine Learning, pages\n306\u2013314, 2014.\n\n9\n\nMax-GSWMax-SW\f[3] Charlie Frogner, Chiyuan Zhang, Hossein Mobahi, Mauricio Araya, and Tomaso A Poggio.\nLearning with a Wasserstein loss. In Advances in Neural Information Processing Systems, pages\n2053\u20132061, 2015.\n\n[4] Gr\u00e9goire Montavon, Klaus-Robert M\u00fcller, and Marco Cuturi. Wasserstein training of restricted\nBoltzmann machines. In Advances in Neural Information Processing Systems, pages 3718\u20133726,\n2016.\n\n[5] Soheil Kolouri, Se Rim Park, Matthew Thorpe, Dejan Slepcev, and Gustavo K Rohde. Optimal\nmass transport: Signal processing and machine-learning applications. IEEE Signal Processing\nMagazine, 34(4):43\u201359, 2017.\n\n[6] Nicolas Courty, R\u00e9mi Flamary, Devis Tuia, and Alain Rakotomamonjy. Optimal transport for\ndomain adaptation. IEEE transactions on pattern analysis and machine intelligence, 39(9):1853\u2013\n1865, 2017.\n\n[7] Gabriel Peyr\u00e9 and Marco Cuturi. Computational optimal transport.\n\narXiv:1803.00567, 2018.\n\narXiv preprint\n\n[8] Morgan A Schmitz, Matthieu Heitz, Nicolas Bonneel, Fred Ngole, David Coeurjolly, Marco\nCuturi, Gabriel Peyr\u00e9, and Jean-Luc Starck. Wasserstein dictionary learning: Optimal\ntransport-based unsupervised nonlinear dictionary learning. SIAM Journal on Imaging Sciences,\n11(1):643\u2013678, 2018.\n\n[9] Martin Arjovsky, Soumith Chintala, and L\u00e9on Bottou. Wasserstein GAN. arXiv preprint\n\narXiv:1701.07875, 2017.\n\n[10] Olivier Bousquet, Sylvain Gelly, Ilya Tolstikhin, Carl-Johann Simon-Gabriel, and Bernhard\nSchoelkopf. From optimal transport to generative modeling: the VEGAN cookbook. arXiv\npreprint arXiv:1705.07642, 2017.\n\n[11] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville.\nImproved training of Wasserstein GANs. In Advances in Neural Information Processing Systems,\npages 5767\u20135777, 2017.\n\n[12] Ilya Tolstikhin, Olivier Bousquet, Sylvain Gelly, and Bernhard Schoelkopf. Wasserstein auto-\n\nencoders. In International Conference on Learning Representations, 2018.\n\n[13] Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances\n\nin Neural Information Processing Systems, pages 2292\u20132300, 2013.\n\n[14] Marco Cuturi and Gabriel Peyr\u00e9. A smoothed dual approach for variational Wasserstein\n\nproblems. SIAM Journal on Imaging Sciences, December 2015.\n\n[15] Justin Solomon, Fernando De Goes, Gabriel Peyr\u00e9, Marco Cuturi, Adrian Butscher, Andy\nNguyen, Tao Du, and Leonidas Guibas. Convolutional Wasserstein distances: Ef\ufb01cient optimal\ntransportation on geometric domains. ACM Transactions on Graphics (TOG), 34(4):66, 2015.\n\n[16] Wei Wang, Dejan Slep\u02c7cev, Saurav Basu, John A Ozolek, and Gustavo K Rohde. A linear\noptimal transportation framework for quantifying and visualizing variations in sets of images.\nInternational journal of computer vision, 101(2):254\u2013269, 2013.\n\n[17] Soheil Kolouri, Akif B Tosun, John A Ozolek, and Gustavo K Rohde. A continuous linear\noptimal transport approach for pattern analysis in image datasets. Pattern recognition, 51:453\u2013\n462, 2016.\n\n[18] Adam M Oberman and Yuanlong Ruan. An ef\ufb01cient linear programming method for optimal\n\ntransportation. arXiv preprint arXiv:1509.03668, 2015.\n\n[19] Bernhard Schmitzer. A sparse multiscale algorithm for dense optimal transport. Journal of\n\nMathematical Imaging and Vision, 56(2):238\u2013259, Oct 2016.\n\n[20] Bruno L\u00e9vy. A numerical algorithm for L2 semi-discrete optimal transport in 3D. ESAIM Math.\n\nModel. Numer. Anal., 49(6):1693\u20131715, 2015.\n\n10\n\n\f[21] Jun Kitagawa, Quentin M\u00e9rigot, and Boris Thibert. Convergence of a Newton algorithm for\n\nsemi-discrete optimal transport. arXiv preprint arXiv:1603.05579, 2016.\n\n[22] Nicolas Bonneel, Julien Rabin, Gabriel Peyr\u00e9, and Hanspeter P\ufb01ster. Sliced and Radon Wasser-\nstein barycenters of measures. Journal of Mathematical Imaging and Vision, 51(1):22\u201345,\n2015.\n\n[23] Nicolas Bonnotte. Unidimensional and evolution methods for optimal transportation. PhD\n\nthesis, Universit\u00e9 Paris 11, France, 2013.\n\n[24] Soheil Kolouri, Yang Zou, and Gustavo K Rohde. Sliced-Wasserstein kernels for probability dis-\ntributions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,\npages 4876\u20134884, 2016.\n\n[25] Mathieu Carriere, Marco Cuturi, and Steve Oudot. Sliced Wasserstein kernel for persistence\ndiagrams. In ICML 2017-Thirty-fourth International Conference on Machine Learning, pages\n1\u201310, 2017.\n\n[26] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of GANs for\n\nimproved quality, stability, and variation. arXiv preprint arXiv:1710.10196, 2017.\n\n[27] Antoine Liutkus, Umut \u00b8Sim\u00b8sekli, Szymon Majewski, Alain Durmus, and Fabian-Robert Stoter.\nSliced-Wasserstein \ufb02ows: Nonparametric generative modeling via optimal transport and diffu-\nsions. In International Conference on Machine Learning, 2019.\n\n[28] Ishan Deshpande, Ziyu Zhang, and Alexander Schwing. Generative modeling using the sliced\nWasserstein distance. In Proceedings of the IEEE Conference on Computer Vision and Pattern\nRecognition, pages 3483\u20133491, 2018.\n\n[29] Soheil Kolouri, Gustavo K. Rohde, and Heiko Hoffmann. Sliced Wasserstein distance for\nlearning gaussian mixture models. In The IEEE Conference on Computer Vision and Pattern\nRecognition (CVPR), June 2018.\n\n[30] Soheil Kolouri, Phillip E. Pope, Charles E. Martin, and Gustavo K. Rohde. Sliced Wasserstein\n\nauto-encoders. In International Conference on Learning Representations, 2019.\n\n[31] K. Nadjahi, A. Durmus, U. \u00b8Sim\u00b8sekli, and R. Badeau. Asymptotic Guarantees for Learning\nGenerative Models with the Sliced-Wasserstein Distance. In Advances in Neural Information\nProcessing Systems, 2019.\n\n[32] Johann Radon. Uber die bestimmug von funktionen durch ihre integralwerte laengs geweisser\nmannigfaltigkeiten. Berichte Saechsishe Acad. Wissenschaft. Math. Phys., Klass, 69:262, 1917.\n\n[33] Sigurdur Helgason. The Radon transform on Rn. In Integral Geometry and Radon Transforms,\n\npages 1\u201362. Springer, 2011.\n\n[34] Mark Rowland, Jiri Hron, Yunhao Tang, Krzysztof Choromanski, Tamas Sarlos, and Adrian\nWeller. Orthogonal estimation of wasserstein distances. In Arti\ufb01cial Intelligence and Statistics\n(AISTATS), volume 89, pages 186\u2013195, 16\u201318 Apr 2019.\n\n[35] Ishan Deshpande, Yuan-Ting Hu, Ruoyu Sun, Ayis Pyrros, Nasir Siddiqui, Sanmi Koyejo,\nZhizhen Zhao, David Forsyth, and Alexander Schwing. Max-sliced wasserstein distance and its\nuse for gans. In IEEE Conference on Computer Vision and Pattern Recognition, 2019.\n\n[36] Fran\u00e7ois-Pierre Paty and Marco Cuturi. Subspace robust wasserstein distances. In International\n\nConference on Machine Learning, 2019.\n\n[37] Gregory Beylkin. The inversion problem and applications of the generalized Radon transform.\n\nCommunications on pure and applied mathematics, 37(5):579\u2013599, 1984.\n\n[38] Yann Brenier. Polar factorization and monotone rearrangement of vector-valued functions.\n\nCommunications on pure and applied mathematics, 44(4):375\u2013417, 1991.\n\n[39] Frank Natterer. The mathematics of computerized tomography, volume 32. SIAM, 1986.\n\n11\n\n\f[40] AS Denisyuk. Inversion of the generalized Radon transform. Translations of the American\n\nMathematical Society-Series 2, 162:19\u201332, 1994.\n\n[41] Leon Ehrenpreis. The universality of the Radon transform. Oxford University Press on Demand,\n\n2003.\n\n[42] Israel M Gel\u2019fand, Mark Iosifovich Graev, and Z Ya Shapiro. Differential forms and integral\n\ngeometry. Functional Analysis and its Applications, 3(2):101\u2013114, 1969.\n\n[43] Peter Kuchment. Generalized transforms of Radon type and their applications. In Proceedings\n\nof Symposia in Applied Mathematics, volume 63, page 67, 2006.\n\n[44] Andrew Homan and Hanming Zhou. Injectivity and stability for a generic class of generalized\n\nRadon transforms. The Journal of Geometric Analysis, 27(2):1515\u20131529, 2017.\n\n[45] Gunther Uhlmann. Inside out: inverse problems and applications, volume 47. Cambridge\n\nUniversity Press, 2003.\n\n[46] Francois Rouviere. Nonlinear Radon and Fourier Transforms. https://math.unice.fr/\n\n~frou/recherche/Nonlinear%20RadonW.pdf, 2015.\n\n[47] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[48] Wei Wang, Yilin Mo, John A Ozolek, and Gustavo K Rohde. Penalized Fisher discrimi-\nnant analysis and its application to image-based morphometry. Pattern recognition letters,\n32(15):2128\u20132135, 2011.\n\n[49] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the\nwild. In Proceedings of International Conference on Computer Vision (ICCV), December 2015.\n\n12\n\n\f", "award": [], "sourceid": 106, "authors": [{"given_name": "Soheil", "family_name": "Kolouri", "institution": "HRL Laboratories LLC"}, {"given_name": "Kimia", "family_name": "Nadjahi", "institution": "T\u00e9l\u00e9com ParisTech"}, {"given_name": "Umut", "family_name": "Simsekli", "institution": "Institut Polytechnique de Paris/ University of Oxford"}, {"given_name": "Roland", "family_name": "Badeau", "institution": "T\u00e9l\u00e9com ParisTech"}, {"given_name": "Gustavo", "family_name": "Rohde", "institution": "University of Virginia"}]}