{"title": "Principal Geodesic Analysis for Probability Measures under the Optimal Transport Metric", "book": "Advances in Neural Information Processing Systems", "page_first": 3312, "page_last": 3320, "abstract": "We consider in this work the space of probability measures $P(X)$ on a Hilbert space $X$ endowed with the 2-Wasserstein metric. Given a finite family of probability measures in $P(X)$, we propose an iterative approach to compute geodesic principal components that summarize efficiently that dataset. The 2-Wasserstein metric provides $P(X)$ with a Riemannian structure and associated concepts (Fr\\'echet mean, geodesics, tangent vectors) which prove crucial to follow the intuitive approach laid out by standard principal component analysis. To make our approach feasible, we propose to use an alternative parameterization of geodesics proposed by \\citet[\\S 9.2]{ambrosio2006gradient}. These \\textit{generalized} geodesics are parameterized with two velocity fields defined on the support of the Wasserstein mean of the data, each pointing towards an ending point of the generalized geodesic. The resulting optimization problem of finding principal components is solved by adapting a projected gradient descend method. Experiment results show the ability of the computed principal components to capture axes of variability on histograms and probability measures data.", "full_text": "Principal Geodesic Analysis for Probability Measures\n\nunder the Optimal Transport Metric\n\nvivien.seguy@iip.ist.i.kyoto-u.ac.jp\n\nmcuturi@i.kyoto-u.ac.jp\n\nMarco Cuturi\n\nGraduate School of Informatics\n\nKyoto University\n\nVivien Seguy\n\nGraduate School of Informatics\n\nKyoto University\n\nAbstract\n\nGiven a family of probability measures in P (X ), the space of probability mea-\nsures on a Hilbert space X , our goal in this paper is to highlight one ore more\ncurves in P (X ) that summarize ef\ufb01ciently that family. We propose to study this\nproblem under the optimal transport (Wasserstein) geometry, using curves that\nare restricted to be geodesic segments under that metric. We show that concepts\nthat play a key role in Euclidean PCA, such as data centering or orthogonality of\nprincipal directions, \ufb01nd a natural equivalent in the optimal transport geometry,\nusing Wasserstein means and differential geometry. The implementation of these\nideas is, however, computationally challenging. To achieve scalable algorithms\nthat can handle thousands of measures, we propose to use a relaxed de\ufb01nition\nfor geodesics and regularized optimal transport distances. The interest of our ap-\nproach is demonstrated on images seen either as shapes or color histograms.\n\n1\n\nIntroduction\n\nOptimal transport distances (Villani, 2008), a.k.a Wasserstein or earth mover\u2019s distances, de\ufb01ne\na powerful geometry to compare probability measures supported on a metric space X . The\nWasserstein space P (X )\u2014the space of probability measures on X endowed with the Wasserstein\ndistance\u2014is a metric space which has received ample interest from a theoretical perspective. Given\nthe prominent role played by probability measures and feature histograms in machine learning, the\nproperties of P (X ) can also have practical implications in data science. This was shown by Agueh\nand Carlier (2011) who described \ufb01rst Wasserstein means of probability measures. Wasserstein\nmeans have been recently used in Bayesian inference (Srivastava et al., 2015), clustering (Cuturi\nand Doucet, 2014), graphics (Solomon et al., 2015) or brain imaging (Gramfort et al., 2015). When\nX is not just metric but also a Hilbert space, P (X ) is an in\ufb01nite-dimensional Riemannian manifold\n(Ambrosio et al. 2006, Chap. 8; Villani 2008, Part II). Three recent contributions by Boissard et al.\n(2015, \u00a75.2), Bigot et al. (2015) and Wang et al. (2013) exploit directly or indirectly this structure to\nextend Principal Component Analysis (PCA) to P (X ). These important seminal papers are, how-\never, limited in their applicability and/or the type of curves they output. Our goal in this paper is to\npropose more general and scalable algorithms to carry out Wasserstein principal geodesic analysis\non probability measures, and not simply dimensionality reduction as explained below.\nPrincipal Geodesics in P (X ) vs. Dimensionality Reduction on P (X ) We provide in Fig. 1 a\nsimple example that illustrates the motivation of this paper, and which also shows how our approach\ndifferentiates itself from existing dimensionality reduction algorithms (linear and non-linear) that\ndraw inspiration from PCA. As shown in Fig. 1, linear PCA cannot produce components that re-\nmain in P (X ). Even more advanced tools, such as those proposed by Hastie and Stuetzle (1989),\nfall slightly short of that goal. On the other hand, Wasserstein geodesic analysis yields geodesic\ncomponents in P (X ) that are easy to interpret and which can also be used to reduce dimensionality.\n\n1\n\n\fP (X )\n\nWasserstein Principal Geodesics\nEuclidean Principal Components\nPrincipal Curve\n\nFigure 1: (top-left) Dataset: 60 \u00d7 60 images of a single Chinese character randomly translated,\nscaled and slightly rotated (36 images displayed out of 300 used). Each image is handled as a\nnormalized histogram of 3, 600 non-negative intensities. (middle-left) Dataset schematically drawn\non P (X ). The Wasserstein principal geodesics of this dataset are depicted in red, its Euclidean\ncomponents in blue, and its principal curve (Verbeek et al., 2002) in yellow. (right) Actual curves\n(blue colors depict negative intensities, green intensities \u2265 1). Neither the Euclidean components\nnor the principal curve belong to P (X ), nor can they be interpreted as meaningful axis of variation.\n\nFoundations of PCA and Riemannian Extensions Carrying out PCA on a family (x1, . . . , xn)\nof points taken in a space X can be described in abstract terms as: (i) de\ufb01ne a mean element \u00afx\nfor that dataset; (ii) de\ufb01ne a family of components in X, typically geodesic curves, that contain \u00afx;\n(iii) \ufb01t a component by making it follow the xi\u2019s as closely as possible, in the sense that the sum\nof the distances of each point xi to that component is minimized; (iv) \ufb01t additional components\nby iterating step (iii) several times, with the added constraint that each new component is different\n(orthogonal) enough to the previous components. When X is Euclidean and the xi\u2019s are vectors in\nRd, the (n + 1)-th component vn+1 can be computed iteratively by solving:\n\nN(cid:88)\n\ni=1\n\nvn+1 \u2208 argmin\n\nv\u2208V \u22a5\n\nn ,||v||2=1\n\nt\u2208R (cid:107)xi \u2212 (\u00afx + tv)(cid:107)2\n\nmin\n\n2, where V0\n\ndef.= \u2205, and Vn\n\ndef.= span{v1,\u00b7\u00b7\u00b7 , vn}. (1)\n\nSince PCA is known to boil down to a simple eigen-decomposition when X is Euclidean or Hilber-\ntian (Sch\u00a8olkopf et al., 1997), Eq. (1) looks arti\ufb01cially complicated. This formulation is, however,\nextremely useful to generalize PCA to Riemannian manifolds (Fletcher et al., 2004). This gen-\neralization proceeds \ufb01rst by replacing vector means, lines and orthogonality conditions using re-\nspectively Fr\u00b4echet means (1948), geodesics, and orthogonality in tangent spaces. Riemannian PCA\nbuilds then upon the knowledge of the exponential map at each point x of the manifold X. Each ex-\nponential map expx is locally bijective between the tangent space Tx of x and X. After computing\nthe Fr\u00b4echet mean \u00afx of the dataset, the logarithmic map log\u00afx at \u00afx (the inverse of exp\u00afx) is used to map\nall data points xi onto T\u00afx. Because T\u00afx is a Euclidean space by de\ufb01nition of Riemannian manifolds,\nthe dataset (log\u00afx xi)i can be studied using Euclidean PCA. Principal geodesics in X can then be\nrecovered by applying the exponential map to a principal component v(cid:63), {exp\u00afx(tv(cid:63)),|t| < \u03b5}.\nFrom Riemannian PCA to Wasserstein PCA: Related Work As remarked by Bigot et al.\n(2015), Fletcher et al.\u2019s approach cannot be used as it is to de\ufb01ne Wasserstein geodesic PCA, be-\ncause P (X ) is in\ufb01nite dimensional and because there are no known ways to de\ufb01ne exponential\nmaps which are locally bijective between Wasserstein tangent spaces and the manifold of probabil-\nity measures. To circumvent this problem, Boissard et al. (2015), Bigot et al. (2015) have proposed\nto formulate the geodesic PCA problem directly as an optimization problem over curves in P (X ).\n\n2\n\n\fBoissard et al. and Bigot et al. study the Wasserstein PCA problem in restricted scenarios: Bigot\net al. focus their attention on measures supported on X = R, which considerably simpli\ufb01es their\nanalysis since it is known in that case that the Wasserstein space P (R) can be embedded isometri-\ncally in L1(R); Boissard et al. assume that each input measure has been generated from a single\ntemplate density (the mean measure) which has been transformed according to one \u201cadmissible de-\nformation\u201d taken in a parameterized family of deformation maps. Their approach to Wasserstein\nPCA boils down to a functional PCA on such maps. Wang et al. proposed a more general approach:\ngiven a family of input empirical measures (\u00b51, . . . , \u00b5N ), they propose to compute \ufb01rst a \u201ctemplate\ni \u00b5i. They consider next all optimal transport plans \u03c0i\nbetween that template \u02dc\u00b5 and each of the measures \u00b5i, and propose to compute the barycentric pro-\njection (see Eq. 8) of each optimal transport plan \u03c0i to recover Monge maps Ti, on which standard\nPCA can be used. This approach is computationally attractive since it requires the computation of\nonly one optimal transport per input measure. Its weakness lies, however, in the fact that the curves\nin P (X ) obtained by displacing \u02dc\u00b5 along each of these PCA directions are not geodesics in general.\n\nmeasure\u201d \u02dc\u00b5 using k-means clustering on(cid:80)\n\nContributions and Outline We propose a new algorithm to compute Wasserstein Principal\nGeodesics (WPG) in P (X ) for arbitrary Hilbert spaces X . We use several approximations\u2014both\nof the optimal transport metric and of its geodesics\u2014to obtain tractable algorithms that can scale\nto thousands of measures. We provide \ufb01rst in \u00a72 a review of the key concepts used in this paper,\nnamely Wasserstein distances and means, geodesics and tangent spaces in the Wasserstein space.\nWe propose in \u00a73 to parameterize a Wasserstein principal component (PC) using two velocity \ufb01elds\nde\ufb01ned on the support of the Wasserstein mean of all measures, and formulate the WPG problem\nas that of optimizing these velocity \ufb01elds so that the average distance of all measures to that PC\nis minimal. This problem is non-convex and non-smooth. We propose to optimize smooth upper-\nbounds of that objective using entropy regularized optimal transport in \u00a74. The practical interest of\nour approach is demonstrated in \u00a75 on toy samples, datasets of shapes and histograms of colors.\nNotations We write (cid:104)A, B (cid:105) for the Frobenius dot-product of matrices A and B. D(u) is the\ndiagonal matrix of vector u. For a mapping f : Y \u2192 Y, we say that f acts on a measure \u00b5 \u2208 P (Y)\nthrough the pushforward operator # to de\ufb01ne a new measure f #\u00b5 \u2208 P (Y). This measure is\ncharacterized by the identity (f #\u00b5)(B) = \u00b5(f\u22121(B)) for any Borel set B \u2282 Y. We write p1 and\np2 for the canonical projection operators X 2 \u2192 X , de\ufb01ned as p1(x1, x2) = x1 and p2(x1, x2) = x2.\n\n2 Background on Optimal Transport\n\nWasserstein Distances We start this section with the main mathematical object of this paper:\nDe\ufb01nition 1. (Villani, 2008, Def. 6.1) Let P (X ) the space of probability measures on a Hilbert\nspace X . Let \u03a0(\u03bd, \u03b7) be the set of probability measures on X 2 with marginals \u03bd and \u03b7, i.e. p1#\u03c0 =\n\u03bd and p2#\u03c0 = \u03b7. The squared 2-Wasserstein distance between \u03bd and \u03b7 in P (X ) is de\ufb01ned as:\n\nW 2\n\n2 (\u03bd, \u03b7) =\n\ninf\n\n\u03c0\u2208\u03a0(\u03bd,\u03b7)\n\nX 2\n\n(cid:107)x \u2212 y(cid:107)2X d\u03c0(x, y).\n\n(2)\n\nWasserstein Barycenters Given a family of N probability measures (\u00b51,\u00b7\u00b7\u00b7 , \u00b5N ) in P (X ) and\nweights \u03bb \u2208 RN\n+ , Agueh and Carlier (2011) de\ufb01ne \u00af\u00b5, the Wasserstein barycenter of these measures:\n\n(cid:90)\n\nN(cid:88)\n\ni=1\n\n\u00af\u00b5 \u2208 argmin\n\u03bd\u2208P (X )\n\n\u03bbiW 2\n\n2 (\u00b5i, \u03bd).\n\nOur paper relies on several algorithms which have been recently proposed (Benamou et al., 2015;\nBonneel et al., 2015; Carlier et al., 2015; Cuturi and Doucet, 2014) to compute such barycenters.\n\nWasserstein Geodesics Given two measures \u03bd and \u03b7, let \u03a0(cid:63)(\u03bd, \u03b7) be the set of optimal couplings\nfor Eq. (2). Informally speaking, it is well known that if either \u03bd or \u03b7 are absolutely continuous\nmeasures, then any optimal coupling \u03c0(cid:63) \u2208 \u03a0(cid:63)(\u03bd, \u03b7) is degenerated in the sense that, assuming for\ninstance that \u03bd is absolutely continuous, for all x in the support of \u03bd only one point y \u2208 X is\nsuch that d\u03c0(cid:63)(x, y) > 0. In that case, the optimal transport is said to have no mass splitting, and\n\n3\n\n\fthere exists an optimal mapping T : X \u2192 X such that \u03c0(cid:63) can be written, using a pushforward, as\n\u03c0(cid:63) = (id\u00d7 T )#\u03bd. When there is no mass splitting to transport \u03bd to \u03b7, McCann\u2019s interpolant (1997):\n\ngt = ((1 \u2212 t)id + tT )#\u03bd, t \u2208 [0, 1],\n\n(3)\nde\ufb01nes a geodesic curve in the Wasserstein space, i.e. (gt)t is locally the shortest path between\nany two measures located on the geodesic, with respect to W2. In the more general case, where no\noptimal map T exists and mass splitting occurs (for some locations x one may have d\u03c0(cid:63)(x, y) > 0\nfor several y), then a geodesic can still be de\ufb01ned, but it relies on the optimal plan \u03c0(cid:63) instead:\ngt = ((1\u2212 t)p1 + tp2)#\u03c0(cid:63), t \u2208 [0, 1], (Ambrosio et al., 2006, \u00a77.2). Both cases are shown in Fig. 2.\n\nFigure 2: Both plots display geodesic curves between two empirical measures \u03bd and \u03b7 on R2. An\noptimal map exists in the left plot (no mass splitting occurs), whereas some of the mass of \u03bd needs\nto be split to be transported onto \u03b7 on the right plot.\n\nTangent Space and Tangent Vectors We brie\ufb02y describe in this section the tangent spaces of\nP (X ), and refer to (Ambrosio et al., 2006, Chap. 8) for more details. Let \u00b5 : I \u2282 R \u2192 P (X )\nbe a curve in P (X ). For a given time t, the tangent space of P (X ) at \u00b5t is a subset of L2(\u00b5t,X ),\nthe space of square-integrable velocity \ufb01elds supported on Supp(\u00b5t). At any t, there exists tangent\nvectors vt in L2(\u00b5t,X ) such that limh\u21920 W2(\u00b5t+h, (id + hvt)#\u00b5t)/|h| = 0. Given a geodesic\ncurve in P (X ) parameterized as Eq. (3), its corresponding tangent vector at time zero is v = T \u2212 id.\n\n3 Wasserstein Principal Geodesics\n\nGeodesic Parameterization The goal of principal geodesic analysis is to de\ufb01ne geodesic curves\nin P (X ) that go through the mean \u00af\u00b5 and which pass close enough to all target measures \u00b5i. To that\nend, geodesic curves can be parameterized with two end points \u03bd and \u03b7. However, to avoid dealing\nwith the constraint that a principal geodesic needs to go through \u00af\u00b5, one can start instead from \u00af\u00b5, and\nconsider a velocity \ufb01eld v \u2208 L2(\u00af\u00b5,X ) which displaces all of the mass of \u00af\u00b5 in both directions:\n\ngt(v) def.= (id + tv) #\u00af\u00b5,\n\nt \u2208 [\u22121, 1].\n\n(4)\n\nLemma 7.2.1 of Ambrosio et al. (2006) implies that any geodesic going through \u00af\u00b5 can be written\nas Eq. (4). Hence, we do not lose any generality using this parameterization. However, given an\narbitrary vector \ufb01eld v, the curve (gt(v))t is not necessarily a geodesic. Indeed, the maps id \u00b1 v are\nnot necessarily in the set C\u00af\u00b5\ndef.= {r \u2208 L2(\u00af\u00b5,X )|(id\u00d7 r)#\u00af\u00b5 \u2208 \u03a0(cid:63)(\u00af\u00b5, r#\u00af\u00b5)} of maps that are optimal\nwhen moving mass away from \u00af\u00b5. Ensuring thus, at each step of our algorithm, that v is still such\nthat (gt(v))t is a geodesic curve is particularly challenging. To relax this strong assumption, we\npropose to use a generalized formulation of geodesics, which builds upon not one but two velocity\n\ufb01elds, as introduced by Ambrosio et al. (2006, \u00a79.2):\nDe\ufb01nition 2. (adapted from (Ambrosio et al., 2006, \u00a79.2)) Let \u03c3, \u03bd, \u03b7 \u2208 P (X ), and assume there\nis an optimal mapping T (\u03c3,\u03bd) from \u03c3 to \u03bd and an optimal mapping T (\u03c3,\u03b7) from \u03c3 to \u03b7. A generalized\ngeodesic, illustrated in Fig. 3 between \u03bd and \u03b7 with base \u03c3 is de\ufb01ned by,\nt \u2208 [0, 1].\n\n(1 \u2212 t)T (\u03c3,\u03bd) + tT (\u03c3,\u03b7)(cid:17)\n\n#\u03c3,\n\n(cid:16)\n\ngt =\n\nChoosing \u00af\u00b5 as the base measure in De\ufb01nition 2, and two \ufb01elds v1, v2 such that id \u2212 v1, id + v2 are\noptimal mappings (in C\u00af\u00b5), we can de\ufb01ne the following generalized geodesic gt(v1, v2):\n\ngt(v1, v2) def.= (id \u2212 v1 + t(v1 + v2)) #\u00af\u00b5, for t \u2208 [0, 1].\n\n(5)\n\n4\n\n1.31.41.51.61.71.800.10.20.30.4 \u03b7\u03bdgeodesicg1/3g2/30.811.21.41.60.50.550.60.650.7 \u03b7\u03bdgeodesicg1/3g2/3\fGeneralized geodesics become true geodesics when v1 and v2 are positively proportional. We can\nthus consider a regularizer that controls the deviation from that property by de\ufb01ning \u2126(v1, v2) =\n((cid:104)v1, v2 (cid:105)L2(\u00af\u00b5,X ) \u2212 (cid:107)v1(cid:107)L2(\u00af\u00b5,X )(cid:107)v2(cid:107)L2(\u00af\u00b5,X ))2, which is minimal when v1 and v2 are indeed posi-\ntively proportional. We can now formulate the WPG problem as computing, for n \u2265 0, the (n + 1)th\nprincipal (generalized) geodesic component of a family of measures (\u00b5i)i by solving, with \u03bb > 0:\n\n(cid:40)\n\nmin\n\n\u03bb\u2126(v1, v2) +\n\nv1,v2\u2208L2(\u00af\u00b5,X )\n\nmin\nt\u2208[0,1]\n\nW 2\n\n2 (gt(v1, v2), \u00b5i), s.t.\n\nid \u2212 v1, id + v2 \u2208 C\u00af\u00b5,\nv1 +v2\u2208span({v(i)\n\n1 + v(i)\n\n2 }i\u2264n)\u22a5.\n\n(6)\n\nN(cid:88)\n\ni=1\n\nThis problem is not convex in v1, v2. We pro-\npose to \ufb01nd an approximation of that minimum\nby a projected gradient descent, with a projec-\ntion that is to be understood in terms of an al-\nternative metric on the space of vector \ufb01elds\nL2(\u00af\u00b5,X ). To preserve the optimality of the\nmappings id \u2212 v1 and id + v2 between itera-\ntions, we introduce in the next paragraph a suit-\nable projection operator on L2(\u00af\u00b5,X ).\nRemark 1. A trivial way to ensure that (gt(v))t\nis geodesic is to impose that the vector \ufb01eld v is\na translation, namely that v is uniformly equal\nto a vector \u03c4 on all of Supp(\u00af\u00b5). One can show\nin that case that the WPG problem described\nin Eq. (6) outputs an optimal vector \u03c4 which is\nthe Euclidean principal component of the fam-\nily formed by the means of each measure \u00b5i.\n\nFigure 3: Generalized geodesic interpolation be-\ntween two empirical measures \u03bd and \u03b7 using the\nbase measure \u03c3, all de\ufb01ned on X = R2.\n\nProjection on the Optimal Mapping Set We use a projected gradient descent method to solve\nEq. (6) approximately. We will compute the gradient of a local upper-bound of the objective of\nEq. (6) and update v1 and v2 accordingly. We then need to ensure that v1 and v2 are such that id\u2212 v1\nand id + v2 belong to the set of optimal mappings C\u00af\u00b5. To do so, we would ideally want to compute\nthe projection r2 of id + v2 in C\u00af\u00b5\n\n(cid:107)(id + v2) \u2212 r(cid:107)2\n\nr2 = argmin\nr\u2208C \u00af\u00b5\n\n(7)\nto update v2 \u2190 r2 \u2212 id. Westdickenberg (2010) has shown that the set of optimal mappings C\u00af\u00b5 is a\nconvex closed cone in L2(\u00af\u00b5,X ), leading to the existence and the unicity of the solution of Eq. (7).\nHowever, there is to our knowledge no known method to compute the projection r2 of id + v2.\nThere is nevertheless a well known and ef\ufb01cient approach to \ufb01nd a mapping r2 in C\u00af\u00b5 which is close\nto id + v2. That approach, known as the the barycentric projection, requires to compute \ufb01rst an\noptimal coupling \u03c0(cid:63) between \u00af\u00b5 and (id + v2)#\u00af\u00b5, to de\ufb01ne then a (conditional expectation) map\n\nL2(\u00af\u00b5,X ),\n\nT\u03c0(cid:63) (x) def.=\n\nyd\u03c0(cid:63)(y|x).\n\n(8)\n\n(cid:90)\n\nX\n\nAmbrosio et al. (2006, Theorem 12.4.4) or Reich (2013, Lemma 3.1) have shown that T\u03c0(cid:63) is indeed\nan optimal mapping between \u00af\u00b5 and T\u03c0(cid:63) #\u00af\u00b5. We can thus set the velocity \ufb01eld as v2 \u2190 T\u03c0(cid:63) \u2212 id to\ncarry out an approximate projection. We show in the supplementary material that this operator can\nbe in fact interpreted as a projection under a pseudo-metric GW\u00af\u00b5 on L2(\u00af\u00b5,X ).\n\n4 Computing Principal Generalized Geodesics in Practice\nWe show in this section that when X = Rd, the steps outlined above can be implemented ef\ufb01ciently.\nInput Measures and Their Barycenter Each input measure in the family (\u00b51,\u00b7\u00b7\u00b7 , \u00b5N ) is a \ufb01nite\nweighted sum of Diracs, described by ni points contained in a matrix Xi of size d \u00d7 ni, and a (non-\nnegative) weight vector ai of dimension ni summing to 1. The Wasserstein mean of these measures\nk=1 bk\u03b4yk, where the nonnegative vector b = (b1,\u00b7\u00b7\u00b7 , bp) sums to one,\nand Y = [y1,\u00b7\u00b7\u00b7 , yp] \u2208 Rd\u00d7p is the matrix containing locations of \u00af\u00b5.\n\nis given and equal to \u00af\u00b5 =(cid:80)p\n\n5\n\n0.20.40.60.811.21.41.600.20.40.60.811.21.4\u03c3\u03bd\u03b7g\u03c3 \u2192 \u03bdg\u03c3 \u2192 \u03b7gg1/3g2/3\fGeneralized Geodesic Two velocity vectors for each of the p points in \u00af\u00b5 are needed to pa-\nrameterize a generalized geodesic. These velocity \ufb01elds will be represented by two matrices\np] in Rd\u00d7p. Assuming that these velocity \ufb01elds yield optimal\nV1 = [v1\nmappings, the points at time t of that generalized geodesic are the measures parameterized by t,\n\np] and V2 = [v2\n\n1,\u00b7\u00b7\u00b7 , v1\n\n1,\u00b7\u00b7\u00b7 , v2\n\ngt(V1, V2) =\n\n, with locations Zt = [zt\n\n1, . . . , zt\n\np] def.= Y \u2212 V1 + t(V1 + V2).\n\nbk\u03b4zt\n\nk\n\np(cid:88)\n\nk=1\n\nThe squared 2-Wasserstein distance between datum \u00b5i and a point gt(V1, V2) on the geodesic is:\n\n(cid:104)P, MZtXi (cid:105),\n(9)\nP\u2208U (b,ai)\nwhere U (b, ai) is the transportation polytope {P \u2208 Rp\u00d7ni\n, P 1ni = b, P T 1p = ai}, and MZtXi\nstands for the p \u00d7 ni matrix of squared-Euclidean distances between the p and ni column vectors of\nZt and Xi respectively. Writing zt = D(Z T\n\n2 (gt(V1, V2), \u00b5i) = min\n\ni Xi), we have that\n\nW 2\n\n+\n\nwhich, by taking into account the marginal conditions on P \u2208 U (b, ai), leads to,\n\nMZtXi = zt1T\nni\n(cid:104)P, MZtXi (cid:105) = bT zt + aT\n\nt Zt) and xi = D(X T\n+ 1pxT\n\ni \u2212 2Z T\ni xi \u2212 2(cid:104)P, Z T\n\nt Xi \u2208 Rp\u00d7ni ,\nt Xi (cid:105).\n\n(10)\n1. Majorization of the Distance of each \u00b5i to the Principal Geodesic Using Eq. (10), the dis-\ntance between each \u00b5i and the PC (gt(V1, V2))t can be cast as a function fi of (V1, V2):\n\u22122(cid:104)P, (Y \u2212 V1 + t(V1 + V2))T Xi (cid:105)\n\nbT zt + aT\n\n(cid:18)\n\n(cid:19)\n\ni xi + min\n\n(11)\n\n.\n\nfi(V1, V2) def.= min\nt\u2208[0,1]\n\nP\u2208U (b,ai)\n\nwhere we have replaced Zt above by its explicit form in t to highlight that the objective above\nis quadratic convex plus piecewise linear concave as a function of t, and thus neither convex nor\nconcave. Assume that we are given P (cid:93) and t(cid:93) that are approximate arg-minima for fi(V1, V2). For\nany A, B in Rd\u00d7p, we thus have that each distance fi(V1, V2) appearing in Eq. (6), is such that\n\nfi(A, B) (cid:54) mV1V2\n\ni\n\n(A, B) def.= (cid:104)P (cid:93), MZt(cid:93) Xi (cid:105).\n\n(12)\nWe can thus use a majorization-minimization procedure (Hunter and Lange, 2000) to minimize the\nsum of terms fi by iteratively creating majorization functions mV1V2\nat each iterate (V1, V2). All\nfunctions mV1V2\nare quadratic convex. Given that we need to ensure that these velocity \ufb01elds yield\noptimal mappings, and that they may also need to satisfy orthogonality constraints with respect to\nlower-order principal components, we use gradient steps to update V1, V2, which can be recovered\nusing (Cuturi and Doucet, 2014, \u00a74.3) and the chain rule as:\n= 2(t(cid:93) \u2212 1)(Zt(cid:93) \u2212 XiP (cid:93)T D(b\u22121)), \u22072mV1V2\n\n= 2t(cid:93)(Zt(cid:93) \u2212 XiP (cid:93)T D(b\u22121)).\n\n\u22071mV1V2\n\n(13)\n\ni\n\ni\n\ni\n\ni\n\ni\n\n2. Ef\ufb01cient Approximation of P (cid:93) and t(cid:93) As discussed above, gradients for majorization func-\ntions mV1V2\ncan be obtained using approximate minima P (cid:93) and t(cid:93) for each function fi. Because the\nobjective of Eq. (11) is not convex w.r.t. t, we propose to do an exhaustive 1-d grid search with K\nvalues in [0, 1]. This approach would still require, in theory, to solve K optimal transport problems\nto solve Eq. (11) for each of the N input measures. To carry out this step ef\ufb01ciently, we propose\nto use entropy regularized transport (Cuturi, 2013), which allows for much faster computations and\nef\ufb01cient parallelizations to recover approximately optimal transports P (cid:93).\n\n3. Projected Gradient Update Velocity \ufb01elds are updated with a gradient stepsize \u03b2 > 0,\n\n\u22071mV1V2\n\nV1 \u2190 V1 \u2212 \u03b2\n\n+ \u03bb\u22072\u2126\ni=1\n,\u00b7\u00b7\u00b7 , V (n)\n)\u22a5\nfollowed by a projection step to enforce that V1 and V2 lie in span(V (1)\nin the L2(\u00af\u00b5,X ) sense when computing the (n + 1)th PC. We \ufb01nally apply the barycentric projection\noperator de\ufb01ned in the end of \u00a73. We \ufb01rst need to compute two optimal transport plans,\n\n, V2 \u2190 V2 \u2212 \u03b2\n\n\u22072mV1V2\n\n+ \u03bb\u22071\u2126\n\n1 + V (n)\n\n1 + V (1)\n\ni=1\n\n2\n\n2\n\n,\n\ni\n\ni\n\nto form the barycentric projections, which then yield updated velocity vectors:\n\n(cid:104)P, MY (Y \u2212V1) (cid:105), P (cid:63)\n\n1 \u2208 argmin\nP (cid:63)\nP\u2208U (b,b)\n\nV1 \u2190 \u2212(cid:0)(Y \u2212 V1)P (cid:63)T\n\n1 D(b\u22121) \u2212 Y(cid:1) ,\n\n2 \u2208 argmin\nP\u2208U (b,b)\n\n(cid:104)P, MY (Y +V2) (cid:105),\n\nV2 \u2190 (Y + V2)P (cid:63)T\n\n2 D(b\u22121) \u2212 Y.\n\nWe repeat steps 1,2,3 until convergence. Pseudo-code is given in the supplementary material.\n\n(14)\n\n(15)\n\n(cid:33)\n\n6\n\n(cid:32) N(cid:88)\n\n(cid:32) N(cid:88)\n\n(cid:33)\n\n\f5 Experiments\n\nFigure 4: Wasserstein mean \u00af\u00b5 and \ufb01rst PC computed on a dataset of four (left) and three (right)\nempirical measures. The second PC is also displayed in the right \ufb01gure.\n\nToy samples: We \ufb01rst run our algorithm on two simple synthetic examples. We consider re-\nspectively 4 and 3 empirical measures supported on a small number of locations in X = R2, so\nthat we can compute their exact Wasserstein means, using the multi-marginal linear programming\nformulation given in (Agueh and Carlier, 2011, \u00a74). These measures and their mean (red squares)\nare shown in Fig. 4. The \ufb01rst principal component on the left example is able to capture both the\nvariability of average measure locations, from left to right, and also the variability in the spread\nof the measure locations. On the right example, the \ufb01rst principal component captures the overall\nelliptic shape of the supports of all considered measures. The second principal component re\ufb02ects\nthe variability in the parameters of each ellipse on which measures are located. The variability in\nthe weights of each location is also captured through the Wasserstein mean, since each single line\nof a generalized geodesic has a corresponding location and weight in the Wasserstein mean.\n\nMNIST: For each of the digits ranging from 0 to 9, we sample 1,000 images in the MNIST\ndatabase representing that digit. Each image, originally a 28x28 grayscale image, is converted into a\nprobability distribution on that grid by normalizing each intensity by the total intensity in the image.\nWe compute the Wasserstein mean for each digit using the approach of Benamou et al. (2015). We\nthen follow our approach to compute the \ufb01rst three principal geodesics for each digit. Geodesics\nfor four of these digits are displayed in Fig. 5 by showing intermediary (rasterized) measures on the\ncurves. While some deformations in these curves can be attributed to relatively simple rotations\naround the digit center, more interesting deformations appear in some of the curves, such as the the\nloop on the bottom left of digit 2. Our results are easy to interpret, unlike those obtained with Wang\net al.\u2019s approach (2013) on these datasets, see supplementary material. Fig. 6 displays the \ufb01rst PC\nobtained on a subset of MNIST composed of 2,000 images of 2 and 4 in equal proportions.\n\nP C1\n\nP C2 P C3\n\nt = 0\n\nt = 1\n\nFigure 5: 1000 images for each of the digits 1,2,3,4 were sampled from the MNIST database. We\ndisplay above the \ufb01rst three PCs sampled at times tk = k/4, k = 0, . . . , 4 for each of these digits.\n\nColor histograms: We consider a subset of the Caltech-256 Dataset composed of three image\ncategories: waterfalls, tomatoes and tennis balls, resulting in a set of 295 color images. The pixels\n\n7\n\n-10123-1-0.500.5\u00af\u00b5\u00b51\u00b52\u00b53\u00b54pc1-6-4-20246-3-2-10123\u00af\u00b5\u00b51\u00b52\u00b53pc1pc2\fFigure 6: First PC on a subset of MNIST composed of one thousand 2s and one thousand 4s.\n\ncontained in each image can be seen as a point-cloud in the RGB color space [0, 1]3. We use k-means\nquantization to reduce the size of these uniform point-clouds into a set of k = 128 weighted points,\nusing cluster assignments to de\ufb01ne the weights of each of the k cluster centroids. Each image can be\nthus regarded as a discrete probability measure of 128 atoms in the tridimensional RGB space. We\nthen compute the Wasserstein barycenter of these measures supported on p = 256 locations using\n(Cuturi and Doucet, 2014, Alg.2). Principal components are then computed as described in \u00a74. The\ncomputation for a single PC is performed within 15 minutes on an iMac (3.4GHz Intel Core i7).\nFig. 7 displays color palettes sampled along each of the \ufb01rst three PCs. The \ufb01rst PC suggests that\nthe main source of color variability in the dataset is the illumination, each pixel going from dark to\nlight. Second and third PCs display the variation of colors induced by the typical images\u2019 dominant\ncolors (blue, red, yellow). Fig. 8 displays the second PC, along with three images projected on that\ncurve. The projection of a given image on a PC is obtained by \ufb01nding \ufb01rst the optimal time t(cid:63) such\nthat the distance of that image to the PC at t(cid:63) is minimum, and then by computing an optimal color\ntransfer (Piti\u00b4e et al., 2007) between the original image and the histogram at time t(cid:63).\n\nFigure 7: Each row represents a PC displayed at regular time intervals from t = 0 (left) to t = 1\n(right), from the \ufb01rst PC (top) to the third PC (bottom).\n\nFigure 8: Color palettes from the second PC (t = 0 on the left, t = 1 on the right) displayed at times\n3 , 1. Images displayed in the top row are original; their projection on the PC is displayed\nt = 0, 1\nbelow, using a color transfer with the palette in the PC to which they are the closest.\n\n3 , 2\n\nConclusion We have proposed an approximate projected gradient descent method to compute gener-\nalized geodesic principal components for probability measures. Our experiments suggest that these\nprincipal geodesics may be useful to analyze shapes and distributions, and that they do not require\nany parameterization of shapes or deformations to be used in practice.\n\nAknowledgements MC acknowledges the support of JSPS young researcher A grant 26700002.\n\n8\n\n\fReferences\nMartial Agueh and Guillaume Carlier. Barycenters in the Wasserstein space. SIAM Journal on Mathematical\n\nAnalysis, 43(2):904\u2013924, 2011.\n\nLuigi Ambrosio, Nicola Gigli, and Giuseppe Savar\u00b4e. Gradient \ufb02ows: in metric spaces and in the space of\n\nprobability measures. Springer, 2006.\n\nJean-David Benamou, Guillaume Carlier, Marco Cuturi, Luca Nenna, and Gabriel Peyr\u00b4e. Iterative Bregman\nprojections for regularized transportation problems. SIAM Journal on Scienti\ufb01c Computing, 37(2):A1111\u2013\nA1138, 2015.\n\nJ\u00b4er\u00b4emie Bigot, Ra\u00b4ul Gouet, Thierry Klein, and Alfredo L\u00b4opez. Geodesic PCA in the Wasserstein space by\n\nconvex PCA. Annales de l\u2019Institut Henri Poincar\u00b4e B: Probability and Statistics, 2015.\n\nEmmanuel Boissard, Thibaut Le Gouic, Jean-Michel Loubes, et al. Distributions template estimate with\n\nWasserstein metrics. Bernoulli, 21(2):740\u2013759, 2015.\n\nNicolas Bonneel, Julien Rabin, Gabriel Peyr\u00b4e, and Hanspeter P\ufb01ster. Sliced and radon Wasserstein barycenters\n\nof measures. Journal of Mathematical Imaging and Vision, 51(1):22\u201345, 2015.\n\nGuillaume Carlier, Adam Oberman, and Edouard Oudet. Numerical methods for matching for teams and\n\nWasserstein barycenters. ESAIM: Mathematical Modelling and Numerical Analysis, 2015. to appear.\n\nMarco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport.\n\nInformation Processing Systems, pages 2292\u20132300, 2013.\n\nIn Advances in Neural\n\nMarco Cuturi and Arnaud Doucet. Fast computation of Wasserstein barycenters. In Proceedings of the 31st\n\nInternational Conference on Machine Learning (ICML-14), pages 685\u2013693, 2014.\n\nP. Thomas Fletcher, Conglin Lu, Stephen M. Pizer, and Sarang Joshi. Principal geodesic analysis for the study\n\nof nonlinear statistics of shape. Medical Imaging, IEEE Transactions on, 23(8):995\u20131005, 2004.\n\nMaurice Fr\u00b4echet. Les \u00b4el\u00b4ements al\u00b4eatoires de nature quelconque dans un espace distanci\u00b4e. In Annales de l\u2019institut\n\nHenri Poincar\u00b4e, volume 10, pages 215\u2013310. Presses universitaires de France, 1948.\n\nAlexandre Gramfort, Gabriel Peyr\u00b4e, and Marco Cuturi. Fast optimal transport averaging of neuroimaging data.\n\nIn Information Processing in Medical Imaging (IPMI). Springer, 2015.\n\nTrevor Hastie and Werner Stuetzle. Principal curves. Journal of the American Statistical Association, 84(406):\n\n502\u2013516, 1989.\n\nDavid R Hunter and Kenneth Lange. Quantile regression via an MM algorithm. Journal of Computational and\n\nGraphical Statistics, 9(1):60\u201377, 2000.\n\nRobert J McCann. A convexity principle for interacting gases. Advances in mathematics, 128(1):153\u2013179,\n\n1997.\n\nFranc\u00b8ois Piti\u00b4e, Anil C Kokaram, and Rozenn Dahyot. Automated colour grading using colour distribution\n\ntransfer. Computer Vision and Image Understanding, 107(1):123\u2013137, 2007.\n\nSebastian Reich. A nonparametric ensemble transform method for bayesian inference. SIAM Journal on\n\nScienti\ufb01c Computing, 35(4):A2013\u2013A2024, 2013.\n\nBernhard Sch\u00a8olkopf, Alexander Smola, and Klaus-Robert M\u00a8uller. Kernel principal component analysis. In\n\nArti\ufb01cial Neural Networks, ICANN\u201997, pages 583\u2013588. Springer, 1997.\n\nJustin Solomon, Fernando de Goes, Gabriel Peyr\u00b4e, Marco Cuturi, Adrian Butscher, Andy Nguyen, Tao Du,\nand Leonidas Guibas. Convolutional Wasserstein distances: Ef\ufb01cient optimal transportation on geometric\ndomains. ACM Transactions on Graphics (Proc. SIGGRAPH 2015), 34(4), 2015.\n\nSanvesh Srivastava, Volkan Cevher, Quoc Tran-Dinh, and David B Dunson. Wasp: Scalable bayes via barycen-\nters of subset posteriors. In Proceedings of the Eighteenth International Conference on Arti\ufb01cial Intelligence\nand Statistics, pages 912\u2013920, 2015.\n\nJakob J Verbeek, Nikos Vlassis, and B Kr\u00a8ose. A k-segments algorithm for \ufb01nding principal curves. Pattern\n\nRecognition Letters, 23(8):1009\u20131017, 2002.\n\nC\u00b4edric Villani. Optimal transport: old and new, volume 338. Springer, 2008.\nWei Wang, Dejan Slep\u02c7cev, Saurav Basu, John A Ozolek, and Gustavo K Rohde. A linear optimal transportation\nframework for quantifying and visualizing variations in sets of images. International journal of computer\nvision, 101(2):254\u2013269, 2013.\n\nMichael Westdickenberg. Projections onto the cone of optimal transport maps and compressible \ufb02uid \ufb02ows.\n\nJournal of Hyperbolic Differential Equations, 7(04):605\u2013649, 2010.\n\n9\n\n\f", "award": [], "sourceid": 1836, "authors": [{"given_name": "Vivien", "family_name": "Seguy", "institution": "Kyoto University"}, {"given_name": "Marco", "family_name": "Cuturi", "institution": "Kyoto University"}]}