{"title": "Subspace Detours: Building Transport Plans that are Optimal on Subspace Projections", "book": "Advances in Neural Information Processing Systems", "page_first": 6917, "page_last": 6928, "abstract": "Computing optimal transport (OT) between measures in high dimensions is doomed by the curse of dimensionality. A popular approach to avoid this curse is to project input measures on lower-dimensional subspaces (1D lines in the case of sliced Wasserstein distances), solve the OT problem between these reduced measures, and settle for the Wasserstein distance between these reductions, rather than that between the original measures. This approach is however difficult to extend to the case in which one wants to compute an OT map (a Monge map) between the original measures. Since computations are carried out on lower-dimensional projections, classical map estimation techniques can only produce maps operating in these reduced dimensions. We propose in this work two methods to extrapolate, from an transport map that is optimal on a subspace, one that is nearly optimal in the entire space. We prove that the best optimal transport plan that takes such \"subspace detours\" is a generalization of the Knothe-Rosenblatt transport. We show that these plans can be explicitly formulated when comparing Gaussian measures (between which the Wasserstein distance is commonly referred to as the Bures or Fr\u00e9chet distance). We provide an algorithm to select optimal subspaces given pairs of Gaussian measures, and study scenarios in which that mediating subspace can be selected using prior information. We consider applications to semantic mediation between elliptic word embeddings and domain adaptation with Gaussian mixture models.", "full_text": "Subspace Detours: Building Transport Plans that are\n\nOptimal on Subspace Projections\n\nBoris Muzellec\nCREST, ENSAE\n\nboris.muzellec@ensae.fr\n\nMarco Cuturi\n\nGoogle Brain and CREST, ENSAE\n\ncuturi@google.com\n\nAbstract\n\nComputing optimal transport (OT) between measures in high dimensions is doomed\nby the curse of dimensionality. A popular approach to avoid this curse is to project\ninput measures on lower-dimensional subspaces (1D lines in the case of sliced\nWasserstein distances), solve the OT problem between these reduced measures,\nand settle for the Wasserstein distance between these reductions, rather than that\nbetween the original measures. This approach is however dif\ufb01cult to extend to\nthe case in which one wants to compute an OT map (a Monge map) between\nthe original measures. Since computations are carried out on lower-dimensional\nprojections, classical map estimation techniques can only produce maps operating\nin these reduced dimensions. We propose in this work two methods to extrapolate,\nfrom an transport map that is optimal on a subspace, one that is nearly optimal\nin the entire space. We prove that the best optimal transport plan that takes such\n\u201csubspace detours\u201d is a generalization of the Knothe-Rosenblatt transport. We show\nthat these plans can be explicitly formulated when comparing Gaussian measures\n(between which the Wasserstein distance is commonly referred to as the Bures or\nFr\u00e9chet distance). We provide an algorithm to select optimal subspaces given pairs\nof Gaussian measures, and study scenarios in which that mediating subspace can be\nselected using prior information. We consider applications to semantic mediation\nbetween elliptic word embeddings and domain adaptation with Gaussian mixture\nmodels.\n\n1\n\nIntroduction\n\nMinimizing the transport cost between two probability distributions [32] results in two useful\nquantities: the minimum cost itself, often cast as a loss or a metric (the Wasserstein distance), and\nthe minimizing solution, a function known as the Monge [20] map that pushes forward the \ufb01rst\nmeasure onto the second with least expected cost. While the former has long attracted the attention\nof the machine learning community, the latter is playing an increasingly important role in data\nsciences. Indeed, important problems such as domain adaptation [8], generative modelling [17, 2, 16],\nreconstruction of cell trajectories in biology [28] and auto-encoders [19, 30] among others can be\nrecast as the problem of \ufb01nding a map, preferably optimal, which transforms a reference distribution\ninto another. However, accurately estimating an OT map from data samples is a dif\ufb01cult problem,\nplagued by the well documented instability of OT in high-dimensional spaces [11, 13] and its high\ncomputational cost.\nOptimal Transport on Subspaces. Several approaches, both in theory and in practice, aim at\nbridging this gap. Theory [33] supports the idea that sample complexity can be improved when\nthe measures are supported on lower-dimensional manifolds of high-dimensional spaces. Practical\ninsights [9] supported by theory [15] advocate using regularizations to improve both computational\nand sample complexity. Some regularity in OT maps can also be encoded by looking at speci\ufb01c\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\ffamilies of maps [29, 23]. Another trend relies on lower-dimensional projections of measures before\ncomputing OT. In particular, sliced Wasserstein (SW) distances [4] leverage the simplicity of OT\nbetween 1D measures to de\ufb01ne distances and barycentres, by averaging the optimal transport between\nprojections onto several random directions. This approach has been applied to alleviate training\ncomplexity in the GAN/VAE literature [10, 34] and was generalized very recently in [22] who\nconsidered projections on k-dimensional subspaces that are adversarially selected. However, these\nsubspace approaches only carry out half of the goal of OT: by design, they do result in more robust\nmeasures of OT costs, but they can only provide maps in subspaces that are optimal (or nearly so)\nbetween the projected measures, not transportation maps in the original, high-dimensional space in\nwhich the original measures live. For instance, the closest thing to a map one can obtain from using\nseveral SW univariate projections is an average of several permutations, which is not a map but a\ntransport plan or coupling [26][25, p.6].\nOur approach. Whereas the approaches cited above focus on OT maps and plans in projection\nsubspaces only, we consider here plans and maps on the original space that are constrained to be\noptimal when projected on a given subspace E. This results in the de\ufb01nition of a class of transportation\nplans that \ufb01guratively need to make an optimal \u201cdetour\u201d in E. We propose two constructions to\nrecover such maps corresponding respectively (i) to the independent product between conditioned\nmeasures, and (ii) to the optimal conditioned map.\nPaper Structure. After recalling background material on OT in section 2, we introduce in section 3\nthe class of subspace-optimal plans that satisfy projection constraints on a given subspace E. We\ncharacterize the degrees of freedom of E-optimal plans using their disintegrations on E and introduce\ntwo extremal instances: Monge-Independent plans, which assume independence of the conditionals,\nand Monge-Knothe maps, in which the conditionals are optimally coupled. We give closed forms\nfor the transport between Gaussian distributions in section 4, respectively as a degenerate Gaussian\ndistribution, and a linear map with block-triangular matrix representation. We provide guidelines\nand a minimizing algorithm for selecting a subspace E when it is not prescribed a priori in section 5.\nFinally, in section 6 we showcase the behavior of MK and MI transports on (noisy) synthetic data,\nshow how using a mediating subspace can be applied to selecting meanings for polysemous elliptical\nword embeddings, and experiment using MK maps with the minimizing algorithm on a domain\nadaptation task with Gaussian mixture models.\nNotations. For E a linear subspace of Rd, E? is its orthogonal complement, VE 2 Rd\u21e5k (resp.\nVE? 2 Rd\u21e5dk) the matrix of orthonormal basis vectors of E (resp E?). pE : x ! V>Ex is the\northogonal projection operator onto E. P2(Rd) is the space of probability distributions over Rd with\n\ufb01nite second moments. B(Rd) is the Borel algebra over Rd. * denotes the weak convergence of\nmeasures. \u2326 is the product of measures, and is used in measure disintegration by abuse of notation.\n2 Optimal Transport: Plans, Maps and Disintegration of Measure\nKantorovitch Plans. For two probability measures \u00b5, \u232b 2P 2(Rd), we refer to the set of couplings\n\n\u21e7(\u00b5, \u232b) def={ 2P (Rd \u21e5 Rd) : 8A, B 2B (Rd), (A \u21e5 Rd) = \u00b5(A), (Rd \u21e5 B) = \u232b(B)}\n\nas the set of transportation plans between \u00b5, \u232b. The 2-Wasserstein distance between \u00b5 and \u232b is\nde\ufb01ned as\n\nW 2\n\n2 (\u00b5, \u232b) def= min\n\n2\u21e7(\u00b5,\u232b)\n\nE(X,Y )\u21e0\u21e5kX Y k2\u21e4 .\n\nConveniently, transportation problems with quadratic cost can be reduced to transportation between\ncentered measures. Indeed, let m\u00b5 (resp. m\u232b) denote \ufb01rst moment of \u00b5 (resp. \u232b). Then, 8 2\n\u21e7(\u00b5, \u232b), E(X,Y )\u21e0[kX Y k2] = km\u00b5 m\u232bk2 +E(X,Y )\u21e0[k(X m\u00b5) (Y m\u232b)k2]. Therefore,\nin the following all probability measures are assumed to be centered, unless stated otherwise.\nMonge Maps. For a Borel-measurable map T , the push-forward of \u00b5 by T is de\ufb01ned as the measure\nT]\u00b5 satisfying for all A 2B (Rd), T]\u00b5(A) = \u00b5(T 1(A)). A map such that T]\u00b5 = \u232b is called a\ntransportation map from \u00b5 to \u232b. When a transportation map exists, the Wasserstein distance can be\nwritten in the form of the Monge problem\n\nW 2\n\n2 (\u00b5, \u232b) = min\n\nT :T]\u00b5=\u232b\n\nEX\u21e0\u00b5[kX T (X)k2].\n\n2\n\n\fWhen it exists, the optimal transportation map T ? in the Monge problem is called the Monge map\nfrom \u00b5 to \u232b. It is then related to the optimal transportation plan ? by the relation ? = (Id, T ?)]\u00b5.\nWhen \u00b5 and \u232b are absolutely continuous (a.c.), a Monge map always exists ([27], Theorem 1.22).\nGlobal Maps or Plans that are Locally Optimal. Considering the projection operator on E, pE,\nwe write \u00b5E = (pE)]\u00b5 for the marginal distribution of \u00b5 on E. Suppose that we are given a Monge\nmap S between the two projected measures \u00b5E and \u232bE. One of the contributions of this paper is to\npropose extensions of this map S as a transportation plan (resp. a new map T ) whose projection\nE = (pE, pE)] on that subspace E coincides with the optimal transportation plan (IdE, S)]\u00b5E\n(resp. pE T = S pE). Formally, the transports introduced in section 3 only require that S be a\ntransport map from \u00b5E to \u232bE, but optimality is required in the closed forms given in section 4 for\nGaussian distributions. In either case, this constraint implies that is built \u201cassuming that\u201d it is equal\nto (IdE, S)]\u00b5E on E. This is rigorously de\ufb01ned using the notion of measure disintegration.\nDisintegration of Measures. The disintegration of \u00b5 on a subspace E is the collection of measures\n(\u00b5xE )xE2E supported on the \ufb01bers {xE}\u21e5 E? such that any test function can be integrated\nagainst \u00b5 asRRd d\u00b5 =RERE? (y)d\u00b5xE (y) d\u00b5E(xE). In particular, if X \u21e0 \u00b5, then the law of\nX given xE is \u00b5xE. By abuse of the measure product notation \u2326, measure disintegration is denoted\nas \u00b5 = \u00b5xE \u2326 \u00b5E. A more general description of disintegration can be found in [1], Ch. 5.5.\n3 Lifting Transport from Subspace to Full Space\n\nGiven two distributions \u00b5, \u232b 2P 2(Rd), it is often easier to compute a Monge map S between their\nmarginals \u00b5E,\u232b E on a k-dimensional subspace E rather than in the whole space Rd. When k = 1,\nthis fact is at the heart of sliced wasserstein approaches [4], which have recently sparked interest\nin the GAN/VAE literature [10, 34]. However, when k < d, there is in general no straightforward\nway of extending S to a transportation map or plan between \u00b5 and \u232b. In this section, we prove the\nexistence of such extensions and characterize them.\nSubspace-Optimal Plans. A transportation plan between \u00b5E and \u232bE is a coupling living in P(E\u21e5E).\nIn general, it cannot be cast directly as a transportation plan between \u00b5 and \u232b taking values in\nP(Rd \u21e5 Rd). However, the existence of such a \u201clifted\u201d plan is given by the following result, which is\nused in OT theory to prove that Wp is a metric:\nLemma 1 (The Gluing Lemma, [32]). Let \u00b51, \u00b52, \u00b53 2P (Rd). If 12 is a coupling of (\u00b51, \u00b52) and\n23 is a coupling of (\u00b52, \u00b53), then one can construct a triple of random variables (Z1, Z2, Z3) such\nthat (Z1, Z2) \u21e0 12 and (Z2, Z3) \u21e0 23.\nBy extension of the lemma, if we de\ufb01ne (i) a coupling between \u00b5 and \u00b5E, (ii) a coupling between\n\u232b and \u232bE, and (iii) the optimal coupling between \u00b5E and \u232bE, (Id, S)]\u00b5E (where S stands for the\nMonge map from \u00b5E to \u232bE), we get the existence of four random variables (with laws \u00b5, \u00b5E,\u232b and\n\u232bE) which follow the desired joint laws. However, the lemma does not imply the uniqueness of those\nrandom variables, nor does it give a closed form for the corresponding coupling between \u00b5 and \u232b.\nDe\ufb01nition 1 (Subspace-Optimal Plans). Let \u00b5, \u232b 2P 2(Rd) and E be a k-dimensional subspace of\nRd. Let S be a Monge map from \u00b5E to \u232bE. We de\ufb01ne the set of E-optimal plans between \u00b5 and \u232b as\n\u21e7E(\u00b5, \u232b) def={ 2 \u21e7(\u00b5, \u232b) : E = (IdE, S)]\u00b5E}.\nDegrees of freedom in \u21e7E(\u00b5, \u232b). When k < d, there can be in\ufb01nitely many E-optimal plans.\nHowever, we can further characterize the degrees of freedom available to de\ufb01ne plans in \u21e7E(\u00b5, \u232b).\nIndeed, let 2 \u21e7E(\u00b5, \u232b). Then, disintegrating on E \u21e5 E, we get = (xE ,yE ) \u2326 E, i.e. plans\nin \u21e7E(\u00b5, \u232b) only differ on their disintegrations on E \u21e5 E. Further, since E stems from a transport\n(Monge) map S, it is supported on the graph of S on E, G(S) = {(xE, S(xE)) : xE 2 E}\u21e2 E \u21e5 E.\nThis implies that puts zero mass when yE 6= S(xE) and thus that is fully characterized by\n(xE ,S(xE )), xE 2 E, i.e. by the couplings between \u00b5xE and \u232bS(xE ) for xE 2 E. This is illustrated\nin Figure 1. Two such couplings are presented: the \ufb01rst, MI (De\ufb01nition 2) corresponds to independent\ncouplings between the conditionals, while the second (MK, De\ufb01nition 3) corresponds to optimal\ncouplings between the conditionals.\nDe\ufb01nition 2 (Monge-Independent Plans). \u21e1MI\n\ndef= (\u00b5xE \u2326 \u232bS(xE )) \u2326 (IdE, S)]\u00b5E.\n\n3\n\n\fMonge-Independent transport only re-\nquires that there exists a Monge map\nS between \u00b5E and \u232bE (and not on the\nwhole space), but extends S as a trans-\nportation plan and not a map. Since\nit couples disintegrations with the in-\ndependent law, it is particularly suited\nto settings where all the information\nis contained in E, as shown in section\n6.\nWhen there exists a Monge map be-\ntween disintegrations \u00b5xE to \u232bS(xE )\nfor all xE 2 E (e.g. when \u00b5 and \u232b\nare a.c.), it is possible to extend S as\na transportation map between \u00b5 and\n\u232b using those maps. Indeed, for all\nxE 2 E, let \u02c6T (xE;\u00b7) : E? ! E?\ndenote the Monge map from \u00b5xE to \u232bS(xE ). The Monge-Knothe transport corresponds to the E-\noptimal plan with optimal couplings between the disintegrations:\nDe\ufb01nition 3 (Monge-Knothe Transport). TMK(xE, xE?) def= (S(xE), \u02c6T (xE; xE?)) 2 E E?.\nThe proof that TMK de\ufb01nes a transport map from \u00b5 to \u232b is a direct adaptation of the proof for the\nKnothe-Rosenblatt transport ([27], Section 2.3). When it is not possible to de\ufb01ne a Monge map\nbetween the disintegrations, one can still consider the optimal couplings \u21e1OT(\u00b5xE ,\u232b S(xE )) and de\ufb01ne\n\u21e1MK = \u21e1OT(\u00b5xE ,\u232b S(xE ))\u2326 (IdE, S)]\u00b5E, which we still call Monge-Knothe plan by abuse. In either\ncase, \u21e1MK is the E-optimal plan with lowest global cost:\nProposition 1. The Monge-Knothe plan is optimal in \u21e7E(\u00b5, \u232b), namely\n\nFigure 1: A d = 2, k = 1 illustration. Any 2 \u21e7E(\u00b5, \u232b)\nbeing supported on G(S) \u21e5 (E?)2, all the mass from x\nis transported on the \ufb01ber {S(xE)}\u21e5 E?. Different \u2019s\nin \u21e7E(\u00b5, \u232b) correspond to different couplings between the\n\ufb01bers {xE}\u21e5 E? and {S(xE)}\u21e5 E?.\n\n\u21e1MK 2 arg min\n2\u21e7E (\u00b5,\u232b)\n\nE(X,Y )\u21e0[kX Y k2].\n\nProof. E-optimal plans only differ in the couplings they induce between \u00b5xE and \u232bS(xE ) for xE 2 E.\nSince \u21e1MK corresponds to the case when these couplings are optimal, disintegrating over E \u21e5 E in\nRRd\u21e5Rd kx yk2d(x, y) shows that = \u21e1MK has the lowest cost. \u2305\nRelation with the Knothe-Rosenblatt (KR) transport. These de\ufb01nitions are related to the KR\ntransport ([27], section 2.3), which consists in de\ufb01ning a transport map between two a.c. measures\nby recursively (i) computing the Monge map T1 between the \ufb01rst two one-dimensional marginals of\n\u00b5 and \u232b and (ii) repeating the process between the disintegrated measures \u00b5x1 and \u232bT1(x1). MI and\nMK marginalize on the k 1 dimensional subspace E, and respectively de\ufb01ne the transport between\ndisintegrations \u00b5xE and \u232bS(xE ) as the product measure and the optimal transport instead of recursing.\nMK as a limit of optimal transport with re-weighted quadratic costs. Similarly to KR [5], MK\ntransport maps can intuitively be obtained as the limit of optimal transport maps, when the costs on\nE? become negligible compared to the costs on E.\nProposition 2. Let Rd = EE?, (VE VE?) an orthonormal basis of EE? and \u00b5, \u232b 2P 2(Rd)\nbe two a.c. probability measures. De\ufb01ne\n\ndef= VEV>E + \"VE?V>E?,\nLet T\" be the optimal transport map for the cost d2\n\n8\"> 0, P\"\n\nP\"(x, y) def= (x y)>P\"(x y).\nd2\n\nP\". Then T\" ! TMK in L2(\u00b5).\n\nProof in the supplementary material.\nMI as a limit of the discrete case. When \u00b5 and \u232b are a.c., for n 2 N let \u00b5n,\u232b n denote the uniform\ndistribution over n i.i.d. samples from \u00b5 and \u232b respectively, and let \u21e1n be an optimal transportation\nplan between (pE)]\u00b5n and (pE)]\u232bn given by a Monge map (which is possible assuming uniform\nweights and non-overlapping projections). We have that \u00b5n *\u00b5 and \u232bn *\u00b5 . From [27], Th\n1.50, 1.51, we have that \u21e1n 2P 2(E \u21e5 E) converges weakly, up to subsequences, to a coupling\n\u21e1 2P 2(E \u21e5 E) that is optimal for \u00b5E and \u232bE. On the other hand, up to points having the same\n\n4\n\n\fprojections, the discrete plans \u21e1n can also be seen as plans in P(Rd \u21e5 Rd). A natural question is\nthen whether the sequence \u21e1n 2P (Rd \u21e5 Rd) has a limit in P(Rd \u21e5 Rd).\nProposition 3. Let \u00b5, \u232b 2P 2(Rd) be a.c. and compactly supported, \u00b5n,\u232b n, n 0 be uniform\ndistributions over n i.i.d. samples, and \u21e1n 2 \u21e7E(\u00b5n,\u232b n), n 0. Then \u21e1n * \u21e1MI(\u00b5, \u232b).\nProof in the supplementary material. We conjecture that under additional assumptions, the compact-\nness hypothesis can be relaxed. In particular, we empirically observe convergence for Gaussians.\n\n4 Explicit Formulas for Subspace Detours in the Bures Metric\n\nMultivariate Gaussian measures are a speci\ufb01c case of continuous distributions for which Wasserstein\ndistances and Monge maps are available in closed form. We \ufb01rst recall basic facts about optimal\ntransport between Gaussian measures, and then show that the E-optimal transports MI and MK\nintroduced in section 3 are also in closed form. For two Gaussians \u00b5, \u232b, one has W 2\n2 (\u00b5, \u232b) =\nkm\u00b5 m\u232bk2 + B2(var \u00b5, var \u232b) where B is the Bures metric [3] between PSD matrices [14]:\nB2(A, B) def= TrA + TrB 2Tr(A1/2BA1/2)1/2. The Monge map from a centered Gaussian\ndistribution \u00b5 with covariance matrix A to one \u232b with covariance matrix B is linear and is represented\nby the matrix TAB def= A1/2(A1/2BA1/2)1/2A1/2. For any linear transport map, T]\u00b5 has\ncovariance TAT>, and the transportation cost from \u00b5 to \u232b is EX\u21e0\u00b5[kX TXk2] = TrA +\nTrB Tr(TA + AT>). In the following, \u00b5 (resp. \u232b) will denote the centered Gaussian distribution\nEE? AE? \u2318 when A is represented in an\nwith covariance matrix A (resp. B). We write A =\u21e3 AE AEE?\northonormal basis (VE VE?) of E E?.\nMonge-Independent Transport for Gaussians. The MI transport between Gaussian measures is\ngiven by a degenerate Gaussian, i.e. a measure with Gaussian density over the image of its covariance\nmatrix \u2303 (we refer to the supplementary material for the proof).\nProposition 4 (Monge-Independent (MI) Transport for Gaussians). Let\n\nA>\n\nC def=VEAE + VE?A>EE? TAE BEVE> + (BE)1BEE?V>E? and \u2303 def= A C\nC> B.\n\nThen \u21e1MI(\u00b5, \u232b) = N (02d, \u2303) 2P (Rd \u21e5 Rd).\nKnothe-Rosenblatt and Monge-Knothe for Gaus-\nsians. Before giving the closed-form MK map for\nGaussian measures, we derive the KR map ([27],\nsection 2.3) with successive marginalization1 on\nx1, x2, ..., xd. When d = 2 and the basis is orthonor-\nmal for E E?, those two notions coincide.\nProposition 5 (Knothe-Rosenblatt (KR) Transport\nbetween Gaussians). Let LA (resp. LB) be the\nCholesky factor of A (resp. B). The KR transport\nfrom \u00b5 to \u232b is a linear map whose matrix is given by\nKR = LB(LA)1. Its cost is the squared Frobe-\nTAB\nnius distance between the Cholesky factors LA and\nLB:\n\nEX\u21e0\u00b5[kX T AB\n\nKR Xk2] = kLA LBk2.\n\nProof. The KR transport with successive marginaliza-\ntion on x1, x2, ..., xd between two a.c. distributions\nhas a lower triangular Jacobian with positive entries\non the diagonal. Further, since the one-dimensional\ndisintegrations of Gaussians are Gaussians them-\nselves, and since Monge maps between Gaussians\n\nFigure 2: MI transport from a 2D Gaussian\n(red) to a 1D Gaussian (blue), projected on\nthe x-axis. The two 1D distributions repre-\nsent the projections of both Gaussians on the\nx-axis, the blue one being already originally\nsupported on the x-axis. The oblique hyper-\nplane is the support of \u21e1MI, onto which its\ndensity is represented.\n\n1Note that compared to [27], this is the reversed marginalization order, which is why the KR map here has\n\nlower triangular Jacobian.\n\n5\n\n\fare linear, the KR transport between two centered Gaussians is a linear map, hence its matrix\nrepresentation equals its Jacobian and is lower triangular.\nLet T = LB(LA)1. We have TAT> = LBL1\nA LAL>AL>A L>B = LBL>B = B, i.e. T]\u00b5 = \u232b.\nFurther, since TLA is the Cholesky factor for B, and since A is supposed non-singular, by unicity of\nthe Cholesky decomposition T is the only lower triangular matrix satisfying T]\u00b5 = \u232b. Hence, it is\nthe KR transport map from \u00b5 to \u232b.\nFinally, we have that EX\u21e0\u00b5[kX TKRXk2] = Tr(A + B (A(TKR)> + TKRA)) = Tr(LAL>A +\nLBL>B (LAL>B + LBL>A)) = kLA LBk2 \u2305\nCorollary 1. The (square root) cost of the Knothe-Rosenblatt transport (EX\u21e0\u00b5[kX TKRXk2])1/2\nbetween centered gaussians de\ufb01nes a distance (i.e. it satis\ufb01es all three metric axioms).\nProof. This comes from the fact thatEX\u21e0\u00b5[kX TKRXk2]1/2 = kLA LBk. \u2305\n\nAs can be expected from the fact that MK can be seen as a generalization of KR, the MK transportation\nmap is linear and has a block-triangular structure. The next proposition shows that the MK transport\nmap can be expressed as a function of the Schur complements A/AE\nE AEE?\nof A w.r.t. AE, and B w.r.t. BE, which are the covariance matrices of \u00b5 (resp. \u232b) conditioned on E.\n\ndef= AE? A>EE?A1\n\n(a) Usual Monge Interpolation of Gaussians\n\n(b) Monge-Knothe Interpolation through E\n\nTAE BE\n\n\u21e5B>EE?(TAE BE )1 T(A/AE )(B/BE )A>EE?\u21e4 (AE)1 T(A/AE )(B/BE )\u25c6 .\n\nFigure 3: (a) Wasserstein-Bures geodesic and (b) Monge-Knothe interpolation through E = {(x, y) :\nx = y} from \u00b50 to \u00b51, at times t = 0, 0.25, 0.5, 0.75, 1.\nProposition 6 (Monge-Knothe (MK) Transport for Gaussians). Let A, B be represented in an\northonormal basis for E E?. The MK transport map on E between \u00b5 and \u232b is given by\nTMK =\u2713\n0k\u21e5(dk)\nProof. As can be seen from the structure of the MK transport map in De\ufb01nition 3, TMK has a lower\nblock-triangular Jacobian (with block sizes k and d k), with PSD matrices on the diagonal (corre-\nsponding to the Jacobians of the Monge maps (i) between marginals and (ii) between conditionals).\nFurther, since \u00b5 and \u232b are Gaussian, their disintegrations are Gaussian as well. Hence, all Monge\nmaps from the disintegrations of \u00b5 to that of \u232b are linear, and therefore the matrix representing T is\nequal to its Jacobian. One can check that the map T in the proposition veri\ufb01es TAT> = B and is of\nthe right form. One can also check that it is the unique such matrix, hence it is the MK transport map.\n\u2305\n\n5 Selecting the Supporting Subspace\n\nAlgorithm 1 MK Subspace Selection\nInput: A, B 2 PSD, k 2 [[1, d]],\u2318\n\nBoth MI and MK transports are highly dependent on the\nchosen subspace E. Depending on applications, E can\neither be prescribed (e.g. if one has access to a transport\nmap between the marginals in a given subspace) or has to\nbe selected. In the latter case, we give guidelines on how\nprior knowledge can be used, and alternatively propose an\nalgorithm for minimizing the MK distance.\nSubspace Selection Using Prior Knowledge. When\nprior knowledge is available, one can choose a mediating\nsubspace E to enforce speci\ufb01c criteria when comparing\ntwo distributions. Indeed, if the directions in E are known to correspond to given properties of the\ndata, then MK or MI transport privileges those properties when matching distributions over those not\nencoded by E. In particular, if one has access to features X from a reference dataset, one can use\n\nOutput: E = Span{v1, .., vk}\n\nV Polar(AB)\nwhile not converged do\n\nL MK(V>AV, V>BV; k)\nV V \u2318rVL\nV Polar(V)\n\nend while\n\n6\n\n\fprincipal component analysis (PCA) and select the \ufb01rst k principal directions to compare datasets X1\nand X2. MK and MI then allow comparing X1 and X2 using the most signi\ufb01cant features from the\nreference X with higher priority. In section 6, we experiment this method on word embeddings.\nMinimal Monge-Knothe Subspace. Alternatively, in the absence of prior knowledge, it is natural\nto aim at \ufb01nding the subspace which minimizes MK. Unfortunately, optimization on the Grassmann\nmanifold is quite hard in general, which makes direct optimization of MK w.r.t. E impractical.\nOptimizing with respect to an orthonormal matrix V of basis vectors of Rd is a more practical\nparameterization, which allows to perform projected gradient descent (Algorithm 1). The projection\nstep consists in computing a polar decomposition, as the projection of a matrix V onto the set of\nunitary matrices is the unitary matrix in the polar decomposition of V. The proposed initialization is\nV = Polar(AB), as this is the optimal solution when A, B are co-diagonalizable. Note that since\nthe function being minimized is non-convex, Algorithm 1 is only guaranteed to converge to a local\nminimum. In section 6, experimental evaluation of Algorithm 1 is carried out on noise-contaminated\nsynthetic data (Figure 6) and on a domain adaptation task with Gaussian mixture models on the Of\ufb01ce\nHome dataset [31] with inception features (Figure 7).\n\n6 Experiments\n\nColor Transfer. Given a source and a target\nimage, the goal of color transfer is to map the\ncolor palette of the source image (represented\nby its RGB histogram) into that of the target\nimage. A natural toolbox for such a task is\nFigure 4: OT color transfer between gray projections.\noptimal transport, see e.g. [4, 12, 24]. First, a\nk-means quantization of both images is computed. Then, the colors of the pixels within each source\ncluster are modi\ufb01ed according to the optimal transport map between both color distributions. In\nFigure 5, we illustrate discrete MK transport maps for color transfer. In this setting, we project\nimages on the 1D space of grayscale images, relying on the 1D OT sorting-based algorithm (Figure\n4). Then, we solve small 2D OT problems on the corresponding disintegrations. We compare this\napproach with classic full OT maps and a sliced OT approach (with 100 random projections). As\ncan be seen in Figure 5, MK results are visually very similar to that of full OT, with a x50 speedup\nallowed by the fast 1D OT sorting-based algorithm that is comparable to sliced OT.\n\n(a) Gray Source\n\n(b) Gray OT\n\n(c) Gray Target\n\n(a) Source\n\n(b) Full OT (2.67s) (c) Gray MK (0.052s)\n\n(d) Sliced (0.057s)\n\n(e) Target\n\nFigure 5: Color transfer, after quantization using 3000 k-means clusters, with corresponding runtimes.\n\nSynthetic Data. We test the behavior of MK and MI in a noisy environment, where the signal is\nsupported in a subspace of small dimension. We represent the signal using two normalized PSD\nmatrices A, B 2 Rd1\u21e5d1 and sample noise \u23031, \u23032 2 Rd2\u21e5d2, d2 d1 from a Wishart distribution\nwith parameter I. We then build the noisy covariance A\" = ( A 0\n0 0 ) + \"\u23031 2 Rd2\u21e5d2 (and likewise\nB\") for different noise levels \" and compute MI and MK distances along the \ufb01rst k directions,\nk = 1, ..., d2. As can be seen in Figure 6, both MI and MK curves exhibit a local minimum or\nan \u201celbow\u201d when k = d1, i.e. when E corresponds to the subspace where the signal is located.\nHowever, important differences in the behaviors of MI and MK can be noticed. Indeed, MI has a\nsteep decreasing curve from 1 to d1 and then a slower decreasing curve. This is explained by the fact\nthat MI transport computes the OT map along the k directions of E only, and treats the conditionals\nas being independent. Therefore, if k d1, all the signal has been \ufb01tted and for increasing values of\nk MI starts \ufb01tting the noise as well. On the other hand, MK transport computes the optimal transport\non both E and the corresponding (d2 k)-dimensional conditionals. Therefore, if k 6= d1, either or\nboth maps \ufb01t a mixture of signal and noise. Local maxima correspond to cases where the signal is the\nmost contaminated by noise, and minima k = d1, k = d2 to cases where either the marginals or the\nconditionals are unaffected by noise. Using Algorithm 1 instead of the principle directions allows\nto \ufb01nd better subspaces than the \ufb01rst k directions when k \uf8ff d1, and then behaves similarly (up to\nthe gradient being stuck in local minima and thus being occasionally less competitive). Overall, the\n\n7\n\n\fdifferences in behavior of MI and MK show that MI is more adapted to noisy environments, and MK\nto applications where all directions are meaningful, but where one wishes to prioritize \ufb01tting on a\nsubset of those directions, as shown in the next experiment.\n\n(a) Monge-Independent\n\n(b) Monge-Knothe\n\n(c) Bures\n\nFigure 6: (a)-(b): Difference between (a) MI and Bures and (b) MK and Bures metrics for different\nnoise levels \" and subspace dimensions k. (c): Corresponding Bures values. For each \u270f, 100 different\nnoise matrices are sampled. Points show mean values, and shaded areas the 25%-75% and 10%-90%\npercentiles. Top row: d1 = 4, d2 = 8. Bottom row: d1 = 4, d2 = 16.\nSemantic Mediation. We experiment using reference features for comparing distributions with\nelliptical word embeddings [21], which represent each word from a given corpus using a mean vector\nand a covariance matrix. For a given embedding, we expect the principal directions of its covariance\nmatrix to be linked to its semantic content. Therefore, the comparison of two words w1, w2 based on\nthe principal eigenvectors of a context word c should be impacted by the semantic relations of w1\nand w2 with respect to c, e.g. if w1 is polysemous and c is related to a speci\ufb01c meaning. To test this\nintuition, we compute the nearest neighbors of a given word w according to the MK distance with E\ntaken as the subspace spanned by the principal directions of two different contexts c1 and c2. We\nexclude means and compute MK based on covariances only, and look at the symmetric difference of\nthe returned sets of words (i.e. words in KNN(w|c1) but not in KNN(w|c2), and inversely). Table 1\nshows that speci\ufb01c contexts affect the nearest neighbors of ambiguous words.\n\nTable 1: Symmetric differences of the 20-NN sets of w given c1 minus w given c2 using MK.\nEmbeddings are 12 \u21e5 12 pretrained normalized covariance matrices from [21]. E is spanned by the 4\nprincipal directions of the contexts. Words are printed in increasing distance order.\n\nWord\n\ninstrument\n\nwindows\n\nfox\n\noboe\nmonitor\n\nContext 1 Context 2\nmonitor\noboe\npc\ndoor\nmedia\n\ndoor\npc\n\nhedgehog\n\nmedia\n\nhedgehog\n\nDifference\n\ncathode, monitor, sampler, rca, watts, instrumentation, telescope, synthesizer, ambient\ntuned, trombone, guitar, harmonic, octave, baritone, clarinet, saxophone, virtuoso\nnetscape, installer, doubleclick, burner, installs, adapter, router, cpus\nscrewed, recessed, rails, ceilings, tiling, upvc, pro\ufb01led, roofs\nPenny, quiz, Whitman, outraged, Tinker, ads, Keating, Palin, show\npanther, reintroduced, kangaroo, Harriet, fair, hedgehog, bush, paw, bunny\n\nMK Domain Adaptation with Gaussian Mixture Models. Given a source dataset of labeled data,\ndomain adaptation (DA) aims at \ufb01nding labels for a target dataset by transfering knowledge from the\nsource. Such a problem has been successfully tackled using OT-based techniques [8]. We illustrate\nusing MK Gaussian maps on a domain adaptation task where both source and target distributions\nare modeled by a Gaussian mixture model (GMM). We use the Of\ufb01ce Home dataset [31], which\ncomprises 15000 images from 65 different classes across 4 domains: Art, Clipart, Product and\nReal World. For each image, we consider 2048-dimensional features taken from the coding layer\nof an inception model, as with Fr\u00e9chet inception distances [18]. For each source/target pair, we\nrepresent the source as a GMM by \ufb01tting one Gaussian per source class and de\ufb01ning mixture weights\nproportional to class frequencies, and we \ufb01t a GMM with the same number of components on the\ntarget. Since label information is not available for the target dataset, data from different classes may\nbe assigned to the same component. We then compute pairwise MK distances between all source and\ntarget components, and solve for the discrete OT plan P using those distances as costs and mixture\nweights as marginals (as in [6] with Bures distances). Finally, we map the source distribution on\nMK,\n\nthe target by computing the P -barycentric projection of the component-wise MK mapsPij PijT ij\n\n8\n\n\fand assign target labels using 1-NN prediction over the mapped source data. The same procedure\nis applied using Bures distances between the projections on E. We use Algorithm 1 between the\nempirical covariance matrices of the source and target datasets to select the supporting subspace E,\nfor different values of the supporting dimension k (Figure 7).\n\nFigure 7: Domain Adaptation: 1-NN accuracy scores on the Of\ufb01ce Home dataset v.s. dimension k.\nWe compare the k-dimensional projected Bures maps with the E-MK maps and the 2048-D Bures\nbaseline. E is selected using Algorithm 1 between the source and target covariance matrices for\nk = 32, 64, 128, 256, 512, 1024. Rows: sources, Columns: targets.\n\nSeveral facts can be observed from Figure 7. First, using the full 2048-dimensional Bures maps\nis regularly sub-optimal compared to Bures (resp. MK) maps on a lower-dimensional subspace,\neven though this is dependent on the source/target combination. This shows the interest of not\nusing all available features equally in transport problems. Secondly, when E is chosen using the\nminimizing algorithm 1, in most cases MK maps yield equivalent or better classi\ufb01cation accuracy\nthat the corresponding Bures maps on the projections, even though they have the same projections on\nE. However, as can be expected, this does not hold for an arbitrary choice of E (not shown in the\n\ufb01gure). Due to the relative simplicity of this DA method (which models the domains as GMMs), we\ndo not aim at comparing with state-of-the-art OT DA methods [8, 7] (which compute transportation\nplans between the discrete distributions directly). The goal is rather to illustrate how MK maps can\nbe used to compute maps which put higher priority on the most meaningful feature dimensions. Note\nalso that the mapping between source and target distributions used here is piecewise linear, and is\ntherefore more regular.\nConclusion and Future Work. We have proposed in this paper a new class of transport plans and\nmaps that are built using optimality constraints on a subspace, but de\ufb01ned over the whole space. We\nhave presented two particular instances, MI and MK, with different properties, and derived closed\nformulations for Gaussian distributions. Future work includes exploring other applications of OT to\nmachine learning relying on low-dimensional projections, from which subspace-optimal transport\ncould be used to recover full-dimensional plans or maps.\n\n9\n\n\fReferences\n[1] Luigi Ambrosio, Nicola Gigli, and Giuseppe Savare. Gradient Flows in Metric Spaces and in\n\nthe Space of Probability Measures. 01 2005.\n\n[2] Martin Arjovsky, Soumith Chintala, and L\u00e9on Bottou. Wasserstein generative adversarial\nnetworks. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International\nConference on Machine Learning, volume 70 of Proceedings of Machine Learning Research,\npages 214\u2013223, International Convention Centre, Sydney, Australia, 06\u201311 Aug 2017. PMLR.\n\n[3] Rajendra Bhatia, Tanvi Jain, and Yongdo Lim. On the Bures-Wasserstein distance between\n\npositive de\ufb01nite matrices. Expositiones Mathematicae, 2018.\n\n[4] Nicolas Bonneel, Julien Rabin, Gabriel Peyr\u00e9, and Hanspeter P\ufb01ster. Sliced and Radon Wasser-\nstein Barycenters of Measures. Journal of Mathematical Imaging and Vision, 1(51):22\u201345,\n2015.\n\n[5] Guillaume Carlier, Alfred Galichon, and Filippo Santambrogio. From knothe\u2019s transport to\n\nbrenier\u2019s map and a continuation method for optimal transport. SIAM J. Math. An., 2009.\n\n[6] Y. Chen, T. T. Georgiou, and A. Tannenbaum. Optimal transport for gaussian mixture models.\n\nIEEE Access, 7:6269\u20136278, 2019.\n\n[7] Nicolas Courty, R\u00e9mi Flamary, Amaury Habrard, and Alain Rakotomamonjy. Joint distribution\noptimal transportation for domain adaptation. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach,\nR. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing\nSystems 30, pages 3730\u20133739. Curran Associates, Inc., 2017.\n\n[8] Nicolas Courty, Remi Flamary, Alain Rakotomamonjy, and Devis Tuia. Optimal transport for\ndomain adaptation. In NIPS, Workshop on Optimal Transport and Machine Learning, Montr\u00e9al,\nCanada, December 2014.\n\n[9] Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances\n\nin Neural Information Processing Systems, pages 2292\u20132300, 2013.\n\n[10] Ishan Deshpande, Ziyu Zhang, and Alexander G. Schwing. Generative modeling using the sliced\nwasserstein distance. In 2018 IEEE Conference on Computer Vision and Pattern Recognition,\nCVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 3483\u20133491, 2018.\n\n[11] RM Dudley. The speed of mean glivenko-cantelli convergence. The Annals of Mathematical\n\nStatistics, 40(1):40\u201350, 1969.\n\n[12] Sira Ferradans, Nicolas Papadakis, Gabriel Peyr\u00e9, and Jean-Fran\u00e7ois Aujol. Regularized discrete\n\noptimal transport. SIAM Journal on Imaging Sciences, 7(3):1853\u20131882, 2014.\n\n[13] Nicolas Fournier and Arnaud Guillin. On the rate of convergence in wasserstein distance of the\n\nempirical measure. Probability Theory and Related Fields, 162(3-4):707\u2013738, 2015.\n\n[14] Matthias Gelbrich. On a formula for the l2 Wasserstein metric between measures on Euclidean\n\nand Hilbert spaces. Mathematische Nachrichten, 147(1):185\u2013203, 1990.\n\n[15] Aude Genevay, L\u00e9na\u00efc Chizat, Francis Bach, Marco Cuturi, and Gabriel Peyr\u00e9. Sample\ncomplexity of sinkhorn divergences. In Kamalika Chaudhuri and Masashi Sugiyama, editors,\nProceedings of Machine Learning Research, volume 89 of Proceedings of Machine Learning\nResearch, pages 1574\u20131583. PMLR, 16\u201318 Apr 2019.\n\n[16] Aude Genevay, Gabriel Peyre, and Marco Cuturi. Learning generative models with sinkhorn\ndivergences. In Amos Storkey and Fernando Perez-Cruz, editors, Proceedings of the Twenty-\nFirst International Conference on Arti\ufb01cial Intelligence and Statistics, volume 84 of Proceedings\nof Machine Learning Research, pages 1608\u20131617, Playa Blanca, Lanzarote, Canary Islands,\n09\u201311 Apr 2018. PMLR.\n\n10\n\n\f[17] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil\nOzair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Z. Ghahramani,\nM. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural\nInformation Processing Systems 27, pages 2672\u20132680. Curran Associates, Inc., 2014.\n\n[18] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter.\nGans trained by a two time-scale update rule converge to a local nash equilibrium. In I. Guyon,\nU. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors,\nAdvances in Neural Information Processing Systems 30, pages 6626\u20136637. Curran Associates,\nInc., 2017.\n\n[19] Diederik P. Kingma and Max Welling. Auto-encoding variational bayes. In 2nd International\nConference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014,\nConference Track Proceedings, 2014.\n\n[20] Gaspard Monge. M\u00e9moire sur la th\u00e9orie des d\u00e9blais et des remblais. Histoire de l\u2019Acad\u00e9mie\n\nRoyale des Sciences, pages 666\u2013704, 1781.\n\n[21] Boris Muzellec and Marco Cuturi. Generalizing point embeddings using the wasserstein space\nof elliptical distributions. In Advances in Neural Information Processing Systems 31, pages\n10237\u201310248. Curran Associates, Inc., 2018.\n\n[22] Fran\u00e7ois-Pierre Paty and Marco Cuturi. Subspace robust Wasserstein distances. In Kamalika\nChaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference\non Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 5072\u2013\n5081, Long Beach, California, USA, 09\u201315 Jun 2019. PMLR.\n\n[23] Fran\u00e7ois-Pierre Paty, Alexandre d\u2019Aspremont, and Marco Cuturi. Regularity as regulariza-\ntion: Smooth and strongly convex brenier potentials in optimal transport. arXiv preprint\narXiv:1905.10812, 2019.\n\n[24] Julien Rabin, Sira Ferradans, and Nicolas Papadakis. Adaptive color transfer with relaxed\noptimal transport. In 2014 IEEE International Conference on Image Processing (ICIP), pages\n4852\u20134856. IEEE, 2014.\n\n[25] Julien Rabin, Gabriel Peyr\u00e9, Julie Delon, and Marc Bernot. Wasserstein barycenter and its\napplication to texture mixing. In International Conference on Scale Space and Variational\nMethods in Computer Vision, pages 435\u2013446. Springer, 2011.\n\n[26] Mark Rowland, Jiri Hron, Yunhao Tang, Krzysztof Choromanski, Tamas Sarlos, and Adrian\nWeller. Orthogonal estimation of wasserstein distances. In Kamalika Chaudhuri and Masashi\nSugiyama, editors, Proceedings of Machine Learning Research, volume 89 of Proceedings of\nMachine Learning Research, pages 186\u2013195. PMLR, 16\u201318 Apr 2019.\n\n[27] Filippo Santambrogio. Optimal transport for applied mathematicians, 2015.\n[28] Geoffrey Schiebinger, Jian Shu, Marcin Tabaka, Brian Cleary, Vidya Subramanian, Aryeh\nSolomon, Joshua Gould, Siyan Liu, Stacie Lin, Peter Berube, et al. Optimal-transport analysis\nof single-cell gene expression identi\ufb01es developmental trajectories in reprogramming. Cell,\n176(4):928\u2013943, 2019.\n\n[29] Vivien Seguy, Bharath Bhushan Damodaran, R\u00e9mi Flamary, Nicolas Courty, Antoine Rolet, and\nMathieu Blondel. Large-scale optimal transport and mapping estimation. In Proceedings of the\nInternational Conference in Learning Representations, 2018.\n\n[30] Ilya Tolstikhin, Olivier Bousquet, Sylvain Gelly, and Bernhard Scholkopf. Wasserstein auto-\n\nencoders. In International Conference on Learning Representations, ICLR, 2018.\n\n[31] Hemanth Venkateswara, Jose Eusebio, Shayok Chakraborty, and Sethuraman Panchanathan.\nDeep hashing network for unsupervised domain adaptation. In (IEEE) Conference on Computer\nVision and Pattern Recognition (CVPR), 2017.\n\n[32] C\u00e9dric Villani. Optimal transport: old and new, volume 338. Springer Science & Business\n\nMedia, 2008.\n\n11\n\n\f[33] Jonathan Weed and Francis Bach. Sharp asymptotic and \ufb01nite-sample rates of convergence of\n\nempirical measures in wasserstein distance. arXiv preprint arXiv:1707.00087, 2017.\n\n[34] Jiqing Wu, Zhiwu Huang, Dinesh Acharya, Wen Li, Janine Thoma, Danda Pani Paudel, and\nLuc Van Gool. Sliced wasserstein generative models. In The IEEE Conference on Computer\nVision and Pattern Recognition (CVPR), 2019.\n\n12\n\n\f", "award": [], "sourceid": 3749, "authors": [{"given_name": "Boris", "family_name": "Muzellec", "institution": "ENSAE, Institut Polytechnique de Paris"}, {"given_name": "Marco", "family_name": "Cuturi", "institution": "Google Brain & CREST - ENSAE"}]}