{"title": "Alleviating Label Switching with Optimal Transport", "book": "Advances in Neural Information Processing Systems", "page_first": 13634, "page_last": 13644, "abstract": "Label switching is a phenomenon arising in mixture model posterior inference that prevents one from meaningfully assessing posterior statistics using standard Monte Carlo procedures. This issue arises due to invariance of the posterior under actions of a group; for example, permuting the ordering of mixture components has no effect on the likelihood. We propose a resolution to label switching that leverages machinery from optimal transport. Our algorithm efficiently computes posterior statistics in the quotient space of the symmetry group. We give conditions under which there is a meaningful solution to label switching and demonstrate advantages over alternative approaches on simulated and real data.", "full_text": "Alleviating Label Switching with Optimal Transport\n\nPierre Monteiller\n\nENS Ulm\n\npierre.monteiller@ens.fr\n\nSebastian Claici\n\nMIT CSAIL & MIT-IBM Watson AI Lab\n\nsclaici@mit.edu\n\nEdward Chien\n\nFarzaneh Mirzazadeh\n\nMIT CSAIL & MIT-IBM Watson AI Lab\n\nIBM Research & MIT-IBM Watson AI Lab\n\nedchien@mit.edu\n\nfarzaneh@ibm.com\n\nJustin Solomon\n\nMIT CSAIL & MIT-IBM Watson AI Lab\n\njsolomon@mit.edu\n\nMikhail Yurochkin\n\nIBM Research & MIT-IBM Watson AI Lab\n\nmikhail.yurochkin@ibm.com\n\nAbstract\n\nLabel switching is a phenomenon arising in mixture model posterior inference that\nprevents one from meaningfully assessing posterior statistics using standard Monte\nCarlo procedures. This issue arises due to invariance of the posterior under actions\nof a group; for example, permuting the ordering of mixture components has no\neffect on the likelihood. We propose a resolution to label switching that leverages\nmachinery from optimal transport. Our algorithm ef\ufb01ciently computes posterior\nstatistics in the quotient space of the symmetry group. We give conditions under\nwhich there is a meaningful solution to label switching and demonstrate advantages\nover alternative approaches on simulated and real data.\n\n1\n\nIntroduction\n\nMixture models are powerful tools for understanding multimodal data. In the Bayesian setting, to\n\ufb01t a mixture model to such data, we typically assume a prior number of components and optimize\nor sample from the posterior distribution over the component parameters. If prior components are\nexchangeable, this leads to an identi\ufb01ability issue known as label switching. In particular, permuting\nthe ordering of mixture components does not change the likelihood, since it produces the same\nmodel. The underlying problem is that a group acts on the parameters of the mixture model; posterior\nprobabilities are invariant under the action of the group.\nTo formalize this intuition, suppose our input is a data set X and a parameter K denoting the number of\nmixture components. In the most common application, we want to \ufb01t a mixture of K Gaussians to the\ndata; our parameter set is \u0398 = {\u03b81, . . . , \u03b8K} where \u03b8k = {\u00b5k, \u03a3k, \u03c0k} gives the parameters of each\nk=1 \u03c0kf (x; \u00b5k, \u03a3k), where\nf (x; \u00b5k, \u03a3k) is the density function of N (\u00b5k, \u03a3k). Any permutation of the labels k = 1, . . . , K\nyields the same likelihood. The prior is also permutation invariant. When we compute statistics of the\nposterior p(\u0398|x), however, this permutation invariance leads to K! symmetric regions in the posterior\nlandscape. Sampling and inference algorithms behave poorly as the number of modes increases,\nand this problem is only exacerbated in this context since increasing the number of components in\nthe mixture model leads to a super-exponential increase in the number of modes of the posterior.\nPrevious methods such as the invariant losses of Celeux et al. (2000) and pivot alignments of Marin\net al. (2005) do not identify modes in a principled manner.\n\ncomponent. The likelihood of x \u2208 X conditioned on \u0398 is p(x|\u0398) =(cid:80)K\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fTo combat this issue, we leverage the theory of optimal transport. In particular, one way to avoid the\nmultimodal nature of the posterior distribution is to replace each sample with its orbit under the action\nof the symmetry group seen as a distribution over K! points. While this symmetrized distribution is\ninvariant to group actions, we can not average several such distributions using standard Euclidean\nmetrics. We use the notion of a Wasserstein barycenter to calculate a mean in this space, which we\ncan project to a mean in the parameter space via the quotient map. We show conditions under which\nour optimization can be performed ef\ufb01ciently on the quotient space, thus circumventing the need to\nstore and manipulate orbit distributions with large support.\n\nContributions. We give a practical and simple algorithm to solve the label switching problem. To\njustify our algorithm, we demonstrate that a group-invariant Wasserstein barycenter exists when the\ndistributions being averaged are group-invariant. We give conditions under which the Wasserstein\nbarycenter can be written as the orbit of a single point, and we explain how failure modes of our\nalgorithm correspond to ill-posed problems. We show that the problem can be cast as computing the\nexpected value of the quotient distribution, and we give an SGD algorithm to solve it.\n\n2 Related work\n\nMixture models. Gaussian mixture models are powerful for modeling a wide range of phenomena\n(McLachlan et al., 2019). These models assume that a sample is drawn from one of the latent states\n(or components), but that the particular component assigned to any given sample is unknown. In\na Bayesian setup, Markov Chain Monte Carlo can sample from the posterior distribution over the\nparameters of the mixture model. Hamiltonian Monte Carlo (HMC) has proven particularly successful\nfor this task. Introduced for lattice quantum chromodynamics (Duane et al., 1987), HMC has become\na popular option for statistical applications (Neal et al., 2011). Recent high-performance software\noffers practitioners easy access to HMC and other sampling algorithms (Carpenter et al., 2017).\nLabel switching. Label switching arises when we take a Bayesian approach to parameter estimation\nin mixture models (Diebolt & Robert, 1994). Jasra et al. (2005) and Papastamoulis (2015) overview\nthe problem. Label switching can happen even when samplers do not explore all K! possible modes,\ne.g., for Gibbs sampling. Documentation for modern sampling tools mentions that it arises in\npractice.1 Label switching can also occur when using parallel HMC, since tools like Stan run\nmultiple chains at once. While a single chain may only explore one mode, several chains are likely to\nyield different label permutations.\nJasra et al. (2005, \u00a76) mention a few loss functions invariant to the different labelings. Most relevant\nis the loss proposed by Celeux et al. (2000, \u00a75). Beyond our novel theoretical connections to optimal\ntransport, in contrast to their method, our algorithm uses optimal rather than greedy matching to\nresolve elements of the symmetric group, applies to general groups and quotient manifolds, and uses\nstochastic gradient descent instead of simulated annealing. Somewhat ad-hoc but also related is the\npivotal reordering algorithm (Marin et al., 2005), which uses a sample drawn from the distribution\nas a pivot point to break the symmetry; as we will see in our experiments, a poorly-chosen pivot\nseriously degrades the performance.\nOptimal transport. Optimal transport (OT) has seen a surge of interest in learning, from applications\nin generative models (Arjovsky et al., 2017; Genevay et al., 2018), Bayesian inference (Srivastava\net al., 2015), and natural language (Kusner et al., 2015; Alvarez-Melis & Jaakkola, 2018) to technical\nunderpinnings for optimization methods (Chizat & Bach, 2018). See Solomon (2018); Peyr\u00e9 & Cuturi\n(2018) for discussion of computational OT and Santambrogio (2015); Villani (2009) for theory.\nThe Wasserstein distance from optimal transport (\u00a73.1) induces a metric on the space of probability\ndistributions from the geometry of the underlying domain. This leads to a notion of a Wasserstein\nbarycenter of several probability distributions (Agueh & Carlier, 2011). Scalable algorithms have\nbeen proposed for barycenter computation, including methods that exploit entropic regularization\n(Cuturi & Doucet, 2014), use parallel computing (Staib et al., 2017), apply stochastic optimization\n(Claici et al., 2018), and distribute the computation across several machines (Uribe et al., 2018).\n\n1https://mc-stan.org/users/documentation/case-studies/identifying_mixture_models.html\n\n2\n\n\f3 Optimal Transport under Group Actions\n\nBefore delving into technical details, we will illustrate our approach with a simple example. Assume\nwe have some data to which we wish to \ufb01t a Gaussian mixture model with K components. We can\nnow draw samples from the posterior distribution, and we would like to obtain a point estimate of the\nmean of the posterior. We draw two samples \u03981 = (\u03b81\nK). We cannot\naverage them due to the ambiguity of label switching; see Figure 1(a) and \u00a71.3 of the supplementary\nfor a simple example. However, we can explicitly encode this multimodality as a uniform distribution\nover all K! states:\n\nK) and \u03982 = (\u03b82\n\n1, . . . , \u03b81\n\n1, . . . , \u03b82\n\n(cid:88)\n\n\u03c3\u2208SK\n\n1\nK!\n\n\u03b4\u03c3\u00b7\u03981\n\nand\n\n1\nK!\n\n\u03b4\u03c3\u00b7\u03982\n\n(cid:88)\n\n\u03c3\u2208SK\n\nwhere SK is the symmetry group on K points that acts by permuting the elements of \u03981 and \u03982.\nThese distributions are now invariant to permutations, so we can ask if there exists an average in this\nspace. In this section, we prove that this is possible through the machinery of optimal transport.\nWe provide theoretical results relevant to optimal transport between measures supported on the\nquotient space under actions of some group G. This theory is fairly general and requires only basic\nassumptions about the underlying space X and the action of G. For each theoretical result, we will\nuse italics to highlight key assumptions, since they vary somewhat from proposition to proposition.\n\n3.1 Preliminaries: Optimal transport\n\nLet (X, d) be a complete and separable metric space. We de\ufb01ne the p-Wasserstein distance on the\nspace P (X) of probability distributions over X as a minimization over matchings between \u00b5 and \u03bd:\n\nW p\n\np (\u00b5, \u03bd) = inf\n\n\u03c0\u2208\u03a0(\u00b5,\u03bd)\n\nX\u00d7X\n\nd(x, y)p d\u03c0(x, y).\n\nHere \u03a0(\u00b5, \u03bd) is the set of couplings between measures \u00b5 and \u03bd de\ufb01ned as \u03a0(\u00b5, \u03bd) = {\u03c0 \u2208 P (X \u00d7\nX) | \u03c0(x \u00d7 X) = \u00b5(x), \u03c0(X \u00d7 y) = \u03bd(y)}.\nWp induces a metric on the set Pp(X) of measures with \ufb01nite p-th moments (Villani, 2009). We will\nfocus on P2(X), endowed with the metric W2. This metric structure allows us to de\ufb01ne meaningful\nstatistics for sets of distributions. In particular, a Fr\u00e9chet mean (or Wasserstein barycenter) of a set of\ndistributions \u03bd1, . . . , \u03bdn \u2208 P2(X) is de\ufb01ned as a minimizer\n\n\u00b5\u2217 = arg min\n\u00b5\u2208P2(X)\n\n1\nn\n\nW 2\n\n2 (\u00b5, \u03bdi).\n\n(1)\n\n(cid:90)\n\nn(cid:88)\n\ni=1\n\n(cid:90)\n\n(cid:2)W 2\n2 (\u00b5, \u03bd)(cid:3) .\n\nWe follow Kim & Pass (2017) and generalize this notion slightly, by placing a measure itself on the\nspace P2(X). We will use P2(P2(X)) to denote the space of probability measures on P2(X) that\nhave \ufb01nite second moments and let \u2126 be a member of this set. Then the following functional will be\n\ufb01nite, which generalizes (1) from \ufb01nite sums to in\ufb01nite sets of measures:\n\nB(\u00b5) =\n\nP2(X)\n\nW 2\n\n2 (\u00b5, \u03bd) d\u2126(\u03bd) = E\u03bd\u223c\u2126\n\n(2)\n\nIn analog to (1), a natural task is to search for a minimizer of the map \u00b5 (cid:55)\u2192 B(\u00b5). For existence of\nsuch a minimizer, we simply require that supp(\u2126) is tight.\nDe\ufb01nition 1 (Tightness of measures). A collection C of measures on X is called tight if for any\n\u03b5 > 0 there exists a compact set K \u2282 X such that for all \u00b5 \u2208 C, we have \u00b5(K) > 1 \u2212 \u03b5.\nHere are three examples of tight collections: P2(X) if X is compact, the set of all Gaussian\ndistributions with means supported on a compact space and of bounded variance, or any set of\nmeasures with a uniform bound on second moments (argued in supplementary). This assumption is\nfairly mild and covers many application scenarios.\nProkhorov\u2019s theorem (deferred to the supplementary) implies the existence of a barycenter:\nTheorem 1 (Existence of minimizers). B(\u00b5) has at least one minimizer in P2(X) if supp(\u2126) is tight.\n\n3\n\n\f3.2 Optimal transport with group invariances\n\nLet G be a \ufb01nite group that acts by isometries on X. We de\ufb01ne the set of measures invariant under\ngroup action P2(X)G = {\u00b5 \u2208 P2(X) | g#\u00b5 = \u00b5,\u2200g \u2208 G}, where the pushforward of \u00b5 by g is\nde\ufb01ned as g#\u00b5(B) = \u00b5(g\u22121(B)) for B a measurable set. We are interested in the relation between\nthe space P2(X)G and the space of measures on the quotient space P2(X/G). If all of the measures\nin the support of \u2126 in (2) are invariant under group action, we can show that there exists a barycenter\nwith the same property:\nLemma 1. If \u2126 \u2208 P2(P2(X)G) is supported on the set of group-invariant measures on X and\nsupp(\u2126) is tight, then there exists a minimizer of B(\u00b5) in P2(X) that is invariant under group action.\nProof. Let \u00b5 \u2208 P2(X) denote the minimizer from Theorem 1. De\ufb01ne a new distribution \u00b5G =\n1|G|\n\ng\u2208G g#\u00b5. We verify that \u00b5G has the same cost as \u00b5:\n\n\uf8f6\uf8f8\uf8f9\uf8fb \u2264 E\u03bd\u223c\u2126\n\n(cid:80)\n\uf8eb\uf8ed 1\n\uf8ee\uf8f0W 2\n(cid:88)\n\uf8ee\uf8f0 1\n(cid:88)\n(cid:88)\n(cid:2)W 2\n2 (\u00b5, \u03bd)(cid:3) =E\u03bd\u223c\u2126\n\n2 (\u00b5, (g\u22121)#\u03bd)\n\nE\u03bd\u223c\u2126\n\ng#\u00b5, \u03bd\n\n|G|\n\n|G|\n\ng\u2208G\n\ng\u2208G\n\nW 2\n\nW 2\n\n|G|\n\n2 (g#\u00b5, \u03bd)\n\n(cid:88)\n\n\uf8f9\uf8fb by convexity of \u00b5 (cid:55)\u2192 W 2\n\n\uf8ee\uf8f0 1\n\uf8f9\uf8fb since g acts by isometry\n(cid:2)W 2\n2 (\u00b5, \u03bd)(cid:3) by linearity of expectation and group invariance of \u03bd.\n\n2 (\u00b5, \u03bd)\n\ng\u2208G\n\nE\u03bd\u223c\u2126\n\n2\n\n= E\u03bd\u223c\u2126\n\n=\n\n1\n|G|\n\ng\u2208G\n\nBut \u00b5 is a minimizer, so the inequality in line 1 must be an equality.\n\nRemark: If X is a compact Riemannian manifold and \u2126 gives positive weight to the set of absolutely\ncontinuous measures, then Theorem 3.1 of Kim & Pass (2017) provides uniqueness (and this may be\nextended to other non-compact cases with suitable decay conditions). However, in our setting, \u2126 is\nsupported on samples, measures consisting of delta functions. In this case, a simple counterexample\nis presented in the supplementary (\u00a71.4) which arises in the case where X consists of two points in\nR2 and S2 acts to swap the points (SK is the group of permutations of a \ufb01nite set of K points). This\nis accompanied by a study of the case of K points in Rd (see supplementary \u00a71.3), relevant to the\nmixture models where components are evenly weighted and identical with a single mean parameter.\nVia this study we see that counterexamples seem to require a high degree of symmetry, which is\nunlikely to happen in applied scenarios, and does not arise empirically in our experiments.\nAn analogous proof technique can be used to show the following lemma needed later:\nLemma 2. If \u03bd1 and \u03bd2 are two measures invariant under group action, then there exists an optimal\ntransport plan \u03c0 \u2208 \u03a0(\u03bd1, \u03bd2) that is invariant under the group action g \u00b7 \u03c0(x, y) = \u03c0(g \u00b7 x, g \u00b7 y).\nThe above suggests that we might instead search for barycenters in the quotient space. Consider:\nLemma 3 (Lott & Villani 2009, Lemma 5.36). Let p : X \u2192 X/G be the quotient map. The\nmap p\u2217 : P2(X) \u2192 P2(X/G) restricts to an isometric isomorphism between the set of P2(X)G of\nG-invariant elements in P2(X) and P2(X/G).\nWe now introduce additional structure relevant to label switching. Assume that all measures \u03bd \u223c \u2126\nare the orbits of individual delta distributions, as they are samples of parameter values, i.e., \u03bd =\ng\u2208G \u03b4g\u00b7x for some x \u2208 X. In the simple example of a mixture of two Gaussians from 1D data\n1|G|\nwith means at \u00b51, \u00b52 \u2208 R, \u03bd is of the following form \u03bd = 1\nUnder this assumption and by Lemmas 1 and 3, minimization of B(\u00b5) is equivalent to \ufb01nding the\nWasserstein barycenter of delta distributions on X/G. Letting \u2126\u2217 := p\u2217#\u2126, we aim to \ufb01nd:\n\n2 \u03b4(\u00b51,\u00b52) + 1\n\n2 \u03b4(\u00b52,\u00b51).\n\n(cid:80)\n\nFrom properties of Wasserstein barycenters (Carlier et al. 2015, Equation (2.9)), the support of \u00b5 lies\nin the set of solutions to\n\narg min\n\u00b5\u2208P2(X/G)\n\nE\u03b4x\u223c\u2126\u2217\n\n(cid:2)W 2\n2 (\u00b5, \u03b4x)(cid:3) .\n(cid:2)d(x, z)2(cid:3)\n\nE\u03b4x\u223c\u2126\u2217\n\nmin\nz\u2208X/G\n\n4\n\n(3)\n\n(4)\n\n\f(cid:2)d(x, z)2(cid:3) is established in \u00a72.1 of the supplementary, giving the following lemma:\n\nwhere d is the metric on the quotient space X/G (see e.g. Santambrogio 2015, \u00a75.5.5). As \u2126 has \ufb01nite\nsecond moments, so does \u2126\u2217, giving us existence of the expectation. The existence of minimizers of\nz \u2192 E\u03b4x\u223c\u2126\u2217\nLemma 4. The map z \u2192 E\u03b4x\u223c\u2126\u2217\nUniqueness of minimizers is not guaranteed (see \u00a71.4 of supplementary), but we can rewrite (3) as:\n\n(cid:2)d(x, z)2(cid:3) has a minimizer.\n(cid:90)\n(cid:2)W 2\n2 (\u00b5, \u03b4x)(cid:3) = arg min\n(cid:90)\nBy Lemma 4, the term y \u2192(cid:82)\n\nd(x, y)2 d\u00b5(y) d\u2126\u2217(\u03b4x)\n\nd(x, y)2 d\u2126\u2217(\u03b4x) d\u00b5(y).\n\nX/G d(x, y)2d\u2126\u2217(\u03b4x) has a (potentially non-unique) minimizer. Call\n\n= arg min\n\u00b5\u2208P2(X/G)\n\narg min\n\u00b5\u2208P2(X/G)\n\nE\u03b4x\u223c\u2126\u2217\n\n\u00b5\u2208P2(X/G)\n\nX/G\n\nX/G\n\n(cid:90)\n(cid:90)\n\nthis function b(y). We are left with\n\nX/G\n\nX/G\n\n(cid:90)\n\narg min\n\u00b5\u2208P2(X/G)\n\nX/G\n\nb(y) d\u00b5(y).\n\n(cid:80)\n\ng\u2208G \u03b4g\u00b7z\u2217 .\n\nAny minimizer y\u2217 of b leads to a minimizing distribution \u00b5 = \u03b4y\u2217, and we can conclude\nTheorem 2 (Single Orbit Barycenters). There is a barycenter solution of (2) that can be written as\n\u00b5 = 1|G|\nReturning to our example of a Gaussian mixture model, we see that this theorem implies there\nis a barycenter (a mean in distribution space) that has the same form as the symmetrized sample\ndistributions. Any point in the support of the barycenter is an estimate for the mean of the posterior\ndistribution.\nAs an aside, we mention that our proofs do not require \ufb01nite groups. In fact, we prove Theorem 2 for\ncompact groups G endowed with a Haar measure in the supplement.\nTo summarize: Label switching leads to issues when computing posterior statistics because we work\nin the full space X, when we ought to work in the quotient space X/G. Theorem 2 relates means in\nX/G to barycenters of measures on X which gives us a principled method for computing statistics\nbacked by a convex problem in the space of measures: take a quotient, \ufb01nd a mean in X/G, and\nthen pull the result back to X. We will see below in concrete detail that we do not need to explicitly\nconstruct and average in X/G, but may leverage group invariance of the transport to perform these\nsteps in X.\nThe crux of this theory is that the Wasserstein barycenter in the setting of Lemma 1 is a point estimate\nfor the mean of the symmetrized posterior distribution. The results leading to Theorem 2 should be\nunderstood then as a reduction of the problem of \ufb01nding an estimate of the mean to that of minimizing\na distance function on the quotient space; this latter minimization problem can then be solved via\nRiemannian gradient descent.\n\n4 Algorithms\n\nM\ngp\nd(p, q)\nMK\nc(p, q)\nexpp, logp\nSK\nCK\nM/G\n\nLabel switching usually occurs due to\nsymmetries of certain Bayesian mod-\nels. Posteriors with the label switching\nmake it dif\ufb01cult to compute meaning-\nful summary statistics, e.g. posterior\nexpectations for the parameters of in-\nterest.\nAny attempt\nto compute posterior\nstatistics in this regime must account\nfor the orbits of samples under the sym-\nmetry group. Continuing in the case of expectations, based on the previous section we can extract a\nmeaningful notion of averaging by taking the image of each posterior sample under the symmetry\ngroup and computing a barycenter with respect to the Wasserstein metric. This resolves the ambiguity\nregarding which points in orbits should match, without symmetry-breaking heuristics like pivoting\n(Marin et al., 2005).\n\nRiemannian manifold\nInner product at p \u2208 M\nGeodesic distance between p, q \u2208 M\nK-fold product manifold with product metric\n2 d(p, q)2\nTransport cost, c(p, q) = 1\nExponential, logarithm maps at p \u2208 M\nSymmetric group on K symbols\nCyclic group on K symbols\nQuotient space of equivalence classes [p] = {g \u00b7 p | g \u2208 G}\n\nTable 1: Notation for our algorithm.\n\n5\n\n\f?\n\n(a) Ambiguity\n\n(b) Orbit empirical distribution\n\n(c) Quotient update\n\nFigure 1: (a) Suppose we wish to update our estimate of the average (blue) given a new sample\n(red) from \u2126; due to label switching, other points (light shade) have equal likelihood to our sample,\ncausing ambiguity. (b) Theorem 2 suggests an unambiguous update by constructing |G|-point orbits\nas empirical distributions and doing gradient descent with respect to the Wasserstein metric. (c)\nThis algorithm is equivalent to moving one point, with a careful choice of update functions. This\nschematic arises for a mean-only model with three means in R (\u00a71.3 of supplementary); G = S3,\nwith action is generated by re\ufb02ection over the dashed lines.\n\n(cid:80)\n\nIn this section, we provide an algorithm for com-\nputing the W2 barycenters above, extracting a\nsymmetry-invariant notion of expectation for\ndistributions with label switching. As input, we\nare given a sampler from a distribution \u2126 over\na space M subject to label switching, as well\nas its (\ufb01nite) symmetry group G. Our goal is to\noutput a barycenter of the form 1|G|\ng\u2208G \u03b4g\u00b7x\nfor some x \u2208 M, using stochastic gradient de-\nscent on (2). Our approach can be interpreted\ntwo ways, echoing the derivation of Theorem 2:\n(cid:80)\n\u2022 The most direct interpretation, shown in Figure 1(b), is that we push forward \u2126 to a distribution\ng\u2208G \u03b4g\u00b7x, where x \u223c \u2126, and then compute the\nover empirical distributions of the form 1|G|\nbarycenter as a |G|-point empirical distribution whose support points move according to stochastic\ngradient descent, similar to the method by Claici et al. (2018).\n\nAlgorithm 1 Riemannian Barycenter of \u2126.\nInput: Distribution \u2126, exp and log maps on M\nOutput: Estimate of the barycenter of \u2126\n1: Initialize the barycenter p \u223c \u2126.\n2: for t = 1, . . . do\nDraw q \u223c \u2126\n3:\n4: \u2212Dpc(p, q) := logp(q)\np \u2190 expp\n5:\n6: end for\n\n(cid:0)\u2212 1\nt Dpc(p, q)(cid:1)\n\n\u2022 Since |G| can grow extremely quickly, we argue that this algorithm is equivalent to one that moves\na single representative x, so long as the gradient with respect to x accounts for the objective\nfunction; this is illustrated in Figure 1(c).\n\nAlthough our \ufb01nal algorithm has cosmetic similarity to pivoting and other algorithms that compute a\nsingle representative point, the details of our approach show an equivalence to a well-posed transport\nproblem. Moreover, our stochastic gradient algorithm invokes a sampler from \u2126 in every iteration,\nrather than precomputing a \ufb01nite sample, i.e. our algorithm deals with samples as they come in, rather\nthan collecting multiple samples, and then trying to cluster or break the symmetry a posteriori.\nTable 1 gives a reference for the notation used in this section. Note the Riemannian gradient of c(p, q)\nhas a particularly simple form: \u2212Dpc(p, q) = logp(q) (Kim & Pass, 2017).\nGradient descent on the quotient space. For simplicity of exposition, we introduce a few additional\nassumptions on our problem; our algorithm can generalize to other cases, but these assumptions are\nthe most relevant to the experiments and applications in \u00a75. In particular, we assume we are trying to\ninfer a mixture model with K components. The parameters of our model are tuples (p1, . . . , pK),\nwhere pi \u2208 M for all i and some Riemannian manifold M. We can think of the space of parameters\nas the product MK. Typically it is undesirable when two components match exactly in a mixture\nmodel, so we additionally excise any tuple (p1, . . . , pK) with any matching elements (together a\nset of measure zero). Representing parameters in a mixture model can be made through a point\nprocess, it is natural to work with the Kth ordered con\ufb01guration space of M considered in physics\nand algebraic topology (R. Fadell & Husseini, 2001):\n\nConf K(M) := MK(cid:15){(p1, . . . , pK) | pi = pj for some i (cid:54)= j} \u2282 MK.\n\n6\n\n\fLet \u2126 \u2208 P (Conf K(M )) be the Bayesian poste-\nrior distribution restricted to Conf K(M ) (assum-\ning the posterior P (MK) is absolutely continuous\nwith respect to the volume measure, this restric-\ntion does essentially nothing). If K = 1, we can\ncompute the expected value of \u2126 using a classical\nstochastic gradient descent (Algorithm 1). If K >\n1, however, label switching may occur: There\nmay be a group G acting on {1, 2, . . . , K} that\nreindices the elements of the product Conf K(M )\nwithout affecting likelihood. This invalidates the\nexpectation computed by Algorithm 1.\nIn this case, we need to work in the quotient\nConf K(M )/G. Two key examples for G will be the symmetric group SK of permutations and\nthe cyclic group CK of cyclic permutations. When G = SK we simply recover the Kth unordered\ncon\ufb01guration space, typically denoted UConf K(M ).\nUConf K(M ) is a Riemannian manifold with structure inherited from the product metric on\nConf K(M ) and has the property:\n\nAlgorithm 2 Barycenter of \u2126 on quotient space\nInput: Distribution \u2126, exp and log maps on M\nOutput: Barycenter [(p1, . . . , pK)]\n1: Initialize the barycenter (p1, . . . , pK) \u223c \u2126.\n2: for t = 1, . . . do\n3:\n4:\n5:\n6:\n7:\n8:\n9: end for\n\n\u2212Dpi c(pi, q\u03c3(i)) := logpi(q\u03c3(i))\npi \u2190 exppi\n\nDraw (q1, . . . , qK) \u223c \u2126\nCompute \u03c3 in (5)\nfor i = 1, . . . , K do\n\n(cid:0)\u2212 1\nt Dpic(pi, q\u03c3(i))(cid:1)\n\nend for\n\ndMK ((p1, . . . , pK), (q\u03c3(1), . . . , q\u03c3(K))). (5)\ndUConfK(M )([(p1, . . . , pK)], [(q1, . . . , qK)]) = min\n\u03c3\u2208SK\nThe analogous fact holds for Conf K(M)/G for other \ufb01nite G via standard arguments (see e.g.\nKobayashi (1995)). Thus, we may step in the gradient direction on the quotient by solving a suitable\noptimal transport matching problem.\nSince G is \ufb01nite, the map \u03c3 minimizing (5) is computable algorithmically. When G = CK, we\nsimply enumerate all K cyclic permutations of (q1, . . . , qK) and choose the one closest to p. When\nG = SK, we can recover \u03c3 by solving a linear assignment problem with cost cij = d(pi, qj)2.\nThese properties suggest an adjustment of Al-\ngorithm 1 to account for G. Given a barycen-\nter estimate p = (p1, . . . , pK) and a draw q =\n(q1, . . . , qK) \u223c \u2126: (1) align p and q by min-\nimizing the right-hand side of (5); (2) compute\ncomponent-wise Riemannian gradients from pi to\nq\u03c3(i); and (3) step p toward q using the exponen-\ntial map.\nAlgorithm 2 summarizes our approach. It can be\nunderstood as stochastic gradient descent for z\nin (4), working in space Conf K(M ) rather than\nthe quotient Conf K(M ) /G. Theorem 2, however,\ngives an alternative interpretation. Construct a |G|-\npoint empirical distribution \u00b5 = 1|G|\n\u03c3\u2208G \u03b4\u03c3\u00b7p\nfrom the iterate p. After drawing q \u223c \u2126, we do the\nsame to obtain \u03bd \u2208 P2(Conf K(M )). Then, our update can be understood as a stochastic Wasserstein\ngradient descent step of \u00b5 toward \u03bd for problem (2). While this equivalent formulation would\nrequire O(|G|) rather than O(1) memory, it imparts the theoretical perspective in \u00a73, in particular a\nconnection to the (convex) problem of Wasserstein barycenter computation.\nIn the supplementary, we prove the following theorem:\nTheorem 3 (Ordering Recovery). If M = R, with the standard metric, then:\n\nAlgorithm 3 Barycenter for Gaussian Mixtures\nInput: Distribution \u2126\nOutput: Barycenter p = (\u00b5\u2217\n1, \u03a3\u2217\nK, \u03a3\u2217\nK)\n1: Initialize the barycenter p \u223c \u2126.\n2: for t = 1, . . . do\n3:\n4:\n5:\n6:\n7:\n8:\n9: end for\n\nDraw ((\u00b51, \u03a31) . . . , (\u00b5K, \u03a3K)) \u223c \u2126\nCompute \u03c3 in (5)\nfor i = 1, . . . , K do\ni \u2212 \u03b7(\u00b5\u2217\ni \u2212 \u03b7(I \u2212 T \u03a3\u2217\n\n\u00b5\u2217\ni = \u00b5\u2217\nL\u2217\ni = L\u2217\n\ni \u2212 \u00b5\u03c3(i))\n\n1) . . . , (\u00b5\u2217\n\nend for\n\ni \u03a3\u03c3\u2217(i) )L\u2217\n\ni\n\n(cid:80)\n\nUConf K(M ) \u223c= {(u1, . . . , uK) \u2208 Conf K(R) | u1 < . . . < uK} \u2282 RK.\n\nAdditionally, the single-orbit barycenter of Theorem 2 is unique and our algorithm provably converges.\n\nThis setting occurs when one\u2019s mixture model consists of evenly weighted components with only\na single mean parameter for each in R. The result relates our method to the classical approach\nof ordering these means for correspondence and shows that it is well-justi\ufb01ed. The convergence\nof our algorithm leverages the convexity of UConf K(M ). The supplementary contains additional\n\n7\n\n\fdiscussion (\u00a72.3) about such \u201cmean-only\u201d models in Rd for d > 1. They lack the niceness of the\nd = 1 case, due to positive curvature. This curvature is problematic for convergence arguments\n(as it leads to potential non-uniqueness of barycenters), but we empirically \ufb01nd that our algorithm\nconverges to reasonable results.\nMixtures of Gaussians. One particularly useful example involves estimating the parameters of\na Gaussian mixture over Rd. For simplicity, assume that all the mixture weights are equal. The\nmanifold M is the set of all (\u00b5, \u03a3) pairs: M \u223c= Rd \u00d7 P d with P d the set of positive de\ufb01nite\nsymmetric matrices. This space can be endowed with the W2 metric:\n\nd((\u00b51, \u03a31), (\u00b52, \u03a32))2 = W 2\n\n2 (N (\u00b51, \u03a31),N (\u00b52, \u03a32)) = (cid:107)\u00b51 \u2212 \u00b52(cid:107)2\n\n2 + B2(\u03a31, \u03a32),\n\n(6)\n\nwhere B2 is the squared Bures metric B2(\u03a31, \u03a32) = Tr[\u03a31 + \u03a32 \u2212 2(\u03a3\nAs the mean components inherit the structure of Euclidean space, we only need to compute Rie-\nmannian gradients and exponential maps for the Bures metric. Muzellec & Cuturi (2018) leverage\n(cid:124)\nthe Cholesky decomposition to parameterize \u03a3i = LiL\ni . The gradient of the Bures metric then\nbecomes:\n\n1 \u03a32\u03a3\n\n1 ) 1\n\n2 ].\n\n1\n2\n\n1\n2\n\n\u2207L1\n\n1\n2\n\nB(\u03a31, \u03a32) = (I \u2212 T \u03a31\u03a32)L1 with T \u03a31\u03a32 = \u03a3\n\n\u2212 1\n1\n\n2\n\n1\n2\n\n(\u03a3\n\n1 \u03a32\u03a3\n\n1\n2\n\n1 )\n\n1\n\n2 \u03a3\n\n\u2212 1\n1\n\n2\n\nThe 2-Wassertein exponential map for SPD matrices is exp\u03a3(\u03be) = (I + L\u03a3(\u03be))\u03a3(I + L\u03a3(\u03be)) where\nL\u03a3(\u03be) is the solution of this Lyapunov equation : L\u03a3(\u03be)\u03a3 + \u03a3L\u03a3(\u03be) = \u03be.\n\n5 Results\n\nFigure 2: True covariances in blue, co-\nvariances from SGD in green and pivot\nin red\n\nIn \u00a74, we gave a symmetry-invariant, simple, and ef\ufb01cient\nalgorithm for computing a Wasserstein barycenter to sum-\nmarize a distribution subject to label switching. To verify\nempirically that our algorithm can ef\ufb01ciently address label\nswitching, we test on two natural examples: estimating the\nparameters of a Gaussian mixture model and a Bayesian\ninstance of multi-reference alignment.\nEstimating components of a Gaussian mixture. Our\n\ufb01rst scenario is estimating the parameters of a Gaussian\nmixture with K > 1 components. We use Hamiltonian\nMonte Carlo (HMC) to sample from the posterior distribu-\ntion of a Gaussian mixture model. Na\u00efve averaging does\nnot yield a meaningful barycenter estimate, since the sam-\nples are not guaranteed to have the same label ordering.\nTo resolve this ambiguity, we apply our method and two baselines: the pivotal reordering method\n(Marin et al., 2005) and Stephens\u2019 method (Stephens, 2000). The Stephens and Pivot methods\nrelabel samples. Stephens minimizes the Kullback\u2013Leibler divergence between average classi\ufb01cation\ndistribution and classi\ufb01cation distribution of each MCMC sample. Pivot aligns every sample to a\nprespeci\ufb01ed sample (i.e. pivot) by solving a series of linear sum assignment problems. Pivot method\nrequires pre-selecting a single sample for alignment \u2014 poor choice of the pivot sample leads to\nbad estimation quality, while making a \u201cgood\u201d pivot choice may be highly non-trivial in practice.\nThe default pivot choice is the MAP. Stephens method is more accurate, however it is expensive\ncomputationally and has large memory requirement.\nTo illustrate why pivoting fails, consider samples drawn\nfrom a mixture of \ufb01ve Gaussians with mean 0 and co-\nangle \u03b8 \u2208 {\u2212\u03c0/12,\u2212\u03c0/24, 0, \u03c0/12, \u03c0/24} (Figure 2). The\nresulting pivot is uninformative for certain components.\nThe underlying issue is that the pivot is chosen to maximize the posterior distribution. If this sample\nlies on the boundary of Conf K(M ) /SK, the pivot cannot be effectively used to realign samples.\nQuantitative results for this test case are in Table 2.\nTo get a better handle of the performance/accuracy trade-off for the three methods, we run an\nadditional experiment. We draw samples from a mixture of \ufb01ve Gaussians over R5 with means\n\nvariances R\u03b8M with M =(cid:0) 1 0\n\nPivot Stephens SGD\n1.47\n1.65\n1.4\n7.5\n\n1.26\n54\n\n(cid:1) and R\u03b8 a rotation of\n\nTable 2: Absolute error & timings\n\nError (abs)\nTime (s)\n\n0 0.1\n\n8\n\n\f0.5ei, where ei \u2208 R5 is the i-th standard basis vector with i \u2208 {1, . . . , 5}, and covariances 0.4I5\u00d75.\nWe implement HMC sampler using Stan (Carpenter et al., 2017), with four chains discarding 500\nburn-in samples and keeping 500 per chain. Then we compare the three methods, increasing the\nnumber of samples to which they have access. We measure relative error as a function of wall clock\ntime and number of samples (Figure 3). The resulting plots align with our intuition: pivoting obtains\na suboptimal solution quickly, but if a more accurate solution is desired, it is better to run our SGD\nalgorithm.\n\nFigure 3: Relative error as a function of (a) number of samples and (b) time.\n\nMulti-reference alignment. A different problem to which we can apply our methods is multi-\nreference alignment (Zwart et al., 2003; Bandeira et al., 2014). We wish to reconstruct a template\nsignal x \u2208 RK given noisy and cyclically shifted samples y \u223c g \u00b7 x + N (0, \u03c32I), where g \u2208 CK\nacts by cyclic permutation. These observations correspond to a mixture model with K components\nN (g \u00b7 x, \u03c32I) for g \u2208 CK (Perry et al., 2017). We simulated draws from this distribution using\nMarkov Chain Monte Carlo (MCMC), where each draw applies a random cyclic permutation and\nadds Gaussian noise (Figure 4a). The sampler we used was a Gibbs Sampler (Casella & George,\n1992). To reconstruct the signal, we \ufb01rst compute a barycenter using Algorithm 2, giving a reference\npoint to which we can align the noisy signals; we then average the aligned samples. Reconstructed\nsignals for different \u03c3\u2019s are in Figure 4b. To evaluate quantitatively, we compute the relative error of\nthe reconstruction as a function of signal-to-noise ratio SNR = (cid:107)x(cid:107)2/K\u03c32 (Figure 4c).\n\nFigure 4: Reconstruction of a signal from shifted and noisy observations. (a) The true signal is plotted\nin blue against a random shifted and noisy draw from the MCMC chain. (b) Reconstructed signals at\nvarying values of noise. (c) Relative error as a function of SNR.\n\n6 Discussion and Conclusion\n\nThe issue underlying label switching is the existence of a group acting on the space of parameters. This\ngroup-theoretic abstraction allows us to relate a widely-recognized problem in Bayesian inference to\nWasserstein barycenters from optimal transport. Beyond theoretical interest, this connection suggests\na well-posed and easily-solved optimization method for alleviating label switching in practice.\nThe new structure we have revealed in the label switching problem opens several avenues for further\ninquiry. Most importantly, (4) yields a simple algorithm, but this algorithm is only well-understood\nwhen the Fr\u00e9chet mean is unique. This leads to two questions: When can we prove uniqueness of the\nmean? More generally, are there ef\ufb01cient algorithms for computing barycenters in P2(X)G?\nFinding faster algorithms for computing barycenters under the constraints of Lemma 1 provides an\nunexplored and highly-structured instance of the barycenter problem. Current approaches, such as\nthose by Cuturi & Doucet (2014) and Claici et al. (2018) are too slow and not tailored to the demands\nof our application, since each measure is supported on K! points and the barycenter may not share\nsupport with the input measures. Moreover, after incorporating an HMC sampler or similar piece of\nmachinery, our task likely requires taking the barycenter of an in\ufb01nitely large set of distributions. The\nkey to this problem is to exploit the symmetry of the support of the input measures and the barycenter.\n\n9\n\n102103Number of Samples100101Relative ErrorPivotSGDStephens102101100101Time(s)100101Relative ErrorPivotSGDStephens020406080100SignalSample0255075100125150175SNR0.00.10.20.30.40.5||SS||2||S||2\fAcknowledgements.\nJ. Solomon acknowledges the generous support of Army Research Of\ufb01ce\ngrant W911NF1710068, Air Force Of\ufb01ce of Scienti\ufb01c Research award FA9550-19-1-031, of National\nScience Foundation grant IIS-1838071, from an Amazon Research Award, from the MIT-IBM Watson\nAI Laboratory, from the Toyota-CSAIL Joint Research Center, from the QCRI\u2013CSAIL Computer\nScience Research Program, and from a gift from Adobe Systems. Any opinions, \ufb01ndings, and\nconclusions or recommendations expressed in this material are those of the authors and do not\nnecessarily re\ufb02ect the views of these organizations.\n\nReferences\nAgueh, M. and Carlier, G. Barycenters in the Wasserstein space. SIAM J. Math. Anal., 43(2):904\u2013924,\n\nJanuary 2011. ISSN 0036-1410. doi: 10.1137/100805741.\n\nAlvarez-Melis, D. and Jaakkola, T. S. Gromov-Wasserstein alignment of word embedding spaces.\nIn Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing,\nBrussels, Belgium, October 31 - November 4, 2018, pp. 1881\u20131890, 2018.\n\nArjovsky, M., Chintala, S., and Bottou, L. Wasserstein generative adversarial networks. In Pro-\nceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW,\nAustralia, 6-11 August 2017, pp. 214\u2013223, 2017.\n\nBandeira, A. S., Charikar, M., Singer, A., and Zhu, A. Multireference alignment using semide\ufb01nite\nprogramming. In Proceedings of the 5th Conference on Innovations in Theoretical Computer\nScience, pp. 459\u2013470. ACM, 2014.\n\nCarlier, G., Oberman, A., and Oudet, E. Numerical methods for matching for teams and Wasserstein\nbarycenters. ESAIM: M2AN, 49(6):1621\u20131642, November 2015. ISSN 0764-583X, 1290-3841.\ndoi: 10.1051/m2an/2015033.\n\nCarpenter, B., Gelman, A., Hoffman, M. D., Lee, D., Goodrich, B., Betancourt, M., Brubaker, M.,\nGuo, J., Li, P., and Riddell, A. Stan: A probabilistic programming language. Journal of Statistical\nSoftware, 76(1), 2017.\n\nCasella, G. and George, E. I. Explaining the Gibbs sampler. The American Statistician, 46(3):\n\n167\u2013174, 1992.\n\nCeleux, G., Hurn, M., and Robert, C. P. Computational and inferential dif\ufb01culties with mixture\nposterior distributions. Journal of the American Statistical Association, 95(451):957\u2013970, 2000.\n\nChizat, L. and Bach, F. On the global convergence of gradient descent for over-parameterized models\nusing optimal transport. In Advances in Neural Information Processing Systems 31: Annual\nConference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018,\nMontr\u00e9al, Canada., pp. 3040\u20133050, 2018.\n\nClaici, S., Chien, E., and Solomon, J. Stochastic Wasserstein barycenters. In Proceedings of the\n35th International Conference on Machine Learning, ICML 2018, Stockholmsm\u00e4ssan, Stockholm,\nSweden, July 10-15, 2018, pp. 998\u20131007, 2018.\n\nCuturi, M. and Doucet, A. Fast computation of Wasserstein barycenters. In Proceedings of the 31th\nInternational Conference on Machine Learning, ICML 2014, Beijing, China, 21-26 June 2014, pp.\n685\u2013693, 2014.\n\nDiebolt, J. and Robert, C. P. Estimation of \ufb01nite mixture distributions through Bayesian sampling.\n\nJournal of the Royal Statistical Society: Series B (Methodological), 56(2):363\u2013375, 1994.\n\nDuane, S., Kennedy, A. D., Pendleton, B. J., and Roweth, D. Hybrid Monte Carlo. Physics Letters B,\n\n195(2):216\u2013222, 1987.\n\nGenevay, A., Peyr\u00e9, G., and Cuturi, M. Learning generative models with Sinkhorn divergences. In\n\nInternational Conference on Arti\ufb01cial Intelligence and Statistics, pp. 1608\u20131617, 2018.\n\nJasra, A., Holmes, C. C., and Stephens, D. A. Markov chain Monte Carlo methods and the label\n\nswitching problem in Bayesian mixture modeling. Statistical Science, pp. 50\u201367, 2005.\n\n10\n\n\fKim, Y.-H. and Pass, B. Wasserstein barycenters over Riemannian manifolds. Advances in Mathe-\n\nmatics, 307:640\u2013683, February 2017. ISSN 0001-8708. doi: 10.1016/j.aim.2016.11.026.\n\nKobayashi, S. Isometries of Riemannian Manifolds, pp. 39\u201376. Springer Berlin Heidelberg, Berlin,\n\nHeidelberg, 1995. ISBN 978-3-642-61981-6. doi: 10.1007/978-3-642-61981-6_2.\n\nKusner, M. J., Sun, Y., Kolkin, N. I., and Weinberger, K. Q. From word embeddings to document\ndistances. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015,\nLille, France, 6-11 July 2015, pp. 957\u2013966, 2015.\n\nLott, J. and Villani, C. Ricci curvature for metric-measure spaces via optimal transport. Annals of\n\nMathematics, pp. 903\u2013991, 2009.\n\nMarin, J.-M., Mengersen, K., and Robert, C. P. Bayesian modelling and inference on mixtures of\n\ndistributions. Handbook of Statistics, 25:459\u2013507, 2005.\n\nMcLachlan, G. J., Lee, S. X., and Rathnayake, S. I. Finite mixture models. Annual Review of\n\nStatistics and its Application, 6:355\u2013378, 2019.\n\nMuzellec, B. and Cuturi, M. Generalizing point embeddings using the Wasserstein space of elliptical\ndistributions. In Advances in Neural Information Processing Systems 31, pp. 10237\u201310248. Curran\nAssociates, Inc., 2018.\n\nNeal, R. M. et al. MCMC using Hamiltonian dynamics. Handbook of Markov Chain Monte Carlo, 2\n\n(11), 2011.\n\nPapastamoulis, P. label.switching: An R package for dealing with the label switching problem in\n\nMCMC outputs. arXiv preprint arXiv:1503.02271, 2015.\n\nPerry, A., Weed, J., Bandeira, A. S., Rigollet, P., and Singer, A. The sample complexity of multi-\n\nreference alignment. CoRR, abs/1707.00943, 2017.\n\nPeyr\u00e9, G. and Cuturi, M. Computational Optimal Transport. Submitted, 2018.\nR. Fadell, E. and Husseini, S. Geometry and Topology of Con\ufb01guration Spaces. 01 2001. doi:\n\n10.1007/978-3-642-56446-8.\n\nSantambrogio, F. Optimal Transport for Applied Mathematicians, volume 87 of Progress in Nonlinear\nDifferential Equations and Their Applications. Springer International Publishing, 2015. ISBN\n978-3-319-20827-5 978-3-319-20828-2. doi: 10.1007/978-3-319-20828-2.\n\nSolomon, J. Optimal Transport on Discrete Domains. AMS Short Course on Discrete Differential\n\nGeometry, 2018.\n\nSrivastava, S., Cevher, V., Dinh, Q., and Dunson, D. WASP: Scalable Bayes via barycenters of\nsubset posteriors. In Lebanon, G. and Vishwanathan, S. V. N. (eds.), Proceedings of the Eighteenth\nInternational Conference on Arti\ufb01cial Intelligence and Statistics, volume 38 of Proceedings of\nMachine Learning Research, pp. 912\u2013920, San Diego, California, USA, 09\u201312 May 2015. PMLR.\nStaib, M., Claici, S., Solomon, J. M., and Jegelka, S. Parallel streaming Wasserstein barycenters. In\n\nAdvances in Neural Information Processing Systems, NIPS 2017, pp. 2644\u20132655, 2017.\n\nStephens, M. Dealing with label switching in mixture models. Journal of the Royal Statistical Society:\n\nSeries B (Statistical Methodology), 62(4):795\u2013809, 2000.\n\nUribe, C. A., Dvinskikh, D., Dvurechensky, P., Gasnikov, A., and Nedic, A. Distributed computation\nof Wasserstein barycenters over networks. In 57th IEEE Conference on Decision and Control,\nCDC 2018, Miami, FL, USA, December 17-19, 2018, pp. 6544\u20136549, 2018. doi: 10.1109/CDC.\n2018.8619160.\n\nVillani, C. Optimal Transport: Old and New. Number 338 in Grundlehren der mathematischen\n\nWissenschaften. Springer, Berlin, 2009. ISBN 978-3-540-71049-3. OCLC: ocn244421231.\n\nZwart, J. P., van der Heiden, R., Gelsema, S., and Groen, F. Fast translation invariant classi\ufb01cation of\nHRR range pro\ufb01les in a zero phase representation. IEE Proceedings-Radar, Sonar and Navigation,\n150(6):411\u2013418, 2003.\n\n11\n\n\f", "award": [], "sourceid": 7574, "authors": [{"given_name": "Pierre", "family_name": "Monteiller", "institution": "ENS Ulm"}, {"given_name": "Sebastian", "family_name": "Claici", "institution": "MIT"}, {"given_name": "Edward", "family_name": "Chien", "institution": "Massachusetts Institute of Technology"}, {"given_name": "Farzaneh", "family_name": "Mirzazadeh", "institution": "MIT-IBM Watson AI Lab, IBM Research"}, {"given_name": "Justin", "family_name": "Solomon", "institution": "MIT"}, {"given_name": "Mikhail", "family_name": "Yurochkin", "institution": "IBM Research, MIT-IBM Watson AI Lab"}]}