NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID: 7574 Alleviating Label Switching with Optimal Transport

### Reviewer 1

The authors tackle the label ambiguity in the Bayesian mixture model setting, by considering quotients of group actions on the labels that preserve posterior likelihoods. A numerical algorithm for computing barycenters in the given latent space is then devised, by viewing the computation of a quotient metric as an optimal transport task. The paper is well written and the exposition is clear. Notably, I enjoyed the general theoretical framework provided, which is then specialized to the finite mixture model setting. Although the experimental section is quite limited, I think the framework itself makes up for it as a contribution worthy to be published. However, I encourage the authors to provide more convincing experiments. The problem of ıGı getting large quickly is acknowledged by the authors. Perhaps a simple approach to this that the authors could try, is to randomly sample elements of G for the barycentric problem and carry out a stochastic gradient descent on this level too? Specific remarks: - Line 125: I might have missed it, but is the notation S_K defined prior to this part? - Line 131: The relation looks more like equivariance than invariance. Perhaps \pi(g x, g y) can be shown to be \pi(x,y) based on the group invariance of \nu_1 and \nu_2? - Line 140: I believe the push-forward notation of \Omega with respect to p_* should be used - Line 229: Are you referring to the map \sigma that minimizes (5) here? - Line 230: Bold face p notation is not defined until line 234. - Line 266 Exponential map: The 2-Wasserstein exponential map for spd matrices is given by Exp_K(v) = (I + v)K(I + v), and the logarithm given by T^{\Sigma_1\Sigma_2} - I. The exponential map here looks more related to the exponential map under the affine-invariant metric for spd matrices. Additionally, the gradient is written to be taken with respect ot \Sigma_1, although it should be with respect to L_1 as written in Muzellec & Cuturi (2018). If the gradient is with respect to \Sigma_1, then the expression should be (I - T^{\Sigma_1\Sigma_2}) without the L_1. -Table 2: Missing a dot.

### Reviewer 2

After feedback: I thank the authors for their responses but remain unconvinced that the paper should be published. I restate my main concerns below. 1) The method consists of a post-processing method for MCMC samples approximating the posterior in a mixture model. The choice of MCMC algorithm is somewhat orthogonal to the contribution of the paper, as argued by the authors, however surely what is given as an input to the proposed method must have an impact on the output. The posterior distribution in mixture models is known to be multimodal, and not only because of label switching. There is genuine multimodality (in the terminology of Jasra, Holmes & Stephens 2005). If the MCMC algorithm employed gets stuck in a local mode, then I don't see how the proposed method will give satisfactory results. Therefore, one should start with an MCMC algorithm reasonably suited to the task of sampling from multimodal target distributions. Thus HMC seems to be a very poor choice here. Even a simple random walk MH with heavy-tailed proposals would seem a better choice. 2) As far as post-processing methods for MCMC samples in the context of mixture models go (i.e. as an alternative to Stephens' method), the method has indeed some potential, although it is hard to assess without an in-depth comparison with Stephens' method. I remain unconvinced that the output of algorithm can be called a "summary", a term that is typically reserved to numbers or small dimensional vectors that summarize high dimensional or infinite-dimensional objects, e.g. the mean and the standard deviation are summaries of a probability distribution. The proposed object being a Wasserstein barycenter, it is itself a probability distribution, if I understand correctly. If I am not missing something, none of the experiments show that distribution visually. My point here is that as a reader I was expecting to see concretely what these "posterior statistics" announced in the abstract were looking like, thus I was disappointed. I believe that the changes proposed by the authors would improve the paper thus I change my rating from 4 to 5, but remain overall negative about its acceptance in NeurIPS. Pre-feedback: === The article introduces Wasserstein barycenter on the quotient space resulting from a group of isometries. The main motivation is the label switching phenomenon in mixture models, by which the posterior distribution is invariant by permutation of label indices. The part on the definition of Wasserstein barycenters on quotient spaces seems novel and might be of independent interest. However, I have concerns about the motivation from label switching and whether the article "solves" the label switching problem as claimed on page 2. The description of label switching itself is not very clear: for instance, on page 5 "Posteriors with label switching make it difficult to compute meaningful summary statistics like expectations". If the goal is to obtain a meaningful summary, I am not sure the proposed Wasserstein barycenter is a helpful contribution. Indeed it is a probability distribution, just as the posterior distribution, but it is defined on a more abstract space. So, is it really a useful summary compared to the original posterior distribution? What can be learned from the proposed barycenter that couldn't be directly learned from the posterior distribution? The method assumes that one starts with a sampler from the posterior distribution. The authors use HMC in their experiments. But a key issue with label switching is that sampling from the posterior is hard, modes might be missed, and standard MCMC methods such as HMC are likely to struggle. State of the art samplers for these settings are specifically designed to handle multimodality, e.g. parallel tempering, Wang Landau, sequential Monte Carlo, free energy MCMC, etc. In fact there is an article specifically on the fact that Hamiltonian Monte Carlo does not help tackling multimodality: https://arxiv.org/abs/1808.03230. HMC is notoriously successful in high-dimensional, log-concave settings, which is the opposite of the small-dimensional multimodal settings considered in the article. In the "multi-reference alignment" example, the authors do not specify which MCMC algorithm was used. Overall I am not convinced that the proposed manuscript solves the label switching problem. The contribution (Wasserstein barycenter on quotient spaces, gradient descent on the quotient space, etc) might be helpful to tackle other problems.