Paper ID: | 4979 |
---|---|

Title: | Sinkhorn Barycenters with Free Support via Frank-Wolfe Algorithm |

1. Overall, I like this work. I checked the proofs and believed that the proposed FW algorithm is solid. It would be better if authors can compare the proposed method with existing Wasserstein/Sinkhorn barycenter learning methods, e.g., [10, 12, 18, 48] and Ye, Jianbo, et al. "Fast discrete distribution clustering using Wasserstein barycenter with sparse support." IEEE Transactions on Signal Processing 65.9 (2017): 2317-2332. 2. In each iteration, the FW algorithm needs to solve a MINIMIZE problem in continuous or discrete domain and obtain x_{k+1} accordingly. Could authors provide more details about how to derive x_{k+1}? Is it based on finding a stationary point making the gradient of objective function zero? Or sampling some candidate points and finding that corresponding to the smallest objective value?

Overall, the paper is very clearly written with a unusually high level of mathematical sophistication and quality (e.g. compared to the usual NeurIPS submissions). I don't feel that the actual method is too novel, as Frank-Wolfe on measures have been considered previously. The strength and significance of the work is rather the particularly rigorous treatment, which puts such ideas on more firm theoretical grounds. **after rebuttal** My concerns mentioned in the "Improvements" sections were clearly addressed by the rebuttal, and I still vote for accepting the paper.

Pros: - This is a nice algorithm, as it doesn't require the support of the barycenter to be fixed beforehand and provides a simple way to control the number of support points. - The theorems proved might be useful in the other settings where optimal transport is used. The proofs in the appendix seem fine, although I should be honest that I have essentially just glanced over them. - The paper is well organized and is mostly easy to read. Cons: - What is the regularisation constant (epsilon) used for the experiments? It seems that the convergence essentially slows down exponentially as epsilon is decreased. - The rate of convergence seems to be independent of the number (m) of input distributions which you are averaging. Is this the case? Or probably it is hidden in the constants? - Besides the convergence analysis, it would be useful to compare the runtime of your algorithm with existing algorithms. Right now, nothing is said about the practical efficiency and how much it can be scaled. - What is the initialization strategy used for the barycenter in the experimental section? ---- UPDATE ----- Thanks for the rebuttal and doing the additional experiments. It addresses most of the concerns except for scalability, but I think that can hopefully be worked around in the future.