Review for NeurIPS paper: Statistical Optimal Transport posed as Learning Kernel Embedding

NeurIPS 2020

Statistical Optimal Transport posed as Learning Kernel Embedding

Review 1

Summary and Contributions: The paper is concerned with estimating an optimal transport map given samples from the source and target distributions. The idea is to represent source and target marginals, as well as the (bivariate) cost function, as empirical kernel mean embeddings. Actually, a sort of minimum norm like embedding is used for the latter (cost function). Given these assumptions it is shown that the optimal transport map enjoys a representer theorem, meaning it lies in the span of the data. They prove this representer theorem and also an intriguing result that their estimator converges independently of the source and target dimensionality. After the rebuttal: I thank the authors for their response. I maintain my positive appraisal.

Strengths: I really enjoyed the creative aspect here. It's a fun idea and seems to work really well in practice. The authors combine this creative idea with a nice theoretical guarantee.

Weaknesses: The paper really needs a bit of work from an English and overall presentation standpoint including formatting of the figures. This is nothing deal breaking, just that it gives a slightly weak impression on a superficial level. It would be nice if the paper came with a primer on kernel mean embeddings which helped the reader to undewrstand the notation in section 3.

Correctness: I have not checked the detailed proofs. The reviewer load is high these days, I apologise.

Clarity: It's moderately well written. See the weaknesses section.

Relation to Prior Work: Yes, the links are clearly drawn out.

Reproducibility: Yes

Additional Feedback: See the weaknesses section. Also, I find line 151 slightly odd, in the definition of rho star. This is not a minimum norm interpolant. Moreover you are defining rho in terms of the entire function c. Can you expand on how you compute rho star?

Review 2

Summary and Contributions: Rebuttal read. It could be good indeed to put the complexity matters in the main paper. =============================================== The proposed methods consists in re-writing the original optimal transport formulation in terms of kernel embedding, such that the cost function is exactly represented (or arbitrarily close to the true one). The obtained problem then benefits from kernel framework. To estimate the marginal kernel mean embedding, the author propose a regularization based on MMD distances. Eventually the authors obtain a (nice) fully regularized and kernelized problem to represent the original one. Known algorithms are proposed to solve the problems, it is not a major contribution of the paper but it shows that one can find solutions quite efficiently. The experimental part shows that the proposed method can outperform estimates based on discrete OT, using cases for which the analytical solutions are known. It also illustrates the capacity to provide estimation on out-of-sample points. It also proposes a successful application to domain adaptation.

Strengths: The main strength of the paper can be listed as follows: - representer theorem - can be applied to all variants of OT: continuous, semi-discrete, and discrete - the proposed method allows for out-of-sample extension

Weaknesses: I can't see many weaknesses, apart from the complexity matters : I would be curious to have some insights on this subject.

Correctness: I could not find flows in the process of kernelization.

Clarity: yes

Relation to Prior Work: yes

Reproducibility: Yes

Additional Feedback: I found that some parts of the paper really "compressed", I guess due to space constraints. For instance, may be because I'm not much familiar with the task, I found hard to understand the domain adaptation experimental setup. Sentences like "OOS estimation is especially attractive..." (l293) could leave space to some more informative ones?

Review 3

Summary and Contributions: This paper proposed to utilize the kernel embedding method to reduce the dimensionality of the statistical optimal transport (OT) problem. By reformulating the OT problem as a kernel embedding problem, the paper claims that the sample complexity of the proposed estimator is dimension-free under mild conditions.

Strengths: Dimension reduction is an important issue in statistical OT problems. Existing methods mainly focused on regularized approaches and random projection methods. The kernel embedding method, proposed in this paper, can be an interesting complement to existing literature.

Weaknesses: Some implementation details of the proposed method are not clearly specified. The validity of the theoretical claims is hard to check due to the poorly organized proofs. The numerical experiments are not comprehensive. No comparison with popular competitors.

Correctness: (a) The proposed kernel embedding formulation of OT is based on the cross-covariance operator which ignores higher moments distributions. (b) The dimensionality of canonical feature maps can be infinity. This paper did not carefully discuss this issue. Does the proposed method need to make a finite-dimensional approximation? If so, how to selec the number of terms? Does it depend on the dimensionality of the data? (c) The objective functions in (3) and (4) are based on the expectation of functions defined in line 104 which are not empirically available. The algorithm discussed in Section 3.4 is based on a couple of simplification conditions. It will be helpful to discuss the approximation error as well as the computational complexity of the algorithm. (d) The experiment only simulated a multivariate Gaussian OT problem which is in favor of the second-order moment construction as pointed out in (a). It may not be sufficient to demonstrate the effectiveness of the proposed method under general cases.

Clarity: The paper reads smoothly with a couple of grammar errors. The proof in the supplemental material, though not mandatory, are poorly organized and have several gaps. It impedes the reviewer (at least for me) to check the validity of the theory within the time-imited review process. Some details in the experiments are also missing. For example, how to construct \Sigma_1 and \Sigma_2 in the multivariate Gaussian example?

Relation to Prior Work: The paper did not make a comparison with popular dimension reduction methods for statistical OT problems.

Reproducibility: No

Additional Feedback: ######## EDIT after author's response ############# The responses have addressed some of my confusing points. I would like to raise my score to 6. I have not checked the correctness of the theory due to the proofs are not well organized in the appendix.