Review for NeurIPS paper: Linear Time Sinkhorn Divergences using Positive Features

NeurIPS 2020

Linear Time Sinkhorn Divergences using Positive Features

Review 1

Summary and Contributions: This paper proposes the combination of Sinkhorn's algorithm with cost functions of the form -log <\phi(x), \phi(y)> where \phi is an embedding into nonnegative vectors. The motivation is that when the embedding vectors \phi(x) live in R^r and r is small, the Sinkhorn iteration can be computed very quickly (b/c the relevant matrix is rank r). The paper argues that some natural cost functions (e.g. square euclidean distances) can be approximated by such \phi, which lets us estimate the OT cost more quickly. A related idea from previous work (e.g. [3]) was to use a low-rank approximation as in the Nystrom method; an advantage of the approach proposed in this paper is that the needed nonnegativity property is guaranteed by construction.

Strengths: The authors propose a new way to estimate regularized OT distances which seems reasonably practical and prove that in principle it can recover the regularized OT distance under natural cost functions, note that it has some interesting properties (e.g. differentiability), and give some experiments showing the application of their method in practice.

Weaknesses: The actual quantitative bounds for approximating a cost function using nonnegative features are difficult to parse and interpret, and it's unclear how good they are in practice. For example, the bounds in Lemma 1 depend exponentially on dimension, what should we take from this?

Correctness: As far as I saw the theorems are correct and the experiments were conducted properly.

Clarity: Mostly yes. One complaint: in the experiment of Figure 1, which is supposed to compare this method with nyquist, the setup is not clear enough. "Two normal distributions" - what is the difference between the two normal distributions, what dimension are they in etc? What is the true cost function and what feature maps are being used to approximate it? Without this information it's hard to say what this Figure means.

Relation to Prior Work: Yes, it seemed to me that other relevant methods like nyquist, other variants of sinkhorn etc. have been discussed appropriately.

Reproducibility: Yes

Additional Feedback: --- post-rebuttal: thanks for answering the question about the experiment. overall my opinion is unchanged

Review 2

Summary and Contributions: In this paper, the authors propose to use random positive feature maps to calculate the kernel matrix in Sinkhorn algorithm. By using $r$ random features, the per-iteration cost of Sinkhorn iterations is reduced from $O(n^2)$ to $O(nr)$. The authors show that for kernels with forms (9), the approximate errors of the kernel matrix and calculated Sinkhorn distance can be bounded in a desirable accuracy with proper choice of $r$ under certain regularity condition of kernel function. They also show that the commonly used Arc-cosine kernels and Gaussian kernels satisfy the regularity condition and provide a constructive approach to designing positive features.

Strengths: This paper provide a practical way to reduce the per-iteration computation cost of the Sinkhorn method. The theoretical analysis is sound and comprehensive. As Wasserstein distance based methods have been widely used in machine learning, this work would have a wide audience in the NeuraIPS community.

Weaknesses: My major concern are listed as follows. Random feature maps have been widely adopted in approximating the kernel matrix, since any kernel function can be approximate by eigenfunctions according to the Mercer’s theorem. Here, the authors restrict the feature maps to positive as $K = exp(-C/\epsilon) > 0$ and require the kernel function satisfies certain regularity conditions. Thus, this method is not suitable for all cost function $c$. $r$ is not guaranteed to be small though its dependency on $n$ is $O(\log(n))$. For example, for the Gaussian kernels, $\phi$ and $V$ is $O(2^d)$, which indicates $r = \Omega(\phi^2)$ is also $O(2^d)$. Thus, this method would need a large $r$ even for moderate $d$. The empirical studies are somehow weak as the authors only compares different methods on calculating the Sinkhorn distance between two normal distributions(Exp1). In the second experiments, the authors shows that this method can learn the adversarial kernel function. However, the results in Table 1 is a little confused: The quantity is $k_\theta(f_\gamma(x),f_\gamma(y))$ or $c_\theta(f_\gamma(x),f_\gamma(y))$? As $k = exp(-c/\epsilon)$, k should be smaller than 1. If it is $c$, why the distance between z and z is much bigger than x and z? =======update after author rebuttal======== My major concern is that $r$, the number of features, exponentially depends on the the problem dimension $d$ for the most important Gaussian case. The authors acknowledge this drawback. Also, I think approximating kernel with random features according to the Mercer’s theorem is a very standard technique. The novelty of this paper is hence limited.

Correctness: The claims are correct and the empirical methodology is correct.

Clarity: This paper is well organized and well written.

Relation to Prior Work: The relation to previous contributions is clearly discussed

Reproducibility: Yes

Additional Feedback: What is performance of Sinkhorn algorithm with other direct rank r approximation of matrix $K$ such as rank-r SVD?

Review 3

Summary and Contributions: The paper proposes a method for approximating the optimal transport and Sinkhorn divergence costs by exploiting the feature representation of the kernel matrix. This approach leads to faster faster Sinkhorn iterations and is still fully differentiable.

Strengths: The approach is interesting theoretically, and well validated empirically. There exist previous approaches that try to do the same thing, but they are either slower or non-differentiable or both. The paper is clearly written, and the contribution is easy to understand, the proofs easy to follow.

Weaknesses: The section on generative adversarial networks needs more detail. As it is now, it reads like a last minute addition. The authors mention an alternative approach via batching the input and training a normal W-GAN but there is no comparison with this approach either quantitative or qualitative. It would also be great to see how the approach compares to Sinkhorn/Nystrom on different manifolds and metrics. Remark 1 mentions transport on a sphere for example.

Correctness: The statements present in the paper are correct to my understanding.

Clarity: The paper is clearly written and understandable. The theorems and proofs are easy to follow.

Relation to Prior Work: All relevant prior work is cited.

Reproducibility: Yes

Additional Feedback: Post rebuttal: ----------------- My opinion of the paper has not changed. The authors promise more detail and experiments. If these are to be provided, I am strongly in favor of accepting this paper.