Reviews: Wasserstein Dependency Measure for Representation Learning

In this paper the authors discuss the limitations of current representation learning approaches that aim to approximate MI, and propose their own objective, a lower bound on what they call the Wasserstein dependency measure. They compare their method to a similar approach, CPC, where the difference largely lies in the constraint that the function class they are optimizing over is 1-Lipschitz. Their results are really promising and show increased robustness and performance on small, relatively simple datasets with regards to dataset size, mini batch size and mutual information of the dataset. Overall, I think this is a nice paper - it's a simple modification to an existing approach, but the performance gains are significant in the settings considered. Id like to see more discussion on the limitations of this method in more complicated settings - this is alluded in the paper, but I think careful analysis about the limitations of an approach is also valuable in addition to the analysis about its strengths.

Reviewer 2

I like the idea of wassersteinizing Mutual information, and the reasoning that enforcing Lipschitz constraint in CPC bounds prevents the encoder to exaggerate small differences. Although, I am not fully convinced if the underlying problem with exponential sample size is resolved by their lower bound. Besides, there are some questions on clarity and experiments which I believe the authors need to address. >>>>> Questions on experiments: - What would happen if you directly used the proposed Wasserstein dependency measure instead of considering any contrastive learning/negative samples? You won't have la ower bound, and this proposed Wasserstein mutual information would be exact. - "illustrating relatively simple problems where fully reconstructive models can easily learn complete representations, while self-supervised learning methods struggle." In which experiments, do you show that fully reconstructive models can easily learn...? - In the CPC paper, they report top1 and top5 unsupervised classification results on ImageNet. Are there any reasons that prevent experiments with WPC in this setting? >>>>> Quality and Clarity: - How is the base metric (which is typically called ground metric in OT literature) in Wasserstein data independent? For instance, if the distribution is over words, it makes more sense to use angular/cosine distance between word embeddings. I agree with what you say in lines 118-120, and maybe it is just the particular wording of the line 116 that needs to be refined. - From my understanding, the summation over j in eq (3) is over the set (say Y) containing the target y and the other negatives samples. So in the second term of eq(3), the expectation is over some p(x)p(y_j) for some j, while inside the sum is over all j indexing the set Y. Maybe, I missing something but doesn't seem you can take out the expectation over some j like that. (I understand what you want to do from this, but I think the equation is not precise). - Also, it would be useful to provide some background on CPC in your paper. It will make it more complete. - Besides the two main differences mentioned, aren't equation (2) and (3) also different in the sense that (2) doesn't have second term (expectation of log sum exp) for negative samples? If yes, I think it would be good to clarify in the text. - How is the reference section organized? It is neither ordered by serial numbers, nor ordered by last names or first names, or nor by publication date. It is really hard to check! >>>>> Typos: Line 209, I(x; y) = Some of the inline references are not properly typeset, example: "ideally via linear probes Alain and Bengio [2016]." Line 222 "predictice" -> "predictive" Line 244 "mininatches" -> "minibatches" (Some nitpicking) Line 135: \mathcal{M} is not defined. ----- UPDATE ----- Thanks for the rebuttal and it addresses most of the concerns I had. As mentioned in the rebuttal, I think it is critical to clarify that your method's efficacy is shown empirically and not theoretically. Please make the suggested revisions in the final version and *at all costs* improve the reference section.

Reviewer 3

Originality: I'm not very familiar but experimental tasks looked new to me. Moreover, the proposed approach is a novel combination of previous works. Quality: All arguments in the paper is well supported verbally. In the experiments section, they also supported their claims empirically. Clarity: Overall it is good. I had difficulties to follow 'Choice of Metric Space' subsection. Significance: I think other researchers won't use this approach as is for solving problems but they will try to build upon. I think this work can lead to other high quality publications. For methodology, it follows a similar approach borrows different components from previous related work and build upon. Although research direction looks very promising, experimental results looked very dataset/experimental setup specific to me.

Paper ID:	9051
Title:	Wasserstein Dependency Measure for Representation Learning

Reviewer 1

Reviewer 2

Reviewer 3