Review for NeurIPS paper: Multi-label Contrastive Predictive Coding

NeurIPS 2020

Multi-label Contrastive Predictive Coding

Review 1

Summary and Contributions: The authors propose a multi-label version of contrastive predictive coding (CPC), which essentially transforms the CPC loss from one that is defined on each positive pair and its set of negative pairs for i=1...n versus a version where all positive pairs (and all possible negatives amongst them) make up the softmax distribution, which can be thought of a 'multi-label' classification task where the network is trained to ensure the top n predictions from that distribution are the n positive examples. The motivation behind this technique is that: - (1) The regular CPC loss (which is a lower bound on the mutual information between X and Y) is upper bounded by log(m), where m is the total number of pairs used (i.e. 1 positive pair + m-1 negative pairs). If log(m) is much lower than I(X;Y) then we underestimate mutual information. (In practice, m will be constrained by how much GPU memory / processing power is available.) - (2) alpha-CPC, the reweighted version of this loss (with weighting coefficient alpha) can upper bound I(X;Y) by log(m/alpha), potentially reducing bias, but also at the cost of increased variance. Its disadvantage however is that for some range of alpha the loss will no longer be a lower bound to I(X;Y). In essence, the authors show that for their proposed method (alpha-ML-CPC), one can derive the range of alphas that lower bound I(X;Y) as a function of m and n, and these ranges are quite large even for modest values of m and n. This means that all one has to do is find the right alpha within this range to balance bias and variance.

Strengths: - Technique is well justified, theoretically, with strong empirical results to support the theory. - The significance of the approach is that one can estimate tighter lower bounds to I(X;Y) without necessarily needing more computational resources (i.e. increasing m). This is significant because contrastive-based techniques generally require *many* negative pairs in order to achieve good performance.

Weaknesses: - While the paper certainly stands on its own merits even without the knowledge distillation results, I noticed that you are reporting test accuracy in both Table 1 and Table 2. It is not clear to me whether you (a) performed hyperparameter tuning on the validation set for each model enumeration (e.g. KD, CRD, L_1.0, L_0.1, J_1.0, etc.) and *then* got the best model for each and evaluated it on the test set; or (b) whether you just used the test set as your validation set. The latter approach (although seemingly common in papers nowadays) is not very well principled and can lead to optimistic results. It would be good if you can elaborate on this. - Table 1 and Table 2 should also report variances.

Correctness: I have not verified the proofs in the appendix, though the theoretical claims do appear to be well-supported empirically in the main text.

Clarity: Generally, yes. Some suggestions however: - The expectation symbol in your equations is redundant, since you already have the summation over n, so remove one and keep the other (I'd prefer keeping the summation). - In Equation (6), it would be much clearer if the summation outside the fraction was moved to the numerator. While either form is functionally equivalent, putting the summation in the numerator makes it more clear that your loss L(g) is one term computed over the entire batch, as opposed to Equation (2) where L(g) is defined on a per example-basis for i=1...n and averaged. Also, as mentioned earlier, this also erroneously has the expectation symbol. - Typo: "MC-CPC", should be "ML-CPC" - Typo: "we illustrates" should be "we illustrate"

Relation to Prior Work: Yes.

Reproducibility: Yes

Additional Feedback: - "two types of critic, named joint and separable" --> what is the difference between these two? - A very minor point related to the pdf itself: sometimes Figure 3 takes a while to render because of all the individual points in the plot (which is 20,000 per experiment, according to the x-axis). Perhaps these plots could be re-rendered by subsampling points or simply rasterising the plot to png (I would prefer the former, since this results in a much higher quality figure). ------ REVIEW UPDATE: I have read the other reviews and the rebuttal. I am satisfied with the response and will keep my score as-is.

Review 2

Summary and Contributions: Recent successful approaches to mutual information estimation and lower bounds use contrastive approaches, like contrastive predictive coding (CPC). This paper proposes an elegant modification of the CPC estimator that decreases bias with negligible computational cost. The paper demonstrates the value of the improved estimator for downstream tasks.

Strengths: There has been a lot of recent interest in the NeurIPS community in different information estimators. While CPC stands out as one of the more successful and widely used estimators, the fact that it saturates at log m, for m negative samples, is a persistent problem that highlights the bias in this estimator. This paper suggests a simple modification that improves bias and reduces the problem of saturating the bound with a small number of contrastive examples. The theoretical development is very clear, and includes conditions for the bound to hold. The empirical evaluation compares to CPC and shows both MI estimation as well as demonstrations of how it helps in downstream tasks. For the community of people interested in information bounds and estimators, this is certainly a welcome contribution.

Weaknesses: The scope of the paper is somewhat narrowly focused - only comparing directly with CPC at the expense of a menagerie of other variational estimators. However, I find it hard to fault this since CPC is the most widely used and is the natural comparison for this method.

Correctness: Yes.

Clarity: The paper was very clear and easy to follow.

Relation to Prior Work: This paper did a good job of summarizing recent work in this area.

Reproducibility: Yes

Additional Feedback: The plots are rather small and hard to read, maybe you can improve that in the final version with more space. Is the variance always strictly bigger? It's nice that you give an estimator with an adjustable range of bias-variance trade-off, I'm just wondering if there's a way to generalize even more so that Eq. 7 can somehow smoothly reduce to the standard CPC. Is it easy to understand why the variance goes up? I guess generically we expect that if bias is reduced it may increase variance, but the denominator in Eq. 7 adds together more R.V.'s, so I didn't expect the variance to be higher than CPC from that point of view. EDIT: I have read the author rebuttal.

Review 3

Summary and Contributions: This paper introduces a novel estimator based on a multi-label classification problem, where the critic needs to jointly identify multiple positive samples simultaneously. This is called multi-label contrastive predictive coding. With the same amount of negative samples, multi-label CPC can provably exceed the log(m)bound, while still being a valid lower bound of mutual information. Experiments in MI estimation, knowledge distillation, self-supervised learning demonstrate the validity of proposed methods. -------------------------------------- Updates: The rebuttal addresses my concerns and thus I recommend to accept this paper.

Strengths: + A well-written paper + Extensive experiments + Nice theoretical justification + The idea of generalized CPC and ML-CPC is interesting.

Weaknesses: - The experimental improvement comparing to CPC is marginal. However, since the ML-CPC has similar computation complexity, ML-CPC is still valuable. - Lacking results on ImageNet. I think the results on ImagetNet is very important. For most knowledge distillation and self-supervised learning works, ImageNet is a standard benchmark. Thus to make this work more convincing, I suggest the authors to report results on ImageNet, especially considering the marginal improvement on CIFAR. - Typo: Line 133: xxx negative pairs "share" a same element xxx - The multi-label CPC is provably lower than MI. However, in Fig. 3, the curves of ML-CPC (\alpha=0.001, 0.0001) without smoothing surpasses the MI. Can you provide some explanation?

Correctness: The method and claims are correct.

Clarity: Yes

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback: If the results of ImageNet are presented, I am happy to raise scores.

Review 4

Summary and Contributions: Contrastive Predictive Coding (CPC) estimates are upper bounded by log of negative samples per positive samples, while computational overhead limits the increasement of the number of negative samples. Aim at this problem, this paper proposes ML-CPC for increasing the bound without additional computational costs and decreasing bias. **The authors have carefully addressed my questions in the rebuttal. It's very rewarding to reading this paper.**

Strengths: 1. This paper explores a practical problem and solves it by proposing three generalizations to CPC (re-weighted CPC, ML-CPC and re-weighted ML-CPC). These methods are easy and step-by-step. 2. This paper proves the learning objective of ML-CPC is still guaranteed to be a variational lower bound to mutual information and has similar computational costs to CPC. 3. The paper provides a thorough experimental validation of the proposed methods, and the results align with their claims.

Weaknesses: I have a few minor concerns: 1. In Sec. 2, the authors say "Therefore, one can train g and h to maximize L(g), resulting in higher lower bounds to I(X;Y)...", should it be "minimize L(g)"? 2. In Sec. 3.1 and Sec. 3.3, I don't very understand why L_{\algph}(g) is upper bounded by log(m/\alpha). It may need an explanation.

Correctness: The problem is well-motivated and the technical content seems to be sound.

Clarity: The paper is clearly written and fairly readable.

Relation to Prior Work: The authors review the previous works and naturally raise their motivation. The main contribution of this paper is to propose a method decreasing bias and being a lower bound to mutual information. In my knowledge, this is a novel work.

Reproducibility: Yes

Additional Feedback: