Review for NeurIPS paper: Counterfactual Contrastive Learning for Weakly-Supervised Vision-Language Grounding

NeurIPS 2020

Counterfactual Contrastive Learning for Weakly-Supervised Vision-Language Grounding

Review 1

Summary and Contributions: A counterfactual contrastive learning paradigm is proposed for weakly-supervised vision-language grounding, which can be regarded as an effective improvement for traditional MIL-based or reconstruction-based WSVLG solutions. Three counterfactual transformation strategies from the feature, interaction and relation-level are designed. Experimental results on five grounding datasets have demonstrated the effectiveness of the proposed method.

Strengths: The proposed contrastive learning paradigm is ingenious and effective for WSVLG. Extensive ablation studies have demonstrated the effectiveness of the proposed method.

Weaknesses: (1) The idea of counterfactual contrastive learning is similar to adversarial erasing in object mining, which has been widely used as an effective strategy in weakly supervised detection[a], semantic segmentation[b] and has also been introduced to vision-language grounding in [c]. However, the authors failed to mention the relation between the two and did not cite related papers. (2) The name of “Relation Module” is not very appropriate, because it includes both relational modeling and score inference. (3) What is the motivation of using gradient-based selection in MIL-based pretraining? What are its advantages compared to the direct selection of the proposals with higher scores as the critical proposals. (4) In Line 203, what kind of component can be called a mature component? Verifying the performance of the algorithm w.r.t simple framework seems unconvincing. (5) For temporally language grounding, the authors should cite and compare with [d], which is quite related and is the state-of-the-art method in this field. [a] K. K. Singh and Y. J. Lee. Hide-and-seek: Forcing a network to be meticulous for weakly-supervised object and action localization. In The IEEE International Conference on Computer Vision (ICCV), 2017 [b] Y. Wei, J. Feng, X. Liang, M.-M. Cheng, Y. Zhao, and S. Yan. Object region mining with adversarial erasing: A simple classification to semantic segmentation approach, CVPR 2017. [c] Liu, Xihui, et al. "Improving referring expression grounding with cross-modal attention-guided erasing." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019. [d] Wu, Jie, et al. “Tree-Structured Policy based Progressive Reinforcement Learning for Temporally Language Grounding in Video”, AAAI 2020.

Correctness: The claims and method are correct. It will be more convincing to integrate CCL into the existing SOTA grounding frameworks (e.g., [37][38][44]) and prove its effectiveness.

Clarity: This paper is well-written and easy to follow.

Relation to Prior Work: The idea of counterfactual contrastive learning is similar to adversarial erasing in object mining, which has been widely used as an effective strategy in weakly supervised detection[a], semantic segmentation[b] and has also been introduced to vision-language grounding in [c]. However, the authors failed to mention the relation between the two and did not cite related papers. [a] K. K. Singh and Y. J. Lee. Hide-and-seek: Forcing a network to be meticulous for weakly-supervised object and action localization. In The IEEE International Conference on Computer Vision (ICCV), 2017 [b] Y. Wei, J. Feng, X. Liang, M.-M. Cheng, Y. Zhao, and S. Yan. Object region mining with adversarial erasing: A simple classification to semantic segmentation approach, CVPR 2017. [c] Liu, Xihui, et al. "Improving referring expression grounding with cross-modal attention-guided erasing." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019.

Reproducibility: Yes

Additional Feedback: The explanation to the major difference between the proposed method and adversarial erasing mostly addressed my concerns.

Review 2

Summary and Contributions: The paper proposed Counterfactual Contrastive Learning (CCL) for weakly-supervised vision-language grounding. CCL conducts contrastive learning by constructing counterfactual positive/negative samples and produces meaningful alignment score for each proposal, which is different from previous MIL-based and reconstruction methods. Three different types of counterfactual transformation are proposed to facilitate the contrastive learning. Experiments conducted on different vision-language grounding benchmarks demonstrate the effectiveness of CCL.

Strengths: -The method is novel. -Proposed CCL can help model localize the parts of video/image relevant to given query under weak supervision signals. -Experimental results shows that CCL is an effective weakly-supervised vision-language grounding method and CCL outperforms the SOTAs. -Complete ablation study.

Weaknesses: Influences of memory bank size B and the memory update strategy are expected to be discussed.

Correctness: Correct

Clarity: Easy to follow

Relation to Prior Work: Yes. The author dicussed in the paper how the proposed Counterfactual Contrastive Learning works differently from previous MIL-based and reconstruction-based weakly-supervised vision-language grounding methods.

Reproducibility: Yes

Additional Feedback:

Review 3

Summary and Contributions: The paper addresses weakly-supervised vision-language grounding, including video grounding and image grounding. Authors propose counterfactual contrastive learning (CCL) to perform contrastive training between generated counterfactual positive and negative results. The experiments on five vision-language grounding datasets demonstrate the effectiveness of the proposed CCL.

Strengths: + The idea of counterfactual contrastive learning (CCL) is novel and reasonable for weakly-supervised vision-language grounding. + CCL achieves new state-of-the-art performance on five vision-language grounding datasets, which demonstrates its effectiveness. + The ablation study is meaningful and demonstrate the effectiveness of counterfactual transformations and the contrastive loss.

Weaknesses: + Counterfactual Transformation. The generation of counterfactual negative results is reasonable; however, the generation of counterfactual positive results from the inessential proposal set is confusing. How can authors guarantee the positive results have higher alignment scores with the original results than the negative results by using the proposed counterfactual transformation. + The distribution of original, counterfactual negative and counterfactual positive results. In fact, we don’t know what counterfactual positive/negative results are generated in the process of counterfactual transformation. Could you provide some visualization or analysis about the distribution of original, counterfactual negative and positive results?

Correctness: Yes

Clarity: well written

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback: The authors answered my questions and would insert the visualization of distribution of original, counterfactual negative and positive results in the revision. I will keep the original score 6.

Review 4

Summary and Contributions: In this paper, the authors aim at addressing the lack of contrastive problem in weakly supervised vision-language grounding by proposing counterfactual contrastive learning. The proposed CCL generates samples with the proposed counterfactual transformations conducted at the feature-, interaction-, or relation-level. Experiments are conducted on weakly supervised image and video grounding datasets.

Strengths: 1. The proposed methods show good results on both image and video grounding datasets. Ablation studies are sufficient to support the effectiveness of the proposed CCL. 2. To my knowledge, this is the first paper that adopts the recent advances in contrastive learning [6, 15] to the weakly supervised vision-language grounding task. 3. The paper is well written and clear.

Weaknesses: 1. The baseline VGN method, with conventional MIL loss, shows good results that already outperforms the previous MIL-based SOTA CTF [7]. It might be necessary to better present the feature and proposal details in Section 4.2 to help understand the good performance of the baseline. 2. Other than the proposed approach in this paper, many previous MIL studies also try to address the problem that “samples are often easy to distinguish,” mainly from negative sampling mining perspective. The experiment can be strengthened by comparing to the designed stronger baselines, such as selecting samples directly from the proposal set, or with naive hard negative mining. 3. The image grounding part is not evaluated on the (more commonly used) ReferitGame and Flickr30K Entities dataset. [1] S. Kazemzadeh, V. Ordonez, M. Matten, and T. L. Berg. Referit game: Referring to objects in photographs of natural scenes. In EMNLP, 2014. [2] B. A. Plummer, L. Wang, C. M. Cervantes, J. C. Caicedo, J. Hockenmaier, and S. Lazebnik. Flickr30k entities: Collecting region-to-phrase correspondences for richer imageto-sentence models. In IJCV, 2016.

Correctness: The manuscript looks correct.

Clarity: The paper is well written and clear.

Relation to Prior Work: The references are clear.

Reproducibility: Yes

Additional Feedback: ######################## Final rating ######################## Thank you for your feedback. The additional comparisons to "hard negative mining" and "direct proposal masking" well address my previous concern on the comparison to existing "hard negative mining" approaches. Overall, I would like to keep my rating as 7.