NeurIPS 2020

COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning

Review 1

Summary and Contributions: This paper studies the problem of video-text representation learning with applications to video-text and text-video retrieval. The proposed method falls into the "late fusion" scheme where an individual stream is adopted for each modality and the resulted embedding outcomes are then aligned in a joint embedding space. The main contributions include a cross-modal cycle consistency loss for video-text fusion enhancement and impressive empirical results and practices.

Strengths: The technical design in this paper is sound. The main framework is a transformer-equipped version of [21], which has already given >10% relative improvement on ANet-Captions compared to the original method in [21]. Intensive adoptations of Transformers and attention modules are used in video/text local, global or local+global contextualized encoding. Experimental results are solid and ablation studies are convincing.

Weaknesses: Overall, the paper has made significant progress towards better video-text retrieval, from both efficiency aspect (compared to "early fusion") and empirical results aspect. However, methodology-wise, this paper is bringing limited excitement as Transformers and attention modules are frequently used for video and text encoding. Some claims are not well supported. For instance, i) [CLS] also adapts attention for temporal aggregation, at a even denser level than the proposed method; ii) the attention-aware feature aggregation is not exactly "new" as it can be regarded as a MLP with GELU for self-attention. But we do note that using standard architecture does not diminish its contribution on boosting model performance (53.7 -> 61.3 on R@1 on ANet-Captions). Also, the idea of applying cycle-consistency loss between video clips and sentences is interesting and effective (57.6 -> 61.3 on the same experiment). The clarity of the paper could be improved; details see the "Clarity" section. Future works on more video-text tasks other than retrieval should be considered as the paper claims to learn general video-text representation.

Correctness: The paper has paid sufficient attention to experiment fairness and its improvements are significant.

Clarity: Overall, the paper is well written. Here are some clarification questions. i) Line 77, what does (1, 1) mean when it comes to Eq. 1? Which positive sample is it referring to? ii) In Fig. 2, where does the Global Context caption come from? Or is it just a proof of concept? If say, please note that in the text. iii) Line 212, this paragraph is quite abrupt. Consider moving it to the supplementary. iv) What's the main difference between HSE in Tab. 1 and HSE in Tab. 2? The numbers do not match (45.6 vs. 44.4). v) In Tab. 1, does replacing CoT indicate an average pooling layer?

Relation to Prior Work: Yes, a comprehensive review on related work is conducted. Recent/relevant existing works are acknowledged and compared against in the experiments.

Reproducibility: Yes

Additional Feedback: Final rating =================== After reading the authors' response, the reviewers agree that the novelties of AF and CoT are limited. CMC is considered novel in the context of video-language retrieval and captioning. However, the overall empirical improvements from the proposed method are rather significant, that said, the somewhat incremental designs are effective across multiple major benchmarks and tasks. To summarize, weighing in the novelty of CMC and model performance, my final rating is leaning towards Accept. Besides, the correct reference on [55] should be: Luowei Zhou, Hamid Palangi, Lei Zhang, Houdong Hu, Jason J. Corso, and Jianfeng Gao. Unified Vision-Language Pre-Training for Image Captioning and VQA. In AAAI, pp. 13041-13049. 2020.

Review 2

Summary and Contributions: The authors proposed a hierarchical transformer model for text-video retrieval and a cycle-consistency loss to better optimize the model, which is closer to optimizing the evaluation metric compared metric learning based losses.

Strengths: 1. The proposed approach is straightforward and intuitive. 2. The experiment results on the multimodal retrieval task are shown to be very effective.

Weaknesses: 1. In the ablation studies, CMC has a much larger benefit when AF or CoT is available. I would like to invite the authors to provide some insights on why these components strengthen each other's benefit. 2. Many hyper-parameter optimization was done. More details about how each hyper-parameter, e.g., activation function selection, optimization algorithms and etc., affect the final performance should be provided. This will provide a comprehensive understanding of the contribution of this paper. 3. Currently all the experiments are done with trains set and test set only. Therefore, the model may be over-fitting the validation set. Some cross-dataset evaluation or test set evaluation, e.g., ActivityNet val2, should be provided.

Correctness: The experiment setting has some flaws as mentioned above.

Clarity: Yes, it is easy to follow.

Relation to Prior Work: Multi-modal Transformer for Video Retrieval, ECCV 2020

Reproducibility: Yes

Additional Feedback: Final rating: I did not find clear indication to show that the results are on ActivityNet val2. Please do make it clearer if I did not miss anything, and please also provide more comprehensive ablation results on hyper-parameters. The idea of cycle-consistency is interesting for me and this is my main motivation for maintain my rating as Marginally above the acceptance threshold. Thanks for the work!

Review 3

Summary and Contributions: This paper introduces a Cooperative hierarchical Transformer for video-text modeling. The introduced transformer incorporates low-level and high-level semantics. A cross-modal cycle-consistency loss is leveraged to build the connection between video and text. The results on ActivityNet-captions, video-paragraph retrieval demonstrate the effectiveness of the introduced model.

Strengths: This paper is well motivated. The inter-level cooperation introduces two levels of cues for video-text modeling. The introduced transformer is simple and could be possibly extended in other applications. The attention-aware feature aggregation method is technically sound and effective. The cross-modality cycle-consistency is innovative and interesting.

Weaknesses: 1. The authors claimed that this is first to introduce an attention-aware feature aggregation module for video-text transformers. However, ViLBERT introduced co-attentional transformer layer for image-text modeling. ActBERT introduced Tangled Transformer for video-text modeling. The authors are suggested adding more discussions. 2. What is the value of $\lambda$ on different architectures and datasets? 3. It seems some of the losses have been studied in Zhang et al. [21]. Can the authors summarize the differences between this paper and [21]? 4. In Table 2, the results significantly outperform previous state-of-the-arts. However, when compared to HSE [21] in Table 1, COOT without AF, CMC, CoT outperforms HSE with a clear margin. How do the authors obtain these improvements?

Correctness: The method is technically sound.

Clarity: This paper is well written and clear.

Relation to Prior Work: The discussion is sufficient.

Reproducibility: Yes

Additional Feedback: Final rating =========== I would like to thank the reviewer for responding my comments. Based on the discussion with AC and peer reviewers, I would like to keep on rating unchanged. The main concern is this paper is to limited novelty of AF and CoT. But I do like to idea of CMC.

Review 4

Summary and Contributions: In this paper, the authors propose a hierarchical transformer-based architecture for video-text retrieval. There are two major contributions, 1) the design of a hierarchical transformer network and 2) the propose of the cycle loss based on sentence-clip retrieval. The proposed approach shows good performance on video-paragraph and clip-sentence retrieval tasks. Ablation studies support the effectiveness of the proposed attention-aware feature aggregation layer, contextual transformer, and the cycle loss.

Strengths: 1. The results of the video retrieval task are good. The ablation studies are sufficient to show the effectiveness of the proposed modules in the paper (i.e., attention-aware feature aggregation layer, contextual transformer, and the cycle loss).

Weaknesses: 1. Although the experiments well support the effectiveness of the proposed attention-aware feature aggregation layer and contextual transformer, it might be necessary to better discuss its superiority over other alternatives. For attention-FA, it looks like conventional attention for me. Could you please expand the discussion in Line 103-107 and explain why these two differences will lead to superior performance? Besides, I wish to clarify whether the CLS in Table 4 refers to the [CLS] output from the T-transformer? As an alternative, it is fairer to compare Attention-FA by replacing it with conventional attention, or an extra T-transformer layer and takes the [CLS] output? Other alternatives to the contextual transformer with local-global fusion could be designed to further validate its effectiveness. 2. The proposed framework is only evaluated on video retrieval tasks which might limit its technical impact. The proposed framework has the potential to work on other video-language tasks such as video grounding, and the extended experiments can better support the method’s effectiveness.

Correctness: The manuscript looks correct.

Clarity: Is it correct that both T-transformer and contextual transformer contain a single transformer layer?

Relation to Prior Work: The references are clear.

Reproducibility: Yes

Additional Feedback: ######################## Final rating ######################## Thank you for your feedback. The additional experiments on other tasks address the concern of "limited impact." My concern is still about the source of the improvement and the ablation study design. Overall, I would like to keep the rating unchanged.