NeurIPS 2020

COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning

Meta Review

The reviews are generally positive: 1 accept (score 7), 2 borderline accept (score 6), and 1 borderline reject (score 5). All reviewers acknowledged strong empirical performance as one of the major strengths of this paper. They also acknowledged the proposed CMC (Cross-Modal Consistency) is novel in the context of video-language retrieval, although the idea of cycle consistency itself is already well-explored. However, the reviewers also pointed out that the technical novelty of this work, especially AF (attention-aware feature aggregation) and CoT (Contextual Transformer), is somewhat limited because Transformer and attention are widely being used for similar tasks. I took a careful reading of the paper another time. The novelty of AF, CoT, CMC all seem limited indeed. But they provide good empirical performance improvement, and these are well supported by convincing experiments and ablation studies. I feel that these could be considered a good contribution to the vision/language community. Therefore, I agree with the reviewers' final recommendation of acceptance.