Review for NeurIPS paper: Cycle-Contrast for Self-Supervised Video Representation Learning

NeurIPS 2020

Cycle-Contrast for Self-Supervised Video Representation Learning

Review 1

Summary and Contributions: The paper presents a self-supervised method to train video encoders using a combination of the temporal cycle-consistency and fa (instance discrimination) losses. The proposed method results in improved downstream task performance on a number of action recognition benchmarks.

Strengths: 1. The presented method is quite general and combines two self-supervised strategies (temporal cycle-consistency and instance discrimination). The formulation is quite elegant as they convert the problem of aligning two sequence of embeddings to aligning between the video and frame embedding spaces to train an encoder for video tasks. 2. Good ablation studies showing the importance of the different losses. 3. Experiments showing downstream performance on multiple datasets.

Weaknesses: 1. It would be nice to look at training and validation curves of the different losses. Is the model able to bring all three losses down simultaneously? Are there any tricks related to weighing these losses? 2. Originally TCC aimed to focus on temporally fine-grained actions. It would be interesting to investigate the efficacy of the learned features on tasks beyond action recognition. Recognizing action segments on Epic Kitchens [1] or Breakfast dataset [2] would make a stronger case for this paper. Measuring phase classification accuracy on the Pouring dataset [3] is also an option. [1] https://epic-kitchens.github.io/2019 [2] https://serre-lab.clps.brown.edu/resource/breakfast-actions-dataset/ [3] Temporal Cycle-Consistency Learning. Debidatta Dwibedi, Yusuf Aytar, Jonathan Tompson, Pierre Sermanet, Andrew Zisserman

Correctness: The methods presented seem correct.

Clarity: The paper is well written but some parts need clarification: 1. Use of the word transformer in Figure 1 is confusing as the word might refer to Transformer architecture but in the implementation the transformer refers to temporal convolutions. 2. The authors claim they train a 3D ResNet but except the last layer the size of kernel in time axis is 1 that is most of the modules are 2D convolutions. Usage of 3D ResNet is confusing as a reader expects entire network to be 3D convolutions not just the last layer. 3. How are the 8 frames chosen for training? How are the frames and embeddings chosen for evaluation? What is the fps during train and test?

Relation to Prior Work: Related work is covered well.

Reproducibility: Yes

Additional Feedback: **** POST REBUTTAL **** I thank the authors for the rebuttal and answering my questions. Overall I like the paper as it pursues a new technique to learn features. They have experiments on a number of video datasets. It would be a stronger paper if they included comparisons to new papers such as SpeedNet. However in the rebuttal they state the reasons why it is not fair to directly compare with these papers (SpeedNet using a different architecture from them and CBT being trained with language supervision). But even without these particular comparisons, this paper has an interesting contribution that is using the cycle-consistency losses to learn video and frame -level features. As this framework is quite general and is easily applicable to different domains (time-series data or data with sets), I hope the community would benefit more from learning about this training methodology. For this reason, I retain my previous recommendation of accept.

Review 2

Summary and Contributions: This paper proposes to utilize a cycle-consistency of the semantics between video clips and the frames constituting them. Specifically, they enforce two learning objectives: 1) the forward objective to make sure that the representation of a clip is close to the representation of its constituting frame. 2) Another backward objective to ensure that the representation, in the clip space, for the nearest neighbor of a frame is the actual clip that contains this particular frame. The authors conducted experiments on UCF101, HMDB, Kinetics400 and MMAct with the tasks of nearest neighbor retrieval and action recognition.

Strengths: + The topic of self-supervised video representation learning is important and very relevant to the NeurIPS community. + The proposed cycle-constraint between frames and clips is interesting.

Weaknesses: The experiments presented in this is paper is insufficient in several ways. I will elaborate below. - I am not sure what the authors try to convey through Table 2. It seems they are comparing MSE and contrastive loss, for which it's already widely proven, by many previous work, to be true that contrastive loss is superior than MSE for this task. Instead, what I think the authors should actually compare to are baselines that can prove the effectiveness of the proposed cycle constraints. For example, I think one important baseline would be directly applying instance discrimination (for which MoCo is a particularly effective variant that can be used) on frames. Another important baseline would be temporal cycle consistency (TCC), such that we can see if the proposed cycle constraint is indeed more effective compared to cycles that only happen between frames (as is done in TCC). - L216-7, "Clips from the test set are used to query the clips from the training set.", why not query in the test set? - Table 3, the more recent and stronger baselines like DPC, CBT ("Learning Video Representations using Contrastive Bidirectional Transformer"), AVSlowFast ("Audiovisual SlowFast Networks for Video Recognition") are missing. - Same applies for Table 4, the entries are outdated and many stronger baselines are missing (for examples in my previous comment).

Correctness: The method is mostly correct, however I do have the following doubt: - It's not clear to me about the reason to retrieve a soft nearest neighbor. For example, for Eq. 2, instead of using \hat{Z} (obtained from Eq. 1) as the query, what's wrong with directly using s_i , the clip representation as the query? The same question applies to the backward cycle-contrast section as well. - Eq. 6, there is no weight factor balancing each term? Can I assume they are all 1?

Clarity: There are some typos and sentences that are hard to understand, I list some examples below: - L96-100, "We use the output from the last 2D conv ... average pooling as the video representation", these two sentences are hard to understand. - L115-8, "However, the representations ... to the frames of the other videos.", this sentence is hard to understand. - L122, "to be disagree" -> "to be disagreeing"

Relation to Prior Work: There are some important related work that are missing. I list some examples below: - Though in different way, the idea of utilizing the frame-clip hierarchy has been explored before in this paper "Self-Supervised Learning of Video-Induced Visual Invariances". This should at least be discussed. - The "Contrastive learning" section of Related Work has missed several representative papers like "Unsupervised Feature Learning via Non-Parametric Instance-level Discrimination", "A Simple Framework for Contrastive Learning of Visual Representations", which give an incomplete picture about the evolvement of this topic and the current state-of-the-arts. - The "Self-supervised learning on video understanding" section of Related Work has missed several representative papers like "Learning to See by Moving", "Unsupervised Learning of Visual Representations using Videos", "Slow and steady feature analysis: higher order temporal coherence in video", "Learning Video Representations using Contrastive Bidirectional Transformer", and "Audiovisual SlowFast Networks for Video Recognition".

Reproducibility: Yes

Additional Feedback: POST REBUTTAL After reading the author's rebuttal, I don't think my concerns in the initial reviews are properly addressed. For example, the authors did not provide any results on this instance discrimination baseline that I was suggesting in my reviews. In fact, the authors wrote in rebuttal that "DPC is a state-of-the-art approach by utilizing the idea of instance discrimination on frame level.", which is not true -- DPC learns self-supervised features through predicting futures. Also, the rebuttal did not address the problem of lacking comparisons to some of the recent video SSL methods (e.g. CBT). Because of all these, I will keep my reject rating as in the initial review. -------------------------------------------------------- Given the many weaknesses of the experiment section (i.e., lack of comparison to recent video self-supervised learning methods and proper ablation study to demonstrate the effectiveness of the proposed cycle constraints), I don't think this paper is ready for publication.

Review 3

Summary and Contributions: This paper learns self-supervised video representations using cycle consistency between the full video representation and individual frame representations. It formulates the problem in a contrastive learning framework, and evaluates the performance of learned representations on downstream video classification benchmarks (e.g. UCF101, HMDB51, etc.) and video/frame retrieval.

Strengths: + The idea of applying cycle consistency between the frames and full video representation is quite interesting and novel to the best of my knowledge. Video representation learning is relevant to the NeurIPS community + The empirical results show modest improvement over some recent video representation learning approaches such as DCP [7].

Weaknesses: - The presentation quality of the paper can be improved both in terms of technical formulation and overall flow. Please see clarity section for details. - From the presented results, it is hard to judge if retrieval performance is evaluated fairly. For instance many of the methods in table 4 are not evaluated in retrieval performance. The ones that are evaluated in table 3 either uses AlexNet, or uses R3D but pre-trained on UCF (not on Kinetics similar to the submission). - The penalisation term in section 3.5 is not adequately motivated. Some more explanation and insight would help here. Also there is no empirical ablation of this component. Does it bring any major performance increase?

Correctness: There are many errors in the way the method presented. I'm assuming that they are all typos mainly by looking at the results. Most of the methods that they use (cycle-consistency and contrastive losses) have established implementations available online. I would assume that they used these implementations in their methods.

Clarity: Clarity is the major issue in this paper. - in Section 3.1. there is a confusion between "l" (number of frames) and "l" as the variable iterating over latent frame representations {z_li}. I would suggest using capital L for count (similar to the others such as M, N, etc.) and "l" for iterating over samples. This would clarify this section a lot. Also $n \in N$ in line 104 is not necessary and incorrect as N is not a set. - The exp(dis(.,.)) in equations (1),(2),(3) and (4) should be exp(-dis(.,.)) as these exp(.) requires similarity inside not distance. - The indicator function in equations (2) and (4) is not needed as positive samples also should be in the denominator. Please see InfoNCE [21] for details. - Fig.1. uses the term "Transformer" for a 2 layer MLP. The term "Transformer" already has an established meaning, I would avoid using it this way. - Section 3.6 may not be necessary as it doesn't bring much new information on top of the relevant part in related work. - Technical presentation of the paper needs a comprehensive check. - Proofreading the paper one more time for language would also be helpful.

Relation to Prior Work: The relevant related works are discussed adequately.

Reproducibility: Yes

Additional Feedback: Update after the rebuttal: I read the other reviews and the rebuttal. The authors address many points that I asked in my review. In overall the results may not be significantly better than the state of the art but they are mostly competitive. Besides the core idea in the paper is interesting and mostly novel as far as I know. They also have a decent execution of the idea. I was a bit concerned about clarity of equations but they appear to resolve all these confusions. I think the proposed idea and their execution deserves some attention from the community although it may not be clearly outperforming the state of the art. Hence I'll update my review to marginally-above-accept.

Review 4

Summary and Contributions: This paper proposes a way of self-supervised video representation learning, termed cycle-contrastive learning (CCL). The method utilizes the cycle from videos to frames, then from frames back to videos as a constraint to learn effective video representations. The authors demonstrated the performance on three video datasets.

Strengths: The proposed cycle contrastive learning is new for video representation learning.

Weaknesses: 1. Paper writing needs improvement. For example, Figure 1 needs more detailed caption to make it self-contained. I read it multiple times to understand what is the cycle and what is the pipeline. Readers may have questions like which transformer architecture you used? How do you form the anchor/pos/neg samples? What does z11, z12, z13 mean? 2. What is the linear probe accuracy on Kinetics400? I mean, if you fix the backbone and only finetune the last fc layer, what accuracy will you get for Kinetcis400 dataset? This is an important metric to evaluate the quality of the learned representations. Right now, all your experiments are to initialize the model by pretrained weights and finetune end-to-end, it is hard to know whether the pre-training stage learns good feature. 3. Performance on several datasets are not comparable to state-of-the-art methods. For example, recent CVPR 2020 paper SpeedNet, https://arxiv.org/pdf/2004.06130.pdf, reports much higher performance on both UCF101 and HMDB51 dataset than the proposed method. Take UCF101 as an example, SpeedNet achieves 81.1% accuracy while this submission only obtains 68.3%. I'm not saying you need to beat the state-of-the-art, it is just the performance gap is quite large. I expect some discussions on this.

Correctness: Yes

Clarity: Mostly yes. As I mentioned in weakness, the method is not clear enough. Figures and tables are not self-contained

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback: Post-rebuttal: I have read the authors' rebuttal. I think it partially addressed my concerns. However, there are two important issues that are not addressed by the rebuttal, one is lack of comparisons to recent video self-supervised learning approaches (like CBT and SpeedNet), the other is lack of experimental results (such as linear evaluation on Kinetics400 dataset, which I would say is very important to evaluate the quality of learned representations). Hence, given the paper's current stage, I will keep my score as reject.