This paper presents an approach to learn video representation via contrastive learning framework. All the reviewers like the proposed approach calling it intuitive and a step in right direction. Several concerns were there as well: (a) comparison to CMC; (b) relation to prior work; (c) reproducibility; (d) UberNCE being upper-bound. Authors submitted a strong rebuttal.They provided comparisons to CMC, promised to do better discussion of related work, release code. UberNCE argument still remains a concern but the this is a simple change. AC agrees with the reviewers and recommends acceptance. Please make all the changes suggested by reviewers in camera ready.