This paper received mixed reviews: R1 recommends clear accept (score 8), R3 recommends weak accept (score 6), and R2 & R4 recommends weak reject (score 5). All four reviewers agreed that the tri-modal learning formulation (vision, audio and language) is interesting and the FAC (fine and coarse) approach is conceptually novel. R1 & R3 acknowledged that the experiments are extensive and convincing, and R4 noted that the ablation studies are well conducted. EVALUATION -- R2 & R4 shared a concern that some of the comparisons with prior work are problematic because of different experimental protocols (backbone architectures and datasets used for self-supervised pretraining). The rebuttal addressed this by reporting new experimental results obtained using the same backbone/dataset with the baselines. During the discussion phase, R2 acknowledged that the new results addressed the reviewer's concern. R4 also acknowledged the same, but seem to remain lukewarm because of one potentially inaccurate statement (see R4's review); it would be great if the authors could clarify this carefully in the paper and discuss what the implications are. NOVELTY -- R2 raised a concern about technical novelty. Although FAC is conceptually novel, the actual implementation of the idea is based on existing networks and loss formulations, and thus the proposed approach does not achieve the FAC motivation in a novel way. The authors agreed to this point but argued that their claimed novelty is in the design of FAC embedding process. R2 didn't seem to be convinced by this argument despite the rebuttal. I think R2 made a valid criticism here and agree that the concern isn't properly addressed. I would also add that this paper does not provide convincing evidence showing the superiority of FAC over the other alternatives (Shared and Disjoint). This can be shown in Table 1(b): FAC performs better on UCF/HMDB/MSRVTT, but Shared performs better on ESC50 and Disjoint performs better on YouCook2 (and all the performance differences are all marginal within about 2% differences). Furthermore, the experimental setup makes it difficult to draw meaningful conclusions. The results show inconsistent results across different downstream tasks, including action classification (UCF/HMDB), sound classification (ESC50), and video/language understanding (MSRVTT/YouCook2). What conclusions can we draw from such result? When should we use FAC instead of Shared/Disjoint approaches? Should we choose FAC only for action recognition, or for all scenarios? When should we learn from all three modalities? Is learning from all three modalities always a better idea or should we be more judicious about the choice of modalities depending on possible downstream scenarios? Why would bringing sound & text help solve action recognition, which is inherently vision-centric? (R1 asked a similar question; the rebuttal tried to answer this but unfortunately the text was cut due to space limit) These questions are hard questions to answer thoroughly, but I think the authors could've run controlled experiments to provide better insights and generalizable conclusions. That said, despite my added criticisms above, I am convinced about the general direction of multimodal representation learning. This paper provides good empirical performances on several challenging benchmarks; I especially liked the results on MSRVTT/YouCook2 which have not been used frequently in the self-supervised video representation literature (I personally think we are reaching the limits of UCF/HMDB as testbeds for self-supervised learning). The scale of the experiments is also impressive (HowTo100M+AudioSet pretraining, evaluation on UCF/HMDB/ESC50/MSRVTT/YouCook2). As R4 mentioned, various ablation studies are well-conducted as well. These will set the bar higher for future research in this direction. Given this, I think this paper is worth acceptance. It is imperative that the authors will include the new results presented in the rebuttal, and discuss various references pointed out by the reviewers.