Summary and Contributions: This paper introduces a method to learn representation from videos (here composed of visual and audio streams) without manually defined labels. The method consists of alternating between clustering the data using the Sinkhorn clustering algorithm and optimizing the parameters of neural network to predict these clusters. Clustering is done so that the visual and audio streams of a same video fall in the same cluster. In addition, the authors propose a way to impose different priors on the cluster distribution, hence enabling to impose more realistic cluster distribution prior. Equipped with this method they demonstrate empirically that they are able to recover clustering of real world datasets that better match manually defined labels than other state of the art self supervised video methods.
Strengths: The paper presents the following strengths: - The paper is clearly written. - The model description is technically sound. The model extends the work SeLa  in two non trivial ways: (a) allows to incorporate more general cluster prior distribution and (b) extend the formalism to the multi modality case. - The choice of experiments (albeit some clarifications that are needed, see Weaknesses) is convincing to support the claims of the method. - The resulting clustering of the video data is better than previous approaches on 3 out of 4 datasets considered. Non parametric techniques for evaluating representation is also interesting in general as an alternative to linear classification or finetuning evaluation.
Weaknesses: Required clarifications: there are some parts of the work that would require clarification, see below: * The description of the exact algorithm is not completely clear to me in the paper (and the appendix). I understand that code is provided but it should be clarified in the paper. In particular, is it a pure alternate approach? - If yes, how many reclustering steps are done during the 200 epochs of training (in other words what is the frequency of reclustering)? How many examples are sampled for the clustering stage (is N equal to the number of example in the dataset?) If I understand correctly, thanks to the probabilistic formulation, once the data is reclustered there is no need to reinit the last linear layer, is that correct? - If no, it is unclear to me how to apply the algorithm in an online fashion (see later for a related question). * Decorrelated heads (DH) for clustering: in A.2. it is said that "we found no significant difference in performance between the heads". This seems to contradict what is said in L198 paragraph "there is no single correct way of clustering a dataset [...]", since A.2. indicates that all resulting labelings are equally performant. What is the intuition why the multiple heads help during training but do not necessarily lead to diverse clustering at the end? (since DH seems to give a 0.8 NMI improvement). * Table 2: what does Alignment (A) exactly stands for? Can the authors confirm it it the technique of paragraph 182 (I found the term slightly confusing since alignmnent between audio and video will happen anyways due to the fact that the labels are shared). * Regarding the modified marginal for r: why not directly using a Zipf distribution rather than using some gaussian prior? * Table 3: the setup for retrieval is not clear at all. Only video is used here? How many nearest neighbors are employed (legend says various number of nearest neighbors)? What does R@1 means in that context? * Is there a rounding applied to the matrix Q to obtain integer solutions? (as far as I understand the lambda term means that the solution might lie in the interior of the polytope) Empirical evaluation: - The NMI metric is not very standard for video classification so it is hard to really judge the significance of +1.0 point there for the ablation studies (e.g. Table 2 that evaluates two important contributions of the work, namely Alignement and G). It would be good to have another more standard metric (e.g. linear classification on UCF/HMDB) so that people will understand better the significance of the proposed improvements. This would also help to encourage people using the NMI metric in the future. Comment about practicality of this type of clustering approaches at scale: - Can the authors comment about how much the method can scale to arbitrarily large datasets such as HowTo100M or AudioSet? Can the Sinkhorn algorithm scale to such large datasets? If not what would be the alternative? In particular have the authors thought about a way to do the clustering in an online manner to avoid the cumbersome step of clustering? (which requires to also maintain a mapping between examples and labels which might not always be practical as a full epoch is needed to obtain these labels)? This question is particularly important since the idea of using different marginals for the clustering distribution might be more important when going to larger uncurated datasets. Significance and novelty of the contribution: - At a high level, the method is quite similar to the XDC approach (alternate betwee clustering the modalities and training the networks to predict the clusters). The difference seems to be about replacing Kmeans via the Sinkhorn clustering algorithm and a slightly different merging strategy (where both audio and video are mapped to the same cluster). Do the authors have a good intuition why their method is much better in terms of predicting the right cluster structure (even when removing the Alignement technique and the Zipf style distribution the method still outperforms by a large margin XDC but conceptually I fail to really see the difference). - I understand that one the main goal of the paper is to find clustering of the data as opposed to learning representations. But fundamentally what is the advantage of clustering vs directly learning the representations? I would be keen to hear the author thoughts on this as I was not fully convinced why one should try to learn this clustering? (in particular the paper does not really demonstrate any practical advantage but instead observe that the quality of the clusters is better when compared to manually defined labels)
Correctness: Mostly yes (see two above sections for details).
Clarity: Yes, (see Strengths) the paper is clearly written.
Relation to Prior Work: To the best of my knowledge, the author do present a representative comparison of their work to previous contributions.
Additional Feedback: Summary of feedback: Overall, the paper tackles an interesting problem with a valid approach. Some clarifications are required so that I can make a final informed decision. Given that, I will give a score of 6 for now. === POST REBUTTAL COMMENT === The authors have adequately answer my comments in the rebuttal. I encourage them to incorporate the changes and clarify the parts of the paper as suggested in my review. For this reason I am willing to keep my positive score. There is a last request I have for the final version of the paper, see below (hence keeping my score of 6). There is one thing that I realized after reading "First, as shown in Tab. 1(b), simply using SK (24.7% NMI) instead of k-means (18.1% NMI) improves performance, likely for the same reasons that SeLa  outperforms DeepCluster ." in the rebuttal. If SK is much better than Kmeans as a clustering algorithm, why using k-means to cluster the features (L91 in the paper) from the other self supervised work instead of also using SK there? I think this would be great if the authors can add this comparison to the paper.
Summary and Contributions: The paper proposes a method for clustering unlabeled videos by correlating visual and audio modalities. It extends SeLa  to relax the cluster sizes to a non-uniform distribution. It proposes modality splicing transformations for performing modality-agnostic clustering and further improves the results by de-correlating the clustering heads for different modalities. It creates several video clustering baselines from state-of-the-art video representation learning methods, and shows the proposed method generate state-of-the-art results for clustering.
Strengths: - The paper demonstrates that unsupervised labeling of videos requires joint optimization of the representation and clustering, and that naively clustering pre-learned representations is suboptimal. - Relaxing the uniform-distribution constraint of SeLa  allows the proposed model to match any marginal distribution, which is helpful for real-world data which tends to be unbalanced. - Its idea of modality alignment for capturing the natural correspondences between different modalities for unsupervised clustering is novel. - It demonstrates clear quantitative and qualitative state-of-the-art results on 4 datasets, showing the effectiveness of utilizing multiple modalities. It also shows the usefulness of the different components via ablation studies, and that the learned representations can be useful for other downstream tasks.
Weaknesses: To demonstrate the effectiveness of the proposed (1) audio-visual alignment and (2) clustering with arbitrary prior distributions, an ablation experiment with both disabled would be helpful (and can be included in Table 2). Since the paper shows the baseline method of SeLa with its original implementation, for fairness, it could create another baseline that replaces the image feature encoder of SeLa with an audio-visual encoder.
Relation to Prior Work: Yes.
Additional Feedback: The paper proposes a novel approach for multi-modal unsupervised clustering of videos, and demonstrates state-of-the-art results. It would be helpful if the authors could address the ablation experiments mentioned under "weaknesses". ===== Post-rebuttal comments: The rebuttal adequately addresses my questions. My rating remains positive.
Summary and Contributions: This paper enables pseudo-labeling of a video dataset without any human annotations. The aim is to leverage the natural correspondence between the audio and visual modalities. This paper establishes video clustering benchmark results on four datasets. This paper also clusters multi-modal data, which is interesting.
Strengths: This paper is well written and well organized. The motivation is clear and the application of this paper is wide in the real-world. The idea of introducing clustering for skew distributions is technically sound. The results are sufficient and well presented. The results outperform the state-of-the-art significantly.
Weaknesses: 1. Section 3.1 mainly summarizes the technical details described in . The main contribution is in Section 3.2. The authors are suggested reorganizing these Sections to highlight the contribution. The readers may also better understand the main contribution. 2. In "Decorrelated clustering heads", the authors generate two augmented examples. I don't understand why doubling the examples only increasing the cost of the algorithm by only a small amount. 3. In Table 2, the authors compared SeLa. What's the baseline of SeLA using visual and audio features?
Correctness: The method is technically sound.
Clarity: This paper is very well written.
Relation to Prior Work: The references and discussions are sufficient.
Additional Feedback: Final rating ================= I am satisfied with the authors' feedback. I would keep my rating unchanged.
Summary and Contributions: This paper presents a way to learn from unlabeled videos by using clustering across multiple modalities. The biggest contribution is in the formulation of a clustering method that takes multiple modalities to a single cluster label.
Strengths: The paper is well written, understandable, and makes a nice contribution. The experiments are thorough and show the benefit of the approach.
Weaknesses: - The claim "In this paper, we are thus interested in developing methods to cluster video datasets without manual supervision, substantially reducing the cost of labelling video data." is a bit misleading, given that the method is evaluated on existing datasets by ignoring the labels. Due to this, the method is trained on some human annotated data as the videos in these datasets (e.g., Kinetics) have had significant annotation in the selection of the temporal intervals and contents of the videos. They already belong to a set of actions. Instead, to support this claim, unlabeled and unannotated videos should be used. As the additional cost of labeling an action is quite small compared to the temporal selection of an action in a clip. - Some of the ideas seem quite similar to those introduced in Evolving Losses for Unsupervised Video Representation Learning . It would be good to better clarify the conceptual differences between these works. Specifically, "As part of this, we wish to account for the fact that semantic classes are not all equally probable, and tend instead to follow a Zipf distribution [1, 39]. We then evaluate the quality of the discovered labels by matching them to the ones provided by human annotators, using datasets where ground-truth labels are known" seems very similar to the ideas presented in . - It would also be good to compare against  in Table A.5
Relation to Prior Work: Mostly, see weaknesses.
Additional Feedback: For the broader impacts section, specifically "Few-label harmful content detection" I think it would be good to discuss potential biases in the few label setting. For example, if the few-label finetuning data only contains 1 race, will the method accurately work on other races or not? Overall, the broader impacts sections seems to only focus on the positive side of the paper and not really discuss any possible negative impacts which were asked for. The main negative discussed is essentially blaming the user, "Perhaps the most direct risk is that a user of the algorithm might overestimate its capabilities." which doesn't seem to fit the goal of the broader impacts section in discussing implications of the work. Overall, the paper is quite good and interesting. I would like to see the comments and concerns addresses to further strengthen the paper.