Review for NeurIPS paper: The Hateful Memes Challenge: Detecting Hate Speech in Multimodal Memes

NeurIPS 2020

The Hateful Memes Challenge: Detecting Hate Speech in Multimodal Memes

Review 1

Summary and Contributions: The paper proposes a challenge in detecting hate speech in multimodal memes (text + image; the problem is posed as binary classification). The dataset is constructed in a way such that it requires models to perform "multimodal reasoning" to succeed. The authors perform baseline evaluation using a set of uni- and multimodal models and show the performance of these models is inferior to human performance by a considerable margin.

Strengths: - An interesting task with both research (multimodal reasoning) and practical (social media moderation) applications. - Careful construction of the challenge dataset that makes it difficult for systems to "cheat" by exploiting solely a single modality. - Detailed framing of the work in related literature. - There are plans to run the challenge as a public competition with an "unseen" test set. - An interesting result indicating that there is room to grow in terms of multimodal pretraining, as the difference between unimodally and multimodally pretrained models is relatively small (some unimodally pretrained models perform better than the multimodal ViLBERT CC model).

Weaknesses: - Main weakness: the results section contains very little analysis. At the very least, it'd be useful to indicate how accuracy differs across different classes of memes in the test dataset (multimodal vs unimodal hate, benign image/text, other random non-hateful). - Additional analyses could be performed on the dev set, using the annotations from the appendix Table B.1. - The dataset is not large (10k items) and it is therefore unclear whether considerable gains in performance can be attained by constructing a few thousand additional memes. An additional evaluation (at least of the top performing model/s) using subsets of the training set of different sizes could shed some light on this. - Generalization to real world memes may be limited by the fact that a single tool was used to generate all the memes in the dataset (discussion in section 2.2).

Correctness: Yes.

Clarity: Yes.

Relation to Prior Work: Yes.

Reproducibility: Yes

Additional Feedback: - It'd be helpful to indicate in Table 1 which rows correspond to early/middle/late fusion to ease reading the results. - Line 236 is missing a closing parenthesis. - It's not clear how exactly the unimodal versions of VilBERT and Visual BERT were trained. - Should the appendix Table B.1 be labelled Table C.1 (as it relates to Appendix C)? - Line 175: "further filtering to remove low-quality examples" - can you detail what kind of filtering you performed here? --after author response-- - Additional analyses described in the response (varying training dataset sizes, model failure modes) will make the paper and the remaining challenges clearer.

Review 2

Summary and Contributions: The paper introduces a novel dataset (Hates Memes Challenge) with 10k annotated image+text memes. In the paper, authors detail the procedures of the data collection and annotation. The memes in the dataset contain the associated text in text form as well as the proper licence to be used. Finally, authors evaluate many unimodal and multi-modal models on their dataset, and show how multimodality is relevant to correctly solve the task.

Strengths: - The dataset contains counterfactual examples, which enable machine learning practitioners to improve their models through better language-image modelling. - The dataset is well curated, the meme annotation seem to be rigorous and the final labels are high quality. - The topic detecting hateful content in internet is very relevant on our times. Improving systems to detect hateful speech is going to improve the quality of the public debate as well as reduce attacks over internet. I think the topic is relevant and the contribution significant. - Model performance reported by the authors seem to indicate that actual models are still far from human performance. This means there is still a lot of work to do in the topic, and the dataset will provide the community a good reference to advance in the topic.

Weaknesses: - In my opinion, the main weakness of the paper is the reduced size of the dataset. Authors propose a dataset with around 10k examples, from which 10% of it is used for testing. This means, the test performance which will be driving the field and the performance in the task will be computed over 1000 samples, which in my opinion is very limited. From the paper description, I understand the annotation is very costly, but I wonder if the fact that the labels are clean compensates the small amount of overall data. I believe authors should analyse in depth the effect of dataset size in the models. - Did the authors check whether there is an added bias when using images from Getty? As the images do not come from the original memes, it could be that the distribution is different between the newly defined memes and the original ones. - It would be interesting to look at the failure modes of the models. Are the models consistent on their failures? Are there particularly hard categories in the task? - How are the authors dealing with non-standard text (acronyms, non existent words, etc)? As memes are collected in the wild, I would assume some of the text is non standard. Are authors correcting the text if that happens?

Correctness: Yes. The authors describe the methodology very well in the paper and all the claims and methodology looks correct.

Clarity: Yes, the paper is very clear and well written. Authors clearly describe the different steps to build the dataset as well as the baselines used.

Relation to Prior Work: Yes, the authors highlight in the related work the difference between previously published works on hate speech and their dataset and approach.

Reproducibility: Yes

Additional Feedback: - L236 there is a typo with a parenthesis not closing.

Review 3

Summary and Contributions: This paper proposes a new dataset for multimodal hate speech detection. This dataset is carefully designed to be difficult for unimodal prediction. Several baseline models including unimodal and multimodal models are provided in the experiment part. This paper also finds that state-of-the-art methods perform poorly compared to humans.

Strengths: + This paper collects a new dataset especially for multimodal hate speech detection and the dataset may be beneficial to the relevant research community. + This paper exhibits the detailed annotation process elaborately, which provides a new method for another dataset collection. + Both unimodal and multimodal baseline models are discussed in the paper.

Weaknesses: - This paper is not well-organized. A proper dataset paper should emphasize the dataset analysis and comparison with the previous datasets. However, this paper turns to focus on the tedious annotation process, which is not that important and ought to be put in the supplementary materials. By and large, this paper is more like a technical report, not a standard and qualified NeurIPS paper. - The detailed dataset analysis can't be found in the paper. How large is the proposed dataset? What's the superiority over the previous datasets? As illustrated in the definition of hatefulness in part 2.1, the hate speech should be divided into several categories naturally. Therefore, the proposed dataset should be highly structured. How about the structure of the dataset? For example, the authors can provide several samples of ethnicity and religion. - As mentioned in the introduction part, the determination of multimodal memes is often subtle. Different people may have different opinions for the same memes. Must the opposite of the hateful speech be harmless ones? Hate speech detection is NOT a strictly binary classification problem. Hence, the modeling for the task in the paper is inaccurate. Could the authors present a candidate solution for it? - The baseline models provided in the experiment part is too simple. More advanced multimodal information fusion methods, like gated-fusion, should be further explored. Gated-fusion paper: DeepDualMapper: A Gated Fusion Network for Automatic Map Extraction using Aerial Images and Trajectories. AAAI 2020.

Correctness: The claims are correct. But the method exists a problem. The authors should present a structured dataset, not a random-organized one.

Clarity: No, the structure of the paper exists a big problem referred in the weakness part.

Relation to Prior Work: No. The comprehensive comparison with the datasets in [25] and [78] is NOT presented at all. Actually, only a clear comparison with the previous dataset can make the proposed dataset meaningful and stand out.

Reproducibility: Yes

Additional Feedback: ====== Post Rebuttal ======= More analysis and comparison with previous datasets should be provided. I decide to give 5 finally.

Review 4

Summary and Contributions: This paper introduces a newly created dataset, Hateful Memes. This dataset is intended for evaluating multimodal understanding models on the hateful memes detection task, in which the model takes a meme (image + text) and predicts if it's hateful or not. The dataset is created by experienced and trained annotators from a thrid-party company, which ensures the data quality. Experiments on a large variety of baseline models showed that even the best model largely underperforms human, implying that the dataset is challenging.

Strengths: 1. The hateful memes detection task is of great importance in practice. For example, as mentioned in paper, it can help with controlling malicious contents on social media. 2. The dataset was created carefully. The concept of "hateful" is rigorously defined and followed in the dataset collection process. Annotators are from a third-party company instead of crowdsourcing platforms. Each annotator were trained for 4 hours with feedbacks to improve their performance. When annotators disagree, there are expert annotators to make further decisions. Also, the dataset is reasonably large (10K examples), given the rigorousness. 3. Adding the "benign confounders" makes the dataset not easily solvable by unimodal models, thus requiring "real" multimodal understanding. This makes the dataset potentially very helpful for the multimodality area, since (as mentioned in paper) for many current benchmarks, complex unimodal models already achieves very high performances.

Weaknesses: If I understand correctly, the way to add "benign confounders" might introduce a slight bias to the dataset that, for each multimodal hateful memes, the image or text is likely to appear more than once in the dataset. For example, when a model sees an image which has been seen before, if the previous meme is non-hateful, this one is likely to be hateful; and vice versa, if an image is never seen before, it's more likely not hateful. I don't know if this will be a problem.

Correctness: The task is rigorously defined and the dataset collection is carefully designed and executed to ensure the quality of the dataset. The experiment design and results analysis are logically sound.

Clarity: The paper is clearly written and logically consistent.

Relation to Prior Work: The authors related this work to Hate Speech research, which is the practical field of this work; and Vision and Language tasks, which is the technical field. Another line of related work is the multimodal models. Many relevant models are mentioned in the Models section; however it may still be helpful to discuss it in Related Work to mention their progress and impact in practice.

Reproducibility: Yes

Additional Feedback: Typos: L236: (right bracket missing) ====== After author response ====== I have read the author response. I admit that the possible skew in dataset might not be an essential problem - I just wanted to raise this point. Indeed, it would be great if some analysis here could be added to the paper. I agree with other reviewers' ideas to add more analysis on the model performance on different type of samples, as well as a more detailed comparison with previous dataset (like [25]). These would be helpful too.