NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID: 1959 Predicting the Politics of an Image Using Webly Supervised Data

### Reviewer 1

Originality: This paper proposes a brand new dataset that is unique in that it contains images paired with text with bias labels (both noisy labels from source and also human labels). The methods described are similar to existing distant supervision techniques though they use it for new analysis on this domain. Quality: The experiments seem sound to me. They test on two different sets of labels and achieve consistent results that are reasonable (e.g. ocr performs better with logos). I think there is room for them to add more analysis of the distant supervision technique and possibly include ablation of the approach to verify how much performance gains come from different components. I was curious whether you verified how much performance differs with and without stage 2 of the training? Similarly, is there a “sweet spot” in the amount of text data used helps vs. hurts during training? Clarity: Overall, the paper was well-structured and easy to read, with only a few points that I found confusing. I wanted to clarify about the Ours-GT model described briefly on line 242 because it was a bit ambiguously worded. What do you mean by “ground truth text embeddings”? Significance: The dataset, itself, is a potentially high impact contribution. The techniques and analysis are also interesting to read and indicate possible avenues for future research.

### Reviewer 2

1. Problem setting: The problem of predicting political affiliation from news media articles is relevant and important. I am not convinced that the assumption of not having text at test time is a necessary one or even a good one. This assumption is not well motivated by the authors and strongly influences and limits the approach. I view this paper as tackling a real world problem (fairly applied) but unfortunately making strong and unnecessary assumptions to solve it which result in both poor performance (Table 1 shows a gap of 9 points because of this assumption) and an unnecessary two stage approach (Figure 2). A real world application should not throw away information or entire modalities without good reason. 2. Dataset: The dataset collected in this work is original and I do not know of a large dataset containing news media articles and affiliations. From the few qualitative examples of images in the paper, the dataset seems to have a lot of visual variety. 3. Approach: I have made my reservations about the problem setting above. I think the assumption of not having text at test time strongly influences the approach. In the first stage, a model is trained using paired images and text. This uses a ResNet to extract image features, and a Doc2vec model to extract text features. The two features undergo late fusion and are then input to a classifier. In the second stage, a linear classifier is trained only on the fixed ResNet features. The first stage training approach seems to be a standard late fusion method. The second stage, according to me, seems unncessary. Also, training a linear classifier on top of fixed ConvNet features is not uncommon. Questions 1. In L253, the authors say that the JOO method is trained on the closeup of politicians, and thus performs weakest in the 'broader dataset' collected by the authors. In Table 2, however, the JOO method seems to perform the best on "No people". Both these statements don't seem to agree with one another. To add to the confusion, Table 2 also shows that the JOO method's performance on "Closeup" is the methods worst performance (compared to symbols, text etc.). 2. What is the performance of the first stage model? Is it the one denoted by Ours (GT) in Table 1? The results for the Ours (GT) model have not been reported in Table 2. 3. If the assumption of not having text at test time is necessary, the authors should show why their particular style of modeling (two stage) is necessary. How about an approach like DeVISE (Frome et al.)? The ConvNet takes the image as input and has two heads, one to predict the Doc2Vec and the other to predict the political affiliation. Or simply take the top N words in the corpus and then ask the ConvNet to predict those words (along with the political affiliation). 4. What is the "fusion" used in Figure 2? It is never mentioned in the paper. Do you concatenate the features?

### Reviewer 3

Originality: There have been other works that look into visual bias for things like advertisements as well as other papers that consider political bias in natural language. The paper does a good job of outlining these works and showing where their model is different. Understanding bias from images seems unique and interesting. Quality: The paper is well cited and put into proper context. There do not appear to be any technical errors in terms of how the model is presented/trained. When splitting the dataset into train/test splits, are individual sources placed into either train or test? E.g., are Breitbart images found in both train and test? If not, could there be a source specific bias which is learned (and not helpful for understanding political bias)? Clarity: The prose is clear and the paper is quite enjoyable to read. There are some important details missing though. In particular, I did not understand how the fusion layer was implemented. I also do not understand what Ours (GT) is in Table 1. It would also be helpful to understand what kinds of errors the bias model is making; the best model in Table 2 is at 62% (chance would be at 50%). What kinds of things does the current model not understand? Results in Table 1 of the supplemental are somewhat helpful, but it would be helpful to know if things like better pose understanding or sentiment analysis would improve results. I see the data collected as a primary contribution of the paper. It might also be helpful for a discussion on how this data could be used by others in the community. Is the intention to have a benchmark task on bias prediction, or are there other aspects of the dataset that would be useful to researchers? It seems like an interesting set of data, but it would be helpful to have this a bit more explicitly outlined. It could also be helpful consider datasheets for datasets for this dataset (Gebru et al. Datasheets for Datasets. Arxiv). It is unclear if the data is biased in such a way that it is not learning about useful/interesting visual bias; in particular, if more right leaning articles discuss gun laws, perhaps the model can learn that any image related to gun laws is right leaning. This is somewhat shown in Figure 3 of the supplemental where (for example) a picture of a person on a red carpet is considered left''. Finally, it would be interesting to also consider politically neutral images. It seems that when collecting the dataset these images were just thrown out. Is this mainly because finding truly neutral images is hard? Understanding if an image is neutral seems important as well. Significance: The significance of this paper comes from the question being asked (can we learn political bias from images?). The collected dataset could be useful for other people interesting in studying bias in images. Additionally, authors include a variety of annotations types (e.g., explanations from humans about decisions) which could be helpful for different types of analysis. The experiments are plentiful. Authors not only consider bias prediction but also image editing to make an image more "left" or "right", accuracy breakdown across different kinds of images, image-text alignment, etc. It would be interesting to see if any sort of interpretability methods could be used to shed light on the results. E.g., when making a prediction what is the most important part of the image for the model to consider? There are no strong claims about their model (is it not novel in comparison to other methods for learning with privileged information?). The paper could be more significant if the model was run on another similar task with good results. I lean towards accept because I think the kinds of questions being asked in this paper are important and would like to encourage more work like this at ML conferences. UPDATE: After reading other reviews, I increased my score to 7.