Reviews: Aligning Visual Regions and Textual Concepts for Semantic-Grounded Image Representations

The paper proposes a new method called Mutual Iterative Attention (MIA) for improving the representations used by common visual-question-answering and image captioning models. MIA works by repeated execution of 'mutual attention', a computation that is similar to the self-attention operation in the Transformer model, but where the lookup ('query') representation is conditioned by information from the other modality. Importantly, the two modalities involved in the MIA operation are not vision and language, they are vision and 'textual concepts' (which they also call 'textual words' and 'visual words' at various points in the paper). These are actual words referring to objects that can be found in the image. The model that predicts textual concepts (the 'visual words' extractor) is trained on the MS-COCO dataset in a separate optimization to the captioning model Applying MIA to a range of models before attempting VQA or captioning tasks improves the scores, in some cases above the state-of-the-art. It is a strength of this paper that the authors apply their method to a wide range of existing models and observe consistent improvements. This indicates the promise of the work, and it seems to me that for this reason the reviewers have recommended that the paper be accepted. However, the reviewers have also raised some concerns, which I share. As with each of the other reviewers, I find that the overall method is not clearly explained in the paper. It took me many readings to reach some understanding of what textual concepts were (partly because the authors give them different names in different parts of the paper). The authors apply MIA to many different models, each of which works in quite a different way, and it is (still) not clear to me exactly how it interacts with each of these existing models. As a concrete example, despite helpful discussions with the Reviewers, we are all still somewhat confused by Tables 1-3. Is it really possible to train a captioning model without access to any visual features (i.e. based solely on 'textual concepts')? This may be the case, but if so this must be much more clearly explained to a reader not familiar with the application of textual concepts to captioning (without visual features). Another concern is that textual concepts typically requires more training data (the data needed to train the extractor). As the authors point out in their rebuttal, for MS-COCO task, this is not the case, since the extractor was trained on precisely the MS-COCO training data. However, for the other tasks, it is fair to say that more data is being used to train a model that uses textual concepts. I think this is in fact an interesting application of transfer learning, but, as mentioned above, this is not made clear in the paper. It took me a long time to work out that this is what was going on, and in my opinion this *must* be discussed more explicitly and openly in the paper for the work to meet the standards of transparency and clarity expected of Neurips. In short, I agree with the reviewers that this is a promising method that can improve image captioning and VQA systems (and potentially any models that rely on mixing vision and language). However, if the reviewers recommendations are followed and the paper is accepted then it is my recommendation that the authors must comprehensively re-write parts of the paper to give a clear explanation of a) what textual concepts are, b) how a captioning model can be trained directly from them (without access to the underlying image) c) how much data is required to train a textual concept extractor and d) how exactly MIA is applied with the range of existing models considered in the paper.

Paper ID:	3726
Title:	Aligning Visual Regions and Textual Concepts for Semantic-Grounded Image Representations