Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
This paper describes a method for integrating visual and textual features within a self-attention-like architecture. Overall I find this to be a good paper presenting an interesting method, with comprehensive experiments demonstrating the capacity of the method to improve on a wide range of models in image captioning as well as VQA.The analysis is informative, and the supplementary materials add further comprehensiveness. My main complaint is that the paper could be clearer about the current state of the art in these tasks and how the paper's contribution relates to that state of the art. The paper apparently presents a new state-of-the-art on the COCO image captioning dataset, by integrating the proposed method with the Transformer model. It doesn't, however, report what happens if the method is integrated with the prior state-of-the-art model SGAE -- was this tried and shown not to yield improvement? I also found it odd that mention of the identity of the current state-of-the-art system, and the authors' surpassing of that state-of-the-art, was limited to a caption of one of the tables and not included in the text of the Experiments section (though the new state-of-the-art is mentioned in the intro and conclusion). Making all of this more explicit in the text of the Experiments section would be helpful for contextualizing the results. Also, because most of the reported results are showing improvements over systems that are not state-of-the-art, it would be good to be clear about the importance of showing these results and what we should take away from them. Are these still strong baselines? Is it simply useful to show that the refined representations are useful across these different categories of systems? Similarly, though the VQA section does identify a state-of-the-art model in the text and report results surpassing that model, the results are on the validation set, so we are left without knowing how this method contributes to performance on test relative to the state-of-the-art. The supplementary materials mention that this has to do with difficulties in running on the test set, but this is a bit unsatisfying. Minor: line 188 overpasses --> overtakes/surpasses line 207: that most --> that are most line 236: tailored made --> tailored (or tailor-made) line 242: absent --> absence
The overall idea is interesting and experiments show that it constantly improves the performance of two major integrated tasks, image captioning and visual question answering. The presentation of MIA is clearly described, and it is understandable to use the method for pretraining of integrated features. However, when it is applied to image captioning where the inputs are only images, it is a bit difficult how those features are used. It is because Figure 3 and its explanation are too brief. It is preferable the architecture of LSTM-A3 is explained or illustrated in some detail.
-- What is the neural structure of the component that processes textual concepts? Bag-of-words, bag-of-embeddings or RNN encoder without order? -- How is the textual concept extractor carried out? Is it basically the extractor proposed by Fang et al. (2015) or Wu et al. (2016)? What are the datasets used for training the textual concepts? --Line 107. Typo, "... from the self-attention in that in the self-attention, the query matrix" -> "... from the self-attention where the query matrix" Citations: Lu, Jiasen, et al. "Hierarchical question-image co-attention for visual question answering." Advances In Neural Information Processing Systems. 2016. Xiong, Caiming, Victor Zhong, and Richard Socher. "Dynamic coattention networks for question answering." arXiv preprint arXiv:1611.01604 (2016).