{"title": "Visual Question Answering with Question Representation Update (QRU)", "book": "Advances in Neural Information Processing Systems", "page_first": 4655, "page_last": 4663, "abstract": "Our method aims at reasoning over natural language questions and visual images. Given a natural language question about an image, our model updates the question representation iteratively by selecting image regions relevant to the query and learns to give the correct answer. Our model contains several reasoning layers, exploiting complex visual relations in the visual question answering (VQA) task. The proposed network is end-to-end trainable through back-propagation, where its weights are initialized using pre-trained convolutional neural network (CNN) and gated recurrent unit (GRU). Our method is evaluated on challenging datasets of COCO-QA and VQA and yields state-of-the-art performance.", "full_text": "Visual Question Answering with\n\nQuestion Representation Update (QRU)\n\nRuiyu Li\n\nJiaya Jia\n\nThe Chinese University of Hong Kong\n{ryli,leojia}@cse.cuhk.edu.hk\n\nAbstract\n\nOur method aims at reasoning over natural language questions and visual images.\nGiven a natural language question about an image, our model updates the question\nrepresentation iteratively by selecting image regions relevant to the query and\nlearns to give the correct answer. Our model contains several reasoning layers,\nexploiting complex visual relations in the visual question answering (VQA) task.\nThe proposed network is end-to-end trainable through back-propagation, where its\nweights are initialized using pre-trained convolutional neural network (CNN) and\ngated recurrent unit (GRU). Our method is evaluated on challenging datasets of\nCOCO-QA [19] and VQA [2] and yields state-of-the-art performance.\n\n1\n\nIntroduction\n\nVisual question answering (VQA) is a new research direction as intersection of computer vision and\nnatural language processing. Developing stable systems for VQA attracts increasing interests in\nmultiple communities. Possible applications include bidirectional image-sentence retrieval, human\ncomputer interaction, blind person assistance, etc. It is now still a dif\ufb01cult problem due to many\nchallenges in visual object recognition and grounding, natural language representation, and common\nsense reasoning.\nMost recently proposed VQA models are based on image captioning [10, 24, 28]. These methods\nhave been advanced by the great success of deep learning on building language models [23], image\nclassi\ufb01cation [12] and on visual object detection [6]. Compared with image captioning, where a\nplausible description is produced for a given image, VQA requires algorithms to give the correct\nanswer to a speci\ufb01c human-raised question regarding the content of a given image. It is a more\ncomplex research problem since the method is required to answer different types of questions.\nAn example related to image content is \u201cWhat is the color of the dog?\u201d. There are also\nquestions requiring extra knowledge or commonsense reasoning, such as \u201cDoes it appear to be\nrainy?\".\nProperly modeling questions is essential for solving the VQA problem. A commonly employed\nstrategy is to use a CNN or an RNN to extract semantic vectors. The general issue is that the resulting\nquestion representation lacks detailed information from the given image, which however is vital\nfor understanding visual content. We take the question and image in Figure 1 as an example. To\nanswer the original question \u201cWhat is sitting amongst things have been abandoned?\",\none needs to know the target object location. Thus the question can be more speci\ufb01c as \u201cWhat is\ndiscarded on the side of a building near an old book shelf?\".\nIn this paper, we propose a neural network based reasoning model that is able to update the question\nrepresentation iteratively by inferring image information. With this new system, it is now possible\nto make questions more speci\ufb01c than the original ones focusing on important image information\nautomatically. Our approach is based on neural reasoner [18], which has recently shown remarkable\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fWhat is sitting\n\nQuestion:\namongst things have been\nabandoned?\nAnswer: Toilet.\n\nWhat sits in the\n\nBefore:\nroom that appears to be\npartially abandoned?\n\nUpdated: What is discarded\non the side of a building\nnear an old book shelf?\n\n(a)\n\n(b)\n\nFigure 1: The questions asked by human can be ambiguous given an image containing various objects.\nThe Before and Updated questions are the most similar ones based on the cosine similarity to the\noriginal Question before and after applying our algorithm to update representation. (b) shows the\nattention masks generated by our model.\n\nsuccess in text question answering tasks. Neural reasoner updates the question by interacting it with\nsupporting facts through multiple reasoning layers. We note applying this model to VQA is nontrivial\nsince the facts are in the form of an image. Thus image region information is extracted in our\nmodel. To determine the relevance between question and each image region, we employ the attention\nmechanism to generate the attention distribution over regions of the image. Our contributions are as\nfollows.\n\n\u2022 We present a reasoning network to iteratively update the question representation after each\n\ntime the question interacts with image content.\n\n\u2022 Our model utilizes object proposals to obtain candidate image regions and has the ability to\n\nfocus on image regions relevant to the question.\n\nWe evaluate and compare the performance of our model on two challenging VQA datasets \u2013 i.e.,\nCOCO-QA [19] and VQA [2]. Experiments demonstrate the ability of our model to infer image\nregions relevant to the question.\n\n2 Related Work\n\nResearch on visual question answering is mostly driven by text question answering and image\ncaptioning methods. In natural language processing, question answering is a well-studied problem. In\n[22], an end-to-end memory network was used with a recurrent attention model over a large external\nmemory. Compared with the original memory network, it has less supervision and shows comparable\nresults on the QA task. The neural reasoning system proposed in [18], named neural reasoner, can\nutilize multiple supporting facts and \ufb01nd an answer. Decent performance was achieved on positional\nreasoning and path \ufb01nding QA tasks.\nVQA is closely related to image captioning [10, 24, 28, 5]. In [5], a set of likely words are detected\nin several regions of the image and are combined together using a language model to generate image\ndescription. In [10], a structured max-margin objective was used for deep neural networks. It learns to\nembed both visual and language data into a common multi-modal space. Vinyals et al. [24] extracted\nhigh-level image feature vectors from CNN and took them as the \ufb01rst input to the recurrent network\nto generate caption. Xu et al. [28] integrated visual attention in the recurrent network. The proposed\nalgorithm predicts one word at a time by looking at local image regions relevant to the currently\ngenerated word.\nMalinowski et al. [15] \ufb01rst introduced a solution addressing the VQA problem. It combines natural\nlanguage processing with semantic segmentation in a Bayesian framework for automatic question\nanswering. Since it, several neural network based models [16, 19, 2] were proposed to solve the\nVQA problem. These models use CNN to extract image features and recurrent neural networks to\nembed questions. The embedded image and question features are then fused by concatenation [16]\n\n2\n\n\fFigure 2: The overall architecture of our model with single reasoning layer for VQA.\n\nor element-wise addition [29] to predict answers. Recently several models integrated the attention\nmechanism [29, 27, 3, 20] and showed the ability of their networks to focus on image regions related\nto the question.\nThere also exist other approaches for VQA. For example, Xiong et al. [26] proposed an improved\ndynamic memory network to fuse the question and image region representations using bi-directional\nGRU. The algorithm of [1] learns to compose a network from a collection of composable modules.\nMa et al. [14] made use of CNN and proposed a model with three CNNs to capture information of\nthe image, question and multi-modal representation.\n\n3 Our Model\n\nThe overall architecture of our model is illustrated in Figure 2. The model is derived from the neural\nreasoner [18], which is able to update the representation of question recursively by inferring over\nmultiple supporting facts. Our model yet contains a few inherently different components. Since\nVQA involves only one question and one image each time instead of a set of facts, we use object\nproposal to obtain candidate image regions serving as the facts in our model. Moreover, in the\npooling step, we employ an attention mechanism to determine the relevance between representation\nof original questions and updated ones. Our network consists of four major components \u2013 i.e., image\nunderstanding, question encoding, reasoning and answering layers.\n\n3.1\n\nImage Understanding Layer\n\nThe image understanding layer is designed for modeling image content into semantic vectors. We\nbuild this layer upon the VGG model with 19 weight layers [21]. It is pre-trained on ImageNet [4].\nThe network has sixteen convolutional layers and \ufb01ve max-pooling layers of kernel size 2 \u00d7 2 with\nstride 2, followed by two fully-connected layers with 4,096 neurons.\nUsing a global representation of the image may fail to capture all necessary information for answering\nthe question involving multiple objects and spatial con\ufb01guration. Moreover, since most of the\nquestions are related to objects [19, 2], we utilize object proposal generator to produce a set of\ncandidate regions that are most likely to be an object. For each image, we choose candidate regions\nby extracting the top 19 detected edge boxes [31]. We choose intersection over union (IoU) value 0.3\nwhen performing non-maximum suppression, which is a common setting in object detection.\nAdditionally, the whole image region is added to capture the global information in the image\nunderstanding layer, resulting in 20 candidate regions per image. We extract features from each\ncandidate region through the above mentioned CNN, bringing a dimension of 4,096 image region\nfeatures. The extracted features, however, lack spatial information for object location. To remedy this\nissue, we follow the method of [8] to include an 8D representation\n\n[xmin, ymin, xmax, ymax, xcenter, ycenter, wbox, hbox],\n\n3\n\nImageQuestionWhat are they playing?GRUCNNRegion1Region2RegionM......Query0Query11Query12Query1M......Query1SoftMaxImage UnderstandingQuestion EncodingReasoningAnswering\fwhere wbox and hbox are the width and height of the image region. We set the image center as the\norigin. The coordinates are normalized to range from \u22121 to 1. Then each image region is represented\nas a 4104D feature denoted as fi where i \u2208 [1, 20]. For modeling convenience, we use a single layer\nperceptron to transform the image representation into a common latent space shared with the question\nfeature\n\nvi = \u03c6(Wvf \u2217 fi + bvf ),\nwhere \u03c6 is the recti\ufb01ed activation function \u03c6(x) = max(0, x).\n\n(1)\n\n3.2 Question Encoding Layer\n\nTo encode the natural language question, we resort to the recurrent neural network, which has\ndemonstrated great success on sentence embedding. The question encoding layer is composed of a\nword embedding layer and GRU cells. Given a question w = [w1, ..., wT ], where wt is the tth word\nin the question and T is the length of the question, we \ufb01rst embed each word wt to a vector space xt\nwith an embedding matrix xt = Wewt. Then for each time step, we feed xt into GRU sequentially.\nAt each step, the GRU takes one input vector xt, and updates and outputs a hidden state ht. The \ufb01nal\nhidden state hT is considered as the question representation. We also embed it into the common\nlatent space same as image embedding through a single layer perceptron\n\nq = \u03c6(Wqh \u2217 hT + bqh).\n\n(2)\n\nWe utilize the pre-trained network with skip-thought vectors model [11] designed for general sentence\nembedding to initialize our question encoding layer as used in [17]. Note that the skip-thought vectors\nmodel is trained in an unsupervised manner on large language corpus. By \ufb01ne-tuning the GRU, we\ntransfer knowledge from natural language corpus to the VQA problem.\n\n3.3 Reasoning Layer\n\nThe reasoning layer includes question-image interaction and weighted pooling.\n\nQuestion-Image Interaction Given that multilayer perceptron (MLP) has the ability to determine\nthe relationship between two input sentences according to supervision [7, 18]. We examine image\nregion features and question representation to acquire a good understanding of the question. In a\nmemory network [22], these image region features are akin to the input memory representation,\nwhich can be retrieved for multiple times according to the question.\nThere are a total of L reasoning layers. In the lth reasoning layer, the ith interaction happens between\nql\u22121 and vi through an MLP, resulting in updated question representation ql\n\ni as\n\ni = M LPl(ql\u22121, vi; \u03b8l),\nql\n\n(3)\n\nwith \u03b8l being the model parameter of interaction at the lth reasoning layer. In the simplest case with\none single layer in M LPl, the updating process is given by\n\ni = \u03c6(Wl \u2217 (ql\u22121 \u2297 vi) + bl),\nql\n\n(4)\nwhere \u2297 indicates element-wise multiplication, which performs better in our experiments than other\nstrategies, e.g., concatenation and element-wise addition.\nGenerally speaking, ql\ninteraction with image feature vi. This property is important for the reasoning process [18].\n\ni contains update of network focus towards answering the question after its\n\nWeighted Pooling Pooling aims to fuse components of the question after its interaction with all\nimage features to update representation. Two common strategies for pooling are max and mean\npooling. However, when answering a speci\ufb01cal question, it is often the case the correct answer is only\nrelated to particular image regions. Therefore, using max pooling may lead to unsatisfying results\nsince questions may involve interaction between human and object, while mean pooling may also\ncause inferior performance due to noise introduced by regions irrelevant to the question.\nTo determine the relevance between question and each image region, we resort to the attention\nmechanism used in [28] to generate the attention distribution over image regions. For each updated\n\n4\n\n\fquestion ql\nrepresentation ql\u22121. Hence, the attention weights take the following forms.\n\ni after interaction with the ith image region, it is chosen close to the original question\n\nCi = tanh(WA \u2217 ql\nP = sof tmax(WP \u2217 C + bP ),\n\ni \u2295 (WB \u2217 ql\u22121 + bB)),\n\n(5)\nwhere C is a matrix and its ith column is Ci. P \u2208 RM is a M dimensional vector representing the\nattention weights. M is the number of image regions, set to 20. Based on the attention distribution,\nwe calculate weighted average of ql\n\ni, resulting in the updated question representation ql as\n\n(cid:88)\n\nql =\n\nPiql\ni.\n\n(6)\n\nThe updated question representation ql after weighted pooling serves as the question input to the next\nreasoning or answering layer.\n\ni\n\n3.4 Answering Layer\n\nFollowing [19, 2], we model VQA as a classi\ufb01cation problem with pre-de\ufb01ned classes. Given the\nupdated question representation at last reasoning layer qL, a softmax layer is employed to classify qL\ninto one of the possible answers as\n\n(7)\nNote instead of the softmax layer for predicting the correct answer, it is also possible to utilize LSTM\nor GRU decoder, taking qL as input, to generate free-form answers.\n\npans = sof tmax(Wans \u2217 qL + bans).\n\n4 Experiments\n\n4.1 Datasets and Evaluation Metrics\n\nWe conduct experiments on COCO-QA [19] and VQA [2]. The COCO-QA dataset is based on\nMicrosoft COCO image data [13]. There are 78,736 training questions and 38,948 test ones, based\non a total of 123,287 images. Four types of questions are provided, including Object, Number, Color\nand Location. Each type takes 70%, 7%, 17% and 6% of the whole dataset respectively.\nIn the VQA dataset, each image from the COCO data is annotated by Amazon Mechanical Turk\n(AMT) with three questions. It is the largest for VQA benchmark so far. There are 248,349, 121,512\nand 244,302 questions for training, validation and testing, respectively. For each question, ten answers\nare provided to take consensus of annotators. Following [2], we choose the top 1,000 most frequent\nanswers as candidate outputs, which constitutes 82.67% of the train+val answers.\nSince we formulate VQA as a classi\ufb01cation problem, mean classi\ufb01cation accuracy is used to evaluate\nthe model on the COCO-QA dataset. Besides, Wu-Palmer similarity (WUPS) [25] measure is also\nreported on COCO-QA dataset. WUPS calculates similarity between two words based on their\nlongest common subsequence in the taxonomy tree. Following [19], we use thresholds 0.9 and 0.0 in\nour evaluation. VQA dataset provides a different kind of evaluation metric. Since ten ground truth\nanswers are given, a predicted answer is considered to be correct when three or more ground truth\nanswers match it. Otherwise, partial score is given.\n\n4.2\n\nImplementation Details\n\nWe implement our network using the public Torch computing framework. Before training, all question\nsentences are normalized to lower case where question marks are removed. These words are fed into\nGRU one by one. The whole answer with one or more words is regarded as a separate class. For\nextracting image features, each candidate region is cropped and resized to 224 \u00d7 224 before feeding\ninto CNN.\nFor the COCO-QA dataset, we set the dimension of common latent space to 1,024. Since VQA\ndataset is larger than COCO-QA, we double the dimension of common latent space to adapt the data\nand classes. On each reasoning layer, we use one single layer in MLP. We test up to two reasoning\nlayers. No further improvement is observed when using three or more layers.\n\n5\n\n\fMethods\n\nMean Pooling\nMax Pooling\nW/O Global\nW/O Coord\nFull Model\n\nACC.\n58.15\n59.37\n60.87\n61.33\n61.99\n\nObject Number\n60.61\n45.34\n45.70\n62.11\n46.68\n63.32\n46.24\n63.76\n64.53\n46.68\n\nColor\n55.37\n55.91\n58.66\n59.35\n59.81\n\nLocation\n\n52.74\n53.63\n55.49\n56.66\n56.82\n\nTable 1: Comparison of ablation models. Models are trained and tested on COCO-QA [19] with one\nreasoning layer.\n\nMethods\n\nIMG+BOW [19]\n2VIS+BLSTM [19]\n\nEnsemble [19]\nABC-CNN [3]\nDPPnet [17]\nSAN [29]\nQRU (1)\nQRU (2)\n\nACC.\n55.92\n55.09\n57.84\n58.10\n61.19\n61.60\n61.99\n62.50\n\nObject Number\n44.10\n58.66\n44.79\n58.17\n47.66\n61.08\n62.46\n45.70\n\n-\n\n64.50\n64.53\n65.06\n\n-\n\n48.60\n46.68\n46.90\n\nColor\n51.96\n49.53\n51.48\n46.81\n\n-\n\n57.90\n59.81\n60.50\n\nLocation WUPS 0.9 WUPS 0.0\n\n88.99\n88.64\n89.52\n89.85\n90.61\n90.90\n91.11\n91.62\n\n49.39\n47.34\n50.28\n53.67\n\n-\n\n54.00\n56.82\n56.99\n\n66.78\n65.34\n67.90\n68.44\n70.84\n71.60\n71.83\n72.58\n\nTable 2: Evaluation results on COCO-QA dataset [19]. \u201cQRU (1)\u201d and \u201cQRU (2)\u201d refer to 1 and 2\nreasoning layers incorporated in the system.\n\nThe network is trained in an end-to-end fashion using stochastic gradient descent with mini-batches\nof 100 samples and momentum 0.9. The learning rate starts from 10\u22123 and decreases by a factor of\n10 when validation accuracy stops improving. We use dropout and gradient clipping to regularize the\ntraining process. Our model is denoted as QRU in following experiments.\n\n4.3 Ablation Results\n\nWe conduct experiments to exam the usefulness of each component in our model. Speci\ufb01cally, we\ncompare different question representation pooling mechanisms, i.e., mean pooling and max pooling.\nWe also train two controlled models devoid of global image feature and spatial coordinate, denoted\nas W/O Global and W/O Coord. Table 1 shows the results.\nThe performance of mean and max pooling models are substantially worse than the full model, which\nuses weighted pooling. This indicates that our model bene\ufb01ts from the attention mechanism by\nlooking at several image regions rather than only one or all of them. A drop of 1.12% in accuracy is\nobserved if the global image feature is not modeled, con\ufb01rming that inclusion of the whole image\nis important for capturing the global information. Without modeling spatial coordinates also leads\nto a drop in accuracy. Notably, the greatest deterioration is on the question type of Object. This is\nbecause the Object type seeks information around the object like \u201cWhat is next to the stop\nsign?\". Spatial coordinates help our model reason spatial relationship among objects.\n\n4.4 Comparison with State-of-the-art\n\nWe compare performance in Tables 2 and 3 with experimental results on COCO-QA and VQA\nrespectively. Table 2 shows that our model with only one reasoning layer already outperforms\nstate-of-the-art 2-layer stacked attention network (SAN) [29]. Two reasoning layers give the best\nperformance. We also report the per-category accuracy to show the strength and weakness of our\nmodel in Table 2. Our best model outperforms SAN by 2.6% and 2.99% in the question types of\nColor and Location respectively, and by 0.56% in Object.\nOur analysis is that the SAN model puts its attention on coarser regions obtained from the activation\nof last convolutional layer, which may include cluttered and noisy background. In contrast, our model\nonly deals with selected object proposal regions, which have the good chance to be objects. When\nanswering questions involving objects, our model gives reasonable results. For the question type\nNumber, since an object proposal may contain several objects, our counting ability is weakened. In\nfact, the counting task is a complete computer vision problem on its own.\n\n6\n\n\fMethods\n\nBOWIMG [2]\nLSTMIMG [2]\niBOWIMG [30]\n\nDPPnet [17]\nSAN [29]\nWR Sel [20]\n\nFDA [9]\n\nDMN+ [26]\n\nQRU (1)\nQRU (2)\n\nAll\n52.64\n53.74\n55.72\n57.22\n58.70\n\n-\n\n59.24\n60.37\n59.26\n60.72\n\nY/N\n75.77\n78.94\n76.55\n80.71\n79.30\n\n-\n\n81.14\n80.75\n80.98\n82.29\n\nOpen-Ended (test-dev)\n\nNum Other\n37.37\n33.67\n36.42\n35.24\n42.62\n35.03\n37.24\n41.71\n36.60\n46.10\n\n-\n\n36.16\n37.00\n35.93\n37.02\n\n-\n\n45.77\n48.25\n45.99\n47.67\n\ntest-std\n\nAll\n-\n\n54.06\n55.89\n57.36\n58.90\n\n-\n\n59.54\n60.36\n59.44\n60.76\n\nMultiple-Choice (test-dev)\nAll\n58.97\n57.17\n61.68\n62.48\n\nNum Other\n50.33\n34.35\n43.41\n35.80\n54.44\n37.05\n38.94\n52.16\n\nY/N\n75.59\n78.95\n76.68\n80.79\n\n-\n\n-\n\n62.44\n64.01\n\n63.96\n65.43\n\n-\n\n-\n\n77.62\n81.50\n\n81.00\n82.24\n\n-\n\n-\n\n34.28\n39.00\n\n37.08\n38.69\n\n-\n\n-\n\n55.84\n54.72\n\n55.48\n57.12\n\ntest-std\n\nAll\n-\n\n57.57\n61.97\n62.69\n\n62.43\n64.18\n\n-\n\n-\n\n64.13\n65.43\n\nTable 3: Evaluation results on VQA dataset [2]. \u201cQRU (1)\u201d and \u201cQRU (2)\u201d refer to 1 and 2 reasoning\nlayers incorporated in the system.\n\nOriginal\n\nBefore updating\n\nAfter updating\n\nwith one\n\nreasoning layer\n\nAfter updating\n\nwith two\n\nreasoning layers\n\nWhat next to two other open laptops?\n\nWhat next to each other dipicting smartphones?\n\nWhat next to two boys?\n\nWhat hooked up to two computers?\n\nWhat next to each other with visible piping?\n\nWhat next to two pair of shoes?\n\nWhat are there laying down with two remotes?\nWhat next to each other depicting smartphones?\n\nWhat hooked up to two computers?\n\nWhat next to each other with monitors?\n\nWhat cubicle with four differnet types of computers?\n\nWhat plugged with wires?\n\nWhat next to each other with monitors?\n\nWhat are open at the table with cell phones?\n\nWhat is next to the monitor?\n\nWhat sits on the desk along with 2 monitors?\n\nFigure 3: Retrieved questions before and after update from COCO-QA dataset [19].\n\nTable 3 shows that our model yields prominent improvement on the Other type when compared\nwith other models [2, 30, 17] that use global representation of the image. Object proposals in our\nmodel are useful since the Other type contains questions such as \u201cWhat color \u00b7\u00b7\u00b7 \", \u201cWhat kind\n\u00b7\u00b7\u00b7 \", \u201cWhere is \u00b7\u00b7\u00b7 \", etc. Our model outperforms that of [20] by 3% where the latter also exploits\nobject proposals. Compared with [20], we use less number of object proposals, demonstrating the\neffectiveness of our approach. This table also reveals that our model with two reasoning layers\nachieve state-of-the-art results for both open-ended and multiple-choice tasks.\n\n4.5 Qualitative Analysis\n\nTo understand the ability of our model in updating question representation, we show an image and\nseveral questions in Figure 3. The retrieved questions from the test set are based on the cosine\nsimilarities to the original question before and after our model updates the representation. It is notable\nthat before update, 4 out of the top 5 similar questions begin with \u201cWhat next\". This is because GRU\nacts as the language model, making the obtained questions share similar language structure. After we\nupdate question representation, the resulting ones are more related to image content regarding objects\ncomputers and monitors while the originally retrieved questions contain irrelevant words like boys\nand shoes. The retrieved questions become even more informative using two reasoning layers.\nWe visualize a few attention masks generated by our model in Figure 4. Visualization is created\nby soft masking the image with a mask created by summing weights of each region. The mask\nis normalized with maximum value 1 followed by small Gaussian blur. Our model is capable of\nputting attention on important regions closely relevant to the question. To answer the question \u201cWhat\nis the color of the snowboard?\", the proposed model \ufb01nds the snowboard. For the other\nquestion \u201cThe man holding what on top of a snow covered hill?\", it is required to infer\nthe relation among person, snow covered hill, and snowboard. With these attention masks, it is\npossible to predict correct answers since irrelevant image regions are ruled out. More examples are\nshown in Figure 5.\n\n7\n\n\f(a)\n\n(b)\n\n(c)\n\nQ: What is the color of the\nsnowboard?\nA: Yellow.\n\nQ: The man holding what on\ntop of a snow covered hill?\nA: Snowboard.\n\nFigure 4: Visualization of attention masks. Our model learns to attend particular image regions that\nare relevant to the question.\n\nQ: What is sitting on\n\ntop of table in a\n\nQ: What is the man in\nstadium style seats\n\nQ: What are hogging\na bed by themselfs?\n\nworkshop?\nA: Dogs\nA: Boat\nFigure 5: Visualization of more attention masks.\n\nusing?\nA: Phone\n\nQ: What next to a\nlarge building?\n\nA: Clock\n\nQ: What is the color\nof the sun\ufb02ower?\n\nA: Yellow\n\n5 Conclusion\n\nWe have proposed an end-to-end trainable neural network for VQA. Our model learns to answer\nquestions by updating question representation and inferring over a set of image regions with multilayer\nperceptron. Visualization of attention masks demonstrates the ability of our model to focus on image\nregions highly related to questions. Experimental results are satisfying on the two challenging VQA\ndatasets. Future work includes improving object counting ability and word-region relation.\n\nAcknowledgements\n\nThis work is supported by a grant from the Research Grants Council of the Hong Kong SAR (project\nNo. 2150760) and by the National Science Foundation China, under Grant 61133009. We thank\nNVIDIA for providing Ruiyu Li a Tesla K40 GPU accelerator for this work.\n\n8\n\n\fReferences\n[1] J. Andreas, M. Rohrbach, T. Darrell, and D. Klein. Learning to compose neural networks for question\n\nanswering. arXiv preprint arXiv:1601.01705, 2016.\n\n[2] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zitnick, and D. Parikh. Vqa: Visual\n\nquestion answering. In ICCV, pages 2425\u20132433, 2015.\n\n[3] K. Chen, J. Wang, L.-C. Chen, H. Gao, W. Xu, and R. Nevatia. Abc-cnn: An attention based convolutional\n\nneural network for visual question answering. arXiv preprint arXiv:1511.05960, 2015.\n\n[4] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image\n\ndatabase. In CVPR, pages 248\u2013255, 2009.\n\n[5] H. Fang, S. Gupta, F. Iandola, R. K. Srivastava, L. Deng, P. Doll\u00e1r, J. Gao, X. He, M. Mitchell, J. C. Platt,\n\net al. From captions to visual concepts and back. In CVPR, pages 1473\u20131482, 2015.\n\n[6] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection\n\nand semantic segmentation. In CVPR, pages 580\u2013587, 2014.\n\n[7] B. Hu, Z. Lu, H. Li, and Q. Chen. Convolutional neural network architectures for matching natural\n\nlanguage sentences. In NIPS, pages 2042\u20132050, 2014.\n\n[8] R. Hu, H. Xu, M. Rohrbach, J. Feng, K. Saenko, and T. Darrell. Natural language object retrieval. arXiv\n\n[9] I. Ilija, Y. Shuicheng, and F. Jiashi. A focused dynamic attention model for visual question answering.\n\n[10] A. Karpathy and L. Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In CVPR,\n\n[11] R. Kiros, Y. Zhu, R. R. Salakhutdinov, R. Zemel, R. Urtasun, A. Torralba, and S. Fidler. Skip-thought\n\n[12] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classi\ufb01cation with deep convolutional neural\n\npreprint arXiv:1511.04164, 2015.\n\narXiv preprint arXiv:1604.01485, 2016.\n\npages 3128\u20133137, 2015.\n\nvectors. In NIPS, pages 3276\u20133284, 2015.\n\nnetworks. In NIPS, pages 1097\u20131105, 2012.\n\n[13] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll\u00e1r, and C. L. Zitnick. Microsoft\n\ncoco: Common objects in context. In ECCV, pages 740\u2013755, 2014.\n\n[14] L. Ma, Z. Lu, and H. Li. Learning to answer questions from image using convolutional neural network.\n\narXiv preprint arXiv:1506.00333, 2015.\n\n[15] M. Malinowski and M. Fritz. A multi-world approach to question answering about real-world scenes based\n\non uncertain input. In NIPS, pages 1682\u20131690, 2014.\n\n[16] M. Malinowski, M. Rohrbach, and M. Fritz. Ask your neurons: A neural-based approach to answering\n\nquestions about images. In ICCV, pages 1\u20139, 2015.\n\n[17] H. Noh, P. H. Seo, and B. Han. Image question answering using convolutional neural network with\n\ndynamic parameter prediction. arXiv preprint arXiv:1511.05756, 2015.\n\n[18] B. Peng, Z. Lu, H. Li, and K.-F. Wong. Towards neural network-based reasoning. arXiv preprint\n\n[19] M. Ren, R. Kiros, and R. Zemel. Exploring models and data for image question answering. In NIPS, pages\n\n[20] K. J. Shih, S. Singh, and D. Hoiem. Where to look: Focus regions for visual question answering. arXiv\n\n[21] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition.\n\n[22] S. Sukhbaatar, A. Szlam, J. Weston, and R. Fergus. Weakly supervised memory networks. arXiv preprint\n\n[23] I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. arXiv preprint\n\narXiv:1508.05508, 2015.\n\n2935\u20132943, 2015.\n\npreprint arXiv:1511.07394, 2015.\n\narXiv preprint arXiv:1409.1556, 2014.\n\narXiv:1503.08895, 2015.\n\narXiv:1409.3215, 2014.\n\nCVPR, pages 3156\u20133164, 2015.\n\n[24] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. In\n\n[25] Z. Wu and M. Palmer. Verbs semantics and lexical selection. In ACL, pages 133\u2013138, 1994.\n[26] C. Xiong, S. Merity, and R. Socher. Dynamic memory networks for visual and textual question answering.\n\narXiv preprint arXiv:1603.01417, 2016.\n\n[27] H. Xu and K. Saenko. Ask, attend and answer: Exploring question-guided spatial attention for visual\n\nquestion answering. arXiv preprint arXiv:1511.05234, 2015.\n\n[28] K. Xu, J. Ba, R. Kiros, A. Courville, R. Salakhutdinov, R. Zemel, and Y. Bengio. Show, attend and tell:\n\nNeural image caption generation with visual attention. arXiv preprint arXiv:1502.03044, 2015.\n\n[29] Z. Yang, X. He, J. Gao, L. Deng, and A. Smola. Stacked attention networks for image question answering.\n\n[30] B. Zhou, Y. Tian, S. Sukhbaatar, A. Szlam, and R. Fergus. Simple baseline for visual question answering.\n\n[31] C. L. Zitnick and P. Doll\u00e1r. Edge boxes: Locating object proposals from edges. In ECCV, pages 391\u2013405,\n\narXiv preprint arXiv:1511.02274, 2015.\n\narXiv preprint arXiv:1512.02167, 2015.\n\n2014.\n\n9\n\n\f", "award": [], "sourceid": 2328, "authors": [{"given_name": "Ruiyu", "family_name": "Li", "institution": "CUHK"}, {"given_name": "Jiaya", "family_name": "Jia", "institution": "CUHK"}]}