{"title": "Exploring Models and Data for Image Question Answering", "book": "Advances in Neural Information Processing Systems", "page_first": 2953, "page_last": 2961, "abstract": "This work aims to address the problem of image-based question-answering (QA) with new models and datasets. In our work, we propose to use neural networks and visual semantic embeddings, without intermediate stages such as object detection and image segmentation, to predict answers to simple questions about images. Our model performs 1.8 times better than the only published results on an existing image QA dataset. We also present a question generation algorithm that converts image descriptions, which are widely available, into QA form. We used this algorithm to produce an order-of-magnitude larger dataset, with more evenly distributed answers. A suite of baseline results on this new dataset are also presented.", "full_text": "Exploring Models and Data for Image Question\n\nAnswering\n\nMengye Ren1, Ryan Kiros1, Richard S. Zemel1,2\n\nUniversity of Toronto1\n\nCanadian Institute for Advanced Research2\n\n{mren, rkiros, zemel}@cs.toronto.edu\n\nAbstract\n\nThis work aims to address the problem of image-based question-answering (QA)\nwith new models and datasets. In our work, we propose to use neural networks\nand visual semantic embeddings, without intermediate stages such as object de-\ntection and image segmentation, to predict answers to simple questions about im-\nages. Our model performs 1.8 times better than the only published results on an\nexisting image QA dataset. We also present a question generation algorithm that\nconverts image descriptions, which are widely available, into QA form. We used\nthis algorithm to produce an order-of-magnitude larger dataset, with more evenly\ndistributed answers. A suite of baseline results on this new dataset are also pre-\nsented.\n\n1\n\nIntroduction\n\nCombining image understanding and natural language interaction is one of the grand dreams of\narti\ufb01cial intelligence. We are interested in the problem of jointly learning image and text through a\nquestion-answering task. Recently, researchers studying image caption generation [1, 2, 3, 4, 5, 6,\n7, 8, 9, 10] have developed powerful methods of jointly learning from image and text inputs to form\nhigher level representations from models such as convolutional neural networks (CNNs) trained on\nobject recognition, and word embeddings trained on large scale text corpora. Image QA involves\nan extra layer of interaction between human and computers. Here the model needs to pay attention\nto details of the image instead of describing it in a vague sense. The problem also combines many\ncomputer vision sub-problems such as image labeling and object detection.\nIn this paper we present our contributions to the problem: a generic end-to-end QA model using\nvisual semantic embeddings to connect a CNN and a recurrent neural net (RNN), as well as compar-\nisons to a suite of other models; an automatic question generation algorithm that converts description\nsentences into questions; and a new QA dataset (COCO-QA) that was generated using the algorithm,\nand a number of baseline results on this new dataset.\nIn this work we assume that the answers consist of only a single word, which allows us to treat the\nproblem as a classi\ufb01cation problem. This also makes the evaluation of the models easier and more\nrobust, avoiding the thorny evaluation issues that plague multi-word generation problems.\n\n2 Related Work\n\nMalinowski and Fritz [11] released a dataset with images and question-answer pairs, the DAtaset\nfor QUestion Answering on Real-world images (DAQUAR). All images are from the NYU depth v2\ndataset [12], and are taken from indoor scenes. Human segmentation, image depth values, and object\nlabeling are available in the dataset. The QA data has two sets of con\ufb01gurations, which differ by the\n\n1\n\n\fDAQUAR 1553\nWhat is there in front of the\nsofa?\nGround truth: table\nIMG+BOW: table (0.74)\n2-VIS+BLSTM: table (0.88)\nLSTM: chair (0.47)\n\nCOCOQA 5078\nHow many leftover donuts is\nthe red bicycle holding?\nGround truth: three\nIMG+BOW: two (0.51)\n2-VIS+BLSTM: three (0.27)\nBOW: one (0.29)\n\nCOCOQA 1238\nWhat is the color of the tee-\nshirt?\nGround truth: blue\nIMG+BOW: blue (0.31)\n2-VIS+BLSTM: orange (0.43)\nBOW: green (0.38)\n\nCOCOQA 26088\nWhere is the gray cat sitting?\nGround truth: window\nIMG+BOW: window (0.78)\n2-VIS+BLSTM: window (0.68)\nBOW: suitcase (0.31)\n\nFigure 1: Sample questions and responses of a variety of models. Correct answers are in green and\nincorrect in red. The numbers in parentheses are the probabilities assigned to the top-ranked answer\nby the given model. The leftmost example is from the DAQUAR dataset, and the others are from\nour new COCO-QA dataset.\n\nnumber of object classes appearing in the questions (37-class and 894-class). There are mainly three\ntypes of questions in this dataset: object type, object color, and number of objects. Some questions\nare easy but many questions are very hard to answer even for humans. Since DAQUAR is the only\npublicly available image-based QA dataset, it is one of our benchmarks to evaluate our models.\nTogether with the release of the DAQUAR dataset, Malinowski and Fritz presented an approach\nwhich combines semantic parsing and image segmentation. Their approach is notable as one of the\n\ufb01rst attempts at image QA, but it has a number of limitations. First, a human-de\ufb01ned possible set\nof predicates are very dataset-speci\ufb01c. To obtain the predicates, their algorithm also depends on the\naccuracy of the image segmentation algorithm and image depth information. Second, their model\nneeds to compute all possible spatial relations in the training images. Even though the model limits\nthis to the nearest neighbors of the test images, it could still be an expensive operation in larger\ndatasets. Lastly the accuracy of their model is not very strong. We show below that some simple\nbaselines perform better.\nVery recently there has been a number of parallel efforts on both creating datasets and proposing\nnew models [13, 14, 15, 16]. Both Antol et al. [13] and Gao et al. [15] used MS-COCO [17] images\nand created an open domain dataset with human generated questions and answers. In Anto et al.\u2019s\nwork, the authors also included cartoon pictures besides real images. Some questions require logical\nreasoning in order to answer correctly. Both Malinowski et al. [14] and Gao et al. [15] use recurrent\nnetworks to encode the sentence and output the answer. Whereas Malinowski et al. use a single\nnetwork to handle both encoding and decoding, Gao et al. used two networks, a separate encoder\nand decoder. Lastly, bilingual (Chinese and English) versions of the QA dataset are available in Gao\net al.\u2019s work. Ma et al. [16] use CNNs to both extract image features and sentence features, and fuse\nthe features together with another multi-modal CNN.\nOur approach is developed independently from the work above. Similar to the work of Malinowski\net al. and Gao et al., we also experimented with recurrent networks to consume the sequential\nquestion input. Unlike Gao et al., we formulate the task as a classi\ufb01cation problem, as there is no\nsingle well- accepted metric to evaluate sentence-form answer accuracy [18]. Thus, we place more\nfocus on a limited domain of questions that can be answered with one word. We also formulate and\nevaluate a range of other algorithms, that utilize various representations drawn from the question\nand image, on these datasets.\n\n3 Proposed Methodology\n\nThe methodology presented here is two-fold. On the model side we develop and apply various forms\nof neural networks and visual-semantic embeddings on this task, and on the dataset side we propose\nnew ways of synthesizing QA pairs from currently available image description datasets.\n\n2\n\n\f.56\n.21\nOne Two\n\n...\n...\n...\n\n.09\n.01\nRed Bird\n\nSoftmax\n\nLSTM\n\nLinear\n\nImage\n\nCNN\n\nWord Embedding\n\n\u201cHow\u201d\nt = 1\n\n\u201cmany\u201d\nt = 2\n\n\u201cbooks\u201d\nt = T\n\nFigure 2: VIS+LSTM Model\n\n3.1 Models\n\nIn recent years, recurrent neural networks (RNNs) have enjoyed some successes in the \ufb01eld of nat-\nural language processing (NLP). Long short-term memory (LSTM) [19] is a form of RNN which\nis easier to train than standard RNNs because of its linear error propagation and multiplicative gat-\nings. Our model builds directly on top of the LSTM sentence model and is called the \u201cVIS+LSTM\u201d\nmodel. It treats the image as one word of the question. We borrowed this idea of treating the image\nas a word from caption generation work done by Vinyals et al. [1]. We compare this newly proposed\nmodel with a suite of simpler models in the Experimental Results section.\n\n1. We use the last hidden layer of the 19-layer Oxford VGG Conv Net [20] trained on Ima-\ngeNet 2014 Challenge [21] as our visual embeddings. The CNN part of our model is kept\nfrozen during training.\n\n2. We experimented with several different word embedding models: randomly initialized em-\nbedding, dataset-speci\ufb01c skip-gram embedding and general-purpose skip-gram embedding\nmodel [22]. The word embeddings are trained with the rest of the model.\n\n3. We then treat the image as if it is the \ufb01rst word of the sentence. Similar to DeViSE [23],\nwe use a linear or af\ufb01ne transformation to map 4096 dimension image feature vectors to a\n300 or 500 dimensional vector that matches the dimension of the word embeddings.\n\n4. We can optionally treat the image as the last word of the question as well through a different\nweight matrix and optionally add a reverse LSTM, which gets the same content but operates\nin a backward sequential fashion.\n\n5. The LSTM(s) outputs are fed into a softmax layer at the last timestep to generate answers.\n\n3.2 Question-Answer Generation\n\nThe currently available DAQUAR dataset contains approximately 1500 images and 7000 questions\non 37 common object classes, which might be not enough for training large complex models. An-\nother problem with the current dataset is that simply guessing the modes can yield very good accu-\nracy.\nWe aim to create another dataset, to produce a much larger number of QA pairs and a more even\ndistribution of answers. While collecting human generated QA pairs is one possible approach, and\nanother is to synthesize questions based on image labeling, we instead propose to automatically\nconvert descriptions into QA form. In general, objects mentioned in image descriptions are easier to\ndetect than the ones in DAQUAR\u2019s human generated questions, and than the ones in synthetic QAs\nbased on ground truth labeling. This allows the model to rely more on rough image understanding\nwithout any logical reasoning. Lastly the conversion process preserves the language variability in\nthe original description, and results in more human-like questions than questions generated from\nimage labeling.\nAs a starting point we used the MS-COCO dataset [17], but the same method can be applied to any\nother image description dataset, such as Flickr [24], SBU [25], or even the internet.\n\n3\n\n\f3.2.1 Pre-Processing & Common Strategies\n\nWe used the Stanford parser [26] to obtain the syntatic structure of the original image description.\nWe also utilized these strategies for forming the questions.\n\n1. Compound sentences to simple sentences\n\nHere we only consider a simple case, where two sentences are joined together with a conjunctive\nword. We split the orginial sentences into two independent sentences.\n\n2. Inde\ufb01nite determiners \u201ca(n)\u201d to de\ufb01nite determiners \u201cthe\u201d.\n3. Wh-movement constraints\n\nIn English, questions tend to start with interrogative words such as \u201cwhat\u201d. The algorithm needs\nto move the verb as well as the \u201cwh-\u201d constituent to the front of the sentence. For example:\n\u201cA man is riding a horse\u201d becomes \u201cWhat is the man riding?\u201d In this work we consider the\nfollowing two simple constraints: (1) A-over-A principle which restricts the movement of a wh-\nword inside a noun phrase (NP) [27]; (2) Our algorithm does not move any wh-word that is\ncontained in a clause constituent.\n\n3.2.2 Question Generation\n\nQuestion generation is still an open-ended topic. Overall, we adopt a conservative approach to\ngenerating questions in an attempt to create high-quality questions. We consider generating four\ntypes of questions below:\n\n1. Object Questions: First, we consider asking about an object using \u201cwhat\u201d. This involves replac-\ning the actual object with a \u201cwhat\u201d in the sentence, and then transforming the sentence structure\nso that the \u201cwhat\u201d appears in the front of the sentence. The entire algorithm has the follow-\ning stages: (1) Split long sentences into simple sentences; (2) Change inde\ufb01nite determiners\nto de\ufb01nite determiners; (3) Traverse the sentence and identify potential answers and replace\nwith \u201cwhat\u201d. During the traversal of object-type question generation, we currently ignore all the\nprepositional phrase (PP) constituents; (4) Perform wh-movement. In order to identify a possible\nanswer word, we used WordNet [28] and the NLTK software package [29] to get noun categories.\n2. Number Questions: We follow a similar procedure as the previous algorithm, except for a dif-\nferent way to identify potential answers: we extract numbers from original sentences. Splitting\ncompound sentences, changing determiners, and wh-movement parts remain the same.\n\n3. Color Questions: Color questions are much easier to generate. This only requires locating the\ncolor adjective and the noun to which the adjective attaches. Then it simply forms a sentence\n\u201cWhat is the color of the [object]\u201d with the \u201cobject\u201d replaced by the actual noun.\n\n4. Location Questions: These are similar to generating object questions, except that now the answer\ntraversal will only search within PP constituents that start with the preposition \u201cin\u201d. We also\nadded rules to \ufb01lter out clothing so that the answers will mostly be places, scenes, or large objects\nthat contain smaller objects.\n\n3.2.3 Post-Processing\n\nWe rejected the answers that appear too rarely or too often in our generated dataset. After this QA\nrejection process, the frequency of the most common answer words was reduced from 24.98% down\nto 7.30% in the test set of COCO-QA.\n\n4 Experimental Results\n\n4.1 Datasets\n\nTable 1 summarizes the statistics of COCO-QA. It should be noted that since we applied the QA\npair rejection process, mode-guessing performs very poorly on COCO-QA. However, COCO-QA\nquestions are actually easier to answer than DAQUAR from a human point of view. This encour-\nages the model to exploit salient object relations instead of exhaustively searching all possible re-\nlations. COCO-QA dataset can be downloaded at http://www.cs.toronto.edu/\u02dcmren/\nimageqa/data/cocoqa\n\n4\n\n\fTable 1: COCO-QA question type break-down\n\nCATEGORY\n\nOBJECT\nNUMBER\nCOLOR\n\nLOCATION\n\nTOTAL\n\nTRAIN\n54992\n5885\n13059\n4800\n78736\n\n%\n\nTEST\n69.84% 27206\n2755\n7.47%\n6509\n16.59%\n6.10%\n2478\n100.00% 38948\n\n%\n\n69.85%\n7.07%\n16.71%\n6.36%\n100.00%\n\nHere we provide some brief statistics of the new dataset. The maximum question length is 55, and\naverage is 9.65. The most common answers are \u201ctwo\u201d (3116, 2.65%), \u201cwhite\u201d (2851, 2.42%), and\n\u201cred\u201d (2443, 2.08%). The least common are \u201ceagle\u201d (25, 0.02%) \u201ctram\u201d (25, 0.02%), and \u201csofa\u201d\n(25, 0.02%). The median answer is \u201cbed\u201d (867, 0.737%). Across the entire test set (38,948 QAs),\n9072 (23.29%) overlap in training questions, and 7284 (18.70%) overlap in training question-answer\npairs.\n\n4.2 Model Details\n1. VIS+LSTM: The \ufb01rst model is the CNN and LSTM with a dimensionality-reduction weight\n\nmatrix in the middle; we call this \u201cVIS+LSTM\u201d in our tables and \ufb01gures.\n\n2. 2-VIS+BLSTM: The second model has two image feature inputs, at the start and the end of the\nsentence, with different learned linear transformations, and also has LSTMs going in both the\nforward and backward directions. Both LSTMs output to the softmax layer at the last timestep.\nWe call the second model \u201c2-VIS+BLSTM\u201d.\n\n3. IMG+BOW: This simple model performs multinomial logistic regression based on the image\nfeatures without dimensionality reduction (4096 dimension), and a bag-of-word (BOW) vector\nobtained by summing all the learned word vectors of the question.\n\n4. FULL: Lastly, the \u201cFULL\u201d model is a simple average of the three models above.\n\nWe release the complete details of the models at https://github.com/renmengye/\nimageqa-public.\n\n4.3 Baselines\n\nTo evaluate the effectiveness of our models, we designed a few baselines.\n\n1. GUESS: One very simple baseline is to predict the mode based on the question type. For ex-\nample, if the question contains \u201chow many\u201d then the model will output \u201ctwo.\u201d In DAQUAR, the\nmodes are \u201ctable\u201d, \u201ctwo\u201d, and \u201cwhite\u201d and in COCO-QA, the modes are \u201ccat\u201d, \u201ctwo\u201d, \u201cwhite\u201d,\nand \u201croom\u201d.\n\n2. BOW: We designed a set of \u201cblind\u201d models which are given only the questions without the\nimages. One of the simplest blind models performs logistic regression on the BOW vector to\nclassify answers.\n\n3. LSTM: Another \u201cblind\u201d model we experimented with simply inputs the question words into the\n\nLSTM alone.\n\n4. IMG: We also trained a counterpart \u201cdeaf\u201d model. For each type of question, we train a separate\nCNN classi\ufb01cation layer (with all lower layers frozen during training). Note that this model\nknows the type of question, in order to make its performance somewhat comparable to models\nthat can take into account the words to narrow down the answer space. However the model does\nnot know anything about the question except the type.\n\n5. IMG+PRIOR: This baseline combines the prior knowledge of an object and the image under-\nstanding from the \u201cdeaf model\u201d. For example, a question asking the color of a white bird \ufb02ying\nin the blue sky may output white rather than blue simply because the prior probability of the bird\nbeing blue is lower. We denote c as the color, o as the class of the object of interest, and x as the\n\n5\n\n\fimage. Assuming o and x are conditionally independent given the color,\n\np(c|o, x) =\n\n(cid:80)\np(c, o|x)\nc\u2208C p(c, o|x)\n\n=\n\n(cid:80)\np(o|c, x)p(c|x)\nc\u2208C p(o|c, x)p(c|x)\n\n=\n\n(cid:80)\np(o|c)p(c|x)\nc\u2208C p(o|c)p(c|x)\n\n(1)\n\nThis can be computed if p(c|x) is the output of a logistic regression given the CNN features alone,\nand we simply estimate p(o|c) empirically: \u02c6p(o|c) = count(o,c)\ncount(c) . We use Laplace smoothing on\nthis empirical distribution.\n\n6. K-NN: In the task of image caption generation, Devlin et al. [30] showed that a nearest neighbors\nbaseline approach actually performs very well. To see whether our model memorizes the training\ndata for answering new question, we include a K-NN baseline in the results. Unlike image\ncaption generation, here the similarity measure includes both image and text. We use the bag-of-\nwords representation learned from IMG+BOW, and append it to the CNN image features. We use\nEuclidean distance as the similarity metric; it is possible to improve the nearest neighbor result\nby learning a similarity metric.\n\n4.4 Performance Metrics\n\nTo evaluate model performance, we used the plain answer accuracy as well as the Wu-Palmer simi-\nlarity (WUPS) measure [31, 32]. The WUPS calculates the similarity between two words based on\ntheir longest common subsequence in the taxonomy tree. If the similarity between two words is less\nthan a threshold then a score of zero will be given to the candidate answer. Following Malinowski\nand Fritz [32], we measure all models in terms of accuracy, WUPS 0.9, and WUPS 0.0.\n\n4.5 Results and Analysis\n\nTable 2 summarizes the learning results on DAQUAR and COCO-QA. For DAQUAR we compare\nour results with [32] and [14]. It should be noted that our DAQUAR results are for the portion of the\ndataset (98.3%) with single-word answers. After the release of our paper, Ma et al. [16] claimed to\nachieve better results on both datasets.\n\nTable 2: DAQUAR and COCO-QA results\n\nMULTI-WORLD [32]\n\nGUESS\nBOW\nLSTM\nIMG\n\nIMG+PRIOR\n\nK-NN (K=31, 13)\n\nIMG+BOW\nVIS+LSTM\n\nASK-NEURON [14]\n\n2-VIS+BLSTM\n\nFULL\n\nHUMAN\n\nDAQUAR\n\nACC. WUPS 0.9 WUPS 0.0\n0.1273\n0.1824\n0.3267\n0.3273\n\n0.5147\n0.7759\n0.8130\n0.8162\n\n0.1810\n0.2965\n0.4319\n0.4350\n\n-\n-\n\n0.3185\n0.3417\n0.3441\n0.3468\n0.3578\n0.3694\n0.6027\n\n-\n-\n\n0.4242\n0.4499\n0.4605\n0.4076\n0.4683\n0.4815\n0.6104\n\n-\n-\n\n0.8063\n0.8148\n0.8223\n0.7954\n0.8215\n0.8268\n0.7896\n\nACC.\n\n-\n\n0.0730\n0.3752\n0.3676\n0.4302\n0.4466\n0.4496\n0.5592\n0.5331\n\n0.5509\n0.5784\n\n-\n\n-\n\nCOCO-QA\nWUPS 0.9 WUPS 0.0\n\n0.1837\n0.4854\n0.4758\n0.5864\n0.6020\n0.5698\n0.6678\n0.6391\n\n0.6534\n0.6790\n\n-\n\n-\n\n-\n\n0.7413\n0.8278\n0.8234\n0.8585\n0.8624\n0.8557\n0.8899\n0.8825\n\n0.8864\n0.8952\n\n-\n\n-\n\n-\n\nFrom the above results we observe that our model outperforms the baselines and the existing ap-\nproach in terms of answer accuracy and WUPS. Our VIS+LSTM and Malinkowski et al.\u2019s recurrent\nneural network model [14] achieved somewhat similar performance on DAQUAR. A simple average\nof all three models further boosts the performance by 1-2%, outperforming other models.\nIt is surprising to see that the IMG+BOW model is very strong on both datasets. One limitation of\nour VIS+LSTM model is that we are not able to consume image features as large as 4096 dimensions\nat one time step, so the dimensionality reduction may lose some useful information. We tried to give\nIMG+BOW a 500 dim. image vector, and it does worse than VIS+LSTM (\u224848%).\n\n6\n\n\fTable 3: COCO-QA accuracy per category\n\nGUESS\nBOW\nLSTM\nIMG\n\nIMG+PRIOR\n\nK-NN\n\nIMG+BOW\nVIS+LSTM\n\n0.4799\n0.5866\n0.5653\n2-VIS+BLSTM 0.5817\n0.6108\n\nFULL\n\nOBJECT NUMBER\n0.3606\n0.0239\n0.3727\n0.4356\n0.4534\n0.3587\n0.2926\n0.4073\n0.3739\n0.3699\n0.4410\n0.4610\n0.4479\n0.4766\n\n-\n\nCOLOR\n0.1457\n0.3475\n0.3626\n0.4268\n0.4899\n0.3723\n0.5196\n0.4587\n0.4953\n0.5148\n\nLOCATION\n\n0.0908\n0.4084\n0.3842\n0.4419\n0.4451\n0.4080\n0.4939\n0.4552\n0.4734\n0.5028\n\nBy comparing the blind versions of the BOW and LSTM models, we hypothesize that in Image QA\ntasks, and in particular on the simple questions studied here, sequential word interaction may not be\nas important as in other natural language tasks.\nIt is also interesting that the blind model does not lose much on the DAQUAR dataset, We speculate\nthat it is likely that the ImageNet images are very different from the indoor scene images, which\nare mostly composed of furniture. However, the non-blind models outperform the blind models\nby a large margin on COCO-QA. There are three possible reasons: (1) the objects in MS-COCO\nresemble the ones in ImageNet more; (2) MS-COCO images have fewer objects whereas the indoor\nscenes have considerable clutter; and (3) COCO-QA has more data to train complex models.\nThere are many interesting examples but due to space limitations we can only show a few in Fig-\nure 1 and Figure 3; full results are available at http://www.cs.toronto.edu/\u02dcmren/\nimageqa/results. For some of the images, we added some extra questions (the ones have\nan \u201ca\u201d in the question ID); these provide more insight into a model\u2019s representation of the image and\nquestion information, and help elucidate questions that our models may accidentally get correct. The\nparentheses in the \ufb01gures represent the con\ufb01dence score given by the softmax layer of the respective\nmodel.\nModel Selection: We did not \ufb01nd that using different word embedding has a signi\ufb01cant impact on\nthe \ufb01nal classi\ufb01cation results. We observed that \ufb01ne-tuning the word embedding results in better\nperformance and normalizing the CNN hidden image features into zero-mean and unit-variance\nhelps achieve faster training time. The bidirectional LSTM model can further boost the result by a\nlittle.\nObject Questions: As the original CNN was trained for the ImageNet challenge, the IMG+BOW\nbene\ufb01ted signi\ufb01cantly from its single object recognition ability. However, the challenging part is\nto consider spatial relations between multiple objects and to focus on details of the image. Our\nmodels only did a moderately acceptable job on this; see for instance the \ufb01rst picture of Figure 1 and\nthe fourth picture of Figure 3. Sometimes a model fails to make a correct decision but outputs the\nmost salient object, while sometimes the blind model can equally guess the most probable objects\nbased on the question alone (e.g., chairs should be around the dining table). Nonetheless, the FULL\nmodel improves accuracy by 50% compared to IMG model, which shows the difference between\npure object classi\ufb01cation and image question answering.\nCounting:\nIn DAQUAR, we could not observe any advantage in the counting ability of the\nIMG+BOW and the VIS+LSTM model compared to the blind baselines. In COCO-QA there is\nsome observable counting ability in very clean images with a single object type. The models can\nsometimes count up to \ufb01ve or six. However, as shown in the second picture of Figure 3, the ability\nis fairly weak as they do not count correctly when different object types are present. There is a lot\nof room for improvement in the counting task, and in fact this could be a separate computer vision\nproblem on its own.\nColor: In COCO-QA there is a signi\ufb01cant win for the IMG+BOW and the VIS+LSTM against\nthe blind ones on color-type questions. We further discovered that these models are not only able\nto recognize the dominant color of the image but sometimes associate different colors to different\nobjects, as shown in the \ufb01rst picture of Figure 3. However, they still fail on a number of easy\n\n7\n\n\fCOCOQA 33827\nWhat is the color of the cat?\nGround truth: black\nIMG+BOW: black (0.55)\n2-VIS+LSTM: black (0.73)\nBOW: gray (0.40)\nCOCOQA 33827a\nWhat is the color of the couch?\nGround truth: red\nIMG+BOW: red (0.65)\n2-VIS+LSTM: black (0.44)\nBOW: red (0.39)\n\nDAQUAR 1522\nHow many chairs are there?\nGround truth: two\nIMG+BOW: four (0.24)\n2-VIS+BLSTM: one (0.29)\nLSTM: four (0.19)\nDAQUAR 1520\nHow many shelves are there?\nGround truth: three\nIMG+BOW: three (0.25)\n2-VIS+BLSTM: two (0.48)\nLSTM: two (0.21)\n\nCOCOQA 14855\nWhere are the ripe bananas sitting?\nGround truth: basket\nIMG+BOW: basket (0.97)\n2-VIS+BLSTM: basket (0.58)\nBOW: bowl (0.48)\nCOCOQA 14855a\nWhat are in the basket?\nGround truth: bananas\nIMG+BOW: bananas (0.98)\n2-VIS+BLSTM: bananas (0.68)\nBOW: bananas (0.14)\n\nDAQUAR 585\nWhat is the object on the chair?\nGround truth: pillow\nIMG+BOW: clothes (0.37)\n2-VIS+BLSTM: pillow (0.65)\nLSTM: clothes (0.40)\nDAQUAR 585a\nWhere is the pillow found?\nGround truth: chair\nIMG+BOW: bed (0.13)\n2-VIS+BLSTM: chair (0.17)\nLSTM: cabinet (0.79)\n\nFigure 3: Sample questions and responses of our system\n\nexamples. Adding prior knowledge provides an immediate gain on the IMG model in terms of\naccuracy on Color and Number questions. The gap between the IMG+PRIOR and IMG+BOW\nshows some localized color association ability in the CNN image representation.\n\n5 Conclusion and Current Directions\n\nIn this paper, we consider the image QA problem and present our end-to-end neural network models.\nOur model shows a reasonable understanding of the question and some coarse image understand-\ning, but it is still very na\u00a8\u0131ve in many situations. While recurrent networks are becoming a popular\nchoice for learning image and text, we showed that a simple bag-of-words can perform equally well\ncompared to a recurrent network that is borrowed from an image caption generation framework [1].\nWe proposed a more complete set of baselines which can provide potential insight for developing\nmore sophisticated end-to-end image question answering systems. As the currently available dataset\nis not large enough, we developed an algorithm that helps us collect large scale image QA dataset\nfrom image descriptions. Our question generation algorithm is extensible to many image description\ndatasets and can be automated without requiring extensive human effort. We hope that the release\nof the new dataset will encourage more data-driven approaches to this problem in the future.\nImage question answering is a fairly new research topic, and the approach we present here has a\nnumber of limitations. First, our models are just answer classi\ufb01ers. Ideally we would like to permit\nlonger answers which will involve some sophisticated text generation model or structured output.\nBut this will require an automatic free-form answer evaluation metric. Second, we are only focusing\non a limited domain of questions. However, this limited range of questions allow us to study the\nresults more in depth. Lastly, it is also hard to interpret why the models output a certain answer.\nBy comparing our models with some baselines we can roughly infer whether they understood the\nimage. Visual attention is another future direction, which could both improve the results (based on\nrecent successes in image captioning [8]) as well as help explain the model prediction by examining\nthe attention output at every timestep.\n\nAcknowledgments\n\nWe would like to thank Nitish Srivastava for the support of Toronto Conv Net, from which we\nextracted the CNN image features. We would also like to thank anonymous reviewers for their\nvaluable and helpful comments.\n\nReferences\n[1] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, \u201cShow and tell: A neural image caption generator,\u201d in\n\nCVPR, 2015.\n\n[2] R. Kiros, R. Salakhutdinov, and R. S. Zemel, \u201cUnifying visual-semantic embeddings with multimodal\n\nneural language models,\u201d TACL, 2015.\n\n8\n\n\f[3] A. Karpathy, A. Joulin, and L. Fei-Fei, \u201cDeep fragment embeddings for bidirectional image sentence\n\nmapping,\u201d in NIPS, 2013.\n\n[4] J. Mao, W. Xu, Y. Yang, J. Wang, and A. L. Yuille, \u201cExplain images with multimodal recurrent neural\n\nnetworks,\u201d NIPS Deep Learning Workshop, 2014.\n\n[5] J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell,\n\n\u201cLong-term recurrent convolutional networks for visual recognition and description,\u201d in CVPR, 2014.\n\n[6] X. Chen and C. L. Zitnick, \u201cLearning a recurrent visual representation for image caption generation,\u201d\n\nCoRR, vol. abs/1411.5654, 2014.\n\n[7] H. Fang, S. Gupta, F. N. Iandola, R. Srivastava, L. Deng, P. Doll\u00b4ar, J. Gao, X. He, M. Mitchell, J. C. Platt,\n\nC. L. Zitnick, and G. Zweig, \u201cFrom captions to visual concepts and back,\u201d in CVPR, 2015.\n\n[8] K. Xu, J. Ba, R. Kiros, K. Cho, A. C. Courville, R. Salakhutdinov, R. S. Zemel, and Y. Bengio, \u201cShow,\n\nattend and tell: Neural image caption generation with visual attention,\u201d in ICML, 2015.\n\n[9] R. Lebret, P. O. Pinheiro, and R. Collobert, \u201cPhrase-based image captioning,\u201d in ICML, 2015.\n[10] B. Klein, G. Lev, G. Lev, and L. Wolf, \u201cFisher vectors derived from hybrid Gaussian-Laplacian mixture\n\nmodels for image annotations,\u201d in CVPR, 2015.\n\n[11] M. Malinowski and M. Fritz, \u201cTowards a visual Turing challenge,\u201d in NIPS Workshop on Learning Se-\n\nmantics, 2014.\n\n[12] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, \u201cIndoor segmentation and support inference from\n\nRGBD images,\u201d in ECCV, 2012.\n\n[13] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh, \u201cVQA: Visual Question\n\nAnswering,\u201d CoRR, vol. abs/1505.00468, 2015.\n\n[14] M. Malinowski, M. Rohrbach, and M. Fritz, \u201cAsk Your Neurons: A Neural-based Approach to Answering\n\nQuestions about Images,\u201d CoRR, vol. abs/1505.01121, 2015.\n\n[15] H. Gao, J. Mao, J. Zhou, Z. Huang, L. Wang, and W. Xu, \u201cAre you talking to a machine? dataset and\n\nmethods for multilingual image question answering,\u201d CoRR, vol. abs/1505.05612, 2015.\n\n[16] L. Ma, Z. Lu, and H. Li, \u201cLearning to answer questions from image using convolutional neural network,\u201d\n\nCoRR, vol. abs/1506.00333, 2015.\n\n[17] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll\u00b4ar, and C. L. Zitnick, \u201cMicrosoft\n\nCOCO: Common Objects in Context,\u201d in ECCV, 2014.\n\n[18] X. Chen, H. Fang, T.-Y. Lin, R. Vedantam, S. Gupta, P. Dollar, and C. L. Zitnick, \u201cMicrosoft COCO\n\ncaptions: Data collection and evaluation server,\u201d CoRR, vol. abs/1504.00325, 2015.\n\n[19] S. Hochreiter and J. Schmidhuber, \u201cLong short-term memory,\u201d Neural Computation, vol. 9, no. 8, pp.\n\n1735\u20131780, 1997.\n\n[20] K. Simonyan and A. Zisserman, \u201cVery deep convolutional networks for large-scale image recognition,\u201d\n\nin ICLR, 2015.\n\n[21] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. S.\n\nBernstein, A. C. Berg, and L. Fei-Fei, \u201cImagenet large scale visual recognition challenge,\u201d IJCV, 2015.\n\n[22] T. Mikolov, K. Chen, G. Corrado, and J. Dean, \u201cEf\ufb01cient estimation of word representations in vector\n\nspace,\u201d in ICLR, 2013.\n\n[23] A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, M. Ranzato, and T. Mikolov, \u201cDeViSE: A deep\n\nvisual-semantic embedding model,\u201d in NIPS, 2013.\n\n[24] M. Hodosh, P. Young, and J. Hockenmaier, \u201cFraming image description as a ranking task: Data, models\n\nand evaluation metrics,\u201d J. Artif. Intell. Res. (JAIR), vol. 47, pp. 853\u2013899, 2013.\n\n[25] V. Ordonez, G. Kulkarni, and T. L. Berg, \u201cIm2text: Describing images using 1 million captioned pho-\n\ntographs,\u201d in NIPS, 2011.\n\n[26] D. Klein and C. D. Manning, \u201cAccurate unlexicalized parsing,\u201d in ACL, 2003.\n[27] N. Chomsky, Conditions on Transformations. New York: Academic Press, 1973.\n[28] C. Fellbaum, Ed., WordNet An Electronic Lexical Database. Cambridge, MA; London: The MIT Press,\n\nMay 1998.\n\n[29] S. Bird, \u201cNLTK: the natural language toolkit,\u201d in ACL, 2006.\n[30] J. Devlin, S. Gupta, R. Girshick, M. Mitchell, and C. L. Zitnick, \u201cExploring nearest neighbor approaches\n\nfor image captioning,\u201d CoRR, vol. abs/1505.04467, 2015.\n\n[31] Z. Wu and M. Palmer, \u201cVerb semantics and lexical selection,\u201d in ACL, 1994.\n[32] M. Malinowski and M. Fritz, \u201cA multi-world approach to question answering about real-world scenes\n\nbased on uncertain input,\u201d in NIPS, 2014.\n\n9\n\n\f", "award": [], "sourceid": 1676, "authors": [{"given_name": "Mengye", "family_name": "Ren", "institution": "University of Toronto"}, {"given_name": "Ryan", "family_name": "Kiros", "institution": "U. Toronto"}, {"given_name": "Richard", "family_name": "Zemel", "institution": "University of Toronto"}]}