{"title": "Are You Talking to a Machine? Dataset and Methods for Multilingual Image Question", "book": "Advances in Neural Information Processing Systems", "page_first": 2296, "page_last": 2304, "abstract": "In this paper, we present the mQA model, which is able to answer questions about the content of an image. The answer can be a sentence, a phrase or a single word. Our model contains four components: a Long Short-Term Memory (LSTM) to extract the question representation, a Convolutional Neural Network (CNN) to extract the visual representation, an LSTM for storing the linguistic context in an answer, and a fusing component to combine the information from the first three components and generate the answer. We construct a Freestyle Multilingual Image Question Answering (FM-IQA) dataset to train and evaluate our mQA model. It contains over 150,000 images and 310,000 freestyle Chinese question-answer pairs and their English translations. The quality of the generated answers of our mQA model on this dataset is evaluated by human judges through a Turing Test. Specifically, we mix the answers provided by humans and our model. The human judges need to distinguish our model from the human. They will also provide a score (i.e. 0, 1, 2, the larger the better) indicating the quality of the answer. We propose strategies to monitor the quality of this evaluation process. The experiments show that in 64.7% of cases, the human judges cannot distinguish our model from humans. The average score is 1.454 (1.918 for human). The details of this work, including the FM-IQA dataset, can be found on the project page: \\url{http://idl.baidu.com/FM-IQA.html}.", "full_text": "Are You Talking to a Machine?\n\nDataset and Methods for Multilingual Image Question Answering\n\nHaoyuan Gao1\n\nJunhua Mao2\n1Baidu Research\n\nJie Zhou1 Zhiheng Huang1 Lei Wang1 Wei Xu1\n2University of California, Los Angeles\n\ngaohaoyuan@baidu.com, mjhustc@ucla.edu, {zhoujie01,huangzhiheng,wanglei22,wei.xu}@baidu.com\n\nAbstract\n\nIn this paper, we present the mQA model, which is able to answer questions about\nthe content of an image. The answer can be a sentence, a phrase or a single word.\nOur model contains four components: a Long Short-Term Memory (LSTM) to\nextract the question representation, a Convolutional Neural Network (CNN) to\nextract the visual representation, an LSTM for storing the linguistic context in an\nanswer, and a fusing component to combine the information from the \ufb01rst three\ncomponents and generate the answer. We construct a Freestyle Multilingual Im-\nage Question Answering (FM-IQA) dataset to train and evaluate our mQA model.\nIt contains over 150,000 images and 310,000 freestyle Chinese question-answer\npairs and their English translations. The quality of the generated answers of our\nmQA model on this dataset is evaluated by human judges through a Turing Test.\nSpeci\ufb01cally, we mix the answers provided by humans and our model. The human\njudges need to distinguish our model from the human. They will also provide\na score (i.e. 0, 1, 2, the larger the better) indicating the quality of the answer.\nWe propose strategies to monitor the quality of this evaluation process. The ex-\nperiments show that in 64.7% of cases, the human judges cannot distinguish our\nmodel from humans. The average score is 1.454 (1.918 for human). The details\nof this work, including the FM-IQA dataset, can be found on the project page:\nhttp://idl.baidu.com/FM-IQA.html.\n\nIntroduction\n\n1\nRecently, there is increasing interest in the \ufb01eld of multimodal learning for both natural language\nand vision. In particular, many studies have made rapid progress on the task of image captioning\n[26, 15, 14, 40, 6, 8, 4, 19, 16, 42]. Most of them are built based on deep neural networks (e.g.\ndeep Convolutional Neural Networks (CNN [17]), Recurrent Neural Network (RNN [7]) or Long\nShort-Term Memory (LSTM [12])). The large-scale image datasets with sentence annotations (e.g.,\n[21, 43, 11]) play a crucial role in this progress. Despite the success of these methods, there are still\nmany issues to be discussed and explored. In particular, the task of image captioning only requires\ngeneric sentence descriptions of an image. But in many cases, we only care about a particular part\nor object of an image. The image captioning task lacks the interaction between the computer and\nthe user (as we cannot input our preference and interest).\nIn this paper, we focus on the task of visual question answering. In this task, the method needs\nto provide an answer to a freestyle question about the content of an image. We propose the mQA\nmodel to address this task. The inputs of the model are an image and a question. This model has four\ncomponents (see Figure 2). The \ufb01rst component is an LSTM network that encodes a natural language\nsentence into a dense vector representation. The second component is a deep Convolutional Neural\nNetwork [36] that extracted the image representation. This component was pre-trained on ImageNet\nClassi\ufb01cation Task [33] and is \ufb01xed during the training. The third component is another LSTM\nnetwork that encodes the information of the current word and previous words in the answer into\ndense representations. The fourth component fuses the information from the \ufb01rst three components\nto predict the next word in the answer. We jointly train the \ufb01rst, third and fourth components by\nmaximizing the probability of the groundtruth answers in the training set using a log-likelihood loss\n\n1\n\n\fFigure 1: Sample answers to the visual question generated by our model on the newly proposed\nFreestyle Multilingual Image Question Answering (FM-IQA) dataset.\n\nfunction. To lower down the risk of over\ufb01tting, we allow the weight sharing of the word embedding\nlayer between the LSTMs in the \ufb01rst and third components. We also adopt the transposed weight\nsharing scheme as proposed in [25], which allows the weight sharing between word embedding layer\nand the fully connected Softmax layer.\nTo train our method, we construct a large-scale Freestyle Multilingual Image Question Answering\ndataset1 (FM-IQA, see details in Section 4) based on the MS COCO dataset [21]. The current\nversion of the dataset contains 158,392 images with 316,193 Chinese question-answer pairs and\ntheir corresponding English translations.2 To diversify the annotations, the annotators are allowed\nto raise any question related to the content of the image. We propose strategies to monitor the\nquality of the annotations. This dataset contains a wide range of AI related questions, such as action\nrecognition (e.g., \u201cIs the man trying to buy vegetables?\u201d), object recognition (e.g., \u201cWhat is there in\nyellow?\u201d), positions and interactions among objects in the image (e.g. \u201cWhere is the kitty?\u201d) and\nreasoning based on commonsense and visual content (e.g. \u201cWhy does the bus park here?\u201d, see last\ncolumn of Figure 3).\nBecause of the variability of the freestyle question-answer pairs, it is hard to accurately evaluate\nthe method with automatic metrics. We conduct a Visual Turing Test [38] using human judges.\nSpeci\ufb01cally, we mix the question-answer pairs generated by our model with the same set of question-\nanswer pairs labeled by annotators. The human judges need to determine whether the answer is\ngiven by a model or a human. In addition, we also ask them to give a score of 0 (i.e. wrong), 1 (i.e.\npartially correct), or 2 (i.e. correct). The results show that our mQA model passes 64.7% of this\ntest (treated as answers of a human) and the average score is 1.454. In the discussion, we analyze\nthe failure cases of our model and show that combined with the m-RNN [24] model, our model can\nautomatically ask a question about an image and answer that question.\n\n2 Related Work\nRecent work has made signi\ufb01cant progress using deep neural network models in both the \ufb01elds of\ncomputer vision and natural language. For computer vision, methods based on Convolutional Neural\nNetwork (CNN [20]) achieve the state-of-the-art performance in various tasks, such as object clas-\nsi\ufb01cation [17, 34, 17], detection [10, 44] and segmentation [3]. For natural language, the Recurrent\nNeural Network (RNN [7, 27]) and the Long Short-Term Memory network (LSTM [12]) are also\nwidely used in machine translation [13, 5, 35] and speech recognition [28].\nThe structure of our mQA model is inspired by the m-RNN model [24] for the image captioning and\nimage-sentence retrieval tasks. It adopts a deep CNN for vision and a RNN for language. We extend\nthe model to handle the input of question and image pairs, and generate answers. In the experiments,\nwe \ufb01nd that we can learn how to ask a good question about an image using the m-RNN model and\nthis question can be answered by our mQA model.\nThere has been recent effort on the visual question answering task [9, 2, 22, 37]. However, most of\nthem use a pre-de\ufb01ned and restricted set of questions. Some of these questions are generated from a\ntemplate. In addition, our FM-IQA dataset is much larger than theirs (e.g., there are only 2591 and\n1449 images for [9] and [22] respectively).\n\n1We are actively developing and expanding the dataset, please \ufb01nd the latest information on the project page\n\n: http://idl.baidu.com/FM-IQA.html\n\n2The results reported in this paper are obtained from a model trained on the \ufb01rst version of the dataset (a\n\nsubset of the current version) which contains 120,360 images and 250,569 question-answer pairs.\n\n2\n\nImage Question Answer \u516c\u5171\u6c7d\u8f66\u662f\u4ec0\u4e48\u989c\u8272\u7684\uff1f What is the color of the bus? \u516c\u5171\u6c7d\u8f66\u662f\u7ea2\u8272\u7684\u3002 The bus is red. \u8349\u5730\u4e0a\u9664\u4e86\u4eba\u4ee5\u5916\u8fd8\u6709\u4ec0\u4e48\u52a8\u7269\uff1f What is there on the grass, except the person? \u7f8a\u3002 Sheep. \u89c2\u5bdf\u4e00\u4e0b\u8bf4\u51fa\u98df\u7269\u91cc\u4efb\u610f\u4e00\u79cd\u852c\u83dc\u7684\u540d\u5b57 \uff1f Please look carefully and tell me what is the name of the vegetables in the plate? \u897f\u5170\u82b1 \u3002 Broccoli. \u732b\u54aa\u5728\u54ea\u91cc\uff1f Where is the kitty? \u5728\u6905\u5b50\u4e0a \u3002 On the chair. \u9ec4\u8272\u7684\u662f\u4ec0\u4e48\uff1f What is there in yellow? \u9999\u8549\u3002 Bananas. \fFigure 2: Illustration of the mQA model architecture. We input an image and a question about the\nimage (i.e. \u201cWhat is the cat doing?\u201d) to the model. The model is trained to generate the answer to\nthe question (i.e. \u201cSitting on the umbrella\u201d). The weight matrix in the word embedding layers of\nthe two LSTMs (one for the question and one for the answer) are shared. In addition, as in [25], this\nweight matrix is also shared, in a transposed manner, with the weight matrix in the Softmax layer.\nDifferent colors in the \ufb01gure represent different components of the model. (Best viewed in color.)\n\nThere are some concurrent and independent works on this topic: [1, 23, 32]. [1] propose a large-\nscale dataset also based on MS COCO. They also provide some simple baseline methods on this\ndataset. Compared to them, we propose a stronger model for this task and evaluate our method using\nhuman judges. Our dataset also contains two different kinds of language, which can be useful for\nother tasks, such as machine translation. Because we use a different set of annotators and different\nrequirements of the annotation, our dataset and the [1] can be complementary to each other, and lead\nto some interesting topics, such as dataset transferring for visual question answering.\nBoth [23] and [32] use a model containing a single LSTM and a CNN. They concatenate the question\nand the answer (for [32], the answer is a single word. [23] also prefer a single word as the answer),\nand then feed them to the LSTM. Different from them, we use two separate LSTMs for questions\nand answers respectively in consideration of the different properties (e.g. grammar) of questions and\nanswers, while allow the sharing of the word-embeddings. For the dataset, [23] adopt the dataset\nproposed in [22], which is much smaller than our FM-IQA dataset. [32] utilize the annotations in\nMS COCO and synthesize a dataset with four pre-de\ufb01ned types of questions (i.e. object, number,\ncolor, and location). They also synthesize the answer with a single word. Their dataset can also be\ncomplementary to ours.\n\n3 The Multimodal QA (mQA) Model\nWe show the architecture of our mQA model in Figure 2. The model has four components: (I). a\nLong Short-Term Memory (LSTM [12]) for extracting semantic representation of a question, (II). a\ndeep Convolutional Neural Network (CNN) for extracting the image representation, (III). an LSTM\nto extract representation of the current word in the answer and its linguistic context, and (IV). a\nfusing component that incorporates the information from the \ufb01rst three parts together and generates\nthe next word in the answer. These four components can be jointly trained together 3. The details\nof the four model components are described in Section 3.1. The effectiveness of the important\ncomponents and strategies are analyzed in Section 5.3.\nThe inputs of the model are a question and the reference image. The model is trained to generate\nthe answer. The words in the question and answer are represented by one-hot vectors (i.e. binary\nvectors with the length of the dictionary size N and have only one non-zero vector indicating its\nindex in the word dictionary). We add a (cid:104)BOA(cid:105) sign and a (cid:104)EOA(cid:105) sign, as two spatial words in\nthe word dictionary, at the beginning and the end of the training answers respectively. They will be\nused for generating the answer to the question in the testing stage.\nIn the testing stage, we input an image and a question about the image into the model \ufb01rst. To\ngenerate the answer, we start with the start sign (cid:104)BOA(cid:105) and use the model to calculate the probability\ndistribution of the next word. We then use a beam search scheme that keeps the best K candidates\n\n3In practice, we \ufb01x the CNN part because the gradient returned from LSTM is very noisy. Finetuning the\n\nCNN takes a much longer time than just \ufb01xing it, and does not improve the performance signi\ufb01cantly.\n\n3\n\nWhat is the doing cat ? <BOA> Sitting on umbrella the CNN LSTM Embedding Fusing Sitting on umbrella the <EOA> Shared Shared Intermediate Softmax \fwith the maximum probabilities according to the Softmax layer. We repeat the process until the\nmodel generates the end sign of the answer (cid:104)BOA(cid:105).\n3.1 The Four Components of the mQA Model\nIt\n(I). The semantic meaning of the question is extracted by the \ufb01rst component of the model.\ncontains a 512 dimensional word embedding layer and an LSTM layer with 400 memory cells. The\nfunction of the word embedding layer is to map the one-hot vector of the word into a dense semantic\nspace. We feed this dense word representation into the LSTM layer.\nLSTM [12] is a Recurrent Neural Network [7] that is designed for solving the gradient explosion or\nvanishing problem. The LSTM layer stores the context information in its memory cells and serves\nas the bridge among the words in a sequence (e.g. a question). To model the long term dependency\nin the data more effectively, LSTM add three gate nodes to the traditional RNN structure: the input\ngate, the output gate and the forget gate. The input gate and output gate regulate the read and write\naccess to the LSTM memory cells. The forget gate resets the memory cells when their contents\nare out of date. Different from [23, 32], the image representation does not feed into the LSTM in\nthis component. We believe this is reasonable because questions are just another input source for\nthe model, so we should not add images as the supervision for them. The information stored in the\nLSTM memory cells of the last word in the question (i.e. the question mark) will be treated as the\nrepresentation of the sentence.\n(II). The second component is a deep Convolutional Neural Network (CNN) that generates the rep-\nresentation of an image. In this paper, we use the GoogleNet [36]. Note that other CNN models,\nsuch as AlexNet [17] and VggNet [34], can also be used as the component in our model. We remove\nthe \ufb01nal SoftMax layer of the deep CNN and connect the remaining top layer to our model.\n(III). The third component also contains a word embedding layer and an LSTM. The structure is\nsimilar to the \ufb01rst component. The activation of the memory cells for the words in the answer, as\nwell as the word embeddings, will be fed into the fusing component to generate the next words in\nthe answer.\nIn [23, 32], they concatenate the training question and answer, and use a single LSTM. Because of\nthe different properties (i.e. grammar) of question and answer, in this paper, we use two separate\nLSTMs for questions and answers respectively. We denote the LSTMs for the question and the\nanswer as LSTM(Q) and LSTM(A) respectively in the rest of the paper. The weight matrix in\nLSTM(Q) is not shared with the LSTM(A) in the \ufb01rst components. Note that the semantic meaning\nof single words should be the same for questions and answers so that we share the parameters in the\nword-embedding layer for the \ufb01rst and third component.\n(IV). Finally, the fourth component fuses the information from the \ufb01rst three layers. Speci\ufb01cally,\nthe activation of the fusing layer f (t) for the tth word in the answer can be calculated as follows:\n\nf (t) = g(VrQrQ + VI I + VrA rA(t) + Vww(t));\n\n(1)\nwhere \u201c+\u201d denotes element-wise addition, rQ stands for the activation of the LSTM(Q) memory\ncells of the last word in the question, I denotes the image representation, rA(t) and w(t) denotes\nthe activation of the LSTM(A) memory cells and the word embedding of the tth word in the answer\nrespectively. VrQ, VI, VrA, and Vw are the weight matrices that need to be learned. g(.) is an\nelement-wise non-linear function.\nAfter the fusing layer, we build an intermediate layer that maps the dense multimodal representation\nin the fusing layer back to the dense word representation. We then build a fully connected Softmax\nlayer to predict the probability distribution of the next word in the answer. This strategy allows the\nweight sharing between word embedding layer and the fully connected Softmax layer as introduced\nin [25] (see details in Section 3.2).\nSimilar to [25], we use the sigmoid function as the activation function of the three gates and adopt\nReLU [30] as the non-linear function for the LSTM memory cells. The non-linear activation function\nfor the word embedding layer, the fusing layer and the intermediate layer is the scaled hyperbolic\ntangent function [20]: g(x) = 1.7159 \u00b7 tanh( 2\n3.2 The Weight Sharing Strategy\nAs mentioned in Section 2, our model adopts different LSTMs for the question and the answer\nbecause of the different grammar properties of questions and answers. However, the meaning of\n\n3 x).\n\n4\n\n\fsingle words in both questions and answers should be the same. Therefore, we share the weight\nmatrix between the word-embedding layers of the \ufb01rst component and the third component.\nIn addition, this weight matrix for the word-embedding layers is shared with the weight matrix in\nthe fully connected Softmax layer in a transposed manner. Intuitively, the function of the weight\nmatrix in the word-embedding layer is to encode the one-hot word representation into a dense word\nrepresentation. The function of the weight matrix in the Softmax layer is to decode the dense word\nrepresentation into a pseudo one-word representation, which is the inverse operation of the word-\nembedding. This strategy will reduce nearly half of the parameters in the model and is shown to\nhave better performance in image captioning and novel visual concept learning tasks [25].\n3.3 Training Details\nThe CNN we used is pre-trained on the ImageNet classi\ufb01cation task [33]. This component is \ufb01xed\nduring the QA training. We adopt a log-likelihood loss de\ufb01ned on the word sequence of the answer.\nMinimizing this loss function is equivalent to maximizing the probability of the model to generate\nthe groundtruth answers in the training set. We jointly train the \ufb01rst, second and the fourth com-\nponents using stochastic gradient decent method. The initial learning rate is 1 and we decrease it\nby a factor of 10 for every epoch of the data. We stop the training when the loss on the valida-\ntion set does not decrease within three epochs. The hyperparameters of the model are selected by\ncross-validation.\nFor the Chinese question answering task, we segment the sentences into several word phrases. These\nphrases can be treated equivalently to the English words.\n4 The Freestyle Multilingual Image Question Answering (FM-IQA) Dataset\nOur method is trained and evaluated on a large-scale multilingual visual question answering dataset.\nIn Section 4.1, we will describe the process to collect the data, and the method to monitor the quality\nof annotations. Some statistics and examples of the dataset will be given in Section 4.2. The latest\ndataset is available on the project page: http://idl.baidu.com/FM-IQA.html\n4.1 The Data Collection\nWe start with the 158,392 images from the newly released MS COCO [21] training, validation and\ntesting set as the initial image set. The annotations are collected using Baidu\u2019s online crowdsourc-\ning server4. To make the labeled question-answer pairs diversi\ufb01ed, the annotators are free to give\nany type of questions, as long as these questions are related to the content of the image. The ques-\ntion should be answered by the visual content and commonsense (e.g., we are not expecting to get\nquestions such as \u201cWhat is the name of the person in the image?\u201d). The annotators need to give an\nanswer to the question themselves.\nOn the one hand, the freedom we give to the annotators is bene\ufb01cial in order to get a freestyle,\ninteresting and diversi\ufb01ed set of questions. On the other hand, it makes it harder to control the\nquality of the annotation compared to a more detailed instruction. To monitor the annotation quality,\nwe conduct an initial quality \ufb01ltering stage. Speci\ufb01cally, we randomly sampled 1,000 images as\na quality monitoring dataset from the MS COCO dataset as an initial set for the annotators (they\ndo not know this is a test). We then sample some annotations and rate their quality after each\nannotator \ufb01nishes some labeling on this quality monitoring dataset (about 20 question-answer pairs\nper annotator). We only select a small number of annotators (195 individuals) whose annotations are\nsatisfactory (i.e. the questions are related to the content of the image and the answers are correct).\nWe also give preference to the annotators who provide interesting questions that require high level\nreasoning to give the answer. Only the selected annotators are permitted to label the rest of the\nimages. We pick a set of good and bad examples of the annotated question-answer pairs from the\nquality monitoring dataset, and show them to the selected annotators as references. We also provide\nreasons for selecting these examples. After the annotation of all the images is \ufb01nished, we further\nre\ufb01ne the dataset and remove a small portion of the images with badly labeled questions and answers.\n4.2 The Statistics of the Dataset\nCurrently there are 158,392 images with 316,193 Chinese question-answer pairs and their English\ntranslations. Each image has at least two question-answer pairs as annotations. The average lengths\n\n4http://test.baidu.com\n\n5\n\n\fFigure 3: Sample images in the FM-IQA dataset. This dataset contains 316,193 Chinese question-\nanswer pairs with corresponding English translations.\n\nof the questions and answers are 7.38 and 3.82 respectively measured by Chinese words. Some\nsample images are shown in Figure 3. We randomly sampled 1,000 question-answer pairs and their\ncorresponding images as the test set.\nThe questions in this dataset are diversi\ufb01ed, which requires a vast set of AI capabilities in order\nto answer them. They contain some relatively simple image understanding questions of, e.g., the\nactions of objects (e.g., \u201cWhat is the boy in green cap doing?\u201d), the object class (e.g., \u201cIs there any\nperson in the image?\u201d), the relative positions and interactions among objects (e.g., \u201cIs the computer\non the right or left side of the gentleman?\u201d), and the attributes of the objects (e.g., \u201cWhat is the color\nof the frisbee?\u201d). In addition, the dataset contains some questions that need a high-level reasoning\nwith clues from vision, language and commonsense. For example, to answer the question of \u201cWhy\ndoes the bus park there?\u201d, we should know that this question is about the parked bus in the image\nwith two men holding tools at the back. Based on our commonsense, we can guess that there might\nbe some problems with the bus and the two men in the image are trying to repair it. These questions\nare hard to answer but we believe they are actually the most interesting part of the questions in the\ndataset. We categorize the questions into 8 types and show the statistics of them on the project page.\nThe answers are also diversi\ufb01ed. The annotators are allowed to give a single phrase or a single word\nas the answer (e.g. \u201cYellow\u201d) or, they can give a complete sentence (e.g. \u201cThe frisbee is yellow\u201d).\n\n5 Experiments\n\nFor the very recent works for visual question answering ([32, 23]), they test their method on the\ndatasets where the answer of the question is a single word or a short phrase. Under this setting,\nit is plausible to use automatic evaluation metrics that measure the single word similarity, such\nas Wu-Palmer similarity measure (WUPS) [41]. However, for our newly proposed dataset, the\nanswers in the dataset are freestyle and can be complete sentences. For most of the cases, there are\nnumerous choices of answers that are all correct. The possible alternatives are BLEU score [31],\nMETEOR [18], CIDEr [39] or other metrics that are widely used in the image captioning task [24].\nThe problem of these metrics is that there are only a few words in an answer that are semantically\ncritical. These metrics tend to give equal weights (e.g. BLEU and METEOR) or different weights\naccording to the tf-idf frequency term (e.g. CIDEr) of the words in a sentence, hence cannot fully\nshow the importance of the keywords. The evaluation of the image captioning task suffers from the\nsame problem (not as severe as question answering because it only needs a general description).\nTo avoid these problems, we conduct a real Visual Turing Test using human judges for our model,\nwhich will be described in details in Section 5.1. In addition, we rate each generated sentences\nwith a score (the larger the better) in Section 5.2, which gives a more \ufb01ne-grained evaluation of our\nmethod. In Section 5.3, we provide the performance comparisons of different variants of our mQA\nmodel on the validation set.\n\n6\n\nImage GT Question GT Answer \u6234\u5e3d\u5b50\u7684\u7537\u5b69\u5728\u5e72\u4ec0\u4e48\uff1f What is the boy in green cap doing? \u4ed6\u5728\u73a9\u6ed1\u677f\u3002 He is playing skateboard. \u56fe\u7247\u4e2d\u6709\u4eba\u4e48\uff1f Is there any person in the image? \u6709\u3002 Yes. \u7535\u8111\u5728\u8001\u4eba\u7684\u5de6\u9762\u8fd8\u662f\u53f3\u9762\uff1f Is the computer on the right hand or left hand side of the gentleman? \u53f3\u624b\u4fa7\u3002 On the right hand side. \u98de\u76d8\u662f\u4ec0\u4e48\u989c\u8272\uff1f What is the color of the frisbee? \u9ec4\u8272\u3002 Yellow. \u516c\u4ea4\u8f66\u505c\u5728\u90a3\u5e72\u5417\uff1f Why does the bus park there? \u51c6\u5907\u7ef4\u4fee\u3002 Preparing for repair. GT Question GT Answer \u623f\u95f4\u91cc\u7684\u6c99\u53d1\u662f\u4ec0\u4e48\u8d28\u5730\u7684\uff1f What is the texture of the sofa in the room? \u5e03\u827a\u3002 Cloth. \u8fd9\u4e2a\u4eba\u5728\u6311\u83dc\u4e48\uff1f Is the man trying to buy vegetables? \u662f\u7684\u3002 Yes. \u8fd9\u4e2a\u86cb\u7cd5\u662f\u51e0\u5c42\u7684\uff1f How many layers are there for the cake? \u516d\u5c42\u3002 Six. \u8fd9\u4e9b\u4eba\u5728\u505a\u4ec0\u4e48\uff1f What are the people doing? \u6253\u96e8\u4f1e\u6b65\u884c\u3002 Walking with umbrellas. \u624b\u673a\uff0c\u9f20\u6807\uff0c\u7535\u8111\u6df7\u653e\u8868\u793a\u4ec0\u4e48\uff1f What does it indicate when the phone, mouse and laptop are placed together? \u4e3b\u4eba\u56f0\u4e86\uff0c\u7761\u7740\u4e86 Their owner is tired and sleeping. \fVisual Turing Test\n\nPass Rate (%)\n\nHuman\nblind-QA\nmQA\n\nPass\n948\n340\n647\n\nFail\n52\n660\n353\n\n94.8\n34.0\n64.7\n\nHuman Rated Scores\n\n2\n927\n-\n628\n\n1\n64\n-\n198\n\n0\n9\n-\n174\n\nAvg. Score\n\n1.918\n\n-\n\n1.454\nTable 1: The results of our mQA model for our FM-IQA dataset.\n\n5.1 The Visual Turing Test\nIn this Visual Turing Test, a human judge will be presented with an image, a question and the answer\nto the question generated by the testing model or by human annotators. He or she need to determine,\nbased on the answer, whether the answer is given by a human (i.e. pass the test) or a machine (i.e.\nfail the test).\nIn practice, we use the images and questions from the test set of our FM-IQA dataset. We use our\nmQA model to generate the answer for each question. We also implement a baseline model of the\nquestion answering without visual information. The structure of this baseline model is similar to\nmQA, except that we do not feed the image information extracted by the CNN into the fusing layer.\nWe denote it as blind-QA. The answers generated by our mQA model, the blind-QA model and\nthe groundtruth answer are mixed together. This leads to 3000 question answering pairs with the\ncorresponding images, which will be randomly assigned to 12 human judges.\nThe results are shown in Table 1. It shows that 64.7% of the answers generated by our mQA model\nare treated as answers provided by a human. The blind-QA performs very badly in this task. But\nsome of the generated answers pass the test. Because some of the questions are actually multi-choice\nquestions, it is possible to get a correct answer by random guess based on pure linguistic clues.\nTo study the variance of the VTT evaluation across different sets of human judges, we conduct\ntwo additional evaluations with different groups of judges under the same setting. The standard\ndeviations of the passing rate are 0.013, 0.019 and 0.024 for human, the blind-mQA model and\nmQA model respectively. It shows that VTT is a stable and reliable evaluation metric for this task.\n\n5.2 The Score of the Generated Answer\nThe Visual Turing Test only gives a rough evaluation of the generated answers. We also conduct a\n\ufb01ne-grained evaluation with scores of \u201c0\u201d, \u201c1\u201d, or \u201c2\u201d. \u201c0\u201d and \u201c2\u201d mean that the answer is totally\nwrong and perfectly correct respectively. \u201c1\u201d means that the answer is only partially correct (e.g.,\nthe general categories are right but the sub-categories are wrong) and makes sense to the human\njudges. The human judges for this task are not necessarily the same people for the Visual Turing\nTest. After collecting the results, we \ufb01nd that some human judges also rate an answer with \u201c1\u201d if the\nquestion is very hard to answer so that even a human, without carefully looking at the image, will\npossibly make mistakes. We show randomly sampled images whose scores are \u201c1\u201d in Figure 4.\nThe results are shown in Table 1. We show that among the answers that are not perfectly correct (i.e.\nscores are not 2), over half of them are partially correct. Similar to the VTT evaluation process, we\nalso conducts two additional groups of this scoring evaluation. The standard deviations of human\nand our mQA model are 0.020 and 0.041 respectively. In addition, for 88.3% and 83.9% of the\ncases, the three groups give the same score for human and our mQA model respectively.\n5.3 Performance Comparisons of the Different mQA Variants\nIn order to show the effectiveness of the different components and strategies of our mQA model, we\nimplement three variants of the mQA in Figure 2. For the \ufb01rst variant (i.e. \u201cmQA-avg-question\u201d), we\nreplace the \ufb01rst LSTM component of the model (i.e. the LSTM to extract the question embedding)\n\nFigure 4: Random examples of the answers generated by the mQA model with score \u201c1\u201d given by\nthe human judges.\n\n7\n\nImage Question Answer \u76d8\u5b50\u91cc\u6709\u4ec0\u4e48\uff1f What is in the plate? \u98df\u7269\u3002 food. \u72d7\u5728\u5e72\u561b\uff1f What is the dog doing? \u5728\u51b2\u6d6a\u3002 Surfing in the sea. \u5c0f\u732b\u5728\u54ea\u91cc\uff1f Where is the cat? \u5e8a\u4e0a\u3002 On the bed. \u8fd9\u662f\u4ec0\u4e48\u8f66\uff1f What is the type of the vehicle? \u706b\u8f66 Train. \u8fd9\u662f\u4ec0\u4e48\uff1f What is there in the image? \u8fd9\u662f\u949f\u8868\u3002 There is a clock. \fFigure 5: The sample generated questions by our model and their answers.\n\nWord Error\n\n0.442\n0.439\n0.438\n0.393\n\nLoss\n2.17\n2.09\n2.14\n1.91\n\nmQA-avg-question\nmQA-same-LSTMs\nmQA-noTWS\nmQA-complete\n\nTable 2: Performance comparisons of the\ndifferent mQA variants.\n\nwith the average embedding of the words in the ques-\ntion using word2vec [29]. It is used to show the effec-\ntiveness of the LSTM as a question embedding learner\nand extractor. For the second variant (i.e. \u201cmQA-\nsame-LSTMs\u201d), we use two shared-weights LSTMs to\nmodel question and answer. It is used to show the ef-\nfectiveness of the decoupling strategy of the weights of\nthe LSTM(Q) and the LSTM(A) in our model. For the\nthird variant (i.e. \u201cmQA-noTWS\u201d), we do not adopt the Transposed Weight Sharing (TWS) strategy.\nIt is used to show the effectiveness of TWS.\nThe word error rates and losses of the three variants and the complete mQA model (i.e. mQA-\ncomplete) are shown in Table 2. All of the three variants performs worse than our mQA model.\n6 Discussion\nIn this paper, we present the mQA model, which is able to give a sentence or a phrase as the answer\nto a freestyle question for an image. To validate the effectiveness of the method, we construct\na Freestyle Multilingual Image Question Answering (FM-IQA) dataset containing over 310,000\nquestion-answer pairs. We evaluate our method using human judges through a real Turing Test. It\nshows that 64.7% of the answers given by our mQA model are treated as the answers provided by a\nhuman. The FM-IQA dataset can be used for other tasks, such as visual machine translation, where\nthe visual information can serve as context information that helps to remove ambiguity of the words\nin a sentence.\nWe also modi\ufb01ed the LSTM in the \ufb01rst component to the multimodal LSTM shown in [25]. This\nmodi\ufb01cation allows us to generate a free-style question about the content of image, and provide an\nanswer to this question. We show some sample results in Figure 5.\nWe show some failure cases of our model in Figure 6. The model sometimes makes mistakes when\nthe commonsense reasoning through background scenes is incorrect (e.g., for the image in the \ufb01rst\ncolumn, our method says that the man is sur\ufb01ng but the small yellow frisbee in the image indicates\nthat he is actually trying to catch the frisbee. It also makes mistakes when the targeting object that the\nquestion focuses on is too small or looks very similar to other objects (e.g. images in the second and\nfourth column). Another interesting example is the image and question in the \ufb01fth column of Figure\n6. Answering this question is very hard since it needs high level reasoning based on the experience\nfrom everyday life. Our model outputs a (cid:104)OOV (cid:105) sign, which is a special word we use when the\nmodel meets a word which it has not seen before (i.e. does not appear in its word dictionary).\nIn future work, we will try to address these issues by incorporating more visual and linguistic infor-\nmation (e.g. using object detection or using attention models).\n\nFigure 6: Failure cases of our mQA model on the FM-IQA dataset.\n\n8\n\nImage Generated Question Answer \u8fd9\u662f\u5728\u4ec0\u4e48\u5730\u65b9\uff1f Where is this? \u8fd9\u662f\u5728\u53a8\u623f\u3002 This is the kitchen room. \u8fd9\u662f\u4ec0\u4e48\u98df\u7269\uff1f What kind of food is this? \u62ab\u8428\u3002 Pizza. \u7535\u8111\u5728\u54ea\u91cc\uff1f Where is the computer? \u5728\u684c\u5b50\u4e0a\u3002 On the desk. \u8fd9\u4e2a\u4eba\u5728\u6253\u7f51\u7403\u4e48\uff1f Is this guy playing tennis? \u662f\u7684\u3002 Yes. Image Question GT Answer \u5e05\u54e5\u5728\u5e72\u4ec0\u4e48\uff1f What is the handsome boy doing? \u5728\u6293\u98de\u76d8\u3002 Trying to catch the frisbee. \u8fd9\u662f\u4ec0\u4e48\uff1f What is there in the image? \u8fd9\u662f\u725b\u3002 They are buffalos. \u8fd9\u662f\u4ec0\u4e48\u8f66\uff1f What is the type of the vehicle? \u706b\u8f66\u3002 Train. \u76d8\u5b50\u91cc\u6709\u4ec0\u4e48\u6c34\u679c \uff1f Which fruit is there in the plate? \u82f9\u679c\u548c\u6a59\u5b50 \u3002 Apples and oranges. mQA Answer \u51b2\u6d6a \u3002 Surfing. \u8349\u539f\u4e0a\u7684\u9a6c\u7fa4 \u3002 Horses on the grassland. \u9999\u8549\u548c\u6a59\u5b50 \u3002 Bananas and oranges. \u516c\u4ea4\u6c7d\u8f66\u3002 Bus. \u516c\u4ea4\u8f66\u505c\u5728\u90a3\u5e72\u5417\uff1f Why does the bus park there? \u51c6\u5907\u7ef4\u4fee\u3002 Preparing for repair. <OOV>\u3002 <OOV> (I do not know.) \fReferences\n[1] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh. Vqa: Visual question answering. arXiv preprint\n\narXiv:1505.00468, 2015.\n\n[2] J. P. Bigham, C. Jayant, H. Ji, G. Little, A. Miller, R. C. Miller, R. Miller, A. Tatarowicz, B. White, S. White, et al. Vizwiz: nearly\n\nreal-time answers to visual questions. In ACM symposium on User interface software and technology, pages 333\u2013342, 2010.\n\n[3] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Semantic image segmentation with deep convolutional nets and\n\nfully connected crfs. ICLR, 2015.\n\n[4] X. Chen and C. L. Zitnick. Learning a recurrent visual representation for image caption generation. In CVPR, 2015.\n[5] K. Cho, B. van Merrienboer, C. Gulcehre, F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase representations using rnn encoder-\n\ndecoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.\n\n[6] J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Long-term recurrent convolu-\n\ntional networks for visual recognition and description. In CVPR, 2015.\n\n[7] J. L. Elman. Finding structure in time. Cognitive science, 14(2):179\u2013211, 1990.\n[8] H. Fang, S. Gupta, F. Iandola, R. Srivastava, L. Deng, P. Doll\u00b4ar, J. Gao, X. He, M. Mitchell, J. Platt, et al. From captions to visual\n\nconcepts and back. In CVPR, 2015.\n\n[9] D. Geman, S. Geman, N. Hallonquist, and L. Younes. Visual turing test for computer vision systems. PNAS, 112(12):3618\u20133623, 2015.\n[10] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In\n\nCVPR, 2014.\n\n[11] M. Grubinger, P. Clough, H. M\u00a8uller, and T. Deselaers. The iapr tc-12 benchmark: A new evaluation resource for visual information\n\nsystems. In International Workshop OntoImage, pages 13\u201323, 2006.\n\n[12] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735\u20131780, 1997.\n[13] N. Kalchbrenner and P. Blunsom. Recurrent continuous translation models. In EMNLP, pages 1700\u20131709, 2013.\n[14] A. Karpathy and L. Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In CVPR, 2015.\n[15] R. Kiros, R. Salakhutdinov, and R. S. Zemel. Unifying visual-semantic embeddings with multimodal neural language models. TACL,\n\n2015.\n\n[16] B. Klein, G. Lev, G. Sadeh, and L. Wolf. Fisher vectors derived from hybrid gaussian-laplacian mixture models for image annotation.\n\narXiv preprint arXiv:1411.7399, 2014.\n\n[17] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classi\ufb01cation with deep convolutional neural networks. In NIPS, 2012.\n[18] A. Lavie and A. Agarwal. Meteor: An automatic metric for mt evaluation with high levels of correlation with human judgements. In\n\nWorkshop on Statistical Machine Translation, pages 228\u2013231. Association for Computational Linguistics, 2007.\n\n[19] R. Lebret, P. O. Pinheiro, and R. Collobert. Simple image description generator via a linear phrase-based approach. arXiv preprint\n\narXiv:1412.8419, 2014.\n\n[20] Y. A. LeCun, L. Bottou, G. B. Orr, and K.-R. M\u00a8uller. Ef\ufb01cient backprop. In Neural networks: Tricks of the trade, pages 9\u201348. 2012.\n[21] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll\u00b4ar, and C. L. Zitnick. Microsoft coco: Common objects in\n\ncontext. arXiv preprint arXiv:1405.0312, 2014.\n\n[22] M. Malinowski and M. Fritz. A multi-world approach to question answering about real-world scenes based on uncertain input.\n\nAdvances in Neural Information Processing Systems, pages 1682\u20131690, 2014.\n\nIn\n\n[23] M. Malinowski, M. Rohrbach, and M. Fritz. Ask your neurons: A neural-based approach to answering questions about images. arXiv\n\npreprint arXiv:1505.01121, 2015.\n\n[24] J. Mao, W. Xu, Y. Yang, J. Wang, Z. Huang, and A. Yuille. Deep captioning with multimodal recurrent neural networks (m-rnn). In\n\nICLR, 2015.\n\n[25] J. Mao, W. Xu, Y. Yang, J. Wang, Z. Huang, and A. Yuille. Learning like a child: Fast novel visual concept learning from sentence\n\ndescriptions of images. arXiv preprint arXiv:1504.06692, 2015.\n\n[26] J. Mao, W. Xu, Y. Yang, J. Wang, and A. L. Yuille. Explain images with multimodal recurrent neural networks. NIPS DeepLearning\n\nWorkshop, 2014.\n\n[27] T. Mikolov, A. Joulin, S. Chopra, M. Mathieu, and M. Ranzato. Learning longer memory in recurrent neural networks. arXiv preprint\n\narXiv:1412.7753, 2014.\n\n[28] T. Mikolov, M. Kara\ufb01\u00b4at, L. Burget, J. Cernock`y, and S. Khudanpur. Recurrent neural network based language model. In INTERSPEECH,\n\npages 1045\u20131048, 2010.\n\n[29] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their composition-\n\nality. In NIPS, pages 3111\u20133119, 2013.\n\n[30] V. Nair and G. E. Hinton. Recti\ufb01ed linear units improve restricted boltzmann machines. In ICML, pages 807\u2013814, 2010.\n[31] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. Bleu: a method for automatic evaluation of machine translation. In ACL, pages 311\u2013318,\n\n2002.\n\n[32] M. Ren, R. Kiros, and R. Zemel. Image question answering: A visual semantic embedding model and a new dataset. arXiv preprint\n\narXiv:1505.02074, 2015.\n\n[33] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and\n\nL. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge, 2014.\n\n[34] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.\n[35]\n[36] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with\n\nI. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In NIPS, pages 3104\u20133112, 2014.\n\nconvolutions. arXiv preprint arXiv:1409.4842, 2014.\n\n[37] K. Tu, M. Meng, M. W. Lee, T. E. Choe, and S.-C. Zhu. Joint video and text parsing for understanding events and answering queries.\n\nMultiMedia, IEEE, 21(2):42\u201370, 2014.\n\n[38] A. M. Turing. Computing machinery and intelligence. Mind, pages 433\u2013460, 1950.\n[39] R. Vedantam, C. L. Zitnick, and D. Parikh. Cider: Consensus-based image description evaluation. In CVPR, 2015.\n[40] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. In CVPR, 2015.\n[41] Z. Wu and M. Palmer. Verbs semantics and lexical selection. In ACL, pages 133\u2013138, 1994.\n[42] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhutdinov, R. Zemel, and Y. Bengio. Show, attend and tell: Neural image caption\n\ngeneration with visual attention. arXiv preprint arXiv:1502.03044, 2015.\n\n[43] P. Young, A. Lai, M. Hodosh, and J. Hockenmaier. From image descriptions to visual denotations: New similarity metrics for semantic\n\ninference over event descriptions. In ACL, pages 479\u2013488, 2014.\n\n[44] J. Zhu, J. Mao, and A. L. Yuille. Learning from weakly supervised data by the expectation loss svm (e-svm) algorithm. In NIPS, pages\n\n1125\u20131133, 2014.\n\n9\n\n\f", "award": [], "sourceid": 1360, "authors": [{"given_name": "Haoyuan", "family_name": "Gao", "institution": "Baidu"}, {"given_name": "Junhua", "family_name": "Mao", "institution": "UCLA"}, {"given_name": "Jie", "family_name": "Zhou", "institution": "Baidu"}, {"given_name": "Zhiheng", "family_name": "Huang", "institution": "Baidu"}, {"given_name": "Lei", "family_name": "Wang", "institution": "Baidu"}, {"given_name": "Wei", "family_name": "Xu", "institution": "Baidu"}]}