{"title": "Chain of Reasoning for Visual Question Answering", "book": "Advances in Neural Information Processing Systems", "page_first": 275, "page_last": 285, "abstract": "Reasoning plays an essential role in Visual Question Answering (VQA). Multi-step and dynamic reasoning is often necessary for answering complex questions. For example, a question \"What is placed next to the bus on the right of the picture?\" talks about a compound object \"bus on the right,\" which is generated by the relation <bus, on the right of, picture>. Furthermore, a new relation including this compound object <sign, next to, bus on the right> is then required to infer the answer. However, previous methods support either one-step or static reasoning, without updating relations or generating compound objects. This paper proposes a novel reasoning model for addressing these problems. A chain of reasoning (CoR) is constructed for supporting multi-step and dynamic reasoning on changed relations and objects. In detail, iteratively, the relational reasoning operations form new relations between objects, and the object refining operations generate new compound objects from relations. We achieve new state-of-the-art results on four publicly available datasets. The visualization of the chain of reasoning illustrates the progress that the CoR generates new compound objects that lead to the answer of the question step by step.", "full_text": "Chain of Reasoning for Visual Question Answering\n\nChenfei Wu\u2217, Jinlai Liu\u2217, Xiaojie Wang, Xuan Dong\n\n{wuchenfei,liujinlai, xjwang, dongxuan8811}@bupt.edu.cn\n\nCenter for Intelligence Science and Technology\n\nBeijing University of Posts and Telecommunications\n\nAbstract\n\nReasoning plays an essential role in Visual Question Answering (VQA). Multi-step\nand dynamic reasoning is often necessary for answering complex questions. For\nexample, a question \u201cWhat is placed next to the bus on the right of the picture?\u201d\ntalks about a compound object \u201cbus on the right,\u201d which is generated by the\nrelation <bus, on the right of, picture>. Furthermore, a new relation including this\ncompound object <sign, next to, bus on the right> is then required to infer the\nanswer. However, previous methods support either one-step or static reasoning,\nwithout updating relations or generating compound objects. This paper proposes\na novel reasoning model for addressing these problems. A chain of reasoning\n(CoR) is constructed for supporting multi-step and dynamic reasoning on changed\nrelations and objects. In detail, iteratively, the relational reasoning operations form\nnew relations between objects, and the object re\ufb01ning operations generate new\ncompound objects from relations. We achieve new state-of-the-art results on four\npublicly available datasets. The visualization of the chain of reasoning illustrates\nthe progress that the CoR generates new compound objects that lead to the answer\nof the question step by step.\n\n1\n\nIntroduction\n\n\u201cThe technical issues of acquiring knowledge, representing it, and using it appropriately to construct\nand explain lines-of-reasoning, are important problems in the design of knowledge-based systems,\nwhich illuminates the art of Arti\ufb01cial Intelligence\u201d [1]. Advances in image and language processing\nhave developed powerful tools on knowledge representation, such as long short-term memory\n(LSTM) [2] and convolutional neural network (CNN) [3]. However, it is still a challenge to construct\n\u201clines-of-reasoning\u201d with these representations for different tasks. This paper meets the challenge in\nvisual question answering, a typical \ufb01eld of Arti\ufb01cial Intelligence.\nVisual question answering (VQA) aims to select an answer given an image and a related question.\nThe left part of Fig. 1 gives an example of the image and the question. Lots of work has been\ndone on this task in recent years. Among them, reasoning, named in different ways, plays a critical\nrole. Most of existing VQA models that enable reasoning can be divided into three categories.\nFirstly, relation-based method [4] views reasoning procedure as relational reasoning. It calculates\nthe relations between image regions to infer the answer in one-step. However, one-step relational\nreasoning can only construct pairwise relations between initial objects, which is not always suf\ufb01cient\nfor complex questions. It is not a trivial problem to extend one-step reasoning to multi-step because\nof the exponential increase of computational complexity. Secondly, attention-based methods [5, 6]\nview reasoning procedure as to update the attention distribution on objects, such as image regions\nor bounding boxes, so as to gradually infer the answer. However, no matter how many times the\nattention distributions are updated, the objects are still from the original input, and the entire reasoning\n\n\u2217The \ufb01rst two authors contributed equally.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fFigure 1: Chain of Reasoning for VQA. The alternate updating of objects and relations forms a chain\nof reasoning. The relational reasoning operation forms new relations between objects. The object\nre\ufb01ning operation generates new compound objects from relations.\n\nprocedure does not produce compound objects, such as \u201csign next to the bus on the right\u201d, which\nmany questions talk about. Thirdly, module-based methods [7, 8, 9] view reasoning procedure as\na layout generated from manually pre-de\ufb01ned modules. It uses the layout to instantiate modular\nnetworks. However, the modules are pre-de\ufb01ned which means the reasoning procedure does not\nproduce new modules or relations anymore. As a result, it is dif\ufb01cult to meet the requirements of\ndiversity of relations in dynamic and multi-step reasoning.\nThis paper tries to construct a chain of reasoning (CoR) for addressing these problems. Both of the\niteratively updated relations and compound objects are used as nodes in the chain. Updated relations\npush reasoning to involve more compound objects; compound objects maintain the intermediate\nconclusions of reasoning and make the next-step relational reasoning possible by lowering the\ncomputational complexity ef\ufb01ciently. An example of the CoR is shown in Fig. 1. Initial objects in the\nimage are \ufb01rst recognized, such as two buses and a sign in the original image. All pairwise relations\nbetween these objects are then calculated, and a combination of the relations are used to generate\ncompound objects, such as \u201cbus on the right.\u201d More complex relations are further calculated between\nthe compound objects and initial objects to generate more complex compound objects, such as \u201csign\nnext to the bus on the right,\u201d which brings us the answer.\nIn summary, our contributions are as follows:\n\n\u2022 We introduce a new VQA model that performs a chain of reasoning, which generates new\n\nrelations and compound objects dynamically to infer the answer.\n\n\u2022 We achieve new state-of-the-art results on four publicly available datasets. We conduct\na detailed ablation study to show that our proposed chain structure is superior to stack\nstructure and parallel structure.\n\n\u2022 We visualize the chain of reasoning, which shows the progress that the CoR generates new\n\ncompound objects dynamically that lead to the answer of the question step by step.\n\n2 Related Work\n\nReasoning plays a crucial role in VQA. Recent studies modeled the reasoning procedure from\ndifferent perspectives. In this section, we brie\ufb02y review three types of existing work that enable\nreasoning. We also highlight differences between previous models and ours.\n\nRelation-based methods\nThe relation-based method performs one-step relational reasoning to\ninfer the answer. [4] proposed a plug-and-play module called \u201cRelation Networks\u201d (RN). RN uses\nfull arrangement to model all the interactions between objects in the image and performs multi-layer\nperceptrons (MLPs) to calculate all the relations. Then, the relations are summed and passed through\nother MLPs to infer the \ufb01nal answer. Modeling pairwise relationships already brings the O(m2)\ncomputational complexity and makes it impossible to carry out multi-step reasoning. By object\nre\ufb01ning, our model lowers the computational complexity and makes the multi-step reasoning possible.\n\n2\n\nQuestion: What is placed next to the bus on the right of the picture?GRURCNN<bus2, on the right of, picture><bus1, in front of, tree>...<sign, next to, bus on the right>...bus2bus1signtreeInitial objectsCompound objectsRelationsNew relationsAnswer: signsign next to the bus on the rightbus on the rightbus in front of the treeCompound objects\fFigure 2: The overall structure of the proposed model for solving the VQA task. It consists of Data\nEmbedding, Chain of Reasoning, and Decision Making, marked with dash lines respectively.\n\nAttention-based methods Usually, attention-based methods enable reasoning by locating relevant\nobjects in original input features, such as bounding boxes or image regions. Initially, [10] proposed\none-step attention to locate relevant objects of images. Furthermore, [5, 6] proposed multi-step\nattention to update relevant objects of images and infer the answer progressively. Additionally,\n[11, 12] proposed multi-modal attention, which \ufb01nds not only the relevant objects of images but also\nquestions or answers. Recently, [13, 14, 15, 16] used bilinear fusion in attention mechanism to \ufb01nd\nmore accurate objects of input features. Attention distributions in the above work are always on\noriginal input features. In contrast, our model pay attentions on not only objects in original input\nfeatures but also new compound objects generated dynamically during reasoning.\n\nModule-based methods Module-based methods try to de\ufb01ne relations as modules in advance, and\nthe reasoning procedure is determined by a layout generated from these modules. [7] proposed neural\nmodule network, which uses \ufb01xed layouts generated from dependency parses. Later, [8] proposed\ndynamic neural module network, which learns to optimize the layout structure by predicting a list of\nlayout candidates. However, the layout candidates are still generated by dependency parses. To solve\nthis problem, [9] proposed an end-to-end module network, which learns to optimize over full space\nof network and requires no parser at evaluation time. Our model forms new relations dynamically in\nthe reasoning procedure, instead of choosing from a set of manually pre-de\ufb01ned modules.\n\n3 Chain of Reasoning based model for VQA\n\nThe overall structure of our model for VQA is illustrated in Fig. 2. It consists of three parts: Data\nEmbedding, Chain of Reasoning, and Decision Making. Data Embedding pre-processes the image\nand question. Chain of Reasoning is the core part of the model. Starting from outputs of Data\nEmbedding, relational reasoning on initial objects forms new relations, and object re\ufb01ning generates\nnew compound objects based on the new relations. Iteratively, these two operations on updated\nrelations and objects build the chain of reasoning, which outputs a series of results. Decision Making\nmakes use of all the results to select the \ufb01nal answer of the question. We give the details of the three\nparts in Section 3.1\u223c3.3 respectively.\n\n3.1 Data Embedding\n\nFaster-RCNN [17] is used to encode images with the static features provided by bottom-up-\nattention [18], GRU [19] is used to encode text with the parameters initialized with skip-thoughts [20],\nas denoted in Eq. (1).\n\nV = RCN N (image), Q = GRU (question),\n\n(1)\n\n3\n\nQuestion: What is placed next to the bus on the right of the picture?GRURCNNLinearLinearsoftmaxLinearExpandLinearsigmoidLinearsigmoidExpandExpandLinearLinearsoftmaxLinearLinearsigmoidLinearsigmoidExpandAnswer: signClassifierDecision MakingConcatenateData EmbeddingChain of Reasoning\fwhere V \u2208 Rm\u00d7dv denotes the visual features of the top-ranked m detection boxes and Q \u2208\nRdq denotes the question embedding. Here, V is viewed as a set of m initial objects, i.e. V =\n{v1, v2, . . . , vm}. From the perspective of reasoning, V can also be viewed as m initial premises.\n\n3.2 Chain of Reasoning\n\nStarting from initial objects O(1) = V de\ufb01ned in Eq. (1), a chain of reasoning consists of a series of\nsub-chains and an output at each time, which is explained in Fig. 3.\n\nFigure 3: Sub-chains and their outputs in Chain of Reasoning.\n\nIn Fig. 3, O(t) \u2208 Rm\u00d7dv is the set of initial objects at time t = 1 or compound objects at time t > 1.\n\n(cid:103)O(t) \u2208 Rdv is the output of the chain at time t. R(t) \u2208 Rm\u00d7m\u00d7dv is the set of updated relations at\n\ntime t. O(t+1) \u2208 Rm\u00d7dv is the set of new compound objects at time t + 1. From the perspective of\nreasoning, O(t) can also be viewed as intermediate conclusions when t > 1. We \ufb01rst give the details\non the output at time t, and then describe how the sub-chain is formed.\nThe output at time t is designed to capture information provided by O(t) under the guidance of\nquestion . An attention-based method is used as in Eq. (2)\u223c(5).\n\n(2)\n\n(3)\n\n(4)\n\nP (t) = relu(O(t)W (t)\n\no ), S(t) = relu(QW (t)\n\nq ),\n\nF (t) =\n\n(P (t)W (t)\n\np,k) (cid:12) (S(t)W (t)\ns,k)\n\nK(cid:88)\n\nk=1\n\n\u03b1(t) = sof tmax(F (t)W (t)\n\nf ),\n\n(cid:103)O(t) =\n\n(cid:16)\n\n\u03b1(t)(cid:17)T\n\nO(t),\n\n(5)\nwhere Eq. (2) maps the objects at time t to P (t) \u2208 Rm\u00d7dp and maps the question feature to S(t) \u2208 Rds\nat time t. Eq. (3) uses the Mutan fusion mechanism proposed by [16]. K is the hyperparameter.\nF (t) \u2208 Rm\u00d7df is the fusion embedding at time t. \u03b1(t) \u2208 Rm in Eq. (4) is the attention distribution\n\nover the m compound objects at time t. (cid:103)O(t) \u2208 Rdv in Eq. (5) is the result of attention at time t,\n\nwhich is also the output of chain of reasoning at time t. The output at each time t will be used for\n\ufb01nal decision making. To write simple, we omit the bias b.\nThe sub-chain O(t) \u2192 R(t) \u2192 O(t+1) is performed in two operations. The \ufb01rst operation from O(t)\nto R(t) is called relational reasoning which forms new relations between objects, and the second\noperation from R(t) to O(t+1) is called object re\ufb01ning which generates new compound objects to\nstart a new sub-chain. We introduce them respectively as follows.\n\nRelational reasoning from O(t) to R(t) The m objects in O(t) interact with the m initial objects\nin O(1) under the guidance of the question Q, as denoted in Eq. (6)\u223c(7).\n\nGl = \u03c3 (relu(QWl1)Wl2 ) , Gr = \u03c3 (relu(QWr1)Wr2) ,\n\n(6)\n\nR(t)\n\nij = (O(t)\n\n(7)\nwhere Eq (6) maps question feature to the same dimension as the object feature by a two-layer MLP\nwith different weights respectively. \u03c3 is the sigmoid function. Gl, Gr \u2208 Rdv are the guidances. Eq. 7\nis the sum of the guided ith compound object at time t and the guided jth initial object. (cid:12) denotes\nthe element-wise multiplication and \u2295 denotes the element-wise summation.\n\ni (cid:12) Gl) \u2295 (O(1)\n\nj (cid:12) Gr),\n\n4\n\n sub-chainoutput sub-chainoutput\fm(cid:88)\n\nNotice that Gl and Gr are different guidances with different weights, but the weights in Gl and Gr\nare shared among all sub-chains respectively. As a result, two sets of weights are trained: the set of\nweights in Gl make the question focus on the compound objects and another set of weights in Gr\nmake the question focus on the initial objects. This is in line with the reasoning procedure \u2014 the\nquestion decides what the model should do for the intermediate conclusions it already got and the\ninitial premises. Besides, initial objects O(1)\nused at each time allow the model to capture initial\npremises through the whole reasoning procedure.\nObject re\ufb01ning from R(t) to O(t+1) The previous relational reasoning operation produces m\u00d7m\nrelations between m compound objects and m initial objects. Since modeling the pairwise relations\nincreases the compexity of reasoning from O(m) to O(m2), n-step reasoning will face the complexity\nof O(mn). In order to avoid the exponential complexity of multi-step reasoning, we re\ufb01ne these\nrelations to m new compound objects, each denoted in Eq. (8):\n\nj\n\nO(t+1)\n\nj\n\n=\n\ni R(t)\n\u03b1(t)\nij ,\n\n(8)\n\ni=1\n\nj\n\nwhere O(t+1)\nis the jth compound object at time t + 1, In Eq. (8), the attention weights of the\ncompound objects \u03b1(t) are used to re\ufb01ne the relations R(t) formed by the compound objects and the\ninitial objects. This has two advantages: Firstly, Eq. (8) is more in line with the reasoning procedure.\nThe jth compound object at time t + 1 is determined by all the compound objects at time t and the\njth initial object. This means that any conclusion generated by the next reasoning step will use all the\nintermediate conclusions in the previous step. At the same time, if an intermediate conclusion in the\nprevious step is important, then its information is more likely to be used in the next step. Secondly,\nEq. (8) makes it mathematically simple and computationally feasible to begin a next turn reasoning.\nMathematically, we can use a single set of equations to describle the whole chain. Computationally,\nwe can keep the complexity of O(nm2) when we perform n sub-chains of reasoning.\n\n3.3 Decision Making\n\nThe decision maker at time T gives an answer to the question by making use of all the outputs\n\n(cid:103)O(t) (t = 1, 2, ..., T ). An concatenation is employed for integrating T outputs in Eq. (9).\n\nO\u2217 = [relu((cid:103)O(1)W (1)); relu((cid:103)O(2)W (2)); ...; relu(\n\n(9)\nwhere O\u2217 \u2208 Rd\u2217 is the joint feature of outputs. We further fuse joint feature and question by Eq.(10).\n\n(cid:93)\nO(T )W (T ))],\n\nH =\n\n(10)\nwhere K \u2208 R+ is the hyperparameter, H \u2208 Rdh is the joint embedding. Finally, a linear layer with a\nsoftmax activation function is used to predict the candidate answer distribution as shown in Eq. (11).\n(11)\n\n\u02c6a = sof tmax(HWh),\n\nk=1\n\n(O\u2217Wo\u2217,k) (cid:12) (QWq(cid:48),k),\n\nK(cid:88)\n\n3.4 Training\n\nWe \ufb01rst calculate the ground-truth answer distribution in Eq. (12):\n\n(cid:80)N\nN \u2212(cid:80)N\n\nj=1\n\nai =\n\n1{uj = i}\n1{uj /\u2208 D} ,\n\nj=1\n\nwhere a \u2208 R|D| is the ground-truth answer distribution, ui is the answer given by the ith annotator.\nN is the number of annotators. In detail, N is 10 in the VQA 1.0 and VQA 2.0 dataset; N is 1 in the\nCOCO-QA dataset and the TDIUC dataset.\nFinally, we use the KL-divergence as the loss function between a and \u02c6a in Eq. (13):\n\nL (\u02c6a, a) =\n\nai log\n\n.\n\n(13)\n\n(12)\n\n(cid:18) ai\n\n(cid:19)\n\n\u02c6ai\n\n|D|(cid:88)\n\ni=1\n\n5\n\n\fTable 1: Comparision with the state-of-the-arts on the VQA 1.0 dataset.\n\nVQA 1.0 Test-dev\nOpen-Ended\nMC\nY/N Num. Other All\n69.4\n\n-\n\n-\n\n-\n\nVQA 1.0 Test-std\nOpen-Ended\nMC\nY/N Num. Other All\n69.3\n\n-\n\n-\n\n-\n\nAll\n-\n\n66.77 84.54 39.21 57.81\n67.42 85.14 39.81 58.52\n66.01 83.59 40.18 56.84 70.04 66.09 83.37 40.39 56.89 69.97\n\n66.89 84.61 39.07 57.79\n67.36 84.91 39.79 58.35\n\n-\n-\n\n-\n-\n\n-\n\n-\n\n-\n\n-\n\n67.9\n\n84.0\n\n38.7\n\n60.4\n\n-\n\nSingle\nimage\nfeature\nMulti\nimage\nfeature\nSingle\nimage\nfeauture\n\nMethod\nAll\nHighOrderAtt[12] -\nMLB(7)[14]\nMutan(5)[16]\nDualMFA[21]\nReasonNet[22]\nCoR-2(36boxes)\n(ours)\nCoR-3(36boxes)\n(ours)\n\n-\n\n68.16 85.57 43.76 58.80 72.60 68.19 85.61 43.10 58.75 72.61\n\n68.37 85.69 44.06 59.08 72.84 68.54 85.83 43.93 59.11 72.93\n\nTable 2: Comparision with the state-of-the-arts on the VQA 2.0 dataset.\n\nVQA 2.0 Test-dev\n\nVQA 2.0 Test-std\n\nMethod\n\nMF-SIG-VG[23]\nUp-Down(36 boxes)[24]\nLC_Baseline(100 boxes)[25]\nLC_Counting(100 boxes)[25]\nCoR-2(36 boxes) (ours)\nCoR-3(36 boxes) (ours)\nCoR-3(100 boxes) (ours)\n\nAll\n64.73\n65.32\n67.50\n68.09\n67.96\n68.19\n68.62\n\nY/N\n81.29\n81.82\n82.98\n83.14\n84.7\n84.98\n85.22\n\nNum. Other\n55.55\n42.99\n44.21\n56.05\n58.99\n46.88\n51.62\n58.97\n58.42\n47.1\n47.19\n58.64\n59.15\n47.95\n\nAll\n-\n\n65.67\n67.78\n68.41\n68.15\n68.59\n69.14\n\nY/N\n\nNum. Other\n\n-\n\n82.20\n83.21\n83.56\n84.82\n85.16\n85.76\n\n-\n\n43.90\n46.60\n51.39\n46.8\n47.19\n48.4\n\n-\n\n56.26\n59.20\n59.11\n58.52\n59.07\n59.43\n\nTable 3: Comparision with the state-of-the-arts on the COCO-QA dataset.\n\nMethod\n\nQRU [26]\nHieCoAtt [11]\nDual-MFA [21]\nCoR-2(36 boxes) (ours)\nCoR-3(36 boxes) (ours)\n\nAll\n62.50\n65.4\n66.49\n68.67\n69.38\n\nObj.\n65.06\n68.0\n68.86\n69.76\n70.42\n\nNum.\n46.90\n51.0\n51.32\n55.14\n55.83\n\nColor\n60.50\n62.9\n65.89\n73.36\n74.13\n\nLoc. WUPS0.9 WUPS0.0\n56.99\n58.8\n58.92\n59.52\n60.57\n\n91.62\n92.0\n92.29\n92.68\n92.86\n\n72.58\n75.1\n76.15\n77.47\n78.10\n\nTable 4: Comparision with the state-of-the-arts on the TDIUC dataset.\n\nQuestion Type\n\nMCB-A[13] RAU[27] CATL-QTAW [28]\n\nSceen Recognition\nSport Recognition\nColor Attributes\nOther Attributes\nActivity Recognition\nPositional Reasoning\nSub. Object Recognition\nAbsurd\nUtility and Affordances\nObject Presence\nCounting\nSentiment Understanding\nOverall (Arithmetric MPT)\nOverall (Harmonic MPT)\nOverall Accuracy\n\n93.06\n92.77\n68.54\n56.72\n52.35\n35.40\n85.54\n84.82\n35.09\n93.64\n51.01\n66.25\n67.90\n60.47\n81.86\n\n93.80\n95.55\n60.16\n54.36\n60.10\n34.71\n86.98\n100.00\n31.48\n94.55\n53.25\n64.38\n69.11\n60.08\n85.03\n\n93.96\n93.47\n66.86\n56.49\n51.60\n35.26\n86.11\n96.08\n31.58\n94.38\n48.43\n60.09\n67.81\n59.00\n84.26\n\n6\n\nCoR-2\n(ours)\n94.48\n95.94\n73.59\n59.59\n60.29\n39.34\n88.38\n95.17\n40.35\n95.40\n57.72\n66.72\n72.25\n65.65\n86.58\n\nCoR-3\n(ours)\n94.68\n95.90\n74.47\n60.02\n62.19\n40.92\n88.83\n94.70\n37.43\n95.75\n58.83\n67.19\n72.58\n65.77\n86.91\n\n\f4 Experiments\n\n4.1 Datasets and evaluation metrics\n\nWe evaluate our model on four public datasets: the VQA 1.0 dataset [29], the VQA 2.0 dataset [30], the\nCOCO-QA dataset[31] and the TDIUC dataset [27]. VQA 1.0 contains 614,163 samples, including\n204,721 images from COCO [32]. VQA 2.0 is a more balanced version and contains 1,105,904\nsamples. COCO-QA is a smaller dataset that contains 78,736 samples. TDIUC is a larger dataset that\ncontains 1,654,167 samples and 12 question types. For VQA 1.0 and VQA 2.0, we use the evaluation\ntool proposed in [29] to evaluate the model. For COCO-QA and TDIUC, we calculate the simple\naccuracy for each question type. Besides, additional WUPS [33] is calculated for COCO-QA and\nadditional Arithmetic/Harmonic mean-per-type (MPT) [27] is calculated for TDIUC.\n\nImplementation details\n\n4.2\nDuring the data-embedding phase, the image features are mapped to the size of 36\u00d7 2048 and the text\nfeatures are mapped to the size of 2400. In the chain of reasoning phase, the number of hidden layer\nin Mutan is 510; hyperparameter K is 5. The attention hidden unit number is 620. In the decision\nmaking phase, the joint feature embedding is set to 510. All the nonlinear layers of the model all use\nthe relu activation function and dropout [34] to prevent over\ufb01tting. All settings are commonly used in\nprevious work. We implement the model using Pytorch. We use Adam[35] to train the model with a\nlearning rate of 10\u22124 and a batch_size of 64. More details, including source codes, will be published\nin the near future.\n\n4.3 Comparison with the state-of-the-art\n\nIn this section, we compare our single CoR-T model with the state-of-the-art models on four datasets.\nCoR-T means that the model consists of T sub-chains. Firstly, Tab. 1 shows the results on the VQA\n1.0 dataset. Using a single image feature, CoR-3 not only outperforms all the models that use single\nimage feature but also outperforms the state-of-the-art ReasonNet [22] model, which uses six input\nimage features including face analysis, object classi\ufb01cation, scene classi\ufb01cation and so on. Secondly,\nTab. 2 shows the results on the VQA 2.0 dataset. Compared with Up-Down (36 boxes) [24], which is\nthe winning model in the VQA challenge 2017, CoR-3 (36 boxes) achieves 2.92% higher accuracy in\ntest-std set. Compared with the most recent state-of-the-art model LC_counting (100 boxes) [25], our\nsingle CoR-3 (100 boxes) model achieves a new state-of-the-art result of 69.14% in the test-std set.\nThirdly, Tab. 3 shows the results on the COCO-QA dataset. CoR-3 improves the overall accuracy\nof the state-of-the-art Dual-MFA from 66.49% to 69.38%. In particular, there is an improvement of\n4.51% in \u201cNum.\u201d and 8.24% in \u201cColor\u201d. Fourthly, Tab. 4 shows the results on the TDIUC dataset.\nCoR-3 improves the overall accuracy of the state-of-the-art CATL-QTAW [28] from 85.03% to\n86.91%. There is also an improvement of 5.58% in \u201cCounting\u201d and 5.93% in \u201cColor Attributes\u201d. In\nsummary, CoR achieves consistently best performance on all four datasets.\n\n4.4 Ablation study\n\nIn this section, we conduct some ablation experiments. For a fair comparion, all the data provided in\nthis section are trained under the VQA 2.0 training set and tested on the VQA 2.0 validation set. All\nthe models use the exact same bottom-up-attention feature (36 boxes) extracted from faster-rcnn.\nTab. 5 shows the effectiveness of the chain structure. We implement MLB[14], Mutan [16] and\ntheir stack and parallel structure. The stack structure is proposed by SAN [5], which stacks 2 or 3\nattention layers. The parallel structure is similar to Multi-Head Attention [36], which consists of 2 or\n3 attention layers running in parallel. As shown in Tab. 5, the chain structure not only signi\ufb01cantly\nimproves the performance of attention models but also superior to their stack or parallel structures.\nFor example, compared with Mutan, Mutan-Stack-3 is only 0.29% higher while CoR-3 is 1.53%\nhigher. Furthermore, the chain structure is insensitive to the attention model. CoR-2 and CoR-3 can\nachieve high performance whether using Mutan or MLB.\nTab. 6 shows the effectiveness of the relational reasoning operation. Firstly, we implement CoR-2 with\n; G]W1, which is proposed RN [4]. We \ufb01nd it lowers the performance (64.96%\u219262.46%).\n[O(t)\nThis is because the purpose of relational reasoning here is to prepare for generating compound\n\n; O(1)\n\ni\n\nj\n\n7\n\n\fTable 5: Effectiveness of the chain structure on the VQA 2.0 validation.\n\nMethod MLB[14]\n\nMLB-\nStack-2\nVal\n63.28\nMethod Mutan[16] Mutan-\nStack-2\nVal\n63.78\n\n62.91\n\n63.61\n\nMLB-\nStack-3\n63.55\nMutan-\nStack-3\n63.90\n\nMLB-\n\nParallel-2\n\n63.20\nMutan-\nParallel-2\n\n63.66\n\nMLB-\n\nParallel-3\n\n63.28\nMutan-\nParallel-3\n\n63.80\n\nCoR-2\n\nwith MLB\n\n64.90\n\nCoR-2\n64.96\n\nCoR-3\n\nwith MLB\n\n64.96\nCoR-3\n65.14\n\nTable 6: Effectiveness of relational reasoning operation on the VQA 2.0 validation.\n\nMethod\n\nCoR-2 with [O(t)\ni\nCoR-2 with (O(t)\n\n; O(1)\ni + O(1)\n\n; G]W1\nj ) (cid:12) G\n\nj\n\nCoR-2 with (O(t)\n\ni (cid:12) Gl) \u2295 (O(t)\nCoR-2\n\nj (cid:12) Gr)\n\nTable 7: Effectiveness of object re\ufb01ning operation on the VQA 2.0 validation.\n\nVal\n62.46\n64.73\n64.24\n64.96\n\nVal\n64.42\n64.96\n\nCoR-2 with(cid:80)m\n\nMethod\n\nCoR-2\n\ni=1 \u03b1(t)\n\ni R(t)\n\nji\n\nTable 8: Effectiveness of the model on different question types on the CLEVR dataset.\n\nMethod Overall\n\nCount\n\nExist\n\nCompare\nNumbers\n\nQuery\nAttribute\n\nCompare\nAttribute\n\nMLB\nMutan\n\nRN\n\nCoR-2\n\n85.0\n86.3\n96.4\n98.7\n\n90.0\n92.5\n\n-\n\n98.8\n\n76.7\n80.2\n\n-\n\n97.7\n\n78.8\n81.7\n\n-\n\n92.3\n\n91.1\n91.2\n\n-\n\n99.9\n\n82.7\n84.5\n\n-\n\n99.7\n\nj\n\nj\n\ni R(t)\n\ni=1 \u03b1(t)\n\ni \u2295 O(1)\n\ni (cid:12) Gl)\u2295 (O(t)\n\n(cid:80)m\nji . Although the formula is similar to(cid:80)m\ndifferent.(cid:80)m\neach compound object and the jth initial object while(cid:80)m\n\nobjects, and the element-wise sum in Eq. (7) is more \ufb01ne-grained. Secondly, we implement CoR-\nj ) (cid:12) G, which uses a single question guidance and also lowers performance\n2 with (O(t)\n(64.96%\u219264.73%). This shows that different guidances for compound objects and initial objects are\nj (cid:12) Gr),\nbene\ufb01cial to improve the performance. Thirdly, we implement CoR-2 with (O(t)\nwhich calculates the relations by the compound objects themselves without the initial object O(1)\n.\nWe \ufb01nd it still lowers the performance (64.96%\u219264.24%). This shows using initial premises O(1)\nat\neach step is crucial and may avoid \u201cover-reasoning\u201d by modeling very complex relations between\ncompound objects.\nTab. 7 shows the effectiveness of the object re\ufb01ning operation. We implement a similar operation\nij in Eq. 8, the meaning is totally\nij generates the jth compound object by weighted sum of the relations between\nji generates that by weighted sum\nof the relations between each initial object and jth compound object. The former focuses on using\nthe previous reasoning conclusions while the latter focuses on the initial premises. CoR-2 has better\nresults and is more in line with the reasoning procedure \u2014 focusing more on previous intermediate\nconclusions to push the next step reasoning.\nTab. 8 shows effectiveness of the model on different question types. We conduct experiments on\nthe state description matrix version of the CLEVR dataset [37]. CoR-2 reaches an overall accuracy\nof 98.7%, which outperforms MLB and Mutan on the same setup. Furthermore, CoR-2 achieves\nthe performance of 99.9% in question type of \u201cQuery Attribute\u201d and 99.7% in question type of\n\u201cCompare Attribute\u201d. It is worth mentioning that there is still room for improvement in \u201cCompare\nNumbers\u201d questions.\n\ni=1 \u03b1(t)\n\ni=1 \u03b1(t)\n\ni=1 \u03b1(t)\n\ni R(t)\n\ni R(t)\n\ni R(t)\n\n8\n\n\f4.5 Qualitative evaluation\n\nExample 1. What object is on the upper\nright side of the picture?\n\nExample 2. How many people can be\nseen in the picture?\n\nGT: \ufb01re hydrant\nExample 3. What color is illuminated\non the traf\ufb01c light?\n\nPred: \ufb01re hydrant !\n\nGT: 3\n\nPred: 3 !\n\nExample 4. What object is to the right\nof the dog in this image?\n\nGT: green\n\nPred: green !\n\nGT: legs\n\nPred: dog #\n\nFigure 4: Visualization of the reasoning procedure of CoR-3.\n\nIn Figure 4, we visualize the compound objects generated by CoR-3 and their attention weights. Four\nexamples are given including three success cases and one failure case. Each example contains three\nsteps. The red box and the blue box in each step represent objects with the top two attention weights\nrespectively. The initial objects in the \ufb01rst step are part of the original image and easy to visualize by\nthe bounding box, but the compound objects in the second and third step are dif\ufb01cult to visualize\ndirectly. Therefore, we search from 1105904 \u00d7 36 boxes (1105904 is the number of samples and\neach sample has 36 boxes) and \ufb01nd the box with the most similar feature by cosine similarity to\nrepresent the compound object. The upper left corner of each box contains a tuple of the form (w, s).\nw is the attention weight, s is the similarity between the searched box and the real compound object.\nIn Example 1, the left image shows a pillar (red box) and ground (blue box). Their values of w are\n0.22 and 0.19 respectively. Since they are initial top two rcnn objects in O(1), the values of s are 1.\nThe model focuses on some disperse \u201cobjects\u201d, which can be further seen by attention distribution\nhistogram below. The middle image shows top two compound objects in the second step. The red box\nfocuses on \u201cobjects on the upper\u201d. The attention weight of the red box increased slightly to 0.25. The\nsimilarity between the red box and the real compound object is 0.92. The right image shows top two\nmore complex compound objects in the third step. The \u201cobjects on the upper right\u201d has been focused\nin the red box. Interestingly, the w of the red box increases to 0.81, which means in the third step,\nCoR-3 is very con\ufb01dent that the box containing \u201chydrant\u201d is exactly the \ufb01nal answer. Statistics show\nthat 96.76% of the success cases satisfy the phenomenon of dispersion to concentration. In Example\n2\u223c3, two more success cases are shown. In Example 4, the model already gets the intermediate\nresult \u201cdog in the image\u201d in the third step but fails to further \ufb01nd \u201cleg on the right of the dog in the\nimage\u201d, which seems that three-step reasoning is insuf\ufb01cient here.\n\n5 Conclusion\n\nIn this paper, we propose a novel chain of reasoning model for VQA task. The reasoning procedure\nis viewed as the alternate updating of objects and relations. Experimental results on four publicly\navailable datasets show that CoR outperforms state-of-the-art approaches. Ablation study shows that\nproposed chain structure is superior to stack structure and parallel strucuture. The visualization of the\nchain of reasoning illustrates the progress that the CoR generates new compound objects that lead\nto the answer of the question step-by-step. In the future, we plan to apply CoR to other tasks that\nrequire reasoning like reading comprehension question answering or video question answering.\n\n9\n\n0.22, 1.000.19, 1.000.25, 0.920.23, 0.900.81, 0.910.10, 0.88051015202530350.00.20.40.60.81.0051015202530350.00.20.40.60.81.0051015202530350.00.20.40.60.81.00.17, 1.000.15, 1.000.98, 0.640.00, 0.661.00, 0.710.00, 0.64051015202530350.00.20.40.60.81.0051015202530350.00.20.40.60.81.0051015202530350.00.20.40.60.81.00.35, 1.000.34, 1.000.43, 0.660.39, 0.690.83, 0.650.14, 0.65051015202530350.00.20.40.60.81.0051015202530350.00.20.40.60.81.0051015202530350.00.20.40.60.81.00.18, 1.000.10, 1.000.29, 0.910.23, 0.890.70, 0.940.13, 0.92051015202530350.00.20.40.60.81.0051015202530350.00.20.40.60.81.0051015202530350.00.20.40.60.81.0\fAcknowledgments\n\nWe would like to thank the anonymous reviewers for their valuable comments. This paper is supported\nby NSFC (No. 61273365), NSSFC (2016ZDA055), 111 Project (No. B08004), Beijing Advanced\nInnovation Center for Imaging Technology, Engineering Research Center of Information Networks\nof MOE, China.\n\nReferences\n[1] Edward A. Feigenbaum. The art of arti\ufb01cial intelligence. 1. Themes and case studies of\nknowledge engineering. Technical report, Stanford Univ CA Dept of Computer Science, 1977.\n\n[2] Sepp Hochreiter and J\u00fcrgen Schmidhuber. Long short-term memory. Neural computation,\n\n9(8):1735\u20131780, 1997.\n\n[3] Yann LeCun, Bernhard Boser, John S. Denker, Donnie Henderson, Richard E. Howard, Wayne\nHubbard, and Lawrence D. Jackel. Backpropagation applied to handwritten zip code recognition.\nNeural computation, 1(4):541\u2013551, 1989.\n\n[4] Adam Santoro, David Raposo, David GT Barrett, Mateusz Malinowski, Razvan Pascanu, Peter\nBattaglia, and Timothy Lillicrap. A simple neural network module for relational reasoning. In\nNIPS, 2017.\n\n[5] Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alex Smola. Stacked attention networks\n\nfor image question answering. In CVPR, 2016.\n\n[6] Huijuan Xu and Kate Saenko. Ask, attend and answer: Exploring question-guided spatial\n\nattention for visual question answering. In ECCV, pages 451\u2013466, 2016.\n\n[7] Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. Neural module networks. In\n\nCVPR, pages 39\u201348, 2016.\n\n[8] Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. Learning to compose neural\n\nnetworks for question answering. In NAACL, pages 1545\u20131554, 2016.\n\n[9] Ronghang Hu, Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Kate Saenko. Learning to\n\nreason: End-to-end module networks for visual question answering. In ICCV, 2017.\n\n[10] Kan Chen, Jiang Wang, Liang-Chieh Chen, Haoyuan Gao, Wei Xu, and Ram Nevatia.\nABC-CNN: An attention based convolutional neural network for visual question answering.\narXiv:1511.05960, 2015.\n\n[11] Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. Hierarchical Question-Image Co-\n\nAttention for Visual Question Answering. In NIPS, 2016.\n\n[12] Idan Schwartz, Alexander G. Schwing, and Tamir Hazan. High-Order Attention Models for\n\nVisual Question Answering. In NIPS, 2017.\n\n[13] Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, and Marcus\nRohrbach. Multimodal compact bilinear pooling for visual question answering and visual\ngrounding. In EMNLP, pages 457\u2013468, 2016.\n\n[14] Jin-Hwa Kim, Kyoung-Woon On, Jeonghee Kim, Jung-Woo Ha, and Byoung-Tak Zhang.\n\nHadamard product for low-rank bilinear pooling. In ICLR, 2017.\n\n[15] Zhou Yu, Jun Yu, Jianping Fan, and Dacheng Tao. Multi-modal Factorized Bilinear Pooling\n\nwith Co-Attention Learning for Visual Question Answering. In ICCV, 2017.\n\n[16] Hedi Ben-younes, R\u00e9mi Cadene, Matthieu Cord, and Nicolas Thome. MUTAN: Multimodal\n\nTucker Fusion for Visual Question Answering. In ICCV, pages 2631\u20132639, 2017.\n\n[17] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time\n\nobject detection with region proposal networks. In NIPS, pages 91\u201399, 2015.\n\n10\n\n\f[18] Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould,\nand Lei Zhang. Bottom-up and top-down attention for image captioning and visual question\nanswering. In CVPR, volume 3, page 6, 2018.\n\n[19] Kyunghyun Cho, Bart van Merrienboer, Dzmitry Bahdanau, and Yoshua Bengio. On the\nProperties of Neural Machine Translation: Encoder-Decoder Approaches. arXiv preprint\narXiv:1409.1259, 2014.\n\n[20] Ryan Kiros, Yukun Zhu, Ruslan R. Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio\n\nTorralba, and Sanja Fidler. Skip-thought vectors. In NIPS, pages 3294\u20133302, 2015.\n\n[21] Pan Lu, Hongsheng Li, Wei Zhang, Jianyong Wang, and Xiaogang Wang. Co-attending Free-\nform Regions and Detections with Multi-modal Multiplicative Feature Embedding for Visual\nQuestion Answering. In AAAI, 2018.\n\n[22] Ilija Ilievski and Jiashi Feng. Multimodal Learning and Reasoning for Visual Question Answer-\n\ning. In NIPS, pages 551\u2013562, 2017.\n\n[23] Chen Zhu, Yanpeng Zhao, Shuaiyi Huang, Kewei Tu, and Yi Ma. Structured Attentions for\n\nVisual Question Answering. In ICCV, 2017.\n\n[24] Damien Teney, Peter Anderson, Xiaodong He, and Anton van den Hengel. Tips and Tricks for\n\nVisual Question Answering: Learnings from the 2017 Challenge. In CVPR, 2018.\n\n[25] Yan Zhang, Jonathon Hare, and Adam Pr\u00fcgel-Bennett. Learning to Count Objects in Natural\n\nImages for Visual Question Answering. In ICLR, 2018.\n\n[26] Ruiyu Li and Jiaya Jia. Visual question answering with question representation update (qru). In\n\nNIPS, pages 4655\u20134663, 2016.\n\n[27] Kushal Ka\ufb02e and Christopher Kanan. An Analysis of Visual Question Answering Algorithms.\n\nIn ICCV, 2017.\n\n[28] Yang Shi, Tommaso Furlanello, Sheng Zha, and Animashree Anandkumar. Question Type\n\nGuided Attention in Visual Question Answering. In ECCV, 2018.\n\n[29] Stanislaw Antol, Aishwarya Agrawal,\n\nJiasen Lu, Margaret Mitchell, Dhruv Batra,\nC. Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. In ICCV, pages\n2425\u20132433, 2015.\n\n[30] Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the V\nin VQA matter: Elevating the role of image understanding in Visual Question Answering. In\nCVPR, volume 1, page 9, 2017.\n\n[31] Mengye Ren, Ryan Kiros, and Richard Zemel. Exploring models and data for image question\n\nanswering. In NIPS, pages 2953\u20132961, 2015.\n\n[32] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan,\nPiotr Doll\u00e1r, and C. Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV,\npages 740\u2013755, 2014.\n\n[33] Mateusz Malinowski and Mario Fritz. A multi-world approach to question answering about\n\nreal-world scenes based on uncertain input. In NIPS, pages 1682\u20131690, 2014.\n\n[34] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov.\nDropout: A simple way to prevent neural networks from over\ufb01tting. The Journal of Machine\nLearning Research, 15(1):1929\u20131958, 2014.\n\n[35] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[36] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,\n\\Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NIPS, pages 5998\u20136008,\n2017.\n\n[37] J. Johnson, B. Hariharan, L. v d Maaten, L. Fei-Fei, C. L. Zitnick, and R. Girshick. CLEVR: A\nDiagnostic Dataset for Compositional Language and Elementary Visual Reasoning. In CVPR,\npages 1988\u20131997, July 2017.\n\n11\n\n\f", "award": [], "sourceid": 187, "authors": [{"given_name": "Chenfei", "family_name": "Wu", "institution": "Beijing University of Posts and Telecommunications"}, {"given_name": "Jinlai", "family_name": "Liu", "institution": "Beijing University of Posts and Telecommunications"}, {"given_name": "Xiaojie", "family_name": "Wang", "institution": "Beijing University of Posts and Telecommunications"}, {"given_name": "Xuan", "family_name": "Dong", "institution": "Beijing University of Posts and Telecommunications"}]}