{"title": "Neural-Symbolic VQA: Disentangling Reasoning from Vision and Language Understanding", "book": "Advances in Neural Information Processing Systems", "page_first": 1031, "page_last": 1042, "abstract": "We marry two powerful ideas: deep representation learning for visual recognition and language understanding, and symbolic program execution for reasoning. Our neural-symbolic visual question answering (NS-VQA) system first recovers a structural scene representation from the image and a program trace from the question. It then executes the program on the scene representation to obtain an answer. Incorporating symbolic structure as prior knowledge offers three unique advantages. First, executing programs on a symbolic space is more robust to long program traces; our model can solve complex reasoning tasks better, achieving an accuracy of 99.8% on the CLEVR dataset. Second, the model is more data- and memory-efficient: it performs well after learning on a small number of training data; it can also encode an image into a compact representation, requiring less storage than existing methods for offline question answering. Third, symbolic program execution offers full transparency to the reasoning process; we are thus able to interpret and diagnose each execution step.", "full_text": "Neural-Symbolic VQA: Disentangling Reasoning\n\nfrom Vision and Language Understanding\n\nKexin Yi\u2217\n\nHarvard University\n\nJiajun Wu\u2217\nMIT CSAIL\n\nChuang Gan\n\nMIT-IBM Watson AI Lab\n\nAntonio Torralba\n\nMIT CSAIL\n\nPushmeet Kohli\n\nDeepMind\n\nJoshua B. Tenenbaum\n\nMIT CSAIL\n\nAbstract\n\nWe marry two powerful ideas: deep representation learning for visual recognition\nand language understanding, and symbolic program execution for reasoning. Our\nneural-symbolic visual question answering (NS-VQA) system \ufb01rst recovers a\nstructural scene representation from the image and a program trace from the\nquestion. It then executes the program on the scene representation to obtain an\nanswer. Incorporating symbolic structure as prior knowledge offers three unique\nadvantages. First, executing programs on a symbolic space is more robust to long\nprogram traces; our model can solve complex reasoning tasks better, achieving an\naccuracy of 99.8% on the CLEVR dataset. Second, the model is more data- and\nmemory-ef\ufb01cient: it performs well after learning on a small number of training\ndata; it can also encode an image into a compact representation, requiring less\nstorage than existing methods for of\ufb02ine question answering. Third, symbolic\nprogram execution offers full transparency to the reasoning process; we are thus\nable to interpret and diagnose each execution step.\n\n1\n\nIntroduction\n\nLooking at the images and questions in Figure 1, we instantly recognize objects and their attributes,\nparse complicated questions, and leverage such knowledge to reason and answer the questions. We\ncan also clearly explain how we reason to obtain the answer. Now imagine that you are standing\nin front of the scene, eyes closed, only able to build your scene representation through touch. Not\nsurprisingly, reasoning without vision remains effortless. For humans, reasoning is fully interpretable,\nand not necessarily interwoven with visual perception.\nThe advances in deep representation learning and the development of large-scale datasets [Malinowski\nand Fritz, 2014, Antol et al., 2015] have inspired a number of pioneering approaches in visual question-\nanswering (VQA), most trained in an end-to-end fashion [Yang et al., 2016]. Though innovative, pure\nneural net\u2013based approaches often perform less well on challenging reasoning tasks. In particular,\na recent study [Johnson et al., 2017a] designed a new VQA dataset, CLEVR, in which each image\ncomes with intricate, compositional questions generated by programs, and showed that state-of-the-art\nVQA models did not perform well.\nLater, Johnson et al. [2017b] demonstrated that machines can learn to reason by wiring in prior\nknowledge of human language as programs. Speci\ufb01cally, their model integrates a program generator\nthat infers the underlying program from a question, and a learned, attention-based executor that runs\nthe program on the input image. Such a combination achieves very good performance on the CLEVR\n\n\u2217 indicates equal contributions. Project page: http://nsvqa.csail.mit.edu\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fFigure 1: Human reasoning is interpretable and disentangled: we \ufb01rst draw abstract knowledge of\nthe scene via visual perception and then perform logic reasoning on it. This enables compositional,\naccurate, and generalizable reasoning in rich visual contexts.\n\ndataset, and generalizes reasonably well to CLEVR-Humans, a dataset that contains the same images\nas CLEVR but now paired with human-generated questions. However, their model still suffers from\ntwo limitations: \ufb01rst, training the program generator requires many annotated examples; second, the\nbehaviors of the attention-based neural executor are hard to explain. In contrast, we humans can\nreason on CLEVR and CLEVR-Humans even with a few labeled instances, and we can also clearly\nexplain how we do it.\nIn this paper, we move one step further along the spectrum of learning vs. modeling, proposing a\nneural-symbolic approach for visual question answering (NS-VQA) that fully disentangles vision\nand language understanding from reasoning. We use neural networks as powerful tools for parsing\u2014\ninferring structural, object-based scene representation from images, and generating programs from\nquestions. We then incorporate a symbolic program executor that, complementary to the neural parser,\nruns the program on the scene representation to obtain an answer.\nThe combination of deep recognition modules and a symbolic program executor offers three unique\nadvantages. First, the use of symbolic representation offers robustness to long, complex program\ntraces. It also reduces the need of training data. On the CLEVR dataset, our method is trained\non questions with 270 program annotations plus 4K images, and is able to achieve a near-perfect\naccuracy of 99.8%.\nSecond, both our reasoning module and visual scene representation are light-weighted, requiring\nminimal computational and memory cost. In particular, our compact structural image representation\nrequires much less storage during reasoning, reducing the memory cost by 99% compared with other\nstate-of-the-art algorithms.\nThird, the use of symbolic scene representation and program traces forces the model to accurately\nrecover underlying programs from questions. Together with the fully transparent and interpretable\nnature of symbolic representations, the reasoning process can be analyzed and diagnosed step-by-step.\n\n2 Related Work\n\nStructural scene representation. Our work is closely related to research on learning an inter-\npretable, disentangled representation with a neural network [Kulkarni et al., 2015, Yang et al., 2015,\nWu et al., 2017]. For example, Kulkarni et al. [2015] proposed convolutional inverse graphics\nnetworks that learn to infer the pose and lighting of a face; Yang et al. [2015] explored learning\ndisentangled representations of pose and content from chair images. There has also been work on\nlearning disentangled representation without direct supervision [Higgins et al., 2018, Siddharth et al.,\n2017, Vedantam et al., 2018], some with sequential generative models [Eslami et al., 2016, Ba et al.,\n2015]. In a broader view, our model also relates to the \ufb01eld of \u201cvision as inverse graphics\u201d [Yuille\nand Kersten, 2006]. Our NS-VQA model builds upon the structural scene representation [Wu et al.,\n2017] and explores how it can be used for visual reasoning.\n\nProgram induction from language. Recent papers have explored using program search and neural\nnetworks to recover programs from a domain-speci\ufb01c language [Balog et al., 2017, Neelakantan\net al., 2016, Parisotto et al., 2017]. For sentences, semantic parsing methods map them to logical\nforms via a knowledge base or a program [Berant et al., 2013, Liang et al., 2013, Vinyals et al.,\n2015, Guu et al., 2017]. In particular, Andreas et al. [2016] attempted to use the latent structure in\n\n2\n\nHow many blocks are on the right of the three-level tower?Will the block tower fall if the top block is removed?What is the shape of the object closest to the large cylinder? Are there more trees than animals?\fFigure 2: Our model has three components: \ufb01rst, a scene parser (de-renderer) that segments an input\nimage (a-b) and recovers a structural scene representation (c); second, a question parser (program\ngenerator) that converts a question in natural language (d) into a program (e); third, a program\nexecutor that runs the program on the structural scene representation to obtain the answer.\n\nlanguage to help question answering and reasoning, Rothe et al. [2017] studied the use of formal\nprograms in modeling human questions, and Goldman et al. [2018] used abstract examples to build\nweakly-supervised semantic parsers.\n\nVisual question answering. Visual question answering (VQA) [Malinowski and Fritz, 2014, Antol\net al., 2015] is a versatile and challenging test bed for AI systems. Compared with the well-\nstudied text-based question answering, VQA emerges by its requirement on both semantic and\nvisual understanding. There have been numerous papers on VQA, among which some explicitly\nused structural knowledge to help reasoning [Wang et al., 2017]. Current leading approaches are\nbased on neural attentions [Yang et al., 2016, Lu et al., 2016], which draw inspiration from human\nperception and learn to attend the visual components that serve as informative evidence to the\nquestion. Nonetheless, Jabri et al. [2016] recently proposed a remarkably simple yet effective\nclassi\ufb01cation baseline. Their system directly extracts visual and text features from whole images\nand questions, concatenates them, and trains multi-class classi\ufb01ers to select answers. This paper,\namong others [Goyal et al., 2017], reveals potential caveats in the proposed VQA systems\u2014models\nare over\ufb01tting dataset biases.\n\nVisual reasoning.\nJohnson et al. [2017a] built a new VQA dataset, named CLEVR, carefully\ncontrolling the potential bias and benchmarking how well models reason. Their subsequent model\nachieved good results on CLEVR by combining a recurrent program generator and an attentive\nexecution engine [Johnson et al., 2017b]. There have been other end-to-end neural models that has\nachieved nice performance on the dataset, exploiting various attention structures and accounting\nobject relations [Hudson and Manning, 2018, Santoro et al., 2017, Hu et al., 2017, Perez et al., 2018,\nZhu et al., 2017]. More recently, several papers have proposed to directly incorporate the syntactic\nand logic structures of the reasoning task to the attentive module network\u2019s architecture for reasoning.\nThese structures include the underlying functional programs [Mascharka et al., 2018, Suarez et al.,\n2018] and dependency trees [Cao et al., 2018] of the input question. However, training of the models\nrelies heavily on these extra signals.\nFrom a broader perspective, Misra et al. [2018] explored learning to reason by asking questions and\nBisk et al. [2018] studied spatial reasoning in a 3D blocks world. Recently, Aditya et al. [2018]\nincorporated probabilistic soft logic into a neural attention module and obtained some interpretability\nof the model, and Gan et al. [2017] learned to associate image segments with questions. Our\nmodel moves along this direction further by modeling the entire scene into an object-based, structural\nrepresentation, and integrating it with a fully transparent and interpretable symbolic program executor.\n\n3\n\nIDSizeShapeMaterialColorxyz1SmallCubeMetalPurple-0.45-1.100.352LargeCubeMetalGreen3.83-0.040.703LargeCubeMetalGreen-3.200.630.704SmallCylinderRubberPurple0.751.310.355LargeCubeMetalGreen1.58-1.600.70IDSizeShape\u20261SmallCube\u20262LargeCube\u20263LargeCube\u20265LargeCube\u2026IDSize\u20262Large\u20263Large\u20265Large\u2026(a) Input Image(b) Object Segments(c) Structural Scene RepresentationI. Scene Parsing (de-rendering)II. Question Parsing (Program Generation)III. Program Execution5. count3. filter_shape4. filter_size1.filter_shape2.relate(d) Question(e) ProgramMask R-CNNCNNLSTMEncoder1. filter_shape(scene, cylinder)2. relate(behind)3. filter_shape(scene, cube)4. filter_size(scene, large)5. count(scene)How many cubes that are behind the cylinder are large?Answer: 3LSTMLSTMLSTMLSTMLSTM\f3 Approach\n\nOur NS-VQA model has three components: a scene parser (de-renderer), a question parser (program\ngenerator), and a program executor. Given an image-question pair, the scene parser de-renders\nthe image to obtain a structural scene representation (Figure 2-I), the question parser generates a\nhierarchical program from the question (Figure 2-II), and the executor runs the program on the\nstructural representation to obtain an answer (Figure 2-III).\nOur scene parser recovers a structural and disentangled representation of the scene in the image\n(Figure 2a), based on which we can perform fully interpretable symbolic reasoning. The parser takes a\ntwo-step, segment-based approach for de-rendering: it \ufb01rst generates a number of segment proposals\n(Figure 2b), and for each segment, classi\ufb01es the object and its attributes. The \ufb01nal, structural scene\nrepresentation is disentangled, compact, and rich (Figure 2c).\nThe question parser maps an input question in natural language (Figure 2d) to a latent program\n(Figure 2e). The program has a hierarchy of functional modules, each ful\ufb01lling an independent\noperation on the scene representation. Using a hierarchical program as our reasoning backbone\nnaturally supplies compositionality and generalization power.\nThe program executor takes the output sequence from the question parser, applies these functional\nmodules on the abstract scene representation of the input image, and generates the \ufb01nal answer\n(Figure 2-III). The executable program performs purely symbolic operations on its input throughout\nthe entire execution process, and is fully deterministic, disentangled, and interpretable with respect to\nthe program sequence.\n\n3.1 Model Details\n\nScene parser. For each image, we use Mask R-CNN [He et al., 2017] to generate segment proposals\nof all objects. Along with the segmentation mask, the network also predicts the categorical labels\nof discrete intrinsic attributes such as color, material, size, and shape. Proposals with bounding box\nscore less than 0.9 are dropped. The segment for each single object is then paired with the original\nimage, resized to 224 by 224 and sent to a ResNet-34 [He et al., 2015] to extract the spacial attributes\nsuch as pose and 3D coordinates. Here the inclusion of the original full image enables the use of\ncontextual information.\n\nQuestion parser. Our question parser is an attention-based sequence to sequence (seq2seq) model\nwith an encoder-decoder structure similar to that in Luong et al. [2015] and Bahdanau et al. [2015].\nThe encoder is a bidirectional LSTM [Hochreiter and Schmidhuber, 1997] that takes as input a\nquestion of variable lengths and outputs an encoded vector ei at time step i as\n\nei = [eF\n\ni , eB\n\ni ], where\n\neF\ni , hF\n\ni = LSTM(\u03a6E(xi), hF\n\ni\u22121),\n\neB\ni , hB\n\ni = LSTM(\u03a6E(xi), hB\n\ni+1).\n\n(1)\nHere \u03a6E is the jointly trained encoder word embedding. (eF\ni ) are the outputs and\nhidden vectors of the forward and backward networks at time step i. The decoder is a similar LSTM\nthat generates a vector qt from the previous token of the output sequence yt\u22121. qt is then fed to an\nattention layer to obtain a context vector ct as a weighted sum of the encoded states via\n\ni ), (eB\n\ni , hB\n\ni , hF\n\nqt = LSTM(\u03a6D(yt\u22121)),\n\n\u03b1ti \u221d exp(q(cid:62)\n\nt WAei),\n\nct =\n\n\u03b1tiei.\n\n(2)\n\n(cid:88)\n\ni\n\n\u03a6D is the decoder word embedding. For simplicity we set the dimensions of vectors qt, ei to be the\nsame and let the attention weight matrix WA to be an identity matrix. Finally, the context vector,\ntogether with the decoder output, is passed to a fully connected layer with softmax activation to obtain\nthe distribution for the predicted token yt \u223c softmax(WO[qt, ct]). Both the encoder and decoder\nhave two hidden layers with a 256-dim hidden vector. We set the dimensions of both the encoder and\ndecoder word vectors to be 300.\n\nProgram executor. We implement the program executor as a collection of deterministic, generic\nfunctional modules in Python, designed to host all logic operations behind the questions in the\ndataset. Each functional module is in one-to-one correspondence with tokens from the input program\nsequence, which has the same representation as in Johnson et al. [2017b]. The modules share the\nsame input/output interface, and therefore can be arranged in any length and order. A typical program\n\n4\n\n\fMethods\n\nCount Exist Compare Compare Query Overall\n\nNumber Attribute Attribute\n\nHumans [Johnson et al., 2017b]\nCNN+LSTM+SAN [Johnson et al., 2017b]\nN2NMN\u2217 [Hu et al., 2017]\nDependency Tree [Cao et al., 2018]\nCNN+LSTM+RN [Santoro et al., 2017]\nIEP\u2217 [Johnson et al., 2017b]\nCNN+GRU+FiLM [Perez et al., 2018]\nDDRprog\u2217 [Suarez et al., 2018]\nMAC [Hudson and Manning, 2018]\nTbD+reg+hres\u2217 [Mascharka et al., 2018]\nNS-VQA (ours, 90 programs)\nNS-VQA (ours, 180 programs)\nNS-VQA (ours, 270 programs)\n\n86.7\n59.7\n68.5\n81.4\n90.1\n92.7\n94.5\n96.5\n97.1\n97.6\n64.5\n85.0\n99.7\n\n96.6\n77.9\n85.7\n94.2\n97.8\n97.1\n99.2\n98.8\n99.5\n99.2\n87.4\n92.9\n99.9\n\n86.4\n75.1\n84.9\n81.6\n93.6\n98.7\n93.8\n98.4\n99.1\n99.4\n53.7\n83.4\n99.9\n\n96.0\n70.8\n88.7\n97.1\n97.1\n98.9\n99.0\n99.0\n99.5\n99.6\n77.4\n90.6\n99.8\n\n95.0\n80.9\n90.0\n90.5\n97.9\n98.1\n99.2\n99.1\n99.5\n99.5\n79.7\n92.2\n99.8\n\n92.6\n73.2\n83.7\n89.3\n95.5\n96.9\n97.6\n98.3\n98.9\n99.1\n74.4\n89.5\n99.8\n\nTable 1: Our model (NS-VQA) outperforms current state-of-the-art methods on CLEVR and achieves\nnear-perfect question answering accuracy. The question-program pairs used for pretraining our model\nare uniformly drawn from the 90 question families of the dataset: 90, 180, 270 programs correspond\nto 1, 2, 3 samples from each family respectively. (*): trains on all program annotations (700K).\n\nsequence begins with a scene token, which signals the input of the original scene representation.\nEach functional module then sequentially executes on the output of the previous one. The last module\noutputs the \ufb01nal answer to the question. When type mismatch occurs between input and output\nacross adjacent modules, an error \ufb02ag is raised to the output, in which case the model will randomly\nsample an answer from all possible outputs of the \ufb01nal module. Figure 3 shows two examples.\n\n3.2 Training Paradigm\n\nScene parsing. Our implementation of the object proposal network (Mask R-CNN) is based on\n\u201cDetectron\u201d [Girshick et al., 2018]. We use ResNet-50 FPN [Lin et al., 2017] as the backbone and\ntrain the model for 30,000 iterations with eight images per batch. Please refer to He et al. [2017]\nand Girshick et al. [2018] for more details. Our feature extraction network outputs the values of\ncontinuous attributes. We train the network on the proposed object segments computed from the\ntraining data using the mean square error as loss function for 30,000 iterations with learning rate\n0.002 and batch size 50. Both networks of our scene parser are trained on 4,000 generated CLEVR\nimages.\n\nReasoning. We adopt the following two-step procedure to train the question parser to learn the\nmapping from a question to a program. First, we select a small number of ground truth question-\nprogram pairs from the training set to pretrain the model with direct supervision. Then, we pair it\nwith our deterministic program executor, and use REINFORCE [Williams, 1992] to \ufb01ne-tune the\nparser on a larger set of question-answer pairs, using only the correctness of the execution result as\nthe reward signal.\nDuring supervised pretraining, we train with learning rate 7 \u00d7 10\u22124 for 20,000 iterations. For\nreinforce, we set the learning rate to be 10\u22125 and run at most 2M iterations with early stopping. The\nreward is maximized over a constant baseline with a decay weight 0.9 to reduce variance. Batch size\nis \ufb01xed to be 64 for both training stages. All our models are implemented in PyTorch.\n\n4 Evaluations\n\nWe demonstrate the following advantages of our disentangled structural scene representation and\nsymbolic execution engine. First, our model can learn from a small number of training data and\noutperform the current state-of-the-art methods while precisely recovering the latent programs\n(Sections 4.1). Second, our model generalizes well to other question styles (Sections 4.3), attribute\ncombinations (Sections 4.2), and visual context (Section 4.4). Code of our model is available at\nhttps://github.com/kexinyi/ns-vqa\n\n5\n\n\fFigure 3: Qualitative results on CLEVR. Blue color indicates correct program modules and answers;\nred indicates wrong ones. Our model is able to robustly recover the correct programs compared to\nthe IEP baseline.\n\n4.1 Data-Ef\ufb01cient, Interpretable Reasoning\n\nSetup. We evaluate our NS-VQA on CLEVR [Johnson et al., 2017a]. The dataset includes synthetic\nimages of 3D primitives with multiple attributes\u2014shape, color, material, size, and 3D coordinates.\nEach image has a set of questions, each of which associates with a program (a set of symbolic\nmodules) generated by machines based on 90 logic templates.\nOur structural scene representation for a CLEVR image characterizes the objects in it, each labeled\nwith its shape, size, color, material, and 3D coordinates (see Figure 2c). We evaluate our model\u2019s\nperformance on the validation set under various supervise signal for training, including the numbers\nof ground-truth programs used for pretraining and question-answer pairs for REINFORCE. Results\nare compared with other state-of-the-art methods including the IEP baseline [Johnson et al., 2017b].\nWe not only assess the correctness of the answer obtained by our model, but also how well it recovers\nthe underlying program. An interpretable model should be able to output the correct program in\naddition to the correct answer.\n\nResults. Quantitative results on the CLEVR dataset are summarized in Table 1. Our NS-VQA\nachieves near-perfect accuracy and outperforms other methods on all \ufb01ve question types. We \ufb01rst\npretrain the question parser on 270 annotated programs sampled across the 90 question templates\n(3 questions per template), a number below the weakly supervised limit suggested by Johnson et al.\n[2017b] (9K), and then run REINFORCE on all the question-answer pairs. Repeated experiments\nstarting from different sets of programs show a standard deviation of less than 0.1 percent on the\nresults for 270 pretraining programs (and beyond). The variances are larger when we train our model\nwith fewer programs (90 and 180). The reported numbers are the mean of three runs.\nWe further investigate the data-ef\ufb01ciency of our method with respect to both the number of programs\nused for pretraining and the overall question-answer pairs used in REINFORCE. Figure 4a shows the\nresult when we vary the number of pretraining programs. NS-VQA outperforms the IEP baseline\nunder various conditions, even with a weaker supervision during REINFORCE (2K and 9K question-\nanswer pairs in REINFORCE). The number of question-answer pairs can be further reduced by\npretraining the model on a larger set of annotated programs. For example, our model achieves the\nsame near-perfect accuracy of 99.8% with 9K question-answer pairs with annotated programs for\nboth pretraining and REINFORCE.\nFigure 4b compares how well our NS-VQA recovers the underlying programs compared to the IEP\nmodel. IEP starts to capture the true programs when trained with over 1K programs, and only recovers\nhalf of the programs with 9K programs. Qualitative examples in Figure 3 demonstrate that IEP tends\nto fake a long wrong program that leads to the correct answer. In contrast, our model achieves 88%\n\n6\n\nQ: Is there a big cylinder made of the same material as the blue object?scenefilter_blueuniquesame_materialfilter_largefilter_cylinderexistscenefilter_greenuniquescene...(25 modules)same_materialexistOursA: noIEPA: yes(a) 500 ProgramsQ: Is the purple thing the same shape as the large gray rubber thing?scenefilter_largefilter_grayfilter_rubberuniquescenefilter_purpleequal_shapefilter_greenuniquerelate_behind...(25 modules)query_shapeequal_shapeOursA: noIEPA: no(b) 1K ProgramsQ: What number of cylinders are gray objects or tiny brown matte objects?scenefilter_smallfilter_brownfilter_rubberscenefilter_grayunionfilter_cylindercountfilter_smallfilter_brownfilter_largefilter_cyan...(25 modules)filter_metalunionfilter_cylindercountOursA: 1IEPA: 2Q: Are there more yellow matte things that are right of the gray ball than cyan metallic objects?scenefilter_cyanfilter_metalCount...(4 modules)scenefilter_yellowfilter_rubbercountgreater_thanfilter_smallfilter_cyanunionfilter_brown...(25 modules)filter_smallfilter_yellowfilter_rubbercountgreater_thanOursA: noIEPA: no(a-1)(a-2)(b-1)(b-2)\f(a) Acc. vs. # pretraining programs\n\n(b) Program acc. vs. # programs\n\n(c) Acc. vs. # training data\n\nFigure 4: Our model exhibits high data ef\ufb01ciency while achieving state-of-the-art performance and\npreserving interpretability. (a) QA accuracy vs. number of programs used for pretraining; different\ncurves indicate different numbers of question-answer pairs used in the REINFORCE stage. (c) QA\naccuracy vs. total number of training question-answer pairs; our model is pretrained on 270 programs.\n\nprogram accuracy with 500 annotations, and performs almost perfectly on both question answering\nand program recovery with 9K programs.\nFigure 4c shows the QA accuracy vs. the number of questions and answers used for training, where our\nNS-VQA has the highest performance under all conditions. Among the baseline methods we compare\nwith, MAC [Hudson and Manning, 2018] obtains high accuracy with zero program annotations;\nin comparison, our method needs to be pretrained on 270 program annotations, but requires fewer\nquestion-answer pairs to reach similar performance.\nOur model also requires minimal memory for of\ufb02ine question answering: the structural representation\nof each image only occupies less than 100 bytes; in comparison, attention-based methods like IEP\nrequires storing either the original image or its feature maps, taking at least 20K bytes per image.\n\n4.2 Generalizing to Unseen Attribute Combinations\n\nRecent neural reasoning models have achieved impressive performance on the original CLEVR QA\ntask [Johnson et al., 2017b, Mascharka et al., 2018, Perez et al., 2018], but they generalize less well\nacross biased dataset splits. This is revealed on the CLEVR-CoGenT dataset [Johnson et al., 2017a],\na benchmark designed speci\ufb01cally for testing models\u2019 generalization to novel attribute compositions.\n\nSetup. The CLEVR-CoGenT dataset is derived from CLEVR and separated into two biased splits:\nsplit A only contains cubes that are either gray, blue, brown or yellow, and cylinders that are red,\ngreen, purple or cyan; split B has the opposite color-shape pairs for cubes and cylinders. Both splits\ncontain spheres of any color. Split A has 70K images and 700K questions for training and both\nsplits have 15K images and 150K questions for evaluation and testing. The desired behavior of a\ngeneralizable model is to perform equally well on both splits while only trained on split A.\n\nResults. Table 2a shows the generalization results with a few interesting \ufb01ndings. The vanilla\nNS-VQA trained purely on split A and \ufb01ne-tuned purely on split B (1000 images) does not generalize\nas well as the state-of-the-art. We observe that this is because of the bias in the attribute recognition\nnetwork of the scene parser, which learns to classify object shape based on color. NS-VQA works well\nafter we \ufb01ne-tune it on data from both splits (4000 A, 1000 B). Here, we only \ufb01ne-tune the attribute\nrecognition network with annotated images from split B, but no questions or programs; thanks to\nthe disentangled pipeline and symbolic scene representation, our question parser and executor are\nnot over\ufb01tting to particular splits. To validate this, we train a separate shape recognition network\nthat takes gray-scale but not color images as input (NS-VQA+Gray). The augmented model works\nwell on both splits without seeing any data from split B. Further, with an image parser trained on the\noriginal condition (i.e. the same as in CLEVR), our question parser and executor also generalize well\nacross splits (NS-VQA+Ori).\n\n7\n\n10010009000Number of programs406080100Accuracy (%)Ours (700K)Ours (9K)Ours (2K)IEP (700K)IEP (9K)1802705001K9K# Annotated Programs020406080100Program Accuracy (%)IEPOurs700070000700000Number of training questions6080100Accuracy (%)OursMACIEPFiLMSAN\fMethods\n\nNot Fine-tuned\nA\n80.3\nCNN+LSTM+SA\n96.6\nIEP (18K programs)\n98.3\nCNN+GRU+FiLM\n98.8\nTbD+reg\n99.8\nNS-VQA (ours)\n99.8\nNS-VQA (ours)\nNS-VQA+Gray (ours) 99.6\n99.8\nNS-VQA+Ori (ours)\n\nB\n68.7\n73.7\n78.8\n75.4\n63.9\n63.9\n98.4\n99.7\n\nFine-tune\n\non\n\nB\nB\nB\nB\nB\n\n-\n-\n\nA+B\n\nFine-tuned\nA\nB\n75.7 75.8\n76.1 92.7\n81.1 96.9\n96.9 96.3\n64.9 98.9\n99.6 99.0\n\n-\n-\n\n-\n-\n\n# Programs NS-VQA\n\n100\n200\n500\n1K\n18K\n\n60.2\n65.2\n67.8\n67.8\n67.0\n\nIEP\n38.7\n40.1\n49.2\n63.4\n66.6\n\n(b) Question answering accuracy on\nCLEVR-Humans.\n\n(a) Generalization results on CLEVR-CoGenT.\n\nTable 2: Generalizing to unseen attribute compositions and question styles. (a) Our image parser is\ntrained on 4,000 synthetic images from split A and \ufb01ne-tuned on 1,000 images from split B. The\nquestion parser is only trained on split A starting from 500 programs. Baseline methods are \ufb01ne-tuned\non 3K images plus 30K questions from split B. NS-VQA+Gray adopts a gray channel in the image\nparser for shape recognition and NS-VQA+Ori uses an image parser trained from the original images\nfrom CLEVR. Please see text for more details. (b) Our model outperforms IEP on CLEVR-Humans\nunder various training conditions.\n\n4.3 Generalizing to Questions from Humans\n\nOur model also enables ef\ufb01cient generalization toward more realistic question styles over the same\nlogic domain. We evaluate this on the CLEVR-Humans dataset, which includes human-generated\nquestions on CLEVR images (see Johnson et al. [2017b] for details). The questions follow real-life\nhuman conversation style without a regular structural expression.\n\nSetup. We adopt a training paradigm for CLEVR-Humans similar to the original CLEVR dataset:\nwe \ufb01rst pretrain the model with a limited number of programs from CLEVR, and then \ufb01ne-tune it on\nCLEVR-Humans with REINFORCE. We initialize the encoder word embedding by the GloVe word\nvectors [Pennington et al., 2014] and keep it \ufb01xed during pretraining. The REINFORCE stage lasts\nfor at most 1M iterations; early stop is applied.\n\nResults. The results on CLEVR-Humans are summarized in Table 2b. Our NS-VQA outperforms\nIEP on CLEVR-Humans by a considerable margin under small amount of annotated programs.\nThis shows our structural scene representation and symbolic program executor helps to exploit the\nstrong exploration power of REINFORCE, and also demonstrates the model\u2019s generalizability across\ndifferent question styles.\n\n4.4 Extending to New Scene Context\n\nStructural scene representation and symbolic programs can also be extended to other visual and\ncontextual scenarios. Here we show results on reasoning tasks from the Minecraft world.\n\nSetup. We now consider a new dataset where objects and scenes are taken from Minecraft and\ntherefore have drastically different scene context and visual appearance. We use the dataset generation\ntool provided by Wu et al. [2017] to render 10,000 Minecraft scenes, building upon the Malmo\ninterface [Johnson et al., 2016]. Each image consists of 3 to 6 objects, and each object is sampled\nfrom a set of 12 entities. We use the same con\ufb01guration details as suggested by Wu et al. [2017]. Our\nstructural representation has the following \ufb01elds for each object: category (12-dim), position in the\n2D plane (2-dim, {x, z}), and the direction the object faces {front, back, left, right}. Each object is\nthus encoded as a 18-dim vector.\nWe generate diverse questions and programs associated with each Minecraft image based on the\nobjects\u2019 categorical and spatial attributes (position, direction). Each question is composed as a\nhierarchy of three families of basic questions: \ufb01rst, querying object attributes (class, location,\ndirection); second, counting the number of objects satisfying certain constraints; third, verifying if\n\n8\n\n\f# Programs Accuracy\n\n50\n100\n200\n500\n\n71.1\n72.4\n86.9\n87.3\n\n(b) Question answering ac-\ncuracy with different num-\nbers of annotated programs\n\n(a) Sample results on the Minecraft dataset.\n\nFigure 5: Our model also applies to Minecraft, a world with rich and hierarchical scene context and\ndifferent visual appearance.\n\nan object has certain property. Our dataset differs from CLEVR primarily in two ways: Minecraft\nhosts a larger set of 3D objects with richer image content and visual appearance; our questions and\nprograms involve hierarchical attributes. For example, a \u201cwolf\u201d and a \u201cpig\u201d are both \u201canimals\u201d, and\nan \u201canimal\u201d and a \u201ctree\u201d are both \u201ccreatures\u201d. We use the \ufb01rst 9,000 images with 88,109 questions\nfor training and the remaining 1,000 images with 9,761 questions for testing. We follow the same\nrecipe as described in Section 3.2 for training on Minecraft.\n\nResults. Quantitative results are summarized in Table 5b. The overall behavior is similar to that on\nthe CLEVR dataset, except that reasoning on Minecraft generally requires weaker initial program\nsignals. Figure 5a shows the results on three test images: our NS-VQA \ufb01nds the correct answer and\nrecovers the correct program under the new scene context. Also, most of our model\u2019s wrong answers\non this dataset are due to errors in perceiving heavily occluded objects, while the question parser still\npreserves its power to parse input questions.\n\n5 Discussion\n\nWe have presented a neural-symbolic VQA approach that disentangles reasoning from visual per-\nception and language understanding. Our model uses deep learning for inverse graphics and inverse\nlanguage modeling\u2014recognizing and characterizing objects in the scene; it then uses a symbolic\nprogram executor to reason and answer questions.\nWe see our research suggesting a possible direction to unify two powerful ideas: deep representation\nlearning and symbolic program execution. Our model connects to, but also differs from the recent\npure deep learning approaches for visual reasoning. Wiring in symbolic representation as prior\nknowledge increases performance, reduces the need for annotated data and for memory signi\ufb01cantly,\nand makes reasoning fully interpretable.\nThe machine learning community has often been skeptical of symbolic reasoning, as symbolic\napproaches can be brittle or have dif\ufb01culty generalizing to natural situations. Some of these concerns\nare less applicable to our work, as we leverage learned abstract representations for mapping both visual\nand language inputs to an underlying symbolic reasoning substrate. However, building structured\nrepresentations for scenes and sentence meanings\u2014the targets of these mappings\u2014in ways that\ngeneralize to truly novel situations remains a challenge for many approaches including ours. Recent\nprogress on unsupervised or weakly supervised representation learning, in both language and vision,\noffers some promise of generalization. Integrating this work with our neural-symbolic approach to\nvisually grounded language is a promising future direction.\n\nAcknowledgments\n\nWe thank Jiayuan Mao, Karthik Narasimhan, and Jon Gauthier for helpful discussions and suggestions.\nWe also thank Drew A. Hudson for sharing experimental results for comparison. This work is in part\nsupported by ONR MURI N00014-16-1-2007, the Center for Brain, Minds, and Machines (CBMM),\nIBM Research, and Facebook.\n\n9\n\nQ: How many trees are behind the farthest animal?P: scene, filter_animal, filter_farthest, unique, relate_behind, filter_tree, countA: 1Q: What direction is the closest creature facing?P: scene, filter_creature, filter_closest, unique, query_directionA: leftQ: Are there wolves farther to the camera than the animal that is facing right?P: scene, filter_animal, filter_face_right, unique, relate_farther, filter_wolf,existA: yes\fReferences\nSomak Aditya, Yezhou Yang, and Chitta Baral. Explicit reasoning over end-to-end neural architectures for visual\n\nquestion answering. In AAAI, 2018. 3\n\nJacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. Learning to compose neural networks for\n\nquestion answering. In NAACL-HLT, 2016. 2\n\nStanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi\n\nParikh. Vqa: Visual question answering. In ICCV, 2015. 1, 3\n\nJimmy Ba, Volodymyr Mnih, and Koray Kavukcuoglu. Multiple object recognition with visual attention. In\n\nICLR, 2015. 2\n\nDzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to\n\nalign and translate. In ICLR, 2015. 4\n\nMatej Balog, Alexander L Gaunt, Marc Brockschmidt, Sebastian Nowozin, and Daniel Tarlow. Deepcoder:\n\nLearning to write programs. In ICLR, 2017. 2\n\nJonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. Semantic parsing on freebase from question-\n\nanswer pairs. In EMNLP, 2013. 2\n\nYonatan Bisk, Kevin J Shih, Yejin Choi, and Daniel Marcu. Learning interpretable spatial operations in a rich 3d\n\nblocks world. In AAAI, 2018. 3\n\nQingxing Cao, Xiaodan Liang, Bailing Li, Guanbin Li, and Liang Lin. Visual question reasoning on general\n\ndependency tree. In CVPR, 2018. 3, 5\n\nSM Eslami, Nicolas Heess, Theophane Weber, Yuval Tassa, Koray Kavukcuoglu, and Geoffrey E Hinton. Attend,\n\ninfer, repeat: Fast scene understanding with generative models. In NIPS, 2016. 2\n\nChuang Gan, Yandong Li, Haoxiang Li, Chen Sun, and Boqing Gong. Vqs: Linking segmentations to questions\nand answers for supervised attention in vqa and question-focused semantic segmentation. In ICCV, 2017. 3\n\nRoss Girshick, Ilija Radosavovic, Georgia Gkioxari, Piotr Doll\u00e1r, and Kaiming He. Detectron. https:\n\n//github.com/facebookresearch/detectron, 2018. 5\n\nOmer Goldman, Veronica Latcinnik, Udi Naveh, Amir Globerson, and Jonathan Berant. Weakly-supervised\n\nsemantic parsing with abstract examples. In ACL, 2018. 3\n\nYash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter:\n\nElevating the role of image understanding in visual question answering. In CVPR, 2017. 3\n\nKelvin Guu, Panupong Pasupat, Evan Zheran Liu, and Percy Liang. From language to programs: Bridging\n\nreinforcement learning and maximum marginal likelihood. In ACL, 2017. 2\n\nKaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In\n\nCVPR, 2015. 4\n\nKaiming He, Georgia Gkioxari, Piotr Doll\u00e1r, and Ross Girshick. Mask r-cnn. In ICCV, 2017. 4, 5\n\nIrina Higgins, Nicolas Sonnerat, Loic Matthey, Arka Pal, Christopher P Burgess, Matthew Botvinick, Demis\nHassabis, and Alexander Lerchner. Scan: learning abstract hierarchical compositional visual concepts. In\nICLR, 2018. 2\n\nSepp Hochreiter and J\u00fcrgen Schmidhuber. Long short-term memory. Neural Comput., 9(8):1735\u20131780, 1997. 4\n\nRonghang Hu, Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Kate Saenko. Learning to reason:\n\nEnd-to-end module networks for visual question answering. In CVPR, 2017. 3, 5\n\nDrew A Hudson and Christopher D Manning. Compositional attention networks for machine reasoning. In\n\nICLR, 2018. 3, 5, 7\n\nAllan Jabri, Armand Joulin, and Laurens van der Maaten. Revisiting visual question answering baselines. In\n\nECCV, 2016. 3\n\nJustin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick.\nClevr: A diagnostic dataset for compositional language and elementary visual reasoning. In CVPR, 2017a. 1,\n3, 6, 7\n\n10\n\n\fJustin Johnson, Bharath Hariharan, Laurens van der Maaten, Judy Hoffman, Li Fei-Fei, C Lawrence Zitnick, and\n\nRoss Girshick. Inferring and executing programs for visual reasoning. In ICCV, 2017b. 1, 3, 4, 5, 6, 7, 8\n\nMatthew Johnson, Katja Hofmann, Tim Hutton, and David Bignell. The malmo platform for arti\ufb01cial intelligence\n\nexperimentation. In IJCAI, 2016. 8\n\nTejas D Kulkarni, William F Whitney, Pushmeet Kohli, and Joshua B Tenenbaum. Deep convolutional inverse\n\ngraphics network. In NIPS, 2015. 2\n\nPercy Liang, Michael I Jordan, and Dan Klein. Learning dependency-based compositional semantics. Computa-\n\ntional Linguistics, 39(2):389\u2013446, 2013. 2\n\nTsung-Yi Lin, Piotr Doll\u00e1r, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature\n\npyramid networks for object detection. In CVPR, 2017. 5\n\nJiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. Hierarchical question-image co-attention for visual\n\nquestion answering. In NIPS, 2016. 3\n\nMinh-Thang Luong, Hieu Pham, and Christopher D Manning. Effective approaches to attention-based neural\n\nmachine translation. In EMNLP, 2015. 4\n\nM Malinowski and M Fritz. A multi-world approach to question answering about real-world scenes based on\n\nuncertain input. In NIPS, 2014. 1, 3\n\nDavid Mascharka, Philip Tran, Ryan Soklaski, and Arjun Majumdar. Transparency by design: Closing the gap\n\nbetween performance and interpretability in visual reasoning. In CVPR, 2018. 3, 5, 7\n\nIshan Misra, Ross Girshick, Rob Fergus, Martial Hebert, Abhinav Gupta, and Laurens van der Maaten. Learning\n\nby asking questions. In CVPR, 2018. 3\n\nArvind Neelakantan, Quoc V Le, and Ilya Sutskever. Neural programmer: Inducing latent programs with\n\ngradient descent. In ICLR, 2016. 2\n\nEmilio Parisotto, Abdel-rahman Mohamed, Rishabh Singh, Lihong Li, Dengyong Zhou, and Pushmeet Kohli.\n\nNeuro-symbolic program synthesis. In ICLR, 2017. 2\n\nJeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word representation.\n\nIn EMNLP, 2014. 8\n\nEthan Perez, Florian Strub, Harm De Vries, Vincent Dumoulin, and Aaron Courville. Film: Visual reasoning\n\nwith a general conditioning layer. In AAAI, 2018. 3, 5, 7\n\nAnselm Rothe, Brenden M Lake, and Todd Gureckis. Question asking as program generation. In NIPS, 2017. 3\n\nAdam Santoro, David Raposo, David GT Barrett, Mateusz Malinowski, Razvan Pascanu, Peter Battaglia, and\n\nTimothy Lillicrap. A simple neural network module for relational reasoning. In NIPS, 2017. 3, 5\n\nN Siddharth, T. B. Paige, J.W. Meent, A. Desmaison, N. Goodman, P. Kohli, F. Wood, and P. Torr. Learning\n\ndisentangled representations with semi-supervised deep generative models. In NIPS, 2017. 2\n\nJoseph Suarez, Justin Johnson, and Fei-Fei Li. Ddrprog: A clevr differentiable dynamic reasoning programmer.\n\narXiv:1803.11361, 2018. 3, 5\n\nRamakrishna Vedantam, Ian Fischer, Jonathan Huang, and Kevin Murphy. Generative models of visually\n\ngrounded imagination. In ICLR, 2018. 2\n\nOriol Vinyals, \u0141ukasz Kaiser, Terry Koo, Slav Petrov, Ilya Sutskever, and Geoffrey Hinton. Grammar as a\n\nforeign language. In NIPS, 2015. 2\n\nPeng Wang, Qi Wu, Chunhua Shen, Anton van den Hengel, and Anthony Dick. Explicit knowledge-based\n\nreasoning for visual question answering. In IJCAI, 2017. 3\n\nRonald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning.\n\nMLJ, 8(3-4):229\u2013256, 1992. 5\n\nJiajun Wu, Joshua B Tenenbaum, and Pushmeet Kohli. Neural scene de-rendering. In CVPR, 2017. 2, 8\n\nJimei Yang, Scott E Reed, Ming-Hsuan Yang, and Honglak Lee. Weakly-supervised disentangling with recurrent\n\ntransformations for 3d view synthesis. In NIPS, 2015. 2\n\n11\n\n\fZichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alex Smola. Stacked attention networks for image\n\nquestion answering. In CVPR, 2016. 1, 3\n\nAlan Yuille and Daniel Kersten. Vision as bayesian inference: analysis by synthesis? Trends Cogn. Sci., 10(7):\n\n301\u2013308, 2006. 2\n\nChen Zhu, Yanpeng Zhao, Shuaiyi Huang, Kewei Tu, and Yi Ma. Structured attentions for visual question\n\nanswering. In ICCV, 2017. 3\n\n12\n\n\f", "award": [], "sourceid": 554, "authors": [{"given_name": "Kexin", "family_name": "Yi", "institution": "Harvard University, MIT CSAIL"}, {"given_name": "Jiajun", "family_name": "Wu", "institution": "MIT"}, {"given_name": "Chuang", "family_name": "Gan", "institution": "MIT-IBM Watson AI Lab"}, {"given_name": "Antonio", "family_name": "Torralba", "institution": "MIT"}, {"given_name": "Pushmeet", "family_name": "Kohli", "institution": "DeepMind"}, {"given_name": "Josh", "family_name": "Tenenbaum", "institution": "MIT"}]}