{"title": "Overcoming Language Priors in Visual Question Answering with Adversarial Regularization", "book": "Advances in Neural Information Processing Systems", "page_first": 1541, "page_last": 1551, "abstract": "Modern Visual Question Answering (VQA) models have been shown to rely heavily on superficial correlations between question and answer words learned during training -- \\eg overwhelmingly reporting the type of room as kitchen or the sport being played as tennis, irrespective of the image. Most alarmingly, this shortcoming is often not well reflected during evaluation because the same strong priors exist in test distributions; however, a VQA system that fails to ground questions in image content would likely perform poorly in real-world settings. \n\nIn this work, we present a novel regularization scheme for VQA that reduces this effect. We introduce a question-only model that takes as input the question encoding from the VQA model and must leverage language biases in order to succeed. We then pose training as an adversarial game between the VQA model and this question-only adversary -- discouraging the VQA model from capturing language biases in its question encoding.Further, we leverage this question-only model to estimate the mutual information between the image and answer given the question, which we maximize explicitly to encourage visual grounding. Our approach is a model agnostic training procedure and simple to implement. We show empirically that it can improve performance significantly on a bias-sensitive split of the VQA dataset for multiple base models -- achieving state-of-the-art on this task. Further, on standard VQA tasks, our approach shows significantly less drop in accuracy compared to existing bias-reducing VQA models.", "full_text": "Overcoming Language Priors in Visual Question\n\nAnswering with Adversarial Regularization\n\nSainandan Ramakrishnan\n\nAishwarya Agrawal\n\nStefan Lee\n\nGeorgia Institute of Technology\n\n{sainandancv, aishwarya, steflee}@gatech.edu\n\nAbstract\n\nModern Visual Question Answering (VQA) models have been shown to rely\nheavily on super\ufb01cial correlations between question and answer words learned\nduring training \u2013 e.g. overwhelmingly reporting the type of room as kitchen or\nthe sport being played as tennis, irrespective of the image. Most alarmingly, this\nshortcoming is often not well re\ufb02ected during evaluation because the same strong\npriors exist in test distributions; however, a VQA system that fails to ground\nquestions in image content would likely perform poorly in real-world settings.\nIn this work, we present a novel regularization scheme for VQA that reduces\nthis effect. We introduce a question-only model that takes as input the question\nencoding from the VQA model and must leverage language biases in order to\nsucceed. We then pose training as an adversarial game between the VQA model\nand this question-only adversary \u2013 discouraging the VQA model from capturing\nlanguage biases in its question encoding. Further, we leverage this question-only\nmodel to estimate the increase in model con\ufb01dence after considering the image,\nwhich we maximize explicitly to encourage visual grounding. Our approach is a\nmodel agnostic training procedure and simple to implement. We show empirically\nthat it can improve performance signi\ufb01cantly on a bias-sensitive split of the VQA\ndataset for multiple base models \u2013 achieving state-of-the-art on this task. Further,\non standard VQA tasks, our approach shows signi\ufb01cantly less drop in accuracy\ncompared to existing bias-reducing VQA models.\n\n1\n\nIntroduction\n\nThe task of answering questions about visual content \u2013 called Visual Question Answering (VQA) \u2013\npresents a rich set of arti\ufb01cial intelligence challenges spanning computer vision and natural language\nprocessing. Successful VQA models must understand the question posed in natural language, identify\nrelevant entities, object, and relationships in the image, and perform grounded reasoning to deduce the\ncorrect answer. In response to these challenges, there has been extensive work on VQA in recent years\nboth in terms of dataset curation [6, 12, 2, 17, 13, 32, 3] and modeling [2, 5, 28, 14, 16, 4, 22, 20].\nThis widespread interest in VQA has resulted in increasingly sophisticated models achieving higher\nand higher performance on increasingly large benchmark datasets; however, recent studies have\ndemonstrated that many models tend to have poor image grounding, instead heavily leveraging\nsuper\ufb01cial correlations between questions and answers in the training dataset to answer questions\n[1, 30, 15, 12]. As a result, these models often exhibit undesirable behaviors \u2013 blindly outputting an\nanswer based on \ufb01rst few words in the question (e.g. re\ufb02exively answering \u2018What sport ...\u2019 questions\nwith \u2018tennis\u2019) or failing to generalize to novel attribute-noun combinations (e.g. being unable to\nidentify a \u2018green hydrant\u2019 despite seeing both hydrants and green objects during training). Perhaps\nmost dissatisfying of all, standard evaluation protocols on benchmark datasets often fail to pick up on\nthese trends due to the presence of the same strong language priors in their test datasets.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fFigure 1: Given an arbitrary base VQA model (A), we introduce two regularizers. First, we build\na question-only adversary (B) that takes the question embedding qi from the VQA model and is\ntrained to output the correct answer from this information alone. For this network to succeed, qi\nmust capture language biases from the dataset \u2013 the same biases that lead the base VQA model to\nignore visual content. To reduce these biases, we set the base VQA model and the question-only\nadversary against each other, with the base VQA network modifying its question embedding to\nreduce question-only performance (shown here as gradient negation of the question-only model loss)\nFurther, the question-only model allows estimation of the change in answer con\ufb01dence given image\n(C), which we maximize explicitly.\n\nRecently, Agrawal et al. [2] introduced the VQA-CP (Visual Question Answering under Changing\nPriors) diagnostic split of the VQA [6, 12] dataset to measure the effect of language bias. VQA-CP\nis constructed such that for each question type (e.g. \u2018What color ...\u2019, \u2018How many ...\u2019, etc.), the\nanswer distributions vary dramatically between training and test. Consequentially, models with poor\ngrounding and an over-reliance on language priors from the training set fair poorly on this new split.\nDespite the image still containing all the necessary information to answer the questions, multiple\nexisting VQA models evaluated on this split achieve dramatically deteriorated performance.\nOne intuitive measure of the strength of language priors in VQA is the performance of a \u2018blind\u2019\nmodel that produces answers given only the question and not the associated image. In fact, this\nquestion-only model has become a standard and powerful baseline presented alongside VQA datasets\n[6, 12, 9, 17]. In this work, we codify this intuition, introducing a novel regularization scheme that\nsets a base VQA model against a question-only adversary to reduce the impact of language biases.\nMore concretely, we consider unwanted language bias in VQA to be overly-speci\ufb01c relationships\nbetween questions and their likely answers learned from the training dataset \u2013 i.e. those that could\nenable a question-only model to achieve relatively high performance without ever seeing an image \u2013\nand we explicitly optimize the question representation within a base VQA model to be uninformative\nto a question-only adversary model. In this adversarial regime, the question-only model is trained to\nanswer as accurately as possible given the question encoding provided by the base VQA model; and\nsimultaneously, the base VQA model is trained to adjust its question encoder (often implemented as a\nrecurrent language model) to minimize the performance of the question-only model while maintaining\nits own VQA accuracy. Moreover, we leverage the question-only model to provide a differentiable\nnotion of image grounding \u2013 the change in model con\ufb01dence after considering the image \u2013 which we\nmaximize explicitly for the VQA model. Thus, our objective consists of a question-only adversarial\nterm and a difference of entropies term.\nOur approach is largely model agnostic, end-to-end trainable, and simple to implement, consisting of\na small, additional classi\ufb01cation network built on the question representation of the base VQA model.\nWe experiment on the VQA-CP dataset [2] with multiple base VQA models, and \ufb01nd \u2013 1) our approach\nprovides consistent improvements over all baseline VQA models, 2) our approach outperforms the\nexisting state-of-art grounded-by-design approach [2] signi\ufb01cantly, 3) both question-only adversary\nand the difference of entropies components improve performance and their combination pushes this\neven further. On standard benchmarks [6, 12] where strong priors from training can be exploited\non test set, our approach shows signi\ufb01cantly smaller drops in accuracy compared to existing bias-\nreducing VQA models [2], with some settings facing only insigni\ufb01cant changes.\n\n2\n\n\f2 Reducing Language Bias Through Adversarial Regularization\n\nSetting aside architectural speci\ufb01cs, the vast majority of VQA models operate on a set of similar\ndesign principles \u2013 \ufb01rst producing vector representations for the image and question and then\ncombining them to predict the answer (often through complex attention mechanisms). However,\nwhen language biases are quite strong, the question feature may already be suf\ufb01ciently discriminative\nand the model can learn to ignore the visual signal without facing signi\ufb01cant losses during training\n(e.g. \u201cWhat color is the sky?\u201d always mapping to \u201cblue\u201d). Such a model which fails to ground its\nanswers in the image might be passable for benchmark datasets that carry similar biases; however, in\nthe real-world, where brown grass and gray skies abound, its usefulness would be severely limited. In\nthis section, we address this problem by explicitly reducing the discriminative power of the question\nfeature \u2013 introducing a pair of adversarial regularizers that penalize the ability of a separate adversary\nnetwork to con\ufb01dently predict the answer from the question encoding alone.\nPreliminaries. Given a dataset D = {Ii, Qi, ai}N\ni=1 consisting of triplets of images Ii \u2208 I, questions\nQi \u2208 Q and answers ai \u2208 A, the VQA task is to learn a mapping F : Q\u00d7I\u2192[0, 1]|A| which produces\nan accurate distribution over the answer space given an input question-image pair.\nWithout loss of generality, we consider differentiable mappings that can be decomposed as an\noperation f over question and image encodings g: Q\u2192Rd and h: I\u2192Rk (as shown in Figure 1A).\nWe write the prediction for instance i for this class of models as\n\nvi = h(Ii), qi = g(Qi)\nP (A | Qi, Ii) = f (vi, qi)\n\n(1)\n\nwhere we denote the image and question embeddings as vi and qi respectively.\nNearly all existing VQA models follow this pattern. The image encoder h(\u00b7) is typically a \ufb01xed CNN\npretrained on either classi\ufb01cation or detection and the question encoder g(\u00b7) is usually some form\nof word or character level RNN learned during training. Typically these models are trained with\nstandard cross-entropy, optimizing parameters to minimize (2) over the ground truth data.\n\nLV QA(f, g, h) = EI,Q,A [\u2212 log Pf (ai|Qi, Ii)] \u2248 \u2212 1\nN\n\nlog f (vi, qi)[ai]\n\n(2)\n\nN(cid:88)\n\ni=1\n\nQuestion-Only Model. One intuitive measure of the power of language priors in VQA is the ability\nof a model to make low-error answer predictions from the question alone \u2013 in fact, some form of\nthis \u2018blind\u2019 model has been frequently presented alongside VQA datasets for exactly this purpose\n[6, 12, 9, 17]. We formalize this question-only model as a mapping fQ. As above, we assume fQ is\ndifferentiable and operates on learned question encodings such that fQ makes predictions\n\nPfQ(A | Qi) = fQ(qi), qi = g(Qi).\n\n(3)\n\nWe parameterize this model as a simple two-layer neural network but note that arbitrary choices can\nbe made in this regard. As above, this model can be trained with cross-entropy, minimizing\n\nLQA(fQ, g) = EQ,A(cid:2)\u2212 log PfQ(ai|Qi)(cid:3) \u2248 \u2212 1\n\nN\n\nN(cid:88)\n\ni=1\n\nlog fQ(qi)[ai].\n\n(4)\n\n2.1 Adversarial Regularization with a Question-Only Adversary\n\nFor any model of the form presented in (1), we can now introduce a simple adversarial regularizer\nthat explicitly reduces the effect of language biases by modifying the question encoder to minimize\nthe performance of this question-only adversary. Speci\ufb01cally, given a VQA model decomposed as\nf, g, h, we splice on the question-only model fQ such that fQ takes as input the encodings produced\nby g(\u00b7) (as in Figure 1), and establish opposing losses for the two networks which we detail below.\nLearning the Question-Only Adversary. The question-only model fQ is trained to minimize the\ncross-entropy loss LQ in (4); however, parameters in g(\u00b7) are not updated with respect to this loss \u2013\nin effect, this forces fQ to perform as well as possible given the question encodings produced by the\nquestion encoder g(\u00b7) from the base VQA model.\n\n3\n\n\fAdversarial Regularization for VQA. As performance of the question-only model acts as a proxy\nfor the language biases represented in the question encodings qi = g(Qi), one approach to reduce\nbias representation is to adjust g(\u00b7) such that the question-only model does poorly. As such, we can\nwrite this adversarial relationship between the question-only (fQ) and base VQA models (f, g, h) as\n(5)\n\nLV QA(f, g, h) \u2212 \u03bbQLQA(fQ, g)\n\nmin\nf,g,h\n\nmax\nfQ\n\nWe note that in practice, training with this adversarial regularizer can be realized with a simple\ngradient negation of the question-only adversary\u2019s loss as shown in Figure 1. Speci\ufb01cally, we back-\npropagate the negative of the gradient of LQ(fQ, g) accumulated at qi through the question encoder \u2013\nupdating the question encoder in a way that maximizes LQ(fQ, g).\nThe regularization coef\ufb01cient \u03bbQ \u2265 0 in (5) controls the trade-off between VQA performance\nand language bias reduction. For low values of \u03bbQ, little regularization occurs and the base model\ncontinues to learn language priors. On the other hand, large values of \u03bbQ force the model to remove all\ndiscriminative language biases, resulting in poor VQA performance for both the base VQA model and\nthe question-only adversary \u2013 essentially stripping the question encoding of even basic question-type\ninformation (e.g. failing to learn that \u201cWhat color ... ?\u201d questions require color answers).\n\n2.2 An Adversarial Difference of Entropies Regularizer\n\nAs the effect of this over-regularization for high-values of \u03bbQ highlights, the question-only adversary\ndoes not capture the full nuance of language bias in VQA. Given the question \u201cWhat color is the sky?\u201d\nit is reasonable to have a prior that the answer may be \u201cblue\u201d, but critically this belief should update\ndepending on observations \u2013 i.e. the answer distribution should sharpen after viewing the image.\nTo capture this intuition, we add an additional term that aims to maximize the information gained\nabout the answer from looking at the image. Speci\ufb01cally, we introduce another adversarial regularizer\ncorresponding to the difference in entropies between the base model prediction given the image and\nthe question-only model which we write as\n\nLH (f, g, h, fQ) = EI,Q [H(A | Q) \u2212 H(A | I,Q)]\n\nN(cid:88)\n\n= Eq\u223cP (Q) [H(A | q)] \u2212 Eq,v\u223cP (Q,I) [H(A | q, v)]\n\u2248 1\nN\n\n( H (fQ(qi)) \u2212 H (f (vi, qi)) )\n\ni=1\n\n(6)\n(7)\n\n(8)\n\nmin\nf,g,h\n\nmax\nfQ\n\nWe note that this regularizer resembles the conditional mutual information (CMI) between the answer\nand image given the question I(A; I|Q); however, fQ(q) is not constrained to be the marginal of\nf (v, q) such that estimating the CMI in this way is ill-de\ufb01ned.\nWe can then update the adversarial relationship between f and fQ from (5) with LM I, writing\n\nLV QA(f, g, h) \u2212 \u03bbQLQA(fQ, g) \u2212 \u03bbHLH (f, g, h, fQ)\n\n(9)\nwhere \u03bbH \u2265 0 controls the strength of the difference of entropies regularizer. Note that while LH is\na function of f, g, h, and fQ, we only update the parameters of the question encoding g based on this\nloss. Otherwise, fQ could learn to produce sharp output distributions from arbitrary question features\nto minimize LH. Likewise, f or h can easily adjust to produce arbitrarily peaky outputs, which we\nobserve can lead to signi\ufb01cant over-\ufb01tting.\nAs before, the question-only adversary fQ in this setting must still perform as well as possible given\nthe question embedding from g(\u00b7), but this embedding is now additionally adjusted to maximize the\nentropy of fQ\u2019s output, while minimizing that of the VQA model. In the experiments that follow, we\nshow that both of these adversarial regularizers improve performance on a language bias sensitive\ntask. Further, we note that their bene\ufb01ts compound, with models combining both terms performing\nbetter across a wider range of regularization coef\ufb01cients.\n\n3 Related Work\n\nEssentially all real world datasets have some form of bias either due to their collection process (e.g.\nreporting biases [11]) or those re\ufb02ecting real-world human biases (e.g. capturing stereotypical gender\n\n4\n\n\froles). These biases are often less than subtle, with human annotators easily identifying from which\ndataset speci\ufb01c instances originate [26] on sight. In this section, we discuss related work on bias in\nVQA, how to reduce it, and on adversarial training regimes related to our approach.\nLanguage Bias in VQA. In VQA, a signi\ufb01cant source of bias is the strong association between\nquestion words and answers (e.g. \u2018Is there a ...\u2019 questions predominantly being answered with\n\u2018Yes\u2019 in VQA v1 [6]). Building off [6], Goyal et al. [12] introduced the VQA v2 dataset which\nsigni\ufb01cantly weakened language priors in the VQA v1 dataset. For each VQA v1 question, VQA v2\nwas constructed to also contain an image which is similar to the VQA v1 image, but has a different\nanswer to the same question \u2013 effectively reducing the sharpness of question-only priors. However,\neven with this additional \u2018balancing\u2019 there exist signi\ufb01cant biases in the dataset. In these works,\nthe extent of language biases was measured through a baseline which must predict answers from\nquestions alone. Deriving inspiration from this baseline, we introduce a question-only adversary to\nexplicitly reduce the ability of the question-only baseline to predict answers from questions alone.\nRecently, Agrawal et al. [2] introduced the VQA-CP (VQA under Changing Priors) dataset, a diag-\nnostic split of the VQA datasets [6, 12] that is constructed with vastly different answer distributions\nbetween train and test. Consequentially, models that overly rely on language biases or have poor\nvisual grounding do poorly on this split, with [2] reporting dramatic drops in performance for state-\nof-the-art VQA models. We use VQA-CP as a testbed for our adversarial regularization approach\nand show consistent improvements over base models and existing work.\nOvercoming Unwanted Biases. In addition to proposing the VQA-CP dataset, [2] designed a\nGrounded VQA model (GVQA) that includes hand-designed architectural restrictions to prevent\nthe model from exploiting language correlations in training data. Speci\ufb01cally, GVQA disentangles\nthe visual concepts in the image that need to be recognized, from the space of plausible answers \u2013\nintroducing separate visual concept and answer cluster classi\ufb01ers. While the model performs well on\nVQA-CP, it is complicated in design and requires training multiple stages in sequence. In contrast,\nour approach is implemented as a simple drop-in regularizer built on top of existing VQA models and\nenables end-to-end training without changing the underlying model architecture, unlike the design\nprinciples of GVQA which require signi\ufb01cant architecture adjustments if extended to new models.\nFurther, we \ufb01nd our approach signi\ufb01cantly outperforms the hand-crafted network structure of GVQA.\nSimilarly, neural module network style architectures [5, 14, 16] introduce an explicit structure in the\nmodel that separates the question from the reasoning on the image. These models predict the layout\nof modular computational units from the question content and these modules then operate on the\nimage to produce an answer. Despite this explicitly compositional reasoning process, these models\nalso suffer a dramatic drop in performance when evaluated on VQA-CP [2]. In contrast, our proposed\napproach performs well on VQA-CP and can be applied to any model architecture.\nIn recent work, Burns et al. [7] investigate the generation of gender-speci\ufb01c words in image de-\nscriptions which is often skewed in captioning models (e.g. models nearly always using male words\nto describe snowboarders). The proposed approach encourages the model to con\ufb01dently predict\ngendered words when gender information is visually present and to be unsure when it is occluded by\na mask. While effective, this model requires segmentation of the visual concept of interest.\nMore generally, Zhao et al. [31] address the issue of bias ampli\ufb01cation broadly, introducing a\ninference-time procedure to recalibrate the model. However, this approach requires computing\nthe output distribution for each element of a test set before this procedure can be performed. In\ncomparison, we present a training-time procedure that results in a less biased model. In principle,\n[31] could also be applied to a model trained under our proposed regularizers.\nAdversarial Learning. Generative Adversarial Networks (GANs) [10] have received signi\ufb01cant\nrecent interest for their ability to model complex distributions \u2013 \ufb01nding use in a variety of image and\nlanguage generation tasks [10, 23, 29, 8, 21]. Recently, other adversarial training schemes have been\nproposed to encourage various forms of invariance in intermediate model representations [18, 19, 27].\nMost related to our approach, Lample et al. [18] introduce an autoencoder framework with an\nadversarial loss for attribute-based image manipulation. Given an input image and a set of attributes\n(e.g. a photo of a person and their gender or age), the task is to manipulate the image such that it\nhas the desired attributes. Unfortunately, without multiple pairings of the same image with different\nattributes, it is challenging to learn disentangled image representations that generalize to new input-\nattribute combinations. An adversarial model is introduced that is trained to predict attributes from the\n\n5\n\n\finput image encoding alone. In combating this adversary, the image encoder model learns to produce\nattribute invariant image encodings. This improves generalization by forcing the attribute-augmented\ndecoder to meaningfully rely on input attributes to accurately reproduce input images.\nSimilarly, our question-only adversary encourages the VQA question encoder to remove answer-\ndiscriminative features from the question representation. However, breaking the parallels with [18],\nthe answer themselves are not added back as inputs to controllably recondition the model on these\nfeatures. Rather, the VQA model must rely on the combination of question and image features to\nrecover the answer information. In this way, the language-level answer information (e.g. that most\ngrass is green) is removed from the question and instance-speci\ufb01c information from the image must\nbe used instead. We take this notion further by leveraging the question-only adversary to estimate\nand directly maximize the change in con\ufb01dence after observing the image, which we show provides\nsubstantial bene\ufb01ts when paired with the question-only adversary.\n\n4 Experiments\n\nImplementation. Our question-only adversary model is implemented as a 2-layer multi-layer\nperceptron with 256 hidden units and a ReLU activation that takes as input the question encoding\nfrom a base VQA network. The network\u2019s output is a distribution over the candidate answers. We\ntrain the entire system (base VQA and question-only model) end-to-end with parameters initialized\nfrom scratch. We set batch size to 150, learning rate to 0.001, weight decay of 0.999 and use the\nAdam optimizer. The model takes \u223c8 hours to train on a TITAN X for SAN (Torch, \u223c60 epochs)\nand < 1 hour for UpDown (PyTorch, \u223c40 epochs). We use public codebases for both.\nAs discussed in Section 2, we update the parameters of the question encoding with respect to the\nVQA loss, the difference of entropies loss, and the negative of the question-only loss. The remaining\nVQA model parameters are trained with just the VQA loss. The question-only model is updated only\nby its VQA loss cross entropy loss term despite contributing to the difference of entropies loss.\nModels. We evaluate the effect of our proposed regularization on the following base models:\n\u2013 Stacked Attention Network (SAN) [28] \u2013 SAN encodes questions with a long short-term\nmemory (LSTM) encoder and the image is encoded with a pretrained VGGNet [25]. The model\nperforms two-hop question-based image attention and the \ufb01nal joint feature is passed to a\n1000-way answer classi\ufb01er. This model is trained with standard cross-entropy.\n\n\u2013 Bottom-Up and Top-Down Attention (UpDn) [4] \u2013 Up-Down encodes questions with a gated\nrecurrent unit (GRU) encoder and represents images as a set of bounding box features extracted\nfrom Faster R-CNN [24]. Soft-attention over these regions is computed based on the question\nfeatures and the attention-pooled feature is combined with the question as input to the classi\ufb01cation\nlayer. This model is trained directly on VQA score under a multi-label binary cross-entropy loss\n(see [4] for more details). We also apply this loss for the question-only model in our experiments,\nbut compute a softmax over these outputs when computing entropies.\n\nFor both SAN1 and Up-Down2, we build on top of publicly available reimplementations. In the\nfollowing results, we indicate the addition of our question-only adversarial regularization with Q-Adv\nand the difference of entropies term as DoE.\nWe also compare to the GVQA [2] model built atop SAN and introduced alongside the VQA-CP\ndataset. GVQA explicitly separates perception from question answering by introducing a Visual\nConcept Classi\ufb01er (VCC) and an Answer Cluster Predictor (ACP). The VCC is a bank of pretrained\nclassi\ufb01ers for visual entities and attributes and its output is modulated by the ACP. The ACP takes a\nquestion and predicts one of a prede\ufb01ned answer clusters. The ACP masked VCC outputs are used\nto predict the answer. A separate branch handles binary questions as a visual veri\ufb01cation task. By\ndesign, this model isolates the answering module from the input question, mitigating the effect of\nlanguage biases, but at a cost of relatively low standard VQA performance and multi-stage training.\nDatasets and Evaluation. We train our models on the VQA-CP [2] train split and evaluate on the test\nset using the standard VQA evaluation metric [6]. For each model, we also report results when trained\n\n1SAN Codebase: https://github.com/abhshkdz/neural-vqa-attention\n2Up-Down Codebase: https://github.com/hengyuan-hu/bottom-up-attention-vqa\n\n6\n\n\fTable 1: Performance on VQA-CP v2 test and VQA v2 val. We signi\ufb01cantly improve the accuracy\nof base models and achieve state-of-the-art performance on the VQA-CP dataset.\n\nModel\n\ns\nr\nu\nO\n\ns\nr\nu\nO\n\n0.15\n\nGVQA [2]\nSAN [28]\nSAN + Q-Adv\nSAN + DoE\nSAN + Q-Adv + DoE\nUpDn [4]\nUpDn + Q-Adv\nUpDn + DoE\n0.05\nUpDn + Q-Adv + DoE 0.005 0.05\n\n0.005\n\n0.15\n\n\u03bbH\n-\n-\n-\n25\n25\n-\n-\n\n\u03bbQ\n-\n-\n\n-\n\n-\n\n-\n\nVQA-CP v2 test\n\nOverall Yes/No Number Other\n13.68 22.14\n31.30\n11.14 21.74\n24.96\n14.91 16.33\n27.24\n12.08 20.87\n25.75\n33.29\n15.22 26.02\n11.93 46.05\n39.74\n13.02 46.33\n40.08\n12.19 47.03\n40.43\n41.17\n15.48 35.48\n\n57.99\n38.35\n54.50\n42.21\n56.65\n42.27\n42.34\n42.62\n65.49\n\nVQA v2 val\n\nOverall Yes/No Number Other\n31.17 34.65\n48.24\n39.28 47.84\n52.41\n39.21 47.52\n52.18\n39.64 47.41\n52.38\n52.31\n39.33 47.63\n42.14 55.66\n63.48\n41.00 52.65\n60.53\n42.64 55.45\n63.43\n62.75\n42.35 55.16\n\n72.03\n70.06\n69.81\n70.05\n69.98\n81.18\n77.70\n81.15\n79.84\n\nand evaluated on the standard VQA train and validation splits [6, 12] with the same regularization\ncoef\ufb01cients used for VQA-CP to compare with [2].\nVQA-CP does not have a validation set and generating such a split is complicated by the need for\nit to contain priors different from both the training and test sets in order to be an accurate estimate\nof generalization under changing priors \u2013 an ill-de\ufb01ned notion for binary questions. As such, we\nset initial regularizer coef\ufb01cients such that gradients at the question encoding are roughly equal in\nmagnitude for all loss terms at the beginning of training and then explore a small region around this\npoint. We report the best performing coef\ufb01cients alongside our results and provide further analysis of\nthe effect of these parameters in Section 5. Notably, we \ufb01nd these coef\ufb01cients to be highly model\ndependent but generalize well between datasets and regularizer ablations. All models are trained until\nconvergence as we have no validation set on which to base early-stopping.\n\n5 Results\n\nTable 1 presents our primary results on both the VQA-CP v2 and the VQA v2 datasets. Table 2 also\nshows limited results on the much more biased VQA v1 dataset [6] and its CP counterpart \u2013 VQA-CP\nv1 [2]. We make a number of observations below.\nThe proposed regularizers help, resulting in state-of-art performance on VQA-CP. For both\nSAN and UpDn models, adding the question-only adversary (Q-Adv) improves the performance\nof the respective base models (2.28% for SAN and 0.34% for UpDn) on the VQA-CP v2 dataset.\nSimilarly, the difference of entropies (DoE) regularizer boosts the performance of both SAN and\nUpDn models, gaining improvements of 0.79% and 0.69% respectively. The combination of the\nQ-Adv and DoE regularizers further boosts the performance, resulting in 8.33% improvement over\nSAN and 1.43% over UpDn. Comparing our SAN + Q-Adv + DoE model to GVQA which is also\nbuilt on top of SAN, we outperform GVQA signi\ufb01cantly (1.99%). Our UpDn + Q-Only + DoE model\nalso sets a new state-of-the-art on VQA-CP v2, improving over GVQA by 9.87% (although it is\nimportant to note the more powerful base architecture already outperforms GVQA by 8.44%).\nSimilar trends repeat for VQA-CP v1 as well. With the question-only regularizer improving SAN by\n1.14%, DoE by 0.95%, and their combination by over 16.55% \u2013 outperforming GVQA by 4.2% and\nagain setting state-of-the-art. We note that these larger gains are in part due to the increased language\nbiases present in the VQA-CP v1 dataset.\nMoreover, we \ufb01nd the question-only network performs increasingly poorly as our models perform\nbetter on VQA-CP \u2013 indicating that optimization is going well and that the intuition behind our\nregularizers seems well-founded. For quantitative results, see the supplementary.\nThe proposed regularizers do not hurt signi\ufb01cantly on VQA v2. When trained and tested on\nthe VQA v2 dataset (right side of Table 1), the addition of the proposed regularizers results in a\ninsigni\ufb01cant drop in the performance for SAN (0.1%) and a minor drop in performance for UpDn\n(0.73%) compared to prior work. This is in contrast to GVQA, whose performance drops by 4.17%\nfor SAN on VQA v2 (note that GVQA is built off of SAN). For completeness we further evaluate on\n\n7\n\n\fTable 2: Performance on VQA-CP v1 test and VQA v1 val.\n\nModel\n\nGVQA [2]\nSAN [28]\nSAN + Q-Adv\nSAN + DoE\nSAN + Q-Adv + DoE\n\ns\nr\nu\nO\n\n\u03bbQ \u03bbH\n-\n-\n-\n-\n-\n25\n0.15 25\n\n0.15\n\n-\n\nVQA-CP v1 test\n\nVQA v1 val\n\nOverall Yes/No Number Other\n39.23\n24.86\n24.70\n26.88\n19.99\n28.02\n24.03\n27.83\n43.43\n25.32\n\n11.87\n11.34\n11.70\n11.15\n12.44\n\n64.72\n35.34\n35.70\n36.33\n74.16\n\nOverall Yes/No Number Other\n51.12\n36.43\n44.51\n55.86\n42.91\n52.01\n41.44\n54.08\n52.15\n42.91\n\n76.90\n78.54\n70.68\n78.19\n71.06\n\n32.79\n33.46\n32.39\n32.59\n32.59\n\nFigure 2: Maximizing difference of entropies\n(DoE) along with the question-only adversar-\nial regularization for the SAN model, not only\nimproves results on changing priors, but also\nstabilizes training.\n\nFigure 3: Answer distribution for SAN+Q-\nAdv+DoE mimic the prior less for questions with\nhigh language bias.\n\nVQA v2 test-std, \ufb01nding our SAN+Q-Adv+DoE model gives 52.95% overall accuracy, 2.33%\nless than base SAN. Results on this split were not reported for GVQA in [2].\nThe more the biases, the higher the gain on VQA-CP, and the higher the loss on VQA. VQA\nv1 has signi\ufb01cantly more bias than VQA v2 and consequentially VQA-CP v1 has a sharper change\nbetween training and test. As such, we observe the proposed regularizers improve over the base model\nsigni\ufb01cantly more in VQA-CP v1 (16.55% for SAN) than in VQA-CP v2 (8.33% for SAN). For the\nsame reasons, the proposed regularizers hurt a bit more on VQA v1 (3.71% for SAN compared to\n0.1% on VQA v2), where strong language biases can be leveraged to boost performance. However,\nthis drop in the performance on VQA v1 is still signi\ufb01cantly less than that with GVQA (4.74%). We\nalso found that the proposed approach has strengths complementary to SAN (see supplementary).\nUpDn [4] is less driven by biases than SAN. The drop in the performance of UpDn from VQA v2\nto VQA-CP v2 is 23.74% which is signi\ufb01cantly less than that of SAN (27.45%). This shows that\nUpDn may be less driven by biases than SAN. And hence, the gains in UpDn (1.43%) due to the\nproposed regularizers are less than those in SAN (8.33%).\nOur approach results in less biased output distributions. Figure 3 shows answer frequency\ndistributions for VQA v2 train, SAN, our SAN+Q-Adv+DoE model (marked Ours), and VQA v2 test\nfor three questions:\u201cWhat color is the dress she/he is wearing?\u201d, \u201cWhat sport ...?\u201d \u201cWhat color is\nthe \ufb01re hydrant?\u201d. It is quite clear that while neither of the SAN based models completely match the\ntest distribution, the base SAN model aligns signi\ufb01cantly more with the training distribution \u2013 even\namplifying the bias for \u2018blue\u2019 in the \ufb01rst question despite very few answers being \u2018blue\u2019 in test.\nDifference of entropies (DoE) stabilizes training with the question-only adversary. Figure 2\nshows VQA-CP v2 test performance of the SAN model, for a range of question-only regularizer\ncoef\ufb01cients \u03bbQ. We can see that when the DoE term is not used (orange line), performance begins to\ndrop after approximately 0.2 and by 0.35 has deteriorated signi\ufb01cantly. At these higher values, nearly\nall discriminative information in the question encoding is lost \u2013 with the VQA model sacri\ufb01cing its\nown performance to lower that of the question-only model. However, we observe that for reasonable\nvalues of \u03bbH, the strength of the question-only adversary can be varied over a much wider range with\nless dramatic losses (blue curve in Figure 2). We observe a similar trend when keeping \u03bbQ constant\nand sweeping over \u03bbH, wherein a dramatic improvement is observed when moving to non-zero \u03bbH\n\n8\n\n\fand then a slow decay for large values of \u03bbH. Unlike the question-only adversary, the DoE regularizer\nsimultaneously seeks to sharpen the VQA models posterior while weakening the question-only prior.\nQuestion-only performance: We study the performance of the question-only model after being\ntrained on VQA-CP v2 using our regularizers. We compare to a question-only model trained without\nthese regularizers, i.e. a model trained to predict the correct answer given the question-encoding\nlearned by the base VQA model. We \ufb01nd this Q-only(SAN) model achieves 24.84% on the VQA-\nCP v2 training set compared to 13.85% for our SAN+Q-only+DoE model, demonstrating that our\napproach has effectively restricted the discriminative information in the question encoding.\nProposed model shows complementary strengths with the base model: To study whether our\nmodels learn complementary strengths to the base VQA models, we experiment with ensembles of\nboth models. First, we consider oracle ensembles where the best model output for each data point is\nconsidered for evaluation. This is an upper bound on ensemble performance that relies on knowing\nground truth. We \ufb01nd that the Oracle(Ours, SAN) ensemble outperforms two separately trained\nSAN models Oracle(SAN, SAN), by 1.48% for VQA v1 and by 3.46% for VQA v2\u2013 signi\ufb01cantly\nlower gains than with Oracle(GVQA, SAN) which improves by 5.28%. It is notable however that the\narchitecture of GVQA is signi\ufb01cantly different from the base SAN model and hence is expected to\nexhibit different error patterns and a higher Oracle accuracy. To take a more attainable view, we also\ncomputed a standard ensemble Ensemble(Ours, SAN) and compared to an Ensemble(SAN, SAN)\nmodel, outperforming it by 1.24% for VQA v2 but falling short by 0.15% for VQA v1. In contrast,\nEnsemble(GVQA, SAN) improves VQA v2 performance by only 0.54%.\nQualitative Examples: Figure 4 shows example outputs and heatmaps for the SAN model with and\nwithout our regularizers on VQA-CP v2. In addition to improving accuracy, our regularized approach\noften results in repositioned heatmaps (surfer bottom right).\n\n6 Conclusion\n\nWe propose a novel adversarial regularization scheme for reducing the memorization of dataset biases\nin VQA based on a question-only adversary and the difference of model con\ufb01dences after processing\nthe image. Experiments on the VQA-CP dataset, show that this technique allows existing VQA\nmodels to signi\ufb01cantly improve performance in the midst of changing priors. Consequently, we\nachieve state-of-the-art performance on VQA-CP. Our approach can be implemented as a simple,\ndrop-in module on top of existing VQA models and easily trained end-to-end from scratch.\nAcknowledgements This work was supported in part by NSF, AFRL, DARPA, Siemens, Google,\nAmazon, ONR YIPs and ONR Grants N00014-16-1-{2713,2793}. The views and conclusions\ncontained herein are those of the authors and should not be interpreted as necessarily representing the\nof\ufb01cial policies or endorsements, either expressed or implied, of any sponsor.\n\nFigure 4: Qualitative examples of outputs and attention maps for SAN with (Our) and without (SAN)\nour proposed regularizers on VQA-CP v2.\n\n9\n\n\fReferences\n[1] Aishwarya Agrawal, Dhruv Batra, and Devi Parikh. Analyzing the behavior of visual question\nanswering models. In Proceedings of the 2016 Conference on Empirical Methods in Natural\nLanguage Processing, pages 1955\u20131960, 2016.\n\n[2] Aishwarya Agrawal, Dhruv Batra, Devi Parikh, and Aniruddha Kembhavi. Don\u2019t just assume;\n\nlook and answer: Overcoming priors for visual question answering, 2017.\n\n[3] Aishwarya Agrawal, Aniruddha Kembhavi, Dhruv Batra, and Devi Parikh. C-vqa: A composi-\n\ntional split of the visual question answering (vqa) v1.0 dataset, 2017.\n\n[4] Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould,\nand Lei Zhang. Bottom-up and top-down attention for image captioning and visual question\nanswering, 2017.\n\n[5] Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. Neural module networks,\n\n2015.\n\n[6] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence\nZitnick, and Devi Parikh. VQA: Visual Question Answering. In International Conference on\nComputer Vision (ICCV), 2015.\n\n[7] Kaylee Burns, Lisa Anne Hendricks, Trevor Darrell, and Anna Rohrbach. Women also snow-\n\nboard: Overcoming bias in captioning models. arXiv preprint arXiv:1803.09797, 2018.\n\n[8] Bo Dai, Sanja Fidler, Raquel Urtasun, and Dahua Lin. Towards diverse and natural image\n\ndescriptions via a conditional gan. In CVPR, 2017.\n\n[9] Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, Jos\u00e9 M.F. Moura, Devi\nParikh, and Dhruv Batra. Visual Dialog. In Proceedings of the IEEE Conference on Computer\nVision and Pattern Recognition (CVPR), 2017.\n\n[10] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil\n\nOzair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks, 2014.\n\n[11] Jonathan Gordon and Benjamin Van Durme. Reporting bias and knowledge acquisition. In\nProceedings of the 2013 workshop on Automated knowledge base construction, pages 25\u201330.\nACM, 2013.\n\n[12] Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v\nin vqa matter: Elevating the role of image understanding in visual question answering, 2016.\n\n[13] Danna Gurari, Qing Li, Abigale J Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo,\nand Jeffrey P Bigham. Vizwiz grand challenge: Answering visual questions from blind people.\nCVPR, 2018.\n\n[14] Ronghang Hu, Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Kate Saenko. Learning to\n\nreason: End-to-end module networks for visual question answering, 2017.\n\n[15] Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C Lawrence Zitnick,\nand Ross Girshick. Clevr: A diagnostic dataset for compositional language and elementary\nvisual reasoning. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference\non, pages 1988\u20131997. IEEE, 2017.\n\n[16] Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Judy Hoffman, Li Fei-Fei,\nC. Lawrence Zitnick, and Ross Girshick. Inferring and executing programs for visual rea-\nsoning, 2017.\n\n[17] Kushal Ka\ufb02e and Christopher Kanan. An analysis of visual question answering algorithms. In\n\nICCV, 2017.\n\n[18] Guillaume Lample, Neil Zeghidour, Nicolas Usunier, Antoine Bordes, Ludovic Denoyer, and\n\nMarc\u2019Aurelio Ranzato. Fader networks: Manipulating images by sliding attributes, 2017.\n\n10\n\n\f[19] Gilles Louppe, Michael Kagan, and Kyle Cranmer. Learning to pivot with adversarial networks.\n\nIn NIPS, pages 982\u2013991, 2017.\n\n[20] Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. Hierarchical question-image co-\n\nattention for visual question answering, 2016.\n\n[21] Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv preprint\n\narXiv:1411.1784, 2014.\n\n[22] Ethan Perez, Harm de Vries, Florian Strub, Vincent Dumoulin, and Aaron Courville. Learning\n\nvisual reasoning without strong priors, 2017.\n\n[23] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with\n\ndeep convolutional generative adversarial networks, 2015.\n\n[24] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time\n\nobject detection with region proposal networks. In NIPS, 2015.\n\n[25] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale\n\nimage recognition. arXiv preprint arXiv:1409.1556, 2014.\n\n[26] Antonio Torralba and Alexei A Efros. Unbiased look at dataset bias. In Computer Vision and\n\nPattern Recognition (CVPR), 2011 IEEE Conference on, pages 1521\u20131528. IEEE, 2011.\n\n[27] Eric Tzeng, Judy Hoffman, Trevor Darrell, and Kate Saenko. Adversarial discriminative domain\n\nadaptation. In CVPR, 2017.\n\n[28] Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alex Smola. Stacked attention networks\n\nfor image question answering, 2015.\n\n[29] Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaolei Huang, Xiaogang Wang, and\nDimitris Metaxas. Stackgan: Text to photo-realistic image synthesis with stacked generative\nadversarial networks. In ICCV, 2017.\n\n[30] Peng Zhang, Yash Goyal, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Yin and yang:\nBalancing and answering binary visual questions. In Computer Vision and Pattern Recognition\n(CVPR), 2016 IEEE Conference on, pages 5014\u20135022. IEEE, 2016.\n\n[31] Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Ordonez, and Kai-Wei Chang. Men also like\n\nshopping: Reducing gender bias ampli\ufb01cation using corpus-level constraints, 2017.\n\n[32] Yuke Zhu, Oliver Groth, Michael Bernstein, and Li Fei-Fei. Visual7w: Grounded question\n\nanswering in images. In CVPR, 2016.\n\n11\n\n\f", "award": [], "sourceid": 779, "authors": [{"given_name": "Sainandan", "family_name": "Ramakrishnan", "institution": "Georgia Institute of Technology"}, {"given_name": "Aishwarya", "family_name": "Agrawal", "institution": "Georgia Institute of Technology"}, {"given_name": "Stefan", "family_name": "Lee", "institution": "Georgia Institute of Technology"}]}