{"title": "RUBi: Reducing Unimodal Biases for Visual Question Answering", "book": "Advances in Neural Information Processing Systems", "page_first": 841, "page_last": 852, "abstract": "Visual Question Answering (VQA) is the task of answering questions about an\nimage.\nSome VQA models often exploit unimodal biases to provide the correct answer without using the image information.\nAs a result, they suffer from a huge drop in performance when evaluated on data outside their training set distribution. This critical issue makes them unsuitable for real-world settings.\n\nWe propose RUBi, a new learning strategy to reduce biases in any VQA model.\nIt reduces the importance of the most biased examples, i.e. examples that can be correctly classified without looking at the image. \nIt implicitly forces the VQA model to use the two input modalities instead of relying on statistical regularities between the question and the answer.\nWe leverage a question-only model that captures the language biases by identifying when these unwanted regularities are used.\nIt prevents the base VQA model from learning them by influencing its predictions. This leads to dynamically adjusting the loss in order to compensate for biases. \nWe validate our contributions by surpassing the current state-of-the-art results on VQA-CP v2. This dataset is specifically designed to assess the robustness of VQA models when exposed to different question biases at test time than what was seen during training.", "full_text": "RUBi: Reducing Unimodal Biases\nfor Visual Question Answering\n\nRemi Cadene 1\u21e4, Corentin Dancette 1\u21e4, Hedi Ben-younes 1, Matthieu Cord 1, Devi Parikh 2,3\n\n1 Sorbonne Universit\u00e9, CNRS, LIP6, 4 place Jussieu, 75005 Paris,\n\n2 Facebook AI Research, 3 Georgia Institute of Technology\n\n{remi.cadene, corentin.dancette, hedi.ben-younes, matthieu.cord}@lip6.fr,\n\nparkih@gatech.edu\n\nAbstract\n\nVisual Question Answering (VQA) is the task of answering questions about an\nimage. Some VQA models often exploit unimodal biases to provide the correct\nanswer without using the image information. As a result, they suffer from a huge\ndrop in performance when evaluated on data outside their training set distribution.\nThis critical issue makes them unsuitable for real-world settings.\nWe propose RUBi, a new learning strategy to reduce biases in any VQA model.\nIt reduces the importance of the most biased examples, i.e. examples that can be\ncorrectly classi\ufb01ed without looking at the image. It implicitly forces the VQA\nmodel to use the two input modalities instead of relying on statistical regularities\nbetween the question and the answer. We leverage a question-only model that\ncaptures the language biases by identifying when these unwanted regularities are\nused. It prevents the base VQA model from learning them by in\ufb02uencing its\npredictions. This leads to dynamically adjusting the loss in order to compensate\nfor biases. We validate our contributions by surpassing the current state-of-the-art\nresults on VQA-CP v2. This dataset is speci\ufb01cally designed to assess the robustness\nof VQA models when exposed to different question biases at test time than what\nwas seen during training.\nOur code is available: github.com/cdancette/rubi.bootstrap.pytorch\n\n1\n\nIntroduction\n\nThe recent Deep Learning success in computer vision [1] and natural language understanding [2]\nallowed researchers to tackle multimodal tasks that combine visual and textual modalities [3, 4, 5, 6, 7].\nAmong these tasks, Visual Question Answering (VQA) attracts increasing attention. The goal of the\nVQA task is to answer a question about an image. It requires a high-level understanding of the visual\nscene and the question, but also to ground the textual concepts in the image and to use both modalities\nadequately. Solving the VQA task could have tremendous impacts on real-world applications such as\naiding visually impaired users in understanding their physical and online surroundings, searching\nthrough large quantities of visual data via natural language interfaces, or even communicating with\nrobots using more ef\ufb01cient and intuitive interfaces.\nSeveral large real image VQA datasets have recently emerged [8, 9, 10, 11, 12, 13, 14]. Each one\nof them targets speci\ufb01c abilities that a VQA model would need to be used in real-world settings\nsuch as \ufb01ne-grained recognition, object detection, counting, activity recognition, commonsense\nreasoning, etc. Current end-to-end VQA models [15, 16, 17, 18, 19, 20, 21, 22] achieve impressive\n\n\u21e4Equal contribution\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: Our RUBi approach aims at reducing the amount of unimodal biases learned by a\nVQA model during training. As depicted, current VQA models often rely on unwanted statistical\ncorrelations between the question and the answer instead of using both modalities.\n\nresults on most of these benchmarks and are even able to surpass the human accuracy on a speci\ufb01c\nbenchmark accounting for compositional reasoning [23]. However, it has been shown that they tend\nto exploit statistical regularities between answer occurrences and certain patterns in the question\n[24, 10, 25, 23, 13]. While they are designed to merge information from both modalities, in practice\nthey often answer without considering the image modality. When most of the bananas are yellow, a\nmodel does not need to learn the correct behavior to reach a high accuracy for questions asking about\nthe color of bananas. Instead of looking at the image, detecting a banana and assessing its color, it is\nmuch easier to learn from the statistical shortcut linking the words what, color and bananas with the\nmost occurring answer yellow.\nOne way to quantify the amount of statistical shortcuts from each modality is to train unimodal\nmodels. For instance, a question-only model trained on the widely used VQA v2 dataset [9] predicts\nthe correct answer approximately 44% of the time over the test set. VQA models are not discouraged\nto exploit these statistical shortcuts from the question modality, because their training set often follows\nthe same distribution as their testing set. However, when evaluated on a test set that displays different\nstatistical regularities, they usually suffer from a signi\ufb01cant drop in accuracy [10, 25]. Unfortunately,\nthese statistical regularities are hard to avoid when collecting real datasets. As illustrated in Figure 1,\nthere is a crucial need to develop new strategies to reduce the amount of biases coming from the\nquestion modality in order to learn better behaviors.\nWe propose RUBi, a training strategy to reduce the amount of biases learned by VQA models. Our\nstrategy reduces the importance of the most biased examples, i.e. examples that can be correctly\nclassi\ufb01ed without looking at the image modality. It implicitly forces the VQA model to use the two\ninput modalities instead of relying on statistical regularities between the question and the answer.\nWe take advantage of the fact that question-only models are by design biased towards the question\nmodality. We add a question-only branch on top of a base VQA model during training only. This\nbranch in\ufb02uences the VQA model, dynamically adjusting the loss to compensate for biases. As\na result, the gradients backpropagated through the VQA model are reduced for the most biased\nexamples and increased for the less biased. At the end of the training, we simply remove the\nquestion-only branch.\nWe run extensive experiments on VQA-CP v2 [10] and demonstrate the ability of RUBi to surpass\ncurrent state-of-the-art results from a signi\ufb01cant margin. This dataset has been speci\ufb01cally designed\nto assess the capacity of VQA models to be robust to biases by the question modality. We show\nthat our RUBi learning framework provides gains when applied on several VQA architectures such\nas Stacked Attention Networks [26] and Top-Down Bottom-Up Attention [15]. We also show that\nRUBi is competitive on the standard VQA v2 dataset [9] when compared to approaches that reduce\nunimodal biases.\n\n2 Related work\n\nReal-world datasets display some form of inherent biases due to their collection process [27, 28, 29].\nAs a result, machine learning models tend to re\ufb02ect these biases because they capture often undesirable\n\n2\n\n\fcorrelations between the inputs and the ground truth annotations [30, 31, 32]. Procedures exist to\nidentify certain kinds of biases and to reduce them. For instance, some methods are focused on gender\nbiases [33, 34], some others on the human reporting biases [35], and also on the shift in distribution\nbetween lab-curated data and real-world data [36]. In the language and vision context, some works\nevaluate unimodal baselines [37, 38] or leverage language priors [39]. In the following, we discuss\nabout related works that assess and reduce unimodal biases learned by VQA models.\n\nAssessing unimodal biases in datasets and models Despite being designed to merge the two\ninput modalities, it has been found that VQA models often rely on super\ufb01cial correlations between\ninputs from one modality and the answers without considering the other modality [40, 32]. An\ninteresting way to quantify the amount of unimodal biases that can potentially be learned by a VQA\nmodel consists in training models using only one of the two modalities [8, 9]. The question-only\nmodel is a particularly strong baseline because of the large amount of statistical regularities that can\nbe leveraged from the question modality. With the RUBi learning strategy, we take advantage of this\nbaseline model to prevent VQA models from learning question biases.\nUnfortunately, biased models that exploit statistical shortcuts from one modality usually reach\nimpressive accuracy on most of the current benchmarks. VQA-CP v2 and VQA-CP v1 [10] were\nrecently introduced as diagnostic datasets containing different answer distributions for each question-\ntype between train and test splits. Consequentially, models biased towards the question modality fail\non these benchmarks. We use the more challenging VQA-CP v2 dataset extensively in order to show\nthe ability of our approach to reduce the learning of biases coming from the question modality.\n\nBalancing datasets to avoid unimodal biases Once the unimodal biases have been identi\ufb01ed, one\nmethod to overcome these biases is to create more balanced datasets. For instance, the synthetic\ndatasets for VQA [23, 13] minimize question-conditional biases via rejection sampling within families\nof related questions to avoid simple shortcuts to the correct answer.\nDoing rejection sampling in real VQA datasets is usually not possible due to the cost of annotations.\nAnother solution is to collect complementary examples to increase the dif\ufb01culty of the task. For\ninstance, VQA v2 [9] has been introduced to weaken language priors in the VQA v1 dataset [8] by\nidentifying complementary images. For a given VQA v1 question, VQA v2 also contains a similar\nimage with a different answer to the same question. However, even with this additional balancing,\nstatistical biases from the question remain and can be leveraged [10]. That is why we propose an\napproach to reduce unimodal biases during training. It is designed to learn unbiased models from\nbiased datasets. Our learning strategy dynamically modi\ufb01es the loss values to reduce biases from\nthe question. By doing so, we reduce the importance of certain examples, similarly to the rejection\nsampling approach, while increasing the importance of complementary examples which are already\nin the training set.\n\nArchitectures and learning strategies to reduce unimodal biases\nIn parallel of these previous\nworks on balancing datasets, an important effort has been carried out to design VQA models to\novercome biases from datasets. [10] proposed a hand-designed architecture called Grounded VQA\nmodel (GVQA). It breaks the task of VQA down into a \ufb01rst step of locating and recognizing the visual\nregions needed to answer the question, and a second step of identifying the space of plausible answers\nbased on a question-only branch. This approach requires training multiple sub-models separately. In\ncontrast, our learning strategy is end-to-end. Their complex design is not straightforward to apply\non different architectures while our approach is model-agnostic. While we rely on a question-only\nbranch, we remove it at the end of the training.\nThe work most related to ours in terms of approach is [25]. The authors propose a learning strategy\nto overcome language priors in VQA models. They \ufb01rst introduce an adversary question-only branch.\nIt takes as input the question encoding from the VQA model and produces a question-only loss. They\nuse a gradient negation of this loss to discourage the question encoder to capture unwanted biases that\ncould be exploited by the VQA model. They also propose a loss based on the difference of entropies\nbetween the VQA model and the question-only branch output distributions. These two losses are\nonly backpropagated to the question encoder. In contrast, our learning strategy targets the full VQA\nmodel parameters to reduce the impact of unwanted biases more effectively. Instead of relying on\nthese two additional losses, we use the question-only branch to dynamically adapt the value of the\n\n3\n\n\fFigure 2: Visual comparison between the classical learning strategy of a VQA model and our RUBi\nlearning strategy. The red highlighted modules are removed at the end of the training. The output \u02c6ai\nis used as the \ufb01nal prediction.\n\nclassi\ufb01cation loss in order to reduce the learning of biases in the VQA model. A visual comparison\nbetween [25] and RUBi can be found in Figure 5 in the supplementary materials.\n\n3 Reducing Unimodal Biases Approach\n\nWe consider the common formulation of the Visual Question Answering (VQA) task as a multi-class\nclassi\ufb01cation problem. Given a dataset D consisting of n triplets (vi, qi, ai)i2[1,n] with vi 2V\nan image, qi 2Q a question in natural language and ai 2A an answer, one must optimize the\nparameters \u2713 of the function f : V\u21e5Q! R|A| to produce accurate predictions. For a single\nexample, VQA models use an image encoder ev : V! Rnv\u21e5dv to output a set of nv vectors of\ndimension dv, a question encoder eq : Q! Rnq\u21e5dq to output a set of nq vectors of dimension dq, a\nmultimodal fusion m : Rnv\u21e5dv \u21e5 Rnq\u21e5dq ! Rdm, and a classi\ufb01er c : Rdm ! R|A|. These functions\nare composed as follows:\n(1)\nEach one of them can be de\ufb01ned to instantiate most of the state of the art models, such as [26, 41, 19,\n42, 17, 43, 16] to cite a few.\nClassical learning strategy and pitfall The classical learning strategy of VQA models, depicted\nin Figure 2, consists in minimizing the standard cross-entropy criterion over a dataset of size n.\n\nf (vi, qi) = c(m(ev(vi), eq(qi)))\n\nL(\u2713;D) = \n\n1\nn\n\nnXi=1\n\nlog(softmax(f (vi, qi)))[ai]\n\n(2)\n\nVQA models are inclined to learn unimodal biases from the datasets [10]. This can be shown by\nevaluating models on datasets that have different distributions of answers for the test set, such as\nVQA-CP v2. In other words, they rely on statistical regularities from one modality to provide accurate\npredictions without having to consider the other modality. As an extreme example, strongly biased\nmodels towards the question modality always output yellow to the question what color is the banana.\nThey do not learn to use the image information because there are too few examples in the dataset\nwhere the banana is not yellow. Once trained, their inability to use the two modalities adequately\nmakes them inoperable on data coming from different distributions such as real-world data. Our\ncontribution consists in modifying this cost function to avoid the learning of these biases.\n\n3.1 RUBi learning strategy\nCapturing biases with a question-only branch One way to measure the unimodal biases in VQA\ndatasets is to train an unimodal model which takes only one of the two modalities as input. The key\nidea of our approach, depicted in Figure 2, is to adapt a question-only model as a branch of our VQA\n\n4\n\n\f(a) Classical learning strategy\n\n(b) RUBi learning strategy\n\nFigure 3: Detailed illustration of the RUBi impact on the learning. In the \ufb01rst row, we illustrate how\nRUBi reduces the loss for examples that can be correctly answered without looking at the image.\nIn the second row, we illustrate how RUBi increases the loss for examples that cannot be answered\nwithout using both modalities.\n\nmodel, that will alter the main model\u2019s predictions. By doing so, the question-only branch captures\nthe question biases, allowing the VQA model to focus on the examples that cannot be answered\ncorrectly using the question modality only. The question-only branch can be formalized as a function\nfQ : Q! R|A| parameterized by \u2713Q, and composed of a question encoder eq : Q! Rnq\u21e5dq to\noutput a set of nq vectors of dimension dq, a neural network nn q: Rnq\u21e5dq ! R|A| and a classi\ufb01er\ncq: R|A| ! R|A|.\n\n(3)\nDuring training, the branch acts as a proxy preventing any VQA model of the form presented in\nEquation (1) from learning biases. At the end of the training, we simply remove the branch and use\nthe predictions from the base VQA model.\n\nfQ(qi) = cq(nn q(eq(qi)))\n\nPreventing biases by masking predictions Before passing the predictions of our base VQA model\nto the loss function de\ufb01ned in Equation (2), we merge them with a mask of length |A| containing a\nscalar value between 0 and 1 for each answer. This mask is obtained by passing the output of the\nneural network nn q through a sigmoid function . The goal of this mask is to dynamically alter\nthe loss by modifying the predictions of the VQA model. To obtain the new predictions, we simply\ncompute an element-wise product  between the mask and the original predictions as de\ufb01ned in the\nfollowing equation.\n(4)\nOur method modi\ufb01es the predictions in this speci\ufb01c way to prevent the VQA model to learn biases\nfrom the question. To better understand the impact of our approach on the learning, we examine two\nscenarios. First, we reduce the importance of the most biased examples, i.e. examples that can be\ncorrectly classi\ufb01ed without using the image modality. To do so, the question-only branch outputs\n\nfQM (vi, qi) = f (vi, qi)  (nn q(eq(qi)))))\n\n5\n\n\fa mask to increase the score of the correct answer while decreasing the scores of the others. As a\nresult, the loss is much lower for these biased examples. In other words, the gradients backpropagated\nthrough the VQA model are smaller, thereby reducing the importance of these examples in the\nlearning. As illustrated in the \ufb01rst row of Figure 3, given the question what color is the banana,\nthe mask takes a high value of 0.8 for the answer yellow which is the most likely answer for this\nquestion in the training set. On the other hand, the value for the other answers green and white are\nsmaller. We see that the mask in\ufb02uences the VQA model to produce new predictions where the score\nassociated with the answer yellow increases from 0.8 to 0.94. Compared to the classical learning\napproach, the loss is smaller with RUBi and decreases from 0.22 to 0.06. Secondly, we increase the\nimportance of examples that cannot be answered without using both modalities. For these examples,\nthe question-only branch outputs a mask that increases the score of the wrong answer. As a result, the\nloss is much higher and the VQA model is encouraged to learn from these examples. We illustrate\nthis behavior in the second row of Figure 3 for the same question about the color of the banana. When\nthe image contains a green banana, RUBi increases the loss from 0.69 to 1.20.\n\nJoint learning procedure We jointly optimize the parameters of the base VQA model and its\nquestion-only branch using the gradients computed from two losses. The main loss LQM refers to the\ncross-entropy loss associated with the predictions of fQM (vi, qi) from Equation 4. We backpropagate\nthis loss to optimize all the parameters \u2713QM which contributed to this loss. \u2713QM is the union of the\nparameters of the base VQA model, the encoders, and the neural network nn q of the question-only\nbranch. In our setup, we share the parameters of the question encoder eq between the VQA model\nand the question-only branch. The question-only loss LQO is a cross-entropy loss associated with\nthe predictions of fQ(qi) from Equation 3. We use this loss to only optimize \u2713QO, union of the\nparameters of cq and nn q. By doing so, we further improve the question-only branch ability to\ncapture biases. Note that we do not backpropagate this loss to the question encoder eq preventing it\nfrom directly learning question biases. We obtain our \ufb01nal loss LRUBi by summing the two losses\ntogether in the following equation:\n(5)\n\nLRUBi(\u2713QM ,\u2713 QO;D) = LQM (\u2713QM ;D) + LQO(\u2713QO;D)\n\n3.2 Baseline architecture\nMost VQA architectures from the state of the art are compatible with our RUBi learning strategy.\nTo test our strategy, we design a fast and simple architecture inspired from [16]. This baseline\narchitecture is detailed in the supplementary material. As common in the state of the art, our baseline\narchitecture encodes the image as a bag of nv visual features vi 2 Rdv using the pretrained Faster\nR-CNN by [15], and encodes the question as a vector q 2 Rdq using a GRU, pretrained on the\nskipthought task [3]. The VQA model consists of a Bilinear BLOCK fusion [17] which merges the\nquestion representation q with the features vi of each region of the image. The output is aggregated\nusing a max pooling on the nv regions. The resulting vector is then fed into a MLP classi\ufb01er which\noutputs the \ufb01nal predictions. While most of our experiments are done with this fast and simple\nbaseline architecture, we experimentally demonstrate that the RUBi learning strategy is effective on\nother VQA architectures.\n\n4 Experiments\n\nExperimental setup We train and evaluate our models on VQA-CP v2 [10]. This dataset was\ndeveloped to evaluate the models robustness to question biases. We follow the same training and\nevaluation protocol as [25], who also propose a learning strategy to reduce biases. For each model,\nwe report the standard VQA evaluation metric [8]. We also evaluate our models on the standard VQA\nv2 [9]. Further implementation details are included in the supplementary materials, as well as results\non VQA-CP v1 and grounding experiments on VQA-HAT [44].\n\n4.1 Results\nState-of-the-art comparison In Table 1, we compare our approach consisting of our baseline\narchitecture trained with RUBi on VQA-CP v2 against the state of the art. To be fair, we only report\napproaches that use the strong visual features from [15]. We compute the average accuracy over 5\nexperiments with different random seeds. Our RUBi approach reaches an average overall accuracy\n\n6\n\n\fTable 1: State-of-the-art results on VQA-CP v2 test. All reported models use the same features\nfrom [15]. Models with * have been trained by [25]. Models with ** have been trained by [45].\n\n.\n\nModel\n\nQuestion-Only [10]\nUpDn [15] **\nRAMEN [45]\nBAN [19] **\nMuRel [16]\nUpDn [15] *\nUpDn + Q-Adv + DoE [25]\nBalanced Sampling\nQ-type Balanced Sampling\nBaseline architecture (ours)\nRUBi (ours)\n\nOverall\n\n15.95\n38.01\n39.21\n39.31\n39.54\n39.74\n41.17\n40.38\n42.11\n\n38.46 \u00b1 0.07\n47.11 \u00b1 0.51\n\nYes/No\n35.09\n\n.\n.\n.\n\n42.85\n42.27\n65.49\n57.99\n61.55\n\nAnswer type\n\nNumber\n11.63\n\n.\n.\n.\n\n13.17\n11.93\n15.48\n10.07\n11.26\n\nOther\n7.11\n\n.\n.\n.\n\n45.04\n46.05\n35.48\n39.23\n40.39\n\n42.85 \u00b1 0.18\n68.65 \u00b1 1.16\n\n12.81 \u00b1 0.20\n20.28 \u00b1 0.90\n\n43.20 \u00b1 0.15\n43.18 \u00b1 0.43\nTable 3: Overall accuracy of the\nRUBi learning strategy on VQA v2\nval and test-dev splits.\n\n.\n\nTable 2: Effectiveness of the RUBi learning strategy\nwhen used on different architectures on VQA-CP v2\ntest. Detailed results can be found in the supplemen-\ntary materials.\n\nOverall\nSAN\nBaseline [26]\n24.96\n+ Q-Adv + DoE [25] 33.29\n+ RUBi (ours)\n37.63\n\nOverall\nUpDn\nBaseline [15]\n39.74\n+ Q-Adv + DoE [25] 41.17\n44.23\n+ RUBi (ours)\n\nval test-dev\n\nModel\nBaseline (ours) 63.10\nRUBi (ours)\n61.16\n\n64.75\n63.18\n\nof 47.11% with a low standard deviation of \u00b10.51. This accuracy corresponds to a gain of +5.94\npercentage points over the current state-of-the-art UpDn + Q-Adv + DoE. It also corresponds to a\ngain of +15.88 over GVQA [10], which is a speci\ufb01c architecture designed for VQA-CP. RUBi reaches\na +8.65 improvement over our baseline model trained with the classical cross-entropy. In comparison,\nthe second best approach UpDn + Q-Adv + DoE only achieves a +1.43 gain in overall accuracy over\ntheir baseline UpDn. In addition, our approach does not signi\ufb01cantly reduce the accuracy over our\nbaseline for the answer type Other, while the second best approach reduces it by 10.57 point.\n\nAdditional baselines We compare our results to two sampling-based training methods. In the\nBalanced Sampling method, we sample the questions such that the answer distribution is uniform. In\nthe Question-Type Balanced Sampling method, we sample the questions such that for every question\ntype, the answer distribution is uniform, but the question type distribution remains the same overall\nBoth methods are tested with our baseline architecture. We can see that the Question-Type Balanced\nSampling improves the result from 38.46 in accuracy to 42.11. This gain is already +0.94 higher than\nthe previous state of the art method [25], but remains signi\ufb01cantly lower than our proposed method.\n\nArchitecture agnostic RUBi can be used on existing VQA models without changing the underlying\narchitecture. In Table 2, we experimentally demonstrate the generality and effectiveness of our\nlearning scheme by showing results on two additional architectures, Stacked Attention Networks\n(SAN) [26] and Bottom-Up and Top-Down Attention (UpDn) [15]. First, we show that applying\nRUBi on these architectures leads to important gains over the baselines trained with their original\nlearning strategy. We report a gain of +11.73 accuracy point for SAN and +4.5 for UpDn. This\nlower gap in accuracy may show that UpDn is less driven by biases than SAN. This is consistent\nwith results from [25]. Secondly, we show that these architectures trained with RUBi obtain better\naccuracy than with the state-of-the-art strategy from [25]. We report a gain of +3.4 with SAN + RUBi\nover SAN + Q-Adv + DoE, and +3.06 with UpDn + RUBi over UpDn + Q-Adv + DoE. Full results\nsplitted by question type are available in the supplementary materials.\n\n7\n\n\fImpact on VQA v2 We report the impact of our method on the standard VQA v2 dataset in Table 3.\nVQA v2 train, val and test sets follow the same distribution, contrarily to VQA-CP v2 train and test\nsets. In this context, we usually observe a drop in accuracy using approaches focused on reducing\nbiases. This is due to the fact that exploiting unwanted correlations from the VQA v2 train set is not\ndiscouraged and often leads to a higher accuracy on the test set. Nevertheless, our RUBi approach\nleads to a comparable drop to what can be seen in the state-of-the-art. We report a drop of 1.94\npercentage points with respect to our baseline, while [10] report a drop of 3.78 between GVQA\nand their SAN baseline. [25] report drops of 0.05, 0.73 and 2.95 for their three learning strategies\nwith the UpDn architecture which uses the same visual features as RUBi. As shown in this section,\nRUBi improves the accuracy on VQA-CP v2 from a large margin, while maintaining competitive\nperformance on the standard VQA v2 dataset compared to similar approaches.\n\nValidation of the masking strategy We compare different fusion techniques to combine the output\nof nn q with the output from the VQA model. We report a drop of 7.09 accuracy point on VQA-CP\nv2 by replacing the sigmoid with a ReLU on our best scoring model. Using an element-wise sum\ninstead of an element-wise product leads to a further performance drop. These results con\ufb01rm the\neffectiveness of our proposed masking method which relies on a sigmoid and an element-wise sum.\n\nValidation of the question-only loss\nIn Table 4, we validate the ability of the question-only loss\nLQO to reduce the question biases. The absence of LQO implies that the question-only classi\ufb01er cq is\nnever used, and nn q only receives gradients from the main loss LQM. Using LQO leads to consistent\ngains on all three architectures. We report a gain of +0.89 for our Baseline architecture, +0.22 for\nSAN, +4.76 for UpDn.\n\nModel\n\nBaseline + RUBi\n\nSAN + RUBi\n\nUpDn + RUBi\n\n3\n7\n\n3\n7\n\nLQO Overall Yes/No Number Other\n43.18\n3\n39.31\n7\n32.74\n31.69\n39.61\n35.01\n\n47.11\n46.11\n37.63\n36.96\n44.23\n39.47\n\n20.28\n26.85\n13.71\n12.55\n17.48\n16.01\n\n68.65\n69.18\n59.49\n59.78\n67.05\n60.27\n\nTable 4: Ablation study of the question-only loss LQO on VQA-CP v2.\n\n4.2 Qualitative analysis\nTo better understand the impact of our RUBi approach, we compare in Figure 4 the answer distribution\non VQA-CP v2 for some speci\ufb01c question patterns. We also display interesting behaviors on some\nexamples using attention maps extracted as in [16]. In the \ufb01rst row, we show the ability of RUBi to\nreduce biases for the is this person skiing question pattern. Most examples in the train set have the\nanswer yes, while in the test set, they have the answer no. Nevertheless, RUBi outputs 80% of no,\nwhile the baseline almost always outputs yes. Interestingly, the best scoring region from the attention\nmap of both models is localized on the shoes. To get the answer right, RUBi seems to reason about\nthe absence of skis in this region. It seems that our baseline gets it wrong by not seeing that the skis\nare not locked under the ski boots. This unwanted behavior could be due to the question biases. In\nthe second row, similar behaviors occur for the what color are the bananas question pattern. 80% of\nthe answers from the train set are yellow, while most of them are green in the test set. We show that\nthe amount of green and white answers from RUBi are much closer to the ones from the test set than\nwith our baseline. In the example, it seems that RUBi relies on the color of the banana, while our\nbaseline misses it. In the third row, it seems that RUBi is able to ground the textual concepts such\nas top part of the \ufb01re hydrant and color on the right visual region, while the baseline relies on the\ncorrelations between the \ufb01re hydrant, the yellow color of its core and the answer yellow. Similarly on\nthe fourth row, RUBi grounds color, star, \ufb01re hydrant on the right region, while our baseline relies\non correlations between color, \ufb01re hydrant, the yellow color of the top part region and the answer\nyellow. Interestingly, there is no similar question that involves the color of a star on a \ufb01re hydrant in\nthe training set. It shows the capacity of RUBi to generalize to unseen examples by composing and\ngrounding existing visual and textual concepts from other kinds of question patterns.\n\n8\n\n\fFigure 4: Qualitative comparison between the outputs of RUBi and our baseline on VQA-CP v2\ntest. On the left, we display distributions of answers for the train set, the baseline evaluated on the\ntest set, RUBi on the test set and the ground truth answers from the test set. For each row, we \ufb01lter\nquestions in a certain way. In the \ufb01rst row, we keep the questions that exactly match the string is\nthis person skiing. In the three other rows, we \ufb01lter questions that respectively include the following\nwords: what color bananas, what color \ufb01re hydrant and what color star hydrant. On the right, we\ndisplay examples that contains the pattern from the left. For each example, we display the answer of\nour baseline and RUBi, as well as the best scoring region from their attention map.\n\n5 Conclusion\n\nWe propose RUBi to reduce unimodal biases learned by Visual Question Answering (VQA) models.\nRUBi is a simple learning strategy designed to be model agnostic. It is based on a question-only\nbranch that captures unwanted statistical regularities from the question modality. This branch\nin\ufb02uences the base VQA model to prevent the learning of unimodal biases from the question. We\ndemonstrate a signi\ufb01cant gain of +5.94 percentage point in accuracy over the state-of-the-art result\non VQA-CP v2, a dataset speci\ufb01cally designed to account for question biases. We also show that\nRUBi is effective with different kinds of common VQA models. In future works, we would like to\nextend our approach on other multimodal tasks.\n\n9\n\n\fAcknowledgments\n\nWe would like to thank the reviewers for valuable and constructive comments and suggestions. We\nadditionally would like to thank Abhishek Das and Aishwarya Agrawal for their help.\nThe effort from Sorbonne University was supported within the Labex SMART supported by French\nstate funds managed by the ANR within the Investissements d\u2019Avenir programme under reference\nANR-11-LABX-65.\n\nReferences\n[1] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classi\ufb01cation with deep convolutional\nneural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in\nNeural Information Processing Systems 25, pages 1097\u20131105. Curran Associates, Inc., 2012.\n\n[2] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Ef\ufb01cient estimation of word representations\n\nin vector space. ICLR, 2013.\n\n[3] Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Torralba,\nand Sanja Fidler. Skip-thought vectors. In Advances in neural information processing systems, pages\n3294\u20133302, 2015.\n\n[4] Andrej Karpathy and Li Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In\nProceedings of the IEEE conference on computer vision and pattern recognition, pages 3128\u20133137, 2015.\n\n[5] Cewu Lu, Ranjay Krishna, Michael Bernstein, and Li Fei-Fei. Visual relationship detection with language\n\npriors. In European Conference on Computer Vision, pages 852\u2013869. Springer, 2016.\n\n[6] Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, Jos\u00e9 M.F. Moura, Devi Parikh,\nand Dhruv Batra. Visual Dialog. In Proceedings of the IEEE Conference on Computer Vision and Pattern\nRecognition (CVPR), 2017.\n\n[7] Harm de Vries, Florian Strub, Sarath Chandar, Olivier Pietquin, Hugo Larochelle, and Aaron C. Courville.\nGuessWhat?! Visual object discovery through multi-modal dialogue. In Conference on Computer Vision\nand Pattern Recognition (CVPR), 2017.\n\n[8] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick,\nand Devi Parikh. VQA: Visual Question Answering. In International Conference on Computer Vision\n(ICCV), 2015.\n\n[9] Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the V in VQA\nmatter: Elevating the role of image understanding in Visual Question Answering. In IEEE Conference on\nComputer Vision and Pattern Recognition CVPR, 2017.\n\n[10] Aishwarya Agrawal, Dhruv Batra, Devi Parikh, and Aniruddha Kembhavi. Don\u2019t just assume; look and\nanswer: Overcoming priors for visual question answering. In IEEE Conference on Computer Vision and\nPattern Recognition (CVPR), 2018.\n\n[11] Kushal Ka\ufb02e and Christopher Kanan. An analysis of visual question answering algorithms. In The IEEE\n\nInternational Conference on Computer Vision (ICCV), Oct 2017.\n\n[12] Danna Gurari, Qing Li, Abigale J Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P\nBigham. Vizwiz grand challenge: Answering visual questions from blind people. In Proceedings of the\nIEEE Conference on Computer Vision and Pattern Recognition, pages 3608\u20133617, 2018.\n\n[13] Drew A Hudson and Christopher D Manning. Gqa: a new dataset for compositional question answering\n\nover real-world images. arXiv preprint arXiv:1902.09506, 2019.\n\n[14] Rowan Zellers, Yonatan Bisk, Ali Farhadi, and Yejin Choi. From recognition to cognition: Visual\n\ncommonsense reasoning. CVPR, 2019.\n\n[15] Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei\nZhang. Bottom-up and top-down attention for image captioning and visual question answering. In IEEE\nConference on Computer Vision and Pattern Recognition CVPR, June 2018.\n\n[16] Remi Cadene, Hedi Ben-Younes, Nicolas Thome, and Matthieu Cord. Murel: Multimodal Relational Rea-\nsoning for Visual Question Answering. In IEEE Conference on Computer Vision and Pattern Recognition\nCVPR, 2019.\n\n10\n\n\f[17] Hedi Ben-Younes, Remi Cadene, Nicolas Thome, and Matthieu Cord. Block: Bilinear superdiagonal fusion\nfor visual question answering and visual relationship detection. In Proceedings of the 33st Conference on\nArti\ufb01cial Intelligence (AAAI), 2019.\n\n[18] Ronghang Hu, Jacob Andreas, Trevor Darrell, and Kate Saenko. Explainable neural computation via stack\n\nneural module networks. In ECCV, 2018.\n\n[19] Jin-Hwa Kim, Jaehyun Jun, and Byoung-Tak Zhang. Bilinear attention networks. In Advances in Neural\n\nInformation Processing Systems, pages 1564\u20131574, 2018.\n\n[20] Juanzi Li Jiaxin Shi, Hanwang Zhang. Explainable and explicit visual reasoning over scene graphs. In\n\nCVPR, 2019.\n\n[21] Chenfei Wu, Jinlai Liu, Xiaojie Wang, and Xuan Dong. Chain of Reasoning for Visual Question Answering.\nIn S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances\nin Neural Information Processing Systems 31, pages 275\u2013285. Curran Associates, Inc., 2018.\n\n[22] Gao Peng, Zhengkai Jiang, Haoxuan You, Zhengkai Jiang, Pan Lu, Steven Hoi, Xiaogang Wang, and\nHongsheng Li. Dynamic Fusion with Intra- and Inter- Modality Attention Flow for Visual Question\nAnswering. In CVPR, Dec 2019.\n\n[23] Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C. Lawrence Zitnick, and Ross\nGirshick. CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning. In\nIEEE Conference on Computer Vision and Pattern Recognition CVPR, 2017.\n\n[24] Aishwarya Agrawal, Dhruv Batra, and Devi Parikh. Analyzing the behavior of visual question answering\n\nmodels. EMNLP, 2016.\n\n[25] Sainandan Ramakrishnan, Aishwarya Agrawal, and Stefan Lee. Overcoming language priors in visual\nquestion answering with adversarial regularization. In Advances in Neural Information Processing Systems,\npages 1541\u20131551, 2018.\n\n[26] Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alex Smola. Stacked attention networks for image\nquestion answering. In Proceedings of the IEEE conference on computer vision and pattern recognition,\npages 21\u201329, 2016.\n\n[27] Jonathan Gordon and Benjamin Van Durme. Reporting bias and knowledge acquisition. In Proceedings of\n\nthe 2013 workshop on Automated knowledge base construction, pages 25\u201330. ACM, 2013.\n\n[28] Wei-Lun Chao, Hexiang Hu, and Fei Sha. Being negative but constructively: Lessons learnt from creating\n\nbetter visual question answering datasets. NAACL, 2018.\n\n[29] Antonio Torralba and Alexei A. Efros. Unbiased look at dataset bias. CVPR, Jun 2011.\n\n[30] Pierre Stock and Moustapha Cisse. Convnets and imagenet beyond accuracy: Understanding mistakes and\n\nuncovering biases. In The European Conference on Computer Vision (ECCV), September 2018.\n\n[31] Sen Jia, Thomas Lansdall-Welfare, and Nello Cristianini. Right for the Right Reason: Training Agnostic\n\nNetworks. Lecture Notes in Computer Science, page 164\u2013174, 2018.\n\n[32] Varun Manjunatha, Nirat Saini, and Larry S. Davis. Explicit Bias Discovery in Visual Question Answering\n\nModels. In CVPR, Nov 2019.\n\n[33] Lisa Anne Hendricks, Kaylee Burns, Kate Saenko, Trevor Darrell, and Anna Rohrbach. Women also\n\nsnowboard: Overcoming bias in captioning models. In ECCV, 2018.\n\n[34] Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Ordonez, and Kai-Wei Chang. Men also like shopping:\nReducing gender bias ampli\ufb01cation using corpus-level constraints. In Conference on Empirical Methods in\nNatural Language Processing (EMNLP), 2017.\n\n[35] Ishan Misra, C Lawrence Zitnick, Margaret Mitchell, and Ross Girshick. Seeing through the human\nreporting bias: Visual classi\ufb01ers from noisy human-centric labels. In Proceedings of the IEEE Conference\non Computer Vision and Pattern Recognition, pages 2930\u20132939, 2016.\n\n[36] Abhinav Gupta, Adithyavairavan Murali, Dhiraj Prakashchand Gandhi, and Lerrel Pinto. Robot learning in\nhomes: Improving generalization and reducing dataset bias. In Advances in Neural Information Processing\nSystems, pages 9094\u20139104, 2018.\n\n[37] Ankesh Anand, Eugene Belilovsky, Kyle Kastner, Hugo Larochelle, and Aaron Courville. Blindfold\n\nbaselines for embodied qa. arXiv preprint arXiv:1811.05013, 2018.\n\n11\n\n\f[38] Jesse Thomason, Daniel Gordon, and Yonatan Bisk. Shifting the baseline: Single modality performance\n\non visual navigation & qa. In NACL, 2019.\n\n[39] Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, and Kate Saenko. Object hallucination\n\nin image captioning. In EMNLP, 2018.\n\n[40] Allan Jabri, Armand Joulin, and Laurens Van Der Maaten. Revisiting visual question answering baselines.\n\nIn European conference on computer vision, pages 727\u2013739. Springer, 2016.\n\n[41] Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. Hierarchical question-image co-attention for\nvisual question answering. In Advances In Neural Information Processing Systems, pages 289\u2013297, 2016.\n\n[42] Hedi Ben-Younes, R\u00e9mi Cad\u00e8ne, Nicolas Thome, and Matthieu Cord. Mutan: Multimodal tucker fusion\n\nfor visual question answering. ICCV, 2017.\n\n[43] Zhou Yu, Jun Yu, Chenchao Xiang, Jianping Fan, and Dacheng Tao. Beyond bilinear: Generalized\nmulti-modal factorized high-order pooling for visual question answering. IEEE Transactions on Neural\nNetworks and Learning Systems, 2018.\n\n[44] Abhishek Das, Harsh Agrawal, C. Lawrence Zitnick, Devi Parikh, and Dhruv Batra. Human Attention in\nVisual Question Answering: Do Humans and Deep Networks Look at the Same Regions? In Conference\non Empirical Methods in Natural Language Processing (EMNLP), 2016.\n\n[45] Robik Shrestha, Kushal Ka\ufb02e, and Christopher Kanan. Answer them all! toward universal visual question\n\nanswering models. CVPR, 2019.\n\n12\n\n\f", "award": [], "sourceid": 462, "authors": [{"given_name": "Remi", "family_name": "Cadene", "institution": "Sorbonne University - LIP6"}, {"given_name": "Corentin", "family_name": "Dancette", "institution": "Sorbonne Universit\u00e9"}, {"given_name": "Hedi", "family_name": "Ben younes", "institution": "Universit\u00e9 Pierre & Marie Curie / Heuritech"}, {"given_name": "Matthieu", "family_name": "Cord", "institution": "Sorbonne University"}, {"given_name": "Devi", "family_name": "Parikh", "institution": "Georgia Tech / Facebook AI Research (FAIR)"}]}