{"title": "Learning to Specialize with Knowledge Distillation for Visual Question Answering", "book": "Advances in Neural Information Processing Systems", "page_first": 8081, "page_last": 8091, "abstract": "Visual Question Answering (VQA) is a notoriously challenging problem because it involves various heterogeneous tasks defined by questions within a unified framework. Learning specialized models for individual types of tasks is intuitively attracting but surprisingly difficult; it is not straightforward to outperform naive independent ensemble approach. We present a principled algorithm to learn specialized models with knowledge distillation under a multiple choice learning (MCL) framework, where training examples are assigned dynamically to a subset of models for updating network parameters. The assigned and non-assigned models are learned to predict ground-truth answers and imitate their own base models before specialization, respectively. Our approach alleviates the limitation of data deficiency in existing MCL frameworks, and allows each model to learn its own specialized expertise without forgetting general knowledge. The proposed framework is model-agnostic and applicable to any tasks other than VQA, e.g., image classification with a large number of labels but few per-class examples, which is known to be difficult under existing MCL schemes. Our experimental results indeed demonstrate that our method outperforms other baselines for VQA and image classification.", "full_text": "Learning to Specialize with Knowledge Distillation\n\nfor Visual Question Answering\n\nJonghwan Mun1,3\n\nKimin Lee2\n\nJinwoo Shin2\n\nBohyung Han3\n\n1Computer Vision Lab., POSTECH, Pohang, Korea\n\n2Algorithmic Intelligence Lab., KAIST, Daejeon, Korea\n\n3Computer Vision Lab., ASRI, Seoul National University, Seoul, Korea\n\n1choco1916@postech.ac.kr 2{kiminlee,jinwoos}@kaist.ac.kr 3bhhan@snu.ac.kr\n\nAbstract\n\nVisual Question Answering (VQA) is a notoriously challenging problem because it\ninvolves various heterogeneous tasks de\ufb01ned by questions within a uni\ufb01ed frame-\nwork. Learning specialized models for individual types of tasks is intuitively\nattracting but surprisingly dif\ufb01cult; it is not straightforward to outperform na\u00efve\nindependent ensemble approach. We present a principled algorithm to learn special-\nized models with knowledge distillation under a multiple choice learning (MCL)\nframework, where training examples are assigned dynamically to a subset of mod-\nels for updating network parameters. The assigned and non-assigned models are\nlearned to predict ground-truth answers and imitate their own base models be-\nfore specialization, respectively. Our approach alleviates the limitation of data\nde\ufb01ciency in existing MCL frameworks, and allows each model to learn its own\nspecialized expertise without forgetting general knowledge. The proposed frame-\nwork is model-agnostic and applicable to any tasks other than VQA, e.g., image\nclassi\ufb01cation with a large number of labels but few per-class examples, which\nis known to be dif\ufb01cult under existing MCL schemes. Our experimental results\nindeed demonstrate that our method outperforms other baselines for VQA and\nimage classi\ufb01cation.\n\n1\n\nIntroduction\n\nVisual Question Answering (VQA) [9] is a task to \ufb01nd an answer for a question about an input image.\nThis is an extremely challenging problem because VQA models deal with various recognition tasks\nat the same time within a uni\ufb01ed framework, which requires to understand local and global context\nof an image as well as a question. A VQA model thus should have diverse reasoning capabilities to\ncapture appropriate information from input images and questions. Despite such challenges, recent\napproaches [4, 8, 16, 27, 31, 33, 34] show impressive performance by leveraging advance of deep\nneural networks and emergence of large-scale datasets [9, 14].\nAlthough VQA is composed of various tasks de\ufb01ned by questions, existing algorithms typically train\na universal model generalized for all possible questions as depicted in Figure 1(a). This is partly\nbecause designing and learning specialized models on each task is dif\ufb01cult by itself and it is not\nstraightforward to develop an algorithm assigning tasks to a subset of models in a principled way. In\npractice, it is challenging to show improved performance by model specialization compared to na\u00efve\nindependent ensemble. This paper tackles how to associate models with individual types of tasks and\nhow to learn the specialized models effectively as illustrated in Figure 1(b).\nRecently, Multiple Choice Learning (MCL) [10, 19, 21] has been investigated as an elegant framework\nto learn specialized models for recognition. In MCL, examples are typically assigned to a subset\nof models with the highest accuracy, thus each model is expected to be specialized to certain types\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fFigure 1: Comparison between the conventional approach and ours for VQA. In our approach,\nwe are interested in learning specialized models for certain questions that require different visual\nreasoning.\n\nof examples. Our intuition is that the specialized models have potential to outperform the models\ngeneralized on all tasks since the models trained by MCL achieve higher oracle accuracy; at least\none of the models predicts correctly for each example. This suggests that approaches to learning\nspecialized models would be a promising direction to beat an ensemble of universal models in VQA.\nUnfortunately, direct use of MCL turns out to be ineffective in VQA, i.e., the models trained with\nthe existing MCL schemes typically are outperformed by na\u00efve ensembles. This is mainly because\nmodels trained by MCL inherently suffer from the problem of data de\ufb01ciency due to hard assignments\nof tasks to particular models; each model can see only a subset of training dataset, which results in\nweak generalization accuracy. In addition to such data de\ufb01ciency issue, MCL loses the opportunity to\nlearn general knowledge from the examples assigned to other models. Note that VQA models tend to\nlearn some compositional information observed in all training examples. For example, given two\nquestions \u2018what is the color of umbrella?\u2019 and \u2018how many people are wearing glasses?\u2019 in a training\nset, a model specialized to only one of the two questions may have troubles to answer a question like\n\u2018what is the color of glasses?\u2019 Therefore, one has to specialize models while being aware of general\nknowledge shared across diverse examples. The issue has not been addressed in the MCL approaches\nas they mainly focus on simple image classi\ufb01cation tasks with at most 10 labels [10, 19, 21].\n\nContribution To overcome such challenges, we propose a novel and principled algorithm lever-\naging multiple choice learning and knowledge distillation. The main difference with respect to the\nstandard MCL framework is two-fold: oracle loss de\ufb01nition and example assignment policy. We \ufb01rst\nlearn multiple base models independently on the whole training dataset, and assign each example\nbased on the scores from the base models. During training procedure, we specialize a subset of\nmodels to the dynamically assigned examples given by the current con\ufb01gurations of models while\nthe rest of models imitate predictions of base models for the examples. This strategy of knowledge\ndistillation with non-assigned examples alleviates the data de\ufb01ciency problem of MCL and is effective\nto learn rich compositional information across examples. The proposed algorithm shows meaningful\nperformance improvement over the na\u00efve ensemble and variants of MCL on VQA. Furthermore, due\nto its model-agnostic and generic property, it is straightforward to apply the proposed method to a\nvariety tasks.\nThe main contribution of this paper is summarized as follows:\n\n\u2022 We propose a novel model-agnostic ensemble learning algorithm, referred to as Multiple\nChoice Learning with Knowledge Distillation (MCL-KD), which learns models to be\nspecialized to a subset of tasks. In particular, we introduce a new oracle loss by incorporating\nthe concept of knowledge distillation into MCL, which facilitates to handle data de\ufb01ciency\nissue in MCL effectively and learn shared representations from whole training data.\n\n\u2022 The proposed algorithm is applied to existing VQA models and consistently improves\n\nperformance compared to independent ensembles and existing MCL-based approaches.\n\n\u2022 We present that our framework also works well in challenging image classi\ufb01cation tasks\nwith many labels but few per-class examples, in which other MCL variants perform poorly.\n\nRelated works The main research stream of VQA is to learn end-to-end deep neural network\nmodels that answer all types of questions in a uni\ufb01ed framework. For the purpose, VQA models\n\n2\n\nQ1: What color is the boy\u2019s shirt?Q2: How many umbrellas are open?Q3: What is the girl walking with?Universal VQA SystemQ1: What color is the boy\u2019s shirt?Q2: How many umbrellas are open?Q3: What is the girl walking with?(a) Conventional approach(b) Our approachSpecialized VQA System 1Specialized VQA System 2Specialized VQA System 3\fincorporate various techniques to concentrate only on the tasks speci\ufb01ed by questions, including at-\ntention mechanism [26, 27, 31, 33, 34], multimodal fusion schemes [4, 8, 16], adaptive networks [28]\nor modular networks [2, 3, 13]. Although these approaches show capability to adapt models to\nparticular tasks characterized by input images and questions, it is not straightforward to handle\nvarious heterogeneous tasks in a single model and understand internal operations of VQA models.\nUse of multiple models for VQA is a natural direction because ensemble of multiple models is\na common practice to improve performance in deep learning and proper allocations of tasks to a\nsubset of the models may lead to better generalization by reducing problem complexity. Traditional\nindependent ensemble (IE) [7], which trains models independently with random initialization, is\nknown as a reasonable option but is far from the approaches to learn task-speci\ufb01c models. A more\nsophisticated ensemble method based on MCL [10] specializes ensemble members on a subset of\ndata and encourages individual models to produce diverse and reasonable outputs; it minimizes the\nso-called oracle loss and focuses on the most accurate prediction. However, due to overcon\ufb01dence\nissue [19, 20] of deep neural networks (DNNs), it is not straightforward to select appropriate models\nfrom ensemble members. The con\ufb01dent multiple choice learning (CMCL) [19] alleviates this issue by\nintroducing a loss term that minimizes the Kullback-Leibler (KL) divergence between the predictive\ndistribution of each non-specialized model and a uniform distribution. However, it suffers from the\ndata de\ufb01ciency issue for complex tasks such as VQA, as we mentioned earlier. We overcome the\nlimitation by incorporating knowledge distillation [12].\nKnowledege distillation is achieved by training a network to mimic the activations of intermediate\nlayers [30], attention maps [36] or output distributions [12] of large networks (i.e., teacher network).\nThe knowledge distillation technique is widely used to learn compact and fast small models (i.e.,\nstudent network) in many practical applications [5, 22, 25]. In addition to learning compact models,\nthe concept of knowledge distillation is used in other tasks [6, 23, 29]. Chen et al. [6] proposes a\nsystem building large deep neural network models by transferring knowledge from small networks\ntrained beforehand while Li et al. [23] employs idea for continual learning to preserve knowledge\nfrom previously learned tasks. Noroozi et al. [29] boosts performances in a self-supervised learning\nframework by adopting knowledge distillation pipeline instead of \ufb01ne-tuning when passing learned\nfeature information to target tasks. In contrast to the prior works, our novelty lies in utilizing\nknowledge distillation for balancing generalization and specialization of ensemble models.\n\n2 Background on Multiple Choice Learning\n\nThis section introduces the main idea of multiple choice learning, and compares its two variants,\noriginal and con\ufb01dent multiple choice learning.\n\n2.1 Multiple Choice Learning\n\nThe objective of MCL [10] is to minimize the oracle loss, i.e., making at least one of M models predict\nthe correct answer. Formally, denote a training dataset by D = {(x1, y1), (x2, y2), ..., (xN , yN )},\nwhere N is the number of training examples and x and y are input and ground-truth output, re-\nspectively. Given multiple predictive distributions P from M models, the oracle loss is de\ufb01ned as\n\nN(cid:88)\n\nLMCL(D) =\n\n(cid:96)task(yi, P (y|xi; \u03b8m)),\n\nmin\nm\n\n(1)\nwhere \u03b8m and (cid:96)task(\u00b7,\u00b7) denote the m-th model parameter and a task speci\ufb01c loss function, respec-\ntively. The oracle loss optimizes the most accurate model for individual input examples xi, driving\neach model to become a specialist on a subset of questions.\nSince the minimization function is not a continuous function, the oracle loss is relaxed to the following\ninteger programming problem:\n\ni=1\n\nLMCL(D) =\n\nvi,m(cid:96)task(yi, P (y|xi; \u03b8m)) subject to\n\n(2)\nwhere vi,m \u2208 {0, 1} is an indicator variable for the assignment of xi to the m-th model, and k\n(= 1, . . . , M) is the number of specialized models per example. Note that if k = M, MCL is\nidentical to independent ensemble.\n\nvi,m = k,\n\nm=1\n\ni=1\n\nm=1\n\nN(cid:88)\n\nM(cid:88)\n\nM(cid:88)\n\n3\n\n\fFigure 2: Overall framework of our multiple choice learning with knowledge distillation.\n\n2.2 Con\ufb01dent Multiple Choice Learning\n\nAlthough the objective of MCL [10] is appropriate to generate diverse outputs, it suffers from\novercon\ufb01dence issue that results in failure of selecting specialized models during inference. To\naddress this issue, Lee et al. [19] proposed Con\ufb01dent Multiple Choice Learning (CMCL) based on a\ncon\ufb01dent oracle loss:\n\nN(cid:88)\n\nM(cid:88)\n\ni=1\n\nm=1\n\nsubject to\n\nM(cid:88)\n\nLCMCL(D) =\n\nvi,m(cid:96)task(yi, P (y|xi; \u03b8m)) + \u03b2(1 \u2212 vi,m)DKL(U(y)||P (y|xi; \u03b8m))\n\nvi,m = k,\n\nvi,m \u2208 {0, 1},\n\n(3)\n\nm=1\n\nwhere \u03b2 is a weight for the losses of non-specialized models, DKL is KL divergence and U(y)\ndenotes the uniform distribution. Compared to oracle loss in MCL, the con\ufb01dent oracle loss further\nregularizes for non-specialized models to be less con\ufb01dent by minimizing the KL divergence between\nthe predictive and uniform distributions. Although the oracle loss in CMCL is well-designed to\nlearn specialized models, its performance is not impressive particularly in complex tasks such as\nVQA because the loss function enforces individual models to disregard unassigned training examples\ncompletely. To tackle the limitations of the existing MCL techniques, we present a more principled\noracle loss in the next section.\n\n3 Multiple Choice Learning with Knowledge Distillation\n\nAlthough MCL and CMCL show potential to achieve competitive performance compared to in-\ndependent ensembles, model specialization on a subset of training examples suffers from weak\ngeneralization power of each model, often resulting in degraded accuracy. We believe that this is\nmainly due to data de\ufb01ciency; specialization to a subset of training data reduces the number of\nobserved examples for each model.\nTo address the issue, we design a novel oracle loss that makes models specialized to a subset of\ntraining examples while encouraging them to maintain common sense in training data (e.g., the\nconcept of attributes, objects and etc). This objective is achieved by multiple choice learning with\nknowledge distillation. Our overall learning framework is depicted in Figure 2, which is composed\nof two steps. Our algorithm \ufb01rst learns M base models, which are generalists trained on the whole\ntraining dataset independently. Then, we specialize a subset of models to each example while the rest\nof models are trained to be at least as good as the corresponding base models on the example.\nAnother motivation of our work starts from a natural question that forcing uniform distribution is\noptimal choice for relaxing the overcon\ufb01dence issue of MCL. We indeed found that CMCL is not\neffective in more complex tasks such as VQA. This fact motivates our approach of developing a new\nloss function, which utilizes the knowledge distillation.\n\n4\n\ncTraining DataBaseModel \ud835\udc351BaseModel \ud835\udc352BaseModel \ud835\udc353SpecializedModel \ud835\udc461SpecializedModel \ud835\udc462SpecializedModel \ud835\udc463\ud835\udc43\ud835\udf191,\ud835\udc47\ud835\udc43\ud835\udf192,\ud835\udc47\ud835\udc43\ud835\udf193,\ud835\udc47c\ud835\udc371\ud835\udc373\ud835\udc372IndependentEnsemble LearningKnowledgeDistillationMultiple Choice Learning\f3.1 Multiple Choice Learning with Knowledge Distillation\n\nWe propose a novel multiple choice learning framework, Multiple Choice Learning with Knowledge\nDistillation (MCL-KD). Given a training dataset D = {(x1, y1), (x2, y2), ..., (xN , yN )} and M\nindependently trained base models with \ufb01xed model parameters \u03c6m, we train M models parametrized\nby \u03b8m in the proposed MCL-KD framework using the following oracle loss:\n\nvi,m(cid:96)task(yi, P (y|xi; \u03b8m)) + \u03b2(1 \u2212 vi,m)(cid:96)KD (P (y|xi; \u03c6m, T ), P (y|xi; \u03b8m, T )) ,\n\n(4)\n\nLMCL-KD(D)\n\nN(cid:88)\n\nM(cid:88)\n\n=\n\ni=1\n\nm=1\n\nsubject to\n\nM(cid:88)\n\nvi,m = k,\n\nvi,m \u2208 {0, 1},\n\nm=1\n\nwhere \u03b2 > 0 is a hyper-parameter to balance the two loss terms, and T > 0 is a temperature scaling\nparameter of the softmax function. Here, we employ a knowledge distillation loss between the m-th\nbase model and the corresponding specialized model (cid:96)KD(\u00b7), which is formally given by\n\n(cid:80)\n\n(cid:96)KD (P (y|xi; \u03c6m, T ), P (y|xi; \u03b8m, T )) = DKL (P (y|xi; \u03c6m, T )||P (y|xi; \u03b8m, T )) ,\n\n(5)\ny(cid:48) exp(fy(cid:48) (x;\u03b8)/T ) is a calibrated softmax distribution, and f (\u00b7) denotes a\n\nwhere P (y|x; \u03b8, T ) = exp(fy(x;\u03b8)/T )\nlogit of deep models.\nIn our oracle loss, specialized models are learned to predict the ground-truth answers while non-\nspecialized ones are trained to preserve the representations of the corresponding base models. We\nbelieve that this is a reasonable choice because knowledge distillation is known for an effective\ntechnique for fast optimization, transfer learning, and learning without forgetting [23, 35]. Note\nthat, contrary to CMCL, the knowledge distillation loss in our framework provides opportunity to\nlearn from non-assigned training examples and makes non-specialized models as competitive as the\ncorresponding base models for those examples.\nTraining procedure of MCL-KD is as follows. We \ufb01rst learn M base models using the whole training\ndata independently. Given the number of specialized models per example denoted by k, a binary\nassignment vector by vi = (vi,1, vi,2, ..., vi,M ) indicates which models are specialized for xi. Then,\nvi and \u03b8m are optimized by the following iterative procedure:\n\n1. Fix \u03b8m and optimize for vi.\n\nLet us denote the collection of all possible assignments vectors by Ak,M . Then, based on\nthe current model parameters \u03b8m, the assignment vector vi is determined to achieve the\nlowest LMCL-KD, which is given by\n\nLMCL-KD(xi, yi, v(cid:48)\ni)\n\nvi = arg min\ni\u2208Ak,M\nv(cid:48)\n\n= arg min\ni\u2208Ak,M\nv(cid:48)\n\nM(cid:88)\n\nm=1\n\ni,m(cid:96)task(yi, P\u03b8m (y|xi)) + \u03b2(1 \u2212 v(cid:48)\nv(cid:48)\n\ni,m)(cid:96)KD (P\u03c6m,T (y|xi), P\u03b8m,T (y|xi)) .\n\n2. Fix vi and optimize for \u03b8m.\n\nGiven assignment vector vi, each model is trained to minimize the task loss for assigned\nexamples and the knowledge distillation loss for non-assigned examples.\n\nThese two optimization steps are repeated until convergence. For computational ef\ufb01ciency, Lee et\nal. [21] proposed a stochastic algorithm that the model assignment step and optimizing models are\nperformed inside a batch. That is, examples are assigned to a model with the lowest oracle loss\nand models are updated without assignment convergence in a batch. Because the simple stochastic\nalgorithm is computationally ef\ufb01cient and works well in practice, we also employ this option for\noptimization of the proposed oracle loss.\n\n5\n\n\fTable 1: Classi\ufb01cation accuracy (%) on CLEVR validation set with varying the number of specialized\nmodels (k). The bold-faced numbers mean the best algorithm for each k in top1 accuracy.\n\nAnswering\nNetwork\n\nMLP\n\nSAN\n\nk\n1\n2\n3\n1\n2\n3\n\nSingle\nTop1\n\nIE\n\nTop1 Oracle\n\n58.40\n\n60.10\n\n80.73\n\n82.30\n\n85.23\n\n94.93\n\nMCL\n\nTop1 Oracle\n98.92\n41.31\n48.94\n97.57\n95.67\n58.63\n98.67\n42.19\n98.62\n58.39\n83.73\n98.62\n\nCMCL\n\nTop1 Oracle\n63.76\n59.12\n60.27\n76.00\n82.67\n60.49\n91.55\n83.99\n96.64\n84.83\n86.18\n96.26\n\nMCL-KD\n\nTop1 Oracle\n60.22\n80.75\n60.38\n81.20\n60.89\n81.86\n85.98\n95.38\n87.02\n95.78\n88.16\n96.12\n\nTable 2: Classi\ufb01cation accuracy (%) on VQA v2.0 validation set at k = 3. The bold-faced number\nmeans the best algorithm in top1 accuracy.\n\nSingle\nTop1\n63.42\n\nIE\n\nTop1 Oracle\n65.27\n76.23\n\nMCL\n\nTop1 Oracle\n64.94\n78.15\n\nCMCL\n\nTop1 Oracle\n64.99\n73.34\n\nMCL-KD\n\nTop1 Oracle\n65.67\n76.95\n\n3.2 Application to Visual Question Answering\n\nOn VQA, given the input data of an image and a question xi = (Ii, qi) and its corresponding answer\nyi, an answering network is trained to minimize a negative log-likelihood of prediction:\n\n(cid:96)VQA(yi, P (y|xi; \u03b8)) = \u2212log P (yi|xi; \u03b8) = \u2212log P (yi|Ii, qi; \u03b8).\n\n(6)\nIn our algorithm, specialized answering networks with knowledge distillation are trained with the\nloss function in Eq. (4) with (cid:96)task replaced by (cid:96)VQA. The loss functions of all other compared models\nsuch as IE, MCL, and CMCL are obtained by the same manner. After learning specialized models,\nwe feedforward a testing example for each model and the \ufb01nal answer is obtained by the average of\nall prediction scores after softmax, which is given by\n\nM(cid:88)\n\nm=1\n\narg max\n\ny\n\n1\nM\n\nP (y|I, q; \u03b8m),\n\n(7)\n\nwhere \u03b8m denotes the parameters of the m-th specialized model.\n\n4 Experimental Results\n\n4.1 Visual Question Answering\n\nDataset We employ CLEVR and VQA v2.0 datasets to validate our algorithm. CLEVR [14] is\nconstructed for an analysis of various aspects of visual reasoning such as attribute identi\ufb01cation,\ncounting, comparison, spatial relationship, and logical operations. The dataset is composed of 70,000\ntraining images with 699,989 questions and 15,000 validation images with 149,991 questions, where\neach question is associated with a single unique answer. The vocabulary sizes of question and answer\nare 84 and 28, respectively. VQA v2.0 [9] is a very popular dataset based on images collected from\nMSCOCO [24]. To handle dataset bias issues found in v1.0, it contains two images with different\nanswers for each question. This dataset consists of 443,757 and 214,354 questions for train and\nvalidation, respectively, where each question has 10 candidate answers.\n\nExperimental setup For CLEVR dataset, we adopt two models as our answering networks: a\nsimple MLP-based model with 2 hidden layers of 1,024 units after an image and question fusion\nlayer, and a well-known stacked attention network (SAN) [15]. We extract conv4 features from input\nimages of 224\u00d7224 using ResNet-101 [11], which results in 1024\u00d714\u00d714 image representations.\nWe also apply additional residual blocks on top of the extracted conv4 features to adapt the image\nrepresentations to a target dataset. All models are optimized using ADAM [17] with \ufb01xed learning\nrate of 0.0005 and batch size of 64 while the parameters of ResNet-101 are \ufb01xed. We set \u03b2 and T in\nEq. 4 to 50 and 0.1, respectively, based on our empirical observations. For VQA v2.0 dataset, we\nadopt the bottom-up and top-down attention model [1], which is the winner of 2017 VQA challenge.\n\n6\n\n\fFigure 3: Visualization of the number of training examples assigned to each SAN model [15] of\nMCL-KD at k = 1 in CLEVR dataset. The numbers in x-axis denote indices of question families,\nand the number of assigned examples are normalized across each question family (each column).\n\nTable 3: Question examples of a question family in which SAN models in Figure 3 are dominantly\nspecialized. Questions in different question family ask different semantics in images. For exam-\nples, questions in #1 are about comparison with counting two objects while those in #37 requires\ncomparison of size between objects.\n\nQuestion family Question examples\n\n#1\n\n#13\n\n#37\n\n#85\n\nAre there the same number of red balls and cyan balls?\nAre there an equal number of metallic spheres and brown cylinders?\nIs the number of yellow things the same as the number of tiny cyan blocks?\nAre there the same number of blue cylinders and big blocks?\nDoes the shiny object in front of the purple matte block have the same size as\nthe small metallic cylinder?\nThere is a object that is behind the blue object; does it have the same size as the\ngreen cylinder?\nAre there any other things that have the same size as the brown shiny sphere?\nIs there any other thing that is the same size as the rubber cube?\nIs there any other thing of the same size as the cube?\nAre there any other things that have the same size as the cyan metallic cylinder?\nHow many tiny blue cubes are there?\nHow many matte cubes are there?\nWhat number of big red objects are there?\nWhat number of tiny yellow metal cylinders are there?\n\nWe use the publicly available implementation1 and leave all parameters unchanged except batch\nsize, which is changed from 512 to 256 due to memory limitation. We set \u03b2 and T to 5,000 and 0.5,\nrespectively.\nFor fair comparison, we initialize the networks of all algorithms (MCL, CMCL and MCL-KD) in the\nsame way using the base models trained independently. According to our observation, this strategy\ngenerally achieves higher performance than learning from scratch for all methods. For evaluation, we\nmeasure both top-1 and oracle accuracy. The top-1 accuracy is computed by the ratio of correctly\npredicted examples identi\ufb01ed by the average output distribution of ensemble members. The oracle\naccuracy measures whether at least one of the models predicts the correct answers for input image and\nquestion pairs. Generally speaking, higher oracle accuracy implies that trained models are specialized\nmore to subsets of data.\n\nResults We compare our algorithm denoted by MCL-KD with three baselines\u2014IE, MCL and\nCMCL. We train 5 models while varying the number of specialized models, k, for MCL, CMCL and\nMCL-KD. We test performance of MCL-based models when k = 1, 2, 3.\nTable 1 summarizes the results on the validation set of CLEVR dataset. It is noticeable that top-1\naccuracies in all three MCL-based methods are getting higher with larger k. This is partly because\nincreasing k is effective to alleviate data de\ufb01ciency issue and improve quality of ensemble predictions.\nMCL is the best in terms of oracle accuracy, but its top-1 accuracy is not satisfactory due to\novercon\ufb01dence issue. CMCL is substantially better than MCL, but still not suf\ufb01cient to achieve\nclear accuracy improvement with respect to IE. On the contrary, MCL-KD consistently outperforms\nIE, MCL, and CMCL regardless of k. This implies that applying knowledge distillation loss for\n\n1https://github.com/hengyuan-hu/bottom-up-attention-vqa\n\n7\n\n\fFigure 4: Visualization of prediction distributions given by SAN models of MCL, CMCL and\nMCL-KD algorithms in CLEVR dataset when k = 1. The number at the upper-right corner of each\nbox means the predicted label and its probability of each model, where correct predictions are marked\nin blue. Red boxes denote the specialized models.\n\nnon-specialized models is important to balance specialization and generalization of ensemble models\nin visual question answering. We also observe that the accuracy improvement compared to IE in MLP\nis lower than in SAN. This is probably because of the following two reasons: 1) reduced bene\ufb01t of\nensemble because the architecture of the MLP is simpler and the learned models are more correlated,\nand 2) degraded knowledge distillation performance since single model accuracy of the MLP is lower\nthus there is little knowledge to be distilled compared to stronger models.\nTable 2 presents the results on the validation set of VQA v2.0 dataset at k = 3. VQA v2.0 dataset is\nmore complex and challenging since it is a real dataset and contains more diverse question and answer\npairs than CLEVR dataset. Nevertheless, MCL-KD still achieves accuracy gain with respect to IE\nwhile both MCL and CMCL are not effective enough to improve accuracy by model specialization.\n\nAnalysis on CLEVR CLEVR dataset divide the whole dataset into 90 question families depending\non their requirements of visual reasoning, and examples with a kind of questions belong to the\nsame family. Using the information, we analyze SAN models in MCL-KD (k = 1) on CLEVR\ndataset and illustrate how question families are associated with individual models. Figure 3 presents\nmodel specialization tendency of MCL-KD over the question families. It is interesting to see that\n4 models (S1, S2, S3, and S4) out of 5 are specialized to the unique subsets of question families\nwhile S5 mimics its corresponding base model since only a small number of examples are assigned\nto S5 and distribution of question family assignment is close to uniform. Note that only a subset of\nquestion\u201426 out of 90\u2014families is visualized in Figure 3 due to space limitation. Question examples\nof question families in which S1, S2, S3 and S4 are dominantly specialized are shown in Table 3.\nFigure 4 shows the predictive distributions of MCL, CMCL and MCL-KD models with k = 1 on\nCLEVR dataset. MCL suffers from overcon\ufb01dence issue that non-specialized models (i.e., model 2,\n3, 4, and 5) predict incorrect answers with high con\ufb01dence, which leads to incorrect \ufb01nal decisions.\nSince the non-specialized models in CMCL are learned to generate uniform distribution and all\nmodels lose the opportunity to learn from a suf\ufb01cient number of training examples, specialized\nmodel fails to predict the correct answer. However, MCL-KD predicts the correct answer since some\nnon-specialized models are capable of predicting correct answers using knowledge distilled from the\ncorresponding base models.\n\n4.2\n\nImage Classi\ufb01cation\n\nDataset Although our primary objective is to learn specialized models for VQA, our algorithm is\neasily applicable to any other tasks. Thus, we also evaluate our algorithm on image classi\ufb01cation\ntasks using CIFAR-100 [18] dataset, and compare its performance with IE, MCL, and CMCL again.\n\n8\n\nGround truth: 3 MCL: 1 CMCL: 2 MCL-KD: 3Model 1Model 2Model 3Model 4Model 5MCLMCL-KDCMCL1.00.00.51.00.00.51.00.00.51.00.00.51.00.00.53 (1.00)1 (1.00)0 (1.00)2 (1.00)1 (0.53)2 (0.73)2 (0.23)3 (0.41)3 (0.29)2 (0.26)1 (0.38)Q: what number of objects are either big objects that are behind the big gray block or tiny brown rubber balls?1.00.00.51.00.00.51.00.00.51.00.00.51.00.00.51.00.00.51.00.00.51.00.00.51.00.00.51.00.00.5\fTable 4: Classi\ufb01cation accuracy (%) on CIFAR-100 with varying the number of specialized models\n(k) out of 5 models. The test accuracies are represented as top1/oracle. The numbers in red and blue\ndenote the best and second-best algorithms for each classi\ufb01cation model over k in top1 accuracy,\nrespectively. FS means feature sharing proposed in CMCL.\n\nModel name\n\nK\nSingle\nIE\nMCL\nCMCL\nCMCL+FS\nMCL-KD\nMCL-KD+FS\n\n1\n\n56.40 / 70.40\n48.67 / 62.56\n56.30 / 70.39\n65.27 / 80.74\n66.72 / 81.07\n\nResNet-20\n\n-\n\n2\n53.98 /\n64.60 / 79.95\n57.19 / 78.28\n55.16 / 71.69\n61.20 / 77.24\n65.31 / 80.79\n66.63 / 81.86\n\n3\n\n1\n\n61.28 / 80.03\n58.13 / 75.01\n63.78 / 79.84\n65.60 / 80.91\n66.56 / 81.33\n\n51.94 / 78.64\n57.27 / 62.16\n60.85 / 66.05\n68.75 / 81.87\n69.70 / 82.48\n\nVGGNet-17\n\n-\n\n2\n61.62 /\n68.43 / 81.07\n62.41 / 82.77\n64.37 / 74.41\n66.34 / 75.59\n68.77 / 81.72\n69.33 / 82.09\n\n3\n\n67.30 / 83.91\n67.44 / 78.90\n67.94 / 80.36\n68.80 / 82.00\n68.92 / 82.11\n\nCIFAR-100 [18] dataset consists of 50,000 training and 10,000 test images with 100 image classes,\nwhere each image consists of 32 \u00d7 32 RGB pixels. This dataset has a signi\ufb01cantly large number of\nclasses compared to the ones tested in Lee et al. [19]\u2014CIFAR-10 and SVHN, and has a limited\nnumber of training example per class. Therefore, performance improvement by model specialization\nis known to be challenging under the existing multiple choice learning framework.\n\nExperimental setup Following the original implementation of CMCL [19]2, we preprocess the\nimages with global contrast normalization and ZCA whitening, and do not use any data augmentation.\nWe employ two convolutional neural networks including VGGNet-17 [32] and ResNet-20 [11]. For\nall experiments, we use softmax classi\ufb01er, and each model is optimized using stochastic gradient\ndescent algorithm with Nesterov momentum. For this task, we also consider feature sharing, which\nstochastically shares the features among ensemble members. This trick is proposed in CMCL to\nimprove classi\ufb01cation performance by sharing general features and reducing data de\ufb01ciency issue.\nAs in [19], we share the non-linear activated features right before the \ufb01rst pooling layer, i.e., the 2nd\nReLU activations of VGGNet-17 and the 6th ReLU activations of ResNet-20.\n\nResults Table 4 presents the results of all models on CIFAR-100 dataset. Both MCL and CMCL\nfail to achieve competitive accuracy compared to IE regardless of k. It is surprising that their oracle\naccuracies are often worse than those of IE. This is because the number of training examples per\nclass is only 500 in CIFAR-100, and the data de\ufb01ciency problem drives MCL and CMCL to fail in\nspecialization. Note that, although feature sharing of low-level representations turns out to be helpful\nto improve classi\ufb01cation accuracy in all cases by addressing data de\ufb01ciency issue, its bene\ufb01t is not\nsuf\ufb01cient to improve accuracy in MCL and CMCL on CIFAR-100 dataset. However, by adopting\nknowledge distillation, our method consistently outperforms IE for all k\u2019s by large margins with and\nwithout feature sharing.\n\n5 Conclusion\n\nWe presented a novel and principled framework to learn specialized models for visual question\nanswering. For the purpose, we formulate the problem with an ensemble of models, where each\nmodel is specialized dynamically on a subset of training examples within a multiple choice learning\nframework. By exploiting the idea of knowledge distillation, we \ufb01rst learn base models with the whole\ndata and specialize models to predict ground-truth labels on assigned examples while preserving\nthe representations of the base models on non-assigned ones. This method effectively addresses\nthe data de\ufb01ciency issues in multiple choice learning. We showed that our algorithm consistently\noutperforms all other methods including IE, MCL, CMCL in VQA and image classi\ufb01cation. We\nbelieve that adaptively determining the number of models for specialization of each example would\nbe an interesting future direction.\n\n2https://github.com/chhwang/cmcl\n\n9\n\n\fAcknowledgments\n\nThis work was partly supported by ICT R&D program of the MSIP/IITP grant [2016-0-00563; Re-\nsearch on Adaptive Machine Learning Technology Development for Intelligent Autonomous Digital\nCompanion, 2017-0-01778; Development of Explainable Human-level Deep Machine Learning\nInference Framework] and Kakao and Kakao Brain corporations.\n\nReferences\n[1] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang. Bottom-up and\n\ntop-down attention for image captioning and visual question answering. In CVPR, 2018.\n\n[2] J. Andreas, M. Rohrbach, T. Darrell, and D. Klein. Learning to compose neural networks for\n\nquestion answering. In NAACL, 2016.\n\n[3] J. Andreas, M. Rohrbach, T. Darrell, and D. Klein. Neural module networks. In CVPR, 2016.\n[4] H. Ben-younes, R. Cadene, M. Cord, and N. Thome. Mutan: Multimodal tucker fusion for\n\nvisual question answering. In ICCV, 2017.\n\n[5] G. Chen, W. Choi, X. Chen, T. X. Han, and M. K. Chandraker. Learning ef\ufb01cient object\n\ndetection models with knowledge distillation. In NIPS, 2017.\n\n[6] T. Chen, I. J. Goodfellow, and J. Shlens. Net2net: Accelerating learning via knowledge transfer.\n\nIn ICLR, 2016.\n\nclassi\ufb01cation. In CVPR, 2012.\n\n[7] D. Ciregan, U. Meier, and J. Schmidhuber. Multi-column deep neural networks for image\n\n2016.\n\n[8] A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell, and M. Rohrbach. Multimodal compact\n\nbilinear pooling for visual question answering and visual grounding. In EMNLP, 2016.\n\n[9] Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh. Making the V in VQA matter:\n\nElevating the role of image understanding in Visual Question Answering. In CVPR, 2017.\n\n[10] A. Guzman-Rivera, D. Batra, and P. Kohli. Multiple choice learning: Learning to produce\n\nmultiple structured outputs. In NIPS, 2012.\n\n[11] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR,\n\n[12] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. In NIPS\n\nDeep Learning and Representation Learning Workshop, 2015.\n\n[13] R. Hu, J. Andreas, M. Rohrbach, T. Darrell, and K. Saenko. Learning to reason: End-to-end\n\nmodule networks for visual question answering. In ICCV, 2017.\n\n[14] J. Johnson, B. Hariharan, L. van der Maaten, L. Fei-Fei, C. L. Zitnick, and R. Girshick. Clevr:\nA diagnostic dataset for compositional language and elementary visual reasoning. In CVPR,\n2017.\n\n[15] V. Kazemi and A. Elqursh. Show, ask, attend, and answer: A strong baseline for visual question\n\nanswering. arXiv preprint arXiv:1704.03162, 2017.\n\n[16] J.-H. Kim, K.-W. On, W. Lim, J. Kim, J.-W. Ha, and B.-T. Zhang. Hadamard product for\n\nlow-rank bilinear pooling. In ICLR, 2017.\n\n[17] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR, 2015.\n[18] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. 2009.\n[19] K. Lee, C. Hwang, K. Park, and J. Shin. Con\ufb01dent multiple choice learning. In ICML, 2017.\n[20] K. Lee, H. Lee, K. Lee, and J. Shin. Training con\ufb01dence-calibrated classi\ufb01ers for detecting\n\nout-of-distribution samples. In ICLR, 2018.\n\n[21] S. Lee, S. P. S. Prakash, M. Cogswell, V. Ranjan, D. Crandall, and D. Batra. Stochastic multiple\n\nchoice learning for training diverse deep ensembles. In NIPS, 2016.\n\n[22] Q. Li, S. Jin, and J. Yan. Mimicking very ef\ufb01cient network for object detection. In CVPR, 2017.\n[23] Z. Li and D. Hoiem. Learning without forgetting. In ECCV, 2016.\n[24] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll\u00e1r, and C. L. Zitnick.\n\nMicrosoft coco: Common objects in context. In ECCV, 2014.\n\n[25] P. Luo, Z. Zhu, Z. Liu, X. Wang, and X. Tang. Face model compression by distilling knowledge\n\nfrom neurons. In AAAI, 2016.\n\nIn CVPR, 2017.\n\n[26] H. Nam, J.-W. Ha, and J. Kim. Dual attention networks for multimodal reasoning and matching.\n\n[27] H. Noh and B. Han. Training recurrent answering units with joint loss minimization for vqa.\n\narXiv preprint arXiv:1606.03647, 2016.\n\n[28] H. Noh, P. H. Seo, and B. Han. Image question answering using convolutional neural network\n\nwith dynamic parameter prediction. In CVPR, 2016.\n\n[29] M. Noroozi, A. Vinjimoor, P. Favaro, and H. Pirsiavash. Boosting self-supervised learning via\n\nknowledge transfer. In CVPR, 2018.\n\n10\n\n\f[30] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio. Fitnets: Hints for\n\n[31] K. J. Shih, S. Singh, and D. Hoiem. Where to look: Focus regions for visual question answering.\n\n[32] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image\n\nthin deep nets. In ICLR, 2015.\n\nIn CVPR, 2016.\n\nrecognition. In ICLR, 2015.\n\n[33] H. Xu and K. Saenko. Ask, attend and answer: Exploring question-guided spatial attention for\n\nvisual question answering. In ECCV, 2016.\n\n[34] Z. Yang, X. He, J. Gao, L. Deng, and A. Smola. Stacked attention networks for image question\n\nanswering. In CVPR, 2016.\n\n[35] J. Yim, D. Joo, J. Bae, and J. Kim. A gift from knowledge distillation: Fast optimization,\n\nnetwork minimization and transfer learning. In CVPR, 2017.\n\n[36] S. Zagoruyko and N. Komodakis. Paying more attention to attention: Improving the performance\n\nof convolutional neural networks via attention transfer. In ICLR, 2017.\n\n11\n\n\f", "award": [], "sourceid": 4975, "authors": [{"given_name": "Jonghwan", "family_name": "Mun", "institution": "POSTECH"}, {"given_name": "Kimin", "family_name": "Lee", "institution": "Korea Advanced Institute of Science and Technology"}, {"given_name": "Jinwoo", "family_name": "Shin", "institution": "KAIST; AITRICS"}, {"given_name": "Bohyung", "family_name": "Han", "institution": "Seoul National University"}]}