{"title": "Abstract Reasoning with Distracting Features", "book": "Advances in Neural Information Processing Systems", "page_first": 5842, "page_last": 5853, "abstract": "Abstraction reasoning is a long-standing challenge in artificial intelligence. Recent studies suggest that many of the deep architectures that have triumphed over other domains failed to work well in abstract reasoning. In this paper, we first illustrate that one of the main challenges in such a reasoning task is the presence of distracting features, which requires the learning algorithm to leverage counter-evidence and to reject any of false hypothesis in order to learn the true patterns. We later show that carefully designed learning trajectory over different categories of training data can effectively boost learning performance by mitigating the impacts of distracting features. Inspired this fact, we propose feature robust abstract reasoning (FRAR) model, which consists of a reinforcement learning based teacher network to determine the sequence of training and a student network for predictions. Experimental results demonstrated strong improvements over baseline algorithms and we are able to beat the state-of-the-art models by 18.7\\% in RAVEN dataset and 13.3\\% in the PGM dataset.", "full_text": "Abstract Reasoning with Distracting Features\n\nKecheng Zheng\n\nUniversity of Science\n\nand Technology of China\n\nzkcys001@mail.ustc.edu.cn\n\nWei Wei\n\nGoogle Research\n\nwewei@google.com\n\nZheng-jun Zha\u2217\nUniversity of Science\n\nand Technology of China\nzhazj@ustc.edu.cn\n\nAbstract\n\nAbstraction reasoning is a long-standing challenge in arti\ufb01cial intelligence. Recent\nstudies suggest that many of the deep architectures that have triumphed over\nother domains failed to work well in abstract reasoning. In this paper, we \ufb01rst\nillustrate that one of the main challenges in such a reasoning task is the presence\nof distracting features, which requires the learning algorithm to leverage counter-\nevidence and to reject any of the false hypotheses in order to learn the true patterns.\nWe later show that carefully designed learning trajectory over different categories of\ntraining data can effectively boost learning performance by mitigating the impacts\nof distracting features. Inspired by this fact, we propose feature robust abstract\nreasoning (FRAR) model, which consists of a reinforcement learning based teacher\nnetwork to determine the sequence of training and a student network for predictions.\nExperimental results demonstrated strong improvements over baseline algorithms\nand we are able to beat the state-of-the-art models by 18.7% in the RAVEN dataset\nand 13.3% in the PGM dataset.\n\n1\n\nIntroduction\n\nA critical feature of biological intelligence is its capacity for acquiring principles of abstract reasoning\nfrom a sequence of images. Developing machines with skills of abstract reasoning help us to improve\nthe understandings of underlying elemental cognitive processes. It is one of the long-standing\nchallenges of arti\ufb01cial intelligence research [3, 12, 31, 34]. Recently, Raven\u2019s Progressive Matrices\n(RPM), as a visual abstract reasoning IQ test for humans, is used to effectively estimate a model\u2019s\ncapacity to extract and process abstract reasoning principles.\nVarious models have been developed to tackle the problem of abstract reasoning. Some traditional\nmodels [4, 24, 25, 26, 27, 29, 30] rely on the assumptions and heuristics rules about various mea-\nsurements of image similarity to perform abstract reasoning. As Wang and Su [38] propose an\nautomatic system to ef\ufb01ciently generate a large number using \ufb01rst-order logic. There has also been\nsubstantial progress in both reasoning and abstract representation learning using deep neural networks\n[14, 15, 34, 39]. However, these deep neural based methods simply adopt existing networks such as\nCNN [22], ResNet [11] and relational network [35] to perform abstract reasoning but largely ignore\nsome of the reasoning\u2019s fundamental characteristics.\nOne aspect that makes abstract reasoning substantially dif\ufb01cult is the presence of distracting features\nin addition to the reasoning features that are necessary to solve the problem. Learning algorithms\nwould have to leverage various counter-evidence to reject any false hypothesis before reaching the\ncorrect one. Some other methods [36, 37] design an unsupervised mapping from high-dimensional\nfeature space to a few explanatory factors of variation that are subsequently used by reasoning models\nto complete the abstract reasoning task. Although these models boost the performance of abstract\n\n\u2217Corresponding author.\n1Full code are available at https://github.com/zkcys001/distracting_feature.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: Left: Without distracting features, it is obvious to infer the abstract reasoning principles.\nSamples with distracting features confuse our judgment and make it harder to characterize reasoning\nfeatures. Right: The in\ufb02uence of distracting features. (a) Without distracting features, training on the\nwhole dataset is better than training on the individual dataset. (b) When the divergence of distracting\nfeatures is set to zero, test performance decreases as the mean of distracting features increases. (c)\nWhen the mean of distracting features is set to one, test performance decreases as the divergence of\ndistracting features increases.\n\nTable 1: Test performance of LEN trained on different trajectories. \"\u2212>\" denotes to training order.\nThe \ufb01rst row demonstrates two datasets (i.e., 1 and 2) without distracting features while the second\nrow illustrates datasets (i.e., 3 and 4) with distracting features. FRAR demonstrates our algorithm\nwhich optimizes learning trajectory to prevent distracting features from affecting the training of\nlearning algorithms.\n\nDataset\n\n1\n\nAcc(%)\n\n74.2\n\ndataset\n\n3\n\n2\u2212>1\n\n79.5\n4\u2212>3\n\n2\n\n60.2\n\n4\n\n1\u2212>2\n\n77.5\n3\u2212>4\n\n1+2\n\n81.0\n\n3+4\n\nAcc(%)\n\n52.5\n\n58.4\n\n64.5\n\n65.9\n\n58.2\n\n1\u2212>2\n\u2212>1+2\n81.8\n3\u2212>4\n\u2212>3+4\n61.0\n\n1\u2212>2\u2212>\n1\u2212>1+2\n81.5\n\n3\u2212>4\u2212>\n1\u2212>3+4\n59.6\n\n2\u2212>1\u2212>\n2\u2212>1+2\n81.3\n\n4\u2212>3\u2212>\n4\u2212>3+4\n62.1\n\nFRAR\n\n82.1\nFRAR\n\n67.6\n\nreasoning tasks by capturing the independent factors of variation given an image, it is still dif\ufb01cult to\n\ufb01nd the reasoning logic from independent factors of variation and separate distracting features and\nreasoning features. Figure 1 shows one such example of abstract reasoning with distracting features\nwhere the true reasoning features in 1) is mingled with distracting ones in 2). Distracting features\ndisrupt the learning of statistical models and make them harder to characterize the true reasoning\npatterns. On the right panel of Figure 1, we see that when we add more distracting features into\nthe dataset (either through increasing the mean number of distracting features or through increasing\nthe divergence of such features), the learning performance decrease sharply alert no information\nthat covers the true reasoning patterns have been changed. Another observation with the distracting\nfeature is that when we divide the abstract reasoning dataset into several subsets, training the model\non the entire dataset would bene\ufb01t the model as opposed to training them separately on the individual\ndataset. This is not surprising since features that are not directly bene\ufb01ting its own reasoning logic\nmight bene\ufb01t those from other subsets. When distracting features are present, however, we see that\nsome of the learning algorithms get worse performance when training on the entire dataset, suggesting\nthat those distracting features trick the model and interfere with the performance.\n\n2\n\n(cid:21)(cid:12)(cid:39)(cid:76)(cid:86)(cid:87)(cid:85)(cid:68)(cid:70)(cid:87)(cid:76)(cid:81)(cid:74)(cid:73)(cid:72)(cid:68)(cid:87)(cid:88)(cid:85)(cid:72)(cid:3)(cid:29)(cid:3)(cid:54)(cid:75)(cid:68)(cid:83)(cid:72)(cid:16)(cid:87)(cid:92)(cid:83)(cid:72)(cid:15)(cid:3)(cid:54)(cid:75)(cid:68)(cid:83)(cid:72)(cid:16)(cid:86)(cid:76)(cid:93)(cid:72)(cid:15)(cid:54)(cid:75)(cid:68)(cid:83)(cid:72)(cid:16)(cid:70)(cid:82)(cid:79)(cid:82)(cid:85)(cid:3)(cid:20)(cid:12)(cid:55)(cid:75)(cid:72)(cid:3)(cid:81)(cid:88)(cid:80)(cid:69)(cid:72)(cid:85)(cid:3)(cid:82)(cid:73)(cid:3)(cid:71)(cid:76)(cid:86)(cid:87)(cid:85)(cid:68)(cid:70)(cid:87)(cid:76)(cid:81)(cid:74)(cid:3)(cid:73)(cid:72)(cid:68)(cid:87)(cid:88)(cid:85)(cid:72)(cid:3)(cid:76)(cid:86)(cid:3)(cid:86)(cid:72)(cid:87)(cid:3)(cid:87)(cid:82)(cid:3)(cid:19)(cid:28605)(cid:28600)(cid:28644)(cid:28643)(cid:28647)(cid:28637)(cid:28648)(cid:28637)(cid:28643)(cid:28642)(cid:28599)(cid:28643)(cid:28640)(cid:28643)(cid:28646)(cid:28647)(cid:28637)(cid:28654)(cid:28633)(cid:28648)(cid:28653)(cid:28644)(cid:28633)(cid:28642)(cid:28649)(cid:28641)(cid:28630)(cid:28633)(cid:28646)(cid:28612)(cid:28611)(cid:28607)(cid:28613)(cid:28607)(cid:28615)(cid:28607)(cid:28616)(cid:28617)(cid:28607)(cid:28616)(cid:28607)(cid:28613)(cid:28607)(cid:28613)(cid:28615)(cid:28607)(cid:28616)(cid:28607)(cid:28617)(cid:28607)(cid:28615)(cid:28615)(cid:28607)(cid:28616)(cid:28607)(cid:28612)(cid:28607)(cid:28614)(cid:28615)(cid:28613)(cid:28612)(cid:28607)(cid:28613)(cid:28607)(cid:28615)(cid:28607)(cid:28616)(cid:28607)(cid:28618)(cid:28619)(cid:28609)(cid:28615)(cid:28609)(cid:28613)(cid:28609)(cid:28619)(cid:28609)(cid:28618)(cid:28616)(cid:28607)(cid:28615)(cid:28607)(cid:28614)(cid:28607)(cid:28614)(cid:28607)(cid:28616)(cid:28613)(cid:28607)(cid:28618)(cid:28607)(cid:28615)(cid:28607)(cid:28618)(cid:28607)(cid:28613)(cid:28616)(cid:28614)(cid:28613)(cid:28607)(cid:28615)(cid:28607)(cid:28616)(cid:28620)(cid:28607)(cid:28620)(cid:28607)(cid:28613)(cid:28614)(cid:28607)(cid:28615)(cid:28607)(cid:28614)(cid:28618)(cid:28607)(cid:28612)(cid:28607)(cid:28615)(cid:28614)(cid:28605)(cid:28600)(cid:28644)(cid:28643)(cid:28647)(cid:28637)(cid:28648)(cid:28637)(cid:28643)(cid:28642)(cid:28631)(cid:28643)(cid:28640)(cid:28643)(cid:28646)(cid:28647)(cid:28637)(cid:28654)(cid:28633)(cid:28648)(cid:28653)(cid:28644)(cid:28633)(cid:28642)(cid:28649)(cid:28641)(cid:28630)(cid:28633)(cid:28646)(cid:28612)(cid:28611)(cid:28607)(cid:28613)(cid:28607)(cid:28615)(cid:28607)(cid:28616)(cid:28616)(cid:28607)(cid:28616)(cid:28607)(cid:28616)(cid:28607)(cid:28616)(cid:28616)(cid:28607)(cid:28616)(cid:28607)(cid:28616)(cid:28607)(cid:28616)(cid:28612)(cid:28607)(cid:28612)(cid:28607)(cid:28612)(cid:28607)(cid:28612)(cid:28615)(cid:28613)(cid:28612)(cid:28607)(cid:28613)(cid:28607)(cid:28615)(cid:28607)(cid:28616)(cid:28607)(cid:28618)(cid:28616)(cid:28607)(cid:28616)(cid:28607)(cid:28616)(cid:28607)(cid:28616)(cid:28607)(cid:28616)(cid:28616)(cid:28607)(cid:28616)(cid:28607)(cid:28616)(cid:28607)(cid:28616)(cid:28607)(cid:28616)(cid:28612)(cid:28607)(cid:28612)(cid:28607)(cid:28612)(cid:28607)(cid:28612)(cid:28607)(cid:28612)(cid:28616)(cid:28614)(cid:28613)(cid:28607)(cid:28615)(cid:28607)(cid:28616)(cid:28616)(cid:28607)(cid:28616)(cid:28607)(cid:28616)(cid:28616)(cid:28607)(cid:28616)(cid:28607)(cid:28616)(cid:28612)(cid:28607)(cid:28612)(cid:28607)(cid:28612)(cid:28614)(cid:83)(cid:68)(cid:81)(cid:72)(cid:79)(cid:3)(cid:20)(cid:3)(cid:36)(cid:49)(cid:39)(cid:3)(cid:83)(cid:68)(cid:81)(cid:72)(cid:79)(cid:3)(cid:21)(cid:32)(cid:83)(cid:68)(cid:81)(cid:72)(cid:79)(cid:3)(cid:22)(cid:83)(cid:68)(cid:81)(cid:72)(cid:79)(cid:3)(cid:20)(cid:3)(cid:36)(cid:49)(cid:39)(cid:3)(cid:83)(cid:68)(cid:81)(cid:72)(cid:79)(cid:3)(cid:21)(cid:32)(cid:83)(cid:68)(cid:81)(cid:72)(cid:79)(cid:3)(cid:22)(cid:28603)(cid:28660)(cid:28604)(cid:28595)(cid:28647)(cid:28667)(cid:28664)(cid:28595)(cid:28673)(cid:28680)(cid:28672)(cid:28661)(cid:28664)(cid:28677)(cid:28595)(cid:28674)(cid:28665)(cid:28595)(cid:28663)(cid:28668)(cid:28678)(cid:28679)(cid:28677)(cid:28660)(cid:28662)(cid:28679)(cid:28668)(cid:28673)(cid:28666)(cid:28595)(cid:28665)(cid:28664)(cid:28660)(cid:28679)(cid:28680)(cid:28677)(cid:28664)(cid:28595)(cid:28668)(cid:28678)(cid:28595)(cid:28678)(cid:28664)(cid:28679)(cid:28595)(cid:28679)(cid:28674)(cid:28595)(cid:28611)(cid:28603)(cid:28662)(cid:28604)(cid:28595)(cid:28647)(cid:28667)(cid:28664)(cid:28595)(cid:28672)(cid:28664)(cid:28660)(cid:28673)(cid:28595)(cid:28674)(cid:28665)(cid:28595)(cid:28663)(cid:28668)(cid:28678)(cid:28679)(cid:28677)(cid:28660)(cid:28662)(cid:28679)(cid:28668)(cid:28673)(cid:28666)(cid:28595)(cid:28665)(cid:28664)(cid:28660)(cid:28679)(cid:28680)(cid:28677)(cid:28664)(cid:28595)(cid:28668)(cid:28678)(cid:28595)(cid:28678)(cid:28664)(cid:28679)(cid:28595)(cid:28679)(cid:28674)(cid:28595)(cid:28612)(cid:28603)(cid:28661)(cid:28604)(cid:28595)(cid:28647)(cid:28667)(cid:28664)(cid:28595)(cid:28663)(cid:28668)(cid:28681)(cid:28664)(cid:28677)(cid:28666)(cid:28664)(cid:28673)(cid:28662)(cid:28664)(cid:28595)(cid:28674)(cid:28665)(cid:28595)(cid:28663)(cid:28668)(cid:28678)(cid:28679)(cid:28677)(cid:28660)(cid:28662)(cid:28679)(cid:28668)(cid:28673)(cid:28666)(cid:28595)(cid:28665)(cid:28664)(cid:28660)(cid:28679)(cid:28680)(cid:28677)(cid:28664)(cid:28595)(cid:28668)(cid:28678)(cid:28595)(cid:28678)(cid:28664)(cid:28679)(cid:28595)(cid:28679)(cid:28674)(cid:28595)(cid:28611)\fTo tackle the problem of abstract reasoning with distraction, we take inspirations from human learning\nin which knowledge is taught progressively according to a speci\ufb01c order as our reasoning abilities\nbuild up. Table 1 illustrates such an idea by dividing the abstract reasoning dataset into two parts as\nwe change the proportion of datasets and take them progressively to the learning algorithm as learning\nproceeds. As we see from the results, when no distracting features are present (\ufb01rst row), changing\nthe order of the training has little impacts on the actual results. When distracting features are present\n(second row), however, the trajectory of training data signi\ufb01cantly affects the training outcome. The\nFRAR model that we propose to optimize training trajectory in order to prevent distracting features\nfrom affecting the training achieves a signi\ufb01cant boost of 15.1% compares to training on a single\ndataset. This suggests that we are able to achieve better training performance by changing the order\nthat the learning algorithm receives the training data.\nThe next question we want to ask is can we design an automated algorithm to choose an optimized\nlearning path in order to minimize the adversarial impacts of distracting features in abstract reasoning.\nSome of the methods have been studied but with slightly different motivations. Self-paced learning\n[21] prioritize examples with small training loss which are likely not noising images; hard negative\nmining [28] assign a priority to examples with high training loss focusing on the minority class in order\nto solve the class imbalance problem. Mentornet[18] learns a data-driven curriculum that provides\na sample weighting scheme for a student model to focus on the sample whose label is probably\ncorrect. These attempts are either based on task-speci\ufb01c heuristic rules, the strong assumption of a\npre-known oracle model. However, in many scenarios, there are no heuristic rules, so it is dif\ufb01cult to\n\ufb01nd an appropriate prede\ufb01ned curriculum. Thus adjustable curriculum that takes into account of the\nfeedback from the student accordingly has greater advantages. [10] leverages the feedback from the\nstudent model to optimize its own teaching strategies by means of reinforcement learning. But in\n[10], historical trajectory information is insuf\ufb01ciently considered and action is not \ufb02exible enough,\nlead to being not suitable for the situations where training trajectory should be taken into account.\nIn this paper, we propose a method to learn the adaptive logic path from data by a model named feature\nrobust abstract reasoning model (FRAR). Our model consists of two intelligent agents interacting\nwith each other. Speci\ufb01cally, a novel Logic Embedding Network (LEN) as the student model is\nproposed to disentangle abstract reasoning by explicitly enumerating a much larger space of logic\nreasoning. A teacher model is proposed to determine the appropriate proportion of teaching materials\nfrom the learning behavior of a student model as the adaptive logic path. With the guidance of this\nadaptive logic path, the Logic Embedding Network enables to characterize reasoning features and\ndistracting features and then infer abstract reasoning rules from the reasoning features. The teacher\nmodel optimizes its teaching strategies based on the feedback from the student model by means of\nreinforcement learning so as to achieve teacher-student co-evolution. Extensive experiments on PGM\nand RAVEN datasets have demonstrated that the proposed FRAR outperforms the state-of-the-art\nmethods.\n\n2 Related Work\n\nAbstract reasoning In order to develop machines with the capabilities to underlying reasoning\nprocess, computational models [4, 24, 25, 26, 27, 29, 30] are proposed to disentangle abstract\nreasoning. Some simpli\ufb01ed assumptions[4, 25, 26, 27] are made in the experiments that machines are\nable to extract a symbolic representation of images and then infer the corresponding rules. Various\nmeasurements of image similarity [24, 29, 30] are adopted to learn the relational structures of abstract\nreasoning. These methods rely on assumptions about typical abstract reasoning principles. As\nWang and Su [38] propose an automatic system to ef\ufb01ciently generate a large number of abstract\nreasoning problems using \ufb01rst-order logic, there are substantial progress in both reasoning and\nabstract representation learning in neural networks. A novel variant of Relation Network [35] with\na scoring structure [34] is designed to learn relational comparisons between a sequence of images\nand then reasoning the corresponding rules. Hill et al. [14] induce analogical reasoning in neural\nnetworks by contrasting abstract relational structures. Zhang et al. [39] propose a dynamic residual\ntree (DRT) that jointly operates on the space of image understanding and structure reasoning.\n\nCurriculum learning The teaching strategies of weighting each training example have been well\nstudied in the literature[5, 6, 18, 21, 28, 32]. Self-paced learning [9, 16, 17, 21] prioritizes examples\nwith small training loss which are likely not noising images; hard negative mining [28] assigns a\n\n3\n\n\fFigure 2: Overview of the interactive process between teacher model and student model. Left: The\nguidance of a teacher model replaces that training student model in random order. Right: Form\nthe teacher model as a reinforcement learning problem. Our reinforcement learning agent (DDPG)\nreceives the state st from the performance of the student model and outputs a proportion of at of\ntraining data at tth time step. After training the student model, the accuracy of the student model on\na held-out validation set is evaluated as a reward r which is returned to the reinforcement learning\nagent.\n\npriority to examples with high training loss focusing on the minority class in order to solve the class\nimbalance problem. MentorNet [18] learns a data-driven curriculum that provides a sample weighting\nscheme for StudentNet focusing on the samples whose label are probably correct. These attempts are\neither based on task-speci\ufb01c heuristic rules or the strong assumptions of a pre-known oracle model.\nFan et al. [10] leverage the feedback from a student model to optimize its own teaching strategies by\nmeans of reinforcement learning, so as to achieve teacher-student co-evolution. The re-weighting\nmethod [32] determines the example weights by minimizing the loss on a clean unbiased validation\nset.\n\nDisentangled Feature Representations Disentangled feature representations ef\ufb01ciently encode\nhigh-dimensional features about the sensitive variation in single generative factors, isolating the varia-\ntion about each sensitive factor in a fewer dimension. The key idea about disentangled representations\nis that real-world data mostly are generated by a few explanatory factors of variation which can be\nrecovered by unsupervised learning algorithms. Hence, disentangled representations that capture\nthese explanatory factors are expected to help in generalizing systematically [8, 19]. The sampling\nmethod based on disentangled representations is more ef\ufb01cient [13] and less sensitive to nuisance\nvariables [33]. In terms of systematic generalization [1, 7], VASE [1] detects the adaptive shift of\ndata distribution based on the principle of minimum description length, and allocates redundant\ndisentangled representations to new knowledge. In other cases, however, it is not clear whether the\ngains of experiments are actually due to disentanglement [20]. In the abstracting reasoning tasks,\nsome works [36, 37] learn an unsupervised mapping from high-dimensional feature space to a lower\ndimensional and more structured latent space that is subsequently used by reasoning models to\ncomplete reasoning task.\n\n3 Feature Robust Abstract Reasoning\n\nOur feature robust abstract reasoning algorithm is employed based on a student-teacher architecture\nillustrated in 2. In this architecture, the teacher model adjusts the proportions of training datasets\nand sends them to the student model. After these data are consumed, a student model will return its\nvalidation accuracy on the current batch which is used as rewards for the teacher model to update\nitself and to take the next action. This process repeats until the two models are converged.\n\n3.1 Teacher Model\n\nSince the rewards are generated by a non-differential function of the actions, we will use reinforcement\nlearning to optimize the teacher model in a blackbox fashion.\n\n4\n\nactorcriticExperienceReplaymemoryMemory ( !,\"!,#!, !$%)\"!&( !)N\u2019( !,\"!,#!, !$%)Sample mini-batchsoftmaxThe whole datasetTeacherModelSample by !\u2026\u2026TrainTrainTrainThe whole datasetTrain(cid:28647)(cid:28677)(cid:28660)(cid:28668)(cid:28673)(cid:28668)(cid:28673)(cid:28666)(cid:28595)(cid:28672)(cid:28674)(cid:28663)(cid:28664)(cid:28671)(cid:28595)(cid:28668)(cid:28673)(cid:28595)(cid:28660)(cid:28595)(cid:28677)(cid:28660)(cid:28673)(cid:28663)(cid:28674)(cid:28672)(cid:28595)(cid:28674)(cid:28677)(cid:28663)(cid:28664)(cid:28677)(cid:28595)(cid:28647)(cid:28677)(cid:28660)(cid:28668)(cid:28673)(cid:28668)(cid:28673)(cid:28666)(cid:28595)(cid:28672)(cid:28674)(cid:28663)(cid:28664)(cid:28671)(cid:28595)(cid:28682)(cid:28668)(cid:28679)(cid:28667)(cid:28595)(cid:28679)(cid:28667)(cid:28664)(cid:28595)(cid:28666)(cid:28680)(cid:28668)(cid:28663)(cid:28660)(cid:28673)(cid:28662)(cid:28664)(cid:28595)(cid:28674)(cid:28665)(cid:28595)(cid:28679)(cid:28664)(cid:28660)(cid:28662)(cid:28667)(cid:28664)(cid:28677)(cid:28672)(cid:28674)(cid:28663)(cid:28664)(cid:28671)Student Model\fAction We assume that each training sample is associated with a class label. In our dataset, this\nis taken to be the category of the abstraction reasoning. Those categories are a logic combination\nof some of the basic types such as \u201cshape\u201d, \u201ctype\u201d or \u201cposition\u201d. One such example can be seen\nin Figure 1 where \u201cposition and\u201d is labeled as the category of the problem. Here we divide the\ntraining data into C parts: D = (D1,D2, ...,DC), with each of the subset Dc denotes a part of the\ntraining data that belongs to category c. Here C is the number of categories in the dataset. The action\nat =< at,1, at,2, ..., at,C > is then de\ufb01ned to be a vector of probabilities they will present in the\ntraining batch. Samples in the training batch xi will be drawn from the dataset D from distribution\nat. B independent draws of xi will form the mini-batch < x1, x2, ...xB > that will be sent to the\nstudent for training.\n\nState The state of teacher model tracks the progress of student learning through a collection of\nfeatures. Those features include:\n1. Long-term features: a) the loss of each class over the last N time steps; b) validation accuracy of\neach class over N time steps;\n2. Near-term features: a) the mean predicted probabilities of each class; b) the loss of each class; c)\nvalidation accuracy of each class; d) the average historical training loss; e) batch number and its label\ncategory of each class; f) action at the last time step; g) the time step.\n\nReward Reward rt measures the quality of the current action at. This is measured using a held-out\nvalidation set on the student model.\n\nImplementation We use the deep deterministic policy gradient (DDPG) for continuous control of\nproportions of questions at. As illustrated in Figure 2, the teacher agent receives a state st of student\nmodel at each time step t and then outputs a proportion of questions as action at. Then, the student\nmodel adopts the proportion at to generate the training data of tth time step. We use policy gradient\nto update our DDPG model used in the teacher network.\n\n3.2 Logic Embedding Network\n\nWe can choose any traditional machine learning algorithms as our student model. Here, we propose a\nnovel Logic Embedding Network (LEN) with the reasoning relational module which is more \ufb01tted\nfor abstract reasoning questions, since it enables to explicitly enumerate a much larger space of logic\nreasoning. In the case of N \u00d7 N matrices of abstract reasoning tasks, the input of LEN consists of\nN 2 \u2212 1 context panels and K multiple-choice panels, and we need to select which choice panel is\nthe perfect match for these context panels. In the LEN, the input images \ufb01rstly are processed by a\nshallow CNN and an MLP is adopted to achieve N 2 \u2212 1 context embeddings and K multiple-choice\nembeddings. Then, we adopt the reasoning module to output the score of combinations of given\nchoice embeddings and N 2 \u2212 1 context embeddings. The output of reasoning module is a score sk\nfor a given candidate multiple-choice panel, with label k \u2208 [1, K]:\n\n(cid:88)\n\ng\u03b82 (xj1, xj2, ..., xjN , z)),\n\n(1)\n\n(cid:88)\n\nsk = f\u03a6(\n\ng\u03b81(xi1, xi2, ..., xiN , z) +\n\n(xi1 ,xi2 ,...,xiN )\u2208\u03c7k1\n\n(xj1 ,xj2 ,...,xjN )\u2208\u03c7k2\n\nwhere \u03c7k is the whole combinations of panels, \u03c7k1 is row-wise and column-wise combinations of\npanels and \u03c7k2 = \u03c7k \u2212 \u03c7k1 represents the other combinations of panels. ck is a embedding of kth\nchoice panel, xi is a embedding of ith context panel, and z is global representation of all 8 context\nembedding panels. For example, in the case of 3 \u00d7 3 matrices (N=3) of abstract reasoning tasks with\n8 multiple-choice panels, \u03c7k = {(xi, xj, xk)|xi, xj, xk \u2208 S, S = {x1, x2, ..., x8, ck}, i (cid:54)= j, i (cid:54)=\nk, j (cid:54)= k}, \u03c7k1 = {(x1, x2, x3), (x4, x5, x6), (x7, x8, ck), (x1, x4, x7), (x2, x5, x8), (x3, x6, ck)}\nand \u03c7k2 = \u03c7k \u2212 \u03c7k1. f\u03a6, g\u03b81 and g\u03b82 are functions with parameters \u03a6, \u03b81 and \u03b82, respectively.\nFor our purposes, f\u03a6, g\u03b81 and g\u03b82 are MLP layers, and these parameters are learned by end-to-end\ndifferentiable. Finally, the option with the highest score is chosen as the answer based on a softmax\nfunction across all scores.\n\n5\n\n\fFigure 3: The architecture of Logic Embedding Network in the case of 3 \u00d7 3 abstract reasoning\nmatrices with 8 multiple-choice panels. A CNN processes each context panel and each choice panel\nindependently to produce 16 vector embeddings. Then we pass all 8 context embeddings with a\n9 = 84) of logic reasoning.\nchoice embedding to a reasoning model, which enumerate the all space (C 3\nAnd then this model outputs a score for the associated answer choice panel. There are totally 8 such\nreasoning module (here we only depict 1 for clarity) for each answer choice.\n\nIn abstract reasoning tasks, the goal is to infer reasoning logic rules that exist among N panels.\nTherefore, the structure of LEN model is very suitable for dealing with abstract reasoning task,\nsince it adopts g\u03b81 and g\u03b82 to form representations of relationship of N panels, in the case of 3 \u00d7 3\nmatrices, including two context panels and a given multiple choice candidate, or triple context panels\nthemselves. The function g\u03b81 extracts the representations in row order and column order, such as \u201cand\u201d\nrelational type in the color of shapes, while g\u03b82 forms the representations of some reasoning logic\nrules regardless of order, such as the rule that all pictures contain common \u201cshape\u201d. The function\nf\u03a6 integrates informations about context-context relations and context-choice relations together to\nprovide a score of answer. For each multiple-choice candidate, our proposed LEN model calculates\na score respectively, allowing the network to select the multiple-choice candidate with the highest\nscore.\n\n3.2.1 Two-stream Logic Embedding Network\n\nDuring our training process, we have observed that \u201cshape\u201d and \u201cline\u201d features share little patterns\nin terms of logic reasoning. As a result, we have constructed a two-stream version of the logic\nembedding network in order to process these two types of features using its own parameters. Those\ntwo networks are then combined at the fusion layer before the predictions are generated.\n\n4 Datasets\n\n4.1 Procedurally Matrices dataset (PGM)\n\nPGM [34] dataset consists of 8 different subdatasets, which each subdataset contains 119, 552, 000\nimages and 1, 222, 000 questions. We only compare all models on the neutral train/test split, which\ncorresponds most closely to traditional supervised learning regimes. There are totally 2 objects\n\n6\n\n !\"#$softmax !%&\u2019(-6=84-6=78Row-wiseColumn-wiseOther triple panels0.55Choice PanelContext Panels !\"type lossReasoning moduleReasoning module0.550.72CNN\u221a10CNNCNNCNNChoice PanelsContext Panels \f(Shape and Line), 5 rules (Progression, XOR, OR, AND, and Consistent union) and 5 attributes (Size,\nType, Colour, Position, and Number), and we can achieve 50 rule-attribute combinations. However,\nexcluding some con\ufb02icting and counterintuitive combinations (i.e., Progression on Position), we\nresult in 29 combinations.\n\n4.2 Relational and Analogical Visual rEasoNing dataset (RAVEN)\n\nRAVEN [39] dataset consists of 1, 120, 000 images and 70, 000 RPM questions, equally distributed\nin 7 distinct \ufb01gure con\ufb01gurations: Center, 2\u00d7 2 Grid, 3\u00d7 3 Grid, Left-Right, Up-Down, Out-InCenter,\nand Out-InGrid. There are 1 object (Shape), 4 rules(Constant, Progression, Arithmetic, and Distribute\nThree) and 5 attributes(Type, Size, Color, Number, and Position), and we can achieve 20 rule-attribute\ncombinations. However, excluding a con\ufb02icting combination (i.e., Arithmetic on Type), we result in\n19 combinations.\n\n5 Experiments\n\n5.1 Performance on PGM Dataset\n\nBaseline Models We compare\na comprehensive list of baseline\nmodels. From Table 2, we can\nsee that CNN models fail al-\nmost completely at PGM rea-\nsoning tasks,\nthose in include\nLSTM, CNN+MLP, ResNet-50,\nand W-ResNet.\nThe WReN\nmodel Barrett et al. proposed\n[34] is also compared. Xander\nSteenbrugge et al.[36] explore\nthe generalization characteristics\nof disentangled representations\nby leveraging a VAE modular\non abstract reasoning tasks and\ncan boost a little performance.\nOur proposed Logic Embedding\nNetwork (LEN) and its variant\nwith two-stream (i.e.g, T-LEN)\nachieve a much better perfor-\nmance when comparing to base-\nline algorithms.\n\nTeacher Model Baselines We\ncompare several baselines to\nour propose teacher model and\nadapt\nthem using our LEN\nmodel. Those baseline teacher\nmodel algorithms include curricu-\nlum learning, self-paced learn-\ning, learning to teach, hard ex-\nample mining, focal loss, and\nMentornet-PD. Results show that\nthese methods are not effective in\nthe abstract reasoning task.\n\nTable 2: Test performance of all models trained on the neutral\nsplit of the PGM dataset. Teacher Model denotes that using the\nteacher model to determine the appropriate training trajectory.\nType loss denotes that adding category label of questions into\nloss functions.\n\nModel\n\nAcc(%)\n\nLSTM[34]\n\nCNN+MLP[34]\nResNet-50 [34]\n\nW-ResNet-50 [34]\n\nWReN [34]\n\nVAE-WReN [36]\n\nLEN\nT-LEN\n\nLEN + Curriculum learning[2]\nLEN + Self-paced learning[21]\nLEN + Learning to teach[10]\n\nLEN + Hard example mining[28]\n\nLEN + Focal loss[23]\n\nLEN + Mentornet-PD [18]\n\nWReN + type loss[34]\n\nLEN + type loss\nT-LEN + type loss\n\nWReN + Teacher Model [34]\n\nLEN + Teacher Model\nT-LEN + Teacher Model\n\nWReN + Teacher Model + type loss[34]\n\nLEN + Teacher Model + type loss\nT-LEN + Teacher Model +type loss\n\n33.0\n35.8\n42.0\n48.0\n62.8\n64.2\n68.1\n70.3\n63.3\n57.2\n64.3\n60.7\n66.2\n67.7\n75.6\n82.3\n84.1\n68.9\n79.8\n85.1\n77.8\n85.8\n88.9\n\nUse of Type Loss We have experimented by adding additional training labels into the loss function\nfor training with WReN, LEN, and T-LEN. The improvements are consistent with what have been\nreported in Barrett\u2019s paper [34].\n\n7\n\n\fTeacher Models Finally, we show that our LEN and T-LEN augmented with a teacher model\nachieve the testing accuracy above 79.8% and 85.1% respectively on the whole neutral split of the\nPGM Dataset. This strongly indicates that models lacking effective guidance of training trajectory\nmay even be completely incapable of solving tasks that require very simple abstract reasoning rules.\nTraining these models with an appropriate trajectory is suf\ufb01cient to mitigate the impacts of distracting\nfeatures and overcomes this hurdle. Further experiments by adding a type loss illustrate that teacher\nmodel and also be improved with the best performance of LEN (from 79.8% to 85.3%) and T-LEN\n(from 85.1% to 88.9%). Results with WReN with teacher network also reported improvements but is\nconsistently below the ones with LEN and T-LEN models.\n\n5.2 Performance on RAVEN Dataset\n\nWe compare all models on 7 distinct \ufb01gure con\ufb01gurations of RAVEN dataset respectively, and table\n3 shows the testing accuracy of each model trained on the dataset. In terms of model performance,\npopular models perform poorly (i.e., LSTM, WReN, CNN+MLP, and ResNet-50). These models lack\nthe ability to disentangle abstract reasoning and can\u2019t distinguish distracting features and reasoning\nfeatures. The best performance goes to our LEN containing the reasoning module, which is designed\nexplicitly to explicitly enumerate a much larger space of logical reasoning about the triple rules in the\nquestion. Similar to the previous dataset, we have also implemented the type loss. However, contrary\nto the \ufb01rst dataset, type loss performs a bit worse in this case. This \ufb01nding is consistent with what has\nbeen reported in [39]. We observe a consistent performance improvement of our LEN model after\nincorporating the teacher model, suggesting the effectiveness of appropriate training trajectory in this\nvisual reasoning question. Other teaching strategies have little effect on the improvement of models.\nTable 3 shows that our LEN and LEN with teacher model achieve a state-of-the-art performance on\nthe RAVEN dataset at 72.9% and 78.3%, exceeding the best model existing when the datasets are\npublished by 13.3% and 18.7%.\n\nTable 3: Test performance of each model trained on different \ufb01gure con\ufb01gurations of the RAVEN\ndataset. Acc denotes the mean accuracy of each model, while other columns show model accuracy\non different \ufb01gure con\ufb01gurations. 2Grid denotes 2 \u00d7 2 Grid, 3Grid denotes 3 \u00d7 3 Grid, L-R denotes\nLeft-Right, U-D denotes Up-Down, O-IC denotes Out-InCenter, and O-IG denotes Out-InGrid.\n\nmodel\n\nLSTM[39]\nWReN[34]\n\nCNN + MLP[39]\nResNet-18[39]\nLEN + type loss\n\nLEN\n\nResNet-18 + DRT [39]\n\nLEN + Self-paced learning[21]\nLEN + Learning to teach [10]\n\nLEN + Hard example mining[28]\n\nLEN + Focal loss[23]\n\nLEN + Mentornet-PD[18]\n\nLEN + Teacher Model\n\nAcc\n\n13.1\n14.7\n37.0\n53.4\n59.4\n72.9\n59.6\n65.0\n71.8\n72.4\n75.6\n74.4\n78.3\n\nCenter\n\n2Grid\n\n3Grid\n\nL-R U-D O-IC O-IG\n\n13.2\n13.1\n33.6\n52.8\n71.1\n80.2\n58.1\n70.0\n78.1\n77.8\n80.4\n80.2\n82.3\n\n14.1\n28.6\n30.3\n41.9\n45.9\n57.5\n46.5\n50.0\n56.5\n56.2\n55.5\n56.1\n58.5\n\n13.7\n28.3\n33.5\n44.2\n40.1\n62.1\n50.4\n55.2\n60.3\n62.9\n63.8\n62.8\n64.3\n\n12.8\n7.5\n39.4\n58.8\n63.9\n73.5\n65.8\n64.5\n73.4\n75.6\n85.2\n81.4\n87.0\n\n12.5\n6.3\n41.3\n60.2\n62.7\n81.2\n67.1\n73.9\n78.8\n77.5\n83.0\n80.6\n85.5\n\n12.5\n8.4\n43.2\n63.2\n67.3\n84.4\n69.1\n77.8\n82.9\n84.2\n86.4\n85.5\n88.9\n\n12.9\n10.6\n37.5\n53.1\n65.2\n71.5\n60.1\n63.8\n72.3\n72.7\n75.3\n74.5\n81.9\n\n5.3 Teaching Trajectory Analysis\n\nWe set two groups of experiments to examine training trajectory generated by the teacher model.\nIn this setting, according to the rules of [34], we generate 4 subdatasets (D1,D2,D3,D4), which\nwill exhibit an \u201cand\u201d relation, instantiated on the attribute types of \u201cshape\u201d. D1 denotes that we\ninstantiate the \u201cand\u201d relation on the type of \u201cshape\u201d as reasoning attributes and does not set the\ndistracting attribute. D2 denotes that the reasoning attribution is based on the \u201csize shape\u201d and do not\nset the distracting attribute. D3 is similar to D1, but \u201csize\u201d is set a random value as the distracting\nattribute. D4 is similar to D2, but \u201ctype\u201d is set a random value as the distracting attribute. In summary,\nthere not exist distracting attributes in D1 and D2. For D3 and D4, \u201csize\u201d and \u201ctype\u201d are distracting\n\n8\n\n\fattributes respectively. We conduct experiments as follows. As shown table 1, in D1 and D2, the\naccuracy of joint training is higher than that of individual training. Without distracting attributes,\nD1 and D2 can promote each other to encode the reasoning attributes, thus improving the accuracy\nof the model. Adjusting the training trajectory in the dataset without distracting attributes only\nprovides a small increase in the performance. It demonstrates that a model without the in\ufb02uence\nof distracting attributes is able to encode all the attributes into satisfactory embedding and perform\nabstract reasoning. However, joint training in the dataset D3 and D4 with distracting attributes\ndo not promote each other. Experiments in table 1 show that training in an appropriate trajectory\ncan effectively guide the model to encode a satisfactory attribution and improve the performance.\nThen, our proposed model is able to \ufb01nd a more proper training trajectory and achieve an obvious\nimprovement.\n\n5.4 Embedding Space Visualizations\n\nTo understand the model\u2019s capacity to distinguish distracting representations and reasoning repre-\nsentations, we analyzed neural activity in models trained with our logic embedding network. We\ngenerated 8 types of questions including 4 attributes: \u201cposition\u201d, \u201ccolor\u201d, \u201ctype\u201d and \u201csize\u201d, as shown\nin Figure 4. Our model seems to encourage the model to distinguish distracting features and reasoning\nfeatures more explicitly, which could in turn explain its capacity to disentangles abstract reasoning.\nWe \ufb01nd that these activities clustered with the guidance of teacher model better than without it. It\ndemonstrates that the adaptive path from teacher model can promote the model to characterize the\nreasoning features and distracting features, which is bene\ufb01cial for abstract reasoning.\n\nFigure 4: t-SNE analysis of the last layer\u2019s embedding of logic embedding model. Each dot represents\na (8-dimensional) state coloured according to the number of reasoning features and distracting\nfeatures of the corresponding question.\n\nConclusions\n\nIn this paper we proposed a student-teacher architecture to deal with distracting features in abstract\nreasoning through feature robust abstract reasoning (FRAR). FRAR performs abstract reasoning by\ncharacterizing reasoning features and distracting features with the guidance of adaptive logic path.\nA novel Logic Embedding Network (LEN) as a student model is also proposed to perform abstract\nreasoning by explicitly enumerating a much larger space of logic reasoning. Additionally, a teacher\nmodel is proposed to determine the appropriate proportion of teaching materials as adaptive logic\npath. The teacher model optimizes its teaching strategies based on the feedback from a student model\nby means of reinforcement learning. Extensive experiments on PGM and RAVEN datasets have\ndemonstrated that the proposed FRAR outperforms the state-of-the-art methods.\n\nAcknowledgments\n\nThis work was supported by the National Key R&D Program of China under Grant 2017YFB1300201,\nthe National Natural Science Foundation of China (NSFC) under Grants 61622211 and 61620106009\nas well as the Fundamental Research Funds for the Central Universities under Grant WK2100100030.\n\n9\n\n(cid:28636)(cid:28631)(cid:28645)(cid:28664)(cid:28660)(cid:28678)(cid:28674)(cid:28673)(cid:28668)(cid:28673)(cid:28666)(cid:28665)(cid:28664)(cid:28660)(cid:28679)(cid:28680)(cid:28677)(cid:28664)(cid:28678)(cid:28631)(cid:28668)(cid:28678)(cid:28679)(cid:28677)(cid:28660)(cid:28662)(cid:28679)(cid:28668)(cid:28673)(cid:28666)(cid:28665)(cid:28664)(cid:28660)(cid:28679)(cid:28680)(cid:28677)(cid:28664)(cid:28678)(cid:28618)(cid:28675)(cid:28674)(cid:28678)(cid:28668)(cid:28679)(cid:28668)(cid:28674)(cid:28673)(cid:28630)(cid:28674)(cid:28671)(cid:28674)(cid:28677)(cid:28607)(cid:28679)(cid:28684)(cid:28675)(cid:28664)(cid:28617)(cid:28675)(cid:28674)(cid:28678)(cid:28668)(cid:28679)(cid:28668)(cid:28674)(cid:28673)(cid:28630)(cid:28674)(cid:28671)(cid:28674)(cid:28677)(cid:28616)(cid:28675)(cid:28674)(cid:28678)(cid:28668)(cid:28679)(cid:28668)(cid:28674)(cid:28673)(cid:28679)(cid:28684)(cid:28675)(cid:28664)(cid:28615)(cid:28675)(cid:28674)(cid:28678)(cid:28668)(cid:28679)(cid:28668)(cid:28674)(cid:28673)(cid:28608)(cid:28614)(cid:28678)(cid:28668)(cid:28685)(cid:28664)(cid:28630)(cid:28674)(cid:28671)(cid:28674)(cid:28677)(cid:28607)(cid:28679)(cid:28684)(cid:28675)(cid:28664)(cid:28613)(cid:28678)(cid:28668)(cid:28685)(cid:28664)(cid:28630)(cid:28674)(cid:28671)(cid:28674)(cid:28677)(cid:28612)(cid:28678)(cid:28668)(cid:28685)(cid:28664)(cid:28679)(cid:28684)(cid:28675)(cid:28664)(cid:28611)(cid:28678)(cid:28668)(cid:28685)(cid:28664)(cid:28608)(cid:28660)(cid:28604)(cid:28679)(cid:28608)(cid:28646)(cid:28641)(cid:28632)(cid:28595)(cid:28660)(cid:28673)(cid:28660)(cid:28671)(cid:28684)(cid:28678)(cid:28668)(cid:28678)(cid:28595)(cid:28674)(cid:28665)(cid:28595)(cid:28639)(cid:28632)(cid:28641)(cid:28660)(cid:28604)(cid:28679)(cid:28608)(cid:28646)(cid:28641)(cid:28632)(cid:28595)(cid:28660)(cid:28673)(cid:28660)(cid:28671)(cid:28684)(cid:28678)(cid:28668)(cid:28678)(cid:28595)(cid:28674)(cid:28665)(cid:28595)(cid:28639)(cid:28632)(cid:28641)(cid:28682)(cid:28668)(cid:28679)(cid:28667)(cid:28595)(cid:28679)(cid:28664)(cid:28660)(cid:28662)(cid:28667)(cid:28664)(cid:28677)(cid:28595)(cid:28672)(cid:28674)(cid:28663)(cid:28664)(cid:28671)\fReferences\n[1] Alessandro Achille, Tom Eccles, Loic Matthey, Chris Burgess, Nicholas Watters, Alexander\nLerchner, and Irina Higgins. Life-long disentangled representation learning with cross-domain\nlatent homologies. In Advances in Neural Information Processing Systems, pages 9873\u20139883.\n2018.\n\n[2] Yoshua Bengio, J\u00e9r\u00f4me Louradour, Ronan Collobert, and Jason Weston. Curriculum learning.\n\nIn International Conference on Machine Learning, pages 41\u201348, 2009.\n\n[3] Selmer Bringsjord and Bettina Schimanski. What is arti\ufb01cial intelligence? psychometric ai as\nan answer. In Proceedings of the 18th International Joint Conference on Arti\ufb01cial Intelligence,\npages 887\u2013893, 2003.\n\n[4] Patricia A Carpenter, Marcel A Just, and Peter Shell. What one intelligence test measures:\na theoretical account of the processing in the raven progressive matrices test. Psychological\nreview, 97(3):404, 1990.\n\n[5] Haw-Shiuan Chang, Erik Learned-Miller, and Andrew McCallum. Active bias: Training\nmore accurate neural networks by emphasizing high variance samples. In Advances in Neural\nInformation Processing Systems, pages 1002\u20131012, 2017.\n\n[6] Mostafa Dehghani, Arash Mehrjou, Stephan Gouws, Jaap Kamps, and Bernhard Sch\u00f6lkopf.\n\nFidelity-weighted learning. 2018.\n\n[7] Cian Eastwood. A framework for the quantitative evaluation of disentangled representations. In\n\nInternational Conference on Learning Representations, 2018.\n\n[8] Babak Esmaeili, Wu Hao, Sarthak Jain, Siddharth Narayanaswamy, and Jan Willem Van De\n\nMeent. Hierarchical disentangled representations. 2019.\n\n[9] Yanbo Fan, Ran He, Jian Liang, and Baogang Hu. Self-paced learning: an implicit regularization\n\nperspective. In Thirty-First AAAI Conference on Arti\ufb01cial Intelligence, 2017.\n\n[10] Yang Fan, Fei Tian, Tao Qin, Xiang-Yang Li, and Tie-Yan Liu. Learning to teach. In Interna-\n\ntional Conference on Learning Representations, 2018.\n\n[11] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for im-\nage recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern\nRecognition, pages 770\u2013778, 2016.\n\n[12] Jos\u00e9 Hern\u00e1ndez-Orallo, Fernando Mart\u00ednez-Plumed, Ute Schmid, Michael Siebers, and David L\nDowe. Computer models solving intelligence test problems: Progress and implications. Arti\ufb01cial\nIntelligence, 230:74\u2013107, 2016.\n\n[13] Irina Higgins, Nicolas Sonnerat, Loic Matthey, Arka Pal, Christopher P Burgess, Matthew\nBotvinick, Demis Hassabis, and Alexander Lerchner. Scan: Learning abstract hierarchical\ncompositional visual concepts. 2018.\n\n[14] Felix Hill, Adam Santoro, David Barrett, Ari Morcos, and Timothy Lillicrap. Learning to make\nanalogies by contrasting abstract relational structure. In International Conference on Learning\nRepresentations, 2019.\n\n[15] Dokhyam Hoshen and Michael Werman.\n\narXiv:1710.01692, 2017.\n\nIq of neural networks.\n\narXiv preprint\n\n[16] Lu Jiang, Deyu Meng, Shoou-I Yu, Zhenzhong Lan, Shiguang Shan, and Alexander Hauptmann.\nSelf-paced learning with diversity. In Advances in Neural Information Processing Systems,\npages 2078\u20132086, 2014.\n\n[17] Lu Jiang, Deyu Meng, Qian Zhao, Shiguang Shan, and Alexander G Hauptmann. Self-paced\n\ncurriculum learning. In Twenty-Ninth AAAI Conference on Arti\ufb01cial Intelligence, 2015.\n\n10\n\n\f[18] Lu Jiang, Zhengyuan Zhou, Thomas Leung, Li-Jia Li, and Li Fei-Fei. Mentornet: Learning\ndata-driven curriculum for very deep neural networks on corrupted labels. In International\nConference on Machine Learning, pages 2309\u20132318, 2018.\n\n[19] Rishab Kabra Nick Watters Chris Burgess Daniel Zoran Loic Matthey Matthew Botvinick\nKlaus Greff, Rapha\u00ebl Lopez Kaufmann and Alexander Lerchner. Multi-object representation\nlearning with iterative variational inference. 2018.\n\n[20] Abhishek Kumar, Prasanna Sattigeri, and Avinash Balakrishnan. Variational inference of\n\ndisentangled latent concepts from unlabeled observations. 2018.\n\n[21] M Pawan Kumar, Benjamin Packer, and Daphne Koller. Self-paced learning for latent variable\n\nmodels. In Advances in Neural Information Processing Systems, pages 1189\u20131197, 2010.\n\n[22] Yann LeCun, Bernhard Boser, John S Denker, Donnie Henderson, Richard E Howard, Wayne\nHubbard, and Lawrence D Jackel. Backpropagation applied to handwritten zip code recognition.\nNeural computation, 1(4):541\u2013551, 1989.\n\n[23] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Doll\u00e1r. Focal loss for dense\nobject detection. In Proceedings of the IEEE international Conference on Computer Vision,\npages 2980\u20132988, 2017.\n\n[24] Daniel R Little, Stephan Lewandowsky, and Thomas L Grif\ufb01ths. A bayesian model of rule\ninduction in raven\u2019s progressive matrices. In Proceedings of the Annual Meeting of the Cognitive\nScience Society, volume 34, 2012.\n\n[25] Andrew Lovett and Kenneth Forbus. Modeling visual problem solving as analogical reasoning.\n\nPsychological review, 124(1):60, 2017.\n\n[26] Andrew Lovett, Kenneth Forbus, and Jeffrey Usher. A structure-mapping model of raven\u2019s\nprogressive matrices. In Proceedings of the Annual Meeting of the Cognitive Science Society,\nvolume 32, 2010.\n\n[27] Andrew Lovett, Emmett Tomai, Kenneth Forbus, and Jeffrey Usher. Solving geometric analogy\nproblems through two-stage analogical mapping. Cognitive science, 33(7):1192\u20131231, 2009.\n\n[28] Tomasz Malisiewicz, Abhinav Gupta, and Alexei A Efros. Ensemble of exemplar-svms for\nobject detection and beyond. In Proceedings of the IEEE international Conference on Computer\nVision, pages 89\u201396, 2011.\n\n[29] Keith McGreggor and Ashok Goel. Con\ufb01dent reasoning on raven\u2019s progressive matrices tests.\n\nIn Twenty-Eighth AAAI Conference on Arti\ufb01cial Intelligence, 2014.\n\n[30] Can Serif Mekik, Ron Sun, and David Yun Dai. Similarity-based reasoning, raven\u2019s matrices,\nand general intelligence. In Proceedings of the 27th International Joint Conference on Arti\ufb01cial\nIntelligence, pages 1576\u20131582, 2018.\n\n[31] John C Raven and John Hugh Court. Raven\u2019s progressive matrices and vocabulary scales.\n\nOxford pyschologists Press, 1998.\n\n[32] Mengye Ren, Wenyuan Zeng, Bin Yang, and Raquel Urtasun. Learning to reweight examples\nfor robust deep learning. In International Conference on Machine Learning, pages 4331\u20134340,\n2018.\n\n[33] Michael I Jordan Romain Lopez, Jeffrey Regier and Nir Yosef. Information constraints on\n\nauto-encoding variational bayes. 2018.\n\n[34] Adam Santoro, Felix Hill, David Barrett, Ari Morcos, and Timothy Lillicrap. Measuring\nabstract reasoning in neural networks. In International Conference on Machine Learning, pages\n4477\u20134486, 2018.\n\n[35] Adam Santoro, David Raposo, David G Barrett, Mateusz Malinowski, Razvan Pascanu, Peter\nBattaglia, and Timothy Lillicrap. A simple neural network module for relational reasoning. In\nAdvances in Neural Information Processing Systems, pages 4967\u20134976, 2017.\n\n11\n\n\f[36] Xander Steenbrugge, Sam Leroux, Tim Verbelen, and Bart Dhoedt. Improving generaliza-\ntion for abstract reasoning tasks using disentangled feature representations. arXiv preprint\narXiv:1811.04784, 2018.\n\n[37] Sjoerd van Steenkiste, Francesco Locatello, J\u00fcrgen Schmidhuber, and Olivier Bachem. Are\ndisentangled representations helpful for abstract visual reasoning? CoRR, abs/1905.12506,\n2019.\n\n[38] Ke Wang and Zhendong Su. Automatic generation of raven\u2019s progressive matrices. In Pro-\nceedings of the 24th International Joint Conference on Arti\ufb01cial Intelligence, pages 903\u2013909,\n2015.\n\n[39] Chi Zhang, Feng Gao, Baoxiong Jia, Yixin Zhu, and Song-Chun Zhu. Raven: A dataset for\nrelational and analogical visual reasoning. In Proceedings of the IEEE Conference on Computer\nVision and Pattern Recognition, 2019.\n\n12\n\n\f", "award": [], "sourceid": 3118, "authors": [{"given_name": "Kecheng", "family_name": "Zheng", "institution": "University of Science and Technology of China"}, {"given_name": "Zheng-Jun", "family_name": "Zha", "institution": "University of Science and Technology of China"}, {"given_name": "Wei", "family_name": "Wei", "institution": "Google AI"}]}