{"title": "Hybrid Knowledge Routed Modules for Large-scale Object Detection", "book": "Advances in Neural Information Processing Systems", "page_first": 1552, "page_last": 1563, "abstract": "Abstract The dominant object detection approaches treat the recognition of each region separately and overlook crucial semantic correlations between objects in one scene. This paradigm leads to substantial performance drop when facing heavy long-tail problems, where very few samples are available for rare classes and plenty of confusing categories exists. We exploit diverse human commonsense knowledge for reasoning over large-scale object categories and reaching semantic coherency within one image. Particularly, we present Hybrid Knowledge Routed Modules (HKRM) that incorporates the reasoning routed by two kinds of knowledge forms: an explicit knowledge module for structured constraints that are summarized with linguistic knowledge (e.g. shared attributes, relationships) about concepts; and an implicit knowledge module that depicts some implicit constraints (e.g. common spatial layouts). By functioning over a region-to-region graph, both modules can be individualized and adapted to coordinate with visual patterns in each image, guided by specific knowledge forms. HKRM are light-weight, general-purpose and extensible by easily incorporating multiple knowledge to endow any detection networks the ability of global semantic reasoning. Experiments on large-scale object detection benchmarks show HKRM obtains around 34.5% improvement on VisualGenome (1000 categories) and 30.4% on ADE in terms of mAP.", "full_text": "Hybrid Knowledge Routed Modules for Large-scale\n\nObject Detection\n\nChenhan Jiang\u2217\n\nSun Yat-Sen University\njchcyan@gmail.com\n\nHang Xu\u2217\n\nHuawei Noah\u2019s Ark Lab\n\nxbjxh@live.com\n\nXiaodan Liang\u2020\n\nSchool of Intelligent Systems Engineering\n\nSun Yat-Sen University\nxdliang328@gmail.com\n\nLiang Lin\n\nSun Yat-Sen University\nlinliang@ieee.org\n\nAbstract\n\nThe dominant object detection approaches treat the recognition of each region\nseparately and overlook crucial semantic correlations between objects in one scene.\nThis paradigm leads to substantial performance drop when facing heavy long-tail\nproblems, where very few samples are available for rare classes and plenty of\nconfusing categories exists. We exploit diverse human commonsense knowledge\nfor reasoning over large-scale object categories and reaching semantic coherency\nwithin one image. Particularly, we present Hybrid Knowledge Routed Modules\n(HKRM) that incorporates the reasoning routed by two kinds of knowledge forms:\nan explicit knowledge module for structured constraints that are summarized with\nlinguistic knowledge (e.g. shared attributes, relationships) about concepts; and an\nimplicit knowledge module that depicts some implicit constraints (e.g. common\nspatial layouts). By functioning over a region-to-region graph, both modules can\nbe individualized and adapted to coordinate with visual patterns in each image,\nguided by speci\ufb01c knowledge forms. HKRM are light-weight, general-purpose\nand extensible by easily incorporating multiple knowledge to endow any detection\nnetworks the ability of global semantic reasoning. Experiments on large-scale\nobject detection benchmarks show HKRM obtains around 34.5% improvement on\nVisualGenome (1000 categories) and 30.4% on ADE in terms of mAP. Codes and\ntrained model can be found in https://github.com/chanyn/HKRM.\n\n1\n\nIntroduction\n\nThe most state-of-the-art object detection methods [16, 43, 8, 4] follow the region-based paradigm,\nwhich treats the classi\ufb01cation and boundingbox regression of each region proposal separately. The\ndetection performance purely relies on the discriminative capabilities of region features, which\noften depends on suf\ufb01cient training data for each category. Such paradigm thus obtains substantial\nperformance drop when dealing with large-scale detection task [49, 18] that recognizes and localizes\na large number of categories (e.g. 3000 classes in VG [23]). The long-tail problem is very common,\nwhere very few samples exist for rare classes, such as pepperoni and bagel. On the other hand,\ndetection challenges such as heavy occlusion, class ambiguities and tiny-size objects become more\nsevere due to more categories within one image. However, humans can still identity objects precisely\nunder complex circumstances because of the remarkable reasoning ability resorting to commonsense\n\n\u2217Both authors contributed equally to this work.\n\u2020Corresponding Author\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fFigure 1: An example of how different types of commonsense knowledge can facilitate large-scale\nobject detection, especially for rare classes (e.g. the obscured mandarin). We illustrate three useful\nknowledge forms: attribute knowledge, relationship knowledge and spatial knowledge.\n\nknowledge. This inspires us to explore how to incorporate diverse knowledge forms into current\ndetection paradigm in a light-weight and effective way, in order to mimic human reasoning procedure.\nWhen humans watch a scene [3], each object is not identi\ufb01ed individually. Different knowledge\nobtained by a human commonsense can help to make a correct identi\ufb01cation by considering global\nsemantic coherency. An example of hybrid knowledge reasoning in Figure 1 would be to identify\nthe obscured \u201cmandarin\u201d (bottom-right). Human can recognize it is a mandarin learned from hybrid\ncommonsense: a) this round object is orange and just like the other nearby mandarins (shared\nattribute knowledge); b) this object is in the bowl (pairwise relationship knowledge); c) this object\nhas moderate size and its position is near to other fruits (spatial layout).\nRecently, some works incorporate knowledge via direct relation modeling [34, 9, 19] or iterative\nreasoning architecture [33, 5, 6]. Different from recent implicit relation networks [19, 52] that learned\ninter-region relationships in an implicit and uncontrollable way, recently an iterative reasoning\n[6] was proposed to combine both local and global reasoning. However, they take only region\npredictions of a basic detection network, rather than enhancing intermediate feature representations.\nFurthermore, they directly use statistic edge connections in a prior knowledge graph while ignoring\nthe compatibility of prior knowledge with visual evidence in each image. Given diverse object\nappearances and correlations in each image, personalized edge connections with respect to each\nknowledge form should be adaptive for different regions. On the contrary, our work aims to develop\nin-place knowledge modules which can not only explicitly incorporate any kinds of commonsense\nknowledge (both explicit or implicit) for better semantic reasoning but also link external knowledge\nwith visual observations in each image in an adaptive way.\nIn this paper, we propose Hybrid Knowledge Routed Modules (HKRM) to incorporate multiple\nsemantic reasoning routed by two major kinds of knowledge forms: an explicit knowledge module\nthat exploits structure constraints that are summarized with linguistic knowledge (e.g. shared\nattributes, co-occurrence and relationships), and an implicit knowledge module to encode some\nimplicit commonsense constraints over object (e.g. common spatial layouts). Instead of building\ncategory-to-category graph [26, 38, 22, 33, 7], each knowledge module in HKRM learns adaptive\ncontext connections for each pair of regions by regarding a speci\ufb01c prior knowledge graph as external\nsupervisions, rather than \ufb01xing the connections. Our HKRM is general-purposed and extensible by\neasily integrating several individualized knowledge modules instantiated with any chosen knowledge\nforms to pursue more advanced and hybrid semantic reasoning. As a showcase, we experiment\nwith three kinds of knowledge forms in this paper: the attribute knowledge (e.g. color, status),\npairwise relationship knowledge such as co-occurrence and object-verb-subject relationship, the\nspatial knowledge including layout, size and overlap. HKRM is light-weight and easily plugged into\nany detection network for endowing its ability in global reasoning.\nOur HKRM thus enables sharing visual features among certain regions with similar attributes,\npairwise relationship or spatial relationship. The recognition and localization of dif\ufb01cult regions\nwith heavy occlusions, class ambiguities and tiny-size problems can be thus remedied by discovering\nadaptive contexts from other regions guided by external knowledge. Another merit of HKRM lies in\nthe ability of distilling common characteristics among common/uncommon categories so that the\nproblem of crucial imbalanced categories can be alleviated.\nThe proposed HKRM outperforms the state-of-the-art Faster RCNN [43] with a large margin on\ntwo large-scale object detection benchmarks, that is, ADE [56] with 445 object classes and VG [23]\nwith 1000 or 3000 classes. Particularly, our HKRM achieves around 34.5% of mAP improvement\n\n2\n\nObjectsYellowOrangeRoundFlatRoughBanana\uf050\uf050\uf050\uf050Orange\uf050\uf050\uf050Bowl\uf050Table\uf050Wall\uf050\uff1f\uf050\uf050\uf050\uf076Shared Attribute Knowledge:\uf076Pairwise Relationship Knowledge:\uf076Region Spatial Knowledge:Similarity between objects, Shared featuresCo-occurance, Pairwise RelationshipRegion Position, Size, Spatial Overlap Object Detection Task?ObjectsBananaOrangeBowlTableWallBananaBehindBehindInFrontBehindInOrangeFrontFrontInBowlOnTableUnderByWall\uff1fFrontByInbananatableorangemandarinbowlgrapefruit wallMand-arinMand-arinMand-arin\fFigure 2: Overview of our HKRM, including two kinds of general modules: an explicit knowledge\nmodule to incorporate external knowledge and an implicit knowledge module to learn knowledge\nwithout explicit de\ufb01nitions or being summarized by human, such as spatial layouts. An adaptive\nregion-to-region knowledge graph is constructed by regarding each speci\ufb01ed external knowledge as\nthe supervision of edge connections. The features of each region node are then enhanced through\nintegrating several individual knowledge modules instantiated with distinct knowledge forms. The\nevolved features after each module are combined to produce \ufb01nal object detection results.\n\non VG (1000 categories), 26.5% on VG (3000 categories) and 30.4% on ADE. More interestingly,\nfurther analysis shows our HKRM module can provide meaningful explanations about how different\ncommonsense knowledge can help perform reasonable visual reasoning and what each module\nactually learn with the guidance of external knowledge.\n\n2 Related Work\n\nObject Detection. Big progress has been made recent years on object detection due to the use of\nCNN such as Faster RCNN [43], R-FCN [8], SSD [30] and YOLO [41]. The backbones are some\nfeature extractors such as VGG 16 [47] and Resnet 101 [17]. However, the number of categories\nbeing considered usually is small: 20 for PASCAL VOC [10] and 80 for COCO [29]. However, those\nmethods are usually performed on each proposal individually without reasoning.\nVisual Reasoning. Visual reasoning seeks to incorporate different information or interplay between\nobjects or scenes. Several aspects such as shared attributes [11, 24, 39, 1, 2, 36], relationships among\nobjects can be considered. [13, 32, 42] relies on \ufb01nding similarity as the attributes in the linguistic\nspace. For incorporating information such as relationship, most early works use object relations as a\npost-processing step [50, 14, 12, 37]. Recent works consider a graph structure [26, 38, 22, 33, 7, 6].\nOn the other hand, there are some sequential reasoning model for relationships [5, 25, 6]. In these\nworks, a \ufb01xed graph is usually considered, while our module\u2019s graph has adaptive region-to-region\nedges which can be embedded with any kinds of external knowledge.\nFew-shot Recognition. Few-shot recognition seeks to learn a new concept with a few annotated\nexamples which share the similar problem with us. Early work focus on learning attributes embedding\nto represent categories [1, 21, 24, 44]. Most recent works use knowledge graph such as WordNet [35]\nto distill information among categories [46, 9, 54, 33, 53]. [15] further de\ufb01ned a GNN architecture to\nlearn a knowledge graph implicitly. In contrast, our module is explicitly routed and bene\ufb01ts from the\nguidance of hybrid knowledge forms.\n\n3 The Proposed Approach\n\n3.1 Overview\n\nThe goal of this paper is to develop general modules for incorporating knowledge to facilitate large-\nscale object detection with global reasoning. Our HKRM includes two kinds of modules to support\nany prior knowledge forms, shown in Figure 2: an explicit knowledge module to incorporate external\nknowledge and an implicit knowledge module to learn knowledge without explicit de\ufb01nitions or\nbeing summarized by the human. Taking an image as the input, visual features are extracted for each\nproposal region through the region proposal network. Based on the region features, each module\nbuilds an adaptive region-to-region undirected graph \u02c6G : \u02c6G =< N , \u2227E >, where N are region\n\n3\n\nImage inputsExternal knowledgeImplicit knowledge moduleExplicit knowledge modulecategorycategory edgeregionregion edgeshare featuresupervisionregionregion edgeshare featurepathdogwomanpantsBuildgraph\uff1aBuildgraph\uff1aKnowledge driven predictionclassificationbboxregressionProposals featureRPN\fFigure 3: Explicit Knowledge Module. Taking the pairwise L1 differences of the f as inputs, a\nregion-to-region graph is generated by stacked MLP. This process is supervised by the ground truth of\nthe external knowledge. The output evolved feature f(cid:48) is the enhanced feature via graph propagation.\nThen f(cid:48) is concatenated to the f to produce \ufb01nal detection results.\nproposal nodes and each edge ei,j \u2208 E de\ufb01nes a kind of knowledge between two nodes. Each module\nthen outputs enhanced features integrating a particular knowledge. Finally, outputs from several\nmodules are concatenated together and fed into the boundingbox regression layer and classi\ufb01cation\nlayer to obtain \ufb01nal detection results.\n\n3.2 Explicit Knowledge Module\n\nWe regard the human commonsense knowledge that can be clearly de\ufb01ned and summarized using\nlinguistics as explicit knowledge. The most representative explicit knowledge forms can be attribute\nknowledge (e.g. \u201capple is red.\u201d) and pairwise relationship knowledge (e.g. \u201cman rides bicycles\u201d). Our\nexplicit knowledge module aims to enhance region features with kinds of explicit knowledge forms.\nSpeci\ufb01cally, as shown in Figure 3, this module updates edge connections between each pair of region\ngraph nodes in \u02c6G, supervised by a mapping of the ground truth from a class-to-class knowledge\ngraph Q. This Q is a certain form of linguistic knowledge.\n\n3.2.1 Module De\ufb01nition\nAdaptive region-to-region graph. We \ufb01rst de\ufb01ne a region-to-region graph \u02c6G for all Nr = |N|\nregion proposals with visual features f = {fi}Nr\ni=1, fi \u2208 RD of D dimension extracted from the\nbackbone network, where N are region proposal nodes and ei,j \u2208 \u02c6E is the learned graph edge for\neach pair of region nodes. Given any external knowledge form, distinct edge connections \u02c6E can\nbe accordingly updated to characterize speci\ufb01c context information for each region proposal in the\ncontext of speci\ufb01c knowledge. Formally, given a speci\ufb01c knowledge graph Q, each edge between\ntwo regions \u02c6eij is learned by a stacked Multi-layer Perceptron (MLP) :\n\n\u02c6eij = MLPQ(\u03b1(fi, fj)),\n\n(1)\nwhere \u03b1(\u00b7) is chosen to be the pairwise L1 difference between features of each region pair (fi, fj)\nsince L1 difference is symmetric. Given different prior graphs Q, MLPQ would be parametrized with\nWQ distinctly to generate different region-to-region graphs \u02c6G, leading to personalized knowledge\nreasoning.\nWe learn MLPQ by directly enforcing the predicted \u02c6eij to be consistent with the edge weights of\na prior graph Q. We de\ufb01ne Q =< C,V > as a class-to-class graph with C class graph nodes and\ntheir prior edge weights vi,j\u2208 V, such as attribute and relationship graphs. During training, as we\nknow ground-truth categories of each region, the edge \u02c6eij of two region nodes is learned towards\nthe edge weights \u02dceij of ground truth categories of region nodes in Q, that is, \u02dceij = vci,cj where ci is\nthe ground truth class of i-th region. Such explicit supervision with ground truth classes of region\nnodes would ensure the learning of a reliable graph reasoning regardless of the errors from proposal\nlocalization. MLPQ is then learned to encode explicit region-wise knowledge correlations that can\nbe applied in the testing phase. The loss function of learned edge weights {\u02c6eij} for all Nr region\nproposals is de\ufb01ned as:\n\n4\n\nPairwise L1 differences \ud835\udefc(\ud835\udc53\ud835\udc56\u2212\ud835\udc53\ud835\udc57)Generate Region-to-Region Graph\ud835\udc40\ud835\udc3f\ud835\udc43\ud835\udc44 \ud835\udc6e=<\ud835\udca9, \u2130>\ud835\udc87\u2032= \u2130\ud835\udc87\ud835\udc7e\ud835\udc86Row NormalizationDetectionResultConcatExternal KnowledgeGround Truth Region and ClassFeature enhancementExtractedfeature\ud835\udc87womandogpathmanshoespersonbenchtreeskirthatbushpantswomandogpathmanshoespersonbenchtreeskirthatbushpants\fL(f , WQ, Q) =\n\nNr(cid:88)\n\nNr(cid:88)\n\ni=1\n\nj=1\n\n1\n2\n\n(\u02c6eij \u2212 \u02dceij)2.\n\n(2)\n\nFeature evolving via graph reasoning. After performing row normalization over learned edges\n\u02c6E = {\u02c6eij}, we can propagate features of connected regions into enhancing each region features f(cid:48) by\ndifferent weighted edges, which can be solved by matrix multiplication:\n\n(3)\nwhere W e \u2208 RD\u00d7E is a transformation weight matrix and f(cid:48) \u2208 RE are the enhanced features with\nE dimension via graph reasoning. Those regions with heavy occlusions, class ambiguities and the\ntiny-size problem can be remedied by discovering adaptive contexts from other regions guided by\nexternal knowledge. The trainable parameters are WQ of the stacked MLP and W e.\n\nf(cid:48) = \u02c6EfW e,\n\n3.2.2 Module Speci\ufb01cation with Different Knowledge\n\nWe can specify different prior knowledge graphs Q to obtain distinct graph reasoning behaviors.\nHere, we take attribute knowledge graph and relationship knowledge graph as the examples. We refer\nreaders to \ufb01nd illustrations of constructing knowledge graphs in Supplementary material.\nAttribute Knowledge. Attribute knowledge graph QA as one kind of Q denotes object classes\nare connected with kinds of attributes such as colors, size, materials, and status. The explicit\nknowledge module instantiated with attribute knowledge will facilitate features of rare classes with\nmore frequent classes by transferring their shared visual attribute properties. Let us consider C\nclasses and K attributes, we obtain a C \u00d7 K frequency distribution table for each class-attribute pair,\ndetailed in experiments. Then the pairwise Jensen\u2013Shannon (JS) divergence between probability\ndistributions Pci and Pcj of two classes ci and cj can be measured as the edge weights of two classes\n= JS(Pci||Pcj ). We consider JS divergence to measure the similarity instead of KL divergence\neA\nci,cj\nhere since we expect a symmetry undirected graph while KL(Pi||Pj) (cid:54)= KL(Pj||Pi). Finally, the\nmodule outputs a enhanced feature f(cid:48)\nRelationship Knowledge. Relationship knowledge QR denotes the pairwise relationship between\nclasses, such as location relationship (e.g. along, on), the \u201csubject-verb-object\u201d relationship (e.g.\neat, wear) or co-occurrence relationship. The evolved features will be enhanced with high-level\nsemantic correlations between regions. Similarly, we obtain QR by calculating frequent statistics\neither from the semantic information or simply from the occurrence among all class pairs. The\nr \u2208 REr\nsymmetric transformation and row normalization are performed on edge weights. Let f(cid:48)\ndenotes the output of the explicit relationship module.\n\na \u2208 REa.\n\n3.3\n\nImplicit Knowledge Module\n\nConsidering some commonsense knowledge without explicit de\ufb01nitions or being summarized by the\nhuman, we regard them as implicit knowledge and thus an implicit knowledge module is designed.\nTaking geometry priors as an example, besides those explicit pairwise locations, there also exists\nsome complicate location information, such as \u201cthe ceiling is always above all the other objects\u201d and\n\u201cthe water is always below the ships, mountains and the sky\u201d. Taking features q = {qi} as inputs that\ndepict the features of each region (e.g. geometric features), our implicit knowledge module integrates\nmultiple graph reasoning over M region-to-region graphs obtained by M stacked MLPs following (1)\nto encode these implicit priors. The analogous idea of multi-head attention can be found in [6, 19, 51].\nThis enables the module to catch multiple spatial relationships such as \u201cup and down\u201d, \u201cleft and right\u201d\nand \u201ccorner and center\u201d. Visualization of different learned graphs can be found in Supplementary\nmaterial. Similar to region-to-region graph used in explicit knowledge module, we learn speci\ufb01c\nedge weights {\u02c6e(m)\nij } of each graph \u02c6Gm, m = 1, . . . , M for all-region proposal pairs, following Eqn.\n1. We then average edge weights of all graph { \u02c6Gm} and add them with a identity matrix I to obtain\nthe edge connections \u02c6eI\n\nij \u2208 \u02c6E I:\n\n\u02c6eI\nij =\n\n1\nM\n\n\u02c6e(m)\nij + I.\n\n(4)\n\nM(cid:88)\n\nm=1\n\n5\n\n\f% Method\n\n0\n0\n0\n1\nG\nV\n\n0\n0\n0\n3\nG\nV\n\nE\nD\nA\n\n4.3\n5.3\n4.5\n6.1\n6.2\n5.0\n\nAPS\n2.8\n3.4\n1.9\n2.4\n3.0\n2.7\n\n5.1\n5.2\n6.0\n7.0\n7.1\n6.7\n\n11.7\n12.1\n14.7\n16.8\n16.8\n14.0\n\n5.8\n4.8\n7.3\n7.9\n8.1\n7.6\n\n11.2\n10.5\n13.2\n15.9\n15.4\n14.3\n\nAR1\n14.6\n13.0\n13.7\n17.0\n17.0\n17.7\n\n7.3\n6.9\n8.1\n9.7\n9.7\n9.3\n\n9.6\n9.5\n10.6\n12.7\n12.6\n11.4\n\nAP50\n10.9\n10.1\n10.7\n12.9\n12.8\n12.1\n\n3.2\n3.2\n3.4\n4.3\n4.3\n4.1\n\n7.3\n6.2\n7.5\n9.7\n9.8\n9.0\n\n1.7\n1.9\n1.6\n2.5\n2.6\n2.3\n\n2.4\n3.3\n2.1\n3.1\n3.0\n3.1\n\n4.0\n4.3\n4.3\n5.3\n5.3\n5.1\n\n5.1\n6.0\n5.8\n7.0\n7.2\n6.9\n\n7.2\n7.5\n4.9\n6.0\n7.2\n6.3\n\n4.3\n4.3\n3.8\n5.7\n6.0\n5.3\n\nAP75\n6.2\n5.4\n5.7\n7.4\n7.5\n7.7\n\nAR10\n18.0\n16.5\n17.2\n21.4\n21.6\n21.9\n\nAPM APL\n6.5\n5.8\n5.8\n7.4\n7.5\n7.2\n\n9.8\n8.0\n10.0\n13.7\n13.0\n12.7\n\nAR100 ARS\n18.7\n16.6\n17.2\n21.5\n21.7\n22.0\n\nARM ARL\n25.3\n17.1\n20.6\n15.7\n25.3\n15.7\n33.0\n19.5\n19.8\n31.4\n33.3\n19.5\n\nAP\nLight-head rcnn[27] 6.2\n5.6\nFPN[28]\n5.8\nFaster RCNN[43]\n7.4\nAttribute\nRelation\n7.4\n7.3\nSpatial\n7.8+2.0 13.4+2.7 8.1+2.4 4.1+2.2 8.1+2.3 12.7+2.7 18.1+4.4 22.7+5.5 22.7+5.5 9.6+4.7 20.8+5.1 31.4+6.1\nHKRM (All)\nLight-head rcnn[27] 3.0\n3.3\nFPN[28]\nFaster RCNN[43]\n3.4\n4.1\nAttribute\n4.2\nRelation\n4.0\nSpatial\n4.3+0.9 7.2+1.2 4.4+1.0 2.6+1.0 5.5+1.2 8.4+1.1 10.1+2.0 12.2+2.4 12.2+2.4 5.9+2.1 13.0+2.1 20.5+2.5\nHKRM (All)\nLight-head rcnn[27] 7.0\n6.5\nFPN[28]\n7.9\nFaster RCNN[43]\n9.6\nAttribute\nRelation\n9.6\n8.7\nSpatial\n10.3+2.4 18.0+3.0 10.4+2.9 4.1+2.0 7.9+2.1 16.8+3.6 13.6+3.0 18.3+4.1 18.5+4.1 7.1+2.6 15.5+3.6 28.4+6.0\nHKRM (All)\nTable 1: Main results of test datasets on VG1000 , VG3000 and ADE. \u201cAttribute\u201d, Relation\u201d and\n\u201cSpatial\u201d are the baseline Faster RCNN adding the corresponding knowledge module alone. HKRM\nis the model with a combination of all.\nWe then adopt matrix multiplication g(cid:48) = \u02c6E I fW g to get the evolved features g(cid:48) \u2208 REg. The\ntrainable parameters are weights of M stacked MLP for learning edge weights of knowledge graphs\n{ \u02c6Gm}, and the transformation matrix W g \u2208 RD\u00d7Eg is shared for all graphs.\nModule speci\ufb01cation with spatial layout. Here, we instantiate the implicit knowledge module by\nspatial layout inputs to capture complicated spatial constraints by using speci\ufb01c input information.\nThe input geometry feature qi of each region is simply object bounding box. To make qi be invariant\n\u00afh , pi), where \u00afw and \u00afh\nto the scale transformation, a relative geometry feature is used, as ( xi\ndenotes the size of the image and pi is the initial foreground probability of each region. Note that\nedge weights are implicitly learned via the back-propagation of the whole network.\n\n10.3\n9.8\n10.9\n12.8\n12.8\n12.4\n\n10.4\n11.9\n11.9\n14.1\n14.2\n12.7\n\n15.4\n11.6\n17.0\n19.6\n19.8\n18.7\n\n20.4\n18.6\n22.4\n26.3\n26.0\n24.2\n\n9.0\n8.3\n9.8\n11.7\n11.9\n11.2\n\n13.4\n13.0\n14.4\n17.1\n17.0\n15.6\n\n9.0\n8.3\n9.8\n11.7\n11.9\n11.2\n\n13.3\n12.9\n14.2\n16.9\n16.8\n15.5\n\n\u00afw , yi\n\n\u00afh , wi\n\n\u00afw , hi\n\n4 Experiments\n\nDataset and Evaluation. We conduct experiments on large-scale object detection benchmarks with\na large number of classes: that is, Visual Genome (VG) [23] and ADE [56]. The task is to localize an\nobject and classify it, which is different from the experiments with given ground truth locations [6].\nFor Visual Genome, we use the latest release (v1.4), and synsets [45] instead of the raw names of\nthe categories due to inconsistent label annotations, following [20, 6]. We consider two set of target\nclasses: 1000 most frequent classes and 3000 most frequent classes, resulting in two settings VG1000\nand VG3000. We split the remaining 92960 images with objects on these class sets into 87960 and\n5,000 for training and testing, respectively. In term of ADE dataset, we use 20,197 images for training\nand 1,000 images for testing, following [6]. To validate the generalization capability of models and\nthe usefulness of transferred knowledge graph from VG, 445 classes that overlap with VG dataset\nare selected as targets. Since ADE is a segmentation dataset, we convert segmentation masks to\nbounding boxes [6] for all instances. For evaluation, we adopt the metrics from COCO detection\nevaluation criteria [29], that is, mean Average Precision (mAP) across different IoU thresholds\n(IoU= {0.5 : 0.95, 0.5, 0.75}) and scales (small, medium, big). We also use Average Recall (AR)\nwith different number of given detection per image ({1, 10, 100}) and different scales (small, medium,\nbig).\nAdditionally, we also evaluate on PASCAL VOC 2007 [10] and MSCOCO 2017 [29] to show prior\nknowledge can help detection for a small set of frequent classes (20/80 classes). PASCAL VOC\nconsists of about 10k trainval images (included VOC 2007 trainval and VOC 2012 trainval) and 5k\n\n6\n\n\fDataset\n\nPASCAL VOC20\n\nMSCOCO80\n\nMethod\nSMN[5]\n\nHKRM (All)\n\nSMN[5]\n\nBackbone #. Parameter (M) mAP (%)\nResNet-101\nFaster RCNN[43] ResNet-101\nResNet-101\nResNet-101\nRelation Network[19] ResNet-101\nFaster RCNN[43] ResNet-101\nResNet-101\n\n66.7\n57.0\n59.2\n68.1\n64.6\n58.3\n60.3\n\n67.8\n75.1\n78.8\n31.6\n35.2\n34.2\n37.8\n\nHKRM (All)\n\nTable 2: Comparisons of mean Average Precision (mAP) and #. Parameter on PASCAL VOC 2007\ntest set and COCO 2017 val set.\n\ntest images over 20 object categories. We only report mAP scores using IoU thresholds at 0.5 for\nthe purpose of comparison with other existing methods. MSCOCO 2017 contains 118k images for\ntraining, 5k for evaluation.\nKnowledge Graph Construction. We apply general knowledge graphs for both experiments on VG\nand ADE datasets. With the help of the statistics of the annotations in the VG dataset, we can both\ncreate attribute knowledge and relationship knowledge graphs. Speci\ufb01cally, we consider top 200 most\nfrequent attributes annotations in VG such as color, material and status of the categories (C = 3000),\nand then count their frequent statistics as the class-attribute table. For relationship knowledge, we\nuse top 200 most frequent relationship annotations in VG such as location relationship, subject-\nverb-object relationship, and count frequent statistics of each class-relationship pair. Illustrations of\nconstructed knowledge graphs can be found in Supplementary material.\nImplementation Details. We treat the state-of-the-art Faster RCNN [43, 55] as our baseline and\nimplement all models in Pytorch [40]. We also compare with Light-head RCNN [27] and FPN [28].\nResNet-101 [17] pretrained on ImageNet [45] is used as our backbone network. The parameters\nbefore conv3 and the batch normalization are \ufb01xed, same with [27]. During training, we augment\nwith \ufb02ipped images and multi-scaling (pixel size={400, 500, 600, 700, 800}). During testing, pixel\nsize= 600 is used. Following [43], RPN is applied on the conv4 feature maps. The total number\nof proposed regions after NMS is 128. Features in conv5 are avg-pooled to become the input of\nthe \ufb01nal classi\ufb01er. Unless otherwise noted, settings are same for all experiments. In terms of our\nexplicit attribute and relationship knowledge module upon region proposals, we use the \ufb01nal conv5\nfor 128 regions after avg-pool (D= 2048) as our module inputs. We consider a 4 stacked linear layers\nas MLPQ(output channels:[256, 128, 64, 1]). ReLU is selected as the activation function between\neach linear layer. The output size : Ea = Er = 256, which is considered suf\ufb01cient to contain the\nenhanced feature. In terms of implicit knowledge module, we employ M = 10 implicit graphs. For\nlearning each graph, 2 stacked linear layers are used (output channels:[5, 1]). pi is the score of the\nforeground form the RPN. The output size: Eg = 256. To avoid over-\ufb01tting, the \ufb01nal version of\nHKRM is the combination of three shrink modules with each output size equals 256. f(cid:48)\nr, g(cid:48) and f\nare concatenated together and fed into the boundingbox regression layer and classi\ufb01cation layer.We\napply stochastic gradient descent with momentum to optimize all models. The initial learning rate\nis 0.01, reduce three times (\u00d70.01) during \ufb01ne-tuning; 10\u22124 as weight decay; 0.9 as momentum.\nFor both VG and ADE dataset, we train 28 epochs with mini-batch size of 2 for both the baseline\nFaster RCNN. (Further training after 14 epochs won\u2019t increase the performance of baseline.) For our\nHKRM, we use 14 epochs of the baseline as pretrained model and train another 14 epochs with same\nsettings with baseline.\n\na, f(cid:48)\n\n4.1 Comparison with state-of-the-art\n\nWe report the result comparisons on VG1000 with 1000 categories , VG3000 with 3000 categories\nand ADE dataset in Table 1. As can be seen, all our model variants outperform the baseline Faster\nRCNN[43] on all dataset. Our HKRM achieves an overall AP of 7.8% compared to 5.8% by Faster\nRCNN on VG1000, 4.3% compared to 3.4% on VG3000, and 10.3% compared to 7.9% on ADE,\nrespectively. Moreover, our model achieves signi\ufb01cant higher performance on both classi\ufb01cation\nand localization accuracy than the baseline on all cases (i.e. different scales and overlaps). This\nveri\ufb01es the effectiveness of incorporating global reasoning guided by rich external knowledge into\nlocal region recognition. More signi\ufb01cant performance gap by our HKRM can be observed for those\n\n7\n\n\fN\nN\nC\nR\n\nr\ne\nt\ns\na\nF\n\nM\nR\nK\nH\n\nFigure 4: Qualitative result comparison on VG1000 between Faster RCNN and our HKRM. Objects\nwith occlusion, ambiguities and rare category can be detected by our modules.\n\nrare categories with very few samples (about 1.5% average improvement for the top 150 infrequent\ncategories by our method in terms of mAP).\nTo compare with the state-of-art knowledge-enhanced methods, we also implement HKRM on\nPASCAL VOC and MS COCO datasets with only 20/80 categories in Table 2. For PASCAL VOC,\nour HKRM performs 1.1% better than the baseline Faster RCNN, and outperforms Spatial Memory\nNetwork [5]. For MSCOCO, comparison is made between Relation Network [19] and Spatial Memory\nNetwork. The proposed HKRM boosts the mAP from 34.9% to 37.8% and outperform all the other\nmethods. Our method can also boost the performance in the more simpli\ufb01ed dataset bene\ufb01ting from\nthe shared linguistic knowledge and spatial layout knowledge. Note that HKRM consisted of three\nknowledge modules totally increases about 2% parameters and is light-weight compared to [5, 19].\nFigure 4 shows the qualitative result comparison between our HKRM and Faster RCNN. Our HKRM\ncan detect the obscure palm trees far away in the left image. In the middle image, the multiple\noverlapped small objects such as glass and paper is recognized by our method. \u201cPepperoni\u201d is a rare\ncategory and is detected on the pizza in the right image.\n\n4.2 Ablation Studies\n\nThe effect of different explicit knowledge. We analyze the effect of both attribute and relationship\nknowledge on \ufb01nal detection performance. The attribute module along can increase overall AP by\n1.6% for VG1000, 0.6% for VG3000 and 1.7% for ADE over baseline. The relationship module has\nsimilar performance with a slightly higher result for VG3000. Sharing visual feature according to both\nattribute and relationship knowledge can actually boost the performance of object detection.\nThe effect of different explicit knowledge. We analyze the effect of both attribute knowledge and\nrelationship knowledge on \ufb01nal detection performance. The attribute module along can increase\noverall AP by 1.6% for VG1000, 0.6% for VG3000 and 1.7% for ADE over baseline. The relationship\nmodule has similar performance with a slightly higher result for VG3000. Sharing visual feature\naccording to both attribute and relationship knowledge can actually boost the performance of object\ndetection.\nThe effect of implicit knowledge. As can be seen, the implicit spatial module alone helps around\n1.5% for VG1000, 0.3% for VG3000 and 0.8% for ADE. The spatial module alone is not as effective\n263 as the attribute and relation module. However, the unsupervised learning of the spatial knowledge\n264 still can signi\ufb01cantly help the object recognition through those unde\ufb01ned knowledge.\nGeneralization capability. From Table 1, the external knowledge graph from VG can actually help\nto improve the performance of ADE. Therefore, any datasets with overlap categories can share the\nexisting knowledge graph. Besides, our module can be added to diverse detection systems easily.\nGlobal reasoning. The proposed HKRM achieves the global reasoning over regions via one-time\npropagation over all graph edges and nodes. Bene\ufb01ting from the learned knowledge graph for each\nimage, our HKRM is able to propagate information between nodes which are not connected in the\nprior knowledge graph. We have tried the higher orders of feature transformation (e.g. 2 and 3) and\ndid not observed signi\ufb01cant improvement. In fact, over-transformation will even make the enhanced\nfeatures all identical.\n\n8\n\n\fFigure 5: 2-D visualization of f(cid:48)\na and g(cid:48) by t-SNE method [31]: the explicit module with attribute\nknowledge (top); implicit knowledge module with spatial knowledge (bottom) . The red regions\nare enlarged in right panels. The categories shared the similar attribute knowledge (top) and spatial\nrelationship (bottom) are closed to each other. This veri\ufb01es that our modules learn the corresponding\nknowledge.\n\nAnalysis of feature interpretability. To better understand the underlying feature representations that\nour HKRM actually learn for graph reasoning, we record the output f(cid:48)\na and g(cid:48) (Ea = Eg = 512) from\nthe explicit attribute module and implicit spatial module and its corresponding real labels from each\nregion of 8000 VG1000 images. Then we take average according to the labels and use the t-SNE [31]\nclustering method to visualize them as shown in Figure 5. Note that if features of some categories are\nclosed to each other, the edges between those close categories are more likely to be activated. From\ntwo enlarged regions on top, we can see that features of categories which share similar attributes\nsuch as \u201cwater\u201d, \u201dsand\u201d and \u201celectronics\u201d are closed to each other. And this speaks well our explicit\nknowledge module successfully incorporates the prior attribute knowledge and leads to interpretable\nfeature learning. Similarly, from two bottom enlarged regions, features of objects which has spatial\nrelationship such as \u201con face\u201d and \u201cin kitchen counter\u201d are closed to each other. This validates our\nspatial knowledge module is capable of encoding underlying spatial relationships. Bene\ufb01ting from\nexplicit knowledge supervision, the feature clustering property of the explicit attribute module seems\nto be better than those of the implicit knowledge module. More gradient visualization [48] results for\nthe enhanced features are included in Supplementary materials for better understanding the module.\n\n5 Conclusion\n\nWe present two novel general knowledge modules in HKRM. The \ufb01rst one can embed any external\nknowledge through supervision. The second one can implicitly learn some knowledge without\nexplicit de\ufb01nitions or being summarized by human. Both modules can be easily applied to the\noriginal detection system to improve the detection performance. The experiment and analysis\nindicated HKRM can alleviate the problems of large-scale object detection. For our future work, we\ncan use Cholesky decomposition to re-parametrize the region-to-region graph to further reduce half\nof the module parameters due to the property of symmetry of our graph. We can also add experiments\nusing the word embedding knowledge in the explicit module and the latest new Open Images Dataset\nwhich consists about 600 categories.\n\n9\n\nVisualization for the output features of the Explicit Attribute module by t-SNEVisualization for the output features of the Implicit Spatial module by t-SNE\fAcknowledgments\n\nThis work was supported in part by the National Key Research and Development Program of China\nunder Grant No. 2018YFC0830103, in part by National High Level Talents Special Support Plan (Ten\nThousand Talents Program), and in part by National Natural Science Foundation of China (NSFC)\nunder Grant No. 61622214, and 61836012.\n\nReferences\n[1] Z. Akata, F. Perronnin, Z. Harchaoui, and C. Schmid. Label-embedding for attribute-based\n\nclassi\ufb01cation. In CVPR, 2013. 3\n\n[2] J. Almaz\u00e1n, A. Gordo, A. Forn\u00e9s, and E. Valveny. Word spotting and recognition with embedded\nattributes. IEEE transactions on pattern analysis and machine intelligence, 36(12):2552\u20132566,\n2014. 3\n\n[3] I. Biederman, R. J. Mezzanotte, and J. C. Rabinowitz. Scene perception: Detecting and judging\n\nobjects undergoing relational violations. Cognitive psychology, 14(2):143\u2013177, 1982. 2\n\n[4] Z. Cai and N. Vasconcelos. Cascade r-cnn: Delving into high quality object detection. In CVPR,\n\n2018. 1\n\n[5] X. Chen and A. Gupta. Spatial memory for context reasoning in object detection. In ICCV,\n\n2017. 2, 3, 7, 8\n\n[6] X. Chen, L.-J. Li, L. Fei-Fei, and A. Gupta. Iterative visual reasoning beyond convolutions. In\n\nCVPR, 2018. 2, 3, 5, 6\n\n[7] B. Dai, Y. Zhang, and D. Lin. Detecting visual relationships with deep relational networks. In\n\nCVPR, 2017. 2, 3\n\n[8] J. Dai, Y. Li, K. He, and J. Sun. R-fcn: Object detection via region-based fully convolutional\n\nnetworks. In NIPS, 2016. 1, 3\n\n[9] J. Deng, N. Ding, Y. Jia, A. Frome, K. Murphy, S. Bengio, Y. Li, H. Neven, and H. Adam.\n\nLarge-scale object classi\ufb01cation using label relation graphs. In ECCV, 2014. 2, 3\n\n[10] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The pascal visual\nobject classes (voc) challenge. International Journal of Computer Vision, 88(2):303\u2013338, June\n2010. 3, 6\n\n[11] A. Farhadi, I. Endres, D. Hoiem, and D. Forsyth. Describing objects by their attributes. In\n\nCVPR, 2009. 3\n\n[12] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection with\ndiscriminatively trained part-based models. IEEE transactions on pattern analysis and machine\nintelligence, 32(9):1627\u20131645, 2010. 3\n\n[13] A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, T. Mikolov, et al. Devise: A deep\n\nvisual-semantic embedding model. In NIPS, 2013. 3\n\n[14] C. Galleguillos, A. Rabinovich, and S. Belongie. Object categorization using co-occurrence,\n\nlocation and appearance. In CVPR, 2008. 3\n\n[15] V. Garcia and J. Bruna. Few-shot learning with graph neural networks. In ICLR, 2018. 3\n\n[16] S. Gould, T. Gao, and D. Koller. Region-based segmentation and object detection. In Advances\n\nin Neural Information Processing Systems 22, 2009. 1\n\n[17] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR,\n\n2016. 3, 7\n\n[18] J. Hoffman, S. Guadarrama, E. S. Tzeng, R. Hu, J. Donahue, R. Girshick, T. Darrell, and\n\nK. Saenko. Lsda: Large scale detection through adaptation. In NIPS, 2014. 1\n\n10\n\n\f[19] H. Hu, J. Gu, Z. Zhang, J. Dai, and Y. Wei. Relation networks for object detection. In CVPR,\n\n2018. 2, 5, 7, 8\n\n[20] R. Hu, P. Doll\u00e1r, K. He, T. Darrell, and R. Girshick. Learning to segment every thing. In CVPR,\n\n2018. 6\n\n[21] D. Jayaraman and K. Grauman. Zero-shot recognition with unreliable attributes. In NIPS, 2014.\n\n3\n\n[22] T. N. Kipf and M. Welling. Semi-supervised classi\ufb01cation with graph convolutional networks.\n\nIn ICLR, 2017. 2, 3\n\n[23] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li,\nD. A. Shamma, M. Bernstein, and L. Fei-Fei. Visual genome: Connecting language and vision\nusing crowdsourced dense image annotations. International Journal of Computer Vision, 2016.\n1, 2, 6\n\n[24] C. H. Lampert, H. Nickisch, and S. Harmeling. Learning to detect unseen object classes by\n\nbetween-class attribute transfer. In CVPR, 2009. 3\n\n[25] J. Li, Y. Wei, X. Liang, J. Dong, T. Xu, J. Feng, and S. Yan. Attentive contexts for object\n\ndetection. IEEE Transactions on Multimedia, 19(5):944\u2013954, 2017. 3\n\n[26] Y. Li, D. Tarlow, M. Brockschmidt, and R. Zemel. Gated graph sequence neural networks. In\n\nICLR, 2016. 2, 3\n\n[27] Z. Li, C. Peng, G. Yu, X. Zhang, Y. Deng, and J. Sun. Light-head r-cnn: In defense of two-stage\n\nobject detector. In CVPR, 2017. 6, 7\n\n[28] T.-Y. Lin, P. Doll\u00e1r, R. Girshick, K. He, B. Hariharan, and S. Belongie. Feature pyramid\n\nnetworks for object detection. In CVPR, 2017. 6, 7\n\n[29] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll\u00e1r, and C. L. Zitnick.\n\nMicrosoft coco: Common objects in context. In ECCV, 2014. 3, 6\n\n[30] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg. Ssd: Single\n\nshot multibox detector. In ECCV, 2016. 3\n\n[31] L. v. d. Maaten and G. Hinton. Visualizing data using t-sne. Journal of machine learning\n\nresearch, 9(Nov):2579\u20132605, 2008. 9\n\n[32] J. Mao, X. Wei, Y. Yang, J. Wang, Z. Huang, and A. L. Yuille. Learning like a child: Fast novel\n\nvisual concept learning from sentence descriptions of images. In ICCV, 2015. 3\n\n[33] K. Marino, R. Salakhutdinov, and A. Gupta. The more you know: Using knowledge graphs for\n\nimage classi\ufb01cation. In CVPR, 2017. 2, 3\n\n[34] T. Mensink, J. Verbeek, F. Perronnin, and G. Csurka. Metric learning for large scale image\nclassi\ufb01cation: Generalizing to new classes at near-zero cost. In Computer Vision\u2013ECCV 2012,\n2012. 2\n\n[35] G. A. Miller. Wordnet: a lexical database for english. Communications of the ACM, 38(11):39\u2013\n\n41, 1995. 3\n\n[36] I. Misra, A. Gupta, and M. Hebert. From red wine to red tomato: Composition with context. In\n\nCVPR, 2017. 3\n\n[37] R. Mottaghi, X. Chen, X. Liu, N.-G. Cho, S.-W. Lee, S. Fidler, R. Urtasun, and A. Yuille. The\nrole of context for object detection and semantic segmentation in the wild. In CVPR, 2014. 3\n\n[38] M. Niepert, M. Ahmed, and K. Kutzkov. Learning convolutional neural networks for graphs. In\n\nICML, pages 2014\u20132023, 2016. 2, 3\n\n[39] D. Parikh and K. Grauman. Relative attributes. In ICCV, 2011. 3\n\n11\n\n\f[40] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison,\n\nL. Antiga, and A. Lerer. Automatic differentiation in pytorch. In NIPS Workshop, 2017. 7\n\n[41] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Uni\ufb01ed, real-time\n\nobject detection. In CVPR, 2016. 3\n\n[42] S. Reed, Z. Akata, H. Lee, and B. Schiele. Learning deep representations of \ufb01ne-grained visual\n\ndescriptions. In CVPR, 2016. 3\n\n[43] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with\n\nregion proposal networks. In NIPS, 2015. 1, 2, 3, 6, 7\n\n[44] M. Rohrbach, S. Ebert, and B. Schiele. Transfer learning in a transductive setting. In NIPS,\n\n2013. 3\n\n[45] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy,\nA. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. International\nJournal of Computer Vision, 115(3):211\u2013252, 2015. 6, 7\n\n[46] R. Salakhutdinov, A. Torralba, and J. Tenenbaum. Learning to share visual appearance for\n\nmulticlass object detection. In CVPR, 2011. 3\n\n[47] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image\n\nrecognition. In ICLR, 2015. 3\n\n[48] J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller. Striving for simplicity: The all\n\nconvolutional net. In ICLR Workshop, 2015. 9\n\n[49] A. Torralba, K. P. Murphy, and W. T. Freeman. Sharing features: ef\ufb01cient boosting procedures\n\nfor multiclass object detection. In CVPR, 2004. 1\n\n[50] A. Torralba, K. P. Murphy, W. T. Freeman, and M. A. Rubin. Context-based vision system for\n\nplace and object recognition. In ICCV, 2003. 3\n\n[51] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and\n\nI. Polosukhin. Attention is all you need. In NIPS, 2017. 5\n\n[52] X. Wang, R. Girshick, A. Gupta, and K. He. Non-local neural networks. In CVPR, 2018. 2\n\n[53] X. Wang, Y. Ye, and A. Gupta. Zero-shot recognition via semantic embeddings and knowledge\n\ngraphs. In CVPR, 2018. 3\n\n[54] Q. Wu, P. Wang, C. Shen, A. Dick, and A. van den Hengel. Ask me anything: Free-form visual\n\nquestion answering based on knowledge from external sources. In CVPR, 2016. 3\n\n[55] J. Yang, J. Lu, D. Batra, and D. Parikh. A faster pytorch implementation of faster r-cnn.\n\nhttps://github.com/jwyang/faster-rcnn.pytorch, 2017. 7\n\n[56] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba. Scene parsing through\n\nade20k dataset. In CVPR, 2017. 2, 6\n\n12\n\n\f", "award": [], "sourceid": 782, "authors": [{"given_name": "ChenHan", "family_name": "Jiang", "institution": "Sun Yat-sen University"}, {"given_name": "Hang", "family_name": "Xu", "institution": "Huawei Noah's Ark Lab"}, {"given_name": "Xiaodan", "family_name": "Liang", "institution": "Sun Yat-sen University"}, {"given_name": "Liang", "family_name": "Lin", "institution": "Sun Yat-Sen University"}]}