{"title": "Symbolic Graph Reasoning Meets Convolutions", "book": "Advances in Neural Information Processing Systems", "page_first": 1853, "page_last": 1863, "abstract": "Beyond local convolution networks, we explore how to harness various external human knowledge for endowing the networks with the capability of semantic global reasoning. Rather than using separate graphical models (e.g. CRF) or constraints for modeling broader dependencies, we propose a new Symbolic Graph Reasoning (SGR) layer, which performs reasoning over a group of symbolic nodes whose outputs explicitly represent different properties of each semantic in a prior knowledge graph. To cooperate with local convolutions, each SGR is constituted by three modules: a) a primal local-to-semantic voting module where the features of all symbolic nodes are generated by voting from local representations; b) a graph reasoning module propagates information over knowledge graph to achieve global semantic coherency; c) a dual semantic-to-local mapping module learns new associations of the evolved symbolic nodes with local representations, and accordingly enhances local features. The SGR layer can be injected between any convolution layers and instantiated with distinct prior graphs. Extensive experiments show incorporating SGR significantly improves plain ConvNets on three semantic segmentation tasks and one image classification task. More analyses show the SGR layer learns shared symbolic representations for domains/datasets with the different label set given a universal knowledge graph, demonstrating its superior generalization capability.", "full_text": "Symbolic Graph Reasoning Meets Convolutions\n\nXiaodan Liang1, Zhiting hu2 , Hao Zhang2 , Liang Lin3 , Eric P. Xing4\n\n1 School of Intelligent Systems Engineering, Sun Yat-sen University\n\n3 School of Data and Computer Science, Sun Yat-sen University\n\n2Carnegie Mellon University\n\nxdliang328@gmail.com, {zhitingh,hao, epxing}@cs.cmu.edu, linliang@ieee.org\n\n4Petuum Inc.\n\nAbstract\n\nBeyond local convolution networks, we explore how to harness various external\nhuman knowledge for endowing the networks with the capability of semantic\nglobal reasoning. Rather than using separate graphical models (e.g. CRF) or\nconstraints for modeling broader dependencies, we propose a new Symbolic Graph\nReasoning (SGR) layer, which performs reasoning over a group of symbolic nodes\nwhose outputs explicitly represent different properties of each semantic in a prior\nknowledge graph. To cooperate with local convolutions, each SGR is constituted\nby three modules: a) a primal local-to-semantic voting module where the features\nof all symbolic nodes are generated by voting from local representations; b) a\ngraph reasoning module propagates information over knowledge graph to achieve\nglobal semantic coherency; c) a dual semantic-to-local mapping module learns\nnew associations of the evolved symbolic nodes with local representations, and\naccordingly enhances local features. The SGR layer can be injected between\nany convolution layers and instantiated with distinct prior graphs. Extensive\nexperiments show incorporating SGR signi\ufb01cantly improves plain ConvNets on\nthree semantic segmentation tasks and one image classi\ufb01cation task. More analyses\nshow the SGR layer learns shared symbolic representations for domains/datasets\nwith the different label set given a universal knowledge graph, demonstrating its\nsuperior generalization capability.\n\n1\n\nIntroduction\n\nDespite signi\ufb01cant advances in standard recognition tasks such as image classi\ufb01cation [12] and\nsegmentation [6] achieved by convolution networks, the dominant paradigm lies in the stack of deeper\nand complicated local convolutions, and we hope it captures everything about the relationship between\ninputs and targets. But such networks compromise the feature interpretability and also lack the global\nreasoning capability that is crucial for complicated real-world tasks. Some works [51, 41, 5] thus\nformulated graphical models and structure constraints (e.g. CRF [22, 19]) as recurrent works to effect\non \ufb01nal convolution predictions. However, they cannot explicitly enhance feature representations,\nleading to the limited generalization capability. The very recent capsule network [39, 14] extends\nto learn the sharing of knowledge across locations to \ufb01nd feature clusters, but it can only exploit\nimplicit and uncontrollable feature hierarchy. As emphasized in [3], visual reasoning over external\nknowledge is crucial for human decision-making. The lack of explicitly reasoning over contexts and\nhigh-level semantics would hinder the advances of convolution networks in recognizing objects in a\nlarge concept vocabulary where exploring semantic correlations and constraints plays an important\nrole. On the other hand, structured knowledge provides rich cues to record human observations and\ncommonsense using symbolic words (e.g. nouns or predicates). It is thus desirable to bridge symbolic\nsemantics with learned local feature representations for better graph reasoning.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fIn this paper, we explore how to incorporate rich commonsense human knowledge [33, 53] into\nintermediate feature representation learning beyond local convolutions, and further achieve global\nsemantic coherency. The commonsense human knowledge can be formed as various undirected graphs\nconsisting of rich relationships (e.g. semantic hierarchy, spatial/action interactions and attributes,\nconcurrence) among concepts. For example, \u201cShetland Sheepdog\" and \u201cHusky\" share one superclass\n\u201cdog\" due to some common characteristics; people wear a hat and play guitar not vice-versa; orange\nis yellow color. After associating structured knowledge with the visual domain, all these symbolic\nentities (e.g. dog) can be connected with visual evidence from images, and human can thus integrate\nvisual appearance and commonsense knowledge to help recognize.\nWe attempt to mimic this reasoning procedure and integrate it into convolution networks, that is,\n\ufb01rst characterize representations of different symbolic nodes by voting from local features; then\nperform graph reasoning for enhancing visual evidence of these symbolic nodes via graph propagation\nto achieve semantic coherency; \ufb01nally mapping the evolved features of symbolic nodes back into\nfacilitating each local representation. Our work takes an important next step beyond prior approaches\nin that it directly incorporates the reasoning over external knowledge graph into local feature learning,\ncalled as Symbolic Graph Reasoning (SGR) layer. Note that, here we use \u201cSymbolic\" to denote\nnodes with explicit linguistic meaning rather than conventional/hidden graph nodes used in graphical\nmodels or graph neural networks [40, 18].\nThe core of our SGR layer consists of three modules, as illustrated in Figure 1. First, personalized\nvisual evidence of each symbolic node can be produced by voting from all local representations,\nnamed as a local-to-semantic voting module. The voting weights stand for the semantic agreement\ncon\ufb01dence of each local features to a certain node. Second, given a prior knowledge graph, the graph\nreasoning module is instantiated to propagate information over this graph for evolving visual features\nof all symbolic nodes. Finally, a dual semantic-to-local module learns appropriate associations\nbetween the evolved symbolic nodes and local features to join forces of local and global reasoning.\nIt thus enables the evolved knowledge of a speci\ufb01c symbolic node to only drive the recognition of\nsemantically compatible local features with the help of global reasoning.\nThe key merits of our SGR layer lie in three aspects: a) local convolutions and global reasoning\nfacilitated with commonsense knowledge can collaborate by learning associations between image-\nspeci\ufb01c observations with prior knowledge graph; b) each local feature is enhanced by its correlated\nincoming local features whereas in standard local convolutions it is only based on a comparison\nbetween its own incoming features and a learned weight vector; c) bene\ufb01ting from the learned\nrepresentations of universal symbolic nodes, the learned SGR layer can be easily transferable to other\ndataset domain with discrepant concept sets. And SGR layer can be plugged between any convolution\nlayers and personalized according to distinct knowledge graphs.\nExtensive experiments show superior performance over plain ConvNets by incorporating our SGR\nlayer, especially on recognizing a large concept vocabulary in three semantic segmentation datasets\n(COCO-Stuff, ADE20K, PASCAL-Context) and image classi\ufb01cation dataset (CIFAR100). We further\ndemonstrate its promising generalization capability when transferring SGR layer trained one domain\ninto other domains.\n\n2 Related Work\n\nRecent researches that explored the context modeling for convolution networks can be categorized\ninto two streams. One stream exploits networks for the graph-structured data with a family of\ngraph-based CNNs [36, 40] and RNNs [25, 26] or advanced convolution \ufb01lters [43] to discover\nmore complex feature dependencies. In the context of convolutional networks, the graphical models\nsuch as conditional random \ufb01elds (CRF) [22, 19] can be formulated into a recurrent network by\nfunctioning on \ufb01nal predictions of basic convolutions [51, 41, 5]. In contrast, the proposed SGR layer\ncan be treated as a simple feedforward layer that can be injected between any convolution layers\nand general-purposed for any networks for large-scale and semantic related recognition. Our work\ndiffers in that local features are mapped into meaningful symbolic nodes. The global reasoning over\nlocations is directly aligned with external knowledge rather than implicit feature clusters, which is a\nmore effective and interpretable way to introduce structure constraints.\nAnother stream explored external knowledge bases into facilitating networks. For example, Deng et\nal. [9] employed a label relation graph to guide network learning while Ordonez et al. [37] learned the\n\n2\n\n\fFigure 1: An overview of the proposed SGR layer. Each symbolic node receives votes from all local\nfeatures via a local-to-semantic voting module (long gray arrows), and its evolved features after\ngraph reasoning are then mapped back to each location via a semantic-to-local mapping module (long\npurple arrows). For simplicity, we omit more edges and symbolic nodes in the knowledge graph.\n\nmapping of common concepts to entry-level concepts. Some works regularized the output of networks\nby resorting to complex graphical inference [9], hierarchical loss [38] or word embedding priors [49]\non \ufb01nal prediction scores. However, their loss constraints can only function on \ufb01nal prediction layer\nand indirectly guide visual features to be hierarchy-aware, which is hard to be guaranteed. More\nrecently, Marino et al. [32] used structure prior knowledge to enhance predictions of multi-label\nclassi\ufb01cation while our SGR proposes a general neural layer that can be injected into any convolution\nlayers and allows the neural network to leverage semantic constraints derived from various human\nknowledge. Chen et al. [7] leverage local region-based reasoning and global reasoning to facilitate\nobject detection. In contrast, our SGR layer directly performs reasoning over symbolic nodes and is\nseamlessly interacted with local convolution layers for better \ufb02exibility. Notably, the earliest efforts\nin reasoning in arti\ufb01cial intelligence date back to symbolic approaches [35] by performing reasoning\nover abstract symbols with the language of mathematics and logic. After grounding these symbols,\nstatistical learning algorithm [23] is used to extract useful patterns to perform relational reasoning\non knowledge bases. An effective reasoning procedure that would be practical enough for advanced\ntasks should join the force of local visual representation learning and global semantic graph reasoning.\nOur reasoning layer relates to this line of research by explicitly reasoning over visual evidence of\nlanguage entities by voting from local representations.\n\n3 Symbolic Graph Reasoning\n3.1 General-purposed Graph Construction\n\nThe commonsense knowledge graph is used to depict distinct correlations between entities (e.g.\nclasses, attributes and relationships) in general, which can be any forms. To support the general\npurposed graph reasoning, the knowledge graph can be formulated as G = (N ,E), where N and E\ndenote the symbol set and edge set, respectively. Here we give three examples: a) class hierarchy\ngraph is constructed by a list of entity classes (e.g. person, motorcyclist) and its graph edges shoulder\nthe responsibility of concept belongings (e.g. \u201cis kind of\" or \u201cis part of\"). The networks equipped\nby such hierarchy knowledge can encourage the learning of feature hierarchy by passing the shared\nrepresentations of parent classes into its child nodes; b) class occurrence graph de\ufb01nes the edges\nas the occurrence of two classes across images, characterizing the rationality of predictions; c) as a\nhigher-level semantic abstraction, a semantic relationship graph can extend symbolic nodes to include\nmore actions (e.g. \u201cride\", \u201cplay\"), layouts (e.g. \u201con top of\") and attributes (e.g. color or shape) while\ngraph edges are statistically collected from language descriptions. Incorporating such high-level\ncommonsense knowledge can facilitate networks to prune spurious explanations after knowing the\nrelationship of each entity pair, resulting in good semantic coherency.\nBased on this general formula, the graph reasoning is required to be compatible and general enough\nfor soft graph edges (e.g. occurrence probabilities) and hard edges (e.g. belongings), as well as\ndiverse symbolic nodes. Various structure constraints can thus be modeled as edge connections\nover symbolic nodes, just like human use language tools. Our SGR layer is designed to achieve the\ngeneral graph reasoning that is applicable for encoding a wide range of knowledge graph forms. As\nillustrated in Figure 1, it consists of a local-to-semantic voting module, a graph reasoning module\nand a semantic-to-local mapping module, as presented in following sections.\n\n3\n\nLocal-to-Semantic\tVotingConvolution\tmapsSymbolic\tnodesGraph\tReasoningKnowledge\tGraph0.10.30.6truckroadcarpersonsidewalkterraincurbchildbicycletreebuildingSemantic-to-Local\tMappingvia\tAgreementConvolution\tmapsEvolved\tsymbolic\tnodesReLuConv.SGR\t\u2026ReLuConv.\u2026\fFigure 2: Implementation details of one SGR layer by taking the convolution feature tensors of\nH l \u00d7 W l \u00d7 Dl as inputs. \u2297 denotes matrix multiplication, and \u2295 denotes element-wise summation\nand the circle with C denotes the concatenation. The softmax operation, tensor expansion, ReLU\noperation are performed when noted. The green boxes denote 1 \u00d7 1 convolution or linear layer.\n3.2 Local-to-Semantic Voting Module\n\nGiven local feature tensors from convolution layers, our target is to leverage global graph reasoning\nto enhance local features with external structured knowledge. We thus \ufb01rst summarize the global\ninformation encoded in local features into representations of symbolic nodes, that is, local features\nthat are correlated to a speci\ufb01c semantic meaning (e.g. cat) are aggregated to depict the characteristic\nof its corresponding symbolic node. Formally, we use the feature tensor X l \u2208 RH l\u00d7W l\u00d7Dl after l-th\nconvolution layer as the module inputs, where H l and W l are height and weight of feature maps and\nDl is the channel number. This module aims to produce visual representations H ps \u2208 RM\u00d7Dc of\nall M = |N| symbolic nodes using X l, where Dc is the desired feature dimension for each node n,\nwhich is formulated as the function \u03c6:\n\nH ps = \u03c6(Aps, X l, W ps),\n\n(1)\nwhere W ps \u2208 RDl\u00d7Dc is the trainable transformation matrix for converting each local feature\nxi \u2208 X l into the dimension Dc, and Aps \u2208 RH l\u00d7W l\u00d7M denotes the voting weights of all local\nn \u2208 H ps of each node n are computed\nfeatures to each symbolic node. Speci\ufb01cally, visual features H ps\nby summing up all weighted transformed local features via the voting weight axi\u2192n \u2208 Aps that\nrepresents the con\ufb01dence of assigning local feature xi to the node n. More speci\ufb01cally, the function\n\u03c6 is computed as:\n\n(cid:88)\n\nxi\n\n(cid:80)\n\nH ps\n\nn =\n\naxi\u2192nxiW ps,\n\naxi\u2192n =\n\nexp(W a\nT xi)\nn\nn\u2208N exp(W a\nn\n\nT xi)\n\n.\n\n(2)\n\nn} \u2208 RDl\u00d7M is a trainable weight matrix for calculating voting weights. Aps is\nHere W a = {W a\nnormalized by using a softmax at each location. In this way, different local features can adaptively\nvote to representations of distinct symbolic nodes.\n\n3.3 Graph Reasoning Module\n\nBased on visual evidence of symbolic nodes, the reasoning guided by structured knowledge is\nemployed to leverage semantic constraints from human commonsense to evolve global representations\nof symbolic nodes. Here, we incorporate both linguistic embedding of each symbolic node and\nknowledge connections (i.e. node edges) for performing graph reasoning. Formally, for each symbolic\nnode n \u2208 N , we use the off-the-shelf word vectors [17] as its linguistic embedding, denoted as\nS = {sn}, sn \u2208 RK. The graph reasoning module performs graph propagation over representations\nH ps of all symbolic nodes via the matrix multiplication form, resulting in the evolved features H g:\n(3)\nwhere B = [\u03c3(H ps),S] \u2208 RM\u00d7(Dc+K) concatenates features of transformed H ps via the activation\nfunction \u03c3(\u00b7) and the linguistic embedding S. W g \u2208 R(Dc+K)\u00d7(Dc) is a trainable weight matrix.\nThe node adjacency weight an\u2192n(cid:48) \u2208 Ag is de\ufb01ned according the edge connections in (n, n(cid:48)) \u2208 E.\nAs discussed in Section 3.1, the edge connections can be soft weights (e.g. 0.8) or hard weight (i.e.\n{0,1}) according to different knowledge graph resources. The naive multiplication with Ag will\n\nH g = \u03c3(AgBW g),\n\n4\n\n!\"#$%:1\u00d71\t*+,-./\"\u00d7#\"\u00d70\"#1:1\u00d71\t*+,-./\"\u00d7#\"\u00d72/\"\u00d7#\"\u00d703\u00d72\u00d7/#Softmax/\"#\"\u00d7032\u00d7034SymmetricNormalization#5:67,89:2\u00d703ReLU\u00d72\u00d722\u00d72C2\u00d7;<2\u00d7(03+;)ReLU2\u00d703/\"#\"\u00d72\u00d703C/\"#\"\u00d72\u00d70\"ExpandExpand/\"#\"\u00d72\u00d7(0\"+03)#%:1\u00d71\t*+,-./\"#\"\u00d72Softmax#%$:1\u00d71\t*+,-.2\u00d70\"\u00d7+/\"#\"\u00d70\"/\"\u00d7#\"\u00d70\"/\"\u00d7#\"\u00d70\"!\"@AReLULocal-to-Semantic\tVotingGraph\tReasoningSemantic-to-Local\tMapping\fcompletely change the scale of the feature vectors. Inspired from graph convolutional networks [18],\nwe can normalize Ag such that all rows sum to one to get rid of this problem, i.e. Q\u2212 1\n2 AgQ\u2212 1\n2 ,\nwhere Q is the diagonal node degree matrix of Ag. This symmetric normalization corresponds to\ntaking the average of neighboring node features. This formulation arrives at the new propagation rule:\n\n(4)\nwhere \u02c6Ag = Ag +I is the adjacency matrix of the graph G with added self-connections for considering\n\nits own representation of each node and I is the identity matrix. \u02c6Qii =(cid:80)\n\n2 BW g),\n\n\u02c6Ag\nij.\n\nj\n\nH g = \u03c3( \u02c6Q\u2212 1\n\n2 \u02c6Ag \u02c6Q\u2212 1\n\n3.4 Semantic-to-Local Mapping Module\nFinally, the evolved global representations H g \u2208 RM\u00d7Dc of symbolic nodes can be used to fur-\nther boost the capability of each local feature representation. As the feature distributions of each\nsymbolic node have been changed after graph reasoning, a critical question is how to \ufb01nd most\nappropriate mappings from the representation hg \u2208 H g of each symbolic node to all xi. This can be\nagnostic to learning the compatibility matrix between local features and symbolic nodes. Inspired by\nmessage-passing algorithms [11], we compute the mapping weights ahg\u2192xi \u2208 Asp by evaluating the\ncompatibility of each symbolic node hg with each local feature xi:\n\n(cid:80)\n\nahg\u2192xi =\n\nexp(W sT [hg, xi])\n\nexp(W sT [hg, xi])\n\nxi\n\n,\n\n(5)\n\nwhere W s \u2208 RDl+Dc is a trainable weight matrix. The compatibility matrix Asp \u2208 RH\u00d7W\u00d7M is\nagain row-normalized. The evolved features X l+1 by graph reasoning, posed as inputs in the l + 1\nconvolution layer can be updated as:\n\nX l+1 = \u03c3(AspH gW sp) + X l,\n\n(6)\nwhere W sp \u2208 RDc\u00d7Dl is the trainable matrix for transforming the dimension of symbolic node rep-\nresentation back into Dl, and we use residual connection [12] to further enhance local representations\nwith the original local feature tensor X l. Each local feature is updated by the weighted mappings\nfrom each symbolic node that represents different characteristics of semantics.\n\n3.5 Symbolic Graph Reasoning Layer\n\nEach symbolic graph reasoning layer is constituted by the stack of a local-to-semantic voting module,\na graph reasoning module, and a semantic-to-local mapping module. The SGR layer is instantiated\nby speci\ufb01c knowledge graph with different numbers of symbolic nodes and distinct node connections.\nCombining multiple SGR layers with distinct knowledge graphs into convolutional networks can lead\nto hybrid graph reasoning behaviors. We implement the modules of each SGR via the combination of\n1 \u00d7 1 convolution operations and non-linear functions, detailed as Figure 2. Our SGR is \ufb02exible and\ngeneral enough for injecting it between any local convolutions. Nonetheless, as SGR is designated to\nincorporate high-level semantic reasoning, using SGR in later convolution layers is more preferable,\nas demonstrated in our experiments.\n\n4 Experiments\n\nAs we present the proposed SGR layer as a conventional module suitable for any convolution networks,\nwe thus compare it with on both the pixel-level prediction task (i.e. semantic segmentation) on Coco-\nStuff [4], Pascal-Context [34] and ADE20K [52], and image classi\ufb01cation task on CIFAR-100 [21].\nExtensive ablation studies are conducted on Coco-Stuff dataset [4].\n\n4.1 Semantic Segmentation\n\nDataset. We evaluate on three public benchmarks for segmenting over large-scale categories, which\npose more realistic challenges than other small segmentation datasets (e.g. PASCAL-VOC) and\ncan better validate the necessity of global symbolic reasoning. Speci\ufb01cally, Coco-Stuff [4] contains\n10,000 images with dense annotations of 91 thing (e.g. book, clock) and 91 stuff classes (e.g. \ufb02ower,\n\n5\n\n\fMethod\nFCN [31]\n\nDeepLabv2 (ResNet-101) [6]\n\nClass acc. acc. mean IoU\n\nmean IoU (%)\n\nMethod\nFCN [31]\n\nDeepLab-v2 (ResNet-101) [6]\n\n38.5\n45.5\n42.8\n45.8\n47.0\n47.9\n49.1\n48.6\n47.3\n47.6\n49.3\n49.4\n49.8\n\n27.2\n34.4\n31.2\n34.3\n36.2\n38.1\n38.3\n38.4\n37.2\n37.5\n38.7\n38.8\n39.1\n\nSGR (w/o residual)\nSGR (scene graph)\n\nSGR (concurrence graph)\n\nSGR (w/o mapping)\nSGR (ConvBlock4)\n\nOur SGR (ResNet-101)\n\n37.8\n39.3\n40.4\n40.5\n41.3\n43.3\n44.5\n45.7\n47.3\n50.8\n51.3\n52.5\n\nDAG RNN + CRF [42]\nOHE + DC + FCN [15]\n\nDSSPN (ResNet-101) [27]\n\nTable 2: Comparison on PASCAL-Context\ntest set(%).\n\nCRF-RNN [51]\nParseNet [30]\nBoxSup [8]\nHO CRF [1]\nPiecewise [29]\nVeryDeep [44]\n\nRe\ufb01neNet (Res152) [28]\nOur SGR (ResNet-101)\nOur SGR (Transfer convs)\nOur SGR (Transfer SGR)\n\n60.4\n65.1\n63.0\n66.6\n68.5\n68.4\n69.6\n69.5\n67.9\n68.3\n69.9\n69.7\nOur SGR (ResNet-101 2-layer)\n70.5\nOur SGR (ResNet-101 Hybrid)\nTable 1: Comparison on Coco-Stuff test set\n(%). All our models are based on ResNet-101.\nwood), including 9,000 for training and 1,000 for testing. ADE20k [52] consists of 20,210 images\nfor training and 2,000 for validation, annotated with 150 semantic concepts (e.g. painting, lamp).\nPASCAL-Context [34] includes 4,998 images for training and 5105 for testing, annotated with 59\nobject categories and one background. We use standard evaluation metrics of pixel accuracy (pixAcc)\nand mean Intersection of Union (mIoU).\nImplementation. We conduct all experiments using Pytorch, 2 GTX TITAN X 12GB cards on\na single server. We use the Imagenet-pretrained ResNet-101 [12] as basic ConvNet following\nthe procedure of [6], employ output stride = 8 and incorporate the SGR layer into it. The detailed\nimplementation of one SGR layer is in Figure 2. Our \ufb01nal SGR model \ufb01rst employs the Atrous Spatial\nPyramid Pooling (ASSP) [6] modules with pyramids of {6,12,18,24} to reduce 2,048-d features\nfrom \ufb01nal ResBlock of ResNet-101 into 256-d features. Upon this, we stack one SGR layer to\nenhance local features and then a \ufb01nal 1 \u00d7 1 convolution layer to produce \ufb01nal pixel-wise predictions.\nDl and Dc for feature dimensions in both local-to-semantic voting module and graph reasoning\nmodule are thus set as 256, and we use ReLU activation function for \u03c3(\u00b7). Word embeddings from\nfastText [17] are used to represent each class, which extracts sub-word information and generalizes\nwell to out-of-vocabulary words, resulting in a K = 100-d vector for each node.\nWe use a universal concept hierarchy for all datasets. Following [27], starting from the label hierarchy\nof COCO-Stuff [4] that includes 182 concepts and 27 super-classes, we manually merge concepts\nfrom the rest two dataset together by using WordTree as [27]. It results in 340 concepts in the \ufb01nal\nconcept graph. Thus, this concept graph makes the symbolic graph reasoning layer can be identical\nacross all three datasets and its weights can be easily shared to each other dataset. We \ufb01x the moving\nmeans and variations in batch normalization of ResNet-101 during \ufb01netuning. We adopt the standard\nSGD optimization. Inspired by [6], we use the \u201cpoly\" learning rate policy, set the base learning\nrate to 2.5e-3 for newly initialized layers and 2.5e-4 for pretrained layers. We train 64 epochs for\nCoco-Stuff and PASCAL-Context, and 120 epochs for ADE20K dataset. For data augmentation, we\nadopt random \ufb02ipping, random cropping and random resize between 0.5 and 2 for all datasets. Due\nto the GPU memory limitation, the batch size is used as 6. The input crop size is set as 513 \u00d7 513.\n\n4.1.1 Comparison with the state-of-the-arts\n\nTable 1, 2, 3 report the comparisons with recent state-of-the-art methods on Coco-Stuff, Pascal-\nContext and ADE20K dataset, respectively. Incorporating our SGR layer signi\ufb01cantly outperforms\nexisting methods on all three datasets, demonstrating its effectiveness of performing explicit graph\nreasoning beyond local convolutions for large-scale pixel-level recognition. Figure 3 shows the\nqualitative comparison with the baseline \u201cDeeplabv2 [6]\". Our SGR obtains better segmentation\nperformance, especially for some rare classes (e.g. umbrella, teddy bear), bene\ufb01ting from the\njoint reasoning with frequent concepts over the concept hierarchy graph. Particularly, applying\nthe techniques of incorporating high-level semantic constraints designed for classi\ufb01cation task into\npixel-wise recognition is not trivial since associating prior knowledge with dense pixels itself is\ndif\ufb01cult. The prior works [38, 10, 49] also attempt to implicitly facilitate the network learning with\nthe hierarchical classi\ufb01cation objective. The very recent DSSPN [27] directly designs a network layer\nfor each parent concept. However, this method is hard to scale up for large-scale concept set and\n\n6\n\n\fmean IoU pixel acc.\n\nMethod\nFCN [31]\nSegNet [2]\n\nDilatedNet [47]\nCascadeNet [52]\n\nWord2Vec [10]\nJoint-Cosine [49]\n\nResNet-101, 2 conv [45]\n\nPSPNet (ResNet-101)DA_AL [50]\n\nConditional Softmax [38]\n\nDeepLabv2 (ResNet-101) [6]\nDSSPN (ResNet-101) [27]\n\n29.39\n21.64\n32.31\n34.90\n39.40\n41.96\n31.27\n29.18\n31.52\n38.97\n42.03\n44.32\n\n71.32\n71.00\n73.55\n74.52\n79.07\n80.64\n72.23\n71.31\n73.15\n79.01\n81.21\n81.43\nTable 3: Comparison on the ADE20K val\nset [52] (%).\n\u201cConditional Softmax [38]\",\n\u201cWord2Vec [10]\" and \u201cJoint-Cosine [49]\" use\nVGG as backbone. We use \u201cDeepLabv2 (ResNet-\n101) [6]\" as baseline.\nresults in redundant predictions for pixels that unlikely belongs to a speci\ufb01c concept. Unlike prior\nmethods, the proposed SGR layer can achieve better results by only adding one reasoning layer while\npreserving both good computation and memory ef\ufb01ciency.\n\nTable 4: Curves of the training losses on Coco-\nStuff for the Deeplabv2 (Baseline) [6] and our\nthree variants. Following [6], the loss is the sum-\nmations of losses for inputs of three scales (i.e.\n1, 0.75, 0.5).\n\nOur SGR (ResNet-101)\n\n4.1.2 Ablation studies\nWhich ConvBlock to add SGR layer? Table 1 and Table 4 compare the variants of adding a\nsingle SGR layer into different stages of ResNet-101. \u201cSGR ConvBlock4\" means the SGR layer is\nadded to right before the last residual block of res4 while all other variants add SGR layer before\nthe last residual block of res5 (\ufb01nal residual block). The performance of \u201cSGR ConvBlock4\" is\nworse than \u201cOur SGR (ResNet-101)\" while using SGR layer for both res4 and res5 (\u201cOur SGR\n(ResNet-101 2-layer)\") can slightly improve the results. Note that in order to use pretrained weights\nfrom ResNet-101, \u201cOur SGR (ResNet-101 2-layer)\" directly fuses the prediction results from two\nSGR layers after res4 and res5 via the summation to get the \ufb01nal prediction. One possible explanation\nfor this observation is that the \ufb01nal res5 can encode more semantically abstracted features, which\nis more suitable for conducting symbolic graph reasoning. Furthermore, we \ufb01nd removing residual\nconnection in Eqn. 6 would decrease the \ufb01nal performance but is still better than other baselines, by\ncomparing \u201cSGR (w/o residual)\" with our full SGR. The reason is that the SGR layer induces more\nsmoothing local features enhanced by global reasoning and thus may degrade some discriminative\ncapability in boundaries.\nThe effect of semantic-to-local mapping. Note that our SGR learns distinct voting weights and\nmapping weights in the local-to-semantic modules and semantic-to-local module, respectively. The\nadvantages of reevaluating mapping weights can be seen by comparing \u201cOur SGR (ResNet-101)\"\nwith \u201cSGR (w/o mapping)\" in both testing performance and training convergence in Table 1 and\nTable 4. This justi\ufb01es that estimating new semantic-to-local mapping weights can make the reasoning\nprocess better accommodate with the evolved feature distributions after graph reasoning, otherwise\nthe evolved symbolic nodes will be misaligned with local features.\nDifferent prior knowledge graphs. As discussed in Section 3.1, our SGR layer is general for\nany forms of knowledge graphs with either soft or hard edge weights. We thus evaluate results\nof leveraging distinct knowledge graphs in Table 1. First, class concurrence graph is often used\nto represent the frequency of any two concepts appearing in one image, which depicts inter-class\nrationality in a statistic view. We calculate the class concurrence graph from all training images on\nCoco-Stuff and feed it as the input of SGR layer, as \u201cSGR (concurrence graph)\". We can see that\nincorporating a concurrence-driven SGR layer can also boost the segmentation performance, but is\nslightly inferior to that with concept hierarchy. Second, we also sequentially stack one SGR layer\nwith hierarchy graph and one layer with concurrence graph, leading to a hybrid version as \u201cOur\nSGR (ResNet-101 Hybrid)\". This variant achieves the best performance among all models, verifying\nthe bene\ufb01ts of boosting semantic reasoning capability with the mixtures of knowledge constraints.\nFinally, we further explore a rich scene graph that includes concepts, attributes and relationships for\nencoding higher-level semantics, as \u201cSGR (scene graph)\" variant. Following [24], the scene graph\n\n7\n\nIterations\t(K)Training\tloss0408012016002468Deeplabv2\t(Baseline)Our\tSGRSGR\tw/o\tlearning\tmappingSGR\ton\tConvBlock4\fFigure 3: Qualitative comparison results on Coco-stuff dataset.\n\n1001\n16.1M\n22.71\n\nMethodResNet [13]Wide [48]ResNeXt-29 [46]DenseNet [16]DenseNet-100 [16] (baseline) SGR SGR 2-layer\nDepth\nParams\nError\nTable 5: Comparison of model depth, number of parameters (M), test errors (%) on CIFAR-100.\n\u201cSGR\" and \u201cSGR 2-layer\" indicate the results of appending one or two SGR layer on the \ufb01nal\ndenseblock of the baseline network (DenseNet-100), respectively.\n\n100+2*\n8.1M\n17.29\n\n100+1*\n7.5M\n17.68\n\n190\n25.6M\n17.18\n\n28\n\n36.5M\n20.50\n\n29\n\n68.1M\n17.31\n\n100\n7.0M\n22.19\n\nis constructed from the Visual Genome [20]. For simplicity, we only select the object categories,\nattributes, and predicates, which appear at least 30 times and are associated with our targeted 182\nconcepts in Coco-Stuff. It leads to an undirected graph with 312 object nodes, 160 attribute nodes,\nand 68 predicate nodes. \u201cSGR (scene graph)\" is slightly worse than \u201cOur SGR (ResNet-101)\" but\nbetter than \u201cSGR (concurrence graph)\". Observed from all these studies, we thus use the concept\nhierarchy graph for all rest experiments by balancing the ef\ufb01ciency and effectiveness.\nTransferring SGR learned from one domain to other domains. Our SGR layer naturally learns\nto encode explicit semantic meanings for general symbolic nodes after voting from local features,\nwhose weights can be easily transferred from one domain into other domains only if these domains\nshare one prior graph. Due to the usage of a single hierarchy graph for both Coco-Stuff and PASCAL-\nContext datasets, we can use the SGR model pretrained on Coco-Stuff to initialize the training on\nPASCAL-Context dataset, as reported in Table 2. \u201cOur SGR (Transfer convs)\" denotes only the\npretrained weights of residual blocks are used while \u201cOur SGR (Transfer SGR)\" is the variant of\nfurther using the parameters of SGR layer. We can see that transferring parameters of SGR layer can\ngive more improvements than that of solely transferring convolution blocks.\n\n4.2\n\nImage classi\ufb01cation results\n\nWe further conduct studies for image classi\ufb01cation task on CIFAR-100 [21] consisting of 50K training\nimages and 10K test images in 100 classes. We explore how much SGR will improve the performance\nof a baseline network, DenseNet-100 [16]. We append SGR layers on the \ufb01nal dense block which\nproduces 342 feature maps with 8 \u00d7 8 size. We \ufb01rst use a 1 \u00d7 1 convolution layer to reduce 342-d\nfeature into 128-d, and then sequentially employ one SGR layer, global average pooling and a\nlinear layer to produce \ufb01nal classi\ufb01cation. The concept hierarchy graph with 148 symbolic nodes\nis generated by mapping 100 classes into WordTree, similar to the strategy used in segmentation\nexperiments, included in Supplementary Material. We set Dl and Dc as 128. During training, we use\na mini-batch size of 64 on two GPUs using a cosine learning rate scheduling [16] for 600 epochs.\nMore comparisons in Table 5 demonstrate that our SGR can improve the performance of the baseline\nnetwork, bene\ufb01ting from the enhanced features via global reasoning. It achieves comparable results\nwith state-of-the-art methods with considerable less model complexity.\n\n8\n\nInputGroundtruthOur\tSGRDeeplabv2(baseline)\f5 Conclusion\n\nTo endow the local convolution networks with the capability of global graph reasoning, we introduce a\nSymbolic Graph Reasoning (SGR) layer, which harnesses external human knowledge to enhance local\nfeature representation. The proposed SGR layer is general, light-weight and compatible with existing\nconvolution networks, consisting of a local-to-semantic voting module, a graph reasoning module,\nand a semantic-to-local mapping module. Extensive experiments on both three public benchmarks on\nsemantic segmentation and one image classi\ufb01cation dataset demonstrated its superior performance.\nWe hope the design of our SGR can help boost the research of investigating global reasoning property\nof convolution networks and be bene\ufb01cial for various applications in the community.\n\nAcknowledgements\n\nThis work was supported in part by the National Key Research and Development Program of China\nunder Grant No. 2018YFC0830103, in part by National High Level Talents Special Support Plan (Ten\nThousand Talents Program), and in part by National Natural Science Foundation of China (NSFC)\nunder Grant No. 61622214, and 61836012.\n\nReferences\n[1] A. Arnab, S. Jayasumana, S. Zheng, and P. H. Torr. Higher order conditional random \ufb01elds in\n\ndeep neural networks. In ECCV, pages 524\u2013540, 2016. 6\n\n[2] V. Badrinarayanan, A. Kendall, and R. Cipolla. Segnet: A deep convolutional encoder-decoder\n\narchitecture for image segmentation. In CVPR, 2015. 7\n\n[3] I. Biederman, R. J. Mezzanotte, and J. C. Rabinowitz. Scene perception: Detecting and judging\n\nobjects undergoing relational violations. Cognitive psychology, 14(2):143\u2013177, 1982. 1\n\n[4] H. Caesar, J. Uijlings, and V. Ferrari. Coco-stuff: Thing and stuff classes in context. arXiv\n\npreprint arXiv:1612.03716, 2016. 5, 6\n\n[5] S. Chandra, N. Usunier, and I. Kokkinos. Dense and low-rank gaussian crfs using deep\n\nembeddings. In ICCV, 2017. 1, 2\n\n[6] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic\nimage segmentation with deep convolutional nets, atrous convolution, and fully connected crfs.\narXiv preprint arXiv:1606.00915, 2016. 1, 6, 7\n\n[7] X. Chen, L.-J. Li, L. Fei-Fei, and A. Gupta. Iterative visual reasoning beyond convolutions.\n\nCVPR, 2018. 3\n\n[8] J. Dai, K. He, and J. Sun. Boxsup: Exploiting bounding boxes to supervise convolutional\n\nnetworks for semantic segmentation. In ICCV, pages 1635\u20131643, 2015. 6\n\n[9] J. Deng, N. Ding, Y. Jia, A. Frome, K. Murphy, S. Bengio, Y. Li, H. Neven, and H. Adam.\nLarge-scale object classi\ufb01cation using label relation graphs. In ECCV, pages 48\u201364, 2014. 2, 3\n\n[10] A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, T. Mikolov, et al. Devise: A deep\n\nvisual-semantic embedding model. In NIPS, pages 2121\u20132129, 2013. 6, 7\n\n[11] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl. Neural message passing for\n\nquantum chemistry. arXiv preprint arXiv:1704.01212, 2017. 5\n\n[12] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR,\n\npages 770\u2013778, 2016. 1, 5, 6\n\n[13] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. In ECCV,\n\n2016. 8\n\n[14] G. Hinton, N. Frosst, and S. Sabour. Matrix capsules with em routing. In ICLR, 2018. 1\n\n9\n\n\f[15] H. Hu, Z. Deng, G.-T. Zhou, F. Sha, and G. Mori. Labelbank: Revisiting global perspectives for\n\nsemantic segmentation. arXiv preprint arXiv:1703.09891, 2017. 6\n\n[16] G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten. Densely connected convolutional\n\nnetworks. In CVPR, 2017. 8\n\n[17] A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. J\u00e9gou, and T. Mikolov. Fasttext. zip:\n\nCompressing text classi\ufb01cation models. arXiv preprint arXiv:1612.03651, 2016. 4, 6\n\n[18] T. N. Kipf and M. Welling. Semi-supervised classi\ufb01cation with graph convolutional networks.\n\nICLR, 2017. 2, 5\n\n[19] P. Kr\u00e4henb\u00fchl and V. Koltun. Ef\ufb01cient inference in fully connected crfs with gaussian edge\n\npotentials. In NIPS, pages 109\u2013117, 2011. 1, 2\n\n[20] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li,\nD. A. Shamma, et al. Visual genome: Connecting language and vision using crowdsourced\ndense image annotations. International Journal of Computer Vision, 123(1):32\u201373, 2017. 8\n\n[21] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. 2009. 5, 8\n\n[22] J. Lafferty, A. McCallum, and F. C. Pereira. Conditional random \ufb01elds: Probabilistic models\n\nfor segmenting and labeling sequence data. 2001. 1, 2\n\n[23] N. Lao, T. Mitchell, and W. W. Cohen. Random walk inference and learning in a large scale\n\nknowledge base. In EMNLP, pages 529\u2013539, 2011. 3\n\n[24] X. Liang, L. Lee, and E. P. Xing. Deep variation-structured reinforcement learning for visual\n\nrelationship and attribute detection. In CVPR, 2017. 7\n\n[25] X. Liang, L. Lin, X. Shen, J. Feng, S. Yan, and E. P. Xing. Interpretable structure-evolving lstm.\n\nIn CVPR, 2017. 2\n\n[26] X. Liang, X. Shen, J. Feng, L. Lin, and S. Yan. Semantic object parsing with graph lstm. In\n\nECCV, 2016. 2\n\n[27] X. Liang, H. Zhou, and E. Xing. Dynamic-structured semantic propagation network. CVPR,\n\n2018. 6, 7\n\n[28] G. Lin, A. Milan, C. Shen, and I. Reid. Re\ufb01nenet: Multi-path re\ufb01nement networks for high-\n\nresolution semantic segmentation. In CVPR, 2017. 6\n\n[29] G. Lin, C. Shen, A. Van Den Hengel, and I. Reid. Ef\ufb01cient piecewise training of deep structured\n\nmodels for semantic segmentation. In CVPR, pages 3194\u20133203, 2016. 6\n\n[30] W. Liu, A. Rabinovich, and A. C. Berg. Parsenet: Looking wider to see better. arXiv preprint\n\narXiv:1506.04579, 2015. 6\n\n[31] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation.\n\nIn CVPR, pages 3431\u20133440, 2015. 6, 7\n\n[32] K. Marino, R. Salakhutdinov, and A. Gupta. The more you know: Using knowledge graphs for\n\nimage classi\ufb01cation. arXiv preprint arXiv:1612.04844, 2016. 3\n\n[33] T. M. Mitchell, W. W. Cohen, E. R. Hruschka Jr, P. P. Talukdar, J. Betteridge, A. Carlson, B. D.\nMishra, M. Gardner, B. Kisiel, J. Krishnamurthy, et al. Never ending learning. In AAAI, pages\n2302\u20132310, 2015. 2\n\n[34] R. Mottaghi, X. Chen, X. Liu, N.-G. Cho, S.-W. Lee, S. Fidler, R. Urtasun, and A. Yuille. The\nrole of context for object detection and semantic segmentation in the wild. In CVPR, 2014. 5, 6\n\n[35] A. Newell. Physical symbol systems. Cognitive science, 4(2):135\u2013183, 1980. 3\n\n[36] M. Niepert, M. Ahmed, and K. Kutzkov. Learning convolutional neural networks for graphs. In\n\nICML, pages 2014\u20132023, 2016. 2\n\n10\n\n\f[37] V. Ordonez, J. Deng, Y. Choi, A. C. Berg, and T. L. Berg. From large scale image categorization\n\nto entry-level categories. In ICCV, pages 2768\u20132775, 2013. 2\n\n[38] J. Redmon and A. Farhadi. Yolo9000: better, faster, stronger. In CVPR, 2017. 3, 6, 7\n\n[39] S. Sabour, N. Frosst, and G. E. Hinton. Dynamic routing between capsules. In NIPS, 2017. 1\n\n[40] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini. The graph neural\n\nnetwork model. IEEE Transactions on Neural Networks, 20(1):61\u201380, 2009. 2\n\n[41] A. G. Schwing and R. Urtasun. Fully connected deep structured networks. arXiv preprint\n\narXiv:1503.02351, 2015. 1, 2\n\n[42] B. Shuai, Z. Zuo, B. Wang, and G. Wang. Scene segmentation with dag-recurrent neural\n\nnetworks. TPAMI, 2017. 6\n\n[43] X. Wang, R. Girshick, A. Gupta, and K. He. Non-local neural networks. CVPR, 2018. 2\n\n[44] Z. Wu, C. Shen, and A. v. d. Hengel. Bridging category-level and instance-level semantic image\n\nsegmentation. arXiv preprint arXiv:1605.06885, 2016. 6\n\n[45] Z. Wu, C. Shen, and A. v. d. Hengel. Wider or deeper: Revisiting the resnet model for visual\n\nrecognition. arXiv preprint arXiv:1611.10080, 2016. 7\n\n[46] S. Xie, R. Girshick, P. Doll\u00e1r, Z. Tu, and K. He. Aggregated residual transformations for deep\n\nneural networks. In CVPR, pages 5987\u20135995, 2017. 8\n\n[47] F. Yu and V. Koltun. Multi-scale context aggregation by dilated convolutions. arXiv preprint\n\narXiv:1511.07122, 2015. 7\n\n[48] S. Zagoruyko and N. Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146,\n\n2016. 8\n\n[49] H. Zhao, X. Puig, B. Zhou, S. Fidler, and A. Torralba. Open vocabulary scene parsing. In ICCV,\n\n2017. 3, 6, 7\n\n[50] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid scene parsing network. In CVPR, 2017. 7\n\n[51] S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du, C. Huang, and P. H. Torr.\nConditional random \ufb01elds as recurrent neural networks. In ICCV, pages 1529\u20131537, 2015. 1, 2,\n6\n\n[52] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba. Semantic understanding of\n\nscenes through the ade20k dataset. arXiv preprint arXiv:1608.05442, 2016. 5, 6, 7\n\n[53] Y. Zhu, C. Zhang, C. R\u00e9, and L. Fei-Fei. Building a large-scale multimodal knowledge base\n\nsystem for answering visual queries. arXiv preprint arXiv:1507.05670, 2015. 2\n\n11\n\n\f", "award": [], "sourceid": 926, "authors": [{"given_name": "Xiaodan", "family_name": "Liang", "institution": "Sun Yat-sen University"}, {"given_name": "Zhiting", "family_name": "Hu", "institution": "Carnegie Mellon University"}, {"given_name": "Hao", "family_name": "Zhang", "institution": "Petuum Inc."}, {"given_name": "Liang", "family_name": "Lin", "institution": "Sun Yat-Sen University"}, {"given_name": "Eric", "family_name": "Xing", "institution": "Petuum Inc. / Carnegie Mellon University"}]}