{"title": "Understanding Attention and Generalization in Graph Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 4202, "page_last": 4212, "abstract": "We aim to better understand attention over nodes in graph neural networks (GNNs) and identify factors influencing its effectiveness. We particularly focus on the ability of attention GNNs to generalize to larger, more complex or noisy graphs. Motivated by insights from the work on Graph Isomorphism Networks, we design simple graph reasoning tasks that allow us to study attention in a controlled environment. We find that under typical conditions the effect of attention is negligible or even harmful, but under certain conditions it provides an exceptional gain in performance of more than 60% in some of our classification tasks. Satisfying these conditions in practice is challenging and often requires optimal initialization or supervised training of attention. We propose an alternative recipe and train attention in a weakly-supervised fashion that approaches the performance of supervised models, and, compared to unsupervised models, improves results on several synthetic as well as real datasets. Source code and datasets are available at https://github.com/bknyaz/graph_attention_pool.", "full_text": "Understanding Attention and Generalization\n\nin Graph Neural Networks\n\nBoris Knyazev\n\nUniversity of Guelph\n\nVector Institute\n\nbknyazev@uoguelph.ca\n\nGraham W. Taylor\nUniversity of Guelph\n\ngwtaylor@uoguelph.ca\n\nVector Institute, Canada CIFAR AI Chair\n\nmohamed@robust.ai\n\nMohamed R. Amer\u2217\n\nRobust.AI\n\nAbstract\n\nWe aim to better understand attention over nodes in graph neural networks (GNNs)\nand identify factors in\ufb02uencing its effectiveness. We particularly focus on the ability\nof attention GNNs to generalize to larger, more complex or noisy graphs. Motivated\nby insights from the work on Graph Isomorphism Networks, we design simple\ngraph reasoning tasks that allow us to study attention in a controlled environment.\nWe \ufb01nd that under typical conditions the effect of attention is negligible or even\nharmful, but under certain conditions it provides an exceptional gain in performance\nof more than 60% in some of our classi\ufb01cation tasks. Satisfying these conditions\nin practice is challenging and often requires optimal initialization or supervised\ntraining of attention. We propose an alternative recipe and train attention in a\nweakly-supervised fashion that approaches the performance of supervised models,\nand, compared to unsupervised models, improves results on several synthetic as\nwell as real datasets. Source code and datasets are available at https://github.\ncom/bknyaz/graph_attention_pool.\n\n1 Attention meets pooling in graph neural networks\n\nThe practical importance of attention in deep learning is well-established and there are many argu-\nments in its favor [1], including interpretability [2, 3]. In graph neural networks (GNNs), attention\ncan be de\ufb01ned over edges [4, 5] or over nodes [6]. In this work, we focus on the latter, because,\ndespite being equally important in certain tasks, it is not as thoroughly studied [7]. To begin our\ndescription, we \ufb01rst establish a connection between attention and pooling methods. In convolutional\nneural networks (CNNs), pooling methods are generally based on uniformly dividing the regular grid\n(such as one-dimensional temporal grid in audio) into local regions and taking a single value from\nthat region (average, weighted average, max, stochastic, etc.), while attention in CNNs is typically a\nseparate mechanism that weights C-dimensional input X \u2208 RN\u00d7C:\n\nwhere Zi = \u03b1iXi - output for unit (node in a graph) i,(cid:80)N\n\nZ = \u03b1 (cid:12) X,\n\n(1)\ni \u03b1i = 1, (cid:12) - element-wise multiplication,\n\nN - the number of units in the input (i.e. number of nodes in a graph).\nIn GNNs, pooling methods generally follow the same pattern as in CNNs, but the pooling regions\n(sets of nodes) are often found based on clustering [8, 9, 10], since there is no grid that can be\nuniformly divided into regions in the same way across all examples (graphs) in the dataset. Recently,\ntop-k pooling [11] was proposed, diverging from other methods: instead of clustering \u201csimilar\u201d nodes,\nit propagates only part of the input and this part is not uniformly sampled from the input. Top-k\npooling can thus select some local part of the input graph, completely ignoring the rest. For this\nreason at \ufb01rst glance it does not appear to be logical.\n\n\u2217Most of this work was done while the author was at SRI International.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f(a) COLORS\n\n(b) TRIANGLES\n\n(c) MNIST\n\nFigure 1: Three tasks with a controlled environment we consider in this work. The values inside the\nnodes are ground truth attention coef\ufb01cients, \u03b1GT\n\n, which we \ufb01nd heuristically (see Section 3.1).\n\ni\n\nHowever, we can notice that pooled feature maps in [11, Eq. 2] are computed in the same way as\nattention outputs Z in Eq. 1 above, if we rewrite their Eq. 2 in the following way:\n\n(cid:26)\u03b1iXi, \u2200i \u2208 P\n\nZi =\n\n\u2205,\n\notherwise,\n\n(2)\nwhere P is a set of indices of pooled nodes, |P| \u2264 N, and \u2205 denotes the unit is absent in the output.\nThe only difference between Eq. 2 and Eq. 1 is that Z \u2208 R|P|\u00d7C, i.e. the number of units in the output\nis smaller or, formally, there exists a ratio r = |P|/N \u2264 1 of preserved nodes. We leverage this\n\ufb01nding to integrate attention and pooling into a uni\ufb01ed computational block of a GNN. In contrast, in\nCNNs, it is challenging to achieve this, because the input is de\ufb01ned on a regular grid, so we need to\nmaintain resolution for all examples in the dataset after each pooling layer. In GNNs, we can remove\nany number of nodes, so that the next layer will receive a smaller graph. When applied to the input\nlayer, this form of attention-based pooling also brings us interpretability of predictions, since the\nnetwork makes a decision only based on pooled nodes.\nDespite the appealing nature of attention, it is often unstable to train and the conditions under which\nit fails or succeeds are unclear. Motivated by insights of [12] recently proposed Graph Isomorphism\nNetworks (GIN), we design two simple graph reasoning tasks that allow us to study attention in\na controlled environment where we know ground truth attention. The \ufb01rst task is counting colors\nin a graph (COLORS), where a color is a unique discrete feature. The second task is counting the\nnumber of triangles in a graph (TRIANGLES). We con\ufb01rm our observations on a standard benchmark,\nMNIST [13] (Figure 1), and identify factors in\ufb02uencing the effectiveness of attention.\nOur synthetic experiments also allow us to study the ability of attention GNNs to generalize to larger,\nmore complex or noisy graphs. Aiming to provide a recipe to train more effective, stable and robust\nattention GNNs, we propose a weakly-supervised scheme to train attention, that does not require\nground truth attention scores, and as such is agnostic to a dataset and the choice of a model. We\nvalidate the effectiveness of this scheme on our synthetic datasets, as well as on MNIST and on real\ngraph classi\ufb01cation benchmarks in which ground truth attention is unavailable and hard to de\ufb01ne,\nnamely COLLAB [14, 15], PROTEINS [16], and D&D [17].\n\n2 Model\n\nWe study two variants of GNNs: Graph Convolutional Networks (GCN) [18] and Graph Isomorphism\nNetworks (GIN) [12]. One of the main ideas of GIN is to replace the MEAN aggregator over\nnodes, such as the one in GCN, with a SUM aggregator, and add more fully-connected layers after\naggregating neigboring node features. The resulting model can distinguish a wider range of graph\nstructures than previous models [12, Figure 3].\n\n2.1 Thresholding by attention coef\ufb01cients\nTo pool the nodes in a graph using the method from[11] a prede\ufb01ned ratio r = |P|/N (Eq. 2) must be\nchosen for the entire dataset. For instance, for r = 0.8 only 80% of nodes are left after each pooling\n\n2\n\nGNNNode attentionClass=70.50.5Node attention0.50000.500.30.30.200.200.30.30.20.2Node attentionGNNGNNGNNNode attentionClass=70.50.5Node attention0.50000.500.30.30.200.200.30.30.20.2Node attentionGNNGNNGNNNode attentionClass=70.50.5Node attention0.50000.500.30.30.200.200.30.30.20.2Node attentionGNNGNN\flayer. Intuitively, it is clear that this ratio should be different for small and large graphs. Therefore,\nwe propose to choose threshold \u02dc\u03b1, such that only nodes with attention values \u03b1i > \u02dc\u03b1 are propagated:\n\n(cid:26)\u03b1iXi, \u2200i : \u03b1i > \u02dc\u03b1\n\notherwise.\n\nZi =\n\n\u2205,\n\n(3)\n\nNote, that dropping nodes from a graph is different from keeping nodes with very small, or even\nzero, feature values, because a bias is added to node features after the following graph convolution\nlayer affecting features of neighbors. An important potential issue of dropping nodes is the change of\ngraph structure and emergence of isolated nodes. However, in our experiments we typically observe\nthat the model predicts similar \u03b1 for nearby nodes, so that an entire local neighborhood is pooled or\ndropped, as opposed to clustering-based methods which collapse each neighborhood to a single node.\nWe provide a quantitative and qualitative comparison in Section 3.\n\n2.2 Attention subnetwork\n\nTo train an attention model that predicts the coef\ufb01cients for nodes, we consider two approaches:\n(1) Linear Projection [11], where a single layer projection p \u2208 RC is trained: \u03b1pre = Xp; and\n(2) DiffPool [10], where a separate GNN is trained:\n\n(4)\nwhere A is the adjacency matrix of a graph.\nIn all cases, we use a softmax activation [1, 2]\ninstead of tanh in [11], because it provides more interpretable results and ecourages sparse outputs:\n\u03b1 = softmax(\u03b1pre). To train attention in a supervised or weakly-supervised way, we use the\nKullback-Leibler divergence loss (see Section 3.3).\n\n\u03b1pre = GNN(A, X),\n\n2.3 ChebyGIN\n\nIn some of our experiments, the performance of both GCNs and GINs is quite poor and, consequently,\nit is also hard for the attention subnetwork to learn. By combining GIN with ChebyNet [8], we\npropose a stronger model, ChebyGIN. ChebyNet is a multiscale extension of GCN [18], so that for\nthe \ufb01rst scale, K = 1, node features are node features themselves, for K = 2 features are averaged\nover one-hop neighbors, for K = 3 - over two-hop neighbors and so forth. To implement the SUM\nj Aij starting from K = 2.\n\naggregator in ChebyGIN, we multiply features by node degrees Di =(cid:80)\n\nWe also add more fully-connected layers after feature aggregation as in GIN.\n\n3 Experiments\n\nWe introduce the color counting task (COLORS) and the triangle counting task (TRIANGLES) in\nwhich we generate synthetic training and test graphs. We also experiment with MNIST images [13]\nand three molecule and social datasets. In COLORS, TRIANGLES and MNIST tasks (Figure 1), we\nassume to know ground truth attention, i.e. for each node i we heuristically de\ufb01ne its importance\ni \u2208 [0, 1], which is necessary to train (in the supervised case) and\nin solving the task correctly, \u03b1GT\nevaluate our attention models.\n\n3.1 Datasets\n\nCOLORS. We introduce the color counting task. We generate random graphs where features for\neach node are assigned to one of the three one-hot values (colors): [1,0,0] (red), [0,1,0] (green),\n[0,0,1] (blue). The task is to count the number of green nodes, Ngreen. This is a trivial task, but it\nlets us study the in\ufb02uence of initialization of the attention model p \u2208 R3 on the training dynamics.\nIn this task, graph structure is unimportant and edges of graphs act like a medium to exchange\nnode features. Ground truth attention is \u03b1GT\ni = 1/Ngreen, when i corresponds to green nodes and\ni = 0 otherwise. We also extend this dataset to higher n-dimensional cases p \u2208 Rn to study how\n\u03b1GT\nmodel performance changes with n. In these cases, node features are still one-hot vectors and we\nclassify the number of nodes where the second feature is one.\nTRIANGLES. Counting the number of triangles in a graph is a well-known task which can be solved\nanalytically by computing trace(A3)/6, where A is an adjacency matrix. This task turned out to\n\n3\n\n\fi = Ti/(cid:80)\n\ni Ti, where Ti is the number of triangles that include node i, so that \u03b1GT\n\nbe hard for GNNs, so we add node degree features as one-hot vectors to all graphs, so that the\nmodel can exploit both graph structure and features. Compared to the COLORS task, here it is more\nchallenging to study the effect of initializing p, but we can still calculate ground truth attention as\ni = 0 for\n\u03b1GT\nnodes that are not part of triangles.\nMNIST-75SP. MNIST [13] contains 70k grayscale images of size 28\u00d728 pixels. While each of\n784 pixels can be represented as a node, we follow [19, 20] and consider an alternative approach to\nhighlight the ability of GNNs to work on irregular grids. In particular, each image can be represented\nas a small set of superpixels without losing essential class-speci\ufb01c information (see Figure 2). We\ncompute SLIC [21] superpixels for each image and build a graph, in which each node corresponds\nto a superpixel with node features being pixel intensity values and coordinates of their centers of\nmasses. We extract N \u2264 75 superpixels, hence the dataset is denoted as MNIST-75SP. Edges are\nformed based on spatial distance between superpixel centers as in [8, Eq. 8]. Each image depicts a\nhandwritten digit from 0 to 9 and the task is to classify the image. Ground truth attention is considered\nto be \u03b1GT\ni = 1/Nnonzero for superpixels with nonzero intensity, and Nnonzero is the total number of\nsuch superpixels. The idea is that only nonzero superpixels determine the digit class.\nMolecule and social datasets. We extend our study to more practical cases, where ground truth\nattention is not available, and experiment with protein datasets: PROTEINS [16] and D&D [17], and a\nscienti\ufb01c collaboration dataset, COLLAB [14, 15]. These are standard graph classi\ufb01cation benchmarks.\nA standard way to evaluate models on these datasets is to perform 10-fold cross-validation and report\naverage accuracy [22, 10]. In this work, we are concerned about a model\u2019s ability to generalize\nto larger and more complex or noisy graphs, therefore, we generate splits based on the number of\nnodes. For instance, for PROTEINS we train on graphs with N \u2264 25 nodes and test on graphs with\n6 \u2264 N \u2264 620 nodes (see Table 2 for details about splits of other datasets and results).\nA detailed description of tasks and model hyperparameters is provided in the Supp. Material.\n\n3.2 Generalization to larger and noisy graphs\n\nOne of the core strengths of attention is that it makes it easier to generalize to unseen, potentially\nmore complex and/or noisy, inputs by reducing them to better resemble certain inputs in the training\nset. To examine this phenomenon, for COLORS and TRIANGLES tasks we add test graphs that can\nbe several times larger (TEST-LARGE) than the training ones. For COLORS we further extend it by\nadding unseen colors to the test set (TEST-LARGEC) in the format [c1, c2, c3, c4], where ci = 0 for\ni (cid:54)= 2 if c2 = 1 and ci \u2208 [0, 1] for i (cid:54)= 2 if c2 = 0, i.e. there is no new colors that have nonzero values\nin a green channel. This can be interpreted as adding mixtures of red, blue and transparency channels,\nwith nine possible colors in total as opposed to three in the training set (Figure 2).\n\nS\nR\nO\nL\nO\nC\n\nS\nE\nL\nG\nN\nA\nI\nR\nT\n\nP\nS\n5\n7\n-\n\nT\nS\nI\nN\nM\n\nTRAIN (N \u2264 25)\n\nTEST-ORIG (N \u2264 25)\n\nTEST-LARGE (25 < N \u2264 200)\n\nTEST-LARGEC (25 < N \u2264 200)\n\nTRAIN (N \u2264 25)\n\nTEST-ORIG (N \u2264 25)\n\nTEST-LARGE (25 < N \u2264 100)\n\nTRAIN(N = 64)\n\nTEST-ORIG(N = 63)\n\nTEST-NOISY(N = 63)\n\nTEST-NOISYC(N = 63)\n\nFigure 2: Examples from training and test sets. For COLORS, the correct label is Ngreen = 4 in all\ncases; for TRIANGLES Ntri = 3 and color intensities denote ground truth attention values \u03b1GT . The\nrange of the number of nodes, N, is shown in each case. For MNIST-75SP, we visualize graphs\nfor digit 7 by assigning an average intensity value to all pixels within a superpixel. Even though\nsuperpixels have certain shapes and borders between each other (visible only on noisy graphs), we\nfeed only superpixel intensities and coordinates of their centers of masses to our GNNs.\n\n4\n\n0.00.10.20.30.00.10.20.30.00.10.2\fNeural networks (NNs) have been observed to be brittle if they are fed with test samples corrupted\nin a subtle way, i.e. by adding a noise [23] or changing a sample in an adversarial way [24], such\nthat a human can still recognize them fairly well. To study this problem, test sets of standard image\nbenchmarks have been enlarged by adding corrupted images [25].\nGraph neural networks, as a particular case of NNs, inherit this weakness. The attention mechanism,\nif designed and trained properly, can improve a net\u2019s robustness by attending to only important and\nignoring misleading parts (nodes) of data. In this work, we explore the ability of GNNs with and\nwithout attention to generalize to noisy graphs and unseen node features. This should help us to\nunderstand the limits of GNNs, and potentially NNs in general, with attention and conditions when it\nsucceedes and when it does not. To this end, we generate two additional test sets for MNIST-75SP.\nIn the \ufb01rst set, TEST-NOISY, we add Gaussian noise, drawn from N (0, 0.4), to superpixel intensity\nfeatures, i.e. the shape and coordinates of superpixels are the same as in the original clean test\nset. In the second set, TEST-NOISY-C, we colorize images by adding two more channels and add\nindependent Gaussian noise, drawn from N (0, 0.6), to each channel (Figure 2).\n\n3.3 Network architectures and training\n\nWe build 2 layer GNNs for COLORS and 3 layer GNNs for other tasks with 64 \ufb01lters in each layer,\nexcept for MNIST-75SP where we have more \ufb01lters. Our baselines are GNNs with global sum or\nmax pooling (gpool), DiffPool [10] and top-k pooling [11]. We add two layers of our pooling for\nTRIANGLES, each of which is a GNN with 3 layers and 32 \ufb01lters (Eq. 4); whereas a single pooling\nlayer in the form of vector p is used in other cases. We train all models with Adam [26], learning rate\n1e-3, batch size 32, weight decay 1e-4 (see the Supp. Material for details).\nFor COLORS and TRIANGLES we minimize the regression loss (MSE) and cross entropy (CE) for other\ntasks, denoted as LM SE/CE. For experiments with supervised and weakly-supervised (described\nbelow in Section 3.4) attention, we additionally minimize the Kullback-Leibler (KL) divergence loss\nbetween ground truth attention \u03b1GT and predicted coef\ufb01cients \u03b1. The KL term is weighted by scale\n\u03b2, so that the total loss for some training graph with N nodes becomes:\n\nL = LM SE/CE +\n\n\u03b2\nN\n\n\u03b1GT\n\ni\n\nlog(\n\n\u03b1GT\ni\n\u03b1i\n\n).\n\n(5)\n\nWe repeat experiments at least 10 times and report an average accuracy and standard deviation in\nTables 1 and 2. For COLORS we run experiments 100 times, since we observe larger variance. In\nTable 1 we report results on all test subsets independently. In all other experiments on COLORS,\nTRIANGLES and MNIST-75SP, we report an average accuracy on the combined test set. For COLLAB,\nPROTEINS and D&D, we run experiments 10 times using splits described in Section 3.1.\nThe only hyperparameters that we tune in our experiments are threshold \u02dc\u03b1 in our method (Eq. 3), ratio\nr in top-k (Eq. 2) and \u03b2 in Eq. 5. For synthetic datasets, we tune them on a validation set generated\nin the same way as TEST-ORIG. For MNIST-75SP, we use part of the training set. For COLLAB,\nPROTEINS and D&D, we tune them using 10-fold cross-validation on the training set.\nAttention correctness. We evaluate attention correctness using area under the ROC curve (AUC) as\nan alternative to other methods, such as [27], which can be overoptimistic in some extreme cases,\nsuch as when all attention is concentrated in a single node or attention is uniformly spread over all\nnodes. AUC allows us to evaluate the ranking of \u03b1 instead of their absolute values. Compared to\nranking metrics, such as rank correlation, AUC enables us to directly choose a pooling threshold\n\u02dc\u03b1 from the ROC curve by \ufb01nding a desired balance between false-positives (pooling unimportant\nnodes) and false-negatives (dropping important nodes).\nTo evaluate attention correctness of models with global pooling, we follow the idea from convolutional\nneural networks [28]. After training a model, we remove node i \u2208 [1, N ] and compute an absolute\ndifference from prediction y for the original graph:\n\n(cid:88)\n\ni\n\n\u03b1W S\n\ni =\n\n(cid:80)N\n|yi \u2212 y|\nj=1 |yj \u2212 y| ,\n\n(6)\n\nwhere yi is a model\u2019s prediction for the graph without node i. While this method shows surprisingly\nhigh AUC in some tasks, it is not built-in in training and thus does not help to train a better model\nand only implicitly interprets a model\u2019s prediction (Figures 5 and 7). However, these results inspired\nus to design a weakly-supervised method described below.\n\n5\n\n\f3.4 Weakly-supervised attention supervision\n\ni\n\nAlthough for COLORS, TRIANGLES and MNIST-75SP we can de\ufb01ne ground truth attention, so that\nit does not require manual labeling, in practice it is usually not the case and such annotations are hard\nto de\ufb01ne and expensive, or even unclear how to produce. Based on results in Table 1, supervision\nof attention is necessary to reveal its power. Therefore, we propose a weakly-supervised approach,\nagnostic to the choice of a dataset and model, that does not require ground truth attention labels,\nbut can improve a model\u2019s ability to generalize. Our approach is based on generating attention\ncoef\ufb01cients \u03b1W S\n(Eq. 6) and using them as labels to train our attention model with the loss de\ufb01ned in\nEq 5. We apply this approach to COLORS, TRIANGLES and MNIST-75SP and observe peformance\nand robustness close to supervised models. We also apply it to COLLAB, PROTEINS and D&D, and\nin all cases we are able to improve results compared to unsupervised attention.\nTraining weakly-supervised models. Assume we want to train model A with \u201cweak-sup\u201d attention\non a dataset without ground truth attention. We \ufb01rst need to train model B that has the same\narchitecture as A, but does not have any attention/pooling between graph convolution layers. So,\nmodel B has only global pooling. After training B with the LM SE/CE loss, we need to evaluate\ntraining graphs on B in the same way as during computation of \u03b1W S in Eq. 6. In particular, for each\ntraining graph G with N nodes, we \ufb01rst make a prediction y for the entire G. Then, for each i \u2208 [1, N ],\nwe remove node i from G, and feed this reduced graph with N \u2212 1 nodes to model B recording the\nmodel\u2019s prediction yi. We then use Eq. 6 to compute \u03b1W S based on y and yi. Now, we can train A\nand use \u03b1W S instead of ground truth \u03b1GT in Eq. 5 to optimize both MSE/CE and KL losses.\n\n4 Analysis of results\n\nIn this work, we aim to better understand attention and generalization in graph neural networks, and,\nbased on our empirical \ufb01ndings, below we provide our analysis for the following questions.\nHow powerful is attention over nodes in GNNs? Our results on the COLORS, TRIANGLES and\nMNIST-75SP datasets suggest that the main strength of attention over nodes in GNNs is the ability to\ngeneralize to more complex or noisy graphs at test time. This ability essentially transforms a model\nthat fails to generalize into a fairly robust one. Indeed, a classi\ufb01cation accuracy gap for COLORS-\nLARGEC between the best model without supervised attention (GIN with global pooling) and a\n\nTable 1: Results on three tasks for different test subsets. \u00b1 denotes standard deviation, not shown\nin case of small values (large values are explained in Section 4). ATTN denotes attention accuracy in\nterms of AUC and is computed for the combined test set. The best result in each column (ignoring up-\ndenotes poor results with relatively low accuracy and/or high variance;\nper bound results) is bolded.\ndenotes failed cases with accuracy close to random and/or extremely high variance. \u2020 For COL-\nORS and MNIST-75SP, ChebyNets are used instead of ChebyGINs as described in the Supp. Material.\n\nl\na\nb\no\nl\nG\n\nl\no\no\np\n\nGCN\nGIN\nChebyGIN\u2020\n\n. GIN, top-k\nGIN, ours\nChebyGIN\u2020, top-k\nChebyGIN\u2020, ours\n\nv\nr\ne\np\nu\ns\nn\nU\n\nd GIN, topk\nGIN, ours\nChebyGIN\u2020, topk\nChebyGIN\u2020, ours\n\ne\ns\ni\nv\nr\ne\np\nu\nS\n\nk\na\ne\n\nW\n\n.\n\np\nu\ns\n\nChebyGIN\u2020, ours\n\nd GIN\n\nr\ne\np\np\nU\n\nn\nu\no\nb\n\nChebyGIN\u2020\n\nORIG\n\n97\n96\u00b110\n100\n\n99.6\n94\u00b118\n100\n80\u00b130\n\n87\u00b11\n100\n100\n100\n\n100\n\n100\n100\n\nCOLORS\n\nLARGE LARGEC ATTN\n\nTRIANGLES\n\nORIG LARGE ATTN ORIG\n\nMNIST-75SP\n\nNOISY\n\nNOISYC ATTN\n\n72\u00b115\n71\u00b122\n93\u00b112\n\n17\u00b14\n13\u00b17\n11\u00b17\n16\u00b110\n\n39\u00b118\n96\u00b19\n86\u00b115\n94\u00b18\n\n20\u00b13\n26\u00b111\n15\u00b17\n\n9\u00b13\n11\u00b16\n6\u00b16\n11\u00b16\n\n28\u00b18\n89\u00b118\n31\u00b115\n75\u00b117\n\n90\u00b16\n\n73\u00b114\n\n100\n100\n\n100\n100\n\n99.6\n99.2\n99.8\n\n75\u00b16\n72\u00b115\n79\u00b120\n67\u00b131\n\n99.9\n99.8\n99.8\n99.8\n\n99.9\n\n100\n100\n\n46\u00b11\n50\u00b11\n66\u00b11\n\n47\u00b12\n47\u00b13\n64\u00b15\n67\u00b13\n\n49\u00b11\n49\u00b11\n83\u00b11\n88\u00b11\n\n23\u00b11\n22\u00b11\n30\u00b11\n\n18\u00b11\n20\u00b12\n25\u00b12\n26\u00b12\n\n20\u00b11\n22\u00b11\n39\u00b11\n48\u00b11\n\n68\u00b11\n\n30\u00b11\n\n94\u00b11\n99.8\n\n85\u00b12\n99.4\u00b11\n\n79\n77\n79\n\n63\u00b15\n68\u00b13\n76\u00b16\n77\u00b14\n\n88\n76\u00b11\n97\n96\n\n88\n\n100\n100\n\n78.3\u00b12\n87.6\u00b13\n97.4\n\n86\u00b16\n82.6\u00b18\n92.9\u00b14\n94.6\u00b13\n\n38\u00b14\n55\u00b111\n80\u00b112\n\n59\u00b126\n51\u00b128\n68\u00b126\n80\u00b123\n\n90.5\u00b11\n90.9\u00b10.4\n95.1\u00b10.3\n95.4\u00b10.2\n\n85.5\u00b12\n85.0\u00b11\n90.6\u00b10.8\n92.3\u00b10.4\n\n36\u00b14\n51\u00b112\n79\u00b111\n\n55\u00b123\n47\u00b124\n67\u00b125\n77\u00b122\n\n79\u00b15\n80\u00b13\n83\u00b116\n86\u00b116\n\n72\u00b12\n71\u00b15\n72\u00b13\n\n65\u00b134\n58\u00b131\n52\u00b137\n78\u00b131\n\n99.3\n99.3\n100\n100\n\n95.8\u00b10.4\n\n88.8\u00b14\n\n86\u00b19\n\n96.5\u00b11\n\n93.6\u00b10.4\n96.9\u00b10.1\n\n90.8\u00b11\n94.8\u00b10.3\n\n90.8\u00b11\n95.1\u00b10.3\n\n100\n100\n\n6\n\n\f(a)\n\n(c)\n\n(a)-zoomed\n\n(d)\n\n(b)\n\n(e)\n\n(b)-zoomed\n\n(f)\n\nFigure 3: Disentangling factors in\ufb02uencing attention and classi\ufb01cation accuracy for COLORS (a-e)\nand TRIANGLES (f). Accuracies are computed over all test subsets. Notice the exponential growth\nof classi\ufb01cation accuracy depending on attention correctness (a,b), see zoomed plots (a)-zoomed,\n(b)-zoomed for cases when attention AUC>95%. (d) Probability of a good initialization is estimated\nas the proportion of cases when cosine similarity > 0.5; error bars indicate standard deviation. (c-e)\nshow results using a higher dimensional attention model, p \u2208 Rn.\n\nsimilar model with supervised attention (GIN, sup) is more than 60%. For TRIANGLES-LARGE this\ngap is 18% and for MNIST-75SP-NOISY it is more than 12%. This gap is even larger if compared to\nupper bound cases indicating that our supervised models can be further tuned and improved. Models\nwith supervised or weakly-supervised attention also have a more narrow spread of results (Figure 3).\n\nbad initialization (cos. sim.=-0.75)\n\ngood initialization (cos. sim.=0.75)\n\noptimal initialization (cos. sim.=1.00)\n\nD\nE\nS\nI\n\nV\nR\nE\nP\nU\nS\nN\nU\n\nN\nO\n\nI\nT\nN\nE\nT\nT\nA\n\n(a)\n\n(b)\n\n(c)\n\nFigure 4: In\ufb02uence of initialization on training dynamics for COLORS using GIN trained in the\nunsupervised way. For the supervised cases, see the Supp. Material. The nodes that should be pooled\naccording to our ground truth prior, must have larger attention values \u03b1. However, in the unsupervised\ncases, only the model with an optimal initialization (c) reaches a high accuracy, while other models\n(a,b) are stuck in a suboptimal state and wrong nodes are pooled, which degrades performance. In\nthese experiments, we train models longer to see if they can recover from a bad initialization.\n\nWhat are the factors in\ufb02uencing performance of GNNs with attention? We identify three key\nfactors in\ufb02uencing performance of GNNs with attention: initialization of the attention model (i.e. vec-\ntor p or GNN in Eq. 4), strength of the main GNN model (i.e. the model that actually performs\nclassi\ufb01cation), and \ufb01nally other hyperparameters of the attention and GNN models.\nWe highlight initialization as the critical factor. We ran 100 experiments on COLORS with random\ninitializations (Figure 3, (a-e)) of the vector p and measured how performance of both attention and\nclassi\ufb01cation is affected depending on how close (in terms of cosine similarity) the initialized p was\nto the optimal one, p = [0, 1, 0]. We disentangle the dependency between the classi\ufb01cation accuracy\nand cos. sim. into two functions to make the relationship clearer (Figure 3, (a, c)). Interestingly, we\nfound that classi\ufb01cation accuracy depends exponentially on attention correctness and becomes close\nto 100% only when attention is also close to being perfect. In the case of slightly worse attention, even\nstarting from 99%, classi\ufb01cation accuracy drops signi\ufb01cantly. This is an important \ufb01nding that can\nalso be valid for other more realistic applications. In the TRIANGLES task we only partially con\ufb01rm\nthis \ufb01nding, because our attention models could not achieve AUC high enough to boost classi\ufb01cation.\nHowever, by observing the upper bound results obtained by training with ground truth attention, we\nassume that this boost potentially should happen once attention becomes accurate enough.\n\n7\n\n020406080100Attention correctness (AUC, %)20406080100Avg. class. accuracy, %GIN (unsup, top-k)GIN (sup, top-k)GIN (unsup, ours)GIN (sup, ours)9596979899100Attention correctness (AUC, %)20406080100Avg. class. accuracy, %GIN (unsup, top-k)GIN (sup, top-k)GIN (unsup, ours)GIN (sup, ours)020406080100Attention correctness (AUC, %)20406080100Avg. class. accuracy, %ChebyGIN (unsup, top-k)ChebyGIN (sup, top-k)ChebyGIN (unsup, ours)ChebyGIN (sup, ours)ChebyGIN (weaksup, ours)9596979899100Attention correctness (AUC, %)20406080100Avg. class. accuracy, %ChebyGIN (unsup, top-k)ChebyGIN (sup, top-k)ChebyGIN (unsup, ours)ChebyGIN (sup, ours)ChebyGIN (weaksup, ours)1.00.50.00.51.0Cosine similarity with GT attention020406080100Attention correct. (AUC, %)GIN (unsup, ours, n=3)GIN (unsup, ours, n=16)GIN (sup, ours, n=16)51015202530Dimensionality of attention model p, n020406080100Avg. class. acc. / Prob., %Prob. of good initChebyGIN (sup, ours)ChebyGIN (unsup, ours)ChebyGIN (weaksup, ours)20406080100Attention correctness (AUC, %)20406080100Avg. class. accuracy, %GIN (unsup, ours, n=16)GIN (sup, ours, n=16)405060708090100Attention correctness (AUC, %)3040506070Avg. class. accuracy, %ChebyGIN (sup, ours)ChebyGIN (unsup, ours)ChebyGIN (sup, top-k)ChebyGIN (unsup, top-k)ChebyGIN (weaksup, ours)02004006008001000Training epoch20406080100Accuracy / Ratio of nodes (%)Attn AUCAvg. class. accuracy of \"should be pooled\" nodes of \"should be dropped\" nodesRatio of pooled nodes, r0.010.020.030.040.05Predicted attention coeff., 02004006008001000Training epoch102030405060708090Accuracy / Ratio of nodes (%)0.010.020.030.040.05Predicted attention coeff., 02004006008001000Training epoch20406080100Accuracy / Ratio of nodes (%)0.0100.0150.0200.0250.0300.035Predicted attention coeff., 02004006008001000Training epoch20406080100Accuracy / Ratio of nodes (%)0.010.020.030.040.050.06Predicted attention coeff., \fTEST-ORIG\n\nTEST-NOISY\n\nTEST-NOISYC\n\nTRIANGLES TEST-LARGE\n\nT\nU\nP\nN\nI\n\nL\nO\nO\nP\nF\nF\nI\n\nD\n\nS\nW\n\u03b1\n\n\u03b1\n\nN = 93\n\nN = 16\n\nN = 93\n\nN = 27\n\nFigure 5: Qualitative analysis. For MNIST-75SP (on the left) we show examples of input test images\n(top row), results of DiffPool [10] (second row), attention weights \u03b1W S generated using a model with\nglobal pooling based on Eq. 6 (third row), and \u03b1 predicted by our weakly-supervised model (bottom\nrow). Both our attention-based pooling and DiffPool can be strong and interpretable depending on\nthe task, but in our tasks DiffPool was inferior (see the Supp. Material). For TRIANGLES (on the\nright) we show an example of a test graph with N = 93 nodes with six triangles and the results of\npooling based on ground truth attention weights \u03b1GT (top row); in the bottom row we show attention\nweights predicted by our weakly-supervised model and results of our threshold-based pooling (Eq. 3).\nNote that during training, our model has not encountered noisy images (MNIST-75SP) nor graphs\nlarger than with N = 25 nodes (TRIANGLES).\n\nWhy is the variance of some results so high? In Table 1 we report high variance of results, which is\nmainly due to initialization of the attention model as explained above. This variance is also caused by\ninitialization of other trainable parameters of a GNN, but we show that once the attention model is per-\nfect, other parameters can recover from a bad initialization leading to better results. The opposite, how-\never, is not true: we never observed recovery of a model with poorly initialized attention (Figure 4).\nHow top-k compares to our threshold-based pooling method? Our method to attend and pool\nnodes (Eq. 3) is based on top-k pooling [11] and we show that the proposed threshold-based pooling\nis superior in a principle way. When we use supervised attention our results are better by more than\n40% on COLORS-LARGEC, by 9% on TRIANGLES-LARGE and by 3% on MNIST-75SP. In Figure 3\n((a,b)-zoomed) we show that GIN and ChebyGIN models with supervised top-k pooling never reach\nan average accuracy of more than 80% as opposed to our method which reaches 100% in many cases.\nHow results change with increase of attention model input dimensionality or capacity? We\nperformed experiments using ChebyGIN-h - a model with higher dimensionality of an input to the\nattention model (see the Supp. Material for details). In such cases, it becomes very unlikely to\ninitialize it in a way close to optimal (Figure 3, (c-e)), and attention accuracy is concentrated in the 60-\n80% region. Effect of the attention model of such low accuracy is neglible or even harmful, especially\non the large and noisy graphs. We also experimented with a deeper attention model (ChebyGIN-h),\ni.e. a 2 layer fully-connected layer with 32 hidden units for COLORS and MNIST-75SP, and a deeper\nGNN (Eq. 4) for TRIANGLES. This has a positive effect overall, except for TRIANGLES, where our\nattention models were already deep GNNs.\nCan we improve initialization of attention? In all our experiments, we initialize p from the Normal\ndistribution, N (0, 1). To verify if the performance can be improved by choosing another distribution,\nwe evaluate GIN and GCN models on a wide range of random distributions, Normal N (0, \u03c3) and\nUniform U (\u2212\u03c3, \u03c3), by varying scale \u03c3 (Figure 6). We found out that for unsupervised training\n\nFigure 6: In\ufb02uence of distribution parameters used to initialize the attention model p in the COL-\nORS task with n = 3 dimensional features. We show points corresponding to the commonly used\ninitialization strategies of Xavier [29] and Kaiming [29]. (a-c) Shaded areas show range, bars show\n\u00b11 std. For n = 16 see the Supp. Material.\n\n8\n\n0.000.030.060.080.110.000.010.030.040.050.00.51.01.52.02.5Scale/std of initialized attention weights, 020406080100Avg. class. accuracy, %(a)GIN: COLORS (n=3), unsup attentionXavier's normalXavier's uniformKaiming's normalKaiming's uniformnormaluniform0.00.51.01.52.02.5Scale/std of initialized attention weights, 20406080100Avg. class. accuracy, %(b)GIN: COLORS (n=3), sup attentionXavier's normalXavier's uniformKaiming's normalKaiming's uniformnormaluniform0.00.51.01.52.02.5Scale/std of initialized attention weights, 20406080100Avg. class. accuracy, %(c)GIN: COLORS (n=3), wsup attentionXavier's normalXavier's uniformKaiming's normalKaiming's uniformnormaluniform1.00.50.00.51.0Cosine similarity with GT attention0246810Std of init. attention for , (d)GIN: COLORS (n=3), unsup attention020406080100Avg. class. accuracy, %1.00.50.00.51.0Cosine similarity with GT attention0246810Std of init. attention for , (e)GCN: COLORS (n=3), unsup attention020406080100Avg. class. accuracy, %\fTable 2: Results on the social (COLLAB) and\nmolecule (PROTEINS and D&D) datasets. We\nuse 3 layer GCNs [18] or ChebyNets [8] (see\nSupp. Material for architecture details). Dataset\nsubscripts denote the maximum number of nodes in\nthe training set according to our splits (Section 3.1).\n\n(Figure 6, (a)), larger initial values and the Normal distribution should be used to make it possible to\nconverge to an optimal solution, which is still unlikely and greatly depends on cosine similarity with\nGT attention (Figure 6, (d,e)). For supervised and \u201cweak-sup\u201d attention, smaller initial weights and\neither the Normal or Uniform distribution should be used (Figure 6, (b,c)).\nWhat is the recipe for more powerful atten-\ntion GNNs? We showed that GNNs with su-\npervised training of attention are signi\ufb01cantly\nmore accurate and robust, although in case of\na bad initialization it can take a long time to\nreach the performance of a better initialization.\nHowever, supervised attention is often infea-\nsible. We suggested an alternative approach\nbased on weakly-supervised training and val-\nidated it on our synthetic (Table 1) and real\n(Table 2) datasets. In case of COLORS, TRI-\nANGLES and MNIST-75SP we can compare\nto both unsupervised and supervised models\nand conclude that our approach shows perfor-\nmance, robustness and relatively low variation\n(i.e. sensitivity to initialization) similar to su-\npervised models and much better than unsu-\npervised models. In case of COLLAB, PROTEINS and D&D we can only compare to unsupervised\nand global pooling models and con\ufb01rm that our method can be effectively employed for a wide\ndiversity of graph classi\ufb01cation tasks and attends to more relevant nodes (Figures 5 and 7). Tuning\nthe distribution and scale \u03c3 for the initialization of attention can further improve results. For instance,\non PROTEINS for the weakly-supervised case, we obtain 76.4% as opposed to 76.2%.\n\n462 / 716 500 / 678\n30-200\n30-300\n30-5748\n201-5748\n\n# train / test graphs 500 / 4500\n# nodes (N) train\n# nodes (N) test\n\n32-35\n32-492\n\nCOLLAB35 PROTEINS25 D&D200 D&D300\n\n29.7\u00b14.9 72.7\u00b13.6\n51.9\u00b15.3 77.2\u00b12.9\n\n76.2\u00b10.7\n\n54.3\u00b15.0 78.4\u00b11.1\n\n500 / 613\n\n4-25\n6-620\n\n74.4\u00b11.0\n75.6\u00b11.4\n\nGlobal max\nUnsup, ours\n\nWeak-sup\n\n65.9\u00b13.4\n65.7\u00b13.5\n\n67.0\u00b11.7\n\nGLOBAL POOL\n\nUNSUP\n\nUNSUP POOLED\n\nWEAK-SUP\n\nWEAK-SUP POOLED\n\n5\n3\nB\nA\nL\nL\nO\nC\n\n5\n2\nS\nN\n\nI\nE\nT\nO\nR\nP\n\n0\n0\n2\nD\n&\nD\n\nFigure 7: Qualitative results. In COLLAB, a graph represents an ego-network of a researcher, therefore\ncenter nodes are important. In PROTEINS and D&D, a graph is a protein and nodes are amino acids,\nso it is important to attend to a connected chain of amino acids to distinguish an enzyme from\na non-enzyme protein. Our weakly-supervised method attends to and pools more relevant nodes\ncompared to global and unsupervised models, leading to better classi\ufb01cation results.\n\n5 Conclusion\n\nWe have shown that learned attention can be extremely powerful in graph neural networks, but only\nif it is close to optimal. This is dif\ufb01cult to achieve due to the sensitivity of initialization, especially\nin the unsupervised setting where we do not have access to ground truth attention. Thus, we have\nidenti\ufb01ed initialization of attention models for high dimensional inputs as an important open issue.\nWe also show that attention can make GNNs more robust to larger and noisy graphs, and that the\nweakly-supervised approach proposed in our work brings advantages similar to the ones of supervised\nmodels, yet at the same time can be effectively applied to datasets without annotated attention.\n\n9\n\n\fAcknowledgments\nThis research was developed with funding from the Defense Advanced Research Projects Agency\n(DARPA). The views, opinions and/or \ufb01ndings expressed are those of the author and should not be\ninterpreted as representing the of\ufb01cial views or policies of the Department of Defense or the U.S. Gov-\nernment. The authors also acknowledge support from the Canadian Institute for Advanced Research\nand the Canada Foundation for Innovation. We are also thankful to Angus Galloway for feedback.\n\nReferences\n[1] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, \u0141ukasz\nKaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing\nSystems, pages 5998\u20136008, 2017.\n\n[2] Dong Huk Park, Lisa Anne Hendricks, Zeynep Akata, Bernt Schiele, Trevor Darrell, and Marcus Rohrbach.\nAttentive explanations: Justifying decisions and pointing to the evidence. arXiv preprint arXiv:1612.04757,\n2016.\n\n[3] Andreea Deac, Petar Veli \u02c7Ckovi\u00b4c, and Pietro Sormanni. Attentive cross-modal paratope prediction. Journal\n\nof Computational Biology, 2018.\n\n[4] Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio.\n\nGraph attention networks. In International Conference on Learning Representations (ICLR), 2018.\n\n[5] Jiani Zhang, Xingjian Shi, Junyuan Xie, Hao Ma, Irwin King, and Dit-Yan Yeung. Gaan: Gated attention\n\nnetworks for learning on large and spatiotemporal graphs. In UAI, 2018.\n\n[6] John Boaz Lee, Ryan Rossi, and Xiangnan Kong. Graph classi\ufb01cation using structural attention. In\nProceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining,\npages 1666\u20131674. ACM, 2018.\n\n[7] John Boaz Lee, Ryan A Rossi, Sungchul Kim, Nesreen K Ahmed, and Eunyee Koh. Attention models in\n\ngraphs: A survey. arXiv preprint arXiv:1807.07984, 2018.\n\n[8] Micha\u00ebl Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks on graphs\nIn Advances in Neural Information Processing Systems, pages\n\nwith fast localized spectral \ufb01ltering.\n3844\u20133852, 2016.\n\n[9] Uri Shaham, Kelly Stanton, Henry Li, Boaz Nadler, Ronen Basri, and Yuval Kluger. Spectralnet: Spectral\nclustering using deep neural networks. In International Conference on Learning Representations (ICLR),\n2018.\n\n[10] Zhitao Ying, Jiaxuan You, Christopher Morris, Xiang Ren, Will Hamilton, and Jure Leskovec. Hierarchical\ngraph representation learning with differentiable pooling. In Advances in Neural Information Processing\nSystems, pages 4805\u20134815, 2018.\n\n[11] Hongyang Gao and Shuiwang Ji. Graph U-Nets. In Proceedings of the 36th International Conference on\n\nMachine Learning (ICML), 2018.\n\n[12] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks?\n\nIn International Conference on Learning Representations (ICLR), 2019.\n\n[13] Yann LeCun, L\u00e9on Bottou, Yoshua Bengio, Patrick Haffner, et al. Gradient-based learning applied to\n\ndocument recognition. Proceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[14] Jure Leskovec, Jon Kleinberg, and Christos Faloutsos. Graph evolution: Densi\ufb01cation and shrinking\n\ndiameters. ACM Transactions on Knowledge Discovery from Data (TKDD), 1(1):2, 2007.\n\n[15] Anshumali Shrivastava and Ping Li. A new space for comparing graphs. In Proceedings of the 2014\nIEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, pages 62\u201371.\nIEEE Press, 2014.\n\n[16] Karsten M Borgwardt, Cheng Soon Ong, Stefan Sch\u00f6nauer, SVN Vishwanathan, Alex J Smola, and\nHans-Peter Kriegel. Protein function prediction via graph kernels. Bioinformatics, 21(suppl_1):i47\u2013i56,\n2005.\n\n[17] Paul D Dobson and Andrew J Doig. Distinguishing enzyme structures from non-enzymes without\n\nalignments. Journal of molecular biology, 330(4):771\u2013783, 2003.\n\n[18] Thomas N Kipf and Max Welling. Semi-supervised classi\ufb01cation with graph convolutional networks. In\n\nInternational Conference on Learning Representations (ICLR), 2017.\n\n[19] Federico Monti, Davide Boscaini, Jonathan Masci, Emanuele Rodola, Jan Svoboda, and Michael M\nBronstein. Geometric deep learning on graphs and manifolds using mixture model cnns. In Proceedings of\nthe IEEE Conference on Computer Vision and Pattern Recognition, volume 1, page 3, 2017.\n\n10\n\n\f[20] Matthias Fey, Jan Eric Lenssen, Frank Weichert, and Heinrich M\u00fcller. Splinecnn: Fast geometric deep\nlearning with continuous b-spline kernels. In Proceedings of the IEEE Conference on Computer Vision\nand Pattern Recognition, pages 869\u2013877, 2018.\n\n[21] Radhakrishna Achanta, Appu Shaji, Kevin Smith, Aurelien Lucchi, Pascal Fua, Sabine S\u00fcsstrunk, et al.\nSlic superpixels compared to state-of-the-art superpixel methods. IEEE Transactions on Pattern Analysis\nand Machine Intelligence, 2012.\n\n[22] Pinar Yanardag and SVN Vishwanathan. Deep graph kernels. In Proceedings of the 21th ACM SIGKDD\n\nInternational Conference on Knowledge Discovery and Data Mining, pages 1365\u20131374. ACM, 2015.\n\n[23] Samuel Dodge and Lina Karam. A study and comparison of human and deep learning recognition\nperformance under visual distortions. In 2017 26th international conference on computer communication\nand networks (ICCCN), pages 1\u20137. IEEE, 2017.\n\n[24] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow,\nand Rob Fergus. Intriguing properties of neural networks. In International Conference on Learning\nRepresentations (ICLR), 2014.\n\n[25] Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions\n\nand perturbations. In International Conference on Learning Representations (ICLR), 2019.\n\n[26] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.\n\nConference on Learning Representations (ICLR), 2015.\n\nIn International\n\n[27] Chenxi Liu, Junhua Mao, Fei Sha, and Alan L Yuille. Attention correctness in neural image captioning. In\n\nProceedings of the Thirty-First AAAI Conference on Arti\ufb01cial Intelligence, 2017.\n\n[28] Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In European\n\nconference on computer vision, pages 818\u2013833. Springer, 2014.\n\n[29] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into recti\ufb01ers: Surpassing\nhuman-level performance on imagenet classi\ufb01cation. In Proceedings of the IEEE international conference\non computer vision, pages 1026\u20131034, 2015.\n\n11\n\n\f", "award": [], "sourceid": 2372, "authors": [{"given_name": "Boris", "family_name": "Knyazev", "institution": "University of Guelph"}, {"given_name": "Graham", "family_name": "Taylor", "institution": "University of Guelph"}, {"given_name": "Mohamed", "family_name": "Amer", "institution": "RobustAI"}]}