{"title": "Rethinking Kernel Methods for Node Representation Learning on Graphs", "book": "Advances in Neural Information Processing Systems", "page_first": 11686, "page_last": 11697, "abstract": "Graph kernels are kernel methods measuring graph similarity and serve as a standard tool for graph classification. However, the use of kernel methods for node classification, which is a related problem to graph representation learning, is still ill-posed and the state-of-the-art methods are heavily based on heuristics. Here, we present a novel theoretical kernel-based framework for node classification that can bridge the gap between these two representation learning problems on graphs. Our approach is motivated by graph kernel methodology but extended to learn the node representations capturing the structural information in a graph. We theoretically show that our formulation is as powerful as any positive semidefinite kernels. To efficiently learn the kernel, we propose a novel mechanism for node feature aggregation and a data-driven similarity metric employed during the training phase. More importantly, our framework is flexible and complementary to other graph-based deep learning models, e.g., Graph Convolutional Networks (GCNs). We empirically evaluate our approach on a number of standard node classification benchmarks, and demonstrate that our model sets the new state of the art.", "full_text": "Rethinking Kernel Methods for\n\nNode Representation Learning on Graphs\n\nYu Tian\u2217\n\nRutgers University\n\nLong Zhao\u2217\n\nRutgers University\n\nyt219@cs.rutgers.edu\n\nlz311@cs.rutgers.edu\n\nXi Peng\n\nUniversity of Delaware\n\nxipeng@udel.edu\n\nDimitris N. Metaxas\nRutgers University\n\ndnm@cs.rutgers.edu\n\nAbstract\n\nGraph kernels are kernel methods measuring graph similarity and serve as a stan-\ndard tool for graph classi\ufb01cation. However, the use of kernel methods for node\nclassi\ufb01cation, which is a related problem to graph representation learning, is still\nill-posed and the state-of-the-art methods are heavily based on heuristics. Here, we\npresent a novel theoretical kernel-based framework for node classi\ufb01cation that can\nbridge the gap between these two representation learning problems on graphs. Our\napproach is motivated by graph kernel methodology but extended to learn the node\nrepresentations capturing the structural information in a graph. We theoretically\nshow that our formulation is as powerful as any positive semide\ufb01nite kernels. To\nef\ufb01ciently learn the kernel, we propose a novel mechanism for node feature aggrega-\ntion and a data-driven similarity metric employed during the training phase. More\nimportantly, our framework is \ufb02exible and complementary to other graph-based\ndeep learning models, e.g., Graph Convolutional Networks (GCNs). We empirically\nevaluate our approach on a number of standard node classi\ufb01cation benchmarks,\nand demonstrate that our model sets the new state of the art. The source code is\npublicly available at https://github.com/bluer555/KernelGCN.\n\n1\n\nIntroduction\n\nGraph structured data, such as citation networks [11, 22, 30], biological models [12, 45], grid-like\ndata [36, 37, 51] and skeleton-based motion systems [6, 42, 49, 50], are abundant in the real world.\nTherefore, learning to understand graphs is a crucial problem in machine learning. Previous studies\nin the literature generally fall into two main categories: (1) graph classi\ufb01cation [8, 19, 40, 47, 48],\nwhere the whole structure of graphs is captured for similarity comparison; (2) node classi\ufb01cation [1,\n19, 38, 41, 46], where the structural identity of nodes is determined for representation learning.\nFor graph classi\ufb01cation, kernel methods, i.e., graph kernels, have become a standard tool [20]. Given\na large collection of graphs, possibly with node and edge attributes, such algorithms aim to learn a\nkernel function that best captures the similarity between any two graphs. The graph kernel function\ncan be utilized to classify graphs via standard kernel methods such as support vector machines or\nk-nearest neighbors. Moreover, recent studies [40, 47] also demonstrate that there has been a close\nconnection between Graph Neural Networks (GNNs) and the Weisfeiler-Lehman graph kernel [32],\nand relate GNNs to the classic graph kernel methods for graph classi\ufb01cation.\n\n\u2217indicates equal contributions.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: Overview of our kernel-based framework.\n\nNode classi\ufb01cation, on the other hand, is still an ill-posed problem in representation learning on\ngraphs. Although identi\ufb01cation of node classes often leverages their features, a more challenging\nand important scenario is to incorporate the graph structure for classi\ufb01cation. Recent efforts in\nGraph Convolutional Networks (GCNs) [19] have made great progress on node classi\ufb01cation. In\nparticular, these efforts broadly follow a recursive neighborhood aggregation scheme to capture\nstructural information, where each node aggregates feature vectors of its neighbors to compute its\nnew features [1, 41, 46]. Empirically, these GCNs have achieved the state-of-the-art performance\non node classi\ufb01cation. However, the design of new GCNs is mostly based on empirical intuition,\nheuristics, and experimental trial-and-error.\nIn this paper, we propose a novel theoretical framework leveraging kernel methods for node classi\ufb01-\ncation. Motivated by graph kernels, our key idea is to decouple the kernel function so that it can be\nlearned driven by the node class labels on the graph. Meanwhile, its validity and expressive power\nare guaranteed. To be speci\ufb01c, this paper makes the following contributions:\n\u2022 We propose a learnable kernel-based framework for node classi\ufb01cation. The kernel function is\ndecoupled into a feature mapping function and a base kernel to ensure that it is valid as well as\nlearnable. Then we present a data-driven similarity metric and its corresponding learning criteria\nfor ef\ufb01cient kernel training. The implementation of each component is extensively discussed. An\noverview of our framework is shown in Fig. 1.\n\n\u2022 We demonstrate the validity of our learnable kernel function. More importantly, we theoretically\nshow that our formulation is powerful enough to express any valid positive semide\ufb01nite kernels.\n\u2022 A novel feature aggregation mechanism for learning node representations is derived from the per-\nspective of kernel smoothing. Compared with GCNs, our model captures the structural information\nof a node by aggregation in a single step, other than a recursive manner, thus is more ef\ufb01cient.\n\n\u2022 We discuss the close connection between the proposed approach and GCNs. We also show that our\nmethod is \ufb02exible and complementary to GCNs and their variants but more powerful, and can be\nleveraged as a general framework for future work.\n\n2 Related Work\n\nGraph Kernels. Graph kernels are kernels de\ufb01ned on graphs to capture the graph similarity, which\ncan be used in kernel methods for graph classi\ufb01cation. Many graph kernels are instances of the\nfamily of convolutional kernels [15]. Some of them measure the similarity between walks or paths on\ngraphs [4, 39]. Other popular kernels are designed based on limited-sized substructures [18, 33, 31,\n32]. Most graph kernels are employed in models which have learnable components, but the kernels\nthemselves are hand-crafted and motivated by graph theory. Some learnable graph kernels have been\nproposed recently, such as Deep Graph Kernels [43] and Graph Matching Networks [21]. Compared\nto these approaches, our method targets at learning kernels for node representation learning.\nNode Representation Learning. Conventional methods for learning node representations largely\nfocus on matrix factorization. They directly adopt classic techniques for dimension reduction [2, 3].\nOther methods are derived from the random walk algorithm [23, 26] or sub-graph structures [13,\n35, 44, 28]. Recently, Graph Convolutional Networks (GCNs) have emerged as an effective class of\n\n2\n\nInputNodesFeatureMapping\ud835\udc54\"InnerProductSpaceLoss\u2112InferenceLearnableKernel\ud835\udc3e\"Base\t\ud835\udc0a\ud835\udc1e\ud835\udc2b\ud835\udc27\ud835\udc1e\ud835\udc25\t\ud835\udc58,-./\u2295\fmodels for learning representations of graph structured data. They were introduced in [19], which\nconsist of an iterative process aggregating and transforming representation vectors of its neighboring\nnodes to capture structural information. Recently, several variants have been proposed, which employ\nself-attention mechanism [38] or improve network architectures [41, 46] to boost the performance.\nHowever, most of them are based on empirical intuition and heuristics.\n\n3 Preliminaries\n\nWe begin by summarizing some of the most important concepts about kernel methods as well as\nrepresentation learning on graphs and, along the way, introduce our notations.\nKernel Concepts. A kernel K : X \u00d7 X (cid:55)\u2192 R is a function of two arguments: K(x, y) for x, y \u2208 X .\nThe kernel function K is symmetric, i.e., K(x, y) = K(y, x), which means it can be interpreted as\na measure of similarity. If the Gram matrix K \u2208 RN\u00d7N de\ufb01ned by K(i, j) = K(xi, xj) for any\n{xi}N\ni=1 is positive semide\ufb01nite (p.s.d.), then K is a p.s.d. kernel [24]. If K(x, y) can be represented\nas (cid:104)\u03a8(x), \u03a8(y)(cid:105), where \u03a8 : X (cid:55)\u2192 RD is a feature mapping function, then K is a valid kernel.\nGraph Kernels. In the graph space G, we denote a graph as G = (V, E), where V is the set of nodes\nand E is the edge set of G. Given two graphs Gi = (Vi, Ei) and Gj = (Vj, Ej) in G, the graph\nkernel KG(Gi, Gj) measures the similarity between them. According to the de\ufb01nition in [29], the\nkernel KG must be p.s.d. and symmetric. The graph kernel KG between Gi and Gj is de\ufb01ned as:\n\nKG(Gi, Gj) =\n\nkbase(f (vi), f (vj)),\n\n(1)\nwhere kbase is the base kernel for any pair of nodes in Gi and Gj, and f : V (cid:55)\u2192 \u2126 is a function\nto compute the feature vector associated with each node. However, deriving a new p.s.d. graph\nkernel is a non-trivial task. Previous methods often implement kbase and f as the dot product between\nhand-crafted graph heuristics [25, 31, 4]. There are little learnable parameters in these approaches.\nRepresentation Learning on Graphs. Although graph kernels have been applied to a wide range\nof applications, most of them depend on hand-crafted heuristics. In contrast, representation learning\naims to automatically learn to encode graph structures into low-dimensional embeddings. Formally,\ngiven a graph G = (V, E), we follow [14] to de\ufb01ne representation learning as an encoder-decoder\nframework, where we minimize the empirical loss L over a set of training node pairs D \u2286 V \u00d7 V :\n(2)\n\n(cid:96)(ENC-DEC(vi, vj), sG(vi, vj)).\n\n(cid:88)\n\nL =\n\n(cid:88)\n\n(cid:88)\n\nvi\u2208Vi\n\nvj\u2208Vj\n\n(vi,vj )\u2208D\n\nEquation (2) has three methodological components: ENC-DEC, sG and (cid:96). Most of the previous\nmethods on representation learning can be distinguished by how these components are de\ufb01ned. The\ndetailed meaning of each component is explained as follows.\n\u2022 ENC-DEC : V \u00d7 V (cid:55)\u2192 R is an encoder-decoder function. It contains an encoder which projects\neach node into a M-dimensional vector to generate the node embedding. This function contains\na number of trainable parameters to be optimized during the training phase. It also includes a\ndecoder function, which reconstructs pairwise similarity measurements from the node embeddings\ngenerated by the encoder.\n\u2022 sG is a pairwise similarity function de\ufb01ned over the graph G. This function is user-speci\ufb01ed, and\n\u2022 (cid:96) : R \u00d7 R (cid:55)\u2192 R is a loss function, which is leveraged to train the model. This function evaluates\nthe quality of the pairwise reconstruction between the estimated value ENC-DEC(vi, vj) and the\ntrue value sG(vi, vj).\n\nit is used for measuring the similarity between nodes in G.\n\n4 Proposed Method: Learning Kernels for Node Representation\n\nGiven a graph G, as we can see from Eq. (2), the encoder-decoder ENC-DEC aims to approximate\nthe pairwise similarity function sG, which leads to a natural intuition: we can replace ENC-DEC\nwith a kernel function K\u03b8 parameterized by \u03b8 to measure the similarity between nodes in G, i.e.,\n\n(cid:88)\n\nL =\n\n(vi,vj )\u2208D\n\n(cid:96)(K\u03b8(vi, vj), sG(vi, vj)).\n\n(3)\n\n3\n\n\fHowever, there exist two technical challenges: (1) designing a valid p.s.d. kernel which captures the\nnode feature is non-trivial; (2) it is impossible to handcraft a uni\ufb01ed kernel to handle all possible\ngraphs with different characteristics [27]. To tackle these issues, we introduce a novel formulation to\nreplace K\u03b8. Inspired by the graph kernel as de\ufb01ned in Eq. (1) and the mapping kernel framework [34],\nour key idea is to decouple K\u03b8 into two components: a base kernel kbase which is p.s.d. to maintain\nthe validity, and a learnable feature mapping function g\u03b8 to ensure the \ufb02exibility of the resulting\nkernel. Therefore, we rewrite Eq. (3) by K\u03b8(vi, vj) = kbase(g\u03b8(vi), g\u03b8(vj)) for vi, vj \u2208 V of the\ngraph G to optimize the following objective:\n\n(cid:96)(kbase(g\u03b8(vi), g\u03b8(vj)), sG(vi, vj)).\n\n(4)\n\n(cid:88)\n\nL =\n\n(vi,vj )\u2208D\n\nTheorem 1 demonstrates that the proposed formulation, i.e., K\u03b8(vi, vj) = kbase(g\u03b8(vi), g\u03b8(vj)), is\nstill a valid p.s.d. kernel for any feature mapping function g\u03b8 parameterized by \u03b8.\nTheorem 1. Let g\u03b8 : V (cid:55)\u2192 RM be a function which maps nodes (or their corresponding features)\nto a M-dimensional Euclidean space. Let kbase : RM \u00d7 RM (cid:55)\u2192 R be any valid p.s.d. kernel. Then,\nK\u03b8(vi, vj) = kbase(g\u03b8(vi), g\u03b8(vj)) is a valid p.s.d. kernel.\n\nProof. Let \u03a6 be the corresponding feature mapping function of the p.s.d. kernel kbase. Then, we\nhave kbase(zi, zj) = (cid:104)\u03a6(zi), \u03a6(zj)(cid:105), where zi, zj \u2208 RM . Substitute g\u03b8(vi), g\u03b8(vj) for zi, zj, and\nwe have kbase(g\u03b8(vi), g\u03b8(vj)) = (cid:104)\u03a6(g\u03b8(vi)), \u03a6(g\u03b8(vj))(cid:105). Write the new feature mapping \u03a8(v) as\n\u03a8(v) = \u03a6(g\u03b8(v)), and we immediately have that kbase(g\u03b8(vi), g\u03b8(vj)) = (cid:104)\u03a8(vi), \u03a8(vj)(cid:105). Hence,\nkbase(g\u03b8(vi), g\u03b8(vj)) is a valid p.s.d. kernel.\n\nA natural follow-up question is whether our proposed formulation, in principle, is powerful enough to\nexpress any valid p.s.d. kernels? Our answer, in Theorem 2, is yes: if the base kernel has an invertible\nfeature mapping function, then the resulting kernel is able to model any valid p.s.d. kernels.\nTheorem 2. Let K(vi, vj) be any valid p.s.d. kernel for node pairs (vi, vj) \u2208 V \u00d7 V . Let kbase :\nRM \u00d7 RM (cid:55)\u2192 R be a p.s.d. kernel which has an invertible feature mapping function \u03a6. Then there\nexists a feature mapping function g\u03b8 : V (cid:55)\u2192 RM , such that K(vi, vj) = kbase(g\u03b8(vi), g\u03b8(vj)).\n\nProof. Let \u03a8 be the corresponding feature mapping function of the p.s.d. kernel K, and then we have\nK(vi, vj) = (cid:104)\u03a8(vi), \u03a8(vj)(cid:105). Similarly, for zi, zj \u2208 RM , we have kbase(zi, zj) = (cid:104)\u03a6(zi), \u03a6(zj)(cid:105).\nSubstitute g\u03b8(v) for z, and then it is easy to see that g\u03b8(v) = (\u03a6\u22121 \u25e6 \u03a8)(v) is the desired feature\nmapping function when \u03a6\u22121 exists.\n\n4.1\n\nImplementation and Learning Criteria\n\nTheorems 1 and 2 have demonstrated the validity and power of the proposed formulation in Eq. (4).\nIn this section, we discuss how to implement and learn g\u03b8, kbase, sG and (cid:96), respectively.\nImplementation of the Feature Mapping Function g\u03b8. The function g\u03b8 aims to project the feature\nvector xv of each node v into a better space for similarity measurement. Our key idea is that in a\ngraph, connected nodes usually share some similar characteristics, and thus changes between nearby\nnodes in the latent space of nodes should be smooth. Inspired by the concept of kernel smoothing,\nwe consider g\u03b8 as a feature smoother which maps xv into a smoothed latent space according to the\ngraph structure. The kernel smoother estimates a function as the weighted average of neighboring\nobserved data. To be speci\ufb01c, given a node v \u2208 V , according to Nadaraya-Watson kernel-weighted\naverage [10], a feature smoothing function is de\ufb01ned as:\n\n(cid:80)\n(cid:80)\n\ng(v) =\n\nu\u2208V k(u, v)p(u)\n\nu\u2208V k(u, v)\n\n,\n\n(5)\n\nwhere p is a mapping function to compute the feature vector of each node, and here we let p(v) = xv;\nk is a pre-de\ufb01ned kernel function to capture pairwise relations between nodes. Note that we omit \u03b8\nfor g here since there are no learnable parameters in Eq. (5). In the context of graphs, the natural\nchoice of computing k is to follow the graph structure, i.e., the structural information within the\nnode\u2019s h-hop neighborhood.\n\n4\n\n\f(cid:80)\n\n2 AD\u2212 1\n\nTo compute g, we let A be the adjacent matrix of the given graph G and I be the identity matrix\nwith the same size. We notice that I + D\u2212 1\n2 is a valid p.s.d. matrix, where D(i, i) =\nj A(i, j). Thus we can employ this matrix to de\ufb01ne the kernel function k. However, in practice,\nthis matrix would lead to numerical instabilities and exploding or vanishing gradients when used for\ntraining deep neural networks. To alleviate this problem, we adopt the renormalization trick [19]:\nI + D\u2212 1\n\u02dcA(i, j). Then the\nh-hop neighborhood can be computed directly from the h power of \u00afA, i.e., \u00afAh. And the kernel k for\nnode pairs vi, vj \u2208 V is computed as k(vi, vj) = \u00afAh(i, j). After collecting the feature vector xv of\neach node v \u2208 V into a matrix XV , we rewrite Eq. (5) approximately into its matrix form:\n\n2 , where \u02dcA = A + I and \u02dcD(i, i) =(cid:80)\n\n2 \u2192 \u00afA = \u02dcD\u2212 1\n\n2 AD\u2212 1\n\n2 \u02dcA \u02dcD\u2212 1\n\nj\n\ng(V ) \u2248 \u00afAhXV .\n\n(6)\n\nNext, we enhance the expressive power of Eq. (6) to model any valid p.s.d. kernels by implementing\nit with deep neural networks based on the following two aspects. First, we make use of multi-layer\nperceptrons (MLPs) to model and learn the composite function \u03a6\u22121 \u25e6 \u03a8 in Theorem 2, thanks to the\nuniversal approximation theorem [16, 17]. Second, we add learnable weights to different hops of\nnode neighbors. As a result, our \ufb01nal feature mapping function g\u03b8 is de\ufb01ned as:\n\n(cid:32)(cid:88)\n\n\u03c9h \u00b7(cid:16) \u00afAh (cid:12) M(h)(cid:17)(cid:33)\n\ng\u03b8(V ) =\n\nh\n\n\u00b7 MLP(l)(XV ),\n\n(7)\n\nwhere \u03b8 means the set of parameters in g\u03b8; \u03c9h is a learnable parameter for the h-hop neighborhood\nof each node v; (cid:12) is the Hadamard (element-wise) product; M(h) is an indicator matrix where\nM(h)(i, j) equals to 1 if vj is a h-th hop neighbor of vi and 0 otherwise. The hyperparameter l\ncontrols the number of layers in the MLP.\nEquation (7) can be interpreted as a weighted feature aggregation schema around the given node v\nand its neighbors, which is employed to compute the node representation. It has a close connection\nwith Graph Neural Networks. We leave it in Section 5 for a more detailed discussion.\nImplementation of the Base Kernel kbase. As we have shown in Theorem 2, in order to model\nan arbitrary p.s.d. kernel, we require that the corresponding feature mapping function \u03a6 of the\nbase kernel kbase must be invertible, i.e., \u03a6\u22121 exists. An obvious choice would let \u03a6 be an identity\nfunction, then kbase will reduce to the dot product between nodes in the latent space. Since g\u03b8 maps\nnode representations to a \ufb01nite dimensional space, the identity function makes our model directly\nmeasure the node similarity in this space. On the other hand, an alternative choice of kbase is the RBF\nkernel which additionally projects node representations to an in\ufb01nite dimensional latent space before\ncomparison. We compare both implementations in the experiments for further evaluation.\nData-Driven Similarity Metric sG and Criteria (cid:96). In node classi\ufb01cation, each node vi \u2208 V is\nassociated with a class label yi \u2208 Y . We aim to measure node similarity with respect to their class\nlabels other than hand-designed metrics. Naturally, we de\ufb01ne the pairwise similarity sG as:\n\nsG(vi, vj) =\n\n\u22121\n\nif yi = yj\no/w\n\n(8)\n\nHowever, in practice, it is hard to directly minimize the loss between K\u03b8 and sG in Eq. (8). Instead,\nwe consider a \u201csoft\u201d version of sG, where we require that the similarity of node pairs with the\nsame label is greater than those with distinct labels by a margin. Therefore, we train the kernel\nK\u03b8(vi, vj) = kbase(g\u03b8(vi), g\u03b8(vj)) to minimize the following objective function on triplets:\n\n(cid:96)(K\u03b8(vi, vj), K\u03b8(vi, vk)),\n\n(9)\n\n(cid:88)\n\nLK =\n\n(vi,vj ,vk)\u2208T\n\nwhere T \u2286 V \u00d7 V \u00d7 V is a set of node triplets: vi is an anchor, and vj is a positive of the same class\nas the anchor while vk is a negative of a different class. The loss function (cid:96) is de\ufb01ned as:\n\n(cid:96)(K\u03b8(vi, vj), K\u03b8(vi, vk)) = [K\u03b8(vi, vk) \u2212 K\u03b8(vi, vj) + \u03b1]+.\n\n(10)\nIt ensures that given two positive nodes of the same class and one negative node, the kernel value of\nthe negative should be farther away than the one of the positive by the margin \u03b1. Here, we present\nTheorem 3 and its proof to show that minimizing Eq. (9) leads to K\u03b8 = sG.\n\n(cid:26)1\n\n5\n\n\fTheorem 3. If |K\u03b8(vi, vj)| \u2264 1 for any vi, vj \u2208 V , minimizing Eq. (9) with \u03b1 = 2 yields K\u03b8 = sG.\nProof. Let (vi, vj, vk) be all triplets satisfying yi = yj, yi (cid:54)= yk. Suppose that for \u03b1 = 2, Eq. (10)\nholds for all (vi, vj, vk). It means K\u03b8(vi, vk) + 2 \u2264 K\u03b8(vi, vj) for all (vi, vj, vk). As |K\u03b8(vi, vj)| \u2264\n1, we have K\u03b8(vi, vk) = \u22121 for all (vi, vk) and K\u03b8(vi, vj) = 1 for all (vi, vj). Hence, K\u03b8 = sG.\nWe note that |K\u03b8(vi, vj)| \u2264 1 can be simply achieved by letting kbase be the dot product and\nnormalizing all g\u03b8 to the norm ball. In the following sections, the normalized K\u03b8 is denoted by \u00afK\u03b8.\n\n4.2\n\nInference for Node Classi\ufb01cation\n\nOnce the kernel function K\u03b8(vi, vj) = kbase(g\u03b8(vi), g\u03b8(vj)) has learned how to measure the simi-\nlarity between nodes, we can leverage the output of the feature mapping function g\u03b8 as the node\nrepresentation for node classi\ufb01cation. In this paper, we introduce the following two classi\ufb01ers.\nNearest Centroid Classi\ufb01er. The nearest centroid classi\ufb01er extends the k-nearest neighbors algo-\nrithm by assigning to observations the label of the class of training samples whose centroid is closest\n(cid:80)\nto the observation. It does not require additional parameters. To be speci\ufb01c, given a testing node\nu, for all nodes vi with class label yi \u2208 Y in the training set, we compute the per-class average\n\u00afK\u03b8(u, vi), where Vy is the set of nodes belonging to\nsimilarity between u and vi: \u00b5y = 1|Vy|\nclass y \u2208 Y . Then the class assigned to the testing node u:\ny\u2217 = arg maxy\u2208Y \u00b5y.\n\nvi\u2208Vy\n\n(11)\n\nSoftmax Classi\ufb01er. The idea of the softmax classi\ufb01er is to reuse the ground truth labels of nodes\nfor training the classi\ufb01er, so that it can be directly employed for inference. To do this, we add the\nsoftmax activation \u03c3 after g\u03b8(vi) to minimize the following objective:\n\nq(yi) log(\u03c3(g\u03b8(vi))),\n\n(12)\n\nLY = \u2212 (cid:88)\n\nvi\u2208V\n\nwhere q(yi) is the one-hot ground truth vector. Note that Eq. (12) is optimized together with Eq. (9)\nin an end-to-end manner. Let \u03a8 denote the corresponding feature mapping function of K\u03b8, then\nwe have K\u03b8(vi, vj) = (cid:104)\u03a8(vi), \u03a8(vj)(cid:105) = kbase(g\u03b8(vi), g\u03b8(vj)). In this case, we use the node feature\nproduced by \u03a8 for classi\ufb01cation since \u03a8 projects node features into the dot-product space which is a\nnatural metric for similarity comparison. To this end, kbase is \ufb01xed to be the identity function for the\nsoftmax classi\ufb01er, so that we have (cid:104)\u03a8(vi), \u03a8(vj)(cid:105) = (cid:104)g\u03b8(vi), g\u03b8(vj)(cid:105) and thus \u03a8(vi) = g\u03b8(vi).\n\n5 Discussion\n\n(cid:16) \u00afAH(l)W(l)(cid:17)\n\nOur feature mapping function g\u03b8 proposed in Eq. (7) has a close connection with Graph Convolutional\nNetworks (GCNs) [19] in the way of capturing node latent representations. In GCNs and most of\ntheir variants, each layer leverages the following aggregation rule:\n\n,\n\nH(l+1) = \u03c1\n\n(13)\nwhere W(l) is a layer-speci\ufb01c trainable weighting matrix; \u03c1 denotes an activation function; H(l) \u2208\nRN\u00d7D denotes the node features in the l-th layer, and H0 = X. Through stacking multiple layers,\nGCNs aggregate the features for each node from its L-hop neighbors recursively, where L is the\nnetwork depth. Compared with the proposed g\u03b8, GCNs actually interleave two basic operations of g\u03b8:\nfeature transformation and Nadaraya-Watson kernel-weighted average, and repeat them recursively.\nWe contrast our approach with GCNs in terms of the following aspects. First, our aggregation function\nis derived from the kernel perspective, which is novel. Second, we show that aggregating features\nin a recursive manner is inessential. Powerful h-hop node representations can be obtained by our\nmodel where aggregation is performed only once. As a result, our approach is more ef\ufb01cient both in\nstorage and time when handling very large graphs, since no intermediate states of the network have\nto be kept. Third, our model is \ufb02exible and complementary to GCNs: our function g\u03b8 can be directly\nreplaced by GCNs and other variants, which can be exploited for future work.\n\n6\n\n\fTime and Space Complexity. We assume the number of features F is \ufb01xed for all layers and both\nGCNs and our method have L \u2265 2 layers. We count matrix multiplications as in [7]. GCN\u2019s time\ncomplexity is O(L(cid:107) \u00afA(cid:107)0F + L|V |F 2), where (cid:107) \u00afA(cid:107)0 is the number of nonzeros of \u00afA and |V | is the\nnumber of nodes in the graph. While ours is O((cid:107) \u00afAh(cid:107)0F + L|V |F 2), since we do not aggregate\nfeatures recursively. Obviously, (cid:107) \u00afAh(cid:107)0 is constant but L(cid:107) \u00afA(cid:107)0 is linear to L. For space complexity,\nGCNs have to store all the feature matrices for recursive aggregation which needs O(L|V |F + LF 2)\nspace, where LF 2 is for storing trainable parameters of all layers, and thus the \ufb01rst term is linear\nto L. Instead, ours is O(|V |F + LF 2) where the \ufb01rst term is again constant to L. Our experiments\nindicate that we save 20% (0.3 ms) time and 15% space on Cora dataset [22] than GCNs.\n\n6 Experiments\n\nWe evaluate the proposed kernel-based approach on three benchmark datasets: Cora [22], Citeseer [11]\nand Pubmed [30]. They are citation networks, where the task of node classi\ufb01cation is to classify\nacademic papers of the network (graph) into different subjects. These datasets contain bag-of-words\nfeatures for each document (node) and citation links between documents.\nWe compare our approach to \ufb01ve state-of-the-art methods: GCN [19], GAT [38], FastGCN [5],\nJK [41] and KLED [9]. KLED is a kernel-based method, while the others are based on deep neural\nnetworks. We test all methods in the supervised learning scenario, where all data in the training\nset are used for training. We evaluate the proposed method in two different experimental settings\naccording to FastGCN [5] and JK [41], respectively. The statistics of the datasets together with\ntheir data split settings (i.e., the number of samples contained in the training, validation and testing\nsets, respectively) are summarized in Table 1. Note that there are more training samples in the data\nsplit of JK [41] than FastGCN [5]. We report the average means and standard deviations of node\nclassi\ufb01cation accuracy which are computed from ten runs as the evaluation metrics.\n\nTable 1: Overview of the three evaluation datasets under two different data split settings.\n\nNodes Edges Classes Features Data split of FastGCN [5] Data split of JK [41]\nDataset\nCora [22]\n2,708 5,429\nCiteseer [11] 3,327 4,732\nPubmed [30] 19,717 44,338\n\n1,208 / 500 / 1,000\n1,827 / 500 / 1,000\n18,217 / 500 / 1,000\n\n1,433\n3,703\n500\n\n7\n6\n3\n\n1,624 / 542 / 542\n1,997 / 665 / 665\n\n-\n\n6.1 Variants of the Proposed Method\n\nAs we have shown in Section 4.1, there are alternative choices to implement each component of our\nframework. In this section, we summarize all the variants of our method employed for evaluation.\nChoices of the Feature Mapping Function g. We implement the feature mapping function g\u03b8\naccording to Eq. (7). In addition, we also choose GCN and GAT as the alternative implementations\nof g\u03b8 for comparison, and denote them by gGCN and gGAT, respectively.\nChoices of the Base Kernel kbase. The base kernel kbase has two different implementations: the dot\nproduct which is denoted by k(cid:104)\u00b7,\u00b7(cid:105), and the RBF kernel which is denoted by kRBF. Note that when the\nsoftmax classi\ufb01er is employed, we set the base kernel to be k(cid:104)\u00b7,\u00b7(cid:105).\nChoices of the Loss L and Classi\ufb01er C. We consider the following three combinations of the loss\nfunction and classi\ufb01er. (1) LK in Eq. (9) is optimized, and the nearest-centroid classi\ufb01er CK is\nemployed for classi\ufb01cation. This combination aims to evaluate the effectiveness of the learned kernel.\n(2) LY in Eq. (12) is optimized, and the softmax classi\ufb01er CY is employed for classi\ufb01cation. This\ncombination is used in a baseline without kernel methods. (3) Both Eq. (9) and Eq. (12) are optimized,\nand we denote this loss by LK+Y . The softmax classi\ufb01er CY is employed for classi\ufb01cation. This\ncombination aims to evaluate how the learned kernel improves the baseline method.\nIn the experiments, we use K to denote kernel-based variants and N to denote ones without the kernel\nfunction. All these variants are implemented by MLPs with two layers. Due to the space limitation,\nwe ask the readers to refer to the supplementary material for implementation details.\n\n7\n\n\f6.2 Results of Node Classi\ufb01cation\n\nThe means and standard deviations of node classi\ufb01cation accuracy (%) following the setting of\nFastGCN [5] are organized in Table 2. Our variant of K3 sets the new state of the art on all datasets.\nAnd on Pubmed dataset, all our variants improve previous methods by a large margin. It proves the\neffectiveness of employing kernel methods for node classi\ufb01cation, especially on datasets with large\ngraphs. Interestingly, our non-kernel baseline N1 even achieves the state-of-the-art performance,\nwhich shows that our feature mapping function can capture more \ufb02exible structural information than\nprevious GCN-based approaches. For the choice of the base kernel, we can \ufb01nd that K2 outperforms\nK1 on two large datasets: Citeseer and Pubmed. We conjecture that when handling complex datasets,\nthe non-linear kernel, e.g., the RBF kernel, is a better choice than the liner kernel.\nTo evaluate the performance of our feature mapping function, we report the results of two variants\nK\u2217\n1 and K\u2217\n2 in Table 2. They utilize GCN and GAT as the feature mapping function respectively. As\nexpected, our K1 outperforms K\u2217\n2 among most datasets. This demonstrates that the recursive\naggregation schema of GCNs is inessential, since the proposed g\u03b8 aggregates features only in a single\nstep, which is still powerful enough for node classi\ufb01cation. On the other hand, it is also observed\nthat both K\u2217\n2 outperform their original non-kernel based implementations, which shows that\nlearning with kernels yields better node representations.\nTable 3 shows the results following the setting of JK [41]. Note that we do not evaluate on Pubmed in\nthis setup since its corresponding data split for training and evaluation is not provided by [41]. As\nexpected, our method achieves the best performance among all datasets, which is consistent with the\nresults in Table 2. For Cora, the improvement of our method is not so signi\ufb01cant. We conjecture\nthat the results in Table 3 involve more training data due to different data splits, which narrows the\nperformance gap between different methods on datasets with small graphs, such as Cora.\n\n1 and K\u2217\n\n1 and K\u2217\n\nTable 2: Accuracy (%) of node classi\ufb01cation following the setting of FastGCN [5].\n\nMethod\nKLED [9]\nGCN [19]\nGAT [38]\nFastGCN [5]\nK1 = {k(cid:104)\u00b7,\u00b7(cid:105), g\u03b8,LK,CK}\nK2 = {kRBF, g\u03b8,LK,CK}\nK3 = {k(cid:104)\u00b7,\u00b7(cid:105), g\u03b8,LK+Y ,CY }\nN1 = {g\u03b8,LY ,CY }\nK\u2217\n1 = {k(cid:104)\u00b7,\u00b7(cid:105), gGCN,LK,CK}\nK\u2217\n2 = {k(cid:104)\u00b7,\u00b7(cid:105), gGAT,LK,CK}\n\nCora [22]\n\nCiteseer [11]\n\nPubmed [30]\n\n82.3\n86.0\n85.6\n85.0\n\n86.68 \u00b1 0.17\n86.12 \u00b1 0.05\n88.40 \u00b1 0.24\n87.56 \u00b1 0.14\n87.04 \u00b1 0.09\n86.10 \u00b1 0.33\n\n-\n\n77.2\n76.9\n77.6\n\n77.92 \u00b1 0.25\n78.68 \u00b1 0.38\n80.28 \u00b1 0.03\n79.80 \u00b1 0.03\n77.12 \u00b1 0.23\n77.92 \u00b1 0.19\n\n82.3\n86.5\n86.2\n88.0\n\n89.22 \u00b1 0.17\n89.36 \u00b1 0.21\n89.42 \u00b1 0.01\n89.24 \u00b1 0.14\n87.84 \u00b1 0.12\n\n-\n\nTable 3: Accuracy (%) of node classi\ufb01cation following the setting of JK [41].\n\nMethod\nGCN [19]\nGAT [38]\nJK-Concat [41]\nK3 = {k(cid:104)\u00b7,\u00b7(cid:105), g\u03b8,LK+Y ,CY }\n\nCora [22]\n88.20 \u00b1 0.70\n87.70 \u00b1 0.30\n89.10 \u00b1 1.10\n89.24 \u00b1 0.31\n\nCiteseer [11]\n77.30 \u00b1 1.30\n76.20 \u00b1 0.80\n78.30 \u00b1 0.80\n80.78 \u00b1 0.28\n\n6.3 Ablation Study on Node Feature Aggregation Schema\nIn Table 4, we implement three variants of K3 (2-hop and 2-layer with \u03c9h by default) to evaluate\nthe proposed node feature aggregation schema. We answer the following three questions. (1) How\ndoes performance change with fewer (or more) hops? We change the number of hops from 1 to 3,\nand the performance improves if it is larger, which shows capturing long-range structures of nodes is\nimportant. (2) How many layers of MLP are needed? We show results with different layers ranging\nfrom 1 to 3. The best performance is obtained with two layers, while networks over\ufb01t the data when\n\n8\n\n\fmore layers are employed. (3) Is it necessary to have a trainable parameter \u03c9h? We replace \u03c9h with\na \ufb01xed constant ch, where c \u2208 (0, 1]. We can see larger c improves the performance. However, all\nresults are worse than learning a weighting parameter \u03c9h, which shows the importance of it.\nTable 4: Results of accuracy (%) with different settings of the aggregation schema.\n\nVariants of K3\nDefault\n1-hop\n3-hop\n1-layer\n3-layer\nc = 0.25\nc = 0.50\nc = 0.75\nc = 1.00\n\nCora [22]\n88.40 \u00b1 0.24\n85.56 \u00b1 0.02\n88.25 \u00b1 0.01\n82.60 \u00b1 0.01\n86.33 \u00b1 0.04\n69.33 \u00b1 0.09\n76.98 \u00b1 0.10\n84.25 \u00b1 0.01\n87.31 \u00b1 0.01\n\nCiteseer [11]\n80.28 \u00b1 0.03\n77.73 \u00b1 0.02\n80.13 \u00b1 0.01\n77.63 \u00b1 0.01\n78.53 \u00b1 0.20\n74.48 \u00b1 0.03\n77.47 \u00b1 0.04\n77.99 \u00b1 0.01\n78.57 \u00b1 0.01\n\nPubmed [30]\n89.42 \u00b1 0.01\n88.98 \u00b1 0.01\n89.53 \u00b1 0.01\n85.80 \u00b1 0.01\n89.46 \u00b1 0.05\n84.68 \u00b1 0.02\n86.45 \u00b1 0.01\n87.45 \u00b1 0.01\n88.68 \u00b1 0.01\n\n6.4\n\nt-SNE Visualization of Node Embeddings\n\nWe visualize the node embeddings of GCN, GAT and our method on Citeseer with t-SNE. For\nour method, we use the embedding of K3 which obtains the best performance. Figure 2 illustrates\nthe results. Compared with other methods, our method produces a more compact clustering result.\nSpeci\ufb01cally our method clusters the \u201cred\u201d points tightly, while in the results of GCN and GAT, they\nare loosely scattered into other clusters. This is caused by the fact that both GCN and GAT minimize\nthe classi\ufb01cation loss LY , only targeting at accuracy. They tend to learn node embeddings driven\nby those classes with the majority of nodes. In contrast, K3 are trained with both LK and LY . Our\nkernel-based similarity loss LK encourages data within the same class to be close to each other. As a\nresult, the learned feature mapping function g\u03b8 encourages geometrically compact clusters.\n\nFigure 2: t-SNE visualization of node embeddings on Citeseer dataset.\n\nDue to the space limitation, we ask the readers to refer to the supplementary material for more\nexperiment results, such as the results of link prediction and visualization on other datasets.\n\n7 Conclusions\n\nIn this paper, we introduce a kernel-based framework for node classi\ufb01cation. Motivated by the\ndesign of graph kernels, we learn the kernel from ground truth labels by decoupling the kernel\nfunction into a base kernel and a learnable feature mapping function. More importantly, we show\nthat our formulation is valid as well as powerful enough to express any p.s.d. kernels. Then the\nimplementation of each component in our approach is extensively discussed. From the perspective\nof kernel smoothing, we also derive a novel feature mapping function to aggregate features from a\nnode\u2019s neighborhood. Furthermore, we show that our formulation is closely connected with GCNs but\nmore powerful. Experiments on standard node classi\ufb01cation benchmarks are conducted to evaluated\nour approach. The results show that our method outperforms the state of the art.\n\n9\n\n(a) GCN(c) Ours(b) GAT\fAcknowledgments\n\nThis work is funded by ARO-MURI-68985NSMUR and NSF 1763523, 1747778, 1733843, 1703883.\n\nReferences\n[1] S. Abu-El-Haija, A. Kapoor, B. Perozzi, and J. Lee. N-GCN: Multi-scale graph convolution for semi-\n\nsupervised node classi\ufb01cation. arXiv preprint arXiv:1802.08888, 2018.\n\n[2] A. Ahmed, N. Shervashidze, S. Narayanamurthy, V. Josifovski, and A. J. Smola. Distributed large-scale\nnatural graph factorization. In Proceedings of the International Conference on World Wide Web (WWW),\npages 37\u201348, 2013.\n\n[3] M. Belkin and P. Niyogi. Laplacian eigenmaps and spectral techniques for embedding and clustering. In\n\nAdvances in Neural Information Processing Systems (NeurIPS), pages 585\u2013591, 2002.\n\n[4] K. M. Borgwardt and H.-P. Kriegel. Shortest-path kernels on graphs.\n\nInternational Conference on Data Mining (ICDM), 2005.\n\nIn Proceedings of the IEEE\n\n[5] J. Chen, T. Ma, and C. Xiao. Fastgcn: fast learning with graph convolutional networks via importance\n\nsampling. arXiv preprint arXiv:1801.10247, 2018.\n\n[6] Y. Chen, L. Zhao, X. Peng, J. Yuan, and D. N. Metaxas. Construct dynamic graphs for hand gesture\nrecognition via spatial-temporal attention. In Proceedings of the British Machine Vision Conference\n(BMVC), 2019.\n\n[7] W.-L. Chiang, X. Liu, S. Si, Y. Li, S. Bengio, and C.-J. Hsieh. Cluster-GCN: An ef\ufb01cient algorithm for\ntraining deep and large graph convolutional networks. In Proceedings of the ACM SIGKDD International\nConference on Knowledge Discovery and Data Mining (KDD), pages 257\u2013266, 2019.\n\n[8] M. Draief, K. Kutzkov, K. Scaman, and M. Vojnovic. KONG: Kernels for ordered-neighborhood graphs.\n\nIn Advances in Neural Information Processing Systems (NeurIPS), pages 4051\u20134060, 2018.\n\n[9] F. Fouss, L. Yen, A. Pirotte, and M. Saerens. An experimental investigation of graph kernels on a\ncollaborative recommendation task. In Proceedings of the International Conference on Data Mining\n(ICDM), pages 863\u2013868, 2006.\n\n[10] J. Friedman, T. Hastie, and R. Tibshirani. The elements of statistical learning. Springer series in statistics\n\nNew York, 2001.\n\n[11] C. L. Giles, K. D. Bollacker, and S. Lawrence. Citeseer: An automatic citation indexing system. In\n\nProceedings of the Third ACM Conference on Digital Libraries, pages 89\u201398, 1998.\n\n[12] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl. Neural message passing for quantum\nchemistry. In Proceedings of the International Conference on Machine Learning (ICML), pages 1263\u20131272,\n2017.\n\n[13] A. Grover and J. Leskovec. node2vec: Scalable feature learning for networks. In Proceedings of the ACM\nSIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pages 855\u2013864,\n2016.\n\n[14] W. L. Hamilton, R. Ying, and J. Leskovec. Representation learning on graphs: Methods and applications.\n\narXiv preprint arXiv:1709.05584, 2017.\n\n[15] D. Haussler. Convolution kernels on discrete structures. Technical report, Department of Computer Science,\n\nUniversity of California at Santa Cruz, 1999.\n\n[16] K. Hornik. Approximation capabilities of multilayer feedforward networks. Neural networks, 4(2):251\u2013257,\n\n1991.\n\n[17] K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networks are universal approximators.\n\nNeural networks, 2(5):359\u2013366, 1989.\n\n[18] T. Horv\u00e1th, T. G\u00e4rtner, and S. Wrobel. Cyclic pattern kernels for predictive graph mining. In Proceedings\nof the ACM SIGKDD International Conference on Knowledge discovery and Data Mining (KDD), pages\n158\u2013167, 2004.\n\n[19] T. N. Kipf and M. Welling. Semi-supervised classi\ufb01cation with graph convolutional networks.\n\nProceedings of the International Conference on Learning Representations (ICLR), 2017.\n\nIn\n\n10\n\n\f[20] N. M. Kriege, M. Neumann, C. Morris, K. Kersting, and P. Mutzel. A unifying view of explicit and implicit\nfeature maps for structured data: systematic studies of graph kernels. arXiv preprint arXiv:1703.00676,\n2017.\n\n[21] Y. Li, C. Gu, T. Dullien, O. Vinyals, and P. Kohli. Graph matching networks for learning the similarity of\ngraph structured objects. In Proceedings of the International Conference on Machine Learning (ICML),\n2019.\n\n[22] A. K. McCallum, K. Nigam, J. Rennie, and K. Seymore. Automating the construction of internet portals\n\nwith machine learning. Information Retrieval, 3(2):127\u2013163, 2000.\n\n[23] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and\nphrases and their compositionality. In Advances in Neural Information Processing Systems (NeurIPS),\npages 3111\u20133119, 2013.\n\n[24] K. P. Murphy. Machine learning: a probabilistic perspective. MIT press, 2012.\n\n[25] M. Neuhaus and H. Bunke. Self-organizing maps for learning the edit costs in graph matching. IEEE\n\nTransactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 35(3):503\u2013514, 2005.\n\n[26] B. Perozzi, R. Al-Rfou, and S. Skiena. Deepwalk: Online learning of social representations. In Proceedings\nof the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pages\n701\u2013710, 2014.\n\n[27] J. Ramon and T. G\u00e4rtner. Expressivity versus ef\ufb01ciency of graph kernels. In Proceedings of the International\n\nWorkshop on Mining Graphs, Trees and Sequences, pages 65\u201374, 2003.\n\n[28] L. F. Ribeiro, P. H. Saverese, and D. R. Figueiredo. struc2vec: Learning node representations from\nstructural identity. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery\nand Data Mining (KDD), pages 385\u2013394, 2017.\n\n[29] B. Scholkopf and A. J. Smola. Learning with kernels: support vector machines, regularization, optimization,\n\nand beyond. MIT press, 2001.\n\n[30] P. Sen, G. Namata, M. Bilgic, L. Getoor, B. Galligher, and T. Eliassi-Rad. Collective classi\ufb01cation in\n\nnetwork data. AI magazine, 29(3):93\u201393, 2008.\n\n[31] N. Shervashidze and K. Borgwardt. Fast subtree kernels on graphs. In Advances in Neural Information\n\nProcessing Systems (NeurIPS), pages 1660\u20131668, 2009.\n\n[32] N. Shervashidze, P. Schweitzer, E. J. v. Leeuwen, K. Mehlhorn, and K. M. Borgwardt. Weisfeiler-lehman\n\ngraph kernels. Journal of Machine Learning Research, 12:2539\u20132561, 2011.\n\n[33] N. Shervashidze, S. Vishwanathan, T. Petri, K. Mehlhorn, and K. Borgwardt. Ef\ufb01cient graphlet kernels\nfor large graph comparison. In Proceedings of the International Conference on Arti\ufb01cial Intelligence and\nStatistics (AISTATS), pages 488\u2013495, 2009.\n\n[34] K. Shin and T. Kuboyama. A generalization of haussler\u2019s convolution kernel: mapping kernel.\n\nProceedings of the International Conference on Machine Learning (ICML), pages 944\u2013951, 2008.\n\nIn\n\n[35] J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, and Q. Mei. Line: Large-scale information network embedding.\n\nIn Proceedings of the International Conference on World Wide Web (WWW), pages 1067\u20131077, 2015.\n\n[36] Z. Tang, X. Peng, S. Geng, L. Wu, S. Zhang, and D. Metaxas. Quantized Densely Connected U-Nets for\nEf\ufb01cient Landmark Localization. In Proceedings of the European Conference on Computer Vision (ECCV),\npages 339\u2013354, 2018.\n\n[37] Y. Tian, X. Peng, L. Zhao, S. Zhang, and D. N. Metaxas. CR-GAN: learning complete representations\nfor multi-view generation. In Proceedings of the International Joint Conference on Arti\ufb01cial Intelligence\n(IJCAI), pages 942\u2013948, 2018.\n\n[38] P. Veli\u02c7ckovi\u00b4c, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio. Graph attention networks. In\n\nProceedings of the International Conference on Learning Representations (ICLR), 2018.\n\n[39] S. V. N. Vishwanathan, N. N. Schraudolph, R. Kondor, and K. M. Borgwardt. Graph kernels. Journal of\n\nMachine Learning Research, 11(Apr):1201\u20131242, 2010.\n\n[40] K. Xu, W. Hu, J. Leskovec, and S. Jegelka. How powerful are graph neural networks? In Proceedings of\n\nthe International Conference on Learning Representations (ICLR), 2019.\n\n11\n\n\f[41] K. Xu, C. Li, Y. Tian, T. Sonobe, K.-i. Kawarabayashi, and S. Jegelka. Representation learning on graphs\nwith jumping knowledge networks. In Proceedings of the 34th International Conference on Machine\nLearning (ICML), 2018.\n\n[42] S. Yan, Y. Xiong, and D. Lin. Spatial temporal graph convolutional networks for skeleton-based action\n\nrecognition. In Proceedings of the AAAI Conference on Arti\ufb01cial Intelligence (AAAI), 2018.\n\n[43] P. Yanardag and S. Vishwanathan. Deep graph kernels. In Proceedings of the ACM SIGKDD International\n\nConference on Knowledge Discovery and Data Mining (KDD), pages 1365\u20131374, 2015.\n\n[44] Z. Yang, W. W. Cohen, and R. Salakhutdinov. Revisiting semi-supervised learning with graph embeddings.\n\nIn Proceedings of the International Conference on Machine Learning (ICML), pages 40\u201348, 2016.\n\n[45] J. You, B. Liu, Z. Ying, V. Pande, and J. Leskovec. Graph convolutional policy network for goal-directed\nmolecular graph generation. In Advances in Neural Information Processing Systems (NeurIPS), pages\n6410\u20136421, 2018.\n\n[46] L. Zhang, H. Song, and H. Lu. Graph node-feature convolution for representation learning. arXiv preprint\n\narXiv:1812.00086, 2018.\n\n[47] M. Zhang, Z. Cui, M. Neumann, and Y. Chen. An end-to-end deep learning architecture for graph\n\nclassi\ufb01cation. In Proceedings of the AAAI Conference on Arti\ufb01cial Intelligence (AAAI), 2018.\n\n[48] Z. Zhang, M. Wang, Y. Xiang, Y. Huang, and A. Nehorai. Retgk: Graph kernels based on return probabilities\nof random walks. In Advances in Neural Information Processing Systems (NeurIPS), pages 3964\u20133974,\n2018.\n\n[49] L. Zhao, X. Peng, Y. Tian, M. Kapadia, and D. Metaxas. Learning to forecast and re\ufb01ne residual motion\nfor image-to-video generation. In Proceedings of the European Conference on Computer Vision (ECCV),\npages 387\u2013403, 2018.\n\n[50] L. Zhao, X. Peng, Y. Tian, M. Kapadia, and D. N. Metaxas. Semantic graph convolutional networks for\n3D human pose regression. In Proceedings of the IEEE Conference on Computer Vision and Pattern\nRecognition (CVPR), pages 3425\u20133435, 2019.\n\n[51] Y. Zhu, M. Elhoseiny, B. Liu, X. Peng, and A. Elgammal. A generative adversarial approach for zero-shot\nIn Proceedings of the IEEE Conference on Computer Vision and Pattern\n\nlearning from noisy texts.\nRecognition (CVPR), 2018.\n\n12\n\n\f", "award": [], "sourceid": 6235, "authors": [{"given_name": "Yu", "family_name": "Tian", "institution": "Rutgers"}, {"given_name": "Long", "family_name": "Zhao", "institution": "Rutgers University"}, {"given_name": "Xi", "family_name": "Peng", "institution": "University of Delaware"}, {"given_name": "Dimitris", "family_name": "Metaxas", "institution": "Rutgers University"}]}