{"title": "On Valid Optimal Assignment Kernels and Applications to Graph Classification", "book": "Advances in Neural Information Processing Systems", "page_first": 1623, "page_last": 1631, "abstract": "The success of kernel methods has initiated the design of novel positive semidefinite functions, in particular for structured data. A leading design paradigm for this is the convolution kernel, which decomposes structured objects into their parts and sums over all pairs of parts. Assignment kernels, in contrast, are obtained from an optimal bijection between parts, which can provide a more valid notion of similarity. In general however, optimal assignments yield indefinite functions, which complicates their use in kernel methods. We characterize a class of base kernels used to compare parts that guarantees positive semidefinite optimal assignment kernels. These base kernels give rise to hierarchies from which the optimal assignment kernels are computed in linear time by histogram intersection. We apply these results by developing the Weisfeiler-Lehman optimal assignment kernel for graphs. It provides high classification accuracy on widely-used benchmark data sets improving over the original Weisfeiler-Lehman kernel.", "full_text": "On Valid Optimal Assignment Kernels and\n\nApplications to Graph Classi\ufb01cation\n\nNils M. Kriege\n\nDepartment of Computer Science\n\nTU Dortmund, Germany\n\nnils.kriege@tu-dortmund.de\n\nPierre-Louis Giscard\n\npierre-louis.giscard@york.ac.uk\n\nDepartment of Computer Science\n\nUniversity of York, UK\n\nRichard C. Wilson\n\nDepartment of Computer Science\n\nUniversity of York, UK\n\nrichard.wilson@york.ac.uk\n\nAbstract\n\nThe success of kernel methods has initiated the design of novel positive semidef-\ninite functions, in particular for structured data. A leading design paradigm for\nthis is the convolution kernel, which decomposes structured objects into their parts\nand sums over all pairs of parts. Assignment kernels, in contrast, are obtained\nfrom an optimal bijection between parts, which can provide a more valid notion\nof similarity. In general however, optimal assignments yield inde\ufb01nite functions,\nwhich complicates their use in kernel methods. We characterize a class of base\nkernels used to compare parts that guarantees positive semide\ufb01nite optimal assign-\nment kernels. These base kernels give rise to hierarchies from which the optimal\nassignment kernels are computed in linear time by histogram intersection. We\napply these results by developing the Weisfeiler-Lehman optimal assignment kernel\nfor graphs. It provides high classi\ufb01cation accuracy on widely-used benchmark data\nsets improving over the original Weisfeiler-Lehman kernel.\n\n1\n\nIntroduction\n\nThe various existing kernel methods can conveniently be applied to any type of data, for which a\nkernel is available that adequately measures the similarity between any two data objects. This includes\nstructured data like images [2, 5, 11], 3d shapes [1], chemical compounds [8] and proteins [4], which\nare often represented by graphs. Most kernels for structured data decompose both objects and add up\nthe pairwise similarities between their parts following the seminal concept of convolution kernels\nproposed by Haussler [12]. In fact, many graph kernels can be seen as instances of convolution\nkernels under different decompositions [23].\nA fundamentally different approach with good prospects is to assign the parts of one objects to the\nparts of the other, such that the total similarity between the assigned parts is maximum possible.\nFinding such a bijection is known as assignment problem and well-studied in combinatorial optimiza-\ntion [6]. This approach has been successfully applied to graph comparison, e.g., in general graph\nmatching [9, 17] as well as in kernel-based classi\ufb01cation [8, 18, 1]. In contrast to convolution kernels,\nassignments establish structural correspondences and thereby alleviate the problem of diagonal\ndominance at the same time. However, the similarities derived in this way are not necessarily positive\nsemide\ufb01nite (p.s.d.) [22, 23] and hence do not give rise to valid kernels, severely limiting their use in\nkernel methods.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fK k\n\nB(X, Y ) = max\n\nB\u2208B(X,Y )\n\nW (B), where W (B) = (cid:88)(x,y)\u2208B\n\nk(x, y)\n\n(1)\n\nOur goal in this paper is to consider a particular class of base kernels which give rise to valid\nassignment kernels. In the following we use the term valid to mean a kernel which is symmetric and\npositive semide\ufb01nite. We formalize the considered problem: Let [X ]n denote the set of all n-element\nsubsets of a set X and B(X, Y ) the set of all bijections between X, Y in [X ]n for n \u2208 N. We study\nthe optimal assignment kernel K k\n\nB on [X ]n de\ufb01ned as\n\nand k is a base kernel on X . For clarity of presentation we assume n to be \ufb01xed. In order to apply the\nkernel to sets of different cardinality, we may \ufb01ll up the smaller set by new objects z with k(z, x) = 0\nfor all x \u2208 X without changing the result.\nRelated work. Correspondence problems have been extensively studied in object recognition,\nwhere objects are represented by sets of features often called bag of words. Grauman and Darrell\nproposed the pyramid match kernel that seeks to approximate correspondences between points in\nRd by employing a space-partitioning tree structure and counting how often points fall into the\nsame bin [11]. An adaptive partitioning with non-uniformly shaped bins was used to improve the\napproximation quality in high dimensions [10].\nFor non-vectorial data, Fr\u00f6hlich et al. [8] proposed kernels for graphs derived from an optimal assign-\nment between their vertices and applied the approach to molecular graphs. However, it was shown\nthat the resulting similarity measure is not necessarily a valid kernel [22]. Therefore, Vishwanathan et\nal. [23] proposed a theoretically well-founded variation of the kernel, which essentially replaces the\nmax-function in Eq. (1) by a soft-max function. Besides introducing an additional parameter, which\nmust be chosen carefully to avoid numerical dif\ufb01culties, the approach requires the evaluation of a\nsum over all possible assignments instead of \ufb01nding a single optimal one. This leads to an increase in\nrunning time from cubic to factorial, which is infeasible in practice. Pachauri et al. [16] considered\nthe problem of \ufb01nding optimal assignments between multiple sets. The problem is equivalent to\n\ufb01nding a permutation of the elements of every set, such that assigning the i-th elements to each other\nyields an optimal result. Solving this problem allows the derivation of valid kernels between pairs\nof sets with a \ufb01xed ordering. This approach was referred to as transitive assignment kernel in [18]\nand employed for graph classi\ufb01cation. However, this does not only lead to non-optimal assignments\nbetween individual pairs of graphs, but also suffers from high computational costs. Johansson and\nDubhashi [14] derived kernels from optimal assignments by \ufb01rst sampling a \ufb01xed set of so-called\nlandmarks. Each data point is then represented by a feature vector, where each component is the\noptimal assignment similarity to a landmark.\nVarious general approaches to cope with inde\ufb01nite kernels have been proposed, in particular, for\nsupport vector machines [see 15, and references therein]. Such approaches should principally be used\nin applications, where similarities cannot be expressed by positive semide\ufb01nite kernels.\n\nOur contribution. We study optimal assignment kernels in more detail and investigate which base\nkernels lead to valid optimal assignment kernels. We characterize a speci\ufb01c class of kernels we\nrefer to as strong and show that strong kernels are equivalent to kernels obtained from a hierarchical\npartition of the domain of the kernel. We show that for strong base kernels the optimal assignment (i)\nyields a valid kernel; and (ii) can be computed in linear time given the associated hierarchy. While the\ncomputation reduces to histogram intersection similar to the pyramid match kernel [11], our approach\nis in no way restricted to speci\ufb01c objects like points in Rd. We demonstrate the versatility of our\nresults by deriving novel graph kernels based on optimal assignments, which are shown to improve\nover their convolution-based counterparts. In particular, we propose the Weisfeiler-Lehman optimal\nassignment kernel, which performs favourable compared to state-of-the-art graph kernels on a wide\nrange of data sets.\n\n2 Preliminaries\n\nBefore continuing with our contribution, we begin by introducing some key notation for kernels\nand trees which will be used later. A (valid) kernel on a set X is a function k : X \u00d7 X \u2192 R such\nthat there is a real Hilbert space H and a mapping \u03c6 : X \u2192 H such that k(x, y) = (cid:104)\u03c6(x), \u03c6(y)(cid:105)\nfor all x, y in X , where (cid:104)\u00b7,\u00b7(cid:105) denotes the inner product of H. We call \u03c6 a feature map, and H a\nfeature space. Equivalently, a function k : X \u00d7 X \u2192 R is a kernel if and only if for every subset\n\n2\n\n\f{x1, . . . , xn} \u2286 X the n \u00d7 n matrix de\ufb01ned by [m]i,j = k(xi, xj) is p.s.d. The Dirac kernel k\u03b4 is\nde\ufb01ned by k\u03b4(x, y) = 1, if x = y and 0 otherwise.\nWe consider simple undirected graphs G = (V, E), where V (G) = V is the set of vertices and\nE(G) = E the set of edges. An edge {u, v} is for short denoted by uv or vu, where both refer to the\nsame edge. A graph with a unique path between any two vertices is a tree. A rooted tree is a tree\nT with a distinguished vertex r \u2208 V (T ) called root. The vertex following v on the path to the root\nr is called parent of v and denoted by p(v), where p(r) = r. The vertices on this path are called\nancestors of v and the depth of v is the number of edges on the path. The lowest common ancestor\nLCA(u, v) of two vertices u and v in a rooted tree is the unique vertex with maximum depth that is\nan ancestor of both u and v.\n\n3 Strong kernels and hierarchies\n\n\u22650 is called strong kernel if k(x, y) \u2265\n\nIn this section we introduce a restricted class of kernels that will later turn out to lead to valid optimal\nassignment kernels when employed as base kernel. We provide two different characterizations of\nthis class, one in terms of an inequality constraint on the kernel values, and the other by means of a\nhierarchy de\ufb01ned on the domain of the kernel. The latter will provide the basis for our algorithm to\ncompute valid optimal assignment kernels ef\ufb01ciently.\nWe \ufb01rst consider similarity functions ful\ufb01lling the requirement that for any two objects there is no\nthird object that is more similar to each of them than the two to each other. We will see later in\nSection 3.1 that every such function indeed is p.s.d. and hence a valid kernel.\nDe\ufb01nition 1 (Strong Kernel). A function k : X \u00d7 X \u2192 R\nmin{k(x, z), k(z, y)} for all x, y, z \u2208 X .\nNote that a strong kernel requires that every object is most similar to itself, i.e., k(x, x) \u2265 k(x, y) for\nall x, y \u2208 X .\nIn the following we introduce a restricted class of kernels that is derived from a hierarchy on the set\nX . As we will see later in Theorem 1 this class of kernels is equivalent to strong kernels according to\nDe\ufb01nition 1. Such hierarchies can be systematically constructed on sets of arbitrary objects in order\nto derive strong kernels. We commence by \ufb01xing the concept of a hierarchy formally. Let T be a\nrooted tree such that the leaves of T are the elements of X . Each inner vertex v in T corresponds to a\nsubset of X comprising all leaves of the subtree rooted at v. Therefore the tree T de\ufb01nes a family of\nnested subsets of X . Let w : V (T ) \u2192 R\n\u22650 be a weight function such that w(v) \u2265 w(p(v)) for all v\nin T . We refer to the tuple (T, w) as a hierarchy.\nDe\ufb01nition 2 (Hierarchy-induced Kernel). Let H = (T, w) be a hierarchy on X , then the function\nde\ufb01ned as k(x, y) = w(LCA(x, y)) for all x, y in X is the kernel on X induced by H.\nWe show that De\ufb01nitions 1 and 2 characterize the same class of kernels.\nLemma 1. Every kernel on X that is induced by a hierarchy on X is strong.\nProof. Assume there is a hierarchy (T, w) that induces a kernel k that is not strong. Then there are\nx, y, z \u2208 X with k(x, y) < min{k(x, z), k(z, y)} and three vertices a = LCA(x, z), b = LCA(z, y)\nand c = LCA(x, y) with w(c) < w(a) and w(c) < w(b). The unique path from x to the root contains\na and the path from y to the root contains b, both paths contain c. Since weights decrease along\npaths, the assumption implies that a, b, c are pairwise distinct and c is an ancestor of a and b. Thus,\nthere must be a path from z via a to c and another path from z via b to c. Hence, T is not a tree,\ncontradicting the assumption.\n\nWe show constructively that the converse holds as well.\nLemma 2. For every strong kernel k on X there is a hierarchy on X that induces k.\nProof (Sketch). We incrementally construct a hierarchy on X that induces k by successive insertion\nof elements from X . In each step the hierarchy induces k restricted to the inserted elements and\neventually induces k after insertion of all elements. Initially, we start with a hierarchy containing\njust one element x \u2208 X with w(x) = k(x, x). The key to all following steps is that there is a\nunique way to extend the hierarchy: Let Xi \u2286 X be the \ufb01rst i elements in the order of insertion\nand let Hi = (Ti, wi) be the hierarchy after the i-th step. A leaf representing the next element z\ncan be grafted onto Hi to form a hierarchy Hi+1 that induces k restricted to Xi+1 = Xi \u222a {z}. Let\n\n3\n\n\f(a) Hi\n\n(b) Hi+1 for B = {b1, b2, b3}\n\n(c) Hi+1 for |B| = 1\n\nFigure 1: Illustrative example for the construction of the hierarchy on i + 1 objects (b), (c) from the hierarchy on\ni objects (a) following the procedure used in the proof of Lemma 2. The inserted leaf z is highlighted in red, its\nparent p with weight w(p) = kmax in green and b in blue, respectively.\n\nB = {x \u2208 Xi : k(x, z) = kmax}, where kmax = maxy\u2208Xi k(y, z). There is a unique vertex b, such\nthat B are the leaves of the subtree rooted at b, cf. Fig. 1. We obtain Hi+1 by inserting a new vertex p\nwith child z into Ti, such that p becomes the parent of b, cf. Fig. 1(b), (c). We set wi+1(p) = kmax,\nwi+1(z) = k(z, z) and wi+1(x) = wi(x) for all x \u2208 V (Ti). Let k(cid:48) be the kernel induced by Hi+1.\nClearly, k(cid:48)(x, y) = k(x, y) for all x, y \u2208 Xi. According to the construction k(cid:48)(z, x) = kmax = k(z, x)\nfor all x \u2208 B. For all x /\u2208 B we have LCA(z, x) = LCA(c, x) for any c \u2208 B, see Fig. 1(b). For strong\nkernels k(x, c) \u2265 min{k(x, z), k(z, c)} = k(x, z) and k(x, z) \u2265 min{k(x, c), k(c, z)} = k(x, c),\nsince k(c, z) = kmax. Thus k(z, x) = k(c, x) must hold and consequently k(cid:48)(z, x) = k(z, x).\n\n\u22650 with | img(k)| = 2 is strong.\n\nNote that a hierarchy inducing a speci\ufb01c strong kernel is not unique: Adjacent inner vertices with the\nsame weight can be merged, and vertices with just one child can be removed without changing the\ninduced kernel. Combining Lemmas 1 and 2 we obtain the following result.\nTheorem 1. A kernel k on X is strong if and only if it is induced by a hierarchy on X .\nAs a consequence of the above theorem the number of values a strong kernel on n objects may take is\nbounded by the number of vertices in a binary tree with n leaves, i.e., for every strong kernel k on X\nwe have | img(k)| \u2264 2|X| \u2212 1. The Dirac kernel is a common example of a strong kernel, in fact,\nevery kernel k : X \u00d7 X \u2192 R\nThe de\ufb01nition of a strong kernel and its relation to hierarchies is reminiscent of related concepts for\ndistances: A metric d on X is an ultrametric if d(x, y) \u2264 max{d(x, z), d(z, y)} for all x, y, z \u2208 X .\nFor every ultrametric d on X there is a rooted tree T with leaves X and edge weights, such that\n(i) d is the path length between leaves in T , (ii) the path lengths from a leaf to the root are all\nequal. Indeed, every ultrametric can be embedded into a Hilbert space [13] and thus the associated\ninner product is a valid kernel. Moreover, it can be shown that this inner product always is a strong\nkernel. However, the concept of strong kernels is more general: there are strong kernels k such\nthat the associated kernel metric dk(x, y) = (cid:107)\u03c6(x) \u2212 \u03c6(y)(cid:107) is not an ultrametric. The distinction\noriginates from the self-similarities, which in strong kernels, can be arbitrary provided that they ful\ufb01l\nk(x, x) \u2265 k(x, y) for all x, y in X . This degree of freedom is lost when considering distances. If we\nrequire all self-similarities of a strong kernel to be equal, then the associated kernel metric always is\nan ultrametric. Consequently, strong kernels correspond to a superset of ultrametrics. We explicitly\nde\ufb01ne a feature space for general strong kernels in the following.\n\n3.1 Feature maps of strong kernels\nWe use the property that every strong kernel is induced by a hierarchy to derive feature vectors\nfor strong kernels. Let (T, w) be a hierarchy on X that induces the strong kernel k. We de\ufb01ne the\nadditive weight function \u03c9 : V (T ) \u2192 R\n\u22650 as \u03c9(v) = w(v) \u2212 w(p(v)) and \u03c9(r) = w(r) for the root\nr. Note that the property of a hierarchy assures that the difference is non-negative. For v \u2208 V (T ) let\nP (v) \u2286 V (T ) denote the vertices in T on the path from v to the root r.\nWe consider the mapping \u03c6 : X \u2192 Rt, where t = |V (T )| and the components indexed by v \u2208 V (T )\nare\n\n[\u03c6(x)]v =(cid:26)(cid:112)\u03c9(v),\n\n0,\n\nif v \u2208 P (x)\notherwise.\n\n4\n\nb1b2=cb3bLCA(x,c)xpzbpz\f(a) Kernel matrix\n\n(b) Hierarchy\n\n(c) Feature vectors\n\nFigure 2: The matrix of a strong kernel on three objects (a) induced by the hierarchy (b) and the derived feature\nvectors (c). A vertex u in (b) is annotated by its weights w(u); \u03c9(u).\n\nProposition 1. Let k be a strong kernel on X . The function \u03c6 de\ufb01ned as above is a feature map of k,\ni.e., k(x, y) = \u03c6(x)(cid:62)\u03c6(y) for all x, y \u2208 X .\nProof. Given arbitrary x, y \u2208 X and let c = LCA(x, y). The dot product yields\n\n[\u03c6(x)]v[\u03c6(y)]v = (cid:88)v\u2208P (c)(cid:112)\u03c9(v)\n\n\u03c6(x)(cid:62)\u03c6(y) = (cid:88)v\u2208V (T )\nsince according to the de\ufb01nition the only non-zero products contributing to the sum over v \u2208 V (T )\nare those in P (x) \u2229 P (y) = P (c).\nFigure 2 shows an example of a strong kernel, an associated hierarchy and the derived feature vectors.\nAs a consequence of Theorem 1 and Proposition 1, strong kernels according to De\ufb01nition 1 are indeed\nvalid kernels.\n\n= w(c) = k(x, y),\n\n2\n\n4 Valid kernels from optimal assignments\n\nWe consider the function K k\nB on [X ]n according to Eq. (1) under the assumption that the base kernel\nk is strong. Let (T, w) be a hierarchy on X which induces k. For a vertex v \u2208 V (T ) and a set X \u2286 X ,\nwe denote by Xv the subset of X that is contained in the subtree rooted at v. We de\ufb01ne the histogram\nH k of a set X \u2208 [X ]n w.r.t. the strong base kernel k as H k(X) =(cid:80)x\u2208X \u03c6(x)\u25e6 \u03c6(x), where \u03c6 is the\nfeature map of the strong base kernel according to Section 3.1 and \u25e6 denotes the element-wise product.\nEquivalently, [H k(X)]v = \u03c9(v) \u00b7 |Xv| for v \u2208 V (T ). The histogram intersection kernel [20] is\nde\ufb01ned as K(cid:117)(g, h) =(cid:80)t\ni=1 min{[g]i, [h]i}, t \u2208 N, and known to be a valid kernel on Rt [2, 5].\nTheorem 2. Let k be a strong kernel on X and the histograms H k de\ufb01ned as above, then\nB(X, Y ) = K(cid:117)(cid:0)H k(X), H k(Y )(cid:1) for all X, Y \u2208 [X ]n.\nK k\nProof. Let (T, w) be a hierarchy inducing the strong base kernel k. We rewrite the weight of an\nassignment B as sum of weights of vertices in T . Since\nk(x, y) = w(LCA(x, y)) =(cid:88)v\u2208P (x)\u2229P (y)\n\n\u03c9(v), we have W (B) =(cid:88)(x,y)\u2208B\n\nk(x, y) =(cid:88)v\u2208V (T )\n\ncv \u00b7 \u03c9(v),\n\nmin{\u03c9(v) \u00b7 |Xv|, \u03c9(v) \u00b7 |Yv|} =(cid:88)v\u2208V (T )\n\nwhere cv counts how often v appears simultaneously in P (x) and P (y) in total for all (x, y) \u2208 B.\nFor the histogram intersection kernel we obtain\nK(cid:117)(H k(X), H k(Y )) =(cid:88)v\u2208V (T )\nSince every assignment B \u2208 B(X, Y ) is a bijection, each x \u2208 X and y \u2208 Y appears only once in B\nand cv \u2264 min{|Xv|,|Yv|} follows.\nIt remains to show that the above inequality is tight for an optimal assignment. We construct such an\nassignment by the following greedy approach: We perform a bottom-up traversal on the hierarchy\nstarting with the leaves. For every vertex v in the hierarchy we arbitrarily pair the objects in Xv and\nYv that are not yet contained in the assignment. Note that no element in Xv has been assigned to an\nelement in Y \\ Yv, and no element in Yv to an element from X \\ Xv. Hence, at every vertex v we\nhave cv = min{|Xv|,|Yv|} vertices from Xv assigned to vertices in Yv.\n\nmin{|Xv|,|Yv|} \u00b7 \u03c9(v).\n\n5\n\nabca431b351c112a4;1b5;2c2;1v3;2r1;1rvabc\u03c6(a)=(cid:0)\u221a1,\u221a2,\u221a1,0,0(cid:1)>\u03c6(b)=(cid:0)\u221a1,\u221a2,0,\u221a2,0(cid:1)>\u03c6(c)=(cid:0)\u221a1,0,0,0,\u221a1(cid:1)>\f(a) Assignment problem\n\n(b) Histograms\n\nFigure 3: An assignment instance (a) for X, Y \u2208 [X ]5 and the derived histograms (b). The set X contains\nthree distinct vertices labelled a and the set Y two distinct vertices labelled b and c. Taking the multiplicities\ninto account the histograms are obtained from the hierarchy of the base kernel k depicted in Fig. 2. The\nB(X, Y ) = 15, where grey, green, brown, red and orange edges\noptimal assignment yields a value of K k\nhave weight 1, 2, 3, 4 and 5, respectively. The histogram intersection kernel gives K(cid:117)(H k(X), H k(Y )) =\nmin{5, 5} + min{8, 6} + min{3, 1} + min{2, 4} + min{1, 2} = 15.\n\nFigure 3 illustrates the relation between the optimal assignment kernel employing a strong base\nkernel and the histogram intersection kernel. Note that a vertex v \u2208 V (T ) with \u03c9(v) = 0 does not\ncontribute to the histogram intersection kernel and can be omitted. In particular, for any two objects\nx1, x2 \u2208 X with k(x1, y) = k(x2, y) for all y \u2208 X we have \u03c9(x1) = \u03c9(x2) = 0. There is no\nneed to explicitly represent such leaves in the hierarchy, yet their multiplicity must be considered to\ndetermine the number of leaves in the subtree rooted at an inner vertex, cf. Fig. 2, 3.\nCorollary 1. If the base kernel k is strong, then the function K k\nB is a valid kernel.\n\nTheorem 2 implies not only that optimal assignments give rise to valid kernels for strong base kernels,\nbut also allows to compute them by histogram intersection. Provided that the hierarchy is known,\nbottom-up computation of histograms and their intersection can both be performed in linear time,\nwhile the general Hungarian method would require cubic time to solve the assignment problem [6].\nCorollary 2. Given a hierarchy inducing k, K k\n\nB(X, Y ) can be computed in time O(|X| + |Y |).\n\n5 Graph kernels from optimal assignments\n\nThe concept of optimal assignment kernels is rather general and can be applied to derive kernels on\nvarious structures. In this section we apply our results to obtain novel graph kernels, i.e., kernels\nof the form K : G \u00d7 G \u2192 R, where G denotes the set of graphs. We assume that every vertex v is\nequipped with a categorical label given by \u03c4 (v). Labels typically arise from applications, e.g., in a\ngraph representing a chemical compound the labels may indicate atom types.\n\n5.1 Optimal assignment kernels on vertices and edges\nAs a baseline we propose graph kernels on vertices and edges. The vertex optimal assignment kernel\n(V-OA) is de\ufb01ned as K(G, H) = K k\nB(V (G), V (H)), where k is the Dirac kernel on vertex labels.\nAnalogously, the edge optimal assignment kernel (E-OA) is given by K(G, H) = K k\nB(E(G), E(H)),\nwhere we de\ufb01ne k(uv, st) = 1 if at least one of the mappings (u (cid:55)\u2192 s, v (cid:55)\u2192 t) and (u (cid:55)\u2192 t, v (cid:55)\u2192 s)\nmaps vertices with the same label only; and 0 otherwise. Since these base kernels are Dirac kernels,\nthey are strong and, consequently, V-OA and E-OA are valid kernels.\n\n5.2 Weisfeiler-Lehman optimal assignment kernels\nWeisfeiler-Lehman kernels are based on iterative vertex colour re\ufb01nement and have been shown\nto provide state-of-the-art prediction performance in experimental evaluations [19]. These kernels\nemploy the classical 1-dimensional Weisfeiler-Lehman heuristic for graph isomorphism testing and\nconsider subtree patterns encoding the neighbourhood of each vertex up to a given distance. For a\nparameter h and a graph G with initial labels \u03c4, a sequence (\u03c40, . . . , \u03c4h) of re\ufb01ned labels referred\nto as colours is computed, where \u03c40 = \u03c4 and \u03c4i is obtained from \u03c4i\u22121 by the following procedure:\n\n6\n\nXYaaabcabbccrvabc02468H(X)rvabcH(Y)\f(a) Graph G with re\ufb01ned colours\n\n(b) Feature vector\n\n(c) Associated hierarchy\n\nFigure 4: A graph G with uniform initial colours \u03c40 and re\ufb01ned colours \u03c4i for i \u2208 {1, . . . , 3} (a), the feature\nvector of G for the Weisfeiler-Lehman subtree kernel (b) and the associated hierarchy (c). Note that the vertices\nof G are the leaves of the hierarchy, although not shown explicitly in Fig. 4(c).\n\nSort the multiset of colours {\u03c4i\u22121(u) : vu \u2208 E(G)} for every vertex v lexicographically to obtain\na unique sequence of colours and add \u03c4i\u22121(v) as \ufb01rst element. Assign a new colour \u03c4i(v) to every\nvertex v by employing a one-to-one mapping from sequences to new colours. Figure 4(a) illustrates\nthe re\ufb01nement process. The Weisfeiler-Lehman subtree kernel (WL) counts the vertex colours two\ngraphs have in common in the \ufb01rst h re\ufb01nement steps and can be computed by taking the dot product\nof feature vectors, where each component counts the occurrences of a colour, see Fig. 4(b).\nWe propose the Weisfeiler-Lehman optimal assignment kernel (WL-OA), which is de\ufb01ned on the\nvertices like OA-V, but employs the non-trivial base kernel\n\nk(u, v) =\n\nk\u03b4(\u03c4i(u), \u03c4i(v)).\n\n(2)\n\nh(cid:88)i=0\n\nThis base kernel corresponds to the number of matching colours in the re\ufb01nement sequence. More\nintuitively, the base kernel value re\ufb02ects to what extent the two vertices have a similar neighbourhood.\nLet V be the set of all vertices of graphs in G, we show that the re\ufb01nement process de\ufb01nes a hierarchy\non V, which induces the base kernel of Eq. (2). Each vertex colouring \u03c4i naturally partitions V into\ncolour classes, i.e., sets of vertices with the same colour. Since the re\ufb01nement takes the colour \u03c4i(v) of\na vertex v into account when computing \u03c4i+1(v), the implication \u03c4i(u) (cid:54)= \u03c4i(v) \u21d2 \u03c4i+1(u) (cid:54)= \u03c4i+1(v)\nholds for all u, v \u2208 V. Hence, the colour classes induced by \u03c4i+1 are at least as \ufb01ne as those induced\nby \u03c4i. Moreover, the sequence (\u03c4i)0\u2264i\u2264h gives rise to a family of nested subsets, which can naturally\nbe represented by a hierarchy (T, w), see Fig. 4(c) for an illustration. When assuming \u03c9(v) = 1\nfor all vertices v \u2208 V (T ), the hierarchy induces the kernel of Eq. (2). We have shown that the base\nkernel is strong and it follows from Corollary 1 that WL-OA is a valid kernel. Moreover, it can\nbe computed from the feature vectors of the Weisfeiler-Lehman subtree kernel in linear time by\nhistogram intersection, cf. Theorem 2.\n\n6 Experimental evaluation\n\nWe report on the experimental evaluation of the proposed graph kernels derived from optimal\nassignments and compare with state-of-the-art convolution kernels.\n\n6.1 Method and Experimental Setup\nWe performed classi\ufb01cation experiments using the C-SVM implementation LIBSVM [7]. We report\nmean prediction accuracies and standard deviations obtained by 10-fold cross-validation repeated 10\ntimes with random fold assignment. Within each fold all necessary parameters were selected by cross-\nvalidation based on the training set. This includes the regularization parameter C, kernel parameters\nwhere applicable and whether to normalize the kernel matrix. All kernels were implemented in Java\nand experiments were conducted using Oracle Java v1.8.0 on an Intel Core i7-3770 CPU at 3.4GHz\n(Turbo Boost disabled) with 16GB of RAM using a single processor only.\n\nKernels. As a baseline we implemented the vertex kernel (V) and edge kernel (E), which are the\ndot products on vertex and edge label histograms, respectively, where an edge label consist of the\nlabels of its endpoints. V-OA and E-OA are the related optimal assignment kernels as described in\nSec. 5.1. For the Weisfeiler-Lehman kernels WL and WL-OA, see Section 5.2, the parameter h was\n\n7\n\nabecdf7\u219267\u219257\u219217\u219247\u219217\u219217\u219227\u219217\u219227\u21921{a,b}{c,d}{f}{e}\fTable 1: Classi\ufb01cation accuracies and standard deviations on graph data sets representing small molecules,\nmacromolecules and social networks.\n\nKernel\n\nMUTAG PTC-MR NCI1 NCI109 PROTEINS D&D ENZYMES COLLAB REDDIT\n56.2\u00b10.0 75.3\u00b10.1\n85.4\u00b10.7\nV\n59.3\u00b10.1 77.8\u00b10.1\n82.5\u00b11.1\nV-OA\n52.0\u00b10.0 75.1\u00b10.1\n85.2\u00b10.6\nE\n81.0\u00b11.1\n68.2\u00b10.3 79.8\u00b10.2\nE-OA\n86.0\u00b11.7\n79.1\u00b10.1 80.8\u00b10.4\nWL\n80.7\u00b10.1 89.3\u00b10.3\nWL-OA 84.5\u00b11.7\n64.7\u00b10.1 60.1\u00b10.2\n85.2\u00b10.9\nGL\nSP\n83.0\u00b11.4\n58.8\u00b10.2 84.6\u00b10.2\n\n64.6\u00b10.1 63.6\u00b10.2\n65.6\u00b10.3 65.1\u00b10.4\n66.2\u00b10.1 64.9\u00b10.1\n68.9\u00b10.3 68.7\u00b10.2\n85.8\u00b10.2 85.9\u00b10.3\n86.1\u00b10.2 86.3\u00b10.2\n70.5\u00b10.2 69.3\u00b10.2\n74.5\u00b10.3 73.0\u00b10.3\n\n78.2\u00b10.4\n78.8\u00b10.3\n78.3\u00b10.5\n79.0\u00b10.4\n79.0\u00b10.4\n79.2\u00b10.4\n79.7\u00b10.7\n79.0\u00b10.6\n\n57.8\u00b10.9\n56.4\u00b11.8\n57.3\u00b10.7\n56.3\u00b11.7\n61.3\u00b11.4\n63.6\u00b11.5\n54.7\u00b12.0\n58.9\u00b12.2\n\nData Set\n\n71.9\u00b10.4\n73.8\u00b10.5\n73.5\u00b10.2\n74.5\u00b10.6\n75.6\u00b10.4\n76.4\u00b10.4\n72.7\u00b10.6\n75.8\u00b10.5\n\n23.4\u00b11.1\n35.1\u00b11.1\n27.4\u00b10.8\n37.4\u00b11.8\n53.7\u00b11.4\n59.9\u00b11.1\n30.6\u00b11.2\n42.6\u00b11.6\n\nchosen from {0, ..., 7}. In addition we implemented a graphlet kernel (GL) and the shortest-path\nkernel (SP) [3]. GL is based on connected subgraphs with three vertices taking labels into account\nsimilar to the approach used in [19]. For SP we used the Dirac kernel to compare path lengths and\ncomputed the kernel by explicit feature maps, cf. [19]. Note that all kernels not identi\ufb01ed as optimal\nassignment kernels by the suf\ufb01x OA are convolution kernels.\n\nData sets. We tested on widely-used graph classi\ufb01cation benchmarks from different domains [cf.\n4, 23, 19, 24]: MUTAG, PTC-MR, NCI1 and NCI109 are graphs derived from small molecules,\nPROTEINS, D&D and ENZYMES represent macromolecules, and COLLAB and REDDIT are derived\nfrom social networks.1 All data sets have two class labels except ENZYMES and COLLAB, which\nare divided into six and three classes, respectively. The social network graphs are unlabelled and we\nconsidered all vertices uniformly labelled. All other graph data sets come with vertex labels. Edge\nlabels, if present, were ignored since they are not supported by all graph kernels under comparison.\n\n6.2 Results and discussion\n\nTable 1 summarizes the classi\ufb01cation accuracies. We observe that optimal assignment kernels on\nmost data sets improve over the prediction accuracy obtained by their convolution-based counterpart.\nThe only distinct exception is MUTAG. The extent of improvement on the other data sets varies, but is\nin particular remarkable for ENZYMES and REDDIT. This indicates that optimal assignment kernels\nprovide a more valid notion of similarity than convolution kernels for these classi\ufb01cation tasks. The\nmost successful kernel is WL-OA, which almost consistently improves over WL and performs best\non seven of the nine data sets. WL-OA provides the second best accuracy on D&D and ranks in the\nmiddle of the \ufb01eld for MUTAG. For these two data set the difference in accuracy between the kernels\nis small and even the baseline kernels perform notably well.\nThe time to compute the quadratic kernel matrix was less that one minute for all kernels and data sets\nwith exception of SP on D&D (29 min) and REDDIT (2 h) as well as GL on COLLAB (28 min). The\nrunning time to compute the optimal assignment kernels by histogram intersection was consistently\non par with the running time required for the related convolution kernels and orders of magnitude\nfaster than their computation by the Hungarian method.\n\n7 Conclusions and future work\n\nWe have characterized the class of strong kernels leading to valid optimal assignment kernels and\nderived novel effective kernels for graphs. The reduction to histogram intersection makes ef\ufb01cient\ncomputation possible and known speed-up techniques for intersection kernels can directly be applied\n(see, e.g., [21] and references therein). We believe that our results may form the basis for the design\nof new kernels, which can be computed ef\ufb01ciently and adequately measure similarity.\n\n1The data sets,\ntu-dortmund.de.\n\nfurther references and statistics are available from http://graphkernels.cs.\n\n8\n\n\fAcknowledgments\nN. M. Kriege is supported by the German Science Foundation (DFG) within the Collaborative Research Center\nSFB 876 \u201cProviding Information by Resource-Constrained Data Analysis\u201d, project A6 \u201cResource-ef\ufb01cient\nGraph Mining\u201d. P.-L. Giscard is grateful for the \ufb01nancial support provided by the Royal Commission for the\nExhibition of 1851.\n\nReferences\n[1] L. Bai, L. Rossi, Z. Zhang, and E. R. Hancock. An aligned subtree kernel for weighted graphs. In Proc.\n\nInt. Conf. Mach. Learn., ICML 2015, pages 30\u201339, 2015.\n\n[2] A. Barla, F. Odone, and A. Verri. Histogram intersection kernel for image classi\ufb01cation. In Int. Conf.\n\nImage Proc., ICIP 2003, volume 3, pages III\u2013513\u201316 vol.2, Sept 2003.\n\n[3] K. M. Borgwardt and H.-P. Kriegel. Shortest-path kernels on graphs. In Proc. IEEE Int. Conf. Data Min.,\n\nICDM \u201905, pages 74\u201381, Washington, DC, USA, 2005.\n\n[4] K. M. Borgwardt, C. S. Ong, S. Sch\u00f6nauer, S. V. N. Vishwanathan, A. J. Smola, and H.-P. Kriegel. Protein\n\nfunction prediction via graph kernels. Bioinformatics, 21 Suppl 1:i47\u2013i56, Jun 2005.\n\n[5] S. Boughorbel, J. P. Tarel, and N. Boujemaa. Generalized histogram intersection kernel for image\n\nrecognition. In Int. Conf. Image Proc., ICIP 2005, pages III\u2013161\u20134, 2005.\n\n[6] R. E. Burkard, M. Dell\u2019Amico, and S. Martello. Assignment Problems. SIAM, 2012.\n[7] C.-C. Chang and C.-J. Lin. LIBSVM: A library for support vector machines. ACM Trans. Intell. Syst.\n\nTechnol., 2:27:1\u201327:27, May 2011.\n\n[8] H. Fr\u00f6hlich, J. K. Wegner, F. Sieker, and A. Zell. Optimal assignment kernels for attributed molecular\n\ngraphs. In Proc. Int. Conf. Mach. Learn., ICML \u201905, pages 225\u2013232, 2005.\n\n[9] M. Gori, M. Maggini, and L. Sarti. Exact and approximate graph matching using random walks. IEEE\n\nTrans. Pattern Anal. Mach. Intell., 27(7):1100\u20131111, July 2005.\n\n[10] K. Grauman and T. Darrell. Approximate correspondences in high dimensions. In B. Sch\u00f6lkopf, J. C. Platt,\nand T. Hoffman, editors, Adv. Neural Inf. Process. Syst. 19, NIPS \u201906, pages 505\u2013512. MIT Press, 2007.\n[11] K. Grauman and T. Darrell. The pyramid match kernel: Ef\ufb01cient learning with sets of features. J. Mach.\n\nLearn. Res., 8:725\u2013760, May 2007.\n\n[12] D. Haussler. Convolution kernels on discrete structures. Technical Report UCSC-CRL-99-10, University\n\nof California, Santa Cruz, CA, USA, 1999.\n\n[13] R. S. Ismagilov. Ultrametric spaces and related hilbert spaces. Mathematical Notes, 62(2):186\u2013197, 1997.\n[14] F. D. Johansson and D. Dubhashi. Learning with similarity functions on graphs using matchings of\ngeometric embeddings. In Proc. ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, KDD\n\u201915, pages 467\u2013476. ACM, 2015.\n\n[15] G. Loosli, S. Canu, and C. S. Ong. Learning SVM in Krein spaces. IEEE Trans. Pattern Anal. Mach.\n\nIntell., PP(99):1\u20131, 2015.\n\n[16] D. Pachauri, R. Kondor, and V. Singh. Solving the multi-way matching problem by permutation synchro-\nnization. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, Adv.\nNeural Inf. Process. Syst. 26, NIPS \u201913, pages 1860\u20131868. Curran Associates, Inc., 2013.\n\n[17] K. Riesen and H. Bunke. Approximate graph edit distance computation by means of bipartite graph\n\nmatching. Image Vis. Comp., 27(7):950 \u2013 959, 2009.\n\n[18] M. Schiavinato, A. Gasparetto, and A. Torsello. Transitive assignment kernels for structural classi\ufb01cation.\nIn A. Feragen, M. Pelillo, and M. Loog, editors, Int. Workshop Similarity-Based Pattern Recognit.,\nSIMBAD \u201915, pages 146\u2013159, 2015.\n\n[19] N. Shervashidze, P. Schweitzer, E. J. van Leeuwen, K. Mehlhorn, and K. M. Borgwardt. Weisfeiler-Lehman\n\ngraph kernels. J. Mach. Learn. Res., 12:2539\u20132561, 2011.\n\n[20] M. J. Swain and D. H. Ballard. Color indexing. Int. J. Comp. Vis., 7(1):11\u201332, 1991.\n[21] A. Vedaldi and A. Zisserman. Ef\ufb01cient additive kernels via explicit feature maps. IEEE Trans. Pattern\n\nAnal. Mach. Intell., 34(3):480\u2013492, 2012.\n\n[22] J.-P. Vert. The optimal assignment kernel is not positive de\ufb01nite. CoRR, abs/0801.4061, 2008.\n[23] S. V. N. Vishwanathan, N. N. Schraudolph, R. I. Kondor, and K. M. Borgwardt. Graph kernels. J. Mach.\n\nLearn. Res., 11:1201\u20131242, 2010.\n\n[24] P. Yanardag and S. V. N. Vishwanathan. Deep graph kernels. In Proc. ACM SIGKDD Int. Conf. Knowledge\n\nDiscovery and Data Mining, KDD \u201915, pages 1365\u20131374. ACM, 2015.\n\n9\n\n\f", "award": [], "sourceid": 885, "authors": [{"given_name": "Nils", "family_name": "Kriege", "institution": "TU Dortmund"}, {"given_name": "Pierre-Louis", "family_name": "Giscard", "institution": "University of York"}, {"given_name": "Richard", "family_name": "Wilson", "institution": "University of York"}]}