{"title": "Fast Computation of Graph Kernels", "book": "Advances in Neural Information Processing Systems", "page_first": 1449, "page_last": 1456, "abstract": "", "full_text": "Fast Computation of Graph Kernels\n\nS.V. N. Vishwanathan\n\nsvn.vishwanathan@nicta.com.au\n\nStatistical Machine Learning, National ICT Australia,\n\nLocked Bag 8001, Canberra ACT 2601, Australia\n\nResearch School of Information Sciences & Engineering\n\nAustralian National University, Canberra ACT 0200, Australia\n\nKarsten M. Borgwardt\n\nborgwardt@dbs.ifi.lmu.de\n\nInstitute for Computer Science, Ludwig-Maximilians-University Munich\n\nOettingenstr. 67, 80538 Munich, Germany\n\nNicol N. Schraudolph\n\nnic.schraudolph@nicta.com.au\n\nStatistical Machine Learning, National ICT Australia\n\nLocked Bag 8001, Canberra ACT 2601, Australia\n\nResearch School of Information Sciences & Engineering\n\nAustralian National University, Canberra ACT 0200, Australia\n\nAbstract\n\nUsing extensions of linear algebra concepts to Reproducing Kernel Hilbert Spaces\n(RKHS), we de\ufb01ne a unifying framework for random walk kernels on graphs. Re-\nduction to a Sylvester equation allows us to compute many of these kernels in\nO(n3) worst-case time. This includes kernels whose previous worst-case time\ncomplexity was O(n6), such as the geometric kernels of G\u00a8artner et al. [1] and\nthe marginal graph kernels of Kashima et al. [2]. Our algebra in RKHS allow us\nto exploit sparsity in directed and undirected graphs more effectively than previ-\nous methods, yielding sub-cubic computational complexity when combined with\nconjugate gradient solvers or \ufb01xed-point iterations. Experiments on graphs from\nbioinformatics and other application domains show that our algorithms are often\nmore than 1000 times faster than existing approaches.\n\n1 Introduction\n\nMachine learning in domains such as bioinformatics, drug discovery, and web data mining involves\nthe study of relationships between objects. Graphs are natural data structures to model such rela-\ntions, with nodes representing objects and edges the relationships between them. In this context, one\noften encounters the question: How similar are two graphs?\nSimple ways of comparing graphs which are based on pairwise comparison of nodes or edges, are\npossible in quadratic time, yet may neglect information represented by the structure of the graph.\nGraph kernels, as originally proposed by G\u00a8artner et al. [1], Kashima et al. [2], Borgwardt et al. [3],\ntake the structure of the graph into account. They work by counting the number of common random\nwalks between two graphs. Even though the number of common random walks could potentially be\n\n\fexponential, polynomial time algorithms exist for computing these kernels. Unfortunately for the\npractitioner, these kernels are still prohibitively expensive since their computation scales as O(n6),\nwhere n is the number of vertices in the input graphs. This severely limits their applicability to\nlarge-scale problems, as commonly found in areas such as bioinformatics.\nIn this paper, we extend common concepts from linear algebra to Reproducing Kernel Hilbert Spaces\n(RKHS), and use these extensions to de\ufb01ne a unifying framework for random walk kernels. We show\nthat computing many random walk graph kernels including those of G\u00a8artner et al. [1] and Kashima\net al. [2] can be reduced to the problem of solving a large linear system, which can then be solved\nef\ufb01ciently by a variety of methods which exploit the structure of the problem.\n\n2 Extending Linear Algebra to RKHS\nLet \u03c6 : X \u2192 H denote the feature map from an input space X to the RKHS H associated with the\nkernel \u03ba(x, x0) = h\u03c6(x), \u03c6(x0)iH. Given an n by m matrix X \u2208 X n\u00d7m of elements Xij \u2208 X , we\nextend \u03c6 to matrix arguments by de\ufb01ning \u03a6 : X n\u00d7m \u2192 Hn\u00d7m via [\u03a6(X)]ij := \u03c6(Xij). We can\nnow borrow concepts from tensor calculus to extend certain linear algebra operations to H:\nDe\ufb01nition 1 Let A \u2208 X n\u00d7m, B \u2208 X m\u00d7p, and C \u2208 Rm\u00d7p. The matrix products \u03a6(A)\u03a6(B) \u2208 Rn\u00d7p\nand \u03a6(A) C \u2208 Hn\u00d7p are\n\nh\u03c6(Aij), \u03c6(Bjk)iH and\n\nj\n\n\uf8ee\uf8ef\uf8f0 A11B A12B . . . A1mB\n\n...\n\n...\n\n\u03c6(Aij) Cjk.\n\n[\u03a6(A) C]ik :=X\n\uf8ee\uf8ef\uf8f0 A\u22171\n\uf8f9\uf8fa\uf8fb , vec(A) :=\n\nj\n\n\uf8f9\uf8fa\uf8fb ,\n\n(1)\n\n[\u03a6(A)\u03a6(B)]ik :=X\n\nGiven A \u2208 Rn\u00d7m and B \u2208 Rp\u00d7q the Kronecker product A \u2297 B \u2208 Rnp\u00d7mq and vec operator are\nde\ufb01ned as\n\nA \u2297 B :=\n\n...\nA\u2217m\nwhere A\u2217j denotes the j-th column of A. They are linked by the well-known property:\n\nAn1B An2B . . . AnmB\n\n...\n\n...\n\nvec(ABC) = (C>\u2297 A) vec(B).\n\n(2)\nDe\ufb01nition 2 Let A \u2208 X n\u00d7m and B \u2208 X p\u00d7q. The Kronecker product \u03a6(A) \u2297 \u03a6(B) \u2208 Rnp\u00d7mq is\n(3)\n\n[\u03a6(A) \u2297 \u03a6(B)]ip+k,jq+l := h\u03c6(Aij), \u03c6(Bkl)iH .\n\nIt is easily shown that the above extensions to RKHS obey an analogue of (2):\nLemma 1 If A \u2208 X n\u00d7m, B \u2208 Rm\u00d7p, and C \u2208 X p\u00d7q, then\n\nvec(\u03a6(A) B \u03a6(C)) = (\u03a6(C)>\u2297 \u03a6(A)) vec(B).\n\n(4)\n\nIf p = q = n = m, direct computation of the right hand side of (4) requires O(n4) kernel evalua-\ntions. For an arbitrary kernel the left hand side also requires a similar effort. But, if the RKHS H is\nisomorphic to Rr, in other words the feature map \u03c6(\u00b7) \u2208 Rr, the left hand side of (4) is easily com-\nputed in O(n3r) operations. Our ef\ufb01cient computation schemes described in Section 4 will exploit\nthis observation.\n\n3 Random Walk Kernels\n\nRandom walk kernels on graphs are based on a simple idea: Given a pair of graphs perform a random\nwalk on both of them and count the number of matching walks [1, 2, 3]. These kernels mainly differ\nin the way the similarity between random walks is computed. For instance, G\u00a8artner et al. [1] count\nthe number of nodes in the random walk which have the same label. They also include a decay factor\nto ensure convergence. Kashima et al. [2], and Borgwardt et al. [3] on the other hand, use a kernel\nde\ufb01ned on nodes and edges in order to compute similarity between random walks, and de\ufb01ne an\ninitial probability distribution over nodes in order to ensure convergence. In this section we present\na unifying framework which includes the above mentioned kernels as special cases.\n\n\f3.1 Notation\n\nWe use ei to denote the i-th standard basis (i.e., a vector of all zeros with the i-th entry set to one), e\nto denote a vector with all entries set to one, 0 to denote the vector of all zeros, and I to denote the\nidentity matrix. When it is clear from context we will not mention the dimensions of these vectors\nand matrices.\nA graph G \u2208 G consists of an ordered and \ufb01nite set of n vertices V denoted by {v1, v2, . . . , vn}, and\na \ufb01nite set of edges E \u2282 V \u00d7 V . A vertex vi is said to be a neighbor of another vertex vj if they are\nconnected by an edge. G is said to be undirected if (vi, vj) \u2208 E \u21d0\u21d2 (vj, vi) \u2208 E for all edges.\nThe unnormalized adjacency matrix of G is an n\u00d7n real matrix P with Pij = 1 if (vi, vj) \u2208 E, and\n0 otherwise. If G is weighted then P can contain non-negative entries other than zeros and ones,\ni.e., Pij \u2208 (0,\u221e) if (vi, vj) \u2208 E and zero otherwise.\n\nLet D be an n\u00d7 n diagonal matrix with entries Dii = P\n\nj Pij. The matrix A := P D\u22121 is then\ncalled the normalized adjacency matrix, or simply adjacency matrix. A walk w on G is a sequence\nof indices w1, w2, . . . wt+1 where (vwi, vwi+1) \u2208 E for all 1 \u2264 i \u2264 t. The length of a walk is equal\nto the number of edges encountered during the walk (here: t). A graph is said to be connected if any\ntwo pairs of vertices can be connected by a walk; here we always work with connected graphs. A\nrandom walk is a walk where P(wi+1|w1, . . . wi) = Awi,wi+1, i.e., the probability at wi of picking\nwi+1 next is directly proportional to the weight of the edge (vwi, vwi+1). The t-th power of the\ntransition matrix A describes the probability of t-length walks. In other words, [At]ij denotes the\nprobability of a transition from vertex vi to vertex vj via a walk of length t. We use this intuition to\nde\ufb01ne random walk kernels on graphs.\nLet X be a set of labels which includes the special label \u0001. Every edge labeled graph G is associated\nwith a label matrix L \u2208 X n\u00d7n, such that Lij = \u0001 iff (vi, vj) /\u2208 E, in other words only those edges\nwhich are present in the graph get a non-\u0001 label. Let H be the RKHS endowed with the kernel\n\u03ba : X \u00d7X \u2192 R, and let \u03c6 : X \u2192 H denote the corresponding feature map which maps \u0001 to the\nzero element of H. We use \u03a6(L) to denote the feature matrix of G. For ease of exposition we do\nnot consider labels on vertices here, though our results hold for that case as well. Henceforth we use\nthe term labeled graph to denote an edge-labeled graph.\n\n3.2 Product Graphs\nGiven two graphs G(V, E) and G0(V 0, E0), the product graph G\u00d7(V\u00d7, E\u00d7) is a graph with nn0\nvertices, each representing a pair of vertices from G and G0, respectively. An edge exists in E\u00d7 iff\nthe corresponding vertices are adjacent in both G and G0. Thus\n\nV\u00d7 = {(vi, v0\nE\u00d7 = {((vi,v0\n\ni0) : vi \u2208 V \u2227 v0\ni0), (vj,v0\n\ni0 \u2208 V 0},\n\ni0, v0\n\nj0)\u2208 E0}.\n\nj0)) : (vi, vj)\u2208 E \u2227 (v0\n\n(5)\n(6)\nIf A and A0 are the adjacency matrices of G and G0, respectively, the adjacency matrix of the product\ngraph G\u00d7 is A\u00d7 = A \u2297 A0. An edge exists in the product graph iff an edge exits in both G and\nG0, therefore performing a simultaneous random walk on G and G0 is equivalent to performing a\nrandom walk on the product graph [4].\nLet p and p0 denote initial probability distributions over vertices of G and G0. Then the initial\nprobability distribution p\u00d7 of the product graph is p\u00d7 := p \u2297 p0. Likewise, if q and q0 denote\nstopping probabilities (i.e., the probability that a random walk ends at a given vertex), the stopping\nprobability q\u00d7 of the product graph is q\u00d7 := q \u2297 q0.\nIf G and G0 are edge-labeled, we can associate a weight matrix W\u00d7 \u2208 Rnn0\u00d7nn0\nwith G\u00d7, using\nour Kronecker product in RKHS (De\ufb01nition 2): W\u00d7 = \u03a6(L) \u2297 \u03a6(L0). As a consequence of the\nde\ufb01nition of \u03a6(L) and \u03a6(L0), the entries of W\u00d7 are non-zero only if the corresponding edge exists\nin the product graph. The weight matrix is closely related to the adjacency matrix: assume that\nH = R endowed with the usual dot product, and \u03c6(Lij) = 1 if (vi, vj) \u2208 E or zero otherwise. Then\n\u03a6(L) = A and \u03a6(L0) = A0, and consequently W\u00d7 = A\u00d7, i.e., the weight matrix is identical to the\nadjacency matrix of the product graph.\nTo extend the above discussion, assume that H = Rd endowed with the usual dot product, and that\nthere are d distinct edge labels {1, 2, . . . , d}. For each edge (vi, vj) \u2208 E we have \u03c6(Lij) = el if\n\n\fthe edge (vi, vj) is labeled l. All other entries of \u03a6(L) are set to 0. \u03ba is therefore a delta kernel, i.e.,\nits value between any two edges is one iff the labels on the edges match, and zero otherwise. The\nweight matrix W\u00d7 has a non-zero entry iff an edge exists in the product graph and the corresponding\nedges in G and G0 have the same label. Let lA denote the adjacency matrix of the graph \ufb01ltered by\nthe label l, i.e., lAij = Aij if Lij = l and zero otherwise. Some simple algebra (omitted for the sake\nof brevity) shows that the weight matrix of the product graph can be written as\n\ndX\n\nW\u00d7 =\n\nlA \u2297 lA0.\n\n(7)\n\n3.3 Kernel De\ufb01nition\n\nl=1\n\nPerforming a random walk on the product graph G\u00d7 is equivalent to performing a simultaneous\nrandom walk on the graphs G and G0 [4]. Therefore, the (in+ j, i0n0 + j0)-th entry of Ak\u00d7 represents\nthe probability of simultaneous k length random walks on G (starting from vertex vi and ending in\nvertex vj) and G0 (starting from vertex v0\nj0). The entries of W\u00d7 represent\nsimilarity between edges. The (in + j, i0n0 + j0)-th entry of W k\u00d7 represents the similarity between\nsimultaneous k length random walks on G and G0 measured via the kernel function \u03ba.\nGiven the weight matrix W\u00d7, initial and stopping probability distributions p\u00d7 and q\u00d7, and an ap-\npropriately chosen discrete measure \u00b5, we can de\ufb01ne a random walk kernel on G and G0 as\n\ni0 and ending in vertex v0\n\n\u221eX\n\nk(G, G0) :=\n\n\u00b5(k) q>\n\n\u00d7W k\u00d7p\u00d7.\n\n(8)\n\nk=0\n\nIn order to show that (8) is a valid Mercer kernel we need the following technical lemma.\nLemma 2 \u2200 k \u2208 N0 : W k\u00d7p\u00d7 = vec[\u03a6(L0)kp0 (\u03a6(L)kp)>].\nProof By induction over k. Base case: k = 0. Since \u03a6(L0)0 = \u03a6(L)0 = I, using (2) we can write\n\nW 0\u00d7p\u00d7 = p\u00d7 = (p \u2297 p0) vec(1) = vec(p0 1 p>) = vec[\u03a6(L0)0p0 (\u03a6(L)0p)>].\n\nInduction from k to k + 1: Using Lemma 1 we obtain\n\nW k+1\u00d7 p\u00d7 = W\u00d7W k\u00d7p\u00d7 = (\u03a6(L) \u2297 \u03a6(L0)) vec[\u03a6(L0)kp0 (\u03a6(L)kp)>]\n\n= vec[\u03a6(L0)\u03a6(L0)kp0 (\u03a6(L)kp)>\u03a6(L)>] = vec[\u03a6(L0)k+1p0 (\u03a6(L)k+1p)>].\nLemma 3 If the measure \u00b5(k) is such that (8) converges, then it de\ufb01nes a valid Mercer kernel.\n\nProof Using Lemmas 1 and 2 we can write\n\n\u00d7W k\u00d7p\u00d7 = (q \u2297 q0) vec[\u03a6(L0)kp0 (\u03a6(L)kp)>] = vec[q0>\u03a6(L0)kp0 (\u03a6(L)kp)>q]\nq>\n\n= (q>\u03a6(L)kp)>\n\n(q0>\u03a6(L0)kp0)\n\n.\n\n|\n\n{z\n\n}\n\n|\n\n{z\n\n}\n\n\u03c8k(G)>\n\n\u03c8k(G0)\n\nEach individual term of (8) equals \u03c8k(G)>\u03c8k(G0) for some function \u03c8k, and is therefore a valid\nkernel. The lemma follows since a convex combination of kernels is itself a valid kernel.\n3.4 Special Cases\n\nA popular choice to ensure convergence of (8) is to assume \u00b5(k) = \u03bbk for some \u03bb > 0. If \u03bb is\nsuf\ufb01ciently small1 then (8) is well de\ufb01ned, and we can write\n\u00d7W k\u00d7p\u00d7 = q>\n\nk(G, G0) =X\n\n\u00d7(I\u2212\u03bbW\u00d7)\u22121p\u00d7.\n\n\u03bbkq>\n\n(9)\n\nk\n\nKashima et al. [2] use marginalization and probabilities of random walks to de\ufb01ne kernels on graphs.\nGiven transition probability matrices P and P 0 associated with graphs G and G0 respectively, their\nkernel can be written as (see Eq. 1.19, [2])\n\nk(G, G0) = q>\n\n\u00d7(I\u2212T\u00d7)\u22121p\u00d7,\n\n(10)\n\n1The values of \u03bb which ensure convergence depends on the spectrum of W\u00d7.\n\n\fwhere T\u00d7 := (vec(P ) vec(P 0)>) (cid:12) (\u03a6(L) \u2297 \u03a6(L0)), using (cid:12) to denote element-wise (Hadamard)\nmultiplication. The edge kernel \u02c6\u03ba(Lij, L0\nG\u00a8artner et al. [1] use the adjacency matrix of the product graph to de\ufb01ne the so-called geometric\nkernel\n\ni,j0) with \u03bb = 1 recovers (9).\n\ni0j0) := PijP 0\n\ni0j0 \u03ba(Lij, L0\n\nk(G, G0) =\n\n\u03bbk[Ak\u00d7]ij.\n\n(11)\n\nnX\n\nn0X\n\n\u221eX\n\ni=1\n\nj=1\n\nk=0\n\nTo recover their kernel in our framework, assume an uniform distribution over the vertices of G and\nG0, i.e., set p = q = 1/n and p0 = q0 = 1/n0. The initial as well as \ufb01nal probability distribution\nover vertices of G\u00d7 is given by p\u00d7 = q\u00d7 = e /(nn0). Setting \u03a6(L) := A, and hence \u03a6(L0) = A0\nand W\u00d7 = A\u00d7, we can rewrite (8) to obtain\n\n\u221eX\n\nk=0\n\nk(G, G0) =\n\n\u03bbkq>\n\n\u00d7Ak\u00d7p\u00d7 =\n\n1\n\nn2n02\n\nnX\n\nn0X\n\n\u221eX\n\ni=1\n\nj=1\n\nk=0\n\n\u03bbk[Ak\u00d7]ij,\n\nwhich recovers (11) to within a constant factor.\n\n4 Ef\ufb01cient Computation\n\nIn this section we show that iterative methods, including those based on Sylvester equations, conju-\ngate gradients, and \ufb01xed-point iterations, can be used to greatly speed up the computation of (9).\n\n4.1 Sylvester Equation Methods\n\nConsider the following equation, commonly known as the Sylvester or Lyapunov equation:\n\n(12)\nHere, S, T, X0 \u2208 Rn\u00d7n are given and we need for solve for X \u2208 Rn\u00d7n. These equations can be\nreadily solved in O(n3) time with freely available code [5], e.g. Matlab\u2019s dlyap method. The\ngeneralized Sylvester equation\n\nX = SXT + X0.\n\nX =\n\nSiXTi + X0\n\n(13)\n\ncan also be solved ef\ufb01ciently, albeit at a slightly higher computational cost of O(dn3).\nWe now show that if the weight matrix W\u00d7 can be written as (7) then the problem of computing the\ngraph kernel (9) can be reduced to the problem of solving the following Sylvester equation:\n\niA0\u03bb X iA> + X0,\n\ndX\n\ni=1\n\nX =X\nX\n\ni\n\n(14)\n\n(15)\n\n(16)\n\nwhere vec(X0) = p\u00d7. We begin by \ufb02attening the above equation:\n\nvec(X) = \u03bb\n\nvec(iA0X iA>) + p\u00d7.\n\nX\n\nUsing Lemma 1 we can rewrite (15) as\n\ni\n\n(I\u2212\u03bb\n\niA \u2297 iA0) vec(X) = p\u00d7,\n\ni\n\n(17)\n\n\u00d7 yields\n\nvec(X) = (I\u2212\u03bbW\u00d7)\u22121p\u00d7.\nq>\n\u00d7vec(X) = q>\n\n\u00d7(I\u2212\u03bbW\u00d7)\u22121p\u00d7.\n\nuse (7), and solve for vec(X):\nMultiplying both sides by q>\nThe right-hand side of (18) is the graph kernel (9). Given the solution X of the Sylvester equation\n(14), the graph kernel can be obtained as q>\n\u00d7vec(X) in O(n2) time. Since solving the generalized\nSylvester equation takes O(dn3) time, computing the graph kernel in this fashion is signi\ufb01cantly\nfaster than the O(n6) time required by the direct approach.\nWhere the number of labels d is large, the computational cost may be reduced further by computing\nmatrices S and T such that W\u00d7 \u2248 S \u2297 T . We then simply solve the simple Sylvester equation\n(12) involving these matrices. Finding the nearest Kronecker product approximating a matrix such\nas W\u00d7 is a well-studied problem in numerical linear algebra and ef\ufb01cient algorithms which exploit\nsparsity of W\u00d7 are readily available [6].\n\n(18)\n\n\f4.2 Conjugate Gradient Methods\n\nGiven a matrix M and a vector b, conjugate gradient (CG) methods solve the system of equations\nM x = b ef\ufb01ciently [7]. While they are designed for symmetric positive semi-de\ufb01nite matrices,\nCG solvers can also be used to solve other linear systems ef\ufb01ciently. They are particularly ef\ufb01cient\nif the matrix is rank de\ufb01cient, or has a small effective rank, i.e., number of distinct eigenvalues.\nFurthermore, if computing matrix-vector products is cheap \u2014 because M is sparse, for instance \u2014\nthe CG solver can be sped up signi\ufb01cantly [7]. Speci\ufb01cally, if computing M v for an arbitrary vector\nv requires O(k) time, and the effective rank of the matrix is m, then a CG solver requires only\nO(mk) time to solve M x = b.\nThe graph kernel (9) can be computed by a two-step procedure: First we solve the linear system\n\n(I\u2212\u03bbW\u00d7) x = p\u00d7,\n\n(19)\nfor x, then we compute q>\n\u00d7x. We now focus on ef\ufb01cient ways to solve (19) with a CG solver. Recall\nthat if G and G0 contain n vertices each then W\u00d7 is a n2 \u00d7 n2 matrix. Directly computing the\nmatrix-vector product W\u00d7r, requires O(n4) time. Key to our speed-ups is the ability to exploit\nLemma 1 to compute this matrix-vector product more ef\ufb01ciently: Recall that W\u00d7 = \u03a6(L)\u2297 \u03a6(L0).\nLetting r = vec(R), we can use Lemma 1 to write\n\nW\u00d7r = (\u03a6(L) \u2297 \u03a6(L0)) vec(R) = vec(\u03a6(L0)R \u03a6(L)>).\n\n(20)\nIf \u03c6(\u00b7) \u2208 Rr for some r, then the above matrix-vector product can be computed in O(n3r) time. If\n\u03a6(L) and \u03a6(L0) are sparse, however, then \u03a6(L0)R \u03a6(L)> can be computed yet more ef\ufb01ciently: if\nthere are O(n) non-\u0001 entries in \u03a6(L) and \u03a6(L0), then computing (20) requires only O(n2) time.\n\n4.3 Fixed-Point Iterations\n\nx = p\u00d7 + \u03bbW\u00d7x.\n\n(21)\nFixed-point methods begin by rewriting (19) as\nNow, solving for x is equivalent to \ufb01nding a \ufb01xed point of the above iteration [7]. Letting xt denote\nthe value of x at iteration t, we set x0 := p\u00d7, then compute\nxt+1 = p\u00d7 + \u03bbW\u00d7xt\n\n(22)\nrepeatedly until ||xt+1 \u2212 xt|| < \u03b5, where || \u00b7 || denotes the Euclidean norm and \u03b5 some pre-de\ufb01ned\ntolerance. This is guaranteed to converge if all eigenvalues of \u03bbW\u00d7 lie inside the unit disk; this can\nbe ensured by setting \u03bb < 1/\u03bemax, where \u03bemax is the largest-magnitude eigenvalue of W\u00d7.\nThe above is closely related to the power method used to compute the largest eigenvalue of a matrix\n[8]; ef\ufb01cient preconditioners can also be used to speed up convergence [8]. Since each iteration\nof (22) involves computation of the matrix-vector product W\u00d7xt, all speed-ups for computing the\nmatrix-vector product discussed in Section 4.2 are applicable here. In particular, we exploit the fact\nthat W\u00d7 is a sum of Kronecker products to reduce the worst-case time complexity to O(n3) in our\nexperiments, in contrast to Kashima et al. [2] who computed the matrix-vector product explicitly.\n\n5 Experiments\n\nTo assess the practical impact of our algorithmic improvements, we compared our techniques from\nSection 4 with G\u00a8artner et al.\u2019s [1] direct approach as a baseline. All code was written in MATLAB\nRelease 14, and experiments run on a 2.6 GHz Intel Pentium 4 PC with 2 GB of main memory\nrunning Suse Linux. The Matlab function dlyap was used to solve the Sylvester equation.\nBy default, we used a value of \u03bb = 0.001, and set the tolerance for both CG solver and \ufb01xed-point\niteration to 10\u22126 for all our experiments. We used Lemma 1 to speed up matrix-vector multiplication\nfor both CG and \ufb01xed-point methods (cf. Section 4.2). Since all our methods are exact and produce\nthe same kernel values (to numerical precision), we only report their runtimes below.\nWe tested the practical feasibility of the presented techniques on four real-world datasets whose\nsize mandates fast graph kernel computation; two datasets of molecular compounds (MUTAG and\nPTC), and two datasets with hundreds of graphs describing protein tertiary structure (Protein and\nEnzyme). Graph kernels provide useful measures of similarity for all these graphs; please refer to\nthe addendum for more details on these datasets, and applications for graph kernels on them.\n\n\fFigure 1: Time (in seconds on a log-scale) to compute 100\u00d7100 kernel matrix for unlabeled (left)\nresp. labelled (right) graphs from several datasets. Compare the conventional direct method (black)\nto our fast Sylvester equation, conjugate gradient (CG), and \ufb01xed-point iteration (FP) approaches.\n\n5.1 Unlabeled Graphs\n\nIn a \ufb01rst series of experiments, we compared graph topology only on our 4 datasets, i.e., without\nconsidering node and edge labels. We report the time taken to compute the full graph kernel matrix\nfor various sizes (number of graphs) in Table 1 and show the results for computing a 100 \u00d7 100\nsub-matrix in Figure 1 (left).\nOn unlabeled graphs, conjugate gradient and \ufb01xed-point iteration \u2014 sped up via our Lemma 1 \u2014 are\nconsistently about two orders of magnitude faster than the conventional direct method. The Sylvester\napproach is very competitive on smaller graphs (outperforming CG on MUTAG) but slows down\nwith increasing number of nodes per graph; this is because we were unable to incorporate Lemma 1\ninto Matlab\u2019s black-box dlyap solver. Even so, the Sylvester approach still greatly outperforms\nthe direct method.\n\n5.2 Labeled Graphs\n\nIn a second series of experiments, we compared graphs with node and edge labels. On our two\nprotein datasets we employed a linear kernel to measure similarity between edge labels representing\ndistances (in \u02daangstr\u00a8oms) between secondary structure elements. On our two chemical datasets we\nused a delta kernel to compare edge labels re\ufb02ecting types of bonds in molecules. We report results\nin Table 2 and Figure 1 (right).\nOn labeled graphs, our three methods outperform the direct approach by about a factor of 1000\nwhen using the linear kernel. In the experiments with the delta kernel, conjugate gradient and \ufb01xed-\npoint iteration are still at least two orders of magnitude faster. Since we did not have access to a\ngeneralized Sylvester equation (13) solver, we had to use a Kronecker product approximation [6]\nwhich dramatically slowed down the Sylvester equation approach.\n\nTable 1: Time to compute kernel matrix for given number of unlabeled graphs from various datasets.\n\ndataset\nnodes/graph\nedges/node\n#graphs\nDirect\nSylvester\nConjugate\nFixed-Point\n\nMUTAG\n\n17.7\n2.2\n\nPTC\n26.7\n1.9\n\nEnzyme\n\n32.6\n3.8\n\nProtein\n38.6\n3.7\n\n100\n18\u201909\u201d\n25.9\u201d\n42.1\u201d\n12.3\u201d\n\n230\n\n104\u201931\u201d\n2\u201916\u201d\n4\u201904\u201d\n1\u201909\u201d\n\n100\n\n142\u201953\u201d\n73.8\u201d\n58.4\u201d\n32.4\u201d\n\n100\n31h*\n48.3\u201d\n44.6\u201d\n13.6\u201d\n\n600\n46.5d*\n36\u201943\u201d\n34\u201958\u201d\n15\u201923\u201d\n\n417\n41h*\n19\u201930\u201d\n19\u201927\u201d\n5\u201959\u201d\n\u2217: Extrapolated; run did not \ufb01nish in time available.\n\n1128\n12.5y*\n6.1d*\n97\u201913\u201d\n40\u201958\u201d\n\n100\n36d*\n69\u201915\u201d\n55.3\u201d\n31.1\u201d\n\n\fTable 2: Time to compute kernel matrix for given number of labeled graphs from various datasets.\n\nkernel\ndataset\n#graphs\nDirect\nSylvester\nConjugate\nFixed Point\n\ndelta\n\nlinear\n\nMUTAG\n\nPTC\n\nEnzyme\n\nProtein\n\n100\n7.2h\n3.9d*\n2\u201935\u201d\n1\u201905\u201d\n\n230\n1.6d*\n21d*\n13\u201946\u201d\n6\u201909\u201d\n\n100\n1.4d*\n2.7d*\n3\u201920\u201d\n1\u201931\u201d\n\n417\n25d*\n46d*\n53\u201931\u201d\n26\u201952\u201d\n\n100\n2.4d*\n89.8\u201d\n124.4\u201d\n50.1\u201d\n\n600\n86d*\n53\u201955\u201d\n71\u201928\u201d\n35\u201924\u201d\n\n100\n5.3d*\n25\u201924\u201d\n3\u201901\u201d\n1\u201947\u201d\n\n1128\n18y*\n2.3d*\n4.1h\n1.9h\n\n\u2217: Extrapolated; run did not \ufb01nish in time available.\n\n6 Outlook and Discussion\n\nWe have shown that computing random walk graph kernels is essentially equivalent to solving a large\nlinear system. We have extended a well-known identity for Kronecker products which allows us to\nexploit the structure inherent in this problem. From this we have derived three ef\ufb01cient techniques\nto solve the linear system, employing either Sylvester equations, conjugate gradients, or \ufb01xed-point\niterations. Experiments on real-world datasets have shown our methods to be scalable and fast, in\nsome instances outperforming the conventional approach by more than three orders of magnitude.\nEven though the Sylvester equation method has a worst-case complexity of O(n3), the conjugate\ngradient and \ufb01xed-point methods tend to be faster on all our datasets. This is because computing\nmatrix-vector products via Lemma 1 is quite ef\ufb01cient when the graphs are sparse, so that the feature\nmatrices \u03a6(L) and \u03a6(L0) contain only O(n) non-\u0001 entries. Matlab\u2019s black-box dlyap solver is\nunable to exploit this sparsity; we are working on more capable alternatives. An ef\ufb01cient generalized\nSylvester solver requires extensive use of tensor calculus and is part of ongoing work.\nAs more and more graph-structured data becomes available in areas such as biology, web data min-\ning, etc., graph classi\ufb01cation will gain importance over coming years. Hence there is a pressing\nneed to speed up the computation of similarity metrics on graphs. We have shown that sparsity, low\neffective rank, and Kronecker product structure can be exploited to greatly reduce the computational\ncost of graph kernels; taking advantage of other forms of structure in W\u00d7 remains a challenge. Now\nthat the computation of random walk graph kernels is viable for practical problem sizes, it will open\nthe doors for their application in hitherto unexplored domains. The algorithmic challenge now is\nhow to integrate higher-order structures, such as spanning trees, in graph comparisons.\n\nAcknowledgments\n\nNational ICT Australia is funded by the Australian Government\u2019s Department of Communications, Informa-\ntion Technology and the Arts and the Australian Research Council through Backing Australia\u2019s Ability and the\nICT Center of Excellence program. This work is supported by the IST Program of the European Community,\nunder the Pascal Network of Excellence, IST-2002-506778, and by the German Ministry for Education, Sci-\nence, Research and Technology (BMBF) under grant no. 031U112F within the BFAM (Bioinformatics for the\nFunctional Analysis of Mammalian Genomes) project, part of the German Genome Analysis Network (NGFN).\n\nReferences\n[1] T. G\u00a8artner, P. Flach, and S. Wrobel. On graph kernels: Hardness results and ef\ufb01cient alternatives.\n\nIn\nB. Sch\u00a8olkopf and M. K. Warmuth, editors, Proc. Annual Conf. Comput. Learning Theory. Springer, 2003.\n[2] H. Kashima, K. Tsuda, and A. Inokuchi. Kernels on graphs. In K. Tsuda, B. Sch\u00a8olkopf, and J. Vert, editors,\n\nKernels and Bioinformatics, Cambridge, MA, 2004. MIT Press.\n\n[3] K. M. Borgwardt, C. S. Ong, S. Schonauer, S. V. N. Vishwanathan, A. J. Smola, and H. P. Kriegel. Protein\n\nfunction prediction via graph kernels. Bioinformatics, 21(Suppl 1):i47\u2013i56, 2005.\n\n[4] F. Harary. Graph Theory. Addison-Wesley, Reading, MA, 1969.\n[5] J. D. Gardiner, A. L. Laub, J. J. Amato, and C. B. Moler. Solution of the Sylvester matrix equation\n\nAXB> + CXD> = E. ACM Transactions on Mathematical Software, 18(2):223\u2013231, 1992.\n\n[6] C. F. V. Loan. The ubiquitous kronecker product. Journal of Computational and Applied Mathematics,\n\n[7] J. Nocedal and S. J. Wright. Numerical Optimization. Springer Series in Operations Research, 1999.\n[8] G. H. Golub and C. F. Van Loan. Matrix Computations. John Hopkins University Press, Baltimore, MD,\n\n123:85 \u2013 100, 2000.\n\n3rd edition, 1996.\n\n\f", "award": [], "sourceid": 2973, "authors": [{"given_name": "Karsten", "family_name": "Borgwardt", "institution": ""}, {"given_name": "Nicol", "family_name": "Schraudolph", "institution": null}, {"given_name": "S.v.n.", "family_name": "Vishwanathan", "institution": null}]}