{"title": "A Structural Smoothing Framework For Robust Graph Comparison", "book": "Advances in Neural Information Processing Systems", "page_first": 2134, "page_last": 2142, "abstract": "In this paper, we propose a general smoothing framework for graph kernels by taking \\textit{structural similarity} into account, and apply it to derive smoothed variants of popular graph kernels. Our framework is inspired by state-of-the-art smoothing techniques used in natural language processing (NLP). However, unlike NLP applications which primarily deal with strings, we show how one can apply smoothing to a richer class of inter-dependent sub-structures that naturally arise in graphs. Moreover, we discuss extensions of the Pitman-Yor process that can be adapted to smooth structured objects thereby leading to novel graph kernels. Our kernels are able to tackle the diagonal dominance problem, while respecting the structural similarity between sub-structures, especially under the presence of edge or label noise. Experimental evaluation shows that not only our kernels outperform the unsmoothed variants, but also achieve statistically significant improvements in classification accuracy over several other graph kernels that have been recently proposed in literature. Our kernels are competitive in terms of runtime, and offer a viable option for practitioners.", "full_text": "A Structural Smoothing Framework For Robust\n\nGraph-Comparison\n\nPinar Yanardag\n\nDepartment of Computer Science\n\nPurdue University\n\nWest Lafayette, IN, 47906, USA\n\nypinar@purdue.edu\n\nS.V.N. Vishwanathan\n\nDepartment of Computer Science\n\nUniversity of California\n\nSanta Cruz, CA, 95064, USA\n\nvishy@ucsc.edu\n\nAbstract\n\nIn this paper, we propose a general smoothing framework for graph kernels by\ntaking structural similarity into account, and apply it to derive smoothed variants\nof popular graph kernels. Our framework is inspired by state-of-the-art smoothing\ntechniques used in natural language processing (NLP). However, unlike NLP ap-\nplications that primarily deal with strings, we show how one can apply smoothing\nto a richer class of inter-dependent sub-structures that naturally arise in graphs.\nMoreover, we discuss extensions of the Pitman-Yor process that can be adapted\nto smooth structured objects, thereby leading to novel graph kernels. Our kernels\nare able to tackle the diagonal dominance problem while respecting the structural\nsimilarity between features. Experimental evaluation shows that not only our ker-\nnels achieve statistically signi\ufb01cant improvements over the unsmoothed variants,\nbut also outperform several other graph kernels in the literature. Our kernels are\ncompetitive in terms of runtime, and offer a viable option for practitioners.\n\n1\n\nIntroduction\n\nIn many applications we are interested in computing similarities between structured objects such\nas graphs. For instance, one might aim to classify chemical compounds by predicting whether a\ncompound is active in an anti-cancer screen or not. A kernel function which corresponds to a dot\nproduct in a reproducing kernel Hilbert space offers a \ufb02exible way to solve this problem [19]. R-\nconvolution [10] is a framework for computing kernels between discrete objects where the key idea\nis to recursively decompose structured objects into sub-structures. Let (cid:104)\u00b7,\u00b7(cid:105)H denote a dot product in\na reproducing kernel Hilbert space, G represent a graph and \u03c6 (G) represent a vector of sub-structure\nfrequencies. The kernel between two graphs G and G(cid:48) is computed by K (G,G(cid:48)) = (cid:104)\u03c6 (G) , \u03c6 (G(cid:48))(cid:105)H.\nMany existing graph kernels can be viewed as instances of R-convolution kernels. For instance, the\ngraphlet kernel [22] decomposes a graph into graphlets, Weisfeiler-Lehman Subtree kernel (referred\nas Weisfeiler-Lehman for the rest of the paper) [23] decomposes a graph into subtrees, and the\nshortest-path kernel [1] decomposes a graph into shortest-paths. However, R-convolution based\ngraph kernels suffer from a few drawbacks. First, the size of the feature space often grows exponen-\ntially. As size of the space grows, the probability that two graphs will contain similar sub-structures\nbecomes very small. Therefore, a graph becomes similar to itself but not to any other graph in the\ntraining data. This is well known as the diagonal dominance problem [11] where the resulting kernel\nmatrix is close to the identity matrix. Second, lower order sub-structures tend to be more numerous\nwhile a vast majority of the sub-structures occurs rarely. In other words, a few sub-structures dom-\ninate the distribution. This exhibits a strong power-law behavior and results in underestimation of\nthe true distribution. Third, the sub-structures used to de\ufb01ne a graph kernel are often related to each\nother. However, an R-convolution kernel only respects exact matchings. This problem is particu-\n\n1\n\n\fFigure 1: Graphlets of size k \u2264 5.\n\nlarly important when noise is present in the training data and considering partial similarity between\nsub-structures might alleviate the noise problem.\nOur solution: In this paper, we propose to tackle the above problems by using a general framework\nto smooth graph kernels that are de\ufb01ned using a frequency vector of decomposed structures. We\nuse structure information by encoding relationships between lower and higher order sub-structures\nin order to derive our method. The remainder of this paper is structured as follows. In Section 2,\nwe review three families of graph kernels for which our smoothing is applicable. In Section 3, we\nreview smoothing methods for multinomial distributions. In Section 4, we introduce a framework\nfor smoothing structured objects. In Section 5, we propose a Bayesian variant of our model that\nis extended from the Hierarchical Pitman-Yor process [25]. In Section 6, we discuss related work.\nIn Section 7, we compare smoothed graph kernels to their unsmoothed variants as well as to other\nstate-of-the-art graph kernels. We report results on classi\ufb01cation accuracy on several benchmark\ndatasets as well as their noisy-variants. Section 8 concludes the paper.\n\n2 Graph kernels\n\nare vectors of normalized counts of graphlets, that is, the i-th component of fG\n\n) denotes the frequency of graphlet Gi occurring as a sub-graph of G (resp. G(cid:48)).\n\nExisting graphs kernels based on R-convolution can be categorized into three major families: graph\nkernels based on limited-sized subgraphs [e.g. 22], graph kernels based on subtree patterns [e.g.\n18, 21], and graph kernels based on walks [e.g. 27] or paths [e.g. 1].\nGraph kernels based on subgraphs: A graphlet G [17] is non-isomorphic sub-graph of size-k,\n(see Figure 1). Given two graphs G and G(cid:48), the kernel [22] is de\ufb01ned as KGK(G,G(cid:48)) =\nwhere fG and fG(cid:48)\n(resp. fG(cid:48)\nGraph kernels based on subtree patterns: Weisfeiler-Lehman [21] is a popular instance of graph\nkernels that decompose a graph into its subtree patterns. It simply iterates over each vertex in a\ngraph, and compresses the label of the vertex and labels of its neighbors into a multiset label. The\nvertex is then relabeled with the compressed label to be used for the next iteration. Algorithm con-\ncludes after running for h iterations, and the compressed labels are used for constructing a frequency\nvector for each graph. Formally, given G and G(cid:48), this kernel is de\ufb01ned as KW L(G,G(cid:48)) =\nwhere lG contains the frequency of each compressed label occurring in h iterations.\nGraph kernels based on walks or paths: Shortest-path graph kernel [1] is a popular instance of\nthis family. This kernel simply compares the sorted endpoints and the length of shortest-paths that\nare common between two graphs. Formally, let PG represent the set of all shortest-paths in graph\nG, and pi \u2208 PG denote a triplet (ls, le, nk) where nk is the length of the path and ls and le are the\nlabels of the source and sink vertices, respectively. The kernel between graphs G and G(cid:48) is de\ufb01ned\nas KSP (G,G(cid:48)) =\nwhere i-th component of pG contains the frequency of i-th triplet\noccurring in graph G (resp. pG(cid:48)\n).\n\npG, pG(cid:48)(cid:69)\n(cid:68)\n\n(cid:68)\n\nfG, fG(cid:48)(cid:69)\n\n(cid:68)\nlG, lG(cid:48)(cid:69)\n\n3 Smoothing multinomial distributions\n\nIn this section, we brie\ufb02y review smoothing techniques for multinomial distributions. Let\ne1, e2, . . . , em represent a sequence of n discrete events drawn from a ground set A = {1, 2, . . . , V }.\n\n2\n\n\fdenotes the number of times the event a appears in the observed sequence and m =(cid:80)\n\nFigure 2: Topologically sorted graphlet DAG for k \u2264 5 where nodes are colored based on degree.\nSuppose, we would like to estimate the probability P (ei = a) for some a \u2208 A. It is well known\nthat the Maximum Likelihood Estimate (MLE) can be computed as PM LE (ei = a) = ca\nm where ca\nj cj denotes\nthe total number of observed events. However, MLE of the multinomial distribution is spiky since it\nassigns zero probability to the events that did not occur in the observed sequence. In other words, an\nevent with low probability is often estimated to have zero probability mass. The general idea behind\nsmoothing is to adjust the MLE of the probabilities by pushing the high probabilities downwards\nand pushing low or zero probabilities upwards in order to produce a more accurate distribution on\nthe events [30]. Interpolated smoothing methods offer a \ufb02exible solution between the higher-order\nmaximum likelihood model and lower-order smoothed model (or so-called, fallback model). The\nway the fallback model is designed is the key to de\ufb01ne a new smoothing method1. Absolute dis-\ncounting [15] and Interpolated Kneser-Ney [12] are two popular instances of interpolated smoothing\nmethods:\n\nmax{ca \u2212 d, 0}\n\nmd \u00d7 d\n\nPA (ei = a) =\n\n(1)\nHere, d > 0 is a discount factor, md := |{a : ca > d}| is the number of events whose counts\nare larger than d, while P (cid:48)\nA is the fallback distribution. Absolute discounting de\ufb01nes the fallback\ndistribution as the smoothed version of the lower-order MLE while Kneser-Ney uses an unusual\nestimate of the fallback distribution by using number of different contexts that the event follows in\nthe lower order model.\n\nm\n\nm\n\n+\n\nP (cid:48)\nA (ei = a) .\n\n4 Smoothing structured objects\n\nIn this section, we \ufb01rst propose a new interpolated smoothing framework that is applicable to a\nricher set of objects such as graphs by using a Directed Acyclic Graph (DAG). We then discuss how\nto design such DAGs for various graph kernels.\n\n4.1 Structural smoothing\n\nThe key to designing a new smoothing method is to de\ufb01ne a fallback distribution, which not only\nincorporates domain knowledge but is also easy to estimate recursively. Suppose, we have access\nto a weighted DAG where every node at the k-th level represents an event from the ground set A.\nMoreover let wij denote the weight of the edge connecting event i to event j, and Pa (resp. Ca)\ndenote the parents (resp. children) of event a \u2208 A in the DAG. We de\ufb01ne our structural smoothing\nfor events at level k as follows:\n\n(cid:88)\n\nP k\n\nSS (ei = a) =\n\nmax{ca \u2212 d, 0}\n\nmd \u00d7 d\n\n+\n\nm\n\nm\n\nj\u2208Pa\n\nP k\u22121\nSS (j)\n\nwja(cid:80)\n\na(cid:48)\u2208Cj\n\n.\n\nwja(cid:48)\n\n(2)\n\nThe way to understand the above equation is as follows: we subtract a \ufb01xed discounting factor d\nfrom every observed event which accumulates to a total mass of md \u00d7 d. Each event a receives\nsome portion of this accumulated probability mass from its parents. The proportion of the mass that\na parent j at level k \u2212 1 transmits to a given child a depends on the weight wja between the parent\nand the child (normalized by the sum of the weights of the edges from j to all its children), and the\nprobability mass P k\u22121\nSS (j) that is assigned to node j. In other words, the portion a child event a is\nable to obtain from the total discounted mass depends on how authoritative its parents are, and how\nstrong the relationship between the child and its parents.\n\n1See Table 2 in [3] for summarization of various smoothing algorithms using this general framework.\n\n3\n\n\f4.2 Designing the DAG\n\n\u00afG\u2208CG\n\nDAG, and nG :=(cid:80)\n\nIn order to construct a DAG for smoothing structured objects, we \ufb01rst construct a vocabulary V\nthat denotes the set of all unique sub-structures that are going to be smoothed. Each item in the\nvocabulary V corresponds to a node in the DAG. V can be generated statically or dynamically\nbased on the type of sub-structure the graph kernel exploits. For instance, it requires a one-time\nO(2k) effort to generate the vocabulary of size \u2264 k graphlets for graphlet kernel. However, one\nneeds to build the vocabulary dynamically in Weisfeiler-Lehman and Shortest-Path kernels since\nthe sub-structures depend on the node labels obtained from the datasets. After constructing the\nvocabulary V , the parent/child relationship between sub-structures needs to be obtained. Given a\nsub-structure s of size k, we apply a transformation to \ufb01nd all possible sub-structures of size k \u2212 1\nthat s can be reduced into. Each sub-structure s(cid:48) that is obtained by this transformation is assigned\nas a parent of s. After obtaining the parent/child relationship between sub-structures, the DAG is\nconstructed by drawing a directed edge from each parent to its children nodes. Since all descendants\nof a given sub-structure at depth k \u2212 1 are at depth k, this results in a topological ordering of the\nvertices, and hence the resulting graph is indeed a DAG. Next, we discuss how to construct such\nDAGs for different graph kernels.\nGraphlet Kernel: We construct the vocabulary V for graphlet kernel by enumerating all canonical\ngraphlets of size up to k2. Each canonically-labeled graphlet is a node in the DAG. We then apply\na transformation to infer the parent/child relationship between graphlets as follows: we place a\ndirected edge from graphlet G to G(cid:48) if, and only if, G can be obtained from G(cid:48) by deleting a node.\nIn other words, all edges from a graphlet G of size k \u2212 1 point to a graphlet G(cid:48) of size k. In order to\nassign weights to the edges, given a graphlet pair G and G(cid:48), we count the number of times G can be\nobtained from G(cid:48) by deleting a node (call this number nGG(cid:48)). Recall that G is of size k \u2212 1 and G(cid:48)\nis of size k, and therefore nGG(cid:48) can at most be k. Let CG denote the set of children of node G in the\nnG \u00afG. Then we de\ufb01ne the weight wGG(cid:48) of the edge connecting G and G(cid:48)\nas nGG(cid:48)/nG. The idea here is that the weight encodes the proportion of different ways of extending\nG which results in the graphlet G(cid:48). For instance, let us consider G15 and its parents G5, G6, G7 (see\nFigure 2 for the DAG of graphlets with size k \u2264 5). Even if graphlet G15 is not observed in the\ntraining data, it still gets a probability mass proportional to the edge weight from its parents in order\nto overcome the sparsity problem of unseen data.\nWeisfeiler-Lehman Kernel: The Weisfeiler-Lehman kernel performs an exact matching between\nthe compressed multiset labels. For instance, given two labels ABCDE and ABCDF, it simply as-\nsigns zero value for their similarity even though two labels have a partial similarity. In order to\nsmooth Weisfeiler-Lehman kernel, we \ufb01rst run the original algorithm and obtain the multiset repre-\nsentation of each graph in the dataset. We then apply a transformation to infer the parent/child re-\nlationship between compressed labels as follows: in each iteration of Weisfeiler-Lehman algorithm,\nand for each multiset label of size k in the vocabulary, we generate its power set by computing all\nsubsets of size k \u2212 1 while keeping the root node \ufb01xed. For instance, the parents of a multiset label\nABCDE are {ABCD, ABCE, ABDE, ACDE}. Then, we simply construct the DAG by drawing a\ndirected edge from parent labels to children. Notice that considering only the set of labels gener-\nated from the Weisfeiler-Lehman kernel is not suf\ufb01cient enough for constructing a valid DAG. For\ninstance, it might be the case that none of the possible parents of a given label exists in the vocab-\nulary simply due to the sparsity problem (e.g.out of all possible parents of ABCDE, we might only\nobserve ABCE in the training data). Thus, restricting ourselves to the original vocabulary leaves\nsuch labels orphaned in the DAG. Therefore, we consider so-called pseudo parents as a part of the\nvocabulary when constructing the DAG. Since the sub-structures in this kernel are data-dependent,\nwe use a uniform weight between a parent and its children.\nShortest-Path Kernel: Similar to other graph kernels discussed above, shortest-path graph kernel\ndoes not take partial similarities into account. For instance, given two shortest-paths ABCDE and\nABCDF (compressed as AE5 and AF5, respectively), it assigns zero for their similarity since their\nsink labels are different. However, one can notice that shortest-path sub-structures exhibit a strong\ndependency relationship. For instance, given a shortest-path pij = {ABCDE} of size k, one can\nderive the shortest-paths {ABCD, ABC, AB} of size < k as a result of the optimal sub-structure\nproperty, that is, one can show that all sub-paths of a shortest-path are also shortest-paths with\n\n2We used Nauty [13] to obtain canonically-labeled isomorphic representations of graphlets.\n\n4\n\n\fFigure 3: An illustration of table assignment, adapted from [9]. In this example, labels at the tables\nare given by (l1, . . . , l4) = (G44, G30, G32, G44). Black dots indicate the number of occurrences of\neach label in 10 draws from the Pitman-Yor process.\n\nthe same source node [6]. In order to smooth shortest-path kernel, we \ufb01rst build the vocabulary by\ncomputing all shortest-paths for each graph. Let pij be a shortest-path of size k and pij(cid:48) be a shortest-\npath of size k \u2212 1 that is obtained by removing the sink node of pij. Let lij be the compressed form\nof pij that represents the sorted labels of its endpoints i and j concatenated to its length (resp. lij(cid:48)).\nThen, in order to build the DAG, we draw a directed edge from lij(cid:48) of depth k \u2212 1 to lij of depth k if\nand only if pij(cid:48) is a sub-path of pij. In other words, all ascendants of lij consist of the compressed\nlabels obtained from sub-paths of pij of size < k. Similar to Weisfeiler-Lehman kernel, we assign a\nuniform weight between parents and children.\n\n5 Pitman-Yor Smoothing\n\nPitman-Yor processes are known to produce power-law distributions [8]. A novel interpretation of\ninterpolated Kneser-Ney is proposed by [25] as approximate inference in a hierarchical Bayesian\nmodel consisting of Pitman-Yor process [16]. By following a similar spirit, we extend our model\nto adapt Pitman-Yor process as an alternate smoothing framework. A Pitman-Yor process P on a\nground set Gk+1 of size-(k + 1) graphlets is de\ufb01ned via Pk+1 \u223c P Y (dk+1, \u03b8k+1, Pk) where dk+1\nis a discount parameter, 0 \u2264 dk+1 < 1, \u03b8 > \u2212dk+1 is a strength parameter, and Pk is a base\ndistribution. The most intuitive way to understand draws from the Pitman-Yor process is via the\nChinese restaurant process (see Figure 3). Consider a restaurant with an in\ufb01nite number of tables\n\nAlgorithm 1 Insert a Customer\nInput: dk+1, \u03b8k+1, Pk\n\nt \u2190 0 // Occupied tables\nc \u2190 () // Counts of customers\nl \u2190 () // Labels of tables\nif t = 0 then\n\nelse\n\nt \u2190 1\nappend 1 to c\ndraw graphlet Gi \u223c Pk // Insert customer in parent\ndraw Gj \u223c wij\nappend Gj to l\nreturn Gj\nwith probability \u221d max(0, cj \u2212 d)\ncj \u2190 cj + 1\nreturn lj\nwith probability proportional to \u03b8 + dt\nt \u2190 t + 1\nappend 1 to c\ndraw graphlet Gi \u223c Pk // Insert customer in parent\ndraw Gj \u223c wij\nappend Gj to l\nreturn Gj\n\nend if\n\nwhere customers enter the restaurant one by one. The \ufb01rst customer sits at the \ufb01rst table, and a\ngraphlet is assigned to it by drawing a sample from the base distribution since this table is occupied\nfor the \ufb01rst time. The label of the \ufb01rst table is the \ufb01rst graphlet drawn from the Pitman-Yor process.\n\n5\n\n\fSubsequent customers when they enter the restaurant decide to sit at an already occupied table with\nprobability proportional to ci \u2212 dk+1, where ci represents the number of customers already sitting at\ntable i. If they sit at an already occupied table, then the label of that table denotes the next graphlet\ndrawn from the Pitman-Yor process. On the other hand, with probability \u03b8k+1 + dk+1t, where t is\nthe current number of occupied tables, a new customer might decide to occupy a new table. In this\ncase, the base distribution is invoked to label this table with a graphlet. Intuitively the reason this\nprocess generates power-law behavior is because popular graphlets which are served on tables with\na large number of customers have a higher probability of attracting new customers and hence being\ngenerated again, similar to a rich gets richer phenomenon. In a hierarchical Pitman-Yor process, the\nbase distribution Pk is recursively de\ufb01ned via a Pitman-Yor process Pk \u223c P Y (dk, \u03b8k, Pk\u22121). In\norder to label a table, we need a draw from Pk, which is obtained by inserting a customer into the\ncorresponding restaurant. However, adopting the traditional hierarchical Pitman-Yor process is not\nstraightforward in our case since the size of the context differs between levels of hierarchy, that is, a\nchild restaurant in the hierarchy can have more than one parent restaurant to request a label from. In\nother words, Pk+1 is de\ufb01ned over Gk+1 of size nk+1 while Pk is de\ufb01ned over Gk of size nk \u2264 nk+1.\nTherefore, one needs a transformation function to transform base distributions of different sizes. We\nincorporate edge weights between parent and child restaurants by using the same weighting scheme\nin Section 4.2. This changes the Chinese Restaurant process as follows: When we need to label a\ntable, we will \ufb01rst draw a size-k graphlet Gi \u223c Pk by inserting a customer into the corresponding\nrestaurant. Given Gi, we will draw a size-(k + 1) graphlet Gj proportional to wij, where wij is\nobtained from the DAG. See Algorithm 1 for pseudo code of inserting a customer. Deletion of a\ncustomer is handled similarly (see Algorithm 2).\n\nAlgorithm 2 Delete a Customer\nInput: d, \u03b8, P0, C, L, t\nwith probability \u221d cl\ncl \u2190 cl \u2212 1\nGj \u2190 lj\nif cl = 0 then\nPk \u221d 1/wij\ndelete cl from c\ndelete lj from l\nt \u2190 t \u2212 1\n\nend if\nreturn G\n\n6 Related work\n\nA survey of most popular graph kernel methods is already given in previous sections. Several meth-\nods proposed in smoothing structured objects [4], [20]. Our framework is similar to dependency\ntree kernels [4] since both methods are using the notion of smoothing for structured objects. How-\never, our method is interested in the problem of smoothing the count of structured objects. Thus,\nwhile smoothing is achieved by using a DAG, we discard the DAG once the counts are smoothed.\nAnother related work to ours is propagation kernels [14] that de\ufb01ne graph features as counts of sim-\nilar node-label distributions on the respective graphs by using Locality Sensitive Hashing (LSH).\nOur framework not only considers node label distributions, but also explicitly incorporates struc-\ntural similarity via the DAG. Another similar work to ours is recently proposed framework by [29]\nwhich learns the co-occurrence relationship between sub-structures by using neural language mod-\nels. However, their framework do not respect the structural similarity between sub-structures, which\nis an important property to consider especially in the presence of noise in edges or labels.\n\n7 Experiments\n\nThe aim of our experiments is threefold. First, we want to show that smoothing graph kernels\nsigni\ufb01cantly improves the classi\ufb01cation accuracy. Second, we want to show that the smoothed\nkernels are comparable to or outperform state-of-the-art graph kernels in terms of classi\ufb01cation\n\n6\n\n\fTable 1: Comparison of classi\ufb01cation accuracy (\u00b1 standard deviation) of shortest-path (SP),\nWeisfeiler-Lehman (WL), graphlet (GK) kernels with their smoothed variants. Smoothed variants\nwith statistically signi\ufb01cant improvements over the base kernels are shown in bold as measured by\na t-test with a p value of \u2264 0.05. Ramon & G\u00a8artner (Ram & G\u00a8ar), p-random walk and random\nwalk kernels are included for additional comparison where > 72H indicates the computation did not\n\ufb01nish in 72 hours. Runtime for constructing the DAG and smoothing (SMTH) the counts are also\nreported where \u201d indicates seconds and \u2019 indicates minutes.\n\nDATASET\nSP\nSMOOTHED SP\nWL\nSMOOTHED WL\nGK\nSMOOTHED GK\nPYP GK\nRAM & G \u00a8AR\nP-RANDOMWALK\nRANDOM WALK\nDAG/SMTH (GK)\nDAG/SMTH (SP)\nDAG/SMTH (WL)\nDAG/SMTH (PYP)\n\nMUTAG\n85.22 \u00b12.43\n87.94 \u00b12.58\n82.22 \u00b11.87\n87.44 \u00b11.95\n81.33 \u00b11.02\n83.17 \u00b10.64\n83.11 \u00b11.23\n84.88 \u00b11.86\n80.05 \u00b11.64\n83.72 \u00b11.50\n1\u201d\n6\u201d\n1\u201d\n3\u201d\n2\u201d\n1\u201d\n6\u201d\n5\u201d\n\nPTC\n58.24 \u00b12.44\n60.82 \u00b11.84\n60.41 \u00b11.93\n60.47 \u00b12.39\n55.56 \u00b11.46\n58.44 \u00b11.00\n57.44 \u00b11.44\n58.47 \u00b10.90\n59.38 \u00b11.66\n57.85 \u00b11.30\n6\u201d\n1\u201d\n19\u201d 1\u201d\n17\u201d\n1\u201d\n6\u201d\n12\u201d\n\nENZYMES\n40.10 \u00b11.50\n42.27 \u00b11.07\n53.88 \u00b10.95\n55.30 \u00b10.65\n27.32 \u00b10.96\n30.90 \u00b11.51\n29.63 \u00b11.30\n16.96 \u00b11.46\n30.01 \u00b11.00\n24.16 \u00b11.64\n6\u201d\n1\u201d\n45\u201d 1\u201d\n10\u201d 12\u2019\n6\u201d\n21\u201d\n\nNCI1\n73.00\u00b10.24\n73.26\u00b10.24\n84.13\u00b10.22\n84.66\u00b10.18\n62.46\u00b10.19\n62.48\u00b10.15\n62.50\u00b10.20\n56.61\u00b10.53\n\nPROTEINS\n75.07 \u00b10.54\n75.85 \u00b10.28\n74.49 \u00b10.49\n75.53 \u00b10.50\n69.69 \u00b10.46\n69.83 \u00b10.46\n70.00 \u00b10.80\n70.73 \u00b10.35\n71.16 \u00b10.35 > 72H\n74.22 \u00b10.42 > 72H\n6\u201d\n9\u2019\n7\u2019\n6\u201d\n\n1\u201d\n1\u201d\n70\u2019\n1\u2019\n\n6\u201d\n9\u2019\n2\u201d\n6\u201d\n\n3\u201d\n17\u201d\n21\u2019\n8\u2019\n\nNCI109\n73.00\u00b10.21\n73.01\u00b10.31\n83.83\u00b10.31\n84.72\u00b10.21\n62.33\u00b10.14\n62.48\u00b10.11\n62.68\u00b10.18\n54.62\u00b10.23\n> 72H\n> 72H\n6\u201d\n3\u201d\n10\u2019 16\u201d\n21\u2019\n2\u201d\n6\u201d\n8\u2019\n\naccuracy, while remaining competitive in terms of computational requirements. Third, we want to\nshow that our methods outperform base kernels when edge or label noise is presence.\nDatasets We used the following benchmark datasets used in graph kernels: MUTAG, PTC, EN-\nZYMES, PROTEINS, NCI1 and NCI109. MUTAG is a dataset of 188 mutagenic aromatic and\nheteroaromatic nitro compounds [5] with 7 discrete labels. PTC [26] is a dataset of 344 chemical\ncompounds has 19 discrete labels. ENZYMES is a dataset of 600 protein tertiary structures obtained\nfrom [2], and has 3 discrete labels. PROTEINS is a dataset of 1113 graphs obtained from [2] having\n3 discrete labels. NCI1 and NCI109 [28] are two balanced datasets of chemical compounds having\nsize 4110 and 4127 with 37 and 38 labels, respectively.\nExperimental setup We compare our framework against representative instances of major families\nof graph kernels in the literature. In addition to the base kernels, we also compare our smoothed\nkernels with the random walk kernel [7], the Ramon-G\u00a8artner subtree [18], and p-step random walk\nkernel [24]. The Random Walk, p-step Random Walk and Ramon-G\u00a8artner are written in Matlab and\nobtained from [22]. All other kernels were coded in Python except Pitman-Yor smoothing which is\ncoded in C++3. We used a parallel implementation for smoothing the counts of Weisfeiler-Lehman\nkernel for ef\ufb01ciency. All kernels are normalized to have a unit length in the feature space. Moreover,\nwe use 10-fold cross validation with a binary C-Support Vector Machine (SVM) where the C value\nfor each fold is independently tuned using training data from that fold. In order to exclude random\neffects of the fold assignments, this experiment is repeated 10 times and average prediction accuracy\nof 10 experiments with their standard deviations are reported4.\n\n7.1 Results\n\nIn our \ufb01rst experiment, we compare the base kernels with their smoothed variants. As can be seen\nfrom Table 1, smoothing improves the classi\ufb01cation accuracy of every base kernel on every dataset\nwith majority of the improvements being statistically signi\ufb01cant with p \u2264 0.05. We observe that\neven though smoothing improves the accuracy of graphlet kernels on PROTEINS and NCI1, the im-\nprovements are not statistically signi\ufb01cant. We believe this is due to the fact that these datasets are\nnot sensitive to structural noise as much as the other datasets, thus considering the partial similarities\n\n3We modi\ufb01ed the open source implementation of PYP: https://github.com/redpony/cpyp.\n4Implementations of original and smoothed versions of the kernels, datasets and detailed discussion of\nparameter selection procedure with the list of parameters used in our experiments can be accessed from http:\n//web.ics.purdue.edu/\u02dcypinar/nips.\n\n7\n\n\fFigure 4: Classi\ufb01cation accuracy vs. noise for base graph kernels (dashed lines) and their smoothed\nvariants (non-dashed lines).\n\ndo not improve the results signi\ufb01cantly. Moreover, PYP smoothed graphlet kernels achieve statisti-\ncally signi\ufb01cant improvements in most of the datasets, however they are outperformed by smoothed\ngraphlet kernels introduced in Section 3.\nIn our second experiment, we picked the best smoothed kernel in terms of classi\ufb01cation accuracy for\neach dataset, and compared it against the performance of state-of-the-art graph kernels (see Table\n1). Smoothed kernels outperform other methods on all datasets, and the results are statistically\nsigni\ufb01cant on every dataset except PTC.\nIn our third experiment, we investigated the runtime behavior of our framework with two major\ncosts. First, one has to compute a DAG by using the original feature vectors. Next, the constructed\nDAG need to be used to compute smoothed representations of the feature vectors. Table 1 shows\nthe total wallclock runtime taken by all graphs for constructing the DAG, and smoothing the counts\nfor each dataset. As can be seen from the runtimes, our framework adds a constant factor to the\noriginal runtime for most of the datasets. While the DAG creation in Weisfeiler-Lehman kernel also\nadds a negligible overhead, the cost of smoothing becomes signi\ufb01cant if the vocabulary size gets\nprohibitively large due to the exponential growing nature of the kernel w.r.t. to subtree parameter h.\nFinally, in our fourth experiment, we test the performance of graph kernels when edge or label\nnoise is present. For edge noise, we randomly removed and added {10%, 20%, 30%} of the edges\nin each graph. For label noise, we randomly \ufb02ipped {25%, 50%, 75%} of the node labels in each\ngraph where random labels are selected proportionally to the original label-distribution of the graph.\nFigure 4 shows the performance of smoothed graph kernels under noise. As can be seen from\nthe \ufb01gure, smoothed kernels are able to outperform their base variants when noise is present. An\ninteresting observation is that even though a signi\ufb01cant amount of edge noise is added to PROTEINS\nand NCI datasets, the performance of base kernels do not change drastically. This further supports\nour observation that these datasets are not sensitive to structural noise as much as the other datasets.\n\n8 Conclusion and Future Work\n\nWe presented a novel framework for smoothing graph kernels inspired by smoothing techniques\nfrom natural language processing and applied our method to state-of-the-art graph kernels. Our\nframework is rather general, and lends itself to many extensions. For instance, by de\ufb01ning domain-\nspeci\ufb01c parent-child relationships, one can construct different DAGs with different weighting\nschemes. Another interesting extension of our smoothing framework would be to apply it to graphs\nwith continuous labels. Moreover, even though we restricted ourselves to graph kernels in this pa-\nper, our framework is applicable to any R-convolution kernel that uses a frequency-vector based\nrepresentation, such as string kernels.\n\n9 Acknowledgments\n\nWe thank to Hyokun Yun for his tremendous help in implementing Pitman-Yor Processes. We also\nthank to anonymous NIPS reviewers for their constructive comments, and Jiasen Yang, Joon Hee\nChoi, Amani Abu Jabal and Parameswaran Raman for reviewing early drafts of the paper. This work\nis supported by the National Science Foundation under grant No. #1219015.\n\n8\n\n\fReferences\n[1] K. M. Borgwardt and H.-P. Kriegel. Shortest-path kernels on graphs. In ICML, pages 74\u201381, 2005.\n[2] K. M. Borgwardt, C. S. Ong, S. Sch\u00a8onauer, S. V. N. Vishwanathan, A. J. Smola, and H.-P. Kriegel. Protein\n\nfunction prediction via graph kernels. In ISMB, Detroit, USA, 2005.\n\n[3] S. F. Chen and J. Goodman. An empirical study of smoothing techniques for language modeling. In ACL,\n\npages 310\u2013318, 1996.\n\n[4] D. Croce, A. Moschitti, and R. Basili. Structured lexical similarity via convolution kernels on dependency\n\ntrees. In Proceedings of EMNLP, pages 1034\u20131046. Association for Computational Linguistics, 2011.\n\n[5] A. K. Debnath, R. L. Lopez de Compadre, G. Debnath, A. J. Shusterman, and C. Hansch. Structure-\nactivity relationship of mutagenic aromatic and heteroaromatic nitro compounds. correlation with molec-\nular orbital energies and hydrophobicity. J. Med. Chem, 34:786\u2013797, 1991.\n\n[6] A. Feragen, N. Kasenburg, J. Petersen, M. de Bruijne, and K. Borgwardt. Scalable kernels for graphs with\n\ncontinuous attributes. In NIPS, pages 216\u2013224, 2013.\n\n[7] T. G\u00a8artner, P. Flach, and S. Wrobel. On graph kernels: Hardness results and ef\ufb01cient alternatives. In\n\nCOLT, pages 129\u2013143, 2003.\n\n[8] S. Goldwater, T. Grif\ufb01ths, and M. Johnson. Interpolating between types and tokens by estimating power-\n\nlaw generators. NIPS, 2006.\n\n[9] S. Goldwater, T. L. Grif\ufb01ths, and M. Johnson. Producing power-law distributions and damping word\n\nfrequencies with two-stage language models. JMLR, 12:2335\u20132382, 2011.\n\n[10] D. Haussler. Convolution kernels on discrete structures. Technical Report UCS-CRL-99-10, UC Santa\n\nCruz, 1999.\n\n[11] J. Kandola, T. Graepel, and J. Shawe-Taylor. Reducing kernel matrix diagonal dominance using semi-\nde\ufb01nite programming. In COLT, volume 2777 of Lecture Notes in Computer Science, pages 288\u2013302,\nWashington, DC, 2003.\n\n[12] R. Kneser and H. Ney. Improved backing-off for M-gram language modeling. In ICASSP, 1995.\n[13] B. D. McKay. Nauty user\u2019s guide (version 2.4). Australian National University, 2007.\n[14] M. Neumann, R. Garnett, P. Moreno, N. Patricia, and K. Kersting. Propagation kernels for partially\n\nlabeled graphs. In ICML\u20132012 Workshop on Mining and Learning with Graphs, Edinburgh, UK, 2012.\n\n[15] H. Ney, U. Essen, and R. Kneser. On structuring probabilistic dependences in stochastic language mod-\n\neling. In Computer Speech and Language, pages 1\u201338, 1994.\n\n[16] J. Pitman and M. Yor. The two-parameter poisson-dirichlet distribution derived from a stable subordinator.\n\nAnnals of Probability, 25(2):855\u2013900, 1997.\n\n[17] N. Przulj. Biological network comparison using graphlet degree distribution. In ECCB, 2006.\n[18] J. Ramon and T. G\u00a8artner. Expressivity versus ef\ufb01ciency of graph kernels. Technical report, First Interna-\n\ntional Workshop on Mining Graphs, Trees and Sequences (held with ECML/PKDD\u201903), 2003.\n\n[19] B. Sch\u00a8olkopf and A. J. Smola. Learning with Kernels. 2002.\n[20] A. Severyn and A. Moschitti. Fast support vector machines for convolution tree kernels. Data Mining\n\nand Knowledge Discovery, 25(2):325\u2013357, 2012.\n\n[21] N. Shervashidze and K. Borgwardt. Fast subtree kernels on graphs. In NIPS, 2010.\n[22] N. Shervashidze, S. V. N. Vishwanathan, T. Petri, K. Mehlhorn, and K. Borgwardt. Ef\ufb01cient graphlet\n\nkernels for large graph comparison. In AISTATS, 2009.\n\n[23] N. Shervashidze, P. Schweitzer, E. J. Van Leeuwen, K. Mehlhorn, and K. M. Borgwardt. Weisfeiler-\n\nlehman graph kernels. JMLR, 12:2539\u20132561, 2011.\n\n[24] A. J. Smola and R. Kondor. Kernels and regularization on graphs. In COLT, pages 144\u2013158, 2003.\n[25] Y. W. Teh. A hierarchical bayesian language model based on pitman-yor processes. In ACL, 2006.\n[26] H. Toivonen, A. Srinivasan, R. D. King, S. Kramer, and C. Helma. Statistical evaluation of the predictive\n\ntoxicology challenge 2000-2001. Bioinformatics, 19(10):1183\u20131193, July 2003.\n\n[27] S. V. N. Vishwanathan, N. N. Schraudolph, I. R. Kondor, and K. M. Borgwardt. Graph kernels. JMLR,\n\n2010.\n\n[28] N. Wale, I. A. Watson, and G. Karypis. Comparison of descriptor spaces for chemical compound retrieval\n\nand classi\ufb01cation. Knowledge and Information Systems, 14(3):347\u2013375, 2008.\n\n[29] P. Yanardag and S. Vishwanathan. Deep graph kernels. In KDD, pages 1365\u20131374. ACM, 2015.\n[30] C. Zhai and J. Lafferty. A study of smoothing methods for language models applied to information\n\nretrieval. ACM Trans. Inf. Syst., 22(2):179\u2013214, 2004.\n\n9\n\n\f", "award": [], "sourceid": 1272, "authors": [{"given_name": "Pinar", "family_name": "Yanardag", "institution": "Purdue University"}, {"given_name": "S.V.N.", "family_name": "Vishwanathan", "institution": "UCSC"}]}