{"title": "An Application of Boosting to Graph Classification", "book": "Advances in Neural Information Processing Systems", "page_first": 729, "page_last": 736, "abstract": null, "full_text": " An Application of Boosting to\n Graph Classification\n\n\n\n Taku Kudo, Eisaku Maeda Yuji Matsumoto\n NTT Communication Science Laboratories. Nara Institute of Science and Technology.\n 2-4 Hikaridai, Seika-cho, Soraku, Kyoto, Japan 8916-5 Takayama-cho, Ikoma, Nara, Japan\n {taku,maeda}@cslab.kecl.ntt.co.jp matsu@is.naist.jp\n\n\n\n\n Abstract\n\n This paper presents an application of Boosting for classifying labeled\n graphs, general structures for modeling a number of real-world data, such\n as chemical compounds, natural language texts, and bio sequences. The\n proposal consists of i) decision stumps that use subgraph as features,\n and ii) a Boosting algorithm in which subgraph-based decision stumps\n are used as weak learners. We also discuss the relation between our al-\n gorithm and SVMs with convolution kernels. Two experiments using\n natural language data and chemical compounds show that our method\n achieves comparable or even better performance than SVMs with convo-\n lution kernels as well as improves the testing efficiency.\n\n\n1 Introduction\n\nMost machine learning (ML) algorithms assume that given instances are represented in\nnumerical vectors. However, much real-world data is not represented as numerical vectors,\nbut as more complicated structures, such as sequences, trees, or graphs. Examples include\nbiological sequences (e.g., DNA and RNA), chemical compounds, natural language texts,\nand semi-structured data (e.g., XML and HTML documents).\n\nKernel methods, such as support vector machines (SVMs) [11], provide an elegant solution\nto handling such structured data. In this approach, instances are implicitly mapped into a\nhigh-dimensional space, where information about their similarities (inner-products) is only\nused for constructing a hyperplane for classification. Recently, a number of kernels have\nbeen proposed for such structured data, such as sequences [7], trees [2, 5], and graphs [6].\nMost are based on the idea that a feature vector is implicitly composed of the counts of\nsubstructures (e.g., subsequences, subtrees, subpaths, or subgraphs).\n\nAlthough kernel methods show remarkable performance, their implicit definitions of fea-\nture space make it difficult to know what kind of features (substructures) are relevant or\nwhich features are used in classifications. To use ML algorithms for data mining or as\nknowledge discovery tools, they must output a list of relevant features (substructures). This\ninformation may be useful not only for a detailed analysis of individual data but for the hu-\nman decision-making process.\n\nIn this paper, we present a new machine learning algorithm for classifying labeled graphs\nthat has the following characteristics: 1) It performs learning and classification using the\n\n\f\n Figure 1: Labeled connected graphs and subgraph relation\n\nstructural information of a given graph. 2) It uses a set of all subgraphs (bag-of-subgraphs)\nas a feature set without any constraints, which is essentially the same idea as a convolution\nkernel [4]. 3) Even though the size of the candidate feature set becomes quite large, it\nautomatically selects a compact and relevant feature set based on Boosting.\n\n\n2 Classifier for Graphs\n\nWe first assume that an instance is represented in a labeled graph. The focused problem\ncan be formalized as a general problem called the graph classification problem. The graph\nclassification problem is to induce a mapping f (x) : X {1}, from given training\nexamples T = { xi, yi }L\n i=1, where xi X is a labeled graph and yi {1} is a class\nlabel associated with the training data. We here focus on the problem of binary classifica-\ntion. The important characteristic is that input example xi is represented not as a numerical\nfeature vector but as a labeled graph.\n\n\n2.1 Preliminaries\n\nIn this paper we focus on undirected, labeled, and connected graphs, since we can easily\nextend our algorithm to directed or unlabeled graphs with minor modifications. Let us in-\ntroduce a labeled connected graph (or simply a labeled graph), its definitions and notations.\n\nDefinition 1 Labeled Connected Graph\nA labeled graph is represented in a 4-tuple G = (V, E, L, l), where V is a set of vertices,\nE V V is a set of edges, L is a set of labels, and l : V E L is a mapping that\nassigns labels to the vertices and the edges. A labeled connected graph is a labeled graph\nsuch that there is a path between any pair of verticies.\n\nDefinition 2 Subgraph\nLet G = (V , E , L , l ) and G = (V, E, L, l) be labeled connected graphs. G matches G,\nor G is a subgraph of G (G G) if the following conditions are satisfied: (1) V V ,\n(2) E E, (3) L L, and (4) l = l. If G is a subgraph of G, then G is a supergraph\nof G .\n\nFigure 1 shows an example of a labeled graph and its subgraph and non-subgraph.\n\n\n2.2 Decision Stumps\n\nDecision stumps are simple classifiers in which the final decision is made by a single hy-\npothesis or feature. Boostexter [10] uses word-based decision stumps for text classification.\nTo classify graphs, we define the subgraph-based decision stumps as follows.\n\nDefinition 3 Decision Stumps for Graphs\nLet t and x be labeled graphs and y be a class label (y {1}). A decision stump\nclassifier for graphs is given by\n\n y t x\n h t,y (x) def\n = -y otherwise.\n\n\f\nThe parameter for classification is a tuple t, y , hereafter referred to as a rule of decision\nstumps. The decision stumps are trained to find a rule ^\n t, ^\n y that minimizes the error rate\nfor the given training data T = { xi, yi }L\n i=1:\n\n L L\n ^ 1 1\n t, ^\n y = argmin I(yi = h t,y (xi)) = argmin (1 - yih t,y (xi)), (1)\n tF ,y{1} L tF ,y{1} 2L\n i=1 i=1\n\nwhere F is a set of candidate graphs or a feature set (i.e., F = L {t|t x\n i=1 i}) and I ()\nis the indicator function. The gain function for a rule t, y is defined as\n\n L\n\n gain( t, y ) def\n = yih t,y (xi). (2)\n i=1\n\nUsing the gain, the search problem (1) becomes equivalent to the problem: ^\n t, ^\n y =\nargmaxtF,y{1} gain( t, y ). In this paper, we use gain instead of error rate for clarity.\n\n2.3 Applying Boosting\n\nThe decision stump classifiers are too inaccurate to be applied to real applications, since the\nfinal decision relies on the existence of a single graph. However, accuracies can be boosted\nby the Boosting algorithm [3, 10]. Boosting repeatedly calls a given weak learner and\nfinally produces a hypothesis f , which is a linear combination of K hypotheses produced\nby the weak learners, i,e.: f (x) = sgn( K (x)). A weak learner is built\n k=1 k h tk,yk\n\nat each iteration k with different distributions or weights d(k) = (d(k), . . . , d(k)) on the\n i L\ntraining data, where L d(k) = 1, d(k) 0. The weights are calculated to concentrate\n i=1 i i\nmore on hard examples than easy examples. To use decision stumps as the weak learner of\nBoosting, we redefine the gain function (2) as:\n\n L\n\n gain( t, y ) def\n = yidih t,y (xi). (3)\n i=1\n\nIn this paper, we use the AdaBoost algorithm, the original and the best known algorithm\namong many variants of Boosting. However, it is trivial to fit our decision stumps to other\nboosting algorithms, such as Arc-GV [1] and Boosting with soft margins [8].\n\n\n3 Efficient Computation\n\nIn this section, we introduce an efficient and practical algorithm to find the optimal rule\n ^\n t, ^\n y from given training data. This problem is formally defined as follows.\n\nProblem 1 Find Optimal Rule\nLet T = { x1, y1, d1 , . . . , xL, yL, dL } be training data where xi is a labeled graph,\nyi {1} is a class label associated with xi and di ( L d\n i=1 i = 1, di 0) is a normal-\nized weight assigned to xi. Given T , find the optimal rule ^\n t, ^\n y that maximizes the gain,\ni.e., ^\n t, ^\n y = argmax {t|t x\n tF ,y{1} diyih t,y , where F = L\n i=1 i}.\n\n\nThe most naive and exhaustive method in which we first enumerate all subgraphs F and\nthen calculate the gains for all subgraphs is usually impractical, since the number of sub-\ngraphs is exponential to its size. We thus adopt an alternative strategy to avoid such ex-\nhaustive enumerations. The method to find the optimal rule is modeled as a variant of\nbranch-and-bound algorithm and will be summarized as the following strategies: 1) Define\n\n\f\n Figure 2: Example of DFS Code Tree for a graph\n\na canonical search space in which a whole set of subgraphs can be enumerated. 2) Find\nthe optimal rule by traversing this search space. 3) Prune the search space by proposing a\ncriteria for the upper bound of the gain. We will describe these steps more precisely in the\nnext subsections.\n\n\n3.1 Efficient Enumeration of Graphs\n\nYan et al. proposed an efficient depth-first search algorithm to enumerate all subgraphs\nfrom a given graph [12]. The key idea of their algorithm is a DFS (depth first search)\ncode, a lexicographic order to the sequence of edges. The search tree given by the DFS\ncode is called a DFS Code Tree. Leaving the details to [12], the order of the DFS code\nis defined by the lexicographic order of labels as well as the topology of graphs. Figure 2\nillustrates an example of a DFS Code Tree. Each node in this tree is represented in a 5-tuple\n[i, j, vi, eij, vj], where eij, vi and vj are the labels of i-j edge, i-th vertex, and j-th vertex\nrespectively. By performing a pre-order search of the DFS Code Tree, we can obtain all the\nsubgraphs of a graph in order of their DFS code. However, one cannot avoid isomorphic\nenumerations even giving pre-order traverse, since one graph can have several DFS codes\nin a DFS Code Tree. So, canonical DFS code (minimum DFS code) is defined as its first\ncode in the pre-order search of the DFS Code Tree. Yan et al. show that two graphs G\nand G are isomorphic if and only if minimum DFS codes for the two graphs min(G)\nand min(G ) are the same. We can thus ignore non-minimum DFS codes in subgraph\nenumerations. In other words, in depth-first traverse, we can prune a node with DFS code\nc, if c is not minimum. The isomorphic graph represented in minimum code has already\nbeen enumerated in the depth-first traverse. For example, in Figure 2, if G1 is identical to\nG0, G0 has been discovered before the node for G1 is reached. This property allows us to\navoid an explicit isomorphic test of the two graphs.\n\n\n3.2 Upper bound of gain\n\nDFS Code Tree defines a canonical search space in which one can enumerate all subgraphs\nfrom a given set of graphs. We consider an upper bound of the gain that allows pruning of\nsubspace in this canonical search space. The following lemma gives a convenient method\nof computing a tight upper bound on gain( t , y ) for any supergraph t of t.\n\nLemma 1 Upper bound of the gain: (t)\nFor any t t and y {1}, the gain of t , y is bounded by (t) (i.e., gain( t y ) (t)),\nwhere (t) is given by\n\n L L\n def\n (t) = max 2 di - yi di, 2 di + yi di .\n\n {i|yi=+1,txi} i=1 {i|yi=-1,txi} i=1\nProof 1\n\n L L\n\n gain( t , y ) = diyih t ,y (xi) = diyi y (2I(t xi) - 1),\n i=1 i=1\n\n\f\nwhere I() is the indicator function. If we focus on the case y = +1, then\n\n L L\n\n gain( t , +1 ) = 2 yidi - yi di 2 di - yi di\n\n {i|t xi} i=1 {i|yi=+1,t xi} i=1\n\n L\n\n 2 di - yi di,\n\n {i|yi=+1,txi} i=1\n\nsince |{i|yi = +1, t xi}| |{i|yi = +1, t xi}| for any t t. Similarly,\n\n L\n\n gain( t , -1 ) 2 di + yi di.\n\n {i|yi=-1,txi} i=1\n\nThus, for any t t and y {1}, gain( t , y ) (t). 2\n\nWe can efficiently prune the DFS Code Tree using the upper bound of gain u(t). During\npre-order traverse in a DFS Code Tree, we always maintain the temporally suboptimal gain\n among all the gains calculated previously. If (t) < , the gain of any supergraph t t\nis no greater than , and therefore we can safely prune the search space spanned from the\nsubgraph t. If (t) , then we cannot prune this space since a supergraph t t might\nexist such that gain(t ) .\n\n\n3.3 Efficient Computation in Boosting\n\nAt each Boosting iteration, the suboptimal value is reset to 0. However, if we can calcu-\nlate a tighter upper bound in advance, the search space can be pruned more effectively. For\nthis purpose, a cache is used to maintain all rules found in the previous iterations. Subop-\ntimal value is calculated by selecting one rule from the cache that maximizes the gain of\nthe current distribution. This idea is based on our observation that a rule in the cache tends\nto be reused as the number of Boosting iterations increases. Furthermore, we also maintain\nthe search space built by a DFS Code Tree as long as memory allows. This cache reduces\nduplicated constructions of a DFS Code Tree at each Boosting iteration.\n\n\n4 Connection to Convolution Kernel\n\nRecent studies [1, 9, 8] have shown that both Boosting and SVMs [11] work according to\nsimilar strategies: constructing an optimal hypothesis that maximizes the smallest margin\nbetween positive and negative examples. The difference between the two algorithms is the\nmetric of margin; the margin of Boosting is measured in l1-norm, while that of SVMs is\nmeasured in l2-norm. We describe how maximum margin properties are translated in the\ntwo algorithms.\n\nAdaBoost and Arc-GV asymptotically solve the following linear program, [1, 9, 8],\n\n J\n\n max ; s.t. yi wjhj(xi) , ||w||1 = 1 (4)\n wIRJ ,IR+ j=1\n\nwhere J is the number of hypotheses. Note that in the case of decision stumps for graphs,\nJ = |{1} F | = 2|F |.\n\nSVMs, on the other hand, solve the following quadratic optimization problem [11]: 1\n\n max ; s.t. yi (w (xi)) , ||w||2 = 1. (5)\n wIRJ ,IR+\n\n 1For simplicity, we omit the bias term (b) and the extension of Soft Margin.\n\n\f\nThe function (x) maps the original input example x into a J -dimensional feature vector\n(i.e., (x) IRJ ). The l2-norm margin gives the separating hyperplane expressed by dot-\nproducts in feature space. The feature space in SVMs is thus expressed implicitly by using\na Marcer kernel function, which is a generalized dot-product between two objects, (i.e.,\nK(x1, x2) = (x1) (x2)).\n\nThe best known kernel for modeling structured data is a convolution kernel [4] (e.g., string\nkernel [7] and tree kernel [2, 5]), which argues that a feature vector is implicitly composed\nof the counts of substructures. 2 The implicit mapping defined by the convolution kernel\nis given as: (x) = (#(t1 x), . . . , #(t|F| x)), where tj F and #(u) is the\ncardinality of u. Noticing that a decision stump can be expressed as h t,y (x) = y (2I(t \nx) - 1), we see that the constraints or feature space of Boosting with substructure-based\ndecision stumps are essentially the same as those of SVMs with the convolution kernel 3.\nThe critical difference is the definition of margin: Boosting uses l1-norm, and SVMs use\nl2-norm. The difference between them can be explained by sparseness.\n\nIt is well known that the solution or separating hyperplane of SVMs is expressed in a linear\ncombination of training examples using coefficients , (i.e., w = L \n i=1 i(xi)) [11].\nMaximizing l2-norm margin gives a sparse solution in the example space, (i.e., most of i\nbecomes 0). Examples having non-zero coefficients are called support vectors that form the\nfinal solution. Boosting, in contrast, performs the computation explicitly in feature space.\nThe concept behind Boosting is that only a few hypotheses are needed to express the final\nsolution. l1-norm margin realizes such a property [8]. Boosting thus finds a sparse solution\nin the feature space. The accuracies of these two methods depend on the given training\ndata. However, we argue that Boosting has the following practical advantages. First, sparse\nhypotheses allow the construction of an efficient classification algorithm. The complexity\nof SVMs with tree kernel is O(l|n1||n2|), where n1 and n2 are trees, and l is the number of\nsupport vectors, which is too heavy to be applied to real applications. Boosting, in contrast,\nperforms faster since the complexity depends only on a small number of decision stumps.\nSecond, sparse hypotheses are useful in practice as they provide \"transparent\" models with\nwhich we can analyze how the model performs or what kind of features are useful. It is\ndifficult to give such analysis with kernel methods since they define feature space implicitly.\n\n\n5 Experiments and Discussion\n\n\nTo evaluate our algorithm, we employed two experiments using two real-world data.\n\n(1) Cellphone review classification (REV)\nThe goal of this task is to classify reviews for cellphones as positive or negative. 5,741 sen-\ntences were collected from an Web-BBS discussion about cellphones in which users were\ndirected to submit positive reviews separately from negative reviews. Each sentence is rep-\nresented in a word-based dependency tree using a Japanese dependency parser CaboCha4.\n\n(2) Toxicology prediction of chemical compounds (PTC)\nThe task is to classify chemical compounds by carcinogenicity. We used the PTC data\nset5 consisting of 417 compounds with 4 types of test animals: male mouse (MM), female\n\n 2Strictly speaking, graph kernel [6] is not a convolution kernel because it is not based on the count\nof subgraphs, but on random walks in a graph.\n 3The difference between decision stumps and the convolution kernels is that the former uses a\nbinary feature denoting the existence (or absence) of each substructure, whereas the latter uses the\ncardinality of each substructure. However, it makes little difference since a given graph is often sparse\nand the cardinality of substructures will be approximated by their existence.\n 4http://chasen.naist.jp/~ taku/software/cabocha/\n 5http://www.predictive-toxicology.org/ptc/\n\n\f\n Table 1: Classification F-scores of the REV and PTC tasks\n\n REV PTC\n MM FM MR FR\n Boosting BOL-based Decision Stumps 76.6 47.0 52.9 42.7 26.9\n Subgraph-based Decision Stumps 79.0 48.9 52.5 55.1 48.5\n SVMs BOL Kernel 77.2 40.9 39.9 43.9 21.8\n Tree/Graph Kernel 79.4 42.3 34.1 53.2 25.9\n\nmouse (FM), male rat (MR) and female rat (FR). Each compound is assigned one of the\nfollowing labels: {EE,IS,E,CE,SE,P,NE,N}. We here assume that CE,SE, and P are \"posi-\ntive\" and that NE and NN are \"negative\", which is exactly the same setting as [6]. We thus\nhave four binary classifiers (MM/FM/MR/FR) in this data set.\n\nWe compared the performance of our Boosting algorithm and support vector machines with\ntree kernel [2, 5] (for REV) and graph kernel [6] (for PTC) according to their F-score in\n5-fold cross validation.\n\nTable 1 summarizes the best results of REV and PCT task, varying the hyperparameters\nof Boosting and SVMs (e.g., maximum iteration of Boosting, soft margin parameter of\nSVMs, and termination probability of random walks in graph kernel [6]). We also show\nthe results with bag-of-label (BOL) features as a baseline. In most tasks and categories,\nML algorithms with structural features outperform the baseline systems (BOL). These re-\nsults support our first intuition that structural features are important for the classification of\nstructured data, such as natural language texts and chemical compounds.\n\nComparing our Boosting algorithm with SVMs using tree kernel, no significant difference\ncan be found the REV data set. However, in the PTC task, our method outperforms SVMs\nusing graph kernel on the categories MM, FM, and FR at a statistically significant level.\nFurthermore, the number of active features (subgraphs) used in Boosting is much smaller\nthan those of SVMs. With our methods, about 1800 and 50 features (subgraphs) are used\nin the REV and PTC tasks respectively, while the potential number of features is quite\nlarge. Even giving all subgraphs as feature candidates, Boosting selects a small and highly\nrelevant subset of features.\n\nFigure 3 show an example of extracted support features (subgraphs) in the REV and PTC\ntask respectively. In the REV task, features reflecting the domain knowledge (cellphone\nreviews) are extracted: 1) \"want to use \" positive, 2) \"hard to use\" negative, 3)\n\"recharging time is short\" positive, 4) \"recharging time is long\" negative. These\nfeatures are interesting because we cannot determine the correct label (positive/negative)\nonly using such bag-of-label features as \"charging,\" \"short,\" or \"long.\" In the PTC task,\nsimilar structures show different behavior. For instance, Trihalomethanes (TTHMs), well-\nknown carcinogenic substances (e.g., chloroform, bromodichloromethane, and chlorodi-\nbromomethane), contain the common substructure H-C-Cl (Fig. 3(a)). However, TTHMs\ndo not contain the similar but different structure H-C(C)-Cl (Fig. 3(b)). Such structural\ninformation is useful for analyzing how the system classifies the input data in a category\nand what kind of features are used in the classification. We cannot examine such analysis\nin kernel methods, since they define their feature space implicitly.\n\nThe reason why graph kernel shows poor performance on the PTC data set is that it cannot\nidentify subtle difference between two graphs because it is based on a random walks in a\ngraph. For example, kernel dot-product between the similar but different structures 3(c)\nand 3(d) becomes quite large, although they show different behavior. To classify chemical\ncompounds by their functions, the system must be capable of capturing subtle differences\namong given graphs.\n\nThe testing speed of our Boosting algorithm is also much faster than SVMs with tree/graph\n\n\f\n Figure 3: Support features and their weights\n\n\nkernels. In the REV task, the speed of Boosting and SVMs are 0.135 sec./1,149 instances\nand 57.91 sec./1,149 instances respectively6. Our method is significantly faster than SVMs\nwith tree/graph kernels without a discernible loss of accuracy.\n\n\n6 Conclusions\n\nIn this paper, we focused on an algorithm for the classification of labeled graphs. The\nproposal consists of i) decision stumps that use subtrees as features, and ii) a Boosting\nalgorithm in which subgraph-based decision stumps are applied as the weak learners. Two\nexperiments are employed to confirm the importance of subgraph features. In addition, we\nexperimentally show that our Boosting algorithm is accurate and efficient for classification\ntasks involving discrete structural features.\n\n\nReferences\n\n [1] Leo Breiman. Prediction games and arching algoritms. Neural Computation, 11(7):1493 \n 1518, 1999.\n\n [2] Michael Collins and Nigel Duffy. Convolution kernels for natural language. In NIPS 14, Vol.1,\n pages 625632, 2001.\n\n [3] Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning\n and an application to boosting. Journal of Computer and System Sicences, 55(1):119139,\n 1996.\n\n [4] David Haussler. Convolution kernels on discrete structures. Technical report, UC Santa Cruz\n (UCS-CRL-99-10), 1999.\n\n [5] Hisashi Kashima and Teruo Koyanagi. Svm kernels for semi-structured data. In Proc. of ICML,\n pages 291298, 2002.\n\n [6] Hisashi Kashima, Koji Tsuda, and Akihiro Inokuchi. Marginalized kernels between labeled\n graphs. In Proc. of ICML, pages 321328, 2003.\n\n [7] Huma Lodhi, Craig Saunders, John Shawe-Taylor, Nello Cristianini, and Chris Watkins. Text\n classification using string kernels. Journal of Machine Learning Research, 2, 2002.\n\n [8] Gunnar. Ratsch, Takashi. Onoda, and Klaus-Robert M uller. Soft margins for AdaBoost. Ma-\n chine Learning, 42(3):287320, 2001.\n\n [9] Robert E. Schapire, Yoav Freund, Peter Bartlett, and Wee Sun Lee. Boosting the margin: a new\n explanation for the effectiveness of voting methods. In Proc. of ICML, pages 322330, 1997.\n\n[10] Robert E. Schapire and Yoram Singer. BoosTexter: A boosting-based system for text catego-\n rization. Machine Learning, 39(2/3):135168, 2000.\n\n[11] Vladimir N. Vapnik. Statistical Learning Theory. Wiley-Interscience, 1998.\n\n[12] Xifeng Yan and Jiawei Han. gspan: Graph-based substructure pattern mining. In Proc. of\n ICDM, pages 721724, 2002.\n\n 6We tested the performances on Linux with XEON 2.4Ghz dual processors.\n\n\f\n", "award": [], "sourceid": 2739, "authors": [{"given_name": "Taku", "family_name": "Kudo", "institution": null}, {"given_name": "Eisaku", "family_name": "Maeda", "institution": null}, {"given_name": "Yuji", "family_name": "Matsumoto", "institution": null}]}