{"title": "Anytime Induction of Cost-sensitive Trees", "book": "Advances in Neural Information Processing Systems", "page_first": 425, "page_last": 432, "abstract": null, "full_text": "Anytime Induction of Cost-sensitive Trees\n\nTechnion\u2014Israel Institute of Technology\n\nTechnion\u2014Israel Institute of Technology\n\nShaul Markovitch\n\nComputer Science Department\n\nHaifa 32000, Israel\n\nshaulm@cs.technion.ac.il\n\nSaher Esmeir\n\nComputer Science Department\n\nHaifa 32000, Israel\n\nesaher@cs.technion.ac.il\n\nAbstract\n\nMachine learning techniques are increasingly being used to produce a wide-range\nof classi\ufb01ers for complex real-world applications that involve nonuniform testing\ncosts and misclassi\ufb01cation costs. As the complexity of these applications grows,\nthe management of resources during the learning and classi\ufb01cation processes be-\ncomes a challenging task. In this work we introduce ACT (Anytime Cost-sensitive\nTrees), a novel framework for operating in such environments. ACT is an anytime\nalgorithm that allows trading computation time for lower classi\ufb01cation costs. It\nbuilds a tree top-down and exploits additional time resources to obtain better esti-\nmations for the utility of the different candidate splits. Using sampling techniques\nACT approximates for each candidate split the cost of the subtree under it and fa-\nvors the one with a minimal cost. Due to its stochastic nature ACT is expected to\nbe able to escape local minima, into which greedy methods may be trapped. Ex-\nperiments with a variety of datasets were conducted to compare the performance\nof ACT to that of the state of the art cost-sensitive tree learners. The results show\nthat for most domains ACT produces trees of signi\ufb01cantly lower costs. ACT is\nalso shown to exhibit good anytime behavior with diminishing returns.\n\n1 Introduction\n\nSuppose that a medical center has decided to use machine learning techniques to induce a diagnostic\ntool from records of previous patients. The center aims to obtain a comprehensible model, with low\nexpected test costs (the costs of testing attribute values) and high expected accuracy. Moreover, in\nmany cases there are costs associated with the predictive errors. In such a scenario, the task of the\ninducer is to produce a model with low expected test costs and low expected misclassi\ufb01cation costs.\n\nA good candidate for achieving the goals of comprehensibility and reduced costs is a decision\ntree model. Decision trees are easily interpretable because they mimic the way doctors think\n[13][chap. 9].\nIn the context of cost-sensitive classi\ufb01cation, decision trees are the natural form\nof representation: they ask only for the values of the features along a single path from the root to\na leaf. Indeed, cost-sensitive trees have been the subject of many research efforts. Several works\nproposed learners that consider different misclassi\ufb01cation costs [7, 18, 6, 9, 10, 14, 1]. These meth-\nods, however, do not consider test costs. Other authors designed tree learners that take into account\ntest costs, such as IDX [16], CSID3 [22], and EG2 [17]. These methods, however, do not consider\nmisclassi\ufb01cation costs. The medical center scenario exempli\ufb01es the need for considering both types\nof cost together: doctors do not perform a test before considering both its cost and its importance to\nthe diagnosis.\n\nMinimal Cost trees, a method that attempts to minimize both types of costs simultaneously has been\nproposed in [21]. A tree is built top-down. The immediate reduction in total cost each split results\nin is estimated, and a split with the maximal reduction is selected. Although ef\ufb01cient, the Minimal\nCost approach can be trapped into a local minimum and produce trees that are not globally optimal.\n\n1\n\n\fa9\n\ncost(a1-10) = $$\n\na1\n\na10\n\na10\n\n0\n\n1\n\n1\n\n0\n\ncost(a1-8) = $$\ncost(a9,10) = $$$$$$\n\na7\n\na6\n\na9\n\na9\n\na4\n\na4\n\n0\n\n1\n\n1\n\n0\n\n0\n\n1\n\n1\n\n0\n\nFigure 1: A dif\ufb01culty for greedy learners (left). Importance of context-based evaluation (right).\n\nFor example, consider a problem with 10 attributes a1\u221210, of which only a9 and a10 are relevant.\nThe cost of a9 and a10, however, is signi\ufb01cantly higher than the others but lower than the cost\nof misclassi\ufb01cation. This may hide their usefulness, and mislead the learner to \ufb01t a large expensive\ntree. The problem is intensi\ufb01ed if a9 and a10 were interdependent with a low immediate information\ngain (e.g., a9 \u2295 a10), as illustrated in Figure 1 (left). In such a case, even if the costs were uniform,\nlocal measures would fail in recognizing the relevance of a9 and a10 and other attributes might be\npreferred. The Minimal Cost method is appealing when resources are very limited. However, it\nrequires a \ufb01xed runtime and cannot exploit additional resources. In many real-life applications, we\nare willing to wait longer if a better tree can be induced. For example, due to the importance of the\nmodel, the medical center is ready to allocate 1 week to learn it. Algorithms that can exploit more\ntime to produce solutions of better quality are called anytime algorithms [5].\n\nOne way to exploit additional time when searching for a tree of lower costs is to widen the search\nspace. In [2] the cost-sensitive learning problem is formulated as a Markov Decision Process (MDP)\nand a systematic search is used to solve the MDP. Although the algorithm searches for an optimal\nstrategy, the time and memory limits prevent it from always \ufb01nding optimal solutions.\n\nThe ICET algorithm [24] was a pioneer in searching non-greedily for a tree that minimizes both\ncosts together. ICET uses genetic search to produce a new set of costs that re\ufb02ects both the original\ncosts and the contribution each attribute can make to reduce misclassi\ufb01cation costs. Then it builds\na tree using the greedy EG2 algorithm but with the evolved costs instead of the original ones. ICET\nwas shown to produce trees of lower total cost. It can use additional time resources to produce more\ngenerations and hence to widen its search in the space of costs. Nevertheless, it is limited in the\nway it can exploit extra time. Firstly, it builds the \ufb01nal tree using EG2. EG2 prefers attributes with\nhigh information gain (and low test cost). Therefore, when the concept to learn hides interdepen-\ndency between attributes, the greedy measure may underestimate the usefulness of highly relevant\nattributes, resulting in more expensive trees. Secondly, even if ICET may overcome the above prob-\nlem by reweighting the attributes, it searches the space of parameters globally, regardless of the\ncontext. This imposes a problem if an attribute is important in one subtree but useless in another. To\nillustrate the above consider the concept in Figure 1 (right). There are 10 attributes of similar costs.\nDepending on the value of a1, the target concept is a7 \u2295 a9 or a4 \u2295 a6. Due to interdependencies,\nall attributes will have a low gain. Because ICET assigns costs globally, they will have similar costs\nas well. Therefore, ICET will not be able to recognize which attribute is relevant in what context.\n\nRecently, we have introduced LSID3, a cost-insensitive algorithm, which can induce more accurate\ntrees when given more time [11]. The algorithm uses stochastic sampling techniques to evaluate\ncandidate splits. It is not designed, however, to minimize test and misclassi\ufb01cation costs. In this\nwork we build on LSID3 and propose ACT, an Anytime Cost-sensitive Tree learner that can exploit\nadditional time to produce trees of lower costs. Applying the sampling mechanism to the cost-\nsensitive setup, however, is not trivial and imposes several challenges which we address in Section\n2. Extensive set of experiments that compares ACT to EG2 and to ICET is reported in Section 3. The\nresults show that ACT is signi\ufb01cantly better for the majority of problems. In addition ACT is shown\nto exhibit good anytime behavior with diminishing returns. The major contributions of this paper\nare: (1) a non-greedy algorithm for learning trees of lower costs that allows handling complex cost\nstructures, (2) an anytime framework that allows learning time to be traded for reduced classi\ufb01cation\ncosts, and (3) a parameterized method for automatic assigning of costs for existing datasets.\n\nNote that costs may also be involved during example acquisition [12, 15]. In this work, however,\nwe assume that the full training examples are in hand. Moreover, we assume that during the test\nphase, all tests in the relevant path will be taken. Several test strategies that determine which values\nto query for and at what order have been recently studied [21]. These strategies are orthogonal to\nour work because they assume a given tree.\n\n2\n\n\f2 The ACT Algorithm\n\nOf\ufb02ine concept learning consists of two stages: learning from labelled examples; and using the\ninduced model to classify unlabelled instances. These two stages involve different types of cost\n[23]. Our primary goal in this work is to trade the learning time for reduced test and misclassi\ufb01cation\ncosts. To make the problem well de\ufb01ned, we need to specify how to: (1) represent misclassi\ufb01cation\ncosts, (2) calculate test costs, and (3) combine both types of cost.\n\nTo answer these questions, we adopt the model described by Turney [24]. In a problem with |C|\ndifferent classes, a classi\ufb01cation cost matrix M is a |C| \u00d7 |C| matrix whose Mi,j entry de\ufb01nes the\npenalty of assigning the class ci to an instance that actually belongs to the class cj. To calculate\nthe test costs of a particular case, we sum the cost of the tests along the path from the root to the\nappropriate leaf. For tests that appear several times we charge only for the \ufb01rst occurrence. The\nmodel handles two special test types, namely grouped and delayed. Grouped tests share a common\ncost that is charged only once per group. Each test also has an extra cost charged when the test is\nactually made. For example, consider a tree path with tests like cholesterol level and glucose level.\nFor both values to be measured, a blood test is needed. Clearly, once blood samples are taken to\nmeasure the cholesterol level, the cost for measuring the glucose level is lower. Delayed tests are\ntests whose outcome cannot be obtained immediately, e.g., lab test results. Such tests force us to\nwait until the outcome is available. Alternatively, we can take into account all possible outcomes\nand follow several paths in the tree simultaneously (and pay for their costs). Once the result of the\ndelayed test is available, the prediction is in hand. Note that we might be charged for tests that we\nwould not perform if the outcome of the delayed tests were available. In this work we do not handle\ndelayed costs but we do explain how to adapt our framework to scenarios that involve them.\n\nHaving measured the test costs and misclassi\ufb01cation costs, an important question is how to combine\nthem. Following [24] we assume that both types of cost are given in the same scale. Alternatively,\nQin et. al. [19] presented a method to handle the two kinds of cost scales by setting a maximal\nbudget for one kind and minimizing the other.\n\nACT, our proposed anytime framework for induction of cost-sensitive trees, builds on the recently\nintroduced LSID3 algorithm [11]. LSID3 adopts the general top-down induction of decision trees\nscheme (TDIDT): it starts from the entire set of training examples, partitions it into subsets by testing\nthe value of an attribute, and then recursively builds subtrees. Unlike greedy inducers, LSID3 invests\nmore time resources for making better split decisions. For every candidate split, LSID3 attempts to\nestimate the size of the resulting subtree were the split to take place and following Occam\u2019s razor\n[4] it favors the one with the smallest expected size. The estimation is based on a biased sample\nof the space of trees rooted at the evaluated attribute. The sample is obtained using a stochastic\nversion of ID3, called SID3 [11]. In SID3, rather than choosing an attribute that maximizes the\ninformation gain \u2206I (as in ID3), the splitting attribute is chosen semi-randomly. The likelihood that\nan attribute will be chosen is proportional to its information gain. LSID3 is a contract algorithm\nparameterized by r, the sample size. When r is larger, the resulting estimations are expected to be\nmore accurate, therefore improving the \ufb01nal tree. Let m = |E| be the number of examples and\nn = |A| be the number of attributes. The runtime complexity of LSID3 is O(rmn3) [11]. LSID3\nwas shown to exhibit a good anytime behavior with diminishing returns. When applied to hard\nconcepts, it produced signi\ufb01cantly better trees than ID3 and C4.5. ACT takes the same sampling\napproach as in LSID3. However, three major components of LSID3 need to be replaced for the\ncost-sensitive setup: (1) sampling the space of trees, (2) evaluating a tree, and (3) pruning.\nObtaining the Sample. LISD3 uses SID3 to bias the samples towards small trees. In ACT, however,\nwe would like to bias our sample towards low cost trees. For this purpose, we designed a stochastic\nversion of the EG2 algorithm, that attempts to build low cost trees greedily. In EG2, a tree is built\ntop-down, and the attribute that maximizes ICF (Information Cost Function) is chosen for splitting\na node, where, ICF (a) = (cid:0)2\u2206I(a) \u2212 1(cid:1) / ((cost (a) + 1)w).\nIn Stochastic EG2 (SEG2), we choose splitting attributes semi-randomly, proportionally to their ICF.\nDue to the stochastic nature of SEG2 we expect to be able to escape local minima for at least some\nof the trees in the sample. To obtain a sample of size r, ACT uses EG2 once and SEG2 r \u2212 1 times.\nUnlike ICET, we give EG2 and SEG2 a direct access to context-based costs, i.e., if an attribute has\nalready been tested its cost would be zero and if another attribute that belongs to the same group\nhas been tested, a group discount is applied. The parameter w controls the bias towards lower cost\n\n3\n\n\fattributes. While ICET tunes this parameter using genetic search, we set w inverse proportionally to\nthe misclassi\ufb01cation cost: a high misclassi\ufb01cation cost results in a smaller w, reducing the effect of\nattribute costs. One direction for future work would be to tune w a priori.\nEvaluating a Subtree. As a cost insensitive learner, the main goal of LSID3 is to maximize the\nexpected accuracy of the learned tree. Following Occam\u2019s razor, it uses the tree size as a preference\nbias and favors splits that are expected to reduce the \ufb01nal tree size. In a cost-sensitive setup, our goal\nis to minimize the expected cost of classi\ufb01cation. Following the same lookahead strategy as LSID3,\nwe sample the space of trees under each candidate split. However, instead of choosing an attribute\nthat minimizes the size, we would like to choose one that minimizes costs. Therefore, given a tree,\nwe need to come up with a procedure that estimates the expected costs when classifying a future\ncase. This cost consists of two components: the test cost and misclassi\ufb01cation cost.\n\nAssuming that the distribution of future cases would be similar to that of the learning examples, we\ncan estimate the test costs using the training data. Given a tree, we calculate the average test cost\nof the training examples and use it to approximate the test cost of new cases. For a tree T and a set\nof training examples E, we denote the average cost of traversing T for an example from E (average\ntesting cost) by tst-cost(T, E). Note that group discounts and delayed cost penalties do not need a\nspecial care because they will be incorporated when calculating the average test costs.\n\nEstimating the cost of errors is not obvious. One can no longer use the tree size as a heuristic for pre-\ndictive errors. Occam\u2019s razor allows to compare two consistent trees but does not provide a mean to\nestimate accuracy. Moreover, tree size is measured in a different currency than accuracy and hence\ncannot be easily incorporated in the cost function. Instead, we propose using a different estimator:\nthe expected error [20]. For a leaf with m training examples, of which e are misclassi\ufb01ed the ex-\npected error is de\ufb01ned as the upper limit on the probability for error, i.e., EE(m, e, cf ) = Ucf (e, m)\nwhere cf is the con\ufb01dence level and U is the con\ufb01dence interval for binomial distribution. The ex-\npected error of a tree is the sum of the expected errors in its leafs. Originally, the expected error was\nused by C4.5 to predict whether a subtree performs better than a leaf. Although it lacks theoretical\nbasis, it was shown experimentally to be a good heuristic. In ACT we use the expected error to\napproximate the misclassi\ufb01cation cost. Assume a problem with |C| classes and a misclassi\ufb01cation\ncost matrix M . Let c be the class label in a leaf l. Let m be the total number of examples in l and\nmi be the number of examples in l that belong to class i. The expected misclassi\ufb01cation cost in l is\n(the right most expression assumes uniform misclassi\ufb01cation cost Mi,j = mc)\n\nmc-cost(l) = EE(m, m \u2212 mc, cf ) \u00b7\n\n1\n\n|C| \u2212 1 X\ni6=c\n\nMc,i = EE(m, m \u2212 mc, cf ) \u00b7 mc\n\nThe expected error of a tree is the sum of the expected errors in its leafs. In our experiments we use\ncf = 0.25, as in C4.5. In the future, we intend to tune cf if the allocated time allows. Alternatively,\nwe also plan to estimate the error using a set-aside validation set, when the training set size allows.\nTo conclude, let E be the set of examples used to learn a tree T , and let m be the size of E. Let L\nbe the set of leafs in T . The expected total cost of T when classifying an instance is:\n\ntst-cost(T, E) +\n\n1\nm\n\n\u00b7 X\nl\u2208L\n\nmc-cost (l).\n\nHaving decided about the sampler and the tree utility function we are ready to formalize the tree\ngrowing phase in ACT. A tree is built top-down. The procedure for selecting splitting test at each\nnode is listed in Figure 2 (left), and exempli\ufb01ed in Figure 2 (right). The selection procedure, as\nformalized is Figure 2 (left) needs to be slightly modi\ufb01ed when an attribute is numeric: instead\nof iterating over the values the attribute can take, we examine r cutting points, each is evaluated\nwith a single invocation of EG2. This guarantees that numeric and nominal attributes get the same\nresources. The r points are chosen dynamically, according to their information gain.\nCosts-sensitive Pruning. Pruning plays an important role in decision tree induction.\nIn cost-\ninsensitive environments, the main goal of pruning is to simplify the tree in order to avoid over\ufb01tting.\nA subtree is pruned if the resulting tree is expected to yield a lower error. When test costs are taken\ninto account, pruning has another important role: reducing costs. It is worthwhile to keep a subtree\nonly if its expected reduction to the misclassi\ufb01cation cost is larger that the cost of its tests. If the\nmisclassi\ufb01cation cost was zero, it makes no sense to keep any split in the tree. If, on the other hand,\n\n4\n\n\fProcedure ACT-CHOOSE-ATTRIBUTE(E, A, r)\n\nIf r = 0 Return EG2-CHOOSE-ATTRIBUTE(E, A)\nForeach a \u2208 A\n\nForeach vi \u2208 domain(a)\n\nEi \u2190 {e \u2208 E | a(e) = vi}\nT \u2190 EG2(a, Ei, A \u2212 {a})\nmini \u2190 COST(T, Ei)\nRepeat r \u2212 1 times\n\nT \u2190 SEG2(a, Ei, A \u2212 {a})\nmini \u2190 min (mini, COST(T, Ei))\nmini\n\ntotala \u2190 COST(a) + P|domain(a)|\nReturn a for which totala is minimal\n\ni=1\n\na\n\ncost(EG2)\n\n=4.1\n\nc\n\no\n\ns\nt\n(\n\nS\n\n=\n\nE\n\n4\n.\n9\n\nG\n\n2\n\n)\n\ncost(SEG2)\n\n=5.1\n\ncost(EG2)\n\n=8.9\n\nFigure 2: Attribute selection (left) and evaluation (right) in ACT (left). Assume that the cost of a in the current\ncontext is 1. The estimated cost of a subtree rooted at a is therefore 1 + min(4.1, 5.1) + min(8.9, 4.9) = 9.\n\nthe misclassi\ufb01cation cost was very large, we would expect similar behavior to the cost-insensitive\nsetup. To handle this challenge, we propose a novel approach for cost-sensitive pruning. Similarly\nto error-based pruning [20], we scan the tree bottom-up. For each subtree, we compare its expected\ntotal cost to that of a leaf. Formally, assume that e examples in E do not belong to the default class.1\nWe prune a subtree T into a leaf if:\n\n1\nm\n\n\u00b7 mc-cost(l) \u2264 tst-cost(T, E) +\n\n1\nm\n\n\u00b7 X\nl\u2208L\n\nmc-cost(l).\n\n3 Empirical Evaluation\n\nA variety of experiments were conducted to test the performance and behavior of ACT. First we\ndescribe and motivate our experimental methodology. We then present and discuss our results.\n\n3.1 Methodology\n\nWe start our experimental evaluation by comparing ACT, given a \ufb01xed resource allocation, with\nEG2 and ICET. EG2 was selected as a representative for greedy learners. We also tested the per-\nformance of CSID3 and IDX but found the results very similar to EG2, con\ufb01rming the report in\n[24]. Our second set of experiments compares the anytime behavior of ACT to that of ICET. Be-\ncause the code of EG2 and ICET is not publicly available we have reimplemented them. To verify\nthe reimplementation results, we compared them with those reported in literature. We followed the\nsame experimental setup and used the same 5 datasets. The results are indeed similar with the basic\nversion of ICET achieving an average cost of 49.9 in our reimplementation vs. 49 in Turney\u2019s paper\n[24]. One possible reason for the slight difference may be the randomization involved in the genetic\nsearch as well as in data partitioning into training, validating, and testing sets.\nDatasets. Typically, machine learning researchers use datasets from the UCI repository [3]. Only\n\ufb01ve UCI datasets, however, have assigned test costs [24]. To gain a wider perspective, we developed\nan automatic method that assigns costs to existing datasets randomly. The method is parameterized\nwith: (1) cr the cost range, (2) g the number of desired groups as a percentage of the number of\nattributes, and (3) sc the group shared cost as a percentage of the maximal marginal cost in the\ngroup. Using this method we assigned costs to 25 datasets: 21 arbitrarily chosen UCI datasets2\nand 4 datasets that represent hard concept and have been used in previous research. The online\nappendix 3 gives detailed descriptions of these datasets. Two versions of each dataset have been\ncreated, both with cost range of 1-100. In the \ufb01rst g and sc were set to 20% and in the second\nthey were set to 80%. These parameters were chosen arbitrarily, in attempt to cover different types\nof costs. In total we have 55 datasets: 5 with costs assigned as in [24] and 50 with random costs.\nCost-insensitive learning algorithms focus on accuracy and therefore are expected to perform well\n\n1The default class is the one that minimizes the misclassi\ufb01cation cost in the node.\n2The chosen UCI datasets vary in their size, type of attributes and dimension.\n3http://www.cs.technion.ac.il/\u223cesaher/publications/nips07\n\n5\n\n\fTable 1: Average cost of classi\ufb01cation as a percentage of the standard cost of classi\ufb01cation. The table also lists\nfor each of ACT and ICET the number of signi\ufb01cant wins they had using t-test. The last row shows the winner,\nif any, as implied by a Wilcoxon test over all datasets with \u03b1 = 5%.\n\nEG2\n22.37\n\nmc = 10\n\nICET\n10.23\n\n0\n\nAVERAGE\nBETTER\nWILCOXON\n\n 100\n\n 80\n\n 60\n\n 40\n\n 20\n\n 0\n\n 0\n\n 20\n\n 40\n\n 60\n\n 80\n\n 100\n\nACT\n2.21\n34\n\u221a\n\n 100\n\n 80\n\n 60\n\n 40\n\n 20\n\n 0\n\n 0\n\nmc = 100\n\nmc = 1000\n\nmc = 10000\n\nEG2\n25.93\n\nICET\n17.15\n\n0\n\nACT\n11.86\n\n25\n\u221a\n\nEG2\n38.69\n\nICET\n35.28\n\n3\n\nACT\n34.38\n\n11\n\nEG2\n54.22\n\nICET\n47.47\n\n10\n\nACT\n41.62\n\n12\n\u221a\n\n 100\n\n 80\n\n 60\n\n 40\n\n 20\n\n 0\n\n 0\n\n 20\n\n 40\n\n 60\n\n 80\n\n 100\n\n 100\n\n 80\n\n 60\n\n 40\n\n 20\n\n 0\n\n 0\n\n 20\n\n 40\n\n 60\n\n 80\n\n 100\n\n 20\n\n 40\n\n 60\n\n 80\n\n 100\n\nFigure 3: Illustration of the differences in performance between ACT and ICET for misclassi\ufb01cation costs\n(from left to right: 10, 100, 1000, and 10000). Each point represents a dataset. The x-axis represents the cost\nof ICET while the y-axis represents that of ACT. The dashed line indicates equality. Points are below it if ACT\nperforms better and above it if ICET is better.\n\nwhen testing costs are negligible relative to misclassi\ufb01cation costs. On the other hand, when testing\ncosts are signi\ufb01cant, ignoring them would result in expensive classi\ufb01ers. Therefore, to evaluate a\ncost-sensitive learner a wide spectrum of misclassi\ufb01cation costs is needed. For each problem out of\nthe 55, we created 4 instances, with uniform misclassi\ufb01cation costs mc = 10, 100, 1000, 10000.\nNormalized Cost. As pointed out by Turney [24], using the average cost is problematic because:\n(1) the differences in costs among the algorithms become small as misclassi\ufb01cation cost increases,\n(2) it is dif\ufb01cult to combine the results for the multiple datasets, and (3) it is dif\ufb01cult to com-\nbine average costs for different misclassi\ufb01cation costs. To overcome these problems, Turney sug-\ngests to normalize the average cost of classi\ufb01cation by dividing it by the standard cost, de\ufb01ned as\n(T C + mini (1 \u2212 fi) \u00b7 maxi,j (Mi,j)), The standard cost is an approximation for the maximal cost\nin a given problem. It consists of two components: (1) T C, the cost if we take all tests, and (2) the\nmisclassi\ufb01cation cost if the classi\ufb01er achieves only the base-line accuracy. fi denotes the frequency\nof class i in the data and hence (1 \u2212 fi) would be the error if the response would always be class i.\nStatistical Signi\ufb01cance. For each problem, one 10 fold cross-validation experiment has been con-\nducted. The same partition to train-test sets was used for all compared algorithms. To test the\nstatistical signi\ufb01cance of the differences between ACT and ICET we used two tests. The \ufb01rst is\nt-test with a \u03b1 = 5% con\ufb01dence: for each method we counted how many times it was a signi\ufb01-\ncant winner. The second is Wilcoxon test [8], which compares classi\ufb01ers over multiple datasets and\nstates whether one method is signi\ufb01cantly better than the other (\u03b1 = 5%).\n\n3.2 Fixed-time Comparison\n\nFor each of the 55 \u00d7 4 problem instances, we run the seeded version of ICET with its default\nparameters (20 generations),4 EG2, and ACT with r = 5. We choose r = 5 so the average runtime\nof ACT would be shorter than ICET for all problems. EG2 and ICET use the same post-pruning\nmechanism as in C4.5. In EG2 the default con\ufb01dence factor is used (0.25) while in ICET this value\nis tuned using the genetic search.\n\nTable 1 lists the average results, Figure 3 illustrates the differences between ICET and ACT, and\nFigure 4 (left) plots the average cost for the different values of mc. The full results are available\nin the online appendix. Similarly to the results reported in [24] ICET is clearly better than EG2,\nbecause the latter does not consider misclassi\ufb01cation costs. When mc is set to 10 and to 100 ACT\nsigni\ufb01cantly outperforms ICET for most datasets. In these cases ACT was able to produce very\nsmall trees, sometimes consist of one node, neglecting the accuracy of the learned model. For mc\nset to 1000 and 10000 there are fewer signi\ufb01cant wins, yet it is clear that ACT is dominating: the\n\n4Seeded ICET includes the true costs in the initial population and was reported to perform better [24].\n\n6\n\n\f 50\n\n 40\n\n 30\n\n 20\n\n 10\n\n \n\nt\ns\no\nC\ne\ng\na\nr\ne\nv\nA\n\n 0\n\n 10\n\nEG2\nICET\nACT\n\n 100\n\n 1000\n\n 10000\n\nMisclassification Cost\n\ny\nc\na\nr\nu\nc\nc\nA\ne\ng\na\nr\ne\nv\nA\n\n \n\n 85\n 80\n 75\n 70\n 65\n 60\n 55\n 50\n\n 10\n\nC4.5\nICET\nACT\n\n 100\n\n 1000\n\n 10000\n\nMisclassification Cost\n\n \n\nt\ns\no\nC\ne\ng\na\nr\ne\nv\nA\n\n 50\n\n 48\n\n 46\n\n 44\n\n 42\n\n 40\n\nEG2\nICET\nACT\n\n \n\nt\ns\no\nC\ne\ng\na\nr\ne\nv\nA\n\n 45\n\n 40\n\n 35\n\n 30\n\n 25\n\n 20\n\nEG2\nICET\nACT\n\n 0\n\n 1\n\n 2\n 3\nTime [sec]\n\n 4\n\n 5\n\n 0\n\n 1\n\n 2\n\n 3\n\n 4\nTime [sec]\n\n 5\n\n 6\n\nFigure 4: Average cost (left most) and accuracy (mid-left) as a function of misclassi\ufb01cation cost. Average cost\nas a function of time for Breast-cancer-20 (mid-right) and Multi-XOR-80 (right most).\n\nnumber of ACT wins is higher and the average results indicate that ACT trees are cheaper. The\nWilcoxon test, states that for mc = 10, 100, 10000, ACT is signi\ufb01cantly better than ICET, and that\nfor mc = 1000 no signi\ufb01cant winner was found.\nWhen misclassi\ufb01cation costs are low, an optimal algorithm would produce a very shallow tree.\nWhen misclassi\ufb01cation costs are dominant, an optimal algorithm would produce a highly accurate\ntree. Some concepts, however, are not easily learnable and even cost-insensitive algorithms fail\nto achieve perfect accuracy on them. Hence, with the increase in the importance of accuracy the\nnormalized cost increases: the predictive errors affect the cost more dramatically. To learn more\nabout the effect of accuracy, we compared the accuracy of ACT to that of C4.5 and ICET mc\nvalues. Figure 4 (mid-left) shows the results. An important property of both ICET and ACT is their\nability to compromise on accuracy when needed. ACT\u2019s \ufb02exibility, however, is more noteworthy:\nfrom the least accurate method it becomes the most accurate one. Interestingly, when accuracy is\nextremely important both ICET and ACT achieves even better accuracy than C4.5. The reason is\ntheir non-greedy nature. ICET performs an implicit lookahead by reweighting attributes according\nto their importance. ACT performs lookahead by sampling the space of subtrees under every split.\nAmong the two, the results indicates that ACT\u2019s lookahead is more ef\ufb01cient in terms of accuracy.\nWe also compared ACT to LSID3. As expected, ACT was signi\ufb01cantly better for mc \u2264 1000.\nFor mc = 10000 their performance was similar. In addition, we compared the studied methods on\nnonuniform misclassi\ufb01cation costs and found ACT\u2019s advantage to be consistent.\n\n3.3 Anytime Comparison\n\nBoth ICET and ACT are anytime algorithms that improve their performance with time. ICET is\nexpected to exploit extra time by producing more generations and hence better tuning the parameters\nfor the \ufb01nal invocation of EG2. ACT can use additional time to acquire larger samples and hence\nachieve better cost estimations. A typical anytime algorithm would produce improved results with\nthe increase in resources. The improvements diminish with time, reaching a stable performance.\n\nTo examine the anytime behavior of ICET and ACT, we run each of them on 2 problems, namely\nBreast-cancer-20 and Multi-XOR-80, with exponentially increasing time allocation. ICET was run\nwith 2, 4, 8 . . . generations and ACT with a sample size of 1, 2, 4, . . .. Figure 4 plots the results. The\nresults show a good anytime behavior of both ICET and ACT. For both algorithms, it is worthwhile\nto allocate more time. ACT dominates ICET for both domains and is able to produce trees of lower\ncosts in shorter time. The Multi-XOR dataset is an example for a concept with attributes being\nimportant only in one sub-concept. As we expected, ACT outperforms ICET signi\ufb01cantly because\nthe latter cannot assign context-based costs. Allowing ICET to produce more and more generations\n(up to 128) does not result in trees comparable to those obtained by ACT.\n\n4 Conclusions\n\nMachine learning techniques are increasingly being used to produce a wide-range of classi\ufb01ers for\nreal-world applications that involve nonuniform testing costs and misclassi\ufb01cation costs. As the\ncomplexity of these applications grows, the management of resources during the learning and clas-\nsi\ufb01cation processes becomes a challenging task. In this work we introduced a novel framework for\noperating in such environments. Our framework has 4 major advantages: (1) it uses a non-greedy\napproach to build a decision tree and therefore is able to overcome local minima problems, (2) it\nevaluates entire trees and therefore can be adjusted to any cost scheme that is de\ufb01ned over trees. (3)\nit exhibits good anytime behavior and produces signi\ufb01cantly better trees when more time is avail-\nable, and (4) it can be easily parallelized and hence can bene\ufb01t from distributed computer power.\n\n7\n\n\fTo evaluate ACT we have designed an extensive set of experiments with a wide range of costs. The\nexperimental results show that ACT is superior over ICET and EG2. Signi\ufb01cance tests found the\ndifferences to be statistically strong. ACT also exhibited good anytime behavior: with the increase\nin time allocation, there was a decrease in the cost of the learned models. ACT is a contract anytime\nalgorithm that requires its sample size to be pre-determined. In the future we intend to convert\nACT into an interruptible anytime algorithm, by adopting the IIDT general framework [11].\nIn\naddition, we plan to apply monitoring techniques for optimal scheduling of ACT and to examine\nother strategies for evaluating subtrees.\n\nReferences\n[1] N. Abe, B. Zadrozny, and J. Langford. An iterative method for multi-class cost-sensitive\n\nlearning. In KDD, 2004.\n\n[2] V. Bayer-Zubek and Dietterich. Integrating learning from examples into the search for diag-\n\nnostic policies. Arti\ufb01cial Intelligence, 24:263\u2013303, 2005.\n\n[3] C. L. Blake and C. J. Merz. UCI repository of machine learning databases, 1998.\n[4] A. Blumer, A. Ehrenfeucht, D. Haussler, and M. K. Warmuth. Occam\u2019s Razor. Information\n\nProcessing Letters, 24(6):377\u2013380, 1987.\n\n[5] M. Boddy and T. L. Dean. Deliberation scheduling for problem solving in time constrained\n\nenvironments. Arti\ufb01cial Intelligence, 67(2):245\u2013285, 1994.\n\n[6] J. Bradford, C. Kunz, R. Kohavi, C. Brunk, and C. Brodley. Pruning decision trees with\n\nmisclassi\ufb01cation costs. In ECML, 1998.\n\n[7] L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classi\ufb01cation and Regression Trees.\n\nWadsworth and Brooks, Monterey, CA, 1984.\n\n[8] J. Demsar. Statistical comparisons of classi\ufb01ers over multiple data sets. Journal of Machine\n\nLearning Research, 7:1\u201330, 2006.\n\n[9] P. Domingos. Metacost: A general method for making classi\ufb01ers cost-sensitive. In KDD, 1999.\n[10] C. Elkan. The foundations of cost-sensitive learning. In IJCAI, 2001.\n[11] S. Esmeir and S. Markovitch. Anytime learning of decision trees. Journal of Machine Learning\n\nResearch, 8, 2007.\n\n[12] R. Greiner, A. J. Grove, and D. Roth. Learning cost-sensitive active classi\ufb01ers. Arti\ufb01cial\n\nIntelligence, 139(2):137\u2013174, 2002.\n\n[13] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data Mining,\n\nInference, and Prediction. New York: Springer-Verlag, 2001.\n\n[14] D. Margineantu. Active cost-sensitive learning. In IJCAI, 2005.\n[15] P. Melville, M. Saar-Tsechansky, F. Provost, and R. J. Mooney. Active feature acquisition for\n\nclassi\ufb01er induction. In ICDM, 2004.\n\n[16] S. W. Norton. Generating better decision trees. In IJCAI, 1989.\n[17] M. Nunez. The use of background knowledge in decision tree induction. Machine Learning,\n\n6:231\u2013250, 1991.\n\n[18] F. Provost and B. Buchanan. Inductive policy: The pragmatics of bias selection. Machine\n\nLearning, 20(1-2):35\u201361, 1995.\n\n[19] Z. Qin, S. Zhang, and C. Zhang. Cost-sensitive decision trees with multiple cost scales. Lecture\n\nNotes in Computer Science, AI, Volume 3339/2004:380\u2013390, 2004.\n\n[20] J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993.\n[21] S. Sheng, C. X. Ling, A. Ni, and S. Zhang. Cost-sensitive test strategies. In AAAI, 2006.\n[22] M. Tan and J. C. Schlimmer. Cost-sensitive concept learning of sensor use in approach and\n\nrecognition. In Proceedings of the 6th international workshop on Machine Learning, 1989.\n\n[23] P. Turney. Types of cost in inductive concept learning. In Workshop on Cost-Sensitive Learning\n\nat ICML, 2000.\n\n[24] P. D. Turney. Cost-sensitive classi\ufb01cation: Empirical evaluation of a hybrid genetic decision\n\ntree induction algorithm. Journal of Arti\ufb01cial Intelligence Research, 2:369\u2013409, 1995.\n\n8\n\n\f", "award": [], "sourceid": 228, "authors": [{"given_name": "Saher", "family_name": "Esmeir", "institution": null}, {"given_name": "Shaul", "family_name": "Markovitch", "institution": null}]}