{"title": "Rapid Quality Estimation of Neural Network Input Representations", "book": "Advances in Neural Information Processing Systems", "page_first": 45, "page_last": 51, "abstract": null, "full_text": "Rapid Quality Estimation of Neural \n\nNetwork Input Representations \n\nKevin J. Cherkauer \n\nJude W. Shav lik \n\nComputer Sciences Department, University of Wisconsin-Madison \n\n1210 W. Dayton St., Madison, WI 53706 \n\n{cherkauer,shavlik }@cs.wisc.edu \n\nAbstract \n\nThe choice of an input representation for a neural network can have \na profound impact on its accuracy in classifying novel instances. \nHowever, neural networks are typically computationally expensive \nto train, making it difficult to test large numbers of alternative \nrepresentations. This paper introduces fast quality measures for \nneural network representations, allowing one to quickly and ac(cid:173)\ncurately estimate which of a collection of possible representations \nfor a problem is the best. We show that our measures for ranking \nrepresentations are more accurate than a previously published mea(cid:173)\nsure, based on experiments with three difficult, real-world pattern \nrecognition problems. \n\n1 \n\nIntroduction \n\nA key component of successful artificial neural network (ANN) applications is an \ninput representation that suits the problem. However, ANNs are usually costly to \ntrain, preventing one from trying many different representations. In this paper, \nwe address this problem by introducing and evaluating three new measures for \nquickly estimating ANN input representation quality. Two of these, called [DBleaves \nand Min (leaves), consistently outperform Rendell and Ragavan's (1993) blurring \nmeasure in accurately ranking different input representations for ANN learning on \nthree difficult, real-world datasets. \n\n2 Representation Quality \n\nChoosing good input representations for supervised learning systems has been \nthe subject of diverse research in both connectionist (Cherkauer & Shavlik, 1994; \nKambhatla & Leen, 1994) and symbolic paradigms (Almuallim & Dietterich, 1994; \n\n\f46 \n\nK. J. CHERKAUER, J. W. SHA VLIK \n\nCaruana & Freitag, 1994; John et al., 1994; Kira & Rendell, 1992). Two factors \nof representation quality are well-recognized in this work: the ability to separate \nexamples of different classes (sufficiency of the representation) and the number of \nfeatures present (representational economy). We believe there is also a third impor(cid:173)\ntant component that is often overlooked, namely the ease of learning an accurate \nconcept under a given representation, which we call transparency. We define trans(cid:173)\nparency as the density of concepts that are both accurate (generalize well) and \nsimple (of low complexity) in the space of possible concepts under a given input \nrepresentation and learning algorithm. Learning an accurate concept will be more \nlikely if the concept space is rich in accurate concepts that are also simple, because \nsimple concepts require less search to find and less data to validate. \nIn this paper, we introduce fast transparency measures for ANN input represen(cid:173)\ntations. These are orders of magnitude faster than the wrapper method (John \net al., 1994), which would evaluate ANN representations by training and testing \nthe ANN s themselves. Our measures are based on the strong assumption that, \nfor a fixed input representation, information about the density of accurate, simple \nconcepts under a (fast) decision-tree learning algorithm will transfer to the concept \nspace of an ANN learning algorithm. Our experiments on three real-world datasets \ndemonstrate that our transparency measures are highly predictive of representation \nquality for ANNs, implying that the transfer assumption holds surprisingly well for \nsome pattern recognition tasks even though ANNs and decision trees are believed \nto work best on quite different types of problems (Quinlan, 1994).1 In addition, our \nExper. 1 shows that transparency does not depend on representational sufficiency. \nExper. 2 verifies this conclusion and also demonstrates that transparency does not \ndepend on representational economy. Finally, Exper. 3 examines the effects of re(cid:173)\ndundant features on the transparency measures, demonstrating that the ID31eaves \nmeasure is robust in the face of such features. \n\n2.1 Model-Based Transparency Measures \n\nWe introduce three new \"model-based\" measures that estimate representational \ntransparency by sampling instances of roughly accurate concept models from a \ndecision-tree space and measuring their complexities. If simple, accurate models \nare abundant, the average complexity of the sampled models will be low. If they \nare sparse, we can expect a higher complexity value. \nOur first measure, avg(leaves), estimates the expected complexity of accurate con(cid:173)\ncepts as the average number of leaves in n randomly constructed decision trees that \ncorrectly classify the training set: \n\navg(leaves) == ~ 2:;=11eaves(t) \n\nwhere leaves(t) is the number of leaves in tree t. Random trees are built top-down; \nfeatures are chosen with uniform probability from those which further partition the \ntraining examples (ignoring example class). Tree building terminates when each \nleaf achieves class purity (Le., the tree correctly classifies all the training examples). \nHigh values of avg(leaves) indicate high concept complexity (i.e., low transparency). \nThe second measure, min(leaves), finds the minimum number of leaves over the n \nrandomly constructed trees instead of the average to reflect the fact that learning \nsystems try to make intelligent, not random, model choices: \n\nmin (leaves) == min {leaves(t)} \n\nt=l,n \n\nlWe did not preselect datasets based on whether our experiments upheld the transfer \nassumption. We report the results for all datasets that we have tested our transparency \nmeasures on. \n\n\fRapid Quality Estimation of Neural Network Input Representations \n\n47 \n\nDataset \nDNA \nNIST \nMagellan \n\nTable 1: Summary of datasets used. \n\nII Examples \n\nClasses \n\nCross Validation Folds \n\n20,000 \n3,471 \n625 \n\n6 \n10 \n2 \n\n4 \n10 \n4 \n\nThe third measure, ID31eaves, simply counts the number of leaves in the tree grown \nby Quinlan's (1986) ID3 algorithm: \n\nID31eaves == leaves(ID3 tree) \n\nWe always use the full ID3 tree (100% correct on the training set). This measure \nassumes the complexity of the concept ID3 finds depends on the density of simple, \naccurate models in its space and thus reflects the true transparency. \nAll these measures fix tree training-set accuracy at 100%, so simpler trees imply \nmore accurate generalization (Fayyad, 1994) as well as easier learning. This lets us \nestimate transparency without the multiplicative additional computational expense \nof cross validating each tree. It also lets us use all the training data for tree building. \n\n2.2 \n\n\"Blurring\" as a Transparency Measure \n\nRendell and Ragavan (1993) address ease of learning explicitly and present a met(cid:173)\nric for quantifying it called blurring. In their framework, the less a representation \nrequires the use of feature interactions to produce accurate concepts, the more \ntransparent it is. Blurring heuristically estimates this by measuring the average \ninformation content of a representation's individual features. Blurring is equivalent \nto the (negation of the) average information gain (Quinlan, 1986) of a representa(cid:173)\ntion's features with respect to a training set, as we show in Cherkauer and Shavlik \n(1995). \n\n3 Evaluating the Transparency Measures \n\nWe evaluate the transparency measures on three problems: DNA (predicting gene \nreading frames; Craven & Shavlik, 1993), NIST (recognizing handwritten digits; \n\"FI3\" distribution), and Magellan (detecting volcanos in radar images of the planet \nVenus; Burl et al., 1994).2 The datasets are summarized in Table l. \nTo assess the different transparency measures, we follow these steps for each dataset \nin Exper. 1 and 2: \n\n1. Construct several different input representations for the problem. \n2. Train ANNs using each representation and test the resulting generalization \naccuracy via cross validation (CV). This gives us a (costly) ground-truth \nranking of the relative qualities of the different representations. \n\n3. For each transparency measure, compute the transparency score of each \n\nrepresentation. This gives us a (cheap) predicted ranking of the represen(cid:173)\ntations from each measure. \n\n4. For each transparency measure, compute Spearman's rank correlation co(cid:173)\n\nefficient between the ground-truth and predicted rankings. The higher this \ncorrelation, the better the transparency measure predicts the true ranking. \n\n20n these problems, we have found that ANNs generalize 1- 6 percentage points better \nthan decision trees using identical input representations, motivating our desire to develop \nfast measures of ANN input representation quality. \n\n\f48 \n\nK. 1. CHERKAUER, J. W. SHAVLIK \n\nTable 2: User CPU seconds on a Sun SPARCstation 10/30 for the largest representation \nof each dataset. Parenthesized numbers are standard deviations over 10 runs. \n\nI Dataset \\I Blurring I ID3leaves I Min! A vg(leaves) I Backprop I \nDNA \nNIST \nMagellan \n\n13,444 56.25 \n1,558 5.00 \n0.13) \n\n212,900 \n501,400 \n6,300 \n\n1.68 2.38 \n2.69 2.31 \n0.21 0.15 \n\n1,245 3.96) \n221 2.75 \n1 0.07 \n\n12 \n\nIn Exper. 3 we rank only two representations at a time, so instead of computing a \nrank correlation in step 4, we just count the number of pairs ranked correctly. \nWe created input representations (step 1) with an algorithm we call RS (\"Repre(cid:173)\nsentation Selector\"). RS first constructs a large pool of plausible, domain-specific \nBoolean features (5,460 features for DNA, 251,679 for NIST, 33,876 for Magellan). \nFor each CV fold, RS sorts the features by information gain on the entire training \nset. Then it scans the list, selecting each feature that is not strongly pairwise de(cid:173)\npendent on any feature already selected according to a standard X2 independence \ntest using the X 2 statistic. \nThis produces a single reasonable input representation, Rl.3 To obtain the addi(cid:173)\ntional representations needed for the ranking experiments, we ran RS several times \nwith successively smaller subsets of the initial feature pool, created by deleting \nfeatures whose training-set information gains were above different thresholds. For \neach dataset, we made nine additional representations of varying qualities, labeled \nR 2-RlO , numbered from least to most \"damaged\" initial feature pool. \nTo get the ground-truth ranking (step 2), we trained feed-forward ANNs with back(cid:173)\npropagation using each representation and one output unit per class. We tried \nseveral different numbers of hidden units in one layer and used the best CV accu(cid:173)\nracy among these (Fig. 1, left) to rank each input representation for ground truth. \nEach transparency measure also predicted a ranking of the representations (step 3). \nA CPU time comparison is in Table 2. This table and the experiments below report \nmin (leaves) and avg(leaves) results from sampling 100 random trees, but sampling \nonly 10 trees (giving a factor 10 speedup) yields similar ranking accuracy. \nFinally, in Exper. 1 and 2 we evaluate each transparency measure (step 4) using \n\nSpearman's rank correlation coefficient, rs = 1 - m(\";i.!:l)\u00b7' between the ground-\ntruth and predicted rankings (m is the number of representations (10); di is the \nground-truth rank (an integer between 1 and 10) minus the transparency rank). \nWe evaluate the transparency measures in Exper. 3 by counting the number (out \nof ten) of representation pairs each measure orders the same as ground truth. \n\n6.Em d 2 \n\n4 Experiment I-Transparency vs. Sufficiency \n\nThis experiment demonstrates that our transparency measures are good predictors \nof representation quality and shows that transparency does not depend on repre(cid:173)\nsentational sufficiency (ability to separate examples). In this experiment we used \ntransparency to rank ten representations for each dataset and compared the rank(cid:173)\nings to the ANN ground truth using the rank correlation coefficient. RS created \nthe representations by adding features until each representation could completely \nseparate the training data into its classes. Thus, representational sufficiency was \n\n3Though feature selection is not the focus of this paper, note that similar feature \nselection algorithms have been used by others for machine learning applications (Baim, \n1988; Battiti, 1994). \n\n\fRapid Quality Estimation of Neural Network Input Representations \n\n49 \n\nDNA Backprop Ground-Truth Cross-Validation \n\n1 00 .---..---.----.--r--.---,.----.-....--,.---.---, \n\n~ \n\n~ 90 \n!!! \n~ 80 \n~ \n~ 70 \nrfl \nCii 60 \n~ \n~ 50 \n\nExperiment 1 -\nExperiment 2 ......... . \n\n40~~~~~~~~~~~~ \n\nR1 R2 R3 R4 R5 R6 R7 R8 R9R10 \n\nRepresentation Number \n\nNIST Backprop Ground-Truth Cross-Validation \n\n1 00 r-\"--\"'--~--.--r--r--.---.----.---..--, \n\n~ 90 \n~ o !i 80 \n*(cid:173)OJ 70 \n1ii 60 \n~ \n~ 50 \n\n(f) \n\nExperiment 1 -\nExperiment 2 ......... . \n\n40~~~~~~~~~~~~ \n\nR1 R2 R3 R4 R5 R6 R7 R8 R9R10 \n\nRepresentation Number \n\nMagellan Backprop Ground-Truth Cross-Validation \n100.---..--...--~--.--r--r--.---.----.---..--, \n\n~ 90 \n~ o \n!i 80 \n*(cid:173)OJ 70 \nrfl \nCii 60 \n~ \n!li 50 \n\nExperiment 1 -\nExperiment 2 ......... . \n\n40~~~~~~~~~~~~ \n\nR1 R2 R3 R4 R5 R6 R7 R8 R9R10 \n\nRepresentation Number \n\nDNA Dataset \n\nMeasure \nID3leaves \nMin (leaves) \nA vgJleaves) \n\nBlurring \n\nExp1 rs \n\nExp2 rs \n\n0.99 \n0.94 \n0.78 \n0.78 \n\n0.95 \n0.99 \n0.96 \n0.81 \n\nNIST Dataset \n\nMeasure \nID3leaves \nMin(leaves) \nAvg(leaves) \n\nBlurring \n\nExp1 rs \n\nExp2 rs \n\n1.00 \n1.00 \n1.00 \n1.00 \n\n1.00 \n1.00 \n1.00 \n1.00 \n\nMagellan Dataset \n\nMeasure \nID3leaves \nMin(leaves) \nAvg(leaves) \n\nBlurring \n\nExp1 rs Exp2 rs \n\n0.81 \n0.83 \n0.71 \n0.48 \n\n0.78 \n0.76 \n0.71 \n0.73 \n\nFigure 1: Left: Exper. 1 and 2 ANN CV test-set accuracies (y axis; error bars are 1 \nSD) used to rank the representations (x axis). Right: Exper. 1 and 2, transparency \nrankings compared to ground truth. rs: rank correlation coefficient (see text). \n\nheld constant. (The number of features could vary across representations.) \nThe rank correlation results are shown in Fig. 1 (right). ID31eaves and min (leaves) \noutperform the less sophisticated avg(leaves) and blurring measures on datasets \nwhere there is a difference. On the NIST data, all measures produce perfect rank(cid:173)\nings. The confidence that a true correlation exists is greater than 0.95 for all \nmeasures and datasets except blurring on the Magellan data, where it is 0.85. \nThe high rank correlations we observe imply that our transparency measures cap(cid:173)\ntUre a predictive factor of representation quality. This factor does not depend on \nrepresentational sufficiency, because sufficiency was equal for all representations. \n\n\f50 \n\nK. J. CHERKAUER. J. W. SHAVLIK \n\nTable 3: Exper. 3 results: correct rankings (out of 10) by the transparency measures of \nthe corresponding representation pairs, Ri vs. R~, from Exper. 1 and Exper. 2. \n\nI Dataset II ID3leaves Min{leaves) Avg(leaves) Blurring \n\nI ~:naJi \n\n~~ \n\n~ \n\n~ \n\n~ \n\n5 Experiment 2-Transparency vs. Economy \n\nThis experiment shows that transparency does not depend on representational econ(cid:173)\nomy (number of features), and it verifies Exper. 1's conclusion that it does not \ndepend on sufficiency. It also reaffirms the predictive power of the measures. \nIn Exper. 1, sufficiency was held constant, but economy could vary. Exper. 2 demon(cid:173)\nstrates that transparency does not depend on economy by equalizing the number \nof features and redoing the comparison. In Exper. 2, RS added extra features to \neach representation used in in Exper. 1 until they all contained a fixed number of \nfeatures (200 for DNA, 250 for NIST, 100 for Magellan). Each Exper. 2 represen(cid:173)\ntation, R~ (i = 1, ... , 10), is thus a proper superset of the corresponding Exper. 1 \nrepresentation, Ri. All representations for a given dataset in Exper. 2 have an \nidentical number of features and allow perfect classification of the training data, so \nneither economy nor sufficiency can affect the transparency scores now. \nThe results (Fig. 1, right) are similar to Exper. 1's. The notable changes are that \nblurring is not as far behind ID3leaves and min (leaves) on the Magellan data as be(cid:173)\nfore, and avg(leaves) has joined the accuracy of the other two model-based measures \non the DNA. The confidence that correlations exist is above 0.95 in all cases. \nAgain, the high rank correlations indicate that transparency is a good predictor \nof representation quality. Exper. 2 shows that transparency does not depend on \nrepresentational economy or sufficiency, as both were held constant here. \n\n6 Experiment 3-Redundant Features \n\nExper. 3 tests the transparency measures' predictions when the number of redun(cid:173)\ndant features varies, as ANNs can often use redundant features to advantage (Sutton \n& Whitehead, 1993), an ability generally not attributed to decision trees. \nExper. 3 reuses the representations Ri and R~ (i = 1, ... , 10) from Exper. 1 and 2. \nRecall that R~ => R i . The extra features in each R~ are redundant as they are not \nneeded to separate the training data. We show the number of Ri vs. R~ representa(cid:173)\ntion pairs each transparency measure ranks correctly for each dataset (Table 3). For \nDNA and NIST, the redundant representations always improved ANN generaliza(cid:173)\ntion (Fig. 1, left; 0.05 significance). Only ID3leaves predicted this correctly, finding \nsmaller trees with the increased flexibility afforded by the extra features. The other \nmeasures were always incorrect because the lower quality redundant features de(cid:173)\ngraded the random trees (avg (leaves) , min (leaves)) and the average information \ngain (blurring). For Magellan, ANN generalization was only significantly different \nfor one representation pair, and all measures performed near chance. \n\n7 Conclusions \n\nWe introduced the notion of transparency (the prevalence of simple and accurate \nconcepts) as an important factor of input representation quality and developed in-\n\n\fRapid Quality Estimation of Neural Network Input Representations \n\n51 \n\nexpensive, effective ways to measure it. Empirical tests on three real-world datasets \ndemonstrated these measures' accuracy at ranking representations for ANN learn(cid:173)\ning at much lower computational cost than training the ANNs themselves. Our \nnext step will be to use transparency measures as scoring functions in algorithms \nthat apply extensive search to find better input representations. \n\nAcknowledgments \n\nThis work was supported by ONR grant N00014-93-1-099S, NSF grant CDA-\n9024618 (for CM-5 use), and a NASA GSRP fellowship held by KJC. \n\nReferences \n\nAlmuallim, H. & Dietterich, T. (1994). Learning Boolean concepts in the presence of \nmany irrelevant features. Artificial Intelligence, 69(1- 2):279-305. \n\nBairn, P. (1988). A method for attribute selection in inductive learning systems. IEEE \nTransactions on Pattern Analysis fj Machine Intelligence, 10(6):888-896. \n\nBattiti, R. (1994). Vsing mutual information for selecting features in supervised neural \nnet learning. IEEE Transactions on Neural Networks, 5(4):537-550. \nBurl, M., Fayyad, V., Perona, P., Smyth, P., & Burl, M. (1994). Automating the hunt \nfor volcanoes on Venus. In IEEE Computer Society Con! on Computer Vision fj Pattern \nRecognition: Proc, Seattle, WA. IEEE Computer Society Press. \nCaruana, R. & Freitag, D. (1994). Greedy attribute selection. In Machine Learning: Proc \n11th Intl Con!, (pp. 28-36), New Brunswick, NJ. Morgan Kaufmann. \nCherkauer, K. & Shavlik, J. (1994). Selecting salient features for machine learning \nfrom large candidate pools through parallel decision-tree construction. In Kitano, H. \n& Hendler, J., ecis., Massively Parallel Artificial Intel. MIT Press, Cambridge, MA. \nCherkauer, K. & Shavlik, J. (1995). Rapidly estimating the quality of input representa(cid:173)\ntions for neural networks. In Working Notes, IJCAI Workshop on Data Engineering for \nInductive Learning, (pp. 99-108), Montreal, Canada. \nCraven, M. & Shavlik, J. (1993). Learning to predict reading frames in E. coli DNA \nsequences. In Proc 26th Hawaii Intl Con! on System Science, (pp. 773-782), Wailea, HI. \nIEEE Computer Society Press. \nFayyad, V. (1994). Branching on attribute values in decision tree generation. In Proc \n12th Natl Con! on Artificial Intel, (pp. 601-606), Seattle, WA. AAAIjMIT Press. \nJohn, G., Kohavi, R., & Pfleger, K. (1994). Irrelevant features and the subset selection \nproblem. In Machine Learning: Proc 11th Intl Con!, (pp. 121-129), New Brunswick, NJ. \nMorgan Kaufmann. \nKambhatla, N. & Leen, T. (1994). Fast non-linear dimension reduction. In Advances in \nNeural In!o Processing Sys (vol 6), (pp. 152-159), San Francisco, CA. Morgan Kaufmann. \nKira, K. & Rendell, L. (1992). The feature selection problem: Traditional methods and a \nnew al~orithm. In Proc 10th Natl Con! on Artificial Intel, (pp. 129-134), San Jose, CA. \nAAAI/MIT Press. \nQuinlan, J. (1986). Induction of decision trees. Machine Learning, 1:81-106. \n\nQuinlan, J. (1994). Comparing connectionist and symbolic learning methods. In Hanson, \nS., Drastal, G., & Rivest, R., eds., Computational Learning Theory fj Natural Learning \nSystems (vol I: Constraints fj Prospects). MIT Press, Cambridge, MA. \nRendell, L. & Ragavan, H. (1993). Improving the design of induction methods by analyz(cid:173)\ning algorithm functionality and data-based concept complexity. In Proc 13th Intl Joint \nCon! on Artificial Intel, (pp. 952-958), Chamhery, France. Morgan Kaufmann. \nSutton, R. & Whitehead, S. (1993) . Online learning with random representations. In Ma(cid:173)\nchine Learning: Proc 10th IntI Con/, (pp. 314-321), Amherst, MA. Morgan Kaufmann. \n\n\f", "award": [], "sourceid": 1139, "authors": [{"given_name": "Kevin", "family_name": "Cherkauer", "institution": null}, {"given_name": "Jude", "family_name": "Shavlik", "institution": null}]}